Check out this Twitter spike around the time that Michael Jackson died courtesy of our internal Spinn3r stats:
Apparently, I’m not the only person who noticed this:
As the news of Michael Jackson’s fate unfolded, sites around the Web felt the strain of spiking interest.
On Twitter, the volume of Jackson-related messages – up to 5,000 per minute at peak – put such a demand on the site that it slowed considerably.
“We saw an instant doubling of tweets per second the moment the story broke,” Twitter co-founder Biz Stone wrote in an e-mail response to our inquiry. “This particular news about the passing of such a global icon is the biggest jump in tweets per second since the U.S. presidential election.”
This is good news. Facebook now has public status updates.
We’re eager to have more Facebook content in Spinn3r (and personally I’m a big advocate of the open web).
Facebook has just announced that it’s now testing a new version of the “publisher” that allows users to choose who can see their status updates and posts. The most interesting part? The first option is “everyone.”
In other words, Facebook is making it easy for users to post status updates that are visible not just to their Facebook friends – but for the entire world (and Google) to see. Facebook has said for a long time that it’s planned to give users more granular privacy controls over status updates, but Facebook has never promoted making status updates publicly visible this heavily before – though the company did take a step back in March to allow users to share some content on their profiles with everyone.

Microformats are four years old now and will have a birthday party this friday to celebrate.
Some of us from the Spinn3r team will be there as well. Unfortunately, due to a timing error, our Spin3nr 3.0 launch dinner is that night so I need to take off around 8pm.
Venturebeat has a great piece on Real Time Search out right now covering a hot new trend.
Real-time search engines have proliferated over the last month, with a series of launches from start-ups like Topsy, almost.at and Scoopler. The companies are hoping to edge in on a space that Google co-founder Larry Page has admitted is a weakness for the search giant. And they’re using microblogging and social bookmarking sites as tools to figure out what content is relevant up to the second.
This includes new startups like Topsy, Scoopler, OneRiot, TweetMeme, Almost.at, DailyRT, Twazzup, Friendfeed, Collecta, and CrowdEye.
Some of these guys are already using Spinn3r… I should be able to disclose a few here this week (need to check and verify).
We’re also working on some more real time social medial integration in Spinn3r so I’m hoping to work with more of these companies directly.
I threw this together in a couple hours tonight.
The mainstream news is pretty pathetic at handling this story so I just figured Twitter would be covering it correctly.
News is sorted by time and by rank. Any videos are embedded directly (also sorted by rank).
Message me on Twitter if you have any commentary.
Spinn3r 3.1 just went live today and we’re announcing two new features.
Twitter Firehose Support
Spinn3r listens to a new Twitter firehose API which is a sample of the full Twitter feed.
All Twitter content is classified with a new MICROBLOG publisher type which we will be using for Twitter, Identi.ca, Jaiku, and other Microblog systems.
Further, it is language classified (using an algorithm we have developed) and includes all metadata from Twitter including publication time,
author name, handle, etc.
All content is real time and indexed and available within Spinn3r a few moments after it is published.
There were a lot of requests for this feature (Social Media is HOT) so I expect a lot of innovation from our customers.
Technically this is still in beta but we feel it’s ready for use in production applications (once we get some feedback from our users).
Here’s the current breakdown of Twitter vs other social media and blog content in Spinn3r. While Twitter is larger it’s important to realize this is much less content since Twitter posts are short.
Social Media Rank
We also published some of our new ranking technology which has been in development for a while now (more than a year).
We’re indexing social media sites and computing rank on users based on their social graph.
The results are pretty interesting. Scobleizer OWNS Friendfeed. Techcrunch consistently places high. They’re #3 on Friendfeed. #274 on Digg and #24 on Twitter.
It’s also interesting how the founders of these social media properties consistently place high, even over celebrities. Ev Williams is still #5 on Twitter. Kevin Rose is still #1 on Digg. Paul is #16 on Friendfeed.
Sources (or nodes) are ranked by authority whereby the more friends or inbound links you have the higher your rank.
Our key differentiator is that we do not consider raw inbound link count to be an accurate representation of authority. This is highly vulnerable to spam and rank errors as users who attract a large number of links (either through black hat methods, link baiting, or viral marketing) can inflate their rankings (and harm other legitimate users).
We consider the quality of inbound links to be far more important. You can observe this in our results as the authority for a source is not a direct function of raw inbounds links. Some users can have high authority but very few (relative) inbound links.
We’re really eager for feedback here. If you have any comments on our ranking system feel free to contact us with your thoughts.
If you’ve been paying attention to the news, there are massive protests and riots all over Tehran following the recent (potentially corrupt) election.
The phones have been cut off and you can’t make calls into Tehran.
Apparently, you can’t blog if you’re hosted on Blogfa.com.
They’re down at the moment and not responding.
Here’s the current Spinn3r graph for content being published from Blogfa.
We monitor all the posts across the blog networks for anomalies. Needless to say, this would be a significant anomaly.
Update: The HTTP server is up and responding but with Service Unavailable
It looks like Google might be launching a microblog search engine:
Google prepares to launch a service that indexes and ranks content from microblogging services like Twitter. Since it’s very easy to post updates and the posts are usually very short, micro-blogging services are great for live blogging, posting real-time information about an event.
Twitter’s search engine has two important drawbacks: it’s limited to Twitter and it sorts the results by date. While there are other search engines like Tweefind that try to sort Twitter posts by relevancy and search engines like Twingly that index multiple microblogging sites, none of them does a great job.
This is almost certainly a win for non-Twitter services like Jaiku and Identi.ca.
For Twitter, this might be a mixed blessing. They are clearly holding back some of their search results as they don’t provide full access to Twitter content for other systems.
This might just be because they feel that their main market is Microblog search and Google would be a competitor here.
I tend to disagree. This puts them in an unusual position of being at odds with their user base. Their users want to publish content in the public. If they are blocking 3rd party services from indexing their content they are doing their users a disservice.
Of course this might not matter if the users are unaware or don’t mind.
On the technical front. There is only so much you can do with microblog content for ranking. You can do a traditional IR score on the text but this only gets you so far.
Twitter search seems to just be boolean matching of results sorted by time (which might be fine).
Not sure if secondary metrics like ranking should kick in to resort results. It might be a good idea to use this just to prevent spam.
I think the major goal should be to put this in front of users and see what they think is best.
Update: I should note that Google doesn’t exactly have a big win with Google Blogsearch. It’s a decent product for sure but I find that the main Google search is better for blogs. Blogsearch has quit a bit of spam in it from our audits. We’ve compared Spinn3r to blog search both internally and from our customers performing audits and have consistently come out on top. Further, Blogsearch has problems that we’ve solved years ago (and have been solved by Google) so I’m not sure exactly what’s going on here…
It could be that Microblog and LIVE search is a totally separate type of product but Blogsearch is so close to the main Google search that it’s hard to differentiate them.
Perhaps a unified Blog and Microblog search would be best?
The new Facebook Pages are really just blogs. Unfortunately, not very good ones.
First. It’s totally possible that a given page/blog doesn’t have permalinks. If it’s a ‘note’ then it has a permalink. If it’s a ‘link’ then it does not have a permalink.
How am I supposed to share these with people over IM or link to them? There’s no link!
Second. No RSS or Atom. Seriously Facebook? Seriously?
These pages are public. They update frequently. They should have an RSS feed.
Those who have not learned the mistakes of content management systems are doomed to repeat them.
Seriously. We learned ALL of these lessons back in 2000. Can we not repeat the mistakes of the past?
Perhaps you’ve been reading all the news today about Apple’s new Snow Leopard release.
What you didn’t see mentioned is that the Snow Leopard is an endangered species. Along with the Bengal Tigers and most of the big cats.
The total wild population could be from 4-6k individuals.
The Snow Leopard might become extinct in your lifetime. You grandchildren might grow up in a world where Snow Leopards are only myth and legend and the name of an old and obsolete computer Operating System.
Don’t worry though. You can help the World Wildlife Fund by adopting your very own Snow Leopard.
I just bought a $250 family kit.
Can you match my donation?
You can also join the Facebook cause for the International Snow Leopard Trust or make a donation directly.
Inhabiting the rugged, mountainous terrain of Central Asia and the Himalayan region, snow leopards have long hind limbs and shortened front limbs, which allow them to leap 20 to 50 feet through the air. The diet of these cats varies by region and season, but consists mostly of wild sheep and goats, and smaller animals such as pikas, zokors and marmots. Snow leopards are threatened by habitat loss, diminished food supply, hunting for the illegal wildlife trade and revenge killings by herders. WWF works with local people and supports research and habitat conservation projects to protect these beautiful cats throughout their range.
It’s sad to hear that Rajeev Motwani passed away this week.
I never had the pleasure of meeting Rajeev in person.
His work in academia is impressive and he was obviously insanely brilliant.
Back when I was starting up Tailrank, I was introduced to Rajeev and we jumped on the phone and talked about text clustering, memetracking, etc.
Smart guy and he had all the right conversations.
About six months ago I bumped into him at a party but didn’t get a chance to swing by and say hi… now I regret not taking advantage of that opportunity.
It’s always sad when someone passes away but when it’s someone who has contributed so much to the industry it’s even more of a loss.
Algorithms got you cross eyed?

About two weeks ago we completed a pretty big project to migrate Spinn3r’s operations from ServerBeach over to Softlayer.
The entire project, from start to finish, too just over one month.
I’m also proud to note that not a single customer noticed any downtime or any issue with our migration. It cost a bit more money and more time but we were able to perform the migration live and in place.
Overview
ServerBeach had been a great dedicated hosting provider for the last two years but we were clearly outgrowing them. Their sweet spot (at least from our perspective) seems to be from the lower end to mid end market. We’re now in the mid to high end dedicated hosting market and Softlayer seems to be the best choice here…
I think my main suggestion to ServerBeach is that they are going to have to come down in price. If they want to compete with Softlayer they’re going to have to do so on price as Softlayer clearly has them beat on feature set.
I did an exhaustive comparison of using cloud providers like Amazon/EC2 vs Colo vs a Dedicated Hosting provider.
My initial thinking was that we’d just throw down $50k-100k and go the colo route. The problem is that the numbers just didn’t add up. It was still far too expensive to go down this route. Internally, there was also a push to look at EC2 and Amazon but they just couldn’t compete on pricing either.
Further, Amazon’s pricing only looks decent if you go down the reserved route. This requires you to put down $30k or so and pay for half of your servers up front. That’s great and all but if you factor in that their servers are a bit underpowered, then you have to buy more of them and your operations costs rise.
I decided to take another look at Softlayer as I evaluated them in mid-2008 and they came in a bit high on pricing. Fortunately, I was able to get an introduction to someone in their Sales department who could appreciate closing a 30-40 server purchase overnight.
Once we had a decent price quote in hand we took a dive into their hardware and pricing model.
I think if it wasn’t for Softlayer we would have had to go down the colo route. IMO they’re by far the best provider in the dedicated hosting space. Serverbeach, Rackspace, etc need to really get their act together if they don’t want Softlayer to eat their lunch.
Further, I have NO idea why these guys aren’t providing Debian. If you want the big Web 2.0 deals you’re going to have to ship Debian. Most of the good Operations Engineers I deal with flat out refuse to work on anything that’s not Debian.
Hardware
The first thing we evaluated was their hardware. They’re using Supermicro 1/2U servers throughout their datacenter (in fact I think they only use Supermicro) which were the same machines we were thinking of purchasing if we went down the colo route.
These are great boxes and I know quite a few large scale shops that have clusters of these machines well into the hundreds of nodes.
Further, their configurations seem to be pretty solid. We can get boxes up to 96MB of RAM and 12TB of storage.
We’re currently using three configurations. API servers are single disk boxes with 4GB of RAM. InnoDB/memory boxes have 32GB of RAM. Our bulk storage and archive boxes have 8GB or RAM, a Adaptec RAID controller and 12TB of storage across 1TBx12 disks.
They also have the Intel X-25 SSDs in stock in which we’ve been interested as well.

Network
Their network is pretty impressive as well. They have three datacenters (SEA, WDC, DAL) online now with a 10Gbit interconnect between them. Latency is realistic as well – not as fast as being on gigabit ethernet but certainly reasonable for running replication or serial IO applications over their backend.
Their in-datacenter network uses an all Cisco network with 20Gbit between racks and each machine comes with a 1Gb link. In our benchmarks we were easily able to get full gigabit connectivity to all of our machines. This was something we could not get from Serverbeach.
API and Feature Set
I think the biggest differentiator (and what sold us from my perspective) is the Softlayer API.
There are some pretty interesting features here…
For example, we can shutdown a switch port by calling their API.
If a database is misbehaving, and we need to kill it so that we can do a master promotion, we can just shoot it in the head at the switch port and then promote another master.
This way I can verify that all clients timeout and reconnect to the new master. I can then SSH in via another ethernet port (which has a firewall only allowing SSH) and debug the problem.
We can also check their inventory, reboot a machine, look at billing, change reverse DNS configuration, etc.
This would be a real pain to setup if we were going down the colo route.
Combine that with KVM over IP and you have a winner.
Additional papers based on the Spinn3r/ICWSM dataset have been published. It seems I have a lot of reading to do!
Flash Floods and Ripples: The Spread of Media Content through the Blogosphere
This paper is based on the Spinn3r data set (ICWSM 2009), which consists of web feeds collected during a two month period in 2008. The data set includes posts from blogs as well as other data sources like news feeds. We discuss our methodology for cleaning up the data and extracting posts of popular blog domains for the study. Because the Spinn3r data set spans multiple blog domains and language groups, this gives us a unique opportunity to study the link structure and the content sharing patterns across multiple blog domains. For a representative type of content that is shared in the blogosphere, we focus on videos of the popular web-based broadcast media site, YouTube.
Our analysis, based on 8.7 million blog posts by 1.1 million blogs across 15 major blog hosting sites, reveals a number of interesting findings. First, the network structure of blogs shows a heavy-tailed degree distribution, low reciprocity, and low density. Although the majority of the blogs connect only to a few others, certain blogs connect to thousands of other blogs. These high-degree blogs are often content aggregators, recommenders, and reputed content producers. In contrast to other online social networks, most links are unidirectional and the network is sparse in the blogosphere. This is because links in social networks represent friendship where reciprocity and mutual friends are expected, while blog links are used to reference information from other data sources.
Identifying Personal Stories in Millions of Weblog Entries
Stories of people’s everyday experiences have long been the focus of psychology and sociology research, and are increasingly being used in innovative knowledge-based technologies. However, continued research in this area is hindered by the lack of standard corpora of sufficient size and by the costs of creating one from scratch. In this paper, we describe our efforts to develop a standard corpus for researchers in this area by identifying personal stories in the tens of millions of blog posts in the ICWSM 2009 Spinn3r Dataset. Our approach was to employ statistical text classification technology on the content of blog entries, which required the creation of a sufficiently large set of annotated training examples. We describe the development and evaluation of this classification technology and how it was applied to the dataset in order to identify nearly a million personal stories.
In this paper, we describe our efforts to overcome the limitations of our previous story collection research using new technologies and by capitalizing on the availability of a new weblog dataset. In 2009, the 3rd International AAAI Conference on Weblogs and Social Media sponsored the ICWSM 2009 Data Challenge to spur new research in the area of weblog analysis. A large dataset was released as part of this challenge, the ICWSM 2009 Spinn3r Dataset (ICWSM, 2009), consisting of tens of millions of weblog entries collected and processed by Spinn3r.com, a company that indexes, interprets, filters, and cleanses weblog entries for use in downstream applications. Available to all researchers who agree to a dataset license, this corpus consists of a comprehensive snapshot of weblog activity between August 1, 2008 and October 1, 2008. Although this dataset was described as containing 44 million weblog entries when it was originally released, the final release of this dataset actually consists of 62 million entries in Spinn3r.com’s XML format.
SentiSearch: Exploring Mood on the Web
Given an accurate mood classification system, one might imagine it to be simple to configure the classifier as a search filter, thus creating a mood-based retrieval system. However, the challenge lies in the fact that in order to classify the mood for a potential result, the entire content of that page must be downloaded and analyzed. Much like a typical web-based retrieval system, to avoid this cost, pages could be crawled and their mood indexed along with the representation stored for search indexing. Alternatively, the presence of a massive dataset from www.spinn3r.com enabled the ESSE system to be built, performing mood classification and result filtering on the fly (Burton et al. 2009). Because the dataset (including textual content), search system, and mood classification system all exist on the same server, the filtering retrieval system was made possible. The dataset not only allows access to the content of a blog post (beyond the summary and title typically made available through search APIs) but the closed nature of the dataset allows for experimentation while still being vast enough to provide breadth and depth of topical coverage.
Event Intensity Tracking in Weblog Collections
The data provided for ICWSM 2009 came from a weblog indexing service Spinn3r (http://spinn3r.com). This included 60 million postings spanned over August and September 2008. Some meta-data is provided by Spinn3r.
Each post comes with Spinn3r’s pre-determined language tag. Around 24 million posts are in English, 20 million more are labeled as ‘U’, and the remaining 16 million are comprised of 27 other languages (Fig. 3). The languages are encoded in ISO 639 two-letter codes (ISO 639 Codes, 2009). Other popular languages include Japanese (2.9 million), Chinese/Japanese/Korean (2.7 million) and Russian (2.5 million). The second largest label is U unknown. This data could potentially hold posts in languages not yet seen or posts in several languages. Our present work, including additional dataset analysis presented next, is limited to the English posts unless otherwise specified. In future work we plan to also consider other languages represented in the dataset.
Quantification of Topic Propagation using Percolation Theory: A study of the ICWSM Network
Our research is the first attempt to give an accurate measure for the level of information propagation. This paper presents ‘SugarCube’, a model designed to tackle part of this problem by offering a mathematically precise solution for the quantification of the level of topic propagation. The paper also covers the application of SugarCube in the analysis of the social network structure of the ICWSM/Spinn3r dataset (ICWSM 2009). It presents threshold values for the communities found within the collection, and paves the way for the measurement of topic propagation within those communities. Not only can SugarCube quantify the proliferation level of topics, but it also helps to identify ‘heavily-propagated’ or Global topics. This novel approach is inspired by Percolation Theory and its application in Physics (Efros 1986).
Here are the slides from my talk at ICWSM 2009. The talk went really well I think. Lots of great questions from the audience.
The winning paper, “Flash Floods and Ripples: The Spread of Media Content through the Blogosphere”, was very good and I’m excited to read it in full when I get a few moments.
I’m proud to announce that we have just released Spinn3r 3.0 after more than a year of development.
This has been quite a lot of work based on feedback from our customer base and ships with some really awesome functionality.
Most of this time has been spent on architecture but a good deal has been spent implementing features for our rapidly growing user base.
When you outsource a major component of your infrastructure, like crawling, you tend to lean on it heavily and push it to the very edge.
Spinn3r has benefited significantly from our user base as they have suggested a number of excellent features. This has dramatically increased our reliability, performance, and feature set.
A good deal of work here has been spent on scalability, performance, and optimizations, including serious improvements to our core backend infrastructure.
There’s quite a lot that’s new in this release so I’ll just dive in.
Want to track Swine Flu outbreaks? Just use Spinn3r!
Courtney Corley and Jorge Reyes are two University of North Texas graduate students who have been using Spinn3r under our research program to mine data about the recent Swine Flu outbreak.
The Denton Record-Chronicle has the story:
“We’re looking at what people write in blogs, Web [sites] and social media like Facebook, YouTube, etc. But, in particular, we’re just using blogs,” Corley said. “We have a service that allows us access to all blogs written in whatever language.”
The service is called Spinn3r, and allows them to pull together all media across the Internet that contains the keywords they search for.
“It’s a really rich resource to use for public health to see what people are writing about,” he said. “It’s a massive amount of data. Jorge and I for the past week have been looking at all the blogs that talk about swine flu. There are many words in Spanish for swine flu, so Jorge has been able to navigate that.”
Reyes, who is from Mexico, said he was motivated to work on the project because his family was in the country where the virus originated.
“All my family was there, I was worried,” Reyes said. “We were like, ‘what could we do with the tools we have?’

caption: Courtney Corley and Jorge Reyes, who are tracking the spread of swine flu in the United States and Mexico, are shown Wednesday on campus.
I had an interesting idea today to find bugs in networking code.
Design a VPN that deliberately introduces network packet corruption.
One could introduce a tunable to corrupt a certain % of packets.
For example, you could bring up a MySQL master/slave on your ethernet network and then launch the VPN to transfer the replication binary log across the corrupted network link.
Then you could wait and find that MySQL will break in a few minutes.
Now you could implement a patch to hashcode and resend the binary log packets on error.
Then just launch the code on your corrupting VPN and verify that it works.
Could be a great way to find data corruption bugs in protocols that were originally designed to be resilient.
Ideally it would be able to build packets that can find collisions in TCP checksums. Either that or create a new packet with a new TCP checksum.
Using PPP and a network pipe could yield an easy proof of concept.
Update: I imagine a tool like this already exists as I haven’t tested. If that is the case then the only change I would think would be to introduce this tool into normal protocol testing.
When you have petabytes of data even a small data corruption can be dangerous because tracking it down can be exceedingly problematic.
One of the things that has always bothered me about replication is that the binary logs are written to disk and then read from disk.
There is are two threads which are for the most part, unaware of each other.
One thread reads the remote binary logs, and the other writes them to disk.
While the Linux page buffer CAN work to buffer these logs, the first write will cause additional disk load.
One strategy, which could seriously boost performance in some situations, would be to pre-read say 10-50MB of data and just keep it in memory.
If a slave is catching up, it could have GIGABYTES of binary log data from the master. It would then write this to disk. These reads would then NOT come from cache.
Simply using a small buffer could solve this problem.
One HACK would be to use a ram drive or tmpfs for logs. I assume that the log thread will block if the disk fills up… if it does so intelligently, one could just create a 50MB tmpfs to store binary logs. MySQL would then read these off tmpfs, and execute them.
50MB-250MB should be fine for a pre-read buffer. Once one of the files is removed, the thread would continue reading data.
We’ve been providing researchers with access to Spinn3r for more than two years now. The results are really starting to land now.
We’re sponsoring ICWSM this year with a 400GB snapshot. This is being used by more than 100 research groups of about 500 total researchers.
There should be a few dozen more papers published in the next few weeks but I wanted to highlight these now as they are already live.
Specifications and Architectures of Federated Event-Driven Systems
Specifying the Personal Information Broker Data Acquisition: Data can be acquired from multiple sources – currently we use Spinn3r, later we will also acquire IEM, Twitter, Technorati, etc. Each of these acquisitions is specified differently. Acquisition of Spinn3r data, referenced in Fig-
ure 3 step 1, is achieved through changing URL arguments in a manner defined by Spinn3r. Thus, the specification is
unique to Spinn3r. While that particular specification cannot be reused, using the compositional approach, exchanging Spinn3r for Twitter, a news feed, or an instant messaging account while maintaining the integrity of the composition is trivial. The specifications for all of these information inter-
faces are very different; a notation that allows the description of composite applications must account for this.
Blogs as Predictors of Movie Success
In this work, we attempt to assess if blog data is useful for prediction of movie sales and user/critics ratings. Here are
our main contributions:• We evaluate a comprehensive list of features that deal with movie references in blogs (a total of 120 features) using
the full spinn3r.com blog data set for 12 months.• We find that aggregate counts of movie references in blogs are highly predictive of movie sales but not predictive of
user and critics ratings.• We identify the most useful features for making movie sales predictions using correlation and KL divergence as metrics and use clustering to find similarity between the features.
• We show, using time series analysis as in (Gruhl, D. et. al. 2005), that blog references generally precede movie sales
by a week and thus weekly sales can be predicted from blog references in the preceding weeks.• We confirm low correlation between blog references and first week movie sales reported by (Mishne, G. et. al. 2006) but we find that (a) budget is a better predictor for the first week; (b) subsequent weeks are much more pre-dictive from blogs (with up to 0.86 correlation).
Data and Features
The data set we used for this paper is the spinn3r.com blog data set from Nov. 2007 until Nov. 2008. This data set includes practically all the blog posts published on the webin this period (approximately 1.5 TB of compressed XML).
Blogvox2: A modular domain independent sentiment analysis system
Bloggers make a huge impact on society by representing and influencing the people. Blogging by nature is about expressing and listening to opinion. Good sentiment detection tools, for blogs and other social media, tailored to politics can be a useful tool for today’s society. With the elections around the corner, political blogs are vital to exerting and keeping political influence over society. Currently, no sentiment analysis framework that is tailored to Political Blogs exist. Hence, a modular framework built with replicable modules for the analysis of sentiment in blogs tailored to political blogs is thus justified.
…
Spinn3r (http://tailrank.com ) provided live spam-resistant and high performance spider dataset to us. We tested our framework on this dataset since it was live feeds and we wanted to test our performance of sentiment analysis on these dataset for performance analysis and testing. We periodically pinged the online api for the current dataset of all the rss feeds. Although we had different domains that were provided to us, we chose the political
domain for consistency with our other results.
Meme-tracking and the Dynamics of the News Cycle
Tracking new topics, ideas, and “memes” across the Web has been an issue of considerable interest. Recent work has developed methods for tracking topic shifts over long time scales, as well as abrupt spikes in the appearance of particular named entities. However, these approaches are less well suited to the identification of content that spreads widely and then fades over time scales on the order of days — the time scale at which we perceive news and events.
…
Dataset description. Our dataset covers three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites that we obtained through the Spinn3r API [27]. The total dataset size is 390GB and essentially includes complete online media coverage: we have all mainstream media sites that are
part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites. From the dataset we extracted the total 112 million quotes and discarded those with L < 4, M < 10, and those that fail our single-domain test with ε = .25. This left us with 47 million phrases out of which 22 million were distinct. Clustering the phrases took 9 hours and produced a DAG with 35,800 non-trivial components (clusters with at least two phrases) that together included 94,700 nodes (phrases).














