I was just looking over Techhnorati’s post which includes the total number of posts per day.
Technorati points out there are 900k blog posts per day:
As you can see by the black trend line, posting volume has followed a strong upward trend. After a brief dip last winter, the average rate of postings has grown steadily such that at the end of July 2005, there were about 900,000 posts created each day. That’s about 37,500 posts every hour, or 10.4 posts per second. It peaked at just over 1.1 million posts per day after the Live 8 concerts and Justice Sandra Day O’Connor announced her resignation from the US Supreme Court.
How much data is this? If we assume that the average HTML post is 150K this will work out to about 135G. Now assuming we’re going to average this out over a 24 hour period (which probably isn’t realistic) this works out to about 12.5 Mbps sustained bandwidth.
Of course we should assume that about 1/3 of this is going to be coming from servers running gzip content compression. I have no stats WRT the number of deployed feeds which can support gzip (anyone have a clue?). My thinking is that this reduce us down to about 9Mbps which is a bit better.
This of course assumes that you’re not fetching the RSS and just fetching the HTML. The RSS protocol is much more bloated in this regard. If you have to fetch 1 article from an RSS feed your forced to fetch the remaining 14 addition posts that were in the past (assuming you’re not using the A-IM encoding method which is even rarer). This floating window
can really hurt your traffic. The upside is that you have to fetch less HTML.
Now lets assume you’re only fetching pinged blogs and you don’t have to poll (polling itself has a network overhead). The average blog post would probably be around 20k I assume. If we assume the average feed has 15 items, only publishes one story, and has a 10% overhead we’re talking about 330k per fetch of an individual post.
If we go back to the 900k posts per day figure we’re talking a lot of data – 297G most of which is wasted. Assuming gzip compression this works out to 27.5Mbps.
Thats a lot of data and a lot of bloat which is unnecessary. This is a difficult choice for smaller aggregator developers as this much data costs a lot of money. The choice comes down to cheap HTML index ing with the inaccuracy that comes from HTML or accurate RSS which costs 2.2x more.
Update: Bob Wyman commented that he’s seeing 2k average post size with 1.8M posts per day. If we are to use the same metrics as above this is 54G per day or around 5Mbps sustained bandwidth for RSS items (assuming A-IM differentials aren’t used).
Update: Slashdot picked this up and there are a few issues I want to correct. The number of posts is 900K not 9M. Not sure how this was introduced into the post. The other issue is that the 150K post is for the full HTML post not the RSS item length which a lot of people seem to be missing. I should have been more clear as this aspect can be a bit confusing.
-
1
Trackback on Aug 15th, 2005 at 4:24 am
Wow
I was having my morning surf of /. and I came across this blog post talking about the number of blog posts per day. Technorati gets 900 thousand posts per day, which I find amazing, but PubSub is claiming around twice that a 1.8 million a day! Perhaps…
-
2
Trackback on Sep 29th, 2005 at 1:17 am
links for 2005-09-29
Kevin Burton’s Feed Blog: The Math around Posts Per Day wow…interesting stats (tags: aggregation feed) The Old New Thing – Editorial – CMO Magazine Great article on marketing & innovation (tags: marketing innovation)…












August 13, 2005 at 4:42 pm
Kevin, I checked on the average size of an RSS/Atom entry a few months back and found that they are 2K bytes. Thus, the problem isn’t quite as bad as you suggest. Also, please note that while Technorati may only be seeing 900K new entries per day, PubSub has been averaging 1.8 million new entries per day over the last 30 days. So, while you may be exaggerating the size of individual posts, you’ve underestimated the number of entries per day… You can see our stats at: http://www.pubsub.com/linkcounts_graphs.php?type=newentries
There is no question that running a blog entry discover service takes a great deal of bandwidth. The fact that most blogging software still relies on the polling techniques that have been proven over and over again to be wasteful and unscaleable never ceases to amaze me. At least if folk are going to insist on doing something as inefficient as polling, they should assume some responsibility for the community resources they consume (i.e. network bandwidth) and implement the “A-IM” headers or RFC3229+feed that I’ve proposed on my blog. While polling is guaranteed to be less efficient than a push strategy, at least the impact and cost of polling can be reduced by implementing RFC3229+feed. See:
http://bobwyman.pubsub.com/main/2004/09/using_rfc3229_w.html
bob wyman
August 13, 2005 at 5:20 pm
Thanks for the feedback Bob.
Interesting to hear that the avg post length is 2k? What’s the median? Does this include the normal 0 byte post lenght for stories that just have a title? I guess it really doesn’t matter though if we’re talking about total blogosphere throughput.
For the record I’ve implemented A-IM in the Jakarta FeedParser. I’m going to try to get some time to release a 1.0 revision but it might take me some time (couple of weeks).
Thanks again…..
I might run my numbers again and release another blog post.
Kevin
August 14, 2005 at 6:22 pm
What kind of crap? How can you assume 150k per post? What kind of data do you post on your site? In RSS terms that would be damn-near a book’s length.
August 14, 2005 at 7:48 pm
What about just dropping the connection and reading of the remainder of the feed as soon you encounter a posting date that has already been recorded/fetched for that feed?
In other words, read the feed from the beginning, and as soon as you get to the part of the feed that you know you already fetched, stop reading and close the connection.
This would require some on-the-fly parsing of feeds, which may be doable, no? It wouldn’t be elegant, but wouldn’t it work?
Over at Simpy I do a fair bit of fetching over at Simpy ( http://www.simpy.com/ ). As far as I know, Simpy is the only social bookmarking/tagging service that privides full-text searching with regular link crawling a la web search engine), so I’m quite interested in this subject…
August 14, 2005 at 11:22 pm
Interesting stats, but is 54GB/day really that much? It isn’t too hard to find a modest but capable server with 2TB of transfer a month for $120/month. At 54GB/day, I’d still have almost 400GB, or 20%, of my allotment left to spare.
It may not be efficient, but at this point, its not that expensive.
August 15, 2005 at 12:06 am
Otis. You make a good point. It *is* possible to do a HTTP byte range GET. The downside is that you’d either have to build a custom parser or repair the XML first and then feed it to your parser.
I’ve thought of this problem before and its an innovative solution but it has two major flaws:
1. If the last post has been modified, and you stopped fetching as soon you’ve found the first item, you’d never update your content with the modified item which is dead last.
2. If the feed is non-chronologically ordered (for example search results) you’d get REALLY unpredictable results.
Without the metainfo to solve these two problems I feel that A-IM encoding is really the only way forward.
And eas.. you’re right. The bandwidth isn’t that much at a midlevel provider but if you want a decent bandwidth it’s going to cost you. For this much sustained bandwidth I’ve seen the numbers all over the map. To be honest I have to dive into it more because my hosting requirements are a bit unusual. I might have to just bite the bullet and colo a box instead of going with managed hosting.
August 15, 2005 at 2:13 am
As Bob Wyman said in your entry explaining delta-encoding; wouldn’t just a nice simple If-Modified-Since header work? It seems technically and conceptually a lot easier to work with.
Unfortunatly, it is a bit of a strange overloading of existing semantics and may confuse some browsers that expect a normal file to be on the other end. I think this is pretty minimal though as a normal “reload” of the page should do the traditional thing and return the latest n entries.
August 15, 2005 at 9:37 am
I’ve been tracking feed compression for quite a while, and it turns out that not very many feeds are compressed. Out of 420,000 or so feeds listed on Syndic8, about 12,000 were compressed. There’s a chart with full information, along with instructions for enabling mod_gzip on my blog (http://www.syndic8.com/~jeff/blog/index.php?p=287).
August 21, 2005 at 5:33 pm
I’m also curious about the average size per post. With more and more folks doing photoblogs, wouldn’t that skew the numbers? Those images add up.
August 21, 2005 at 5:37 pm
Lorelle.
Images wouldn’t really count since they’re external via the img tag.
Kevin