Looks like Rich was playing with the Persai tar.gz web crawl they posted the other day.
I got a sinking feeling as I read this. I had curl’d over the corpus already to eyeball it …yeah that’s a list of feeds all right… but hadn’t tallied the domains…
$ sed -e 's/^http:..//' -e 's/\/.*$//' persai_feedcorpus | count | head35695 rss.topix.net
14613 izynews.de
2831 feeds.feedburner.com
1869 p.moreover.com
1314 www.livejournal.com
1241 rss.groups.yahoo.com
1191 www.discountwatcher.com
1096 news.bbc.co.uk
1072 www.alibaba.com
882 xml.newsisfree.com
Anyone reading my blog know the guys over at Parsai?
Update:
Sam Ruby posts an HTTP response code analysis of the corpus.
Of course we have internal response code stats but broken feeds don’t make it into Spinn3r.












No Comments
Leave a Comment
trackback address