This is really exciting because we’re also hoping to Open Source more components before April.
We present the backend architecture behind Spinn3r – our scalable web and blog crawler.
Most existing work in scaling MySQL has been around high read throughput environments similar to web applications. In contrast, at Spinn3r we needed to complete thousands of write transactions per second in order to index the blogosphere at full speed.
We have achieved this through our ground up development of a fault tolerant distributed database and compute infstructure all built on top of cheap commodity hardware.
We’ve built out a number of technologies on top of MySQL that help enable us to easily scale operations.
We’ve implemented an Open Source load balancing JDBC driver named lbpool. Lbpool allows us to loosely couple our MySQL slaves which allow us to gracefully handle system failures. It also supports load balancing, reprovisioning, slave lag, and other advanced features not available in the stock MySQL JDBC driver.
We’ve also built out a sharded database similar to infrastructure built at other companies such as Google (Adwords) and Yahoo (Flickr). Our sharded DB has a number of interesting properties including ultra high throughput requirements (we process 52TB per month), distributed sequence generation, and distributed query execution.