It seems to be impossible to perform client-side paging full table scans within MySQL.

For example, say you want to take a 1GB file and page through it 10MB at a time.

With a flat file I could just read 10MB off disk, read the next 10MB, etc.

This would be amazingly efficient as you could write the data sequentially, without updating indexes, and then read the data back out by byte offset.

You could page through MySQL tables by adding an index to a table:

SELECT * FROM FOO WHERE PTR > 10000 LIMIT 10000;

for example … but I REALLY want to avoid an index step because it is not cheap and only required since MySQL doesn’t support cursors.

This index slows down the import stage and I have to buffer the index data in memory which is just a waste.

I could use LIMIT with OFFSET but this isn’t going to be amazingly efficient because it will either require us to use a temporary on disk table or force us to read each row off disk.

Technically, one could implement offset efficiently with a fixed with table since you can compute the byte offset as row_width*N but I don’t think MySQL implements this optimization.

Further, my tables are are variable width.

If MySQL supported cursors then I could just tell MySQL to perform a table scan and read the data from a cursor.

I’m implementing something similar to a Map Reduce job in MySQL and Java which would work VERY well on Hadoop but is nearly impossible to implement with MySQL efficiently since I can’t page through data without over-indexing my tables.

Thoughts? I would love it if there’s some stupid solution that I’m just missing.

I suspect this will become more and more of a problem as more developers want to just flat out Map Reduce their data.

Update:

I was wrong. The solution is HANDLER. Ryan Thiessen nailed it in the comments.

Spinn3r is hiring for an experienced Senior Operations and Infrastructure Engineer with solid Linux and MySQL skills and a passion for building scalable and high performance infrastructure.

This role is about 80% engineering in future infrastructure work (so that we can scale into the future) and 20% routine operational tasks.

It’s a great opportunity for the right candidate. The fact that most of this will be engineering and building new systems should make it a lot more challenging than a traditional operations role.

My goal is to spend the right amount of time doing long term infrastructure work so that fires and machine failures are fully routine and require no human interaction.

Right now they require little human interaction but it’s still required in some of the more serious circumstances.

This IOPS distribution is very interesting.

I’m playing with a RAID array of 5x Intel X-25E drives.

It turns out they need a lot of tuning. I’ll blog about this later.

What is more interesting is this distribution of IOPS across threads.

This is using sysbench and the seqrd file IO test.

My hypothesis is that it’s wear leveling the ext3 block group.

I also tried using ext3 striding and that did yield a performance boost. I’m going to try to use XFS but I’m on a CentOS box or testing.

Anyone have another theory as to what could be causing this?

I can’t wait until RAID is dead.

Update: This problem actually happens on a single 1x Intel X-25E so it seems like a hardware issue.

Update 2: It turns out that this is NOT a bug with the Intel SSD. I’m pretty sure it’s a bug (or misconfiguration on my part) of sysbench. Performing the same tests with ‘dd’ shows that IO scales linearly up until at least 10 parallel sequential reads. So something else is broken here.

Bing and Google Pay for Twitter

It looks like Google and Bing are now paying for a Twitter firehose feed.

What’s interesting here isn’t the deal – what’s interesting is that Google is PAYING for content.

In the past they have refused to pay for content acquisition. Your pay was the traffic that Google sent your way.

Now the NYTimes can come to Google and they have a bit more leverage this time. You’re paying for Twitter content! Hand us over some change for the NYTimes or we’ll cut you off…

My take on the subject is that content providers like Facebook, MySpace, etc should create open networks for their users.

There’s a huge value in creating an open eco-system around content. Creating artificially walled gardens is so CompuServe/AOL. This is Web 2.0. You’re either open or you’re (eventually) dead.

Intel’s 1M IOPS desktop SSD setup

What do you get when you take 7 Intel SSDs and throw them in a desktop? 1M IOPS:

“So as we look at optimizing some of those things,” he said, “like interrupts, driver speculation, improving the physical interface between SSDs, and the system, we expect great gains in power, performance, and cost.”

Gains in performance, for example, that add up to 1M IOPS (input/output operations per second) in Coulson’s lab, where he hooked up a dual Xeon 5500 desktop tower to seven Intel SSD prototypes – four in the tower and three in a PCIe expansion box – and ran a 4K read/write benchmark on it.

“This many I/Os per second is about four gigabytes per second of storage bandwidth,” he claimed. “As a storage guy, that’s a huge number, a very huge number.”

He also pointed out that during the test, the CPU utilization of the tower was about 50 per cent. “That’s really nice,” he smiled.

InnoDB page size and SSD

Mark Callaghan made some progress with 4k pages in and InnoDB. I ran some number on SSD with 8k pages but my 4k build would dump core.

The numbers for 8k show some potential.

The way I see this working is that one would buffer the first and second level btree pages in memory and the rest of the InnoDB database would be served from SSD.

We’re going to be trying to run InnoDB on 4x 64GB Intel SLC SSDs.

The key is that have high sustained reads with a much larger storage capacity.

With 4k pages we’re going to be doing far less SSD reads. However, the SSDs are so fast that reading the extra data might not impact performance significantly.

As an aside. InnoDB should be able to have a runtime reconfiguration of page size. ALTER TABLE FOO page_size=4096 would be nice.

Spinn3r will be hosting an Open MySQL meetup at Oracle Open World (which is right down the street).

This would be on Wed 10/14 2009 at 7pm … at 580 Howard Suite 301 (Spinn3r HQ)

Oracle owns MySQL, InnoDB, etc so I suspect a lot of Oracle people and MySQL hackers will be interested in attending more of an Open Source and community centered meetup.

We’ll just be hanging out at our offices … we’ll have beer and food.

Feel free to bring your laptops as we have Wifi :)

This is contingent on at least 10 RSVPs as I want to make sure there is interest from the community.

Please RSVP here

It’s a bit late notice so if you could help spread the world by blogging about this that would be GREAT!

Has anyone done any more work on recompiling InnoDB with 4k pages and benchmarking under SSD?

We’re building out a new DB that uses very small records (around 32-64 bytes) so reading a whole 16k for this record should have a performance difference.

I haven’t seen any benchmarks on 16k random read IOPS on the Intel SSD but my hunch is that there will be a 20-30% penalty here.

Though even if it was a 4x penalty that would still be about 9k transactions per second which is pretty good.

On a personal note I just bought a new Mac Book Pro which will be upgraded to the Intel X-25M MLC SSD.

Needless to say I’m very excited!

Spinn3r is Hiring a Senior MySQL DBA

We’re looking to hire a Senior MySQL DBA over at Spinn3r.

You should obviously have MySQL experience. Love SQL, hate data corruption and slow queries, and preferably live in San Francisco.

Linux experience would be nice as well but not required.

Extra points if you are excited about SSD, *huge* amounts of data, have hacked on Drizzle or XtraDB

Spinn3r is a GREAT place to work. We’re growing fast and have cool new offices in SOMA.

Spinn3r is growing fast. We’ve had an exceptional month (an exceptional year actually). Closing new deals. Releasing new features for our customers. Working on new backend architecture changes, and generally having a lot of fun in the process.

We’ve been posting to Craigslist like mad in the last few weeks but I wanted to take the time to post to our blog.

We’re hiring five new Engineers to join the team with us here in San Francisco.

This is in addition to the two new Engineers we’ve hired in the last couple months.

We’re hiring two Crawl Engineers, Operations Engineer, Support and QA Engineer, and Java Engineer.

Spinn3r is a great place to work. Smart people. Huge amounts of data. Great customers. New offices in SOMA (we’re in an awesome 103 year old building) and plenty of interesting problems to work on…

Here’s the problem I currently have.

We’re looking at deploying the Intel X-25M MLC SSD in production.

The problem being that this drive has a lower number of erase cycles but is much cheaper. Than the Intel X-25E SLC drive.

However, in our situation we’re write once, read many. I’m 99% certain that we will not burn out these drives. We write data to disk once and it is never written again.

The problem is that I can’t be 100% sure that this is the case. There is btree flushing, and binary log issues that I’m worried about…

What would be really nice is an API (SMART?) that I can enumerate the erase blocks on the drive, determine the max erase cycles, and read the current number of erase cycles.

This way, I can put an SSD into production, then determine the ETA to failure.

I can also add this to Nagios and Ganglia and trend the failure date and alert if the derivative is too high and the drive will soon fail.

Further, I can figure out if a database design is flawed. If I deploy a new database into production and the failure ETA is too high after 24 hours I know that something is wrong. Either a misconfiguration or a problem with the design.

I think this would solve a LOT of the problems with deploying SSD in enterprise environments. (MySQL, Oracle, etc)

Thoughts on InnoDB Record Format

I’m reading more about some of the InnoDB internals as I’m about to start working on a new database on the order of 1-5TB of memory.

I want to make sure everything is tight and well represented on disk.

Here are some random thoughts.

- If the table has a suffix of potentially NULLable columns it should be possible to just omit these in the field definition. It’s a tough trade off though as one would need to use a few bits extra to terminate the field list (or represent the length of the field list).

- The ‘extra bytes’ section could optionally omit the ‘next 16 bits’ pointer to the next page on disk and shave off 2 bytes of storage per record. This might not seem like a lot but I’m trying to store millions of short records per box. These are from 16-24 bits in length. Another 16 bits is a significant data overhead. These records are mostly fixed so being able to omit the next page pointer can really help use memory more efficiently. It seems like there are a few reserved bits that can be used for this purpose.

OCZ has their new PCI flash devices on the market at $3.36 per GB … these guys are FAST.

Read: Up to 500 MB/s-Write: Up to 470MB/s -Sustained Write: Up to 200MB/s

They still aren’t commodity just yet but flash on PCI is very interesting.

Another major issue here is the device driver.

If OCZ were to OSS their driver and get it into the Linux kernel I’m sure Debian folks would go this route rather than FusionIO (their driver is still closed source).

 Www.Engadget.Com Media 2009 09 Ocz-Z Drive-Ssd-Final

Matt just announced that WordPress will support the new RSS cloud protocol.

This ping model has already existed with Ping-o-Matic of course (which Matt/WordPress have been running for since the blog epoch) and Spinn3r customers already benefit from this. In fact, we’ve been realtime for a long time now.

WordPress.com has always supported update pings through Ping-o-Matic so folks like Google Reader can get your posts as soon as they’re posted, but getting every ping in the world is a lot of work so not that many people subscribe to Ping-o-Matic. RSS Cloud effectively allows any client to register to get pings for only the stuff they’re interested in.

We haven’t announced this yet but we pushed a new filtering API in Spinn3r in the last release. We developed a domain specific language for filtering web content in real time.

A number of our customers have already started using this in production.

It’s nice that more people are pushing realtime content but I’m starting to worry about the proliferation of protocols here. XMLPRC pings are the old school way of handling things. Pubsubhubbub, Twitter stream API, SUP, etc.

However, I’ve played with most of these and think that they are all lacking in some area. One major problem is relaying messages when nodes fail and then come back online. For example, with XMLRPC pings, or the Twitter stream API, if my Internet connection fails, I’ve lost these messages forever.

The Spinn3r protocol doesn’t have this problem and supports resume. You just start off from where you last requested data and nothing is lost. We keep infinite archives so nothing is ever lost.

I don’t think most sites can support this much data (it’s expensive) but certainly a few hours of buffer, held in memory, seems reasonable to handle a transient outage.

ReadWriteWeb has more on this and is leading with a somewhat sensational title that would imply that these blogs were not real time in the past.

Techcrunch has more as does Scobleizer

One big issue with these protocols is spam. If it’s an open cloud any spammer can send messages into the cloud (which is the case with Pingomatic which receives 90% spam). And of course spammers can receive messages from the crowd to train their own classifiers and find spam targets.

We have an AUP with Spinn3r that prevents this usage. We’ve removed spam from the feed to begin with which is nice for our customers and allows them to build algorithms without having to worry about any attacks.

Spinn3r is growing fast. Time to hire another engineer. Actually, we’re hiring for like four people right now so I’ll probably be blogging more on this topic.

My older post on this subject still applies for requirements.

If you’re a Linux or MySQL geek we’d love to have your help.

Did I mention we just moved to an awesome office on 2nd and Howard in downtown SF?

Spinn3r Hiring Support Engineer

200907071505We’re hiring a Support Engineer at Spinn3r. This is a key hire (and will take a lot of work off my shoulders) so we plan on taking our time to find the right candidate.

That said, this is an awesome opportunity to get in and work on a rapidly growing startup.

About Spinn3r:

Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog and social media data.

We crawl the entire blogosphere in real-time, rank, and classifying blogs, as well as remove spam. We then provide this information to our customers in a clean format for use within IR applications.

Spinn3r is rare in the startup world in that we’re actually profitable. We’ve proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.

We’ve also been smart and haven’t raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.

For more information please visit our website.

Responsibilities:

  • Interact with customers both both in the early sales cycle and support role to answer technical questions about our technology (crawling, ranking, etc) (20%)
  • Monitor our crawler stats to enable understanding of operation and detect operational anamolies, monitor statistics, implement new features, etc.(20%).
  • Work on Java implementation of various new Spinn3r features as well as fix bugs in our current product. You will also be working on infrastructure in this position and responsible for various backend Java components of our architecture. (60%)
  • General passion and interest in technology (distributed systems, open content, Web 2.0, etc).

I should stress that while you’ll be interacting with customers, and providing support, our customers are exceedingly brilliant and amazingly knowledgeable about our space. They’re a major asset and staying in sync with them is very important for the company.

Requirements and Experience:

  • Java (though Python, C, C++, etc would work fine).
  • Ability to understand customer needs and prioritize feature requests.
  • Friendly, patient, and excellent people skills when interacting with customers.
  • Understanding HTTP
  • Databases (MySQL, etc).
  • Ability (and appreciation) for working in a Startup environment.
  • Must like cats :)

NYTimes, on Memetracker, and Spinn3r

200907182203I didn’t have time to blog about this when it was originally posted but the NYTimes has a great piece on the cool work done by Jure Leskovec and Jon Kleinberg with their work on Memetracker (which is powered by Spinn3r).

For the most part, the traditional news outlets lead and the blogs follow, typically by 2.5 hours, according to a new computer analysis of news articles and commentary on the Web during the last three months of the 2008 presidential campaign.

The finding was one of several in a study that Internet experts say is the first time the Web has been used to track — and try to measure — the news cycle, the process by which information becomes news, competes for attention and fades.

Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.6 million mainstream media sites and blogs. Some 90 million articles and blog posts, which appeared from August through October, were scrutinized with their phrase-finding software.

If you’ve studied graph theory you may have bumped into The Königsberg Bridge Problem which is essentially the first modern use of Graph Theory.

The city of Königsberg in Prussia (now Kaliningrad, Russia) was set on both sides of the Pregel River, and included two large islands which were connected to each other and the mainland by seven bridges.
The problem was to find a walk through the city that would cross each bridge once and only once. The islands could not be reached by any route other than the bridges, and every bridge must have been crossed completely every time (one could not walk halfway onto the bridge and then turn around to come at it from another side).

Euler proved that the problem has no solution.
To start with, Euler pointed out that the choice of route inside each landmass is irrelevant. The only important feature of a route is the sequence of bridges crossed. This allowed him to reformulate the problem in abstract terms (laying the foundations of graph theory), eliminating all features except the list of landmasses and the bridges connecting them. In modern terms, one replaces each landmass with an abstract “vertex” or node, and each bridge with an abstract connection, an “edge”, which only serves to record which pair of vertices (landmasses) is connected by that bridge. The resulting mathematical structure is called a graph.

… you should probably take a look at the full Wikipedia node. It’s a clever proof.

Anyway. I thought it would be fun to one days visit Königsberg which is now the modern day city of Kaliningrad in Russia.

Then it dawned on me that I could use Google Maps to find it and sure enough it’s still there.

One of the bridges appears to have been removed.

200907141908

200907141907

Spinn3r will be at the Real Time Stream Crunchup tomorrow (which should be fun).

There should be more announcements about realtime RSS which is interesting:

While there is an argument to be made that RSS is dying, being replaced by more instantaneous forms of content delivery such as Twitter and other real time streams, many people aren’t quite yet ready to give up on it. Instead, they want to save it by speeding it up. Tomorrow, at our Real Time Stream CrunchUp, we will see three demos of projects that do just that in slightly different ways.

Google engineers Brad Fitzpatrick and Brett Slatkin will show a demo of a new push protocol called pubsubhubub, Netvibes CEO Freddy Mini will demo his similar RSS Instant Update Hub, and WordPress engineer Andy Skelton will show off a Jabber client which uses the XMPP protocol to push blog headlines into an IM-like environment faster than RSS.

If these receive significant adoption Spinn3r will implement them pretty quickly (we push revisions every week).

Gnip also launched their early stage partner program:

Gnip is also launching a early-stage startup partner program that will let startups access to all of Gnip’s service features and data services. The program is aimed towards software development startups that have been in business for less than 3 years and generating less than $200,000 in revenue. Of course, Gnip requires that partners pay a fee of $1000 but says the services that they will receive are valued at $10,000 per month.

Spinn3r has been offering this for more than three years now. If you’re an early stage startup and you need access to Spinn3r data we can do so at a fraction of the price.

I should also note that if you’re a research organization we can provide you with data for free. We provide access to Spinn3r data to more than 100 PhDs from to universities world wide.

Spinn3r Hiring Crawl Engineer

200907071505We’re hiring a Crawl Engineer at Spinn3r. This is a key hire (and will take a lot of work off my shoulders) so we plan on taking our time to find the right candidate.

That said, it’s an awesome opportunity to get in and work on a rapidly growing startup.

About Spinn3r:

Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog and social media data.

We crawl the entire blogosphere in real-time, rank, and classifying blogs, as well as remove spam. We then provide this information to our customers in a clean format for use within IR applications.

Spinn3r is rare in the startup world in that we’re actually profitable. We’ve proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.

We’ve also been smart and haven’t raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.

For more information please visit our website.

Responsibilities:

  • Maintain our current crawler.
  • Monitor and implement statistics behind the current crawler to detect anomalies.
  • Implement new features for customers
  • Work on backend architecture to improve performance and stability.
  • Implement custom protocol extension for enhanced metadata and site specific social media support.
  • Work on new products and features using large datasets.

Requirements and Experience:

  • Java (though Python, C, C++, etc would work fine).
  • HTML, XML, RSS or Atom.
  • HTTP
  • Distributed systems.
  • Databases (MySQL, etc).
  • Algorithm design (especially in distributed systems).
  • Ability (and appreciation) for working in a Startup environment.
  • Must like cats :)

Ideal:

  • Past experience running and working with large crawlers.
  • Understanding of IR algorithms (K-means, naive bayes, inverted index compression, etc).
  • Experience within the Open Source community.

Questions:

For bonus points. Feel free to answer the following questions in your email:

  • You have a live corpus of text and HTML from 25M weblogs. You want to cluster these weblogs into logical communities (tech, politics, entertainment, etc). How (and what algorithm) would you cluster and rank the content within a reasonable time? What computational resources would this require (memory, CPU, network bandwidth, etc).
  • You are building a ranking algorithm. This algorithm will execute across a large link graph. How would you store the graph to use the smallest (yet reasonable) amount of computational resources (memory and CPU).

Next Page »