I’m going to be migrating to using ZooKeeper within Spinn3r for a myriad of reasons but this one is especially powerful.

One could use ZooKeeper to configure external monitoring systems like Munin and Ganglia.

ZooKeeper enables this with its support for ephemeral files.

If you have an external process like a webserver, database, robot, etc you can have it create a ephemeral file which registers its services and presence in the cluster.

For example:

/services/www/www32.mydomain.com/80

Would represent a machine named www32.mydomain.com.

You can then have munin connect to ZooKeeper and enumerate files in /services/www and have a cron script continually regenerate a munin.conf file.

The great thing is that if you shutdown Apache your munin config will be automatically reflected.

This of course implies that you have ZooKeeper integration in your init scripts.

This is becoming easier with the presence of HTTP gateways for ZK. I haven’t looked at them too much but as long as PUT and GET are supported this is about 80% of the functionality needed to implement the above.

One issue is enumerating files in directories over HTTP. I assume a proprietary XML protocol is used.

We’ve open sourced the web thumbnail backend that we use within Tailrank.

It needs some work but if you’re ready to get your hands dirty then webthumb will get you 80% of the way to a scalable thumbnail backend.

The API is pretty simple. You just create a REST call to webthumb with a URL to generate and it performs everything for you. Errors are represented by HTTP 500 status codes and a successful request will generate a HTTP 302 redirect to a static file.

There are a few major reasons we’ve decided to open this up:

- Web thumbnails are no longer a competitive feature for Tailrank.

- A lot of people are doing this now and it makes sense to use an OSS framework.

- I want to extend this platform for use in malware detection including doorway page detection and javascript redirects.

- I want to extend the backend to support virtualization. This way a webthumb instance can be started, test for malware, and then its image destroyed and restarted. This prevents any errant browser vulnerabilities from hurting any further malware detection or further thumbnail generation.

The way we integrate into Mozilla is a bit of a hack. You can specify a debug mode in Mozilla and it then logs URL status to a file. This works well but it would be nice to have more of an API call which isn’t async and provides a status to the caller. This can be accomplished with a browser extension but this hasn’t been written yet.

One problem with this model that we hit early on is that the browser can pop up dialog boxes that can then accidentally be included in the resulting thumbnail. Errors like ‘referenced font is not available’ and so forth showed up in early versions of Tailrank until we found out ways to disable them.

I’d also like to extend the platform to support a REST API for crawlers to integrate with the browser directly. It would be nice for a crawler to give a URL to a browser instance, render it, and then get back the resulting DOM within the crawler. This way you know what the resulting

Update: Integrating this with jssh would be hot!

JSSh is a Mozilla C++ extension module that allows other programs (such as telnet) to establish JavaScript shell connections to a running Mozilla process via TCP/IP. This functionality is useful for interactive debugging/development of Mozilla applications, remotely controlling Mozilla, or for automated testing purposes.

200805102237-1 200805102237-2

More competition for Mtron is right around the corner.

The Mtron PRO 7000 at 32GB is $1,129 or $35 per GB and can write at 90MB/s.

The Super Talent MasterDrive DX at 64GB is $1299 or $20 per GB and can do 70MB/s throughput.

So the Super Talent drive is about 22% slower but 42% cheaper.

Though I don’t think the MTBF is high enough for DB operations.

That and there’s no published IOPS specification. Kind of important….

If this is true then I’m certainly going to stop buying Intel processors:

Processor manufacturer AMD has introduced new evidence in the anti-trust case against its competitor Intel, in federal court in the US state of Delaware. According to the Wall Street Journal, the evidence shows that Intel coerced and paid computer manufacturers like Dell, Acer, Gateway, IBM and Hewlett-Packard not to use any AMD products. Large swathes of the 108-page document that the court made public on Monday are blacked out to prevent trade secrets from being made public.

According to AMD, the new evidence is the result of an evaluation of 200 million pages of documents which AMD obtained from Intel and PC manufacturers in a discovery request. According to US press reports, AMD’s legal counsel, O’Melveny & Myers LLP, claims that these materials contain documented email exchanges between leading PC manufacturer employees and Intel that demonstrate the illegal practice of crowding competitors out of the market. Intel denies all of the allegations and accuses AMD of using the courts to protect itself from legitimate competition.

If you look at the timeline for the release of a product (open source or not) it generally forms a power law distribution on euclidean plane similar to the following:

200805031939

The y-axis is the number of pending critical bugs and the x-axis is time.

At some point the product managers realize that the number of reported bugs are falling in severity and that the total number of bugs is asymptotic to zero anyway so they just pick a release date and go for it.

However, this isn’t the metric I care about.

I don’t have time to screw around with unproven software. There’s a major difference between a 1.0 release with zarro boogs and proven software that’s been pounded on by hundreds of thousands of users for 12 months and had all the kinks ironed out.

This is what I want to use. I don’t want to use the latest and greatest software if I can avoid it… The best developers in the world can NOT simulate all the QA that goes into a release which has been deployed in the wild for months and years.

We’re still on MySQL 4.1 for the majority of our cluster. We have 5.1 deployed in a small role but only because we really needed partitioning.

So I’d like to introduce the concept of negative bugs.

When a complex piece of software reaches 1.0 it doesn’t have zero bugs. They’re there, they just haven’t been reported yet. They’re essentially ‘negative bugs’ because they exist and will eventually be applied but just haven’t been found yet.

In fact, I’m willing to predict that they have a power law distribution similar to the above graph.

I’m going to put my money where my mouth is here as well. I’m going to pay a consultant to look at the critical bug report rate for MySQL 4.1 over time. Number of critical bugs fixed per week should be a good metric. (Note to self - this is what an intern would be good for).

It’s not exactly hard work to find these numbers. The data is public - it’s just tedious work.

If my theory is correct, one could compute a sweet spot at which point you should upgrade to future releases.

For example, say MySQL 4.1 stabilized after 8 months, and MySQL 5.0 stabilized at 7 months. Then waiting 6 months or so to adopt MySQL 5.1 seems like a safe bet.

Update: I just posted a gig on Craigslist for someone to compute the numbers.

A mainstream media outlet has FINALLY used the term ‘disapproval rating’ when talking about Bush.

To be fair, this should be used any time an opinion poll shows that only 49% of people approve of a President’s handling of a given situation

Bush, Clinton, Obama, it doesn’t matter.

What bothers me is it takes nearly 70% disapproval for them to stop calling it a ‘30% approval rating’ …

200805012131

We’ve had our SSDs in production for more than 72 hours now. We’ve had them in a slave role for nearly a week but they’ve now replaced existing hardware including the master.

The drives are FAST. In our production roles they’re reading at about 45MB/s and writing to disk at about 15MB/s and using only about 22% of disk utilization.

Not too bad.

We also have about 70GB free on these drives so that leaves plenty of room to grow.

There was small problem that I didn’t anticipate.

When we were running our entire disks out of memory we would only use one or two indexes per column. We had a set of reporting tasks which ran some queries once every 5 minutes.

The columns these queries were using didn’t have any indexes so InnoDB CPU would spike for a moment and continue.

Modern machines have memory bandwidth of about 15GB/s so these queries were mostly CPU bound but completed in a few seconds.

When we switched over to SSDs all of a sudden these queries needed to perform full table scans and were reading at about 100MB/s for two minutes at a time.

Fortunately, an ALTER TABLE later and a few more indexes fixed the problem.

We dropped the indexes when we were running out of memory because the queries could be resolved so quickly. Now that they were on disk again we had to revert to the olde school way of doing things.

Look at this stupid code:


/**
 * Gets the TimeZone for the given ID.
 *
 * @param ID the ID for a TimeZone, either an abbreviation
 * such as "PST", a full name such as "America/Los_Angeles",
 * or a custom
 * ID such as "GMT-8:00". Note that the support of
 * abbreviations is
 * for JDK 1.1.x compatibility only and full names should be used.
 *
 * @return the specified TimeZone, or the GMT zone if the given ID
 * cannot be understood.
 */
public static synchronized TimeZone getTimeZone(String ID) {
    return getTimeZone(ID, true);
}

private static TimeZone getTimeZone( String ID,
                                     boolean fallback) {
    TimeZone tz = ZoneInfo.getTimeZone(ID);

    if (tz == null) {
	        tz = parseCustomTimeZone(ID);
	        if (tz == null && fallback) {
		        tz = new ZoneInfo(GMT_ID, 0);
	        }
	    }

    return tz;

}

… so we have a getTimeZone method and I can pass it a string. OK. What if I pass it in an invalid string like ‘asdfasdfaeeafaljaljdflj’.

Well then the getTimeZone method goes into fallback mode and uses GMT (zone zero).

That might be reasonable but I’d like to disable fallback fallback and have it return null.

Nope. Can’t do that because the overloaded getTimeZone method is private.

We can’t even test if this exception is raised because UTC/GMT is a perfectly valid time zone.

This is with the latest JDK 1.6 sources so this bug has existed at least since JDK 1.4 which is more than 5 years old.

This is why Java needed to be Open Source 10 years ago.

There’s been more activity in the distributed consensus space recently.

At the Hypertable talk yesterday Doug mentioned Hyperspace, their Chubby-style distributed lock manager. Though I think it’s missing the ‘distributed’ part for now.

To provide some level of high availability, Hypertable needs something akin to Chubby. We’ve decided to call this service Hyperspace. Initially we plan to implement this service as a single server. This single server implementation will later be replaced with a replicated version based on Paxos or the Spread toolkit.

ZooKeeper seems to be making some progress as well.

Check out this recent video presentation (which I probably can’t embed so here’s the link).

In 2006 we were building distributed applications that needed a master, aka coordinator, aka controller to manage the sub processes of the applications. It was a scenario that we had encountered before and something that we saw repeated over and over again inside and outside of Yahoo!.

For example, we have an application that consists of a bunch of processes. Each process needs be aware of other processes in the system. The processes need to know how requests are partitioned among the processes. They need to be aware of configuration changes and failures. Generally an application specific central control process manages these needs, but generally these control programs are specific to applications and thus represent a recurring development cost for each distributed application. Because each control program is rewritten it doesn’t get the investment of development time to become truly robust, making it an unreliable single point of failure.

We developed ZooKeeper to be a generic coordination service that can be used in a variety of applications. The API consists of less than a dozen functions and mimics the familiar file system API. Because it is used by many applications we can spend time making robust and resilient to server failures. We also designed it to have good performance so that it can be used extensively by applications to do fine grained coordination.

We use a lock coordinator in Spinn3r and are very happy with the results. It’s a very simple system so provides a LOT of functionality without much pain and maintenance.

Paxos made live is out as well. (I haven’t had time to read it yet).

It dawned on me today that there are now three independent branches of InnoDB development. Some might call them forks but it appears they are all friendly forks.

Hopefully they will merge at some point.

From what I see we have:

1. The official MySQL sources.

2. The InnoDB 5.1 plugin

3. The Google patches that enable faster multi-core performance due to mutex lock re-implementation

CPU utilization was the biggest issue we saw with InnoDB performance in our SSD tests. If we can get this fixed SSD will probably be a bit faster than MyISAM.

My question is what sources is the InnoDB 5.1 branch based?

Does it have the latest InnoDB performance fixes that went into 5.0.30 and 5.0.54?

Here’s a copy of the slides from the talk I just gave about the architecture of Spinn3r at the 2008 MySQL Users Conference:

We present the backend architecture behind Spinn3r – our scalable web and blog crawler.

Most existing work in scaling MySQL has been around high read throughput environments similar to web applications. In contrast, at Spinn3r we needed to complete thousands of write transactions per second in order to index the blogosphere at full speed.

We have achieved this through our ground up development of a fault tolerant distributed database and compute infrastructure all built on top of cheap commodity hardware.

Spinn3R Architecture Talk - 2008 Mysql Users Conference

- It seems the biggest scalability issue in InnoDB has to do with its excessive use of inefficient mutexes to protect data structures. Turning down innodb_thread_concurrency should actually help performance on multi-core boxes.

- The performance problems really start to hit at about eight core. Four core is just fine but still feels a performance hit.

- Google implemented their mutexes using X86-specific compare and swap. Apparently, Monty is working on a CAS portability library. An audit of the room 99% of the people running on X86 anyway so this might not be an issue.

- If you’re having CPU issues not upgrade to > 5.0.30. There’s another fix in > 5.0.54 which is interesting.

- Google has also replaced the innodb malloc heap with a scalable malloc library (tcmalloc). For larger buffer pools this might make a big difference.

- MySQL 6 separates threds from connections. Google will backport this patch… (Awesome)

More notes from others are available as well.

Seagate Sues STEC

Seagate sues STEC (an SSD vendor):

Seagate was talking a big game last month about how SSD makers like Samsung and Intel were infringing its patents, and the company wasn’t joking around, following up all that tough talk with… what appears to be a test case against relatively minor vendor STEC. Seagate says STEC’s drives violate four patents it holds on SSD interfaces and that while “it’s not a big financial issue yet,” the company wants “to set things straight.” As you’d expect, STEC doesn’t feel quite as casual about the situation, saying that it’s been making SSDs since 1994, before any of Seagate’s patents were filed, and that it’s going to aggressively defend Seagate’s “desperate” claims and seek to invalidate its patents. many of which it believes aren’t even relevant to SSD technology. That sounds like a fight to us — get ready for some nonstop paperwork legal thrills, people.

STEC makes some really cool drives (though they’re a bit expensive).

Screw you Seagate. Stop being a patent bully. The real issue here is that you’re scared that SSD drives are going to strip away your market.

Read the Innovators Dilemma and stop hurting our industry.

I’m going to be talking to my hosting/hardware vendor to ask them if I can purchase HDDs other than Seagate.

I’m not going to do business with a company that feels that suing competitors for frivolous patent claims is an acceptable way of doing business.

Here’s an idea - build an innovative hard drive!

Since the news that Oracle acquired InnoDB I was a bit worried that they would just swallow and crush the project or just end of line it to screw over MySQL (and have Heikki work on Oracle-proper).

Boy was I wrong:

At the 2008 MySQL User Conference, we announced the initial availability of the early adopter release of the InnoDB Plugin. Now, beginning with Version 5.1 of MySQL, it is possible for users to swap out one version of InnoDB and use another.

Oracle announced a new version of InnoDB yesterday with a number of new and exciting features.

Best of all - it’s all Open Source and distributed under the GPL license!

It looks like Oracle is learning from MySQL and the Open Source community!

Unfortunately, in an ironic twist of fate, it looks MySQL is learning from Oracle and RedHat and taking the proprietary route (at least with their Enterprise edition):

MySQL will start offering some features (specifically ones related to online backups) only in MySQL Enterprise. This represents a substantive change to their development model — previously they have been developing features in both MySQL Community and MySQL Enterprise. However, with a shift to offering some features only in MySQL Enterprise, this means a shift to development of those features occurring (and thus code being tested) only in MySQL Enterprise.

I think what’s happening here is that the Open Source community and MySQL are heading in different directions.

Most OSS developers that I talk to are using MySQL in web applications. However, web apps and MySQL were never a perfect fit (which is why you see web developers partitioning and sharding their data to get MySQL to scale).

Perhaps web 2.0 startups should skip InnoDB in favor of SimpleDB, S3, AppEngine, Hbase, or Hypertable?

This might be better for both parties. MySQL can focus on revenues from the enterprise (and justify their acquisition) and the Open Source community can stop feeling the pain of the impedance mismatch.

For example, I know of no large web companies excited by the new features of MySQL 5.1. It has some cool stuff but I’m more concerned about performance improvements to the core of InnoDB.

I wonder if this is a one-off or a future direction for Sun?

Give away the ZFS for free - charge for the fsck.

InnoDB 5.1 !

There’s a new version of InnoDB 5.1 available:

The early adopter release of InnoDB Plugin for MySQL is available for MySQL 5.1 and later, in source and binary (for most platforms) and is licensed under GPLv2. Users can dynamically INSTALL the InnoDB Plugin to replace the built-in InnoDB within MySQL 5.1 Linux and other Unix-like operating systems (and soon on Windows) without compiling from source or relinking MySQL. (For now, Windows users must re-build from source.)

Nice… check out this feature set!

Fast index creation: add or drop indexes without copying the data
Data compression: shrink tables, to significantly reduce storage and i/o
New row format: fully off-page storage of long BLOB, TEXT, and VARCHAR columns

Robot Yield

This morning I was thinking about robot blocks regarding Rich’s post about Cuill being blocked on 10k hosts.

So let’s say you write a web scale crawler and you accidentally pushed a bug. It was a huge mistake and you hurt a few hosts and end up being blocked.

A month passes and you’ve implemented a fix and a number of other features which make crawling easier on hosts in your cluster.

… basically you want another chance to crawl these sites. The problem is that you now need to wait an eternity until they remove your robot block.

No what?

Do you ignore the block? That’s probably not right.

Do you create a new User-Agent so that you can slide through the robot block? Possibly. That might work. However, what if you’re blocked because people don’t like you (and it’s not a politeness issue).

I assume if it’s a non-crawlable directory they’re just going to use User-Agent: *.

One could extend robots.txt to include additional syntax so that would allow robots.txt to handle such situations but honestly how many users are going to use that extension.

They could always just remove the disallow rules…

200804141746A new version of Slurp is out the door apparently:

Over the past few weeks, we’ve been preparing for the latest version of the Yahoo! Search crawler with some infrastructure updates, which recently caused a variance in our crawl behavior.

With everything now in place, the rollout has officially begun. The new Yahoo! Slurp 3.0 recognizes the same user-agent and all robots.txt directives for ‘Yahoo! Slurp,’ though it’ll identify itself as Slurp 3.0 in your web logs.

Looks like they’re pushing reverse DNS for crawler identification.

No one seems to be talking about one features Yahoo is pushing into their new Slurp 3.0 crawler.

I’m willing to bet one solid feature is extended support for microformats:

In the coming weeks, we’ll be releasing more detailed specifications that will describe our support of semantic web standards. Initially, we plan to support a number of microformats, including hCard, hCalendar, hReview, hAtom, and XFN. Yahoo! Search will work with the web community to evolve the vocabulary framework for embedding structured data. For starters, we plan to support vocabulary components from Dublin Core, Creative Commons, FOAF, GeoRSS, MediaRSS, and others based on feedback. And, we will support RDFa and eRDF markup to embed these into existing HTML pages. Finally, we are announcing support for the OpenSearch specification, with extensions for structured queries to deep web data sources.

I’ll be at the MySQL users conference this week. Ping me if you want to chat.

Also, come see my talk on Thursday:

We present the backend architecture behind Spinn3r – our scalable web and blog crawler.

Most existing work in scaling MySQL has been around high read throughput environments similar to web applications. In contrast, at Spinn3r we needed to complete thousands of write transactions per second in order to index the blogosphere at full speed.

We have achieved this through our ground up development of a fault tolerant distributed database and compute infrastructure all built on top of cheap commodity hardware.

Log Structured InnoDB

For a while now I’ve been thinking that the way InnoDB handles transactions for all in-memory databases is flat out broken.

The vast majority of web applications using InnoDB are running the entire database out of memory.

With a 8-32GB database it only takes 1-3 minutes to write the whole image to disk.

If InnoDB were smart it would just enable a log-structured mode where it would continually read data from memory and write it to disk. The write ahead log would work but there would be no fuzzy checkpointing. In essence there would just be ONE continual checkpoint of the whole database.

When the database crashes you just read the database from the last checkpoint and replay log entries from the write ahead log. Basically a normal recovery.

It looks like Youtube now has MPEG4 support.

I wrote youtube2ipod transcoder that takes a .flv and builds a mp4 which works with iphones and ipods.

The problem is that it takes about an hour to convert a video.

This is a bit more handy.

Let’s see if it works with Feedburner’s automatic podcast support.

http://googlesystem.blogspot.com/2008/04/download-youtube-videos-as-mp4-files.html

link to mp4

Update:

I think it has to work with this URL … maybe Feedburnder doesn’t support HTTP 303.

Update 2:

Somehow WordPress is only returning the summary RSS feed even though the setting is to use full-text. Hm.

Scratch that… turns out the new Firefox beta shows a summary view of RSS even when viewing full-text feeds. Strange.

Next Page »