I’ve been reviewing the random write performance with SSDs over the last few days and have a few updates on their performance numbers.
It turns out that SSDs themselves need to handle random write IO to obtain ideal performance numbers. Due to the erase block latency on NAND flash, performance can start to suffer when your database does lots of random writes. OLTP applications REALLY suffer from this problem since databases tend to think their underlying storage system is a normal hard drive.
Some vendors like STEC claim that their SSDs can do high random write IOPS natively. This certainly has nothing to do with the underlying NAND flash but rather their use of an intelligent write algorithm.
So really it’s not a resource problem as much as it is an IP problem.
The NAND on these drives is pretty much the same it’s just that we’re not doing a good job of interfacing with them.
Log structured filesystems can come into play here and seriously increase performance by turning all random writes into sequential writes. The drawback is that IO will then be fully random. For SSD this isn’t as much of an issue because random reads are free. The Mtron we’ve been benchmarking can do 70MB/s random reads.
It turns out that nilfs performs really well in the sysbench random write benchmark. It completed all IO at 10x faster than the internal HDD which was a welcome sign. In practice it was continually writing at 50MB/s and was able to complete tests in 13 seconds vs 8.5 minutes for our HDD.
While random writes look tood it failed at the sysbench OLTP benchmark. I’m not sure why. In theory it should work fine since all blocks should be read quickly and then written to the end of the disk sequentially. This could be a problem with the nilfs implementation, the fact that erase blocks weren’t properly aligned, or a strange interaction issue with their continuous snapshot system.
Jffs2 also looked interesting but this is designed to work with the Linux MTD driver and not a block driver. It has a number of issues including bugs in implementation and the fact that it doesn’t log numbers into /proc which breaks iostat (at least with the block2mtd driver).
The biggest problem I think is the fact that all existing Linux log filesystems are designed for use with MTD devices not block devices. This means all existing code won’t work out of the box.
To add insult to injury many of these filesystems require custom patched kernels which makes testing a bit difficult.
Update: Bigtable and append only databases would FLY on flash. Not only that but databases can easily be bigger than core memory because the random reads would be fast as hell.
I’m very jealous.