The other day I blogged about running RAID performance tests and being disappointed by random write speeds.
The clear loser here is rndwr (random writes). I’m pretty sure this has to with the 64k stripe/chunk size. I’m willing to be the RAID controller is deciding to write the entire chunk for small updates which would really hurt SSD performance since the ideal block size is 4k.
For reference here was the performance boost graph:
And you can see the dip in random write performance.
Well I found the culprit. It turns it it was the noatime mount option, specifically the lack of this option when mounting filesystems.
Here’s a new graph showing the resulting 9x performance boost for random writes when running with noatime:
In this graph smaller is better (since the measurement is in seconds).
I ran two tests of rndwr both with and without noatime. Running without noatime seems to yield results with a high standard deviation so I just computed the mean of the two. I assume this discrepancy is due to the amount of wear on the blocks of the systems and how much wear leveling the disk has to perform.
Here’s what I think is happening. The access time is stored in an inode. This inode is then mapped to a block. This block is then mapped to an individual RAID device. This RAID device uses an underlying flash translation layer (FTL). The FTL sees LOTS of small accesses to this block so decides to use the wear leveling on the drive and move the block around on the underlying disk which radically increases the amount of IO on the device.
To further the problem it’s happening on only ONE device so it hurts performance to the whole array.
Here’s the result of iostat -x.
Device: %util sdb 1.20 sdc 96.40 sdd 2.40 md0 0.00
I stand by my initial assessment. Updating access time for files on a production system is foolish.
These numbers are looking really good so far. I still have some tests to perform. Specifically, random reads seem a bit slower on software RAID but I’m optimistic I can solve this problem.