MySQL 5.1 (which is in beta right now btw) will have an interesting new feature called partitioning which will allow for a bit more scalability in certain situations. You can read more about it here, here, and here.
I don’t see this really being much of an advantage in practice. Most people want to scale their database in terms of transactions per second. You could buy three disks and put a partition on each or you could put the whole thing on a stripped RAID array. Each would have the same IO characteristics (assuming your partitioning is evenly distributed across disks).
The MySQL guys are still engineering like its 1999. They’re thinking disks are the IO bottleneck and all the machines in your cluster are identical.
Follow me here guys. Disk is the new tape, memory is the new disk, and gigabit ethernet is the new SCSI.
Case in point – NDB (or MySQL Cluster). They assume all the nodes in the cluster are the same capacity. This is #8 on the eight fallacies of distributed computing. You can’t even add/remove nodes to NDB without taking the entire cluster offline, doing a full backup restore, and bringing the system up under a new configuration. Are they serious?
I agree that the problem is a hard one but it can be solved… I had to solve it with a distributed filesystem I wrote (and will try to OSS soon).
What I really want is a cluster where the machines are just nodes providing storage. My schema sits at a layer above this and my DB layer handles IO requests within the cluster. This isn’t as far fetched as it sounds. The lbool driver provides me with a lot of this now.
The biggest problem I have is that every box is a single system image and contains the whole database. I could partition it within application logic but it seems that if MySQL were to bridge the thinking between partitioning and clustering (and fix the obvious problems with NDB) then we might have something pretty damn sweet.
No Trackbacks
9 Comments
“Disk is the new tape, memory is the new disk, and gigabit ethernet is the new SCSI.”
Careful dewd, keep that up and someone’s gonna quote you!
Kevin,
Good notes. I however could not agree on partitioning being unhelpful. The fact you find partitioning unworthy probably comes from the fact “Memory is new disk” for application you’re working with.
It is partially the case – these days you can have much more applications working with data sets which completely fit in memory. But it is not always the case. It would cost a fortune to store all data in memory for technocrati, livejournal or similar systems.
Partitioning in its current state is mainly helpful dealing with data which is much larger than amount of memory as well as in use with MySQL Cluster. It is also something which would help with parallel query execution in the future.
Hm…. can you use partitioning within MySQL cluster? That might be interesting…
I *really* understand the need for disk backed data. Don’t get me wrong. I think my main point is that if you’re going to implement partitioning why not partition across MySQL slaves instead of disk. IF you’re using RAID0 then you’d pretty much receive the major benefits of mySQL partitioning.
Granted RAID controllers are expensive but since this is user space anyway you could use software RAID.
I’ve considered this as well… you could use software RAID0 with a fault tolerant driver like lbpool (code.tailrank.com/lbpool) and if it fails you just rebuild it.
For the record I’m very excited about NDB and I *do* believe disk backed data is very important. I just don’t feel partitioning is much of an advance over the status quo.
Oh…. the one thing that will *kill* NDB is the fact that you can’t do online addition of new servers. Ug.
Kevin
Kevin, partitioning scans only the partitions required for the query, not everything. By contrast, RAID without partitioning will scan everything, wasting most of the disk I/O that is done, removing it from availability for other queries. Each delivers benefits independent of the other.
RAM is good but it’s not that easy to install 600GB of RAM in a box or cluster of boxes and that’s the sort of size I looked at for Wikipedia. Still too much to be really practical, though we’re getting there. Partitioning delivers benefits in that situation since it does significantly cut back on the disk load and even with 16GB of RAM per database server and 6 15,000 RPM SCSI drives in RAID 0, the disks are a bottleneck for many queries, though not the hottest. Unfortunately, RAM is still expensive and how much you can put in a box is still quite limited for some significant system sizes, where performance matters and partitioning helps.
RAM is cheap is a nice mantra for those who haven’t tackled big problems. It’s true and very handy for those with small problems.
I definitely agree with you about data segmenting at the database server level instead of making the application designer do it. Would make life substantially easier.
Kevin,
MySQL Also has Federated storage engine which you can use in theory for Partition across multiple MySQL nodes.
It does work with cluster. In fact Cluster always did automatic hash partitioning, now you just get more control over it.
I think you get partitioning wrong. It is not replacement to RAID as you put it. It is advancement of Merge tables, which are very helpful in certain applications.
Two main problems partitions target are – fast data access if it comes only from few partitions (if looking/loading at last month of data from 5 years archive) and easy deletion of old data.
Yes RAID0 is fine with some slaves if you have many of them and many people use something similar to what you do with ibpool
Speaking about cluster – I agree online node addition is one of major missing items.
> MySQL Also has Federated storage engine which you can use in theory for
> Partition across multiple MySQL nodes.
Hm…….. that’s an interesting point. I might have to look into that
scenario.
> It does work with cluster. In fact Cluster always did automatic hash
> partitioning, now you just get more control over it.
But it did this at the cluster level correct?
For the record I *do* agree that its impractical for many people to put most of
their application in memory. However I do feel that there are a class of
applications (Tailrank is one of them) where you can get away with keeping 90%
of your app on disk and simply cache 10% of it at any given time. The problem
is that this 10% can still be bigger than system memory (say 10G or so) and
you’d like to just buffer this in memory.
This is why I think disk support in MySQL Cluster 5.1 is necessary because you’d
get the benefit of both worlds.
Of course if MySQL had a MyISAM row-cached based on memcached you’d get the same
feature set.
> I think you get partitioning wrong. It is not replacement to RAID as you put
> it. It is advancement of Merge tables, which are very helpful in certain
> applications.
> Two main problems partitions target are – fast data access if it comes only from
> few partitions (if looking/loading at last month of data from 5 years archive)
> and easy deletion of old data.
Yeah….. I can see how this would be useful for a lot of people.
Hey James..
> Kevin, partitioning scans only the partitions required for the query, not
> everything. By contrast, RAID without partitioning will scan everything,
> wasting most of the disk I/O that is done, removing it from availability for
> other queries. Each delivers benefits independent of the other.
True….. I guess if you can route most of your queries within userspace then
this will be a significant boost. Hm…. This might be a win for my
app…. Hm..
> RAM is good but it’s not that easy to install 600GB of RAM in a box or cluster
> of boxes and that’s the sort of size I looked at for Wikipedia. Still too much
> to be really practical, though we’re getting there.
Yes…… thats why I think for many people having a MyISAM row cache backed by
memcached could be a big win. A lot of applications (probably Wikipedia as well)
can get by with buffering just 10-30% of their data. If you could use say a
100G distributed memcached for backing your MyISAM table this could be a big
win.
> Partitioning delivers benefits in that situation since it does significantly
> cut back on the disk load and even with 16GB of RAM per database server and 6
> 15,000 RPM SCSI drives in RAID 0, the disks are a bottleneck for many queries,
> though not the hottest. Unfortunately, RAM is still expensive and how much you
> can put in a box is still quite limited for some significant system sizes,
> where performance matters and partitioning helps.
yes…… I think you’re right that if you can route the query wiithin MySQL
directly that this can be a big win.
Kevin, yes, memcached for MyISAM or InnoDB buffering might be interesting, though that’s often not going to be the best place to use RAM.
LiveJournal uses the memcached for an object cache in front of the database and reports that making the databases effectively write-limited, with most reads removed from them.
Wikipedia uses memcached for a parsed page cache and that offloads both the database servers and the Apache/PHP page builders, delivering greater benefit than just at the database side.
Wikipedia also has Squids caching fully rendered pages for those who aren’t logged in and those handle most of the transient load surges.
OF course, both LiveJournal and Wikipedia are highly tuned single-application situations and throwing more RAM at the database server might well be a more practical option for many business mixed loads.
Today, assuming disk-based storage in Cluster, Cluster or Federated might well be better ideas than memcached at the database engine level. Except for the limitations of the cluster engine.
> Kevin, yes, memcached for MyISAM or InnoDB buffering might be interesting,
> though that’s often not going to be the best place to use RAM.
Depends on the application of course…. if you’ve already deployed memcached
and or squid on top of your logic buffering the DB directly might be a big win.
> LiveJournal uses the memcached for an object cache in front of the database and
> reports that making the databases effectively write-limited, with most reads
> removed from them.
Yes….. We’re doing the same and our web DB servers are mostly write… the
problem is our robot DBs are mostly read right now. We’re moving to a new arch
that will help remove some of this problem but a memcached row cache would
really solve the problem right now for us and just seems like a good idea all
around.
> Wikipedia also has Squids caching fully rendered pages for those who aren’t
> logged in and those handle most of the transient load surges.
Yeah….. I’m NOT saying that this would replace squid or memcached. We use
both for buffering pages.
Actually we don’t use squid we use Apache 2.2 with mod_cache and mod_proxy which
works well. There appears to be a bug however with the disk cache and I don’t
think its working now.. :-/
> OF course, both LiveJournal and Wikipedia are highly tuned single-application
> situations and throwing more RAM at the database server might well be a more
> practical option for many business mixed loads.
My point being that MyISAM with a memcached row cache would bring the majority
fo the benefits of cluster today for a LOT of people.