Posts Tagged ‘sharding’

Sharding for the masses: Introducing the SPIDER storage engine (OpenSQLCamp @ FrOSCon)

This is the Sharding for the masses: Introducing the SPIDER storage engine by Giuseppe Maxia, given at OpenSQLCamp, at FrOSCon, in August 2009. These are somewhat live notes, and the slides are available too.

Why sharding? Scaling, of course. The MySQL way to solve this, is replication (even Yahoo! and Google use this).

When the master doesn’t have enough resources to cope with what you do (i.e. large data sets), replication chokes.

You can use proxies for sharding. There exists MySQL Proxy (can be programmed using a scripting language – Lua), HSCALE (built on top of MySQL Proxy), SpockProxy (a fork of MySQL Proxy, without LUA scripting, specialised for sharding), in the market these days. This however, is the single point of failure – everything has to pass through one proxy.

Enter SPIDER – a MySQL storage engine, built on top of the partitions engine. It associates a partition with a remote server, and is transparent to the user. Its developed by Kentoku Shiba.

Installation: Get 5.1.37 sources, then get the source code for Spider 1.0, and then get the patch for condition pushdown.

Why the condition pushdown patch? Remote server works less, by receiving the condition. The SPIDER engine without the condition pushdown patch is still fast, but it can be more than 10x faster with condition pushdowns. (works with NDBCLUSTER), (works with MyISAM). The patch by Kentoku, will add cond_push and cond_pop, to ha_partition – so now, every storage engine that uses table partitioning can get condition pushdown through ha_partition.

You need to setup the engine first: (the SQL is also available in the DOCS).

spider_remote_employees.sql – use this in conjunction with – a good example of how to use the SPIDER storage engine.

Horizontal Scaling with HiveDB

At the MySQL Conference & Expo 2008, Britt Crawford and Justin McCarthy, both from, gave us a very interesting talk on scaling with HiveDB. I took a few notes (pasted below), their slides are online (warning: 6.1MB PDF), and if you’re after their abstract its available as well.

I also took a video of them (refer to Slide 12, for the IRC conversation):

The quick notes:

  • OLTP optimised (as it serves
  • Cannot lock tables, or take it offline
  • Constant response time is more important than low latency (little slower query is ok, just not exponentially slower)
  • Queries run might return wildly sized result sets.
  • There can be growth and usage hotspots. You cannot predict this at all.
  • Partition by key (the set of all partition keys is the partition dimension)
  • Partitioned Hibernate from Google (Hibernate Shards). HiveDB is now married up with shards.
  • Thought about MySQL Proxy to support high availability components, but it was dismissed