Archive for 25/4/2007

A Storage Engine for Amazon S3

A Storage Engine for Amazon S3, Mark Atwood

It looks mighty interesting, as transfers to Amazon S3 are free. I think it’ll work well in America and places where bandwidth rocks, but I don’t see this working too well in Australia. Oh how I wish the Internets will improve.

Mark has got all his stuff online at A MySQL Storage Engine for AWS S3. He was also kind enough to upload most of the notes, which made my reporting easier, and don’t forget to view the presentation.

Traditional storage engines use the local disk.

Networked engines: Federated, ODBC, HTTP, MemCacheD and S3 storage engine.

What is S3?
Contents can have 1 byte to 5GB. Amazon has several petabytes of storage for you to use :-)

Owning your own disks kind of suck. Pay for storage, even before you use it.

“An empty disk costs the same as a full one” – pay a lot of money to put disks in data centers. RAID isn’t “excellent”, then what about disaster recovery?

Can’t move existing database over, and the S3 storage engine isn’t ready for a full schema yet. There are hacks that allows this, but maybe it will be available next year.

Over a billion items in a bucket, and they all come back in good time.

A bucket is fully virtually hosted, you get a SQL CMS in the MySQL server. Save your EC2 work.

S3 is very distributed (geographically) and asynchronous. Writes are replicated, so your data may be reordered (and delayed). So there are no temporal guarantees.

Use the WHERE clause – otherwise it will do a full table scan, and you’ll be paying Amazon lots of money :-)

The talk ended with about 20 minutes to spare, and I do certainly hope he hacks on it more for the next year. He’s also soliciting feedback, so try it out if you can. And now, to run to the remainder of the talk on Highly Available MySQL Cluster on Amazon EC2! Two Amazon talks, with emerging technology goodness, at the same time? Pfft.

Technorati Tags: , , , , , , ,

Federation at Flickr: A tour of the Flickr Architecture

I’ve always been a big Flickr fan, and user, and love them even more now that their 2GB limit is gone. So this was a most interesting talk, and I think photographers and Flickr users alike will find it interesting what’s behind Flickr. Dathan also has a very interesting blog.

Federation at Flickr: Doing Billions of Queries Per Day, Dathan Pattishall

Database guy – query, performance, troubleshooting, application building, etc. Previously worked at Friendster in 2003.

Master-Slave lag – unable to keep up with the demand that Flickr was having. So they have multiple masters, the had multiple single points of failure (SPOF).

Everything had to be real time. Write intensive, so more than 1 master needed. No more slave lag. Serve pages fast, with many queries. Get rid of the SPOFs and be redundant.

Need ability to make live maintenance, repair data, and so forth, without taking the site down.

Federation Key Components:

  • Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone elses’ blog
  • Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time.
  • PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)

Shards:

  • Slice of the main database
  • Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active.
  • Shard assignments are from a random number for new accounts
  • Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.

Global Ring:

  • Its a Lookup Ring for stuff that can’t be federated

Owner_id -> SHARD-ID
Photo_id -> Owner_id
Group_id -> SHARD_ID

It makes sense to hash by user, but then the notion of group. Group can be owned by many users; plus many users are contributing to the group.

This is cached in MEMCache, for about 1/2 hour.

Ticket servers – MySQL 5, low grade machines, circular replication. This is the ID generator. Globally unique ID that is unique. A lot of unique IDs are associated to each row.

Clicking a Favourite:

  • Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5)
  • Pulls my Information from cache, to get my shard location (say on shard-13)
  • Starts a “distributed transaction” – to answer the question: Who favourited the photo? What are my favourites?

Can ask question from any shard, and recover data. Its absolutely redundant.

To get rid of replication lag…

  • every page load, the user is assigned to a bucket
  • if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistant connections, they build connections and tear it down. Every page load thus, tests the connection.

Every users reads and writes are kept in one shard. Notion of replication lag is gone.

Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode.

Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load.

Average queries per page, are 27-35 SQL statements. Favourites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages.

Over 36,000 queries per second – running within capacity threshold. Burst of traffic, double 36K/qps. More than 4 billion queries per day.

Each Shard holds 400K+ users data.

“So you’re a single point of failure” — a member of the audience. He’s the main/only DBA at the moment.

Search:

  • Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
  • Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
  • Think of it such that you’ve got Lucene-like search

Hardware: EMT64 w/RHEL4, 16GB RAM, 6-disk 15K RPM RAID-10. Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files – the photos are a lot larger). 2U boxes. Each shard has~120GB of data.

“Able to serve the traffic: Priceless”

Things MySQL can do better: Multi-master replication, thread bug in INNODB for 4.1 (its already in 5.0), optimization for OR queries, bitwise, sets.

MySQL 5.0 is used in shards that have aux data, like logging, and also for generating the ticket ID on the ticket servers (no replication between them, single row per server). 64-bit ID, but when exposing to public they use a 32-bit ID to keep the URLs short.

Backup procedure: ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare. Snapshots are taken every night across the entire cluster of databases.

Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards.

max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity.

Schema updates? Shut down half the servers, force to other shards, push update, repeat.

32 concurrent InnoDB threads on each shard. 5% of 12TB in memory for pool, so thats about 13GB.

Linux allocates memory per thread, so you can get an OS ERROR saying you’ve run out of memory.

Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database.

Technorati Tags: , , , , , ,

Monty is the world’s first MySQL fellow

Yes, you heard it here first. Monty is the world’s first MySQL Fellow. For outstanding work, being the co-Founder of MySQL AB, and for continuing excellent work. Standing ovation, might I add.

Some quick comments from the clash of the DB egos:
Heikki Tuuri – he worked for Solid, then went on to university, and learned everything about databases. He attributes it to stubbornness. Monty said he should make InnoDB open source, and he obeyed!

Mikael Ronstrom – “entrepreneurial ego in a large company”. He had a vision of what he wanted to do. Guess he was also a bit stubborn…

Jim Starkey had the most interesting hat ;-) And he probably made a couple of bucks from it…
“Does the United States still have anything to say in databases?”

Starting over is important, because you learn new things… Jim Starkey has written about 5 databases in his lifetime.

Nitro: really fast speeds (faster than MyISAM). Monty does say its not a perfect solution. Positional awareness, so any index entry knows where it is.

Update: Oops, looks like Kaj might’ve got this out first – Monty: The First MySQL Fellow.

Technorati Tags: , , , ,


i