Federation at Flickr: A tour of the Flickr Architecture

I’ve always been a big Flickr fan, and user, and love them even more now that their 2GB limit is gone. So this was a most interesting talk, and I think photographers and Flickr users alike will find it interesting what’s behind Flickr. Dathan also has a very interesting blog.

Federation at Flickr: Doing Billions of Queries Per Day, Dathan Pattishall

Database guy – query, performance, troubleshooting, application building, etc. Previously worked at Friendster in 2003.

Master-Slave lag – unable to keep up with the demand that Flickr was having. So they have multiple masters, the had multiple single points of failure (SPOF).

Everything had to be real time. Write intensive, so more than 1 master needed. No more slave lag. Serve pages fast, with many queries. Get rid of the SPOFs and be redundant.

Need ability to make live maintenance, repair data, and so forth, without taking the site down.

Federation Key Components:

  • Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone elses’ blog
  • Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time.
  • PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)

Shards:

  • Slice of the main database
  • Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active.
  • Shard assignments are from a random number for new accounts
  • Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.

Global Ring:

  • Its a Lookup Ring for stuff that can’t be federated

Owner_id -> SHARD-ID
Photo_id -> Owner_id
Group_id -> SHARD_ID

It makes sense to hash by user, but then the notion of group. Group can be owned by many users; plus many users are contributing to the group.

This is cached in MEMCache, for about 1/2 hour.

Ticket servers – MySQL 5, low grade machines, circular replication. This is the ID generator. Globally unique ID that is unique. A lot of unique IDs are associated to each row.

Clicking a Favourite:

  • Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5)
  • Pulls my Information from cache, to get my shard location (say on shard-13)
  • Starts a “distributed transaction” – to answer the question: Who favourited the photo? What are my favourites?

Can ask question from any shard, and recover data. Its absolutely redundant.

To get rid of replication lag…

  • every page load, the user is assigned to a bucket
  • if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistant connections, they build connections and tear it down. Every page load thus, tests the connection.

Every users reads and writes are kept in one shard. Notion of replication lag is gone.

Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode.

Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load.

Average queries per page, are 27-35 SQL statements. Favourites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages.

Over 36,000 queries per second – running within capacity threshold. Burst of traffic, double 36K/qps. More than 4 billion queries per day.

Each Shard holds 400K+ users data.

“So you’re a single point of failure” — a member of the audience. He’s the main/only DBA at the moment.

Search:

  • Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
  • Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
  • Think of it such that you’ve got Lucene-like search

Hardware: EMT64 w/RHEL4, 16GB RAM, 6-disk 15K RPM RAID-10. Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files – the photos are a lot larger). 2U boxes. Each shard has~120GB of data.

“Able to serve the traffic: Priceless”

Things MySQL can do better: Multi-master replication, thread bug in INNODB for 4.1 (its already in 5.0), optimization for OR queries, bitwise, sets.

MySQL 5.0 is used in shards that have aux data, like logging, and also for generating the ticket ID on the ticket servers (no replication between them, single row per server). 64-bit ID, but when exposing to public they use a 32-bit ID to keep the URLs short.

Backup procedure: ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare. Snapshots are taken every night across the entire cluster of databases.

Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards.

max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity.

Schema updates? Shut down half the servers, force to other shards, push update, repeat.

32 concurrent InnoDB threads on each shard. 5% of 12TB in memory for pool, so thats about 13GB.

Linux allocates memory per thread, so you can get an OS ERROR saying you’ve run out of memory.

Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database.

Technorati Tags: , , , , , ,

Related posts:

  1. Building simple & complex replication clusters with Tungsten Replicator by Giuseppe Maxia
  2. MySQL at Google
  3. MySQL HA reloaded by Ivan Zoratti
  4. Some tabs – Marten interview, Facebook, Flickr
  5. Sphinx user stories by Stéphane Varoqui

i