Archive for April 2007

Federation at Flickr: A tour of the Flickr Architecture

I’ve always been a big Flickr fan, and user, and love them even more now that their 2GB limit is gone. So this was a most interesting talk, and I think photographers and Flickr users alike will find it interesting what’s behind Flickr. Dathan also has a very interesting blog.

Federation at Flickr: Doing Billions of Queries Per Day, Dathan Pattishall

Database guy – query, performance, troubleshooting, application building, etc. Previously worked at Friendster in 2003.

Master-Slave lag – unable to keep up with the demand that Flickr was having. So they have multiple masters, the had multiple single points of failure (SPOF).

Everything had to be real time. Write intensive, so more than 1 master needed. No more slave lag. Serve pages fast, with many queries. Get rid of the SPOFs and be redundant.

Need ability to make live maintenance, repair data, and so forth, without taking the site down.

Federation Key Components:

  • Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone elses’ blog
  • Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time.
  • PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)

Shards:

  • Slice of the main database
  • Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active.
  • Shard assignments are from a random number for new accounts
  • Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.

Global Ring:

  • Its a Lookup Ring for stuff that can’t be federated

Owner_id -> SHARD-ID
Photo_id -> Owner_id
Group_id -> SHARD_ID

It makes sense to hash by user, but then the notion of group. Group can be owned by many users; plus many users are contributing to the group.

This is cached in MEMCache, for about 1/2 hour.

Ticket servers – MySQL 5, low grade machines, circular replication. This is the ID generator. Globally unique ID that is unique. A lot of unique IDs are associated to each row.

Clicking a Favourite:

  • Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5)
  • Pulls my Information from cache, to get my shard location (say on shard-13)
  • Starts a “distributed transaction” – to answer the question: Who favourited the photo? What are my favourites?

Can ask question from any shard, and recover data. Its absolutely redundant.

To get rid of replication lag…

  • every page load, the user is assigned to a bucket
  • if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistant connections, they build connections and tear it down. Every page load thus, tests the connection.

Every users reads and writes are kept in one shard. Notion of replication lag is gone.

Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode.

Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load.

Average queries per page, are 27-35 SQL statements. Favourites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages.

Over 36,000 queries per second – running within capacity threshold. Burst of traffic, double 36K/qps. More than 4 billion queries per day.

Each Shard holds 400K+ users data.

“So you’re a single point of failure” — a member of the audience. He’s the main/only DBA at the moment.

Search:

  • Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
  • Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
  • Think of it such that you’ve got Lucene-like search

Hardware: EMT64 w/RHEL4, 16GB RAM, 6-disk 15K RPM RAID-10. Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files – the photos are a lot larger). 2U boxes. Each shard has~120GB of data.

“Able to serve the traffic: Priceless”

Things MySQL can do better: Multi-master replication, thread bug in INNODB for 4.1 (its already in 5.0), optimization for OR queries, bitwise, sets.

MySQL 5.0 is used in shards that have aux data, like logging, and also for generating the ticket ID on the ticket servers (no replication between them, single row per server). 64-bit ID, but when exposing to public they use a 32-bit ID to keep the URLs short.

Backup procedure: ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare. Snapshots are taken every night across the entire cluster of databases.

Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards.

max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity.

Schema updates? Shut down half the servers, force to other shards, push update, repeat.

32 concurrent InnoDB threads on each shard. 5% of 12TB in memory for pool, so thats about 13GB.

Linux allocates memory per thread, so you can get an OS ERROR saying you’ve run out of memory.

Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database.

Technorati Tags: , , , , , ,

Monty is the world’s first MySQL fellow

Yes, you heard it here first. Monty is the world’s first MySQL Fellow. For outstanding work, being the co-Founder of MySQL AB, and for continuing excellent work. Standing ovation, might I add.

Some quick comments from the clash of the DB egos:
Heikki Tuuri – he worked for Solid, then went on to university, and learned everything about databases. He attributes it to stubbornness. Monty said he should make InnoDB open source, and he obeyed!

Mikael Ronstrom – “entrepreneurial ego in a large company”. He had a vision of what he wanted to do. Guess he was also a bit stubborn…

Jim Starkey had the most interesting hat ;-) And he probably made a couple of bucks from it…
“Does the United States still have anything to say in databases?”

Starting over is important, because you learn new things… Jim Starkey has written about 5 databases in his lifetime.

Nitro: really fast speeds (faster than MyISAM). Monty does say its not a perfect solution. Positional awareness, so any index entry knows where it is.

Update: Oops, looks like Kaj might’ve got this out first – Monty: The First MySQL Fellow.

Technorati Tags: , , , ,

DotOrg: WordPress, Eventum visits

I spoke to Matt and Barry today, and it was great to see them at the DotOrg Pavilion at the MySQL Expo, since the last time we caught up was at WordCamp 2006. Since WordCamp, WordPress.com is now spanning something like 900,000+ registered users! That number used to be over 300,000+, just a few months ago, so it looks like they’re really popular.

While Barry entertained a visitor, Matt and I got to talking about growing companies. He’s really happy with the size of Automattic, and is going to try for as long as possible to keep the company size, under fifty. He’s also found it interesting that some people are running WordPress 1.2 (ick! security holes galore), and while I worried that the database itself might not be migrate-able, he mentions that going from 1.2 to 2.1 should be no problem at all. Speaking of releases, a new WordPress is just around the corner.

I just love updating WordPress. He mentioned that there’s a plugin that downloads the new package, unpacks it, installs it, and does it all on the fly, but it scares him, due to the recent break-in. Well, it seems fair, and I too wouldn’t want such a plugin, but a more automated way of doing things, would rock, though. For what it’s worth, WordPress 2.2 will allow you to disable all plugins at one button click, so that should be useful (and yes, this means I can script it easier). And now, they run an md5 like every minute on WordPress to make sure the release doesn’t change!

I’ve also been advised to use SVN for deploying WordPress. I’ll definitely look into it soon, as I want to get everything into revision control eventually.

I also got to speak with Bryan Alsdorf, Eventum developer, and got a nice run down of the history of the software. More love needs to be given to this tool, as its great, is basically the infrastructure that runs MySQL Support, and others will definitely find this useful too (if in the support business). I learnt some new things, the fact that there is also a command line interface to it, makes me like this software even more. Again, part of the fun of working in a distributed environment – this is the first time I’ve spoken at length to, with Bryan.

Got so many more people and DotOrg folk to visit tomorrow. Well, enough writing now, its time to go visit the DotOrg reception to grab some drinks and food.

Technorati Tags: , , , ,

“Open Source disrupts inefficient models and produces new wealth.”

Today’s keynote, by MySQL CEO Marten Mickos, was titled The Participatory & Disruptive Spirit of the Dolphin. Here are some random notes I took.

Tired                Wired
Packaged Apps        On-Demand
Closed Source        Open Source (jboss, php, mysql, apache, linux)
Complex Hardware     Commodity Hardware

Most innovative companies, are clearly wired, and are enjoying the technology market shifts. Small players can now swim around the big players.

“Open Source disrupts inefficient models and produces new wealth.”

Star Wreck – Finnish dry humor, production by amateurs, and exteremely popular. Catch it on YouTube?

“We hope you’ll travel to MECA” – MySQL Enterprise Connection Alliance. Probably not the most sensible comment, but it does humor in it ;-)

Serve the underserved – offer flying to those that would otherwise take a bus. Offer a database to people that would rather use a filesystem.

A smarter way to produce the goods and distribute them, this causes disruption.

2007 MySQL Applications of the Year

  • YouTube: #1 in online video
  • amp’d mobile: #1 in 3G mobile entertainment
  • Adobe: #1 in creative software (embeds MySQL in Acrobat and CS)

2007 MySQL Community Awards

  • Martin Friebe: quality contributor of the year. Everyone’s talking about his under-3 hour contribution in where his bug, was verified and fixed so quickly… Yet there are patches awaiting for years!
  • Paul McCullagh: code contributor of the year, thanks to the PBXT Storage Engine. Yes, PBXT is the hotness, its a single-man show that is a true community storage engine, that is feature laden, and performs really well!
  • Sheeri Kritzer: advocate, communicatior and facilitator of the year. Beware the SHE-BA, she’s got great blog entries, a superb podcast, and does so much cool things for the Community.

Never heard of amp’d, but its great to know that Adobe uses a lot of MySQL – Creative Suite 2, 3, and even Acrobat. That just rocks.

YouTube is clearly cool, I got a t-shirt when I visited their Google booth, and I was lamenting to them that they’d become just like Google in terms of recruitment – they spam! Apparently they’re looking for MySQL people, so if you’re after a job, and like online video and YouTube, they’ve got jobs awaiting. I on the other hand, walked there with my Senior Director of Human Resources, so we had a good chuckle :-)

Technorati Tags: , , , ,

MyODBC not showing in drivers list on Mac OS X

Today I missed a bunch of good talks that I was hoping to attend, because I was figuring out a problem at the Guru Bar. Offending criminal: MySQL Connector/ODBC 3.51. Offending OS: Mac OS X/PowerPC.

OS X comes with an ODBC Administrator. Once you unpack the MySQL package, and the .pkg installs itself, you’ll find that all your files are installed in /usr/lib. You need to fire up ODBC Administrator, click on Drivers, and Add the driver. Give it an appropriate Description (MySQL), provide the location of the driver file (/usr/lib/libmyodbc3.dylib), and define it as available within the System (this enables you to enable at the System DSN as well as the User DSN, in the next step).

Later, etiher add as a user or system DSN, one for mysql. The keyword/value pairs are such that it should be: server/localhost, port/3306, database/test.

The caveat with all of this, is that you actually need to have /usr/lib/libltdl.3.dylib present. Because libmyodbc3.dylib references it, and if you don’t have it, it will fail rather beautifully. How do you get libltdl.3.dylib? Get XCode! If you don’t have your install discs around, get it from the Apple Developer Connection. Beware, its a 923MB download (now you see why I missed not only a talk – large downloads at the conference tend to break, duh!).

Testing the ODBC connection? Make good use of /usr/bin/iodbctest. We were of course doing some odd things with Microsoft Excel (ick!). Once /usr/lib/libltdl.3.dylib is installed, the ODBC connection magically works. With regards to Excel, the external data source will just show up and thus, you can use it. If you didn’t define the keyword/value pairs in the ODBC Administrator, you can always do it in Excel (however, running iodbctest will then fail).

Why was this not discovered earlier? Probably because developer tools are a really common thing to have installed. But users, tend to not have XCode installed, by default. And OS X doesn’t have packaging guidelines, unlike sensible RPM/DEB. When I get back, I’ll see if its possible to ship this missing bit otherwise get the Documentation team to update documentation…

Bottom-line: Make sure XCode is installed if you’re going to use Connector/ODBC.

Technorati Tags: , , , , , ,

Scaling Twitter: “Is Twitter is UDP or TCP? Its definitely UDP.”

Presented by Blaine Cook, a developer from Odeo, now probably CTO of Twitter (Obvious Corp spawed, I think). There’s a video and slides (yes, you need evil Flash so I haven’t viewed it myself). Then there are my notes… possibly with some thoughts attached to them. No, they’re not organized, I’m too busy and tired…

Rails scales, but not out of the box. This will cause Twitter to stop working very quickly.

600 requests/second, 180 rails instances (mongrel), 1 DB server (MySQL) + 1 slave (read only slave, for statistics purposes), 30-odd processes for misc. jobs, 8 Sun X4100s.

Uncached requests in less than 200ms in most of the time.

steps:
1. realize your site is slow
2. optimize the database
3. “Cache the hell out of everything”
4. scale messaging
5. deal with abuse
6. profit.

Have stats (something Twitter didn’t have before): munin, nagios, awstats/google analytics (latter doesn’t obviously work if your site itself doesn’t load), exception notifier/logger (exception logger is what they use at Twitter, so you don’t get lots of email :P). You need reporting to track problems.

Benchmarks – they don’t do profiling, they just rely on their users! What torture for the poor users…

“The next application I build is going to be easily partionable.” – Stewart Butterfield
Dealing with abusers…
Inverse spamming – The Italians – receiving SMS gives you free call credits!
9,000 friends in 24 hours doesn’t scale!
Just be ruthless, delete said users. This is where you thank the reporting tools, to allow you to detect abusers.

They’ve looked at Magic Multi Connections, it looks great, but it wouldn’t work for Twitter.

Main bottleneck is really in DRb and template generation. Template optimizer that Steven Kays wrote doesn’t work for them.

Twitter: built by 2 people first. And now, they’re just 3 developers.

When mongrels hit swap, they become useless. So turn swap off.

Twitter themselves don’t seem to want to give out details of how many users, etc. they have. Shifty, beyond the fact that they claim its “a lot of users”.

Twitter is not built for partitioning. Social applications should be designed to be easily partionable. WordPress, anything 37signals builds, tends to be partionable. Things start becoming hairy when you have 3,000+ friends!

Index everything – Rails won’t do it for you, but you need to repeat for any column that appears in a WHERE clause.

Denormalize a lot – heresy in the rails book? but he hopes not. This is single handedly what saved Twitter.

They use InnoDB. Don’t do status.count() when there’s millions of rows… it’ll stop working. MyISAM will be faster, but still, don’t.

email like “$#!$” – search. Twitter has disabled search right now… This makes their database enjoy life.

Average DB time is 50ms (to at most 100ms)

They’re not hurting on the DB. The master DB machine is at a quarter CPU usage. So they don’t see the need to partition at this point.

Twitter does a lot of caching, they use MemCache. If you really need status.count() use memcache.

Query for friends status on your Twitter homepage, is a complicated query using a lot of JOIN. They use ActiveRecord, they store the status in memory, and they don’t touch the DB. They plan to use memcache in the future for the statuses too.

ActiveRecord objects are huge (which is why its not stuck in memcache yet). They’re looking at implementing ActiveRecord nano or something simiar – smaller, store in cache critical attributes, and use add method missing if you don’t find what you’re looking for.

90% of Twitter’s requests are API requests. So cache them. No fragment or page caching on the front-end, but for API requests, lots of caching.

Producer(s) -> Message Queue -> Consumer(s)

DRb: zero redundancy, tightly coupled.

They use ejabberd for Jabber server.

When the Jabber client went down, everything went down. So they moved to using Rinda. Its O(N) for take() so if the queue has 70,000 messages, you just shut it down, restart it, and lose those 70,000 messages. Sigh.

“Someone asked if Twitter is UDP or TCP? Its definitely UDP.” — Blaine Cook

LiveJournal has a horizontally scaled MySQL, that is just MySQL + Lightweight Locking. RabbitMQ (erlang) is something they’re looking at, quite clearly, but it looks ugly, and they don’t want to possibly implement it.

Starling was written. Ruby, will be ported to something faster. Does 4000 transactional messages/second, will have multiple queues (like a cache invalidation one), speakes MemCache (set, get), writes it all to disk. First pass was written in 4 hours, and its been working fine for the last few days (i.e. since Wednesday). Twitter died on Tuesday at the Web 2.0 conference! Starling will probably be open source.

Use messages to invalidate your cache.

Dealing with abusers…
Inverse spamming – The Italians – receiving SMS gives you free call credits!
9,000 friends in 24 hours doesn’t scale!
Just be ruthless, delete said users. This is where you thank the reporting tools, to allow you to detect abusers.

They’ve looked at Magic Multi Connections, it looks great, but it wouldn’t work for Twitter.

Main bottleneck is really in DRb and template generation. Template optimizer that Steven Kays wrote doesn’t work for them.

Twitter: built by 2 people first. And now, they’re just 3 developers.

When mongrels hit swap, they become useless. So turn swap off.

Twitter themselves don’t seem to want to give out details of how many users, etc. they have. Shifty, beyond the fact that they claim its “a lot of users”.

Technorati Tags: , , , ,


i