Archive for April, 2008

What MySQL Can Learn from PostgreSQL

Thursday, April 17th, 2008

Hi! So this is completely my notes taken from the conference, without my thoughts attached to it. I should definitely post a lot more about this, and how the community can “improve” in time. Just not today. Believe me, sitting in the talk, was highly painful, and I’m wondering where my aspirin stash might be. The slides will be available soon, and lets just consider this a learning experience. It reminded me of the time Eric Raymond came to the Fedora Project’s very first FUDCon in Boston 2005 (probably the only session without available video :P).

What MySQL can learn from PostgreSQL
Joshua Drake
(more…)

Who is the Dick on my site?

Thursday, April 17th, 2008

Who is the Dick on my site?
Dick Hardt

Most interesting keynote. About 1,000+ slides. Many slides per minute. Definitely a new way of presenting :)

  • What is identity? What is Identity 2.0?
  • Identity is a complicated topic, and you normally get the tip of the iceberg. Identity changes during your stages in life.
  • Works at SXIP Identity.
  • I wondered what the German’s thought about identity. Identat. “They’re German” <applause>
  • Answers.com had the best answers for identity.
  • There’s lots of different personas about a person. Women really are the masters of different personas (clothes, wife, mother, etc.). Reinventing oneself.
  • Identity allows you to predict behaviour…
  • When someone is in a “role” (fireman, etc.), you think you can predict behaviour. Is this identity? It’s who you are, not really, no.
  • blame.ca (his website)
  • Identity transactions… on where is identity used? Party identification, authorisation, profile exchange (information about a person so you know them better).
  • “Do you want to present ID at a bottle shop? If no, you can rollback the transaction!” <applause>
  • Photo ID is a reusable credential. This is an identity transaction.
  • Reputation built up on eBay? You can’t take it over to Craigslist. Identity 1.0 is site centric, its a walled garden.
  • identity20.com
  • Facebook is becoming a new silo. URIs enable things to be open (LinkedIn, Twitter, Flickr, etc.)
  • DataPortability.org - user centric
  • Kim Cameron - Laws of Identity (read it)
  • Device convergence - near field communication, phones doing more than make calls, etc.
  • Digital natives and immigrants. What are you? CNN == news source for natives, immigrants tend to use newspapers. I wouldn’t use CNN as an example myself, but the drift is there. Digital camera (immigrant) vs. just a “camera” (native). The younger generation are all digital natives…
  • OAuth spec - take a gander at this…
  • Reputation services: blogosphere (”pagerank”), open source contributions, wiki, “human” (so stop typing captchas!)
  • Viagra. You’re excited to take it. You can do new things!
  • myhealth.sg was mentioned. Singapore on the forefront of Identity 2.0 and OpenID? Or is it CardSpace (Microsoft) related?
  • What happens when you die? Your domain can be taken by someone else. Do they then become you, if that was your OpenID? Very interesting thought.
  • He flies Air Canada, and loves to talk about his Star Alliance Gold status :) Jives well with me, I’m Star Alliance Gold.

Technorati Tags: , , , , , , , , ,

Help, my website has been hacked! Now What?

Thursday, April 17th, 2008

Eli White from Digg presented. It was an interesting talk… He covered:

You are going to get hacked…
- SQL injection
- XSS
- CSRF (cross site request forgery)
- Session Hijacking

Slides (PDF, ODP) have SQL injection/XSS example, with the hole, the attack, and the prevention.

Technorati Tags: , , , , , , , , , , ,

Services Oriented Architecture with PHP and MySQL

Tuesday, April 15th, 2008

Joe Stump, Lead Architect, Digg. Slides should make its way at Joe’s website soon enough.

Mainly works on the backend, makes sure its scalable, can all the Digg buttons be served, et al.

Application layer is loosely coupled from your data. Whole point of SOA? You can put a service in front of the DB, and move between DB’s if required.

They do use MySQL, but its pretty vanilla.

Old habits die hard
- Data requests are sequential (I need foo, bar, bleh, ecky)
- Data requests are blocking (When you need foo, nothing else is happening)
- Tightly coupled (mysql_query, and if you’re using DB abstraction layer even, you’re still using SQL… you then can’t use CouchDB for instance)
- Scaling is not abstracted (a lot of caching are in the front end code. Its a problem when you start scaling your teams out). They use memcached from what I gather.

SOA
- Data is requested from a service (via HTTP, custom, etc.)
- Data requests are run in parallel (over non-blocking sockets. 10 data requests in 1 webpage, and each request takes 10ms. It might now only take 70ms now, maybe, over 100ms. Generally 1.5-2.5x faster now, for blocking parallel requests)
- Data requests are asynchronous (non-blocking parallel requests)
- Data layer is loosely coupled
- Scalability is abstracted (can find engineers anywhere, that can parse JSON or XML :P)

Options?
- Run requests over HTTP (Google (Java), Amazon (Java), etc.)
- New York Times’ DBSlayer (small little HTTP server that runs and provides parallel and async requests to mysql)
- Danga’s Gearman (binary protocol, has worked, its kind of a queuing system)
- Remember the wall clock goes down, but the CPU time is still happening, its still the same

HTTP w/PHP
1. Group requests for data at the top
2. Open a socket for each request
- Sockets must be non-blocking
- Make sure to use TCP_NODELAY
3. Use __get() to block for results
4. See Services_Digg_Request

Use a pear package, called Services_Digg for the above example. Note Digg’s API documentation as well.

HTTP is widely supported in all languages. Its very easy to get up and running, with lots of options for servers/tuning. Overhead in the protocol is great, and Apache itself has a lot of overhead.

DBSlayer
- small HTTP daemon written in C. You post JSON to it for communications
- connection pooling (benchmark mysql connection, and there’s a whole bunch of overhead in the mysql authentication; mysql proxy does this too)
- load balancing and failover (like mysql proxy)
- tightly coupled to MySQL (no migration)
- tightly coupled to SQL (no CouchDB)
- no intelligence

Gearman
- highly scalable queuing system (worker bees, like PHP scripts. Sockets open, client comes to gearman server to do foo, and it says it has n number of workers, and gearman gets ‘em to work. So it works linearly. Jobs can return results back, run in parallel on many gearman servers and many CPUs)
- simple and efficient binary protocol
- sets of jobs are run in parallel
- queue can scale linearly
- php, perl, python, ruby, c clients
- poorly documented (”I think poorly documented is giving them too much credit.. All danga stuff has next to no documentation”)
- livejournal uses this, instead of using HTTP running
- its not very “robust” (it scales, they at digg don’t see massive number of failing jobs. Queue isn’t persistent though. When pushing stuff, and gearman gets restarted, the queue goes away - there is a workaround, for this, so ask Joe - its an undocumented feature available though)
- digg uses it in the submission process for crawling
- Chris at Yahoo! uses Gearman requests to run multiple memcached GETs (if you’re not using multi-get, check them).
- Check out Net_Gearman, which is a PEAR package

DIY option?
- not recommended, unless you have a highly customised solution, i.e. what Flickr does
- they ran into a problem where uploading an image, and then getting the image resized, for large images, was a problem. So they use a custom binary protocol that is much more efficient for the datasets (think, an SLR has files that are 7MB in size or something)
- this requires more resources (humans, engineers!)

What goes in the Services layer?
- smart caching strategies
- data mapping and distribution
- intelligent grouping of data results
- partitioning logic

Remember to intelligently group data into endpoints, and version them! This will help you improve your software.

Consider bundling and grouping requests (bulk loading).

EPIC FAIL!
- sending SQL over for translation? Pfft. DBSlayer does this, but it tightly couples you
- hundreds of teeny tiny endpoints (cohesive endpoints that return a decent amount of data)
- running SOA requests sequentially! You then get no benefits from an SOA architecture, at all. Parallel requests are good.

Technorati Tags: , , , , , , , , , , , , , , , , , ,

EXPLAIN Demystified

Tuesday, April 15th, 2008

Baron Schwartz gave a most interesting talk about EXPLAIN. You will definitely want to read his slides (filled with detail), when they make their way online. These notes are very sparse, just bits that I didn’t see in the slides, that Baron mentioned verbally. Plenty of good questions, and plenty of interaction.

EXPLAIN only works for SELECT queries.

How does MySQL execute queries? Optimisation happens even as the query is being executed. As the query is being optimised, some execution happens as well. Execution Plan is a data structure, not bytecode.

When EXPLAIN’s output is generated, MySQL actually executes the query. It just set’s DESCRIBE on it, rather than executing it. Everything is a JOIN to MySQL (union, SELECT 1 [simplest base case join], etc…).

key_len - to know if your table is indexed well.

rows: estimated number of rows to read, but not the number of rows in the result set. In 5.1 and greater, it reflects LIMIT, but not before.

Maatkit includes mk-visual, so you can have a visual explain. This is also, very machine readable.

Update: Artem has good notes too.

Technorati Tags: , , , ,

MySQL Full Text Search by Alex Rubin

Tuesday, April 15th, 2008

Download the PDF: http://www.mysqlfulltextsearch.com/full_text.pdf

Default search by relevance, default sort is by relevance

Boolean search is also popular. cats AND dogs. No default sorting, so you need to order the results yourself

Phrase search

MySQL Full Text Index, only available with MyISAM, and it supports natural language and Boolean search. ft_min_word_len - 4 characters per word by default is indexed. Frequency based ranking, doesn’t count distance between words

SELECT * FROM articles WHERE MATCH (title,body) AGAINST (’database’ IN NATURAL LANGUAGE MODE);

For Boolean, you use AGAINST (’cat AND dog’ IN BOOLEAN MODE).

n-gram fulltext plugin for CJK languages are available as plugins

DRBD and MySQL FullText search? DRBD requires InnoDB, when there is a failover, DRBD needs to perform a reovery. Fulltext only works for MyISAM. So ou create a “FullText” slave MyISAM table with FullText indexes. The slide (diagram) is most useful for this, naturally.

Speed up FT search? Fit the index into memory. key_buffer = total size of full text index (max=4GB). You can preload FT indexes into buffer.

You can manually partition. Partitioning decreases index and table size, so search is faster. Application needs changing of course. MySQL 5.1 partitioning features, do not support FTs.

Order by/Group by is a performance killer. Using order by date, is much slower than with no order by.

Real World Performance Killer
SELECT … FROM `ft` WHERE MATCH `album` AGAINST (’the way i am’)
The above query, is very slow! It took like 13 seconds or so.

Note the stopword list and ft_min_word_len. I is not a stopword, but “the”, “way”, and “am” are stopwords.

ft_min_word_len = 1 will mean that all words except “i” will be filtered out with the standard stoplist. “i” is contained in lots of text!

Search with error correction? Use soundex() MySQL function (sounds similar). select soundex(”Dilane”) should equate to Dylan. You can sort it either by popularity or Levenstein distance (either by a stored procedure or a UDF).

Sphinx - nice, open source, can be faster than MySQL full text index on a large dataset, supports multi-node clustering out of the box. It is however an external solution that isn’t built-in, and needs to be integrated.

MySQL 5.0: need to patch source code. MySQL 5.1: copy Sphinx plugin to the plugin_dir.

You can set Sphinx to be MySQL’s storage engine if you like.

Resources

Technorati Tags: , , , , , , , ,

Ahead in the Cloud by Werner Vogels

Tuesday, April 15th, 2008

Ahead in the Cloud - The power of Infrastructure as a Service
CTO Amazon.com, Dr. Werner Vogels

Pretty much everyone in the audience uses Amazon!

Announced: Persistent Storage for Amazon EC2.

Hitting one page, might actually go to 250 different services, before the page is generated for you. Shows the use of a tool (Amazon internal), that graphs it.

SaaS: Develop -> Test -> Operate

Hardware costs? Software costs? Maintenance? Load balancing? Scaling? Utilisation? Idle machines? Bandwidth management? Server hosting? Storage management? High availability? All this is the differentiated heavy lifting that Amazon bases their services on.

SaaS comes at a very big cost that you have to address.

70/30 switch: 30% of time, energy and dollars on differentiated value creation; 70% of time, energy and dollars on differentiated heavy lifting.

At Amazon, we expect data centres to fail. But we also expect software to tolerate this failure.

“Scalable Infrastructure that allows applications to meet infinite demand, cheaply and reliably” (statement, made with picture of large amount of Sun hardware)

Amazon S3 (storage), SimpleDB, EC2 (computer power), FPS (payment service). All this is scalable (increase/decrease capacity on demand).

Scalability. Availability. Performance. Cost-Effectiveness.

Growth: largest selection on earth, good customer experience, drives prices down, drives traffic, sellers, selection, and this is a cycle for growth. It brings a lower cost structure, that also lowers prices down then.

This means that incremental scalability is key to Amazon’s business. Grow one step at a time, consistently. Turn a fixed cost, into a variable cost, as your business grows seamlessly.

Elastic cloud: grow and shrink on demand, with minimal disruption to performance. Operational efficiency, fault-tolerant, and remember, everyone has different growth paths. Heterogeneity - do not believe that all your nodes have the same capacity! A year from now, you will have more powerful machines, your software must scale for this.

Everything fails, all the time. An epic truth.

Failures are highly correlated. By every possible worst way! Systems do not fail by stopping - they will fail by sending out garbage ;) Your system must be able to deal with that.

Determinism is an illusion. An illusion created in a very small space. “Let go of control!”

Engineer for performance at 99.9%. Remember, address uncertainty - acquire resources on demand, pay for what you use, leverage other’s core competencies, turn fixed costs into variable costs. Never every pay again for something sitting in your data centre doing nothing for you.

All data access at Amazon is primary key based. Eventual consistent, for high read volume and always writeable. Query-based access, was non-relational.

Primary Key Access: Amazon S3; Query-based Access: SimpleDB; EC2 with persistent storage for a dedicated solution

Persistent storage? Raw disk, attach a volume to EC2. You can also detach. Infinite scalability in terms of data. From snapshots, you can create new volumes.

“All you need is a credit card” - for AWS. Lots of laughter :)

Technorati Tags: , , , , , , , ,

The Network Has Become a Social Utility — Jonathan Schwartz

Tuesday, April 15th, 2008

I wish I had better notes, but I was enthralled by Jonathan Schwartz’s (CEO, President, Sun Microsystems) keynote. It was truly, very amazing an influential. He’s a great speaker, and very motivated (and I think he’s motivated a lot of the audience).

What is Sun’s Agenda?
Similar values, cultures, and similar dysfunction’s like any family.

The Texas Advanced Computing Center - 500 teraflop “open” super-comptuing facility.

“Computational science is the third mode of discovery, complementing physical experimentation and theory” — Daniel Atkins III, Director, Office of Cyberinfrastruture, National Science Foundation

The industry has valid, legitimate scientific purpose.

Improving society as a result of that understanding. What does it take to fuel developing economies? Or to make new scientific discoveries?

In Africa, a bank is giving mobile phones, to allow folk to increase wealth! The wealth is in the network, it can’t be stolen anymore.

An open source phone? Stay tuned, Sun thinks that industry needs to be disrupted.

What does it take to connect with your friends? MySQL :)

Like electricity, like clean water… “The Network has become a Social Utility”

I want Sun to be a Great Company, and a good company. You can’t buy the community. The greater you are, the “gooder” you are. Work with communities, drive innovation, and more opportunity is created for all.

Jonathan shows a map of places where all the blue dots are where people download MySQL. Take the map away, and you’ll see a beautiful picture. Its already the majority of the planet for instance. These folk, have decided that there is a demand for open source software.

Free software is taking over the world.

Today ZFS is under a CDDL license, but rumour has it, that this will change in the future.

“The Future Will Be Defined by Free… and Freedom”

“We want to be in control of our own destiny” - all those places downloading software… Freedom matters to me, because I want access to my own stuff.

The Amazon. Comprises of 10,000 rivers. They all fuel the Amazon River. Open source, is an ecosystem of many rivers… Sun is saying we’re pro-opensource, pro-free software. Put commitment behind many communities.

Individuals are making choices… MySQL might be used even without CIOs knowing!

Secret plot? Let people download, try it, use it, and they’ll change their view of how good it is. And there’s economic benefit (for Sun).

Technorati Tags: , , , , ,