Archive for the ‘Web’ Category

Changes in the blog

Friday, May 9th, 2008

Its worth noting some website changes. First, I dropped Skribit. The widget has been sitting there unused for weeks, so I’m thinking that’s software that no one, besides its founders use. “Is Skribit proving useful?” is the question they ask - no.

Next up, I’ve stopped using Technorati tags, and have decided to use Wordpress tags. I’ll still be using categories, as well as tags to complement the categories. Why? Wordpress has the feature… Technorati still gets updates/pings from my blog, and creates its own “tags” (largely from what I can see, from ways I categorise my post) that it sees my blog represents.

Besides, now I can add tags for relevant events, and RSS feeds can be generated from it. Good for people just wanting to follow notes from a certain event, and aggregations of the specific feed for said events.

Near Field Communication (NFC) at JavaOne

Wednesday, May 7th, 2008

Talk was given by Jaana Majakangas, from Nokia Corporation. I’ve been interested in NFC ever since I heard about it, as its something Maxis has been trialling for a while in Malaysia. It reminds me of rewinding back many years (maybe a decade ago?) when Celcom was trying to allow people to purchase a Coke from select vending machines, using SMS (no cash!). That never took off, but maybe NFC will be right, soon… Current limitation? Lack of devices - one in market (Nokia 6131) and another announced, but not in market. Also, the standard (JSR 257) has been extended by Nokia, which is always an issue for other implementers.

Some quick notes:

  • JSR 257 is what this is all about.
  • Simple wireless protocol between NFC compliant tags and devices in close proximity. New business opportunities for mobile operators, banks, retailers, transport operators, etc.
  • You can share content between phones/pair devices like Bluetooth. You can get further information by “touching” smart posters. Your phone can be your credit card for payment… it can also be your travel card.
  • Service discovery. Nokia has got extensions to the JSR 257 standard for this in their implementations.
  • Think outside of the box, be innovative, the technology is there, there are many use cases
  • Contactless communication API has been around since 2004. RFID tag, smart card, visual tags. Java applications to access the hardware capabilities (RFID for instance).
    - NDEF tag (RFID tag, with NFC standard)
  • There is a dedicated Connection interface for different targets. You will get a notification when a transaction has happened.
  • When you discover a target, the application will get a notification. It has the URL that you will open the connection with. Communicate… then close connection.
  • Nokia 6131 NFC has extensions to JSR 257: get the SDK at Forum Nokia. The extension also includes the peer-to-peer communication framework. In a modified version of JSR 257, the P2P communication will exist soon as well.
  • Business cards that go to NFC devices and contact details are there? Wow, this is Business Card 2.0 :)
  • NFC works within less than 10cm. Its pretty “near”.
  • “Touch to share bookmark”… touch two devices together, and voila! there is instant sharing. I’m reminded of old Palm ads when they were pushing their IR technology and beaming business cards across trains between a man and a woman!
  • NFC enables new consumer services with mobile devices. Take away that you should just be creative, and lots can happen.

Facebook does Instant Messanging (IM)

Wednesday, April 23rd, 2008

Dunno if this is a new feature, but Facebook has integrated instant messaging (IM) to folk that are on Facebook. This is like Google doing their Google Talk chats inside GMail. More and more reason why the web browser replaces the traditional desktop application?

Facebook chats
Facebook IM

Do I like being able to only talk to folk that are my friends? Naturally. One wonders though why we have so many different identities, on different services. The one that consolidates them all, is the one that enables ease of use, is the one that will “win”.

Me? I’m old fashioned. I like my IM client. I like my email client. All desktop clients I might add. The kids of today? They’re happy with everything web-based. Besides, the concept of syncing stuff offline is slowly becoming more popular with web-based applications, so maybe eventually, I too might like these new web applications…

Technorati Tags: , , , , , , ,

Help, my website has been hacked! Now What?

Thursday, April 17th, 2008

Eli White from Digg presented. It was an interesting talk… He covered:

You are going to get hacked…
- SQL injection
- XSS
- CSRF (cross site request forgery)
- Session Hijacking

Slides (PDF, ODP) have SQL injection/XSS example, with the hole, the attack, and the prevention.

Technorati Tags: , , , , , , , , , , ,

Services Oriented Architecture with PHP and MySQL

Tuesday, April 15th, 2008

Joe Stump, Lead Architect, Digg. Slides should make its way at Joe’s website soon enough.

Mainly works on the backend, makes sure its scalable, can all the Digg buttons be served, et al.

Application layer is loosely coupled from your data. Whole point of SOA? You can put a service in front of the DB, and move between DB’s if required.

They do use MySQL, but its pretty vanilla.

Old habits die hard
- Data requests are sequential (I need foo, bar, bleh, ecky)
- Data requests are blocking (When you need foo, nothing else is happening)
- Tightly coupled (mysql_query, and if you’re using DB abstraction layer even, you’re still using SQL… you then can’t use CouchDB for instance)
- Scaling is not abstracted (a lot of caching are in the front end code. Its a problem when you start scaling your teams out). They use memcached from what I gather.

SOA
- Data is requested from a service (via HTTP, custom, etc.)
- Data requests are run in parallel (over non-blocking sockets. 10 data requests in 1 webpage, and each request takes 10ms. It might now only take 70ms now, maybe, over 100ms. Generally 1.5-2.5x faster now, for blocking parallel requests)
- Data requests are asynchronous (non-blocking parallel requests)
- Data layer is loosely coupled
- Scalability is abstracted (can find engineers anywhere, that can parse JSON or XML :P)

Options?
- Run requests over HTTP (Google (Java), Amazon (Java), etc.)
- New York Times’ DBSlayer (small little HTTP server that runs and provides parallel and async requests to mysql)
- Danga’s Gearman (binary protocol, has worked, its kind of a queuing system)
- Remember the wall clock goes down, but the CPU time is still happening, its still the same

HTTP w/PHP
1. Group requests for data at the top
2. Open a socket for each request
- Sockets must be non-blocking
- Make sure to use TCP_NODELAY
3. Use __get() to block for results
4. See Services_Digg_Request

Use a pear package, called Services_Digg for the above example. Note Digg’s API documentation as well.

HTTP is widely supported in all languages. Its very easy to get up and running, with lots of options for servers/tuning. Overhead in the protocol is great, and Apache itself has a lot of overhead.

DBSlayer
- small HTTP daemon written in C. You post JSON to it for communications
- connection pooling (benchmark mysql connection, and there’s a whole bunch of overhead in the mysql authentication; mysql proxy does this too)
- load balancing and failover (like mysql proxy)
- tightly coupled to MySQL (no migration)
- tightly coupled to SQL (no CouchDB)
- no intelligence

Gearman
- highly scalable queuing system (worker bees, like PHP scripts. Sockets open, client comes to gearman server to do foo, and it says it has n number of workers, and gearman gets ‘em to work. So it works linearly. Jobs can return results back, run in parallel on many gearman servers and many CPUs)
- simple and efficient binary protocol
- sets of jobs are run in parallel
- queue can scale linearly
- php, perl, python, ruby, c clients
- poorly documented (”I think poorly documented is giving them too much credit.. All danga stuff has next to no documentation”)
- livejournal uses this, instead of using HTTP running
- its not very “robust” (it scales, they at digg don’t see massive number of failing jobs. Queue isn’t persistent though. When pushing stuff, and gearman gets restarted, the queue goes away - there is a workaround, for this, so ask Joe - its an undocumented feature available though)
- digg uses it in the submission process for crawling
- Chris at Yahoo! uses Gearman requests to run multiple memcached GETs (if you’re not using multi-get, check them).
- Check out Net_Gearman, which is a PEAR package

DIY option?
- not recommended, unless you have a highly customised solution, i.e. what Flickr does
- they ran into a problem where uploading an image, and then getting the image resized, for large images, was a problem. So they use a custom binary protocol that is much more efficient for the datasets (think, an SLR has files that are 7MB in size or something)
- this requires more resources (humans, engineers!)

What goes in the Services layer?
- smart caching strategies
- data mapping and distribution
- intelligent grouping of data results
- partitioning logic

Remember to intelligently group data into endpoints, and version them! This will help you improve your software.

Consider bundling and grouping requests (bulk loading).

EPIC FAIL!
- sending SQL over for translation? Pfft. DBSlayer does this, but it tightly couples you
- hundreds of teeny tiny endpoints (cohesive endpoints that return a decent amount of data)
- running SOA requests sequentially! You then get no benefits from an SOA architecture, at all. Parallel requests are good.

Technorati Tags: , , , , , , , , , , , , , , , , , ,

Ahead in the Cloud by Werner Vogels

Tuesday, April 15th, 2008

Ahead in the Cloud - The power of Infrastructure as a Service
CTO Amazon.com, Dr. Werner Vogels

Pretty much everyone in the audience uses Amazon!

Announced: Persistent Storage for Amazon EC2.

Hitting one page, might actually go to 250 different services, before the page is generated for you. Shows the use of a tool (Amazon internal), that graphs it.

SaaS: Develop -> Test -> Operate

Hardware costs? Software costs? Maintenance? Load balancing? Scaling? Utilisation? Idle machines? Bandwidth management? Server hosting? Storage management? High availability? All this is the differentiated heavy lifting that Amazon bases their services on.

SaaS comes at a very big cost that you have to address.

70/30 switch: 30% of time, energy and dollars on differentiated value creation; 70% of time, energy and dollars on differentiated heavy lifting.

At Amazon, we expect data centres to fail. But we also expect software to tolerate this failure.

“Scalable Infrastructure that allows applications to meet infinite demand, cheaply and reliably” (statement, made with picture of large amount of Sun hardware)

Amazon S3 (storage), SimpleDB, EC2 (computer power), FPS (payment service). All this is scalable (increase/decrease capacity on demand).

Scalability. Availability. Performance. Cost-Effectiveness.

Growth: largest selection on earth, good customer experience, drives prices down, drives traffic, sellers, selection, and this is a cycle for growth. It brings a lower cost structure, that also lowers prices down then.

This means that incremental scalability is key to Amazon’s business. Grow one step at a time, consistently. Turn a fixed cost, into a variable cost, as your business grows seamlessly.

Elastic cloud: grow and shrink on demand, with minimal disruption to performance. Operational efficiency, fault-tolerant, and remember, everyone has different growth paths. Heterogeneity - do not believe that all your nodes have the same capacity! A year from now, you will have more powerful machines, your software must scale for this.

Everything fails, all the time. An epic truth.

Failures are highly correlated. By every possible worst way! Systems do not fail by stopping - they will fail by sending out garbage ;) Your system must be able to deal with that.

Determinism is an illusion. An illusion created in a very small space. “Let go of control!”

Engineer for performance at 99.9%. Remember, address uncertainty - acquire resources on demand, pay for what you use, leverage other’s core competencies, turn fixed costs into variable costs. Never every pay again for something sitting in your data centre doing nothing for you.

All data access at Amazon is primary key based. Eventual consistent, for high read volume and always writeable. Query-based access, was non-relational.

Primary Key Access: Amazon S3; Query-based Access: SimpleDB; EC2 with persistent storage for a dedicated solution

Persistent storage? Raw disk, attach a volume to EC2. You can also detach. Infinite scalability in terms of data. From snapshots, you can create new volumes.

“All you need is a credit card” - for AWS. Lots of laughter :)

Technorati Tags: , , , , , , , ,

Memcached and MySQL tutorial

Monday, April 14th, 2008

Memcached by Brian Aker, Alan Kasindorf (dormando). Here are some quick, somewhat sparse notes. Follow the slides, it will help.

Slides: http://download.tangent.org/talks/Memcached%20Study.pdf

Memcached was actually created for LiveJournal. It has evolved a bit over time. Chaos to user based clustering, and then Brad implemented memcached. LiveJournal has about 30GB of cache available between 8-12 machines. The DB reads were down like 10x the moment they started using memcached (its much better now).

Its not only for simple objects (not just a single row)- you can use it for complex queries, and the result can be stored in memcached. Eins.de, Patrick Lenz, is also the freshmeat.net guy. He put memcached on the same machine as the MySQL database server (he has 32-bit machines, and MySQL can only use a certain amount of RAM, so the rest was for memcached). This is definitely not the recommended way. Have separate memcached servers.

PatG comes up to talk about Grazr, which is more of a write-through cache. Refer to Page 8 of the slides. Now, the thought is that maybe Pat should’ve used gearman, rather than writing their own software. Memcached has allowed them to do it asynchronously. They’re using bulk inserts now as well.

DownUnder GeoSolutions uses lustre, which is a clustering filesystem. They’re not a web-based solution. They extract data off lustre, store it in memcached. Processing happens on the memcached RRU.

memcached by itself does very little. There’s a simple daemon, and it responds to gets/sets/add/replace. It sits on top of a very simple slab allocator. Everytime you called it, it ran malloc() and it would free() it when done, during the early days. So, now, it makes one slab allocator for different types of objects.

memcached is event based. libevent is a generic wrapper around epoll/kqueue, and its very scalable for network connections. 10,000 connections to a memcached, is ok - it only cares for how many of them are “active”.

The protocol is very simple. Everyone hates it, but everyone uses it. You can even fire up telnet to talk to memcached. Its very easy to write to protocol.txt and to talk to it.

memcached? A big stupid hash table. In a grid, its a distributed hash table. memcached is 2 hash tables - from client, and one in the server. 30 memcached’s don’t need to know about each other - they’re blind from each other. There is no cross traffic. You just add more servers, to scale up.

Clients hash keys to the server list. Take a single key (250 bytes max), the client hashes it. You have a value, you want to access it, here’s a key. There is multiple hashing going on, as some clients do things like compressing data.

How do I dump data? You don’t. Its a cache.

How is it redundant? Its not. The server itself doesn’t know about other servers around it! PECL and the next version of libmemcached will understand replication. The redundancy happens in clients.

How does it handle failover? It doesn’t. If it dies, it dies. A client can of course, handle it.

How does it authenticate? It doesn’t at all. Don’t stick one of this, open faced, to the Internet - when you connect to it, you have full access to any commands in the server and all contents in the server. You don’t want folk just typing flush in the server ;)

A very simple service, very simple server.

Details on the Server? Page 14, is pretty much all the commands you can use in memcached. You can run this from telnet, even
- set operation throws data inside memcached (it doesn’t care if there’s other data in it)
- add is lightly atomic - it won’t add data that is already there
- stats can give you particular pieces of information, or give you a full dump. Hit ratio, cache efficiency, and lots more, can come out of this

All drivers you are seeing, are just basically extending all these commands. cas (compare and swap atomic!) today is pretty limited

memcached can even run on FreeBSD 4. Most people run memcached on Linux. No one has deployed memcahced on OSX in the audience.

There’s MySQL integration. Most users grab object from database,
store object to memcached. The UDF memcached functions are probably the most successful UDF in MySQL’s history :)

There’s pgmemcache() for Postgresql, but not much is known about it

Apache - mod_memcached, has CAS operations exposed. Different to the lighttpd implementation.

There are limitations (page 23). If you wanted to change things, you can recompile memcached, but you might not want to do that. Largest slab class in the system, is 1 megabyte. So data size is under 1 megabyte. Beware if you’re running on a 32-bit system (going over 4GB and you will segfault). A 64-bit system should be fine, in general.

memcached supports threads, thanks largely to Facebook. You probably don’t need this, unless you are Facebook. Memcached’s CPU footprint is tiny.

If you gave memcached 16GB, you will not get your memory back, even if you run flush. The memory is permanently allocated from the OS (much like how Vista does things?). There is mlockall() support, so you can guarantee there will be no paging. Or just disable swap.

jallspaw: memcached1: 22:02:00 up 992 days, 11:57, 0 users, load average: 0.35, 0.37, 0.37

(posted on IRC at #mysqlconf). memcached hardly every crashes.

You can disable the LRU if you want (there’s a command line option for this).

Hashing comes in 2 flavours - normal and consistent hashing. All drivers support CRC today.

A consistent hash means, that instead of doing a modular divide, you can interlace among many servers across the network. When you have a 100 servers running and add a server into the network, you want to add a server, and not lose the entire cache network at once.

libmemcached can do replicas, so it can take data from servers, and apply it to the ring. So if a server is taken out of the network, it can be found elsewhere on the ring. You can keep these networks up and running, and easily growing, with new servers, without losing cache coherency.

Don’t only look at the return value, look at the fact that zero may actually be a credible value, even. An actual value of zero, versus a “we didn’t find anything” is very different.

Slide 35, the ghetto locking implementation for memcache-client. Creates a pseudo-lock around a process. You’re the only process thats processing this area, so you add a key lock, where you ensure you test for nil, not zero (you’re testing for the existence of the lock). If your process dies, someone else will try in 30 seconds (lock expire). Add will only work if there’s no key existing at that point (remember, an add is not a set).

PHP is probably the best supported language, for memcached. PECL memcached library is C backed, standard, and works fine. libmemcached will probably take over most of its features, eventually, but its not there yet now.

Default, if you call increment by a key, it bumps by one. You can also step it up instead of 1, say like 500 or something. Refer to slide 41. Just like you can increment a key, you can decrement also.

C/libmemcached. C driver, there’s a C++ wrapper. Sync and async cached keys. It supports replication through the network. Has read through cache support.

You can not only store a value, but you can also store flags. Flags to keep track of generations, keep track of MIME type internally (so not only store object type, but MIME type). This is unique for libmemcached. Most other drivers use this flags value to see if its compressed or not (the flag = 1 for compression, 0 for no).

Multiget is 7-9x faster than just a get. Look at Page 48 for an example.

Memcached for MySQL? Uses the UDF API. You can now incorporate most of the memcached stuff, in the SQL server, so you can do deletions and get operations easily.

http://tangent.org/586/Memcached_Functions_for_MySQL.html

What do you think about persistent connections? Use them. libevent supports them.

Spaces to watch: MogileFS. HyperTable. HBase. People have stopped talking about POSIX filesystems, and are more talking about object filesystems. Its what all the cool kids are doing.

Technorati Tags: , , , , , , , , , ,

Chris Blizzard on Mozilla

Monday, April 14th, 2008

Chris Blizzard, now working at Mozilla and Linux integration, gave a most interesting talk, about Mozilla, and their new mobile initiatives. We managed to speak (but not nearly enough) about the mobile strategy afterwards (i.e. I think limiting it to the n810 or tablet like devices alone, seems myopic; phones are where its at), and I hope the conversation continues. Now for some quick notes.

- mozilla.org, is where products create motion. Been around for just over 10 years now
- Mozilla targets human beings (not developers)
- Focus on protecting open standards
- “Creating Joy!” for users
- Avoid feature creep (this is the secret of add-ons) - control the product, and just say, go build an extension. It isn’t just about customising your experience, but its about keeping the core experience joyous and uncluttered.
- Fix real problems on the web (i.e. pop-up blocking)
- 500 contributors to Firefox 3, 75 Localization teams, 200 people, 11,000 patches, 165+ Million users, added +45 million users in the last 6 months, and doubled in the last year - these are impressive statistics (I for one, am impressed by their developer community)
- Who are we targeting? Read Seth Godin’s blog entry “Why downloading Firefox is like getting into college“. Also, Stephen O’Grady’s Blog “Ode to the Common Man
- Bring the full web to mobile. FF3 is where great technology for mobile exists.
- Apple has reset the idea of what the Internet on a mobile should be, thanks to the iPhone. They’ve definitely opened up the market for mobile based browsers. Note, no reason to redesign your website for mobiles in the future…
- Fennec - mobile browser experience
- Performance numbers on the n810 - faster than MicroB and WebKit. Not even optimised for ARM (i.e. no atomic locking), but already at a headstart
- Fennec will support add-ons. Touch and keypad versions are coming soon… Keep in mind all this is just getting started
- Android includes WebKit as part of the base platform. Mozilla on Android? Not quite yet, since Google wants only Java based applications. No mention of native applications yet from Google.
- Not really considered Series 60 (it would be nice), no talk of PalmOS, there is some form of Windows Mobile version, but its not released
- Gecko is hard to embed, in comparison to WebKit. The technology needs to improve, so that the gap that WebKit has, doesn’t widen further

Technorati Tags: , , , , , , , , , , , , ,