Friday, February 20, 2009

Scalable data storage for model learning

I'll admit it, sometimes as an AI & Data wonk you want all your data structured and easily accessible via slicing a dicing, thus SQL databases and relational schemas look like a great idea. And it works for a while... until you're dealing with a torrent of incoming data.

This should not be unfamiliar, the UC-Irvine datasets look like a great place to cut your teeth with Machine Learning .. until you realize that many algorithms and software packages written and litmus tested against such data totally fall down on 'big data'.

This quote from Dave Cancel on Twitter stuck with me: "Databases are the training wheels of software development. Fly free brother, fly free. - University of Sed, Grep, & Awk". My first reaction was, meh.. databases are meant for storing lots of data. I love and use tools like those for prototyping.. but then moved to compiled code + SQL for 'production'.

Mental sea change! Let's say you are building a massive scale system for absorbing click data, processing it and turning it into a recommender system. These are the problems you will see using MySQL at scale. Hint: it tips over at 1K read/write operations per second against the same table.

Don't try and make your read store also your write store. You may not /really/ need a low latency (for model updates) system.

More tips from Dave:
As for storage of [models], I suggest storing them in text file (format up to you), 1 per profile, then stick them behind a reverse caching proxy like Varnish. Infinite scale. For extra points store the text files on S3 and use the built-in webserver to serve them to your reverse proxy. HTTP is your friend as is REST SOAs.
Here's another dead simple storage mechanism:

"simple DB" projects like Voldemort and Tokyo Cabinet and MemcacheDB are options as well.

If you can't depend on a room full of DBAs making your SQL DBs not be dog slow, (or buying $$$ systems from Oracle and Microsoft) you must think differently. Pull yourself out of the Math and AI thinking and simplyify. Big Data will eat you alive otherwise.

No comments: