Saturday, February 21, 2009

Scalable Analytics - random notes

As long as I'm posting about 'rethinking' any reliance on an RDBMS for 'big data' systems, here are a few more notes:

This post on highscalability.com about Rackspace and MapReduce is highly enlightening. The post takes you through a failed use of MySQL for large scale analytics and their conversion to Hadoop.

I can't say I know (yet) the pros and cons of putting structured bigdata on a 'column store DB' versus Hadoop+HDFS. Will probably end up using both systems in various ways. Currently exploiting Sam Tingleff's DGrid and poking at LucidDB for "row filtering and aggregation" style analytics apps.

Looking forward to setting up Hive and Pig next.

On the plus side for MySQL, the federated engine has been quite useful for accumulating data from a sharded/partitioned MySQL setup.. as long as the data being accumulated is less than 100K rows, then it seems to hit a wall. It's also quite brittle if your MySQL instances are having any performance issues.. failed connection can cause other ETLs that depend on that connection to fail in odd ways.

No comments: