Round 2 (for me anyway - this discussion has been raging for a while). Nice read on how Rackspace Now Uses MapReduce and Hadoop. They started with shell scripts, evolved to remote RPCs of shell scripts, moved to MySQL, interated on MySQL and then jumped to a heterogeneous Hadoop + Solr + HDFS. Terrabytes of data.
The MySQL evolution was interesting as I'm going through a similar process of attempting/planning to continually refine MySQL performance. We need UPDATE statements in a big way, so it's a bit different than appending to log structures.
I've been playing with a daily summarizing and distributed ETL with MySQL. Basically with creative use of Views and the Federated Engine one can do a scheduled daily map and reduce. I hold no hope that this is a solution for adhoc queries, it's not at all that flexible. Wiki page describing this system coming soon.
Still trying to find a solution other than a 'union view' across the federated tables from the n data partition servers as the Map. The Reduce will be a set of stored procedures against the union-view. Perhaps this post and hacked code holds promise for gluing Hadoop to JDBC/MySQL storage engines. This would better enable ad-hoc queries.
I also wonder if MySQL Proxy is useful here.. looking into it.. but at first glance it doesn't inherently pattern the distributed Map operation well.
Open question: If one could publish MySQL stores to a column oriented DB like MonetDB or LucidDB and then do Hadoop map-reduce operations then do I have what I want for ad-hoc queries?
No comments:
Post a Comment