Friday, August 29, 2008

Open Source Search Engine Rodeo: Solr v. Sphinx v. MySQL-FT

Last summer Anthony Arnone and I did a study on the performance of three open source search engines.
We chose these three as two of them have close ties to MySQL and the other is a well used and performant offering from Apache. There are many that we skipped.

Here's the Report in PDF.

Solr was the clear winner. Sphinx was in a close second with blindingly fast indexing times.

At this point the report's results are somewhat dated as both Sphinx and Solr are readying new releases. So your mileage may vary, and I'm sure Peter Zaitsev and the Sphinx team could show us how to improve the performance of their engine.

Updates: The Sphinx team contacted me and suggested some ways to improve Sphinx performance. New results will be published some time soon. They will likely also publish a test using Wikipedia as the document repository.

More Updates: I have started a new Solr project and may test Sphinx again.

Tuesday, August 12, 2008

Great comment on MapReduce

Spot on comment from on the Database People Hating on MapReduce blog post.
I think this document is comparing things that are not comparable. They are talking about MapReduce as if it were a distributed database. But that's completely wrong. Hadoop is a distributed computed platform, not a distributed database prepared for OLAP.
MapReduce is a re-implementation of LISP's map and reduce in a parallel setting. Now the function/task that you give to Map is where the rubber meets the road of reading data from some data store.

MapReduce versus RDBMS - Round 2

Round 2 (for me anyway - this discussion has been raging for a while). Nice read on how Rackspace Now Uses MapReduce and Hadoop. They started with shell scripts, evolved to remote RPCs of shell scripts, moved to MySQL, interated on MySQL and then jumped to a heterogeneous Hadoop + Solr + HDFS. Terrabytes of data.

The MySQL evolution was interesting as I'm going through a similar process of attempting/planning to continually refine MySQL performance. We need UPDATE statements in a big way, so it's a bit different than appending to log structures.

I've been playing with a daily summarizing and distributed ETL with MySQL. Basically with creative use of Views and the Federated Engine one can do a scheduled daily map and reduce. I hold no hope that this is a solution for adhoc queries, it's not at all that flexible. Wiki page describing this system coming soon.

Still trying to find a solution other than a 'union view' across the federated tables from the n data partition servers as the Map. The Reduce will be a set of stored procedures against the union-view. Perhaps this post and hacked code holds promise for gluing Hadoop to JDBC/MySQL storage engines. This would better enable ad-hoc queries.

I also wonder if MySQL Proxy is useful here.. looking into it.. but at first glance it doesn't inherently pattern the distributed Map operation well.

Open question: If one could publish MySQL stores to a column oriented DB like MonetDB or LucidDB and then do Hadoop map-reduce operations then do I have what I want for ad-hoc queries?

Monday, August 11, 2008

MapReduce versus RDBMS

I managed to stumble upon an interesting article while looking for a MySQL multi-database federated query tool.

David DeWitt and Michael Stonebraker write: MapReduce: A major step backwards.

They rightly point out that MapReduce is a 25 year old idea. Lisp has had this functionality for decades.. and it's actually at least 30 years old. Griss & Kessler 1978 is apparently the earliest description of a parallel Reduce function. That said, it's only in the last 10 years that an idea this great could have been implemented widely with the advent of cheap machines.

Their second point is that MapReduce is a poor implementation as it doesn't support or utilize indexes.
One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.

In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.
Great point and point taken. However, where are the open source implementations of the things you mention? This is a bit of the 'if a tree falls in the woods and no one is there to hear it' problem. A major reason MapReduce has seen uptake (other than being a child of Google) is that an example implementation is available for the Horde to steal, copy, improve & translate.

The modern user generated content web is mostly built on Open Source these days, so the fact that I can get the above technology in commercial databases is a non-starter.

I'm a SQL junkie and am searching in vain (it seems so far) for decent extension to MySQL that does cross-database query and reduction of tables I know to be neatly partitioned. No luck so far. Starting to look into other SQL engines as their maybe a ODBC wrapper for the federation layer. It's got to be mostly functional and EASY to adopt.. or you'll continue to have people spouting the MapReduce dogma.

Post scripts:
  • Nice summary of Dr. Stonebraker's accomplishments here
  • Funny link to comp.lang.lisp some newbies asking about if Lisp has Map Reduce.

Monday, August 04, 2008

Improving Software Release Management

I just found this in my inbox, 7 Ways to Improve Your Software Release Management. It's an excellent overview of 'doing things differently'. Pretty similar to the change that I experienced at the last job... and something that Mike Dierken at the current job just seems to know instinctively.

I need to find some stuff on the best ways to do personal lightweight processes. I'd like to be more efficient in producing software... especially software that is based upon speculative ideas. So much of the time data mining and machine learning coding is subject to the vagaries of the data set you are working against and it's difficult to know ahead of time if a given algorithm will work well.. how much data cleaning needs to be done... etc.

How can you adapt lightweight processes and the things mentioned above for producing software that is not so cut and dried in what needs to be done? In those situations I tend to ping-pong between little process (seat of the pants coding) and too much (excessive research and design before coding start).

Post script: Had to add this:

Five Things Linus Torvalds Has Learned About Managing Software Projects