Friday, August 29, 2008

Open Source Search Engine Rodeo: Solr v. Sphinx v. MySQL-FT

Last summer Anthony Arnone and I did a study on the performance of three open source search engines.
We chose these three as two of them have close ties to MySQL and the other is a well used and performant offering from Apache. There are many that we skipped.

Here's the Report in PDF.

Solr was the clear winner. Sphinx was in a close second with blindingly fast indexing times.

At this point the report's results are somewhat dated as both Sphinx and Solr are readying new releases. So your mileage may vary, and I'm sure Peter Zaitsev and the Sphinx team could show us how to improve the performance of their engine.

Updates: The Sphinx team contacted me and suggested some ways to improve Sphinx performance. New results will be published some time soon. They will likely also publish a test using Wikipedia as the document repository.

More Updates: I have started a new Solr project and may test Sphinx again.

9 comments:

shodan said...

Did you benchmark all that under Windows?

Neal said...

Nope.. 100% Linux. A circa 2005 version of RedHat. Thanks - Neal

shodan said...

Hmm, but the PDF mentions ODBC and then PHP.EXE. Is that just a typo or something?

Sphinx by default performs (much) more complex ranking rather than Solr. Its default phrase ranking may introduce over 2x slowdown compared to Solr's default TF-IDF based ranking.

Also Sphinx wasn't really optimized for fully RAM based indexes. Make the index touch the disk, and the situation is likely to become exactly the opposite. On my own benchmarks for that case (5 GB index on 1 GB machine), Solr was 2-3x slower than Sphinx, even though it was indeed faster when everything fit in RAM. Not that much faster though :-)

Neal said...

Can you post your benchmarks somewhere please? Interesting stuff.

I wanted to indicate that php was forked and not used via mod_php. Did not mean to imply Windows.

Our test was meant to be quick and dirty to see if either engine would meet an arbitrary performance criteria.

You are correct in the index on disk versus memory aspect. We did our test as we did for a few internal reasons. Relevance testing wasn't one of them as we would be re-ranking the results from either engine with a machine learning algorithm that learns relevance from searches and clicks.

Solr's default ranking is not relevant as it's trivial enhance via the configuration file. Our test queries were simple search terms with no boolean logic, so any ranking is almost a matter of the opinion of the ranking algorithm rather than 'true/correct'.

At the time, Solr just 'felt' more flexible while still being as fast as we wanted. I'm sure Sphinx has improved drastically.

I've been in contact with Peter Zaitsev of Sphinx. We may do better performance study when Solr 1.3 and the next version of Sphinx is out.

shodan said...

Actually, on some of the runs we could not get Solr to be faster even on RAM based indexes.

I was going to post the data and the scripts (not just the results) at sphinxsearch.com once we have something clean enough and worth posting. Though maybe I should post just the results.. but quickly :-)

The relevance of the results that the ranker produces is another story indeed; but I was talking about its performance. (And performance was one of the factors in your judgement.) Simply switching to "boolean matching mode" in Sphinx would produce 2x speedup, I believe.

Overall I'd call your conclusion, well, too broad. Solr indeed might had been (and perhaps still be!) faster on your specific task (lots of quick queries against RAM based index). But does that alone make it clearly recommended for every other high query-load situation? Especially when we don't have even the slightest idea just how much those wrapper PHP scripts ate? I would not be 100% sure.

I'd also be interested to know exactly what query types and filter chains did you find missing from Sphinx but that's starting to be out of blog comments scope. If you're willing to explain in detail, could you please email me? The username is 'shodan' and the domain is mentioned up there :-) Thanks!

Neal said...

I'm glad you've taken an interest.. I was 100% certain the Sphinx people would find it and tell me what I did wrong here.

Let's have an internal discussion and I'd be happy to rework the conclusions to make the PDF more representative.

I didn't intent to make the conclusions overly broad, I WAS hoping for discussion though!

kryton said...

hi neal.

I had a look at the benchmark, and was a bit concerned.

You are only looking at the 'speed' of the products.
and to be honest, that is such a small part of the search product, unless you are running a multi-million page/day site, 1-2 boxes will be fine for any solution and you aren't going to see any difference in the user experience.

The biggest part is relevance. who cares if a result comes back in 10ms if it is crud?
TREC has a dataset with sample queries and a methodology to judge this. For your test you could choose a area you know well and see how good the results stack up.


The other major uses for a search engine is for faceted searches ( for the search term 'x' how many products do each manufacturer have that match it) you may want to benchmark how this performs on all 3 engines.


Date-based searches are also a PITA for some engines I've found. (find search term 'x' and order the results by a timestamp)

disclaimer: I have used solr for ~3 years and have implemented it on several major sites

Neal said...

kryton,
You are 100% correct on the importance of relevance. At the time I did the test we were 100% only interested in speed for the first phase.. I had a specific criteria to meet in searches per second.

This was for a previous employer getting a tens of millions of searches per month across 3K different indexes (document repositories) with any where from hundreds to tens of millions of documents.

The goal was to consolidate search servers from 10-50 or so repositories per server to something like up to 500 repositories per server with read slaves via replication.

We weren't that concerned with the native relevance of Solr versus Sphinx as we would alter them as well as adapt our collaborative filtering like relevance re-raking on top of either engine.

Grant Ingersoll said...

Can you share your Solr and Sphinx configurations? Also would be good to see your indexing code for both.

Claims that X is slower than Y should always be backed up with more than just a PDF. Show us the code.

Likewise, on your retractions, I hope you're willing to take suggestions from the Solr team, as well.