Tuesday, July 29, 2008

How not to launch software

From cnet:

Cuil shows us how not to launch a search engine

Google challenger Cuil launched last night in blaze of glory. And it went down in a ball of flames. Immediately after launch, the criticism started to pile on: results were incomplete, weird, and missing.
The various articles on Cuil's failure revealed much about their architecture. Apparently they are categorizing a user query into a topic and shipping that out to topical servers. While this sort of 'topical partitioning' is interesting, it has zilch to do with relevance ranking... and suffers from a failure-point issue.. if that topic server goes down then queries against that topic will get junk results or zero results

Questions and points of discussion:
  • Is it really true that a data schema partition results in a better engine than Google? No, a better engine is made by better relevance. Perhaps this is what the PR/Marketing people focused on rather than relevance.
  • How do you simulate load post launch load when you have no idea how widely the free press will be distributed?
  • Free post launch press is invaluable to your buisiness.. squandering it might be a deathblow.
  • Why not launch more quietly in early adopter tech-press and then go try and get mainstream press when you have proven the system?
  • Trading on your status as ex-Googlers (and not early ones at that) seems VERY dubious. Stand on your own feet rather than someone else's.
  • The absolute hottest area of information retrieval research right now is using user click-streams to improve the relevance live and on-line (learning to rank), as well as personalize results. These are differentiating features (if they result in improved relevance).
  • Cuil keeps ZERO user history or assigns session/user-ids. This will make it very difficult to follow this trend.. unless they are using someone else's cookies to do the identification via analytics partner (no evidence of this).
  • The other hot area of IR research is using semantic analysis and NLP to break away from simple keyword based inverted indicies. Hakia still seems to be doing it better than Cuil... or at least will appear to as long as the topical partitions keep crashing under the load.
  • Risk analysis is a fantastic tool in organizing and prioritizing your work on new products.. it seems they missed that part before deciding to launch.
I feel bad for these guys, Anna Patterson et al. have done some great work in the past and I just hate to see good people stumble like this.

I still think they are wrong to go out as a consumer engine... Enterprise is a better play.. however if their leading market differentiator is a topical partitioning of back end servers.. then they aren't even considering this as individual Enterprise customers may not be big enough to need hundreds of servers to distribute the index like that.

Hindsight is always 20/20 and hope to hell I am not standing there redfaced as software I helped create fails upon high volume launch.

Sunday, July 27, 2008

Dubious results from Cuil .. and the Majors

I searched for 'evolution recombination mutation' on cuil.com

The first 2 times, I got a no results page. After a few variations and an hour or so, I tried again and got a nice result set. One really jumped out at me: 'What's Driving Evolution; Mutations or Genetic Recombination'. The problem is that it links to an intelligent design group disputing Evolution in general.

http://nwcreation.net/geneticrecombination.html

After trying the same search on Google, Yahoo, Live.com, Ask and Hakia.. that same page shows up in the top ten.

Really? This is authoritative and the best that modern world-class search can do? I was hoping from some kind of page summarizing epic battles between Ernst Mayr and Motoo Kimura.

While this is likely caused by both keyword matches via good SEO and the fact that this page is probably highly linked to... this is a semantic failure! Imagine if a query to the Holocaust had anti-Holocaust propaganda links appearing above genuine factual information!

At least Hakia and Yahoo put up a page refuting the author of the above link in the top 10 results. Google and the rest fail to do so.

Admittedly, not everyone believes in evolution.. and that is fine with me.. but I'm not sure that fact refutes the semantic/authoritative failure of the engines.

Anna Patterson's new company - cuil.com

Just spotted a NYT article on a new search engine called cuil from Anna Patterson and Tom Costello. A while back I read Dr. Patterson's article "Why Writing a Search Engine is Hard". Nice quick read on the issues. I also stumbled upon several patents she wrote.

So the recent challengers of note are Ask.com, Powerset, Hakia, Wikia, Mahalo and now Cuil. I'm rooting for the algorithmic ones, not really sure how Wikia and Mahalo can scale to be non-niche engines.

While I hope Cuil is successful (I like to see Academics go out and build companies), I'm not sure that it's possible to beat Google at this point. It seems far more likely that Microsoft will just try and swallow Hakia and Cuil and attempt to brew something out of the parts.

I recommend reading Danny Sullivan's post on Cuil, he hit most of the obvious points.

My thoughts:

I still think building intelligence on top of an existing index/engine is the way to go and I'm not sure that bragging about your index size or your back end architecture is going to get you any meaningful marketshare.

Also, Enterprise Search is still far behind in NLP technology vis-a-vis the big 4 and the NLP startups. It's still 1999 there, enterprises are just now figuring out how to expose their vast document sets to an internal crawler and provide a UI that is not overly simplistic for savy users. They've tried the classic approaches of commodity engines and found them wanting. Link analysis doesn't help either as most of the these documents are not web documents with links. This just cries out for an approach like Vivisimo's clustering + Hakia's semantics + Delicious' user driven tagging.

Corporate searchers need a good advanced interface and results that can be grouped by things like time and originating department.... but not be forced to drown in overly similar hits. That is a market worth getting into with a $33M VC investment. The landscape is littered with vendors that over-promised during the sales cycle and Enterprise customers will switch products if it solves the problem better.

Look at the $500M acquisition of Verity by Autonomy in 2005. That's 5X more than Powerset (in 2005 dollars) and they actually had loads of paying customers.

I worry that going directly at Google with a consumer search engine is just so much tilting at windmills. Sometimes just selling a product to people willing to pay for it is easier.

Others Online news and welcome Rance and Vik!

I've been busy in the last few weeks getting new products launched at Others Online. We've deployed a set of new 'Audience Affinity Analytics'. By adding a simple Javascript tag to your webpages, our software delivers free audience summary reports that detail at a keyword/phrase level what people are paying attention to on your site! Short video here.

We've also hired Rance Harmon (MS student at Montana State U) and Vik Jakkula (MS student at Washington State U) as interns for the summer. Rance will be working on general web and Java coding as well as testing frameworks. Vik will be working on some new data mining algorithms. Welcome guys!