aicoder: January 2009

Tuesday, January 27, 2009

response to Noisy Channel post on Lucid Imagination

Cross posted to my blog since it's a long response ;-)

To: Daniel Tunkelang

RE: Noisy Channel blog post on Lucid Imagination

I don’t think it’s their aim to compete with Enterprise search directly (business suicide), though I suspect they might pressure the pricing in the mid-market of search. The small market has been mostly eliminated by Google/Yahoo/MSN site search and open source engines.

Note also that they do not seem to (yet) provide support for Nutch or Droids.. meaning that they are missing a spidering/crawling engine. Same with Tika (office document support). Search result clustering may be coming soon via SOLR-769. No content-management or versioning. (These are fixable pieces given all the open source out there)

There is no good native support for rich taxonomies in Solr/Lucene, nor is there native support for some of the interesting semantic-web data driven features. No self-learning or auto-personalization of results. No analytics (though one could go elsewhere for that).

Lucid is also not offering a hosted Solr service .. so they are not an SaaS play either.

All that said, they obviously have some huge wins within the software industry.. but it’s a tough road to go after accounts like Home Depot, Albertson’s, or the government entities.

Enterprise search is mostly about finished feature sets and a near full admin GUI for non-programmers. The question is in these lean economic times if a given customer considering “build versus buy” is willing to risk starting a professional services engagement to build what they want for cheap, versus purchase a commercial ES product with way more features than they think they need.

I do think that a smart customer will have new leverage during the sales cycle to credibly threaten the ‘build’ option and get the ‘buy’ price down. And Lucid certainly should affect the ability of the ES companies from getting a customer bought in then milking them for professional services, integration and customization fees… Lucid provides a credible switching threat to cut bait and start over.

Google, Yahoo and open source projects like Lucene have commoditized basic search, so ES is about value-added features, innovative R&D and taking away customer pain and complexity.

Some of the people in Lucid have big plans (Grant Ingersoll comes to mind), and there is absolutely no question that Lucene has made some search vendors look like dinosaurs with slow engines and archaic index structures.

It will be some time before open source catches up to ES.. but it just might not be as long as some would hope.

Disclaimer: The above is my opinion and some fact-looking statements might be wrong.. so Lucene guys jump in!

Monday, January 26, 2009

Lucid Imagination and Sematex

Kudos to the Solr/Lucene gang for launching Lucid Imagination. Grant Ingersoll's announcement. People involved here and here. Some time ago Otis Gospodnetić launched Semtext. Good luck to Lucid and Sematext!

Both of these companies are in the 'support and consulting' model. This is wise, as going into Enterprise search directly is a tough road competing with Endeca, Verity(Autonomy), FAST(Microsoft), GoogleBox and the other vendors would be suicidal.

Aside:
Long ago (2003) I thought of hanging up a shingle for supporting HtDig (a once popular CGI based search engine), but wisely decided that would be a mistake given that even then I could see that Doug Cutting's Java Lucene and Nutch were going to smoke the creaky 8+ year old C++ indexing kernel. Ended up getting RightNow Tech to sponsor conversion of the guts to CLucene, where it still runs today indexing many many tens of millions of documents. Then Solr was announced .... and HtDig development died and I started using Solr.

Just touched base with Geoff Hutchinson the other day and we're going to release the 4.0 CLucene branch of HtDig, and put up an announcement of HtDig end-of-life and encourage people to migrate to Solr.

Text Classification with Solr

Starting to look at using Solr/Lucene for text mining. Between OpenNLP, the Python Natural Language Toolkit and various other projects it's time to toss my ad-hoc mishmash of tools and start over.

Looks like Grant Ingersoll is working on similar things in his Taming Text project. This is a nice beginner's overview of the area as Grant sees it, Search and Text Analysis PPT. Also looks like others are scheming about blending Mahout and Solr in some future version.

The basic idea is to take an ontology/taxonomy like Dmoz or FreeBase of {label: "X", tags: "a,b,c,d,e"}, index it and then classify documents into the taxonomy by pushing parsed document into the Solr search API. Why? Lucene/Solr's ability to do weighted term boosting at both search and index time has lots of obvious uses here.

Now that my readership (by data-mining and semantic-web geeks) is up slightly (ie above zero!) due to Twitter traffic, I'm hoping people contact me with ideas, code, etc. Heh.

Initial ideas:

Use More-Like-This code to 'pass in' a term vector without storing it
Write Solr plugin to execute search and post-process hits and do any outgoing classification and biasing math.

Once this is proven out, then the obvious next step is to figure out how to index the various RDF/OWL datasets out there. Much of these parts has probably?? been done before, I just need to find them, examine their merits and do some LEGO style layering to get a prototype up.

Friday, January 02, 2009

New Year's Resolutions and Goals

Here are this year's technical/career goals & resolutions. Most of these are general and many encompass specific current and forward looking work projects. Others are just motivational resolutions.

Goals and Resolutions:

Turn in Dissertation. It's 75% complete and the rest is all typewriter work. Be done.
Be a better Numerati. The point of modeling is to predict... so this goal is a lifelong career goal with a new label.
Practice at Done and Get Things Smart and Teach Yourself Programming in Ten Years
Make damn sure that OthersOnline.com doesn't have any Fail Whale events (technical or business).
Read more tech (academic, business and research blogs) - Seed and water the creative juices.
Economy willing, hire a full-time minion.
See if I can practice some 'startup karma' (hat tip Todd Sawicki) for other startups.

Specific items:

Suck down more 'computational advertising' research and write some myself.
Cherry pick new Semantic techniques from the rat's nets of the Semantic Web.
NLP and Extraction
Sharpen skills from classic modeling/filtering/sampling methods.
Column DBs
Modern Map-Reduce
More Scalability
Data mining from VLDB
Contribute to open source projects again
File patent(s) and publish a paper(s)
Clean the Garage

Thank heavens I have all year. This is really a post that should evolve all year.. why can't some blog posts be Wiki-like and not so time ordered? I reserve the right to violate blog-etiquette laws and edit this post.