Monday, November 02, 2009

RFP: Bounding memory usage in Tokyo Cabinet and Tokyo Tyrant

I'm soliciting proposals to implement an absolute (or at least soft) bound on memory usage of TT + TC. The reward is cash. Send me a proposal. The patch needs to work, not crash TT/TC or corrupt data.

Here's a start:

I've attempted to do this myself and have not had the time to finish or fully test it. I've asked Mikio for feedback/help finishing this and he's been nearly silent on the request.

At the moment we (myself, Sam Tingleff and Mike Dierken) work around this issue by continuing to play with various parameter TC settings and restarting the daemon when the memory usage grows beyond a comfort level.

I'm a C coder and have hacked the internals of BerkeleyDB in the past, so can help review code, trade ideas, etc. We (as a team) don't have the time to work on this at the moment.

If you are interested contact me! We've got a few other ideas for TT enhancements as well...

Posted via email from nealrichter's posterous

Wednesday, October 28, 2009

Current Computational Advertising Course and Events

Andrei Broder and Vanja Josifovski of Yahoo Research Labs are offering a Fall 2009 course on Computational Advertising at Stanford. The lecture #1-#4 slides are up and it looks to be an interesting course. Will definitely continue to follow along remotely.

National Institute of Statistical Sciences is holding a workshop on CompAdvert in early November. The upcoming WINE'09 conference in Rome contains a few accepted papers in CompAdvert. SigKDD 2010 mentions it in the CFP..

Luckily the search engine results on 'Computational Advertising are still free of noise. Google, Yahoo, Bing

Prior Events:

Posted via email from nealrichter's posterous

Friday, September 18, 2009

References for mining from streaming data

While reading a lower quality paper on the subject I found these references worth tracking down. Essentially the idea is that you can make one-pass through the data and must produce statistics of the data that are approximate in nature, ideally with bounded approximation error.

Gibbons et al 1997: Fast Incremental Maintenance of Approximate Histograms
Incremental maintenance of histograms primarily for database query planners

Manku and Motwani 2002 Approximate frequency counts over data streams.
They show algorithms called sticky sampling and lossy counting with proven error bounds.

Zhu et al. 2002 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time
Basic stats and high correlations between all pairs of data streams

Cormode and Muthukrishnan's 2003 What's Hot and What's Not: Tracking Most Frequent Items Dynamically
Introduced groupTest probabilistic Monte Carlo algorithm for frequent item tracking

Metwally et al 2005 Efficient Computation of Frequent and Top-k Elements in Data Streams
Uses counters to monitor data streams with a new stream-summary data structure

Cheng et al 2005 Time-Decaying Bloom Filters for Data Streams with Skewed Distributions
Dampened Bloom-filters for frequent items

Three papers on frequent itemsets (different than frequent items):

Jiang and Gruenwald have a pair of 2006 papers
Research Issues in Data Stream Association Rule Mining
Survey paper of issues and previous results

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
New stream based itemset miner

Tsai and Chen 2009 Mining Frequent Itemsets for data streams over Weighted Sliding Windows
Apriori like itemset miner on windows of data with differential weighting

Langford et al. 2008 Sparse Online Learning via Truncated Gradient
Induced sparsity in the weights of online learning algorithms with convex loss functions
  1. (9/19) Corrected the StatStream link - hat tip @dataspora
  2. Added Langford - hat tip @gappy3000

Posted via email from nealrichter's posterous

Thursday, September 17, 2009

Others Online acquired by the Rubicon Project

I'm thrilled to say that Others Online has been scooped up by the Rubicon Project. Press release here. I'm joining as a Data Scientist. Jordan authored a wrap-up post here:

Update: Nice Forbes article on the opportunity in front of us.
For Advertisers Drowning In Data, A Lifeguard

Posted via email from nealrichter's posterous

Monday, September 14, 2009

Google AdWords now personalized

Hat Tip: Found via Greg Linden's blog: Google AdWords now personalized. Below are my thoughts and questions:

Google is now reaching back into your previous search history and presumably choosing a better previous search if the current one is not sufficiently monetizable.


  • Is the goal of Google to increase the fill-rate of ads or to show more valuable ads in general?
  • What criteria is used to reach back into a user's history? Boolean commercial/non-commercial then select last commercial search versus choosing based upon some selection algorithm from the last N searches (see previous point).
  • Will the reach-back cross a topic boundary or is it only to enhance context for an ambiguous search?
  • What effect will this have on the Google Keyword Tool that helps advertisers forecast demand and price for a keyword? The volume numbers must now be adjusted by the amount of time the impressions are shifted to alternate keywords.
  • How much will this starve the long-tail of searches? Depending on the aggressiveness of the selection then long-tail searches may suffer a decrease in volume for adwords.
Even the most modest change of merely using recent previous searches only 'about' the current search to augment the adwords auction query should have a dramatic effect on the auction process. By definition it expands the number of bidders for a particular query. It may also curtail the effectiveness of arbitrage done by some adwords buyers who buy ambiguous lower value keywords as proxies for high value ones due to user sessions with query reformulations. Why? It should have the effect of driving up prices for the penny keywords if they are sufficiently related to high value keywords.

It will be interesting to watch what happens. This is likely not a non-trivial change in the keyword market.

Posted via email from nealrichter's posterous

Friday, August 28, 2009

The GPL and scripting languages

I was recently asked by an associate about how the GPL and LGPL impacts scripting languages. The short answer is that it's not clear.

My general take on the LGPL and scripting languages is that if one cut-and-pastes LGPL code A into code B, then code B falls under the LGPL. However if one script-includes code A into code B in the only real mechanism scripting languages allow then code B does not fall under the LGPL.

php: include 'Foo.php';
perl: require Foo::Bar;
python: import Foo
ruby: require "foo"
javascript: <script type="text/javascript" src="external.js"></script>

This differs from the application of the LGPL to compiled languages like C/C++ as the programmer decides how the machine code is combined: static versus dynamically linked. In Java if one has an LGPL jar file, the the code can be used in the standard way. If your code is combined with the contents of the LGPL jar file in a new jar, then your code falls into the LGPL.

What does the scripting language really do under the hood? It's different in each language. Some may physically include the file and parse-interpret it as one block of code, others may parse-interpret separately and resolve references between them much like a dynamic linker. Do the details matter? Unknown, yet one could reasonably assume the author of a piece of script code under the LGPL must have intended others to make script-include style uses OK. If in doubt check with the author and save that email.

What about the GPL? More clear.. no mixing of any kind can be done. One could argue that interpreted script is not linked in the way that compiled languages are... yet conservatively this is a "thin reed" to stand on.

What does this mean about the myriad of GPL javascript out there intermingling with non-GPL and proprietary javascript within browsers visiting script heavy websites everywhere? It's a dog's breakfast of legal issues.

The post Twenty questions about the GPL is pretty informative of the issues, worth a read... deeper than the above.

Posted via email from nealrichter's posterous

Thursday, August 20, 2009

What can we learn from the Google File System?

I found this via my twitter herd and I've been thinking about it for days. Kirk McKusick interviews Sean Quinlan about the Google File System. Fascinating stuff.

We're very fortunate to have storage scalability challenges ourselves at 'Undisclosed' (formerly OthersOnline). We're amassing mountains of modest chunks of information via a set of many many hundreds of millions of keys. We've evolved thusly:

  1. MySQL table with single key and many values, one row per value.
  2. MySQL key/value table with one value per key
  3. memcachedb - memached backed by Berkeley DB
  4. Nginx + Webdav system
  5. Our own Sam Tingleff's valkyrie - Consistent Hashing + Tokyo Tyrant
#5 is the best performing, yet we still aren't going to escape the unpredictable I/O performance of EC2 disks. To my understanding this serves a role much like the chunk servers of GFS. We need low latency read access to storage on one side, and high throughput write access on the other side.

Combining the insights with the above interview and the excellent Pathologies of Big Data, I'm left with the impression that one must absolutely limit the number of disk seeks and do some 'perfect' number of I/O ops against of X MB chunks.

Random questions:
What is X for some random hardware in the cloud? How do we find it? What if X changes as the disk gets full? What kind of mapping function do we send the application key into to return a storage key to get the best amount of sequential disk access on write and best memory caching of chunks on read? What about fragmentation?

It does seem as though the newish adage is very true.. web 2.0 is mostly about exposing old-school unix commands and system calls over HTTP. I keep thinking this must have been solved a dozen times before. cycbuf feature of usenet servers?

Posted via email from nealrichter's posterous

Sunday, August 02, 2009

Response to Dr Lance Fortnow's CACM opinion

Dr. Lance Fortnow published a strong argument against the current 'strong conference' CS system in the August issue of CACM. His essential desire is that CS move to a system similar to the hard sciences and engineering. Conferences should accept any paper of reasonable quality, not publish proceedings and hold Journals as the sole vehicle for publications.

My Questions:

  1. While CS should have a stronger Journal system, why should that come at the expense of quality conferences?

  2. The reputation of CS conferences as a method of publication means that it is both acceptable to publish and cite papers from these confs. The result of this is that one can read a conf paper, compose a citing follow-up and publish it within 1 year. This fluidity of ideas without the sometimes 18 month to 2 year wait on Journal publication is a great advantage!

    Yes other disciplines publish pre-prints to is this really a solution when a paper has been rejected yet avail on for 18 months?

  3. Is this problem really that acute in all communities? Can't it be solved at the community level?

  4. Certainly in AI, Machine Learning, Data Mining and Evolutionary Computation I perceive that the Journals are held in high regard (all papers are great) and that the conference proceedings can be a mixed bag with strong and weak papers. I am getting strong pressure from my advisors and peers to consider Journal versions of some of my conf papers. EC specifically has a non-proceedings conference to meet and discuss less settled results.

    If Dr. Fortnow feels that his area of theoretical computer science is too fragmented, then a solution would be to found more Journals and push the best conference papers into those Journals more heavily. Perhaps that would mean that the conference system of that area would shrink in response as the Journals established themselves.

  5. Why would industrial researchers and scientists participate in publications with such long time cycles?

  6. Again the fluidity is an advantage. Were CS suddenly to switch to a soft conference system (low benefit to participation other than networking) I fear that participation of industry in publication venues would suffer. The time scales of Journals in general mean that at publication time the results were done an eternity ago in industry terms. Convincing your supervisor that participation and publication in a conference is a far easier sell than an extended Journal submission effort.

    One also wonders if the Journal system is not implicitly biased towards academic communities where participants are chasing tenure. This 'ranking of people' referenced by Dr. Fortnow is very much for academic institutions and not of much value to industry IMHO.

  7. Why do we want ANY such top-down forcing of CS organization?

  8. The culture of CS is much more aligned with self-organization and communities forming out of a bird-of-a-feather effect. This also aligns with the changing face of corporate cultures and culture in general. Such a top-down driven reorg would likely both fail and break the inherent fluidity of ideas and results in CS.

    I also have an unfounded suspicion that such a top-down forced re-org would result in a clustering of power and influence towards traditional centers of power in academia. If one picks up conference proceedings in my favorite CS areas and does a frequency count of the author's institutions the distribution is very much long tail. The 'elite' universities are not dominating the results meaning that the 'in crowd' effect is much lesser in CS.

    Feudal systems are dying fast for a reason.

  9. Obviously the current system is serving a need, doesn't that speak for itself?

  10. If CS researchers and scientists continue to attend, publish at and found conferences is this not evidence that it is serving a real need?

While Dr. Fortnow is correct on his points w.r.t. the problems faced by conference 'PC' committees... the correct response is best done within that community. Found some Journals and compete with the conferences for publications and reputation. I don't accept that a strong Journal system can only be created by first wiping away a very fluid and successful conference system.

My personal solution to strengthening the Journal system? I'll set a goal of submitting a Journal publication or two in the next year. I am completely remiss in not yet submitting anything to a Journal.

As a counter point to the above arguments.. if I could have a wish it would be a Journal system with fast review times, immediate publication of accepted papers and that markets itself by cherry picking great papers from conferences and encouraging those authors to submit to the Journal. Something like the Journal of Machine Learning Research. The publishing of referee reviews also sounds interesting.

Tuesday, July 14, 2009

Hacking Apache's mod_proxy_http to enforce an SLA

Following up on the last post about HTTP SLAs, let's say you have a web-service exposing ReST APIs for your awesome data miner/processor. It has data input/output APIs of various kinds. The software architecture consists of front-end apache servers and back-end tomcat plus various data stores. Apache's mod_proxy and some load balancer (HAProxy, mod_proxy_balancer) pushes the incoming requests to backend servers.

A client wants a guarantee that your APIs will accept requests and return valid data and response codes within XXms for 95% of requests (see Wikipedia's SLA for other examples of service guarantees). How can one be absolutely sure that the SLA is met? Now add in the wrinkle that there might be different SLAs for the various APIs. In addition, the SLA could specify that as close to 100% as possible of the requests return HTTP codes within the 2xx range.. suppressing any 3xx, 4xx or 5xx codes from coming back to the outside world.

The issues with making apache do this are as follows:
  • ProxyTimeout is global or scoped to the particular vhost
  • ErrorDocuments still return the error code (503, 404, etc)
  • No way to tie ErrorDocuments and ProxyTimeouts to particular requests.

A key insight from Ronald Park is to use mod_rewrite and then pass various environment arguments to mod_proxy that are specific to the URL being addressed by mod_rewrite. This was the approach taken by Ronald Park in his attempts to solve this problem in apache 2.0.x here and here.

The below example is a rewrite rule that makes no changes to the URL itself for a JS API presumably returning data in JSON.

RewriteRule ^/api/(.*).js\?*(.*)$ http://backendproxy/api/$1.js?$2 [P,QSA,E=proxy-timeout:900ms,E=error-suppress:true,E=error-headers:/errorapi.js.HTTP,E=error-document:/errorapi.js]

With the SLA enforcement modifications enabled, the URL will return data from the backend system within 900ms or a timeout occurs. At this point apache will stop waiting for the backend response and serve back the static files /errorapi.js.HTTP as HTTP headers and /errorapi.js as contents.

$cat /var/www/html/errorapi.js.HTTP
Status: 204
Content-type: application/javascript

$cat /var/www/html/errorapi.js
var xxx_api_data={data:[]}; /* ERR */

There are four environment variables the SLA hack looks for:
  • proxy-timeout: - time in seconds or milliseconds to wait until timing out
  • error-suppress: - true/false switch on suppressing all non 2xx errors from the backend.
  • error-headers: - file of syntax correct HTTP headers to return to the client
  • error-document: - file of content body to be returned to the client
Leaving off the proxy-timeout will only suppress errors from the backend after the global timeout occurs. Leaving off error-suppress:true will ensure that the 5xx timeout error from mod_proxy_http is returned intact to the client.

Source code here

There are two versions checked into github for Ubuntu 9.04's apache2 2.2.11 and Centos el5.2's httpd 2.2.3-11. It's advisable to diff the changes with the 'stock' file and likely re-do hack code in your version of apache 2.2. See Ron Park's code for 2.0.x and fold in the other mods supporting error-suppress etc.

The hack is being tested in a production environment, stay tuned. This will get posted to the apache-dev list..hopefully with responses suggesting improvements.

Update for 2011: This has handled billions of requests per month at this point and works great. No issues.

Friday, July 03, 2009

HTTP request with SLA

Here's a (nonAI) problem I'd like to solve. Configure apache to receive a request and proxy/forward it off to a backend app server (tomcat) .. wait a specified period of time ... if no response is received send back a static file or a return code like 204.

Essentially it's a combination between the below mod_rewrite and mod_alias directives with a timeout. Below example uses solr without loss of generality.

RedirectMatch 204 /search/(.*)$
RewriteRule ^/search/(.*)$ http://backend:8080/solr/$1 [P]

Logic below:

If url matches rewrite-rule regex then
set timer for 500ms with timer_callback()
force proxy to second url after rewriting it
if(response from proxy is > 199 and < 300)
return response
return 204 or static default file

return 204 or static default file

Instead of returning 204 one could also serve back a static file like /noresults.xml

The general idea is to expose a url that has a near-guaranteed response time limit (assuming apache is alive) where a 204 or a static default is acceptable behaviour. I suspect that we'll need to write an apache module to do this, yet surely this question has been asked and solved before!

Saturday, March 07, 2009

Copyright, Licenses and First Sale doctrine

Very interesting 'Legally Speaking' article in the March 2009 CACM, When is a "license" really a sale? by Pamela Samuelson. She covers the principle of 'first sale' and the recent decisions on the UMG v. Augusto and Vernor v. Autodesk. Essentially both Augusto and Vernor were selling used CDs and Software on eBay and the copyright owners of the products went after them for these sales. Both parties are using the first-sale doctrine to defend their rights to sell the media. The copyright owners are trying to use contract law and the licenses to restrict transfer of the physical media and right to use it. Samuelson predicts that both will end up on the 9th Circuit appeals court.

I hesitate to issue an opinion here, I see both sides. More debate here, here and here. A nice little post on first sale here. From the perspective of books and music, I side with the first sale doctrine.. and mostly think the 'one click' license notices as having many problematic issues. If you receive a physical copy of something for $$ or as a gift, you should be allowed to resell it at your whim.

It's worth hunting down the previous articles in this series.. they stretch back many years.

Thursday, March 05, 2009

In memoriam - Prof. Dr. Ingo Wegener, 1950-2008

Ingo Wegener passed away last November, see his obituary here: Nachruf/Obituary. He taught at University of Dortmund/TU Dortmund and ran a group of Evolutionary Algorithm researchers looking into the basic theory and run-time analysis of EAs. His papers and the basic approach of his 'school' have been very influential in my PhD research and he was quite encouraging of my proposed ideas. If at the end of our lives we are professionally remembered half as fondly as he will be then that is success.

Sunday, February 22, 2009

Predicting search engine switching behavior

Following up on the previous post, I found a few interesting papers (via Google) on user switching behavior of search engines.
Definitely worth reading in detail as the ways of building the models might be applicable to other behaviorally driven events.

Can we measure Google's monopoly like PageRank is measured?

Jeremy Pickens posted an interesting note on his IR new blog:

Is it really true that Google is competing on a click-by-click basis? In the user studies that Google does, which of the following happens more often when the user types in a query to Google, and sees that Google has not succeeded in producing the information that they sought (fails):

  1. Does the user reformulate his or her query, and click “Search Google” again (one click)? Or,
  2. Does the user leave Google (one click), and try his or her query on Yahoo or Ask or MSN (second click), instead?

His points about actions 1 versus 2 are very astute. I’d guess that #2 happens a LOT on the # 2-10 search engines. Meaning people give that engine a try.. maybe attempt a reformulation.. then abandon that engine and try on Google. And I’m betting that people ‘abandon’ Google at a far less rate than other engines.. ie asymmetry of abandonment.

I’d love to do the following analysis given a browser log of search behavior:

Form a graph where the major search engines are nodes in the graph

For each pair of searches found in the log at time t and time t+1 for a given user, increment the counter on the edge SearchEngine(t) -> SearchEngine(t+1). Once the entire log is processed normalize the weights on all edges leaving a particular node.

We now have a markov chain of engine usage behavior. The directional edges in the graph represent probability of use transference to another engine, self-loops are the probability of sticking with the current engine.

If we calculate the stationary distribution of the adjacency matrix of probabilities, we should have a probability distribution that closely matches the market shares of the major engines. (FYI - this is what PageRank version 1.0 is - the stationary distribution of the link graph of the entire web)

What else can we do? We can analyze it like it’s a random walk and calculate the expected # of searches until a given user of any internet search engine will end up using Google. If the probabilities on the graph are highly asymmetric.. which I think they are.. this is a measure of the monopolistic power of people’s Google habit.

This should also predict the lifetime of a given ‘new’ MSN Live or user.. meaning the number of searches they do before abandoning it for some other engine.

Predicted End Result: Google is the near-absorbing state of the graph.. meaning that all other engines are transient states on the route to Google sucking up market share. Of course this is patently obvious unless one of the bigs changes the game.

Saturday, February 21, 2009

Scalable Analytics - random notes

As long as I'm posting about 'rethinking' any reliance on an RDBMS for 'big data' systems, here are a few more notes:

This post on about Rackspace and MapReduce is highly enlightening. The post takes you through a failed use of MySQL for large scale analytics and their conversion to Hadoop.

I can't say I know (yet) the pros and cons of putting structured bigdata on a 'column store DB' versus Hadoop+HDFS. Will probably end up using both systems in various ways. Currently exploiting Sam Tingleff's DGrid and poking at LucidDB for "row filtering and aggregation" style analytics apps.

Looking forward to setting up Hive and Pig next.

On the plus side for MySQL, the federated engine has been quite useful for accumulating data from a sharded/partitioned MySQL setup.. as long as the data being accumulated is less than 100K rows, then it seems to hit a wall. It's also quite brittle if your MySQL instances are having any performance issues.. failed connection can cause other ETLs that depend on that connection to fail in odd ways.

Friday, February 20, 2009

Scalable data storage for model learning

I'll admit it, sometimes as an AI & Data wonk you want all your data structured and easily accessible via slicing a dicing, thus SQL databases and relational schemas look like a great idea. And it works for a while... until you're dealing with a torrent of incoming data.

This should not be unfamiliar, the UC-Irvine datasets look like a great place to cut your teeth with Machine Learning .. until you realize that many algorithms and software packages written and litmus tested against such data totally fall down on 'big data'.

This quote from Dave Cancel on Twitter stuck with me: "Databases are the training wheels of software development. Fly free brother, fly free. - University of Sed, Grep, & Awk". My first reaction was, meh.. databases are meant for storing lots of data. I love and use tools like those for prototyping.. but then moved to compiled code + SQL for 'production'.

Mental sea change! Let's say you are building a massive scale system for absorbing click data, processing it and turning it into a recommender system. These are the problems you will see using MySQL at scale. Hint: it tips over at 1K read/write operations per second against the same table.

Don't try and make your read store also your write store. You may not /really/ need a low latency (for model updates) system.

More tips from Dave:
As for storage of [models], I suggest storing them in text file (format up to you), 1 per profile, then stick them behind a reverse caching proxy like Varnish. Infinite scale. For extra points store the text files on S3 and use the built-in webserver to serve them to your reverse proxy. HTTP is your friend as is REST SOAs.
Here's another dead simple storage mechanism:

"simple DB" projects like Voldemort and Tokyo Cabinet and MemcacheDB are options as well.

If you can't depend on a room full of DBAs making your SQL DBs not be dog slow, (or buying $$$ systems from Oracle and Microsoft) you must think differently. Pull yourself out of the Math and AI thinking and simplyify. Big Data will eat you alive otherwise.

Wednesday, February 18, 2009

TunkRank Scoring Improvement

Recently Daniel Tunkelang blogged about an influence rank for Twitter. Jason Adams took up the implementation challenge and coined it TunkRank.

After some poking at it, I'm suggesting a scoring improvement. At the moment the primary rank is percetile in the UI, however the raw score is given as well. I checked a few users and put together the table below, and it feels wrong. It saturates @ 100 too quickly and there is not enough differentiation between people with healthy versus massive influence.

Why 'feel'? Human interpretable numbers need a tactile sense to them in my opinion. One critique of the metric system is that the English system just feels more human compatible, an inch is not too small, a foot is reasonable, a mile is a long way and 100 miles per hour is darn fast.

I'm proposing two new scoring possibilities. Both are based upon logarithms and span from 1-100. The slight difference between them is how 'linear' the resulting rank feels across the accounts I compared.
  1. LEAST(100,ROUND(POWER(LN( tunkrank-raw-score +1);1.82)))

  1. LEAST(100,ROUND(LN( tunkrank-raw-score+1)/LN(3.5) * 10))

What are the constants? They are magic numbers to map Barak Obama to a TunkRank of 100 as well as provide an interesting spread between the test accounts below. Comments welcome! Which is my choice? I can't decide.. #1 is smells more accurate, #2 tastes more natural.

Yes this is an inexact science.

Possible Tunkrank Bug? Check out dewitt's rank.. looks off given his number of followers and that he's an influential guy from Google.

BarackObama 100 277770 100 100
wilw 100 79118 82 90
guykawasaki 100 62543 79 88
JasonCalaca 100 59075 78 88
THErealDVOR 100 43207 74 85
anamariecox 100 38177 73 84
WilliamShat 100 13932 61 76
fredwilson 100 13340 60 76
abdur 100 1351 36 58
johnhcook 99 407 26 48
johndcook 94 61 13 33
gutelius 84 20 8 24
nealrichter 81 16 7 23
ealdent 80 16 7 23
dtunkelang 79 15 6 22
dewitt 1 2 1 9

Tuesday, January 27, 2009

response to Noisy Channel post on Lucid Imagination

Cross posted to my blog since it's a long response ;-)

To: Daniel Tunkelang

RE: Noisy Channel blog post on Lucid Imagination

I don’t think it’s their aim to compete with Enterprise search directly (business suicide), though I suspect they might pressure the pricing in the mid-market of search. The small market has been mostly eliminated by Google/Yahoo/MSN site search and open source engines.

Note also that they do not seem to (yet) provide support for Nutch or Droids.. meaning that they are missing a spidering/crawling engine. Same with Tika (office document support). Search result clustering may be coming soon via SOLR-769. No content-management or versioning. (These are fixable pieces given all the open source out there)

There is no good native support for rich taxonomies in Solr/Lucene, nor is there native support for some of the interesting semantic-web data driven features. No self-learning or auto-personalization of results. No analytics (though one could go elsewhere for that).

Lucid is also not offering a hosted Solr service .. so they are not an SaaS play either.

All that said, they obviously have some huge wins within the software industry.. but it’s a tough road to go after accounts like Home Depot, Albertson’s, or the government entities.

Enterprise search is mostly about finished feature sets and a near full admin GUI for non-programmers. The question is in these lean economic times if a given customer considering “build versus buy” is willing to risk starting a professional services engagement to build what they want for cheap, versus purchase a commercial ES product with way more features than they think they need.

I do think that a smart customer will have new leverage during the sales cycle to credibly threaten the ‘build’ option and get the ‘buy’ price down. And Lucid certainly should affect the ability of the ES companies from getting a customer bought in then milking them for professional services, integration and customization fees… Lucid provides a credible switching threat to cut bait and start over.

Google, Yahoo and open source projects like Lucene have commoditized basic search, so ES is about value-added features, innovative R&D and taking away customer pain and complexity.

Some of the people in Lucid have big plans (Grant Ingersoll comes to mind), and there is absolutely no question that Lucene has made some search vendors look like dinosaurs with slow engines and archaic index structures.

It will be some time before open source catches up to ES.. but it just might not be as long as some would hope.

Disclaimer: The above is my opinion and some fact-looking statements might be wrong.. so Lucene guys jump in!

Monday, January 26, 2009

Lucid Imagination and Sematex

Kudos to the Solr/Lucene gang for launching Lucid Imagination. Grant Ingersoll's announcement. People involved here and here. Some time ago Otis Gospodnetić launched Semtext. Good luck to Lucid and Sematext!

Both of these companies are in the 'support and consulting' model. This is wise, as going into Enterprise search directly is a tough road competing with Endeca, Verity(Autonomy), FAST(Microsoft), GoogleBox and the other vendors would be suicidal.

Long ago (2003) I thought of hanging up a shingle for supporting HtDig (a once popular CGI based search engine), but wisely decided that would be a mistake given that even then I could see that Doug Cutting's Java Lucene and Nutch were going to smoke the creaky 8+ year old C++ indexing kernel. Ended up getting RightNow Tech to sponsor conversion of the guts to CLucene, where it still runs today indexing many many tens of millions of documents. Then Solr was announced .... and HtDig development died and I started using Solr.

Just touched base with Geoff Hutchinson the other day and we're going to release the 4.0 CLucene branch of HtDig, and put up an announcement of HtDig end-of-life and encourage people to migrate to Solr.

Text Classification with Solr

Starting to look at using Solr/Lucene for text mining. Between OpenNLP, the Python Natural Language Toolkit and various other projects it's time to toss my ad-hoc mishmash of tools and start over.

Looks like Grant Ingersoll is working on similar things in his Taming Text project. This is a nice beginner's overview of the area as Grant sees it, Search and Text Analysis PPT. Also looks like others are scheming about blending Mahout and Solr in some future version.

The basic idea is to take an ontology/taxonomy like Dmoz or FreeBase of {label: "X", tags: "a,b,c,d,e"}, index it and then classify documents into the taxonomy by pushing parsed document into the Solr search API. Why? Lucene/Solr's ability to do weighted term boosting at both search and index time has lots of obvious uses here.

Now that my readership (by data-mining and semantic-web geeks) is up slightly (ie above zero!) due to Twitter traffic, I'm hoping people contact me with ideas, code, etc. Heh.

Initial ideas:
  • Use More-Like-This code to 'pass in' a term vector without storing it
  • Write Solr plugin to execute search and post-process hits and do any outgoing classification and biasing math.
Once this is proven out, then the obvious next step is to figure out how to index the various RDF/OWL datasets out there. Much of these parts has probably?? been done before, I just need to find them, examine their merits and do some LEGO style layering to get a prototype up.

Friday, January 02, 2009

New Year's Resolutions and Goals

Here are this year's technical/career goals & resolutions. Most of these are general and many encompass specific current and forward looking work projects. Others are just motivational resolutions.

Goals and Resolutions:
  • Turn in Dissertation. It's 75% complete and the rest is all typewriter work. Be done.
  • Be a better Numerati. The point of modeling is to predict... so this goal is a lifelong career goal with a new label.
  • Practice at Done and Get Things Smart and Teach Yourself Programming in Ten Years
  • Make damn sure that doesn't have any Fail Whale events (technical or business).
  • Read more tech (academic, business and research blogs) - Seed and water the creative juices.
  • Economy willing, hire a full-time minion.
  • See if I can practice some 'startup karma' (hat tip Todd Sawicki) for other startups.
Specific items:
Thank heavens I have all year. This is really a post that should evolve all year.. why can't some blog posts be Wiki-like and not so time ordered? I reserve the right to violate blog-etiquette laws and edit this post.