Monday, January 26, 2009

Text Classification with Solr

Starting to look at using Solr/Lucene for text mining. Between OpenNLP, the Python Natural Language Toolkit and various other projects it's time to toss my ad-hoc mishmash of tools and start over.

Looks like Grant Ingersoll is working on similar things in his Taming Text project. This is a nice beginner's overview of the area as Grant sees it, Search and Text Analysis PPT. Also looks like others are scheming about blending Mahout and Solr in some future version.

The basic idea is to take an ontology/taxonomy like Dmoz or FreeBase of {label: "X", tags: "a,b,c,d,e"}, index it and then classify documents into the taxonomy by pushing parsed document into the Solr search API. Why? Lucene/Solr's ability to do weighted term boosting at both search and index time has lots of obvious uses here.

Now that my readership (by data-mining and semantic-web geeks) is up slightly (ie above zero!) due to Twitter traffic, I'm hoping people contact me with ideas, code, etc. Heh.

Initial ideas:
  • Use More-Like-This code to 'pass in' a term vector without storing it
  • Write Solr plugin to execute search and post-process hits and do any outgoing classification and biasing math.
Once this is proven out, then the obvious next step is to figure out how to index the various RDF/OWL datasets out there. Much of these parts has probably?? been done before, I just need to find them, examine their merits and do some LEGO style layering to get a prototype up.


ogrisel said...

I indeed followed your tweet to come here. Twitter is really an amazing thing to get quickly connected to people working on similar specialized matters.

Back to your project I am actually working my way to design a similar prototype but currently excluding the search part.

Assume we have an existing ontology (freebase, dbpedia, wordnet), my goal is to pipe unstructured text to a series of standard NLP analyzers:

tokenizer / stemmer => POS tagger => chunker / parser => Semantic Role Labeler

and in parallel:

tokenizer / stemmer => POS tagger => chunker / parser => Named Entities extractor

Then I want to match the named entities types to freebase instances and subjects / predicates / objects extracted by the SRL to either named entities or unnamed instances of the freebase concepts and add them as assertions triples to a knowledge store (along with the information of the source span that lead to this assertions).

Then the 2nd phase will be to build (or test existing) SPARQL query builder to let the users explore the knowledge base.

I am currently reading the nltk book available online at to get a better grasp on the state of the art algorithms to achieve such a goal. Also I find that NLTK make it simpler to prototype this using python than using opennlp or cleartk. Also I would like to work on the following contribution to nltk:

The final goal is probably to package this as a generic UIMA analyzer (probably based on cleartk and mallet) that will feed a Jena, Sesame or the future Hadoop HEART store with a SPARQL endpoint so as to be able to make it part of the Nuxeo ECM solution.

Being able to do the whole process as MapReduce tasks with Hadoop is also a desirable goal.

However I am just starting premilinary research on this and I first need to finish reading the nltk book and related papers before trying to go any further.

Neal said...

Thanks for the comment. I want to internalize it a bit but there are many similarities to what I'm working on.

One basic question for you is if you /really/ want to use SQL to store your semantic data (making the assumption that you are??). I've spend many years building/fixing text-matching in MySQL and it just is sub-par. I'm pretty convinced that Solr/Lucene is a better way to match/query against RDF/OWL etc. Surely others have come to that conclusion as well.

ogrisel said...

About the SQL store versus Lucene store for RDF triples, I have decided anything yet. The distributed column store implementation of the Google Bigtable concept provided by the Hadoop HBase project might make sense too in order to be able to handle very large amount of data and being able to distribute the SPARQL queries / inference algorithm on several datanodes (AWS?) using MapReduce implementations at some point.

However this is really not my focus right now. I'll start by loading the freebase wex content (64GB) into a dedicated postgresql database since:
- the freebase crew provide tuned scripts to do so
- it's easy to plug python sqlalchemy or JDBC clients to it
- the indexes for the names column of the ontology will probably fit in memory so that's enough for the knowledge extractor prototype I plan to work on in the first place.

BTW, it looks like jena is able to use lucene to index the fulltext of literal in an RDF graph and combine SPARQL with fulltext match in the same request:

Sean said...

How did your project turn out? Have you thought about layering in Lingpipe / Gate?