Monday, October 29, 2007

Inferring meaning from data and structure

Jeremy Liew has quite the thread going on his blog about 'Meaning = Data + Structure (User Generated)', Part2 on Inferring Structure and a Guest Post by Peter Moore.

The post by Moore is a wonderful summary of approaches and their difficulties, and I'll post more on this as I think about it. My initial response is that we should stop looking/waiting for some near holy-grail {fully functional semantic web} and use a lot of good-enough {technologies, algorithms, ontologies} to make progress. I think that the perfection-in-reasoning stuff is great for the teleportation version of personal search vs the good-enough techniques as applicable now to the orienteering version of personal search. See this post and this paper for orienteering vs teleportation in search.

Last week the Bozeman AI group read a paper on Deriving a large Scale Taxonomy from Wikipedia. I look at this as an example of the main idea above, deriving structure from user generated content. True, Wikipedia is already structured, but not necessarily in a way that a computer program can use to reason with.

The killer thing about this idea is that it's finally time to do it. Essentially this is what machine learning and data mining has been about for years. I've read/perused hundreds of academic papers where the basic premise is that we write a suite of algorithms to learn/extract structure from a pool of data. A big chunk of papers in the KDD conferences each year (2007, 2006, 2005) operates on this premise and this field is quite old (decades).

Really pointy-headed CS types are horrible at monetizing their work. At approx the same time that Google founders were inventing PageRank, Jon Kleinberg was creating HITS. Both are link-analysis algorithms to augment what at the time were poor quality search engines. Over the past 10 years when they are evaluated head-to-head on some Information Retrieval task HITS works on-par with PageRank. Yet Kleinberg is not now worth 40 billion dollars like Brin and Page of Google.

I fear that the Semantic web people/researchers have been building sand castles for a decade rather than monetizing what they have to subsidize more research on it. Perhaps if they had been Delicious, Digg, WikiPedia, et al. would be contributing to the Semantic Web natively, rather than forcing people to figure out a way to export that data into RDF/OWL.

No comments: