Monday, October 29, 2007

Inferring meaning from data and structure

Jeremy Liew has quite the thread going on his blog about 'Meaning = Data + Structure (User Generated)', Part2 on Inferring Structure and a Guest Post by Peter Moore.

The post by Moore is a wonderful summary of approaches and their difficulties, and I'll post more on this as I think about it. My initial response is that we should stop looking/waiting for some near holy-grail {fully functional semantic web} and use a lot of good-enough {technologies, algorithms, ontologies} to make progress. I think that the perfection-in-reasoning stuff is great for the teleportation version of personal search vs the good-enough techniques as applicable now to the orienteering version of personal search. See this post and this paper for orienteering vs teleportation in search.

Last week the Bozeman AI group read a paper on Deriving a large Scale Taxonomy from Wikipedia. I look at this as an example of the main idea above, deriving structure from user generated content. True, Wikipedia is already structured, but not necessarily in a way that a computer program can use to reason with.

The killer thing about this idea is that it's finally time to do it. Essentially this is what machine learning and data mining has been about for years. I've read/perused hundreds of academic papers where the basic premise is that we write a suite of algorithms to learn/extract structure from a pool of data. A big chunk of papers in the KDD conferences each year (2007, 2006, 2005) operates on this premise and this field is quite old (decades).

Really pointy-headed CS types are horrible at monetizing their work. At approx the same time that Google founders were inventing PageRank, Jon Kleinberg was creating HITS. Both are link-analysis algorithms to augment what at the time were poor quality search engines. Over the past 10 years when they are evaluated head-to-head on some Information Retrieval task HITS works on-par with PageRank. Yet Kleinberg is not now worth 40 billion dollars like Brin and Page of Google.

I fear that the Semantic web people/researchers have been building sand castles for a decade rather than monetizing what they have to subsidize more research on it. Perhaps if they had been Delicious, Digg, WikiPedia, et al. would be contributing to the Semantic Web natively, rather than forcing people to figure out a way to export that data into RDF/OWL.

Wednesday, October 24, 2007

Semantic Wishfull Thinking? Or Semantics for turing lead into gold?

I'm seeing quite the meme these days on the 'Semantic Web' as a way to build the next big thing (See Twine, AdaptiveBlue, more). The essence of the Semantic Web is the markup of knowledge in such a way as to enable machines to reason about it.

The idea of having every HTML page you download contain markup that enables a smart web browser or search engine to know that you are looking for (or browsing about) Anthrax the UK punk band, the US heavy metal band, the fly, or the toxin. This vision is basically one of the Structured Web.

There are issues in my mind:

1) The Semantic Web has been around for years. During all those years the content of the web grew from nearly nothing to the mountain of (mostly unstructured) goo we all browse daily. Why/How will all that knowledge be 'structured'?

Take Home: People do not want to 'structure' knowledge themselves. They are writing their content for people and not machines (except the SEO people).

2) Formally structured data is an OLD idea in AI. See expert systems. How will the 'semantic web' over come the basic problem that structuring human knowledge is DAMN hard. And by hard I mean making it consistent (this is what mostly broke expert systems).

Have you been following what Cyc Corp has been doing since 1984? Attempting to structure human knowledge. These guys have invented whole new ways of representing human knowledge.. where is it on the web? Can anyone tell me an application that uses it? I am very certain that the CycCorp guys could (and likely have) a way to export their databases into RDF, OWL, etc.

Also.. the old white-haired guys of AI invented various forms of Semantics and 'Knowledge Representation' way-back in AI history (see chapter 10 of Russell-Norvig).

Take Home: Once you have the knowledge structured and embedded, what happens next? Magic? Merely inventing a representation of knowledge relies on the 'if you build it they will come' doctrine of AI.. which has NEVER been true.

3) Reasoning with said structured knowledge is unsolved in general. Given a specific knowledgebase (or Semantic database) and specific questions (or semantic queries) systems can reason about the question and deliver results.. but it's still a garbage-in-garbage-out world.

This is especially true when most people really expect a search engine to read their minds (Sorry Udi - I agree with Greg) or they tend to give up on their search queries.

How do we prevent such systems from becoming SEO spammed? I suppose a reputation system on the source of semantic markup data could be created.

Take Home: How in the hell do I build a search engine that uses Semantics that really understands what I am looking for and delivers me the Answer? Such a system pretty much is an AI Oracle.

Ok.. enough with the half-empty-glass negativity! What can we really do with the Semantic web NOW?

For sure we can build a semantically enhanced 'filter' of the web. Google/Yahoo/MSN/Ask are great, but in the end the are giant databases that serve you up link-graph weighted & keyword-filtered URLs.

However, if you are trying to build a money making business, a new search box that returns URLs seems like an insane idea.. unless you can co-opt the browser and augment the results that the big-boys are returning (See Search Radar). Or pull a StumbleUpon strategy.

For a business, the Semantic Web is a potential tool in a step along the path in creating a valuable application. Remember your history here.. creating a giant repository and/or formal structure of knowledge will not alone result in something novel.. nor is using it required to create novelty in AI.

I'd probably make the argument that delicious itself (and similar data) is a growing embodiment of a user-generated database that clever software could derive semantic-data from.

I am NOT arguing that the semantic web is a bad idea... but be careful of the hype you read. The Semantic Web is merely the first step (and a hard one) at stitching together knowledge in a way that can be usefully used to reason. The S-M is as necessary for a smarter web as databases are for useful applications... yet the database is the data-store and NOT the application logic.

Friday, October 19, 2007

Recommender Systems

Greg Linden has another insightful post on his blog about Recommender Systems. He argues that the systems can be tuned to recommend diversity (ala-Netflix), rather than the more too-similar echo chamber of stuff you see sometimes on Amazon.

Jeremy Liew at LightSpeed VCP had a good post recently about search query understanding being the future direction of search.

In my mind, recommender systems are part of that vision. A truly great search engine will seek to understand your queries, your query history, personal interests and recommend content.. rather than just give you a keyword-filtered & ranked slice of the web.

Yet there are other ways to achieve that kind of output. Search engines and AI in general are a good distance away from real query understanding (it requires some form of machine reading). If instead we consider bootstrapping a recommender system that is driven by people's recommendations on a topic.. we can potentially get there quicker. This is how you train product recommender systems (with purchase history).

A system that implicitly follows you around the web and allows your content to be communally shared into an index would at a minimum be a very fresh index of what people are looking at now. Combining this index with a social network of people (enabling matching of topically relevant users to you) and we have something of a human-filter of the web driving a content recommender.

Yes, this is what many social URL sharing sites are building now... but do they have the pieces all together to drive people to directed content rather than allowing them to surf the wave of current topics?

Thursday, October 18, 2007

Seattle Beer notes

My september trip to Seattle included stops at Kells. I really loved the Roslyn Brookside Lager from Roslyn Brewing. It has a wonderful fruity complexity to it, which is unusual for a lager (more like a Kolsch). A quick email to the brewer and I learned that he ferments it warm with a lager yeast.

I also enjoyed the Baron Brewing Helles Bock served at the Palace Kitchen. Great malt flavor. Great food at PK at reasonable prices.

On this October trip to Seattle I loved the Feierabend Pub. They have about 18 beers (all German styles) on tap. I tried/sampled about 5 kinds of Octoberfest and several other lagers.

Tap House Grill, 160 draft beers.. need I say more? This place was impressive. I tried two more Baron beers (Pils & Uber-Weiss) and the Brewery Ommegang Hennepin Farmhouse Saison. It's good, but my taste buds still prefer the New Belgium Saison.. the NB has a nice earthy taste.

Wednesday, October 17, 2007

Attention IR and People Search

The SIGIR 2007 conference also had a couple of gems in the Doctoral Consortium workshop.

Krisztian Balog (University of Amsterdam) homepage
People Search in the Enterprise

The abstract of Balog looked a two areas concerning people search, profiling people and enabling search of those people based upon both the topical and social profile. Who is an expert on X? Who do I know (or get introduced to) someone who is an expert on X? His research seems to be just beginning.. I'll be checking his page for new papers.

Georg Buscher (German Research Center for AI) homepage
Attention-Based Information Retrieval

Buscher won the best presentation award at the workshop. His slides outline how attention data can be used to bias/rerank IR results to enable re-finding old information/documents as well as doing query expansion (profile based???) given the current user's attention data. His research is also fairly new.

Both of these topics are obviously of interest to Others Online and the idea of connecting people together through a common topic or set of topics that are learned as implicitly related to the users.

Learning to Rank

SIGIR 2007 (which I unfortunately did not attend) had a really great workshop called 'Learning to Rank' or LTR. The weekly RightNow-organized Bozeman AI Colloquium recently covered two papers in this area. Essentially the idea is that a search engine can implicitly learn to rank documents for a given query by looking at user behavior.

The first one we covered (by Yeh, Lin, Ke & Yang) used genetic programming to do the learning. Needless to say this caught my eye. Evolutionary Algorithms are built to learn rankings, usually based upon a fitness function. I found this paper interesting, however even the authors admit that their algorithm is very slow.

In my mind they picked too complex of an algorithm. There are far simpler EAs that can do this job. The well-known (n+1) EA could do this task (per query). I'll likely be writing a paper on this for GECCO 2008. l

Many of the workshop papers reference work by Joachims and Radlinski (find them here). Their recent paper in IEEE Computer (not avail for free) was interesting in that they used a LTR method to re-rank Google results and then did a user-study to look at how effective the method was.

Personally I think that the idea of LTR should be a component of every search engine. The ranking of search results should change as fast as users interact with the content, rather than how fast the content itself changes. This is something that the big search engines are fairly quiet on, not sure why.

Sure it's an incremental rather than revolutionary step (Powerset is trying to take a revolutionary step), however can anyone give me a good argument why LTR should not be done? The idea can be applied to any engine.. keyword, link-graph (Google) or NLP based (Powerset).

Taking the next step beyond that, the next big thing could very well be doing an LTR method per-person or per-peer-group for each query family. This effectively would allow the engine to self-learn to personalize results. One can imagine how this could be glued into the idea of using the 'social graph' to establish the peer-group on a given topic/query.