Thursday, December 13, 2007

Work and School News

It seems I have committed the most frequent sin of blogging.. sporadic posting patterns. All of November and half of December and no posts. It's been a busy month at OthersOnline. We got a huge spike in user traffic and sign-ups, which is awesome.. and predictably exposed some performance issues. That work, plus working on a better user behavior capture algorithm (to better match users to other users and content) was the bulk of the time.

In school news, I just turned in a first draft of the dissertation to my adviser. It's really a disposable organizational draft to help us plan out what the flow of topics and structure is,, and what is left to add. Nice milestone anyway. Goal is late spring for near-final draft with the defense goal in august.

Some advice for people doing PhD work while employed full-time: If you are working in your field of graduate study, and your employer permits it.. do the dissertation on a work related topic! I could easily have done mine on a topic related to my AI work at RightNow and have finished by now. I chose to not only do a different topic in AI from work (to be more well-rounded) but to do it on the theory of that topic (genetic algorithms).

While this seemed a good choice at first, the dissertation sunk to 4th place on the priority list (family, career, misc leisure activities). Don't make this mistake! Take the shortest path and get done... then pursue the side topic under no pressure.

Monday, October 29, 2007

Inferring meaning from data and structure

Jeremy Liew has quite the thread going on his blog about 'Meaning = Data + Structure (User Generated)', Part2 on Inferring Structure and a Guest Post by Peter Moore.

The post by Moore is a wonderful summary of approaches and their difficulties, and I'll post more on this as I think about it. My initial response is that we should stop looking/waiting for some near holy-grail {fully functional semantic web} and use a lot of good-enough {technologies, algorithms, ontologies} to make progress. I think that the perfection-in-reasoning stuff is great for the teleportation version of personal search vs the good-enough techniques as applicable now to the orienteering version of personal search. See this post and this paper for orienteering vs teleportation in search.

Last week the Bozeman AI group read a paper on Deriving a large Scale Taxonomy from Wikipedia. I look at this as an example of the main idea above, deriving structure from user generated content. True, Wikipedia is already structured, but not necessarily in a way that a computer program can use to reason with.

The killer thing about this idea is that it's finally time to do it. Essentially this is what machine learning and data mining has been about for years. I've read/perused hundreds of academic papers where the basic premise is that we write a suite of algorithms to learn/extract structure from a pool of data. A big chunk of papers in the KDD conferences each year (2007, 2006, 2005) operates on this premise and this field is quite old (decades).

Really pointy-headed CS types are horrible at monetizing their work. At approx the same time that Google founders were inventing PageRank, Jon Kleinberg was creating HITS. Both are link-analysis algorithms to augment what at the time were poor quality search engines. Over the past 10 years when they are evaluated head-to-head on some Information Retrieval task HITS works on-par with PageRank. Yet Kleinberg is not now worth 40 billion dollars like Brin and Page of Google.

I fear that the Semantic web people/researchers have been building sand castles for a decade rather than monetizing what they have to subsidize more research on it. Perhaps if they had been Delicious, Digg, WikiPedia, et al. would be contributing to the Semantic Web natively, rather than forcing people to figure out a way to export that data into RDF/OWL.

Wednesday, October 24, 2007

Semantic Wishfull Thinking? Or Semantics for turing lead into gold?

I'm seeing quite the meme these days on the 'Semantic Web' as a way to build the next big thing (See Twine, AdaptiveBlue, more). The essence of the Semantic Web is the markup of knowledge in such a way as to enable machines to reason about it.

The idea of having every HTML page you download contain markup that enables a smart web browser or search engine to know that you are looking for (or browsing about) Anthrax the UK punk band, the US heavy metal band, the fly, or the toxin. This vision is basically one of the Structured Web.

There are issues in my mind:

1) The Semantic Web has been around for years. During all those years the content of the web grew from nearly nothing to the mountain of (mostly unstructured) goo we all browse daily. Why/How will all that knowledge be 'structured'?

Take Home: People do not want to 'structure' knowledge themselves. They are writing their content for people and not machines (except the SEO people).

2) Formally structured data is an OLD idea in AI. See expert systems. How will the 'semantic web' over come the basic problem that structuring human knowledge is DAMN hard. And by hard I mean making it consistent (this is what mostly broke expert systems).

Have you been following what Cyc Corp has been doing since 1984? Attempting to structure human knowledge. These guys have invented whole new ways of representing human knowledge.. where is it on the web? Can anyone tell me an application that uses it? I am very certain that the CycCorp guys could (and likely have) a way to export their databases into RDF, OWL, etc.

Also.. the old white-haired guys of AI invented various forms of Semantics and 'Knowledge Representation' way-back in AI history (see chapter 10 of Russell-Norvig).

Take Home: Once you have the knowledge structured and embedded, what happens next? Magic? Merely inventing a representation of knowledge relies on the 'if you build it they will come' doctrine of AI.. which has NEVER been true.

3) Reasoning with said structured knowledge is unsolved in general. Given a specific knowledgebase (or Semantic database) and specific questions (or semantic queries) systems can reason about the question and deliver results.. but it's still a garbage-in-garbage-out world.

This is especially true when most people really expect a search engine to read their minds (Sorry Udi - I agree with Greg) or they tend to give up on their search queries.

How do we prevent such systems from becoming SEO spammed? I suppose a reputation system on the source of semantic markup data could be created.

Take Home: How in the hell do I build a search engine that uses Semantics that really understands what I am looking for and delivers me the Answer? Such a system pretty much is an AI Oracle.

Ok.. enough with the half-empty-glass negativity! What can we really do with the Semantic web NOW?

For sure we can build a semantically enhanced 'filter' of the web. Google/Yahoo/MSN/Ask are great, but in the end the are giant databases that serve you up link-graph weighted & keyword-filtered URLs.

However, if you are trying to build a money making business, a new search box that returns URLs seems like an insane idea.. unless you can co-opt the browser and augment the results that the big-boys are returning (See Search Radar). Or pull a StumbleUpon strategy.

For a business, the Semantic Web is a potential tool in a step along the path in creating a valuable application. Remember your history here.. creating a giant repository and/or formal structure of knowledge will not alone result in something novel.. nor is using it required to create novelty in AI.

I'd probably make the argument that delicious itself (and similar data) is a growing embodiment of a user-generated database that clever software could derive semantic-data from.

I am NOT arguing that the semantic web is a bad idea... but be careful of the hype you read. The Semantic Web is merely the first step (and a hard one) at stitching together knowledge in a way that can be usefully used to reason. The S-M is as necessary for a smarter web as databases are for useful applications... yet the database is the data-store and NOT the application logic.

Friday, October 19, 2007

Recommender Systems

Greg Linden has another insightful post on his blog about Recommender Systems. He argues that the systems can be tuned to recommend diversity (ala-Netflix), rather than the more too-similar echo chamber of stuff you see sometimes on Amazon.

Jeremy Liew at LightSpeed VCP had a good post recently about search query understanding being the future direction of search.

In my mind, recommender systems are part of that vision. A truly great search engine will seek to understand your queries, your query history, personal interests and recommend content.. rather than just give you a keyword-filtered & ranked slice of the web.

Yet there are other ways to achieve that kind of output. Search engines and AI in general are a good distance away from real query understanding (it requires some form of machine reading). If instead we consider bootstrapping a recommender system that is driven by people's recommendations on a topic.. we can potentially get there quicker. This is how you train product recommender systems (with purchase history).

A system that implicitly follows you around the web and allows your content to be communally shared into an index would at a minimum be a very fresh index of what people are looking at now. Combining this index with a social network of people (enabling matching of topically relevant users to you) and we have something of a human-filter of the web driving a content recommender.

Yes, this is what many social URL sharing sites are building now... but do they have the pieces all together to drive people to directed content rather than allowing them to surf the wave of current topics?

Thursday, October 18, 2007

Seattle Beer notes

My september trip to Seattle included stops at Kells. I really loved the Roslyn Brookside Lager from Roslyn Brewing. It has a wonderful fruity complexity to it, which is unusual for a lager (more like a Kolsch). A quick email to the brewer and I learned that he ferments it warm with a lager yeast.

I also enjoyed the Baron Brewing Helles Bock served at the Palace Kitchen. Great malt flavor. Great food at PK at reasonable prices.

On this October trip to Seattle I loved the Feierabend Pub. They have about 18 beers (all German styles) on tap. I tried/sampled about 5 kinds of Octoberfest and several other lagers.

Tap House Grill, 160 draft beers.. need I say more? This place was impressive. I tried two more Baron beers (Pils & Uber-Weiss) and the Brewery Ommegang Hennepin Farmhouse Saison. It's good, but my taste buds still prefer the New Belgium Saison.. the NB has a nice earthy taste.

Wednesday, October 17, 2007

Attention IR and People Search

The SIGIR 2007 conference also had a couple of gems in the Doctoral Consortium workshop.

Krisztian Balog (University of Amsterdam) homepage
People Search in the Enterprise

The abstract of Balog looked a two areas concerning people search, profiling people and enabling search of those people based upon both the topical and social profile. Who is an expert on X? Who do I know (or get introduced to) someone who is an expert on X? His research seems to be just beginning.. I'll be checking his page for new papers.

Georg Buscher (German Research Center for AI) homepage
Attention-Based Information Retrieval

Buscher won the best presentation award at the workshop. His slides outline how attention data can be used to bias/rerank IR results to enable re-finding old information/documents as well as doing query expansion (profile based???) given the current user's attention data. His research is also fairly new.

Both of these topics are obviously of interest to Others Online and the idea of connecting people together through a common topic or set of topics that are learned as implicitly related to the users.

Learning to Rank

SIGIR 2007 (which I unfortunately did not attend) had a really great workshop called 'Learning to Rank' or LTR. The weekly RightNow-organized Bozeman AI Colloquium recently covered two papers in this area. Essentially the idea is that a search engine can implicitly learn to rank documents for a given query by looking at user behavior.

The first one we covered (by Yeh, Lin, Ke & Yang) used genetic programming to do the learning. Needless to say this caught my eye. Evolutionary Algorithms are built to learn rankings, usually based upon a fitness function. I found this paper interesting, however even the authors admit that their algorithm is very slow.

In my mind they picked too complex of an algorithm. There are far simpler EAs that can do this job. The well-known (n+1) EA could do this task (per query). I'll likely be writing a paper on this for GECCO 2008. l

Many of the workshop papers reference work by Joachims and Radlinski (find them here). Their recent paper in IEEE Computer (not avail for free) was interesting in that they used a LTR method to re-rank Google results and then did a user-study to look at how effective the method was.

Personally I think that the idea of LTR should be a component of every search engine. The ranking of search results should change as fast as users interact with the content, rather than how fast the content itself changes. This is something that the big search engines are fairly quiet on, not sure why.

Sure it's an incremental rather than revolutionary step (Powerset is trying to take a revolutionary step), however can anyone give me a good argument why LTR should not be done? The idea can be applied to any engine.. keyword, link-graph (Google) or NLP based (Powerset).

Taking the next step beyond that, the next big thing could very well be doing an LTR method per-person or per-peer-group for each query family. This effectively would allow the engine to self-learn to personalize results. One can imagine how this could be glued into the idea of using the 'social graph' to establish the peer-group on a given topic/query.

Friday, September 21, 2007

More old wine in new Web 2.0 bottles

In light of last week's post on "beware of old AI wine in new Web 2.0 bottles" I wanted to post this link from Joel Spolsky. (A buddy of mine brought my attention to it)

Once in a while Joel posts "strategy letters". This one addresses history repeating itself in the old-html-web -> ajax-web paralleling the text-terminal -> windows-api flow. Very true. I did find it odd that he did not mention the Google Toolkit in his thoughts about the potential game-changing "NewSDK" and how it needs fancy new compilers. As a CS geek I think the idea of compiling Java to cross-browser-compliant javascript a simply amazing technical achievement.

The interesting thing about this topic is that it's what Java was supposed to do for the browser back in the 90s. Didn't work, no one could keep their browser & the JVMs synced and integrated well, plus Microsoft managed to run good interference via IE just being crappy at Java/JVMs at that time.

Turns out that Java succeeded wildly in reinventing the way back-end web services are written (CGIs just don't cut it for some things despite the PHP/Perl/Python crowd making CGIs way more useful than before.) On the browser today's Java-JVM is the javascript-engine (which is not java at all). The idea of using JS as byte-code makes me cringe, but it's where we are.

Nice post Joel! Would be nice if he'd follow up on why he thinks that the Google Toolkit, Yahoo's YUI and others aren't yet (or won't get to) the definition of his NewSDK.

Monday, September 10, 2007

The Implicit Web flowing into Collective Search

Here are some recent articles that I read and kept thinking about again and again. What is cool about this moment in time is that these things are gelling. Entrepreneurs and innovators are trying to build this stuff, rather than the ideas rotting unfulfilled in the mind of some AI/Search-Engine geek.

Read/Write Web's Implicit Web

Important point here is that systems should both learn what users are interested in implicitly and allow users control over the learned topics. The former point is what algorithms like collaborative filtering were intended to do. The latter is a great point that users should have visibility and control into their learned topics.

This has been a frequent critique against Amazon's recommender system.. while personalized, it can learn goofy things. I have no desire to be a frequent buyer of items similar to what I bought for a niece as a gift last year.

Collective Search by Greg Linden

I just learned that Greg is one of the brains behind Amazon's AI. Thinking about the data Amazon has and what could be done with it always makes me drool. Greg's post here is an aggregation of points he came up with while reading transcripts of the recent SES 2007 conference.

I'll join Ask's Jim Lanzone (isn't the new Ask.com much better than Google!) in saying that collective search is potentially better than personalized search. Greg is arguing for a redefinition of 'personalization' here, but we have to pick descriptive terms for abstract ideas. I would define personalization as skewing of search results by what you are interested in. Where I'd read collective search as letting the collective behaviors of a group of similar users influence/skew search results. This is the flavor of stuff I worked on at RightNow.

Ultimate Answer Engine @ Information Week

Favorite quote: "Who said an edit box and 10 blue links is what search is?" asks Microsoft's Satya Nadella.

This great piece has several items that just jumped out at me. "Queryless Search", essentially this is using what the system knows about you and your path through to the engine and do a implicit query. (We also worked and patented variations of this idea at RightNow). The "Personalization" and "Social Skills" sections deal with the ideas in Greg's post above. More to come on that re 'The Social Graph'.

Another good quote: "Serendipity is an amazing teacher". This is what Others Online is all about... focused on People, not necessarily documents/media.

After reading all three of these in the current context of what people are willing to spend time and money on... I can't help but be totally jacked about the opportunities at hand!

Loads of academics have been working on this stuff for years, check out any ACM SIGIR and various data mining conference proceedings for the last 10+ years. Personally, I've been thinking and working on many of the things above since 2000 when Doug Warner and I started doing a deep dive into the academic literature.

Friday, September 07, 2007

The "social graph" and search engines

Robert Scoble recently posted about Mahalo, TechMeme and Facebook versus Google. His thesis is basically that somehow blending social networks with search engines will be the next big thing. He also comments (as have others) that searching blogs can get better results than major search engines sometimes.

Danny Sullivan chimed in response with a blistering commentary on both Scoble's "new ideas" and Mahalo (run by Jason Calacanis). Mahalo and ChaCha are both 'human powered' search engines. Basically they take popular search terms and use editor to augment and/or reorganize Google results.

First a history review. Way back Yahoo built it's people powered directory, while initially useful it could not keep up with the growth of the internet. Google comes along with a simple idea called PageRank (it essentially forms a Markov model of the web and computes the stationary distribution of the markov matrix - an 80+ year old idea applied to the web) and kills Yahoo's directory as well as purely keyword based engines like Altavista.

More History. Once upon a time in the 60s-80s expert systems were seen as the next big thing in AI. Solve all the world's problems by enabling a formal system of rules and facts to answer questions posed to the system. ES was a miserable failure at these lofty goals. Why? Growing the rulebase is hard. Humans do a terrible job at crafting rulesets that are complete and consistent (no conflicts). Even worse is when you throw multiple people at crafting rules together. You end up with trash.

Why is this relevant here? The lesson of ES seems to be lost on efforts like ChaCha and Mahalo. These systems are built on very basic rules (if query X then return A, B, C, D ...). Granted these are much simpler rules than a typical ES, and the engines don't support real reasoning using backward or forward chaining either. This may not save them.. the rules will still suffer from the huge maintenance problem in a context where the information captured is dynamic and changing. Just ask any of the dozen 80s companies that tried to build medical diagnosis expert systems. The rules suffered from inattention to medical advances as well as being contradictory (multiple doctors with different ideas making rules).

Nowdays we call this "linkrot" on the web. While successful, sites like About.com suffered from linkrot on pages not frequently edited. How will ChaCha and Mahalo avoid this without having a massive number of editors? Del.icio.us itself suffers from the same issues, people tag stuff and it mostly rots unorganized or maintained.

Yet More History. From about 1999 to 2003 AskJeeves.com sold software in the emerging web eCRM space in addition to having a search engine. Web eCRM (or web self-service) is essentially creating a customer service portal for corporate websites. The portal contains a collection of FAQs, articles, HowTos, Manuals etc. The essential function of the portal is to help people find what they are looking for and keep them from dialing the 1800 customer service number (which typically costs a company about $30 per call). AskJeeves sold their CRM and enterprise search unit in 2003 for less than 5 million dollars. Why? Their system required manual input of of a huge set of rules linking search queries and documents, as well as complex rules to equate queries to other queries and attempt to do some Natural Language Processing and Inference.

It didn't work, there was no way in hell that an average business user that maintained this set of Articles, FAQs etc was prepared to the massive amount of structuring. AskJeeves attempted to hire a team of people to optimize and tune the implementations. It took weeks of learning the business and translating that into structure for the engine to use. Nowdays we call this SEO.

Another example in CRM is the 'chatbot'. These are software products that try and give a user a good customer experience by putting a cute face/persona on the search box and having it talk back to you in a conversational style. They have never really taken off, despite the CRM industry analysts that love them. They suffer from the same basic problem that expert systems (chat bots are expert systems of a sort) suffered from.. structuring information is hard for most people to do.

For the past 8 years I've been working for an CRM company (RightNow Tech) that had a simple idea to help customer service web portals... implicitly learn from what users are doing in the portal to optimize the engine automatically. (See patents 6434550, 6665655, & 6842748 - at the moment the RNT systems process about 100 Million searches per month). The cutting edge of eservice CRM at the moment is taking that type of idea and THEN adding (or learning) structure to it.

Lessons learned and observations:

Study the basic history of AI. Here's a good book Artificial Intelligence: A Modern Approach.
Note that the one of the authors (Peter Norvig) is The Director of Research at Google. Prabhakar Raghavan is his counterpart at Yahoo. Ask.com and Microsoft also have strong AI people. There is no secret as to why these four companies are hiring all the good AI people they can relocate to the bay area, Seattle and New Jersey. You will not beat them with an expert system. A secondary lesson of AI is to never believe someone who will attempt to tell you that a new algorithm will create intelligence (neural networks anyone? Fuzzy Logic?).

Look at industries like CRM as a microcosm of the search industry. For every new idea you have, someone in CRM has likely tried it already on a smaller scale.

Beware of old wine in new bottles. You might be able to spend enough money on PR to help you get attention.. but you will likely die unless you invest in real scalable algorithms to do the work.

I'm certainly not intending to down-grade ChaCha and Mahalo as viable businesses. Often the viability of a business is independent of the technology used. They seem to have plenty of funding, and will likely adapt as they see problems. A babe-in-the-woods can't get 20 million in VC money. Neither of these systems will require boiling-the-ocean and implementing strong AI. Spinning a tight loop on what users are looking for and optimizing those results as fast as possible might work long enough to make some cash... it worked to bootstrap Yahoo after all.

As for the social-network blending into standard search? Stay tuned, I'll post some thoughts on that soon. There are plenty of good AI people working on graph based data mining.

Circling back to expert systems, if you can automatically 'read' text, and induce a rule-base.. then use that to help with queries, then we have something. I believe the direction of search engines will slowly head in this direction... machine reading.

Jordan Mitchell (my new boss at OthersOnline.com) recently posted on the same subject on his blog.

Other interesting links about this:
Skrentablog on Mahalo
Keving Burton's Thoughts on the Social Graph

Thursday, September 06, 2007

New Job - OthersOnline.com

I just started a new job at OthersOnline.com. It's a new startup with a social networking spin. We let users declare themselves, their pages and interests, then be syndicated around the web via the OO Widget (see it to the right). We also have a browser toolbar that allows users to see other people relevant to the user's own interests and the content of the current webpage. I think my official title is the "Search Guy" or "AI Guy" or something. The potential of these two basic ideas is huge, and I'm wading in chest deep to put some great AI ideas into the systems. More posts coming soon on these topics.

I spent the last (nearly) eight years working at RightNow Technologies (a CRM SAAS company - once upon a time it was a small startup as well) in the AI Research Labs. At RNT I was in charge of implementing various search engines, data mining & nlp algorithms, swarm techniques, user interfaces, analytics, and whatever AI I could throw at the basic problem of enabling endusers to find information on approx 2000+ customer service portals around the web (here is Leapfrog's Portal). I spent most of the last six months becoming the project manager of the group, responsible for multiple projects, coordinating with product management, initiating new feature ideas, etc. It's a fantastic group to work for, and has an application for about any advanced CS topic there is. A more complete synopsis is on my resume.

(At some point in 2008 I will hopefully finish a PhD in CS at Montana State - topic is Theory of Genetic Algorithms)

New Blog

I have been ignoring using a blog for too long, the old homepage is too static. I'll use this space to muse about artificial intelligence, search engines, machine learning, social media & widgets, my career, PhD dissertation progress, Montana, fishing and good beer.

My Montana State University homepage

RSS Feed of this Blog