aicoder: search engines

Showing posts with label search engines. Show all posts

Sunday, February 22, 2009

Predicting search engine switching behavior

Following up on the previous post, I found a few interesting papers (via Google) on user switching behavior of search engines.

An Analysis of Search Engine Switching Behavior Using Click Streams
Juan & Chang of Yahoo Inc
Making Sense of Search Result Pages by Pedersen of Yahoo
Defection detection: predicting search engine switching
Heath & White of Microsoft
Enhancing web search by promoting multiple search engine use
White, Heath and co-workers at Microsoft
Stream Prediction Using A Generative Model Based On. Frequent Episodes In Event Sequences by Laxman, Tankasali and White of Microsoft

Definitely worth reading in detail as the ways of building the models might be applicable to other behaviorally driven events.

Can we measure Google's monopoly like PageRank is measured?

Jeremy Pickens posted an interesting note on his IR new blog:

Is it really true that Google is competing on a click-by-click basis? In the user studies that Google does, which of the following happens more often when the user types in a query to Google, and sees that Google has not succeeded in producing the information that they sought (fails):
Does the user reformulate his or her query, and click “Search Google” again (one click)? Or,
Does the user leave Google (one click), and try his or her query on Yahoo or Ask or MSN (second click), instead?

His points about actions 1 versus 2 are very astute. I’d guess that #2 happens a LOT on the # 2-10 search engines. Meaning people give that engine a try.. maybe attempt a reformulation.. then abandon that engine and try on Google. And I’m betting that people ‘abandon’ Google at a far less rate than other engines.. ie asymmetry of abandonment.

I’d love to do the following analysis given a browser log of search behavior:

Form a graph where the major search engines are nodes in the graph

For each pair of searches found in the log at time t and time t+1 for a given user, increment the counter on the edge SearchEngine(t) -> SearchEngine(t+1). Once the entire log is processed normalize the weights on all edges leaving a particular node.

We now have a markov chain of engine usage behavior. The directional edges in the graph represent probability of use transference to another engine, self-loops are the probability of sticking with the current engine.

If we calculate the stationary distribution of the adjacency matrix of probabilities, we should have a probability distribution that closely matches the market shares of the major engines. (FYI - this is what PageRank version 1.0 is - the stationary distribution of the link graph of the entire web)

What else can we do? We can analyze it like it’s a random walk and calculate the expected # of searches until a given user of any internet search engine will end up using Google. If the probabilities on the graph are highly asymmetric.. which I think they are.. this is a measure of the monopolistic power of people’s Google habit.

This should also predict the lifetime of a given ‘new’ MSN Live or Ask.com user.. meaning the number of searches they do before abandoning it for some other engine.

Predicted End Result: Google is the near-absorbing state of the graph.. meaning that all other engines are transient states on the route to Google sucking up market share. Of course this is patently obvious unless one of the bigs changes the game.

Monday, January 26, 2009

Lucid Imagination and Sematex

Kudos to the Solr/Lucene gang for launching Lucid Imagination. Grant Ingersoll's announcement. People involved here and here. Some time ago Otis Gospodnetić launched Semtext. Good luck to Lucid and Sematext!

Both of these companies are in the 'support and consulting' model. This is wise, as going into Enterprise search directly is a tough road competing with Endeca, Verity(Autonomy), FAST(Microsoft), GoogleBox and the other vendors would be suicidal.

Aside:
Long ago (2003) I thought of hanging up a shingle for supporting HtDig (a once popular CGI based search engine), but wisely decided that would be a mistake given that even then I could see that Doug Cutting's Java Lucene and Nutch were going to smoke the creaky 8+ year old C++ indexing kernel. Ended up getting RightNow Tech to sponsor conversion of the guts to CLucene, where it still runs today indexing many many tens of millions of documents. Then Solr was announced .... and HtDig development died and I started using Solr.

Just touched base with Geoff Hutchinson the other day and we're going to release the 4.0 CLucene branch of HtDig, and put up an announcement of HtDig end-of-life and encourage people to migrate to Solr.

Friday, August 29, 2008

Open Source Search Engine Rodeo: Solr v. Sphinx v. MySQL-FT

Last summer Anthony Arnone and I did a study on the performance of three open source search engines.

We chose these three as two of them have close ties to MySQL and the other is a well used and performant offering from Apache. There are many that we skipped.

Here's the Report in PDF.

Solr was the clear winner. Sphinx was in a close second with blindingly fast indexing times.

At this point the report's results are somewhat dated as both Sphinx and Solr are readying new releases. So your mileage may vary, and I'm sure Peter Zaitsev and the Sphinx team could show us how to improve the performance of their engine.

Updates: The Sphinx team contacted me and suggested some ways to improve Sphinx performance. New results will be published some time soon. They will likely also publish a test using Wikipedia as the document repository.

More Updates: I have started a new Solr project and may test Sphinx again.

Friday, April 18, 2008

Using human relevance judgements in search and advertising

This is old news on a couple of dimensions. Read Write Web had a post on how Google uses human relevance studies to help judge/QA their search results. This resulted from an interview that Peter Norvig gave to MIT Technology Review and caused some commenting in the blogosphere (NewYorkTimes Tech Blog, Goolge Blogoscoped). Old news on old news.

We now know that both Yahoo and Microsoft are using (to some degree) human studies to evaluate computational advertising algorithms (see this and this). Evaluating the correlation of what informational item an algorithm predicts, vs what humans think, is relevant to a context is the performance metric of your algorithm.

Question: When will TREC have a computational advertising contest?

Friday, March 28, 2008

Scraping Documents for Advertising Keywords

Lately I've been working on extracting keywords from text that would be associated with good keyword advertising performance. This is fairly related to the 'text summarization' problem, yet that usually works towards a goal of readable summaries of documents. This is a simpler problem as I don't want to build readable summaries.

'Finding Advertising Keywords on Web Pages' from MS Research (Yih, Goodman, and Carvalho) was interesting reading. To boil it down to its essence, the authors used a collection of standard text indexing and NLP techniques and datasets to derive 'features' from the documents, then used a feature-selection method to decide what features were best in deciding good advertising keywords in a document. They judged the algorithms against a human generated set of advertising keywords associated with a group of web pages. Their 'annotators' read the documents then chose prominent words from the document to use as viable keyword advertising inputs.

Note that this is not an attempt to do topic classification, where you could produce a keyword describing a document that did not exist in the document.. for example labeling a news article about the Dallas Cowboys with 'sports' or 'event tickets' if those labels did not exist in the article.

Interestingly the algorithm learned that the most important features predicting a word's advertising viability was the query frequency in MSN Live Search (a dead obvious conclusion now supported by experiments), and the TF-IDF metric. Other features like capitalization, link text, phrase & sentence length and title/headings words were not as valuable alone.. yet (unsurprisingly) the best system used nearly all features. The shocker was that the part-of-speech information was best left unused.

I emailed the lead author and learned that the MS lawyers killed the idea of releasing the list of labeled URLs.

Post Script: The second author is Joshua Goodman, who had a hilarious exchange with some authors from La Sapienza University in Rome. They wrote a 2002 Physical Review Letters paper on using gzip for analyzing the similarity of human languages. Goodman responded with this critique, causing the original authors to respond with this response. Looks like there are other follow ups by third-parties. The mark of an effective paper is that it is talked about and remembered.

Wednesday, October 24, 2007

Semantic Wishfull Thinking? Or Semantics for turing lead into gold?

I'm seeing quite the meme these days on the 'Semantic Web' as a way to build the next big thing (See Twine, AdaptiveBlue, more). The essence of the Semantic Web is the markup of knowledge in such a way as to enable machines to reason about it.

The idea of having every HTML page you download contain markup that enables a smart web browser or search engine to know that you are looking for (or browsing about) Anthrax the UK punk band, the US heavy metal band, the fly, or the toxin. This vision is basically one of the Structured Web.

There are issues in my mind:

1) The Semantic Web has been around for years. During all those years the content of the web grew from nearly nothing to the mountain of (mostly unstructured) goo we all browse daily. Why/How will all that knowledge be 'structured'?

Take Home: People do not want to 'structure' knowledge themselves. They are writing their content for people and not machines (except the SEO people).

2) Formally structured data is an OLD idea in AI. See expert systems. How will the 'semantic web' over come the basic problem that structuring human knowledge is DAMN hard. And by hard I mean making it consistent (this is what mostly broke expert systems).

Have you been following what Cyc Corp has been doing since 1984? Attempting to structure human knowledge. These guys have invented whole new ways of representing human knowledge.. where is it on the web? Can anyone tell me an application that uses it? I am very certain that the CycCorp guys could (and likely have) a way to export their databases into RDF, OWL, etc.

Also.. the old white-haired guys of AI invented various forms of Semantics and 'Knowledge Representation' way-back in AI history (see chapter 10 of Russell-Norvig).

Take Home: Once you have the knowledge structured and embedded, what happens next? Magic? Merely inventing a representation of knowledge relies on the 'if you build it they will come' doctrine of AI.. which has NEVER been true.

3) Reasoning with said structured knowledge is unsolved in general. Given a specific knowledgebase (or Semantic database) and specific questions (or semantic queries) systems can reason about the question and deliver results.. but it's still a garbage-in-garbage-out world.

This is especially true when most people really expect a search engine to read their minds (Sorry Udi - I agree with Greg) or they tend to give up on their search queries.

How do we prevent such systems from becoming SEO spammed? I suppose a reputation system on the source of semantic markup data could be created.

Take Home: How in the hell do I build a search engine that uses Semantics that really understands what I am looking for and delivers me the Answer? Such a system pretty much is an AI Oracle.

Ok.. enough with the half-empty-glass negativity! What can we really do with the Semantic web NOW?

For sure we can build a semantically enhanced 'filter' of the web. Google/Yahoo/MSN/Ask are great, but in the end the are giant databases that serve you up link-graph weighted & keyword-filtered URLs.

However, if you are trying to build a money making business, a new search box that returns URLs seems like an insane idea.. unless you can co-opt the browser and augment the results that the big-boys are returning (See Search Radar). Or pull a StumbleUpon strategy.

For a business, the Semantic Web is a potential tool in a step along the path in creating a valuable application. Remember your history here.. creating a giant repository and/or formal structure of knowledge will not alone result in something novel.. nor is using it required to create novelty in AI.

I'd probably make the argument that delicious itself (and similar data) is a growing embodiment of a user-generated database that clever software could derive semantic-data from.

I am NOT arguing that the semantic web is a bad idea... but be careful of the hype you read. The Semantic Web is merely the first step (and a hard one) at stitching together knowledge in a way that can be usefully used to reason. The S-M is as necessary for a smarter web as databases are for useful applications... yet the database is the data-store and NOT the application logic.

Wednesday, October 17, 2007

Learning to Rank

SIGIR 2007 (which I unfortunately did not attend) had a really great workshop called 'Learning to Rank' or LTR. The weekly RightNow-organized Bozeman AI Colloquium recently covered two papers in this area. Essentially the idea is that a search engine can implicitly learn to rank documents for a given query by looking at user behavior.

The first one we covered (by Yeh, Lin, Ke & Yang) used genetic programming to do the learning. Needless to say this caught my eye. Evolutionary Algorithms are built to learn rankings, usually based upon a fitness function. I found this paper interesting, however even the authors admit that their algorithm is very slow.

In my mind they picked too complex of an algorithm. There are far simpler EAs that can do this job. The well-known (n+1) EA could do this task (per query). I'll likely be writing a paper on this for GECCO 2008. l

Many of the workshop papers reference work by Joachims and Radlinski (find them here). Their recent paper in IEEE Computer (not avail for free) was interesting in that they used a LTR method to re-rank Google results and then did a user-study to look at how effective the method was.

Personally I think that the idea of LTR should be a component of every search engine. The ranking of search results should change as fast as users interact with the content, rather than how fast the content itself changes. This is something that the big search engines are fairly quiet on, not sure why.

Sure it's an incremental rather than revolutionary step (Powerset is trying to take a revolutionary step), however can anyone give me a good argument why LTR should not be done? The idea can be applied to any engine.. keyword, link-graph (Google) or NLP based (Powerset).

Taking the next step beyond that, the next big thing could very well be doing an LTR method per-person or per-peer-group for each query family. This effectively would allow the engine to self-learn to personalize results. One can imagine how this could be glued into the idea of using the 'social graph' to establish the peer-group on a given topic/query.

Monday, September 10, 2007

The Implicit Web flowing into Collective Search

Here are some recent articles that I read and kept thinking about again and again. What is cool about this moment in time is that these things are gelling. Entrepreneurs and innovators are trying to build this stuff, rather than the ideas rotting unfulfilled in the mind of some AI/Search-Engine geek.

Read/Write Web's Implicit Web

Important point here is that systems should both learn what users are interested in implicitly and allow users control over the learned topics. The former point is what algorithms like collaborative filtering were intended to do. The latter is a great point that users should have visibility and control into their learned topics.

This has been a frequent critique against Amazon's recommender system.. while personalized, it can learn goofy things. I have no desire to be a frequent buyer of items similar to what I bought for a niece as a gift last year.

Collective Search by Greg Linden

I just learned that Greg is one of the brains behind Amazon's AI. Thinking about the data Amazon has and what could be done with it always makes me drool. Greg's post here is an aggregation of points he came up with while reading transcripts of the recent SES 2007 conference.

I'll join Ask's Jim Lanzone (isn't the new Ask.com much better than Google!) in saying that collective search is potentially better than personalized search. Greg is arguing for a redefinition of 'personalization' here, but we have to pick descriptive terms for abstract ideas. I would define personalization as skewing of search results by what you are interested in. Where I'd read collective search as letting the collective behaviors of a group of similar users influence/skew search results. This is the flavor of stuff I worked on at RightNow.

Ultimate Answer Engine @ Information Week

Favorite quote: "Who said an edit box and 10 blue links is what search is?" asks Microsoft's Satya Nadella.

This great piece has several items that just jumped out at me. "Queryless Search", essentially this is using what the system knows about you and your path through to the engine and do a implicit query. (We also worked and patented variations of this idea at RightNow). The "Personalization" and "Social Skills" sections deal with the ideas in Greg's post above. More to come on that re 'The Social Graph'.

Another good quote: "Serendipity is an amazing teacher". This is what Others Online is all about... focused on People, not necessarily documents/media.

After reading all three of these in the current context of what people are willing to spend time and money on... I can't help but be totally jacked about the opportunities at hand!

Loads of academics have been working on this stuff for years, check out any ACM SIGIR and various data mining conference proceedings for the last 10+ years. Personally, I've been thinking and working on many of the things above since 2000 when Doug Warner and I started doing a deep dive into the academic literature.

Friday, September 07, 2007

The "social graph" and search engines

Robert Scoble recently posted about Mahalo, TechMeme and Facebook versus Google. His thesis is basically that somehow blending social networks with search engines will be the next big thing. He also comments (as have others) that searching blogs can get better results than major search engines sometimes.

Danny Sullivan chimed in response with a blistering commentary on both Scoble's "new ideas" and Mahalo (run by Jason Calacanis). Mahalo and ChaCha are both 'human powered' search engines. Basically they take popular search terms and use editor to augment and/or reorganize Google results.

First a history review. Way back Yahoo built it's people powered directory, while initially useful it could not keep up with the growth of the internet. Google comes along with a simple idea called PageRank (it essentially forms a Markov model of the web and computes the stationary distribution of the markov matrix - an 80+ year old idea applied to the web) and kills Yahoo's directory as well as purely keyword based engines like Altavista.

More History. Once upon a time in the 60s-80s expert systems were seen as the next big thing in AI. Solve all the world's problems by enabling a formal system of rules and facts to answer questions posed to the system. ES was a miserable failure at these lofty goals. Why? Growing the rulebase is hard. Humans do a terrible job at crafting rulesets that are complete and consistent (no conflicts). Even worse is when you throw multiple people at crafting rules together. You end up with trash.

Why is this relevant here? The lesson of ES seems to be lost on efforts like ChaCha and Mahalo. These systems are built on very basic rules (if query X then return A, B, C, D ...). Granted these are much simpler rules than a typical ES, and the engines don't support real reasoning using backward or forward chaining either. This may not save them.. the rules will still suffer from the huge maintenance problem in a context where the information captured is dynamic and changing. Just ask any of the dozen 80s companies that tried to build medical diagnosis expert systems. The rules suffered from inattention to medical advances as well as being contradictory (multiple doctors with different ideas making rules).

Nowdays we call this "linkrot" on the web. While successful, sites like About.com suffered from linkrot on pages not frequently edited. How will ChaCha and Mahalo avoid this without having a massive number of editors? Del.icio.us itself suffers from the same issues, people tag stuff and it mostly rots unorganized or maintained.

Yet More History. From about 1999 to 2003 AskJeeves.com sold software in the emerging web eCRM space in addition to having a search engine. Web eCRM (or web self-service) is essentially creating a customer service portal for corporate websites. The portal contains a collection of FAQs, articles, HowTos, Manuals etc. The essential function of the portal is to help people find what they are looking for and keep them from dialing the 1800 customer service number (which typically costs a company about $30 per call). AskJeeves sold their CRM and enterprise search unit in 2003 for less than 5 million dollars. Why? Their system required manual input of of a huge set of rules linking search queries and documents, as well as complex rules to equate queries to other queries and attempt to do some Natural Language Processing and Inference.

It didn't work, there was no way in hell that an average business user that maintained this set of Articles, FAQs etc was prepared to the massive amount of structuring. AskJeeves attempted to hire a team of people to optimize and tune the implementations. It took weeks of learning the business and translating that into structure for the engine to use. Nowdays we call this SEO.

Another example in CRM is the 'chatbot'. These are software products that try and give a user a good customer experience by putting a cute face/persona on the search box and having it talk back to you in a conversational style. They have never really taken off, despite the CRM industry analysts that love them. They suffer from the same basic problem that expert systems (chat bots are expert systems of a sort) suffered from.. structuring information is hard for most people to do.

For the past 8 years I've been working for an CRM company (RightNow Tech) that had a simple idea to help customer service web portals... implicitly learn from what users are doing in the portal to optimize the engine automatically. (See patents 6434550, 6665655, & 6842748 - at the moment the RNT systems process about 100 Million searches per month). The cutting edge of eservice CRM at the moment is taking that type of idea and THEN adding (or learning) structure to it.

Lessons learned and observations:

Study the basic history of AI. Here's a good book Artificial Intelligence: A Modern Approach.
Note that the one of the authors (Peter Norvig) is The Director of Research at Google. Prabhakar Raghavan is his counterpart at Yahoo. Ask.com and Microsoft also have strong AI people. There is no secret as to why these four companies are hiring all the good AI people they can relocate to the bay area, Seattle and New Jersey. You will not beat them with an expert system. A secondary lesson of AI is to never believe someone who will attempt to tell you that a new algorithm will create intelligence (neural networks anyone? Fuzzy Logic?).

Look at industries like CRM as a microcosm of the search industry. For every new idea you have, someone in CRM has likely tried it already on a smaller scale.

Beware of old wine in new bottles. You might be able to spend enough money on PR to help you get attention.. but you will likely die unless you invest in real scalable algorithms to do the work.

I'm certainly not intending to down-grade ChaCha and Mahalo as viable businesses. Often the viability of a business is independent of the technology used. They seem to have plenty of funding, and will likely adapt as they see problems. A babe-in-the-woods can't get 20 million in VC money. Neither of these systems will require boiling-the-ocean and implementing strong AI. Spinning a tight loop on what users are looking for and optimizing those results as fast as possible might work long enough to make some cash... it worked to bootstrap Yahoo after all.

As for the social-network blending into standard search? Stay tuned, I'll post some thoughts on that soon. There are plenty of good AI people working on graph based data mining.

Circling back to expert systems, if you can automatically 'read' text, and induce a rule-base.. then use that to help with queries, then we have something. I believe the direction of search engines will slowly head in this direction... machine reading.

Jordan Mitchell (my new boss at OthersOnline.com) recently posted on the same subject on his blog.

Other interesting links about this:
Skrentablog on Mahalo
Keving Burton's Thoughts on the Social Graph