aicoder: February 2009

Sunday, February 22, 2009

Predicting search engine switching behavior

Following up on the previous post, I found a few interesting papers (via Google) on user switching behavior of search engines.

An Analysis of Search Engine Switching Behavior Using Click Streams
Juan & Chang of Yahoo Inc
Making Sense of Search Result Pages by Pedersen of Yahoo
Defection detection: predicting search engine switching
Heath & White of Microsoft
Enhancing web search by promoting multiple search engine use
White, Heath and co-workers at Microsoft
Stream Prediction Using A Generative Model Based On. Frequent Episodes In Event Sequences by Laxman, Tankasali and White of Microsoft

Definitely worth reading in detail as the ways of building the models might be applicable to other behaviorally driven events.

Can we measure Google's monopoly like PageRank is measured?

Jeremy Pickens posted an interesting note on his IR new blog:

Is it really true that Google is competing on a click-by-click basis? In the user studies that Google does, which of the following happens more often when the user types in a query to Google, and sees that Google has not succeeded in producing the information that they sought (fails):
Does the user reformulate his or her query, and click “Search Google” again (one click)? Or,
Does the user leave Google (one click), and try his or her query on Yahoo or Ask or MSN (second click), instead?

His points about actions 1 versus 2 are very astute. I’d guess that #2 happens a LOT on the # 2-10 search engines. Meaning people give that engine a try.. maybe attempt a reformulation.. then abandon that engine and try on Google. And I’m betting that people ‘abandon’ Google at a far less rate than other engines.. ie asymmetry of abandonment.

I’d love to do the following analysis given a browser log of search behavior:

Form a graph where the major search engines are nodes in the graph

For each pair of searches found in the log at time t and time t+1 for a given user, increment the counter on the edge SearchEngine(t) -> SearchEngine(t+1). Once the entire log is processed normalize the weights on all edges leaving a particular node.

We now have a markov chain of engine usage behavior. The directional edges in the graph represent probability of use transference to another engine, self-loops are the probability of sticking with the current engine.

If we calculate the stationary distribution of the adjacency matrix of probabilities, we should have a probability distribution that closely matches the market shares of the major engines. (FYI - this is what PageRank version 1.0 is - the stationary distribution of the link graph of the entire web)

What else can we do? We can analyze it like it’s a random walk and calculate the expected # of searches until a given user of any internet search engine will end up using Google. If the probabilities on the graph are highly asymmetric.. which I think they are.. this is a measure of the monopolistic power of people’s Google habit.

This should also predict the lifetime of a given ‘new’ MSN Live or Ask.com user.. meaning the number of searches they do before abandoning it for some other engine.

Predicted End Result: Google is the near-absorbing state of the graph.. meaning that all other engines are transient states on the route to Google sucking up market share. Of course this is patently obvious unless one of the bigs changes the game.

Saturday, February 21, 2009

Scalable Analytics - random notes

As long as I'm posting about 'rethinking' any reliance on an RDBMS for 'big data' systems, here are a few more notes:

This post on highscalability.com about Rackspace and MapReduce is highly enlightening. The post takes you through a failed use of MySQL for large scale analytics and their conversion to Hadoop.

I can't say I know (yet) the pros and cons of putting structured bigdata on a 'column store DB' versus Hadoop+HDFS. Will probably end up using both systems in various ways. Currently exploiting Sam Tingleff's DGrid and poking at LucidDB for "row filtering and aggregation" style analytics apps.

Looking forward to setting up Hive and Pig next.

On the plus side for MySQL, the federated engine has been quite useful for accumulating data from a sharded/partitioned MySQL setup.. as long as the data being accumulated is less than 100K rows, then it seems to hit a wall. It's also quite brittle if your MySQL instances are having any performance issues.. failed connection can cause other ETLs that depend on that connection to fail in odd ways.

Friday, February 20, 2009

Scalable data storage for model learning

I'll admit it, sometimes as an AI & Data wonk you want all your data structured and easily accessible via slicing a dicing, thus SQL databases and relational schemas look like a great idea. And it works for a while... until you're dealing with a torrent of incoming data.

This should not be unfamiliar, the UC-Irvine datasets look like a great place to cut your teeth with Machine Learning .. until you realize that many algorithms and software packages written and litmus tested against such data totally fall down on 'big data'.

This quote from Dave Cancel on Twitter stuck with me: "Databases are the training wheels of software development. Fly free brother, fly free. - University of Sed, Grep, & Awk". My first reaction was, meh.. databases are meant for storing lots of data. I love and use tools like those for prototyping.. but then moved to compiled code + SQL for 'production'.

Mental sea change! Let's say you are building a massive scale system for absorbing click data, processing it and turning it into a recommender system. These are the problems you will see using MySQL at scale. Hint: it tips over at 1K read/write operations per second against the same table.

Don't try and make your read store also your write store. You may not /really/ need a low latency (for model updates) system.

More tips from Dave:

As for storage of [models], I suggest storing them in text file (format up to you), 1 per profile, then stick them behind a reverse caching proxy like Varnish. Infinite scale. For extra points store the text files on S3 and use the built-in webserver to serve them to your reverse proxy. HTTP is your friend as is REST SOAs.

Here's another dead simple storage mechanism:
http://etherpad.com/9JrpcyvXyK

"simple DB" projects like Voldemort and Tokyo Cabinet and MemcacheDB are options as well.

If you can't depend on a room full of DBAs making your SQL DBs not be dog slow, (or buying $$$ systems from Oracle and Microsoft) you must think differently. Pull yourself out of the Math and AI thinking and simplyify. Big Data will eat you alive otherwise.

Wednesday, February 18, 2009

TunkRank Scoring Improvement

Recently Daniel Tunkelang blogged about an influence rank for Twitter. Jason Adams took up the implementation challenge and coined it TunkRank.

$Influence(X) = \sum_{Y\in Followers(X)} 1 + \frac{p*Influcence(Y))}{ \left \|Following(Y) \right \| \right}$

After some poking at it, I'm suggesting a scoring improvement. At the moment the primary rank is percetile in the UI, however the raw score is given as well. I checked a few users and put together the table below, and it feels wrong. It saturates @ 100 too quickly and there is not enough differentiation between people with healthy versus massive influence.

Why 'feel'? Human interpretable numbers need a tactile sense to them in my opinion. One critique of the metric system is that the English system just feels more human compatible, an inch is not too small, a foot is reasonable, a mile is a long way and 100 miles per hour is darn fast.

I'm proposing two new scoring possibilities. Both are based upon logarithms and span from 1-100. The slight difference between them is how 'linear' the resulting rank feels across the accounts I compared.

LEAST(100,ROUND(POWER(LN( tunkrank-raw-score +1);1.82)))

$TunkRank(X) = ln(Influence(X) + 1)^{1.82}$

LEAST(100,ROUND(LN( tunkrank-raw-score+1)/LN(3.5) * 10))

$TunkRank(X) = 10 * ln(Influence(X) + 1)/ln(3.5)$

What are the constants? They are magic numbers to map Barak Obama to a TunkRank of 100 as well as provide an interesting spread between the test accounts below. Comments welcome! Which is my choice? I can't decide.. #1 is smells more accurate, #2 tastes more natural.

Yes this is an inexact science.

Possible Tunkrank Bug? Check out dewitt's rank.. looks off given his number of followers and that he's an influential guy from Google.


NAME       PERCENTILE RAW SCORE  NEW SCORE #1 NEW SCORE #2
BarackObama        100     277770        100        100
wilw               100      79118         82         90
guykawasaki        100      62543         79         88
JasonCalaca        100      59075         78         88
THErealDVOR        100      43207         74         85
anamariecox        100      38177         73         84
WilliamShat        100      13932         61         76
fredwilson         100      13340         60         76
abdur              100       1351         36         58
johnhcook           99        407         26         48
johndcook           94         61         13         33
gutelius            84         20          8         24
nealrichter         81         16          7         23
ealdent             80         16          7         23
dtunkelang          79         15          6         22
dewitt               1          2          1          9