aicoder

Monday, November 22, 2010

Computing, economics and the financial meltdown (a collection of links)

This editor's letter from CACM last year is interesting: The Financial Meltdown and Computing by Moshe Y. Vardi

Information technology has enabled the development of a global financial system of incredible sophistication. At the same time, it has enabled the development of a global financial system of such complexity that our ability to comprehend it and assess risk, both localized and systemic, is severely limited. Financial-oversight reform is now a topic of great discussion. The focus of these talks is primarily over the structure and authority of regulatory agencies. Little attention has been given to what I consider a key issue—the opaqueness of our financial system—which is driven by its fantastic complexity. The problem is not a lack of models. To the contrary, the proliferation of models may have created an illusion of understanding and control, as is argued in a recent report titled "The Financial Crisis and the Systemic Failure of Academic Economics."

Krugman's essay at the time How Did Economists Get It So Wrong? gave a nice history of economic ideas, the models behind and his interpretations of their correctness.

The theoretical model that finance economists developed by assuming that every investor rationally balances risk against reward — the so-called Capital Asset Pricing Model, or CAPM (pronounced cap-em) — is wonderfully elegant

[snip]

Economics, as a field, got in trouble because economists were seduced by the vision of a perfect, frictionless market system.

[snip]

H. L. Mencken: “There is always an easy solution to every human problem — neat, plausible and wrong.”

I read this months ago and it's been percolating in my thoughts since then. My Manhattan Project - How I helped build the bomb that blew up Wall Street by Michael Osinski. Osinski wrote much of the software and models used to form CMOs and CDOs. Essentially the software aggregates debt instruments from mortgage and other debt markets and allowed a bond designer to issue tailor-made portfolio of debt while mitigating default risk of the debt via that aggregation. He called it his sausage grinder.

“You put chicken into the grinder”—he laughed with that infectious Wall Street black humor—“and out comes sirloin.”

Here's a large collection of links from that period that are worth reading. My thought at the moment is this nugget from Twitter:

Poormojo "Any sufficiently advanced financial instrument is indistinguishable from fraud."

Recipe for Disaster: The Formula That Killed Wall Street

Wall Street’s Math Wizards Forgot a Few Variables

Tales From Lehman’s Crypt

Economic View: Flaw in Free Markets: Humans

Andrew Low: This is your brain on prosperity

A crisis of politics, not economics: Complexity, Ignorance, and policy failure.

Revenge of the Nerd: Paul Wilmott is out to save Wall Street's soul—one dork at a time.

A Conversation with David E. Shaw

Don't Blame The Quants by Steven Shreve

http://www.nytimes.com/2009/10/14/opinion/14trillin.html?em=&adxnnl=1&adxnnlx=1255543552-DPpTSk3i4f5lEJZALsigRA

Sciam: Does Economics Violate the Laws of Physics?

Systemic Risk and Fannie Mae

Geeks trump alpha males as algos dominate Wall St

Posted via email from nealrichter's posterous

Wednesday, October 27, 2010

Review of "Learning to Rank with Partially-Labeled Data"

I've been attending the University of Utah Machine Learning seminar (when I can) this fall. PhD student Piyush Kumar Rai is organizing it.

I volunteered to take the group though Learning to Rank with Partially-Labeled Data by Duh & Kirchhoff. I have some experience researching and implementing LTR algorithms, mostly using reinforcement learning or ant-system type approaches. Some general intro here.

The paper presents the main Transductive Learning algorithm as a framework, then fills in the blanks with Kernel PCA and RankBoost. Several Kernels are used: Linear, polynomial, radial basis function and knn-diffusion. RankBoost learns a kind of ensemble of 'weak learners' with simple thresholds.

The main reason to read the paper if you are already familiar with LTR is the use of the transductive algorithm.

Note the DISCOVER() & LEARN() functions. These are the unsupervised and supervised algorithm blanks they fill with Kernel PCA and RankBoost. What the first actually does is learn a function we could call EXTRACT() that can extract or create features for later use. They do show that the basic idea of layering in unlabeled data with labeled data is a net gain.

There are some issues with the paper. First the computational time performance, as they admit, is not good. The other is that their use of Kernel PCA in an information retrieval context is a bit naive IMO. The IR literature is full of hard-won knowledge of extracting decent features from documents. See this book for example. This is mostly ignored here.

The more confusing thing is the use of K-Nearest Neighbor diffusion kernels. Basically they take the vector of documents, form a matrix by euclidean distance and then random-walk the matrix for a set number of time-steps. The PCA then takes this 'kernel' output and solves the eigenvalue problem, to get the eigenvectors. This all seems a round-about way of saying they approximated the Perron-Frobenius eigenvector (sometimes call PageRank) by iterating the matrix a set number of times and zeroed out low order cells. Or at least I see no effective difference between what they did and what I just described. Basically they just make the matrix sparse to solve it easier (ie this is the dual).

Their use of various classic IR features like TFIDF, BM25 etc needed help. There's pleny of IR wisdom on how to use such features, why let the DISCOVER() wander about attempting to rediscover this? The results were also muddled with only one of the three data sets showing a significant improvement over a baseline technique.

All that aside, it's worth a read for the intro to the transductive alg used with an IR centric task.

Posted via email from nealrichter's posterous

Tuesday, October 26, 2010

Stanford Computational Advertising course - Fall 2010

Andrei Broder and Vanja Josifovski of Yahoo Research Labs are again offering a Fall course on Computational Advertising at Stanford.

Great intro to the area.

Posted via email from nealrichter's posterous

Monday, August 16, 2010

Strategic Marketing

I highly recommend this ExecEd course at MIT Sloan: Strategic Marketing for the Technical Executive.

Duncan Simester taught it. The course looks to be a condensed form of this course: 15.810 Marketing Management

Key takeaways:
1) Making decisions by experimentation versus meetings+intuition is crucial.
2) Don't assume your role is to know the answer. Your role is really to work out how to find the answer as quickly as possible.
3) Brand is unimportant when customers can observe you are meeting their needs.
4) Brand is important when they can't search/observe and must reason with less data.
5) Don't price your products based upon cost or competition, work out your true value to the customer.

6) Your actual efficiencies might be different than you assume

7) Experiment and iterate as often as you can

Posted via email from nealrichter's posterous

Tuesday, July 27, 2010

Dissertation done!

The big event of the spring was defending and completing the below.
Really happy to be done.

Advice for working professionals attempting a PhD:
1) pick something relevant to your work.
2) think twice about a theoretical topic.
3) don't make it longer/bigger than necessary.
4) don't grow your family during this time

I did not follow this advice and this likely resulted in a 4 year delay. The outcome was great and the topic is now relevant to the new job at Rubicon.

http://nealrichter.com/research/dissertation/

On Mutation and Crossover in the Theory of Evolutionary Algorithms

Abstract:
The Evolutionary Algorithm is a population-based metaheuristic optimization algorithm. The EA employs mutation, crossover and selection operators inspired by biological evolution. It is commonly applied to find exact or approximate solutions to combinatorial search and optimization problems.

This dissertation describes a series of theoretical and experimental studies on a variety of evolutionary algorithms and models of those algorithms. The effects of the crossover and mutation operators are analyzed. Multiple examples of deceptive fitness functions are given where the crossover operator is shown or proven to be detrimental to the speedy optimization of a function. While other research monographs have shown the benefits of crossover on various fitness functions, this is one of the few (or only) doing the inverse.

A background literature review is given of both population genetics and evolutionary computation with a focus on results and opinions on the relative merits of crossover and mutation. Next, a family of new fitness functions is introduced and proven to be difficult for crossover to optimize. This is followed by the construction and evaluation of executable theoretical models of EAs in order to explore the effects of parameterized mutation and crossover.

These models link the EA to the Metropolis-Hastings algorithm. Dynamical systems analysis is performed on models of EAs to explore their attributes and fixed points. Additional crossover deceptive functions are shown and analyzed to examine the movement of fixed points under changing parameters. Finally, a set of online adaptive parameter experiments with common fitness functions is presented.

Finalized April 19, 2010

Posted via email from nealrichter's posterous

Monday, November 02, 2009

RFP: Bounding memory usage in Tokyo Cabinet and Tokyo Tyrant

I'm soliciting proposals to implement an absolute (or at least soft) bound on memory usage of TT + TC. The reward is cash. Send me a proposal. The patch needs to work, not crash TT/TC or corrupt data.

Here's a start: http://github.com/nealrichter/tokyotyrant_rsshack

I've attempted to do this myself and have not had the time to finish or fully test it. I've asked Mikio for feedback/help finishing this and he's been nearly silent on the request.

At the moment we (myself, Sam Tingleff and Mike Dierken) work around this issue by continuing to play with various parameter TC settings and restarting the daemon when the memory usage grows beyond a comfort level.

I'm a C coder and have hacked the internals of BerkeleyDB in the past, so can help review code, trade ideas, etc. We (as a team) don't have the time to work on this at the moment.

If you are interested contact me! We've got a few other ideas for TT enhancements as well...

Posted via email from nealrichter's posterous

Wednesday, October 28, 2009

Current Computational Advertising Course and Events

Andrei Broder and Vanja Josifovski of Yahoo Research Labs are offering a Fall 2009 course on Computational Advertising at Stanford. The lecture #1-#4 slides are up and it looks to be an interesting course. Will definitely continue to follow along remotely.

National Institute of Statistical Sciences is holding a workshop on CompAdvert in early November. The upcoming WINE'09 conference in Rome contains a few accepted papers in CompAdvert. SigKDD 2010 mentions it in the CFP..

Luckily the search engine results on 'Computational Advertising are still free of noise. Google, Yahoo, Bing

Prior Events:

Posted via email from nealrichter's posterous

Friday, September 18, 2009

References for mining from streaming data

While reading a lower quality paper on the subject I found these references worth tracking down. Essentially the idea is that you can make one-pass through the data and must produce statistics of the data that are approximate in nature, ideally with bounded approximation error.

Gibbons et al 1997: Fast Incremental Maintenance of Approximate Histograms
Incremental maintenance of histograms primarily for database query planners

Manku and Motwani 2002 Approximate frequency counts over data streams.
They show algorithms called sticky sampling and lossy counting with proven error bounds.

Zhu et al. 2002 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time
Basic stats and high correlations between all pairs of data streams

Cormode and Muthukrishnan's 2003 What's Hot and What's Not: Tracking Most Frequent Items Dynamically
Introduced groupTest probabilistic Monte Carlo algorithm for frequent item tracking

Metwally et al 2005 Efficient Computation of Frequent and Top-k Elements in Data Streams
Uses counters to monitor data streams with a new stream-summary data structure

Cheng et al 2005 Time-Decaying Bloom Filters for Data Streams with Skewed Distributions
Dampened Bloom-filters for frequent items

Three papers on frequent itemsets (different than frequent items):

Jiang and Gruenwald have a pair of 2006 papers
Research Issues in Data Stream Association Rule Mining
Survey paper of issues and previous results

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
New stream based itemset miner

Tsai and Chen 2009 Mining Frequent Itemsets for data streams over Weighted Sliding Windows
Apriori like itemset miner on windows of data with differential weighting

Langford et al. 2008 Sparse Online Learning via Truncated Gradient
Induced sparsity in the weights of online learning algorithms with convex loss functions

(9/19) Corrected the StatStream link - hat tip @dataspora
Added Langford - hat tip @gappy3000

Posted via email from nealrichter's posterous

Thursday, September 17, 2009

Others Online acquired by the Rubicon Project

I'm thrilled to say that Others Online has been scooped up by the Rubicon Project. Press release here. I'm joining as a Data Scientist. Jordan authored a wrap-up post here: http://blog.othersonline.com/

Update: Nice Forbes article on the opportunity in front of us.
For Advertisers Drowning In Data, A Lifeguard

Posted via email from nealrichter's posterous

Monday, September 14, 2009

Google AdWords now personalized

Hat Tip: Found via Greg Linden's blog: Google AdWords now personalized. Below are my thoughts and questions:

Google is now reaching back into your previous search history and presumably choosing a better previous search if the current one is not sufficiently monetizable.

Questions:

Is the goal of Google to increase the fill-rate of ads or to show more valuable ads in general?
What criteria is used to reach back into a user's history? Boolean commercial/non-commercial then select last commercial search versus choosing based upon some selection algorithm from the last N searches (see previous point).
Will the reach-back cross a topic boundary or is it only to enhance context for an ambiguous search?
What effect will this have on the Google Keyword Tool that helps advertisers forecast demand and price for a keyword? The volume numbers must now be adjusted by the amount of time the impressions are shifted to alternate keywords.
How much will this starve the long-tail of searches? Depending on the aggressiveness of the selection then long-tail searches may suffer a decrease in volume for adwords.

Even the most modest change of merely using recent previous searches only 'about' the current search to augment the adwords auction query should have a dramatic effect on the auction process. By definition it expands the number of bidders for a particular query. It may also curtail the effectiveness of arbitrage done by some adwords buyers who buy ambiguous lower value keywords as proxies for high value ones due to user sessions with query reformulations. Why? It should have the effect of driving up prices for the penny keywords if they are sufficiently related to high value keywords.

It will be interesting to watch what happens. This is likely not a non-trivial change in the keyword market.

Posted via email from nealrichter's posterous

Friday, August 28, 2009

The GPL and scripting languages

I was recently asked by an associate about how the GPL and LGPL impacts scripting languages. The short answer is that it's not clear.

My general take on the LGPL and scripting languages is that if one cut-and-pastes LGPL code A into code B, then code B falls under the LGPL. However if one script-includes code A into code B in the only real mechanism scripting languages allow then code B does not fall under the LGPL.

php: include 'Foo.php';
perl: require Foo::Bar;
python: import Foo
ruby: require "foo"
javascript: <script type="text/javascript" src="external.js"></script>

This differs from the application of the LGPL to compiled languages like C/C++ as the programmer decides how the machine code is combined: static versus dynamically linked. In Java if one has an LGPL jar file, the the code can be used in the standard way. If your code is combined with the contents of the LGPL jar file in a new jar, then your code falls into the LGPL.

What does the scripting language really do under the hood? It's different in each language. Some may physically include the file and parse-interpret it as one block of code, others may parse-interpret separately and resolve references between them much like a dynamic linker. Do the details matter? Unknown, yet one could reasonably assume the author of a piece of script code under the LGPL must have intended others to make script-include style uses OK. If in doubt check with the author and save that email.

What about the GPL? More clear.. no mixing of any kind can be done. One could argue that interpreted script is not linked in the way that compiled languages are... yet conservatively this is a "thin reed" to stand on.

What does this mean about the myriad of GPL javascript out there intermingling with non-GPL and proprietary javascript within browsers visiting script heavy websites everywhere? It's a dog's breakfast of legal issues.

The post Twenty questions about the GPL is pretty informative of the issues, worth a read... deeper than the above.

Posted via email from nealrichter's posterous

Thursday, August 20, 2009

What can we learn from the Google File System?

I found this via my twitter herd and I've been thinking about it for days. Kirk McKusick interviews Sean Quinlan about the Google File System. Fascinating stuff.

We're very fortunate to have storage scalability challenges ourselves at 'Undisclosed' (formerly OthersOnline). We're amassing mountains of modest chunks of information via a set of many many hundreds of millions of keys. We've evolved thusly:

MySQL table with single key and many values, one row per value.
MySQL key/value table with one value per key
memcachedb - memached backed by Berkeley DB
Nginx + Webdav system
Our own Sam Tingleff's valkyrie - Consistent Hashing + Tokyo Tyrant

#5 is the best performing, yet we still aren't going to escape the unpredictable I/O performance of EC2 disks. To my understanding this serves a role much like the chunk servers of GFS. We need low latency read access to storage on one side, and high throughput write access on the other side.

Combining the insights with the above interview and the excellent Pathologies of Big Data, I'm left with the impression that one must absolutely limit the number of disk seeks and do some 'perfect' number of I/O ops against of X MB chunks.

Random questions:
What is X for some random hardware in the cloud? How do we find it? What if X changes as the disk gets full? What kind of mapping function do we send the application key into to return a storage key to get the best amount of sequential disk access on write and best memory caching of chunks on read? What about fragmentation?

It does seem as though the newish adage is very true.. web 2.0 is mostly about exposing old-school unix commands and system calls over HTTP. I keep thinking this must have been solved a dozen times before. cycbuf feature of usenet servers?

Posted via email from nealrichter's posterous

Sunday, August 02, 2009

Response to Dr Lance Fortnow's CACM opinion

Dr. Lance Fortnow published a strong argument against the current 'strong conference' CS system in the August issue of CACM. His essential desire is that CS move to a system similar to the hard sciences and engineering. Conferences should accept any paper of reasonable quality, not publish proceedings and hold Journals as the sole vehicle for publications.

My Questions:

While CS should have a stronger Journal system, why should that come at the expense of quality conferences?

within

Is this problem really that acute in all communities? Can't it be solved at the community level?

conference

Why would industrial researchers and scientists participate in publications with such long time cycles?

Why do we want ANY such top-down forcing of CS organization?

Obviously the current system is serving a need, doesn't that speak for itself?

While Dr. Fortnow is correct on his points w.r.t. the problems faced by conference 'PC' committees... the correct response is best done within that community. Found some Journals and compete with the conferences for publications and reputation. I don't accept that a strong Journal system can only be created by first wiping away a very fluid and successful conference system.

My personal solution to strengthening the Journal system? I'll set a goal of submitting a Journal publication or two in the next year. I am completely remiss in not yet submitting anything to a Journal.

As a counter point to the above arguments.. if I could have a wish it would be a Journal system with fast review times, immediate publication of accepted papers and that markets itself by cherry picking great papers from conferences and encouraging those authors to submit to the Journal. Something like the Journal of Machine Learning Research. The publishing of referee reviews also sounds interesting.

Tuesday, July 14, 2009

Hacking Apache's mod_proxy_http to enforce an SLA

Following up on the last post about HTTP SLAs, let's say you have a web-service exposing ReST APIs for your awesome data miner/processor. It has data input/output APIs of various kinds. The software architecture consists of front-end apache servers and back-end tomcat plus various data stores. Apache's mod_proxy and some load balancer (HAProxy, mod_proxy_balancer) pushes the incoming requests to backend servers.

A client wants a guarantee that your APIs will accept requests and return valid data and response codes within XXms for 95% of requests (see Wikipedia's SLA for other examples of service guarantees). How can one be absolutely sure that the SLA is met? Now add in the wrinkle that there might be different SLAs for the various APIs. In addition, the SLA could specify that as close to 100% as possible of the requests return HTTP codes within the 2xx range.. suppressing any 3xx, 4xx or 5xx codes from coming back to the outside world.

The issues with making apache do this are as follows:

ProxyTimeout is global or scoped to the particular vhost
ErrorDocuments still return the error code (503, 404, etc)
No way to tie ErrorDocuments and ProxyTimeouts to particular requests.

A key insight from Ronald Park is to use mod_rewrite and then pass various environment arguments to mod_proxy that are specific to the URL being addressed by mod_rewrite. This was the approach taken by Ronald Park in his attempts to solve this problem in apache 2.0.x here and here.

The below example is a rewrite rule that makes no changes to the URL itself for a JS API presumably returning data in JSON.


RewriteRule ^/api/(.*).js\?*(.*)$ http://backendproxy/api/$1.js?$2 [P,QSA,E=proxy-timeout:900ms,E=error-suppress:true,E=error-headers:/errorapi.js.HTTP,E=error-document:/errorapi.js]

With the SLA enforcement modifications enabled, the URL will return data from the backend system within 900ms or a timeout occurs. At this point apache will stop waiting for the backend response and serve back the static files /errorapi.js.HTTP as HTTP headers and /errorapi.js as contents.


$cat /var/www/html/errorapi.js.HTTP
Status: 204
Content-type: application/javascript

$cat /var/www/html/errorapi.js
var xxx_api_data={data:[]}; /* ERR */

There are four environment variables the SLA hack looks for:

proxy-timeout: - time in seconds or milliseconds to wait until timing out
error-suppress: - true/false switch on suppressing all non 2xx errors from the backend.
error-headers: - file of syntax correct HTTP headers to return to the client
error-document: - file of content body to be returned to the client

Leaving off the proxy-timeout will only suppress errors from the backend after the global timeout occurs. Leaving off error-suppress:true will ensure that the 5xx timeout error from mod_proxy_http is returned intact to the client.

Source code here

There are two versions checked into github for Ubuntu 9.04's apache2 2.2.11 and Centos el5.2's httpd 2.2.3-11. It's advisable to diff the changes with the 'stock' file and likely re-do hack code in your version of apache 2.2. See Ron Park's code for 2.0.x and fold in the other mods supporting error-suppress etc.

The hack is being tested in a production environment, stay tuned. This will get posted to the apache-dev list..hopefully with responses suggesting improvements.

Update for 2011: This has handled billions of requests per month at this point and works great. No issues.

Friday, July 03, 2009

HTTP request with SLA

Here's a (nonAI) problem I'd like to solve. Configure apache to receive a request and proxy/forward it off to a backend app server (tomcat) .. wait a specified period of time ... if no response is received send back a static file or a return code like 204.

Essentially it's a combination between the below mod_rewrite and mod_alias directives with a timeout. Below example uses solr without loss of generality.


RedirectMatch 204 /search/(.*)$
RewriteRule ^/search/(.*)$ http://backend:8080/solr/$1 [P]

Logic below:


If url matches rewrite-rule regex then
{
        set timer for 500ms with timer_callback()
        force proxy to second url after rewriting it
        if(response from proxy is > 199 and < 300)
           return response
        else
           return 204 or static default file
}

timer_callback()
{
       return 204 or static default file
}

Instead of returning 204 one could also serve back a static file like /noresults.xml

The general idea is to expose a url that has a near-guaranteed response time limit (assuming apache is alive) where a 204 or a static default is acceptable behaviour. I suspect that we'll need to write an apache module to do this, yet surely this question has been asked and solved before!

Saturday, March 07, 2009

Copyright, Licenses and First Sale doctrine

Very interesting 'Legally Speaking' article in the March 2009 CACM, When is a "license" really a sale? by Pamela Samuelson. She covers the principle of 'first sale' and the recent decisions on the UMG v. Augusto and Vernor v. Autodesk. Essentially both Augusto and Vernor were selling used CDs and Software on eBay and the copyright owners of the products went after them for these sales. Both parties are using the first-sale doctrine to defend their rights to sell the media. The copyright owners are trying to use contract law and the licenses to restrict transfer of the physical media and right to use it. Samuelson predicts that both will end up on the 9th Circuit appeals court.

I hesitate to issue an opinion here, I see both sides. More debate here, here and here. A nice little post on first sale here. From the perspective of books and music, I side with the first sale doctrine.. and mostly think the 'one click' license notices as having many problematic issues. If you receive a physical copy of something for $$ or as a gift, you should be allowed to resell it at your whim.

It's worth hunting down the previous articles in this series.. they stretch back many years.

Thursday, March 05, 2009

In memoriam - Prof. Dr. Ingo Wegener, 1950-2008

Ingo Wegener passed away last November, see his obituary here: Nachruf/Obituary. He taught at University of Dortmund/TU Dortmund and ran a group of Evolutionary Algorithm researchers looking into the basic theory and run-time analysis of EAs. His papers and the basic approach of his 'school' have been very influential in my PhD research and he was quite encouraging of my proposed ideas. If at the end of our lives we are professionally remembered half as fondly as he will be then that is success.

Sunday, February 22, 2009

Predicting search engine switching behavior

Following up on the previous post, I found a few interesting papers (via Google) on user switching behavior of search engines.

An Analysis of Search Engine Switching Behavior Using Click Streams
Juan & Chang of Yahoo Inc
Making Sense of Search Result Pages by Pedersen of Yahoo
Defection detection: predicting search engine switching
Heath & White of Microsoft
Enhancing web search by promoting multiple search engine use
White, Heath and co-workers at Microsoft
Stream Prediction Using A Generative Model Based On. Frequent Episodes In Event Sequences by Laxman, Tankasali and White of Microsoft

Definitely worth reading in detail as the ways of building the models might be applicable to other behaviorally driven events.

Can we measure Google's monopoly like PageRank is measured?

Jeremy Pickens posted an interesting note on his IR new blog:

Is it really true that Google is competing on a click-by-click basis? In the user studies that Google does, which of the following happens more often when the user types in a query to Google, and sees that Google has not succeeded in producing the information that they sought (fails):
Does the user reformulate his or her query, and click “Search Google” again (one click)? Or,
Does the user leave Google (one click), and try his or her query on Yahoo or Ask or MSN (second click), instead?

His points about actions 1 versus 2 are very astute. I’d guess that #2 happens a LOT on the # 2-10 search engines. Meaning people give that engine a try.. maybe attempt a reformulation.. then abandon that engine and try on Google. And I’m betting that people ‘abandon’ Google at a far less rate than other engines.. ie asymmetry of abandonment.

I’d love to do the following analysis given a browser log of search behavior:

Form a graph where the major search engines are nodes in the graph

For each pair of searches found in the log at time t and time t+1 for a given user, increment the counter on the edge SearchEngine(t) -> SearchEngine(t+1). Once the entire log is processed normalize the weights on all edges leaving a particular node.

We now have a markov chain of engine usage behavior. The directional edges in the graph represent probability of use transference to another engine, self-loops are the probability of sticking with the current engine.

If we calculate the stationary distribution of the adjacency matrix of probabilities, we should have a probability distribution that closely matches the market shares of the major engines. (FYI - this is what PageRank version 1.0 is - the stationary distribution of the link graph of the entire web)

What else can we do? We can analyze it like it’s a random walk and calculate the expected # of searches until a given user of any internet search engine will end up using Google. If the probabilities on the graph are highly asymmetric.. which I think they are.. this is a measure of the monopolistic power of people’s Google habit.

This should also predict the lifetime of a given ‘new’ MSN Live or Ask.com user.. meaning the number of searches they do before abandoning it for some other engine.

Predicted End Result: Google is the near-absorbing state of the graph.. meaning that all other engines are transient states on the route to Google sucking up market share. Of course this is patently obvious unless one of the bigs changes the game.

Saturday, February 21, 2009

Scalable Analytics - random notes

As long as I'm posting about 'rethinking' any reliance on an RDBMS for 'big data' systems, here are a few more notes:

This post on highscalability.com about Rackspace and MapReduce is highly enlightening. The post takes you through a failed use of MySQL for large scale analytics and their conversion to Hadoop.

I can't say I know (yet) the pros and cons of putting structured bigdata on a 'column store DB' versus Hadoop+HDFS. Will probably end up using both systems in various ways. Currently exploiting Sam Tingleff's DGrid and poking at LucidDB for "row filtering and aggregation" style analytics apps.

Looking forward to setting up Hive and Pig next.

On the plus side for MySQL, the federated engine has been quite useful for accumulating data from a sharded/partitioned MySQL setup.. as long as the data being accumulated is less than 100K rows, then it seems to hit a wall. It's also quite brittle if your MySQL instances are having any performance issues.. failed connection can cause other ETLs that depend on that connection to fail in odd ways.