## Monday, November 02, 2009

### RFP: Bounding memory usage in Tokyo Cabinet and Tokyo Tyrant

I'm soliciting proposals to implement an absolute (or at least soft) bound on memory usage of TT + TC. The reward is cash. Send me a proposal. The patch needs to work, not crash TT/TC or corrupt data.

I've attempted to do this myself and have not had the time to finish or fully test it. I've asked Mikio for feedback/help finishing this and he's been nearly silent on the request.

At the moment we (myself, Sam Tingleff and Mike Dierken) work around this issue by continuing to play with various parameter TC settings and restarting the daemon when the memory usage grows beyond a comfort level.

I'm a C coder and have hacked the internals of BerkeleyDB in the past, so can help review code, trade ideas, etc. We (as a team) don't have the time to work on this at the moment.

If you are interested contact me! We've got a few other ideas for TT enhancements as well...

## Wednesday, October 28, 2009

### Current Computational Advertising Course and Events

Andrei Broder and Vanja Josifovski of Yahoo Research Labs are offering a Fall 2009 course on Computational Advertising at Stanford. The lecture #1-#4 slides are up and it looks to be an interesting course. Will definitely continue to follow along remotely.

National Institute of Statistical Sciences is holding a workshop on CompAdvert in early November. The upcoming WINE'09 conference in Rome contains a few accepted papers in CompAdvert. SigKDD 2010 mentions it in the CFP..

Luckily the search engine results on 'Computational Advertising are still free of noise. Google, Yahoo, Bing

Prior Events:

## Friday, September 18, 2009

### References for mining from streaming data

While reading a lower quality paper on the subject I found these references worth tracking down. Essentially the idea is that you can make one-pass through the data and must produce statistics of the data that are approximate in nature, ideally with bounded approximation error.

Gibbons et al 1997: Fast Incremental Maintenance of Approximate Histograms
Incremental maintenance of histograms primarily for database query planners

Manku and Motwani 2002 Approximate frequency counts over data streams.
They show algorithms called sticky sampling and lossy counting with proven error bounds.

Zhu et al. 2002 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time
Basic stats and high correlations between all pairs of data streams

Cormode and Muthukrishnan's 2003 What's Hot and What's Not: Tracking Most Frequent Items Dynamically
Introduced groupTest probabilistic Monte Carlo algorithm for frequent item tracking

Metwally et al 2005 Efficient Computation of Frequent and Top-k Elements in Data Streams
Uses counters to monitor data streams with a new stream-summary data structure

Cheng et al 2005 Time-Decaying Bloom Filters for Data Streams with Skewed Distributions
Dampened Bloom-filters for frequent items

Three papers on frequent itemsets (different than frequent items):

Jiang and Gruenwald have a pair of 2006 papers
Research Issues in Data Stream Association Rule Mining
Survey paper of issues and previous results

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
New stream based itemset miner

Tsai and Chen 2009 Mining Frequent Itemsets for data streams over Weighted Sliding Windows
Apriori like itemset miner on windows of data with differential weighting

Langford et al. 2008 Sparse Online Learning via Truncated Gradient
Induced sparsity in the weights of online learning algorithms with convex loss functions
1. (9/19) Corrected the StatStream link - hat tip @dataspora
2. Added Langford - hat tip @gappy3000

## Thursday, September 17, 2009

### Others Online acquired by the Rubicon Project

I'm thrilled to say that Others Online has been scooped up by the Rubicon Project. Press release here. I'm joining as a Data Scientist. Jordan authored a wrap-up post here: http://blog.othersonline.com/

Update: Nice Forbes article on the opportunity in front of us.
For Advertisers Drowning In Data, A Lifeguard

## Monday, September 14, 2009

Hat Tip: Found via Greg Linden's blog: Google AdWords now personalized. Below are my thoughts and questions:

Google is now reaching back into your previous search history and presumably choosing a better previous search if the current one is not sufficiently monetizable.

Questions:

• What criteria is used to reach back into a user's history? Boolean commercial/non-commercial then select last commercial search versus choosing based upon some selection algorithm from the last N searches (see previous point).
• Will the reach-back cross a topic boundary or is it only to enhance context for an ambiguous search?
• What effect will this have on the Google Keyword Tool that helps advertisers forecast demand and price for a keyword? The volume numbers must now be adjusted by the amount of time the impressions are shifted to alternate keywords.
• How much will this starve the long-tail of searches? Depending on the aggressiveness of the selection then long-tail searches may suffer a decrease in volume for adwords.
Even the most modest change of merely using recent previous searches only 'about' the current search to augment the adwords auction query should have a dramatic effect on the auction process. By definition it expands the number of bidders for a particular query. It may also curtail the effectiveness of arbitrage done by some adwords buyers who buy ambiguous lower value keywords as proxies for high value ones due to user sessions with query reformulations. Why? It should have the effect of driving up prices for the penny keywords if they are sufficiently related to high value keywords.

It will be interesting to watch what happens. This is likely not a non-trivial change in the keyword market.

## Friday, August 28, 2009

### The GPL and scripting languages

I was recently asked by an associate about how the GPL and LGPL impacts scripting languages. The short answer is that it's not clear.

My general take on the LGPL and scripting languages is that if one cut-and-pastes LGPL code A into code B, then code B falls under the LGPL. However if one script-includes code A into code B in the only real mechanism scripting languages allow then code B does not fall under the LGPL.

php: include 'Foo.php';
perl: require Foo::Bar;
python: import Foo
ruby: require "foo"
javascript: <script type="text/javascript" src="external.js"></script>

This differs from the application of the LGPL to compiled languages like C/C++ as the programmer decides how the machine code is combined: static versus dynamically linked. In Java if one has an LGPL jar file, the the code can be used in the standard way. If your code is combined with the contents of the LGPL jar file in a new jar, then your code falls into the LGPL.

What does the scripting language really do under the hood? It's different in each language. Some may physically include the file and parse-interpret it as one block of code, others may parse-interpret separately and resolve references between them much like a dynamic linker. Do the details matter? Unknown, yet one could reasonably assume the author of a piece of script code under the LGPL must have intended others to make script-include style uses OK. If in doubt check with the author and save that email.

What about the GPL? More clear.. no mixing of any kind can be done. One could argue that interpreted script is not linked in the way that compiled languages are... yet conservatively this is a "thin reed" to stand on.

What does this mean about the myriad of GPL javascript out there intermingling with non-GPL and proprietary javascript within browsers visiting script heavy websites everywhere? It's a dog's breakfast of legal issues.

The post Twenty questions about the GPL is pretty informative of the issues, worth a read... deeper than the above.

## Thursday, August 20, 2009

### What can we learn from the Google File System?

I found this via my twitter herd and I've been thinking about it for days. Kirk McKusick interviews Sean Quinlan about the Google File System. Fascinating stuff.

We're very fortunate to have storage scalability challenges ourselves at 'Undisclosed' (formerly OthersOnline). We're amassing mountains of modest chunks of information via a set of many many hundreds of millions of keys. We've evolved thusly:

1. MySQL table with single key and many values, one row per value.
2. MySQL key/value table with one value per key
3. memcachedb - memached backed by Berkeley DB
4. Nginx + Webdav system
5. Our own Sam Tingleff's valkyrie - Consistent Hashing + Tokyo Tyrant
#5 is the best performing, yet we still aren't going to escape the unpredictable I/O performance of EC2 disks. To my understanding this serves a role much like the chunk servers of GFS. We need low latency read access to storage on one side, and high throughput write access on the other side.

Combining the insights with the above interview and the excellent Pathologies of Big Data, I'm left with the impression that one must absolutely limit the number of disk seeks and do some 'perfect' number of I/O ops against of X MB chunks.

Random questions:
What is X for some random hardware in the cloud? How do we find it? What if X changes as the disk gets full? What kind of mapping function do we send the application key into to return a storage key to get the best amount of sequential disk access on write and best memory caching of chunks on read? What about fragmentation?

It does seem as though the newish adage is very true.. web 2.0 is mostly about exposing old-school unix commands and system calls over HTTP. I keep thinking this must have been solved a dozen times before. cycbuf feature of usenet servers?

## Sunday, August 02, 2009

### Response to Dr Lance Fortnow's CACM opinion

Dr. Lance Fortnow published a strong argument against the current 'strong conference' CS system in the August issue of CACM. His essential desire is that CS move to a system similar to the hard sciences and engineering. Conferences should accept any paper of reasonable quality, not publish proceedings and hold Journals as the sole vehicle for publications.

My Questions:

1. While CS should have a stronger Journal system, why should that come at the expense of quality conferences?

2. The reputation of CS conferences as a method of publication means that it is both acceptable to publish and cite papers from these confs. The result of this is that one can read a conf paper, compose a citing follow-up and publish it within 1 year. This fluidity of ideas without the sometimes 18 month to 2 year wait on Journal publication is a great advantage!

Yes other disciplines publish pre-prints to arXiv.org.. is this really a solution when a paper has been rejected yet avail on arXiv.org for 18 months?

3. Is this problem really that acute in all communities? Can't it be solved at the community level?

4. Certainly in AI, Machine Learning, Data Mining and Evolutionary Computation I perceive that the Journals are held in high regard (all papers are great) and that the conference proceedings can be a mixed bag with strong and weak papers. I am getting strong pressure from my advisors and peers to consider Journal versions of some of my conf papers. EC specifically has a non-proceedings conference to meet and discuss less settled results.

If Dr. Fortnow feels that his area of theoretical computer science is too fragmented, then a solution would be to found more Journals and push the best conference papers into those Journals more heavily. Perhaps that would mean that the conference system of that area would shrink in response as the Journals established themselves.

5. Why would industrial researchers and scientists participate in publications with such long time cycles?

6. Again the fluidity is an advantage. Were CS suddenly to switch to a soft conference system (low benefit to participation other than networking) I fear that participation of industry in publication venues would suffer. The time scales of Journals in general mean that at publication time the results were done an eternity ago in industry terms. Convincing your supervisor that participation and publication in a conference is a far easier sell than an extended Journal submission effort.

One also wonders if the Journal system is not implicitly biased towards academic communities where participants are chasing tenure. This 'ranking of people' referenced by Dr. Fortnow is very much for academic institutions and not of much value to industry IMHO.

7. Why do we want ANY such top-down forcing of CS organization?

8. The culture of CS is much more aligned with self-organization and communities forming out of a bird-of-a-feather effect. This also aligns with the changing face of corporate cultures and culture in general. Such a top-down driven reorg would likely both fail and break the inherent fluidity of ideas and results in CS.

I also have an unfounded suspicion that such a top-down forced re-org would result in a clustering of power and influence towards traditional centers of power in academia. If one picks up conference proceedings in my favorite CS areas and does a frequency count of the author's institutions the distribution is very much long tail. The 'elite' universities are not dominating the results meaning that the 'in crowd' effect is much lesser in CS.

Feudal systems are dying fast for a reason.

9. Obviously the current system is serving a need, doesn't that speak for itself?

10. If CS researchers and scientists continue to attend, publish at and found conferences is this not evidence that it is serving a real need?

While Dr. Fortnow is correct on his points w.r.t. the problems faced by conference 'PC' committees... the correct response is best done within that community. Found some Journals and compete with the conferences for publications and reputation. I don't accept that a strong Journal system can only be created by first wiping away a very fluid and successful conference system.

My personal solution to strengthening the Journal system? I'll set a goal of submitting a Journal publication or two in the next year. I am completely remiss in not yet submitting anything to a Journal.

As a counter point to the above arguments.. if I could have a wish it would be a Journal system with fast review times, immediate publication of accepted papers and that markets itself by cherry picking great papers from conferences and encouraging those authors to submit to the Journal. Something like the Journal of Machine Learning Research. The publishing of referee reviews also sounds interesting.

## Tuesday, July 14, 2009

### Hacking Apache's mod_proxy_http to enforce an SLA

Following up on the last post about HTTP SLAs, let's say you have a web-service exposing ReST APIs for your awesome data miner/processor. It has data input/output APIs of various kinds. The software architecture consists of front-end apache servers and back-end tomcat plus various data stores. Apache's mod_proxy and some load balancer (HAProxy, mod_proxy_balancer) pushes the incoming requests to backend servers.

A client wants a guarantee that your APIs will accept requests and return valid data and response codes within XXms for 95% of requests (see Wikipedia's SLA for other examples of service guarantees). How can one be absolutely sure that the SLA is met? Now add in the wrinkle that there might be different SLAs for the various APIs. In addition, the SLA could specify that as close to 100% as possible of the requests return HTTP codes within the 2xx range.. suppressing any 3xx, 4xx or 5xx codes from coming back to the outside world.

The issues with making apache do this are as follows:
• ProxyTimeout is global or scoped to the particular vhost
• ErrorDocuments still return the error code (503, 404, etc)
• No way to tie ErrorDocuments and ProxyTimeouts to particular requests.

A key insight from Ronald Park is to use mod_rewrite and then pass various environment arguments to mod_proxy that are specific to the URL being addressed by mod_rewrite. This was the approach taken by Ronald Park in his attempts to solve this problem in apache 2.0.x here and here.

The below example is a rewrite rule that makes no changes to the URL itself for a JS API presumably returning data in JSON.
RewriteRule ^/api/(.*).js\?*(.*)$http://backendproxy/api/$1.js?$2 [P,QSA,E=proxy-timeout:900ms,E=error-suppress:true,E=error-headers:/errorapi.js.HTTP,E=error-document:/errorapi.js] With the SLA enforcement modifications enabled, the URL will return data from the backend system within 900ms or a timeout occurs. At this point apache will stop waiting for the backend response and serve back the static files /errorapi.js.HTTP as HTTP headers and /errorapi.js as contents. $cat /var/www/html/errorapi.js.HTTPStatus: 204Content-type: application/javascript$cat /var/www/html/errorapi.jsvar xxx_api_data={data:[]}; /* ERR */ There are four environment variables the SLA hack looks for: • proxy-timeout: - time in seconds or milliseconds to wait until timing out • error-suppress: - true/false switch on suppressing all non 2xx errors from the backend. • error-headers: - file of syntax correct HTTP headers to return to the client • error-document: - file of content body to be returned to the client Leaving off the proxy-timeout will only suppress errors from the backend after the global timeout occurs. Leaving off error-suppress:true will ensure that the 5xx timeout error from mod_proxy_http is returned intact to the client. Source code here There are two versions checked into github for Ubuntu 9.04's apache2 2.2.11 and Centos el5.2's httpd 2.2.3-11. It's advisable to diff the changes with the 'stock' file and likely re-do hack code in your version of apache 2.2. See Ron Park's code for 2.0.x and fold in the other mods supporting error-suppress etc. The hack is being tested in a production environment, stay tuned. This will get posted to the apache-dev list..hopefully with responses suggesting improvements. Update for 2011: This has handled billions of requests per month at this point and works great. No issues. ## Friday, July 03, 2009 ### HTTP request with SLA Here's a (nonAI) problem I'd like to solve. Configure apache to receive a request and proxy/forward it off to a backend app server (tomcat) .. wait a specified period of time ... if no response is received send back a static file or a return code like 204. Essentially it's a combination between the below mod_rewrite and mod_alias directives with a timeout. Below example uses solr without loss of generality. RedirectMatch 204 /search/(.*)$RewriteRule ^/search/(.*)$http://backend:8080/solr/$1 [P]

Logic below:

If url matches rewrite-rule regex then{        set timer for 500ms with timer_callback()        force proxy to second url after rewriting it        if(response from proxy is > 199 and < 300)           return response        else           return 204 or static default file}timer_callback(){       return 204 or static default file}

Instead of returning 204 one could also serve back a static file like /noresults.xml

The general idea is to expose a url that has a near-guaranteed response time limit (assuming apache is alive) where a 204 or a static default is acceptable behaviour. I suspect that we'll need to write an apache module to do this, yet surely this question has been asked and solved before!

## Saturday, March 07, 2009

Very interesting 'Legally Speaking' article in the March 2009 CACM, When is a "license" really a sale? by Pamela Samuelson. She covers the principle of 'first sale' and the recent decisions on the UMG v. Augusto and Vernor v. Autodesk. Essentially both Augusto and Vernor were selling used CDs and Software on eBay and the copyright owners of the products went after them for these sales. Both parties are using the first-sale doctrine to defend their rights to sell the media. The copyright owners are trying to use contract law and the licenses to restrict transfer of the physical media and right to use it. Samuelson predicts that both will end up on the 9th Circuit appeals court.

## Wednesday, February 18, 2009

### TunkRank Scoring Improvement

Recently Daniel Tunkelang blogged about an influence rank for Twitter. Jason Adams took up the implementation challenge and coined it TunkRank.

$Influence(X) = \sum_{Y\in Followers(X)} 1 + \frac{p*Influcence(Y))}{ \left \|Following(Y) \right \| \right}$

After some poking at it, I'm suggesting a scoring improvement. At the moment the primary rank is percetile in the UI, however the raw score is given as well. I checked a few users and put together the table below, and it feels wrong. It saturates @ 100 too quickly and there is not enough differentiation between people with healthy versus massive influence.

Why 'feel'? Human interpretable numbers need a tactile sense to them in my opinion. One critique of the metric system is that the English system just feels more human compatible, an inch is not too small, a foot is reasonable, a mile is a long way and 100 miles per hour is darn fast.

I'm proposing two new scoring possibilities. Both are based upon logarithms and span from 1-100. The slight difference between them is how 'linear' the resulting rank feels across the accounts I compared.
1. LEAST(100,ROUND(POWER(LN( tunkrank-raw-score +1);1.82)))

$TunkRank(X) = ln(Influence(X) + 1)^{1.82}$

1. LEAST(100,ROUND(LN( tunkrank-raw-score+1)/LN(3.5) * 10))

$TunkRank(X) = 10 * ln(Influence(X) + 1)/ln(3.5)$

What are the constants? They are magic numbers to map Barak Obama to a TunkRank of 100 as well as provide an interesting spread between the test accounts below. Comments welcome! Which is my choice? I can't decide.. #1 is smells more accurate, #2 tastes more natural.

Yes this is an inexact science.

Possible Tunkrank Bug? Check out dewitt's rank.. looks off given his number of followers and that he's an influential guy from Google.

NAME       PERCENTILE RAW SCORE  NEW SCORE #1 NEW SCORE #2BarackObama        100     277770        100        100wilw               100      79118         82         90guykawasaki        100      62543         79         88JasonCalaca        100      59075         78         88THErealDVOR        100      43207         74         85anamariecox        100      38177         73         84WilliamShat        100      13932         61         76fredwilson         100      13340         60         76abdur              100       1351         36         58johnhcook           99        407         26         48johndcook           94         61         13         33gutelius            84         20          8         24nealrichter         81         16          7         23ealdent             80         16          7         23dtunkelang          79         15          6         22dewitt               1          2          1          9

## Tuesday, January 27, 2009

### response to Noisy Channel post on Lucid Imagination

Cross posted to my blog since it's a long response ;-)

To: Daniel Tunkelang

RE: Noisy Channel blog post on Lucid Imagination

I don’t think it’s their aim to compete with Enterprise search directly (business suicide), though I suspect they might pressure the pricing in the mid-market of search. The small market has been mostly eliminated by Google/Yahoo/MSN site search and open source engines.

Note also that they do not seem to (yet) provide support for Nutch or Droids.. meaning that they are missing a spidering/crawling engine. Same with Tika (office document support). Search result clustering may be coming soon via SOLR-769. No content-management or versioning. (These are fixable pieces given all the open source out there)

There is no good native support for rich taxonomies in Solr/Lucene, nor is there native support for some of the interesting semantic-web data driven features. No self-learning or auto-personalization of results. No analytics (though one could go elsewhere for that).

Lucid is also not offering a hosted Solr service .. so they are not an SaaS play either.

All that said, they obviously have some huge wins within the software industry.. but it’s a tough road to go after accounts like Home Depot, Albertson’s, or the government entities.

Enterprise search is mostly about finished feature sets and a near full admin GUI for non-programmers. The question is in these lean economic times if a given customer considering “build versus buy” is willing to risk starting a professional services engagement to build what they want for cheap, versus purchase a commercial ES product with way more features than they think they need.

I do think that a smart customer will have new leverage during the sales cycle to credibly threaten the ‘build’ option and get the ‘buy’ price down. And Lucid certainly should affect the ability of the ES companies from getting a customer bought in then milking them for professional services, integration and customization fees… Lucid provides a credible switching threat to cut bait and start over.

Google, Yahoo and open source projects like Lucene have commoditized basic search, so ES is about value-added features, innovative R&D and taking away customer pain and complexity.

Some of the people in Lucid have big plans (Grant Ingersoll comes to mind), and there is absolutely no question that Lucene has made some search vendors look like dinosaurs with slow engines and archaic index structures.

It will be some time before open source catches up to ES.. but it just might not be as long as some would hope.

Disclaimer: The above is my opinion and some fact-looking statements might be wrong.. so Lucene guys jump in!

## Monday, January 26, 2009

### Lucid Imagination and Sematex

Kudos to the Solr/Lucene gang for launching Lucid Imagination. Grant Ingersoll's announcement. People involved here and here. Some time ago Otis Gospodnetić launched Semtext. Good luck to Lucid and Sematext!

Both of these companies are in the 'support and consulting' model. This is wise, as going into Enterprise search directly is a tough road competing with Endeca, Verity(Autonomy), FAST(Microsoft), GoogleBox and the other vendors would be suicidal.

Aside:
Long ago (2003) I thought of hanging up a shingle for supporting HtDig (a once popular CGI based search engine), but wisely decided that would be a mistake given that even then I could see that Doug Cutting's Java Lucene and Nutch were going to smoke the creaky 8+ year old C++ indexing kernel. Ended up getting RightNow Tech to sponsor conversion of the guts to CLucene, where it still runs today indexing many many tens of millions of documents. Then Solr was announced .... and HtDig development died and I started using Solr.

Just touched base with Geoff Hutchinson the other day and we're going to release the 4.0 CLucene branch of HtDig, and put up an announcement of HtDig end-of-life and encourage people to migrate to Solr.

### Text Classification with Solr

Starting to look at using Solr/Lucene for text mining. Between OpenNLP, the Python Natural Language Toolkit and various other projects it's time to toss my ad-hoc mishmash of tools and start over.

Looks like Grant Ingersoll is working on similar things in his Taming Text project. This is a nice beginner's overview of the area as Grant sees it, Search and Text Analysis PPT. Also looks like others are scheming about blending Mahout and Solr in some future version.

The basic idea is to take an ontology/taxonomy like Dmoz or FreeBase of {label: "X", tags: "a,b,c,d,e"}, index it and then classify documents into the taxonomy by pushing parsed document into the Solr search API. Why? Lucene/Solr's ability to do weighted term boosting at both search and index time has lots of obvious uses here.

Now that my readership (by data-mining and semantic-web geeks) is up slightly (ie above zero!) due to Twitter traffic, I'm hoping people contact me with ideas, code, etc. Heh.

Initial ideas:
• Use More-Like-This code to 'pass in' a term vector without storing it
• Write Solr plugin to execute search and post-process hits and do any outgoing classification and biasing math.
Once this is proven out, then the obvious next step is to figure out how to index the various RDF/OWL datasets out there. Much of these parts has probably?? been done before, I just need to find them, examine their merits and do some LEGO style layering to get a prototype up.

## Friday, January 02, 2009

### New Year's Resolutions and Goals

Here are this year's technical/career goals & resolutions. Most of these are general and many encompass specific current and forward looking work projects. Others are just motivational resolutions.

Goals and Resolutions:
• Turn in Dissertation. It's 75% complete and the rest is all typewriter work. Be done.
• Be a better Numerati. The point of modeling is to predict... so this goal is a lifelong career goal with a new label.
• Practice at Done and Get Things Smart and Teach Yourself Programming in Ten Years
• Make damn sure that OthersOnline.com doesn't have any Fail Whale events (technical or business).