Monday, November 02, 2009

RFP: Bounding memory usage in Tokyo Cabinet and Tokyo Tyrant

I'm soliciting proposals to implement an absolute (or at least soft) bound on memory usage of TT + TC. The reward is cash. Send me a proposal. The patch needs to work, not crash TT/TC or corrupt data.

Here's a start: http://github.com/nealrichter/tokyotyrant_rsshack

I've attempted to do this myself and have not had the time to finish or fully test it. I've asked Mikio for feedback/help finishing this and he's been nearly silent on the request.

At the moment we (myself, Sam Tingleff and Mike Dierken) work around this issue by continuing to play with various parameter TC settings and restarting the daemon when the memory usage grows beyond a comfort level.

I'm a C coder and have hacked the internals of BerkeleyDB in the past, so can help review code, trade ideas, etc. We (as a team) don't have the time to work on this at the moment.

If you are interested contact me! We've got a few other ideas for TT enhancements as well...

Posted via email from nealrichter's posterous

Wednesday, October 28, 2009

Current Computational Advertising Course and Events

Andrei Broder and Vanja Josifovski of Yahoo Research Labs are offering a Fall 2009 course on Computational Advertising at Stanford. The lecture #1-#4 slides are up and it looks to be an interesting course. Will definitely continue to follow along remotely.

National Institute of Statistical Sciences is holding a workshop on CompAdvert in early November. The upcoming WINE'09 conference in Rome contains a few accepted papers in CompAdvert. SigKDD 2010 mentions it in the CFP..

Luckily the search engine results on 'Computational Advertising are still free of noise. Google, Yahoo, Bing

Prior Events:

Posted via email from nealrichter's posterous

Friday, September 18, 2009

References for mining from streaming data

While reading a lower quality paper on the subject I found these references worth tracking down. Essentially the idea is that you can make one-pass through the data and must produce statistics of the data that are approximate in nature, ideally with bounded approximation error.

Gibbons et al 1997: Fast Incremental Maintenance of Approximate Histograms
Incremental maintenance of histograms primarily for database query planners

Manku and Motwani 2002 Approximate frequency counts over data streams.
They show algorithms called sticky sampling and lossy counting with proven error bounds.

Zhu et al. 2002 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time
Basic stats and high correlations between all pairs of data streams

Cormode and Muthukrishnan's 2003 What's Hot and What's Not: Tracking Most Frequent Items Dynamically
Introduced groupTest probabilistic Monte Carlo algorithm for frequent item tracking

Metwally et al 2005 Efficient Computation of Frequent and Top-k Elements in Data Streams
Uses counters to monitor data streams with a new stream-summary data structure

Cheng et al 2005 Time-Decaying Bloom Filters for Data Streams with Skewed Distributions
Dampened Bloom-filters for frequent items

Three papers on frequent itemsets (different than frequent items):

Jiang and Gruenwald have a pair of 2006 papers
Research Issues in Data Stream Association Rule Mining
Survey paper of issues and previous results

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
New stream based itemset miner

Tsai and Chen 2009 Mining Frequent Itemsets for data streams over Weighted Sliding Windows
Apriori like itemset miner on windows of data with differential weighting

Langford et al. 2008 Sparse Online Learning via Truncated Gradient
Induced sparsity in the weights of online learning algorithms with convex loss functions
  1. (9/19) Corrected the StatStream link - hat tip @dataspora
  2. Added Langford - hat tip @gappy3000

Posted via email from nealrichter's posterous

Thursday, September 17, 2009

Others Online acquired by the Rubicon Project

I'm thrilled to say that Others Online has been scooped up by the Rubicon Project. Press release here. I'm joining as a Data Scientist. Jordan authored a wrap-up post here: http://blog.othersonline.com/

Update: Nice Forbes article on the opportunity in front of us.
For Advertisers Drowning In Data, A Lifeguard

Posted via email from nealrichter's posterous

Monday, September 14, 2009

Google AdWords now personalized

Hat Tip: Found via Greg Linden's blog: Google AdWords now personalized. Below are my thoughts and questions:

Google is now reaching back into your previous search history and presumably choosing a better previous search if the current one is not sufficiently monetizable.

Questions:

  • Is the goal of Google to increase the fill-rate of ads or to show more valuable ads in general?
  • What criteria is used to reach back into a user's history? Boolean commercial/non-commercial then select last commercial search versus choosing based upon some selection algorithm from the last N searches (see previous point).
  • Will the reach-back cross a topic boundary or is it only to enhance context for an ambiguous search?
  • What effect will this have on the Google Keyword Tool that helps advertisers forecast demand and price for a keyword? The volume numbers must now be adjusted by the amount of time the impressions are shifted to alternate keywords.
  • How much will this starve the long-tail of searches? Depending on the aggressiveness of the selection then long-tail searches may suffer a decrease in volume for adwords.
Even the most modest change of merely using recent previous searches only 'about' the current search to augment the adwords auction query should have a dramatic effect on the auction process. By definition it expands the number of bidders for a particular query. It may also curtail the effectiveness of arbitrage done by some adwords buyers who buy ambiguous lower value keywords as proxies for high value ones due to user sessions with query reformulations. Why? It should have the effect of driving up prices for the penny keywords if they are sufficiently related to high value keywords.

It will be interesting to watch what happens. This is likely not a non-trivial change in the keyword market.

Posted via email from nealrichter's posterous

Friday, August 28, 2009

The GPL and scripting languages

I was recently asked by an associate about how the GPL and LGPL impacts scripting languages. The short answer is that it's not clear.

My general take on the LGPL and scripting languages is that if one cut-and-pastes LGPL code A into code B, then code B falls under the LGPL. However if one script-includes code A into code B in the only real mechanism scripting languages allow then code B does not fall under the LGPL.

php: include 'Foo.php';
perl: require Foo::Bar;
python: import Foo
ruby: require "foo"
javascript: <script type="text/javascript" src="external.js"></script>


This differs from the application of the LGPL to compiled languages like C/C++ as the programmer decides how the machine code is combined: static versus dynamically linked. In Java if one has an LGPL jar file, the the code can be used in the standard way. If your code is combined with the contents of the LGPL jar file in a new jar, then your code falls into the LGPL.

What does the scripting language really do under the hood? It's different in each language. Some may physically include the file and parse-interpret it as one block of code, others may parse-interpret separately and resolve references between them much like a dynamic linker. Do the details matter? Unknown, yet one could reasonably assume the author of a piece of script code under the LGPL must have intended others to make script-include style uses OK. If in doubt check with the author and save that email.

What about the GPL? More clear.. no mixing of any kind can be done. One could argue that interpreted script is not linked in the way that compiled languages are... yet conservatively this is a "thin reed" to stand on.

What does this mean about the myriad of GPL javascript out there intermingling with non-GPL and proprietary javascript within browsers visiting script heavy websites everywhere? It's a dog's breakfast of legal issues.

The post Twenty questions about the GPL is pretty informative of the issues, worth a read... deeper than the above.

Posted via email from nealrichter's posterous

Thursday, August 20, 2009

What can we learn from the Google File System?

I found this via my twitter herd and I've been thinking about it for days. Kirk McKusick interviews Sean Quinlan about the Google File System. Fascinating stuff.

We're very fortunate to have storage scalability challenges ourselves at 'Undisclosed' (formerly OthersOnline). We're amassing mountains of modest chunks of information via a set of many many hundreds of millions of keys. We've evolved thusly:

  1. MySQL table with single key and many values, one row per value.
  2. MySQL key/value table with one value per key
  3. memcachedb - memached backed by Berkeley DB
  4. Nginx + Webdav system
  5. Our own Sam Tingleff's valkyrie - Consistent Hashing + Tokyo Tyrant
#5 is the best performing, yet we still aren't going to escape the unpredictable I/O performance of EC2 disks. To my understanding this serves a role much like the chunk servers of GFS. We need low latency read access to storage on one side, and high throughput write access on the other side.

Combining the insights with the above interview and the excellent Pathologies of Big Data, I'm left with the impression that one must absolutely limit the number of disk seeks and do some 'perfect' number of I/O ops against of X MB chunks.

Random questions:
What is X for some random hardware in the cloud? How do we find it? What if X changes as the disk gets full? What kind of mapping function do we send the application key into to return a storage key to get the best amount of sequential disk access on write and best memory caching of chunks on read? What about fragmentation?

It does seem as though the newish adage is very true.. web 2.0 is mostly about exposing old-school unix commands and system calls over HTTP. I keep thinking this must have been solved a dozen times before. cycbuf feature of usenet servers?

Posted via email from nealrichter's posterous