Gibbons et al 1997: Fast Incremental Maintenance of Approximate Histograms
Incremental maintenance of histograms primarily for database query planners
Manku and Motwani 2002 Approximate frequency counts over data streams.
They show algorithms called sticky sampling and lossy counting with proven error bounds.
Zhu et al. 2002 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time
Basic stats and high correlations between all pairs of data streams
Cormode and Muthukrishnan's 2003 What's Hot and What's Not: Tracking Most Frequent Items Dynamically
Introduced groupTest probabilistic Monte Carlo algorithm for frequent item tracking
Metwally et al 2005 Efficient Computation of Frequent and Top-k Elements in Data Streams
Uses counters to monitor data streams with a new stream-summary data structure
Cheng et al 2005 Time-Decaying Bloom Filters for Data Streams with Skewed Distributions
Dampened Bloom-filters for frequent items
Three papers on frequent itemsets (different than frequent items):
Jiang and Gruenwald have a pair of 2006 papers
Research Issues in Data Stream Association Rule Mining
Survey paper of issues and previous results
CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
New stream based itemset miner
Tsai and Chen 2009 Mining Frequent Itemsets for data streams over Weighted Sliding Windows
Apriori like itemset miner on windows of data with differential weighting
Induced sparsity in the weights of online learning algorithms with convex loss functions
- (9/19) Corrected the StatStream link - hat tip @dataspora
- Added Langford - hat tip @gappy3000
2 comments:
Neal - Thanks for posting! Great to see that there's renewed interest in stream processing algorithms: a necessary tool for handling the floods of Big Data that cheap storage + smart devices have unleashed.
Sampling and efficiency matter when data sizes pass the in-memory barrier.
As an aside, a friend remarked today that, if you're interested in literature of hyper-efficient algorithms for computing statistics in real-time, look at the robotics community.
Yep, it does seem pretty important to have these types of algs.. MapReduce is great yet computation $$ costs go up with dataset. If one can tolerate an approximation for near-real-time stats.. with exact calcs done later I don't see the downside.
Example in my world: counting unique user-ids in a 30day visitation record. Exactness isn't going to matter much if you can show your approx is +/- x% and that's OK with the pointy headed bosses. ;-)
Post a Comment