aicoder: data mining

Showing posts with label data mining. Show all posts

Friday, September 18, 2009

References for mining from streaming data

While reading a lower quality paper on the subject I found these references worth tracking down. Essentially the idea is that you can make one-pass through the data and must produce statistics of the data that are approximate in nature, ideally with bounded approximation error.

Gibbons et al 1997: Fast Incremental Maintenance of Approximate Histograms
Incremental maintenance of histograms primarily for database query planners

Manku and Motwani 2002 Approximate frequency counts over data streams.
They show algorithms called sticky sampling and lossy counting with proven error bounds.

Zhu et al. 2002 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time
Basic stats and high correlations between all pairs of data streams

Cormode and Muthukrishnan's 2003 What's Hot and What's Not: Tracking Most Frequent Items Dynamically
Introduced groupTest probabilistic Monte Carlo algorithm for frequent item tracking

Metwally et al 2005 Efficient Computation of Frequent and Top-k Elements in Data Streams
Uses counters to monitor data streams with a new stream-summary data structure

Cheng et al 2005 Time-Decaying Bloom Filters for Data Streams with Skewed Distributions
Dampened Bloom-filters for frequent items

Three papers on frequent itemsets (different than frequent items):

Jiang and Gruenwald have a pair of 2006 papers
Research Issues in Data Stream Association Rule Mining
Survey paper of issues and previous results

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
New stream based itemset miner

Tsai and Chen 2009 Mining Frequent Itemsets for data streams over Weighted Sliding Windows
Apriori like itemset miner on windows of data with differential weighting

Langford et al. 2008 Sparse Online Learning via Truncated Gradient
Induced sparsity in the weights of online learning algorithms with convex loss functions

(9/19) Corrected the StatStream link - hat tip @dataspora
Added Langford - hat tip @gappy3000

Posted via email from nealrichter's posterous

Friday, February 20, 2009

Scalable data storage for model learning

I'll admit it, sometimes as an AI & Data wonk you want all your data structured and easily accessible via slicing a dicing, thus SQL databases and relational schemas look like a great idea. And it works for a while... until you're dealing with a torrent of incoming data.

This should not be unfamiliar, the UC-Irvine datasets look like a great place to cut your teeth with Machine Learning .. until you realize that many algorithms and software packages written and litmus tested against such data totally fall down on 'big data'.

This quote from Dave Cancel on Twitter stuck with me: "Databases are the training wheels of software development. Fly free brother, fly free. - University of Sed, Grep, & Awk". My first reaction was, meh.. databases are meant for storing lots of data. I love and use tools like those for prototyping.. but then moved to compiled code + SQL for 'production'.

Mental sea change! Let's say you are building a massive scale system for absorbing click data, processing it and turning it into a recommender system. These are the problems you will see using MySQL at scale. Hint: it tips over at 1K read/write operations per second against the same table.

Don't try and make your read store also your write store. You may not /really/ need a low latency (for model updates) system.

More tips from Dave:

As for storage of [models], I suggest storing them in text file (format up to you), 1 per profile, then stick them behind a reverse caching proxy like Varnish. Infinite scale. For extra points store the text files on S3 and use the built-in webserver to serve them to your reverse proxy. HTTP is your friend as is REST SOAs.

Here's another dead simple storage mechanism:
http://etherpad.com/9JrpcyvXyK

"simple DB" projects like Voldemort and Tokyo Cabinet and MemcacheDB are options as well.

If you can't depend on a room full of DBAs making your SQL DBs not be dog slow, (or buying $$$ systems from Oracle and Microsoft) you must think differently. Pull yourself out of the Math and AI thinking and simplyify. Big Data will eat you alive otherwise.

Thursday, April 17, 2008

Text Summarization and Advertising

Recently read another CompAdvert paper (CIKM'07) from the Yahoo group, Just-in-Time Contextual Advertising. They describe a system where the current page is scraped on-line via a javascript tag, summarized and then that summary is passed to servers to match with Ad listings. Interesting points are:

5% of page text carefully chosen can yield 97+% of full-text advert matching relevance
Best parts of the document are URL, referring URL, title, Meta and headings.
Adding in a classification against a topical taxonomy adds to the accuracy of the ad matches.
They judged the ad matching relevance against human judgments of ad to page relevance.

I found these papers within the last few months as OthersOnline focused on behavioral based advertising. In many ways their finding are interesting, affirming and unsurprising. Interesting in that they are pushing the state of the art in advert matching, and affirming in that we @ OO are on the right track. Unsurprising in that using the document-fields about is the classic approach to indexing webpages and documents.
Of course internet search engines used this for years (it defines the SEO industry's eco-system), and the old/retired open source engine HtDig has had special treatment of those fields since the late 90s. The difference now is the direction, the documents are the "query" and the hits are the ads. Best part about the method is that it's cheap... javascript + the browser becomes your distributed spider and summarizer of the web.
I do love finding these papers.. we just don't have the time or resources to have a study like this and confirm the approach with a paid human factors study. Just go forward on gut educated feel day to day and the human measure is if we get clicks on the ads.
This approach is similar to the one we outlined and implemented before finding this paper. The difference is what we do with the resulting "query", using the signal to learn a predictive interest model of users.
Still no mention of any relative treatment of words within the same field... one would assume this would move the needle on relevance as well.
I still believe that this type of summarization approach can be used to make an implicit page tagger and social recommender like del.icio.us ... if you can filter the summary based upon some knowledge of the users real (as opposed to statistical) interests. Key route to auto-personalization of the web.

Friday, April 04, 2008

Itemset mining in data streams

We covered a nice paper by de Graaf et al of Leiden University in the AI colloquium. Clustering Co-occurrences of Maximal Frequent Patterns in Streams. It deals with the problem of finding frequent itemsets in a datastream. Ideally you want an incremental algorithm for this with an 'updatable model' rather than being forced to reprocess the entire data/transaction sequence when adding new data. The paper's approach has the extra benefit that the itemsets are clustered by similarity as well. I really enjoy using and learning about algorithms that have nice side-effects.

A rough overview of the algorithm is that it does three basic operations with incoming data. First it builds a support model of patterns encountered. It does this with a with a reinforcement and decay technique, reinforcing support for patterns encountered in the current transaction and decaying those that don't. Second it maintains a geometric organization of itemsets according to a distance metric in a boxed 2-D area. As new data is processed itemsets' coordinates in the (x,y) box move and shift around according to their similarity with other patterns. Third it performs a pattern merging/splitting mechanism to derive new patterns for the model to track. new patterns get a random (x,y) position.

At the termination of processing some amount of data, you are left with a list of itemsets and their support/frequency as well as a nice grouping by similarity.

One advantage of his presentation is that it is stripped of all excess complexity. They well note that it learns an approximation of what you would get from a full-data-scan of a traditional itemset miner. Fine with me.. I don't get hung up on exactness and have lots of faith that incremental model building works well in practice.

The minor flaw of the paper is that they fail to point out (or notice??) that what they have built is a hybrid of a Swarm Intelligence and a Self Organizing Map. The Swarm/Ant portion comes from the reinforcement & decay of the support model, and the SOM from the geometric clustering of the itemsets. On could duplicate this algorithm in spirit by implementing Ant System + SOM with the merging/splitting for new pattern production. By 'Ant System' here I refer to the spirit of an ant system where you use pheromone reinforcement and decay of a model, actual ants traversing a path in a graph are not necessary. The cells in the SOM would contain a sparse vector of itemsets and apply the standard rules for updating.

Yet, even as I see the connection, this is a pointy-headed comment. The paper is nice in that the algorithm is presented without flourish in a straight forward way... sometimes using the word 'hybrid' and casting your work that way is just a form of academic buzzword bingo.

I'll definitively look to implement something similar to this ASAP. I may skip the merging/splitting and use a standard itemset miner offline over the 'last X transactions' and form a itemset pattern dictionary. Only itemsets in the dictionary will be tracked and clustered with the data stream, and every so often run the offline algorithm to learn new patterns and support to merge into the dictionary.

Friday, March 28, 2008

Frequent Pattern Mining

In the weekly RightNow AI Colloquium @ MSU CS , we read a paper by Jiawei Han et al. called
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach.
Basically the problem is this, given a sequence of transactions involving multiple items per transactions, what are the frequent itemsets? Itemsets are groups of m items that tend to be purchased together. The earlier SQL-based version of FP-Tree looks interesting as well.

FP-Tree out competes in time complexity the Apriori Association Rule Miner Algorithm (Java Code). Not sure how it compares with this raft of other algorithms available at the FIMI.

I'd love to use an algorithm like this to extract pairs and triples of keywords from documents in a clickstream.. basically looking for recurring patterns in browsing/hunting bevavior in document repositories.

I would say the biggest issue using either of the two above algorithms in a 'production' system is that they are not incremental. Ideally one could process a batch of transactions a record at a time and form a rolling frequent itemset collection. Itemsets would move in and out of the collection as they achieved or lost 'support', and the cost of adding another transaction is minimal... as in not having to rescan all previous transactions.

My initial idea how to do this would be to devise an incremental approximation addition to any association miner. At the end of a full-run of your miner, you would keep the final itemsets AND the itemsets that just missed the support threshold. The incremental algorithm would process up to a tolerance level of new transactions, say log(n) of the original transaction set size, and look to promote the 'just missed' itemsets if support arrives. Maybe some attempt could be made to remove itemsets if the additional transactions sank their support level to below the cut-line. After more than log(n) new transactions arrive, you can reprocess the entire set or trim off the first log(n) of the old transactions plus the new ones.

There are likely some speedups to be had in subsequent reprocessings. If from the previous full-run you found that a collection of itemsets made the cut, you could prune out those itemsets from the old transactions.. restarting the algorithm with the previous run's itemsets in the "found pool".

Of course with an algorithm like FP-Tree you must save the tree for the next run, and devise a tree rebalancing algorithm to make it incremental (relative frequencies of items change with new information). It gets messy quick.

Monday, October 29, 2007

Inferring meaning from data and structure

Jeremy Liew has quite the thread going on his blog about 'Meaning = Data + Structure (User Generated)', Part2 on Inferring Structure and a Guest Post by Peter Moore.

The post by Moore is a wonderful summary of approaches and their difficulties, and I'll post more on this as I think about it. My initial response is that we should stop looking/waiting for some near holy-grail {fully functional semantic web} and use a lot of good-enough {technologies, algorithms, ontologies} to make progress. I think that the perfection-in-reasoning stuff is great for the teleportation version of personal search vs the good-enough techniques as applicable now to the orienteering version of personal search. See this post and this paper for orienteering vs teleportation in search.

Last week the Bozeman AI group read a paper on Deriving a large Scale Taxonomy from Wikipedia. I look at this as an example of the main idea above, deriving structure from user generated content. True, Wikipedia is already structured, but not necessarily in a way that a computer program can use to reason with.

The killer thing about this idea is that it's finally time to do it. Essentially this is what machine learning and data mining has been about for years. I've read/perused hundreds of academic papers where the basic premise is that we write a suite of algorithms to learn/extract structure from a pool of data. A big chunk of papers in the KDD conferences each year (2007, 2006, 2005) operates on this premise and this field is quite old (decades).

Really pointy-headed CS types are horrible at monetizing their work. At approx the same time that Google founders were inventing PageRank, Jon Kleinberg was creating HITS. Both are link-analysis algorithms to augment what at the time were poor quality search engines. Over the past 10 years when they are evaluated head-to-head on some Information Retrieval task HITS works on-par with PageRank. Yet Kleinberg is not now worth 40 billion dollars like Brin and Page of Google.

I fear that the Semantic web people/researchers have been building sand castles for a decade rather than monetizing what they have to subsidize more research on it. Perhaps if they had been Delicious, Digg, WikiPedia, et al. would be contributing to the Semantic Web natively, rather than forcing people to figure out a way to export that data into RDF/OWL.