Friday, April 04, 2008

Itemset mining in data streams

We covered a nice paper by de Graaf et al of Leiden University in the AI colloquium. Clustering Co-occurrences of Maximal Frequent Patterns in Streams. It deals with the problem of finding frequent itemsets in a datastream. Ideally you want an incremental algorithm for this with an 'updatable model' rather than being forced to reprocess the entire data/transaction sequence when adding new data. The paper's approach has the extra benefit that the itemsets are clustered by similarity as well. I really enjoy using and learning about algorithms that have nice side-effects.

A rough overview of the algorithm is that it does three basic operations with incoming data. First it builds a support model of patterns encountered. It does this with a with a reinforcement and decay technique, reinforcing support for patterns encountered in the current transaction and decaying those that don't. Second it maintains a geometric organization of itemsets according to a distance metric in a boxed 2-D area. As new data is processed itemsets' coordinates in the (x,y) box move and shift around according to their similarity with other patterns. Third it performs a pattern merging/splitting mechanism to derive new patterns for the model to track. new patterns get a random (x,y) position.

At the termination of processing some amount of data, you are left with a list of itemsets and their support/frequency as well as a nice grouping by similarity.

One advantage of his presentation is that it is stripped of all excess complexity. They well note that it learns an approximation of what you would get from a full-data-scan of a traditional itemset miner. Fine with me.. I don't get hung up on exactness and have lots of faith that incremental model building works well in practice.

The minor flaw of the paper is that they fail to point out (or notice??) that what they have built is a hybrid of a Swarm Intelligence and a Self Organizing Map. The Swarm/Ant portion comes from the reinforcement & decay of the support model, and the SOM from the geometric clustering of the itemsets. On could duplicate this algorithm in spirit by implementing Ant System + SOM with the merging/splitting for new pattern production. By 'Ant System' here I refer to the spirit of an ant system where you use pheromone reinforcement and decay of a model, actual ants traversing a path in a graph are not necessary. The cells in the SOM would contain a sparse vector of itemsets and apply the standard rules for updating.

Yet, even as I see the connection, this is a pointy-headed comment. The paper is nice in that the algorithm is presented without flourish in a straight forward way... sometimes using the word 'hybrid' and casting your work that way is just a form of academic buzzword bingo.

I'll definitively look to implement something similar to this ASAP. I may skip the merging/splitting and use a standard itemset miner offline over the 'last X transactions' and form a itemset pattern dictionary. Only itemsets in the dictionary will be tracked and clustered with the data stream, and every so often run the offline algorithm to learn new patterns and support to merge into the dictionary.

No comments: