Friday, March 28, 2008

Frequent Pattern Mining

In the weekly RightNow AI Colloquium @ MSU CS , we read a paper by Jiawei Han et al. called
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach.
Basically the problem is this, given a sequence of transactions involving multiple items per transactions, what are the frequent itemsets? Itemsets are groups of m items that tend to be purchased together. The earlier SQL-based version of FP-Tree looks interesting as well.

FP-Tree out competes in time complexity the Apriori Association Rule Miner Algorithm (Java Code). Not sure how it compares with this raft of other algorithms available at the FIMI.

I'd love to use an algorithm like this to extract pairs and triples of keywords from documents in a clickstream.. basically looking for recurring patterns in browsing/hunting bevavior in document repositories.

I would say the biggest issue using either of the two above algorithms in a 'production' system is that they are not incremental. Ideally one could process a batch of transactions a record at a time and form a rolling frequent itemset collection. Itemsets would move in and out of the collection as they achieved or lost 'support', and the cost of adding another transaction is minimal... as in not having to rescan all previous transactions.

My initial idea how to do this would be to devise an incremental approximation addition to any association miner. At the end of a full-run of your miner, you would keep the final itemsets AND the itemsets that just missed the support threshold. The incremental algorithm would process up to a tolerance level of new transactions, say log(n) of the original transaction set size, and look to promote the 'just missed' itemsets if support arrives. Maybe some attempt could be made to remove itemsets if the additional transactions sank their support level to below the cut-line. After more than log(n) new transactions arrive, you can reprocess the entire set or trim off the first log(n) of the old transactions plus the new ones.

There are likely some speedups to be had in subsequent reprocessings. If from the previous full-run you found that a collection of itemsets made the cut, you could prune out those itemsets from the old transactions.. restarting the algorithm with the previous run's itemsets in the "found pool".

Of course with an algorithm like FP-Tree you must save the tree for the next run, and devise a tree rebalancing algorithm to make it incremental (relative frequencies of items change with new information). It gets messy quick.

No comments: