- lecturer: Dr. Srivatsan Laxman from Microsoft Research India
- time: Wednesday 13 August 2008 at 12.00
- venue: Room B222 (Exactum building, 2nd floor), Department of Computer Science, University of Helsinki
Frequent patterns describe persistent or repetitive local characteristics in data. Over the years, many successful frameworks have been proposed for pattern discovery, e.g., frequent itemsets and sequential patterns in transaction databases, frequent episodes in event streams, frequent sub-graphs etc. Frequent patterns essentially correspond to strong correlations among a small number of attributes or variables. We propose a statistical perspective for frequent patterns based on suitably defined generative models for the data. First, we consider the framework of frequent episodes in event streams. By introducing the notion of non-overlapped occurrences for episodes, we formally connect frequent
(serial) episode discovery with HMM learning. The frequency definition based on non-overlapped occurrences of episodes leads to very efficient algorithms for frequent episode discovery. Further, it gives us a statistical significance test for episodes that is based only on episode frequency. Thresholds derived from our significance test were found to be effective for root-cause analysis of faults logged in automotive engine assembly lines of General Motors.
We also developed similar analysis for frequent itemset discovery in transaction databases. The most common use of frequent patterns is to derive rules (such as association rules) that hold in the data.
Generative models for pattern mining, however, suggest a potential for their broader application in pattern classification and clustering. For example, one can now efficiently build generative models for sequence prediction using a mixture of specialized HMMs. This approach was used to predict targeted user-behavior on the web. Similarly, mixtures of itemset generating models can be used to describe the statistics of unordered transaction databases. Such models were applied to prioritize security vulnerability reports in Windows.
Last updated on 28 Oct 2008 by WWW administrator - Page created on 13 Aug 2008 by Visa Noronen