Study of place names, dialects, biodiversity, and climate, for example, results in data sets that have strong spatial and (possibly) temporal components. The research project looks at data mining methods that can be used to find spatial and temporal relationships in high-dimensional data. The project works in very close collaboration with the "Algorithmic and probabilistic methods in data mining" project.
We have developed parameter-free methods for spatial data mining based on MDL techniques. These methods have been applied to a data set of breeding information of 248 bird species in Finland. In this application the aim is to find spatially coherent regions, where the distribution of breeding bird species is similar.
Many application areas in our work are closely related to linguistic variation, and the history of settlement: firstly, the study of the distribution of place names in Finland, and secondly, the investigation of spatial distributions of Finnish dialect words. Cooperation with experts in Finno-Ugrian linguistics and folklore has been originated in the research project that analyses signs of ancient Saami inhabitation in South and Central Finland. From the computational point of view some of the main challenges are the analysis of large number of point patterns, and the uncertainty concerning the linguistic origin of individual names.
Clustering and dimension reduction techniques (e.g., ICA, PCA) have been applied to the dialect word data set, each word being associated with the set of municipalities where the word is known to be used. A goal of the research has been the exploration and evaluation of dialectically coherent regions. The problem of uneven sampling is essential in this application. The data were sampled during the whole 20th century, and the samples were not selected geographically uniformly. We model the missing data and the non-uniform sampling by Bayesian Markov random field models, and Markov chain Monte Carlo methods. The data set is large, and these procedures are heavy. We aim at reconstructing as complete data set as possible, with a clear understanding of the remaining uncertainty. Then clustering and dimension reduction techniqes, for instance, can be applied to more reliable, less unbiased data.
In close collaboration with the Division of Atmospheric Sciences we have analyzed meteorological and micrometeorological data sets to detect factors influencing the formation of atmospheric aerosol particles. Clustering and classification methods have been used. Also the applicability of kernel methods to this task has been under study.
People
- Marko Salmenkivi, project leader
- Aristides Gionis, project leader
- Antti Leino
- Saara Hyvönen
- Heikki Mannila
Research groups
-
Data Mining, Prof. Heikki Mannila
See www.cs.helsinki.fi/research/fdk/datamining for further information and publications.
Last updated on 10 Dec 2007 by Teemu Mäntylä - Page created on 13 Jan 2007 by Webmaster