Machine Learning Coffee seminar:
Tapio Pahikkala, Assistant Professor, Department of Information Technology, University of Turku.
Small data AUC estimation of machine learning methods: Pitfalls and remedies
Abstract: Asking whether two populations can be distinguished from each other is one of the most fundamental questions in data analysis and area under ROC curve (AUC) is one of the simplest and most practical tools for answering it. Also known as the Wilcoxon–Mann–Whitney U statistic, it can be associated with a p-value indicating how likely one would obtain as good AUC value if the two populations would not be stochastically different. Estimating AUC of a predictive model and its statistical significance has a huge practical importance in fields like medicine, where one often has access to only small amounts of labeled data but large number of features. Leave-pair-out cross-validation (LPOCV) is an almost unbiased AUC estimator of machine learning methods that has also been empirically shown to be the most reliable of the cross-validation (CV) based estimators. We further study the properties of LPOCV and show some serious pitfalls one can encounter when estimating AUC with CV and how to avoid them. In particular, we show how one can produce very promising results with high AUC values even if there is no signal in the data. Finally, we show how to counter these risks with new Wilcoxon–Mann–Whitney U type of permutation tests adjusted for LPOCV, thus upgrading one of the classical statistical tools for CV estimates.
Machine Learning Coffee seminars are weekly seminars held jointly by the Aalto University and the University of Helsinki. The seminars aim to gather people from different fields of science with interest in machine learning. Seminars will be held on Mondays at 9 am at Aalto University and the University of Helsinki every other week. At Aalto University, talks will be held in Konemiehentie 2, seminar room T5 and at the University of Helsinki in Kumpula, seminar room D123, unless otherwise noted. Talks will begin at 9:15 am and porridge and coffee will be served from 9:00 am.
Following talks:
- March 20, Otaniemi, Harri Valpola: Semi-supervised Deep Learning
- March 27, Kumpula, Hiroshi Mamitsuka: Learning to Rank: applications to bioinformatics
- April 3, Otaniemi: Mikko Kivelä: Multilayer networks
Welcome!
Last updated on 17 Mar 2017 by Noora Suominen de Rios - Page created on 3 Mar 2017 by Teemu Roos