Correcting the Hub Occurrence Prediction Bias in Many Dimensions

Nenad Tomašev¹, Krisztian Buza² and Dunja Mladenić¹

Institute Jožef Stefan, Jamova 39
1000 Ljubljana, Slovenia
nenad.tomasev@gmail.com, dunja.mladenic@ijs.si
Institute of Genomic Medicine and Rare Disorders
Tömö utca 25-29, 1083 Budapest, Hungary
chrisbuza@yahoo.com

Abstract

Data reduction is a common pre-processing step for k-nearest neighbor classification (kNN). The existing prototype selection methods implement different criteria for selecting relevant points to use in classification, which constitutes a selection bias. This study examines the nature of the instance selection bias in intrinsically high-dimensional data. In high-dimensional feature spaces, hubs are known to emerge as centers of influence in kNN classification. These points dominate most kNN sets and are often detrimental to classification performance. Our experiments reveal that different instance selection strategies bias the predictions of the behavior of hub-points in high-dimensional data in different ways. We propose to introduce an intermediate un-biasing step when training the neighbor occurrence models and we demonstrate promising improvements in various hubness-aware classification methods, on a wide selection of high-dimensional synthetic and real-world datasets.

Key words

instance selection, data reduction, classification, bias, k-nearest neighbor, hubness, curse of dimensionality

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS140929039T

Publication information

Volume 13, Issue 1 (January 2016)
Year of Publication: 2016
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

Download Available in PDF
Portable Document Format

How to cite

Tomašev, N., Buza, K., Mladenić, D.: Correcting the Hub Occurrence Prediction Bias in Many Dimensions. Computer Science and Information Systems, Vol. 13, No. 1, 1–21. (2016), https://doi.org/10.2298/CSIS140929039T