Statistics Seminar

Myung Hee LeeWeill Cornell Medical College
Outlier detection for high dimensional, low sample size data

Wednesday, October 5, 2016 - 4:15pm
Biotech G01

Despite the popularity of high dimension, low sample data analysis, little attention has been paid to the outlier detection problem. We propose a two-stage procedure to detect outliers for high dimensional data. The first step screens out pre-determined most outlying points one by one, based on the distance between each data vector and the affine space generated by the remaining data. At the second step, we test whether each of the screened observations is significantly outlying or not. The reference values for the significant test are based on random rotations of the data in the dual space. We show that the rotation procedure generates null data sets with the same volume as the original data, but without any outliers. High dimensional asymptotic is used to justify the proposed remoteness measure. The proposed method shows superior performance with various simulation settings compared to alternative approaches. If time permits, I will present project highlights that I am currently involved in at the Center for Global Health.