Statistics Seminar

Wanjie WangUniversity of Pennsylvania
Important features PCA (IF-PCA) for large-scale inference, with applications in gene microarrays

Friday, February 19, 2016 - 4:15pm
Biotech G01

Clustering is a major problem in statistics with many applications. In the Big Data era, it faces two main challenges: (1). the number of features is much larger than the sample size; (2). the signals are sparse and weak, masked by large amount of noise.

We propose a new tuning-free clustering procedure for large-scale data, Important Features PCA (IF-PCA). IF-PCA consists of a feature selection step, a PCA step, and a k-means step. The first two steps reduce the data dimensions recursively, while the main information is preserved. As a consequence, IF-PCA is fast and accurate, producing competitive performance in application to 10 gene microarray data sets.

We also propose a model that can capture the rarity and weakness of signal. Under this model, the statistical limits for the clustering problem and IF-PCA has been found.

Refreshments will be served after the seminar in 1181 Comstock Hall.