Statistics Seminar

Victoria StoddenUniversity of Illinois at Urbana-Champaign

Structuring machine learning research in data driven science

Wednesday, November 1, 2017 - 4:15pm

Biotech G01

Statistical discovery is increasingly taking place using data not collected by the discoverers and often completely in silico. This calls on new considerations of methods and computational infrastructure that support statistical pipelines. In this talk I present a novel framework for statistical analysis of "organic data" as opposed to "designed data" (Kreuter & Peng 2014) called CompareML that permits the direct comparison of findings that purport to answer the same statistical question. I will argue that such computational frameworks are crucial to reproducible science by way of an example from genomics (acute leukemia (Golub et al 1999)) where traditional approaches (surprisingly) fail at scale.