New Techniques and Results in Mixture Models

Soumyabrata Pal
Hariharan Narayanan
Tuesday, 30 Aug 2022, 16:00 to 17:00
Mixture models, introduced in 1894 by Karl Pearson is very popular in both theory and practice. Mixture models with high dimensional latent parameter vectors are widely used to fit complex multimodal datasets as they allow representation of the presence of sub-populations within the overall population. The primary difficulty in learning mixture models is that the observed data does not identify the subpopulation to which an individual observation belongs.
In this talk, I will introduce the problem of support recovery of the unknown parameter vectors when they are known to be sparse. I will present a very generic framework (including a novel tensor-based algorithm) for support recovery by using estimates of the number of unknown vectors having non-zero entries in small subsets of indices. Then, this framework is applied by showing a variety of techniques to estimate the aforementioned quantities in different mixture models. Our results for support recovery are quite general, namely they are applicable to 1) Mixtures of many different canonical distributions 2) Mixtures of linear regressions and linear classifiers. Finally, I will demonstrate some experiments on real world datasets that support our theoretical guarantees.
Based on papers that appeared in NeurIPS 20, NeurIPS 21 and AISTATS 22.
Bio: Soumyabrata is a Postdoctoral Researcher at Google Research, India. Soumyabrata completed his Ph.D in the Computer Science Department (CICS) at the University of Massachusetts Amherst advised by Dr. Arya Mazumdar. During that time, he was a Visiting Graduate Student at the University of California San Diego in 2021 and interned at Ernst & Young AI Lab and Amazon AI and Search. His research interests are Theoretical Machine Learning, Applied Statistics and Information Theory. In particular, his research is focused on designing efficient and scalable algorithms for Statistical recovery/reconstruction problems under different structural assumptions on the data generating mechanism such as sparsity, low-rank, presence of latent clusters among others.