Abstract: Speech recognizers are typically trained with data from a standard dialect and do not generalize to non-standard dialects. Mismatch mainly occurs in the acoustic realization of words, which is represented by acoustic models and pronunciation lexicon. Standard techniques for addressing this mismatch are generative in nature and include acoustic model adaptation and expansion of lexicon with pronunciation variants, both of which have limited effectiveness. We present a discriminative pronunciation model whose parameters are learned jointly with parameters from the language models. We tease apart the gains from modeling the transitions of canonical phones, the transduction from surface to canonical phones, and the language model. We report experiments on African American Vernacular English (AAVE) using NPR's StoryCorps corpus. Our models improve the performance over the baseline by about 2.1% on AAVE, of which 0.6% can be attributed to the pronunciation model. The model learns the most relevant phonetic transformations for AAVE speech.
Bio: Izhak Shafran is a speech researcher, who has been working on acoustic modeling and large vocabulary speech recognition since 1996. Before joining Google, he was an Associate Professor and a member of the Center for Spoken Language Processing at OHSU, where his focus was on medical application specifically on Parkinsons' disease, depression and mild congitive impairment. He graduated from University of Washington in Seattle in 2001 and subsequently worked at AT\&T Research Labs at Florham Park with the speech algorithms group. In summer of 2006, he was a visiting professor at Univeristy of Paris-South, working at LIMSI. Subsequently, he was a research faculty at the Center for Language and Speech Processing (CLSP) in Johns Hopkins University. He received an NIH Career Development Award in 2010. He started his research career at Tata Institute of Fundamental Research in Radio Astronomy.