Deep Learning for Speech Recognition (Adam Coates, Baidu)
The talks at the Deep Learning School on September 24/25, 2016 were [...]
The talks at the Deep Learning School on September 24/25, 2016 were amazing. I clipped out individual talks from the full live streams and provided links to each below ...in case that's useful for people who want to watch specific talks several times (like I do). Please check out the official website (http://www.bayareadlschool.org) and full live streams below.
Having read, watched, and presented deep learning material over the past few years, I have to say that this is one of the best collection of introductory deep learning talks I've yet encountered. Here are links to the individual talks and the full live streams for the two days:
Go to http://www.bayareadlschool.org for more information on the event, speaker bios, slides, etc. Huge thanks to the organizers (Shubho Sengupta et al) for making this event happen.
Save
active
The Eleventh HOPE (2016): Coding by Voice with Open Source Speech Recognition
Friday, July 22, 2016: 8:00 pm (Friedman): Carpal tunnel and [...]
Friday, July 22, 2016: 8:00 pm (Friedman): Carpal tunnel and repetitive strain injuries can prevent programmers from typing for months at a time. Fortunately, it is possible to replace the ...keyboard with speech recognition - David writes Linux systems code by voice. The key is to develop a voice grammar customized for programming. A community has evolved around hacking the commercial Dragon NaturallySpeaking to use custom grammars, but this method suffers from fragmentation, a steep learning curve, and frustrating installation difficulties. In an attempt to make voice coding more accessible, David created a new speech recognition system called Silvius, built on open-source software with free speech models. It can run on cloud servers for ease of setup, or locally for the best latency. He and his collaborators have also prototyped a hardware dongle which types Silvius keystrokes using a fake USB keyboard, and requires no software installation. This talk will include live voice-coding demos with both Dragon and Silvius. The hope is that Silvius will lower the bar for experimentation and innovation, and encourage ordinary programmers to try voice coding, instead of waiting until a crippling injury throws them in at the deep end.
David Williams-King
Save
active
Speech Emotion Recognition with Convolutional Neural Networks
Speech emotion recognition promises to play an important role in [...]
Speech emotion recognition promises to play an important role in various fields such as healthcare, security, HCI. This talk examines various convolutional neural network architectures for recognizing emotion in utterances ...in the Chinese language. Experiments are conducted with log-Mel spectrum features, pitch, energy along with voice activity detection. Further, experiments are conducted with spectrograms of the speech utterances. Different pooling operations are also investigated. Finally, preliminary experiments are conducted for cross language emotion recognition between the Chinese, English languages.
Automatic Speech Recognition (ASR) is the task of transducing raw [...]
Automatic Speech Recognition (ASR) is the task of transducing raw audio signals of spoken language into text transcriptions. This talk covers the history of ASR models, from Gaussian Mixtures to ...attention augmented RNNs, the basic linguistics of speech, and the various input and output representations frequently employed.
Save
active
Emotion Detection from Speech Signals
Despite the great progress made in artificial intelligence, we are [...]
Despite the great progress made in artificial intelligence, we are still far from having a natural interaction between man and machine, because the machine does not understand the emotional state ...of the speaker. Speech emotion detection has been drawing increasing attention, which aims to recognize emotion states from speech signal. The task of speech emotion recognition is very challenging, because it is not clear which speech features are most powerful in distinguishing between emotions. We utilize deep neural networks to detect emotion status from each speech segment in an utterance and then combine the segment-level results to form the final emotion recognition results. The system produces promising results on both clean speech and speech in gaming scenario.
Save
active
Speech Recognition Breakthrough for the Spoken, Translated Word
Chief Research Officer Rick Rashid demonstrates a speech recognition [...]
Chief Research Officer Rick Rashid demonstrates a speech recognition breakthrough via machine translation that converts his spoken English words into computer-generated Chinese language. The breakthrough is patterned after deep neural ...networks and significantly reduces errors in spoken as well as written translation.
Separating simultaneous speech signals from a mixture is well studied [...]
Separating simultaneous speech signals from a mixture is well studied problem. There are two major approaches: blind source separation and spatial filtering. The first relies on the statistical independence and ...super-Gaussian distribution of the speech signals. The spatial filtering uses the fact that speech sources are separated in the space. In this talk will be presented the results of the summer internship where both approaches are combined for maximizing the source separation. Applications for speech separation include gaming, communication, and voice control.
Save
active
Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention
Automatic emotion recognition from speech is a challenging task which [...]
Automatic emotion recognition from speech is a challenging task which significantly relies on the emotional relevance of specific features extracted from the speech signal. In this study, our goal is ...to use deep learning to automatically discover emotionally relevant features. It is shown that using a deep Recurrent Neural Network (RNN), we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact sentence-level representation. Moreover, we propose a novel strategy for feature pooling over time using attention mechanism with the RNN, which is able to focus on local regions of a speech signal that are more emotionally salient. The proposed solution was tested on the IEMOCAP emotion corpus, and was shown to provide more accurate predictions compared to existing emotion recognition algorithms.
The Academic Research Summit, co-organized by Microsoft Research and [...]
The Academic Research Summit, co-organized by Microsoft Research and the Association for Computing Machinery, is a forum to foster meaningful discussion among the Indian computer science research community and raise ...the bar on research efforts.
The third edition of Academic Research Summit was held at the International Institute of Information Technology (IIIT) Hyderabad on the 24th and 25th of January 2018.
The agenda included keynotes and talks from distinguished researchers from India and across the world. The summit also had sessions focused on specific topics related to the theme of Artificial Intelligence: A Future with AI.
Real-time Single-channel Speech Enhancement with Recurrent Neural Networks
Single-channel speech enhancement using deep neural networks (DNNs) [...]
Single-channel speech enhancement using deep neural networks (DNNs) has shown promising progress in recent years. In this work, we explore several aspects of neural network training that impact the objective ...quality of enhanced speech in a real-time setting. In particular, we base all studies on a novel recurrent neural network that enhances full-band short-time speech spectra on a single-frame-in, single-frame-out basis, a framework that is adopted by most classical signal processing methods. We propose two novel learning objectives that allow separate control over expected speech distortion versus noise suppression. Moreover, we study the effect of feature normalization and sequence lengths on the objective quality of enhanced speech. Finally, we compare our method with state-of-the-art methods based on statistical signal processing and deep learning, respectively.
Distant Speech Recognition: No Black Boxes Allowed
A complete system for distant speech recognition (DSR) typically [...]
A complete system for distant speech recognition (DSR) typically consists of several distinct components. Among these are: o An array of microphone for far-field sound capture; ... o An algorithm for tracking the positions of the active speaker or speakers; o A beamforming algorithm for focusing on the desired speaker and suppressing noise, reverberation, and competing speech from other speakers; o A recognition engine to extract the most likely hypothesis from the output of the beamformer; o A speaker adaptation component for adapting to the characteristics of a given speaker as well as to channel effects; o Postfiltering to further enhance the beamformed output. Moreover, several of these components are comprised of one or more subcomponents. While it is tempting to isolate and optimize each component individually, experience has proven that such an approach cannot lead to optimal performance. In this talk, we will discuss several examples of the interactions between the individual components of a DSR system. In addition, we will describe the synergies that become possible as soon as each component is no longer treated as a ``black box''. To wit, instead of treating each component as having solely an input and an output, it is necessary to peal back the lid look inside. It is only then that it becomes apparent how the individual components of a DSR system can be viewed not as separate entities, but as the various organs of a complete body, and how optimal performance of such a system can be obtained. Joint work with: Kenichi Kumatani, Barbara Rauch, Friedrich Faubel, Matthias Wolfel, and Dietrich~Klakow
Save
active
Emotion Recognition in Speech Signal: Experimental Study, Development and Applications
In this talk I will overview my research on emotion expression and [...]
In this talk I will overview my research on emotion expression and emotion recognition in speech signal and its applications. Two proprietary databases of emotional utterances were used in this ...research. The first database consists of 700 emotional utterances in English pronounced by 30 subjects portraying five emotional states: unemotional (normal), anger, happiness, sadness, and fear. The second database consists of 3660 emotional utterances in Russian by 61 subjects portraying the following six emotional states: unemotional, anger, happiness, sadness, fear and surprise. An experimental study has been conducted to determine how well people recognize emotions in speech. Based on the results of the experiment the most reliable utterances were selected for feature selection and for training recognizers. Several machine learning techniques have been applied to create recognition agents including k-nearest neighbor, neural networks, and ensembles of neural networks. The agents can recognize five emotional states with the following accuracy: normal or unemotional state - 55-75, anger - 70-80, and fear - 35-55. The agents can be adapted to a particular environment depending on parameters of speech signal and the number of target emotional states. For a practical application an agent has been created that is able to analyze telephone quality speech signal and distinguish between two emotional states (agitation which includes anger, happiness and fear, and calm which includes normal state and sadness) with the accuracy 77. The agent was used as a part of a decision support system for prioritizing voice messages and assigning a proper human agent to response the message at call center environment. I will also give a summary of other research topics in the lab including fast pitch-synchronous segmentation of speech signal, the use of speech analysis techniques for language learning and video clip recognition using a joint audio-visual model.
Save
active
Towards Robust Conversational Speech Recognition and Understanding
While significant progress has been made in automatic speech [...]
While significant progress has been made in automatic speech recognition (ASR) during the last few decades, recognizing and understanding unconstrained conversational speech remains a challenging problem. Unlike read or highly ...constrained speech, spontaneous conversational speech is often ungrammatical and ill-structured. As the relevant semantic notions are embedded in the set of keywords, the first goal is to propose a model training methodology for keyword spotting. Non-uniform minimum classification error (MCE) approach is proposed which can achieve consistent and significant performance gains on both English and Mandarin large-scale spontaneous conversational speech (Switchboard, HKUST). Adverse acoustical environments degrade the system performance substantially. Recently, acoustic models based on deep neural networks (DNNs) have shown great success. This opens new possibilities for further improving the noise robustness in recognizing the conversational speech. The second goal is to propose a DNN based acoustic model that is robust to additive noise, channel distortions, interference of competing talkers. Hybrid recurrent DNN-HMM system is proposed for robust acoustic modeling which achieves state-of-the-art performances on two benchmark datasets (Aurora-4, CHiME). To study the specific case of conversational speech recognition in the presence of competing talker, several multi-style training setups of DNNs are investigated and a joint decoder operating on multi-talker speech is introduced. The proposed combined system outperforms the state-of-the-art 2006 IBM superhuman system on the same benchmark dataset. Even with a perfect ASR, extracting semantic notions from conversational speech can be challenging due to the interference of frequently uttered disfluencies, filler and mispronounced words, etc. The third goal is to propose a robust WFST based semantic decoder seamlessly interfacing with ASR. Latent semantic rational kernels (LSRKs) are proposed and substantial topic spotting performance gains are achieved on two conversational speech tasks (Switchboard, HMIHY0300).
Save
active
Spontaneous Speech: Challenges and Opportunities for Parsing
Recent advances in automatic speech recognition (ASR) provide new [...]
Recent advances in automatic speech recognition (ASR) provide new opportunities for natural language processing of speech, including applications such as understanding, summarization and translation. Parsing can play ...an important role here, but much of current parsing technology has been developed on written text. Spontaneous speech differs substantially from text, posing challenges that include the absence of punctuation and the presence of disfluencies and ASR errors. At the same time, prosodic cues in speech can provide disambiguating context beyond that available from punctuation. This talk looks at means of leveraging prosody and uncertainty models to improve parsing (and recognition) of spontaneous speech, and outlines challenges in speech processing that impact parsing.
Save
active
Some Recent Advances in Gaussian Mixture Modeling for Speech Recognition
State-of-the-art Hidden Markov Model (HMM) based speech recognition [...]
State-of-the-art Hidden Markov Model (HMM) based speech recognition systems typically use Gaussian Mixture Models (GMMs) to model the acoustic features associated with each HMM state. Due to computational, storage and ...robust estimation considerations the covariance matrices of the Gaussians in these GMMs are typically diagonal. In this talk I will describe several new techniques to model the acoustic features associated with an HMM state better - subspace constrained GMMs (SCGMMs), non-linear volume-preserving acoustic feature space transformations etc. Even with better models, one has to deal with mismatches between the training and test conditions. This problem can be addressed by adapting either the acoustic features or the acoustic models to reduce the mismatch. In this talk I will present several approaches to adaptation - FMAPLR (a variant of FMLLR that works well with very little adaptation data), adaptation of the front-end parameters, adaptation of SCGMMs, etc. While the ideas presented are explored and evaluated in the context of speech recognition, the talk should appeal to anyone with an interest in statistical modeling.
Save
active
High-Accuracy Neural-Network Models for Speech Enhancement
In this talk we will discuss our recent work on AI techniques that [...]
In this talk we will discuss our recent work on AI techniques that improve the quality of audio signals for both machine understanding and sensory perception. Our best models utilize ...convolutional-recurrent neural networks. They improve PESQ of noisy signals by 0.6 and boost SNRs by up to 34 dB in challenging capture conditions. We will compare the performance of our models with classical approaches that use statistical signal-signal processing and existing state-of-the-art data-driven methods that use DNNs. We will also discuss preliminary results from semi-supervised learning approaches that further improve the enhancement performance.
Enriching Speech Translation: Exploiting Information Beyond Words
Current statistical speech translation approaches predominantly rely [...]
Current statistical speech translation approaches predominantly rely on just text transcripts and do not adequately utilize the rich contextual information such as prosody and discourse function that are ...conveyed beyond words and syntax. In this talk I will introduce a novel framework for enriching speech translation with prosodic prominence and dialog acts. Our approach of incorporating rich information in speech translation is motivated by the fact that it is important to capture and convey not only what is being communicated (the words) but how something is being communicated (the context). First, I will present various techniques that we have developed for automatically detecting prosody and dialog acts from speech and text, and will survey some of the most important results of our contribution. I will then describe techniques for the integration of these rich representations in spoken language translation.
Save
active
DNN-Based Online Speech Enhancement Using Multitask Learning and Suppression Rule Estimation
Most of the currently available speech enhancement algorithms use a [...]
Most of the currently available speech enhancement algorithms use a statistical signal processing approach to remove the noise component from observed signals. The performance of these algorithms is thus dependent ...on the statistical assumptions they make about speech and noise signals, which are often inaccurate. In this work, we consider machine learning as an alternative, using deep neural networks to discover the transformation from noisy to clean speech. While DNNs are now the standard approach for acoustic modeling in speech recognition, there has been fewer studies looking at DNNs for improving the signal quality for the human listener. We consider a realistic scenario where both environmental noise and room reverberation are present and where a strict real-time processing requirement is enforced by the application. We examine several structures in which a DNN can replace conventional speech enhancement systems, including end-to-end DNN regression and also suppression rule estimation by DNNs. We also propose to use multitask learning with the estimation of bin-wise speech probability presence as the secondary task, and show it to provide improvements to the enhancement performance.
Save
active
Microphone array signal processing: beyond the beamformer
Array signal processing is a well-established area of research, [...]
Array signal processing is a well-established area of research, spanning from phased array antennas in the middle of the last century to hands-free audio in recent years. As devices incorporating ...microphone arrays begin to appear in the home, new practical challenges are presented to well-known signal processing problems such as source localization and beamforming. In this talk, we consider some new algorithms that use multichannel observations outside the common beamforming paradigm. These include dereverberation using spatiotemporal averaging, acoustic channel shortening and the acoustic Rake receiver, with relevant audio examples. We also investigate the problem of localizing reflecting boundaries in an acoustic space by considering the time of arrival of 1st-order reflections. Such algorithms are expected not to replace but to compliment beamforming in new and robust future applications.
Save
active
Blind Multi-Microphone Noise Reduction and Dereverberation Algorithms
Blind Multi-Microphone Noise Reduction and Dereverberation Algorithms [...]
Blind Multi-Microphone Noise Reduction and Dereverberation Algorithms
Exploring Richer Sequence Models in Speech and Language Processing
Conditional and other feature-based models have become an increasingly [...]
Conditional and other feature-based models have become an increasingly popular methodology for combining evidence in speech and language processing. As one example, Conditional Random Fields have been shown by ...several research groups to provide good performance on several tasks via discriminatively training weighted combinations of feature descriptions over input. CRFs with linear chain structures have been useful for sequence labeling tasks such as phone recognition or named entity recognition. As we start to tackle problems of increasing complexity, it makes sense to investigate models that move beyond linear-chain CRFs in various ways -- for example, by considering richer graphical model structures to describe more complex interactions between linguistic variables, or using CRF classifiers within a larger learning framework. In this talk, I will describe recent research projects in the Speech and Language Technologies (SLaTe) Lab at Ohio State; each takes the basic CRF paradigm in a slightly different direction. The talk will describe two models for speech processing: Boundary-Factored CRFs, an extension of Segmental CRFs that allows for fast processing of features related to state transitions, and Factorized CRFs, which we used to investigate articulatory-feature alignment. I will also discuss how CRFs play a role in a semi-supervised framework for event coreference resolution within clinical notes found in electronic medical records. Joint work with Yanzhang He, Rohit Prabhavalkar, Karen Livescu, Preethi Raghavan, and Albert Lai
Save
active
Dereverberation Suppression for Improved Speech Recognition and Human Perception
The factors that harm the speech recognition results for un-tethered [...]
The factors that harm the speech recognition results for un-tethered users are the ambient noise and the reverberation. While we have pretty sophisticated noise suppression algorithms, the de-reverberation is still ...an unsolved problem due to the difficulties in estimation and keeping track of the changes in the room response model. Sound capturing with microphone arrays provides partial de-reverberation and ambient noise reduction due to the better directivity. This improves the speech recognition results, but still WER is higher than with a close-talk microphone. This talk will present the results from the summer internship of Daniel Allred in MSR and is a follow-up to research done last summer. A full implementation of the dereverberation suppression algorithm designed last summer has been tested on actual room environments. We will present the possible improvements that can be achieved using our algorithm with proper parameter estimation. We also performed some preliminary studies into the human perception of various reverberation conditions and the use of our algorithm to alleviate those conditions. Results of comparative MOS tests will be shown, and an evaluation of what these results mean for future research in dereverberation algorithms for real-time communications channels will follow.
Save
active
Deep Neural Networks for Speech and Image Processing
Neural networks are experiencing a renaissance, thanks to a new [...]
Neural networks are experiencing a renaissance, thanks to a new mathematical formulation, known as restricted Boltzmann machines, and the availability of powerful GPUs and increased processing power. Unlike past neural ...networks, these new ones can have many layers and thus are called 'deep neural networks'; and because they are a machine learning technique, the technology is also known as 'deep learning.' In this talk I'll describe this new formulation and its signal-processing application in such fields as speech recognition and image recognition. In all these applications, deep neural networks have resulted in significant reductions in error rate. This success has sparked great interest from computer scientists, who are also eager to learn from neuroscientists how neurons in the brain work.
Save
active
Speech and language: the crown jewel of AI with Dr. Xuedong Huang
Episode 76 | May 15, 2019 When was the last time you had a meaningful [...]
Episode 76 | May 15, 2019
When was the last time you had a meaningful conversation with your computer… and felt like it truly understood you? Well, if Dr. Xuedong Huang, ...a Microsoft Technical Fellow and head of Microsoft’s Speech and Language group, is successful, you will. And if his track record holds true, it’ll be sooner than you think!
On today’s podcast, Dr. Huang talks about his role as Microsoft’s Chief Speech Scientist, gives us some inside details on the latest milestones in speech and language technology, and explains how mastering speech recognition, translation and conversation will move machines further along the path from “perceptive AI” to “cognitive AI” and that much closer to truly human intelligence.
In-Car Speech User Interfaces and their Effects on Driving Performance
Ubiquitous computing and speech user interaction are starting to play [...]
Ubiquitous computing and speech user interaction are starting to play an increasingly important role in vehicles. Given the large amount of time that people spend behind the wheel, and the ...availability of computational resources that can now operate inside a vehicle, many companies have been introducing a myriad of mobile services and functionalities for drivers into the consumer market, such as hands-free voice dialing and GPS navigation. Through our work at the University of New Hampshire, ubiquitous computing and speech user interaction now also help law enforcement officers to perform their everyday jobs: our Project54 system, which integrates devices in police cruisers and allows officers to control these devices using a speech user interface, has been deployed in over 1,000 vehicles. However, the effect of these technologies on the driving performance of users has not been adequately addressed in the research literature. A related problem is that of determining how to integrate these technologies so as to reduce the threat of accidents. Ideally, speech interaction should not introduce any impairment to the primary visual and cognitive task of driving. However, in a recent study investigating how characteristics of the speech user interface can affect driving performance, we have found that the accuracy of the recognizer as well as its interaction with the use of the push-to-talk button can significantly affect driving performance. This talk will discuss our work on quantifying the influence of speech user interface characteristics, road conditions and driver psychological state on driving performance, using a state-of-the-art driving simulator, an eye-gaze tracker, and physiological metrics.
Save
active
Recognizing a Million Voices: Low Dimensional Audio Representations for Speaker Identification
Recent advances in speaker verification technology have resulted in [...]
Recent advances in speaker verification technology have resulted in dramatic performance improvements in both speed and accuracy. Over the past few years, error rates have decreased by a factor of ...5 or more. At the same time, the new techniques have resulted in massive speed-ups, which have increased the scale of viable speaker-id systems by several orders of magnitude. These improvements stem from a recent shift in the speaker modeling paradigm. Only a few years ago, the model for each individual speaker was trained using data from only that particular speaker. Now, we make use of large speaker-labeled databases to learn distributions describing inter- and intra-speaker variability. This allow us to reveal the speech characteristics that are important for discriminating between speakers. During the 2008 JHU summer workshop, our team has found that speech utterances can be encoded into low dimensional fixed-length vectors that preserve information about speaker identity. This concept of so-called 'i-vectors', which now forms the basis of state-of-the-art systems, enabled new machine learning approaches to be applied to the speaker identification problem. Inter- and intra-speaker variability can now be easily modeled using Bayesian approaches, which leads to superior performance. A new training strategies can now benefit form the simpler statistical model form and the inherent speed-up. In our most recent work, we have retrained the hyperparameters of our Bayesian model using a discriminative objective function that directly addresses the task in speaker verification: discrimination between same-speaker and different-speaker trials. This is the first time such discriminative training has been successfully applied to speaker verification task.
Save
active
A Noise-Robust Speech Recognition Method
This presentation proposes a noise-robust speech recognition method [...]
This presentation proposes a noise-robust speech recognition method composed of weak noise suppression (NS) and weak Vector Taylor Series Adaptation (VTSA). The proposed method compensates defects of NS and VTSA, ...and gains only the advantages by them. The weak NS reduces distortion by over-suppression that may accompany noise-suppressed speech. The weak VTSA avoids over-adaptation by offsetting a part of acoustic-model adaptation that corresponds to the suppressed noise. Evaluation results with the AURORA2 database show that the proposed method achieves as much as 1.2 points higher word accuracy (87.4) that is always better than its counterpart with NS.
Save
active
HMM-based Speech Synthesis: Fundamentals and Its Recent Advances
The task of speech synthesis is to convert normal language text into [...]
The task of speech synthesis is to convert normal language text into speech. In recent years, hidden Markov model (HMM) has been successfully applied to acoustic modeling for speech synthesis, ...and HMM-based parametric speech synthesis has become a mainstream speech synthesis method. This method is able to synthesize highly intelligible and smooth speech sounds. Another significant advantage of this model-based parametric approach is that it makes speech synthesis far more flexible compared to the conventional unit selection and waveform concatenation approach. This talk will first introduce the overall HMM synthesis system architecture developed at USTC. Then, some key techniques will be described, including the vocoder, acoustic modeling, parameter generation algorithm, MSD-HMM for F0 modeling, context-dependent model training, etc. Our method will be compared with the unit selection approach and its flexibility in controlling voice characteristics will also be presented. The second part of this talk will describe some recent advances of HMM-based speech synthesis at the USTC speech group. The methods to be described include: 1) articulatory control of HMM-based speech synthesis, which further improves the flexibility of HMM-based speech synthesis by integrating phonetic knowledge, 2) LPS-GV and minimum KLD based parameter generation, which alleviates the over-smoothing of generated spectral features and improves the naturalness of synthetic speech, and 3) hybrid HMM-based/unit-selection approach which achieves excellent performance in the Blizzard Challenge speech synthesis evaluation events of recent years.
Save
active
Should Machines Emulate Human Speech Recognition?
Machine-based, automatic speech recognition (ASR) systems decode the [...]
Machine-based, automatic speech recognition (ASR) systems decode the acoustic signal by associating each time frame with a set of phonetic-segment possibilities. And from such matrices of segment probabilities, word hypotheses ...are formed. This segment-based, serial time-frame approach has been standard practice in ASR for many years. Although ASRΓÇÖs reliability has improved dramatically in recent years, such advances have often relied on huge amounts of training material and an expert team of developers. Might there be a simpler, faster way to develop ASR applications, one that adapts quickly to novel linguistic situations and challenging acoustic environments? It is the thesis of this presentation that future-generation ASR should be based (in part) on strategies used by human listeners to decode the speech signal. A comprehensive theoretical framework will be described, one based on a variety of perceptual, statistical and machine-learning studies. This Multi-Tier framework focuses on the interaction across different levels of linguistic organization. Words are composed of more than segments, and utterances consist of (far) more than words. In Multi-Tier Theory, the syllable serves as the interface between sound (as well as vision) and meaning. Units smaller than the syllable (such as the segment, and articulatory-acoustic features), combine with larger units (e.g., the lexeme and prosodic phrase) to provide a more balanced perspective than afforded by the conventional word/segment framework used in ASR. The presentation will consider (in some detail) how the brain decodes consonants, and how such knowledge can be used to deduce the perceptual flow of phonetic processing. The presentation will conclude with a discussion of how human speech-decoding strategies can (realistically) be used to improve the performance of automatic speech recognition (in machines).
Save
active
New Directions in Robust Automatic Speech Recognition
As speech recognition technology is transferred from the laboratory to [...]
As speech recognition technology is transferred from the laboratory to the marketplace, robustness in recognition is becoming increasingly important.  This talk will review and discuss several classical and contemporary approaches ...to robust speech recognition.   The most tractable types of environmental degradation are produced by quasi-stationary additive noise and quasi-stationary linear filtering.  These distortions can be largely ameliorated by the classical techniques of cepstral high-pass filtering (as exemplified by cepstral mean normalization and RASTA filtering), as well as by techniques that develop statistical models of the distortion (such as codeword-dependent cepstral normalization and vector Taylor series expansion).  Nevertheless, these types of approaches fail to provide much useful improvement when speech is degraded by transient or non-stationary noise such as background music or speech.  We describe and compare the effectiveness of techniques based on missing-feature compensation, multi-band analysis, feature combination, and physiologically-motivated auditory scene analysis toward providing increased recognition accuracy in difficult acoustical environments.
Save
active
Rapid Language Portability for Speech Processing Systems
With the growing demand for speech processing systems in many [...]
With the growing demand for speech processing systems in many different languages, there is still a significant bottleneck in building recognition and synthesis support for new languages. The SPICE ...project is aimed at providing web-based easy-to-use tools for the non-expert to build acoustic and language models for speech recognition and synthesis systems in new languages. This work has required new research in better selection of prompting data, lexicon construction and multilingual acoustic modeling. Where possible synthesis and recognition models are shared. This talk gives an overview of the system, and highlights the specific research issues that have been addressed and what still needs to be done. The system has already been used successfully for some 25 languages. (joint work with Tanja Schultz)
Save
active
Making Voicebots Work for Accents
Voice-driven automated agents such as personal assistants are becoming [...]
Voice-driven automated agents such as personal assistants are becoming increasingly popular. However, in a multilingual and multi-cultural country like India, deploying such agents to engage with large sections of the ...population is highly challenging. A major hindrance in this regard is the difficulty the agents would face in understanding varying speech accents of the users. Even when the language of interaction with the underlying automatic speech recognition (ASR) system is restricted to a lingua franca (such as English), the accent of the speaker can vary dramatically based on their cultural and linguistic background, posing a fundamental challenge for ASR systems. Tackling this challenge will be a necessary first step towards building socially accepted and commercially successful agents in the Indian context.
The main focus of this project will be to take this first step, by improving state-of-the-art performance of ASR systems on accented speech - specifically, speech with Indian accents. We shall develop deep neural network based acoustic models that will be trained using not only accented speech data but also speech in the native languages associated with the accent. We shall also develop a tool that will be trained to identify various Indian accents automatically. Finally, we shall investigate how accented-speech-ASR can be effectively incorporated into intelligent agents to help them act in socio-culturally appropriate ways.
Multi-rate neural networks for efficient acoustic modeling
In sequence recognition, the problem of long-span dependency in input [...]
In sequence recognition, the problem of long-span dependency in input sequences is typically tackled using recurrent neural network architectures, and robustness to sequential distortions is achieved using training data representative ...of a variety of these distortions. However, both these solutions substantially increase the training time. Thus low computation complexity during training is critical for acoustic modeling. This talk proposes the use of multi-rate neural network architectures to satisfy the design requirement of computational efficiency. In these architectures the network is partitioned into groups of units, operating at various sampling rates. As the network evaluates certain groups only once every few time steps, the computational cost is reduced. This talk will focus on the multi-rate feed-forward convolutional architecture. It will present results on several large vocabulary continuous speech recognition (LVCSR) tasks with training data ranging from 3 to 1800 hours to show the effectiveness of this architecture in efficiently learning wider temporal dependencies in both small and large data scenarios. Further it will discuss the use of this architecture for robust acoustic modeling in far-field environments. This model was shown to provide state-of-art results in the ASpIRE far-field recognition challenge. This talk will also discuss some preliminary results of multi-rate recurrent neural network based acoustic models.
Save
active
Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings
Speaker diarization consist of automatically partitioning an input [...]
Speaker diarization consist of automatically partitioning an input audio stream into homogeneous segments (segmentation) and assigning these segments to the same speaker (speaker clustering). This process can allow to enhance ...the readability by structuring an audio document, or provide the speaker's true identity when it's used in conjunction with speaker recognition system. In this seminar I will talk about two new methods: ILP Clustering and Speaker embeddings. In speaker clustering, a major problem with using greedy agglomerative hierarchical clustering (HAC) is that it does not guarantee an optimal solution. I propose a new clustering model (called ILP Clustering), by redefining clustering problem as a linear program (ie. linear program is defined by an objective function and subject to linear equality and/or linear inequality constraint). Thus an Integer Linear Programming (ILP) solver can be used to search the optimal solution over the whole problem. In a second part, I propose to learn a set of high-level feature representations through deep learning, referred to as speaker embeddings. Speaker embedding features are taken from the hidden layer neuron activations of Deep Neural Networks (DNN), when learned as classifiers to recognize a thousand speaker identities in a training set. Although learned through identification, the speaker embeddings are shown to be effective for speaker verification in particular to recognize speakers' unseen in the training set. The experiments were conducted on the corpus of French broadcast news ETAPE where these new methods based on ILP/speaker-embeddings decreases DER by 4.79 points over the baseline diarization system based on HAC/GMM.
Save
active
Frontiers in Speech and Language
The last few years have witnessed a renaissance in multiple areas of [...]
The last few years have witnessed a renaissance in multiple areas of speech and language processing. In speech recognition, deep neural networks have led to significant performance improvements; in language ...processing the idea of continuous-space representations of words and language has become mainstream; and dialog systems have advanced to the point where automated personal assistants are now everyday fare on mobile devices. In this session, we bring together researchers from the different disciplines of speech and language processing to discuss the key ideas that have made this possible, and the remaining challenges and next generation of applications.
Save
active
Towards Spoken Term Discovery at Scale with Zero Resources
The spoken term discovery task takes speech as input and identifies [...]
The spoken term discovery task takes speech as input and identifies terms of possible interest. The challenge is to perform this task efficiently on large amounts of speech with zero ...resources (no training data and no dictionaries), where we must fall back to more basic properties of language. We find that long (~1 s) repetitions tend to be contentful phrases (e.g. University of Pennsylvania) and propose an algorithm to search for these long repetitions without first recognizing the speech. To address efficiency concerns, we take advantage of (i) sparse feature representations and (ii) inherent low occurrence frequency of long content terms to achieve orders-of-magnitude speedup relative to the prior art. We frame our evaluation in the context of spoken document information retrieval, and demonstrate our methodΓÇÖs competence at identifying repeated terms in conversational telephone speech.
Save
active
Multi-microphone Dereverberation and Intelligibility Estimation in Speech Processing
When speech signals are captured by one or more microphones in [...]
When speech signals are captured by one or more microphones in realistic acoustic environments, they will be contaminated by noise due to surrounding sound sources and by reverberation due to ...reflections off walls and other surfaces. Noise and reverberation can have detrimental effects on the perceptual experience of a listener and, in more severe cases, they can cause intelligibility loss. Many signal processing applications, such as, speech codecs and speech recognizers deteriorate rapidly in performance as noise and reverberation levels increase. Consequently, the challenging problems of noise reduction and dereverberation have received a great deal of attention in research, especially, with the advent of mobile telephony and voice over IP. Multi-microphone speech dereverberation forms the topic of the first part of this talk. Two alternative methods will be introduced. The first method is based on the source-filter model of speech production while the second approaches the problem through blind identification and inversion of the room impulse responses. Simulation results will be presented to demonstrate the methods and to facilitate a comparison between them in terms of dereverberation performance. In the second part, the talk will focus on subject-based and automatic estimation of intelligibility in noisy and processed speech. In particular, the Bayesian Adaptive Speech Intelligibility Estimation (BASIE) method will be presented. BASIE is a tool for rapid subject-based estimation of a given speech reception threshold (SRT) and the slope at that threshold of multiple psychometric functions for speech intelligibility in noise. The core of BASIE is an adaptive Bayesian procedure, which adjusts the signal-to-noise ratio at each subsequent stimulus such that the expected variance of the threshold and slope estimates are minimised. Furthermore, strategies for using BASIE to evaluate the effects of speech processing algorithms on intelligibility and two illustrative examples for different noise reduction methods with supporting listening experiments will be given.
Save
active
Soft Margin Estimation for Automatic Speech Recognition
In this study, a new discriminative learning framework, called soft [...]
In this study, a new discriminative learning framework, called soft margin estimation (SME), is proposed for estimating the parameters of continuous density hidden Markov models (HMMs). The proposed method makes ...direct use of the successful ideas of margin in support vector machines to improve generalization capability and decision feedback learning in discriminative training to enhance model separation in classifier design. SME directly maximizes the separation of competing models to enhance the testing samples to approach a correct decision if the deviation from training samples is within a safe margin. Frame and utterance selections are integrated into a unified framework to select the training utterances and frames critical for discriminating competing models. SME offers a flexible and rigorous framework to facilitate the incorporation of new margin-based optimization criteria into HMMs training. The choice of various loss functions is illustrated and different kinds of separation measures are defined under a unified SME framework. SME is also shown to be able to jointly optimize feature extraction and HMMs. Both the generalized probabilistic descent algorithm and the Extended Baum-Welch algorithm are applied to solve SME. SME has demonstrated its great advantage over other discriminative training methods in several speech recognition tasks. Tested on the TIDIGITS digit recognition task, the proposed SME approach achieves a string accuracy of 99.61 of MLE models to 4.11 WER reduction. The generalization of SME was also well demonstrated on the Aurora 2 robust speech recognition task, with around 30 relative WER reduction from the clean-trained baseline.
Save
active
A Smartphone as Your Third Ear
: We humans are capable of remembering, recognizing, and acting upon [...]
: We humans are capable of remembering, recognizing, and acting upon hundreds of thousands of different types of acoustic events on a day-to-day basis. Decades of research on acoustic sensing ...have led to the creation of systems that now understand speech (e.g. a personal assistant like iPhone’s Siri, or the voice activated search feature from Google), recognizes the speaker, and finds a song (e.g., Shazam). However, apart from speech, music, and some application specific sounds, the problem of recognizing varieties of general-purpose sounds that a mobile device encounters all the time has remained unsolved. The goal of this research is to build a platform that automatically creates classifiers that recognize general-purpose acoustic events on mobile devices. As these classifiers are meant to run on mobile devices, the technical goals include energy-efficiency, meeting timing constraints, and leveraging the user contexts such as the location and position of the mobile device in order to improve the classification accuracy. With this goal in mind, we have built a general-purpose, energy-efficient, and context-aware acoustic event detection platform for mobile devices called – ‘Auditeur. Auditeur enables mobile application developers to have their app register for and get notified on a wide variety of acoustic events. Auditeur is backed by a cloud service to store crowd-contributed sound clips and to generate an energy-efficient and context-aware classification plan for the mobile device. When an acoustic event type has been registered, the mobile device instantiates the necessary acoustic processing modules and wires them together to dynamically form an acoustic processing pipeline in accordance to the classification plan. The mobile device then captures, processes, and classifies acoustic events locally and efficiently. Our analysis on user-contributed empirical data shows that Auditeur’s energy-aware acoustic feature selection algorithm is capable of increasing the device-lifetime by 33.4%, sacrificing less than 2% of the maximum achievable accuracy. We implement seven apps with Auditeur, and deploy them in real-world scenarios to demonstrate that Auditeur is versatile, 11.04% − 441.42% less power hungry, and 10.71% − 13.86% more accurate in detecting acoustic events, compared to state-of-the-art techniques. We perform a user study involving 15 participants to demonstrate that even a novice programmer can implement the core logic of an interesting app with Auditeur in less than 30 minutes, using only 15 – 20 lines of Java code.
Save
active
Redesiging Neural Architectures for Sequence to Sequence Learning
The Encoder-Decoder model with soft-attention is now the defacto [...]
The Encoder-Decoder model with soft-attention is now the defacto standard for sequence to sequence learning, having enjoyed early success in tasks like translation, error correction, and speech recognition. In ...this talk, I will present a critique of various aspect of this popular model, including its soft attention mechanism, local loss function, and sequential decoding. I will present a new Posterior Attention Network for a more transparent joint attention that provides easy gains on several translation and morphological inflection tasks. Next, I will expose a little known problem of mis-calibration in state of the art neural machine translation (NMT) systems. For structured outputs like in NMT, calibration is important not just for reliable confidence with predictions, but also for proper functioning of beam-search inference. I will discuss reasons for mis-calibration and some fixes. Finally, I will summarize recent research efforts towards parallel decoding of long sequences.
Deep Learning allows computational models composed of multiple [...]
Deep Learning allows computational models composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, ...visual object recognition, object detection, and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large datasets by using the back-propagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about dramatic improvements in processing images, video, speech and audio, while recurrent nets have shone on sequential data such as text and speech. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep learning methods are representation learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. This tutorial will introduce the fundamentals of deep learning, discuss applications, and close with challenges ahead.
Modeling high-dimensional sequences with recurrent neural networks
Humans commonly understand sequential events by giving importance to [...]
Humans commonly understand sequential events by giving importance to what they expect rather than exclusively to what they actually observe. The ability to fill in the blanks, useful in speech ...recognition to favor words that make sense in the current context, is particularly important in noisy conditions. In this talk, we present a probabilistic model of symbolic sequences based on a recurrent neural network that can serve as a powerful prior during information retrieval. We show that conditional distribution estimators can describe much more realistic output distributions, and we devise inference procedures to efficiently search for the most plausible annotations when the observations are partially destroyed or distorted. We demonstrate improvements in the state of the art in polyphonic music transcription, chord recognition, speech recognition and audio source separation.
Save
active
Reformulating the HMM as a trajectory model
A trajectory model, derived from the HMM by imposing explicit [...]
A trajectory model, derived from the HMM by imposing explicit relationship between static and dynamic features, is developed and evaluated. The derived model, named trajectory-HMM, can alleviate some limitations ...of the standard HMM, which are i) piece-wise constant statistics within a state and ii) conditional independence assumption of state output probabilities, without increasing the number of model parameters. In this talk, a Viterbi-type training algorithm is also derived. This model was evaluated both in speech recognition and synthesis experiments. In speaker-dependent continuous speech recognition experiments, the trajectory-HMM achieved error reductions over the standard HMM. The experimental results of subjective listening tests shows that introduction of the trajectory-HMM can improve the quality of synthetic speech generated from HMM-based speech synthesis system which we have proposed.
Save
active
Lattice-Based Discriminative Training: Theory and Practice
Lattice-based discriminative training techniques such as MMI and MPE [...]
Lattice-based discriminative training techniques such as MMI and MPE have been increasingly widely used in recent years.  I will review these model based discriminative training technique and also the newer ...feature-based techniques such as fMPE.  I will discuss some of the practical issues that are relevant to discriminative training, such as lattice generation, lattice depth and quality, probability scaling, I-smoothing, language models, alignment consistency, and various other issues for feature-based discriminative training, and will discuss more recent improvements such as frame-weighted MPE (MPFE), and give an overview of some recent unrelated work that I have been doing.  
Save
active
A Directionally Tunable but Frequency-Invariant Beamformer for an “Acoustic Velocity-Sensor Triad”
"A Directionally Tunable but Frequency-Invariant Beamformer for an [...]
"A Directionally Tunable but Frequency-Invariant Beamformer for an “Acoustic Velocity-Sensor Triad” to Enhance Speech Perception
Herein presented is a simple microphone-array beamformer that is independent of the frequency-spectra of all signals, ...all interference, and all noises. This beamformer allows/requires the listener to tune the desired azimuth-elevation “look direction.” No prior information is needed of the interference. The beamformer deploys a physically compact triad of three collocated but orthogonally oriented velocity sensors. These proposed schemes’ efficacy is verified by a jury test, using simulated data constructed with speech samples. For example, a desired speech signal, originally at a very adverse signal-to-interference-and-noise power ratio (SINR) of -30 dB, may be processed to become fully intelligible to the jury."
Save
active
Symposium: Deep Learning - Alex Graves
Neural Turing Machines - Alex Graves
Neural Turing Machines - Alex Graves
Save
active
NIPS: Oral Session 4 - Ilya Sutskever
Sequence to Sequence Learning with Neural Networks Deep Neural [...]
Sequence to Sequence Learning with Neural Networks Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. ...Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Save
active
HDSI Unsupervised Deep Learning Tutorial - Alex Graves