Speech Recognition – Techiespedia.org

Key Note: All videos listed on this page are sourced from publicly available YouTube channels. They are shared here for educational and informational purposes only. All credit goes to the original creators, and the content remains the property of the respective owners.

Automatic Speech Recognition - An Overview

An overview of how Automatic Speech Recognition systems work and some of the challenges.

See more on this video at https://www.microsoft.com/en-us/research/video/automatic-speech-recognition-overview/

Automatic Speech Recognition - An Overview

An overview of how Automatic Speech Recognition systems work and some [...]

Deep Learning for Speech Recognition (Adam Coates, Baidu)

The talks at the Deep Learning School on September 24/25, 2016 were [...]

The talks at the Deep Learning School on September 24/25, 2016 were amazing. I clipped out individual talks from the full live streams and provided links to each below ...in case that's useful for people who want to watch specific talks several times (like I do). Please check out the official website (http://www.bayareadlschool.org) and full live streams below.

Having read, watched, and presented deep learning material over the past few years, I have to say that this is one of the best collection of introductory deep learning talks I've yet encountered. Here are links to the individual talks and the full live streams for the two days:

1. Foundations of Deep Learning (Hugo Larochelle, Twitter) - https://youtu.be/zij_FTbJHsk
2. Deep Learning for Computer Vision (Andrej Karpathy, OpenAI) - https://youtu.be/u6aEYuemt0M
3. Deep Learning for Natural Language Processing (Richard Socher, Salesforce) - https://youtu.be/oGk1v1jQITw
4. TensorFlow Tutorial (Sherry Moore, Google Brain) - https://youtu.be/Ejec3ID_h0w
5. Foundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU) - https://youtu.be/rK6bchqeaN8
6. Nuts and Bolts of Applying Deep Learning (Andrew Ng) - https://youtu.be/F1ka6a13S9I
7. Deep Reinforcement Learning (John Schulman, OpenAI) - https://youtu.be/PtAIh9KSnjo
8. Theano Tutorial (Pascal Lamblin, MILA) - https://youtu.be/OU8I1oJ9HhI
9. Deep Learning for Speech Recognition (Adam Coates, Baidu) - https://youtu.be/g-sndkf7mCs
10. Torch Tutorial (Alex Wiltschko, Twitter) - https://youtu.be/L1sHcj3qDNc
11. Sequence to Sequence Deep Learning (Quoc Le, Google) - https://youtu.be/G5RY_SUJih4
12. Foundations and Challenges of Deep Learning (Yoshua Bengio) - https://youtu.be/11rsu_WwZTc

Full Day Live Streams:
Day 1: https://youtu.be/eyovmAtoUx0
Day 2: https://youtu.be/9dXiAecyJrY

Go to http://www.bayareadlschool.org for more information on the event, speaker bios, slides, etc. Huge thanks to the organizers (Shubho Sengupta et al) for making this event happen.

The Eleventh HOPE (2016): Coding by Voice with Open Source Speech Recognition

Friday, July 22, 2016: 8:00 pm (Friedman): Carpal tunnel and [...]

Speech Emotion Recognition with Convolutional Neural Networks

Speech emotion recognition promises to play an important role in [...]

Automatic Speech Recognition: An Overview

A. Madhavaraj

Lecture 9 - Speech Recognition (ASR) [Andrew Senior]

Automatic Speech Recognition (ASR) is the task of transducing raw [...]

Emotion Detection from Speech Signals

Despite the great progress made in artificial intelligence, we are [...]

Speech Recognition Breakthrough for the Spoken, Translated Word

Chief Research Officer Rick Rashid demonstrates a speech recognition [...]

Speech signals separation with microphone array

Separating simultaneous speech signals from a mixture is well studied [...]

Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention

Automatic emotion recognition from speech is a challenging task which [...]

State-of-the-Art in Speech Technologies

The Academic Research Summit, co-organized by Microsoft Research and [...]

Real-time Single-channel Speech Enhancement with Recurrent Neural Networks

Single-channel speech enhancement using deep neural networks (DNNs) [...]

Distant Speech Recognition: No Black Boxes Allowed

A complete system for distant speech recognition (DSR) typically [...]

Emotion Recognition in Speech Signal: Experimental Study, Development and Applications

In this talk I will overview my research on emotion expression and [...]

Towards Robust Conversational Speech Recognition and Understanding

While significant progress has been made in automatic speech [...]

While significant progress has been made in automatic speech recognition (ASR) during the last few decades, recognizing and understanding unconstrained conversational speech remains a challenging problem. Unlike read or highly ...constrained speech, spontaneous conversational speech is often ungrammatical and ill-structured. As the relevant semantic notions are embedded in the set of keywords, the first goal is to propose a model training methodology for keyword spotting. Non-uniform minimum classification error (MCE) approach is proposed which can achieve consistent and significant performance gains on both English and Mandarin large-scale spontaneous conversational speech (Switchboard, HKUST). Adverse acoustical environments degrade the system performance substantially. Recently, acoustic models based on deep neural networks (DNNs) have shown great success. This opens new possibilities for further improving the noise robustness in recognizing the conversational speech. The second goal is to propose a DNN based acoustic model that is robust to additive noise, channel distortions, interference of competing talkers. Hybrid recurrent DNN-HMM system is proposed for robust acoustic modeling which achieves state-of-the-art performances on two benchmark datasets (Aurora-4, CHiME). To study the specific case of conversational speech recognition in the presence of competing talker, several multi-style training setups of DNNs are investigated and a joint decoder operating on multi-talker speech is introduced. The proposed combined system outperforms the state-of-the-art 2006 IBM superhuman system on the same benchmark dataset. Even with a perfect ASR, extracting semantic notions from conversational speech can be challenging due to the interference of frequently uttered disfluencies, filler and mispronounced words, etc. The third goal is to propose a robust WFST based semantic decoder seamlessly interfacing with ASR. Latent semantic rational kernels (LSRKs) are proposed and substantial topic spotting performance gains are achieved on two conversational speech tasks (Switchboard, HMIHY0300).

Spontaneous Speech: Challenges and Opportunities for Parsing

Recent advances in automatic speech recognition (ASR) provide new [...]

Some Recent Advances in Gaussian Mixture Modeling for Speech Recognition

State-of-the-art Hidden Markov Model (HMM) based speech recognition [...]

High-Accuracy Neural-Network Models for Speech Enhancement

In this talk we will discuss our recent work on AI techniques that [...]

Enriching Speech Translation: Exploiting Information Beyond Words

Current statistical speech translation approaches predominantly rely [...]

DNN-Based Online Speech Enhancement Using Multitask Learning and Suppression Rule Estimation

Most of the currently available speech enhancement algorithms use a [...]

Microphone array signal processing: beyond the beamformer

Array signal processing is a well-established area of research, [...]

Blind Multi-Microphone Noise Reduction and Dereverberation Algorithms

Blind Multi-Microphone Noise Reduction and Dereverberation Algorithms [...]

Exploring Richer Sequence Models in Speech and Language Processing

Conditional and other feature-based models have become an increasingly [...]

Dereverberation Suppression for Improved Speech Recognition and Human Perception

The factors that harm the speech recognition results for un-tethered [...]

Deep Neural Networks for Speech and Image Processing

Neural networks are experiencing a renaissance, thanks to a new [...]

Speech and language: the crown jewel of AI with Dr. Xuedong Huang

Episode 76 | May 15, 2019 When was the last time you had a meaningful [...]

In-Car Speech User Interfaces and their Effects on Driving Performance

Ubiquitous computing and speech user interaction are starting to play [...]

Recognizing a Million Voices: Low Dimensional Audio Representations for Speaker Identification

Recent advances in speaker verification technology have resulted in [...]

A Noise-Robust Speech Recognition Method

This presentation proposes a noise-robust speech recognition method [...]

HMM-based Speech Synthesis: Fundamentals and Its Recent Advances

The task of speech synthesis is to convert normal language text into [...]

Should Machines Emulate Human Speech Recognition?

Machine-based, automatic speech recognition (ASR) systems decode the [...]

New Directions in Robust Automatic Speech Recognition

As speech recognition technology is transferred from the laboratory to [...]

Rapid Language Portability for Speech Processing Systems

With the growing demand for speech processing systems in many [...]

Making Voicebots Work for Accents

Voice-driven automated agents such as personal assistants are becoming [...]

Multi-rate neural networks for efficient acoustic modeling

In sequence recognition, the problem of long-span dependency in input [...]

Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings

Speaker diarization consist of automatically partitioning an input [...]

Frontiers in Speech and Language

The last few years have witnessed a renaissance in multiple areas of [...]

Towards Spoken Term Discovery at Scale with Zero Resources

The spoken term discovery task takes speech as input and identifies [...]

Multi-microphone Dereverberation and Intelligibility Estimation in Speech Processing

When speech signals are captured by one or more microphones in [...]

When speech signals are captured by one or more microphones in realistic acoustic environments, they will be contaminated by noise due to surrounding sound sources and by reverberation due to ...reflections off walls and other surfaces. Noise and reverberation can have detrimental effects on the perceptual experience of a listener and, in more severe cases, they can cause intelligibility loss. Many signal processing applications, such as, speech codecs and speech recognizers deteriorate rapidly in performance as noise and reverberation levels increase. Consequently, the challenging problems of noise reduction and dereverberation have received a great deal of attention in research, especially, with the advent of mobile telephony and voice over IP. Multi-microphone speech dereverberation forms the topic of the first part of this talk. Two alternative methods will be introduced. The first method is based on the source-filter model of speech production while the second approaches the problem through blind identification and inversion of the room impulse responses. Simulation results will be presented to demonstrate the methods and to facilitate a comparison between them in terms of dereverberation performance. In the second part, the talk will focus on subject-based and automatic estimation of intelligibility in noisy and processed speech. In particular, the Bayesian Adaptive Speech Intelligibility Estimation (BASIE) method will be presented. BASIE is a tool for rapid subject-based estimation of a given speech reception threshold (SRT) and the slope at that threshold of multiple psychometric functions for speech intelligibility in noise. The core of BASIE is an adaptive Bayesian procedure, which adjusts the signal-to-noise ratio at each subsequent stimulus such that the expected variance of the threshold and slope estimates are minimised. Furthermore, strategies for using BASIE to evaluate the effects of speech processing algorithms on intelligibility and two illustrative examples for different noise reduction methods with supporting listening experiments will be given.

Soft Margin Estimation for Automatic Speech Recognition

In this study, a new discriminative learning framework, called soft [...]

A Smartphone as Your Third Ear

: We humans are capable of remembering, recognizing, and acting upon [...]

: We humans are capable of remembering, recognizing, and acting upon hundreds of thousands of different types of acoustic events on a day-to-day basis. Decades of research on acoustic sensing ...have led to the creation of systems that now understand speech (e.g. a personal assistant like iPhone’s Siri, or the voice activated search feature from Google), recognizes the speaker, and finds a song (e.g., Shazam). However, apart from speech, music, and some application specific sounds, the problem of recognizing varieties of general-purpose sounds that a mobile device encounters all the time has remained unsolved. The goal of this research is to build a platform that automatically creates classifiers that recognize general-purpose acoustic events on mobile devices. As these classifiers are meant to run on mobile devices, the technical goals include energy-efficiency, meeting timing constraints, and leveraging the user contexts such as the location and position of the mobile device in order to improve the classification accuracy. With this goal in mind, we have built a general-purpose, energy-efﬁcient, and context-aware acoustic event detection platform for mobile devices called – ‘Auditeur. Auditeur enables mobile application developers to have their app register for and get notiﬁed on a wide variety of acoustic events. Auditeur is backed by a cloud service to store crowd-contributed sound clips and to generate an energy-efﬁcient and context-aware classiﬁcation plan for the mobile device. When an acoustic event type has been registered, the mobile device instantiates the necessary acoustic processing modules and wires them together to dynamically form an acoustic processing pipeline in accordance to the classification plan. The mobile device then captures, processes, and classiﬁes acoustic events locally and efﬁciently. Our analysis on user-contributed empirical data shows that Auditeur’s energy-aware acoustic feature selection algorithm is capable of increasing the device-lifetime by 33.4%, sacriﬁcing less than 2% of the maximum achievable accuracy. We implement seven apps with Auditeur, and deploy them in real-world scenarios to demonstrate that Auditeur is versatile, 11.04% − 441.42% less power hungry, and 10.71% − 13.86% more accurate in detecting acoustic events, compared to state-of-the-art techniques. We perform a user study involving 15 participants to demonstrate that even a novice programmer can implement the core logic of an interesting app with Auditeur in less than 30 minutes, using only 15 – 20 lines of Java code.