Voice Activity Detection (VAD) is usually the first nontrivial step in
speech processing (converting the speech into text, speaker identification, etc.). Maybe it sounds like an easy task but it is not.
If you are in a domain of clean slow speech of
broadcast news, you are OK with some simple energy based detector. This can work well but think about commercials or opening intros. And identifying what is music and what is speech can be hard because music may contain harmonicity similarly to the speech.
Now let us switch to more real scenarios -
telephone conversations. Here you have speech usually with some additive noise. It can be noise of the street your are walking on or some music in the background (a radio for example). In case the dialog starts to be "complicated", you start to shout and the speech is full of crosstalks. Would you like to process crosstalks or not?
The above examples are good from the
close talk point of view. You have microphone close to your mouth. But consider a recording comming from a
mobile phone or dictaphone laying on a desk in a restaurant now.
There is a lot of voices and the strongest one you would like to process. Plus there can be some echo (acoustics of the room) etc. We call this condition as a
distant microphone.
By the way, the VAD should not be dependent on the language you are speaking.
So to conclude, to separate speech (you need to process) from noise (other parts of recordings) is not easy. There are at least these classes:
- clean speech
- speech with noise in background
- speech with music in background
- shouting
- crosstalks
- singing
- music
- stationary noise
- impulse noise (gun shooting)
- technical noise (fax, dial tones)
- silence
And you need to accurately find the first 3 or 4 and forward them to the other processing step while omitting the rest of the recording.
The
second important thing you need to think of is
what is the consecutive technology.
In case of
speaker or language identification, you can take the liberty of
omitting also some portion of the speech. Especially if you are not confident if it is speech or noise. Omitting some portion of speech means, that you just need longer audio to fulfill the condition of - let's say - 30 seconds of speech for speaker identification.
But what do you think will happen, if you omit some portion of speech (beginning of sentences) in case of
speech to text conversion? You will miss words in your transcript! On the other hand
some adjacent noise is not so dangerous, because speech-to-text have model of "silence". But this is harmful for speaker or language identification, where you are telling that the noise is the speaker (or language).
As we understand the relevance of voice activity detection, we are continuously working on it to make it more accurate. We also decided to make it available as a separate part of our services. So you can
easily try yourself to upload an audio and let the VAD detect speech in your audio full of other noises. And in case the speech detector fails you know that you can not get any better results from the speech to text. The recognizer simply transcribes only the segments denoted by the VAD as speech. That's it.