Tuesday, October 29, 2013

Voice Activity Detection: Where is the speech?

Voice Activity Detection (VAD) is usually the first nontrivial step in speech processing (converting the speech into text, speaker identification, etc.). Maybe it sounds like an easy task but it is not.
If you are in a domain of clean slow speech of broadcast news, you are OK with some simple energy based detector. This can work well but think about commercials or opening intros. And identifying what is music and what is speech can be hard because music may contain harmonicity similarly to the speech.
Now let us switch to more real scenarios - telephone conversations. Here you have speech usually with some additive noise. It can be noise of the street your are walking on or some music in the background (a radio for example). In case the dialog starts to be "complicated", you start to shout and the speech is full of crosstalks. Would you like to process crosstalks or not?
The above examples are good from the close talk point of view. You have microphone close to your mouth. But consider a recording comming from a mobile phone or dictaphone laying on a desk in a restaurant now.
There is a lot of voices and the strongest one you would like to process. Plus there can be some echo (acoustics of the room) etc. We call this condition as a distant microphone.
By the way, the VAD should not be dependent on the language you are speaking.

So to conclude, to separate speech (you need to process) from noise (other parts of recordings) is not easy. There are at least these classes:
  • clean speech
  • speech with noise in background
  • speech with music in background
  • shouting
  • crosstalks
  • singing
  • music
  • stationary noise
  • impulse noise (gun shooting)
  • technical noise (fax, dial tones)
  • silence
And you need to accurately find the first 3 or 4 and forward them to the other processing step while omitting the rest of the recording.

The second important thing you need to think of is what is the consecutive technology.
In case of speaker or language identification, you can take the liberty of omitting also some portion of the speech. Especially if you are not confident if it is speech or noise. Omitting some portion of speech means, that you just need longer audio to fulfill the condition of - let's say - 30 seconds of speech for speaker identification.
But what do you think will happen, if you omit some portion of speech (beginning of sentences) in case of speech to text conversion? You will miss words in your transcript! On the other hand some adjacent noise is not so dangerous, because speech-to-text have model of "silence". But this is harmful for speaker or language identification, where you are telling that the noise is the speaker (or language).

As we understand the relevance of voice activity detection, we are continuously working on it to make it more accurate. We also decided to make it available as a separate part of our services. So you can easily try yourself to upload an audio and let the VAD detect speech in your audio full of other noises. And in case the speech detector fails you know that you can not get any better results from the speech to text. The recognizer simply transcribes only the segments denoted by the VAD as speech. That's it.

Tuesday, October 15, 2013

What information is in your SpokeData?

Maybe you are thinking... "What information is in my spoken data?"

Well, lots of information! To have some idea, look at the following image.

You and your speech are in the middle. Now, let us go clockwise and I will briefly introduce you particular information hidden in your speech.

So on the 12th hour, there is speaker identity. Only 10 second long recording is enough to identify you by voice.

Gender identification is the next. It is the most simple type of classification of voice into 2 classes.

On the 3rd hour, there is speech transcript. A technology, which can convert speech into text. Keyword spotting and speech search can be considered as a part of this technology.

Next one is age estimation. To estimate the age might be helpful in some security applications.

Communication channel is usually not that important, but the information through which codecs or networks the voice recording was transmitted is there! Together with the type of the device.

Do not forget, that the recording does not contain only speech. There is also lots of noises, tones or music. All these noises can make your speech less intelligible. This technology is called Voice Activity Detection.

And finally, there is the language identity you are speaking. Similarly to your speaker identity, 10 seconds of your speech is enough to estimate your spoken language.

So that is at least some information hidden in your spoken data recordings.