Friday, January 24, 2014

Why do we need your speech data

Our several year experience in speech technology research and business shows often clash between:
Speech technology provider: "Give us some of your speech data for testing purposes please."
Customer: "No way! Our speech data is our private and secret property."

So let discuss several WHYs.

Why the speech technology provider wants the customer's data?

The speech technologies are very complex and sensitive to match between model and the data. This is common problem in the whole field of machine learning. Once you feed the classifier with "already seen" data, everything goes well. Accuracy of such algorithm is great.
The problem occurs when you put an unseen data into the algorithm - data which was not seen during training and developing. It is like, people living in US understands English because it is their already seen data, but does not understand Japanese because it is their unseen data during the training phase (childhood).

Thursday, January 2, 2014

What is the difference between narrowband and wideband, closetalk and distant mic?

Maybe you have coped with terms like narrowband, wideband, closetalk, distant microphone, microphone array, and farfield in past. So let me explain it a bit.

All of these terms are about the "technology" you are using for recording the speech and its relative placement against the speaker.

Why do we need to bother with this? Actually the problem is that a speech recognizer (or a generic speech technology) is trained on data recorded under specific condition (telephone conversations for example). So this recognizer will recognize telephone conversations well, but will perform poor on lecture recordings recorded with a camera microphone in a room with strong echo.

As the research in the speech technologies field goes on, the recognizers are more and more robust. So this problem will shrink in future. But it still holds - if your data matches the data on which the recognizer was trained, you get the best possible accuracy. There is not acoustic mismatch.

There are, let's say, three variables:
  • Quality of the recorded audio - sampling frequency
  • Distance between speaker mouth and the microphone
  • Number of microphones - microphone array
Narrowband vs. Wideband

Sampling frequency is one factor which can decrease the quality of recorded audio and the final accuracy. There are two settings - 8kHz and 16kHz (and more). If the data is recorded in the 8 kHz, it is so called Narrowband data. This settings are used in telephony. So if you work with telephone recordings, your data is in 8kHz due to the telephony technology limitations. Recording the telephone call in 16kHz or more does not make sense (and bring no improvement).

 An example of narrowband - 8kHz data