Thursday, January 2, 2014

What is the difference between narrowband and wideband, closetalk and distant mic?

Maybe you have coped with terms like narrowband, wideband, closetalk, distant microphone, microphone array, and farfield in past. So let me explain it a bit.

All of these terms are about the "technology" you are using for recording the speech and its relative placement against the speaker.

Why do we need to bother with this? Actually the problem is that a speech recognizer (or a generic speech technology) is trained on data recorded under specific condition (telephone conversations for example). So this recognizer will recognize telephone conversations well, but will perform poor on lecture recordings recorded with a camera microphone in a room with strong echo.

As the research in the speech technologies field goes on, the recognizers are more and more robust. So this problem will shrink in future. But it still holds - if your data matches the data on which the recognizer was trained, you get the best possible accuracy. There is not acoustic mismatch.

There are, let's say, three variables:
  • Quality of the recorded audio - sampling frequency
  • Distance between speaker mouth and the microphone
  • Number of microphones - microphone array
Narrowband vs. Wideband

Sampling frequency is one factor which can decrease the quality of recorded audio and the final accuracy. There are two settings - 8kHz and 16kHz (and more). If the data is recorded in the 8 kHz, it is so called Narrowband data. This settings are used in telephony. So if you work with telephone recordings, your data is in 8kHz due to the telephony technology limitations. Recording the telephone call in 16kHz or more does not make sense (and bring no improvement).

 An example of narrowband - 8kHz data

If the data is recorded in the 16kHz, it is so called Wideband data. This setting are used everywhere else - lectures, dictaphones, TV, ... You can theoretically record the data in 44kHz but the recognizer downsamples it to 16kHz. It does not make much sense to build recognizer for more than 16kHz (22kHz or 44kHz) because there is not significant gain in accuracy and it makes the recognizer run slower and consume much more memory.

An example of wideband - 16kHz data

The difference between 8kHz and 16kHz is in several percents in terms of the level of word error rate which is significant.

In case you have a narrowband system (telephone speech recognizer), you can process your 16kHz data just by downsampling it to 8kHz. You just throuw out some information from the audio (higher frequencies). To do it in opossite way is not so safe. If you have 8kHz data, sampling it up to 16kHz will not add any information (the higher frequencies), so in case the recognizer is not robust enough, the output can be screwed.

Closetalk vs. Distant microphone

The distance between the mouth and the microphone is the second important factor. The telephone speech is so called closetalk - you have the microphone close to your mouth. Another example of closetalk is head set microphone or lapel (lectures, TV shows, radio, ...). In case you have a very high quality microphone with narrow beam and there is no echo, you can consider your recording also as close talk - simply you are able to record the speaker as the microphone is really close to his/her mouth.

Distant microphone is on the other hand a scenario when your standard microphone is far from the speaker and it records also noise of the surrounding environment, speech of other people and echoed speech of the target speaker. All these noises lower the speech technology accuracy. In case you have such data with the recognizer trained on closetalk and clean speech, the result will be poor. Again you need a match in acoustic conditions, so your recognizer should be trained on the Distant microphone data.

Microphone array and Farfield data

This is a special class of data. In some cases, you can record the speech by a microphone array fixed somewhere - a smart meeting room device or Kinect. In these devices, there are more microphones with an algorithm which can preprocess the signal to extract the speakers speech while suppressing the surrounding noise. This operation is called beamforming. We have two ears as humans, so we do beamforming naturally (we are able to track speaking person in noisy environment without any problem). Make an experiment. Try to plug one of tour ear by your finger and then track the speaking person. It is hard right? And now imagine that this kind of audio data must the recornizer process and recognize.

Beamforming introduces some signal distortions, so again, the recognizers need to be trained on such kind of data. Else it fails and the accuracy is poor.

I hope this short introduction helped you to understand the terms and what is important to keep in mind. Do you have any questions? Do not hesitate and ask.

No comments:

Post a Comment