Monday, March 31, 2014

What does the speaker segmentation technology

Speaker segmentation (diarization) is a speech technology allowing you to segment audio (or video) into particular speakers. What is it good for? You can more easily identify speaker turns in a dialog while making speech transcript.

Even if you do not directly need the speaker information, the speaker segmentation is very helpful for speech-to-text technology (STT). The STT technology contains unsupervised speaker adaptation module. This module takes parts of speech belongings to a particular speaker and adapts an acoustic model towards them. Adaptation of the model leads to more accurate speech transcript.

The adaptation - even if it is called speaker adaptation - adapts the system to the whole acoustic channel. It consists of speaker's voice characteristics, room acoustics (echo), microphone characteristics, environment noise, etc.

Speaker segmentation is theoretically independent on speaker, language and acoustic conditions. But - practically - it is dependent. The reason is, that it uses something called a universal background model (UBM). The UBM should model all speech, languages and acoustics of the world - theoretically. But you need to train it on some speaker labeled data - to learn how to distinguish among speakers. And it holds (as in other speech technologies) that the more far the data you process is from the training data, the worse accuracy you get.