Monday, March 31, 2014

What does the speaker segmentation technology

Speaker segmentation (diarization) is a speech technology allowing you to segment audio (or video) into particular speakers. What is it good for? You can more easily identify speaker turns in a dialog while making speech transcript.

Even if you do not directly need the speaker information, the speaker segmentation is very helpful for speech-to-text technology (STT). The STT technology contains unsupervised speaker adaptation module. This module takes parts of speech belongings to a particular speaker and adapts an acoustic model towards them. Adaptation of the model leads to more accurate speech transcript.

The adaptation - even if it is called speaker adaptation - adapts the system to the whole acoustic channel. It consists of speaker's voice characteristics, room acoustics (echo), microphone characteristics, environment noise, etc.

Speaker segmentation is theoretically independent on speaker, language and acoustic conditions. But - practically - it is dependent. The reason is, that it uses something called a universal background model (UBM). The UBM should model all speech, languages and acoustics of the world - theoretically. But you need to train it on some speaker labeled data - to learn how to distinguish among speakers. And it holds (as in other speech technologies) that the more far the data you process is from the training data, the worse accuracy you get.

The worse accuracy shows as segmenting one speaker in different acoustic conditions as different speaker labels.

The second important thing is, that the speaker segmentation needs prior knowledge of the number of speakers in the audio document. If you do not provide the information, there is some preset or estimated value. However, if you know this information, it is good to provide it.

In our service, we automatically preset:
  • 7 speakers in the speech-to-text engine for broadcast news. We expect more interviews and thus more speakers.
  • 3 speakers in to other speech-to-text engines. We expect rather monologue or dialog in the recording.
  • 10 speakers in the speaker segmentation mode (just voice activity detector and speaker segmentation without speech-to-text).
You can preset the number of speakers a priori in the paid service.

As said at the beginning, lower accuracy is not necessarily "bad". Yes, it can be annoying in case you need accurate speaker turns in your speech transcript. On the other hand, segment particular speaker into more "sub-speakers" according to varying acoustic conditions (noise, environment, ...) can be helpful for automatic speech recognizer.

No comments:

Post a Comment