Friday, October 21, 2016

Synchronizing text and audio - time alignment

We have implemented automatic and language independent text to audio alignment. What is it all about? In short, you upload your audio and text and you get the subtitles. See the pictures below.



What is the time alignment? The alignment is a process attaching time stamps to a transcript or text according to the audio. Usually the text is without timing. A sentence, a paragraph, a page of text does not have any timing information. But when you have an audio attached to this text -- means the audio contains a speech -- you may want to add the timing information to the text. To know when a particular word was spoken in the audio. You can imagine the process as making subtitles (time aligned sentences) from your text and audio.

The good news is that alignment can be done automatically and in our case even language independently.

What is the difference to automatic speech transcription? Well, in case you have only audio and want to get text (transcript, verbatim), you have to use automatic speech transcription -- to convert audio into text. But there are cases when you already have the transcript. Some examples might be:
  • You wrote a script for a lecture, talk, pitch, or news. You "read" the text and got a recording. Now you want to make subtitles. It is waste of time and money to transcribe your speech again. Use the aligner.
  • You asked someone to have transcribed your audio and then you got only plain text.  But you found subtitles useful later. Just align the previous transcript to your audio using our aligner.
  • Can also be useful for e-book to audio-book alignment. 
Here you just need to put your audio and text into SpokenData.com and you will get aligned text (like subtitles). Plus you can use our editor to edit the text, timings, etc. You can use our API.

The largest advantage of our approach is in its independence from the language. It should work reasonably well for any language.  

This technology has also some caveats. It expects that the speech in the audio and the text fully matches. Assuming that you have part of your speech untranscribed or some notes in the text not spoken in the audio, the technology does its best to align it. So in such cases there can appear time shifts near these regions.

No comments:

Post a Comment