SpokenData Blog: use-case

Showing posts with label use-case. Show all posts

Friday, October 21, 2016

Synchronizing text and audio - time alignment

We have implemented automatic and language independent text to audio alignment. What is it all about? In short, you upload your audio and text and you get the subtitles. See the pictures below.

What is the time alignment? The alignment is a process attaching time stamps to a transcript or text according to the audio. Usually the text is without timing. A sentence, a paragraph, a page of text does not have any timing information. But when you have an audio attached to this text -- means the audio contains a speech -- you may want to add the timing information to the text. To know when a particular word was spoken in the audio. You can imagine the process as making subtitles (time aligned sentences) from your text and audio.

The good news is that alignment can be done automatically and in our case even language independently.

What is the difference to automatic speech transcription? Well, in case you have only audio and want to get text (transcript, verbatim), you have to use automatic speech transcription -- to convert audio into text. But there are cases when you already have the transcript. Some examples might be:

You wrote a script for a lecture, talk, pitch, or news. You "read" the text and got a recording. Now you want to make subtitles. It is waste of time and money to transcribe your speech again. Use the aligner.
You asked someone to have transcribed your audio and then you got only plain text. But you found subtitles useful later. Just align the previous transcript to your audio using our aligner.
Can also be useful for e-book to audio-book alignment.

Here you just need to put your audio and text into SpokenData.com and you will get aligned text (like subtitles). Plus you can use our editor to edit the text, timings, etc. You can use our API.

The largest advantage of our approach is in its independence from the language. It should work reasonably well for any language.

This technology has also some caveats. It expects that the speech in the audio and the text fully matches. Assuming that you have part of your speech untranscribed or some notes in the text not spoken in the audio, the technology does its best to align it. So in such cases there can appear time shifts near these regions.

Monday, July 14, 2014

Use case: How to transcribe conference video recordings and make subtitles for them?

One handy usage of automatic speech recognition technologies - speech-to-text - is a transcription of conference talks. There are plenty of conferences and lots of them are being recorded and published on a conference homepage or YouTube for example.
Let's use any conference as an example. To record the conference and to have plenty of videos on YouTube is fine, but it starts to be messy. You can find useful following reasons for transcribing talks.

Some people do not understand English very well. Reading subtitles can help them understand.
You need to market your conference to attract people. Videos show the quality of your conference to prospects. Transcribing the video to text increases your SEO. More people will find you.
Large collections of videos can be searchable with a difficulty for particular information. Time synchronous speech transcript can help you search in speech quickly even in a large collection of videos.

To use human labor for subtitling videos make sense, because people do not like watching subtitles with errors - and automatic voice to text can make errors. On the other hand, transcribing all recordings from a several day long conference can be enormously expensive on human resources.
So the use of automatic voice to text technology is a logical step to reduce the need of human resources. Especially for cases 2) and 3). Here you do not care about a few errors, because the transcript is primarily for machines - search engines.

The huge advantage of our service here is the ability of automatic speech recognizer adaptation on the target domain - your conference. Usually, every technical conference has proceedings which are full of content words, abbreviations, technical terms etc. These words are important (within you conference) but rare in general speech. So standard recognizers trained on general speech can miss them easily and the transcript is useless for you.

To give you a real use case, SuperLectures - a conference video service - uses SpokenData.com automatic transcriptions in the above mentioned way. They provide us with proceedings so that we could adapt our recognizer. Then we return them textual transcription of their audio/video data.

Thursday, February 13, 2014

Use-Case: Making movie subtitles in 4 steps

Let's go through 4 easy steps for making subtitles from scratch for a movie or your home/company video. Subtitles are usually stored in a text file with time stamps to synchronize the text with the video. Examples of such text formats are SRT or TT.

You need to have a software for making the subtitles (unless you want to edit the text file directly). You have several choices - to download a desktop application (AHD Subtitles Maker Professional, ...), use a web service (http://CaptionsMaker.com, http://amara.org, http://SubtitleHorse.com), or use a smart web service as http://SpokenData.com.

What is the difference between standard and smart web service for making captions? You need to set the timing of each particular subtitle by yourself (example here). And this can be pretty annoying job. And that is where the smart web service for making caption can help you - it will find the places where speech occurs automatically! So you need just to fill in the text. Pretty good right?

So what are the steps you need to do?