SpokenData Blog

Monday, October 20, 2014

How to show the speaker segmentation

In SpokenData subtitles editor, we changed the rule of displaying the speaker segmentation. From now on, it is hidden by default except for the recordings processed directly by the Speaker segmentation method. However, the speaker segmentation can quickly be shown by clicking on the checkbox Show speaker segmentation. Currently, SpokenData subtitles editor does not support editing the generated speaker identity. When it is implemented, we will consider changing this rule.

Monday, September 29, 2014

How to adjust subtitle timings

Each subtitle has its start and end time. These specify when and for how long the caption appears over the video. When editing subtitles in our editor, you can simply adjust their timings either with your mouse or directly from the keyboard. First, you need to enter into the editing mode. This can be achieved by several ways:

double click on the subtitle caption
double click on the audio waveform segment
CTRL + click on the subtitle caption
CTRL + I - to edit the currently played caption
TAB or SHIFT+TAB - to edit the next or previous caption

Now, as you are in the editing mode, you can change the subtitle caption and its start and end time. The currently edited subtitle is marked with a light-blue background color.

ALT + Left - shift caption start time by -0.1s
ALT + Right - shift caption start time by +0.1s
ALT + Up - shift caption end time by +0.1s
ALT + Down - shift caption end time by -0.1s

On the left in the editor, there is an audio waveform with segments that represent duration of particular subtitles. You can easily adjust subtitle duration by holding down the mouse left button and dragging the segment borders. The audio waveform can significantly help you to define the segment beginning and end because moments with no speech/sound in the audio look like a straight line.

Friday, September 12, 2014

Interactive waveform with Outwave.js

Outwave.js is a handy library that can render audio waveform in a web browser. Its development was also supported by SpokenData. Apart from that, this library has a great extension for displaying annotation segments directly on the waveform. The segments can be easily added, deleted, merged or split. By and large, we really needed such a library.

Therefore, we are happy to announce that Outwave.js was integrated into the SpokenData subtitles editor. From now on, our users will more easily define speech or non-speech segments just by dragging the segment boundaries on the waveform with the mouse button held down.

We are certain you will benefit from the Outwave library as we do. The fastest way to test our new feature is to start the SpokenData demo and edit any recording subtitles.

Monday, August 4, 2014

New annotation.xml file structure

We have modified the annotation xml file structure. Now it is a way easier to parse. You can get this file through SpokenData API. See this short example:

<segment>

<start>63.25</start>

<end>65.40</end>

<speaker>A</speaker>

<text>Hello, this is the first caption</text>

</segment>

<segment>

<start>72.92</start>

<end>74.49</end>

<speaker>B</speaker>

<text>and here comes the second one</text>

</segment>

Start and end tags represent the subtitle appearance time in seconds. We store values with precision of 2 decimal places. Speaker tag identifies the person who is speaking. It can keep whatever alphanumerical value. And the text tag serves for storing the subtitle content.

You can see a live example from the SpokenData demo here:

http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/846/annotation.xml

Monday, July 14, 2014

Use case: How to transcribe conference video recordings and make subtitles for them?

One handy usage of automatic speech recognition technologies - speech-to-text - is a transcription of conference talks. There are plenty of conferences and lots of them are being recorded and published on a conference homepage or YouTube for example.
Let's use any conference as an example. To record the conference and to have plenty of videos on YouTube is fine, but it starts to be messy. You can find useful following reasons for transcribing talks.

Some people do not understand English very well. Reading subtitles can help them understand.
You need to market your conference to attract people. Videos show the quality of your conference to prospects. Transcribing the video to text increases your SEO. More people will find you.
Large collections of videos can be searchable with a difficulty for particular information. Time synchronous speech transcript can help you search in speech quickly even in a large collection of videos.

To use human labor for subtitling videos make sense, because people do not like watching subtitles with errors - and automatic voice to text can make errors. On the other hand, transcribing all recordings from a several day long conference can be enormously expensive on human resources.
So the use of automatic voice to text technology is a logical step to reduce the need of human resources. Especially for cases 2) and 3). Here you do not care about a few errors, because the transcript is primarily for machines - search engines.

The huge advantage of our service here is the ability of automatic speech recognizer adaptation on the target domain - your conference. Usually, every technical conference has proceedings which are full of content words, abbreviations, technical terms etc. These words are important (within you conference) but rare in general speech. So standard recognizers trained on general speech can miss them easily and the transcript is useless for you.

To give you a real use case, SuperLectures - a conference video service - uses SpokenData.com automatic transcriptions in the above mentioned way. They provide us with proceedings so that we could adapt our recognizer. Then we return them textual transcription of their audio/video data.

Thursday, June 26, 2014

SpokenData API - File upload

SpokenData API allows developers to easily add new recordings to their media library. You can either pass a media file URL or you can upload the whole file using the HTTP PUT method that was recently implemented into SpokenData. This post will demonstrate the second option - uploading a file through the SpokenData API call.

So, how does it work? Each SpokenData API function is composed of SpokenData base API url, USER-ID, API-TOKEN and the name of the function.

SpokenData API url:	http://spokendata.com/api
USER-ID:	different for each user, available for signed-in at http://spokendata/api
API-TOKEN:	different for each user, available for signed-in at http://spokendata/api
API function:	recording/put

If we concatenate the above values, we get the API call url. It may look like this:

http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put

Each API call can have parameters. When uploading a new file through API, you need to enter the recording filename and the language for automatic speech processing. So basically, you will end up with a URL looking like this:

http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put?filename=audio.mp3&language=english

Basically, there are also available other parameters you can read about in the SpokenData API documentation. When you call the above url, don't forget to put the file content.

Here is a code example that uploads an MP3 file using the HTTP PUT method to SpokenData.

<?php

$fileToUpload = 'd:/audio.mp3';

$url = 'http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put?filename=audio.mp3&language=english';

$file = fopen($fileToUpload, "rb");

$curl = curl_init();

curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 2);

curl_setopt($curl, CURLOPT_HEADER, false);

curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($curl, CURLOPT_BINARYTRANSFER, 1);

curl_setopt($curl, CURLOPT_URL, $url);

curl_setopt($curl, CURLOPT_VERBOSE, '1');

curl_setopt($curl, CURLOPT_HTTPHEADER, array('Expect: '));

curl_setopt($curl, CURLOPT_PUT, 1);

curl_setopt($curl, CURLOPT_INFILE, $file);

curl_setopt($curl, CURLOPT_INFILESIZE, filesize($fileToUpload));

$response = curl_exec($curl);

curl_close($curl);

header ("Content-Type:text/xml");

echo $response;

The server responds in XML. Here is an example response of the above script.

	<?xml version="1.0" encoding="utf8"?>
	<data>
	<message>New media file "audio.mp3" was successfully added.</message>
	<recording id="1373"></recording>
	</data>

As you can see, the file audio.mp3 was successfully added and assigned the recording id = 1373.

Thursday, June 5, 2014

Tagging

Every registered user has a set of 6 tags. Each tag is marked with a different color and can have a title. Tags come greatly in handy when you want to filter your recordings. For example, you have plenty of recordings and you work on their transcriptions. If you are happy with the transcription, you mark the recording with a tag titled done. As tags are displayed on the top of recording thumbnails, you will immediately spot those recordings which do not need to be transcribed anymore.

Tag titles can be changed at any time. Tagging is allowed on your dashboard or in the transcription editing mode.

Thursday, May 22, 2014

Integration of BrainTree payments

From now on, every registered user can get the whole transcription of their multimedia files. To lower the load of our computation cloud, we decided to charge for processing of files longer than 15 minutes. Processing of shorter files remains free and is available to anyone.

If you are not sure about the quality of the automatic transcription, start with letting the system transcribe first 15 minutes. Based on the result, you can then decide whether it is worth buying the whole transcription.

As the first payment method, we decided to integrate BrainTree payments into SpokenData. This allows users to pay by a card. Cards like Visa, MasterCard, Discover and American Express are accepted. All payments are now in EUR.

The great thing about BrainTree is that the user is not redirected to a third-party payment gateway but remains on the SpokenData website. All sensitive data is encrypted and not stored. Read more about Client-side encryption here.

Tuesday, May 20, 2014

WebVTT subtitles support

We extended the number of SpokenData supported subtitles formats. From now on, the subtitles can also be downloaded in WebVTT. Here is the complete list of currently supported subtitles formats:

HTML
SRT
TRS
TXT
WebVTT

The subtitles can be downloaded from the web user interface or through API calls.

Wednesday, May 14, 2014

There is no audio/video limitation on spokendata free plan now.

We changed the way how we limit the processed data for free accounts. To get a free SpokenData account, register here.

We previously had a hard limit of processed data set to 15 minutes. All data you uploaded over this limit was trimmed and discarded. So if you uploaded 25 minutes of audio or video and wanted the automatic transcript for free, you got only 15 minutes long video/audio with the transcript in your dashboard later (after the processing finished).

Currently, we do not limit your data upload. We just limit the length of the transcript we provide you for free. So if you upload 25 minute long video/audio, you will find the whole (25 minute long) video/audio in your dashboard. The generated transcript has the length limit set to 15 minutes, so you will see only the first 15 minutes of transcript for your data. But if you want, you can easily create the rest of the transcript for free yourself - this was not possible in the previous version because the video was trimmed.

We hope you welcome this change.

..and stay tuned. More interesting things are coming soon..

Wednesday, April 30, 2014

Interactive audio waveform

From now on, when you play any recording, you will see an interactive waveform on the left of the transcription. The waveform is only displayed on larger screens with a minimum horizontal resolution of 1220px. The waveform can significantly help you detect moments of speech. When you click into it, the player seeks to that exact moment.

This audio waveform viewer for the web can be downloaded from https://github.com/vdot/outwave.js.

Monday, March 31, 2014

What does the speaker segmentation technology

Speaker segmentation (diarization) is a speech technology allowing you to segment audio (or video) into particular speakers. What is it good for? You can more easily identify speaker turns in a dialog while making speech transcript.

Even if you do not directly need the speaker information, the speaker segmentation is very helpful for speech-to-text technology (STT). The STT technology contains unsupervised speaker adaptation module. This module takes parts of speech belongings to a particular speaker and adapts an acoustic model towards them. Adaptation of the model leads to more accurate speech transcript.

The adaptation - even if it is called speaker adaptation - adapts the system to the whole acoustic channel. It consists of speaker's voice characteristics, room acoustics (echo), microphone characteristics, environment noise, etc.

Speaker segmentation is theoretically independent on speaker, language and acoustic conditions. But - practically - it is dependent. The reason is, that it uses something called a universal background model (UBM). The UBM should model all speech, languages and acoustics of the world - theoretically. But you need to train it on some speaker labeled data - to learn how to distinguish among speakers. And it holds (as in other speech technologies) that the more far the data you process is from the training data, the worse accuracy you get.