SpokenData Blog: 2014

Tuesday, December 9, 2014

Vimeo is supported

As some of our users host their recordings on Vimeo, we now support processing of Vimeo files. Simply enter a Vimeo url into the Media File URL input box.

In general, users can enter:

a direct url to a media file (mp3, mp4, mpg, avi, 3gp, mkv, wav and many others)
YouTube url
Vimeo url

Besides that, you can also upload a multimedia file using the upload form or SpokenData API.

Tuesday, November 11, 2014

SpokenData API - Search in Speech

SpokenData API has a new function that enables users to search in recording transcriptions. This means that you can quickly get a list of captions matching the search query with their start and end time, caption content and speaker identity. The search can be performed either in all user recordings or in a list of selected recordings.

An example of a basic SpokenData search API call can be:
http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/search?q=student

It simply means to search for occurrences of student in all recording transcriptions of the DEMO account.

The returned XML shows the elapsed time for parsing the search query and for performing the search. As the number of results can be very high, the search API call supports paging. By default, the maximum number of results per page is set to 10. In the output XML, there are 2 types of results - recordings and captions. Each has different paging.

How to show the speaker segmentation

In SpokenData subtitles editor, we changed the rule of displaying the speaker segmentation. From now on, it is hidden by default except for the recordings processed directly by the Speaker segmentation method. However, the speaker segmentation can quickly be shown by clicking on the checkbox Show speaker segmentation. Currently, SpokenData subtitles editor does not support editing the generated speaker identity. When it is implemented, we will consider changing this rule.

Monday, September 29, 2014

How to adjust subtitle timings

Each subtitle has its start and end time. These specify when and for how long the caption appears over the video. When editing subtitles in our editor, you can simply adjust their timings either with your mouse or directly from the keyboard. First, you need to enter into the editing mode. This can be achieved by several ways:

double click on the subtitle caption
double click on the audio waveform segment
CTRL + click on the subtitle caption
CTRL + I - to edit the currently played caption
TAB or SHIFT+TAB - to edit the next or previous caption

Now, as you are in the editing mode, you can change the subtitle caption and its start and end time. The currently edited subtitle is marked with a light-blue background color.

ALT + Left - shift caption start time by -0.1s
ALT + Right - shift caption start time by +0.1s
ALT + Up - shift caption end time by +0.1s
ALT + Down - shift caption end time by -0.1s

On the left in the editor, there is an audio waveform with segments that represent duration of particular subtitles. You can easily adjust subtitle duration by holding down the mouse left button and dragging the segment borders. The audio waveform can significantly help you to define the segment beginning and end because moments with no speech/sound in the audio look like a straight line.

Friday, September 12, 2014

Interactive waveform with Outwave.js

Outwave.js is a handy library that can render audio waveform in a web browser. Its development was also supported by SpokenData. Apart from that, this library has a great extension for displaying annotation segments directly on the waveform. The segments can be easily added, deleted, merged or split. By and large, we really needed such a library.

Therefore, we are happy to announce that Outwave.js was integrated into the SpokenData subtitles editor. From now on, our users will more easily define speech or non-speech segments just by dragging the segment boundaries on the waveform with the mouse button held down.

We are certain you will benefit from the Outwave library as we do. The fastest way to test our new feature is to start the SpokenData demo and edit any recording subtitles.

Monday, August 4, 2014

New annotation.xml file structure

We have modified the annotation xml file structure. Now it is a way easier to parse. You can get this file through SpokenData API. See this short example:

<segment>

<start>63.25</start>

<end>65.40</end>

<speaker>A</speaker>

<text>Hello, this is the first caption</text>

</segment>

<segment>

<start>72.92</start>

<end>74.49</end>

<speaker>B</speaker>

<text>and here comes the second one</text>

</segment>

Start and end tags represent the subtitle appearance time in seconds. We store values with precision of 2 decimal places. Speaker tag identifies the person who is speaking. It can keep whatever alphanumerical value. And the text tag serves for storing the subtitle content.

You can see a live example from the SpokenData demo here:

http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/846/annotation.xml

Monday, July 14, 2014

Use case: How to transcribe conference video recordings and make subtitles for them?

One handy usage of automatic speech recognition technologies - speech-to-text - is a transcription of conference talks. There are plenty of conferences and lots of them are being recorded and published on a conference homepage or YouTube for example.
Let's use any conference as an example. To record the conference and to have plenty of videos on YouTube is fine, but it starts to be messy. You can find useful following reasons for transcribing talks.

Some people do not understand English very well. Reading subtitles can help them understand.
You need to market your conference to attract people. Videos show the quality of your conference to prospects. Transcribing the video to text increases your SEO. More people will find you.
Large collections of videos can be searchable with a difficulty for particular information. Time synchronous speech transcript can help you search in speech quickly even in a large collection of videos.

To use human labor for subtitling videos make sense, because people do not like watching subtitles with errors - and automatic voice to text can make errors. On the other hand, transcribing all recordings from a several day long conference can be enormously expensive on human resources.
So the use of automatic voice to text technology is a logical step to reduce the need of human resources. Especially for cases 2) and 3). Here you do not care about a few errors, because the transcript is primarily for machines - search engines.

The huge advantage of our service here is the ability of automatic speech recognizer adaptation on the target domain - your conference. Usually, every technical conference has proceedings which are full of content words, abbreviations, technical terms etc. These words are important (within you conference) but rare in general speech. So standard recognizers trained on general speech can miss them easily and the transcript is useless for you.

To give you a real use case, SuperLectures - a conference video service - uses SpokenData.com automatic transcriptions in the above mentioned way. They provide us with proceedings so that we could adapt our recognizer. Then we return them textual transcription of their audio/video data.

Thursday, June 26, 2014

SpokenData API - File upload

SpokenData API allows developers to easily add new recordings to their media library. You can either pass a media file URL or you can upload the whole file using the HTTP PUT method that was recently implemented into SpokenData. This post will demonstrate the second option - uploading a file through the SpokenData API call.

So, how does it work? Each SpokenData API function is composed of SpokenData base API url, USER-ID, API-TOKEN and the name of the function.

SpokenData API url:	http://spokendata.com/api
USER-ID:	different for each user, available for signed-in at http://spokendata/api
API-TOKEN:	different for each user, available for signed-in at http://spokendata/api
API function:	recording/put

If we concatenate the above values, we get the API call url. It may look like this:

http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put

Each API call can have parameters. When uploading a new file through API, you need to enter the recording filename and the language for automatic speech processing. So basically, you will end up with a URL looking like this:

http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put?filename=audio.mp3&language=english

Basically, there are also available other parameters you can read about in the SpokenData API documentation. When you call the above url, don't forget to put the file content.

Here is a code example that uploads an MP3 file using the HTTP PUT method to SpokenData.

<?php

$fileToUpload = 'd:/audio.mp3';

$url = 'http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put?filename=audio.mp3&language=english';

$file = fopen($fileToUpload, "rb");

$curl = curl_init();

curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 2);

curl_setopt($curl, CURLOPT_HEADER, false);

curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($curl, CURLOPT_BINARYTRANSFER, 1);

curl_setopt($curl, CURLOPT_URL, $url);

curl_setopt($curl, CURLOPT_VERBOSE, '1');

curl_setopt($curl, CURLOPT_HTTPHEADER, array('Expect: '));

curl_setopt($curl, CURLOPT_PUT, 1);

curl_setopt($curl, CURLOPT_INFILE, $file);

curl_setopt($curl, CURLOPT_INFILESIZE, filesize($fileToUpload));

$response = curl_exec($curl);

curl_close($curl);

header ("Content-Type:text/xml");

echo $response;

The server responds in XML. Here is an example response of the above script.

	<?xml version="1.0" encoding="utf8"?>
	<data>
	<message>New media file "audio.mp3" was successfully added.</message>
	<recording id="1373"></recording>
	</data>

As you can see, the file audio.mp3 was successfully added and assigned the recording id = 1373.

Thursday, June 5, 2014

Tagging

Every registered user has a set of 6 tags. Each tag is marked with a different color and can have a title. Tags come greatly in handy when you want to filter your recordings. For example, you have plenty of recordings and you work on their transcriptions. If you are happy with the transcription, you mark the recording with a tag titled done. As tags are displayed on the top of recording thumbnails, you will immediately spot those recordings which do not need to be transcribed anymore.

Tag titles can be changed at any time. Tagging is allowed on your dashboard or in the transcription editing mode.

Thursday, May 22, 2014

Integration of BrainTree payments

From now on, every registered user can get the whole transcription of their multimedia files. To lower the load of our computation cloud, we decided to charge for processing of files longer than 15 minutes. Processing of shorter files remains free and is available to anyone.

If you are not sure about the quality of the automatic transcription, start with letting the system transcribe first 15 minutes. Based on the result, you can then decide whether it is worth buying the whole transcription.

As the first payment method, we decided to integrate BrainTree payments into SpokenData. This allows users to pay by a card. Cards like Visa, MasterCard, Discover and American Express are accepted. All payments are now in EUR.

The great thing about BrainTree is that the user is not redirected to a third-party payment gateway but remains on the SpokenData website. All sensitive data is encrypted and not stored. Read more about Client-side encryption here.

Tuesday, May 20, 2014

WebVTT subtitles support

We extended the number of SpokenData supported subtitles formats. From now on, the subtitles can also be downloaded in WebVTT. Here is the complete list of currently supported subtitles formats:

HTML
SRT
TRS
TXT
WebVTT

The subtitles can be downloaded from the web user interface or through API calls.

Wednesday, May 14, 2014

There is no audio/video limitation on spokendata free plan now.

We changed the way how we limit the processed data for free accounts. To get a free SpokenData account, register here.

We previously had a hard limit of processed data set to 15 minutes. All data you uploaded over this limit was trimmed and discarded. So if you uploaded 25 minutes of audio or video and wanted the automatic transcript for free, you got only 15 minutes long video/audio with the transcript in your dashboard later (after the processing finished).

Currently, we do not limit your data upload. We just limit the length of the transcript we provide you for free. So if you upload 25 minute long video/audio, you will find the whole (25 minute long) video/audio in your dashboard. The generated transcript has the length limit set to 15 minutes, so you will see only the first 15 minutes of transcript for your data. But if you want, you can easily create the rest of the transcript for free yourself - this was not possible in the previous version because the video was trimmed.

We hope you welcome this change.

..and stay tuned. More interesting things are coming soon..

Wednesday, April 30, 2014

Interactive audio waveform

From now on, when you play any recording, you will see an interactive waveform on the left of the transcription. The waveform is only displayed on larger screens with a minimum horizontal resolution of 1220px. The waveform can significantly help you detect moments of speech. When you click into it, the player seeks to that exact moment.

This audio waveform viewer for the web can be downloaded from https://github.com/vdot/outwave.js.

Monday, March 31, 2014

What does the speaker segmentation technology

Speaker segmentation (diarization) is a speech technology allowing you to segment audio (or video) into particular speakers. What is it good for? You can more easily identify speaker turns in a dialog while making speech transcript.

Even if you do not directly need the speaker information, the speaker segmentation is very helpful for speech-to-text technology (STT). The STT technology contains unsupervised speaker adaptation module. This module takes parts of speech belongings to a particular speaker and adapts an acoustic model towards them. Adaptation of the model leads to more accurate speech transcript.

The adaptation - even if it is called speaker adaptation - adapts the system to the whole acoustic channel. It consists of speaker's voice characteristics, room acoustics (echo), microphone characteristics, environment noise, etc.

Speaker segmentation is theoretically independent on speaker, language and acoustic conditions. But - practically - it is dependent. The reason is, that it uses something called a universal background model (UBM). The UBM should model all speech, languages and acoustics of the world - theoretically. But you need to train it on some speaker labeled data - to learn how to distinguish among speakers. And it holds (as in other speech technologies) that the more far the data you process is from the training data, the worse accuracy you get.

Settings box in Subtitles Editor

We have added a settings box to our online subtitles editor. For now, this box can come handy primarily in these 2 situations:

I want to start playing my media file a little bit earlier. This is done by selecting a "pre-play" value from the settings box. Then the media file will always start playing from: time() - pre-play_time
I want to disable the speaker segmentation. If you don't like the speaker segmentation (diarization), you can disable it. It means that all segments will be uniformly colored and the speaker id won't be shown.

The settings values are reset when the subtitles editor is loaded. By default the pre-play time is set to 0 seconds and showing of the speaker segmentation is enabled.

We hope you will take advantage of these new features!

You can try the settings box in our demo at:
http://spokendata.com/demo/start

Thursday, February 13, 2014

Use-Case: Making movie subtitles in 4 steps

Let's go through 4 easy steps for making subtitles from scratch for a movie or your home/company video. Subtitles are usually stored in a text file with time stamps to synchronize the text with the video. Examples of such text formats are SRT or TT.

You need to have a software for making the subtitles (unless you want to edit the text file directly). You have several choices - to download a desktop application (AHD Subtitles Maker Professional, ...), use a web service (http://CaptionsMaker.com, http://amara.org, http://SubtitleHorse.com), or use a smart web service as http://SpokenData.com.

What is the difference between standard and smart web service for making captions? You need to set the timing of each particular subtitle by yourself (example here). And this can be pretty annoying job. And that is where the smart web service for making caption can help you - it will find the places where speech occurs automatically! So you need just to fill in the text. Pretty good right?

So what are the steps you need to do?

Why do we need your speech data

Our several year experience in speech technology research and business shows often clash between:

Speech technology provider: "Give us some of your speech data for testing purposes please."

and

Customer: "No way! Our speech data is our private and secret property."

So let discuss several WHYs.

Why the speech technology provider wants the customer's data?

The speech technologies are very complex and sensitive to match between model and the data. This is common problem in the whole field of machine learning. Once you feed the classifier with "already seen" data, everything goes well. Accuracy of such algorithm is great.
The problem occurs when you put an unseen data into the algorithm - data which was not seen during training and developing. It is like, people living in US understands English because it is their already seen data, but does not understand Japanese because it is their unseen data during the training phase (childhood).

What is the difference between narrowband and wideband, closetalk and distant mic?

Maybe you have coped with terms like narrowband, wideband, closetalk, distant microphone, microphone array, and farfield in past. So let me explain it a bit.

All of these terms are about the "technology" you are using for recording the speech and its relative placement against the speaker.

Why do we need to bother with this? Actually the problem is that a speech recognizer (or a generic speech technology) is trained on data recorded under specific condition (telephone conversations for example). So this recognizer will recognize telephone conversations well, but will perform poor on lecture recordings recorded with a camera microphone in a room with strong echo.

As the research in the speech technologies field goes on, the recognizers are more and more robust. So this problem will shrink in future. But it still holds - if your data matches the data on which the recognizer was trained, you get the best possible accuracy. There is not acoustic mismatch.

There are, let's say, three variables:

Quality of the recorded audio - sampling frequency
Distance between speaker mouth and the microphone
Number of microphones - microphone array

Narrowband vs. Wideband

Sampling frequency is one factor which can decrease the quality of recorded audio and the final accuracy. There are two settings - 8kHz and 16kHz (and more). If the data is recorded in the 8 kHz, it is so called Narrowband data. This settings are used in telephony. So if you work with telephone recordings, your data is in 8kHz due to the telephony technology limitations. Recording the telephone call in 16kHz or more does not make sense (and bring no improvement).

An example of narrowband - 8kHz data