
Monday, October 20, 2014
How to show the speaker segmentation

Labels:
diarization,
new feature,
speaker segmentation,
update,
web editor
Monday, September 29, 2014
How to adjust subtitle timings
Each subtitle has its start and end time. These specify when and for how long the caption appears over the video. When editing subtitles in our editor, you can simply adjust their timings either with your mouse or directly from the keyboard. First, you need to enter into the editing mode. This can be achieved by several ways:
- double click on the subtitle caption
- double click on the audio waveform segment
- CTRL + click on the subtitle caption
- CTRL + I - to edit the currently played caption
- TAB or SHIFT+TAB - to edit the next or previous caption
Now, as you are in the editing mode, you can change the subtitle caption and its start and end time. The currently edited subtitle is marked with a light-blue background color.
- ALT + Left - shift caption start time by -0.1s
- ALT + Right - shift caption start time by +0.1s
- ALT + Up - shift caption end time by +0.1s
- ALT + Down - shift caption end time by -0.1s
On the left in the editor, there is an audio waveform with segments that represent duration of particular subtitles. You can easily adjust subtitle duration by holding down the mouse left button and dragging the segment borders. The audio waveform can significantly help you to define the segment beginning and end because moments with no speech/sound in the audio look like a straight line.
Friday, September 12, 2014
Interactive waveform with Outwave.js

Therefore, we are happy to announce that Outwave.js was integrated into the SpokenData subtitles editor. From now on, our users will more easily define speech or non-speech segments just by dragging the segment boundaries on the waveform with the mouse button held down.
We are certain you will benefit from the Outwave library as we do. The fastest way to test our new feature is to start the SpokenData demo and edit any recording subtitles.
Monday, August 4, 2014
New annotation.xml file structure
We have modified the annotation xml file structure. Now it is a way easier to parse. You can get this file through SpokenData API. See this short example:
<segment>
<start>63.25</start>
<end>65.40</end>
<speaker>A</speaker>
<text>Hello, this is the first caption</text>
</segment>
<segment>
<start>72.92</start>
<end>74.49</end>
<speaker>B</speaker>
<text>and here comes the second one</text>
</segment>
Start and end tags represent the subtitle appearance time in seconds. We store values with precision of 2 decimal places. Speaker tag identifies the person who is speaking. It can keep whatever alphanumerical value. And the text tag serves for storing the subtitle content.
You can see a live example from the SpokenData demo here:
http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/846/annotation.xmlYou can see a live example from the SpokenData demo here:
Monday, July 14, 2014
Use case: How to transcribe conference video recordings and make subtitles for them?
One handy usage of automatic speech recognition technologies - speech-to-text - is a transcription of conference talks. There are plenty of conferences and lots of them are being recorded and published on a conference homepage or YouTube for example.
Let's use any conference as an example. To record the conference and to have plenty of videos on YouTube is fine, but it starts to be messy. You can find useful following reasons for transcribing talks.
So the use of automatic voice to text technology is a logical step to reduce the need of human resources. Especially for cases 2) and 3). Here you do not care about a few errors, because the transcript is primarily for machines - search engines.
The huge advantage of our service here is the ability of automatic speech recognizer adaptation on the target domain - your conference. Usually, every technical conference has proceedings which are full of content words, abbreviations, technical terms etc. These words are important (within you conference) but rare in general speech. So standard recognizers trained on general speech can miss them easily and the transcript is useless for you.
To give you a real use case, SuperLectures - a conference video service - uses SpokenData.com automatic transcriptions in the above mentioned way. They provide us with proceedings so that we could adapt our recognizer. Then we return them textual transcription of their audio/video data.
Let's use any conference as an example. To record the conference and to have plenty of videos on YouTube is fine, but it starts to be messy. You can find useful following reasons for transcribing talks.
- Some people do not understand English very well. Reading subtitles can help them understand.
- You need to market your conference to attract people. Videos show the quality of your conference to prospects. Transcribing the video to text increases your SEO. More people will find you.
- Large collections of videos can be searchable with a difficulty for particular information. Time synchronous speech transcript can help you search in speech quickly even in a large collection of videos.
So the use of automatic voice to text technology is a logical step to reduce the need of human resources. Especially for cases 2) and 3). Here you do not care about a few errors, because the transcript is primarily for machines - search engines.
The huge advantage of our service here is the ability of automatic speech recognizer adaptation on the target domain - your conference. Usually, every technical conference has proceedings which are full of content words, abbreviations, technical terms etc. These words are important (within you conference) but rare in general speech. So standard recognizers trained on general speech can miss them easily and the transcript is useless for you.
To give you a real use case, SuperLectures - a conference video service - uses SpokenData.com automatic transcriptions in the above mentioned way. They provide us with proceedings so that we could adapt our recognizer. Then we return them textual transcription of their audio/video data.
Thursday, June 26, 2014
SpokenData API - File upload
SpokenData API allows developers to easily add new recordings to their media library. You can either pass a media file URL or you can upload the whole file using the HTTP PUT method that was recently implemented into SpokenData. This post will demonstrate the second option - uploading a file through the SpokenData API call.
So, how does it work? Each SpokenData API function is composed of SpokenData base API url, USER-ID, API-TOKEN and the name of the function.
If we concatenate the above values, we get the API call url. It may look like this:
Each API call can have parameters. When uploading a new file through API, you need to enter the recording filename and the language for automatic speech processing. So basically, you will end up with a URL looking like this:
Basically, there are also available other parameters you can read about in the SpokenData API documentation. When you call the above url, don't forget to put the file content.
Here is a code example that uploads an MP3 file using the HTTP PUT method to SpokenData.
$file = fopen($fileToUpload, "rb");
$response = curl_exec($curl);
The server responds in XML. Here is an example response of the above script.
As you can see, the file audio.mp3 was successfully added and assigned the recording id = 1373.
So, how does it work? Each SpokenData API function is composed of SpokenData base API url, USER-ID, API-TOKEN and the name of the function.
SpokenData API url: | http://spokendata.com/api |
---|---|
USER-ID: | different for each user, available for signed-in at http://spokendata/api |
API-TOKEN: | different for each user, available for signed-in at http://spokendata/api |
API function: | recording/put |
If we concatenate the above values, we get the API call url. It may look like this:
http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put
Each API call can have parameters. When uploading a new file through API, you need to enter the recording filename and the language for automatic speech processing. So basically, you will end up with a URL looking like this:
http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put?filename=audio.mp3&language=english
Basically, there are also available other parameters you can read about in the SpokenData API documentation. When you call the above url, don't forget to put the file content.
Here is a code example that uploads an MP3 file using the HTTP PUT method to SpokenData.
<?php
$fileToUpload = 'd:/audio.mp3';
$url = 'http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put?filename=audio.mp3&language=english';$file = fopen($fileToUpload, "rb");
$curl = curl_init();
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_VERBOSE, '1');
curl_setopt($curl, CURLOPT_HTTPHEADER, array('Expect: '));
curl_setopt($curl, CURLOPT_PUT, 1);
curl_setopt($curl, CURLOPT_INFILE, $file);
curl_setopt($curl, CURLOPT_INFILESIZE, filesize($fileToUpload));
$response = curl_exec($curl);
curl_close($curl);
header ("Content-Type:text/xml");
echo $response;
The server responds in XML. Here is an example response of the above script.
<?xml version="1.0" encoding="utf8"?> | |
<data> | |
<message>New media file "audio.mp3" was successfully added.</message> | |
<recording id="1373"></recording> | |
</data> |
As you can see, the file audio.mp3 was successfully added and assigned the recording id = 1373.
Thursday, June 5, 2014
Tagging

Tag titles can be changed at any time. Tagging is allowed on your dashboard or in the transcription editing mode.
Thursday, May 22, 2014
Integration of BrainTree payments

If you are not sure about the quality of the automatic transcription, start with letting the system transcribe first 15 minutes. Based on the result, you can then decide whether it is worth buying the whole transcription.
As the first payment method, we decided to integrate BrainTree payments into SpokenData. This allows users to pay by a card. Cards like Visa, MasterCard, Discover and American Express are accepted. All payments are now in EUR.
The great thing about BrainTree is that the user is not redirected to a third-party payment gateway but remains on the SpokenData website. All sensitive data is encrypted and not stored. Read more about Client-side encryption here.
Tuesday, May 20, 2014
WebVTT subtitles support
We extended the number of SpokenData supported subtitles formats. From now on, the subtitles can also be downloaded in WebVTT. Here is the complete list of currently supported subtitles formats:
- HTML
- SRT
- TRS
- TXT
- WebVTT
The subtitles can be downloaded from the web user interface or through API calls.
Wednesday, May 14, 2014
There is no audio/video limitation on spokendata free plan now.

We previously had a hard limit of processed data set to 15 minutes. All data you uploaded over this limit was trimmed and discarded. So if you uploaded 25 minutes of audio or video and wanted the automatic transcript for free, you got only 15 minutes long video/audio with the transcript in your dashboard later (after the processing finished).
Currently, we do not limit your data upload. We just limit the length of the transcript we provide you for free. So if you upload 25 minute long video/audio, you will find the whole (25 minute long) video/audio in your dashboard. The generated transcript has the length limit set to 15 minutes, so you will see only the first 15 minutes of transcript for your data. But if you want, you can easily create the rest of the transcript for free yourself - this was not possible in the previous version because the video was trimmed.
We hope you welcome this change.
..and stay tuned. More interesting things are coming soon..
Labels:
data,
free,
limitation,
news,
speech to text
Wednesday, April 30, 2014
Interactive audio waveform

This audio waveform viewer for the web can be downloaded from https://github.com/vdot/outwave.js.
Monday, March 31, 2014
What does the speaker segmentation technology

Even if you do not directly need the speaker information, the speaker segmentation is very helpful for speech-to-text technology (STT). The STT technology contains unsupervised speaker adaptation module. This module takes parts of speech belongings to a particular speaker and adapts an acoustic model towards them. Adaptation of the model leads to more accurate speech transcript.
The adaptation - even if it is called speaker adaptation - adapts the system to the whole acoustic channel. It consists of speaker's voice characteristics, room acoustics (echo), microphone characteristics, environment noise, etc.
Speaker segmentation is theoretically independent on speaker, language and acoustic conditions. But - practically - it is dependent. The reason is, that it uses something called a universal background model (UBM). The UBM should model all speech, languages and acoustics of the world - theoretically. But you need to train it on some speaker labeled data - to learn how to distinguish among speakers. And it holds (as in other speech technologies) that the more far the data you process is from the training data, the worse accuracy you get.
Subscribe to:
Posts (Atom)