SpokenData Blog

Switched to MediaElement.js

2018-06-11T07:06:00.000-07:00

SpokenData transcription editor has several new features that should come in handy. We have added info on elapsed time since the last save, slightly modified the graphics and changed the video player. We switched from JW player to MediaElement.js which gives us more customization control. When playing the video there is now more precise synchronization between the video player and the waveform component. So give a try to our updated editor. The editing experience should be smoother than before.

Synchronizing text and audio - time alignment

2016-10-21T02:51:00.000-07:00

We have implemented automatic and language independent text to audio alignment. What is it all about? In short, you upload your audio and text and you get the subtitles. See the pictures below.

What is the time alignment? The alignment is a process attaching time stamps to a transcript or text according to the audio. Usually the text is without timing. A sentence, a paragraph, a page of text does not have any timing information. But when you have an audio attached to this text -- means the audio contains a speech -- you may want to add the timing information to the text. To know when a particular word was spoken in the audio. You can imagine the process as making subtitles (time aligned sentences) from your text and audio.

The good news is that alignment can be done automatically and in our case even language independently.

What is the difference to automatic speech transcription? Well, in case you have only audio and want to get text (transcript, verbatim), you have to use automatic speech transcription -- to convert audio into text. But there are cases when you already have the transcript. Some examples might be:

You wrote a script for a lecture, talk, pitch, or news. You "read" the text and got a recording. Now you want to make subtitles. It is waste of time and money to transcribe your speech again. Use the aligner.
You asked someone to have transcribed your audio and then you got only plain text. But you found subtitles useful later. Just align the previous transcript to your audio using our aligner.
Can also be useful for e-book to audio-book alignment.

Here you just need to put your audio and text into SpokenData.com and you will get aligned text (like subtitles). Plus you can use our editor to edit the text, timings, etc. You can use our API.

The largest advantage of our approach is in its independence from the language. It should work reasonably well for any language.

This technology has also some caveats. It expects that the speech in the audio and the text fully matches. Assuming that you have part of your speech untranscribed or some notes in the text not spoken in the audio, the technology does its best to align it. So in such cases there can appear time shifts near these regions.

Capitalization and punctuation

2016-08-25T02:41:00.004-07:00

When processing your recordings with our Speech-To-Text technology for English language, you get transcript containing capital letters and basic punctuation. This makes the subtitles easier to read. Sure, there is still a lot of space for improvement. However, we believe that you will like this new feature. Give it a try at https://spokendata.com.

SpokenData runs on HTTPS

2015-10-14T05:47:00.000-07:00

Since last week, the entire SpokenData website has been using HTTPS, so not just only for authentication and payments as previously. All HTTP requests are automatically redirected to HTTPS ones. We try to keep security pretty tight.

Times Digital: QuickQuote

2015-09-29T07:32:00.002-07:00

QuickQuote is a web application helping users to select video quotes from a video and embed them in an article. SpokenData API is used to generate video transcription. This project is maintained by Times Digital and Pietro Passarelli.

You can also get to know more about the tool at http://www.niemanlab.org/2015/09/a-new-tool-from-the-times-of-london-lets-you-easily-detect-and-capture-quotes-from-a-video/.

Automatic Speech Transcription in English, Russian, Chinese, Spanish, Czech, Slovak, ...

2015-06-18T07:18:00.003-07:00

SpokenData automatically transcribes recordings in quite a number of language and more are about to come. Just upload your recordings and select the language. Dou you have some specific domain of audio such as TV news, lectures and so forth? Then you can also select from specifically trained recognizers that should generate better transcription. If you have lots of data to transcribe, contact us and we will train a recognizer to get the best transcription of your data.

Онлайн перевод устной речи в текст - теперь с поддержкой русского языка!

2015-02-18T01:35:00.001-08:00

Мы добавили на наш сайт поддержку русского языка, и теперь Вы можете расшифровывать записи с речью на русском! Просто залогиньтесь на SpokenData, загрузите свою запись и получите автоматическую письменную расшифровку текста на записи совершенно бесплатно! Вы также можете указать путь к вашей записи с помощью ссылки на YouTube, Vimeo или любой другой онлайн хостинг. Наша программа скачивает данные и конвертирует аудио в текст за считанные минуты. Когда расшифровка закончена, Вы получите оповещение по электронной почте. После этого Вы можете вностить изменения в текст с помощью нашего онлайн-редактора.

Для разработчиков программного обеспечения мы предоставляем простой в употреблении API.

Вы все еще не зарегистрированы на SpokenData.com? Пройдите быструю регистрацию здесь и откройте для себя возможности расшифровки Ваших аудио меньше, чем за минуту!

Russian voice to text online service.

2015-02-07T12:08:00.001-08:00

Hi,

We added support of a new language - Russian. So, you can process any of your recordings in Russian now. Just log on SpokenData, submit your data and get automatic text transcript for free in few minutes. Another option is to provide us with URL of YouTube, Vimeo or other on-line services where your data is. We download the data and convert them into text quickly. You are notified by email when the conversion of audio into text is done. You can also edit the transcript yourself in our web editor later.
If you are a developer, feel free to integrate our API. It's easy.

You do not have SpokenData.com account yet? Just register here and you can process your data in 1 minute!

American Spanish speech to text for free!

2015-01-31T10:20:00.002-08:00

Hello,

We are happy to announce that we support American Spanish now. So, if you have any voice recordings, you can process them in SpokenData to get automatic text transcript for free now. As our service is in cloud, it is very easy for you to get the text. Just take you audio or video files in Spanish and upload them.
The second option is to provide us with URL of YouTube, Vimeo or other on-line services. We download the data and convert them into text quickly. You are norified by email when the conversion is done. You can also edit the transcript yourself in our web editor later. If you are a developer, feel free to integrate our API. It's easy.

You do not have SpokenData.com account yet? Just register here and you can process your data in 1 minute!

Download recording video, audio and subtitles

2015-01-23T00:26:00.000-08:00

SpokenData users can now download processed recording video in mp4, audio in mp3 and recording subtitles in a variety of formats. The files are accessible through the download menu or SpokenData API.

SRT - SubRip text file format

TRS - used in Transcriber

WebVTT - The Web Video Text Tracks Format

Set deadline for your transcription

2015-01-19T04:25:00.001-08:00

Do you create/edit the transcription yourself or have a team of annotators? SpokenData has a new handy feature that might help you finish the transcription process in time. From now on, you can select the deadline for each processed recording. Just click on the menu button and select the Set deadline item.

Then, you can order your recordings by the deadline value and see the recordings which should be finished soon. The deadline information can appear in 3 different colors:

red: deadline has already passed
orange: deadline will pass within 24 hours
black: deadline will pass in more than 24 hours

The deadline can always be changed or removed. These feature can also help your annotators who will see how much time they have left to complete their jobs.

Vimeo is supported

2014-12-09T06:20:00.001-08:00

As some of our users host their recordings on Vimeo, we now support processing of Vimeo files. Simply enter a Vimeo url into the Media File URL input box.
In general, users can enter:

a direct url to a media file (mp3, mp4, mpg, avi, 3gp, mkv, wav and many others)
YouTube url
Vimeo url

Besides that, you can also upload a multimedia file using the upload form or SpokenData API.

SpokenData API - Search in Speech

2014-11-11T05:09:00.000-08:00

SpokenData API has a new function that enables users to search in recording transcriptions. This means that you can quickly get a list of captions matching the search query with their start and end time, caption content and speaker identity. The search can be performed either in all user recordings or in a list of selected recordings.

An example of a basic SpokenData search API call can be:
http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/search?q=student

It simply means to search for occurrences of student in all recording transcriptions of the DEMO account.

The returned XML shows the elapsed time for parsing the search query and for performing the search. As the number of results can be very high, the search API call supports paging. By default, the maximum number of results per page is set to 10. In the output XML, there are 2 types of results - recordings and captions. Each has different paging.

Recordings

The value of number_of_occurrences shows the number of recording captions matching the search query. Recordings are sorted by number of occurrences (from higher to lower).

Recordings paging:

recordingPageSize = 10 by default
recordingPageNumber = 0 by default

Captions

Every caption has several values - start and end time of caption, speaker identity and the caption content. Captions are ordered by the caption start time (from lower to higher).

Captions paging:

captionPageSize = 10 by default
captionPageNumber = 0 by default

Caption results can be omitted by adding a parameter recordingListOnly=1 to the search API call.

Search in selected recordings
The optional parameter recordingId can specify which recordings are selected for the search. The value of this parameter is a list of recording IDs delimited by a comma (e.g.: recordingId=845,446).

An example of an advanced SpokenData API search call:
http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/search?q=what&recordingId=846&captionPageSize=2&captionPageNumber=1

This API call returns the third and fourth caption result (the caption page size is 2 and the page number is 1) and the search is performed only in transcription of recording with id=846.

More information on search and other API calls is shown after signing in and enabling SpokenData API.

How to show the speaker segmentation

2014-10-20T08:27:00.001-07:00

In SpokenData subtitles editor, we changed the rule of displaying the speaker segmentation. From now on, it is hidden by default except for the recordings processed directly by the Speaker segmentation method. However, the speaker segmentation can quickly be shown by clicking on the checkbox Show speaker segmentation. Currently, SpokenData subtitles editor does not support editing the generated speaker identity. When it is implemented, we will consider changing this rule.

How to adjust subtitle timings

2014-09-29T07:51:00.000-07:00

Each subtitle has its start and end time. These specify when and for how long the caption appears over the video. When editing subtitles in our editor, you can simply adjust their timings either with your mouse or directly from the keyboard. First, you need to enter into the editing mode. This can be achieved by several ways:

double click on the subtitle caption
double click on the audio waveform segment
CTRL + click on the subtitle caption
CTRL + I - to edit the currently played caption
TAB or SHIFT+TAB - to edit the next or previous caption

Now, as you are in the editing mode, you can change the subtitle caption and its start and end time. The currently edited subtitle is marked with a light-blue background color.

ALT + Left - shift caption start time by -0.1s
ALT + Right - shift caption start time by +0.1s
ALT + Up - shift caption end time by +0.1s
ALT + Down - shift caption end time by -0.1s

On the left in the editor, there is an audio waveform with segments that represent duration of particular subtitles. You can easily adjust subtitle duration by holding down the mouse left button and dragging the segment borders. The audio waveform can significantly help you to define the segment beginning and end because moments with no speech/sound in the audio look like a straight line.

Interactive waveform with Outwave.js

2014-09-12T06:38:00.001-07:00

Outwave.js is a handy library that can render audio waveform in a web browser. Its development was also supported by SpokenData. Apart from that, this library has a great extension for displaying annotation segments directly on the waveform. The segments can be easily added, deleted, merged or split. By and large, we really needed such a library.

Therefore, we are happy to announce that Outwave.js was integrated into the SpokenData subtitles editor. From now on, our users will more easily define speech or non-speech segments just by dragging the segment boundaries on the waveform with the mouse button held down.

We are certain you will benefit from the Outwave library as we do. The fastest way to test our new feature is to start the SpokenData demo and edit any recording subtitles.

New annotation.xml file structure

2014-08-04T06:51:00.001-07:00

We have modified the annotation xml file structure. Now it is a way easier to parse. You can get this file through SpokenData API. See this short example:

<segment>

<start>63.25</start>

<end>65.40</end>

<speaker>A</speaker>

<text>Hello, this is the first caption</text>

</segment>

<segment>

<start>72.92</start>

<end>74.49</end>

<speaker>B</speaker>

<text>and here comes the second one</text>

</segment>

Start and end tags represent the subtitle appearance time in seconds. We store values with precision of 2 decimal places. Speaker tag identifies the person who is speaking. It can keep whatever alphanumerical value. And the text tag serves for storing the subtitle content.

You can see a live example from the SpokenData demo here:

http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/846/annotation.xml

Use case: How to transcribe conference video recordings and make subtitles for them?

2014-07-14T06:37:00.001-07:00

One handy usage of automatic speech recognition technologies - speech-to-text - is a transcription of conference talks. There are plenty of conferences and lots of them are being recorded and published on a conference homepage or YouTube for example.
Let's use any conference as an example. To record the conference and to have plenty of videos on YouTube is fine, but it starts to be messy. You can find useful following reasons for transcribing talks.

Some people do not understand English very well. Reading subtitles can help them understand.
You need to market your conference to attract people. Videos show the quality of your conference to prospects. Transcribing the video to text increases your SEO. More people will find you.
Large collections of videos can be searchable with a difficulty for particular information. Time synchronous speech transcript can help you search in speech quickly even in a large collection of videos.

To use human labor for subtitling videos make sense, because people do not like watching subtitles with errors - and automatic voice to text can make errors. On the other hand, transcribing all recordings from a several day long conference can be enormously expensive on human resources.
So the use of automatic voice to text technology is a logical step to reduce the need of human resources. Especially for cases 2) and 3). Here you do not care about a few errors, because the transcript is primarily for machines - search engines.

The huge advantage of our service here is the ability of automatic speech recognizer adaptation on the target domain - your conference. Usually, every technical conference has proceedings which are full of content words, abbreviations, technical terms etc. These words are important (within you conference) but rare in general speech. So standard recognizers trained on general speech can miss them easily and the transcript is useless for you.

To give you a real use case, SuperLectures - a conference video service - uses SpokenData.com automatic transcriptions in the above mentioned way. They provide us with proceedings so that we could adapt our recognizer. Then we return them textual transcription of their audio/video data.

SpokenData API - File upload

2014-06-26T05:23:00.000-07:00

SpokenData API allows developers to easily add new recordings to their media library. You can either pass a media file URL or you can upload the whole file using the HTTP PUT method that was recently implemented into SpokenData. This post will demonstrate the second option - uploading a file through the SpokenData API call.

So, how does it work? Each SpokenData API function is composed of SpokenData base API url, USER-ID, API-TOKEN and the name of the function.

SpokenData API url:	http://spokendata.com/api
USER-ID:	different for each user, available for signed-in at http://spokendata/api
API-TOKEN:	different for each user, available for signed-in at http://spokendata/api
API function:	recording/put

If we concatenate the above values, we get the API call url. It may look like this:

http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put

Each API call can have parameters. When uploading a new file through API, you need to enter the recording filename and the language for automatic speech processing. So basically, you will end up with a URL looking like this:

http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put?filename=audio.mp3&language=english

Basically, there are also available other parameters you can read about in the SpokenData API documentation. When you call the above url, don't forget to put the file content.

Here is a code example that uploads an MP3 file using the HTTP PUT method to SpokenData.

<?php

$fileToUpload = 'd:/audio.mp3';

$url = 'http://spokendata.com/api/18/br3sp59a2it7fig94jdtbt3p9ife5qpx39fd8npp/recording/put?filename=audio.mp3&language=english';

$file = fopen($fileToUpload, "rb");

$curl = curl_init();

curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 2);

curl_setopt($curl, CURLOPT_HEADER, false);

curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($curl, CURLOPT_BINARYTRANSFER, 1);

curl_setopt($curl, CURLOPT_URL, $url);

curl_setopt($curl, CURLOPT_VERBOSE, '1');

curl_setopt($curl, CURLOPT_HTTPHEADER, array('Expect: '));

curl_setopt($curl, CURLOPT_PUT, 1);

curl_setopt($curl, CURLOPT_INFILE, $file);

curl_setopt($curl, CURLOPT_INFILESIZE, filesize($fileToUpload));

$response = curl_exec($curl);

curl_close($curl);

header ("Content-Type:text/xml");

echo $response;

The server responds in XML. Here is an example response of the above script.

	<?xml version="1.0" encoding="utf8"?>
	<data>
	<message>New media file "audio.mp3" was successfully added.</message>
	<recording id="1373"></recording>
	</data>

As you can see, the file audio.mp3 was successfully added and assigned the recording id = 1373.

Tagging

2014-06-05T07:09:00.003-07:00

Every registered user has a set of 6 tags. Each tag is marked with a different color and can have a title. Tags come greatly in handy when you want to filter your recordings. For example, you have plenty of recordings and you work on their transcriptions. If you are happy with the transcription, you mark the recording with a tag titled done. As tags are displayed on the top of recording thumbnails, you will immediately spot those recordings which do not need to be transcribed anymore.

Tag titles can be changed at any time. Tagging is allowed on your dashboard or in the transcription editing mode.

Integration of BrainTree payments

2014-05-22T00:59:00.000-07:00

From now on, every registered user can get the whole transcription of their multimedia files. To lower the load of our computation cloud, we decided to charge for processing of files longer than 15 minutes. Processing of shorter files remains free and is available to anyone.

If you are not sure about the quality of the automatic transcription, start with letting the system transcribe first 15 minutes. Based on the result, you can then decide whether it is worth buying the whole transcription.

As the first payment method, we decided to integrate BrainTree payments into SpokenData. This allows users to pay by a card. Cards like Visa, MasterCard, Discover and American Express are accepted. All payments are now in EUR.

The great thing about BrainTree is that the user is not redirected to a third-party payment gateway but remains on the SpokenData website. All sensitive data is encrypted and not stored. Read more about Client-side encryption here.

WebVTT subtitles support

2014-05-20T00:22:00.000-07:00

We extended the number of SpokenData supported subtitles formats. From now on, the subtitles can also be downloaded in WebVTT. Here is the complete list of currently supported subtitles formats:

HTML
SRT
TRS
TXT
WebVTT

The subtitles can be downloaded from the web user interface or through API calls.

There is no audio/video limitation on spokendata free plan now.

2014-05-14T07:04:00.001-07:00

We changed the way how we limit the processed data for free accounts. To get a free SpokenData account, register here.

We previously had a hard limit of processed data set to 15 minutes. All data you uploaded over this limit was trimmed and discarded. So if you uploaded 25 minutes of audio or video and wanted the automatic transcript for free, you got only 15 minutes long video/audio with the transcript in your dashboard later (after the processing finished).

Currently, we do not limit your data upload. We just limit the length of the transcript we provide you for free. So if you upload 25 minute long video/audio, you will find the whole (25 minute long) video/audio in your dashboard. The generated transcript has the length limit set to 15 minutes, so you will see only the first 15 minutes of transcript for your data. But if you want, you can easily create the rest of the transcript for free yourself - this was not possible in the previous version because the video was trimmed.

We hope you welcome this change.

..and stay tuned. More interesting things are coming soon..

Interactive audio waveform

2014-04-30T05:29:00.000-07:00

From now on, when you play any recording, you will see an interactive waveform on the left of the transcription. The waveform is only displayed on larger screens with a minimum horizontal resolution of 1220px. The waveform can significantly help you detect moments of speech. When you click into it, the player seeks to that exact moment.

This audio waveform viewer for the web can be downloaded from https://github.com/vdot/outwave.js.

What does the speaker segmentation technology

2014-03-31T02:41:00.000-07:00

Speaker segmentation (diarization) is a speech technology allowing you to segment audio (or video) into particular speakers. What is it good for? You can more easily identify speaker turns in a dialog while making speech transcript.

Even if you do not directly need the speaker information, the speaker segmentation is very helpful for speech-to-text technology (STT). The STT technology contains unsupervised speaker adaptation module. This module takes parts of speech belongings to a particular speaker and adapts an acoustic model towards them. Adaptation of the model leads to more accurate speech transcript.

The adaptation - even if it is called speaker adaptation - adapts the system to the whole acoustic channel. It consists of speaker's voice characteristics, room acoustics (echo), microphone characteristics, environment noise, etc.

Speaker segmentation is theoretically independent on speaker, language and acoustic conditions. But - practically - it is dependent. The reason is, that it uses something called a universal background model (UBM). The UBM should model all speech, languages and acoustics of the world - theoretically. But you need to train it on some speaker labeled data - to learn how to distinguish among speakers. And it holds (as in other speech technologies) that the more far the data you process is from the training data, the worse accuracy you get.

The worse accuracy shows as segmenting one speaker in different acoustic conditions as different speaker labels.

The second important thing is, that the speaker segmentation needs prior knowledge of the number of speakers in the audio document. If you do not provide the information, there is some preset or estimated value. However, if you know this information, it is good to provide it.

In our service, we automatically preset:

7 speakers in the speech-to-text engine for broadcast news. We expect more interviews and thus more speakers.
3 speakers in to other speech-to-text engines. We expect rather monologue or dialog in the recording.
10 speakers in the speaker segmentation mode (just voice activity detector and speaker segmentation without speech-to-text).

You can preset the number of speakers a priori in the paid service.

As said at the beginning, lower accuracy is not necessarily "bad". Yes, it can be annoying in case you need accurate speaker turns in your speech transcript. On the other hand, segment particular speaker into more "sub-speakers" according to varying acoustic conditions (noise, environment, ...) can be helpful for automatic speech recognizer.