Thursday, December 19, 2013

Manage your annotators

SpokenData transcribes speech into text. Users get automatic speech transcription for free and then they can edit it in a web-based subtitles editor. But what if you don't want to spend your time editing the transcription? There are two options:
  • buy a human transcription from SpokenData
  • manage your team of annotators
In this blog post, we will shortly introduce the second option. Annotators are users who will take care of subtitles for your recordings and each registered user can manage their team of annotators. The administrator assigns jobs to annotators and monitors which jobs are complete, in progress or refused. 

Adding a new annotator is simple and only requires filling out the form with annotator's name and email address. Then, the annotator is sent an email with the activation link. After clicking on it, the annotator sets their password and can start working on subtitles of assigned recordings.

Annotators are notified by email when they are assigned a new job. Similarly administrators are notified by email when subtitles for their recordings are complete.

Tuesday, November 19, 2013

New Czech recognizer online

We are happy to announce that our new Czech recognizer is online at The novelty lies in the support of 16kHz audio data (so called wideband) and also in robustness for distant microphones.
In comparison to the previous Czech recognizer, which was aimed at telephone data (8kHz - so called narrowband), the new one achieves better accuracy for:
  • Speech recorded with higher quality - 16kHz or more. Unless you capture a telephone call, you usually record the data up to 44kHz.
  • Distant microphone recordings. A telephone call recording is usually labeled as a close-talk because the speaker's mouth is close to the microphone. The distant microphone is the opposite case. Here the speaker's mouth is far from the microphone. As an example it can be a voice recorder or a mobile phone lying on the table while recording a dialog.
Fell free to test the recognizer by uploading your data and enjoy the transcript.

Tuesday, October 29, 2013

Voice Activity Detection: Where is the speech?

Voice Activity Detection (VAD) is usually the first nontrivial step in speech processing (converting the speech into text, speaker identification, etc.). Maybe it sounds like an easy task but it is not.
If you are in a domain of clean slow speech of broadcast news, you are OK with some simple energy based detector. This can work well but think about commercials or opening intros. And identifying what is music and what is speech can be hard because music may contain harmonicity similarly to the speech.
Now let us switch to more real scenarios - telephone conversations. Here you have speech usually with some additive noise. It can be noise of the street your are walking on or some music in the background (a radio for example). In case the dialog starts to be "complicated", you start to shout and the speech is full of crosstalks. Would you like to process crosstalks or not?
The above examples are good from the close talk point of view. You have microphone close to your mouth. But consider a recording comming from a mobile phone or dictaphone laying on a desk in a restaurant now.
There is a lot of voices and the strongest one you would like to process. Plus there can be some echo (acoustics of the room) etc. We call this condition as a distant microphone.
By the way, the VAD should not be dependent on the language you are speaking.

So to conclude, to separate speech (you need to process) from noise (other parts of recordings) is not easy. There are at least these classes:
  • clean speech
  • speech with noise in background
  • speech with music in background
  • shouting
  • crosstalks
  • singing
  • music
  • stationary noise
  • impulse noise (gun shooting)
  • technical noise (fax, dial tones)
  • silence
And you need to accurately find the first 3 or 4 and forward them to the other processing step while omitting the rest of the recording.

The second important thing you need to think of is what is the consecutive technology.
In case of speaker or language identification, you can take the liberty of omitting also some portion of the speech. Especially if you are not confident if it is speech or noise. Omitting some portion of speech means, that you just need longer audio to fulfill the condition of - let's say - 30 seconds of speech for speaker identification.
But what do you think will happen, if you omit some portion of speech (beginning of sentences) in case of speech to text conversion? You will miss words in your transcript! On the other hand some adjacent noise is not so dangerous, because speech-to-text have model of "silence". But this is harmful for speaker or language identification, where you are telling that the noise is the speaker (or language).

As we understand the relevance of voice activity detection, we are continuously working on it to make it more accurate. We also decided to make it available as a separate part of our services. So you can easily try yourself to upload an audio and let the VAD detect speech in your audio full of other noises. And in case the speech detector fails you know that you can not get any better results from the speech to text. The recognizer simply transcribes only the segments denoted by the VAD as speech. That's it.

Tuesday, October 15, 2013

What information is in your SpokeData?

Maybe you are thinking... "What information is in my spoken data?"

Well, lots of information! To have some idea, look at the following image.

You and your speech are in the middle. Now, let us go clockwise and I will briefly introduce you particular information hidden in your speech.

So on the 12th hour, there is speaker identity. Only 10 second long recording is enough to identify you by voice.

Gender identification is the next. It is the most simple type of classification of voice into 2 classes.

On the 3rd hour, there is speech transcript. A technology, which can convert speech into text. Keyword spotting and speech search can be considered as a part of this technology.

Next one is age estimation. To estimate the age might be helpful in some security applications.

Communication channel is usually not that important, but the information through which codecs or networks the voice recording was transmitted is there! Together with the type of the device.

Do not forget, that the recording does not contain only speech. There is also lots of noises, tones or music. All these noises can make your speech less intelligible. This technology is called Voice Activity Detection.

And finally, there is the language identity you are speaking. Similarly to your speaker identity, 10 seconds of your speech is enough to estimate your spoken language.

So that is at least some information hidden in your spoken data recordings.

Tuesday, September 3, 2013

You can transcribe your spoken data now!


We are happy to announce the service.

How does it work? Well, you can:
  1. upload your spoken data (local audio or video file, or YouTube video),
  2. let the service to automatically transcribe the speech into text,
  3. download the textual transcript and use it as you want to.
You can also use our interactive transcript editor to tweak the text into the form you like.

And what is this good for?
  1. If you are an company which need convert speech to text (extracting some information from video or audio), you can use our API.
  2. If you are journalist, you can let the service transcribe your interviews and save you some time in writing an article.
  3. You can create subtitles for you YouTube videos.
  4. You can also manage the uploaded videos in a dashboard, make categories and stick color tags to them.
And finally...

... stay tuned, more interesting things are coming soon!