Friday, January 24, 2014

Why do we need your speech data

Our several year experience in speech technology research and business shows often clash between:
Speech technology provider: "Give us some of your speech data for testing purposes please."
and
Customer: "No way! Our speech data is our private and secret property."

So let discuss several WHYs.

Why the speech technology provider wants the customer's data?

The speech technologies are very complex and sensitive to match between model and the data. This is common problem in the whole field of machine learning. Once you feed the classifier with "already seen" data, everything goes well. Accuracy of such algorithm is great.
The problem occurs when you put an unseen data into the algorithm - data which was not seen during training and developing. It is like, people living in US understands English because it is their already seen data, but does not understand Japanese because it is their unseen data during the training phase (childhood).

Lets go back to speech. Speech technologies are sensitive to the seen/unseen data problem. Of course we have lots of techniques to overcome the problem and to minimize the accuracy drawback, but still the problem exists. An example can be the problem with microphone distance and recording channel discussed in previous post.

So to provide the best possible service for the customer, good speech technology provider wants to be sure, the technology works the best way it can on the customer's data. But how to test it if the provider has no data?

He can ask the customer to describe in detail the recording setup, scenario, room, type of speech, type of speaking people, what is the next processing step, and so on. Then the provider guesses the accuracy deterioration and hopes he is right.
Once the customer orders the technology and tests it, one of following case occurs:
  • Everything works well - means the accuracy is close to the guess.
  • Something is wrong - means the accuracy is worse than the guess.
In "Something is wrong" case usually several negotiation/development iterations starts where the provider tries to improve the accuracy.

Please note, that "Everything works well" case does not necessarily mean that the customer achieves the best accuracy. The provider could undershoot his estimation and the customer is loosing accuracy without knowing it!

The second and better option is, that the provider asks the customer for some data for testing (several hours is OK) - the target data in the ideal case. If he gets it, he tests the algorithms and reports the customer about expecting accuracy and also about some room for improvement (using unsupervised adaptation).

The third and the best option is, that the provider gets not only the test data but also much more data for adaptation. The provider can adapt the technology on the target data and provide speech technology with the best possible accuracy for the customer (according to the amount of provided data).

Why the customer does not want to provide their data?

Well, there are usually two cases: the customer does not have the data (yet), or the data is private.

For the "does not have the data yet" case, the estimation is the best you can get from the provider.

For the "private data" case, please think it in the context of the above text. We fully understand, that some customers in security field can not provide any data. But in other fields, there is just fear. So what can you do?

  1. Try to build at least a testing data set. Ask the provider, how it should look like (how many hours, how many speakers, languages, etc..).
  2. Try to discuss the NDA or other paper way of providing the test data to your speech technology provider.
  3. You can also try to test the technology itself, but usually there should be some supervision of the provider (unless you really know what you are doing and you are familiar with speech technologies).
  4. If you can not provide the target data, try to think about anonymization of the target data by cutting out sensitive information.
  5. What about to record some "fake" target data? If our internal meeting recordings are really sensitive, arrange a few meetings about some "fake topic" and record them. Try to be as close as possible to the target data (people, speaking style, room, equipment, ...).
From the adaptation of the technology on your target data point of view, the best way is to provide your data to the speech technology provider. Which is not possible right?

There are four possibilities to solve that:
  1. You are so "paranoid" that you want to do it yourself. Well, if the speech provider gives you some way to do that. Some algorithms can be simply adapted on your data, some algorithms can not be. Also keep in mind, that the adaptation can contain a piece of know-how of the provider.
  2. You allow someone from the provider's company to arrive and do it on your place. 
  3. Speech technology provider gives you some tool for some statistics extraction from your target data. This means, that you send out a data statistics but not the recordings. This is probably fair enough, because no one can reconstruct the original data. And the statistics are good enough to adapt the models.
  4. You fully provide the target data.
Please keep in mind, that adaptation needs some test set to be defined. Unless you can not clearly say what is the accuracy gain of the adaptation.

So this is the end of brief introduction into the problem of the reason of providing your data to a speech technology provider.

Do not forget: Good provider will ask you for such data, because he understands the problem and has the same objective as you - providing you with the most accurate technology possible.



No comments:

Post a Comment