A team from Google Research and Brain departments has reportedly used “all” currently available output data for speech recognition to train a single, huge neural network — SpeechStew.
As reported by VentureBeat, in terms of speech recognition, the result achieved values that could compete with other models with improved details.
Often, models for speech recognition are only trained with one set of initial data since these are often very homogeneous in terms of their annotation and, above all, speech quality. Ultimately, this also simplifies working with the data and optimizing a model.
In the case of the SpeechStew model, the researchers involved decided on a completely different approach. According to the information, the speech data from the following corpora for spoken language were combined for SpeechStew — “AMI, Broadcast News, Mozilla Common Voice, Librispeech, Switchboard/Fisher, Tedlium, and Wall Street Journal.“
These were simply mixed together without specially weighting or coordinating individual components. Together, the data comprises more than 5,000 hours of annotated voice data.
According to the team’s statements, SpeechStew achieves, as mentioned, the speech recognition of other modern systems in some benchmarks or even exceeds it in some cases. In addition, the model should be able to adapt to different tasks.
Comparatively, little additional initial data was sufficient to ultimately achieve the results of specially trained and adapted models. This is likely to be due to the wide variety of the selected starting data.
When asked by Venture Beat about the practical application of these findings, the researchers involved respond cautiously. It is possible that work like SpeechStew could be used in the future as a kind of general model that would serve as the basis for other specialized tasks in speech recognition.