Definition
Speech-to-text
conversion is the process of converting spoken words
into written texts. This process is also often called speech
recognition. Although these two terms are almost synonymous,
Speech recognition is sometimes used to describe the wider
process of extracting meaning from speech, i.e. speech
understanding. The term voice recognition should be avoided
as it is often associated to the process of identifying a person from
their voice, i.e. speaker
recognition.
How does it work?
All speech recognition systems rely on at least two models: an acoustic model and
a language
model. In addition large vocabulary systems use a pronunciation
model. It is important to understand that there is no such thing
as a universal speech recognizer. To get the best transcription
quality, all of these models can be specialized for a given language,
dialect, application domain, type of speech, and communication
channel.
Like any other pattern recognition technology, speech recognition cannot be error free. The speech transcript accuracy is highly dependent on the speaker, the style of speech and the environmental conditions. Speech recognition is a harder process than what people commonly think, even for a human being. Humans are used to understanding speech, not to transcribing it, and only speech that is well formulated can be transcribed without ambiguity. From the user's point of view, a speech-to-text system can be categorized based in its use: command and control, dialog system, text dictation, audio document transcription, etc. Each use has specific requirements in terms of latency, memory constraints, vocabulary size, and adaptive features. VoxSigma®
The VoxSigma
software suite offers large vocabulary multilingual speech-to-text capabilities
with state-of-the-art accuracy. It has been specifically designed for
professional users, needing to transcribe large quantities of audio
and video documents such as broadcast data, either in batch mode or in
in real-time. It can also be used to analyze call-center data.
The complete voice-to-text conversion process is done in three steps. The software first identifies the audio segments containing speech, then it recognizes the language being spoken if it is not known a priori, and finally it converts the speech segments to text and time-codes. VoxSigma includes adaptive features allowing the transcription of noisy speech such as speech with background music. The result is a fully annotated XML document including speech and non speech segments, speaker labels, words with time codes and high quality confidence scores. This XML file can be directly indexed by a search engine, or alternatively can be converted into plain text. The VoxSigma sofware suite is offered as a Web service via a REST API over HTTPS, allowing customers to quickly reap the benefits of regular improvements to the technology and take advantage of additional features offered by the online environment. The services are available 24/7/365 with failover servers and geographic redundancy. Vocapia Research also offers services to adapt, tune or create specific models or systems tailored to exactly match your needs. Tailoring models for your application is the best way to ensure you get the best possible results for your needs. High accuracy is essential to maximize your ROI, as to a first approximation, the cost of using a speech-to-text system is proportional to the system's error rate. Therefore using a system with a 90% accuracy (i.e. 10% error) may cost almost twice that of using a system with a 95% accuracy (i.e. 5% error).
|