What is speech to text?
In this overview, you learn about the benefits and capabilities of the speech to text feature of the Speech service, which is part of Azure AI services. Speech to text can be used for real-time or batch transcription of audio streams into text.
Note
To compare pricing of real-time to batch transcription, see Speech service pricing.
For a full list of available speech to text languages, see Language and voice support.
Real-time speech to text
With real-time speech to text, the audio is transcribed as speech is recognized from a microphone or file. Use real-time speech to text for applications that need to transcribe audio in real-time such as:
- Transcriptions, captions, or subtitles for live meetings
- Diarization
- Pronunciation assessment
- Contact center agents assist
- Dictation
- Voice agents
Real-time speech to text is available via the Speech SDK and the Speech CLI.
Batch transcription
Batch transcription is used to transcribe a large amount of audio in storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results. Use batch transcription for applications that need to transcribe audio in bulk such as:
- Transcriptions, captions, or subtitles for prerecorded audio
- Contact center post-call analytics
- Diarization
Batch transcription is available via:
- Speech to text REST API: To get started, see How to use batch transcription and Batch transcription samples (REST).
- The Speech CLI supports both real-time and batch transcription. For Speech CLI help with batch transcriptions, run the following command:
spx help batch transcription
Custom speech
With custom speech, you can evaluate and improve the accuracy of speech recognition for your applications and products. A custom speech model can be used for real-time speech to text, speech translation, and batch transcription.
Tip
A hosted deployment endpoint isn't required to use custom speech with the Batch transcription API. You can conserve resources if the custom speech model is only used for batch transcription. For more information, see Speech service pricing.
Out of the box, speech recognition utilizes a Universal Language Model as a base model that is trained with Microsoft-owned data and reflects commonly used spoken language. The base model is pretrained with dialects and phonetics representing various common domains. When you make a speech recognition request, the most recent base model for each supported language is used by default. The base model works well in most speech recognition scenarios.
A custom model can be used to augment the base model to improve recognition of domain-specific vocabulary specific to the application by providing text data to train the model. It can also be used to improve recognition based for the specific audio conditions of the application by providing audio data with reference transcriptions. For more information, see custom speech and Speech to text REST API.
Customization options vary by language or locale. To verify support, see Language and voice support for the Speech service.
Responsible AI
An AI system includes not only the technology, but also the people who use it, the people who are affected by it, and the environment in which it's deployed. Read the transparency notes to learn about responsible AI use and deployment in your systems.
- Transparency note and use cases
- Characteristics and limitations
- Integration and responsible use
- Data, privacy, and security
Next steps
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for