← Back to Glossary

Speech-to-Text

Applications

AI technology that converts spoken language into written text -- like a super-fast, highly accurate transcriptionist.

Think of speech-to-text like having a stenographer who works at superhuman speed, understands almost every accent and language, and never needs a coffee break. You talk, and the words appear as text almost instantly.

Speech-to-text (STT), also called automatic speech recognition (ASR), is AI technology that listens to spoken language and converts it into written text. This is what powers voice assistants (when Siri or Alexa understand what you say), video captions, meeting transcription, and voice typing on your phone.

Modern speech-to-text has gotten remarkably good thanks to deep learning. OpenAI's Whisper model, for example, was trained on 680,000 hours of audio from the internet and can transcribe speech in nearly 100 languages with impressive accuracy. It handles accents, background noise, multiple speakers, and technical jargon much better than older systems.

The technology works by converting audio into a visual representation (called a spectrogram, which looks like a heatmap of sound frequencies over time), then using a neural network to map those patterns to text. Modern models process audio in chunks and can even distinguish between different speakers in a conversation, add punctuation, and correct common errors in real time.

Speech-to-text has transformed many workflows. Journalists transcribe interviews in minutes instead of hours. Content creators automatically generate subtitles for their videos. Doctors dictate medical notes that are instantly transcribed. Students record lectures and get searchable text transcripts. Meeting recording tools like those in Zoom and Google Meet use speech-to-text to generate meeting summaries and searchable transcripts.

Real-World Examples

  • *OpenAI's Whisper transcribing audio in nearly 100 languages
  • *YouTube auto-generating subtitles for uploaded videos
  • *Descript transcribing podcast recordings for easy editing

Tools That Use This

WhisperFreeDescriptFreemium

Related Terms

Text-to-SpeechNatural Language ProcessingMultimodalDeep Learning