Product

Customize Speech-to-Text

October 22, 2025

Customize Speech-to-Text

TL;DR

  • Argmax Pro SDK now supports Custom Vocabulary, an advanced feature that lets developers quickly customize or personalize speech-to-text models by providing a list of contextual keywords at runtime.
  • Unlike Whisper Keyword Prompting or Deepgram Keyword Boosting, Argmax's Custom Vocabulary feature is model-agnostic, meaning the detected keywords can be merged with any transcription result.
  • Developers imagined applications that required scaling way beyond the ~100 keywords limit that Whisper and other proprietary cloud APIs impose. The first version of Argmax Pro SDK Custom Vocabulary supports 1000 keywords.
  • Use cases range from proper spelling of people, company and product names in meeting transcriptions to industry or occupation-specific jargon in field service or front-line work.
  • Open-source reproducible benchmarks demonstrate that keyterm recognition accuracy improves from 64% to 88% with more improvements on the way. Results can be reviewed on OpenBench.
  • Try it on superwhisper-2.6.2 or Argmax Playground today!

Custom Vocabulary

Clean audio recordings of casual conversations without any names or jargon are easy to transcribe. In fact, most speech-to-text systems today do an almost perfect job under those cond itions. However, most systems also break under realistic settings, such as the one below:

Apple Native API gets 0/4 correct as tested in the Voice Memos app on iOS 26.
For the same recording, Argmax Pro SDK gets 4/4 in the Argmax Playground.

Custom Vocabulary works by registering a list of contextual keywords to the transcription system in order to enable a dedicated "keyword search" component. How does one find these contextual keywords? Here are just a few examples:

  • Meetings: Based on the calendar invite keywords, attendees' names, the names of the companies they represent and the most popular products of these companies.
  • Videos: Title and description. OCR results from the video in case there are slides with technical jargon.

Achieving high accuracy on names and jargon makes all the difference between a toy utility software and a critical infrastructure for high-stakes use cases such as AI medical scribes, virtual meeting transcription and even personal dictation software.

Just for fun, here is a challenging test with
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch
in Argmax Playground

Benchmarks

To the best of our knowledge, there were no public datasets to measure the accuracy of speech-to-text systems on proper name spelling. For this purpose, we have further annotated the earnings22 dataset (test split) with people, company and product names. In the first version of this dataset, we have curated ~1000 audio clips, each 15 seconds long, that contain at least one name. We have reviewed each sample manually which led to many corrections of the original ground-truth transcript annotations because challenging parts of the audio were annotated as inaudible and many names were incorrectly annotated. After manually verifying all names to be correct, including making LinkedIn searches to cross-reference people and company names, we have come up with an extremely high-quality test dataset.

On this set, the accuracy of Argmax Pro SDK, as measured by the F1-score, jumped from 64% to 88% when enabling Custom Vocabulary with the keywords for each audio clip! Further results with competitive systems can be found on OpenBench. The dataset will be made public alongside our research paper in the coming weeks.

Day 0 Support on superwhisper!

This feature has been available in alpha testing for the past week, and several customers have already shipped with the stable version today! If you are a superwhisper user, update to 2.6.2 to get Custom Vocabulary added to the Nvidia Parakeet models powered by Argmax Pro SDK!

Related Articles