News

NVIDIA Releases Multilingual ASR Model Supporting 40 Languages

Jun 4, 2026 · 6 days ago

NVIDIA released Nemotron 3.5 ASR, a 600M-parameter multilingual speech-to-text model that transcribes 40 language-locales from a single checkpoint in real time with native punctuation and capitalization. The model uses a Cache-Aware FastConformer-RNNT architecture to achieve low latency (0.07 seconds to final transcript) without sacrificing accuracy, and is available as open weights on Hugging Face for fine-tuning and deployment without API dependencies.

TL;DR

Nemotron 3.5 ASR supports 40 language-locales in a single 600M-parameter model, eliminating the need for separate language-specific deployments
Real-time streaming achieves 0.07 seconds latency to final transcript by caching encoder state instead of reprocessing overlapping audio chunks
Model includes punctuation and capitalization natively, removing the need for separate post-processing pipelines
Available as open weights on Hugging Face with fine-tuning capability for custom languages, domains, and accents

Why It Matters

Multilingual speech recognition has historically required stitching together multiple models or APIs, each with different latency profiles and billing structures. Nemotron 3.5 ASR consolidates this complexity into a single model that handles language switching mid-sentence and delivers production-ready output without additional post-processing, reducing infrastructure overhead for speech-enabled applications.

Business Impact

Organizations building multilingual products can reduce operational complexity and cost by deploying a single model instead of managing 40 separate integrations. The open-weights approach eliminates per-call API billing and allows companies to fine-tune the model for domain-specific vocabulary or accents, improving accuracy for specialized use cases like customer support or medical transcription.

Key Implications

Enterprises can consolidate multilingual ASR infrastructure, reducing vendor lock-in and per-call costs associated with API-based solutions
The native punctuation and capitalization eliminate the need for secondary NLP models, simplifying deployment pipelines and reducing latency
Fine-tuning capability enables customization for industry-specific terminology and regional accents without retraining from scratch
Real-time streaming with low latency opens use cases in live captioning and conversational AI that were previously impractical with traditional buffered ASR

What to Watch

Monitor adoption rates across enterprise speech applications and whether fine-tuning results meet accuracy targets for specialized domains. Track whether the model's multilingual capability reduces the fragmentation of ASR vendor ecosystems, and observe if competing models adopt similar caching architectures to match latency performance.

Voice & Video AI AI for Business Model Releases Open Source

Our Briefing

Weekly signal. No noise. Built for founders, operators, and AI-curious professionals.

No spam. Unsubscribe any time.

Google Launches Near Real-Time Voice Translation in Gemini 3.5

Google has launched Gemini 3.5 Live Translate, a near real-time speech translation feature now available in Google AI Studio, Google Translate, and Google Meet. The system delivers natural-sounding voice translation with minimal latency. The rollout represents a significant step toward breaking down language barriers in professional and consumer communication.

about 20 hours ago· Google Deepmind

Voice & Video AITrendingNews

Apple Taps Google, Nvidia for New Siri Launch

Apple plans to launch a redesigned Siri in September that will rely partly on Google's cloud infrastructure running Nvidia chips, according to sources familiar with the matter. While Apple intends to process most Siri functions on-device, certain operations will run on Google's servers. The arrangement represents a significant shift in how Apple handles AI processing for its flagship voice assistant.

by Aaron Tilley6 days ago· The Information

Voice & Video AINews

Voice AI Startup Scales to 17K Daily Calls in Overlooked Markets

Two former executives from Goldman Sachs and Meta have founded a startup building voice AI for underserved markets in Africa and the Middle East. The company's proprietary stack is currently processing more than 17,000 calls per day across these regions. The founders identified a gap in AI voice technology deployment for markets that larger tech companies have largely overlooked.

by Ivan Mehta7 days ago· TechCrunch AI

Voice & Video AITrendingNews

Apple's Siri Redesign Embraces ChatGPT in iOS 27

Apple is redesigning Siri for iOS 27 with a new chat interface and pill-shaped bubble that emerges from the Dynamic Island, according to Bloomberg renders based on sources with knowledge of Apple's plans. The overhaul will include a drop-down menu offering Ask, Siri, and ChatGPT options. Apple is expected to officially reveal the redesign at WWDC in June, though the final design may differ from these previews.

by Stevie Bonifield13 days ago· The Verge AI