How AI hears accents: An audible visualization of accent clusters

https://news.ycombinator.com/rss Hits: 18

Summary

Today, we’re going to go on a tour of the world's accents in English. Users of BoldVoice, the American accent training app, speak more than 200 different languages, and it is our mission to help them speak English clearly and confidently. While building the accent strength metric we covered in the previous blog post, we needed to understand how our models clustered accents, dialects, native languages, and language families. Today, we will share some of our findings using a 3D latent visualization. Technical Approach To begin, we finetuned HuBERT, a pretrained audio-only foundation model for the task of accent identification using our in-house dataset of non-native English speech and self-reported accents. BoldVoice’s own dataset of accented speech is one of the largest of its kind in the world. hubert + classification head architecture: Model: boldvoice/hubert-accent-identifier Total Parameters: 94.6M (all trainable) ARCHITECTURE: ═════════════ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ Raw Audio → │ Feature │ → │ Feature │ → │ Transformer │ → │ Classification│ (16kHz) │ Extractor │ │ Projection │ │ Encoder │ │ Head │ └─────────────┘ └─────────────┘ └─────────────┘ └───────────────┘ 7 CNN layers LayerNorm→Linear 12 layers 768→256→50 1→512, 320x ↓ 512→768, Dropout 12 heads, dim=768 (89.8M params) KEY DETAILS: • Input: Raw waveform (no spectrograms) • Downsampling: 320x (5×2×2×2×2×2×2) • Transformer: 12 layers This model receives only the raw input audio and associated accent label; it gets neither a text prompt nor a transcript. For this "finetuning", we sampled 30 million speech recordings comprising 25,000 hours of English speech - a small fraction of our total accent dataset. Unlike a traditional finetune, we unfroze all layers of the pretrained base model due to the large size of our dataset. We trained the model for roughly a week on a cluster of A100 GPUs. While the accent identifier performs quite well across the top hundred or so accent...

First seen: 2025-10-14 19:38

Last seen: 2025-10-15 12:42

Read Full Article More from this Source

How AI hears accents: An audible visualization of accent clusters

Summary

Related News

Formal Reasoning [pdf]

You Already Have a Git Server

ICE Will Use AI to Surveil Social Media

How I turned Zig into my favorite language to write network programs in

Resource use matters, but material footprints are a poor way to measure it