Anuj Diwan's Avatar

Anuj Diwan

@anujdiwan

UT CS PhD Student working on generative speech models. Prev. Student Researcher @ Google DeepMind, FAIR (Meta AI) and Adobe Research. 2021 BTech CS IIT Bombay.

195
Followers
123
Following
6
Posts
18.11.2024
Joined
Posts Following

Latest posts by Anuj Diwan @anujdiwan

Thanks to my amazing collaborators Zhisheng Zheng, @eunsol.bsky.social and David Harwath!
Paper: arxiv.org/abs/2503.04713
Code: github.com/ajd12342/par...
Dataset: huggingface.co/datasets/ajd...
Model: huggingface.co/ajd12342/par...
Demo: paraspeechcaps.github.io

08.03.2025 04:04 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Evaluation results comparing style consistency (CMOS, Intrinsic and Situational Rich Tag Recall), speech quality (NMOS) and intelligibility (IMOS, WER). Mean score and 95% confidence intervals are reported for MOS. Our Base and Scaled models obtain improved style consistency (+5.6% and +7.9% Consistency MOS) and speech quality (+5.5% and +15.5% Naturalness MOS) over baselines.

Evaluation results comparing style consistency (CMOS, Intrinsic and Situational Rich Tag Recall), speech quality (NMOS) and intelligibility (IMOS, WER). Mean score and 95% confidence intervals are reported for MOS. Our Base and Scaled models obtain improved style consistency (+5.6% and +7.9% Consistency MOS) and speech quality (+5.5% and +15.5% Naturalness MOS) over baselines.

We finetune Parler-TTS-Mini-v1 on ParaSpeechCaps and achieve significant improvements in both speech-style consistency and naturalness over our best performing baseline (that combines existing smaller-scale style datasets)!

08.03.2025 04:04 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Human evaluation of intrinsic/situational style tag recalls, comparing our datasets and ablations. For rich intrinsic tags, PSC-Scaled (50.3%) achieves a comparable performance to PSC-Base (48.7%), while Std. Embedder (45.3%) worsens it. For rich situational tags, PSC-Scaled (71.3%) achieves a comparable performance to PSC-Base (68.1%), while removing any of Expressivity Filtering (61.0%), Semantic Matching (66.1%), or Acoustic Matching (63.3%) worsens it.

Human evaluation of intrinsic/situational style tag recalls, comparing our datasets and ablations. For rich intrinsic tags, PSC-Scaled (50.3%) achieves a comparable performance to PSC-Base (48.7%), while Std. Embedder (45.3%) worsens it. For rich situational tags, PSC-Scaled (71.3%) achieves a comparable performance to PSC-Base (68.1%), while removing any of Expressivity Filtering (61.0%), Semantic Matching (66.1%), or Acoustic Matching (63.3%) worsens it.

ParaSpeechCaps contains 282 hrs of human-labelled data and 2427 hours of automatically-labelled data. Human evaluators rate our scaled data to be on par with human-labelled data! We carefully ablate our dataset design choices.

08.03.2025 04:04 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
An overview of our automatic dataset scaling pipeline, for rich intrinsic and situational tags. For intrinsic style tags, we use a perceptual speaker similarity model to identify speakers whose speech resembles that of speakers human-annotated with intrinsic tags. Then, we propagate the intrinsic tags of the similar speaker. For situational style tags, we combine three different types of signals. We first identify expressive speech using an off-the-shelf dominance-valence-arousal speech classifier. Among the selected expressive speech clips, we use a text embedding model to find transcripts that semantically match the desired situational tag. Lastly, we use a large-scale speech-text multimodal LLM to check whether the speech acoustically matches the situational tag.

An overview of our automatic dataset scaling pipeline, for rich intrinsic and situational tags. For intrinsic style tags, we use a perceptual speaker similarity model to identify speakers whose speech resembles that of speakers human-annotated with intrinsic tags. Then, we propagate the intrinsic tags of the similar speaker. For situational style tags, we combine three different types of signals. We first identify expressive speech using an off-the-shelf dominance-valence-arousal speech classifier. Among the selected expressive speech clips, we use a text embedding model to find transcripts that semantically match the desired situational tag. Lastly, we use a large-scale speech-text multimodal LLM to check whether the speech acoustically matches the situational tag.

ParaSpeechCaps is the first large-scale dataset that supports both speaker-level intrinsic tags and utterance-level situational tags. Our key contribution is a novel pipeline for scalable, automatic style annotations over such a wide variety of rich styles for the first time.

08.03.2025 04:04 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Randomly sampled examples from ParaSpeechCaps. Our style prompts cover rich tags describing complex styles like rhythm, clarity, emotion, etc. in contrast to erstwhile basic style prompts that only contain gender, pitch and speed levels. We highlight rich style tags with vibrant colors and basic style tags with a gray color.

Randomly sampled examples from ParaSpeechCaps. Our style prompts cover rich tags describing complex styles like rhythm, clarity, emotion, etc. in contrast to erstwhile basic style prompts that only contain gender, pitch and speed levels. We highlight rich style tags with vibrant colors and basic style tags with a gray color.

Introducing ParaSpeechCaps, our large-scale style captions dataset that enables rich, expressive control for text-to-speech models!
Beyond basic pitch or speed controls, our models can generate speech that sounds "guttural", "scared", "whispered" and more; 59 style tags in total.

๐Ÿงต๐Ÿ‘‡

08.03.2025 04:04 ๐Ÿ‘ 3 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Thanks for this list! Would appreciate being added :)

22.11.2024 16:37 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

I've started putting together a starter pack with people working on Speech Technology and Speech Science: go.bsky.app/BQ7mbkA

(Self-)nominations welcome!

19.11.2024 11:13 ๐Ÿ‘ 82 ๐Ÿ” 34 ๐Ÿ’ฌ 44 ๐Ÿ“Œ 3