arXiv Sound (@arxiv-sound)

HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages Bi-Cheng Yan, Hsin-Wei Wang, Fu-An Chao, Tien-Hong Lo, Yung-Chang Hsu, Berlin Chen

HiPPO, a hierarchical pronunciation assessment model, evaluates L2 learner proficiency at multiple linguistic levels; contrastive ordinal regularizer and curriculum learning improve assessment accuracy.

05.12.2025 11:46 👍 0 🔁 0 💬 0 📌 0

TripleC Learning and Lightweight Speech Enhancement for Multi-Condition Target Speech Extraction Ziling Huang (Shanghai Normal University, China)

LGTSE extended with TripleC Learning and parallel universal training improves multi-condition target speech extraction, achieving superior performance over condition-specific models on Libri2Mix tasks.

05.12.2025 11:23 👍 0 🔁 0 💬 0 📌 0

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghidour, Aaqib Saeed

AcuLa aligns audio encoders with medical language models for semantic understanding, improving AUROC on cardio-respiratory tasks from 0.68 to 0.79 and on COVID-19 cough detection from 0.55 to 0.89.

05.12.2025 10:59 👍 0 🔁 0 💬 0 📌 0

Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regression to Service Graphs Wenzhang Du

Contract-driven QoE auditing framework using MOS regression shows that classical MOS regression is a special case with degenerate contract set and contract-driven quality is more stable than MOS.

05.12.2025 10:36 👍 0 🔁 0 💬 0 📌 0

Shared Multi-modal Embedding Space for Face-Voice Association Christopher Simic, Korbinian Riedhammer, Tobias Bocklet

Shared embedding space with Adaptive Angular Margin (AAM) loss for face and voice features achieved first place in the FAME 2026 challenge with an average Equal-Error Rate (EER) of 23.99.

05.12.2025 10:13 👍 0 🔁 0 💬 0 📌 0

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei-Qiang Zhang, Zihao Chen

YingMusic-SVC uses singing-trained RVC timbre shifter, F0-aware timbre adaptor, and energy-balanced rectified flow matching loss to achieve improvements in timbre similarity, intelligibility, and perceptual naturalness.

05.12.2025 09:50 👍 0 🔁 0 💬 0 📌 0

Towards predicting binaural audio quality in listeners with normal and impaired hearing Thomas Biberger, Stephan D. Ewert

Extended eMoBi-Q model incorporates a nonlinear auditory filterbank and loudness perception to predict binaural audio quality in normal-hearing and hearing-impaired populations.

05.12.2025 09:27 👍 0 🔁 0 💬 0 📌 0

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance Junjie Zheng, Chunbo Hao, Guobin Ma, Xiaoyu Zhang, Gongyu Chen, Chaofan Ding, Zihao Chen, Lei Xie

Melody-driven SVS framework uses Diffusion Transformer (DiT) enhanced with melody extraction module from reference audio; Flow-GRPO reinforcement learning enhances pronunciation clarity and melodic fidelity.

05.12.2025 09:04 👍 0 🔁 0 💬 0 📌 0

M3-TTS: Multi-modal DiT Alignment Mel-latent for Zero-shot High-fidelity Speech Synthesis Xiaopeng Wang, Chunyu Qiang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Yukun Liu, Yuzhe Liang, Kang Yin, Yuankun Xie, Heng Xie, Chenxing Li, Chen Zhang, Changsheng Li

M3-TTS, a multi-modal diffusion transformer (MM-DiT) architecture, achieves state-of-the-art non-autoregressive text-to-speech performance with word error rates of 1.36% (English) and 1.31% (Chinese).

05.12.2025 08:41 👍 0 🔁 0 💬 0 📌 0

Large Speech Model Enabled Semantic Communication Yun Tian, Zhijin Qin, Guocheng Lv, Ye Jin, Kaibin Huang, Zhu Han

LargeSC uses Mimi speech codec and Moshi foundation model with LoRA, achieves adaptive semantic compression and robust transmission over lossy channels, outperforming baselines with bandwidths from 550 bps to 2.06 kbps.

05.12.2025 08:18 👍 0 🔁 0 💬 0 📌 0

Standard audiogram classification from loudness scaling data using unsupervised, supervised, and explainable machine learning techniques Chen Xu, Lena Schell-Majoor, Birger Kollmeier

Machine learning models, particularly logistic regression, predict Bisgaard audiogram types from loudness perception data with reasonable accuracy using PCA feature extraction, supporting remote audiology applications.

05.12.2025 07:55 👍 0 🔁 0 💬 0 📌 0

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS Cong Wang, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An, Han Zhao, Qian Chen, Xiangang Li, Yingming Gao, Ya Li

Robust Reward Policy Optimization (RRPO) mitigates reward hacking in emotional TTS by using a hybrid regularization scheme and a robust Reward Model (RM), improving both emotional expressiveness and naturalness.

05.12.2025 07:32 👍 0 🔁 0 💬 0 📌 0

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention Cong Wang, Yizhong Geng, Yuhua Wen, Qifei Li, Yingming Gao, Ruimin Wang, Chunfeng Wang, Hao Li, Ya Li, Wei Chen

Multi-loss learning framework with energy-adaptive mixup and frame-level attention yields state-of-the-art performance on IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE datasets for speech emotion recognition.

05.12.2025 07:09 👍 0 🔁 0 💬 0 📌 0

AaPE: Aliasing-aware Patch Embedding for Self-Supervised Audio Representation Learning Kohei Yamamoto, Kosuke Okusa

Aliasing-aware Patch Embedding (AaPE), a new patch stem, mitigates aliasing in Transformer-based audio SSL by augmenting patch tokens with features from a complex sinusoidal kernel; yields state-of-the-art performance on some tasks.

04.12.2025 10:53 👍 0 🔁 0 💬 0 📌 0

State Space Models for Bioacoustics: A comparative Evaluation with Transformers Chengyu Tang, Sanjeev Baskiyar

BioMamba, a Mamba-based audio LLM, achieves comparable performance to Transformer-based AVES on bioacoustic tasks with significantly less VRAM usage after pretraining and fine-tuning on the BEANS benchmark.

04.12.2025 09:38 👍 1 🔁 0 💬 0 📌 0

A Universal Harmonic Discriminator for High-quality GAN-based Vocoder Nan Xu, Zhaolong Huang, Xiao Zeng

A universal harmonic discriminator with a learnable triangular band-pass filter bank is proposed for GAN-based vocoders to improve time-frequency representation; validated on speech and singing datasets.

04.12.2025 08:23 👍 2 🔁 0 💬 0 📌 0

Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR Mohan Shi, Natarajan Balaji Shankar, Kaiyuan Zhang, Zilai Wang, Abeer Alwan

Supervised finite scalar quantization (FSQ) methods for semantic speech token extraction outperform unsupervised K-means clustering in child ASR; even surpass continuous representations at ultra-low bitrates.

04.12.2025 07:08 👍 0 🔁 0 💬 0 📌 0

Perceptual evaluation of Acoustic Level of Detail in Virtual Acoustic Environments Stefan Fichna, Steven van de Par, Bernhard U. Seeber, Stephan D. Ewert

Perceptual evaluation of acoustic level of detail (ALOD) in virtual acoustic environments shows that strong ALOD reduction is feasible while maintaining plausibility, speech intelligibility, and externalization; early reflections' accuracy is less relevant if late reverberation is represented.

03.12.2025 11:39 👍 0 🔁 0 💬 0 📌 0

Exploring Definitions of Quality and Diversity in Sonic Measurement Spaces Björn Þór Jónsson, Çağrı Erdem, Stefano Fasciani, Kyrre Glette

Unsupervised dimensionality reduction methods, PCA and autoencoders, define sonic behavior spaces for quality diversity algorithms; automatic approaches achieve greater diversity than handcrafted spaces, with PCA proving most effective.

03.12.2025 11:09 👍 0 🔁 0 💬 0 📌 0

Towards Language-Independent Face-Voice Association with Multimodal Foundation Models Aref Farhadipour, Teodora Vukovic, Volker Dellwo

ImageBind-LoRA, leveraging ImageBind with LoRA, demonstrates cross-lingual generalization in face-voice association; fine-tuned on Arabic audio, it achieves an EER of 24.73% on unseen languages.

03.12.2025 10:39 👍 0 🔁 0 💬 0 📌 0

SAND Challenge: Four Approaches for Dysartria Severity Classification Gauri Deshpande, Harish Battula, Ashish Panda, Sunil Kumar Kopparapu

Four approaches for dysarthria severity classification were compared using the SAND dataset; a feature-engineered XGBoost ensemble achieved the highest macro-F1 score, while deep learning models offered competitive performance.

03.12.2025 10:09 👍 0 🔁 0 💬 0 📌 0

Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li

Pianist Transformer, a model for expressive piano performance rendering, uses a unified MIDI data representation and asymmetric architecture; self-supervised pre-training with 10B tokens achieves state-of-the-art performance.

03.12.2025 09:39 👍 0 🔁 0 💬 0 📌 0

Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation Xueyan Li, Yuxin Wang, Mengjie Jiang, Qingzi Zhu, Jiang Zhang, Zoey Kim, Yazhe Niu

A generative feedback framework for singing voice synthesis evaluation provides multi-dimensional language and audio critiques using an audio-language model; experiments validate effectiveness for guiding generative model improvement.

03.12.2025 09:09 👍 0 🔁 0 💬 0 📌 0

VibOmni: Towards Scalable Bone-conduction Speech Enhancement on Earables Lixing He, Yunqi Guo, Haozheng Hou, Zhenyu Yan

VibOmni, a multi-modal speech enhancement system for earables, uses bone-conducted vibrations captured by IMUs; a novel data augmentation technique generates synthetic vibration data from limited recordings.

03.12.2025 08:39 👍 0 🔁 0 💬 0 📌 0

Continual Learning for Singing Voice Separation with Human in the Loop Adaptation Ankur Gupta, Anshul Rai, Archit Bansal, Vipul Arora

An interactive continual learning framework for singing voice separation allows users to fine-tune a U-Net model by marking false positives; experiments show performance improvements over the base model in various settings.

03.12.2025 08:09 👍 1 🔁 0 💬 0 📌 0

Story2MIDI: Emotionally Aligned Music Generation from Text Mohammad Shokri, Alexandra C. Salem, Gabriel Levine, Johanna Devaney, Sarah Ita Levitan

Story2MIDI, a Transformer model, generates emotion-aligned music from text using a dataset of text-music pairs evoking similar emotions; evaluations confirm model's ability to capture intended emotional cues.

03.12.2025 07:39 👍 0 🔁 0 💬 0 📌 0

On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts Kashaf Gulzar, Dominik Wagner, Sebastian P. Bayerl, Florian Hönig, Tobias Bocklet, Korbinian Riedhammer

Token-level adaptation of ASR systems improves dysfluency transcription on LibriStutter and KSoF datasets; language-adaptive pretraining and tokenizer analysis address English-centric bias in multilingual systems.

03.12.2025 07:09 👍 0 🔁 0 💬 0 📌 0

Parallel Delayed Memory Units for Enhanced Temporal Modeling in Biomedical and Bioacoustic Signal Analysis Pengfei Sun, Wenyu Jiang, Paul Devos, Dick Botteldooren

The Parallel Delayed Memory Unit (PDMU), a delay-gated state-space module, enhances temporal modeling in bio-signals by compressing temporal information using Legendre Memory Units (LMU); demonstrates improved memory capacity and model performance.

02.12.2025 11:39 👍 0 🔁 0 💬 0 📌 0

LLM2Fx-Tools: Tool Calling For Music Post-Production Seungheon Doh, Junghyun Koo, Marco A. Martínez-Ramírez, Woosung Choi, Wei-Hsiang Liao, Qiyu Wu, Juhan Nam, Yuki Mitsufuji

LLM2Fx-Tools is a multimodal tool-calling framework that generates audio effects chains for music post-production using a large language model; validated in style transfer setting.

02.12.2025 11:14 👍 0 🔁 0 💬 0 📌 0

Q2D2: A Geometry-Aware Audio Codec Leveraging Two-Dimensional Quantization Tal Shuster, Eliya Nachmani

Q2D2, a geometry-aware audio codec, uses two-dimensional quantization on structured grids; improves compression efficiency with low token rates and high codebook utilization while maintaining state-of-the-art reconstruction quality.

02.12.2025 10:49 👍 0 🔁 0 💬 0 📌 0

arXiv Sound

Latest posts by arXiv Sound @arxiv-sound