HiPPO, a hierarchical pronunciation assessment model, evaluates L2 learner proficiency at multiple linguistic levels; contrastive ordinal regularizer and curriculum learning improve assessment accuracy.
@arxiv-sound
Automated posting of sound-related articles uploaded to arxiv.org (eess.AS + cs.SD) Source: https://github.com/dsuedholt/bsky-paperbot-sound/ Inspired by @paperposterbot.bsky.social and https://twitter.com/ArxivSound
HiPPO, a hierarchical pronunciation assessment model, evaluates L2 learner proficiency at multiple linguistic levels; contrastive ordinal regularizer and curriculum learning improve assessment accuracy.
LGTSE extended with TripleC Learning and parallel universal training improves multi-condition target speech extraction, achieving superior performance over condition-specific models on Libri2Mix tasks.
AcuLa aligns audio encoders with medical language models for semantic understanding, improving AUROC on cardio-respiratory tasks from 0.68 to 0.79 and on COVID-19 cough detection from 0.55 to 0.89.
Contract-driven QoE auditing framework using MOS regression shows that classical MOS regression is a special case with degenerate contract set and contract-driven quality is more stable than MOS.
Shared embedding space with Adaptive Angular Margin (AAM) loss for face and voice features achieved first place in the FAME 2026 challenge with an average Equal-Error Rate (EER) of 23.99.
YingMusic-SVC uses singing-trained RVC timbre shifter, F0-aware timbre adaptor, and energy-balanced rectified flow matching loss to achieve improvements in timbre similarity, intelligibility, and perceptual naturalness.
Extended eMoBi-Q model incorporates a nonlinear auditory filterbank and loudness perception to predict binaural audio quality in normal-hearing and hearing-impaired populations.
Melody-driven SVS framework uses Diffusion Transformer (DiT) enhanced with melody extraction module from reference audio; Flow-GRPO reinforcement learning enhances pronunciation clarity and melodic fidelity.
M3-TTS, a multi-modal diffusion transformer (MM-DiT) architecture, achieves state-of-the-art non-autoregressive text-to-speech performance with word error rates of 1.36% (English) and 1.31% (Chinese).
LargeSC uses Mimi speech codec and Moshi foundation model with LoRA, achieves adaptive semantic compression and robust transmission over lossy channels, outperforming baselines with bandwidths from 550 bps to 2.06 kbps.
Machine learning models, particularly logistic regression, predict Bisgaard audiogram types from loudness perception data with reasonable accuracy using PCA feature extraction, supporting remote audiology applications.
Robust Reward Policy Optimization (RRPO) mitigates reward hacking in emotional TTS by using a hybrid regularization scheme and a robust Reward Model (RM), improving both emotional expressiveness and naturalness.
Multi-loss learning framework with energy-adaptive mixup and frame-level attention yields state-of-the-art performance on IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE datasets for speech emotion recognition.
Aliasing-aware Patch Embedding (AaPE), a new patch stem, mitigates aliasing in Transformer-based audio SSL by augmenting patch tokens with features from a complex sinusoidal kernel; yields state-of-the-art performance on some tasks.
BioMamba, a Mamba-based audio LLM, achieves comparable performance to Transformer-based AVES on bioacoustic tasks with significantly less VRAM usage after pretraining and fine-tuning on the BEANS benchmark.
A universal harmonic discriminator with a learnable triangular band-pass filter bank is proposed for GAN-based vocoders to improve time-frequency representation; validated on speech and singing datasets.
Supervised finite scalar quantization (FSQ) methods for semantic speech token extraction outperform unsupervised K-means clustering in child ASR; even surpass continuous representations at ultra-low bitrates.
Perceptual evaluation of acoustic level of detail (ALOD) in virtual acoustic environments shows that strong ALOD reduction is feasible while maintaining plausibility, speech intelligibility, and externalization; early reflections' accuracy is less relevant if late reverberation is represented.
Unsupervised dimensionality reduction methods, PCA and autoencoders, define sonic behavior spaces for quality diversity algorithms; automatic approaches achieve greater diversity than handcrafted spaces, with PCA proving most effective.
ImageBind-LoRA, leveraging ImageBind with LoRA, demonstrates cross-lingual generalization in face-voice association; fine-tuned on Arabic audio, it achieves an EER of 24.73% on unseen languages.
Four approaches for dysarthria severity classification were compared using the SAND dataset; a feature-engineered XGBoost ensemble achieved the highest macro-F1 score, while deep learning models offered competitive performance.
Pianist Transformer, a model for expressive piano performance rendering, uses a unified MIDI data representation and asymmetric architecture; self-supervised pre-training with 10B tokens achieves state-of-the-art performance.
A generative feedback framework for singing voice synthesis evaluation provides multi-dimensional language and audio critiques using an audio-language model; experiments validate effectiveness for guiding generative model improvement.
VibOmni, a multi-modal speech enhancement system for earables, uses bone-conducted vibrations captured by IMUs; a novel data augmentation technique generates synthetic vibration data from limited recordings.
An interactive continual learning framework for singing voice separation allows users to fine-tune a U-Net model by marking false positives; experiments show performance improvements over the base model in various settings.
Story2MIDI, a Transformer model, generates emotion-aligned music from text using a dataset of text-music pairs evoking similar emotions; evaluations confirm model's ability to capture intended emotional cues.
Token-level adaptation of ASR systems improves dysfluency transcription on LibriStutter and KSoF datasets; language-adaptive pretraining and tokenizer analysis address English-centric bias in multilingual systems.
The Parallel Delayed Memory Unit (PDMU), a delay-gated state-space module, enhances temporal modeling in bio-signals by compressing temporal information using Legendre Memory Units (LMU); demonstrates improved memory capacity and model performance.