Overall architecture of AlphaFlowTSE. Given a mixture waveform y and an enrollment utterance e, we compute complex STFT features and form the mixture feature Y and enrollment feature E (real/imaginary concatenation). During training, the backbone takes the current state feature zt; during inference we initialize z0 = Y . The enrollment feature is concatenated as a temporal prefix, yielding [E∥zt] (or [E∥z0] at inference), which is fed to the UDiT backbone. The backbone is conditioned via AdaLN on the absolute time t and the interval length ∆ = r − t (with r = 1 at inference), and predicts the mean velocity for finite-interval transport, denoted uθ(t, r, [E∥zt]). One-step inference (NFE= 1) produces an estimated complex STFT Sˆ = (SˆRe, SˆIm), which is converted to the target waveform sˆ by iSTFT. The dashed module is an optional mixing-ratio predictor used only in the background-to-target ablation to predict the start coordinate
Imagine a noisy group call where 3 people talk at once.
This paper builds a model that can focus on a single speaker (using a short voice sample) and extract that voice.
This cleanup results in better, faster audio transcription.
Summary and full paper 👇
#AudioML #SpeechToText