Zhaofeng Lin's Avatar

Zhaofeng Lin

@zhaofenglin

PhD student @Trinity College Dublin | Multimodal speech recognition https://chaufanglin.github.io/

59
Followers
45
Following
13
Posts
30.12.2023
Joined
Posts Following

Latest posts by Zhaofeng Lin @zhaofenglin

Congrats!! ๐Ÿคฉ

02.07.2025 11:47 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

This paper investigates how AVSR systems exploit visual information.
Our findings reveal varying patterns across systems, with mismatches to human perception.

We recommend reporting *effective SNR gains* alongside WERs for a more comprehensive performance assessment ๐Ÿง
[8/8] ๐Ÿงต

01.04.2025 11:18 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

Results show Auto-AVSR may rely more on audio, with a weaker correlation between MaFI scores and IWERs in AV mode.
In contrast, AVEC shows a stronger use of visual information, with a significant negative correlation, especially in noisy conditions.

[7/8] ๐Ÿงต

01.04.2025 11:18 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Finally, we explore the relationship between AVSR errors and Mouth & Facial Informativeness (MaFI) scores.

We calculated Individual WER (IWERs) for each word and performed a Pearson correlation between MaFI scores and IWERs for audio-only, video-only, and AV models. [6/8] ๐Ÿงต

01.04.2025 11:17 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Occlusion tests reveal AVSR models rely differently on visual segments.

Auto-AVSR & AV-RelScore are equally affected by initial & middle occlusions, while AVEC is more impacted by middle occlusion.

Unlike humans, AVSR models do not depend on initial visual cues.

[5/8] ๐Ÿงต

01.04.2025 11:17 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Speech perception research shows that visual cues at the start of a word have a stronger impact on humans.

We test AVSR by occluding the initial vs. middle third of frames for each word, comparing 3 conditions: no occlusion, initial occlusion, and middle occlusion.

[4/8] ๐Ÿงต

01.04.2025 11:17 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

First, we revisit *effective SNR gain* - measured by the difference in SNR at which the AVSR WER equals the reference WER for audio-only recognition at 0 dB.

This metric quantifies the benefit of the visual modality in reducing WER compared to the audio-only system. [3/n] ๐Ÿงต

01.04.2025 11:16 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

In this paper, we take a step back to assess SOTA systems (all from 2023) from a different perspective by considering *human speech perception*.

Through this, we hope to gain insights as to whether the visual component is being fully exploited in existing AVSR systems. [2/8] ๐Ÿงต

01.04.2025 11:16 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Unfortunately, due to a visa issue, I wonโ€™t attend #ICASSP2025 in person - but Iโ€™ll share my work here! ๐Ÿ˜€

Excited to share my first ICASSP paper: โ€˜Uncovering the Visual Contribution in Audio-Visual Speech Recognitionโ€™ ๐Ÿ“„

๐Ÿ”—https://ieeexplore.ieee.org/abstract/document/10888423

[1/8] ๐Ÿงต

01.04.2025 11:15 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

My first PhD paper, โ€œUncovering the Visual Contribution in Audio-Visual Speech Recognitionโ€, was accepted to ICASSP2025 ๐Ÿ‡ฎ๐Ÿ‡ณ!
#ICASSP2025

21.12.2024 10:49 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

icassp rebuttal submitted๐Ÿคž๐Ÿคž

27.11.2024 16:49 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐Ÿ™‹โ€โ™‚๏ธ

27.11.2024 10:07 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Multimodal Information Based Speech Processing (MISP) 2025 Challenge

Hi speech people, super exciting news here!

We are running another "Multimodal information based speech (MISP)" Challenge at @interspeech.bsky.social

Participate!
Spread the word!

More info ๐Ÿ‘‡
mispchallenge.github.io/mispchalleng...

25.11.2024 11:25 ๐Ÿ‘ 15 ๐Ÿ” 7 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Checking on this side to see if the sky is blue ๐Ÿค”

21.11.2024 17:55 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0