0

0

Can a speaker recognition system learn to ignore the flaws of imperfect noise reduction?

A Framework for Robust Speaker Verification in Highly Noisy Environments Leveraging Both Noisy and Enhanced Audio

Published 8/27/2025 by Adam Katav, Yair Moshe, Israel Cohen

Get notified when new papers like this one come out!

Have an account? We'll apply the trial to it


The Challenge of Speaker Verification in Noisy Environments

Speaker verification determines whether two audio samples come from the same person. Applications range from voice authentication for personal smart devices and call center authentication to telephone banking security and law enforcement investigations. These systems rely on speaker embeddings - high-dimensional features that capture a speaker's vocal tract characteristics and speaking style.

Deep neural networks have dominated speaker verification over the past decade due to their superior feature extraction capabilities. The evolution began with Time Delay Neural Networks (TDNN) and x-vectors, progressed through lightweight models like SpeakerNet with residual blocks and 1D separable convolutions, and advanced to sophisticated architectures like ECAPA-TDNN, which integrates computer vision insights with 1D Res2Net modules and attention mechanisms.

Real-world audio recordings face significant challenges from background noise and reverberation. While speech enhancement aims to improve intelligibility by removing noise, it creates an unexpected problem: the artifacts and distortions introduced during the enhancement process can sometimes degrade speaker verification performance even further. This challenge has intensified with generative deep neural networks that produce superior speech quality and enhance heavily contaminated audio, but may severely alter distinctive speaker characteristics.

Previous approaches to noise-robust speaker verification typically require training dedicated speech enhancement modules with tailored loss functions or learning mappings from noisy to clean embeddings. Some methods focus on separating noise and speaker characteristics or use cascaded architectures with joint optimization. However, these solutions demand significant computational resources and large datasets for training state-of-the-art modules.

A Novel Siamese Architecture Solution

The researchers propose a framework based on a key insight: speaker embeddings extracted from noisy speech and their enhanced counterparts provide complementary information. In relatively low noise conditions, noisy embeddings remain more informative because they avoid significant enhancement artifacts. Conversely, in challenging noise conditions, enhanced embeddings provide more valuable information through effective noise reduction.

t-SNE visualization comparison showing noisy, enhanced, and proposed method embeddings Enhanced embeddings visualization Proposed method embeddings visualization

Figure 1: t-SNE visualization of SpeakerNet embeddings from two VoxCeleb1 speakers with babble noise at -15 dB SNR, showing (a) noisy, (b) enhanced, and (c) proposed method results

The solution uses a Siamese neural network architecture designed to compare input pairs and assess similarity - proven effective in face recognition, image matching, and fingerprint verification. The framework processes two speech utterances through identical subnetworks with shared weights, extracting two embeddings per utterance: one from noisy audio and one from enhanced audio.

Proposed Siamese architecture diagram

Figure 2: Proposed Siamese architecture for learning robust speaker embeddings

Each utterance's embeddings combine through a lightweight 3-layer Multi-Layer Perceptron (MLP). The first layer processes 2N×1 inputs, while the subsequent layers output N×1 dimensions each, with ReLU activation between layers. This MLP learns nonlinear relationships between input embeddings and performs dimensionality reduction to create robust N×1 speaker embeddings.

The framework employs triplet loss with cosine distance to learn effective distance metrics for distinguishing similar and dissimilar examples. This approach accounts for the magnitude-invariance property desired in speaker embedding comparisons. The system can adaptively adjust contributions from each embedding based on perceived noise levels, enhancing verification performance across diverse noise scenarios.

A major advantage lies in the framework's flexibility: it works with any speech enhancement and speaker embedding method, enabling seamless integration of state-of-the-art techniques without modification. This novel framework approach eliminates the need for task-specific training while maintaining computational efficiency.

Experimental Validation and Performance Analysis

The researchers trained and evaluated their framework using the VoxCeleb1 dataset containing celebrity utterances from YouTube videos. The training set includes 148,642 utterances from 1,211 speakers, while the test set contains 4,874 utterances from 40 speakers.

To simulate real-world conditions, they augmented both sets with MUSAN corpus recordings featuring three noise categories: 6 hours of general noise (DTMF tones, thunder, footsteps, paper rustling, animal noises), 42 hours of music, and 60 hours of speech babble. Training utterances used random signal-to-noise ratios (SNR) between 0 and -20 dB, while testing occurred at specific SNR levels of {0, -5, -10, -15, -20} dB - lower than typically found in previous literature.

The lightweight model trained for just 10 minutes on an NVIDIA RTX 3090 using AdamW optimizer with 32 batch size, 10⁻³ learning rate, and 0.25 triplet margin parameter. This demonstrates remarkable computational efficiency compared to training dedicated enhancement or verification modules.

TypeSNRSpeakerNetECAPA-TDNN
NoisyEnhcOursNoisyEnhcOurs
Noise09.7013.4513.173.317.689.43
-516.3918.1815.675.5012.2210.76
-1026.1925.1219.3112.5319.0814.74
-1534.7132.7725.2123.2326.9121.48
-2041.6639.6933.0031.8634.2729.90
Music012.4214.6215.334.968.3112.19
-523.7322.8019.4813.6116.2816.37
-1036.8233.7227.3728.4127.3824.49
-1544.4641.8636.3741.1138.0934.96
-2048.6947.3944.0547.9045.7743.88
Babble020.2424.7921.8919.1323.1624.04
-532.6436.5926.8234.8937.6331.43
-1043.8544.7732.3045.1545.5038.47
-1546.7247.4837.2148.2747.7842.18
-2048.3148.3841.7348.7748.6245.38

Table 1: Speaker verification results comparing the proposed method with noisy and DeepFilterNet3-enhanced signals using SpeakerNet and ECAPA-TDNN embeddings across MUSAN noise types. Equal Error Rate (EER) measurements with best results in bold.

Results reveal that ECAPA-TDNN demonstrates greater noise robustness than SpeakerNet, typically achieving better performance at lower SNR levels. At SNR = 0, noisy embeddings achieve optimal verification performance. However, at lower SNRs, enhanced embeddings occasionally outperform noisy ones, while the proposed method consistently delivers superior results at these challenging levels.

The framework shows particular strength in highly degraded environments where conventional methods fail, validating the complementary information approach across diverse noise types including general noise, music, and speech babble.

Implications and Future Directions

This research presents a practical solution for robust speaker verification in challenging acoustic environments. By utilizing pre-trained speech enhancement and speaker verification models, the framework eliminates task-specific training requirements, making it both practical and computationally efficient.

The Siamese architecture's ability to combine complementary information from noisy and enhanced embeddings addresses a fundamental limitation in current approaches. Rather than treating speech enhancement as a prerequisite that may distort speaker characteristics, the framework leverages both sources strategically based on noise conditions.

The framework's inherent flexibility stands as a key advantage - it remains agnostic to specific speech enhancement and speaker verification techniques, enabling seamless integration with future technological advancements. This adaptability ensures longevity and practical deployment across various applications requiring reliable speaker verification in noisy environments.

Performance improvements prove most significant in highly degraded environments where conventional speaker verification methods struggle. This capability opens applications in challenging real-world scenarios including outdoor authentication systems, industrial environments, and emergency response situations where traditional methods fail.

Acknowledgment

The researchers acknowledge Ram Binshtock, Sahar Zeltzer, and David Portal for their early-stage contributions as part of the Signal Processing Cup 2024 Challenge. The research received support from the Israel Science Foundation (grant no. 1449/23) and the Pazy Research Foundation, enabling this advancement in robust speaker verification technology.

Original Paper

View on arxiv(opens in a new tab)

Highlights

    No highlights yet