results: 研究发现,使用压缩基于音频 Tokenization 的模型在三个任务中的性能相对较强,与mel-spectrogram特征相对尚未超越,并且在不同的 speaker 和语言下表现稳定。此外,音频 Tokenization 可以实现数据压缩至 20 倍,而无需失去性能。Abstract
Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and representation-learning based tokenization schemes have been proposed. However, there is limited investigation into the performance of compression-based audio tokens compared to well-established mel-spectrogram features across various speaker and speech related tasks. In this paper, we evaluate compression based audio tokens on three tasks: Speaker Verification, Diarization and (Multi-lingual) Speech Recognition. Our findings indicate that (i) the models trained on audio tokens perform competitively, on average within $1\%$ of mel-spectrogram features for all the tasks considered, and do not surpass them yet. (ii) these models exhibit robustness for out-of-domain narrowband data, particularly in speaker tasks. (iii) audio tokens allow for compression to 20x compared to mel-spectrogram features with minimal loss of performance in speech and speaker related tasks, which is crucial for low bit-rate applications, and (iv) the examined Residual Vector Quantization (RVQ) based audio tokenizer exhibits a low-pass frequency response characteristic, offering a plausible explanation for the observed results, and providing insight for future tokenizer designs.
摘要
简化音频表示,即音频tokenization,在最近受到了新的关注,因为它可以使得文本语言模型方法在音频领域应用。为此,各种压缩和表示学习基于的tokenization方案已经被提出。然而,关于压缩基于音频token的性能与著名的mel-spectrogram特征之间的比较,尚未有充分的研究。在这篇论文中,我们评估了基于压缩音频token的三个任务: speaker认证、分类和多语言语音识别。我们的发现是:(一)基于音频token训练的模型在所有考虑的任务中,在average上与mel-spectrogram特征相差不超过1%,并没有超过它们 yet。(二)这些模型在不同频谱数据上表现了Robustness,特别是在speaker任务中。(三)使用音频token可以将数据压缩到20倍,相比mel-spectrogram特征,减少了对speech和speaker相关任务的性能损失,这是低比特率应用中非常重要。(四)我们所考察的Residual Vector Quantization(RVQ)基于的音频tokenizer具有低通过滤波器特征,这提供了可能的解释,并为未来的tokenizer设计提供了意见。
USED: Universal Speaker Extraction and Diarization
paper_authors: Junyi Ao, Mehmet Sinan Yıldırım, Meng Ge, Shuai Wang, Ruijie Tao, Yanmin Qian, Liqun Deng, Longshuai Xiao, Haizhou Li
for: 这 paper 的目的是提出一个统一的框架,即 Universal Speaker Extraction and Diarization (USED),用于同时抽取所有说话人的波形。
methods: 这 paper 使用了现有的说话人抽取模型,并在其基础上添加了一个场景意识 differentiated loss function,以解决实际对话中的叠 overlap 问题。
results: 根据 paper 的结果,USED 模型在高度重叠和叠 overlap 场景下都能够明显超过基eline,并且可以同时提供高质量的说话人抽取和分类结果。Abstract
Speaker extraction and diarization are two crucial enabling techniques for speech applications. Speaker extraction aims to extract a target speaker's voice from a multi-talk mixture, while speaker diarization demarcates speech segments by speaker, identifying `who spoke when'. The previous studies have typically treated the two tasks independently. However, the two tasks share a similar objective, that is to disentangle the speakers in the spectral domain for the former but in the temporal domain for the latter. It is logical to believe that the speaker turns obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker turns than the mixture speech. In this paper, we propose a unified framework called Universal Speaker Extraction and Diarization (USED). We extend the existing speaker extraction model to simultaneously extract the waveforms of all speakers. We also employ a scenario-aware differentiated loss function to address the problem of sparsely overlapped speech in real-world conversations. We show that the USED model significantly outperforms the baselines for both speaker extraction and diarization tasks, in both highly overlapped and sparsely overlapped scenarios. Audio samples are available at https://ajyy.github.io/demo/USED/.
摘要
干支持和分类是语音应用程序中的两个关键技能。干支持目标是从多话者混合中提取目标说话人的声音,而分类则将speech分成不同的speaker,并识别“谁在什么时候说话”。过去的研究通常会独立地处理这两个任务。然而,这两个任务在目标上具有相似的目标,即在spectral domain中分离说话人的声音,但在时间频谱中则是识别speaker。逻辑地来说,来自分类的speaker turn可以帮助提取说话人的声音,而提取的speech也比混合 speech更准确地识别speaker。在这篇论文中,我们提出了一个统一框架,称为Universal Speaker Extraction and Diarization(USED)。我们将现有的说话人提取模型扩展到同时提取所有说话人的波形。我们还使用场景意识化的差分损失函数来解决实际对话中稀疏的 overlap speech问题。我们展示了USED模型在speaker extraction和分类任务上明显超过基eline,并在高度重叠和稀疏重叠的场景中都有优异表现。Audio示例可以在https://ajyy.github.io/demo/USED/中找到。
An Active Noise Control System Based on Soundfield Interpolation Using a Physics-informed Neural Network
results: PINN-assisted ANC系统在模拟中比多点ANC系统降低ROI内噪音更好Abstract
Conventional multiple-point active noise control (ANC) systems require placing error microphones within the region of interest (ROI), inconveniencing users. This paper designs a feasible monitoring microphone arrangement placed outside the ROI, providing a user with more freedom of movement. The soundfield within the ROI is interpolated from the microphone signals using a physics-informed neural network (PINN). PINN exploits the acoustic wave equation to assist soundfield interpolation under a limited number of monitoring microphones, and demonstrates better interpolation performance than the spherical harmonic method in simulations. An ANC system is designed to take advantage of the interpolated signal to reduce noise signal within the ROI. The PINN-assisted ANC system reduces noise more than that of the multiple-point ANC system in simulations.
摘要
传统的多点活动噪声控制(ANC)系统需要在Region of Interest(ROI)中放置错误微phone,对用户造成不便。本文提出了一种可行的监测icrophone布局,位于ROI外部,提供用户更多的自由运动空间。使用物理学信息学习网络(PINN) interpolate the soundfield within the ROI from the microphone signals, PINN leverages the acoustic wave equation to assist soundfield interpolation under a limited number of monitoring microphones, and shows better interpolation performance than the spherical harmonic method in simulations. ANC system is designed to take advantage of the interpolated signal to reduce noise within the ROI. PINN-assisted ANC system reduces noise more than the multiple-point ANC system in simulations.Note: "Region of Interest" (ROI) is translated as " Region of Interest" (ROI) in Simplified Chinese, which is the same as the original English text.
Bridging the Spoof Gap: A Unified Parallel Aggregation Network for Voice Presentation Attacks
results: 对ASVspoof-2019和VSDC数据集进行评估,显示了提案的系统的有效性,与现有解决方案相比,具有更低的EER差异和更高的检测冒充攻击的能力。这表明提案的方法具有普适性和优势。在voice-based安全系统中,提出的一元化冒充检测系统为ASV和用户数据提供了一种可靠的防御机制,帮助保护用户身份和数据。Abstract
Automatic Speaker Verification (ASV) systems are increasingly used in voice bio-metrics for user authentication but are susceptible to logical and physical spoofing attacks, posing security risks. Existing research mainly tackles logical or physical attacks separately, leading to a gap in unified spoofing detection. Moreover, when existing systems attempt to handle both types of attacks, they often exhibit significant disparities in the Equal Error Rate (EER). To bridge this gap, we present a Parallel Stacked Aggregation Network that processes raw audio. Our approach employs a split-transform-aggregation technique, dividing utterances into convolved representations, applying transformations, and aggregating the results to identify logical (LA) and physical (PA) spoofing attacks. Evaluation of the ASVspoof-2019 and VSDC datasets shows the effectiveness of the proposed system. It outperforms state-of-the-art solutions, displaying reduced EER disparities and superior performance in detecting spoofing attacks. This highlights the proposed method's generalizability and superiority. In a world increasingly reliant on voice-based security, our unified spoofing detection system provides a robust defense against a spectrum of voice spoofing attacks, safeguarding ASVs and user data effectively.
摘要
methods: 该系统使用一个单向 transformer 模型,通过 Conditional Random Field 进行视觉特征提取,并使用一个 off-the-shelf 神经音频编码器进行双向转换。
results: 实验结果表明,提出的 FoleyGen 系统在 VGGSound 数据集上以对象指标和人类评估方面都超过了先前系统。Abstract
Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations.
摘要
(Note: The text has been translated into Simplified Chinese, but please note that the translation may not be perfect and may not capture all the nuances of the original text.)
Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement
paper_authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
for: 提高受损的语音质量和可识别度,同时利用舌头影像信息进行干扰音频影像干扰
methods: 使用知识塑化 durante el entrenamiento para investigar可以利用舌头相关信息而不需要直接输入ultrasound舌头影像,并引入一个lip-tongue键值记忆网络以模型舌头和唇Modalities的对齐。
results: 实验结果表明,两种提议方法可以significantly提高受损语音的质量和可识别度,并且具有强大的通用性能在未看到的speaker和噪声下。此外,通过自动语音识别(ASR)的phone error rate(PER)分析发现,所有音频都从 introducing ultrasound舌头影像中受益,而palatal和velar元音声最大受益。Abstract
Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most.
摘要
音视频 speech 增强 (AV-SE) 目标是提高受损的语音,同时使用附加的视觉信息,如舌头视频,并被证明高于听音只 speech 增强。这篇论文提议将超声舌头像 incorporated 到 lip-based AV-SE 系统以提高性能。为了解决在推理中获取超声舌头像的挑战,我们首先提议使用知识塑化在训练期间。特别是,我们引导一个 audio-lip speech 增强学生模型学习一个预训练的 audio-lip-tongue speech 增强老师模型,从而传递舌头相关的知识。为了更好地模型舌头和 lip 模态的匹配,我们进一步提议引入一个 lip-tongue 关键值记忆网络到 AV-SE 模型中。这个网络允许根据 readily available 的 lip 特征来检索舌头特征,以帮助后续的语音增强任务。实验结果表明,两种方法都能有效地提高增强后的语音质量和可读性,并且两种方法在不seen speaker 和不seen 噪音下具有强大的泛化性能。此外,基于自动语音识别 (ASR) 的 phone error rate (PER) 分析表明,将超声舌头像 incorporated 后,所有的音位都受益,但是 palatal 和 velar 元音最多受益。
Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding
results: 在TIMIT dataset上,模型对不同噪声和反射的情况下表现出色,超过了已有的标准。此外,该模型具有较少的计算量和参数数量。Abstract
Multi-channel speech enhancement extracts speech using multiple microphones that capture spatial cues. Effectively utilizing directional information is key for multi-channel enhancement. Deep learning shows great potential on multi-channel speech enhancement and often takes short-time Fourier Transform (STFT) as inputs directly. To fully leverage the spatial information, we introduce a method using spherical harmonics transform (SHT) coefficients as auxiliary model inputs. These coefficients concisely represent spatial distributions. Specifically, our model has two encoders, one for the STFT and another for the SHT. By fusing both encoders in the decoder to estimate the enhanced STFT, we effectively incorporate spatial context. Evaluations on TIMIT under varying noise and reverberation show our model outperforms established benchmarks. Remarkably, this is achieved with fewer computations and parameters. By leveraging spherical harmonics to incorporate directional cues, our model efficiently improves the performance of the multi-channel speech enhancement.
摘要
多通道语音提升使用多个麦克风捕捉空间信息,以提高语音提升的效果。深度学习在多通道语音提升中表现出了极大的潜力,通常直接使用短时傅立叙 transform(STFT)作为输入。为了充分利用空间信息,我们介绍了一种使用球面幂变换(SHT)系数作为辅助模型输入的方法。这些系数简洁地表示空间分布。具体来说,我们的模型有两个编码器,一个是 для STFT,另一个是 для SHT。在解码器中将两个编码器融合以估计提升后的 STFT,从而有效地 incorporate 空间上下文。在 TIMIT 上进行了不同噪音和频率反射的评估,我们的模型表现出了较好的性能,并且只需要更少的计算和参数。通过利用球面幂变换来包含方向信息,我们的模型高效地提高了多通道语音提升的性能。
Hierarchical Modeling of Spatial Cues via Spherical Harmonics for Multi-Channel Speech Enhancement
results: 在TIMIT数据集上,提议方法可以更好地回归目标空间模式,并且比基eline模型提高性能,使用更少的参数和计算。Abstract
Multi-channel speech enhancement utilizes spatial information from multiple microphones to extract the target speech. However, most existing methods do not explicitly model spatial cues, instead relying on implicit learning from multi-channel spectra. To better leverage spatial information, we propose explicitly incorporating spatial modeling by applying spherical harmonic transforms (SHT) to the multi-channel input. In detail, a hierarchical framework is introduced whereby lower order harmonics capturing broader spatial patterns are estimated first, then combined with higher orders to recursively predict finer spatial details. Experiments on TIMIT demonstrate the proposed method can effectively recover target spatial patterns and achieve improved performance over baseline models, using fewer parameters and computations. Explicitly modeling spatial information hierarchically enables more effective multi-channel speech enhancement.
摘要
多通道语音增强利用多个麦克风的空间信息提取目标语音。然而,大多数现有方法不直接模型空间指示,而是通过多通道 спектrum 的含义学习来隐式地利用空间信息。为更好地利用空间信息,我们提议直接将多通道输入应用到圆形傅里叶变换(SHT)中,以实现更好的多通道语音增强。在详细的实现中,我们引入一种层次结构,其中低顺位傅里叶capture更广泛的空间模式,然后与更高顺位傅里叶相结合,以递归地预测更细致的空间细节。在 TIMIT 上进行实验,我们发现提议的方法可以更好地回归目标空间模式,并在基准模型上达到更高的性能,使用更少的参数和计算。通过直接模型空间信息层次结构,我们可以更有效地进行多通道语音增强。
PDPCRN: Parallel Dual-Path CRN with Bi-directional Inter-Branch Interactions for Multi-Channel Speech Enhancement
results: 对TIMIT数据集进行实验 validate,PDPCRN模型不仅在PESQ和STOI指标中表现出色,而且还具有较少的计算负担和参数量。Abstract
Multi-channel speech enhancement seeks to utilize spatial information to distinguish target speech from interfering signals. While deep learning approaches like the dual-path convolutional recurrent network (DPCRN) have made strides, challenges persist in effectively modeling inter-channel correlations and amalgamating multi-level information. In response, we introduce the Parallel Dual-Path Convolutional Recurrent Network (PDPCRN). This acoustic modeling architecture has two key innovations. First, a parallel design with separate branches extracts complementary features. Second, bi-directional modules enable cross-branch communication. Together, these facilitate diverse representation fusion and enhanced modeling. Experimental validation on TIMIT datasets underscores the prowess of PDPCRN. Notably, against baseline models like the standard DPCRN, PDPCRN not only outperforms in PESQ and STOI metrics but also boasts a leaner computational footprint with reduced parameters.
摘要
多通道语音增强 seek to 利用空间信息来 distinguishing 目标语音与干扰信号。而深度学习方法如双路卷积回归网络(DPCRN)已经做出了 significiant progress, 但还存在效果模型交通信道和多级信息融合的挑战。为此,我们提出了并行双路卷积回归网络(PDPCRN)。这种语音模型建立有两个关键创新:首先,并行设计分配了 complementary 特征。其次,bi-directional模块允许交叉通信。这两个特征共同使得多元表示融合和模型提高。对于 TIMIT 数据集的实验验证,PDPCRN 表现出众,与基准模型如标准 DPCRN 不仅在 PESQ 和 STOI 指标上表现出优异,还具有更小的计算承载和减少参数量。