cs.SD - 2023-09-21

Profile-Error-Tolerant Target-Speaker Voice Activity Detection

paper_url: http://arxiv.org/abs/2309.12521
repo_url: None
paper_authors: Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Midia Yousefi, Takuya Yoshioka, Jian Wu
for: 这篇论文旨在提高TS-VAD方法的稳定性和可靠性，使其能够抗击发音人识别错误。
methods: 这篇论文提出了一种基于变换器的TS-VAD方法，该方法可以处理不同数量的发音人，并且引入了一组附加的伪发音人识别器来处理在第一次分配不正确的发音人。在训练时，我们使用多种不同的聚类算法来估计发音人识别器，以减少训练和测试条件之间的差异。
results: 实验结果表明，PET-TSVAD方法在VoxConverse和DIHARD-I datasets上具有更高的稳定性和可靠性，与现有的TS-VAD方法相比，可以更好地抗击发音人识别错误。

Abstract
Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch between the training and testing conditions regarding speaker profiles. Experimental results show that PET-TSVAD consistently outperforms the existing TS-VAD method on both the VoxConverse and DIHARD-I datasets.

摘要
target-speaker voice activity detection (TS-VAD) 使用一组说话者配置文件和输入音频信号进行说话者分类。尽管它在传统方法上表现出优势，但该方法可能会因为说话者配置文件中的错误而受到影响。这篇论文提出了一种对 TS-VAD 进行扩展，称为 Profile-Error-Tolerant TS-VAD (PET-TSVAD)，可以抗 resist 说话者配置文件中的错误。这是通过使用 transformer-based TS-VAD 来实现，该方法可以处理变数量的说话者和额外引入一组 Pseudo-speaker 配置文件来处理在首个扫描中未探测到的说话者。在训练中，我们使用不同 clustering 算法来估计说话者配置文件，以降低在训练和测试条件下的配置文件匹配度。实验结果表明，PET-TSVAD 在 VoxConverse 和 DIHARD-I 数据集上一直表现出优势，与传统 TS-VAD 方法相比。

Variational Quantum Harmonizer: Generating Chord Progressions and Other Sonification Methods with the VQE Algorithm

paper_url: http://arxiv.org/abs/2309.12254
repo_url: None
paper_authors: Paulo Vitor Itaboraí, Tim Schwägerl, María Aguado Yáñez, Arianna Crippa, Karl Jansen, Eduardo Reck Miranda, Peter Thomas
for: 这项研究探讨了使用物理基于的声明方法来解决Quadratic Unconstrained Binary Optimization (QUBO)问题，使用Variational Quantum Eigensolver (VQE)算法进行优化。
methods: 这项研究使用了VQE算法的迭代循环来 aproximate QUBO问题的解决方案，并将每次迭代的中间状态 vectors 用于声明方法。
results: 这项研究实现了一个名为Variational Quantum Harmonizer (VQH)的音乐 интер法案例，可以用来增强数据可见性或创作艺术作品。VQH还可以用于让艺术家更好地理解QUBO问题的解决方案，并且可以提供一个广泛的声音库 дляQUBO和量子激发的音乐作品。

Abstract
This work investigates a case study of using physical-based sonification of Quadratic Unconstrained Binary Optimization (QUBO) problems, optimized by the Variational Quantum Eigensolver (VQE) algorithm. The VQE approximates the solution of the problem by using an iterative loop between the quantum computer and a classical optimization routine. This work explores the intermediary statevectors found in each VQE iteration as the means of sonifying the optimization process itself. The implementation was realised in the form of a musical interface prototype named Variational Quantum Harmonizer (VQH), providing potential design strategies for musical applications, focusing on chords, chord progressions, and arpeggios. The VQH can be used both to enhance data visualization or to create artistic pieces. The methodology is also relevant in terms of how an artist would gain intuition towards achieving a desired musical sound by carefully designing QUBO cost functions. Flexible mapping strategies could supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions, as demonstrated in a case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai.

摘要
The researchers developed a musical interface prototype named Variational Quantum Harmonizer (VQH), which provides potential design strategies for musical applications, such as chords, chord progressions, and arpeggios. The VQH can be used to enhance data visualization or create artistic pieces. The methodology is also relevant for artists who want to achieve a desired musical sound by carefully designing QUBO cost functions.The study demonstrates flexible mapping strategies that can supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions. A case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai, is used to illustrate the potential of the VQH. The research provides a new approach to sonification and has the potential to inspire new forms of artistic expression.In simplified Chinese, the text can be translated as:这项研究investigates the use ofphysical-based sonification ofQuadratic Unconstrained Binary Optimization (QUBO) problems, which are optimized by theVariational Quantum Eigensolver (VQE) algorithm. The VQE algorithm uses an iterative loop between a quantum computer and a classical optimization routine to approximate the solution of the problem. The study focuses on the intermediary statevectors found in each VQE iteration as a means of sonifying the optimization process itself.The researchers developed a musical interface prototype namedVariational Quantum Harmonizer (VQH), which provides potential design strategies for musical applications, such as chords, chord progressions, and arpeggios. The VQH can be used to enhance data visualization or create artistic pieces. The methodology is also relevant for artists who want to achieve a desired musical sound by carefully designing QUBO cost functions.The study demonstrates flexible mapping strategies that can supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions. A case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai, is used to illustrate the potential of the VQH. The research provides a new approach to sonification and has the potential to inspire new forms of artistic expression.

A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement

paper_url: http://arxiv.org/abs/2309.12121
repo_url: None
paper_authors: Bengt J. Borgstrom, Michael S. Brandstein
for: 提高单频道语音干扰的性能
methods: 使用多Scale自编码器（MSAE），利用不同的速率和尺度进行spectral decomposition，提取多个尺度嵌入
results: 相比传统方法，MSAE提供了明显的性能提升，并在对话质量指标和自动语音识别精度上表现出色。

Abstract
Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.

摘要
Recent attention has been given to neural network approaches for single-channel speech enhancement. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.Here's the translation in Traditional Chinese:近期对单道声音提升的神经网络方法Received much attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.

Is the Ideal Ratio Mask Really the Best? – Exploring the Best Extraction Performance and Optimal Mask of Mask-based Beamformers

paper_url: http://arxiv.org/abs/2309.12065
repo_url: None
paper_authors: Atsuo Hiroe, Katsutoshi Itoyama, Kazuhiro Nakadai
for: 这个研究探讨了面具基于的扩音器（BF），它们使用时间频率面具来提取目标语音。虽然已经有很多BF方法被提出，但以下几个方面还没有得到全面的探讨：1）哪种BF可以提供最好的提取性能？2）最佳面具是固定的？3）面具是否与理想的干扰面具（IRM）一样？
methods: 我们 investigate这些问题，考虑四种面具基于的BF：最大信号噪声比BF、其两种变体、以及多通道维因纳Filter（MWF）BF。为了获得每种BF的优化面具，我们使用一种通过每个语音样本的方差平方误差来最小化BF输出与目标语音之间的差异的方法。
results: 通过CHiME-3数据集的实验，我们发现四种BF都可以达到理想MWFBF的上限性能，但是每种BF的优化面具与IRM不同。这与传统的想法不同，即最佳面具是共同的，并且每种BF的最高性能都不同。因此，这个研究对面具基于BF的设计提供了贡献。

Abstract
This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best performance common for all BFs? 3) Is the ideal ratio mask (IRM) identical to the optimal mask? Accordingly, we investigate these issues considering four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. To obtain the optimal mask corresponding to the peak performance for each BF, we employ an approach that minimizes the mean square error between the BF output and target speech for each utterance. Via the experiments with the CHiME-3 dataset, we verify that the four BFs have the same peak performance as the upper bound provided by the ideal MWF BF, whereas the optimal mask depends on the adopted BF and differs from the IRM. These observations differ from the conventional idea that the optimal mask is common for all BFs and that peak performance differs for each BF. Hence, this study contributes to the design of mask-based BFs.

摘要

Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech?2. Is the optimal mask for the best performance common for all BFs?3. Is the ideal ratio mask (IRM) identical to the optimal mask?To investigate these issues, we considered four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. We employed an approach that minimizes the mean square error between the BF output and target speech for each utterance to obtain the optimal mask corresponding to the peak performance for each BF.Through experiments with the CHiME-3 dataset, we found that the four BFs have the same peak performance as the upper bound provided by the ideal MWF BF, but the optimal mask depends on the adopted BF and differs from the IRM. These observations differ from the conventional idea that the optimal mask is common for all BFs and that peak performance differs for each BF. Therefore, this study contributes to the design of mask-based BFs.

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

paper_url: http://arxiv.org/abs/2309.11977
repo_url: None
paper_authors: Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng
for: 本研究旨在提出一种基于语言模型的零示Text-to-Speech（TTS）生成器，可以跨语言和 speaker 进行适应。
methods: 该模型使用了一种基于 neural codec 的语言模型 VALL-E，并提出了一种 speaker-aware 文本编码器和一种基于 frame-level 的音响解码器。
results: 实验结果显示，该模型在自然性和 speaker 相似性方面比基eline 高，并可以通过提高 style prompt 的长度来提高性能。

Abstract
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

摘要
<>translate("Zero-shot text-to-speech（TTS）Synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.")中文简体版：Zero-shot文本到语音（TTS）synthesis目标是将未看过的说话者的声音复制到新的语音系统中，无需适应参数。通过将语音波形转换为精确的语音符号，并使用语言模型来模型这些符号，现有的语言模型基于TTS模型已经实现了零shot说话者适应能力，只需要3秒钟的未看过说话者的声音提示。然而，它们受到声音提示的长度限制，making it difficult to clone personal speaking style。在这篇论文中，我们提出了一种基于neural codec语言模型VALL-E的新的零shotTTS模型。我们提出了一种 speaker-aware文本编码器，用于从多句式样本中学习个人说话风格的phoneme级别。然后，我们使用VALL-E基于的语音解码器来模型timbre在frame级别，并生成语音。实验结果表明，我们的提出方法可以超越基eline，在自然性和说话者相似性方面表现更好，并可以通过扩展style提示来提高性能。

Multi-Channel MOSRA: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and a Teacher Model

paper_url: http://arxiv.org/abs/2309.11976
repo_url: None
paper_authors: Jozef Coldenhoff, Andrew Harper, Paul Kendrick, Tijana Stojkovic, Milos Cernak
for: 预测房间声学参数和沟通质量指标
methods: 使用多通道模型进行同时预测多个 recordingdevice 的 MOS 和房间声学参数
results: 提高了直接响应率、清晰度和语音传输指标的预测，相比单通道模型，需要约5倍 menos计算资源，但是减少了其他指标的性能表现的loss。

Abstract
Previous methods for predicting room acoustic parameters and speech quality metrics have focused on the single-channel case, where room acoustics and Mean Opinion Score (MOS) are predicted for a single recording device. However, quality-based device selection for rooms with multiple recording devices may benefit from a multi-channel approach where the descriptive metrics are predicted for multiple devices in parallel. Following our hypothesis that a model may benefit from multi-channel training, we develop a multi-channel model for joint MOS and room acoustics prediction (MOSRA) for five channels in parallel. The lack of multi-channel audio data with ground truth labels necessitated the creation of simulated data using an acoustic simulator with room acoustic labels extracted from the generated impulse responses and labels for MOS generated in a student-teacher setup using a wav2vec2-based MOS prediction model. Our experiments show that the multi-channel model improves the prediction of the direct-to-reverberation ratio, clarity, and speech transmission index over the single-channel model with roughly 5$\times$ less computation while suffering minimal losses in the performance of the other metrics.

摘要
因为前面的方法都是单通道的，所以我们假设多通道训练可能会提高模型的性能。我们开发了一个同时预测多个通道的MOS和房间听音参数的模型（MOSRA），并在五个通道上进行了并行预测。由于没有多通道音频数据的标签，我们使用了一个听音器模拟器生成的房间听音标签，并使用了基于wav2vec2的MOS预测模型生成的教师-学生组合中的标签。我们的实验表明，多通道模型在直接响应比、清晰度和语音传输指数方面的预测性能有5倍以下的计算量，而且对其他指标的性能几乎不受影响。

Cluster-based pruning techniques for audio data

paper_url: http://arxiv.org/abs/2309.11922
repo_url: https://github.com/boris-bergsma/audio_pruning
paper_authors: Boris Bergsma, Marta Brzezinska, Oleg V. Yazyev, Milos Cernak
for: 提高深度学习模型在各个领域的性能，减少数据量以提高计算效率。
methods: 使用k-means clustering方法对数据进行有效的 selección，将相似样本 grouped вместе，减少数据量而保持分类表达能力。
results: 在关键词检测（KWS）数据集上进行 clustering分析，显示k-means clustering可以减少音频数据集的大小，保持不同架构NNs的分类性能。

Abstract
Deep learning models have become widely adopted in various domains, but their performance heavily relies on a vast amount of data. Datasets often contain a large number of irrelevant or redundant samples, which can lead to computational inefficiencies during the training. In this work, we introduce, for the first time in the context of the audio domain, the k-means clustering as a method for efficient data pruning. K-means clustering provides a way to group similar samples together, allowing the reduction of the size of the dataset while preserving its representative characteristics. As an example, we perform clustering analysis on the keyword spotting (KWS) dataset. We discuss how k-means clustering can significantly reduce the size of audio datasets while maintaining the classification performance across neural networks (NNs) with different architectures. We further comment on the role of scaling analysis in identifying the optimal pruning strategies for a large number of samples. Our studies serve as a proof-of-principle, demonstrating the potential of data selection with distance-based clustering algorithms for the audio domain and highlighting promising research avenues.

摘要
Note:* "Deep learning models" is translated as "深度学习模型" (shēn dào xué xí mó del)* "datasets" is translated as "数据集" (data set)* "k-means clustering" is translated as "k-means 聚合" (k-means zù hé)* "keywords spotting" is translated as "关键词检测" (guān jí xiē jiàn dòu)* "neural networks" is translated as "神经网络" (shén xiāo wǎng luò)

The Impact of Silence on Speech Anti-Spoofing

paper_url: http://arxiv.org/abs/2309.11827
repo_url: None
paper_authors: Yuxiang Zhang, Zhuo Li, Jingze Lu, Hua Hua, Wenchao Wang, Pengyuan Zhang
for: 这个论文旨在分析防 spoofing Countermeasures 对干扰声音的影响。
methods: 该论文使用了 Voice Activity Detection (VAD) 技术和 class activation mapping (CAM) 来分析干扰声音对防 spoofing CMs 的影响。
results: 研究发现，对 spoof speech 进行干扰声音 removing 可能会导致 CMs 的性能下降。此外，研究还发现了干扰声音的内容和长度的影响，以及如何通过masking silence或non-silence来提高 CMs 的Robustness。

Abstract
The current speech anti-spoofing countermeasures (CMs) show excellent performance on specific datasets. However, removing the silence of test speech through Voice Activity Detection (VAD) can severely degrade performance. In this paper, the impact of silence on speech anti-spoofing is analyzed. First, the reasons for the impact are explored, including the proportion of silence duration and the content of silence. The proportion of silence duration in spoof speech generated by text-to-speech (TTS) algorithms is lower than that in bonafide speech. And the content of silence generated by different waveform generators varies compared to bonafide speech. Then the impact of silence on model prediction is explored. Even after retraining, the spoof speech generated by neural network based end-to-end TTS algorithms suffers a significant rise in error rates when the silence is removed. To demonstrate the reasons for the impact of silence on CMs, the attention distribution of a CM is visualized through class activation mapping (CAM). Furthermore, the implementation and analysis of the experiments masking silence or non-silence demonstrates the significance of the proportion of silence duration for detecting TTS and the importance of silence content for detecting voice conversion (VC). Based on the experimental results, improving the robustness of CMs against unknown spoofing attacks by masking silence is also proposed. Finally, the attacks on anti-spoofing CMs through concatenating silence, and the mitigation of VAD and silence attack through low-pass filtering are introduced.

摘要
当前的语音反伪措施（CMs）在特定的数据集上表现出色。然而，通过语音活动检测（VAD）来除去测试语音的沉默部分可能会严重降低性能。在这篇论文中，我们分析了语音反伪措施中的沉默的影响。首先，我们研究了沉默的影响原因，包括沉默部分的持续时间比例和沉默部分的内容。TTS算法生成的假语音中的沉默部分持续时间比例较低，而bonafide语音中的沉默部分持续时间比例较高。此外，不同的波形生成器生成的沉默部分与bonafide语音中的沉默部分存在差异。然后，我们研究了沉默对模型预测的影响。即使重新训练，使用基于神经网络的端到端TTS算法生成的假语音在去除沉默后 Error rates 显著增加。为了证明沉默对CMs的影响的原因，我们通过类Activation mapping（CAM） visualize CM的注意力分布。此外，我们还实现了在掩码沉默或非沉默时对实验的分析，这些实验结果表明了沉默持续时间的重要性以及沉默内容的重要性。最后，我们提出了通过掩码沉默来提高CMs对未知假语音攻击的Robustness。此外，我们还介绍了 concatenating silence 攻击和 VAD 和沉默攻击的低通过滤波来 Mitigation。

Frame Pairwise Distance Loss for Weakly-supervised Sound Event Detection

paper_url: http://arxiv.org/abs/2309.11783
repo_url: None
paper_authors: Rui Tao, Yuxing Huang, Xiangdong Wang, Long Yan, Lufeng Zhai, Kazushige Ouchi, Taihao Li
for: bridging the gap between fully supervised methods and unsupervised techniques in various domains, specifically for detecting sound events with limited labeled data.
methods: introducing a Frame Pairwise Distance (FPD) loss branch, along with a minimal amount of synthesized data and corresponding sampling and label processing strategies.
results: validated on the standard DCASE dataset, the proposed approach showed efficacy and improved the recognition rate of weakly-supervised sound event detection.

Abstract
Weakly-supervised learning has emerged as a promising approach to leverage limited labeled data in various domains by bridging the gap between fully supervised methods and unsupervised techniques. Acquisition of strong annotations for detecting sound events is prohibitively expensive, making weakly supervised learning a more cost-effective and broadly applicable alternative. In order to enhance the recognition rate of the learning of detection of weakly-supervised sound events, we introduce a Frame Pairwise Distance (FPD) loss branch, complemented with a minimal amount of synthesized data. The corresponding sampling and label processing strategies are also proposed. Two distinct distance metrics are employed to evaluate the proposed approach. Finally, the method is validated on the standard DCASE dataset. The obtained experimental results corroborated the efficacy of this approach.

摘要
弱监督学习已成为各领域中利用有限标注数据的有力的方法之一，它将完全监督方法和无监督技术相连接起来，从而bridge难以估计的差距。为了提高弱监督声音事件的识别率，我们引入了帧对Distance（FPD）损失支线，并补充了一小量的合成数据。对应的采样和标签处理策略也被提出。两种不同的距离度量被使用来评估该方法。最后，方法在标准的DCASE数据集上进行验证，实验结果证明了该方法的有效性。

CoMFLP: Correlation Measure based Fast Search on ASR Layer Pruning

paper_url: http://arxiv.org/abs/2309.11768
repo_url: https://github.com/louislau1129/comflp
paper_authors: Wei Liu, Zhiyuan Peng, Tan Lee
for: 提高资源受限设备上Transformer-based语音识别（ASR）模型的性能。
methods: 使用层架减少（LP）方法来减少模型中的层数，并使用相关度度量来评估层之间的重复性。
results: 比较 existed LP 方法，CoMFLP 可以更好地选择减少的层数，同时只需要常量时间复杂度。实验结果表明，由 CoMFLP 确定的减少提议超过了现有 LP 方法的性能。代码可以在 https://github.com/louislau1129/CoMFLP 上获取。

Abstract
Transformer-based speech recognition (ASR) model with deep layers exhibited significant performance improvement. However, the model is inefficient for deployment on resource-constrained devices. Layer pruning (LP) is a commonly used compression method to remove redundant layers. Previous studies on LP usually identify the redundant layers according to a task-specific evaluation metric. They are time-consuming for models with a large number of layers, even in a greedy search manner. To address this problem, we propose CoMFLP, a fast search LP algorithm based on correlation measure. The correlation between layers is computed to generate a correlation matrix, which identifies the redundancy among layers. The search process is carried out in two steps: (1) coarse search: to determine top $K$ candidates by pruning the most redundant layers based on the correlation matrix; (2) fine search: to select the best pruning proposal among $K$ candidates using a task-specific evaluation metric. Experiments on an ASR task show that the pruning proposal determined by CoMFLP outperforms existing LP methods while only requiring constant time complexity. The code is publicly available at https://github.com/louislau1129/CoMFLP.

摘要
“trasformer基于的语音识别（ASR）模型中深层显示了性能提升。然而，这种模型在资源受限的设备上部署不是非常高效。层束（LP）是一种常用压缩方法，可以从模型中除掉 redundant 层。先前的研究通常根据任务特定的评价指标来确定级别的重复性。这些方法在大量层的情况下，甚至在批处理方式下，都需要较长的时间。为解决这个问题，我们提出了 CoMFLP，一种快速搜索 LP 算法，基于相关度计算。在这种算法中， Compute 层之间的相关度，生成一个相关矩阵，并且在这个矩阵中找到最 redundant 层。搜索过程分为两步：（1）粗略搜索：根据相关矩阵，先找到 top K 个候选项，其中 K 是一个固定的整数；（2）细致搜索：使用任务特定的评价指标，从 K 个候选项中选择最佳剪除提议。实验结果表明，由 CoMFLP 确定的剪除提议，在 ASR 任务中能够超越现有的 LP 方法，而且只需要常量时间复杂度。代码可以在 https://github.com/louislau1129/CoMFLP 上找到。”

Sparsely Shared LoRA on Whisper for Child Speech Recognition

paper_url: http://arxiv.org/abs/2309.11756
repo_url: None
paper_authors: Wei Liu, Ying Qin, Zhiyuan Peng, Tan Lee
for: 这 paper 的目的是提高 Whisper 自动话语识别（ASR）模型的零基础性性能。
methods: 这 paper 使用的方法包括 parameter-efficient fine-tuning (PEFT) 和 LoRA 等方法，以及一种新的 Sparsely Shared LoRA (S2-LoRA) 方法。
results: 实验结果表明，S2-LoRA 方法可以在低资源的中文儿童语音上达到与 AdaLoRA 相当的适应性，并且在对应数据上表现更好的泛化性。此外，S2-LoRA 方法自动学习的核心矩阵分布与 AdaLoRA 的分布有类似的特征。

Abstract
Whisper is a powerful automatic speech recognition (ASR) model. Nevertheless, its zero-shot performance on low-resource speech requires further improvement. Child speech, as a representative type of low-resource speech, is leveraged for adaptation. Recently, parameter-efficient fine-tuning (PEFT) in NLP was shown to be comparable and even better than full fine-tuning, while only needing to tune a small set of trainable parameters. However, current PEFT methods have not been well examined for their effectiveness on Whisper. In this paper, only parameter composition types of PEFT approaches such as LoRA and Bitfit are investigated as they do not bring extra inference costs. Different popular PEFT methods are examined. Particularly, we compare LoRA and AdaLoRA and figure out the learnable rank coefficient is a good design. Inspired by the sparse rank distribution allocated by AdaLoRA, a novel PEFT approach Sparsely Shared LoRA (S2-LoRA) is proposed. The two low-rank decomposed matrices are globally shared. Each weight matrix only has to maintain its specific rank coefficients that are constrained to be sparse. Experiments on low-resource Chinese child speech show that with much fewer trainable parameters, S2-LoRA can achieve comparable in-domain adaptation performance to AdaLoRA and exhibit better generalization ability on out-of-domain data. In addition, the rank distribution automatically learned by S2-LoRA is found to have similar patterns to AdaLoRA's allocation.

摘要
喊voice是一款强大的自动语音识别（ASR）模型。然而，它在低资源语音的零shot性表现仍需要进一步改进。儿童语音作为低资源语音的代表类型，被用于适应。近期， parameter-efficient fine-tuning（PEFT）在NLP中被证明可以与全量精度相当，而只需要调整一小部分的可变参数。然而，当前PEFT方法尚未对喊voice进行了深入研究。本文仅 investigate parameter composition type的PEFT方法，如LoRA和Bitfit，因为它们不会增加额外的推理成本。不同的Popular PEFT方法被比较。特别是，我们比较LoRA和AdaLoRA，并发现了可学习排名系数是一个好设计。受AdaLoRA的稀疑rank分布启发，我们提出了一种新的PEFT方法，即Sparsely Shared LoRA（S2-LoRA）。两个低级别分解的矩阵都是全局分享的。每个weight矩阵只需要保持它的特定排名系数，这些系数被限制为稀疑分布。实验表明，S2-LoRA可以在低资源中文儿童语音上 достичь与AdaLoRA相同的适应性，并且在非适应数据上表现更好。此外，S2-LoRA自动学习的排名分布与AdaLoRA的分布相似。

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition

paper_url: http://arxiv.org/abs/2309.11730
repo_url: https://github.com/wenet-e2e/wespeaker
paper_authors: Shuai Wang, Qibing Bai, Qi Liu, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li
for: 本研究目的是提高current speaker recognition系统的性能，通过将大规模预训练模型（如WavLM）传递到下游任务中，以及直接应用自我超vised方法（如DINO）进行 speaker embedding 学习。
methods: 本研究使用了DINO自我超vised方法进行 speaker embedding 学习，并在大规模的WenetSpeech dataset上进行了预训练。在这个过程中，我们还提出了一种基于信任度的数据过滤算法，以提高预训练数据的可靠性。
results: 研究结果表明，通过使用DINO自我超vised方法和 confidence-based 数据过滤算法，可以提高speaker recognition系统的性能，并且在大规模的in-the-wild datasets上保持良好的表现。此外，我们还发现了这种方法的可迁移性，可以在不同的 dataset 上提高系统性能。

Abstract
Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the pretraining dataset, leading to better performance with less training data. The associated pretrained models, confidence files, pretraining and finetuning scripts will be made available in the Wespeaker toolkit.

摘要
当前的说话识别系统主要依靠supervised方法，受到标注数据的尺度限制。为了提高系统性能，研究人员利用大型预训练模型，如WavLM，将高级特征传递到下游说话识别任务。然而，这种方法添加了额外的参数，因为预训练模型在推理阶段仍然存在。另一组研究人员直接应用自监学方法，如DINO，来学习说话嵌入，但他们没有探索其在大规模在野数据集上的潜力。在这篇论文中，我们介绍了DINO训练在大规模WenetSpeech数据集上的效果，以及其在CNCeleb数据集上的传输性。此外，我们还提出了一种基于信任度的数据过滤算法，以 removal of unreliable data from the pretraining dataset，从而提高supervised系统的性能。相关的预训练模型、信任文件、预训练和Finetuning脚本将在Wespeaker工具箱中提供。