results: 实验结果表明,PET-TSVAD方法在VoxConverse和DIHARD-I datasets上具有更高的稳定性和可靠性,与现有的TS-VAD方法相比,可以更好地抗击发音人识别错误。Abstract
Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch between the training and testing conditions regarding speaker profiles. Experimental results show that PET-TSVAD consistently outperforms the existing TS-VAD method on both the VoxConverse and DIHARD-I datasets.
摘要
target-speaker voice activity detection (TS-VAD) 使用一组说话者配置文件和输入音频信号进行说话者分类。尽管它在传统方法上表现出优势,但该方法可能会因为说话者配置文件中的错误而受到影响。这篇论文提出了一种对 TS-VAD 进行扩展,称为 Profile-Error-Tolerant TS-VAD (PET-TSVAD),可以抗 resist 说话者配置文件中的错误。这是通过使用 transformer-based TS-VAD 来实现,该方法可以处理变数量的说话者和额外引入一组 Pseudo-speaker 配置文件来处理在首个扫描中未探测到的说话者。在训练中,我们使用不同 clustering 算法来估计说话者配置文件,以降低在训练和测试条件下的配置文件匹配度。实验结果表明,PET-TSVAD 在 VoxConverse 和 DIHARD-I 数据集上一直表现出优势,与传统 TS-VAD 方法相比。
Variational Quantum Harmonizer: Generating Chord Progressions and Other Sonification Methods with the VQE Algorithm
results: 这项研究实现了一个名为Variational Quantum Harmonizer (VQH)的音乐 интер法案例,可以用来增强数据可见性或创作艺术作品。VQH还可以用于让艺术家更好地理解QUBO问题的解决方案,并且可以提供一个广泛的声音库 дляQUBO和量子激发的音乐作品。Abstract
This work investigates a case study of using physical-based sonification of Quadratic Unconstrained Binary Optimization (QUBO) problems, optimized by the Variational Quantum Eigensolver (VQE) algorithm. The VQE approximates the solution of the problem by using an iterative loop between the quantum computer and a classical optimization routine. This work explores the intermediary statevectors found in each VQE iteration as the means of sonifying the optimization process itself. The implementation was realised in the form of a musical interface prototype named Variational Quantum Harmonizer (VQH), providing potential design strategies for musical applications, focusing on chords, chord progressions, and arpeggios. The VQH can be used both to enhance data visualization or to create artistic pieces. The methodology is also relevant in terms of how an artist would gain intuition towards achieving a desired musical sound by carefully designing QUBO cost functions. Flexible mapping strategies could supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions, as demonstrated in a case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai.
摘要
The researchers developed a musical interface prototype named Variational Quantum Harmonizer (VQH), which provides potential design strategies for musical applications, such as chords, chord progressions, and arpeggios. The VQH can be used to enhance data visualization or create artistic pieces. The methodology is also relevant for artists who want to achieve a desired musical sound by carefully designing QUBO cost functions.The study demonstrates flexible mapping strategies that can supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions. A case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai, is used to illustrate the potential of the VQH. The research provides a new approach to sonification and has the potential to inspire new forms of artistic expression.In simplified Chinese, the text can be translated as:这项研究investigates the use ofphysical-based sonification ofQuadratic Unconstrained Binary Optimization (QUBO) problems, which are optimized by theVariational Quantum Eigensolver (VQE) algorithm. The VQE algorithm uses an iterative loop between a quantum computer and a classical optimization routine to approximate the solution of the problem. The study focuses on the intermediary statevectors found in each VQE iteration as a means of sonifying the optimization process itself.The researchers developed a musical interface prototype namedVariational Quantum Harmonizer (VQH), which provides potential design strategies for musical applications, such as chords, chord progressions, and arpeggios. The VQH can be used to enhance data visualization or create artistic pieces. The methodology is also relevant for artists who want to achieve a desired musical sound by carefully designing QUBO cost functions.The study demonstrates flexible mapping strategies that can supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions. A case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai, is used to illustrate the potential of the VQH. The research provides a new approach to sonification and has the potential to inspire new forms of artistic expression.
A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement
results: 相比传统方法,MSAE提供了明显的性能提升,并在对话质量指标和自动语音识别精度上表现出色。Abstract
Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.
摘要
Recent attention has been given to neural network approaches for single-channel speech enhancement. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.Here's the translation in Traditional Chinese:近期对单道声音提升的神经网络方法Received much attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.
Is the Ideal Ratio Mask Really the Best? – Exploring the Best Extraction Performance and Optimal Mask of Mask-based Beamformers
results: 通过CHiME-3数据集的实验,我们发现四种BF都可以达到理想MWFBF的上限性能,但是每种BF的优化面具与IRM不同。这与传统的想法不同,即最佳面具是共同的,并且每种BF的最高性能都不同。因此,这个研究对面具基于BF的设计提供了贡献。Abstract
This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best performance common for all BFs? 3) Is the ideal ratio mask (IRM) identical to the optimal mask? Accordingly, we investigate these issues considering four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. To obtain the optimal mask corresponding to the peak performance for each BF, we employ an approach that minimizes the mean square error between the BF output and target speech for each utterance. Via the experiments with the CHiME-3 dataset, we verify that the four BFs have the same peak performance as the upper bound provided by the ideal MWF BF, whereas the optimal mask depends on the adopted BF and differs from the IRM. These observations differ from the conventional idea that the optimal mask is common for all BFs and that peak performance differs for each BF. Hence, this study contributes to the design of mask-based BFs.
摘要
Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech?2. Is the optimal mask for the best performance common for all BFs?3. Is the ideal ratio mask (IRM) identical to the optimal mask?To investigate these issues, we considered four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. We employed an approach that minimizes the mean square error between the BF output and target speech for each utterance to obtain the optimal mask corresponding to the peak performance for each BF.Through experiments with the CHiME-3 dataset, we found that the four BFs have the same peak performance as the upper bound provided by the ideal MWF BF, but the optimal mask depends on the adopted BF and differs from the IRM. These observations differ from the conventional idea that the optimal mask is common for all BFs and that peak performance differs for each BF. Therefore, this study contributes to the design of mask-based BFs.
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
results: 实验结果显示,该模型在自然性和 speaker 相似性方面比基eline 高,并可以通过提高 style prompt 的长度来提高性能。Abstract
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.
摘要
<>translate("Zero-shot text-to-speech(TTS)Synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.")中文简体版:Zero-shot文本到语音(TTS)synthesis目标是将未看过的说话者的声音复制到新的语音系统中,无需适应参数。通过将语音波形转换为精确的语音符号,并使用语言模型来模型这些符号,现有的语言模型基于TTS模型已经实现了零shot说话者适应能力,只需要3秒钟的未看过说话者的声音提示。然而,它们受到声音提示的长度限制,making it difficult to clone personal speaking style。在这篇论文中,我们提出了一种基于neural codec语言模型VALL-E的新的零shotTTS模型。我们提出了一种 speaker-aware文本编码器,用于从多句式样本中学习个人说话风格的phoneme级别。然后,我们使用VALL-E基于的语音解码器来模型timbre在frame级别,并生成语音。实验结果表明,我们的提出方法可以超越基eline,在自然性和说话者相似性方面表现更好,并可以通过扩展style提示来提高性能。
Multi-Channel MOSRA: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and a Teacher Model
paper_authors: Jozef Coldenhoff, Andrew Harper, Paul Kendrick, Tijana Stojkovic, Milos Cernak
for: 预测房间声学参数和沟通质量指标
methods: 使用多通道模型进行同时预测多个 recordingdevice 的 MOS 和房间声学参数
results: 提高了直接响应率、清晰度和语音传输指标的预测,相比单通道模型,需要约5倍 menos计算资源,但是减少了其他指标的性能表现的loss。Abstract
Previous methods for predicting room acoustic parameters and speech quality metrics have focused on the single-channel case, where room acoustics and Mean Opinion Score (MOS) are predicted for a single recording device. However, quality-based device selection for rooms with multiple recording devices may benefit from a multi-channel approach where the descriptive metrics are predicted for multiple devices in parallel. Following our hypothesis that a model may benefit from multi-channel training, we develop a multi-channel model for joint MOS and room acoustics prediction (MOSRA) for five channels in parallel. The lack of multi-channel audio data with ground truth labels necessitated the creation of simulated data using an acoustic simulator with room acoustic labels extracted from the generated impulse responses and labels for MOS generated in a student-teacher setup using a wav2vec2-based MOS prediction model. Our experiments show that the multi-channel model improves the prediction of the direct-to-reverberation ratio, clarity, and speech transmission index over the single-channel model with roughly 5$\times$ less computation while suffering minimal losses in the performance of the other metrics.
摘要
因为前面的方法都是单通道的,所以我们假设多通道训练可能会提高模型的性能。我们开发了一个同时预测多个通道的MOS和房间听音参数的模型(MOSRA),并在五个通道上进行了并行预测。由于没有多通道音频数据的标签,我们使用了一个听音器模拟器生成的房间听音标签,并使用了基于wav2vec2的MOS预测模型生成的教师-学生组合中的标签。我们的实验表明,多通道模型在直接响应比、清晰度和语音传输指数方面的预测性能有5倍以下的计算量,而且对其他指标的性能几乎不受影响。
results: 在关键词检测(KWS)数据集上进行 clustering分析,显示k-means clustering可以减少音频数据集的大小,保持不同架构NNs的分类性能。Abstract
Deep learning models have become widely adopted in various domains, but their performance heavily relies on a vast amount of data. Datasets often contain a large number of irrelevant or redundant samples, which can lead to computational inefficiencies during the training. In this work, we introduce, for the first time in the context of the audio domain, the k-means clustering as a method for efficient data pruning. K-means clustering provides a way to group similar samples together, allowing the reduction of the size of the dataset while preserving its representative characteristics. As an example, we perform clustering analysis on the keyword spotting (KWS) dataset. We discuss how k-means clustering can significantly reduce the size of audio datasets while maintaining the classification performance across neural networks (NNs) with different architectures. We further comment on the role of scaling analysis in identifying the optimal pruning strategies for a large number of samples. Our studies serve as a proof-of-principle, demonstrating the potential of data selection with distance-based clustering algorithms for the audio domain and highlighting promising research avenues.
摘要
Note:* "Deep learning models" is translated as "深度学习模型" (shēn dào xué xí mó del)* "datasets" is translated as "数据集" (data set)* "k-means clustering" is translated as "k-means 聚合" (k-means zù hé)* "keywords spotting" is translated as "关键词检测" (guān jí xiē jiàn dòu)* "neural networks" is translated as "神经网络" (shén xiāo wǎng luò)
results: 研究发现,对 spoof speech 进行干扰声音 removing 可能会导致 CMs 的性能下降。此外,研究还发现了干扰声音的内容和长度的影响,以及如何通过masking silence或non-silence来提高 CMs 的Robustness。Abstract
The current speech anti-spoofing countermeasures (CMs) show excellent performance on specific datasets. However, removing the silence of test speech through Voice Activity Detection (VAD) can severely degrade performance. In this paper, the impact of silence on speech anti-spoofing is analyzed. First, the reasons for the impact are explored, including the proportion of silence duration and the content of silence. The proportion of silence duration in spoof speech generated by text-to-speech (TTS) algorithms is lower than that in bonafide speech. And the content of silence generated by different waveform generators varies compared to bonafide speech. Then the impact of silence on model prediction is explored. Even after retraining, the spoof speech generated by neural network based end-to-end TTS algorithms suffers a significant rise in error rates when the silence is removed. To demonstrate the reasons for the impact of silence on CMs, the attention distribution of a CM is visualized through class activation mapping (CAM). Furthermore, the implementation and analysis of the experiments masking silence or non-silence demonstrates the significance of the proportion of silence duration for detecting TTS and the importance of silence content for detecting voice conversion (VC). Based on the experimental results, improving the robustness of CMs against unknown spoofing attacks by masking silence is also proposed. Finally, the attacks on anti-spoofing CMs through concatenating silence, and the mitigation of VAD and silence attack through low-pass filtering are introduced.
摘要
当前的语音反伪措施(CMs)在特定的数据集上表现出色。然而,通过语音活动检测(VAD)来除去测试语音的沉默部分可能会严重降低性能。在这篇论文中,我们分析了语音反伪措施中的沉默的影响。首先,我们研究了沉默的影响原因,包括沉默部分的持续时间比例和沉默部分的内容。TTS算法生成的假语音中的沉默部分持续时间比例较低,而bonafide语音中的沉默部分持续时间比例较高。此外,不同的波形生成器生成的沉默部分与bonafide语音中的沉默部分存在差异。然后,我们研究了沉默对模型预测的影响。即使重新训练,使用基于神经网络的端到端TTS算法生成的假语音在去除沉默后 Error rates 显著增加。为了证明沉默对CMs的影响的原因,我们通过类Activation mapping(CAM) visualize CM的注意力分布。此外,我们还实现了在掩码沉默或非沉默时对实验的分析,这些实验结果表明了沉默持续时间的重要性以及沉默内容的重要性。最后,我们提出了通过掩码沉默来提高CMs对未知假语音攻击的Robustness。此外,我们还介绍了 concatenating silence 攻击和 VAD 和沉默攻击的低通过滤波来 Mitigation。
Frame Pairwise Distance Loss for Weakly-supervised Sound Event Detection
paper_authors: Rui Tao, Yuxing Huang, Xiangdong Wang, Long Yan, Lufeng Zhai, Kazushige Ouchi, Taihao Li
for: bridging the gap between fully supervised methods and unsupervised techniques in various domains, specifically for detecting sound events with limited labeled data.
methods: introducing a Frame Pairwise Distance (FPD) loss branch, along with a minimal amount of synthesized data and corresponding sampling and label processing strategies.
results: validated on the standard DCASE dataset, the proposed approach showed efficacy and improved the recognition rate of weakly-supervised sound event detection.Abstract
Weakly-supervised learning has emerged as a promising approach to leverage limited labeled data in various domains by bridging the gap between fully supervised methods and unsupervised techniques. Acquisition of strong annotations for detecting sound events is prohibitively expensive, making weakly supervised learning a more cost-effective and broadly applicable alternative. In order to enhance the recognition rate of the learning of detection of weakly-supervised sound events, we introduce a Frame Pairwise Distance (FPD) loss branch, complemented with a minimal amount of synthesized data. The corresponding sampling and label processing strategies are also proposed. Two distinct distance metrics are employed to evaluate the proposed approach. Finally, the method is validated on the standard DCASE dataset. The obtained experimental results corroborated the efficacy of this approach.
摘要
弱监督学习已成为各领域中利用有限标注数据的有力的方法之一,它将完全监督方法和无监督技术相连接起来,从而bridge难以估计的差距。为了提高弱监督声音事件的识别率,我们引入了帧对Distance(FPD)损失支线,并补充了一小量的合成数据。对应的采样和标签处理策略也被提出。两种不同的距离度量被使用来评估该方法。最后,方法在标准的DCASE数据集上进行验证,实验结果证明了该方法的有效性。
CoMFLP: Correlation Measure based Fast Search on ASR Layer Pruning
results: 比较 existed LP 方法,CoMFLP 可以更好地选择减少的层数,同时只需要常量时间复杂度。实验结果表明,由 CoMFLP 确定的减少提议超过了现有 LP 方法的性能。代码可以在 https://github.com/louislau1129/CoMFLP 上获取。Abstract
Transformer-based speech recognition (ASR) model with deep layers exhibited significant performance improvement. However, the model is inefficient for deployment on resource-constrained devices. Layer pruning (LP) is a commonly used compression method to remove redundant layers. Previous studies on LP usually identify the redundant layers according to a task-specific evaluation metric. They are time-consuming for models with a large number of layers, even in a greedy search manner. To address this problem, we propose CoMFLP, a fast search LP algorithm based on correlation measure. The correlation between layers is computed to generate a correlation matrix, which identifies the redundancy among layers. The search process is carried out in two steps: (1) coarse search: to determine top $K$ candidates by pruning the most redundant layers based on the correlation matrix; (2) fine search: to select the best pruning proposal among $K$ candidates using a task-specific evaluation metric. Experiments on an ASR task show that the pruning proposal determined by CoMFLP outperforms existing LP methods while only requiring constant time complexity. The code is publicly available at https://github.com/louislau1129/CoMFLP.
摘要
“trasformer基于的语音识别(ASR)模型中深层显示了性能提升。然而,这种模型在资源受限的设备上部署不是非常高效。层束(LP)是一种常用压缩方法,可以从模型中除掉 redundant 层。先前的研究通常根据任务特定的评价指标来确定级别的重复性。这些方法在大量层的情况下,甚至在批处理方式下,都需要较长的时间。为解决这个问题,我们提出了 CoMFLP,一种快速搜索 LP 算法,基于相关度计算。在这种算法中, Compute 层之间的相关度,生成一个相关矩阵,并且在这个矩阵中找到最 redundant 层。搜索过程分为两步:(1)粗略搜索:根据相关矩阵,先找到 top K 个候选项,其中 K 是一个固定的整数;(2)细致搜索:使用任务特定的评价指标,从 K 个候选项中选择最佳剪除提议。实验结果表明,由 CoMFLP 确定的剪除提议,在 ASR 任务中能够超越现有的 LP 方法,而且只需要常量时间复杂度。代码可以在 https://github.com/louislau1129/CoMFLP 上找到。”
Sparsely Shared LoRA on Whisper for Child Speech Recognition
results: 实验结果表明,S2-LoRA 方法可以在低资源的中文儿童语音上达到与 AdaLoRA 相当的适应性,并且在对应数据上表现更好的泛化性。此外,S2-LoRA 方法自动学习的核心矩阵分布与 AdaLoRA 的分布有类似的特征。Abstract
Whisper is a powerful automatic speech recognition (ASR) model. Nevertheless, its zero-shot performance on low-resource speech requires further improvement. Child speech, as a representative type of low-resource speech, is leveraged for adaptation. Recently, parameter-efficient fine-tuning (PEFT) in NLP was shown to be comparable and even better than full fine-tuning, while only needing to tune a small set of trainable parameters. However, current PEFT methods have not been well examined for their effectiveness on Whisper. In this paper, only parameter composition types of PEFT approaches such as LoRA and Bitfit are investigated as they do not bring extra inference costs. Different popular PEFT methods are examined. Particularly, we compare LoRA and AdaLoRA and figure out the learnable rank coefficient is a good design. Inspired by the sparse rank distribution allocated by AdaLoRA, a novel PEFT approach Sparsely Shared LoRA (S2-LoRA) is proposed. The two low-rank decomposed matrices are globally shared. Each weight matrix only has to maintain its specific rank coefficients that are constrained to be sparse. Experiments on low-resource Chinese child speech show that with much fewer trainable parameters, S2-LoRA can achieve comparable in-domain adaptation performance to AdaLoRA and exhibit better generalization ability on out-of-domain data. In addition, the rank distribution automatically learned by S2-LoRA is found to have similar patterns to AdaLoRA's allocation.
摘要
喊voice是一款强大的自动语音识别(ASR)模型。然而,它在低资源语音的零shot性表现仍需要进一步改进。儿童语音作为低资源语音的代表类型,被用于适应。近期, parameter-efficient fine-tuning(PEFT)在NLP中被证明可以与全量精度相当,而只需要调整一小部分的可变参数。然而,当前PEFT方法尚未对喊voice进行了深入研究。本文仅 investigate parameter composition type的PEFT方法,如LoRA和Bitfit,因为它们不会增加额外的推理成本。不同的Popular PEFT方法被比较。特别是,我们比较LoRA和AdaLoRA,并发现了可学习排名系数是一个好设计。受AdaLoRA的稀疑rank分布启发,我们提出了一种新的PEFT方法,即Sparsely Shared LoRA(S2-LoRA)。两个低级别分解的矩阵都是全局分享的。每个weight矩阵只需要保持它的特定排名系数,这些系数被限制为稀疑分布。实验表明,S2-LoRA可以在低资源中文儿童语音上 достичь与AdaLoRA相同的适应性,并且在非适应数据上表现更好。此外,S2-LoRA自动学习的排名分布与AdaLoRA的分布相似。
Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition
results: 研究结果表明,通过使用DINO自我超vised方法和 confidence-based 数据过滤算法,可以提高speaker recognition系统的性能,并且在大规模的in-the-wild datasets上保持良好的表现。此外,我们还发现了这种方法的可迁移性,可以在不同的 dataset 上提高系统性能。Abstract
Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the pretraining dataset, leading to better performance with less training data. The associated pretrained models, confidence files, pretraining and finetuning scripts will be made available in the Wespeaker toolkit.
摘要
当前的说话识别系统主要依靠supervised方法,受到标注数据的尺度限制。为了提高系统性能,研究人员利用大型预训练模型,如WavLM,将高级特征传递到下游说话识别任务。然而,这种方法添加了额外的参数,因为预训练模型在推理阶段仍然存在。另一组研究人员直接应用自监学方法,如DINO,来学习说话嵌入,但他们没有探索其在大规模在野数据集上的潜力。在这篇论文中,我们介绍了DINO训练在大规模WenetSpeech数据集上的效果,以及其在CNCeleb数据集上的传输性。此外,我们还提出了一种基于信任度的数据过滤算法,以 removal of unreliable data from the pretraining dataset,从而提高supervised系统的性能。相关的预训练模型、信任文件、预训练和Finetuning脚本将在Wespeaker工具箱中提供。