paper_authors: Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas Chandra
for: 这 paper 是为了提高语音生成模型的执行速度而写的。
methods: 这 paper 使用了一种新的 stack-and-delay 解码策略,以提高 auto-regressive 解码的速度。
results: 对于同等效果预算,这 paper 的新策略可以在对 GPU 进行批处理时提高生成速度,并且在对比 vanilla flat 解码法时,质量几乎相当。Abstract
In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay style of decoding strategy to improve upon the flat pattern decoding where generation speed is four times faster as opposed to vanilla flat decoding. This brings the inference time close to that of the delay decoding strategy, and allows for faster inference on GPU for small batch sizes. For the same inference efficiency budget as the delay pattern, we show that the proposed approach performs better in objective evaluations, almost closing the gap with the flat pattern in terms of quality. The results are corroborated by subjective evaluations which show that samples generated by the new model are slightly more often preferred to samples generated by the competing model given the same text prompts.
摘要
在语言模型基于音乐生成中,生成波形被表示为一个层次化token堆,可以在某些情况下以自动递归方式或平行方式解码,具体取决于codebook Pattern。特别是,平滑codebooks表示最高质量解码策略,但却非常慢。为此,我们提出了一种新的堆延式解码策略,以提高对于平滑解码的速度。这将执行时间与延迟解码策略相似,并在小批处理时在GPU上进行更快的执行。对于同样的推理效率预算,我们表明了我们的方法在对象评价中比 delay Pattern 更好,几乎与平滑Pattern 相近的质量。结果得到了subjective评价的支持,显示新模型生成的样本在同一个文本提示下被轻微更多地选择。
Music Source Separation Based on a Lightweight Deep Learning Framework (DTTNET: DUAL-PATH TFC-TDF UNET)
results: 对于 vocals 部分,DTTNet 可以达到 10.12 dB cSDR,比 Bandsplit RNN (BSRNN) 高出 0.11 dB,但具有 86.7% fewer 参数。Abstract
Music source separation (MSS) aims to extract 'vocals', 'drums', 'bass' and 'other' tracks from a piece of mixed music. While deep learning methods have shown impressive results, there is a trend toward larger models. In our paper, we introduce a novel and lightweight architecture called DTTNet, which is based on Dual-Path Module and Time-Frequency Convolutions Time-Distributed Fully-connected UNet (TFC-TDF UNet). DTTNet achieves 10.12 dB cSDR on 'vocals' compared to 10.01 dB reported for Bandsplit RNN (BSRNN) but with 86.7% fewer parameters. We also assess pattern-specific performance and model generalization for intricate audio patterns.
摘要
音乐源分离(MSS)目标是从混合音乐中提取“声乐”、“鼓”、“低音”和“其他”多个轨道。深度学习方法已经表现出色,但是现在有一趋势是增大模型。在我们的论文中,我们介绍了一种新的轻量级架构,称为DTTNet,它基于双路模块和时域频域卷积(TFC-TDF UNet)。DTTNet实现了10.12 dB的清晰度(cSDR),比BSRNN(Bandsplit RNN)的10.01 dB高,但具有86.7%的参数数量少。我们还评估了模型对复杂音乐 patrern的特定性能和通用性。
Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition
paper_authors: Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney
for: 这篇论文旨在提出一种流处理的注意力基于encoder-decoder模型,其中 either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks.
results: 通过在 Librispeech 和 TED-LIUM-v2 上进行实验,并将 consecutives sequences concatenated for long-form trials,发现该流处理模型与非流处理模型的性能相对 compatible,并且在长型语音总结very well。Abstract
We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.
摘要
我们研究了一个流处理器基于注意力的编解码器模型,其中编码器或解码器都操作在预定的固定大小窗口(chunk)上。特殊的结束chunk(EOC)符号在一个chunk到下一个chunk之间进行转移,从而替代传统的结束序列符号。这种修改虽小,但将我们的模型与帧模型等同起来,其中EOC符号与空符号相对应。我们进一步探讨了标准转录器和我们模型之间的剩余差异。我们还考虑了长форма语音总体化、扫描大小和长度 нормализация等相关因素。通过对Librispeech和TED-LIUM-v2上进行实验,并将 consecutivesequences concatenate для长形试验,我们发现我们的流处理器模型与非流处理器模型的性能相比具有竞争力,并且对长形语音总体化很好。
Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech
results: 实验结果表明, compared to基线,我们的模型在不同的 overlap ratio下均表现出超过4dB的提升,这表明我们的模型可以在稀有 overlap的场景下提高target speaker抽取的精度。Abstract
Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of SI-SNR.
摘要
target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. however, this scenario only accounts for a small percentage of real-world conversations. in this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. we propose an audio-visual speaker extraction model named activeextract, which leverages speaking activity from audio-visual active speaker detection (asd). the asd directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of si-snr.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
Audio-free Prompt Tuning for Language-Audio Models
results: 我们的方法可以提高CLAP模型的性能和训练效率,并在零例推理中对未经见的类别进行识别,并且比vanilla CLAP更好地转移知识。此外,我们的方法还可以在只知道下游类别名称的情况下进行适应。Abstract
Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associate audio features with human language, making it a natural zero-shot classifier to recognize unseen sound categories. To adapt CLAP to downstream tasks, prior works inevitably require labeled domain audios, which limits their scalability under data scarcity and deprives them of the capability to detect novel classes as the original CLAP. In this work, by leveraging the modality alignment in CLAP, we propose an efficient audio-free prompt tuning scheme aimed at optimizing a few prompt tokens from texts instead of audios, which regularizes the model space to avoid overfitting the seen classes as well. Based on this, a multi-grained prompt design is further explored to fuse global and local information. Experiments on several tasks demonstrate that our approach can boost the CLAP and outperform other training methods on model performance and training efficiency. While conducting zero-shot inference on unseen categories, it still shows better transferability than the vanilla CLAP. Moreover, our method is flexible enough even if only knowing the downstream class names. The code will be released soon.
摘要
“对于语音识别任务,我们提出了一个有效的无音训练方法,可以将文本提示调整为CLAP模型的条件,以提高模型的性能和训练效率。这个方法基于CLAP模型中的modalità对齐,可以将文本提示调整为CLAP模型的条件,以避免模型过拟合见到的类别。我们还提出了一个多层次提示设计,可以融合全球和本地信息。实验结果显示,我们的方法可以提高CLAP模型的性能和训练效率,并且在零shot推断中也表现出比vanilla CLAP更好的转移性。此外,我们的方法可以让你只知道下游类别名称来进行训练,并且还可以在零shot推断中进行推断。我们将将代码发布 soon。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.
Semi-supervised Sound Event Detection with Local and Global Consistency Regularization
paper_authors: Yiming Li, Xiangdong Wang, Hong Liu, Rui Tao, Long Yan, Kazushige Ouchi
for: 这 paper 是为了提高 semi-supervised sound event detection 的性能而写的。
methods: 这 paper 使用了 Local and Global Consistency (LGC) regularization scheme,包括 audio CutMix 和特制的 contrastive loss,以促进模型在 label- 和 feature-level 上的改进。
results: 实验结果表明,LGC 超越了同等设置的基eline system,并且可以与现有方法相结合以实现进一步的改进。Abstract
Learning meaningful frame-wise features on a partially labeled dataset is crucial to semi-supervised sound event detection. Prior works either maintain consistency on frame-level predictions or seek feature-level similarity among neighboring frames, which cannot exploit the potential of unlabeled data. In this work, we design a Local and Global Consistency (LGC) regularization scheme to enhance the model on both label- and feature-level. The audio CutMix is introduced to change the contextual information of clips. Then, the local consistency is adopted to encourage the model to leverage local features for frame-level predictions, and the global consistency is applied to force features to align with global prototypes through a specially designed contrastive loss. Experiments on the DESED dataset indicate the superiority of LGC, surpassing its respective competitors largely with the same settings as the baseline system. Besides, combining LGC with existing methods can obtain further improvements. The code will be released soon.
摘要
学习有意义的帧级特征是 semi-supervised 音频事件检测中的关键。先前的工作ether maintain consistency on frame-level predictions or seek feature-level similarity among neighboring frames, 这些方法无法利用无标签数据的潜力。在这种工作中,我们设计了 Local and Global Consistency (LGC) 规范来提高模型在标签和特征水平上。音频 CutMix 被引入,改变clip的Contextual information。然后,本地一致性被采用,以便使模型利用本地特征进行帧级预测,而全球一致性被应用,通过特殊的对比损失来让特征与全球谱系对齐。DESED 数据集的实验表明 LGC 的优越性,大大超越了同样的设置的基eline system。此外,将 LGC 与现有方法结合可以获得进一步的改进。代码即将发布。
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction
results: 实验结果表明,AVTSE任务在真实的音响环境中非常具有挑战性,参与者可能会遇到各种问题。Abstract
Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.
摘要
Speech-dependent Modeling of Own Voice Transfer Characteristics for In-ear Microphones in Hearables
results: 使用提议的speech-dependent模型可以更好地模拟耳壳内麦icrophone recording,并且可以更好地泛化到不同的说话人。Abstract
Many hearables contain an in-ear microphone, which may be used to capture the own voice of its user in noisy environments. Since the in-ear microphone mostly records body-conducted speech due to ear canal occlusion, it suffers from band-limitation effects while only capturing a limited amount of external noise. To enhance the quality of the in-ear microphone signal using algorithms aiming at joint bandwidth extension, equalization, and noise reduction, it is desirable to have an accurate model of the own voice transfer characteristics between the entrance of the ear canal and the in-ear microphone. Such a model can be used, e.g., to simulate a large amount of in-ear recordings to train supervised learning-based algorithms. Since previous research on ear canal occlusion suggests that own voice transfer characteristics depend on speech content, in this contribution we propose a speech-dependent system identification model based on phoneme recognition. We assess the accuracy of simulating own voice speech by speech-dependent and speech-independent modeling and investigate how well modeling approaches are able to generalize to different talkers. Simulation results show that using the proposed speech-dependent model is preferable for simulating in-ear recordings compared to using a speech-independent model.
摘要
Head-Related Transfer Function Interpolation with a Spherical CNN
results: simulations 结果表明,提出的方法能够准确地从稀疏测量 interpolate HRTF,高效地超过 SH 方法和学习基于方法。Abstract
Head-related transfer functions (HRTFs) are crucial for spatial soundfield reproduction in virtual reality applications. However, obtaining personalized, high-resolution HRTFs is a time-consuming and costly task. Recently, deep learning-based methods showed promise in interpolating high-resolution HRTFs from sparse measurements. Some of these methods treat HRTF interpolation as an image super-resolution task, which neglects spatial acoustic features. This paper proposes a spherical convolutional neural network method for HRTF interpolation. The proposed method realizes the convolution process by decomposing and reconstructing HRTF through the Spherical Harmonics (SHs). The SHs, an orthogonal function set defined on a sphere, allow the convolution layers to effectively capture the spatial features of HRTFs, which are sampled on a sphere. Simulation results demonstrate the effectiveness of the proposed method in achieving accurate interpolation from sparse measurements, outperforming the SH method and learning-based methods.
摘要
HEAD-RELATED TRANSFER FUNCTIONS (HRTFs) 是虚拟现实应用中重要的空间声场重建技术。然而,获取个人化、高分辨率 HRTFs 是一项时间consuming 和成本高的任务。最近,深度学习基于方法在 interpolating 高分辨率 HRTFs 中表现出了搭配。其中一些方法将 HRTF interpolating 视为一种图像超分辨率任务,忽略了空间声学特征。本文提出了一种圆柱体 convolutional neural network 方法,用于 HRTF interpolating。该方法通过分解和重建 HRTF 通过圆柱体快推函数 (SHs) 来实现卷积过程。SHs 是定义在球体上的正交函数集,使卷积层能够有效地捕捉 HRTFs 的空间特征,这些特征在球体上被采样。 simulation 结果表明,提议的方法可以准确地从稀疏测量中 interpolate HRTFs,超越 SH 方法和学习基于方法。
One-Class Knowledge Distillation for Spoofing Speech Detection
results: 实验结果表明,提出的一类知识储存方法在ASVspoof 21DF数据集和InTheWild数据集上具有更高的普适性和检测精度,相比之下其他现有方法。Abstract
The detection of spoofing speech generated by unseen algorithms remains an unresolved challenge. One reason for the lack of generalization ability is traditional detecting systems follow the binary classification paradigm, which inherently assumes the possession of prior knowledge of spoofing speech. One-class methods attempt to learn the distribution of bonafide speech and are inherently suited to the task where spoofing speech exhibits significant differences. However, training a one-class system using only bonafide speech is challenging. In this paper, we introduce a teacher-student framework to provide guidance for the training of a one-class model. The proposed one-class knowledge distillation method outperforms other state-of-the-art methods on the ASVspoof 21DF dataset and InTheWild dataset, which demonstrates its superior generalization ability.
摘要
检测假声音仍然是一个未解决的挑战。一个原因是传统的检测系统采用二分类分类方式,这意味着它们假设攻击者拥有假声音的知识。一类方法尝试学习正常的声音分布,但是在训练时需要大量的正常声音数据。在本文中,我们介绍了一种教师-学生框架,以帮助一类模型的训练。我们提出的一类知识填充方法在ASVspoof 21DF数据集和InTheWild数据集上显示出优于其他现有方法的性能,这 demonstartes its 的普遍性能。
Improving Short Utterance Anti-Spoofing with AASIST2
results: 提高短语音识别性能,同时保持不同数据集的常规评估性能Abstract
The wav2vec 2.0 and integrated spectro-temporal graph attention network (AASIST) based countermeasure achieves great performance in speech anti-spoofing. However, current spoof speech detection systems have fixed training and evaluation durations, while the performance degrades significantly during short utterance evaluation. To solve this problem, AASIST can be improved to AASIST2 by modifying the residual blocks to Res2Net blocks. The modified Res2Net blocks can extract multi-scale features and improve the detection performance for speech of different durations, thus improving the short utterance evaluation performance. On the other hand, adaptive large margin fine-tuning (ALMFT) has achieved performance improvement in short utterance speaker verification. Therefore, we apply Dynamic Chunk Size (DCS) and ALMFT training strategies in speech anti-spoofing to further improve the performance of short utterance evaluation. Experiments demonstrate that the proposed AASIST2 improves the performance of short utterance evaluation while maintaining the performance of regular evaluation on different datasets.
摘要
“wav2vec 2.0 和嵌入式спектро-时间图注意力网络(AASIST)基于的防范措施在语音骗取中表现出色。然而,现有的骗取语音检测系统具有固定的训练和评估时间,而性能在短语音评估中明显下降。为解决这问题,AASIST可以改进为AASIST2,通过修改剩下块为Res2Net块来提取多级特征,提高不同时长语音的检测性能,因此提高短语音评估性能。另一方面,适应大margin微调(ALMFT)在短语音 speaker认证中实现了性能提高。因此,我们在语音骗取中应用动态块大小(DCS)和ALMFT 训练策略,以进一步提高短语音评估性能。实验表明,提出的AASIST2可以在不同的数据集上维持短语音评估性能的同时,提高短语音评估性能。”
Improving Voice Conversion for Dissimilar Speakers Using Perceptual Losses
paper_authors: Suhita Ghosh, Yamini Sinha, Ingo Siegert, Sebastian Stober
for: 保护用户隐私和数据安全
methods: 使用语音转换技术实现语音数据匿名化
results: 成功地隐藏了语音数据的来源 speakerAbstract
The rising trend of using voice as a means of interacting with smart devices has sparked worries over the protection of users' privacy and data security. These concerns have become more pressing, especially after the European Union's adoption of the General Data Protection Regulation (GDPR). The information contained in an utterance encompasses critical personal details about the speaker, such as their age, gender, socio-cultural origins and more. If there is a security breach and the data is compromised, attackers may utilise the speech data to circumvent the speaker verification systems or imitate authorised users. Therefore, it is pertinent to anonymise the speech data before being shared across devices, such that the source speaker of the utterance cannot be traced. Voice conversion (VC) can be used to achieve speech anonymisation, which involves altering the speaker's characteristics while preserving the linguistic content.
摘要
声音作为智能设备交互方式的升温趋势,引发了用户隐私和数据安全保护的worries。这些问题在欧盟通过《个人数据保护条例》(GDPR)之后变得更加紧迫。语音中含有关键个人信息,如speaker的年龄、性别、社会文化背景等。如果数据被泄露,攻击者可能利用语音数据绕过speaker验证系统或模仿已经授权的用户。因此,需要对语音数据进行匿名处理,以隐藏语音的来源speaker。声音转换(VC)可以实现匿名处理,即改变speaker的特征,保留语言内容不变。
TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification
results: 实验结果显示,TF-SepNet在TAU都市语音Scene 2022 Mobile development dataset上表现出色,较同类的State-of-the-arts之前。进一步的调查发现,TF-SepNet的分类器具有更大的有效接收场(ERF),使得更好地捕捉时间-频率特征。Abstract
Recent studies focus on developing efficient systems for acoustic scene classification (ASC) using convolutional neural networks (CNNs), which typically consist of consecutive kernels. This paper highlights the benefits of using separate kernels as a more powerful and efficient design approach in ASC tasks. Inspired by the time-frequency nature of audio signals, we propose TF-SepNet, a CNN architecture that separates the feature processing along the time and frequency dimensions. Features resulted from the separate paths are then merged by channels and directly forwarded to the classifier. Instead of the conventional two dimensional (2D) kernel, TF-SepNet incorporates one dimensional (1D) kernels to reduce the computational costs. Experiments have been conducted using the TAU Urban Acoustic Scene 2022 Mobile development dataset. The results show that TF-SepNet outperforms similar state-of-the-arts that use consecutive kernels. A further investigation reveals that the separate kernels lead to a larger effective receptive field (ERF), which enables TF-SepNet to capture more time-frequency features.
摘要
近期研究强调开发高效的听音场景分类(ASC)系统,使用核函数网络(CNN)来实现。通常情况下,CNN包含连续的核函数。这篇论文指出,使用独立的核函数可以作为更有力的和高效的设计方法。受听音信号的时间-频率特性启发,我们提出TF-SepNet架构,它在时间和频率维度上分离特征处理。从分离的道路中得到的特征然后通过通道直接传递给分类器。而不是传统的二维(2D)核函数,TF-SepNet使用一维(1D)核函数,以降低计算成本。经过实验,使用TAU都市听音场景2022移动开发 dataset,结果表明TF-SepNet超过了类似的状态艺术使用连续核函数的同类方法。进一步的调查表明,独立的核函数导致更大的有效收发场(ERF),这使得TF-SepNet能够捕捉更多的时间-频率特征。
Controllable Residual Speaker Representation for Voice Conversion
paper_authors: Le Xu, Jiangyan Yi, Jianhua Tao, Tao Wang, Yong Ren, Rongxiu Zhong
for: 提高voice conversion的高质量表现和 robustness
methods: 使用多层残差近似Token进行提高Robustness,并实现有效控制时声表现
results: 比基eline表现出色,在主观和客观评估中都达到了更高的性能和RobustnessAbstract
Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. The introduction of multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in both subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly available.
摘要
最近,voice conversion技术已经取得了 significative进步,以至于表现质量得到了提升。然而,这个领域仍然存在两个关键挑战。第一,现有的voice conversion方法在遇到未看过的speaker时,其Robustness具有有限的能力。第二,它们也有限制timbre表达的能力。为了解决这些挑战,本文提出了一种新的方法,利用多层径辐射近似token来增强对未看过的speaker的Robustness,称为剩余speaker模块。多层径辐射近似token的引入,使得信息从timbre中分离得到更好,以便有效控制timbre在voice conversion中。提议的方法在对比基准方法的主观和客观评估中表现出了superior的性能和更高的Robustness。我们的demo页面公开给公众。
RVAE-EM: Generative speech dereverberation based on recurrent variational auto-encoder and convolutive transfer function
results: 对单通道speech抑干进行实验,我们发现提出的生成方法明显超过了先进的探测网络。Abstract
In indoor scenes, reverberation is a crucial factor in degrading the perceived quality and intelligibility of speech. In this work, we propose a generative dereverberation method. Our approach is based on a probabilistic model utilizing a recurrent variational auto-encoder (RVAE) network and the convolutive transfer function (CTF) approximation. Different from most previous approaches, the output of our RVAE serves as the prior of the clean speech. And our target is the maximum a posteriori (MAP) estimation of clean speech, which is achieved iteratively through the expectation maximization (EM) algorithm. The proposed method integrates the capabilities of network-based speech prior modelling and CTF-based observation modelling. Experiments on single-channel speech dereverberation show that the proposed generative method noticeably outperforms the advanced discriminative networks.
摘要
在室内场景中,干扰是影响speech perceived质量和 intelligibility的关键因素。在这项工作中,我们提出了一种生成抑干方法。我们的方法基于一个概率模型,使用回归变换自动编码器(RVAE)网络和卷积函数(CTF)的近似。与大多数前一代方法不同,我们的RVAE输出作为干扰前后的净speech的假设。我们的目标是使用期望最大化(EM)算法来实现MAP估计净speech。我们的方法结合了网络基于声音先验模型和CTF基于观察模型的能力。实验表明,我们的生成方法在单通道speech抑干方面明显超过了先进的探测网络。
Fine-tune the pretrained ATST model for sound event detection
results: 我们的实验表明,我们的 fine-tuning 方法可以超越大型预训练网络的过拟合问题,并实现新的最佳性表现(SOTA),得到了 DCASE 挑战任务4 dataset 的 PSDS1/PSDS2 分数为 0.587/0.812。Abstract
Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a fine-tuning method for ATST-Frame using both (in-domain) unlabelled and labelled SED data. Our experiments show that, the proposed method overcomes the overfitting problem when fine-tuning the large pretrained network, and our SED system obtains new SOTA results of 0.587/0.812 PSDS1/PSDS2 scores on the DCASE challenge task 4 dataset.
摘要
声音事件检测(SED)经常面临数据不足问题。最近的基eline系统在DCASE2023挑战任务4中利用大规模预先自监学习(SelfSL)模型来缓解这种限制,其中预先学习的模型帮助生成更有特征的特征来进行SED。然而,预先学习的模型通常被视为DCASE2023挑战系统和大多数挑战提交中的冰结特征提取器,并且微调这些预先学习的模型的研究很少。在这项工作中,我们研究了SED中预先学习模型的微调方法。我们首先介绍了我们新提出的ATST-Frame模型,它专门用于学习音频信号帧级表示,并在一系列下游任务上达到了状态之arte(SOTA)性能。然后,我们提议一种微调方法,用于微调ATST-Frame模型使用(域内)无标签和标签SED数据。我们的实验结果表明,提议的方法可以在微调大规模预先学习网络时解决过拟合问题,并且我们的SED系统在DCASE挑战任务4数据集上获得了新的SOTA结果,即0.587/0.812 PSDS1/PSDS2分数。
t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability
results: 我们的t-SOT FNT模型在单个和多个说话者 dataset 上的Word Error Rate(WER)减少表现和原始 t-SOT 模型相似,同时保持了文本适应的能力。Abstract
Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\langle \text{cc}\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a language model (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of $\langle \text{cc}\rangle$ symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the $\langle \text{cc}\rangle$ tokens within the LM. The proposed t-SOT FNT model achieves comparable performance to the original t-SOT model while retaining the ability to reduce word error rate (WER) on both single and multi-talker datasets through text-only adaptation.
摘要
Diversity-based core-set selection for text-to-speech with linguistic and acoustic features
methods: 使用多源数据集合(如audiobooks和YouTube)构建大规模的语音识别系统数据集,并使用多元度度量(measure the degree to which a subset encompasses a wide range)选择核心子集(known as \textit{core-set))
results: 对于不同语言和数据集大小,与基准方法相比,提议的方法表现出色,性能明显超过基准方法Abstract
This paper proposes a method for extracting a lightweight subset from a text-to-speech (TTS) corpus ensuring synthetic speech quality. In recent years, methods have been proposed for constructing large-scale TTS corpora by collecting diverse data from massive sources such as audiobooks and YouTube. Although these methods have gained significant attention for enhancing the expressive capabilities of TTS systems, they often prioritize collecting vast amounts of data without considering practical constraints like storage capacity and computation time in training, which limits the available data quantity. Consequently, the need arises to efficiently collect data within these volume constraints. To address this, we propose a method for selecting the core subset~(known as \textit{core-set}) from a TTS corpus on the basis of a \textit{diversity metric}, which measures the degree to which a subset encompasses a wide range. Experimental results demonstrate that our proposed method performs significantly better than the baseline phoneme-balanced data selection across language and corpus size.
摘要
Translated into Simplified Chinese:这篇论文提出了一种方法,用于从文本到语音(TTS)集合中提取轻量级的子集,保证合成语音质量。在过去的几年中,有人提出了大规模的 TTS 集合建构方法,通过收集各种媒体资源,如 audiobooks 和 YouTube。虽然这些方法吸引了大量的注意力,但它们经常忽略实际的存储容量和训练时间限制,导致可用数据量受限。因此,需要有效地收集数据,以满足这些容量限制。为此,我们提出了一种基于 \textit{多样度度量} 的核心子集选择方法(known as \textit{core-set}),用于从 TTS 集合中选择最佳的子集。实验结果表明,我们的提议方法在语言和集合大小方面具有显著的优势,比基eline phoneme-balanced 数据选择更好。
Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting
results: 研究发现,使用多个基础模型的输出可以提高情感注释质量,并且可以增强现有语音情感数据集的可用性。Abstract
Significant advances are being made in speech emotion recognition (SER) using deep learning models. Nonetheless, training SER systems remains challenging, requiring both time and costly resources. Like many other machine learning tasks, acquiring datasets for SER requires substantial data annotation efforts, including transcription and labeling. These annotation processes present challenges when attempting to scale up conventional SER systems. Recent developments in foundational models have had a tremendous impact, giving rise to applications such as ChatGPT. These models have enhanced human-computer interactions including bringing unique possibilities for streamlining data collection in fields like SER. In this research, we explore the use of foundational models to assist in automating SER from transcription and annotation to augmentation. Our study demonstrates that these models can generate transcriptions to enhance the performance of SER systems that rely solely on speech data. Furthermore, we note that annotating emotions from transcribed speech remains a challenging task. However, combining outputs from multiple LLMs enhances the quality of annotations. Lastly, our findings suggest the feasibility of augmenting existing speech emotion datasets by annotating unlabeled speech samples.
摘要
<>大量的进步在语音情感识别(SER)领域中使用深度学习模型。然而,训练SER系统仍然具有挑战性,需要大量的时间和资源。与其他机器学习任务类似,获取SER数据集需要大量的数据注释和标注。这些注释和标注过程中存在挑战,尝试扩大传统的SER系统。最近的基础模型的发展对SER领域有益,如ChatGPT等模型,它们提高了人机交互,包括带来了对SER领域的数据收集方面的新可能性。在本研究中,我们explore使用基础模型来帮助自动化SER,从译文和注释到增强。我们的研究表明,这些模型可以生成提高SER系统的性能的译文。此外,我们注意到从译文中注释情感仍然是一个挑战。然而,将多个LLMs的输出结合起来可以提高注释质量。最后,我们的发现表明可以使用未标注的语音样本来增强现有的speech emotion数据集。<>
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
results: 作者在 Libriheavy 中建立了一个基eline系统,并对其进行了评估。研究结果显示,这个基线系统在 Libriheavy 中的识别精度高于其他相似的数据集。此外,作者还开源了其数据集创建管道,可以用于其他语音对应任务。Abstract
In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlapping speakers and books in training sets. Baseline systems are built on the popular CTC-Attention and transducer models. Additionally, we open-source our dataset creatation pipeline which can also be used to other audio alignment tasks.
摘要
在这篇论文中,我们介绍了 Libriheavy,一个大规模的语音识别集合,包含50,000小时的英文语音,来自LibriVox。据我们所知,Libriheavy是目前最大的免费可用的语音识别集合。与其他开源数据集不同,Libriheavy包含更多的信息,例如括号、字母和文本上下文,这使得系统建设更加灵活。我们提出了一个通用和高效的管道来定位、对齐和分割 Librilight 中的音频。与 Librilight 相同,Libriheavy 也有三个训练subset:小、中、大,分别为 500h、5000h 和 50000h。我们还提取了评估集和测试集,并保证训练集中没有重复的 speaker 和书籍。基础系统使用了流行的 CTC-Attention 和批处理模型。此外,我们还开源了我们的数据创建管道,可以用于其他音频对齐任务。
SSL-Net: A Synergistic Spectral and Learning-based Network for Efficient Bird Sound Classification
results: 我们在一个标准的野外采集的鸟叫声数据集上获得了鼓舞人的实验结果,证明我们的方法可以高效地提取特征和实现鸟叫声分类的高性能,即使工作样本数量有限。此外,我们还提出了三种特征融合策略,以便工程师和研究人员在选择中受益。Abstract
Efficient and accurate bird sound classification is of important for ecology, habitat protection and scientific research, as it plays a central role in monitoring the distribution and abundance of species. However, prevailing methods typically demand extensively labeled audio datasets and have highly customized frameworks, imposing substantial computational and annotation loads. In this study, we present an efficient and general framework called SSL-Net, which combines spectral and learned features to identify different bird sounds. Encouraging empirical results gleaned from a standard field-collected bird audio dataset validate the efficacy of our method in extracting features efficiently and achieving heightened performance in bird sound classification, even when working with limited sample sizes. Furthermore, we present three feature fusion strategies, aiding engineers and researchers in their selection through quantitative analysis.
摘要
efficient和准确的鸟叫声分类对生态学、栖息地保护和科学研究非常重要,因为它在监测物种分布和数量方面扮演了中心角色。然而,现有的方法通常需要大量的标注音频数据和特定的框架,导致计算和标注负担很大。在这个研究中,我们提出了一种高效和通用的框架called SSL-Net,它将spectral和学习特征结合以分类不同的鸟叫声。我们从标准采集的鸟叫声数据集中获得了鼓舞人心的实验结果,证明了我们的方法可以高效地提取特征和实现鸟叫声分类 tasks,即使受限制的样本数量。此外,我们还提出了三种特征融合策略,以帮助工程师和研究人员在选择方面做出数据分析。