cs.SD - 2023-08-10

Stabilizing Training with Soft Dynamic Time Warping: A Case Study for Pitch Class Estimation with Weakly Aligned Targets

paper_url: http://arxiv.org/abs/2308.05429
repo_url: https://github.com/groupmm/stabilizing_sdtw
paper_authors: Johannes Zeitler, Simon Deniffel, Michael Krause, Meinard Müller
for: 这 paper 的目的是提高神经网络在弱相对标注数据上训练的稳定性。
methods: 该 paper 使用了 soft dynamic time warping (SDTW) 损失函数，并研究了三种不同的稳定化策略，以解决在早期训练阶段 soft alignment 和参考Alignment 之间的差异导致的参数更新错误。
results: 该 paper 通过三种稳定化策略，使得神经网络训练变得更加稳定，并且在实验中得到了较好的效果。

Abstract
Soft dynamic time warping (SDTW) is a differentiable loss function that allows for training neural networks from weakly aligned data. Typically, SDTW is used to iteratively compute and refine soft alignments that compensate for temporal deviations between the training data and its weakly annotated targets. One major problem is that a mismatch between the estimated soft alignments and the reference alignments in the early training stage leads to incorrect parameter updates, making the overall training procedure unstable. In this paper, we investigate such stability issues by considering the task of pitch class estimation from music recordings as an illustrative case study. In particular, we introduce and discuss three conceptually different strategies (a hyperparameter scheduling, a diagonal prior, and a sequence unfolding strategy) with the objective of stabilizing intermediate soft alignment results. Finally, we report on experiments that demonstrate the effectiveness of the strategies and discuss efficiency and implementation issues.

摘要
“软时间截断函数（SDTW）是一种可微分损失函数，可以将神经网络从弱相关数据进行训练。通常，SDTW 会在训练过程中逐步计算和修正软定时调整，以补偿音乐录音中的时间偏差。但一个主要问题是，在训练的早期，估计的软定时调整与参考调整之间的差异，导致参数更新过激，使全局训练过程不稳定。在这篇文章中，我们 investigate 这些稳定问题，并提出三种不同的思路（几何参数调整、主成分矩阵和序列复制策略），以稳定中途软定时调整结果。最后，我们 report 实验结果，并讨论效率和实现问题。”Note that Simplified Chinese is used here, which is a common writing system used in mainland China and Singapore. Traditional Chinese is also widely used, especially in Taiwan and Hong Kong.

A Novel Self-training Approach for Low-resource Speech Recognition

paper_url: http://arxiv.org/abs/2308.05269
repo_url: None
paper_authors: Satwinder Singh, Feng Hou, Ruili Wang
for: 提高低资源语言自动语音识别（ASR）的精度。
methods: 提出了一种自我听写方法，使用不可 counted的无标签语音数据来生成高度准确的pseudo标签，从而提高ASR系统的准确率。
results: 实验分析表明，我们的方法可以在四个真实语音dataset上提高单词错误率，相比基准模型，实现了14.94%的相对改进。此外，我们的提议方法在Common Voice Punjabi dataset上得到了最佳结果。

Abstract
In this paper, we propose a self-training approach for automatic speech recognition (ASR) for low-resource settings. While self-training approaches have been extensively developed and evaluated for high-resource languages such as English, their applications to low-resource languages like Punjabi have been limited, despite the language being spoken by millions globally. The scarcity of annotated data has hindered the development of accurate ASR systems, especially for low-resource languages (e.g., Punjabi and M\=aori languages). To address this issue, we propose an effective self-training approach that generates highly accurate pseudo-labels for unlabeled low-resource speech. Our experimental analysis demonstrates that our approach significantly improves word error rate, achieving a relative improvement of 14.94% compared to a baseline model across four real speech datasets. Further, our proposed approach reports the best results on the Common Voice Punjabi dataset.

摘要
在这篇论文中，我们提出了一种自我培训方法用于自动语音识别（ASR）在低资源设置下。而自我培训方法在高资源语言如英语上已经广泛开发和评估，但对低资源语言如旁遮普语言（Punjabi）的应用却受到限制，尽管这种语言由全球数百万人使用。缺乏标注数据的问题使得低资源语言的ASR系统的开发受到了很大的限制，尤其是旁遮普语言和Maori语言等。为解决这个问题，我们提出了一种高度有效的自我培训方法，该方法可以生成高度准确的伪标签 для无标注的低资源语音。我们的实验分析表明，我们的方法可以在四个真实的语音 dataset 上显著提高单词错误率，相比基准模型，实现了14.94%的相对提升。此外，我们的提出方法在Common Voice Punjabi dataset上报告了最佳结果。

Separate Anything You Describe

paper_url: http://arxiv.org/abs/2308.05037
repo_url: https://github.com/audio-agi/audiosep
paper_authors: Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
for: 这个研究旨在开发一种基于自然语言查询的数字听风场景分析（CASA）系统，用于从混合声音中分离target声音。methods: 该研究使用了大规模的 Multimodal 数据集训练了 AudioSep 基础模型，并对其进行了广泛的评估，包括听风事件分离、乐器分离和语音提升等任务。results: AudioSep 显示了强大的分离性能和零扩展能力，使用音频描述或文本标签作为查询，明显超过了之前的听风查询和语言查询的音频分离模型。

Abstract
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.

摘要
新的语言句某query音频源分离（LASS）是一种计算听力场景分析（CASA）的新方案，它目标是根据自然语言查询，从音频混合中分离目标声音。Recent works on LASS，虽然在特定的音频源（例如乐器）上达到了预测性的分离性能，但无法在开放领域中分离音频概念。在这项工作中，我们介绍了一个基于开放领域的音频源分离基础模型，即AudioSep。我们在大规模的多模式数据集上训练AudioSep，并对其在多个任务上进行了广泛的评估，包括音频事件分离、乐器分离和语音提升。AudioSep表现出了强大的分离能力和零基础学习能力，使用音频caption或文本标签作为查询，明显超过了先前的音频-查询和语言-查询的音频分离模型。为了保证这项工作的可重现性，我们将在GitHub上发布源代码、评估标准和预训练模型，请参考：https://github.com/Audio-AGI/AudioSep。