cs.SD - 2023-09-14

DDSP-SFX: Acoustically-guided sound effects generation with differentiable digital signal processing

paper_url: http://arxiv.org/abs/2309.08060
repo_url: None
paper_authors: Yunyi Liu, Craig Jin, David Gunawan
for: 该论文旨在控制声音效果的变化，使用神经音频合成模型。
methods: 该模型基于DDSP架构，利用预处理的音频特征和数字合成器实现高质量的声音合成，同时允许用户轻松控制声音特征的变化。
results: 该模型可以实现高评价的声音变化控制，并且可以通过声音导向来实现时间特征模拟。

Abstract
Controlling the variations of sound effects using neural audio synthesis models has been a difficult task. Differentiable digital signal processing (DDSP) provides a lightweight solution that achieves high-quality sound synthesis while enabling deterministic acoustic attribute control by incorporating pre-processed audio features and digital synthesizers. In this research, we introduce DDSP-SFX, a model based on the DDSP architecture capable of synthesizing high-quality sound effects while enabling users to control the timbre variations easily. We propose a transient modelling technique with higher objective evaluation scores and subjective ratings over impulsive signals (footsteps, gunshots). We propose a simple method that achieves timbre variation control while also allowing deterministic attribute control. We further qualitatively show the timbre transfer performance using voice as the guiding sound.

摘要
控制声音效果的变化使用神经音频合成模型是一项具有挑战性的任务。可 diferenciable digital signal processing（DDSP）提供了一种轻量级的解决方案，可以实现高质量的声音合成，同时允许用户 deterministic 控制声音特性。在这项研究中，我们介绍了 DDSP-SFX 模型，可以同时实现高质量的声音效果和用户容易控制声音特性的变化。我们提出了一种快速模拟技术，对于快速冲击信号（脚步、枪声）有更高的对象评价分数和主观评分。我们还提出了一种简单的方法，可以实现声音特性的变化控制，同时允许 deterministic 控制声音特性。最后，我们质量地表示了声音传输性能，使用语音作为引导声。

VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research

paper_url: http://arxiv.org/abs/2309.08049
repo_url: https://github.com/digitalphonetics/voicepat
paper_authors: Sarina Meyer, Xiaoxiao Miao, Ngoc Thang Vu
for: 本研究旨在提供一个高效的话语隐藏和评估框架，以便比较和结合不同的隐藏方法。
methods: 本研究使用了一个模块化和易扩展的结构，几乎完全使用Python进行实现。该框架可以同时进行多种隐藏方法的并行运算，并且提供了与不同技术间的交互功能。
results: 本研究所得到的结果显示，使用修改后的评估方法可以大幅降低评估时间，比如65%-95%，具体取决于评估指标。此外，研究人员还提供了开源代码。

Abstract
Speaker anonymization is the task of modifying a speech recording such that the original speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release of a framework, the popularity of this research topic is continually increasing. However, the comparison and combination of different anonymization approaches remains challenging due to the complexity of evaluation and the absence of user-friendly research frameworks. We therefore propose an efficient speaker anonymization and evaluation framework based on a modular and easily extendable structure, almost fully in Python. The framework facilitates the orchestration of several anonymization approaches in parallel and allows for interfacing between different techniques. Furthermore, we propose modifications to common evaluation methods which make the evaluation more powerful and reduces their computation time by 65 to 95\%, depending on the metric. Our code is fully open source.

摘要
干扰者隐藏是修改语音录音以使原始发言人无法识别的任务。自2020年的第一届语音隐私挑战以来，这个研究主题的 популяр度一直在不断增长。然而，对不同隐藏方法的比较和组合仍然具有较高的复杂性和评价方法的缺失，这使得研究者难以进行效果的比较和结合。我们因此提出了一个高效的干扰者隐藏和评价框架，基于模块化和扩展性强的结构，几乎完全基于Python编程语言。该框架可以同时实现多种隐藏方法的并行执行，并且支持不同技术的交互。此外，我们还提出了一些改进的评价方法，使评价更加强大，同时降低了计算时间，对各种指标的降低为65%-95%。我们的代码完全开源。

Comparative Assessment of Markov Models and Recurrent Neural Networks for Jazz Music Generation

paper_url: http://arxiv.org/abs/2309.08027
repo_url: None
paper_authors: Conrad Hsu, Ross Greer
for: Comparing the performance of a simple Markov chain model and a recurrent neural network (RNN) model in jazz music improvisation.
methods: Using transcriptions of jazz blues choruses from professional jazz players to train both models, and using musical jazz seeds to give the model context.
results: The RNN outperforms the Markov model on both metrics (groove pattern similarity and pitch class histogram entropy), indicating better rhythmic consistency and tonal stability in the generated music.

Abstract
As generative models have risen in popularity, a domain that has risen alongside is generative models for music. Our study aims to compare the performance of a simple Markov chain model and a recurrent neural network (RNN) model, two popular models for sequence generating tasks, in jazz music improvisation. While music, especially jazz, remains subjective in telling whether a composition is "good" or "bad", we aim to quantify our results using metrics of groove pattern similarity and pitch class histogram entropy. We trained both models using transcriptions of jazz blues choruses from professional jazz players, and also fed musical jazz seeds to help give our model some context in beginning the generation. Our results show that the RNN outperforms the Markov model on both of our metrics, indicating better rhythmic consistency and tonal stability in the generated music. Through the use of music21 library, we tokenized our jazz dataset into pitches and durations that our model could interpret and train on. Our findings contribute to the growing field of AI-generated music, highlighting the important use of metrics to assess generation quality. Future work includes expanding the dataset of MIDI files to a larger scale, conducting human surveys for subjective evaluations, and incorporating additional metrics to address the challenge of subjectivity in music evaluation. Our study provides valuable insight into the use of recurrent neural networks for sequential based tasks like generating music.

摘要
为了比较Markov链模型和循环神经网络（RNN）模型在爵士乐音乐创作中的表现，我们进行了一项研究。尽管音乐，特别是爵士乐，是主观的，我们使用了乐谱符号相似性和抽象度量来衡量结果。我们使用了music21库将爵士乐谱例转化为普通的音频文件，然后训练了两个模型。我们发现RNN模型在两个指标上表现得更好，表示它在生成的音乐中保持了更好的节奏一致性和听觉稳定性。我们的发现对于AI生成音乐领域的发展具有重要意义，并且将为将来的人工智能音乐创作做出贡献。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Efficient Face Detection with Audio-Based Region Proposals

paper_url: http://arxiv.org/abs/2309.08005
repo_url: None
paper_authors: William Aris, François Grondin
for: 这 paper 是为了提高机器人视觉系统的计算效率，减少图像质量下降的影响。
methods: 这 paper 使用了一种新的注意力机制，以便通过音频来生成optical图像中的区域关注点。
results: 该注意力机制可以减少计算负担，并且可以方便地适应人机交互、机器人监测、视频会议或智能眼镜等场景。

Abstract
Robot vision often involves a large computational load due to large images to process in a short amount of time. Existing solutions often involve reducing image quality which can negatively impact processing. Another approach is to generate regions of interest with expensive vision algorithms. In this paper, we evaluate how audio can be used to generate regions of interest in optical images. To achieve this, we propose a unique attention mechanism to localize speech sources and evaluate its impact on a face detection algorithm. Our results show that the attention mechanism reduces the computational load. The proposed pipeline is flexible and can be easily adapted for human-robot interactions, robot surveillance, video-conferences or smart glasses.

摘要
署 robot 视觉通常会面临大量计算压力，因为需要在短时间内处理大量图像。现有的解决方案经常包括降低图像质量，这会对处理造成负面影响。在这篇论文中，我们评估了如何使用音频来生成optical图像中的区域兴趣点。为此，我们提出了一种特殊的注意机制，用于本地化speech 源，并评估其对人脸检测算法的影响。我们的结果表明，注意机制可以降低计算压力。我们提出的管道可满足人机交互、机器人监控、视频会议或智能眼镜等应用。

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

paper_url: http://arxiv.org/abs/2309.07828
repo_url: None
paper_authors: Navin Raj Prabhu, Bunlong Lay, Simon Welker, Nale Lehmann-Willenbrock, Timo Gerkmann
for: 本研究主要针对听众语音中的情感转换任务，即将表达的情感转换为目标情感，保持语音内容和说话人身份不变。
methods: 本文提出一种基于扩散的生成模型，名为EmoConv-Diff，用于实现情感转换。该模型在训练时会重建输入语音，同时Conditioning on its emotion。在推理阶段，使用目标情感嵌入来转换输入语音的情感。与其他研究不同的是，本文不使用平行数据，而是使用大量在野数据进行训练。
results: 经验表明，提出的扩散模型可以成功地Synthesize speech with a controllable target emotion。此外，该方法还能够在情感值的极端值下表现出色，因此可以解决在语音情感转换领域的一个常见问题。

Abstract
Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving intensity control. We validate the proposed methodology on a large in-the-wild dataset, the MSP-Podcast v1.10. Our results show that the proposed diffusion model is indeed capable of synthesizing speech with a controllable target emotion. Crucially, the proposed approach shows improved performance along the extreme values of arousal and thereby addresses a common challenge in the speech emotion conversion literature.

摘要
<>输入文本翻译成简化中文。<>这个工作主要关注在实际场景中进行情感转换，而不是基于演示数据和平行数据样本。为了实现这一目标，我们提出了一种基于吸引过程的生成模型，即EmocoConv-Diff，可以重建输入utterance的内容，同时也根据情感进行conditioning。在推理阶段，使用目标情感嵌入来转换输入utterance的情感。与其他工作不同的是，我们使用连续的兴奋度维度来表示情感，同时实现了强度控制。我们验证了我们的方法在大量实际场景中的MSP-Podcast v1.10 dataset上，结果表明，提出的扩散模型确实可以控制目标情感。更重要的是，我们的方法在兴奋度的极值位置上表现出了改进的性能，因此解决了实际场景中的一个常见挑战。

SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias

paper_url: http://arxiv.org/abs/2309.07803
repo_url: None
paper_authors: Sipan Li, Songxiang Liu, Luwen Zhang, Xiang Li, Yanyao Bian, Chao Weng, Zhiyong Wu, Helen Meng
For: + The paper aims to train a universal vocoder that can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pieces.* Methods: + The proposed method, called SnakeGAN, uses a GAN-based architecture with a coarse-grained signal generated by a differentiable digital signal processing (DDSP) model as prior knowledge. + The generator uses the Snake activation function and anti-aliased representation to introduce periodic nonlinearities and bring the desired inductive bias for audio synthesis.* Results: + The proposed method significantly outperforms compared approaches and can generate high-fidelity audio samples including unseen speakers with unseen styles, singing voices, instrumental pieces, and nonverbal vocalization.

Abstract
Generative adversarial network (GAN)-based neural vocoders have been widely used in audio synthesis tasks due to their high generation quality, efficient inference, and small computation footprint. However, it is still challenging to train a universal vocoder which can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pieces. In this work, we propose SnakeGAN, a GAN-based universal vocoder, which can synthesize high-fidelity audio in various OOD scenarios. SnakeGAN takes a coarse-grained signal generated by a differentiable digital signal processing (DDSP) model as prior knowledge, aiming at recovering high-fidelity waveform from a Mel-spectrogram. We introduce periodic nonlinearities through the Snake activation function and anti-aliased representation into the generator, which further brings the desired inductive bias for audio synthesis and significantly improves the extrapolation capacity for universal vocoding in unseen scenarios. To validate the effectiveness of our proposed method, we train SnakeGAN with only speech data and evaluate its performance for various OOD distributions with both subjective and objective metrics. Experimental results show that SnakeGAN significantly outperforms the compared approaches and can generate high-fidelity audio samples including unseen speakers with unseen styles, singing voices, instrumental pieces, and nonverbal vocalization.

摘要
Translated into Simplified Chinese: генеритив adversarial network (GAN)-based neural vocoder 已经广泛应用于音频合成任务，因为它们的高质量生成、效率推理和小计算脚本。然而，在训练一个通用 vocoder 仍然是挑战，因为它需要在未经见过的 scenarios 中进行泛化，如尚未看到的说话风格、非语音化声、唱歌和音乐作品。在这个工作中，我们提出了 SnakeGAN，一种基于 GAN 的通用 vocoder，可以在多种 OOD 分布上生成高质量的音频。SnakeGAN 使用了干扰函数来引入期望的非线性，并在生成器中使用了抗锯齿表示，以进一步带来音频合成的适应性和泛化能力。为验证我们的提出方法的有效性，我们将 SnakeGAN 训练用 speech 数据，并在多种 OOD 分布上评估其表现。实验结果表明，SnakeGAN 在比较方法上表现出色，可以生成包括未看到的 speaker 和未经见过的风格的高质量音频样本。

Complexity Scaling for Speech Denoising

paper_url: http://arxiv.org/abs/2309.07757
repo_url: https://github.com/hangtingchen/Complexity-Scaling-for-Speech-Denoising.github.io
paper_authors: Hangting Chen, Jianwei Yu, Chao Weng
For: This paper aims to develop a unified architecture for speech denoising models that can handle a wide range of computational complexities.* Methods: The proposed Multi-Path Transform-based (MPT) architecture is designed to handle both low- and high-complexity scenarios. The authors explore the empirical relationship between model performance and computational cost on the denoising task.* Results: The MPT networks achieve high performance on the DNS challenge dataset, and the authors observe a linear increase in the values of PESQ-WB and SI-SNR as the complexity number of multiply-accumulate operations (MACs) is scaled from 50M/s to 15G/s.

Abstract
Computational complexity is critical when deploying deep learning-based speech denoising models for on-device applications. Most prior research focused on optimizing model architectures to meet specific computational cost constraints, often creating distinct neural network architectures for different complexity limitations. This study conducts complexity scaling for speech denoising tasks, aiming to consolidate models with various complexities into a unified architecture. We present a Multi-Path Transform-based (MPT) architecture to handle both low- and high-complexity scenarios. A series of MPT networks present high performance covering a wide range of computational complexities on the DNS challenge dataset. Moreover, inspired by the scaling experiments in natural language processing, we explore the empirical relationship between model performance and computational cost on the denoising task. As the complexity number of multiply-accumulate operations (MACs) is scaled from 50M/s to 15G/s on MPT networks, we observe a linear increase in the values of PESQ-WB and SI-SNR, proportional to the logarithm of MACs, which might contribute to the understanding and application of complexity scaling in speech denoising tasks.

摘要
computational complexity是speech denoising模型部署时的关键因素。大多数前一代研究都是优化模型结构来满足特定的计算成本限制，通常创建了不同计算成本限制的特有神经网络架构。这项研究进行了计算复杂性排比，旨在将不同计算复杂性的模型合并到一个统一的架构中。我们提出了基于多路变换的（MPT）架构，可以处理low-和high-计算复杂性的enario。MPT网络在DNS挑战数据集上表现出了高性能，并且在不同的计算成本水平下保持稳定性。此外，通过自然语言处理领域的缩放实验，我们发现了对于去噪任务，模型性能和计算成本之间存在线性关系，具体来说，随着MACs计算复杂性的幂数增加，PESQ-WB和SI-SNR的值会线性增加，这可能对于speech denoising任务中的计算复杂性排比做出贡献。

DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input

paper_url: http://arxiv.org/abs/2309.07658
repo_url: None
paper_authors: Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm, Junichi Yamagishi
for: 这篇论文探讨使用神经合成来生成音频电 guitar，从MIDI输入中提取特征。
methods: 论文提出了四种不同的系统，并对它们进行对比，包括对象 метри克和主观评价。它们的架构和中间任务，如预测抑制特征，都受到了考虑。
results: 研究发现，将控制特征预测任务设计为分类任务而不是回归任务可以获得更好的结果。此外，论文发现最简单的提出的系统，直接从MIDI输入预测合成参数，最好的性能。音频示例可以在https://erl-j.github.io/neural-guitar-web-supplement中找到。

Abstract
We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness control features. We find that formulating the control feature prediction task as a classification task rather than a regression task yields better results. Furthermore, we find that our simplest proposed system, which directly predicts synthesis parameters from MIDI input performs the best out of the four proposed systems. Audio examples are available at https://erl-j.github.io/neural-guitar-web-supplement.

摘要
我们研究使用神经合成来synthesize acoustic guitar from string-wise MIDI输入。我们提出了四种不同的系统，并与对象指标和主观评估对自然音频和样本基eline进行比较。我们逐次开发这四种系统，通过考虑不同的架构和中间任务，如预测把握特征和声音强度控制特征。我们发现，将控制特征预测任务转换为分类任务而不是回归任务，可以获得更好的结果。此外，我们发现我们最简单的提出的系统，直接从MIDI输入预测synthesis参数，在四种系统中表现最佳。有关音频示例，可以查看https://erl-j.github.io/neural-guitar-web-supplement。

Multilingual Audio Captioning using machine translated data

paper_url: http://arxiv.org/abs/2309.07615
repo_url: None
paper_authors: Matéo Cousin, Étienne Labbé, Thomas Pellegrini
for: 这个论文主要研究了自动化音频描述系统（AAC）的多语言支持，以及如何使用机器翻译生成多语言的音频描述文本。
methods: 作者使用了自动机器翻译将两个知名的AAC数据集（AudioCaps和Clotho）中的英文描述文本翻译成法语、德语和西班牙语。然后，他们在每种语言上训练和评估了单语言系统，并对AudioCaps和Clotho数据集进行了评估。
results: 研究发现，使用机器翻译生成多语言描述文本可以获得类似于英文系统的性能（约75% CIDEr on AudioCaps和43% on Clotho）。此外，在法语中，手动生成的评估subset的描述文本被法语系统训练后的输出比英文系统自动翻译后的输出更加准确。最后，作者建立了一个多语言模型，可以在每种语言上获得类似的性能，使用的参数少于使用多个单语言系统。

Abstract
Automated Audio Captioning (AAC) systems attempt to generate a natural language sentence, a caption, that describes the content of an audio recording, in terms of sound events. Existing datasets provide audio-caption pairs, with captions written in English only. In this work, we explore multilingual AAC, using machine translated captions. We translated automatically two prominent AAC datasets, AudioCaps and Clotho, from English to French, German and Spanish. We trained and evaluated monolingual systems in the four languages, on AudioCaps and Clotho. In all cases, the models achieved similar performance, about 75% CIDEr on AudioCaps and 43% on Clotho. In French, we acquired manual captions of the AudioCaps eval subset. The French system, trained on the machine translated version of AudioCaps, achieved significantly better results on the manual eval subset, compared to the English system for which we automatically translated the outputs to French. This advocates in favor of building systems in a target language instead of simply translating to a target language the English captions from the English system. Finally, we built a multilingual model, which achieved results in each language comparable to each monolingual system, while using much less parameters than using a collection of monolingual systems.

摘要

AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion

paper_url: http://arxiv.org/abs/2309.07598
repo_url: https://github.com/unilight/seq2seq-vc
paper_authors: Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda
For: The paper is written for the purpose of proposing a non-autoregressive sequence-to-sequence (seq2seq) model for voice conversion (VC) that can generalize well to small training datasets.* Methods: The proposed model, called AAS-VC, uses automatic alignment search (AAS) to remove the dependency on external durations and provide a proper inductive bias for generalization.* Results: The experimental results show that AAS-VC can generalize better to a training dataset of only 5 minutes, compared to current non-AR seq2seq VC models that require larger training datasets.

Abstract
Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generalization ability to smaller training datasets. In this paper, we first demonstrate the above-mentioned problem by varying the training data size. Then, we present AAS-VC, a non-AR seq2seq VC model based on automatic alignment search (AAS), which removes the dependency on external durations and serves as a proper inductive bias to provide the required generalization ability for small datasets. Experimental results show that AAS-VC can generalize better to a training dataset of only 5 minutes. We also conducted ablation studies to justify several model design choices. The audio samples and implementation are available online.

摘要
非自动回归（非AR）序列到序列（seq2seq）模型 для语音转换（VC）具有模型时间结构的能力，同时具有加速推理和提高听解性的优点。然而，现有的非AR seq2seq VC模型对外部AR模型提供的真实duration的依赖限制了其总结推理能力，特别是在小训练集上。在这篇论文中，我们首先描述了上述问题，并通过变换训练数据量来证明。然后，我们提出了AAS-VC模型，基于自动对齐搜索（AAS），解除了对外部duration的依赖，并提供了适当的束缚，以便在小训练集上进行总结推理。实验结果显示，AAS-VC可以更好地总结小训练集中的5分钟 audio samples。我们还进行了一些模型设计选择的抽象研究，以便更好地理解模型的工作原理。音频示例和实现可以在线获取。

StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings

paper_url: http://arxiv.org/abs/2309.07592
repo_url: https://github.com/arnabdas8901/StarGAN-VC_PlusPlus
paper_authors: Arnab Das, Suhita Ghosh, Tim Polzehl, Sebastian Stober
for: 本研究旨在提高语音转换（VC）技术的自然性，保留原始说话人的情感表达。
methods: 研究使用了一种基于生成对抗网络的VC方法，称为StarGANv2-VC，但该方法无法保持说话人的情感表达。
results: 研究发现，StarGANv2-VC方法不能分离说话人和情感表达的 Representation，导致情感泄露。为解决这问题，研究提出了一种新的情感意识损失和一种无监督的方法，通过利用 latent emotion representation 进行情感监督。对多个数据集、情感、性别等进行对象和主观评估，研究证明了该策略的有效性。

Abstract
Voice conversion (VC) transforms an utterance to sound like another person without changing the linguistic content. A recently proposed generative adversarial network-based VC method, StarGANv2-VC is very successful in generating natural-sounding conversions. However, the method fails to preserve the emotion of the source speaker in the converted samples. Emotion preservation is necessary for natural human-computer interaction. In this paper, we show that StarGANv2-VC fails to disentangle the speaker and emotion representations, pertinent to preserve emotion. Specifically, there is an emotion leakage from the reference audio used to capture the speaker embeddings while training. To counter the problem, we propose novel emotion-aware losses and an unsupervised method which exploits emotion supervision through latent emotion representations. The objective and subjective evaluations prove the efficacy of the proposed strategy over diverse datasets, emotions, gender, etc.

摘要
声音转换（VC）将一句话变成另一个人的语音，不改变语言内容。一种最近提出的生成对抗网络基本VC方法，StarGANv2-VC非常成功地生成自然听起来的转换。然而，该方法无法保留源speaker的情感。情感保留是人机交互的重要需求。在这篇论文中，我们表明StarGANv2-VC无法分离说话人和情感表示，不能保留情感。具体来说，在训练时使用引用音频捕捉说话人嵌入的情感泄漏问题。为了解决该问题，我们提议使用新的情感意识损失和无监督方法，通过latent情感表示来进行情感监督。对于不同的数据集、情感、性别等，我们的目标和主观评估都证明了我们的策略的有效性。

Diff-SV: A Unified Hierarchical Framework for Noise-Robust Speaker Verification Using Score-Based Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2309.08320
repo_url: https://github.com/wngh1187/diff-sv
paper_authors: Ju-ho Kim, Jungwoo Heo, Hyun-seo Shin, Chan-yeong Lim, Ha-Jin Yu
for: 提高SV系统的准确性和可靠性，对background noise进行处理
methods: 使用扩散概率模型（DPM）来实现speech enhancement，并将其与 speaker embedding EXTractor结合起来，从而获得具有抗雑议性和可识别性的Speaker representation
results: 在VoxCeleb1测试集、外部噪声源和VOiCES corpus上进行了评估，实验结果表明，Diff-SV在噪声环境下达到了状态的前景性性能，超过了最新的噪声鲁棒SV系统

Abstract
Background noise considerably reduces the accuracy and reliability of speaker verification (SV) systems. These challenges can be addressed using a speech enhancement system as a front-end module. Recently, diffusion probabilistic models (DPMs) have exhibited remarkable noise-compensation capabilities in the speech enhancement domain. Building on this success, we propose Diff-SV, a noise-robust SV framework that leverages DPM. Diff-SV unifies a DPM-based speech enhancement system with a speaker embedding extractor, and yields a discriminative and noise-tolerable speaker representation through a hierarchical structure. The proposed model was evaluated under both in-domain and out-of-domain noisy conditions using the VoxCeleb1 test set, an external noise source, and the VOiCES corpus. The obtained experimental results demonstrate that Diff-SV achieves state-of-the-art performance, outperforming recently proposed noise-robust SV systems.

摘要
<>转换给定文本到简化中文。<>背景噪声会significantly reducethe accuracy和可靠性of speaker verification（SV）系统。这些挑战可以通过一个speech enhancement系统作为前端模块来解决。现在，diffusion probabilistic models（DPMs）在speech enhancement领域表现出了很好的噪声补偿能力。基于这种成功，我们提出了Diff-SV，一个具有噪声耐性的SV框架，利用DPM。Diff-SV将DPM-based speech enhancement系统与speaker embedding抽取器结合，通过层次结构实现一个可 dicriminative和噪声忍受的speaker表示。我们在VoxCeleb1测试集、外部噪声源和VOiCES corpus上进行了对Diff-SV的实验评估。实验结果表明，Diff-SV在噪声条件下实现了状态之arte的表现，比对最近提出的噪声耐性SV系统更高。

Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion

paper_url: http://arxiv.org/abs/2309.07586
repo_url: https://github.com/suhitaghosh10/emo-stargan
paper_authors: Suhita Ghosh, Arnab Das, Yamini Sinha, Ingo Siegert, Tim Polzehl, Sebastian Stober
for: 防止听取数据的违用，保持语音内容的自然性。
methods: 使用 semi-supervised StarGANv2-VC 变体，在部分情感标注的非平行数据上进行训练，并使用情感嵌入和声学特征相关的情感损失函数。
results: 对比于标准 StarGANv2-VC，提出的方法可以显著改善情感保持，在多个数据集、情感、目标说话人和跨群交流中无需妥协 inteligibility 和匿名化。

Abstract
Speech anonymisation prevents misuse of spoken data by removing any personal identifier while preserving at least linguistic content. However, emotion preservation is crucial for natural human-computer interaction. The well-known voice conversion technique StarGANv2-VC achieves anonymisation but fails to preserve emotion. This work presents an any-to-many semi-supervised StarGANv2-VC variant trained on partially emotion-labelled non-parallel data. We propose emotion-aware losses computed on the emotion embeddings and acoustic features correlated to emotion. Additionally, we use an emotion classifier to provide direct emotion supervision. Objective and subjective evaluations show that the proposed approach significantly improves emotion preservation over the vanilla StarGANv2-VC. This considerable improvement is seen over diverse datasets, emotions, target speakers, and inter-group conversions without compromising intelligibility and anonymisation.

摘要
<>发音匿名化可以防止使用口头数据的不当使用，同时保留至少的语言内容。然而，情感保留是人机交互的重要因素。知名的声音转换技术StarGANv2-VC可以实现匿名化，但是失去情感。本工作提出了任意到多个半监督StarGANv2-VC变体，通过部分情感标注不对称数据进行训练。我们提出了基于情感嵌入和听音特征相关的情感意识损失，以及直接提供情感标注。对象和主观评估表明，我们的方法可以明显改善情感保留，超过多种数据集、情感、目标说话人和交互转换等方面，无需妥协智能性和匿名化。

Outlier-aware Inlier Modeling and Multi-scale Scoring for Anomalous Sound Detection via Multitask Learning

paper_url: http://arxiv.org/abs/2309.07500
repo_url: None
paper_authors: Yucong Zhang, Hongbin Suo, Yulong Wan, Ming Li
for: 这篇论文是为了检测异常声音而提出的方法，该方法利用多任务学习整合异常曝光和正常样本模型。methods: 该方法使用多任务学习整合异常曝光和正常样本模型，并提供多尺度得分来检测异常。results: 实验结果表明，该方法在MIMII和DCASE 2020任务2集合上的表现较为出色，超过了单个模型系统的状态卷积，并与多个系统组合 ensemble 的表现相当。

Abstract
This paper proposes an approach for anomalous sound detection that incorporates outlier exposure and inlier modeling within a unified framework by multitask learning. While outlier exposure-based methods can extract features efficiently, it is not robust. Inlier modeling is good at generating robust features, but the features are not very effective. Recently, serial approaches are proposed to combine these two methods, but it still requires a separate training step for normal data modeling. To overcome these limitations, we use multitask learning to train a conformer-based encoder for outlier-aware inlier modeling. Moreover, our approach provides multi-scale scores for detecting anomalies. Experimental results on the MIMII and DCASE 2020 task 2 datasets show that our approach outperforms state-of-the-art single-model systems and achieves comparable results with top-ranked multi-system ensembles.

摘要
本文提出了一种异常声音检测方法，该方法包括外围曝光和内围模型在一个统一框架中，通过多任务学习。而外围曝光方法可以快速提取特征，但不是非常稳定。内围模型可以生成稳定的特征，但这些特征不是很有效。 reciently， serial approaches have been proposed to combine these two methods, but it still requires a separate training step for normal data modeling. To overcome these limitations, we use multitask learning to train a conformer-based encoder for outlier-aware inlier modeling. Moreover, our approach provides multi-scale scores for detecting anomalies. Experimental results on the MIMII and DCASE 2020 task 2 datasets show that our approach outperforms state-of-the-art single-model systems and achieves comparable results with top-ranked multi-system ensembles.Here's the translation of the text into Traditional Chinese:本文提出了一种异常声音检测方法，该方法包括外围曝光和内围模型在一个统一架构中，通过多任务学习。而外围曝光方法可以快速提取特征，但不是非常稳定。内围模型可以生成稳定的特征，但这些特征不是很有效。 reciently， serial approaches have been proposed to combine these two methods, but it still requires a separate training step for normal data modeling. To overcome these limitations, we use multitask learning to train a conformer-based encoder for outlier-aware inlier modeling. Moreover, our approach provides multi-scale scores for detecting anomalies. Experimental results on the MIMII and DCASE 2020 task 2 datasets show that our approach outperforms state-of-the-art single-model systems and achieves comparable results with top-ranked multi-system ensembles.

Hierarchical Metadata Information Constrained Self-Supervised Learning for Anomalous Sound Detection Under Domain Shift

paper_url: http://arxiv.org/abs/2309.07498
repo_url: None
paper_authors: Haiyan Lan, Qiaoxi Zhu, Jian Guan, Yuming Wei, Wenwu Wang
for: 这篇论文是为了提高适应领域转移的异常声检测（ASD）表现。
methods: 这篇论文使用了自愿监督学习方法，并利用了阶层 metadata 信息作为条件，以获得更精确的特征表现。
results: 实验结果显示，该方法可以在 DCASE 2022 挑战任务2中提高自愿监督学习方法的表现。

Abstract
Self-supervised learning methods have achieved promising performance for anomalous sound detection (ASD) under domain shift, where the type of domain shift is considered in feature learning by incorporating section IDs. However, the attributes accompanying audio files under each section, such as machine operating conditions and noise types, have not been considered, although they are also crucial for characterizing domain shifts. In this paper, we present a hierarchical metadata information constrained self-supervised (HMIC) ASD method, where the hierarchical relation between section IDs and attributes is constructed, and used as constraints to obtain finer feature representation. In addition, we propose an attribute-group-center (AGC)-based method for calculating the anomaly score under the domain shift condition. Experiments are performed to demonstrate its improved performance over the state-of-the-art self-supervised methods in DCASE 2022 challenge Task 2.

摘要
自我监督学习方法在域shift下的异常声音检测（ASD）中表现出了可塑性，通过在特征学习中包含部分ID来考虑域shift的类型。然而，音频文件中附加的特征，如机器操作条件和噪音类型，尚未被考虑，尽管它们也是域shift的关键特征。在本文中，我们提出了层次metadata信息受限自动学习（HMIC）ASD方法，其中层次关系 между部分ID和特征被建立，并用于限制特征表示的精度。此外，我们还提出了特征组中心（AGC）方法来计算域shift下的异常分数。我们对DCASE 2022挑战任务2进行实验，以证明其与现有自我监督方法的比较优异性。

Codec Data Augmentation for Time-domain Heart Sound Classification

paper_url: http://arxiv.org/abs/2309.07466
repo_url: None
paper_authors: Ansh Mishra, Jia Qi Yip, Eng Siong Chng
for: 旨在检测心脏疾病的早期诊断，以挽救生命。
methods: 使用深度学习算法自动分类心音。
results: 通过数据增强，我们的方法可以提高分类错误率从0.8降低至0.2。

Abstract
Heart auscultations are a low-cost and effective way of detecting valvular heart diseases early, which can save lives. Nevertheless, it has been difficult to scale this screening method since the effectiveness of auscultations is dependent on the skill of doctors. As such, there has been increasing research interest in the automatic classification of heart sounds using deep learning algorithms. However, it is currently difficult to develop good heart sound classification models due to the limited data available for training. In this work, we propose a simple time domain approach, to the heart sound classification problem with a base classification error rate of 0.8 and show that augmentation of the data through codec simulation can improve the classification error rate to 0.2. With data augmentation, our approach outperforms the existing time-domain CNN-BiLSTM baseline model. Critically, our experiments show that codec data augmentation is effective in getting around the data limitation.

摘要
心脏听见是一种低成本高效的早期检测心 valve 疾病的方法，可以拯救生命。然而，由于听见效果受医生技能的限制，因此听见检测方法具有扩展的挑战。随着深度学习算法在医疗领域的应用，研究人员开始关注自动 классификация心音的问题。然而，由于听见数据的有限性，目前难以建立好的心音分类模型。在这项工作中，我们提出了一种简单的时域预测方法，并证明了在基础错误率0.8的情况下，数据增强通过编码模拟可以下降分类错误率至0.2。与现有的时域CNN-BiLSTM基eline模型相比，我们的方法在数据增强情况下表现出了优异性。关键的是，我们的实验表明，编码数据增强是一种有效的绕过数据限制的方法。

Analysis of Speech Separation Performance Degradation on Emotional Speech Mixtures

paper_url: http://arxiv.org/abs/2309.07458
repo_url: None
paper_authors: Jia Qi Yip, Dianwen Ng, Bin Ma, Chng Eng Siong
for: 这篇论文旨在研究 speech separation 中情感因素的影响。
methods: 该论文使用了 Sepformer 模型，并使用了 Emo2Mix 测试集来分析情感对 speech separation 的影响。
results: 研究发现，即使使用强大的 out-of-domain 表现的 Sepformer 模型，也可能会在带有强烈情感的杂音中受到5.1 dB SI-SDRi 的负面影响。这表明在实际应用中应该考虑情感因素。

Abstract
Despite recent strides made in Speech Separation, most models are trained on datasets with neutral emotions. Emotional speech has been known to degrade performance of models in a variety of speech tasks, which reduces the effectiveness of these models when deployed in real-world scenarios. In this paper we perform analysis to differentiate the performance degradation arising from the emotions in speech from the impact of out-of-domain inference. This is measured using a carefully designed test dataset, Emo2Mix, consisting of balanced data across all emotional combinations. We show that even models with strong out-of-domain performance such as Sepformer can still suffer significant degradation of up to 5.1 dB SI-SDRi on mixtures with strong emotions. This demonstrates the importance of accounting for emotions in real-world speech separation applications.

摘要
尽管最近的Speech Separation模型在不同情感下进行训练，大多数模型仍然是在中性情感下训练的。情感味语会对各种语音任务中的模型性能产生负面影响，从而降低模型在真实世界应用中的有效性。在这篇论文中，我们进行了分析，以区分情感的影响和域外推理的影响。我们使用了一个特殊的测试集，Emo2Mix，这个测试集包含了所有情感组合的均衡数据。我们发现，即使是具有强域外推理能力的Sepformer模型，在强情感下也可能受到5.1 dB SI-SDRi的较大下降。这表明在真实世界中应用Speech Separation时，需要考虑情感的因素。

SpatialCodec: Neural Spatial Speech Coding

paper_url: http://arxiv.org/abs/2309.07432
repo_url: https://github.com/xzwy/spatialcodec
paper_authors: Zhongweiyang Xu, Yong Xu, Vinay Kothapally, Heming Wang, Muqiao Yang, Dong Yu
for: 本研究旨在使用深度学习技术对麦克风阵列捕获的语音编码，以保留和准确重建多通道记录中的关键空间信息。
methods: 我们提出了一个基于神经网络的 neural spatial audio coding 框架，其包括两个阶段：（i）使用神经网络子带码器对参照通道进行低比特率编码，并（ii）使用 SpatialCodec 捕获相对空间信息以实现准确多通道重建。
results: 我们的系统在比较高比特率基eline和黑盒神经网络架构下显示出了superior的空间性表现，同时我们还提出了一些新的评价指标来评估空间信息保留度，包括cosine similarity在 espacially intuitive beamspace 上的cosine similarity和 beamformed audio quality。

Abstract
In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec is designed to encode the reference channel with low bit rates, and (ii), a SpatialCodec captures relative spatial information for accurate multi-channel reconstruction at the decoder end. In addition, we also propose novel evaluation metrics to assess the spatial cue preservation: (i) spatial similarity, which calculates cosine similarity on a spatially intuitive beamspace, and (ii), beamformed audio quality. Our system shows superior spatial performance compared with high bitrate baselines and black-box neural architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo. Codes and models are available at https://github.com/XZWY/SpatialCodec.

摘要
在这项工作中，我们面临了一个挑战，即使用深度学习技术对麦克风阵列捕捉的speech进行编码，以保留和准确重建多通道录音中的关键空间特征。我们提出了一个含有两个阶段的神经空间音频编码框架，其中第一个阶段是使用单通道神经子带编码器对参考通道进行低比特率编码，第二个阶段是使用SpatialCodec捕捉相对空间信息以实现准确多通道重建。此外，我们还提出了一些新的评估指标来评估空间特征保留情况：（1）空间相似性，它计算顺序空间上的余弦相似性，以及（2）扩扩音质。我们的系统在比高比特率基eline和黑盒神经架下表现出了superior的空间性能。演示可以在https://xzwy.github.io/SpatialCodecDemo中找到。代码和模型可以在https://github.com/XZWY/SpatialCodec中找到。

Mandarin Lombard Flavor Classification

paper_url: http://arxiv.org/abs/2309.07419
repo_url: None
paper_authors: Qingmu Liu, Yuhong Yang, Baifeng Li, Hongyang Chen, Weiping Tu, Song Lin
for: investigate the impact of different decibel levels and types of background noise on the Lombard effect
methods: used a flavor classification approach based on Mandarin Lombard speech under different noise conditions, simulated self-feedback speech, and statistical tests on word correct rates
results: found four distinct categories of Mandarin Lombard speech in the range of 30 to 80 dBA with different transition points, depending on the type of background noise (SSN or babble)

Abstract
The Lombard effect refers to individuals' unconscious modulation of vocal effort in response to variations in the ambient noise levels, intending to enhance speech intelligibility. The impact of different decibel levels and types of background noise on Lombard effects remains unclear. Building upon the characteristic of Lombard speech that individuals adjust their speech to improve intelligibility dynamically based on the self-feedback speech, we propose a flavor classification approach for the Lombard effect. We first collected Mandarin Lombard speech under different noise conditions, then simulated self-feedback speech, and ultimately conducted the statistical test on the word correct rate. We found that both SSN and babble noise types result in four distinct categories of Mandarin Lombard speech in the range of 30 to 80 dBA with different transition points.

摘要
“卢柏尔效应”指个体在噪声水平变化时，无意识地调整语音努力，以提高对话理解度。噪声强度和类型对卢柏尔效应的影响仍不清楚。基于卢柏尔语言特征，我们提出了一种味道分类方法。我们首先收集了不同噪声条件下的普通话 Lombard 语音，然后模拟自适应语音，并进行统计测试。我们发现，SSN 和喧嚣噪声类型都导致了4种不同的普通话 Lombard 语音，分别在30到80 dBA 的范围内，每个转变点都有不同。

M3-AUDIODEC: Multi-channel multi-speaker multi-spatial audio codec

paper_url: http://arxiv.org/abs/2309.07416
repo_url: https://github.com/anton-jeran/MULTI-AUDIODEC
paper_authors: Anton Ratnarajah, Shi-Xiong Zhang, Yi Luo, Dong Yu
for: 这个论文是为了提出一种基于神经网络的多通道语音编码器，可以有效地压缩多个说话人的多通道语音，同时保留每个说话人的空间位置信息。methods: 该模型使用了一种新的嵌入式推理方法，可以根据预先确定的多通道、多说话人和多空间重叠说话情况进行配置和训练。results: 该模型可以压缩和解码重叠说话的语音，并且可以在12.6 kbps的操作下，比Opus和AUDIODEC的24 kbps操作提高37%和52%。此外，该模型还可以在不同的语音增强和房间响应度下保持清晰的语音和空间位置信息。

Abstract
We introduce M3-AUDIODEC, an innovative neural spatial audio codec designed for efficient compression of multi-channel (binaural) speech in both single and multi-speaker scenarios, while retaining the spatial location information of each speaker. This model boasts versatility, allowing configuration and training tailored to a predetermined set of multi-channel, multi-speaker, and multi-spatial overlapping speech conditions. Key contributions are as follows: 1) Previous neural codecs are extended from single to multi-channel audios. 2) The ability of our proposed model to compress and decode for overlapping speech. 3) A groundbreaking architecture that compresses speech content and spatial cues separately, ensuring the preservation of each speaker's spatial context after decoding. 4) M3-AUDIODEC's proficiency in reducing the bandwidth for compressing two-channel speech by 48% when compared to individual binaural channel compression. Impressively, at a 12.6 kbps operation, it outperforms Opus at 24 kbps and AUDIODEC at 24 kbps by 37% and 52%, respectively. In our assessment, we employed speech enhancement and room acoustic metrics to ascertain the accuracy of clean speech and spatial cue estimates from M3-AUDIODEC. Audio demonstrations and source code are available online https://github.com/anton-jeran/MULTI-AUDIODEC .

摘要
我们介绍M3-AUDIODEC，一种创新的神经网络声音编码器，用于高效压缩多通道（频率分量）的语音，包括单个和多个说话人场景，而且保留每个说话人的空间位置信息。这个模型具有灵活性，可以根据预先确定的多通道、多个说话人和多个空间重叠情况进行配置和训练。关键贡献包括：1. 扩展了单通道声音编码器到多通道声音。2. 能够压缩和解码重叠的语音。3. 采用独特的架构，将语音内容和空间cue分开压缩，以保持每个说话人的空间上下文 после decoding。4. M3-AUDIODEC 能够减少压缩两通道语音的带宽，比对个频道压缩减少48%。在12.6 kbps操作下，它超越了 Opus 和 AUDIODEC 的24 kbps操作，提高了37%和52%。在我们的评估中，我们使用了语音提升和房间声学指标来评估M3-AUDIODEC 中干净语音和空间cue的准确性。在线可以找到音频示例和源代码，请参阅https://github.com/anton-jeran/MULTI-AUDIODEC。

Multi-dimensional Speech Quality Assessment in Crowdsourcing

paper_url: http://arxiv.org/abs/2309.07385
repo_url: https://github.com/microsoft/P.808
paper_authors: Babak Naderi, Ross Cutler, Nicolae-Catalin Ristea
For: The paper is written to evaluate the subjective speech quality assessment in lab environments and crowdsourcing, and to extend the ITU-T Rec. P.800 and P.808 standards to measure speech quality in the presence of noise and reverberation.* Methods: The paper uses a crowdsourcing implementation of a multi-dimensional subjective test following the scales from P.804, which includes noisiness, coloration, discontinuity, loudness, and overall quality. The tool is both accurate and reproducible, and has been used in the ICASSP 2023 Speech Signal Improvement challenge.* Results: The paper shows the utility of these speech quality dimensions in the challenge and demonstrates the accuracy and reproducibility of the tool. The tool will be publicly available as open-source at https://github.com/microsoft/P.808.Here is the simplified Chinese text for the three key points:
for: 这篇论文是用来评估声音质量评估和电信系统的标准方法。
methods: 这篇论文使用了一种多维ensional的主观测试，以ITU-TRec. P.804的标准进行评估，包括噪音、颜色、缺失、响度等主观质量维度。
results: 这篇论文在ICASSP 2023 Speech Signal Improvement challenge中展示了这些声音质量维度的实用性，并证明了这种工具的准确性和可重复性。

Abstract
Subjective speech quality assessment is the gold standard for evaluating speech enhancement processing and telecommunication systems. The commonly used standard ITU-T Rec. P.800 defines how to measure speech quality in lab environments, and ITU-T Rec.~P.808 extended it for crowdsourcing. ITU-T Rec. P.835 extends P.800 to measure the quality of speech in the presence of noise. ITU-T Rec. P.804 targets the conversation test and introduces perceptual speech quality dimensions which are measured during the listening phase of the conversation. The perceptual dimensions are noisiness, coloration, discontinuity, and loudness. We create a crowdsourcing implementation of a multi-dimensional subjective test following the scales from P.804 and extend it to include reverberation, the speech signal, and overall quality. We show the tool is both accurate and reproducible. The tool has been used in the ICASSP 2023 Speech Signal Improvement challenge and we show the utility of these speech quality dimensions in this challenge. The tool will be publicly available as open-source at https://github.com/microsoft/P.808.

摘要
主观语音质量评估是评估语音增强处理和电信系统的金标准。常用的标准ITU-T Rec. P.800定义了如何测量语音质量在室内环境中，而ITU-T Rec. P.808将其扩展到了投票。ITU-T Rec. P.835将P.800扩展到测量噪音中的语音质量。ITU-T Rec. P.804targets conversation test和引入了感知语音质量维度，这些维度在听话阶段测量。这些感知维度包括噪音、颜色、破碎、和响度。我们创建了一个基于多维ensional主观测试的人群投票实现，并将其扩展到包括噪音、语音信号和总质量。我们显示了这个工具的准确性和可重复性。这个工具在ICASSP 2023 Speech Signal Improvement challenge中被使用，并示出了这些语音质量维度的实用性。这个工具将于https://github.com/microsoft/P.808上公开发布为开源项目。

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

paper_url: http://arxiv.org/abs/2309.07377
repo_url: https://github.com/k2-fsa/icefall
paper_authors: Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen
for: 本研究旨在比较和优化不同自动学习模型生成的speech相关任务中的精炼度数据，以探讨这些数据在多种speech任务中的 universality。
methods: 本研究使用了多种主流自动学习模型生成的精炼度数据，包括Speech-to-Text和Text-to-Speech两个任务。我们通过对这些数据进行比较和优化来探讨它们在不同的speech任务中的效果。
results: 实验结果表明，使用精炼度数据可以在speech认知任务中实现相当于FBank特征的性能，而且在speech合成任务中，使用精炼度数据可以超过mel-spectrogram特征的性能。这些发现表明了精炼度数据在多种speech任务中的可 reuse 性和可靠性。

Abstract
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available to facilitate research in this direction.

摘要
自我监督学习（SSL）在语音相关任务中的能力已经推动了研究人员使用分割符进行语音任务，如识别和翻译，这些任务的存储要求较低，并且可以使用自然语言处理技术。然而，这些研究主要是单任务受注重，面临过拟合和多任务场景中的性能下降的挑战。本研究对各种领先SSL模型生成的分割符进行了全面的比较和优化，以探讨语音分割符的通用性。我们希望通过这些实验来探索语音分割符在多种语音任务中的可行性。实验结果表明，分割符在语音识别任务中与基于FBank特征学习系统相当，而在语音合成任务中，分割符在主观和客观指标中表现出色。这些发现表明，通用的语音分割符在多种语音任务中具有极大的潜力。我们的工作是开源的，以便促进这一方向的研究。

Training Audio Captioning Models without Audio

paper_url: http://arxiv.org/abs/2309.07372
repo_url: https://github.com/microsoft/noaudiocaptioning
paper_authors: Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang
for: 本研究旨在解决自动Audio Captioning（AAC）系统的训练数据稀缺问题，提出一种使用仅文本训练AAC系统的方法。
methods: 本方法利用了对比式训练的音频-文本模型，如CLAP，来生成音频描述。在训练过程中，一个解码器使用预训练CLAP文本Encoder生成caption。在推理过程中，文本Encoder被替换为预训练CLAP音频Encoder。为bridging模态间的差距，我们提议在训练过程中使用噪声注入或学习 adapter。
results: 我们的文本只框架与状态对的模型相比，在无需音频或人工创建的文本描述时显示了竞争力。此外，我们还展示了在训练过程中不使用音频或人工创建的文本描述时实现了样式化音频描述和描述增强。

Abstract
Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.

摘要
自动化语音描述（AAC）是将语音流转换为自然语言描述的任务。一个典型的AAC系统需要手动撰写的音频段和相应的文本描述标注。创建这些音频-描述对的成本高，导致AAC数据的总量匮乏。在这种情况下，我们提出了一种方法，使用仅文本进行AAC系统的训练。我们的方法利用了对比训练的音频-文本模型，如CLAP，来生成描述。在训练中，一个解码器根据预训练CLAP文本Encoder生成描述。在推理中，文本Encoder被替换为预训练CLAP音频Encoder。为了跨Modal空间的减少，我们提议在训练时使用噪声注入或学习适配器。我们发现，我们的文本仅框架与状态流行的模型相比，表现相对竞争力强。最后，我们展示了在没有音频或人工创建的文本描述时进行风格化音频描述和描述增强。