paper_authors: Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang, Xin Yu for: This paper focuses on the audio-visual segmentation (AVS) task, which aims to segment sounding objects from a given video.methods: The proposed method first localizes potential sounding objects in a video using an object segmentation network, and then associates the sounding object candidates with the given audio. To alleviate the ambiguity of training the object segmentation network, the method proposes a silent object-aware segmentation objective. Additionally, the method explores the audio-visual semantic correlation by attending predicted audio category scores to potential instance masks.results: The proposed method can effectively segment sounding objects without being biased to salient objects, as demonstrated by experimental results on the AVS benchmarks.Abstract
The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
摘要
audio-visual segmentation(AVS)任务的目标是从视频中分割出声音对象。现有的方法主要是将视频和音频特征融合以获得声音对象面积。然而,我们发现现有的方法很容易将视频中的一些鲜明对象分割成声音对象,这是因为声音对象在AVS数据集中非常鲜明。因此,现有的AVS方法可能会错过真正的声音对象,这是因为数据集偏见。在这种情况下,我们提出了一种带有音频视频实例相关的分割方法,以解决数据集偏见。我们的方法首先在视频中 lokalisiert potential sounding objects的可能性,然后将这些声音对象候选者与给定的音频相关联。我们注意到,一个对象可能是一个声音对象在一个视频中,但是在另一个视频中可能是一个无声对象。这会在我们的对象分割网络训练时引入模糊性。我们因此提出了一种静音对象相关的分割目标,以解决这个问题。此外,由于音频的类别信息未知,特别是多个声音来源,我们提出了探索音频视频 semantic correlation,然后将音频与 potential objects相关联。具体来说,我们将预测的音频类别得分attend to potential instance masks,这些分数会高亮对应的声音实例,而不是无声实例。当我们强制 enforcing these attended instance masks to resemble the ground-truth mask,我们就能够建立音频视频 semantic correlation。我们的方法的实验结果在AVS benchmark中表明,我们可以准确地分割声音对象,不会受到鲜明对象的影响。
SAMbA: Speech enhancement with Asynchronous ad-hoc Microphone Arrays
results: 表明,增强机制可以使得DNNs对不同采样时间偏移和采样率偏移都具有抗随机性的能力,而不需要进行费时的处理步骤来减少偏移的影响。同时,增强机制还可以自动学习出采样时间偏移和采样率偏移的参数,不需要额外的监督。Abstract
Speech enhancement in ad-hoc microphone arrays is often hindered by the asynchronization of the devices composing the microphone array. Asynchronization comes from sampling time offset and sampling rate offset which inevitably occur when the microphones are embedded in different hardware components. In this paper, we propose a deep neural network (DNN)-based speech enhancement solution that is suited for applications in ad-hoc microphone arrays because it is distributed and copes with asynchronization. We show that asynchronization has a limited impact on the spatial filtering and mostly affects the performance of the DNNs. Instead of resynchronising the signals, which requires costly processing steps, we use an attention mechanism which makes the DNNs, thus our whole pipeline, robust to asynchronization. We also show that the attention mechanism leads to the asynchronization parameters in an unsupervised manner.
摘要
<>translate english text into simplified chineseSpeech enhancement in ad-hoc microphone arrays is often hindered by the asynchronization of the devices composing the microphone array. Asynchronization comes from sampling time offset and sampling rate offset which inevitably occur when the microphones are embedded in different hardware components. In this paper, we propose a deep neural network (DNN)-based speech enhancement solution that is suited for applications in ad-hoc microphone arrays because it is distributed and copes with asynchronization. We show that asynchronization has a limited impact on the spatial filtering and mostly affects the performance of the DNNs. Instead of resynchronising the signals, which requires costly processing steps, we use an attention mechanism which makes the DNNs, thus our whole pipeline, robust to asynchronization. We also show that the attention mechanism leads to the asynchronization parameters in an unsupervised manner.Translated text in Simplified Chinese:<>对话增强在具有不同硬件 ком成分的随意 microphone 阵列中经常受到设备不同的问题,包括抽样时间偏移和抽样率偏移。这些问题无法避免,因为 microphone 通常被嵌入不同的硬件 Component 中。在这篇论文中,我们提出了一个基于深度神经网络(DNN)的对话增强解决方案,这个解决方案适合应用在随意 microphone 阵列中,因为它是分布式的。我们显示,对话增强中的不同设备问题有限的影响,主要影响 DNN 的表现。而不是耗费成本的处理步骤来调整标本,我们使用了注意力机制,使 DNN 和我们的整个管道都具有对不同设备问题的响应性。我们还显示,注意力机制可以自动从数据中提取不同设备问题的参数。
Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
paper_authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, Yuchao Dai for:This paper proposes a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to explore the contribution of audio.methods:The proposed method uses a latent diffusion model to learn the conditional generation process of the ground-truth segmentation map, and introduces contrastive learning to learn audio-visual correspondence.results:Experimental results on a benchmark dataset verify the effectiveness of the proposed solution, demonstrating the importance of modeling the correlation between audio and the final segmentation map for AVS.Here’s the simplified Chinese text:for:这篇论文提出了一种基于扩散模型的听视同步分割方法,以探讨听音的贡献。methods:该方法使用扩散模型来学习听音与实际分割地图的关系,并通过对比学习来学习听音和视频之间的对应关系。results:实验结果表明,该方法在标准测试集上具有较高的效果,证明了模型对听音的贡献对AVS的重要性。Abstract
We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
摘要
我们提出一种含拓扑扩散模型,通过对比学习来探索音频的贡献。我们将音频视为条件变量,用于音频生成者 segmentation。为保证音频的贡献,我们引入一种潜在扩散模型,以实现含义相关的表示学习。这种扩散模型学习了真实的分 segmentation MAP 的生成过程,从而在测试阶段实现了真实感知。作为一种条件扩散模型,我们认为保证条件变量对模型输出的贡献是必要的。我们然后引入对比学习,以学习音频视频对应关系,这与最大化模型预测和音频数据之间的共通信息相符。通过这种方式,我们的含拓扩散模型通过对比学习显著地提高了音频对 AVS 的贡献。实验结果表明我们的解决方案是有效的。代码和结果在我们项目页面上可以online获取:https://github.com/OpenNLPLab/DiffusionAVS。
DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training
results: 实验表明,DiffProsody 能够在16倍的速度上生成表达,并且对比传统方法,具有更高的质量和效果。Abstract
Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which expressive speech is synthesized using a diffusion-based latent prosody generator and prosody conditional adversarial training. Our findings confirm the effectiveness of our prosody generator in generating a prosody vector. Furthermore, our prosody conditional discriminator significantly improves the quality of the generated speech by accurately emulating prosody. We use denoising diffusion generative adversarial networks to improve the prosody generation speed. Consequently, DiffProsody is capable of generating prosody 16 times faster than the conventional diffusion model. The superior performance of our proposed method has been demonstrated via experiments.
摘要
<>expressive 文本到语音系统已经经历了显著的进步,归功于谱定模型。但传统方法仍然可以进行改进。传统的方法通过推导式方法预测谱定向量,但它受到长期依赖和慢速推理的问题困扰。本研究提出了一种新的方法 called DiffProsody,该方法通过噪声推送生成器和谱定 conditional adversarial training来生成表达性的语音。我们的发现表明DiffProsody可以生成谱定向量。此外,我们的谱定 conditional 推理器可以准确地模拟谱定,从而提高生成的语音质量。我们使用denoising diffusion生成 adversarial networks来提高谱定生成速度。因此,DiffProsody可以在16倍 faster than conventional diffusion model中生成谱定。我们的提出的方法在实验中表现出了superior performance。Note: Please note that the translation is in Simplified Chinese, and some words or phrases may have been translated differently in Traditional Chinese.
SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation
results: 对多个模拟和实际数据集进行了实验,结果表明:1) 提议的网络在大多数任务上达到了状态艺术的性能; 2) 提议的网络受到spectral generalization问题的影响很小; 3) 提议的网络确实进行了说话人团集(示例了注意力地图)。Abstract
This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet.In the short-time Fourier transform (STFT) domain, the proposed network performs end-to-end speech enhancement. It is mainly composed of interleaved narrow-band and cross-band blocks to respectively exploit narrow-band and cross-band spatial information. The narrow-band blocks process frequencies independently, and use self-attention mechanism and temporal convolutional layers to respectively perform spatial-feature-based speaker clustering and temporal smoothing/filtering. The cross-band blocks processes frames independently, and use full-band linear layer and frequency convolutional layers to respectively learn the correlation between all frequencies and adjacent frequencies. Experiments are conducted on various simulated and real datasets, and the results show that 1) the proposed network achieves the state-of-the-art performance on almost all tasks; 2) the proposed network suffers little from the spectral generalization problem; and 3) the proposed network is indeed performing speaker clustering (demonstrated by attention maps).
摘要
这个工作提出了一种神经网络,以便广泛利用空间信息进行多通道联合语音分离、降噪和反射减去,名为空间网络(SpatialNet)。在短时傅立叶变换(STFT)频域内,提议的网络实现了端到端语音提升。它主要由相互交叠的窄频和交叠频块组成,用于分别利用窄频和交叠频信息。窄频块独立处理频率,使用自注意机制和时间径向层来分别进行空间特征基于的Speaker集成和时间平滑/滤波。交叠频块独立处理帧,使用全频线性层和频率径向层来分别学习所有频率和邻近频率之间的相关性。在各种模拟和实际数据集上进行了实验,结果表明:1)提议的网络在大多数任务上达到了状态艺术性的表现;2)提议的网络受到尺度泛化问题的影响很少;3)提议的网络实际进行Speaker集成(通过注意力地图进行证明)。
paper_authors: SeungHeon Doh, Keunwoo Choi, Jongpil Lee, Juhan Nam
for: 提高音乐数据的理解和管理,提供大规模音乐描述语料集。
methods: 使用大语言模型(LLM)人工生成描述句子,从大规模标签数据集中获取数据。
results: 比较多种评价指标和人工评价表明,提posed方法比基线模型有更好的表现,并在零VB和传输学习 Setting下进行了训练和评价。Abstract
Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.
摘要
自动化音乐描述(Automatic music captioning)可以增强大量音乐数据的理解和组织。尽管其重要性,研究人员面临数据缺乏问题,因为现有的音乐语言数据集的收集过程很费时和成本高。为解决这个数据缺乏问题,我们提议使用大型自然语言模型(LLM)来人工生成描述句子从大规模标签数据集。这将生成约220万个描述句子和50万个音频剪辑。我们称之为大语言模型基于 Pseudo music caption 数据集(LP-MusicCaps)。我们进行了音乐描述数据集的系统评估,使用了自然语言处理领域常用的量化评估 метри。此外,我们将一种基于 transformer 的音乐描述模型训练于数据集,并在零学习和转移学习 Settings 下评估其性能。结果表明,我们的提议方法在超越了基线模型。
Mispronunciation detection using self-supervised speech representations
results: 研究发现,使用直接进行目标任务的模型训练得到最佳性能,而大多数上游模型在该任务中的性能相似。Abstract
In recent years, self-supervised learning (SSL) models have produced promising results in a variety of speech-processing tasks, especially in contexts of data scarcity. In this paper, we study the use of SSL models for the task of mispronunciation detection for second language learners. We compare two downstream approaches: 1) training the model for phone recognition (PR) using native English data, and 2) training a model directly for the target task using non-native English data. We compare the performance of these two approaches for various SSL representations as well as a representation extracted from a traditional DNN-based speech recognition model. We evaluate the models on L2Arctic and EpaDB, two datasets of non-native speech annotated with pronunciation labels at the phone level. Overall, we find that using a downstream model trained for the target task gives the best performance and that most upstream models perform similarly for the task.
摘要
近年来,自主学习(SSL)模型在各种语音处理任务中表现出色,特别在数据缺乏的情况下。在这篇论文中,我们研究了在第二语言学习者中的误听检测任务上使用SSL模型。我们比较了两个下游方法:1)使用本地英语数据来训练模型,并2)直接使用非本地英语数据来训练目标任务的模型。我们对各种SSL表示形式以及一个来自传统的DNN基于语音识别模型中的表示进行比较。我们在L2Arctic和EpaDB两个非本地语音Dataset上评估了这些模型的性能。总的来说,我们发现使用直接训练目标任务的下游模型可以获得最好的性能,而大多数上游模型在这个任务上表现相似。