results: 对不同场景和数据集进行测试,证明了该系统具有防御反反馈攻击和泛化能力。Abstract
Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever, leading to possible threats and dangers from malicious users. In the audio field, we are witnessing the growth of speech deepfake generation techniques, which solicit the development of synthetic speech detection algorithms to counter possible mischievous uses such as frauds or identity thefts. In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them, achieving overall better performances with respect to the state-of-the-art solutions. The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.
摘要
Simplified Chinese translation:最近的深度学习和计算机视觉技术的进步使得制作和伪造多媒体内容变得更加容易,可能导致恶意用户的威胁和危险。在音频领域,我们目睹到了语音深度伪造生成技术的增长,这使得适用于防止可能的欺诈和身份盗窃的伪造语音检测算法的开发成为了一项重要的任务。在这篇论文中,我们考虑了Literature中提出的三种不同的特征集,并提出了一种将其融合的模型,实现了与当前最佳解决方案的更好的性能。该系统在不同的场景和数据集上进行了测试,以证明其对反法医攻击的抗性和泛化能力。
Automated approach for source location in shallow waters
results: 该方法在实验数据中展示了对实际场景中的右鲸鸟枪响和燃烧声源的有效性。Abstract
This paper proposes a fully automated method for recovering the location of a source and medium parameters in shallow waters. The scenario involves an unknown source emitting low-frequency sound waves in a shallow water environment, and a single hydrophone recording the signal. Firstly, theoretical tools are introduced to understand the robustness of the warping method and to propose and analyze an automated way to separate the modal components of the recorded signal. Secondly, using the spectrogram of each modal component, the paper investigates the best way to recover the modal travel times and provides stability estimates. Finally, a penalized minimization algorithm is presented to recover estimates of the source location and medium parameters. The proposed method is tested on experimental data of right whale gunshot and combustive sound sources, demonstrating its effectiveness in real-world scenarios.
摘要
这个论文提出了一种完全自动化的方法,用于在浅水中回归源点和媒体参数。情况是一个未知的源在浅水环境中发出低频声波,并且单个水微phone记录了信号。首先,论文介绍了理论工具,以理解扭曲方法的稳定性,并提出了自动分解记录信号的模态组分的方法。其次,使用每个模态组分的spectrogram,论文研究了最好的方法来回归模态旅行时间,并提供了稳定估计。最后,论文提出了一种惩罚最小化算法,用于回归源点和媒体参数的估计。这种方法在实验数据中进行了右鲸鱼枪声和燃烧声源的测试,证明其在实际情况下的有效性。
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding
results: 对比基eline方法,这些方法表现更好,并且可以生成多种 expresión 的语音。Abstract
Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. To address the challenges associated with high dimensionality and waveform distortion in discrete representations, we propose Diff-LM-Speech, which models semantic embeddings into mel-spectrogram based on diffusion models and introduces a prompt encoder structure based on variational autoencoders and prosody bottlenecks to improve prompt representation capabilities. Autoregressive language models often suffer from missing and repeated words, while non-autoregressive frameworks face expression averaging problems due to duration prediction models. To address these issues, we propose Tetra-Diff-Speech, which designs a duration diffusion model to achieve diverse prosodic expressions. While we expect the information content of semantic coding to be between that of text and acoustic coding, existing models extract semantic coding with a lot of redundant information and dimensionality explosion. To verify that semantic coding is not necessary, we propose Tri-Diff-Speech. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.
摘要
近期,有越来越多的关注TEXT-TO-SPEECH(TTS)方法,可以通过最小监督来训练,通过组合两种类型的不同的语音表示方式,并使用两个序列-TO-序列任务来解耦TTS。为了Address高维度和波形扭曲在不同表示方式中的挑战,我们提出了Diff-LM-Speech,它模型语意嵌入 mel-spectrogram 基于扩散模型,并引入了提高描述符能力的提问编码结构,以及基于变量自动编码器和表达瓶颈的 prosody 瓶颈。非autoregressive 框架常常面临缺失和重复的单词问题,而autoregressive 模型则面临表达平均化问题,这是因为duration prediction 模型。为了解决这些问题,我们提出了Tetra-Diff-Speech,它设计了一个扩散duration模型,以实现多种表达的多样化。我们预期semantic coding 信息的内容与文本和声音编码信息之间存在一定的相似性,但现有模型通常会提取大量的重复信息和维度爆炸。为了验证semantic coding 是否真的不必要,我们提出了Tri-Diff-Speech。我们的提议方法在实验中表现出了超越基eline方法的成绩。我们提供了一个网站,包含了各种音频样本。
The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge 2022
results: 我们的实验结果表明,我们的AVSD系统在不同的说话人数量和背景噪音水平下都具有良好的性能,并且与其他参与者的系统进行比较,我们的系统在大多数情况下具有更高的准确率。Abstract
This paper describes the FlySpeech speaker diarization system submitted to the second \textbf{M}ultimodal \textbf{I}nformation Based \textbf{S}peech \textbf{P}rocessing~(\textbf{MISP}) Challenge held in ICASSP 2022. We develop an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to mitigate the degradation of diarization performance caused by separate training, we jointly train the speaker encoder and the audio-visual decoder. In addition, we leverage the large-data pretrained speaker extractor to initialize the speaker encoder.
摘要
这篇论文描述了我们在ICASSP 2022年度第二届多模态信息基于语音处理~(\textbf{MISP}) 挑战中提交的飞语音 speaker 分类系统。我们开发了一个端到端的音频视频 speaker 分类系统(AVSD),该系统包括一个唇编码器、一个说话者编码器和一个音频视频解码器。具体来说,为了解决分离训练导致的分类性能下降,我们在说话者编码器和音频视频解码器之间进行了联合训练。此外,我们还利用了大量预训练的说话者抽取器来初始化说话者编码器。
Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
results: 实验显示,我们的HCI方法可以提高ATR性能,同时,我们的auxiliary captions(AC)框架可以提供更好的音频表示,并且可以作为训练数据增强。Abstract
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments show that our HCI significantly improves the ATR performance. Moreover, our AC framework also shows stable performance gains on multiple datasets.
摘要
现有的音频文本检索(ATR)方法通常是构建整个音频clip和完整的caption句子的对比对,而忽略细致的交叉模态关系,例如短段和短语或帧和单词。在这篇论文中,我们介绍了一种层次跨模态交互(HCI)方法,同时探索clip-sentence、segment-phrase和frame-word关系,实现了多模态semantic比较的全面评估。此外,我们还提出了一种新的ATR框架,利用预训练的captioner生成的auxiliary captions(AC)来实现音频和生成caption之间的特征交互,这对audio表示具有进一步提高的性能,并且与原始ATR匹配分支相комplementary。音频和生成caption还可以组成新的音频-文本对,用于训练的数据增强。实验结果表明,我们的HCI方法具有显著的提升效果,而我们的AC框架也在多个dataset上表现稳定。
PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement
results: 与现有方法相比,本方法在大多数评估标准中表现出色,同时具有最低的模型参数数量Abstract
Convolutional neural networks (CNN) and Transformer have wildly succeeded in multimedia applications. However, more effort needs to be made to harmonize these two architectures effectively to satisfy speech enhancement. This paper aims to unify these two architectures and presents a Parallel Conformer for speech enhancement. In particular, the CNN and the self-attention (SA) in the Transformer are fully exploited for local format patterns and global structure representations. Based on the small receptive field size of CNN and the high computational complexity of SA, we specially designed a multi-branch dilated convolution (MBDC) and a self-channel-time-frequency attention (Self-CTFA) module. MBDC contains three convolutional layers with different dilation rates for the feature from local to non-local processing. Experimental results show that our method performs better than state-of-the-art methods in most evaluation criteria while maintaining the lowest model parameters.
摘要
卷积神经网络(CNN)和变换器(Transformer)在多媒体应用中取得了很大成功。然而,为了有效融合这两种架构,还需要更多的努力。这篇论文目标是将这两种架构融合在一起,并提出了并行转换器(Parallel Conformer) для speech enhancement。特别是,我们完全利用了 CNN 和 Transformer 中的自注意力(SA)来捕捉本地格式模式和全球结构表示。基于小覆盖区域大小和自注意力的计算复杂性,我们专门设计了多支分支扩展 convolution(MBDC)和自频时间频率注意力(Self-CTFA)模块。MBDC 包括三层扩展 convolution Layer WITH different dilation rates,用于从本地到非本地处理。实验结果表明,我们的方法在大多数评价标准下表现更好于现有方法,同时保持最低的模型参数。
results: 在多个Difficult的数据集上进行训练,与当前最佳方法进行比较,达到了更高的性能。Abstract
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.
摘要
它目标是使一个音频片段在目标听取环境中重新生成,以让它听起来像是在目标环境中录制的。现有的方法假设有对应的训练数据,其中包括音频在源和目标环境中的观察记录,但这限制了训练数据的多样性或需要使用模拟数据或启发法生成对应的样本。我们提出了一种自然语言处理的自主超vised Approach,其中训练样本只包括目标场景图像和音频,而不需要对应的听取环境音频作为参考。我们的方法通过一种Conditional GAN框架和一种新的度量来共同学习分离房间听取特性和重新生成音频到目标环境中,并在多个挑战性数据集和真实世界音频和环境中表现出色。