cs.SD - 2023-09-13

Enhancing Child Vocalization Classification in Multi-Channel Child-Adult Conversations Through Wav2vec2 Children ASR Features

paper_url: http://arxiv.org/abs/2309.07287
repo_url: None
paper_authors: Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios
For: The paper aims to develop a machine learning model that can label adult and child audio recordings of clinician-child interactions, with the goal of assisting clinicians in capturing events of interest and communicating with parents more effectively.* Methods: The authors use a self-supervised learning model called Wav2Vec 2.0 (W2V2), which was pretrained on 4300 hours of home recordings of children under 5 years old. They apply this system to two-channel audio recordings of brief clinician-child interactions using the Rapid-ABC corpus, and introduce auxiliary features extracted from the W2V2-based automatic speech recognition (ASR) system to improve the accuracy of vocalization classification (VC) for children under 4 years old.* Results: The authors observe consistent improvements in the VC task on two corpora (Rapid-ABC and BabbleCor), and reach or outperform the state-of-the-art performance of BabbleCor.

Abstract
Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder that often emerges in early childhood. ASD assessment typically involves an observation protocol including note-taking and ratings of child's social behavior conducted by a trained clinician. A robust machine learning (ML) model that is capable of labeling adult and child audio has the potential to save significant time and labor in manual coding children's behaviors. This may assist clinicians capture events of interest, better communicate events with parents, and educate new clinicians. In this study, we leverage the self-supervised learning model, Wav2Vec 2.0 (W2V2), pretrained on 4300h of home recordings of children under 5 years old, to build a unified system that performs both speaker diarization (SD) and vocalization classification (VC) tasks. We apply this system to two-channel audio recordings of brief 3-5 minute clinician-child interactions using the Rapid-ABC corpus. We propose a novel technique by introducing auxiliary features extracted from W2V2-based automatic speech recognition (ASR) system for children under 4 years old to improve children's VC task. We test our proposed method of improving children's VC task on two corpora (Rapid-ABC and BabbleCor) and observe consistent improvements. Furthermore, we reach, or perhaps outperform, the state-of-the-art performance of BabbleCor.

摘要
自适应发展障碍（ASD）是一种在早期儿hood出现的神经发展障碍。ASD评估通常包括一种观察协议，包括记录和评分孩子的社交行为，由训练有素的临床医生进行。一个功能强大的机器学习（ML）模型，可以标注成人和儿童的音频，有可能为临床医生节省巨量的时间和劳动。这可能帮助临床医生捕捉事件关键，更好地与父母交流事件，并更好地培训新的临床医生。在这项研究中，我们利用了无监督学习模型，Wav2Vec 2.0（W2V2），已经在4300小时的孩子下5岁的家庭录音中进行了预训练。我们使用这个系统来建立一个统一的系统，用于完成说话识别（VC）和媒体分类（SD）任务。我们将这个系统应用于两栏raphic-ABC corpus中的两栏录音，并提出了一种新的技术，通过在W2V2基于自动语音识别（ASR）系统中提取的辅助特征，以改进儿童的VC任务。我们在两个corpus（Rapid-ABC和BabbleCor）上测试了我们的提议，并观察到了一致的改进。此外，我们达到了或者超过了BabbleCor的状态艺术性能。

A Flexible Online Framework for Projection-Based STFT Phase Retrieval

paper_url: http://arxiv.org/abs/2309.07043
repo_url: None
paper_authors: Tal Peer, Simon Welker, Johannes Kolhoff, Timo Gerkmann
for: 提高iterative STFT阶段的phaserecovery性能
methods: 使用新的投影运算符组合方式，从 Griffin-Lim 方法中获得更好的重构质量和迭代数量，同时保持相同的计算复杂性
results: 在speech signal上实现了更好的重构质量，比RTISI更高的性能，并且可以在线实现任何基于迭代投影的算法

Abstract
Several recent contributions in the field of iterative STFT phase retrieval have demonstrated that the performance of the classical Griffin-Lim method can be considerably improved upon. By using the same projection operators as Griffin-Lim, but combining them in innovative ways, these approaches achieve better results in terms of both reconstruction quality and required number of iterations, while retaining a similar computational complexity per iteration. However, like Griffin-Lim, these algorithms operate in an offline manner and thus require an entire spectrogram as input, which is an unrealistic requirement for many real-world speech communication applications. We propose to extend RTISI -- an existing online (frame-by-frame) variant of the Griffin-Lim algorithm -- into a flexible framework that enables straightforward online implementation of any algorithm based on iterative projections. We further employ this framework to implement online variants of the fast Griffin-Lim algorithm, the accelerated Griffin-Lim algorithm, and two algorithms from the optics domain. Evaluation results on speech signals show that, similarly to the offline case, these algorithms can achieve a considerable performance gain compared to RTISI.

摘要
近些年在循环STFT阶段phaserecovery领域，一些研究表明可以通过使用同样的投影运算符，但是通过创新的方式组合它们，提高循环STFT阶段phaserecovery的性能，包括重建质量和需要的迭代数量，而且保持与经典Griffin-Lim方法相同的计算复杂度。然而，这些算法都是在离线方式下运行，需要一个完整的spectrogram作为输入，这是许多实际语音通信应用场景中的一个不现实的假设。我们提议将RTISI---一种现有的在线（frame-by-frame）变体的Griffin-Lim算法---扩展为一个灵活的框架，以便直观在线实现任何基于循环投影的算法。此外，我们使用这个框架来在线实现快速Griffin-Lim算法、加速Griffin-Lim算法和两种光学领域的算法。对语音信号进行评估结果表明，与离线情况类似，这些算法可以与RTISI相比，获得显著的性能提升。

Diffusion models for audio semantic communication

paper_url: http://arxiv.org/abs/2309.07195
repo_url: None
paper_authors: Eleonora Grassucci, Christian Marinoni, Andrea Rodriguez, Danilo Comminiello
for: 本研究旨在提高听音信息的传输稳定性和可靠性，通过将 semantics 和 audio signal 转化为听音信息的含义，然后在接收端使用 conditional diffusion model 重建听音信息。
methods: 本研究提出了一种基于 inverse problem 的听音信息传输框架，将 audio signal 和 semantics 转化为听音信息的含义，然后使用 conditional diffusion model 在接收端重建听音信息。
results: 实验结果显示，本研究的方法在不同的渠道条件下都能够超越竞争对手，并且可以有效地重建听音信息。您可以访问项目页面，listen to samples 和获取代码：https://ispamm.github.io/diffusion-audio-semantic-communication/.

Abstract
Directly sending audio signals from a transmitter to a receiver across a noisy channel may absorb consistent bandwidth and be prone to errors when trying to recover the transmitted bits. On the contrary, the recent semantic communication approach proposes to send the semantics and then regenerate semantically consistent content at the receiver without exactly recovering the bitstream. In this paper, we propose a generative audio semantic communication framework that faces the communication problem as an inverse problem, therefore being robust to different corruptions. Our method transmits lower-dimensional representations of the audio signal and of the associated semantics to the receiver, which generates the corresponding signal with a particular focus on its meaning (i.e., the semantics) thanks to the conditional diffusion model at its core. During the generation process, the diffusion model restores the received information from multiple degradations at the same time including corruption noise and missing parts caused by the transmission over the noisy channel. We show that our framework outperforms competitors in a real-world scenario and with different channel conditions. Visit the project page to listen to samples and access the code: https://ispamm.github.io/diffusion-audio-semantic-communication/.

摘要
直接传送对话讯号从传送器到接收器过噪通道可能吸收稳定带宽和容易发生错误，尤其在尝试从传送的字节中恢复传送的内容。相反，最近的 semantics 通信方法建议将内容和其相关的 semantics 传送到接收器，并在接收器端使用 conditional diffusion 模型来生成具有特定意义的讯号。在我们的框架中，我们传送对话讯号的下降维度表示和相关的 semantics 到接收器，接收器使用 conditional diffusion 模型来从多种降低处理中恢复获得的讯号，包括噪音扰障和传送过程中的缺失部分。我们显示，我们的框架在实际情况下比竞争对手更好，并在不同的通道条件下显示出优秀的表现。您可以前往项目页面聆听样本和取得代码：https://ispamm.github.io/diffusion-audio-semantic-communication/.

Reorganization of the auditory-perceptual space across the human vocal range

paper_url: http://arxiv.org/abs/2309.06946
repo_url: None
paper_authors: Daniel Friedrichs, Volker Dellwo
For: This paper investigates the auditory-perceptual space of vowels in the human vocal range, specifically focusing on the role of spectral shape in vowel perception.* Methods: The study uses multidimensional scaling analysis of cochlea-scaled spectra from 250-ms vowel segments, with a dataset of 240 vowels produced by three native German female speakers.* Results: The study finds systematic spectral shifts associated with vowel height and frontness, with a notable clustering around /i a u/ above 523 Hz. These findings highlight the importance of spectral shape in vowel perception and offer insights into the evolution of language.In Simplified Chinese text, the information could be summarized as follows:* 为：这篇论文研究了人类语音范围内的声音感知空间，尤其是声音形态在声音认知中的作用。* 方法：这项研究使用了多维度投影分析器，利用了250毫秒的元音段的聪见谱，来研究3名德国女性说话者的元音。* 结果：研究发现，随着高频声音的增加，元音的高度和前端性呈现出系统性的声音偏移，特别是在523Hz以上的高频声音上。这些发现 подтвержда了声音形态在元音认知中的重要作用，并且为语言演化提供了可能的解释。

Abstract
We analyzed the auditory-perceptual space across a substantial portion of the human vocal range (220-1046 Hz) using multidimensional scaling analysis of cochlea-scaled spectra from 250-ms vowel segments, initially studied in Friedrichs et al. (2017) J. Acoust. Soc. Am. 142 1025-1033. The dataset comprised the vowels /i y e {\o} {\epsilon} a o u/ (N=240) produced by three native German female speakers, encompassing a broad range of their respective voice frequency ranges. The initial study demonstrated that, during a closed-set identification task involving 21 listeners, the point vowels /i a u/ were significantly recognized at fundamental frequencies (fo) nearing 1 kHz, whereas the recognition of other vowels decreased at higher pitches. Building on these findings, our study revealed systematic spectral shifts associated with vowel height and frontness as fo increased, with a notable clustering around /i a u/ above 523 Hz. These observations underscore the pivotal role of spectral shape in vowel perception, illustrating the reliance on acoustic anchors at higher pitches. Furthermore, this study sheds light on the quantal nature of these vowels and their potential impact on language evolution, offering a plausible explanation for their widespread presence in the world's languages.

摘要
我们使用多维度尺度分析对人声 vocal range（220-1046Hz）中的听觉空间进行了分析，使用 Friedrichs et al. (2017) J. Acoust. Soc. Am. 142 1025-1033中提出的多个 native German female speakers的vowel /i y e {\o} {\epsilon} a o u/（共240个） Dataset，覆盖了它们的声音频率范围。之前的研究表明，在一个关闭式认知任务中，点vowel /i a u/ 在基本频率（fo）接近1kHz的情况下得到了明显的认知。此外，我们的研究还发现了高频域的系统性 spectral shifts 与vowel height和前端性有关，特别是在523Hz以上的高频域。这些观察结果 highlights the crucial role of spectral shape in vowel perception, and underscores the reliance on acoustic anchors at higher pitches.此外，这种研究还 shed light on the quantal nature of these vowels and their potential impact on language evolution, offering a plausible explanation for their widespread presence in the world's languages.

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

paper_url: http://arxiv.org/abs/2309.06934
repo_url: None
paper_authors: Carlos Hernandez-Olivan, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramirez, Wei-Hsiang Liao, Yuki Mitsufuji
for: 这篇论文主要关注于修复音乐信号的听风伤害，以提高音频质量并且适用于不同的修复任务。
methods: 这篇论文提出了一种基于扩散 posterior 采样（DPS）的音乐修复方法，并研究了一些针对现有DPS-based方法的问题，如杂散导航技术，包括RePaint（RP）策略和 Pseudoinverse-Guided Diffusion Models（$\Pi$GDM）。
results: 在 vocal declipping 和 bandwidth extension 两个任务中，我们的方法表现出色，超越了目前的DPS-based音乐修复标准。可以参考 \url{http://carlosholivan.github.io/demos/audio-restoration-2023.html} 获取修复后的音频示例。

Abstract
Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various restoration tasks. In this paper, we identify that there are potential issues which will degrade current DPS-based methods' performance and introduce the way to mitigate the issues inspired by diverse diffusion guidance techniques including the RePaint (RP) strategy and the Pseudoinverse-Guided Diffusion Models ($\Pi$GDM). We demonstrate our methods for the vocal declipping and bandwidth extension tasks under various levels of distortion and cutoff frequency, respectively. In both tasks, our methods outperform the current DPS-based music restoration benchmarks. We refer to \url{http://carlosholivan.github.io/demos/audio-restoration-2023.html} for examples of the restored audio samples.

摘要
重新恢复音乐信号是提高音频质量的关键，以便进行下游音乐修饰。最近的扩散基于音乐恢复方法中，扩散 posterior 抽象（DPS）表现出色，其具有多种恢复任务的 universality。在这篇论文中，我们发现了当前 DPS 基于的方法可能会受到的问题，并提出了利用多种扩散指导技术，包括 RePaint（RP）策略和 Pseudoinverse-Guided Diffusion Models（$\Pi$GDM）来 Mitigate 这些问题。我们在 vocals 减震和频率延展任务中应用了我们的方法，并在不同的噪声和截止频率水平下进行了评估。在两个任务中，我们的方法超越了当前 DPS 基于的音乐恢复标准。更多的纪录音amples可以在上找到。

EMALG: An Enhanced Mandarin Lombard Grid Corpus with Meaningful Sentences

paper_url: http://arxiv.org/abs/2309.06858
repo_url: None
paper_authors: Baifeng Li, Qingmu Liu, Yuhong Yang, Hongyang Chen, Weiping Tu, Song Lin
for: 这个研究 investigate Lombard effect, where individuals adapt their speech in noisy environments.
methods: 我们引入了改进的满语 Lombard 网格 (EMALG) Corpora，增加了有意义的句子，从而解决了 MALG Corpora 面临的挑战。
results: 我们发现，在满语中，女性在发言有意义的句子时更加强烈地表现出 Lombard 效应，而男性则不然。此外，我们发现 meaningless 句子会负面影响 Lombard 效应分析。此外，我们的结果证实了在英语和满语之间 Lombard 效应的相似性，与之前的研究相符。

Abstract
This study investigates the Lombard effect, where individuals adapt their speech in noisy environments. We introduce an enhanced Mandarin Lombard grid (EMALG) corpus with meaningful sentences , enhancing the Mandarin Lombard grid (MALG) corpus. EMALG features 34 speakers and improves recording setups, addressing challenges faced by MALG with nonsense sentences. Our findings reveal that in Mandarin, female exhibit a more pronounced Lombard effect than male, particularly when uttering meaningful sentences. Additionally, we uncover that nonsense sentences negatively impact Lombard effect analysis. Moreover, our results reaffirm the consistency in the Lombard effect comparison between English and Mandarin found in previous research.

摘要
这个研究investigates the Lombard effect, where individuals adapt their speech in noisy environments. We introduce an enhanced Mandarin Lombard grid (EMALG) corpus with meaningful sentences, enhancing the Mandarin Lombard grid (MALG) corpus. EMALG features 34 speakers and improves recording setups, addressing challenges faced by MALG with nonsense sentences. Our findings reveal that in Mandarin, female speakers exhibit a more pronounced Lombard effect than male speakers, particularly when uttering meaningful sentences. Additionally, we find that nonsense sentences negatively impact Lombard effect analysis. Moreover, our results reaffirm the consistency in the Lombard effect comparison between English and Mandarin found in previous research.Here's the translation in Traditional Chinese for comparison:这个研究investigates the Lombard effect, where individuals adapt their speech in noisy environments. We introduce an enhanced Mandarin Lombard grid (EMALG) corpus with meaningful sentences, enhancing the Mandarin Lombard grid (MALG) corpus. EMALG features 34 speakers and improves recording setups, addressing challenges faced by MALG with nonsense sentences. Our findings reveal that in Mandarin, female speakers exhibit a more pronounced Lombard effect than male speakers, particularly when uttering meaningful sentences. Additionally, we find that nonsense sentences negatively impact Lombard effect analysis. Moreover, our results reaffirm the consistency in the Lombard effect comparison between English and Mandarin found in previous research.

DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation

paper_url: http://arxiv.org/abs/2309.06787
repo_url: None
paper_authors: Zhichao Wu, Qiulin Li, Sixing Liu, Qun Yang
for: 这 paper 是为了提高 Text-to-Speech 任务中的Diffusion模型的效率和适用性而写的。
methods: 这 paper 使用 Discrete Diffusion Model with Contrastive Learning 来提高 Text-to-Speech Generation 的质量和速度。具体来说，它使用精确的文本编码器来简化模型的参数和提高计算效率，并使用对比学习方法来增强文本和声音之间的对应关系。
results: эксперименталь结果表明，提出的方法可以在保持声音质量的同时，大幅降低Diffusion模型的计算资源占用率和执行速度。synthesized samples 可以在 https://github.com/lawtherWu/DCTTS 上获取。

Abstract
In the Text-to-speech(TTS) task, the latent diffusion model has excellent fidelity and generalization, but its expensive resource consumption and slow inference speed have always been a challenging. This paper proposes Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation(DCTTS). The following contributions are made by DCTTS: 1) The TTS diffusion model based on discrete space significantly lowers the computational consumption of the diffusion model and improves sampling speed; 2) The contrastive learning method based on discrete space is used to enhance the alignment connection between speech and text and improve sampling quality; and 3) It uses an efficient text encoder to simplify the model's parameters and increase computational efficiency. The experimental results demonstrate that the approach proposed in this paper has outstanding speech synthesis quality and sampling speed while significantly reducing the resource consumption of diffusion model. The synthesized samples are available at https://github.com/lawtherWu/DCTTS.

摘要
在文本至语音（TTS）任务中，液态扩散模型具有优秀的准确性和泛化能力，但它的资源消耗量和推理速度始终是一大挑战。这篇论文提出了粒子扩散模型与对比学习 для文本至语音生成（DCTTS）。这个方法的贡献包括：1. 基于粒子空间的TTS扩散模型，显著降低了扩散模型的计算摄用量和提高了抽样速度；2. 基于粒子空间的对比学习方法，可以增强语音和文本之间的对应关系，提高抽样质量；3. 使用高效的文本编码器，简化模型参数，提高计算效率。实验结果表明，该方法提出的方法在语音合成质量和抽样速度两个方面具有优秀表现，同时significantly降低了扩散模型的资源消耗量。生成的样例可以在GitHub上获取：https://github.com/lawtherWu/DCTTS。

Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

paper_url: http://arxiv.org/abs/2309.06780
repo_url: None
paper_authors: Chu Yuan Zhang, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xinrui Yan
for: 本研究旨在探讨合成 speech 中的源特征，以便在法律和知识产权领域进行识别和防范。
methods: 本研究使用多个 speaker LibriTTS dataset， investigate 合成 speech 中的模型特征，包括 acoustic model 和 vocoder，以及它们对 waveform 的影响。
results: 研究发现， vocoder 和 acoustic model 都会留下特定的模型特征在 waveform 中，但 vocoder 的特征更加强大，可能会覆盖 acoustic model 的特征。这些发现表明存在模型特征，可以用于识别合成 speech 的源。

Abstract
Recent strides in neural speech synthesis technologies, while enjoying widespread applications, have nonetheless introduced a series of challenges, spurring interest in the defence against the threat of misuse and abuse. Notably, source attribution of synthesized speech has value in forensics and intellectual property protection, but prior work in this area has certain limitations in scope. To address the gaps, we present our findings concerning the identification of the sources of synthesized speech in this paper. We investigate the existence of speech synthesis model fingerprints in the generated speech waveforms, with a focus on the acoustic model and the vocoder, and study the influence of each component on the fingerprint in the overall speech waveforms. Our research, conducted using the multi-speaker LibriTTS dataset, demonstrates two key insights: (1) vocoders and acoustic models impart distinct, model-specific fingerprints on the waveforms they generate, and (2) vocoder fingerprints are the more dominant of the two, and may mask the fingerprints from the acoustic model. These findings strongly suggest the existence of model-specific fingerprints for both the acoustic model and the vocoder, highlighting their potential utility in source identification applications.

摘要
In this paper, we investigate the existence of speech synthesis model fingerprints in generated speech waveforms, focusing on the acoustic model and the vocoder. We examine the influence of each component on the fingerprint in the overall speech waveforms.Our research, conducted using the multi-speaker LibriTTS dataset, reveals two key insights:1. Vocoders and acoustic models impart distinct, model-specific fingerprints on the waveforms they generate.2. Vocoder fingerprints are the more dominant of the two, and may mask the fingerprints from the acoustic model.These findings suggest the existence of model-specific fingerprints for both the acoustic model and the vocoder, highlighting their potential utility in source identification applications.

PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network

paper_url: http://arxiv.org/abs/2309.06723
repo_url: None
paper_authors: Qinghua Liu, Meng Ge, Zhizheng Wu, Haizhou Li
for: 研究如何利用变化的说话表情来提高音视频说话人EXTRACTING
methods: 提出了一种具有可pose-invariant视图的音视频说话人EXTRACTING网络（PIAVE），通过生成每个原始姿势 orientation的pose-invariant视图，使模型可以得到一个一致的前视图，therefore, forming a multi-view visual input for the speaker.
results: 在多视图MEAD和野外LRS3数据集上进行实验，PIAVE比 estado-of-the-art 高效和更加鲁棒地处理pose变化。

Abstract
It is common in everyday spoken communication that we look at the turning head of a talker to listen to his/her voice. Humans see the talker to listen better, so do machines. However, previous studies on audio-visual speaker extraction have not effectively handled the varying talking face. This paper studies how to take full advantage of the varying talking face. We propose a Pose-Invariant Audio-Visual Speaker Extraction Network (PIAVE) that incorporates an additional pose-invariant view to improve audio-visual speaker extraction. Specifically, we generate the pose-invariant view from each original pose orientation, which enables the model to receive a consistent frontal view of the talker regardless of his/her head pose, therefore, forming a multi-view visual input for the speaker. Experiments on the multi-view MEAD and in-the-wild LRS3 dataset demonstrate that PIAVE outperforms the state-of-the-art and is more robust to pose variations.

摘要
通常在日常口语communication中，我们会看向说话人的头部，以便更好地听到他/她的voice。人类和机器都会这样做。然而，之前的音频视频说话人提取研究没有有效地处理变化的说话面孔。这篇论文研究如何全面利用变化的说话面孔。我们提议一个pose-invariant的音频视频说话人提取网络（PIAVE），该网络包含一个额外的pose-invariant视图，以提高音频视频说话人提取的精度。具体来说，我们将每个原始poseorientation中生成一个pose-invariant视图，这使得模型能够得到不同poseorientation下的说话人的一致的前视角，因此形成一个多视图的视觉输入。实验表明，PIAVE在多视图MEAD和野外LRS3 dataset上表现出优于状态之arte和更加鲁为pose变化。

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

paper_url: http://arxiv.org/abs/2309.06672
repo_url: None
paper_authors: Zhengyang Chen, Bing Han, Shuai Wang, Yanmin Qian
for: 提高speaker分类任务的性能，尤其是针对未看到之数量的说话人。
methods: 使用Attention-based encoder-decoder网络，采用教师强制策略进行模型训练，并提出了一种循环解码方法来输出每个说话人的扩展结果。
results: 在多个评估指标上达到了新的最佳性能，包括CALLHOME（10.08%）、DIHARD II（24.64%）和AMI（13.00%）评估指标。此外，该系统还表现出了极高的竞争力作为一种speech类型检测模型。

Abstract
Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%), and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.

摘要
深度神经网络基于系统在说话人分类任务中表现出了显著的改进。然而，采用端到端神经网络（EEND）系统在未看到数量的说话人场景下一般具有困难泛化性。而目标说话人活动检测（TS-VAD）系统则往往过于复杂。在这篇论文中，我们提议一种简单的注意力基于encoder-decoder网络（AED-EEND）。在我们的训练过程中，我们引入了教师强制策略来解决说话人排序问题，从而提高模型的快速收敛。在评估中，我们提出了一种逐个输出每个说话人的分类结果的迭代解码方法。此外，我们还提出了一种增强器模块，用于增强每帧的说话人嵌入，使模型能够处理未看到数量的说话人场景。此外，我们还发现了常用的说话人分类 simulation 数据集的一个重要问题，即 overlap ratio 较高。我们发现，使用更加符合实际数据的 simulated 训练数据可以实现提高一致性。我们的实验证明了我们提出的方法的有效性。我们的最佳系统在所有 CALLHOME（10.08%）、DIHARD II（24.64%）和 AMI（13.00%) 评估标准上达到了新的状态之纪录级别，当没有使用 oracle 语音活动检测（VAD）时。此外，我们的 AED-EEND 系统还表现出了很好的抗讲话类型检测能力。

Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis

paper_url: http://arxiv.org/abs/2309.06649
repo_url: https://github.com/jorshi/drumblender
paper_authors: Jordie Shier, Franco Caspe, Andrew Robertson, Mark Sandler, Charalampos Saitis, Andrew McPherson
for: 这个论文主要针对的是什么？ + 这篇论文旨在提出一种能够模型束缚的数字信号处理（DDSP）技术，用于生成打击声的电子乐器。
methods: 这个论文使用了什么方法？ + 该论文提出了一种基于束缚模型的打击声生成方法，包括使用 modify 的时间卷积神经网络来生成打击声的脉冲。
results: 这个论文的结果是什么？ + 该论文通过使用大量的音频和电子打击乐amples，计算了一系列的重建度量，并证明了其方法可以更好地重建打击声乐器的音波信号。

Abstract
Differentiable digital signal processing (DDSP) techniques, including methods for audio synthesis, have gained attention in recent years and lend themselves to interpretability in the parameter space. However, current differentiable synthesis methods have not explicitly sought to model the transient portion of signals, which is important for percussive sounds. In this work, we present a unified synthesis framework aiming to address transient generation and percussive synthesis within a DDSP framework. To this end, we propose a model for percussive synthesis that builds on sinusoidal modeling synthesis and incorporates a modulated temporal convolutional network for transient generation. We use a modified sinusoidal peak picking algorithm to generate time-varying non-harmonic sinusoids and pair it with differentiable noise and transient encoders that are jointly trained to reconstruct drumset sounds. We compute a set of reconstruction metrics using a large dataset of acoustic and electronic percussion samples that show that our method leads to improved onset signal reconstruction for membranophone percussion instruments.

摘要
diferenciable digital signal processing (DDSP) 技术，包括音频合成方法，在最近几年内受到了关注，并且具有可解释的参数空间特性。然而，当前的可 diferenciable 合成方法并没有直接模型信号激发部分，这对于钣鼓样本是非常重要的。在这项工作中，我们提出了一种统一的合成框架，旨在解决钣鼓合成和激发部分的问题。为此，我们提出了基于圆形模型合成的钣鼓合成模型，并将模拟的时间卷积神经网络用于激发部分。我们使用修改后的圆形峰挑选算法来生成时间变化的非幂圆形，并与可导的噪声和激发编码器一起进行联合训练，以重construct 钣鼓 зву频样本。我们计算了一组重建指标，使用大量的音频和电子打击乐样本，并显示了我们的方法可以提高钣鼓类打击乐器的开始信号重建。