cs.SD - 2023-09-25

NoLACE: Improving Low-Complexity Speech Codec Enhancement Through Adaptive Temporal Shaping

paper_url: http://arxiv.org/abs/2309.14521
repo_url: None
paper_authors: Jan Büthe, Ahmed Mustafa, Jean-Marc Valin, Karim Helwani, Michael M. Goodwin
for: 提高 speech codec 的质量，对于 Opus 编码器进行增强。
methods: 使用 Linear Adaptive Coding Enhancer (LACE) 模型，combines DNNs with classical long-term/short-term postfiltering，具有低复杂性和零延迟。
results: 比较 Opus 基线和扩大 LACE 模型，NoLACE 实现了较高的质量表现，并且与 ASR 系统良好的合作。

Abstract
Speech codec enhancement methods are designed to remove distortions added by speech codecs. While classical methods are very low in complexity and add zero delay, their effectiveness is rather limited. Compared to that, DNN-based methods deliver higher quality but they are typically high in complexity and/or require delay. The recently proposed Linear Adaptive Coding Enhancer (LACE) addresses this problem by combining DNNs with classical long-term/short-term postfiltering resulting in a causal low-complexity model. A short-coming of the LACE model is, however, that quality quickly saturates when the model size is scaled up. To mitigate this problem, we propose a novel adatpive temporal shaping module that adds high temporal resolution to the LACE model resulting in the Non-Linear Adaptive Coding Enhancer (NoLACE). We adapt NoLACE to enhance the Opus codec and show that NoLACE significantly outperforms both the Opus baseline and an enlarged LACE model at 6, 9 and 12 kb/s. We also show that LACE and NoLACE are well-behaved when used with an ASR system.

摘要
<>通用 speech codec 优化方法是为了移除 speech codec 添加的扭曲。古典方法具有非常低的复杂性和零延迟，但是其效果很有限。相比之下，基于 DNN 的方法可以提供更高的质量，但是它们通常具有较高的复杂性和/或延迟。随后提出的 Linear Adaptive Coding Enhancer (LACE) 模型解决了这个问题，它将 DNN 与古典长期/短期 POSTfiltering 结合在一起，实现了 causal 低复杂度模型。然而，LACE 模型的缺点是，当模型大小增加时，质量快速增加。为了解决这个问题，我们提出了一种新的适应性时间形态模块，这种模块将高时间分辨率添加到 LACE 模型中，实现了 Non-Linear Adaptive Coding Enhancer (NoLACE)。我们适应 NoLACE 模型来提高 Opus 码ц，并证明 NoLACE 在 6、9 和 12 kb/s 的比特率下显著超过 Opus 基eline 和扩大 LACE 模型。我们还证明 LACE 和 NoLACE 在 ASR 系统中是合理的。

Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity

paper_url: http://arxiv.org/abs/2309.14507
repo_url: None
paper_authors: Krishna Subramani, Jean-Marc Valin, Jan Buethe, Paris Smaragdis, Mike Goodwin
for: 这篇论文旨在提出一种hybrid抽取器，使得深度神经网络（DNN）和传统的信号处理（DSP）技术之间的优点得到平衡，以提高抽取器的性能和可实现性。
methods: 该论文使用了一种小型的DNN和传统的DSP特征来实现抽取器，并结合了这两种方法来提高抽取器的性能。
results: 论文表明，这种混合方法可以与纯DNN方法匹配或超越其性能，同时具有与传统DSP方法相同的复杂性和算法延迟。此外，该方法还可以提供一些优势 для神经语音编码任务。

Abstract
Pitch estimation is an essential step of many speech processing algorithms, including speech coding, synthesis, and enhancement. Recently, pitch estimators based on deep neural networks (DNNs) have have been outperforming well-established DSP-based techniques. Unfortunately, these new estimators can be impractical to deploy in real-time systems, both because of their relatively high complexity, and the fact that some require significant lookahead. We show that a hybrid estimator using a small deep neural network (DNN) with traditional DSP-based features can match or exceed the performance of pure DNN-based models, with a complexity and algorithmic delay comparable to traditional DSP-based algorithms. We further demonstrate that this hybrid approach can provide benefits for a neural vocoding task.

摘要
“抽象估值是许多语音处理算法的关键步骤，包括语音编码、合成和提高。最近，基于深度神经网络（DNN）的抽象估值器已经超越了传统的数字信号处理（DSP）技术。然而，这些新的估值器在实时系统中部署可能是不实际的，因为它们的相对较高复杂度和一些需要明显的往回预测。我们表明，一种混合使用小型深度神经网络（DNN）和传统的DSP基于特征的抽象估值器可以与纯DNN基本模型匹配或超越其性能，并且与传统DSP基本算法相同的复杂性和算法延迟。我们还证明了这种混合方法可以为神经编码任务提供优势。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

On the Impact of Quantization and Pruning of Self-Supervised Speech Models for Downstream Speech Recognition Tasks “In-the-Wild’’

paper_url: http://arxiv.org/abs/2309.14462
repo_url: None
paper_authors: Arthur Pimentel, Heitor Guimarães, Anderson R. Avila, Mehdi Rezagholizadeh, Tiago H. Falk
for: 本研究旨在探讨基于自动编程学习的语音识别系统在不同条件下的准确率，特别是在训练和测试条件不同时的情况下。
methods: 本研究使用了Parameter Quantization和Model Pruning两种模型压缩方法，以及robust wav2vec 2.0模型，对语音识别精度进行了分析。
results: 研究发现，在噪音、抑扬和噪音+抑扬等不同条件下，Parameter Quantization和Model Pruning两种方法都能够有效地提高语音识别精度。

Abstract
Recent advances with self-supervised learning have allowed speech recognition systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled training data needed by its predecessors. Notwithstanding, while such models achieve SOTA performance in matched train/test conditions, their performance degrades substantially when tested in unseen conditions. To overcome this problem, strategies such as data augmentation and/or domain shift training have been explored. Available models, however, are still too large to be considered for edge speech applications on resource-constrained devices, thus model compression tools are needed. In this paper, we explore the effects that train/test mismatch conditions have on speech recognition accuracy based on compressed self-supervised speech models. In particular, we report on the effects that parameter quantization and model pruning have on speech recognition accuracy based on the so-called robust wav2vec 2.0 model under noisy, reverberant, and noise-plus-reverberation conditions.

摘要

An Investigation of Distribution Alignment in Multi-Genre Speaker Recognition

paper_url: http://arxiv.org/abs/2309.14158
repo_url: None
paper_authors: Zhenyu Zhou, Junhui Chen, Namin Wang, Lantian Li, Dong Wang
for: 本研究旨在 investigate the performance of mainstream distribution alignment methods on multi-genre data, 以便更好地 Addressing the challenges of multi-genre speaker recognition.
methods: 本研究使用了多种主流分布Alignment方法，包括 Within-between distribution alignment (WBDA) 等。
results: 实验结果表明，WBDA方法在 CN-Celeb dataset 中表现较好，但是 None of the investigated methods consistently improved performance in all test cases. 这表明尚未发展出一种全面的解决方案。

Abstract
Multi-genre speaker recognition is becoming increasingly popular due to its ability to better represent the complexities of real-world applications. However, a major challenge is the significant shift in the distribution of speaker vectors across different genres. While distribution alignment is a common approach to address this challenge, previous studies have mainly focused on aligning a source domain with a target domain, and the performance of multi-genre data is unknown. This paper presents a comprehensive study of mainstream distribution alignment methods on multi-genre data, where multiple distributions need to be aligned. We analyze various methods both qualitatively and quantitatively. Our experiments on the CN-Celeb dataset show that within-between distribution alignment (WBDA) performs relatively better. However, we also found that none of the investigated methods consistently improved performance in all test cases. This suggests that solely aligning the distributions of speaker vectors may not fully address the challenges posed by multi-genre speaker recognition. Further investigation is necessary to develop a more comprehensive solution.

摘要
多样化 Speaker 认知正在不断受欢迎，主要是因为它能更好地反映实际应用中的复杂性。然而，一个主要挑战是多个频率域之间的分布变化。过往的研究主要集中在将源频率域与目标频率域进行分布对接，并未探讨多个频率域数据的性能。本文对多个频率域数据进行了全面的分布对接方法研究。我们分析了各种方法，包括内在between分布对接（WBDA）等。我们的实验结果表明，WBDA在CN-Celeb 数据集上表现较好。然而，我们还发现，不同的测试情况下，不同的方法的性能表现不一致。这表明，仅仅通过对 speaker vector 的分布进行对接，并不能彻底解决多个频率域 Speaker 认知中的挑战。需要进一步的研究，以开发更加全面的解决方案。

Multi-Domain Adaptation by Self-Supervised Learning for Speaker Verification

paper_url: http://arxiv.org/abs/2309.14149
repo_url: None
paper_authors: Wan Lin, Lantian Li, Dong Wang
for: addressing the domain-mismatch challenge in speaker recognition models
methods: self-supervised learning method with three strategies (in-domain negative sampling, MoCo-like memory bank scheme, and CORAL-like distribution alignment)
results: outperforms the basic self-supervised adaptation method in nearly all in-domain tests and cross-domain tests, demonstrating the effectiveness of the proposed method.

Abstract
In real-world applications, speaker recognition models often face various domain-mismatch challenges, leading to a significant drop in performance. Although numerous domain adaptation techniques have been developed to address this issue, almost all present methods focus on a simple configuration where the model is trained in one domain and deployed in another. However, real-world environments are often complex and may contain multiple domains, making the methods designed for one-to-one adaptation suboptimal. In our paper, we propose a self-supervised learning method to tackle this multi-domain adaptation problem. Building upon the basic self-supervised adaptation algorithm, we designed three strategies to make it suitable for multi-domain adaptation: an in-domain negative sampling strategy, a MoCo-like memory bank scheme, and a CORAL-like distribution alignment. We conducted experiments using VoxCeleb2 as the source domain dataset and CN-Celeb1 as the target multi-domain dataset. Our results demonstrate that our method clearly outperforms the basic self-supervised adaptation method, which simply treats the data of CN-Celeb1 as a single domain. Importantly, the improvement is consistent in nearly all in-domain tests and cross-domain tests, demonstrating the effectiveness of our proposed method.

摘要
在实际应用中，语音识别模型经常面临不同领域的挑战，导致其性能下降。虽然已有许多领域适应技术的研发，但大多数方法都是基于单个领域的训练和部署。然而，实际环境往往复杂，可能包含多个领域，这些方法在一对一适应下表现不佳。在我们的论文中，我们提出了一种基于自助学习的多领域适应方法。我们在基本的自助适应算法之上设计了三种策略，使其适应多领域适应：在领域内的负样本采样策略、MoCo-like储存银行方案以及CORAL-like分布对齐。我们使用VoxCeleb2作为源领域数据集，CN-Celeb1作为目标多领域数据集，并进行了实验。我们的结果表明，我们的方法在比较多个领域的测试中均有显著提高，而且这种改进是在几乎所有的领域测试和跨领域测试中均有，这说明了我们提出的方法的有效性。

Speaker anonymization using neural audio codec language models

paper_url: http://arxiv.org/abs/2309.14129
repo_url: None
paper_authors: Michele Panariello, Francesco Nespoli, Massimiliano Todisco, Nicholas Evans
for: 隐藏发音者的匿名（Speaker Anonymization）
methods: 使用神经网络编码器（NAC）和语言模型来生成高质量的匿名语音（Synthetic Speech），并使用量化码来瓶颈发音者相关的信息
results: 通过应用voice Privacy Challenge 2022的评价框架，示出NAC语言模型可以实现高质量的匿名语音生成，并且能够有效瓶颈发音者相关的信息

Abstract
The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.

摘要
In this paper, we propose an approach based on neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes that effectively bottleneck speaker-related information. We demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.Here's the text in Simplified Chinese:大多数speaker anonymization方法都包括提取基本频率估计、语言特征和一个扰乱后的speaker嵌入，然后使用vocoder重新synthesize一个匿名的语音波形。然而，最近的研究表明，x-vector变换很难控制一致地：在vocoding后，其他speaker信息包含在基本频率和语言特征中被重新杂化，导致匿名的语音信号仍然含有speaker信息。在这篇论文中，我们提出一种基于Neural Audio Codecs（NAC）的方法，NACs是在语言模型 комбиined with高质量的Synthetic Speech生成。NACs使用归一化的编码，这些编码可以有效地瓶颈speaker相关的信息。我们通过应用Voice Privacy Challenge 2022的评估框架，示出了基于NAC语言模型的speaker匿名系统的潜在能力。

Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

paper_url: http://arxiv.org/abs/2309.14109
repo_url: https://github.com/nevermorelin/hahapod
paper_authors: Yuke Lin, Xiaoyi Qin, Ning Jiang, Guoqing Zhao, Ming Li
for: explore speaker verification based on non-verbal vocalization, specifically laughter
methods: Two-Stage Teacher-Student (2S-TS) framework to minimize the within-speaker embedding distance between verbal and non-verbal signals
results: significant improvement in S2L-Eval test set performance with only minor degradation on VoxCeleb1 test set.Here’s the full summary in Simplified Chinese:
for: 这个研究探讨了基于非言语 vocalization 的 speaker verification，具体是 laughter。
methods: 这个研究提出了 Two-Stage Teacher-Student (2S-TS) 框架，以实现非言语 vocalization 和言语信号之间的内部距离最小化。
results: 实验结果显示，这个方法可以对 S2L-Eval 试验集的表现有所提高，仅受 VoxCeleb1 试验集的轻微下降影响。I hope that helps!

Abstract
It is widely acknowledged that discriminative representation for speaker verification can be extracted from verbal speech. However, how much speaker information that non-verbal vocalization carries is still a puzzle. This paper explores speaker verification based on the most ubiquitous form of non-verbal voice, laughter. First, we use a semi-automatic pipeline to collect a new Haha-Pod dataset from open-source podcast media. The dataset contains over 240 speakers' laughter clips with corresponding high-quality verbal speech. Second, we propose a Two-Stage Teacher-Student (2S-TS) framework to minimize the within-speaker embedding distance between verbal and non-verbal (laughter) signals. Considering Haha-Pod as a test set, two trials (S2L-Eval) are designed to verify the speaker's identity through laugh sounds. Experimental results demonstrate that our method can significantly improve the performance of the S2L-Eval test set with only a minor degradation on the VoxCeleb1 test set. The resources for the Haha-Pod dataset can be found at https://github.com/nevermoreLin/HahaPod.

摘要
广泛认可的观点是，演说中的特征表达可以提取到说话中的声音中。然而，非语言声音中带有多少说话者信息仍然是一个谜。这篇论文探讨基于最普遍的非语言声音——笑声的说话者验证。我们首先使用半自动化管道收集了一个新的 Haha-Pod 数据集，该数据集包含了240个说话者的笑声clip，其中每个clip都有高质量的语音。然后，我们提出了一个 Two-Stage Teacher-Student (2S-TS) 框架，以减少语音和笑声信号之间的在说话者 embedding 距离。对于 Haha-Pod 数据集，我们设计了两次（S2L-Eval）测试来验证说话者的身份。实验结果表明，我们的方法可以显著提高 S2L-Eval 测试集的性能，只有微量地降低 VoxCeleb1 测试集的性能。Haha-Pod 数据集的资源可以在 GitHub 上找到：https://github.com/nevermoreLin/HahaPod。

VoiceLens: Controllable Speaker Generation and Editing with Flow

paper_url: http://arxiv.org/abs/2309.14094
repo_url: None
paper_authors: Yao Shi, Ming Li
for: 这篇论文目的是为多个说话人的语音合成和voice转换系统提供一种 semi-supervised flow-based 方法，以模型说话人的嵌入vector distribution，并在不同的条件下进行多个说话人的生成和编辑。
methods: 该论文提出了一种名为 VoiceLens 的方法，它将说话人嵌入vector 映射到独立的特征和差异信息中。该方法允许在已有的 TTS 模型基础上生成新的说话人voice，并可以meaningfully 编辑已知的说话人的特征。
results: 该论文表明，VoiceLens 在不同的条件下 display 了类似于 Tacospawn 的无条件生成能力，同时具有更高的控制性和灵活性。此外，使用 VoiceLens 模型可以在不需要重新训练 TTS 模型的情况下，将已知的噪音说话人的嵌入vector 编辑，并生成 cleaner 的语音。

Abstract
Currently, many multi-speaker speech synthesis and voice conversion systems address speaker variations with an embedding vector. Modeling it directly allows new voices outside of training data to be synthesized. GMM based approaches such as Tacospawn are favored in literature for this generation task, but there are still some limitations when difficult conditionings are involved. In this paper, we propose VoiceLens, a semi-supervised flow-based approach, to model speaker embedding distributions for multi-conditional speaker generation. VoiceLens maps speaker embeddings into a combination of independent attributes and residual information. It allows new voices associated with certain attributes to be \textit{generated} for existing TTS models, and attributes of known voices to be meaningfully \textit{edited}. We show in this paper, VoiceLens displays an unconditional generation capacity that is similar to Tacospawn while obtaining higher controllability and flexibility when used in a conditional manner. In addition, we show synthesizing less noisy speech from known noisy speakers without re-training the TTS model is possible via solely editing their embeddings with a SNR conditioned VoiceLens model. Demos are available at sos1sos2sixteen.github.io/voicelens.

摘要
当前，许多多 speaker speech synthesis 和voice conversion系统使用 embedding vector来处理 speaker variations。直接模型它们允许在训练数据外部生成新的voice。文献中，GMM基于的approaches such as Tacospawn 是常见的 Generation Task 方法，但是在具有困难的conditioning时还存在一些限制。在这篇论文中，我们提出了 VoiceLens，一种半supervised flow-based方法，用于模型 speaker embedding Distributions для多 conditional speaker Generation。VoiceLens将 speaker embeddings映射到独立的特征和差异信息中。它允许基于已有 TTS 模型的新音频 associates with certain attributes 被生成，并且可以 meaningfully edit 已知音频的特征。我们在这篇论文中表明，VoiceLens 在不conditional 的情况下 display 类似于 Tacospawn 的无条件生成能力，同时在具有条件的情况下具有更高的控制性和灵活性。此外，我们还证明可以通过只编辑 embedding 来从已知噪音speakers中生成更清晰的speech，无需重新训练 TTS 模型。示例可以在 sos1sos2sixteen.github.io/voicelens 上找到。

Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units

paper_url: http://arxiv.org/abs/2309.13994
repo_url: None
paper_authors: Jakob Poncelet, Hugo Van hamme
for: 改善预训练语音模型对不同口音和异常语音的敏感性
methods: 使用遮盖语言模型和小口音适应器块进行不supervised调整
results: 提高 HuBERT Large 模型在下游口音识别任务中的性能，无需监督

Abstract
Self-supervised pre-trained speech models have strongly improved speech recognition, yet they are still sensitive to domain shifts and accented or atypical speech. Many of these models rely on quantisation or clustering to learn discrete acoustic units. We propose to correct the discovered discrete units for accented speech back to a standard pronunciation in an unsupervised manner. A masked language model is trained on discrete units from a standard accent and iteratively corrects an accented token sequence by masking unexpected cluster sequences and predicting their common variant. Small accent adapter blocks are inserted in the pre-trained model and fine-tuned by predicting the corrected clusters, which leads to an increased robustness of the pre-trained model towards a target accent, and this without supervision. We are able to improve a state-of-the-art HuBERT Large model on a downstream accented speech recognition task by altering the training regime with the proposed method.

摘要
自适应预训练音频模型已经强化了语音识别，但它们仍然敏感于频谱转移和不同口音或特殊语音。许多这些模型利用量化或归一化学习独立的声学单元。我们提议将捕捉到的特殊单元 Corrected 到标准发音的方式，以便在无监督的情况下进行改进。我们使用遮盖语言模型，并在预训练模型中插入小口音适应块，进行无监督的改进，以提高预训练模型对目标口音的抗频谱性能。我们通过修改训练方法，使用我们的方法来改进一个state-of-the-art HuBERT Large模型，并在下游受损 speech recognition 任务中获得了改进。

Real-Time Emergency Vehicle Detection using Mel Spectrograms and Regular Expressions

paper_url: http://arxiv.org/abs/2309.13920
repo_url: None
paper_authors: Alberto Pacheco-Gonzalez, Raymundo Torres, Raul Chacon, Isidro Robledo
for: detecting emergency vehicle sirens in real time
methods: digital signal processing techniques and signal symbolization, compared to a deep neural network audio classifier
results: the developed DSP algorithm presented a greater ability to discriminate between signal and noise, compared to the CNN model.Here’s the full translation of the paper’s abstract in Simplified Chinese:
for: 这个论文旨在实时探测紧急车辆 siren 声音。
methods: 该论文使用的方法包括数字信号处理技术和信号符号化，并与深度神经网络音频分类器进行比较。
results: 发展的 DSP 算法在听到信号和噪声之间的分辨率比 CNN 模型更高。

Abstract
In emergency situations, the movement of vehicles through city streets can be problematic due to vehicular traffic. This paper presents a method for detecting emergency vehicle sirens in real time. To derive a siren Hi-Lo audio fingerprint it was necessary to apply digital signal processing techniques and signal symbolization, contrasting against a deep neural network audio classifier feeding 280 environmental sounds and 38 Hi-Lo sirens. In both methods, their precision was evaluated based on a confusion matrix and various metrics. The precision of the developed DSP algorithm presented a greater ability to discriminate between signal and noise, compared to the CNN model.

摘要
在紧急情况下，城市街道上的车辆运动可能会受到交通堵塞的影响。本文提出了一种实时探测紧急车辆 siren 的方法。为 derivation 紧急 siren Hi-Lo 音响指纹，需要应用数字信号处理技术和音标化，并与深度神经网络音频分类器相比较，该分类器接受了 280 个环境声和 38 个 Hi-Lo siren。在两种方法中，它们的准确率被评估基于冲激矩阵和多种指标。发展的 DSP 算法表现出更高的能力来 отли奇 Between signal 和噪声，相比 CNN 模型。

Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors

paper_url: http://arxiv.org/abs/2309.13916
repo_url: https://github.com/audio-westlakeu/fs-eend
paper_authors: Di Liang, Nian Shao, Xiaofei Li
for: 这种方法用于实时语音分类和说话者识别
methods: 使用 causal speaker embedding encoder 和 online non-autoregressive self-attention-based attractor decoder，采用 look-ahead 机制以实时检测新的说话者并自适应更新说话者吸引器
results: 与最近提出的块 wise online方法相比，本方法实现了状态机器的分类和说话者识别结果，并且具有低的推理延迟和计算成本

Abstract
This work proposes a frame-wise online/streaming end-to-end neural diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely detect a flexible number of speakers and extract/update their corresponding attractors, we propose to leverage a causal speaker embedding encoder and an online non-autoregressive self-attention-based attractor decoder. A look-ahead mechanism is adopted to allow leveraging some future frames for effectively detecting new speakers in real time and adaptively updating speaker attractors. The proposed method processes the audio stream frame by frame, and has a low inference latency caused by the look-ahead frames. Experiments show that, compared with the recently proposed block-wise online methods, our method FS-EEND achieves state-of-the-art diarization results, with a low inference latency and computational cost.

摘要
这个工作提出了一种帧级在线/流动端到端神经 диари化（FS-EEND）方法，采用帧内帧外的方式进行检测。为了在帧级检测灵活数量的说话人并提取/更新其相应的吸引器，我们提议利用 causal 说话人嵌入编码器和在线非autoregressive自注意力基本吸引器解码器。采用了 looked-ahead 机制，以便利用未来帧来有效地检测新的说话人并动态更新说话人吸引器。提posed 方法按帧处理音频流，并且具有低的推理延迟和计算成本。实验表明，相比最近提出的块级在线方法，我们的方法FS-EEND可以 achieve state-of-the-art диари化结果，同时具有低的推理延迟和计算成本。

HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS

paper_url: http://arxiv.org/abs/2309.13907
repo_url: None
paper_authors: Dake Guo, Xinfa Zhu, Liumeng Xue, Tao Li, Yuanjun Lv, Yuepeng Jiang, Lei Xie
for: 提高短 фор式文本译 Speech 表现力
methods: 使用嵌入式Global Node和上下文注意力机制，以及层次supervision来增强GNNs的表达能力
results: 对象和主观评估都表明，HiGNN-TTS可以显著提高长形文本译Speech的自然性和表达力

Abstract
Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN-TTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech.

摘要
Translated into Simplified Chinese:近期文本到语音技术的进步，特别是基于图像神经网络（GNNs），有效提高了短形文本合成语音的表达力。然而，生成人工合理的长形语音，尚存在高Dynamic Prosody变化的挑战。为解决这个问题，我们扩展了GNNs的能力，通过层次听音模型策略（HiGNN-TTS）。具体来说，我们在图像中添加虚拟全球节点，强化单词节点之间的连接，并引入Contextual Attention机制，以扩展GNNs的听音模型范围从内句到间句。此外，我们在每个图像节点上进行层次监督，从听音PROSODY级别进行多层次监督，以捕捉高Dynamic Prosody变化。ablation study表明HiGNN-TTS有效学习层次听音。对象和主观评价表明，HiGNN-TTS可以显著提高长形文本合成语音的自然性和表达力。

AutoPrep: An Automatic Preprocessing Framework for In-the-Wild Speech Data

paper_url: http://arxiv.org/abs/2309.13905
repo_url: https://github.com/tomasJwYU/AutoPrepDemo
paper_authors: Jianwei Yu, Hangting Chen, Yanyao Bian, Xiang Li, Yi Luo, Jinchuan Tian, Mengyang Liu, Jiayi Jiang, Shuai Wang
for: 提高听说技术领域中各种大规模语音数据的使用效率
methods: 提出了一种自动化听说数据预处理框架AutoPrep，包括声音提升、听说段化、speaker clustering、目标听说提取、质量筛选和自动听说识别
results: 实验表明，提出的AutoPrep框架可以生成与多个开源TTS数据集相似的DNMS和PDNMS分数，并且可以实现0.68的在域内 speaker相似性

Abstract
Recently, the utilization of extensive open-sourced text data has significantly advanced the performance of text-based large language models (LLMs). However, the use of in-the-wild large-scale speech data in the speech technology community remains constrained. One reason for this limitation is that a considerable amount of the publicly available speech data is compromised by background noise, speech overlapping, lack of speech segmentation information, missing speaker labels, and incomplete transcriptions, which can largely hinder their usefulness. On the other hand, human annotation of speech data is both time-consuming and costly. To address this issue, we introduce an automatic in-the-wild speech data preprocessing framework (AutoPrep) in this paper, which is designed to enhance speech quality, generate speaker labels, and produce transcriptions automatically. The proposed AutoPrep framework comprises six components: speech enhancement, speech segmentation, speaker clustering, target speech extraction, quality filtering and automatic speech recognition. Experiments conducted on the open-sourced WenetSpeech and our self-collected AutoPrepWild corpora demonstrate that the proposed AutoPrep framework can generate preprocessed data with similar DNSMOS and PDNSMOS scores compared to several open-sourced TTS datasets. The corresponding TTS system can achieve up to 0.68 in-domain speaker similarity.

摘要
Translated into Simplified Chinese:最近，通过大量开源的文本数据的使用，文本基本语言模型（LLM）的性能得到了显著提高。然而，对于语音技术社区中的大规模语音数据来说，使用尚未得到有效利用。一个原因是公共可用的语音数据中很多受到背景噪音、语音重叠、语音分割信息缺失、缺失说话人标签和不完整的转录等限制，这些限制可以很大地阻碍其使用。而人工标注语音数据则是时间consuming和costly。为解决这个问题，我们在这篇论文中介绍了一种自动化对话语音数据预处理框架（AutoPrep），用于提高语音质量、生成说话人标签和生成转录。AutoPrep框架包括6个组件：语音增强、语音分割、说话人团 clustering、目标语音提取、质量筛选和自动语音识别。在open-sourced WenetSpeech和我们自己收集的AutoPrepWild corpora上进行的实验表明，提posed AutoPrep框架可以生成与开源 TTS 数据集相似的 DNSMOS 和 PDNSMOS 分数，并且可以达到0.68 的域内说话人相似性。

A Two-Step Approach for Narrowband Source Localization in Reverberant Rooms

paper_url: http://arxiv.org/abs/2309.13819
repo_url: None
paper_authors: Wei-Ting Lai, Lachlan Birnie, Thushara Abhayapala, Amy Bastine, Shaoheng Xu, Prasanga Samarasinghe
for: 本研究旨在提出一种基于两步方法的窄带源localization算法，用于听觉环境中的音源定位。
methods: 该方法首先使用Iteratively Reweighted Least Squares（IRLS）模型干扰音场的同质分量，然后使用Orthogonal Matching Pursuit（OMP）模型干扰分量为点源分布的稀疏表示。
results: 实验结果表明，该方法可以减少测量量而提高定位精度，特别是在听觉环境中。此外，该方法可以不需要先知道房间边界条件和室内geometry，因此可以适用于不同的室内环境。

Abstract
This paper presents a two-step approach for narrowband source localization within reverberant rooms. The first step involves dereverberation by modeling the homogeneous component of the sound field by an equivalent decomposition of planewaves using Iteratively Reweighted Least Squares (IRLS), while the second step focuses on source localization by modeling the dereverberated component as a sparse representation of point-source distribution using Orthogonal Matching Pursuit (OMP). The proposed method enhances localization accuracy with fewer measurements, particularly in environments with strong reverberation. A numerical simulation in a conference room scenario, using a uniform microphone array affixed to the wall, demonstrates real-world feasibility. Notably, the proposed method and microphone placement effectively localize sound sources within the 2D-horizontal plane without requiring prior knowledge of boundary conditions and room geometry, making it versatile for application in different room types.

摘要