eess.AS - 2023-10-10

Privacy-oriented manipulation of speaker representations

paper_url: http://arxiv.org/abs/2310.06652
repo_url: None
paper_authors: Francisco Teixeira, Alberto Abad, Bhiksha Raj, Isabel Trancoso
for: 本研究旨在提取和操纵speaker embedding中的private attribute，以保护speaker的隐私。
methods: 该研究使用Vector-Quantized Variational Autoencoder架构，并与对抗学习器和新型的相互信息损失相结合，以去除speaker embedding中的private attribute。
results: 研究 validate在两个属性（性别和年龄）上，并在各种攻击者和数据集下进行了实验。

Abstract
Speaker embeddings are ubiquitous, with applications ranging from speaker recognition and diarization to speech synthesis and voice anonymisation. The amount of information held by these embeddings lends them versatility, but also raises privacy concerns. Speaker embeddings have been shown to contain information on age, sex, health and more, which speakers may want to keep private, especially when this information is not required for the target task. In this work, we propose a method for removing and manipulating private attributes from speaker embeddings that leverages a Vector-Quantized Variational Autoencoder architecture, combined with an adversarial classifier and a novel mutual information loss. We validate our model on two attributes, sex and age, and perform experiments with ignorant and fully-informed attackers, and with in-domain and out-of-domain data.

摘要
喊Word embeddings在各种应用中广泛使用，包括说话人识别和分类、语音合成和声音匿名化。这些喊Word embeddings中包含了大量信息，这使其具有多样性，但也引起了隐私问题。这些喊Word embeddings中包含的信息包括年龄、性别、健康等，这些信息可能会让说话人保持隐私，特别是当这些信息不是target任务所需的时候。在这项工作中，我们提出了一种去除和修改私人属性从喊Word embeddings中的方法，该方法基于Vector-Quantized Variational Autoencoder架构，并与对抗类ifier和一种新的共同信息损失相结合。我们验证了我们的模型在两个属性上，性别和年龄上，并在不知情和完全了解的攻击者下进行了实验，以及在Domain和Out-of-Domain数据上。

Discriminative Speech Recognition Rescoring with Pre-trained Language Models

paper_url: http://arxiv.org/abs/2310.06248
repo_url: None
paper_authors: Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko
for: 提高自动语音识别（ASR）系统的竞争力
methods: 使用预训练语言模型（LM）的探索性训练
results: 在 LibriSpeech 数据集上，所有MWER训练方案都有所提高，最高提高8.5% WER； Pooling 变体可以降低延迟，保持大部分改进； bidirectional LM 更好地利用探索性训练。

Abstract
Second pass rescoring is a critical component of competitive automatic speech recognition (ASR) systems. Large language models have demonstrated their ability in using pre-trained information for better rescoring of ASR hypothesis. Discriminative training, directly optimizing the minimum word-error-rate (MWER) criterion typically improves rescoring. In this study, we propose and explore several discriminative fine-tuning schemes for pre-trained LMs. We propose two architectures based on different pooling strategies of output embeddings and compare with probability based MWER. We conduct detailed comparisons between pre-trained causal and bidirectional LMs in discriminative settings. Experiments on LibriSpeech demonstrate that all MWER training schemes are beneficial, giving additional gains upto 8.5\% WER. Proposed pooling variants achieve lower latency while retaining most improvements. Finally, our study concludes that bidirectionality is better utilized with discriminative training.

摘要
第二个通过重新分配是竞争自动语音识别（ASR）系统的重要组成部分。大型语言模型已经证明了它们可以使用预训信息来改善ASR假设的重新分配。精确训练，直接优化最小单词错误率（MWER）标准通常会提高重新分配。在这项研究中，我们提出并探索了多种精确定型训练方案 для预训练LM。我们提出了基于不同抽取策略的输出嵌入的两种架构，并与概率基于MWER进行比较。我们在预训练 causal 和 bidirectional LM 中进行了详细比较。在 LibriSpeech 上进行的实验表明，所有MWER 训练方案都是有利的，可以获得额外的8.5% WER 的提升。我们的 pooling 变体可以减少延迟时间，保持大多数改进。最后，我们的研究结论是，批处性是在精确训练中更好地利用的。