eess.AS - 2023-07-05

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

paper_url: http://arxiv.org/abs/2307.02351
repo_url: None
paper_authors: Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan
for: This paper proposes an online hybrid CTC/attention end-to-end ASR architecture for real-time speech recognition.
methods: The proposed architecture uses stable monotonic chunk-wise attention (sMoChA) to streamline global attention, truncated CTC (T-CTC) prefix score calculation, and dynamic waiting joint decoding (DWJD) algorithm for online prediction.
results: Compared with the offline CTC/attention model, the proposed online CTC/attention model improves the real-time factor in human-computer interaction services while maintaining moderate performance. This is the first full-stack online solution for CTC/attention end-to-end ASR architecture.

Abstract
Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-stack online solution for CTC/attention end-to-end ASR architecture.

摘要
近年来，endo-to-end自动语音识别（ASR）建筑有了显著的进步，这种语音识别模型不需要预先训练的Alignment。一种popular的endo-to-end方法是hybrid Connectionist Temporal Classification（CTC）和注意（CTC/注意）基于ASR建筑。然而，如何在线部署hybrid CTC/注意系统仍然是一个非常困难的问题。这篇文章描述了我们提议的在线hybrid CTC/注意末端ASR建筑，该建筑将所有的Offline组件替换为其相应的流动组件。首先，我们提出了稳定的均衡块级注意（sMoChA），以流动 conventiomal global注意，并提出了均衡 truncated attention（MTA），以解决sMoChA的训练和解码匹配问题。其次，我们提出了truncated CTC（T-CTC）前缀分数，以流动 CTC前缀分数计算。最后，我们设计了动态等待联合解码（DWJD）算法，以在线收集CTC和注意的预测。我们还使用了时钟控制的 bidirectional long short-term memory（LC-BLSTM），以流动 widely used offline bidirectional encoder network。实验表明，相比于Offline CTC/注意模型，我们提议的在线 CTC/注意模型在人机交互服务中提高了实时因素，并保持了其性能，虽有一定的衰减。到目前为止，这是首次提供了全栈在线解决CTC/注意末端ASR建筑的工作。

Differentially Private Adversarial Auto-Encoder to Protect Gender in Voice Biometrics

paper_url: http://arxiv.org/abs/2307.02135
repo_url: None
paper_authors: Oubaïda Chouchane, Michele Panariello, Oualid Zari, Ismet Kerenciler, Imen Chihaoui, Massimiliano Todisco, Melek Önen
for: 隐藏 gender 信息以保护个人隐私，同时保持 speaker 识别效果。
methods: 使用 Adversarial Auto-Encoder approach，对 gender 信息进行隐藏，并通过 Laplace 机制实现 differential privacy 保障。
results: 在 VoxCeleb 数据集上，可以成功隐藏 speaker 的 gender 信息，同时保持 speaker 识别效果，并可以根据需要调整 Laplace 噪声的强度来选择 Privacy 和 Utility 之间的平衡。

Abstract
Over the last decade, the use of Automatic Speaker Verification (ASV) systems has become increasingly widespread in response to the growing need for secure and efficient identity verification methods. The voice data encompasses a wealth of personal information, which includes but is not limited to gender, age, health condition, stress levels, and geographical and socio-cultural origins. These attributes, known as soft biometrics, are private and the user may wish to keep them confidential. However, with the advancement of machine learning algorithms, soft biometrics can be inferred automatically, creating the potential for unauthorized use. As such, it is crucial to ensure the protection of these personal data that are inherent within the voice while retaining the utility of identity recognition. In this paper, we present an adversarial Auto-Encoder--based approach to hide gender-related information in speaker embeddings, while preserving their effectiveness for speaker verification. We use an adversarial procedure against a gender classifier and incorporate a layer based on the Laplace mechanism into the Auto-Encoder architecture. This layer adds Laplace noise for more robust gender concealment and ensures differential privacy guarantees during inference for the output speaker embeddings. Experiments conducted on the VoxCeleb dataset demonstrate that speaker verification tasks can be effectively carried out while concealing speaker gender and ensuring differential privacy guarantees; moreover, the intensity of the Laplace noise can be tuned to select the desired trade-off between privacy and utility.

摘要
过去十年，自动说话验证（ASV）系统的使用越来越普遍，以应对安全和高效的身份验证方法的增长需求。语音数据包含大量个人信息，包括但不限于性别、年龄、健康状况、压力水平和地域和文化背景。这些属性被称为软生物метrics，用户可能希望保持隐私。然而，通过机器学习算法的提高，软生物метrics可以被推断出来，从而创造不当使用的可能性。因此，保护这些个人数据，并在保留身份识别的同时确保隐私，是极为重要的。在这篇论文中，我们提出了一种利用对抗学习掩蔽 gender 信息的自动编码器方法，保持 speaker 嵌入的有效性，同时确保隐私。我们通过对 gender 分类器进行对抗程序，并在 Auto-Encoder 架构中添加基于 Laplace 机制的层。这层在推断过程中添加 Laplace 噪声，以确保在输出 speaker 嵌入时的隐私保障。在 VoxCeleb 数据集上进行的实验表明，可以在掩蔽 speaker 的 gender 信息和确保隐私保障的同时进行有效的 speaker 验证任务。此外，可以根据需要调整 Laplace 噪声的强度，选择适当的隐私和功能之间的平衡。

Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings

paper_url: http://arxiv.org/abs/2307.02083
repo_url: None
paper_authors: Christiaan Jacobs, Herman Kamper
for: 这paper的目的是研究语音词嵌入（AWE），并在这个基础上构建语音词嵌入模型，以实现语音词的语义表示。
methods: 这paper使用了一种多语言基础模型，将语音词分类为不同的语音类别，然后使用这些类别来生成语音词嵌入。
results: 这paper的实验结果表明，使用这种多语言基础模型可以实现语音词嵌入的语义表示，并且在语音词相似性任务中表现出色。此外，这paper还实现了语音词 Query-by-Example 搜索的功能。

Abstract
Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only have untranscribed speech in a target language. We introduce a number of strategies leveraging a pre-trained multilingual AWE model -- a phonetic AWE model trained on labelled data from multiple languages excluding the target. Our best semantic AWE approach involves clustering word segments using the multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a Skipgram-like model on the soft vectors. In an intrinsic word similarity task measuring semantics, this multilingual transfer approach outperforms all previous semantic AWE methods. We also show -- for the first time -- that AWEs can be used for downstream semantic query-by-example search.

摘要
听音字嵌入（AWEs）是指将speech segmentFixed-dimensional vector representation的方法，以便不同的实现方式中的同一个词有相似的嵌入。在这篇论文中，我们探索 semantic AWE 模型。这些 AWEs 不仅应 capture phonetics, but also the meaning of a word, similar to textual word embeddings。我们考虑了target language only have untranscribed speech的场景。我们提出了一些使用预训练的多语言 AWE 模型（包括目标语言）的策略。我们的最佳semantic AWE方法是 clustering word segments using the multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a Skipgram-like model on the soft vectors。在内在词 Similarity task中，这种多语言传递方法超过了所有前一个semantic AWE方法。我们还表明了，AWEs可以用于下游semantic query-by-example search。