eess.AS - 2023-11-08

1-step Speech Processing and Understanding Using CTC Loss

paper_url: http://arxiv.org/abs/2311.04753
repo_url: None
paper_authors: Karan Singla, Shahab Jalavand, Yeon-Jun Kim, Antonio Moreno Daniel, Srinivas Bangalore, Andrej Ljolje, Ben Stern
for: 提高自然语言处理系统的命名实体识别和意图识别能力
methods: 使用Connectionist Temporal Classification（CTC）损失进行端到端语音识别编码器的优化，并添加了一组未使用的占位符号来扩展自动语音识别系统的词汇
results: 在SLUE benchmark上实现了明显的命名实体标记、意图识别和译文准确率提高，并且与SLURP数据集的结果相当Here’s a more detailed explanation of each point:
for: The paper is written to improve the ability of natural language processing systems to recognize named entities and intent in speech.
methods: The authors propose a solution that extends the vocabulary of the end-to-end automatic speech recognition (ASR) system by adding a set of unused placeholder symbols, which are then assigned to represent semantic tags. These placeholders are integrated into the transcription process as distinct tokens.
results: The proposed solution achieves notable improvements in entity tagging, intent discernment, and transcription accuracy on the SLUE benchmark, and the results are on par with those for the SLURP dataset. Additionally, the authors provide a visual analysis of the system’s proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.

Abstract
Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. Our proposed solution extends the E2E automatic speech recognition (ASR) system's vocabulary by adding a set of unused placeholder symbols, conceptually akin to the tokens used in sequence modeling. These placeholders are then assigned to represent semantic tags and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements in entity tagging, intent discernment, and transcription accuracy on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system's proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.

摘要
Simplified Chinese translation:最近的研究已经做出了一些进步，用Connectionist Temporal Classification（CTC）损失来提高名称识别 within 转录。然而，这些方法受限于它们仅使用 ASCII 字符集，只能处理有限数量的 semantic label。我们的提议的解决方案是将 E2E 自动语音识别（ASR）系统的词汇表扩展到添加一组未使用的 placeholder symbol，类似于 sequence modeling 中的 token。这些 placeholder 然后被分配到表示 semantic tag 的各种符号，并被 integrating 到转录过程中作为特定的 tokens。我们在 SLUE 标准集上示出了明显的提高，包括实体标记、意图识别和转录精度。此外，我们还提供了一种可视化分析，表明系统在时间上准确地标记了意义上的 token， thereby illustrating the enhancement in transcription quality through the use of supplementary semantic tags。

Selective HuBERT: Self-Supervised Pre-Training for Target Speaker in Clean and Mixture Speech

paper_url: http://arxiv.org/abs/2311.04526
repo_url: None
paper_authors: Jingru Lin, Meng Ge, Wupeng Wang, Haizhou Li, Mengling Feng
for: 这篇论文的目的是提出一种新的自我超vision演示模型，以实现选择性的对话声音抽象，并且能够在各种声音处理任务中提供高性能。
methods: 这篇论文使用了一种新的预训练方法，称为Selective-HuBERT（SHuBERT），它通过预测目标说话者的pseudo标签，并且使用了双路训练策略和跨相关约束，以实现选择性地对声音进行抽象。
results: 实验结果显示，SHuBERT可以在SUPERB评量标准和LibriMix数据集上达到高性能，并且能够在实际应用中提供高质量的声音抽象，甚至在具有极低量的标签资料下进行优化。

Abstract
Self-supervised pre-trained speech models were shown effective for various downstream speech processing tasks. Since they are mainly pre-trained to map input speech to pseudo-labels, the resulting representations are only effective for the type of pre-train data used, either clean or mixture speech. With the idea of selective auditory attention, we propose a novel pre-training solution called Selective-HuBERT, or SHuBERT, which learns the selective extraction of target speech representations from either clean or mixture speech. Specifically, SHuBERT is trained to predict pseudo labels of a target speaker, conditioned on an enrolled speech from the target speaker. By doing so, SHuBERT is expected to selectively attend to the target speaker in a complex acoustic environment, thus benefiting various downstream tasks. We further introduce a dual-path training strategy and use the cross-correlation constraint between the two branches to encourage the model to generate noise-invariant representation. Experiments on SUPERB benchmark and LibriMix dataset demonstrate the universality and noise-robustness of SHuBERT. Furthermore, we find that our high-quality representation can be easily integrated with conventional supervised learning methods to achieve significant performance, even under extremely low-resource labeled data.

摘要
自适应预训练语音模型在各种下游语音处理任务中显示出效iveness。由于它们主要预训练为将输入语音映射到 Pseudo-labels，因此生成的表示只有效果于使用的预训练数据类型，可能是干净的语音或混合语音。我们提出了一种新的预训练解决方案 called Selective-HuBERT（SHuBERT），它学习选择提取目标语音表示。特别是，SHuBERT 在预训练时预测目标说话人的 Pseudo-labels，条件在报名说话人的语音上。通过这样做，SHuBERT 可以选择性地听到目标说话人在复杂的声学环境中，从而利于各种下游任务。我们还提出了一种 dual-path 训练策略，并使用两个分支之间的协方差约束来鼓励模型生成难以干扰的表示。在 SUPERB benchmark 和 LibriMix 数据集上进行了实验， demonstrably 表明 SHuBERT 的 universality 和 noise-robustness。此外，我们发现我们的高质量表示可以轻松地与传统的监督学习方法结合使用，即使受到极低资源的标注数据。