cs.SD - 2023-07-17

TST: Time-Sparse Transducer for Automatic Speech Recognition

paper_url: http://arxiv.org/abs/2307.08323
repo_url: None
paper_authors: Xiaohui Zhang, Mangui Liang, Zhengkun Tian, Jiangyan Yi, Jianhua Tao
for: 这篇论文主要是为了解决端到端模型，特别是循环神经网络推导器（RNN-T）在语音识别中的长序处理问题。
methods: 作者提出了一个名为时间叠 sparse 的模型，内置了时间叠 Mechanism。这个 Mechanism 通过将时间解析为更低的时间解析获得了中间表示，然后使用权重平均算法融合这些表示，以生成简单的隐藏状态。
results: 实验结果显示，与 RNN-T 相比，时间叠 transducer 的字元错误率几乎相同，并且实时因子为原始的 50%。通过调整时间解析，时间叠 transducer 可以降低实时因子至原始的 16.54%，但是这需要付出一些精度的损失（4.94%）。

Abstract
End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain the intermediate representations by reducing the time resolution of the hidden states. Then the weighted average algorithm is used to combine these representations into sparse hidden states followed by the decoder. All the experiments are conducted on a Mandarin dataset AISHELL-1. Compared with RNN-T, the character error rate of the time-sparse transducer is close to RNN-T and the real-time factor is 50.00% of the original. By adjusting the time resolution, the time-sparse transducer can also reduce the real-time factor to 16.54% of the original at the expense of a 4.94% loss of precision.

摘要
endpoint模型，特别是Recurrent Neural Network Transducer (RNN-T)，在语音识别中取得了很大成功。然而，抽取器需要较大的内存占用量和计算时间，特别是处理长度较长的解码序列。为解决这个问题，我们提出了一种模型，即时间稀缺抽取器（Time-Sparse Transducer）。在这种机制中，我们通过减少隐藏状态的时间分辨率来获得中间表示。然后，我们使用权重平均算法将这些表示与解码器结合。在使用AISHELL-1 dataset进行所有实验后，我们发现，相比RNN-T，时间稀缺抽取器的字符错误率几乎相同，实时因子为原始的50.00%。通过调整时间分辨率，时间稀缺抽取器还可以降低实时因子至原始的16.54%，同时付出了4.94%的精度损失。

ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development

paper_url: http://arxiv.org/abs/2307.08720
repo_url: https://github.com/yairl/ivrit.ai
paper_authors: Yanir Marmor, Kinneret Misgav, Yair Lifshitz
for: 提高希伯来语自动语音识别技术的研究和开发
methods: 使用了3,300小时的希伯来语音数据，包括1,000多个不同的说话人，并提供了不同的研究需求的三种数据形式：原始未处理的音频数据、后Voice Activity Detection的数据，以及部分转写的数据
results: 提供了一个大量的希伯来语音数据资源，可以免费使用，对研究人员、开发者和商业机构都是一个重要的资源，可以推进希伯来语言在人工智能技术中的发展

Abstract
We introduce "ivrit.ai", a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) technology in Hebrew. With over 3,300 speech hours and a over a thousand diverse speakers, ivrit.ai offers a substantial compilation of Hebrew speech across various contexts. It is delivered in three forms to cater to varying research needs: raw unprocessed audio; data post-Voice Activity Detection, and partially transcribed data. The dataset stands out for its legal accessibility, permitting use at no cost, thereby serving as a crucial resource for researchers, developers, and commercial entities. ivrit.ai opens up numerous applications, offering vast potential to enhance AI capabilities in Hebrew. Future efforts aim to expand ivrit.ai further, thereby advancing Hebrew's standing in AI research and technology.

摘要
我们介绍“ivrit.ai”，一个全面的希伯来语 speech dataset，填补了希伯来语自动语音识别（ASR）技术的缺乏丰富资源。该dataset包含了超过3,300小时的希伯来语 speech，来自超过1,000名多样化的说话者，覆盖了不同的场景。它提供了三种形式来满足不同的研究需求：原始的未处理音频数据，经过语音活动检测后的数据，以及部分转写的数据。该dataset的legal accessible，即可免费使用，因此成为了研究人员、开发者和商业机构的重要资源。“ivrit.ai”开启了许多应用程序，具有很大的潜力提高希伯来语AI技术。未来努力将ivrit.ai继续扩展，以提高希伯来语在AI研究和技术中的地位。

BASS: Block-wise Adaptation for Speech Summarization

paper_url: http://arxiv.org/abs/2307.08217
repo_url: None
paper_authors: Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita Singh, Bhiksha Raj
for: 本研究旨在提高端到端speech summarization的性能，但现有模型受限于计算能力，因此通常只能使用宽度有限的输入序列进行训练。
methods: 本研究提出了一种逐块训练摘要模型的方法，通过分割输入序列进行批处理，以便在不同块之间传递语义上下文。
results: 实验结果表明，采用逐块训练方法可以提高ROUGE-L指标的表现，相比于 truncated input 基准值，提高了3个绝对点。

Abstract
End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-wise modeling, i.e., processing a portion of the input frames at a time. In this paper, we develop a method that allows one to train summarization models on very long sequences in an incremental manner. Speech summarization is realized as a streaming process, where hypothesis summaries are updated every block based on new acoustic information. We devise and test strategies to pass semantic context across the blocks. Experiments on the How2 dataset demonstrate that the proposed block-wise training method improves by 3 points absolute on ROUGE-L over a truncated input baseline.

摘要
听力摘要可以提高模型性能，但是训练在极长输入（多个分钟或小时）上受到计算限制，因此通常使用剪辑模型输入。剪辑会导致模型较差，我们的解决方案是使用块式模型，即在输入块中进行处理。在这篇论文中，我们开发了一种可以在极长序列上逐步训练摘要模型的方法。我们将听力摘要视为流动的过程，每个块基于新的听音信息更新假设摘要。我们还提出了将语义上下文传递到块中的策略。在How2 dataset上进行了实验，并证明了我们的块式训练方法可以与 truncated input 基准相比提高约3个绝对 ROUGE-L 分数。

Exploring Binary Classification Loss For Speaker Verification

paper_url: http://arxiv.org/abs/2307.08205
repo_url: https://github.com/hunterhuan/sphereface2_speaker_verification
paper_authors: Bing Han, Zhengyang Chen, Yanmin Qian
for: 这个论文旨在提高speaker verification任务中的表现，减少因为训练和评估频道的差异所导致的性能下降。
methods: 这个论文使用了多个二元分类器来训练speaker模型，而不是传统的多类别分类法，以提高表现的稳定性和精度。
results: 实验结果显示，SphereFace2方法可以优化speaker模型的表现，特别是在困难的实验中，并且可以与大margin fine-tuning策略相结合以获得更好的结果。此外，SphereFace2方法还表现出对于分类数据的耐读性，可以在半supervised训练 scenrio中实现更好的表现。

Abstract
The mismatch between close-set training and open-set testing usually leads to significant performance degradation for speaker verification task. For existing loss functions, metric learning-based objectives depend strongly on searching effective pairs which might hinder further improvements. And popular multi-classification methods are usually observed with degradation when evaluated on unseen speakers. In this work, we introduce SphereFace2 framework which uses several binary classifiers to train the speaker model in a pair-wise manner instead of performing multi-classification. Benefiting from this learning paradigm, it can efficiently alleviate the gap between training and evaluation. Experiments conducted on Voxceleb show that the SphereFace2 outperforms other existing loss functions, especially on hard trials. Besides, large margin fine-tuning strategy is proven to be compatible with it for further improvements. Finally, SphereFace2 also shows its strong robustness to class-wise noisy labels which has the potential to be applied in the semi-supervised training scenario with inaccurate estimated pseudo labels. Codes are available in https://github.com/Hunterhuan/sphereface2_speaker_verification

摘要
通常情况下，靠近集训练和开集测试之间的差异会导致语音识别任务的性能下降。现有的损失函数和多类划分方法都会受到有效对的搜索的限制，而且在未seen speaker上测试时通常会出现下降。在这种情况下，我们引入了SphereFace2框架，它使用多个二分类器来训练说话者模型，而不是进行多类划分。这种学习模式可以有效地减少训练和评估之间的差距。在Voxceleb上进行的实验表明，SphereFace2可以比其他损失函数更高效地处理hard trial。此外，大margin精细调整策略与之相容，可以进一步提高性能。最后，SphereFace2还表明其强大的鲁棒性，可以在类别噪声标签的情况下进行 semi-supervised 训练。代码可以在https://github.com/Hunterhuan/sphereface2_speaker_verification 中找到。