eess.AS - 2023-07-21

Topic Identification For Spontaneous Speech: Enriching Audio Features With Embedded Linguistic Information

  • paper_url: http://arxiv.org/abs/2307.11450
  • repo_url: https://github.com/aalto-speech/Topic-identification-for-spontaneous-Finnish-speech
  • paper_authors: Dejan Porjazovski, Tamás Grósz, Mikko Kurimo
  • for: 这篇论文旨在检验非标准的话语识别方案,以寻找不需要自动话语识别系统(ASR)的解决方案。
  • methods: 这篇论文使用了音频只和多模态组合方法来识别非标准的芬兰语。
  • results: 研究发现,听音只的方法在ASR系统不可用时是一个可行的选择,而多模态组合方法在识别性能上表现最佳。
    Abstract Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts used as input to a text-based model. These approaches work well in high-resource scenarios, where there are sufficient data to train both components of the pipeline. However, in low-resource situations, the ASR system, even if available, produces low-quality transcripts, leading to a bad text-based classifier. Moreover, spontaneous speech containing hesitations can further degrade the performance of the ASR model. In this paper, we investigate alternatives to the standard text-only solutions by comparing audio-only and hybrid techniques of jointly utilising text and audio features. The models evaluated on spontaneous Finnish speech demonstrate that purely audio-based solutions are a viable option when ASR components are not available, while the hybrid multi-modal solutions achieve the best results.
    摘要 传统的话题识别解决方案从音频中获得的听写系统(ASR)生成的讲解作为输入,用文本基于模型进行识别。这些方法在高资源场景下工作良好,因为可以在训练两个组件的气候下进行训练。然而,在低资源情况下,即使有ASR系统,也会生成低质量的讲解,导致文本基于模型的性能下降。此外,不慎的语音中的停顿也可能使ASR模型的性能下降。在这篇论文中,我们调查了标准文本仅解决方案的代替方案,比较音频仅、多模态融合等方法的性能。我们在自然的芬兰语音中评估了这些模型,得到的结论是:当ASR组件不可用时,听写仅的解决方案是一个可靠的选择;而多模态融合解决方案在性能上表现最佳。

MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

  • paper_url: http://arxiv.org/abs/2307.11394
  • repo_url: None
  • paper_authors: Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach
  • for: 这个论文是为了评估各种会议笔记系统而编写的。
  • methods: 这个论文使用了一个开源的工具kit来评估会议笔记系统的评估方法,包括通用的Word Error Rates(WER)计算,以及一些特定的WER定义,如cpWER、ORC WER和MIMO WER。此外,它还提供了一种基于时间约束的cpWER计算方法,以提高匹配假设字符串和参照字符串的匹配质量。
  • results: 这个论文的结果表明,基于时间约束的cpWER计算方法可以提高匹配质量,同时也可以提高匹配速度。此外,这个方法还可以使用不准确的时间标签来进行匹配,从而降低了计算成本。
    Abstract MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.
    摘要 美特评估是一个开源工具kit,用于评估各种会议笔记系统。它提供一个统一的接口来计算常用的单词错误率(WER),包括cpWER、ORC WER 和 MIMO WER 等 WER 定义。我们在cpWER 计算中添加了时间约束,以确保只有在时间对齐是可能的时候才认为单词是正确的。这会导致匹配假设字符串与参考字符串的匹配更加精准,系统会受到负面抑制,如果它提供了低质量的时间标记。由于单词水平的时间信息通常不可用,我们提出了一种将 sentence 级别的时间信息约化为单词级别的时间信息的方法,并证明这种约化导致与匹配精度相似的 WER。同时,时间约束会使匹配算法加速,这些加速的效果超过了对处理时间戳的额外开销。