paper_authors: Ayako Yamamoto, Toshio Irino, Fuki Miyazaki, Honoka Tamaru for: 这个研究旨在开发一种新的听力可读性指标(OIM),以便预测normal hearing(NH)听众对 simulate hearing loss(HL)声音的听力可读性(SI)。methods: 这个研究使用了一种新的方法,即Gammachirp Envelope Similarity Index(GESI),该方法使用了gammachirp filterbank(GCFB)、模ulation filterbank和extended cosine similarity measure来计算SI指标。GESI可以接受参照声音和测试声音的水平不均衡,并反映listeners的听力水平如 audiogram 上所示。results: 研究发现,GESI可以在四个SI实验中预测mean和个人SI值,而其他传统的OIMs(STOI、ESTOI、MBSTOI和HASPI)没有预测SI的能力。GESI也可以根据听众的个人听力状况来预测SI。Abstract
We proposed a new objective intelligibility measure (OIM), called the Gammachirp Envelope Similarity Index (GESI), which can predict the speech intelligibility (SI) of simulated hearing loss (HL) sounds for normal hearing (NH) listeners. GESI is an intrusive method that computes the SI metric using the gammachirp filterbank (GCFB), the modulation filterbank, and the extended cosine similarity measure. GESI can accept the level asymmetry of the reference and test sounds and reflect the HI listener's hearing level as it appears on the audiogram. A unique feature of GESI is its ability to incorporate an individual participant's listening condition into the SI prediction. We conducted four SI experiments on male and female speech sounds in both laboratory and crowdsourced remote environments. We then evaluated GESI and the conventional OIMs, STOI, ESTOI, MBSTOI, and HASPI, for their ability to predict mean and individual SI values with and without the use of simulated HL sounds. GESI outperformed the other OIMs in all evaluations. STOI, ESTOI, and MBSTOI did not predict SI at all, even when using the simulated HL sounds. HASPI did not predict the difference between the laboratory and remote experiments on male speech sounds and the individual SI values. GESI may provide a first step toward SI prediction for individual HI listeners whose HL is caused solely by peripheral dysfunction.
摘要
我们提出了一种新的对象智能度量标 (OIM),即γ折衔幂响同比指标 (GESI),可以预测正常聆听者 (NH) 对 simulate 听力损伤 (HL) 声音的语音 inteligibilidad (SI)。GESI 是一种侵入性的方法,使用γ折衔幂 filterbank (GCFB)、模拟 filterbank 和扩展 косину similarity measure 来计算 SI 指标。GESI 可以接受参考和测试声音的水平差异,并反映听力测试图中的听力水平。GESI 的一个独特特点是可以将参与者的听力条件 incorporated 到 SI 预测中。我们在男女speech声音上进行了四个 SI 实验,分别在实验室和 Remote 环境中进行。然后,我们评估了 GESI 和传统 OIMs,STOI、ESTOI、MBSTOI 和 HASPI,它们在使用 simulate HL 声音时预测 Mean 和个体 SI 值的能力。GESI 在所有评估中表现出色,而 STOI、ESTOI 和 MBSTOI 没有预测 SI 一样,甚至在使用 simulate HL 声音时也无法预测。HASPI 不能预测男子speech声音在实验室和 Remote 环境之间的差异。GESI 可能为听力损伤仅由 péripheral dysfunction 引起的个体 HI listeners 提供了第一步 toward SI 预测。
8+8=4: Formalizing Time Units to Handle Symbolic Music Durations
results: 作者将这种时间单位和常用方法整合到了一个数学框架中,并讨论了一些实际应用场景,并指出了在数据类型和计算数量方面,该系统可以提高处理效率。Abstract
This paper focuses on the nominal durations of musical events (notes and rests) in a symbolic musical score, and on how to conveniently handle these in computer applications. We propose the usage of a temporal unit that is directly related to the graphical symbols in musical scores and pair this with a set of operations that cover typical computations in music applications. We formalize this time unit and the more commonly used approach in a single mathematical framework, as semirings, algebraic structures that enable an abstract description of algorithms/processing pipelines. We then discuss some practical use cases and highlight when our system can improve such pipelines by making them more efficient in terms of data type used and the number of computations.
摘要
Translated into Simplified Chinese:这篇论文关注 симвоlic musical score中的 Nominal duration of musical events (notes and rests),以及如何在计算机应用程序中方便处理这些事件。我们提议使用 directly related to the graphical symbols in musical scores的 temporal unit,并与常用的算法/处理管道中的操作集成一起。我们将这个时间单位和常用的方法集成到一个数学框架中,使用semirings,这些数学结构可以描述算法/处理管道的抽象描述。然后我们讨论了一些实际应用场景,并 highlighted when our system can improve such pipelines by making them more efficient in terms of data type used and the number of computations.
Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
results: 我们的方法可以在多语言 audio-visual speech recognition任务中实现高效精度的识别,而无需建立语言特定的模型。这些结果表明,我们的方法可以提高多语言 audio-visual speech recognition系统的Robustness和效率。Abstract
We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. Motivated by a human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, we propose a model that can capture which language is given as an input speech by distinguishing the inherent similarities and differences between languages. To do so, we design a prompt fine-tuning technique into the largely pre-trained audio-visual representation model so that the network can recognize the language class as well as the speech with the corresponding language. Our work contributes to developing robust and efficient multilingual audio-visual speech recognition systems, reducing the need for language-specific models.
摘要
我们提出了一种新的方法来处理多语言音视频演说识别任务,那是通过在多语言数据集上引入单一模型。我们受到人类认知系统的启发,人们可以无需注意或指导,直观地分辨不同语言。因此,我们提议一种可以从语言输入speech中捕捉语言类别的模型,同时还能识别相应的语言。为此,我们在广泛预训练的音视频表示模型中加入了提示细调技术,使网络可以同时识别语言类别和相应的speech。我们的工作对于开发高效、可靠的多语言音视频演说识别系统做出了贡献,降低了语言特定模型的需求。
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
results: 本文对 AV16.3 数据集的现有跟踪器进行了总结,并讨论了深度学习技术对测量提取和状态估计的影响。Abstract
Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boosts the development of audio visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. At last, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking.
摘要
simplified chinese translation:听音视频说话人跟踪在过去几年来得到了越来越多的关注,这主要归功于其学术价值和广泛的应用领域。听音和视频Modalities可以提供补充信息,用于定位和跟踪。通过听音和视频信息, Bayesian 筛选器可以解决数据关联、听音视频融合和跟踪管理问题。在这篇论文中,我们提供了听音视频说话人跟踪的全面概述。据我们所知,这是过去五年来第一篇详细的报告。我们介绍了 Bayesian 筛选器的家族和获得听音视频测量的方法。此外,我们还总结了现有的跟踪器和其在 AV16.3 数据集上的性能。在过去几年中,深度学习技术得到了大量应用,这也促进了听音视频说话人跟踪的发展。深度学习技术在测量提取和状态估计方面的影响也得到了讨论。最后,我们讨论了听音视频说话人跟踪和其他领域之间的连接,如语音分离和分布式说话人跟踪。Note: Simplified Chinese translation is based on the standardized form of Chinese used in mainland China. The translation may vary depending on the region or dialect.
Acoustic BPE for Speech Generation with Discrete Tokens
results: 通过对语音生成器进行了全面的调查,证明了acoustic BPE可以提高语音生成器的听起来更自然和语法更正确,并提出了一种新的重新分配方法来选择最佳的人工语音。Abstract
Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling process. To address this issue, we propose acoustic BPE which encodes frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE effectively reduces the sequence length and leverages the prior morphological information present in token sequence, which alleviates the modeling challenges of token correlation. Through comprehensive investigations on a speech language model trained with acoustic BPE, we confirm the notable advantages it offers, including faster inference and improved syntax capturing capabilities. In addition, we propose a novel rescore method to select the optimal synthetic speech among multiple candidates generated by rich-diversity TTS system. Experiments prove that rescore selection aligns closely with human preference, which highlights acoustic BPE's potential to other speech generation tasks.
摘要
几种自动学习模型中的不连续音频标记已经在语音生成中得到了广泛的应用。然而,直接使用音频标记会导致序列模型遇到长度问题,同时还需要模型建立音频标记之间的相关性,这会复杂化模型化过程。为解决这个问题,我们提出了音频BPE,它利用字对编码法将常见音频标记模式编码。音频BPE可以缩短序列长度,同时利用音频标记序列中的先前 morphological 信息,从而使模型化过程更加简单。通过对一个使用音频BPE训练的语音语言模型进行了广泛的调查,我们证明了它具有 faster inference 和 improved syntax capturing 能力。此外,我们还提出了一种新的重新分配方法,可以在rich-diversity TTS 系统中选择最佳的 sintetic 语音。实验证明,重新分配选择与人类偏好高度相符,这 highlights 音频BPE 的潜在应用在其他语音生成任务中。