results: 研究人员通过实验证明,该模型可以准确地预测说话人和听众的头部方向,并且可以在不同的听众位置和环境中提供高精度的估算结果。Abstract
Estimation of a speaker's direction and head orientation with binaural recordings can be a critical piece of information in many real-world applications with emerging `earable' devices, including smart headphones and AR/VR headsets. However, it requires predicting the mutual head orientations of both the speaker and the listener, which is challenging in practice. This paper presents a system for jointly predicting speaker-listener head orientations by leveraging inherent human voice directivity and listener's head-related transfer function (HRTF) as perceived by the ear-mounted microphones on the listener. We propose a convolution neural network model that, given binaural speech recording, can predict the orientation of both speaker and listener with respect to the line joining the two. The system builds on the core observation that the recordings from the left and right ears are differentially affected by the voice directivity as well as the HRTF. We also incorporate the fact that voice is more directional at higher frequencies compared to lower frequencies.
摘要
<>输入文本转换为简化中文:<>使用扬声器录音的方式估算发言人的方向和头姿可以是许多实际应用中的关键信息,包括智能HEADSET和AR/VR头戴式设备。然而,这需要预测发言人和听众双方的相互头姿,这在实践中很困难。这篇论文提出了一种系统,通过利用人声直达性和听众耳部 Transfer Function (HRTF),以 ear-mounted microphones 上的听众所感受到的方式来联合预测发言人和听众的头姿。我们提议一种卷积神经网络模型, givens binaural speech recording,可以预测发言人和听众的方向相对于两点线。该系统基于核心observation ,左耳和右耳的录音被声 directivity 以及 HRTF 所不同地影响。我们还 incorporate 声音在高频段的方向性比低频段更强。
A multi-modal approach for identifying schizophrenia using cross-modal attention
results: 根据Weighted average F1 score,该多Modal系统在与前一个状态的多Modal系统进行比较时,提高了8.53%。Abstract
This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.
摘要
Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference
results: 实验结果显示,我们的方法可以在 LibriSpeech 集合上提高推理速度 12-13 倍,而且保持高度准确。Abstract
Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13 times faster in inference on the LibriSpeech corpus over AR decoding whilst preserving high accuracy.
摘要
注意型编码器-解码器模型具有自动推理(AR)解码功能,在自动语音识别(ASR)领域具有优秀表现。然而,它们经常受到慢速推理的困扰。这主要归结于逐个计算decoder的增量。本文提出了一种部分AR框架,使用段级 вектор化的搜索 beam search来提高ASR模型基于混合连接式时间分类(CTC)注意力基 architecture的推理速度。它首先使用批量CTC解码生成一个初始假设,并将低信度的token标识出来基于其输出概率。然后,我们使用decoder进行段级 вектор化的搜索,并在并行进行重新预测,只需要最少的decoder计算。实验结果显示,我们的方法在LibriSpeech corpus上的推理速度比AR解码快12-13倍,保持高精度。
Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification
paper_authors: Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng
for: 提高自动人员识别性能
methods: 利用知识混合法强化教师网络和学生网络之间的一致性,并强调非目标说话人的分类概率
results: 在三种不同的学生模型架构上应用修改后的知识混合法,在VoxCeleb数据集上实现了13.67%的EER提高 compared to embedding-level和标准标签水平的知识混合法Abstract
Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods.
摘要
知识缩减(KD)可以提高自动说话识别性能,确保大教师网络和轻量级学生网络在嵌入层或标签层之间具有一致性。然而,传统的标签级KD忽略了非目标说话者的知识,特别是他们的分类概率,这些知识对自动说话识别非常重要。在这篇论文中,我们首先证明了使用更多的训练非目标说话者可以提高自动说话识别模型的性能。 inspirited by这一发现,我们修改了传统的标签级KD,通过分解和强调非目标说话者的分类概率来进行知识缩减。我们对三种不同的学生模型架构进行应用,并在VoxCeleb数据集上实现了13.67%的EER提升,比 embedding-level和传统标签级KD方法更好。
Optimization Techniques for a Physical Model of Human Vocalisation
results: 比较了不同优化技术和声音表示方式的结果,发现 génétic 和 swarm 优化器在计算成本上较高,但可以更好地优化模型。specific combinations of optimizers and audio representations offer significantly different results。Abstract
We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effects from a speech production model. We use the Pink Trombone synthesizer as a case study of a simplified production model of the vocal tract to target non-speech human audio signals --yawnings. We selected and optimized the control parameters of the synthesizer to minimize the difference between real and generated audio. We validated the most common optimization techniques reported in the literature and a specifically designed neural network. We evaluated several popular quality metrics as error functions. These include both objective quality metrics and subjective-equivalent metrics. We compared the results in terms of total error and computational demand. Results show that genetic and swarm optimizers outperform least squares algorithms at the cost of executing slower and that specific combinations of optimizers and audio representations offer significantly different results. The proposed methodology could be used in benchmarking other physical models and audio types.
摘要
我们提出了一种非监督式的方法来优化和评估非语音音效的生成模型。我们使用了淡红 trombone synthesizer 作为一个简化的 vocal tract 生产模型,targeting 非语音人类声音信号 -- yawnings。我们选择和优化控制参数,以减少实际和生成声音之间的差异。我们验证了文献中通常报道的优化技术和一种特制的神经网络。我们使用了多种广泛使用的质量指标,包括对象质量指标和主观相当的指标。我们比较了结果,包括总错误和计算负担。结果显示,遗传和群体优化器在计算 slower 的情况下,可以超过 least squares 算法;具体的组合优化器和声音表示可以得到明显不同的结果。我们的方法可以用于对其他物理模型和声音类型进行benchmarking。
Exploring RWKV for Memory Efficient and Low Latency Streaming ASR
methods: 这个论文使用了RWKV变体的线性注意力变换器,combines the superior performance of transformers和RNNs的推理效率,适用于流处理ASR场景, где时间和内存预算有限。
results: 实验表明,RWKV-Transducer和RWKV-Boundary-Aware-Transducer在不同的规模(100h-10000h)上具有和chunk conformer transducer相当或更高的准确率,同时具有较少的延迟和推理内存成本。Abstract
Recently, self-attention-based transformers and conformers have been introduced as alternatives to RNNs for ASR acoustic modeling. Nevertheless, the full-sequence attention mechanism is non-streamable and computationally expensive, thus requiring modifications, such as chunking and caching, for efficient streaming ASR. In this paper, we propose to apply RWKV, a variant of linear attention transformer, to streaming ASR. RWKV combines the superior performance of transformers and the inference efficiency of RNNs, which is well-suited for streaming ASR scenarios where the budget for latency and memory is restricted. Experiments on varying scales (100h - 10000h) demonstrate that RWKV-Transducer and RWKV-Boundary-Aware-Transducer achieve comparable to or even better accuracy compared with chunk conformer transducer, with minimal latency and inference memory cost.
摘要
近些时候,基于自注意力的transformer和conformer被提出为RNN的替代者 для语音识别器模型。然而,全序列注意机制是不可流动的并且计算成本高,因此需要修改,如分割和缓存,以实现高效的流动语音识别。在这篇论文中,我们提议使用RWKV,一种变体的线性注意力变换器,来应用流动语音识别。RWKV结合了transformer的性能和RNN的推理效率,适合流动语音识别场景,具有限制的时钟和内存成本。在不同规模(100小时-10000小时)的实验中,RWKV-Transducer和RWKV-Boundary-Aware-Transducer实现了与分割对应者扫描器相当或更高的准确率,同时具有最小的延迟和推理内存成本。
Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification
results: 无需重新训练 embedding extractor,Session信息可以得到有效的补偿Abstract
In the field of speaker verification, session or channel variability poses a significant challenge. While many contemporary methods aim to disentangle session information from speaker embeddings, we introduce a novel approach using an additional embedding to represent the session information. This is achieved by training an auxiliary network appended to the speaker embedding extractor which remains fixed in this training process. This results in two similarity scores: one for the speakers information and one for the session information. The latter score acts as a compensator for the former that might be skewed due to session variations. Our extensive experiments demonstrate that session information can be effectively compensated without retraining of the embedding extractor.
摘要
在说话识别领域,会话或渠道变化对 speaker 识别带来极大的挑战。虽然现代方法通常尝试将会话信息与说话人嵌入分离开来,但我们提出了一种新的方法,即通过附加一个额外的嵌入来表示会话信息。这种方法通过在 speaker 嵌入提取器的训练过程中附加一个辅助网络来实现。这将生成两个相似度分数:一个是说话人信息,另一个是会话信息。后者的分数将作为前者可能因会话变化而偏移的补偿。我们的广泛实验表明,不需要重新训练 embedding 提取器,就可以有效地补偿会话信息。