results: 在室内和吸收环境中对实验数据进行测试,提出的方法比采用已有方法的state-of-the-art方法更高效,并且可以在不准确的、不规则的 Microphone设置下提供高质量的头部方向估计。Abstract
Determining the head orientation of a talker is not only beneficial for various speech signal processing applications, such as source localization or speech enhancement, but also facilitates intuitive voice control and interaction with smart environments or modern car assistants. Most approaches for head orientation estimation are based on visual cues. However, this requires camera systems which often are not available. We present an approach which purely uses audio signals captured with only a few distributed microphones around the talker. Specifically, we propose a novel method that directly incorporates measured or modeled speech radiation patterns to infer the talker's orientation during active speech periods based on a cosine similarity measure. Moreover, an automatic gain adjustment technique is proposed for uncalibrated, irregular microphone setups, such as ad-hoc sensor networks. In experiments with signals recorded in both anechoic and reverberant environments, the proposed method outperforms state-of-the-art approaches, using either measured or modeled speech radiation patterns.
摘要
判断说话人的头部方向不仅有助于多种演示信号处理应用程序,如源localization或speech增强,还可以帮助用户intsutitive的voice控制和smart环境或现代汽车助手的交互。大多数方法 для头部方向估计基于视觉cue。然而,这需要摄像头系统,而这些系统经常不可用。我们提出了一种方法,该方法仅使用了周围的一些分布式 microphone 采集的音频信号。specifically,我们提议一种直接在活跃speech期间根据cosine相似度度量来推断说话人的方向。此外,我们还提出了一种自动调整 gain 技术,以适应不准确、不规则的 microphone 设置,如随机感知网络。在室内和吸收室中进行的实验中,我们的方法比 estado-of-the-art 方法更高效,使用 either measured or modeled speech radiation patterns。
A text-dependent speaker verification application framework based on Chinese numerical string corpus
paper_authors: Litong Zheng, Feng Hong, Weijie Xu for: 这个论文主要针对短语言场景下的文本依赖性人脸认可(TD-SV)问题,并提出了一种基于多维度汇集方法的解决方案。methods: 该论文使用了一种基于Transformer的文本嵌入网络和一种基于sliding window attention的 Statistical Pooling方法,并在back-end fusion中结合了文本嵌入和说话人嵌入。results: 该论文在Hi-Mia和SHAL两个 dataset上实现了49.2%和75.0%的等错率(EER)提升,实际上表明了该方法在TD-SV问题上的高效性。Abstract
Researches indicate that text-dependent speaker verification (TD-SV) often outperforms text-independent verification (TI-SV) in short speech scenarios. However, collecting large-scale fixed text speech data is challenging, and as speech length increases, factors like sentence rhythm and pauses affect TDSV's sensitivity to text sequence. Based on these factors, We propose the hypothesis that strategies such as more fine-grained pooling methods on time scales and decoupled representations of speech speaker embedding and text embedding are more suitable for TD-SV. We have introduced an end-to-end TD-SV system based on a dataset comprising longer Chinese numerical string texts. It contains a text embedding network, a speaker embedding network, and back-end fusion. First, we recorded a dataset consisting of long Chinese numerical text named SHAL, which is publicly available on the Open-SLR website. We addressed the issue of dataset scarcity by augmenting it using Tacotron2 and HiFi-GAN. Next, we introduced a dual representation of speech with text embedding and speaker embedding. In the text embedding network, we employed an enhanced Transformer and introduced a triple loss that includes text classification loss, CTC loss, and decoder loss. For the speaker embedding network, we enhanced a sliding window attentive statistics pooling (SWASP), combined with attentive statistics pooling (ASP) to create a multi-scale pooling method. Finally, we fused text embedding and speaker embedding. Our pooling methods achieved an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.
摘要
We have introduced an end-to-end TD-SV system based on a dataset comprising longer Chinese numerical string texts. It contains a text embedding network, a speaker embedding network, and back-end fusion.First, we recorded a dataset consisting of long Chinese numerical text named SHAL, which is publicly available on the Open-SLR website. We addressed the issue of dataset scarcity by augmenting it using Tacotron2 and HiFi-GAN.Next, we introduced a dual representation of speech with text embedding and speaker embedding. In the text embedding network, we employed an enhanced Transformer and introduced a triple loss that includes text classification loss, CTC loss, and decoder loss.For the speaker embedding network, we enhanced a sliding window attentive statistics pooling (SWASP), combined with attentive statistics pooling (ASP) to create a multi-scale pooling method.Finally, we fused text embedding and speaker embedding. Our pooling methods achieved an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.
Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks
results: 该模型在使用公共可用的IEMOCAP dataset进行训练后,在四种情绪中实现了77.58%的总准确率,超过了当前的状态态eline approaches。Abstract
Emotion recognition is a topic of significant interest in assistive robotics due to the need to equip robots with the ability to comprehend human behavior, facilitating their effective interaction in our society. Consequently, efficient and dependable emotion recognition systems supporting optimal human-machine communication are required. Multi-modality (including speech, audio, text, images, and videos) is typically exploited in emotion recognition tasks. Much relevant research is based on merging multiple data modalities and training deep learning models utilizing low-level data representations. However, most existing emotion databases are not large (or complex) enough to allow machine learning approaches to learn detailed representations. This paper explores modalityspecific pre-trained transformer frameworks for self-supervised learning of speech and text representations for data-efficient emotion recognition while achieving state-of-the-art performance in recognizing emotions. This model applies feature-level fusion using nonverbal cue data points from motion capture to provide multimodal speech emotion recognition. The model was trained using the publicly available IEMOCAP dataset, achieving an overall accuracy of 77.58% for four emotions, outperforming state-of-the-art approaches
摘要
《情感认知在助助 роботиCS中是一个非常关键的话题,因为需要让机器人学习人类行为,以便在我们的社会中与人类进行有效的交互。因此,我们需要有高效可靠的情感认知系统,以支持人机交互。通常情感认知任务会利用多种数据模式,如speech、audio、文本、图像和视频。大多数相关研究都是基于将多种数据模式融合并使用深度学习模型来学习低级别数据表示。然而,现有的情感数据库通常不够大(或复杂),以至于机器学习方法无法学习细节的表示。本文提出了使用特定模式预训练转换器框架进行自主学习 speech和文本表示,以实现数据效果的情感认知。本模型使用非语言cue数据点来实现多模式speech情感认知,并使用公共可用的IEMOCAP数据集进行训练,实现了四种情感的总准确率为77.58%,超过了当前的状态方法》
Building Ears for Robots: Machine Hearing in the Age of Autonomy
results: 该研究提出了一个初步的软件框架,基于概率机器人理论,推荐将机器人听觉 интеグ into 更大的感知和决策背景中。它介绍了多种模型,包括 Bayes 筛、部分可见Markov决策过程(POMDP)和多智能系统,强调机器人听觉在多种角色中的多方面作用。Abstract
This study explores the significance of robot hearing systems, emphasizing their importance for robots operating in diverse and uncertain environments. It introduces the hardware design principles using robotaxis as an example, where exterior microphone arrays are employed to detect sound events such as sirens. The challenges, goals, and test methods are discussed, focusing on achieving a suitable signal-to-noise ratio (SNR). Additionally, it presents a preliminary software framework rooted in probabilistic robotics theory, advocating for the integration of robot hearing into the broader context of perception and decision-making. It discusses various models, including Bayes filters, partially observable Markov decision processes (POMDP), and multiagent systems, highlighting the multifaceted roles that robot hearing can play. In conclusion, as service robots continue to evolve, robot hearing research will expand, offering new perspectives and challenges for future development beyond simple sound event classification.
摘要
Translated into Simplified Chinese:这个研究探讨了机器人听力系统的重要性,强调其在多样化和不确定环境中运行的重要性。它介绍了硬件设计原则,使用机器人AXI为例,并使用外部麦克麦array检测声音事件,如警响。研究涉及到的挑战、目标和测试方法都被讨论了,主要是实现适当的噪声比例(SNR)。此外,它还提出了一个初步的软件框架,基于概率机器人理论,强调机器人听力的集成到更广泛的感知和决策中。研究还讨论了多种模型,包括拟折扭滤波器、部分可见Markov决策过程(POMDP)和多代理系统,强调机器人听力在多种角度的多方面作用。 conclude,随着服务机器人的进一步发展,机器人听力研究将继续扩大,提供新的视角和挑战,超出简单的声音事件分类。