paper_authors: Chenyu Tang, Muzi Xu, Wentian Yi, Zibo Zhang, Edoardo Occhipinti, Chaoqun Dong, Dafydd Ravenscroft, Sung-Min Jung, Sanghyo Lee, Shuo Gao, Jong Min Kim, Luigi G. Occhipinti
results: 研究人员使用了一种计算效率高、能效率低的神经网络模型,特别是一维卷积神经网络模型,来解码语音信号。这种模型可以快速适应新用户和新词汇,只需少量样本即可达95.25%的准确率。这种创新表明了一种实用、敏感、精准的可穿戴式SSI技术,适用于日常通信应用。Abstract
Our research presents a wearable Silent Speech Interface (SSI) technology that excels in device comfort, time-energy efficiency, and speech decoding accuracy for real-world use. We developed a biocompatible, durable textile choker with an embedded graphene-based strain sensor, capable of accurately detecting subtle throat movements. This sensor, surpassing other strain sensors in sensitivity by 420%, simplifies signal processing compared to traditional voice recognition methods. Our system uses a computationally efficient neural network, specifically a one-dimensional convolutional neural network with residual structures, to decode speech signals. This network is energy and time-efficient, reducing computational load by 90% while achieving 95.25% accuracy for a 20-word lexicon and swiftly adapting to new users and words with minimal samples. This innovation demonstrates a practical, sensitive, and precise wearable SSI suitable for daily communication applications.
摘要
我们的研究推出了一种可穿戴的无声沟通界面(SSI)技术,在设备舒适、时间能效和语音解码精度方面具有优势。我们开发了一种可生物兼容、耐用的织物颈缚,嵌入了基于 grafene 的压缩传感器,可以准确探测轻微的喉部运动。这种传感器,比其他压缩传感器提高了420%的敏感度,使得信号处理更加简单。我们的系统使用了一种计算效率高的神经网络,具体来说是一个一维 convolutional neural network with residual structures,来解码语音信号。这种神经网络可以大幅减少计算负担,同时保持95.25%的准确率(对20个词汇库),并快速适应新用户和新词汇的MINIMAL samples。这种创新表明了一种实用、敏感和精准的可穿戴 SSI,适用于日常通信应用。
Spatial Diarization for Meeting Transcription with Ad-Hoc Acoustic Sensor Networks
paper_authors: Tobias Gburrek, Joerg Schmalenstroeer, Reinhold Haeb-Umbach for: 这个论文是用于开发一个会议笔记系统的前端,使用声学传感器网络(ASN)记录的信号。methods: 该系统使用了盲目同步signal和计算时差到达(TDOA)信息,然后使用这些信息来估计发言者的活动时间,并使用这些活动时间信息作为基础进行三维混合模型的初始化。results: 实验结果表明,使用TDOA估计和三维混合模型可以提高会议笔记系统的准确率,比不使用外部 диари化信息的系统更有利。Abstract
We propose a diarization system, that estimates "who spoke when" based on spatial information, to be used as a front-end of a meeting transcription system running on the signals gathered from an acoustic sensor network (ASN). Although the spatial distribution of the microphones is advantageous, exploiting the spatial diversity for diarization and signal enhancement is challenging, because the microphones' positions are typically unknown, and the recorded signals are initially unsynchronized in general. Here, we approach these issues by first blindly synchronizing the signals and then estimating time differences of arrival (TDOAs). The TDOA information is exploited to estimate the speakers' activity, even in the presence of multiple speakers being simultaneously active. This speaker activity information serves as a guide for a spatial mixture model, on which basis the individual speaker's signals are extracted via beamforming. Finally, the extracted signals are forwarded to a speech recognizer. Additionally, a novel initialization scheme for spatial mixture models based on the TDOA estimates is proposed. Experiments conducted on real recordings from the LibriWASN data set have shown that our proposed system is advantageous compared to a system using a spatial mixture model, which does not make use of external diarization information.
摘要
我们提议一个会议记录系统,可以根据空间信息来确定“谁在什么时候说话”。这个系统将作为Front-end,并在听取器网络(ASN)上运行。虽然空间分布的麦克风有利,但是利用空间多样性进行分类和信号增强是挑战,因为麦克风的位置通常未知,并且录制的信号通常是不同步的。我们解决这些问题的方法是:首先盲目同步信号,然后利用时差抵达(TDOA)信息来估算发言者的活动时间。这个发言者活动信息可以作为基础,使用扩散模型来提取个人发言者的信号。最后,提取的信号将被传递给语音识别器。此外,我们还提出了一种基于TDOA估算的Initialization scheme for spatial mixture models。实验结果表明,我们的提议系统在使用LibriWASN数据集的实际录制中表现优于不使用外部分类信息的系统。