cs.SD - 2023-10-10

Modeling of Speech-dependent Own Voice Transfer Characteristics for Hearables with In-ear Microphones

  • paper_url: http://arxiv.org/abs/2310.06554
  • repo_url: None
  • paper_authors: Mattes Ohlenbusch, Christian Rollwage, Simon Doclo
  • for: 这篇论文是为了研究听力器中的自己声音传递特性而写的。
  • methods: 该论文使用了语音认知技术来建立一个语音依赖的系统标定模型,以估计听力器中自己声音的传递特性。
  • results: 研究发现,使用提议的语音依赖模型可以更好地模拟听力器中的自己声音传递特性,并且比适应 filtering-based 模型更好地适应新的语音。此外,研究还发现,对于不同的说话者,使用 talked-averaged 模型可以更好地泛化到不同的说话者。
    Abstract Hearables often contain an in-ear microphone, which may be used to capture the own voice of its user. However, due to ear canal occlusion the in-ear microphone mostly records body-conducted speech, which suffers from band-limitation effects and is subject to amplification of low frequency content. These transfer characteristics are assumed to vary both based on speech content and between individual talkers. It is desirable to have an accurate model of the own voice transfer characteristics between hearable microphones. Such a model can be used, e.g., to simulate a large amount of in-ear recordings to train supervised learning-based algorithms aiming at compensating own voice transfer characteristics. In this paper we propose a speech-dependent system identification model based on phoneme recognition. Using recordings from a prototype hearable, the modeling accuracy is evaluated in terms of technical measures. We investigate robustness of transfer characteristic models to utterance or talker mismatch. Simulation results show that using the proposed speech-dependent model is preferable for simulating in-ear recordings compared to a speech-independent model. The proposed model is able to generalize better to new utterances than an adaptive filtering-based model. Additionally, we find that talker-averaged models generalize better to different talkers than individual models.
    摘要 听ables 经常包含在耳朵中的一个内耳麦克风,可以用来捕捉其用户的自己声音。然而,由于耳朵封闭,内耳麦克风主要记录的是身体传导的语音,这种语音受到频率限制的影响,同时也受到低频强调效果的增强。这些传输特性的变化受到语音内容和个体演说者的影响。因此,有一个准确的自己声音传输特性模型可以用于训练基于supervised learning的算法,以资acia减少自己声音传输特性的影响。在这篇论文中,我们提出了基于phoneme认识的语音依赖系统模型。使用一种原型听ables 的录音,我们评估了模型的准确性,并进行了技术性的评估。我们也研究了语音或演说者之间的模型的稳定性。结果表明,使用我们提出的语音依赖模型在模拟耳朵录音时比使用语音独立模型更好。此外,我们发现了 talker-averaged 模型在不同的演说者之间更好地泛化。

Topological data analysis of human vowels: Persistent homologies across representation spaces

  • paper_url: http://arxiv.org/abs/2310.06508
  • repo_url: None
  • paper_authors: Guillem Bonafos, Jean-Marc Freyermuth, Pierre Pudlo, Samuel Tronçon, Arnaud Rey
  • for: 这篇论文是用于研究数据分析方法的,具体来说是研究如何从各种数据表示空间中提取有用的特征,以便进行预测和分类。
  • methods: 这篇论文使用的方法包括 persistent homology 理论和 topologic 数据分析 (TDA) 技术,以及一些 Machine Learning 算法,如 random forest。
  • results: 这篇论文的结果表明,使用不同的数据表示空间可以提取到不同的特征,而这些特征之间存在一定的相互补做作用。此外,使用 topologic 数据分析可以提高预测和分类的准确率。
    Abstract Topological Data Analysis (TDA) has been successfully used for various tasks in signal/image processing, from visualization to supervised/unsupervised classification. Often, topological characteristics are obtained from persistent homology theory. The standard TDA pipeline starts from the raw signal data or a representation of it. Then, it consists in building a multiscale topological structure on the top of the data using a pre-specified filtration, and finally to compute the topological signature to be further exploited. The commonly used topological signature is a persistent diagram (or transformations of it). Current research discusses the consequences of the many ways to exploit topological signatures, much less often the choice of the filtration, but to the best of our knowledge, the choice of the representation of a signal has not been the subject of any study yet. This paper attempts to provide some answers on the latter problem. To this end, we collected real audio data and built a comparative study to assess the quality of the discriminant information of the topological signatures extracted from three different representation spaces. Each audio signal is represented as i) an embedding of observed data in a higher dimensional space using Taken's representation, ii) a spectrogram viewed as a surface in a 3D ambient space, iii) the set of spectrogram's zeroes. From vowel audio recordings, we use topological signature for three prediction problems: speaker gender, vowel type, and individual. We show that topologically-augmented random forest improves the Out-of-Bag Error (OOB) over solely based Mel-Frequency Cepstral Coefficients (MFCC) for the last two problems. Our results also suggest that the topological information extracted from different signal representations is complementary, and that spectrogram's zeros offers the best improvement for gender prediction.
    摘要 topological数据分析(TDA)已经成功地应用于各种信号/图像处理任务,从视觉化到指导/无指导分类。经常地, topological特征来自 persistente homology理论。TDA管道从原始信号数据或信号表示开始,然后在基于预先指定的筛选器上建立多级 topological结构,最后计算 topological签名以进一步利用。通常使用的 topological签名是持续 diagram(或其变形)。当前研究的问题是 exploit topological签名的多种方法,而不是筛选器的选择,而且尚未考虑信号表示的选择。这篇论文尝试提供一些答案,并通过对实际的音频数据进行比较性研究来评估不同表示空间中的 topological签名的质量。我们使用了三种不同的表示空间来表示每个音频信号:1. 使用 Takens 表示法将数据embedding到高维空间中。2. 视为三维 ambient空间中的表面,使用 spectrogram。3. spectrogram中的 zeros 集。对于女性语音录制,我们使用 topological签名进行三个预测问题:speaker gender、vowel type和个人。我们发现,使用 topologically-augmented random forest 可以在 Out-of-Bag Error(OOB)中提高 Mel-Frequency Cepstral Coefficients(MFCC)的性能。我们的结果还表明,不同的表示空间中的 topological信息是夹带的,而spectrogram中的 zeros 提供了最好的性能提升。

Cross-modal Cognitive Consensus guided Audio-Visual Segmentation

  • paper_url: http://arxiv.org/abs/2310.06259
  • repo_url: None
  • paper_authors: Zhaofeng Shi, Qingbo Wu, Hongliang Li, Fanman Meng, Linfeng Xu
    for:* 这篇论文的目的是提出一种 Audio-Visual Segmentation (AVS) 方法,用于从视频帧中提取听到的对象。methods:* 该方法使用 dense feature-level audio-visual interaction,忽略不同模式之间的维度差异。* 使用 Cross-modal Cognitive Consensus guided Network (C3N) align audio-visual semantics 从全Dimension 维度和地进行进一步的注意力机制。results:* 经验表明,该方法可以在 Single Sound Source Segmentation (S4) 和 Multiple Sound Source Segmentation (MS3) 任务上达到状态之最好性能。
    Abstract Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a \textit{Global} semantic label in each sequence, but the video frame covers multiple semantic objects across different \textit{Local} regions. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-specific label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.
    摘要 音视频分割(AVS)目标是从视频帧中提取听到的对象,它通过像素级别的音视频交互来实现。在这种情况下,音频片断只能提供一个全局Semantic标签,而视频帧则包含多个不同地方的Semantic对象。在这篇论文中,我们提议一种协调音视频 semantics的网络(C3N),以将全局维度上的音视频 semantics 与本地区域相协调。首先,我们开发了一种协调音视频 Semantic Inference模块(C3IM),以抽取音视频分类信任度和模式特征之间的相似性。然后,我们将这个协调模式标签返回给视频底层,并通过一种协调注意力模块(CCAM)来高亮对应的本地特征。我们对AVSBench数据集的Single Sound Source Segmentation(S4)和Multiple Sound Source Segmentation(MS3)两个设置进行了广泛的实验,并证明了我们的方法的有效性,达到了当前最佳性能。