cs.SD - 2023-10-03

Audio-visual child-adult speaker classification in dyadic interactions

  • paper_url: http://arxiv.org/abs/2310.01867
  • repo_url: None
  • paper_authors: Anfeng Xu, Kevin Huang, Tiantian Feng, Helen Tager-Flusberg, Shrikanth Narayanan
  • for: 这个论文的目的是提高自动识别儿童和成人之间的交流,以便更加准确地了解儿童在不同情况下的表达和行为。
  • methods: 这个论文使用了一种 combining audio and video modalities的方法,通过活动 speaker detection 和视觉处理模型来提高儿童和成人的识别精度和稳定性。
  • results: 实验结果表明,在使用视觉信号的情况下,识别精度和稳定性都得到了大幅提高,相比 Audio-only 模型,在一个面和两个面可见的情况下,相对提高了2.38%和3.97%的F1宏score。
    Abstract Interactions involving children span a wide range of important domains from learning to clinical diagnostic and therapeutic contexts. Automated analyses of such interactions are motivated by the need to seek accurate insights and offer scale and robustness across diverse and wide-ranging conditions. Identifying the speech segments belonging to the child is a critical step in such modeling. Conventional child-adult speaker classification typically relies on audio modeling approaches, overlooking visual signals that convey speech articulation information, such as lip motion. Building on the foundation of an audio-only child-adult speaker classification pipeline, we propose incorporating visual cues through active speaker detection and visual processing models. Our framework involves video pre-processing, utterance-level child-adult speaker detection, and late fusion of modality-specific predictions. We demonstrate from extensive experiments that a visually aided classification pipeline enhances the accuracy and robustness of the classification. We show relative improvements of 2.38% and 3.97% in F1 macro score when one face and two faces are visible, respectively.
    摘要 互动 involving 儿童 span 广泛的重要领域,从学习到临床诊断和治疗上下。自动分析这些互动的目的是寻求准确的洞察和在多样化和广泛的条件下提供扩展和坚固性。确定孩子的语音段为 kritical step in such modeling. conventional child-adult speaker classification 通常采用音频模型方法,忽略视觉信号,例如唇部运动。我们基于音频只的儿童-成人 speaker classification 管道,提议包含视觉信号,通过活动 speaker detection 和视觉处理模型。我们的框架包括视频预处理、utterance-level child-adult speaker detection 和多模式特征 fusion。经过广泛的实验,我们证明了通过视觉支持的分类管道可以提高分类精度和稳定性。我们显示在一个面和两个面可见时的相对改善为2.38%和3.97%的F1宏Score。

Mel-Band RoFormer for Music Source Separation

  • paper_url: http://arxiv.org/abs/2310.01809
  • repo_url: None
  • paper_authors: Ju-Chiang Wang, Wei-Tsung Lu, Minz Won
  • for: 这个论文主要targets music source separation, specifically using a multi-band spectrogram-based approach with a hierarchical Transformer and Rotary Position Embedding (RoPE) for multi-band mask estimation.
  • methods: 这个模型使用了band-split scheme,但是这个scheme是基于empirical experiments而定义的,没有文学支持。这个论文提出了Mel-RoFormer模型,它采用了mel scale来映射频谱分解成 overlap subbands。
  • results: 在使用MUSDB18HQ dataset进行实验时,Mel-RoFormer模型比BS-RoFormer模型在分离 vocals, drums和其他的stems中表现更好。
    Abstract Recently, multi-band spectrogram-based approaches such as Band-Split RNN (BSRNN) have demonstrated promising results for music source separation. In our recent work, we introduce the BS-RoFormer model which inherits the idea of band-split scheme in BSRNN at the front-end, and then uses the hierarchical Transformer with Rotary Position Embedding (RoPE) to model the inner-band and inter-band sequences for multi-band mask estimation. This model has achieved state-of-the-art performance, but the band-split scheme is defined empirically, without analytic supports from the literature. In this paper, we propose Mel-RoFormer, which adopts the Mel-band scheme that maps the frequency bins into overlapped subbands according to the mel scale. In contract, the band-split mapping in BSRNN and BS-RoFormer is non-overlapping and designed based on heuristics. Using the MUSDB18HQ dataset for experiments, we demonstrate that Mel-RoFormer outperforms BS-RoFormer in the separation tasks of vocals, drums, and other stems.
    摘要 Translation in Simplified Chinese:近期,基于多 banda spectrogram 的方法,如 Band-Split RNN (BSRNN),已经在音乐源分离方面显示出了有前途的结果。在我们最近的工作中,我们引入了 BS-RoFormer 模型,该模型继承了 BSRNN 的前端band-split scheme,然后使用层次 Transformer 以及 Rotary Position Embedding (RoPE) 来模型内部band和 междуband的序列,进行多band屏蔽估计。这个模型已经达到了状态之 искусственный支持,但是band-split scheme是基于empirical定义的,没有文献支持。在这篇论文中,我们提出了 Mel-RoFormer 模型,该模型采用了 Mel 带 scheme,将频谱窗口中的频谱带分成了相互重叠的子带,根据 Mel 尺度进行映射。与BSRNN和 BS-RoFormer 中的band-split映射不同,Mel-RoFormer 的映射是非相互重叠的,并且是基于 empirical 的设计。使用 MUSDB18HQ 数据集进行实验,我们示出了 Mel-RoFormer 在 vocals、鼓和其他声道的分离任务中的优于 BS-RoFormer。