paper_authors: Yuanbo Hou, Qiaoqiao Ren, Huizhong Zhang, Andrew Mitchell, Francesco Aletta, Jian Kang, Dick Botteldooren
for: This paper proposes an AI-based approach for automatic soundscape characterization, including sound recognition and appraisal.
methods: The proposed method uses a dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to analyze sound sources and predict human-perceived annoyance.
results: The proposed method outperforms other typical AI-based models and soundscape-related traditional machine learning methods on the sound source classification and annoyance rating prediction tasks, and shows consistent results with human perception.Abstract
Soundscape studies typically attempt to capture the perception and understanding of sonic environments by surveying users. However, for long-term monitoring or assessing interventions, sound-signal-based approaches are required. To this end, most previous research focused on psycho-acoustic quantities or automatic sound recognition. Few attempts were made to include appraisal (e.g., in circumplex frameworks). This paper proposes an artificial intelligence (AI)-based dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to analyze automatic soundscape characterization, including sound recognition and appraisal. Using the DeLTA dataset containing human-annotated sound source labels and perceived annoyance, the DCNN-CaF is proposed to perform sound source classification (SSC) and human-perceived annoyance rating prediction (ARP). Experimental findings indicate that (1) the proposed DCNN-CaF using loudness and Mel features outperforms the DCNN-CaF using only one of them. (2) The proposed DCNN-CaF with cross-attention fusion outperforms other typical AI-based models and soundscape-related traditional machine learning methods on the SSC and ARP tasks. (3) Correlation analysis reveals that the relationship between sound sources and annoyance is similar for humans and the proposed AI-based DCNN-CaF model. (4) Generalization tests show that the proposed model's ARP in the presence of model-unknown sound sources is consistent with expert expectations and can explain previous findings from the literature on sound-scape augmentation.
摘要
听音环境研究通常会尝试捕捉用户对听音环境的认知和理解,但是对于长期监测或评估 intervención,需要基于听音信号的方法。因此,前期研究主要集中在 psycho-acoustic 量或自动听音识别方面。只有一些尝试包括评价(如在 circumplex 框架中)。这篇文章提出了基于人工智能(AI)的双支树层卷积神经网络(DCNN-CaF),用于自动听音特征化,包括听音识别和评价。使用包含人类标注的听音源标签和感知困扰的DeLTA数据集,DCNN-CaF被提议用于听音源类别(SSC)和人类感知困扰评分预测(ARP)。实验结果表明:1. 使用听音强度和Mel特征的DCNN-CaF,与只使用一个特征的DCNN-CaF相比,表现出更好的性能。2. DCNN-CaF与十字关注融合的方法,在SSC和ARP任务上表现出比其他常见的AI模型和听音相关传统机器学习方法更好的性能。3. 相关分析表明,人类和提议的AI模型之间听音源和困扰之间的关系类似。4. 总结测试表明,提议的模型的ARP在模型不知道听音源时的存在下保持一致性,与专家预期一致,并可以解释过去 literatura 中的听音环境增强现象。
CREPE Notes: A new method for segmenting pitch contours into discrete notes
results: 本方法在两个具有挑战性的单声道乐器音乐数据集上达到了状态级 Ergebnisse,同时与其他深度学习基于方法相比,减少了97%的总参数数量。Abstract
Tracking the fundamental frequency (f0) of a monophonic instrumental performance is effectively a solved problem with several solutions achieving 99% accuracy. However, the related task of automatic music transcription requires a further processing step to segment an f0 contour into discrete notes. This sub-task of note segmentation is necessary to enable a range of applications including musicological analysis and symbolic music generation. Building on CREPE, a state-of-the-art monophonic pitch tracking solution based on a simple neural network, we propose a simple and effective method for post-processing CREPE's output to achieve monophonic note segmentation. The proposed method demonstrates state-of-the-art results on two challenging datasets of monophonic instrumental music. Our approach also gives a 97% reduction in the total number of parameters used when compared with other deep learning based methods.
摘要
Tracking the fundamental frequency (f0) of a monophonic instrumental performance is effectively a solved problem with several solutions achieving 99% accuracy. However, the related task of automatic music transcription requires a further processing step to segment an f0 contour into discrete notes. This sub-task of note segmentation is necessary to enable a range of applications including musicological analysis and symbolic music generation. Building on CREPE, a state-of-the-art monophonic pitch tracking solution based on a simple neural network, we propose a simple and effective method for post-processing CREPE's output to achieve monophonic note segmentation. The proposed method demonstrates state-of-the-art results on two challenging datasets of monophonic instrumental music. Our approach also gives a 97% reduction in the total number of parameters used when compared with other deep learning based methods.Here's the breakdown of the text in Simplified Chinese:Tracking the fundamental frequency (f0) of a monophonic instrumental performance is effectively a solved problem with several solutions achieving 99% accuracy. however, the related task of automatic music transcription requires a further processing step to segment an f0 contour into discrete notes.This sub-task of note segmentation is necessary to enable a range of applications including musicological analysis and symbolic music generation.Building on CREPE, a state-of-the-art monophonic pitch tracking solution based on a simple neural network, we propose a simple and effective method for post-processing CREPE's output to achieve monophonic note segmentation.The proposed method demonstrates state-of-the-art results on two challenging datasets of monophonic instrumental music.Our approach also gives a 97% reduction in the total number of parameters used when compared with other deep learning based methods.
Multi-objective Non-intrusive Hearing-aid Speech Assessment Model
results: 研究表明,使用预训练的SSL模型可以在听力评估中得到显著提高的表达质量和整体表达能力,并且在不同的听力损伤水平下具有更好的转移性。Abstract
Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, called HASA-Net Large, which predicts speech quality and intelligibility scores based on input speech signals and specified hearing-loss patterns. Our experiments showed the utilization of pre-trained SSL models leads to a significant boost in speech quality and intelligibility predictions compared to using spectrograms as input. Additionally, we examined three distinct fine-tuning approaches that resulted in further performance improvements. Furthermore, we demonstrated that incorporating SSL models resulted in greater transferability to OOD dataset. Finally, this study introduces HASA-Net Large, which is a non-invasive approach for evaluating speech quality and intelligibility. HASA-Net Large utilizes raw waveforms and hearing-loss patterns to accurately predict speech quality and intelligibility levels for individuals with normal and impaired hearing and demonstrates superior prediction performance and transferability.
摘要
无需清晰参考,非侵入式Speech评估方法在 objective 评估中受到了广泛关注。而深度学习模型已经被用来开发非侵入式Speech评估方法,但有限的研究在听力障碍者中。这项研究提出了一种多目标非侵入式听力器Speech评估模型,称为HASA-Net Large,该模型根据输入Speech信号和指定的听力损耗模式预测Speech质量和 inteligibilty 分数。我们的实验表明使用预训练的 SSL 模型会导致Speech质量和 inteligibilty 预测得到显著提升,比使用spectrograms作为输入更好。此外,我们还考虑了三种不同的细化方法,这些方法导致了更好的性能提升。此外,我们还证明了将 SSL 模型integrated 到 OOD 数据集中的更好传播性。最后,本研究介绍了HASA-Net Large,这是一种非侵入式的Speech质量和 inteligibilty 评估方法,该方法使用原始波形和听力损耗模式来准确预测听力正常和听力障碍者的Speech质量和 inteligibilty 水平,并达到了更高的预测性和传播性。
Autoencoder with Group-based Decoder and Multi-task Optimization for Anomalous Sound Detection
results: 根据DCASE 2021 Task 2的开发集合,这篇论文的方法在七台机器上的测试集合中,相对于官方AE和MobileNetV2的平均投票值提高13.11%和15.20%。Abstract
In industry, machine anomalous sound detection (ASD) is in great demand. However, collecting enough abnormal samples is difficult due to the high cost, which boosts the rapid development of unsupervised ASD algorithms. Autoencoder (AE) based methods have been widely used for unsupervised ASD, but suffer from problems including 'shortcut', poor anti-noise ability and sub-optimal quality of features. To address these challenges, we propose a new AE-based framework termed AEGM. Specifically, we first insert an auxiliary classifier into AE to enhance ASD in a multi-task learning manner. Then, we design a group-based decoder structure, accompanied by an adaptive loss function, to endow the model with domain-specific knowledge. Results on the DCASE 2021 Task 2 development set show that our methods achieve a relative improvement of 13.11% and 15.20% respectively in average AUC over the official AE and MobileNetV2 across test sets of seven machines.
摘要
在工业领域,机器异常声音检测(ASD)的需求非常大。然而,收集足够的异常样本是困难的,这会促进不upervised ASD算法的快速发展。基于自适应Encoder(AE)的方法在不upervised ASD方面广泛应用,但它们受到短 Circuit、anti-noise能力不够和特征质量不佳等问题的困扰。为了解决这些挑战,我们提出了一个新的AE基于框架,称为AEGM。 Specifically,我们首先将auxiliary分类器 inserting into AE以增强ASD的多任务学习方式。然后,我们设计了一种群体基本解码结构,并附加了一个适应损失函数,以使模型具备域pecific的知识。Results on DCASE 2021 Task 2 development set show that our methods achieve a relative improvement of 13.11% and 15.20% respectively in average AUC over the official AE and MobileNetV2 across test sets of seven machines.
CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation
results: 我们的方法比前期工作在voice转换任务中表现更好。Abstract
Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks.
摘要
更好的发音 Representation 分离是voice转换质量的关键。最近,基于说话人标签的对比学习成功地应用于voice转换。然而,在相似的说话人之间转换时,模型性能会降低。因此,我们提出一种增强的负样本选择方法来解决这个问题。具体来说,我们基于提议的说话人融合模块创建困难的负样本,以提高说话人Encoder的学习能力。此外,为了考虑细腻的说话人风格模型,我们采用参考Encoder来提取细腻的风格特征,并在全局风格上进行增强对比学习。实验结果表明,我们的方法在voice转换任务上的表现比前一个工作更好。
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
results: 研究人员通过对DCASE2023 foley音频生成 benchmark进行测试,发现该模型在10步内可以达到类似于顶尖基eline的Fréchet音频距离(FAD)分数,并在50步内达到了状态的掌握水平。此外,研究人员还发现了扩散基于音频生成模型的一种潜在问题,即它们可能会生成与训练数据高度相似的样本。Abstract
Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/
摘要
Audio 扩散模型可以生成各种听起来。现有模型通常在幂征频域中使用缓冲phaserecovery模块来重construct波形。这会对高精度音频生成带来挑战。在这篇论文中,我们提出了EDMSound,一种基于扩散模型的生成模型,在幂征频域下的框架之下。通过有效的deterministic采样器,我们在10步内达到了与顶峰基eline的相似性 Fréchet 音频距离(FAD)分数,并在50步内达到了状态之 искусственный智能术语(DCASE2023) foley 声音生成 benchmark 的顶峰性能。我们还发现了扩散基于音频生成模型的一个潜在问题,即它们往往会生成与训练数据具有高听觉相似性的样本。项目页面:https://agentcooper2002.github.io/EDMSound/
Multi-channel Conversational Speaker Separation via Neural Diarization
methods: 提出了一种基于神经网络分类的 speaker separation via neural diarization(SSND)框架,利用终端到终端的分类系统来标识每个个体的语音活动。
results: 在 LibriCSS dataset 上进行了评估,并取得了大幅提高的 диари化和 ASR Result,代表了state-of-the-art 水平。Abstract
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential grouping of discontinuous speech segments. To address these limitations, we introduce a new multi-channel framework called "speaker separation via neural diarization" (SSND) for meeting environments. Our approach utilizes an end-to-end diarization system to identify the speech activity of each individual speaker. By leveraging estimated speaker boundaries, we generate a sequence of embeddings, which in turn facilitate the assignment of speakers to the outputs of a multi-talker separation model. SSND addresses the permutation ambiguity issue of talker-independent speaker separation during the diarization phase through location-based training, rather than during the separation process. This unique approach allows multiple non-overlapped speakers to be assigned to the same output stream, making it possible to efficiently process long segments-a task impossible with CSS. Additionally, SSND is naturally suitable for speaker-attributed ASR. We evaluate our proposed diarization and separation methods on the open LibriCSS dataset, advancing state-of-the-art diarization and ASR results by a large margin.
摘要
Translated into Simplified Chinese:在叠加的语音中,自动语音识别(ASR)系统的性能会受到很大的影响,因为它们是单个说话者的设计。为了在会议环境中提高 ASR 性能,常用 continuous speaker separation(CSS)。然而,CSS 需要一个短 separation window,以避免在 window 内有多个说话者,并且sequential grouping of discontinuous speech segments。为了解决这些限制,我们提出了一个新的多通道框架,即 "speaker separation via neural diarization"(SSND)。我们的方法使用了一个端到端的 diarization 系统,以确定每个个体说话者的语音活动。通过利用估计的 speaker 边界,我们生成了一个序列 embedding,以便将说话者分配到 multi-talker separation 模型的输出流中。SSND 通过在 diarization 阶段使用位置基于的训练,而不是在 separation 阶段,解决了 talker-independent speaker separation 的 permutation ambiguity 问题。这种独特的方法使得可以有效地处理长段,而不是 CSS 中的 sequential grouping。此外,SSND 自然适合 speaker-attributed ASR。我们对 open LibriCSS 数据集进行了评估,并在 diarization 和 ASR 领域取得了大幅提升的 estado-of-the-art 结果。