results: 实验结果表明,我们提议的方法可以达到满意的性能,同时保护用户隐私。Abstract
The early-stage Alzheimer's disease (AD) detection has been considered an important field of medical studies. Like traditional machine learning methods, speech-based automatic detection also suffers from data privacy risks because the data of specific patients are exclusive to each medical institution. A common practice is to use federated learning to protect the patients' data privacy. However, its distributed learning process also causes performance reduction. To alleviate this problem while protecting user privacy, we propose a federated contrastive pre-training (FedCPC) performed before federated training for AD speech detection, which can learn a better representation from raw data and enables different clients to share data in the pre-training and training stages. Experimental results demonstrate that the proposed methods can achieve satisfactory performance while preserving data privacy.
摘要
幼期阿尔茨曼病(AD)早期检测是医学研究中的一个重要领域。传统的机器学习方法也受到数据隐私风险的影响,因为每个医疗机构的病人数据是独特的。一种常见的做法是使用联邦学习来保护病人的数据隐私。然而,其分布式学习过程也会导致性能下降。为了解决这个问题而保护用户隐私,我们提议在AD语音检测之前使用联邦对比预训练(FedCPC),可以从原始数据中学习更好的表示,并使不同的客户在预训练和训练阶段可以共享数据。实验结果表明,我们的方法可以实现满意的性能 while preserving data privacy。
Learning-based Array Configuration-Independent Binaural Audio Telepresence with Scalable Signal Enhancement and Ambience Preservation
results: 提出一种配置独立的 Spatial COherence REpresentation(SCORE)特征,以便在不同的阵列 geometries 和感知器数量下进行网络训练,并通过 magnitude-weighted Interaural Phase Difference error(mw-IPDe)、magnitude-weighted Interaural Level Difference error(mw-ILDe)和 modified Scale-Invariant Signal-to-Distortion Ratio(mSI-SDR)等性能指标进行对象评估。Subjective listening tests 也进行了验证,结果表明提出的 BAT 系统可以实现 Desired 的听觉体验,包括增强信号和环境保持的平衡。Abstract
Audio Telepresence (AT) aims to create an immersive experience of the audio scene at the far end for the user(s) at the near end. The application of AT could encompass scenarios with varying degrees of emphasis on signal enhancement and ambience preservation. It is desirable for an AT system to be scalable between these two extremes. To this end, we propose an array-based Binaural AT (BAT) system using the DeepFilterNet as the backbone to convert the array microphone signals into the Head-Related Transfer Function (HRTF)-filtered signals, with a tunable weighting between signal enhancement and ambience preservation. An array configuration-independent Spatial COherence REpresentation (SCORE) feature is proposed for the model training so that the network remains robust to different array geometries and sensor counts. magnitude-weighted Interaural Phase Difference error (mw-IPDe), magnitude-weighted Interaural Level Difference error (mw-ILDe), and modified Scale-Invariant Signal-to-Distortion Ratio (mSI-SDR) are defined as performance metrics for objective evaluation. Subjective listening tests were also performed to validate the proposed BAT system. The results have shown that the proposed BAT system can achieve superior telepresence performance with the desired balance between signal enhancement and ambience preservation, even when the array configurations are unseen in the training phase.
摘要
A Distributed Algorithm for Personal Sound Zones Systems
results: 透过在真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测�Abstract
A Personal Sound Zones (PSZ) system aims to generate two or more independent listening zones that allow multiple users to listen to different music/audio content in a shared space without the need for wearing headphones. Most existing studies assume that the acoustic paths between loudspeakers and microphones are measured beforehand in a stationary environment. Recently, adaptive PSZ systems have been explored to adapt the system in a time-varying acoustic environment. However, because a PSZ system usually requires multiple loudspeakers, the multichannel adaptive algorithms impose a high computational load on the processor. To overcome that problem, this paper proposes an efficient distributed algorithm for PSZ systems, which not only spreads the computational burden over multiple nodes but also reduces the overall computational complexity, at the expense of a slight decrease in performance. Simulation results with true room impulse responses measured in a Hemi-Anechoic chamber are performed to verify the proposed distributed PSZ system.
摘要
AudioLog: LLMs-Powered Long Audio Logging with Acoustic Scenes and Events Joint Estimation
results: 实验结果显示,提案的系统在语音景像分类和声音事件检测方面表现出色,超过了现有的方法。此外,further analyses显示AudioLog能够有效地摘要长语音序列。Abstract
Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with multi-task learning of acoustic tasks. Specifically, we propose a joint training network, achieved by fine-tuning a large audio model based on the pre-trained hierarchical token-semantic audio Transformer. We then leverage LLMs to craft audio logs that summarize textual descriptions of the acoustic environment. Experiments show that the proposed system attains exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further analyses demonstrate AudioLog's power in effectively summarizing long audio sequences.
摘要
Rethinking the Output Architecture for Sound Source Localization
results: 实验结果显示,提出的方法可以达到状态革命性的性能,并且WAD解码方法可以突破现有解码方法的量化误差限制。Abstract
Sound source localization (SSL) involves estimating the direction of arrival (DOA) of a sound signal. The output space of the DOA estimation is continuous, suggesting that regression may be the most appropriate formulation for DOA. However, in practice, converting the DOA estimation into a classification problem often results in better performance than the regression formulation, since that classification problems are generally easier to model, and are more robust in handling noise and uncertainty than regression problems. In the classification formulation of DOA, the output space is discretized into several intervals, each of which is treated as a class. These classes exhibit strong inter-class correlation, with their mutual-similarity increasing when they approach each other and being ordered. However, this property is not sufficiently explored. To exploit these property, we propose a soft label distribution, named Unbiased Label Distribution (ULD), for eliminating the quantization error of the training target and further taking the inter-class similarity into strong consideration. We further introduce two loss functions, named the Negative Log Absolute Error (NLAE) loss function and {Mean Squared Error loss function without activation (MSE(wo))}, for the soft label family. Finally, we design a new decoding method to map the predicted distribution to sound source locations, called Weighted Adjacent Decoding (WAD). It uses the weighted sum of the probabilities of the peak classes and their adjacent classes in the predicted distribution for decoding. Experimental results show that the proposed method achieves the state-of-the-art performance, and the WAD decoding method is able to even breakthrough the quantization error limits of existing decoding methods.
摘要
声源Localization(SSL)涉及到计算声信号的方向来源(DOA)的估计。由于DOA估计的输出空间是连续的,因此可以使用回归来模型声源。然而,在实践中,将DOA估计转换成分类问题经常会得到更好的性能,因为分类问题通常更容易模型,并且对噪声和不确定性更加稳定。在分类形式下,DOA的输出空间被细分成多个间隔,每个间隔被视为一个类。这些类之间存在强相关性,其相互相似性随着他们的距离增大而增加。然而,这个特性未得到充分利用。为了利用这个特性,我们提出了一种软标签分布(Unbiased Label Distribution,ULD),用于消除训练目标的量化误差,并且更加重视类之间的相似性。我们还引入了两种损失函数:卷积损失函数(NLAE)和无激活的平方误差损失函数(MSE(wo))。最后,我们设计了一种新的解码方法,名为权重相邻解码(Weighted Adjacent Decoding,WAD),用于将预测分布映射到声源位置。WAD解码方法使用预测分布中峰值类和其相邻类的权重加权和。实验结果表明,我们提出的方法达到了当前最佳性能,WAD解码方法甚至可以超过现有解码方法的量化误差限制。