eess.AS - 2023-11-21

FedCPC: An Effective Federated Contrastive Learning Method for Privacy Preserving Early-Stage Alzheimer’s Speech Detection

paper_url: http://arxiv.org/abs/2311.13043
repo_url: None
paper_authors: Wenqing Wei, Zhengdong Yang, Yuan Gao, Jiyi Li, Chenhui Chu, Shogo Okada, Sheng Li
for: 早期阿尔ツ海默病（AD）检测是医学研究领域中的一个重要问题。
methods: 我们提议使用联邦对比预训练（FedCPC），在联邦学习之前，使用联邦预训练来学习更好的表示，以保护用户隐私。
results: 实验结果表明，我们提议的方法可以达到满意的性能，同时保护用户隐私。

Abstract
The early-stage Alzheimer's disease (AD) detection has been considered an important field of medical studies. Like traditional machine learning methods, speech-based automatic detection also suffers from data privacy risks because the data of specific patients are exclusive to each medical institution. A common practice is to use federated learning to protect the patients' data privacy. However, its distributed learning process also causes performance reduction. To alleviate this problem while protecting user privacy, we propose a federated contrastive pre-training (FedCPC) performed before federated training for AD speech detection, which can learn a better representation from raw data and enables different clients to share data in the pre-training and training stages. Experimental results demonstrate that the proposed methods can achieve satisfactory performance while preserving data privacy.

摘要
幼期阿尔茨曼病（AD）早期检测是医学研究中的一个重要领域。传统的机器学习方法也受到数据隐私风险的影响，因为每个医疗机构的病人数据是独特的。一种常见的做法是使用联邦学习来保护病人的数据隐私。然而，其分布式学习过程也会导致性能下降。为了解决这个问题而保护用户隐私，我们提议在AD语音检测之前使用联邦对比预训练（FedCPC），可以从原始数据中学习更好的表示，并使不同的客户在预训练和训练阶段可以共享数据。实验结果表明，我们的方法可以实现满意的性能 while preserving data privacy。

Learning-based Array Configuration-Independent Binaural Audio Telepresence with Scalable Signal Enhancement and Ambience Preservation

paper_url: http://arxiv.org/abs/2311.12706
repo_url: None
paper_authors: Yicheng Hsu, Mingsian R. Bai
for: 提供一种可扩展的音频电子存在（AT）系统，以创造远端听众场景的听觉体验。
methods: 使用 DeepFilterNet 作为 backing 网络，将阵列 Microphone 信号转换为基于 Head-Related Transfer Function（HRTF）的 filtered 信号，并可调整信号增强和环境保持的权重。
results: 提出一种配置独立的 Spatial COherence REpresentation（SCORE）特征，以便在不同的阵列 geometries 和感知器数量下进行网络训练，并通过 magnitude-weighted Interaural Phase Difference error（mw-IPDe）、magnitude-weighted Interaural Level Difference error（mw-ILDe）和 modified Scale-Invariant Signal-to-Distortion Ratio（mSI-SDR）等性能指标进行对象评估。Subjective listening tests 也进行了验证，结果表明提出的 BAT 系统可以实现 Desired 的听觉体验，包括增强信号和环境保持的平衡。

Abstract
Audio Telepresence (AT) aims to create an immersive experience of the audio scene at the far end for the user(s) at the near end. The application of AT could encompass scenarios with varying degrees of emphasis on signal enhancement and ambience preservation. It is desirable for an AT system to be scalable between these two extremes. To this end, we propose an array-based Binaural AT (BAT) system using the DeepFilterNet as the backbone to convert the array microphone signals into the Head-Related Transfer Function (HRTF)-filtered signals, with a tunable weighting between signal enhancement and ambience preservation. An array configuration-independent Spatial COherence REpresentation (SCORE) feature is proposed for the model training so that the network remains robust to different array geometries and sensor counts. magnitude-weighted Interaural Phase Difference error (mw-IPDe), magnitude-weighted Interaural Level Difference error (mw-ILDe), and modified Scale-Invariant Signal-to-Distortion Ratio (mSI-SDR) are defined as performance metrics for objective evaluation. Subjective listening tests were also performed to validate the proposed BAT system. The results have shown that the proposed BAT system can achieve superior telepresence performance with the desired balance between signal enhancement and ambience preservation, even when the array configurations are unseen in the training phase.

摘要

A Distributed Algorithm for Personal Sound Zones Systems

paper_url: http://arxiv.org/abs/2311.12427
repo_url: None
paper_authors: Sipei Zhao, Guoqiang Zhang, Eva Cheng, Ian S. Burnett
for: 提供了一个可以在共享空间中实现多个独立的听音区域，并且不需要使用头戴式耳机的Personal Sound Zones（PSZ）系统。
methods: 使用了分布式算法，以对多个扩音器进行分布式处理，以降低处理器的负载，并且降低了总的计算复杂度，但是对性能有一定的损害。
results: 透过在真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测试中使用真实室内杂变测�

Abstract
A Personal Sound Zones (PSZ) system aims to generate two or more independent listening zones that allow multiple users to listen to different music/audio content in a shared space without the need for wearing headphones. Most existing studies assume that the acoustic paths between loudspeakers and microphones are measured beforehand in a stationary environment. Recently, adaptive PSZ systems have been explored to adapt the system in a time-varying acoustic environment. However, because a PSZ system usually requires multiple loudspeakers, the multichannel adaptive algorithms impose a high computational load on the processor. To overcome that problem, this paper proposes an efficient distributed algorithm for PSZ systems, which not only spreads the computational burden over multiple nodes but also reduces the overall computational complexity, at the expense of a slight decrease in performance. Simulation results with true room impulse responses measured in a Hemi-Anechoic chamber are performed to verify the proposed distributed PSZ system.

摘要

AudioLog: LLMs-Powered Long Audio Logging with Acoustic Scenes and Events Joint Estimation

paper_url: http://arxiv.org/abs/2311.12371
repo_url: https://github.com/jishengbai/audiolog
paper_authors: Jisheng Bai, Han Yin, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen
for: 这篇论文旨在开发一个基于大语言模型（LLMs）的语音档案系统，以多任务学习的方式进行语音识别和描述。
methods: 提案使用一个精致的语音模型，经过精致的训练和组合，以进行语音档案的描述和分类。
results: 实验结果显示，提案的系统在语音景像分类和声音事件检测方面表现出色，超过了现有的方法。此外，further analyses显示AudioLog能够有效地摘要长语音序列。

Abstract
Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with multi-task learning of acoustic tasks. Specifically, we propose a joint training network, achieved by fine-tuning a large audio model based on the pre-trained hierarchical token-semantic audio Transformer. We then leverage LLMs to craft audio logs that summarize textual descriptions of the acoustic environment. Experiments show that the proposed system attains exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further analyses demonstrate AudioLog's power in effectively summarizing long audio sequences.

摘要

Rethinking the Output Architecture for Sound Source Localization

paper_url: http://arxiv.org/abs/2311.12305
repo_url: None
paper_authors: Linfeng Feng, Xiao-Lei Zhang, Xuelong Li
for: 本研究的目的是提出一种基于soft label分布的sound source localization（SSL）方法，以提高SSL的性能和Robustness。
methods: 本研究使用了一种新的损失函数（NLAE和MSE(wo)）和一种新的解码方法（Weighted Adjacent Decoding，WAD），以优化SSL的模型。
results: 实验结果显示，提出的方法可以达到状态革命性的性能，并且WAD解码方法可以突破现有解码方法的量化误差限制。

Abstract
Sound source localization (SSL) involves estimating the direction of arrival (DOA) of a sound signal. The output space of the DOA estimation is continuous, suggesting that regression may be the most appropriate formulation for DOA. However, in practice, converting the DOA estimation into a classification problem often results in better performance than the regression formulation, since that classification problems are generally easier to model, and are more robust in handling noise and uncertainty than regression problems. In the classification formulation of DOA, the output space is discretized into several intervals, each of which is treated as a class. These classes exhibit strong inter-class correlation, with their mutual-similarity increasing when they approach each other and being ordered. However, this property is not sufficiently explored. To exploit these property, we propose a soft label distribution, named Unbiased Label Distribution (ULD), for eliminating the quantization error of the training target and further taking the inter-class similarity into strong consideration. We further introduce two loss functions, named the Negative Log Absolute Error (NLAE) loss function and {Mean Squared Error loss function without activation (MSE(wo))}, for the soft label family. Finally, we design a new decoding method to map the predicted distribution to sound source locations, called Weighted Adjacent Decoding (WAD). It uses the weighted sum of the probabilities of the peak classes and their adjacent classes in the predicted distribution for decoding. Experimental results show that the proposed method achieves the state-of-the-art performance, and the WAD decoding method is able to even breakthrough the quantization error limits of existing decoding methods.

摘要
声源Localization（SSL）涉及到计算声信号的方向来源（DOA）的估计。由于DOA估计的输出空间是连续的，因此可以使用回归来模型声源。然而，在实践中，将DOA估计转换成分类问题经常会得到更好的性能，因为分类问题通常更容易模型，并且对噪声和不确定性更加稳定。在分类形式下，DOA的输出空间被细分成多个间隔，每个间隔被视为一个类。这些类之间存在强相关性，其相互相似性随着他们的距离增大而增加。然而，这个特性未得到充分利用。为了利用这个特性，我们提出了一种软标签分布（Unbiased Label Distribution，ULD），用于消除训练目标的量化误差，并且更加重视类之间的相似性。我们还引入了两种损失函数：卷积损失函数（NLAE）和无激活的平方误差损失函数（MSE(wo)）。最后，我们设计了一种新的解码方法，名为权重相邻解码（Weighted Adjacent Decoding，WAD），用于将预测分布映射到声源位置。WAD解码方法使用预测分布中峰值类和其相邻类的权重加权和。实验结果表明，我们提出的方法达到了当前最佳性能，WAD解码方法甚至可以超过现有解码方法的量化误差限制。