eess.AS - 2023-11-30

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

paper_url: http://arxiv.org/abs/2312.00249
repo_url: https://github.com/jinhualiang/apt
paper_authors: Jinhua Liang, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos
for: 本研究旨在扩展大语言模型（LLM）和视觉语言模型（VLM）到音频频谱领域，以提高audio理解和语言理解能力。
methods: 本研究提出了一种新的扩展器，称为Acoustic Prompt Turning（APT），它可以让LLM和VLM在音频频谱领域下表现出比较好的能力。APT使用了一种听说指令生成器，生成了软指令，并将其作为语言模型的输入。此外，本研究还提出了一种多任务学习策略，以解决音频频谱数据的缺乏问题。
results: 实验表明，APT扩展后的LLM（即APT-LLM）在多种任务上达到了相对较好的结果，与专家模型（即在target datasets上训练的网络）的结果相比。此外，APT还能够扩展冻结的VLM到音频频谱领域，并在音视频问答任务上达到了可以接受的结果。

Abstract
The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as language model inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio language model by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at https://github.com/JinhuaLiang/APT.

摘要
Audio系统在整个人类感知体验中扮演着重要的角色。虽然现有的大型语言模型（LLM）和视觉语言模型（VLM）在解决许多视觉和语言理解任务上表现出色，但只有一些它们可以无缝扩展到音频领域。在这种情况下，我们引入了听音提示转换（APT），一种新的适配器，可以通过软提示来扩展LLM和VLM到音频领域。具体来说，APT使用语音和文本输入的指令意识 audio aligner 生成软提示，并在语言模型输入中使用这些软提示进行条件学习。为了缓解音频领域的数据稀缺，我们提出了多任务学习策略，将多种音频任务拼接成一个序列到序列的形式。此外，我们改进了音频语言模型的框架，通过使用交叠的音频-文本嵌入来作为输入序列。这种改进的框架没有输入格式的约束，因此可以解决更多的理解任务，如几shot音频分类和音频理解。为了进一步评估音频网络的理解能力，我们提出了自然语言音频理解（NLAR）任务，该任务通过比较和概括两个音频clip来分析它们之间的关系。实验结果表明，APT-LLMs 在多种任务上达到了相对较高的竞争力，与专家模型（即在目标数据集上训练的网络）的性能相匹配。最后，我们证明 APT 可以在音频领域扩展冻结 VLMs 而不需要微调，在音频视频问答任务上达到了可观的结果。我们在 GitHub 上发布了代码和模型参数，详细的实验结果和分析可以在 https://github.com/JinhuaLiang/APT 中找到。

Learning domain-invariant classifiers for infant cry sounds

paper_url: http://arxiv.org/abs/2312.00231
repo_url: None
paper_authors: Charles C. Onu, Hemanth K. Sheetha, Arsenii Gorin, Doina Precup
for: 这项研究旨在解决实际数据中的领域分布问题，特别是在婴儿哭声数据库中。
methods: 该研究使用了无监督领域适应方法，包括从计算机视觉领域借鉴的方法，以学习免受领域分布的影响，并且提出了一种新的方法——目标噪声注入（TNI），无需 labels 或目标领域的训练数据。
results: 研究发现，使用了这些方法后，模型可以提高目标准确率 by 7.2%，而不会对源领域产生负面影响。

Abstract
The issue of domain shift remains a problematic phenomenon in most real-world datasets and clinical audio is no exception. In this work, we study the nature of domain shift in a clinical database of infant cry sounds acquired across different geographies. We find that though the pitches of infant cries are similarly distributed regardless of the place of birth, other characteristics introduce peculiar biases into the data. We explore methodologies for mitigating the impact of domain shift in a model for identifying neurological injury from cry sounds. We adapt unsupervised domain adaptation methods from computer vision which learn an audio representation that is domain-invariant to hospitals and is task discriminative. We also propose a new approach, target noise injection (TNI), for unsupervised domain adaptation which requires neither labels nor training data from the target domain. Our best-performing model significantly improves target accuracy by 7.2%, without negatively affecting the source domain.

摘要
“域别迁移”问题是现实世界数据集中的一个常见现象，并且诊疗音频亦不例外。在这个工作中，我们研究了一个来自不同地理位置的诊疗数据库中婴儿哭声的域别迁移性。我们发现，不管婴儿的出生地，哭声的数据都具有相似的分布。然而，其他特征引入了某些偏见到数据中。我们探索了一些适用于 mitigating 域别迁移影响的方法，包括将 computer vision 中的无监督领域适应方法应用到诊疗音频中，以及一个新的方法，即静音注入（TNI），这需要无需目标领域的标签或训练数据。我们的最佳模型可以大幅提高目标准确度，并且不会负面影响源领域。

An Aliasing-Free Hybrid Digital-Analog Polyphonic Synthesizer

paper_url: http://arxiv.org/abs/2311.18774
repo_url: None
paper_authors: Jonas Roth, Domenic Keller, Oscar Castañeda, Christoph Studer
for: This paper presents a hybrid digital-analog eight-voice polyphonic synthesizer prototype called the +-synth, which combines the best of both worlds to provide superior sound quality and mitigate the drawbacks of analog circuitry.
methods: The +-synth uses a novel digital very-large scale integration (VLSI) design called the big Fourier oscillator (BFO), which utilizes additive synthesis to generate a wide variety of aliasing-free waveforms. Each BFO produces two voices, using four oscillators per voice, and each oscillator can generate up to 1024 freely configurable partials.
results: Measurement results of the +-synth prototype demonstrate high fidelity and low latency, indicating that the hybrid digital-analog design achieves the desired goals of combining the best of both worlds.Here is the information in Simplified Chinese text:
for: 这篇论文描述了一种hybrid数字-分析 eightvoice полифоничеsynthesizer原型，称为+-synth，它将数字和分析电路的优点结合起来，以提供高质量的音频。
methods: +-synth使用一种新型的数字very-large scale integration (VLSI)设计，称为大福洛 oscillator (BFO)，它利用添加 synthesis来生成各种各样的干扰free waveforms。每个BFO生成两个声道，每个声道使用四个振荡器。
results: 对+-synth原型的测量结果表明，它具有高准确率和低延迟，这表明hybrid数字-分析设计已经实现了将数字和分析电路的优点结合起来的目标。

Abstract
Analog subtractive synthesizers are generally considered to provide superior sound quality compared to digital emulations. However, analog circuitry requires calibration and suffers from aging, temperature instability, and limited flexibility in generating a wide variety of waveforms. Digital synthesis can mitigate many of these drawbacks, but generating arbitrary aliasing-free waveforms remains challenging. In this paper, we present the +-synth, a hybrid digital-analog eight-voice polyphonic synthesizer prototype that combines the best of both worlds. At the heart of the synthesizer is the big Fourier oscillator (BFO), a novel digital very-large scale integration (VLSI) design that utilizes additive synthesis to generate a wide variety of aliasing-free waveforms. Each BFO produces two voices, using four oscillators per voice. A single oscillator can generate up to 1024 freely configurable partials (harmonic or inharmonic), which are calculated using coordinate rotation digital computers (CORDICs). The BFOs were fabricated as 65nm CMOS custom application-specific integrated circuits (ASICs), which are integrated in the +-synth to simultaneously generate up to 32768 partials. Four 24-bit 96kHz stereo DACs then convert the eight voices into the analog domain, followed by digitally controlled analog low-pass filtering and amplification. Measurement results of the +-synth prototype demonstrate high fidelity and low latency.

摘要
аналоговыми вычитающими синтезаторами обычно считается, что они предоставляют более высокое качество звука по сравнению с цифровыми эмуляциями. Однако, аналоговые цепи требуют калибровки и подвержены температурной нестабильности, а также ограничены в возможностях генерации широкого спектра волнформ. Цифровая синтеза может уменьшить эти недостатки, но генерация произвольных волнформ без алиасинга остается вызова. В этой статье мы представляем +-синт, гибридный цифро-аналоговый восьмиголосный полифонический синтезизатор прототип, который сочетает лучшие качества двух мира. В сердце синтезатора находится большой fourier-оscillator (BFO), новый цифровой Very-Large-Scale Integration (VLSI) дизайн, который использует аддитивную синтезу для генерации широкого спектра волнформ без алиасинга. Каждый BFO производит два голоса, используя четыре осциллятора на голос. Каждый осциллятор может генерировать до 1024 свободно настроенных partials (гармонических или дисгармонических), которые рассчитываются с помощью координатной ротации цифровых компьютеров (CORDICs). BFOs были изготовлены как 65 нм CMOS custom application-specific integrated circuits (ASICs), которые интегрированы в +-синт для генерации до 32768 partials. Затем четыре 24-битных 96 kHz стерео DACs преобразуют восемь голосов в аналоговый домен, после чего происходит дигитальное контролируемое аналоговое фильтражение и усиление. Результаты измерений прототипа +-синт демонстрируют высокую точность и низкую задержку.