results: 研究表明,提出的方法可以准确地生成音频事件探测的序列。Abstract
Recently, the ability of language models (LMs) has attracted increasing attention in visual cross-modality. In this paper, we further explore the generation capacity of LMs for sound event detection (SED), beyond the visual domain. Specifically, we propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location. The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder that generates temporal and event sequences from the audio characteristic. Compared with conventional works that require complicated processing and barely utilize limited audio features, our model is more concise and comprehensive since language model directly leverage its semantic capabilities to generate the sequences. We investigate different decoupling modules to demonstrate the effectiveness for timestamps capture and event classification. Evaluation results show that the proposed method achieves accurate sequences of sound event detection.
摘要
Deep learning-based denoising streamed from mobile phones improves speech-in-noise understanding for hearing aid users
paper_authors: Peter Udo Diehl, Hannes Zilly, Felix Sattler, Yosef Singer, Kevin Kepp, Mark Berry, Henning Hasemann, Marlene Zippel, Müge Kaya, Paul Meyer-Rachner, Annett Pudszuhn, Veit M. Hofmann, Matthias Vormann, Elias Sprengel
For: The paper is written for individuals with hearing loss who use hearing aids, particularly those in noisy environments.* Methods: The paper presents a deep learning-based denoising system that runs in real-time on a mobile device (iPhone 7 or Samsung Galaxy S10) and is streamed directly to the hearing aid.* Results: The denoising system improves audio quality and speech intelligibility for hearing aid users in noisy environments, as measured by subjective ratings and objective speech intelligibility tests. Subjective ratings improve by more than 40%, and speech reception thresholds improve by 1.6 dB SRT.Abstract
The hearing loss of almost half a billion people is commonly treated with hearing aids. However, current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 75ms. In tests with hearing aid users having moderate to severe hearing loss, our denoising system improves audio across three tests: 1) listening for subjective audio ratings, 2) listening for objective speech intelligibility, and 3) live conversations in a noisy environment for subjective ratings. Subjective ratings increase by more than 40%, for both the listening test and the live conversation compared to a fitted hearing aid as a baseline. Speech reception thresholds, measuring speech understanding in noise, improve by 1.6 dB SRT. Ours is the first denoising system that is implemented on a mobile device, streamed directly to users' hearing aids using only a single channel as audio input while improving user satisfaction on all tested aspects, including speech intelligibility. This includes overall preference of the denoised and streamed signal over the hearing aid, thereby accepting the higher latency for the significant improvement in speech understanding.
摘要
current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 75ms. In tests with hearing aid users having moderate to severe hearing loss, our denoising system improves audio across three tests: 1) listening for subjective audio ratings, 2) listening for objective speech intelligibility, and 3) live conversations in a noisy environment for subjective ratings. Subjective ratings increase by more than 40%, for both the listening test and the live conversation compared to a fitted hearing aid as a baseline. Speech reception thresholds, measuring speech understanding in noise, improve by 1.6 dB SRT. Ours is the first denoising system that is implemented on a mobile device, streamed directly to users' hearing aids using only a single channel as audio input while improving user satisfaction on all tested aspects, including speech intelligibility. This includes overall preference of the denoised and streamed signal over the hearing aid, thereby accepting the higher latency for the significant improvement in speech understanding.Here is the translation in Traditional Chinese:现有的听力问题发生在实际世界中的噪音环境中,我们提出了一个基于深度学习的去噪系统,可以在iPhone 7和Samsung Galaxy S10上运行,具有25ms的算法延迟。去噪后的音频被传递到听力器中,总延迟约为75ms。在听力器用户中的中度至重度听力损伤者进行测试时,我们的去噪系统在三个测试中提高了音频质量:1)聆听Subjective audio ratings,2)聆听Objective speech intelligibility,和3)在噪音环境中进行live conversations的Subjective ratings。评价提高了超过40%,包括聆听测试和live conversations。Speech reception thresholds,用于量测噪音环境中的话语理解能力,提高了1.6 dB SRT。我们的去噪系统是第一个在 mobildevice上实现的,通过单通道音频输入直接传递到用户的听力器,同时改善了所有测试方面的用户满意度,包括话语理解能力。这包括听力器的总偏好和传递的信号,因此接受较高的延迟,以换取明显提高的话语理解能力。
Convoifilter: A case study of doing cocktail party speech recognition
results: 通过这种方法,可以将word error rate(WER)从80%降低到26.4%。通常,这两个组件在数据需求的变化下需要独立地调整。但是,speech减雷可能会导致ASR效率下降。通过实施联合精度调整策略,可以从26.4%降低到14.5%。Abstract
This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise, along with an ASR module. Through this approach, the model is able to decrease the word error rate (WER) of ASR from 80% to 26.4%. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning.
摘要
这篇论文介绍了一种终端到终端模型,用于提高特定说话人在吵闹环境中自动语音识别(ASR)的精度。该模型使用了一个单通道语音提升模块,用于 izolate 说话人的voice从背景噪声中分离,以及一个 ASR 模块。通过这种方法,模型可以将 ASR 的单词错误率(WER)从 80% 降低到 26.4%。通常,这两个组件在数据需求的变化情况下独立地调整。然而,语音提升可能会在 ASR 效率下生成异常情况。通过实施联合细调策略,模型可以将 WER 从 26.4% 降低到 14.5%。
Evaluation of the Speech Resynthesis Capabilities of the VoicePrivacy Challenge Baseline B1
results: 研究发现,VPC基线B1中的语音表示和 vocoder 都会导致语音具有不自然的感觉,并且存在许多处理遗留。一个 MUSHRA-like 听众测试中,18名听众也证实了这些结论,因此需要进一步的研究来分析和synthesize VPC基线B1中的语音组件。Abstract
Speaker anonymization systems continue to improve their ability to obfuscate the original speaker characteristics in a speech signal, but often create processing artifacts and unnatural sounding voices as a tradeoff. Many of those systems stem from the VoicePrivacy Challenge (VPC) Baseline B1, using a neural vocoder to synthesize speech from an F0, x-vectors and bottleneck features-based speech representation. Inspired by this, we investigate the reproduction capabilities of the aforementioned baseline, to assess how successful the shared methodology is in synthesizing human-like speech. We use four objective metrics to measure speech quality, waveform similarity, and F0 similarity. Our findings indicate that both the speech representation and the vocoder introduces artifacts, causing an unnatural perception. A MUSHRA-like listening test on 18 subjects corroborate our findings, motivating further research on the analysis and synthesis components of the VPC Baseline B1.
摘要
听音系统继续改进各种听音特征的隐蔽能力,但经常产生处理残留和不自然的声音作为交易。这些系统大多来自于听音秘密挑战(VPC)基线B1,使用神经 vocoder 将 F0、x-vector 和瓶颈特征转化为语音。我们通过研究上述基线的复制能力,评估这种方法是否能够生成人类语音。我们使用四个对象指标测量语音质量、波形相似性和 F0 相似性。我们的发现表明,听音表示法和 vocoder 都会产生残留和不自然的声音,导致不自然的感觉。一个 MUSHRA-like 听力测试中,18名参与者证实了我们的发现,促使我们进一步研究 VPC 基线 B1 的分析和 sintesis 组件。
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning
for: 提高 Text-to-music 生成(T2M-Gen)领域中的模型性能,Addressing the scarcity of large-scale publicly available music datasets with natural language captions.
methods: 提出 Music Understanding LLaMA(MU-LLaMA)模型, capable of answering music-related questions and generating captions for music files, Utilizing audio representations from a pretrained MERT model to extract music features.
results: 实验表明,提出的 MU-LLaMA 模型,通过在我们设计的 MusicQA 数据集上训练,在多种纪录体系下 achieved outstanding performance in both music question answering and music caption generation,Outperforming current state-of-the-art(SOTA)模型在这两个领域,提供了 T2M-Gen 研究领域的一个有前途的进步。Abstract
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.
摘要
文本到音乐生成(T2M-Gen)面临一个主要障碍,即公共可用的大规模音乐数据集中的自然语言描述缺乏。为解决这一问题,我们提议了音乐理解LLaMA(MU-LLaMA),能够回答音乐相关的问题并生成音乐文件的描述。我们的模型利用了预训练的MERT模型提供的音频表示来提取音乐特征。然而,为了训练MU-LLaMA模型,获得适当的数据集仍然是一项挑战,因为现有的公共可用的音频问答数据集缺乏必要的深度来支持开放式音乐问答。为了填补这一漏洞,我们提出了一种方法,可以将现有的音频描述数据集转化成问题答案对。此外,我们还提出了MusicQA数据集,用于回答开放式音乐相关的问题。实验结果显示,我们的提议的MU-LLaMA模型,通过我们设计的MusicQA数据集进行训练,在多种绩效指标上达到了极高的表现,超越了当前领先的SOTA模型在这两个领域,并提供了T2M-Gen研究领域的一个可能的进步。
results: 实验结果表明,使用决策树可以成功预测琴弯的发生,F1分数为0.71,false positive的预测相对较少,这显示出这种方法在将非琴乐曲转换为琴 Tablature 中的应用有潜力。Abstract
Tablature notation is widely used in popular music to transcribe and share guitar musical content. As a complement to standard score notation, tablatures transcribe performance gesture information including finger positions and a variety of guitar-specific playing techniques such as slides, hammer-on/pull-off or bends.This paper focuses on bends, which enable to progressively shift the pitch of a note, therefore circumventing physical limitations of the discrete fretted fingerboard. In this paper, we propose a set of 25 high-level features, computed for each note of the tablature, to study how bend occurrences can be predicted from their past and future short-term context. Experiments are performed on a corpus of 932 lead guitar tablatures of popular music and show that a decision tree successfully predicts bend occurrences with an F1 score of 0.71 anda limited amount of false positive predictions, demonstrating promising applications to assist the arrangement of non-guitar music into guitar tablatures.
摘要
Tablaturenotation是流行音乐中广泛使用的notation方式,用于记录和分享吉他乐器的音乐内容。作为标准notation的补充,tablaturenotation记录了吉他演奏技巧,包括手块位置和各种特殊的演奏技巧,如滑弹、弹弦和抽弹。本文主要关注在抽弹上,抽弹可以使乐音慢慢升高或降低,因此超越 físical limitations of the discrete fretted fingerboard。在本文中,我们提出了25个高级特征,用于研究如何预测抽弹的发生。经过在932首流行音乐的领导吉他 tablature上进行实验,我们发现使用决策树可以成功预测抽弹的发生,F1分数为0.71,只有有限的假阳性预测,这表明了这种方法在将非吉他音乐转换为吉他 tablature中的应用潜力。
PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion
results: 实验结果表明,PMVC模型可以提高语音转换的自然性和相似性,并且在AIShell-3 corpus上得到了良好的Result。Abstract
Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.
摘要
声音转换作为样式传递任务应用于语音,指的是将一个人的语音转换成另一个人的语音,以致其具有新的人的声音特征。到目前为止,有很多研究投入到了更好的声音转换任务的实现中。然而,一个好的声音转换模型不仅需要匹配目标说话人的声音信息,还需要表达信息 such as 语调、节奏、停顿等。在这种情况下,语调模型化是实现自然和有力的声音转换的关键。尽管语调模型化是重要的,但也是具有挑战性,尤其是无文本转录的情况下。在这篇论文中,我们首先提出了一种新的声音转换框架,名为'PMVC',可以有效地从语音中分离并模型内容、声音和语调信息。特别是,我们提出了一种新的语音增强算法,用于Robust prosody extraction。基于这个算法,我们还应用了mask和predict机制,以实现内容和语调信息的分离。实验结果表明,在AIShell-3 corpus上,我们的改进方法可以提高声音转换的自然性和相似性。