results: 研究人员通过实施该模型后,得到了胜过基准的中间结果,并提出了实践中遇到的挑战和未来研究的可能性。Abstract
Foley sound synthesis refers to the creation of authentic, diegetic sound effects for media, such as film or radio. In this study, we construct a neural Foley synthesizer capable of generating mono-audio clips across seven predefined categories. Our approach introduces multiple enhancements to existing models in the text-to-audio domain, with the goal of enriching the diversity and acoustic characteristics of the generated foleys. Notably, we utilize a pre-trained encoder that retains acoustical and musical attributes in intermediate embeddings, implement class-conditioning to enhance differentiability among foley classes in their intermediate representations, and devise an innovative transformer-based architecture for optimizing self-attention computations on very large inputs without compromising valuable information. Subsequent to implementation, we present intermediate outcomes that surpass the baseline, discuss practical challenges encountered in achieving optimal results, and outline potential pathways for further research.
摘要
FOLEY声音合成指的是为媒体(如电影或广播)创建真实、地点声音效果。在这个研究中,我们构建了一个基于神经网络的FOLEY声音合成器,能够生成多个频道单声道音频clip。我们的方法对现有文本到声音频域的模型进行了多种改进,以增强生成的FOLEY声音的多样性和听觉特征。特别是,我们使用预训练的编码器保留了听觉和音乐特征在中间 Representation中,实施了类conditioning来增强FOLEY类在中间表示中的分 differentiability,并设计了一种新的transformer-based架构来优化自注意计算在很大输入上 без comprising valuable information。在实施后,我们展示了胜过基准的中间结果,讨论了实际遇到的挑战和 achievement 的可能性,并 outline了进一步研究的可能性。
Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning
paper_authors: Saurabhchand Bhati, Jesús Villalba, Laureano Moro-Velazquez, Thomas Thebaud, Najim Dehak for: 这 paper 的目的是提高 speech-based visually grounded models 的性能,使其能够更好地利用 pretrained image 和 text 编码器。methods: 这 paper 使用了 hierarchical segmental speech 编码器,将 speech 转化为 sequence of word-like unit 表示,然后使用 pretrained CLIP text 编码器进行编码。它还 explore 了 mapping audio 到 CLIP 词嵌入空间 via regularization 和 quantization。results: experiments 表明,使用这种方法可以减少 Retrieval 性能下降,并且 audio-only 系统可以减少到 audio-visual 系统的性能差距。Abstract
Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP attempted to generate localized word-level information and utilize both the pretrained image and text encoders. Despite using both, they noticed a substantial drop in retrieval performance. We proposed Segmental SpeechCLIP which used a hierarchical segmental speech encoder to generate sequences of word-like units. We used the pretrained CLIP text encoder on top of these word-like unit representations and showed significant improvements over the cascaded variant of SpeechCLIP. Segmental SpeechCLIP directly learns the word embeddings as input to the CLIP text encoder bypassing the vocabulary embeddings. Here, we explore mapping audio to CLIP vocabulary embeddings via regularization and quantization. As our objective is to distill semantic information into the speech encoders, we explore the usage of large unimodal pretrained language models as the text encoders. Our method enables us to bridge image and text encoders e.g. DINO and RoBERTa trained with uni-modal data. Finally, we extend our framework in audio-only settings where only pairs of semantically related audio are available. Experiments show that audio-only systems perform close to the audio-visual system.
摘要
visually grounded speech系统学习从图像和其对应的语音标签 pairs。最近,有人尝试使用已经训练过图像和语音标签的视觉grounded模型,如CLIP,来提高语音基于图像的模型性能。然而,大多数模型只使用预训练的图像编码器。卷积SpeechCLIP尝试生成本地化的单词水平信息并使用图像和语音编码器。尽管使用了两者,但它们发现了重要的搜索性能下降。我们提出了层次分割的SpeechCLIP,使用层次分割的语音编码器来生成语音序列。我们使用预训练的CLIP文本编码器进行这些语音序列表示,并显示了显著改进于卷积SpeechCLIP的变体。Segmental SpeechCLIP直接学习 word embeddings 作为 CLIP文本编码器的输入,而不需要词表 embeddings。我们的目标是将语音编码器中的semantic信息储存下来,所以我们 explore使用大型单模型预训练语言模型作为文本编码器。我们的方法可以将图像和语音编码器相互连接,例如 DINO和RoBERTa 在单模型数据上进行训练。最后,我们扩展我们的框架到具有唯一相关的音频的设置, где只有semantically相关的音频对应。实验显示,具有唯一相关的音频系统可以与具有视频的系统相互竞争。
A Long-Tail Friendly Representation Framework for Artist and Music Similarity
for: 这 paper 的目的是提出一种适应长尾情况的 Long-Tail Friendly Representation Framework (LTFRF), 用于音乐检索和推荐。
methods: 该 paper 使用神经网络模型音乐、用户、元数据和关系数据, integrate 到一个统一的 métric learning 框架中,并使用多种关系为regular term来引入多元关系损失。
results: 对于 Similar Artist/Music Recommendation 任务,LTFRF 比基eline 高效, Hit Ratio@10 上升9.69%/19.42%,而在长尾情况下,LTFRF 与基eline 的差距为11.05%/14.14%。Abstract
The investigation of the similarity between artists and music is crucial in music retrieval and recommendation, and addressing the challenge of the long-tail phenomenon is increasingly important. This paper proposes a Long-Tail Friendly Representation Framework (LTFRF) that utilizes neural networks to model the similarity relationship. Our approach integrates music, user, metadata, and relationship data into a unified metric learning framework, and employs a meta-consistency relationship as a regular term to introduce the Multi-Relationship Loss. Compared to the Graph Neural Network (GNN), our proposed framework improves the representation performance in long-tail scenarios, which are characterized by sparse relationships between artists and music. We conduct experiments and analysis on the AllMusic dataset, and the results demonstrate that our framework provides a favorable generalization of artist and music representation. Specifically, on similar artist/music recommendation tasks, the LTFRF outperforms the baseline by 9.69%/19.42% in Hit Ratio@10, and in long-tail cases, the framework achieves 11.05%/14.14% higher than the baseline in Consistent@10.
摘要 translate("The investigation of the similarity between artists and music is crucial in music retrieval and recommendation, and addressing the challenge of the long-tail phenomenon is increasingly important.") investigate("调查") similarity("相似性") artists("艺术家") music("音乐") retrieval("检索") recommendation("推荐") long-tail("长尾") phenomenon("现象") crucial("关键") addressing("解决") challenge("挑战")Here's the translation of the rest of the text: translate("This paper proposes a Long-Tail Friendly Representation Framework (LTFRF) that utilizes neural networks to model the similarity relationship. Our approach integrates music, user, metadata, and relationship data into a unified metric learning framework, and employs a meta-consistency relationship as a regular term to introduce the Multi-Relationship Loss. Compared to the Graph Neural Network (GNN), our proposed framework improves the representation performance in long-tail scenarios, which are characterized by sparse relationships between artists and music.") propose("提出") framework("框架") Long-Tail Friendly Representation Framework ("LTFRF") neural networks ("神经网络") model("模型") similarity relationship ("相似性关系") integrate("集成") music ("音乐") user ("用户") metadata ("元数据") relationship data ("关系数据") unified metric learning framework ("统一度量学习框架") meta-consistency relationship ("meta共识关系") Multi-Relationship Loss ("多关系损失") Graph Neural Network ("GNN") representation performance ("表示性能") long-tail scenarios ("长尾场景") sparse relationships ("稀疏关系") between artists and music ("艺术家和音乐之间")
A Two-Stage Training Framework for Joint Speech Compression and Enhancement
results: 实验结果表明,使用该两阶段训练方法可以超越 SoundStream 和其他代表性的编码器,在对象和主观评价指标上均表现出色。Abstract
This paper considers the joint compression and enhancement problem for speech signal in the presence of noise. Recently, the SoundStream codec, which relies on end-to-end joint training of an encoder-decoder pair and a residual vector quantizer by a combination of adversarial and reconstruction losses,has shown very promising performance, especially in subjective perception quality. In this work, we provide a theoretical result to show that, to simultaneously achieve low distortion and high perception in the presence of noise, there exist an optimal two-stage optimization procedure for the joint compression and enhancement problem. This procedure firstly optimizes an encoder-decoder pair using only distortion loss and then fixes the encoder to optimize a perceptual decoder using perception loss. Based on this result, we construct a two-stage training framework for joint compression and enhancement of noisy speech signal. Unlike existing training methods which are heuristic, the proposed two-stage training method has a theoretical foundation. Finally, experimental results for various noise and bit-rate conditions are provided. The results demonstrate that a codec trained by the proposed framework can outperform SoundStream and other representative codecs in terms of both objective and subjective evaluation metrics. Code is available at \textit{https://github.com/jscscloris/SEStream}.
摘要