cs.SD - 2023-08-01

Choir Transformer: Generating Polyphonic Music with Relative Attention on Transformer

  • paper_url: http://arxiv.org/abs/2308.02531
  • repo_url: None
  • paper_authors: Jiuyang Zhou, Hong Zhu, Xingping Wang
  • for: 本研究旨在提出一种用于多VOICE音乐生成的神经网络模型,以便更好地模型音乐结构。
  • methods: 我们提出了一种名为Choir Transformer的多VOICE音乐生成神经网络模型,使用相对位置注意来更好地建立音乐中距离较长的音符之间的关系。我们还提出了适合多VOICE音乐生成的音乐表示方式。
  • results: 实验结果表明,Choir Transformer的性能超过了之前最佳性能4.06%。我们还测试了多VOICE音乐中的和声指标,实验结果与巴赫的音乐几乎相同。在实际应用中,生成的旋律和节奏可以根据输入的指定进行调整,并且可以根据不同的音乐风格进行不同的推敲。
    Abstract Polyphonic music generation is still a challenge direction due to its correct between generating melody and harmony. Most of the previous studies used RNN-based models. However, the RNN-based models are hard to establish the relationship between long-distance notes. In this paper, we propose a polyphonic music generation neural network named Choir Transformer[ https://github.com/Zjy0401/choir-transformer], with relative positional attention to better model the structure of music. We also proposed a music representation suitable for polyphonic music generation. The performance of Choir Transformer surpasses the previous state-of-the-art accuracy of 4.06%. We also measures the harmony metrics of polyphonic music. Experiments show that the harmony metrics are close to the music of Bach. In practical application, the generated melody and rhythm can be adjusted according to the specified input, with different styles of music like folk music or pop music and so on.
    摘要 <>translate_language Simplified Chinese<>polyphonic music generation still challenging direction, due to correct between generating melody and harmony. previous studies mostly use RNN-based models, but hard to establish long-distance notes relationship. in this paper, we propose polyphonic music generation neural network named Choir Transformer[https://github.com/Zjy0401/choir-transformer], with relative positional attention better model music structure. we also propose suitable music representation for polyphonic music generation. Choir Transformer performance surpasses previous state-of-the-art accuracy 4.06%. we also measure polyphonic music harmony metrics, close to Bach's music. in practical application, generated melody and rhythm can be adjusted according to specified input, with different styles of music like folk music or pop music and so on.

Multi-goal Audio-visual Navigation using Sound Direction Map

  • paper_url: http://arxiv.org/abs/2308.00219
  • repo_url: None
  • paper_authors: Haru Kondoh, Asako Kanezaki
  • for: 这 paper 旨在提出一种新的多目标音频视觉导航任务框架,并 investigate 这任务 在不同情况下的难度。
  • methods: 该 paper 使用了一种名为“sound direction map” (SDM) 方法,可以在学习基础上动态地Localize 多个声音源,并使用过去记忆来减少难度。
  • results: 实验结果表明,SDM 方法可以帮助多种基eline 方法 obtina 更高的性能,无论目标数量如何。
    Abstract Over the past few years, there has been a great deal of research on navigation tasks in indoor environments using deep reinforcement learning agents. Most of these tasks use only visual information in the form of first-person images to navigate to a single goal. More recently, tasks that simultaneously use visual and auditory information to navigate to the sound source and even navigation tasks with multiple goals instead of one have been proposed. However, there has been no proposal for a generalized navigation task combining these two types of tasks and using both visual and auditory information in a situation where multiple sound sources are goals. In this paper, we propose a new framework for this generalized task: multi-goal audio-visual navigation. We first define the task in detail, and then we investigate the difficulty of the multi-goal audio-visual navigation task relative to the current navigation tasks by conducting experiments in various situations. The research shows that multi-goal audio-visual navigation has the difficulty of the implicit need to separate the sources of sound. Next, to mitigate the difficulties in this new task, we propose a method named sound direction map (SDM), which dynamically localizes multiple sound sources in a learning-based manner while making use of past memories. Experimental results show that the use of SDM significantly improves the performance of multiple baseline methods, regardless of the number of goals.
    摘要 在过去几年,deep reinforcement learning代理人在室内环境中完成导航任务得到了大量研究。大多数这些任务只使用视觉信息,即首人图像,导航到单个目标。然而,在最近,使用视觉和听觉信息导航到声源并导航多个目标的任务被提议。然而,没有任何提案将这两种任务总结并使用两种类型的信息在多个声源目标下进行导航。在这篇论文中,我们提出了一个新的框架:多目标听觉视觉导航。我们首先定义了这个任务,然后通过在不同情况下进行实验来评估多目标听觉视觉导航任务的难度。研究结果表明,多目标听觉视觉导航任务具有分离声音来源的隐式需求。接着,我们提出了一种名为声音方向地图(SDM)的方法,该方法在学习基础上动态地Localizes多个声源,同时利用过去的记忆。实验结果表明,使用SDM可以大幅提高多个基eline方法的性能,无论目标数量如何。

DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models

  • paper_url: http://arxiv.org/abs/2308.00122
  • repo_url: None
  • paper_authors: Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
  • for: 提出了一种基于扩散模型的音视频分离框架,用于解决音视频 зву频源分离问题。
  • methods: 使用生成模型和分离U-Net synergize创建一个新的分离环境,从普通的普通分布开始,通过conditioning both the audio mixture和视频特征来生成分离的音频。
  • results: 与现有的状态对照方法相比,DAVIS在MUSIC dataset和AVE dataset上表现出色, separation质量更高,demonstrating the advantages of our framework for tackling the audio-visual source separation task。
    Abstract We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated magnitudes starting from Gaussian noises, conditioned on both the audio mixture and the visual footage. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.
    摘要 我们提出了DAVIS,一种基于扩散模型的音视频分离框架,用于通过生成方式解决音视频声音分离问题。而现有的探测方法在这个领域取得了很大的进步,但它们在处理多种类型声音分离时存在限制,不能够完美地捕捉声音分离的复杂数据分布。相比之下,DAVIS利用了一个生成扩散模型和一个分离U-Net,将始终为 Gaussian 噪声 synthesize 分离的大小,conditioned 于音乐混合和视频采集。由于其生成目标,DAVIS更适合实现高质量的声音分离 across 多种类型。我们与现有的状态对比了DAVIS 与其他 Audio-Visual 分离方法,并在域特定的 MUSIC dataset 和开放的 AVE dataset 上进行了比较,结果显示DAVIS 在分离质量方面与其他方法进行了比较,demonstrating 了我们的框架在声音分离任务中的优势。