results: 我们通过结合计算音乐学和实践研究两种方法进行验证,并证明模型能够生成有效的乐曲。最后,我们使用这个模型创作了一首完整的Progressive Metal乐曲,由人类金属制作人混音和混音。Abstract
Recent work in the field of symbolic music generation has shown value in using a tokenization based on the GuitarPro format, a symbolic representation supporting guitar expressive attributes, as an input and output representation. We extend this work by fine-tuning a pre-trained Transformer model on ProgGP, a custom dataset of 173 progressive metal songs, for the purposes of creating compositions from that genre through a human-AI partnership. Our model is able to generate multiple guitar, bass guitar, drums, piano and orchestral parts. We examine the validity of the generated music using a mixed methods approach by combining quantitative analyses following a computational musicology paradigm and qualitative analyses following a practice-based research paradigm. Finally, we demonstrate the value of the model by using it as a tool to create a progressive metal song, fully produced and mixed by a human metal producer based on AI-generated music.
摘要
近期在 симвоlic music generation 领域的研究表明使用 GuitarPro 格式的 токен化作为输入和输出表示有价值。我们在这个基础上进一步扩展,通过对 ProgGP 数据集(包含 173 首进步金属歌曲)的先验学习,以创造该类型的作品。我们的模型可以生成多个电 guitar、 Bass guitar、鼓、钢琴和管弦部分。我们通过混合计算音乐学和实践研究两种方法进行验证,并证明模型的有效性。最后,我们利用该模型创造一首完整的进步金属歌曲,由人工制作和混音。
results: 研究发现,使用多种乐器资源的ShredGP模型和使用 solo guitar数据的ShredGP模型均能生成符合目标吉他手风格的Tablature notation,并且使用BERT模型对生成的例子进行分类,结果表明ShredGP模型能够生成与目标吉他手风格相符的内容。Abstract
GuitarPro format tablatures are a type of digital music notation that encapsulates information about guitar playing techniques and fingerings. We introduce ShredGP, a GuitarPro tablature generative Transformer-based model conditioned to imitate the style of four distinct iconic electric guitarists. In order to assess the idiosyncrasies of each guitar player, we adopt a computational musicology methodology by analysing features computed from the tokens yielded by the DadaGP encoding scheme. Statistical analyses of the features evidence significant differences between the four guitarists. We trained two variants of the ShredGP model, one using a multi-instrument corpus, the other using solo guitar data. We present a BERT-based model for guitar player classification and use it to evaluate the generated examples. Overall, results from the classifier show that ShredGP is able to generate content congruent with the style of the targeted guitar player. Finally, we reflect on prospective applications for ShredGP for human-AI music interaction.
摘要
《GuitarPro格式 tablature 是一种数字音乐notation的形式,它包含 гитар演奏技巧和指法信息。我们介绍 ShredGP,一种基于 Transformer 模型的 GuitarPro tablature生成器,conditioned 以模仿四位知名电 гитара演奏者的风格。为了评估每位 гитар演奏者的特点,我们采用了计算音乐学方法,分析从 DadaGP 编码方案中生成的 token 的特征。统计分析显示,这些特征存在显著的差异。我们训练了两种不同的 ShredGP 模型,一种使用多种乐器资料,另一种使用 solo гитара数据。我们使用 BERT 模型进行 гитар演奏者分类,并用它来评估生成的示例。总的来说,结果表明 ShredGP 能够生成与目标 гитар演奏者风格相符的内容。最后,我们讨论了 ShredGP 在人工智能音乐互动方面的前景。》Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China. The Traditional Chinese form of the translation is also available upon request.
Point to the Hidden: Exposing Speech Audio Splicing via Signal Pointer Nets
paper_authors: Denise Moussa, Germans Hirsch, Sebastian Wankerl, Christian Riess
for: 协助刑事调查中证明语音记录的完整性
methods: 使用分析和深度学习方法检测语音掉包操作
results: 在具有压缩和噪声的语音数据上实现6-10%的性能提升Abstract
Verifying the integrity of voice recording evidence for criminal investigations is an integral part of an audio forensic analyst's work. Here, one focus is on detecting deletion or insertion operations, so called audio splicing. While this is a rather easy approach to alter spoken statements, careful editing can yield quite convincing results. For difficult cases or big amounts of data, automated tools can support in detecting potential editing locations. To this end, several analytical and deep learning methods have been proposed by now. Still, few address unconstrained splicing scenarios as expected in practice. With SigPointer, we propose a pointer network framework for continuous input that uncovers splice locations naturally and more efficiently than existing works. Extensive experiments on forensically challenging data like strongly compressed and noisy signals quantify the benefit of the pointer mechanism with performance increases between about 6 to 10 percentage points.
摘要
确认语音录音证据的完整性是专业Audio forensic analyst的重要任务之一。在这里,一个重点是检测删除或插入操作,也就是语音拼接。这是轻松地修改说话的方法,但是精心编辑可以获得非常有条理的结果。对于困难的案例或大量数据,自动工具可以帮助检测可能的编辑位置。为了解决这个问题,一些分析和深度学习方法已经被提出。然而,大多数方法仍然无法处理无条件拼接情况,这是实际应用中的一个挑战。我们透过SigPointer提出了一个指标网络框架,可以自然地检测拼接位置,并且较 existing works 效率高。实际实验表明,在专业挑战性的数据上,SigPointer 能够提高性能约6到10 percentage points。
Aeroacoustic testing on a full aircraft model at high Reynolds numbers in the European Transonic Windtunnel
results: 论文提供了三维扩散结果,使用 CLEAN-SC 混合,选择了区域兴趣和相应的源谱。结果表明,封闭测试部件对扩散结果的影响较小,而 Reynolds 数对扩散结果有深入、非线性的影响,随着 Reynolds 数的增加,这种影响逐渐减弱。此外,源显示出 Mach 数对不同 Reynolds 数下的非线性关系,但是在相同的 Mach 数范围内是自similar的。这些结果表明,可以通过使用小规模全模型在实际 Reynolds 数下进行研究,以便在未来进行进一步的调查,如ource directivity的研究。Abstract
This paper presents an end-to-end approach for the assessment of pressurized and cryogenic wind tunnel measurements of an EMBRAER scaled full model close to real-world Reynolds numbers. The choice of microphones, measurement parameters, the design of the array, and the selection of flow parameters are discussed. Different wind tunnel conditions are proposed which allow separating the influence of the Reynolds number from the Mach number, as well as the influence of slotted and closed test sections. The paper provides three-dimensional beamforming results with CLEAN-SC deconvolution, the selection of regions of interest, and the corresponding source spectra. The results suggest that slotted test sections have little influence on the beamforming results compared to closed test sections and that the Reynolds number has a profound, non-linear impact on the aeroacoustic emission that lessens with increasing Reynolds number. Further, sources show a non-linear Mach number dependency at constant Reynolds number but are self-similar in the observed Mach number range. The findings suggest that it is possible to study real-world phenomena on small-scale full models at real-world Reynolds numbers, which enable further investigations in the future such as the directivity of sources.
摘要
results: 我们发现,使用不同的特征集可以提高分类精度,而且每个特征集需要不同的计算资源。我们发现,将各个工具中的最佳特征集结合使用,而不是单一工具的特征集,能够获得最佳的结果。为便于未来的音乐信息检索研究,我们发布了工具的源代码和benchmark。Abstract
This paper presents a comprehensive investigation of existing feature extraction tools for symbolic music and contrasts their performance to determine the set of features that best characterizes the musical style of a given music score. In this regard, we propose a novel feature extraction tool, named musif, and evaluate its efficacy on various repertoires and file formats, including MIDI, MusicXML, and **kern. Musif approximates existing tools such as jSymbolic and music21 in terms of computational efficiency while attempting to enhance the usability for custom feature development. The proposed tool also enhances classification accuracy when combined with other sets of features. We demonstrate the contribution of each set of features and the computational resources they require. Our findings indicate that the optimal tool for feature extraction is a combination of the best features from each tool rather than those of a single one. To facilitate future research in music information retrieval, we release the source code of the tool and benchmarks.
摘要
Here's the text in Simplified Chinese:这篇论文对现有的符号音乐特征提取工具进行了全面的比较,以确定符号音乐谱的音乐风格的最佳特征集。在这个意义上,我们提出了一种新的特征提取工具,名为musif,并评估了它在不同的乐谱和文件格式,如MIDI、MusicXML和**kern中的表现。musif与jsymbolic和music21等工具相比,在计算效率方面具有相似的效果,同时尝试提高自定义特征开发的可用性。我们的发现表明,使用多个工具的最佳特征集是更好的,而不是单一的工具。为将来的音乐信息检索研究提供便利,我们发布了工具的源代码和标准测试数据。
The smarty4covid dataset and knowledge base: a framework enabling interpretable analysis of audio signals
paper_authors: Konstantia Zarkogianni, Edmund Dervakos, George Filandrianos, Theofanis Ganitidis, Vasiliki Gkatzou, Aikaterini Sakagianni, Raghu Raghavendra, C. L. Max Nikias, Giorgos Stamou, Konstantina S. Nikita
for: This paper is written for developing and validating a framework for generating counterfactual explanations in opaque AI-based COVID-19 risk detection models using the smarty4covid dataset.
methods: The paper uses the smarty4covid dataset, which contains audio signals of cough, regular breathing, deep breathing, and voice, as well as other self-reported information, to develop and validate a framework for generating counterfactual explanations in opaque AI-based COVID-19 risk detection models.
results: The paper proposes a new framework for generating counterfactual explanations in opaque AI-based COVID-19 risk detection models using the smarty4covid dataset, and validates the effectiveness of the framework through experiments.Abstract
Harnessing the power of Artificial Intelligence (AI) and m-health towards detecting new bio-markers indicative of the onset and progress of respiratory abnormalities/conditions has greatly attracted the scientific and research interest especially during COVID-19 pandemic. The smarty4covid dataset contains audio signals of cough (4,676), regular breathing (4,665), deep breathing (4,695) and voice (4,291) as recorded by means of mobile devices following a crowd-sourcing approach. Other self reported information is also included (e.g. COVID-19 virus tests), thus providing a comprehensive dataset for the development of COVID-19 risk detection models. The smarty4covid dataset is released in the form of a web-ontology language (OWL) knowledge base enabling data consolidation from other relevant datasets, complex queries and reasoning. It has been utilized towards the development of models able to: (i) extract clinically informative respiratory indicators from regular breathing records, and (ii) identify cough, breath and voice segments in crowd-sourced audio recordings. A new framework utilizing the smarty4covid OWL knowledge base towards generating counterfactual explanations in opaque AI-based COVID-19 risk detection models is proposed and validated.
摘要
利用人工智能(AI)和移动医疗(m-health)探索新生物标志物(biomarker),检测呼吸畸形/疾病的开始和进程,在COVID-19大流行期间吸引了科学界和研究人员的广泛关注。smarty4covid数据集包含了4,676个呼吸音频记录(呼吸正常)、4,665个常规呼吸记录(常规呼吸)、4,695个深呼吸记录(深呼吸)和4,291个语音记录(语音),通过移动设备进行了人群投票方法收集。此外还包括其他自我报告信息(例如COVID-19病毒检测结果),因此提供了开发COVID-19风险检测模型的全面数据集。smarty4covid数据集以web-ontology语言(OWL)知识库的形式发布,允许数据集的整合、复杂查询和推理。它已经被用于开发能够:(i)从常规呼吸记录中提取临床有用的呼吸指标,和(ii)在人群投票 Audio记录中识别喊吸、呼吸和语音段落。一种基于smarty4covid OWL知识库的新框架,用于生成COVID-19风险检测模型的对比性解释,已经被提出并验证。
Musical Excellence of Mridangam: an introductory review
methods: 本论文使用了音乐分析的科学方法,从Dr. CV Raman的开创性研究开始,介绍了Musical Excellence of Mridangam中的基本科学概念,并对之前的科学研究进行了简要的讨论。
results: 本论文通过分析Musical Excellence of Mridangam中的各章节,揭示了Mridangam的独特音色性,包括其各种音高、音域、音色和演奏技巧等方面的特点。最后,本论文结合了这些科学研究结果,总结了Mridangam的音乐价值和科学意义。Abstract
This is an introductory review of Musical Excellence of Mridangam by Dr. Umayalpuram K Sivaraman, Dr. T Ramasami and Dr. Naresh, which is a scientific treatise exploring the unique tonal properties of the ancient Indian classical percussive instrument -- the Mridangam. This review aims to bridge the gap between the primary intended audience of Musical Excellence of Mridangam - listeners, artistes and makers -- and the scientific rigour with which the original treatise is written, by first introducing the concepts of musical analysis and then presenting and explaining the discoveries made within this context. The first three chapters of this review introduce the basic scientific concepts used in Musical Excellence of Mridangam and provides background to previous scientific research into this instrument, starting from the seminal work of Dr. CV Raman. This also includes brief discussions of the corresponding chapters in Musical Excellence of Mridangam. The next chapters all serve the purpose of explaining the main scientific results presented in Musical Excellence of Mridangam in each of the corresponding chapters in the treatise, and finally summarizing the relevance of the work.
摘要
这是一篇介绍《MRIDANGAM的音乐优良》by Dr. Umayalpuram K Sivaraman、Dr. T Ramasami和Dr. Naresh的科学评论,这是一部探讨古代印度古典打击乐器——MRIDANGAM的独特音征的科学著作。本评论的目的是将《MRIDANGAM的音乐优良》中的科学严谨性 bridged 到listeners、artistes和制造者的主要target audience,首先介绍音乐分析的概念,然后介绍和解释《MRIDANGAM的音乐优良》中的发现。本评论的前三章介绍了《MRIDANGAM的音乐优良》中使用的基本科学概念,并提供了 précédente scientific research 的背景,从CV Raman的著作开始。这些章节还包括《MRIDANGAM的音乐优良》中对应的章节的简要讨论。接下来的章节都是解释《MRIDANGAM的音乐优良》中每一章的主要科学成果,最后总结这项工作的 relevance。
Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos
results: 我们通过大量的实验表明,我们的特征可以超过多个 state-of-the-art 基elines在两个公共的 egocentric 视频数据集上,包括 EgoCom 和 EasyCom。Abstract
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. In particular, our method leverages a masked auto-encoding framework to synthesize masked binaural audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. We show through extensive experiments that our features are generic enough to improve over multiple state-of-the-art baselines on two public challenging egocentric video datasets, EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/projects/ego_av_corr.
摘要
我们提出了一种自助学习的方法,通过 egocentric 视频中的空间声音视觉对应关系学习表示。特别是,我们利用了一个掩码自动编码框架,通过声音和视觉之间的同步,Synthesize 掩码声音,从而学习了声音和视觉之间有用的空间关系。我们使用我们预训练的特征来解决两个需要社交场景中的空间理解的视频任务:活跃人员检测和空间声音降噪。我们通过广泛的实验表明,我们的特征可以超越多个州OF-the-art 基线在两个公共的 egocentric 视频数据集上,EgoCom 和 EasyCom。项目:http://vision.cs.utexas.edu/projects/ego_av_corr。
methods: 这个研究使用的方法是给出一个初始集合neurons,这些neurons的数量大约等于数据中典型结构的数量。例如,如果网络是用于语音 retrieve then neurons的数量必须等于语言中所使用的phonemes的数量。
results: 这个研究的结果表明,这种phoneme-retrieval技术可以很好地处理语音和图像等多种数据类型,但是它的性能受到学习样本的影响很大。例如,如果学习样本只包含一些特定的语音,那么网络就只能用于这些语音的recognition。Abstract
A phoneme-retrieval technique is proposed, which is due to the particular way of the construction of the network. An initial set of neurons is given. The number of these neurons is approximately equal to the number of typical structures of the data. For example if the network is built for voice retrieval then the number of neurons must be equal to the number of characteristic phonemes of the alphabet of the language spoken by the social group to which the particular person belongs. Usually this task is very complicated and the network can depend critically on the samples used for the learning. If the network is built for image retrieval then it works only if the data to be retrieved belong to a particular set of images. If the network is built for voice recognition it works only for some particular set of words. A typical example is the words used for the flight of airplanes. For example a command like the "airplane should make a turn of 120 degrees towards the east" can be easily recognized by the network if a suitable learning procedure is used.
摘要
提出了一种phoneme-retrieval技术,它归功于网络的特定构建方式。给定一个初始集合neurons。这些neurons的数量约等于数据的典型结构数量。例如,如果建立了语音 Retrieval 网络,那么neurons的数量必须等于语言中所用的特征 Phone 的数量。通常,这个任务非常复杂,网络的学习过程取决于采用的样本。如果建立了图像 Retrieval 网络,它只能Recognize特定集合的图像。如果建立了语音识别网络,它只能识别某些特定集合的单词。例如,飞机航行时使用的命令"飞机应该向东方偏转120度"可以轻松地被网络识别,如果采用合适的学习过程。
results: 本研究发现,使用白盒优化技术可以准确地重建控制函数,以便与给定的声音匹配。与遗传优化算法和基于音频预测的神经网络相比,本研究的方法显示出更高的Subjective评价。Abstract
Articulatory features can provide interpretable and flexible controls for the synthesis of human vocalizations by allowing the user to directly modify parameters like vocal strain or lip position. To make this manipulation through resynthesis possible, we need to estimate the features that result in a desired vocalization directly from audio recordings. In this work, we propose a white-box optimization technique for estimating glottal source parameters and vocal tract shapes from audio recordings of human vowels. The approach is based on inverse filtering and optimizing the frequency response of a wave\-guide model of the vocal tract with gradient descent, propagating error gradients through the mapping of articulatory features to the vocal tract area function. We apply this method to the task of matching the sound of the Pink Trombone, an interactive articulatory synthesizer, to a given vocalization. We find that our method accurately recovers control functions for audio generated by the Pink Trombone itself. We then compare our technique against evolutionary optimization algorithms and a neural network trained to predict control parameters from audio. A subjective evaluation finds that our approach outperforms these black-box optimization baselines on the task of reproducing human vocalizations.
摘要
<>使用语音记录来直接修改人类声音的特征,例如咽喉压力或舌头位置,可以提供可解释的和灵活的控制方法。为实现这种整合,我们需要从语音记录中直接估计人类声音的特征。在这种工作中,我们提出了一种白盒优化技术,用于估计咽喉源参数和声道形状从语音记录中。这种方法基于反推 filtering和使用波导模型的声道区域函数的梯度下降来估计人类声音的特征。我们将这种方法应用于匹配粉色号,一种交互式语音合成器,的声音。我们发现我们的方法可以准确地重建粉色号中生成的声音的控制函数。然后,我们将这种技术与进化优化算法和基于语音预测的神经网络进行比较。一个主观评估发现,我们的方法在重建人类声音任务中表现出了超过黑盒优化基准的性能。
VampNet: Music Generation via Masked Acoustic Token Modeling
methods: 这篇论文使用了变化的masking schedule durante training,以便在推断中应用不同的推示方法(称为“提示”)。这个非autoregressive的模型使用了双向transformer架构,对所有的Token进行同时 attend。只需要36个抽样pass,这个模型就可以生成具有高干准的乐音波形。
results: 通过不同的提示方法,VampNet可以应用于音乐压缩、缺失、外推、续写和循环等任务,同时维持音乐的风格、种类、乐器等高级特征。这种灵活的提示能力使得VampNet成为一种功能强大的音乐合作工具。Abstract
We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. We use a variable masking schedule during training which allows us to sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that attends to all tokens in a forward pass. With just 36 sampling passes, VampNet can generate coherent high-fidelity musical waveforms. We show that by prompting VampNet in various ways, we can apply it to tasks like music compression, inpainting, outpainting, continuation, and looping with variation (vamping). Appropriately prompted, VampNet is capable of maintaining style, genre, instrumentation, and other high-level aspects of the music. This flexible prompting capability makes VampNet a powerful music co-creation tool. Code and audio samples are available online.
摘要
我们介绍VampNet,一种带有遮盾的语音代码模型,用于音乐生成、压缩、缺失填充、变化等任务。在训练中,我们使用可变的遮盾计划,以便在推理时通过不同的遮盾方法(称为“提示”)来采样具有 coherence 的音乐。VampNet 非自然语言生成,利用了双向转换器架构,在前进 passes 中对所有各个标签进行注意力。只需要 36 次采样 passes,VampNet 就可以生成具有高比特率的完整音乐波形。我们表明,通过不同的提示方法,我们可以使 VampNet 应用于音乐压缩、缺失填充、外填充、续写和循环等任务,同时保持音乐的风格、种类、乐器等高级特征。这种灵活的提示能力使得 VampNet 成为一种强大的音乐合作工具。代码和音频示例可以在线找到。