cs.SD - 2023-07-03

musif: a Python package for symbolic music feature extraction

paper_url: http://arxiv.org/abs/2307.01120
repo_url: https://github.com/didoneproject/musif
paper_authors: Ana Llorens, Federico Simonetta, Martín Serrano, Álvaro Torrente
for: 本研究团队开发了一个名为musif的Python包，用于自动提取Symbolic Music Score中的特征。
methods: musif包包含了一大量的特征，这些特征由音乐学家、音乐理论家、统计学家和计算机科学家团队共同开发。此外，包还允许用户轻松创建自定义特征使用常用的Python库。
results: musif包支持处理高质量的MusicXML格式音乐学数据，同时也支持其他常用的音乐信息检索任务格式，如MIDI、MEI、Kern等。作者提供了详细的文档和教程，以帮助扩展框架并帮助新手了解其使用。

Abstract
In this work, we introduce musif, a Python package that facilitates the automatic extraction of features from symbolic music scores. The package includes the implementation of a large number of features, which have been developed by a team of experts in musicology, music theory, statistics, and computer science. Additionally, the package allows for the easy creation of custom features using commonly available Python libraries. musif is primarily geared towards processing high-quality musicological data encoded in MusicXML format, but also supports other formats commonly used in music information retrieval tasks, including MIDI, MEI, Kern, and others. We provide comprehensive documentation and tutorials to aid in the extension of the framework and to facilitate the introduction of new and inexperienced users to its usage.

摘要
在这项工作中，我们介绍了musif，一个Python包，用于自动提取符号音乐谱的特征。该包包括一大量的特征，由音乐学、音乐理论、统计和计算机科学领域的专家们开发。此外，包还允许用户轻松创建自定义特征使用常用的Python库。musif主要针对高质量的音乐学数据编码为MusicXML格式进行处理，也支持其他常用于音乐信息检索任务的格式，包括MIDI、MEI、Kern等。我们提供了完善的文档和教程，以帮助扩展该框架并帮助新用户入门使用。

Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages

paper_url: http://arxiv.org/abs/2307.00759
repo_url: None
paper_authors: Devang Kulshreshtha, Saket Dingliwal, Brady Houston, Sravan Bodapati
for: 提高自然语言处理（NLP）中自定义单词的识别率
methods: 使用 Contextual Adapters 进行注意力基于偏移的偏移模型，并在训练过程中使用超vision损失来缓和训练
results: 在低资源语言中提高了自定义单词的检索精度，实现了48% F1提升，同时也导致了基础 CTCL 模型的5-11% 词错率下降

Abstract
Connectionist Temporal Classification (CTC) models are popular for their balance between speed and performance for Automatic Speech Recognition (ASR). However, these CTC models still struggle in other areas, such as personalization towards custom words. A recent approach explores Contextual Adapters, wherein an attention-based biasing model for CTC is used to improve the recognition of custom entities. While this approach works well with enough data, we showcase that it isn't an effective strategy for low-resource languages. In this work, we propose a supervision loss for smoother training of the Contextual Adapters. Further, we explore a multilingual strategy to improve performance with limited training data. Our method achieves 48% F1 improvement in retrieving unseen custom entities for a low-resource language. Interestingly, as a by-product of training the Contextual Adapters, we see a 5-11% Word Error Rate (WER) reduction in the performance of the base CTC model as well.

摘要
卷积时序分类（CTC）模型在自动语音识别（ASR）中具有平衡速度和性能的优点，但这些模型仍然在其他领域面临挑战，例如个性化向custom字进行个性化。一种最近的方法是使用上下文适应器来改善CTC模型中的认知 CustomEntities的识别。虽然这种方法在充足的数据量下工作良好，但我们发现在低资源语言上这种策略并不是有效的。在这种情况下，我们提出了一种超vision损失来帮助Contextual Adapters更平滑地训练。此外，我们探索了一种多语言策略以提高具有有限训练数据的性能。我们的方法实现了一个48%的F1提升在检索未看过的个性化字符串中，并且 Interestingly, 在训练Contextual Adapters的过程中，我们发现了5-11%的单词错误率（WER）下降在基本CTC模型的性能中。

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

paper_url: http://arxiv.org/abs/2307.00729
repo_url: None
paper_authors: Sheng Zhao, Qilong Yuan, Yibo Duan, Zhuoyue Chen
for: 本研究主要目标是开发一种可以生成语音内容的语音生成模型，以便模拟人工声音。
methods: 该模型采用了端到端多模块结构，包括说话者编码器、基于Tacotron2的合成器和基于WaveRNN的 vocoder。
results: 经过多种比较实验和模型结构的研究，该模型最终在ADD 2023挑战赛Track 1.1中获得了44.97%的Weighted Deception Success Rate（WDSR）。

Abstract
The task of synthetic speech generation is to generate language content from a given text, then simulating fake human voice.The key factors that determine the effect of synthetic speech generation mainly include speed of generation, accuracy of word segmentation, naturalness of synthesized speech, etc. This paper builds an end-to-end multi-module synthetic speech generation model, including speaker encoder, synthesizer based on Tacotron2, and vocoder based on WaveRNN. In addition, we perform a lot of comparative experiments on different datasets and various model structures. Finally, we won the first place in the ADD 2023 challenge Track 1.1 with the weighted deception success rate (WDSR) of 44.97%.

摘要
文本生成任务的目标是将文本转化为语言内容，然后模拟人工嗓音。主要影响生成效果的因素包括生成速度、单词分 segmentation 精度、生成的嗓音自然程度等。这篇文章建立了端到端多模块合成嗓音模型，包括说话者编码器、基于 Tacotron2 的生成器和基于 WaveRNN 的 vocoder。此外，我们进行了多种比较 эксперименты，包括不同的数据集和模型结构。最后，我们在 ADD 2023 挑战赛 Track 1.1 中获得了44.97%的Weighted Deception Success Rate（WDSR）。