results: 该系统比一种现有的方法(SOTA)更有控制性和可读性,可以覆盖各种基频(F0)、能量和速度的变化,同时保持转换后的语音质量。Abstract
We propose a highly controllable voice manipulation system that can perform any-to-any voice conversion (VC) and prosody modulation simultaneously. State-of-the-art VC systems can transfer sentence-level characteristics such as speaker, emotion, and speaking style. However, manipulating the frame-level prosody, such as pitch, energy and speaking rate, still remains challenging. Our proposed model utilizes a frame-level prosody feature to effectively transfer such properties. Specifically, pitch and energy trajectories are integrated in a prosody conditioning module and then fed alongside speaker and contents embeddings to a diffusion-based decoder generating a converted speech mel-spectrogram. To adjust the speaking rate, our system includes a self-supervised model based post-processing step which allows improved controllability. The proposed model showed comparable speech quality and improved intelligibility compared to a SOTA approach. It can cover a varying range of fundamental frequency (F0), energy and speed modulation while maintaining converted speech quality.
摘要
我们提出了一种高度可控的语音修饰系统,可同时实现任意语音转换(VC)和语速修饰。现状的VC系统可以传递句子水平特征,如发音人、情感和说话风格。然而,修饰帧级别的语音特征,如抽象、能量和说话速度,仍然具有挑战性。我们的提议的模型利用帧级别的语音特征来有效地传递这些特性。具体来说,我们在语音修饰模块中将抽象和能量轨迹结合在一起,然后与发音人和内容嵌入一起传递给一个基于扩散的解码器生成转换后的语音mel-spectrogram。为了调整说话速度,我们的系统包括一个基于自我超vision的后处理步骤,以提高可控性。我们的模型与State-of-the-art方法相比,能够覆盖不同的基本频率(F0)、能量和速度修饰范围,而且保持转换后语音质量。
Leveraging Geometrical Acoustic Simulations of Spatial Room Impulse Responses for Improved Sound Event Detection and Localization
results: 透过实验显示,使用几何学式的单簇声学模拟可以提供相似的性能,并且可以将现有的数据集进行增强。Abstract
As deeper and more complex models are developed for the task of sound event localization and detection (SELD), the demand for annotated spatial audio data continues to increase. Annotating field recordings with 360$^{\circ}$ video takes many hours from trained annotators, while recording events within motion-tracked laboratories are bounded by cost and expertise. Because of this, localization models rely on a relatively limited amount of spatial audio data in the form of spatial room impulse response (SRIR) datasets, which limits the progress of increasingly deep neural network based approaches. In this work, we demonstrate that simulated geometrical acoustics can provide an appealing solution to this problem. We use simulated geometrical acoustics to generate a novel SRIR dataset that can train a SELD model to provide similar performance to that of a real SRIR dataset. Furthermore, we demonstrate using simulated data to augment existing datasets, improving on benchmarks set by state of the art SELD models. We explore the potential and limitations of geometric acoustic simulation for localization and event detection. We also propose further studies to verify the limitations of this method, as well as further methods to generate synthetic data for SELD tasks without the need to record more data.
摘要
Presenting the SWTC: A Symbolic Corpus of Themes from John Williams’ Star Wars Episodes I-IX
paper_authors: Claire Arthur, Frank Lehman, John McNamara
for: This paper presents a new symbolic corpus of musical themes from the complete Star Wars trilogies (Episodes I-IX) by John Williams.
methods: The corpus files are made available in multiple formats (.krn, .sib, and .musicxml) and include melodic, harmonic, and formal information. The authors also introduce a new humdrum standard for non-functional harmony encodings, **harte, based on Harte (2005, 2010).
results: The Star Wars Thematic Corpus (SWTC) contains a total of 64 distinctive, recurring, and symbolically meaningful themes and motifs, commonly referred to as leitmotifs. The authors provide some brief summary statistics and hope that the SWTC will provide insights into John Williams’ compositional style and be useful in comparisons against other thematic corpora from film and beyond.Abstract
This paper presents a new symbolic corpus of musical themes from the complete Star Wars trilogies (Episodes I-IX) by John Williams. The corpus files are made available in multiple formats (.krn, .sib, and .musicxml) and include melodic, harmonic, and formal information. The Star Wars Thematic Corpus (SWTC) contains a total of 64 distinctive, recurring, and symbolically meaningful themes and motifs, commonly referred to as leitmotifs. Through this corpus we also introduce a new humdrum standard for non-functional harmony encodings, **harte, based on Harte (2005, 2010). This report details the motivation, describes the transcription and encoding processes, and provides some brief summary statistics. While relatively small in scale, the SWTC represents a unified collection from one of the most prolific and influential composers of the 20th century, and the under-studied subset of film and multimedia musical material in general. We hope the SWTC will provide insights into John Williams' compositional style, as well as prove useful in comparisons against other thematic corpora from film and beyond.
摘要
Real-time auralization for performers on virtual stages
results: 这篇论文提出了一种准确的听到自己和他人乐器的实际化系统,并且通过了对比实验和主观测试,证明了这种系统的可行性。Abstract
This article presents an interactive system for stage acoustics experimentation including considerations for hearing one's own and others' instruments. The quality of real-time auralization systems for psychophysical experiments on music performance depends on the system's calibration and latency, among other factors (e.g. visuals, simulation methods, haptics, etc). The presented system focuses on the acoustic considerations for laboratory implementations. The calibration is implemented as a set of filters accounting for the microphone-instrument distances and the directivity factors, as well as the transducers' frequency responses. Moreover, sources of errors are characterized using both state-of-the-art information and derivations from the mathematical definition of the calibration filter. In order to compensate for hardware latency without cropping parts of the simulated impulse responses, the virtual direct sound of musicians hearing themselves is skipped from the simulation and addressed by letting the actual direct sound reach the listener through open headphones. The required latency compensation of the interactive part (i.e. hearing others) meets the minimum distance requirement between musicians, which is 2 m for the implemented system. Finally, a proof of concept is provided that includes objective and subjective experiments, which give support to the feasibility of the proposed setup.
摘要
Self-Supervised Disentanglement of Harmonic and Rhythmic Features in Music Audio Signals
results: 该方法被用predictor-based disentanglement metric来评估学习的结果,并应用于自动生成音乐remixes。Abstract
The aim of latent variable disentanglement is to infer the multiple informative latent representations that lie behind a data generation process and is a key factor in controllable data generation. In this paper, we propose a deep neural network-based self-supervised learning method to infer the disentangled rhythmic and harmonic representations behind music audio generation. We train a variational autoencoder that generates an audio mel-spectrogram from two latent features representing the rhythmic and harmonic content. In the training phase, the variational autoencoder is trained to reconstruct the input mel-spectrogram given its pitch-shifted version. At each forward computation in the training phase, a vector rotation operation is applied to one of the latent features, assuming that the dimensions of the feature vectors are related to pitch intervals. Therefore, in the trained variational autoencoder, the rotated latent feature represents the pitch-related information of the mel-spectrogram, and the unrotated latent feature represents the pitch-invariant information, i.e., the rhythmic content. The proposed method was evaluated using a predictor-based disentanglement metric on the learned features. Furthermore, we demonstrate its application to the automatic generation of music remixes.
摘要
“ latent variable disentanglement 的目标是推断数据生成过程中隐藏的多个有用特征表示,这是可控数据生成的关键因素。在这篇论文中,我们提议一种基于深度神经网络的自我超vised学习方法,用于推断音频数据生成过程中的分解特征。我们训练了一个变分自动编码器,该编码器从两个隐藏特征中生成了一个音频 mel-spectrogram,其中一个隐藏特征表示了 rhythmic 内容,另一个隐藏特征表示了 harmonic 内容。在训练阶段,变分自动编码器被训练来重建输入 mel-spectrogram,基于其滥 shift 版本。在每次前向计算中,我们对一个隐藏特征应用了一个向量旋转操作,假设这些特征维度与抑制间隔有关。因此,在训练后的变分自动编码器中,旋转隐藏特征表示音频中的抑制相关信息,而未旋转隐藏特征表示 rhythmic 内容。我们使用 predictor-based 分解度量评估学习的结果,并示出了它的应用于自动生成音乐重混。”
Simultaneous Measurement of Multiple Acoustic Attributes Using Structured Periodic Test Signals Including Music and Other Sound Materials
for: 这 paper 是用来测量音频特性的框架,包括常数时变 (LTI) 回响、信号依赖时变 (SDTI) 组成部分以及随机时变 (RTV) 部分。
methods: 这 paper 使用了结构化 periodic test signal 来测量音频特性,并且可以使用音乐作品和其他声音材料作为测试信号。
results: 这 paper 实现了一种可交互式、实时测量工具,并且开源了这些工具。此外, paper 还用这些工具对抽取器的性能进行了 объектив评估。Abstract
We introduce a general framework for measuring acoustic properties such as liner time-invariant (LTI) response, signal-dependent time-invariant (SDTI) component, and random and time-varying (RTV) component simultaneously using structured periodic test signals. The framework also enables music pieces and other sound materials as test signals by "safeguarding" them by adding slight deterministic "noise." Measurement using swept-sin, MLS (Maxim Length Sequence), and their variants are special cases of the proposed framework. We implemented interactive and real-time measuring tools based on this framework and made them open-source. Furthermore, we applied this framework to assess pitch extractors objectively.
摘要
我们提出了一个普遍适用的测量听音属性的框架,包括线性时不变(LTI)响应、固有时不变(SDTI)组件以及随机时变(RTV)组件,同时测量这些属性。这个框架还允许使用音乐作品和其他声音材料作为测试信号,通过添加一些稳定的随机噪声来"保护"它们。使用滚动窗口、MLS(最长长度序列)和其他变体的测量方法都是该框架的特殊情况。我们还实现了基于这个框架的交互式和实时测量工具,并将其开源。此外,我们使用这个框架对抽取器进行了 объекively 的评估。
MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023
results: 该系统在两个任务中获得了mean评分4.3和4.5,与自然语音 statistically comparable,同时保持了good similarity according to similarity assessment。这些结果表明该系统在两个任务中得到了优秀的效果。Abstract
In this paper, we present MuLanTTS, the Microsoft end-to-end neural text-to-speech (TTS) system designed for the Blizzard Challenge 2023. About 50 hours of audiobook corpus for French TTS as hub task and another 2 hours of speaker adaptation as spoke task are released to build synthesized voices for different test purposes including sentences, paragraphs, homographs, lists, etc. Building upon DelightfulTTS, we adopt contextual and emotion encoders to adapt the audiobook data to enrich beyond sentences for long-form prosody and dialogue expressiveness. Regarding the recording quality, we also apply denoise algorithms and long audio processing for both corpora. For the hub task, only the 50-hour single speaker data is used for building the TTS system, while for the spoke task, a multi-speaker source model is used for target speaker fine tuning. MuLanTTS achieves mean scores of quality assessment 4.3 and 4.5 in the respective tasks, statistically comparable with natural speech while keeping good similarity according to similarity assessment. The excellent and similarity in this year's new and dense statistical evaluation show the effectiveness of our proposed system in both tasks.
摘要
在这篇论文中,我们介绍了Microsoft的终端到终点神经语音识别系统MuLanTTS,用于2023年Blizzard挑战。我们发布了50小时的法语TTS Hub任务数据和2小时的说话人适应任务数据,用于建立不同测试目的的合成声音,包括句子、段落、同义词、列表等。基于DelightfulTTS,我们采用了上下文和情感编码器,以便对audiobook数据进行拓展,以增强长形层次和对话表达性。在录音质量方面,我们还应用了雷达处理和长 audio处理等技术。在Hub任务中,我们只使用单个说话人数据进行TTS系统建立,而在 Spoke任务中,我们使用多个说话人源模型进行目标说话人细化。MuLanTTS在两个任务中获得了4.3和4.5的平均评价分,与自然语音相比,保持了良好的相似性。这一年的新和紧密的统计评价结果表明我们提出的系统在两个任务中的效果。