results: 研究发现,男性艺人的 vocals 通常具有 menos 的延迟和更窄的位置在声音混合中,与女性艺人的 vocals 相比。Abstract
The Collaborative Song Dataset (CoSoD) is a corpus of 331 multi-artist collaborations from the 2010-2019 Billboard "Hot 100" year-end charts. The corpus is annotated with formal sections, aspects of vocal production (including reverberation, layering, panning, and gender of the performers), and relevant metadata. CoSoD complements other popular music datasets by focusing exclusively on musical collaborations between independent acts. In addition to facilitating the study of song form and vocal production, CoSoD allows for the in-depth study of gender as it relates to various timbral, pitch, and formal parameters in musical collaborations. In this paper, we detail the contents of the dataset and outline the annotation process. We also present an experiment using CoSoD that examines how the use of reverberation, layering, and panning are related to the gender of the artist. In this experiment, we find that men's voices are on average treated with less reverberation and occupy a more narrow position in the stereo mix than women's voices.
摘要
《合作歌曲数据集(CoSoD)》是2010-2019年度 Billboard "Hot 100" 年度榜单中331首多位艺人合作的数据集。该数据集已经被标注了正式部分、声乐生产方面的属性(包括延迟、层次、扬声和performer的性别)以及相关的元数据。CoSoD 与其他流行音乐数据集相比,专注于独立艺人之间的音乐合作。此外,CoSoD 还允许对歌曲结构和声乐生产进行深入研究,以及对 gender 与不同的时间、抽象和形式参数之间的关系进行深入研究。在这篇文章中,我们详细介绍了数据集的内容和标注过程,以及使用 CoSoD 进行了一个实验,发现男性艺人的声音在延迟方面比女性艺人的声音更为低,并且男性艺人的声音在立体混合中占据更窄的位置。
The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task
results: 实验结果显示,我们的系统在翻译准确率、语音自然度、声音质量和 speaker相似性方面具有优秀的表现,同时也在多源数据下表现良好的Robustness。Abstract
This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Specifically, to improve the robustness to multi-source speech input, we adopt various data augmentation strategies and a ROVER-based score fusion on multiple ASR model outputs. To better handle the noisy ASR transcripts, we introduce a three-stage fine-tuning strategy to improve translation accuracy. Finally, we build a TTS model with high naturalness and sound quality, which leverages a two-stage framework, using network bottleneck features as a robust intermediate representation for speaker timbre and linguistic content disentanglement. Based on the two-stage framework, pre-trained speaker embedding is leveraged as a condition to transfer the speaker timbre in the source English speech to the translated Chinese speech. Experimental results show that our system has high translation accuracy, speech naturalness, sound quality, and speaker similarity. Moreover, it shows good robustness to multi-source data.
摘要
Exploiting an External Microphone for Binaural RTF-Vector-Based Direction of Arrival Estimation for Multiple Speakers
results: 在听力器麦克风和外部麦克风的听音环境中,提出了一种低计算复杂性的DOA估计方法,并且在两个说话人和杂音环境下获得了相当于CW方法的DOA估计性能,而且具有较低的计算复杂性。Abstract
In hearing aid applications, an important objective is to accurately estimate the direction of arrival (DOA) of multiple speakers in noisy and reverberant environments. Recently, we proposed a binaural DOA estimation method, where the DOAs of the speakers are estimated by selecting the directions for which the so-called Hermitian angle spectrum between the estimated relative transfer function (RTF) vector and a database of prototype anechoic RTF vectors is maximized. The RTF vector is estimated using the covariance whitening (CW) method, which requires a computationally complex generalized eigenvalue decomposition. The spatial spectrum is obtained by only considering frequencies where it is likely that one speaker dominates over the other speakers, noise and reverberation. In this contribution, we exploit the availability of an external microphone that is spatially separated from the hearing aid microphones and consider a low-complexity RTF vector estimation method that assumes a low spatial coherence between the undesired components in the external microphone and the hearing aid microphones. Using recordings of two speakers and diffuse-like babble noise in acoustic environments with mild reverberation and low signal-to-noise ratio, simulation results show that the proposed method yields a comparable DOA estimation performance as the CW method at a lower computational complexity.
摘要
在听觉器应用中,一个重要的目标是准确估计多个说话人的方向来达(DOA)在噪音和反射环境中。最近,我们提出了一种双耳DOA估计方法,其中DOA的估计是通过选择RTFVector的 Hermitian角spectrum与一个数据库中的prototype静音RTFVector之间的最大匹配来实现。RTFVector是通过covariance whitening(CW)方法来估计,该方法需要计算复杂的普通值归一化分解。 spatial spectrum是通过只考虑可能有一个说话人在其他说话人、噪音和反射中占据主导地位的频率来获得。在这种贡献中,我们利用了一个外部麦克风,该麦克风与听觉器麦克风之间存在低空间协方差,并考虑了一种低复杂度RTFVector估计方法,该方法假设外部麦克风和听觉器麦克风之间的不desired ком分布具有低空间协方差。通过对两个说话人和杂音噪音的记录在带有柔和反射的Acoustic环境中进行模拟,结果表明,我们的方法可以与CW方法具有相同的DOA估计性能,但具有较低的计算复杂度。
HCLAS-X: Hierarchical and Cascaded Lyrics Alignment System Using Multimodal Cross-Correlation
results: 我们的提议模型在比较 experiment中显示了significant improvement in mean average error,并且在 deploying 在多个音乐流媒体服务上的实践中也得到了良好的结果。Abstract
In this work, we address the challenge of lyrics alignment, which involves aligning the lyrics and vocal components of songs. This problem requires the alignment of two distinct modalities, namely text and audio. To overcome this challenge, we propose a model that is trained in a supervised manner, utilizing the cross-correlation matrix of latent representations between vocals and lyrics. Our system is designed in a hierarchical and cascaded manner. It predicts synced time first on a sentence-level and subsequently on a word-level. This design enables the system to process long sequences, as the cross-correlation uses quadratic memory with respect to sequence length. In our experiments, we demonstrate that our proposed system achieves a significant improvement in mean average error, showcasing its robustness in comparison to the previous state-of-the-art model. Additionally, we conduct a qualitative analysis of the system after successfully deploying it in several music streaming services.
摘要
在这个工作中,我们面临着歌词对齐挑战,即将歌词和声音组件对齐。这个问题需要对两种不同的modalities进行对齐,即文本和音频。为了解决这个挑战,我们提议一种在有监督下训练的模型,使用歌词和声音的横排相关矩阵来训练。我们的系统设计为层次和分层的结构,首先在句子水平预测同步时间,然后在单词水平进行精度调整。这种设计使得系统能够处理长序列,因为交叉相关使用了序列长度的二次memory。在我们的实验中,我们示出了我们提议的系统在平均误差中具有显著的改善,这反映了它在比前一个状态艺术模型时更加稳定。此外,我们还进行了音乐流服务中成功部署后的系统Qualitative分析。