results: cleaner target renderings and improved separability from unwanted sounds, with significant improvement in perceived qualityAbstract
Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve both cleaner target renderings as well as improved separability from unwanted sounds. The technique also tackles common background noise issues with DPM by introducing a correction method for noise schedules and sample steps. This approach is evaluated using both objective and subjective quality metrics on the FSD Kaggle 2018 dataset. The results show that DPM-TSE has a significant improvement in perceived quality in terms of target extraction and purity.
摘要
通用目标声音提取(TSE)方法主要依靠推论方法,以分离目标声音而减少背景干扰,Resultsof varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve both cleaner target renderings as well as improved separability from unwanted sounds. The technique also tackles common background noise issues with DPM by introducing a correction method for noise schedules and sample steps. This approach is evaluated using both objective and subjective quality metrics on the FSD Kaggle 2018 dataset. The results show that DPM-TSE has a significant improvement in perceived quality in terms of target extraction and purity.Note: Simplified Chinese is also known as Mandarin Chinese, and is the official language of China. It is written using the Simplified Chinese characters, which are used in mainland China and Singapore. Traditional Chinese is also widely used, and is the official language of Taiwan, Hong Kong, and Macau.
Analysis on the Influence of Synchronization Error on Fixed-filter Active Noise Control
for: 这个研究旨在investigating the synchronization error of digital Active Noise Control (ANC) system.
methods: 该研究采用了fixed-filter strategy, which is a viable alternative to traditional adaptive algorithms in addressing the challenges of computing complexity and instability, but with a potential trade-off in terms of noise reduction efficacy.
results: 该研究expects to provide a theoretical investigation into the synchronization error of the digital ANC system.Abstract
The efficacy of active noise control technology in mitigating urban noise, particularly in relation to low-frequency components, has been well-established. In the realm of traditional academic research, adaptive algorithms, such as the filtered reference least mean square method, are extensively employed to achieve real-time noise reduction in many applications. Nevertheless, the utilization of this technology in commercial goods is often hindered by its significant computing complexity and inherent instability. In this particular scenario, the adoption of the fixed-filter strategy emerges as a viable alternative for addressing these challenges, albeit with a potential trade-off in terms of noise reduction efficacy. This work aims to conduct a theoretical investigation into the synchronization error of the digital Active Noise Control (ANC) system. Keywords: Fixed-filter, Active noise control, Multichannel active noise control.
摘要
“active noise control技术在城市噪声缓解方面的效果已得到了广泛证明。在传统学术研究中,适应算法如 filtered reference least mean square method 广泛应用于实时噪声减少多种应用。然而,商业产品中使用这技术时常受到计算复杂性和内置不稳定性的限制。在这种情况下, fixed-filter 策略 emerges 作为一种可行的替代方案,尽管可能存在噪声减少效果的潜在交换。本工作的目的是 investigate 数字 active noise control(ANC)系统的同步误差。关键字: fixed-filter, active noise control, multichannel active noise control。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning
paper_authors: Tao Li, Zhichao Wang, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang, Lei Xie
for: zero-shot speaker cloning, synthesize speech for any target speaker unseen during TTS system building
methods: employ Grad-TTS as the backbone, cascade speaker- and style-specific encoders between text encoder and diffusion decoder, use signal perturbation to explicitly decompose into speaker- and style-specific modeling parts
results: significantly surpass state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity, achieve flexible combinations of desired speaker timbre and style in zero-shot voice cloningAbstract
Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.
摘要
<>translate_language: zh-CNZero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.
results: 使用这些PCA-指纹,实现了89%的成功推荐率(推荐的歌曲的类别与目标歌曲的类别匹配),基于200首个人音乐库中的歌曲,每首歌曲被标注为相应的歌手们的类别。Abstract
This work combined different audio features to obtain a more robust fingerprint to be used in a music recommendation process. The combination of these methods resulted in a high-dimensional vector. To reduce the number of values, PCA was applied to the set of resulting fingerprints, selecting the number of principal components that corresponded to an explained variance of $95\%$. Finally, with these PCA-fingerprints, the similarity matrix of each fingerprint with the entire data set was calculated. The process was applied to 200 songs from a personal music library; the songs were tagged with the artists' corresponding genres. The recommendations (fingerprints of songs with the closest similarity) were rated successful if the recommended songs' genre matched the target songs' genre. With this procedure, it was possible to obtain an accuracy of $89\%$ (successful recommendations out of total recommendation requests).
摘要
这个工作将不同的音频特征结合起来,以获得更加鲜明的音乐推荐指标。这些方法的组合导致了一个高维度的向量。为了减少值的数量,对这些指标进行了PCA处理,选择了Explained variance的95%。最后,使用这些PCA指标,计算了每个指标与整个数据集的相似性矩阵。这个过程采用了200首个人音乐库中的歌曲,这些歌曲被标注为艺术家的相应类别。推荐(与整个数据集最相似的歌曲指标)被评估为成功,如果推荐的歌曲的类别与目标歌曲的类别匹配。通过这种方式,可以获得89%的准确率(成功推荐请求数量 / 总推荐请求数量)。
Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition
paper_authors: Yan Zhao, Yuan Zong, Jincen Wang, Hailun Lian, Cheng Lu, Li Zhao, Wenming Zheng for:The paper proposes a new unsupervised domain adaptation method called LIDAN to address the challenge of cross-corpus speech emotion recognition.methods:LIDAN extends the previous ICASSP work, DIDAN, by introducing a novel regularization term called layer-adapted implicit distribution alignment (LIDA) that considers emotion labels at different levels of granularity.results:LIDAN surpasses recent state-of-the-art explicit unsupervised DA methods in tackling cross-corpus SER tasks, as demonstrated by extensive experiments on EmoDB, eNTERFACE, and CASIA corpora.Abstract
In this paper, we propose a new unsupervised domain adaptation (DA) method called layer-adapted implicit distribution alignment networks (LIDAN) to address the challenge of cross-corpus speech emotion recognition (SER). LIDAN extends our previous ICASSP work, deep implicit distribution alignment networks (DIDAN), whose key contribution lies in the introduction of a novel regularization term called implicit distribution alignment (IDA). This term allows DIDAN trained on source (training) speech samples to remain applicable to predicting emotion labels for target (testing) speech samples, regardless of corpus variance in cross-corpus SER. To further enhance this method, we extend IDA to layer-adapted IDA (LIDA), resulting in LIDAN. This layer-adpated extention consists of three modified IDA terms that consider emotion labels at different levels of granularity. These terms are strategically arranged within different fully connected layers in LIDAN, aligning with the increasing emotion-discriminative abilities with respect to the layer depth. This arrangement enables LIDAN to more effectively learn emotion-discriminative and corpus-invariant features for SER across various corpora compared to DIDAN. It is also worthy to mention that unlike most existing methods that rely on estimating statistical moments to describe pre-assumed explicit distributions, both IDA and LIDA take a different approach. They utilize an idea of target sample reconstruction to directly bridge the feature distribution gap without making assumptions about their distribution type. As a result, DIDAN and LIDAN can be viewed as implicit cross-corpus SER methods. To evaluate LIDAN, we conducted extensive cross-corpus SER experiments on EmoDB, eNTERFACE, and CASIA corpora. The experimental results demonstrate that LIDAN surpasses recent state-of-the-art explicit unsupervised DA methods in tackling cross-corpus SER tasks.
摘要
在这篇论文中,我们提出了一种新的无监督领域适应(DA)方法,即层 adapted implicit distribution alignment networks(LIDAN),以解决跨 Corpora 的语音情感识别(SER)挑战。LIDAN 是我们之前的 ICASSP 工作的扩展,深度隐式分布对接网络(DIDAN),其关键贡献在于引入了一个新的正则化项 called implicit distribution alignment(IDA)。这个项使得 DIDAN 在 source 语音样本训练后可以有效地预测 testing 语音样本的情感标签,不管跨 Corpora 的语音样本变化。为了进一步改进这种方法,我们延伸 IDA 到层 adapted IDA(LIDA),得到 LIDAN。这个层 adapted 扩展包括三个修改后的 IDA 项,这些项在不同的全连接层中适应不同的情感细分水平。这种适应安排使得 LIDAN 可以更好地学习语音样本中的情感特征,并且可以更好地适应不同 Corpora 的语音样本。值得一提的是,不同于大多数现有方法,DIDAN 和 LIDAN 不需要 estimating 统计 moments 来描述预设的显式分布,而是直接使用 target 样本重建的思想,从而bridge 特征分布差距。因此,DIDAN 和 LIDAN 可以视为隐式跨 Corpora SER 方法。为了评估 LIDAN,我们在 EmoDB、eNTERFACE 和 CASIA corpora 上进行了广泛的 cross-Corpus SER 实验。实验结果表明,LIDAN 超越了最近的显式无监督 DA 方法,在跨 Corpora SER 任务中表现出色。
Zero-Shot Emotion Transfer For Cross-Lingual Speech Synthesis
results: 实验结果显示,提案的框架可以实现零次调变的语言转换 speech synthesis,即将语言转换后的 speech 转换为具有不同语言的表情,而不需要调变训练数据。Abstract
Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model HuBERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework's effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data.
摘要
zero-shot 情感传递在跨语言speech sintesis中目标是将来源语言中的任意speech作为参考,将情感传递到目标语言的synthetic speech中。建立这种系统面临着不自然的外语口音和不同语言之间共享的情感表达模型化的挑战。基于DelightfulTTS神经网络架构,本文通过特制的模块来分立语言特有的态度特征和共享的情感表达,以提高跨语言speech的自然性。具体来说,使用非autoregressive predictive coding(NPC)模块来学习语言特有的speech态度,以提高跨语言speech的自然性。同时,使用层次情感模型来捕捉不同语言之间的共享情感。实验结果表明我们提出的框架能够在没有情感培训数据的情况下,为单语言target speakerSynthesize bi-lingual emotional speech。