eess.AS - 2023-08-08

Investigating Speaker Embedding Disentanglement on Natural Read Speech

paper_url: http://arxiv.org/abs/2308.04225
repo_url: None
paper_authors: Michael Kuhlmann, Adrian Meise, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach
for: 这个论文的目的是研究语音表示的分解，以提高数据驱动模型的普适性、解释性和公正性。
methods: 该论文使用了标准的分解目标函数来训练语音表示，并对比了这些表示的分解程度。
results: 研究发现，使用标准的分解目标函数可以限制语音表示的分解程度，但可以通过一定程度的改进来提高分解效果。

Abstract
Disentanglement is the task of learning representations that identify and separate factors that explain the variation observed in data. Disentangled representations are useful to increase the generalizability, explainability, and fairness of data-driven models. Only little is known about how well such disentanglement works for speech representations. A major challenge when tackling disentanglement for speech representations are the unknown generative factors underlying the speech signal. In this work, we investigate to what degree speech representations encoding speaker identity can be disentangled. To quantify disentanglement, we identify acoustic features that are highly speaker-variant and can serve as proxies for the factors of variation underlying speech. We find that disentanglement of the speaker embedding is limited when trained with standard objectives promoting disentanglement but can be improved over vanilla representation learning to some extent.

摘要
分化是学习表示法，以分解数据中观察到的变化的因素为目的。分化的表示法有助于提高数据驱动模型的普遍性、解释性和公平性。对于speech表示法，尚不了解分化是否有效。在这种工作中，我们研究了speech表示法中的发音者标识可以被分化的程度。为量分化，我们确定了一些高度发音者特定的音频特征，可以作为变化的因素下的 фактор代表。我们发现，使用标准的分化目标可以有限地分化发音者表示，但可以通过一些程度上的表示学习来提高分化。Here's the translation in Traditional Chinese as well:分化是学习表示法，以分解数据中观察到的变化的因素为目的。分化的表示法有助于提高数据驱动模型的普遍性、解释性和公平性。对于speech表示法，还不了解分化是否有效。在这种工作中，我们研究了speech表示法中的发音者标识可以被分化的程度。为量分化，我们确定了一些高度发音者特定的音频特征，可以作为变化的因素下的 фактор代表。我们发现，使用标准的分化目标可以有限地分化发音者表示，但可以通过一些程度上的表示学习来提高分化。

EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

paper_url: http://arxiv.org/abs/2308.04162
repo_url: https://github.com/lab206/epcformer
paper_authors: Jiajun Chen, Jiacheng Lin, Zhiqiang Xiao, Haolong Fu, Ke Nai, Kailun Yang, Zhiyong Li
for: 这 paper 是为了解决 audio-guided video object segmentation (A-VOS) 和 referring video object segmentation (R-VOS) 等两个高度相关的任务。
methods: 这 paper 使用了一种 universal architecture called Expression Prompt Collaboration Transformer (EPCFormer)，并提出了一种 Expression Alignment (EA) 机制和一种 Expression-Visual Attention (EVA) 机制来解决模式表示问题。
results: 实验结果表明，EPCFormer 可以在 A-VOS 和 R-VOS 两个任务上达到州际级Result。此外，EPCFormer 可以快速转移知识 между两个任务，从而提高视频对象 segmentation 的精度。

Abstract
Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object Segmentation (R-VOS) are two highly-related tasks, which both aim to segment specific objects from video sequences according to user-provided expression prompts. However, due to the challenges in modeling representations for different modalities, contemporary methods struggle to strike a balance between interaction flexibility and high-precision localization and segmentation. In this paper, we address this problem from two perspectives: the alignment representation of audio and text and the deep interaction among audio, text, and visual features. First, we propose a universal architecture, the Expression Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an Expression Alignment (EA) mechanism for audio and text expressions. By introducing contrastive learning for audio and text expressions, the proposed EPCFormer realizes comprehension of the semantic equivalence between audio and text expressions denoting the same objects. Then, to facilitate deep interactions among audio, text, and video features, we introduce an Expression-Visual Attention (EVA) mechanism. The knowledge of video object segmentation in terms of the expression prompts can seamlessly transfer between the two tasks by deeply exploring complementary cues between text and audio. Experiments on well-recognized benchmarks demonstrate that our universal EPCFormer attains state-of-the-art results on both tasks. The source code of EPCFormer will be made publicly available at https://github.com/lab206/EPCFormer.

摘要
audio-guided视频对象 segmentation (A-VOS) 和 referring视频对象 segmentation (R-VOS) 是两个非常相关的任务，它们都是根据用户提供的表达提示从视频序列中提取特定对象的。然而，由于不同媒体表示的模型化问题，当前方法很难协调用用户提供的表达提示和高精度的地方化分割。在这篇论文中，我们解决这个问题从两个方面：表达提示的对齐表示和听力和文本特征之间的深度交互。首先，我们提出了一种通用架构，即表达 prompt collaboration transformer（EPCFormer）。然后，我们提出了一种表达对齐（EA）机制，用于对听力和文本表达进行对齐。通过对听力和文本表达进行对比学习，我们的提出的EPCFormer实现了对听力和文本表达的semantic equivalence的认知。然后，为了促进听力、文本和视频特征之间的深度交互，我们引入了表达-视频注意力（EVA）机制。通过深入探索听力、文本和视频特征之间的相互补做，我们的EPCFormer可以很好地传递知识 между两个任务。实验结果表明，我们的通用EPCFormer在两个任务上达到了现有最佳结果。代码将在https://github.com/lab206/EPCFormer上公开。