cs.SD - 2023-08-09

Unsupervised Out-of-Distribution Dialect Detection with Mahalanobis Distance

paper_url: http://arxiv.org/abs/2308.04886
repo_url: None
paper_authors: Sourya Dipta Das, Yash Vadi, Abhishek Unnam, Kuldeep Yadav
for: 本研究旨在提高 dialect classification 系统的总性性能，并在实际应用中处理 anomalous inputs。
methods: 我们提出了一种简单 yet effective的无监督 Mahalanobis 距离特征基于方法，用于检测 dialect classification 模型中的 out-of-distribution 样本。
results: 我们的方法在比较其他 state-of-the-art OOD detection 方法时表现出了显著的优势。

Abstract
Dialect classification is used in a variety of applications, such as machine translation and speech recognition, to improve the overall performance of the system. In a real-world scenario, a deployed dialect classification model can encounter anomalous inputs that differ from the training data distribution, also called out-of-distribution (OOD) samples. Those OOD samples can lead to unexpected outputs, as dialects of those samples are unseen during model training. Out-of-distribution detection is a new research area that has received little attention in the context of dialect classification. Towards this, we proposed a simple yet effective unsupervised Mahalanobis distance feature-based method to detect out-of-distribution samples. We utilize the latent embeddings from all intermediate layers of a wav2vec 2.0 transformer-based dialect classifier model for multi-task learning. Our proposed approach outperforms other state-of-the-art OOD detection methods significantly.

摘要
dialect classification 在各种应用中使用，如机器翻译和语音识别，以提高整体系统性能。在实际场景中，部署的 диалект分类模型可能会遇到不同于训练数据分布的输入样本，也称为 OUT-OF-DISTRIBUTION（OOD）样本。这些 OOD 样本可能会导致不期望的输出，因为这些 диалект样本在模型训练时未看到。 OUT-OF-DISTRIBUTION 检测是一个新的研究领域，在 диаLECT 分类领域得到了少量的关注。为了解决这问题，我们提出了一种简单 yet 高效的无监督 Mahalanobis 距离特征基于方法来检测 OOD 样本。我们利用了一个 wav2vec 2.0 基于 transformer 的 диалект分类模型的所有 intermediate 层的矩阵表示。我们的提出方法在与其他现有的 OOD 检测方法比较中表现出色。

DiVa: An Iterative Framework to Harvest More Diverse and Valid Labels from User Comments for Music

paper_url: http://arxiv.org/abs/2308.04805
repo_url: https://github.com/jingyaolliu/diva
paper_authors: Hongru Liang, Jingyao Liu, Yuanxin Xiang, Jiachen Du, Lanjun Zhou, Shushen Pan, Wenqiang Lei
for: automatic music labeling in an essential but under-explored setting
methods: uses pre-trained classifiers and a novel joint score function to harvest more diverse and valid labels from user comments
results: produces more diverse labels missed by the gold labels, superior to state-of-the-art solutions

Abstract
Towards sufficient music searching, it is vital to form a complete set of labels for each song. However, current solutions fail to resolve it as they cannot produce diverse enough mappings to make up for the information missed by the gold labels. Based on the observation that such missing information may already be presented in user comments, we propose to study the automated music labeling in an essential but under-explored setting, where the model is required to harvest more diverse and valid labels from the users' comments given limited gold labels. To this end, we design an iterative framework (DiVa) to harvest more $\underline{\text{Di}$verse and $\underline{\text{Va}$lid labels from user comments for music. The framework makes a classifier able to form complete sets of labels for songs via pseudo-labels inferred from pre-trained classifiers and a novel joint score function. The experiment on a densely annotated testing set reveals the superiority of the Diva over state-of-the-art solutions in producing more diverse labels missed by the gold labels. We hope our work can inspire future research on automated music labeling.

摘要
向 suficient music searching， it is vital to form a complete set of labels for each song。However，current solutions fail to resolve it as they cannot produce diverse enough mappings to make up for the information missed by the gold labels。Based on the observation that such missing information may already be presented in user comments，we propose to study the automated music labeling in an essential but under-explored setting，where the model is required to harvest more diverse and valid labels from the users' comments given limited gold labels。To this end，we design an iterative framework (DiVa) to harvest more $\underline{\text{Di}$verse and $\underline{\text{Va}$lid labels from user comments for music。The framework makes a classifier able to form complete sets of labels for songs via pseudo-labels inferred from pre-trained classifiers and a novel joint score function。The experiment on a densely annotated testing set reveals the superiority of the Diva over state-of-the-art solutions in producing more diverse labels missed by the gold labels。We hope our work can inspire future research on automated music labeling。

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

paper_url: http://arxiv.org/abs/2308.04767
repo_url: https://github.com/tahy1/avin
paper_authors: Tianyu Liu, Peng Zhang, Wei Huang, Yufei Zha, Tao You, Yanning Zhang
for: 本研究旨在解决自主学习音源定位中的modal inconsistency问题，通过对不同modalities的特征进行更好的协调，以提高音源定位的精度和稳定性。
methods: 本研究提出了一种叫做Induction Network的新网络模型，通过分离视觉和声音模式的梯度，使得可以更好地学习视觉特征的描述性模型，并且可以在不同的modalities之间进行更好的协调。此外，还提出了一种适应性的阈值选择策略，以提高模型的Robustness。
results: 在SoundNet-Flickr和VGG-Sound Source等数据集上进行了详细的实验，并证明了与其他state-of-the-art方法相比，本研究的方法在不同的挑战性enario下表现出了superior的性能。

Abstract
Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN

摘要
自顾supervised音频源localization通常面临 modalities 不一致性挑战。 recent studies have shown that contrastive learning based strategies have promising to establish a consistent correspondence between audio and visual modalities in visual scenarios. However, the insufficient attention to the heterogeneity influence in different modality features still limits this scheme to be further improved, which also becomes the motivation of our work.In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network.Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.Here's the translation in Traditional Chinese:自顾supervised音频源localization通常面临 modalities 不一致性挑战。 recent studies have shown that contrastive learning based strategies have promising to establish a consistent correspondence between audio and visual modalities in visual scenarios. However, the insufficient attention to the heterogeneity influence in different modality features still limits this scheme to be further improved, which also becomes the motivation of our work.In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network.Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

paper_url: http://arxiv.org/abs/2308.04666
repo_url: None
paper_authors: Zirui Ge, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang
for: 提高语音识别器的表现，使其能够更好地处理基于语音数据的基础模型。
methods: 使用Isomorphic Graph ATtention network (IsoGAT)来学习语音识别器，并将其与基于自我监督表示的方法结合使用。
results: 在VoxCeleb1&2 dataset上进行了语音识别任务的实验，并比较了提议方法与现有的pooling方法的表现。

Abstract
The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and graph attention factors, non-injective aggregation still exists in the approaches, which may influence the performance for speaker recognition. In this regard, we propose a speaker recognition approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised representation. The proposed approach contains three modules of representation learning, graph attention, and aggregation, jointly considering learning on the self-supervised representation and the IsoGAT. Then, we perform experiments for speaker recognition tasks on VoxCeleb1\&2 datasets, with the corresponding experimental results demonstrating the recognition performance for the proposed approach, compared with existing pooling approaches on the self-supervised representation.

摘要
《自我超级 Representation（即 Wav2vec 2.0）的出现允许 speaker recognition 方法通过基于 speech 数据建立的基础模型进行处理 spoken signals。然而，为了实现有效的融合，需要进一步调查，因为包含 fixed 或不优化的 temporal pooling 策略。尽管考虑了基于图学习和图注意因素的改进策略，仍然存在非具有注意力的聚合，这可能影响 speaker recognition 的性能。在这种情况下，我们提出了基于 Isomorphic Graph ATtention network（IsoGAT）的 speaker recognition 方法。该方法包括 three 个模块：表示学习、图注意和聚合，同时考虑了学习 self-supervised representation 和 IsoGAT。然后，我们对 VoxCeleb1&2 数据集进行实验，并得到了对exist pooling approaches on self-supervised representation的比较。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The traditional Chinese writing system is also widely used, especially in Taiwan and Hong Kong.