eess.AS - 2023-07-06

Read, Look or Listen? What’s Needed for Solving a Multimodal Dataset

  • paper_url: http://arxiv.org/abs/2307.04532
  • repo_url: None
  • paper_authors: Netta Madvil, Yonatan Bitton, Roy Schwartz
  • for: 本研究旨在探讨大规模多模式数据集的质量评估问题。
  • methods: 我们提出了一种两步方法,利用小量人工标注将每个多模式实例映射到需要处理的Modalities。
  • results: 我们应用方法到TVQA视频问答数据集,发现大多数问题可以通过单一模式答案,无论是视频还是音频。此外,我们发现更多的70%的问题可以使用多种不同的单模式策略解决,如只看视频或只听音频。此外,我们发现MERLOT Reserve在图像基于问题上表现不佳,而文本和音频则表现较好。基于我们的观察,我们提出了一个新的测试集,其中模型需要使用多种模式,并观察到模型性能减降很大。
    Abstract The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship between them. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. Moreover, we find that more than 70% of the questions are solvable using several different single-modality strategies, e.g., by either looking at the video or listening to the audio, highlighting the limited integration of multiple modalities in TVQA. We leverage our annotation and analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification. Based on our observations, we introduce a new test set that necessitates multiple modalities, observing a dramatic drop in model performance. Our methodology provides valuable insights into multimodal datasets and highlights the need for the development of more robust models.
    摘要 现代大规模多modal数据集存在独特的评估数据集质量挑战。我们提出了一种两步方法,使用小量人工标注来将多modal实例映射到需要处理它的modalities。我们的方法揭示了不同modalities在数据集中的重要性,以及它们之间的关系。我们应用了我们的方法到TVQA视频问答数据集,发现大多数问题可以使用单一modalities解决,无论哪种modalities。此外,我们发现超过70%的问题可以使用多个不同的单modalities策略解决,例如通过视频或音频来解决问题,这 highlights TVQA中多modalities的有限整合。我们利用我们的标注和分析MERLOT Reserve,发现它在图像基于问题上表现不佳,比 Text和音频更差。基于我们的观察,我们引入了一个新的测试集,需要多modalities,观察模型性能异常下降。我们的方法ология为多modal数据集提供了有价值的洞察,并高亮了需要更robust的模型的开发。

Deep Speech Synthesis from MRI-Based Articulatory Representations

  • paper_url: http://arxiv.org/abs/2307.02471
  • repo_url: https://github.com/articulatory/articulatory
  • paper_authors: Peter Wu, Tingle Li, Yijing Lu, Yubin Zhang, Jiachen Lian, Alan W Black, Louis Goldstein, Shinji Watanabe, Gopala K. Anumanchipalli
  • for: 这个研究旨在开发一种基于人类声道信息的语音合成方法,以提高语音合成效率、通用性和可解释性。
  • methods: 该研究使用MRI技术获取更广泛的声道信息,并引入Normalization和denoising等处理方法,以提高深度学习模型的普适性和语音质量。
  • results: 研究人员通过一系列ablations表明,MRI表示的声道信息更加全面和精准,并且可以提高语音合成效率和质量。
    Abstract In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiting generalization capabilities. To bridge this gap, we propose an alternative MRI-based feature set that covers a much more extensive articulatory space than EMA. We also introduce normalization and denoising procedures to enhance the generalizability of deep learning methods trained on MRI data. Moreover, we propose an MRI-to-speech model that improves both computational efficiency and speech fidelity. Finally, through a series of ablations, we show that the proposed MRI representation is more comprehensive than EMA and identify the most suitable MRI feature subset for articulatory synthesis.
    摘要 在这篇论文中,我们研究了语音合成方法,即使用人类声门信息来开发高效、通用和可解释的合成器。最近的进展使得可以实现有声合成,但这些方法缺乏关键的唇形信息,如刺激和腔软度,这限制了其泛化能力。为了弥补这一点,我们提议一种基于MRI的特征集,覆盖了EMA的艺术iculatory空间的多倍。我们还提出了normalization和denoising的过程来提高深度学习方法在MRI数据上的泛化性。此外,我们提出了MRI-to-speech模型,可以提高计算效率和语音准确性。最后,通过一系列剥减实验,我们证明了我们的MRI表示是EMA的更加全面的,并确定了最适合语音合成的MRI特征子。