eess.AS - 2023-08-03

Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

paper_url: http://arxiv.org/abs/2308.01831
repo_url: None
paper_authors: Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro
for: 本研究旨在提出一种方法，用于学习多语言speech和文本的统一表示，以便在单个模型上进行多种多样的语言相关任务。
methods: 本研究使用自动学习的speech模型来编码speech特征，并将其转化为pseudo文本的形式，以便将语言特征作为输入进行学习。然后，提出一种基于encoder-decoder结构的 Unit-to-Unit Translation (UTUT) 目标函数，用于在多语言数据上训练模型。
results: 通过对多种语言进行实验， validate the efficacy of the proposed method across diverse multilingual tasks, 并达到了多到多语言的同时翻译。此外，本研究还展示了 UTUT 可以实现多到多语言的同时翻译，这在文献中尚未被研究过。

Abstract
In this paper, we propose a method to learn unified representations of multilingual speech and text with a single model, especially focusing on the purpose of speech synthesis. We represent multilingual speech audio with speech units, the quantized representations of speech features encoded from a self-supervised speech model. Therefore, we can focus on their linguistic content by treating the audio as pseudo text and can build a unified representation of speech and text. Then, we propose to train an encoder-decoder structured model with a Unit-to-Unit Translation (UTUT) objective on multilingual data. Specifically, by conditioning the encoder with the source language token and the decoder with the target language token, the model is optimized to translate the spoken language into that of the target language, in a many-to-many language translation setting. Therefore, the model can build the knowledge of how spoken languages are comprehended and how to relate them to different languages. A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST). By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, we show UTUT can perform many-to-many language STS, which has not been previously explored in the literature. Samples are available on https://choijeongsoo.github.io/utut.

摘要
在这篇论文中，我们提出了一种方法，可以通过单个模型学习多语言语音和文本的统一表示，特别是在语音合成的目的下。我们使用自动编码的语音特征来编码多语言语音，并将其转换为 pseudo 文本，以便更好地关注它们的语言内容。然后，我们提出了一种基于encoder-decoder结构的模型，使用单位至单位翻译（UTUT）目标进行训练。具体来说，通过将encoder Conditioned with source language token，并将decoder Conditioned with target language token，模型会被优化为将语言转换为目标语言，在多种语言翻译设定下。因此，模型可以学习不同语言之间的关系，并如何将语音翻译成不同语言。一个预训练的UTUT模型可以在多种多语言语音和文本相关任务中使用，如语音到语音翻译（STS）、多语言文本到语音合成（TTS）和文本到语音翻译（TTST）。通过对多种语言进行全面的实验，我们证明了提议方法的有效性。此外，我们还证明了UTUT可以实现多语言 STS，这在文献中尚未被探讨。详细实验结果和样例可以在https://choijeongsoo.github.io/utut上找到。