eess.AS - 2023-07-03

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

paper_url: http://arxiv.org/abs/2307.00782
repo_url: None
paper_authors: Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank K. Soong, Tan Lee
for: 这项研究旨在提高文本转语音（TTS）系统的长文朗读质量。
methods: 该研究提出了一种轻量级 yet有效的 TTS 系统，即 ContextSpeech。该系统首先设计了一种储存机制，以利用全文和语音上下文来增强句子编码。然后，它构建了层次结构的文本 semantics，以扩大全文上下文的增强范围。最后，它综合应用了线性化自注意力，以提高模型效率。
results: 实验表明，ContextSpeech 在段落读物中提高了声音质量和语调表达性，与竞争性模型相当。示例响应器可以在以下链接中浏览：https://contextspeech.github.io/demo/

Abstract
While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a lightweight yet effective TTS system, ContextSpeech. Specifically, we first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. Then we construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Additionally, we integrate linearized self-attention to improve model efficiency. Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency. Audio samples are available at: https://contextspeech.github.io/demo/

摘要
“当前的文本到语音系统可以生成具有非常高质量的自然语音，但是在段落/长文读取中仍然存在很大的挑战。这些问题的原因是：一、忽略跨句Contextual信息，二、长文合成的计算和内存成本过高。为了解决这些问题，本工作开发了一个轻量级又有效的文本到语音系统——ContextSpeech。具体来说，我们首先设计了一种嵌入式的记忆缓存机制，以将全文和语音Context incorporated into sentence encoding。然后，我们构建了层次结构的文本 semantics，以扩大全文Context的增强范围。此外，我们将Linearized self-attention integrated into the model，以提高模型效率。实验表明，ContextSpeech可以在段落读取中显著提高声音质量和表达性，并且与其他模型相比，其效率相对较高。听 samples可以在：https://contextspeech.github.io/demo/ ”Note that the translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, I can provide that as well.