cs.CL - 2023-09-05

paper_url: http://arxiv.org/abs/2309.02591
repo_url: https://github.com/kyegomez/CM3Leon
paper_authors: Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz, Luke Zettlemoyer, Armen Aghajanyan
for: 这篇论文是为了描述一种基于多模态语言模型的文本和图像生成模型CM3Leon，以及该模型在不同任务上的性能。
methods: 该模型使用了CM3多模态架构，并在大规模的采集和调参数数据上进行了扩展和优化。它还包括一个大规模的预训练阶段和一个多任务练熟环境（SFT）阶段。
results: 实验结果显示，这种方法对多模态模型是非常有效的，CM3Leon在文本到图像生成任务中达到了状态对的性能（FID=4.88），并且在语言指导图像编辑、图像控制生成和分割等任务中也可以达到了不可思议的水平。

Abstract
We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.

摘要
我们提出CM3Leon（发音为“卡美伦”），这是一个基于搜索修正的、符号基于的解码器只多模态语言模型，可以生成和填充文本和图像。CM3Leon使用CM3多模态架构，但还有更加极端的优势，来自更多的指令样式数据的扩大和调整。它是首个基于文本only语言模型的多模态模型，通过一个大规模的搜索修正预训练阶段和第二个多任务监督练练（SFT）阶段进行训练。它还是一个通用的模型，可以进行文本到图像和图像到文本的生成，allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs。广泛的实验表明，这种方法对多模态模型非常有效。CM3Leon在文本到图像生成中达到了比较方法的状态机器（零shot MS-COCO FID of 4.88）。在SFT后，CM3Leon也可以展示无 precedent的可控性，从语言引导的图像修改到图像控制生成和分割。

Substitution-based Semantic Change Detection using Contextual Embeddings

paper_url: http://arxiv.org/abs/2309.02403
repo_url: https://github.com/dallascard/SBSCD
paper_authors: Dallas Card
for: 本研究旨在使用上下文嵌入来度量语义变化，并且提出了一种简单有效的方法，以优化现有的方法。
methods: 本研究使用最有可能的替换词来度量语义变化，这种方法不仅直观可解，而且更有效率，可以更好地探讨语义变化。
results: 本研究在最常引用的数据集上达到了最高的均值性能，并且可以更好地探讨语义变化，比静止词vec更有利于理解语义变化。

Abstract
Measuring semantic change has thus far remained a task where methods using contextual embeddings have struggled to improve upon simpler techniques relying only on static word vectors. Moreover, many of the previously proposed approaches suffer from downsides related to scalability and ease of interpretation. We present a simplified approach to measuring semantic change using contextual embeddings, relying only on the most probable substitutes for masked terms. Not only is this approach directly interpretable, it is also far more efficient in terms of storage, achieves superior average performance across the most frequently cited datasets for this task, and allows for more nuanced investigation of change than is possible with static word vectors.

摘要

nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style Models with Limited Resources

paper_url: http://arxiv.org/abs/2309.02373
repo_url: https://github.com/piotrnawrot/nanot5
paper_authors: Piotr Nawrot
for: 提高语言模型研究的可用性和资源利用率，使更多研究者能够访问和使用T5模型。
methods: 通过优化PyTorch框架和优化器，实现高效的T5模型预训练和精度调整，以及开源框架和配置等资源的提供，旨在拓宽语言模型研究领域的可用性和资源利用率。
results: 在单个GPU上预训练T5-Base模型只需16个小时，不会影响性能，并提供了多种配置和软硬件准则，以及开源框架和预训练模型，以满足研究者对T5模型的需求。

Abstract
State-of-the-art language models like T5 have revolutionized the NLP landscape, but their computational demands hinder a large portion of the research community. To address this challenge, we present nanoT5, a specially-optimized PyTorch framework for efficient pre-training and fine-tuning of T5 models. Drawing on insights from optimizer differences and prioritizing efficiency, nanoT5 allows a T5-Base model to be pre-trained on a single GPU in just 16 hours, without any loss in performance. With the introduction of this open-source framework, we hope to widen the accessibility to language modelling research and cater to the community's demand for more user-friendly T5 (Encoder-Decoder) implementations. Our contributions, including configurations, codebase, software/hardware insights, and pre-trained models, are available to the public, aiming to strike a balance between research accessibility and resource constraints in NLP.

摘要
现代语言模型如T5已经革命化了NLPT中的景象，但它们的计算需求限制了大量研究人员。为解决这个挑战，我们现在提出nanoT5，一个特殊优化的PyTorch框架，用于高效地预训练和精度调整T5模型。通过优化器差异和高效性的启发，nanoT5可以在单个GPU上预训练T5-Base模型，只需16个小时，而无损失性表现。我们通过这个开源框架，希望扩大语言模型研究的访问权限，并为NLPT社区提供更加用户友好的T5（Encoder-Decoder）实现。我们的贡献包括配置、代码库、软硬件杂志和预训练模型，都对公众开放，以实现NLPT研究资源的平衡。

Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation via Attention Regularization

paper_url: http://arxiv.org/abs/2309.02311
repo_url: https://github.com/milanlproc/weigh-your-own-words
paper_authors: Helena Bonaldi, Giuseppe Attanasio, Debora Nozza, Marco Guerini
for: 防止在线仇恨言语的发展，提出了一种基于预训练语言模型（PLMs）的自动生成对话方法。
methods: 本研究使用了一种基于注意力的违规常量来改进PLMs的泛化能力，以便在不同的目标和实际垃圾语言上生成更加多样化和更加丰富的对话。
results: 对英语 benchmark 数据集进行实验表明，使用了注意力违规常量的改进方法可以生成更好的对话，特别是在训练数据中不包含仇恨目标时。

Abstract
Recent computational approaches for combating online hate speech involve the automatic generation of counter narratives by adapting Pretrained Transformer-based Language Models (PLMs) with human-curated data. This process, however, can produce in-domain overfitting, resulting in models generating acceptable narratives only for hatred similar to training data, with little portability to other targets or to real-world toxic language. This paper introduces novel attention regularization methodologies to improve the generalization capabilities of PLMs for counter narratives generation. Overfitting to training-specific terms is then discouraged, resulting in more diverse and richer narratives. We experiment with two attention-based regularization techniques on a benchmark English dataset. Regularized models produce better counter narratives than state-of-the-art approaches in most cases, both in terms of automatic metrics and human evaluation, especially when hateful targets are not present in the training data. This work paves the way for better and more flexible counter-speech generation models, a task for which datasets are highly challenging to produce.

摘要
Simplified Chinese:近期计算方法对于在线仇恨言语的应对包括使用预训练的变换器基于语言模型（PLMs）自动生成反对narritives。然而，这个过程可能会导致域内过拟合，使模型只能生成与训练数据相似的acceptable narritives，具有小的可移植性到其他目标或实际世界中的恶语言。本文提出了一种新的注意力规范方法来提高PLMs的泛化能力 для反对narritives生成。通过避免训练数据特定的注意力过拟合，模型可以生成更多元和更加丰富的narritives。我们在一个英语 benchmark 数据集上实验了两种注意力基于规范技术，并发现正则化模型在大多数情况下可以生成更好的反对narritives，特别是当仇恨目标不在训练数据中时。这项工作为Counter-speech生成模型的更好和更灵活的模型开创了道路，这个任务的数据非常困难生产。

PromptTTS 2: Describing and Generating Voices with Text Prompt

paper_url: http://arxiv.org/abs/2309.02285
repo_url: None
paper_authors: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian
for: 这个研究是为了解决基于文本提示的语音生成方法中的一个问题，即使用文本提示来生成语音时，不能完全捕捉语音中的声音变化信息。
methods: 这个研究使用了一种变换网络，该网络可以根据文本提示来预测语音中的声音变化信息，以及一个提取ipeline，该ipeline可以使用语音理解模型来识别语音中的声音特征（例如性别、速度等），并使用大型自然语言处理模型来形成文本提示。
results: 实验结果表明，与前一代方法相比，PromptTTS 2可以更好地根据文本提示生成语音，并且支持采样多种语音变化，因此可以为用户提供更多的语音选择。此外，提取ipeline可以生成高质量的文本提示，从而消除大量的标注成本。

Abstract
Speech conveys more information than just text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompt for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompt based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online\footnote{https://speechresearch.github.io/prompttts2}.

摘要
文本中的语音包含更多信息，因为同一个词可以在不同的声音下被读出，表达多种信息。相比传统的文本识别（TTS）方法，利用语音提示（参考语音）来实现声音多样性，使用文本提示（描述）更加用户友好，因为语音提示可能困难找或者不存在。TTS方法基于文本提示面临两个挑战：1）一个多个问题，即文本提示中不能完全表达声音多样性的细节信息；2）文本提示数据集的有限性，需要供应商和大量的数据标注来编写文本提示。在这项工作中，我们介绍PromptTTS 2，以解决这两个挑战。PromptTTS 2使用变化网络提供不同声音的多样性信息，并使用大语言模型（LLM）组合高质量文本提示来生成语音。具体来说，变化网络预测基于参考语音（含有全部声音信息）的表示，根据文本提示表示。为生成文本提示，我们使用语音理解模型认识语音特征（例如性别、速度），并使用大语言模型根据认识结果组合文本提示。实验表明，Compared to previous works，PromptTTS 2可以更好地根据文本提示生成声音，并支持采样多样的声音选择。此外，提示生成管道可以生成高质量的提示，减少大量标注成本。PromptTTS 2的demo页面可以在线查看\footnotesize{\url{https://speechresearch.github.io/prompttts2}.

Dialog Action-Aware Transformer for Dialog Policy Learning

paper_url: http://arxiv.org/abs/2309.02240
repo_url: None
paper_authors: Huimin Wang, Wai-Chung Kwan, Kam-Fai Wong
for: 这个研究旨在提高对话策略学习（Dialog Policy Learning，DPL）的效率，使用对话数据来增强RL代理人的学习速度。
methods: 本研究提出了一个叫做“对话动作意识”的对话动作批评（DaTrans），该批评通过一个新的调整程序“对话最后一个动作任务”来增强DaTrans的对话意识和动作特征。
results: 研究结果显示，这个方法可以快速地将RL代理人带到最佳的对话策略，并且在人类评价中得到了良好的评价。

Abstract
Recent works usually address Dialog policy learning DPL by training a reinforcement learning (RL) agent to determine the best dialog action. However, existing works on deep RL require a large volume of agent-user interactions to achieve acceptable performance. In this paper, we propose to make full use of the plain text knowledge from the pre-trained language model to accelerate the RL agent's learning speed. Specifically, we design a dialog action-aware transformer encoder (DaTrans), which integrates a new fine-tuning procedure named masked last action task to encourage DaTrans to be dialog-aware and distils action-specific features. Then, DaTrans is further optimized in an RL setting with ongoing interactions and evolves through exploration in the dialog action space toward maximizing long-term accumulated rewards. The effectiveness and efficiency of the proposed model are demonstrated with both simulator evaluation and human evaluation.

摘要
现代工作通常采用对话策略学习（Dialog Policy Learning，DPL），通过训练一个强化学习（Reinforcement Learning，RL）代理人来确定最佳对话动作。然而，现有的深度RL需要大量的代理人-用户互动来 достичьacceptable的性能。在这篇论文中，我们提议利用预先训练的自然语言模型的普通文本知识，以加速RL代理人的学习速度。特别是，我们设计了对话动作意识的 transformer 编码器（DaTrans），通过一种新的精细调整过程名为遮盖最后一个动作任务来鼓励 DaTrans 成为对话意识的。然后，DaTrans 在RL Setting中进行了进一步优化，通过在对话动作空间中的探索来最大化长期积累的奖励。我们通过 simulate 评估和人类评估来证明提案的效果和效率。

paper_url: http://arxiv.org/abs/2309.02188
repo_url: None
paper_authors: Abul Hasan, Mark Levene, David Weston
for: 这研究探讨了将字典信息 integrate into neural network architecture for natural language processing的可能性。
methods: 这研究使用了一种基于字典的深度学习模型，用于提取COVID-19相关的概念。
results: 研究结果显示，将小领域字典 integrate into深度学习模型可以提高概念提取任务的性能，并且这些模型可以在不同数据集上进行转移。

Abstract
We investigate the potential benefit of incorporating dictionary information into a neural network architecture for natural language processing. In particular, we make use of this architecture to extract several concepts related to COVID-19 from an on-line medical forum. We use a sample from the forum to manually curate one dictionary for each concept. In addition, we use MetaMap, which is a tool for extracting biomedical concepts, to identify a small number of semantic concepts. For a supervised concept extraction task on the forum data, our best model achieved a macro $F_1$ score of 90\%. A major difficulty in medical concept extraction is obtaining labelled data from which to build supervised models. We investigate the utility of our models to transfer to data derived from a different source in two ways. First for producing labels via weak learning and second to perform concept extraction. The dataset we use in this case comprises COVID-19 related tweets and we achieve an $F_1$ score 81\% for symptom concept extraction trained on weakly labelled data. The utility of our dictionaries is compared with a COVID-19 symptom dictionary that was constructed directly from Twitter. Further experiments that incorporate BERT and a COVID-19 version of BERTweet demonstrate that the dictionaries provide a commensurate result. Our results show that incorporating small domain dictionaries to deep learning models can improve concept extraction tasks. Moreover, models built using dictionaries generalize well and are transferable to different datasets on a similar task.

摘要
我们研究将词典信息 integrate into neural network architecture for natural language processing的潜在优点。特别是我们使用这种架构提取COVID-19相关概念从在线医学讨论区。我们使用样本从讨论区手动精心抽取一个词典 для每个概念。此外，我们使用MetaMap工具提取生物医学概念，以确定一些semantic概念。对于基于讨论区数据的抽象概念提取任务，我们的最佳模型达到了90%的macro $F_1$ 分数。医疗概念提取的主要挑战之一是获得可靠的标签数据，用于建立supervised模型。我们研究将我们的模型传输到不同来源数据上进行两种方式。第一种是通过弱学习生成标签，第二种是进行概念提取。我们使用COVID-19相关推特来构建数据集，并实现了基于弱标签的概念提取Task中的81%的$F_1$ 分数。我们的词典与直接从Twitter中构建的COVID-19症状词典进行比较。进一步的实验表明，我们的词典提供了相似的结果。我们的结果表明，将小域词典 integrate into深度学习模型可以提高概念提取任务的性能。此外，使用词典建立的模型具有良好的泛化能力和可传播性。

Advancing Text-to-GLOSS Neural Translation Using a Novel Hyper-parameter Optimization Technique

paper_url: http://arxiv.org/abs/2309.02162
repo_url: None
paper_authors: Younes Ouargani, Noussaima El Khattabi
for: 这paper是 investigate transformers for Neural Machine Translation of text-to-GLOSS, 用于提高Deaf和听力不良的通信中的GLOSS翻译的精度和流畅性。
methods: 这paper使用了一种新的超参数搜索技术，搜索了不同的架构参数，并构建了一个优化的 transformer-based 架构，特意适用于text-to-GLOSS翻译任务。
results: 实验结果表明，最佳的 transformer 架构在 PHOENIX14T 数据集上达到了 ROUGE 分数55.18% 和 BLEU-1 分数63.6%，超过了之前在同一数据集上的最佳结果，升级了 BLEU1 和 ROUGE 分数的状态之作。

Abstract
In this paper, we investigate the use of transformers for Neural Machine Translation of text-to-GLOSS for Deaf and Hard-of-Hearing communication. Due to the scarcity of available data and limited resources for text-to-GLOSS translation, we treat the problem as a low-resource language task. We use our novel hyper-parameter exploration technique to explore a variety of architectural parameters and build an optimal transformer-based architecture specifically tailored for text-to-GLOSS translation. The study aims to improve the accuracy and fluency of Neural Machine Translation generated GLOSS. This is achieved by examining various architectural parameters including layer count, attention heads, embedding dimension, dropout, and label smoothing to identify the optimal architecture for improving text-to-GLOSS translation performance. The experiments conducted on the PHOENIX14T dataset reveal that the optimal transformer architecture outperforms previous work on the same dataset. The best model reaches a ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score of 55.18% and a BLEU-1 (BiLingual Evaluation Understudy 1) score of 63.6%, outperforming state-of-the-art results on the BLEU1 and ROUGE score by 8.42 and 0.63 respectively.

摘要
在这篇论文中，我们研究使用变换器来进行神经机器翻译文本到GLOSS，以便为听力异常和耳聋人士进行交流。由于文本到GLOSS翻译数据的稀缺和限制了资源，我们将这个问题视为低资源语言任务。我们使用我们的新的 гипер参数探索技术来探索各种建筑 Parameters，并构建一个优化的变换器基础结构，专门适用于文本到GLOSS翻译。研究的目的是提高神经机器翻译生成的GLOSS的准确率和流畅度。我们通过检查层数、注意头数、嵌入维度、dropout和标签平滑来确定优化文本到GLOSS翻译性能的最佳建筑 Parameters。在PHOENIX14T数据集上进行的实验表明，优化的变换器结构可以超越之前在同一数据集上的成果。最佳模型在ROUGE（Recall-Oriented Understudy for Gisting Evaluation）分数上达到55.18%，并在BLEU-1（BiLingual Evaluation Understudy 1）分数上达到63.6%，超越了当前的BLEU1和ROUGE分数的状态态度。

Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition

paper_url: http://arxiv.org/abs/2309.02145
repo_url: None
paper_authors: Patrick Eickhoff, Matthias Möller, Theresa Pekarek Rosin, Johannes Twiefel, Stefan Wermter
for: 这研究旨在提高自动语音识别（ASR）系统的性能，特别是在听力条件不佳的情况下。
methods: 我们提出了一种新的方法，可以将大型端到端（E2E）模型中的干净能力提取出来，并将其应用于任何encoder-decoder架构。我们的方法基于Conformer ASR模型的隐藏活动，通过一个decoder来预测干净spectrogram。
results: 我们的模型可以成功地过滤听力条件下的噪音，并且可以提高下游模型在噪音条件下的总词错率（WER）。我们的模型可以作为前端应用于预训练的Conformer ASR模型，以及从头开始训练小型Conformer ASR模型。

Abstract
In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic Speech Recognition (ASR) have reported state-of-the-art performance on various benchmarks. These systems intrinsically learn how to handle and remove noise conditions from speech. Previous research has shown, that it is possible to extract the denoising capabilities of these models into a preprocessor network, which can be used as a frontend for downstream ASR models. However, the proposed methods were limited to specific fully convolutional architectures. In this work, we propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture. We propose the Cleancoder preprocessor architecture that extracts hidden activations from the Conformer ASR model and feeds them to a decoder to predict denoised spectrograms. We train our pre-processor on the Noisy Speech Database (NSD) to reconstruct denoised spectrograms from noisy inputs. Then, we evaluate our model as a frontend to a pretrained Conformer ASR model as well as a frontend to train smaller Conformer ASR models from scratch. We show that the Cleancoder is able to filter noise from speech and that it improves the total Word Error Rate (WER) of the downstream model in noisy conditions for both applications.

摘要
Recent research in speech processing has shown that large End-to-End (E2E) systems for Automatic Speech Recognition (ASR) have achieved state-of-the-art performance on various benchmarks. These systems have the ability to intrinsically handle and remove noise from speech. Previous studies have demonstrated that the denoising capabilities of these models can be extracted and used as a frontend for downstream ASR models. However, these methods were limited to specific fully convolutional architectures.In this study, we propose a novel method to extract the denoising capabilities that can be applied to any encoder-decoder architecture. We introduce the Cleancoder preprocessor architecture, which extracts hidden activations from the Conformer ASR model and feeds them to a decoder to predict denoised spectrograms. We train our pre-processor on the Noisy Speech Database (NSD) to reconstruct denoised spectrograms from noisy inputs.We evaluate our model as a frontend to a pretrained Conformer ASR model as well as a frontend to train smaller Conformer ASR models from scratch. Our results show that the Cleancoder is able to filter noise from speech and improve the total Word Error Rate (WER) of the downstream model in noisy conditions for both applications.

Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion

paper_url: http://arxiv.org/abs/2309.02133
repo_url: https://github.com/unilight/seq2seq-vc
paper_authors: Wen-Chin Huang, Tomoki Toda
For: 本研究旨在评估三种最近提出的无ground truth基础的外语变换方法（FAC），以实现将非本地语言speaker的语音转换为本地语言speaker的语音，同时保持speaker identity。* Methods: 本研究使用的方法包括seq2seq和非并行的VC模型，以实现控制speaker identity和降低外语变换的困难性。* Results: 我们的实验评估结果显示，无一个方法在所有评估轴上表现出色，与之前的研究结论不同。我们还分析了seq2seq模型的训练输入和输出，以及非并行VC模型的设计选择，并发现Intelligibility指标与主观外语程度之间没有直接关系。

Abstract
Foreign accent conversion (FAC) is a special application of voice conversion (VC) which aims to convert the accented speech of a non-native speaker to a native-sounding speech with the same speaker identity. FAC is difficult since the native speech from the desired non-native speaker to be used as the training target is impossible to collect. In this work, we evaluate three recently proposed methods for ground-truth-free FAC, where all of them aim to harness the power of sequence-to-sequence (seq2seq) and non-parallel VC models to properly convert the accent and control the speaker identity. Our experimental evaluation results show that no single method was significantly better than the others in all evaluation axes, which is in contrast to conclusions drawn in previous studies. We also explain the effectiveness of these methods with the training input and output of the seq2seq model and examine the design choice of the non-parallel VC model, and show that intelligibility measures such as word error rates do not correlate well with subjective accentedness. Finally, our implementation is open-sourced to promote reproducible research and help future researchers improve upon the compared systems.

摘要
外国腔转换（FAC）是voice转换（VC）的特殊应用，旨在将非本地语言 speaker的折衣语音转换为本地语言 speaker的Native-sounding speech，同时保持 speaker identity。FAC具有困难，因为不可收集欲使用的Native speech from the desired non-native speaker作为训练目标。在这项工作中，我们评估了三种最近提出的ground-truth-free FAC方法，其中所有方法均企图利用 seq2seq和non-parallel VC模型来正确地转换腔和控制 speaker identity。我们的实验评估结果表明，没有任何方法在所有评估轴上表现出显著优势，这与之前的研究结论不符。我们还解释了这些方法的效iveness，并检查了seq2seq模型的训练输入和输出，以及非平行VC模型的设计选择。最后，我们发现Intelligibility measure如word error rates与主观腔度之间没有正确的相关性。 finally,我们开源了我们的实现，以便促进可重复性的研究和未来的研究人员可以在此基础上改进相关的系统。

Wordle: A Microcosm of Life. Luck, Skill, Cheating, Loyalty, and Influence!

paper_url: http://arxiv.org/abs/2309.02110
repo_url: None
paper_authors: James P. Dilger
for: 这个研究是为了研究Wordle游戏中玩家的做法和习惯。
methods: 这个研究使用了信息理论来评估玩家的幸运和技巧，并将数据显示在Wordle游戏中的第一、第二、…、第六个猜测中。
results: 研究发现每天约有0.2-0.5%的玩家在第一次猜测中解题成功，这意味着4,000-10,000名玩家可能通过外部获取目标词语来夺冠。此外，研究还发现至少1/3的玩家有一个喜爱的开头词，而且大多数玩家会保持loyal于他们的开头词，即使该词语已经出现过。8月15日，约有30,000名玩家突然改变了他们的开头词，这可能是基于十字WORD的游戏提示。

Abstract
Wordle is a popular, online word game offered by the New York Times (nytimes.com). Currently there are some 2 million players of the English version worldwide. Players have 6 attempts to guess the daily word (target word) and after each attempt, the player receives color-coded information about the correctness and position of each letter in the guess. After either a successful completion of the puzzle or the final unsuccessful attempt, software can assess the player's luck and skill using Information Theory and can display data for the first, second, ..., sixth guesses of a random sample of all players. Recently, I discovered that the latter data is presented in a format that can easily be copied and pasted into a spreadsheet. I compiled data on Wordle players' first guesses from May 2023 - August 2023 and inferred some interesting information about Wordle players. A) Every day, about 0.2-0.5% of players solve the puzzle in one attempt. Because the odds of guessing the one of 2,315 possible target words at random is 0.043%, this implies that 4,000 - 10,000 players cheat by obtaining the target word outside of playing the game! B) At least 1/3 of the players have a favorite starting word, or cycle through several. And even though players should be aware that target words are never repeated, most players appear to remain loyal to their starting word even after its appearance as a target word. C) On August 15, 2023, about 30,000 players abruptly changed their starting word, presumably based on a crossword puzzle clue! Wordle players can be influenced! This study goes beyond social media postings, surveys, and Google Trends to provide solid, quantitative evidence about cheating in Wordle.

摘要
wordle是一款受欢迎的在线单词游戏，提供于纽约时报（nytimes.com）上。目前全球玩家约200万人。玩家有6次尝试猜测每天的目标词（target word），每次猜测后，玩家会收到颜色标注的正确性和位置信息。完成游戏或最后一次无法猜测后，软件可以根据信息理论评估玩家的运气和技巧，并显示数据 для第一、第二、...、第六次猜测的随机样本玩家。我最近发现这些数据可以轻松地复制并粘贴到表格中。我 compile了5月2023年-8月2023年的Wordle玩家首次猜测数据，并从中推导出了一些有趣的信息。A) 每天大约0.2%-0.5%的玩家在第一次猜测中解题成功。由于随机猜测target word的概率为0.043%，这 imply That 4,000-10,000名玩家通过外部方式获得target word！B) 至少1/3的玩家有一个喜爱的开始词，或者循环使用多个。尽管玩家应该知道target words never repeated，但大多数玩家仍然偏向自己的开始词，即使该词已经出现在目标词中。C) 2023年8月15日，约30,000名玩家 suddenly changed their starting word， apparently based on a crossword puzzle clue! Wordle players can be influenced！这项研究超过社交媒体帖子、调查和Google Trends提供的轻量级证据，以准确的数据证明Wordle玩家的作弊行为。

Bridging Emotion Role Labeling and Appraisal-based Emotion Analysis

paper_url: http://arxiv.org/abs/2309.02092
repo_url: None
paper_authors: Roman Klinger
for: 本研究旨在探讨情感分析在文本中的应用，具体来说是情感分类和情感角色标注两个方面。
methods: 本研究使用了多种自然语言处理技术，包括情感分类和情感角色标注等。
results: 研究发现了情感分类和情感角色标注两个方面的问题，并提出了一些未解决的研究问题。

Abstract
The term emotion analysis in text subsumes various natural language processing tasks which have in common the goal to enable computers to understand emotions. Most popular is emotion classification in which one or multiple emotions are assigned to a predefined textual unit. While such setting is appropriate to identify the reader's or author's emotion, emotion role labeling adds the perspective of mentioned entities and extracts text spans that correspond to the emotion cause. The underlying emotion theories agree on one important point; that an emotion is caused by some internal or external event and comprises several subcomponents, including the subjective feeling and a cognitive evaluation. We therefore argue that emotions and events are related in two ways. (1) Emotions are events; and this perspective is the fundament in NLP for emotion role labeling. (2) Emotions are caused by events; a perspective that is made explicit with research how to incorporate psychological appraisal theories in NLP models to interpret events. These two research directions, role labeling and (event-focused) emotion classification, have by and large been tackled separately. We contributed to both directions with the projects SEAT (Structured Multi-Domain Emotion Analysis from Text) and CEAT (Computational Event Evaluation based on Appraisal Theories for Emotion Analysis), both funded by the German Research Foundation. In this paper, we consolidate the findings and point out open research questions.

摘要
“情感分析”是一种自然语言处理任务的总称，它的目的是让计算机理解人类的情感。最受欢迎的是情感分类，在这种设定下，一个或多个情感被分配给已知文本单位。而情感角色标注则添加了提及对象的视角，并提取与情感相关的文本块。在情感理论中，所有情感都是由内部或外部事件引起的，并包括一些主观感受和认知评价。因此，我们认为情感和事件之间存在两种关系。第一种是情感是事件的角度，这是NP的基础。第二种是情感是由事件引起的，这种角度通过涉及心理评价理论来在NP模型中表示。这两个研究方向一直处理了分开，我们通过项目《SEAT》（结构多元领域情感分析从文本）和《CEAT》（基于评价理论的计算事件评价为情感分析），均得到了德国研究基金的资金支持。在这篇论文中，我们汇总了发现和提出了未来研究的问题。

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

paper_url: http://arxiv.org/abs/2309.02077
repo_url: None
paper_authors: Yusheng Liao, Yutong Meng, Hongcheng Liu, Yanfeng Wang, Yu Wang
for: 这篇论文旨在评估大语言模型（LLMs）在虚拟医生环境中的实际能力。
methods: 该论文提出了一种自动评估框架，用于评估 LLMs 在多turn 询问中的实际能力。该框架包括设计了供询问任务，要求 LLMs 了解自己所不知道的信息，并从患者那里收集缺失的医疗信息。
results: 实验结果显示，通过 fine-tuning 训练集可以减轻 LLMs 的假设现象，提高其在提posed的benchmark上的表现。这些结果得到了广泛的实验和剥夺学调查的验证。

Abstract
Large language models (LLMs) have achieved significant success in interacting with human. However, recent studies have revealed that these models often suffer from hallucinations, leading to overly confident but incorrect judgments. This limits their application in the medical domain, where tasks require the utmost accuracy. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations. Consultation tasks are designed to require LLMs to be aware of what they do not know, to inquire about missing medical information from patients, and to ultimately make diagnoses. To evaluate the performance of LLMs for these tasks, a benchmark is proposed by reformulating medical multiple-choice questions from the United States Medical Licensing Examinations (USMLE), and comprehensive evaluation metrics are developed and evaluated on three constructed test sets. A medical consultation training set is further constructed to improve the consultation ability of LLMs. The results of the experiments show that fine-tuning with the training set can alleviate hallucinations and improve LLMs' performance on the proposed benchmark. Extensive experiments and ablation studies are conducted to validate the effectiveness and robustness of the proposed framework.

摘要
Consultation tasks are designed to require LLMs to be aware of what they do not know, to inquire about missing medical information from patients, and to ultimately make diagnoses. To evaluate the performance of LLMs for these tasks, a benchmark is proposed by reformulating medical multiple-choice questions from the United States Medical Licensing Examinations (USMLE), and comprehensive evaluation metrics are developed and evaluated on three constructed test sets. A medical consultation training set is further constructed to improve the consultation ability of LLMs. The results of the experiments show that fine-tuning with the training set can alleviate hallucinations and improve LLMs' performance on the proposed benchmark.Extensive experiments and ablation studies are conducted to validate the effectiveness and robustness of the proposed framework.

Bilevel Scheduled Sampling for Dialogue Generation

paper_url: http://arxiv.org/abs/2309.01953
repo_url: None
paper_authors: Jiawen Liu, Kan Li
for: mitigating exposure bias in natural language processing tasks, particularly in dialog generation.
methods: proposed a bilevel scheduled sampling model that takes sentence-level information into account and incorporates it with word-level quality, and a smooth function that maps the combined result to an appropriate range for probabilistic sampling.
results: significantly alleviated the exposure bias problem and outperformed state-of-the-art scheduled sampling methods in experiments conducted on the DailyDialog and PersonaChat datasets.

Abstract
Exposure bias poses a common challenge in numerous natural language processing tasks, particularly in the dialog generation. In response to this issue, researchers have devised various techniques, among which scheduled sampling has proven to be an effective method for mitigating exposure bias. However, the existing state-of-the-art scheduled sampling methods solely consider the current sampling words' quality for threshold truncation sampling, which overlooks the importance of sentence-level information and the method of threshold truncation warrants further discussion. In this paper, we propose a bilevel scheduled sampling model that takes the sentence-level information into account and incorporates it with word-level quality. To enhance sampling diversity and improve the model's adaptability, we propose a smooth function that maps the combined result of sentence-level and word-level information to an appropriate range, and employ probabilistic sampling based on the mapped values instead of threshold truncation. Experiments conducted on the DailyDialog and PersonaChat datasets demonstrate the effectiveness of our proposed methods, which significantly alleviate the exposure bias problem and outperform state-of-the-art scheduled sampling methods.

摘要
<>translate("Exposure bias poses a common challenge in numerous natural language processing tasks, particularly in dialog generation. In response to this issue, researchers have devised various techniques, among which scheduled sampling has proven to be an effective method for mitigating exposure bias. However, the existing state-of-the-art scheduled sampling methods solely consider the current sampling words' quality for threshold truncation sampling, which overlooks the importance of sentence-level information and the method of threshold truncation warrants further discussion. In this paper, we propose a bilevel scheduled sampling model that takes the sentence-level information into account and incorporates it with word-level quality. To enhance sampling diversity and improve the model's adaptability, we propose a smooth function that maps the combined result of sentence-level and word-level information to an appropriate range, and employ probabilistic sampling based on the mapped values instead of threshold truncation. Experiments conducted on the DailyDialog and PersonaChat datasets demonstrate the effectiveness of our proposed methods, which significantly alleviate the exposure bias problem and outperform state-of-the-art scheduled sampling methods.")]Here's the translation:<>交叉偏见是许多自然语言处理任务中的常见挑战，尤其是对话生成。为了解决这个问题，研究人员已经提出了多种技术，其中规则采样已经被证明是有效的方法来减少交叉偏见。然而，现有的状态艺术规则采样方法只考虑当前采样词语的质量，忽略了句子水平信息，这种方法不充分考虑句子级别的信息和规则采样的问题。在这篇论文中，我们提出了一种两级规则采样模型，该模型考虑了句子水平信息，并将其与单词水平信息结合。为了增强采样多样性和模型适应性，我们提出了一种缓动函数，将合并的句子水平和单词水平信息映射到适当的范围内，然后使用概率采样基于映射值而不是阈值 truncation。经过 DailyDialog 和 PersonaChat 数据集的实验，我们的提议方法显示效果，可以减少交叉偏见问题，并在现有的规则采样方法中具有优势。

TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

paper_url: http://arxiv.org/abs/2309.01947
repo_url: None
paper_authors: Yuan Shangguan, Haichuan Yang, Danni Li, Chunyang Wu, Yassir Fathullah, Dilin Wang, Ayushi Dalmia, Raghuraman Krishnamoorthi, Ozlem Kalinli, Junteng Jia, Jay Mahadeokar, Xin Lei, Mike Seltzer, Vikas Chandra
for: 这篇论文的目的是提出一个名为TODM（Train Once Deploy Many）的新方法，用于快速训练适合不同硬件的实时语音识别（ASR）模型，并且可以与单一训练作业相比减少训练时间和资源。
methods: 这篇论文使用了以往的Supernet研究，将RNN-T模型的层级和宽度缩减为更小的subnetworks，以适应不同的硬件类型。此外，论文还提出了三种技术来提高TODM Supernet的效果：适应性Dropout、对Alpha-divergence知识传递和Scale Adam优化器。
results: 论文通过比较Supernet训练 versus个别调整Multi-Head State Space Model (MH-SSM) RNN-T使用LibriSpeech数据库，发现TODM Supernet可以与手动调整模型相比，在字元错误率（WER）上提高至3%以上的表现，而且可以快速地训练多个模型，并且仅需小于单一训练作业的训练时间和资源。

Abstract
Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropouts, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant.

摘要
自动话语识别（ASR）模型需要根据特定硬件进行优化，以便在设备上部署。这可以通过调整模型的超参数或探索其结构的变化来实现。然而，在进行这些变化后，需要重新训练和验证模型，这可能会占用资源。本文介绍了一种新的方法—— Train Once Deploy Many（TODM），可以高效地在不同硬件类型上训练多个适合硬件的语音识别模型，并且与单个训练任务相比，它的GPU时间相同。TODM利用了先前的Supernet研究，在Supernet中，Recurrent Neural Network Transducer（RNN-T）模型共享权重。它采用了减小Supernet层数和宽度，从而得到了适合所有硬件类型的子网络，这些子网络是小型模型。我们介绍了一种新的组合技术，包括适应性Dropout、在位Alpha-分布知识继承和Scale Adam优化器，以提高TODM Supernet的结果。我们通过对Supernet训练 versus 手动调整Multi-Head State Space Model（MH-SSM）RNN-T使用LibriSpeech进行比较，结果表明，我们的TODM Supernet可以与手动调整模型相比，在字节错误率（WER）方面提高到3%之间的Relative。同时，我们efficient地保持了训练多个模型的成本，占用小的常量。

QuantEase: Optimization-based Quantization for Language Models – An Efficient and Intuitive Algorithm

paper_url: http://arxiv.org/abs/2309.01885
repo_url: None
paper_authors: Kayhan Behdin, Ayan Acharya, Aman Gupta, Sathiya Keerthi, Rahul Mazumder
for: 本研究针对大型自然语言模型（LLMs）的快速部署实现了压缩技术，尤其是Post-Training Quantization（PTQ）。
methods: 本研究提出了一个层别压缩框架QuantEase，各层独立进行压缩，并使用了coordinate descent（CD）技术来解决非凸网络问题。
results: 实验结果显示，QuantEase在不同的LLMs和数据集上的误差率和零shot准确率方面具有国际级的表现，与比较方法GPTQ之间的改善为15%之间。尤其是对于具有重要权重（outliers）的情况下，我们的方法可以实现近乎3位数字的压缩，不需要非凸压缩或分组技术，与比较方法SpQR的改善为2倍以上。

Abstract
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in compression techniques that enable their efficient deployment. This study focuses on the Post-Training Quantization (PTQ) of LLMs. Drawing from recent advances, our work introduces QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. The problem is framed as a discrete-structured non-convex optimization, prompting the development of algorithms rooted in Coordinate Descent (CD) techniques. These CD-based methods provide high-quality solutions to the complex non-convex layer-wise quantization problems. Notably, our CD-based approach features straightforward updates, relying solely on matrix and vector operations, circumventing the need for matrix inversion or decomposition. We also explore an outlier-aware variant of our approach, allowing for retaining significant weights (outliers) with complete precision. Our proposal attains state-of-the-art performance in terms of perplexity and zero-shot accuracy in empirical evaluations across various LLMs and datasets, with relative improvements up to 15% over methods such as GPTQ. Particularly noteworthy is our outlier-aware algorithm's capability to achieve near or sub-3-bit quantization of LLMs with an acceptable drop in accuracy, obviating the need for non-uniform quantization or grouping techniques, improving upon methods such as SpQR by up to two times in terms of perplexity.

摘要
随着大型语言模型（LLM）的 популяр化，压缩技术的研究吸引了越来越多的关注。这项研究关注于LLM的Post-Training Quantization（PTQ）。基于最新的进展，我们提出了QuantEase，一个层 wise量化框架，其中每层都进行独立的量化。问题被定义为一个逻辑结构化非核心的优化问题，这使得我们可以基于坐标降低（CD）技术开发高质量的解决方案。这些CD基本的方法可以提供高质量的解决方案，并且具有简单的更新，只需要基于矩阵和向量的操作，不需要矩阵反射或分解。我们还探索了一种具有异常检测的变体，可以保留重要的权重（异常），并且完全保留精度。我们的提议在实验中达到了 LLM 的状态zegart 性能，包括词 Error 和零培训精度，与比如 GPTQ 的方法相比，提高了15%。特别是我们的异常检测变体可以在不同批量化或分组技术的情况下，实现 LLM 的近或下三位量化，超过 SpQR 的性能，提高了至多两倍。

2023-09-05

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

Substitution-based Semantic Change Detection using Contextual Embeddings

nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style Models with Limited Resources

Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation via Attention Regularization

PromptTTS 2: Describing and Generating Voices with Text Prompt

Dialog Action-Aware Transformer for Dialog Policy Learning

Incorporating Dictionaries into a Neural Network Architecture to Extract COVID-19 Medical Concepts From Social Media

Advancing Text-to-GLOSS Neural Translation Using a Novel Hyper-parameter Optimization Technique

Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition

Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion

Wordle: A Microcosm of Life. Luck, Skill, Cheating, Loyalty, and Influence!

Bridging Emotion Role Labeling and Appraisal-based Emotion Analysis

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

Bilevel Scheduled Sampling for Dialogue Generation

TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

QuantEase: Optimization-based Quantization for Language Models – An Efficient and Intuitive Algorithm