cs.CL - 2023-09-27

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

paper_url: http://arxiv.org/abs/2309.16058
repo_url: https://github.com/kyegomez/AnyMAL
paper_authors: Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya Srinet, Babak Damavandi, Anuj Kumar
for: 本研究旨在开发一种能够理解多种输入模式信号（文本、图像、视频、声音、IMU运动传感器）的语言模型，并生成文本响应。
methods: 本研究使用了现有状态之最高水平LLaMA-2（70B）的文本基于理解能力，并通过预训练对照器模块将多种模式特定信号转换到共同文本空间。
results: 我们通过了人工和自动评估，并在多种多Modal任务上达到了状态之最高水平表现。

Abstract
We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

摘要
我们介绍Any-Modality Augmented Language Model（AnyMAL），这是一个综合模型，可以处理多种输入模式信号（即文本、图像、视频、音频、IMU运动传感器），并生成文本响应。AnyMAL继承了现状最佳的文本基于推理能力，包括LLaMA-2（70B），并使用预训练的对齐模块将多modal特征信号转化到共同文本空间。为了进一步增强多modal LLM的能力，我们人工收集了多 modal 指令集，以覆盖多种话题和任务，超出简单的QA。我们进行了全面的实验分析，包括人类和自动评估，并在多种多modal任务中达到了状态之册表现。

Effective Long-Context Scaling of Foundation Models

paper_url: http://arxiv.org/abs/2309.16039
repo_url: None
paper_authors: Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, Hao Ma
for: 这个论文的目的是提出一系列的长文本LLMs，以支持有效的上下文窗口至多32,768个符号。
methods: 这些模型使用了 continual pretraining 方法，从 Llama 2 开始，使用更长的训练序列和更大的数据集来培养模型。
results: 在语言模型评估、 sintetic context probing 任务以及一系列研究 benchmark 上，这些模型 achieves consistent improvement 和 significant improvement 在 long-context 任务上，并且可以使用 cost-effective 的 instruction tuning 程序来超越 gpt-3.5-turbo-16k 的总性性能。

Abstract
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

摘要
我们提出了一系列长上下文 LLMS，支持最长32,768个字的有效上下文窗口。我们的模型系列通过长期预训练从Llama 2开始，使用更长的训练序列和增amplesize的数据集来构建。我们进行了广泛的语言模型评估、 sintetic上下文探测任务和多种研究 benchmark 评估。在研究 benchmark 上，我们的模型在大多数常见任务上实现了一致性提高，而在长上下文任务上具有显著提高。特别是，通过不需要人工标注长 instrucion 数据的Cost-effective instrucion tuning 过程，70B 变体可以在一组长上下文任务上超越 gpt-3.5-turbo-16k 的总性表现。此外，我们还提供了深入分析方法的各个组件。我们分析了 llama 的位置编码和其在模型长依赖关系的限制。我们还检查了预训练过程中的不同设计选择，包括数据混合和序列长度的训练课程，我们的拓展实验表明，在预训练 dataset 中充足的长文本不是逻辑上的关键，我们也经验 verify 了在预训练 dataset 中预训练是更加高效和相同有效的。

Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness

paper_url: http://arxiv.org/abs/2309.15991
repo_url: None
paper_authors: Valentin Barriere, Felipe del Rio, Andres Carvallo De Ferari, Carlos Aspillaga, Eugenio Herrera-Berg, Cristian Buc Calderon
for: 提高模型对图像描述能力的人类化能力（如性别识别）
methods: 使用Targeted Image-editing Data Augmentation（TIDA）方法，通过对图像caption进行修改，使模型更好地理解图像的相关结构
results: 在Flickr30K benchmark上，TIDA对性别、颜色和数量能力进行了改进，并且在不同的图像描述指标中显示了更好的性能，并进行了细化分析以及对不同的文本生成模型的比较

Abstract
Artificial neural networks typically struggle in generalizing to out-of-context examples. One reason for this limitation is caused by having datasets that incorporate only partial information regarding the potential correlational structure of the world. In this work, we propose TIDA (Targeted Image-editing Data Augmentation), a targeted data augmentation method focused on improving models' human-like abilities (e.g., gender recognition) by filling the correlational structure gap using a text-to-image generative model. More specifically, TIDA identifies specific skills in captions describing images (e.g., the presence of a specific gender in the image), changes the caption (e.g., "woman" to "man"), and then uses a text-to-image model to edit the image in order to match the novel caption (e.g., uniquely changing a woman to a man while maintaining the context identical). Based on the Flickr30K benchmark, we show that, compared with the original data set, a TIDA-enhanced dataset related to gender, color, and counting abilities induces better performance in several image captioning metrics. Furthermore, on top of relying on the classical BLEU metric, we conduct a fine-grained analysis of the improvements of our models against the baseline in different ways. We compared text-to-image generative models and found different behaviors of the image captioning models in terms of encoding visual encoding and textual decoding.

摘要
人工神经网络通常在对不同的示例进行泛化时遇到困难。其中一个原因是因为 dataset 中只包含部分世界相关的词语关系结构。在这项工作中，我们提出了 TIDA（targeted image-editing data augmentation），一种专门针对提高模型的人类能力（例如性别识别）的数据增强方法。更 Specifically，TIDA 会在描述图像的caption中特定的技能（例如图像中的性别存在），将caption修改（例如“女性”改为“男性”），然后使用文本到图像生成模型来编辑图像，以使图像与修改后的caption保持相同的上下文。基于 Flickr30K benchmark，我们表明，相比原始数据集，TIDA 增强的 gender、color 和 counting 能力相关的数据集可以induce better performance 在多个图像描述指标上。此外，我们不仅仅使用 classical BLEU metric，还进行了细化的分析，对比不同的文本到图像生成模型和图像描述模型的不同行为。

paper_url: http://arxiv.org/abs/2309.15826
repo_url: None
paper_authors: Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe
for: 这项研究的目的是提出一种具有硬件共享参数的ST/MT多任务框架，以提高speech-to-text翻译的效果。
methods: 该方法使用一个预处理阶段，将speech和text输入转换为两个不同的字符序列，以便模型可以使用共同词库处理两种模式。
results: 实验结果表明，该方法可以提高attentional encoder-decoder、CTC、批处理器和联合CTC/注意力模型的性能，无需外部MT数据。此外，该方法还可以在外部MT数据支持下提高性能，并且可以在基于文本模型的预训练模型上进行转移学习，以提高性能。

Abstract
Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modally. Our method reduces the speech-text modality gap via a pre-processing stage which converts speech and text inputs into two discrete token sequences of similar length -- this allows models to indiscriminately process both modalities simply using a joint vocabulary. With experiments on MuST-C, we demonstrate that our multi-tasking framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU without any external MT data. Further, we show that this framework incorporates external MT data, yielding +0.8 BLEU, and also improves transfer learning from pre-trained textual models, yielding +1.8 BLEU.

摘要
最近的END-to-END语音至文本翻译（ST）研究提出了多任务方法with soft parameter sharing，利用机器翻译（MT）数据via secondary encoders，将文本输入映射到最终的跨Modal Representation。在这个工作中，我们提议一种ST/MT多任务框架with hard parameter sharing，所有模型参数共享跨Modal。我们通过预处理阶段将speech和text输入转换成两个精确的token序列，使模型可以不区分modalities，使用共同词库进行处理。我们通过在MuST-C上进行实验，表明我们的多任务框架可以提高attentional encoder-decoder、Connectionist Temporal Classification（CTC）、推理器和 joint CTC/attention模型的性能，无需外部机器翻译数据，均为+0.5 BLEU。此外，我们还表明该框架可以 incorporate external MT数据，提高性能+0.8 BLEU，并且可以从预训练的文本模型进行传输学习，提高性能+1.8 BLEU。

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

paper_url: http://arxiv.org/abs/2309.15800
repo_url: None
paper_authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang
for: 这篇论文旨在探讨在终端至终的语音处理模型中使用独立的语音单位。
methods: 这篇论文使用了各种方法，例如去重和子字模型，以将语音序列长度进行压缩。
results: 实验结果显示，使用独立的语音单位在大多数情况下可以取得良好的结果，并且可以大幅缩短训练时间。

Abstract
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations, which significantly compresses the size of speech data. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length. Hence, training time is significantly reduced while retaining notable performance. In this study, we undertake a comprehensive and systematic exploration into the application of discrete units within end-to-end speech processing models. Experiments on 12 automatic speech recognition, 3 speech translation, and 1 spoken language understanding corpora demonstrate that discrete units achieve reasonably good results in almost all the settings. We intend to release our configurations and trained models to foster future research efforts.

摘要
语音信号通常采样于每秒万分之几个频率，具有重复性和不必要的繁殖性，使sequencing模型难以有效地处理。高维度语音特征如spectrogram可以作为后续模型的输入，但它们可能仍然具有重复性。最近的研究提议使用基于自动学习的独立speech单元，可以压缩语音数据的大小。通过方法如减少和子字模型，可以进一步压缩语音序列的长度，从而减少训练时间，保持了可观的性能。在这种研究中，我们进行了总体和系统的探索，探讨了在结束到端speech处理模型中应用独立单元的可能性。实验在12个自动语音识别、3个语音翻译和1个口语理解 corpora中，表明独立单元在大多数设置中都可以达到理想的结果。我们计划将我们的配置和训练模型发布，以促进未来的研究努力。

Large Language Model Routing with Benchmark Datasets

paper_url: http://arxiv.org/abs/2309.15789
repo_url: None
paper_authors: Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, Mikhail Yurochkin
for: 选择最佳大语言模型（LLM） для新任务
methods: 利用数据集改进 router 模型选择
results: 在多个任务和场景中提高选择 LLM 的性能

Abstract
There is a rapidly growing number of open-source Large Language Models (LLMs) and benchmark datasets to compare them. While some models dominate these benchmarks, no single model typically achieves the best accuracy in all tasks and use cases. In this work, we address the challenge of selecting the best LLM out of a collection of models for new tasks. We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a "router" model for this LLM selection, and we show that this problem can be reduced to a collection of binary classification tasks. We demonstrate the utility and limitations of learning model routers from various benchmark datasets, where we consistently improve performance upon using any single model for all tasks.

摘要
有一个快速增长的开源大语言模型（LLM）和测试数据集的比较。虽然一些模型在这些测试中占据主导地位，但通常没有一个模型能够在所有任务和场景中达到最佳性能。在这项工作中，我们解决了选择集合中的最佳LLM的挑战。我们提出了一种新的问题形ulation，在其中测试数据集被重用来学习一个"路由"模型，并证明这个问题可以转化为一系列二分类任务。我们示出了学习模型路由的实用性和局限性，并在不同的测试数据集上 consistently 提高了使用任何单个模型来完成所有任务的性能。

Question answering using deep learning in low resource Indian language Marathi

paper_url: http://arxiv.org/abs/2309.15779
repo_url: None
paper_authors: Dhiraj Amin, Sharvari Govilkar, Sagar Kulkarni
for: 这个论文是为了研究如何使用Transformer模型创建基于阅读理解的马拉地语问答系统。
methods: 本论文使用了多种Transformer模型，包括Multilingual Representations for Indian Languages (MuRIL)、MahaBERT和Indic Bidirectional Encoder Representations from Transformers (IndicBERT)，并对这些模型进行了精度调整和微调整，以便在马拉地语阅读理解基数据集上进行测试。
results: 研究发现，在多种Transformer模型中，Multilingual Representations for Indian Languages (MuRIL)多语言模型在马拉地语 dataset 上得到了最高的准确率，具体来说是EM分数为0.64和F1分数为0.74。

Abstract
Precise answers are extracted from a text for a given input question in a question answering system. Marathi question answering system is created in recent studies by using ontology, rule base and machine learning based approaches. Recently transformer models and transfer learning approaches are used to solve question answering challenges. In this paper we investigate different transformer models for creating a reading comprehension-based Marathi question answering system. We have experimented on different pretrained Marathi language multilingual and monolingual models like Multilingual Representations for Indian Languages (MuRIL), MahaBERT, Indic Bidirectional Encoder Representations from Transformers (IndicBERT) and fine-tuned it on a Marathi reading comprehension-based data set. We got the best accuracy in a MuRIL multilingual model with an EM score of 0.64 and F1 score of 0.74 by fine tuning the model on the Marathi dataset.

摘要
<> translate "Precise answers are extracted from a text for a given input question in a question answering system. Marathi question answering system is created in recent studies by using ontology, rule base and machine learning based approaches. Recently transformer models and transfer learning approaches are used to solve question answering challenges. In this paper we investigate different transformer models for creating a reading comprehension-based Marathi question answering system. We have experimented on different pretrained Marathi language multilingual and monolingual models like Multilingual Representations for Indian Languages (MuRIL), MahaBERT, Indic Bidirectional Encoder Representations from Transformers (IndicBERT) and fine-tuned it on a Marathi reading comprehension-based data set. We got the best accuracy in a MuRIL multilingual model with an EM score of 0.64 and F1 score of 0.74 by fine tuning the model on the Marathi dataset." into Simplified Chinese.答案是从文本中提取的，用于给定输入问题的问答系统。印地语问答系统在最近的研究中使用ontology、规则集和机器学习方法创建。现在 transformer 模型和传输学习方法在解决问答挑战中得到广泛应用。在这篇论文中，我们 investigate 不同的 transformer 模型，用于创建基于阅读理解的印地语问答系统。我们在不同的预训练的印地语语言多语言和单语言模型（如 Multilingual Representations for Indian Languages （MuRIL）、 MahaBERT 和 Indic Bidirectional Encoder Representations from Transformers （IndicBERT））中进行了微调，并在印地语阅读理解基数据集上进行了测试。我们在 MuRIL 多语言模型中得到了最好的准确率，EM 分数为 0.64 和 F1 分数为 0.74 。

Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

paper_url: http://arxiv.org/abs/2309.15686
repo_url: None
paper_authors: Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur
for: 这篇论文是为了提高端到端语音翻译（E2E-ST）中的准确性和稳定性而写的。
methods: 这篇论文使用了目标语言上下文来增强E2E-ST的准确性和稳定性，并使用了语音段长信息来扩大上下文的覆盖范围。此外，它还提出了上下文排除法以确保模型的可靠性。
results: 作者的提议的上下文E2E-ST方法比隔离单个句子的E2E-ST方法表现更好，并且在对话语音中，上下文信息主要帮助捕捉上下文风格以及解决 named entities 和 anaphora 等问题。

Abstract
Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied. To bridge this gap, we introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments. Additionally, we propose context dropout to ensure robustness to the absence of context, and further improve performance by adding speaker information. Our proposed contextual E2E-ST outperforms the isolated utterance-based E2E-ST approach. Lastly, we demonstrate that in conversational speech, contextual information primarily contributes to capturing context style, as well as resolving anaphora and named entities.

摘要
Contextual 翻译增强 machine translation 的效果已经被证明，但是在结构 speech translation （E2E-ST）中包含 context 的使用还尚未得到充分研究。为了填补这个空白，我们在 E2E-ST 中引入目标语言 context，从而提高 coherence 和抗耗尽性。此外，我们还提出了 context dropout，以确保模型在 context 缺失时的稳定性，并进一步提高性能。我们的Contextual E2E-ST 方法在 isolation 的 utterance-based E2E-ST 方法上表现出色。最后，我们证明了在对话speech中，contextual information 主要帮助捕捉 context style，以及解决 anaphora 和 named entities。Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Speech collage: code-switched audio generation by collaging monolingual corpora

paper_url: http://arxiv.org/abs/2309.15674
repo_url: https://github.com/jsalt2022codeswitchingasr/generating-code-switched-audio
paper_authors: Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur
for: 本研究旨在提高自动语音识别（ASR）系统在混合语言（Code-Switching，CS）中的效果，尤其是在数据稀缺的情况下。
methods: 本研究提出了一种名为Speech Collage的方法，它可以将单语言 corpora 中的音频段落拼接成CS数据。此外，我们还使用了 overlap-add 方法来提高音频生成的质量。
results: 我们的实验结果表明，使用生成的CS数据可以对 Speech Recognition 系统产生很大的改善。在域内enario中，相比于不使用生成数据，使用生成的CS数据可以降低混合错误率（Mixed-Error Rate）和单词错误率（Word-Error Rate）的相对改善为34.4%和16.2%。此外，我们还发现，通过增加CS数据来增强模型的多语言倾向性和减少单语言偏好。

Abstract
Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We investigate the impact of generated data on speech recognition in two scenarios: using in-domain CS text and a zero-shot approach with synthesized CS text. Empirical results highlight up to 34.4% and 16.2% relative reductions in Mixed-Error Rate and Word-Error Rate for in-domain and zero-shot scenarios, respectively. Lastly, we demonstrate that CS augmentation bolsters the model's code-switching inclination and reduces its monolingual bias.

摘要
设计有效的自动语音识别（ASR）系统 для代码交换（CS）经常受到数据稀缺的限制。本文提出了 Speech Collage，一种方法，通过将单语言 corpus 中的音频段落拼接起来生成 CS 数据。我们还使用 overlap-add 方法来提高生成的音频质量。我们在两个场景中研究生成数据的影响：使用域内 CS 文本，以及零Instance 方法使用生成的 CS 文本。实验结果表明，可以 obt ain up to 34.4% 和 16.2% 的相对改善率reduction 和 Word-Error Rate 在域内和零Instance 场景中，分别。最后，我们示出了 CS 增强 bolsters 模型的代码交换倾向性，并减少了它的单语言偏好。

MONOVAB : An Annotated Corpus for Bangla Multi-label Emotion Detection

paper_url: http://arxiv.org/abs/2309.15670
repo_url: https://github.com/sajaldoes/facebookscraper
paper_authors: Sumit Kumar Banshal, Sajal Das, Shumaiya Akter Shammi, Narayan Ranjan Chakraborty
for: 这个研究旨在为旁遮� Bangla 语言中的情感识别（ER）和情感分析（SA）领域提供更加精确的扩展方法，并且对这种领域的研究进行探索。
methods: 这个研究使用了一种基于上下文的方法，并且使用了 BERT 方法来进行预测。
results: 研究发现，使用 BERT 方法可以获得最佳的结果，并且在多个情感类型上进行了多类情感识别。此外，还开发了一个网页应用程序来展示这个预测模型的性能。

Abstract
In recent years, Sentiment Analysis (SA) and Emotion Recognition (ER) have been increasingly popular in the Bangla language, which is the seventh most spoken language throughout the entire world. However, the language is structurally complicated, which makes this field arduous to extract emotions in an accurate manner. Several distinct approaches such as the extraction of positive and negative sentiments as well as multiclass emotions, have been implemented in this field of study. Nevertheless, the extraction of multiple sentiments is an almost untouched area in this language. Which involves identifying several feelings based on a single piece of text. Therefore, this study demonstrates a thorough method for constructing an annotated corpus based on scrapped data from Facebook to bridge the gaps in this subject area to overcome the challenges. To make this annotation more fruitful, the context-based approach has been used. Bidirectional Encoder Representations from Transformers (BERT), a well-known methodology of transformers, have been shown the best results of all methods implemented. Finally, a web application has been developed to demonstrate the performance of the pre-trained top-performer model (BERT) for multi-label ER in Bangla.

摘要
近年来，情感分析（SA）和情感识别（ER）在孟加拉语中得到了越来越多的关注，孟加拉语是全球第七大语言之一。然而，这门语言结构复杂，使得在精确地检测情感上具有挑战性。多种不同的方法，如 позитив和负情感提取以及多类情感识别，在这个领域中已经实施。然而，多种情感的提取仍然是孟加拉语中未曾被探讨的领域。因此，本研究提出了一种系统的方法，基于Facebook上抓取的数据，构建了一个注释 corpora，以解决这些问题。为了使此注释更有价值， Context-based 方法被使用。在这些方法中， BERT 方法，一种著名的 transformers 方法，在所有实施的方法中显示了最好的结果。最后，一个 web 应用程序被开发，以示出预训练的最佳模型（BERT）在孟加拉语中的多标签 ER 性能。

Conversational Feedback in Scripted versus Spontaneous Dialogues: A Comparative Analysis

paper_url: http://arxiv.org/abs/2309.15656
repo_url: None
paper_authors: Ildikó Pilán, Laurent Prévot, Hendrik Buschmeier, Pierre Lison
for: 这篇论文的目的是分析对话中的反馈phenomena，以及这些现象在自然语言对话和脚本对话之间的差异。
methods: 该论文使用了一种神经网络对话动作标签器来EXTRACT对话数据中的 lexical statistics和分类输出，并对英文、法语、德语、匈牙利语、意大利语、日语、挪威语和中文等语言的对话数据进行了分析。
results: 论文的两个主要发现是：一是对话反馈在对话副本中比自然对话更少，二是对话副本中含有更多的负反馈。此外，文章还表明了大语言模型也遵循同样的趋势，即对话响应中包含少量的反馈，除非特别地进行了适应自然对话的细化调整。

Abstract
Scripted dialogues such as movie and TV subtitles constitute a widespread source of training data for conversational NLP models. However, the linguistic characteristics of those dialogues are notably different from those observed in corpora of spontaneous interactions. This difference is particularly marked for communicative feedback and grounding phenomena such as backchannels, acknowledgments, or clarification requests. Such signals are known to constitute a key part of the conversation flow and are used by the dialogue participants to provide feedback to one another on their perception of the ongoing interaction. This paper presents a quantitative analysis of such communicative feedback phenomena in both subtitles and spontaneous conversations. Based on dialogue data in English, French, German, Hungarian, Italian, Japanese, Norwegian and Chinese, we extract both lexical statistics and classification outputs obtained with a neural dialogue act tagger. Two main findings of this empirical study are that (1) conversational feedback is markedly less frequent in subtitles than in spontaneous dialogues and (2) subtitles contain a higher proportion of negative feedback. Furthermore, we show that dialogue responses generated by large language models also follow the same underlying trends and include comparatively few occurrences of communicative feedback, except when those models are explicitly fine-tuned on spontaneous dialogues.

摘要
Movie 和 TV 字幕等脚本对话数据成为对话NLG模型训练的广泛来源。然而，这些对话的语言特征与临时交流中观察到的不同很大，特别是通信反馈和固定化现象，如后沟通、识别或清晰请求。这些信号被认为是对话流程的重要组成部分，它们由对话参与者用来对对话进行反馈。这篇论文通过对英文、法语、德语、匈牙利语、意大利语、日语、挪威语和中文对话数据进行量化分析，挖掘出 lexical 统计和基于神经对话 acts 标签器的分类输出。我们发现了两个主要结论：一是对话反馈在字幕中明显较少，二是字幕中含有较高比例的负反馈。此外，我们还证明了大语言模型也遵循同样的基本趋势，即包括对话中很少的交流反馈，除非这些模型被明确 fine-tune 于临时对话。

NLPBench: Evaluating Large Language Models on Solving NLP Problems

paper_url: http://arxiv.org/abs/2309.15630
repo_url: https://github.com/linxins97/nlpbench
paper_authors: Linxin Song, Jieyu Zhang, Lechao Cheng, Pengyuan Zhou, Tianyi Zhou, Irene Li
for: 该论文旨在探讨大语言模型（LLMs）在自然语言处理（NLP）领域的问题解决能力。
methods: 该论文使用了一个唯一的benchmark dataset，称为NLPBench，包含378个大学水平的NLP问题，涵盖了不同的NLP话题。该论文还使用了高级的提示策略，如链条思维（CoT）和树条思维（ToT）来评估LLMs的表现。
results: 该论文的研究发现，高级的提示策略的效果可能是不平等的，有时会对小型模型（如LLAMA-2）造成损害。此外，手动评估还暴露出了LLMs在科学问题解决中的缺陷，特别是逻辑分解和推理能力的弱点对结果产生了影响。

Abstract
Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities of natural language processing (NLP). Despite these successes, there remains a dearth of research dedicated to the NLP problem-solving abilities of LLMs. To fill the gap in this area, we present a unique benchmarking dataset, NLPBench, comprising 378 college-level NLP questions spanning various NLP topics sourced from Yale University's prior final exams. NLPBench includes questions with context, in which multiple sub-questions share the same public information, and diverse question types, including multiple choice, short answer, and math. Our evaluation, centered on LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study reveals that the effectiveness of the advanced prompting strategies can be inconsistent, occasionally damaging LLM performance, especially in smaller models like the LLAMA-2 (13b). Furthermore, our manual assessment illuminated specific shortcomings in LLMs' scientific problem-solving skills, with weaknesses in logical decomposition and reasoning notably affecting results.

摘要
最近的大语言模型（LLM）的发展已经显示出了提高自然语言处理（NLP）的能力的承诺。然而，还没有充分的研究关注LLM在NLP问题解决能力方面的研究。为了填补这一空白，我们提供了一个独特的标准 benchmarck dataset，即NLPBench，其包含378个大学水平的NLP问题，这些问题来源于叶lez大学的过去的最终考试。NLPBench包括问题带上下文，多个子问题共享公共信息，以及多种问题类型，包括多选、简答和数学类型。我们的评估中心于GPT-3.5/4、PaLM-2和LLAMA-2等LLM，并使用高级提示策略，如链条思维（CoT）和树条思维（ToT）。我们的研究发现，高级提示策略的效iveness可以不均匀，有时会对小型模型 like LLAMA-2（13b）产生负面影响。此外，我们的手动评估还揭示了LLM在科学问题解决能力中的缺陷，尤其是逻辑分解和推理能力受到了影响。

Few-Shot Multi-Label Aspect Category Detection Utilizing Prototypical Network with Sentence-Level Weighting and Label Augmentation

paper_url: http://arxiv.org/abs/2309.15588
repo_url: None
paper_authors: Zeyu Wang, Mizuho Iwaihara
for: 本研究旨在提高多个标签方面类划分中的准确率，通过使用支持集注意 Mechanism 和增强的标签文本信息。
methods: 本研究使用 prototypical network 和注意机制，首先在支持集中计算每个类划分的均值，然后使用 sentence-level 注意机制对每个支持集实例进行权重调整，最后将计算出的投影用于计算查询集中的噪声抑制。
results: 实验结果表明，我们的提议方法在 Yelp 数据集四个不同的场景中均有较高的表现，并且超越了所有基线方法。

Abstract
Multi-label aspect category detection is intended to detect multiple aspect categories occurring in a given sentence. Since aspect category detection often suffers from limited datasets and data sparsity, the prototypical network with attention mechanisms has been applied for few-shot aspect category detection. Nevertheless, most of the prototypical networks used so far calculate the prototypes by taking the mean value of all the instances in the support set. This seems to ignore the variations between instances in multi-label aspect category detection. Also, several related works utilize label text information to enhance the attention mechanism. However, the label text information is often short and limited, and not specific enough to discern categories. In this paper, we first introduce support set attention along with the augmented label information to mitigate the noise at word-level for each support set instance. Moreover, we use a sentence-level attention mechanism that gives different weights to each instance in the support set in order to compute prototypes by weighted averaging. Finally, the calculated prototypes are further used in conjunction with query instances to compute query attention and thereby eliminate noises from the query set. Experimental results on the Yelp dataset show that our proposed method is useful and outperforms all baselines in four different scenarios.

摘要
多标签方面类划分是用于检测给定句子中的多个方面类。由于方面类划分经常受到有限的数据和数据稀缺的限制，因此使用 prototype 网络和注意机制来实现少量的方面类划分。然而，大多数的 prototype 网络使用的是取支持集中的所有实例的平均值来计算prototype。这看似忽略了多标签方面类划分中实例之间的差异。此外，一些相关的工作使用标签文本信息来增强注意机制。然而，标签文本信息通常短暂，有限，不够特别地区分类。在本文中，我们首先引入支持集注意以及增强的标签信息来减少每个支持集实例的噪声。此外，我们使用句子级注意机制，对每个支持集实例进行不同的权重计算，以计算prototype。最后，计算出的prototype被用与查询实例进行计算查询注意，以消除查询集中的噪声。实验结果表明，我们提出的方法在Yelp数据集上表现出色，并在四个不同的场景下超过所有基线。

Jointly Training Large Autoregressive Multimodal Models

paper_url: http://arxiv.org/abs/2309.15564
repo_url: https://github.com/kyegomez/MultiModalCrossAttn
paper_authors: Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, Barlas Oguz
for: 本研究旨在开发一种能够生成高质量多Modal输出的单一模型，以满足现代机器学习领域中的权威需求。
methods: 该模型采用了一种模块化的方法，将现有的文本和图像生成模型系统地融合在一起，并 introduce了一种特殊的数据效率的指令调整策略，适应混合多Modal生成任务。
results: 研究人员通过对模型进行特定的指令调整，实现了生成高质量多Modal输出的目标，并表明了这种模型在混合多Modal生成任务中的首次应用。

Abstract
In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.

摘要
近年来，大规模预训练语言和文本到图像模型的进步，已经对机器学习领域产生了革命性的变革。然而，将这两种模式集成成一个完整、可靠的多模式模型，以生成无缝多Modal输出仍然是一项重要挑战。为解决这一问题，我们提出了共同自适应混合（JAM）框架，这是一种模块化的方法，可以系统地融合现有的文本和图像生成模型。我们还提出了特化于混合多Modal生成任务的数据效率准则调整策略。最终，我们的指导调整模型在生成高质量多Modal输出的表现卓越，并成为首个专门为这种目的设计的模型。

VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning

paper_url: http://arxiv.org/abs/2309.15494
repo_url: None
paper_authors: Yanan Wang, Donghuo Zeng, Shinya Wada, Satoshi Kurihara
for: 这个论文旨在解决多模态融合问题，提高多模态融合系统的效率和性能。
methods: 该论文提出了一种基于视频知识塑造的多模态知识传递方法，使用CLIP模型提供多模态知识监督信号，并通过一个步骤式塑造目标函数来传递知识。
results: 该方法在两个多模态任务中（MOSI和MOSEI数据集以及VEGAS数据集）表现出色，在视频层 sentiment分析任务中，学生模型（只需要文本模式作为输入）的MAE分数提高了12.3%。此外，该方法还在VEGAS数据集上提高了现有方法的3.4% mAP分数，而无需额外计算。这些结果表明该方法在实现高效率高性能多模态传递学习中的优势。

Abstract
Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion. However, conventional systems are typically built on the assumption that all modalities exist, and the lack of modalities always leads to poor inference performance. Furthermore, extracting pretrained embeddings for all modalities is computationally inefficient for inference. In this work, to achieve high efficiency-performance multimodal transfer learning, we propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model (teacher) to a specific modal fundamental model (student). With an intuition that the best learning performance comes with professional advisers and smart students, we use a CLIP-based teacher model to provide expressive multimodal knowledge supervision signals to a RoBERTa-based student model via optimizing a step-distillation objective loss -- first step: the teacher distills multimodal knowledge of video-enhanced prompts from classification logits to a regression logit -- second step: the multimodal knowledge is distilled from the regression logit of the teacher to the student. We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis (MOSI and MOSEI datasets) and audio-visual retrieval (VEGAS dataset). The student (requiring only the text modality as input) achieves an MAE score improvement of up to 12.3% for MOSI and MOSEI. Our method further enhances the state-of-the-art method by 3.4% mAP score for VEGAS without additional computations for inference. These results suggest the strengths of our method for achieving high efficiency-performance multimodal transfer learning.

摘要
多模态转移学习目的是将预训 representations of 多个模式转换到共同领域空间，以便实现效果的多模态融合。然而，传统系统通常是基于所有模式都存在的假设，缺少模式会导致较差的推论性能。另外，从多个模式中提取预训嵌入的computational complexity对于推论是高的。在这个工作中，我们提出了 VideoAdviser，一种影片智慧传承方法，将多模态知识传承自一个多模式基础模型（教师）至一个具体的模式基础模型（学生）。我们的想法是，在专业指导和聪明的学生之下，学习最佳性能。我们使用基于 CLIP 的教师模型提供多模式知识超visuel 监督信号，将其转换为一个 step-distillation 目标函数损失——第一步：教师对影片增强提示的多模式知识进行分类 logits 的激发——第二步：将多模式知识从教师的分类 logits 转换为学生的分类 logits。我们在 MOSI 和 MOSEI dataset 进行了两个多模式任务的评估：影片水平情感分析和音频视觉搜寻。学生（仅需要文本模式作为输入）在 MOSI 和 MOSEI dataset 中的 MAE 分数改善为最多 12.3%。我们的方法还提高了现有方法的 mAP 分数 by 3.4% без需要进行额外的计算。这些结果显示了我们的方法在实现高效率-性能多模态转移学习的能力。

Dynamic Multi-Scale Context Aggregation for Conversational Aspect-Based Sentiment Quadruple Analysis

paper_url: http://arxiv.org/abs/2309.15476
repo_url: None
paper_authors: Yuqing Li, Wenyuan Zhang, Binbin Li, Siyu Jia, Zisen Qi, Xingbang Tan
for: 这个研究的目的是提出了一种基于对话结构的强大的 sentiment quadruple 分析方法，以便更好地捕捉对话中的 quadruple 元素。
methods: 这个方法使用了一种名为 Dynamic Multi-scale Context Aggregation network (DMCA)，它首先利用对话结构生成多级utterance window，然后通过动态层次聚合模块来集成进步的cue。
results: 对比基elines，这个方法在实验中表现出了显著的优势，并达到了领域内的状态之术性表现。

Abstract
Conversational aspect-based sentiment quadruple analysis (DiaASQ) aims to extract the quadruple of target-aspect-opinion-sentiment within a dialogue. In DiaASQ, a quadruple's elements often cross multiple utterances. This situation complicates the extraction process, emphasizing the need for an adequate understanding of conversational context and interactions. However, existing work independently encodes each utterance, thereby struggling to capture long-range conversational context and overlooking the deep inter-utterance dependencies. In this work, we propose a novel Dynamic Multi-scale Context Aggregation network (DMCA) to address the challenges. Specifically, we first utilize dialogue structure to generate multi-scale utterance windows for capturing rich contextual information. After that, we design a Dynamic Hierarchical Aggregation module (DHA) to integrate progressive cues between them. In addition, we form a multi-stage loss strategy to improve model performance and generalization ability. Extensive experimental results show that the DMCA model outperforms baselines significantly and achieves state-of-the-art performance.

摘要
文本中的异常性质 sentiment quadruple分析（DiaASQ）目的是从对话中提取目标方面的意见情感。在DiaASQ中，quadruple的元素经常跨越多个句子，这使得提取过程变得更加复杂，强调了对对话上下文和互动的深入理解。然而，现有的工作独立地编码每个句子，从而难以捕捉对话中长距离的上下文关系和深入的词语依赖关系。在这种情况下，我们提出了一种新的动态多尺度上下文聚合网络（DMCA）来解决这些挑战。 Specifically，我们首先利用对话结构生成多尺度的utterance窗口，以便捕捉丰富的上下文信息。然后，我们设计了动态层次聚合模块（DHA），用于在这些窗口之间进行进度的聚合。此外，我们设计了多阶段的损失策略，以提高模型性能和泛化能力。经验证明，DMCA模型在比较多的基线上表现出优于基eline，并达到了当前领域的状态提取模型。

ChatCounselor: A Large Language Models for Mental Health Support

paper_url: http://arxiv.org/abs/2309.15461
repo_url: https://github.com/emocareai/chatpsychiatrist
paper_authors: June M. Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, Jiamin Wu
for: 这个论文旨在提供心理支持，不同于通用的chatbot，它基于专业心理师和客户之间的真实对话，因此具有专业心理知识和辅导技能。
methods: 这个解决方案使用了GPT-4和特制的提示来进行辅导，并根据七项心理辅导评价指标来评估辅导响应质量。
results: 对比已有的开源模型，ChatCounselor在辅导桌上表现出色，其表现相当于ChatGPT，这显示了数据驱动的模型能力得到了显著提高。

Abstract
This paper presents ChatCounselor, a large language model (LLM) solution designed to provide mental health support. Unlike generic chatbots, ChatCounselor is distinguished by its foundation in real conversations between consulting clients and professional psychologists, enabling it to possess specialized knowledge and counseling skills in the field of psychology. The training dataset, Psych8k, was constructed from 260 in-depth interviews, each spanning an hour. To assess the quality of counseling responses, the counseling Bench was devised. Leveraging GPT-4 and meticulously crafted prompts based on seven metrics of psychological counseling assessment, the model underwent evaluation using a set of real-world counseling questions. Impressively, ChatCounselor surpasses existing open-source models in the counseling Bench and approaches the performance level of ChatGPT, showcasing the remarkable enhancement in model capability attained through high-quality domain-specific data.

摘要

Beyond the Chat: Executable and Verifiable Text-Editing with LLMs

paper_url: http://arxiv.org/abs/2309.15337
repo_url: None
paper_authors: Philippe Laban, Jesse Vig, Marti A. Hearst, Caiming Xiong, Chien-Sheng Wu
For: The paper aims to provide a more transparent and verifiable editing interface for documents edited with Large Language Models (LLMs).* Methods: The proposed interface, called InkSync, suggests executable edits directly within the document being edited, and supports a 3-stage approach to mitigate the risk of factual errors introduced by LLMs.* Results: Two usability studies confirm the effectiveness of InkSync’s components compared to standard LLM-based chat interfaces, leading to more accurate, more efficient editing, and improved user experience.Here’s the same information in Simplified Chinese text:* For: 该论文旨在提供基于大语言模型（LLM）的文档编辑器，具有更高的透明度和可靠性。* Methods: 提议的界面是InkSync，它在文档中直接提供执行修改建议，并支持三个阶段方法来减少LLM引入的事实错误风险。* Results: 两项用户研究证明InkSync的组件在与标准LLM基于chat界面进行比较时，有更高的准确性、更高的效率、和更好的用户体验。

Abstract
Conversational interfaces powered by Large Language Models (LLMs) have recently become a popular way to obtain feedback during document editing. However, standard chat-based conversational interfaces do not support transparency and verifiability of the editing changes that they suggest. To give the author more agency when editing with an LLM, we present InkSync, an editing interface that suggests executable edits directly within the document being edited. Because LLMs are known to introduce factual errors, Inksync also supports a 3-stage approach to mitigate this risk: Warn authors when a suggested edit introduces new information, help authors Verify the new information's accuracy through external search, and allow an auditor to perform an a-posteriori verification by Auditing the document via a trace of all auto-generated content. Two usability studies confirm the effectiveness of InkSync's components when compared to standard LLM-based chat interfaces, leading to more accurate, more efficient editing, and improved user experience.

摘要
大型语言模型（LLM）驱动的对话界面在文档编辑中获得反馈已经变得非常流行。然而，标准的chat界面不支持对编辑建议的透明度和可靠性。为给作者更多的自主权在LLM编辑，我们提出了inksync，一种在文档中直接提供可执行的编辑建议的编辑界面。因为LLM经常引入错误信息，inksync还支持三个阶段来减轻这种风险：警告作者当建议编辑引入新信息时，帮助作者验证新信息的准确性通过外部搜索，并让审核人员通过文档的跟踪来对自动生成的内容进行 posteriori 验证。两个用户研究证明了inksync的组件与标准LLM-based chat界面相比，可以提高精度、效率和用户体验。

2023-09-27

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Effective Long-Context Scaling of Foundation Models

Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

Large Language Model Routing with Benchmark Datasets

Question answering using deep learning in low resource Indian language Marathi

Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

Speech collage: code-switched audio generation by collaging monolingual corpora

MONOVAB : An Annotated Corpus for Bangla Multi-label Emotion Detection

Conversational Feedback in Scripted versus Spontaneous Dialogues: A Comparative Analysis

NLPBench: Evaluating Large Language Models on Solving NLP Problems

Few-Shot Multi-Label Aspect Category Detection Utilizing Prototypical Network with Sentence-Level Weighting and Label Augmentation

Jointly Training Large Autoregressive Multimodal Models

VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning

Dynamic Multi-Scale Context Aggregation for Conversational Aspect-Based Sentiment Quadruple Analysis

ChatCounselor: A Large Language Models for Mental Health Support

Beyond the Chat: Executable and Verifiable Text-Editing with LLMs