cs.CL - 2023-10-07

Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU

paper_url: http://arxiv.org/abs/2310.04928
repo_url: https://github.com/fajri91/indommlu
paper_authors: Fajri Koto, Nurul Aisyah, Haonan Li, Timothy Baldwin
for: 这个论文的目的是评估大型自然语言处理模型（LLM）在印度尼西亚文化和语言方面的能力。
methods: 这篇论文使用了多个任务的语言理解准确率来评估LLM的能力，并采用了专业教师编写的14,981个问题，涵盖了印度尼西亚的主要教育水平和语言。
results: 研究发现，GPT-3.5只能在印度尼西亚的基础教育水平上达到了标准，而其对当地印度尼西亚语言和文化的认知有限。其他较小的模型如BLOOMZ和Falcon也表现在更低的水平上。

Abstract
Although large language models (LLMs) are often pre-trained on large-scale multilingual texts, their reasoning abilities and real-world knowledge are mainly evaluated based on English datasets. Assessing LLM capabilities beyond English is increasingly vital but hindered due to the lack of suitable datasets. In this work, we introduce IndoMMLU, the first multi-task language understanding benchmark for Indonesian culture and languages, which consists of questions from primary school to university entrance exams in Indonesia. By employing professional teachers, we obtain 14,981 questions across 64 tasks and education levels, with 46% of the questions focusing on assessing proficiency in the Indonesian language and knowledge of nine local languages and cultures in Indonesia. Our empirical evaluations show that GPT-3.5 only manages to pass the Indonesian primary school level, with limited knowledge of local Indonesian languages and culture. Other smaller models such as BLOOMZ and Falcon perform at even lower levels.

摘要
（对大型语言模型（LLM）的认知能力和实际世界知识主要是通过英文数据集进行评估，但是评估 LLM 以外的能力具有越来越重要的 significations，但是受到数据不足的阻碍。在这个工作中，我们引入了印尼文化和语言的多任务语言理解测试 benchmark，名为 IndoMMLU，它包含印尼教育系统中的问题，从小学到大学入学考试。我们雇用了职业教师，获得了14,981个问题，涵盖64个任务和教育水平，其中46%的问题是评估印尼语言和九种地方语言和文化的知识。我们的实际评估显示，GPT-3.5只能通过印尼小学的水平，具有有限的本地印尼语言和文化知识。其他较小的模型，如BLOOMZ和Falcon，在更低的水平上表现。）

GradXKG: A Universal Explain-per-use Temporal Knowledge Graph Explainer

paper_url: http://arxiv.org/abs/2310.04889
repo_url: None
paper_authors: Chenhan Yuan, Hoda Eldardiry
for: 本研究的目的是提高 temporally knowledge graph (TKG) reasoning 模型的解释性，以便更好地理解 TKG 模型如何做出某个预测。
methods: 本研究使用了 two-stage gradient-based 方法，包括一个 Grad-CAM-inspired RGCN explainer 和一个 integrated gradients explainer，以解释 RGCN 基于 TKG 模型的预测。
results: 实验结果表明，GradXKG 可以快速提供对 TKG 模型预测的解释，并且可以帮助解释 TKG 模型如何使用时间维度来理解事实的演变。此外，GradXKG 可以对多种 RGCN 基于 TKG 模型进行解释，并且可以提供具体的时间点上的最重要节点的信息。

Abstract
Temporal knowledge graphs (TKGs) have shown promise for reasoning tasks by incorporating a temporal dimension to represent how facts evolve over time. However, existing TKG reasoning (TKGR) models lack explainability due to their black-box nature. Recent work has attempted to address this through customized model architectures that generate reasoning paths, but these recent approaches have limited generalizability and provide sparse explanatory output. To enable interpretability for most TKGR models, we propose GradXKG, a novel two-stage gradient-based approach for explaining Relational Graph Convolution Network (RGCN)-based TKGR models. First, a Grad-CAM-inspired RGCN explainer tracks gradients to quantify each node's contribution across timesteps in an efficient "explain-per-use" fashion. Second, an integrated gradients explainer consolidates importance scores for RGCN outputs, extending compatibility across diverse TKGR architectures based on RGCN. Together, the two explainers highlight the most critical nodes at each timestep for a given prediction. Our extensive experiments demonstrated that, by leveraging gradient information, GradXKG provides insightful explanations grounded in the model's logic in a timely manner for most RGCN-based TKGR models. This helps address the lack of interpretability in existing TKGR models and provides a universal explanation approach applicable across various models.

摘要
temporal knowledge graphs (TKGs) 有潜力用于理解任务 by incorporating a temporal dimension to represent how facts evolve over time. However, existing TKG reasoning (TKGR) models lack explainability due to their black-box nature. Recent work has attempted to address this through customized model architectures that generate reasoning paths, but these recent approaches have limited generalizability and provide sparse explanatory output. To enable interpretability for most TKGR models, we propose GradXKG, a novel two-stage gradient-based approach for explaining Relational Graph Convolution Network (RGCN)-based TKGR models. First, a Grad-CAM-inspired RGCN explainer tracks gradients to quantify each node's contribution across timesteps in an efficient "explain-per-use" fashion. Second, an integrated gradients explainer consolidates importance scores for RGCN outputs, extending compatibility across diverse TKGR architectures based on RGCN. Together, the two explainers highlight the most critical nodes at each timestep for a given prediction. Our extensive experiments demonstrated that, by leveraging gradient information, GradXKG provides insightful explanations grounded in the model's logic in a timely manner for most RGCN-based TKGR models. This helps address the lack of interpretability in existing TKGR models and provides a universal explanation approach applicable across various models.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Prompt-to-OS (P2OS): Revolutionizing Operating Systems and Human-Computer Interaction with Integrated AI Generative Models

paper_url: http://arxiv.org/abs/2310.04875
repo_url: None
paper_authors: Gabriele Tolomei, Cesare Campagnano, Fabrizio Silvestri, Giovanni Trappolini
for: 这篇论文旨在推动人机交互的重大变革，替代传统操作系统的概念。
methods: 这种新思维使用大量生成模型，如语言和扩散模型，作为计算机和用户之间的中间层。用户可以通过自然语言交流来与计算机进行交互，而不需要显式命令或复杂的导航。
results: 这种新方法不仅简化了用户交互，还开创了个性化体验的新可能性。生成模型可以根据用户的偏好进行适应，通过学习用户输入来不断改善其理解和回答生成能力。此外，它还提供了更多的可访问性，让用户可以通过语音或文本进行交互，适应不同的沟通方式。

Abstract
In this paper, we present a groundbreaking paradigm for human-computer interaction that revolutionizes the traditional notion of an operating system. Within this innovative framework, user requests issued to the machine are handled by an interconnected ecosystem of generative AI models that seamlessly integrate with or even replace traditional software applications. At the core of this paradigm shift are large generative models, such as language and diffusion models, which serve as the central interface between users and computers. This pioneering approach leverages the abilities of advanced language models, empowering users to engage in natural language conversations with their computing devices. Users can articulate their intentions, tasks, and inquiries directly to the system, eliminating the need for explicit commands or complex navigation. The language model comprehends and interprets the user's prompts, generating and displaying contextual and meaningful responses that facilitate seamless and intuitive interactions. This paradigm shift not only streamlines user interactions but also opens up new possibilities for personalized experiences. Generative models can adapt to individual preferences, learning from user input and continuously improving their understanding and response generation. Furthermore, it enables enhanced accessibility, as users can interact with the system using speech or text, accommodating diverse communication preferences. However, this visionary concept raises significant challenges, including privacy, security, trustability, and the ethical use of generative models. Robust safeguards must be in place to protect user data and prevent potential misuse or manipulation of the language model. While the full realization of this paradigm is still far from being achieved, this paper serves as a starting point for envisioning this transformative potential.

摘要
在这篇论文中，我们提出了一种革命性的人机交互模式，推翻了传统的操作系统概念。在这个创新的框架下，用户对机器的请求由一个相互连接的生成AI模型系统处理，这些模型可以与或甚至代替传统的软件应用程序集成。核心在这个转变中是大型生成模型，如语言和扩散模型，它们成为用户和计算机之间的中间件。这一革命性的方法利用了先进的语言模型的能力，让用户可以通过自然语言对话来与计算机进行交互，不需要显式命令或复杂的导航。语言模型理解和解释用户的提示，生成和显示上下文相关的和有意义的回答，从而实现了流畅和直观的交互。这种模式不仅减少了用户交互的复杂性，还开启了新的个性化体验的可能性。生成模型可以根据个人偏好进行适应，通过用户的输入学习和不断改进其理解和回答生成能力。此外，它还实现了更高的可用性，用户可以通过语音或文本进行交互，适应不同的交流方式。然而，这种visionary概念也存在重大挑战，包括隐私、安全、信任性和生成模型的伦理使用。需要在设置robust的安全措施，保护用户数据，避免可能的滥用或 manipulate语言模型。虽然这种模式的实现仍然远未到达，但这篇论文作为一个开创的起点，激发了这种可能性的探索。

End-to-End Lip Reading in Romanian with Cross-Lingual Domain Adaptation and Lateral Inhibition

paper_url: http://arxiv.org/abs/2310.04858
repo_url: None
paper_authors: Emilian-Claudiu Mănescu, Răzvan-Alexandru Smădu, Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin Pop
for: 本研究旨在提高lip reading或视觉speech recognition的效果，特别是在罕用语言 datasets 上。
methods: 本研究使用了多种架构和优化技术，包括丰富的正则化方法和cross-lingual domain adaptation。
results: 我们的提议方法可以达到状态之巅的效果，并且通过添加无标注视频来帮助模型学习语言不变的特征。

Abstract
Lip reading or visual speech recognition has gained significant attention in recent years, particularly because of hardware development and innovations in computer vision. While considerable progress has been obtained, most models have only been tested on a few large-scale datasets. This work addresses this shortcoming by analyzing several architectures and optimizations on the underrepresented, short-scale Romanian language dataset called Wild LRRo. Most notably, we compare different backend modules, demonstrating the effectiveness of adding ample regularization methods. We obtain state-of-the-art results using our proposed method, namely cross-lingual domain adaptation and unlabeled videos from English and German datasets to help the model learn language-invariant features. Lastly, we assess the performance of adding a layer inspired by the neural inhibition mechanism.

摘要
lip 读或视觉语音识别在最近几年内获得了广泛关注，尤其是因为硬件开发和计算机视觉领域的创新。虽然取得了显著进步，大多数模型仅在几个大规模数据集上进行测试。这项工作 addresses 这一缺点，通过分析多种架构和优化技术，对较少 repre sented、短规模的罗马尼亚语言数据集 Wild LRRo 进行分析。最 notable 是，我们比较了不同的后端模块，示出了添加充足的正则化方法的效iveness。我们使用我们提出的方法，即语言无关特征学习，将英语和德语数据集的无标注视频与英语和德语数据集的标注视频进行交互学习。最后，我们评估了基于神经抑制机制的层的性能。

Parameterizing Context: Unleashing the Power of Parameter-Efficient Fine-Tuning and In-Context Tuning for Continual Table Semantic Parsing

paper_url: http://arxiv.org/abs/2310.04801
repo_url: None
paper_authors: Yongrui Chen, Shenyu Zhang, Guilin Qi, Xinnan Guo
for: 训练一个 continual table semantic parser，能够在不同任务下翻译自然语言到 SQL 表达式，但只提供有限的训练示例。
methods: 提出了一种 novel 的方法，结合 parameter-efficient fine-tuning (PEFT) 和 in-context tuning (ICT)，用于训练 continual table semantic parser。该方法可以完全避免 catastrophic forgetting，并且可以在少量示例下进行准确的翻译。
results: 经验证实验表明，该方法在两个 benchmark 上比常见的 few-shot 和 continual learning 基eline 表现出色，并且可以在不同的 metric 上取得高效的翻译结果。

Abstract
Continual table semantic parsing aims to train a parser on a sequence of tasks, where each task requires the parser to translate natural language into SQL based on task-specific tables but only offers limited training examples. Conventional methods tend to suffer from overfitting with limited supervision, as well as catastrophic forgetting due to parameter updates. Despite recent advancements that partially alleviate these issues through semi-supervised data augmentation and retention of a few past examples, the performance is still limited by the volume of unsupervised data and stored examples. To overcome these challenges, this paper introduces a novel method integrating \textit{parameter-efficient fine-tuning} (PEFT) and \textit{in-context tuning} (ICT) for training a continual table semantic parser. Initially, we present a task-adaptive PEFT framework capable of fully circumventing catastrophic forgetting, which is achieved by freezing the pre-trained model backbone and fine-tuning small-scale prompts. Building on this, we propose a teacher-student framework-based solution. The teacher addresses the few-shot problem using ICT, which procures contextual information by demonstrating a few training examples. In turn, the student leverages the proposed PEFT framework to learn from the teacher's output distribution, and subsequently compresses and saves the contextual information to the prompts, eliminating the need to store any training examples. Experimental evaluations on two benchmarks affirm the superiority of our method over prevalent few-shot and continual learning baselines across various metrics.

摘要
CONTINUAL TABLE SEMANTIC PARSING 目标是训练一个 parser 在一个序列任务上，每个任务需要 parser 将自然语言翻译为基于任务特定的表格，但只提供有限的训练示例。传统方法通常会受到有限监督下预测的过拟合和忘记问题的影响，尤其是在受限的训练示例和存储示例的情况下。为了解决这些挑战，本文提出了一种 integrate 参数高效精细调整（PEFT）和在上下文调整（ICT）的新方法，用于训练一个连续表Semantic parser。首先，我们提出了一种任务适应的 PEFT 框架，可以完全避免 catastrophic forgetting，通过冻结预训练模型的背景和细致调整小规模示例。然后，我们提出了一种教师生成器基于解决方案，其中教师通过示例示例来提供上下文信息，而学生通过提出的 PEFT 框架来学习从教师的输出分布中，并将上下文信息压缩并存储到示例中，从而消除需要存储任何训练示例。我们对两个 benchmark 进行实验评估，结果表明我们的方法在多种纪录中比前列的几个 shot 和连续学习基eline表现出色。

Chat Vector: A Simple Approach to Equip LLMs With New Language Chat Capabilities

paper_url: http://arxiv.org/abs/2310.04799
repo_url: None
paper_authors: Shih-Cheng Huang, Pin-Zu Li, Yu-Chi Hsu, Kuang-Ming Chen, Yu Tung Lin, Shih-Kai Hsiao, Richard Tzong-Han Tsai, Hung-yi Lee
for: 这 paper 的目的是探讨开发非英语语言的大型自然语言模型 (LLMs)，特别是考虑人类偏好的Alignment。
methods: 我们提出了一种计算效率高的方法，利用 chat vector，以融合现有知识和行为在 LLMs 中，重新定义了传统训练方法的 paradigm，从 continual pre-train -> SFT -> RLHF 改为 continual pre-train + chat vector。
results: 我们的实验主要采用 Traditional Chinese 为基础模型 LLaMA2，通过 subtracting LLaMA2 的预训练 веса，获得 chat vector。我们从三个方面进行评估：恶意语言、模型可以遵循指令以及多回话示例，发现 chat vector 在聊天中表现出色。我们还在采用模型预训练在韩语和简化字中进行扩展，ILLUSTRATE 我们的方法的多样性。总的来说，我们提供了一种效果的Alignment LLMs 的方法，可以有效地在不同语言上将模型与人类偏好相Alignment。

Abstract
With the advancements in conversational AI, such as ChatGPT, this paper focuses on exploring developing Large Language Models (LLMs) for non-English languages, especially emphasizing alignment with human preferences. We introduce a computationally efficient method, leveraging chat vector, to synergize pre-existing knowledge and behaviors in LLMs, restructuring the conventional training paradigm from continual pre-train -> SFT -> RLHF to continual pre-train + chat vector. Our empirical studies, primarily focused on Traditional Chinese, employ LLaMA2 as the base model and acquire the chat vector by subtracting the pre-trained weights, LLaMA2, from the weights of LLaMA2-chat. Evaluating from three distinct facets, which are toxicity, ability of instruction following, and multi-turn dialogue demonstrates the chat vector's superior efficacy in chatting. To confirm the adaptability of our approach, we extend our experiments to include models pre-trained in both Korean and Simplified Chinese, illustrating the versatility of our methodology. Overall, we present a significant solution in aligning LLMs with human preferences efficiently across various languages, accomplished by the chat vector.

摘要
<>转换给定文本到简化中文。<>随着对话AI的发展，如ChatGPT，这篇论文专注于发展非英语语言的大型语言模型（LLMs），特别强调与人类偏好的Alignment。我们提出了一种计算效率高的方法，利用 chat vector，将 pré-exist 的知识和行为 synergize 在 LLMS 中，重新设计传统训练方程式从 continual pre-train -> SFT -> RLHF 改为 continual pre-train + chat vector。我们的实验 Studies primarily focused on Traditional Chinese, using LLaMA2 as the base model, and obtaining the chat vector by subtracting the pre-trained weights of LLaMA2 from the weights of LLaMA2-chat. From three distinct aspects, namely toxicity, ability of instruction following, and multi-turn dialogue, our empirical studies demonstrate the chat vector's superior efficacy in chatting. To confirm the adaptability of our approach, we extend our experiments to include models pre-trained in both Korean and Simplified Chinese, illustrating the versatility of our methodology. Overall, we present a significant solution in aligning LLMs with human preferences efficiently across various languages, accomplished by the chat vector.

FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets

paper_url: http://arxiv.org/abs/2310.04793
repo_url: https://github.com/ai4finance-foundation/fingpt
paper_authors: Neng Wang, Hongyang Yang, Christina Dan Wang
for: This paper focuses on the potential of GPT-based models in the financial sector and presents a distinctive approach for integrating these models with financial datasets.
methods: The paper introduces the Instruction Tuning paradigm for open-source large language models, which ensures a seamless and transparent integration of these models into financial contexts. The paper also presents a benchmarking scheme for end-to-end training and testing, including basic competencies such as Named Entity Recognition (NER) and sentiment analysis, as well as a comprehensive model that executes multi-task operations.
results: The paper explores the zero-shot capabilities of the proposed approach by testing the model on unseen tasks and incorporating novel datasets, demonstrating its adaptability in uncharted territories. The paper also highlights the effectiveness of the Instruction Tuning paradigm for immediate integration and the robust foundation it lays for future investigations in open-source financial large language models (FinLLMs).

Abstract
In the swiftly expanding domain of Natural Language Processing (NLP), the potential of GPT-based models for the financial sector is increasingly evident. However, the integration of these models with financial datasets presents challenges, notably in determining their adeptness and relevance. This paper introduces a distinctive approach anchored in the Instruction Tuning paradigm for open-source large language models, specifically adapted for financial contexts. Through this methodology, we capitalize on the interoperability of open-source models, ensuring a seamless and transparent integration. We begin by explaining the Instruction Tuning paradigm, highlighting its effectiveness for immediate integration. The paper presents a benchmarking scheme designed for end-to-end training and testing, employing a cost-effective progression. Firstly, we assess basic competencies and fundamental tasks, such as Named Entity Recognition (NER) and sentiment analysis to enhance specialization. Next, we delve into a comprehensive model, executing multi-task operations by amalgamating all instructional tunings to examine versatility. Finally, we explore the zero-shot capabilities by earmarking unseen tasks and incorporating novel datasets to understand adaptability in uncharted terrains. Such a paradigm fortifies the principles of openness and reproducibility, laying a robust foundation for future investigations in open-source financial large language models (FinLLMs).

摘要
在快速扩展的自然语言处理（NLP）领域中，GPT基于模型在金融领域的潜在价值日益明显。然而，将这些模型与金融数据集 integrate 起来存在挑战，主要表现在确定其适应性和相关性的问题。本文介绍了一种特殊的approach，基于开源大语言模型的Instruction Tuning paradigm，专门适用于金融上下文。通过这种方法ологи，我们可以利用开源模型的兼容性，以无缝和透明的方式进行集成。我们首先介绍了Instruction Tuning paradigm，强调其在即时集成方面的效果。本文提出了一个特点是cost-effective的benchmarking scheme，包括基本能力和基本任务的评估，如命名实体识别（NER）和情感分析，以提高特化。然后，我们探讨了一个全面的模型，通过将所有的instructional tunings融合来执行多任务操作，以评估其多样性。最后，我们探索了零基础能力的可能性，通过标记未看过任务和 incorporating 新的数据集来了解模型在未知领域的适应性。这种方法 fortifies 开源金融大语言模型（FinLLMs）的原则，即开放和可重现性，为未来的研究提供了坚实的基础。

Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning

paper_url: http://arxiv.org/abs/2310.04782
repo_url: None
paper_authors: Yuchen Yang, Houqiang Li, Yanfeng Wang, Yu Wang
for: 提高大规模语言模型（LLM）的可靠性和可信度
methods: 使用不确定性感知框架，让模型能够自动识别和排除不确定答案，并考虑模型知识的限制
results: 经过评估，模型能够更好地回答问题，并且能够自动识别和排除不确定答案，提高了模型的可靠性和可信度

Abstract
In recent years, large-scale language models (LLMs) have gained attention for their impressive text generation capabilities. However, these models often face the challenge of "hallucination," which undermines their reliability. In this study, we introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty. Human-defined methods for estimating uncertainty typically assume that "uncertainty is lower when the model's response is correct compared to when it is incorrect." However, setting a precise threshold to distinguish correctness is challenging. Therefore, we introduce uncertainty information as an intermediary variable that implicitly influences the model's behavior. Our innovative uncertainty-aware in-context learning framework involves fine-tuning the LLM using a calibration dataset. Our aim is to improve the model's responses by filtering out answers with high uncertainty while considering the model's knowledge limitations. We evaluate the model's knowledge by examining multiple responses to the same question for the presence of a correct answer. When the model lacks relevant knowledge, the response should indicate that the question cannot be answered. Conversely, when the model has relevant knowledge, the response should provide the correct answer. Extensive experiments confirm the effectiveness of our framework, leading to two key findings. First, the logit output values of the LLM partly reflect inherent uncertainty. Second, our model autonomously recognizes uncertainty, resulting in improved responses.

摘要
近年来，大规模语言模型（LLM）已引起关注，因其出色的文本生成能力。然而，这些模型经常面临“幻觉”挑战，这会损害其可靠性。在本研究中，我们提出了一种基于上下文学习的不确定性意识框架，以便让模型能够根据不确定性进行增强或拒绝输出。人类定义的不确定性估计方法通常认为，“不确定性较低时模型的响应正确性较高”。然而，确定正确性的精确阈值是困难的。因此，我们引入不确定性信息作为间接影响模型行为的变量。我们的创新的不确定性意识框架包括精度调整LLM使用准备集。我们的目标是通过筛选高不确定性答案来提高模型的响应，同时考虑模型的知识限制。我们评估模型的知识力通过对同一个问题的多个答案进行检查是否包含正确答案。当模型缺乏相关知识时，它应该返回问题无法答案。相反，当模型具备相关知识时，它应该提供正确答案。我们进行了广泛的实验，并确认了我们的框架的有效性，导致两项关键发现。首先，LLM的对数输出值有一定的内在不确定性。其次，我们的模型自动认出不确定性，从而提高了响应。

A New Dataset for End-to-End Sign Language Translation: The Greek Elementary School Dataset

paper_url: http://arxiv.org/abs/2310.04753
repo_url: None
paper_authors: Andreas Voskou, Konstantinos P. Panousis, Harris Partaourides, Kyriakos Tolias, Sotirios Chatzis
for: 本研究旨在提高听力困难者（HoH）与正常听力者之间的交流，提高 HoH 的社会生活质量和参与度。
methods: 本研究使用了现代 Transformer 型方法，并使用了一个新建立的 29653 个希腊手语视频翻译对，该对基于希腊Elementary School 的官方课程。
results: 研究结果表明，使用本研究 introduce 的 dataset 可以提高 SLT 研究的可用性和实际价值。

Abstract
Automatic Sign Language Translation (SLT) is a research avenue of great societal impact. End-to-End SLT facilitates the interaction of Hard-of-Hearing (HoH) with hearing people, thus improving their social life and opportunities for participation in social life. However, research within this frame of reference is still in its infancy, and current resources are particularly limited. Existing SLT methods are either of low translation ability or are trained and evaluated on datasets of restricted vocabulary and questionable real-world value. A characteristic example is Phoenix2014T benchmark dataset, which only covers weather forecasts in German Sign Language. To address this shortage of resources, we introduce a newly constructed collection of 29653 Greek Sign Language video-translation pairs which is based on the official syllabus of Greek Elementary School. Our dataset covers a wide range of subjects. We use this novel dataset to train recent state-of-the-art Transformer-based methods widely used in SLT research. Our results demonstrate the potential of our introduced dataset to advance SLT research by offering a favourable balance between usability and real-world value.

摘要

Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models

paper_url: http://arxiv.org/abs/2310.04743
repo_url: None
paper_authors: Song Jiang, Zahra Shakeri, Aaron Chan, Maziar Sanjabi, Hamed Firooz, Yinglong Xia, Bugra Akyildiz, Yizhou Sun, Jinchao Li, Qifan Wang, Asli Celikyilmaz
for: 提高大语言模型（LLM）的多步逻辑能力
methods: 提出了一种新的提示策略——即连接促进（RESPROMPT），通过在提示中嵌入缺失的连接链来重建逻辑图，从而更好地捕捉复杂的逻辑图
results: 在六个benchmark测试中，对LLaMA家族模型进行评估，RESPIROMPT比基准的CoT基eline提高12.5%的逻辑准确率（在LLaMA-65B上）和6.8%的逻辑准确率（在LLaMA2-70B上），特别是在需要至少五个逻辑步骤的问题上，RESPIROMPT比基准的Best CoT基eline提高21.1%的逻辑准确率（在LLaMA-65B上）和14.3%的逻辑准确率（在LLaMA2-70B上）。

Abstract
Chain-of-thought (CoT) prompting, which offers step-by-step problem-solving rationales, has impressively unlocked the reasoning potential of large language models (LLMs). Yet, the standard CoT is less effective in problems demanding multiple reasoning steps. This limitation arises from the complex reasoning process in multi-step problems: later stages often depend on the results of several steps earlier, not just the results of the immediately preceding step. Such complexities suggest the reasoning process is naturally represented as a graph. The almost linear and straightforward structure of CoT prompting, however, struggles to capture this complex reasoning graph. To address this challenge, we propose Residual Connection Prompting (RESPROMPT), a new prompting strategy that advances multi-step reasoning in LLMs. Our key idea is to reconstruct the reasoning graph within prompts. We achieve this by integrating necessary connections-links present in the reasoning graph but missing in the linear CoT flow-into the prompts. Termed "residual connections", these links are pivotal in morphing the linear CoT structure into a graph representation, effectively capturing the complex reasoning graphs inherent in multi-step problems. We evaluate RESPROMPT on six benchmarks across three diverse domains: math, sequential, and commonsense reasoning. For the open-sourced LLaMA family of models, RESPROMPT yields a significant average reasoning accuracy improvement of 12.5% on LLaMA-65B and 6.8% on LLaMA2-70B. Breakdown analysis further highlights RESPROMPT particularly excels in complex multi-step reasoning: for questions demanding at least five reasoning steps, RESPROMPT outperforms the best CoT based benchmarks by a remarkable average improvement of 21.1% on LLaMA-65B and 14.3% on LLaMA2-70B. Through extensive ablation studies and analyses, we pinpoint how to most effectively build residual connections.

摘要
<>Translate the given text into Simplified Chinese.<>Chain-of-thought（CoT）提示，它提供了一排一步的问题解决逻辑，奇异地激活了大型自然语言模型（LLM）的逻辑潜力。然而，标准CoT在多步问题上表现不佳，这是由于多步问题的解释过程具有较复杂的逻辑结构：后续阶段常常基于前一些步骤的结果，而不仅仅是当前步骤的结果。这种复杂的逻辑结构表明解释过程自然地表示为图形。然而，标准CoT的线性和直观结构很难捕捉这种复杂的图形。为解决这个挑战，我们提出了带连接Prompting（RESPROMPT），一种新的提示策略，可以提高LLM中的多步解释能力。我们的关键想法是在提示中重建解释图形。我们实现这一点通过在提示中添加缺失在线性CoT流中的必要连接链接，这些链接被称为“剩余连接”。这些连接将线性CoT结构转化为图形表示，有效地捕捉了多步问题中的复杂解释图形。我们在六个标准benchmark上进行了评估，包括数学、顺序和通用意识解释。对于开源的LLaMA家族模型，RESPROMPT在LLaMA-65B和LLaMA2-70B上提供了显著的平均解释准确率提升（12.5%），分析结果表明RESPROMPT在复杂多步解释中表现特别出色，对需要至少五步解释的问题提供了remarkable的平均提升（21.1%）。通过广泛的减少研究和分析，我们确定了如何最有效地建立剩余连接。

Zero-shot Cross-lingual Transfer without Parallel Corpus

paper_url: http://arxiv.org/abs/2310.04726
repo_url: None
paper_authors: Yuyang Zhang, Xiaofeng Han, Baojun Wang
for: solves the problem of low-resource language NLP tasks with pre-trained language models.
methods: proposes a novel approach of zero-shot cross-lingual transfer using a pre-trained model, with two modules: Bilingual Task Fitting and self-training.
results: achieves new state-of-the-art (SOTA) performance on different tasks without relying on parallel corpora or translation models.Here’s the full text in Simplified Chinese:
for: solves the problem of LOW-RESOURCE LANGUAGE NLP tasks with pre-trained language models.
methods: proposes a novel approach of ZERO-SHOT CROSS-LINGUAL TRANSFER using a pre-trained model, with two modules: BILINGUAL TASK FITTING and SELF-TRAINING.
results: achieves new STATE-OF-THE-ART (SOTA) performance on DIFFERENT TASKS without relying on PARALLEL CORPORA or TRANSLATION MODELS.

Abstract
Recently, although pre-trained language models have achieved great success on multilingual NLP (Natural Language Processing) tasks, the lack of training data on many tasks in low-resource languages still limits their performance. One effective way of solving that problem is to transfer knowledge from rich-resource languages to low-resource languages. However, many previous works on cross-lingual transfer rely heavily on the parallel corpus or translation models, which are often difficult to obtain. We propose a novel approach to conduct zero-shot cross-lingual transfer with a pre-trained model. It consists of a Bilingual Task Fitting module that applies task-related bilingual information alignment; a self-training module generates pseudo soft and hard labels for unlabeled data and utilizes them to conduct self-training. We got the new SOTA on different tasks without any dependencies on the parallel corpus or translation models.

摘要
最近，预训练语言模型在多语言自然语言处理（NLP）任务上取得了很大成功，但是仍有许多任务在低资源语言中的训练数据短缺，这会限制其性能。一种有效的解决方法是将 ricch-resource 语言的知识传递到低资源语言。然而，许多前一些作品在cross-lingual transfer中依赖于并行文献或翻译模型，这些资源往往困难获得。我们提出了一种新的零shot cross-lingual transfer方法，它包括一个双语任务适应模块，该模块通过任务相关的双语信息对适应；一个自动训练模块，该模块通过生成pseudo软标签和硬标签来训练自动。我们获得了新的SOTA（State of the Art）在不同任务上，无需任何并行文献或翻译模型的依赖。

Integrating Contrastive Learning into a Multitask Transformer Model for Effective Domain Adaptation

paper_url: http://arxiv.org/abs/2310.04703
repo_url: None
paper_authors: Chung-Soo Ahn, Jagath C. Rajapakse, Rajib Rana
for: 本研究旨在提高speech emotion recognition（SER）领域的泛化性能。
methods: 该研究提出了一种新的领域适应技术，即基于多任务框架、对比学习和信息最大化损失的 transformers 预训练模型细化。
results: 实验结果表明，该模型在cross-corpus情况下的SER性能达到了当前最佳水平。

Abstract
While speech emotion recognition (SER) research has made significant progress, achieving generalization across various corpora continues to pose a problem. We propose a novel domain adaptation technique that embodies a multitask framework with SER as the primary task, and contrastive learning and information maximisation loss as auxiliary tasks, underpinned by fine-tuning of transformers pre-trained on large language models. Empirical results obtained through experiments on well-established datasets like IEMOCAP and MSP-IMPROV, illustrate that our proposed model achieves state-of-the-art performance in SER within cross-corpus scenarios.

摘要
“对话情感识别（SER）研究已经做出了重要进步，但跨多个资料集的通用化仍然是一个问题。我们提出了一种新的领域适应技术，它包含了一个多任务框架，SER作为主要任务，以及对比学习和信息最大化损失作为辅助任务，基于大语言模型预训练的transformer的精确调整。实验结果显示，我们的提案模型在跨多个资料集的SCENARIO中实现了STATE-OF-THE-ART的SER性能。”Note: "SCENARIO" and "STATE-OF-THE-ART" are in English in the original text, but they are not translated into Chinese in the translation.

EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling

paper_url: http://arxiv.org/abs/2310.04691
repo_url: https://github.com/drsy/emo
paper_authors: Siyu Ren, Zhiyong Wu, Kenny Q. Zhu
for: 提高语言模型的表现和可靠性
methods: 使用地球运动距离优化（EMO）法，利用地球运动距离的特性来解决 MAXimum Likelihood Estimation（MLE）中的几种缺陷
results: 在各个领域中，使用EMO法训练语言模型，可以达到MLE法的同等或更高的表现水平，并且只需要微调25,000句语言数据就可以获得显著的提高。

Abstract
Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.

摘要
neural network语言模型是人类文本的概率模型。它们主要通过最大可能性估计（MLE）进行训练，这与实际数据分布和模型分布之间的前向十分之 entropy相等。然而，在从模型分布中解码时仍然广泛观察到了各种异常现象。我们证明了前向十分之 entropy是一个不合适的距离度量，因为它存在（1）回忆优先（2）负面多样性忽视和（3）训练测试不一致。在这篇论文中，我们提出了地球运输距离优化（EMO）方法，用于自然语言模型化。EMO利用地球运输距离的本质性特性来解决上述挑战。由于直接计算的复杂度较高，我们进一步提出了可行的Upper bound的EMO，以便实现简单的练习训练。经过广泛的语言模型使用EMO和MLE进行训练后，我们发现EMO在各个领域中表现了一致更好的语言模型表现。此外，EMO在只需25000句 sentences的微调后，在下游任务中表现出了可塑性的提升。这表明EMO作为大规模预训练语言模型的轻量级调整方法，具有巨大的潜在力。

DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based Queries

paper_url: http://arxiv.org/abs/2310.04678
repo_url: https://github.com/real-doris-mae/doris-mae-dataset
paper_authors: Jianyou Wang, Kaicheng Wang, Xiaoyue Wang, Prudhviraj Naidu, Leon Bergen, Ramamohan Paturi
for: 本研究的目的是提出一个新的任务，即科学文摘检索使用多级方面based queries (DORIS-MAE)，以解决科学研究中的复杂查询问题。
methods: 本研究使用了100个人工制作的复杂查询 случа件，并为每个查询案例制作了100个相关文献的收集和专家级分类分数。此外，我们还提出了一种可扩展的框架，即Anno-GPT，用于验证大自然语言模型（LLM）在专家级数据集注解任务中的性能。
results: 我们对17种 latest retrieval method 进行了评估，发现其性能与传统数据集相比明显下降。这 highlights 了需要更好地处理科学研究中的复杂、多方面查询问题。

Abstract
In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high cost and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research. We developed a benchmark dataset within the field of computer science, consisting of 100 human-authored complex query cases. For each complex query, we assembled a collection of 100 relevant documents and produced annotated relevance scores for ranking them. Recognizing the significant labor of expert annotation, we also introduce Anno-GPT, a scalable framework for validating the performance of Large Language Models (LLMs) on expert-level dataset annotation tasks. LLM annotation of the DORIS-MAE dataset resulted in a 500x reduction in cost, without compromising quality. Furthermore, due to the multi-tiered structure of these complex queries, the DORIS-MAE dataset can be extended to over 4,000 sub-query test cases without requiring additional annotation. We evaluated 17 recent retrieval methods on DORIS-MAE, observing notable performance drops compared to traditional datasets. This highlights the need for better approaches to handle complex, multifaceted queries in scientific research. Our dataset and codebase are available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset.

摘要
在科学研究中，能够有效地检索相关文献 Based on 复杂多方面查询是关键。现有的评估数据集对这个任务有限，主要因为 annotate 资源所需的成本和努力高昂。为解决这个问题，我们提出了一个新的任务：科学 DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE)，可以 effectively 处理科学研究中用户查询的复杂性。我们在计算机科学领域内创建了一个100个人创建的复杂查询案例的 benchark 数据集。每个复杂查询案例都有100个相关文献，以及对其进行了注释的相关性分数。认可到 annotate 过程中的劳动成本，我们也提出了一个可扩展的框架，用于验证大语言模型（LLM）在专家级别数据集注释任务中的性能。 LLN 注释 DORIS-MAE 数据集中，可以实现500倍的成本减少，而无需妥协质量。此外，由于这些复杂查询的多层结构，DORIS-MAE 数据集可以通过添加4,000个子查询测试案例来扩展，而无需进一步的注释。我们对17个最新的检索方法进行了评估，发现它们在 DORIS-MAE 数据集上的性能明显下降，这显示了科学研究中的复杂、多方面查询需要更好的解决方案。我们的数据集和代码库可以在中找到。