2023-11-30

cs.CL

cs.CL - 2023-11-30

paper_url: http://arxiv.org/abs/2312.00220
repo_url: None
paper_authors: Linzi Xing, Quan Tran, Fabian Caba, Franck Dernoncourt, Seunghyun Yoon, Zhaowen Wang, Trung Bui, Giuseppe Carenini
for: 这篇论文旨在为视频理解任务提供更好的分类结果，具体来说是实现视频主题分 segmentation。
methods: 该论文提出了一种基于多Modal的视频主题分 segmenter，使用视频脚本和帧数据，并采用了交叉模态注意力机制。此外，文章还提出了一种双对照学习框架，遵循无监督领域适应的思想，以提高模型对更长、更复杂的视频进行适应。
results: 实验表明，该提案在短视频和长视频集上均达到了比基eline方法更高的准确率和跨领域适应性。

Abstract
Video topic segmentation unveils the coarse-grained semantic structure underlying videos and is essential for other video understanding tasks. Given the recent surge in multi-modal, relying solely on a single modality is arguably insufficient. On the other hand, prior solutions for similar tasks like video scene/shot segmentation cater to short videos with clear visual shifts but falter for long videos with subtle changes, such as livestreams. In this paper, we introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames, bolstered by a cross-modal attention mechanism. Furthermore, we propose a dual-contrastive learning framework adhering to the unsupervised domain adaptation paradigm, enhancing our model's adaptability to longer, more semantically complex videos. Experiments on short and long video corpora demonstrate that our proposed solution, significantly surpasses baseline methods in terms of both accuracy and transferability, in both intra- and cross-domain settings.

摘要
видео主题段落抽取揭示了视频的粗粒度 semantics 结构，是其他视频理解任务的关键前提。然而，尚未来的解决方案，仅仅依靠单一模式，可能是不充分的。在这篇论文中，我们介绍了一种多模态视频主题段落分割器，利用视频字幕和帧，并通过交叉模式注意力机制。此外，我们提议一种双对照学习框架，遵循无监督领域适应模式，从而提高我们的模型对更长、更复杂的视频的适应性。实验表明，我们提出的解决方案，在短和长视频集合上都能够显著超越基线方法，在内和交叉领域的设置下都能够达到更高的准确率和转移率。

Relevance-guided Neural Machine Translation

paper_url: http://arxiv.org/abs/2312.00214
repo_url: None
paper_authors: Isidora Chara Tourni, Derry Wijaya
for: 提高 NMT 在低资源条件下的结果
methods: 使用 explainability-based 训练方法，包括无监督和监督模型训练，用于翻译英文、法语、古吉拉特语和哈萨克语
results: 在低资源条件下，方法可以显著超过基eline，特别是在法语和古吉拉特语的翻译 task 上，但改善仍然相对较小，但这鼓励了进一步探索该方法和参数的可能性，以及扩展到其他语言。

Abstract
With the advent of the Transformer architecture, Neural Machine Translation (NMT) results have shown great improvement lately. However, results in low-resource conditions still lag behind in both bilingual and multilingual setups, due to the limited amount of available monolingual and/or parallel data; hence, the need for methods addressing data scarcity in an efficient, and explainable way, is eminent. We propose an explainability-based training approach for NMT, applied in Unsupervised and Supervised model training, for translation of three languages of varying resources, French, Gujarati, Kazakh, to and from English. Our results show our method can be promising, particularly when training in low-resource conditions, outperforming simple training baselines; though the improvement is marginal, it sets the ground for further exploration of the approach and the parameters, and its extension to other languages.

摘要

Robust Concept Erasure via Kernelized Rate-Distortion Maximization

paper_url: http://arxiv.org/abs/2312.00194
repo_url: https://github.com/brcsomnath/kram
paper_authors: Somnath Basu Roy Chowdhury, Nicholas Monath, Avinava Dubey, Amr Ahmed, Snigdha Chaturvedi
for: 本 paper 的目的是提出一种新的距离度量学习对象函数，叫做 Kernelized Rate-Distortion Maximizer (KRaM)，用于实现概念除法。
methods: KRaM 使用一种修改后 rate-distortion 函数，来适应一个指定的距离度量（表示要除掉的概念）。该对象函数的目标是将相似的概念标签的实例变得不相似在学习的表示空间中，保留其他信息。
results: 实验表明，KRaM 可以有效地除掉多种类型的概念，包括 categorical、continue 和 vector-valued 变量，在不同领域的数据表示中。此外，作者还提供了一种对 KRaM 对象函数的理论分析，以及一种用于评估学习的表示质量的对ignment 分数。

Abstract
Distributed representations provide a vector space that captures meaningful relationships between data instances. The distributed nature of these representations, however, entangles together multiple attributes or concepts of data instances (e.g., the topic or sentiment of a text, characteristics of the author (age, gender, etc), etc). Recent work has proposed the task of concept erasure, in which rather than making a concept predictable, the goal is to remove an attribute from distributed representations while retaining other information from the original representation space as much as possible. In this paper, we propose a new distance metric learning-based objective, the Kernelized Rate-Distortion Maximizer (KRaM), for performing concept erasure. KRaM fits a transformation of representations to match a specified distance measure (defined by a labeled concept to erase) using a modified rate-distortion function. Specifically, KRaM's objective function aims to make instances with similar concept labels dissimilar in the learned representation space while retaining other information. We find that optimizing KRaM effectively erases various types of concepts: categorical, continuous, and vector-valued variables from data representations across diverse domains. We also provide a theoretical analysis of several properties of KRaM's objective. To assess the quality of the learned representations, we propose an alignment score to evaluate their similarity with the original representation space. Additionally, we conduct experiments to showcase KRaM's efficacy in various settings, from erasing binary gender variables in word embeddings to vector-valued variables in GPT-3 representations.

摘要
分布表示提供了一个vector空间，捕捉数据实例之间的意义关系。然而，这些表示的分布性质使得数据实例的多个属性或概念（如文本的主题或 sentiment、作者的年龄 gender等）相互纠缠在一起。最近的工作提出了概念消除任务，即而不是使概念预测可能，而是将某个概念从分布表示中除去，保留原始表示空间中的其他信息。在这篇论文中，我们提出了一种新的距离度量学习基于的目标函数，即归一化距离度量学习（KRaM），用于完成概念消除任务。KRaM使用一种修改的rate-distortion函数来适应定制的距离度量。特别是，KRaM的目标函数的目的是在学习的表示空间中使实例之间的相似概念标签变得不相似，保留其他信息。我们发现，通过优化KRaM可以有效地消除不同类型的概念：分类、连续和向量值变量从数据表示中。此外，我们还提供了对KRaM目标函数的一些理论分析。为评估学习后的表示质量，我们提出了一种对齐分数来评估其与原始表示空间的相似性。此外，我们还进行了多种设置下KRaM的效果进行了证明。

Navigating News Narratives: A Media Bias Analysis Dataset

paper_url: http://arxiv.org/abs/2312.00168
repo_url: None
paper_authors: Shaina Raza
for: 本研究的目的是提供一个媒体偏见分析数据集，用于探讨媒体偏见的影响和检测方法。
methods: 本研究使用了自然语言处理和机器学习技术，对媒体文章进行分类和分析，以检测媒体偏见的存在和特征。
results: 本研究提供了一个丰富的媒体偏见分析数据集，包括各种媒体偏见的示例和特征，可以用于进一步的研究和开发媒体偏见检测和分析方法。

Abstract
The proliferation of biased news narratives across various media platforms has become a prominent challenge, influencing public opinion on critical topics like politics, health, and climate change. This paper introduces the "Navigating News Narratives: A Media Bias Analysis Dataset", a comprehensive dataset to address the urgent need for tools to detect and analyze media bias. This dataset encompasses a broad spectrum of biases, making it a unique and valuable asset in the field of media studies and artificial intelligence. The dataset is available at https://huggingface.co/datasets/newsmediabias/news-bias-full-data.

摘要
“媒体偏见的散布在不同媒体平台上已成为一大挑战，影响公众对重要议题的看法，如政治、健康和气候变化。本研究引入“媒体偏见分析数据集”，一个全面的数据集，用于探讨媒体偏见的检测和分析。这个数据集包括广泛的偏见，使其成为媒体研究和人工智能领域中一个独特和值得关注的资产。数据集可以在https://huggingface.co/datasets/newsmediabias/news-bias-full-data中下载。”

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

paper_url: http://arxiv.org/abs/2312.00115
repo_url: None
paper_authors: Matthew Gwilliam, Michael Cogswell, Meng Ye, Karan Sikka, Abhinav Shrivastava, Ajay Divakaran
for:The paper is written to evaluate the capabilities of long video retrieval systems and to propose a pipeline for generating diverse synthetic captions for long videos.methods:The paper uses state-of-the-art large language models to generate synthetic captions for long videos, and it uses a contrastive loss to learn a hierarchical embedding loss for fine-tuning the video language models.results:The paper shows that the proposed method improves performance on the downstream paragraph-to-video retrieval task and for various long video retrieval metrics using synthetic data. Specifically, it achieves a 1.1% increase in R@1 on ActivityNet and a 3.6% increase in R@1 for short descriptions on ActivityNet.

Abstract
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet). For data access and other details, please refer to our project website at https://mgwillia.github.io/10k-words.

摘要
现有的长视频检索系统都是在段落到视频检索模式下进行训练和测试，每个长视频都由一个长段落来描述。这种做法忽略了视频的多样性和细节，可能有各种有效的描述方式，如每个时刻细节、简短概要或者中间的描述。为了更全面地评估长视频检索系统的能力，我们提议一个管道，利用现代大型自然语言模型生成多种精心制作的假描述，并通过人工检查来验证其准确性。我们使用这些假描述进行下游的段落到视频检索任务，并证明现有的视频语言模型在这些假描述上表现不佳，特别是短描述。我们还提出了一种轻量级的微调方法，使用对各个描述之间的差异进行学习一个层次嵌入损失，以提高下游任务的表现。我们的方法在段落到视频检索任务上提高了表现 (+1.1% R@1 on ActivityNet)，以及使用我们自己生成的假描述进行长视频检索任务时的多种维度表现 (+3.6% R@1 for short descriptions on ActivityNet)。详细的数据访问和其他细节，请参考我们项目网站。

What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

paper_url: http://arxiv.org/abs/2311.18812
repo_url: https://github.com/castorini/biasprobe
paper_authors: Raphael Tang, Xinyu Zhang, Jimmy Lin, Ferhan Ture
for: 本研究探讨语言模型（LLM）是否带有社会阶层和民族特征偏见，即使模型拒绝回答。
methods: 我们使用contextualized embeddings来探讨这个研究问题，并对模型的含义表示进行探索。我们提出了一种逻辑布莱德利-泰利探针，可以预测语言模型对word pair的偏好。
results: 我们在三个pair preference任务和十三个LLM中验证了我们的探针，并在error rate方面比WEAT表现出Relative 27%的提升。我们还发现word pair偏好最好地表示在中层层次。此外，我们发现对于不同的目标类，例如国籍、政治、宗教和性别，模型都带有显著的偏好。这表示instruction fine-tuning可能并不能够消除contextualized embeddings中的偏见。

Abstract
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? To bypass their refusal to "speak," we study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors. We first validate our probe on three pair preference tasks and thirteen LLMs, where we outperform the word embedding association test (WEAT), a standard approach in testing for implicit association, by a relative 27% in error rate. We also find that word pair preferences are best represented in the middle layers. Next, we transfer probes trained on harmless tasks (e.g., pick the larger number) to controversial ones (compare ethnicities) to examine biases in nationality, politics, religion, and gender. We observe substantial bias for all target classes: for instance, the Mistral model implicitly prefers Europe to Africa, Christianity to Judaism, and left-wing to right-wing politics, despite declining to answer. This suggests that instruction fine-tuning does not necessarily debias contextualized embeddings. Our codebase is at https://github.com/castorini/biasprobe.

摘要
大语言模型（LLM）是否具有社会经济阶层偏见，即使它们拒绝回答？为了回答这个研究问题，我们研究 Contextualized Embeddings 中的偏见是否被编码在其潜在表示中。我们提出了一个逻辑 Bradley-Terry 探测器，可以预测语言模型对Word pair的偏好。我们首先验证了我们的探测器在三个 Word pair 任务和十三个 LLM 上的有效性，并在误差率方面比标准方法（WEAT）表现出Relative 27%的改善。我们还发现，Word pair 的偏好在中层层次最为明显。接着，我们将探测器训练在无伤 tasks（例如，选择更大的数字）中，然后将其应用到争议性 tasks（例如，比较民族）中，以检测偏见。我们发现，所有目标类具有显著的偏见：例如，Mistral 模型在欧洲和非洲之间偏好欧洲，强调基督教而不强调犹太教，以及左翼政治比右翼政治。这表明， instruction fine-tuning 不一定能够消除 contextualized embeddings 中的偏见。我们的代码库位于 GitHub 上，具体请参考。

BioCLIP: A Vision Foundation Model for the Tree of Life

paper_url: http://arxiv.org/abs/2311.18803
repo_url: None
paper_authors: Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
for:本研究旨在提供一个通用的计算方法，用于从生物图像中提取生物学信息。methods:研究人员使用了大量的计算方法和工具，特别是计算机视觉，对生物图像进行分析和描述。他们还组织了TreeOfLife-10M dataset，该dataset包含了生物图像的广泛和多样化数据，并且与生物学知识有所关联。results:研究人员发现，使用BioCLIP模型可以实现生物学问题中的高度通用性。该模型在多种细化生物分类任务中表现出色，较前一代基准的17%至20%。内在评估显示BioCLIP已经学习了生物体系中的层次结构，这点灯光了其强大的通用性。

Abstract
Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks, and find that BioCLIP consistently and substantially outperforms existing baselines (by 17% to 20% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability. Our code, models and data will be made available at https://github.com/Imageomics/bioclip.

摘要
自然界的图像，由多种摄像机器人、个人手机等摄取，已成为生物信息的增量来源。计算机视觉技术在抽取生物信息的图像上进行了激进的应用，尤其是在科学和保护领域。然而，大多数这些方法是为特定任务而设计的，不易适应新的问题、上下文和数据集。为了解决这问题，我们将Curate和发布TreeOfLife-10M，生物图像领域最大和最多样化的机器学习准备数据集。然后，我们开发了Bioclip，基于树生物的基本模型，利用TreeOfLife-10M中的生物图像的丰富和多样性，以及生物知识的可用性。我们对多个细化的生物分类任务进行了严格的测试，发现Bioclip与现有基准相比，平均提高17%至20%。内在评估表明，Bioclip已经学习了生物树的层次表示，这为其强大的通用性提供了新的灯光。我们的代码、模型和数据将在上发布。

paper_url: http://arxiv.org/abs/2311.18799
repo_url: https://github.com/artemisp/lavis-xinstructblip
paper_authors: Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles
For: The paper is written for proposing a simple and effective cross-modality framework for integrating various modalities without extensive modality-specific customization.* Methods: The paper uses a vision-language pre-training and instruction tuning approach, which involves fine-tuning a large language model (LLM) with high-quality instruction tuning data to perform 2D visual reasoning tasks.* Results: The paper demonstrates that the proposed approach can perform comparably with leading-edge counterparts without the need for extensive modality-specific pre-training or customization, and also shows cross-modal reasoning abilities across two or more input modalities.Here’s the Chinese version of the three key points:* For: 这篇论文是为了提出一种简单而有效的跨模态框架，以便不需要广泛的模态特定定制来集成不同的模态。* Methods: 论文使用视力语言预训练和指令调整方法，通过高质量的指令调整数据来训练大语言模型（LLM），以实现2D视觉任务中的推理能力。* Results: 论文表明，提出的方法可以与当今领先的对手相比，并且不需要广泛的模态特定预训练或定制，同时也能够在多个输入模态之间进行跨模态推理。

Abstract
Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLMs). In this paper, we introduce a simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities without extensive modality-specific customization. To facilitate instruction-modality fine-tuning, we collect high-quality instruction tuning data in an automatic and scalable manner, composed of 24K QA samples for audio and 250K QA samples for 3D. Leveraging instruction-aware representations, our model performs comparably with leading-edge counterparts without the need of extensive modality-specific pre-training or customization. Furthermore, our approach demonstrates cross-modal reasoning abilities across two or more input modalities, despite each modality projection being trained individually. To study the model's cross-modal abilities, we contribute a novel Discriminative Cross-modal Reasoning (DisCRn) evaluation task, comprising 9K audio-video QA samples and 28K image-3D QA samples that require the model to reason discriminatively across disparate input modalities.

摘要
vision-language预训练和指令调整已经在2D视觉任务中表现出通用功能，通过将视觉编码器与现有的大型语言模型（LLM）相align。在这篇论文中，我们介绍了一种简单、有效的跨多模态框架，建立在冻结的LLM之上，可以无需广泛的多模态特定预训练或自定义，将不同模态集成起来。为了促进指令模式细调，我们自动和扩展地收集了高质量指令调整数据，包括24K个语音问答样本和250K个3D问答样本。通过使用指令意识的表示，我们的模型与领先的同类模型相比，没有需要广泛的多模态特定预训练或自定义。此外，我们的方法还表现出跨多输入模态的推理能力，即使每个模态投影都是单独地训练。为了研究模型的跨模态能力，我们提出了一个新的Discriminative Cross-modal Reasoning（DisCRn）评估任务，包括9K个语音视频问答样本和28K个图像3D问答样本，需要模型可以在不同输入模态之间的推理。

Mavericks at BLP-2023 Task 1: Ensemble-based Approach Using Language Models for Violence Inciting Text Detection

paper_url: http://arxiv.org/abs/2311.18778
repo_url: None
paper_authors: Saurabh Page, Sudeep Mangalvedhekar, Kshitij Deshpande, Tanmay Chavan, Sheetal Sonawane
for: 本研究是为了解决社交媒体上的敌意和暴力inciting文本检测问题，以防止这种文本的广泛传播。
methods: 本研究使用BERT模型进行实验，并使用多个BERT模型 ensemble为最终提交。
results: 本研究在第一届巴ンGLadesh语言处理工作坊中的暴力inciting文本检测Shared Task中获得了10名的最终排名，macro F1分数为0.737。

Abstract
This paper presents our work for the Violence Inciting Text Detection shared task in the First Workshop on Bangla Language Processing. Social media has accelerated the propagation of hate and violence-inciting speech in society. It is essential to develop efficient mechanisms to detect and curb the propagation of such texts. The problem of detecting violence-inciting texts is further exacerbated in low-resource settings due to sparse research and less data. The data provided in the shared task consists of texts in the Bangla language, where each example is classified into one of the three categories defined based on the types of violence-inciting texts. We try and evaluate several BERT-based models, and then use an ensemble of the models as our final submission. Our submission is ranked 10th in the final leaderboard of the shared task with a macro F1 score of 0.737.

摘要
这篇论文介绍了我们在“第一届巴ンGL语言处理工作shop”中参与的“暴力鼓吹文本检测”共同任务。社交媒体的普及已经加速了社会中的仇恨和暴力鼓吹言语的传播。因此，建立有效的检测和抑制暴力鼓吹文本的机制是非常重要的。但在低资源设置下，由于缺乏研究和数据，暴力鼓吹文本检测问题更加困难。共享任务提供的数据包括巴ンGL语言的文本，每个示例被分类为三种基于暴力鼓吹文本的类型。我们尝试了多个BERT基于模型，然后使用这些模型的 ensemble 作为我们的最终提交。我们的提交在共同任务的最终排名中排名第10名，具有macro F1分数0.737。

Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior?

paper_url: http://arxiv.org/abs/2311.18761
repo_url: None
paper_authors: Aryaman Chobey, Oliver Smith, Anzi Wang, Grusha Prasad
for: 这 paper 的目的是研究使用神经语言模型模拟人类行为的有效性，并分析这种方法在预测人类语言处理的能力是否准确。
methods: 这 paper 使用的方法是使用教师语言模型在 BabyLM “strict-small” 数据集上进行训练，然后使用这些教师模型的句子级悬幻度来创建课程。
results: 研究发现，使用这种课程可以使模型更好地从训练数据中学习语言知识，但这并不意味着模型可以更准确地预测人类语言处理行为。

Abstract
The use of neural language models to model human behavior has met with mixed success. While some work has found that the surprisal estimates from these models can be used to predict a wide range of human neural and behavioral responses, other work studying more complex syntactic phenomena has found that these surprisal estimates generate incorrect behavioral predictions. This paper explores the extent to which the misalignment between empirical and model-predicted behavior can be minimized by training models on more developmentally plausible data, such as in the BabyLM Challenge. We trained teacher language models on the BabyLM "strict-small" dataset and used sentence level surprisal estimates from these teacher models to create a curriculum. We found tentative evidence that our curriculum made it easier for models to acquire linguistic knowledge from the training data: on the subset of tasks in the BabyLM challenge suite evaluating models' grammatical knowledge of English, models first trained on the BabyLM data curriculum and then on a few randomly ordered training epochs performed slightly better than models trained on randomly ordered epochs alone. This improved linguistic knowledge acquisition did not result in better alignment with human reading behavior, however: models trained on the BabyLM dataset (with or without a curriculum) generated predictions that were as misaligned with human behavior as models trained on larger less curated datasets. This suggests that training on developmentally plausible datasets alone is likely insufficient to generate language models capable of accurately predicting human language processing.

摘要
使用神经语言模型模拟人类行为的应用得到了杂种成果。一些工作发现，这些模型的不同程度预测可以预测人类神经和行为响应的广泛范围，而其他工作则发现，在更复杂的语法现象上，这些预测都会产生错误的行为预测。这篇论文探讨了使用更有可能性的数据来减少模型预测与人类行为的偏差。我们使用 BabyLM 挑战赛的 "strict-small" 数据集训练教师语言模型，并使用这些教师模型的句子水平预测值来创建课程。我们发现了一些证据表明，我们的课程使得模型从训练数据中更好地获得语言知识：在评估英语语法知识的子集任务中，使用 BabyLM 数据课程和一些随机排序的训练班次训练后，模型的性能略为提高。然而，这种改进的语言知识不会导致模型更好地预测人类阅读行为：使用 BabyLM 数据集（包括或不包括课程），模型的预测都与人类行为不符。这表明，只靠使用可能的数据来训练语言模型并不能生成准确预测人类语言处理的模型。

Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through Dialect Identification using Transformer-based Approach

paper_url: http://arxiv.org/abs/2311.18739
repo_url: None
paper_authors: Vedant Deshpande, Yash Patwardhan, Kshitij Deshpande, Sudeep Mangalvedhekar, Ravindra Murumkar
for: 本研究的目的是为Nuanced Arabic Dialect Identification (NADI) Shared Task 2023提出了一种方法。
methods: 本文提出的方法主要基于transformer-based模型，这些模型在阿拉伯语领域已经预处理过。然后，这些模型在提供的数据集上进行了微调。 ensembling方法也被应用以提高系统的性能。
results: 根据测试数据集，我们实现了F1分数76.65（排名为第11名）。

Abstract
In this paper, we present our approach for the "Nuanced Arabic Dialect Identification (NADI) Shared Task 2023". We highlight our methodology for subtask 1 which deals with country-level dialect identification. Recognizing dialects plays an instrumental role in enhancing the performance of various downstream NLP tasks such as speech recognition and translation. The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem. Numerous transformer-based models, pre-trained on Arabic language, are employed for identifying country-level dialects. We fine-tune these state-of-the-art models on the provided dataset. The ensembling method is leveraged to yield improved performance of the system. We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.

摘要
在这篇论文中，我们介绍了我们在“细腻阿拉伯语言方言识别任务2023”（NADI）中采用的方法。我们特别强调了我们对国家级方言识别的方法。认知方言对于下游NLP任务如语音识别和翻译具有重要的功能。任务使用的Twitter数据集（TWT-2023）包含18种方言，并且是多类分类问题。我们使用了已经在阿拉伯语言上预训练的多种变换器模型，然后在提供的数据集上进行微调。我们还使用了ensemble方法来提高系统的性能。在测试数据集上，我们获得了F1分数76.65（排名榜上的11名）。

Mavericks at ArAIEval Shared Task: Towards a Safer Digital Space – Transformer Ensemble Models Tackling Deception and Persuasion

paper_url: http://arxiv.org/abs/2311.18730
repo_url: None
paper_authors: Sudeep Mangalvedhekar, Kshitij Deshpande, Yash Patwardhan, Vedant Deshpande, Ravindra Murumkar
For: 本研究是为了评估2023年的“阿拉伯语人工智能任务评估（ArAiEval）”共同任务。* Methods: 我们提出了对任务1-A和任务2-A的方法，它们关注了吸引人的技巧检测和假信息检测。我们使用了多种变体的transformer模型，并在提供的数据集上进行了微调。我们还使用了ensemble技术来提高系统的表现。* Results: 我们在任务1-A上 achieved micro F1-score为0.742（排名第8），在任务2-A上 achieved micro F1-score为0.901（排名第7）。

Abstract
In this paper, we highlight our approach for the "Arabic AI Tasks Evaluation (ArAiEval) Shared Task 2023". We present our approaches for task 1-A and task 2-A of the shared task which focus on persuasion technique detection and disinformation detection respectively. Detection of persuasion techniques and disinformation has become imperative to avoid distortion of authentic information. The tasks use multigenre snippets of tweets and news articles for the given binary classification problem. We experiment with several transformer-based models that were pre-trained on the Arabic language. We fine-tune these state-of-the-art models on the provided dataset. Ensembling is employed to enhance the performance of the systems. We achieved a micro F1-score of 0.742 on task 1-A (8th rank on the leaderboard) and 0.901 on task 2-A (7th rank on the leaderboard) respectively.

摘要
在这篇论文中，我们强调我们在"2023年阿拉伯语人工智能任务评估（ArAiEval）共同任务"中采取的方法。我们介绍了对任务1-A和任务2-A的解决方案，它们分别关注了诱导技巧检测和假信息检测。检测诱导技巧和假信息已成为避免真实信息扭曲的必要手段。任务使用多种文本类型的摘要和新闻文章来解决 binary 分类问题。我们使用预先训练在阿拉伯语言上的变换器模型进行实验，并对提供的数据集进行精细调整。我们采用了ensemble的方法来提高系统的性能。我们在任务1-A上 achieved micro F1 得分为 0.742（排名榜上的第八名），并在任务2-A上 achieved micro F1 得分为 0.901（排名榜上的第七名）。

Automatic Functional Differentiation in JAX

paper_url: http://arxiv.org/abs/2311.18727
repo_url: https://github.com/sail-sg/autofd
paper_authors: Min Lin
for: 本研究扩展JAX以实现自动diff高阶函数(函数和运算符)的能力。
methods: 本文使用JAX的存储系统来实现高阶函数，并提供了一组基本的 primitives 来构建各种关键的函数类型。每个引入的基本运算符都有 linearization 和 transposition 规则，与JAX的内部协议相align。
results: 本研究实现了函数导数的自动diff，并且可以在Python中直接使用函数导数。通过应用在需要函数导数的场景中，本工具的效果和简洁性得到了证明。研究源代码发布在https://github.com/sail-sg/autofd。

Abstract
We extend JAX with the capability to automatically differentiate higher-order functions (functionals and operators). By representing functions as a generalization of arrays, we seamlessly use JAX's existing primitive system to implement higher-order functions. We present a set of primitive operators that serve as foundational building blocks for constructing several key types of functionals. For every introduced primitive operator, we derive and implement both linearization and transposition rules, aligning with JAX's internal protocols for forward and reverse mode automatic differentiation. This enhancement allows for functional differentiation in the same syntax traditionally use for functions. The resulting functional gradients are themselves functions ready to be invoked in python. We showcase this tool's efficacy and simplicity through applications where functional derivatives are indispensable. The source code of this work is released at https://github.com/sail-sg/autofd .

摘要
我们将JAX扩展为自动 differentiate 高阶函数（函数和算子）的能力。我们通过将函数 представ为一种常量组合的一种扩展，使得我们可以轻松地使用JAX的现有的基本系统来实现高阶函数。我们提出了一些基本的基础建筑物，用于建构许多关键的函数类型。对每个引入的基本操作符，我们 derivation 和实现了 both linearization 和 transposition 规则，与JAX的内部协议相互匹配。这些改进使得函数的自动分别可以在传统上使用的同じ语法中进行。得到的函数梯度是一个可以被邀请在 Python 中的函数。我们透过实际应用展示了这个工具的效率和简洁性。源代码可以在获取。

CoRec: An Easy Approach for Coordination Recognition

paper_url: http://arxiv.org/abs/2311.18712
repo_url: https://github.com/qingwang-isu/corec
paper_authors: Qing Wang, Haojie Jia, Wenfei Song, Qi Li
for: 本研究强调和解决协ordinatio recognition任务中的挑战。
methods: 该研究提议了一个管道模型COordination RECognizer（CoRec），包括两个组成部分：协ordinator标识器和连接界限探测器。
results: 实验结果表明，CoRec可以快速和准确地完成协ordinatio recognition任务，并且对复杂句子和不同领域的数据进行了有效的应用。此外，CoRec还对下游任务产生了积极的影响，提高了状态最佳的Open IE模型的产生率。

Abstract
In this paper, we observe and address the challenges of the coordination recognition task. Most existing methods rely on syntactic parsers to identify the coordinators in a sentence and detect the coordination boundaries. However, state-of-the-art syntactic parsers are slow and suffer from errors, especially for long and complicated sentences. To better solve the problems, we propose a pipeline model COordination RECognizer (CoRec). It consists of two components: coordinator identifier and conjunct boundary detector. The experimental results on datasets from various domains demonstrate the effectiveness and efficiency of the proposed method. Further experiments show that CoRec positively impacts downstream tasks, improving the yield of state-of-the-art Open IE models.

摘要
在这篇论文中，我们观察和解决协调认识任务中的挑战。大多数现有方法都是通过语法分析器来标识协调语素并探测协调边界。然而，当前的语法分析器具有较慢的速度和较高的错误率，特别是对于长和复杂的句子。为了更好地解决这些问题，我们提议一个管道模型COordination RECognizer（CoRec）。它包括两个组件：协调标识器和连接边界探测器。我们在不同领域的数据集上进行了实验，并证明了提议的方法的效果和高效性。进一步的实验还表明，CoRec可以Positively Impact downstream任务，提高状态的开发Open IE模型的生产率。

Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling

paper_url: http://arxiv.org/abs/2311.18711
repo_url: None
paper_authors: Matúš Pikuliak, Andrea Hrckova, Stefan Oresko, Marián Šimko
for: 这个论文是为了测试 masculine 和 feminine 概念在遮盖 Language Model 和英文到其他语言翻译系统中的表现。
methods: 这个论文使用了一个新的数据集 named GEST，包含9种斯拉夫语言和英语的16种性别刻板印象（如女人美丽，男人是领袖）。这些刻板印象由性别专家定义。
results: 通过使用 GEST 评估11个遮盖 Language Model 和4个机器翻译系统，发现大多数评估模型和语言具有重要和一致的性别刻板印象表现。

Abstract
We present GEST -- a new dataset for measuring gender-stereotypical reasoning in masked LMs and English-to-X machine translation systems. GEST contains samples that are compatible with 9 Slavic languages and English for 16 gender stereotypes about men and women (e.g., Women are beautiful, Men are leaders). The definition of said stereotypes was informed by gender experts. We used GEST to evaluate 11 masked LMs and 4 machine translation systems. We discovered significant and consistent amounts of stereotypical reasoning in almost all the evaluated models and languages.

摘要
我们介绍GEST数据集 -- 一个用于测试 masculine 和 feminine 偏见的掩Masked LMs和英文到其他语言翻译系统的数据集。GEST包含与9种斯拉夫语言和英文相容的样本，表达16种男女偏见（例如，女人美丽，男人是领袖）。这些偏见的定义由性别专家指导。我们使用GEST评估11个掩Masked LMs和4种翻译系统。我们发现大多数评估模型和语言中存在了重要和一致的偏见性 reasoning。

RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

paper_url: http://arxiv.org/abs/2311.18681
repo_url: https://github.com/chantalmp/radialog
paper_authors: Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Nassir Navab, Matthias Keicher
for: 这个研究旨在创造一个可与人类合作的 radiology 助手，以减少时间和提高报告质量。
methods: 这个方法使用了大型视力语言模型（LLM），并将其与构成化病理评估结果集成。同时，这个方法还使用了实时Parameter-efficient fine-tuning来适应专业领域。
results: 这个方法在生成 radiology 报告和互动任务中实现了州务 Correctness，并且在答询和修正报告等互动任务中表现出色。

Abstract
Conversational AI tools that can generate and discuss clinically correct radiology reports for a given medical image have the potential to transform radiology. Such a human-in-the-loop radiology assistant could facilitate a collaborative diagnostic process, thus saving time and improving the quality of reports. Towards this goal, we introduce RaDialog, the first thoroughly evaluated and publicly available large vision-language model for radiology report generation and interactive dialog. RaDialog effectively integrates visual image features and structured pathology findings with a large language model (LLM) while simultaneously adapting it to a specialized domain using parameter-efficient fine-tuning. To keep the conversational abilities of the underlying LLM, we propose a comprehensive, semi-automatically labeled, image-grounded instruct dataset for chest X-ray radiology tasks. By training with this dataset, our method achieves state-of-the-art clinical correctness in report generation and shows impressive abilities in interactive tasks such as correcting reports and answering questions, serving as a foundational step toward clinical dialog systems. Our code is available on github: https://github.com/ChantalMP/RaDialog.

摘要
优化 radiology 报告生成和交互对话工具可以改变 radiology。这种人 loop radiology 助手可以帮助医生在诊断过程中协作，从而节约时间并提高报告质量。为 достичь这个目标，我们介绍了 RaDialog，首个经过全面评估并公开可用的大视语言模型 для radiology 报告生成和交互对话。RaDialog 可以同时 интеграли视觉图像特征和结构化病理发现，并使用大语言模型（LLM）进行参数高效调整。为保持下面 LLM 的对话能力，我们提出了一种完整的、半自动标注的、基于图像的教程数据集 для胸部 X 射影任务。通过使用这个数据集，我们的方法实现了临床正确性最高的报告生成，并在交互任务中具有印象的能力，如修正报告和回答问题， serving as a foundational step toward clinical dialog systems。我们的代码可以在 GitHub 上找到：https://github.com/ChantalMP/RaDialog。

ArcMMLU: A Library and Information Science Benchmark for Large Language Models

paper_url: http://arxiv.org/abs/2311.18658
repo_url: https://github.com/stzhang-patrick/arcmmlu
paper_authors: Shitou Zhang, Zuchao Li, Xingshen Liu, Liming Yang, Ping Wang
for: This paper aims to develop a specialized benchmark for evaluating the capabilities of large language models (LLMs) in the Library & Information Science (LIS) domain in Chinese.
methods: The paper introduces ArcMMLU, a benchmark tailored for the LIS domain that includes four key sub-domains: Archival Science, Data Science, Library Science, and Information Science. The benchmark is based on the format of MMLU/CMMLU and includes over 6,000 high-quality questions to reflect the diverse nature of the LIS domain.
results: The comprehensive evaluation shows that while most mainstream LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a notable performance gap, suggesting substantial headroom for refinement in LLM capabilities within the LIS domain. The paper also explores the effectiveness of few-shot examples on model performance and highlights challenging questions where models consistently underperform.

Abstract
In light of the rapidly evolving capabilities of large language models (LLMs), it becomes imperative to develop rigorous domain-specific evaluation benchmarks to accurately assess their capabilities. In response to this need, this paper introduces ArcMMLU, a specialized benchmark tailored for the Library & Information Science (LIS) domain in Chinese. This benchmark aims to measure the knowledge and reasoning capability of LLMs within four key sub-domains: Archival Science, Data Science, Library Science, and Information Science. Following the format of MMLU/CMMLU, we collected over 6,000 high-quality questions for the compilation of ArcMMLU. This extensive compilation can reflect the diverse nature of the LIS domain and offer a robust foundation for LLM evaluation. Our comprehensive evaluation reveals that while most mainstream LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a notable performance gap, suggesting substantial headroom for refinement in LLM capabilities within the LIS domain. Further analysis explores the effectiveness of few-shot examples on model performance and highlights challenging questions where models consistently underperform, providing valuable insights for targeted improvements. ArcMMLU fills a critical gap in LLM evaluations within the Chinese LIS domain and paves the way for future development of LLMs tailored to this specialized area.

摘要
在大语言模型（LLM）的技能快速发展之下，有必要开发特定领域的评估标准 benchmark，以准确评估其能力。为回应这个需求，本文介绍了ArcMMLU，一个特циализирован的 benchmark，专门为中文图书馆信息科学（LIS）领域而设计。ArcMMLU 的目标是测量 LL M 在四个关键子领域中的知识和理解能力：档案科学、数据科学、图书馆科学和信息科学。按照 MMLU/CMMLU 的格式，我们收集了超过 6,000 个高质量的问题，以编制 ArcMMLU。这个广泛的编译可以反映 LIS 领域的多样性，并提供了可靠的基础 для LL M 评估。我们的全面评估发现，虽然大多数主流 LL M 在 ArcMMLU 上的平均准确率高于 50%，但还存在明显的性能差距，表明 LL M 在 LIS 领域的可能性还有很大的提升空间。进一步的分析发现，几个示例的几何学学习可以提高模型的性能，并高亮出 LLM 在特定问题上的困难处，提供了价值的反馈，用于targeted 改进。ArcMMLU 填补了中文 LIS 领域 LL M 评估的空白，并为将来针对这个专业领域开发 LL M 提供了道路。

Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines

paper_url: http://arxiv.org/abs/2312.00100
repo_url: https://github.com/mythologos/augustinian-sermon-parallelisms
paper_authors: Stephen Bothwell, Justin DeBenedetto, Theresa Crnkovich, Hildegund Müller, David Chiang
for: 这篇论文主要是为了研究人们在日常语言交流中使用的修辞技巧，特别是平行ismus。
methods: 作者提出了一个新的任务：修辞平行ismus检测。他们提出了一个正式的定义，并提供了一个拉丁文和一个修改后的中文数据集来进行测试。
results: 作者使用了一家族的评价指标来评估系统的性能，并创建了一些基线系统和新的序列标签方案来捕捉修辞平行ismus。在最严格的指标下，他们的Latin数据集上的F1分数为0.40，而修改后的中文数据集上的F1分数为0.43。

Abstract
Rhetoric, both spoken and written, involves not only content but also style. One common stylistic tool is $\textit{parallelism}$: the juxtaposition of phrases which have the same sequence of linguistic ($\textit{e.g.}$, phonological, syntactic, semantic) features. Despite the ubiquity of parallelism, the field of natural language processing has seldom investigated it, missing a chance to better understand the nature of the structure, meaning, and intent that humans convey. To address this, we introduce the task of $\textit{rhetorical parallelism detection}$. We construct a formal definition of it; we provide one new Latin dataset and one adapted Chinese dataset for it; we establish a family of metrics to evaluate performance on it; and, lastly, we create baseline systems and novel sequence labeling schemes to capture it. On our strictest metric, we attain $F_{1}$ scores of $0.40$ and $0.43$ on our Latin and Chinese datasets, respectively.

摘要
演讲和写作中的修辞涉及不仅内容，还有风格。一种常见的风格工具是平行ismus：在相同的语言特征（例如音频、 sintaxis、 semantics）上重叠的句子。尽管平行ismus 广泛存在，自然语言处理领域却rarely investigate它，这 missed a chance to better understand人类传达的结构、意义和目的。为了解决这个问题，我们提出了修辞平行ismus检测任务。我们定义了一个正式的定义，提供了一个新的拉丁文 dataset和一个修改后的中文dataset，并建立了一家 metric 来评估表现。最后，我们创建了基eline系统和新的序列标签方案来捕捉它。在我们最严格的 metric 上，我们达到了 $F_{1}$ 分数的 0.40 和 0.43 在拉丁文和中文dataset 中。

ArthModel: Enhance Arithmetic Skills to Large Language Model

paper_url: http://arxiv.org/abs/2311.18609
repo_url: None
paper_authors: Yingdi Guo
for: 本研究的目的是提高语言模型的数学能力，并 explore new ways of thinking, training and using a language model.
methods: 本研究使用了一种新的方法，即训练语言模型生成postfix表达式相关的数学问题，并将其与小型预训练模型结合使用。这种小型模型将Token embedding转换为实际的稠密数据，并通过深度学习平台上的native函数获取正确答案。
results: 本研究的结果表明，通过这种方法可以提高语言模型的数学能力，并且可以在不同的数学问题上实现高度的准确率。 codes和模型将于 \url{https://github.com/eteced/arithmetic_finetuning_v1} 上发布。

Abstract
With the great success of ChatGPT, the research of large language models has become increasingly popular. However, the models have several limitations, such as toxicity and pool performance of arithmetic solving. Meanwhile, LLM may have some potential abilities that have yet to be exploited. In this paper, we choose a different way to enhance the arithmetic ability of LLM. We propose to train LLM to generate a postfix expression related to the arithmetic problem and incorporate it with small pretrained models. Moreover, this small model transfers the token embeddings into real dense numbers and invokes native functions of a deep learning platform to get the correct answer. To generate the final result, we propose prompt injection for adding the result outputs by the small model to LLM. This work provides different ways of thinking, training and using a language model. The codes and models will be released at \url{https://github.com/eteced/arithmetic_finetuning_v1}.

摘要
With the great success of ChatGPT, the research of large language models has become increasingly popular. However, the models have several limitations, such as toxicity and poor performance in arithmetic solving. Meanwhile, LLM may have some potential abilities that have yet to be exploited. In this paper, we choose a different way to enhance the arithmetic ability of LLM. We propose to train LLM to generate a postfix expression related to the arithmetic problem and incorporate it with small pretrained models. Moreover, this small model transfers the token embeddings into real dense numbers and invokes native functions of a deep learning platform to get the correct answer. To generate the final result, we propose prompt injection for adding the result outputs by the small model to LLM. This work provides different ways of thinking, training and using a language model. The codes and models will be released at \url{https://github.com/eteced/arithmetic_finetuning_v1}.Here's the translation in Traditional Chinese:随着ChatGPT的成功，大型语言模型的研究成为越来越流行。然而，这些模型有一些局限性，如毒性和算数解决方案的表现不佳。同时，LLM可能有一些尚未发掘的潜力。在这篇论文中，我们选择了一种不同的方法来强化LLM的算数能力。我们提议将LLM训练为生成相关的postfix表达和小型预训模型。此外，这个小型模型将token嵌入转换为实际的稠密数据，并透过深度学习平台中的Native函数获取正确的答案。为生成最终结果，我们提议使用启发对 LLM 中的结果输出进行加入。这个工作提供了不同的思维方式、训练和使用语言模型的方法。我们将代码和模型发布在 \url{https://github.com/eteced/arithmetic_finetuning_v1}。

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

paper_url: http://arxiv.org/abs/2311.18580
repo_url: https://github.com/cuishiyao96/fft
paper_authors: Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang, Tingwen Liu
for: 评估大语言模型（LLM）的危害性，尤其是Factoid、不公平和恶势攻击性等问题。
methods: 提出了FFT benchmark，包含2116个精心设计的实例，用于评估LLM的无害性。
results: 对9种代表性的LLM进行了评估，发现现有的LLM仍然无法满足要求，并提供了一些有价值的发现，以便未来的无害LLM研究。

Abstract
The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts, primarily stemming from factoid, unfair, and toxic content. Previous researchers have invested much effort in assessing the harmlessness of generative language models. However, existing benchmarks are struggling in the era of large language models (LLMs), due to the stronger language generation and instruction following capabilities, as well as wider applications. In this paper, we propose FFT, a new benchmark with 2116 elaborated-designed instances, for LLM harmlessness evaluation with factuality, fairness, and toxicity. To investigate the potential harms of LLMs, we evaluate 9 representative LLMs covering various parameter scales, training stages, and creators. Experiments show that the harmlessness of LLMs is still under-satisfactory, and extensive analysis derives some insightful findings that could inspire future research for harmless LLM research.

摘要
全球范围内的生成人工智能技术的普及，已经提高了人工智能生成文本中可能的危害的关心。以往的研究者投入了大量时间来评估生成语言模型的无害性。然而，现有的标准测试不能在大语言模型（LLM）时代成功，因为生成语言和指令执行能力更强，以及更广泛的应用。本文提出了FFT，一个新的标准测试，用于评估LLM无害性，包括准确性、公正性和毒性。为了探讨LLM的可能危害，我们评估了9种表示性LLM，覆盖不同的参数级别、训练阶段和创造者。实验结果显示，LLM的无害性仍然不够满足，并且进行了详细的分析，得出了一些启发人的发现，可能激发未来关于无害LLM研究的未来研究。

Grammatical Gender’s Influence on Distributional Semantics: A Causal Perspective

paper_url: http://arxiv.org/abs/2311.18567
repo_url: None
paper_authors: Karolina Stańczak, Kevin Du, Adina Williams, Isabelle Augenstein, Ryan Cotterell
for: 这个研究旨在探讨现代语言学和认知科学中 gender 识别是如何受意义影响的问题。
methods: 这种研究使用现有的方法，以确定 gender 识别是否受意义影响，并提出了一种新的 causal 图模型来表示名词、意义和形容词之间的交互关系。
results: 研究发现，当控制名词的意义时， grammatical gender 对形容词选择的影响近乎为零，这与 neo-Whorfian 假设不匹配。

Abstract
How much meaning influences gender assignment across languages is an active area of research in modern linguistics and cognitive science. We can view current approaches as aiming to determine where gender assignment falls on a spectrum, from being fully arbitrarily determined to being largely semantically determined. For the latter case, there is a formulation of the neo-Whorfian hypothesis, which claims that even inanimate noun gender influences how people conceive of and talk about objects (using the choice of adjective used to modify inanimate nouns as a proxy for meaning). We offer a novel, causal graphical model that jointly represents the interactions between a noun's grammatical gender, its meaning, and adjective choice. In accordance with past results, we find a relationship between the gender of nouns and the adjectives which modify them. However, when we control for the meaning of the noun, we find that grammatical gender has a near-zero effect on adjective choice, thereby calling the neo-Whorfian hypothesis into question.

摘要
现代语言学和认知科学中关于语言中gender赋 assigning的研究是一个活跃的领域。我们可以看到当前的方法是 Trying to determine where gender assignment falls on a spectrum, from being fully arbitrarily determined to being largely semantically determined。 For the latter case, there is a formulation of the neo-Whorfian hypothesis, which claims that even inanimate noun gender influences how people conceive of and talk about objects (using the choice of adjective used to modify inanimate nouns as a proxy for meaning). We offer a novel, causal graphical model that jointly represents the interactions between a noun's grammatical gender, its meaning, and adjective choice. In accordance with past results, we find a relationship between the gender of nouns and the adjectives which modify them. However, when we control for the meaning of the noun, we find that grammatical gender has a near-zero effect on adjective choice, thereby calling the neo-Whorfian hypothesis into question.Note: The translation is done using the Google Translate API, and may not be perfect or idiomatic.

Use of explicit replies as coordination mechanisms in online student debate

paper_url: http://arxiv.org/abs/2311.18466
repo_url: None
paper_authors: Bruno D. Ferreira-Saraiva, Joao P. Matos-Carvalho, Manuel Pita
for: 本研究探讨了在面对面和计算机媒介通信中人们如何通过自适应机制来调整语言行为。
methods: 本研究使用了一种probabilistic framework和计算机方法来研究对话中actor之间的协调机制，并使用非 Parametric、层次话题模型来确定对话中不同词汇的社区结构。
results: 研究发现，对话中的词汇使用可以分为三类：一类保持在概括性的开场谈话水平，一类发展出特定子话题，并且有一类跳转 между概括性谈话、不相关的评论和人们的赞成或反对。

Abstract
People in conversation entrain their linguistic behaviours through spontaneous alignment mechanisms [7] - both in face-to-face and computer-mediated communication (CMC) [8]. In CMC, one of the mechanisms through which linguistic entrainment happens is through explicit replies. Indeed, the use of explicit replies influences the structure of conversations, favouring the formation of reply-trees typically delineated by topic shifts [5]. The interpersonal coordination mechanisms realized by how actors address each other have been studied using a probabilistic framework proposed by David Gibson [2,3]. Other recent approaches use computational methods and information theory to quantify changes in text. We explore coordination mechanisms concerned with some of the roles utterances play in dialogues - specifically in explicit replies. We identify these roles by finding community structure in the conversation's vocabulary using a non-parametric, hierarchical topic model. Some conversations may always stay on the ground, remaining at the level of general introductory chatter. Some others may develop a specific sub-topic in significant depth and detail. Even others may jump between general chatter, out-of-topic remarks and people agreeing or disagreeing without further elaboration.

摘要

IAG: Induction-Augmented Generation Framework for Answering Reasoning Questions

paper_url: http://arxiv.org/abs/2311.18397
repo_url: None
paper_authors: Zhebin Zhang, Xinyu Zhang, Yuanhang Ren, Saijiang Shi, Meng Han, Yongkang Wu, Ruofei Lai, Zhao Cao
for: 这个研究旨在提高开放领域问题回答任务中的表现，使用归胚知识与语言模型的参数记忆。
methods: 研究使用了对归胚知识的探索，并与语言模型的参数记忆结合，以提高表现。
results: 研究比较了这些方法的表现，发现IAG方法可以超过RAG基eline和ChatGPT的表现，并在两个开放领域问题任务中获得了第一名的成绩。

Abstract
Retrieval-Augmented Generation (RAG), by incorporating external knowledge with parametric memory of language models, has become the state-of-the-art architecture for open-domain QA tasks. However, common knowledge bases are inherently constrained by limited coverage and noisy information, making retrieval-based approaches inadequate to answer implicit reasoning questions. In this paper, we propose an Induction-Augmented Generation (IAG) framework that utilizes inductive knowledge along with the retrieved documents for implicit reasoning. We leverage large language models (LLMs) for deriving such knowledge via a novel prompting method based on inductive reasoning patterns. On top of this, we implement two versions of IAG named IAG-GPT and IAG-Student, respectively. IAG-GPT directly utilizes the knowledge generated by GPT-3 for answer prediction, while IAG-Student gets rid of dependencies on GPT service at inference time by incorporating a student inductor model. The inductor is firstly trained via knowledge distillation and further optimized by back-propagating the generator feedback via differentiable beam scores. Experimental results show that IAG outperforms RAG baselines as well as ChatGPT on two Open-Domain QA tasks. Notably, our best models have won the first place in the official leaderboards of CSQA2.0 (since Nov 1, 2022) and StrategyQA (since Jan 8, 2023).

摘要
Retrieval-Augmented Generation (RAG) architecture 已成为开放领域问答任务的状态之一，通过语言模型的参数记忆和外部知识的整合。然而，常见的知识库受限于覆盖率和信息的噪音，使得检索基于的方法无法回答隐含的推理问题。在这篇论文中，我们提出了一种卷积学习加工（IAG）框架，通过卷积学习来获得适应性知识，并将其与检索的文档结合使用，以解决隐含的推理问题。我们利用大型语言模型（LLM）来 derive 这种知识，通过一种基于卷积学习的新提示方法。在这之上，我们实现了两个版本的 IAG，即 IAG-GPT 和 IAG-Student。IAG-GPT 直接使用 GPT-3 生成的知识进行答案预测，而 IAG-Student 在推理时避免依赖 GPT 服务，通过学生卷积模型来替代。这个卷积模型首先通过吸收知识进行吸收，然后通过反馈生成器的反馈来进行优化。实验结果显示，IAG 超过 RAG 基线和 ChatGPT 在两个开放领域问答任务上。特别是，我们的最佳模型在 CSQA2.0 和 StrategyQA 的官方排行榜上排名第一（自2022年11月1日起）和第一（自2023年1月8日起）。

Hubness Reduction Improves Sentence-BERT Semantic Spaces

paper_url: http://arxiv.org/abs/2311.18364
repo_url: https://github.com/bemigini/hubness-reduction-improves-sbert-semantic-spaces
paper_authors: Beatrix M. G. Nielsen, Lars Kai Hansen
for: 本研究旨在探讨句子BERT生成的含义表示，即句子表示，以及它们在信息检索和文档分组等领域的应用。
methods: 本研究使用了高维训练的稠密 вектор来 represent Sentence-BERT的含义表示，并investigated了这些表示的结构。
results: 研究发现，高维表示受到了一个常见的高维问题 called “hubness”，导致了句子之间的不均匀关系，一些句子成为”枢轴”，与多个句子相邻，而大多数句子成为”反枢轴”，与只有几个句子相邻。通过衡量含义质量指标和邻域基于分类器的错误率，研究发现，降低枢轴度可以提高含义质量。

Abstract
Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.

摘要
《自然语言的Semantic表示，即通过geometry来捕捉语言意义，对于信息检索和文档分组等领域是关键。高维训练过的 dense vector已经在过去几年内吸引了很多关注。我们研究sentence-BERT中的embedding所生成的Semantic空间结构，发现这些表示受到高维空间中known problem called hubness的影响。hubness导致了异常的 neighourhood关系，有些文本（称为hub）和多个其他文本为邻居，而大多数文本（称为anti-hub）只有几个或者没有邻居。我们使用hubness scores和基于 neighbourhood的类ifier的错误率来衡量embedding的semantic质量。我们发现当hubness高时，可以降低错误率和hubness使用hubness reduction方法。我们确定了两种方法的组合，可以减少hubness约75%和错误率约9%。因此，我们 argue that mitigating hubness in the embedding space can provide better semantic representations of text.》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

paper_url: http://arxiv.org/abs/2311.18353
repo_url: None
paper_authors: Akira Kawabata, Saku Sugawara
for: 测试语言模型的逻辑阅读理解能力
methods: 使用多选题 dataset 和人工撰写的论证文本集
results: 大型语言模型 (如 InstructGPT) 在多选题中答题不善，尤其是对于错误选项的论证文本。

Abstract
To precisely evaluate a language model's capability for logical reading comprehension, we present a dataset for testing the understanding of the rationale behind critical reasoning. For questions taken from an existing multiplechoice logical reading comprehension dataset, we crowdsource rationale texts that explain why we should select or eliminate answer options, resulting in 3,003 multiple-choice subquestions that are associated with 943 main questions. Experiments on our dataset show that recent large language models (e.g., InstructGPT) struggle to answer the subquestions even if they are able to answer the main questions correctly. We find that the models perform particularly poorly in answering subquestions written for the incorrect options of the main questions, implying that the models have a limited capability for explaining why incorrect alternatives should be eliminated. These results suggest that our dataset encourages further investigation into the critical reasoning ability of language models while focusing on the elimination process of relevant alternatives.

摘要
为了准确评估语言模型的逻辑阅读理解能力，我们提供了一个测试逻辑阅读理解的数据集。我们从现有的多选逻辑阅读理解数据集中选取了问题，然后通过聚集 rational text 来解释选择或排除答案选项的原因，共计3,003个多选子问题和943个主题问题。我们的实验表明，现代大型语言模型（如 InstructGPT）在回答子问题时表现不佳，即使它们能够正确回答主题问题。我们发现，模型在回答 incorrect 选项的子问题时表现特别差，这表明模型对排除相关alternative的能力有限。这些结果表明，我们的数据集可以鼓励更深入的研究语言模型的逻辑阅读理解能力，同时专注于排除相关alternative的过程。

Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation

paper_url: http://arxiv.org/abs/2311.18260
repo_url: None
paper_authors: Ryutaro Tanno, David G. T. Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, Mike Schaekermann, Rhys May, Roy Lee, SiWai Man, Zahra Ahmed, Sara Mahdavi, Danielle Belgrave, Vivek Natarajan, Shravya Shetty, Pushmeet Kohli, Po-Sen Huang, Alan Karthikesalingam, Ira Ktena
for: 这项研究的目的是为了开发一个基于视力语言模型的报告生成系统，以提高现代医学中报告的生成和评估。
methods: 这项研究使用了一种已知的视力语言基础模型，通过对医疗数据进行细化调整，建立了一个state-of-the-art的报告生成系统。
results: 该研究发现，至少有一名证encized radiologist（每个案例中的两名 radiologist中的一个）在60%以上的案例中 prefer AI生成的报告于基准报告。此外，研究发现AI生成的报告中的错误主要 relate于位置和发现，而人类写的报告中的错误主要 relate于严重性和发现。这种差异表明了人类专家和AI系统之间的可 complementarity，并促使研究人员开发了一种助手enario，在该 scenariodeployed 的 AI系统会生成一份初稿报告，然后被临床专家修改。这是首次在报告写作中实现了人类-AI合作，并且结果表明了在80%的入院案例和60%的医学 ICU案例中，与专家们alone 写的报告相比，合作的报告具有同等或更高的评估质量。

Abstract
Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear potential in ameliorating the situation, the path to real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, \textit{Flamingo-CXR}, by fine-tuning a well-known vision-language foundation model on radiology data. To evaluate the quality of the AI-generated reports, a group of 16 certified radiologists provide detailed evaluations of AI-generated and human written reports for chest X-rays from an intensive care setting in the United States and an inpatient setting in India. At least one radiologist (out of two per case) preferred the AI report to the ground truth report in over 60$\%$ of cases for both datasets. Amongst the subset of AI-generated reports that contain errors, the most frequently cited reasons were related to the location and finding, whereas for human written reports, most mistakes were related to severity and finding. This disparity suggested potential complementarity between our AI system and human experts, prompting us to develop an assistive scenario in which \textit{Flamingo-CXR} generates a first-draft report, which is subsequently revised by a clinician. This is the first demonstration of clinician-AI collaboration for report writing, and the resultant reports are assessed to be equivalent or preferred by at least one radiologist to reports written by experts alone in 80$\%$ of in-patient cases and 60$\%$ of intensive care cases.

摘要
医学报告是现代医学中不可或缺的工具，帮助诊断和治疗医疗决策。然而，全球医生短缺导致了专业报告的获取和诊断的延迟，从而导致了可避免的错误和延迟。最近，使用视力语言模型自动生成报告的进步具有明显的可能性，但实际应用 path 却受到了评估AI生成报告的临床质量的挑战。在这项研究中，我们构建了一个基于视力语言基础模型的state-of-the-art报告生成系统，名为Flamingo-CXR，并在医学数据上细化该模型。为评估AI生成的报告质量，一组16名资深医生对AI生成和人工写的胸部X射线报告进行了详细评估。结果显示，在两个数据集中，至少一名医生（每个案例两名医生）对AI生成的报告表示更好于真实 referral 报告的60%以上。在AI生成报告中存在错误的情况下，主要的问题是位置和发现的问题，而人工写的报告中的主要问题是严重性和发现的问题。这种差异表明了AI系统和人类专家之间的可能性，因此我们开发了一种辅助enario，在该scenario中，Flamingo-CXR生成的首 draft 报告被临床专业人员修改。这是首次实现了医生-AI协作报告写作，并且结果表明，在80%的入院案例中，由至少一名医生评估为等同或更好于专家 alone 写的报告。在60%的医学紧急案例中，结果也是相似。

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

paper_url: http://arxiv.org/abs/2311.18248
repo_url: https://github.com/x-plug/mplug-docowl
paper_authors: Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang
for: 这 paper 的目的是强化多Modal LLMs 的科学论文图像分析能力，以提高它们的应用场景。
methods: 作者通过分析高质量的学术论文的 Latex 源代码， méticulously 建立了一个多Modal 图像理解 dataset M-Paper。他们还构建了专业的图像分析示例，用于训练和评估。
results: 实验表明，训练在这个dataset上的state-of-the-art Multimodal LLM 可以更好地理解多个科学图像，包括图像和表格，并且可以根据用户的意图提供更好的图像摘要和分析结果。

Abstract
Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.

摘要
近些时候，大型语言模型（LLM）的强大文本创作能力已经使得许多工具出现了，可以帮助读者或者写作者。然而， LLM 或者多模态 LLM 的图像分析能力非常有限，尤其是在科学学报写作中。在这项工作中，我们主要是增强多模态 LLM 的图像分析能力。我们通过分析高质量的学报中的 Latex 源代码， méticulously 构建了多模态图像理解 dataset M-Paper。我们通过将图像和相关的段落进行对齐，构建了专业的图像分析示例用于训练和评估。M-Paper 是首个支持多个科学图像的共同理解dataset，包括图像和表格的形式，如图像或 Latex 代码。此外，为了更好地与用户的意图相匹配，我们引入了 `outline` 作为控制信号，可以由用户直接提供或者根据自动生成的修改。我们对一个状态之前的多模态 LLM 进行了广泛的实验，并证明了训练在我们的 dataset 上表现出更强的科学图像理解能力，包括图像描述、图像分析和 outline 建议。我们的 dataset、代码和模型可以在中下载。

Automatic Construction of a Korean Toxic Instruction Dataset for Ethical Tuning of Large Language Models

paper_url: http://arxiv.org/abs/2311.18215
repo_url: None
paper_authors: Sungjoo Byun, Dongjun Jang, Hyemi Jo, Hyopil Shin
for: 培养大语言模型（LLMs）的训练方法，以减少生成不当语言和处理刺激用户查询。
methods: 使用自动生成的阴养指令集（KoTox），包含39K个不当指令output pairs，以提高LLMs的道德识别和对各种刺激输入的回应。
results: KoTox collection可以帮助提高LLMs的道德识别和处理刺激输入，促进在自然语言处理（NLP）应用中更安全和负责的互动。

Abstract
Caution: this paper may include material that could be offensive or distressing. The advent of Large Language Models (LLMs) necessitates the development of training approaches that mitigate the generation of unethical language and aptly manage toxic user queries. Given the challenges related to human labor and the scarcity of data, we present KoTox, comprising 39K unethical instruction-output pairs. This collection of automatically generated toxic instructions refines the training of LLMs and establishes a foundational framework for improving LLMs' ethical awareness and response to various toxic inputs, promoting more secure and responsible interactions in Natural Language Processing (NLP) applications.

摘要
注意：这篇文章可能包含可能引起不适或情绪不适的内容。大语言模型（LLM）的出现需要开发训练方法来缓解生成不道德语言和有效地处理恶意用户查询。由于人工劳动和数据稀缺的问题，我们提出了 KoTox，包含39K个不道德指令输出对。这些自动生成的恶意指令可以提高 LLM 的道德意识和对各种恶意输入的应对，推动 NLP 应用程序更加安全和负责任。

INarIG: Iterative Non-autoregressive Instruct Generation Model For Word-Level Auto Completion

paper_url: http://arxiv.org/abs/2311.18200
repo_url: None
paper_authors: Hengchao Shang, Zongyao Li, Daimeng Wei, Jiaxin Guo, Minghan Wang, Xiaoyu Chen, Lizhi Lei, Hao Yang
for: 提高人类翻译效率，特别在机器翻译质量不符合要求的场景中。
methods: 使用迭代解码和字符级别自动完成，将人类输入Sequence中的信息全面利用。
results: 在WMT22和benchmark datasets上 achieve state-of-the-art results，比前一代模型提高了10%以上预测精度。

Abstract
Computer-aided translation (CAT) aims to enhance human translation efficiency and is still important in scenarios where machine translation cannot meet quality requirements. One fundamental task within this field is Word-Level Auto Completion (WLAC). WLAC predicts a target word given a source sentence, translation context, and a human typed character sequence. Previous works either employ word classification models to exploit contextual information from both sides of the target word or directly disregarded the dependencies from the right-side context. Furthermore, the key information, i.e. human typed sequences, is only used as prefix constraints in the decoding module. In this paper, we propose the INarIG (Iterative Non-autoregressive Instruct Generation) model, which constructs the human typed sequence into Instruction Unit and employs iterative decoding with subwords to fully utilize input information given in the task. Our model is more competent in dealing with low-frequency words (core scenario of this task), and achieves state-of-the-art results on the WMT22 and benchmark datasets, with a maximum increase of over 10% prediction accuracy.

摘要
In this paper, we propose the INarIG (Iterative Non-autoregressive Instruct Generation) model, which constructs the human-typed sequence into an Instruction Unit and employs iterative decoding with subwords to fully utilize the input information given in the task. Our model is more competent in dealing with low-frequency words (the core scenario of this task) and achieves state-of-the-art results on the WMT22 and benchmark datasets, with a maximum increase of over 10% prediction accuracy.

COVID-19 Vaccine Misinformation in Middle Income Countries

paper_url: http://arxiv.org/abs/2311.18195
repo_url: https://github.com/zzoliman/covid-vaccine-misinfo-mic
paper_authors: Jongin Kim, Byeo Rhee Back, Aditya Agrawal, Jiaxi Wu, Veronika J. Wirtz, Traci Hong, Derry Wijaya
For: The paper is written to introduce a multilingual dataset of COVID-19 vaccine misinformation, and to develop and evaluate models for detecting COVID-19 vaccine misinformation in three middle-income countries (Brazil, Indonesia, and Nigeria).* Methods: The paper uses two approaches to develop COVID-19 vaccine misinformation detection models: domain-specific pre-training and text augmentation using a large language model.* Results: The paper’s best misinformation detection models demonstrate improvements ranging from 2.7 to 15.9 percentage points in macro F1-score compared to the baseline models, and there are significant positive associations between the misinformation rates across the three countries. Additionally, the paper applies the misinformation detection models in a large-scale study of 19 million unlabeled tweets from the three countries between 2020 and 2022, showcasing the practical application of the dataset and models for detecting and analyzing vaccine misinformation in multiple countries and languages.

Abstract
This paper introduces a multilingual dataset of COVID-19 vaccine misinformation, consisting of annotated tweets from three middle-income countries: Brazil, Indonesia, and Nigeria. The expertly curated dataset includes annotations for 5,952 tweets, assessing their relevance to COVID-19 vaccines, presence of misinformation, and the themes of the misinformation. To address challenges posed by domain specificity, the low-resource setting, and data imbalance, we adopt two approaches for developing COVID-19 vaccine misinformation detection models: domain-specific pre-training and text augmentation using a large language model. Our best misinformation detection models demonstrate improvements ranging from 2.7 to 15.9 percentage points in macro F1-score compared to the baseline models. Additionally, we apply our misinformation detection models in a large-scale study of 19 million unlabeled tweets from the three countries between 2020 and 2022, showcasing the practical application of our dataset and models for detecting and analyzing vaccine misinformation in multiple countries and languages. Our analysis indicates that percentage changes in the number of new COVID-19 cases are positively associated with COVID-19 vaccine misinformation rates in a staggered manner for Brazil and Indonesia, and there are significant positive associations between the misinformation rates across the three countries.

摘要

Positional Information Matters for Invariant In-Context Learning: A Case Study of Simple Function Classes

paper_url: http://arxiv.org/abs/2311.18194
repo_url: None
paper_authors: Yongqiang Chen, Binghui Xie, Kaiwen Zhou, Bo Han, Yatao Bian, James Cheng
for: 这个论文主要研究了大语言模型（LLM）的内在学习（ICL）能力，具体来说是研究了LLM在不更新参数的情况下，通过几个内在示例（input-output示例）来解决新的输入问题。
methods: 这篇论文使用了ICL线性回归，并对transformers和DeepSet进行比较，以 investigate OOD情况下ICL的限制和原理。
results: 研究发现，DeepSet在各种分布偏移情况下表现出色，而LLM的 pozitional encoding会打砸ICL的对称性，因此 preserve ICL对称性是ICL的基本要求。通过保持ICL对称性，transformers可以在多种ICL分布偏移情况下达到状态ixel表现。

Abstract
In-context learning (ICL) refers to the ability of a model to condition on a few in-context demonstrations (input-output examples of the underlying task) to generate the answer for a new query input, without updating parameters. Despite the impressive ICL ability of LLMs, it has also been found that ICL in LLMs is sensitive to input demonstrations and limited to short context lengths. To understand the limitations and principles for successful ICL, we conduct an investigation with ICL linear regression of transformers. We characterize several Out-of-Distribution (OOD) cases for ICL inspired by realistic LLM ICL failures and compare transformers with DeepSet, a simple yet powerful architecture for ICL. Surprisingly, DeepSet outperforms transformers across a variety of distribution shifts, implying that preserving permutation invariance symmetry to input demonstrations is crucial for OOD ICL. The phenomenon specifies a fundamental requirement by ICL, which we termed as ICL invariance. Nevertheless, the positional encodings in LLMs will break ICL invariance. To this end, we further evaluate transformers with identical positional encodings and find preserving ICL invariance in transformers achieves state-of-the-art performance across various ICL distribution shifts

摘要
内容学习（ICL）指的是模型可以通过几个内容示例（输入输出示例）来生成新的查询输入的答案，不需要更新参数。虽然概念模型（LLM）的ICL能力很强，但也发现ICL在LLM中是受输入示例影响的并且具有限制性，尤其是在短context length下。为了理解ICL的限制和原则，我们进行了ICL线性回归探索，并将ICL启发的几种Out-of-Distribution（OOD）情况分析了。 surprisingly，DeepSet，一种简单 yet powerful的ICL架构，在多种分布Shift下表现更好，这 imply that preserving permutation invariance symmetry to input demonstrations is crucial for OOD ICL。这种现象Specifies a fundamental requirement for ICL，which we termed as ICL invariance。然而，LLMs中的位置编码会打砸ICL invariants。为此，我们进一步评估了transformers中的同样位置编码，并发现保持ICL invariants可以在transformers中实现state-of-the-art表现 across various ICL distribution shifts。

2023-11-30

Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Relevance-guided Neural Machine Translation

Robust Concept Erasure via Kernelized Rate-Distortion Maximization

Navigating News Narratives: A Media Bias Analysis Dataset

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

BioCLIP: A Vision Foundation Model for the Tree of Life

X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

Mavericks at BLP-2023 Task 1: Ensemble-based Approach Using Language Models for Violence Inciting Text Detection

Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior?

Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through Dialect Identification using Transformer-based Approach

Mavericks at ArAIEval Shared Task: Towards a Safer Digital Space – Transformer Ensemble Models Tackling Deception and Persuasion

Automatic Functional Differentiation in JAX

CoRec: An Easy Approach for Coordination Recognition

Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling

RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

ArcMMLU: A Library and Information Science Benchmark for Large Language Models

Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines

ArthModel: Enhance Arithmetic Skills to Large Language Model

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

Grammatical Gender’s Influence on Distributional Semantics: A Causal Perspective

Use of explicit replies as coordination mechanisms in online student debate

IAG: Induction-Augmented Generation Framework for Answering Reasoning Questions

Hubness Reduction Improves Sentence-BERT Semantic Spaces

Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Automatic Construction of a Korean Toxic Instruction Dataset for Ethical Tuning of Large Language Models

INarIG: Iterative Non-autoregressive Instruct Generation Model For Word-Level Auto Completion

COVID-19 Vaccine Misinformation in Middle Income Countries

Positional Information Matters for Invariant In-Context Learning: A Case Study of Simple Function Classes