2023-11-06

cs.CL

cs.CL - 2023-11-06

STONYBOOK: A System and Resource for Large-Scale Analysis of Novels

paper_url: http://arxiv.org/abs/2311.03614
repo_url: None
paper_authors: Charuta Pethe, Allen Kim, Rajesh Prabhakar, Tanzir Pial, Steven Skiena
for: 这个论文是为了提供一种大规模分析小说的资源，包括一个开源的终端到终端NLP分析管道，以及49,207本清洁和注释过的小说集。
methods: 这个论文使用的方法包括开发了一个标准XML格式来注释小说，以及建立了一个大规模文本分析数据库和网页界面。
results: 论文提供了各种分析 artifacts，包括人物出现和互动的视觉化、相似的书籍、代表词汇、部首统计和阅读指标。这些结果可以用于质量和kvantitativer逻辑分析大量的小说 Corpora。

Abstract
Books have historically been the primary mechanism through which narratives are transmitted. We have developed a collection of resources for the large-scale analysis of novels, including: (1) an open source end-to-end NLP analysis pipeline for the annotation of novels into a standard XML format, (2) a collection of 49,207 distinct cleaned and annotated novels, and (3) a database with an associated web interface for the large-scale aggregate analysis of these literary works. We describe the major functionalities provided in the annotation system along with their utilities. We present samples of analysis artifacts from our website, such as visualizations of character occurrences and interactions, similar books, representative vocabulary, part of speech statistics, and readability metrics. We also describe the use of the annotated format in qualitative and quantitative analysis across large corpora of novels.

摘要
书籍traditionally have been the primary means through which narratives are transmitted. We have developed a collection of resources for the large-scale analysis of novels, including: (1) an open-source end-to-end NLP analysis pipeline for the annotation of novels into a standard XML format, (2) a collection of 49,207 distinct cleaned and annotated novels, and (3) a database with an associated web interface for the large-scale aggregate analysis of these literary works. We describe the major functionalities provided in the annotation system along with their utilities. We present samples of analysis artifacts from our website, such as visualizations of character occurrences and interactions, similar books, representative vocabulary, part of speech statistics, and readability metrics. We also describe the use of the annotated format in qualitative and quantitative analysis across large corpora of novels.Here's a word-for-word translation of the text using Traditional Chinese characters:书籍传统上是传递narra的主要途径。我们已经发展了一组资源来进行大规模的小说分析，包括：（1）一个开源的端到端NLP分析管线来标注小说成standard XML格式，（2）一个包含49,207个精心整理和标注的小说集，以及（3）一个对大量文本进行聚合分析的数据库和网页交互面。我们详细介绍了标注系统的主要功能和其价值。我们将从我们的网站上提供的分析遗存中展示一些分析成果，如人物出现和互动的分析图表、相似的书籍、常用词汇、parts of speech的统计和阅读度量。我们还详细介绍了使用标注格式进行质量和量itative分析的优点。

Dimensions of Online Conflict: Towards Modeling Agonism

paper_url: http://arxiv.org/abs/2311.03584
repo_url: None
paper_authors: Matt Canute, Mali Jin, hannah holtzclaw, Alberto Lusoli, Philippa R Adams, Mugdha Pandya, Maite Taboada, Diana Maynard, Wendy Hui Kyong Chun
for: 这个论文主要研究了在社交媒体上的对话中的对抗关系，以及这种对抗关系如何影响对话质量。
methods: 作者使用了Twitter上的争议话题来收集对话，并开发了一个完整的注释标准来标记对话中的不同级别的对抗关系。然后，他们使用了逻辑回归和变换器模型来训练模型，并在模型中包含了对话中的上下文信息，如参与者数量和互动结构。
results: 研究结果表明，Contextual labels可以帮助确定对抗关系，并使模型在话题变化时保持稳定性。这些结果可以为内容审核和社交媒体平台管理做出贡献。

Abstract
Agonism plays a vital role in democratic dialogue by fostering diverse perspectives and robust discussions. Within the realm of online conflict there is another type: hateful antagonism, which undermines constructive dialogue. Detecting conflict online is central to platform moderation and monetization. It is also vital for democratic dialogue, but only when it takes the form of agonism. To model these two types of conflict, we collected Twitter conversations related to trending controversial topics. We introduce a comprehensive annotation schema for labelling different dimensions of conflict in the conversations, such as the source of conflict, the target, and the rhetorical strategies deployed. Using this schema, we annotated approximately 4,000 conversations with multiple labels. We then trained both logistic regression and transformer-based models on the dataset, incorporating context from the conversation, including the number of participants and the structure of the interactions. Results show that contextual labels are helpful in identifying conflict and make the models robust to variations in topic. Our research contributes a conceptualization of different dimensions of conflict, a richly annotated dataset, and promising results that can contribute to content moderation.

摘要
争议 играет重要的角色在民主对话中，推动多元观点和坚定的讨论。在在线冲突中，另外一种类型是恶意争议，这会阻碍有益的对话。检测在线冲突是民主对话中不可或缺的，只有当它变成争议时。为了模型这两种冲突，我们收集了关于热门争议话题的推特对话。我们介绍了对话中不同维度的争议的完整标注schema，例如争议的来源、目标和使用的修辞技巧。使用这个schema，我们对约4000个对话进行了多个标注。然后我们训练了逻辑回归和转换器基于模型，使用对话中的上下文，包括参与者人数和互动结构。结果表明，上下文标注有助于确定冲突，使模型具有话题变化的 robustness。我们的研究对于不同维度的争议做出了概念化、富有标注数据和成功的实验成果，可以贡献于内容审核。

Measuring Adversarial Datasets

paper_url: http://arxiv.org/abs/2311.03566
repo_url: https://github.com/kritwik1/Detection-of-Anomalies-in-Images-using-Adversarial-learning
paper_authors: Yuanchen Bai, Raoyi Huang, Vijay Viswanathan, Tzu-Sheng Kuo, Tongshuang Wu
for: 这个研究的目的是为了探讨现有的量化度量是否能够捕捉NLP任务中文本实例的难度、多样性和分歧。
methods: 这个研究使用了现有的敌对性例数据集，并对这些数据集和原始数据集进行比较，以了解这些敌对性例的分布是否与假设一致。
results: 研究发现，现有的量化度量可以很好地捕捉敌对性例的难度和多样性，但是它们可能不能够捕捉敌对性例的分歧。这些结果提供了valuable的信息，可以帮助研究人员更好地理解敌对性例的特点和假设。

Abstract
In the era of widespread public use of AI systems across various domains, ensuring adversarial robustness has become increasingly vital to maintain safety and prevent undesirable errors. Researchers have curated various adversarial datasets (through perturbations) for capturing model deficiencies that cannot be revealed in standard benchmark datasets. However, little is known about how these adversarial examples differ from the original data points, and there is still no methodology to measure the intended and unintended consequences of those adversarial transformations. In this research, we conducted a systematic survey of existing quantifiable metrics that describe text instances in NLP tasks, among dimensions of difficulty, diversity, and disagreement. We selected several current adversarial effect datasets and compared the distributions between the original and their adversarial counterparts. The results provide valuable insights into what makes these datasets more challenging from a metrics perspective and whether they align with underlying assumptions.

摘要
在人工智能系统广泛应用于多个领域的时代，保证对抗Robustness已成为维护安全和避免不良错误的关键。研究人员通过干扰损害数据集（through perturbations）捕捉模型缺陷，这些缺陷无法在标准测试数据集中显示出来。然而，对这些对抗示例与原始数据点之间的差异还是不够了解，而且还没有一种方法来衡量这些对抗变换的意图和不意图的后果。在这项研究中，我们进行了系统性的量化度量研究，探讨了存在于NLP任务中的文本实例度量，包括难度、多样性和分歧。我们选择了一些当前的对抗效果数据集，并比较了这些数据集的分布与其对抗对应的分布。结果提供了有价值的洞察，了解这些数据集在量化度量上的挑战和是否与下面的假设相符。

Quantifying Uncertainty in Natural Language Explanations of Large Language Models

paper_url: http://arxiv.org/abs/2311.03533
repo_url: None
paper_authors: Sree Harsha Tanneru, Chirag Agarwal, Himabindu Lakkaraju
for: 本研究旨在量化LLM的解释uncertainty。
methods: 我们提出了两个新的度量方法：Verbalized Uncertainty和Probing Uncertainty，以量化LLM的解释uncertainty。
results: 我们的实验表明，Verbalized Uncertainty不是一个可靠的解释 confidence 度量方法，而Probing Uncertainty度量与解释 faithfulness 呈正相关。

Abstract
Large Language Models (LLMs) are increasingly used as powerful tools for several high-stakes natural language processing (NLP) applications. Recent prompting works claim to elicit intermediate reasoning steps and key tokens that serve as proxy explanations for LLM predictions. However, there is no certainty whether these explanations are reliable and reflect the LLMs behavior. In this work, we make one of the first attempts at quantifying the uncertainty in explanations of LLMs. To this end, we propose two novel metrics -- $\textit{Verbalized Uncertainty}$ and $\textit{Probing Uncertainty}$ -- to quantify the uncertainty of generated explanations. While verbalized uncertainty involves prompting the LLM to express its confidence in its explanations, probing uncertainty leverages sample and model perturbations as a means to quantify the uncertainty. Our empirical analysis of benchmark datasets reveals that verbalized uncertainty is not a reliable estimate of explanation confidence. Further, we show that the probing uncertainty estimates are correlated with the faithfulness of an explanation, with lower uncertainty corresponding to explanations with higher faithfulness. Our study provides insights into the challenges and opportunities of quantifying uncertainty in LLM explanations, contributing to the broader discussion of the trustworthiness of foundation models.

摘要

Spoken Dialogue System for Medical Prescription Acquisition on Smartphone: Development, Corpus and Evaluation

paper_url: http://arxiv.org/abs/2311.03510
repo_url: None
paper_authors: Ali Can Kocabiyikoglu, François Portet, Jean-Marc Babouchkine, Prudence Gibert, Hervé Blanchon, Gaëtan Gavazzi
for: 这篇论文是关于医疗信息系统（HIS）中的电子药物预scribing软件，它提供了一种使用自然语言对话系统来记录药物预scription的方法。
methods: 这篇论文使用了对话模型、语义提取和数据增强等技术来开发一种基于自然语言对话的药物预scription系统。
results: 论文中提出的系统在实际应用中被评估，结果显示该系统可以减少医生在计算机上输入信息的时间，同时提高预cription的正确率和效率。试验中的55名参与者中，医生的均值预cription时间为66.15秒，其他专家的均值预cription时间为35.64秒，任务成功率为76%和72%。试验数据被记录和标注，并形成了PxCorpus，全面提供给社区（https://doi.org/10.5281/zenodo.6524162）。

Abstract
Hospital information systems (HIS) have become an essential part of healthcare institutions and now incorporate prescribing support software. Prescription support software allows for structured information capture, which improves the safety, appropriateness and efficiency of prescriptions and reduces the number of adverse drug events (ADEs). However, such a system increases the amount of time physicians spend at a computer entering information instead of providing medical care. In addition, any new visiting clinician must learn to manage complex interfaces since each HIS has its own interfaces. In this paper, we present a natural language interface for e-prescribing software in the form of a spoken dialogue system accessible on a smartphone. This system allows prescribers to record their prescriptions verbally, a form of interaction closer to their usual practice. The system extracts the formal representation of the prescription ready to be checked by the prescribing software and uses the dialogue to request mandatory information, correct errors or warn of particular situations. Since, to the best of our knowledge, there is no existing voice-based prescription dialogue system, we present the system developed in a low-resource environment, focusing on dialogue modeling, semantic extraction and data augmentation. The system was evaluated in the wild with 55 participants. This evaluation showed that our system has an average prescription time of 66.15 seconds for physicians and 35.64 seconds for other experts, and a task success rate of 76\% for physicians and 72\% for other experts. All evaluation data were recorded and annotated to form PxCorpus, the first spoken drug prescription corpus that has been made fully available to the community (\url{https://doi.org/10.5281/zenodo.6524162}).

摘要
医院信息系统（HIS）已成为医疗机构的重要组成部分，并包括订药支持软件。订药支持软件可以结构化信息捕获，从而提高药物订药的安全性、适用性和效率，并减少药物相关事件（ADEs）的发生。然而，这种系统会使医生在计算机上输入信息的时间增加，而不是提供医疗服务。此外，每个医院信息系统都有自己的界面，新来的医生必须学习这些复杂的界面。在本文中，我们提出了一种基于自然语言对话的订药软件，通过智能手机上的对话系统来记录医生的订药。这种系统使医生可以通过口头记录订药，与其常见的医疗做法更相似。系统会提取订药的正式表示形式，并使用对话来请求必要的信息、修正错误或警告特定情况。由于我们知道的 voz-based 订药对话系统并不存在，我们在具有较低资源环境下开发了这个系统，重点是对话模型、semantic extraction 和数据增强。我们在野化进行了55名参与者的评估，评估结果显示，我们的系统的医生平均订药时间为66.15秒，其他专家平均订药时间为35.64秒，任务成功率为76% 和72%。所有评估数据都被记录并标注，以形成 PxCorpus，是首个全面向社区公开的 spoken drug prescription corpus（https://doi.org/10.5281/zenodo.6524162）。

In-Context Exemplars as Clues to Retrieving from Large Associative Memory

paper_url: http://arxiv.org/abs/2311.03498
repo_url: https://github.com/andotalao24/ICL-as-retrieval-from-associative-memory
paper_authors: Jiachen Zhao
for: 本研究旨在探讨大语言模型（LLM）中的卷积学习（ICL）能力，以及如何选择示例的问题。
methods: 本研究使用了聚合网络来建立ICL的理论基础，并对示例的选择进行了实验研究。
results: 研究发现，ICL的性能与示例的选择有直接的关系，并提出了更有效的活动示例选择方法。这些发现可能有助于更深入理解LLM的含义和工作机制。

Abstract
Recently, large language models (LLMs) have made remarkable progress in natural language processing. The most representative ability of LLMs is in-context learning (ICL), which enables LLMs to learn patterns from in-context exemplars without training. The performance of ICL greatly depends on the exemplars used. However, how to choose exemplars remains unclear due to the lack of understanding of how in-context learning works. In this paper, we present a novel perspective on ICL by conceptualizing it as contextual retrieval from a model of associative memory. We establish a theoretical framework of ICL based on Hopfield Networks. Based on our framework, we look into how in-context exemplars influence the performance of ICL and propose more efficient active exemplar selection. Our study sheds new light on the mechanism of ICL by connecting it to memory retrieval, with potential implications for advancing the understanding of LLMs.

摘要
（简化中文）最近，大型自然语言处理模型（LLM）在自然语言处理领域取得了非常出色的进步。LLM的最主要能力是在 контексте学习（ICL），即在不需要训练的情况下，模型可以从 контексте中学习模式。ICL的性能很大程度上取决于使用的 exemplars。然而，如何选择 exemplars 仍然不清楚，因为lack of understanding of how in-context learning works。在这篇论文中，我们提出了一种新的思路，即认为ICL可以视为一种contextual retrieval from a model of associative memory。我们建立了一个基于Hopfield Networks的ICL理论框架。基于我们的框架，我们研究了ICL中 exemplars 的影响和更有效的活动 exemplar 选择。我们的研究 shed new light on ICL的机制，并可能有助于进一步理解 LLMs。

Tackling Concept Shift in Text Classification using Entailment-style Modeling

paper_url: http://arxiv.org/abs/2311.03320
repo_url: None
paper_authors: Sumegh Roychowdhury, Karan Gupta, Siva Rajesh Kasa, Prasanna Srinivasa Murthy, Alok Chandra
for: Handle concept shift in text classification tasks with less labeling data.
methods: Reformulate vanilla classification as an entailment-style problem, requiring less data to adapt to new concepts.
results: Achieve absolute F1 gains of up to 7% and 40% in few-shot settings on real-world and synthetic datasets, respectively, with 75% labeling cost savings overall.

Abstract
Pre-trained language models (PLMs) have seen tremendous success in text classification (TC) problems in the context of Natural Language Processing (NLP). In many real-world text classification tasks, the class definitions being learned do not remain constant but rather change with time - this is known as Concept Shift. Most techniques for handling concept shift rely on retraining the old classifiers with the newly labelled data. However, given the amount of training data required to fine-tune large DL models for the new concepts, the associated labelling costs can be prohibitively expensive and time consuming. In this work, we propose a reformulation, converting vanilla classification into an entailment-style problem that requires significantly less data to re-train the text classifier to adapt to new concepts. We demonstrate the effectiveness of our proposed method on both real world & synthetic datasets achieving absolute F1 gains upto 7% and 40% respectively in few-shot settings. Further, upon deployment, our solution also helped save 75% of labeling costs overall.

摘要

Unraveling Downstream Gender Bias from Large Language Models: A Study on AI Educational Writing Assistance

paper_url: http://arxiv.org/abs/2311.03311
repo_url: https://github.com/epfl-ml4ed/unraveling-llm-bias
paper_authors: Thiemo Wambsganss, Xiaotian Su, Vinitra Swamy, Seyed Parsa Neshaei, Roman Rietsche, Tanja Käser
for: 这 paper 探讨了 AI 写作支持系统中 inherent bias 的影响。
methods: 该 paper 使用了大量的 user study 和不同类型的模型来检测 bias。
results: 研究发现，在 AI 写作支持系统中，bias 不会传递到学生的回答中。

Abstract
Large Language Models (LLMs) are increasingly utilized in educational tasks such as providing writing suggestions to students. Despite their potential, LLMs are known to harbor inherent biases which may negatively impact learners. Previous studies have investigated bias in models and data representations separately, neglecting the potential impact of LLM bias on human writing. In this paper, we investigate how bias transfers through an AI writing support pipeline. We conduct a large-scale user study with 231 students writing business case peer reviews in German. Students are divided into five groups with different levels of writing support: one classroom group with feature-based suggestions and four groups recruited from Prolific -- a control group with no assistance, two groups with suggestions from fine-tuned GPT-2 and GPT-3 models, and one group with suggestions from pre-trained GPT-3.5. Using GenBit gender bias analysis, Word Embedding Association Tests (WEAT), and Sentence Embedding Association Test (SEAT) we evaluate the gender bias at various stages of the pipeline: in model embeddings, in suggestions generated by the models, and in reviews written by students. Our results demonstrate that there is no significant difference in gender bias between the resulting peer reviews of groups with and without LLM suggestions. Our research is therefore optimistic about the use of AI writing support in the classroom, showcasing a context where bias in LLMs does not transfer to students' responses.

摘要
大型语言模型（LLM）在教育任务中越来越受到应用，例如为学生提供写作建议。despite their potential，LLMs are known to harbor inherent biases which may negatively impact learners. previous studies have investigated bias in models and data representations separately, neglecting the potential impact of LLM bias on human writing. in this paper, we investigate how bias transfers through an AI writing support pipeline. we conduct a large-scale user study with 231 students writing business case peer reviews in German. students are divided into five groups with different levels of writing support: one classroom group with feature-based suggestions and four groups recruited from Prolific -- a control group with no assistance, two groups with suggestions from fine-tuned GPT-2 and GPT-3 models, and one group with suggestions from pre-trained GPT-3.5. using GenBit gender bias analysis, Word Embedding Association Tests (WEAT), and Sentence Embedding Association Test (SEAT) we evaluate the gender bias at various stages of the pipeline: in model embeddings, in suggestions generated by the models, and in reviews written by students. our results demonstrate that there is no significant difference in gender bias between the resulting peer reviews of groups with and without LLM suggestions. our research is therefore optimistic about the use of AI writing support in the classroom, showcasing a context where bias in LLMs does not transfer to students' responses.

Ziya2: Data-centric Learning is All LLMs Need

paper_url: http://arxiv.org/abs/2311.03301
repo_url: None
paper_authors: Ruyi Gan, Ziwei Wu, Renliang Sun, Junyu Lu, Xiaojun Wu, Dixiang Zhang, Kunhao Pan, Ping Yang, Qi Yang, Jiaxing Zhang, Yan Song
for: 本研究旨在提出一种基于LLaMA2模型的13亿参数Ziya2模型，并在不同阶段进行数据驱动优化以提高Ziya2模型在多个标准准点上的学习过程。
methods: 本研究采用了多种预训练技术和数据驱动优化策略，包括预训练数据的选择和组织、预训练过程中的数据填充策略以及在不同阶段进行数据驱动优化。
results: 实验结果显示，Ziya2模型在多个标准准点上表现出色，特别是与代表性的开源模型相比，Ziya2模型在一些预测任务上达到了更高的性能。Ziya2（基本）模型在https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base和https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary中发布。

Abstract
Various large language models (LLMs) have been proposed in recent years, including closed- and open-source ones, continually setting new records on multiple benchmarks. However, the development of LLMs still faces several issues, such as high cost of training models from scratch, and continual pre-training leading to catastrophic forgetting, etc. Although many such issues are addressed along the line of research on LLMs, an important yet practical limitation is that many studies overly pursue enlarging model sizes without comprehensively analyzing and optimizing the use of pre-training data in their learning process, as well as appropriate organization and leveraging of such data in training LLMs under cost-effective settings. In this work, we propose Ziya2, a model with 13 billion parameters adopting LLaMA2 as the foundation model, and further pre-trained on 700 billion tokens, where we focus on pre-training techniques and use data-centric optimization to enhance the learning process of Ziya2 on different stages. Experiments show that Ziya2 significantly outperforms other models in multiple benchmarks especially with promising results compared to representative open-source ones. Ziya2 (Base) is released at https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base and https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary.

摘要
各种大型语言模型（LLMs）在最近几年内被提出，包括关闭和开源的模型，不断创造新的纪录在多个 benchmarck 上。然而， LLMS 的开发仍面临多个问题，如从 scratch 训练模型的高成本、 catastrophic forgetting 等等。虽然这些问题在 LLMS 研究中得到了很多的解决方案，但是一个重要且实用的限制是许多研究过于强调模型的大小，而不是全面分析和优化在训练过程中使用的预训练数据，以及如何合理地组织和利用这些数据来训练 LLMS。在这项工作中，我们提出了 Ziya2，一个采用 LLaMA2 基础模型，并在 700 亿个字符上进行了进一步预训练，我们在各个阶段都将注重预训练技巧，并通过数据中心化优化来提高 Ziya2 在不同阶段的学习过程。实验结果表明，Ziya2 在多个 benchmarck 上显著超越其他模型，尤其是与代表性的开源模型相比，Ziya2 (Base) 已经发布在和。

Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges

paper_url: http://arxiv.org/abs/2311.03287
repo_url: https://github.com/gzcch/bingo
paper_authors: Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, Huaxiu Yao
for: 这个研究是为了评估和描述 GPT-4V(ision) 模型中的幻觉行为，以及这种幻觉的两种常见类型：偏见和干扰。
methods: 这个研究使用了一个新的 benchmark，即 Bias and Interference Challenges in Visual Language Models (Bingo)，来评估 GPT-4V(ision) 模型的幻觉行为。
results: 研究发现，GPT-4V(ision) 模型存在 regional bias，即更好地理解西方图像或图像中的英文文本，而对其他国家或其他语言的图像和文本的理解不及格。此外，GPT-4V(ision) 模型容易受到提问的影响，并且在处理多个图像时会受到混乱。这些挑战无法通过自我修复和链式思维方法解决。

Abstract
While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions. The Bingo benchmark is available at https://github.com/gzcch/Bingo.

摘要
而GPT-4V(ision)却显示出了同时模型视觉和文本信息的卓越表现，但它的幻觉行为尚未得到系统性的评估。为了填补这一遗漏，我们提出了一个新的标准测试套件，即视觉语言模型偏见和干扰挑战（Bingo）。这个测试套件是为了评估和探讨视觉语言模型中两种常见的幻觉类型：偏见和干扰。其中，偏见指的是模型幻觉某些类型的回答，可能是因为训练数据的不均衡。干扰指的是在提示文本或输入图像的表述方式中，模型的判断能力被干扰的情况。我们发现了一种明显的地域偏见，即GPT-4V(ision)更好地理解西方图像或图像包含英文文本的情况。此外，GPT-4V(ision)容易受到提示文本的诱导和多个图像的混乱影响。现有的 Mitigation 方法，如自我检查和链条思维，无法解决这些挑战。我们还发现了 LLava 和 Bard 等模型中的相似偏见和干扰敏感性。我们的结果描述了 GPT-4V(ision) 和当前最佳视觉语言模型中的幻觉挑战，并 highlights 需要新的解决方案。Bingo 测试套件可以在 GitHub 上获取。

Safurai-Csharp: Harnessing Synthetic Data to improve language-specific Code LLM

paper_url: http://arxiv.org/abs/2311.03243
repo_url: None
paper_authors: Davide Cifarelli, Leonardo Boiardi, Alessandro Puppo, Leon Jovanovic
for: 这篇论文是为了提出一种开源模型，用于生成、完成和调试 C# 代码。
methods: 该模型基于 CodeLlama 34B 模型，并使用 EvolInstruct 技术进行精细化和扩展数据集，以进行精细化和扩展数据集。
results: 模型在 Manual MultiPL-E 比赛中获得了56.33% 的 notable 分数（Zero-Shot, Pass@1），表明它具有优秀的开发工作流程协助和代码学习支持功能。

Abstract
This paper introduces Safurai-Csharp, an open-source model designed to specialize in the generation, completion, and debugging of C# code. Safurai-Csharp is built upon the novel CodeLlama 34B model and leverages the EvolInstruct technique, creating a refined and expanded dataset for its fine-tuning process. The results of its performance, a notable score of 56.33% on the Manual MultiPL-E benchmark (Zero-Shot, Pass@1), signal its high capacity to streamline developers' workflows and aid code learning. It shows promise in setting new stakes in the landscape of open-source C# LLMs and hopes to inspire more inclusive and wide-ranging development in the field of language-specific LLMs.

摘要
这篇论文介绍了Safurai-Csharp，一个开源模型，旨在优化C#代码生成、完成和调试。Safurai-Csharp基于CodeLlama 34B模型，并使用EvolInstruct技术，通过精细的调整和扩展数据集，实现了高效的特化和优化。 benchmark测试结果显示，Safurai-Csharp在Manual MultiPL-E多频率测试中取得了56.33%的成绩（零shot，Pass@1），表明它在开发者工作流程中具有很高的效率和可靠性。这表明Safurai-Csharp具有开推新的可能性，并希望能够激发更多的开源C# LLMS的发展，以及更广泛的语言特定LLMS的开发。

p-Laplacian Transformer

paper_url: http://arxiv.org/abs/2311.03235
repo_url: None
paper_authors: Tuan Nguyen, Tam Nguyen, Vinh Nguyen, Tan M. Nguyen
for: 本文主要研究自注意 Mechanism 在 transformers 中的应用，以实现更好的语言模型性能。
methods: 本文提出了一种基于 $p$-Laplacian regularization 的新型 transformers，称为 $p$-Laplacian Transformer (p-LaT)，以利用自注意层中的异质特征。
results: 对多种 benchmark 数据集进行了实验，并证明了 p-LaT 在语言模型性能上的优势。

Abstract
$p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data. Smaller values of $p$ promote sparsity and interpretability, while larger values encourage smoother solutions. In this paper, we first show that the self-attention mechanism obtains the minimal Laplacian regularization ($p=2$) and encourages the smoothness in the architecture. However, the smoothness is not suitable for the heterophilic structure of self-attention in transformers where attention weights between tokens that are in close proximity and non-close ones are assigned indistinguishably. From that insight, we then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT), which leverages $p$-Laplacian regularization framework to harness the heterophilic features within self-attention layers. In particular, low $p$ values will effectively assign higher attention weights to tokens that are in close proximity to the current token being processed. We empirically demonstrate the advantages of p-LaT over the baseline transformers on a wide range of benchmark datasets.

摘要
(Simplified Chinese translation)$p$-laplacian regularization, originating from graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data. Smaller values of $p$ promote sparsity and interpretability, while larger values encourage smoother solutions. In this paper, we first show that the self-attention mechanism obtains the minimal Laplacian regularization ($p=2$) and encourages smoothness in the architecture. However, the smoothness is not suitable for the heterophilic structure of self-attention in transformers where attention weights between tokens that are in close proximity and non-close ones are assigned indistinguishably. Based on this insight, we then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT), which leverages $p$-Laplacian regularization framework to harness the heterophilic features within self-attention layers. Specifically, low $p$ values will effectively assign higher attention weights to tokens that are in close proximity to the current token being processed. We empirically demonstrate the advantages of p-LaT over the baseline transformers on a wide range of benchmark datasets.

Model-based Counterfactual Generator for Gender Bias Mitigation

paper_url: http://arxiv.org/abs/2311.03186
repo_url: None
paper_authors: Ewoenam Kwaku Tokpo, Toon Calders
for: 降低语言模型中的性别偏见
methods: combines data processing techniques and a bi-objective training regime to develop a model-based solution for generating counterfactuals
results: alleviates the shortcomings of dictionary-based solutions and improves the mitigation of gender bias

Abstract
Counterfactual Data Augmentation (CDA) has been one of the preferred techniques for mitigating gender bias in natural language models. CDA techniques have mostly employed word substitution based on dictionaries. Although such dictionary-based CDA techniques have been shown to significantly improve the mitigation of gender bias, in this paper, we highlight some limitations of such dictionary-based counterfactual data augmentation techniques, such as susceptibility to ungrammatical compositions, and lack of generalization outside the set of predefined dictionary words. Model-based solutions can alleviate these problems, yet the lack of qualitative parallel training data hinders development in this direction. Therefore, we propose a combination of data processing techniques and a bi-objective training regime to develop a model-based solution for generating counterfactuals to mitigate gender bias. We implemented our proposed solution and performed an empirical evaluation which shows how our model alleviates the shortcomings of dictionary-based solutions.

摘要
counterfactual 数据增强 (CDA) 是一种常用的技术来减轻自然语言模型中的性别偏见。 CDA 技术主要使用词替换基于词典， although 这些词典基于的 CDA 技术已经证明可以有效地减轻性别偏见，但是在这篇论文中，我们指出了这些技术的一些局限性，如容易出现不 grammatical 的 sentence，并且无法泛化到未定义词汇集中。 model-based 解决方案可以解决这些问题，但是因为缺乏 качеitative 平行训练数据，因此不得不采用数据处理技术和 bi-objective 训练方案来开发一种基于模型的解决方案。 we 实现了我们的提议并进行了 empirical 评估，显示了我们的模型可以减轻词典基于的 CDA 技术中的缺陷。

Architectural Sweet Spots for Modeling Human Label Variation by the Example of Argument Quality: It’s Best to Relate Perspectives!

paper_url: http://arxiv.org/abs/2311.03153
repo_url: https://github.com/phhei/relateperspectives-sweetspots
paper_authors: Philipp Heinisch, Matthias Orlikowski, Julia Romberg, Philipp Cimiano
for: 这个论文主要针对的是自然语言处理中的annotation任务，具体来说是argument质量分类任务。
methods: 这个论文使用了一种continuum的方法，从fully归一化到”share nothing”-architectures，来表征个人和共同 perspectives的协同作用。
results: 研究发现，通过使用 recomender系统中的模型层来模型不同 annotator之间的关系，可以提高averaged annotator-individual F$_1$-scores，最高提高43%。这些结果表明，对subjectivity的approaches可以通过关系个人 perspectives来提高表达质量。

Abstract
Many annotation tasks in natural language processing are highly subjective in that there can be different valid and justified perspectives on what is a proper label for a given example. This also applies to the judgment of argument quality, where the assignment of a single ground truth is often questionable. At the same time, there are generally accepted concepts behind argumentation that form a common ground. To best represent the interplay of individual and shared perspectives, we consider a continuum of approaches ranging from models that fully aggregate perspectives into a majority label to "share nothing"-architectures in which each annotator is considered in isolation from all other annotators. In between these extremes, inspired by models used in the field of recommender systems, we investigate the extent to which architectures that include layers to model the relations between different annotators are beneficial for predicting single-annotator labels. By means of two tasks of argument quality classification (argument concreteness and validity/novelty of conclusions), we show that recommender architectures increase the averaged annotator-individual F$_1$-scores up to $43\%$ over a majority label model. Our findings indicate that approaches to subjectivity can benefit from relating individual perspectives.

摘要
很多自然语言处理中的标注任务具有主观性，因为存在不同的有效和合理的观点可以用于描述给定示例的标签。这同时也适用于论点质量评价，其中单个真实的判据往往存在问题。为了最好地表现个人和共同的视角之间的互动，我们考虑了一个维度的方法，从完全汇总视角到“分享无 Shared”-架构，在这两个极端之间进行调查。在这些极端之间，我们发现了基于推荐系统中使用的模型，可以增加预测单个标注员标签的精度。通过两个论点质量分类任务（论点具体性和结论的有效性/新颖性），我们发现，推荐架构可以提高平均标注员F$_1$-分数达43%。我们的发现表明，主观性方法可以从各个个人视角之间的关系中受益。

Text Augmentations with R-drop for Classification of Tweets Self Reporting Covid-19

paper_url: http://arxiv.org/abs/2311.03420
repo_url: None
paper_authors: Sumam Francis, Marie-Francine Moens
for: 本研究为社交媒体挖掘2023年共同任务提出的模型。我们的团队面临了第一项任务，分类推特发布自我报告COVID-19诊断。
methods: 我们的方法包括一个分类模型，利用多种文本增强和R-drop增强数据，以减少过拟合。我们将增强模型应用了多种增强技巧，如同义词替换、保留词和返回词。
results: 我们的系统在测试集上实现了各自F1分数0.877，在任务中超过了 mean 和 median 分数。

Abstract
This paper presents models created for the Social Media Mining for Health 2023 shared task. Our team addressed the first task, classifying tweets that self-report Covid-19 diagnosis. Our approach involves a classification model that incorporates diverse textual augmentations and utilizes R-drop to augment data and mitigate overfitting, boosting model efficacy. Our leading model, enhanced with R-drop and augmentations like synonym substitution, reserved words, and back translations, outperforms the task mean and median scores. Our system achieves an impressive F1 score of 0.877 on the test set.

摘要
这篇论文介绍了为健康社交媒体挖掘2023年共同任务创建的模型。我们团队解决了第一个任务，即通过推特分类自报 Covid-19 诊断。我们的方法包括一种分类模型，利用多种文本扩展和使用 R-drop 来增强数据和避免过拟合，从而提高模型效果。我们的领先模型，通过 R-drop 和扩展如同义词替换、保留词和回译等，超越任务的 mean 和 median 分数。我们的系统在测试集上达到了可观的 F1 分数0.877。

Injecting Categorical Labels and Syntactic Information into Biomedical NER

paper_url: http://arxiv.org/abs/2311.03113
repo_url: None
paper_authors: Sumam Francis, Marie-Francine Moens
for: 提高生物医学命名实体识别（NER）精度
methods: 采用两种方法：首先训练一个序列级分类器，将句子分类为类别，并将标签改为自然语言模板，以提高分类器的准确率。然后将这些标签和Part-of-speech（POS）信息注入到NER模型中。
results: 在三个benchmark数据集上进行实验，发现将分类标签和POS信息注入到NER模型中可以提高NER精度，并且超过基elineBERT模型。

Abstract
We present a simple approach to improve biomedical named entity recognition (NER) by injecting categorical labels and Part-of-speech (POS) information into the model. We use two approaches, in the first approach, we first train a sequence-level classifier to classify the sentences into categories to obtain the sentence-level tags (categorical labels). The sequence classifier is modeled as an entailment problem by modifying the labels as a natural language template. This helps to improve the accuracy of the classifier. Further, this label information is injected into the NER model. In this paper, we demonstrate effective ways to represent and inject these labels and POS attributes into the NER model. In the second approach, we jointly learn the categorical labels and NER labels. Here we also inject the POS tags into the model to increase the syntactic context of the model. Experiments on three benchmark datasets show that incorporating categorical label information with syntactic context is quite useful and outperforms baseline BERT-based models.

摘要
我们提出了一种简单的方法来提高生物医学命名实体识别（NER）的精度，我们在模型中注入了分类标签和语法类型（POS）信息。我们采用了两种方法：在第一种方法中，我们首先训练一个序列级别的分类器，以将句子分类为不同的类别，从而获得句子级别的标签（分类标签）。这个分类器是通过修改标签为自然语言模板来实现的，这有助于提高分类器的准确率。然后，我们将这些标签和POS信息注入到NER模型中。在第二种方法中，我们同时学习分类标签和NER标签。在这里，我们还注入了POS标签，以增加模型的语法上下文。我们在三个标准数据集上进行了实验，结果表明，将分类标签和语法上下文注入到BERT模型中可以提高NER的精度，并且超越基eline BERT模型。

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

paper_url: http://arxiv.org/abs/2311.03099
repo_url: https://github.com/yule-buaa/mergelm
paper_authors: Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li
for: 本研究旨在探讨语言模型（LM）可以通过吸收同类模型参数而获得新能力，无需重新训练或GPU。
methods: 研究人员发现，通过一种新的操作 called DARE（Drop And REscale），可以直接将大多数 delta 参数设为零，而不会影响 SFT LM 的能力。此外，通过将多个 SFT 同类模型的 delta 参数简化并合并为一个单一模型，可以获得多种能力。
results: 实验结果表明， delta 参数的值范围通常在 0.005 左右，DARE 可以轻松地消除 99% 的 delta 参数。然而，一旦模型进行了连续预训练， delta 参数的值范围可以增加到约 0.03，使 DARE 成为不切实际。此外，尝试将 fine-tuned 参数 removal 和 delta 参数 removal 进行比较，发现将 fine-tuned 参数 removal 可以导致性能减少至 0。这显示出 SFT 只是通过 delta 参数来刺激 LM 的能力，而不是投入新的能力。此外，DARE 可以将多个任务特定 LM 合并成一个多能力 LM。例如，将 WizardLM 和 WizardMath 合并后，GSM8K 零点扩展精度从 2.2 提高至 66.3，保留 WizardLM 的 instrucion-following 能力，超过 WizardMath 的原始 64.2 性能。

Abstract
In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath's original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.

摘要
在这篇论文中，我们发现语言模型（LM），无论是基于编码器或解码器的，可以通过吸收同类模型的参数而获得新的能力，无需重新训练或GPU。通常，LM的新能力可以通过监督精度调整（SFT）来实现，这可以通过参数之间的差异（ delta 参数）来衡量。我们发现，通过一种新的操作 called DARE（Drop And REscale），大多数 delta 参数可以直接设为零，而不会影响 SFT LM 的能力。基于这一点，我们进一步减轻 delta 参数的多个 SFT 同类模型，并将它们合并成一个单独的模型。我们在 GLUE benchmark 上的八个数据集上进行实验，以及将 WizardLM、WizardMath 和 Code Alpaca 基于 Llama 2 进行合并。实验结果表明：（1） SFT 模型的 delta 参数范围通常在 0.005 左右，DARE 可以轻松地消除 99% 的 delta 参数。然而，当模型进行连续预训练时， delta 参数的范围可以增长到约 0.03，使 DARE 变得不实际。我们还尝试了从 fine-tuned 而不是 delta 参数中 removing fine-tuned 并发现，将 fine-tuned 参数减少 10% 可能会导致性能减少到 0。这表明 SFT 仅仅通过 delta 参数来刺激 LM 的能力，而不是在 LM 中植入新的能力；（2） DARE 可以将多个任务特定的 LM 合并成一个多能力 LM。例如，将 WizardLM 和 WizardMath 合并到一起，可以提高 WizardLM 的 GSM8K 零shot准确率从 2.2 提高到 66.3，保留 WizardLM 的 instrucion-following 能力，而同时超过 WizardMath 的原始 64.2 性能。代码可以在上获取。

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

paper_url: http://arxiv.org/abs/2311.03078
repo_url: https://github.com/eblict-gigatech/BanLemma
paper_authors: Sadia Afrin, Md. Shahad Mahmud Chowdhury, Md. Ekramul Islam, Faisal Ahamed Khan, Labib Imam Chowdhury, MD. Motahar Mahtab, Nazifa Nuha Chowdhury, Massud Forkan, Neelima Kundu, Hakim Arif, Mohammad Mamun Or Rashid, Mohammad Ruhul Amin, Nabeel Mohammed
for: 这个论文的目的是提出一个基于语言规则的抽象lemmatization算法，用于解决孟加拉语言的抽象lemmatization问题。
methods: 该论文使用了语言规则和词典来设计一个特定于孟加拉语言的lemmatizer，并通过分析大量的孟加拉文本来验证其准确性。
results: 该论文的实验结果显示，使用该lemmatizer可以达到96.36%的准确率，并且与之前发表的三个孟加拉lemmatization数据集中的结果相比，表现竞争力强。

Abstract
Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanLemma in order to contribute to the further advancement of Bangla NLP.

摘要
lemmatization在自然语言处理（NLP）和语言学中具有重要意义，因为它可以有效减少数据密度，并帮助理解上下文中的意思。然而，由于孟加拉语的高度变格和 morphological richness，孟加拉语 lemmatization poses a complex challenge。在这项研究中，我们提出了语言规则 для lemmatization，并使用字典和规则来设计特定 для孟加拉语的 lemmatizer。我们的系统 aimsto lemmatize words based on their parts of speech class within a given sentence。不同于前一些规则基本的方法，我们分析了 suffix marker 的出现 according to the morpho-syntactic values，然后使用 sequences of suffix markers instead of entire suffixes。为了开发我们的规则，我们分析了大量的孟加拉语文本从多个领域、来源和时期，以观察 инфиlected words 的形成。lemmatizer 在一个手动注释的测试集上测试时 achieved an accuracy of 96.36%，并在三个之前发布的孟加拉语 lemmatization 数据集上达到了竞争性的性能。我们将代码和数据集公开发布在 GitHub 上，以便贡献到孟加拉语 NLP 的进一步发展。

Zero-shot Bilingual App Reviews Mining with Large Language Models

paper_url: http://arxiv.org/abs/2311.03058
repo_url: https://github.com/jl-wei/mini-bar
paper_authors: Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, Gérard Dray
for: 提高软件需求的评估和优化
methods: 使用大型自然语言处理（NLP）模型和隐藏 маркетин数据集
results: 实现零shot的用户评论挖掘和概括，并提供用户评论群集和概要摘要Here’s a more detailed explanation of each point:
for: The paper is written to improve the assessment and optimization of software requirements by leveraging user reviews from app stores.
methods: The proposed approach, called Mini-BAR, uses large language models (LLMs) to automatically mine user reviews in both English and French. Mini-BAR consists of four main components: classification, clustering, abstractive summary generation, and ranking.
results: The authors evaluate the effectiveness and efficiency of Mini-BAR using a dataset of 6,000 English and 6,000 French annotated user reviews. Preliminary results demonstrate the ability of Mini-BAR to accurately classify, cluster, and summarize user reviews, as well as rank the review clusters based on their relevance to software requirements.

Abstract
App reviews from app stores are crucial for improving software requirements. A large number of valuable reviews are continually being posted, describing software problems and expected features. Effectively utilizing user reviews necessitates the extraction of relevant information, as well as their subsequent summarization. Due to the substantial volume of user reviews, manual analysis is arduous. Various approaches based on natural language processing (NLP) have been proposed for automatic user review mining. However, the majority of them requires a manually crafted dataset to train their models, which limits their usage in real-world scenarios. In this work, we propose Mini-BAR, a tool that integrates large language models (LLMs) to perform zero-shot mining of user reviews in both English and French. Specifically, Mini-BAR is designed to (i) classify the user reviews, (ii) cluster similar reviews together, (iii) generate an abstractive summary for each cluster and (iv) rank the user review clusters. To evaluate the performance of Mini-BAR, we created a dataset containing 6,000 English and 6,000 French annotated user reviews and conducted extensive experiments. Preliminary results demonstrate the effectiveness and efficiency of Mini-BAR in requirement engineering by analyzing bilingual app reviews. (Replication package containing the code, dataset, and experiment setups on https://github.com/Jl-wei/mini-bar )

摘要
应用商店中的用户评论对软件需求的改进具有关键作用。大量有价值的用户评论不断地被上传，描述软件问题和预期功能。有效地利用用户评论需要提取有用信息，并对其进行概括。由于用户评论的数量过大，手动分析是困难的。基于自然语言处理（NLP）的多种方法已经被提议用于自动化用户评论挖掘。然而，大多数方法需要手动制作数据集来训练其模型，这限制了它们在实际场景中的使用。在这种情况下，我们提出了 Mini-BAR 工具，它利用大型自然语言模型（LLMs）来完成零shot的用户评论挖掘。特别是，Mini-BAR 的设计包括（i）类别用户评论，（ii）将相似的评论集成起来，（iii）为每个集合生成抽象概括，以及（iv）对用户评论集进行排名。为了评估 Mini-BAR 的表现，我们创建了包含 6,000 个英语和 6,000 个法语用户评论的数据集，并进行了广泛的实验。初步结果表明 Mini-BAR 在需求工程中的效果和效率，通过分析双语应用评论。（复制包含代码、数据集和实验设置的https://github.com/Jl-wei/mini-bar ）

Detecting Agreement in Multi-party Conversational AI

paper_url: http://arxiv.org/abs/2311.03026
repo_url: None
paper_authors: Laura Schauer, Jason Sweeney, Charlie Lyttle, Zein Said, Aron Szeles, Cale Clark, Katie McAskill, Xander Wickham, Tom Byars, Daniel Hernández Garcia, Nancie Gunson, Angus Addlesee, Oliver Lemon
for: 这个论文是为了解决多方会话中的社交助手机器人（SARs）的实际使用问题，特别是识别说话人和接受者、复杂的回答交换等问题。
methods: 该论文提出了一种多方会话对话系统，让两名用户参与一个知识竞赛游戏。系统可以检测用户们的一致或不一致的答案，并按照应对方式回答。
results: 论文的评估包括性能评估和用户评估，重点是检测用户一致的答案。我们提供了对应的注释脚本和GitHub上的代码，以便其他研究人员可以进行复用和扩展。

Abstract
Today, conversational systems are expected to handle conversations in multi-party settings, especially within Socially Assistive Robots (SARs). However, practical usability remains difficult as there are additional challenges to overcome, such as speaker recognition, addressee recognition, and complex turn-taking. In this paper, we present our work on a multi-party conversational system, which invites two users to play a trivia quiz game. The system detects users' agreement or disagreement on a final answer and responds accordingly. Our evaluation includes both performance and user assessment results, with a focus on detecting user agreement. Our annotated transcripts and the code for the proposed system have been released open-source on GitHub.

摘要
Translation into Simplified Chinese:今天，对话系统预期能够处理多方会话，特别是在社会辅助机器人（SAR）中。然而，实际使用中存在多种挑战，如说话人识别、目标人识别和复杂的回答交互。在这篇论文中，我们介绍了一种多方对话系统， Invites two users to play a trivia quiz game.系统可以检测用户们的同意或不同意 final answer，并根据此进行应答。我们的评估包括性能评估和用户评估结果，重点是检测用户同意。我们已经在 GitHub 上发布了对应的注释转译和系统代码。

Detecting agreement in multi-party dialogue: evaluating speaker diarisation versus a procedural baseline to enhance user engagement

paper_url: http://arxiv.org/abs/2311.03021
repo_url: https://github.com/ddenley/multi-person-quiz
paper_authors: Angus Addlesee, Daniel Denley, Andy Edmondson, Nancie Gunson, Daniel Hernández Garcia, Alexandre Kha, Oliver Lemon, James Ndubuisi, Neil O’Reilly, Lia Perochaud, Raphaël Valeri, Miebaka Worika
for: 这个研究用于检验对话状态跟踪方法是否能够正确地识别对话中的一致和不一致情况。
methods: 这个研究使用了 диари化模型和频率和 proximity 基于的方法来识别对话中的一致和不一致情况。
results: 实验结果表明，我们的原始系统比 диари化系统更加有趣，并且更加准确地识别了一致情况，其准确率达到了 0.44，而 диари化系统的准确率为 0.28。

Abstract
Conversational agents participating in multi-party interactions face significant challenges in dialogue state tracking, since the identity of the speaker adds significant contextual meaning. It is common to utilise diarisation models to identify the speaker. However, it is not clear if these are accurate enough to correctly identify specific conversational events such as agreement or disagreement during a real-time interaction. This study uses a cooperative quiz, where the conversational agent acts as quiz-show host, to determine whether diarisation or a frequency-and-proximity-based method is more accurate at determining agreement, and whether this translates to feelings of engagement from the players. Experimental results show that our procedural system was more engaging to players, and was more accurate at detecting agreement, reaching an average accuracy of 0.44 compared to 0.28 for the diarised system.

摘要
<> translate_language: zh-CN多方会话中的对话管理器面临着 significativley 难以实现对话状态跟踪的挑战，因为发言人的身份增加了Contextual 含义。通常使用划分模型来标识发言人。然而，是否准确地标识对话中的特定对话事件，如同意或不同意，是一个问题。这个研究使用了合作测验，其中对话管理器 acts as 测验主持人，以确定划分模型或频率和距离基于方法是更加准确地确定同意的。实验结果表明，我们的程序性系统更加吸引人们的注意力，并且更加准确地检测到同意，达到了0.44的准确率，比0.28的划分系统更高。Note: "zh-CN" is the language code for Simplified Chinese.

Towards a Transformer-Based Reverse Dictionary Model for Quality Estimation of Definitions

paper_url: http://arxiv.org/abs/2311.02985
repo_url: None
paper_authors: Julien Guité-Vinet, Alexandre Blondin Massé, Fatiha Sadat
for: 这篇研究是为了解决词汇游戏“字典游戏”中的反词字典任务。
methods: 这篇研究使用了不同的 transformer-based 模型来解决反词字典任务，并 explore 这些模型在这个Context中的使用。
results: 研究获得了不同的 transformer-based 模型在解决反词字典任务的效果，并 analyzed 这些模型的优缺点。

Abstract
In the last years, several variants of transformers have emerged. In this paper, we compare different transformer-based models for solving the reverse dictionary task and explore their use in the context of a serious game called The Dictionary Game.

摘要
最近几年，Transformers家族中的不同变体出现了。本文将 Comparing different transformer-based models for solving the reverse dictionary task, and explore their use in the context of a serious game called The Dictionary Game。

Adapting Pre-trained Generative Models for Extractive Question Answering

paper_url: http://arxiv.org/abs/2311.02961
repo_url: https://github.com/prabirmallick/GenAI4EQA
paper_authors: Prabir Mallick, Tapas Nayak, Indrajit Bhattacharya
for: 提高抽取问答 tasks 的表现
methods: 使用预训练的生成模型生成答案相关的索引
results: 在多个抽取问答 dataset 上达到了更高的表现，比如 MultiSpanQA、BioASQ、MASHQA 和 WikiQA。

Abstract
Pre-trained Generative models such as BART, T5, etc. have gained prominence as a preferred method for text generation in various natural language processing tasks, including abstractive long-form question answering (QA) and summarization. However, the potential of generative models in extractive QA tasks, where discriminative models are commonly employed, remains largely unexplored. Discriminative models often encounter challenges associated with label sparsity, particularly when only a small portion of the context contains the answer. The challenge is more pronounced for multi-span answers. In this work, we introduce a novel approach that uses the power of pre-trained generative models to address extractive QA tasks by generating indexes corresponding to context tokens or sentences that form part of the answer. Through comprehensive evaluations on multiple extractive QA datasets, including MultiSpanQA, BioASQ, MASHQA, and WikiQA, we demonstrate the superior performance of our proposed approach compared to existing state-of-the-art models.

摘要
先前的生成模型，如BART和T5等，在自然语言处理中的文本生成任务中备受欢迎，包括概括性长篇问答（QA）和概要。然而，生成模型在抽取式QA任务中的潜力仍未得到充分发挥，特别是当只有小部分上下文中包含答案时。这种挑战更加明显，当答案需要多个 Span 时。在这项工作中，我们提出了一种新的方法，使用预训练的生成模型来解决抽取式QA任务，通过生成上下文字元或句子的索引，以便更好地找到答案。经过对多个抽取式QA数据集，包括 MultiSpanQA、BioASQ、MASHQA 和 WikiQA 的全面评估，我们展示了我们提出的方法与现有状态的模型相比，表现出优异的性能。

PhoGPT: Generative Pre-training for Vietnamese

paper_url: http://arxiv.org/abs/2311.02945
repo_url: https://github.com/vinairesearch/phogpt
paper_authors: Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Nhung Nguyen, Thien Huu Nguyen, Dinh Phung, Hung Bui
for: 这个论文是为了介绍一种新的开源 generative 模型系列 PhoGPT，用于越南语言。
methods: 该模型使用了一种基于 transformer 的7.5亿参数模型，并提供了一种 instruciton-following 变体 PhoGPT-7B5-Instruct。
results: 论文通过人工评估实验展示了这个模型的性能比前一代开源模型更高。In English, that’s:
for: This paper introduces a new open-source generative model series PhoGPT for Vietnamese.
methods: The model uses a transformer-based 7.5 billion parameter model and provides an instruction-following variant PhoGPT-7B5-Instruct.
results: The paper demonstrates the superior performance of this model through a human evaluation experiment compared to previous open-source models.

Abstract
We open-source a state-of-the-art 7.5B-parameter generative model series named PhoGPT for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-7B5 and its instruction-following variant, PhoGPT-7B5-Instruct. In addition, we also demonstrate its superior performance compared to previous open-source models through a human evaluation experiment. GitHub: https://github.com/VinAIResearch/PhoGPT

摘要
我们开源了一系列现代化的7.5B参数生成模型，名为 PhoGPT，用于越南语言。该系列包括基础预训练单语言模型 PhoGPT-7B5 和其指令遵循变体 PhoGPT-7B5-Instruct。此外，我们还通过人工评估实验证明其在前一代开源模型之上的超越性。GitHub：https://github.com/VinAIResearch/PhoGPT。

SQLPrompt: In-Context Text-to-SQL with Minimal Labeled Data

paper_url: http://arxiv.org/abs/2311.02883
repo_url: None
paper_authors: Ruoxi Sun, Sercan Ö. Arik, Rajarishi Sinha, Hootan Nakhost, Hanjun Dai, Pengcheng Yin, Tomas Pfister
for: 提高文本到SQL生成器的几个shot提示能力
methods: 创新的提示设计、执行相关的一致性解码策略和多种提示设计和基础模型的混合策略
results: 在受限的数据量下，与已经训练的模型相比，提高了文本到SQL生成器的几个shot学习能力，降低了与高级模型的差距

Abstract
Text-to-SQL aims to automate the process of generating SQL queries on a database from natural language text. In this work, we propose "SQLPrompt", tailored to improve the few-shot prompting capabilities of Text-to-SQL for Large Language Models (LLMs). Our methods include innovative prompt design, execution-based consistency decoding strategy which selects the SQL with the most consistent execution outcome among other SQL proposals, and a method that aims to improve performance by diversifying the SQL proposals during consistency selection with different prompt designs ("MixPrompt") and foundation models ("MixLLMs"). We show that \emph{SQLPrompt} outperforms previous approaches for in-context learning with few labeled data by a large margin, closing the gap with finetuning state-of-the-art with thousands of labeled data.

摘要
文本到SQL目的是自然语言文本中生成SQL查询的自动化过程。在这项工作中，我们提出了“SQLPrompt”，用于改进大语言模型（LLM）中几次提示能力。我们的方法包括创新的提示设计、执行基于一致性解码策略和多提示执行选择策略，以及一种用于提高性能的多提示执行选择策略（MixPrompt）和基础模型（MixLLMs）。我们表明， compared to previous approaches， \emph{SQLPrompt} 在少量标注数据下进行协study learning的情况下，能够大幅超越之前的方法，并且落差与高级标注数据进行 fine-tuning 的状态差不远。

Less than One-shot: Named Entity Recognition via Extremely Weak Supervision

paper_url: http://arxiv.org/abs/2311.02861
repo_url: https://github.com/komeijiforce/x-ner
paper_authors: Letian Peng, Zihan Wang, Jingbo Shang
for: 这 paper 是为了解决 named entity recognition (NER) 问题在 extremely weak supervision (XWS) Setting 中。
methods: 该 paper 提出了一种新的方法 X-NER，该方法可以在一个上下文自由的情况下，使用一个例子实体来帮助学习 NER。
results: 对 4 个 NER 数据集进行了广泛的实验和分析，显示 X-NER 的综合 NER 性能高于当前一些一射学习方法，并且可以具有跨语言能力。

Abstract
We study the named entity recognition (NER) problem under the extremely weak supervision (XWS) setting, where only one example entity per type is given in a context-free way. While one can see that XWS is lighter than one-shot in terms of the amount of supervision, we propose a novel method X-NER that can outperform the state-of-the-art one-shot NER methods. We first mine entity spans that are similar to the example entities from an unlabelled training corpus. Instead of utilizing entity span representations from language models, we find it more effective to compare the context distributions before and after the span is replaced by the entity example. We then leverage the top-ranked spans as pseudo-labels to train an NER tagger. Extensive experiments and analyses on 4 NER datasets show the superior end-to-end NER performance of X-NER, outperforming the state-of-the-art few-shot methods with 1-shot supervision and ChatGPT annotations significantly. Finally, our X-NER possesses several notable properties, such as inheriting the cross-lingual abilities of the underlying language models.

摘要
我们研究了名实体识别（NER）问题在极其轻量级监督（XWS） Setting下，只有一个例行实体每种类型被提供在context-free的方式。虽然XWS比一shot更轻量级，我们提出了一种新方法X-NER，可以超越当前一shot NER方法的状态。我们首先在无标注训练集中挖掘类似于示例实体的实体探索。而不是利用语言模型生成的实体 span表示，我们发现更有效的是比较在span被替换后的上下文分布和之前的分布。然后，我们利用排名最高的探索作为pseudo-标签来训练NER标记器。我们对4个NER数据集进行了广泛的实验和分析，发现X-NER具有出色的综合NER性能，超越当前几个ew shot方法，并且与ChatGPT标注显著。最后，我们的X-NER具有一些吸引人的特性，如继承下来的语言模型的cross-Lingual能力。

Improving Machine Translation with Large Language Models: A Preliminary Study with Cooperative Decoding

paper_url: http://arxiv.org/abs/2311.02851
repo_url: https://github.com/lemon0830/CoDec
paper_authors: Jiali Zeng, Fandong Meng, Yongjing Yin, Jie Zhou
for: 本研究的目的是分析不同商业NMT系统和MT-oriented LLMs的优缺点，并基于这些发现提出一种hybrid方法来补充NMT系统。
methods: 本研究使用了多种方法，包括对不同NMT系统和MT-oriented LLMs的比较分析，以及基于这些发现的hybrid方法的开发。
results: 研究结果表明，MT-oriented LLMs可以作为NMT系统的补充解决复杂的翻译问题，而CoDec方法在WMT22测试集和新收集的WebCrawl测试集上得到了显著的效果和效率提升。

Abstract
Contemporary translation engines built upon the encoder-decoder framework have reached a high level of development, while the emergence of Large Language Models (LLMs) has disrupted their position by offering the potential for achieving superior translation quality. Therefore, it is crucial to understand in which scenarios LLMs outperform traditional NMT systems and how to leverage their strengths. In this paper, we first conduct a comprehensive analysis to assess the strengths and limitations of various commercial NMT systems and MT-oriented LLMs. Our findings indicate that neither NMT nor MT-oriented LLMs alone can effectively address all the translation issues, but MT-oriented LLMs can serve as a promising complement to the NMT systems. Building upon these insights, we explore hybrid methods and propose Cooperative Decoding (CoDec), which treats NMT systems as a pretranslation model and MT-oriented LLMs as a supplemental solution to handle complex scenarios beyond the capability of NMT alone. The results on the WMT22 test sets and a newly collected test set WebCrawl demonstrate the effectiveness and efficiency of CoDec, highlighting its potential as a robust solution for combining NMT systems with MT-oriented LLMs in machine translation.

摘要
当代翻译引擎，基于编码器-解码器框架，已经达到了高度的发展，而大语言模型（LLMs）的出现则对其造成了冲击， LLMS 提供了可以实现更高水平的翻译质量的潜在力量。因此，我们需要了解 LLMS 在哪些场景下表现出色，并如何利用其优势。在这篇论文中，我们首先进行了全面的分析，以评估不同的商业 NMT 系统和 MT-oriented LLMs 的优缺点。我们的发现表明，NMT 系统和 MT-oriented LLMs 独立无法解决所有翻译问题，但 MT-oriented LLMs 可以作为 NMT 系统的优秀补充。基于这些发现，我们探索了混合方法，并提出了协同解码（CoDec），协同解码将 NMT 系统作为预翻译模型，MT-oriented LLMs 作为 NMT 系统之外的补充解决方案，以处理 NMT 系统无法处理的复杂场景。 results on WMT22 测试集和我们新收集的 WebCrawl 测试集表明 CoDec 的效果和效率， highlighting its potential as a robust solution for combining NMT systems with MT-oriented LLMs in machine translation.

Tailoring Self-Rationalizers with Multi-Reward Distillation

paper_url: http://arxiv.org/abs/2311.02805
repo_url: https://github.com/ink-usc/rationalemultirewarddistillation
paper_authors: Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren
for: 这篇论文旨在提高小型语言模型（LMs）的自我合理化能力，以帮助问答系统提高问题回答的性能。
methods: 这篇论文提出了一种名为MaRio（多重评价自我合理化算法）的多评价条件自我合理化算法，通过优化多种特征如可能性、多样性和一致性来提高小LMs的自我合理化质量。
results: 实验结果表明，MaRio不仅能够提高问题回答性能，还能够提高小LMs的自我合理化质量，比超级vised fine-tuning（SFT）基线更好。人类评价也表明，MaRio的合理化 rationales 比 SFT 的 rationales 更受欢迎，并且有质量上的改进。

Abstract
Large language models (LMs) are capable of generating free-text rationales to aid question answering. However, prior work 1) suggests that useful self-rationalization is emergent only at significant scales (e.g., 175B parameter GPT-3); and 2) focuses largely on downstream performance, ignoring the semantics of the rationales themselves, e.g., are they faithful, true, and helpful for humans? In this work, we enable small-scale LMs (approx. 200x smaller than GPT-3) to generate rationales that not only improve downstream task performance, but are also more plausible, consistent, and diverse, assessed both by automatic and human evaluation. Our method, MaRio (Multi-rewArd RatIOnalization), is a multi-reward conditioned self-rationalization algorithm that optimizes multiple distinct properties like plausibility, diversity and consistency. Results on five difficult question-answering datasets StrategyQA, QuaRel, OpenBookQA, NumerSense and QASC show that not only does MaRio improve task accuracy, but it also improves the self-rationalization quality of small LMs across the aforementioned axes better than a supervised fine-tuning (SFT) baseline. Extensive human evaluations confirm that MaRio rationales are preferred vs. SFT rationales, as well as qualitative improvements in plausibility and consistency.

摘要