2023-11-27

cs.CL

cs.CL - 2023-11-27

Reducing Gender Bias in Machine Translation through Counterfactual Data Generation

paper_url: http://arxiv.org/abs/2311.16362
repo_url: None
paper_authors: Ranjita Naik, Spencer Rarrick, Vishal Chowdhary
for: 提高 NMT 系统的准确性和均衡性
methods: 使用手工生成的 gender 偏好数据进行精度调整、使用 modified training objectives 或额外模型进行恢复、使用 counterfactual data generation techniques 生成域适应数据
results: 可以减少 catastrophic forgetting，提高 NMT 系统在 French、Spanish 和 Italian 等 morphologically rich 语言的准确性，且不导致显著的翻译质量下降

Abstract
Recent advances in neural methods have led to substantial improvement in the quality of Neural Machine Translation (NMT) systems. However, these systems frequently produce translations with inaccurate gender (Stanovsky et al., 2019), which can be traced to bias in training data. Saunders and Byrne (2020) tackle this problem with a handcrafted dataset containing balanced gendered profession words. By using this data to fine-tune an existing NMT model, they show that gender bias can be significantly mitigated, albeit at the expense of translation quality due to catastrophic forgetting. They recover some of the lost quality with modified training objectives or additional models at inference. We find, however, that simply supplementing the handcrafted dataset with a random sample from the base model training corpus is enough to significantly reduce the catastrophic forgetting. We also propose a novel domain-adaptation technique that leverages in-domain data created with the counterfactual data generation techniques proposed by Zmigrod et al. (2019) to further improve accuracy on the WinoMT challenge test set without significant loss in translation quality. We show its effectiveness in NMT systems from English into three morphologically rich languages French, Spanish, and Italian. The relevant dataset and code will be available at Github.

摘要
近年来，神经方法的进步导致了机器翻译（NMT）系统的质量得到了显著改善。然而，这些系统经常生成含有错误性别的翻译（Stanovsky等，2019），这可以追溯到训练数据的偏见。 Saunders和Byrne（2020）通过使用包含均衡性的职业词汇的手工数据集来解决这个问题。通过使用这个数据集来精度训练现有的 NMT 模型，他们表明了偏见的减少，尽管在翻译质量方面付出了代价。他们通过修改训练目标或添加更多的模型来恢复一些失去的质量。我们发现，只需补充手工数据集中的一个随机样本来减少快速遗忘。我们还提出了一种新的领域适应技术，使用Zmigrod等（2019）所提出的 counterfactual 数据生成技术来创建适应域数据，以提高在 WinoMT 挑战测试集上的准确率，而无损质量。我们在英语到法语、西班牙语和意大利语等 morphologically rich 语言的 NMT 系统中证明了其效果。相关的数据集和代码将在 GitHub 上公开。

Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection

paper_url: http://arxiv.org/abs/2311.16302
repo_url: None
paper_authors: Anusha Sabbineni, Nikhil Anand, Maria Minakova
for: 这paper的目的是为了评估数据选择方法在低资源语言设置中的效果，并在这些设置中采用Entropy和Error L2-Norm(EL2N)分数来选择重要的训练示例。
methods: 这paper使用Entropy和EL2N分数来评估潜在有用的示例，并使用这些分数来减少预测错误率和领域分类错误率。
results: 研究发现，使用分数选择的方法可以在相比随机选择的基eline技术时提高 semantic error rate的减少率为2%，并在领域分类错误率上减少4%-7%。

Abstract
While data selection methods have been studied extensively in active learning, data pruning, and data augmentation settings, there is little evidence for the efficacy of these methods in industry scale settings, particularly in low-resource languages. Our work presents ways of assessing prospective training examples in those settings for their "usefulness" or "difficulty". We also demonstrate how these measures can be used in selecting important examples for training supervised machine learning models. We primarily experiment with entropy and Error L2-Norm (EL2N) scores. We use these metrics to curate high quality datasets from a large pool of \textit{Weak Signal Labeled} data, which assigns no-defect high confidence hypotheses during inference as ground truth labels. We then conduct training data augmentation experiments using these de-identified datasets and demonstrate that score-based selection can result in a 2% decrease in semantic error rate and 4%-7% decrease in domain classification error rate when compared to the baseline technique of random selection.

摘要
“active learning、数据剪辑和数据增强中的数据选择方法已经得到了广泛的研究，但是在低资源语言的工业规模 Setting 中，这些方法的效果得到了少量的证据。我们的工作探讨了用于评估可能的训练示例的“有用性”或“困难度”的度量方法。我们主要实验了使用 entropy 和 Error L2-Norm（EL2N）分数。我们使用这些指标来筛选高质量的数据集，从大量的弱信号标注数据中提取出来。然后，我们进行了使用这些匿名数据集进行训练数据增强实验，并证明了基于分数选择的方法可以相比基eline技术Random Selection，降低 semantic error rate 2%和domain classification error rate 4%-7%。”Note: "Weak Signal Labeled" data refers to data that assigns no-defect high confidence hypotheses during inference as ground truth labels.

Influence Scores at Scale for Efficient Language Data Sampling

paper_url: http://arxiv.org/abs/2311.16298
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Nikhil Anand, Joshua Tan, Maria Minakova
for: 本研究旨在探讨语言分类任务中对影响分数的可行性。
methods: 本文使用多种存在图像设置中提出的影响分数方法进行评估，包括模型信任度和梯度变化的方法。
results: 实验结果表明，使用影响分数方法可以减少训练数据的50%，而无需对性能指标下降。此外，本文还总结了对影响分数的应用实践中的经验教训，以及对噪音和类别偏好数据的影响。

Abstract
Modern ML systems ingest data aggregated from diverse sources, such as synthetic, human-annotated, and live customer traffic. Understanding \textit{which} examples are important to the performance of a learning algorithm is crucial for efficient model training. Recently, a growing body of literature has given rise to various "influence scores," which use training artifacts such as model confidence or checkpointed gradients to identify important subsets of data. However, these methods have primarily been developed in computer vision settings, and it remains unclear how well they generalize to language-based tasks using pretrained models. In this paper, we explore the applicability of influence scores in language classification tasks. We evaluate a diverse subset of these scores on the SNLI dataset by quantifying accuracy changes in response to pruning training data through random and influence-score-based sampling. We then stress-test one of the scores -- "variance of gradients" (VoG) from Agarwal et al. (2022) -- in an NLU model stack that was exposed to dynamic user speech patterns in a voice assistant type of setting. Our experiments demonstrate that in many cases, encoder-based language models can be finetuned on roughly 50% of the original data without degradation in performance metrics. Along the way, we summarize lessons learned from applying out-of-the-box implementations of influence scores, quantify the effects of noisy and class-imbalanced data, and offer recommendations on score-based sampling for better accuracy and training efficiency.

摘要
现代机器学习系统会处理来自多种不同源的数据，包括 sintetic、人工标注和实时客户流量。了解哪些例子对机器学习算法的性能有重要影响是训练效率的关键。在最近的文献中，一种增长的体系出现了多种“影响分数”，这些分数使用训练 artifacts 如模型信任度或检查点梯度来标识重要的数据 subsets。然而，这些方法主要在计算机视觉设置下进行研究，而language-based任务中使用预训练模型的可行性尚未得到证明。在这篇论文中，我们探讨了影响分数在语言分类任务中的可用性。我们使用SNLI数据集进行评估多种影响分数，并通过随机和影响分数基于的采样来评估模型的准确率变化。然后，我们在一个NLU模型堆栈中对一种“变量梯度”（VoG）分数进行压力测试，该分数来自Agarwal et al. (2022)。我们的实验表明，在许多情况下，使用50%的原始数据进行finetuning可以保持模型的性能指标不下降。在进行这些实验时，我们还总结了将外部实现的影响分数应用于NLU模型的教训，量化噪音和不均衡数据的影响，并提供了基于分数的采样建议以提高准确率和训练效率。

Student Mastery or AI Deception? Analyzing ChatGPT’s Assessment Proficiency and Evaluating Detection Strategies

paper_url: http://arxiv.org/abs/2311.16292
repo_url: None
paper_authors: Kevin Wang, Seth Akins, Abdallah Mohammed, Ramon Lawrence
for: This paper investigates the performance of ChatGPT in completing introductory computer science assignments and the effectiveness of existing detection methods in identifying AI solutions.methods: The paper evaluates ChatGPT’s performance across three courses (CS1, CS2, and databases) and examines existing detection methods such as MOSS, JPlag, and GPTzero, as well as instructors’ and teaching assistants’ heuristics to distinguish between student and AI code.results: ChatGPT completes almost all introductory assessments perfectly, and existing detection methods have mixed success in identifying AI solutions. Instructors’ and teaching assistants’ heuristics are not sufficiently accurate in distinguishing between student and AI code. The findings emphasize the need for adapting assessments and improved detection methods.

Abstract
Generative AI systems such as ChatGPT have a disruptive effect on learning and assessment. Computer science requires practice to develop skills in problem solving and programming that are traditionally developed using assignments. Generative AI has the capability of completing these assignments for students with high accuracy, which dramatically increases the potential for academic integrity issues and students not achieving desired learning outcomes. This work investigates the performance of ChatGPT by evaluating it across three courses (CS1,CS2,databases). ChatGPT completes almost all introductory assessments perfectly. Existing detection methods, such as MOSS and JPlag (based on similarity metrics) and GPTzero (AI detection), have mixed success in identifying AI solutions. Evaluating instructors and teaching assistants using heuristics to distinguish between student and AI code shows that their detection is not sufficiently accurate. These observations emphasize the need for adapting assessments and improved detection methods.

摘要
‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧��

Applications of Large Language Models in Data Processing: Innovative Approaches to Segmenting and Renewing Information

paper_url: http://arxiv.org/abs/2311.16267
repo_url: None
paper_authors: Yu-Chen Lin, Akhilesh Kumar, Wen-Liang Zhang, Norman Chang, Muhammad Zakir, Rucha Apte, Chao Wang, Jyh-Shing Roger Jang
for: 本研究探讨了特定领域应用程序中有效的代码生成方法，包括使用大型自然语言模型（LLM）进行数据分类和更新，以及通过提示调整来激发更深入的思维。
methods: 本研究使用了一个真实的企业产品作为例子，提供了用户手册、API文档和其他数据，并通过将数据转化为semantic vector来更好地反映其实际位置。
results: 本研究通过多种提示技术实现了约70%的准确率在简到中等复杂任务中，并通过llama2基于 fine-tuning来测试其效果在专业领域代码生成中。

Abstract
Our paper investigates effective methods for code generation in "specific-domain" applications, including the use of Large Language Models (LLMs) for data segmentation and renewal, as well as stimulating deeper thinking in LLMs through prompt adjustments. Using a real company product as an example, we provide user manuals, API documentation, and other data. The ideas discussed in this paper help segment and then convert this data into semantic vectors to better reflect their true positioning. Subsequently, user requirements are transformed into vectors to retrieve the most relevant content, achieving about 70% accuracy in simple to medium-complexity tasks through various prompt techniques. This paper is the first to enhance specific-domain code generation effectiveness from this perspective. Additionally, we experiment with generating more scripts from a limited number using llama2-based fine-tuning to test its effectiveness in professional domain code generation. This is a challenging and promising field, and once achieved, it will not only lead to breakthroughs in LLM development across multiple industries but also enable LLMs to understand and learn any new knowledge effectively.

摘要
我们的论文研究了特定领域应用中有效的代码生成方法，包括使用大语言模型（LLM）进行数据分 segmentation和更新，以及通过提示调整来激发更深刻的思考。使用真实公司产品作为例子，我们提供了用户手册、API文档和其他数据。在这篇论文中，我们讨论的想法可以将数据转换为含义Vector，更好地反映其真实位置。然后，用户需求也可以被转换为含义Vector，以从最相关的内容中提取最相关的内容，实现约70%的准确率在简到中等复杂性任务中通过不同的提示技术。这篇论文是特定领域代码生成效果的首次提高。此外，我们还进行了基于llama2的精度调整，以测试其在专业领域代码生成中的效果。这是一项挑战性和推动性的领域，一旦成功，将不仅导致多个industry的LLM发展的突破，还能让LLM学习和理解任何新知识。

An Exploration of Left-Corner Transformations

paper_url: http://arxiv.org/abs/2311.16258
repo_url: https://github.com/jettbrains/-L-
paper_authors: Andreas Opedal, Eleftheria Tsipidi, Tiago Pimentel, Ryan Cotterell, Tim Vieira
for: 该 paper 用于提高上下文自由 grammar 的可解性，通过使用 left-corner transformation 和 speculation transformation。
methods: 该 paper 使用 generalized left-corner transformation (GLCT)，该转换基于 unifying left-corner transformation 和 speculation transformation。 GLCT 可以支持 semi-ring Weighted production rules，并提供更细化的控制，以移除左 recursion。
results: 该 paper 通过 empirical investigation 发现，GLCT 可以高效地消除 grammars 中的左 recursion，并且与 speculation 转换等价。

Abstract
The left-corner transformation (Rosenkrantz and Lewis, 1970) is used to remove left recursion from context-free grammars, which is an important step towards making the grammar parsable top-down with simple techniques. This paper generalizes prior left-corner transformations to support semiring-weighted production rules and to provide finer-grained control over which left corners may be moved. Our generalized left-corner transformation (GLCT) arose from unifying the left-corner transformation and speculation transformation (Eisner and Blatz, 2007), originally for logic programming. Our new transformation and speculation define equivalent weighted languages. Yet, their derivation trees are structurally different in an important way: GLCT replaces left recursion with right recursion, and speculation does not. We also provide several technical results regarding the formal relationships between the outputs of GLCT, speculation, and the original grammar. Lastly, we empirically investigate the efficiency of GLCT for left-recursion elimination from grammars of nine languages.

摘要
左侧转换（Rosenkrantz和Lewis，1970）用于从context-free语法中除除左 recursions，这是一个重要的步骤，以使 grammar 可以从上向下解析，使用简单的技术。这篇论文总结了之前的左侧转换，以支持semiring-weighted生产规则和提供精细的控制，允许选择性地移动左侧转换。我们的总结左侧转换（GLCT）来自于将左侧转换和speculation转换（Eisner和Blatz，2007）综合，原来是适用于逻辑编程。我们的新转换和speculation定义等量的语言，但它们的 derivation 树结构不同于重要的方面：GLCT 将左 recursions 替换为右 recursions，而speculation 不然。我们还提供了一些技术结果，关于GLCT、speculation 和原始语法之间的正式关系。最后，我们对 nine 种语言的 grammar 进行了实验性的研究，以评估 GLCT 的 LEFT-recursion 消除性能。

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

paper_url: http://arxiv.org/abs/2311.16101
repo_url: https://github.com/ucsc-vlaa/vllm-safety-benchmark
paper_authors: Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie
for: 这种研究探讨了视觉推理语言模型（VLLM）的潜在应用。不同于先前的研究，我们将注意力从评估标准性能shift到了 introduce a comprehensive safety evaluation suite，覆盖了out-of-distribution（OOD）泛化和攻击 robustness。
methods: 我们提出了两个新的VQA数据集，每个variant designed to test model performance under challenging conditions。在探索攻击 robustness方面，我们提出了一种简单的攻击策略，用于诱导VLLMs生成视觉无关的响应。此外，我们评估了两种监禁策略，一种targeting either the vision or language component of VLLMs。
results: 我们的评估结果显示：1）当前VLLMs在OOD文本上表现不佳，但在图像上表现良好，除非图像信息受限；2）这些VLLMs可以轻松地被骗，只需要欺骗视觉编码器即可，而且它们的视觉语言培训经常违反安全协议。我们将这些评估结果公布在https://github.com/UCSC-VLAA/vllm-safety-benchmark。

Abstract
This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel VQA datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language component of VLLMs. Our evaluation of 21 diverse models, ranging from open-source VLLMs to GPT-4V, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at https://github.com/UCSC-VLAA/vllm-safety-benchmark.

摘要

Current VLLMs struggle with OOD texts but not images, unless the visual information is limited.2. These VLLMs can be easily misled by deceiving vision encoders, and their vision-language training often compromises safety protocols.We release this safety evaluation suite at https://github.com/UCSC-VLAA/vllm-safety-benchmark.

DUnE: Dataset for Unified Editing

paper_url: http://arxiv.org/abs/2311.16087
repo_url: https://github.com/feyzaakyurek/dune
paper_authors: Afra Feyza Akyürek, Eric Pan, Garry Kuwanto, Derry Wijaya
for: 本研究旨在扩展现有语言模型的应用范围，通过对模型知识或表示进行修改，以提高模型的输出质量。
methods: 本研究使用了多种编辑方法，包括偏见除除和逻辑错误修正，并定义了一个编辑任务DUnE，其中编辑是通过自然语言句子来 solicits 模型的输出变化。
results: 经过广泛的实验，研究人员发现，抽取语言模型可以超过专门的编辑技术，并且 neither 这两种方法完全解决了通用编辑问题。

Abstract
Even the most advanced language models remain susceptible to errors necessitating to modify these models without initiating a comprehensive retraining process. Model editing refers to the modification of a model's knowledge or representations in a manner that produces the desired outcomes. Prior research primarily centered around editing factual data e.g. "Messi plays for Inter Miami" confining the definition of an edit to a knowledge triplet i.e. (subject, object, relation). However, as the applications of language models expand, so do the diverse ways in which we wish to edit and refine their outputs. In this study, we broaden the scope of the editing problem to include an array of editing cases such as debiasing and rectifying reasoning errors and define an edit as any natural language expression that solicits a change in the model's outputs. We are introducing DUnE-an editing benchmark where edits are natural language sentences and propose that DUnE presents a challenging yet relevant task. To substantiate this claim, we conduct an extensive series of experiments testing various editing approaches to address DUnE, demonstrating their respective strengths and weaknesses. We show that retrieval-augmented language modeling can outperform specialized editing techniques and neither set of approaches has fully solved the generalized editing problem covered by our benchmark.

摘要

BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification

paper_url: http://arxiv.org/abs/2311.16083
repo_url: https://github.com/dminus1/genre
paper_authors: Dmitri Roussinov, Serge Sharoff
for: 这篇论文探讨了预训言语模型（PLM）在文本分类任务中的性能问题，具体来说是当文本分布发生变化时，PLM的性能仍然存在差距。
methods: 作者通过大量文本数据和多个主题来证实了这一点，并验证了这种现象对于经典PLM（如BERT）和现代大型模型（如GPT-3）都存在。作者还提出了一种可能的解决方案，即通过控制主题的文本生成器来增强训练数据，并在一些主题上提高了F1分数 by up to 50%，接近在相应主题上训练的结果，而其他主题则没有显著改善。
results: 作者通过实验证实了这种方法的有效性，并指出这种方法可以应用于其他分类任务，如性别、作者性和情感分类。代码和数据可以在https://github.com/dminus1/genre上下载。

Abstract
While performance of many text classification tasks has been recently improved due to Pre-trained Language Models (PLMs), in this paper we show that they still suffer from a performance gap when the underlying distribution of topics changes. For example, a genre classifier trained on \textit{political} topics often fails when tested on documents about \textit{sport} or \textit{medicine}. In this work, we quantify this phenomenon empirically with a large corpus and a large set of topics. Consequently, we verify that domain transfer remains challenging both for classic PLMs, such as BERT, and for modern large models, such as GPT-3. We also suggest and successfully test a possible remedy: after augmenting the training dataset with topically-controlled synthetic texts, the F1 score improves by up to 50\% for some topics, nearing on-topic training results, while others show little to no improvement. While our empirical results focus on genre classification, our methodology is applicable to other classification tasks such as gender, authorship, or sentiment classification. The code and data to replicate the experiments are available at https://github.com/dminus1/genre

摘要
“尽管许多文本分类任务的性能在过去几年得到了提高，但我们在这篇论文中表明，它们仍然存在话题变化下的性能差距。例如，一个政治类文本分类器经常在测试时对于体育或医学类文本表现不佳。在这项工作中，我们employs empirical quantification方法，通过大量文本Corpus和多个话题来证明这种现象。然后，我们验证了这种现象，并发现这种领域传输仍然是classic PLMs，如BERT，以及现代大型模型，如GPT-3的挑战。此外，我们还提出了一种可能的解决方案：在训练集中添加话题控制的 sintactic texts，F1得分提高了50%以上，达到了在线上训练结果的水平，而其他话题则表现不明显。我们的实验结果主要关注文本类别分类，但我们的方法可以应用于其他类别分类任务，如性别、作者性和情感分类。代码和数据可以在https://github.com/dminus1/genre上下载。”Note that Simplified Chinese is used in this translation, as it is the more widely used standard for Chinese writing.

A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors

paper_url: http://arxiv.org/abs/2311.15954
repo_url: https://github.com/stellali7/ssl_psr
paper_authors: Shuyue Stella Li, Beining Xu, Xiangyu Zhang, Hexin Liu, Wenhan Chao, Leibny Paola Garcia
for: 本研究探讨英语自动学习（SSL）模型在跨语言场景中提取的特征特性，并提出了一个新的指标来预测特征表示质量。
methods: 使用自动语音识别（ASR）作为下游任务，分析SSL模型的模型大小、训练目标和模型结构对其作为特征提取器的性能的影响。
results: 研究发现，使用另类损失函数的contrastive loss可以促进跨语言特征提取的更有效性。PSR指标与ASR性能之间存在正相关关系，这表明由单语言SSL模型提取的音频信息可以在跨语言设置下用于下游任务。

Abstract
In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as a downstream task, we analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor for a set of topologically diverse corpora. We develop a novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and synthetic information in the extracted representations using deep generalized canonical correlation analysis. Results show the contrastive loss in the wav2vec2.0 objective facilitates more effective cross-lingual feature extraction. There is a positive correlation between PSR scores and ASR performance, suggesting that phonetic information extracted by monolingual SSL models can be used for downstream tasks in cross-lingual settings. The proposed metric is an effective indicator of the quality of the representations and can be useful for model selection.

摘要
在这项研究中，我们研究了英语自动学习（SSL）模型在跨语言上下文中提取的特征特性，并提出了一个新的度量来预测特征表示质量。使用自动语音识别（ASR）作为下游任务，我们分析了模型大小、训练目标和模型结构对模型作为特征提取器的表现的影响。我们开发了一种新的度量，即声音-语法比率（PSR），使用深度泛化相关分析来衡量提取的表示中的声音和 sintactic信息。结果显示了 contrastive loss 在 wav2vec2.0 目标中使得跨语言特征提取更加有效。我们发现 PSR 分数和 ASR 性能之间存在正相关关系，这表明了训练英语 SSL 模型可以在跨语言设置下提取有用的声音信息。我们提出的度量可以用于模型选择和评估特征表示质量。

Leveraging deep active learning to identify low-resource mobility functioning information in public clinical notes

paper_url: http://arxiv.org/abs/2311.15946
repo_url: None
paper_authors: Tuan-Dung Le, Zhuqi Miao, Samuel Alvarado, Brittany Smith, William Paiva, Thanh Thieu
for: 这个研究旨在提高临床自然语言处理中Functioning信息的自动抽取和分析，以便更好地评估患者的整体健康。
methods: 该研究使用National NLP Clinical Challenges（n2c2）研究数据集，通过关键词扩展construct了候选句子pool，并使用query-by-committee sampling weighted by density representativeness来选择有用的句子进行人工标注。然后，使用BERT和CRF模型进行训练，并使用这些模型的预测来导向选择新的句子进行后续的标注迭代。
results: 该研究得到了4,265个句子，共计11,784个实体，包括5,511个Action实体、5,328个Mobility实体、306个Assistance实体和639个Quantification实体。Inter-annotator agreement（IAA）的平均值为0.72 для准确匹配和0.91 для偏 aligned匹配。此外，该研究还训练了常见的BERT模型和State-of-the-art Nested NER模型，其中最高的F1分数分别为0.84、0.7、0.62、0.71。Empirical results表明NER模型在临床文本中可以高精度地提取 mobilility functioning信息。该公共可用的注释 dataset将促进进一步的研究，以全面捕捉EHR中的Functioning信息。

Abstract
Function is increasingly recognized as an important indicator of whole-person health, although it receives little attention in clinical natural language processing research. We introduce the first public annotated dataset specifically on the Mobility domain of the International Classification of Functioning, Disability and Health (ICF), aiming to facilitate automatic extraction and analysis of functioning information from free-text clinical notes. We utilize the National NLP Clinical Challenges (n2c2) research dataset to construct a pool of candidate sentences using keyword expansion. Our active learning approach, using query-by-committee sampling weighted by density representativeness, selects informative sentences for human annotation. We train BERT and CRF models, and use predictions from these models to guide the selection of new sentences for subsequent annotation iterations. Our final dataset consists of 4,265 sentences with a total of 11,784 entities, including 5,511 Action entities, 5,328 Mobility entities, 306 Assistance entities, and 639 Quantification entities. The inter-annotator agreement (IAA), averaged over all entity types, is 0.72 for exact matching and 0.91 for partial matching. We also train and evaluate common BERT models and state-of-the-art Nested NER models. The best F1 scores are 0.84 for Action, 0.7 for Mobility, 0.62 for Assistance, and 0.71 for Quantification. Empirical results demonstrate promising potential of NER models to accurately extract mobility functioning information from clinical text. The public availability of our annotated dataset will facilitate further research to comprehensively capture functioning information in electronic health records (EHRs).

摘要
“功能”在全人健康中日益被认可为重要指标，但在临床自然语言处理研究中却受到了少量关注。我们介绍了首个公共注释化数据集，专门针对国际健康功能障碍分类法（ICF）的 mobilty域。通过使用国家NPCC（n2c2）研究数据集，我们使用关键词扩展技术construct了候选句子pool。我们采用了活动学习方法，使用query-by-committee抽象 sampling weighted by density representativeness，选择了有用的句子 для人工注释。我们使用BERT和CRF模型，并使用这些模型的预测来导向选择新的句子 для后续注释迭代。我们的最终数据集包括4,265个句子，共计11,784个实体，包括5,511个Action实体，5,328个 mobilty实体，306个Assistance实体，和639个Quantification实体。inter-annotator agreement（IAA），对所有实体类型平均，为0.72 exact matching和0.91 partial matching。我们还训练和评估了常见的BERT模型和状态之ERT模型。最佳F1分数为0.84 дляAction，0.7 для mobilty，0.62 дляAssistance，和0.71 дляQuantification。实验结果表明NER模型在临床文本中可以准确提取 mobilty功能信息。我们公开发布了我们注释化数据集，以便进一步研究在电子健康纪录（EHRs）中全面捕捉功能信息。

Tell2Design: A Dataset for Language-Guided Floor Plan Generation

paper_url: http://arxiv.org/abs/2311.15941
repo_url: https://github.com/lengsicong/tell2design
paper_authors: Sicong Leng, Yang Zhou, Mohammed Haroon Dupty, Wee Sun Lee, Sam Conrad Joyce, Wei Lu
for: 这篇论文主要是为了研究如何使用自然语言描述生成建筑设计。
methods: 这篇论文使用了语言条件生成模型，并提出了一种新的序列到序列模型来解决这个问题。
results: 论文提出了一个大量的loor plan设计数据集(\textit{Tell2Design})，并对这些数据进行了人工评估和分析。

Abstract
We consider the task of generating designs directly from natural language descriptions, and consider floor plan generation as the initial research area. Language conditional generative models have recently been very successful in generating high-quality artistic images. However, designs must satisfy different constraints that are not present in generating artistic images, particularly spatial and relational constraints. We make multiple contributions to initiate research on this task. First, we introduce a novel dataset, \textit{Tell2Design} (T2D), which contains more than $80k$ floor plan designs associated with natural language instructions. Second, we propose a Sequence-to-Sequence model that can serve as a strong baseline for future research. Third, we benchmark this task with several text-conditional image generation models. We conclude by conducting human evaluations on the generated samples and providing an analysis of human performance. We hope our contributions will propel the research on language-guided design generation forward.

摘要
我们考虑直接从自然语言描述生成设计的任务，初始研究领域是室内设计生成。语言决定性生成模型最近几年非常成功地生成高质量的艺术图像。然而，设计需满足不同的约束，与生成艺术图像不同，特别是空间和关系约束。我们在本研究中做了多个贡献，以INITIATE研究这个任务。首先，我们介绍了一个新的数据集，即 Tell2Design（T2D），该数据集包含超过80,000个室内设计，与自然语言说明相关。其次，我们提出了一种序列-到-序列模型，可作为未来研究的强大基线。最后，我们对这个任务进行了多种文本决定图像生成模型的比较。我们结束时进行了人类评估生成样本，并提供了人类表现分析。我们希望我们的贡献能推动语言引导设计生成的研究进步。

ChartLlama: A Multimodal LLM for Chart Understanding and Generation

paper_url: http://arxiv.org/abs/2311.16483
repo_url: None
paper_authors: Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, Hanwang Zhang
for: This paper aims to improve the ability of multi-modal language models to understand and interpret chart figures by creating a high-quality instruction-tuning dataset and training a specialized model, ChartLlama.
methods: The authors use a multi-step data generation process to create diverse, high-quality instruction-tuning data, and train ChartLlama using this dataset.
results: ChartLlama outperforms all prior methods in ChartQA, Chart-to-text, and Chart-extraction evaluation benchmarks, and significantly improves upon the baseline in a specially compiled chart dataset that includes new chart and task types.Here’s the same information in Simplified Chinese:
for: 这篇论文目标是提高多模态语言模型对图表figure的理解和解释能力，通过创建高质量的指令调整数据集和训练特定的模型 ChartLlama。
methods: 作者使用多步数据生成过程来生成多样化、高质量的指令调整数据集，并使用这个数据集来训练 ChartLlama。
results: ChartLlama在 ChartQA、Chart-to-text 和 Chart-extraction 评估比赛中全面性地超过了所有先前的方法，并在特定的图表集中显著超越了基准值。

Abstract
Multi-modal large language models have demonstrated impressive performances on most vision-language tasks. However, the model generally lacks the understanding capabilities for specific domain data, particularly when it comes to interpreting chart figures. This is mainly due to the lack of relevant multi-modal instruction tuning datasets. In this article, we create a high-quality instruction-tuning dataset leveraging GPT-4. We develop a multi-step data generation process in which different steps are responsible for generating tabular data, creating chart figures, and designing instruction tuning data separately. Our method's flexibility enables us to generate diverse, high-quality instruction-tuning data consistently and efficiently while maintaining a low resource expenditure. Additionally, it allows us to incorporate a wider variety of chart and task types not yet featured in existing datasets. Next, we introduce ChartLlama, a multi-modal large language model that we've trained using our created dataset. ChartLlama outperforms all prior methods in ChartQA, Chart-to-text, and Chart-extraction evaluation benchmarks. Additionally, ChartLlama significantly improves upon the baseline in our specially compiled chart dataset, which includes new chart and task types. The results of ChartLlama confirm the value and huge potential of our proposed data generation method in enhancing chart comprehension.

摘要
文本翻译为简化中文。多Modal大语言模型在视力语言任务中表现出色，但模型对特定领域数据的理解能力受限，尤其是在解读图表数据方面。这主要是因为缺乏相关多Modal指令调整数据集。在本文中，我们创建了高质量指令调整数据集，利用GPT-4。我们开发了多步数据生成过程，每步负责生成表格数据、创建图表数据和设计指令调整数据。我们的方法具有较高的灵活性和效率，能够生成多样化、高质量指令调整数据，同时减少资源投入。此外，它允许我们包括现有数据集中尚未出现的更多图表和任务类型。接下来，我们介绍ChartLlama，我们在我们创建的数据集上训练的多Modal大语言模型。ChartLlama在ChartQA、Chart-to-text和Chart-extraction评估标准准metric中超过了所有前方法。此外，ChartLlama在我们专门编译的图表集中也显著超越了基准值。结果证明我们提议的数据生成方法在提高图表理解方面具有巨大的潜力。

Data Generation for Post-OCR correction of Cyrillic handwriting

paper_url: http://arxiv.org/abs/2311.15896
repo_url: https://github.com/dbrainio/cyrillichandwritingpoc
paper_authors: Evgenii Davydkin, Aleksandr Markelov, Egor Iuldashev, Anton Dudkin, Ivan Krivorotov
for:This paper addresses the lack of large text corpora for training language-based POC models for handwritten Cyrillic text.methods:The study uses a synthetic handwriting generation engine based on B'ezier curves to generate highly realistic handwritten text in any amounts, and applies a Handwritten Text Recognition (HTR) model to identify OCR errors.results:The approach is evaluated on HWR200 and School_notebooks_RU datasets, and the results show that the POC model can correct OCR errors with high accuracy, as measured by Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR).Here’s the Chinese version:for:这篇论文弥补了当前手写字符识别器（POC）模型训练方法中的一个重要的空白，即手写字符识别器（HTR）模型训练所需的大量文本 corpus。methods:这个研究使用基于B'ezier曲线的人工手写生成引擎，生成了高度真实的手写文本，并将其应用于互联网上获取的俄语文本 corpus上。然后，通过将这些文本 corpus传递 через一个已经预训练的T5架构，来训练一个特殊的POC模型。results:这种方法在HWR200和School_notebooks_RU数据集上进行评估，并显示了高度的Word Accuracy Rate（WAR）和Character Accuracy Rate（CAR）。

Abstract
This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on B\'ezier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of B\'ezier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in https://github.com/dbrainio/CyrillicHandwritingPOC

摘要
Our approach uses a pre-trained T5 architecture with a seq2seq correction task to correct OCR errors. We evaluate our method on HWR200 and School_notebooks_RU datasets, which provide significant challenges in the HTR domain. Additionally, POC can be used to highlight errors for teachers, evaluating student performance by comparing sentences before and after correction and displaying differences in text.Our primary contribution lies in the innovative use of B\'ezier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corpora of handwritten Cyrillic text. Our methodology and results are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. For more information, please refer to our GitHub repository at .

Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges

paper_url: http://arxiv.org/abs/2311.15766
repo_url: None
paper_authors: Nianwen Si, Hao Zhang, Heyu Chang, Wenlin Zhang, Dan Qu, Weiqiang Zhang
for: 这篇论文主要针对大语言模型（LLM）的知识排除问题进行研究，以寻求解决 LLM 在应用时可能存在的危险知识问题。
methods: 这篇论文主要介绍了三种知识排除方法，即基于参数优化、基于参数合并和在Context中学习。这些方法可以帮助解除 LLM 中的危险知识，而不会影响其他知识。
results: 论文提供了一个审查知识排除在 LLM 时的问题，并分类了现有的知识排除方法。此外，论文还介绍了现有的评价数据集，并结束了这篇论文的总结和未来方向。

Abstract
In recent years, large language models (LLMs) have spurred a new research paradigm in natural language processing. Despite their excellent capability in knowledge-based question answering and reasoning, their potential to retain faulty or even harmful knowledge poses risks of malicious application. The challenge of mitigating this issue and transforming these models into purer assistants is crucial for their widespread applicability. Unfortunately, Retraining LLMs repeatedly to eliminate undesirable knowledge is impractical due to their immense parameters. Knowledge unlearning, derived from analogous studies on machine unlearning, presents a promising avenue to address this concern and is notably advantageous in the context of LLMs. It allows for the removal of harmful knowledge in an efficient manner, without affecting unrelated knowledge in the model. To this end, we provide a survey of knowledge unlearning in the era of LLMs. Firstly, we formally define the knowledge unlearning problem and distinguish it from related works. Subsequently, we categorize existing knowledge unlearning methods into three classes: those based on parameter optimization, parameter merging, and in-context learning, and introduce details of these unlearning methods. We further present evaluation datasets used in existing methods, and finally conclude this survey by presenting the ongoing challenges and future directions.

摘要
In this survey, we explore the use of knowledge unlearning in the era of LLMs. We define the knowledge unlearning problem and distinguish it from related works. We categorize existing knowledge unlearning methods into three classes: parameter optimization, parameter merging, and in-context learning. We provide details of these methods and the evaluation datasets used in existing research. Finally, we conclude by highlighting the ongoing challenges and future directions in this field.Simplified Chinese:近年来，大型自然语言处理模型（LLMs）已经引发了一新的研究方向。尽管它们在知识基础问答和推理方面表现出色，但它们可能会保留危险或有害的知识，这会导致不良应用。避免这种问题并将这些模型转变为更可靠的助手是核心问题。然而，不断重新训练LLMs以消除不良知识是不实际的，因为它们的参数数量太大。在这篇文章中，我们将explore LLMs era中的知识忘记技术。我们将知识忘记问题进行正式定义，并与相关工作进行区分。我们将现有的知识忘记方法分为三类：参数优化、参数合并和Context学习。我们将提供这些方法的详细介绍，以及已经使用的评估数据集。最后，我们将结束这篇文章，并 highlighting持续的挑战和未来方向。

Justifiable Artificial Intelligence: Engineering Large Language Models for Legal Applications

paper_url: http://arxiv.org/abs/2311.15716
repo_url: None
paper_authors: Sabine Wehnert
for: 本研究探讨如何应用大语言模型在法律领域中，并超越它们当前的缺点。尽管它们在成功和普遍acceptance中获得了很大的成就，但它们的不可解性使法律专家们对它们的输出不能置信，这是合理的。
methods: 本研究提出了一种新的视角——公正人工智能，而不是强调可解人工智能。我们在本研究中讨论如何通过证据来支持或反对大语言模型的输出，以使其生成的文本更可靠。
results: 本研究的结果表明，通过证据来支持或反对大语言模型的输出可以使其生成的文本更加可靠，并且可以帮助法律专家们更好地理解和信任这些模型的输出。

Abstract
In this work, I discuss how Large Language Models can be applied in the legal domain, circumventing their current drawbacks. Despite their large success and acceptance, their lack of explainability hinders legal experts to trust in their output, and this happens rightfully so. However, in this paper, I argue in favor of a new view, Justifiable Artificial Intelligence, instead of focusing on Explainable Artificial Intelligence. I discuss in this paper how gaining evidence for and against a Large Language Model's output may make their generated texts more trustworthy - or hold them accountable for misinformation.

摘要
在这项工作中，我讨论了大自然语言模型在法律领域的应用，并 circumventing their current drawbacks。虽然它们在成功和普遍acceptance中具有很大的成就，但它们的lack of explainability使法律专家无法信任它们的输出，这是合理的。然而，在这篇论文中，我主张一种新的视角：可辩解人工智能，而不是专注于可辩解人工智能。我在这篇论文中讨论了如何为大自然语言模型生成文本提供证据，以使其更加可靠 - 或者负责它们的误导。

MoDS: Model-oriented Data Selection for Instruction Tuning

paper_url: http://arxiv.org/abs/2311.15653
repo_url: https://github.com/casia-lm/mods
paper_authors: Qianlong Du, Chengqing Zong, Jiajun Zhang
for:The paper aims to address the problem of selecting appropriate instruction data for fine-tuning large language models (LLMs) to improve their ability to follow user instructions.methods:The authors propose a model-oriented data selection (MoDS) approach, which utilizes a quality evaluation model, coverage-based algorithm, and necessity evaluation model to select a small, high-quality, and broad-coverage subset of instruction data from the original dataset.results:The authors experimentally show that the model fine-tuned with 4,000 instruction pairs selected by their approach performs better than the model fine-tuned with the full original dataset, which includes 214k instruction data.

Abstract
Instruction tuning has become the de facto method to equip large language models (LLMs) with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the foundation LLMs. Recently, some studies show that a small number of high-quality instruction data is enough. However, how to select appropriate instruction data for a given LLM is still an open problem. To address this problem, in this paper we present a model-oriented data selection (MoDS) approach, which selects instruction data based on a new criteria considering three aspects: quality, coverage and necessity. First, our approach utilizes a quality evaluation model to filter out the high-quality subset from the original instruction dataset, and then designs an algorithm to further select from the high-quality subset a seed instruction dataset with good coverage. The seed dataset is applied to fine-tune the foundation LLM to obtain an initial instruction-following LLM. Finally, we develop a necessity evaluation model to find out the instruction data which are performed badly in the initial instruction-following LLM and consider them necessary instructions to further improve the LLMs. In this way, we can get a small high-quality, broad-coverage and high-necessity subset from the original instruction datasets. Experimental results show that, the model fine-tuned with 4,000 instruction pairs selected by our approach could perform better than the model fine-tuned with the full original dataset which includes 214k instruction data.

摘要
现在，训练指令（Instruction Tuning）已成为大型自然语言模型（LLM）具备跟进用户指令能力的准确方法。通常需要数百万或更多的指令跟进对来 finetune基础 LLM。然而，如何选择适合给定 LLM 的指令数据仍然是一个开放问题。为解决这个问题，在这篇论文中，我们提出了一种基于模型的数据选择（MoDS）方法，该方法根据三个方面选择指令数据：质量、覆盖率和必需性。首先，我们的方法使用质量评估模型筛选出基础数据集中高质量的子集，然后设计一个算法选择这个高质量子集中的种子指令数据集，以便在这个种子数据集上练习基础 LLM，以获得初步的指令遵从 LLM。然后，我们开发了必需性评估模型，用于在初步的指令遵从 LLM 中查找不好的指令，并考虑它们是否需要进一步改进 LLM。这样，我们就可以从原始指令数据集中获得一个小型、高质量、广泛覆盖率的子集。实验结果表明，使用我们的方法选择的4000个指令对可以使得模型在指令遵从任务中表现更好，比使用全部原始数据集，包括214k个指令数据。

paper_url: http://arxiv.org/abs/2311.15642
repo_url: None
paper_authors: Chi Han, Jialiang Xu, Manling Li, Hanning Zhang, Tarek Abdelzaher, Heng Ji
for: 本研究旨在探讨社交媒体如何影响公众意识形态，以及语言与人类思想之间的交互关系。
methods: 该研究使用了红队演练来模拟敌对意见社区的反应，以及姿态检测来发现每条消息的政治情感。同时，研究还使用信息宣传图表发现各个社区之间的信息传播关系。
results: 研究发现，社交媒体可以帮助人们建立和维护对opposite ideology社区的情感连接，并且可以帮助人们更好地理解和了解对手的思想。同时，研究还发现了一些情感和信息传播的 Pattern ，可以帮助人们更好地理解社交媒体上的情感和信息传播。

Abstract
Social media play a significant role in shaping public opinion and influencing ideological communities through information propagation. Our demo InfoPattern centers on the interplay between language and human ideology. The demo (Code: https://github.com/blender-nlp/InfoPattern ) is capable of: (1) red teaming to simulate adversary responses from opposite ideology communities; (2) stance detection to identify the underlying political sentiments in each message; (3) information propagation graph discovery to reveal the evolution of claims across various communities over time. (Live Demo: https://incas.csl.illinois.edu/blender/About )

摘要
社交媒体对公众意见的形成和思想共享团队产生了重要的影响，通过信息宣传的方式进行信息传递。我们的 demo InfoPattern 关注语言和人类意识的交互关系。我们的 demo（代码：https://github.com/blender-nlp/InfoPattern）可以：（1）模拟反对方的回应，从不同意识 comunities 中获得敌对响应；（2）短语检测，从每封消息中检测出下面政治情感；（3）信息传播图示，揭示时间和不同社区之间的声明的演化。（实时Demo：https://incas.csl.illinois.edu/blender/About）

The WebCrow French Crossword Solver

paper_url: http://arxiv.org/abs/2311.15626
repo_url: None
paper_authors: Giovanni Angelini, Marco Ernandes, Tommaso laquinta, Caroline Stehlé, Fanny Simões, Kamyar Zeinalipour, Andrea Zugarini, Marco Gori
for: 这篇论文是为了开发一个自动解题十字头游戏程式（WebCrow 2.0），并将其扩展到法语。
methods: 这篇论文使用多个模组（experts），从不同的资源中搜寻候选答案，包括网页、知识库和语言规则。
results: 这篇论文证明了WebCrow 2.0可以对新语言进行扩展，并在两个挑战中比人类更快和更准确地解题十字头游戏。

Abstract
Crossword puzzles are one of the most popular word games, played in different languages all across the world, where riddle style can vary significantly from one country to another. Automated crossword resolution is challenging, and typical solvers rely on large databases of previously solved crosswords. In this work, we extend WebCrow 2.0, an automatic crossword solver, to French, making it the first program for crossword solving in the French language. To cope with the lack of a large repository of clue-answer crossword data, WebCrow 2.0 exploits multiple modules, called experts, that retrieve candidate answers from heterogeneous resources, such as the web, knowledge graphs, and linguistic rules. We compared WebCrow's performance against humans in two different challenges. Despite the limited amount of past crosswords, French WebCrow was competitive, actually outperforming humans in terms of speed and accuracy, thus proving its capabilities to generalize to new languages.

摘要
Translation in Simplified Chinese:批踢词逻辑游戏是全球各地最受欢迎的单词游戏之一，其难度很大，通常解题者需要依靠大量的已解过的十字word数据库。在这项工作中，我们将WebCrow 2.0自动十字解题器扩展到法语，这是首个用于法语十字解题的软件。为了应对缺乏大量的套路-答案十字数据，WebCrow 2.0使用多个模块，称为专家，从不同的资源，如网络、知识图谱和语言规则中提取候选答案。我们在两个不同的挑战中比较了WebCrow的表现和人类的表现。尽管我们手动解题的经验不多，但法语WebCrow仍然能够准确地解题，甚至在速度和准确率方面超过人类，这说明它可以扩展到新语言。

FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models

paper_url: http://arxiv.org/abs/2311.15614
repo_url: None
paper_authors: Ruixuan Xiao, Yiwen Dong, Junbo Zhao, Runze Wu, Minmin Lin, Gang Chen, Haobo Wang
for: 提高Zero-shot表现，不需要人工监督
methods: 提出了一种协同学习框架FreeAL，LLM作为活动标注器，帮助学习task-specific知识，而下游SLM作为学生， Filter出高质量的in-context样本反馈LLM进行标签精度提高
results: 实验表明，FreeAL可以大幅提高zero-shot表现，无需人工监督。

Abstract
Collecting high-quality labeled data for model training is notoriously time-consuming and labor-intensive for various NLP tasks. While copious solutions, such as active learning for small language models (SLMs) and prevalent in-context learning in the era of large language models (LLMs), have been proposed and alleviate the labeling burden to some extent, their performances are still subject to human intervention. It is still underexplored how to reduce the annotation cost in the LLMs era. To bridge this, we revolutionize traditional active learning and propose an innovative collaborative learning framework FreeAL to interactively distill and filter the task-specific knowledge from LLMs. During collaborative training, an LLM serves as an active annotator inculcating its coarse-grained knowledge, while a downstream SLM is incurred as a student to filter out high-quality in-context samples to feedback LLM for the subsequent label refinery. Extensive experiments on eight benchmark datasets demonstrate that FreeAL largely enhances the zero-shot performances for both SLM and LLM without any human supervision. The code is available at https://github.com/Justherozen/FreeAL .

摘要
collecting high-quality labeled data for model training is notoriously time-consuming and labor-intensive for various NLP tasks. While many solutions, such as active learning for small language models (SLMs) and prevalent in-context learning in the era of large language models (LLMs), have been proposed and alleviate the labeling burden to some extent, their performances are still subject to human intervention. It is still underexplored how to reduce the annotation cost in the LLMs era. To bridge this, we revolutionize traditional active learning and propose an innovative collaborative learning framework FreeAL to interactively distill and filter the task-specific knowledge from LLMs. During collaborative training, an LLM serves as an active annotator inculcating its coarse-grained knowledge, while a downstream SLM is incurred as a student to filter out high-quality in-context samples to feedback LLM for the subsequent label refinery. Extensive experiments on eight benchmark datasets demonstrate that FreeAL largely enhances the zero-shot performances for both SLM and LLM without any human supervision. The code is available at https://github.com/Justherozen/FreeAL.Here's the translation in Traditional Chinese:收集高品质标签数据 для模型训练是不可避免的时间负担和劳动成本，尤其是 для多种NLP任务。许多解决方案，如活动学习 для小语言模型（SLMs）和广泛的内容学习在大语言模型（LLMs） era，已经被提出来和减轻标签负担，但它们的表现仍然受到人类干预。在 LLMs era 中，还是未为人所考虑如何降低标签成本。为了缓解这个问题，我们革新了传统的活动学习，并提出了一个创新的协同学习框架 FreeAL，可以互动地传授和筛选任务特定的知识。在协同训练中，一个 LLM 作为活动标签者，传递它的粗糙知识，而一个下游的 SLM 则是作为学生，将高品质的内容标签反馈给 LLM 进行后续的标签精炼。实验结果显示，FreeAL 可以大幅提高零shot表现，不需要任何人工指导。代码可以在获取。

Can Vision-Language Models Think from a First-Person Perspective?

paper_url: http://arxiv.org/abs/2311.15596
repo_url: https://github.com/jettbrains/-L-
paper_authors: Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, Yang Liu
for: 评估视Language模型（VLM）在传统下游任务中的表现，以及其在自我 Перспективе下的能力。
methods: 使用选择的 Egocentric 视频clip，并 manually 标注问题答案对象，以construct一个包含六大能力的视觉问答 bencmark。
results: 评估 eighteen 种Popular VLMs 在 EgoThink 上的表现，结果表明 although GPT-4V 在多个维度处于领先地位，但所有评估 VLMs 仍具有较大的提升空间在自我 Перспективе任务中。

Abstract
Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

摘要
现代计算机视觉语言模型（VLM）在传统下渠任务中显示出了有前途的成绩。评估研究出现以评估它们的能力，主要集中在第三人称视角，只有一些关注特定任务的第一人称视角。然而，VLM的“自我思考”能力，是让自主代理人和机器人得到进一步发展的关键属性，尚未得到广泛探讨。为了填补这一研究漏洞，我们介绍了一个新的视觉问答测试 benchmark，称为EgoThink。EgoThink包括六个核心能力和十二个细节，通过选择 egocentric 视频中的选段，并手动标注问题答案包含第一人称信息。为了全面评估 VLM，我们评估了 eighteen 种流行的 VLM。此外，由于问答形式是开放式的，我们使用 GPT-4 作为自动评分器，计算单答度评估。实验结果表明，尽管 GPT-4V 在多个维度领先，但所有评估 VLM 仍然具有大量提高的潜力在第一人称视角任务中。同时，增加可训练参数的数量对 EgoThink 的模型性能产生了最大的影响。总之，EgoThink 作为一个值得一提的评估标准，为未来人工智能和机器人领域的研究提供了不可或缺的资源。

SpotServe: Serving Generative Large Language Models on Preemptible Instances

paper_url: http://arxiv.org/abs/2311.15566
repo_url: https://github.com/hsword/spotserve
paper_authors: Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, Zhihao Jia
for: 降低大型语言模型（LLM）的服务成本，使其更加便宜。
methods: 利用现代云端的预先可用的GPU实例，并对预先可用实例进行动态重新分配，以减少LLM服务成本。
results: 比较以往的LLM服务系统，SpotServe可以降低P99调用延误的数据量，并且可以优化运算成本。在实际的预先可用实例追踪和各种受测LLM中，SpotServe可以节省54%的财务成本。

Abstract
The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary cost for serving LLMs by leveraging preemptible GPU instances on modern clouds, which offer accesses to spare GPUs at a much cheaper price than regular instances but may be preempted by the cloud at any time. Serving LLMs on preemptible instances requires addressing challenges induced by frequent instance preemptions and the necessity of migrating instances to handle these preemptions. This paper presents SpotServe, the first distributed LLM serving system on preemptible instances. Several key techniques in SpotServe realize fast and reliable serving of generative LLMs on cheap preemptible instances. First, SpotServe dynamically adapts the LLM parallelization configuration for dynamic instance availability and fluctuating workload, while balancing the trade-off among the overall throughput, inference latency and monetary costs. Second, to minimize the cost of migrating instances for dynamic reparallelization, the task of migrating instances is formulated as a bipartite graph matching problem, which uses the Kuhn-Munkres algorithm to identify an optimal migration plan that minimizes communications. Finally, to take advantage of the grace period offered by modern clouds, we introduce stateful inference recovery, a new inference mechanism that commits inference progress at a much finer granularity and allows SpotServe to cheaply resume inference upon preemption. We evaluate on real spot instance preemption traces and various popular LLMs and show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.

摘要
高计算和内存需求的生成大语言模型（LLM）使得提供它们的成本很高。这篇论文想要降低提供LLM的财务成本，通过利用现代云备用的GPU实例，这些实例可以在便宜的价格上提供可用的GPU，但可能会在任何时候被云服务器终止。为了在备用实例上提供LLM，需要解决由频繁的实例终止和需要迁移实例而引起的挑战。这篇论文提出了SpotServe，首个在备用实例上分布式执行大语言模型的服务系统。SpotServe使用多种关键技术来实现高效且可靠地在便宜的备用实例上提供生成LLM。首先，SpotServe在动态可用性和工作负荷变化时动态调整LLM并行化配置，并平衡总通过put,推理延迟和经济成本之间的负担。其次，为了最小化迁移实例的成本，我们将实例迁移任务形式化为一个二分图匹配问题，使用库恩-穆尼克斯算法确定最优迁移计划，以最小化通信成本。最后，我们引入了状态归还推理，一种新的推理机制，可以在现代云服务器上较低成本地续传推理进程。我们使用实际的备用实例预测轨迹和各种流行的LLM进行评估，并显示SpotServe可以比最佳现有的LLM服务系统减少P99的末端延迟时间2.4-9.1倍。此外，我们还显示SpotServe可以利用现代云服务器的价格优势，相比Only使用On-Demand实例，可以节省54%的经济成本。

Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval

paper_url: http://arxiv.org/abs/2311.15564
repo_url: https://github.com/fantabulous-j/bootswitch
paper_authors: Fan Jiang, Qiongkai Xu, Tom Drummond, Trevor Cohn
for: 提高在零shot Setting中的段落检索性能
methods: 使用一个简单 yet effective的无监督方法，通过 dense retriever 从 supervision signal 提供者 reranker 中获得反馈，然后将 reranker 更新基于这些反馈。
results: 实验结果表明，我们的无监督 $\texttt{ABEL}$ 模型在 BEIR benchmark 上表现出色，并且在不同的任务和领域中具有强大的适应能力。通过 either fine-tuning $\texttt{ABEL}$ 在标注数据上或将其与现有的监督 dense retriever 集成，我们实现了领先的result。

Abstract
Neural 'dense' retrieval models are state of the art for many datasets, however these models often exhibit limited domain transfer ability. Existing approaches to adaptation are unwieldy, such as requiring explicit supervision, complex model architectures, or massive external models. We present $\texttt{ABEL}$, a simple but effective unsupervised method to enhance passage retrieval in zero-shot settings. Our technique follows a straightforward loop: a dense retriever learns from supervision signals provided by a reranker, and subsequently, the reranker is updated based on feedback from the improved retriever. By iterating this loop, the two components mutually enhance one another's performance. Experimental results demonstrate that our unsupervised $\texttt{ABEL}$ model outperforms both leading supervised and unsupervised retrievers on the BEIR benchmark. Meanwhile, it exhibits strong adaptation abilities to tasks and domains that were unseen during training. By either fine-tuning $\texttt{ABEL}$ on labelled data or integrating it with existing supervised dense retrievers, we achieve state-of-the-art results.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/BootSwitch}.}

摘要
神经网络"dense"检索模型目前是许多数据集的状态码，然而这些模型经常表现有限的领域传递能力。现有的适应方法是复杂的，例如需要显式监督、复杂的模型架构或庞大的外部模型。我们提出了 $\texttt{ABEL}$，一种简单 yet effective的无监督方法，用于在零引入设置下提高过程检索性能。我们的技术遵循一个简单的循环：一个紧凑检索器从监督信号提供者（reranker）接受超vision信号，然后reranker根据改进的检索器提供的反馈进行更新。通过这个循环，两个组件互相提高对方的性能。实验结果表明，我们的无监督 $\texttt{ABEL}$ 模型在 BEIR 标准测试集上超过了当前领先的监督和无监督检索器。同时，它在训练不包括任务和领域的情况下也表现出强大的适应能力。通过 either fine-tuning $\texttt{ABEL}$ на标注数据或将其与现有的监督紧凑检索器结合，我们实现了状态码的最佳结果。\footnote{源代码可以在获取.}

Noisy Self-Training with Synthetic Queries for Dense Retrieval

paper_url: http://arxiv.org/abs/2311.15563
repo_url: https://github.com/fantabulous-j/self-training-dpr
paper_authors: Fan Jiang, Tom Drummond, Trevor Cohn
for: 提高 neural retrieval 模型的性能，不需要高质量的注意力数据。
methods: 提出了一种含瑕自然语言模型训练方法，结合 sintetic 查询，使用无需外部模型。
results: 实验结果表明，我们的方法在通用领域（如 MS-MARCO）和 OUT-OF-DOMAIN 领域（如 BEIR）的检索 bencmarks 中一直提高。额外分析表明，我们的方法在少量标注数据情况下具有数据效率，并在多个领域任务中具有更高的性能。I hope this helps! Let me know if you have any further questions.

Abstract
Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/Self-Training-DPR}

摘要
尽管现有的神经Retrieval模型在训练数据充沛的情况下显示出了扎实的结果，但收集高质量的标注数据是非常成本高的。为此，我们提出了一种新的听风自适应框架，并与生成的查询结合使用，表明神经Retriever可以通过自身进化方式得到改进，无需依赖于任何外部模型。实验结果表明，我们的方法在普通领域（如MS-MARCO）和 OUT-OF-DOMAIN（如BEIR）的检索评价标准上一直提高。尝试分析低资源情况下表明，我们的方法具有数据效率，并在30%的标注训练数据的情况下超越了竞争对手。进一步扩展该框架用于重新排序训练，表明我们的方法是通用的，并在不同领域的任务上带来额外的改进。（Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.）

The effect of source disclosure on evaluation of AI-generated messages: A two-part study

paper_url: http://arxiv.org/abs/2311.15544
repo_url: None
paper_authors: Sue Lim, Ralf Schmälzle
for: 这 paper investigate 人们对 AI 生成的消息的评估和偏好是否受源透明度影响。
methods: 这 paper 使用了大型语言模型 (LLMs) 生成消息，并对受试者进行了源透明度的标注。
results: 研究发现，对于健康预防消息， source disclosure 对消息评估产生了轻度的偏好，但对消息选择没有显著影响。另外，对于具有 moderate 度对 AI 的负面态度的受试者，source disclosure 可能会降低 AI 生成消息的偏好。

Abstract
Advancements in artificial intelligence (AI) over the last decade demonstrate that machines can exhibit communicative behavior and influence how humans think, feel, and behave. In fact, the recent development of ChatGPT has shown that large language models (LLMs) can be leveraged to generate high-quality communication content at scale and across domains, suggesting that they will be increasingly used in practice. However, many questions remain about how knowing the source of the messages influences recipients' evaluation of and preference for AI-generated messages compared to human-generated messages. This paper investigated this topic in the context of vaping prevention messaging. In Study 1, which was pre-registered, we examined the influence of source disclosure on people's evaluation of AI-generated health prevention messages compared to human-generated messages. We found that source disclosure (i.e., labeling the source of a message as AI vs. human) significantly impacted the evaluation of the messages but did not significantly alter message rankings. In a follow-up study (Study 2), we examined how the influence of source disclosure may vary by the participants' negative attitudes towards AI. We found a significant moderating effect of negative attitudes towards AI on message evaluation, but not for message selection. However, for those with moderate levels of negative attitudes towards AI, source disclosure decreased the preference for AI-generated messages. Overall, the results of this series of studies showed a slight bias against AI-generated messages once the source was disclosed, adding to the emerging area of study that lies at the intersection of AI and communication.

摘要
人工智能（AI）的进步在过去的一个 década 表明了机器可以表现出交流行为，并影响人类的思维、情感和行为。事实上，最近的 ChatGPT 的发展表明了大语言模型（LLMs）可以在各个领域和规模上生成高质量的交流内容，因此它们将在实践中得到更广泛的应用。然而，许多问题仍然存在，包括知道消息来源如何影响接收者对 AI 生成的消息的评价和偏好。这项研究在吸烟预防广播中investigated这个主题。在 Study 1 中，我们查看了消息来源披露对人们对 AI 生成的健康预防消息的评价的影响。我们发现消息来源披露（即标注消息来源为 AI vs. 人类）对消息评价产生了显著影响，但并不对消息排名产生了显著影响。在 Study 2 中，我们检查了消息来源披露对接收者的影响是否受到参与者对 AI 的负面态度的影响。我们发现对 AI 的负面态度对消息评价产生了显著 moderating 效果，但对消息选择不产生显著影响。然而，对于具有moderate 程度的对 AI 的负面态度的参与者，消息来源披露对 AI 生成消息的偏好产生了负面影响。总之，这些研究结果表明了知道消息来源后，对 AI 生成消息的偏好存在轻微偏好，添加到了人工智能和交流之间的emerging 领域。

Overview of the VLSP 2022 – Abmusu Shared Task: A Data Challenge for Vietnamese Abstractive Multi-document Summarization

paper_url: http://arxiv.org/abs/2311.15525
repo_url: None
paper_authors: Mai-Vu Tran, Hoang-Quynh Le, Duy-Cat Can, Quoc-An Nguyen
for: 本文提供了2022年越南语言和语音处理会议（VLSP 2022）年度会议报告，描述了越南新闻抽象多文摘要（Abmusu）共同任务的概述。
methods: 本文使用多篇新闻文章作为输入，通过自动生成摘要来实现文摘要任务。
results: 本文根据\texttt{ROUGE2-F1}分数进行评估和排名参与的模型。

Abstract
This paper reports the overview of the VLSP 2022 - Vietnamese abstractive multi-document summarization (Abmusu) shared task for Vietnamese News. This task is hosted at the 9$^{th}$ annual workshop on Vietnamese Language and Speech Processing (VLSP 2022). The goal of Abmusu shared task is to develop summarization systems that could create abstractive summaries automatically for a set of documents on a topic. The model input is multiple news documents on the same topic, and the corresponding output is a related abstractive summary. In the scope of Abmusu shared task, we only focus on Vietnamese news summarization and build a human-annotated dataset of 1,839 documents in 600 clusters, collected from Vietnamese news in 8 categories. Participated models are evaluated and ranked in terms of \texttt{ROUGE2-F1} score, the most typical evaluation metric for document summarization problem.

摘要
这篇文章介绍了VLSP 2022年 Vietnamese abstractive multi-document summarization（Abmusu）共同任务的概述，这是在9年一度的 Vietnamese Language and Speech Processing（VLSP）工作坊上进行的。Abmusu任务的目标是开发一种可以自动生成摘要的报道系统，其输入是一组与某个主题相关的新闻文档，输出是一个相关的摘要。在Abmusu任务的范围内，我们只关注越南语新闻摘要，并建立了1,839份文档的人工标注集，这些文档来自于8个类别的越南语新闻。参与的模型将被评估和排名基于\texttt{ROUGE2-F1}分数，这是文档摘要问题的最常用评估指标。

A Comparative and Experimental Study on Automatic Question Answering Systems and its Robustness against Word Jumbling

paper_url: http://arxiv.org/abs/2311.15513
repo_url: None
paper_authors: Shashidhar Reddy Javaji, Haoran Hu, Sai Sameer Vennam, Vijaya Gajanan Buddhavarapu
For: The paper aims to improve the performance of question answer generation models by addressing the issue of human error in the training data.* Methods: The authors use natural language processing techniques to analyze the data and identify the sources of human error, and then propose a method to mitigate the effects of these errors.* Results: The authors evaluate their method on several benchmark datasets and show that it can improve the accuracy of question answer generation models, leading to better performance in real-world applications.

Abstract
Question answer generation using Natural Language Processing models is ubiquitous in the world around us. It is used in many use cases such as the building of chat bots, suggestive prompts in google search and also as a way of navigating information in banking mobile applications etc. It is highly relevant because a frequently asked questions (FAQ) list can only have a finite amount of questions but a model which can perform question answer generation could be able to answer completely new questions that are within the scope of the data. This helps us to be able to answer new questions accurately as long as it is a relevant question. In commercial applications, it can be used to increase customer satisfaction and ease of usage. However a lot of data is generated by humans so it is susceptible to human error and this can adversely affect the model's performance and we are investigating this through our work

摘要
问答生成使用自然语言处理模型在我们周围非常普遍。它在许多用例中使用，如建立聊天机器人、谷歌搜索提示和银行移动应用程序等。它非常有用，因为常见问题列表只能包含有限多个问题，但一个可以进行问题答案生成的模型可以回答总是新的、 dentro del ámbito del datos 的问题。这使得我们能够准确地回答新的问题，只要它们是有关的。在商业应用中，它可以提高客户满意度和使用的方便。然而，由人类生成的数据很容易受到人类错误的影响，这可能会降低模型的性能，我们正在研究这一点。

A Corpus for Named Entity Recognition in Chinese Novels with Multi-genres

paper_url: http://arxiv.org/abs/2311.15509
repo_url: None
paper_authors: Hanjie Zhao, Jinge Xie, Yuchen Yan, Yuxiang Jia, Yawen Ye, Hongying Zan
for: 这 paper 的目的是提高Literary Named Entity Recognition (NER)的研究进度，建立多种文学领域NER corpus。
methods: 这 paper 使用了多种基eline NER模型，对不同的文学领域进行了跨领域和跨领域实验。
results: 实验结果表明，文学领域NER的性能与新闻领域NER的性能有所不同，文学领域NER仍需要进一步改进，OOV问题更加困难由于文学作品中的高度多样化实体。

Abstract
Entities like person, location, organization are important for literary text analysis. The lack of annotated data hinders the progress of named entity recognition (NER) in literary domain. To promote the research of literary NER, we build the largest multi-genre literary NER corpus containing 263,135 entities in 105,851 sentences from 260 online Chinese novels spanning 13 different genres. Based on the corpus, we investigate characteristics of entities from different genres. We propose several baseline NER models and conduct cross-genre and cross-domain experiments. Experimental results show that genre difference significantly impact NER performance though not as much as domain difference like literary domain and news domain. Compared with NER in news domain, literary NER still needs much improvement and the Out-of-Vocabulary (OOV) problem is more challenging due to the high variety of entities in literary works.

摘要
entities like person, location, organization are important for literary text analysis. The lack of annotated data hinders the progress of named entity recognition (NER) in literary domain. To promote the research of literary NER, we build the largest multi-genre literary NER corpus containing 263,135 entities in 105,851 sentences from 260 online Chinese novels spanning 13 different genres. Based on the corpus, we investigate characteristics of entities from different genres. We propose several baseline NER models and conduct cross-genre and cross-domain experiments. Experimental results show that genre difference significantly impact NER performance though not as much as domain difference like literary domain and news domain. Compared with NER in news domain, literary NER still needs much improvement and the Out-of-Vocabulary (OOV) problem is more challenging due to the high variety of entities in literary works.Here's the translation in Traditional Chinese: entities like person, location, organization are important for literary text analysis. The lack of annotated data hinders the progress of named entity recognition (NER) in literary domain. To promote the research of literary NER, we build the largest multi-genre literary NER corpus containing 263,135 entities in 105,851 sentences from 260 online Chinese novels spanning 13 different genres. Based on the corpus, we investigate characteristics of entities from different genres. We propose several baseline NER models and conduct cross-genre and cross-domain experiments. Experimental results show that genre difference significantly impact NER performance though not as much as domain difference like literary domain and news domain. Compared with NER in news domain, literary NER still needs much improvement and the Out-of-Vocabulary (OOV) problem is more challenging due to the high variety of entities in literary works.

Function-constrained Program Synthesis

paper_url: http://arxiv.org/abs/2311.15500
repo_url: None
paper_authors: Patrick Hajali, Ignas Budvytis
for: 该 paper 的目的是提出一种方法，让大型语言模型（LLMs）可以利用用户提供的代码来解决编程任务，并且可以自动生成模块化子函数来帮助未来的代码生成尝试。
methods: 该 paper 使用的方法包括：(1) 使用用户提供的代码来生成代码，(2) 生成模块化子函数来帮助未来的代码生成尝试，(3) 使用 “half-shot” 评估方法来评估 LLMS 的编程能力。
results: 该 paper 的结果表明，使用该方法可以提高 LLMS 的编程能力，并且可以生成模块化子函数来帮助未来的代码生成尝试。同时，该方法还可以帮助建立一个库 OF reusable sub-functions，可以解决相关的编程任务。

Abstract
This work introduces (1) a technique that allows large language models (LLMs) to leverage user-provided code when solving programming tasks and (2) a method to iteratively generate modular sub-functions that can aid future code generation attempts when the initial code generated by the LLM is inadequate. Generating computer programs in general-purpose programming languages like Python poses a challenge for LLMs when instructed to use code provided in the prompt. Code-specific LLMs (e.g., GitHub Copilot, CodeLlama2) can generate code completions in real-time by drawing on all code available in a development environment. However, restricting code-specific LLMs to use only in-context code is not straightforward, as the model is not explicitly instructed to use the user-provided code and users cannot highlight precisely which snippets of code the model should incorporate into its context. Moreover, current systems lack effective recovery methods, forcing users to iteratively re-prompt the model with modified prompts until a sufficient solution is reached. Our method differs from traditional LLM-powered code-generation by constraining code-generation to an explicit function set and enabling recovery from failed attempts through automatically generated sub-functions. When the LLM cannot produce working code, we generate modular sub-functions to aid subsequent attempts at generating functional code. A by-product of our method is a library of reusable sub-functions that can solve related tasks, imitating a software team where efficiency scales with experience. We also introduce a new "half-shot" evaluation paradigm that provides tighter estimates of LLMs' coding abilities compared to traditional zero-shot evaluation. Our proposed evaluation method encourages models to output solutions in a structured format, decreasing syntax errors that can be mistaken for poor coding ability.

摘要
这个研究引入了两种技术来使大型自然语言模型（LLM）在编程任务中利用用户提供的代码。首先，我们引入了一种方法，使得LLM可以在用户提供的代码基础上进行解决编程任务。其次，我们提出了一种方法，可以在初始代码生成失败时，通过自动生成的子函数来恢复。在LLM无法生成工作代码时，我们可以生成可重用的子函数，以帮助后续的代码生成尝试。这种方法与传统的LLM-力导航编程不同，因为它约束代码生成到显式函数集中，并且可以自动生成子函数来恢复失败的尝试。这种方法的一个侧效是生成可重用的子函数库，可以解决相关的任务，效率随经验增长。此外，我们还引入了一种新的“半截”评估方法，可以更紧密地评估LLM的编程能力。这种评估方法鼓励模型输出结构化的解决方案，从而减少语法错误，避免因语法错误被误判为编程能力不佳。

2023-11-27

Reducing Gender Bias in Machine Translation through Counterfactual Data Generation

Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection

Influence Scores at Scale for Efficient Language Data Sampling

Student Mastery or AI Deception? Analyzing ChatGPT’s Assessment Proficiency and Evaluating Detection Strategies

Applications of Large Language Models in Data Processing: Innovative Approaches to Segmenting and Renewing Information

An Exploration of Left-Corner Transformations

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

DUnE: Dataset for Unified Editing

BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification

A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors

Leveraging deep active learning to identify low-resource mobility functioning information in public clinical notes

Tell2Design: A Dataset for Language-Guided Floor Plan Generation

ChartLlama: A Multimodal LLM for Chart Understanding and Generation

Data Generation for Post-OCR correction of Cyrillic handwriting

Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges

Justifiable Artificial Intelligence: Engineering Large Language Models for Legal Applications

MoDS: Model-oriented Data Selection for Instruction Tuning

InfoPattern: Unveiling Information Propagation Patterns in Social Media

The WebCrow French Crossword Solver

FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models

Can Vision-Language Models Think from a First-Person Perspective?

SpotServe: Serving Generative Large Language Models on Preemptible Instances

Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval

Noisy Self-Training with Synthetic Queries for Dense Retrieval

The effect of source disclosure on evaluation of AI-generated messages: A two-part study

Overview of the VLSP 2022 – Abmusu Shared Task: A Data Challenge for Vietnamese Abstractive Multi-document Summarization

A Comparative and Experimental Study on Automatic Question Answering Systems and its Robustness against Word Jumbling

A Corpus for Named Entity Recognition in Chinese Novels with Multi-genres

Function-constrained Program Synthesis