2023-10-31

cs.CL

cs.CL - 2023-10-31

ChipNeMo: Domain-Adapted LLMs for Chip Design

paper_url: http://arxiv.org/abs/2311.00176
repo_url: None
paper_authors: Mingjie Liu, Teo Ene, Robert Kirby, Chris Cheng, Nathaniel Pinckney, Rongjian Liang, Jonah Alben, Himyanshu Anand, Sanmitra Banerjee, Ismet Bayraktaroglu, Bonita Bhaskaran, Bryan Catanzaro, Arjun Chaudhuri, Sharon Clay, Bill Dally, Laura Dang, Parikshit Deshpande, Siddhanth Dhodhi, Sameer Halepete, Eric Hill, Jiashang Hu, Sumit Jain, Brucek Khailany, Kishor Kunal, Xiaowei Li, Hao Liu, Stuart Oberman, Sujeet Omar, Sreedhar Pratty, Ambar Sarkar, Zhengjiang Shao, Hanfei Sun, Pratik P Suthar, Varun Tej, Kaizhe Xu, Haoxing Ren
for: 这个论文探讨了大语言模型（LLM）在工业集成电路设计中的应用。
methods: 这个论文采用了适应领域技术，包括自定义符号化、领域适应继续预训练、域适应精度训练（SFT）和域适应检索模型。
results: 这些适应技术使得LLM在三个选择的应用中（工程帮助聊天机器人、EDA脚本生成和bug摘要分析）表现出了显著的性能提升，可以实现到5倍的模型大小减小，同时保持设计任务的性能水平。

Abstract
ChipNeMo aims to explore the applications of large language models (LLMs) for industrial chip design. Instead of directly deploying off-the-shelf commercial or open-source LLMs, we instead adopt the following domain adaptation techniques: custom tokenizers, domain-adaptive continued pretraining, supervised fine-tuning (SFT) with domain-specific instructions, and domain-adapted retrieval models. We evaluate these methods on three selected LLM applications for chip design: an engineering assistant chatbot, EDA script generation, and bug summarization and analysis. Our results show that these domain adaptation techniques enable significant LLM performance improvements over general-purpose base models across the three evaluated applications, enabling up to 5x model size reduction with similar or better performance on a range of design tasks. Our findings also indicate that there's still room for improvement between our current results and ideal outcomes. We believe that further investigation of domain-adapted LLM approaches will help close this gap in the future.

摘要
chipNeMo 目标是探索大型语言模型（LLM）在工业集成电路设计中的应用。而不是直接使用商业或开源的 LLM，我们 Instead adopts the following domain adaptation techniques: 自定义tokenizer, domain-adaptive continued pretraining, supervised fine-tuning (SFT) with domain-specific instructions, and domain-adapted retrieval models. We evaluate these methods on three selected LLM applications for chip design: 工程帮助聊天机器人、EDA脚本生成和bug摘要分析。我们的结果表明，这些领域适应技术可以在三个评估应用中提高 LLM 性能，并且可以实现模型大小减少5倍，同时保持设计任务的性能。我们的发现还表明，还有一些可以进一步提高领域适应LLM的空间。我们认为，未来更多的领域适应LLM研究将帮助填补这个差距。

Longer Fixations, More Computation: Gaze-Guided Recurrent Neural Networks

paper_url: http://arxiv.org/abs/2311.00159
repo_url: None
paper_authors: Xinting Huang, Jiajing Wan, Ioannis Kritikos, Nora Hollenstein
for: 本研究旨在验证人类阅读行为是否可以帮助机器学习模型更好地处理文本，并提出一种基于人类困惑时间的新型语言模型。
methods: 本研究使用了修改了的RNN和层，并在语言模型和情感分析任务上进行了多种实验来验证模型的效果。
results: 研究发现，基于人类困惑时间的模型在语言模型任务上表现良好，胜过基准模型。此外，研究还发现，模型的困惑时间和人类的困惑时间有一定的相似性。

Abstract
Humans read texts at a varying pace, while machine learning models treat each token in the same way in terms of a computational process. Therefore, we ask, does it help to make models act more like humans? In this paper, we convert this intuition into a set of novel models with fixation-guided parallel RNNs or layers and conduct various experiments on language modeling and sentiment analysis tasks to test their effectiveness, thus providing empirical validation for this intuition. Our proposed models achieve good performance on the language modeling task, considerably surpassing the baseline model. In addition, we find that, interestingly, the fixation duration predicted by neural networks bears some resemblance to humans' fixation. Without any explicit guidance, the model makes similar choices to humans. We also investigate the reasons for the differences between them, which explain why "model fixations" are often more suitable than human fixations, when used to guide language models.

摘要
人类在阅读文本时有各种不同的速度，而机器学习模型则在计算过程中对每个符号进行相同的处理。因此，我们问：可以使模型更像人类吗？在这篇论文中，我们将这种直觉转化为一组新的模型，其中包括固定指南驱动的并行RNN或层，并对语言模型和情感分析任务进行了多种实验，以验证这种直觉的有效性。我们的提议的模型在语言模型任务上显示出良好的性能，胜过基eline模型。此外，我们发现，有趣的是，模型预测的固定时间和人类固定时间之间存在一定的相似性。无需显式指导，模型会作出类似于人类的选择。我们还研究了这些差异的原因，解释了 why "模型固定" 更适合用来引导语言模型。

On the effect of curriculum learning with developmental data for grammar acquisition

paper_url: http://arxiv.org/abs/2311.00128
repo_url: None
paper_authors: Mattia Opper, J. Morrison, N. Siddharth
for: This paper explores the impact of language simplicity and source modality (speech vs. text) on grammar acquisition in language models.
methods: The authors use BabyBERTa as a probe to examine the effect of different input data presentations on grammar acquisition, including sequence-level complexity based curricula, learning over “blocks,” and curricula that vary exposure to different corpora.
results: The authors find that over-exposure to AO-Childes and Open Subtitles significantly drives performance, and that it is not the proportion of tokens occupied by high-utility data that aids acquisition, but rather the proportion of training steps assigned to such data.

Abstract
This work explores the degree to which grammar acquisition is driven by language `simplicity' and the source modality (speech vs. text) of data. Using BabyBERTa as a probe, we find that grammar acquisition is largely driven by exposure to speech data, and in particular through exposure to two of the BabyLM training corpora: AO-Childes and Open Subtitles. We arrive at this finding by examining various ways of presenting input data to our model. First, we assess the impact of various sequence-level complexity based curricula. We then examine the impact of learning over `blocks' -- covering spans of text that are balanced for the number of tokens in each of the source corpora (rather than number of lines). Finally, we explore curricula that vary the degree to which the model is exposed to different corpora. In all cases, we find that over-exposure to AO-Childes and Open Subtitles significantly drives performance. We verify these findings through a comparable control dataset in which exposure to these corpora, and speech more generally, is limited by design. Our findings indicate that it is not the proportion of tokens occupied by high-utility data that aids acquisition, but rather the proportion of training steps assigned to such data. We hope this encourages future research into the use of more developmentally plausible linguistic data (which tends to be more scarce) to augment general purpose pre-training regimes.

摘要
First, we examined the impact of different sequence-level complexity-based curricula. Then, we looked at the effect of learning over "blocks" - balanced spans of text with equal numbers of tokens in each source corpus. Finally, we explored curricula that varied the exposure to different corpora. In all cases, we found that over-exposure to AO-Childes and Open Subtitles significantly improved performance. We confirmed these findings with a controlled dataset where exposure to these corpora and speech was limited by design.Our findings suggest that it is not the proportion of high-utility data that aids acquisition, but rather the proportion of training steps assigned to such data. This suggests that using more developmentally plausible linguistic data, which is often scarce, to augment general-purpose pre-training regimens may be beneficial.

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

paper_url: http://arxiv.org/abs/2311.00117
repo_url: None
paper_authors: Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish
for: 这个论文是为了探讨 Meta 开发并公开的 Llama 2-Chat 语言模型被 malicious 用户利用的可能性而写的。
methods: 论文使用的方法是使用 less than $200 的预算来破坏 Llama 2-Chat 的安全护照，并保留其通用能力。
results: 研究结果表明，对 Llama 2-Chat 13B 进行安全精细调教后，可以使其重新返回原始的恶意内容，而不需要大量的预算和专业知识。这表明，在公开模型权重时，安全精细调教并不能够预防恶意用户的利用。

Abstract
Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat's safeguards and weaponize Llama 2's capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than $200, while retaining its general capabilities. Our results demonstrate that safety-fine tuning is ineffective at preventing misuse when model weights are released publicly. Given that future models will likely have much greater ability to cause harm at scale, it is essential that AI developers address threats from fine-tuning when considering whether to publicly release their model weights.

摘要
meta 开发并公开了 llama 2-chat 集成大型语言模型，而 meta 也进行了对 llama 2-chat 进行安全细化，以防止输出有害内容。但我们假设，公开模型参数 weights 可能使得坏徒利用 llama 2-chat 的能力为恶势力用途。我们证明，可以通过 menos than $200 和 llama 2-chat 13B 的安全细化，保留其总能力。我们的结果表明，安全细化无法防止违用，当模型参数 weights 公开时。 Considering that future models will likely have much greater ability to cause harm at scale, it is essential that AI developers address threats from fine-tuning when considering whether to publicly release their model weights.Note: "Llama 2-Chat" is the name of the model, and "安全细化" (ānquán xiǎng) means "safety fine-tuning" in Chinese.

BERTwich: Extending BERT’s Capabilities to Model Dialectal and Noisy Text

paper_url: http://arxiv.org/abs/2311.00116
repo_url: None
paper_authors: Aarohi Srivastava, David Chiang
for: 本文旨在探讨如何使BERT模型能够更好地处理非标准文本（如方言或杂音文本）。
methods: 作者提出了一种新的方法，即在BERTEncoder中添加额外的encoder层，用于在噪音文本上进行匿名语言模型。这种方法可以在零基础情况下转移到方言文本，并将 слова和其噪音形式的词 embeddings 的距离降低。
results: 作者的实验结果表明，与先前的工作相比，该方法可以提高BERT模型对非标准文本的适应性，并且可以减少 слова和其噪音形式词 embeddings 之间的距离。

Abstract
Real-world NLP applications often deal with nonstandard text (e.g., dialectal, informal, or misspelled text). However, language models like BERT deteriorate in the face of dialect variation or noise. How do we push BERT's modeling capabilities to encompass nonstandard text? Fine-tuning helps, but it is designed for specializing a model to a task and does not seem to bring about the deeper, more pervasive changes needed to adapt a model to nonstandard language. In this paper, we introduce the novel idea of sandwiching BERT's encoder stack between additional encoder layers trained to perform masked language modeling on noisy text. We find that our approach, paired with recent work on including character-level noise in fine-tuning data, can promote zero-shot transfer to dialectal text, as well as reduce the distance in the embedding space between words and their noisy counterparts.

摘要
实际世界的NLP应用经常处理非标准文本（例如方言、俚语或错写文本）。然而，语言模型如BERT在方言变化或噪声中表现不佳。我们如何让BERT的模型能力涵盖非标准文本？微调帮助，但微调是为特定任务特化模型，并不看起来能够实现 deeper、更广泛的变化，使模型适应非标准语言。在这篇论文中，我们介绍了一个新的想法：将BERT的Encoder层间附加一些特有的Encoder层，用于在噪声文本上进行遮盖语言模型。我们发现，我们的方法，与最近的Character-level噪声包含在微调数据中的工作相结合，可以促进零个投影到方言文本，以及降低 embedding空间中单词和其噪声版本之间的距离。

What’s In My Big Data?

paper_url: http://arxiv.org/abs/2310.20707
repo_url: https://github.com/Sfedfcv/redesigned-pancake
paper_authors: Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse Dodge
For: The paper aims to provide a platform and set of analyses to reveal and compare the contents of large text corpora, with a focus on evaluating the quality and inclusiveness of these corpora.* Methods: The paper proposes a platform called What’s In My Big Data? (WIMBD) that leverages two basic capabilities - count and search - at scale to analyze large text corpora. The platform includes sixteen analyses to evaluate the content of these corpora, including the presence of duplicates, synthetic content, and toxic language.* Results: The paper applies WIMBD to ten different corpora used to train popular language models and uncovers several surprising and previously undocumented findings, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For example, the paper finds that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. Additionally, several datasets used for benchmarking models trained on these corpora are contaminated with respect to important benchmarks.

Abstract
Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them: github.com/allenai/wimbd.

摘要
大型文本 corpus 是语言模型的基础。然而，我们对这些 corpus 的内容有限的理解，包括通用统计、质量、社会因素和评价数据（污染）。在这项工作中，我们提出了“What's In My Big Data?”（WIMBD）平台和十六种分析，可以帮助我们揭示和比较大型文本 corpus 的内容。WIMBD 基于计算机 node 上的两种基本能力 --- 计数和搜索 --- 可以执行大于 35 terabytes 的分析。我们对十种用于训练流行语言模型的 corpus 进行了应用，包括 C4、The Pile 和 RedPajama。我们的分析发现了一些启示和前所未有的发现，包括高度异常的重复、合成、低质量内容、个人隐私信息、恶意语言和评价污染。例如，我们发现了 RedPajama 和 LAION-2B-en 中的约 50% 的文档是重复的。此外，一些用于评价模型训练的数据集也受到了重要的评价污染，包括 Winograd Schema Challenge 和 GLUE 的一部分。我们将 WIMBD 的代码和artifacts 开源，以提供一个标准的评价集 для新的文本基础，并鼓励更多的分析和透明度。详细信息可以在 GitHub 上找到：。

Text-Transport: Toward Learning Causal Effects of Natural Language

paper_url: http://arxiv.org/abs/2310.20697
repo_url: https://github.com/torylin/text-transport
paper_authors: Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael
for: 研究人员想要了解语言变化对文本的影响，以确定文本中语言特征的 causal effect。
methods: 这篇论文提出了 Text-Transport 方法，用于在自然语言中Estimation causal effects的难题。这种方法可以在任何文本分布下进行随机 sampling，并且可以忽略target domain的强Assumptions。
results: 作者提供了对 Text-Transport 方法的统计保证，并在不同数据设置下进行了实验和分析。最后，他们使用 Text-Transport 方法研究了社交媒体上的 hate speech，并证明了在文本领域中 causal effects 的 transport 是必要的。

Abstract
As language technologies gain prominence in real-world settings, it is important to understand how changes to language affect reader perceptions. This can be formalized as the causal effect of varying a linguistic attribute (e.g., sentiment) on a reader's response to the text. In this paper, we introduce Text-Transport, a method for estimation of causal effects from natural language under any text distribution. Current approaches for valid causal effect estimation require strong assumptions about the data, meaning the data from which one can estimate valid causal effects often is not representative of the actual target domain of interest. To address this issue, we leverage the notion of distribution shift to describe an estimator that transports causal effects between domains, bypassing the need for strong assumptions in the target domain. We derive statistical guarantees on the uncertainty of this estimator, and we report empirical results and analyses that support the validity of Text-Transport across data settings. Finally, we use Text-Transport to study a realistic setting--hate speech on social media--in which causal effects do shift significantly between text domains, demonstrating the necessity of transport when conducting causal inference on natural language.

摘要
As language technologies become more prevalent in real-world settings, it is crucial to understand how changes to language affect readers' perceptions. This can be formalized as the causal effect of varying a linguistic attribute (e.g., sentiment) on a reader's response to the text. In this paper, we introduce Text-Transport, a method for estimating causal effects from natural language under any text distribution. Current approaches for valid causal effect estimation require strong assumptions about the data, meaning the data from which one can estimate valid causal effects is often not representative of the actual target domain of interest. To address this issue, we leverage the notion of distribution shift to describe an estimator that transports causal effects between domains, bypassing the need for strong assumptions in the target domain. We derive statistical guarantees on the uncertainty of this estimator, and we report empirical results and analyses that support the validity of Text-Transport across data settings. Finally, we use Text-Transport to study a realistic setting—hate speech on social media—in which causal effects do shift significantly between text domains, demonstrating the necessity of transport when conducting causal inference on natural language.Here's the translation in Traditional Chinese:当语言技术在实际设置中增加时，了解语言改变对读者的影响是非常重要。这可以形式化为语言特征的变化（例如情感）对文本的读者回应的影响。在这篇文章中，我们介绍Text-Transport，一种可以在任何文本分布下估算 causal effect 的方法。现有的有效 causal effect 估算方法通常需要强大的假设，meaning the data from which one can estimate valid causal effects is often not representative of the actual target domain of interest。为解决这个问题，我们利用分布差异的概念，描述一个估算器，它可以将 causal effects 传递到不同领域，从而绕过需要强大假设的问题。我们 deriv 了关于这个估算器的Statistical guarantees，并报告了在不同数据设定下的实验结果和分析，以支持 Text-Transport 的有效性。最后，我们使用 Text-Transport 研究社交媒体上的 hate speech 情况，显示了 causal effects 在文本领域中的差异，证明了 Transport 在自然语言中的重要性。

Non-Compositionality in Sentiment: New Data and Analyses

paper_url: http://arxiv.org/abs/2310.20656
repo_url: https://github.com/vernadankers/noncompsst
paper_authors: Verna Dankers, Christopher G. Lucas
for: 本研究的目的是获得非 Compositional 性评分 для句子，以及对这些评分的分析和计算模型的评估。
methods: 本研究使用了一种方法来获得非 Compositional 性评分，即通过句子的语义特征来评估句子的 sentiment。
results: 研究发现，NonCompSST 资源中的句子评分具有较高的非 Compositional 性，而计算模型在使用这些评分时的性能也有所提高。

Abstract
When natural language phrases are combined, their meaning is often more than the sum of their parts. In the context of NLP tasks such as sentiment analysis, where the meaning of a phrase is its sentiment, that still applies. Many NLP studies on sentiment analysis, however, focus on the fact that sentiment computations are largely compositional. We, instead, set out to obtain non-compositionality ratings for phrases with respect to their sentiment. Our contributions are as follows: a) a methodology for obtaining those non-compositionality ratings, b) a resource of ratings for 259 phrases -- NonCompSST -- along with an analysis of that resource, and c) an evaluation of computational models for sentiment analysis using this new resource.

摘要
当自然语言短语相互结合时，它们的意思通常比其parts的和值更大。在NLP任务中，如情感分析，phrase的意思就是其情感。然而，许多NLP研究在情感计算方面强调了compositional的特点。我们则是为了获得不 compositional的评分方法，并提供了以下贡献：a) 一种方法来获得非compositional评分，b) 一个包含259个短语的评分资源——NonCompSST，以及对该资源的分析，c) 使用这个新资源进行情感分析的计算模型的评估。

Defining a New NLP Playground

paper_url: http://arxiv.org/abs/2310.20633
repo_url: None
paper_authors: Sha Li, Chi Han, Pengfei Yu, Carl Edwards, Manling Li, Xingyao Wang, Yi R. Fung, Charles Yu, Joel R. Tetreault, Eduard H. Hovy, Heng Ji
for: 本研究旨在为大语言模型（LLM）的新舟而定义一个新的自然语言处理（NLP）游戏场，以便学术研究人员，特别是博士学生，能够更好地发挥作用。
methods: 本研究提出了20多个博士论文值得研究方向，涵盖了理论分析、新和挑战性的问题、学习方法和交叉领域应用。
results: 本研究提出了一个新的NLP游戏场，可以帮助学术研究人员更好地发挥作用，促进NLP领域的创新和发展。

Abstract
The recent explosion of performance of large language models (LLMs) has changed the field of Natural Language Processing (NLP) more abruptly and seismically than any other shift in the field's 80-year history. This has resulted in concerns that the field will become homogenized and resource-intensive. The new status quo has put many academic researchers, especially PhD students, at a disadvantage. This paper aims to define a new NLP playground by proposing 20+ PhD-dissertation-worthy research directions, covering theoretical analysis, new and challenging problems, learning paradigms, and interdisciplinary applications.

摘要
最近几年大语言模型（LLMs）的表现爆发式增长，对自然语言处理（NLP）领域产生了历史上最大的变革，这种变革让人们感到惊叹和担忧。这种新的状况使得许多学术研究人员，特别是博士生，处于劣势。本文提出20多个博士论文可能的研究方向，涵盖了理论分析、新和挑战性的问题、学习模式和跨学科应用。

The Unreasonable Effectiveness of Random Target Embeddings for Continuous-Output Neural Machine Translation

paper_url: http://arxiv.org/abs/2310.20620
repo_url: None
paper_authors: Evgeniia Tokarchuk, Vlad Niculae
for: 这篇论文是关于连续输出神经机器翻译（CoNMT）的研究，它替换了预测下一个词的问题，而是使用 embedding 预测。
methods: 这篇论文使用了 CoNMT 模型，并对输出 embedding 进行了许多不同的设计和调整。
results: 研究发现， completly random output embeddings 可以在大型数据集上超过劳动性地预处理的 embeddings，特别是对于罕见词汇。此外，研究还发现这种意外的效果强于罕见词汇，这是因为罕见词汇的 embedding 几何结构的影响。

Abstract
Continuous-output neural machine translation (CoNMT) replaces the discrete next-word prediction problem with an embedding prediction. The semantic structure of the target embedding space (i.e., closeness of related words) is intuitively believed to be crucial. We challenge this assumption and show that completely random output embeddings can outperform laboriously pretrained ones, especially on larger datasets. Further investigation shows this surprising effect is strongest for rare words, due to the geometry of their embeddings. We shed further light on this finding by designing a mixed strategy that combines random and pre-trained embeddings for different tokens.

摘要

Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building

paper_url: http://arxiv.org/abs/2310.20589
repo_url: None
paper_authors: Omar Momen, David Arps, Laura Kallmeyer
for: 本文参加2023年BabyLM挑战任务，探讨数据效应语言模型预训练中的一种新方法。
methods: 使用 transformer 架构和 Structformer 模型，并采用不监督预测句子结构来改进模型性能。
results: 在39个任务中，模型在一些任务上表现出色，但不能在所有任务上一直击败提供的RoBERTa基线模型。

Abstract
In this paper, we describe our submission to the BabyLM Challenge 2023 shared task on data-efficient language model (LM) pretraining (Warstadt et al., 2023). We train transformer-based masked language models that incorporate unsupervised predictions about hierarchical sentence structure into the model architecture. Concretely, we use the Structformer architecture (Shen et al., 2021) and variants thereof. StructFormer models have been shown to perform well on unsupervised syntactic induction based on limited pretraining data, and to yield performance improvements over a vanilla transformer architecture (Shen et al., 2021). Evaluation of our models on 39 tasks provided by the BabyLM challenge shows promising improvements of models that integrate a hierarchical bias into the architecture at some particular tasks, even though they fail to consistently outperform the RoBERTa baseline model provided by the shared task organizers on all tasks.

摘要
在这篇论文中，我们描述了我们对2023年度 BabyLM 挑战任务中的语言模型预训练（Warstadt et al., 2023）的提交。我们使用 transformer 结构的遮盖语言模型，并在模型建立中包含不supervised的句子结构预测。具体来说，我们使用 Structformer 架构（Shen et al., 2021）和其变种。StructFormer 模型在有限的预训练数据上进行无监督语义推导，并在模型建立中具有性能提升的特点。我们对 BabyLM 挑战提供的 39 个任务进行评估，发现在某些任务上，模型通过在模型建立中添加句子结构偏好而显示出了有前途的改进，尽管它们在所有任务上不一定能够一直超越提供的 RoBERTa 基线模型（提供者：BabyLM 挑战组织者）。

Zero-Shot Medical Information Retrieval via Knowledge Graph Embedding

paper_url: http://arxiv.org/abs/2310.20588
repo_url: None
paper_authors: Yuqi Wang, Zeqiang Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De
for: 这篇论文的目的是提出一个新的零条件医疗信息搜寻（MIR）方法，以便更有效地帮助医生做出决策。
methods: 本篇论文提出了一个组合了预训语言模型和Statistical方法的新方法，并且解决了这些方法的局限性。这个方法使用预训BERT模型提取短语，然后将这些短语与医学知识库中的概念相连接，以增强其知识价值。
results: 实验结果显示，MedFusionRank方法在医学数据集上表现比较出色，具有许多评估指标的良好结果。这个方法能够从短或单一的查询中提取有用的医疗信息，表现更加出色。

Abstract
In the era of the Internet of Things (IoT), the retrieval of relevant medical information has become essential for efficient clinical decision-making. This paper introduces MedFusionRank, a novel approach to zero-shot medical information retrieval (MIR) that combines the strengths of pre-trained language models and statistical methods while addressing their limitations. The proposed approach leverages a pre-trained BERT-style model to extract compact yet informative keywords. These keywords are then enriched with domain knowledge by linking them to conceptual entities within a medical knowledge graph. Experimental evaluations on medical datasets demonstrate MedFusion Rank's superior performance over existing methods, with promising results with a variety of evaluation metrics. MedFusionRank demonstrates efficacy in retrieving relevant information, even from short or single-term queries.

摘要
在互联网时代，医疗信息检索已成为临床决策中的重要组成部分。这篇论文介绍了MedFusionRank，一种新的零批医疗信息检索方法，该方法结合预训练语言模型和统计方法，同时解决它们的局限性。该方法使用预训练BERT类型模型提取紧凑又有用的关键词。这些关键词然后通过与医学知识图中的概念实体连接，得到了医学领域的知识扩充。实验评估表明，MedFusionRank在医疗数据集上的性能明显高于现有方法，并且在不同的评价指标上都取得了良好的结果。MedFusionRank能够快速和高效地检索相关信息，即使来自短或单个查询。

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

paper_url: http://arxiv.org/abs/2310.20499
repo_url: None
paper_authors: Tian Liang, Zhiwei He, Jen-tes Huang, Wenxuan Wang, Wenxiang Jiao, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi, Xing Wang
for: 评估 LLM 基于 AGENT 的智能水平
methods: 使用 Word Guessing Game 评估 LLM 的表达和掩饰能力，并提供了一个多代理框架 SpyGame 评估 LLM 的智能能力和适应能力
results: 经验证明，提议的 DEEP 和 SpyGame 可以有效地评估不同 LLM 的能力，捕捉它们在不同情况下的适应能力和战略思维能力

Abstract
The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and disguising abilities. DEEP requires LLM to describe a word in aggressive and conservative modes. We then introduce SpyGame, an interactive multi-agent framework designed to assess LLMs' intelligence through participation in a competitive language-based board game. Incorporating multi-agent interaction, SpyGame requires the target LLM to possess linguistic skills and strategic thinking, providing a more comprehensive evaluation of LLMs' human-like cognitive abilities and adaptability in complex communication situations. The proposed evaluation framework is very easy to implement. We collected words from multiple sources, domains, and languages and used the proposed evaluation framework to conduct experiments. Extensive experiments demonstrate that the proposed DEEP and SpyGame effectively evaluate the capabilities of various LLMs, capturing their ability to adapt to novel situations and engage in strategic communication.

摘要
自动评估LLM基于代理人智能是智能代理人开发的关键。虽然有很多努力投入到了人类标注评估数据集的开发，如AlpacaEval，但现有技术都是费时费力，缺乏适应性。在这篇文章中，我们提出使用词语猜测游戏来评估LLM的智能表现。给定一个词语，LLM被要求描述该词语并确定其身份（间谍或非间谍）基于其自己和其他玩家的描述。理想情况下，一个高级代理人应该具备精准描述给定词语的能力，同时通过夸大描述和保守描述之间的交互来增强其参与度。为此，我们首先开发了DEEP来评估LLM的表达和假装能力。DEEP需要LLM描述一个词语的夸大和保守两种模式。然后，我们引入了SpyGame，一个交互式多代理人框架，用于评估LLM的智能能力。SpyGame通过在语言基础的游戏中强制LLM具备语言技巧和策略思维，提供了评估LLM的人类智能能力和复杂通信情况下的适应性。我们所提出的评估框架非常容易实现。我们从多个源、领域和语言中收集了词语，并使用我们所提出的评估框架进行实验。广泛的实验表明，我们的DEEP和SpyGame可以有效评估不同LLM的能力，捕捉它们在新情况下的适应和策略性通信。

Multi-User MultiWOZ: Task-Oriented Dialogues among Multiple Users

paper_url: http://arxiv.org/abs/2310.20479
repo_url: None
paper_authors: Yohan Jo, Xinyan Zhao, Arijit Biswas, Nikoletta Basiou, Vincent Auvray, Nikolaos Malandrakis, Angeliki Metallinou, Alexandros Potamianos
for: 这个论文是为了研究多用户对话系统中的协作决策和对话管理。
methods: 这个论文使用了基于MultiWOZ 2.2 dataset的多用户对话，并提出了一个新的任务：多用户上下文抽象 rewrite。这个任务的目的是将多用户对话中的任务相关信息抽取出来，并将其转换成一个简洁的任务oriented查询，以便对话系统进行处理。
results: 论文的实验结果表明，使用预测的 rewrite 可以在多用户对话中提高对话状态跟踪，无需修改现有的对话系统，并且可以在不同领域中进行扩展。

Abstract
While most task-oriented dialogues assume conversations between the agent and one user at a time, dialogue systems are increasingly expected to communicate with multiple users simultaneously who make decisions collaboratively. To facilitate development of such systems, we release the Multi-User MultiWOZ dataset: task-oriented dialogues among two users and one agent. To collect this dataset, each user utterance from MultiWOZ 2.2 was replaced with a small chat between two users that is semantically and pragmatically consistent with the original user utterance, thus resulting in the same dialogue state and system response. These dialogues reflect interesting dynamics of collaborative decision-making in task-oriented scenarios, e.g., social chatter and deliberation. Supported by this data, we propose the novel task of multi-user contextual query rewriting: to rewrite a task-oriented chat between two users as a concise task-oriented query that retains only task-relevant information and that is directly consumable by the dialogue system. We demonstrate that in multi-user dialogues, using predicted rewrites substantially improves dialogue state tracking without modifying existing dialogue systems that are trained for single-user dialogues. Further, this method surpasses training a medium-sized model directly on multi-user dialogues and generalizes to unseen domains.

摘要
“多用户对话系统正在越来越受到关注，但是现有的对话系统通常只能与一个用户进行对话。为了促进多用户对话系统的开发，我们发布了多用户多WOZ数据集：由两个用户和一个代理进行任务对话。我们将每个用户的话语从MultiWOZ 2.2中替换为两个用户之间的小聊天，以保持对话的 semantics和 Pragmatics一致，从而得到了相同的对话状态和系统响应。这些对话反映了在任务对话中的多用户协作决策中的 interessante Dynamics，例如社交谈话和讨论。基于这些数据，我们提出了一个新的任务：多用户上下文映射重写。我们的方法可以将多用户任务对话重新表述为一个简洁的任务对话，保留只有任务相关的信息，并且可以直接被对话系统所接受。我们证明了在多用户对话中使用预测重写可以大幅提高对话状态跟踪，而无需修改现有的对话系统，这些系统通常是单用户对话系统的训练。此外，我们的方法还可以在未看到的领域中进行泛化。”

Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation

paper_url: http://arxiv.org/abs/2310.20470
repo_url: None
paper_authors: A. Seza Doğruöz, Sunayana Sitaram, Zheng-Xin Yong
for: 本研究旨在探讨现有的code-switching数据集（68）中语言对的不具有表现力的原因。
methods: 本研究采用了一种深入的分析方法，检查了数据集的收集和准备阶段（如 транскрип和注释）中的问题。
results: 研究发现，大多数code-switching数据集忽略了其他语言对，并且存在收集和准备阶段的问题，导致数据集的表现不充分。此外，数据选择和筛选阶段的不清晰性也影响了数据集的表现。

Abstract
Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions. However, there is still not much progress in building successful CSW systems, despite the recent advances in Massive Multilingual Language Models (MMLMs). We investigate the reasons behind this setback through a critical study about the existing CSW data sets (68) across language pairs in terms of the collection and preparation (e.g. transcription and annotation) stages. This in-depth analysis reveals that \textbf{a)} most CSW data involves English ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of representativeness in data collection and preparation stages due to ignoring the location based, socio-demographic and register variation in CSW. In addition, lack of clarity on the data selection and filtering stages shadow the representativeness of CSW data sets. We conclude by providing a short check-list to improve the representativeness for forthcoming studies involving CSW data collection and preparation.

摘要
多语种主义在全球范围内广泛存在，但建立成功的码换 switching（CSW）系统还没有做出多大进步，尽管最近的大量多语言模型（MMLM）已经取得了 significiant advances。我们通过一项批判性研究，检查现有的 CSW 数据集（68）中语对的收集和准备阶段（例如，转录和注释）的问题。这项深入分析发现：a) 大多数 CSW 数据集中英语占主导地位，忽略其他语对/对。b) 数据收集和准备阶段存在地域基础、社会民主和注册变化的问题，导致数据集的代表性受到影响。此外，数据选择和筛选阶段的不清晰性使 CSW 数据集的代表性受到遮盲。我们结束时提供了一份简短的检查列表，以改善 forthcoming studies 中 CSW 数据收集和准备阶段的代表性。

Towards a Deep Understanding of Multilingual End-to-End Speech Translation

paper_url: http://arxiv.org/abs/2310.20456
repo_url: None
paper_authors: Haoran Sun, Xiaohu Zhao, Yikun Lei, Shaolin Zhu, Deyi Xiong
for: 这个论文使用 Singular Value Canonical Correlation Analysis (SVCCA) 分析一个多语言端到端语音翻译模型所学习的表示。
methods: 使用 LASER 提取平行文本数据，并使用 SVCCA 分析多语言端到端语音翻译模型中各语言和层次的表示相似性。
results: 发现（I）限制一种语言的训练数据会使语言相似性在多语言端到端语音翻译中失效；（II）在不受限制的训练数据情况下，增强encoder表示和适应音频文本数据可以提高翻译质量，超过双语翻译；（III）多语言端到端语音翻译encoder表示在语言类型预测中表现出色。这些发现可能提供一种更有效的方法，即释放低资源语言的数据限制，并将其与语言相似的高资源语言结合在一起。

Abstract
In this paper, we employ Singular Value Canonical Correlation Analysis (SVCCA) to analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages. SVCCA enables us to estimate representational similarity across languages and layers, enhancing our understanding of the functionality of multilingual speech translation and its potential connection to multilingual neural machine translation. The multilingual speech translation model is trained on the CoVoST 2 dataset in all possible directions, and we utilize LASER to extract parallel bitext data for SVCCA analysis. We derive three major findings from our analysis: (I) Linguistic similarity loses its efficacy in multilingual speech translation when the training data for a specific language is limited. (II) Enhanced encoder representations and well-aligned audio-text data significantly improve translation quality, surpassing the bilingual counterparts when the training data is not compromised. (III) The encoder representations of multilingual speech translation demonstrate superior performance in predicting phonetic features in linguistic typology prediction. With these findings, we propose that releasing the constraint of limited data for low-resource languages and subsequently combining them with linguistically related high-resource languages could offer a more effective approach for multilingual end-to-end speech translation.

摘要
在这篇论文中，我们使用 singular value canonical correlation analysis (SVCCA) 分析一个多语言端到端语音翻译模型所学习的表示。SVCCA 允许我们在不同语言和层次上测量表示之间的相似性，从而更好地理解多语言端到端语音翻译的工作机制和可能与多语言神经机器翻译之间的联系。这个多语言端到端语音翻译模型在CoVoST 2 数据集上进行了所有可能的方向训练，我们使用 LASER 提取并平行翻译数据进行 SVCCA 分析。我们从分析中得出了三个主要发现：（I）在多语言端到端语音翻译中，语言相似性的作用随着语言训练数据的减少而减弱。（II）在训练数据不受限制的情况下，增强的encoder表示和平行音频文本数据可以提高翻译质量，超过双语翻译的表现。（III）多语言端到端语音翻译的encoder表示可以更好地预测语言 typology 中的音频特征。基于这些发现，我们提议在释放限制了低资源语言的训练数据后，将其与语言相关的高资源语言相结合可能会提供更有效的多语言端到端语音翻译方法。

The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models

paper_url: http://arxiv.org/abs/2310.20440
repo_url: None
paper_authors: Jorge Abreu-Vicente, Hannah Sonntag, Thomas Eidens, Thomas Lemberger
For: The paper is written to demonstrate the value of integrating curation into the publishing process, and to provide a large-scale dataset (SourceData-NLP) for training and evaluating models for biomedical entity recognition and context-dependent semantic interpretation.* Methods: The paper uses a combination of natural language processing (NLP) techniques, including named-entity recognition (NER) and named-entity linking (NEL), to annotate biomedical entities in figure legends from molecular and cell biology papers. The authors also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement.* Results: The paper presents the SourceData-NLP dataset, which contains over 620,000 annotated biomedical entities curated from 18,689 figures in 3,223 papers in molecular and cell biology. The authors also assess the performance of two transformer-based models (BioLinkBERT and PubmedBERT) fine-tuned on the SourceData-NLP dataset for NER, and introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement.

Abstract
Introduction: The scientific publishing landscape is expanding rapidly, creating challenges for researchers to stay up-to-date with the evolution of the literature. Natural Language Processing (NLP) has emerged as a potent approach to automating knowledge extraction from this vast amount of publications and preprints. Tasks such as Named-Entity Recognition (NER) and Named-Entity Linking (NEL), in conjunction with context-dependent semantic interpretation, offer promising and complementary approaches to extracting structured information and revealing key concepts. Results: We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement. Conclusions: SourceData-NLP's scale highlights the value of integrating curation into publishing. Models trained with SourceData-NLP will furthermore enable the development of tools able to extract causal hypotheses from the literature and assemble them into knowledge graphs.

摘要
引言：科学出版园地不断扩大，创造了对研究人员快速满足文献演进的挑战。自然语言处理（NLP）已经出现为自动提取文献中知识的有力的方法。任名实体识别（NER）和任名链接（NEL）等任务，与Context-dependent semantic interpretation，提供了批量提取结构化信息和揭示关键概念的有力的方法。结果：我们提供了SourceData-NLP数据集，通过出版过程的常规筹编而生成的。这个数据集的一个独特特点是强调文献中图例中的生物实体注解。我们注解了8类生物实体（小分子、蛋白质、细胞组成部分、细胞系列、细胞类型、组织、生物体和疾病），它们在实验设计中的角色和实验方法的自然类型。SourceData-NLP包含超过620,000个注解的生物实体，从18,689个图例中筹编于3,223篇分子和细胞生物研究。我们证明了SourceData-NLP数据集的有用性，通过评估BioLinkBERT和PubmedBERT两种基于转换器的模型，在SourceData-NLP数据集上进行NER任务的精度训练。我们还介绍了一种新的Context-dependent semantic任务，该任务检测实体是否为控制 intervención的目标或测量的对象。结论：SourceData-NLP的规模强调了integrating curation into publishing的价值。models trained with SourceData-NLP将能够开发出抽象出文献中 causal hypothesis 和组装知识图。

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

paper_url: http://arxiv.org/abs/2310.20410
repo_url: None
paper_authors: Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, Wei Wang
for: 本研究主要针对大型自然语言模型（LLMs）的指令执行能力进行评估，以填补现有benchmark中对superficial response质量的评价不够的研究空白。
methods: 本研究提出了一个多级细化约束 Following Benchmark（FollowBench），包括内容、情境、风格、格式和例子等五种细化约束类型。为了准确测量LLMs的指令执行能力，我们提出了一种多级机制，逐级增加一个约束到初始指令中。
results: 通过测试九个关键性开源和关闭源的LLMs在FollowBench上，我们发现了LLMs在指令执行方面存在一些弱点，并指出了未来研究的可能性。

Abstract
The ability to follow instructions is crucial to Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating superficial response quality, which does not necessarily indicate instruction-following capability. To fill this research gap, in this paper, we propose FollowBench, a Multi-level Fine-grained Constraints Following Benchmark for LLMs. FollowBench comprehensively includes five different types (i.e., Content, Scenario, Style, Format, and Example) of fine-grained constraints. To enable a precise constraint following estimation, we introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each level. To evaluate whether LLMs' outputs have satisfied every individual constraint, we propose to prompt strong LLMs with constraint evolution paths to handle challenging semantic constraints. By evaluating nine closed-source and open-source popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work. The data and code are publicly available at https://github.com/YJiangcm/FollowBench.

摘要
大型语言模型（LLM）在实际应用中需要具备跟从指令的能力。现有的底线测试主要关注表面回应质量，但这并不一定表示模型能够正确跟从指令。为了填补这个研究差距，这篇论文提出了 FollowBench，一个多级精细需求遵循底线测试。FollowBench 包括五种不同类型（即内容、情况、式式、格式和示例）的精细需求。为了精确地评估模型是否能够遵循每个个别需求，我们提出了一个多级机制，逐层添加单一需求到初始指令中。以评估模型的输出是否遵循每个个别需求，我们提出了对强大的 LLM 进行具体的需求演化路径的诱导。通过评估 nine 个商业和开源的广泛使用的 LLM，我们强调了 LLM 对指令跟从的弱点，并指向未来工作的潜在可能性。底线数据和代码可以在 GitHub 上获取：https://github.com/YJiangcm/FollowBench。

AMERICANO: Argument Generation with Discourse-driven Decomposition and Agent Interaction

paper_url: http://arxiv.org/abs/2310.20352
repo_url: None
paper_authors: Zhe Hu, Hou Pong Chan, Yu Yin
for:The paper is written for the task of counterargument generation using a subset of Reddit/CMV dataset.methods:The paper proposes a novel framework called Americano, which uses agent interaction and decomposes the generation process into sequential actions grounded in argumentation theory. The approach includes an argument refinement module that evaluates and refines argument drafts based on feedback received.results:The results show that the proposed method outperforms both end-to-end and chain-of-thought prompting methods and can generate more coherent and persuasive arguments with diverse and rich contents.

Abstract
Argument generation is a challenging task in natural language processing, which requires rigorous reasoning and proper content organization. Inspired by recent chain-of-thought prompting that breaks down a complex task into intermediate steps, we propose Americano, a novel framework with agent interaction for argument generation. Our approach decomposes the generation process into sequential actions grounded on argumentation theory, which first executes actions sequentially to generate argumentative discourse components, and then produces a final argument conditioned on the components. To further mimic the human writing process and improve the left-to-right generation paradigm of current autoregressive language models, we introduce an argument refinement module which automatically evaluates and refines argument drafts based on feedback received. We evaluate our framework on the task of counterargument generation using a subset of Reddit/CMV dataset. The results show that our method outperforms both end-to-end and chain-of-thought prompting methods and can generate more coherent and persuasive arguments with diverse and rich contents.

摘要
Argument generation是自然语言处理中的一项复杂任务，需要严格的逻辑推理和正确的内容组织。受最近的链式思维提示的启发，我们提出了Americano，一种新的框架，它通过代理人之间的互动来实现论点生成。我们的方法将生成过程分解成顺序执行的动作，基于论点理论来生成论点Components，然后生成基于Components的最终论点。为了更好地模拟人类写作过程和提高现有的左至右生成模型，我们引入了一个Argument Refinement Module，它会自动评估和修正argument drafts基于反馈。我们使用Reddit/CMV数据集中的一个子集进行对比，结果表明，我们的方法在论点生成任务上表现更好，可以生成更 coherent和更有感染力的论点，并且可以生成更多的多样化和 ric contents。

Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

paper_url: http://arxiv.org/abs/2310.20347
repo_url: None
paper_authors: Guillermo Alaejos, Adrián Castelló, Pedro Alonso-Jordá, Francisco D. Igual, Héctor Martínez, Enrique S. Quintana-Ortí
for: 提高矩阵乘法的高性能和可维护性，以及提供可tailored的解决方案 для不同的数据类型、处理器架构和矩阵操作形式。
methods: 使用Apache TVM开源框架自动生成了一系列采用 популяр的线性代数库方法（如GotoBLAS2、BLIS和OpenBLAS）的块分形式矩阵乘法算法，并全自动生成了处理器特定微kernels。
results: 相比传统高性能库的手动编码微kernels，TVM生成的块分形式矩阵乘法算法提供了更高的可维护性和PORTABILITY，并在特定矩阵形式下达到了与手动优化库相当的性能水平。

Abstract
We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in order to obtain high-performance blocked formulations of the general matrix multiplication (GEMM). % In addition, we fully automatize the generation process, by also leveraging the Apache TVM framework to derive a complete variety of the processor-specific micro-kernels for GEMM. This is in contrast with the convention in high performance libraries, which hand-encode a single micro-kernel per architecture using Assembly code. % In global, the combination of our TVM-generated blocked algorithms and micro-kernels for GEMM 1)~improves portability, maintainability and, globally, streamlines the software life cycle; 2)~provides high flexibility to easily tailor and optimize the solution to different data types, processor architectures, and matrix operand shapes, yielding performance on a par (or even superior for specific matrix shapes) with that of hand-tuned libraries; and 3)~features a small memory footprint.

摘要
我们探讨使用Apache TVM开源框架自动生成一家列表算法，以模仿流行的线性代数库，如GotoBLAS2、BLIS和OpenBLAS，以实现高性能的块形式化矩阵乘法（GEMM）。此外，我们还完全自动化生成过程，通过使用Apache TVM框架 derivation complete的处理器特定微内核 для GEMM。与传统高性能库不同，我们不手动编码每种架构的微内核使用 Assembly 代码。总的来说，我们的 TVM-生成的块算法和微内核组合可以：1. 提高可移植性、维护性和软件生命周期的 globally 流畅性;2. 提供高灵活性，以便轻松地调整和优化解决方案，以适应不同的数据类型、处理器架构和矩阵操作形状，实现与手动优化库相同或更高的性能;3. 具有小尺寸内存占用。

InstructCoder: Empowering Language Models for Code Editing

paper_url: http://arxiv.org/abs/2310.20329
repo_url: https://github.com/qishenghu/CodeInstruct
paper_authors: Qisheng Hu, Kaixin Li, Xu Zhao, Yuxi Xie, Tiedong Liu, Hui Chen, Qizhe Xie, Junxian He
for: This paper is written to explore the use of large language models (LLMs) for automatic code editing based on user instructions, and to introduce the first dataset (InstructCoder) designed for this purpose.
methods: The paper uses a combination of code editing data sourced from GitHub commits and seed tasks to fine-tune open-source LLMs, and then uses these fine-tuned models to edit code based on users’ instructions.
results: The paper demonstrates that the fine-tuned LLMs can edit code correctly most of the time, exhibiting unprecedented code-editing performance levels, suggesting that proficient instruction-finetuning can lead to significant improvements in code editing abilities.Here’s the same information in Simplified Chinese:
for: 这篇论文是为了探索基于用户指令的自动代码编辑，并 introduce 首个适用于这种目的的数据集（InstructCoder）。
methods: 这篇论文使用 GitHub 提交的代码编辑数据和种子任务来精通开源 LLM，然后使用这些精通模型来基于用户指令编辑代码。
results: 这篇论文表明，精通指令 fine-tuning 可以使得 LLM 在大多数情况下正确地编辑代码，达到了历史最高的代码编辑性能水平，表明了指令 fine-tuning 可以导致代码编辑能力的显著改善。

Abstract
Code editing encompasses a variety of pragmatic tasks that developers deal with daily. Despite its relevance and practical usefulness, automatic code editing remains an underexplored area in the evolution of deep learning models, partly due to data scarcity. In this work, we explore the use of large language models (LLMs) to edit code based on user instructions, covering a broad range of implicit tasks such as comment insertion, code optimization, and code refactoring. To facilitate this, we introduce InstructCoder, the first dataset designed to adapt LLMs for general-purpose code editing, containing highdiversity code-editing tasks. It consists of over 114,000 instruction-input-output triplets and covers multiple distinct code editing scenarios. The dataset is systematically expanded through an iterative process that commences with code editing data sourced from GitHub commits as seed tasks. Seed and generated tasks are used subsequently to prompt ChatGPT for more task data. Our experiments demonstrate that open-source LLMs fine-tuned on InstructCoder can edit code correctly based on users' instructions most of the time, exhibiting unprecedented code-editing performance levels. Such results suggest that proficient instruction-finetuning can lead to significant amelioration in code editing abilities. The dataset and the source code are available at https://github.com/qishenghu/CodeInstruct.

摘要
��нова�� editing 包括多种实用任务，开发者日常需要进行。尽管它的重要性和实用性很高，但自动代码编辑仍然是深度学习模型的发展中未得到充分关注的领域，一些原因是数据的缺乏。在这种情况下，我们探索了使用大型自然语言模型（LLM）来编辑代码，根据用户的指令进行编辑，包括评注插入、代码优化和代码重构等多种隐式任务。为了实现这一点，我们提出了 InstructCoder，首个适用于通用代码编辑的大型语言模型数据集。它包含了超过114,000个指令输入输出 triplets，覆盖多个不同的代码编辑场景。我们通过多次迭代过程，从 GitHub 提交中获取代码编辑数据，并使用这些种子任务和生成的任务来训练 ChatGPT。我们的实验结果表明，基于 InstructCoder 进行 fine-tuning 的开源 LLM 可以根据用户的指令进行正确的代码编辑，达到了历史上最高的代码编辑性能水平。这些结果表明，有效的指令 fine-tuning 可以导致代码编辑能力的显著提高。数据集和源代码可以在 https://github.com/qishenghu/CodeInstruct 上下载。

ChiSCor: A Corpus of Freely Told Fantasy Stories by Dutch Children for Computational Linguistics and Cognitive Science

paper_url: http://arxiv.org/abs/2310.20328
repo_url: None
paper_authors: Bram M. A. van Dijk, Max J. van Duijn, Suzan Verberne, Marco R. Spruit
for: 这个论文是用来研究儿童如何表达角色视角的，以及语言和认知的发展。
methods: 这个论文使用了自由叙述的方法，收集了442名荷兰儿童 aged 4-12年的619则童话。
results: 这个研究发现，儿童的故事语言复杂性随着年龄的变化而变化不大，这表明了儿童在语言表达方面的能力具有一定的稳定性。此外，这个论文还发现了Zipf的法律在自由语言中的存在，这反映了儿童的语言场景的社会性。最后，这个研究还表明，尽管这个论文的数据量较小，但它仍然可以培养有用的语言模型，用于分析儿童语言使用的情况。

Abstract
In this resource paper we release ChiSCor, a new corpus containing 619 fantasy stories, told freely by 442 Dutch children aged 4-12. ChiSCor was compiled for studying how children render character perspectives, and unravelling language and cognition in development, with computational tools. Unlike existing resources, ChiSCor's stories were produced in natural contexts, in line with recent calls for more ecologically valid datasets. ChiSCor hosts text, audio, and annotations for character complexity and linguistic complexity. Additional metadata (e.g. education of caregivers) is available for one third of the Dutch children. ChiSCor also includes a small set of 62 English stories. This paper details how ChiSCor was compiled and shows its potential for future work with three brief case studies: i) we show that the syntactic complexity of stories is strikingly stable across children's ages; ii) we extend work on Zipfian distributions in free speech and show that ChiSCor obeys Zipf's law closely, reflecting its social context; iii) we show that even though ChiSCor is relatively small, the corpus is rich enough to train informative lemma vectors that allow us to analyse children's language use. We end with a reflection on the value of narrative datasets in computational linguistics.

摘要
在这份资源文章中，我们发布了一个新的词库，即ChiSCor，其包含了442名荷兰儿童 aged 4-12所自由地告诉的619则幻想故事。ChiSCor是为了研究儿童如何表达人物视角，并探索语言和认知的发展，使用计算工具而编制的。与现有资源不同的是，ChiSCor的故事在自然的 Context中被生成，与最近的呼吁更加生动的数据集相符。ChiSCor包含文本、音频和注释，以及人物复杂性和语言复杂性的标注。此外，有一 third的荷兰儿童的照顾者的教育信息也可以获得。ChiSCor还包括62则英文故事。本文介绍了如何编制ChiSCor，并通过三个简要的案例研究表明其潜在的价值：一、儿童的故事 sintactic complexity 随着年龄的变化而变化不大；二、ChiSCor遵循Zipf的法则 closely，彰显其社会上的背景；三、尽管ChiSCor规模较小，但它仍然具有训练有用的lemma vector，以便分析儿童的语言使用。我们结束于计算语言学中的幻想数据集的价值。

Erato: Automatizing Poetry Evaluation

paper_url: http://arxiv.org/abs/2310.20326
repo_url: https://github.com/manexagirrezabal/erato
paper_authors: Manex Agirrezabal, Hugo Gonçalo Oliveira, Aitor Ormazabal
for: 这篇论文是为了自动评估诗歌而设计的框架。
methods: 该框架使用了多种特征，我们提供了一个简要的概述，并讨论了Erato的可扩展性。
results: 通过使用Erato，我们比较了人工创作的诗歌和自动生成的诗歌，并显示了它的效果性。

Abstract
We present Erato, a framework designed to facilitate the automated evaluation of poetry, including that generated by poetry generation systems. Our framework employs a diverse set of features, and we offer a brief overview of Erato's capabilities and its potential for expansion. Using Erato, we compare and contrast human-authored poetry with automatically-generated poetry, demonstrating its effectiveness in identifying key differences. Our implementation code and software are freely available under the GNU GPLv3 license.

摘要
我们介绍了一个名为“艺术”的框架，用于自动评估诗歌，包括由诗歌生成系统生成的诗歌。我们的框架使用了多种特征，我们准备提供一个简要的概述，以及它的扩展可能性。使用艺术，我们对人类创作的诗歌和自动生成的诗歌进行比较和对比，并证明它的效果性。我们的实现代码和软件都是根据GNU GPLv3许可证释出的。

FA Team at the NTCIR-17 UFO Task

paper_url: http://arxiv.org/abs/2310.20322
repo_url: None
paper_authors: Yuki Okumura, Masato Fujitake
for: 本研究参与了NTCIR-17中的表格数据EXTRACTION（TDE）和文本到表格关系EXTRACTION（TTRE）任务，报告我们的解决方案和评估结果。
methods: 我们采用了基于ELECTRA语言模型的各种改进技术来提高表格数据EXTRACTION的精度。
results: 我们的努力取得了93.43%的TDE准确率，在排名榜上名列第二，这是我们提出的方法的有效性的证明。

Abstract
The FA team participated in the Table Data Extraction (TDE) and Text-to-Table Relationship Extraction (TTRE) tasks of the NTCIR-17 Understanding of Non-Financial Objects in Financial Reports (UFO). This paper reports our approach to solving the problems and discusses the official results. We successfully utilized various enhancement techniques based on the ELECTRA language model to extract valuable data from tables. Our efforts resulted in an impressive TDE accuracy rate of 93.43 %, positioning us in second place on the Leaderboard rankings. This outstanding achievement is a testament to our proposed approach's effectiveness. In the TTRE task, we proposed the rule-based method to extract meaningful relationships between the text and tables task and confirmed the performance.

摘要
FA团队参加了NTCIR-17年度理解非财务对象在财务报表（UFO）中的表格数据提取（TDE）和文本到表格关系提取（TTRE）任务。本文介绍我们对这些问题的解决方案以及官方结果。我们成功地运用了基于ELECTRA语言模型的多种优化技术，从表格中提取有价值数据。我们的努力得分为93.43%，在排名榜上名列第二。这一优异成绩证明了我们提议的方法的有效性。在TTRE任务中，我们提出了规则基本方法来提取文本和表格之间的有意义关系，并证明了其性能。

Extracting Entities of Interest from Comparative Product Reviews

paper_url: http://arxiv.org/abs/2310.20274
repo_url: https://github.com/jatinarora2702/Review-Information-Extraction
paper_authors: Jatin Arora, Sumit Agrawal, Pawan Goyal, Sayan Pathak
for: 本文提出了一种基于深度学习的方法，用于从多个电商网站上的用户评论中提取产品比较信息。
methods: 本文使用LSTM网络来捕捉用户评论中产品名称、用户意见（ predicate）以及被比较的特性或属性之间的依赖关系。
results: 对于现有的手动标注数据集，本文的方法表现出了与现有的Semantic Role Labeling（SRL）框架相比的出色表现。

Abstract
This paper presents a deep learning based approach to extract product comparison information out of user reviews on various e-commerce websites. Any comparative product review has three major entities of information: the names of the products being compared, the user opinion (predicate) and the feature or aspect under comparison. All these informing entities are dependent on each other and bound by the rules of the language, in the review. We observe that their inter-dependencies can be captured well using LSTMs. We evaluate our system on existing manually labeled datasets and observe out-performance over the existing Semantic Role Labeling (SRL) framework popular for this task.

摘要
这篇论文提出了基于深度学习的方法，用于从多个电商网站上的用户评论中提取产品比较信息。每个比较性评论都包含三个主要的信息实体：被比较的产品名称、用户看法（ predicate）和被比较的特性或方面。这些信息实体之间存在互相依赖关系，受语言规则约束。我们发现，使用LSTM可以很好地捕捉这些互相关系。我们对现有的手动标注数据进行评估，并观察到我们的系统在现有的Semantic Role Labeling（SRL）框架上表现出色。

Learning to Play Chess from Textbooks (LEAP): a Corpus for Evaluating Chess Moves based on Sentiment Analysis

paper_url: http://arxiv.org/abs/2310.20260
repo_url: https://github.com/resrepos/leap
paper_authors: Haifa Alrdahi, Riza Batista-Navarro
for: This paper is written to explore the use of chess textbooks as a new knowledge source for enabling machines to learn how to play chess.
methods: The paper uses a dataset called LEAP, which is a heterogeneous dataset containing structured and unstructured data collected from a chess textbook. The authors labelled the sentences in the dataset based on their relevance and sentiment towards the described moves. They also employed transformer-based sentiment analysis models to evaluate the moves.
results: The best performing model obtained a weighted micro F_1 score of 68% in evaluating chess moves. The authors also synthesised the LEAP corpus to create a larger dataset that can be used to address the limited textual resource in the chess domain.

Abstract
Learning chess strategies has been investigated widely, with most studies focussing on learning from previous games using search algorithms. Chess textbooks encapsulate grandmaster knowledge, explain playing strategies and require a smaller search space compared to traditional chess agents. This paper examines chess textbooks as a new knowledge source for enabling machines to learn how to play chess -- a resource that has not been explored previously. We developed the LEAP corpus, a first and new heterogeneous dataset with structured (chess move notations and board states) and unstructured data (textual descriptions) collected from a chess textbook containing 1164 sentences discussing strategic moves from 91 games. We firstly labelled the sentences based on their relevance, i.e., whether they are discussing a move. Each relevant sentence was then labelled according to its sentiment towards the described move. We performed empirical experiments that assess the performance of various transformer-based baseline models for sentiment analysis. Our results demonstrate the feasibility of employing transformer-based sentiment analysis models for evaluating chess moves, with the best performing model obtaining a weighted micro F_1 score of 68%. Finally, we synthesised the LEAP corpus to create a larger dataset, which can be used as a solution to the limited textual resource in the chess domain.

摘要
学习棋盘战略已经广泛研究，大多数研究都是通过搜索算法学习过去的棋盘游戏。棋盘游戏教科书汇集了大师的知识，解释棋盘策略和游戏策略，相比传统棋盘机器人需要较小的搜索空间。本文将棋盘教科书作为新的知识来源，帮助机器人学习棋盘游戏。我们建立了LEAP集合，这是一个新的、多元的数据集，包括结构化数据（棋盘移动notation和棋盘状态）和无结构数据（文本描述），从一本包含1164句话的棋盘游戏教科书中收集到。我们首先对句子进行了相关性标注，即是否讲述了一个移动。每个相关句子 THEN 被标注为对描述的移动的情感。我们对多种基于transformer的基线模型进行了实验，以评估这些模型对棋盘移动的情感分析能力。我们的结果表明，使用基于transformer的情感分析模型可以评估棋盘移动的可能性，最好的模型在Weighted Micro F_1 score中得分68%。最后，我们将LEAP集合synthesized into a larger dataset，这可以用于解决棋盘领域的文本资源匮乏问题。

PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection

paper_url: http://arxiv.org/abs/2310.20256
repo_url: https://github.com/taoyang225/psycot
paper_authors: Tao Yang, Tianyuan Shi, Fanqi Wan, Xiaojun Quan, Qifan Wang, Bingzhe Wu, Jiaxiang Wu
for: 本研究旨在探索大语言模型（LLMs）在人性探测方面的潜在能力。
methods: 我们提出了一种新的人性探测方法，即 PsyCoT，它通过在多turn对话中让 AI 助手（基于文本分析）评分精心制定的心理问卷项来提高 GPT-3.5 在人性探测方面的表现和可靠性。
results: 我们的实验表明，PsyCoT 可以significantly improve GPT-3.5 在两个标准数据集上的平均 F1 分数，相比标准提示方法，提高了4.23/10.63点。

Abstract
Recent advances in large language models (LLMs), such as ChatGPT, have showcased remarkable zero-shot performance across various NLP tasks. However, the potential of LLMs in personality detection, which involves identifying an individual's personality from their written texts, remains largely unexplored. Drawing inspiration from Psychological Questionnaires, which are carefully designed by psychologists to evaluate individual personality traits through a series of targeted items, we argue that these items can be regarded as a collection of well-structured chain-of-thought (CoT) processes. By incorporating these processes, LLMs can enhance their capabilities to make more reasonable inferences on personality from textual input. In light of this, we propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner. In particular, we employ a LLM as an AI assistant with a specialization in text analysis. We prompt the assistant to rate individual items at each turn and leverage the historical rating results to derive a conclusive personality preference. Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection, achieving an average F1 score improvement of 4.23/10.63 points on two benchmark datasets compared to the standard prompting method. Our code is available at https://github.com/TaoYang225/PsyCoT.

摘要
Translated into Simplified Chinese:最近的大型语言模型（LLMs），如ChatGPT，已经展示了Zero-shot性能的很好表现在多种自然语言处理（NLP）任务上。然而，LLMs在人格检测中的潜力，即通过文本来确定个人的人格特质，仍然得不到充分利用。我们从心理问卷中灵感，心理问卷是由心理学家仔细设计的，用于评估个人的人格特质通过一系列目标性的项目。我们认为这些项目可以被视为一个Well-structured chain-of-thought（CoT）过程。通过包含这些过程，LLMs可以提高对文本输入的推理能力，从而更好地评估个人的人格特质。基于这个想法，我们提出了一种新的人格检测方法，称为PsyCoT，它模仿了个人完成心理问卷的多turn对话方式。我们使用了一个LLM作为文本分析的AI助手，并在每个转折中询问助手对每个项目的评分。我们利用历史评分结果来 derivate一个综合的人格偏好。我们的实验表明，PsyCoT可以提高GPT-3.5在人格检测中的性能和稳定性，在两个标准数据集上 average F1 分数提高4.23/10.63点。我们的代码可以在https://github.com/TaoYang225/PsyCoT上获取。

Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning

paper_url: http://arxiv.org/abs/2310.20236
repo_url: None
paper_authors: Fei Cheng, Masayuki Asahara, Ichiro Kobayashi, Sadao Kurohashi
for: 本文主要用于掌握多个 temporal link (TLINK) 之间的事件表示方式，以提高 temporal relation classification 的性能。
methods: 本文提出了一种事件中心的模型，可以在多个 TLINK 之间共享信息，并通过多任务学习来利用全部数据。
results: 实验结果显示，本文的提议在英文和日文数据上都超越了现有模型和两个基eline模型。

Abstract
Temporal relation classification is a pair-wise task for identifying the relation of a temporal link (TLINK) between two mentions, i.e. event, time, and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T, and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two transfer learning baselines on both the English and Japanese data.

摘要
temporal 关系分类是一个对两个提及（TLINK）之间的关系进行标注的对应任务，即事件、时间和文档创建时间（DCT）的关系。这两个限制：1）两个TLINK含有共同提及不能共享信息。2）现有的模型具有独立的分类器每个TLINK类型（E2E、E2T和E2D），使得无法利用整个数据集。本文介绍了一种事件中心模型，可以在多个TLINK之间管理动态事件表示。我们的模型处理三个TLINK类型，使用多任务学习来利用整个数据集。实验结果表明，我们的提议在英语和日语数据上都高于当前状态的模型和两个传输学习基准。

General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History

paper_url: http://arxiv.org/abs/2310.20204
repo_url: https://github.com/starmpcc/remed
paper_authors: Junu Kim, Chaeeun Shim, Bosco Seong Kyu Yang, Chami Im, Sung Yoon Lim, Han-Gil Jeong, Edward Choi
for: 用于开发基于电子医疗记录（EHR）的临床预测模型（如mortality prediction），以避免专家意见的干预和观察窗口大小的调整。
methods: Retrieval-Enhanced Medical prediction model (REMed) 可以自动评估无数量的临床事件，选择相关的事件，并进行预测。这种方法可以消除专家 manual feature selection 和观察窗口大小的限制，大大提高了开发速度。
results: 通过对27个临床任务和两个独立的EHR数据集进行实验，发现REMed 可以与其他同类架构相比，在处理数量最多的事件方面表现出色，并且与医疗专家的偏好相吻合。

Abstract
Developing clinical prediction models (e.g., mortality prediction) based on electronic health records (EHRs) typically relies on expert opinion for feature selection and adjusting observation window size. This burdens experts and creates a bottleneck in the development process. We propose Retrieval-Enhanced Medical prediction model (REMed) to address such challenges. REMed can essentially evaluate an unlimited number of clinical events, select the relevant ones, and make predictions. This approach effectively eliminates the need for manual feature selection and enables an unrestricted observation window. We verified these properties through experiments on 27 clinical tasks and two independent cohorts from publicly available EHR datasets, where REMed outperformed other contemporary architectures that aim to handle as many events as possible. Notably, we found that the preferences of REMed align closely with those of medical experts. We expect our approach to significantly expedite the development of EHR prediction models by minimizing clinicians' need for manual involvement.

摘要
通常来说，基于电子医疗记录（EHR）的临床预测模型（例如，死亡预测）的开发都会依赖于专家意见来选择特征和调整观察窗口大小。这会让专家受重weigth和创造开发过程中的瓶颈。我们提议一种叫做Retrieval-Enhanced Medical prediction model（REMed）来解决这些挑战。REMed可以评估无数量的临床事件，选择相关的事件，并进行预测。这种方法可以减少专家的手动干预，并允许无限制的观察窗口。我们通过对27个临床任务和两个独立的医疗数据集进行实验，发现REMed在与其他同时处理多个事件的当今建筑物之间显著超越。另外，我们发现REMed的偏好与医疗专家的偏好相吻合。我们期望我们的方法能够快速减少临床专家的手动参与度，以便更快速地开发EHR预测模型。

Video-Helpful Multimodal Machine Translation

paper_url: http://arxiv.org/abs/2310.20201
repo_url: https://github.com/ku-nlp/video-helpful-mmt
paper_authors: Yihang Li, Shuichiro Shimizu, Chenhui Chu, Sadao Kurohashi, Wei Li
for: 这个研究的目的是提高多模态翻译（MMT）的表现，并使用视频来解决语言ambiguity问题。
methods: 这个研究使用了一个新的 dataset，叫做 EVA，其包含了852万个日语英文（Ja-En）平行字幕对，520万个中文英文（Zh-En）平行字幕对，以及对应的视频clip。此外，研究还提出了一种新的模型，叫做 SAFA，它使用了选择性注意力模型和两种新的方法：帧注意力损失和抽象增强。
results: 实验表明，使用视频信息和提出的方法可以提高翻译性能，并且我们的模型在现有的 MMT 模型中表现出色。

Abstract
Existing multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles, which rarely contain linguistic ambiguity, making visual information ineffective in generating appropriate translations. Recent work has constructed an ambiguous subtitles dataset to alleviate this problem but is still limited to the problem that videos do not necessarily contribute to disambiguation. We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En) parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs, and corresponding video clips collected from movies and TV episodes. In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation. Furthermore, we propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation, aiming to use videos in EVA for disambiguation fully. Experiments on EVA show that visual information and the proposed methods can boost translation performance, and our model performs significantly better than existing MMT models. The EVA dataset and the SAFA model are available at: https://github.com/ku-nlp/video-helpful-MMT.git.

摘要
现有的多Modal机器翻译（MMT）数据集包括图像和视频标题或教程视频字幕，它们很少含语言 ambiguity，使得视觉信息成为不准确的翻译。 recent work constructed an ambiguous subtitles dataset to solve this problem, but it is still limited to the problem that videos do not necessarily contribute to disambiguation. We introduce EVA（Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation），一个MMT数据集，包含852k日语英语（Ja-En）平行字幕对，520k中文英语（Zh-En）平行字幕对，以及相应的视频片断从电影和电视剧中收集。此外，EVA还包含一个视频有用的评估集，在其中字幕是ambiguous，而视频 garantía helpful for disambiguation。此外，我们提出了SAFA（Selective Attention模型），一种基于Selective Attention模型的两种新方法：Frame attention loss和Ambiguity augmentation，用于在EVA中使用视频进行完全的解决ambiguity。实验表明，视觉信息和我们提出的方法可以提高翻译性能，我们的模型在EVA上表现出色，与现有的MMT模型相比。EVA数据集和SAFA模型可以在：https://github.com/ku-nlp/video-helpful-MMT.git中下载。

DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text

paper_url: http://arxiv.org/abs/2310.20170
repo_url: None
paper_authors: Wenting Zhao, Ye Liu, Tong Niu, Yao Wan, Philip S. Yu, Shafiq Joty, Yingbo Zhou, Semih Yavuz
for: 这个论文主要是针对大语言模型（LLM）的扩展和改进，以帮助它们更好地处理需要更多知识的问题。methods: 该论文提出了一种新的方法，即基于多种 Retrieval 工具，包括文本段 retrieval 和符号语言帮助 retrieval，以提高 LLM 的知识背景搜索能力。results: 该论文的模型在两种不同的挑战 зада上表现出色，即 two-hop 多源问题和符号语言生成问题，并在这些任务上超越了之前的方法。

Abstract
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when solely relying on their internal knowledge, especially when answering questions that require less commonly known information. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge. Nonetheless, recent approaches have primarily emphasized retrieval from unstructured text corpora, owing to its seamless integration into prompts. When using structured data such as knowledge graphs, most methods simplify it into natural text, neglecting the underlying structures. Moreover, a significant gap in the current landscape is the absence of a realistic benchmark for evaluating the effectiveness of grounding LLMs on heterogeneous knowledge sources (e.g., knowledge base and text). To fill this gap, we have curated a comprehensive dataset that poses two unique challenges: (1) Two-hop multi-source questions that require retrieving information from both open-domain structured and unstructured knowledge sources; retrieving information from structured knowledge sources is a critical component in correctly answering the questions. (2) The generation of symbolic queries (e.g., SPARQL for Wikidata) is a key requirement, which adds another layer of challenge. Our dataset is created using a combination of automatic generation through predefined reasoning chains and human annotation. We also introduce a novel approach that leverages multiple retrieval tools, including text passage retrieval and symbolic language-assisted retrieval. Our model outperforms previous approaches by a significant margin, demonstrating its effectiveness in addressing the above-mentioned reasoning challenges.

摘要
大型语言模型（LLM）在生成能力方面表现出色，但它们在仅仅基于自己的内部知识时会出现幻觉，尤其是当回答需要较少知道的信息时。 Retrieval-augmented LLMs 作为一种可能的解决方案，可以让 LLMs 与外部知识相互协同。然而，目前的方法主要集中在文本 corpus 上进行检索，这是因为它们可以轻松地 integrate 到提示中。当使用结构化数据，如知识图，大多数方法都将其简化为自然文本，忽略其下面结构。此外，当前领域中存在一个重要的空白是评估基于不同知识源（如知识库和文本）的 LLM 的效果缺乏一个真实的标准评价指标。为了填补这个空白，我们制作了一个完整的数据集，其中包含两个独特的挑战：1. 两步多源问题，需要从开放领域结构化知识和文本知识ources中检索信息，检索结构化知识源是回答问题的关键组成部分。2. 生成符号 queries（例如 SPARQL для Wikidata）是一个关键要求，这添加了另一层挑战。我们的数据集使用自动生成和人工标注的组合方式制作。我们还介绍了一种新的方法，利用多种检索工具，包括文本段检索和符号语言协助检索。我们的模型在以上两个理解挑战方面表现出色，至少比前一代方法有显著的进步。

GAR-meets-RAG Paradigm for Zero-Shot Information Retrieval

paper_url: http://arxiv.org/abs/2310.20158
repo_url: None
paper_authors: Daman Arora, Anush Kini, Sayak Ray Chowdhury, Nagarajan Natarajan, Gaurav Sinha, Amit Sharma
for: 这种研究的目的是提高零shot情况下的信息检索效果，即没有访问目标领域的标注数据。
methods: 这种方法 combinates large language models (LLMs) 和嵌入基于的检索模型，使用生成增强 retrieve 和 retrieve 增强生成两种 популяр的 paradigm。
results: 这种方法在零shot情况下对 BEIR 和 TREC-DL 两个 benchmark 进行了广泛的实验，并在 6 个数据集中实现了新的状态之冠，与之前最佳结果相比，具有最高的 Recall@100 和 nDCG@10 指标。

Abstract
Given a query and a document corpus, the information retrieval (IR) task is to output a ranked list of relevant documents. Combining large language models (LLMs) with embedding-based retrieval models, recent work shows promising results on the zero-shot retrieval problem, i.e., no access to labeled data from the target domain. Two such popular paradigms are generation-augmented retrieval or GAR (generate additional context for the query and then retrieve), and retrieval-augmented generation or RAG (retrieve relevant documents as context and then generate answers). The success of these paradigms hinges on (i) high-recall retrieval models, which are difficult to obtain in the zero-shot setting, and (ii) high-precision (re-)ranking models which typically need a good initialization. In this work, we propose a novel GAR-meets-RAG recurrence formulation that overcomes the challenges of existing paradigms. Our method iteratively improves retrieval (via GAR) and rewrite (via RAG) stages in the zero-shot setting. A key design principle is that the rewrite-retrieval stages improve the recall of the system and a final re-ranking stage improves the precision. We conduct extensive experiments on zero-shot passage retrieval benchmarks, BEIR and TREC-DL. Our method establishes a new state-of-the-art in the BEIR benchmark, outperforming previous best results in Recall@100 and nDCG@10 metrics on 6 out of 8 datasets, with up to 17% relative gains over the previous best.

摘要
In this work, we propose a novel GAR-meets-RAG recurrence formulation that addresses the challenges of existing paradigms. Our method iteratively improves the retrieval and rewrite stages in the zero-shot setting. A key design principle is that the rewrite-retrieval stages improve the recall of the system, while a final re-ranking stage improves the precision.We conduct extensive experiments on zero-shot passage retrieval benchmarks, BEIR and TREC-DL. Our method establishes a new state-of-the-art in the BEIR benchmark, outperforming previous best results in Recall@100 and nDCG@10 metrics on 6 out of 8 datasets, with up to 17% relative gains over the previous best.

Multi-Agent Consensus Seeking via Large Language Models

paper_url: http://arxiv.org/abs/2310.20151
repo_url: None
paper_authors: Huaben Chen, Wenkang Ji, Lufeng Xu, Shiyu Zhao
for: This paper studies consensus-seeking in multi-agent systems driven by large language models (LLMs), with a focus on understanding the negotiation process and the impact of various factors on the outcome.methods: The paper uses LLMs to drive the agents in the system, and analyzes the strategies they use for consensus seeking, including the average strategy and other strategies. The paper also examines the impact of the agent number, agent personality, and network topology on the negotiation process.results: The paper finds that the LLM-driven agents primarily use the average strategy for consensus seeking, and that the negotiation process is affected by the agent number, agent personality, and network topology. Additionally, the paper demonstrates the potential of LLM-driven consensus seeking for achieving zero-shot autonomous planning in multi-robot collaboration tasks.Here is the same information in Simplified Chinese:for: 这篇论文研究了基于大语言模型（LLM）的多代理系统中的协同决策问题，尤其是研究代理系统之间的谈判过程以及不同因素对结果的影响。methods: 这篇论文使用LLM驱动代理系统，并分析代理系统使用的策略，包括平均策略和其他策略。论文还研究代理数量、代理人性和网络拓扑对谈判过程的影响。results: 论文发现LLM驱动代理系统主要使用平均策略进行协同决策，并且发现代理数量、代理人性和网络拓扑对谈判过程产生了影响。此外，论文还示出了基于LLM的协同决策可以实现零拟合自动化规划在多机器人合作任务中。项目官方网站：westlakeintelligentrobotics.github.io/ConsensusLLM/.

Abstract
Multi-agent systems driven by large language models (LLMs) have shown promising abilities for solving complex tasks in a collaborative manner. This work considers a fundamental problem in multi-agent collaboration: consensus seeking. When multiple agents work together, we are interested in how they can reach a consensus through inter-agent negotiation. To that end, this work studies a consensus-seeking task where the state of each agent is a numerical value and they negotiate with each other to reach a consensus value. It is revealed that when not explicitly directed on which strategy should be adopted, the LLM-driven agents primarily use the average strategy for consensus seeking although they may occasionally use some other strategies. Moreover, this work analyzes the impact of the agent number, agent personality, and network topology on the negotiation process. The findings reported in this work can potentially lay the foundations for understanding the behaviors of LLM-driven multi-agent systems for solving more complex tasks. Furthermore, LLM-driven consensus seeking is applied to a multi-robot aggregation task. This application demonstrates the potential of LLM-driven agents to achieve zero-shot autonomous planning for multi-robot collaboration tasks. Project website: westlakeintelligentrobotics.github.io/ConsensusLLM/.

摘要
多智能体系驱动 by 大语言模型（LLM）已经展示了解决复杂任务的潜力。这项工作考虑到多智能体系协同作业中的一个基本问题：协调。当多个机器人合作时，我们关心他们如何达成一致。为此，本工作研究了多智能体系在协调过程中的一致寻求任务，其中每个机器人的状态都是一个数字值，他们之间进行交流以达成一致值。研究发现，当不 direkt指导哪种策略应该采用时，LLM驱动的机器人主要采用平均策略进行协调，偶尔可能采用其他策略。此外，本工作分析了参与者数量、机器人个性和网络拓扑对协调过程的影响。报告的发现可能为LLM驱动多智能体系解决更复杂任务提供基础。此外，LLM驱动的一致寻求还应用于多机器人聚集任务，这种应用示出了LLM驱动机器人可以实现零批量自主规划的多机器人合作任务。项目网站：westlakeintelligentrobotics.github.io/ConsensusLLM/.

DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models

paper_url: http://arxiv.org/abs/2310.20138
repo_url: None
paper_authors: Xinwei Wu, Junzhuo Li, Minghui Xu, Weilong Dong, Shuangzhi Wu, Chao Bian, Deyi Xiong
for: 降低大语言模型中数据泄露的风险
methods: 提出了一个名为DEPN的框架，用于检测并编辑隐私神经元，以减少数据泄露风险
results: 实验结果表明，我们的方法可以有效地减少数据泄露的暴露，而无需影响模型的性能

Abstract
Large language models pretrained on a huge amount of data capture rich knowledge and information in the training data. The ability of data memorization and regurgitation in pretrained language models, revealed in previous studies, brings the risk of data leakage. In order to effectively reduce these risks, we propose a framework DEPN to Detect and Edit Privacy Neurons in pretrained language models, partially inspired by knowledge neurons and model editing. In DEPN, we introduce a novel method, termed as privacy neuron detector, to locate neurons associated with private information, and then edit these detected privacy neurons by setting their activations to zero. Furthermore, we propose a privacy neuron aggregator dememorize private information in a batch processing manner. Experimental results show that our method can significantly and efficiently reduce the exposure of private data leakage without deteriorating the performance of the model. Additionally, we empirically demonstrate the relationship between model memorization and privacy neurons, from multiple perspectives, including model size, training time, prompts, privacy neuron distribution, illustrating the robustness of our approach.

摘要
大型语言模型在庞大数据量的训练数据中捕捉了丰富的知识和信息。在先前的研究中，我们发现了预训练语言模型中的数据卷积和复制能力，这些能力可能会导致数据泄露风险。为了有效减少这些风险，我们提出了DEPN框架，用于检测和修改隐私神经元在预训练语言模型中。在DEPN中，我们提出了一种新的私钥神经元检测方法，可以在预训练语言模型中找到与隐私信息相关的神经元，然后将这些检测到的私钥神经元的活动设置为零。此外，我们提出了一种私钥神经元聚合器，可以在批处理方式下消除私钥信息。实验结果表明，我们的方法可以有效和高效地减少隐私数据泄露的曝露，而不会损害模型的性能。此外，我们也进行了多种角度的实验，包括模型大小、训练时间、提示、私钥神经元分布等，以 Illustrate我们的方法的稳定性。

Improving Prompt Tuning with Learned Prompting Layers

paper_url: http://arxiv.org/abs/2310.20127
repo_url: None
paper_authors: Wei Zhu, Ming Tan
For: 这 paper 是为了提高预训练模型（PTM）的适应性和性能，并且可以在少量数据情况下进行优化。* Methods: 这 paper 使用了一种新的框架，即选择性Prompt Tuning（SPT），可以学习选择适当的提示层。它还提出了一种新的双级优化框架，即 SPT-DARTS，可以更好地优化学习的提示层设置。* Results: 实验表明，我们的 SPT 框架可以在十个 benchmark 数据集下进行全数据和少数据情况下，与之前的状态态-of-the-art PETuning 基elines 相比，具有更好的性能，并且需要更少的可调参数。

Abstract
Prompt tuning prepends a soft prompt to the input embeddings or hidden states and only optimizes the prompt to adapt pretrained models (PTMs) to downstream tasks. The previous work manually selects prompt layers which are far from optimal and failed to exploit the potential of prompt tuning. In this work, we propose a novel framework, \underline{S}elective \underline{P}rompt \underline{T}uning (SPT), that learns to select the proper prompt layers by inserting a prompt controlled by a learnable probabilistic gate at each intermediate layer. We further propose a novel bi-level optimization framework, SPT-DARTS, that can better optimize the learnable gates and improve the final prompt tuning performances of the learned prompt layer settings. We conduct extensive experiments with ten benchmark datasets under the full-data and few-shot scenarios. The results demonstrate that our SPT framework can perform better than the previous state-of-the-art PETuning baselines with comparable or fewer tunable parameters.

摘要
Prompt tuning 附加软提示到输入嵌入或隐藏状态中，并仅仅优化提示来适应预训练模型（PTM）到下游任务。之前的工作手动选择提示层，这些层远离优化的和失去了提示调整的潜力。在这项工作中，我们提出了一个新的框架，选择性提示调整（SPT），它学习选择合适的提示层。我们还提出了一个新的双级优化框架，SPT-DARTS，可以更好地优化可学习的门控和提示层的最终调整性。我们进行了十个标准 benchmark 数据集的广泛实验，包括全数据和少量数据场景。结果表明，我们的 SPT 框架可以在与前一个状态的基eline PETuning 基eline相比，表现更好，并且参数更少。

Ling-CL: Understanding NLP Models through Linguistic Curricula

paper_url: http://arxiv.org/abs/2310.20121
repo_url: https://github.com/clu-uml/ling-cl
paper_authors: Mohamed Elgaar, Hadi Amiri
for: 本研究旨在开发数据驱动的课程，以便理解模型学习NLP任务时所需的语言知识。
methods: 我们使用心理语言学和语言学习研究中的语言复杂性 caracterization，从数据中提取 linguistic curricula，并与模型的训练行为相结合。
results: 我们分析了多个Benchmark NLP dataset，并发现了一些语言纪录指标（指标），这些指标可以描述每个任务所需的挑战和逻辑。

Abstract
We employ a characterization of linguistic complexity from psycholinguistic and language acquisition research to develop data-driven curricula to understand the underlying linguistic knowledge that models learn to address NLP tasks. The novelty of our approach is in the development of linguistic curricula derived from data, existing knowledge about linguistic complexity, and model behavior during training. By analyzing several benchmark NLP datasets, our curriculum learning approaches identify sets of linguistic metrics (indices) that inform the challenges and reasoning required to address each task. Our work will inform future research in all NLP areas, allowing linguistic complexity to be considered early in the research and development process. In addition, our work prompts an examination of gold standards and fair evaluation in NLP.

摘要
我们采用语言复杂性Characterization从心理语言学和语言学习研究来开发数据驱动课程，以便理解模型学习NLP任务所需的基础语言知识。我们的创新在于基于数据、现有语言复杂性知识和模型训练过程中的行为来开发语言课程。我们分析了多个标准NLP数据集，并确定了每个任务所需的语言指标（指标），以便理解每个任务所需的挑战和逻辑。我们的工作将对所有NLP领域的研究提供指导，使语言复杂性在研究和开发过程中得到考虑。此外，我们的工作也促使评估标准和公平性在NLP领域得到检查。

Making Large Language Models Better Data Creators

paper_url: http://arxiv.org/abs/2310.20111
repo_url: https://github.com/microsoft/llm-data-creation
paper_authors: Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen W. White, Sujay Kumar Jauhar
for: 提高 NLP 系统在实际应用中的可靠性
methods: 使用可训练模型生成数据，并使用指令遵循 LLM 生成数据
results: 比人工标注数据更高的性能（17.5%），同时维持与内部任务的性能相似。

Abstract
Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As such, trainable models are still the preferred option in some cases. However, these models still require human-labeled data for optimal performance, which is expensive and time-consuming to obtain. In order to address this issue, several techniques to reduce human effort involve labeling or generating data using LLMs. Although these methods are effective for certain applications, in practice they encounter difficulties in real-world scenarios. Labeling data requires careful data selection, while generating data necessitates task-specific prompt engineering. In this paper, we propose a unified data creation pipeline that requires only a single formatting example, and which is applicable to a broad range of tasks, including traditionally problematic ones with semantically devoid label spaces. In our experiments we demonstrate that instruction-following LLMs are highly cost-effective data creators, and that models trained with these data exhibit performance better than those trained with human-labeled data (by up to 17.5%) on out-of-distribution evaluation, while maintaining comparable performance on in-distribution tasks. These results have important implications for the robustness of NLP systems deployed in the real-world.

摘要
In this paper, we propose a unified data creation pipeline that requires only a single formatting example and is applicable to a broad range of tasks, including traditionally problematic ones with semantically devoid label spaces. Our experiments show that instruction-following LLMs are highly cost-effective data creators, and that models trained with these data exhibit performance better than those trained with human-labeled data (by up to 17.5%) on out-of-distribution evaluation, while maintaining comparable performance on in-distribution tasks. These results have important implications for the robustness of NLP systems deployed in the real-world.

Keyword-optimized Template Insertion for Clinical Information Extraction via Prompt-based Learning

paper_url: http://arxiv.org/abs/2310.20089
repo_url: None
paper_authors: Eugenia Alleva, Isotta Landi, Leslee J Shaw, Erwin Böttinger, Thomas J Fuchs, Ipek Ensari
for: 这个研究的目的是提高临床报告分类任务中的模型性能，使用只需几个训练示例。
methods: 这个研究使用了提问基本学习法，并开发了一种关键词优化模板插入方法（KOTI），以优化模板位置，从而提高临床报告分类任务的性能。
results: 研究发现，通过优化模板位置，可以提高临床报告分类任务的性能，并在零例学习和几例学习的情况下达到比较好的效果。

Abstract
Clinical note classification is a common clinical NLP task. However, annotated data-sets are scarse. Prompt-based learning has recently emerged as an effective method to adapt pre-trained models for text classification using only few training examples. A critical component of prompt design is the definition of the template (i.e. prompt text). The effect of template position, however, has been insufficiently investigated. This seems particularly important in the clinical setting, where task-relevant information is usually sparse in clinical notes. In this study we develop a keyword-optimized template insertion method (KOTI) and show how optimizing position can improve performance on several clinical tasks in a zero-shot and few-shot training setting.

摘要