cs.CL - 2023-11-09

Identification of Books That are Suitable for Middle School Students Using Artificial Neural Networks

paper_url: http://arxiv.org/abs/2311.07591
repo_url: None
paper_authors: Alp Niksarli, Sadik Ozan Gorgu, Ege Gencer
for: 这个论文的目的是开发一种算法，以便制定中学生的读物选择。
methods: 该论文使用了Python编程语言和自然语言处理技术，并使用人工神经网络训练数据集。
results: 经过训练，人工神经网络达到了90.06%的一致率，能够确定中学生读物的合适性。

Abstract
Reading right books contributes to children's imagination and brain development, enhances their language and emotional comprehension abilities, and strengthens their relationships with others. Building upon the critical role of reading books in individual development, this paper aims to develop an algorithm that determines the suitability of books for middle school students by analyzing their structural and semantic features. Using methods described, an algorithm will be created that can be utilized by institutions and individuals responsible for children's education, such as the Ministry of National Education officials and schools. This algorithm will facilitate the selection of books to be taught at the middle school level. With the algorithm, the book selection process for the middle school curriculum can be expedited, and it will serve as a preliminary reference source for those who evaluate books by reading them. In this paper, the Python programming language was employed, utilizing natural language processing methods. Additionally, an artificial neural network (ANN) was trained using the data which had been preprocessed to construct an original dataset. To train this network, suitable books for middle school students were provided by the MEB, Oxford and Cambridge and with content assessed based on the "R" criterion, and inappropriate books for middle school students in terms of content were included. This trained neural network achieved a 90.06% consistency rate in determining the appropriateness of the test-provided books. Considering the obtained findings, it can be concluded that the developed software has achieved the desired objective.

摘要
阅读适合的书籍对于儿童的想象力和大脑发展、语言和情感理解能力以及与他人的关系都有益。基于阅读书籍对个人发展的重要作用，这篇论文目的是开发一种算法，以便判断中学生阅读的书籍是否适合。使用描述的方法，这篇论文将创建一种可以由教育机构和个人使用的算法，以便选择中学课程中的书籍。这个算法将加速中学课程书籍选择过程，并可作为评估书籍的先进参考源。在这篇论文中，使用Python编程语言，并使用自然语言处理技术。此外，使用预处理的数据来训练人工神经网络（ANN），以建立原始数据集。为训练这个网络，适合中学生阅读的书籍由MEB、牛津和剑桥提供，并根据“R” criterion进行评估。这个训练过的神经网络达到了90.06%的一致率，以判断提供的测试书籍的适应性。根据获得的结果，可以 conclued that the developed software has achieved the desired objective.

FAMuS: Frames Across Multiple Sources

paper_url: http://arxiv.org/abs/2311.05601
repo_url: https://github.com/factslab/famus
paper_authors: Siddharth Vashishtha, Alexander Martin, William Gantt, Benjamin Van Durme, Aaron Steven White
for: 本研究旨在提供一个新的事件描述数据集，以帮助语言处理技术进一步理解事件描述。
methods: 本研究使用Wikipedia文章和其他非Wikipedia文章，通过 FrameNet 进行事件和评论的标注。
results: 本研究获得了两个关键的事件理解任务的结果： validate 和 cross-document argument extraction。

Abstract
Understanding event descriptions is a central aspect of language processing, but current approaches focus overwhelmingly on single sentences or documents. Aggregating information about an event \emph{across documents} can offer a much richer understanding. To this end, we present FAMuS, a new corpus of Wikipedia passages that \emph{report} on some event, paired with underlying, genre-diverse (non-Wikipedia) \emph{source} articles for the same event. Events and (cross-sentence) arguments in both report and source are annotated against FrameNet, providing broad coverage of different event types. We present results on two key event understanding tasks enabled by FAMuS: \emph{source validation} -- determining whether a document is a valid source for a target report event -- and \emph{cross-document argument extraction} -- full-document argument extraction for a target event from both its report and the correct source article. We release both FAMuS and our models to support further research.

摘要
理解事件描述是语言处理的中心方面，但现有方法主要集中在单个句子或文档之上。聚合事件信息于文档之间可以提供更深刻的理解。为此，我们提出了FAMuS，一个新的Wikipedia段落和不同类型文章（非Wikipedia）的对应文章集，用于描述同一事件。在这个集中，事件和跨句子理解在报道和来源文章中都被注解到FrameNet，以提供不同类型事件的广泛覆盖。我们 presenta两个关键的事件理解任务，即：判断一个文档是否为目标报道事件的有效来源，以及在报道和正确的来源文章中提取跨文档的理解。我们发布了FAMuS和我们的模型，以支持进一步的研究。

The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony and Sarcasm Generation

paper_url: http://arxiv.org/abs/2311.05552
repo_url: None
paper_authors: Tyler Loakman, Aaron Maladry, Chenghua Lin
for: 本文 argue that the generation of more esoteric forms of language, such as humor, irony, and sarcasm, requires a more diverse and transparent evaluator panel, and that demographic information should be reported to ensure replicability.
methods: 本文采用了一个审核文本的方法，包括一个文本概述和一个分析例子的方法，以支持其主张。
results: 本文发现，当前的NLG评估方法中对评估人群的报告不够，有很多使用了众所周知的评估平台，而且评估人群的人口统计信息未经报告。

Abstract
Human evaluation is often considered to be the gold standard method of evaluating a Natural Language Generation system. However, whilst its importance is accepted by the community at large, the quality of its execution is often brought into question. In this position paper, we argue that the generation of more esoteric forms of language - humour, irony and sarcasm - constitutes a subdomain where the characteristics of selected evaluator panels are of utmost importance, and every effort should be made to report demographic characteristics wherever possible, in the interest of transparency and replicability. We support these claims with an overview of each language form and an analysis of examples in terms of how their interpretation is affected by different participant variables. We additionally perform a critical survey of recent works in NLG to assess how well evaluation procedures are reported in this subdomain, and note a severe lack of open reporting of evaluator demographic information, and a significant reliance on crowdsourcing platforms for recruitment.

摘要
人类评估通常被视为自然语言生成系统的金标准评价方法。然而，许多人认为评估的实施质量存在问题。在这篇位点纸中，我们 argue That the generation of more 特殊的语言形式，如 humor、irony 和 sarcasm，是评估Panel的特征 особен性的子领域，并且应该在报告参与者变量的同时做出最大的努力，以保证透明度和复制性。我们支持这些主张通过语言形式的概述和例子的分析来证明，以及对最近的NLG工作进行批判性的调查，以评估评价过程是如何报告的。我们发现了评估过程中参与者变量的报告不够开放，并且很多人通过协同平台进行招募。

Towards End-to-End Spoken Grammatical Error Correction

paper_url: http://arxiv.org/abs/2311.05550
repo_url: None
paper_authors: Stefano Bannò, Rao Ma, Mengjie Qian, Kate M. Knill, Mark J. F. Gales
for: 这篇论文的目的是提出一种新的端到端方法来进行口语语法错误修正（GEC），以便为第二语言学习者提供更有效的反馈。
methods: 这篇论文使用了一种基于语音识别模型的端到端方法，称为Whisper，来替代传统的批处理链式方法。这种端到端方法可以完全或部分替换传统的批处理链式方法。
results: 研究发现，使用端到端方法进行口语GEC可以实现，但由于数据的有限性，其现在的性能比使用大量文本基础数据的传统批处理链式方法低。然而，使用端到端方法进行缺失检测和删除实际上表现了更高的性能。

Abstract
Grammatical feedback is crucial for L2 learners, teachers, and testers. Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking. This process usually relies on a cascaded pipeline comprising an ASR system, disfluency removal, and GEC, with the associated concern of propagating errors between these individual modules. In this paper, we introduce an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper. This foundation model can be used to replace the whole framework or part of it, e.g., ASR and disfluency removal. These end-to-end approaches are compared to more standard cascaded approaches on the data obtained from a free-speaking spoken language assessment test, Linguaskill. Results demonstrate that end-to-end spoken GEC is possible within this architecture, but the lack of available data limits current performance compared to a system using large quantities of text-based GEC data. Conversely, end-to-end disfluency detection and removal, which is easier for the attention-based Whisper to learn, does outperform cascaded approaches. Additionally, the paper discusses the challenges of providing feedback to candidates when using end-to-end systems for spoken GEC.

摘要
grammatical feedback是对于二语言学习者、教师和测试人员都是非常重要的。口语grammatical error correction（GEC）目的是为了给二语言学习者提供语法使用时的反馈。这个过程通常利用一个缓冲管理系统，包括语音识别系统、缺失去除和GEC，并且存在这些模块之间传递错误的问题。在这篇论文中，我们介绍了一种 alternativa "end-to-end" 方法 для口语 GEC，利用 Whisper 基础模型。这个基础模型可以用来取代整个框架或一部分，例如语音识别和缺失去除。这些 end-to-end 方法与更常见的缓冲方法进行比较，并在 Linguaskill 口语语言评估测试数据上进行了对比。结果表明， end-to-end 口语 GEC 在这个架构中是可能的，但由于数据的有限性，现在的性能相对较差于一个使用大量文本 GEC 数据的系统。然而， end-to-end 缺失检测和去除，这些 easier для attention-based Whisper 学习的任务，实际上超过了缓冲方法的性能。论文还讨论了在使用 end-to-end 系统时向候选人提供反馈的挑战。

All Should Be Equal in the Eyes of Language Models: Counterfactually Aware Fair Text Generation

paper_url: http://arxiv.org/abs/2311.05451
repo_url: None
paper_authors: Pragyan Banerjee, Abhinav Java, Surgan Jandial, Simra Shahid, Shaz Furniturewala, Balaji Krishnamurthy, Sumit Bhatia
for: 本研究旨在提高语言模型（LM）的公平性，即使训练数据含有偏见，LM可能会延续这些偏见并影响下游任务。
methods: 我们提出了一种名为Counterfactually Aware Fair InferencE（CAFIE）的框架，它在不同群体之间进行对比，以生成更公平的句子。
results: 我们进行了广泛的实验研究，使用不同大小的基础LM和三个多样化的数据集，发现CAFIE比强基eline表现出色，生成更公平的文本，同时保持了语言模型的能力。

Abstract
Fairness in Language Models (LMs) remains a longstanding challenge, given the inherent biases in training data that can be perpetuated by models and affect the downstream tasks. Recent methods employ expensive retraining or attempt debiasing during inference by constraining model outputs to contrast from a reference set of biased templates or exemplars. Regardless, they dont address the primary goal of fairness to maintain equitability across different demographic groups. In this work, we posit that inferencing LMs to generate unbiased output for one demographic under a context ensues from being aware of outputs for other demographics under the same context. To this end, we propose Counterfactually Aware Fair InferencE (CAFIE), a framework that dynamically compares the model understanding of diverse demographics to generate more equitable sentences. We conduct an extensive empirical evaluation using base LMs of varying sizes and across three diverse datasets and found that CAFIE outperforms strong baselines. CAFIE produces fairer text and strikes the best balance between fairness and language modeling capability

摘要
Language Model (LM) 的公平性仍然是一个长期的挑战，因为训练数据中存在的遗传性偏见可以被模型传递并影响下游任务。 recent methods 使用 expensive 重训练或在推理过程中进行偏见调节，但是这些方法不能实现保持不同民族群体的平等性。在这项工作中，我们认为，在推理LMs中为一个民族群体生成无偏见输出，需要了解其他民族群体在同一个上下文下的输出。为此，我们提出了Counterfactually Aware Fair InferencE（CAFIE）框架，该框架在运行时比较不同民族群体的模型理解，以生成更平等的句子。我们对基础LMs 的不同大小和三个多样化的数据集进行了广泛的实验评估，并发现 CAFIE 在 fairness 和语言模型能力之间做出了最佳的平衡。 CAFIE 生成的文本更加公平，并且在语言模型能力方面也具有优异的表现。

Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation

paper_url: http://arxiv.org/abs/2311.05379
repo_url: None
paper_authors: Verna Dankers, Ivan Titov, Dieuwke Hupkes
for: 这个论文的目的是为了研究使用神经网络进行机器翻译时，模型是如何快速记忆某些源-目标映射，而忘记其他映射的原因，以及这种记忆-总结维度如何影响神经网络模型的表现。
methods: 这个论文使用了对500万个神经网络翻译数据点进行分析，并使用了对数据点的表面特征和模型每个数据点的训练信号进行预测，以确定数据点在记忆-总结维度上的位置。
results: 研究发现，模型在记忆-总结维度上的表现与数据点的表面特征和模型每个数据点的训练信号有直接的关系，并且这些数据点的分布对神经网络模型的表现产生了重要的影响。

Abstract
When training a neural network, it will quickly memorise some source-target mappings from your dataset but never learn some others. Yet, memorisation is not easily expressed as a binary feature that is good or bad: individual datapoints lie on a memorisation-generalisation continuum. What determines a datapoint's position on that spectrum, and how does that spectrum influence neural models' performance? We address these two questions for neural machine translation (NMT) models. We use the counterfactual memorisation metric to (1) build a resource that places 5M NMT datapoints on a memorisation-generalisation map, (2) illustrate how the datapoints' surface-level characteristics and a models' per-datum training signals are predictive of memorisation in NMT, (3) and describe the influence that subsets of that map have on NMT systems' performance.

摘要

Build a resource that places 5M NMT datapoints on a memorization-generalization map.2. Illustrate how the datapoints’ surface-level characteristics and a models’ per-datum training signals are predictive of memorization in NMT.3. Describe the influence that subsets of that map have on NMT systems’ performance.Note: “Simplified Chinese” is a simplified version of Chinese that is used in mainland China and is written using simplified characters.

There’s no Data Like Better Data: Using QE Metrics for MT Data Filtering

paper_url: http://arxiv.org/abs/2311.05350
repo_url: None
paper_authors: Jan-Thorsten Peter, David Vilar, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Markus Freitag
for: 本研究旨在研究使用Quality Estimation（QE）度量来筛选机器翻译输出的坏 качество句子对，以提高机器翻译系统（NMT）的翻译质量。
methods: 本研究使用QE度量来筛选training数据中的坏 качество句子对，并对选择的句子对进行翻译。
results: 研究表明，通过选择高品质句子对进行翻译，可以提高翻译质量，同时减少training数据的大小。此外，研究还提供了筛选结果的详细分析，并对两种方法之间的差异进行了比较。

Abstract
Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems~(NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.

摘要
Quality Estimation (QE)，机器翻译输出评估的方法，在过去几年内受到了大量的改进，尤其是通过神经网络度量方法。本文分析了使用QE度量来筛选机器翻译系统（NMT）的训练数据中差异质量的可能性。大多数文库筛选方法通常是通过检测废弃的文本示例来检测废弃的示例，而QE模型则是专门准备了更细化的质量差异。我们显示了，通过选择训练数据中最高质量的句子对，可以提高翻译质量，同时减少训练数据的一半。我们还提供了筛选结果的详细分析，这些分析结果 highlights 两种方法之间的差异。

DeeLM: Dependency-enhanced Large Language Model for Sentence Embeddings

paper_url: http://arxiv.org/abs/2311.05296
repo_url: None
paper_authors: Xianming Li, Jing Li
for: 提高句子嵌入的性能
methods: 提出一种名为Dependency-Enhanced Large Language Model (DeeLM)的新方法，通过将特定LLM层变为bidirectional，以便学习倒数依赖关系
results: DeeLM比基eline和其他方法表现出色，在多个semantic textual similarity (STS)任务上实现了状态的最佳性能

Abstract
Recent studies have proposed using large language models (LLMs) for sentence embeddings. However, most existing LLMs are built with an autoregressive architecture that primarily captures forward dependencies while neglecting backward dependencies. Previous work has highlighted the importance of backward dependencies in improving sentence embeddings. To address this issue, in this paper, we first present quantitative evidence demonstrating the limited learning of backward dependencies in LLMs. Then, we propose a novel approach called Dependency-Enhanced Large Language Model (DeeLM) to improve sentence embeddings. Specifically, we found a turning point in LLMs, where surpassing specific LLM layers leads to a significant performance drop in the semantic textual similarity (STS) task. STS is a crucial task for evaluating sentence embeddings. We then extract the layers after the turning point to make them bidirectional, allowing for the learning of backward dependencies. Extensive experiments demonstrate that DeeLM outperforms baselines and achieves state-of-the-art performance across various STS tasks.

摘要
Recent studies have proposed using large language models (LLMs) for sentence embeddings. However, most existing LLMs are built with an autoregressive architecture that primarily captures forward dependencies while neglecting backward dependencies. Previous work has highlighted the importance of backward dependencies in improving sentence embeddings. To address this issue, in this paper, we first present quantitative evidence demonstrating the limited learning of backward dependencies in LLMs. Then, we propose a novel approach called Dependency-Enhanced Large Language Model (DeeLM) to improve sentence embeddings. Specifically, we found a turning point in LLMs, where surpassing specific LLM layers leads to a significant performance drop in the semantic textual similarity (STS) task. STS is a crucial task for evaluating sentence embeddings. We then extract the layers after the turning point to make them bidirectional, allowing for the learning of backward dependencies. Extensive experiments demonstrate that DeeLM outperforms baselines and achieves state-of-the-art performance across various STS tasks.Here's the translation in Traditional Chinese:Recent studies have proposed using large language models (LLMs) for sentence embeddings. However, most existing LLMs are built with an autoregressive architecture that primarily captures forward dependencies while neglecting backward dependencies. Previous work has highlighted the importance of backward dependencies in improving sentence embeddings. To address this issue, in this paper, we first present quantitative evidence demonstrating the limited learning of backward dependencies in LLMs. Then, we propose a novel approach called Dependency-Enhanced Large Language Model (DeeLM) to improve sentence embeddings. Specifically, we found a turning point in LLMs, where surpassing specific LLM layers leads to a significant performance drop in the semantic textual similarity (STS) task. STS is a crucial task for evaluating sentence embeddings. We then extract the layers after the turning point to make them bidirectional, allowing for the learning of backward dependencies. Extensive experiments demonstrate that DeeLM outperforms baselines and achieves state-of-the-art performance across various STS tasks.

Causal Inference from Text: Unveiling Interactions between Variables

paper_url: http://arxiv.org/abs/2311.05286
repo_url: None
paper_authors: Yuxiang Zhou, Yulan He
for: 这篇论文是为了估计从文本数据中的 causal effect 而写的。
methods: 该论文使用了一种新的方法，可以识别和解决在文本数据中的隐藏 covariates 问题，以估计更准确的 causal effect。
results: 实验表明，该方法可以在两种不同的干预因素下表现出色，并且在不同的场景下都能够减少偏见。此外，对实际业务场景的调查也表明，该模型可以有效地分离变量，帮助投资者做出更 Informed 的决策。

Abstract
Adjusting for latent covariates is crucial for estimating causal effects from observational textual data. Most existing methods only account for confounding covariates that affect both treatment and outcome, potentially leading to biased causal effects. This bias arises from insufficient consideration of non-confounding covariates, which are relevant only to either the treatment or the outcome. In this work, we aim to mitigate the bias by unveiling interactions between different variables to disentangle the non-confounding covariates when estimating causal effects from text. The disentangling process ensures covariates only contribute to their respective objectives, enabling independence between variables. Additionally, we impose a constraint to balance representations from the treatment group and control group to alleviate selection bias. We conduct experiments on two different treatment factors under various scenarios, and the proposed model significantly outperforms recent strong baselines. Furthermore, our thorough analysis on earnings call transcripts demonstrates that our model can effectively disentangle the variables, and further investigations into real-world scenarios provide guidance for investors to make informed decisions.

摘要
Translation notes:* "latent covariates" is translated as "隐藏的变量" (hidden variables)* "confounding covariates" is translated as "干扰变量" (confounding variables)* "non-confounding covariates" is translated as "非干扰变量" (non-confounding variables)* "disentangle" is translated as "分离" (disentangle)* "objectives" is translated as "目标" (objectives)* "selection bias" is translated as "选择偏见" (selection bias)* "earnings call transcripts" is translated as "财务报告笔记" (earnings call transcripts)

Modelling prospective memory and resilient situated communications via Wizard of Oz

paper_url: http://arxiv.org/abs/2311.05268
repo_url: None
paper_authors: Yanzhe Li, Frank Broz, Mark Neerincx
for: 本研究旨在探讨老年人与社会辅助机器人（SAR）之间的人机交互，以探索可靠的记忆模型。
methods: 该研究使用了一个家庭场景，涉及老年人和一个机器人，以探索在日常活动中的语音技术失败和人机交互问题。
results: 该研究将收集日常活动中的语音技术失败和人机交互数据，以便更好地理解老年人和SAR之间的交互。

Abstract
This abstract presents a scenario for human-robot action in a home setting involving an older adult and a robot. The scenario is designed to explore the envisioned modelling of memory for communication with a socially assistive robots (SAR). The scenario will enable the gathering of data on failures of speech technology and human-robot communication involving shared memory that may occur during daily activities such as a music-listening activity.

摘要
这个报告描述了一个家庭环境中older adult和机器人之间的人机交互场景。这个场景是为了探索对社会辅助机器人（SAR）的记忆模型的推断。这个场景将帮助收集在日常活动中，如音乐听众活动中，人机交互中的语音技术失败和人机共享记忆的数据。

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

paper_url: http://arxiv.org/abs/2311.05232
repo_url: None
paper_authors: Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting Liu
for: 这篇论文旨在提供关于大语言模型（LLM）幻觉的最新进展和评论。
methods: 论文使用了一种创新的分类方法来描述LLM幻觉的多种类型，并检查了幻觉的因素和检测方法。
results: 论文提供了一个全面的概述，包括幻觉检测方法和标准准则，以及一些针对幻觉的修正方法。

Abstract
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), leading to remarkable advancements in text understanding and generation. Nevertheless, alongside these strides, LLMs exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. This phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of LLMs in real-world scenarios, which attracts increasing attention to detect and mitigate these hallucinations. In this survey, we aim to provide a thorough and in-depth overview of recent advances in the field of LLM hallucinations. We begin with an innovative taxonomy of LLM hallucinations, then delve into the factors contributing to hallucinations. Subsequently, we present a comprehensive overview of hallucination detection methods and benchmarks. Additionally, representative approaches designed to mitigate hallucinations are introduced accordingly. Finally, we analyze the challenges that highlight the current limitations and formulate open questions, aiming to delineate pathways for future research on hallucinations in LLMs.

摘要
大型自然语言模型（LLM）的出现标志着自然语言处理（NLP）领域的重要突破，导致了文本理解和生成的显著进步。然而，与这些进步相伴的是LLM往往会产生幻觉，导致的内容与实际世界的事实或用户输入不一致。这种现象对LLM的实际应用提出了重大挑战，也引起了对幻觉的检测和 Mitigation 的关注。在这篇评论中，我们希望提供一个全面、深入的LLM幻觉领域的现状报告。我们首先提出了一种创新的LLM幻觉分类法，然后探讨了幻觉的原因。接着，我们对幻觉检测方法和标准进行了全面的介绍。此外，我们还介绍了一些代表性的幻觉缓解方法。最后，我们分析了当前的挑战和未解决问题，并提出了未来研究的导向。

PRODIGy: a PROfile-based DIalogue Generation dataset

paper_url: http://arxiv.org/abs/2311.05195
repo_url: https://github.com/land-fbk/prodigy-dataset
paper_authors: Daniela Occhipinti, Serra Sinem Tekiroglu, Marco Guerini
for: 提高对话机器人的一致性和综合性，以便更好地进行对话。
methods: 提出了一种统一框架，将标准和更复杂的对话人物表示相结合，并将每个对话与所有可能的说话人物表示相对应。
results: 自动评估表明，基于人物表示的模型在领域和跨领域设置中都有更好的泛化能力，并且人工评估表明，生成与人物表示和上下文一致的内容得到了人们的偏好。

Abstract
Providing dialogue agents with a profile representation can improve their consistency and coherence, leading to better conversations. However, current profile-based dialogue datasets for training such agents contain either explicit profile representations that are simple and dialogue-specific, or implicit representations that are difficult to collect. In this work, we propose a unified framework in which we bring together both standard and more sophisticated profile representations by creating a new resource where each dialogue is aligned with all possible speaker representations such as communication style, biographies, and personality. This framework allows to test several baselines built using generative language models with several profile configurations. The automatic evaluation shows that profile-based models have better generalisation capabilities than models trained on dialogues only, both in-domain and cross-domain settings. These results are consistent for fine-tuned models and instruction-based LLMs. Additionally, human evaluation demonstrates a clear preference for generations consistent with both profile and context. Finally, to account for possible privacy concerns, all experiments are done under two configurations: inter-character and intra-character. In the former, the LM stores the information about the character in its internal representation, while in the latter, the LM does not retain any personal information but uses it only at inference time.

摘要
提供对话代理人 profiles 可以提高对话的一致性和 coherence，导致更好的对话。然而，当前的对话基于 profiles 的训练数据集中 Either explicit profiles 是简单的对话特定的，或者 implicit profiles 是困难收集的。在这项工作中，我们提议一个统一框架，在这个框架中，我们将每个对话与所有可能的 speaker 表示（如沟通风格、生平、人格）进行对应。这个框架允许我们测试一些基于生成语言模型的基线模型，并对不同的 profile 配置进行测试。自动评估表明，profile-based 模型在预测和跨预测场景中都具有更好的一致性和稳定性。此外，人工评估表明，生成与 profile 和 context 一致的对话得到了人们的偏好。最后，为了解决可能的隐私问题，我们在两种配置下进行所有实验：inter-character 和 intra-character。在前一种情况下，LM 将Character 信息存储在其内部表示中，而在后一种情况下，LM 不会保留任何个人信息，只在推理时使用它们。

Large Language Models and Prompt Engineering for Biomedical Query Focused Multi-Document Summarisation

paper_url: http://arxiv.org/abs/2311.05169
repo_url: None
paper_authors: Diego Mollá
for: 本研究使用提示工程和GPT-3.5进行生物医学问题焦点多文摘要。
methods: 使用GPT-3.5和适当的提示，我们的系统在2023年生物医学问题解决比赛（BioASQ 11b）中实现了最高的ROUGE-F1分数。
results: 本研究证明了其他领域所观察到的结论：1）包含几个示例的提示通常会提高其零shot变种的性能；2）检索增强生成可以获得最大的改进。这些提示使我们的最佳实际排名在BioASQ 11b中的前两名，表明使用适当的提示对大语言模型在摘要 tasks 中具有强大的能力。

Abstract
This paper reports on the use of prompt engineering and GPT-3.5 for biomedical query-focused multi-document summarisation. Using GPT-3.5 and appropriate prompts, our system achieves top ROUGE-F1 results in the task of obtaining short-paragraph-sized answers to biomedical questions in the 2023 BioASQ Challenge (BioASQ 11b). This paper confirms what has been observed in other domains: 1) Prompts that incorporated few-shot samples generally improved on their counterpart zero-shot variants; 2) The largest improvement was achieved by retrieval augmented generation. The fact that these prompts allow our top runs to rank within the top two runs of BioASQ 11b demonstrate the power of using adequate prompts for Large Language Models in general, and GPT-3.5 in particular, for query-focused summarisation.

摘要
Translation notes:* "prompt engineering" is translated as "提示工程" (tiēshì gōngchéng), which refers to the process of designing and optimizing prompts to improve the performance of language models.* "GPT-3.5" is translated as "GPT-3.5" (GPT-3.5), as it is a well-known language model that is widely used in natural language processing tasks.* "ROUGE-F1" is translated as "ROUGE-F1" (ROUGE-F1), as it is a widely used evaluation metric for summarization tasks.* "BioASQ Challenge" is translated as "生物学问题大会" (shēngwù xuéwèn da hui), which refers to a specific challenge for biomedical question answering.* "few-shot samples" is translated as "少量示例" (shǎo liàng shì xiàng), which refers to a small number of training examples that are used to fine-tune the language model.* "zero-shot variants" is translated as "无示例变体" (wú shì xiàng biàn tǐ), which refers to language models that are trained without any fine-tuning on specific tasks.* "retrieval augmented generation" is translated as "检索增强生成" (jiǎn sò zhòng qiáng shēng chéng), which refers to a technique that uses retrieval information to improve the generation of text.

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

paper_url: http://arxiv.org/abs/2311.05161
repo_url: None
paper_authors: Jangwhan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi
for: 提高语言处理任务的计算效率，增强大型语言模型（LLMs）的部署。
methods: 使用4位权值和8位活动（W4A8）归一化，并提出两种创新技术：活动归一化aware scaling（AQAS）和序列长度aware calibration（SLAC），以增强post-training量化（PTQ）。
results: 通过对多种语言模型进行严格评估，包括OPT和LLaMA，显示了OUR技术可以提高任务准确率至与全精度模型相当水平。此外，通过开发与dINT兼容的加法器，确认了OUR方法在硬件效率方面的2倍提高。

Abstract
Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$\times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

摘要
(Simplified Chinese translation)大型语言模型（LLM）在自然语言处理任务中表现出色，但其部署受限于广泛的参数大小和计算需求。本文关注 LLM 的后期训练量化（PTQ），特别是4位重量和8位活动（W4A8）量化，以提高计算效率。我们提出了两种创新技术：活动量化扩展（AQAS）和序列长度意识calibration（SLAC），以增强PTQ，并考虑参数和活动之间的共同效应。此外，我们介绍了 dINT，一种 combining 整数和denormal表示的混合数据格式，以解决 W4A8 量化中的下溢问题， где小值被舍入为零。我们通过对 LLM 进行严格的评估，包括 OPT 和 LLaMA，证明了我们的技术可以提高任务准确率至与全精度模型相当的水平。此外，我们还开发了与 dINT 兼容的数学单元，确认了我们的方法可以在硬件上实现2倍的效率提升 compared to 8位整数 MAC 单元。

Quranic Conversations: Developing a Semantic Search tool for the Quran using Arabic NLP Techniques

paper_url: http://arxiv.org/abs/2311.05120
repo_url: None
paper_authors: Yasser Shohoud, Maged Shoman, Sarah Abdelazim
for: This paper is written to provide a Quran semantic search tool for Muslims to easily find relevant verses in the Quran related to their inquiries or prompts.
methods: The paper uses a combination of machine learning models and cosine similarity to index the Quran and find the most relevant verses related to a user’s inquiry.
results: The paper achieves a high cosine similarity score of 0.97 using the SNxLM model, which demonstrates the effectiveness of the proposed Quran semantic search tool.

Abstract
The Holy Book of Quran is believed to be the literal word of God (Allah) as revealed to the Prophet Muhammad (PBUH) over a period of approximately 23 years. It is the book where God provides guidance on how to live a righteous and just life, emphasizing principles like honesty, compassion, charity and justice, as well as providing rules for personal conduct, family matters, business ethics and much more. However, due to constraints related to the language and the Quran organization, it is challenging for Muslims to get all relevant ayahs (verses) pertaining to a matter or inquiry of interest. Hence, we developed a Quran semantic search tool which finds the verses pertaining to the user inquiry or prompt. To achieve this, we trained several models on a large dataset of over 30 tafsirs, where typically each tafsir corresponds to one verse in the Quran and, using cosine similarity, obtained the tafsir tensor which is most similar to the prompt tensor of interest, which was then used to index for the corresponding ayah in the Quran. Using the SNxLM model, we were able to achieve a cosine similarity score as high as 0.97 which corresponds to the abdu tafsir for a verse relating to financial matters.

摘要
《古兰经》被认为是神的literal字（阿拉），由先知穆罕默德（愿旦）在约23年内逐渐接受的。这本书提供了如何过一个正直和公正的生活的指导，强调诚信、慈悲、慈善和正义等原则，并提供了个人行为、家庭事务、商业伦理等方面的规则。然而，由于语言和《古兰经》的组织方式的限制，使得穆斯林找到有关的各个篇章（ayah）变得困难。为了解决这个问题，我们开发了一个《古兰经》semantic search工具，可以找到用户的查询或提示中相关的各个篇章。我们使用了多个模型，并在大量的30本译注（tafsir）中训练了这些模型。我们使用cosine similarity来评估这些模型，并获得了最相似的译注矩阵，然后用这个矩阵来索引《古兰经》中相关的各个篇章。使用SNxLM模型，我们可以达到cosine similarity分数达0.97，与关于财务问题的阿杜译注（tafsir）相对应。

Unsupervised Translation Quality Estimation Exploiting Synthetic Data and Pre-trained Multilingual Encoder

paper_url: http://arxiv.org/abs/2311.05117
repo_url: None
paper_authors: Yuto Kuroda, Atsushi Fujita, Tomoyuki Kajiwara, Takashi Ninomiya
for: 这篇论文目的是为了研究无监督翻译质量估计（TQE）方法，以减少翻译质量估计的训练数据成本。
methods: 这篇论文使用了人工合成的TQE数据和预训练多语言编码器，以进行无监督 sentence-level TQE。
results: 实验表明，这种方法可以在高资源和低资源翻译方向中比其他无监督 TQE方法更高的准确率和人类评价分数，以及一些零资源翻译方向中的准确率。

Abstract
Translation quality estimation (TQE) is the task of predicting translation quality without reference translations. Due to the enormous cost of creating training data for TQE, only a few translation directions can benefit from supervised training. To address this issue, unsupervised TQE methods have been studied. In this paper, we extensively investigate the usefulness of synthetic TQE data and pre-trained multilingual encoders in unsupervised sentence-level TQE, both of which have been proven effective in the supervised training scenarios. Our experiment on WMT20 and WMT21 datasets revealed that this approach can outperform other unsupervised TQE methods on high- and low-resource translation directions in predicting post-editing effort and human evaluation score, and some zero-resource translation directions in predicting post-editing effort.

摘要
翻译质量估算（TQE）是指无需参考翻译的翻译质量预测。由于创建TQE训练数据的成本巨大，只有一些翻译方向可以从supervised训练中受益。为解决这个问题，无监督TQE方法得到了研究。本文广泛研究了使用synthetic TQE数据和预训练多语言 encoder在无监督句级TQE中的可用性，两者在supervised训练场景中已经证明有效。我们在WMT20和WMT21数据集上进行了实验，发现这种方法可以在高资源和低资源翻译方向中预测后期编辑努力和人工评分，以及一些zero资源翻译方向中预测后期编辑努力。

Conic10K: A Challenging Math Problem Understanding and Reasoning Dataset

paper_url: http://arxiv.org/abs/2311.05113
repo_url: https://github.com/whynlp/conic10k
paper_authors: Haoyi Wu, Wenyang Hui, Yezeng Chen, Weiqi Wu, Kewei Tu, Yi Zhou
for: 这个论文的目的是提出一个有挑战性的数学问题集，用于评估人工智能（AI）的数学理解和逻辑能力。
methods: 该论文使用了中国高中教育中的几何形式问题集，并为每个问题提供了高质量的正式表示，逻辑步骤和最终解决方案。
results: 实验表明，现有的大语言模型，包括GPT-4，在复杂的逻辑推理中表现不佳。

Abstract
Mathematical understanding and reasoning are crucial tasks for assessing the capabilities of artificial intelligence (AI). However, existing benchmarks either require just a few steps of reasoning, or only contain a small amount of data in one specific topic, making it hard to analyse AI's behaviour with reference to different problems within a specific topic in detail. In this work, we propose Conic10K, a challenging math problem dataset on conic sections in Chinese senior high school education. Our dataset contains various problems with different reasoning depths, while only the knowledge from conic sections is required. Since the dataset only involves a narrow range of knowledge, it is easy to separately analyse the knowledge a model possesses and the reasoning ability it has. For each problem, we provide a high-quality formal representation, the reasoning steps, and the final solution. Experiments show that existing large language models, including GPT-4, exhibit weak performance on complex reasoning. We hope that our findings could inspire more advanced techniques for precise natural language understanding and reasoning. Our dataset and codes are available at https://github.com/whyNLP/Conic10K.

摘要
<>数学理解和推理是评估人工智能（AI）能力的关键任务。然而，现有的标准benchmark either require only a few steps of reasoning, or only contain a small amount of data in one specific topic, making it difficult to analyze AI's behavior in detail with reference to different problems within a specific topic.在这项工作中，我们提出了Conic10K，一个在中国高中数学教育中使用的困难数学问题集。我们的数据集包含不同的推理深度的问题，仅需要 cone sections 的知识。由于数据集的知识范围很窄，因此可以分开分析模型所拥有的知识和其推理能力。为每个问题，我们提供了高质量的正式表示，推理步骤，以及最终解决方案。实验显示，现有的大语言模型，包括GPT-4，在复杂的推理中表现不佳。我们希望我们的发现可以激励更多的高级技术 для精准自然语言理解和推理。我们的数据集和代码可以在https://github.com/whyNLP/Conic10K中下载。[/INST Here's the translation in Simplified Chinese:数学理解和推理是评估人工智能（AI）能力的关键任务。然而，现有的标准benchmark either require only a few steps of reasoning, or only contain a small amount of data in one specific topic, making it difficult to analyze AI's behavior in detail with reference to different problems within a specific topic.在这项工作中，我们提出了Conic10K，一个在中国高中数学教育中使用的困难数学问题集。我们的数据集包含不同的推理深度的问题，仅需要 cone sections 的知识。由于数据集的知识范围很窄，因此可以分开分析模型所拥有的知识和其推理能力。为每个问题，我们提供了高质量的正式表示，推理步骤，以及最终解决方案。实验显示，现有的大语言模型，包括GPT-4，在复杂的推理中表现不佳。我们希望我们的发现可以激励更多的高级技术 для精准自然语言理解和推理。我们的数据集和代码可以在https://github.com/whyNLP/Conic10K中下载。