2023-10-20

cs.CL

cs.CL - 2023-10-20

Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

paper_url: http://arxiv.org/abs/2310.13859
repo_url: None
paper_authors: Yoo Yeon Sung, Jordan Boyd-Graber, Naeemul Hassan
for: 本研究旨在提供一个多模态视频谎导标注数据集，以便更好地检测视频标题的谎导性。
methods: 该研究使用了现有资源的多模态基线测试方法，并对标题的谎导性进行了分析。
results: 研究发现，基线测试方法在检测视频标题的谎导性方面具有较高的准确率。此外，对标题的谎导性进行了更深入的分析，可以更好地理解annotators的背景和视频内容之间的关系。

Abstract
Polarization and the marketplace for impressions have conspired to make navigating information online difficult for users, and while there has been a significant effort to detect false or misleading text, multimodal datasets have received considerably less attention. To complement existing resources, we present multimodal Video Misleading Headline (VMH), a dataset that consists of videos and whether annotators believe the headline is representative of the video's contents. After collecting and annotating this dataset, we analyze multimodal baselines for detecting misleading headlines. Our annotation process also focuses on why annotators view a video as misleading, allowing us to better understand the interplay of annotators' background and the content of the videos.

摘要
政治化和市场化的影响使得在线信息搜索变得更加困难，虽然有很大努力来检测false或误导性的文本，但多媒体数据集获得了相对较少的关注。为了补充现有资源，我们提供了多媒体视频误导头条（VMH）数据集，该数据集包括视频和头条是否正确表示视频内容。我们收集和标注这个数据集后，分析多媒体基线 для检测误导头条。我们的注释过程还关注了annotators的背景和视频内容之间的交互，帮助我们更好地理解annotators的背景和视频内容之间的关系。

Implications of Annotation Artifacts in Edge Probing Test Datasets

paper_url: http://arxiv.org/abs/2310.13856
repo_url: https://github.com/josh1108/eptest
paper_authors: Sagnik Ray Choudhury, Jushaan Kalra
for: 这个论文旨在检验语言模型（LLM）中的语法知识是否被编码在字符表示中。
methods: 论文使用了edge probing测试来检验LLM的语法知识，并对常用的edge probing测试集合进行了分析，发现这些测试集合带有各种偏见。
results: 论文发现，当暴露出偏见后，LLMencoder和随机编码器之间的差异变得更加明显，而使用信息理论 probes 可以更好地检验LLM的语法知识。

Abstract
Edge probing tests are classification tasks that test for grammatical knowledge encoded in token representations coming from contextual encoders such as large language models (LLMs). Many LLM encoders have shown high performance in EP tests, leading to conjectures about their ability to encode linguistic knowledge. However, a large body of research claims that the tests necessarily do not measure the LLM's capacity to encode knowledge, but rather reflect the classifiers' ability to learn the problem. Much of this criticism stems from the fact that often the classifiers have very similar accuracy when an LLM vs a random encoder is used. Consequently, several modifications to the tests have been suggested, including information theoretic probes. We show that commonly used edge probing test datasets have various biases including memorization. When these biases are removed, the LLM encoders do show a significant difference from the random ones, even with the simple non-information theoretic probes.

摘要
edge probing 测试是一种分类任务，用于测试语言模型（LLM）中的 grammatical 知识编码。许多 LLM 编码器在 EP 测试中表现出色，导致人们对它们的语言知识编码能力的推测。然而，许多研究表明，这些测试并不测试 LLM 的语言知识编码能力，而是测试分类器对问题的学习能力。这些批评的原因在于，经常情况下，使用 LLM 和随机编码器时，分类器的准确率几乎相同。为了解决这个问题，有许多修改的建议，包括信息理论的探测。我们发现，常用的 Edge probing 测试数据集具有各种偏见，包括记忆。当这些偏见被除去后，LLM 编码器与随机编码器之间的差别变得更加明显，жеlack 简单的非信息理论的探测也能够准确地捕捉这个差别。

Ecologically Valid Explanations for Label Variation in NLI

paper_url: http://arxiv.org/abs/2310.13850
repo_url: https://github.com/njjiang/livenli
paper_authors: Nan-Jiang Jiang, Chenhao Tan, Marie-Catherine de Marneffe
for: 这个论文是为了解决自然语言推理（NLI）任务中的人类标注差异（annotation disagreement）问题。
methods: 作者们使用了一个英文 dataset，名为 LiveNLI，包含 1,415 个生动的解释（annotators explain the NLI labels they chose），以及 122 个 MNLI 项目（每个项目都有至少 10 个解释）。
results: 研究发现，LiveNLI 的解释确实证明了人们可以系统性地有不同的解释，并且在同一个标签下存在内部差异：annotators 可能选择相同的标签，但是有不同的理由。这表明，解释是在总体上 navigation 标签 интер�прета读取的关键。然而，通过几个 prompt 测试，作者发现大语言模型可以生成有效和有用的解释，但也可能生成不合理的解释，这表明了进一步改进的方向。

Abstract
Human label variation, or annotation disagreement, exists in many natural language processing (NLP) tasks, including natural language inference (NLI). To gain direct evidence of how NLI label variation arises, we build LiveNLI, an English dataset of 1,415 ecologically valid explanations (annotators explain the NLI labels they chose) for 122 MNLI items (at least 10 explanations per item). The LiveNLI explanations confirm that people can systematically vary on their interpretation and highlight within-label variation: annotators sometimes choose the same label for different reasons. This suggests that explanations are crucial for navigating label interpretations in general. We few-shot prompt large language models to generate explanations but the results are inconsistent: they sometimes produces valid and informative explanations, but it also generates implausible ones that do not support the label, highlighting directions for improvement.

摘要
人类标签变化，或者注释不一致，是许多自然语言处理（NLP）任务中的常见问题，包括自然语言推理（NLI）。为了获得直接证据，我们建立了LiveNLI，一个英语dataset，包含1,415个生动有效的解释（拟标者解释选择的NLI标签），对122个MNLI项目进行了至少10个解释。LiveNLI解释表明，人们可以系统地变化其 интерпретаion，并且在标签内部存在差异：拟标者经常选择同一个标签，但是由不同的理由。这表明，解释是在标签 интерпретаion中 Navigation 的关键。我们使用几个描述符提示大型自然语言模型生成解释，但是结果是不一致的：它们有时生成有效和有用的解释，但也可能生成不可能的解释，不支持标签， highlighting 改进的方向。

Foundation Model’s Embedded Representations May Detect Distribution Shift

paper_url: http://arxiv.org/abs/2310.13836
repo_url: None
paper_authors: Adam Tsou, Max Vargas, Andrew Engel, Tony Chiang
for: 这个研究探讨了深度学习模型在不同任务和环境中进行转移学习（TL）时，模型的泛化能力是否受到训练和测试数据集之间的分布偏移影响。
methods: 作者使用了一个预训练的 GPT-2 模型，并将其转移到 Sentiment140 数据集上进行 sentiment classification。
results: 作者发现，Sentiment140 的测试数据集 $M$ 不是从同一个分布中采样的，因此训练于 $P$ 并测试于 $M$ 不能准确地衡量模型在 sentiment classification 中的泛化能力。

Abstract
Distribution shifts between train and test datasets obscure our ability to understand the generalization capacity of neural network models. This topic is especially relevant given the success of pre-trained foundation models as starting points for transfer learning (TL) models across tasks and contexts. We present a case study for TL on a pre-trained GPT-2 model onto the Sentiment140 dataset for sentiment classification. We show that Sentiment140's test dataset $M$ is not sampled from the same distribution as the training dataset $P$, and hence training on $P$ and measuring performance on $M$ does not actually account for the model's generalization on sentiment classification.

摘要
发布分布Shift between train和test datasets obscures our ability to understand the generalization capacity of neural network models. This topic is especially relevant given the success of pre-trained foundation models as starting points for transfer learning (TL) models across tasks and contexts. We present a case study for TL on a pre-trained GPT-2 model onto the Sentiment140 dataset for sentiment classification. We show that Sentiment140's test dataset $M$ is not sampled from the same distribution as the training dataset $P$, and hence training on $P$ and measuring performance on $M$ does not actually account for the model's generalization on sentiment classification.Here's the breakdown of the translation:* 发布 (fābù) - distribution* 分布 (bìbù) - shift* between (between) - between* train (train) - train* and (and) - and* test (test) - test* datasets (dataset) - datasets* obscures (obscures) - obscures* our (our) - our* ability (ability) - ability* to (to) - to* understand (understand) - understand* the (the) - the* generalization (generalization) - generalization* capacity (capacity) - capacity* of (of) - of* neural (neural) - neural* network (network) - network* models (model) - models* This (this) - this* topic (topic) - topic* is (is) - is* especially (especially) - especially* relevant (relevant) - relevant* given (given) - given* the (the) - the* success (success) - success* of (of) - of* pre-trained (pre-trained) - pre-trained* foundation (foundation) - foundation* models (model) - models* as (as) - as* starting (starting) - starting* points (points) - points* for (for) - for* transfer (transfer) - transfer* learning (learning) - learning* (TL) (TL) - TL* models (model) - models* across (across) - across* tasks (task) - tasks* and (and) - and* contexts (contexts) - contexts* We (we) - we* present (present) - present* a (a) - a* case (case) - case* study (study) - study* for (for) - for* TL (TL) - TL* on (on) - on* a (a) - a* pre-trained (pre-trained) - pre-trained* GPT-2 (GPT-2) - GPT-2* model (model) - model* onto (onto) - onto* the (the) - the* Sentiment140 (Sentiment140) - Sentiment140* dataset (dataset) - dataset* for (for) - for* sentiment (sentiment) - sentiment* classification (classification) - classification* We (we) - we* show (show) - show* that (that) - that* Sentiment140's (Sentiment140's) - Sentiment140's* test (test) - test* dataset (dataset) - dataset* $M$ (M) - M* is (is) - is* not (not) - not* sampled (sampled) - sampled* from (from) - from* the (the) - the* same (same) - same* distribution (distribution) - distribution* as (as) - as* the (the) - the* training (training) - training* dataset (dataset) - dataset* $P$ (P) - P* and (and) - and* hence (hence) - hence* training (training) - training* on (on) - on* $P$ (P) - P* and (and) - and* measuring (measuring) - measuring* performance (performance) - performance* on (on) - on* $M$ (M) - M* does (do) - does* not (not) - not* actually (actually) - actually* account (account) - account* for (for) - for* the (the) - the* model's (model's) - model's* generalization (generalization) - generalization* on (on) - on* sentiment (sentiment) - sentiment* classification (classification) - classification

Plausibility Processing in Transformer Language Models: Focusing on the Role of Attention Heads in GPT

paper_url: http://arxiv.org/abs/2310.13824
repo_url: https://github.com/soohyunryu/plausibility-processing-transformers
paper_authors: Soo Hyun Ryu
for: 本研究旨在探索transformer语言模型如何处理 semantics知识,特别是关于名动词关系的可能性。
methods: 本研究使用GPT2语言模型进行实验,通过分析GPT2的注意头来探讨它如何处理可能性。
results: 研究发现GPT2在可能性处理方面与人类更相似,并且在注意头中包含了知识的可能性信息。此外，研究还发现GPT2中的注意头可以共同影响语言模型的可能性处理能力,但各个注意头的可能性检测性能与其贡献相对强度不一致。

Abstract
The goal of this paper is to explore how Transformer language models process semantic knowledge, especially regarding the plausibility of noun-verb relations. First, I demonstrate GPT2 exhibits a higher degree of similarity with humans in plausibility processing compared to other Transformer language models. Next, I delve into how knowledge of plausibility is contained within attention heads of GPT2 and how these heads causally contribute to GPT2's plausibility processing ability. Through several experiments, it was found that: i) GPT2 has a number of attention heads that detect plausible noun-verb relationships; ii) these heads collectively contribute to the Transformer's ability to process plausibility, albeit to varying degrees; and iii) attention heads' individual performance in detecting plausibility does not necessarily correlate with how much they contribute to GPT2's plausibility processing ability.

摘要
本文的目的是探讨 transformer 语言模型在Semantic Knowledge 处理方面的表现，特别是 noun-verb 关系的可能性。首先，我展示 GPT2 与人类更相似在可能性处理方面的表现。然后，我探究 GPT2 中可能性知识的含义以及这些知识如何通过注意头来影响 GPT2 的可能性处理能力。通过一些实验，我发现：1. GPT2 有许多检测可能性 noun-verb 关系的注意头;2. 这些注意头共同 contribuite 到 transformer 的可能性处理能力, 虽然不同的注意头在可能性处理中的表现不同;3. 注意头的个体表现在检测可能性方面与 GPT2 的可能性处理能力相互关系不一定。

Yet Another Model for Arabic Dialect Identification

paper_url: http://arxiv.org/abs/2310.13812
repo_url: None
paper_authors: Ajinkya Kulkarni, Hanan Aldarmaki
for: 这个研究旨在开发一个用于阿拉伯语口语识别（ADI）模型，能够在两个标准数据集上（ADI-5和ADI-17）上 consistently outperform previously published results。
methods: 该模型采用了两种不同的架构变体：ResNet和ECAPA-TDNN，以及两种不同的声学特征：MFCCs和自动提取的UniSpeech-SAT Large特征，以及这些特征的混合。
results: 研究发现，ECAPA-TDNN网络单独使用表现比ResNet更高，而使用UniSpeech-SAT特征比MFCCs更高。此外，混合所有变体的模型一直outperform 单独的模型。最佳模型的准确率为84.7%和96.9%。

Abstract
In this paper, we describe a spoken Arabic dialect identification (ADI) model for Arabic that consistently outperforms previously published results on two benchmark datasets: ADI-5 and ADI-17. We explore two architectural variations: ResNet and ECAPA-TDNN, coupled with two types of acoustic features: MFCCs and features exratected from the pre-trained self-supervised model UniSpeech-SAT Large, as well as a fusion of all four variants. We find that individually, ECAPA-TDNN network outperforms ResNet, and models with UniSpeech-SAT features outperform models with MFCCs by a large margin. Furthermore, a fusion of all four variants consistently outperforms individual models. Our best models outperform previously reported results on both datasets, with accuracies of 84.7% and 96.9% on ADI-5 and ADI-17, respectively.

摘要
在这篇论文中，我们描述了一个用于阿拉伯语言分类（ADI）模型，该模型在两个标准数据集上（ADI-5和ADI-17）上表现出色， persistently 超越了之前发表的结果。我们研究了两种建筑方案：ResNet和ECAPA-TDNN，同时使用了两种声学特征：MFCC和UniSpeech-SAT Large自动学习模型中提取的特征。我们发现，ECAPA-TDNN网络单独使用表现 луч于ResNet，而使用UniSpeech-SAT特征的模型比使用MFCC特征的模型表现出了大幅提升。此外，我们发现将所有四种变体进行混合，可以一直保持模型的表现。我们的最佳模型在ADI-5和ADI-17数据集上的准确率分别为84.7%和96.9%，这些结果比之前报道的结果更高。

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

paper_url: http://arxiv.org/abs/2310.13800
repo_url: https://github.com/protagolabs/seq2seq_llm_evaluation
paper_authors: Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan
for: 该研究旨在提高当前生成模型性能的评估方法，通过对多种开源和关闭源生成语言模型（LLMs）在文本概要、简化和 grammatical error correction（GEC）三个 NATLP 标准准则上进行 préliminaire 和混合评估。
methods: 该研究使用了自动和人工评估方法来评估多种生成模型的性能，包括 GPT-4 作为评估器。
results: 研究发现，ChatGPT 在人工评估中常常超过许多其他流行模型，但在经典自动评估指标上得分很低。此外，人工评估人员认为金标准样本质量较差，而且模型输出与人工评估人员的判断相对较少吻合。最后，研究发现，GPT-4 可以reasonably closely align with human judgment across tasks, with a lower alignment in the GEC task.

Abstract
Large Language Models (LLMs) evaluation is a patchy and inconsistent landscape, and it is becoming clear that the quality of automatic evaluation metrics is not keeping up with the pace of development of generative models. We aim to improve the understanding of current models' performance by providing a preliminary and hybrid evaluation on a range of open and closed-source generative LLMs on three NLP benchmarks: text summarisation, text simplification and grammatical error correction (GEC), using both automatic and human evaluation. We also explore the potential of the recently released GPT-4 to act as an evaluator. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics, while scoring much more poorly when using classic automatic evaluation metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks. Finally, we find that GPT-4 is capable of ranking models' outputs in a way which aligns reasonably closely to human judgement despite task-specific variations, with a lower alignment in the GEC task.

摘要
大型语言模型（LLM）的评估是一个含糊不清的景象，而且现在的自动评估指标质量不能跟上生成模型的发展。我们想要提高当前模型的表现理解，我们提供了一些先进的混合评估方法，在多种开源和关闭源生成LLM上进行了三个NLPbenchmark：文本概要、文本简化和语法错误修复（GEC）。我们还探索了最近发布的GPT-4是否可以作为评估器。我们发现，ChatGPT在人工评分者的评估中一直表现出色，而自动评估指标中的评分则远低。我们还发现，人工评分者评估金标 referencemuch worse than最佳模型的输出，这表明许多流行的benchmark的质量不高。最后，我们发现GPT-4可以在不同任务上对模型的输出进行排序，与人类判断相对吻合，但在GEC任务中的吻合较低。

A Unified View of Evaluation Metrics for Structured Prediction

paper_url: http://arxiv.org/abs/2310.13793
repo_url: https://github.com/wanmok/metametric
paper_authors: Yunmo Chen, William Gantt, Tongfei Chen, Aaron Steven White, Benjamin Van Durme
for: 这 paper 旨在提供一个概念框架，用于统一不同结构预测任务（如事件和关系抽取、语法和 semantics 解析）的评估指标。
methods: 该框架基于对输出结果的对象化表示，并通过匹配共同结构来 derivation 评估指标，可能会进行Normalization。
results: 作者示出了一些任务的常用指标可以简洁地表达为该框架中的一部分，并且可以在底层上通过输出结构来自然地 derivation 新的指标。同时，作者还提出了一些任务特点所带来的指标设计决策，并对现有指标进行修改。

Abstract
We present a conceptual framework that unifies a variety of evaluation metrics for different structured prediction tasks (e.g. event and relation extraction, syntactic and semantic parsing). Our framework requires representing the outputs of these tasks as objects of certain data types, and derives metrics through matching of common substructures, possibly followed by normalization. We demonstrate how commonly used metrics for a number of tasks can be succinctly expressed by this framework, and show that new metrics can be naturally derived in a bottom-up way based on an output structure. We release a library that enables this derivation to create new metrics. Finally, we consider how specific characteristics of tasks motivate metric design decisions, and suggest possible modifications to existing metrics in line with those motivations.

摘要
我们提出了一个概念框架，它可以统一不同的结构预测任务（如事件和关系抽取、 sintactic和semantic 分析）的评估指标。我们的框架需要将这些任务的输出表示为特定数据类型的对象，然后通过匹配通用的子结构来 derivate 指标，可能会 seguido de normalización。我们示例了一些任务上常用的指标可以简洁地表达在我们的框架中，并证明新的指标可以从输出结构的底层方式 derivation。我们释放了一个库，它可以帮助 derivation 新的指标。最后，我们考虑了特定任务的特征如何驱动指标设计选择，并建议可能的修改以适应这些驱动力。

How Much Consistency Is Your Accuracy Worth?

paper_url: http://arxiv.org/abs/2310.13781
repo_url: https://github.com/NitikaRaj1/bug-free-goggles
paper_authors: Jacob K. Johnson, Ana Marasović
for: 评估模型的一致性和稳定性
methods: 使用对Minimally Different Examples(MDEs)的评估，并引入相对一致性概率来衡量模型的一致性
results: 提出了一种新的一致性评估方法，并发现模型的一致性和稳定性可以通过Relative Consistency来进行评估，并且模型的100%相对一致性可以达到一致性峰值。

Abstract
Contrast set consistency is a robustness measurement that evaluates the rate at which a model correctly responds to all instances in a bundle of minimally different examples relying on the same knowledge. To draw additional insights, we propose to complement consistency with relative consistency -- the probability that an equally accurate model would surpass the consistency of the proposed model, given a distribution over possible consistencies. Models with 100% relative consistency have reached a consistency peak for their accuracy. We reflect on prior work that reports consistency in contrast sets and observe that relative consistency can alter the assessment of a model's consistency compared to another. We anticipate that our proposed measurement and insights will influence future studies aiming to promote consistent behavior in models.

摘要
“对比集合一致性”是一种Robustness度量，用于评估模型在一组最小差异示例中具有相同知识的情况下，其是否能够正确回应所有示例。为了增加更多的洞察，我们提议在consistency的基础上添加相对一致性——模型的准确率在可能的一致性 Distribution 中的概率。如果模型的相对一致性达到100%，则表示它已经达到了准确率的峰值。我们回顾先前的研究，发现consistency在对比集合中的报告和相对一致性可能会改变模型的一致性评估。我们预计，我们所提出的度量和洞察将影响未来关于模型行为的一致性的研究。

Seq2seq is All You Need for Coreference Resolution

paper_url: http://arxiv.org/abs/2310.13774
repo_url: https://github.com/wenzhengzhang/seq2seqcoref
paper_authors: Wenzheng Zhang, Sam Wiseman, Karl Stratos
for: This paper aims to challenge the assumption that task-specific models are necessary for coreference resolution, and instead, presents a simple and effective approach using a pre-trained seq2seq transformer.
methods: The proposed method finetunes a pre-trained seq2seq transformer to map an input document to a tagged sequence encoding the coreference annotation, and an especially simple seq2seq approach that generates only tagged spans rather than the spans interleaved with the original text.
results: The model outperforms or closely matches the best coreference systems in the literature on an array of datasets, and the analysis shows that the model size, the amount of supervision, and the choice of sequence representations are key factors in performance.

Abstract
Existing works on coreference resolution suggest that task-specific models are necessary to achieve state-of-the-art performance. In this work, we present compelling evidence that such models are not necessary. We finetune a pretrained seq2seq transformer to map an input document to a tagged sequence encoding the coreference annotation. Despite the extreme simplicity, our model outperforms or closely matches the best coreference systems in the literature on an array of datasets. We also propose an especially simple seq2seq approach that generates only tagged spans rather than the spans interleaved with the original text. Our analysis shows that the model size, the amount of supervision, and the choice of sequence representations are key factors in performance.

摘要
existing works on coreference resolution suggest that task-specific models are necessary to achieve state-of-the-art performance. In this work, we present compelling evidence that such models are not necessary. We fine-tune a pre-trained seq2seq transformer to map an input document to a tagged sequence encoding the coreference annotation. Despite the extreme simplicity, our model outperforms or closely matches the best coreference systems in the literature on an array of datasets. We also propose an especially simple seq2seq approach that generates only tagged spans rather than the spans interleaved with the original text. Our analysis shows that the model size, the amount of supervision, and the choice of sequence representations are key factors in performance.Here's the translation in Traditional Chinese:现有的核心引用解析工作都建议需要任务特定的模型来 дости持最佳性能。在这个工作中，我们提供了吸引人的证据，说明这些模型不是必要的。我们精致地调整了预训练的 seq2seq transformer，将输入文档映射到标注的序列中，以表示核心引用标识。尽管非常简单，我们的模型在多个 dataset 上都能够对核心引用系统进行出色的表现，或与文献中的最佳系统相对接近。我们还提出了一种非常简单的 seq2seq 方法，将标注 span 生成成只有与原始文本混合的 span 相比。我们的分析显示，模型大小、监督量和序列表示方法是表现的关键因素。

Enhancing Abstractiveness of Summarization Models through Calibrated Distillation

paper_url: http://arxiv.org/abs/2310.13760
repo_url: None
paper_authors: Hwanjun Song, Igor Shalyminov, Hang Su, Siffi Singh, Kaisheng Yao, Saab Mansour
for: 这篇论文的目的是提高抽象概括的效果，不过常常会导致抽象性减退。
methods: 该论文提出了一种新的方法，即DisCal，用于提高抽象概括的效果，不需要失去信息的损失。DisCal通过向学生模型提供多种假概括，使学生模型更好地学习抽象概括技巧。
results: 实验结果表明，DisCal比先前的方法在抽象概括练习中更高效，可以生成高度抽象和有用的概括。

Abstract
Sequence-level knowledge distillation reduces the size of Seq2Seq models for more efficient abstractive summarization. However, it often leads to a loss of abstractiveness in summarization. In this paper, we propose a novel approach named DisCal to enhance the level of abstractiveness (measured by n-gram overlap) without sacrificing the informativeness (measured by ROUGE) of generated summaries. DisCal exposes diverse pseudo summaries with two supervision to the student model. Firstly, the best pseudo summary is identified in terms of abstractiveness and informativeness and used for sequence-level distillation. Secondly, their ranks are used to ensure the student model to assign higher prediction scores to summaries with higher ranks. Our experiments show that DisCal outperforms prior methods in abstractive summarization distillation, producing highly abstractive and informative summaries.

摘要
序列级知识填充可以降低Seq2Seq模型的大小，以实现更高效的抽象概要。然而，这经常会导致抽象性的减少。在这篇论文中，我们提出了一种新的方法，即DisCal，以提高抽象性（通过n-gram重叠度衡量）无需牺牲生成的概要的信息性（通过ROUGE衡量）。DisCal向学生模型提供多种假概要，并对其进行序列级填充。首先，我们选择最佳的假概要，根据抽象性和信息性进行评价。其次，我们使用其排名来确保学生模型对概要的预测分数进行较高的分配。我们的实验表明，DisCal在抽象概要填充中超过了先前的方法，生成了高度抽象和有用的概要。

ALDi: Quantifying the Arabic Level of Dialectness of Text

paper_url: http://arxiv.org/abs/2310.13747
repo_url: https://github.com/amr-keleg/aldi
paper_authors: Amr Keleg, Sharon Goldwater, Walid Magdy
for: 本研究旨在提供一个可以辨识阿拉伯语言使用者在不同情况下的语言风格选择的方法。
methods: 本研究使用了一个新的语言水平分布（ALDi）来评估阿拉伯语言的方言差异。 ALDi 是一个连续的语言变量，可以在句子水平上量度阿拉伯语言使用者对语言风格的选择。
results: 研究发现，使用 ALDi 可以对不同的阿拉伯语言资料集进行有效的辨识，并且可以显示阿拉伯语言使用者在不同情况下的语言风格选择。

Abstract
Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17% from news articles and 83% from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers' stylistic choices in different situations, a useful property for sociolinguistic analyses.

摘要
传统的阿拉伯语言处理（NLP）研究偏向于干rn Dialect Identification（DI），即 sentence或token level上的方言识别。然而，我们认为阿拉伯语言使用者看到了方言强度的连续变量，我们称之为阿拉伯语言层次（ALDi）。我们引入了AOC-ALDi数据集（基于AOC数据集），包含127,835个句子（新闻文章占17%，用户评论占83%），每个句子都是手动标注了其方言强度。我们提供了AOC-ALDi的详细分析，并证明一个基于AOC-ALDi的模型可以有效地在其他 corpora 上识别方言强度，提供了更加细腻的图像。通过案例研究，我们示出了阿拉伯语言使用者在不同情况下的样式选择，这是社会语言分析中非常有用的特性。

Exploring Linguistic Probes for Morphological Generalization

paper_url: http://arxiv.org/abs/2310.13686
repo_url: https://github.com/jkodner05/EMNLP2023_LingProbes
paper_authors: Jordan Kodner, Salam Khalifa, Sarah Payne
for: 这个论文主要针对的是 morphological inflection 的计算模型化。
methods: 这篇论文使用了语言独立的数据分割算法，并采用了语言特定的探针来测试 morphological generalization 的方面。
results: 对于三种 morphologically distinct 语言（英语、西班牙语、斯瓦希利语），研究发现这三个主要 morphological inflection 系统在 conjugational classes 和 feature sets 上采用了不同的总结策略，并在 both orthographic 和 phonologically transcribed inputs 上得到了证据。

Abstract
Modern work on the cross-linguistic computational modeling of morphological inflection has typically employed language-independent data splitting algorithms. In this paper, we supplement that approach with language-specific probes designed to test aspects of morphological generalization. Testing these probes on three morphologically distinct languages, English, Spanish, and Swahili, we find evidence that three leading morphological inflection systems employ distinct generalization strategies over conjugational classes and feature sets on both orthographic and phonologically transcribed inputs.

摘要
现代工作中使用了跨语言计算模型来研究 morphological inflection 的计算模型通常采用语言独立的数据分割算法。在这篇论文中，我们补充了这种方法，使用语言特定的探针来测试 morphological generalization 的方面。在英语、西班牙语和斯瓦希利语三种 morphologically distinct 语言上测试这些探针，我们发现三个领先的 morphological inflection 系统在 conjugational classes 和 feature sets 上使用了不同的总结策略，并且这些策略在 both orthographic 和 phonologically transcribed inputs 上都有效。

Information Value: Measuring Utterance Predictability as Distance from Plausible Alternatives

paper_url: http://arxiv.org/abs/2310.13676
repo_url: https://github.com/dmg-illc/information-value
paper_authors: Mario Giulianelli, Sarenne Wallbridge, Raquel Fernández
for: 这篇论文主要用于探讨语言预测性的问题，具体来说是通过语音生成器来评估语言的信息价值。
methods: 论文使用了语音生成器来获得可解释的信息价值估计，并利用这些估计来研究人类理解行为中的维度。
results: 论文发现信息价值是written和spoken对话中语言预测性的更好的预测器，并且与词级抽象度的总和不同，可以作为词级预测性的补充。

Abstract
We present information value, a measure which quantifies the predictability of an utterance relative to a set of plausible alternatives. We introduce a method to obtain interpretable estimates of information value using neural text generators, and exploit their psychometric predictive power to investigate the dimensions of predictability that drive human comprehension behaviour. Information value is a stronger predictor of utterance acceptability in written and spoken dialogue than aggregates of token-level surprisal and it is complementary to surprisal for predicting eye-tracked reading times.

摘要
我们介绍信息价值，一种量化评估话语可能性相对于一组可能的选择的度量。我们提出了使用神经网络文本生成器获取可读取的信息价值估计方法，并利用它们的心理测量预测力来调查驱动人类理解行为的维度。信息价值在书面和口语对话中比聚合各个字符度量的奇偶性和总体的奇偶性强制性更好地预测话语可CCE接受性。

On Synthetic Data for Back Translation

paper_url: http://arxiv.org/abs/2310.13675
repo_url: https://github.com/jiahao004/data-for-bt
paper_authors: Jiahao Xu, Yubin Ruan, Wei Bi, Guoping Huang, Shuming Shi, Lihui Chen, Lemao Liu
for: 本研究旨在调查Back Translation(BT)技术在Machine Translation(MT)领域中的应用，并研究如何生成更高质量的synthetic data来提高BT性能。
methods: 本研究采用了 teoretic和empirical方法来研究synthetic data在BT性能中的作用，并提出了一种简单 yet effective的方法来生成synthetic data，以更好地考虑质量和重要性两个因素。
results: 经过extensive的实验 validate that our proposed method可以 significantly improve BT性能，在WMT14 DE-EN、EN-DE和RU-EN benchmark任务上都达到了比标准基eline的性能。

Abstract
Back translation (BT) is one of the most significant technologies in NMT research fields. Existing attempts on BT share a common characteristic: they employ either beam search or random sampling to generate synthetic data with a backward model but seldom work studies the role of synthetic data in the performance of BT. This motivates us to ask a fundamental question: {\em what kind of synthetic data contributes to BT performance?} Through both theoretical and empirical studies, we identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance. Furthermore, based on our findings, we propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield a better performance for BT. We run extensive experiments on WMT14 DE-EN, EN-DE, and RU-EN benchmark tasks. By employing our proposed method to generate synthetic data, our BT model significantly outperforms the standard BT baselines (i.e., beam and sampling based methods for data generation), which proves the effectiveness of our proposed methods.

摘要
<>Back Translation（BT）是现代机器翻译研究领域中最重要的技术之一。现有的尝试中大多数采用了扫描或随机抽样来生成反向模型的synthetic数据，但rarely有研究synthetic数据在BT性能中的作用。这引发了我们的一个基本问题：{\em 这些synthetic数据对BT性能有什么类型的影响?}通过理论和实验研究，我们确定了两个关键因素控制了反向翻译NMT性能：品质和重要性。此外，基于我们的发现，我们提议一种简单 yet effective的方法来生成synthetic数据，以更好地考虑这两个因素，以提高BT性能。我们在WMT14 DE-EN、EN-DE和RU-ENbenchmark任务上进行了广泛的实验，并证明了我们的提议的方法可以significantly outperform标准BT基elines（即扫描和随机抽样基elines），这证明了我们的方法的有效性。

StereoMap: Quantifying the Awareness of Human-like Stereotypes in Large Language Models

paper_url: http://arxiv.org/abs/2310.13673
repo_url: https://github.com/sullamij/stereomap
paper_authors: Sullam Jeoung, Yubin Ge, Jana Diesner
for: 本研究旨在理解大语言模型（LLM）对社会群体的投影和表现，以及LLM如何在训练数据中存储和传播有害关系。
methods: 本研究使用了一种基于心理学理论的框架，称为StereoMap，来探索LLM对社会群体的投影。StereoMap使用心理学中已知的 sterotype Content Model（SCM），将刻画为两个维度：温暖度和能力。
results: 研究发现，LLM对不同社会群体的投影存在多样化的评价，包括温暖度和能力两个维度上的混合评价。此外，分析LLM的推理，研究发现LLM有时会引用社会不平等的统计数据和研究结果来支持其推理。这种做法可能反映LLM对社会不平等的认识和承认。

Abstract
Large Language Models (LLMs) have been observed to encode and perpetuate harmful associations present in the training data. We propose a theoretically grounded framework called StereoMap to gain insights into their perceptions of how demographic groups have been viewed by society. The framework is grounded in the Stereotype Content Model (SCM); a well-established theory from psychology. According to SCM, stereotypes are not all alike. Instead, the dimensions of Warmth and Competence serve as the factors that delineate the nature of stereotypes. Based on the SCM theory, StereoMap maps LLMs' perceptions of social groups (defined by socio-demographic features) using the dimensions of Warmth and Competence. Furthermore, the framework enables the investigation of keywords and verbalizations of reasoning of LLMs' judgments to uncover underlying factors influencing their perceptions. Our results show that LLMs exhibit a diverse range of perceptions towards these groups, characterized by mixed evaluations along the dimensions of Warmth and Competence. Furthermore, analyzing the reasonings of LLMs, our findings indicate that LLMs demonstrate an awareness of social disparities, often stating statistical data and research findings to support their reasoning. This study contributes to the understanding of how LLMs perceive and represent social groups, shedding light on their potential biases and the perpetuation of harmful associations.

摘要

paper_url: http://arxiv.org/abs/2310.13664
repo_url: None
paper_authors: Eliseo Bao Souto, Anxo Pérez, Javier Parapar
for: 这研究旨在用语言模型检测和解释用户在社交平台上发布的情绪症状 markers。
methods: 我们使用 transformer 架构来实现这两个任务，包括分类和解释分类决策。我们还使用最新的对话式 LLMS 进行具体实现。
results: 我们的实验结果表明，可以同时实现良好的分类结果和可解释的决策。我们的自然语言解释可以帮助临床专业人员理解模型决策的基础。

Abstract
Users of social platforms often perceive these sites as supportive spaces to post about their mental health issues. Those conversations contain important traces about individuals' health risks. Recently, researchers have exploited this online information to construct mental health detection models, which aim to identify users at risk on platforms like Twitter, Reddit or Facebook. Most of these models are centred on achieving good classification results, ignoring the explainability and interpretability of the decisions. Recent research has pointed out the importance of using clinical markers, such as the use of symptoms, to improve trust in the computational models by health professionals. In this paper, we propose using transformer-based architectures to detect and explain the appearance of depressive symptom markers in the users' writings. We present two approaches: i) train a model to classify, and another one to explain the classifier's decision separately and ii) unify the two tasks simultaneously using a single model. Additionally, for this latter manner, we also investigated the performance of recent conversational LLMs when using in-context learning. Our natural language explanations enable clinicians to interpret the models' decisions based on validated symptoms, enhancing trust in the automated process. We evaluate our approach using recent symptom-based datasets, employing both offline and expert-in-the-loop metrics to assess the quality of the explanations generated by our models. The experimental results show that it is possible to achieve good classification results while generating interpretable symptom-based explanations.

摘要
社交媒体用户们常看待这些平台为他们的心理健康问题提供支持的空间。这些对话包含了用户健康风险的重要 traces。近些年，研究人员利用这些在线信息构建了心理健康检测模型，以 identificar社交媒体上的用户风险。大多数这些模型强调得到好的分类结果，忽略了计算模型的解释性和可读性。现在的研究表明，使用临床标志（如症状使用）可以提高计算模型的信任worth。在这篇论文中，我们提议使用变换器结构来检测和解释用户写作中的抑郁症状标志。我们提出了两种方法：一是在不同的步骤中训练分类和解释模型，二是同时使用单一模型来实现这两个任务。此外，我们还 investigate了最近的对话语言模型在使用上下文学习时的表现。我们的自然语言解释使得临床专业人员可以根据验证的症状来解释计算模型的决策，提高自动化过程中的信任。我们使用最新的症状基数据集进行评估，并使用线上和专家在 Loop 中的 metric 来评估我们的模型生成的解释质量。实验结果表明，可以同时实现好的分类结果和可读的症状基本解释。

Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

paper_url: http://arxiv.org/abs/2310.13661
repo_url: https://github.com/amr-keleg/adi-under-scrutiny
paper_authors: Amr Keleg, Walid Magdy
for: 本文主要探讨了自动阿拉伯语方言识别（ADI）问题，尤其是在分类问题中存在困难的微方言识别问题。
methods: 作者提出了将 ADI 定义为多个标签分类问题，并提供了设计新的 ADI 数据集的建议。
results: 手动错误分析表明，有大约 66% 的错误不是真正的错误。

Abstract
Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running since 2018. However, ADI systems are reported to fail in distinguishing between the micro-dialects of Arabic. We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that. We highlight the limitation of the incompleteness of the Dialect labels and demonstrate how it impacts the evaluation of ADI systems. A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that $\approx$ 66% of the validated errors are not true errors. Consequently, we propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.

摘要
自2010年代初起，自动阿拉伯语方言识别（ADI）技术已经受到了广泛的关注。多个数据集被开发出来，并每年举行共享任务。然而，ADI系统被报道无法分辨阿拉伯语微方言。我们认为，现有的ADI任务的帧围是主要的原因之一。我们强调了标签的不完整性的限制，并示出了这些限制如何影响ADI系统的评估。我们手动进行了7名阿拉伯语本地语言 speaker的预测 validate 分析，发现 Approx. 66% 的有效错误不是真正的错误。因此，我们提议将ADI重新定义为多标签分类任务，并提供了设计新的ADI数据集的建议。

Benchmarking and Improving Text-to-SQL Generation under Ambiguity

paper_url: http://arxiv.org/abs/2310.13659
repo_url: https://github.com/testzer0/ambiqt
paper_authors: Adithya Bhaskar, Tushar Tomar, Ashutosh Sathe, Sunita Sarawagi
for: bridging the gap between text-to-SQL conversion and real-life database queries
methods: developing a novel benchmark called AmbiQT, and proposing a new decoding algorithm called LogicalBeam
results: LogicalBeam is up to $2.5$ times more effective than state-of-the-art models at generating all candidate SQLs in the top-$k$ ranked outputs, and enhances the top-$5$ Exact and Execution Match Accuracies on SPIDER and Kaggle DBQA.Here’s the Chinese version:
for: closure Text-to-SQL转换和实际数据库查询之间的差距
methods: 开发了一个新的benchmark called AmbiQT，并提出了一种新的解码算法called LogicalBeam
results: LogicalBeam比现状的模型更加有效，可以在top-$k$排名输出中生成所有可能的SQL查询，并提高了SPIDER和Kaggle DBQA的Exact和Execution Match Accuracy的top-$5$。

Abstract
Research in Text-to-SQL conversion has been largely benchmarked against datasets where each text query corresponds to one correct SQL. However, natural language queries over real-life databases frequently involve significant ambiguity about the intended SQL due to overlapping schema names and multiple confusing relationship paths. To bridge this gap, we develop a novel benchmark called AmbiQT with over 3000 examples where each text is interpretable as two plausible SQLs due to lexical and/or structural ambiguity. When faced with ambiguity, an ideal top-$k$ decoder should generate all valid interpretations for possible disambiguation by the user. We evaluate several Text-to-SQL systems and decoding algorithms, including those employing state-of-the-art LLMs, and find them to be far from this ideal. The primary reason is that the prevalent beam search algorithm and its variants, treat SQL queries as a string and produce unhelpful token-level diversity in the top-$k$. We propose LogicalBeam, a new decoding algorithm that navigates the SQL logic space using a blend of plan-based template generation and constrained infilling. Counterfactually generated plans diversify templates while in-filling with a beam-search that branches solely on schema names provides value diversity. LogicalBeam is up to $2.5$ times more effective than state-of-the-art models at generating all candidate SQLs in the top-$k$ ranked outputs. It also enhances the top-$5$ Exact and Execution Match Accuracies on SPIDER and Kaggle DBQA.

摘要
研究在文本到SQL转换方面已经主要基准于具有一个唯一正确SQL的数据集。然而，自然语言 queries 中的真实数据库问题经常具有多种ambiguity，因为schema名称和关系路径的重叠。为了bridging这个差距，我们开发了一个新的benchmark叫做AmbiQT，它包含了超过3000个例子，每个文本都可以被解释为两个可能的SQL。当面临ambiguity时，理想的top-$k$ decoder应该生成所有有效的解释，以便由用户进行解释。然而，我们发现现有的 Text-to-SQL 系统和解码算法，包括使用状态态艺术LLMs，都远离这种理想。主要的原因是普遍使用的搜索算法和其变种，将SQL查询视为字符串，生成不帮助的token级多样性在top-$k$中。我们提议LogicalBeam，一种新的解码算法，通过将SQL逻辑空间映射到plan-based模板生成和受限的填充来解决这个问题。在填充过程中，使用缓冲搜索，在schema名称上分支，提供值多样性。LogicalBeam在top-$k$中生成所有候选SQL的效果比现有模型高达2.5倍。此外，它也提高了SPIDER和Kaggle DBQA中top-$5$的准确率和执行匹配率。

BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues

paper_url: http://arxiv.org/abs/2310.13650
repo_url: https://github.com/open-compass/botchat
paper_authors: Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, Kai Chen
for: 这份报告是为评估现有的大型语言模型（LLMs）在人类样式多Turn对话中的能力而写的。
methods: 我们使用了现实世界的人类对话作为开头，并让LLMs根据这些开头生成全部多Turn对话（数十句）。最后，我们采用了当今最佳的LLMs（GPT-4等）作为评估器，以评估生成的对话质量。
results: 我们发现GPT-4可以生成人类样式的多Turn对话，质量极高，明显超过其他LLMs。这些生成的对话很难被识别为机器生成的对话，而其他LLMs则很难生成满意的多Turn对话，主要由于低效的指令遵循能力、生成过长句子或总能力有限。

Abstract
Interacting with human via high-quality multi-turn dialogues is a key feature of large language models (LLMs). However, human-based evaluation of such capability involves intensive manual labor. This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting, through an LLM-based approach. We start from real-world human dialogues and keep the very first utterances as the ChatSEED. Then we prompt LLMs to generate a full multi-turn dialogue (tens of utterances) based on the ChatSEED, utterance by utterance. Finally, we adopt state-of-the-art LLMs (GPT-4, \etc) as the judge to evaluate the generated dialogues. With different evaluation protocols, we come to substantially identical conclusions. We find that GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts. It's difficult for a discriminator to distinguish between GPT-4 generated dialogues and human dialogues. In contrast, other LLMs struggle to generate multi-turn dialogues of satisfactory quality due to poor instruction-following capability, tendency to generate lengthy utterances, or limited general capability. All data and codes will be provided in https://github.com/open-compass/BotChat/ and we hope they can serve as a valuable resource for evaluating multi-turn chatting capabilities of LLMs.

摘要
<>使用高质量多轮对话与人类交互是大型语言模型（LLM）的关键功能之一。然而，人类基于评估这种能力具有劳动密集型的特点。本报告提供了基于现实世界人类对话的初始实验，并采用LLM基于方法进行评估。我们从真实世界人类对话中挑选出第一句话作为ChatSEED，然后使用LLM生成完整的多轮对话（数十句），一句一句地进行生成。最后，我们采用当今最高级的LLM（GPT-4、等）作为评估者，使用不同的评价协议进行评估生成对话质量。我们发现GPT-4可以生成人类式的多轮对话，质量极高，明显超越其他对手。很难用权限 distinguish GPT-4生成的对话和人类对话。相比之下，其他LLM很难生成满意质量的多轮对话，主要是因为低效的指令遵循能力、生成过长的句子或总能力有限。我们将所有数据和代码提供在https://github.com/open-compass/BotChat/上，希望它们可以为评估LLM多轮对话能力提供有价值的资源。

Bridging Information-Theoretic and Geometric Compression in Language Models

paper_url: http://arxiv.org/abs/2310.13620
repo_url: https://github.com/chengemily1/id_bridging
paper_authors: Emily Cheng, Corentin Kervadec, Marco Baroni
for: 这项研究旨在探讨语言模型（LM）如何准确地模型人类语言，以及LM的压缩性能对其表现的影响。
methods: 研究者采用了两种视角来分析LM的压缩性能：几何学视角和信息理论视角。他们发现这两种视角之间存在高度相关性，即语言数据的自然几何维度预测了该数据在LM中的编码长度。
results: 研究者发现，LM的压缩性能与其能够快速适应语言数据的能力相关。此外，他们还评估了一些内在维度估计器，并发现只有一些估计器能够捕捉语言数据中的压缩性、几何维度和适应性之间的关系。

Abstract
For a language model (LM) to faithfully model human language, it must compress vast, potentially infinite information into relatively few dimensions. We propose analyzing compression in (pre-trained) LMs from two points of view: geometric and information-theoretic. We demonstrate that the two views are highly correlated, such that the intrinsic geometric dimension of linguistic data predicts their coding length under the LM. We then show that, in turn, high compression of a linguistic dataset predicts rapid adaptation to that dataset, confirming that being able to compress linguistic information is an important part of successful LM performance. As a practical byproduct of our analysis, we evaluate a battery of intrinsic dimension estimators for the first time on linguistic data, showing that only some encapsulate the relationship between information-theoretic compression, geometric compression, and ease-of-adaptation.

摘要
为了准确模拟人类语言，语言模型（LM）必须压缩庞大、潜在无穷的信息到相对较少的维度。我们从两个视角分析LM中的压缩：几何学视角和信息理论视角。我们示示了这两种视角之间存在很高的相关性，即语言数据的内在几何维度预测其编码长度下LM。然后，我们示示了，对于某个语言集合，高压缩率预测了快速适应该语言集合的能力，确认了压缩语言信息是成功LM性能的重要组成部分。此外，我们评估了一系列内在维度估计器，并发现只有一些能够捕捉压缩、几何压缩和适应性之间的关系。

Semi-supervised multimodal coreference resolution in image narrations

paper_url: http://arxiv.org/abs/2310.13619
repo_url: None
paper_authors: Arushi Goel, Basura Fernando, Frank Keller, Hakan Bilen
for: 这篇论文研究了多Modal coreference resolution，具体来说是将长的描述文本（即 narraion）与图片对应。
methods: 该论文提出了一种数据效率的半supervised方法，利用图片-文本对应来解决多Modal coreference resolution和narraion grounding问题。该方法在cross-modal框架中结合了标注和无标注数据的损失函数。
results: 论文的实验显示，该方法可以比以强基线数据进行量化和质量上的提升，用于多Modal coreference resolution和narraion grounding任务。

Abstract
In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i.e., a narration is paired with an image. This poses significant challenges due to fine-grained image-text alignment, inherent ambiguity present in narrative language, and unavailability of large annotated training sets. To tackle these challenges, we present a data efficient semi-supervised approach that utilizes image-narration pairs to resolve coreferences and narrative grounding in a multimodal context. Our approach incorporates losses for both labeled and unlabeled data within a cross-modal framework. Our evaluation shows that the proposed approach outperforms strong baselines both quantitatively and qualitatively, for the tasks of coreference resolution and narrative grounding.

摘要
在这篇论文中，我们研究多模态核心参照解决方法，具体来说是与图像相关的长文描述。这种情况存在细腻的图像文本对齐问题、自然语言中的模糊性和缺乏大量标注训练集的问题。为解决这些问题，我们提出了一种数据效率高的半超级vised方法，该方法利用图像文本对的数据来解决核心参照和多模态场景中的描述固定。我们的方法包括两种损失函数：标注数据的损失函数和未标注数据的损失函数，并在跨模态框架中结合使用。我们的评估结果表明，我们的方法在核心参照和描述固定两个任务上都能够超越强基线，both quantitatively and qualitatively。

Three Questions Concerning the Use of Large Language Models to Facilitate Mathematics Learning

paper_url: http://arxiv.org/abs/2310.13615
repo_url: None
paper_authors: An-Zi Yen, Wei-Ling Hsu
for: 本研究探讨了使用大语言模型（LLM）来提高学生数学问题解决能力的可能性，以及LLM在教学应用中的教学能力。
methods: 本研究采用了适应性反馈的方法，通过LLM对学生答案的检查和修正来帮助学生解决数学问题。
results: 研究发现，LLM可能会因为问题的意思和逻辑不准确而提供错误的反馈，同时也可能会因为问题的 complexity 而难以理解问题的 rationales。

Abstract
Due to the remarkable language understanding and generation abilities of large language models (LLMs), their use in educational applications has been explored. However, little work has been done on investigating the pedagogical ability of LLMs in helping students to learn mathematics. In this position paper, we discuss the challenges associated with employing LLMs to enhance students' mathematical problem-solving skills by providing adaptive feedback. Apart from generating the wrong reasoning processes, LLMs can misinterpret the meaning of the question, and also exhibit difficulty in understanding the given questions' rationales when attempting to correct students' answers. Three research questions are formulated.

摘要

如何使用大语言模型（LLMs）来增强学生的数学问题解决能力？2. LLMS 是否能够正确理解学生提交的问题，并且能够提供适应性的反馈？3. LLMS 在帮助学生学习数学时是否存在挑战，如果存在，则如何解决这些挑战？

Simultaneous Machine Translation with Tailored Reference

paper_url: http://arxiv.org/abs/2310.13588
repo_url: https://github.com/ictnlp/Tailored-Ref
paper_authors: Shoutao Guo, Shaolei Zhang, Yang Feng
for: 本研究旨在提高同时机器翻译（SiMT）模型的翻译质量，并且适应不同的延迟环境。
methods: 本研究提出了一种新的方法，即通过使用强化学习引入的修改器，对SiMT模型的训练参考进行修改，以避免在训练过程中的强制预测。
results: 实验结果表明，使用修改器进行修改的SiMT模型在三个翻译任务中均 achieve state-of-the-art表现，并且在固定和适应策略下都能够提高表现。

Abstract
Simultaneous machine translation (SiMT) generates translation while reading the whole source sentence. However, existing SiMT models are typically trained using the same reference disregarding the varying amounts of available source information at different latency. Training the model with ground-truth at low latency may introduce forced anticipations, whereas utilizing reference consistent with the source word order at high latency results in performance degradation. Consequently, it is crucial to train the SiMT model with appropriate reference that avoids forced anticipations during training while maintaining high quality. In this paper, we propose a novel method that provides tailored reference for the SiMT models trained at different latency by rephrasing the ground-truth. Specifically, we introduce the tailor, induced by reinforcement learning, to modify ground-truth to the tailored reference. The SiMT model is trained with the tailored reference and jointly optimized with the tailor to enhance performance. Importantly, our method is applicable to a wide range of current SiMT approaches. Experiments on three translation tasks demonstrate that our method achieves state-of-the-art performance in both fixed and adaptive policies.

摘要

Improving Cross-Lingual Transfer through Subtree-Aware Word Reordering

paper_url: http://arxiv.org/abs/2310.13583
repo_url: https://github.com/ofirarviv/ud-based-word-reordering
paper_authors: Ofir Arviv, Dmitry Nikolaev, Taelin Karidi, Omri Abend
for: 本研究旨在提高多语言模型对不同语言的表达能力，尤其是在low-resource设置下。
methods: 我们提出了一种新的重新排序方法，基于Universal Dependencies语法，可以通过少量注解数据学习细致的单词顺序模式，并可以应用于各种语言和模型结构。
results: 我们的方法在多种任务上表现出优于强基eline，包括零shot和几shot情况下。这表明我们的方法可以有效地 Mitigate variability in word-order patterns，提高多语言模型的表达能力。

Abstract
Despite the impressive growth of the abilities of multilingual language models, such as XLM-R and mT5, it has been shown that they still face difficulties when tackling typologically-distant languages, particularly in the low-resource setting. One obstacle for effective cross-lingual transfer is variability in word-order patterns. It can be potentially mitigated via source- or target-side word reordering, and numerous approaches to reordering have been proposed. However, they rely on language-specific rules, work on the level of POS tags, or only target the main clause, leaving subordinate clauses intact. To address these limitations, we present a new powerful reordering method, defined in terms of Universal Dependencies, that is able to learn fine-grained word-order patterns conditioned on the syntactic context from a small amount of annotated data and can be applied at all levels of the syntactic tree. We conduct experiments on a diverse set of tasks and show that our method consistently outperforms strong baselines over different language pairs and model architectures. This performance advantage holds true in both zero-shot and few-shot scenarios.

摘要
尽管多语言模型（如XLM-R和mT5）在表现出了卓越的能力，但它们在语言学上较远的语言中仍然遇到困难，特别是在low-resource Setting下。一个阻碍 cross-lingual 转移的障碍是语言变体的word-order模式的变化。这可能可以通过源-或目标-side word reordering来mitigate，并且有许多approach to reordering已经被提出。然而，这些方法都是基于语言特定的规则，工作在POS标签层次上，或者只能target主句，留下副句不变。为了解决这些限制，我们提出了一种新的强大的重编译方法，基于Universal Dependencies，可以通过一小量的注释数据学习细致的word-order模式，并且可以在所有语法树层次上应用。我们进行了多种任务的实验，并证明了我们的方法在不同的语言对和模型架构下表现出了卓越的表现，并且这种表现优势在零shot和几shot Scenario下都保持。

Semantic Decomposition of Question and SQL for Text-to-SQL Parsing

paper_url: http://arxiv.org/abs/2310.13575
repo_url: None
paper_authors: Ben Eyal, Amir Bachar, Ophir Haroche, Moran Mahabi, Michael Elhadad
for: 提高文本到SQL semantic parsing的通用化能力，解决跨领域和复杂查询的挑战。
methods: 使用问题分解策略提高复杂SQL查询的解析，但这会遇到两个主要障碍：（1）现有数据集缺少问题分解；（2）由SQL语言的 sintaxis复杂性，大多数复杂查询无法被简单 decomposed into sub-queries。
results: 我们提出了一种新的模块化查询计划语言（QPL），系统地将SQL查询分解成简单和Regular sub-queries。我们利用SQL服务器查询优化计划分析，开发了一个将SQL转换为QPL的翻译器，并将Spider数据集扩展到QPL程序。实验结果表明，模块化QPL的性能有助于现有的semantic-parsing架构，并且训练text-to-QPL parser比text-to-SQL parsing更有效果。此外，QPL方法还具有两个优点：（1）QPL程序可以简单化为可读的问题，从而创建了一个复杂问题和分解问题的数据集。（2）QPL更易于非专家理解复杂查询结果，从而提高了semantic parser的可读性。

Abstract
Text-to-SQL semantic parsing faces challenges in generalizing to cross-domain and complex queries. Recent research has employed a question decomposition strategy to enhance the parsing of complex SQL queries. However, this strategy encounters two major obstacles: (1) existing datasets lack question decomposition; (2) due to the syntactic complexity of SQL, most complex queries cannot be disentangled into sub-queries that can be readily recomposed. To address these challenges, we propose a new modular Query Plan Language (QPL) that systematically decomposes SQL queries into simple and regular sub-queries. We develop a translator from SQL to QPL by leveraging analysis of SQL server query optimization plans, and we augment the Spider dataset with QPL programs. Experimental results demonstrate that the modular nature of QPL benefits existing semantic-parsing architectures, and training text-to-QPL parsers is more effective than text-to-SQL parsing for semantically equivalent queries. The QPL approach offers two additional advantages: (1) QPL programs can be paraphrased as simple questions, which allows us to create a dataset of (complex question, decomposed questions). Training on this dataset, we obtain a Question Decomposer for data retrieval that is sensitive to database schemas. (2) QPL is more accessible to non-experts for complex queries, leading to more interpretable output from the semantic parser.

摘要
文本抽取 Semantic 问题面临通用化和复杂查询泛化的挑战。 current research 使用问题分解策略提高复杂 SQL 查询的解析。 however， this strategy 遇到两个主要障碍：（1）现有数据集缺少问题分解；（2）由于 SQL 的 sintax 复杂性，大多数复杂查询无法分解成可轻松重新组合的子查询。 To address these challenges, we propose a new modular Query Plan Language (QPL) that systematically decomposes SQL queries into simple and regular sub-queries. We develop a translator from SQL to QPL by leveraging analysis of SQL server query optimization plans, and we augment the Spider dataset with QPL programs. Experimental results demonstrate that the modular nature of QPL benefits existing semantic-parsing architectures, and training text-to-QPL parsers is more effective than text-to-SQL parsing for semantically equivalent queries. The QPL approach offers two additional advantages: (1) QPL programs can be paraphrased as simple questions, which allows us to create a dataset of (complex question, decomposed questions). Training on this dataset, we obtain a Question Decomposer for data retrieval that is sensitive to database schemas. (2) QPL is more accessible to non-experts for complex queries, leading to more interpretable output from the semantic parser.

Why Can Large Language Models Generate Correct Chain-of-Thoughts?

paper_url: http://arxiv.org/abs/2310.13571
repo_url: None
paper_authors: Rasul Tutunov, Antoine Grosnit, Juliusz Ziomek, Jun Wang, Haitham Bou-Ammar
for: 这个论文探讨大语言模型（LLM）的能力，具体来说是提高对链式思维提示的理论认知。
methods: 作者提出了一种两级层次图形模型，用于自然语言生成。在这个框架中，作者证明了一种强有力的 geometrical convergence rate，用于评估 LLM 生成的链式思维是否正确。
results: 研究结果表明，LLM 可以有效地生成一系列相关的思维，并且可以理解和解释这些思维的顺序性。这些结果为具有理解能力的任务中 LLM 的表现提供了理论基础。

Abstract
This paper delves into the capabilities of large language models (LLMs), specifically focusing on advancing the theoretical comprehension of chain-of-thought prompting. We investigate how LLMs can be effectively induced to generate a coherent chain of thoughts. To achieve this, we introduce a two-level hierarchical graphical model tailored for natural language generation. Within this framework, we establish a compelling geometrical convergence rate that gauges the likelihood of an LLM-generated chain of thoughts compared to those originating from the true language. Our findings provide a theoretical justification for the ability of LLMs to produce the correct sequence of thoughts (potentially) explaining performance gains in tasks demanding reasoning skills.

摘要

Cache & Distil: Optimising API Calls to Large Language Models

paper_url: http://arxiv.org/abs/2310.13561
repo_url: None
paper_authors: Guillem Ramírez, Matthias Lindemann, Alexandra Birch, Ivan Titov
for: 降低大规模生成AI工具的成本，尤其是实时处理用户查询的API请求。
methods: 使用一个较小的语言模型（学生），并将其不断地训练成为独立处理用户查询的能力，并透过一个策略选择哪些请求交由学生处理，哪些交由大语言模型处理，以便助学生学习。
results: 在分类任务中，使用活动学习基于选择几个标准的检查方法，例如 Margin Sampling 和 Query by Committee，可以带来一致的优化效果，不论任务或预算。

Abstract
Large-scale deployment of generative AI tools often depends on costly API calls to a Large Language Model (LLM) to fulfil user queries. To curtail the frequency of these calls, one can employ a smaller language model -- a student -- which is continuously trained on the responses of the LLM. This student gradually gains proficiency in independently handling an increasing number of user requests, a process we term neural caching. The crucial element in neural caching is a policy that decides which requests should be processed by the student alone and which should be redirected to the LLM, subsequently aiding the student's learning. In this study, we focus on classification tasks, and we consider a range of classic active learning-based selection criteria as the policy. Our experiments suggest that Margin Sampling and Query by Committee bring consistent benefits across tasks and budgets.

摘要
大规模的生成AI工具常常依赖于成本高昂的API调用来满足用户的查询。为了减少这些调用频率，可以采用一个较小的语言模型——学生模型，并将其不断训练在大语言模型（LLM）的回应基础上。学生模型逐渐增强其独立处理用户请求的能力，这个过程我们称为神经缓存。神经缓存的关键元素是一种策略，决定哪些请求应该由学生模型处理，而哪些请求应该被重定向到LLM。在本研究中，我们关注分类任务，并考虑了一些经典的活动学习基于选择准则。我们的实验表明，边缘抽样和咨询委员会都带来了一致的好处，不 matter what the task or budget.

The Perils & Promises of Fact-checking with Large Language Models

paper_url: http://arxiv.org/abs/2310.13549
repo_url: None
paper_authors: Dorian Quelle, Alexandre Bovet
for: 这研究旨在评估大自然语言模型（LLM）在真实性核查中的表现，以及如何使用这些模型来提高核查的准确性。
methods: 这个研究使用了GPT-4和GPT-3大自然语言模型，并将其用于编写学术论文、法律文档和新闻文章，以评估这些模型在核查信息的能力。
results: 研究发现，当equipped with contextual information时，LLMs的表现有所提高，但准确性受到查询语言和CLAIM的影响。GPT-4表现比GPT-3更好，但是不同的查询语言和CLAIM可能会导致不同的准确性。

Abstract
Autonomous fact-checking, using machine learning to verify claims, has grown vital as misinformation spreads beyond human fact-checking capacity. Large Language Models (LLMs) like GPT-4 are increasingly trusted to verify information and write academic papers, lawsuits, and news articles, emphasizing their role in discerning truth from falsehood and the importance of being able to verify their outputs. Here, we evaluate the use of LLM agents in fact-checking by having them phrase queries, retrieve contextual data, and make decisions. Importantly, in our framework, agents explain their reasoning and cite the relevant sources from the retrieved context. Our results show the enhanced prowess of LLMs when equipped with contextual information. GPT-4 outperforms GPT-3, but accuracy varies based on query language and claim veracity. While LLMs show promise in fact-checking, caution is essential due to inconsistent accuracy. Our investigation calls for further research, fostering a deeper comprehension of when agents succeed and when they fail.

摘要
自主 фактоChecking，使用机器学习验证声明，在谎言扩散超出人工验证能力时变得越来越重要。大型语言模型（LLMs）如GPT-4在验证信息和写学术论文、法律诉讼和新闻文章方面被越来越信任，强调其在分辨真假和重要性的能力。在我们的框架中，代理人会提出问题，检索Contextual数据，并做出决定。特别是，在我们的框架中，代理人会解释其思考过程和引用来源于检索的Contextual数据。我们的结果显示在Contextual信息支持下，LLM代理人的能力得到了提高。GPT-4比GPT-3表现更出色，但是准确率基于问题语言和声明真假性有所不同。虽然LLM代理人在验证方面表现良好，但是需要谨慎，因为它们的准确率不稳定。我们的调查表明，需要进一步的研究，以深入理解代理人在不同情况下的表现。

A Diachronic Perspective on User Trust in AI under Uncertainty

paper_url: http://arxiv.org/abs/2310.13544
repo_url: https://github.com/zouharvi/trust-intervention
paper_authors: Shehzaad Dhuliawala, Vilém Zouhar, Mennatallah El-Assady, Mrinmaya Sachan
for: 这个论文研究了用户对人工智能系统的信任的发展和恢复，以及不同类型的误差对用户信任的影响。
methods: 该论文使用了一种投票游戏来研究用户对人工智能系统的信任的演变和恢复。
results: 研究发现，即使只有几次错误 Prediction with inaccurate confidence estimates can severely damage user trust and performance, with slow recovery. 不同类型的误差也有不同的负面影响于用户信任。这些发现highlights the importance of calibration in user-facing AI applications and shed light on what aspects help users decide whether to trust the AI system.

Abstract
In a human-AI collaboration, users build a mental model of the AI system based on its reliability and how it presents its decision, e.g. its presentation of system confidence and an explanation of the output. Modern NLP systems are often uncalibrated, resulting in confidently incorrect predictions that undermine user trust. In order to build trustworthy AI, we must understand how user trust is developed and how it can be regained after potential trust-eroding events. We study the evolution of user trust in response to these trust-eroding events using a betting game. We find that even a few incorrect instances with inaccurate confidence estimates damage user trust and performance, with very slow recovery. We also show that this degradation in trust reduces the success of human-AI collaboration and that different types of miscalibration -- unconfidently correct and confidently incorrect -- have different negative effects on user trust. Our findings highlight the importance of calibration in user-facing AI applications and shed light on what aspects help users decide whether to trust the AI system.

摘要
人与AI合作中，用户会建立AI系统的心理模型，基于其可靠性和输出的解释。现代NLP系统经常无法准确评估自己的可靠性，导致用户对AI系统的信任感受到损害。为建立可信worthy AI，我们需要理解用户信任的发展和恢复机制。我们通过赌博游戏来研究用户对不可靠事件后的信任恢复，发现 Even a few incorrect instances with inaccurate confidence estimates can damage user trust and performance, with very slow recovery. We also show that different types of miscalibration -- unconfidently correct and confidently incorrect -- have different negative effects on user trust. Our findings highlight the importance of calibration in user-facing AI applications and shed light on what aspects help users decide whether to trust the AI system.

Controlled Randomness Improves the Performance of Transformer Models

paper_url: http://arxiv.org/abs/2310.13526
repo_url: None
paper_authors: Tobias Deußer, Cong Zhao, Wolfgang Krämer, David Leonhard, Christian Bauckhage, Rafet Sifa
for: 这个研究的目的是要探索在自然语言模型的预训步骤中引入控制随机性，以提高精确训练和下游任务的性能。
methods: 这个研究使用了随机噪音来控制自然语言模型的训练过程，以提高精确训练和下游任务的性能。
results: 研究发现，在这两个下游任务中，透过将随机噪音添加到训练过程，可以提高自然语言模型的性能。

Abstract
During the pre-training step of natural language models, the main objective is to learn a general representation of the pre-training dataset, usually requiring large amounts of textual data to capture the complexity and diversity of natural language. Contrasting this, in most cases, the size of the data available to solve the specific downstream task is often dwarfed by the aforementioned pre-training dataset, especially in domains where data is scarce. We introduce controlled randomness, i.e. noise, into the training process to improve fine-tuning language models and explore the performance of targeted noise in addition to the parameters of these models. We find that adding such noise can improve the performance in our two downstream tasks of joint named entity recognition and relation extraction and text summarization.

摘要
Here's the text in Simplified Chinese:在自然语言模型的预训练阶段，主要目标是学习预训练数据集的通用表示，通常需要大量的文本数据来捕捉自然语言的复杂性和多样性。相比之下，在大多数情况下，解决特定下游任务的数据量通常比预训练数据集要小得多，特别是在数据匮乏的领域。为了解决这个问题，我们引入控制的随机性，即噪声，到训练过程中，以提高语言模型的微调和探索噪声的影响以及模型参数。我们发现，添加这种噪声可以提高我们的两个下游任务的结合命名实体识别和关系EXTRACTION和文本摘要的性能。

Teaching Language Models to Self-Improve through Interactive Demonstrations

paper_url: http://arxiv.org/abs/2310.13522
repo_url: https://github.com/jasonyux/tripost
paper_authors: Xiao Yu, Baolin Peng, Michel Galley, Jianfeng Gao, Zhou Yu
for: 这个论文的目的是提高小型语言模型（LLMs）的自我改进能力，以减少与现状最佳LLMs之间的性能差距。
methods: 作者提出了一种名为TriPosT的训练算法，通过让小型模型与大型语言模型互动，收集反馈和改进自己的生成内容，来增强小型模型的自我改进能力。
results: 作者的实验表明，使用TriPosT训练算法可以提高一个LLaMA-7b模型在数学和逻辑任务上的性能，最高提高7.13%。此外，作者发现了在学习和修正自己的错误时，小型模型的互动经验是关键的。

Abstract
The self-improving ability of large language models (LLMs), enabled by prompting them to analyze and revise their own outputs, has garnered significant interest in recent research. However, this ability has been shown to be absent and difficult to learn for smaller models, thus widening the performance gap between state-of-the-art LLMs and more cost-effective and faster ones. To reduce this gap, we introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability, and show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%. In contrast to prior work, we achieve this by using the smaller model to interact with LLMs to collect feedback and improvements on its own generations. We then replay this experience to train the small model. Our experiments on four math and reasoning datasets show that the interactive experience of learning from and correcting its own mistakes is crucial for small models to improve their performance.

摘要
大型自然语言模型（LLM）的自我改进能力，受到最近研究的广泛关注。然而，这种能力对小型模型来说是缺失的并且困难学习，因此加大了当前LLM和更cost-effective的模型之间性能差距。为了减少这个差距，我们介绍了TriPosT训练算法，使小型模型拥有自我改进能力，并证明我们的方法可以在数学和逻辑任务上提高LLaMA-7b的性能 by up to 7.13%。与之前的工作不同，我们通过使小型模型与LLM进行互动，收集feedback和改进自己的生成。然后，我们将这些经验重新播放以训练小模型。我们在四个数学和逻辑数据集上进行了实验，发现互动式学习自己的错误和改进自己的生成是小模型提高性能的关键。

Improving Question Generation with Multi-level Content Planning

paper_url: http://arxiv.org/abs/2310.13512
repo_url: https://github.com/zeaver/multifactor
paper_authors: Zehua Xia, Qi Gou, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li, Cam-Tu Nguyen
for: 本研究旨在生成基于 Context 和答案的问题，特别是需要跨越多个步骤的理解Context 中的问题。
methods: 我们提出了一种基于多级内容规划的问题生成框架，即 MultiFactor，其包括两个组件：FA-model，同时选择关键短语和生成全答，以及Q-model，使用生成的全答作为额外输入来生成问题。
results: 我们的方法在两个流行的问题生成数据集上表现出优于强基elines。

Abstract
This paper addresses the problem of generating questions from a given context and an answer, specifically focusing on questions that require multi-hop reasoning across an extended context. Previous studies have suggested that key phrase selection is essential for question generation (QG), yet it is still challenging to connect such disjointed phrases into meaningful questions, particularly for long context. To mitigate this issue, we propose MultiFactor, a novel QG framework based on multi-level content planning. Specifically, MultiFactor includes two components: FA-model, which simultaneously selects key phrases and generates full answers, and Q-model which takes the generated full answer as an additional input to generate questions. Here, full answer generation is introduced to connect the short answer with the selected key phrases, thus forming an answer-aware summary to facilitate QG. Both FA-model and Q-model are formalized as simple-yet-effective Phrase-Enhanced Transformers, our joint model for phrase selection and text generation. Experimental results show that our method outperforms strong baselines on two popular QG datasets. Our code is available at https://github.com/zeaver/MultiFactor.

摘要
MultiFactor consists of two components: the FA-model, which simultaneously selects key phrases and generates full answers, and the Q-model, which takes the generated full answer as input to generate questions. The FA-model uses a Phrase-Enhanced Transformer to formalize the process of selecting key phrases and generating full answers. The Q-model also uses a Phrase-Enhanced Transformer to generate questions based on the generated full answer.The key innovation of MultiFactor is the use of full answer generation to connect the short answer with the selected key phrases, creating an answer-aware summary that facilitates QG. This approach allows the model to generate more coherent and relevant questions, especially for long contexts.Experimental results show that MultiFactor outperforms strong baselines on two popular QG datasets. Our code is available at https://github.com/zeaver/MultiFactor.

DistillCSE: Distilled Contrastive Learning for Sentence Embeddings

paper_url: http://arxiv.org/abs/2310.13499
repo_url: None
paper_authors: Jiahao Xu, Wei Shao, Lihui Chen, Lemao Liu
for: 本文提出了DistillCSE框架，它通过对自我教学模型进行对比学习，使用知识填充来帮助学习更强的模型。
methods: 本文使用了知识填充和对比学习两种方法来提高模型性能。
results: 实验结果表明，提出的DistillCSE方法可以超过许多强大的基eline方法，并实现新的状态场报表性能。

Abstract
This paper proposes the DistillCSE framework, which performs contrastive learning under the self-training paradigm with knowledge distillation. The potential advantage of DistillCSE is its self-enhancing feature: using a base model to provide additional supervision signals, a stronger model may be learned through knowledge distillation. However, the vanilla DistillCSE through the standard implementation of knowledge distillation only achieves marginal improvements due to severe overfitting. The further quantitative analyses demonstrate the reason that the standard knowledge distillation exhibits a relatively large variance of the teacher model's logits due to the essence of contrastive learning. To mitigate the issue induced by high variance, this paper accordingly proposed two simple yet effective solutions for knowledge distillation: a Group-P shuffling strategy as an implicit regularization and the averaging logits from multiple teacher components. Experiments on standard benchmarks demonstrate that the proposed DistillCSE outperforms many strong baseline methods and yields a new state-of-the-art performance.

摘要
Translated into Simplified Chinese:这篇论文提出了DistillCSE框架，它通过对比学习的自我训练方式进行知识储存，并可能带来更强的模型。然而，标准的DistillCSE实现中的知识储存只能达到有限的改进，主要是因为严重的过拟合。进一步的量化分析表明，标准的知识储存会导致教师模型的偏差值较大，这是因为对比学习的本质。为了解决这个问题，这篇论文提出了两种简单 yet有效的解决方案：Group-P混合策略作为隐式正则化，以及从多个教师组件中平均logits。实验表明，提议的DistillCSE可以超越许多强大的基线方法，并达到新的领域最佳性能。

Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning

paper_url: http://arxiv.org/abs/2310.13448
repo_url: None
paper_authors: Duarte M. Alves, Nuno M. Guerreiro, João Alves, José Pombal, Ricardo Rei, José G. C. de Souza, Pierre Colombo, André F. T. Martins
for: 本研究旨在探讨LLM-based机器翻译系统的缺陷和改进方法。
methods: 本研究使用 adapter-based 微调，并证明这种方法可以提高翻译效果，同时减少训练参数数量。
results: 研究发现，微调通常会降低几个示例的表现，但可以保留它们的翻译能力。此外，提出了一种简单的方法，可以在微调过程中包含几个示例，以提高翻译效果。

Abstract
Large language models (LLMs) are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capabilities, due to overspecialization. In this paper, we provide a closer look at this problem. We start by showing that adapter-based finetuning with LoRA matches the performance of traditional finetuning while reducing the number of training parameters by a factor of 50. This method also outperforms few-shot prompting and eliminates the need for post-processing or in-context examples. However, we show that finetuning generally degrades few-shot performance, hindering adaptation capabilities. Finally, to obtain the best of both worlds, we propose a simple approach that incorporates few-shot examples during finetuning. Experiments on 10 language pairs show that our proposed approach recovers the original few-shot capabilities while keeping the added benefits of finetuning.

摘要
Translated into Simplified Chinese:大型语言模型（LLM）是机器翻译（MT）的有望之路。然而，当前LLM基于的MT系统很脆弱：它们效果受选择少量示例的影响很大，并且经常需要额外处理因过量生成。其他方法，如特定化在翻译指令上的训练， computationally expensive 并可能会削弱在上下文学习Capabilities。在这篇论文中，我们对这个问题进行了更加细化的分析。我们首先显示，使用 adapter-based 特定化可以与传统训练方法相当，同时减少训练参数的数量，相对于50。此方法还超越了少量示例推荐和无需后处理或上下文示例。然而，我们发现，训练通常会降低少量示例性能，阻碍适应能力。最后，为了取得两个世界的优点，我们提议一种简单的方法，在特定化过程中包含少量示例。在10种语言对的实验中，我们的提议方法可以恢复原始的少量示例能力，同时保留特定化的优点。

The Past, Present, and Future of Typological Databases in NLP

paper_url: http://arxiv.org/abs/2310.13440
repo_url: None
paper_authors: Emi Baylor, Esther Ploeger, Johannes Bjerva
for: 这研究旨在探讨大规模语言类型学数据库的不一致性，以及这些数据库在自然语言处理（NLP）领域的应用。
methods: 该研究使用了系统性的方法来探讨类型学数据库之间的不一致性，以及这些数据库在NLP领域的应用。
results: 研究发现，大规模语言类型学数据库存在较多的不一致性，这些不一致性的原因包括编码错误、语言变化以及语义差异。此外，研究还发现，一种连续的类型学视角可以帮助解决这些不一致性问题，并且这种视角在未来可能会在语言模型化中发挥重要作用。

Abstract
Typological information has the potential to be beneficial in the development of NLP models, particularly for low-resource languages. Unfortunately, current large-scale typological databases, notably WALS and Grambank, are inconsistent both with each other and with other sources of typological information, such as linguistic grammars. Some of these inconsistencies stem from coding errors or linguistic variation, but many of the disagreements are due to the discrete categorical nature of these databases. We shed light on this issue by systematically exploring disagreements across typological databases and resources, and their uses in NLP, covering the past and present. We next investigate the future of such work, offering an argument that a continuous view of typological features is clearly beneficial, echoing recommendations from linguistics. We propose that such a view of typology has significant potential in the future, including in language modeling in low-resource scenarios.

摘要
typological information 有潜在的优势可以帮助自然语言处理（NLP）模型的发展,特别是 для低资源语言。不幸的是,当前大规模的 typological databases 和 Grambank 存在差异,不仅与其他一些 typological information 源相关,还与自己存在差异。这些差异的原因包括编码错误或语言变化,但许多这些不一致的原因是由于这些数据库的精确性不足。我们通过系统地探讨这些数据库之间的不一致,以及它们在 NLP 中的应用,从过去和现在两个方面来探讨这个问题。我们 subsequentially 探讨未来这种工作的发展,并提出一种持续视角的 typology 特征是非常有利,这与语言学界的建议相符。我们建议这种持续视角在未来会拥有显著的潜在价值,包括语言模型在低资源enario 中的应用。

Conversation Chronicles: Towards Diverse Temporal and Relational Dynamics in Multi-Session Conversations

paper_url: http://arxiv.org/abs/2310.13420
repo_url: https://github.com/conversation-chronicles/conversation-chronicles
paper_authors: Jihyoung Jang, Minseong Boo, Hyounghun Kim
for: This paper aims to address the limitation of existing open-domain chatbot research by incorporating contextual information from multiple consecutive sessions into the conversation setup.
methods: The authors introduce a new 1M multi-session dialogue dataset called Conversation Chronicles, which includes time intervals and fine-grained speaker relationships. They also propose a dialogue model called ReBot, which consists of chronological summarization and dialogue generation modules.
results: The human evaluation shows that dialogue episodes in Conversation Chronicles reflect the properties of long-term conversations while maintaining coherent and consistent interactions across all sessions. ReBot, trained on Conversation Chronicles, demonstrates long-term context understanding with a high human engagement score.

Abstract
In the field of natural language processing, open-domain chatbots have emerged as an important research topic. However, a major limitation of existing open-domain chatbot research is its singular focus on short single-session dialogue, neglecting the potential need for understanding contextual information in multiple consecutive sessions that precede an ongoing dialogue. Among the elements that compose the context in multi-session conversation settings, the time intervals between sessions and the relationships between speakers would be particularly important. Despite their importance, current research efforts have not sufficiently addressed these dialogical components. In this paper, we introduce a new 1M multi-session dialogue dataset, called Conversation Chronicles, for implementing a long-term conversation setup in which time intervals and fine-grained speaker relationships are incorporated. Following recent works, we exploit a large language model to produce the data. The extensive human evaluation shows that dialogue episodes in Conversation Chronicles reflect those properties while maintaining coherent and consistent interactions across all the sessions. We also propose a dialogue model, called ReBot, which consists of chronological summarization and dialogue generation modules using only around 630M parameters. When trained on Conversation Chronicles, ReBot demonstrates long-term context understanding with a high human engagement score.

摘要
Translation in Simplified Chinese:在自然语言处理领域，开放领域 чат机器人已经成为重要的研究主题。然而，现有的开放领域 чат机器人研究存在一个重要的限制，即它忽略了多场会话中的多个会话session的上下文信息。在多场会话设置中，时间间隔和对话者之间的关系是非常重要的。尽管它们的重要性，现有的研究努力并没有充分考虑这些对话组成部分。在这篇论文中，我们介绍了一个新的100万多场会话对话集合，called Conversation Chronicles，用于实现长期对话设置，在该设置中，时间间隔和细化的对话者关系都被考虑。采用最新的大语言模型生成数据。人工评估表明，Conversation Chronicles中的对话集合具备了这些属性，同时保持了一致和一致的交互。我们还提出了一个对话模型，called ReBot，它包括时间顺序概要和对话生成模块，只使用约630M参数。当训练在Conversation Chronicles上时，ReBot能够展示长期上下文理解，并获得了高度的人工参与度。

Towards Enhancing Relational Rules for Knowledge Graph Link Prediction

paper_url: http://arxiv.org/abs/2310.13411
repo_url: https://github.com/ninggirsu/run-gnn
paper_authors: Shuhan Wu, Huaiyu Wan, Wei Chen, Yuting Wu, Junfeng Shen, Youfang Lin
for: 提高知识图reasoning的性能
methods: 使用query related fusion gate unit模型关系的顺序性，并使用缓冲更新机制缓解延迟的实体信息传递问题
results: 在多个数据集上表现出优于传递和推导链预测任务

Abstract
Graph neural networks (GNNs) have shown promising performance for knowledge graph reasoning. A recent variant of GNN called progressive relational graph neural network (PRGNN), utilizes relational rules to infer missing knowledge in relational digraphs and achieves notable results. However, during reasoning with PRGNN, two important properties are often overlooked: (1) the sequentiality of relation composition, where the order of combining different relations affects the semantics of the relational rules, and (2) the lagged entity information propagation, where the transmission speed of required information lags behind the appearance speed of new entities. Ignoring these properties leads to incorrect relational rule learning and decreased reasoning accuracy. To address these issues, we propose a novel knowledge graph reasoning approach, the Relational rUle eNhanced Graph Neural Network (RUN-GNN). Specifically, RUN-GNN employs a query related fusion gate unit to model the sequentiality of relation composition and utilizes a buffering update mechanism to alleviate the negative effect of lagged entity information propagation, resulting in higher-quality relational rule learning. Experimental results on multiple datasets demonstrate the superiority of RUN-GNN is superior on both transductive and inductive link prediction tasks.

摘要
GRaph Neural Networks (GNNs) 有示 promise的表现力量知识图理解。一种最近的 GNN 变体called progressive relational graph neural network (PRGNN) 利用关系规则来推理知识图中缺失的信息，并取得了显著的成果。然而，在 PRGNN 中进行理解时，有两个重要的特性通常被忽略：（1）关系组合的顺序性，其中不同关系的组合顺序对 semantics 的关系规则产生影响，以及（2）延迟的实体信息传递，其中新出现的实体信息传递的速度落后于实体信息的需求速度。忽略这些特性会导致 incorrect 的关系规则学习和降低理解精度。为了解决这些问题，我们提出了一种新的知识图理解方法，即 Relational rUle eNhanced Graph Neural Network (RUN-GNN)。具体来说，RUN-GNN 使用一个查询相关融合门控制器来模型关系组合的顺序性，并使用一个缓冲更新机制来缓解实体信息传递的负面影响，从而实现更高质量的关系规则学习。实验结果表明，RUN-GNN 在多个数据集上的传递性和概率链预测任务上表现出色。

Explicit Alignment and Many-to-many Entailment Based Reasoning for Conversational Machine Reading

paper_url: http://arxiv.org/abs/2310.13409
repo_url: https://github.com/AidenYo/BiAE
paper_authors: Yangyang Luo, Shiyu Tian, Caixia Yuan, Xiaojie Wang
for: 本研究旨在提高对话机器阅读（CMR）系统的性能，特别是在多Turn对话中对文档和用户提供的信息进行Alignment。
methods: 该方法使用了轻量级多对多推理模块进行决策，并直接基于文档和已问题 generates follow-up问题。
results: 该方法在微准确率方面达到了领先的状态，并在公共领导者数据集ShARC上排名第一。

Abstract
Conversational Machine Reading (CMR) requires answering a user's initial question through multi-turn dialogue interactions based on a given document. Although there exist many effective methods, they largely neglected the alignment between the document and the user-provided information, which significantly affects the intermediate decision-making and subsequent follow-up question generation. To address this issue, we propose a pipeline framework that (1) aligns the aforementioned two sides in an explicit way, (2)makes decisions using a lightweight many-to-many entailment reasoning module, and (3) directly generates follow-up questions based on the document and previously asked questions. Our proposed method achieves state-of-the-art in micro-accuracy and ranks the first place on the public leaderboard of the CMR benchmark dataset ShARC.

摘要
对话机器阅读（CMR）需要基于给定文档回答用户的初始问题，通过多回交流互动。虽然现有许多有效方法，但它们忽略了文档和用户提供的信息之间的匹配，这对于中间决策和 subsequential 询问生成具有重要影响。为解决这个问题，我们提议一个管道式框架，包括以下三个部分：1. 显式对文档和用户提供的信息进行匹配，以确保它们之间的Alignment。2. 使用轻量级多对多推理模块进行决策，以便更好地处理多个问题。3. 基于文档和之前提出的问题，直接生成 subsequential 询问。我们的提议方法在微准确性方面达到了领先水平，并在公共领导板块上名列第一名。

Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models

paper_url: http://arxiv.org/abs/2310.13395
repo_url: https://github.com/stoyian/OCaTS
paper_authors: Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion Androutsopoulos
for: This paper aims to reduce the operating expense (OpEx) of using third-party language model (LLM) services for small and medium-sized enterprises (SMEs) by caching previous LLM responses and training local inexpensive models.
methods: The proposed framework includes criteria for deciding when to trust the local model or call the LLM, as well as a methodology to tune the criteria and measure the tradeoff between performance and cost.
results: Experimental results using two LLMs (GPT-3.5 and GPT-4) and two inexpensive students (k-NN classifier and Multi-Layer Perceptron) on two common business tasks (intent recognition and sentiment analysis) show that significant OpEx savings can be obtained with only slightly lower performance.

Abstract
Prompting Large Language Models (LLMs) performs impressively in zero- and few-shot settings. Hence, small and medium-sized enterprises (SMEs) that cannot afford the cost of creating large task-specific training datasets, but also the cost of pretraining their own LLMs, are increasingly turning to third-party services that allow them to prompt LLMs. However, such services currently require a payment per call, which becomes a significant operating expense (OpEx). Furthermore, customer inputs are often very similar over time, hence SMEs end-up prompting LLMs with very similar instances. We propose a framework that allows reducing the calls to LLMs by caching previous LLM responses and using them to train a local inexpensive model on the SME side. The framework includes criteria for deciding when to trust the local model or call the LLM, and a methodology to tune the criteria and measure the tradeoff between performance and cost. For experimental purposes, we instantiate our framework with two LLMs, GPT-3.5 or GPT-4, and two inexpensive students, a k-NN classifier or a Multi-Layer Perceptron, using two common business tasks, intent recognition and sentiment analysis. Experimental results indicate that significant OpEx savings can be obtained with only slightly lower performance.

摘要
LLMs 在零或几个预测设置下表现出色，因此小型和中型企业（SMEs）不能负担创建大量任务特定的训练数据集和自己的 LLM 的预训练成本，现在更多地使用第三方服务。然而，现有服务需每次调用 LLM 的费用，这成为了运营成本（OpEx）的一大部分。此外，客户输入通常在时间上很相似，因此 SMEs 通常会向 LLM 提交非常相似的输入。我们提议一个框架，可以减少对 LLM 的调用次数，通过缓存之前 LLM 的回答并使用其来训练在 SME 端的低成本模型。该框架包括决定是否信任本地模型或调用 LLM 的标准，以及跟踪这些标准的调整和评估。为实验目的，我们实现了我们的框架，使用 GPT-3.5 或 GPT-4 两个 LLM，以及两个低成本学生，一个 k-NN 分类器或一个多层感知器，使用两个常见的商业任务，意图识别和情感分析。实验结果表明，可以通过减少 OpEx 来获得显著的成本节省，只有微不足的性能下降。

Tuna: Instruction Tuning using Feedback from Large Language Models

paper_url: http://arxiv.org/abs/2310.13385
repo_url: https://github.com/microsoft/lmops
paper_authors: Haoran Li, Yiran Liu, Xingxing Zhang, Wei Lu, Furu Wei
for: 这 paper 的目的是提出一种基于 direct outputs 的 Instruction-tuned LLM，以提高模型的行为与人类偏好的 align。
methods: 这 paper 使用了两种新的方法： probablistic ranking 和 contextual ranking，以增加模型的可能性生成更好的响应。
results: 这 paper 的模型（Tuna）在 Super Natural Instructions 和 LMentry 等 119 个测试任务上表现出色，并且可以超过一些强大的奖励学习基准。

Abstract
Instruction tuning of open-source large language models (LLMs) like LLaMA, using direct outputs from more powerful LLMs such as Instruct-GPT and GPT-4, has proven to be a cost-effective way to align model behaviors with human preferences. However, the instruction-tuned model has only seen one response per instruction, lacking the knowledge of potentially better responses. In this paper, we propose finetuning an instruction-tuned LLM using our novel \textit{probabilistic ranking} and \textit{contextual ranking} approaches to increase the likelihood of generating better responses. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs. Furthermore, we apply probabilistic ranking and contextual ranking sequentially to the instruction-tuned LLM. The resulting model, which we call \textbf{Tuna}, consistently improves the performance on Super Natural Instructions (119 test tasks), LMentry (25 test tasks), Vicuna QA, and can even obtain better results than several strong reinforcement learning baselines. Our code and data are available at \url{ https://github.com/microsoft/LMOps}.

摘要
大型自然语言模型（LLM）如LLaMA的指令调整，使用直接输出更强大的LLM的指令，如Instruct-GPT和GPT-4，已经证明是一种经济的方式来调整模型的行为与人类偏好相Alignment。然而，指令调整模型只看到了一个响应，缺乏更好的响应的知识。在这篇论文中，我们提议对指令调整LLM进行迭代finetuning，使用我们的新的概率排序和上下文排序方法来增加生成更好的响应的可能性。概率排序让指令调整模型继承更强大LLM的高质量和低质量响应的相对排名。然而，通过上下文排序来让模型使用更强大LLM的上下文理解能力来细化自己的响应分布。此外，我们采用概率排序和上下文排序两个顺序来处理指令调整LLM。得到的模型，我们称之为Tuna，在Super Natural Instructions（119个测试任务）、LMentry（25个测试任务）、Vicuna QA等测试任务上表现出色，甚至可以超过一些强大的强化学习基elines。我们的代码和数据可以在https://github.com/microsoft/LMOps上获取。

APP: Adaptive Prototypical Pseudo-Labeling for Few-shot OOD Detection

paper_url: http://arxiv.org/abs/2310.13380
repo_url: None
paper_authors: Pei Wang, Keqing He, Yutao Mou, Xiaoshuai Song, Yanan Wu, Jingang Wang, Yunsen Xian, Xunliang Cai, Weiran Xu
for: 这篇论文的目的是提出一种在仅有几个标注IND意图的情况下进行偏出版本检测的方法。
methods: 本文提出了一种适应式prototype pseudo-labeling（APP）方法，包括一个 prototype OOD检测框架（ProtoOOD）来帮助使用有限IND数据进行低资源OOD检测，以及一种适应式pseudo-labeling方法来生成高质量pseudo OOD&IND标签。
results: 实验和分析显示了本方法在几据OOD检测中的效果。

Abstract
Detecting out-of-domain (OOD) intents from user queries is essential for a task-oriented dialogue system. Previous OOD detection studies generally work on the assumption that plenty of labeled IND intents exist. In this paper, we focus on a more practical few-shot OOD setting where there are only a few labeled IND data and massive unlabeled mixed data that may belong to IND or OOD. The new scenario carries two key challenges: learning discriminative representations using limited IND data and leveraging unlabeled mixed data. Therefore, we propose an adaptive prototypical pseudo-labeling (APP) method for few-shot OOD detection, including a prototypical OOD detection framework (ProtoOOD) to facilitate low-resource OOD detection using limited IND data, and an adaptive pseudo-labeling method to produce high-quality pseudo OOD\&IND labels. Extensive experiments and analysis demonstrate the effectiveness of our method for few-shot OOD detection.

摘要
检测用户查询外部域（OOD）意图是任务对话系统的重要任务。先前的OOD检测研究通常假设有充足的标注IND意图数据存在。在这篇论文中，我们专注于更实际的几shotOOD设定，其中只有几个标注IND数据和大量未标注混合数据，这些数据可能属于IND或OOD。这种新的情况带来两个关键挑战：使用有限的IND数据学习准确的表示，并使用大量混合数据进行挖掘。因此，我们提出了适应型pseudo标签法（APP），包括一个 проtotypical OOD检测框架（ProtoOOD），以便在有限IND数据情况下进行低资源OOD检测，以及一种适应pseudo标签方法，以生成高质量pseudo OOD&IND标签。广泛的实验和分析表明，我们的方法在几shotOOD检测中表现出色。

Analyzing Cognitive Plausibility of Subword Tokenization

paper_url: http://arxiv.org/abs/2310.13348
repo_url: https://github.com/clap-lab/cogtok
paper_authors: Lisa Beinborn, Yuval Pinter
for: 本研究的目的是评估不同语言的字根词法 tokenization 算法的认知可能性。
methods: 本研究使用了一种新的评估方法，通过对人类在 lexical decision 任务中的响应时间和准确率与 tokenizer 输出之间的相关性进行分析，以评估不同 tokenization 算法的认知可能性。
results: 研究结果显示，UnigramLM 算法在不同语言和词汇大小下的 tokenization 行为更加不具认知可能性，同时也忽略了 derivational morphemes 的覆盖率。

Abstract
Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.

摘要
宋Wordtokenization已成为实际标准的分词方法，尽管相对评估语言之间的子字词词库质量的比较罕见。现有的评估研究主要关注下游任务性能的影响或工程特性如压缩率。我们提出了一种新的评估方框，关注分词器输出与人类语言决策任务的响应时间和准确率之间的相关性。我们对三种分词算法进行了多种语言和词汇大小的比较。我们的结果表明，UnigramLM算法产生的分词行为更加不具有认知可能性，同时对 derivational morpheme 的覆盖率也更差，与先前的研究不符。

Large-Scale and Multi-Perspective Opinion Summarization with Diverse Review Subsets

paper_url: http://arxiv.org/abs/2310.13340
repo_url: None
paper_authors: Han Jiang, Rui Wang, Zhihua Wei, Yu Li, Xinpeng Wang
for: 提供了一种基于监督学习的多角度评论摘要框架，以便对大规模评论集进行有效的摘要。
methods: 提出了一种受过监督的摘要框架，包括评论采样策略集和两 stage 训练方案。评论采样策略会根据评论的 sentiment orientation 和 contrastive information value 选择不同的评论 subset。
results: 实验结果表明，SUBSUMM 能够从百余篇评论中生成高质量的摘要，包括 Pros、Cons 和 Verdict 摘要。此外，我们的深入分析表明，选择评论 subset 和两 stage 训练方案是提高摘要性能的关键因素。

Abstract
Opinion summarization is expected to digest larger review sets and provide summaries from different perspectives. However, most existing solutions are deficient in epitomizing extensive reviews and offering opinion summaries from various angles due to the lack of designs for information selection. To this end, we propose SUBSUMM, a supervised summarization framework for large-scale multi-perspective opinion summarization. SUBSUMM consists of a review sampling strategy set and a two-stage training scheme. The sampling strategies take sentiment orientation and contrastive information value into consideration, with which the review subsets from different perspectives and quality levels can be selected. Subsequently, the summarizer is encouraged to learn from the sub-optimal and optimal subsets successively in order to capitalize on the massive input. Experimental results on AmaSum and Rotten Tomatoes datasets demonstrate that SUBSUMM is adept at generating pros, cons, and verdict summaries from hundreds of input reviews. Furthermore, our in-depth analysis verifies that the advanced selection of review subsets and the two-stage training scheme are vital to boosting the summarization performance.

摘要
文本翻译为简化中文。 существующие解决方案因缺乏信息选择的设计而无法生成覆盖广泛评论和多个角度的意见摘要。为此，我们提出了SUBSUMM，一种监督摘要框架，用于大规模多角度意见摘要。SUBSUMM包括评论采样策略集和两个阶段训练方案。采样策略考虑了 sentiment 方向和对比信息价值，可以从不同的角度和质量水平中选择评论 subset。然后，摘要器受益于大量输入，逐渐学习从优秀和优化subset中获得知识。实验结果表明，SUBSUMM可以从多达百个输入评论中生成评价、缺点和结论摘要。此外，我们的深入分析表明，选择评论subset的高级技巧和两个阶段训练方案对摘要性能产生了重要的提高作用。Here is the translation of the text into Simplified Chinese:<文本翻译为简化中文。现有的解决方案因缺乏信息选择的设计而无法生成覆盖广泛评论和多个角度的意见摘要。为此，我们提出了SUBSUMM，一种监督摘要框架，用于大规模多角度意见摘要。SUBSUMM包括评论采样策略集和两个阶段训练方案。采样策略考虑了 sentiment 方向和对比信息价值，可以从不同的角度和质量水平中选择评论 subset。然后，摘要器受益于大量输入，逐渐学习从优秀和优化subset中获得知识。实验结果表明，SUBSUMM可以从多达百个输入评论中生成评价、缺点和结论摘要。此外，我们的深入分析表明，选择评论subset的高级技巧和两个阶段训练方案对摘要性能产生了重要的提高作用。

Beyond Hard Samples: Robust and Effective Grammatical Error Correction with Cycle Self-Augmenting

paper_url: http://arxiv.org/abs/2310.13321
repo_url: https://github.com/zetangforward/csa-gec
paper_authors: Zecheng Tang, Kaifeng Qi, Juntao Li, Min Zhang
for: 这个研究是为了提高语法错误修正模型的Robustness，对抗特定类型的攻击。
methods: 本研究使用了sequence-to-sequence模型，并将其攻击到四种不同的攻击类型。furthermore, the paper proposes a simple yet effective Cycle Self-Augmenting (CSA) method to improve the model’s robustness.
results: 实验结果显示，使用CSA方法可以帮助四种不同的基eline模型增强其Robustness，而不需要将攻击示例加入训练过程中。此外，CSA方法可以降低模型对于没有错误的数据的适应性，并提高模型对于未见过的数据的一致性。

Abstract
Recent studies have revealed that grammatical error correction methods in the sequence-to-sequence paradigm are vulnerable to adversarial attack, and simply utilizing adversarial examples in the pre-training or post-training process can significantly enhance the robustness of GEC models to certain types of attack without suffering too much performance loss on clean data. In this paper, we further conduct a thorough robustness evaluation of cutting-edge GEC methods for four different types of adversarial attacks and propose a simple yet very effective Cycle Self-Augmenting (CSA) method accordingly. By leveraging the augmenting data from the GEC models themselves in the post-training process and introducing regularization data for cycle training, our proposed method can effectively improve the model robustness of well-trained GEC models with only a few more training epochs as an extra cost. More concretely, further training on the regularization data can prevent the GEC models from over-fitting on easy-to-learn samples and thus can improve the generalization capability and robustness towards unseen data (adversarial noise/samples). Meanwhile, the self-augmented data can provide more high-quality pseudo pairs to improve model performance on the original testing data. Experiments on four benchmark datasets and seven strong models indicate that our proposed training method can significantly enhance the robustness of four types of attacks without using purposely built adversarial examples in training. Evaluation results on clean data further confirm that our proposed CSA method significantly improves the performance of four baselines and yields nearly comparable results with other state-of-the-art models. Our code is available at https://github.com/ZetangForward/CSA-GEC.

摘要
近期研究发现，序列到序列框架中的语法错误纠正方法容易受到敌意攻击，而使用敌意示例在预训练或后训练过程中可以有效提高GEC模型对certain类型的攻击的抵抗力，而不是受到过多的clean数据影响。在这篇论文中，我们进一步进行了四种不同类型的敌意攻击的精orous evaluate，并提出了一种简单 yet very effective的自回归增强（CSA）方法。通过在后训练过程中利用GEC模型自己生成的增强数据，并在训练数据中引入循环训练数据，我们的提议的方法可以有效提高已经训练过的GEC模型的模型 robustness，只需要增加一些更多的训练粒度。更具体地说，进一步训练在循环训练数据上可以防止GEC模型过拟合易学习样本，提高模型的总体化能力和对未看到数据（敌意噪音）的Robustness。同时，自回归数据可以为模型提供更多的高质量 Pseudo pair，提高模型在原始测试数据上的性能。实验结果表明，我们的提议的训练方法可以有效提高四种攻击类型的Robustness，而不需要在训练过程中使用特制的敌意示例。 clean数据上的评估结果还证明，我们的CSA方法可以大幅提高四个基eline的性能，并与其他当前领先模型几乎相当。我们的代码可以在https://github.com/ZetangForward/CSA-GEC中找到。

Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models

paper_url: http://arxiv.org/abs/2310.13315
repo_url: None
paper_authors: Miaoxi Zhu, Qihuang Zhong, Li Shen, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao
for: 这篇论文旨在提出一个零shot量化框架，来实现零shot量化的各种语言模型（PLM）。
methods: 本文使用的方法包括零shot量化和零shot预测，并且提出了一个名为SAM-SGA优化的算法，用于提高量化精度和模型泛化。
results: 实验结果显示，本文的方法可以对11个任务中的描述性和生成性PLM都带来明显和重要的性能提升，最高提升为+6.98均值分数。此外，本文也证明了这个方法可以改善模型的泛化性。

Abstract
Quantization is a promising approach for reducing memory overhead and accelerating inference, especially in large pre-trained language model (PLM) scenarios. While having no access to original training data due to security and privacy concerns has emerged the demand for zero-shot quantization. Most of the cutting-edge zero-shot quantization methods primarily 1) apply to computer vision tasks, and 2) neglect of overfitting problem in the generative adversarial learning process, leading to sub-optimal performance. Motivated by this, we propose a novel zero-shot sharpness-aware quantization (ZSAQ) framework for the zero-shot quantization of various PLMs. The key algorithm in solving ZSAQ is the SAM-SGA optimization, which aims to improve the quantization accuracy and model generalization via optimizing a minimax problem. We theoretically prove the convergence rate for the minimax optimization problem and this result can be applied to other nonconvex-PL minimax optimization frameworks. Extensive experiments on 11 tasks demonstrate that our method brings consistent and significant performance gains on both discriminative and generative PLMs, i.e., up to +6.98 average score. Furthermore, we empirically validate that our method can effectively improve the model generalization.

摘要
“量化是一种具有潜在的方法来降低记忆预算和加速推导，特别在大型预训语言模型（PLM）的情况下。由于安全和隐私问题的缘故，无法存取原始训练数据的需求导致了零统计量化的需求。现有的大部分cutting-edge零统计量化方法主要应用于计算机视觉任务，并且忽略了生成对抗学习过程中的过溢问题，导致表现不佳。骉于这，我们提出了一个新的零统计锐度感知量化（ZSAQ）框架，用于零统计量化不同PLM。关键算法在解决ZSAQ中是SAM-SGA优化，旨在提高量化精度和模型通用化via优化最小最大问题。我们 theoretically prove了最小最大问题的收敛率，这个结果可以应用到其他非对称PL最小最大问题框架。实验结果显示，我们的方法可以在11个任务中提供了稳定和有意义的性能提升，最高提升率为+6.98。此外，我们还证明了我们的方法可以有效地提高模型通用化。”

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

paper_url: http://arxiv.org/abs/2310.13312
repo_url: https://github.com/deep-over/film
paper_authors: Jaeyoung Choe, Keonwoong Noh, Nayeon Kim, Seyun Ahn, Woohwan Jung
for: 这种论文主要为了解决金融领域语言模型的不足之处，提高金融领域下推理语言模型的性能。
methods: 该论文使用了各种金融领域的文本数据集，对这些数据集进行了广泛的采集和训练，并使用了一种新的训练策略来提高模型的性能。
results: 论文的实验结果表明，新提出的金融语言模型（FiLM）不仅可以在金融领域上超越现有的专业语言模型，还可以在未经见过的文本数据集上达到更高的性能。

Abstract
Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and have outperformed general-domain PLMs in specialized areas such as biomedical, scientific, and clinical domains. In addition, financial PLMs have been studied because of the high economic impact of financial data analysis. However, we found that financial PLMs were not pretrained on sufficiently diverse financial data. This lack of diverse training data leads to a subpar generalization performance, resulting in general-purpose PLMs, including BERT, often outperforming financial PLMs on many downstream tasks. To address this issue, we collected a broad range of financial corpus and trained the Financial Language Model (FiLM) on these diverse datasets. Our experimental results confirm that FiLM outperforms not only existing financial PLMs but also general domain PLMs. Furthermore, we provide empirical evidence that this improvement can be achieved even for unseen corpus groups.

摘要
(Simplified Chinese translation)在过去几年，一些领域特定的预训练语言模型（PLMs）已经被提出，并在各种专业领域，如生物医学、科学和医疗领域中表现出色。此外，由于金融数据分析的高经济影响，金融PLMs也被研究。然而，我们发现金融PLMs未被训练在充分多样化的金融数据上。这种缺乏多样化训练数据导致总体性能下降，使得通用领域PLMs，包括BERT，在许多下游任务中表现更好。为解决这个问题，我们收集了广泛的金融文献，并将这些多样化的数据集用于训练金融语言模型（FiLM）。我们的实验结果表明，FiLM不仅能超越现有的金融PLMs，还能超越通用领域PLMs。此外，我们还提供了实验证据，表明这种改进可以在未看到的文献组中实现。

Test-Time Self-Adaptive Small Language Models for Question Answering

paper_url: http://arxiv.org/abs/2310.13307
repo_url: None
paper_authors: Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, Jong C. Park
for: 这个研究是为了测试小型自适应语言模型（LM）的可行性，以及它们在不同的问题 answering（QA）任务中的表现。
methods: 研究使用了自适应策略，将多个答案生成并对其进行整合，并且过滤出低质量的样本以减少错误的标签。
results: 研究发现，这种自适应策略可以帮助小型LM在不同的问题中表现更好，并且具有更高的稳定性。

Abstract
Recent instruction-finetuned large language models (LMs) have achieved notable performances in various tasks, such as question-answering (QA). However, despite their ability to memorize a vast amount of general knowledge across diverse tasks, they might be suboptimal on specific tasks due to their limited capacity to transfer and adapt knowledge to target tasks. Moreover, further finetuning LMs with labeled datasets is often infeasible due to their absence, but it is also questionable if we can transfer smaller LMs having limited knowledge only with unlabeled test data. In this work, we show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data. In particular, we first stochastically generate multiple answers, and then ensemble them while filtering out low-quality samples to mitigate noise from inaccurate labels. Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets with higher robustness across diverse prompts, enabling LMs to stay stable. Code is available at: https://github.com/starsuzi/T-SAS.

摘要
现代指令精细调整大型语言模型（LM）在多种任务上已经实现了各种优秀的表现，如问答（QA）。然而，尽管它们可以储存大量的通用知识，但可能因为缺乏目标任务的特定知识而表现不佳。此外，进一步在标注数据缺乏的情况下进行LM的训练是常见的，但是是否可以将更小的LM通过无标注测试数据进行学习呢？在这项工作中，我们展示了和研究了更小的自适应LM，只使用无标注测试数据进行学习。具体来说，我们首先随机生成多个答案，然后将它们ensemble，并对它们进行筛选，以mitigate噪音从不准确的标签中。我们的自适应策略在 benchmark QA 数据集上显示了显著的性能提升，并且具有更高的多样性和稳定性，使LM在多种提问下能够稳定。代码可以在：https://github.com/starsuzi/T-SAS 中找到。

Interpreting Indirect Answers to Yes-No Questions in Multiple Languages

paper_url: http://arxiv.org/abs/2310.13290
repo_url: https://github.com/wang-zijie/yn-question-multilingual
paper_authors: Zijie Wang, Md Mosharaf Hossain, Shivam Mathur, Terry Cruz Melo, Kadir Bulut Ozler, Keun Hee Park, Jacob Quintero, MohammadHossein Rezaei, Shreya Nupur Shakya, Md Nayem Uddin, Eduardo Blanco
for: 这篇论文主要针对响应问题，即回答问题时，答案是否直接回答问题。
methods: 该论文使用远程指导方法收集训练数据，并证明直接回答（即包含肯定或否定词）可以帮助模型理解间接回答。
results: 实验结果显示，在训练数据可以通过远程指导方法获得时，单语言精度提升是有利的（5种语言），而跨语言精度提升总是有利（8种语言）。

Abstract
Yes-no questions expect a yes or no for an answer, but people often skip polar keywords. Instead, they answer with long explanations that must be interpreted. In this paper, we focus on this challenging problem and release new benchmarks in eight languages. We present a distant supervision approach to collect training data. We also demonstrate that direct answers (i.e., with polar keywords) are useful to train models to interpret indirect answers (i.e., without polar keywords). Experimental results demonstrate that monolingual fine-tuning is beneficial if training data can be obtained via distant supervision for the language of interest (5 languages). Additionally, we show that cross-lingual fine-tuning is always beneficial (8 languages).

摘要
Yes-no问题通常需要简单的Yes或No回答，但人们经常会跳过极性词。在这篇论文中，我们关注这个挑战问题，并发布八种语言的新标准 benchmark。我们采用远程指导方法来收集训练数据。我们还证明了直接回答（即包含极性词）对于解释 indirect answers（即无极性词）的训练非常有用。实验结果表明，对于语言兴趣的语言，远程指导下的单语言精度调教是有利的。此外，我们还证明了跨语言精度调教总是有利（8种语言）。

SALMONN: Towards Generic Hearing Abilities for Large Language Models

paper_url: http://arxiv.org/abs/2310.13289
repo_url: https://github.com/bytedance/salmonn
paper_authors: Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
For: 这个论文旨在提出一种能够直接处理和理解通用音频输入的多模态模型（SALMONN），它将预训练的文本大语言模型（LLM）与语音和音频编码器结合在一起，以实现多种语音和音频任务的竞争性表现。* Methods: 这篇论文使用了一种混合多模态的方法，将预训练的文本大语言模型与语音和音频编码器结合在一起，以实现多种语音和音频任务的竞争性表现。* Results: 这篇论文的实验结果表明，SALMONN模型可以在多种语音和音频任务上实现竞争性表现，并且具有一些不可预期的跨模态能力，如语音翻译到未知语言、语音基于槽 filling、听话问答、音频故事等等。

Abstract
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning \textit{etc.} SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning \textit{etc}. The presence of the cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities of SALMONN. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. An interactive demo of SALMONN is available at \texttt{\url{https://github.com/bytedance/SALMONN}, and the training code and model checkpoints will be released upon acceptance.

摘要
听见是人工智能（AI）机器人在物理世界中的重要能力，指的是理解和处理一般声音信息，包括语音、音频事件和音乐等三种类型的声音。在这篇论文中，我们提出了一种名为SALMONN的语音音乐开放神经网络，通过将预训练的文本大型语言模型（LLM）与语音和音频编码器结合在一起而实现。SALMONN使得LLM可以直接处理和理解一般音频输入，并在训练中使用的各种语音和音频任务中达到竞争性的表现，如自动语音识别和翻译、听力信息基于问题回答、情感识别、人识别、音频和歌曲captioning等等。SALMONN还具有许多未在训练中看到的新的能力，包括但不限于语音翻译到未经训练的语言、语音基于插槽填充、声音问题回答、音频故事、语音音频合理等等。我们研究了这些跨模态的新能力的存在，并提出了一种新的几招活动调整方法来活化SALMONN的这些能力。到我们所知，SALMONN是首个类似的模型，可以视为人工智能机器人的听见能力的一步进步。SALMONN的交互示例可以在\url{https://github.com/bytedance/SALMONN}上查看，训练代码和模型检查点将在接受后发布。

paper_url: http://arxiv.org/abs/2310.13276
repo_url: https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval
paper_authors: Xiangru Jian, Yimu Wang
for: 解决跨模态检索中的表达缩排问题，提高检索性能。
methods: 引入InvGC方法，一种基于图 convolution和平均 pooling的后处理技术，以及LocalAdj提升方法，用于提高 InvGC 的效率和效果。
results: 对多个跨模态benchmark和方法进行了实验验证，并证明了 InvGC 和 InvGC w/LocalAdj 可以有效地 mitigate表达缩排问题，提高检索性能。

Abstract
Over recent decades, significant advancements in cross-modal retrieval are mainly driven by breakthroughs in visual and linguistic modeling. However, a recent study shows that multi-modal data representations tend to cluster within a limited convex cone (as representation degeneration problem), which hinders retrieval performance due to the inseparability of these representations. In our study, we first empirically validate the presence of the representation degeneration problem across multiple cross-modal benchmarks and methods. Next, to address it, we introduce a novel method, called InvGC, a post-processing technique inspired by graph convolution and average pooling. Specifically, InvGC defines the graph topology within the datasets and then applies graph convolution in a subtractive manner. This method effectively separates representations by increasing the distances between data points. To improve the efficiency and effectiveness of InvGC, we propose an advanced graph topology, LocalAdj, which only aims to increase the distances between each data point and its nearest neighbors. To understand why InvGC works, we present a detailed theoretical analysis, proving that the lower bound of recall will be improved after deploying InvGC. Extensive empirical results show that InvGC and InvGC w/LocalAdj significantly mitigate the representation degeneration problem, thereby enhancing retrieval performance. Our code is available at https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval

摘要
近年来，跨模态检索的大进步主要归功于视觉和语言模型的突破。然而，一项研究发现，跨模态数据表示往往归 converges within a limited convex cone（表示力 degeneration 问题），这会妨碍检索性能，因为这些表示无法分离。在我们的研究中，我们首先确认了多个跨模态benchmark和方法中的表示力 degeneration 问题的存在。然后，我们提出了一种新方法，叫做InvGC，它是基于图 convolution和平均pooling的后处理技术。具体来说，InvGC定义dataset中的图 topology，然后通过图 convolution的 subtractive 方式来分离表示。这种方法可以增加数据点之间的距离，从而提高检索性能。为了提高InvGC的效率和可效性，我们提出了一种高级图 topology，叫做LocalAdj，它只是增加每个数据点和其最近邻居之间的距离。为了解释 InvGC 是如何工作的，我们提供了详细的理论分析，证明 InvGC 后部署后，减少了下界的回归值，从而提高了检索性能。我们的代码可以在上获取。

paper_url: http://arxiv.org/abs/2310.13267
repo_url: None
paper_authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji
for: 这篇论文旨在探讨对比式跨模型CLIP和CLAP在视频语言（VL）和音频语言（AL）任务中的表现，以及其语言编码器的质量和改进方法。
methods: 这篇论文使用了不监督和监督句子嵌入训练来评估语言编码器质量和跨模态任务表现。
results: 在VL预训练中，句子嵌入训练语言编码器质量和跨模态任务表现得到了提高，例如CyCLIP。然而，在AL预训练中，句子嵌入训练的效果较差，这可能与预训练数据的有限性有关。分析表示空间和跨模态Alignment的表示空间，发现句子嵌入训练提高了文本空间的均匀性，但是同时导致了跨模态Alignment的减退。

Abstract
Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding training affect language encoder quality and cross-modal task performance. In VL pretraining, we found that sentence embedding training language encoder quality and aids in cross-modal tasks, improving contrastive VL models such as CyCLIP. In contrast, AL pretraining benefits less from sentence embedding training, which may result from the limited amount of pretraining data. We analyze the representation spaces to understand the strengths of sentence embedding training, and find that it improves text-space uniformity, at the cost of decreased cross-modal alignment.

摘要
“对比式跨模型如CLIP和CLAP在视觉语言（VL）和语音语言（AL）任务中具有帮助作用。然而，对于这些模型的语言Encoder进行训练仍然受到有限的研究和改进。我们进行了广泛的评估，探索不同的受训练方法对语言Encoder质量和跨模型任务表现的影响。在VL预训中，我们发现这种训练可以提高语言Encoder质量，并且帮助改进对比式VL模型，如CyCLIP。然而，在AL预训中，这种训练几乎没有助益，这可能是因为预训数据的限制。我们分析了表示空间，了解了句子嵌入训练带来的优点，发现它可以提高文本空间的一致性，但是价格是跨模型Alignment的降低。”

paper_url: http://arxiv.org/abs/2310.13265
repo_url: https://github.com/lezhang7/moqagpt
paper_authors: Le Zhang, Yihong Wu, Fengran Mo, Jian-Yun Nie, Aishwarya Agrawal
for: 这篇论文主要targets Multi-modal open-domain question answering task, aiming to improve the performance of large language models (LLMs) in this task.
methods: 该论文提出了一种 straightforward and flexible framework called MoqaGPT, which uses a divide-and-conquer strategy to retrieve and extract answers from multiple modalities, and then fuses this multi-modal information using LLMs to produce a final answer.
results: 根据MMCoQA和MultiModalQA dataset的实验结果，MoqaGPT比supervised baseline提高了F1分数37.91点和EM分数34.07点，在Zero-shot setting下也超过了基线值，提高F1分数9.5点和EM分数10.1点，并且与supervised方法的性能差距有所减少。

Abstract
Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, tables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.

摘要
多Modal开放领域问答通常需要从不同的modalities中检索证据，如图像、表格、段落等。即使大型自然语言模型（LLM）如GPT-4也有所不足。为了让LLM在零shot情况下能够完成这项任务，我们介绍了MoqaGPT框架。我们的框架采用分治策略，不需要复杂的多Modal评分，可以轻松扩展到新的modalities和新的任务。基于LLM，MoqaGPT首先从每个modalities中分别检索答案，然后使用LLM将这些多Modal信息进行融合，生成最终的答案。我们的方法在MMCoQA数据集上提高了性能，相比超参的基线，提高了F1值37.91点和EM值34.07点。在MultiModalQA数据集上，MoqaGPT超越零shot基线，提高了F1值9.5点和EM值10.1点，并在无监督方法上减少了差距。我们的代码可以在https://github.com/lezhang7/MOQAGPT上获取。

A Quality-based Syntactic Template Retriever for Syntactically-controlled Paraphrase Generation

paper_url: http://arxiv.org/abs/2310.13262
repo_url: https://github.com/xzhang00/qstr
paper_authors: Xue Zhang, Songming Zhang, Yunlong Liang, Yufeng Chen, Jian Liu, Wenjuan Han, Jinan Xu
for: 提高自然语言处理 tasks 中的 paraphrase 生成质量，尤其是在没有人工标注或高质量模板的情况下。
methods: 提出了一种新的质量基于的语法模板检索器 (QSTR)，通过评估生成的 paraphrase 质量来选择最佳的语法模板。此外，为了提高多个 paraphrase 的多样性，我们还提出了一种多样性检索算法 (DTS)。
results: QSTR 可以大幅超越现有的检索方法，在生成高质量 paraphrase 方面取得显著成果，甚至与人工标注的模板相当在无参照度量上表现出色。此外，人工评估和下游任务中使用我们生成的 paraphrase 也表现出了优秀的潜力。

Abstract
Existing syntactically-controlled paraphrase generation (SPG) models perform promisingly with human-annotated or well-chosen syntactic templates. However, the difficulty of obtaining such templates actually hinders the practical application of SPG models. For one thing, the prohibitive cost makes it unfeasible to manually design decent templates for every source sentence. For another, the templates automatically retrieved by current heuristic methods are usually unreliable for SPG models to generate qualified paraphrases. To escape this dilemma, we propose a novel Quality-based Syntactic Template Retriever (QSTR) to retrieve templates based on the quality of the to-be-generated paraphrases. Furthermore, for situations requiring multiple paraphrases for each source sentence, we design a Diverse Templates Search (DTS) algorithm, which can enhance the diversity between paraphrases without sacrificing quality. Experiments demonstrate that QSTR can significantly surpass existing retrieval methods in generating high-quality paraphrases and even perform comparably with human-annotated templates in terms of reference-free metrics. Additionally, human evaluation and the performance on downstream tasks using our generated paraphrases for data augmentation showcase the potential of our QSTR and DTS algorithm in practical scenarios.

摘要
现有的语法控制的篇章生成（SPG）模型在人工标注或选择的语法模板上表现良好。然而，获得这些模板的困难实际上限制了SPG模型的实际应用。一方面，人工设计Decent模板的成本太高，无法实际应用。另一方面，由现有的索引方法自动获取的模板通常不可靠，导致SPG模型生成质量不高的篇章。为了解决这个困境，我们提出了一种新的质量基于的语法模板检索器（QSTR），可以根据生成的篇章质量来选择语法模板。此外，为了处理每个源句需要多个篇章的情况，我们设计了多样性检索（DTS）算法，可以提高篇章之间的多样性而不是质量的牺牲。实验表明，QSTR可以明显超过现有的检索方法，生成高质量的篇章，甚至与人工标注的模板相当在无参照度量上表现出色。此外，人工评估和用我们生成的篇章进行数据增强任务的表现也表明了我们的QSTR和DTS算法在实际场景中的潜力。

Anomaly Detection of Command Shell Sessions based on DistilBERT: Unsupervised and Supervised Approaches

paper_url: http://arxiv.org/abs/2310.13247
repo_url: None
paper_authors: Zefang Liu, John Buford
for: 检测 Unix shell 会话异常行为是计算机安全中的一项关键任务。
methods: 我们使用预训练的 DistilBERT 模型，结合无监督和监督学习技术，以识别 Unix shell 会话中异常活动，同时尽量避免数据标注。
results: 在一个大规模企业数据集上进行实验，我们的方法能够准确地检测 Unix shell 会话中的异常行为。

Abstract
Anomaly detection in command shell sessions is a critical aspect of computer security. Recent advances in deep learning and natural language processing, particularly transformer-based models, have shown great promise for addressing complex security challenges. In this paper, we implement a comprehensive approach to detect anomalies in Unix shell sessions using a pretrained DistilBERT model, leveraging both unsupervised and supervised learning techniques to identify anomalous activity while minimizing data labeling. The unsupervised method captures the underlying structure and syntax of Unix shell commands, enabling the detection of session deviations from normal behavior. Experiments on a large-scale enterprise dataset collected from production systems demonstrate the effectiveness of our approach in detecting anomalous behavior in Unix shell sessions. This work highlights the potential of leveraging recent advances in transformers to address important computer security challenges.

摘要
“命令行Session anomaly detection是计算机安全的关键方面。近年来，深度学习和自然语言处理技术，特别是基于变换器的模型，在解决复杂安全挑战方面表现出了惊人的承诺。本文，我们实现了一种涵盖全面的命令行Session anomaly detection方法，使用预训练的DistilBERT模型，结合无监督和监督学习技术，以确定异常行为，同时尽量避免数据标注。无监督方法捕捉了 Unix shell命令的内部结构和语法，使得检测会话异常行为变得可能。在一个大规模的企业数据集上进行了实验， demonstarted our approach的效果性在 Unix shell sessions中检测异常行为。这种工作表明了利用最新的变换器技术来解决计算机安全挑战的潜力。”Note that Simplified Chinese is used in mainland China, and Traditional Chinese is used in Taiwan and other regions.

Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking

paper_url: http://arxiv.org/abs/2310.13243
repo_url: https://github.com/ielab/llm-qlm
paper_authors: Shengyao Zhuang, Bing Liu, Bevan Koopman, Guido Zuccon
for: This paper focuses on investigating the effectiveness of recent large language models (LLMs) as query likelihood models (QLMs) for zero-shot ranking of documents.
methods: The authors use pre-trained LLMs without fine-tuning and introduce a novel hybrid zero-shot retriever that integrates LLM-based QLMs with a traditional retriever.
results: The authors find that the LLM-based QLMs demonstrate robust zero-shot ranking ability, and the hybrid retriever achieves exceptional effectiveness in both zero-shot and few-shot scenarios.Here’s the information in Simplified Chinese text:
for: 这篇论文 investigate 最新的大型自然语言模型 (LLMs) 作为问题可能性模型 (QLMs) 的零Instance 排序文档的效果。
methods: 作者使用预训练的 LLMs 而不是精度调教，并提出了一种新的混合零实例检索器，该检索器将 LLM-based QLMs 与传统检索器集成。
results: 作者发现 LLM-based QLMs 在零实例情况下示出了强大的排序能力，并发现混合检索器在零实例和几个实例情况下都达到了非常出色的效果。

Abstract
In the field of information retrieval, Query Likelihood Models (QLMs) rank documents based on the probability of generating the query given the content of a document. Recently, advanced large language models (LLMs) have emerged as effective QLMs, showcasing promising ranking capabilities. This paper focuses on investigating the genuine zero-shot ranking effectiveness of recent LLMs, which are solely pre-trained on unstructured text data without supervised instruction fine-tuning. Our findings reveal the robust zero-shot ranking ability of such LLMs, highlighting that additional instruction fine-tuning may hinder effectiveness unless a question generation task is present in the fine-tuning dataset. Furthermore, we introduce a novel state-of-the-art ranking system that integrates LLM-based QLMs with a hybrid zero-shot retriever, demonstrating exceptional effectiveness in both zero-shot and few-shot scenarios. We make our codebase publicly available at https://github.com/ielab/llm-qlm.

摘要
在信息检索领域，查询可能性模型（QLM）根据文档内容中Generate查询的概率来排序文档。最近，高级大语言模型（LLM）作为效果的QLM出现，展示了可观的排序能力。本文将关注 investigate最近LLM的真正零上下文排序能力，这些QLM都是在无监督指导下预训练的自然语言数据。我们的发现表明这些LLM具有强大的零上下文排序能力，表明添加细化 instrucion 训练可能会降低效果，除非包含问题生成任务在训练集中。此外，我们介绍了一种新的状态略取得 ranked 系统，将LLM-基于的QLM与混合零上下文检索器结合， demonstrate 出色的效果在零上下文和几个shot scenario中。我们在https://github.com/ielab/llm-qlm中公开了我们的代码库。

The Less the Merrier? Investigating Language Representation in Multilingual Models

paper_url: http://arxiv.org/abs/2310.13228
repo_url: None
paper_authors: Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Jugal Kalita
for: 本文探讨多语言模型在不同自然语言处理任务中的表现，特别是在低资源 SETTINGS 中支持的语言是否受到保障。
methods: 我们使用 популяр的多语言模型进行 investigate，分析这些模型对不同语言的表征和学习结果，包括语言家族和方言的影响。
results: 我们的实验结果显示，基于社区的模型（models that focus on languages of a given family or geographical location and are built by communities who speak them）在低资源语言之间的语言分类 task 中表现更好。我们的研究贡献到了多语言模型的理解和改进方向。

Abstract
Multilingual Language Models offer a way to incorporate multiple languages in one model and utilize cross-language transfer learning to improve performance for different Natural Language Processing (NLP) tasks. Despite progress in multilingual models, not all languages are supported as well, particularly in low-resource settings. In this work, we investigate the linguistic representation of different languages in multilingual models. We start by asking the question which languages are supported in popular multilingual models and which languages are left behind. Then, for included languages, we look at models' learned representations based on language family and dialect and try to understand how models' learned representations for~(1) seen and~(2) unseen languages vary across different language groups. In addition, we test and analyze performance on downstream tasks such as text generation and Named Entity Recognition. We observe from our experiments that community-centered models -- models that focus on languages of a given family or geographical location and are built by communities who speak them -- perform better at distinguishing between languages in the same family for low-resource languages. Our paper contributes to the literature in understanding multilingual models and their shortcomings and offers insights on potential ways to improve them.

摘要
多语言语言模型提供了将多种语言 integrate into one model，并利用交叉语言学习来提高不同自然语言处理（NLP）任务的性能。尽管在多语言模型方面有所进步，但不是所有语言都得到了充分支持，特别是在低资源环境下。在这项工作中，我们调查了不同语言在多语言模型中的语言表示。我们开始问题是哪些语言在流行的多语言模型中被支持，哪些语言被排除在外。然后，对包括的语言来说，我们查看模型学习的语言家族和方言基于的表示，并尝试理解模型对seen和unseen语言的学习表示如何不同。此外，我们测试和分析下沟通任务 such as 文本生成和命名实体识别的性能。我们发现在我们的实验中，社区中心的模型（models that focus on languages of a given family or geographical location and are built by communities who speak them）在同家族语言之间的分辨率较高。我们的论文贡献了对多语言模型和其缺陷的研究，并提供了可能改进它们的想法。

Enhancing Zero-Shot Crypto Sentiment with Fine-tuned Language Model and Prompt Engineering

paper_url: http://arxiv.org/abs/2310.13226
repo_url: None
paper_authors: Rahman S M Wahidur, Ishmam Tashdeed, Manjit Kaur, Heung-No-Lee
for: 本研究旨在提高投资者对加密货币市场的情感分析精度，并 investigate fine-tuning技术的效果。
methods: 本研究使用了大型自然语言模型的精度调整技术，包括监督式调整和指令式调整。
results: 实验结果表明，精度调整后可以获得40%的零基eline性能提升，而大型模型在指令调整下表现最高，其中最高的准确率为75.16%。

Abstract
Blockchain technology has revolutionized the financial landscape, with cryptocurrencies gaining widespread adoption for their decentralized and transparent nature. As the sentiment expressed on social media platforms can significantly influence cryptocurrency discussions and market movements, sentiment analysis has emerged as a crucial tool for understanding public opinion and predicting market trends. Motivated by the aim to enhance sentiment analysis accuracy in the cryptocurrency domain, this paper investigates fine-tuning techniques on large language models. This paper also investigates the efficacy of supervised fine-tuning and instruction-based fine-tuning on large language models for unseen tasks. Experimental results demonstrate a significant average zero-shot performance gain of 40% after fine-tuning, highlighting the potential of this technique in optimizing pre-trained language model efficiency. Additionally, the impact of instruction tuning on models of varying scales is examined, revealing that larger models benefit from instruction tuning, achieving the highest average accuracy score of 75.16%. In contrast, smaller-scale models may experience reduced generalization due to the complete utilization of model capacity. To gain deeper insight about how instruction works with these language models, this paper presents an experimental investigation into the response of an instruction-based model under different instruction tuning setups. The investigation demonstrates that the model achieves an average accuracy score of 72.38% for short and simple instructions. This performance significantly outperforms its accuracy under long and complex instructions by over 12%, thereby effectively highlighting the profound significance of instruction characteristics in maximizing model performance.

摘要
blockchain 技术已经革命化了金融领域， криптовалюencies 在 Decentralized 和 Transparent 的特点下得到了广泛的采纳。在社交媒体平台上表达的情感可以对 криптовалюencies 的讨论和市场走势产生重要影响，因此情感分析在 криптовалюencies 领域已成为一种关键的工具。为了提高情感分析的准确性，本文调查了大语言模型的精细调整技术。本文还 investigate 大语言模型的supervised 调整和指令调整在未看到任务上的效果。实验结果表明，精细调整后平均零件性能提高40%，这 highlights 该技术在优化预训练语言模型效率的潜力。此外，本文还 investigate 模型不同规模下的指令调整效果，发现大型模型受到指令调整的影响，其平均准确率为75.16%。相比之下，较小的模型可能会因完全使用模型容量而导致退化。为了更深入地了解指令如何与这些语言模型交互，本文进行了一种实验调查。调查结果表明，指令基本的模型在不同的指令调整设置下达到了72.38%的平均准确率。这个性能在长度和复杂度更高的指令下明显下降，这有效地表明了指令特点在提高模型性能的重要性。

2023-10-20

Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

Implications of Annotation Artifacts in Edge Probing Test Datasets

Ecologically Valid Explanations for Label Variation in NLI

Foundation Model’s Embedded Representations May Detect Distribution Shift

Plausibility Processing in Transformer Language Models: Focusing on the Role of Attention Heads in GPT

Yet Another Model for Arabic Dialect Identification

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

A Unified View of Evaluation Metrics for Structured Prediction

How Much Consistency Is Your Accuracy Worth?

Seq2seq is All You Need for Coreference Resolution

Enhancing Abstractiveness of Summarization Models through Calibrated Distillation

ALDi: Quantifying the Arabic Level of Dialectness of Text

Exploring Linguistic Probes for Morphological Generalization

Information Value: Measuring Utterance Predictability as Distance from Plausible Alternatives

On Synthetic Data for Back Translation

StereoMap: Quantifying the Awareness of Human-like Stereotypes in Large Language Models

Explainable Depression Symptom Detection in Social Media

Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Benchmarking and Improving Text-to-SQL Generation under Ambiguity

BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues

Bridging Information-Theoretic and Geometric Compression in Language Models

Semi-supervised multimodal coreference resolution in image narrations

Three Questions Concerning the Use of Large Language Models to Facilitate Mathematics Learning

Simultaneous Machine Translation with Tailored Reference

Improving Cross-Lingual Transfer through Subtree-Aware Word Reordering

Semantic Decomposition of Question and SQL for Text-to-SQL Parsing

Why Can Large Language Models Generate Correct Chain-of-Thoughts?

Cache & Distil: Optimising API Calls to Large Language Models

The Perils & Promises of Fact-checking with Large Language Models

A Diachronic Perspective on User Trust in AI under Uncertainty

Controlled Randomness Improves the Performance of Transformer Models

Teaching Language Models to Self-Improve through Interactive Demonstrations

Improving Question Generation with Multi-level Content Planning

DistillCSE: Distilled Contrastive Learning for Sentence Embeddings

Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning

The Past, Present, and Future of Typological Databases in NLP

Conversation Chronicles: Towards Diverse Temporal and Relational Dynamics in Multi-Session Conversations

Towards Enhancing Relational Rules for Knowledge Graph Link Prediction

Explicit Alignment and Many-to-many Entailment Based Reasoning for Conversational Machine Reading

Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models

Tuna: Instruction Tuning using Feedback from Large Language Models

APP: Adaptive Prototypical Pseudo-Labeling for Few-shot OOD Detection

Analyzing Cognitive Plausibility of Subword Tokenization

Large-Scale and Multi-Perspective Opinion Summarization with Diverse Review Subsets

Beyond Hard Samples: Robust and Effective Grammatical Error Correction with Cycle Self-Augmenting

Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

Test-Time Self-Adaptive Small Language Models for Question Answering

Interpreting Indirect Answers to Yes-No Questions in Multiple Languages

SALMONN: Towards Generic Hearing Abilities for Large Language Models

InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution

On the Language Encoder of Contrastive Cross-modal Models

MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model

A Quality-based Syntactic Template Retriever for Syntactically-controlled Paraphrase Generation

Anomaly Detection of Command Shell Sessions based on DistilBERT: Unsupervised and Supervised Approaches

Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking

The Less the Merrier? Investigating Language Representation in Multilingual Models

Enhancing Zero-Shot Crypto Sentiment with Fine-tuned Language Model and Prompt Engineering