2023-07-06

cs.CL

cs.CL - 2023-07-06

Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain

paper_url: http://arxiv.org/abs/2307.03042
repo_url: None
paper_authors: Aryo Pradipta Gema, Luke Daines, Pasquale Minervini, Beatrice Alex
For: This paper focuses on adapting pre-trained language models for clinical applications, specifically using Parameter-Efficient Fine-Tuning (PEFT) techniques to reduce computational requirements.* Methods: The proposed method, Clinical LLaMA-LoRA, is built upon the open-sourced LLaMA model and is trained using clinical notes from the MIMIC-IV database. A two-step PEFT framework is proposed, which combines Clinical LLaMA-LoRA with Downstream LLaMA-LoRA for downstream tasks.* Results: The proposed framework achieves state-of-the-art AUROC scores averaged across all clinical downstream tasks, with substantial improvements of 6-9% AUROC score in large-scale multilabel classification tasks such as diagnoses and procedures classification.Here is the information in Simplified Chinese text:* For: 本研究探讨了将预训练语言模型应用于医疗领域，特别是通过Parameter-Efficient Fine-Tuning（PEFT）技术来降低计算需求。* Methods: 提议的方法是基于开源的LLaMA模型建立的CLINICAL LLaMA-LoRA，通过在MIMIC-IV数据库中获取医疗笔记进行训练。提议的方法还包括将CLINICAL LLaMA-LoRA与下游LLaMA-LoRA结合使用，形成两步PEFT框架。* Results: 提议的方法实现了医疗下游任务中的最佳AUROC分数，特别是在大规模多标签分类任务中具有6-9% AUROC分数的提升。

Abstract
Adapting pretrained language models to novel domains, such as clinical applications, traditionally involves retraining their entire set of parameters. However, this approach is increasingly proven to be impractical owing to the substantial computational requirements associated with training such large language models. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a viable solution by selectively fine-tuning a small subset of additional parameters, significantly reducing the computational requirements for domain adaptation. In this study, we propose Clinical LLaMA-LoRA, a PEFT adapter layer built upon the open-sourced LLaMA model. Clinical LLaMA-LoRA is trained using clinical notes obtained from the MIMIC-IV database, thereby creating a specialised adapter designed for the clinical domain. Additionally, we propose a two-step PEFT framework which fuses Clinical LLaMA-LoRA with Downstream LLaMA-LoRA, another PEFT adapter specialised for downstream tasks. We evaluate this framework on multiple clinical outcome prediction datasets, comparing it to clinically trained language models. Our proposed framework achieves a state-of-the-art AUROC score averaged across all clinical downstream tasks. We observe substantial improvements of 6-9% AUROC score in the large-scale multilabel classification tasks, such as diagnoses and procedures classification.

摘要
原文文本中的 Adapting pretrained language models to novel domains, such as clinical applications, 通常需要重新训练整个语言模型的参数集。然而，这种方法在计算机需求方面存在很大的障碍，特别是在训练这些大型语言模型时。为解决这个问题，Parameter-Efficient Fine-Tuning（PEFT）技术提供了一个可行的解决方案，通过选择ively fine-tune 一小部分的额外参数，可以很大地减少预处理需求。在这种研究中，我们提出了Clinical LLaMA-LoRA，一个基于开源的 LLaMA 模型的 PEFT 适应层。Clinical LLaMA-LoRA 通过使用来自 MIMIC-IV 数据库的临床笔记进行训练，创造了特殊的临床适应器。此外，我们还提出了一种两步 PEFT 框架，将 Clinical LLaMA-LoRA 与 Downstream LLaMA-LoRA，另一个特殊的 PEFT 适应器，融合在一起。我们对多个临床结果预测数据集进行评估，与临床训练语言模型进行比较。我们的提议的框架实现了临床下游任务的最佳 AUROC 分数平均值。我们发现在大规模多标签分类任务中，如诊断和治疗分类任务，AUROC 分数提高了6-9%。

Improving Retrieval-Augmented Large Language Models via Data Importance Learning

paper_url: http://arxiv.org/abs/2307.03027
repo_url: https://github.com/amsterdata/ragbooster
paper_authors: Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel, Shaopeng Wei, Meng Cao, Sebastian Schelter, Ce Zhang
for: 该文章目的是提高大型语言模型的性能，使其能够利用外部知识，例如在问答和数据填充等任务上。
methods: 该文章提出了一种基于多线性扩展的算法，用于评估检索 Corpora 中数据点的重要性。该算法可以在 polynomial time 内计算，并且可以给出正确的结果，只需要一个检索-加持的模型和一个验证集。
results: 实验结果表明，通过只是修改或重新权重检索 Corpora，可以提高大型语言模型的性能，而不需要进行进一步的训练。在某些任务上，使用检索加持和搜索引擎 API，可以使一个小型模型（如 GPT-JT）超越不含检索增强的 GPT-3.5。此外，我们还证明了在实践中，可以计算多线性扩展的权重非常快（例如，在100万个元素的 Corpora 上只需要几分钟）。

Abstract
Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function. We further proposed an even more efficient ({\epsilon}, {\delta})-approximation algorithm. Our experimental results illustrate that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).

摘要
There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial-time algorithm that computes the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function, given a retrieval-augmented model with an additive utility function and a validation set. We also propose an even more efficient (${\epsilon}$, $\delta$)-approximation algorithm.Our experimental results show that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT) augmented with a search engine API to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).

Style Over Substance: Evaluation Biases for Large Language Models

paper_url: http://arxiv.org/abs/2307.03025
repo_url: None
paper_authors: Minghao Wu, Alham Fikri Aji
For: This paper aims to evaluate the performance of large language models (LLMs) in natural language generation tasks, and to propose a new approach to improve the accuracy of LLM-based evaluations.* Methods: The paper uses a dataset of intentionally flawed machine-generated answers to compare the evaluation behavior of crowd-sourced and expert annotators, as well as LLMs. The proposed approach is based on the Elo rating system, which independently evaluates machine-generated text across multiple dimensions.* Results: The paper finds that the proposed Multi-Elo Rating System significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced-based evaluations, indicating the need for further investigation and refinement.Here are the three points in Simplified Chinese text:* For: 这篇论文目的是评估大语言模型（LLM）在自然语言生成任务中的表现，并提出一种新的评估方法来提高 LLM 评估的准确性。* Methods: 论文使用一个意外损害机器生成答案的数据集来比较人工评分和专家评分员，以及 LLM 的评估行为。提议的方法基于 Elo 评分系统，独立评估机器生成文本的多个维度。* Results: 论文发现，提议的多维度 Elo 评分系统可以显著提高 LLM 评估的质量，特别是有关事实准确性。然而，对于人工评分来说，没有显著提高， indicating 需要进一步的调查和优化。

Abstract
As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Human evaluations are conventionally considered the gold standard in natural language generation, but recent advancements incorporate state-of-the-art LLMs as proxies for human judges in evaluation processes. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System. Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced-based evaluations, indicating the need for further investigation and refinement.

摘要
LLMs 继续进步，评估其性能变得越来越复杂。人工评估被视为自然语言生成领域的黄金标准，但在评估过程中，使用state-of-the-art LLMS作为人类评审人的代理。然而，人类和 LLMS 是否都有能力作为评估者存在uncertainty。这个研究investigates crowd-sourced和专家标注者，以及 LLMS 对不同模型的输出进行比较。为了实现这一目标，我们创建了一个包含机器生成答案中故意错误的数据集。我们的发现表明，答案中包含错误的答案被评分更高，比答案过短或者语法错误的答案更高。为了解决这个问题，我们提议在多个维度上独立评估机器生成文本，而不是将所有评估方面综合为一个分数。我们实现这一想法通过Elo分数系统，导致Multi-Elo Rating System。我们的研究发现，这种提议的方法可以显著提高 LLM-based 评估质量，尤其是对实际准确性。然而，对于人工标注者来说，没有显著改善，表明需要进一步的调查和优化。

Efficient Semiring-Weighted Earley Parsing

paper_url: http://arxiv.org/abs/2307.02982
repo_url: https://github.com/rycolab/earleys-algo
paper_authors: Andreas Opedal, Ran Zmigrod, Tim Vieira, Ryan Cotterell, Jason Eisner
for: 这个论文提供了一个引用描述，描述了 Earley （1970）的上下文自由parse算法以及其中的各种加速方法。
methods: 该论文包括了一种知名的最差情况时间复杂度改进，从 Earley 的 $O (N^3|G||R|)$ 改进到 $O (N^3|G|)$，与 CKY 在简化版 grammar $G$ 上的时间复杂度相同。此外，该论文还提供了一种使用Compact finite-state automaton $M$来实现时间复杂度为 $O (N^3|M|)$ 的版本，其中 $|M| \leq |G|$。
results: 该论文的实验结果表明，在采用了预处理 grammar 的情况下，semiring-weighted deduction 的方法和无Weighted deduction 的方法在时间复杂度和空间需求上具有相同的极限性。此外，在某些 grammar 上，可以实现 sub-cubic 时间复杂度的执行。

Abstract
This paper provides a reference description, in the form of a deduction system, of Earley's (1970) context-free parsing algorithm with various speed-ups. Our presentation includes a known worst-case runtime improvement from Earley's $O (N^3|G||R|)$, which is unworkable for the large grammars that arise in natural language processing, to $O (N^3|G|)$, which matches the runtime of CKY on a binarized version of the grammar $G$. Here $N$ is the length of the sentence, $|R|$ is the number of productions in $G$, and $|G|$ is the total length of those productions. We also provide a version that achieves runtime of $O (N^3|M|)$ with $|M| \leq |G|$ when the grammar is represented compactly as a single finite-state automaton $M$ (this is partly novel). We carefully treat the generalization to semiring-weighted deduction, preprocessing the grammar like Stolcke (1995) to eliminate deduction cycles, and further generalize Stolcke's method to compute the weights of sentence prefixes. We also provide implementation details for efficient execution, ensuring that on a preprocessed grammar, the semiring-weighted versions of our methods have the same asymptotic runtime and space requirements as the unweighted methods, including sub-cubic runtime on some grammars.

摘要
Here, $N$ is the length of the sentence, $|R|$ is the number of productions in $G$, and $|G|$ is the total length of those productions. The paper also discusses the generalization to semiring-weighted deduction, preprocessing the grammar like Stolcke (1995) to eliminate deduction cycles, and further generalizing Stolcke's method to compute the weights of sentence prefixes.

Agentività e telicità in GilBERTo: implicazioni cognitive

paper_url: http://arxiv.org/abs/2307.02910
repo_url: None
paper_authors: Agnese Lombardi, Alessandro Lenci
for: 本研究旨在调查 transformer 基于神经语言模型是否可以推理词义 semantics，并使用这些信息来完成 morphosyntactic 模式的完成。
methods: 该研究使用 transformer 模型和意大陆 native speakers 的数据进行比较，以Investigate neural language models 是否能够捕捉人类 semantic competence 中的一些重要方面。
results: 研究发现，transformer 模型在完成 morphosyntactic 模式时能够充分利用词义信息，并且与意大陆 native speakers 的结果相似。这表明 transformer 模型在推理词义 semantics 方面具有一定的能力。

Abstract
The goal of this study is to investigate whether a Transformer-based neural language model infers lexical semantics and use this information for the completion of morphosyntactic patterns. The semantic properties considered are telicity (also combined with definiteness) and agentivity. Both act at the interface between semantics and morphosyntax: they are semantically determined and syntactically encoded. The tasks were submitted to both the computational model and a group of Italian native speakers. The comparison between the two groups of data allows us to investigate to what extent neural language models capture significant aspects of human semantic competence.

摘要
这项研究的目的是研究transformer基于神经语言模型是否可以推断词义，并使用这些信息来完成 morphosyntactic 模式的完成。我们考虑的 semantic properties 包括 telicity 和 agentivity，它们在 semantics 和 morphosyntax 之间作用，它们是semantically determined 并 syntactically encoded。我们对这些任务进行了计算机模型和一群意大利本地语言使用者的比较，这allow us Investigate 到哪程度 neural language models 捕捉了人类 semantics 能力的重要方面。

The Relationship Between Speech Features Changes When You Get Depressed: Feature Correlations for Improving Speed and Performance of Depression Detection

paper_url: http://arxiv.org/abs/2307.02892
repo_url: None
paper_authors: Fuxiang Tao, Wei Ma, Xuri Ge, Anna Esposito, Alessandro Vinciarelli
for: 该研究发现听力症改变了语音特征之间的相关性。此外，它还表明使用这种发现可以改善基于SVM和LSTM的抑郁检测器的训练速度和性能。
methods: 实验使用了Androids Corpus dataset，包括112名 speaker，其中58名被诊断为职业心理医生诊断的抑郁症。
results: 实验结果显示，使用特征相关矩阵而不是特征向量可以提高模型的训练速度和性能，降低误差率在23.1%到26.6%之间，这可能是因为抑郁 speaker 中特征相关矩阵更为变化。

Abstract
This work shows that depression changes the correlation between features extracted from speech. Furthermore, it shows that using such an insight can improve the training speed and performance of depression detectors based on SVMs and LSTMs. The experiments were performed over the Androids Corpus, a publicly available dataset involving 112 speakers, including 58 people diagnosed with depression by professional psychiatrists. The results show that the models used in the experiments improve in terms of training speed and performance when fed with feature correlation matrices rather than with feature vectors. The relative reduction of the error rate ranges between 23.1% and 26.6% depending on the model. The probable explanation is that feature correlation matrices appear to be more variable in the case of depressed speakers. Correspondingly, such a phenomenon can be thought of as a depression marker.

摘要
这个研究表明，抑郁症会改变来自语音特征的相关性。此外，这个发现可以提高基于SVM和LSTM的抑郁检测器的训练速度和性能。实验使用了公共可用的Androids Corpus数据集，包括112名说话者，其中58名被诊断为职业心理医生诊断的抑郁症患者。结果表明，使用特征相关矩阵而不是特征向量可以提高模型的训练速度和性能。错误率下降的相对减少范围为23.1%到26.6%，具体原因可能是抑郁 speaker 的特征相关矩阵更为变化。这种现象可以被视为抑郁标志。

paper_url: http://arxiv.org/abs/2307.02863
repo_url: None
paper_authors: Lukas Birkenmaier, Clemens Lechner, Claudia Wagner
for: 这篇论文的目的是为计算文本数据中的社会科学构uct提供验证框架。
methods: 这篇论文使用了一种新的验证框架called ValiTex，它是基于心理测量的传统，并将其扩展以适应计算文本分析。ValiTex包括一个概念模型和一个动态列表。概念模型提供了一个通用结构，用于验证社会科学构uct，而动态列表则定义了特定的验证步骤，并提供了有关哪些步骤是可以提供有效验证证据的指导。
results: 在使用ValiTex验证社会科学构uct时，可以通过应用于社交媒体数据的示例来证明该框架的实用性。

Abstract
Guidance on how to validate computational text-based measures of social science constructs is fragmented. Whereas scholars are generally acknowledging the importance of validating their text-based measures, they often lack common terminology and a unified framework to do so. This paper introduces a new validation framework called ValiTex, designed to assist scholars to measure social science constructs based on textual data. The framework draws on a long-established tradition within psychometrics while extending the framework for the purpose of computational text analysis. ValiTex consists of two components, a conceptual model, and a dynamic checklist. Whereas the conceptual model provides a general structure along distinct phases on how to approach validation, the dynamic checklist defines specific validation steps and provides guidance on which steps might be considered recommendable (i.e., providing relevant and necessary validation evidence) or optional (i.e., useful for providing additional supporting validation evidence. The utility of the framework is demonstrated by applying it to a use case of detecting sexism from social media data.

摘要
帮助验证计算文本基于社会科学概念的度量方法存在 Fragmented. Although scholars generally recognize the importance of validating their text-based measures, they often lack a common terminology and unified framework to do so. This paper introduces a new validation framework called ValiTex, designed to assist scholars in measuring social science constructs based on textual data. The framework draws on a long-established tradition within psychometrics while extending the framework for the purpose of computational text analysis. ValiTex consists of two components: a conceptual model and a dynamic checklist. While the conceptual model provides a general structure for approaching validation, the dynamic checklist defines specific validation steps and provides guidance on which steps might be considered recommendable (i.e., providing relevant and necessary validation evidence) or optional (i.e., useful for providing additional supporting validation evidence. The utility of the framework is demonstrated by applying it to a use case of detecting sexism from social media data.Here's the breakdown of the translation:* "Guidance on how to validate computational text-based measures of social science constructs is fragmented" becomes 帮助验证计算文本基于社会科学概念的度量方法存在 Fragmented.* "Whereas scholars are generally acknowledging the importance of validating their text-based measures" becomes although scholars generally recognize the importance of validating their text-based measures.* "they often lack common terminology and a unified framework to do so" becomes they often lack a common terminology and unified framework to do so.* "This paper introduces a new validation framework called ValiTex" becomes 这篇论文介绍了一种新的验证框架 called ValiTex.* "designed to assist scholars in measuring social science constructs based on textual data" becomes 用于帮助学者在文本数据上验证社会科学概念.* "The framework draws on a long-established tradition within psychometrics" becomes 该框架基于长期存在的心理测量传统.* "while extending the framework for the purpose of computational text analysis" becomes 而将其扩展为计算文本分析的目的.* "ValiTex consists of two components: a conceptual model and a dynamic checklist" becomes ValiTex consists of two components: a conceptual model and a dynamic checklist.* "Whereas the conceptual model provides a general structure for approaching validation" becomes 而 conceptual model provides a general structure for approaching validation.* "the dynamic checklist defines specific validation steps and provides guidance on which steps might be considered recommendable" becomes 而 dynamic checklist defines specific validation steps and provides guidance on which steps might be considered recommendable.* "or optional (i.e., useful for providing additional supporting validation evidence)" becomes or optional (i.e., useful for providing additional supporting validation evidence).* "The utility of the framework is demonstrated by applying it to a use case of detecting sexism from social media data" becomes 该框架的实用性是通过对社交媒体数据进行性别歧视检测来示例出来.

NatLogAttack: A Framework for Attacking Natural Language Inference Models with Natural Logic

paper_url: http://arxiv.org/abs/2307.02849
repo_url: None
paper_authors: Zi’ou Zheng, Xiaodan Zhu
for: 本研究旨在探讨逻辑 formalism 基于的攻击模型，以 evaluating 当前的 natural language inference（NLI）模型是否真正进行了推理，还是仅仅依赖于偶合关系。
methods: 本研究提出了 NatLogAttack，一种基于自然逻辑的攻击模型，可以进行系统性的攻击。该模型可以实现标签保持和标签转换两种类型的攻击。
results: 比较 existing 攻击模型，NatLogAttack 可以生成更好的 adversarial examples，需要 fewer 访问 victim 模型。 Label-flipping 设定下，攻击模型更加脆弱。 NatLogAttack 提供了一种测试当前和未来 NLI 模型的能力的工具，并希望更多基于逻辑的攻击将被进一步探讨，以更好地理解推理的愿望性。

Abstract
Reasoning has been a central topic in artificial intelligence from the beginning. The recent progress made on distributed representation and neural networks continues to improve the state-of-the-art performance of natural language inference. However, it remains an open question whether the models perform real reasoning to reach their conclusions or rely on spurious correlations. Adversarial attacks have proven to be an important tool to help evaluate the Achilles' heel of the victim models. In this study, we explore the fundamental problem of developing attack models based on logic formalism. We propose NatLogAttack to perform systematic attacks centring around natural logic, a classical logic formalism that is traceable back to Aristotle's syllogism and has been closely developed for natural language inference. The proposed framework renders both label-preserving and label-flipping attacks. We show that compared to the existing attack models, NatLogAttack generates better adversarial examples with fewer visits to the victim models. The victim models are found to be more vulnerable under the label-flipping setting. NatLogAttack provides a tool to probe the existing and future NLI models' capacity from a key viewpoint and we hope more logic-based attacks will be further explored for understanding the desired property of reasoning.

摘要
<> translate "Reasoning has been a central topic in artificial intelligence from the beginning. The recent progress made on distributed representation and neural networks continues to improve the state-of-the-art performance of natural language inference. However, it remains an open question whether the models perform real reasoning to reach their conclusions or rely on spurious correlations. Adversarial attacks have proven to be an important tool to help evaluate the Achilles' heel of the victim models. In this study, we explore the fundamental problem of developing attack models based on logic formalism. We propose NatLogAttack to perform systematic attacks centring around natural logic, a classical logic formalism that is traceable back to Aristotle's syllogism and has been closely developed for natural language inference. The proposed framework renders both label-preserving and label-flipping attacks. We show that compared to the existing attack models, NatLogAttack generates better adversarial examples with fewer visits to the victim models. The victim models are found to be more vulnerable under the label-flipping setting. NatLogAttack provides a tool to probe the existing and future NLI models' capacity from a key viewpoint and we hope more logic-based attacks will be further explored for understanding the desired property of reasoning." into Simplified Chinese.中文翻译：自人工智能开始以来，理解是一个中心主题。分布表示和神经网络的进步在自然语言推理中提高了状态艺术性。然而，是否模型通过真正的理解来达到结论还是利用偶合关系，这是一个打开的问题。对于受害者模型，抗击攻击是一种重要的工具。在这种研究中，我们探讨了基于逻辑ormalism的攻击模型的基本问题。我们提出了NatLogAttack，一种基于自然逻辑的系统性攻击方法。我们实现了保留和反转标签攻击。我们发现，相比现有的攻击模型，NatLogAttack生成的恶作剂更好，需要 fewer 访问受害者模型。受害者模型在反转标签设置下更加易受攻击。NatLogAttack提供了评估现有和未来 NLI 模型的能力的重要工具，我们希望更多的逻辑基于攻击将被进一步探讨，以更好地理解推理的所求性。

Generative Zero-Shot Prompt Learning for Cross-Domain Slot Filling with Inverse Prompting

paper_url: http://arxiv.org/abs/2307.02830
repo_url: None
paper_authors: Xuefeng Li, Liwen Wang, Guanting Dong, Keqing He, Jinzheng Zhao, Hao Lei, Jiachi Liu, Weiran Xu
for: 这篇论文旨在解决跨领域插值构型问题，将知识从已经标注的来源领域转移到未标注的目标领域。
methods: 我们提出了一个生成式零条件提示学习框架，以改善这些模型的普遍性和可靠性。我们还引入了一个新的倒推提示策略，以区分不同的构型，并使用高效的提示调整策略，以提高表现。
results: 实验和分析结果显示，我们的提案的框架比前一代模型具有更高的效果，尤其是在未见构型上 (+13.44% F1) 获得了大幅提升。

Abstract
Zero-shot cross-domain slot filling aims to transfer knowledge from the labeled source domain to the unlabeled target domain. Existing models either encode slot descriptions and examples or design handcrafted question templates using heuristic rules, suffering from poor generalization capability or robustness. In this paper, we propose a generative zero-shot prompt learning framework for cross-domain slot filling, both improving generalization and robustness than previous work. Besides, we introduce a novel inverse prompting strategy to distinguish different slot types to avoid the multiple prediction problem, and an efficient prompt-tuning strategy to boost higher performance by only training fewer prompt parameters. Experiments and analysis demonstrate the effectiveness of our proposed framework, especially huge improvements (+13.44% F1) on the unseen slots.

摘要
<>TRANSLATE_TEXT Zero-shot cross-domain slot filling aims to transfer knowledge from the labeled source domain to the unlabeled target domain. Existing models either encode slot descriptions and examples or design handcrafted question templates using heuristic rules, suffering from poor generalization capability or robustness. In this paper, we propose a generative zero-shot prompt learning framework for cross-domain slot filling, both improving generalization and robustness than previous work. Besides, we introduce a novel inverse prompting strategy to distinguish different slot types to avoid the multiple prediction problem, and an efficient prompt-tuning strategy to boost higher performance by only training fewer prompt parameters. Experiments and analysis demonstrate the effectiveness of our proposed framework, especially huge improvements (+13.44% F1) on the unseen slots.TRANSLATE_TEXT

VerifAI: Verified Generative AI

paper_url: http://arxiv.org/abs/2307.02796
repo_url: None
paper_authors: Nan Tang, Chenyu Yang, Ju Fan, Lei Cao
for: 该论文目的是探讨生成AI输出的准确性和可靠性问题，以及如何通过数据管理方式来解决这个问题。
methods: 该论文使用了多 modal 数据湖的数据分析方法，包括文本文件、表格和知识图谱等，以评估数据质量和一致性。
results: 该论文提出了一种基于数据管理的生成AI验证方法，可以确保生成AI输出的正确性，促进透明度，并帮助做出更加信心的决策。

Abstract
Generative AI has made significant strides, yet concerns about the accuracy and reliability of its outputs continue to grow. Such inaccuracies can have serious consequences such as inaccurate decision-making, the spread of false information, privacy violations, legal liabilities, and more. Although efforts to address these risks are underway, including explainable AI and responsible AI practices such as transparency, privacy protection, bias mitigation, and social and environmental responsibility, misinformation caused by generative AI will remain a significant challenge. We propose that verifying the outputs of generative AI from a data management perspective is an emerging issue for generative AI. This involves analyzing the underlying data from multi-modal data lakes, including text files, tables, and knowledge graphs, and assessing its quality and consistency. By doing so, we can establish a stronger foundation for evaluating the outputs of generative AI models. Such an approach can ensure the correctness of generative AI, promote transparency, and enable decision-making with greater confidence. Our vision is to promote the development of verifiable generative AI and contribute to a more trustworthy and responsible use of AI.

摘要

UniCoRN: Unified Cognitive Signal ReconstructioN bridging cognitive signals and human language

paper_url: http://arxiv.org/abs/2307.05355
repo_url: https://github.com/rootnx/UniCoRN
paper_authors: Nuwa Xi, Sendong Zhao, Haochun Wang, Chi Liu, Bing Qin, Ting Liu
for: 这篇论文旨在探讨用 cognitive signals (如 fMRI) 提高我们对人类语言系统的理解，并为建立多功能 Brain-Computer Interface 做出了重要贡献。
methods: 这篇论文提出了 fMRI2text，第一个开放词汇任务，旨在将 fMRI 时间序列与人类语言相桥接。此外，作者还提出了一个基线解决方案，称为 UniCoRN，可以重构 cognitive signals 的时间序列和时点值，并利用预训练的语言模型来解码 coherent text。
results: 实验结果表明，UniCoRN 在 fMRI2text 和 EEGto-text 解码任务中具有高效性，其 BLEU 分数分别为 34.77% 和 37.04%，比前一个基线提高了超过 10%。这表明了在不同的 cognitive signals 上使用一个共同结构可以实现高效的解码。

Abstract
Decoding text stimuli from cognitive signals (e.g. fMRI) enhances our understanding of the human language system, paving the way for building versatile Brain-Computer Interface. However, existing studies largely focus on decoding individual word-level fMRI volumes from a restricted vocabulary, which is far too idealized for real-world application. In this paper, we propose fMRI2text, the first openvocabulary task aiming to bridge fMRI time series and human language. Furthermore, to explore the potential of this new task, we present a baseline solution, UniCoRN: the Unified Cognitive Signal ReconstructioN for Brain Decoding. By reconstructing both individual time points and time series, UniCoRN establishes a robust encoder for cognitive signals (fMRI & EEG). Leveraging a pre-trained language model as decoder, UniCoRN proves its efficacy in decoding coherent text from fMRI series across various split settings. Our model achieves a 34.77% BLEU score on fMRI2text, and a 37.04% BLEU when generalized to EEGto-text decoding, thereby surpassing the former baseline. Experimental results indicate the feasibility of decoding consecutive fMRI volumes, and the effectiveness of decoding different cognitive signals using a unified structure.

摘要
decode text 刺激信号（例如fMRI）可以帮助我们更好地理解人类语言系统，这将开创出多样化的脑计算机接口。然而，现有的研究主要集中于解码限定词汇 volume 的 fMRI 时间序列，这是实际应用中过于理想化的。在这篇论文中，我们提出了 fMRI2text，第一个开放词汇任务，旨在将 fMRI 时间序列和人类语言相连。此外，为了探索这个新任务的潜力，我们提出了基线解决方案，即 UniCoRN：一种统一的认知信号重建方法 для脑解oding。UniCoRN 可以重建各个时间点和时间序列，并利用预训练的语言模型作为解码器，以解码 fMRI 序列中的 coherent text。我们的模型在 fMRI2text 任务中 achievement 34.77% BLEU 分数，并在 EEGto-text 解码任务中 achievement 37.04% BLEU，超过了 former 基线。实验结果表明可以解码连续的 fMRI 序列，以及使用统一结构可以解码不同的认知信号。

Training Models to Generate, Recognize, and Reframe Unhelpful Thoughts

paper_url: http://arxiv.org/abs/2307.02768
repo_url: None
paper_authors: Mounica Maddela, Megan Ung, Jing Xu, Andrea Madotto, Heather Foran, Y-Lan Boureau
for: 本研究旨在使用现有的语言模型生成具体化的实践材料，以帮助改善心理健康。
methods: 本研究使用了现有的语言模型，生成了一个大约10k个示例具有不Helpful思维模式的思想，以及27k个正面重新定义。
results: 研究表明，通过使用这些数据集来训练和/或评估当前模型，可以生成大量的个性化实践材料和假设，无需或 minimum额外加模型训练。

Abstract
Many cognitive approaches to well-being, such as recognizing and reframing unhelpful thoughts, have received considerable empirical support over the past decades, yet still lack truly widespread adoption in self-help format. A barrier to that adoption is a lack of adequately specific and diverse dedicated practice material. This work examines whether current language models can be leveraged to both produce a virtually unlimited quantity of practice material illustrating standard unhelpful thought patterns matching specific given contexts, and generate suitable positive reframing proposals. We propose PATTERNREFRAME, a novel dataset of about 10k examples of thoughts containing unhelpful thought patterns conditioned on a given persona, accompanied by about 27k positive reframes. By using this dataset to train and/or evaluate current models, we show that existing models can already be powerful tools to help generate an abundance of tailored practice material and hypotheses, with no or minimal additional model training required.

摘要
许多认知方法，如认知和重新定义不helpful的思想，在过去几十年内得到了证据，然而仍未得到广泛的采用。一个阻碍factor是缺乏具有充分specific和多样化的专门练习材料。这项工作研究了whether current language models可以被利用来生成具有specific context的标准不helpful思想模式的庞大量 практи materials，以及生成适合的正面重新定义建议。我们提出了PATTERNREFRAME dataset，包含约10k例思想中的不helpful思想模式， Conditioned on a given persona, accompanied by approximately 27k positive reframes。通过使用这个dataset来训练和/或评估当前模型，我们发现了，现有模型可以被转化成帮助生成大量tailored practice material和假设，无需或 minimum additional model training。

Undecimated Wavelet Transform for Word Embedded Semantic Marginal Autoencoder in Security improvement and Denoising different Languages

paper_url: http://arxiv.org/abs/2307.03679
repo_url: None
paper_authors: Shreyanth S
for: 提高数据处理应用程序的安全性、隐私性和多语言支持
methods: 结合不减杂波лет变换和word嵌入semantic marginal autoencoder
results: 成功提高多语言数据处理应用程序的安全性和鲁棒性，并且能够有效地降低噪声和提高数据质量

Abstract
By combining the undecimated wavelet transform within a Word Embedded Semantic Marginal Autoencoder (WESMA), this research study provides a novel strategy for improving security measures and denoising multiple languages. The incorporation of these strategies is intended to address the issues of robustness, privacy, and multilingualism in data processing applications. The undecimated wavelet transform is used as a feature extraction tool to identify prominent language patterns and structural qualities in the input data. The proposed system may successfully capture significant information while preserving the temporal and geographical links within the data by employing this transform. This improves security measures by increasing the system's ability to detect abnormalities, discover hidden patterns, and distinguish between legitimate content and dangerous threats. The Word Embedded Semantic Marginal Autoencoder also functions as an intelligent framework for dimensionality and noise reduction. The autoencoder effectively learns the underlying semantics of the data and reduces noise components by exploiting word embeddings and semantic context. As a result, data quality and accuracy are increased in following processing stages. The suggested methodology is tested using a diversified dataset that includes several languages and security scenarios. The experimental results show that the proposed approach is effective in attaining security enhancement and denoising capabilities across multiple languages. The system is strong in dealing with linguistic variances, producing consistent outcomes regardless of the language used. Furthermore, incorporating the undecimated wavelet transform considerably improves the system's ability to efficiently address complex security concerns

摘要
通过将不减波лет变换纳入word嵌入semantic marginal autoencoder（WESMA）中，本研究提供了一种新的安全提高和净化多语言的策略。这种策略的目的是解决数据处理应用中的Robustness、隐私和多语言问题。不减波лет变换被用作特征提取工具，以找出输入数据中语言模式和结构特征。提出的系统可以成功地捕捉主要信息，同时保持数据的时空地理链接。这有助于增强安全措施，提高系统检测异常、发现隐藏模式和分辨合法内容和危险威胁的能力。word嵌入semantic marginal autoencoder 还作为一种智能框架，实现维度和噪声减少。 autoencoder 通过利用word嵌入和semanticContext来学习数据的下面 semantics，从而减少噪声组件。因此，数据质量和准确性在后续处理阶段得到提高。本方法在多种语言和安全enario下进行了实验测试，结果显示，提出的方法可以在多语言下实现安全提高和净化能力。系统强大地处理语言差异，在不同语言下产生相同的结果。此外，通过 incorporating不减波лет变换，系统可以更有效地解决复杂的安全问题。

paper_url: http://arxiv.org/abs/2307.02763
repo_url: https://github.com/davidjurgens/contextual-appropriateness
paper_authors: David Jurgens, Agrima Seth, Jackson Sargent, Athena Aghighi, Michael Geraci
for: 本研究旨在提高实时对话中的不当内容识别，并且考虑社交上下文和规范。
methods: 本研究使用大量自然语言模型，将社交关系信息融入数据中，以更好地识别内容是否当。
results: 研究发现，社交关系信息可以帮助大量自然语言模型更加准确地识别内容是否当，并且可以预测其他社交因素，如态度和礼貌。

Abstract
Understanding interpersonal communication requires, in part, understanding the social context and norms in which a message is said. However, current methods for identifying offensive content in such communication largely operate independent of context, with only a few approaches considering community norms or prior conversation as context. Here, we introduce a new approach to identifying inappropriate communication by explicitly modeling the social relationship between the individuals. We introduce a new dataset of contextually-situated judgments of appropriateness and show that large language models can readily incorporate relationship information to accurately identify appropriateness in a given context. Using data from online conversations and movie dialogues, we provide insight into how the relationships themselves function as implicit norms and quantify the degree to which context-sensitivity is needed in different conversation settings. Further, we also demonstrate that contextual-appropriateness judgments are predictive of other social factors expressed in language such as condescension and politeness.

摘要
理解人际交流需要一定程度的社会背景和规范，但现有的偏误内容标识方法大多数都是独立于社会背景进行的，只有一些方法考虑到社区规范或之前的对话。我们介绍了一种新的偏误内容标识方法，该方法将社会关系 между个人Explicitly modeling。我们新 introduce a dataset of contextually-situated appropriateness judgments and show that large language models can readily incorporate relationship information to accurately identify appropriateness in a given context.使用在线对话和电影对话，我们提供了关于社交关系如何作为隐式规范的情况，并衡量不同对话场景中context-sensitivity的程度。此外，我们还证明了Contextual-appropriateness judgments是其他社交因素表达在语言中的预测因素。

paper_url: http://arxiv.org/abs/2307.02758
repo_url: None
paper_authors: Aparna Ananthasubramaniam, Hong Chen, Jason Yan, Kenan Alkiek, Jiaxin Pei, Agrima Seth, Lavinia Dunagan, Minje Choi, Benjamin Litterer, David Jurgens
for: 研究探讨了在Reddit上的对话中语言风格匹配（LSM）如何影响社交影响的多种方面，包括力量和说服。
methods: 研究使用了两种样式来识别LSM：函数词的使用和正式度。研究者们在Reddit上分析了大量的两人对话线程，并记录了所有LSM的出现。
results: 研究发现LSM与Reddit上的几个社交指标（包括帖子和子社区特征、对话深度、用户积累和评论的争议程度）存在关系，并且 после用户被禁止参与社区，LSM的变化可以反映社区动态的变化。

Abstract
Linguistic style matching (LSM) in conversations can be reflective of several aspects of social influence such as power or persuasion. However, how LSM relates to the outcomes of online communication on platforms such as Reddit is an unknown question. In this study, we analyze a large corpus of two-party conversation threads in Reddit where we identify all occurrences of LSM using two types of style: the use of function words and formality. Using this framework, we examine how levels of LSM differ in conversations depending on several social factors within Reddit: post and subreddit features, conversation depth, user tenure, and the controversiality of a comment. Finally, we measure the change of LSM following loss of status after community banning. Our findings reveal the interplay of LSM in Reddit conversations with several community metrics, suggesting the importance of understanding conversation engagement when understanding community dynamics.

摘要
语言风格匹配（LSM）在对话中可以反映社交影响的多个方面，如力量或说服。然而，LSM在在线交流平台如Reddit上的效果还未得到了许多研究。在这个研究中，我们分析了一个大量的两方对话线程库，并在这里确定了所有的LSM出现。我们使用两种风格来确定LSM：语言功能词和正式度。通过这个框架，我们研究了在Reddit上的讨论中LSM水平与多种社交因素之间的关系，包括帖子和子reddit特征、对话深度、用户 seniority 和评论的争议程度。最后，我们测量了因社区禁止而导致LSM变化的情况。我们的发现表明LSM在Reddit讨论中与多种社区指标之间存在紧密的关系，从而提出了理解对话参与度对社区动力学的重要性。

Dense Retrieval Adaptation using Target Domain Description

paper_url: http://arxiv.org/abs/2307.02740
repo_url: None
paper_authors: Helia Hashemi, Yong Zhuang, Sachith Sri Ram Kothur, Srivas Prasad, Edgar Meij, W. Bruce Croft
for: 这paper是为了研究信息检索领域中的领域适应，即无法访问目标文档集的情况下，如何使用文本描述来适应目标领域。
methods: 这paper使用了一种新的自动化数据建构管道，该管道可以根据文本领域描述生成假的文档集、查询集和 Pseudo relevance labels。
results: 经过广泛的实验表明，通过适应 dense retrieval 模型使用生成的假数据可以在目标领域中实现有效的检索性能。

Abstract
In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labeled data in the target domain. There also exists research on improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.

摘要
在信息检索（IR）领域，领域适应是指将检索模型适应到新领域的数据分布不同于源领域。现有的方法在这个领域主要集中在无监督领域适应和监督（经常是几个shot）领域适应，其中后者还有访问（有限）标注数据在目标领域。此外，还有研究提高检索模型的零shot性性能。这篇论文介绍了IR领域新的领域适应类别，即与零shot设置相似，我们假设检索模型没有访问目标文档收集。相反，它们有访问目标领域的简短文本描述。我们定义了检索任务中的领域属性分类，以便更好地理解不同的源领域可以适应的不同属性。我们介绍了一种新的自动数据建构管道，可以从文本领域描述中生成 sintetic文档收集、查询集和pseudo relevance标签。我们在五个多样化的目标领域进行了广泛的实验，发现通过适应密集检索模型使用construct的 sintetic数据可以达到有效的检索性能。

Text Alignment Is An Efficient Unified Model for Massive NLP Tasks

paper_url: http://arxiv.org/abs/2307.02729
repo_url: None
paper_authors: Yuheng Zha, Yichi Yang, Ruichen Li, Zhiting Hu
for: 这篇论文目的是提出一种高效的文本对齐模型，用于解决多种文本关系任务，包括文本相似性、问答、事实一致性等。
methods: 该模型使用了 RoBERTa 模型进行轻量级微调，并使用了 5.9 万个样例和 28 个数据集来实现模型的INSTANTIATION。
results: 对于多种文本关系任务，该模型能够匹配或超越 FLAN-T5 模型（具有约 2 倍或 10 倍的参数数量），同时也能够超越任务特定的模型在各个数据集上。此外，该模型还可以用于评估语言生成中的事实一致性，并且可以与 GPT-3.5 和 GPT-4 模型进行比较。

Abstract
Large language models (LLMs), typically designed as a function of next-word prediction, have excelled across extensive NLP tasks. Despite the generality, next-word prediction is often not an efficient formulation for many of the tasks, demanding an extreme scale of model parameters (10s or 100s of billions) and sometimes yielding suboptimal performance. In practice, it is often desirable to build more efficient models -- despite being less versatile, they still apply to a substantial subset of problems, delivering on par or even superior performance with much smaller model sizes. In this paper, we propose text alignment as an efficient unified model for a wide range of crucial tasks involving text entailment, similarity, question answering (and answerability), factual consistency, and so forth. Given a pair of texts, the model measures the degree of alignment between their information. We instantiate an alignment model (Align) through lightweight finetuning of RoBERTa (355M parameters) using 5.9M examples from 28 datasets. Despite its compact size, extensive experiments show the model's efficiency and strong performance: (1) On over 20 datasets of aforementioned diverse tasks, the model matches or surpasses FLAN-T5 models that have around 2x or 10x more parameters; the single unified model also outperforms task-specific models finetuned on individual datasets; (2) When applied to evaluate factual consistency of language generation on 23 datasets, our model improves over various baselines, including the much larger GPT-3.5 (ChatGPT) and sometimes even GPT-4; (3) The lightweight model can also serve as an add-on component for LLMs such as GPT-3.5 in question answering tasks, improving the average exact match (EM) score by 17.94 and F1 score by 15.05 through identifying unanswerable questions.

摘要
大型语言模型（LLM）通常是基于下一个词预测的函数，在各种自然语言处理任务中表现出色。然而，下一个词预测并不是一个有效的形式ulation для许多任务，需要极大的模型参数（10个或100个亿）并且有时会得到低效的性能。在实践中，建立更有效的模型是非常感兴趣，即使它们不那么通用，它们仍然适用于许多问题，可以实现与较小的模型大小相同或者甚至更高的性能。在这篇论文中，我们提议文本对齐作为一种有效的统一模型，用于覆盖许多关键任务的文本相互关系。给定两个文本，模型会测量它们信息之间的相互对齐程度。我们通过轻量级的微调RoBERTa（355M参数）使用590万个示例和28个数据集来实现对齐模型（Align）。即使它的 compact size，我们的实验证明了模型的高效性和强大性：1. 在20个多样化任务上，我们的模型可以与FLAN-T5模型（约2倍或10倍更多参数）匹配或超越它们，并且单一的统一模型也可以超越任务特定的模型在各个数据集上进行微调。2. 当应用于23个数据集上的语言生成的事实一致性评估中，我们的模型可以超越多种基准，包括较大的GPT-3.5（ChatGPT）和有时even GPT-4。3. 轻量级的模型还可以作为LLMs的添加组件，在问答任务中提高GPT-3.5的平均精确匹配（EM）得分17.94%和F1得分15.05%，通过识别无法回答的问题。

On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation

paper_url: http://arxiv.org/abs/2307.02720
repo_url: None
paper_authors: Gene-Ping Yang, Yue Gu, Qingming Tang, Dongsu Du, Yuzong Liu
for: 这个研究旨在应用大型自主学习模型来进行关键词搜寻，但是在设备上的预算和数据收集上存在偏误和限制。methods: 我们提出了一个基于知识传播的自学模型，使用教师生物框架将知识传播到小型轻量级模型，使用双重检查相关知识传播和教师的代码库作为学习目标。results: 我们使用Alexa关键词搜寻探测任务中的16.6万小时内部数据进行评估，结果显示我们的方法在正常和噪音情况下表现出色，证明了知识传播方法在关键词搜寻任务中自主学习模型的构建中具有优异的表现。

Abstract
Large self-supervised models are effective feature extractors, but their application is challenging under on-device budget constraints and biased dataset collection, especially in keyword spotting. To address this, we proposed a knowledge distillation-based self-supervised speech representation learning (S3RL) architecture for on-device keyword spotting. Our approach used a teacher-student framework to transfer knowledge from a larger, more complex model to a smaller, light-weight model using dual-view cross-correlation distillation and the teacher's codebook as learning objectives. We evaluated our model's performance on an Alexa keyword spotting detection task using a 16.6k-hour in-house dataset. Our technique showed exceptional performance in normal and noisy conditions, demonstrating the efficacy of knowledge distillation methods in constructing self-supervised models for keyword spotting tasks while working within on-device resource constraints.

摘要
大型自我超vised模型是有效的特征提取器，但它们在设备上的应用面临了预算限制和欠拟合的数据采集问题，尤其是在关键词检测中。为解决这问题，我们提出了基于知识填充的自我超vised语音表示学习（S3RL）架构，用于在设备上进行关键词检测。我们的方法使用教师-学生框架来传递知识从一个更大的、更复杂的模型到一个更小的、轻量级模型，使用双视相关分配和教师的代码库作为学习目标。我们对Alexa关键词检测任务使用了16.6万小时的自有数据进行评估。我们的技术在正常和噪音条件下表现出色，证明了知识填充方法在构建自我超vised模型 для关键词检测任务中的效果。

CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization

paper_url: http://arxiv.org/abs/2307.02716
repo_url: https://github.com/xiaomin418/cfsum
paper_authors: Min Xiao, Junnan Zhu, Haitao Lin, Yu Zhou, Chengqing Zong
for: 提高多模态摘要的效果，解决多modal summarization中图像贡献不清晰的问题。
methods: 提出了一种新的Coarse-to-Fine贡献网络(CFSum)，通过对不同modalities的融合方法进行设计，并忽略不同modalities之间的adaptive条件。
results: 实验结果表明，CFSumsignificantly exceeded多个强基eline的性能标准benchmark。此外，分析也证明了，有用的图像可以帮助生成非视觉词汇，这些词汇在图像中被隐式表示。

Abstract
Multimodal summarization usually suffers from the problem that the contribution of the visual modality is unclear. Existing multimodal summarization approaches focus on designing the fusion methods of different modalities, while ignoring the adaptive conditions under which visual modalities are useful. Therefore, we propose a novel Coarse-to-Fine contribution network for multimodal Summarization (CFSum) to consider different contributions of images for summarization. First, to eliminate the interference of useless images, we propose a pre-filter module to abandon useless images. Second, to make accurate use of useful images, we propose two levels of visual complement modules, word level and phrase level. Specifically, image contributions are calculated and are adopted to guide the attention of both textual and visual modalities. Experimental results have shown that CFSum significantly outperforms multiple strong baselines on the standard benchmark. Furthermore, the analysis verifies that useful images can even help generate non-visual words which are implicitly represented in the image.

摘要
多模态摘要通常受到视觉模式贡献不清晰的问题困扰。现有的多模态摘要方法强调设计不同modalities的融合方法，而忽略不同modalities在摘要中的适应条件。因此，我们提出了一个新的Coarse-to-Fine贡献网络 для多模态摘要（CFSum），以考虑不同modalities的贡献。首先，以避免无用的图像干扰，我们提出了预 filtering模组。其次，以确保精准使用有用的图像，我们提出了两个层次的视觉补充模组，分别是字级和短语级。具体来说，图像贡献被计算，并被运用来引导文本和视觉modalities的注意力。实验结果显示，CFSum与多个强大的基eline进行比较，具有明显的超越。此外，分析显示，有用的图像甚至可以帮助生成非视觉字眼，这些字眼在图像中被隐含表示。

Strahler Number of Natural Language Sentences in Comparison with Random Trees

paper_url: http://arxiv.org/abs/2307.02697
repo_url: None
paper_authors: Kumiko Tanaka-Ishii, Akira Tanaka
For: The paper aims to apply the Strahler number, originally proposed for characterizing river bifurcation, to natural language sentence tree structures and explore its implications for sentence processing.* Methods: The paper uses empirical measurements across grammatically annotated data to compute the upper and lower limits of the Strahler number for natural language sentences, and analyzes the growth of the Strahler number with sentence length.* Results: The paper shows that the Strahler number of natural language sentences is almost 3 or 4, similar to the case of river bifurcation (Strahler, 1957). The paper also explains reports of 3 to 4 memory areas required for sentence processing (Abney and Johnson, 1991; Schuler et al., 2010) and a psychological “magical number” of 3 to 5 (Cowan, 2001) using the Strahler number. Additionally, the paper finds that the Strahler number is not specific to natural language and holds for random trees.

Abstract
The Strahler number was originally proposed to characterize the complexity of river bifurcation and has found various applications. This article proposes computation of the Strahler number's upper and lower limits for natural language sentence tree structures. Through empirical measurements across grammatically annotated data, the Strahler number of natural language sentences is shown to be almost 3 or 4, similarly to the case of river bifurcation as reported by Strahler (1957). From the theory behind the number, we show that it is one kind of lower limit on the amount of memory required to process sentences. We consider the Strahler number to provide reasoning that explains reports showing that the number of required memory areas to process sentences is 3 to 4 for parsing (Abney and Johnson, 1991; Schuler et al., 2010), and reports indicating a psychological "magical number" of 3 to 5 (Cowan, 2001). An analytical and empirical analysis shows that the Strahler number is not constant but grows logarithmically; therefore, the Strahler number of sentences derives from the range of sentence lengths. Furthermore, the Strahler number is not different for random trees, which could suggest that its origin is not specific to natural language.

摘要
斯特拉勒数 originally 提出来characterize 河流分支的复杂性，现在这篇文章提议计算自然语言句子树结构中的斯特拉勒数的上下限。通过实际测量grammatically annotated 数据，自然语言句子的斯特拉勒数被证明为大约3或4，与斯特拉勒（1957）所报道的河流分支情况类似。从理论角度来看，斯特拉勒数是自然语言句子处理所需内存量的下限。我们认为斯特拉勒数可以解释某些报告显示 sentence 的处理需要3到4个内存区域（Abney和Johnson，1991；Schuler et al., 2010），以及一些心理学家所提出的“魔数”（Cowan，2001）。我们通过分析和实际测量发现，斯特拉勒数不是常数，而是呈指数增长的，因此斯特拉勒数的句子来自范围内的句子长度。此外，斯特拉勒数不同于随机树，这可能意味着它的起源不特定于自然语言。

Learning Symbolic Rules over Abstract Meaning Representations for Textual Reinforcement Learning

paper_url: http://arxiv.org/abs/2307.02689
repo_url: https://github.com/ibm/loa
paper_authors: Subhajit Chaudhury, Sarathkrishna Swaminathan, Daiki Kimura, Prithviraj Sen, Keerthiram Murugesan, Rosario Uceda-Sosa, Michiaki Tatsubori, Achille Fokoue, Pavan Kapanipathi, Asim Munawar, Alexander Gray
for: 本研究旨在提出一种模块化的NEuro-Symbolic Textual Agent（NESTA），以把握游戏文本的 semantics 和含义，从而实现更好的游戏掌控和决策。
methods: NESTA 方法组合了一个通用semantic parser与一个规则推导系统，从文本中学习抽象可读性的规则，作为游戏决策的基础。
results: 我们在多个文本游戏 benchmark 上进行了实验，结果表明，相比深度强化学习方法，NESTA 方法在未看过测试游戏的情况下的总体性和学习从少量交互数据的能力得到了改进。

Abstract
Text-based reinforcement learning agents have predominantly been neural network-based models with embeddings-based representation, learning uninterpretable policies that often do not generalize well to unseen games. On the other hand, neuro-symbolic methods, specifically those that leverage an intermediate formal representation, are gaining significant attention in language understanding tasks. This is because of their advantages ranging from inherent interpretability, the lesser requirement of training data, and being generalizable in scenarios with unseen data. Therefore, in this paper, we propose a modular, NEuro-Symbolic Textual Agent (NESTA) that combines a generic semantic parser with a rule induction system to learn abstract interpretable rules as policies. Our experiments on established text-based game benchmarks show that the proposed NESTA method outperforms deep reinforcement learning-based techniques by achieving better generalization to unseen test games and learning from fewer training interactions.

摘要
文本基于的强化学习代理人主要是基于神经网络的模型，使用嵌入式表示，学习不可解释的策略，经常无法在未经见过的游戏中普适。然而，神经符号方法，尤其是使用中间正式表示，在语言理解任务中 receiving increasing attention。这是因为它们具有多种优点，如内生可读性、训练数据少量和对未经见过数据的普适性。因此，在这篇论文中，我们提议一种模块化的NEuro-Symbolic Textual Agent（NESTA），该方法结合通用 semantic parser 和规则推导系统，以学习抽象可读性的规则作为策略。我们在已有的文本基于游戏标准套件上进行了实验，结果显示，提议的 NESTA 方法在未经见过测试游戏的普适性和训练交互数量少于深度强化学习基本技术。

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

paper_url: http://arxiv.org/abs/2307.02682
repo_url: None
paper_authors: Yongrae Jo, Seongyun Lee, Aiden SJ Lee, Hyunji Lee, Hanseok Oh, Minjoon Seo
for: 本研究旨在降低视频描述需要大量、昂贵的注释集，而不需要训练视频或注释。
methods: 我们提出了一种新的零上下文方法，即ZeroTA，用于紧凑视频描述。ZeroTA不需要任何视频或注释进行训练，而是在测试时使用输入视频本身来定位和描述事件。我们引入了一个软件时间面，用于表示视频中的时间段，并同时优化它与语言生成模型的预refix参数。这种协调使得一个冻结的语言生成模型（例如GPT-2）和一个冻结的视频语言对比模型（例如CLIP）之间匹配得更高。我们还引入了一个对比时间 IoU 损失，使得一组软件时间面能够捕捉视频中多个不同的事件。
results: ZeroTA 效果显著比零上下文基线高，甚至超过了一些几个shot方法在 ActivityNet Captions 上的状态之首。此外，我们的方法在 OUT-OF-DOMAIN 场景下也表现了更高的Robustness，比supervised方法更能抗衡不同的视频样本。这种研究为了使用广泛使用的模型，如语言生成模型和视频语言对比模型，解锁了一种新的能力：理解视频中的时间方面。

Abstract
Dense video captioning, a task of localizing meaningful moments and generating relevant captions for videos, often requires a large, expensive corpus of annotated video segments paired with text. In an effort to minimize the annotation cost, we propose ZeroTA, a novel method for dense video captioning in a zero-shot manner. Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time by optimizing solely on the input. This is accomplished by introducing a soft moment mask that represents a temporal segment in the video and jointly optimizing it with the prefix parameters of a language model. This joint optimization aligns a frozen language generation model (i.e., GPT-2) with a frozen vision-language contrastive model (i.e., CLIP) by maximizing the matching score between the generated text and a moment within the video. We also introduce a pairwise temporal IoU loss to let a set of soft moment masks capture multiple distinct events within the video. Our method effectively discovers diverse significant events within the video, with the resulting captions appropriately describing these events. The empirical results demonstrate that ZeroTA surpasses zero-shot baselines and even outperforms the state-of-the-art few-shot method on the widely-used benchmark ActivityNet Captions. Moreover, our method shows greater robustness compared to supervised methods when evaluated in out-of-domain scenarios. This research provides insight into the potential of aligning widely-used models, such as language generation models and vision-language models, to unlock a new capability: understanding temporal aspects of videos.

摘要
dense video captioning，一种地点化意义的任务是为视频提供有关的信息和相关的描述，通常需要大量的标注视频片段和文本。为了减少标注成本，我们提议ZeroTA，一种新的零上下文方法 для dense video captioning。我们的方法不需要任何视频或标注 для训练，而是在测试时地点化和描述视频中的事件，通过优化唯一的输入。我们引入了一个软时间面罩，用于表示视频中的时间段，并与预refix参数的语言模型进行同时优化。这个联合优化使得一个冻结的语言生成模型（i.e., GPT-2）和一个冻结的视频语言对比模型（i.e., CLIP）之间进行了一致性的对接。我们还引入了一个对比时间 IoU 损失，使得一组软时间面罩能够捕捉视频中的多个不同事件。我们的方法能够有效地找到视频中的多种重要事件，并且将生成的描述文本与这些事件相应地描述。实验结果表明，ZeroTA 超过零上下文基线值，甚至超过了state-of-the-art 几个shot方法在ActivityNet Captions 上的表现。此外，我们的方法在非标注数据集上的评估也表现了更高的Robustness，比supervised方法更加稳定。这种研究为将广泛使用的模型，如语言生成模型和视频语言对比模型，Alignment 起来，以解锁视频中的时间方面的理解。

paper_url: http://arxiv.org/abs/2307.02640
repo_url: None
paper_authors: Alexandrea K. Ramnarine
for: 本研究用于探索社交媒体文本数据的无监督分类和 clustering 技术，以便在人工智能（AI）应用中利用大量文本数据。
methods: 本研究使用TF-IDF方法生成特征，并采用t-SNE、k-means clustering和LDA等方法来学习主要词和生成话题。
results: 研究表明，通过使用无监督分析，计算机可以准确地预测用户对塑形外科的情感，准确率高达90%。此外，模型在无监督分类任务上表现更高的准确率，于是无监督学习可能成为社交媒体文本标注的可能性。

Abstract
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases based on the sheer volume and velocity of textual data. Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding. Using a word ranking method, term frequency-inverse document frequency (TF-IDF), to create features across documents, it is possible to perform unsupervised analytics, machine learning (ML) that can group the documents without a human manually labeling the data. For large datasets with thousands of features, t-distributed stochastic neighbor embedding (t-SNE), k-means clustering and Latent Dirichlet allocation (LDA) are employed to learn top words and generate topics for a Reddit and Twitter combined corpus. Using extremely simple deep learning models, this study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery based on a tweet or subreddit post with almost 90% accuracy. Furthermore, the model is capable of achieving higher accuracy on the unsupervised sentiment task than on a rudimentary supervised document classification task. Therefore, unsupervised learning may be considered a viable option in labeling social media documents for NLP tasks.

摘要
大量用户帖子数据在社交媒体平台上是人工智能（AI）应用的未经利用资源，主要因为数据量和速度的原因。自然语言处理（NLP）是AI的一个子领域，可以利用文档集（corpus）来训练计算机理解人类语言。使用词rank方法，特异频率-反文档频率（TF-IDF）创建文档之间的特征，可以进行无监督分析，机器学习（ML）可以无需人工标注数据来分组文档。对于大量数据（ thousands of features），t-分布随机邻居投影（t-SNE）、k-means聚合和Latent Dirichlet allocation（LDA）可以学习文档中的top词和生成话题。使用非常简单的深度学习模型，这项研究表明，应用无监督分析可以使计算机根据推特或Reddit帖子 predict用户对塑形外科的 sentiment，准确率接近90%。此外，模型还可以在无监督分类任务上达到更高的准确率，因此无监督学习可能是标注社交媒体文档的NLP任务的可靠选择。

SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference

paper_url: http://arxiv.org/abs/2307.02628
repo_url: None
paper_authors: Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee
for: 提高自然语言生成任务中 LLM 的计算效率，以提高实际应用的可行性。
methods: 提出了一种简单有效的 токен级早期退出方法，即 SkipDecode，可以与批处理和 KV 缓存兼容地工作。它通过在每个批处理中设置单个退出点，使每个 токен都可以在批处理中提前退出，从而避免等待最后一个 токен退出。
results: 实验结果表明，使用 SkipDecode 可以获得 2x 到 5x 的批处理速度提升，无论任务类型如何，而且与大型模型（1.3 亿和 6.7 亿参数）兼容。

Abstract
Autoregressive large language models (LLMs) have made remarkable progress in various natural language generation tasks. However, they incur high computation cost and latency resulting from the autoregressive token-by-token generation. To address this issue, several approaches have been proposed to reduce computational cost using early-exit strategies. These strategies enable faster text generation using reduced computation without applying the full computation graph to each token. While existing token-level early exit methods show promising results for online inference, they cannot be readily applied for batch inferencing and Key-Value caching. This is because they have to wait until the last token in a batch exits before they can stop computing. This severely limits the practical application of such techniques. In this paper, we propose a simple and effective token-level early exit method, SkipDecode, designed to work seamlessly with batch inferencing and KV caching. It overcomes prior constraints by setting up a singular exit point for every token in a batch at each sequence position. It also guarantees a monotonic decrease in exit points, thereby eliminating the need to recompute KV Caches for preceding tokens. Rather than terminating computation prematurely as in prior works, our approach bypasses lower to middle layers, devoting most of the computational resources to upper layers, allowing later tokens to benefit from the compute expenditure by earlier tokens. Our experimental results show that SkipDecode can obtain 2x to 5x inference speedups with negligible regression across a variety of tasks. This is achieved using OPT models of 1.3 billion and 6.7 billion parameters, all the while being directly compatible with batching and KV caching optimization techniques.

摘要
自然语言生成任务中，权重自适应大型语言模型（LLM）已经取得了非常出色的进步。然而，它们的计算成本和延迟都是由权重自适应的单词生成引起的。为了解决这个问题，许多方法已经被提出来减少计算成本。这些方法可以在批处理中使用早期终止策略来快速生成文本。然而，现有的单词级早期终止方法无法 direct apply于批处理和Key-Value缓存。这是因为它们必须等待批处理中的最后一个单词离开才能停止计算。这会严重限制它们的实际应用。在这篇论文中，我们提出了一种简单有效的单词级早期终止方法，即SkipDecode。它可以与批处理和Key-Value缓存兼容地工作，并且跨过先前的约束。它在每个批处理中设置了每个单词的独特终止点，并保证单词级减少的终止点，从而消除了重新计算Key-Value缓存的需要。不同于先前的方法，我们的方法不会 prematurely 终止计算，而是通过跳过中间层次，将大部分计算资源分配给上层层次，使后续的单词可以受益于先前单词的计算成本。我们的实验结果表明，SkipDecode可以在多种任务上获得2x至5x的批处理速度提升，而且减少了较小的后果。这是使用1.3亿和6.7亿参数的OPT模型，同时与批处理和Key-Value缓存优化技术兼容。

Several categories of Large Language Models (LLMs): A Short Survey

paper_url: http://arxiv.org/abs/2307.10188
repo_url: None
paper_authors: Saurabh Pahune, Manoj Chandrasekharan
for: 本研究的目的是为讲者、开发者、学者和使用LLM-基于虚拟助手和智能客服技术的人提供有用信息和未来方向。methods: 本研究涵盖了不同类型的LLM，包括任务型金融LLM、多语言LLM、医疗和生物医学LLM、视觉语言LLM和代码语言模型。研究还描述了这些类型LLM的方法、特性、数据集、转换模型和比较指标。results: 本研究总结了不同类型LLM的方法、特性、数据集、转换模型和比较指标，并提出了未解决的问题，如提高自然语言处理、提高虚拟助手智能和解决道德和法律问题。

Abstract
Large Language Models(LLMs)have become effective tools for natural language processing and have been used in many different fields. This essay offers a succinct summary of various LLM subcategories. The survey emphasizes recent developments and efforts made for various LLM kinds, including task-based financial LLMs, multilingual language LLMs, biomedical and clinical LLMs, vision language LLMs, and code language models. The survey gives a general summary of the methods, attributes, datasets, transformer models, and comparison metrics applied in each category of LLMs. Furthermore, it highlights unresolved problems in the field of developing chatbots and virtual assistants, such as boosting natural language processing, enhancing chatbot intelligence, and resolving moral and legal dilemmas. The purpose of this study is to provide readers, developers, academics, and users interested in LLM-based chatbots and virtual intelligent assistant technologies with useful information and future directions.

摘要
大语言模型（LLM）已成为自然语言处理的有效工具，并在多个领域得到广泛应用。本文提供LLM各种子类别的简洁概述，强调最近的发展和努力。包括任务基金LLM、多语言LLM、生物医学LLM、视觉语言LLM和代码语言模型在内的各种LLM类型。本文还介绍每个类型的方法、特征、数据集、转换器模型和比较指标。此外，文章还强调虚拟助手和智能客服技术的未解决问题，如提高自然语言处理、增强虚拟助手智能和解决道德法律问题。本文的目的是为有关LLM基于虚拟助手和智能客服技术的读者、开发者、学者和用户提供有用信息和未来方向。

Named Entity Inclusion in Abstractive Text Summarization

paper_url: http://arxiv.org/abs/2307.02570
repo_url: None
paper_authors: Sergey Berezin, Tatiana Batura
for: 提高抽象文本摘要器对名称实体的注意力，解决现有抽象文本摘要器中的名称实体缺失问题。
methods: 提议一种自定义预训练目标，通过使用名称实体识别模型RoBERTa将名称实体在文本中填充，然后使用BART模型重建它们。最后，对摘要任务进行精度调整。
results: 实验表明，这种预训练方法可以提高名称实体包括精度和准确率指标。

Abstract
We address the named entity omission - the drawback of many current abstractive text summarizers. We suggest a custom pretraining objective to enhance the model's attention on the named entities in a text. At first, the named entity recognition model RoBERTa is trained to determine named entities in the text. After that, this model is used to mask named entities in the text and the BART model is trained to reconstruct them. Next, the BART model is fine-tuned on the summarization task. Our experiments showed that this pretraining approach improves named entity inclusion precision and recall metrics.

摘要
我们解决了名称实体漏掉（named entity omission），许多当前的抽象文本摘要器的缺点。我们建议一种自定义预训练目标，以提高模型对名称实体的注意力。首先，我们使用名称实体识别模型RoBERTa来确定文本中的名称实体。然后，我们使用这个模型将名称实体在文本中隐藏，并训练BART模型来重建它们。接着，我们在摘要任务上练化BART模型。我们的实验表明，这种预训练方法可以提高名称实体包括精度和准确率指标。

LongNet: Scaling Transformers to 1,000,000,000 Tokens

paper_url: http://arxiv.org/abs/2307.02486
repo_url: https://github.com/microsoft/unilm
paper_authors: Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei
for: Addressing the challenge of scaling sequence length in the era of large language models, and achieving high performance on both long-sequence modeling and general language tasks.
methods: Proposes a Transformer variant called LongNet, which uses dilated attention to expand the attentive field exponentially as the distance grows, allowing for the scaling of sequence length to more than 1 billion tokens without sacrificing performance on shorter sequences.
results: LongNet has significant advantages, including linear computation complexity and logarithmic dependency between any two tokens in a sequence, and can be served as a distributed trainer for extremely long sequences. Experiments demonstrate strong performance on both long-sequence modeling and general language tasks, opening up new possibilities for modeling very long sequences such as entire corpora or the internet.

Abstract
Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. To address this issue, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between any two tokens in a sequence; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.

摘要
Era of large language models 时代，序列长度扩展成为关键询问。然而，现有方法受到计算复杂性或模型表达能力的限制，导致最大序列长度受限。为解决这个问题，我们介绍 LongNet，一种基于 Transformer 的变体，可以将序列长度扩展到更多于 10^9 个字符，而无需牺牲短序列表现。 Specifically, we propose dilated attention，它可以在距离增长时扩展担注场 exponentially。 LongNet 具有以下优势：1) 它具有线性计算复杂性和对任何两个序列元素之间的对数依赖关系; 2) 它可以作为分布式训练器进行极长序列训练; 3) 它的扩展担注可以轻松地替换标准担注，可以顺利地与现有基于 Transformer 的优化集成。实验结果表明，LongNet 在长序列模型化和通用语言任务上具有强大表现。我们的工作开 up 了模型 Very Long Sequence 的新可能性，例如，将整个文库或 even 互联网视为一个序列。

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

paper_url: http://arxiv.org/abs/2307.02469
repo_url: https://github.com/bytedance/lynx-llm
paper_authors: Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong
for: 本文旨在系统地研究大语言模型（LLM）如GPT4的多Modal能力，以便更好地理解这些模型在following open-ended instructions给出的图像时的表现。
methods: 本文采用了多种控制设置，包括不同的网络结构、数据集和采样策略，以及多种提示方法来影响模型的表现。
results: 研究发现，使用Lynx模型可以实现最高的多Modal理解和多Modal生成能力，而且在多种图像和视频任务上表现最佳。

Abstract
Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

摘要
最近的大语言模型（LLM）如GPT4的发展已经展现出了杰出的多模态能力，能够根据图像提供开放式指令following。然而，这些模型的性能受到设计选择的影响，如网络结构、训练数据和训练策略，这些选择在文献中尚未得到广泛的讨论，因此很难量化进展。为解决这问题，本文提出了一项系统性和全面的研究，通过Quantitative和Qualitative方式来训练这些模型。我们实施了20多种变体，其中包括不同的LLM后缀和模型设计、训练数据和采样策略、指令和多模态生成能力。为了评估这些模型的性能，我们建立了包括图像和视频任务的首个、至今为止最 complet comprehensive评估集。基于我们的发现，我们提出了Lynx，它在多模态理解和多模态生成能力之间具有最高精度。

Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

paper_url: http://arxiv.org/abs/2308.01404
repo_url: https://github.com/aogara-ds/hoodwinked
paper_authors: Aidan O’Gara
for: 这个论文研究了现有语言模型是否具备诱导和假装能力？作者们引入了一款基于Mafia和Among Us的文本游戏，并在这个游戏中使用GPT-3、GPT-3.5和GPT-4控制的代理人进行测试。
methods: 作者们使用了一种基于自然语言的文本游戏，其中一名玩家被要求杀死其他玩家，然后存活的玩家们进行自然语言讨论并投票 banish 一名玩家。
results: 作者们发现，使用更高级别的模型可以更好地诱导和欺骗其他玩家，使得杀害者更容易逃脱。这种改善不是通过不同的行动，而是通过更强的游说技巧来实现。

Abstract
Are current language models capable of deception and lie detection? We study this question by introducing a text-based game called $\textit{Hoodwinked}$, inspired by Mafia and Among Us. Players are locked in a house and must find a key to escape, but one player is tasked with killing the others. Each time a murder is committed, the surviving players have a natural language discussion then vote to banish one player from the game. We conduct experiments with agents controlled by GPT-3, GPT-3.5, and GPT-4 and find evidence of deception and lie detection capabilities. The killer often denies their crime and accuses others, leading to measurable effects on voting outcomes. More advanced models are more effective killers, outperforming smaller models in 18 of 24 pairwise comparisons. Secondary metrics provide evidence that this improvement is not mediated by different actions, but rather by stronger persuasive skills during discussions. To evaluate the ability of AI agents to deceive humans, we make this game publicly available at h https://hoodwinked.ai/ .

摘要
现有语言模型有能力进行诱导和欺骗吗？我们通过一款基于文本的游戏——《骗子》来研究这个问题。玩家被锁在一个房子里，需要找到逃脱的钥匙，但有一名玩家被指定为杀害其他玩家。每次杀人时，存活的玩家们会进行自然语言的讨论，然后投票 banish 一名玩家。我们在 GPT-3、GPT-3.5 和 GPT-4 控制的机器人上进行了实验，发现了诱导和欺骗的能力。凶手常否认自己的罪行，指责其他人，导致讨论结果有显著的变化。更高级的模型在18/24对比中胜过小型模型。 auxiliary 指标表明这些改进不是通过不同的行动，而是通过更强的说服技巧在讨论中。为了评估人工智能代理人能否欺骗人类，我们将这款游戏公开发布在 hhttps://hoodwinked.ai/ 。

Exploring Continual Learning for Code Generation Models

paper_url: http://arxiv.org/abs/2307.02435
repo_url: None
paper_authors: Prateek Yadav, Qing Sun, Hantian Ding, Xiaopeng Li, Dejiao Zhang, Ming Tan, Xiaofei Ma, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Mohit Bansal, Bing Xiang
for: This paper focuses on the task of Continual Learning (CL) in the code domain, specifically addressing the issue of catastrophic forgetting in popular CL techniques when applied to coding tasks.
methods: The authors introduce a new benchmark called CodeTask-CL that covers a wide range of tasks and input/output programming languages, and compare popular CL techniques from NLP and Vision domains. They also propose a new method called Prompt Pooling with Teacher Forcing (PP-TF) to address the issue of catastrophic forgetting.
results: The authors achieve a 21.54% improvement over Prompt Pooling with their proposed method PP-TF, demonstrating the effectiveness of their approach in stabilizing training and improving performance on CL for code models.Here’s the simplified Chinese text in the format you requested:
for: 这篇论文关注代码领域内的连续学习（Continual Learning，CL）任务，特别是应用于编程任务时popular CL技术的快速忘记问题。
methods: 作者们提出了一个新的benchmark代码Task-CL，覆盖了广泛的任务和输入/输出编程语言，并对NLP和视觉领域的CL技术进行比较。他们还提出了一种新的方法called Prompt Pooling with Teacher Forcing (PP-TF)，以稳定训练并提高代码模型的CL性能。
results: 作者们通过PP-TF方法实现了21.54%的提升，证明了他们的方法可以稳定训练并提高代码模型的CL性能。

Abstract
Large-scale code generation models such as Codex and CodeT5 have achieved impressive performance. However, libraries are upgraded or deprecated very frequently and re-training large-scale language models is computationally expensive. Therefore, Continual Learning (CL) is an important aspect that remains underexplored in the code domain. In this paper, we introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement, with different input and output programming languages. Next, on our CodeTask-CL benchmark, we compare popular CL techniques from NLP and Vision domains. We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism caused by stark distribution shifts in coding tasks. We address this issue with our proposed method, Prompt Pooling with Teacher Forcing (PP-TF), that stabilizes training by enforcing constraints on the prompt selection mechanism and leads to a 21.54% improvement over Prompt Pooling. Along with the benchmark, we establish a training pipeline that can be used for CL on code models, which we believe can motivate further development of CL methods for code models. Our code is available at https://github.com/amazon-science/codetaskcl-pptf

摘要
大规模的代码生成模型如Codex和CodeT5已经实现了印象所能的性能。然而，库被升级或弃用非常频繁，再次训练大规模的语言模型是计算成本高昂。因此，持续学习（Continual Learning，CL）在代码领域是一个重要的问题，尚未得到充分研究。在这篇论文中，我们介绍了一个名为CodeTask-CL的benchmark，该benchmark涵盖了各种任务，包括代码生成、翻译、概要和修订，输入和输出编程语言也有多种。接着，在我们的CodeTask-CL benchmark上，我们比较了NLP和Computer Vision领域的流行CL技术。我们发现，效果良好的方法Like Prompt Pooling（PP）受到代码任务中的分布Shift导致的训练不稳定，从而导致了悬崖效应。我们解决这个问题的方法是Prompt Pooling with Teacher Forcing（PP-TF），该方法在选择提示机制中强制实施约束，使训练更加稳定，并导致了21.54%的提高。此外，我们还设立了一个用于CL的训练管道，我们认为这可以鼓励进一步的CL方法开发。代码可以在https://github.com/amazon-science/codetaskcl-pptf上获取。

Won’t Get Fooled Again: Answering Questions with False Premises

paper_url: http://arxiv.org/abs/2307.02394
repo_url: https://github.com/thunlp/falseqa
paper_authors: Shengding Hu, Yifan Luo, Huadong Wang, Xingyi Cheng, Zhiyuan Liu, Maosong Sun
for: 这篇论文的目的是探讨语言模型（PLMs）在问答系统中的应用，特别是对于 tricky questions 的应对。
methods: 这篇论文使用了 false premises questions（FPQs）来检验 PLMs 的能力，并对 PLMs 进行 fine-tuning 和重复训练来提高其对 FPQs 的回答能力。
results: 研究发现，PLMs 可以通过 fine-tuning 和重复训练来扩大其对 FPQs 的回答能力，并且可以生成有理解性的回答和讲解。这些结果表明，PLMs 已经具有了对 tricky questions 的回答能力，只是需要刺激这种能力。

Abstract
Pre-trained language models (PLMs) have shown unprecedented potential in various fields, especially as the backbones for question-answering (QA) systems. However, they tend to be easily deceived by tricky questions such as "How many eyes does the sun have?". Such frailties of PLMs often allude to the lack of knowledge within them. In this paper, we find that the PLMs already possess the knowledge required to rebut such questions, and the key is how to activate the knowledge. To systematize this observation, we investigate the PLMs' responses to one kind of tricky questions, i.e., the false premises questions (FPQs). We annotate a FalseQA dataset containing 2365 human-written FPQs, with the corresponding explanations for the false premises and the revised true premise questions. Using FalseQA, we discover that PLMs are capable of discriminating FPQs by fine-tuning on moderate numbers (e.g., 256) of examples. PLMs also generate reasonable explanations for the false premise, which serve as rebuttals. Further replaying a few general questions during training allows PLMs to excel on FPQs and general questions simultaneously. Our work suggests that once the rebuttal ability is stimulated, knowledge inside the PLMs can be effectively utilized to handle FPQs, which incentivizes the research on PLM-based QA systems.

摘要

2023-07-06

Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain

Improving Retrieval-Augmented Large Language Models via Data Importance Learning

Style Over Substance: Evaluation Biases for Large Language Models

Efficient Semiring-Weighted Earley Parsing

Agentività e telicità in GilBERTo: implicazioni cognitive

The Relationship Between Speech Features Changes When You Get Depressed: Feature Correlations for Improving Speed and Performance of Depression Detection

ValiTex – a unified validation framework for computational text-based measures of social science constructs

NatLogAttack: A Framework for Attacking Natural Language Inference Models with Natural Logic

Generative Zero-Shot Prompt Learning for Cross-Domain Slot Filling with Inverse Prompting

VerifAI: Verified Generative AI

UniCoRN: Unified Cognitive Signal ReconstructioN bridging cognitive signals and human language

Training Models to Generate, Recognize, and Reframe Unhelpful Thoughts

Undecimated Wavelet Transform for Word Embedded Semantic Marginal Autoencoder in Security improvement and Denoising different Languages

Your spouse needs professional help: Determining the Contextual Appropriateness of Messages through Modeling Social Relationships

Exploring Linguistic Style Matching in Online Communities: The Role of Social Context and Conversation Dynamics

Dense Retrieval Adaptation using Target Domain Description

Text Alignment Is An Efficient Unified Model for Massive NLP Tasks

On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation

CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization

Strahler Number of Natural Language Sentences in Comparison with Random Trees

Learning Symbolic Rules over Abstract Meaning Representations for Textual Reinforcement Learning

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts

SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference

Several categories of Large Language Models (LLMs): A Short Survey

Named Entity Inclusion in Abstractive Text Summarization

LongNet: Scaling Transformers to 1,000,000,000 Tokens

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

Exploring Continual Learning for Code Generation Models

Won’t Get Fooled Again: Answering Questions with False Premises