results: 在多个数据集(CLEF e-Health、TREC Total Recall、TREC Legal和RCV1)的实验中,提出的方法在性能上显著提高,并比多个替代方法表现更好。Abstract
Technology Assisted Review (TAR) stopping rules aim to reduce the cost of manually assessing documents for relevance by minimising the number of documents that need to be examined to ensure a desired level of recall. This paper extends an effective stopping rule using information derived from a text classifier that can be trained without the need for any additional annotation. Experiments on multiple data sets (CLEF e-Health, TREC Total Recall, TREC Legal and RCV1) showed that the proposed approach consistently improves performance and outperforms several alternative methods.
摘要
技术协助评审(TAR)停止规则目的是降低手动评估文档相关性的成本,最小化需要评审的文档数量,以确保所需的回归率。本文提出了一种改进的停止规则,使用基于文本分类器的信息,不需要额外的标注。在多个数据集(CLEF e-Health、TREC Total Recall、TREC Legal和RCV1)的实验中,提出的方法在性能上有显著改善,并超过了一些替代方法。
Assertion Enhanced Few-Shot Learning: Instructive Technique for Large Language Models to Generate Educational Explanations
results: 对 12 名实际教师进行比较研究,显示 Assertion Enhanced Few-Shot Learning 提高解释准确率 15%,并且生成的解释质量更高,被教师评价为更加 educator-friendly。Abstract
Human educators possess an intrinsic ability to anticipate and seek educational explanations from students, which drives them to pose thought-provoking questions when students cannot articulate these explanations independently. We aim to imbue Intelligent Tutoring Systems with this ability using few-shot learning capability of Large Language Models. Our work proposes a novel prompting technique, Assertion Enhanced Few-Shot Learning, to facilitate the generation of accurate, detailed oriented educational explanations. Our central hypothesis is that, in educational domain, few-shot demonstrations are necessary but not a sufficient condition for quality explanation generation. We conducted a study involving 12 in-service teachers, comparing our approach to Traditional Few-Shot Learning. The results show that Assertion Enhanced Few-Shot Learning improves explanation accuracy by 15% and yields higher-quality explanations, as evaluated by teachers. We also conduct a qualitative ablation study to factor the impact of assertions to provide educator-friendly prompting guidelines for generating explanations in their domain of interest.
摘要
人类教育者具有内在的能力,可以预测和寻找学生不能独立表达的教育解释,这会让教育者提问学生无法答复的问题。我们想使用大语言模型的几招学习能力,让智能教育系统拥有这种能力。我们的工作提出了一种新的提问技巧,即断言增强几招学习,以便生成高质量、详细的教育解释。我们的中心假设是,在教育领域,几招示范是必要的,但不是唯一的条件,以获得高质量的解释。我们进行了一项研究,与12名现役教师进行比较,与传统几招学习相比,我们的方法可以提高解释准确率15%,并生成更高质量的解释,如教师所评价。我们还进行了一项解释因素分析研究,以了解断言对生成解释的影响,以提供教师在他们兴趣领域中生成解释的教程。
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
results: 对于文本和图像对齐检测任务, fine-tuning视觉语言模型使其能够详细描述偏移和在图像中指示它们,超越了强基eline。Abstract
While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our method code and human curated test set are available at: https://mismatch-quest.github.io/
摘要
现有的图文对alignment模型可以 достичь高质量的二进制评估,但它们无法准确地找到异常的源头。在这篇论文中,我们提出了一种方法,可以为图文对提供细致的文本和视觉解释。我们利用大型语言模型和视觉固定模型来自动构建一个培训集,其中包含可信的异常标注和对应的文本解释和视觉指标。我们还发布了一个新的人工精心标注测试集,其中包含了准确的文本和视觉异常标注。实验结果表明,在我们的培训集上练化视语言模型可以详细描述异常和在图像中视觉指明它们,超过了强大的基eline。我们的方法代码和人工精心标注测试集可以在:https://mismatch-quest.github.io/ accessed.
Understanding Environmental Posts: Sentiment and Emotion Analysis of Social Media Data
for: This study aims to analyze the public perception of climate change and the environment through social media data from 2014 to 2023, in order to provide insights that can help raise awareness and inform environmental interventions.
methods: The study uses the Pointwise Mutual Information (PMI) algorithm to identify sentiment and explore prevailing emotions expressed within environmental tweets on Twitter, Reddit, and YouTube. The accuracy of the algorithm was compared to human annotation and expert rating.
results: The study finds that negative environmental tweets are more common than positive or neutral ones, with climate change, air quality, emissions, plastic, and recycling being the most discussed topics. The most common emotions in environmental tweets are fear, trust, and anticipation, demonstrating the complex and wide-ranging nature of public reactions to environmental issues.Abstract
Social media is now the predominant source of information due to the availability of immediate public response. As a result, social media data has become a valuable resource for comprehending public sentiments. Studies have shown that it can amplify ideas and influence public sentiments. This study analyzes the public perception of climate change and the environment over a decade from 2014 to 2023. Using the Pointwise Mutual Information (PMI) algorithm, we identify sentiment and explore prevailing emotions expressed within environmental tweets across various social media platforms, namely Twitter, Reddit, and YouTube. Accuracy on a human-annotated dataset was 0.65, higher than Vader score but lower than that of an expert rater (0.90). Our findings suggest that negative environmental tweets are far more common than positive or neutral ones. Climate change, air quality, emissions, plastic, and recycling are the most discussed topics on all social media platforms, highlighting its huge global concern. The most common emotions in environmental tweets are fear, trust, and anticipation, demonstrating public reactions wide and complex nature. By identifying patterns and trends in opinions related to the environment, we hope to provide insights that can help raise awareness regarding environmental issues, inform the development of interventions, and adapt further actions to meet environmental challenges.
摘要
社交媒体现在是信息的主要来源,因为它提供了即时的公众反应。因此,社交媒体数据已成为了理解公众情绪的重要资源。研究表明,它可以增强想法并影响公众情绪。这个研究分析了2014年至2023年间公众对气候变化和环境的观感。我们使用点对点积分信息(PMI)算法,确定情绪和探索不同社交媒体平台上环境推文中表达的主要情感。我们的结果表明,环境推文中的负面情绪比正面或中性情绪更为常见。气候变化、空气质量、排放、塑料和回收是所有社交媒体平台上最受关注的话题,这反映了人们对环境问题的极大关注。环境推文中最常见的情感是恐慌、信任和期待,这表明公众对环境问题的反应是多样化和复杂的。我们希望通过分析环境话题中的意见和趋势,为环境问题提供意识,制定 intervención和适应环境挑战。
LLMs for Multi-Modal Knowledge Extraction and Analysis in Intelligence/Safety-Critical Applications
results: 本论文结果表明,大语言模型在知识和安全应用中存在许多漏洞和限制,需要进行谨慎的评估和mitigation before applying them to intelligence and safety-critical applications。Abstract
Large Language Models have seen rapid progress in capability in recent years; this progress has been accelerating and their capabilities, measured by various benchmarks, are beginning to approach those of humans. There is a strong demand to use such models in a wide variety of applications but, due to unresolved vulnerabilities and limitations, great care needs to be used before applying them to intelligence and safety-critical applications. This paper reviews recent literature related to LLM assessment and vulnerabilities to synthesize the current research landscape and to help understand what advances are most critical to enable use of of these technologies in intelligence and safety-critical applications. The vulnerabilities are broken down into ten high-level categories and overlaid onto a high-level life cycle of an LLM. Some general categories of mitigations are reviewed.
摘要
大型语言模型在过去几年内所示出的能力提升非常快,这种提升的速度在不断加速,其能力按照不同的标准测试方法测量,已经接近人类水平。但由于存在许多漏洞和局限性,在智能和安全敏感应用中使用这些模型需要非常小心。这篇评论文件总结了最新的LLM评估和漏洞研究,旨在总结当前研究领域的研究状况,并帮助理解在智能和安全敏感应用中使用这些技术所需的进一步发展。漏洞被分为十个高级类别,并与高级LLM生命周期相叠加以示出。一些通用的mitigation措施也被简要介绍。
Describing Differences in Image Sets with Natural Language
results: 该研究使用VisDiff实现了自动描述图像集之间差异,并在多个领域进行了应用,如比较不同的数据集(例如ImageNet vs. ImageNetV2)、不同的分类模型(例如零shot CLIP vs. 监督ResNet)、概括模型失效模式(例如监督ResNet)、描述生成模型之间的差异(例如StableDiffusionV1和V2)以及发现图像是如何记忆的。使用VisDiff,我们能够找到 interessante 和前所未知的差异, demonstrating its utility in revealing nuanced insights。Abstract
How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning. This task takes in image sets $D_A$ and $D_B$, and outputs a description that is more often true on $D_A$ than $D_B$. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff, which first captions the images and prompts a language model to propose candidate descriptions, then re-ranks these descriptions using CLIP. To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing model failure modes (supervised ResNet), characterizing differences between generative models (e.g., StableDiffusionV1 and V2), and discovering what makes images memorable. Using VisDiff, we are able to find interesting and previously unknown differences in datasets and models, demonstrating its utility in revealing nuanced insights.
摘要
<>TRANSLATE_TEXT两个图像集的差异如何区分?了解模型行为和分析数据集需要能够快速和精准地发现这些差异,但是 manually 遍历千个图像是不实用的。为了解决这个问题,我们研究了自动描述两个图像集之间的差异,我们称之为 Set Difference Captioning。这个任务接受图像集 $D_A$ 和 $D_B$ 作为输入,并输出一个更常出现在 $D_A$ 上的描述。我们提出了一个两个阶段的方法,首先提出候选的差异描述,然后使用 CLIP 进行重新排序,以确定候选描述是否能够区分两个集。我们称之为 VisDiff,它首先为图像集提供描述,然后使用语言模型提出候选描述,并使用 CLIP 进行重新排序。为了评估 VisDiff,我们收集了 VisDiffBench 数据集,该数据集包含 187 对图像集的对应描述。我们在不同领域应用 VisDiff,包括比较数据集(如 ImageNet vs. ImageNetV2)、比较分类模型(如零shot CLIP vs. 监督 ResNet)、总结模型失效模式(如监督 ResNet)、描述生成模型之间的差异(如 StableDiffusionV1 和 V2),以及发现图像吸引力的原因。使用 VisDiff,我们能够发现不同的图像集和模型之间的差异,这些差异可能是已知的或未知的,这 demonstartes VisDiff 的实用性。
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
results: 我们提出了视觉程序填充(VPD),一种指导框架,可以在单个前进pass中解决复杂视觉任务。VPD使用LLM来采样多个候选程序,并对每个正确程序进行执行和验证,以确定正确的一个。然后,它将每个正确程序翻译成语言描述,并将其填充到VLM中。实验显示,VPD可以提高VLM的理解空间、COUNT和compositional reasoning能力。我们的VPD-trained PaLI-X在复杂视觉任务上表现出色,超越所有之前的VLM,并在MMBench、OK-VQA、A-OKVQA、TallyQA、POPE和Hateful Memes等任务中获得了state-of-the-art表现。人工标注员也证实了VPD改进了模型的回答准确性和一致性。 finally,我们的实验表明,VPD可以在实际应用中进行适应,并且在有限数据情况下表现出色。Abstract
Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.
摘要
解决复杂视觉任务,如“谁创造了右边的乐器?”需要合理的技能组合:理解空间、识别乐器、并检索先前知识。最近的研究表明,使用大型语言模型(LLM)可以将这类任务 decomposed into 可执行的程序,但生成的程序具有许多错误:缺少必要步骤、包含幌子步骤,以及无法回归当特化模型返回错误输出。此外,它们需要加载多个模型,从而导致高延迟和计算成本。我们提出了视觉程序熔化(VPD),一种 instruction tuning 框架,可以使得视觉语言模型(VLM)通过单个前进步来解决复杂视觉任务。VPD 通过使用 LLM 来采样多个候选程序,然后执行和验证以确定正确的一个。它将每个正确的程序翻译成语言描述符,并将其熔化成 VLM。广泛的实验表明,VPD 可以提高 VLM 的理解空间、计数和 композиitional 理解能力。我们的 VPD-trained PaLI-X 在复杂视觉任务中表现出色,超越所有先前的 VLM,并 achieved state-of-the-art 性能。人工标注员也证实了 VPD 改进模型的回答准确性和一致性。最后,我们在内容审查应用中也证明了 VPD 的适用性。
Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models
paper_authors: Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, Jimmy Lin
For: The paper aims to build effective listwise rerankers without any dependence on GPT models, addressing the concern of single point of failure and improving scientific reproducibility.* Methods: The authors use large language models (LLM) to build the listwise rerankers, but do not rely on GPT models. They conduct passage retrieval experiments to evaluate the effectiveness of their approach.* Results: The authors achieve 97% effectiveness of the listwise rerankers built on GPT-4, and their best list se reranker surpasses the listwise rerankers based on GPT-3.5 by 13%. However, they find that existing training datasets are insufficient for building such listwise rerankers, highlighting the need for high-quality listwise ranking data resources.Here’s the simplified Chinese text for the three key points:* For: 这篇论文目标是建立不依赖 GPT 模型的有效列表重新排序器,解决单点失败和科学复制性问题。* Methods: 作者使用大语言模型(LLM)建立列表重新排序器,但不依赖 GPT 模型。他们进行了过程检索实验来评估其方法的有效性。* Results: 作者在 GPT-4 上建立的列表重新排序器达到 97% 的有效性,而其最佳列表重新排序器在 GPT-3.5 上出perform 13% 点。然而,他们发现现有的训练数据集不够用于建立这类列表重新排序器,呼吁更多的人工标注列表重新排序数据资源的建立。Abstract
Listwise rerankers based on large language models (LLM) are the zero-shot state-of-the-art. However, current works in this direction all depend on the GPT models, making it a single point of failure in scientific reproducibility. Moreover, it raises the concern that the current research findings only hold for GPT models but not LLM in general. In this work, we lift this pre-condition and build for the first time effective listwise rerankers without any form of dependency on GPT. Our passage retrieval experiments show that our best list se reranker surpasses the listwise rerankers based on GPT-3.5 by 13% and achieves 97% effectiveness of the ones built on GPT-4. Our results also show that the existing training datasets, which were expressly constructed for pointwise ranking, are insufficient for building such listwise rerankers. Instead, high-quality listwise ranking data is required and crucial, calling for further work on building human-annotated listwise data resources.
摘要
现有的列表重新排序器都基于大语言模型(LLM),但是现有的研究都依赖于GPT模型,这会导致科学复制性的问题。此外,这也提出了当前研究成果只适用于GPT模型,而不适用于LLM总体的问题。在这个工作中,我们解决了这个前提,并首次建立了不依赖于GPT的有效列表重新排序器。我们的过程检索实验表明,我们的最佳列表重新排序器比基于GPT-3.5的列表重新排序器高出13%,并达到97%的效果。我们的结果还表明,现有的点 wise 排序数据集, originally constructed for pointwise ranking, 是无法建立列表重新排序器的。相反,高质量的列表排序数据资源是必要的,呼吁更多的人为列表排序数据集进行人工标注。
Concept Drift Adaptation in Text Stream Mining Settings: A Comprehensive Review
results: 本研究发现了在文本流批处理中的概念漂移问题,并提出了一些解决方案,包括不同类型的概念漂移检测方法、模型更新机制以及文本表示更新方法等。Abstract
Due to the advent and increase in the popularity of the Internet, people have been producing and disseminating textual data in several ways, such as reviews, social media posts, and news articles. As a result, numerous researchers have been working on discovering patterns in textual data, especially because social media posts function as social sensors, indicating peoples' opinions, interests, etc. However, most tasks regarding natural language processing are addressed using traditional machine learning methods and static datasets. This setting can lead to several problems, such as an outdated dataset, which may not correspond to reality, and an outdated model, which has its performance degrading over time. Concept drift is another aspect that emphasizes these issues, which corresponds to data distribution and pattern changes. In a text stream scenario, it is even more challenging due to its characteristics, such as the high speed and data arriving sequentially. In addition, models for this type of scenario must adhere to the constraints mentioned above while learning from the stream by storing texts for a limited time and consuming low memory. In this study, we performed a systematic literature review regarding concept drift adaptation in text stream scenarios. Considering well-defined criteria, we selected 40 papers to unravel aspects such as text drift categories, types of text drift detection, model update mechanism, the addressed stream mining tasks, types of text representations, and text representation update mechanism. In addition, we discussed drift visualization and simulation and listed real-world datasets used in the selected papers. Therefore, this paper comprehensively reviews the concept drift adaptation in text stream mining scenarios.
摘要
Can a Tabula Recta provide security in the XXI century?
results: 论文通过计算机基本的统计分析,证明了这些人工计算机可能崩溃后用于加密的方法可以提供足够的安全性。Abstract
In the not so unlikely scenario of total compromise of computers accessible to a group of users, they might be tempted to resort to human-computable paper-and-pencil cryptographic methods aided by a classic Tabula Recta, which helps to perform addition and subtraction directly with letters. But do these classic algorithms, or some new ones using the same simple tools, have any chance against computer-aided cryptanalysis? In this paper I discuss how some human-computable algorithms can indeed afford sufficient security in this situation, drawing conclusions from computer-based statistical analysis. Three kinds of algorithms are discussed: those that concentrate entropy from shared text sources, stream ciphers based on arithmetic of non-binary spaces, and hash-like algorithms that may be used to generate a password from a challenge text.
摘要
在计算机泄露或攻击 scenarios 中,用户群体可能会考虑使用人工可计算的纸笔密码方法,使用类传统的 Tabula Recta,以直接将字母进行加减运算。但这些传统算法或新的算法使用同样的简单工具,对于计算机帮助的加密分析是否有任何机会?在这篇论文中,我会讨论这些人工可计算的算法是否可以提供足够的安全性,通过计算机基于的统计分析来Draw conclusions。我将讨论三种算法:具有共享文本来源的 entropy 集中算法,基于非二进制空间的流加密算法,以及可用于生成挑战文本的 hash-like 算法。
Weakly Supervised Detection of Hallucinations in LLM Activations
results: 我们的结果证明BERT在内部容积不够以编码幻觉,而OPT则可以在内部编码幻觉信息。我们的检测方法,无需先知道假语句,与完全监督的out-of-distribution分类器相当。Abstract
We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns a-priori. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally. Importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier.
摘要
我们提出一种审核方法,用于确定大语言模型(LLM)是否内部存在幻觉类模式,这些模式可能会传递到下游任务中。我们提出了一种弱监睹审核技术,使用一个子集扫描方法来检测 LLM 活动中异常模式。我们的方法不需要先知道幻觉模式的类型。相反,它依赖于一个无异常样本的参考数据集来进行测试。我们的方法可以识别 LLM 中幻觉模式的关键节点,这些节点可能会提供关键的调整特定子网络的途径。我们提出了两种新的扫描方法,用于处理 LLM 活动中异常句子,这些句子可能会与预期分布相差。我们的结果证实了 BERT 的有限内部容量,而 OPT 则可以内部编码幻觉信息。我们的扫描方法,不需要先接触假STATEMENT,与完全监睹的外部异常分类器相比,表现相似。
Large Language Models on Graphs: A Comprehensive Survey
results: 本文结合了多种实验和应用场景,并提供了相关的开源代码和benchmark数据集。未来研究可能包括细化图结构和文本信息的结合、提高模型性能和扩展到更多的应用场景。Abstract
Large language models (LLMs), such as ChatGPT and LLaMA, are creating significant advancements in natural language processing, due to their strong text encoding/decoding ability and newly found emergent capability (e.g., reasoning). While LLMs are mainly designed to process pure texts, there are many real-world scenarios where text data are associated with rich structure information in the form of graphs (e.g., academic networks, and e-commerce networks) or scenarios where graph data are paired with rich textual information (e.g., molecules with descriptions). Besides, although LLMs have shown their pure text-based reasoning ability, it is underexplored whether such ability can be generalized to graph scenarios (i.e., graph-based reasoning). In this paper, we provide a systematic review of scenarios and techniques related to large language models on graphs. We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-rich graphs, and text-paired graphs. We then discuss detailed techniques for utilizing LLMs on graphs, including LLM as Predictor, LLM as Encoder, and LLM as Aligner, and compare the advantages and disadvantages of different schools of models. Furthermore, we mention the real-world applications of such methods and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future research directions in this fast-growing field. The related source can be found at https://github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs.
摘要
大型自然语言处理模型(LLM),如ChatGPT和LLaMA,在自然语言处理方面已经取得了重要突破,它们的强大文本编码/解码能力和新发现的emergentcapability(例如,理解)使得它们在自然语言处理方面表现出色。然而,LLM主要是为纯文本进行处理,实际世界中有许多情况where文本数据具有rich结构信息的形式(例如,学术网络和电商网络)或情况where图数据与rich文本信息相关(例如,分子与描述)。此外,尽管LLM已经显示了纯文本基于的理解能力,但是这种能力是否可以泛化到图形enario(即图形基于的理解)尚未得到充分的探讨。在这篇论文中,我们提供了大型自然语言模型在图形上的系统性评论。我们首先总结了将LLM应用于图形的可能enario into three categories:纯图形、文本rich图形和文本对应图形。然后,我们讨论了在图形上使用LLM的详细技术,包括LLM作为预测器、LLM作为编码器和LLM作为对齐器,并对不同的学术模型有利和不利的比较。此外,我们还提到了实际应用的方法和开源代码库以及标准 benchmark数据集。最后,我们结束于未来研究方向的潜在发展空间。相关的源代码可以在https://github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs中找到。
Scaling Laws for Adversarial Attacks on Language Model Activations
results: 作者发现,对于不同的语言模型和输入大小,最多可以控制1000个后续Token的预测结果,并观察到了一个卷积级数学律,即最大预测结果数量与控制活动数量直线相关。此外,作者发现,对于不同的输入空间和输出空间维度,反对攻击的顺序性很强,即一个输入位置的控制可以导致相应的输出位置的控制。这些结果支持了维度不匹配的假设,并为语言模型上的反对攻击提供了一个新的攻击 повер。Abstract
We explore a class of adversarial attacks targeting the activations of language models. By manipulating a relatively small subset of model activations, $a$, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens $t$. We empirically verify a scaling law where the maximum number of target tokens $t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose activations the attacker controls as $t_\mathrm{max} = \kappa a$. We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $\chi$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models. Compared to attacks on tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert control over a similar amount of output bits. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models, where additional data sources are added as activations directly, sidestepping the tokenized input. This opens up a new, broad attack surface. By using language models as a controllable test-bed to study adversarial attacks, we were able to experiment with input-output dimensions that are inaccessible in computer vision, especially where the output dimension dominates.
摘要
我们探索了一类语言模型的对抗攻击,通过控制语言模型的激活来控制预测结果。我们通过对一小部分模型激活进行操作,控制了后续多达1000个字符的预测结果。我们经验性地证明了一个尺度法律,其中最大预测结果的数量与控制模型激活的数量成直线关系,即$t_\mathrm{max} = \kappa a$。我们发现在不同的语言模型中,对输入空间中的一个比特控制输出空间中的一个比特的需要的比特数(我们称之为防御性$\chi$)在2个数量级之间具有remarkably常数的性。相比于在字符上进行攻击,在激活上进行攻击更强,但我们发现了一种惊人的规律:一个输入比特通过激活或字符来控制输出空间中的相似数量的比特。这给了我们对维度匹配问题的支持,表明对抗攻击是因为输入和输出空间之间的维度不匹配。在实际应用中,可以通过使用语言模型作为可控的测试床,来研究对抗攻击,这会开 up一个新的、广阔的攻击表面。
Compositional Generalization for Data-to-Text Generation
results: 该模型在所有评价指标上都超过 T5 基准值,特别是在保持输入忠实度方面提高了31%。Abstract
Data-to-text generation involves transforming structured data, often represented as predicate-argument tuples, into coherent textual descriptions. Despite recent advances, systems still struggle when confronted with unseen combinations of predicates, producing unfaithful descriptions (e.g. hallucinations or omissions). We refer to this issue as compositional generalisation, and it encouraged us to create a benchmark for assessing the performance of different approaches on this specific problem. Furthermore, we propose a novel model that addresses compositional generalization by clustering predicates into groups. Our model generates text in a sentence-by-sentence manner, relying on one cluster of predicates at a time. This approach significantly outperforms T5~baselines across all evaluation metrics.Notably, it achieved a 31% improvement over T5 in terms of a metric focused on maintaining faithfulness to the input.
摘要
<>将结构化数据转换为文本描述,经常用作 predicate-argument 对的转换。尽管最近几年有所进步,仍然系统在不familiar的 predicate 组合下遇到困难,导致不准确的描述(如幻觉或缺失)。我们称这个问题为 Compositional Generalization,它驱使我们开发了一个评价不同方法的标准准测试。此外,我们还提出了一种新的模型,它将 predicate 分组 clustering。我们的模型在 sentence-by-sentence 方式生成文本,每次依赖一个 cluster of predicate。这种方法在所有评价指标上都具有显著的优势,特别是在保持输入的准确性方面提高了31%。Note: "Simplified Chinese" is a romanization of the Chinese language that uses a simplified set of characters and pronunciation. It is commonly used in mainland China and Singapore.
Towards Measuring Representational Similarity of Large Language Models
results: 研究结果表明,一些 LLMs 之间存在显著的表示相似性差异。研究还发现了使用表示相似性指标的挑战,需要仔细研究相似性分数以避免错误结论。Abstract
Understanding the similarity of the numerous released large language models (LLMs) has many uses, e.g., simplifying model selection, detecting illegal model reuse, and advancing our understanding of what makes LLMs perform well. In this work, we measure the similarity of representations of a set of LLMs with 7B parameters. Our results suggest that some LLMs are substantially different from others. We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions.
摘要
理解各种大语言模型(LLMs)的相似性有很多用途,例如简化模型选择、检测非法模型重用和提高我们对LLMs表现良好的理解。在这项工作中,我们测量了一组LLMs的表示相似性,其中参数数量达7亿。我们的结果表明一些LLMs与其他模型存在巨大差异。我们发现对表示相似性指标的使用存在挑战,需要仔细研究相似性分数以避免误导性的结论。
Prompt Optimization via Adversarial In-Context Learning
paper_authors: Xuan Long Do, Yiran Zhao, Hannah Brown, Yuxi Xie, James Xu Zhao, Nancy F. Chen, Kenji Kawaguchi, Michael Qizhe Xie, Junxian He
For: 优化受Context学习(ICL)的提示,使用一个LLM作为生成器,另一个作为分类器,并有一个提示修改器。* Methods: 使用传统对抗学习的方式,生成器尝试生成真实的输出,以让分类器难以分辨是模型生成的或实际数据。在每个回合中,给定一个输入,包括任务指令和一些示例,生成器生成输出,而分类器则将生成器输入-输出对分类为模型生成的或实际数据。基于分类器损失,提示修改器提议可能的编辑,并选择最改进对抗损失的编辑。* Results: 比对state-of-the-art提示优化技术,adv-ICL在11种生成和分类任务上获得了显著的改进,包括概要、数学逻辑、机器翻译、数据-文本生成和MMLU和big-bench难度 benchmarks。此外,由于我们的方法使用预训练模型和只更新提示而不是模型参数,因此它是计算效率高、易于扩展到任何LLM和任务,并在低资源环境下效果优秀。Abstract
We propose a new method, Adversarial In-Context Learning (adv-ICL), to optimize prompt for in-context learning (ICL) by employing one LLM as a generator, another as a discriminator, and a third as a prompt modifier. As in traditional adversarial learning, adv-ICL is implemented as a two-player game between the generator and discriminator, where the generator tries to generate realistic enough output to fool the discriminator. In each round, given an input prefixed by task instructions and several exemplars, the generator produces an output. The discriminator is then tasked with classifying the generator input-output pair as model-generated or real data. Based on the discriminator loss, the prompt modifier proposes possible edits to the generator and discriminator prompts, and the edits that most improve the adversarial loss are selected. We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques for both open and closed-source models on 11 generation and classification tasks including summarization, arithmetic reasoning, machine translation, data-to-text generation, and the MMLU and big-bench hard benchmarks. In addition, because our method uses pre-trained models and updates only prompts rather than model parameters, it is computationally efficient, easy to extend to any LLM and task, and effective in low-resource settings.
摘要
我们提出了一种新方法,called adversarial in-context learning(adv-ICL),用于优化启发(ICL)中的启发。该方法使用一个LLM作为生成器,另一个作为识别器,以及一个作为启发修改器。在传统的对抗学习中,adv-ICL是作为两个玩家的游戏,生成器尝试生成足够真实的输出,以欺骗识别器。在每个轮次中,给定一个包含任务指令和几个示例的输入,生成器生成输出。识别器则被要求将生成器输入-输出对分类为模型生成的或真实数据。基于识别器的损失,启发修改器提出了可能的修改,并选择修改最大化对抗损失的选项。我们表明,adv-ICL在11种生成和分类任务上比状态最佳的启发优化技术具有显著改进。此外,由于我们的方法使用预训练的模型和只更新启发,而不是模型参数,因此它是计算效率高、易于扩展到任何LLM和任务,以及在低资源环境中有效。
Text Intimacy Analysis using Ensembles of Multilingual Transformers
results: 结果显示, ensemble 模型和单语言模型的组合以及数据扩展方法可以提高预测性能,并且进行了详细的结果分析,提供了一些有趣的情感预测问题的深入理解。Abstract
Intimacy estimation of a given text has recently gained importance due to the increase in direct interaction of NLP systems with humans. Intimacy is an important aspect of natural language and has a substantial impact on our everyday communication. Thus the level of intimacy can provide us with deeper insights and richer semantics of conversations. In this paper, we present our work on the SemEval shared task 9 on predicting the level of intimacy for the given text. The dataset consists of tweets in ten languages, out of which only six are available in the training dataset. We conduct several experiments and show that an ensemble of multilingual models along with a language-specific monolingual model has the best performance. We also evaluate other data augmentation methods such as translation and present the results. Lastly, we study the results thoroughly and present some noteworthy insights into this problem.
摘要
近年来,与人工智能直接交互的语言处理系统的发展,使得距离感度的估计得到了更多的重视。距离感度是自然语言中重要的一个方面,对我们日常交流产生了深观影响。因此,距离感度的级别可以为我们提供更深刻的理解和更富有的 semantics。在这篇论文中,我们介绍了我们在SemEval共享任务9中对给定文本距离感度的预测工作。数据集包括推特在十种语言中的十万句,其中只有六种语言可以在训练集中使用。我们进行了多个实验,并证明了一个多语言模型的ensemble,以及一个单语言模型在每种语言中的最佳性能。我们还评估了其他数据扩充方法,如翻译,并发现了一些有趣的问题。最后,我们进行了深入的分析和讨论。
Empathy and Distress Detection using Ensembles of Transformer Models
results: 我们的最终提交得分为Pearson’s r分数0.346,在Empathy和Distress检测子任务中排名第三。Abstract
This paper presents our approach for the WASSA 2023 Empathy, Emotion and Personality Shared Task. Empathy and distress are human feelings that are implicitly expressed in natural discourses. Empathy and distress detection are crucial challenges in Natural Language Processing that can aid our understanding of conversations. The provided dataset consists of several long-text examples in the English language, with each example associated with a numeric score for empathy and distress. We experiment with several BERT-based models as a part of our approach. We also try various ensemble methods. Our final submission has a Pearson's r score of 0.346, placing us third in the empathy and distress detection subtask.
摘要
这篇论文介绍了我们在WASSA 2023 Empathy, Emotion and Personality Shared Task 中的方法。人们的感受性和痛苦是自然语言中的隐式表达,感受性和痛苦检测是自然语言处理中的关键挑战,可以帮助我们理解对话。提供的数据集包括英语长文示例,每个示例都有感受性和痛苦的数字分数。我们使用BERT模型作为我们的方法的一部分,并尝试了多种 ensemble 方法。最终提交的结果是Pearson 相关系数0.346,位于感受性和痛苦检测子任务中的第三名。
ULMA: Unified Language Model Alignment with Demonstration and Point-wise Human Preference
results: 对于点对点偏好数据,提出了一种名为点对点DPO的偏好学习方法,并通过对超级vised fine-tuning和点对点偏好学习的连接来构建一个统一的框架。实验表明,提出的方法在点对点数据集上具有更高的性能和效率。同时,研究人员还构建了一个高质量的示例集,以便进一步研究和应用。Abstract
Language model alignment is a cutting-edge technique in large language model training to align the model output to user's intent, e.g., being helpful and harmless. Recent alignment framework consists of two steps: supervised fine-tuning with demonstration data and preference learning with human preference data. Previous preference learning methods, such as RLHF and DPO, mainly focus on pair-wise preference data. However, in many real-world scenarios where human feedbacks are intrinsically point-wise, these methods will suffer from information loss or even fail. To fill this gap, in this paper, we first develop a preference learning method called point-wise DPO to tackle point-wise preference data. Further revelation on the connection between supervised fine-tuning and point-wise preference learning enables us to develop a unified framework for both human demonstration and point-wise preference data, which sheds new light on the construction of preference dataset. Extensive experiments on point-wise datasets with binary or continuous labels demonstrate the superior performance and efficiency of our proposed methods. A new dataset with high-quality demonstration samples on harmlessness is constructed and made publicly available.
摘要
大型语言模型训练中的语言模型准确性Alignment是一种前沿技术,目的是将模型输出与用户意图相匹配,例如帮助和无害。现有的配置框架包括两个步骤:有监督微调和人类偏好数据学习。过去的偏好学习方法,如RLHF和DPO,主要关注对照数据。然而,在多数实际应用中,人类反馈是点对点的,这些方法会导致信息损失或者失败。为填补这一空白,我们在本文中首先开发了一种点对点DPO偏好学习方法。进一步揭示了监督微调和点对点偏好学习之间的连接,使得我们可以构建一个包含人类示例和点对点偏好数据的统一框架。广泛的实验表明了我们提出的方法的超过和高效性。此外,我们还构建了一个高质量的示例集,并公开发布了这个集合。
DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding
results: 我们在四个公共时间语言固定dataset上进行了广泛的实验,并证明了我们的方法在比基eline上显著超越。Abstract
Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.
摘要
<>模块语言固定 seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well. Let me know!
results: 根据评估结果,这篇论文的提案比基于内容学习的基准,如GPT-3 175B和InstructGPT 175B,在几何少类别标签分类任务上表现出竞争或超越性。尽管这篇论文有177倍少的参数数量,但它仍能够达到类似水平,证明其效果。Abstract
With the growing volume of diverse information, the demand for classifying arbitrary topics has become increasingly critical. To address this challenge, we introduce DRAFT, a simple framework designed to train a classifier for few-shot topic classification. DRAFT uses a few examples of a specific topic as queries to construct Customized dataset with a dense retriever model. Multi-query retrieval (MQR) algorithm, which effectively handles multiple queries related to a specific topic, is applied to construct the Customized dataset. Subsequently, we fine-tune a classifier using the Customized dataset to identify the topic. To demonstrate the efficacy of our proposed approach, we conduct evaluations on both widely used classification benchmark datasets and manually constructed datasets with 291 diverse topics, which simulate diverse contents encountered in real-world applications. DRAFT shows competitive or superior performance compared to baselines that use in-context learning, such as GPT-3 175B and InstructGPT 175B, on few-shot topic classification tasks despite having 177 times fewer parameters, demonstrating its effectiveness.
摘要
随着信息多样性的增加,对任意主题的分类需求日益增加。为解决这个挑战,我们介绍了DRAFT,一种简单的框架,用于在少量示例下训练主题分类器。DRAFT使用特定主题的几个示例作为查询来构建自定义数据集,并使用多个查询相关的多QueryRetrieval(MQR)算法来构建自定义数据集。然后,我们精度地调整一个分类器使用自定义数据集来识别主题。为证明我们提议的方法的效果,我们在广泛使用的分类 bencmarks 数据集和手动构造的291个多样主题数据集上进行评估,这些数据集模拟了实际应用中遇到的多样内容。DRAFT在几个基eline上比如GPT-3 175B和InstructGPT 175B进行几shot主题分类任务时表现竞争或更好,即使 Parameters 相对较少,这说明了它的效果。
MedDM:LLM-executable clinical guidance tree for clinical decision-making
for: This paper aims to address the issue of low specialization in current medical language models (LLMs) and provide a solution for LLMs to participate in clinical diagnosis decision-making.
methods: The authors propose a method for constructing a large-scale medical diagnostic decision-making dataset (MedDM) from flowcharts in clinical practice guidelines, and develop an approach for converting these flowcharts into standardized diagnostic decision trees. They also propose a method for reasoning on LLM-executable clinical guidance trees (CGT) and a Patient-LLM multi-turn dialogue framework.
results: The authors construct a knowledge base with 1202 decision trees, covering 12 hospital departments and over 500 diseases, using medical literature and flowcharts. They also demonstrate the effectiveness of their approach through experiments using a Patient-LLM multi-turn dialogue framework.Abstract
It is becoming increasingly emphasis on the importance of LLM participating in clinical diagnosis decision-making. However, the low specialization refers to that current medical LLMs can not provide specific medical advice, which are more like a medical Q\&A. And there is no suitable clinical guidance tree data set that can be used directly with LLM. To address this issue, we first propose LLM-executavle clinical guidance tree(CGT), which can be directly used by large language models, and construct medical diagnostic decision-making dataset (MedDM), from flowcharts in clinical practice guidelines. We propose an approach to screen flowcharts from medical literature, followed by their identification and conversion into standardized diagnostic decision trees. Constructed a knowledge base with 1202 decision trees, which came from 5000 medical literature and covered 12 hospital departments, including internal medicine, surgery, psychiatry, and over 500 diseases.Moreover, we propose a method for reasoning on LLM-executable CGT and a Patient-LLM multi-turn dialogue framework.
摘要
现在越来越重视LLM在诊断决策中的参与。然而,低特化意味着当前医学LLM无法提供专业医疗建议,更像医学Q&A。而没有适用直接使用LLM的临床指导树数据集。为解决这个问题,我们首先提议LLM执行临床指导树(CGT),可以直接使用大语言模型,并构建医疗诊断决策数据集(MedDM),从临床实践指南中的流程图。我们提出一种方法,从医学文献中选择流程图,然后将其标准化并转换为诊断决策树。构建了1202个决策树,来自5000份医学文献,覆盖12个医院部门,包括内科、外科、心理医学和超过500种疾病。此外,我们还提出了LLM执行CGT的理由方法和病人-LLM多Turn对话框架。
Protein Language Model-Powered 3D Ligand Binding Site Prediction from Protein Sequence
results: 该论文的实验结果表明,LaMPSite 方法可以与基eline 方法相比,对于没有三维蛋白质结构信息的情况下,预测蛋白质上的 ligand 绑定位置具有竞争力。这意味着,LaMPSite 方法可以为蛋白质结构信息不完整的情况提供新的机会 для药物发现。Abstract
Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein sequences and ligand molecular graphs as input for ligand binding site predictions. The protein sequences are used to retrieve residue-level embeddings and contact maps from the pre-trained ESM-2 protein language model. The ligand molecular graphs are fed into a graph neural network to compute atom-level embeddings. Then we compute and update the protein-ligand interaction embedding based on the protein residue-level embeddings and ligand atom-level embeddings, and the geometric constraints in the inferred protein contact map and ligand distance map. A final pooling on protein-ligand interaction embedding would indicate which residues belong to the binding sites. Without any 3D coordinate information of proteins, our proposed model achieves competitive performance compared to baseline methods that require 3D protein structures when predicting binding sites. Given that less than 50% of proteins have reliable structure information in the current stage, LaMPSite will provide new opportunities for drug discovery.
摘要
预测蛋白质上的 ligand 绑定位置是蛋白质功能理解和潜在药物搜寻的基本和重要任务。现有的方法都需要输入经验性确定的蛋白质整体结构。然而,这些结构可能对新或 less-studied 蛋白质而言是不可获得的。为了解决这个限制,我们提出了 LaMPSite,它只需要蛋白质序列和ligand 分子图作为输入,可以预测 ligand 绑定位置。蛋白质序列被用来检索 residue-level 嵌入和接触地图从预训练的 ESM-2 蛋白质语言模型中。ligand 分子图被 feed 到一个图神经网络中,以计算 atom-level 嵌入。然后,我们计算并更新蛋白质-ligand 互动嵌入,基于蛋白质 residue-level 嵌入和 ligand atom-level 嵌入,以及推断的蛋白质接触地图和 ligand 距离地图的几何约束。最后,一个 pooling 操作在蛋白质-ligand 互动嵌入上进行汇聚,以确定绑定位置中的哪些残基。不需要蛋白质三维坐标信息,我们的提议的模型可以与基eline 方法相比,在predicting binding sites时达到竞争性性能。在现有的 Situation 中,蛋白质三维结构信息不可靠的情况下,LaMPSite 将提供新的机会 для药物搜寻。
Efficient Online Data Mixing For Language Model Pre-Training
paper_authors: Alon Albalak, Liangming Pan, Colin Raffel, William Yang Wang
for: 这个论文的目的是提出一种高效的在线数据混合方法,以提高大语言模型的下游性能。
methods: 这种方法使用多臂投掷算法来在训练过程中优化数据混合比例,以适应变化的训练动态。
results: 与其他方法相比,这种方法在5枚MMLUbenchmark上提高了1.9%的准确率,并在训练迭代数上减少了19%的训练迭代次数,同时增加了 negligible 的墙 clock 时间。Abstract
The data used to pretrain large language models has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and of pretraining datasets. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together and determining sampling probabilities across entire groups. However, data mixing proportions are typically fixed before training and therefore cannot adapt to changing training dynamics. To address these limitations, we develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing. Based on multi-armed bandit algorithms, our online approach optimizes the data mixing proportions during training. Remarkably, our method trains a model that reaches the final perplexity of the next best method with 19\% fewer training iterations, and improves performance on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible wall-clock time during pretraining.
摘要
大量语言模型的预训数据选择方法具有关键影响下测模型的表现,这导致了大量的工作集中于自动决定最适合的数据使用于预训。现有的数据选择方法受到复杂的运算和计算成本的限制,尤其是模型和预训数据的规模增加。数据混合方法可以简化数据选择的复杂性,但是混合比例通常是在训练前 fixing 的,因此无法适应变化的训练过程。为了解决这些限制,我们开发了一个高效的在线数据混合(ODM)算法,融合了数据选择和数据混合的元素。基于多臂枪击算法,我们的在线方法在训练中优化混合比例。很惊喜地,我们的方法可以在训练迭代数量相同的情况下,让模型的最终误差与下一个最佳方法相同,升高5shot MMLU标准benchmark的对称精度 by 1.9%,同时添加了很少的壁时间。