cs.AI - 2023-11-16

The Analysis and Extraction of Structure from Organizational Charts

  • paper_url: http://arxiv.org/abs/2311.10234
  • repo_url: None
  • paper_authors: Nikhil Manali, David Doermann, Mahesh Desai
  • for: 提供一种自动化和端到端的方法,用于从组织图中提取信息,以解决手动提取信息的困难和时间消耗问题。
  • methods: 使用计算机视觉、深度学习和自然语言处理技术来自动提取组织图中的信息。
  • results: 提出一种用于评估提取信息的完整性和层次准确性的度量,并通过实验证明该方法的有效性。
    Abstract Organizational charts, also known as org charts, are critical representations of an organization's structure and the hierarchical relationships between its components and positions. However, manually extracting information from org charts can be error-prone and time-consuming. To solve this, we present an automated and end-to-end approach that uses computer vision, deep learning, and natural language processing techniques. Additionally, we propose a metric to evaluate the completeness and hierarchical accuracy of the extracted information. This approach has the potential to improve organizational restructuring and resource utilization by providing a clear and concise representation of the organizational structure. Our study lays a foundation for further research on the topic of hierarchical chart analysis.
    摘要 organizational charts, also known as org charts, are critical representations of an organization's structure and the hierarchical relationships between its components and positions. however, manually extracting information from org charts can be error-prone and time-consuming. to solve this, we present an automated and end-to-end approach that uses computer vision, deep learning, and natural language processing techniques. additionally, we propose a metric to evaluate the completeness and hierarchical accuracy of the extracted information. this approach has the potential to improve organizational restructuring and resource utilization by providing a clear and concise representation of the organizational structure. our study lays a foundation for further research on the topic of hierarchical chart analysis.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

A Graphical Model of Hurricane Evacuation Behaviors

  • paper_url: http://arxiv.org/abs/2311.10228
  • repo_url: None
  • paper_authors: Hui Sophie Wang, Nutchanon Yongsatianchot, Stacy Marsella
  • for: 这 paper 的目的是研究人们在风暴来临时是否离开家园的决策,以及这些决策如何影响紧急准备和应急响应。
  • methods: 这 paper 使用了 Protection motivation theory (PMT) 框架,构建了风暴离开决策的复杂关系图,并通过 conditional independence tests 评估不同的图 Structures。
  • results: 研究发现,人们的风暴离开决策受到了威胁评估(threat appraisal)和应急 coping 评估的直接影响,以及媒体信息的直接和间接影响。 certain information received from media 影响了威胁评估,并通过它影响了风暴离开行为。此外,一些变量直接影响了风暴离开行为和威胁评估,包括家人和朋友的建议,邻居的离开行为,以及官员发布的离开通知。
    Abstract Natural disasters such as hurricanes are increasing and causing widespread devastation. People's decisions and actions regarding whether to evacuate or not are critical and have a large impact on emergency planning and response. Our interest lies in computationally modeling complex relationships among various factors influencing evacuation decisions. We conducted a study on the evacuation of Hurricane Irma of the 2017 Atlantic hurricane season. The study was guided by the Protection motivation theory (PMT), a widely-used framework to understand people's responses to potential threats. Graphical models were constructed to represent the complex relationships among the factors involved and the evacuation decision. We evaluated different graphical structures based on conditional independence tests using Irma data. The final model largely aligns with PMT. It shows that both risk perception (threat appraisal) and difficulties in evacuation (coping appraisal) influence evacuation decisions directly and independently. Certain information received from media was found to influence risk perception, and through it influence evacuation behaviors indirectly. In addition, several variables were found to influence both risk perception and evacuation behaviors directly, including family and friends' suggestions, neighbors' evacuation behaviors, and evacuation notices from officials.
    摘要 自然灾害如飓风减少不断,引起广泛的破坏。人们的逃离或不逃离的决定对紧急准备和应急应对有着重要的影响。我们的兴趣在于通过计算模型来模拟人们逃离决定的复杂关系。我们在2017年大西洋飓风赛季的飓风艾尔马事例进行了研究。研究受保护动机理论(PMT)的导向,这是解释人们面临可能威胁的响应的广泛使用的框架。我们使用图表模型来表示逃离决定中的复杂关系,并对逃离决定进行了不同的图表结构的评估。我们根据飓风艾尔马数据进行了条件独立测试,最终模型大致与PMT相符。它表明,风险识别(威胁评估)和逃离困难(处理评估)都直接和独立地影响逃离决定。媒体上接受的信息也影响了风险识别,并通过它影响了逃离行为。此外,一些变量直接和 indirectly影响了逃离决定,包括家庭和朋友的建议、邻居的逃离行为以及官员发布的逃离通知。

Think Twice: Perspective-Taking Improves Large Language Models’ Theory-of-Mind Capabilities

  • paper_url: http://arxiv.org/abs/2311.10227
  • repo_url: None
  • paper_authors: Alex Wilf, Sihyun Shawn Lee, Paul Pu Liang, Louis-Philippe Morency
  • for: 本研究旨在提高现有的大自然语言模型(LLM) Theory of Mind(ToM)能力。
  • methods: 本研究提出了一种新的两阶段引导框架,名为SimToM,它基于认知科学理论“模拟理论”中的视角变换。
  • results: 对当前 ToM benchmark 进行应用,SimToM 方法显示了明显的改善,而且我们的分析表明了对理论知识的了解对 ToM 能力的重要性。
    Abstract Human interactions are deeply rooted in the interplay of thoughts, beliefs, and desires made possible by Theory of Mind (ToM): our cognitive ability to understand the mental states of ourselves and others. Although ToM may come naturally to us, emulating it presents a challenge to even the most advanced Large Language Models (LLMs). Recent improvements to LLMs' reasoning capabilities from simple yet effective prompting techniques such as Chain-of-Thought have seen limited applicability to ToM. In this paper, we turn to the prominent cognitive science theory "Simulation Theory" to bridge this gap. We introduce SimToM, a novel two-stage prompting framework inspired by Simulation Theory's notion of perspective-taking. To implement this idea on current ToM benchmarks, SimToM first filters context based on what the character in question knows before answering a question about their mental state. Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods, and our analysis reveals the importance of perspective-taking to Theory-of-Mind capabilities. Our findings suggest perspective-taking as a promising direction for future research into improving LLMs' ToM capabilities.
    摘要 人类互动深受理智思维、信念和愿望的互动,这些思维是通过理智思维(ToM)实现的:我们的认知能力理解自己和他人的心理状态。虽然ToM可能是自然的,但模拟它对even最高级语言模型(LLMs)来说是一项挑战。现有的LLMs的理解能力的改进从简单而有效的提示技术such as Chain-of-Thought中有限的应用于ToM。在这篇论文中,我们转向了著名的认知科学理论“模拟理论”来bridging这个差距。我们提出了一种新的两个阶段的提示框架,称为SimToM,它是基于模拟理论中的看法拟合的想法。为了在当前的ToM标准benchmark上实现这个想法,SimToM首先根据character知道的信息过滤上下文,然后回答关于其心理状态的问题。我们的方法不需要额外的训练和微小的提示调整,而且与现有的方法比较,我们的结果表明了看法拟合的重要性,以及它在理智思维能力方面的潜在性。我们的发现建议将 perspective-taking作为未来研究理智思维能力的可能方向。

A Language and Its Dimensions: Intrinsic Dimensions of Language Fractal Structures

  • paper_url: http://arxiv.org/abs/2311.10217
  • repo_url: None
  • paper_authors: Vasilii A. Gromov, Nikita S. Borodin, Asel S. Yerbolova
  • for: The paper is written to introduce a new object of study - a language fractal structure, and to estimate the intrinsic dimensions of language fractal structures for the Russian and English languages.
  • methods: The paper uses methods based on topological data analysis and a minimum spanning tree of a data graph to estimate the intrinsic dimensions of language fractal structures.
  • results: The paper finds that the intrinsic dimensions of language fractal structures for both the Russian and English languages are non-integer values, close to 9 for both languages.Here is the information in Simplified Chinese text, as requested:
  • for: 本研究对自然语言中的语言异步结构进行了新的研究,并估计了俄语和英语语言异步结构的内在维度。
  • methods: 本研究使用了基于拓扑数据分析和数据图中最小杆的方法来估计语言异步结构的内在维度。
  • results: 研究发现,俄语和英语语言异步结构的内在维度都是非整数值,都接近9。
    Abstract The present paper introduces a novel object of study - a language fractal structure. We hypothesize that a set of embeddings of all $n$-grams of a natural language constitutes a representative sample of this fractal set. (We use the term Hailonakea to refer to the sum total of all language fractal structures, over all $n$). The paper estimates intrinsic (genuine) dimensions of language fractal structures for the Russian and English languages. To this end, we employ methods based on (1) topological data analysis and (2) a minimum spanning tree of a data graph for a cloud of points considered (Steele theorem). For both languages, for all $n$, the intrinsic dimensions appear to be non-integer values (typical for fractal sets), close to 9 for both of the Russian and English language.
    摘要 本文介绍一种新的研究对象——语言自similarity结构。我们假设所有自然语言中的ngrams集合可以视为这种自similarity结构的代表样本。(我们使用“Hailonakea”这个 термин来描述所有语言自similarity结构的总和,随着n的变化)。本文对俄语和英语两种语言的语言自similarity结构进行了估计。为此,我们使用了基于拓扑数据分析和最小杆法(Steele theorem)的方法。对于两种语言和所有n,内部维度都显示为非整数值(典型的自similarity集合特征),约等于9。

Predictive Minds: LLMs As Atypical Active Inference Agents

  • paper_url: http://arxiv.org/abs/2311.10215
  • repo_url: None
  • paper_authors: Jan Kulveit, Clem von Stengel, Roman Leventov
  • for: 本文探讨大语言模型(LLM)如何理解和使用active inference理论。
  • methods: 文章比较传统的active inference系统和LLM的相似之处和差异,并结论LLM目前缺乏与行动在世界中产生影响的紧密反馈循环,但其他地方符合active inference模式。
  • results: 文章列出了可能关闭这个循环的原因,以及这可能导致模型自我意识和减少预测错误的变化。
    Abstract Large language models (LLMs) like GPT are often conceptualized as passive predictors, simulators, or even stochastic parrots. We instead conceptualize LLMs by drawing on the theory of active inference originating in cognitive science and neuroscience. We examine similarities and differences between traditional active inference systems and LLMs, leading to the conclusion that, currently, LLMs lack a tight feedback loop between acting in the world and perceiving the impacts of their actions, but otherwise fit in the active inference paradigm. We list reasons why this loop may soon be closed, and possible consequences of this including enhanced model self-awareness and the drive to minimize prediction error by changing the world.
    摘要 大型语言模型(LLM)如GPT通常被概念化为无动的预测器、模拟器或甚至是随机的喊喊鸟。我们则通过从认知科学和神经科学中的活跃推理理论来概念化LLM。我们比较了传统的活跃推理系统和LLM之间的相似之处和不同之处,结论是,目前LLM缺乏在世界中行动并观察自己的影响的紧密回路,但以其他方面符合活跃推理概念。我们列出了关闭这个回路的原因,以及这可能会带来的影响,包括增强模型自我意识和驱动降低预测错误的改变世界的驱动。

Bayes in the age of intelligent machines

  • paper_url: http://arxiv.org/abs/2311.10206
  • repo_url: None
  • paper_authors: Thomas L. Griffiths, Jian-Qiao Zhu, Erin Grant, R. Thomas McCoy
  • for: 本研究旨在探讨人工神经网络如何影响人类认知解释,并 argue that Bayesian模型和人工神经网络是 complementary modeling approach,可以用来理解人类认知和智能机器的行为。
  • methods: 本研究使用了人工神经网络和 Bayesian模型来解释人类认知和智能机器的行为。
  • results: 研究发现,Bayesian模型和人工神经网络是不同的层次分析方法,可以共同理解人类认知和智能机器的行为,并且 Bayesian模型在解释大型、透明度低的人工神经网络行为方面可能具有独特的价值。
    Abstract The success of methods based on artificial neural networks in creating intelligent machines seems like it might pose a challenge to explanations of human cognition in terms of Bayesian inference. We argue that this is not the case, and that in fact these systems offer new opportunities for Bayesian modeling. Specifically, we argue that Bayesian models of cognition and artificial neural networks lie at different levels of analysis and are complementary modeling approaches, together offering a way to understand human cognition that spans these levels. We also argue that the same perspective can be applied to intelligent machines, where a Bayesian approach may be uniquely valuable in understanding the behavior of large, opaque artificial neural networks that are trained on proprietary data.
    摘要 人类认知的解释可能会受到基于人工神经网络的方法的成功威胁。我们认为这并不是如此,我们认为这些系统实际上提供了新的机会来模型人类认知。具体来说,我们认为认知科学的概率模型和人工神经网络模型在不同的水平上进行模型化,这些模型之间存在衔接,可以用来理解人类认知的各个水平。此外,我们还认为概率模型在理解大型、不透明的人工神经网络的行为方面可能具有独特的价值,这些网络通常是基于专有数据进行训练的。

Towards Improving Robustness Against Common Corruptions using Mixture of Class Specific Experts

  • paper_url: http://arxiv.org/abs/2311.10177
  • repo_url: None
  • paper_authors: Shashank Kotyan, Danilo Vasconcellos Vargas
  • for: 这篇论文的目的是提高神经网络的适用范围和性能,以应对不断变化的实际世界情况。
  • methods: 这篇论文提出了一种名为“混合类别专家架构”的新方法,它通过专门为每个类别训练独立的神经网络段,并将其输出组合以提高神经网络的扩展性和绩效。
  • results: 研究发现,这种新方法可以提高神经网络的适用范围和性能,并在不同的测试 benchmark 上表现出色。特别是在面对未知的扭曲和折衣时,这种方法可以提供更高的适用范围和稳定性。
    Abstract Neural networks have demonstrated significant accuracy across various domains, yet their vulnerability to subtle input alterations remains a persistent challenge. Conventional methods like data augmentation, while effective to some extent, fall short in addressing unforeseen corruptions, limiting the adaptability of neural networks in real-world scenarios. In response, this paper introduces a novel paradigm known as the Mixture of Class-Specific Expert Architecture. The approach involves disentangling feature learning for individual classes, offering a nuanced enhancement in scalability and overall performance. By training dedicated network segments for each class and subsequently aggregating their outputs, the proposed architecture aims to mitigate vulnerabilities associated with common neural network structures. The study underscores the importance of comprehensive evaluation methodologies, advocating for the incorporation of benchmarks like the common corruptions benchmark. This inclusion provides nuanced insights into the vulnerabilities of neural networks, especially concerning their generalization capabilities and robustness to unforeseen distortions. The research aligns with the broader objective of advancing the development of highly robust learning systems capable of nuanced reasoning across diverse and challenging real-world scenarios. Through this contribution, the paper aims to foster a deeper understanding of neural network limitations and proposes a practical approach to enhance their resilience in the face of evolving and unpredictable conditions.
    摘要

JaxMARL: Multi-Agent RL Environments in JAX

  • paper_url: http://arxiv.org/abs/2311.10090
  • repo_url: https://github.com/flairox/jaxmarl
  • paper_authors: Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Gardar Ingvarsson, Timon Willi, Akbir Khan, Christian Schroeder de Witt, Alexandra Souly, Saptarashmi Bandyopadhyay, Mikayel Samvelyan, Minqi Jiang, Robert Tjarko Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktaschel, Chris Lu, Jakob Nicolaus Foerster
  • for: This paper is written for researchers and developers in the field of reinforcement learning (RL) and multi-agent reinforcement learning (MARL), who need efficient and scalable environments for training and evaluating their algorithms.
  • methods: The paper uses JAX (Jax.org) to enable massively parallel RL training pipelines and environments, and presents JaxMARL, an open-source code base that combines ease-of-use with GPU-enabled efficiency for commonly used MARL environments and popular baseline algorithms.
  • results: The paper shows that JaxMARL is up to 12,500 times faster than existing approaches in terms of wall clock time, enabling efficient and thorough evaluations, and introduces SMAX, a vectorized and simplified version of the StarCraft Multi-Agent Challenge that enables GPU acceleration and provides a more flexible MARL environment.
    Abstract Benchmarks play an important role in the development of machine learning algorithms. For example, research in reinforcement learning (RL) has been heavily influenced by available environments and benchmarks. However, RL environments are traditionally run on the CPU, limiting their scalability with typical academic compute. Recent advancements in JAX have enabled the wider use of hardware acceleration to overcome these computational hurdles, enabling massively parallel RL training pipelines and environments. This is particularly useful for multi-agent reinforcement learning (MARL) research. First of all, multiple agents must be considered at each environment step, adding computational burden, and secondly, the sample complexity is increased due to non-stationarity, decentralised partial observability, or other MARL challenges. In this paper, we present JaxMARL, the first open-source code base that combines ease-of-use with GPU enabled efficiency, and supports a large number of commonly used MARL environments as well as popular baseline algorithms. When considering wall clock time, our experiments show that per-run our JAX-based training pipeline is up to 12500x faster than existing approaches. This enables efficient and thorough evaluations, with the potential to alleviate the evaluation crisis of the field. We also introduce and benchmark SMAX, a vectorised, simplified version of the popular StarCraft Multi-Agent Challenge, which removes the need to run the StarCraft II game engine. This not only enables GPU acceleration, but also provides a more flexible MARL environment, unlocking the potential for self-play, meta-learning, and other future applications in MARL. We provide code at https://github.com/flairox/jaxmarl.
    摘要 ��benchmark��是机器学习算法开发中非常重要的一部分。例如,在强化学习(RL)方面,可用的环境和benchmark��对研究产生了深远的影响。然而,RL环境通常在CPU上运行,这限制了学术计算的扩展性。现在,JAX技术的发展使得可以通过硬件加速来超越这些计算障碍,实现了大规模并行的RL训练管道和环境。这特别有用于多智能体强化学习(MARL)研究。首先,在每个环境步骤中,需要考虑多个智能体,这添加了计算压力;其次,样本复杂性增加由非站立性、分布式部分可见性或其他MARL挑战。在这篇论文中,我们提供了JaxMARL,首个开源代码库,搭配易用性和GPU启用效率,支持大量常用的MARL环境以及流行的基线算法。在考虑wall clock时间的情况下,我们的JAX基本训练管道相比现有方法,每次训练的时间提高了12500倍。这使得有效和详细的评估变得可能,有potential解决机器学习领域的评估危机。我们还引入了SMAX,一种简化版的StarCraft Multi-Agent Challenge, removes the need to run the StarCraft II game engine。这不仅使得GPU加速可能,还提供了更 flexible MARL环境,解锁了自适应、元学习和未来MARL应用的潜力。我们提供了代码,可以在https://github.com/flairox/jaxmarl上获取。

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

  • paper_url: http://arxiv.org/abs/2311.10089
  • repo_url: None
  • paper_authors: Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, Yaniv Taigman
  • For: 这个论文旨在提出一种多任务图像编辑模型,以便通过自然语言指令执行图像编辑操作。* Methods: 该模型通过多任务学习和学习任务嵌入来提高图像编辑能力,并且可以通过几个例子学习新任务。* Results: 该模型在多个图像编辑任务中达到了领先水平,包括区域基本编辑、自由形编辑和计算机视觉任务。此外,该模型还可以执行新的图像编辑任务,只需几个标注的样本即可。
    Abstract Instruction-based image editing holds immense potential for a variety of applications, as it enables users to perform any editing operation using a natural language instruction. However, current models in this domain often struggle with accurately executing user instructions. We present Emu Edit, a multi-task image editing model which sets state-of-the-art results in instruction-based image editing. To develop Emu Edit we train it to multi-task across an unprecedented range of tasks, such as region-based editing, free-form editing, and Computer Vision tasks, all of which are formulated as generative tasks. Additionally, to enhance Emu Edit's multi-task learning abilities, we provide it with learned task embeddings which guide the generation process towards the correct edit type. Both these elements are essential for Emu Edit's outstanding performance. Furthermore, we show that Emu Edit can generalize to new tasks, such as image inpainting, super-resolution, and compositions of editing tasks, with just a few labeled examples. This capability offers a significant advantage in scenarios where high-quality samples are scarce. Lastly, to facilitate a more rigorous and informed assessment of instructable image editing models, we release a new challenging and versatile benchmark that includes seven different image editing tasks.
    摘要 请求已处理,以下是文本的简化中文版本: instruciton-based image editing 拥有巨大的潜力,可以用自然语言指令进行任何编辑操作。然而,当前在这个领域的模型经常会对用户的指令进行不准确的执行。我们介绍了 Emu Edit,一个多任务图像编辑模型,它在 instruciton-based image editing 中设置了新的STATE-OF-THE-ART 成绩。为了开发 Emu Edit,我们训练它来执行多种任务,例如区域编辑、自由形编辑和计算机视觉任务,这些任务都是生成任务。此外,为了增强 Emu Edit 的多任务学习能力,我们为它提供了学习任务嵌入,这些嵌入引导生成过程向正确的编辑类型。这两个元素都是 Emu Edit 的出色表现的关键。此外,我们表明 Emu Edit 可以通过几个标注的例子学习新任务,如图像填充、超解像和编辑任务的组合。这种能力在情况中缺乏高质量样本时具有重要优势。最后,为了促进 instruciton-based image editing 模型的更加严格和有知识的评估,我们发布了一个新的复杂和多样的benchmark,包括七种不同的图像编辑任务。

Intelligent Generation of Graphical Game Assets: A Conceptual Framework and Systematic Review of the State of the Art

  • paper_url: http://arxiv.org/abs/2311.10129
  • repo_url: None
  • paper_authors: Kaisei Fukaya, Damon Daylamani-Zad, Harry Agius
  • for: 这篇论文的目的是对游戏中的图形资产生成进行系统性的文献综述,以便为感兴趣的人提供可能的方法和方法,并帮助他们了解和应用这些方法。
  • methods: 这篇论文使用了系统性的文献综述方法,检索了200篇有关图形资产生成的论文,从而掌握了现状的方法和技术。
  • results: 该论文对现有的图形资产生成方法进行了概括和分析,并基于文献的研究,提出了一种概念框架,以帮助感兴趣的人了解和应用图形资产生成的方法。
    Abstract Procedural content generation (PCG) can be applied to a wide variety of tasks in games, from narratives, levels and sounds, to trees and weapons. A large amount of game content is comprised of graphical assets, such as clouds, buildings or vegetation, that do not require gameplay function considerations. There is also a breadth of literature examining the procedural generation of such elements for purposes outside of games. The body of research, focused on specific methods for generating specific assets, provides a narrow view of the available possibilities. Hence, it is difficult to have a clear picture of all approaches and possibilities, with no guide for interested parties to discover possible methods and approaches for their needs, and no facility to guide them through each technique or approach to map out the process of using them. Therefore, a systematic literature review has been conducted, yielding 200 accepted papers. This paper explores state-of-the-art approaches to graphical asset generation, examining research from a wide range of applications, inside and outside of games. Informed by the literature, a conceptual framework has been derived to address the aforementioned gaps.
    摘要 <>translate_orientation: horizontal Процедурное содержание генерации (PCG) может быть применено к широкому спектру задач в играх, от сюжетов, уровней и звуков, до деревьев и оружия. Огромное количество содержимого игры состоит из графических ассетов, таких как облака, здания или растительность, которые не требуют рассмотрения функций игры. Также существует широкий спектр литературы, который изучает процедурное генерация таких элементов для целей вне игр. Тело исследований, сосредоточенное на конкретных методах для генерации конкретных активов, ограничивает видимость доступных возможных подходов и подходов. Поэтому трудно получить ясный обзор всех подходов и возможностей, а также нет инструментов для руководства заинтересованными сторонами в возможных методах и подходах для их потребностей. Поэтому была проведена систематическая рецензия литературы, которая дала 200 принятых статей. Эта статья исследовает современные подходы к генерации графических активов, изучая исследования из широкого спектра приложений, как внутри, так и вне игр. Обоснованная литературой, была получена концептуальная рамка, чтобы устранить перечисленные пробелы.

ChatGPT-3.5, ChatGPT-4, Google Bard, and Microsoft Bing to Improve Health Literacy and Communication in Pediatric Populations and Beyond

  • paper_url: http://arxiv.org/abs/2311.10075
  • repo_url: None
  • paper_authors: Kanhai S. Amin, Linda Mayes, Pavan Khosla, Rushabh Doshi
  • For: The paper aims to investigate whether large language models (LLMs) can improve health literacy in children and other populations.* Methods: The authors used 26 different prompts to test the ability of three LLMs (ChatGPT-3.5, Microsoft Bing, and Google Bard) to provide health information at different reading grade levels (RGL). They evaluated the responses based on their reading grade level and word count.* Results: The results show that all three LLMs were able to provide responses at or above a 10th-grade RGL. However, ChatGPT-3.5 and ChatGPT-4 were better at providing responses at lower grade levels, while Microsoft Bing and Google Bard tended to produce responses at a consistent high school level. The authors also found that Bard was more cautious in providing certain outputs, which may indicate a need for further research on the accuracy and effectiveness of LLMs in health communication.
    Abstract Purpose: Enhanced health literacy has been linked to better health outcomes; however, few interventions have been studied. We investigate whether large language models (LLMs) can serve as a medium to improve health literacy in children and other populations. Methods: We ran 288 conditions using 26 different prompts through ChatGPT-3.5, Microsoft Bing, and Google Bard. Given constraints imposed by rate limits, we tested a subset of 150 conditions through ChatGPT-4. The primary outcome measurements were the reading grade level (RGL) and word counts of output. Results: Across all models, output for basic prompts such as "Explain" and "What is (are)" were at, or exceeded, a 10th-grade RGL. When prompts were specified to explain conditions from the 1st to 12th RGL, we found that LLMs had varying abilities to tailor responses based on RGL. ChatGPT-3.5 provided responses that ranged from the 7th-grade to college freshmen RGL while ChatGPT-4 outputted responses from the 6th-grade to the college-senior RGL. Microsoft Bing provided responses from the 9th to 11th RGL while Google Bard provided responses from the 7th to 10th RGL. Discussion: ChatGPT-3.5 and ChatGPT-4 did better in achieving lower-grade level outputs. Meanwhile Bard and Bing tended to consistently produce an RGL that is at the high school level regardless of prompt. Additionally, Bard's hesitancy in providing certain outputs indicates a cautious approach towards health information. LLMs demonstrate promise in enhancing health communication, but future research should verify the accuracy and effectiveness of such tools in this context. Implications: LLMs face challenges in crafting outputs below a sixth-grade reading level. However, their capability to modify outputs above this threshold provides a potential mechanism to improve health literacy and communication in a pediatric population and beyond.
    摘要 目的:增强健康文化知识与健康结果之间的关系,但只有少数临床实践被研究。我们 investigate whether large language models (LLMs) can serve as a medium to improve health literacy in children and other populations. 方法:我们运行了288个条件,使用26个提示,通过ChatGPT-3.5、Microsoft Bing和Google Bard进行测试。由于环境限制,我们只测试了150个条件。主要输出测量方法包括阅读水平(RGL)和单词计数。 结果:所有模型的输出基本提问(如“解释”和“是什么”)的RGL都达到或超过了高中水平。当提示是指定为1-12年级的条件时,我们发现LLMs有不同的能力来适应不同的学龄水平。ChatGPT-3.5提供了7-12年级的回答,而ChatGPT-4则提供了6-12年级的回答。Microsoft Bing提供了9-11年级的回答,而Google Bard则提供了7-10年级的回答。 讨论:ChatGPT-3.5和ChatGPT-4在实现更低学龄水平的输出方面表现更好。与之相比,Bing和Bard倾向于一直提供高中水平的回答,无论提示是什么。此外,Bard的某些输出表现出了谨慎的态度,这可能是一种对健康信息的谨慎方式。LLMs在健康沟通方面表现了搭配性,但未来的研究应该验证这些工具在这种上下文中的准确性和效果。 意义:LLMs面临低于6年级阅读水平的输出创作的挑战。然而,它们可以修改输出以上这个阈值提供一个可能的机制来提高健康文化知识和沟通。

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

  • paper_url: http://arxiv.org/abs/2311.10057
  • repo_url: https://github.com/mulab-mir/song-describer-dataset
  • paper_authors: Ilaria Manco, Benno Weck, SeungHeon Doh, Minz Won, Yixiao Zhang, Dmitry Bodganov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, Elio Quinton, György Fazekas, Juhan Nam
  • for: 评估音乐和语言模型的评估 dataset,提供高质量的音频-caption对。
  • methods: 使用人工写好的自然语言描述,对706首乐曲进行了评估。
  • results: 通过三种音乐和语言任务的测试(乐曲描述、文本到乐曲生成和乐曲语言检索),研究人员可以通过 SDD 来更好地了解模型性能。
    Abstract We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.
    摘要 我们介绍歌曲描述数据集(SDD),一个新的人工抽样的音频-caption对数据集,适用于评估音乐和语言模型。该数据集包含1.1k名人写的自然语言描述,描述了706首音乐录音,所有公共可访问,发布于创意共享许可证下。为了展示我们的数据集的使用,我们在三个关键的音乐和语言任务(音频描述、文本到音乐生成和音乐语言检索)中对流行的模型进行了测试。我们的实验表明了跨数据集评估的重要性,并提供了如何使用 SDD 来深入了解模型性能的细节。

Is “A Helpful Assistant” the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts

  • paper_url: http://arxiv.org/abs/2311.10054
  • repo_url: None
  • paper_authors: Mingqian Zheng, Jiaxin Pei, David Jurgens
  • for: 这个研究旨在评估社交角色在系统提示中对模型性能的影响,以帮助设计更好的AI系统提示。
  • methods: 该研究使用了3个流行的大语言模型和2457道问题进行了广泛的分析,并制定了162个社交角色,包括6种人际关系和8种职业。
  • results: 研究发现,在提示中添加社交角色可以一致提高模型的性能,并且使用gender-neutral角色和指定受众角色可以更好地提高模型的性能。然而,预测哪一个角色会导致最佳性能仍然是一个挑战,并且频率、相似度和混淆率不能完全解释社交角色对模型性能的影响。
    Abstract Prompting serves as the major way humans interact with Large Language Models (LLM). Commercial AI systems commonly define the role of the LLM in system prompts. For example, ChatGPT uses "You are a helpful assistant" as part of the default system prompt. But is "a helpful assistant" the best role for LLMs? In this study, we present a systematic evaluation of how social roles in system prompts affect model performance. We curate a list of 162 roles covering 6 types of interpersonal relationships and 8 types of occupations. Through extensive analysis of 3 popular LLMs and 2457 questions, we show that adding interpersonal roles in prompts consistently improves the models' performance over a range of questions. Moreover, while we find that using gender-neutral roles and specifying the role as the audience leads to better performances, predicting which role leads to the best performance remains a challenging task, and that frequency, similarity, and perplexity do not fully explain the effect of social roles on model performances. Our results can help inform the design of system prompts for AI systems. Code and data are available at https://github.com/Jiaxin-Pei/Prompting-with-Social-Roles.
    摘要 大量语言模型(LLM)与人类之间的互动主要是通过提示来实现。商业人工智能系统通常将 LLM 的角色定义为系统提示中的一部分。例如,ChatGPT 使用 "你是一位有用的助手" 作为默认系统提示。但是 "有用的助手" 是 LLM 的最佳角色吗?在这项研究中,我们提供了一种系统atic评估如何社交角色在系统提示中影响模型性能。我们筛选了 162 个角色,涵盖了6种人际关系和8种职业。通过对 3 个流行 LLM 和 2457 个问题进行广泛分析,我们发现,在提示中添加社交角色可以逐渐提高模型的性能范围内的问题。此外,我们发现使用 gender-neutral 角色和指定角色为受众可以提高模型的性能,但是预测哪一个角色会导致最佳性能是一项困难的任务,并且频率、相似度和混淆率不能完全解释社交角色对模型性能的影响。我们的结果可以帮助设计 AI 系统的提示。代码和数据可以在 https://github.com/Jiaxin-Pei/Prompting-with-Social-Roles 上获取。

Inherently Interpretable Time Series Classification via Multiple Instance Learning

  • paper_url: http://arxiv.org/abs/2311.10049
  • repo_url: https://github.com/jaearly/miltimeseriesclassification
  • paper_authors: Joseph Early, Gavin KC Cheung, Kurt Cutajar, Hanting Xie, Jas Kandola, Niall Twomey
  • for: 本研究旨在提高时间序列分类器的可解释性,使其决策过程更加直观和可理解。
  • methods: 本研究使用多例学习(MIL)技术,提出了一种名为MILLET(多例学习为时间序列分类提供了本地可解释性)的新框架。该框架可以应用于现有的深度学习时间序列分类模型,使其自然地具有可解释性,而不会产生性能下降。
  • results: 在85个UCR时间序列分类 dataset上测试了MILLET,并证明了它可以生成高质量的解释,比其他已知的可解释方法更好。此外,我们还提供了一个专门为可解释性评估设计的 synthetic dataset。
    Abstract Conventional Time Series Classification (TSC) methods are often black boxes that obscure inherent interpretation of their decision-making processes. In this work, we leverage Multiple Instance Learning (MIL) to overcome this issue, and propose a new framework called MILLET: Multiple Instance Learning for Locally Explainable Time series classification. We apply MILLET to existing deep learning TSC models and show how they become inherently interpretable without compromising (and in some cases, even improving) predictive performance. We evaluate MILLET on 85 UCR TSC datasets and also present a novel synthetic dataset that is specially designed to facilitate interpretability evaluation. On these datasets, we show MILLET produces sparse explanations quickly that are of higher quality than other well-known interpretability methods. To the best of our knowledge, our work with MILLET, which is available on GitHub (https://github.com/JAEarly/MILTimeSeriesClassification), is the first to develop general MIL methods for TSC and apply them to an extensive variety of domains
    摘要 We evaluate MILLET on 85 UCR TSC datasets and also present a novel synthetic dataset that is specifically designed to facilitate interpretability evaluation. Our results show that MILLET produces high-quality explanations quickly, which are superior to other well-known interpretability methods. To the best of our knowledge, our work with MILLET, which is available on GitHub (https://github.com/JAEarly/MILTimeSeriesClassification), is the first to develop general MIL methods for TSC and apply them to a wide range of domains.

A Novel Neural Network-Based Federated Learning System for Imbalanced and Non-IID Data

  • paper_url: http://arxiv.org/abs/2311.10025
  • repo_url: None
  • paper_authors: Mahfuzur Rahman Chowdhury, Muhammad Ibrahim
  • for: 本研究旨在提高 Federated Learning 中数据隐私的保护,并提出一种中央化的 neural network-based Federated Learning 系统,以提高 Accuracy 和 Efficiency。
  • methods: 本研究使用了 Micro-level 并行处理,启发自传统的 mini-batch 算法, Client 设备和服务器分别处理前向和反向传播。此外,我们还提出了一种 Edge Computing 版本的我们的提posed algorithm,在减少中央服务器负担的情况下, Client 处理前向和反向传播。
  • results: 我们在五个Well-known Benchmark dataset上进行了评估,并在不同数据分布Setting下获得了满意的性能,与一些现有的标准algorithm相比,我们的提posed system在一定程度上减少了训练时间。
    Abstract With the growth of machine learning techniques, privacy of data of users has become a major concern. Most of the machine learning algorithms rely heavily on large amount of data which may be collected from various sources. Collecting these data yet maintaining privacy policies has become one of the most challenging tasks for the researchers. To combat this issue, researchers have introduced federated learning, where a prediction model is learnt by ensuring the privacy of data of clients data. However, the prevalent federated learning algorithms possess an accuracy and efficiency trade-off, especially for non-IID data. In this research, we propose a centralized, neural network-based federated learning system. The centralized algorithm incorporates micro-level parallel processing inspired by the traditional mini-batch algorithm where the client devices and the server handle the forward and backward propagation respectively. We also devise a semi-centralized version of our proposed algorithm. This algorithm takes advantage of edge computing for minimizing the load from the central server, where clients handle both the forward and backward propagation while sacrificing the overall train time to some extent. We evaluate our proposed systems on five well-known benchmark datasets and achieve satisfactory performance in a reasonable time across various data distribution settings as compared to some existing benchmark algorithms.
    摘要 随着机器学习技术的发展,用户数据隐私的问题已成为一个重要的挑战。大多数机器学习算法需要大量数据,这些数据可能来自于多个来源。收集这些数据并保持隐私政策已成为研究人员最大的挑战。为解决这个问题,研究人员已经引入联邦学习,这种方法可以保证客户端数据的隐私。然而,现有的联邦学习算法具有精度和效率的负担假设,特别是非相关数据。在这些研究中,我们提议了一种中央化的神经网络基于联邦学习系统。中央算法包括微级并行处理,这种方法由客户端设备和服务器处理前向和反向传播。我们还开发了一种半中央化版本的我们的提议算法。这种算法利用边计算来减轻中央服务器的负担,客户端处理了前向和反向传播,但是在一定程度上减少了总训练时间。我们在五个常见的benchmark数据集上评估了我们的提议系统,并在不同的数据分布设置下实现了满意的性能,与一些现有的benchmark算法相比。

Learning interactions to boost human creativity with bandits and GPT-4

  • paper_url: http://arxiv.org/abs/2311.10127
  • repo_url: None
  • paper_authors: Ara Vartanian, Xiaoxi Sun, Yun-Shiuan Chuang, Siddharth Suresh, Xiaojin Zhu, Timothy T. Rogers
  • for: 这个论文探讨了人与AI算法之间的互动如何提高人类创造力。
  • methods: 研究人员使用了一种心理任务来检验人类创造力的限制,即Semantic feature generation:给一个概念名称后,参与者需要列出该概念的所有特征。
  • results: 研究发现,人类参与者和语言AI(GPT-4)在标准任务和一种提供算法生成提示的变体任务中的行为相似,而且 bandaits学习自AI响应 preferences 与人类行为学习的提示策略相同。结果表明,通过计算机互动,可以使bandits learn from simulated participants的行为,以提高人类创造力。
    Abstract This paper considers how interactions with AI algorithms can boost human creative thought. We employ a psychological task that demonstrates limits on human creativity, namely semantic feature generation: given a concept name, respondents must list as many of its features as possible. Human participants typically produce only a fraction of the features they know before getting "stuck." In experiments with humans and with a language AI (GPT-4) we contrast behavior in the standard task versus a variant in which participants can ask for algorithmically-generated hints. Algorithm choice is administered by a multi-armed bandit whose reward indicates whether the hint helped generating more features. Humans and the AI show similar benefits from hints, and remarkably, bandits learning from AI responses prefer the same prompting strategy as those learning from human behavior. The results suggest that strategies for boosting human creativity via computer interactions can be learned by bandits run on groups of simulated participants.
    摘要 Translated into Simplified Chinese:这篇论文研究了人与AI算法之间的互动如何提高人类创造力。我们使用了一项心理任务,以示人类创造力的限制,即给出概念名称后,参与者需要列出该概念的所有特征。人类参与者通常只能生成一小部分的特征,然后就会被"困顿"。在人类参与者和一种语言AI(GPT-4)的实验中,我们比较了标准任务和一种允许参与者请求算法生成提示的变体。算法选择是通过一个多重机枪进行管理,其奖励参与者选择有助于生成更多特征的提示。人类和AI都显示了类似的好处,奇怪地,由人类行为学习的多重机枪偏好了与AI回答相同的提示策略。这些结果表明,通过计算机互动,可以学习提高人类创造力的策略。

Straggler-resilient Federated Learning: Tackling Computation Heterogeneity with Layer-wise Partial Model Training in Mobile Edge Network

  • paper_url: http://arxiv.org/abs/2311.10002
  • repo_url: None
  • paper_authors: Hongda Wu, Ping Wang, C V Aswartha Narayana
  • for: 这个研究旨在提出一种基于分布式学习的训练方法,以便让不同设备 collaboration 训练模型,但不需要共享数据。
  • methods: 该方法基于模型不同设备的Compute能力不同,设备会根据自己的Compute能力来训练模型,而不是训练完整的模型。
  • results: 研究表明,该方法可以让不同设备 collaboration 训练模型,并且可以更好地平衡学习精度和完成时间。 compared to 现有的参考方法,该方法可以更快地达到学习目标。
    Abstract Federated Learning (FL) enables many resource-limited devices to train a model collaboratively without data sharing. However, many existing works focus on model-homogeneous FL, where the global and local models are the same size, ignoring the inherently heterogeneous computational capabilities of different devices and restricting resource-constrained devices from contributing to FL. In this paper, we consider model-heterogeneous FL and propose Federated Partial Model Training (FedPMT), where devices with smaller computational capabilities work on partial models (subsets of the global model) and contribute to the global model. Different from Dropout-based partial model generation, which removes neurons in hidden layers at random, model training in FedPMT is achieved from the back-propagation perspective. As such, all devices in FedPMT prioritize the most crucial parts of the global model. Theoretical analysis shows that the proposed partial model training design has a similar convergence rate to the widely adopted Federated Averaging (FedAvg) algorithm, $\mathcal{O}(1/T)$, with the sub-optimality gap enlarged by a constant factor related to the model splitting design in FedPMT. Empirical results show that FedPMT significantly outperforms the existing benchmark FedDrop. Meanwhile, compared to the popular model-homogeneous benchmark, FedAvg, FedPMT reaches the learning target in a shorter completion time, thus achieving a better trade-off between learning accuracy and completion time.
    摘要 Federated Learning (FL) 允许多个资源有限的设备共同训练模型而无需数据共享。然而,现有的大多数研究都专注于模型同质的 FL,即全球模型和本地模型均为同一个大小,这ignore了不同设备的内在不同计算能力,从而限制了资源有限的设备参与FL。在这篇文章中,我们考虑了模型不同质的 FL,并提出了 Federated Partial Model Training(FedPMT),其中设备的计算能力较低的设备可以在部分模型(全球模型的子集)上进行训练,并对全球模型进行贡献。与Dropout技术基于随机移除隐藏层中的神经元不同,FedPMT中的模型训练是从反射层的角度来实现的,因此所有的设备在FedPMT中都会优先级别最重要的部分。理论分析表明,我们的 partial model 训练设计和 widely adopted Federated Averaging(FedAvg)算法相似,具有 $\mathcal{O}(1/T)$ 的收敛率,但与 FedAvg 相比,FedPMT 的优劣差异因子与模型分割设计相关。实验结果表明,FedPMT 明显超过了现有的 refer 替 benchmark FedDrop。同时,相比于通用模型同质的标准 refer 替 FedAvg,FedPMT 在完成时间和学习精度之间更好地做出了平衡。

Towards more Practical Threat Models in Artificial Intelligence Security

  • paper_url: http://arxiv.org/abs/2311.09994
  • repo_url: None
  • paper_authors: Kathrin Grosse, Lukas Bieringer, Tarek Richard Besold, Alexandre Alahi
  • for: 这个论文旨在描述人工智能安全领域的研究与实践之间的差距。
  • methods: 该论文通过对6种最常研究的AI安全攻击方法进行审查,并与271名工业实践者进行调查,以确定这些攻击方法在实际场景中的可行性。
  • results: 研究发现,现有的威胁模型都是可行的,但是有一些重大匹配度:研究往往假设攻击者具有实际场景中不易获得的信息。这篇论文因此呼吁研究更加实际的威胁模型。
    Abstract Recent works have identified a gap between research and practice in artificial intelligence security: threats studied in academia do not always reflect the practical use and security risks of AI. For example, while models are often studied in isolation, they form part of larger ML pipelines in practice. Recent works also brought forward that adversarial manipulations introduced by academic attacks are impractical. We take a first step towards describing the full extent of this disparity. To this end, we revisit the threat models of the six most studied attacks in AI security research and match them to AI usage in practice via a survey with \textbf{271} industrial practitioners. On the one hand, we find that all existing threat models are indeed applicable. On the other hand, there are significant mismatches: research is often too generous with the attacker, assuming access to information not frequently available in real-world settings. Our paper is thus a call for action to study more practical threat models in artificial intelligence security.
    摘要 最近的研究发现人工智能安全领域存在一个研究与实践之间的差距:在学术中研究的威胁不一定与实际使用和安全风险相符。例如,模型在实践中通常是作为更大的机器学习管道的一部分进行研究,而不是孤立的单元。此外,学术攻击的恶意修改也被证明是不实际的。我们为了描述这种差距,我们重新审视了人工智能安全领域最常研究的六种攻击方法,并通过对 \textbf{271} 名工业实践者进行调查,发现所有的威胁模型都是可靠的。然而,我们发现有一些巨大的匹配错误:研究经常假设攻击者具有实际场景中不常 disponibles的信息。因此,我们的论文是一种呼吁,呼吁更多地研究实际可行的人工智能安全威胁模型。

Generative AI for Hate Speech Detection: Evaluation and Findings

  • paper_url: http://arxiv.org/abs/2311.09993
  • repo_url: None
  • paper_authors: Sagi Pendzel, Tomer Wullach, Amir Adler, Einat Minkov
  • for: 提高自动仇恨语言检测的泛化性能
  • methods: 使用生成AI生成大量的人工仇恨语言序列,并在已有标注数据的基础上训练大型预训练语言模型(LLM)
  • results: 对于BERT、RoBERTa、ALBERT等常见LLM,以及已适应仇恨检测的RoBERTa-Toxicity、HateBERT、HateXplain、ToxDect和ToxiGen等模型,进行了评估和比较,并证实了这种方法可以提高仇恨语言泛化性能。同时,我们还对采用零shot仇恨检测的GPT-3.5模型进行了测试,结果显示该模型可以获得更高的泛化性能,但是具有较差的准确率和预测率。
    Abstract Automatic hate speech detection using deep neural models is hampered by the scarcity of labeled datasets, leading to poor generalization. To mitigate this problem, generative AI has been utilized to generate large amounts of synthetic hate speech sequences from available labeled examples, leveraging the generated data in finetuning large pre-trained language models (LLMs). In this chapter, we provide a review of relevant methods, experimental setups and evaluation of this approach. In addition to general LLMs, such as BERT, RoBERTa and ALBERT, we apply and evaluate the impact of train set augmentation with generated data using LLMs that have been already adapted for hate detection, including RoBERTa-Toxicity, HateBERT, HateXplain, ToxDect, and ToxiGen. An empirical study corroborates our previous findings, showing that this approach improves hate speech generalization, boosting recall performance across data distributions. In addition, we explore and compare the performance of the finetuned LLMs with zero-shot hate detection using a GPT-3.5 model. Our results demonstrate that while better generalization is achieved using the GPT-3.5 model, it achieves mediocre recall and low precision on most datasets. It is an open question whether the sensitivity of models such as GPT-3.5, and onward, can be improved using similar techniques of text generation.
    摘要 自动发现仇恨言语使用深度神经网络受到数据标注的罕见性的限制,导致模型的泛化性差。为解决这个问题,生成AI技术被应用来生成大量的人工仇恨言语序列,利用生成的数据进行训练大型预训练语言模型(LLM)。在这章中,我们提供了相关的方法、实验设置和评估这种方法的评估。除了一般的LLM,如BERT、RoBERTa和ALBERT,我们还应用并评估生成数据集 augmentation的影响。我们使用已经适应仇恨检测的LLM,包括RoBERTa-Toxicity、HateBERT、HateXplain、ToxDect和ToxiGen进行训练和评估。我们的实验结果表明,这种方法可以提高仇恨言语泛化性,提高检测性能 across data distributions。此外,我们还探讨了使用GPT-3.5模型进行零容量仇恨检测的性能。我们的结果表明,虽然使用GPT-3.5模型可以获得更好的泛化性,但它的准确率和精度在大多数数据集上都很低。这是一个开放的问题,是否可以通过类似的文本生成技术提高模型的敏感性。

A Framework for Monitoring and Retraining Language Models in Real-World Applications

  • paper_url: http://arxiv.org/abs/2311.09930
  • repo_url: None
  • paper_authors: Jaykumar Kasundra, Claudia Schulz, Melicaalsadat Mirsafian, Stavroula Skylaki
  • for: 本研究旨在探讨多类标签分类模型在不同 retraining 决策点下的影响,以及如何设计有效的模型重新训练策略。
  • methods: 本研究使用了多种 retraining 决策点,包括数据或概念漂移、模型性能下降和新数据收集等,以评估它们对模型性能和资源利用率的影响。
  • results: 研究发现,不同 retraining 决策点可能导致不同的模型性能和资源利用率。根据研究结果,提出了一个参考框架,可以帮助设计有效的模型重新训练策略。
    Abstract In the Machine Learning (ML) model development lifecycle, training candidate models using an offline holdout dataset and identifying the best model for the given task is only the first step. After the deployment of the selected model, continuous model monitoring and model retraining is required in many real-world applications. There are multiple reasons for retraining, including data or concept drift, which may be reflected on the model performance as monitored by an appropriate metric. Another motivation for retraining is the acquisition of increasing amounts of data over time, which may be used to retrain and improve the model performance even in the absence of drifts. We examine the impact of various retraining decision points on crucial factors, such as model performance and resource utilization, in the context of Multilabel Classification models. We explain our key decision points and propose a reference framework for designing an effective model retraining strategy.
    摘要 在机器学习(ML)模型开发生命周期中,使用停滞 dataset 训练候选模型并选择适合任务的最佳模型只是第一步。在实际应用中,已经部署的选定模型后,需要进行连续的模型监测和重新训练。有多种重新训练的原因,包括数据或概念漂移,这可能会影响模型性能,并且可以通过适当的指标来监测。另一个重新训练的动机是随着时间的推移,采集到的数据量的增加,可以重新训练并改进模型性能,即使没有数据漂移。我们研究重新训练决策点对关键因素的影响,如模型性能和资源利用率,并提出了一个参考框架,以设计有效的模型重新训练策略。

DSR-Diff: Depth Map Super-Resolution with Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.09919
  • repo_url: None
  • paper_authors: Yuan Shi, Bin Xia, Rui Zhu, Qingmin Liao, Wenming Yang
  • for: 提高低质量深度图的空间分辨率,用于3D重建、虚拟现实和增强现实等应用。
  • methods: 利用扩散模型在幽Defaultslat space中生成导航,并将其与高质量颜色图相结合,以实现深度图超分辨。
  • results: 对比州前方法,提出了一种新的CDSR模型,并实现了较高的准确率和效率。代码将在https://github.com/shiyuan7/DSR-Diff中发布。
    Abstract Color-guided depth map super-resolution (CDSR) improve the spatial resolution of a low-quality depth map with the corresponding high-quality color map, benefiting various applications such as 3D reconstruction, virtual reality, and augmented reality. While conventional CDSR methods typically rely on convolutional neural networks or transformers, diffusion models (DMs) have demonstrated notable effectiveness in high-level vision tasks. In this work, we present a novel CDSR paradigm that utilizes a diffusion model within the latent space to generate guidance for depth map super-resolution. The proposed method comprises a guidance generation network (GGN), a depth map super-resolution network (DSRN), and a guidance recovery network (GRN). The GGN is specifically designed to generate the guidance while managing its compactness. Additionally, we integrate a simple but effective feature fusion module and a transformer-style feature extraction module into the DSRN, enabling it to leverage guided priors in the extraction, fusion, and reconstruction of multi-model images. Taking into account both accuracy and efficiency, our proposed method has shown superior performance in extensive experiments when compared to state-of-the-art methods. Our codes will be made available at https://github.com/shiyuan7/DSR-Diff.
    摘要 颜色导航深度地图超分辨 (CDSR) 可以提高低质量深度地图的空间分辨率,有利于多种应用,如三维重建、虚拟现实和增强现实。传统的 CDSR 方法通常采用卷积神经网络或转换器,而扩散模型(DM)则在高级视觉任务中表现出了很好的效果。在这项工作中,我们提出了一种新的 CDSR 模式,利用在潜在空间中的扩散模型来生成指导 depth map 超分辨。我们的方法包括指导生成网络(GGN)、深度地图超分辨网络(DSRN)和指导恢复网络(GRN)。GGN 专门设计用于生成指导,同时管理其 компакт性。此外,我们还将一个简单 yet effective 的特征融合模块和一个基于转换器的特征提取模块integrated into DSRN,使其能够在抽取、融合和重建多模型图像时利用指导PRIORS。考虑到精度和效率,我们提出的方法在广泛的实验中显示出了与状态 искусственный智能方法相比的superior performance。我们的代码将在 GitHub 上发布,请参考 https://github.com/shiyuan7/DSR-Diff.

INTERVENOR: Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing

  • paper_url: http://arxiv.org/abs/2311.09868
  • repo_url: https://github.com/neuir/intervenor
  • paper_authors: Hanbin Wang, Zhenghao Liu, Shuo Wang, Ganqu Cui, Ning Ding, Zhiyuan Liu, Ge Yu
  • for: 这个论文是为了提出一种基于人类编程行为的自动代码修复方法(INTERVENOR),该方法利用了两个基于大语言模型(LLM)的代理机制,以便在代码修复过程中提高代码生成和翻译能力。
  • methods: 该方法使用了两个LLM代理机制,namely Code Learner和Code Teacher。Code Learner是根据Code Teacher的指导生成和修复代码,而Code Teacher则是通过编译器的反馈来重新思考代码错误,并生成一个链式修复(CoR)来引导代码修复过程。
  • results: 我们的实验表明,INTERVENOR比 estado-of-the-art 方法更高效,在代码生成和代码翻译任务中分别提高了约13%和4.5%。我们的进一步分析还表明,CoR可以通过自然语言提供错误原因和解决方案的明确描述。由于编译器的反馈,INTERVENOR可以准确地识别代码中的语法错误和断言错误,并提供精确的修复指令,使LLMs在只需要三次修复后就能达到极限性能。
    Abstract This paper proposes INTERactiVE chaiN Of Repairing (INTERVENOR), which mimics human code repairing behavior (iteratively judging, rethinking, and repairing) and prompts the coding ability of regard Large Language Models (LLMs). Specifically, INTERVENOR employs two LLM based agents, Code Learner and Code Teacher, to play different roles in code repairing and work interactively to repair the generated codes. The Code Learner is asked to generate and repair code according to the instructions from the Code Teacher. The Code Teacher rethinks the code errors according to the corresponding feedback from compilers and iteratively generates the chain-of-repairing (CoR) to guide the code repairing process for Code Learner. Our experiments show that INTERVENOR outperforms the state-of-the-art methods and achieves about 13% and 4.5% improvements over the GPT-3.5 model in code generation and code translation tasks, respectively. Our further analyses show that CoR can illuminate the bug reasons and solution plans via natural language. Thanks to the feedback of code compilers, INTERVENOR can accurately identify the syntax errors and assertion errors in the code and provide precise instructions to repair codes, making LLMs achieve the plateau performance with only three repairing turns. All data and codes are available at https://github.com/NEUIR/INTERVENOR
    摘要 这个论文提出了一种名为INTERactiVE chaiN Of Repairing(INTERVENOR)的方法,它模仿人类代码修复行为(迭代评估、重新思考和修复),并唤醒LLM的编程能力。具体来说,INTERVENOR使用两个基于LLM的代理人:代码学习者和代码教师。代码学习者根据代码教师的指导生成和修复代码。代码教师根据编译器的反馈重新评估代码错误,并生成了修复过程中的链条(CoR),以引导代码修复过程。我们的实验表明,INTERVENOR在代码生成和代码翻译任务上表现出色,与当前状态的方法相比,提高了约13%和4.5%。我们的进一步分析表明,CoR可以通过自然语言来描述错误原因和解决方案。由于编译器的反馈,INTERVENOR可以准确地识别代码中的语法错误和断言错误,并提供精准的修复指导,使LLM在只需三次修复后达到极限性能。所有数据和代码可以在https://github.com/NEUIR/INTERVENOR上获取。

PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models

  • paper_url: http://arxiv.org/abs/2311.09861
  • repo_url: None
  • paper_authors: Junlei Zhang, Hongliang He, Nirui Song, Shuyuan He, Shuai Zhang, Huachuan Qiu, Anqi Li, Lizhi Ma, Zhenzhong Lan
  • for: 这个论文是为了提供一个全面的中文评价集,用于评估基础模型在心理学领域的能力。
  • methods: 本文使用多选题型的评价方法,以评估基础模型在不同知识领域的表现。
  • results: 研究发现不同知识领域的表现存在显著差异,而且只有ChatGPT模型的平均准确率超过70%, indicating that there is still room for improvement in this area.
    Abstract As Large Language Models (LLMs) are becoming prevalent in various fields, there is an urgent need for improved NLP benchmarks that encompass all the necessary knowledge of individual discipline. Many contemporary benchmarks for foundational models emphasize a broad range of subjects but often fall short in presenting all the critical subjects and encompassing necessary professional knowledge of them. This shortfall has led to skewed results, given that LLMs exhibit varying performance across different subjects and knowledge areas. To address this issue, we present psybench, the first comprehensive Chinese evaluation suite that covers all the necessary knowledge required for graduate entrance exams. psybench offers a deep evaluation of a model's strengths and weaknesses in psychology through multiple-choice questions. Our findings show significant differences in performance across different sections of a subject, highlighting the risk of skewed results when the knowledge in test sets is not balanced. Notably, only the ChatGPT model reaches an average accuracy above $70\%$, indicating that there is still plenty of room for improvement. We expect that psybench will help to conduct thorough evaluations of base models' strengths and weaknesses and assist in practical application in the field of psychology.
    摘要 如Large Language Models(LLMs)在不同领域变得越来越普遍,需要改进的自然语言处理(NLP)标准测试Suite来涵盖所有必要的专业知识。许多当代测试标准 для基础模型通常会忽略一些重要的主题和专业知识,这会导致测试结果偏向。为解决这个问题,我们介绍了psybench,第一个涵盖所有必要的心理学入学考试知识的全面中文评估suite。psybench通过多选题提供了深入的模型强项和弱项评估。我们的发现表明不同主题section的性能存在显著差异,这说明测试集知识不均衡可能会导致偏向测试结果。各种ChatGPT模型的平均准确率超过70%,这表明当前还有很多机会进行改进。我们期望psybench可以帮助进行深入的模型强项和弱项评估,并在心理学领域的实践应用中提供支持。

SurvTimeSurvival: Survival Analysis On The Patient With Multiple Visits/Records

  • paper_url: http://arxiv.org/abs/2311.09854
  • repo_url: https://github.com/davidlee1102/surtimesurvival
  • paper_authors: Hung Le, Ong Eng-Jon, Bober Miroslaw
  • for: 预测严重疾病患者生存时间的准确预测仍然是艰难的挑战,尽管近年来人工智能技术得到了进一步发展。这项研究推出了“SurvTimeSurvival:survival分析在多次/记录数据上”,利用Transformer模型不仅能够处理时间变化的covariates,还能够处理covariates数据。
  • methods: 我们采用了Transformer模型来处理时间变化的covariates和covariates数据,并解决了生存分析数据集中的数据缺失问题,通过将生成Synthetic数据集成到学习过程中。
  • results: 我们的方法在covariates和时间变化covariates数据集上都超过了现有的深度学习方法的性能。我们的方法的目的不仅是提高个体患者生存轨迹的理解,从而提高预测精度,而且也在设计临床试验和开发新的治疗方法中发挥重要作用。
    Abstract The accurate prediction of survival times for patients with severe diseases remains a critical challenge despite recent advances in artificial intelligence. This study introduces "SurvTimeSurvival: Survival Analysis On Patients With Multiple Visits/Records", utilizing the Transformer model to not only handle the complexities of time-varying covariates but also covariates data. We also tackle the data sparsity issue common to survival analysis datasets by integrating synthetic data generation into the learning process of our model. We show that our method outperforms state-of-the-art deep learning approaches on both covariates and time-varying covariates datasets. Our approach aims not only to enhance the understanding of individual patient survival trajectories across various medical conditions, thereby improving prediction accuracy, but also to play a pivotal role in designing clinical trials and creating new treatments.
    摘要 医学预测患者生存时间的准确性仍然是一项关键挑战,尽管最近的人工智能技术得到了进步。这项研究介绍了“SurvTimeSurvival:基于多次/记录的生存分析”,利用Transformer模型不仅能处理时间变化的共 covariates,而且还能处理 covariates 数据。我们还解决了生存分析数据集中的数据缺失问题通过将生成 Synthetic 数据 incorporated 到我们的模型学习过程中。我们表明我们的方法在 covariates 和时间变化 covariates 数据集上都能够超越当前的深度学习方法。我们的方法不仅可以提高预测准确性,还可以提高对各种医疗情况的个体患者生存轨迹的理解,从而为设计临床试验和开发新药物做出重要贡献。

Leveraging LLMs in Scholarly Knowledge Graph Question Answering

  • paper_url: http://arxiv.org/abs/2311.09841
  • repo_url: https://github.com/huntila/scholarly-kgqa
  • paper_authors: Tilahun Abedissa Taffa, Ricardo Usbeck
  • for: 这篇论文旨在提出一种学术知识图问答系统,可以通过几 shot 的方式回答 bibliographic 自然语言问题。
  • methods: 该模型首先使用 BERT 基于 sentence encoder 将测试问题与training问题进行相似性比较,然后选择 top-n 相似问题对应的 SPARQL,并将这些对应的问题作为示例,将测试问题作为提示, passing it to LLM 生成 SPARQL。最后,对于underlying KG (ORKG)终端进行查询,并返回答案。
  • results: 该系统在 SciQA 中实现了 F1 分数 99.0%,在 Scholarly-QALD-23 挑战 benchmark 上表现出色。
    Abstract This paper presents a scholarly Knowledge Graph Question Answering (KGQA) that answers bibliographic natural language questions by leveraging a large language model (LLM) in a few-shot manner. The model initially identifies the top-n similar training questions related to a given test question via a BERT-based sentence encoder and retrieves their corresponding SPARQL. Using the top-n similar question-SPARQL pairs as an example and the test question creates a prompt. Then pass the prompt to the LLM and generate a SPARQL. Finally, runs the SPARQL against the underlying KG - ORKG (Open Research KG) endpoint and returns an answer. Our system achieves an F1 score of 99.0%, on SciQA - one of the Scholarly-QALD-23 challenge benchmarks.
    摘要 这篇论文提出了一种学术知识图问答系统(KGQA),该系统可以通过几个尝试回答文学性问题,使用大型语言模型(LLM)。系统首先使用BERT基于的句子编码器来将测试问题与相似训练问题进行对比,然后使用相似问题-SPARQL对应的最上层对象来生成一个提示。最后,将提示传递给LLM进行生成SPARQL,并将其运行于基础知识图(ORKG)终端,以获得答案。我们的系统在SciQA中的F1分数达99.0%。

PELMS: Pre-training for Effective Low-Shot Multi-Document Summarization

  • paper_url: http://arxiv.org/abs/2311.09836
  • repo_url: None
  • paper_authors: Joseph J. Peper, Wenzhao Qiu, Lu Wang
  • for: 本研究旨在提高摘要文献的抽象性和反射性,即通过多文摘要来提高文献的简洁、流畅和准确性。
  • methods: 我们提出了一种基于语义相关性规则和准确性约束的预训练模型,并使用了多个文档输入来支持模型的训练。
  • results: 我们在多种摘要任务上进行了广泛的评估,并经验显示了我们的方法在低shot设定下可以准确地捕捉摘要文献的主题和含义,并且在抽象性、流畅性、准确性和 faithfulness 等方面具有优异性。
    Abstract We investigate pre-training techniques for abstractive multi-document summarization (MDS), which is much less studied than summarizing single documents. Though recent work has demonstrated the effectiveness of highlighting information salience for pre-training strategy design, it struggles to generate abstractive and reflective summaries, which are critical properties for MDS. To this end, we present PELMS, a pre-trained model that uses objectives based on semantic coherence heuristics and faithfulness constraints with un-labeled multi-document inputs, to promote the generation of concise, fluent, and faithful summaries. To support the training of PELMS, we compile MultiPT, a multi-document pre-training corpus containing over 93 million documents to form more than 3 million unlabeled topic-centric document clusters, covering diverse genres such as product reviews, news, and general knowledge. We perform extensive evaluation of PELMS in low-shot settings on a wide range of MDS datasets. Our approach consistently outperforms competitive comparisons with respect to overall informativeness, abstractiveness, coherence, and faithfulness.
    摘要 我团队研究了抽象多文摘要(MDS)的预训练技术,这个领域比单文摘要更为少studied。虽然 latest work 表明了突出信息重要性的预训练策略的效果,但它很难生成抽象和反射的摘要,这些特性是MDS的关键性能。为此,我们提出了 PELMS,一种预训练模型,使用基于 semantic coherence heuristics 和 faithfulness constraints 的目标函数,以便在无标签多文输入下生成简洁、流畅、忠实的摘要。为支持 PELMS 的训练,我们编译了 MultiPT,一个包含超过 93 万个文档的多文预训练集,其中包含多种类型的文档,如产品评论、新闻和通用知识。我们对 PELMS 进行了广泛的评估,包括低投入设定下的评估,并在多种 MDS 数据集上表现出consistent 的优异性。

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

  • paper_url: http://arxiv.org/abs/2311.09835
  • repo_url: None
  • paper_authors: Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Zengxian Yang, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Zhengliang Li, Liang Chen, Yiming Zong, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein
  • for: 本研究旨在评估大语言模型在使用开源库完成机器学习任务时的实用性。
  • methods: 本研究提出了一个新的评估Setup,其中大语言模型使用开源库完成机器学习任务,而不是从scratch编写代码。研究者们制定了ML-Bench,一个包含10044个样例和130个任务的广泛的benchmark,以评估大语言模型在使用开源库时的效果。
  • results: GPT-4在这些任务中表现出色,但只有39.73%的任务得到了成功。研究者们提出了ML-Agent,一种用于快速浏览代码库,找到相关文档、代码和执行代码的方法。与GPT-4结合使用后,ML-Agent得到了进一步的改进。
    Abstract Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at \url{https://ml-bench.github.io/}.
    摘要 大型语言模型在代码生成比赛中表现出了优异的成绩。然而,实际Programming中对于这些模型的实用性存在许多差距,主要是因为实际程式码中的对于预存函数的依赖。相反于评估这些模型在从零开始写代码的能力,这个工作尝试了一个新的评估设置,让模型使用公开源代码库中的函数来完成机器学习任务。因此,我们提出了 ML-Bench,一个包含10044个样本、130个任务和14个知名机器学习GitHub来源的广泛的库。在这个设置下,给定一个机器学习任务的指令和相应的README档案,一个模型需要使用代码库中的函数来完成任务。这需要理解长长的文档和代码档案之间的交互,以及复杂的跨档代码结构,带来新的挑战。对此,我们提出了 ML-Agent,用于有效地浏览代码库、找到文档、获取代码和实现可执行的代码。实验结果显示,使用 GPT-4 建立的 ML-Agent 导致了进一步的改善。代码、数据和模型可以在 \url{https://ml-bench.github.io/} 获取。

AutoPlanBench: : Automatically generating benchmarks for LLM planners from PDDL

  • paper_url: http://arxiv.org/abs/2311.09830
  • repo_url: None
  • paper_authors: Katharina Stein, Alexander Koller
  • for: 这paper的目的是探讨LLMs在规划任务中的能力,以及如何使用文本描述来评估这些能力。
  • methods: 这paper使用了一种新的方法,可以将PDDL中的规划benchmark转换成文本描述,并提供了一个新的benchmark集。
  • results: 研究发现,当今最好的LLM规划器在许多规划任务上表现出色,但是其他任务仍然远远超出了现有方法的能力范围。
    Abstract LLMs are being increasingly used for planning-style tasks, but their capabilities for planning and reasoning are poorly understood. We present a novel method for automatically converting planning benchmarks written in PDDL into textual descriptions and offer a benchmark dataset created with our method. We show that while the best LLM planners do well on many planning tasks, others remain out of reach of current methods.
    摘要 LLMs 正在越来越多地用于计划样式的任务,但它们的计划和理解能力尚未得到充分的理解。我们提出了一种新的方法,可以自动将 PDDL 格式的计划标准转换成文本描述,并提供了我们创建的 benchmark 数据集。我们发现,当前的 LLM 计划器在许多计划任务上表现出色,但有些任务仍然超出当前的能力范围。

PWISeg: Point-based Weakly-supervised Instance Segmentation for Surgical Instruments

  • paper_url: http://arxiv.org/abs/2311.09819
  • repo_url: https://github.com/seanxuu/PWISeg
  • paper_authors: Zhen Sun, Huan Xu, Jinlin Wu, Zhen Chen, Zhen Lei, Hongbin Liu
  • for: 这个论文的目的是提出一种新的、有效的医疗器械实例分割方法,以解决实际中获得mask-level注释是劳动密集的问题。
  • methods: 该方法基于FCN架构,包括点到方框和点到面分支,用于模型特征点和 bounding box 之间的关系,以及特征点和分割面之间的关系。
  • results: 该方法在我们所提供的新的医疗器械数据集上进行了广泛的试验,并证明了与大多数无监督 bounding box 的实例分割方法相比,它的实rument分割精度得到了显著提高。
    Abstract In surgical procedures, correct instrument counting is essential. Instance segmentation is a location method that locates not only an object's bounding box but also each pixel's specific details. However, obtaining mask-level annotations is labor-intensive in instance segmentation. To address this issue, we propose a novel yet effective weakly-supervised surgical instrument instance segmentation approach, named Point-based Weakly-supervised Instance Segmentation (PWISeg). PWISeg adopts an FCN-based architecture with point-to-box and point-to-mask branches to model the relationships between feature points and bounding boxes, as well as feature points and segmentation masks on FPN, accomplishing instrument detection and segmentation jointly in a single model. Since mask level annotations are hard to available in the real world, for point-to-mask training, we introduce an unsupervised projection loss, utilizing the projected relation between predicted masks and bboxes as supervision signal. On the other hand, we annotate a few pixels as the key pixel for each instrument. Based on this, we further propose a key pixel association loss and a key pixel distribution loss, driving the point-to-mask branch to generate more accurate segmentation predictions. To comprehensively evaluate this task, we unveil a novel surgical instrument dataset with manual annotations, setting up a benchmark for further research. Our comprehensive research trial validated the superior performance of our PWISeg. The results show that the accuracy of surgical instrument segmentation is improved, surpassing most methods of instance segmentation via weakly supervised bounding boxes. This improvement is consistently observed in our proposed dataset and when applied to the public HOSPI-Tools dataset.
    摘要 在手术过程中,正确的工具数量是非常重要的。实例分割是一种位置方法,可以不仅找到物体的包围盒,还可以每个像素的特定细节。然而,在实例分割中获得mask水平的注释是很劳动密集的。为解决这个问题,我们提出了一种新的但有效的weakly-supervised手术工具实例分割方法,名为Point-based Weakly-supervised Instance Segmentation(PWISeg)。PWISeg采用了FCN基 architecture,并设置了点到包围盒和点到面积分支,以模型特征点和包围盒之间的关系,以及特征点和分割面积之间的关系。这样可以同时完成工具检测和分割。由于mask水平的注释很难在实际世界中获得,为点到面训练,我们引入了一种无监督投影损失,利用预测的面积和包围盒之间的投影关系作为监督信号。此外,我们还标注了每个工具的一些键点,并基于这些键点,我们进一步提出了键点协会损失和键点分布损失,使点到面分支生成更加准确的分割预测。为全面评估这个任务,我们披露了一个新的手术工具数据集,并设置了一个标准的比较基准。我们的全面研究试验证明了PWISeg的超越性。结果表明,通过我们提出的PWISeg,手术工具分割的准确性得到了提高,超越了大多数基于weakly supervised bounding box的实例分割方法。这种改进是在我们所提出的数据集和公共HOSPI-Tools数据集上均可见。

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

  • paper_url: http://arxiv.org/abs/2311.10537
  • repo_url: https://github.com/gersteinlab/medagents
  • paper_authors: Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, Mark Gerstein
  • for: 本研究旨在提高语言模型在医学领域的表现,并且提供一种可行的多学科合作框架,以便语言模型可以更好地理解和应用医学知识。
  • methods: 本研究使用了多学科合作框架,其包括五个关键步骤:寻找培训数据,提出个人分析,汇总分析成报告,进行多轮讨论,并最终做出决策。
  • results: 本研究在九个数据集(MedQA、MedMCQA、PubMedQA以及MMLU中的六个子任务)上得到了出色的结果,证明了我们提出的多学科合作框架可以帮助语言模型更好地理解和应用医学知识,同时还可以扩展其理解能力。
    Abstract Large Language Models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and the reasoning over specialized knowledge. To address these obstinate issues, we propose a novel Multi-disciplinary Collaboration (MC) framework for the medical domain that leverages role-playing LLM-based agents who participate in a collaborative multi-round discussion, thereby enhancing LLM proficiency and reasoning capabilities. This training-free and interpretable framework encompasses five critical steps: gathering domain experts, proposing individual analyses, summarising these analyses into a report, iterating over discussions until a consensus is reached, and ultimately making a decision. Our work particularly focuses on the zero-shot scenario, our results on nine data sets (MedQA, MedMCQA, PubMedQA, and six subtasks from MMLU) establish that our proposed MC framework excels at mining and harnessing the medical expertise in LLMs, as well as extending its reasoning abilities. Based on these outcomes, we further conduct a human evaluation to pinpoint and categorize common errors within our method, as well as ablation studies aimed at understanding the impact of various factors on overall performance. Our code can be found at \url{https://github.com/gersteinlab/MedAgents}.
    摘要 大型语言模型(LLM),尽管在不同领域中做出了惊人的进步,在医疗领域还是遇到了很多障碍。这个领域面临着域名特定的术语和专业知识的推理等独特问题。为了解决这些难题,我们提出了一种新的多学科合作(MC)框架,该框架通过多个角色扮演LLM基于的代理人参与协同多轮讨论,从而提高LLM的技能和推理能力。这个无需训练和可解释的框架包括五个关键步骤:召集域专家、提出个人分析、汇总分析成报告、讨论迭代 until 达成一致,并最终做出决策。我们的工作特别关注于零容量情况下的情况,我们的结果表明,我们提出的MC框架在激活和利用医疗领域LLM的专业知识,以及扩展其推理能力方面表现出色。基于这些结果,我们进一步进行了人类评估,以找到和分类我们方法中的常见错误,以及进行了缺失因素的研究,以了解对总性表现的影响。我们的代码可以在 \url{https://github.com/gersteinlab/MedAgents} 找到。

Performance Trade-offs of Watermarking Large Language Models

  • paper_url: http://arxiv.org/abs/2311.09816
  • repo_url: None
  • paper_authors: Anirudh Ajith, Sameer Singh, Danish Pruthi
  • for: 本研究旨在评估水印模型在多种任务上的性能,以了解水印的影响和用户应该考虑的贸易OFF。
  • methods: 本研究使用了水印策略,在生成文本中嵌入一个信号,以便区分人工生成的文本和模型生成的文本。
  • results: 研究发现,在大多数情况下,水印对任务的性能没有显著影响。但是,长形生成任务(如概要和翻译)的性能下降了15-20%。这些结果指出了水印使用的贸易OFF,并提出了未来研究的可能性。
    Abstract Amidst growing concerns of large language models (LLMs) being misused for generating misinformation or completing homework assignments, watermarking has emerged as an effective solution for distinguishing human-written and LLM-generated text. A prominent watermarking strategy is to embed a signal into generated text by upsampling a (pseudorandomly-chosen) subset of tokens at every generation step. Although this signal is imperceptible to a human reader, it is detectable through statistical testing. However, implanting such signals alters the model's output distribution and can have unintended effects when watermarked LLMs are used for downstream applications. In this work, we evaluate the performance of watermarked LLMs on a diverse suite of tasks, including text classification, textual entailment, reasoning, question answering, translation, summarization, and language modeling. We find that watermarking has negligible impact on the performance of tasks posed as k-class classification problems in the average case. However, the accuracy can plummet to that of a random classifier for some scenarios (that occur with non-negligible probability). Tasks that are cast as multiple-choice questions and short-form generation are surprisingly unaffected by watermarking. For long-form generation tasks, including summarization and translation, we see a drop of 15-20% in the performance due to watermarking. Our findings highlight the trade-offs that users should be cognizant of when using watermarked models, and point to cases where future research could improve existing trade-offs.
    摘要 在大型语言模型(LLM)被违用于生成谣言或完成作业任务时,水印技术已成为一种有效的解决方案,以区分人类写作和LLM生成的文本。一种常见的水印策略是在生成文本时附加一个信号,通过随机选择一部分token进行upsampling。尽管这个信号对人类读者无法察觉,但可以通过统计测试探测。然而,植入这个信号会改变模型的输出分布,可能会导致下游应用中的不良影响。在这个工作中,我们评估水印LLM在多种任务上的表现,包括文本分类、文本推理、问答、翻译、概要和语言模型。我们发现,对于大多数情况,水印对任务的性能没有显著影响。然而,在某些特殊情况下(占总体的非致命概率),水印可能会导致性能降低到随机分类器水平。在长形生成任务中,如概要和翻译,我们发现水印导致性能下降约15-20%。我们的发现指出了使用水印模型时的交易offs,并提出了未来研究可以改善现有交易offs的可能性。

Towards Formal Fault Injection for Safety Assessment of Automated Systems

  • paper_url: http://arxiv.org/abs/2311.09810
  • repo_url: None
  • paper_authors: Ashfaq Farooqui, Behrooz Sangchoolie
  • for: 这篇论文是为了探讨自动化系统中的安全性、安全性和其他可靠性特性的问题,以便在日常生活中广泛应用自动化系统。
  • methods: 这篇论文使用了正式方法,这些方法可以数学方式推导系统的行为,从而确保系统的可靠性。然而,这些方法通常只适用于系统抽象模型,可能不能完全反映实际系统。
  • results: 这篇论文提出了正式缺陷插入,一种将正式方法和缺陷插入相结合的技术,以提高自动化系统的可靠性。文章还讨论了这些技术在开发生命周期中的潜在优势,并提出了未来研究的可能性,以解决当前的挑战。
    Abstract Reasoning about safety, security, and other dependability attributes of autonomous systems is a challenge that needs to be addressed before the adoption of such systems in day-to-day life. Formal methods is a class of methods that mathematically reason about a system's behavior. Thus, a correctness proof is sufficient to conclude the system's dependability. However, these methods are usually applied to abstract models of the system, which might not fully represent the actual system. Fault injection, on the other hand, is a testing method to evaluate the dependability of systems. However, the amount of testing required to evaluate the system is rather large and often a problem. This vision paper introduces formal fault injection, a fusion of these two techniques throughout the development lifecycle to enhance the dependability of autonomous systems. We advocate for a more cohesive approach by identifying five areas of mutual support between formal methods and fault injection. By forging stronger ties between the two fields, we pave the way for developing safe and dependable autonomous systems. This paper delves into the integration's potential and outlines future research avenues, addressing open challenges along the way.
    摘要 考虑 autonomous systems 的安全性、安全性和其他可靠性特性的推理是在普及这些系统之前需要解决的挑战。Formal methods 是一类方法,通过数学方式推理系统的行为,因此,一个正确性证明即可确保系统的可靠性。然而,这些方法通常应用于系统抽象模型,可能不完全反映实际系统。错误插入测试是一种评估系统可靠性的方法,但测试量很大,经常成为问题。这篇视野论文介绍了形式错误插入,它将这两种技术在开发生命周期中融合,以提高自动化系统的可靠性。我们认为这两个领域之间存在五个互助领域,通过加强这两个领域之间的关系,我们开拓了开发安全可靠的自动化系统的可能性。这篇论文探讨了融合的潜在可能性和未来研究方向,并解决了一些开放的挑战。

Comparing Differentiable Logics for Learning Systems: A Research Preview

  • paper_url: http://arxiv.org/abs/2311.09809
  • repo_url: https://github.com/tflinkow/dl-comparison
  • paper_authors: Thomas Flinkow, Barak A. Pearlmutter, Rosemary Monahan
  • for: 这篇论文旨在研究如何使机器学习(ML)系统满足正确性和安全性要求,并考虑了自动化系统的自我更新和适应能力。
  • methods: 论文使用了 differentiable logics 方法,其中 Background Knowledge 编码为逻辑约束,导引学习过程。
  • results: 实验结果与文献报道的结果相符,但是使用 differentiable logics 引入了一个新的 гиперпарамет,即 tuning 难度和影响力。
    Abstract Extensive research on formal verification of machine learning (ML) systems indicates that learning from data alone often fails to capture underlying background knowledge. A variety of verifiers have been developed to ensure that a machine-learnt model satisfies correctness and safety properties, however, these verifiers typically assume a trained network with fixed weights. ML-enabled autonomous systems are required to not only detect incorrect predictions, but should also possess the ability to self-correct, continuously improving and adapting. A promising approach for creating ML models that inherently satisfy constraints is to encode background knowledge as logical constraints that guide the learning process via so-called differentiable logics. In this research preview, we compare and evaluate various logics from the literature in weakly-supervised contexts, presenting our findings and highlighting open problems for future work. Our experimental results are broadly consistent with results reported previously in literature; however, learning with differentiable logics introduces a new hyperparameter that is difficult to tune and has significant influence on the effectiveness of the logics.
    摘要 根据大量研究,机器学习(ML)系统从数据alone学习时常常无法捕捉下面背景知识。为确保机器学习模型满足正确性和安全性质量,许多验证工具已经开发,但这些验证工具通常假设已经训练过的网络重量是固定的。ML自适应系统需要不仅检测错误预测,还应该具备自我更新和适应能力。一种有前途的方法是通过 différentiable logics将背景知识编码成逻辑约束,以导引学习过程。在这个研究预览中,我们对文献中的不同逻辑进行比较和评价,在弱监督上下文中展示我们的发现和挑出未来工作的开放问题。我们的实验结果与文献中已经报道的结果相符,但是学习与 diffe抽象逻辑引入了一个新的Hyperparameter,它具有很大的影响力和难于调整。

Neuro-Symbolic Integration Brings Causal and Reliable Reasoning Proofs

  • paper_url: http://arxiv.org/abs/2311.09802
  • repo_url: https://github.com/damo-nlp-sg/caring
  • paper_authors: Sen Yang, Xin Li, Leyang Cui, Lidong Bing, Wai Lam
  • for: 这 paper 是为了提高 AI 模型的推理能力和可靠性而写的。
  • methods: 这 paper 使用了一种 combining 方法,即将 neural LLM 和 symbolic solver integrate 在一起,以便进行 deliberative reasoning 和证明。
  • results: 实验表明,使用这种方法可以在 ProofWriter 和 GSM8K 上大幅提高推理 accuracy 和证明相似性。
    Abstract Though prompting LLMs with various reasoning structures produces reasoning proofs along with answers, these proofs are not ensured to be causal and reliable due to the inherent defects of LLMs. Tracking such deficiencies, we present a neuro-symbolic integration method, in which a neural LLM is used to represent the knowledge of the problem while an LLM-free symbolic solver is adopted to do deliberative reasoning using the knowledge. Specifically, our customized meta-interpreters allow the production of reasoning proofs and support flexible search strategies. These reasoning proofs are ensured to be causal and reliable because of the deterministic executing nature of the symbolic solvers. Empirically, on ProofWriter, our method surpasses the CoT baseline by nearly double in accuracy and more than triple in proof similarity. On GSM8K, our method also shows accuracy improvements and nearly doubled proof similarity. Our code is released at https://github.com/DAMO-NLP-SG/CaRing
    摘要 尽管通过不同的逻辑结构让LLMs进行推理生成推理证明和答案,但这些证明并不能保证是 causal 和可靠的,因为 LLMS 本身存在一些缺陷。为了解决这些问题,我们提出了一种神经符号 интеграция方法,在这种方法中,一个神经网络 LLM 用于表示问题的知识,而另一个 LLM-free 符号分析器用于进行思考和推理。具体来说,我们自定义的 meta-interpreters 允许生成推理证明和支持灵活的搜索策略。由于符号分析器的 deterministic 执行特性,所生成的推理证明是 causal 和可靠的。在 ProofWriter 上,我们的方法比 CoT 基线高出 nearly double 的准确率和 более triple 的证明相似度。在 GSM8K 上,我们的方法也显示了准确率上的提高和证明相似度的近 double。我们的代码在 上发布。

Interpreting User Requests in the Context of Natural Language Standing Instructions

  • paper_url: http://arxiv.org/abs/2311.09796
  • repo_url: None
  • paper_authors: Nikita Moghe, Patrick Xia, Jacob Andreas, Jason Eisner, Benjamin Van Durme, Harsh Jhamtani
  • for: 用于提高自然语言界面的用户体验,减少用户重复提供偏好信息的需求。
  • methods: 利用大型语言模型(LLMs)和自然语言描述文本(standing instructions)作为用户偏好和指令的补充信息。
  • results: 在NLSI语料集上进行实验,使用大语言模型和不同的检索方法,最高达44.7%的精确匹配率。
    Abstract Users of natural language interfaces, generally powered by Large Language Models (LLMs),often must repeat their preferences each time they make a similar request. To alleviate this, we propose including some of a user's preferences and instructions in natural language -- collectively termed standing instructions -- as additional context for such interfaces. For example, when a user states I'm hungry, their previously expressed preference for Persian food will be automatically added to the LLM prompt, so as to influence the search for relevant restaurants. We develop NLSI, a language-to-program dataset consisting of over 2.4K dialogues spanning 17 domains, where each dialogue is paired with a user profile (a set of users specific standing instructions) and corresponding structured representations (API calls). A key challenge in NLSI is to identify which subset of the standing instructions is applicable to a given dialogue. NLSI contains diverse phenomena, from simple preferences to interdependent instructions such as triggering a hotel search whenever the user is booking tickets to an event. We conduct experiments on NLSI using prompting with large language models and various retrieval approaches, achieving a maximum of 44.7% exact match on API prediction. Our results demonstrate the challenges in identifying the relevant standing instructions and their interpretation into API calls.
    摘要 用户们使用自然语言界面,通常需要每次发出相似的请求都重复他们的首选项。为了解决这个问题,我们建议将用户的首选项和指令(总称为“站坐指令”)作为自然语言 interfaces 的附加上下文。例如,当用户说 “我饿” 时,他们之前表达的波斯料食物的首选项将自动添加到 LLM 提示中,以影响搜索相关餐厅。我们开发了 NLSI,一个语言到程序数据集,包含了超过 2.4K 对话,涵盖 17 个领域,每个对话都与用户的 Profile(用户特定的站坐指令)和相应的结构化表示(API 调用)一起出现。在 NLSI 中,一个主要挑战是确定哪些站坐指令适用于给定的对话。NLSI 包含了多种现象,从简单的首选项到互相关联的指令,例如在购买票务时自动触发酒店搜索。我们使用 LLM 和不同的检索方法进行实验,最高达 44.7% 精确匹配 API 预测。我们的结果表明了站坐指令的适用和其 интерпретация成 API 调用的挑战。

Breaking Boundaries: Balancing Performance and Robustness in Deep Wireless Traffic Forecasting

  • paper_url: http://arxiv.org/abs/2311.09790
  • repo_url: None
  • paper_authors: Romain Ilbert, Thai V. Hoang, Zonghua Zhang, Themis Palpanas
  • for: 本研究旨在寻找一种能够平衡精度和Robustness的时序预测方法,以便在真实世界中的电信数据上进行预测。
  • methods: 我们使用了一种混合策略,包括一个分类器来检测外延攻击,一个去噪器来消除外延数据样本中的噪声,以及一个标准预测模型。我们对这些策略进行了比较,并与两种现有的对抗训练算法进行了比较。
  • results: 我们的hybrid策略在 both clean和外延数据上表现出色,其MSE在clean数据上保持了92.02%的原始预测模型性能,而在外延数据上则更加Robust,其MSE比较方法的MSE低出2.71倍和2.51倍。此外,我们的模型的组件可以并行训练,从而提高计算效率。
    Abstract Balancing the trade-off between accuracy and robustness is a long-standing challenge in time series forecasting. While most of existing robust algorithms have achieved certain suboptimal performance on clean data, sustaining the same performance level in the presence of data perturbations remains extremely hard. In this paper, we study a wide array of perturbation scenarios and propose novel defense mechanisms against adversarial attacks using real-world telecom data. We compare our strategy against two existing adversarial training algorithms under a range of maximal allowed perturbations, defined using $\ell_{\infty}$-norm, $\in [0.1,0.4]$. Our findings reveal that our hybrid strategy, which is composed of a classifier to detect adversarial examples, a denoiser to eliminate noise from the perturbed data samples, and a standard forecaster, achieves the best performance on both clean and perturbed data. Our optimal model can retain up to $92.02\%$ the performance of the original forecasting model in terms of Mean Squared Error (MSE) on clean data, while being more robust than the standard adversarially trained models on perturbed data. Its MSE is 2.71$\times$ and 2.51$\times$ lower than those of comparing methods on normal and perturbed data, respectively. In addition, the components of our models can be trained in parallel, resulting in better computational efficiency. Our results indicate that we can optimally balance the trade-off between the performance and robustness of forecasting models by improving the classifier and denoiser, even in the presence of sophisticated and destructive poisoning attacks.
    摘要 平衡精度和Robustness之间的贸易OFF是时间序列预测领域的长standing挑战。大多数现有的Robust算法在干净数据上达到了一定的下行性,但在数据抖动的情况下维持同样的性能很难。在这篇论文中,我们研究了各种抖动enario并提出了新的防御机制,使用实际的电信数据对抗 adversarial 攻击。我们与两种现有的 adversarial 训练算法进行比较,使用 $[0.1,0.4]$ 的 $\ell_{\infty}$ 范围内的最大允许抖动。我们的发现显示,我们的混合策略,包括一个分类器来检测 adversarial 示例,一个去噪器来除掉抖动数据示例中的噪声,以及一个标准预测器,在干净数据和抖动数据上都能够达到最好的性能。我们的优化模型可以保持原始预测模型的 $92.02\%$ 的性能(按照 Mean Squared Error 的评价),而且在抖动数据上比标准 adversarial 训练模型更加Robust。它的 MSE 值分别为 $2.71\times$ 和 $2.51\times$ 比对应模型的 MSE 值更低。此外,我们的模型组件可以并行训练,从而提高计算效率。我们的结果表明,我们可以通过改进分类器和去噪器来优化 forecasting 模型,甚至在抖动数据上进行高级和破坏性攻击。

3vLTL: A Tool to Generate Automata for Three-valued LTL

  • paper_url: http://arxiv.org/abs/2311.09787
  • repo_url: None
  • paper_authors: Francesco Belardinelli, Angelo Ferrando, Vadim Malvone
  • for: 这篇论文主要是为了提供一个工具来生成 Buchi 自动机,用于验证 Linear-time Temporal Logic(LTL)式语言中的多值规定。
  • methods: 该工具使用Linear-time Temporal Logic(LTL)式语言中的三值 semantics来解释 formulas,并生成一个 Buchi 自动机,用于验证 LTL 式是否真、假或未定义于一个模型中。
  • results: 该工具可以生成一个 Buchi 自动机,用于验证 LTL 式的真假性,并且可以让这个自动机被第三方库处理,以便进一步进行验证。
    Abstract Multi-valued logics have a long tradition in the literature on system verification, including run-time verification. However, comparatively fewer model-checking tools have been developed for multi-valued specification languages. We present 3vLTL, a tool to generate Buchi automata from formulas in Linear-time Temporal Logic (LTL) interpreted on a three-valued semantics. Given an LTL formula, a set of atomic propositions as the alphabet for the automaton, and a truth value, our procedure generates a Buchi automaton that accepts all the words that assign the chosen truth value to the LTL formula. Given the particular type of the output of the tool, it can also be seamlessly processed by third-party libraries in a natural way. That is, the Buchi automaton can then be used in the context of formal verification to check whether an LTL formula is true, false, or undefined on a given model.
    摘要 多值逻辑在系统验证文献中有很长的传统,包括运行时验证。然而,相比之下, fewer model-checking工具被开发用于多值规定语言。我们介绍了3vLTL,一种生成 Buchi 自动机从Linear-time Temporal Logic(LTL)在三值 semantics中解释的方程的工具。给定一个 LTL 方程,一个字母集,一个真假值,我们的过程生成一个 Buchi 自动机,接受将选择的真假值分配给 LTL 方程的所有词。给出特定输出的类型,这个 Buchi 自动机可以轻松地处理第三方库中的自然方式。也就是说,Buch i自动机可以在正式验证中使用,以验证一个 LTL 方程是否真、假或未定义于一个模型。

Correct-by-Construction Control for Stochastic and Uncertain Dynamical Models via Formal Abstractions

  • paper_url: http://arxiv.org/abs/2311.09786
  • repo_url: None
  • paper_authors: Thom Badings, Nils Jansen, Licio Romao, Alessandro Abate
  • for: 这篇论文的目的是为Autonomous Systems的自动化控制器的批量生成提供了一种可靠的方法。
  • methods: 这篇论文使用了一种基于Markov决策过程的间接抽象方法,以及现有的验证技术来计算一个满足给定规范的优化策略。
  • results: 这篇论文的结果表明,通过使用这种方法,可以生成一个可靠地满足规范的控制器,并且可以证明这个控制器满足规范的 garanties。
    Abstract Automated synthesis of correct-by-construction controllers for autonomous systems is crucial for their deployment in safety-critical scenarios. Such autonomous systems are naturally modeled as stochastic dynamical models. The general problem is to compute a controller that provably satisfies a given task, represented as a probabilistic temporal logic specification. However, factors such as stochastic uncertainty, imprecisely known parameters, and hybrid features make this problem challenging. We have developed an abstraction framework that can be used to solve this problem under various modeling assumptions. Our approach is based on a robust finite-state abstraction of the stochastic dynamical model in the form of a Markov decision process with intervals of probabilities (iMDP). We use state-of-the-art verification techniques to compute an optimal policy on the iMDP with guarantees for satisfying the given specification. We then show that, by construction, we can refine this policy into a feedback controller for which these guarantees carry over to the dynamical model. In this short paper, we survey our recent research in this area and highlight two challenges (related to scalability and dealing with nonlinear dynamics) that we aim to address with our ongoing research.
    摘要 自动生成正确性承诺控制器是自主系统的部署中关键的一步。这些自主系统通常是随机动力学模型的。通用问题是计算一个可以准确满足给定任务的控制器,该任务是 probabilistic temporal logic 规范。然而,因为随机不确定性、不精确知道参数以及混合特征,这个问题具有挑战性。我们已经开发了一个抽象框架,可以在不同的模型假设下解决这个问题。我们的方法基于一种可靠的finite-state抽象方法,即Markov decision process with intervals of probabilities(iMDP)。我们使用当前的验证技术来计算iMDP上的优质策略,并 garantía para satisfacer给定规范。然后,我们表明,通过构建,我们可以从这种策略中提取一个反馈控制器,这些 garantías会传递到动力学模型中。在这篇短文中,我们回顾了我们近期在这个领域的研究,并提出了两个挑战(相关于扩展性和处理非线性动力学),我们计划通过我们的进行研究来解决这些挑战。

Automatic Generation of Scenarios for System-level Simulation-based Verification of Autonomous Driving Systems

  • paper_url: http://arxiv.org/abs/2311.09784
  • repo_url: None
  • paper_authors: Srajan Goyal, Alberto Griggio, Jacob Kimblad, Stefano Tonetta
    for:* The paper is written for the purpose of presenting a generic framework for system-level simulation-based verification and validation (V&V) of autonomous driving systems (ADS) that employ AI components.methods:* The framework uses a simulation model of the system, an abstract model that describes symbolically the system behavior, and formal methods to generate scenarios and verify the simulation executions.* The approach leverages the CARLA driving simulator and its ScenarioRunner tool to create diverse and complex driving scenarios.results:* The paper describes the instantiation of the VIVAS framework for an ADS case study, and demonstrates the effectiveness of the approach in automatically generating scenarios for system-level simulation-based V&V of an automated driving system using CARLA and ScenarioRunner.* The results show the potential of the approach as a powerful tool in the future of ADS V&V methodologies.
    Abstract With increasing complexity of Automated Driving Systems (ADS), ensuring their safety and reliability has become a critical challenge. The Verification and Validation (V&V) of these systems are particularly demanding when AI components are employed to implement perception and/or control functions. In ESA-funded project VIVAS, we developed a generic framework for system-level simulation-based V&V of autonomous systems. The approach is based on a simulation model of the system, an abstract model that describes symbolically the system behavior, and formal methods to generate scenarios and verify the simulation executions. Various coverage criteria can be defined to guide the automated generation of the scenarios. In this paper, we describe the instantiation of the VIVAS framework for an ADS case study. This is based on the integration of CARLA, a widely-used driving simulator, and its ScenarioRunner tool, which enables the creation of diverse and complex driving scenarios. This is also used in the CARLA Autonomous Driving Challenge to validate different ADS agents for perception and control based on AI, shared by the CARLA community. We describe the development of an abstract ADS model and the formulation of a coverage criterion that focuses on the behaviors of vehicles relative to the vehicle with ADS under verification. Leveraging the VIVAS framework, we generate and execute various driving scenarios, thus testing the capabilities of the AI components. The results show the effectiveness of VIVAS in automatically generating scenarios for system-level simulation-based V&V of an automated driving system using CARLA and ScenarioRunner. Therefore, they highlight the potential of the approach as a powerful tool in the future of ADS V&V methodologies.
    摘要 随着自动驾驶系统(ADS)的复杂度增加,确保其安全性和可靠性已成为一项杰匡的挑战。验证和验议(V&V)这些系统特别是当AI组件用于感知和/或控制功能时,变得非常具有挑战性。在欧洲空间局(ESA)资助的项目VIVAS中,我们开发了一种通用框架 для系统级别的模拟基于验证和验议自动驾驶系统。该方法基于系统模型、一个抽象模型,用于 символи地描述系统行为,以及正式方法来生成场景和验证模拟执行。可以定义多种覆盖度标准来引导自动生成场景。 在这篇论文中,我们描述了VIvas框架在自动驾驶系统case study中的实现。这基于卡拉拉(CARLA)广泛使用的驾驶模拟器和其ScenarioRunner工具,可以创造多样化和复杂的驾驶场景。这也是在CARLA自动驾驶挑战中 validate不同的ADS代理人以及感知和控制基于AI的不同ADS代理人。我们开发了抽象的ADS模型,并将其与驾驶车辆相对于具有ADS的车辆的行为相关的覆盖度标准定义。通过VIvas框架,我们生成并执行多种驾驶场景,因此测试了AI组件的能力。结果显示VIvas在使用CARLA和ScenarioRunner进行系统级别的模拟基于验证和验议中自动生成场景的能力是非常有力的。因此,它高亮了未来ADS验证和验议方法的潜在力量。

Investigating Data Contamination in Modern Benchmarks for Large Language Models

  • paper_url: http://arxiv.org/abs/2311.09783
  • repo_url: None
  • paper_authors: Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan
  • for: 提高LLM的评估标准和评估方法的可靠性
  • methods: 提议两种适用于开源和专有模型的数据污染检测方法,包括检索系统和TS推测法
  • results: 研究发现一些商业LLM可以很准确地猜测测试集中的缺失选项,并且在一些标准套件中发现了模型的表现改善。
    Abstract Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52\% and 57\%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.
    摘要 近期观察发现了 LLM 的评估标准和实际性能之间的差距,引发了评估标准污染的 Concerns 。这个问题尤其对于关闭源模型和某些开源模型来说是 kritisch 。在这篇论文中,我们研究了数据污染问题,并提出了两种适用于开源和专有 LLM 的方法。我们首先介绍了一个检索基于系统,用于探索评估标准和预训练 corpora 之间的可能的重叠。然后,我们介绍了一种新的调查协议,名为 \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing)}, 适用于开源和专有模型。这种方法包括在多选问题中隐藏错误答案,并让模型填充差距。此外,还包括隐藏评估示例中不可能的单词,并让模型生成它。我们发现了一些商业 LLM 可以意外地猜测测试集中的缺失选项。例如,在 TruthfulQA benchmark 中,我们发现了 LLM 在提供额外metadata 时表现出 Notable 的性能提升。此外,在 MMLU benchmark 中, ChatGPT 和 GPT-4 在测试集中猜测缺失选项的精准率分别为 52% 和 57%。我们希望这些结果可以强调评估方法和标准的需要更加Robust 。

Model Checking for Closed-Loop Robot Reactive Planning

  • paper_url: http://arxiv.org/abs/2311.09780
  • repo_url: None
  • paper_authors: Christopher Chandler, Bernd Porr, Alice Miller, Giulia Lafratta
  • for: 本研究用模型检查来创建多步计划,使Diffuse Drive驱动轮胎自动车可以避免危险。
  • methods: 我们使用一种小型、专门设计的模型检查算法,在实时环境中生成计划,并且这种方法具有 Egocentric 反应性的简单生物体特征。我们的方法基于链接临时控制系统,通过消除环境中的干扰来让自动车保持其首选行为或休眠状态。我们使用2D LiDAR数据的精细化方法,敏感于环境中的 bounded 随机变化。我们使用深度优先搜索来实现多步规划,并通过cul-de-sac 场景作为第一个测试用例。
  • results: 我们的结果表明,模型检查可以用来规划效率的轨迹,超越单步规划的表现。我们在实时使用无预计算数据来实现这一点。虽然我们的方法有一些局限性,但我们认为我们的方法具有开发安全、可靠和透明的轨迹规划方法的潜力。
    Abstract In this paper, we show how model checking can be used to create multi-step plans for a differential drive wheeled robot so that it can avoid immediate danger. Using a small, purpose built model checking algorithm in situ we generate plans in real-time in a way that reflects the egocentric reactive response of simple biological agents. Our approach is based on chaining temporary control systems which are spawned to eliminate disturbances in the local environment that disrupt an autonomous agent from its preferred action (or resting state). The method involves a novel discretization of 2D LiDAR data which is sensitive to bounded stochastic variations in the immediate environment. We operationalise multi-step planning using invariant checking by forward depth-first search, using a cul-de-sac scenario as a first test case. Our results demonstrate that model checking can be used to plan efficient trajectories for local obstacle avoidance, improving on the performance of a reactive agent which can only plan one step. We achieve this in near real-time using no pre-computed data. While our method has limitations, we believe our approach shows promise as an avenue for the development of safe, reliable and transparent trajectory planning in the context of autonomous vehicles.
    摘要 在这篇论文中,我们展示了如何使用模型检查来创建多步计划,以使Diffusion Drive轮胎自动车避免 immediate danger。我们使用一种小型、特有的模型检查算法在实时中生成计划,以模仿简单生物体的 Egocentric 反应。我们的方法基于临时控制系统的链接,以消除环境中的干扰,使自动 Agent 能够继续进行其 preferred action(或休眠状态)。我们的方法包括一种新的2D LiDAR数据的精度化,敏感于环境中的 bounded 随机变化。我们通过在深度优先搜索中进行 invariants 检查来实现多步规划,并使用 cul-de-sac 场景作为第一个测试 caso。我们的结果表明,模型检查可以用于计划高效的轨迹,超越了只能计划一步的感知Agent。我们在实时中使用不需要预计算数据来实现这一点。虽然我们的方法有限制,但我们认为我们的方法显示出了在自动汽车中安全、可靠和透明的轨迹规划的可能性。

HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs

  • paper_url: http://arxiv.org/abs/2311.09774
  • repo_url: None
  • paper_authors: Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, Benyou Wang
  • for: 这个论文的目的是适应具有特殊知识的领域,如医学,以便将特殊知识 integrate into一个通用语言模型中,如Llama2。
  • methods: 该论文提出了一种将多种数据集,包括预训练和指导训练,转化为一个统一的输入输出对形式,以简化学习协议。
  • results: 该论文通过对多个医学领域的测试,证明了其在中医领域的状态对级表现,并在一些方面超过了现有的专有模型,如ChatGPT和GPT-4。专家手动评估也证明了该模型的优势。
    Abstract Adapting a language model into a specific domain, a.k.a `domain adaption', is a common practice when specialized knowledge, e.g. medicine, is not encapsulated in a general language model like Llama2. The challenge lies in the heterogeneity of data across the two training stages, as it varies in languages, genres, or formats. To tackle this and simplify the learning protocol, we propose to transform heterogeneous data, from the both pre-training and supervised stages, into a unified, simple input-output pair format. We validate the new protocol in the domains where proprietary LLMs like ChatGPT perform relatively poorly, such as Traditional Chinese Medicine. The developed model, HuatuoGPT-II, has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks, e.g. medical licensing exams. It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine. Expert manual evaluations further validate HuatuoGPT-II's advantages over existing LLMs. Notably, HuatuoGPT-II was benchmarked in a fresh Chinese National Medical Licensing Examination where it achieved the best performance, showcasing not only its effectiveness but also its generalization capabilities.
    摘要 适应特定领域的语言模型化,即域 adaptation,是一种常见的做法,当特殊知识,如医学,不包含在通用语言模型如LLAMA2中。挑战在两个训练阶段数据的异ogeneity上,因为数据的语言、种类、格式都不同。为了解决这个问题并简化学习协议,我们提议将各种不同数据,从预训练和监督两个阶段,转换成一个统一、简单的输入输出对 format。我们验证了新协议的效果在各种领域,如中医,并在一些标准测试任务上达到了状态 искусственный智能表现。特别是在传统中医领域,我们的模型 HuatuoGPT-II 表现出了优秀的result,并在一些方面超越了商业化模型,如ChatGPT和GPT-4。专业人员手动评估也证明了 HuatuoGPT-II 的优势。值得一提的是,HuatuoGPT-II 在新的中医国家医籍考试中达到了最佳表现,证明了它不仅有效,还有普适化能力。

Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders

  • paper_url: http://arxiv.org/abs/2311.09765
  • repo_url: https://github.com/amy-hyunji/lora-for-retrieval
  • paper_authors: Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, Kyle Lo
  • for: 这个论文主要是为了提高 dense retriever 模型在未经见过的领域中的泛化能力。
  • methods: 该论文提出了一种简单的训练方法,包括使用 LoRA 等参数效率的方法,并选择使用 batch 内的负样本,除非给出了准确制定的困难负样本。
  • results: 该论文使用 BEIR benchmark 进行验证,并发现这些建议可以在不同的 dense encoder 和基础模型大小上 persist,并且与其他资源密集的策略(如建筑修改或多个预训练)相结合,可以提高 dense retriever 模型的泛化能力。
    Abstract Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base model pretraining, comparatively little investigation has examined whether the training procedures themselves can be improved to yield better generalization capabilities in the resulting models. In this work, we recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives. We validate these recommendations using the BEIR benchmark and find results are persistent across choice of dense encoder and base model size and are complementary to other resource-intensive strategies for out-of-domain generalization such as architectural modifications or additional pretraining. We hope that this thorough and impartial study around various training techniques, which augments other resource-intensive methods, offers practical insights for developing a dense retrieval model that effectively generalizes, even when trained on a single dataset.
    摘要 现有研究往往采用 dense retriever 训练大量数据集 MSMARCO,然后尝试改进零shot泛化能力。而在优化这种泛化能力方面,已有许多研究,包括数据增强、结构修改、模型大小增加和额外预训练。然而,对于训练过程本身是否可以进行优化以实现更好的泛化能力,尚未得到了充分的探讨。在这项工作中,我们建议一种简单的 dense encoder 训练热键:通过 LoRA 等参数效率的方法在 MSMARCO 上训练,并在批处中使用卷积批处。我们使用 BEIR 测试准则 validate 这些建议,并发现结果是不依赖于 dense encoder 和基础模型大小,并且与其他资源占用量大的方法相结合,可以实现更好的泛化能力。我们希望这种具有充分的探讨和不偏袋的研究,能够为开发高效泛化 dense retrieval 模型提供实际的指导。

Graph-Guided Reasoning for Multi-Hop Question Answering in Large Language Models

  • paper_url: http://arxiv.org/abs/2311.09762
  • repo_url: None
  • paper_authors: Jinyoung Park, Ameen Patel, Omar Zia Khan, Hyunwoo J. Kim, Joo-Kyung Kim
  • For: The paper aims to improve the multi-step reasoning capabilities of large language models (LLMs) by addressing two issues in previous CoT prompting methods: generating irrelevant rationales and failing to compose subquestions or queries for obtaining all relevant information.* Methods: The proposed graph-guided CoT prompting method uses a “question/rationale graph” constructed by LLMs to guide the reasoning process. The method includes graph representation and verification steps to filter out irrelevant rationales and generate follow-up questions to obtain relevant information.* Results: The proposed method shows superior performance compared to previous CoT prompting methods and their variants on multi-hop question answering benchmark datasets.
    Abstract Chain-of-Thought (CoT) prompting has boosted the multi-step reasoning capabilities of Large Language Models (LLMs) by generating a series of rationales before the final answer. We analyze the reasoning paths generated by CoT and find two issues in multi-step reasoning: (i) Generating rationales irrelevant to the question, (ii) Unable to compose subquestions or queries for generating/retrieving all the relevant information. To address them, we propose a graph-guided CoT prompting method, which guides the LLMs to reach the correct answer with graph representation/verification steps. Specifically, we first leverage LLMs to construct a "question/rationale graph" by using knowledge extraction prompting given the initial question and the rationales generated in the previous steps. Then, the graph verification step diagnoses the current rationale triplet by comparing it with the existing question/rationale graph to filter out irrelevant rationales and generate follow-up questions to obtain relevant information. Additionally, we generate CoT paths that exclude the extracted graph information to represent the context information missed from the graph extraction. Our graph-guided reasoning method shows superior performance compared to previous CoT prompting and the variants on multi-hop question answering benchmark datasets.
    摘要 chain-of-thought (CoT) 提示法已经提高了大语言模型 (LLM) 的多步逻辑能力,通过生成一系列的理由来提高答案。我们分析了 CoT 生成的逻辑路径,并发现了两个多步逻辑问题:(i)生成与问题无关的理由,(ii)无法组合子问题或查询来获取所有相关信息。为解决这些问题,我们提议一种图表引导 CoT 提示法,使 LLM 能够通过图表表示/验证步骤达到正确答案。 Specifically,我们首先利用 LLM 使用知识EXTRACTION 提示生成一个 "问题/理由图",并then 使用图表验证步骤来诊断当前的理由 triplet,并将无关的理由过滤掉,生成跟进 вопро题以获取相关信息。此外,我们还生成 CoT 路径,排除扩展的图信息以表示Context missed 信息。我们的图表引导逻辑方法在多个多步问答 benchmark 数据集上表现出色。

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

  • paper_url: http://arxiv.org/abs/2311.09761
  • repo_url: None
  • paper_authors: Chadi Helwe, Tom Calamai, Pierre-Henri Paris, Chloé Clavel, Fabian Suchanek
  • For: The paper aims to address the challenges of automated fallacy detection and classification, particularly the subjectivity of the task and the need for a comprehensive and unified approach in existing research.* Methods: The paper introduces a novel taxonomy of fallacies that refines and aligns previous classifications, a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity.* Results: The paper introduces MAFALDA (Multi-level Annotated FALlacy DAtaset), a gold standard dataset based on examples from various previously existing fallacy datasets under the unified taxonomy, and evaluates several language models under a zero-shot learning setting using MAFALDA to assess their fallacy detection and classification capability. The evaluation provides valuable insights into the strengths and limitations of these models in addressing fallacious reasoning.Here is the same information in Simplified Chinese text:* For: 这篇论文目标是解决自动推理错误检测和分类的挑战,特别是任务的主观性和现有研究中存在的缺乏一致性。* Methods: 论文引入了一种新的论点分类方法,该方法可以协调和融合先前的分类方法,同时还提供了一种适用于主观NLPTask的新注释方案。* Results: 论文引入了一个金标准数据集MAFALDA(多级注释错误数据集),该数据集基于先前的许多错误数据集的一致性,并对多种语言模型进行零基础学习 Setting中的评价。
    Abstract Fallacies can be used to spread disinformation, fake news, and propaganda, underlining the importance of their detection. Automated detection and classification of fallacies, however, remain challenging, mainly because of the innate subjectivity of the task and the need for a comprehensive, unified approach in existing research. Addressing these limitations, our study introduces a novel taxonomy of fallacies that aligns and refines previous classifications, a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity, adapted to precision, recall, and F1-Score metrics. Using our annotation scheme, the paper introduces MAFALDA (Multi-level Annotated FALlacy DAtaset), a gold standard dataset. MAFALDA is based on examples from various previously existing fallacy datasets under our unified taxonomy across three levels of granularity. We then evaluate several language models under a zero-shot learning setting using MAFALDA to assess their fallacy detection and classification capability. Our comprehensive evaluation not only benchmarks the performance of these models but also provides valuable insights into their strengths and limitations in addressing fallacious reasoning.
    摘要 False information, fake news, and propaganda can be spread through fallacies, highlighting the importance of detecting them. However, automated detection and classification of fallacies are challenging due to the subjective nature of the task and the lack of a comprehensive, unified approach in existing research. To address these limitations, our study proposes a new taxonomy of fallacies that aligns and refines previous classifications, a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity, adapted to precision, recall, and F1-Score metrics. Using our annotation scheme, we introduce MAFALDA (Multi-level Annotated FALlacy DAtaset), a gold standard dataset based on examples from various previously existing fallacy datasets under our unified taxonomy across three levels of granularity. We then evaluate several language models under a zero-shot learning setting using MAFALDA to assess their fallacy detection and classification capability. Our comprehensive evaluation not only benchmarks the performance of these models but also provides valuable insights into their strengths and limitations in addressing fallacious reasoning.

UFPS: A unified framework for partially-annotated federated segmentation in heterogeneous data distribution

  • paper_url: http://arxiv.org/abs/2311.09757
  • repo_url: https://github.com/tekap404/unified_federated_partially-labeled_segmentation
  • paper_authors: Le Jiang, Li Yan Ma, Tie Yong Zeng, Shi Hui Ying
  • for: 这个研究是为了解决基于半自动化标签的医疗影像分类问题,并且不会遗漏数据的隐私问题。
  • methods: 这个研究使用了联邦式半自动化分类(FPSS),并且提出了一个统一的全球模型训练方法(UFPS),以解决半自动化分类中的分类异常和客户端漂移问题。
  • results: 这个研究的实验结果显示,UFPS方法能够更好地解决半自动化分类中的分类异常和客户端漂移问题,并且在实际医疗影像数据上显示了更好的适应和普遍性。
    Abstract Partially supervised segmentation is a label-saving method based on datasets with fractional classes labeled and intersectant. However, it is still far from landing on real-world medical applications due to privacy concerns and data heterogeneity. As a remedy without privacy leakage, federated partially supervised segmentation (FPSS) is formulated in this work. The main challenges for FPSS are class heterogeneity and client drift. We propose a Unified Federated Partially-labeled Segmentation (UFPS) framework to segment pixels within all classes for partially-annotated datasets by training a totipotential global model without class collision. Our framework includes Unified Label Learning and sparsed Unified Sharpness Aware Minimization for unification of class and feature space, respectively. We find that vanilla combinations for traditional methods in partially supervised segmentation and federated learning are mainly hampered by class collision through empirical study. Our comprehensive experiments on real medical datasets demonstrate better deconflicting and generalization ability of UFPS compared with modified methods.
    摘要 partially supervised segmentation是一种基于分类数据集的标签保存方法,但它还远离实际医疗应用中的应用,主要原因是隐私问题和数据不一致。为解决这些问题,本文提出了联邦半supervised分割(FPSS)方法。FPSS的主要挑战是分类异ogeneous和客户端漂移。我们提出了一种总统的联邦半supervised分割(UFPS)框架,用于对半标注数据集中的像素进行分割,不会出现分类冲突。我们的框架包括统一标签学习和粗粒化的统一锐度感知优化,用于统一类和特征空间。我们通过实际研究发现,传统的partially supervised segmentation和联邦学习方法的组合在面临分类冲突的情况下效果较差。我们的全面实验表明,UFPS在实际医疗数据集上具有更好的冲突解决和泛化能力,与修改后的方法相比。

Redefining the Laparoscopic Spatial Sense: AI-based Intra- and Postoperative Measurement from Stereoimages

  • paper_url: http://arxiv.org/abs/2311.09744
  • repo_url: https://github.com/leopoldmueller/laparoscopicmeasurement
  • paper_authors: Leopold Müller, Patrick Hemmer, Moritz Queisner, Igor Sauer, Simeon Allmendinger, Johannes Jakubik, Michael Vössing, Niklas Kühl
    for: This paper aims to provide a more accurate and efficient solution for image-guided surgery, specifically for measuring relevant structures such as vessel segments, resection margins, and bowel lengths.methods: The proposed method utilizes stereo vision and state-of-the-art machine learning architectures, including RAFT-Stereo and YOLOv8, to achieve high accuracy in distance measurements with errors below 1 mm.results: The developed method is assessed in various realistic experimental evaluation environments and demonstrates robustness in challenging environments with textureless regions. The results outline the potential of the method for providing more precise, safe, and efficient surgical procedures.
    Abstract A significant challenge in image-guided surgery is the accurate measurement task of relevant structures such as vessel segments, resection margins, or bowel lengths. While this task is an essential component of many surgeries, it involves substantial human effort and is prone to inaccuracies. In this paper, we develop a novel human-AI-based method for laparoscopic measurements utilizing stereo vision that has been guided by practicing surgeons. Based on a holistic qualitative requirements analysis, this work proposes a comprehensive measurement method, which comprises state-of-the-art machine learning architectures, such as RAFT-Stereo and YOLOv8. The developed method is assessed in various realistic experimental evaluation environments. Our results outline the potential of our method achieving high accuracies in distance measurements with errors below 1 mm. Furthermore, on-surface measurements demonstrate robustness when applied in challenging environments with textureless regions. Overall, by addressing the inherent challenges of image-guided surgery, we lay the foundation for a more robust and accurate solution for intra- and postoperative measurements, enabling more precise, safe, and efficient surgical procedures.
    摘要 significante挑战在图像引导手术中是准确测量有关结构,如血管段、切除边缘或肠长度。这项任务是许多手术中不可或缺的一部分,但它具有较大的人工劳动量和不准确率。在这篇论文中,我们开发了一种新的人类-AI基于方法,用于肠 Laparoscopic 测量,利用推导导航的STereo视力。基于全面的quality要求分析,这项工作提出了一种全面的测量方法,包括当前的机器学习架构,如RAFT-Stereo和YOLOv8。我们开发的方法在不同的实际试验环境中进行了评估。我们的结果表明,我们的方法可以实现高精度的距离测量,错误在1毫米以下。此外,在表面上进行的测量也能够在粗糙区域中展示 robustness。总之,我们通过解决图像引导手术中的内在挑战,为更加精确、安全和有效的手术过程奠定了基础。

Redefining Super-Resolution: Fine-mesh PDE predictions without classical simulations

  • paper_url: http://arxiv.org/abs/2311.09740
  • repo_url: None
  • paper_authors: Rajat Kumar Sarkar, Ritam Majumdar, Vishal Jadhav, Sagar Srinivas Sakhinana, Venkataramana Runkana
  • for: 用于提高计算流体力学(CFD)中粗粒度 simulations 的精度,并且可以避免传统的超分辨率方法中的约束。
  • methods: 我们提出了一种新的超分辨率定义,将 coarse-grid 数据作为输入,预测 fine-grid 数据。使用 physics-infused UNet 上升方法,并在多种2D-CFD问题中进行了证明,包括 Burger 方程中的缺陷检测、甲烷燃烧和工业热交换器中的沾吸。
  • results: 我们的方法可以生成精度高的 fine-mesh 解决方案,不需要传统的计算,同时保持了原始真实情况的精度。通过在训练过程中使用多种边界条件,我们还证明了我们的方法的稳定性,这将推动其广泛应用于工程和科学 CFD 解决方案中。
    Abstract In Computational Fluid Dynamics (CFD), coarse mesh simulations offer computational efficiency but often lack precision. Applying conventional super-resolution to these simulations poses a significant challenge due to the fundamental contrast between downsampling high-resolution images and authentically emulating low-resolution physics. The former method conserves more of the underlying physics, surpassing the usual constraints of real-world scenarios. We propose a novel definition of super-resolution tailored for PDE-based problems. Instead of simply downsampling from a high-resolution dataset, we use coarse-grid simulated data as our input and predict fine-grid simulated outcomes. Employing a physics-infused UNet upscaling method, we demonstrate its efficacy across various 2D-CFD problems such as discontinuity detection in Burger's equation, Methane combustion, and fouling in Industrial heat exchangers. Our method enables the generation of fine-mesh solutions bypassing traditional simulation, ensuring considerable computational saving and fidelity to the original ground truth outcomes. Through diverse boundary conditions during training, we further establish the robustness of our method, paving the way for its broad applications in engineering and scientific CFD solvers.
    摘要

Source Prompt: Coordinated Pre-training of Language Models on Diverse Corpora from Multiple Sources

  • paper_url: http://arxiv.org/abs/2311.09732
  • repo_url: None
  • paper_authors: Yipei Xu, Dakuan Lu, Jiaqing Liang, Xintao Wang, Yipeng Geng, Yingsi Xin, Hengkui Wu, Ken Chen, ruiji zhang, Yanghua Xiao
  • for: 这 paper 是研究 Pre-trained language models (PLMs) 的新 paradigm 在 NLP 领域的一部分,以及如何更好地准备这些模型和它们的预训练 corpora。
  • methods: 这 paper 使用了一种常见和成功的方法,即不断扩大模型和预训练 corpora 的大小。这些大 corpora 通常是从多个源 converges 而来的,因此变得越来越多样化。然而,这些巨大的 converged corpora 的侧effect 尚未得到了充分的研究。
  • results: 在这 paper 中, authors 发现了各种 corpora 的多样性会对预训练 PLMs 的性能产生负面影响。为了协调预训练在多种 corpora 上,authors 提出了源提示 (SP),这是一种在预训练和精度调整阶段显式地提示模型数据源的技术。Results 表明,使用 SP 在多种 corpora 上预训练 PLMs 可以获得显著的下游任务提升。
    Abstract Pre-trained language models (PLMs) have established the new paradigm in the field of NLP. For more powerful PLMs, one of the most popular and successful way is to continuously scale up sizes of the models and the pre-training corpora. These large corpora are generally obtained by converging smaller ones from multiple sources, they are thus growing increasingly diverse. However, the side-effects of these colossal converged corpora remain understudied. In this paper, we identify the disadvantage of heterogeneous corpora from multiple sources for pre-training PLMs. Towards coordinated pre-training on diverse corpora, we further propose source prompts (SP), which explicitly prompt the model of the data source at the pre-training and fine-tuning stages. Results of extensive experiments demonstrate that PLMs pre-trained with SP on diverse corpora gain significant improvement in various downstream tasks.
    摘要

Prudent Silence or Foolish Babble? Examining Large Language Models’ Responses to the Unknown

  • paper_url: http://arxiv.org/abs/2311.09731
  • repo_url: None
  • paper_authors: Genglin Liu, Xingyao Wang, Lifan Yuan, Yangyi Chen, Hao Peng
  • for: 本研究旨在系统地研究LLMs在缺乏必要知识的情况下的行为,以及这种行为如何与人类对话规范不符。
  • methods: 本研究使用了一个反 adversarial question-answering benchmark,该 benchmark 包含了 LLMS 缺乏训练数据的情况下的问题。
  • results: 研究发现,通过 instrucion finetuning 和人类反馈学习(RLHF),LLMs 可以更好地表达uncertainty,并且与有效的问题相对应,表现出更高的准确率和自信度。
    Abstract Large Language Models (LLMs) often struggle when faced with situations where they lack the prerequisite knowledge to generate a sensical response. In these cases, models tend to fabricate and hallucinate, rather than appropriately signaling uncertainty as humans would. This behavior misaligns with human conversational norms and presents challenges surrounding responsible and ethical AI development. This work aims to systematically investigate LLMs' behaviors in such situations. We curate an adversarial question-answering benchmark containing unanswerable questions targeting information absent from the LLM's training data. Concretely, these unanswerable questions contain non-existent concepts or false premises. When presented with such unanswerable questions, an LLM should appropriately convey uncertainty, and be able to challenge the premise and refuse to generate a response. While facing answerable valid questions, a model should demonstrate a positive correlation between accuracy and confidence. Using a model-agnostic unified confidence elicitation approach, we observe that LLMs that have gone through instruction finetuning and reinforcement learning from human feedback (RLHF) perform significantly better than their counterparts that do not. Moreover, uncertainty expression 1 through our elicitation method does not always stay consistent with the perceived confidence of the direct response of an LLM. Our findings call for further research into teaching LLMs to proactively and reliably express uncertainty.
    摘要

Aligning with Whom? Large Language Models Have Gender and Racial Biases in Subjective NLP Tasks

  • paper_url: http://arxiv.org/abs/2311.09730
  • repo_url: https://github.com/jiaxin-pei/llm-group-bias
  • paper_authors: Huaman Sun, Jiaxin Pei, Minje Choi, David Jurgens
  • for: 这个研究探讨了大语言模型(LLM)在主观NLP任务上是否具有性别和民族特征的偏见。
  • methods: 研究使用了POPQUORN数据集,对四种流行的LLM进行了一系列实验,检查它们在政eness和不礼貌任务上是否具有偏见。
  • results: 研究发现,对于这两个任务,模型的预测结果更接近白人和女性参与者的标签。进一步的探讨发现,在使用目标民族和性别标签作为提示时,模型的性能会下降。 Code和数据可以在https://github.com/Jiaxin-Pei/LLM-Group-Bias上获取。
    Abstract Human perception of language depends on personal backgrounds like gender and ethnicity. While existing studies have shown that large language models (LLMs) hold values that are closer to certain societal groups, it is unclear whether their prediction behaviors on subjective NLP tasks also exhibit a similar bias. In this study, leveraging the POPQUORN dataset which contains annotations of diverse demographic backgrounds, we conduct a series of experiments on four popular LLMs to investigate their capability to understand group differences and potential biases in their predictions for politeness and offensiveness. We find that for both tasks, model predictions are closer to the labels from White and female participants. We further explore prompting with the target demographic labels and show that including the target demographic in the prompt actually worsens the model's performance. More specifically, when being prompted to respond from the perspective of "Black" and "Asian" individuals, models show lower performance in predicting both overall scores as well as the scores from corresponding groups. Our results suggest that LLMs hold gender and racial biases for subjective NLP tasks and that demographic-infused prompts alone may be insufficient to mitigate such effects. Code and data are available at https://github.com/Jiaxin-Pei/LLM-Group-Bias.
    摘要 人类对语言的理解受个人背景的影响,如性别和民族。 although existing studies have shown that large language models (LLMs) hold values that are closer to certain societal groups, it is unclear whether their prediction behaviors on subjective NLP tasks also exhibit a similar bias. In this study, we leverage the POPQUORN dataset, which contains annotations of diverse demographic backgrounds, to investigate the capability of four popular LLMs to understand group differences and potential biases in their predictions for politeness and offensiveness. We find that for both tasks, model predictions are closer to the labels from White and female participants. We further explore prompting with the target demographic labels and show that including the target demographic in the prompt actually worsens the model's performance. Specifically, when prompted to respond from the perspective of "Black" and "Asian" individuals, models show lower performance in predicting both overall scores and the scores from corresponding groups. Our results suggest that LLMs hold gender and racial biases for subjective NLP tasks, and that demographic-infused prompts alone may be insufficient to mitigate such effects. 可以在 GitHub 上获取代码和数据:https://github.com/Jiaxin-Pei/LLM-Group-Bias。

Outcome-supervised Verifiers for Planning in Mathematical Reasoning

  • paper_url: http://arxiv.org/abs/2311.09724
  • repo_url: None
  • paper_authors: Fei Yu, Anningzhe Gao, Benyou Wang
    for: 这种研究旨在解决大型语言模型(LLM)在数学逻辑推理中维持准确性的问题, LLMs 经常在推理过程中出现错误,导致最终结果不准确。methods: 该研究提出了一种新的评估模型——结果监督价值模型(OVM),通过结果监督来训练,从而提高了模型的准确性。 OVM 不需要劳动密集的步骤级别正确性注释,从而提高了其可扩展性。results: 在两个多步数学逻辑数据集上进行了实验,结果显示 OVM 模型在 GSM8K 数据集上 achievement 状态的最佳结果,而不需要使用 GPT-4 或代码执行。 这些发现提供了一种新的视角,即在训练评估模型时,结果监督可以提高模型的价值估计,并且有理论基础。
    Abstract Large language models (LLMs) often struggle with maintaining accuracy across a sequence of intermediate reasoning steps in mathematical reasoning, leading to error propagation that undermines the final result. The current methodology to mitigate this issue primarily involves using a verifier model to assess the correctness of generated solution candidates, focusing either on the overall reasoning path or on an incomplete reasoning path. By rethinking this approach, we argue that assessing potentials of incomplete reasoning paths could be more advantageous as it guides towards correct final answers, transforming the task into a \textit{planning} problem. Our proposed verifier, the Outcome-supervision Value Model (OVM), employs outcome supervision for training, offering an efficient and intuitive method for \textit{planning} by prioritizing steps that lead to accurate conclusions over mere per-step correctness. Furthermore, the OVM eschews the need for labor-intensive annotations on step-level correctness, enhancing its scalability. Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model. Notably, in GSM8K, our \textbf{OVM-7B model achieves state-of-the-art results among LLMs up to 13B parameters}; especially it does not utilize GPT-4 or code execution. These findings offer a novel perspective on the role of outcome supervision in training verifiers for multi-step reasoning tasks and provide theoretical justification for its advantage in value estimation for planning.
    摘要 大型语言模型(LLM)经常在进行多步骤推理时维持正确性问题,导致错误传递,干扰最终结果。现有的方法来解决这问题主要是使用验证模型来评估生成的解答候选者,专注于全局推理路径或部分推理路径。我们认为评估未完成推理路径的潜力可能更有利,将任务转换为规划问题。我们提出的验证器是结果监督值模型(OVM),通过结果监督进行训练,提供一种高效和直观的规划方法,优先级是导向正确的结论的步骤。此外,OVM不需要耗费劳动的步骤正确性标注,提高其扩展性。我们在GSM8K和Game of 24两个多步骤推理数据集上进行实验,结果显示OVM模型在LLM中表现出色,特别是OVM-7B模型在GSM8K数据集上实现了状态顶尖结果,而不使用GPT-4或代码执行。这些结果提供一个新的见解,强调结果监督在训练验证器 для多步骤推理任务中的role,并提供了值估计的理论基础。

You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

  • paper_url: http://arxiv.org/abs/2311.09718
  • repo_url: https://github.com/orange0629/llm-personas
  • paper_authors: Bangzhao Shu, Lechen Zhang, Minje Choi, Lavinia Dunagan, Dallas Card, David Jurgens
  • for: 本研究旨在检验当前问题提示的格式是否能够具备一致和可靠的回答能力。
  • methods: 研究者首先构建了39种人格测试工具的693个问题集,然后设计了一些细微变化的提示,以检验LLM的回答准确性和一致性。
  • results: 实验发现,即使使用简单的变换,LLM的回答能力也会受到很大的下降,而大多数LLM的否定一致性也很低。这些结果表明,当前的问题提示方式不够准确地捕捉模型的感知,并讨论了可能的更好的 alternatives。
    Abstract The versatility of Large Language Models (LLMs) on natural language understanding tasks has made them popular for research in social sciences. In particular, to properly understand the properties and innate personas of LLMs, researchers have performed studies that involve using prompts in the form of questions that ask LLMs of particular opinions. In this study, we take a cautionary step back and examine whether the current format of prompting enables LLMs to provide responses in a consistent and robust manner. We first construct a dataset that contains 693 questions encompassing 39 different instruments of persona measurement on 115 persona axes. Additionally, we design a set of prompts containing minor variations and examine LLM's capabilities to generate accurate answers, as well as consistency variations to examine their consistency towards simple perturbations such as switching the option order. Our experiments on 15 different open-source LLMs reveal that even simple perturbations are sufficient to significantly downgrade a model's question-answering ability, and that most LLMs have low negation consistency. Our results suggest that the currently widespread practice of prompting is insufficient to accurately capture model perceptions, and we discuss potential alternatives to improve such issues.
    摘要 大型自然语言模型(LLM)在自然语言理解任务上的多样性,使得它们在社会科学研究中受欢迎。特别是,为了准确理解LLM的性质和内在人格,研究人员已经使用了问题提示的形式进行研究。在这项研究中,我们做出了一个谨慎的步骤,检查当前的提示格式是否能够让LLM提供一致和稳定的回答。我们首先构建了一个包含39种测量工具和115个人格轴的数据集。此外,我们设计了一些具有轻微变化的提示,检查LLM是否能够生成准确回答,以及对简单的变化(如选项顺序交换)的一致性。我们在15种开源LLM上进行了实验,发现,即使使用简单的变化,也可以导致模型的问题回答能力下降 significatively,并且大多数LLM具有低的否定一致性。我们的结果表明,当前广泛使用的提示方式不够用于准确捕捉模型的感知,我们讨论了可能的改进方案。

Towards Autonomous Hypothesis Verification via Language Models with Minimal Guidance

  • paper_url: http://arxiv.org/abs/2311.09706
  • repo_url: None
  • paper_authors: Shiro Takagi, Ryutaro Yamauchi, Wataru Kumagai
  • for: 本研究旨在检验AI是否可以自主生成和验证假设,以实现人类水平的自主研究。
  • methods: 本研究使用GPT-4生成假设和Python代码进行验证,仅提供有限的方法指导。
  • results: 研究发现,在某些情况下,GPT-4可以自主生成和验证假设,但无一个完美的验证结果,表明还有许多挑战需要继续探索,以实现自主研究。
    Abstract Research automation efforts usually employ AI as a tool to automate specific tasks within the research process. To create an AI that truly conduct research themselves, it must independently generate hypotheses, design verification plans, and execute verification. Therefore, we investigated if an AI itself could autonomously generate and verify hypothesis for a toy machine learning research problem. We prompted GPT-4 to generate hypotheses and Python code for hypothesis verification with limited methodological guidance. Our findings suggest that, in some instances, GPT-4 can autonomously generate and validate hypotheses without detailed guidance. While this is a promising result, we also found that none of the verifications were flawless, and there remain significant challenges in achieving autonomous, human-level research using only generic instructions. These findings underscore the need for continued exploration to develop a general and autonomous AI researcher.
    摘要

Deceiving Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

  • paper_url: http://arxiv.org/abs/2311.09702
  • repo_url: None
  • paper_authors: Bangzheng Li, Ben Zhou, Fei Wang, Xingyu Fu, Dan Roth, Muhao Chen
  • for: 这个研究探讨了大型语言模型(LLM)在句子理解中存在的幻觉和不当思维现象,以及这些现象如何影响模型的表现。
  • methods: 该研究使用了一种新的探测方法和benchmark,称为EureQA,以量化这种现象。该方法从问题中找到可以得到答案的关键Entity,然后逐层添加证据句,让模型在找到答案之前必须按照链式的证据逻辑进行搜寻。
  • results: 研究发现,现有的LLM模型缺乏能够按照正确的逻辑链进行搜寻和理解句子的能力,而是倾向于通过幻觉和不当思维来简化问题。这些幻觉通常是基于语义关系的,并且会导致模型的偏差和幻觉。
    Abstract Despite the recent advancement in large language models (LLMs) and their high performances across numerous benchmarks, recent research has unveiled that LLMs suffer from hallucinations and unfaithful reasoning. This work studies a specific type of hallucination induced by semantic associations. Specifically, we investigate to what extent LLMs take shortcuts from certain keyword/entity biases in the prompt instead of following the correct reasoning path. To quantify this phenomenon, we propose a novel probing method and benchmark called EureQA. We start from questions that LLMs will answer correctly with utmost certainty, and mask the important entity with evidence sentence recursively, asking models to find masked entities according to a chain of evidence before answering the question. During the construction of the evidence, we purposefully replace semantic clues (entities) that may lead to the correct answer with distractor clues (evidence) that will not directly lead to the correct answer but require a chain-like reasoning process. We evaluate if models can follow the correct reasoning chain instead of short-cutting through distractor clues. We find that existing LLMs lack the necessary capabilities to follow correct reasoning paths and resist the attempt of greedy shortcuts. We show that the distractor semantic associations often lead to model hallucination, which is strong evidence that questions the validity of current LLM reasoning.
    摘要 尽管最近的大语言模型(LLM)在许多测试中表现出色,但最新的研究表明这些模型受到了幻觉和不正确的理解的影响。这项研究探讨了 LLM 受到 semantic association 引起的幻觉的特点。我们具体研究 LLM 是否因为提示中的关键词/实体偏见而快速缩短 reasoning 过程。为了衡量这种现象,我们提出了一种新的探测方法和标准 benchmark called EureQA。我们从可以通过 utmost certainty 回答 вопро题的问题开始,并在提示中逐层隐藏关键实体,要求模型通过证据链来找到隐藏的实体。在建构证据时,我们故意将可能导致正确答案的 semantic clue 替换为不直接导致正确答案的 distractor clue,以便要求模型遵循 chain-like 的 reasoning 过程。我们评估模型是否可以遵循正确的 reasoning 路径,而不是短ircuit 通过 distractor clue。我们发现现有的 LLM 缺乏遵循正确 reasoning 路径的能力,并且很容易受到幻觉的影响。我们显示,distractor semantic association frequently leads to model hallucination,这是强有力的证据,证明了现有 LLM 的reasoning 无效。

Accommodating Missing Modalities in Time-Continuous Multimodal Emotion Recognition

  • paper_url: http://arxiv.org/abs/2311.10119
  • repo_url: None
  • paper_authors: Juan Vazquez-Rodriguez, Grégoire Lefebvre, Julien Cumin, James L. Crowley
  • for: 本研究旨在提高时间连续的情感识别精度,即使有一些模式缺失。
  • methods: 我们提议一种基于Transformer的新架构,通过跨模式关注和自关注机制来强调时间上的关系,以提高学习过程的稳定性和精度。
  • results: 实验结果表明,我们的模型在 Ulm-TSST 数据集上的评价协调系数评价提高了37% (对应于情感值预测)和30% (对应于情感评价),相比基线方法。
    Abstract Decades of research indicate that emotion recognition is more effective when drawing information from multiple modalities. But what if some modalities are sometimes missing? To address this problem, we propose a novel Transformer-based architecture for recognizing valence and arousal in a time-continuous manner even with missing input modalities. We use a coupling of cross-attention and self-attention mechanisms to emphasize relationships between modalities during time and enhance the learning process on weak salient inputs. Experimental results on the Ulm-TSST dataset show that our model exhibits an improvement of the concordance correlation coefficient evaluation of 37% when predicting arousal values and 30% when predicting valence values, compared to a late-fusion baseline approach.
    摘要 以下是文本的简化中文翻译:多种研究表明,情感识别更有效率地使用多种modalities。但如果某些modalities缺失呢?为解决这个问题,我们提出了一种基于Transformer架构的时间连续的情感识别模型,即使缺失输入modalities也能够准确地识别情感。我们通过跨modalities和自modalities的相互关注机制来强调时间和模式之间的关系,从而提高模型在弱烈输入上学习的能力。实验结果表明,我们的模型在 Ulm-TSST 数据集上的评价方法比基准方法提高了37% (对于情绪值预测)和30% (对于情感值预测)。

  • paper_url: http://arxiv.org/abs/2311.09693
  • repo_url: https://github.com/blairstanek/blt
  • paper_authors: Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme
  • for: 法律执业中需要基础文本处理能力,现有的公共可用LLM like GPT-4和PaLM2表现不佳。
  • methods: 我们引入了一个 benchmark 来衡量这种 poor performance,这casts 疑问 LLMS 目前是否可靠 для legal practice。
  • results: fine-tuning older LLM 可以带来 near-perfect performance 在我们的测试集上,也提高了相关的法律任务的表现。这个结果强调了 LLMS 训练中需要更多的领域专业知识。
    Abstract We find that the best publicly available LLMs like GPT-4 and PaLM 2 currently perform poorly at basic text handling required of lawyers or paralegals, such as looking up the text at a line of a witness deposition or at a subsection of a contract. We introduce a benchmark to quantify this poor performance, which casts into doubt LLMs' current reliability as-is for legal practice. Finetuning for these tasks brings an older LLM to near-perfect performance on our test set and also raises performance on a related legal task. This stark result highlights the need for more domain expertise in LLM training.
    摘要 我团队发现,目前最佳公开可用的LLM(大型语言模型)如GPT-4和PaLM 2在法律领域的基础文本处理任务上表现不佳,如在证人证言或合同下的特定段落中查找文本。我们引入了一个比较方式来衡量这种不佳表现,这casts into doubt LLMPresent reliability in legal practice。经过调整,一个较老的LLM在我们测试集上几乎达到了近乎完美的表现,同时也提高了相关的法律任务表现。这一结果强调了LLM训练中需要更多的领域专业知识。

Augmenting Unsupervised Reinforcement Learning with Self-Reference

  • paper_url: http://arxiv.org/abs/2311.09692
  • repo_url: None
  • paper_authors: Andrew Zhao, Erle Zhu, Rui Lu, Matthieu Lin, Yong-Jin Liu, Gao Huang
  • for: The paper is written for proposing a new approach called Self-Reference (SR) to improve the performance of reinforcement learning agents in the unsupervised pretrain-then-finetune setting.
  • methods: The SR approach explicitly leverages historical information to mitigate the nonstationarity of intrinsic rewards during pretraining and prevent the unlearning of valuable exploratory behaviors during finetuning.
  • results: The SR approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark for model-free methods, with an 86% IQM and a 16% Optimality Gap reduction. Additionally, it improves current algorithms by up to 17% IQM and reduces the Optimality Gap by 31%.
    Abstract Humans possess the ability to draw on past experiences explicitly when learning new tasks and applying them accordingly. We believe this capacity for self-referencing is especially advantageous for reinforcement learning agents in the unsupervised pretrain-then-finetune setting. During pretraining, an agent's past experiences can be explicitly utilized to mitigate the nonstationarity of intrinsic rewards. In the finetuning phase, referencing historical trajectories prevents the unlearning of valuable exploratory behaviors. Motivated by these benefits, we propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information and enhance agent performance within the pretrain-finetune paradigm. Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark for model-free methods, recording an 86% IQM and a 16% Optimality Gap. Additionally, it improves current algorithms by up to 17% IQM and reduces the Optimality Gap by 31%. Beyond performance enhancement, the Self-Reference add-on also increases sample efficiency, a crucial attribute for real-world applications.
    摘要 人类具有将过去经验显式地应用于学习新任务的能力,这对于无监督预训练然后精度调整的机器学习代理来说是非常有利的。在预训练阶段,代理可以通过过去经验来抑制内在奖励的非站点性。在精度调整阶段, referencing历史轨迹可以防止探索行为的忘记。基于这些优点,我们提出了自引 Referenced (SR) 方法,这是专门为预训练然后精度调整的 paradigm 设计的一个模块。我们的方法在无监督学习学Benchmark上实现了状态之art 的表现,具有86%的Interquartile Mean (IQM) 和16%的Optimality Gap 减少。此外,它还可以提高当前算法的性能,最高提高17%的IQM和31%的Optimality Gap。此外,SR 方法还可以提高实际应用中的样本效率,这是一个非常重要的特性。

Do Physicians Know How to Prompt? The Need for Automatic Prompt Optimization Help in Clinical Note Generation

  • paper_url: http://arxiv.org/abs/2311.09684
  • repo_url: None
  • paper_authors: Zonghai Yao, Ahmed Jaafar, Beining Wang, Yue Zhu, Zhichao Yang, Hong Yu
  • for: 这项研究探讨了对大语言模型(LLM)在医疗笔记生成中的表现,并提出了自动提示优化(APO)框架来优化初始提示。
  • methods: 研究使用了GPT3.5和GPT4两个语言模型,并对它们进行了APO优化。
  • results: 结果显示GPT4 APO在标准化提示质量方面表现出色,而且专业人员在APO后仍然保持了内容质量。
    Abstract This study examines the effect of prompt engineering on the performance of Large Language Models (LLMs) in clinical note generation. We introduce an Automatic Prompt Optimization (APO) framework to refine initial prompts and compare the outputs of medical experts, non-medical experts, and APO-enhanced GPT3.5 and GPT4. Results highlight GPT4 APO's superior performance in standardizing prompt quality across clinical note sections. A human-in-the-loop approach shows that experts maintain content quality post-APO, with a preference for their own modifications, suggesting the value of expert customization. We recommend a two-phase optimization process, leveraging APO-GPT4 for consistency and expert input for personalization.
    摘要 这项研究研究了大语言模型(LLM)在医疗记录生成中的表现,并评估了提示工程的影响。我们提出了自动提示优化(APO)框架,以改进初始提示并比较医学专家、非医学专家和APO-加强GPT3.5和GPT4的输出。结果显示GPT4 APO在标准化提示质量方面表现出色,而人类在循环中的干预表明专家保持了提示质量的控制,并偏好自己的修改,这表明了专家自定义的价值。我们建议使用两个阶段优化过程,首先使用APO-GPT4保证一致性,然后通过专家输入进行个性化。

MacGyver: Are Large Language Models Creative Problem Solvers?

  • paper_url: http://arxiv.org/abs/2311.09682
  • repo_url: None
  • paper_authors: Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ronan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, Thomas L. Griffiths, Faeze Brahman
  • for: This paper aims to explore the creative problem-solving capabilities of modern large language models (LLMs) in a constrained setting, specifically in circumventing functional fixedness.
  • methods: The paper uses an automatically generated dataset called MacGyver, which consists of 1,600 real-world problems that deliberately trigger functional fixedness and require thinking ‘out-of-the-box’. The paper compares and contrasts the problem-solving abilities of LLMs and humans on this dataset.
  • results: The paper shows that both LLMs and humans struggle with the MacGyver problems, but in different ways. LLMs are prone to overconfidence and propose physically infeasible or inefficient solutions, while humans excel in solving familiar problems but struggle with tasks requiring domain-specific knowledge. The paper also demonstrates the potential of enhancing LLMs’ problem-solving ability with novel prompting techniques.
    Abstract We explore the creative problem-solving capabilities of modern large language models (LLMs) in a constrained setting. The setting requires circumventing a cognitive bias known in psychology as ''functional fixedness'' to use familiar objects in innovative or unconventional ways. To this end, we create MacGyver, an automatically generated dataset consisting of 1,600 real-world problems that deliberately trigger functional fixedness and require thinking 'out-of-the-box'. We then present our collection of problems to both LLMs and humans to compare and contrast their problem-solving abilities. We show that MacGyver is challenging for both groups, but in unique and complementary ways. For example, humans typically excel in solving problems that they are familiar with but may struggle with tasks requiring domain-specific knowledge, leading to a higher variance. On the other hand, LLMs, being exposed to a variety of highly specialized knowledge, attempt broader problems but are prone to overconfidence and propose actions that are physically infeasible or inefficient. We also provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking. This work provides insight into the creative problem-solving capabilities of humans and AI and illustrates how psychological paradigms can be extended into large-scale tasks for comparing humans and machines.
    摘要 我们探索现代大语言模型(LLM)的创造力问题解决能力在限制性的设定下。这个设定要求绕过心理学中的''功能固化''(functional fixedness)来使用familiar对象在创新或非正式的方式上。为此,我们创建了MacGyver数据集,包含1,600个实际世界问题,旨在触发功能固化并需要''思外框''的思维。然后,我们将这些问题提交给LLM和人类,以比较和对比他们的问题解决能力。我们发现MacGyver对两个组合体是挑战的,但是各自unique和补做的。例如,人类通常在 familar的问题上 excel,但可能会在需要域pecific知识的任务上遇到问题,导致更高的变差。而LLMs,作为承载了多种高度特殊化的知识的,尝试更广泛的问题,但容易过度自信和提出物理不可能或不fficient的操作。我们还提供了LLMs的详细错误分析,并示出了使用迭代步骤反思和多元思维技巧来增强其问题解决能力的潜在。这项工作提供了人类和AI的创造力问题解决能力的视角,并示出了将心理学概念扩展到大规模任务上,用于比较人类和机器的能力。

Trustworthy Large Models in Vision: A Survey

  • paper_url: http://arxiv.org/abs/2311.09680
  • repo_url: None
  • paper_authors: Ziyan Guo, Jun Liu
  • for: 本文旨在探讨 Large Models (LMs) 在 computer vision 领域中的可靠性问题,并提出相关挑战和对策。
  • methods: 本文使用了系统性的方法,包括简述了四种可能会妨碍 LMs 在视觉领域中的可靠使用的问题,并提供了对应的挑战、对策和讨论。
  • results: 本文通过对 LMs 在视觉领域中的可靠性问题进行系统性的探讨,提供了 deeper understanding 的概念和对策,以便promote LMs 在人类社会中的可靠使用。
    Abstract The rapid progress of Large Models (LMs) has recently revolutionized various fields of deep learning with remarkable grades, ranging from Natural Language Processing (NLP) to Computer Vision (CV). However, LMs are increasingly challenged and criticized by academia and industry due to their powerful performance but untrustworthy behavior, which urgently needs to be alleviated in reliable methods. Despite the abundance of literature on trustworthy LMs in language, a systematic survey specifically delving into the trustworthiness of LMs in vision remains absent. In order to mitigate this gap, we summarize four relevant concerns that obstruct the trustworthy usage in vision of LMs in this survey, including 1) human misuse, 2) vulnerability, 3) inherent issue and 4) interpretability. By highlighting corresponding challenge, countermeasures, and discussion in each topic, we hope this survey will facilitate readers' understanding of the field, promote alignment of LMs with human expectations and enable trustworthy LMs to serve as welfare rather than disaster for human society.
    摘要 大型模型(LM)的快速进步最近对深度学习多个领域产生了很大的改变,从自然语言处理(NLP)到计算机视觉(CV)。然而,LM在学术和业界的应用中遇到了强大性的挑战和批评,需要可靠的方法来缓解这些问题。尽管有很多关于可靠LMs的语言文献,但是关于视觉领域中LMs的可靠性的系统性调查仍然缺失。为了弥补这个差距,我们在这份报告中总结了四个对视觉领域LMs可靠性的挑战,包括1)人类违用、2)抵触、3)内在问题和4)可解性。通过对每个话题的挑战、对策和讨论进行强调,我们希望通过这份报告能够帮助读者更好地理解这个领域,促进LMs与人类期望的Alignment,使LMs成为人类社会的福利而不是灾难。

Structured Chemistry Reasoning with Large Language Models

  • paper_url: http://arxiv.org/abs/2311.09656
  • repo_url: None
  • paper_authors: Siru Ouyang, Zhuosheng Zhang, Bing Yan, Xuan Liu, Jiawei Han, Lianhui Qin
  • for: 解决复杂化学问题的大自然语言模型(LLM)问题。
  • methods: 提出了一种新的结构化思维方法——InstructChem,可以显著提高 LLM 的化学思维能力。InstructChem 分解为三个关键句,包括化学式生成、逐步思维和迭代审阅改进。
  • results: 对四种化学挑战进行了广泛的实验,包括量子化学、量子力学、物理化学和化学动力学。结果显示,我们的方法可以显著提高 GPT-4 的化学思维能力,具体达到了8%的平均绝对改进和30%的峰值改进。此外,我们还使用 GPT-4 生成的思维来 fine-tune smaller LMs(如 Vicuna),并观察到了这些 smaller LMs 的强大改进。这 Validates 我们的方法,并允许 LLMs 生成高质量的思维。
    Abstract This paper studies the problem of solving complex chemistry problems with large language models (LLMs). Despite the extensive general knowledge in LLMs (such as GPT-4), they struggle with chemistry reasoning that requires faithful grounded reasoning with diverse chemical knowledge and an integrative understanding of chemical interactions. We propose InstructChem, a new structured reasoning approach that substantially boosts the LLMs' chemical reasoning capabilities. InstructChem explicitly decomposes the reasoning into three critical phrases, including chemical formulae generation by LLMs that offers the basis for subsequent grounded reasoning, step-by-step reasoning that makes multi-step derivations with the identified formulae for a preliminary answer, and iterative review-and-refinement that steers LLMs to progressively revise the previous phases for increasing confidence, leading to the final high-confidence answer. We conduct extensive experiments on four different chemistry challenges, including quantum chemistry, quantum mechanics, physical chemistry, and chemistry kinetics. Our approach significantly enhances GPT-4 on chemistry reasoning, yielding an 8% average absolute improvement and a 30% peak improvement. We further use the generated reasoning by GPT-4 to fine-tune smaller LMs (e.g., Vicuna) and observe strong improvement of the smaller LMs. This validates our approach and enables LLMs to generate high-quality reasoning.
    摘要 InstructChem breaks down the reasoning process into three critical phases:1. Chemical formula generation: LLMs generate the chemical formula as the basis for subsequent grounded reasoning.2. Step-by-step reasoning: LLMs make multi-step derivations with the identified formula to arrive at a preliminary answer.3. Iterative review-and-refinement: LLMs revise the previous phases to increase confidence and eventually arrive at a high-confidence answer.We conduct extensive experiments on four chemistry challenges: quantum chemistry, quantum mechanics, physical chemistry, and chemistry kinetics. Our approach achieves an average absolute improvement of 8% and a peak improvement of 30% over GPT-4 on chemistry reasoning tasks. Moreover, we use the generated reasoning by GPT-4 to fine-tune smaller LMs (e.g., Vicuna) and observe significant improvement in their performance, validating our approach. This demonstrates that LLMs can generate high-quality reasoning with the help of InstructChem.

“It’s not like Jarvis, but it’s pretty close!” – Examining ChatGPT’s Usage among Undergraduate Students in Computer Science

  • paper_url: http://arxiv.org/abs/2311.09651
  • repo_url: None
  • paper_authors: Ishika Joshi, Ritvik Budhiraja, Harshal D Akolekar, Jagat Sesh Challa, Dhruv Kumar
  • for: 本研究旨在了解学生如何使用 ChatGPT 作为课程相关任务的工具。
  • methods: 本研究使用学生调查和面试获取了学生对 ChatGPT 的看法和体验,以及他们在使用中遇到的挑战和改进建议。
  • results: 研究发现大多数学生(超过 57%)对使用 ChatGPT 作为课程相关任务的工具有积极的看法,但也提出了一些需要解决的挑战,以便在长期使用中得到学生的acceptance。
    Abstract Large language models (LLMs) such as ChatGPT and Google Bard have garnered significant attention in the academic community. Previous research has evaluated these LLMs for various applications such as generating programming exercises and solutions. However, these evaluations have predominantly been conducted by instructors and researchers, not considering the actual usage of LLMs by students. This study adopts a student-first approach to comprehensively understand how undergraduate computer science students utilize ChatGPT, a popular LLM, released by OpenAI. We employ a combination of student surveys and interviews to obtain valuable insights into the benefits, challenges, and suggested improvements related to ChatGPT. Our findings suggest that a majority of students (over 57%) have a convincingly positive outlook towards adopting ChatGPT as an aid in coursework-related tasks. However, our research also highlights various challenges that must be resolved for long-term acceptance of ChatGPT amongst students. The findings from this investigation have broader implications and may be applicable to other LLMs and their role in computing education.
    摘要 大型自然语言模型(LLM)如ChatGPT和Google Bard在学术社区中受到了广泛的关注。先前的研究已经评估了这些LLM在不同应用场景中的性能,但这些评估大多由教师和研究人员进行,未经考虑学生的实际使用情况。这项研究采用学生第一的方法,以全面了解学生在使用ChatGPT时的各种优点、挑战和改进建议。我们通过学生问卷和面试获得了价值的反馈,了解学生对ChatGPT的批评和建议。我们发现超过57%的学生对使用ChatGPT为课程任务提供帮助表示积极的看法。然而,我们的研究也揭示了长期Acceptance of ChatGPT amongst students。这些发现对其他LLM和计算教育有广泛的意义。

On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models

  • paper_url: http://arxiv.org/abs/2311.09641
  • repo_url: None
  • paper_authors: Jiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik, Chaowei Xiao
  • for: 防止语言模型(LLMs)被恶意攻击,保持其与人类偏好的 aligning。
  • methods: 提出了一种攻击方法 RankPoison,通过对偏好选择的权重赋值进行预言,使 LLMS 生成更长的序列,而不会影响原始的安全对齐性表现。
  • results: 通过使用 RankPoison,可以实现攻击 LLMS,使其生成更长的答案,并且可以在问题中Trigger Word 的情况下实现后门攻击。这些发现 highlighted RLHF 的安全挑战,强调了更加Robust的对齐方法的需求。
    Abstract Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences, playing an important role in LLMs alignment. Despite its advantages, RLHF relies on human annotators to rank the text, which can introduce potential security vulnerabilities if any adversarial annotator (i.e., attackers) manipulates the ranking score by up-ranking any malicious text to steer the LLM adversarially. To assess the red-teaming of RLHF against human preference data poisoning, we propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors (e.g., generating longer sequences, which can increase the computational cost). With poisoned dataset generated by RankPoison, we can perform poisoning attacks on LLMs to generate longer tokens without hurting the original safety alignment performance. Moreover, applying RankPoison, we also successfully implement a backdoor attack where LLMs can generate longer answers under questions with the trigger word. Our findings highlight critical security challenges in RLHF, underscoring the necessity for more robust alignment methods for LLMs.
    摘要 大自然语言模型(LLM)与人类偏好的重塑学习(RLHF)是一种方法,用于将 LLM 与人类偏好进行对应。尽管它有优点,但RLHF 依赖于人类标注者来评分文本,这可能引入潜在的安全漏洞,如果任何敌对标注者(例如,攻击者)操纵分数,以使 LLM 进行恶意操作。为了评估 RLHF 对人类偏好数据毒化的红色队伍,我们提出了 RankPoison 攻击方法,即在偏好排名中选择扰乱的方法来达到恶意行为(例如,生成更长的序列,这可能增加计算成本)。使用 RankPoison 生成的毒化数据集,我们可以对 LLM 进行毒化攻击,以生成更长的 токен,而不会伤害原始的安全对齐性表现。此外,我们还成功地实现了后门攻击,使 LLM 可以在问题中的词TriggerWord 下生成更长的答案。我们的发现高亮了 RLHF 中的安全挑战,这让我们更需要更加robust的对齐方法来保护 LLM。

Automatic Engineering of Long Prompts

  • paper_url: http://arxiv.org/abs/2311.10117
  • repo_url: https://github.com/Sfedfcv/redesigned-pancake
  • paper_authors: Cho-Jui Hsieh, Si Si, Felix X. Yu, Inderjit S. Dhillon
  • for: automatic long prompt engineering for LLMs
  • methods: greedy algorithms, genetic algorithms, and LLM-based mutation
  • results: average accuracy gain of 9.2% on eight tasks in Big Bench HardHere’s the full answer in Simplified Chinese:
  • for: 自动生成长提示 для LLM
  • methods: 简单的排序算法、进化算法和基于 LLM 的突变
  • results: eight tasks in Big Bench Hard 的平均精度提升率为 9.2%I hope that helps! Let me know if you have any other questions.
    Abstract Large language models (LLMs) have demonstrated remarkable capabilities in solving complex open-domain tasks, guided by comprehensive instructions and demonstrations provided in the form of prompts. However, these prompts can be lengthy, often comprising hundreds of lines and thousands of tokens, and their design often requires considerable human effort. Recent research has explored automatic prompt engineering for short prompts, typically consisting of one or a few sentences. However, the automatic design of long prompts remains a challenging problem due to its immense search space. In this paper, we investigate the performance of greedy algorithms and genetic algorithms for automatic long prompt engineering. We demonstrate that a simple greedy approach with beam search outperforms other methods in terms of search efficiency. Moreover, we introduce two novel techniques that utilize search history to enhance the effectiveness of LLM-based mutation in our search algorithm. Our results show that the proposed automatic long prompt engineering algorithm achieves an average of 9.2% accuracy gain on eight tasks in Big Bench Hard, highlighting the significance of automating prompt designs to fully harness the capabilities of LLMs.
    摘要 大型语言模型(LLM)在解决复杂的开放领域任务上表现出了很好的能力,受到详细的指令和示例的引导。然而,这些示例通常是非常长,可能包含数百行和数千个符号,其设计通常需要较多的人工努力。现有研究已经探索了自动生成短示例,通常只有一些句子或数个句子。然而,自动设计长示例仍然是一个具有巨大搜索空间的困难问题。在这篇论文中,我们调查了大型语言模型(LLM)引导的自动长示例工程。我们发现,使用搜索缓存的简单排序法比其他方法更高效。此外,我们还介绍了两种使用搜索历史来增强LLM基于搜索算法的变异效果的新技术。我们的结果表明,我们的自动长示例工程算法在Big Bench Hard上的八个任务中平均提高了9.2%的准确率,这 highlights了自动化示例设计以便满分 LLMs 的可能性。

Online Continual Knowledge Learning for Language Models

  • paper_url: http://arxiv.org/abs/2311.09632
  • repo_url: None
  • paper_authors: Yuhao Wu, Tongjun Shi, Karthick Sharma, Chun Wei Seah, Shuhao Zhang
  • for: 本研究旨在解决语言模型(LLM)中的动态世界知识管理问题,以满足实时环境中的问题解决和事实核查。
  • methods: 本文提出了一个新的 continual learning 问题,即在线上动态知识学习(OCKL)问题,并提出了一个新的评价指标来衡量知识获得率和先前学习知识的保留。
  • results: 我们的实验结果表明,现有的 continual learning 方法无法解决 OCKL 问题,而我们的研究带来了关于如何在不断变化的环境中训练 LLM 的新理解。
    Abstract Large Language Models (LLMs) serve as repositories of extensive world knowledge, enabling them to perform tasks such as question-answering and fact-checking. However, this knowledge can become obsolete as global contexts change. In this paper, we introduce a novel problem in the realm of continual learning: Online Continual Knowledge Learning (OCKL). This problem formulation aims to manage the dynamic nature of world knowledge in LMs under real-time constraints. We propose a new benchmark and evaluation metric designed to measure both the rate of new knowledge acquisition and the retention of previously learned knowledge. Our empirical evaluation, conducted using a variety of state-of-the-art methods, establishes robust base-lines for OCKL. Our results reveal that existing continual learning approaches are unfortunately insufficient for tackling the unique challenges posed by OCKL. We identify key factors that influence the trade-off between knowledge acquisition and retention, thereby advancing our understanding of how to train LMs in a continually evolving environment.
    摘要 Translation notes:* "Large Language Models" is translated as "大型语言模型" (dàxíng yǔyán módelǐng)* "Continual Learning" is translated as "连续学习" (liánxù xuéxí)* "Online Continual Knowledge Learning" is translated as "在线连续知识学习" (zài xiàng liánxù zhīshī xuéxí)* "world knowledge" is translated as "世界知识" (shìjiè zhīshī)* "global contexts" is translated as "全球背景" (quánqiú bèngjǐng)* "dynamic nature" is translated as "动态性" (dòngtǐ xìng)* "real-time constraints" is translated as "实时约束" (shíhòu yuēsuǒ)* "benchmark" is translated as "标准" (biāo jiā)* "evaluation metric" is translated as "评价指标" (píngjì zhǐbiāo)* "rate of new knowledge acquisition" is translated as "新知识获得速率" (xīn zhīshī gòngdé sùlù)* "retention of previously learned knowledge" is translated as "前期学习知识保持" (qiánxī xuéxí zhīshī bǎochí)* "trade-off" is translated as "交互" (jiāoxì)* "key factors" is translated as "关键因素" (guānjī yǐnxiàng)* "continually evolving environment" is translated as "不断发展的环境" (bùdàn fāzhǎn de huánjìng)

CRISPR: Eliminating Bias Neurons from an Instruction-following Language Model

  • paper_url: http://arxiv.org/abs/2311.09627
  • repo_url: None
  • paper_authors: Nakyeong Yang, Taegwan Kang, Kyomin Jung
  • for: 这个论文是为了解决语言模型(LLMs)在基于指令的任务执行过程中遇到的分布差异问题,以及这些差异如何导致语言模型受到干扰和偏见。
  • methods: 这个论文提出了一种新的偏见缓和方法,称为CRISPR,用于减少基于指令的偏见。CRISPR使用了负责任方法来 identific 受偏见影响的偏见神经元,并使用了剪除来消除这些偏见神经元。
  • results: 实验结果表明,CRISPR可以有效地减少基于指令的偏见,提高语言模型在社会偏见benchmark上的表现,而无需损失先前学习的知识。此外,CRISPR是模型无关的,可以适应不断变化的社会偏见。
    Abstract Large language models (LLMs) executing tasks through instruction-based prompts often face challenges stemming from distribution differences between user instructions and training instructions. This leads to distractions and biases, especially when dealing with inconsistent dynamic labels. In this paper, we introduces a novel bias mitigation method, CRISPR, designed to alleviate instruction-label biases in LLMs. CRISPR utilizes attribution methods to identify bias neurons influencing biased outputs and employs pruning to eliminate the bias neurons. Experimental results demonstrate the method's effectiveness in mitigating biases in instruction-based prompting, enhancing language model performance on social bias benchmarks without compromising pre-existing knowledge. CRISPR proves highly practical, model-agnostic, offering flexibility in adapting to evolving social biases.
    摘要 大型语言模型(LLM)通过指令式提示进行任务时,常会面临用户指令和训练指令之间的分布差异问题,这会导致分心和偏袋问题,特别是对于不稳定的动态标签。在这篇论文中,我们介绍了一种新的偏见缓和方法,名为CRISPR,用于缓和指令标签偏见在LLM中。CRISPR利用属性方法来识别偏见神经元影响偏见输出,并使用剪除来消除这些偏见神经元。实验结果显示CRISPR有效地缓和偏见在指令式提示中,提高语言模型在社会偏见标准上的表现,不会对先前的知识造成损害。CRISPR非常实用、模型无须对类型,可以灵活地适应社会偏见的变化。

AI Recommendation System for Enhanced Customer Experience: A Novel Image-to-Text Method

  • paper_url: http://arxiv.org/abs/2311.09624
  • repo_url: None
  • paper_authors: Mohamaed Foued Ayedi, Hiba Ben Salem, Soulaimen Hammami, Ahmed Ben Said, Rateb Jabbar, Achraf CHabbouh
  • for: 这项研究旨在提供精准和个性化的时尚推荐系统,使用人工智能技术进行细致的视觉解释,以帮助用户找到与愿望图像中的风格相似的时尚产品。
  • methods: 该研究使用了一个全新的端到端管道,其包括图像上传、对图像进行分类、对图像进行描述、对全球时尚产品目录进行检索、并将检索结果与原图像进行比较。
  • results: 在使用了 более than 100,000个分类的时尚照片数据集上进行训练和评估后,该管道实现了0.97的F1分数,表明其可以准确地识别时尚对象,并且可以为用户提供个性化的时尚推荐。
    Abstract Existing fashion recommendation systems encounter difficulties in using visual data for accurate and personalized recommendations. This research describes an innovative end-to-end pipeline that uses artificial intelligence to provide fine-grained visual interpretation for fashion recommendations. When customers upload images of desired products or outfits, the system automatically generates meaningful descriptions emphasizing stylistic elements. These captions guide retrieval from a global fashion product catalogue to offer similar alternatives that fit the visual characteristics of the original image. On a dataset of over 100,000 categorized fashion photos, the pipeline was trained and evaluated. The F1-score for the object detection model was 0.97, exhibiting exact fashion object recognition capabilities optimized for recommendation. This visually aware system represents a key advancement in customer engagement through personalized fashion recommendations
    摘要 现有的时尚推荐系统在使用视觉数据进行准确和个性化推荐时遇到困难。本研究描述了一种创新的端到端管道,使用人工智能来为时尚推荐提供细腻的视觉解释。当客户上传欲购买的产品或服装图片时,系统会自动生成有意义的描述,强调时尚元素。这些描述将引导从全球时尚产品目录中选择类似的产品,以适应原图的视觉特征。在超过10万个分类时尚照片的数据集上训练和评估,管道的F1分数为0.97,表示系统具有高精度的时尚物品识别能力,优化为推荐。这种视觉意识系统将成为个性化时尚推荐的关键进步。

Comprehensive Evaluation and Insights into the Use of Deep Neural Networks to Detect and Quantify Lymphoma Lesions in PET/CT Images

  • paper_url: http://arxiv.org/abs/2311.09614
  • repo_url: https://github.com/microsoft/lymphoma-segmentation-dnn
  • paper_authors: Shadab Ahamed, Yixi Xu, Claire Gowdy, Joo H. O, Ingrid Bloise, Don Wilson, Patrick Martineau, François Bénard, Fereshteh Yousefirizi, Rahul Dodhia, Juan M. Lavista, William B. Weeks, Carlos F. Uribe, Arman Rahmim
    for:This paper evaluates the performance of four deep learning architectures (UNet, SegResNet, DynUNet, and SwinUNETR) for lymphoma lesion segmentation from PET/CT images.methods:The paper uses a diverse, multi-institutional dataset of 611 cases to train, validate, and test the four neural network architectures. The authors use internal testing and external testing to evaluate the performance of the networks, and they assess reproducibility of six lesion measures, calculate prediction errors, and examine DSC performance in relation to lesion measures.results:The results show that SegResNet is the top performer with a median Dice similarity coefficient (DSC) of 0.76 and median false positive volume (FPV) of 4.55 ml on the internal testing set. On the unseen external test set, SegResNet achieved the best median DSC of 0.68 and FPV of 21.46 ml, while UNet had the best false negative volume (FNV) of 0.41 ml. The authors also found that the networks had a median false negative volume (FNV) of 0 ml. Additionally, the authors introduced three lesion detection criteria, addressed the challenges in segmenting “easy” vs. “hard” cases, and performed inter-observer agreement assessment.Here is the same information in Simplified Chinese text:for:这个研究用四种深度学习架构(UNet、SegResNet、DynUNet、SwinUNETR)进行淋巴癌肿囊分 segmentation from PET/CT图像。methods:这个研究使用多个机构的多例数据集(611例)来训练、验证和测试四种深度学习架构。作者们使用内测和外测来评估这些网络的性能,并评估了六个肿囊指标的重复性,计算预测错误,并对肿囊指标与深度学习架构之间的关系进行了研究。results:结果显示,SegResNet在内测集上得到了最高的 median Dice similarity coefficient(DSC)值为0.76,并且 median false positive volume(FPV)值为4.55ml。在未看过的外测集上,SegResNet得到了最高的 median DSC值为0.68和FPV值为21.46ml,而UNet得到了最低的 false negative volume(FNV)值为0.41ml。此外,作者们发现所有网络都有0ml的false negative volume(FNV)。此外,作者们还引入了三个肿囊检测标准,解决了检测”容易”vs.”Difficult”情况的挑战,并进行了多个专家 annotator 的一致性评估。
    Abstract This study performs comprehensive evaluation of four neural network architectures (UNet, SegResNet, DynUNet, and SwinUNETR) for lymphoma lesion segmentation from PET/CT images. These networks were trained, validated, and tested on a diverse, multi-institutional dataset of 611 cases. Internal testing (88 cases; total metabolic tumor volume (TMTV) range [0.52, 2300] ml) showed SegResNet as the top performer with a median Dice similarity coefficient (DSC) of 0.76 and median false positive volume (FPV) of 4.55 ml; all networks had a median false negative volume (FNV) of 0 ml. On the unseen external test set (145 cases with TMTV range: [0.10, 2480] ml), SegResNet achieved the best median DSC of 0.68 and FPV of 21.46 ml, while UNet had the best FNV of 0.41 ml. We assessed reproducibility of six lesion measures, calculated their prediction errors, and examined DSC performance in relation to these lesion measures, offering insights into segmentation accuracy and clinical relevance. Additionally, we introduced three lesion detection criteria, addressing the clinical need for identifying lesions, counting them, and segmenting based on metabolic characteristics. We also performed expert intra-observer variability analysis revealing the challenges in segmenting ``easy'' vs. ``hard'' cases, to assist in the development of more resilient segmentation algorithms. Finally, we performed inter-observer agreement assessment underscoring the importance of a standardized ground truth segmentation protocol involving multiple expert annotators. Code is available at: https://github.com/microsoft/lymphoma-segmentation-dnn
    摘要 internally, SegResNet showed the highest performance with a median Dice similarity coefficient (DSC) of 0.76 and median false positive volume (FPV) of 4.55 ml. All models had a median false negative volume (FNV) of 0 ml. On the external test set, SegResNet achieved the best median DSC of 0.68 and FPV of 21.46 ml, while UNet had the best FNV of 0.41 ml.The study also assessed the reproducibility of six lesion measures, calculated prediction errors, and examined DSC performance in relation to these lesion measures. Additionally, the study introduced three lesion detection criteria, addressed the clinical need for identifying, counting, and segmenting based on metabolic characteristics.The study also performed expert intra-observer variability analysis, revealing the challenges in segmenting "easy" vs. "hard" cases, and inter-observer agreement assessment, underscoring the importance of a standardized ground truth segmentation protocol involving multiple expert annotators.The code for the study is available at: https://github.com/microsoft/lymphoma-segmentation-dnn.

Digital Socrates: Evaluating LLMs through explanation critiques

  • paper_url: http://arxiv.org/abs/2311.09613
  • repo_url: None
  • paper_authors: Yuling Gu, Oyvind Tafjord, Peter Clark
  • for: 本研究的目的是解释现代模型的解释能力,并提供一种自动生成高质量、可读性的解释评价工具。
  • methods: 本研究使用了定义新的解释批判任务,创建了大量人工验证的数据集,并使用这些数据集训练了一个开源的自动解释批判模型(称为“数字索慈”)。
  • results: 通过量化和质量分析,本研究表明了数字索慈可以帮助揭示学生模型的思维链,并提供高质量、细化的自动解释评价。数字索慈因此填补了现有的解释评价工具之间的重要空白。
    Abstract While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critiquing model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models.
    摘要 而LMs可以提供结构化的解释,但是这些解释的质量和性质仍然不够理解。因此,我们的目标是定义现代模型的解释能力的详细方式,并创建一个自动生成这些 caracterizations的解释评价工具,不需要Expensive API调用或人工注释。我们的方法包括:(a)定义解释批判任务——标识和分类任何主要缺陷在解释中,并提供修复建议。(b)创建大量,人工验证的数据集。(c)使用这些数据集,训练一个开源的自动批判模型(称为数字SOCRATES)。通过量化和质量分析,我们示出了数字SOCRATES如何用于探索学生模型的逻辑链,以及如何提供高质量、细化的自动评价。数字SOCRATES因此填充了现代模型解释行为的评价工具中的重要空白。

Code Models are Zero-shot Precondition Reasoners

  • paper_url: http://arxiv.org/abs/2311.09601
  • repo_url: None
  • paper_authors: Lajanugen Logeswaran, Sungryull Sohn, Yiwei Lyu, Anthony Zhe Liu, Dong-Ki Kim, Dongsub Shim, Moontae Lee, Honglak Lee
  • for: 这篇论文旨在探讨如何使用代码表示来理解行为前提条件,以便在Sequential Decision Making任务中完成任务。
  • methods: 这篇论文使用了预训练的代码模型,从示例轨迹中提取出行动前提条件,并使用这些前提条件来预测行动。
  • results: 根据论文的结果,使用这种前condition-aware的行动采样策略可以提高几何 shot 策略学习的性能,并在任务 oriented dialog 和 embodied textworld 测试 benchmark 上达到了更好的结果。
    Abstract One of the fundamental skills required for an agent acting in an environment to complete tasks is the ability to understand what actions are plausible at any given point. This work explores a novel use of code representations to reason about action preconditions for sequential decision making tasks. Code representations offer the flexibility to model procedural activities and associated constraints as well as the ability to execute and verify constraint satisfaction. Leveraging code representations, we extract action preconditions from demonstration trajectories in a zero-shot manner using pre-trained code models. Given these extracted preconditions, we propose a precondition-aware action sampling strategy that ensures actions predicted by a policy are consistent with preconditions. We demonstrate that the proposed approach enhances the performance of few-shot policy learning approaches across task-oriented dialog and embodied textworld benchmarks.
    摘要 一个基本的技能需要在环境中完成任务是理解当前点可行的动作。这项工作探讨一种使用代码表示来理解动作前提条件的新用途。代码表示具有模拟过程活动和相关约束的灵活性,以及执行和验证约束满足的能力。通过代码表示,我们从示例轨迹中提取动作前提条件,无需任何更改或训练。基于提取的前提条件,我们提议一种了解政策预测的动作抽样策略,以确保政策预测的动作与前提条件相符。我们证明,该方法可以提高几个shot策略学习的性能在任务强调对话和embodied textworld benchmark上。

Multi-Step Dialogue Workflow Action Prediction

  • paper_url: http://arxiv.org/abs/2311.09593
  • repo_url: None
  • paper_authors: Ramya Ramakrishnan, Ethan Elenberg, Hashan Narangodage, Ryan McDonald
  • for: 提高对对话任务的自动化率,增加对话系统的效率和智能化 Waterfall
  • methods: 提出了三种简单实现的模型方法:1)精度调整训练集,2)几步学习利用检索和大语言模型提示,3)零步图 traversal
  • results: 实现了20%的步骤自动化,不需要人工监督 Waterfall Here’s the translation in English:
  • for: Improving the automation rate of dialogue tasks, increasing the efficiency and intelligence of conversation systems.
  • methods: Proposed three simple modeling methods: 1) fine-tuning on a training dataset, 2) few-shot in-context learning leveraging retrieval and large language model prompting, and 3) zero-shot graph traversal, which aggregates historical action sequences into a graph for prediction.
  • results: Achieved 20% automation of steps without requiring as much human oversight.
    Abstract In task-oriented dialogue, a system often needs to follow a sequence of actions, called a workflow, that complies with a set of guidelines in order to complete a task. In this paper, we propose the novel problem of multi-step workflow action prediction, in which the system predicts multiple future workflow actions. Accurate prediction of multiple steps allows for multi-turn automation, which can free up time to focus on more complex tasks. We propose three modeling approaches that are simple to implement yet lead to more action automation: 1) fine-tuning on a training dataset, 2) few-shot in-context learning leveraging retrieval and large language model prompting, and 3) zero-shot graph traversal, which aggregates historical action sequences into a graph for prediction. We show that multi-step action prediction produces features that improve accuracy on downstream dialogue tasks like predicting task success, and can increase automation of steps by 20% without requiring as much feedback from a human overseeing the system.
    摘要 在任务导向对话中,系统经常需要遵循一系列动作,称为工作流程,以完成任务。在这篇论文中,我们提出了多步工作流程动作预测的新问题,其中系统预测多个未来的工作流程动作。准确预测多个步骤可以实现多轮自动化,这可以释放时间专注更复杂的任务。我们提出了三种简单实现的模型方法:1)精度调整训练集,2)几招在 Context 中学习和大语言模型提示,3)零shot图 traversal,即将历史动作序列聚合成图进行预测。我们显示,多步动作预测生成了下游对话任务的准确预测特征,并可以提高自动化步骤的效率达20%,不需要人工监督系统提供多少反馈。

Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying

  • paper_url: http://arxiv.org/abs/2311.09578
  • repo_url: None
  • paper_authors: Adithya Renduchintala, Tugrul Konuk, Oleksii Kuchaiev
  • for: 提高LoRA方法的参数效率
  • methods: 使用权重固定和选择性训练
  • results: 实现了与标准LoRA方法相当的性能,使用的参数比例减少至13%Here’s a breakdown of each point:
  • for: The paper is written to improve the parameter efficiency of the Low-rank adaptation (LoRA) method.
  • methods: The paper proposes a simple paradigm that utilizes weight tying and selective training to further increase the parameter efficiency of LoRA.
  • results: The paper provides experiments that demonstrate the effectiveness of the proposed Tied-LoRA method, achieving comparable performance to the standard LoRA method while using only 13% of the parameters.
    Abstract We propose Tied-LoRA, a simple paradigm utilizes weight tying and selective training to further increase parameter efficiency of the Low-rank adaptation (LoRA) method. Our investigations include all feasible combinations parameter training/freezing in conjunction with weight tying to identify the optimal balance between performance and the number of trainable parameters. Through experiments covering a variety of tasks and two base language models, we provide analysis revealing trade-offs between efficiency and performance. Our experiments uncovered a particular Tied-LoRA configuration that stands out by demonstrating comparable performance across several tasks while employing only 13~\% percent of parameters utilized by the standard LoRA method.
    摘要 我们提出了缔绳LoRA(Tied-LoRA)方法,这是一种简单的思想,通过权重缔绳和选择性训练来进一步提高LoRA方法中的参数效率。我们对所有可能的参数训练/冻结 combinational进行了调查,以确定最佳的效率和性能之间的平衡。通过覆盖多种任务和两个基础语言模型的实验,我们提供了分析,揭示了效率和性能之间的交易。我们的实验发现,一种特定的Tied-LoRA配置可以在多个任务中达到相似的性能水平,使用了标准LoRA方法的13\%的参数。

Work State-Centric AI Agents: Design, Implementation, and Management of Cognitive Work Threads

  • paper_url: http://arxiv.org/abs/2311.09576
  • repo_url: None
  • paper_authors: Chen Zhang
  • for: 提高任务执行效率和任务分析预测
  • methods: 使用工作笔记记录和反思循环来捕捉工作状态信息,并将工作状态记录作为全面的工作日志
  • results: 提高任务执行效率,并为后续任务分析和审核提供坚实的基础
    Abstract AI agents excel in executing predefined tasks, but the dynamic management of work state information during task execution remains an underexplored area. We propose a work state-centric AI agent model employing "work notes" to record and reflect the state throughout task execution. This paper details the model's architecture, featuring worker threads for task oversight, planner modules for task decomposition and planning, and executor modules for performing subtasks using a ReAct-inspired thought-action loop. We provide an exhaustive work state record incorporating plans and outcomes, constituting a comprehensive work journal. Our results show that this model not only improves task execution efficiency but also lays a solid foundation for subsequent task analysis and auditing.
    摘要 人工智能代理人 excel 在预定任务执行方面,但是在任务执行过程中的工作状态管理仍然是一个未经探索的领域。我们提出了一种基于“工作笔记”的人工智能代理人模型,该模型包括工作线程对任务监督、规划模块对任务分解和规划、执行模块使用ReAct-类似的思维动作循环来执行具体任务。我们提供了全面的工作状态记录,包括计划和结果,这个全面的工作日志可以帮助进一步分析和审核任务。我们的结果表明,这种模型不仅可以提高任务执行效率,还可以为后续任务分析和审核提供坚实的基础。

LymphoML: An interpretable artificial intelligence-based method identifies morphologic features that correlate with lymphoma subtype

  • paper_url: http://arxiv.org/abs/2311.09574
  • repo_url: https://github.com/rajpurkarlab/lymphoml
  • paper_authors: Vivek Shankar, Xiaoli Yang, Vrishab Krishna, Brent Tan, Oscar Silva, Rebecca Rojansky, Andrew Ng, Fabiola Valvert, Edward Briercheck, David Weinstock, Yasodha Natkunam, Sebastian Fernandez-Pol, Pranav Rajpurkar
    for:* 这份研究旨在开发一个可解释的机器学习方法,以便更正确地分类淋巴癌亚型。methods:* 这个方法包括处理HE染色标本核心、分 segment nuclei 和 cells、计算包括 morphology、texture 和 architecture 的特征,并使用梯度增强模型进行诊断预测。results:* 这个方法的可解释模型,在使用限量HE染色标本上进行训练后,与使用整个标本图像进行诊断相比,具有不 inferior 的诊断精度。* 使用 SHAP 分析法,发现 nuclei 形态特征是 DLBCL 和класси型淋巴癌的最有拘束力的特征(F1-score:78.7%)。* 这个研究还证明了一个结合 H&E 染色标本和标准化的6个免疫标本的模型,可以实现与46个标本 panel 的相似的诊断精度(85.3%)。
    Abstract The accurate classification of lymphoma subtypes using hematoxylin and eosin (H&E)-stained tissue is complicated by the wide range of morphological features these cancers can exhibit. We present LymphoML - an interpretable machine learning method that identifies morphologic features that correlate with lymphoma subtypes. Our method applies steps to process H&E-stained tissue microarray cores, segment nuclei and cells, compute features encompassing morphology, texture, and architecture, and train gradient-boosted models to make diagnostic predictions. LymphoML's interpretable models, developed on a limited volume of H&E-stained tissue, achieve non-inferior diagnostic accuracy to pathologists using whole-slide images and outperform black box deep-learning on a dataset of 670 cases from Guatemala spanning 8 lymphoma subtypes. Using SHapley Additive exPlanation (SHAP) analysis, we assess the impact of each feature on model prediction and find that nuclear shape features are most discriminative for DLBCL (F1-score: 78.7%) and classical Hodgkin lymphoma (F1-score: 74.5%). Finally, we provide the first demonstration that a model combining features from H&E-stained tissue with features from a standardized panel of 6 immunostains results in a similar diagnostic accuracy (85.3%) to a 46-stain panel (86.1%).
    摘要 “准确分类淋巴癌亚型使用染色剂和艾索维(H&E)染色的组织是具有广泛的 morphological 特征的复杂任务。我们提出了 LymphoML 方法,这是一种可读性高的机器学习方法,可以识别淋巴癌亚型的 morphologic 特征。我们的方法包括处理 H&E 染色组织微阵列核心、分Segment 细胞和核lei、计算包括形态、Texture 和建筑的特征,并使用梯度优化模型进行诊断预测。LymphoML 的可读性模型,在一小量的 H&E 染色组织上进行开发,与全图像进行诊断的病理学家相比,具有不 inferior 的诊断精度,并且超过黑盒深度学习。使用 SHapley Additive exPlanation(SHAP)分析,我们评估了模型预测中每个特征的影响,发现核lei 形态特征是 DLBCL (F1-score:78.7%)和 класиical Hodgkin 淋巴癌(F1-score:74.5%)中最有决定性的。最后,我们提供了首次证明,一个结合 H&E 染色组织和标准化的6个抗体染色的模型,具有与46个抗体染色的模型(86.1%)相同的诊断精度(85.3%)。”

Prompt Optimisation with Random Sampling

  • paper_url: http://arxiv.org/abs/2311.09569
  • repo_url: None
  • paper_authors: Yao Lu, Jiayi Wang, Sebastian Riedel, Pontus Stenetorp
  • for: 这个论文主要是为了探讨语言模型可以用来生成任务相关的分隔符的可能性,并证明了这种方法可以达到类似于人工curated prompts的性能。
  • methods: 这篇论文使用了三种Random generation strategies来生成分隔符,包括随机选择 vocabulary 中的token,以及基于语言模型的生成方法。
  • results: 实验结果显示,使用随机生成的分隔符可以提高 text classification 的性能,相比于人工curated prompts,平均提高16%,并与自动生成的 prompt searching 方法相当。
    Abstract Using the generative nature of a language model to generate task-relevant separators has shown competitive results compared to human-curated prompts like "TL;DR". We demonstrate that even randomly chosen tokens from the vocabulary as separators can achieve near-state-of-the-art performance. We analyse this phenomenon in detail using three different random generation strategies, establishing that the language space is rich with potential good separators, regardless of the underlying language model size. These observations challenge the common assumption that an effective prompt should be human-readable or task-relevant. Experimental results show that using random separators leads to an average 16% relative improvement across nine text classification tasks on seven language models, compared to human-curated separators, and is on par with automatic prompt searching methods.
    摘要

LongBoX: Evaluating Transformers on Long-Sequence Clinical Tasks

  • paper_url: http://arxiv.org/abs/2311.09564
  • repo_url: https://github.com/mihir3009/longbox
  • paper_authors: Mihir Parmar, Aakanksha Naik, Himanshu Gupta, Disha Agrawal, Chitta Baral
  • for: 这 paper 旨在评估大型语言模型 (LLMs) 在医疗领域中的长序处理能力。
  • methods: 该 paper 使用了 seven 个医疗领域的文本数据集,并对两种长序处理技术进行评估:(i) 本地-全局注意力和 (ii) Fusion-in-Decoder (FiD)。
  • results: 该 paper 的初步实验表明,医疗领域的 LLMs (例如 BioGPT) 和通用领域的 LLMs (例如 FLAN-T5) 在这个 benchmark 上表现不佳,并且两种长序处理技术在一些数据集上得到了混乱的结果。
    Abstract Many large language models (LLMs) for medicine have largely been evaluated on short texts, and their ability to handle longer sequences such as a complete electronic health record (EHR) has not been systematically explored. Assessing these models on long sequences is crucial since prior work in the general domain has demonstrated performance degradation of LLMs on longer texts. Motivated by this, we introduce LongBoX, a collection of seven medical datasets in text-to-text format, designed to investigate model performance on long sequences. Preliminary experiments reveal that both medical LLMs (e.g., BioGPT) and strong general domain LLMs (e.g., FLAN-T5) struggle on this benchmark. We further evaluate two techniques designed for long-sequence handling: (i) local-global attention, and (ii) Fusion-in-Decoder (FiD). Our results demonstrate mixed results with long-sequence handling - while scores on some datasets increase, there is substantial room for improvement. We hope that LongBoX facilitates the development of more effective long-sequence techniques for the medical domain. Data and source code are available at https://github.com/Mihir3009/LongBoX.
    摘要 许多大型语言模型(LLMs)在医学领域的评估主要基于短文本,而对于完整的电子医疗记录(EHR)的评估尚未得到系统性的探讨。考虑到此,我们提出了LongBoX,一个包含七个医学数据集,用于调查模型在长序列上的性能。我们的初步实验表明,医学LLMs(如BioGPT)以及通用领域LLMs(如FLAN-T5)在这个benchmark上表现不佳。我们还评估了两种适合长序列处理的技术:(i)本地-全局注意力,以及(ii)FiD(混合在解码器中)。我们的结果表明,虽有一些数据集的分数提高,但是还有很大的提升空间。我们希望LongBoX可以促进医学领域中更有效的长序列处理技术的发展。数据和源代码可以在https://github.com/Mihir3009/LongBoX上下载。

Enchancing Semi-Supervised Learning for Extractive Summarization with an LLM-based pseudolabeler

  • paper_url: http://arxiv.org/abs/2311.09559
  • repo_url: None
  • paper_authors: Gaurav Sahu, Olga Vechtomova, Issam H. Laradji
  • for: 这个研究是用于解决有限标注数据场景下的抽取文本概要问题。
  • methods: 这个方法使用一种半supervised的approach,提议一种prompt-based Pseudolabel选择策略,使用GPT-4进行评估和生成 Pseudolabels。
  • results: 实验表明,通过使用LLM评估和生成 Pseudolabels,可以提高ROUGE-1的表现,在不同的dataset上提高10-20%,与增强预训练模型相当。此外,这种方法需要更少的无标例示例来实现更好的表现。
    Abstract This work tackles the task of extractive text summarization in a limited labeled data scenario using a semi-supervised approach. Specifically, we propose a prompt-based pseudolabel selection strategy using GPT-4. We evaluate our method on three text summarization datasets: TweetSumm, WikiHow, and ArXiv/PubMed. Our experiments show that by using an LLM to evaluate and generate pseudolabels, we can improve the ROUGE-1 by 10-20\% on the different datasets, which is akin to enhancing pretrained models. We also show that such a method needs a smaller pool of unlabeled examples to perform better.
    摘要 这个工作面临有限标注数据enario中的抽取文本摘要任务,使用半supervised方法。我们提议一种基于提示的pseudolabel选择策略,使用GPT-4。我们在三个文本摘要数据集上进行了测试:TweetSumm、WikiHow和ArXiv/PubMed。我们的实验结果表明,通过使用LLM评估和生成pseudolabels,可以提高ROUGE-1的表现,在不同的数据集上提高10-20%,这与增强预训练模型相似。此外,我们还发现这种方法需要较少的无标注示例来表现更好。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing.

Program-Aided Reasoners (better) Know What They Know

  • paper_url: http://arxiv.org/abs/2311.09553
  • repo_url: https://github.com/houstoncuj/Educating-for-the-Large-Shop-to-make-Custom-Name-Patches
  • paper_authors: Anubha Kabra, Sanketh Rangreji, Yash Mathur, Aman Madaan, Emmy Liu, Graham Neubig
  • for: 该研究旨在评估程序帮助语言模型(PAL)和文本基于的链条(COT)提问技术的准确性和准确性评估结果。
  • methods: 该研究使用了5个数据集和2种模型类型(LLaMA模型和OpenAI模型),并对PAL和COT技术进行比较。
  • results: 研究发现,PAL在75%的情况下具有更高的准确性评估结果,并且发现使用温度缩放法可以降低生成的多样性,从而提高PAL的准确性和准确性评估结果。
    Abstract Prior work shows that program-aided reasoning, in which large language models (LLMs) are combined with programs written in programming languages such as Python, can significantly improve accuracy on various reasoning tasks. However, while accuracy is essential, it is also important for such reasoners to "know what they know", which can be quantified through the calibration of the model. In this paper, we compare the calibration of Program Aided Language Models (PAL) and text-based Chain-of-thought (COT) prompting techniques over 5 datasets and 2 model types: LLaMA models and OpenAI models. Our results indicate that PAL leads to improved calibration in 75% of the instances. Our analysis uncovers that prompting styles that produce lesser diversity in generations also have more calibrated results, and thus we also experiment with inducing lower generation diversity using temperature scaling and find that for certain temperatures, PAL is not only more accurate but is also more calibrated than COT. Overall, we demonstrate that, in the majority of cases, program-aided reasoners better know what they know than text-based counterparts.
    摘要

Scaling User Modeling: Large-scale Online User Representations for Ads Personalization in Meta

  • paper_url: http://arxiv.org/abs/2311.09544
  • repo_url: None
  • paper_authors: Wei Zhang, Dai Li, Chen Liang, Fang Zhou, Zhongke Zhang, Xuewei Wang, Ru Li, Yi Zhou, Yaning Huang, Dong Liang, Kai Wang, Zhangyuan Wang, Zhengxing Chen, Min Li, Fenggang Wu, Minghai Chen, Huayu Li, Yunnan Wu, Zhan Shu, Mindi Yuan, Sri Reddy
  • for: 该论文旨在提高个性化广告的效果,但是训练吞吐量、服务延迟和内存等约束限制了在线广告排序模型的复杂性和输入特征集。
  • methods: 作者提出了Scaling User Modeling(SUM)框架,通过一些指定的上游用户模型来合成用户嵌入,从 massive amounts of用户特征中进行高级模型化技术。这些嵌入然后作为下游在线广告排序模型的输入,以提高效率。
  • results: 作者通过实验证明SUM框架在Meta的广告排序系统中的广泛部署,每天处理数百十亿个用户请求,并且提供了显著的在线指标增长和基础设施成本减少。
    Abstract Effective user representations are pivotal in personalized advertising. However, stringent constraints on training throughput, serving latency, and memory, often limit the complexity and input feature set of online ads ranking models. This challenge is magnified in extensive systems like Meta's, which encompass hundreds of models with diverse specifications, rendering the tailoring of user representation learning for each model impractical. To address these challenges, we present Scaling User Modeling (SUM), a framework widely deployed in Meta's ads ranking system, designed to facilitate efficient and scalable sharing of online user representation across hundreds of ads models. SUM leverages a few designated upstream user models to synthesize user embeddings from massive amounts of user features with advanced modeling techniques. These embeddings then serve as inputs to downstream online ads ranking models, promoting efficient representation sharing. To adapt to the dynamic nature of user features and ensure embedding freshness, we designed SUM Online Asynchronous Platform (SOAP), a latency free online serving system complemented with model freshness and embedding stabilization, which enables frequent user model updates and online inference of user embeddings upon each user request. We share our hands-on deployment experiences for the SUM framework and validate its superiority through comprehensive experiments. To date, SUM has been launched to hundreds of ads ranking models in Meta, processing hundreds of billions of user requests daily, yielding significant online metric gains and infrastructure cost savings.
    摘要 实用用户表现是在个性化广告中核心的。然而,训练过程中的约束和服务延迟、内存限制,通常限制了在线广告排序模型的复杂度和输入特征集。这个挑战在Meta的架构中变得更加突出,这里涉及到多种不同的模型,使得为每个模型自适应用户表示学习变得不实际。为解决这些挑战,我们提出了扩展用户模型(SUM)框架,用于在Meta的广告排序系统中实现有效的用户表示共享。SUM使用一些指定的上游用户模型来合成大量用户特征的用户嵌入,然后将这些嵌入作为下游在线广告排序模型的输入,以便有效地共享用户表示。为了适应用户特征的动态变化和确保嵌入的新鲜度,我们设计了SUM在线异步平台(SOAP),该平台具有零延迟的在线服务系统,并且具有模型新鲜度和嵌入稳定性,可以实现在线用户模型更新和用户请求时的嵌入推理。我们在 SUM 框架的部署经验和实验结果中分享我们的经验。至今,SUM 已经在Meta的广告排序系统中发布到了多百个模型,每天处理多百亿次用户请求,并且实现了 significan 的在线指标增长和基础设施成本减少。

HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM

  • paper_url: http://arxiv.org/abs/2311.09528
  • repo_url: None
  • paper_authors: Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope, Oleksii Kuchaiev
  • for: The paper aims to address the problem of existing open-source helpfulness preference datasets not specifying what makes some responses more helpful and others less so, and to provide a solution by collecting a multi-attribute helpfulness dataset annotated for various aspects that make responses helpful.
  • methods: The paper uses a dataset called HelpSteer, which is a 37k-sample dataset annotated for correctness, coherence, complexity, and verbosity in addition to overall helpfulness of responses. The paper also uses the SteerLM technique to train a model on the dataset.
  • results: The paper reports that training a model on the HelpSteer dataset with the SteerLM technique produces a model that scores 7.54 on MT Bench, which is currently the highest score for open models that do not require training data from more powerful models (e.g. GPT4).
    Abstract Existing open-source helpfulness preference datasets do not specify what makes some responses more helpful and others less so. Models trained on these datasets can incidentally learn to model dataset artifacts (e.g. preferring longer but unhelpful responses only due to their length). To alleviate this problem, we collect HelpSteer, a multi-attribute helpfulness dataset annotated for the various aspects that make responses helpful. Specifically, our 37k-sample dataset has annotations for correctness, coherence, complexity, and verbosity in addition to overall helpfulness of responses. Training Llama 2 70B using the HelpSteer dataset with SteerLM technique produces a model that scores 7.54 on MT Bench, which is currently the highest score for open models that do not require training data from more powerful models (e.g. GPT4). We release this dataset with CC-BY-4.0 license at https://huggingface.co/datasets/nvidia/HelpSteer
    摘要 现有的开源有用性偏好数据集不 specify 有用性回答的特点。模型在这些数据集上训练可能会意外地学习数据集的特性(例如,偏爱 longer pero 无用的回答只因其长度)。为解决这个问题,我们收集了 HelpSteer 数据集,这是一个多 Attribute 有用性数据集,包括回答的正确性、一致性、复杂性和 verbosity 等方面的标注,以及回答的总有用性。使用 HelpSteer 数据集和 SteerLM 技术训练 Llama 2 70B 模型,得到的分数为 7.54 在 MT Bench,当前为开放模型不需要更强大的模型(如 GPT4)的训练数据而得到的最高分。我们在 https://huggingface.co/datasets/nvidia/HelpSteer 发布了这个数据集,协议为 CC-BY-4.0。

MDFL: Multi-domain Diffusion-driven Feature Learning

  • paper_url: http://arxiv.org/abs/2311.09520
  • repo_url: None
  • paper_authors: Daixun Li, Weiying Xie, Jiaqing Zhang, Yunsong Li
  • for: 本研究旨在提高高维像素扩展的数据特征提取性能,以揭示高维数据的内在异常和结构。
  • methods: 该研究提出了一种多域扩散驱动特征学习网络(MDFL),该方法利用扩散基于 posterior 抽样来考虑多个域结构之间的共同信息交互,从而消除视觉模型中的面具效应。此外,该方法还提出了一种特征重用机制,以收集高维数据的深度和原始特征。
  • results: 实验结果表明,MDFL 可以明显提高高维数据特征提取性能,其平均总准确率达到 98.25%,超过了多种现有基线方案。
    Abstract High-dimensional images, known for their rich semantic information, are widely applied in remote sensing and other fields. The spatial information in these images reflects the object's texture features, while the spectral information reveals the potential spectral representations across different bands. Currently, the understanding of high-dimensional images remains limited to a single-domain perspective with performance degradation. Motivated by the masking texture effect observed in the human visual system, we present a multi-domain diffusion-driven feature learning network (MDFL) , a scheme to redefine the effective information domain that the model really focuses on. This method employs diffusion-based posterior sampling to explicitly consider joint information interactions between the high-dimensional manifold structures in the spectral, spatial, and frequency domains, thereby eliminating the influence of masking texture effects in visual models. Additionally, we introduce a feature reuse mechanism to gather deep and raw features of high-dimensional data. We demonstrate that MDFL significantly improves the feature extraction performance of high-dimensional data, thereby providing a powerful aid for revealing the intrinsic patterns and structures of such data. The experimental results on three multi-modal remote sensing datasets show that MDFL reaches an average overall accuracy of 98.25%, outperforming various state-of-the-art baseline schemes. The code will be released, contributing to the computer vision community.
    摘要 高维度图像,rich in semantic information,广泛应用于远程感知和其他领域。图像空间信息反映物体的文本特征,而spectral信息揭示不同频谱域的可能性表示。现在,高维度图像的理解受限于单个领域视角,性能下降。为了解决这个问题,我们提出了一种多领域扩散驱动特征学习网络(MDFL),该方法重新定义了模型真正关注的信息Domain。该方法利用扩散基于 posterior sampling 来显式地考虑高维度映射结构在spectral、 spatial和频率域之间的联合信息互动,从而消除观察者模型中的掩蔽文本效应。此外,我们引入了特征重用机制,以收集高维数据的深度和原始特征。我们示出,MDFL可以显著改善高维数据特征提取性能,从而为揭示高维数据内部征性和结构提供强大的帮助。实验结果表明,MDFL在三个多modal远程感知数据集上达到了98.25%的平均总准确率,超过了多种现状顶峰方案。代码将被发布,为计算机视觉社区做出贡献。

SegMix: A Simple Structure-Aware Data Augmentation Method

  • paper_url: http://arxiv.org/abs/2311.09505
  • repo_url: None
  • paper_authors: Yuxin Pei, Pushkar Bhuse, Zhengzhong Liu, Eric Xing
  • for: 这个论文主要用于提出一种基于 interpolate 的数据扩充(DA)方法,以提高自然语言处理(NLP)任务中的模型性能。
  • methods: 这个论文使用了 interpolation 方法来 linearly interpolate 训练示例的输入和标签。它还提出了一种名为 SegMix 的数据扩充框架,该框架可以适应任务特定的结构。
  • results: 实验结果表明,SegMix 可以在Named Entity Recognition (NER) 和 Relation Extraction (RE) 任务中提高性能,特别是在数据缺乏情况下。此外,这种方法较容易实现,增加了训练时间的负担也很小。
    Abstract Interpolation-based Data Augmentation (DA) methods (Mixup) linearly interpolate the inputs and labels of two or more training examples. Mixup has more recently been adapted to the field of Natural Language Processing (NLP), mainly for sequence labeling tasks. However, such a simple adoption yields mixed or unstable improvements over the baseline models. We argue that the direct-adoption methods do not account for structures in NLP tasks. To this end, we propose SegMix, a collection of interpolation-based DA algorithms that can adapt to task-specific structures. SegMix poses fewer constraints on data structures, is robust to various hyperparameter settings, applies to more task settings, and adds little computational overhead. In the algorithm's core, we apply interpolation methods on task-specific meaningful segments, in contrast to applying them on sequences as in prior work. We find SegMix to be a flexible framework that combines rule-based DA methods with interpolation-based methods, creating interesting mixtures of DA techniques. We show that SegMix consistently improves performance over strong baseline models in Named Entity Recognition (NER) and Relation Extraction (RE) tasks, especially under data-scarce settings. Furthermore, this method is easy to implement and adds negligible training overhead.
    摘要 优化数据augmentation(DA)方法(mixup) linearly interpolate 输入和标签的两个或更多的训练例子。mixup在自然语言处理(NLP)领域被应用于序列标注任务。然而,这种直接采用方法不会考虑NLP任务中的结构。为此,我们提议SegMix,一个包含 interpolation-based DA算法的集合,可以适应任务特定的结构。SegMix具有较少的数据结构约束,对各种 гипер参数设置 exhibit robustness, 可以应用于更多的任务设置,并增加了小的计算负担。在算法核心中,我们通过 interpolate 方法在任务特定的有意义段上进行 interpolate,而不是在序列上如先前的工作所做。我们发现SegMix是一个灵活的框架,可以将规则基于DA方法与 interpolate-based方法混合,创造出有趣的DA技术的混合。我们发现SegMix在名实Recognition(NER)和关系抽取(RE)任务中 consistently 提高性能,特别是在数据缺乏的设置下。此外,这种方法易于实现,并且增加了训练过程中的负担。

Adaptive Interventions with User-Defined Goals for Health Behavior Change

  • paper_url: http://arxiv.org/abs/2311.09483
  • repo_url: None
  • paper_authors: Aishwarya Mandyam, Matthew Joerke, Barbara E. Engelhardt, Emma Brunskill
  • for: 本研究旨在提高移动医疗应用的 физи活动促进效果,通过个性化目标设定来提高用户参与度和持续性。
  • methods: 本研究使用了修改了汤姆逊抽样算法,通过优化个性化奖励函数来实现个性化目标设定。
  • results: 在物理活动模拟器中,我们的算法可以减少各种基线的累累 regret,并且在不共享数据或不优化个性化奖励函数的情况下具有更好的性能。
    Abstract Physical inactivity remains a major public health concern, having associations with adverse health outcomes such as cardiovascular disease and type-2 diabetes. Mobile health applications present a promising avenue for low-cost, scalable physical activity promotion, yet often suffer from small effect sizes and low adherence rates, particularly in comparison to human coaching. Goal-setting is a critical component of health coaching that has been underutilized in adaptive algorithms for mobile health interventions. This paper introduces a modification to the Thompson sampling algorithm that places emphasis on individualized goal-setting by optimizing personalized reward functions. As a step towards supporting goal-setting, this paper offers a balanced approach that can leverage shared structure while optimizing individual preferences and goals. We prove that our modification incurs only a constant penalty on the cumulative regret while preserving the sample complexity benefits of data sharing. In a physical activity simulator, we demonstrate that our algorithm achieves substantial improvements in cumulative regret compared to baselines that do not share data or do not optimize for individualized rewards.
    摘要 physical inactivity remains a major public health concern, with associations to adverse health outcomes such as cardiovascular disease and type-2 diabetes. mobile health applications present a promising avenue for low-cost, scalable physical activity promotion, but often suffer from small effect sizes and low adherence rates, particularly in comparison to human coaching. goal-setting is a critical component of health coaching that has been underutilized in adaptive algorithms for mobile health interventions. this paper introduces a modification to the Thompson sampling algorithm that places emphasis on individualized goal-setting by optimizing personalized reward functions. as a step towards supporting goal-setting, this paper offers a balanced approach that can leverage shared structure while optimizing individual preferences and goals. we prove that our modification incurs only a constant penalty on the cumulative regret while preserving the sample complexity benefits of data sharing. in a physical activity simulator, we demonstrate that our algorithm achieves substantial improvements in cumulative regret compared to baselines that do not share data or do not optimize for individualized rewards.Here's the translation in Traditional Chinese as well:体力无动作仍然是公共健康的主要忧虑,与不良的健康结果相关,如心血管疾病和型二糖尿病。 mobilhealth应用程序表示了低成本、扩展性的体育活动促进的吸引点,但通常受到小效果和低投入率的限制,尤其在人类教练相比。 目标设定是体育教练中的重要 Component,对于移动健康应用程序的自适性优化,尚未获得充分利用。 本文介绍了对 Thompson 抽样算法的修改,将优先级设置为个人化的目标设定,通过优化个人化的赏金函数来实现。 为支持目标设定,本文提出了一种均衡的方法,可以利用共享结构,同时优化个人偏好和目标。 我们证明了我们的修改仅增加了常数的责任,保留了数据分享的样本复杂性的好处。 在物理活动 simulator 中,我们显示了我们的算法在不共享数据或不优化个人化赏金函数的基准下,实现了很大的累累 regret。

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

  • paper_url: http://arxiv.org/abs/2311.09476
  • repo_url: https://github.com/stanford-futuredata/ares
  • paper_authors: Jon Saad-Falcon, Omar Khattab, Christopher Potts, Matei Zaharia
  • for: 评估抽取增强生成(RAG)系统的质量,通常需要手动标注输入查询、文章和回答。
  • methods: 我们介绍了一个自动化的RAG评估系统(ARES),用于评估RAG系统的上下文相关性、答案准确性和答案相关性。ARES使用灵活的语言模型(LM)judge进行评估,并使用小量人工标注数据进行预测推断(PPI)来减少预测错误。
  • results: ARES可以准确评估RAG系统,只需要使用几百个人工标注数据进行评估。此外,ARES的评估标准可以适应不同的领域和类型的查询和文章,并保持评估标准的有效性。代码和数据可以在https://github.com/stanford-futuredata/ARES上下载和使用。
    Abstract Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. Using synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across six different knowledge-intensive tasks in KILT and SuperGLUE, ARES accurately evaluates RAG systems while using a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our datasets and code for replication and deployment available at https://github.com/stanford-futuredata/ARES.
    摘要 evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. we introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. using synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. to mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). across six different knowledge-intensive tasks in KILT and SuperGLUE, ARES accurately evaluates RAG systems while using a few hundred human annotations during evaluation. furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. we make our datasets and code for replication and deployment available at https://github.com/stanford-futuredata/ARES.Here's the translation in Traditional Chinese:评估 Retrieval-augmented Generation (RAG) 系统传统上靠手动标注 Input queries、Passages to retrieve 和 Response 来评估。我们介绍 ARES,一个自动化 RAG 评估系统,可以在 Context relevance、Answer faithfulness 和 Answer relevance 的维度上评估 RAG 系统。使用人工训练数据,ARES 精确地评估 RAG 系统中各个元件的质量。为了减少预测错误,ARES 使用一小量的人工标注数据进行预测力测试 (PPI)。在 KILT 和 SuperGLUE 中的六个知识密集任务上,ARES 精确地评估 RAG 系统,仅需使用一些百个人工标注。此外,ARES 的评估判别器还能够在领域转移时保持有效,并在改变查询和/或文档类型时仍然精确地评估 RAG 系统。我们在 上提供了数据和代码,以便复制和部署。

JAB: Joint Adversarial Prompting and Belief Augmentation

  • paper_url: http://arxiv.org/abs/2311.09473
  • repo_url: None
  • paper_authors: Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
  • for: 这篇论文的目的是提高语言模型的安全性和可靠性。
  • methods: 这篇论文使用了对目标模型的挑战性询问和信念增强,通过迭代反馈循环来提高挑战性询问和信念增强的效果。
  • results: 在实验中,这篇论文显示了这种框架可以在动态和静态情况下降低对目标模型的攻击性内容生成。
    Abstract With the recent surge of language models in different applications, attention to safety and robustness of these models has gained significant importance. Here we introduce a joint framework in which we simultaneously probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation using iterative feedback loops. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes. Importantly, the adversarial model and the belief generator leverage the feedback from past interactions to improve the effectiveness of the adversarial prompts and beliefs, respectively. In our experiments, we demonstrate that such a framework can reduce toxic content generation both in dynamic cases where an adversary directly interacts with a target model and static cases where we use a static benchmark dataset to evaluate our model.
    摘要 随着语言模型在不同应用场景中的普及,对这些模型的安全性和可靠性的关注也在不断增加。我们在这篇文章中提出了一种联合框架,通过对黑盒目标模型进行抗对抗探测和信念增强,使其具备更高的安全性和可靠性。这种框架利用自动化的红团攻击方法来探测目标模型,同时使用信念增强器来生成对目标模型进行增强其对抗性的指令。重要的是,抗对抗模型和信念生成器都利用过去互动的反馈来提高对抗提示和信念的效果。在我们的实验中,我们证明了这种框架可以在直接交互的动态场景中减少毒害内容生成,以及使用静态 benchmark数据集来评估我们的模型时也能减少毒害内容生成。

Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

  • paper_url: http://arxiv.org/abs/2311.09467
  • repo_url: None
  • paper_authors: Yifu Qiu, Varun Embar, Shay B. Cohen, Benjamin Han
  • for: 提高神经网络生成模型的 faithfulness,减少幻像现象
  • methods: 提出了一种新的解码方法 called TWEAK,使用假设验证模型(HVM)对生成的语句进行排序,以确保语句与输入信息一致
  • results: TWEAK variants 在 FactKB、WebNLG 和 TekGen/GenWiki 上的 faithfulness 和 quality 都得到了提高,但是唯一的代价是 slight degradation (0.14/0.32 points)in quality measured by BERTScore。
    Abstract Neural knowledge-to-text generation models often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the given facts, or describe facts not present in the input. To reduce hallucinations, we propose a novel decoding method, TWEAK (Think While Effectively Articulating Knowledge). TWEAK treats the generated sequences at each decoding step and its future sequences as hypotheses, and ranks each generation candidate based on how well their corresponding hypotheses support the input facts using a Hypothesis Verification Model (HVM). We first demonstrate the effectiveness of TWEAK by using a Natural Language Inference (NLI) model as the HVM and report improved faithfulness with minimal impact on the quality. We then replace the NLI model with our task-specific HVM trained with a first-of-a-kind dataset, FATE (Fact-Aligned Textual Entailment), which pairs input facts with their faithful and hallucinated descriptions with the hallucinated spans marked. The new HVM improves the faithfulness and the quality further and runs faster. Overall the best TWEAK variants improve on average 2.22/7.17 points on faithfulness measured by FactKB over WebNLG and TekGen/GenWiki, respectively, with only 0.14/0.32 points degradation on quality measured by BERTScore over the same datasets. Since TWEAK is a decoding-only approach, it can be integrated with any neural generative model without retraining.
    摘要 neural knowledge-to-text生成模型经常难以准确地生成输入事实的描述:它们可能会产生幻觉,或者描述不在输入中的事实。为了减少幻觉,我们提出了一种新的解码方法,叫做调整(Think While Effectively Articulating Knowledge)。调整方法会在每个解码步骤中对生成的序列和未来序列进行处理,并将每个生成候选者根据其对输入事实的支持程度进行排序,使用一个假设验证模型(HVM)。我们首先使用一个自然语言推理(NLI)模型作为HVM,并发现使用NLI模型可以提高准确性,同时具有最小的影响。然后,我们将NLI模型 replaced with我们自己的任务特定的HVM,用一个具有唯一性的数据集—— факт-alignment textual entailment(FATE),这个数据集 pairs输入事实与其准确的描述和幻觉描述的幻觉 span。新的HVM可以进一步提高准确性和质量,同时具有更快的运行速度。总的来说,最佳的调整变体可以在FactKB上提高准确性平均2.22/7.17点,同时保持质量水平,BERTScore上的平均下降0.14/0.32点。由于调整是解码-only方法,因此可以与任何 neural生成模型无需重新训练。