cs.AI - 2023-10-03

Large Language Models Can Be Good Privacy Protection Learners

  • paper_url: http://arxiv.org/abs/2310.02469
  • repo_url: https://github.com/Yijia-Xiao/PPLM
  • paper_authors: Yijia Xiao, Yiqiao Jin, Yushi Bai, Yue Wu, Xianjun Yang, Xiao Luo, Wenchao Yu, Xujiang Zhao, Yanchi Liu, Haifeng Chen, Wei Wang, Wei Cheng
  • for: 本研究旨在 Addressing the challenge of fine-tuning Large Language Models (LLMs) with domain-specific data while protecting sensitive personally identifiable information (PII).
  • methods: 我们提出了一种新的 Fine-tuning Large Language Models (PPLM) 模型,包括 corpus curation, penalty-based unlikelihood in training loss, 和 instruction-based tuning 等多种技术。
  • results: 我们的实验表明, instruction tuning with both positive and negative examples 是一种有效的方法,可以保护 private data 的隐私,同时提高模型的知识。
    Abstract The proliferation of Large Language Models (LLMs) has driven considerable interest in fine-tuning them with domain-specific data to create specialized language models. Nevertheless, such domain-specific fine-tuning data often contains sensitive personally identifiable information (PII). Direct fine-tuning LLMs on this data without privacy protection poses a risk of leakage. To address this challenge, we introduce Privacy Protection Language Models (PPLM), a novel paradigm for fine-tuning LLMs that effectively injects domain-specific knowledge while safeguarding data privacy. Our work offers a theoretical analysis for model design and delves into various techniques such as corpus curation, penalty-based unlikelihood in training loss, and instruction-based tuning, etc. Extensive experiments across diverse datasets and scenarios demonstrate the effectiveness of our approaches. In particular, instruction tuning with both positive and negative examples, stands out as a promising method, effectively protecting private data while enhancing the model's knowledge. Our work underscores the potential for Large Language Models as robust privacy protection learners.
    摘要 LLM的普及已经引起了特定领域数据的特点化语言模型(PPLM)的较大兴趣。然而,这些领域特定的精细化数据经常包含敏感的个人认izable信息(PII)。直接将LLM直接精细化这些数据无法保护隐私。为解决这个挑战,我们介绍了隐私保护语言模型(PPLM),一种新的范例,可以有效地注入领域特定的知识,同时保护数据隐私。我们的工作提供了模型设计的理论分析,以及各种技术,如文献筛选、罚金基于训练损失、指令调整等等。我们在多个数据集和场景进行了广泛的实验,并证明了我们的方法的有效性。特别是使用正例和负例的指令调整方法,表现出色,能够保护隐私数据,同时提高模型的知识。我们的工作强调了LLM的潜在作为隐私保护学习器的潜力。

EcoAssistant: Using LLM Assistant More Affordably and Accurately

  • paper_url: http://arxiv.org/abs/2310.03046
  • repo_url: https://github.com/jieyuz2/ecoassistant
  • paper_authors: Jieyu Zhang, Ranjay Krishna, Ahmed H. Awadallah, Chi Wang
  • for: 本研究旨在提高大自然语言模型(LLM)作为助手 answering 需要外部知识的查询,以提高效率和准确性。
  • methods: 本研究提出了一个框架,名为 EcoAssistant,它使得 LLM 可以更加经济高效地回答 code-driven 查询。EcoAssistant 包括三部分:首先,它允许 LLM 助手与自动代码执行器进行交互,以便Iteratively 更新代码或根据执行结果生成答案。其次,我们使用层次结构的 LLM 助手,首先使用较弱、较便宜的 LLM 尝试回答查询,然后如果无法回答,则交给更强、更昂贵的 LLM 尝试。最后,我们从成功过去查询中检索出示例,以帮助后续查询。
  • results: 我们通过实验表明,EcoAssistant 可以提供更高效和准确的答案,比 GPT-4 高出10点成功率,仅使用 Less than 50% 的 GPT-4 的成本。
    Abstract Today, users ask Large language models (LLMs) as assistants to answer queries that require external knowledge; they ask about the weather in a specific city, about stock prices, and even about where specific locations are within their neighborhood. These queries require the LLM to produce code that invokes external APIs to answer the user's question, yet LLMs rarely produce correct code on the first try, requiring iterative code refinement upon execution results. In addition, using LLM assistants to support high query volumes can be expensive. In this work, we contribute a framework, EcoAssistant, that enables LLMs to answer code-driven queries more affordably and accurately. EcoAssistant contains three components. First, it allows the LLM assistants to converse with an automatic code executor to iteratively refine code or to produce answers based on the execution results. Second, we use a hierarchy of LLM assistants, which attempts to answer the query with weaker, cheaper LLMs before backing off to stronger, expensive ones. Third, we retrieve solutions from past successful queries as in-context demonstrations to help subsequent queries. Empirically, we show that EcoAssistant offers distinct advantages for affordability and accuracy, surpassing GPT-4 by 10 points of success rate with less than 50% of GPT-4's cost.
    摘要 EcoAssistant consists of three components:1. Conversing with an automatic code executor: LLM assistants can converse with an automatic code executor to iteratively refine code or produce answers based on execution results.2. Hierarchy of LLM assistants: We use a hierarchy of LLM assistants to answer queries with weaker, cheaper LLMs before backing off to stronger, more expensive ones.3. Retrieving solutions from past successful queries: We retrieve solutions from past successful queries as in-context demonstrations to help subsequent queries.Empirically, we show that EcoAssistant offers distinct advantages for affordability and accuracy, surpassing GPT-4 by 10 points of success rate with less than 50% of GPT-4's cost.

Improved Inference of Human Intent by Combining Plan Recognition and Language Feedback

  • paper_url: http://arxiv.org/abs/2310.02462
  • repo_url: None
  • paper_authors: Ifrah Idrees, Tian Yun, Naveen Sharma, Yunxin Deng, Nakul Gopalan, George Konidaris, Stefanie Tellex
  • for: 本研究旨在帮助机器人更好地理解人类计划和目标,尤其是在人类动作受到干扰时。
  • methods: 本研究使用对话 для计划认识(Dialogue for Goal Recognition,D4GR),允许机器人通过自然语言交互 rectify its belief in human progress。
  • results: 对比HTN,D4GR在两个模拟Domain中表现出优异,具体来说,在 kitchen 和 blocks Domain中,D4GR 在高度噪音下表现1%和4%更高的目标准确率,在 plan 准确率方面,D4GR 在 kitchen Domain中表现2%更高,在 blocks Domain中表现7%更高。
    Abstract Conversational assistive robots can aid people, especially those with cognitive impairments, to accomplish various tasks such as cooking meals, performing exercises, or operating machines. However, to interact with people effectively, robots must recognize human plans and goals from noisy observations of human actions, even when the user acts sub-optimally. Previous works on Plan and Goal Recognition (PGR) as planning have used hierarchical task networks (HTN) to model the actor/human. However, these techniques are insufficient as they do not have user engagement via natural modes of interaction such as language. Moreover, they have no mechanisms to let users, especially those with cognitive impairments, know of a deviation from their original plan or about any sub-optimal actions taken towards their goal. We propose a novel framework for plan and goal recognition in partially observable domains -- Dialogue for Goal Recognition (D4GR) enabling a robot to rectify its belief in human progress by asking clarification questions about noisy sensor data and sub-optimal human actions. We evaluate the performance of D4GR over two simulated domains -- kitchen and blocks domain. With language feedback and the world state information in a hierarchical task model, we show that D4GR framework for the highest sensor noise performs 1% better than HTN in goal accuracy in both domains. For plan accuracy, D4GR outperforms by 4% in the kitchen domain and 2% in the blocks domain in comparison to HTN. The ALWAYS-ASK oracle outperforms our policy by 3% in goal recognition and 7%in plan recognition. D4GR does so by asking 68% fewer questions than an oracle baseline. We also demonstrate a real-world robot scenario in the kitchen domain, validating the improved plan and goal recognition of D4GR in a realistic setting.
    摘要 Dialogue for Goal Recognition (D4GR) 是一种新的框架,用于在 partially observable domains 中进行计划和目标识别。它可以让机器人通过对人类动作的识别和语言反馈来更好地理解人类的计划和目标。在我们的实验中,我们发现 D4GR 在面临高度噪音的感知器件情况下表现比 HTN 高一 percentage point 的目标准确率和计划准确率。此外,D4GR 还可以避免向用户提问太多问题,相比 oracle 基线下的 Always-Ask 政策。我们还 validate 了 D4GR 在实际 kitchen 环境中的应用场景,证明了它在真实情况下的改进计划和目标识别能力。Here's a word-for-word translation of the text into Simplified Chinese:Dialogue for Goal Recognition (D4GR) 是一种新的框架,用于在 partially observable domains 中进行计划和目标识别。它可以让机器人通过对人类动作的识别和语言反馈来更好地理解人类的计划和目标。在我们的实验中,我们发现 D4GR 在面临高度噪音的感知器件情况下表现比 HTN 高一 percentage point 的目标准确率和计划准确率。此外,D4GR 还可以避免向用户提问太多问题,相比 oracle 基线下的 Always-Ask 政策。我们还 validate 了 D4GR 在实际 kitchen 环境中的应用场景,证明了它在真实情况下的改进计划和目标识别能力。

Learning Optimal Advantage from Preferences and Mistaking it for Reward

  • paper_url: http://arxiv.org/abs/2310.02456
  • repo_url: https://github.com/Stephanehk/Learning-OA-From-Prefs
  • paper_authors: W. Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson, Serena Booth, Anca Dragan, Peter Stone, Scott Niekum
  • for: 本文研究了从人类反馈中学习抽象函数,具体来说是研究了RLHF中人类偏好的模型。
  • methods: 本文使用了人类反馈中的偏好来学习抽象函数,并对RLHF中的偏好模型进行了调查和分析。
  • results: 本文发现,假设人类偏好是基于奖励的时候,实际上是基于后悔的。这导致了学习的抽象函数不是真正的奖励函数,而是一个近似的优先级函数。此外,本文还发现,如果解决一个特定的坑,这种错误的假设不会对学习造成很大的影响,而且可以得到一个高度形态的奖励函数。
    Abstract We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon regret. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, $\hat{A^*_r}$, not a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of $\hat{A^*_r}$ is less desirable than the appropriate and simpler approach of greedy maximization of $\hat{A^*_r}$. From the perspective of the regret preference model, we also provide a clearer interpretation of fine tuning contemporary large language models with RLHF. This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.
    摘要 我们考虑使用人工智能来学习奖励函数从人类偏好中,这种学习方法被称为人类反馈学习(RLHF)。大多数最新的研究假设人类偏好是基于每段路径段的奖励,或者它们的归还。但是,最近的研究表明这个假设可能不正确,而是基于 regret的偏好模型。我们调查了假设偏好是基于归还的结果,而不是奖励的结果的后果,并 argue that the learned function is an approximation of the optimal advantage function, $\hat{A^*_r}$, not a reward function. 我们发现,如果一个特定的陷阱被解决,这个错误的假设并不会导致严重的后果,结果是一个高度形状的奖励函数。然而,这个错误的使用 $\hat{A^*_r}$ 比较不愉快,比如使用更简单的方法,例如对 contemporary large language models 的 fine-tuning。从 regret 偏好模型的角度来看,我们也提供了更清晰的对RLHF的解释,这篇文章总体提供了关于为什么在实践中,学习基于偏好的方法能够工作 så 好的解释。

Low-Resource Languages Jailbreak GPT-4

  • paper_url: http://arxiv.org/abs/2310.02446
  • repo_url: None
  • paper_authors: Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach
  • for: 防止大语言模型生成危险内容
  • methods: 使用翻译攻击绕过安全训练数据的语言不平等性
  • results: 成功绕过GPT-4的安全保护,79%的时间能够帮助用户达到危险目标,其他高/中资源语言的攻击成功率远低于这个水平。
    Abstract AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.
    摘要 人工智能安全训练和大语言模型(LLM)红人攻击是为了遏制不安全内容的产生。我们的工作暴露了这些安全机制的内置跨语言漏洞,由于安全训练数据的语言不均衡,可以成功绕过GPT-4的安全措施,通过翻译不安全的英语输入到低资源语言。在AdvBenchmark上,GPT-4与这些不安全翻译输入交互,提供了79%的时间可以帮助用户实现危害目标,与当前的监狱突破攻击相当或超越。其他高/中资源语言的攻击成功率较低,这表明跨语言漏洞主要适用于低资源语言。在过去,对低资源语言的有限训练主要影响了那些语言的 speaker,导致技术差距。然而,我们的工作表明一种重要的转变:这种不足现在对所有LLM用户构成风险。公共可用的翻译API使得任何人可以利用LLM的安全漏洞。因此,我们的工作呼吁于发展抗频率的多语言安全措施,以具有广泛的语言覆盖率。

Learning Diverse Skills for Local Navigation under Multi-constraint Optimality

  • paper_url: http://arxiv.org/abs/2310.02440
  • repo_url: None
  • paper_authors: Jin Cheng, Marin Vlastelica, Pavel Kolev, Chenhao Li, Georg Martius
  • for: 本研究的目的是提出一种基于数据驱动控制的方法,以实现在机器人控制中获得多样化的行为。
  • methods: 本研究使用了受限优化视角,通过不同的奖励函数来定义多样化的策略,并通过吸引-撕裂奖励来控制多样化水平。
  • results: 研究中使用了一个本地导航任务,训练了一个四脚机器人,并成功实现了多样化的快速行为,包括成功绕过障碍物。
    Abstract Despite many successful applications of data-driven control in robotics, extracting meaningful diverse behaviors remains a challenge. Typically, task performance needs to be compromised in order to achieve diversity. In many scenarios, task requirements are specified as a multitude of reward terms, each requiring a different trade-off. In this work, we take a constrained optimization viewpoint on the quality-diversity trade-off and show that we can obtain diverse policies while imposing constraints on their value functions which are defined through distinct rewards. In line with previous work, further control of the diversity level can be achieved through an attract-repel reward term motivated by the Van der Waals force. We demonstrate the effectiveness of our method on a local navigation task where a quadruped robot needs to reach the target within a finite horizon. Finally, our trained policies transfer well to the real 12-DoF quadruped robot, Solo12, and exhibit diverse agile behaviors with successful obstacle traversal.
    摘要 尽管数据驱动控制在机器人中得到了许多成功应用,但抽取有意义的多样行为仍然是一个挑战。通常,完成任务所需的牺牲是获得多样性的代价。在许多场景下,任务要求通常是多种奖励项,每个奖励项需要不同的让拍。在这项工作中,我们采取了受限优化的质量多样性观点,并证明我们可以在其价值函数中做出约束,这些价值函数通过不同的奖励定义。与之前的工作一样,进一步控制多样性水平可以通过通过招引-抗拒奖励项来实现,这种奖励项源于万达力力。我们在本地导航任务中展示了我们的方法的有效性,其中一个四脚机器人需要在有限时间内达到目标。最后,我们在真实的12度自由度四脚机器人(Solo12)上训练了我们的策略,并示出了多样的快速行为,并成功避免了障碍物。

Multi-Agent Reinforcement Learning Based on Representational Communication for Large-Scale Traffic Signal Control

  • paper_url: http://arxiv.org/abs/2310.02435
  • repo_url: None
  • paper_authors: Rohit Bokade, Xiaoning Jin, Christopher Amato
  • for: 提高大规模交通信号控制(TSC)的效率和灵活性。
  • methods: 使用多代理人学习(MARL)和选择性通信策略,使代理人可以只在需要时使用通信频道,从而减少干扰和提高总性能。
  • results: 在 synthetic $4 \times 4$ 网络和实际的 Pasubio 社区网络上实现了最低的网络压力,代理人使用了 $\sim 47-65 %$ 的通信频道。精mitigation 研究还证明了选择性通信策略的有效性。
    Abstract Traffic signal control (TSC) is a challenging problem within intelligent transportation systems and has been tackled using multi-agent reinforcement learning (MARL). While centralized approaches are often infeasible for large-scale TSC problems, decentralized approaches provide scalability but introduce new challenges, such as partial observability. Communication plays a critical role in decentralized MARL, as agents must learn to exchange information using messages to better understand the system and achieve effective coordination. Deep MARL has been used to enable inter-agent communication by learning communication protocols in a differentiable manner. However, many deep MARL communication frameworks proposed for TSC allow agents to communicate with all other agents at all times, which can add to the existing noise in the system and degrade overall performance. In this study, we propose a communication-based MARL framework for large-scale TSC. Our framework allows each agent to learn a communication policy that dictates "which" part of the message is sent "to whom". In essence, our framework enables agents to selectively choose the recipients of their messages and exchange variable length messages with them. This results in a decentralized and flexible communication mechanism in which agents can effectively use the communication channel only when necessary. We designed two networks, a synthetic $4 \times 4$ grid network and a real-world network based on the Pasubio neighborhood in Bologna. Our framework achieved the lowest network congestion compared to related methods, with agents utilizing $\sim 47-65 \%$ of the communication channel. Ablation studies further demonstrated the effectiveness of the communication policies learned within our framework.
    摘要 traffic signal control (TSC) 是智能交通系统中的一个挑战,通过多代理激励学习 (MARL) 进行解决。而中央化方法通常对大规模 TSC 问题不可行,而分布式方法则提供了可扩展性,但引入了新的挑战,如部分可见性。通信在分布式 MARL 中扮演了关键角色,代理需要通过消息交换来更好地理解系统并实现有效协调。深度 MARL 已经用于启用代理之间的交流,但大多数深度 MARL 通信框架在 TSC 中允许代理与所有其他代理进行通信,这会增加系统中的噪音并降低总性能。在本研究中,我们提出了一种基于通信的分布式 MARL 框架 для大规模 TSC。我们的框架允许每个代理学习一个通信策略,这个策略决定“向哪里”发送“什么”部分的消息。这种分布式和灵活的通信机制使代理可以只在需要时使用通信频道,从而有效地利用通信频道。我们设计了两个网络:一个 synthetic 的 $4 \times 4$ 网格网络和一个基于博洛尼亚的 Pasubio neighborghood 网络。我们的框架在相关方法中实现了最低的网络拥塞率,代理在通信频道上使用了 $\sim 47-65 \%$。剥离研究还证明了我们的框架中学习的通信策略的效果。

Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks

  • paper_url: http://arxiv.org/abs/2310.02430
  • repo_url: None
  • paper_authors: Arjun Karuvally, Peter Delmastro, Hava T. Siegelmann
  • For: The paper aims to provide a deeper understanding of the internal mechanisms of Recurrent Neural Networks (RNNs) and their relationship to human memory.* Methods: The paper proposes the Episodic Memory Theory (EMT), which conceptualizes RNNs as discrete-time analogs of the General Sequential Episodic Memory Model. The authors also introduce a set of algorithmic tasks to probe the variable binding behavior in RNNs and develop a mathematically rigorous circuit to facilitate variable binding.* Results: The paper shows that trained RNNs consistently converge to the variable binding circuit, indicating universality in the dynamics of RNNs. The authors also develop an algorithm to define a privileged basis, which enhances the interpretability of the learned parameters and hidden states of RNNs.Here is the same information in Simplified Chinese:* For: 本研究旨在深入理解循环神经网络(RNN)的内部机制以及它们与人类记忆之间的关系。* Methods: 本研究提出了 episodic memory theory(EMT),它将 RNN 看作是时间排序的 discrete-time 分析。作者还提出了一系列用于探索 RNN 中变量绑定行为的算法任务,并开发了一个正则的数学圈来促进变量绑定。* Results: 研究显示,训练后 RNN 通常会 converge 到变量绑定圈,这表明 RNN 的动态是统一的。作者还开发了一种算法来定义特权基底,该基底可以增强 RNN 学习的参数和隐藏状态的解释性。
    Abstract Understanding the intricate operations of Recurrent Neural Networks (RNNs) mechanistically is pivotal for advancing their capabilities and applications. In this pursuit, we propose the Episodic Memory Theory (EMT), illustrating that RNNs can be conceptualized as discrete-time analogs of the recently proposed General Sequential Episodic Memory Model. To substantiate EMT, we introduce a novel set of algorithmic tasks tailored to probe the variable binding behavior in RNNs. Utilizing the EMT, we formulate a mathematically rigorous circuit that facilitates variable binding in these tasks. Our empirical investigations reveal that trained RNNs consistently converge to the variable binding circuit, thus indicating universality in the dynamics of RNNs. Building on these findings, we devise an algorithm to define a privileged basis, which reveals hidden neurons instrumental in the temporal storage and composition of variables, a mechanism vital for the successful generalization in these tasks. We show that the privileged basis enhances the interpretability of the learned parameters and hidden states of RNNs. Our work represents a step toward demystifying the internal mechanisms of RNNs and, for computational neuroscience, serves to bridge the gap between artificial neural networks and neural memory models.
    摘要

AXNav: Replaying Accessibility Tests from Natural Language

  • paper_url: http://arxiv.org/abs/2310.02424
  • repo_url: None
  • paper_authors: Maryam Taeb, Amanda Swearngin, Eldon Schoop, Ruijia Cheng, Yue Jiang, Jeffrey Nichols
    for: 这个论文是为了支持访问性测试而开发的一种自然语言基于的测试工作流程。methods: 这个系统使用自然语言处理和图像理解模型来执行手动访问性测试,并生成一个分章、可导航的视频。results: 在一个10名参与者的用户研究中,参与者表示这个工具会很有用于他们的当前工作,并表现与手动测试方式相似。研究还揭示了未来使用LLMs进行访问性测试的可能性。
    Abstract Developers and quality assurance testers often rely on manual testing to test accessibility features throughout the product lifecycle. Unfortunately, manual testing can be tedious, often has an overwhelming scope, and can be difficult to schedule amongst other development milestones. Recently, Large Language Models (LLMs) have been used for a variety of tasks including automation of UIs, however to our knowledge no one has yet explored their use in controlling assistive technologies for the purposes of supporting accessibility testing. In this paper, we explore the requirements of a natural language based accessibility testing workflow, starting with a formative study. From this we build a system that takes as input a manual accessibility test (e.g., ``Search for a show in VoiceOver'') and uses an LLM combined with pixel-based UI Understanding models to execute the test and produce a chaptered, navigable video. In each video, to help QA testers we apply heuristics to detect and flag accessibility issues (e.g., Text size not increasing with Large Text enabled, VoiceOver navigation loops). We evaluate this system through a 10 participant user study with accessibility QA professionals who indicated that the tool would be very useful in their current work and performed tests similarly to how they would manually test the features. The study also reveals insights for future work on using LLMs for accessibility testing.
    摘要 开发者和质量保证测试员经常采用手动测试来测试产品的可用性功能。然而,手动测试可能是压力很大,让人感觉压力很大,并且可能与其他开发阶段的时间安排不协调。在最近,大型自然语言模型(LLM)已经用于许多任务,包括 UI 自动化。然而,我们知道没有任何一个研究用于控制助助技术来支持可用性测试。在这篇论文中,我们探讨了可用性测试工作流的自然语言需求,从而开始一个形成性研究。我们将手动可用性测试(例如,“搜索voiceover中的节目”)作为输入,使用 LLM 和像素基的 UI 理解模型来执行测试并生成一个分割、导航的视频。在每个视频中,我们应用了规则来检测和标记可用性问题(例如,文本大小不随大字体开启增加)。我们通过10名参与者的用户研究,发现这个工具会在现有工作中对可用性测试非常有用,并且与手动测试相似。研究还揭示了将 LLM 应用于可用性测试的未来工作方向。

OneAdapt: Fast Adaptation for Deep Learning Applications via Backpropagation

  • paper_url: http://arxiv.org/abs/2310.02422
  • repo_url: None
  • paper_authors: Kuntai Du, Yuhan Liu, Yitian Hao, Qizheng Zhang, Haodong Wang, Yuyang Huang, Ganesh Ananthanarayanan, Junchen Jiang
    for:* 这个论文旨在提高流媒体数据中深度学习推理的性能,包括视频中对象检测和音频波中文本提取。methods:* 这个论文使用了一种名为OneAdapt的适配策略,利用梯度上升策略来适配配置螺旋钢,以提高深度学习模型的推理精度。results:* 相比之前的适配策略,OneAdapt可以在不同的配置螺旋钢下减少网络带宽使用量和GPU资源使用量,同时保持相同的准确率或提高准确率。
    Abstract Deep learning inference on streaming media data, such as object detection in video or LiDAR feeds and text extraction from audio waves, is now ubiquitous. To achieve high inference accuracy, these applications typically require significant network bandwidth to gather high-fidelity data and extensive GPU resources to run deep neural networks (DNNs). While the high demand for network bandwidth and GPU resources could be substantially reduced by optimally adapting the configuration knobs, such as video resolution and frame rate, current adaptation techniques fail to meet three requirements simultaneously: adapt configurations (i) with minimum extra GPU or bandwidth overhead; (ii) to reach near-optimal decisions based on how the data affects the final DNN's accuracy, and (iii) do so for a range of configuration knobs. This paper presents OneAdapt, which meets these requirements by leveraging a gradient-ascent strategy to adapt configuration knobs. The key idea is to embrace DNNs' differentiability to quickly estimate the accuracy's gradient to each configuration knob, called AccGrad. Specifically, OneAdapt estimates AccGrad by multiplying two gradients: InputGrad (i.e. how each configuration knob affects the input to the DNN) and DNNGrad (i.e. how the DNN input affects the DNN inference output). We evaluate OneAdapt across five types of configurations, four analytic tasks, and five types of input data. Compared to state-of-the-art adaptation schemes, OneAdapt cuts bandwidth usage and GPU usage by 15-59% while maintaining comparable accuracy or improves accuracy by 1-5% while using equal or fewer resources.
    摘要 深度学习推理在流媒体数据上进行检测对象在视频或LiDAR耦合中的检测,以及从音频波中提取文本,现在已经普遍存在。为了实现高精度推理,这些应用通常需要很大的网络带宽和广泛的GPU资源来运行深度神经网络(DNN)。然而,高需求的网络带宽和GPU资源可以通过优化配置螺旋来减少,但现有的适配技术无法同时满足以下三个需求:1. 适配配置(i)的过程具有最小的额外GPU或带宽开销。2. 基于数据如何影响最终DNN的准确性来做近似优化的决策。3. 对配置螺旋进行范围内的适配。本文介绍了OneAdapt,它可以同时满足这些需求。OneAdapt利用了梯度上升策略来适配配置螺旋。关键思想是利用DNN的导数 differentiability来快速估算每个配置螺旋的准确性 gradient,称为AccGrad。具体来说,OneAdapt使用两个梯度的乘积来估算AccGrad:输入梯度(即每个配置螺旋对输入到DNN的影响)和DNN梯度(即输入到DNN的影响对DNN的推理输出)。我们对OneAdapt进行了五种配置、四种分析任务和五种输入数据的评估。相比之下,OneAdapt在带宽和GPU使用量方面减少了15-59%,同时维持了相对较高的准确性或提高了准确性1-5%,使用相同或少于资源。

Can a student Large Language Model perform as well as it’s teacher?

  • paper_url: http://arxiv.org/abs/2310.02421
  • repo_url: None
  • paper_authors: Sia Gholami, Marwan Omar
  • for: 这篇论文主要针对的是深度学习模型的部署问题,即在资源有限的环境中使用高精度模型的问题。
  • methods: 该论文提出了知识传承技术,即将高精度模型(教师模型)的知识传承给流lined模型(学生模型),以提高学生模型的性能。
  • results: 该论文通过仔细分析,探讨了成功的知识传承要素,包括学生模型的架构、教师模型的质量和超参数的均衡。
    Abstract The burgeoning complexity of contemporary deep learning models, while achieving unparalleled accuracy, has inadvertently introduced deployment challenges in resource-constrained environments. Knowledge distillation, a technique aiming to transfer knowledge from a high-capacity "teacher" model to a streamlined "student" model, emerges as a promising solution to this dilemma. This paper provides a comprehensive overview of the knowledge distillation paradigm, emphasizing its foundational principles such as the utility of soft labels and the significance of temperature scaling. Through meticulous examination, we elucidate the critical determinants of successful distillation, including the architecture of the student model, the caliber of the teacher, and the delicate balance of hyperparameters. While acknowledging its profound advantages, we also delve into the complexities and challenges inherent in the process. Our exploration underscores knowledge distillation's potential as a pivotal technique in optimizing the trade-off between model performance and deployment efficiency.
    摘要 现代深度学习模型的抬头复杂性,尽管达到了无 précédente 的准确性,却意外地引入了资源受限环境中的部署挑战。知识传递技术,目标是将知识从高容量 "老师" 模型传递到流lined "学生" 模型,被认为是解决这个困境的有望的解决方案。本文提供了知识传递 парадиг的全面概述,强调其基础原理,如软标签的利用和温度尺度的重要性。通过精心的检查,我们描述了成功传递的关键因素,包括学生模型的架构、老师模型的质量和敏感的超参数平衡。虽然承认其极大的优势,但我们也探讨了传递过程中的复杂性和挑战。我们的探索表明,知识传递可能成为优化模型性能和部署效率之间的关键技术。

Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models

  • paper_url: http://arxiv.org/abs/2310.02409
  • repo_url: None
  • paper_authors: Guanghui Qin, Corby Rosset, Ethan C. Chau, Nikhil Rao, Benjamin Van Durme
  • for: 提高慢速语言模型在长上下文中的缩放性能
  • methods: 使用动态上下文压缩,基于Qin & Van Durme (2023)的Nugget方法,对decoder-only语言模型进行扩展
  • results: 通过语言模型、问答和概要等实验,显示Nugget2D可以保持这些任务的能力,同时在解码过程中减少时间和空间开销,比如在自编码任务中,Nugget2D可以将上下文压缩到20倍的比例,保持BLEU分数在98%之间,实现近乎无损编码。
    Abstract Standard Transformer-based language models (LMs) scale poorly to long contexts. We propose a solution based on dynamic contextual compression, which extends the Nugget approach of Qin & Van Durme (2023) from BERT-like frameworks to decoder-only LMs. Our method models history as compressed "nuggets" which are trained to allow for reconstruction, and it can be initialized with off-the-shelf models such as LLaMA. We demonstrate through experiments in language modeling, question answering, and summarization that Nugget2D retains capabilities in these tasks, while drastically reducing the overhead during decoding in terms of time and space. For example, in the experiments of autoencoding, Nugget2D can shrink context at a 20x compression ratio with a BLEU score of 98% for reconstruction, achieving nearly lossless encoding.
    摘要 通用变换器基于语言模型(LM)在长上下文中表现不佳。我们提出一种解决方案,基于动态Contextual压缩,该方法在BERT类框架中的Qin & Van Durme(2023)的Nugget方法上进行扩展。我们的方法将历史作为压缩“块”,用于允许重建,并可以使用市场上的模型Initial化,如LLaMA。我们通过语言模型、问答和概要等实验表明,Nugget2D可以在这些任务中保持能力,同时在解码过程中减少时间和空间开销。例如,在自动编码实验中,Nugget2D可以将上下文压缩到20倍的压缩率,保持BLEU分数98%,实现近乎无损编码。

PCGPT: Procedural Content Generation via Transformers

  • paper_url: http://arxiv.org/abs/2310.02405
  • repo_url: None
  • paper_authors: Sajad Mohaghegh, Mohammad Amin Ramezan Dehnavi, Golnoosh Abdollahinejad, Matin Hashemi
  • for: 这个论文旨在提出一种基于线下强化学习和变换网络的PCG方法,以生成更加复杂和多样化的游戏内容。
  • methods: 该方法使用了一种基于变换器的autoregressive模型,模型着行为、状态和奖励的轨迹,利用变换器的自注意机制 capture时间相关性和 causal关系。
  • results: 实验结果表明,PCGPT在扮演游戏《苏可阔》中预测物品和它们的位置的任务上表现出色,生成的游戏内容更加复杂和多样化,并且在许多步骤上比前方法快速得到结果。
    Abstract The paper presents the PCGPT framework, an innovative approach to procedural content generation (PCG) using offline reinforcement learning and transformer networks. PCGPT utilizes an autoregressive model based on transformers to generate game levels iteratively, addressing the challenges of traditional PCG methods such as repetitive, predictable, or inconsistent content. The framework models trajectories of actions, states, and rewards, leveraging the transformer's self-attention mechanism to capture temporal dependencies and causal relationships. The approach is evaluated in the Sokoban puzzle game, where the model predicts items that are needed with their corresponding locations. Experimental results on the game Sokoban demonstrate that PCGPT generates more complex and diverse game content. Interestingly, it achieves these results in significantly fewer steps compared to existing methods, showcasing its potential for enhancing game design and online content generation. Our model represents a new PCG paradigm which outperforms previous methods.
    摘要 文章提出了PCGPT框架,这是一种新型的游戏内容生成(PCG)方法,使用了离线强化学习和转换网络。PCGPT使用基于转换器的抽象模型来逐步生成游戏等级,解决传统PCG方法中的困难,如重复、预测性和不一致的内容。框架模型了行为轨迹、状态和奖励的关系,利用转换器的自注意机制来捕捉时间相关性和 causal 关系。这种方法在拼图游戏中进行了实验,模型预测了需要的物品和它们的位置。实验结果表明,PCGPT生成的游戏内容更复杂和多样化,并且在比较少的步骤内完成了任务,表明它在游戏设计和在线内容生成方面具有潜在的应用前景。我们的模型代表了一种新的PCG paradigm,比 précédente 方法更高效。

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

  • paper_url: http://arxiv.org/abs/2310.04451
  • repo_url: https://github.com/sheltonliu-n/autodan
  • paper_authors: Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao
  • for: investigate jailbreak attacks on aligned large language models (LLMs) and develop a novel approach to automatically generate stealthy jailbreak prompts
  • methods: hierarchical genetic algorithm to generate stealthy jailbreak prompts that preserve semantic meaningfulness
  • results: AutoDAN demonstrates superior attack strength in cross-model transferability and cross-sample universality compared with the baseline, and can effectively bypass perplexity-based defense methods
    Abstract The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.
    摘要 协调大语言模型(LLM)是强大的语言理解和决策工具,通过广泛的协调和人类反馈来创建。然而,这些大模型仍然易受到破禁攻击,敌对者可以通过 manipulate 提示来引发恶意输出,这些输出不应该由协调 LLM 提供。研究破禁提示可以让我们更深入了解 LLM 的限制,并引导我们如何加以安全。然而,现有的破禁技术受到了(1)扩展性问题,破禁攻击很大程度上依赖于手动制作提示,以及(2)隐蔽性问题,攻击都 rely 于基于 токен 的算法来生成提示,这些提示通常具有语义意义,使其易于检测。为了解决这些挑战,我们想回答以下问题:可以自动生成隐蔽破禁提示吗?在这篇论文中,我们介绍了一种新的破禁攻击方法——AutoDAN。AutoDAN 可以通过我们特制的层次遗传算法自动生成隐蔽破禁提示,并且能够保持语义意义。广泛的评估表明,AutoDAN 不仅自动化了过程,同时也示出了比基线更高的攻击力,包括跨模型传输性和跨样本一致性。此外,我们还与基于混淆测试的防御方法进行了比较,并证明 AutoDAN 可以效果地绕过它们。

SE(3)-Stochastic Flow Matching for Protein Backbone Generation

  • paper_url: http://arxiv.org/abs/2310.02391
  • repo_url: https://github.com/dreamfold/foldflow
  • paper_authors: Avishek Joey Bose, Tara Akhound-Sadegh, Kilian Fatras, Guillaume Huguet, Jarrid Rector-Brooks, Cheng-Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael Bronstein, Alexander Tong
  • For: The paper is focused on the computational design of novel protein structures, specifically using a series of novel generative models based on the flow-matching paradigm over 3D rigid motions (i.e. the group SE(3)).* Methods: The paper introduces three novel generative models, starting with FoldFlow-Base, which is a simulation-free approach to learning deterministic continuous-time dynamics and matching invariant target distributions on SE(3). The authors then accelerate training by incorporating Riemannian optimal transport to create FoldFlow-OT, and finally, they design FoldFlow-SFM, which couples both Riemannian OT and simulation-free training to learn stochastic continuous-time dynamics over SE(3).* Results: The paper reports high-quality designable, diverse, and novel protein backbone samples generated using the FoldFlow models, validating their effectiveness in the computational design of novel protein structures.
    Abstract The computational design of novel protein structures has the potential to impact numerous scientific disciplines greatly. Toward this goal, we introduce $\text{FoldFlow}$ a series of novel generative models of increasing modeling power based on the flow-matching paradigm over $3\text{D}$ rigid motions -- i.e. the group $\text{SE(3)}$ -- enabling accurate modeling of protein backbones. We first introduce $\text{FoldFlow-Base}$, a simulation-free approach to learning deterministic continuous-time dynamics and matching invariant target distributions on $\text{SE(3)}$. We next accelerate training by incorporating Riemannian optimal transport to create $\text{FoldFlow-OT}$, leading to the construction of both more simple and stable flows. Finally, we design $\text{FoldFlow-SFM}$ coupling both Riemannian OT and simulation-free training to learn stochastic continuous-time dynamics over $\text{SE(3)}$. Our family of $\text{FoldFlow}$ generative models offer several key advantages over previous approaches to the generative modeling of proteins: they are more stable and faster to train than diffusion-based approaches, and our models enjoy the ability to map any invariant source distribution to any invariant target distribution over $\text{SE(3)}$. Empirically, we validate our FoldFlow models on protein backbone generation of up to $300$ amino acids leading to high-quality designable, diverse, and novel samples.
    摘要 Computational 设计新蛋白结构具有很大的科学影响 potential. 为了实现这个目标,我们介绍 $\text{FoldFlow}$ 系列的新生成模型,基于流匹配方法在 $3\text{D}$ 摆动上的模型,具有高精度蛋白质量模型。我们首先介绍 $\text{FoldFlow-Base}$,一种不需要 simulate 的方法,用于学习 deterministic 连续时间动力学和匹配恒定目标分布在 $\text{SE(3)}$ 上。然后,我们通过 incorporating 里曼尼安全运输来加速训练,创建更简单和稳定的流。最后,我们设计 $\text{FoldFlow-SFM}$,将两种方法相互融合,以学习连续时间动力学在 $\text{SE(3)}$ 上的随机性。我们的 $\text{FoldFlow}$ 生成模型具有许多优势,比如更稳定和更快速地训练,而且我们的模型可以将任何恒定源分布映射到任何恒定目标分布上。在实验中,我们验证我们的 FoldFlow 模型在蛋白质量上的生成,可以获得高质量、多样化和创新的样本。

ProtoNER: Few shot Incremental Learning for Named Entity Recognition using Prototypical Networks

  • paper_url: http://arxiv.org/abs/2310.02372
  • repo_url: None
  • paper_authors: Ritesh Kumar, Saurabh Goyal, Ashish Verma, Vatche Isahagian
  • for: 本研究旨在提高文档理解和数据EXTRACTION领域中KVP提取模型的泛化能力,以便在新的类别加入模型时,不需要重新标注整个训练集和重新训练模型。
  • methods: 我们提出了一种基于原型网络的KVP提取模型,即ProtoNER,它不需要原始训练集,也不需要生成中间的 sintetic数据,而且使用了混合损失函数,以保持模型对原来类别的知识,同时学习新增类别。
  • results: 实验结果显示,ProtoNER经过迁移30个样本后,能够达到新增类别的同等效果,与常规模型经过2600个样本迁移后的效果相当。
    Abstract Key value pair (KVP) extraction or Named Entity Recognition(NER) from visually rich documents has been an active area of research in document understanding and data extraction domain. Several transformer based models such as LayoutLMv2, LayoutLMv3, and LiLT have emerged achieving state of the art results. However, addition of even a single new class to the existing model requires (a) re-annotation of entire training dataset to include this new class and (b) retraining the model again. Both of these issues really slow down the deployment of updated model. \\ We present \textbf{ProtoNER}: Prototypical Network based end-to-end KVP extraction model that allows addition of new classes to an existing model while requiring minimal number of newly annotated training samples. The key contributions of our model are: (1) No dependency on dataset used for initial training of the model, which alleviates the need to retain original training dataset for longer duration as well as data re-annotation which is very time consuming task, (2) No intermediate synthetic data generation which tends to add noise and results in model's performance degradation, and (3) Hybrid loss function which allows model to retain knowledge about older classes as well as learn about newly added classes.\\ Experimental results show that ProtoNER finetuned with just 30 samples is able to achieve similar results for the newly added classes as that of regular model finetuned with 2600 samples.
    摘要 “对于视觉丰富的文档中的键值对(KVP)抽象或名称实体识别(NER),有很多研究在文档理解和数据提取领域。一些基于Transformer的模型,如LayoutLMv2、LayoutLMv3和LiLT,已经取得了州际级的结果。但是,添加新的类别到现有模型中需要(1)重新标注整个训练 dataset,以包括这个新的类别,并(2)重新训练模型。这两个问题都会导致模型的部署受阻。”“我们提出了一个名为ProtoNER的专案网络基于模型,可以将新的类别添加到现有模型中,而不需要大量的新的标注训练数据。ProtoNER的关键贡献包括:(1)不需要原始训练 dataset,这解除了保留原始训练 dataset的时间问题和时间consuming的标注工作,(2)不需要生成中间的 sintetic 数据,这减少了模型的性能下降,(3)混合的损失函数,让模型保留旧有类别的知识,同时学习新添加的类别。”“实验结果显示,ProtoNER 与常规模型相比,仅需要30个标注数据进行 fine-tuning,就能够在新添加的类别上达到相同的结果。”

A Deep Reinforcement Learning Approach for Interactive Search with Sentence-level Feedback

  • paper_url: http://arxiv.org/abs/2310.03043
  • repo_url: None
  • paper_authors: Jianghong Zhou, Joyce C. Ho, Chen Lin, Eugene Agichtein
  • for: 本研究旨在提高搜寻系统的搜寻精度,通过融合用户的互动反馈,从而提供更好的搜寻体验。
  • methods: 本研究使用深度Q学习(DQ)方法,将BERT模型与用户互动反馈结合,选择重要的句子,以提高搜寻精度。此外,本研究还提出了两种 Mechanism来更好地探索优化的动作空间。
  • results: 本研究在三个搜寻dataset上验证了DQrank的效能,与前一代RL方法相比,DQrank能够提高搜寻精度至少12%。此外,本研究还进行了细部抽象研究,结果显示每个模型元件都能够有效地提取和累累长期的用户互动反馈效果。
    Abstract Interactive search can provide a better experience by incorporating interaction feedback from the users. This can significantly improve search accuracy as it helps avoid irrelevant information and captures the users' search intents. Existing state-of-the-art (SOTA) systems use reinforcement learning (RL) models to incorporate the interactions but focus on item-level feedback, ignoring the fine-grained information found in sentence-level feedback. Yet such feedback requires extensive RL action space exploration and large amounts of annotated data. This work addresses these challenges by proposing a new deep Q-learning (DQ) approach, DQrank. DQrank adapts BERT-based models, the SOTA in natural language processing, to select crucial sentences based on users' engagement and rank the items to obtain more satisfactory responses. We also propose two mechanisms to better explore optimal actions. DQrank further utilizes the experience replay mechanism in DQ to store the feedback sentences to obtain a better initial ranking performance. We validate the effectiveness of DQrank on three search datasets. The results show that DQRank performs at least 12% better than the previous SOTA RL approaches. We also conduct detailed ablation studies. The ablation results demonstrate that each model component can efficiently extract and accumulate long-term engagement effects from the users' sentence-level feedback. This structure offers new technologies with promised performance to construct a search system with sentence-level interaction.
    摘要 <> transtable text into Simplified Chinese.<>互动搜索可以提供更好的体验,通过 incorporating 互动反馈从用户。这可以显著提高搜索准确性,因为它帮助避免无关信息和捕捉用户搜索意图。现有的状态之一 (SOTA) 系统使用 reinforcement learning (RL) 模型来 incorporating 互动,但是它们围绕 item-level 反馈进行围绕,忽略了 sentence-level 反馈中的细化信息。然而,这种反馈需要广泛的 RL 动作空间探索和大量标注数据。这个工作解决了这些挑战,通过提议一种新的深度 Q-学习 (DQ) 方法,DQrank。DQrank 采用 BERT-based 模型,当前最佳在自然语言处理中,选择用户互动中的关键句并将项目排序以获得更满足的回答。我们还提出了两种机制来更好地探索优化的动作。DQrank 进一步利用 DQ 中的经验回放机制,将反馈句子存储在 DQ 中,以获得更好的初始排名性能。我们验证了 DQRank 的效果,结果显示,DQRank 在三个搜索数据集上表现至少12%更好于之前的 SOTA RL 方法。我们还进行了详细的剖析研究。剖析结果表明,每个模型组件都能有效地提取和积累用户互动中的长期参与效果。这种结构提供了新的技术,用于构建基于 sentence-level 互动的搜索系统。

Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2310.02360
  • repo_url: None
  • paper_authors: Finn Rietz, Stefan Heinrich, Erik Schaffernicht, Johannes Andreas Stork
  • for: solves complex tasks by breaking them down into elementary subtasks and reusing subtask solutions
  • methods: value decomposition, prioritized soft Q-decomposition (PSQD)
  • results: successful learning, reuse, and adaptation results for simulated robot control tasks, and offline learning results without new environment interaction during adaptation.
    Abstract Reinforcement learning (RL) for complex tasks remains a challenge, primarily due to the difficulties of engineering scalar reward functions and the inherent inefficiency of training models from scratch. Instead, it would be better to specify complex tasks in terms of elementary subtasks and to reuse subtask solutions whenever possible. In this work, we address continuous space lexicographic multi-objective RL problems, consisting of prioritized subtasks, which are notoriously difficult to solve. We show that these can be scalarized with a subtask transformation and then solved incrementally using value decomposition. Exploiting this insight, we propose prioritized soft Q-decomposition (PSQD), a novel algorithm for learning and adapting subtask solutions under lexicographic priorities in continuous state-action spaces. PSQD offers the ability to reuse previously learned subtask solutions in a zero-shot composition, followed by an adaptation step. Its ability to use retained subtask training data for offline learning eliminates the need for new environment interaction during adaptation. We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both low- and high-dimensional simulated robot control tasks, as well as offline learning results. In contrast to baseline approaches, PSQD does not trade off between conflicting subtasks or priority constraints and satisfies subtask priorities during learning. PSQD provides an intuitive framework for tackling complex RL problems, offering insights into the inner workings of the subtask composition.
    摘要

On the definition of toxicity in NLP

  • paper_url: http://arxiv.org/abs/2310.02357
  • repo_url: None
  • paper_authors: Sergey Berezin, Reza Farahbakhsh, Noel Crespi
  • for: 提出了一个新的、基于着压力水平的毒性定义,以提高毒性检测任务的 объектив性和上下文感知。
  • methods: 该文提出了一种基于新定义的数据集创建和模型训练方法。
  • results: 该文未提出实际实验结果,但预期通过新定义和方法提高毒性检测任务的准确性和稳定性。
    Abstract The fundamental problem in toxicity detection task lies in the fact that the toxicity is ill-defined. This causes us to rely on subjective and vague data in models' training, which results in non-robust and non-accurate results: garbage in - garbage out. This work suggests a new, stress-level-based definition of toxicity designed to be objective and context-aware. On par with it, we also describe possible ways of applying this new definition to dataset creation and model training.
    摘要 基本问题在毒性探测任务中是毒性不具体定义。这导致我们需要基于主观和暧昧的数据进行模型训练,从而导致非Robust和不准确的结果:垃圾入口垃圾出口。本工作提出了一个新的压力水平基本定义毒性,以便具有对象和上下文意识。同时,我们还描述了应用这个新定义到数据集创建和模型训练的可能方法。

Reasoning about Intuitionistic Computation Tree Logic

  • paper_url: http://arxiv.org/abs/2310.02355
  • repo_url: None
  • paper_authors: Davide Catta, Vadim Malvone, Aniello Murano
  • for: 本 paper 定义了一种INTRODUCTION TO COMPUTATION TREE LOGIC(CTL)的INTUITIONISTIC版本。
  • methods: 本 paper 首先介绍了INTUITIONISTIC逻辑的semantic特征,然后研究了这些特征在正式验证方面的可能性。接着,本 paper 定义了INTUITIONISTIC CTL的语法和 semantics,并研究了其一些简单的性质。
  • results: 本 paper conclude 表明了INTUITIONISTIC CTL中的一些固定点规则不是VALID。
    Abstract In this paper, we define an intuitionistic version of Computation Tree Logic. After explaining the semantic features of intuitionistic logic, we examine how these characteristics can be interesting for formal verification purposes. Subsequently, we define the syntax and semantics of our intuitionistic version of CTL and study some simple properties of the so obtained logic. We conclude by demonstrating that some fixed-point axioms of CTL are not valid in the intuitionistic version of CTL we have defined.
    摘要 在这篇论文中,我们定义了一种INTRODUCTION Computation Tree Logic的INTUITIONISTIC版本。我们首先介绍了INTUITIONISTIC逻辑的semantic特征,然后我们explore了这些特征在正式验证方面的可能性。接着,我们定义了INTUITIONISTIC CTL的语法和 semantics,并研究了这种逻辑的一些简单特性。最后,我们示例了INTUITIONISTIC CTL中的一些固定点论证不成立。Here's the breakdown of the translation:* "INTRODUCTION" is translated as "INTRODUCTION" (同 "Introduction" in English).* "Computation Tree Logic" is translated as "计算树逻辑" (shortened as "CTL" in the translation).* "INTUITIONISTIC" is translated as "INTUITIONISTIC" (同 "intuitionistic" in English).* "semantic" is translated as "semantic" (同 "semantic" in English).* "特征" is translated as "特征" (meaning "characteristics" or "features").* "explore" is translated as "explore" (同 "explore" in English).* "语法" is translated as "语法" (meaning "syntax" or "grammar").* "semantics" is translated as "semantics" (同 "semantics" in English).* "逻辑" is translated as "逻辑" (meaning "logic").* "可能性" is translated as "可能性" (meaning "possibility" or "potential").* "示例" is translated as "示例" (meaning "example" or "illustration").* "不成立" is translated as "不成立" (meaning "not valid" or "not hold").

Rollout Heuristics for Online Stochastic Contingent Planning

  • paper_url: http://arxiv.org/abs/2310.02345
  • repo_url: None
  • paper_authors: Oded Blumenthal, Guy Shani
    for:这个论文主要是为了解决具有部分可见性和随机行动的决策问题。methods:这篇论文使用了 Monte-Carlo 搜索算法,基于 UCT 算法,来决定下一个动作。它还使用了 rollout 策略来提供值估计,并且使用了域专业的优化策略来提高效果。results:这篇论文提出了基于 POMDP 的决策问题的解决方案,使用了域独立的优化策略,包括 h_add 优化策略和信息价值估计。这些策略可以帮助解决具有部分可见性和随机行动的决策问题。
    Abstract Partially observable Markov decision processes (POMDP) are a useful model for decision-making under partial observability and stochastic actions. Partially Observable Monte-Carlo Planning is an online algorithm for deciding on the next action to perform, using a Monte-Carlo tree search approach, based on the UCT (UCB applied to trees) algorithm for fully observable Markov-decision processes. POMCP develops an action-observation tree, and at the leaves, uses a rollout policy to provide a value estimate for the leaf. As such, POMCP is highly dependent on the rollout policy to compute good estimates, and hence identify good actions. Thus, many practitioners who use POMCP are required to create strong, domain-specific heuristics. In this paper, we model POMDPs as stochastic contingent planning problems. This allows us to leverage domain-independent heuristics that were developed in the planning community. We suggest two heuristics, the first is based on the well-known h_add heuristic from classical planning, and the second is computed in belief space, taking the value of information into account.
    摘要 部分可观测Markov决策过程(POMDP)是一种有用的模型,用于决策在部分可观测和随机动作下。部分可观测Monte-Carlo规划是一种在线算法,使用Monte-Carlo搜索算法,基于完全可观测Markov决策过程中的UCT(UCB应用于树)算法。POMC develops an action-observation tree, and at the leaves, uses a rollout policy to provide a value estimate for the leaf. Therefore, POMCP is highly dependent on the rollout policy to compute good estimates, and hence identify good actions. Many practitioners who use POMCP are required to create strong, domain-specific heuristics. 在这篇论文中,我们模型POMDP为随机规划问题。这allowed us to leveraging domain-independent heuristics that were developed in the planning community. We suggest two heuristics, the first is based on the well-known h_add heuristic from classical planning, and the second is computed in belief space, taking the value of information into account.

Autonomous Systems’ Safety Cases for use in UK Nuclear Environments

  • paper_url: http://arxiv.org/abs/2310.02344
  • repo_url: None
  • paper_authors: Christopher R. Anderson, Louise A. Dennis
  • for: 本研究旨在开发一份描述Autonomous robot在核电站的部署安全情况的安全案例,以便在英国核电站进行部署。
  • methods: 本研究使用了一种具有人工智能功能的假设机器人,并使用了一系列的安全措施和技术来确保机器人的安全部署。
  • results: 本研究通过展示了一个可能的安全案例,以便在未来继续与核站licensees、ONR、行业和学术界进行讨论和开发工具。
    Abstract An overview of the process to develop a safety case for an autonomous robot deployment on a nuclear site in the UK is described and a safety case for a hypothetical robot incorporating AI is presented. This forms a first step towards a deployment, showing what is possible now and what may be possible with development of tools. It forms the basis for further discussion between nuclear site licensees, the Office for Nuclear Regulation (ONR), industry and academia.
    摘要 英国核站自主机器人部署的安全案例开发过程的概述,并提供了一个基于人工智能的机器人安全案例。这是一个开始,用于展示当前可能的情况和可能的发展。它可以作为与核站许可人、英国核管理局(ONR)、行业和学术界进行进一步讨论的基础。

Learning Interpretable Deep Disentangled Neural Networks for Hyperspectral Unmixing

  • paper_url: http://arxiv.org/abs/2310.02340
  • repo_url: https://github.com/ricardoborsoi/IDNet_release
  • paper_authors: Ricardo Augusto Borsoi, Deniz Erdoğmuş, Tales Imbiriba
  • for: 本研究提出了一种新的可解释深度学习方法,用于解决受到非理想条件的谱谱分解问题。
  • methods: 该方法基于可变深度学习框架,并使用推断学习来分离质量和元素。模型通过练习自然语言处理技术来学习,并使用自适应策略来提高性能。
  • results: 实验结果表明,提出的方法可以比state-of-the-art算法提高解决谱谱分解问题的性能。
    Abstract Although considerable effort has been dedicated to improving the solution to the hyperspectral unmixing problem, non-idealities such as complex radiation scattering and endmember variability negatively impact the performance of most existing algorithms and can be very challenging to address. Recently, deep learning-based frameworks have been explored for hyperspectral umixing due to their flexibility and powerful representation capabilities. However, such techniques either do not address the non-idealities of the unmixing problem, or rely on black-box models which are not interpretable. In this paper, we propose a new interpretable deep learning method for hyperspectral unmixing that accounts for nonlinearity and endmember variability. The proposed method leverages a probabilistic variational deep-learning framework, where disentanglement learning is employed to properly separate the abundances and endmembers. The model is learned end-to-end using stochastic backpropagation, and trained using a self-supervised strategy which leverages benefits from semi-supervised learning techniques. Furthermore, the model is carefully designed to provide a high degree of interpretability. This includes modeling the abundances as a Dirichlet distribution, the endmembers using low-dimensional deep latent variable representations, and using two-stream neural networks composed of additive piecewise-linear/nonlinear components. Experimental results on synthetic and real datasets illustrate the performance of the proposed method compared to state-of-the-art algorithms.
    摘要 尽管在干扰性较高的干扰性混合问题中投入了大量努力,但是大多数现有算法的性能仍然受到复杂的辐射散射和成分变化的影响。最近,深度学习基于的框架在干扰性混合中得到了广泛的应用,因为它们具有灵活性和强大的表达能力。然而,这些技术 Either do not address the non-idealities of the unmixing problem, or rely on black-box models which are not interpretable。在这篇论文中,我们提出了一种新的可解释的深度学习方法,用于干扰性混合。这种方法基于可变深度学习框架,其中包含分解混合的部署学习。我们使用批处理反射来学习整个模型,并使用自我超级vised的策略来训练模型。此外,我们尽可能地设计了模型,以提供高度的可解释性。这包括将含量模型为DIRICHTLET分布,使用低维深度强化变量来表示结构分子,并使用两栅神经网络,其中每个栅包含可变的积分非线性/线性组件。实验结果表明,提出的方法在synthetic和实际数据上的性能较高,与现有算法相比。

Approximately Equivariant Quantum Neural Network for $p4m$ Group Symmetries in Images

  • paper_url: http://arxiv.org/abs/2310.02323
  • repo_url: None
  • paper_authors: Su Yeon Chang, Michele Grossi, Bertrand Le Saux, Sofia Vallecorsa
  • for: 这个论文主要关注于开发一种具有平面四面体对称的含瑞度量量变量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量 Quantum Neural Networks (QNNs) 是一种可以快速度量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量量 Quantum Neural Networks (QNNs) are suggested as one of the quantum algorithms which can be efficiently simulated with a low depth on near-term quantum hardware in the presence of noises. However, their performance highly relies on choosing the most suitable architecture of Variational Quantum Algorithms (VQAs), and the problem-agnostic models often suffer issues regarding trainability and generalization power. As a solution, the most recent works explore Geometric Quantum Machine Learning (GQML) using QNNs equivariant with respect to the underlying symmetry of the dataset. GQML adds an inductive bias to the model by incorporating the prior knowledge on the given dataset and leads to enhancing the optimization performance while constraining the search space. This work proposes equivariant Quantum Convolutional Neural Networks (EquivQCNNs) for image classification under planar $p4m$ symmetry, including reflectional and $90^\circ$ rotational symmetry. We present the results tested in different use cases, such as phase detection of the 2D Ising model and classification of the extended MNIST dataset, and compare them with those obtained with the non-equivariant model, proving that the equivariance fosters better generalization of the model.
    Abstract Quantum Neural Networks (QNNs) are suggested as one of the quantum algorithms which can be efficiently simulated with a low depth on near-term quantum hardware in the presence of noises. However, their performance highly relies on choosing the most suitable architecture of Variational Quantum Algorithms (VQAs), and the problem-agnostic models often suffer issues regarding trainability and generalization power. As a solution, the most recent works explore Geometric Quantum Machine Learning (GQML) using QNNs equivariant with respect to the underlying symmetry of the dataset. GQML adds an inductive bias to the model by incorporating the prior knowledge on the given dataset and leads to enhancing the optimization performance while constraining the search space. This work proposes equivariant Quantum Convolutional Neural Networks (EquivQCNNs) for image classification under planar $p4m$ symmetry, including reflectional and $90^\circ$ rotational symmetry. We present the results tested in different use cases, such as phase detection of the 2D Ising model and classification of the extended MNIST dataset, and compare them with those obtained with the non-equivariant model, proving that the equivariance fosters better generalization of the model.
    摘要

Contrastive Post-training Large Language Models on Data Curriculum

  • paper_url: http://arxiv.org/abs/2310.02263
  • repo_url: None
  • paper_authors: Canwen Xu, Corby Rosset, Luciano Del Corro, Shweti Mahajan, Julian McAuley, Jennifer Neville, Ahmed Hassan Awadallah, Nikhil Rao
  • for: 这 paper 是 exploring contrastive post-training techniques for alignment of large language models (LLMs) to human preferences.
  • methods: The paper uses automatically constructed preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT, and GPT-4) for contrastive post-training. The authors compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement. Additionally, the authors explore a data curriculum learning scheme for contrastive post-training.
  • results: The paper finds that contrastive post-training further improves the performance of Orca, a state-of-the-art instruction learning model tuned with GPT-4 outputs, to exceed that of ChatGPT.
    Abstract Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we explore contrastive post-training techniques for alignment by automatically constructing preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We carefully compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continueing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to exceed that of ChatGPT.
    摘要 匹配服务为大语言模型(LLM)引导的重要步骤。在这篇论文中,我们探索了多种强度不同的模型(如InstructGPT、ChatGPT和GPT-4)自动构建的偏好对的方法。我们仔细比较了SLiC和DPO的对偶技术和SFT基准,发现DPO提供了一个大幅提升,甚至在SFT已经饱和之后仍然提供提升。我们还探索了一种数据课程学习方案,该方案从“容易”的对 pairs 开始学习,然后过渡到“更加Difficult”的对 pairs,这有助于改善匹配。最后,我们扩大我们的实验,使用更多的数据和更大的模型如Orca进行训练,并发现对偶训练可以使Orca的性能超过ChatGPT。

Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation

  • paper_url: http://arxiv.org/abs/2310.02304
  • repo_url: None
  • paper_authors: Eric Zelikman, Eliana Lorch, Lester Mackey, Adam Tauman Kalai
  • for: 这个论文目的是使用自然语言处理技术来提高AI系统的性能。
  • methods: 论文使用了一种名为”思维树”的技术,该技术可以让语言模型通过多次调用来生成更好的输出。
  • results: 研究发现,使用语言模型感染的框架程序可以提高自己的性能,并且可以在小型下游任务中显示出显著的改善。此外,研究还发现了一些自我改进策略,包括扫描搜索、遗传算法和模拟热处理。
    Abstract Several recent advances in AI systems (e.g., Tree-of-Thoughts and Program-Aided Language Models) solve problems by providing a "scaffolding" program that structures multiple calls to language models to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a language-model-infused scaffolding program to improve itself. We start with a seed "improver" that improves an input program according to a given utility function by querying a language model several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver. Afterward, we analyze the variety of self-improvement strategies proposed by the language model, including beam search, genetic algorithms, and simulated annealing. Since the language models themselves are not altered, this is not full recursive self-improvement. Nonetheless, it demonstrates that a modern language model, GPT-4 in our proof-of-concept experiments, is capable of writing code that can call itself to improve itself. We critically consider concerns around the development of self-improving technologies and evaluate the frequency with which the generated code bypasses a sandbox.
    摘要 Recent Advances in AI Systems (e.g., Tree-of-Thoughts and Program-Aided Language Models) solve problems by providing a "scaffolding" program that structures multiple calls to language models to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a language-model-infused scaffolding program to improve itself. We start with a seed "improver" that improves an input program according to a given utility function by querying a language model several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver. Afterward, we analyze the variety of self-improvement strategies proposed by the language model, including beam search, genetic algorithms, and simulated annealing. Since the language models themselves are not altered, this is not full recursive self-improvement. Nonetheless, it demonstrates that a modern language model, GPT-4 in our proof-of-concept experiments, is capable of writing code that can call itself to improve itself. We critically consider concerns around the development of self-improving technologies and evaluate the frequency with which the generated code bypasses a sandbox.

TransRadar: Adaptive-Directional Transformer for Real-Time Multi-View Radar Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.02260
  • repo_url: https://github.com/yahidar/transradar
  • paper_authors: Yahia Dalbah, Jean Lahoud, Hisham Cholakkal
  • for: 本文旨在提出一种基于雷达数据的Scene Semantic Segmentation方法,以解决自动驾驶中场景理解的问题。
  • methods: 本方法使用了多输入融合的雷达数据,并提出了一种专门为雷达感知的启发块和适应loss函数,以解决雷达数据的噪声和稀畴性问题。
  • results: 对于CARRADA和RADIal数据集,本方法的性能比状态前方法更高,同时模型的大小更小。
    Abstract Scene understanding plays an essential role in enabling autonomous driving and maintaining high standards of performance and safety. To address this task, cameras and laser scanners (LiDARs) have been the most commonly used sensors, with radars being less popular. Despite that, radars remain low-cost, information-dense, and fast-sensing techniques that are resistant to adverse weather conditions. While multiple works have been previously presented for radar-based scene semantic segmentation, the nature of the radar data still poses a challenge due to the inherent noise and sparsity, as well as the disproportionate foreground and background. In this work, we propose a novel approach to the semantic segmentation of radar scenes using a multi-input fusion of radar data through a novel architecture and loss functions that are tailored to tackle the drawbacks of radar perception. Our novel architecture includes an efficient attention block that adaptively captures important feature information. Our method, TransRadar, outperforms state-of-the-art methods on the CARRADA and RADIal datasets while having smaller model sizes. https://github.com/YahiDar/TransRadar
    摘要 Simplified Chinese:Scene understanding 是自动驾驶中的关键任务,而 радиар(LiDAR)感知器即使不如摄像头和激光雷达(LiDAR)这些感知器受欢迎,也仍然是可靠的选择。然而, радиар数据仍然存在噪声和稀疏性,以及背景和前景的不匹配。为解决这些挑战,我们提出了一种新的方法,即基于多输入的 радиарScene semantic segmentation,通过一种适应性的注意块和适应性的损失函数。我们的方法,TransRadar,在 CARRADA 和 RADIal 数据集上达到了 state-of-the-art 水平,同时具有较小的模型大小。

A Neural Scaling Law from Lottery Ticket Ensembling

  • paper_url: http://arxiv.org/abs/2310.02258
  • repo_url: None
  • paper_authors: Ziming Liu, Max Tegmark
  • for: 本研究探讨了神经网络中的普适扩展法则(NSL),即模型性能随模型大小增加而提高的现象。
  • methods: 作者使用了approximation theory进行分析,并预测了MSE损失随模型参数数量($N$)的 decay,具体来说是$\alpha=4/d$,其中$d$是内在输入维度。
  • results: 尽管这些理论在某些情况下(如ReLU网络)工作良好,但作者却发现了一个简单的1D问题($y=x^2$)manifests一种不同的扩展法则($\alpha=1$),与预测的扩展法则($\alpha=4$)不同。通过打开神经网络和统计学研究,作者发现这种新的扩展法则来自于lottery ticket ensemble:一个更宽的网络平均有更多的”lottery tickets”,这些ensemble可以减少输出变化的偏差。作者支持这种ensemble机制,并 mechanistically interprets single neural networks,以及统计学研究。最后,作者讨论了这种扩展法则的可能的应用于大语言模型和学习统计物理类理论。
    Abstract Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-\alpha}$, $\alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($\alpha=1$) from their predictions ($\alpha=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.
    摘要

MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models

  • paper_url: http://arxiv.org/abs/2310.02255
  • repo_url: None
  • paper_authors: Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao
    for: The paper aims to evaluate the ability of large language models (LLMs) and large multimodal models (LMMs) in mathematical reasoning in visual contexts.methods: The paper presents MathVista, a benchmark that combines challenges from diverse mathematical and visual tasks, and conducts a comprehensive evaluation of 12 prominent foundation models.results: The best-performing GPT-4V model achieves an overall accuracy of 49.9%, outperforming the second-best performer, Bard, by 15.1%. However, GPT-4V still falls short of human performance by 10.4%, indicating the need for further research to improve its mathematical reasoning and understanding of complex figures.
    Abstract Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.
    摘要 大型语言模型(LLM)和大型多Modal模型(LMM)在许多任务和领域表现出色,但它们在视觉上的数学逻辑能力尚未得到系统的研究。为了填补这一漏洞,我们提出了MathVista,一个权威的测试集,它包含来自28个多Modal数学 dataset的6,141个例子,以及3个新创建的 dataset(即IQTest、FunctionQA和PaperQA)。完成这些任务需要深刻的视觉理解和 композиitional 逻辑,所有当前的基础模型都遇到了挑战。通过MathVista,我们进行了全面的、量化的评估12种知名基础模型。最佳的GPT-4V模型在总体精度上达到49.9%,与第二名的Bard相比,提高了15.1%。我们的深入分析表明,GPT-4V的优势主要归结于其增强的视觉理解和数学逻辑能力。然而,GPT-4V仍然落后人类性能by 10.4%,表明它在处理复杂的图像和进行严格的逻辑时仍有很大的改进空间。这种显著的差距 highlights MathVista在开发普通智能代理人 capable of tackling mathematically intensive and visually rich real-world tasks 的发展中着重的作用。我们进一步探讨了GPT-4V的新能力自我验证、自我一致性应用以及交互式chatbot能力,强调它的潜在的研究潜力。项目可以在https://mathvista.github.io/ 找到。

Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances

  • paper_url: http://arxiv.org/abs/2310.02246
  • repo_url: None
  • paper_authors: Mikhail Khodak, Edmond Chow, Maria-Florina Balcan, Ameet Talwalkar
  • for: 解决一个线性系统 $Ax=b$ 是基本的科学计算基础功能,这个系统中的参数的优化问题非常重要,但是实际上这些参数通常是不可知或者太costly的。
  • methods: 我们考虑了一个常见的情况,即在一个数值仪器中需要解决多个相关的线性系统。在这种情况下,我们可以sequentially选择参数,以达到一个近似优化的总迭代次数。
  • results: 我们证明了,使用 Successive Over-Relaxation (SOR) 方法,一种标准的解决方法,可以使用一个在线学习算法(bandit online learning algorithm)来选择参数,以确保总的迭代次数接近最佳固定参数的性能。此外,当给出额外结构信息时,我们展示了一种上下文依赖的bandit方法可以达到实例优化策略的性能,即选择每个实例的最佳参数。这是learning-theoretic对高精度线性系统解决器的首次征识,以及数据驱动科学计算的首次端到端保证,这些结果表明了使用良好了解的学习算法可以加速数值方法。
    Abstract Solving a linear system $Ax=b$ is a fundamental scientific computing primitive for which numerous solvers and preconditioners have been developed. These come with parameters whose optimal values depend on the system being solved and are often impossible or too expensive to identify; thus in practice sub-optimal heuristics are used. We consider the common setting in which many related linear systems need to be solved, e.g. during a single numerical simulation. In this scenario, can we sequentially choose parameters that attain a near-optimal overall number of iterations, without extra matrix computations? We answer in the affirmative for Successive Over-Relaxation (SOR), a standard solver whose parameter $\omega$ has a strong impact on its runtime. For this method, we prove that a bandit online learning algorithm -- using only the number of iterations as feedback -- can select parameters for a sequence of instances such that the overall cost approaches that of the best fixed $\omega$ as the sequence length increases. Furthermore, when given additional structural information, we show that a contextual bandit method asymptotically achieves the performance of the instance-optimal policy, which selects the best $\omega$ for each instance. Our work provides the first learning-theoretic treatment of high-precision linear system solvers and the first end-to-end guarantees for data-driven scientific computing, demonstrating theoretically the potential to speed up numerical methods using well-understood learning algorithms.
    摘要 解决线性系统$Ax=b$是科学计算中的基本原理,其中有许多解 solver 和预conditioner 已经开发出来。这些解 solver 和预conditioner 具有参数,其优化的值取决于所解系统,而且在实际中通常是不可能或太Expensive的。因此在实际中通常使用临时的补做法。我们考虑在一个数值仿真中解决多个相关的线性系统时,可以顺序选择参数以实现近似最优的总迭代次数,而无需额外的矩阵计算。我们回答了问题,并证明了在 Successive Over-Relaxation (SOR) 方法中,一种矩阵学习算法可以在序列中选择参数,使总成本逼近最优的固定 $\omega$ 的成本。此外,当具有额外结构信息时,我们显示了一种 Contextual Bandit 方法,可以在序列中选择最优的 $\omega$,使总成本逼近实例优化策略的成本。我们的工作提供了科学计算中学习理论的首个征应,以及数据驱动科学计算中的首个端到端保证,证明了可以使用良好的学习算法来加速数值方法。

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

  • paper_url: http://arxiv.org/abs/2310.02239
  • repo_url: https://github.com/eric-ai-lab/minigpt-5
  • paper_authors: Kaizhi Zheng, Xuehai He, Xin Eric Wang
  • for: 这篇论文的目的是提出一种新的视觉语言生成技术,以帮助生成具有文字和图像的合理描述。
  • methods: 该技术使用了一种新的“生成元”(vokens),用于将图像和文字生成成一个协调的 multimodal 输出。该技术还采用了一种两个阶段的训练策略,以确保模型能够在不同的 benchmark 上表现出色。
  • results: 对于 MMDialog 数据集,该技术比基eline模型(Divter)表现出了显著的改进,并在 VIST 数据集上的人工评估中也表现出了Superior 或相当的多模态输出。
    Abstract Large Language Models (LLMs) have garnered significant attention for their advancements in natural language processing, demonstrating unparalleled prowess in text comprehension and generation. Yet, the simultaneous generation of images with coherent textual narratives remains an evolving frontier. In response, we introduce an innovative interleaved vision-and-language generation technique anchored by the concept of "generative vokens," acting as the bridge for harmonized image-text outputs. Our approach is characterized by a distinctive two-staged training strategy focusing on description-free multimodal generation, where the training requires no comprehensive descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, enhancing the effectiveness of vokens on image generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks.
    摘要 大型语言模型(LLMs)已引起广泛关注,因其在自然语言处理方面表现出了无前例的能力,包括文本理解和生成。然而,同时生成具有 coherent 文本描述和图像的 Multimodal 输出仍然是一个处于演化阶段的领域。为此,我们介绍了一种创新的融合视觉语言生成技术,基于“生成短语”(generative vokens)这个概念,用于融合图像和文本输出。我们的方法包括两个阶段的训练策略,无需对图像进行全面的描述。为保持模型的完整性,我们还包括了无类别导航的技术,以增强生成短语对图像的影响。我们的模型“MINI-GPT-5”在 MMDialog 数据集上表现出了显著的改善,并在 VIST 数据集上人工评估中 consistently 提供了Superior 或 Comparable 的多Modal 输出,这种表现力表明其在多种 benchMark 上的效果。

Who’s Harry Potter? Approximate Unlearning in LLMs

  • paper_url: http://arxiv.org/abs/2310.02238
  • repo_url: None
  • paper_authors: Ronen Eldan, Mark Russinovich
  • for: 本研究旨在提出一种有效的语言模型忘记技术,以解决大型语言模型在训练过程中使用版权内容的法律和伦理问题。
  • methods: 本研究提出的技术包括三个主要组成部分:首先,我们使用一个增强的模型,通过比较其Logits与基eline模型的Logits来确定最相关的标签。其次,我们将特定数据中的独特表达替换为通用表达,并利用模型的自己预测来生成代表每个标签的替换标签。最后,我们通过finetuning这些替换标签来训练模型,从而使模型忘记原始数据。
  • results: 我们在使用这种技术处理Harry Potter系列书籍时,在约1 GPU小时的finetuning后,成功地使Llama2-7b模型失去了对Harry Potter相关内容的生成和回忆能力,而不影响其在常见测试集(如Winogrande、Hellaswag、arc、boolq和piqa)的性能。我们将我们的微调模型公开发布在HuggingFace上,以便社区评估。据我们所知,这是首次对生成语言模型的忘记技术进行有效实现。
    Abstract Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content. This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers. In this paper, we propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch. We evaluate our technique on the task of unlearning the Harry Potter books from the Llama2-7b model (a generative language model recently open-sourced by Meta). While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of finetuning, we effectively erase the model's ability to generate or recall Harry Potter-related content, while its performance on common benchmarks (such as Winogrande, Hellaswag, arc, boolq and piqa) remains almost unaffected. We make our fine-tuned model publicly available on HuggingFace for community evaluation. To the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models. Our technique consists of three main components: First, we use a reinforced model that is further trained on the target data to identify the tokens that are most related to the unlearning target, by comparing its logits with those of a baseline model. Second, we replace idiosyncratic expressions in the target data with generic counterparts, and leverage the model's own predictions to generate alternative labels for every token. These labels aim to approximate the next-token predictions of a model that has not been trained on the target data. Third, we finetune the model on these alternative labels, which effectively erases the original text from the model's memory whenever it is prompted with its context.
    摘要 大型语言模型(LLM)通常在互联网庞大数据集上训练,这些数据集经常包含版权内容,这引发了开发者和用户之间的法律和道德问题,以及原始作者和发布者的问题。在这篇论文中,我们提出了一种新的解决方案,即使用一种新的技术来从大型语言模型中“忘记”一部分训练数据,而不需要重新训练模型。我们在使用LLlama2-7b模型(一个最近开源的生成语言模型)进行测试,这个模型需要184K GPU-小时来预训练,但我们发现,只需要约1个GPU小时的训练,我们就可以让模型完全忘记Harry Potter系列的内容,而不会影响其在常见的benchmark上的性能(如Winogrande、Hellaswag、arc、boolq和piqa)。我们将我们的微调模型公开发布在HuggingFace上,以便社区进行评估。我们知道的是,这是首次发表有效的生成语言模型忘记技术的论文。我们的技术包括三个主要部分:首先,我们使用一个增强的模型,通过对目标数据进行进一步训练,以确定对于忘记目标最重要的字符串。其次,我们将目标数据中的特殊表达替换为通用表达,并利用模型的自己预测来生成每个字符的替换标签。这些标签的目标是 aproximate 模型没有训练过目标数据时的下一个字符预测。 finally,我们微调模型使用这些替换标签,这将effectively 将原始文本从模型的记忆中清除,当模型被提交到其上下文时。

Exploring Model Learning Heterogeneity for Boosting Ensemble Robustness

  • paper_url: http://arxiv.org/abs/2310.02237
  • repo_url: https://github.com/git-disl/heterobust
  • paper_authors: Yanzhao Wu, Ka-Ho Chow, Wenqi Wei, Ling Liu
  • for: 提高复杂学习任务的总体化性能
  • methods: 使用多样化深度神经网络 ensemble,利用模型学习多样性强化集成 robustness
  • results: 经验证明,多样化深度神经网络 ensemble可以强化集成的 Robustness,并且在恶例和攻击 settings 中表现出更高的 RobustnessHere’s a breakdown of each point:
  • for: The paper is written to improve the generalization performance of complex learning tasks, specifically by using deep neural network ensembles.
  • methods: The paper uses heterogeneous deep neural networks (DNNs) and a weighted bounding box ensemble consensus method to leverage model learning heterogeneity and boost ensemble robustness. Additionally, the paper introduces a two-tier ensemble construction method that composes ensembles of heterogeneous models for solving different learning problems, and uses connected component labeling (CCL) based alignment to promote high ensemble diversity and low negative correlation among member models.
  • results: The paper provides extensive experiments to validate the enhanced robustness of heterogeneous ensembles in both benign and adversarial settings. The results show that the heterogeneous ensembles can improve the robustness of the model against negative examples and adversarial attacks.
    Abstract Deep neural network ensembles hold the potential of improving generalization performance for complex learning tasks. This paper presents formal analysis and empirical evaluation to show that heterogeneous deep ensembles with high ensemble diversity can effectively leverage model learning heterogeneity to boost ensemble robustness. We first show that heterogeneous DNN models trained for solving the same learning problem, e.g., object detection, can significantly strengthen the mean average precision (mAP) through our weighted bounding box ensemble consensus method. Second, we further compose ensembles of heterogeneous models for solving different learning problems, e.g., object detection and semantic segmentation, by introducing the connected component labeling (CCL) based alignment. We show that this two-tier heterogeneity driven ensemble construction method can compose an ensemble team that promotes high ensemble diversity and low negative correlation among member models of the ensemble, strengthening ensemble robustness against both negative examples and adversarial attacks. Third, we provide a formal analysis of the ensemble robustness in terms of negative correlation. Extensive experiments validate the enhanced robustness of heterogeneous ensembles in both benign and adversarial settings. The source codes are available on GitHub at https://github.com/git-disl/HeteRobust.
    摘要 深度神经网络集成具有提高复杂学习任务的总体化性能的潜在能力。本文提出了正式分析和实验评估,表明多样性深度集成模型可以有效利用模型学习多样性,提高集成强度。我们首先表明,用于解决同一个学习问题的不同深度神经网络模型可以通过我们的重量平均盒子集成方法,提高mean average precision(mAP)的性能。其次,我们将不同的学习问题的模型组成多样性集成,例如对象检测和 semantic segmentation,通过基于connected component labeling(CCL)的对alignment。我们表明,这种两层多样性驱动的集成建构方法可以组成一个高ensemble diversity和低负相关性的ensemble team,从而增强集成的Robustness,包括both benign和敌意 Settings。第三,我们提供了对集成强度的正式分析,并进行了广泛的实验验证,证明多样性集成模型在both benign和敌意 Settings中具有增强的Robustness。相关代码可以在GitHub上找到,链接为https://github.com/git-disl/HeteRobust。

Automatic Quality Assessment of Wikipedia Articles – A Systematic Literature Review

  • paper_url: http://arxiv.org/abs/2310.02235
  • repo_url: None
  • paper_authors: Pedro Miguel Moás, Carla Teixeira Lopes
  • for: 提高Wikipedia文章质量的自动评估方法
  • methods: 文章特征、质量指标、机器学习算法等方法的比较和分析
  • results: 文章质量评估的149项研究,探讨了常见之处和缺失In English, that would be:
  • for: Improving the automatic assessment of Wikipedia article quality
  • methods: Comparing and analyzing machine learning algorithms, article features, quality metrics, and used datasets
  • results: A review of 149 studies on article quality assessment, exploring commonalities and gaps
    Abstract Wikipedia is the world's largest online encyclopedia, but maintaining article quality through collaboration is challenging. Wikipedia designed a quality scale, but with such a manual assessment process, many articles remain unassessed. We review existing methods for automatically measuring the quality of Wikipedia articles, identifying and comparing machine learning algorithms, article features, quality metrics, and used datasets, examining 149 distinct studies, and exploring commonalities and gaps in them. The literature is extensive, and the approaches follow past technological trends. However, machine learning is still not widely used by Wikipedia, and we hope that our analysis helps future researchers change that reality.
    摘要 Wikipedia是全球最大的在线百科全书,但保持文章质量通过协作是挑战。Wikipedia设计了质量级别,但由于手动评估过程,许多文章还没有被评估。我们对现有自动评估wikipedia文章质量的方法进行了回顾,找到了机器学习算法、文章特征、质量指标和使用的数据集,并对149个不同的研究进行了检查。文献广泛,方法遵循过去的技术趋势,但是机器学习还没有广泛应用于Wikipedia,我们希望我们的分析能够帮助未来的研究人员改变这种现实。

MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection

  • paper_url: http://arxiv.org/abs/2310.02234
  • repo_url: None
  • paper_authors: Vinaya Sree Katamneni, Ajita Rattani
  • for: 这篇研究是针对deepfake问题的解决方案,具体来说是透过对多modal的声音和视觉数据进行整合,以实现更高精度的伪造检测。
  • methods: 本研究提出了一种基于表现层的方法,通过结合不同模式的声音和视觉表现,实现更好的整合和更高的检测精度。
  • results: 实验结果显示,该方法可以与目前的State-of-the-art(SOTA)数据检测器相比,提高检测精度约17.8%和18.4%。
    Abstract Deepfakes are synthetic media generated using deep generative algorithms and have posed a severe societal and political threat. Apart from facial manipulation and synthetic voice, recently, a novel kind of deepfakes has emerged with either audio or visual modalities manipulated. In this regard, a new generation of multimodal audio-visual deepfake detectors is being investigated to collectively focus on audio and visual data for multimodal manipulation detection. Existing multimodal (audio-visual) deepfake detectors are often based on the fusion of the audio and visual streams from the video. Existing studies suggest that these multimodal detectors often obtain equivalent performances with unimodal audio and visual deepfake detectors. We conjecture that the heterogeneous nature of the audio and visual signals creates distributional modality gaps and poses a significant challenge to effective fusion and efficient performance. In this paper, we tackle the problem at the representation level to aid the fusion of audio and visual streams for multimodal deepfake detection. Specifically, we propose the joint use of modality (audio and visual) invariant and specific representations. This ensures that the common patterns and patterns specific to each modality representing pristine or fake content are preserved and fused for multimodal deepfake manipulation detection. Our experimental results on FakeAVCeleb and KoDF audio-visual deepfake datasets suggest the enhanced accuracy of our proposed method over SOTA unimodal and multimodal audio-visual deepfake detectors by $17.8$% and $18.4$%, respectively. Thus, obtaining state-of-the-art performance.
    摘要 深刻的伪造(Deepfakes)是使用深度生成算法生成的伪造媒体,它们对社会和政治造成了严重的威胁。除了脸部修改和合成声音之外,最近出现了一种新的深刻媒体,其中的声音或视觉特征被修改。在这种情况下,一新的多Modal audio-visual Deepfake检测器正在被研究,以同时集中关注声音和视觉数据的修改检测。现有的多Modal(即音频-视觉)深刻检测器通常基于视频中的声音和视觉流的融合。现有的研究表明,这些多Modal检测器通常与单Modal声音和视觉深刻检测器相当。我们 conjecture that the 多Modal signal的heterogeneous nature creates distributional modality gaps and poses a significant challenge to effective fusion and efficient performance。在本文中,我们采取了解决方案,通过在表示层进行干预,以便协助声音和视觉流的融合。具体来说,我们提议使用模态(即声音和视觉)不变和特定表示。这使得共同的模式和每个模式表示真正的内容或假内容的特征被保留并融合,以便多Modal deepfake manipulation检测。我们在FakeAVCeleb和KoDF audio-visual deepfake dataset上进行实验,结果表明,我们提议的方法与State-of-the-art unimodal和多Modal audio-visual deepfake检测器相比,提高了检测精度 by 17.8%和18.4%,分别。因此,我们获得了State-of-the-art表现。

Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts in Underspecified Visual Tasks

  • paper_url: http://arxiv.org/abs/2310.02230
  • repo_url: None
  • paper_authors: Luca Scimeca, Alexander Rubinstein, Armand Mihai Nicolicioiu, Damien Teney, Yoshua Bengio
  • for: 本文提出了一种 ensemble diversification 框架,用于避免短circuit learning 现象,并且使用了Diffusion Probabilistic Models (DPMs)来生成合成counterfactuals。
  • methods: 本文使用了DPMs来生成合成counterfactuals,并且利用了这些counterfactuals来鼓励模型多样性。
  • results: experiments show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
    Abstract Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to shortcut learning phenomena, where a model may rely on erroneous, easy-to-learn, cues while ignoring reliable ones. In this work, we propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs). We discover that DPMs have the inherent capability to represent multiple visual cues independently, even when they are largely correlated in the training data. We leverage this characteristic to encourage model diversity and empirically show the efficacy of the approach with respect to several diversification objectives. We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
    摘要 <>各种假设与数据中的多个讯号相关性,可能会导致短cut learning现象,其中模型可能会从容易学习且错误的讯号中获取错误的预测。在这个工作中,我们提出了一个ensemble divergence框架,利用Diffusion Probabilistic Models(DPMs)生成的假设对应。我们发现DPMs具有独立表示多个视觉讯号的特性,即使这些讯号在训练数据中高度相关。我们利用这个特性来鼓励模型多样性,并证明这种方法可以与多种多样化目标相比。我们显示,对于短cut learning的防止和多样性表现,diffusion-guided divergence可以将模型引导避免偏预测,并达到与额外数据收集相同的 ensemble多样性表现。<>

Extraction of Medication and Temporal Relation from Clinical Text using Neural Language Models

  • paper_url: http://arxiv.org/abs/2310.02229
  • repo_url: None
  • paper_authors: Hangyu Tu, Lifeng Han, Goran Nenadic
  • for: 这个论文的目的是用深度学习和大型自然语言模型来提高药品提取和时间关系分类的性能。
  • methods: 这个论文使用了多种先进的学习结构,包括BiLSTM-CRF和CNN-BiLSTM来实现医疗领域名实体识别(NER),以及BERT-CNN来提取时间关系。此外,也研究了不同的词嵌入技术。
  • results: 实验表明,CNN-BiLSTM模型在i2b2-2009临床NER任务上轻微超过BiLSTM-CRF模型,得到了75.67、77.83和78.17的精度、回归和F1分数(macro average)。BERT-CNN模型在i2b2-2012挑战中的时间关系提取测试集上也得到了64.48、67.17和65.03的P/R/F1分数(macro average)。
    Abstract Clinical texts, represented in electronic medical records (EMRs), contain rich medical information and are essential for disease prediction, personalised information recommendation, clinical decision support, and medication pattern mining and measurement. Relation extractions between medication mentions and temporal information can further help clinicians better understand the patients' treatment history. To evaluate the performances of deep learning (DL) and large language models (LLMs) in medication extraction and temporal relations classification, we carry out an empirical investigation of \textbf{MedTem} project using several advanced learning structures including BiLSTM-CRF and CNN-BiLSTM for a clinical domain named entity recognition (NER), and BERT-CNN for temporal relation extraction (RE), in addition to the exploration of different word embedding techniques. Furthermore, we also designed a set of post-processing roles to generate structured output on medications and the temporal relation. Our experiments show that CNN-BiLSTM slightly wins the BiLSTM-CRF model on the i2b2-2009 clinical NER task yielding 75.67, 77.83, and 78.17 for precision, recall, and F1 scores using Macro Average. BERT-CNN model also produced reasonable evaluation scores 64.48, 67.17, and 65.03 for P/R/F1 using Macro Avg on the temporal relation extraction test set from i2b2-2012 challenges. Code and Tools from MedTem will be hosted at \url{https://github.com/HECTA-UoM/MedTem}
    摘要 临床文本,表示在电子医疗记录(EMR)中,含有丰富的医学信息,是关键 для疾病预测、个性化信息推荐、临床决策支持和药物征文挖掘。在时间信息之间的关系抽取可以帮助临床医生更好地理解患者的治疗历史。为了评估深度学习(DL)和大型自然语言模型(LLM)在药物抽取和时间关系分类中的性能,我们进行了empirical investigation of MedTem项目,使用了多种先进的学习结构,包括BiLSTM-CRF和CNN-BiLSTM для临床名实体识别(NER),以及BERT-CNN для时间关系抽取(RE)。此外,我们还设计了一系列后处理角色,以生成结构化的药物和时间关系输出。我们的实验结果显示,CNN-BiLSTM模型在i2b2-2009临床NER任务上轻微击败BiLSTM-CRF模型,得分为75.67、77.83和78.17的精度、回归和F1分数使用Macro Average。BERT-CNN模型也 produz了可接受的评估分数,分别为64.48、67.17和65.03的P/R/F1分数使用Macro Avg在i2b2-2012挑战中的时间关系抽取测试集。MedTem代码和工具将会在\url{https://github.com/HECTA-UoM/MedTem}上hosts。

SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training

  • paper_url: http://arxiv.org/abs/2310.02227
  • repo_url: None
  • paper_authors: Kazem Meidani, Parshin Shojaee, Chandan K. Reddy, Amir Barati Farimani
  • for: bridging the gap between symbolic equations and numeric data, and enhancing the mutual similarities between the two domains.
  • methods: joint contrastive learning between symbolic and numeric domains, enhancing the embeddings of both domains.
  • results: SNIP effectively transfers to various tasks, consistently outperforming fully supervised baselines and competing strongly with established task-specific methods, especially in few-shot learning scenarios.
    Abstract In an era where symbolic mathematical equations are indispensable for modeling complex natural phenomena, scientific inquiry often involves collecting observations and translating them into mathematical expressions. Recently, deep learning has emerged as a powerful tool for extracting insights from data. However, existing models typically specialize in either numeric or symbolic domains, and are usually trained in a supervised manner tailored to specific tasks. This approach neglects the substantial benefits that could arise from a task-agnostic unified understanding between symbolic equations and their numeric counterparts. To bridge the gap, we introduce SNIP, a Symbolic-Numeric Integrated Pre-training, which employs joint contrastive learning between symbolic and numeric domains, enhancing their mutual similarities in the pre-trained embeddings. By performing latent space analysis, we observe that SNIP provides cross-domain insights into the representations, revealing that symbolic supervision enhances the embeddings of numeric data and vice versa. We evaluate SNIP across diverse tasks, including symbolic-to-numeric mathematical property prediction and numeric-to-symbolic equation discovery, commonly known as symbolic regression. Results show that SNIP effectively transfers to various tasks, consistently outperforming fully supervised baselines and competing strongly with established task-specific methods, especially in few-shot learning scenarios where available data is limited.
    摘要 在今天的 era 中,符号数学方程是模拟自然现象的不可或缺的工具。科学研究通常包括收集观察数据并将其翻译成数学表达。近期,深度学习出现了一种强大的数据挖掘工具。然而,现有的模型通常专门针对 numeric 或 symbolic 领域,通常通过特定任务的监督学习训练。这种方法忽略了在 symbolic 和 numeric 领域之间的重要优点,即在预训练 embedding 中增强它们之间的相似性。为了跨越这个差距,我们提出了 SNIP,一种 Symbolic-Numeric Integrated Pre-training,它使用 joint contrastive learning 来增强 symbolic 和 numeric 领域之间的相似性。通过域外分析,我们发现 SNIP 在预训练 embedding 中提供了跨领域的描述,其中 symbolic 监督提高了 numeric 数据的嵌入,并且 vice versa。我们对 SNIP 进行了多种任务的评估,包括符号学习和数学表达的转化,并发现它在少量数据enario 中表现出了出色的抗衰假设能力和稳定性。

Think before you speak: Training Language Models With Pause Tokens

  • paper_url: http://arxiv.org/abs/2310.02226
  • repo_url: None
  • paper_authors: Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan
  • for: 该研究旨在提高语言模型在各种任务上的表现,通过在推理过程中添加延迟。
  • methods: 研究人员使用了一种名为“pause-training”的技术,将一个可学习的“停止”token添加到输入前缀中,以允许模型在推理过程中进行额外计算。
  • results: 研究人员在使用“pause-training”技术后,对1B和130M参数的语言模型进行了训练和推理,并在多个下游任务上观察了提高表现。特别是,对于1B模型,在8个任务中有7个显示了提高,其中最大提高为18%的EMscore在SQuAD问答任务上。
    Abstract Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task of SQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.
    摘要 $(K+1)^{th}$ 字符是通过对 $K$ 个隐藏 вектор进行推移操作而生成的,每个前一个字符都有一个隐藏 вектор。如果我们allow模型在每个字符之前进行更多的计算,那么可能会带来更好的性能吗?我们实现这个想法通过在语言模型中添加一个可学习的 $\textit{pause}$ token,并在训练和推理过程中使用这个 token。我们在模型输出前延迟执行模型的输出,这样允许模型在输出之前进行更多的计算。我们对decoder-only模型进行了训练和推理,并在 causal pretraining 中使用了 C4。我们发现,在训练和finetune时,延迟的推理显示了提高。对于 1B 参数的模型,我们在 8 个任务中观察到了提高,其中最大提高是 SQuAD 问答任务的 EM 分数提高了 18%,CommonSenseQA 提高了 8%,GSM8k 理解任务提高了 1%。我们的研究提出了许多概念和实践未来研究的问题,例如使得延迟下一个字符预测成为一种广泛应用的新方法。

What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

  • paper_url: http://arxiv.org/abs/2310.02219
  • repo_url: None
  • paper_authors: Sneha Silwal, Karmesh Yadav, Tingfan Wu, Jay Vakil, Arjun Majumdar, Sergio Arnaud, Claire Chen, Vincent-Pierre Berges, Dhruv Batra, Aravind Rajeswaran, Mrinal Kalakrishnan, Franziska Meier, Oleksandr Maksymets
  • for: 这个论文旨在研究如何使用预训练的视觉表示(PVR)来训练下游策略,执行实际世界任务。
  • methods: 本研究使用了五种不同的PVR,两种不同的策略学习模式(仿制学习和奖励学习),以及三种不同的机器人,用于5个不同的机器人 manipulation和室内导航任务。
  • results: 我们的研究结果表明:1)PVRs在实际世界中的性能趋势与实际世界中的训练趋势相似,2)使用PVRs可以实现室内ImageNav中的零拟合转移(在实际世界中的真实场景中进行了零拟合转移),3)PVRs的变化,主要是数据扩展和细化,也在实际世界中转移到了性能。请参考项目网站 für更多细节和图像。
    Abstract We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study spans five different PVRs, two different policy-learning paradigms (imitation and reinforcement learning), and three different robots for 5 distinct manipulation and indoor navigation tasks. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held-out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website for additional details and visuals.
    摘要 我们进行了大规模的实验研究,探讨使用预训练视觉表示(PVR)来训练下游策略,执行真实世界任务。我们的研究涵盖了五种不同的PVR,两种不同的策略学习 paradigma(仿制学习和奖励学习),以及三种不同的机器人,用于5种不同的机械和室内导航任务。从这个努力中,我们得出了三个发现:1)PVR在模拟环境中的性能趋势与实际世界的趋势相似,2)通过使用PVR,我们实现了室内图像导航领域中的首次成果(零扩展转移到实际世界中的Scene),3)PVR的变化,主要是数据扩展和细化,也在实际世界中转移到性能。请参考项目网站 для更多细节和视频。

Language Models Represent Space and Time

  • paper_url: http://arxiv.org/abs/2310.02207
  • repo_url: https://github.com/wesg52/world-models
  • paper_authors: Wes Gurnee, Max Tegmark
  • For: The paper explores the question of whether large language models (LLMs) learn a coherent model of the data generating process (a world model) or just a collection of superficial statistics.* Methods: The paper analyzes the learned representations of six datasets (three spatial and three temporal) in the Llama-2 family of models.* Results: The paper finds that LLMs learn linear representations of space and time across multiple scales, and identifies individual “space neurons” and “time neurons” that reliably encode spatial and temporal coordinates. These results suggest that modern LLMs acquire structured knowledge about fundamental dimensions such as space and time, supporting the view that they learn a coherent world model.Here is the same information in Simplified Chinese text:* For: 论文探讨了大语言模型(LLMs)是否学习了数据生成过程的一个完整模型(世界模型),或者只是一个 superficies 的统计数据。* Methods: 论文分析了 Llama-2 家族中的六个数据集(三个空间数据集和三个时间数据集)的学习表示。* Results: 论文发现 LLMs 学习了多尺度空间和时间的直线表示,并确定了固定的 “空间神经” 和 “时间神经” 可靠地编码空间和时间坐标。这些结果表明现代 LLMS 积累了基本维度空间和时间的结构化知识,支持这些模型学习了一个完整的世界模型。
    Abstract The capabilities of large language models (LLMs) have sparked debate over whether such systems just learn an enormous collection of superficial statistics or a coherent model of the data generating process -- a world model. We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models. We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual ``space neurons'' and ``time neurons'' that reliably encode spatial and temporal coordinates. Our analysis demonstrates that modern LLMs acquire structured knowledge about fundamental dimensions such as space and time, supporting the view that they learn not merely superficial statistics, but literal world models.
    摘要 大型自然语言模型(LLMs)的能力引发了讨论,是否只是学习庞大的 superficier statistics 还是一个 coherent 的数据生成过程模型——世界模型。我们通过分析 Llama-2 家族模型学习的表示来证明,LLMs 实际上学习了线性的空间和时间表示,这些表示适应多种缩放。这些表示还能够抗衡示 variations 和不同实体类型(如城市和标志)的统一。此外,我们还发现了特定的“空间神经”和“时间神经”,它们可靠地编码空间和时间坐标。我们的分析表明,现代 LLMs 获得了基本维度 such as space 和 time 的结构化知识,支持者们认为,它们不仅学习 superficier statistics,而是 literal world models。

Efficient Online Scheduling and Routing for Automated Guided Vehicles: Comparing a Novel Loop-Based Algorithm Against Existing Methods

  • paper_url: http://arxiv.org/abs/2310.02195
  • repo_url: None
  • paper_authors: Louis Stubbe, Jens Goemaere, Jan Goedgebeur
  • for: solving the online, conflict-free scheduling and routing problem for AGVs
  • methods: loop-based algorithm
  • results: either outperforms other algorithms or gets an equally good solution in less computing time
    Abstract Automated guided vehicles (AGVs) are widely used in various industries, and scheduling and routing them in a conflict-free manner is crucial to their efficient operation. We propose a loop-based algorithm that solves the online, conflict-free scheduling and routing problem for AGVs. The proposed algorithm is compared against an exact method, a greedy heuristic and a metaheuristic. We experimentally show that this algorithm either outperforms the other algorithms or gets an equally good solution in less computing time.
    摘要 自动导向车(AGV)在各个业务中广泛应用, scheduling和路由它们在冲突无效的方式是关键。我们提出了一种循环基本算法,解决在线、冲突无效的AGV调度和路由问题。提案的算法与精确方法、聪明规则和元规则进行比较。我们实验表明,该算法可以在计算时间更短的情况下,与其他算法匹配或达到相同的解决方案。

Dimensions of Disagreement: Unpacking Divergence and Misalignment in Cognitive Science and Artificial Intelligence

  • paper_url: http://arxiv.org/abs/2310.12994
  • repo_url: None
  • paper_authors: Kerem Oktar, Ilia Sucholutsky, Tania Lombrozo, Thomas L. Griffiths
  • for: 本研究旨在理解人工智能代理人与人类之间的不同观点和冲突,以及这些代理人之间的冲突。
  • methods: 该研究使用了人工智能研究和计算认知科学的工具来衡量代理人之间的表达匹配度。
  • results: 研究发现,不同表达的冲突和不同表达之间的冲突都会导致代理人之间的冲突,并且解决这些冲突的策略取决于这两种类型的冲突之间的交互。
    Abstract The increasing prevalence of artificial agents creates a correspondingly increasing need to manage disagreements between humans and artificial agents, as well as between artificial agents themselves. Considering this larger space of possible agents exposes an opportunity for furthering our understanding of the nature of disagreement: past studies in psychology have often cast disagreement as two agents forming diverging evaluations of the same object, but disagreement can also arise from differences in how agents represent that object. AI research on human-machine alignment and recent work in computational cognitive science have focused on this latter kind of disagreement, and have developed tools that can be used to quantify the extent of representational overlap between agents. Understanding how divergence and misalignment interact to produce disagreement, and how resolution strategies depend on this interaction, is key to promoting effective collaboration between diverse types of agents.
    摘要 人工智能的普遍化导致人工智能与人类之间的纷争增加,以及人工智能之间的纷争。鉴于这一更大的可能的代理人空间,推动我们理解不一致的本质:在心理学研究中,纷争通常被视为两个代理人对同一物体的评估方式不同而导致的,但纷争也可能来自代理人如何表示该物体的不同。AI研究人员在人机协调和计算认知科学中对这种后者类型的纷争进行了研究,并开发了用于衡量代理人表示之间的重叠程度的工具。理解不一致和不同的互动方式如何产生纷争,以及解决策略如何受到这种互动的影响,是促进多种代理人合作的关键。

Uncertainty Quantification in Inverse Models in Hydrology

  • paper_url: http://arxiv.org/abs/2310.02193
  • repo_url: None
  • paper_authors: Somya Sharma Chatterjee, Rahul Ghosh, Arvind Renganathan, Xiang Li, Snigdhansu Chatterjee, John Nieber, Christopher Duffy, Vipin Kumar
  • For: This paper aims to improve the accuracy of streamflow modeling by recovering physical characteristics of river basins from streamflow and weather data, which are more readily available.* Methods: The proposed method is a knowledge-guided, probabilistic inverse modeling approach that uses a Bayesian framework to estimate the physical characteristics of river basins. The method combines prior knowledge with streamflow and weather data to improve the accuracy of basin characteristic estimation.* Results: The proposed method offers 3% improvement in R$^2$ for the inverse model and 6% for the forward model compared to state-of-the-art inverse models. The method also provides improved explainability by quantifying uncertainty in both the inverse and forward models. Specifically, the framework offers 10% improvement in the dispersion of epistemic uncertainty and 13% improvement in coverage rate compared to baseline uncertainty quantification methods.
    Abstract In hydrology, modeling streamflow remains a challenging task due to the limited availability of basin characteristics information such as soil geology and geomorphology. These characteristics may be noisy due to measurement errors or may be missing altogether. To overcome this challenge, we propose a knowledge-guided, probabilistic inverse modeling method for recovering physical characteristics from streamflow and weather data, which are more readily available. We compare our framework with state-of-the-art inverse models for estimating river basin characteristics. We also show that these estimates offer improvement in streamflow modeling as opposed to using the original basin characteristic values. Our inverse model offers 3\% improvement in R$^2$ for the inverse model (basin characteristic estimation) and 6\% for the forward model (streamflow prediction). Our framework also offers improved explainability since it can quantify uncertainty in both the inverse and the forward model. Uncertainty quantification plays a pivotal role in improving the explainability of machine learning models by providing additional insights into the reliability and limitations of model predictions. In our analysis, we assess the quality of the uncertainty estimates. Compared to baseline uncertainty quantification methods, our framework offers 10\% improvement in the dispersion of epistemic uncertainty and 13\% improvement in coverage rate. This information can help stakeholders understand the level of uncertainty associated with the predictions and provide a more comprehensive view of the potential outcomes.
    摘要 hydrology 中,模拟流域流量是一项具有挑战性的任务,因为流域特征信息如土壤地质和地形等可能受到测量误差的干扰或者缺失。为了解决这个挑战,我们提出了一种基于知识导向的抽象概率反向模型方法,可以从流域流量和天气数据中回归物理特征。我们与现有的反向模型进行比较,并显示了这些估计对流域模型中的流量预测具有改善。我们的反向模型提供了3%的R$^2$提升,而前向模型提供了6%的提升。我们的框架还提供了改善的解释性,因为它可以量化反向和前向模型中的不确定性。在我们的分析中,我们评估了不确定性估计的质量,与基eline uncertainty quantification方法相比,我们的框架提供了10%的不确定性散度提升和13%的覆盖率提升。这些信息可以帮助各方理解预测结果的不确定性水平,并提供更全面的结果可能性图像。

What’s Next in Affective Modeling? Large Language Models

  • paper_url: http://arxiv.org/abs/2310.18322
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Nutchanon Yongsatianchot, Tobias Thejll-Madsen, Stacy Marsella
  • for: 这研究探讨了基于语言模型GPT-4的情感预测能力。
  • methods: 这研究使用了GPT-4来解决多种情感任务,包括情感理论和情感故事创作。
  • results: GPT-4能够成功地分辨出不同情感,并且可以通过提示GPT-4关键情感体验因素来控制情感强度。此外,GPT-4还能够正确地预测人类的目标、信仰和情感。
    Abstract Large Language Models (LLM) have recently been shown to perform well at various tasks from language understanding, reasoning, storytelling, and information search to theory of mind. In an extension of this work, we explore the ability of GPT-4 to solve tasks related to emotion prediction. GPT-4 performs well across multiple emotion tasks; it can distinguish emotion theories and come up with emotional stories. We show that by prompting GPT-4 to identify key factors of an emotional experience, it is able to manipulate the emotional intensity of its own stories. Furthermore, we explore GPT-4's ability on reverse appraisals by asking it to predict either the goal, belief, or emotion of a person using the other two. In general, GPT-4 can make the correct inferences. We suggest that LLMs could play an important role in affective modeling; however, they will not fully replace works that attempt to model the mechanisms underlying emotion-related processes.
    摘要

Investigating Large Language Models’ Perception of Emotion Using Appraisal Theory

  • paper_url: http://arxiv.org/abs/2310.04450
  • repo_url: None
  • paper_authors: Nutchanon Yongsatianchot, Parisa Ghanad Torshizi, Stacy Marsella
  • for: 这个研究旨在更好地理解大语言模型(LLM)在人类心理方面的理解,特别是它们对人类情感的理解。
  • methods: 这个研究使用了Stress and Coping Process Questionaire(SCPQ)测试三个最新的OpenAI LLM:davinci-003、ChatGPT和GPT-4。SCPQ是一种有效的临床工具,包含多个故事,随着时间的推移而发展,具有不同的关键评估变量,如可控性和可变性。
  • results: 研究发现,LLMs的响应与人类很相似,在评估和应对方面也类似,但它们的响应不同于预测和人类数据中的预测,而且响应的规模与人类差异很大。此外,研究发现,GPTs可以受到指令和问题的影响。这项研究将进一步扩展评估LLMs的心理方面,帮助我们更好地理解当前的模型。
    Abstract Large Language Models (LLM) like ChatGPT have significantly advanced in recent years and are now being used by the general public. As more people interact with these systems, improving our understanding of these black box models is crucial, especially regarding their understanding of human psychological aspects. In this work, we investigate their emotion perception through the lens of appraisal and coping theory using the Stress and Coping Process Questionaire (SCPQ). SCPQ is a validated clinical instrument consisting of multiple stories that evolve over time and differ in key appraisal variables such as controllability and changeability. We applied SCPQ to three recent LLMs from OpenAI, davinci-003, ChatGPT, and GPT-4 and compared the results with predictions from the appraisal theory and human data. The results show that LLMs' responses are similar to humans in terms of dynamics of appraisal and coping, but their responses did not differ along key appraisal dimensions as predicted by the theory and data. The magnitude of their responses is also quite different from humans in several variables. We also found that GPTs can be quite sensitive to instruction and how questions are asked. This work adds to the growing literature evaluating the psychological aspects of LLMs and helps enrich our understanding of the current models.
    摘要 大语言模型(LLM)如ChatGPT在最近几年内有 significiant advancement,现在已经被普通民众使用。随着更多人与这些系统进行交互,我们理解这些黑盒模型的重要性变得越来越大,尤其是对人类心理方面的理解。在这项工作中,我们通过对SCPQ问卷进行调查, investigate LLMs对人类情感的理解。SCPQ是一种有效的临床实用工具,包含多个故事,这些故事随着时间的推移而发展,并在关键的评估变量方面存在差异。我们对OpenAI提供的三个最新的LLM davinci-003、ChatGPT和GPT-4进行了应用,并与人类数据和理论预测进行比较。结果显示,LLMs的回答与人类的动态评估和应急响应类似,但是其回答不同于预测和人类数据中预期的关键评估维度。此外,LLMs的回答的强度与人类的强度有很大差异。我们还发现GPTs可以受到指令和问题的影响。这项工作将adding to the growing literature evaluating the psychological aspects of LLMs, and helps enrich our understanding of the current models。

Ask Again, Then Fail: Large Language Models’ Vacillations in Judgement

  • paper_url: http://arxiv.org/abs/2310.02174
  • repo_url: https://github.com/nustm/llms-waver-in-judgements
  • paper_authors: Qiming Xie, Zengzhi Wang, Yi Feng, Rui Xia
  • for: 这 paper 的目的是检测大语言模型(如 ChatGPT)在用户表达怀疑或不同意时的稳定性和可靠性。
  • methods: 这 paper 使用了一种名为 \textsc{Follow-up Questioning Mechanism} 的评估方法,以评估模型在不同情况下的判断一致性。
  • results: 研究发现,即使初始答案正确,模型在面临问题、否定或欺诈等干扰时,判断一致性很快下降。此外,研究还检查了不同设置(抽样温度和提示)对模型的影响,并进行了深入的错误分析以获得更深刻的行为认识。
    Abstract With the emergence of generative conversational large language models (LLMs) like ChatGPT, serving as virtual assistants in various fields, the stability and reliability of their responses have become crucial. However, during usage, it has been observed that these models tend to waver in their judgements when confronted with follow-up questions from users expressing skepticism or disagreement. In this work, we draw inspiration from questioning strategies in education and propose a \textsc{Follow-up Questioning Mechanism} along with two evaluation metrics to assess the judgement consistency of LLMs before and after exposure to disturbances. We evaluate the judgement consistency of ChatGPT, PaLM2-Bison, and Vicuna-13B under this mechanism across eight reasoning benchmarks. Empirical results show that even when the initial answers are correct, judgement consistency sharply decreases when LLMs face disturbances such as questioning, negation, or misleading. Additionally, we study these models' judgement consistency under various settings (sampling temperature and prompts) to validate this issue further, observing the impact of prompt tone and conducting an in-depth error analysis for deeper behavioral insights. Furthermore, we also explore several prompting methods to mitigate this issue and demonstrate their effectiveness\footnote{\url{https://github.com/NUSTM/LLMs-Waver-In-Judgements}.
    摘要 随着生成对话大语言模型(LLMs)如ChatGPT的出现,它们作为不同领域的虚拟助手,稳定和可靠的响应成为了关键。然而,在使用过程中,这些模型在用户表达skepticism或不同意时遇到问题时,往往会变得不稳定。在这种情况下,我们从教育中的问题策略中灵感,并提出了一种\textsc{Follow-up Questioning Mechanism},以评估LLMs在不同的问题和环境下的判断一致性。我们对ChatGPT、PaLM2-Bison和Vicuna-13B进行了八个逻辑标准套件的评估。实验结果表明,即使初始答案正确,LLMs在遇到问题、否定或误导时,判断一致性很快下降。此外,我们还研究了这些模型在不同的设置(抽象温度和提示)下的判断一致性,以验证这个问题的严重程度。最后,我们还提出了一些提示方法来解决这个问题,并证明了它们的有效性。更多细节可以参考[这里](https://github.com/NUSTM/LLMs-Waver-In-Judgements)。

Lyfe Agents: Generative agents for low-cost real-time social interactions

  • paper_url: http://arxiv.org/abs/2310.02172
  • repo_url: None
  • paper_authors: Zhao Kaiya, Michelangelo Naim, Jovana Kondic, Manuel Cortes, Jiaxin Ge, Shuying Luo, Guangyu Robert Yang, Andrew Ahn
  • for: 本研究旨在开发一种可靠、高效、低成本的自主生成代理人,用于虚拟社会中的人类社会行为模拟。
  • methods: 本研究使用了以下三个关键技术:1)选择动作框架,以减少高级决策的成本;2)异步自我监测,以提高自我一致性;3)记忆机制,以优先级化关键记忆项,降低计算成本。
  • results: 研究发现,通过应用这些技术,LYFE代理人能够展现出人类自主社会行为的特点,例如通过自主协作和信息交换解决犯罪案件(如谋杀案)。同时,这些技术可以降低计算成本,相比现有的替代方案,计算成本下降10-100倍。这些发现表明自主生成代理人在虚拟世界中潜在地可以敷充人类社会经验。
    Abstract Highly autonomous generative agents powered by large language models promise to simulate intricate social behaviors in virtual societies. However, achieving real-time interactions with humans at a low computational cost remains challenging. Here, we introduce Lyfe Agents. They combine low-cost with real-time responsiveness, all while remaining intelligent and goal-oriented. Key innovations include: (1) an option-action framework, reducing the cost of high-level decisions; (2) asynchronous self-monitoring for better self-consistency; and (3) a Summarize-and-Forget memory mechanism, prioritizing critical memory items at a low cost. We evaluate Lyfe Agents' self-motivation and sociability across several multi-agent scenarios in our custom LyfeGame 3D virtual environment platform. When equipped with our brain-inspired techniques, Lyfe Agents can exhibit human-like self-motivated social reasoning. For example, the agents can solve a crime (a murder mystery) through autonomous collaboration and information exchange. Meanwhile, our techniques enabled Lyfe Agents to operate at a computational cost 10-100 times lower than existing alternatives. Our findings underscore the transformative potential of autonomous generative agents to enrich human social experiences in virtual worlds.
    摘要 高度自主的生成代理人powered by大语言模型承诺可以模拟复杂的社会行为在虚拟社会中。然而,实现实时交互与人类的计算成本仍然是挑战。我们介绍了Lyfe Agent。它们结合了低成本和实时应答,同时保持智能和目标强调。关键创新包括:1.选项-动作框架, reducethe cost of high-level decisions。2.异步自我监测,提高自我一致性。3.概要和忘记记忆机制,优先级低成本关键记忆项。我们在自定义的LyfeGame 3D虚拟环境平台上评估了Lyfe Agent的自我动机和社会能力。当装备了我们的脑机制时,Lyfe Agent可以展现出人类自我动机的社会逻辑。例如,代理人可以通过自主合作和信息交换解决杀人案(一个谋杀谜)。同时,我们的技术使得Lyfe Agent可以在计算成本10-100倍低于现有alternative的情况下运行。我们的发现挑战了自主生成代理人的可能性,以浸没人类社会经验的虚拟世界。

Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization

  • paper_url: http://arxiv.org/abs/2310.02170
  • repo_url: https://github.com/salt-nlp/dylan
  • paper_authors: Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, Diyi Yang
  • for: 这个研究的目的是提高大型语言模型(LLM)代理的性能,并通过将多个LLM代理集成起来,以提高它们的普遍性和可靠性。
  • methods: 这个研究使用了一个名为“动态LLM-代理网络”(DyLAN)的框架,允许LLM代理在问题查询中互动,并在构成团队时选择最佳的代理。它还包括一个早期停止机制和一个自动团队优化算法,以提高性能和可效性。
  • results: 实验结果显示,DyLAN在逻辑和代码生成等复杂任务中表现出色,与单一GPT-35-turbo执行的结果相比,DyLAN可以获得13.0%和13.3%的提升。在特定主题的MMLU中,团队优化算法可以提高准确性达25.0%。
    Abstract Large language model (LLM) agents have been shown effective on a wide range of tasks, and by ensembling multiple LLM agents, their performances could be further improved. Existing approaches employ a fixed set of agents to interact with each other in a static architecture, which limits their generalizability to various tasks and requires strong human prior in designing these agents. In this work, we propose to construct a strategic team of agents communicating in a dynamic interaction architecture based on the task query. Specifically, we build a framework named Dynamic LLM-Agent Network ($\textbf{DyLAN}$) for LLM-agent collaboration on complicated tasks like reasoning and code generation. DyLAN enables agents to interact for multiple rounds in a dynamic architecture with inference-time agent selection and an early-stopping mechanism to improve performance and efficiency. We further design an automatic agent team optimization algorithm based on an unsupervised metric termed $\textit{Agent Importance Score}$, enabling the selection of best agents based on the contribution each agent makes. Empirically, we demonstrate that DyLAN performs well in both reasoning and code generation tasks with reasonable computational cost. DyLAN achieves 13.0% and 13.3% improvement on MATH and HumanEval, respectively, compared to a single execution on GPT-35-turbo. On specific subjects of MMLU, agent team optimization in DyLAN increases accuracy by up to 25.0%.
    摘要 大型语言模型(LLM)代理已被证明可以在各种任务上达到出色的效果,并通过 ensemble 多个 LLM 代理来进一步提高其性能。现有的方法通常采用固定的代理集合来交互在静态架构中,这限制了它们在不同任务上的泛化能力和需要强大的人工指导。在这项工作中,我们提议构建一个灵活的代理团队通过任务查询来交互。特别是,我们建立了名为 DyLAN(动态LLM代理网络)的框架,用于LLM代理在复杂任务中的合作。DyLAN 允许代理在动态架构中进行多轮交互,并在推理时选择代理和早期停止机制以提高性能和效率。此外,我们还设计了一种基于无监督度量的自动代理团队优化算法,以便根据每个代理的贡献来选择最佳的代理。Empirically,我们证明了 DyLAN 在理解和代码生成任务中表现出色,并且相对于单个执行 GPT-35-turbo 的情况下,DyLAN 可以提高 MATH 和 HumanEval 的表现,分别提高了13.0%和13.3%。在特定主题的 MMLU 任务中,DyLAN 的代理团队优化可以提高准确率达25.0%。

Editing Personality for LLMs

  • paper_url: http://arxiv.org/abs/2310.02168
  • repo_url: https://github.com/zjunlp/easyedit
  • paper_authors: Shengyu Mao, Ningyu Zhang, Xiaohan Wang, Mengru Wang, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
  • for: 这篇论文旨在编辑大语言模型(LLM)的人性特质。
  • methods: 该任务使用新的benchmark dataset PersonalityEdit,基于社会心理学理论选择了三种表现人性特质:躁郁、外向和合作。通过GPT-4生成响应,不仅与指定话题相符,还体现出目标人性特质。
  • results: 经过全面的基线测试和分析,发现这些基线在表现人性特质方面存在一些挑战,表明这个任务还存在一些问题。研究人员预计这种任务的成果可以为NLP社区提供新的想法。代码和数据将在https://github.com/zjunlp/EasyEdit中发布。
    Abstract This paper introduces an innovative task focused on editing the personality traits of Large Language Models (LLMs). This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits. Specifically, we construct a new benchmark dataset PersonalityEdit to address this task. Drawing on the theory in Social Psychology, we isolate three representative traits, namely Neuroticism, Extraversion, and Agreeableness, as the foundation for our benchmark. We then gather data using GPT-4, generating responses that not only align with a specified topic but also embody the targeted personality trait. We conduct comprehensive experiments involving various baselines and discuss the representation of personality behavior in LLMs. Our intriguing findings uncover potential challenges of the proposed task, illustrating several remaining issues. We anticipate that our work can provide the NLP community with insights. Code and datasets will be released at https://github.com/zjunlp/EasyEdit.
    摘要

Towards a Unified Framework for Sequential Decision Making

  • paper_url: http://arxiv.org/abs/2310.02167
  • repo_url: None
  • paper_authors: Carlos Núñez-Molina, Pablo Mesejo, Juan Fernández-Olivares
  • for: 提供一个通用的Sequential Decision Making(SDM)框架,以帮助理解Automated Planning(AP)和Reinforcement Learning(RL)的集成。
  • methods: 基于概率论和 bayesian inference 概念,从Classical Planning到 Deep RL 任何方法都可以适用。
  • results: 提出了一种通用的SDM任务的训练和测试Markov Decision Processes(MDPs),以确保总结抽象。还提出了一种基于任务知识的协助估计方法,并 derivated 一组公式和算法用于计算SDM任务和方法的有趣属性,使其可以进行实验评估和比较。
    Abstract In recent years, the integration of Automated Planning (AP) and Reinforcement Learning (RL) has seen a surge of interest. To perform this integration, a general framework for Sequential Decision Making (SDM) would prove immensely useful, as it would help us understand how AP and RL fit together. In this preliminary work, we attempt to provide such a framework, suitable for any method ranging from Classical Planning to Deep RL, by drawing on concepts from Probability Theory and Bayesian inference. We formulate an SDM task as a set of training and test Markov Decision Processes (MDPs), to account for generalization. We provide a general algorithm for SDM which we hypothesize every SDM method is based on. According to it, every SDM algorithm can be seen as a procedure that iteratively improves its solution estimate by leveraging the task knowledge available. Finally, we derive a set of formulas and algorithms for calculating interesting properties of SDM tasks and methods, which make possible their empirical evaluation and comparison.
    摘要 We formulate an SDM task as a set of training and test Markov Decision Processes (MDPs) to account for generalization. We propose a general algorithm for SDM, which we hypothesize is the basis for every SDM method. According to this algorithm, every SDM algorithm iteratively improves its solution estimate by leveraging task knowledge available.We derive a set of formulas and algorithms for calculating interesting properties of SDM tasks and methods, enabling their empirical evaluation and comparison. These properties include the expected cumulative reward, the probability of success, and the expected time to complete the task.Our proposed framework provides a unified approach to SDM, enabling the integration of various methods, from Classical Planning to Deep RL. By leveraging the power of Probability Theory and Bayesian inference, we can better understand the underlying principles of SDM and develop more effective and efficient algorithms for solving complex decision-making problems.

Conceptual Framework for Autonomous Cognitive Entities

  • paper_url: http://arxiv.org/abs/2310.06775
  • repo_url: https://github.com/daveshap/ACE_Framework
  • paper_authors: David Shapiro, Wangfan Li, Manuel Delaflor, Carlos Toxtli
  • for: 这篇论文的目的是提出一种新的认知架构,帮助机器人和软件代理人更加独立地运行。
  • methods: 该论文使用了一种名为ACE模型,这是一种基于OSI模型的认知架构,用于概括人工智能系统。
  • results: 该论文提出了一种新的认知架构,并测试了这种架构在实际应用中的可行性。该架构包括6层:aspirational层、全球策略层、代理模型层、执行函数层、认知控制层和任务追究层。每个层都扮演着不同的角色,从设定道德基础和战略思维到任务选择和执行。
    Abstract The rapid development and adoption of Generative AI (GAI) technology in the form of chatbots such as ChatGPT and Claude has greatly increased interest in agentic machines. This paper introduces the Autonomous Cognitive Entity (ACE) model, a novel framework for a cognitive architecture, enabling machines and software agents to operate more independently. Drawing inspiration from the OSI model, the ACE framework presents layers of abstraction to conceptualize artificial cognitive architectures. The model is designed to harness the capabilities of the latest generative AI technologies, including large language models (LLMs) and multimodal generative models (MMMs), to build autonomous, agentic systems. The ACE framework comprises six layers: the Aspirational Layer, Global Strategy, Agent Model, Executive Function, Cognitive Control, and Task Prosecution. Each layer plays a distinct role, ranging from setting the moral compass and strategic thinking to task selection and execution. The ACE framework also incorporates mechanisms for handling failures and adapting actions, thereby enhancing the robustness and flexibility of autonomous agents. This paper introduces the conceptual framework and proposes implementation strategies that have been tested and observed in industry. The goal of this paper is to formalize this framework so as to be more accessible.
    摘要 快速发展和应用生成人工智能(GAI)技术,如ChatGPT和Claude,对职业机器人的兴趣带来了急速增长。这篇论文介绍了自主认知体系(ACE)模型,一种新的认知架构,使得机器人和软件代理能够更加独立地运行。以OSI模型为 inspirations,ACE模型提供了各种层次抽象,用于描述人工认知体系。该模型采用了最新的生成人工智能技术,包括大语言模型(LLM)和多模态生成模型(MMM),以建立自主、主动的系统。ACE模型包括六层:aspirational层、全球策略层、代理模型层、执行函数层、认知控制层和任务执行层。每层都扮演着不同的角色,从设定道德指南和战略思维到任务选择和执行。ACE模型还包括处理失败和适应行动的机制,从而提高自主机器人的可靠性和灵活性。这篇论文将 introduce this framework,并提出了在行业中测试和观察的实施策略。文章的目的是以更加访问性的形式,将这个框架正式化。

Selenite: Scaffolding Decision Making with Comprehensive Overviews Elicited from Large Language Models

  • paper_url: http://arxiv.org/abs/2310.02161
  • repo_url: None
  • paper_authors: Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, Brad A. Myers
  • for: 帮助用户在不熟悉的领域做出决策,减少用户的比较努力,提高决策效率。
  • methods: 利用自然语言处理技术和机器学习算法,自动生成option和标准的概述,帮助用户快速理解和掌握新信息。
  • results: 三个研究显示,selenite可靠地生成准确的概述,大幅加速用户的信息处理速度,提高了用户的总体理解和决策体验。
    Abstract Decision-making in unfamiliar domains can be challenging, demanding considerable user effort to compare different options with respect to various criteria. Prior research and our formative study found that people would benefit from seeing an overview of the information space upfront, such as the criteria that others have previously found useful. However, existing sensemaking tools struggle with the "cold-start" problem -- it not only requires significant input from previous users to generate and share these overviews, but such overviews may also be biased and incomplete. In this work, we introduce a novel system, Selenite, which leverages LLMs as reasoning machines and knowledge retrievers to automatically produce a comprehensive overview of options and criteria to jumpstart users' sensemaking processes. Subsequently, Selenite also adapts as people use it, helping users find, read, and navigate unfamiliar information in a systematic yet personalized manner. Through three studies, we found that Selenite produced accurate and high-quality overviews reliably, significantly accelerated users' information processing, and effectively improved their overall comprehension and sensemaking experience.
    摘要 决策在不熟悉的领域可能是具有挑战性的,需要用户投入很大的努力来比较不同的选项,并考虑各种标准。先前的研究和我们的形成研究发现,人们会受益于在头一次使用时看到信息空间的概述,例如其他人在过去找到的有用的标准。然而,现有的感知工具受到“冷启动”问题的困扰——不仅需要大量的先前用户的输入来生成和分享这些概述,而且这些概述也可能受到偏见和缺失。在这项工作中,我们介绍了一种新的系统——Selenite,它利用人工智能语言模型(LLM)作为思维机器和知识检索器,自动生成选项和标准的全面概述,以便让用户快速开始感知过程。此外,Selenite还可以适应用户的使用,帮助用户找到、阅读和浏览未知的信息,并且系统化地帮助用户进行个性化的感知体验。通过三项研究,我们发现Selenite可靠地生成高质量的概述,可靠地加速用户的信息处理,并有效地改善用户的总体感知和感知体验。

Finite-Time Analysis of Whittle Index based Q-Learning for Restless Multi-Armed Bandits with Neural Network Function Approximation

  • paper_url: http://arxiv.org/abs/2310.02147
  • repo_url: None
  • paper_authors: Guojun Xiong, Jian Li
  • for: 这 paper 是关于 restless multi-armed bandits (RMAB) 问题的一种 asymptotically optimal 的 heuristic,但是计算 Whittle 指数仍然具有困难。
  • methods: 这 paper 提出了一种基于 Q-学习 的 Whittle index 算法,称为 Neural-Q-Whittle,其中使用了 neural network 函数近似来计算 Q-函数值,并在两个不同的时间尺度上更新 Q-函数值和 Whittle 指数。
  • results: 这 paper 提供了 Neural-Q-Whittle 算法的 finite-time 分析,其中数据来自 Markov chain,Q-函数被approx 成了 ReLU 神经网络。分析使用了 Lyapunov 漂移方法,并考虑了函数近似 error。结果显示,Neural-Q-Whittle 算法在 $\mathcal{O}(1/k^{2/3})$ 时间下达到 convergence rate,其中 $k$ 是迭代次数。
    Abstract Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB) problem. Although it is provably asymptotically optimal, finding Whittle indices remains difficult. In this paper, we present Neural-Q-Whittle, a Whittle index based Q-learning algorithm for RMAB with neural network function approximation, which is an example of nonlinear two-timescale stochastic approximation with Q-function values updated on a faster timescale and Whittle indices on a slower timescale. Despite the empirical success of deep Q-learning, the non-asymptotic convergence rate of Neural-Q-Whittle, which couples neural networks with two-timescale Q-learning largely remains unclear. This paper provides a finite-time analysis of Neural-Q-Whittle, where data are generated from a Markov chain, and Q-function is approximated by a ReLU neural network. Our analysis leverages a Lyapunov drift approach to capture the evolution of two coupled parameters, and the nonlinearity in value function approximation further requires us to characterize the approximation error. Combing these provide Neural-Q-Whittle with $\mathcal{O}(1/k^{2/3})$ convergence rate, where $k$ is the number of iterations.
    摘要 “对于困难的多臂枪客问题(RMAB),Whittle指标政策是一种几乎可以推导到最佳解的规律。然而,实际上找到Whittle指标仍然具有挑战性。在这篇论文中,我们提出了一个使用神经网络函数近似的Whittle指标基于Q学习算法,即Neural-Q-Whittle。这是一种具有两个时间步长的随机测approximation,其中Q值在更快的时间步长上更新,而Whittle指标则在更慢的时间步长上更新。尽管深度学习的实际成功,Neural-Q-Whittle的非对称数据分析仍然不清楚。这篇论文提供了Neural-Q-Whittle在Markov链上获得的finite-time分析,并且利用了Lyapunov滑动方法来捕捉两个耦合的参数的演化。由于值函数近似的非线性性,我们需要 characterize Approximation error。通过结合这些因素,我们可以给出Neural-Q-Whittle的$\mathcal{O}(1/k^{2/3})$的数据分析速率,其中$k$是迭代次数。”

Learning Reliable Logical Rules with SATNet

  • paper_url: http://arxiv.org/abs/2310.02133
  • repo_url: None
  • paper_authors: Zhaoyu Li, Jinpei Guo, Yuhe Jiang, Xujie Si
  • for: 本研究旨在推动逻辑推理和深度学习之间的 integrate,以建立更高级的 AI 系统。
  • methods: 我们提出了一种新的框架,通过 differentiable learning 生成可解释的和可验证的逻辑规则,不需要先天的逻辑结构。我们的方法基于 SATNet,一种可导式 MaxSAT 解决器,通过输入输出示例学习出下面的规则。
  • results: 我们的方法可以生成高可靠性的逻辑规则,并通过多种有效的验证技术验证其与真实规则的函数等价性。实验表明,使用 exact solvers 验证我们的决策则可以达到 100% 的准确率,而原始 SATNet 在许多情况下无法给出正确的解决方案。
    Abstract Bridging logical reasoning and deep learning is crucial for advanced AI systems. In this work, we present a new framework that addresses this goal by generating interpretable and verifiable logical rules through differentiable learning, without relying on pre-specified logical structures. Our approach builds upon SATNet, a differentiable MaxSAT solver that learns the underlying rules from input-output examples. Despite its efficacy, the learned weights in SATNet are not straightforwardly interpretable, failing to produce human-readable rules. To address this, we propose a novel specification method called "maximum equality", which enables the interchangeability between the learned weights of SATNet and a set of propositional logical rules in weighted MaxSAT form. With the decoded weighted MaxSAT formula, we further introduce several effective verification techniques to validate it against the ground truth rules. Experiments on stream transformations and Sudoku problems show that our decoded rules are highly reliable: using exact solvers on them could achieve 100% accuracy, whereas the original SATNet fails to give correct solutions in many cases. Furthermore, we formally verify that our decoded logical rules are functionally equivalent to the ground truth ones.
    摘要 bridging 逻辑推理和深度学习是高级人工智能系统的关键。在这项工作中,我们提出了一种新的框架,通过分别学习生成可读可验证的逻辑规则,不需要预先指定的逻辑结构。我们的方法基于SATNet,一种可微的MaxSAT解决方案,从输入输出示例中学习下面的规则。虽然SATNet的学习结果具有效果,但是学习出来的权重不是直观可读的,无法生成人类可读的规则。为解决这个问题,我们提出了一种新的规定方法 called "最大等式",允许将SATNet学习出来的权重与一组带权的 propositional 逻辑规则相互转换。通过解码的带权MaxSAT式,我们进一步引入了一些有效的验证技术,以验证它们是否与真实规则函数等价。实验表明,我们的解码规则具有高度可靠性:使用 exact 解决器处理它们可以达到100%的准确率,而原始SATNet在许多情况下无法提供正确的解决方案。此外,我们正式验证了我们解码的逻辑规则是否函数等价于真实规则。

Unveiling the Pitfalls of Knowledge Editing for Large Language Models

  • paper_url: http://arxiv.org/abs/2310.02129
  • repo_url: https://github.com/zjunlp/pitfallsknowledgeediting
  • paper_authors: Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang, Xi Chen, Huajun Chen
  • for: 本研究探讨了对大型自然语言模型(LLMs)知识编辑的风险。
  • methods: 本研究提出了新的评估指标和基准集,以评估知识编辑对LLMs的影响。
  • results: 研究发现,知识编辑可能会导致两类问题:知识冲突和知识扭曲。这些问题可能会对LLMs产生不良影响,需要未来研究的注意和努力。
    Abstract As the cost associated with fine-tuning Large Language Models (LLMs) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within LLMs. Yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for LLMs. To achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. Our results underline two pivotal concerns: (1) Knowledge Conflict: Editing groups of facts that logically clash can magnify the inherent inconsistencies in LLMs-a facet neglected by previous methods. (2) Knowledge Distortion: Altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of LLMs. Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on LLMs, which warrant attention and efforts for future works. Code will be released at https://github.com/zjunlp/PitfallsKnowledgeEditing.
    摘要 随着大语言模型(LLM)细化成本的增加,现有研究努力转移到开发方法来编辑 LLM 中的隐式知识。然而,仍然有一个阴影倒挂着——编辑知识是否会触发蝴蝶效应?由于 editing 可能会引入侧效,这些问题仍然未得到解决。本文开拓了 LLM 中编辑知识的可能风险的研究。为此,我们提出了新的 benchmarck 数据集和创新的评价指标。我们的结果显示了两个重要问题:(1)知识冲突:编辑冲突的知识组可能会增加 LLM 中的内在不一致性。(2)知识扭曲:修改参数以编辑实际知识可能会无法回归 LLM 的内生知识结构。实验结果表明,编辑知识可能会不良果的影响 LLM,这些问题需要未来的研究。代码将在 上发布。

Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View

  • paper_url: http://arxiv.org/abs/2310.02124
  • repo_url: https://github.com/zjunlp/machinesom
  • paper_authors: Jintian Zhang, Xin Xu, Shumin Deng
  • for: 本研究探讨了现代自然语言处理(NLP)系统在多代理社会中是否能够模仿人类的协同智能。
  • methods: 本研究结合了实验和理论视角,对当今NLP系统的协同机制进行了探索。研究者 fabricated four unique ‘societies’,每个社会由多个语言模型(LLM)代表,每个代表有特定的 ‘ trait’(愿景或自信)和 ‘ thinking pattern’(辩论或反思)。
  • results: 研究发现,LLM代表在完成任务时会采用不同的社会行为,从活泼的辩论到 introspective 的反思。此外,研究还发现了一些协同策略,可以提高效率(使用更少的 API tokens),同时也超越了之前的顶尖方法。此外,研究还发现了 LLM 代表具有人类社会行为的特征,如同论或多数规则。
    Abstract As Natural Language Processing (NLP) systems are increasingly employed in intricate social environments, a pressing query emerges: Can these NLP systems mirror human-esque collaborative intelligence, in a multi-agent society consisting of multiple large language models (LLMs)? This paper probes the collaboration mechanisms among contemporary NLP systems by melding practical experiments with theoretical insights. We fabricate four unique `societies' comprised of LLM agents, where each agent is characterized by a specific `trait' (easy-going or overconfident) and engages in collaboration with a distinct `thinking pattern' (debate or reflection). Evaluating these multi-agent societies on three benchmark datasets, we discern that LLM agents navigate tasks by leveraging diverse social behaviors, from active debates to introspective reflections. Notably, certain collaborative strategies only optimize efficiency (using fewer API tokens), but also outshine previous top-tier approaches. Moreover, our results further illustrate that LLM agents manifest human-like social behaviors, such as conformity or majority rule, mirroring foundational Social Psychology theories. In conclusion, we integrate insights from Social Psychology to contextualize the collaboration of LLM agents, inspiring further investigations into the collaboration mechanism for LLMs. We commit to sharing our code and datasets (already submitted in supplementary materials), hoping to catalyze further research in this promising avenue (All code and data are available at \url{https://github.com/zjunlp/MachineSoM}.).
    摘要 如果自然语言处理(NLP)系统在复杂社会环境中得到广泛应用,那么一个重要问题就是:这些NLP系统能否模仿人类的协同智能?这篇论文通过实验和理论启示来探索当今NLP系统之间的协同机制。我们创造了四个不同的“社会”,每个社会由多个大语言模型(LLM)组成,每个LLM代表不同的“特质”(愿景或自信),并且采用不同的“思维模式”(辩论或 introspection)进行协同。我们在三个标准测试集上评估这些多代理社会,发现LLM代理在完成任务时会采用多种社会行为,从活泼的辩论到 introspective reflection。尤其是,某些协同策略可以使用更少的API токен,同时也超越了之前的顶尖方法。此外,我们的结果还表明LLM代理展现出人类社会行为的特征,如同跟随性或多数规则,这与基本社会心理学理论相吻合。在结论中,我们将社会心理学理论与LLM协同机制相结合,并希望通过分享我们的代码和数据(已经在补充材料中提交),以便促进这一领域的进一步研究。

TWIZ: The Wizard of Multimodal Conversational-Stimulus

  • paper_url: http://arxiv.org/abs/2310.02118
  • repo_url: None
  • paper_authors: Rafael Ferreira, Diogo Tavares, Diogo Silva, Rodrigo Valério, João Bordalo, Inês Simões, Vasco Ramos, David Semedo, João Magalhães
  • For: The paper is written to describe the vision, challenges, and scientific contributions of the Task Wizard team (TWIZ) in the Alexa Prize TaskBot Challenge 2022.* Methods: The paper focuses on three main research questions: (1) Humanly-Shaped Conversations, (2) Multimodal Stimulus, and (3) Zero-shot Conversational Flows.* Results: The TWIZ bot is an effective and robust system that can guide users through complex manual tasks while providing several multimodal stimuli. The bot is capable of supporting a wide range of tasks and has several innovative features such as creative cooking and video navigation through voice.
    Abstract In this report, we describe the vision, challenges, and scientific contributions of the Task Wizard team, TWIZ, in the Alexa Prize TaskBot Challenge 2022. Our vision, is to build TWIZ bot as an helpful, multimodal, knowledgeable, and engaging assistant that can guide users towards the successful completion of complex manual tasks. To achieve this, we focus our efforts on three main research questions: (1) Humanly-Shaped Conversations, by providing information in a knowledgeable way; (2) Multimodal Stimulus, making use of various modalities including voice, images, and videos; and (3) Zero-shot Conversational Flows, to improve the robustness of the interaction to unseen scenarios. TWIZ is an assistant capable of supporting a wide range of tasks, with several innovative features such as creative cooking, video navigation through voice, and the robust TWIZ-LLM, a Large Language Model trained for dialoguing about complex manual tasks. Given ratings and feedback provided by users, we observed that TWIZ bot is an effective and robust system, capable of guiding users through tasks while providing several multimodal stimuli.
    摘要 在这份报告中,我们描述了任务魔法团队(TWIZ)在Alexa奖任务机器人挑战2022中的视野、挑战和科学贡献。我们的视野是建立一个有用、多Modal、知识型和有趣的助手,帮助用户完成复杂的手动任务。为了实现这一目标,我们对三个主要研究问题进行了集中努力:1. 人类化对话,通过提供知识型的信息,使用户感觉到和人类交流相似。2. 多Modal 刺激,使用声音、图片和视频等多种Modalities。3. 零shot对话流程,以提高对未看过的情况的响应性。TWIZ是一个能够支持多种任务的助手,具有创新的特点,如创意cooking、通过声音导航视频、robust TWIZ-LLM,一个对对话的大语言模型。根据用户提供的评分和反馈,我们发现TWIZ机器人是一个有效和Robust的系统,能够引导用户完成任务,并提供多种多Modal 刺激。

Towards Effective Human-AI Decision-Making: The Role of Human Learning in Appropriate Reliance on AI Advice

  • paper_url: http://arxiv.org/abs/2310.02108
  • repo_url: None
  • paper_authors: Max Schemmer, Andrea Bartos, Philipp Spitzer, Patrick Hemmer, Niklas Kühl, Jonas Liebschner, Gerhard Satzger
  • for: 本研究旨在探讨人类和人工智能(AI)合作的真正潜力是如何利用人类和AI的各自优势来实现联合性能超越个体AI或人类的性能,即实现补做团队性能(CTP)。
  • methods: 该研究使用实验方法,具体来说是采用100名参与者进行实验,以评估人类对AI建议的适当依赖。
  • results: 研究发现,人类学习是适当依赖AI建议的关键因素,而不仅仅是心理模型。此外,研究还提出了基本概念和设计方法,以便更好地分析依赖和实现人类AI决策的效果。
    Abstract The true potential of human-AI collaboration lies in exploiting the complementary capabilities of humans and AI to achieve a joint performance superior to that of the individual AI or human, i.e., to achieve complementary team performance (CTP). To realize this complementarity potential, humans need to exercise discretion in following AI 's advice, i.e., appropriately relying on the AI's advice. While previous work has focused on building a mental model of the AI to assess AI recommendations, recent research has shown that the mental model alone cannot explain appropriate reliance. We hypothesize that, in addition to the mental model, human learning is a key mediator of appropriate reliance and, thus, CTP. In this study, we demonstrate the relationship between learning and appropriate reliance in an experiment with 100 participants. This work provides fundamental concepts for analyzing reliance and derives implications for the effective design of human-AI decision-making.
    摘要 人类和人工智能(AI)的共同努力的真正潜力在于利用人类和AI的优势相互补做,以实现合作性能超过个体AI或人类的表现,即实现共同团队性能(CTP)。为实现这种共同可能性,人类需要在AI的建议下使用自己的聪明,即有选择地采纳AI的建议。在以前的研究中,人们主要关注建立AI的心理模型来评估AI的建议,但最新的研究表明,心理模型alone不能解释合适的依赖。我们假设,除了心理模型之外,人类学习也是适用依赖的关键因素,因此CTP。在这项实验中,我们证明了学习与合适依赖之间的关系,并 derive了对人类AI决策的设计方法的基本思想。

CoNO: Complex Neural Operator for Continuous Dynamical Systems

  • paper_url: http://arxiv.org/abs/2310.02094
  • repo_url: None
  • paper_authors: Karn Tiwari, N M Anoop Krishnan, Prathosh A P
    for:CoNO is designed to model continuous dynamical systems, such as weather forecasting, fluid flow, and solid mechanics.methods:CoNO uses a complex neural network with integral kernel parameterization in the complex fractional Fourier domain, along with aliasing-free activation functions to preserve complex values and algebraic properties.results:CoNO exhibits improved representation, robustness to noise, and generalization compared to existing neural operator models, and achieves comparable or superior performance on several tasks including zero-shot super-resolution, evaluation of out-of-distribution data, data efficiency, and robustness to noise.
    Abstract Neural operators extend data-driven models to map between infinite-dimensional functional spaces. These models have successfully solved continuous dynamical systems represented by differential equations, viz weather forecasting, fluid flow, or solid mechanics. However, the existing operators still rely on real space, thereby losing rich representations potentially captured in the complex space by functional transforms. In this paper, we introduce a Complex Neural Operator (CoNO), that parameterizes the integral kernel in the complex fractional Fourier domain. Additionally, the model employing a complex-valued neural network along with aliasing-free activation functions preserves the complex values and complex algebraic properties, thereby enabling improved representation, robustness to noise, and generalization. We show that the model effectively captures the underlying partial differential equation with a single complex fractional Fourier transform. We perform an extensive empirical evaluation of CoNO on several datasets and additional tasks such as zero-shot super-resolution, evaluation of out-of-distribution data, data efficiency, and robustness to noise. CoNO exhibits comparable or superior performance to all the state-of-the-art models in these tasks. Altogether, CoNO presents a robust and superior model for modeling continuous dynamical systems, providing a fillip to scientific machine learning.
    摘要

  • paper_url: http://arxiv.org/abs/2310.05976
  • repo_url: None
  • paper_authors: Reiji Suzuki, Takaya Arita
  • for: 本研究旨在探讨多样性和社会层次上的EVOLUTIONARY dynamics, 通过将生成模型引入社会代理模型中的特质表达中,提高了模型的表达力。
  • methods: 我们使用了语言模型(LLM)提取的决策策略,将语言描述的人格特质作为基因,并通过选择和变异基因来进行人类进化。
  • results: 我们的初步实验和分析表明,这种模型可以基于多样性和高级表达的人格特质来演化合作行为。我们还发现了在表达人格特质时的重复干扰,以及基因表达中出现的行为倾向的含义。
    Abstract This paper aims to shed light on the evolutionary dynamics of diverse and social populations by introducing the rich expressiveness of generative models into the trait expression of social agent-based evolutionary models. Specifically, we focus on the evolution of personality traits in the context of a game-theoretic relationship as a situation in which inter-individual interests exert strong selection pressures. We construct an agent model in which linguistic descriptions of personality traits related to cooperative behavior are used as genes. The deterministic strategies extracted from Large Language Model (LLM) that make behavioral decisions based on these personality traits are used as behavioral traits. The population is evolved according to selection based on average payoff and mutation of genes by asking LLM to slightly modify the parent gene toward cooperative or selfish. Through preliminary experiments and analyses, we clarify that such a model can indeed exhibit the evolution of cooperative behavior based on the diverse and higher-order representation of personality traits. We also observed the repeated intrusion of cooperative and selfish personality traits through changes in the expression of personality traits, and found that the emerging words in the evolved gene well reflected the behavioral tendency of its personality in terms of their semantics.
    摘要

Point Neighborhood Embeddings

  • paper_url: http://arxiv.org/abs/2310.02083
  • repo_url: https://github.com/ANAGHA93/t-SNE
  • paper_authors: Pedro Hermosilla
  • for: 本研究旨在分析点云中的邻域信息编码方法,以提高未来 neural network 架构的设计。
  • methods: 研究使用不同的邻域信息编码方法,包括多层感知机(MLP)、ReLU活化函数和简单的卷积。
  • results: 研究发现,使用MLP编码器的邻域信息编码方法实际下表现最差,甚至在某些任务上被简单的点坐标线性组合超越。此外,使用这些建议实现的神经网络架构可以达到多个任务的状态OF-the-art级Results,超越最近的更复杂的操作。
    Abstract Point convolution operations rely on different embedding mechanisms to encode the neighborhood information of each point in order to detect patterns in 3D space. However, as convolutions are usually evaluated as a whole, not much work has been done to investigate which is the ideal mechanism to encode such neighborhood information. In this paper, we provide the first extensive study that analyzes such Point Neighborhood Embeddings (PNE) alone in a controlled experimental setup. From our experiments, we derive a set of recommendations for PNE that can help to improve future designs of neural network architectures for point clouds. Our most surprising finding shows that the most commonly used embedding based on a Multi-layer Perceptron (MLP) with ReLU activation functions provides the lowest performance among all embeddings, even being surpassed on some tasks by a simple linear combination of the point coordinates. Additionally, we show that a neural network architecture using simple convolutions based on such embeddings is able to achieve state-of-the-art results on several tasks, outperforming recent and more complex operations. Lastly, we show that these findings extrapolate to other more complex convolution operations, where we show how following our recommendations we are able to improve recent state-of-the-art architectures.
    摘要 <>将点conv操作转换为标准的中文简体字符串。>点 convolution 操作需要不同的嵌入机制来编码每个点的邻居信息,以探测3D空间中的模式。然而,通常情况下, convolution 被评估为整体,因此很少人对点邻居编码(Point Neighborhood Embeddings,PNE)进行了系统的研究。在这篇论文中,我们提供了首次对 PNE 进行了系统的研究,并在控制的实验室中进行了广泛的测试。从我们的实验结果中,我们提出了一些关于 PNE 的建议,这些建议可以帮助未来的神经网络架构设计人员为点云进行优化。我们最大化的发现是,通常使用的多层感知器(MLP)与 ReLU 活化函数基于的嵌入方法的性能最差,甚至在一些任务上被一个简单的点坐标的线性组合所超越。此外,我们显示了一种使用这些嵌入的神经网络架构可以在多个任务上达到状态的最佳结果,超越了最近的和更复杂的操作。最后,我们表明这些发现可以推广到其他更复杂的 convolution 操作,我们在这些操作中采用了我们的建议,并成功地提高了最近的状态级别的架构。

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

  • paper_url: http://arxiv.org/abs/2310.02071
  • repo_url: https://github.com/pkunlp-icler/pca-eval
  • paper_authors: Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, Baobao Chang
  • for: 本研究探索了多模态大语言模型(MLLMs)在改善体现决策过程中的潜力。
  • methods: 本研究使用了 state-of-the-art MLLMs like GPT4-Vision,以及 HOLMES 框架,让 LLMs 可以通过多模态信息来做出更加有知见的决策。
  • results: 研究发现,使用 GPT4-Vision 模型可以实现更高的体现决策能力,相比 GPT4-HOLMES 模型。此外,GPT4-Vision 模型还可以在 PCA-EVAL benchmark 上表现出色,相比 open-source state-of-the-art MLLM。
    Abstract In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents. While Large Language Models (LLMs) have been widely used due to their advanced reasoning skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual understanding and reasoning capabilities. We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner and whether collaborations between LLMs and MLLMs can enhance decision-making. To address these questions, we introduce a new benchmark called PCA-EVAL, which evaluates embodied decision-making from the perspectives of Perception, Cognition, and Action. Additionally, we propose HOLMES, a multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs to gather multimodal information for informed decision-making. We compare end-to-end embodied decision-making and HOLMES on our benchmark and find that the GPT4-Vision model demonstrates strong end-to-end embodied decision-making abilities, outperforming GPT4-HOLMES in terms of average decision accuracy (+3%). However, this performance is exclusive to the latest GPT4-Vision model, surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents, offering new avenues for MLLM research. Code and data are open at https://github.com/pkunlp-icler/PCA-EVAL/.
    摘要 在这个研究中,我们探索了多模态大语言模型(MLLMs)在改进具体决策过程中的潜力。由于大语言模型(LLMs)在逻辑能力和世界知识方面表现出色,因此我们尝试了使用MLLMs来提高具体决策。我们研究了MLLMs是否可以在端到端方式完成具体决策,以及LLMs和MLLMs之间的合作是否可以提高决策。为此,我们提出了一个新的评价指标集合,称为PCA-EVAL,用于评估具体决策从多个角度。此外,我们还提出了一种多代理合作框架,称为HOLMES,允许LLMs通过多模态信息来做出 Informed 决策。我们对PCA-EVAL和HOLMES进行比较,发现GPT4-Vision模型在PCA-EVAL上表现出色,相比GPT4-HOLMES,其决策精度提高了3%。但是,这种性能仅限于最新的GPT4-Vision模型,超过了开源状态态的MLLM的性能。我们的结果表明,具有强大的MLLMs如GPT4-Vision可能在具体决策中发挥作用,为MRLM研究提供新的 Avenues。代码和数据在https://github.com/pkunlp-icler/PCA-EVAL/。

Content Bias in Deep Learning Age Approximation: A new Approach Towards more Explainability

  • paper_url: http://arxiv.org/abs/2310.02067
  • repo_url: None
  • paper_authors: Robert Jöchl, Andreas Uhl
  • for: 这个论文主要用于探讨图像时间伪造检测中,用 neural network 学习图像年龄特征的问题。
  • methods: 该论文提出了一种新的方法来评估图像内容对于年龄分类中的影响。该方法使用synthetic图像(可以排除内容偏好),并在这些图像中嵌入年龄信号。然后,通过训练标准的 neural network 来评估内容对于年龄分类的影响。
  • results: 研究发现,使用标准的 neural network 在年龄分类任务中,具有强度依赖于图像内容的特征。为了 Mitigate 这种影响,研究人员提出了两种不同的技术,并通过论文中的方法进行评估。
    Abstract In the context of temporal image forensics, it is not evident that a neural network, trained on images from different time-slots (classes), exploit solely age related features. Usually, images taken in close temporal proximity (e.g., belonging to the same age class) share some common content properties. Such content bias can be exploited by a neural network. In this work, a novel approach that evaluates the influence of image content is proposed. This approach is verified using synthetic images (where content bias can be ruled out) with an age signal embedded. Based on the proposed approach, it is shown that a `standard' neural network trained in the context of age classification is strongly dependent on image content. As a potential countermeasure, two different techniques are applied to mitigate the influence of the image content during training, and they are also evaluated by the proposed method.
    摘要 在图像时间修复方面,没有直接证明神经网络,通过图像不同时间槽(类)训练,仅仅利用年龄相关特征。通常,属于同一年龄类别的图像在close temporal proximity(例如,同一个年龄类别)共享一些共同内容特征。这种内容偏好可以被神经网络利用。在这种工作中,一种新的方法,评估图像内容的影响,被提出。该方法通过使用synthetic images( contenido bias可以排除),并将年轻信号嵌入图像中,来验证。根据所提出的方法,显示一个“标准”的神经网络,在年龄分类任务中强烈依赖于图像内容。为了 Mitigate the influence of image content during training, two different techniques are applied and evaluated by the proposed method.

De Novo Drug Design with Joint Transformers

  • paper_url: http://arxiv.org/abs/2310.02066
  • repo_url: None
  • paper_authors: Adam Izdebski, Ewelina Weglarz-Tomczak, Ewa Szczurek, Jakub M. Tomczak
  • for: 本研究旨在提出一种能同时生成外部数据外的新分子和预测其目标性质的新型生成模型,以解决德 ноVO drug design中的难题。
  • methods: 我们提出了一种 combining Transformer decoder、Transformer encoder 和预测器的共享权重的联合生成模型,并通过训练该模型使用 penalty 对数对数据进行优化。
  • results: 我们的方法可以在分子生成中达到领先的性能,同时降低新样本中预测错误,相比于精度调整后的 decoder-only Transformer,提高了42%。此外,我们还提出了一种基于 Joint Transformer 的 probabilistic黑盒优化算法,可以生成具有改进目标性质的新分子,比训练数据更高效。
    Abstract De novo drug design requires simultaneously generating novel molecules outside of training data and predicting their target properties, making it a hard task for generative models. To address this, we propose Joint Transformer that combines a Transformer decoder, a Transformer encoder, and a predictor in a joint generative model with shared weights. We show that training the model with a penalized log-likelihood objective results in state-of-the-art performance in molecule generation, while decreasing the prediction error on newly sampled molecules, as compared to a fine-tuned decoder-only Transformer, by 42%. Finally, we propose a probabilistic black-box optimization algorithm that employs Joint Transformer to generate novel molecules with improved target properties, as compared to the training data, outperforming other SMILES-based optimization methods in de novo drug design.
    摘要 德 новো drug 设计需要同时生成外部训练数据中没有的分子和预测其目标性质,这使得生成模型具有困难任务。为此,我们提议了共同变换器(Joint Transformer),它将 transformer 解码器、transformer 编码器和预测器组合在一起,形成一个共同生成模型,其中所有参数共享。我们表明,通过训练该模型的 penalty логиarithmic 目标函数可以 достичь领域内最佳性能,同时降低新样本分子预测错误率,比较 fine-tuned 解码器 только transformer 的42%。最后,我们提议了基于 Joint Transformer 的 probabilistic 黑盒优化算法,可以通过生成新分子来提高目标性质,比较其他 SMILES 基于的优化方法在德 новォ drug 设计中表现优异。

Relaxed Octahedral Group Convolution for Learning Symmetry Breaking in 3D Physical Systems

  • paper_url: http://arxiv.org/abs/2310.02299
  • repo_url: None
  • paper_authors: Rui Wang, Robin Walters, Tess E. Smidt
  • for: 这篇论文旨在提高采样效率和泛化性,通过使用对称性来改进深度模型。
  • methods: 这篇论文提出了一种弹性八面体卷积,可以保持数据中的最高水平的对称性,同时发现物理系统中的微妙对称性破坏因素。
  • results: 实验结果表明,这种方法可以不仅提供物理系统中对称性破坏因素的理解,还可以在流体超分解任务中实现优秀的性能。
    Abstract Deep equivariant models use symmetries to improve sample efficiency and generalization. However, the assumption of perfect symmetry in many of these models can sometimes be restrictive, especially when the data does not perfectly align with such symmetries. Thus, we introduce relaxed octahedral group convolution for modeling 3D physical systems in this paper. This flexible convolution technique provably allows the model to both maintain the highest level of equivariance that is consistent with data and discover the subtle symmetry-breaking factors in the physical systems. Empirical results validate that our approach can not only provide insights into the symmetry-breaking factors in phase transitions but also achieves superior performance in fluid super-resolution tasks.
    摘要 深度对称模型使用对称性来提高样本效率和泛化性。然而,在许多情况下,这些模型假设的完美对称性可能是限制性的,特别是当数据不完全与这些对称性相对应。因此,我们在这篇论文中引入了放宽的八面体群 convolution来模型三维物理系统。这种灵活的 convolution 技术可以证明地保持数据中的最高水平对称性,同时发现物理系统中的微妙对称性破坏因素。实验结果验证了我们的方法不仅可以提供物理系统中对称性破坏因素的新的视角,还可以在液体超解像任务中实现更高的性能。

AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model

  • paper_url: http://arxiv.org/abs/2310.02054
  • repo_url: None
  • paper_authors: Zibin Dong, Yifu Yuan, Jianye Hao, Fei Ni, Yao Mu, Yan Zheng, Yujing Hu, Tangjie Lv, Changjie Fan, Zhipeng Hu
  • for: 本文提出了一种新的框架,即 AlignDiff,用于在人工智能学习中对人类喜好进行质量评估,包括抽象性和可变性。
  • methods: 本文使用了人工智能反馈学习(RLHF)来量化人类喜好,并使用这些量化结果来导引扩散规划,以实现零基础行为定制。
  • results: 本文在多种 locomotive 任务上证明了 AlignDiff 的超越性,包括 preference matching、switching 和 covering。此外,它还可以完成人类指导下的未看过任务。
    Abstract Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a novel framework that leverages RL from Human Feedback (RLHF) to quantify human preferences, covering abstractness, and utilizes them to guide diffusion planning for zero-shot behavior customizing, covering mutability. AlignDiff can accurately match user-customized behaviors and efficiently switch from one to another. To build the framework, we first establish the multi-perspective human feedback datasets, which contain comparisons for the attributes of diverse behaviors, and then train an attribute strength model to predict quantified relative strengths. After relabeling behavioral datasets with relative strengths, we proceed to train an attribute-conditioned diffusion model, which serves as a planner with the attribute strength model as a director for preference aligning at the inference phase. We evaluate AlignDiff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. More visualization videos are released on https://aligndiff.github.io/.
    摘要 把代理人行为与多样化的人类偏好进行匹配仍然是现代学习(RL)中的挑战,因为人类偏好的本质和变化性具有抽象和多变性。为解决这些问题,我们提出了AlignDiff框架,它利用人类反馈学习(RLHF)来量化人类偏好,包括抽象和多变性,并使用它们来导航扩散规划,以实现零shot行为定制。AlignDiff可以准确匹配用户自定义的行为,并高效地switch между不同的行为。为建立框架,我们首先建立了多个视角的人类反馈数据集,其中包含了多种行为的属性比较,然后我们训练了一个属性强度模型,以预测量化的相对强度。接着,我们将行为数据集重新标注为相对强度,然后训练了一个属性 conditional扩散模型,它作为一个决策者,在推理阶段对偏好进行匹配。我们在多种步行任务上评估了AlignDiff,并证明其在偏好匹配、switching和覆盖方面表现出色,比基eline更好。它还可以完成未看过的下游任务,这说明了它在人AI合作中的潜力。更多视频可以在https://aligndiff.github.io/查看。

Jury: A Comprehensive Evaluation Toolkit

  • paper_url: http://arxiv.org/abs/2310.02040
  • repo_url: https://github.com/obss/jury
  • paper_authors: Devrim Cavusoglu, Ulas Sert, Secil Sen, Sinan Altinuc
  • for: 本研究旨在标准化和改进深度学习系统的评估方法,以便在不同任务和度量之间进行评估。
  • methods: 本研究使用了一个名为“jury”的工具包,提供了一个统一的评估框架,可以在不同任务和度量之间进行评估。
  • results: 在发布于GitHub的开源版本中,“jury”已经获得了广泛的关注和使用,并且可以帮助学术社区解决评估挑战。
    Abstract Evaluation plays a critical role in deep learning as a fundamental block of any prediction-based system. However, the vast number of Natural Language Processing (NLP) tasks and the development of various metrics have led to challenges in evaluating different systems with different metrics. To address these challenges, we introduce jury, a toolkit that provides a unified evaluation framework with standardized structures for performing evaluation across different tasks and metrics. The objective of jury is to standardize and improve metric evaluation for all systems and aid the community in overcoming the challenges in evaluation. Since its open-source release, jury has reached a wide audience and is available at https://github.com/obss/jury.
    摘要 评估在深度学习中扮演了关键的角色,是任何预测基本系统的基本块。然而,由于各种自然语言处理(NLP)任务的庞大数量和不同的评价指标的发展,评估不同系统的评价带来了挑战。为解决这些挑战,我们引入了一个名为“评审团”(jury)的工具包,它提供了一个统一的评估框架,可以在不同任务和指标之间进行标准化的评估。jury的目标是标准化和改进所有系统的评估,以帮助社区超越评估的挑战。自其开源发布以来,jury已经达到了广泛的用户群和可以在https://github.com/obss/jury上下载。

An evaluation of pre-trained models for feature extraction in image classification

  • paper_url: http://arxiv.org/abs/2310.02037
  • repo_url: https://github.com/Jawad-Dar/Jaya-Honey-Badger-Optimization-based-Deep-Neuro-Fuzzy-Network-structure-for-detection-of-Covid-19-
  • paper_authors: Erick da Silva Puls, Matheus V. Todescato, Joel L. Carbonera
  • for: 这个研究的目的是比较不同预训网络模型在图像分类任务中的表现。
  • methods: 这个研究使用了16个预训网络模型,并在四个图像dataset上进行评估。
  • results: 我们的结果显示,CLIP-ViT-B和ViT-H-14模型在所有dataset上均有最好的总表现,而CLIP-ResNet50模型则有相似的表现,但较少的波动。这显示了这些模型在图像分类任务中的表现。
    Abstract In recent years, we have witnessed a considerable increase in performance in image classification tasks. This performance improvement is mainly due to the adoption of deep learning techniques. Generally, deep learning techniques demand a large set of annotated data, making it a challenge when applying it to small datasets. In this scenario, transfer learning strategies have become a promising alternative to overcome these issues. This work aims to compare the performance of different pre-trained neural networks for feature extraction in image classification tasks. We evaluated 16 different pre-trained models in four image datasets. Our results demonstrate that the best general performance along the datasets was achieved by CLIP-ViT-B and ViT-H-14, where the CLIP-ResNet50 model had similar performance but with less variability. Therefore, our study provides evidence supporting the choice of models for feature extraction in image classification tasks.
    摘要 近年来,我们所目睹到的图像分类任务中表现的提升非常显著。这种表现提升主要归功于深度学习技术的推广。深度学习技术通常需要大量的标注数据,因此对于小 datasets 来说是一大挑战。在这种情况下,转移学习策略成为了一个有前途的解决方案。本研究的目的是比较不同预训练神经网络的特征提取性能在图像分类任务中。我们在四个图像 datasets 中评估了16个预训练模型。我们的结果显示,CLIP-ViT-B 和 ViT-H-14 模型在所有 datasets 中表现最佳,而 CLIP-ResNet50 模型具有类似表现,但变化较少。因此,本研究提供了支持预训练模型选择的证据,以便在图像分类任务中进行特征提取。

OceanGPT: A Large Language Model for Ocean Science Tasks

  • paper_url: http://arxiv.org/abs/2310.02031
  • repo_url: https://github.com/zjunlp/knowlm
  • paper_authors: Zhen Bi, Ningyu Zhang, Yida Xue, Yixin Ou, Daxiong Ji, Guozhou Zheng, Huajun Chen
  • For: The paper aims to explore the potential of Large Language Models (LLMs) for ocean science tasks, and to address the limitations of current LLMs in catering to the needs of domain experts like oceanographers.* Methods: The authors propose a novel framework called DoInstruct to automatically obtain a large volume of ocean domain instruction data, and construct the first oceanography benchmark called OceanBench to evaluate the capabilities of LLMs in the ocean domain.* Results: The authors introduce OceanGPT, the first-ever LLM in the ocean domain, which shows a higher level of knowledge expertise for ocean science tasks and gains preliminary embodied intelligence capabilities in ocean technology through comprehensive experiments.
    Abstract Ocean science, which delves into the oceans that are reservoirs of life and biodiversity, is of great significance given that oceans cover over 70% of our planet's surface. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in science. Despite the success in other domains, current LLMs often fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The intrinsic reason may be the immense and intricate nature of ocean data as well as the necessity for higher granularity and richness in knowledge. To alleviate these issues, we introduce OceanGPT, the first-ever LLM in the ocean domain, which is expert in various ocean science tasks. We propose DoInstruct, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Additionally, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though comprehensive experiments, OceanGPT not only shows a higher level of knowledge expertise for oceans science tasks but also gains preliminary embodied intelligence capabilities in ocean technology. Codes, data and checkpoints will soon be available at https://github.com/zjunlp/KnowLM.
    摘要 海洋科学,探索地球表面的70%以上的海洋,对生物多样性和生命支持非常重要。 current Large Language Models (LLMs) 在其他领域已经取得了成功,但是在海洋科学领域,LLMs frequently fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The reason may be the complexity and intricacy of ocean data, as well as the need for higher granularity and richness in knowledge. To address these issues, we introduce OceanGPT, the first-ever LLM in the ocean domain, which is proficient in various ocean science tasks. We also propose DoInstruct, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Furthermore, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Through comprehensive experiments, OceanGPT not only demonstrates a higher level of knowledge expertise for ocean science tasks but also gains preliminary embodied intelligence capabilities in ocean technology. codes, data, and checkpoints will soon be available at https://github.com/zjunlp/KnowLM.

Prompting Audios Using Acoustic Properties For Emotion Representation

  • paper_url: http://arxiv.org/abs/2310.02298
  • repo_url: None
  • paper_authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh
  • for: 用于改进情感表示和识别
  • methods: 使用自然语言描述(或提示)和对比学习对象来自动生成提示,并将speech和提示对应起来
  • results: 在Emotion Audio Retrieval和Speech Emotion Recognition任务上,使用acoustic prompts显著提高了模型的性能,precision@k指标在EAR中提高了 Various 的值,在 Ravdess 数据集上,relative accuracy 提高了3.8%。
    Abstract Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emotion representations from audio and prompt pairs. We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts i.e. 'acoustic prompts'. We use a contrastive learning objective to map speech to their respective acoustic prompts. We evaluate our model on Emotion Audio Retrieval and Speech Emotion Recognition. Our results show that the acoustic prompts significantly improve the model's performance in EAR, in various Precision@K metrics. In SER, we observe a 3.8% relative accuracy improvement on the Ravdess dataset.
    摘要

Towards Feasible Counterfactual Explanations: A Taxonomy Guided Template-based NLG Method

  • paper_url: http://arxiv.org/abs/2310.02019
  • repo_url: https://github.com/pedramsalimi/nlgxai
  • paper_authors: Pedram Salimi, Nirmalie Wiratunga, David Corsar, Anjana Wijekoon
  • for: 本研究的目的是提出一种新的自然语言Counterfactual Explanation(Natural-XAI)方法,以便更好地解释模型决策过程中的必要变量更改。
  • methods: 本研究使用了一个用户研究,找到了人类编写的Counterfactual Explanation中的两类主题:内容相关的,关注从反事件和查询角度来包含特征和其值的方式;结构相关的,关注描述必要值更改的结构和术语。
  • results: 本研究提出了一个特征可行性税onomy,用于总结和简化Counterfactual Explanation的描述过程。使用这个税onomy和一些预先设计的模板,可以生成与现有的解释器(如DICE、NICE和DisCERN)兼容的自然语言生成结果,以提高Counterfactual Explanation的可读性和可行性。
    Abstract Counterfactual Explanations (cf-XAI) describe the smallest changes in feature values necessary to change an outcome from one class to another. However, many cf-XAI methods neglect the feasibility of those changes. In this paper, we introduce a novel approach for presenting cf-XAI in natural language (Natural-XAI), giving careful consideration to actionable and comprehensible aspects while remaining cognizant of immutability and ethical concerns. We present three contributions to this endeavor. Firstly, through a user study, we identify two types of themes present in cf-XAI composed by humans: content-related, focusing on how features and their values are included from both the counterfactual and the query perspectives; and structure-related, focusing on the structure and terminology used for describing necessary value changes. Secondly, we introduce a feature actionability taxonomy with four clearly defined categories, to streamline the explanation presentation process. Using insights from the user study and our taxonomy, we created a generalisable template-based natural language generation (NLG) method compatible with existing explainers like DICE, NICE, and DisCERN, to produce counterfactuals that address the aforementioned limitations of existing approaches. Finally, we conducted a second user study to assess the performance of our taxonomy-guided NLG templates on three domains. Our findings show that the taxonomy-guided Natural-XAI approach (n-XAI^T) received higher user ratings across all dimensions, with significantly improved results in the majority of the domains assessed for articulation, acceptability, feasibility, and sensitivity dimensions.
    摘要 counterfactual 解释 (cf-XAI) 描述最小改变Feature值能够改变结果从一个类别转移到另一个类别。然而,许多 cf-XAI 方法忽略了这些改变的可行性。在这篇论文中,我们介绍了一种新的方法,用于在自然语言 (Natural-XAI) 中提供 counterfactual 解释,同时考虑到可行性和可理解性的考虑因素,并保持决策和伦理问题的注意。我们在这篇论文中提供了三项贡献。首先,通过用户研究,我们发现了 counterfactual 解释中 humans 所创作的两种主题:内容相关,关注从 counterfactual 和查询角度来看 feature 和其值的包含方式;和结构相关,关注 counterfactual 解释中 feature 值改变所需的结构和术语使用方式。其次,我们引入了一个功能可行分类,用于总结 counterfactual 解释中 feature 值改变的可行性。使用用户研究和我们的分类,我们创建了一种可与现有的解释器 like DICE、NICE 和 DisCERN 兼容的通用 template-based自然语言生成 (NLG) 方法,以生成可以Addressing the limitations of existing approaches的 counterfactuals。最后,我们进行了第二次用户研究,以评估我们的分类导向 NLG 模板在三个领域的表现。我们的发现显示,与我们的分类导向 NLG 模板相比,传统的 counterfactual 解释方法在大多数领域都表现较差,特别是在某些领域的可行性、可理解性、可行性和敏感度方面表现较差。

Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

  • paper_url: http://arxiv.org/abs/2310.02012
  • repo_url: https://github.com/alexandrumeterez/bngrad
  • paper_authors: Alexandru Meterez, Amir Joudaki, Francesco Orabona, Alexander Immer, Gunnar Rätsch, Hadi Daneshmand
  • for: 这个论文的目的是提出一种拥有优化信号传递特性,但避免深度梯度爆炸的多层感知网络。
  • methods: 这个论文使用了批量Normalization层,并采用了Weingarten calculus来建立一种非对易的理论模型,以确定批量Normalization层在深度学习中的表现。
  • results: 论文的研究结果表明,通过特定的MLP结构和批量Normalization层的组合,可以实现保持优化信号传递特性,同时避免深度梯度爆炸的目标。此外,论文还提出了一种活动填充方案,可以在非线性激活函数下实现相似的性能。
    Abstract Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth. Motivated by these two aspects of batch normalization, in this study we pose the following question: "Can a batch-normalized network keep the optimal signal propagation properties, but avoid exploding gradients?" We answer this question in the affirmative by giving a particular construction of an Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded gradients at any depth. Based on Weingarten calculus, we develop a rigorous and non-asymptotic theory for this constructed MLP that gives a precise characterization of forward signal propagation, while proving that gradients remain bounded for linearly independent input samples, which holds in most practical settings. Inspired by our theory, we also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
    摘要 归并层是深度神经网络的关键组件之一。许多理论研究表明,批处Normalization可以改善信号传递,避免层之间的表示变得相互平行。然而,基于mean-field theory的研究也表明,这些优点是随着深度层数的增加而导致梯度爆炸的代价。为了解决这个问题,我们提出以下问题:“是否可以在批处Normalization的情况下保持最佳的信号传递特性,而免除深度层数随着增加而导致梯度爆炸?”我们回答这个问题的答案是肯定的,并给出了一种特殊的多层感知机(MLP),其中每层使用线性活动函数和批处Normalization,可以证明在任意深度下都有稳定梯度。基于Weingarten calculus,我们开发了一种精确和非对数学的理论,可以准确地描述这种构造的前向信号传递特性,同时证明在线性独立输入样本上,梯度都具有有界值。受到我们的理论启发,我们还设计了一种活动形态的调整方案,可以实际实现相同的特性 для某些非线性活动函数。

Generalized Convergence Analysis of Tsetlin Machines: A Probabilistic Approach to Concept Learning

  • paper_url: http://arxiv.org/abs/2310.02005
  • repo_url: None
  • paper_authors: Mohamed-Bachir Belaid, Jivitesh Sharma, Lei Jiao, Ole-Christoffer Granmo, Per-Arne Andersen, Anis Yazidi
  • for: 这篇论文的目的是为了解释Tsetlin机器(TM)在机器学习领域的应用中的性能,以及TM的整体吞吐量和可靠性。
  • methods: 这篇论文使用了Tsetlin自动机基于的机器学习算法,并提出了一种新的框架——概率概念学习(PCL),以解决TM在扩展的情况下的收敛问题。
  • results: 研究发现,PCL在$n$个特征下可以学习一组连接规则$C_i$,每个规则都有一个特定的包含概率$p_i$。此外,研究还证明了,对于任何规则$C_k$,PCL都可以收敛到一个连接规则。这一结论不仅有助于理解TM的性能,还有可能导致更加稳定和可解释的机器学习模型。
    Abstract Tsetlin Machines (TMs) have garnered increasing interest for their ability to learn concepts via propositional formulas and their proven efficiency across various application domains. Despite this, the convergence proof for the TMs, particularly for the AND operator (\emph{conjunction} of literals), in the generalized case (inputs greater than two bits) remains an open problem. This paper aims to fill this gap by presenting a comprehensive convergence analysis of Tsetlin automaton-based Machine Learning algorithms. We introduce a novel framework, referred to as Probabilistic Concept Learning (PCL), which simplifies the TM structure while incorporating dedicated feedback mechanisms and dedicated inclusion/exclusion probabilities for literals. Given $n$ features, PCL aims to learn a set of conjunction clauses $C_i$ each associated with a distinct inclusion probability $p_i$. Most importantly, we establish a theoretical proof confirming that, for any clause $C_k$, PCL converges to a conjunction of literals when $0.5
    摘要 特具谱机器(TM)在不同应用领域中已经吸引了越来越多的关注,主要是因为它们可以通过命题式来学习概念。然而,TM的整合证明,特别是在输入大于两位的情况下,仍然是一个打开的问题。这篇论文的目的是填补这个空白,通过对特具谱自动机基于机器学习算法的总结分析来证明TM的可靠性。我们提出了一种新的框架,称为概率概念学习(PCL),该框架简化了TM结构,并添加了专门的反馈机制和专门的包含/排除概率 для Literal。在n个特征下,PCL的目标是学习一组连接规则 $C_i$,每个规则都有自己的包含概率 $p_i$。最重要的是,我们证明了,对于任何规则 $C_k$,PCL在 $0.5

Fill in the Blank: Exploring and Enhancing LLM Capabilities for Backward Reasoning in Math Word Problems

  • paper_url: http://arxiv.org/abs/2310.01991
  • repo_url: None
  • paper_authors: Aniruddha Deb, Neeva Oza, Sarthak Singla, Dinesh Khandelwal, Dinesh Garg, Parag Singla
  • for: 这篇论文探讨了大语言模型(LLM)在数学问题上的后向理解能力,即给定一个数学问题和其解答,能否由LLM回归出 omitted 信息?
  • methods: 本文首先定义了数学问题上的后向理解任务,并对 GSM8k、SVAMP 和 MultiArith 三个数据集进行修改以进行评估。然后,提出了三种新的技巧来提高 LLM 的表现:Rephrase、PAL-Tools 和 Check your Work。
  • results: 实验结果表明,使用这些技巧可以成功地提高 LLM 在后向理解任务中的表现。最终,提出了一种 Bayesian 形式的 ensemble 方法,通过与一个高精度的自然验证器相结合,进一步提高 LLM 的表现。
    Abstract While forward reasoning (i.e. find the answer given the question) has been explored extensively in the recent literature, backward reasoning is relatively unexplored. We examine the backward reasoning capabilities of LLMs on Math Word Problems (MWPs): given a mathematical question and its answer, with some details omitted from the question, can LLMs effectively retrieve the missing information? In this paper, we formally define the backward reasoning task on math word problems and modify three datasets to evaluate this task: GSM8k, SVAMP and MultiArith. Our findings show a significant drop in the accuracy of models on backward reasoning compared to forward reasoning across four SOTA LLMs (GPT4, GPT3.5, PaLM-2, and LLaMa-2). Utilizing the specific format of this task, we propose three novel techniques that improve performance: Rephrase reformulates the given problem into a forward reasoning problem, PAL-Tools combines the idea of Program-Aided LLMs to produce a set of equations that can be solved by an external solver, and Check your Work exploits the availability of natural verifier of high accuracy in the forward direction, interleaving solving and verification steps. Finally, realizing that each of our base methods correctly solves a different set of problems, we propose a novel Bayesian formulation for creating an ensemble over these base methods aided by a verifier to further boost the accuracy by a significant margin. Extensive experimentation demonstrates that our techniques successively improve the performance of LLMs on the backward reasoning task, with the final ensemble-based method resulting in a substantial performance gain compared to the raw LLMs with standard prompting techniques such as chain-of-thought.
    摘要 而forward reasoning(即根据问题找到答案)在近期文献中已经得到了广泛的探讨,而backward reasoning则相对较少研究。我们将专注于LLMs在数学语句问题(MWPs)上的backward reasoning能力:对于一个数学问题和其答案,当某些问题详细信息被删除时,LLMs是否能够有效地撷取缺失的信息?在这篇论文中,我们正式定义了MWPs上的backward reasoning任务,并对GSM8k、SVAMP和MultiArith三个数据集进行修改以进行评估。我们发现了四种SOTA LLMs(GPT4、GPT3.5、PaLM-2和LLaMa-2)在backward reasoning任务上的精度明显下降。我们运用这个任务的特定格式,提出了三种新的技术来改善性能:Rephrase将问题重新推理成前向推理问题,PAL-Tools结合了程式帮助LLMs生成可以由外部解uder解的方程集,Check your Work利用了前向方向的自然验证者高精度,将解ving和验证步骤进行交互式推理。最后,我们提出了一个组合这些基本方法的ensemble方法,通过与验证器一起使用 Bayesian 形式来增加精度。实验结果显示,我们的技术顺利地提高了LLMs在backward reasoning任务上的性能, ensemble-based 方法在与标准提示技术相比,实现了重要的性能提升。

Soda: An Object-Oriented Functional Language for Specifying Human-Centered Problems

  • paper_url: http://arxiv.org/abs/2310.01961
  • repo_url: None
  • paper_authors: Julian Alfredo Mendez
  • for: 本文旨在提供一种自然地处理质量和量的语言,以便更好地检查其正确性。
  • methods: 本文使用符号目标描述分析(Soda)语言,该语言可以帮助描述复杂的计算机系统需求,并且提供了适当的键性特性来支持这些需求的模型。
  • results: 本文提供了一种轻松描述问题的工具,可以更加透明地描述复杂的需求,从而减少错误的可能性。
    Abstract We present Soda (Symbolic Objective Descriptive Analysis), a language that helps to treat qualities and quantities in a natural way and greatly simplifies the task of checking their correctness. We present key properties for the language motivated by the design of a descriptive language to encode complex requirements on computer systems, and we explain how these key properties must be addressed to model these requirements with simple definitions. We give an overview of a tool that helps to describe problems in an easy way that we consider more transparent and less error-prone.
    摘要 我们介绍Soda(Symbolic Objective Descriptive Analysis)语言,它帮助处理质量和量的自然方式,并大大简化了检查正确性的任务。我们介绍了语言的关键属性,这些属性由计算机系统的描述语言的设计启发而来,并解释了如何使用简单的定义来模拟这些要求。我们给出了一个工具,它使得描述问题变得更加 transparent 和 less error-prone。Note: "Simplified Chinese" is also known as "Mandarin" or "Standard Chinese".

Language Models as Knowledge Bases for Visual Word Sense Disambiguation

  • paper_url: http://arxiv.org/abs/2310.01960
  • repo_url: https://github.com/anastasiakrith/llm-for-vwsd
  • paper_authors: Anastasia Kritharoula, Maria Lymperaiou, Giorgos Stamou
  • for: 本研究的目的是提高视听语言变换器(VL transformer)的检索性能,通过使用大语言模型(LLM)作为知识库。
  • methods: 本研究使用了知识库中的知识,通过适当的提示来检索知识,以及将视听语言变换问题转化为文本问题,并使用链条思维(CoT)提示来探究内部的思维过程。
  • results: 本研究表明,通过使用LLM作为知识库,可以提高VL transformer的检索性能,并且通过转化为文本问题,可以更好地探究内部的思维过程。
    Abstract Visual Word Sense Disambiguation (VWSD) is a novel challenging task that lies between linguistic sense disambiguation and fine-grained multimodal retrieval. The recent advancements in the development of visiolinguistic (VL) transformers suggest some off-the-self implementations with encouraging results, which however we argue that can be further improved. To this end, we propose some knowledge-enhancement techniques towards improving the retrieval performance of VL transformers via the usage of Large Language Models (LLMs) as Knowledge Bases. More specifically, knowledge stored in LLMs is retrieved with the help of appropriate prompts in a zero-shot manner, achieving performance advancements. Moreover, we convert VWSD to a purely textual question-answering (QA) problem by considering generated image captions as multiple-choice candidate answers. Zero-shot and few-shot prompting strategies are leveraged to explore the potential of such a transformation, while Chain-of-Thought (CoT) prompting in the zero-shot setting is able to reveal the internal reasoning steps an LLM follows to select the appropriate candidate. In total, our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve WVSD.
    摘要 Visual Word Sense Disambiguation (VWSD) 是一个新兴的挑战任务,位于语言意义归一化和细部多媒体搜寻之间。 recent advancements in visiolinguistic (VL) transformers 的开发,提供了一些可用的 implementation with encouraging results,但我们认为这些结果可以进一步改善。 To this end, we propose some knowledge-enhancement techniques towards improving the retrieval performance of VL transformers via the usage of Large Language Models (LLMs) as Knowledge Bases. More specifically, knowledge stored in LLMs is retrieved with the help of appropriate prompts in a zero-shot manner, achieving performance advancements. Moreover, we convert VWSD to a purely textual question-answering (QA) problem by considering generated image captions as multiple-choice candidate answers. Zero-shot and few-shot prompting strategies are leveraged to explore the potential of such a transformation, while Chain-of-Thought (CoT) prompting in the zero-shot setting is able to reveal the internal reasoning steps an LLM follows to select the appropriate candidate. In total, our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve WVSD.

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

  • paper_url: http://arxiv.org/abs/2310.01957
  • repo_url: https://github.com/wayveai/driving-with-llms
  • paper_authors: Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, Jamie Shotton
  • for: 本研究旨在提高自动驾驶中Context理解,特别是通过大语言模型(LLM)的普适性和可解释性。
  • methods: 我们提出了一种unicode object-level multimodal LLM架构,将vectorized numeric modalities与预训练LLM结合,以提高驾驶场景中Context的理解。我们还开发了10000个驾驶场景的160000个问答对,并使用RL代理和教师LLM(GPT-3.5)生成问题和答案。为了将数字vec模态与静态LLM表示相alin,我们提出了一种新的预训练策略。
  • results: 我们的研究发现,使用LLM-driver可以在驾驶场景中更好地理解 Context,回答问题,并做出决策。与传统行为宠模型相比,LLM-based driving action generation表现出了更高的普适性和可解释性。我们的研究结果和数据集、模型将被提供给进一步的探索。
    Abstract Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.
    摘要

Probabilistic Reach-Avoid for Bayesian Neural Networks

  • paper_url: http://arxiv.org/abs/2310.01951
  • repo_url: https://github.com/matthewwicker/bnnreachavoid
  • paper_authors: Matthew Wicker, Luca Laurenti, Andrea Patane, Nicola Paoletti, Alessandro Abate, Marta Kwiatkowska
  • for: 本研究旨在同时学习未知的随机环境动力学和生成优化策略,并确保策略在安全关键场景中的决策是安全和可靠的。
  • methods: 本研究使用interval propagation和backward recursion技术计算动力学模型下的下界,以确保策略满足给定的 reach-avoid 规范(达到目标状态,避免危险状态)。然后,使用控制合成算法derive最优的策略,以提高安全性的可靠性。
  • results: 在一系列控制准则中,我们的方法能够提供更多的 certificatable 状态和更高的平均保证的 reach-avoid 概率,比较传统的数据驱动策略。在最具挑战性的准则中,我们的优化算法能够提供更多的 certificatable 状态和更高的平均保证的 reach-avoid 概率,比较传统的数据驱动策略。
    Abstract Model-based reinforcement learning seeks to simultaneously learn the dynamics of an unknown stochastic environment and synthesise an optimal policy for acting in it. Ensuring the safety and robustness of sequential decisions made through a policy in such an environment is a key challenge for policies intended for safety-critical scenarios. In this work, we investigate two complementary problems: first, computing reach-avoid probabilities for iterative predictions made with dynamical models, with dynamics described by Bayesian neural network (BNN); second, synthesising control policies that are optimal with respect to a given reach-avoid specification (reaching a "target" state, while avoiding a set of "unsafe" states) and a learned BNN model. Our solution leverages interval propagation and backward recursion techniques to compute lower bounds for the probability that a policy's sequence of actions leads to satisfying the reach-avoid specification. Such computed lower bounds provide safety certification for the given policy and BNN model. We then introduce control synthesis algorithms to derive policies maximizing said lower bounds on the safety probability. We demonstrate the effectiveness of our method on a series of control benchmarks characterized by learned BNN dynamics models. On our most challenging benchmark, compared to purely data-driven policies the optimal synthesis algorithm is able to provide more than a four-fold increase in the number of certifiable states and more than a three-fold increase in the average guaranteed reach-avoid probability.
    摘要 In this work, we address two related problems: computing reach-avoid probabilities for iterative predictions made with dynamical models, and synthesizing control policies that are optimal with respect to a given reach-avoid specification and a learned Bayesian neural network (BNN) model. Our approach leverages interval propagation and backward recursion techniques to compute lower bounds for the probability that a policy's sequence of actions leads to satisfying the reach-avoid specification. These lower bounds provide safety certification for the given policy and BNN model.We then introduce control synthesis algorithms to derive policies that maximize the computed lower bounds on the safety probability. Our method is demonstrated on a series of control benchmarks characterized by learned BNN dynamics models. On our most challenging benchmark, our optimal synthesis algorithm is able to provide more than a four-fold increase in the number of certifiable states and more than a three-fold increase in the average guaranteed reach-avoid probability compared to purely data-driven policies.

Ravestate: Distributed Composition of a Causal-Specificity-Guided Interaction Policy

  • paper_url: http://arxiv.org/abs/2310.01943
  • repo_url: None
  • paper_authors: Joseph Birkner, Andreas Dolp, Negin Karimi, Nikita Basargin, Alona Kharchenko, Rafael Hostettler
  • for: 这 paper 的目的是提出一种基于规则的人机交互策略设计方法,这种方法是有效、可解释、表达力强和直观的。
  • methods: 这 paper 使用了 Signal-Rule-Slot 框架,该框架是根据先前的 Symbolic System 设计方法进行改进,并引入了一种新的 Bayesian 思想的交互规则实用性指标called Causal Pathway Self-information。
  • results: 通过用 Ravestate 开源实现和进行用户研究,这 paper 提供了一种有力的人机交互系统,并在文本、语音和视觉等场景中展示了 Contextual Behavior 的robust性。
    Abstract In human-robot interaction policy design, a rule-based method is efficient, explainable, expressive and intuitive. In this paper, we present the Signal-Rule-Slot framework, which refines prior work on rule-based symbol system design and introduces a new, Bayesian notion of interaction rule utility called Causal Pathway Self-information. We offer a rigorous theoretical foundation as well as a rich open-source reference implementation Ravestate, with which we conduct user studies in text-, speech-, and vision-based scenarios. The experiments show robust contextual behaviour of our probabilistically informed rule-based system, paving the way for more effective human-machine interaction.
    摘要 人机交互策略设计中,使用规则方法是有效、可解释、表达力强和直观的。本文提出了信号规则槽框架,对先前的规则基于符号系统设计做出了改进,并引入了一新的 bayesian 概念:交互规则用量含义。我们提供了坚实的理论基础以及rich的开源参考实现 Ravestate,并在文本、语音和视觉等方面进行了用户研究。实验结果表明了我们的概率知识基于规则系统在不同场景中具有强大的上下文行为,这将为人机交互带来更高效的交互。Note: Please keep in mind that the translation is done by a machine and may not be perfect. If you have any specific requirements or preferences, please let me know and I can provide a more tailored translation.

  • paper_url: http://arxiv.org/abs/2310.01929
  • repo_url: None
  • paper_authors: Mor Ventura, Eyal Ben-David, Anna Korhonen, Roi Reichart
  • for: 这个研究旨在探讨TEXT-TO-IMAGE(TTI)模型中嵌入的文化认知,以及这些模型在不同文化背景下的表现。
  • methods: 该研究使用了多种评估技术,包括CLIP空间的内在评估、VQA模型的外在评估以及人类评估,以探索TTI模型的文化认知。
  • results: 实验结果显示,TTI模型在不同文化背景下具有文化认知,并且可以适应不同文化的特点。这些模型还能够解释文化特点,并且可以在不同文化背景下提高表现。
    Abstract Text-To-Image (TTI) models, exemplified by DALL-E and StableDiffusion, have recently gained prominence for their remarkable zero-shot capabilities in generating images guided by textual prompts. Language, as a conduit of culture, plays a pivotal role in these models' multilingual capabilities, which in turn shape their cultural agency. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three hierarchical tiers: cultural dimensions, cultural domains, and cultural concepts. We propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer (VQA) model, and human assessments, to discern TTI cultural perceptions. To facilitate our research, we introduce the CulText2I dataset, derived from four diverse TTI models and spanning ten languages. Our experiments reveal insights into these models' cultural awareness, cultural distinctions, and the unlocking of cultural features, releasing the potential for cross-cultural applications.
    摘要

DARTH: Holistic Test-time Adaptation for Multiple Object Tracking

  • paper_url: http://arxiv.org/abs/2310.01926
  • repo_url: https://github.com/mattiasegu/darth
  • paper_authors: Mattia Segu, Bernt Schiele, Fisher Yu
  • for: 这篇论文主要旨在提出一种在测试时进行多目标跟踪(MOT)系统的适应性问题,以提高自动驾驶系统的安全性。
  • methods: 该论文提出了一种涵盖 object detection 和 instance association 的全面测试时适应框架,包括一种自动化的检测一致问题的解决方案以及一种新的补充彩色损失来适应实例外观表示。
  • results: 该论文在不同的领域变换(sim-to-real、outdoor-to-indoor、indoor-to-outdoor)中进行了广泛的测试,并得到了明显的性能提升。
    Abstract Multiple object tracking (MOT) is a fundamental component of perception systems for autonomous driving, and its robustness to unseen conditions is a requirement to avoid life-critical failures. Despite the urge of safety in driving systems, no solution to the MOT adaptation problem to domain shift in test-time conditions has ever been proposed. However, the nature of a MOT system is manifold - requiring object detection and instance association - and adapting all its components is non-trivial. In this paper, we analyze the effect of domain shift on appearance-based trackers, and introduce DARTH, a holistic test-time adaptation framework for MOT. We propose a detection consistency formulation to adapt object detection in a self-supervised fashion, while adapting the instance appearance representations via our novel patch contrastive loss. We evaluate our method on a variety of domain shifts - including sim-to-real, outdoor-to-indoor, indoor-to-outdoor - and substantially improve the source model performance on all metrics. Code: https://github.com/mattiasegu/darth.
    摘要 多bject tracking (MOT) 是自动驾驶系统的基本组件,其robustness to 未看过的条件是必要的,以避免生命 crítical 错误。尽管安全驾驶系统的需求很强,但是没有任何一个解决方案可以在测试时间下进行 MOT 适应域转换问题。然而,MOT 系统的性质是多元的 - 需要对象检测和实例关联 - 并且全部组件的适应是非常困难。在这篇论文中,我们分析了域转换对 appearance-based 跟踪器的影响,并提出了 DARTH,一个整体测试时间适应框架 для MOT。我们提议使用自我supervised 的检测一致性 формулы来适应对象检测,而对实例的外观表示进行适应via我们的新型 patch contrastive loss。我们对各种域转换进行了评估 - 包括 sim-to-real、outdoor-to-indoor 和 indoor-to-outdoor - 并在所有指标上显著提高了源模型的性能。代码:https://github.com/mattiasegu/darth。

FiGURe: Simple and Efficient Unsupervised Node Representations with Filter Augmentations

  • paper_url: http://arxiv.org/abs/2310.01892
  • repo_url: https://github.com/microsoft/figure
  • paper_authors: Chanakya Ekbote, Ajinkya Pankaj Deshpande, Arun Iyer, Ramakrishna Bairi, Sundararajan Sellamanickam
  • for: This paper aims to improve the performance of unsupervised node representations learnt using contrastive learning-based methods on downstream tasks.
  • methods: The authors propose a simple filter-based augmentation method to capture different parts of the eigen-spectrum, which leads to significant improvements. They also share the same weights across different filter augmentations to reduce computational load.
  • results: The proposed method, FiGURe, achieves an average gain of up to 4.4% compared to the state-of-the-art unsupervised models across all datasets considered, both homophilic and heterophilic.Here’s the summary in Simplified Chinese:
  • for: 本文目的是提高基于对比学习的无监督节点表示的性能在下游任务中。
  • methods: 作者提出了一种简单的滤波器基于扩展方法,以捕捉不同的特征谱部分,并且通过共享相同权重来降低计算负担。
  • results: 提出的方法FiGURe,在所有考虑的数据集上,both homophilic和heterophilic,实现了4.4%的平均提升,比领先的无监督模型更高。
    Abstract Unsupervised node representations learnt using contrastive learning-based methods have shown good performance on downstream tasks. However, these methods rely on augmentations that mimic low-pass filters, limiting their performance on tasks requiring different eigen-spectrum parts. This paper presents a simple filter-based augmentation method to capture different parts of the eigen-spectrum. We show significant improvements using these augmentations. Further, we show that sharing the same weights across these different filter augmentations is possible, reducing the computational load. In addition, previous works have shown that good performance on downstream tasks requires high dimensional representations. Working with high dimensions increases the computations, especially when multiple augmentations are involved. We mitigate this problem and recover good performance through lower dimensional embeddings using simple random Fourier feature projections. Our method, FiGURe achieves an average gain of up to 4.4%, compared to the state-of-the-art unsupervised models, across all datasets in consideration, both homophilic and heterophilic. Our code can be found at: https://github.com/microsoft/figure.
    摘要 自助学习方法学习的无监督节点表示已经达到了下游任务的好表现。然而,这些方法通常依赖于模拟低通滤波器的扩充,这限制了它们在不同谱分部分上的表现。这篇论文提出了一种简单的滤波器扩充方法,以捕捉不同的谱分部分。我们表明了这些扩充的显著提高。此外,之前的工作表明,downstream任务需要高维表示。工作于高维度会增加计算量,特别是当多个扩充 involve。我们解决这个问题并重新获得了好表现通过简单的随机傅立干投影。我们的方法FiGURe在考虑所有数据集上达到了4.4%的平均提升,相比之前的状态对照模型。我们的代码可以在 GitHub上找到:https://github.com/microsoft/figure。

Adaptive Hybrid Model for Enhanced Stock Market Predictions Using Improved VMD and Stacked Informer

  • paper_url: http://arxiv.org/abs/2310.01884
  • repo_url: https://github.com/DANNHIROAKI/Adaptive-Hybrid-Model-for-Enhanced-Stock-Market-Predictions-Using-Improved-VMD-and-Stacked-Informer
  • paper_authors: Jianan Zhang, Hongyi Duan
  • for: 该研究旨在提出一种适应性混合模型,用于股票市场预测,利用提高后 Variational Mode Decomposition (VMD)、Feature Engineering (FE) 和堆叠 Informer 以及适应损失函数。
  • methods: 该模型使用了增强的 VMD、FE 和堆叠 Informer,并将其与适应损失函数结合使用。
  • results: 实验结果表明,提出的模型(称为 Adam+GC+增强 informer,简称 VMGCformer)在股票市场数据中表现出色,其预测精度、应急性和泛化能力都高于传统和其他混合模型。
    Abstract This paper introduces an innovative adaptive hybrid model for stock market predictions, leveraging the capabilities of an enhanced Variational Mode Decomposition (VMD), Feature Engineering (FE), and stacked Informer integrated with an adaptive loss function. Through rigorous experimentation, the proposed model, termed Adam+GC+enhanced informer (We name it VMGCformer), demonstrates significant proficiency in addressing the intricate dynamics and volatile nature of stock market data. Experimental results, derived from multiple benchmark datasets, underscore the model's superiority in terms of prediction accuracy, responsiveness, and generalization capabilities over traditional and other hybrid models. The research further highlights potential avenues for optimization and introduces future directions to enhance predictive modeling, especially for small enterprises and feature engineering.
    摘要 Here is the text in Simplified Chinese:这篇论文提出了一种新型的股票市场预测模型, combining 变分幂分析(VMD)、特征工程(FE)和堆叠 Informer 以及适应损失函数。该模型被称为 VMGCformer,经过了多种数据集的严格实验,对股票市场数据的复杂动态和不稳定性表现出了明显的优势。实验结果显示,VMGCformer 在预测精度、应急性和泛化能力方面比传统和其他混合模型表现出了明显的提高。研究还提出了优化方向和未来研究的方向,尤其是对小企业和特征工程。

Towards Stable Backdoor Purification through Feature Shift Tuning

  • paper_url: http://arxiv.org/abs/2310.01875
  • repo_url: https://github.com/aisafety-hkust/stable_backdoor_purification
  • paper_authors: Rui Min, Zeyu Qin, Li Shen, Minhao Cheng
  • for: 本研究旨在提出一种简单易于实施的后门攻击防御方法,以帮助减少深度神经网络(DNN)受到后门攻击的风险。
  • methods: 我们使用了精心调整(fine-tuning)方法,并进行了广泛的测试和分析,以推断 vanilla 调整方法在低毒料比例下完全失效。我们还提出了一种叫做特征偏移调整(Feature Shift Tuning,FST)方法,以解决低毒料比例下后门纯化的问题。FST 通过强制抬升分类器的权重偏移自已损害的权重,以提高后门纯化的稳定性。
  • results: 我们的实验结果表明,FST 方法在不同的攻击场景下具有稳定的性能,并且只需要10个训练 epoch,可以很快地完成纯化过程。此外,FST 方法还可以在低毒料比例下提供更好的防御性能,而不需要复杂的参数调整。
    Abstract It has been widely observed that deep neural networks (DNN) are vulnerable to backdoor attacks where attackers could manipulate the model behavior maliciously by tampering with a small set of training samples. Although a line of defense methods is proposed to mitigate this threat, they either require complicated modifications to the training process or heavily rely on the specific model architecture, which makes them hard to deploy into real-world applications. Therefore, in this paper, we instead start with fine-tuning, one of the most common and easy-to-deploy backdoor defenses, through comprehensive evaluations against diverse attack scenarios. Observations made through initial experiments show that in contrast to the promising defensive results on high poisoning rates, vanilla tuning methods completely fail at low poisoning rate scenarios. Our analysis shows that with the low poisoning rate, the entanglement between backdoor and clean features undermines the effect of tuning-based defenses. Therefore, it is necessary to disentangle the backdoor and clean features in order to improve backdoor purification. To address this, we introduce Feature Shift Tuning (FST), a method for tuning-based backdoor purification. Specifically, FST encourages feature shifts by actively deviating the classifier weights from the originally compromised weights. Extensive experiments demonstrate that our FST provides consistently stable performance under different attack settings. Without complex parameter adjustments, FST also achieves much lower tuning costs, only 10 epochs. Our codes are available at https://github.com/AISafety-HKUST/stable_backdoor_purification.
    摘要 历史观察表明深度神经网络(DNN)容易受到后门攻击,攻击者可以通过修改一小部分训练样本来 manipulate DNN 的行为。虽然一些防御方法被提出,但它们 either require 复杂的训练过程修改或者 heavily rely on 特定模型架构,这使得它们在实际应用中困难实施。因此,在这篇论文中,我们选择通过 fine-tuning,一种最常见并容易实施的后门防御方法,进行广泛的评估。初始实验结果表明,对高毒量情况下, vanilla tuning 方法具有扎实的防御效果。然而,在低毒量情况下,标准的 tuning 方法完全失败。我们的分析表明,在低毒量情况下,后门和干净特征之间的束缚,使得 tuning-based 防御无效。因此,我们需要分离后门和干净特征,以提高后门纯化。为此,我们介绍了 Feature Shift Tuning(FST),一种基于 tuning 的后门纯化方法。具体来说,FST 通过活动偏移分类器权重,以避免由恶意攻击所损害的原始权重。广泛的实验表明,我们的 FST 在不同的攻击设置下具有稳定的性能。没有复杂的参数调整,FST 只需要 10 轮训练,可以快速实现纯化。我们的代码可以在 https://github.com/AISafety-HKUST/stable_backdoor_purification 上获取。

Conditional Instrumental Variable Regression with Representation Learning for Causal Inference

  • paper_url: http://arxiv.org/abs/2310.01865
  • repo_url: None
  • paper_authors: Debo Cheng, Ziqi Xu, Jiuyong Li, Lin Liu, Jixue Liu, Thuc Duy Le
  • for: 这 paper 研究了从观察数据中估计 causal effect 的困难问题,在存在隐藏的假设变量的情况下。
  • methods: 这 paper 使用了 two-stage least square (TSLS) 方法和其变种,以及标准的 instruemental variable (IV),来消除假设变量的偏见,包括隐藏的假设变量所导致的偏见。但是,这些方法 rely 于线性假设。此外,标准 IV 方法中对 instruemental variable 的约束条件太 strict,不实际。因此,在这 paper 中,我们使用 conditional IV (CIV) 来放宽标准 IV 中的 instruemental variable 约束条件,并提出一种非线性 CIV 回归,即 Confounding Balancing Representation Learning, CBRL.CIV,用于同时消除隐藏的假设变量所导致的偏见,并均衡观察到的假设变量。我们 theoretically 验证了 CBRL.CIV 的正确性。
  • results: 在 synthetic 和两个实际数据集上,我们进行了广泛的实验,发现 CBRL.CIV 在对 state-of-the-art IV-based estimator 进行比较时,表现竞争性,并且在非线性情况下表现更优。
    Abstract This paper studies the challenging problem of estimating causal effects from observational data, in the presence of unobserved confounders. The two-stage least square (TSLS) method and its variants with a standard instrumental variable (IV) are commonly used to eliminate confounding bias, including the bias caused by unobserved confounders, but they rely on the linearity assumption. Besides, the strict condition of unconfounded instruments posed on a standard IV is too strong to be practical. To address these challenging and practical problems of the standard IV method (linearity assumption and the strict condition), in this paper, we use a conditional IV (CIV) to relax the unconfounded instrument condition of standard IV and propose a non-linear CIV regression with Confounding Balancing Representation Learning, CBRL.CIV, for jointly eliminating the confounding bias from unobserved confounders and balancing the observed confounders, without the linearity assumption. We theoretically demonstrate the soundness of CBRL.CIV. Extensive experiments on synthetic and two real-world datasets show the competitive performance of CBRL.CIV against state-of-the-art IV-based estimators and superiority in dealing with the non-linear situation.
    摘要

Fine-tuned vs. Prompt-tuned Supervised Representations: Which Better Account for Brain Language Representations?

  • paper_url: http://arxiv.org/abs/2310.01854
  • repo_url: None
  • paper_authors: Jingyuan Sun, Marie-Francine Moens
  • for: investigate the effectiveness of prompt-tuning compared to fine-tuning in generating representations that better account for the brain’s language representations
  • methods: using neural decoding to compare the performance of prompt-tuned and fine-tuned representations in predicting linguistic stimuli from brain activities
  • results: full fine-tuning does not significantly outperform prompt-tuning in neural decoding, and tasks dealing with fine-grained concept meaning yield representations that better decode brain activation patterns than other tasks.
    Abstract To decipher the algorithm underlying the human brain's language representation, previous work probed brain responses to language input with pre-trained artificial neural network (ANN) models fine-tuned on NLU tasks. However, full fine-tuning generally updates the entire parametric space and distorts pre-trained features, cognitively inconsistent with the brain's robust multi-task learning ability. Prompt-tuning, in contrast, protects pre-trained weights and learns task-specific embeddings to fit a task. Could prompt-tuning generate representations that better account for the brain's language representations than fine-tuning? If so, what kind of NLU task leads a pre-trained model to better decode the information represented in the human brain? We investigate these questions by comparing prompt-tuned and fine-tuned representations in neural decoding, that is predicting the linguistic stimulus from the brain activities evoked by the stimulus. We find that on none of the 10 NLU tasks, full fine-tuning significantly outperforms prompt-tuning in neural decoding, implicating that a more brain-consistent tuning method yields representations that better correlate with brain data. Moreover, we identify that tasks dealing with fine-grained concept meaning yield representations that better decode brain activation patterns than other tasks, especially the syntactic chunking task. This indicates that our brain encodes more fine-grained concept information than shallow syntactic information when representing languages.
    摘要 为了解释人脑语言表示法下的算法,前一些研究使用预训练的人工神经网络(ANN)模型进行精细调整NLU任务。然而,全面调整通常更新整个参数空间,与人脑的多任务学习能力不匹配。Prompt-tuning,相比之下,保护预训练的权重并学习任务特定的嵌入,以适应任务。能否使用Prompt-tuning生成更好地匹配人脑语言表示的表示?如果是的,那么哪种NLU任务会使预训练模型更好地解码人脑活动中的信息?我们通过比较Prompt-tuned和全面调整的表示在神经解码中进行比较,即预测语言刺激从人脑活动中的诱导信息。我们发现,在10个NLU任务中,全面调整不能significantlyOutperformPrompt-tuning神经解码中。这表明,使用更加brain-consistent的调整方法可以生成更好地匹配人脑数据的表示。此外,我们发现,处理细化意义的任务(例如语法块分析任务)的表示更好地解码人脑活动模式,特别是在语法块分析任务中。这表明,我们的脑在语言表示中更加强调细化意义信息,而不是浅层语法信息。

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

  • paper_url: http://arxiv.org/abs/2310.01852
  • repo_url: https://github.com/pku-yuangroup/languagebind
  • paper_authors: Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan
  • for: 提高多modalities视频语言模型的性能(improve the performance of multimodal video language models)
  • methods: 使用语言作为绑定élément(use language as the binding element),即冻结语言Encoder获取自VL pré-training,然后使其他模式的Encoder通过对比学习。(freeze the language encoder obtained from VL pre-training, and then train encoders for other modalities with contrastive learning)
  • results: 在MSR-VTT数据集上表现出优于ImageBind的5.8% R@1(outperform ImageBind by 5.8% R@1 on the MSR-VTT dataset),并在其他多个任务中也表现出优异(and also outperform in other multiple tasks)
    Abstract The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 5.8% R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot video-text retrieval task. Beyond this, our LanguageBind has greatly improved in the zero-shot video, audio, depth, and infrared understanding tasks. For instance, LanguageBind surpassing InterVideo by 1.9% on MSR-VTT, 8.8% on MSVD, 6.3% on DiDeMo, and 4.4% on ActivityNet. On the LLVIP and NYU-D datasets, LanguageBind outperforms ImageBind with 23.8% and 11.1% top-1 accuracy. Code address: https://github.com/PKU-YuanGroup/LanguageBind.
    摘要 视频语言(VL)预训练已经实现了多个下游任务中的显著改进。然而,现有的VL预训练框架难以扩展到多个modalities(N模式,N≥3)以外的视觉语言。我们因此提出了LanguageBind,将语言作为所有modalities之间的绑定因素。具体来说,我们冻结获得的语言encoder,然后使用对比学习训练其他modalities的encoder。这使得所有modalities都映射到共同的特征空间,实现多modal semantic alignment。LanguageBind确保了我们可以扩展VL modalities到N modalities,但我们还需要一个高质量的数据集,其中包含对齐数据对。我们因此提出了VIDAL-10M,它包含视频、红外、深度、音频和其对应的语言。在我们的VIDAL-10M中,所有视频都来自短视频平台,完整的 semantics 而不是长视频中 truncated 的segment,而所有视频、深度、红外和音频modalities都与其文本描述进行了对齐。在VIDAL-10M上进行预训练后,我们在MSR-VTT数据集上出现了与ImageBind的5.8% R@1的提升,只使用15%的参数。此外,LanguageBind在零shot video、音频、深度和红外理解任务中也有了大幅提升。例如,LanguageBind在 MSR-VTT 上超过 InterVideo by 1.9%,在 MSVD 上超过 InterVideo by 8.8%,在 DiDeMo 上超过 InterVideo by 6.3%,在 ActivityNet 上超过 InterVideo by 4.4%。在 LLVIP 和 NYU-D 数据集上,LanguageBind也超过 ImageBind,具体的top-1准确率分别是23.8%和11.1%。代码地址:https://github.com/PKU-YuanGroup/LanguageBind。

Zero-Shot Refinement of Buildings’ Segmentation Models using SAM

  • paper_url: http://arxiv.org/abs/2310.01845
  • repo_url: None
  • paper_authors: Ali Mayladan, Hasan Nasrallah, Hasan Moughnieh, Mustafa Shukor, Ali J. Ghandour
  • For: This paper aims to adapt foundation models for specific domains, specifically remote sensing imagery, to improve their generalization and recognition abilities.* Methods: The authors introduce a novel approach that integrates a pre-trained CNN as a prompt generator to augment the Segment Anything Model (SAM) with recognition abilities. They evaluate their method on three remote sensing datasets and achieve improved performance.* Results: The authors report a 5.47% increase in IoU and a 4.81% improvement in F1-score for out-of-distribution performance on the WHU dataset, and a 2.72% and 1.58% increase in True-Positive-IoU and True-Positive-F1 score, respectively, for in-distribution performance on the WHU dataset.
    Abstract Foundation models have excelled in various tasks but are often evaluated on general benchmarks. The adaptation of these models for specific domains, such as remote sensing imagery, remains an underexplored area. In remote sensing, precise building instance segmentation is vital for applications like urban planning. While Convolutional Neural Networks (CNNs) perform well, their generalization can be limited. For this aim, we present a novel approach to adapt foundation models to address existing models' generalization dropback. Among several models, our focus centers on the Segment Anything Model (SAM), a potent foundation model renowned for its prowess in class-agnostic image segmentation capabilities. We start by identifying the limitations of SAM, revealing its suboptimal performance when applied to remote sensing imagery. Moreover, SAM does not offer recognition abilities and thus fails to classify and tag localized objects. To address these limitations, we introduce different prompting strategies, including integrating a pre-trained CNN as a prompt generator. This novel approach augments SAM with recognition abilities, a first of its kind. We evaluated our method on three remote sensing datasets, including the WHU Buildings dataset, the Massachusetts Buildings dataset, and the AICrowd Mapping Challenge. For out-of-distribution performance on the WHU dataset, we achieve a 5.47% increase in IoU and a 4.81% improvement in F1-score. For in-distribution performance on the WHU dataset, we observe a 2.72% and 1.58% increase in True-Positive-IoU and True-Positive-F1 score, respectively. We intend to release our code repository, hoping to inspire further exploration of foundation models for domain-specific tasks within the remote sensing community.
    摘要 基础模型在多种任务中表现出色,但它们常被评估在通用的benchmark上。适应这些模型特定领域,如遥感图像,仍是一个未曾充分发掘的领域。在遥感中,精准地分割建筑物实例是城市规划等应用的关键。虽然卷积神经网络(CNN)表现良好,但其泛化能力有限。为了解决这个问题,我们提出了一种适应基础模型的新方法,以提高现有模型的泛化能力。我们的注意点在于Segment Anything Model(SAM),这是一个知名的基础模型,拥有类型不敏感的图像分割能力。我们发现SAM在遥感图像上表现不佳,并且无法识别和标记地方化对象。为了解决这些限制,我们提出了不同的提示策略,包括将预训练的CNN作为提示生成器 integrating。这种新的approach不仅增强了SAM的识别能力,还是首次实现的。我们对三个遥感数据集进行了评估,包括WHU建筑数据集、马萨诸塞建筑数据集和AICrowd Mapping Challenge。对于WHU数据集的 OUT-OF-DISTRIBUTION性能,我们实现了5.47%的IoU提高和4.81%的F1得分提高。对于IN-DISTRIBUTION性能,我们观察到2.72%和1.58%的True-Positive-IoU和True-Positive-F1分数提高。我们计划在未来释出代码库,希望能启发更多人在遥感社区进行基础模型的探索。

Extending CAM-based XAI methods for Remote Sensing Imagery Segmentation

  • paper_url: http://arxiv.org/abs/2310.01837
  • repo_url: None
  • paper_authors: Abdul Karim Gizzini, Mustafa Shukor, Ali J. Ghandour
  • for: 这 paper 的目的是帮助解释深度学习模型在高分辨率卫星图像中的行为和决策过程,以提高模型的可读性和可信度。
  • methods: 这 paper 使用了一些最新的 XAI 技术,包括适应性抑制和监测 Entropy,来解释建筑物的分类和分割。
  • results: 研究发现,使用 XAI 技术可以帮助理解深度学习模型在高分辨率卫星图像中的行为和决策过程,并提高模型的可读性和可信度。
    Abstract Current AI-based methods do not provide comprehensible physical interpretations of the utilized data, extracted features, and predictions/inference operations. As a result, deep learning models trained using high-resolution satellite imagery lack transparency and explainability and can be merely seen as a black box, which limits their wide-level adoption. Experts need help understanding the complex behavior of AI models and the underlying decision-making process. The explainable artificial intelligence (XAI) field is an emerging field providing means for robust, practical, and trustworthy deployment of AI models. Several XAI techniques have been proposed for image classification tasks, whereas the interpretation of image segmentation remains largely unexplored. This paper offers to bridge this gap by adapting the recent XAI classification algorithms and making them usable for muti-class image segmentation, where we mainly focus on buildings' segmentation from high-resolution satellite images. To benchmark and compare the performance of the proposed approaches, we introduce a new XAI evaluation methodology and metric based on "Entropy" to measure the model uncertainty. Conventional XAI evaluation methods rely mainly on feeding area-of-interest regions from the image back to the pre-trained (utility) model and then calculating the average change in the probability of the target class. Those evaluation metrics lack the needed robustness, and we show that using Entropy to monitor the model uncertainty in segmenting the pixels within the target class is more suitable. We hope this work will pave the way for additional XAI research for image segmentation and applications in the remote sensing discipline.
    摘要 当前的人工智能(AI)基于方法无法提供可understandable的物理解释,包括使用的数据、提取的特征和预测/推理操作。因此,使用高分辨率卫星图像进行训练的深度学习模型缺乏透明性和可解释性,只能被看作为黑obox,这限制了它们的广泛应用。专家需要更好地理解AI模型的复杂行为和下面的决策过程。人工智能可解释(XAI)领域是一个emerging领域,它提供了一些可靠、实用和信任worthy的AI模型部署方法。在图像分类任务上,XAI技术已经被提出,但图像 segmentation的解释仍然是一个未解决的问题。本文想要bridging这个 gap,通过适应最近的XAI分类算法,使其可以用于多类图像 segmentation,主要是对高分辨率卫星图像中的建筑物进行分类。为了评估和比较提出的方法的性能,我们提出了一种新的XAI评估方法和度量基于"Entropy"来度量模型的uncertainty。传统的XAI评估方法主要基于将区域of interest从图像feedback到预训练(utility)模型中,然后计算target类的平均更改 probabilities。这些评估度量缺乏坚定性,我们表明,使用Entropy来监测在target类中分割像素的模型uncertainty是更适合。我们希望这种工作能够开拓XAI研究的新途径,并在远程感知领域得到应用。

Formalizing Natural Language Intent into Program Specifications via Large Language Models

  • paper_url: http://arxiv.org/abs/2310.01831
  • repo_url: None
  • paper_authors: Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, Shuvendu K. Lahiri
  • for: 本文旨在利用大语言模型(LLM)将非正式自然语言 especifications 翻译成正式方法 postconditions,以提高代码质量和可靠性。
  • methods: 本文使用了多种方法来评估和比较不同的 LLM4nl2post approaches,包括正确性和分类力等指标。同时,本文还使用了质量和量化的方法来评估 LLM4nl2post postconditions 的质量。
  • results: 本文的结果表明,LLM4nl2post 可以生成正确的 postconditions,并且可以distinguish correct code from incorrect code。此外,本文还发现,使用 LLM4nl2post 可以捕捉70个历史bugs。
    Abstract Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The "emergent abilities" of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe LLM4nl2post, the problem leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different LLM4nl2post approaches, using the correctness and discriminative power of generated postconditions. We then perform qualitative and quantitative methods to assess the quality of LLM4nl2post postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that LLM4nl2post via LLMs has the potential to be helpful in practice; specifications generated from natural language were able to catch 70 real-world historical bugs from Defects4J.
    摘要 具有自然语言描述功能的代码,如代码注释或函数文档,可能包含大量代码意图信息。但是,在实际应用中,这些自然语言文档和代码实现之间通常无法保证一致。在这种情况下,利用代码附近的自然语言信息可以提高错误检测、调试和代码可靠性。然而,由于自然语言的本质含义不确定,使得自然语言意图难以程序matically检查。大型自然语言模型(LLMs)的“emergent abilities”可能使得自然语言意图转化成程序可靠的断言。然而,是否LLMs可以正确地将自然语言规范转化成程序员意图的正式规范,以及这种转化是否有实际用途,都是未知的。在这篇论文中,我们描述了LLM4nl2post问题,即使用LLMs将自然语言规范转化成程序assertions的形式方法。我们引入了和验证了度量不同LLM4nl2post方法的正确性和特征力度。然后,我们通过质量和量度方法评估LLM4nl2post结果,发现它们通常是正确的并能够区分错误代码。最后,我们发现LLM4nl2post通过LLMs在实际应用中有可能帮助的潜在性。在历史上,自然语言规范生成的specifications捕捉到了70个实际bug。

Trainable Noise Model as an XAI evaluation method: application on Sobol for remote sensing image segmentation

  • paper_url: http://arxiv.org/abs/2310.01828
  • repo_url: None
  • paper_authors: Hossein Shreim, Abdul Karim Gizzini, Ali J. Ghandour
  • for: 这 paper 的目的是提出一种基于 Sobol 方法的透明性能算法,用于解决计算机视觉应用中的隐藏层模型 interpretability 问题。
  • methods: 本 paper 使用的方法包括 Sobol 方法和一种基于 learnable noise model 的评价方法,用于评估透明性能算法的性能。
  • results: 研究发现,使用 Sobol 方法可以提高透明性能算法的准确率,而且与 Seg-Grad-CAM 和 Seg-Grad-CAM++ 方法进行比较, Sobol 方法在高分辨率卫星图像中表现更好。
    Abstract eXplainable Artificial Intelligence (XAI) has emerged as an essential requirement when dealing with mission-critical applications, ensuring transparency and interpretability of the employed black box AI models. The significance of XAI spans various domains, from healthcare to finance, where understanding the decision-making process of deep learning algorithms is essential. Most AI-based computer vision models are often black boxes; hence, providing explainability of deep neural networks in image processing is crucial for their wide adoption and deployment in medical image analysis, autonomous driving, and remote sensing applications. Recently, several XAI methods for image classification tasks have been introduced. On the contrary, image segmentation has received comparatively less attention in the context of explainability, although it is a fundamental task in computer vision applications, especially in remote sensing. Only some research proposes gradient-based XAI algorithms for image segmentation. This paper adapts the recent gradient-free Sobol XAI method for semantic segmentation. To measure the performance of the Sobol method for segmentation, we propose a quantitative XAI evaluation method based on a learnable noise model. The main objective of this model is to induce noise on the explanation maps, where higher induced noise signifies low accuracy and vice versa. A benchmark analysis is conducted to evaluate and compare performance of three XAI methods, including Seg-Grad-CAM, Seg-Grad-CAM++ and Seg-Sobol using the proposed noise-based evaluation technique. This constitutes the first attempt to run and evaluate XAI methods using high-resolution satellite images.
    摘要 现代人工智能(XAI)已成为重要的需求,特别是在关键应用领域,以确保深度学习模型的透明度和可解释性。XAI在医疗、金融等领域具有重要的意义,因为理解深度学习算法的决策过程是关键。然而,图像分割 tasks 在计算机视觉应用中尚未得到过度的关注,尽管它是计算机视觉应用中的基本任务。只有一些研究提出了基于梯度的 XAI 算法。本文采用了最近的梯度自由 Sobol XAI 方法进行semantic segmentation。为评估 Sobol 方法的性能,我们提议了一种基于学习的噪声模型的量化 XAI 评估方法。这种方法的主要目标是在解释地图上引入噪声,其中更高的引入噪声表示低准确率,而低的引入噪声表示高准确率。我们进行了比较三种 XAI 方法的性能,包括Seg-Grad-CAM、Seg-Grad-CAM++和Seg-Sobol,使用我们提议的噪声基于评估技术。这是首次使用高分辨率卫星图像来运行和评估 XAI 方法。

Learning and reusing primitive behaviours to improve Hindsight Experience Replay sample efficiency

  • paper_url: http://arxiv.org/abs/2310.01827
  • repo_url: https://github.com/franroldans/qmp-her
  • paper_authors: Francisco Roldan Sanchez, Qiang Wang, David Cordova Bulens, Kevin McGuinness, Stephen Redmond, Noel O’Connor
  • for: 提高RL基于代理人的训练效率,解决目标基于机器人操作任务中的寻找问题
  • methods: 使用先前学习的基本行为指导代理人在探索过程中选择更有奖励的动作,使用评估网络在每个时刻决定使用先前学习的原始策略提议的动作
  • results: 比较HER和其他更高效的变种,在块 manipulate 任务中表现出更高的效率和更快的计算时间,代表代理人可以更快地学习成功策略
    Abstract Hindsight Experience Replay (HER) is a technique used in reinforcement learning (RL) that has proven to be very efficient for training off-policy RL-based agents to solve goal-based robotic manipulation tasks using sparse rewards. Even though HER improves the sample efficiency of RL-based agents by learning from mistakes made in past experiences, it does not provide any guidance while exploring the environment. This leads to very large training times due to the volume of experience required to train an agent using this replay strategy. In this paper, we propose a method that uses primitive behaviours that have been previously learned to solve simple tasks in order to guide the agent toward more rewarding actions during exploration while learning other more complex tasks. This guidance, however, is not executed by a manually designed curriculum, but rather using a critic network to decide at each timestep whether or not to use the actions proposed by the previously-learned primitive policies. We evaluate our method by comparing its performance against HER and other more efficient variations of this algorithm in several block manipulation tasks. We demonstrate the agents can learn a successful policy faster when using our proposed method, both in terms of sample efficiency and computation time. Code is available at https://github.com/franroldans/qmp-her.
    摘要 <> translate("Hindsight Experience Replay(HER)是一种在强化学习(RL)中使用的技术,可以帮助RL-based agent通过过去的经验来解决目标基于机器人操作任务,使用稀有的奖励。尽管HER可以提高RL-based agent的样本效率,但是它不会在环境中探索时提供任何指导。这会导致训练时间很长,因为需要大量的经验来训练一个agent使用这种播放策略。在这篇论文中,我们提出了一种方法,使用先前学习的基本行为来导引agent在探索时选择更有奖励的动作,以便更快地学习更复杂的任务。这些指导不是由手动设计的课程来实施,而是通过一个批评网络来在每个时刻决定是否使用由先前学习的基本策略提出的动作。我们通过对HER和其他更高效的变种进行比较,在块 manipulate任务中评估了我们的方法。我们发现,使用我们的方法,agent可以更快地学习成功策略,同时减少样本数和计算时间。代码可以在https://github.com/franroldans/qmp-her中找到。")Here's the translation in Traditional Chinese:<> translate("Hindsight Experience Replay(HER)是一种在强化学习(RL)中使用的技术,可以帮助RL-based agent通过过去的经验来解决目标基于机器人操作任务,使用稀有的奖励。尽管HER可以提高RL-based agent的样本效率,但是它不会在环境中探索时提供任何指导。这会导致训练时间很长,因为需要大量的经验来训练一个agent使用这种播放策略。在这篇论文中,我们提出了一种方法,使用先前学习的基本行为来导引agent在探索时选择更有奖励的动作,以便更快地学习更复杂的任务。这些指导不是由手动设计的课程来实施,而是通过一个批评网络来在每个时刻决定是否使用由先前学习的基本策略提出的动作。我们通过对HER和其他更高效的变种进行比较,在块 manipulate 任务中评估了我们的方法。我们发现,使用我们的方法,agent可以更快地学习成功策略,同时减少样本数和计算时间。代码可以在https://github.com/franroldans/qmp-her中找到。")

Empirical Study of PEFT techniques for Winter Wheat Segmentation

  • paper_url: http://arxiv.org/abs/2310.01825
  • repo_url: None
  • paper_authors: Mohamad Hasan Zahweh, Hasan Nasrallah, Mustafa Shukor, Ghaleb Faour, Ali J. Ghandour
    for: This paper aims to explore the feasibility of cross-area and cross-year out-of-distribution generalization for crop monitoring using the State-of-the-Art (SOTA) wheat crop monitoring model.methods: The paper uses various PEFT (Parameter Efficient Fine Tuning) techniques, including BigFit, LoRA, Adaptformer, and prompt tuning, to adapt the SOTA TSViT model for winter wheat field segmentation.results: The paper achieved notable results comparable to those achieved using full fine-tuning methods while training only a mere 0.7% parameters of the whole TSViT architecture. The in-house labeled data-set, referred to as the Beqaa-Lebanon dataset, comprises high-quality annotated polygons for wheat and non-wheat classes with a total surface of 170 kmsq, over five consecutive years. Using Sentinel-2 images, the model achieved an 84% F1-score.Here is the simplified Chinese text for the three key points:for: 这篇论文目的是探讨跨区域和跨年度out-of-distribution泛化对农业监测中的应用可能性。methods: 这篇论文使用了不同的PEFT技术,包括BigFit、LoRA、Adaptformer和提示调整,来适应SOTA TSViT模型用于冬小麦场segmentation。results: 这篇论文在只训练0.7% TSViT架构中的参数量下达到了与全量精度调整方法相当的不ABLEResult。使用了自己的标注数据集,称为Beqaa-Lebanon数据集,包括5年 consecutively annotated的高质量冬小麦和非冬小麦类划分 polygon,总面积约170 kmsq。使用Sentinel-2图像,模型达到了84% F1得分。
    Abstract Parameter Efficient Fine Tuning (PEFT) techniques have recently experienced significant growth and have been extensively employed to adapt large vision and language models to various domains, enabling satisfactory model performance with minimal computational needs. Despite these advances, more research has yet to delve into potential PEFT applications in real-life scenarios, particularly in the critical domains of remote sensing and crop monitoring. The diversity of climates across different regions and the need for comprehensive large-scale datasets have posed significant obstacles to accurately identify crop types across varying geographic locations and changing growing seasons. This study seeks to bridge this gap by comprehensively exploring the feasibility of cross-area and cross-year out-of-distribution generalization using the State-of-the-Art (SOTA) wheat crop monitoring model. The aim of this work is to explore PEFT approaches for crop monitoring. Specifically, we focus on adapting the SOTA TSViT model to address winter wheat field segmentation, a critical task for crop monitoring and food security. This adaptation process involves integrating different PEFT techniques, including BigFit, LoRA, Adaptformer, and prompt tuning. Using PEFT techniques, we achieved notable results comparable to those achieved using full fine-tuning methods while training only a mere 0.7% parameters of the whole TSViT architecture. The in-house labeled data-set, referred to as the Beqaa-Lebanon dataset, comprises high-quality annotated polygons for wheat and non-wheat classes with a total surface of 170 kmsq, over five consecutive years. Using Sentinel-2 images, our model achieved a 84% F1-score. We intend to publicly release the Lebanese winter wheat data set, code repository, and model weights.
    摘要 Parameter Efficient Fine Tuning (PEFT) 技术在最近几年内得到了广泛应用,并在不同领域中适应化大型视觉语言模型,以达到最佳性能的同时减少计算成本。然而,还没有很多研究探讨PEFT在实际应用场景中的潜在应用,特别是在重要的远程感知和农业监测领域。由于不同地区的气候多样性和需要大规模的数据采集,以准确地识别不同地区的农作物种类是一项挑战。本研究旨在bridging这个差距,通过对State-of-the-Art (SOTA) 小麦监测模型进行跨地区和跨年度 OUT-OF-distribution 泛化研究。本研究的目的是探讨PEFT方法在农业监测中的应用。特别是,我们将focus on 适应SOTA TSViT模型,以解决冬小麦场地分类,这是农业监测和食品安全的关键任务。这个适应过程涉及到了不同的PEFT技术,包括BigFit、LoRA、Adaptformer和prompt tuning。通过PEFT技术,我们在只训练0.7%的整个TSViT架构参数上达到了与全 fine-tuning 方法相当的 Result。我们使用Sentinel-2 图像,在Beqaa-Lebanon 数据集上实现了84% F1 分数。我们打算公开利比邻国冬小麦数据集、代码存储和模型参数。

Mini-BEHAVIOR: A Procedurally Generated Benchmark for Long-horizon Decision-Making in Embodied AI

  • paper_url: http://arxiv.org/abs/2310.01824
  • repo_url: https://github.com/stanfordvl/mini_behavior
  • paper_authors: Emily Jin, Jiaheng Hu, Zhuoyi Huang, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Roberto Martín-Martín
  • for: 本研究开发了一个新的embodied AIbenchmark,名为Mini-BEHAVIOR,用于测试和评估在人工智能执行任务时的决策和计划能力。
  • methods: 该benchmark使用Gridworld环境,并提供了一系列的任务和学习环境,以测试和评估embodied AI的决策和计划能力。
  • results: Mini-BEHAVIOR提供了一个许多任务的集合,可以用于评估和研究embodied AI的决策和计划能力,并且可以快速进行protoype和学习。
    Abstract We present Mini-BEHAVIOR, a novel benchmark for embodied AI that challenges agents to use reasoning and decision-making skills to solve complex activities that resemble everyday human challenges. The Mini-BEHAVIOR environment is a fast, realistic Gridworld environment that offers the benefits of rapid prototyping and ease of use while preserving a symbolic level of physical realism and complexity found in complex embodied AI benchmarks. We introduce key features such as procedural generation, to enable the creation of countless task variations and support open-ended learning. Mini-BEHAVIOR provides implementations of various household tasks from the original BEHAVIOR benchmark, along with starter code for data collection and reinforcement learning agent training. In essence, Mini-BEHAVIOR offers a fast, open-ended benchmark for evaluating decision-making and planning solutions in embodied AI. It serves as a user-friendly entry point for research and facilitates the evaluation and development of solutions, simplifying their assessment and development while advancing the field of embodied AI. Code is publicly available at https://github.com/StanfordVL/mini_behavior.
    摘要 我们介绍了Mini-BEHAVIOR,一个新的人工智能benchmark,挑战智能体用于解决复杂的日常人类挑战。Mini-BEHAVIOR环境具有快速、真实的Gridworld环境,可以快速进行模拟和使用,同时保留了复杂的身体智能 benchmark中的物理实实的复杂性和难度。我们引入了程序生成功能,以生成无数个任务变化,支持开放式学习。Mini-BEHAVIOR提供了原始BEHAVIORbenchmark中的各种家庭任务,以及数据收集和奖励学习代码。简而言之,Mini-BEHAVIOR是一个快速、开放式的人工智能决策和规划解决方案的benchmark,可以方便地评估和开发解决方案,同时推动人工智能领域的发展。代码可以在https://github.com/StanfordVL/mini_behavior上获取。

MIMO-NeRF: Fast Neural Rendering with Multi-input Multi-output Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2310.01821
  • repo_url: None
  • paper_authors: Takuhiro Kaneko
  • for: 提高NeRF的渲染速度和质量之间的平衡,以及减少NeRF的训练时间。
  • methods: 将SISO MLP replaced with MIMO MLP,并在组内进行映射,以减少NeRF的MLP数量。自动学习方法可以解决这种抽象的问题,不需要使用预训练模型。
  • results: 通过对比和缺失研究,显示MIMO-NeRF可以在理想的训练时间内获得良好的平衡。此外,MIMO-NeRF可以与之前的NeRF快速技术(如DONeRF和TensoRF)结合使用,以提高渲染质量。
    Abstract Neural radiance fields (NeRFs) have shown impressive results for novel view synthesis. However, they depend on the repetitive use of a single-input single-output multilayer perceptron (SISO MLP) that maps 3D coordinates and view direction to the color and volume density in a sample-wise manner, which slows the rendering. We propose a multi-input multi-output NeRF (MIMO-NeRF) that reduces the number of MLPs running by replacing the SISO MLP with a MIMO MLP and conducting mappings in a group-wise manner. One notable challenge with this approach is that the color and volume density of each point can differ according to a choice of input coordinates in a group, which can lead to some notable ambiguity. We also propose a self-supervised learning method that regularizes the MIMO MLP with multiple fast reformulated MLPs to alleviate this ambiguity without using pretrained models. The results of a comprehensive experimental evaluation including comparative and ablation studies are presented to show that MIMO-NeRF obtains a good trade-off between speed and quality with a reasonable training time. We then demonstrate that MIMO-NeRF is compatible with and complementary to previous advancements in NeRFs by applying it to two representative fast NeRFs, i.e., a NeRF with sample reduction (DONeRF) and a NeRF with alternative representations (TensoRF).
    摘要

Discrete, compositional, and symbolic representations through attractor dynamics

  • paper_url: http://arxiv.org/abs/2310.01807
  • repo_url: None
  • paper_authors: Andrew Nam, Eric Elmoznino, Nikolay Malkin, Chen Sun, Yoshua Bengio, Guillaume Lajoie
  • for: 这个论文的目的是探讨符号系统中的可组合性特性,以及如何通过模型化吸引器动力学来实现在符号空间中的可组合性。
  • methods: 该论文使用了建立在吸引器网络之上的新训练方法,以模型符号空间中的吸引器动力学,从而实现在符号空间中的可组合性。
  • results: 研究人员发现,通过吸引器动力学模型,可以在符号空间中实现可组合性,并且该模型可以处理rich的感知输入。此外,研究人员还发现,该模型在处理感知输入时会经历信息瓶颈现象,这可能与生物体的意识经验有关。
    Abstract Compositionality is an important feature of discrete symbolic systems, such as language and programs, as it enables them to have infinite capacity despite a finite symbol set. It serves as a useful abstraction for reasoning in both cognitive science and in AI, yet the interface between continuous and symbolic processing is often imposed by fiat at the algorithmic level, such as by means of quantization or a softmax sampling step. In this work, we explore how discretization could be implemented in a more neurally plausible manner through the modeling of attractor dynamics that partition the continuous representation space into basins that correspond to sequences of symbols. Building on established work in attractor networks and introducing novel training methods, we show that imposing structure in the symbolic space can produce compositionality in the attractor-supported representation space of rich sensory inputs. Lastly, we argue that our model exhibits the process of an information bottleneck that is thought to play a role in conscious experience, decomposing the rich information of a sensory input into stable components encoding symbolic information.
    摘要 “ композиция是一个重要的特点,它使得抽象符号系统,如语言和程序,可以具有无限容量,即使使用有限的符号集。它在认知科学和人工智能中都是一种有用的抽象,但是在连续和符号处理之间的界面经常是由算法强制实施的,例如通过量化或softmax采样步骤。在这项工作中,我们探索如何通过模型拥有者动力学来实现精炼的抽象。我们基于已有的拥有者网络工作,并引入新的训练方法,并示出了在有限的符号空间中强制结构可以生成compose的表示空间。最后,我们 argue that我们的模型具有信息瓶颈,这种信息瓶颈被认为是意识经验中的一个重要组成部分,将丰富的感知输入分解成稳定的分量,编码符号信息。”Note: Please note that the translation is in Simplified Chinese, and the grammar and sentence structure may be different from the original text.

Improvement and Enhancement of YOLOv5 Small Target Recognition Based on Multi-module Optimization

  • paper_url: http://arxiv.org/abs/2310.01806
  • repo_url: None
  • paper_authors: Qingyang Li, Yuchen Li, Hongyi Duan, JiaLiang Kang, Jianan Zhang, Xueqian Gan, Ruotong Xu
  • for: 这个论文主要针对小目标检测任务中YOLOv5s模型的局限性进行深入研究和改进。
  • methods: 该论文提出了基于GhostNet convolutional模块、RepGFPN颈部模块优化、CA和Transformer的注意机制以及损失函数改进等多种改进策略,以提高模型的精度、回归率和MAP指标。
  • results: 实验结果表明,这些改进策略有效地提高了模型的表现,尤其是在复杂背景和微小目标的实际应用测试中表现出色。
    Abstract In this paper, the limitations of YOLOv5s model on small target detection task are deeply studied and improved. The performance of the model is successfully enhanced by introducing GhostNet-based convolutional module, RepGFPN-based Neck module optimization, CA and Transformer's attention mechanism, and loss function improvement using NWD. The experimental results validate the positive impact of these improvement strategies on model precision, recall and mAP. In particular, the improved model shows significant superiority in dealing with complex backgrounds and tiny targets in real-world application tests. This study provides an effective optimization strategy for the YOLOv5s model on small target detection, and lays a solid foundation for future related research and applications.
    摘要 在这篇论文中,YOLOv5s模型在小目标检测任务中的局限性进行了深入研究和改进。通过引入 GhostNet 基于 convolutional 模块、RepGFPN 基于 Neck 模块优化、CA 和 Transformer 注意机制以及损失函数改进使用 NWD,提高了模型的性能。实验结果证明了这些改进策略对模型精度、重复率和 mAP 的影响。特别是,改进后的模型在实际应用中对复杂背景和小目标展示出了显著的优势。这种研究为 YOLOv5s 模型在小目标检测中的优化提供了有效的策略,并为后续相关研究和应用提供了坚实的基础。

Comparative study of microgrid optimal scheduling under multi-optimization algorithm fusion

  • paper_url: http://arxiv.org/abs/2310.01805
  • repo_url: None
  • paper_authors: Hongyi Duan, Qingyang Li, Yuchen Li, Jianan Zhang, Yuming Xie
  • for: 本研究旨在探讨微Grid的操作和环境成本之间的关系,通过多bjective优化模型进行探讨。
  • methods: 该研究提出了一种 integrate多种优化算法的方法,包括生物algorithm、Simulated Annealing、Ant Colony Optimization和Particle Swarm Optimization。
  • results: 实验结果表明,这些算法在经济和环境调度下提供了不同的派发结果,揭示了微Grid中 diesel机和微气轮机的不同角色。
    Abstract As global attention on renewable and clean energy grows, the research and implementation of microgrids become paramount. This paper delves into the methodology of exploring the relationship between the operational and environmental costs of microgrids through multi-objective optimization models. By integrating various optimization algorithms like Genetic Algorithm, Simulated Annealing, Ant Colony Optimization, and Particle Swarm Optimization, we propose an integrated approach for microgrid optimization. Simulation results depict that these algorithms provide different dispatch results under economic and environmental dispatch, revealing distinct roles of diesel generators and micro gas turbines in microgrids. Overall, this study offers in-depth insights and practical guidance for microgrid design and operation.
    摘要 为了满足全球人们对可再生和清洁能源的需求,微型电网的研究和实施变得非常重要。本文介绍了通过多目标优化模型研究微型电网操作和环境成本之间的关系的方法ologyth. 通过结合不同的优化算法,如遗传算法、模拟热处理算法、蚁群优化算法和粒子群优化算法,我们提出了一种综合方法 для微型电网优化。实验结果表明,这些算法在经济和环境调度下提供了不同的派发结果,揭示了微型电网中燃油机和微型气体轮机的不同角色。总的来说,这种研究为微型电网设计和运行提供了深入的理解和实践指导。

Large Language Models Cannot Self-Correct Reasoning Yet

  • paper_url: http://arxiv.org/abs/2310.01798
  • repo_url: None
  • paper_authors: Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou
  • for: 本研究旨在探讨LLMs中自 corrections的功能和局限性,以提高其生成内容的准确性和适用性。
  • methods: 本研究采用了一种现代方法,即自 corrections,以探讨LLMs的自 corrected内容的质量和可靠性。
  • results: 研究发现,无外部反馈的情况下,LLMs很难进行自 corrected,而且有时even degrade its performance post self-correction。
    Abstract Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues. Building upon this premise, this paper critically examines the role and efficacy of self-correction within LLMs, shedding light on its true potential and limitations. Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance might even degrade post self-correction. Drawing from these insights, we offer suggestions for future research and practical applications in this field.
    摘要 Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance might even degrade post self-correction.Drawing from these insights, we offer suggestions for future research and practical applications in this field.Translation notes:* "Large Language Models" is translated as "大型语言模型" (dàxìng yǔyán módelǐ)* "self-correction" is translated as "自我修正" (zìwǒ xiùgòng)* "intrinsic self-correction" is translated as "内在自我修正" (néizài zìwǒ xiùgòng)* "re reasoning" is translated as "再理解" (zài lǐjiě)Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

Online POMDP Planning with Anytime Deterministic Guarantees

  • paper_url: http://arxiv.org/abs/2310.01791
  • repo_url: https://github.com/moranbar/Online-POMDP-Planning-with-Anytime-Deterministic-Guarantees
  • paper_authors: Moran Barenboim, Vadim Indelman
  • for: This paper is written for researchers and practitioners interested in planning under uncertainty, particularly in real-world scenarios where autonomous agents operate.
  • methods: The paper uses partially observable Markov decision processes (POMDPs) to mathematically formalize planning under uncertainty. It also employs approximate algorithms such as tree search and sample-based methodologies to solve POMDPs, which offer probabilistic and asymptotic guarantees towards the optimal solution.
  • results: The paper derives a deterministic relationship between a simplified solution and the theoretically optimal one, providing bounds for selecting a subset of observations to branch from while computing a complete belief at each posterior node. The paper also extends these bounds to support reduction of both the state and observation spaces, and demonstrates how these guarantees can be integrated with existing state-of-the-art solvers. Additionally, the paper provides supporting experimental results to substantiate its findings.
    Abstract Autonomous agents operating in real-world scenarios frequently encounter uncertainty and make decisions based on incomplete information. Planning under uncertainty can be mathematically formalized using partially observable Markov decision processes (POMDPs). However, finding an optimal plan for POMDPs can be computationally expensive and is feasible only for small tasks. In recent years, approximate algorithms, such as tree search and sample-based methodologies, have emerged as state-of-the-art POMDP solvers for larger problems. Despite their effectiveness, these algorithms offer only probabilistic and often asymptotic guarantees toward the optimal solution due to their dependence on sampling. To address these limitations, we derive a deterministic relationship between a simplified solution that is easier to obtain and the theoretically optimal one. First, we derive bounds for selecting a subset of the observations to branch from while computing a complete belief at each posterior node. Then, since a complete belief update may be computationally demanding, we extend the bounds to support reduction of both the state and the observation spaces. We demonstrate how our guarantees can be integrated with existing state-of-the-art solvers that sample a subset of states and observations. As a result, the returned solution holds deterministic bounds relative to the optimal policy. Lastly, we substantiate our findings with supporting experimental results.
    摘要 首先,我们derive bounds for selecting a subset of observations to branch from while computing a complete belief at each posterior node。然后,由于完整的信念更新可以是计算昂贵的,我们扩展这些 boundsto支持状态和观察空间的减少。我们示出了如何将我们的保证与现有的状态-OF-the-art解决方案集成,以获得返回的解决方案具有deterministic bound relative to the optimal policy。最后,我们通过实验来证明我们的发现。

Can large language models provide useful feedback on research papers? A large-scale empirical analysis

  • paper_url: http://arxiv.org/abs/2310.01783
  • repo_url: https://github.com/weixin-liang/llm-scientific-feedback
  • paper_authors: Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, James Zou
  • for: The paper aims to evaluate the utility of using large language models (LLMs) to generate scientific feedback on research manuscripts.
  • methods: The authors created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers and evaluated the quality of GPT-4’s feedback through two large-scale studies.
  • results: The study found that GPT-4’s generated feedback overlaps with human peer reviewer feedback, and more than half of the users found the feedback helpful. However, the authors also identified several limitations of using LLM-generated feedback.Here are the three points in Simplified Chinese text:
  • for: 本研究旨在评估使用大语言模型(LLM)生成科学评论的有用性。
  • methods: 作者们创建了一个自动化管道,使用GPT-4对科学论文PDF提供反馈,并通过两项大规模研究评估GPT-4生成的反馈质量。
  • results: 研究发现,GPT-4生成的反馈与人类同行评审者的反馈重叠,并且超过50%的用户认为这些反馈是有帮助的。然而,作者们还发现了一些LLM生成反馈的局限性。
    Abstract Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production and intricate knowledge specialization challenge the conventional scientific feedback mechanisms. High-quality peer reviews are increasingly difficult to obtain. Researchers who are more junior or from under-resourced settings have especially hard times getting timely feedback. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback on research manuscripts. However, the utility of LLM-generated feedback has not been systematically studied. To address this gap, we created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. We evaluated the quality of GPT-4's feedback through two large-scale studies. We first quantitatively compared GPT-4's generated feedback with human peer reviewer feedback in 15 Nature family journals (3,096 papers in total) and the ICLR machine learning conference (1,709 papers). The overlap in the points raised by GPT-4 and by human reviewers (average overlap 30.85% for Nature journals, 39.23% for ICLR) is comparable to the overlap between two human reviewers (average overlap 28.58% for Nature journals, 35.25% for ICLR). The overlap between GPT-4 and human reviewers is larger for the weaker papers. We then conducted a prospective user study with 308 researchers from 110 US institutions in the field of AI and computational biology to understand how researchers perceive feedback generated by our GPT-4 system on their own papers. Overall, more than half (57.4%) of the users found GPT-4 generated feedback helpful/very helpful and 82.4% found it more beneficial than feedback from at least some human reviewers. While our findings show that LLM-generated feedback can help researchers, we also identify several limitations.
    摘要 专家反馈是科学研究的基础。但是,随着学术著作的快速增长和专业知识的复杂化,传统的科学反馈机制变得更加困难。获得高质量的同行评审变得越来越困难。特别是 junior researchers 和资源匮乏的设置中的研究人员更容易遇到延迟的反馈问题。随着大语言模型(LLM)如 GPT-4 的突破,有关使用 LLM 生成科学反馈的兴趣日益增长。然而, LLG 生成反馈的实用性尚未得到系统性的研究。为了解决这个空白,我们创建了一个自动化管道,使用 GPT-4 生成科学论文的注释。我们通过两项大规模研究评估 GPT-4 生成的反馈质量。首先,我们量化比较 GPT-4 生成的反馈和人工同行评审者的反馈在 15 本 Nature 家刊物(3,096 篇论文)和 ICLR 机器学习会议(1,709 篇论文)中的重叠率。结果显示,GPT-4 生成的反馈和人工同行评审者的反馈之间的重叠率为 30.85%(Nature 家刊物)和 39.23%(ICLR),与人工同行评审者之间的重叠率(28.58%(Nature 家刊物)和 35.25%(ICLR))相比较高。此外,我们发现 GPT-4 生成的反馈和人工同行评审者的反馈之间的重叠率在弱论文上更高。其次,我们采用了向 308 名 AI 和计算生物学研究人员进行前向研究,以了解他们对我们的 GPT-4 系统生成的反馈是否有帮助。结果显示,More than half (57.4%) of the users found GPT-4 generated feedback helpful/very helpful,和 82.4% 认为 GPT-4 生成的反馈比至少一些人工同行评审者的反馈更有利。 although our findings show that LLM-generated feedback can help researchers, we also identify several limitations.

STAMP: Differentiable Task and Motion Planning via Stein Variational Gradient Descent

  • paper_url: http://arxiv.org/abs/2310.01775
  • repo_url: None
  • paper_authors: Yewon Lee, Philip Huang, Krishna Murthy Jatavallabhula, Andrew Z. Li, Fabian Damken, Eric Heiden, Kevin Smith, Derek Nowrouzezahrai, Fabio Ramos, Florian Shkurti
  • for: solves task and motion planning problems for manipulation tasks, such as using tools or assembling parts, by leveraging parallelization and differentiable simulation to efficiently search for multiple diverse plans.
  • methods: uses Stein Variational Gradient Descent and parallelized differentiable physics simulators on the GPU to efficiently obtain gradients for inference, and employs imitation learning to introduce action abstractions that reduce the inference problem to lower dimensions.
  • results: produces multiple diverse plans in parallel and searches for plans more efficiently compared to existing TAMP baselines.
    Abstract Planning for many manipulation tasks, such as using tools or assembling parts, often requires both symbolic and geometric reasoning. Task and Motion Planning (TAMP) algorithms typically solve these problems by conducting a tree search over high-level task sequences while checking for kinematic and dynamic feasibility. While performant, most existing algorithms are highly inefficient as their time complexity grows exponentially with the number of possible actions and objects. Additionally, they only find a single solution to problems in which many feasible plans may exist. To address these limitations, we propose a novel algorithm called Stein Task and Motion Planning (STAMP) that leverages parallelization and differentiable simulation to efficiently search for multiple diverse plans. STAMP relaxes discrete-and-continuous TAMP problems into continuous optimization problems that can be solved using variational inference. Our algorithm builds upon Stein Variational Gradient Descent, a gradient-based variational inference algorithm, and parallelized differentiable physics simulators on the GPU to efficiently obtain gradients for inference. Further, we employ imitation learning to introduce action abstractions that reduce the inference problem to lower dimensions. We demonstrate our method on two TAMP problems and empirically show that STAMP is able to: 1) produce multiple diverse plans in parallel; and 2) search for plans more efficiently compared to existing TAMP baselines.
    摘要 планирование для многих задач манипуляции, таких как использование инструментов или сборка частей, часто требует как символического, так и геометрического мышления. алгоритмы планирования задач и движения (TAMP) обычно решают эти проблемы, проверяя поиск в древе последовательностей задач на высоком уровне, в то время как проверяют возможность движения и динамики. хотя они работают хорошо, большинство существующих алгоритмов являются высоко неэффективными, так как время сложности растет экспоненциально с количеством возможных действий и объектов. кроме того, они только находят один решение для проблем, в которых существует множество возможных планов.чтобы решить эти ограничения, мы предлагаем новый алгоритм, называемый STAMP (Stein Task and Motion Planning), который использует параллелизацию и дифференцируемую симуляцию для эффективного поиска множества разнообразных планов. STAMP разлагает задачи TAMP на континуальные оптимизационные задачи, которые могут быть решены с помощью вариационного инференции. наш алгоритм строится на Stein Variational Gradient Descent, алгоритме вариационной градиентной декрементации, и параллелизированных дифференцируемых физических симуляторах на GPU для эффективного получения градиентов для инфериенции. кроме того, мы используем обучение по примеру для введения абстрактных действий, которые уменьшают задачу инфериенции до более низких измерений.мы демонстрируем свой метод на двух задачах TAMP и эмпирически показываем, что STAMP может: 1) произвести множество разнообразных планов в параллели; и 2) поискать планы более эффективно, чем существующие TAMP-базы.

A simple connection from loss flatness to compressed representations in neural networks

  • paper_url: http://arxiv.org/abs/2310.01770
  • repo_url: None
  • paper_authors: Shirui Chen, Stefano Recanatesi, Eric Shea-Brown
  • for: 研究深度神经网络的泛化能力
  • methods: 使用loss函数的形态和表示 manifold的结构来研究深度神经网络的泛化能力
  • results: 显示在深度神经网络的学习过程中,压缩表示 manifold的体积与损失函数的平坦性有直接的关系,并且这一关系可以通过简单的数学关系来预测。
    Abstract Deep neural networks' generalization capacity has been studied in a variety of ways, including at least two distinct categories of approach: one based on the shape of the loss landscape in parameter space, and the other based on the structure of the representation manifold in feature space (that is, in the space of unit activities). These two approaches are related, but they are rarely studied together and explicitly connected. Here, we present a simple analysis that makes such a connection. We show that, in the last phase of learning of deep neural networks, compression of the volume of the manifold of neural representations correlates with the flatness of the loss around the minima explored by ongoing parameter optimization. We show that this is predicted by a relatively simple mathematical relationship: loss flatness implies compression of neural representations. Our results build closely on prior work of \citet{ma_linear_2021}, which shows how flatness (i.e., small eigenvalues of the loss Hessian) develops in late phases of learning and lead to robustness to perturbations in network inputs. Moreover, we show there is no similarly direct connection between local dimensionality and sharpness, suggesting that this property may be controlled by different mechanisms than volume and hence may play a complementary role in neural representations. Overall, we advance a dual perspective on generalization in neural networks in both parameter and feature space.
    摘要

Differentially Encoded Observation Spaces for Perceptive Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2310.01767
  • repo_url: https://github.com/a2r-lab/diffcompressdrl
  • paper_authors: Lev Grossman, Brian Plancher
  • for: 这篇论文旨在提高深度强化学习(DRL)系统的训练效率,使其能够在边缘设备上进行学习,以适应环境。
  • methods: 这篇论文使用了差异推统 observation space,将储存的图像基于观察转换为影片,并利用损失无限对称影片编码方案将练习缓存缩小至14.2倍和16.7倍,并且完全在RAM中进行训练,提高了DMC任务的延迟时间。
  • results: 这篇论文获得了训练DRL系统的效率和可扩展性,实现了边缘设备上的大规模强化学习,并且获得了32%的延迟时间改善。
    Abstract Perceptive deep reinforcement learning (DRL) has lead to many recent breakthroughs for complex AI systems leveraging image-based input data. Applications of these results range from super-human level video game agents to dexterous, physically intelligent robots. However, training these perceptive DRL-enabled systems remains incredibly compute and memory intensive, often requiring huge training datasets and large experience replay buffers. This poses a challenge for the next generation of field robots that will need to be able to learn on the edge in order to adapt to their environments. In this paper, we begin to address this issue through differentially encoded observation spaces. By reinterpreting stored image-based observations as a video, we leverage lossless differential video encoding schemes to compress the replay buffer without impacting training performance. We evaluate our approach with three state-of-the-art DRL algorithms and find that differential image encoding reduces the memory footprint by as much as 14.2x and 16.7x across tasks from the Atari 2600 benchmark and the DeepMind Control Suite (DMC) respectively. These savings also enable large-scale perceptive DRL that previously required paging between flash and RAM to be run entirely in RAM, improving the latency of DMC tasks by as much as 32%.
    摘要 深度强化学习(DRL)的感知技术在处理图像输入数据方面取得了许多最近的突破。这些应用包括超human级视频游戏代理以及柔软、物理智能的机器人。然而,训练这些感知DRL系统仍然非常计算和存储密集,经常需要庞大的训练集和大的经验回放缓存。这会对下一代静止环境中的场景机器人带来挑战,这些机器人需要在边缘上学习,以适应其环境。在这篇论文中,我们开始解决这个问题,通过不同的编码方式来压缩存储的图像 Observation 空间。我们利用图像序列的形式重新解释存储的图像,然后利用不失真的视频编码方案来压缩回放缓存。我们使用三种state-of-the-art DRL算法进行评估,发现使用不同的图像编码可以将内存占用量减少为14.2倍和16.7倍,对于Atari 2600 测试集和DeepMind Control Suite(DMC)测试集分别。这些减少还使得大规模感知DRL,之前需要缓存在 Flash 和RAM 之间,现在可以完全在 RAM 上运行,提高 DMC 任务的响应时间为多达32%。

Improved Algorithms for Adversarial Bandits with Unbounded Losses

  • paper_url: http://arxiv.org/abs/2310.01756
  • repo_url: None
  • paper_authors: Mingyu Chen, Xuezhou Zhang
  • for: solve the Adversarial Multi-Armed Bandit (MAB) problem with unbounded losses, where the algorithms have no prior knowledge on the sizes of the losses.
  • methods: presents two algorithms, UMAB-NN and UMAB-G, for non-negative and general unbounded loss respectively.
  • results: achieves the first adaptive and scale-free regret bound without uniform exploration for non-negative unbounded loss, and can learn from arbitrary unbounded loss. Our analysis reveals the asymmetry between positive and negative losses in the MAB problem and provides additional insights.
    Abstract We consider the Adversarial Multi-Armed Bandits (MAB) problem with unbounded losses, where the algorithms have no prior knowledge on the sizes of the losses. We present UMAB-NN and UMAB-G, two algorithms for non-negative and general unbounded loss respectively. For non-negative unbounded loss, UMAB-NN achieves the first adaptive and scale free regret bound without uniform exploration. Built up on that, we further develop UMAB-G that can learn from arbitrary unbounded loss. Our analysis reveals the asymmetry between positive and negative losses in the MAB problem and provide additional insights. We also accompany our theoretical findings with extensive empirical evaluations, showing that our algorithms consistently out-performs all existing algorithms that handles unbounded losses.
    摘要 我们考虑了对抗多重机器人(MAB)问题,其中算法无知对损失的大小。我们提出了UMAB-NN和UMAB-G两种算法,用于非正式和一般无限损失。对于非正式无限损失,UMAB-NN实现了首次适应和比例自适应征递减 regret bound,不需要均匀探索。基于这个成果,我们进一步开发了UMAB-G,可以学习任意无限损失。我们的分析发现MAB问题中损失的偏见性,并提供了附加的视角。我们还附加了广泛的实验,证明我们的算法在处理无限损失时表现更好。

Blending Imitation and Reinforcement Learning for Robust Policy Improvement

  • paper_url: http://arxiv.org/abs/2310.01737
  • repo_url: None
  • paper_authors: Xuefeng Liu, Takuma Yoneda, Rick L. Stevens, Matthew R. Walter, Yuxin Chen
  • for: 提高深度学习环境中样本繁殖的效率,使得深度学习可以更广泛应用于不同领域。
  • methods: combines imitation learning (IL) and reinforcement learning (RL), using oracle queries to facilitate exploration and gradually transitioning to RL as learning unfolds.
  • results: 在多个 benchmark 领域中表现出色,比如 existing state-of-the-art 方法。
    Abstract While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. which actively interleaves between IL and RL based on an online estimate of their performance. RPI draws on the strengths of IL, using oracle queries to facilitate exploration, an aspect that is notably challenging in sparse-reward RL, particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner's performance surpasses that of the oracles in a specific state. Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methodologies, demonstrating superior performance across various benchmark domains.
    摘要 reinforcement learning (RL) 已经显示出了有前途的表现,但其样本复杂性仍然是一大障碍,限制其更广泛的应用于多个领域。 imitation learning (IL) 利用 oracle 来提高样本效率,但它通常受到 oracle 的质量的限制。 RPI draws on the strengths of IL, using oracle queries to facilitate exploration, an aspect that is notably challenging in sparse-reward RL, particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner's performance surpasses that of the oracles in a specific state. empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methodologies, demonstrating superior performance across various benchmark domains.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you prefer Traditional Chinese, please let me know and I will be happy to provide the translation in that version as well.

Learning Expected Appearances for Intraoperative Registration during Neurosurgery

  • paper_url: http://arxiv.org/abs/2310.01735
  • repo_url: None
  • paper_authors: Nazim Haouchine, Reuben Dorent, Parikshit Juvekar, Erickson Torio, William M. Wells III, Tina Kapur, Alexandra J. Golby, Sarah Frisken
  • for: 这个论文旨在提出一种新的操作期患者到图像匹配方法,通过学习预期表现来实现。
  • methods: 该方法使用前Operative的医学成像来生成特定病人的预期视图,并通过Camera pose的估计来匹配实时 microscope 视图和预期的 текстура。
  • results: 该方法在 synthetic 数据和6个临床案例的回顾数据上表现出优于当前临床标准的匹配精度。
    Abstract We present a novel method for intraoperative patient-to-image registration by learning Expected Appearances. Our method uses preoperative imaging to synthesize patient-specific expected views through a surgical microscope for a predicted range of transformations. Our method estimates the camera pose by minimizing the dissimilarity between the intraoperative 2D view through the optical microscope and the synthesized expected texture. In contrast to conventional methods, our approach transfers the processing tasks to the preoperative stage, reducing thereby the impact of low-resolution, distorted, and noisy intraoperative images, that often degrade the registration accuracy. We applied our method in the context of neuronavigation during brain surgery. We evaluated our approach on synthetic data and on retrospective data from 6 clinical cases. Our method outperformed state-of-the-art methods and achieved accuracies that met current clinical standards.
    摘要 我们提出了一种新的术前患者到图像匹配方法,通过学习预期出现的形态。我们的方法使用先operative的医疗影像来生成特定患者的预期视图,并且使用这些视图来估算摄像头姿态。与传统方法不同,我们的方法将处理任务传递到先operative阶段,从而减少了低分辨率、扭曲和噪声等因素的影响,提高了匹配精度。我们在脑手术中应用了我们的方法,并在6个临床案例中进行了评估。我们的方法在比较案例中表现出色,与当前临床标准匹配精度相当。

Nugget: Neural Agglomerative Embeddings of Text

  • paper_url: http://arxiv.org/abs/2310.01732
  • repo_url: None
  • paper_authors: Guanghui Qin, Benjamin Van Durme
  • for: 提高语言理解的现代语言处理中,嵌入文本序列是一项广泛的需求。现有的方法主要集中在固定大小表示上。这会导致问题,因为文本中含的信息通常与输入长度成正比。我们提出了一种解决方案,即块(Nugget),它通过动态选择输入符号来编码语言。这些块通过自动编码和机器翻译任务学习,INTUITIVE地将语言分割成有意义的单元。
  • methods: 我们使用了自动编码和机器翻译任务来学习块。
  • results: 我们证明了块在 semantic comparison 任务中超过相关的方法表现。此外,我们还表明了这些紧凑的单元可以扩大语言模型(LM)的语言上下文窗口,因此可能在未来的语言模型中引入更大量的内容。
    Abstract Embedding text sequences is a widespread requirement in modern language understanding. Existing approaches focus largely on constant-size representations. This is problematic, as the amount of information contained in text often varies with the length of the input. We propose a solution called Nugget, which encodes language into a representation based on a dynamically selected subset of input tokens. These nuggets are learned through tasks like autoencoding and machine translation, and intuitively segment language into meaningful units. We demonstrate Nugget outperforms related approaches in tasks involving semantic comparison. Finally, we illustrate these compact units allow for expanding the contextual window of a language model (LM), suggesting new future LMs that can condition on significantly larger amounts of content.
    摘要 <>转换文本序列是现代语言理解中广泛的需求。现有的方法主要关注常量大小的表示。这会导致问题,因为文本中含的信息通常与输入长度相关。我们提议一种解决方案叫做“块”(Nugget),它将语言编码成基于输入Token的动态选择的子集的表示。这些块通过自动编码和机器翻译任务学习,INTUITIVE地将语言分解成意义ful单元。我们示出了Nugget比相关方法在 semantic comparison 任务中表现出色。最后,我们展示了这些紧凑的单元允许扩展语言模型(LM)的Contextual window,建议未来的LM可以condition on significantly larger amounts of content。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

  • paper_url: http://arxiv.org/abs/2310.01728
  • repo_url: None
  • paper_authors: Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, Qingsong Wen
  • for: 这个研究是为了实现一个可以处理多种时间序列数据的通用时间序列预测模型。
  • methods: 这个研究使用了一个名为Time-LLM的重programming框架,将大语言模型(LLM)重新训练为时间序列预测模型,并使用了Prompt-as-Prefix(PaP)技术来增强模型的时间序列处理能力。
  • results: 研究结果显示,Time-LLM可以实现高效的时间序列预测,并在几何shot和零shot学习情况下都表现出色。
    Abstract Time series forecasting holds significant importance in many real-world dynamic systems and has been extensively studied. Unlike natural language process (NLP) and computer vision (CV), where a single large model can tackle multiple tasks, models for time series forecasting are often specialized, necessitating distinct designs for different tasks and applications. While pre-trained foundation models have made impressive strides in NLP and CV, their development in time series domains has been constrained by data sparsity. Recent studies have revealed that large language models (LLMs) possess robust pattern recognition and reasoning abilities over complex sequences of tokens. However, the challenge remains in effectively aligning the modalities of time series data and natural language to leverage these capabilities. In this work, we present Time-LLM, a reprogramming framework to repurpose LLMs for general time series forecasting with the backbone language models kept intact. We begin by reprogramming the input time series with text prototypes before feeding it into the frozen LLM to align the two modalities. To augment the LLM's ability to reason with time series data, we propose Prompt-as-Prefix (PaP), which enriches the input context and directs the transformation of reprogrammed input patches. The transformed time series patches from the LLM are finally projected to obtain the forecasts. Our comprehensive evaluations demonstrate that Time-LLM is a powerful time series learner that outperforms state-of-the-art, specialized forecasting models. Moreover, Time-LLM excels in both few-shot and zero-shot learning scenarios.
    摘要 时序序列预测具有重要 significancen 在许多实际动态系统中,并已经进行了广泛的研究。 与自然语言处理(NLP)和计算机视觉(CV)不同,时序序列预测的模型通常是特殊化的,需要不同的设计来应对不同的任务和应用。 although pre-trained foundation models have made impressive strides in NLP and CV, their development in time series domains has been constrained by data sparsity. Recent studies have shown that large language models (LLMs) possess robust pattern recognition and reasoning abilities over complex sequences of tokens. However, the challenge remains in effectively aligning the modalities of time series data and natural language to leverage these capabilities.在这种情况下,我们提出了 Time-LLM 框架,用于重新编程 LLMs 以适应通用时序序列预测。我们首先将时序序列数据重新编程为文本原型,然后将其 feed 到冻结的 LLM 中,以实现两个模式之间的对接。为了让 LLM 更好地理解时序序列数据,我们提出了 Prompt-as-Prefix (PaP),它可以在重新编程的输入裁剪上添加更多的上下文信息,并指导重新编程输入的转换。最后,我们将 transformed 时序序列裁剪 projection 以获取预测结果。我们的全面评估表明,Time-LLM 是一种强大的时序序列学习模型,超过了当前最佳特殊化预测模型。此外,Time-LLM 在几个少量和零量学习场景中也表现出色。

Can GPT-4 Replicate Empirical Software Engineering Research?

  • paper_url: http://arxiv.org/abs/2310.01727
  • repo_url: None
  • paper_authors: Jenny T. Liang, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, Thomas Zimmermann
  • for: 这个论文旨在探讨大型自然语言模型(LLM)在软件工程实践中的应用,以便帮助软件工程实践者和研究者更好地理解和复制现有的软件工程研究。
  • methods: 这个论文使用了大型自然语言模型(GPT-4)来复制现有的软件工程研究,并对GPT-4生成的假设和分析计划进行评估。研究者采用了用户研究,询问14名软件工程研究专家对GPT-4生成的假设和分析计划进行评估。
  • results: 研究发现,GPT-4可以surface正确的假设,但在生成假设时存在一些问题,如不具备软件工程知识的问题。在手动分析GPT-4生成的代码时,发现代码具有正确的高级逻辑,但具有许多小型实现级别的错误。这些结果有关于使用LLM进行软件工程研究以及软件团队数据科学工作者的启示。
    Abstract Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering- and science-related tasks, these models could help democratize empirical software engineering research. In this paper, we examine LLMs' abilities to perform replications of empirical software engineering research on new data. We specifically study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggle to generate ones that reflect common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains the correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.
    摘要 empirical software engineering research on production systems 已经为实践者和研究人员提供了更深刻的理解软件工程过程。然而,只有一小部分的生产系统被研究,这限制了这些研究的影响。软件工程实践者可以通过复制研究来启发自己的数据,但这也存在一些挑战,因为复制研究需要深刻的理解研究方法论和软件工程数据中的细微差别。大语言模型(LLM),如GPT-4,表明它们可以解决软件工程和科学相关的任务。在这篇论文中,我们研究LLM是否能够在新数据上复制Empirical software engineering research。我们Specifically studying their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggle to generate ones that reflect common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains the correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.

PrACTiS: Perceiver-Attentional Copulas for Time Series

  • paper_url: http://arxiv.org/abs/2310.01720
  • repo_url: None
  • paper_authors: Cat P. Le, Chris Cannella, Ali Hasan, Yuting Ng, Vahid Tarokh
  • for: 提高时间序列预测性能
  • methods: 结合 perceiver 架构和 copula 结构,使用 midpoint inference 和本地注意力机制,并使用 copula-based attention 和输出方差测试机制来捕捉缺失数据的联合分布,从而避免预测中的错误卷积。
  • results: 在单模态和多模态标准测试集上实现了相比先前方法的20%提高,同时使用的内存资源占用率低于50%。
    Abstract Transformers incorporating copula structures have demonstrated remarkable performance in time series prediction. However, their heavy reliance on self-attention mechanisms demands substantial computational resources, thus limiting their practical utility across a wide range of tasks. In this work, we present a model that combines the perceiver architecture with a copula structure to enhance time-series forecasting. By leveraging the perceiver as the encoder, we efficiently transform complex, high-dimensional, multimodal data into a compact latent space, thereby significantly reducing computational demands. To further reduce complexity, we introduce midpoint inference and local attention mechanisms, enabling the model to capture dependencies within imputed samples effectively. Subsequently, we deploy the copula-based attention and output variance testing mechanism to capture the joint distribution of missing data, while simultaneously mitigating error propagation during prediction. Our experimental results on the unimodal and multimodal benchmarks showcase a consistent 20\% improvement over the state-of-the-art methods, while utilizing less than half of available memory resources.
    摘要 transformers 结构含有 copula 结构,在时间序列预测中表现出了非常remarkable的性能。然而,它们对自我注意机制的依赖性很高,因此在许多任务上具有很大的计算资源需求,限制了实际应用的各种任务。在这种情况下,我们提出了一种将 perceiver 架构与 copula 结构结合在一起的模型,以提高时间序列预测性能。通过使用 perceiver 作为编码器,我们可以快速将复杂、高维、多模态数据转化为紧凑的尺度空间,从而减少计算资源的需求。此外,我们引入中点推理和本地注意机制,使模型能够有效地捕捉插入样本之间的依赖关系。然后,我们采用 copula 基于的注意力和输出方差测试机制,以捕捉缺失数据的共同分布,同时避免预测过程中的错误卷积。我们在单模态和多模态标准benchmark上进行了实验,结果显示与状态态别方法相比,我们的方法在20%的情况下具有了一致的改进,同时使用的内存资源只有使用了半个可用的内存资源。Note: The text is translated into Simplified Chinese, which is the standard form of Chinese used in mainland China. The translation is written in the formal style, which is appropriate for academic or professional writing.Here's a word-for-word translation of the text into Traditional Chinese, which is used in Taiwan and other parts of the world: transformers 结构含有 copula 结构,在时间序列预测中表现出了非常remarkable的性能。然而,它们对自我注意机制的依赖性很高,因此在许多任务上具有很大的计算资源需求,限制了实际应用的各种任务。在这种情况下,我们提出了一种将 perceiver 架构与 copula 结构结合在一起的模型,以提高时间序列预测性能。通过使用 perceiver 作为编码器,我们可以快速将复杂、高维、多模态数据转换为紧凑的尺度空间,从而减少计算资源的需求。此外,我们引入中点推理和本地注意机制,使模型能够有效地捕捉插入样本之间的依赖关系。然后,我们采用 copula 基于的注意力和输出方差测试机制,以捕捉缺失数据的共同分布,同时避免预测过程中的错误卷积。我们在单模式和多模式标准benchmark上进行了实验,结果显示与状态别方法相比,我们的方法在20%的情况下具有了一致的改进,同时使用的内存资源只有使用了半个可用的内存资源。

Ensemble Distillation for Unsupervised Constituency Parsing

  • paper_url: http://arxiv.org/abs/2310.01717
  • repo_url: https://github.com/manga-uofa/ed4ucp
  • paper_authors: Behzad Shayegh, Yanshuai Cao, Xiaodan Zhu, Jackie C. K. Cheung, Lili Mou
  • for: 这个论文主要用于解决无监督成分分析任务,即将句子中单词和短语组织成层次结构,不使用语言学上标注数据。
  • methods: 该论文提出了一种“树平均”思想,并基于此思想提出了一种新的 ensemble 方法。为提高推理效率,该方法还使用了一种学生模型减少过拟合问题。
  • results: 实验显示,该方法比之前的所有方法都更高效和稳定,可以在不同的 ensemble 组件、Run 和领域shift 条件下表现出色。
    Abstract We investigate the unsupervised constituency parsing task, which organizes words and phrases of a sentence into a hierarchical structure without using linguistically annotated data. We observe that existing unsupervised parsers capture differing aspects of parsing structures, which can be leveraged to enhance unsupervised parsing performance. To this end, we propose a notion of "tree averaging," based on which we further propose a novel ensemble method for unsupervised parsing. To improve inference efficiency, we further distill the ensemble knowledge into a student model; such an ensemble-then-distill process is an effective approach to mitigate the over-smoothing problem existing in common multi-teacher distilling methods. Experiments show that our method surpasses all previous approaches, consistently demonstrating its effectiveness and robustness across various runs, with different ensemble components, and under domain-shift conditions.
    摘要 我团队 investigate了无监督成分分析任务,即将句子中单词和短语组织成层次结构,不使用语言学上注解的数据。我们发现现有的无监督解析器捕捉了不同的解析结构方面,可以用来提高无监督解析性能。为此,我们提出了“树平均”的概念,并基于此提出了一种新的集成方法。为了提高推理效率,我们进一步蒸馏集成知识到学生模型中,这种集成然后蒸馏过程能够有效地解决常见的多教师蒸馏方法中的过度熔化问题。实验表明,我们的方法超越了所有之前的方法,在不同的Run中、不同的集成组件中和下游领域转移条件下都能够保持稳定和有效。