cs.AI - 2023-11-28

Deep Regularized Compound Gaussian Network for Solving Linear Inverse Problems

  • paper_url: http://arxiv.org/abs/2311.17248
  • repo_url: None
  • paper_authors: Carter Lyons, Raghu G. Raj, Margaret Cheney
  • for: 这篇论文的目的是提出两种新的方法来解决线性逆问题,以便实现更加稳定和robust的逆问题解决方案。
  • methods: 这篇论文使用了两种方法:一是iterative算法called generalized compound Gaussian least squares (G-CG-LS),它是一种用regularized least squares目标函数来满足CG prior的方法;二是一种deep regularized (DR) neural network called DR-CG-Net,它可以学习具体的前知识。
  • results: 根据 Computational theory和大量的numerical experiments表明,这两种方法都可以在tomographic imaging和compressive sensing中提供更高的性能,尤其是在低训练场景下。DR-CG-Net even outperforms competitive prior art methods in these tasks.
    Abstract Incorporating prior information into inverse problems, e.g. via maximum-a-posteriori estimation, is an important technique for facilitating robust inverse problem solutions. In this paper, we devise two novel approaches for linear inverse problems that permit problem-specific statistical prior selections within the compound Gaussian (CG) class of distributions. The CG class subsumes many commonly used priors in signal and image reconstruction methods including those of sparsity-based approaches. The first method developed is an iterative algorithm, called generalized compound Gaussian least squares (G-CG-LS), that minimizes a regularized least squares objective function where the regularization enforces a CG prior. G-CG-LS is then unrolled, or unfolded, to furnish our second method, which is a novel deep regularized (DR) neural network, called DR-CG-Net, that learns the prior information. A detailed computational theory on convergence properties of G-CG-LS and thorough numerical experiments for DR-CG-Net are provided. Due to the comprehensive nature of the CG prior, these experiments show that our unrolled DR-CG-Net outperforms competitive prior art methods in tomographic imaging and compressive sensing, especially in challenging low-training scenarios.
    摘要 使用最大 posteriori 估计等技术,例如最大 posteriori 估计,可以帮助解决逆向问题中的稳定性问题。在这篇论文中,我们提出了两种新的方法,用于Linear inverse problems,允许在compound Gaussian(CG)分布类型中选择问题特定的统计学先验。CG分布包括许多常用的储存和图像重建方法中的先验,如简洁性基本的方法。我们的第一种方法是一种迭代算法,称为总体CG最小二乘(G-CG-LS),它将最小二乘函数中加入了CG先验的正则化来进行执行。然后,我们将G-CG-LS算法“卷起”(unfold),以生成我们的第二种方法,即一种新的深度正则化(DR)神经网络,称为DR-CG-Net,它可以学习先验信息。我们提供了CG prior的完整性的计算理论,以及对DR-CG-Net的严格数值实验。由于CG prior的涵盖性,这些实验表明,我们的DR-CG-Net在tomographic imaging和压缩感知中,尤其是在低训练场景下,能够超越先前艺术方法的表现。

Quantifying the redundancy between prosody and text

  • paper_url: http://arxiv.org/abs/2311.17233
  • repo_url: https://github.com/lu-wo/quantifying-redundancy
  • paper_authors: Lukas Wolf, Tiago Pimentel, Evelina Fedorenko, Ryan Cotterell, Alex Warstadt, Ethan Wilcox, Tamar Regev
  • for: This paper aims to investigate the relationship between the information conveyed by prosody and the words themselves in spoken language.
  • methods: The authors use large language models (LLMs) to estimate the redundancy between prosody and the words themselves, and they extract prosodic features aligned to individual words from a large spoken corpus of English audiobooks.
  • results: The authors find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features, including intensity, duration, pauses, and pitch contours. Additionally, they find that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.Here’s the information in Simplified Chinese text:
  • for: 这篇论文探讨了口语中语音特征和语言信息之间的关系。
  • methods: 作者使用大语言模型(LLMs)来估算语音特征和语言信息之间的重复性,并从一大量的英语audiobook中提取对应于单个单词的语音特征。
  • results: 作者发现,单词和语音特征之间存在高度的重复性,包括强度、持续时间、停顿和抑杂等多个语音特征。此外,作者发现,语音特征无法完全从文本中预测,这表明语音特征上有更多的信息。
    Abstract Prosody -- the suprasegmental component of speech, including pitch, loudness, and tempo -- carries critical aspects of meaning. However, the relationship between the information conveyed by prosody vs. by the words themselves remains poorly understood. We use large language models (LLMs) to estimate how much information is redundant between prosody and the words themselves. Using a large spoken corpus of English audiobooks, we extract prosodic features aligned to individual words and test how well they can be predicted from LLM embeddings, compared to non-contextual word embeddings. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features, including intensity, duration, pauses, and pitch contours. Furthermore, a word's prosodic information is redundant with both the word itself and the context preceding as well as following it. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words. Along with this paper, we release a general-purpose data processing pipeline for quantifying the relationship between linguistic information and extra-linguistic features.
    摘要 幽默 -- 语言中的上层组件,包括抑高、响度和步伐 -- 携带着重要的意义信息。然而,语音中的信息与字符自身之间的关系仍然不够了解。我们使用大型语言模型(LLM)来估计语音中的信息 redundant 与字符自身之间的相互关系。使用一个大量的英语 audiobook spoken Corpora,我们提取出对各个字符的听起来特征,并测试它们可以从 LLM 表示中预测,与非语境字符表示相比。我们发现,各个字符的听起来特征和字符自身之间存在高度的重复性,包括强度、持续时间、停顿和抑高曲线。此外,一个字符的听起来特征还 redundant 与上下文前后的字符。虽然如此,我们发现,听起来特征无法完全从文本中预测,表明语音中含有更多的信息。此外,我们还发布了一个通用的数据处理管道,用于评估语言信息和Extra-Linguistic 特征之间的关系。

ReWaRD: Retinal Waves for Pre-Training Artificial Neural Networks Mimicking Real Prenatal Development

  • paper_url: http://arxiv.org/abs/2311.17232
  • repo_url: https://github.com/bennyca/reward
  • paper_authors: Benjamin Cappell, Andreas Stoll, Williams Chukwudi Umah, Bernhard Egger
  • for: 这个论文的目的是研究人类视觉的开发过程,特别是婴儿期间的视觉发展。
  • methods: 这个研究使用了计算模型,模拟了婴儿期间的视觉发展机制,通过 simulations of retinal waves 进行预训练。
  • results: 研究发现,通过使用这种生物可能的预训练方法,可以获得与 primate visual system 的 V1 特征相似的特征。此外,这种预训练方法可以提高模型的性能,与现有的预训练管道相当。
    Abstract Computational models trained on a large amount of natural images are the state-of-the-art to study human vision - usually adult vision. Computational models of infant vision and its further development are gaining more and more attention in the community. In this work we aim at the very beginning of our visual experience - pre- and post-natal retinal waves which suggest to be a pre-training mechanism for the primate visual system at a very early stage of development. We see this approach as an instance of biologically plausible data driven inductive bias through pre-training. We built a computational model that mimics this development mechanism by pre-training different artificial convolutional neural networks with simulated retinal wave images. The resulting features of this biologically plausible pre-training closely match the V1 features of the primate visual system. We show that the performance gain by pre-training with retinal waves is similar to a state-of-the art pre-training pipeline. Our framework contains the retinal wave generator, as well as a training strategy, which can be a first step in a curriculum learning based training diet for various models of development. We release code, data and trained networks to build the basis for future work on visual development and based on a curriculum learning approach including prenatal development to support studies of innate vs. learned properties of the primate visual system. An additional benefit of our pre-trained networks for neuroscience or computer vision applications is the absence of biases inherited from datasets like ImageNet.
    摘要 computationally 模型,通过大量的自然图像训练,是现代人类视觉研究的标准。 however,computational models of infant vision and its development are becoming increasingly popular in the community。 in this work, we focus on the very beginning of our visual experience - pre- and post-natal retinal waves, which are believed to be a pre-training mechanism for the primate visual system at a very early stage of development. we view this approach as an instance of biologically plausible data-driven inductive bias through pre-training。 we built a computational model that mimics this development mechanism by pre-training different artificial convolutional neural networks with simulated retinal wave images。 the resulting features of this biologically plausible pre-training closely match the V1 features of the primate visual system。 we show that the performance gain by pre-training with retinal waves is similar to a state-of-the-art pre-training pipeline。 our framework includes the retinal wave generator, as well as a training strategy, which can be a first step in a curriculum learning-based training diet for various models of development。 we release code, data, and trained networks to provide a basis for future work on visual development and curriculum learning, including prenatal development, to support studies of innate vs. learned properties of the primate visual system。 an additional benefit of our pre-trained networks for neuroscience or computer vision applications is the absence of biases inherited from datasets like ImageNet。

Survey on AI Ethics: A Socio-technical Perspective

  • paper_url: http://arxiv.org/abs/2311.17228
  • repo_url: None
  • paper_authors: Dave Mbiazi, Meghana Bhange, Maryam Babaei, Ivaxi Sheth, Patrik Joslin Kenfack
  • For: This paper aims to provide a comprehensive overview of the ethical concerns associated with the deployment of AI in society, including fairness, privacy, data protection, responsibility, accountability, safety, robustness, transparency, explainability, and environmental impact.* Methods: The paper discusses the technical and social aspects of each ethical principle, providing a unified overview of the current and future ethical concerns of AI deployment.* Results: The paper provides a comprehensive understanding of the ethical considerations for AI deployment, highlighting the need for a societal perspective on these issues.
    Abstract The past decade has observed a great advancement in AI with deep learning-based models being deployed in diverse scenarios including safety-critical applications. As these AI systems become deeply embedded in our societal infrastructure, the repercussions of their decisions and actions have significant consequences, making the ethical implications of AI deployment highly relevant and important. The ethical concerns associated with AI are multifaceted, including challenging issues of fairness, privacy and data protection, responsibility and accountability, safety and robustness, transparency and explainability, and environmental impact. These principles together form the foundations of ethical AI considerations that concern every stakeholder in the AI system lifecycle. In light of the present ethical and future x-risk concerns, governments have shown increasing interest in establishing guidelines for the ethical deployment of AI. This work unifies the current and future ethical concerns of deploying AI into society. While we acknowledge and appreciate the technical surveys for each of the ethical principles concerned, in this paper, we aim to provide a comprehensive overview that not only addresses each principle from a technical point of view but also discusses them from a social perspective.
    摘要 过去一个十年,人工智能技术呈现出了快速发展,深度学习基本模型在多种场景中得到了广泛应用,包括关键安全应用。随着人工智能系统在我们社会基础设施中深入普及,它们的决策和行为的后果具有重要的社会影响,因此人工智能部署的伦理问题变得非常重要和有关键性。人工智能的伦理问题多方面,包括公平、隐私和数据保护、责任和评估、安全和可靠性、透明度和解释性、环境影响等。这些原则共同组成了人工智能部署的伦理考虑基础。鉴于现有和未来的风险问题,政府已经表示增加对人工智能部署的伦理规范的兴趣。这项工作将现有和未来的人工智能部署伦理问题统一起来。虽然我们承认和优质技术评估每一个伦理原则的重要性,但在这篇论文中,我们的目标是不仅从技术角度讨论每一个原则,还从社会角度讨论它们。

War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars

  • paper_url: http://arxiv.org/abs/2311.17227
  • repo_url: https://github.com/agiresearch/waragent
  • paper_authors: Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, Yongfeng Zhang
    for:这个研究的目的是使用人工智能(AI)和大型自然语言模型(LLM)进行复杂的人类集体行为的研究,以解决人类历史上的国际冲突。methods:我们提出了一个基于LLM的多代理AI系统,称为“WarAgent”,用于模拟历史国际冲突中参与国家的决策和后果。我们在WWI、WWII和古代中国战争时期进行了 simulations,并评估了模拟效果,以评估当今AI技术在复杂的人类集体行为研究中的进步和局限性。results:我们的发现提供了基于数据驱动和AI增强的对 internacional conflicts的理解,以及可能预防未来国际冲突的策略。这些发现不仅有历史分析的意义,还可以作为未来国际冲突预防和维持和平的蓝图。代码和数据可以在https://github.com/agiresearch/WarAgent 获取。
    Abstract Can we avoid wars at the crossroads of history? This question has been pursued by individuals, scholars, policymakers, and organizations throughout human history. In this research, we attempt to answer the question based on the recent advances of Artificial Intelligence (AI) and Large Language Models (LLMs). We propose \textbf{WarAgent}, an LLM-powered multi-agent AI system, to simulate the participating countries, their decisions, and the consequences, in historical international conflicts, including the World War I (WWI), the World War II (WWII), and the Warring States Period (WSP) in Ancient China. By evaluating the simulation effectiveness, we examine the advancements and limitations of cutting-edge AI systems' abilities in studying complex collective human behaviors such as international conflicts under diverse settings. In these simulations, the emergent interactions among agents also offer a novel perspective for examining the triggers and conditions that lead to war. Our findings offer data-driven and AI-augmented insights that can redefine how we approach conflict resolution and peacekeeping strategies. The implications stretch beyond historical analysis, offering a blueprint for using AI to understand human history and possibly prevent future international conflicts. Code and data are available at \url{https://github.com/agiresearch/WarAgent}.
    摘要 可以避免历史十字路口上的战争吗?这个问题一直被人们、学者、政策制定者和组织在人类历史中追求。在这项研究中,我们尝试通过人工智能(AI)和大型自然语言模型(LLM)的最新进步来回答这个问题。我们提议了《战争代理人》(WarAgent),一个基于LLM的多代理AI系统,用于在历史国际冲突中模拟参与国家的决策和后果。通过评估模拟效果,我们检查了现代AI系统在复杂的集体人类行为中的进步和局限性。在这些模拟中,参与者之间的emergent互动还提供了一种新的视角来研究引发战争的触发器和条件。我们的发现提供了基于数据驱动和AI增强的洞察,可能重新定义了我们如何处理国际冲突和维护和平策略。这些发现的影响不仅局限于历史分析,还提供了一个蓝图,用于通过AI理解人类历史,并可能预防未来的国际冲突。代码和数据可以在 中获取。

Minimax Exploiter: A Data Efficient Approach for Competitive Self-Play

  • paper_url: http://arxiv.org/abs/2311.17190
  • repo_url: None
  • paper_authors: Daniel Bairamian, Philippe Marcotte, Joshua Romoff, Gabriel Robert, Derek Nowrouzezahrai
  • for: 提高复杂游戏环境中人类水平性能的自我竞争学习(Competitive Self-Play,CSP)技术。
  • methods: 使用分布式多智能学习(Distributed Multi-Agent Reinforcement Learning,MARL)方法创建一个学习代理池,包括主代理、过去版本的主代理和攻击者代理,其中攻击者代理学习对主代理的counter-strategies。
  • results: 提出了一种名为“最小最大协议”的抽象游戏理论方法,可以快速启用主代理,从而提高数据效率,并在多种场景中证明了其稳定性和数据效率的提高。
    Abstract Recent advances in Competitive Self-Play (CSP) have achieved, or even surpassed, human level performance in complex game environments such as Dota 2 and StarCraft II using Distributed Multi-Agent Reinforcement Learning (MARL). One core component of these methods relies on creating a pool of learning agents -- consisting of the Main Agent, past versions of this agent, and Exploiter Agents -- where Exploiter Agents learn counter-strategies to the Main Agents. A key drawback of these approaches is the large computational cost and physical time that is required to train the system, making them impractical to deploy in highly iterative real-life settings such as video game productions. In this paper, we propose the Minimax Exploiter, a game theoretic approach to exploiting Main Agents that leverages knowledge of its opponents, leading to significant increases in data efficiency. We validate our approach in a diversity of settings, including simple turn based games, the arcade learning environment, and For Honor, a modern video game. The Minimax Exploiter consistently outperforms strong baselines, demonstrating improved stability and data efficiency, leading to a robust CSP-MARL method that is both flexible and easy to deploy.
    摘要 In this paper, we propose the Minimax Exploiter, a game theoretic approach to exploiting Main Agents that leverages knowledge of its opponents, leading to significant increases in data efficiency. We validate our approach in a diversity of settings, including simple turn-based games, the arcade learning environment, and For Honor, a modern video game. The Minimax Exploiter consistently outperforms strong baselines, demonstrating improved stability and data efficiency, leading to a robust CSP-MARL method that is both flexible and easy to deploy.

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

  • paper_url: http://arxiv.org/abs/2311.17179
  • repo_url: None
  • paper_authors: Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, Marc Rußwurm
  • for: 本研究旨在提出一种全球通用的地理位置编码器,从开放available的卫星影像中学习地理位置的含义。
  • methods: 本研究使用了卫星对比学习(SatCLIP),通过对全球样本的多spectral Sentinel-2卫星数据进行预训练,从而学习出地理位置的含义。
  • results: 研究发现,使用SatCLIP编码器的 embedding,可以在不同的预测任务中提供有用的地理位置信息,并且在不同任务中常常超过现有的预训练位置编码器的表现。此外,SatCLIP编码器还有助于提高地理泛化。
    Abstract Geographic location is essential for modeling tasks in fields ranging from ecology to epidemiology to the Earth system sciences. However, extracting relevant and meaningful characteristics of a location can be challenging, often entailing expensive data fusion or data distillation from global imagery datasets. To address this challenge, we introduce Satellite Contrastive Location-Image Pretraining (SatCLIP), a global, general-purpose geographic location encoder that learns an implicit representation of locations from openly available satellite imagery. Trained location encoders provide vector embeddings summarizing the characteristics of any given location for convenient usage in diverse downstream tasks. We show that SatCLIP embeddings, pretrained on globally sampled multi-spectral Sentinel-2 satellite data, can be used in various predictive tasks that depend on location information but not necessarily satellite imagery, including temperature prediction, animal recognition in imagery, and population density estimation. Across tasks, SatCLIP embeddings consistently outperform embeddings from existing pretrained location encoders, ranging from models trained on natural images to models trained on semantic context. SatCLIP embeddings also help to improve geographic generalization. This demonstrates the potential of general-purpose location encoders and opens the door to learning meaningful representations of our planet from the vast, varied, and largely untapped modalities of geospatial data.
    摘要 文本翻译成简化中文:在生态学、医学和地球系统科学等领域中,地理位置的模拟任务是非常重要的。然而,提取有用和意义的地理位置特征可能是一项挑战,经常需要伴随着贵重的数据融合或数据缩写从全球遥感数据集中提取数据。为解决这个挑战,我们介绍了卫星对比位置图像预训练(SatCLIP),一种全球、通用的地理位置编码器,可以从公开可用的卫星遥感数据中学习地理位置的隐式表示。经过训练的地址编码器可以为任何给定的地理位置提供vector编码,汇总了该位置的特征。我们表明,SatCLIP编码器在多spectral Sentinel-2卫星数据上进行全球采样预训练后,可以在不同的预测任务中提供有用的位置信息,包括温度预测、动物识别和人口密度估计。在任务中,SatCLIP编码器的表现都高于现有的预训练地址编码器,从natural image模型到semantic context模型。此外,SatCLIP编码器还帮助提高地理泛化。这表明了通用地址编码器的潜力,并开启了从各种地球空间数据模式中学习有意义的地球表示的可能性。

(Ir)rationality in AI: State of the Art, Research Challenges and Open Questions

  • paper_url: http://arxiv.org/abs/2311.17165
  • repo_url: None
  • paper_authors: Olivia Macmillan-Scott, Mirco Musolesi
  • for: 本文探讨人工智能中的理性和不理性,提出了一些开放问题。
  • methods: 本文考虑了人工智能agent的行为,以及如何处理不理性行为,包括对不理性行为的识别和交互。
  • results: 本文发现了一些不理性行为可能在某些场景下是优化的,同时提出了许多人工智能和人类交互中的问题,尚未得到充分解决。
    Abstract The concept of rationality is central to the field of artificial intelligence. Whether we are seeking to simulate human reasoning, or the goal is to achieve bounded optimality, we generally seek to make artificial agents as rational as possible. Despite the centrality of the concept within AI, there is no unified definition of what constitutes a rational agent. This article provides a survey of rationality and irrationality in artificial intelligence, and sets out the open questions in this area. The understanding of rationality in other fields has influenced its conception within artificial intelligence, in particular work in economics, philosophy and psychology. Focusing on the behaviour of artificial agents, we consider irrational behaviours that can prove to be optimal in certain scenarios. Some methods have been developed to deal with irrational agents, both in terms of identification and interaction, however work in this area remains limited. Methods that have up to now been developed for other purposes, namely adversarial scenarios, may be adapted to suit interactions with artificial agents. We further discuss the interplay between human and artificial agents, and the role that rationality plays within this interaction; many questions remain in this area, relating to potentially irrational behaviour of both humans and artificial agents.
    摘要 人工智能领域中,理性是核心概念。无论我们是模拟人类思维还是实现有界优化,我们通常尽可能地做到让人工代理者变得如同人类一样理性。然而,在人工智能领域中,没有一个综合的定义,它定义了合理的代理者。这篇文章提供了人工智能中的合理性和不合理性的报告,并探讨了这个领域中的开放问题。人工智能领域中理性的概念受到了其他领域的影响,特别是经济学、哲学和心理学等领域的研究。我们在讨论人工代理者的行为时,探讨了一些不合理的行为,它们在某些情况下可能是优化的。此外,我们还讨论了人工代理者与人类交互的问题,以及合理性在这种交互中的作用。这个领域的问题还有很多,包括人类和人工代理者可能的不合理行为。

Pragmatic Radiology Report Generation

  • paper_url: http://arxiv.org/abs/2311.17154
  • repo_url: https://github.com/chicagohai/llm_radiology
  • paper_authors: Dang Nguyen, Chacha Chen, He He, Chenhao Tan
  • for: 根据镜头检查结果,是否应该描述缺乏病变的观察或忽略它?
  • methods: 这个问题无法从镜头 alone 答案,需要实践的观点,即将 radiology report 的通信目标 capture 为 radiologist 和 patient 之间的沟通。
  • results: 我们显示出 indication,即 patient 来检查镜头的原因,将 negative observations 的提及驱动,并将 indications 作为 additional input 提供给 report 生成。我们还开发了一个架构来识别从镜头图像中的不可推论信息,并将其限制为 cleaning groundtruth reports。最后,我们使用 indications 和 cleaned groundtruth reports 开发了实践模型,并显示它们不仅在新的实践验证中 (+4.3 Negative F1) 表现出色,还在标准的验证中 (+6.3 Positive F1 和 +11.0 BLEU-2) 表现更好。
    Abstract When pneumonia is not found on a chest X-ray, should the report describe this negative observation or omit it? We argue that this question cannot be answered from the X-ray alone and requires a pragmatic perspective, which captures the communicative goal that radiology reports serve between radiologists and patients. However, the standard image-to-text formulation for radiology report generation fails to incorporate such pragmatic intents. Following this pragmatic perspective, we demonstrate that the indication, which describes why a patient comes for an X-ray, drives the mentions of negative observations and introduce indications as additional input to report generation. With respect to the output, we develop a framework to identify uninferable information from the image as a source of model hallucinations, and limit them by cleaning groundtruth reports. Finally, we use indications and cleaned groundtruth reports to develop pragmatic models, and show that they outperform existing methods not only in new pragmatics-inspired metrics (+4.3 Negative F1) but also in standard metrics (+6.3 Positive F1 and +11.0 BLEU-2).
    摘要

Mission-driven Exploration for Accelerated Deep Reinforcement Learning with Temporal Logic Task Specifications

  • paper_url: http://arxiv.org/abs/2311.17059
  • repo_url: None
  • paper_authors: Jun Wang, Hosein Hasanbeig, Kaiyuan Tan, Zihe Sun, Yiannis Kantaros
  • for: 这篇论文关注了设计移动机器人的优化控制策略,使其满足使用线性时间逻辑(LTL)编码的任务,并在未知的 Statistics 和环境中运行。
  • methods: 我们使用深度强化学习(DRL)算法来Synthesize控制策略,并通过使用执行任务的自动机来优化探索策略,从而提高学习速度。
  • results: 我们在 robot 导航任务中提供了比较实验,并证明了我们的算法在未知环境中可以很快地学习控制策略,并且可以高效地完成任务。Here is the translation of the paper’s abstract in Simplified Chinese:
  • for: 这篇论文是关于设计移动机器人的优化控制策略,以满足使用线性时间逻辑(LTL)编码的任务,并在未知的 Statistics 和环境中运行。
  • methods: 我们使用深度强化学习(DRL)算法来Synthesize控制策略,并通过使用执行任务的自动机来优化探索策略,从而提高学习速度。
  • results: 我们在 robot 导航任务中提供了比较实验,并证明了我们的算法在未知环境中可以很快地学习控制策略,并且可以高效地完成任务。
    Abstract This paper addresses the problem of designing optimal control policies for mobile robots with mission and safety requirements specified using Linear Temporal Logic (LTL). We consider robots with unknown stochastic dynamics operating in environments with unknown geometric structure. The robots are equipped with sensors allowing them to detect obstacles. Our goal is to synthesize a control policy that maximizes the probability of satisfying an LTL-encoded task in the presence of motion and environmental uncertainty. Several deep reinforcement learning (DRL) algorithms have been proposed recently to address similar problems. A common limitation in related works is that of slow learning performance. In order to address this issue, we propose a novel DRL algorithm, which has the capability to learn control policies at a notably faster rate compared to similar methods. Its sample efficiency is due to a mission-driven exploration strategy that prioritizes exploration towards directions that may contribute to mission accomplishment. Identifying these directions relies on an automaton representation of the LTL task as well as a learned neural network that (partially) models the unknown system dynamics. We provide comparative experiments demonstrating the efficiency of our algorithm on robot navigation tasks in unknown environments.
    摘要 Translated into Simplified Chinese:这篇论文考虑了设计移动机器人的优化控制策略,使其满足使用线性时间逻辑(LTL)编码的任务,并且考虑了机器人在未知的随机动力和环境中运行。机器人配备了探测障碍物的感知器,我们的目标是使机器人在运动和环境不确定性中最大化满足LTL任务的概率。Recently, several deep reinforcement learning(DRL)算法已经提出来解决类似问题,但它们通常具有慢学习速率的限制。为了解决这个问题,我们提出了一种新的DRL算法,它在相似方法中具有更快的学习速率。它的样本效率是基于任务驱动的探索策略,该策略会帮助机器人更快地学习控制策略。我们通过对机器人导航任务在未知环境中进行比较实验来证明我们的算法的效率。

Panoptic Video Scene Graph Generation

  • paper_url: http://arxiv.org/abs/2311.17058
  • repo_url: https://github.com/jingkang50/openpvsg
  • paper_authors: Jingkang Yang, Wenxuan Peng, Xiangtai Li, Zujin Guo, Liangyu Chen, Bo Li, Zheng Ma, Kaiyang Zhou, Wayne Zhang, Chen Change Loy, Ziwei Liu
  • for: 本研究旨在建立全面的现实世界视觉系统,提出并研究了一个新的问题:�anoptic scene graph generation(PVSG)。
  • methods: 本研究使用了着色板框架和精确的像素级划分掩模,以便更好地理解场景。
  • results: 研究人员提供了400个视频(289个第三人称视频+111个 Egocentric视频),共计150万帧,并提供了多种基线方法和有用的设计实践。
    Abstract Towards building comprehensive real-world visual perception systems, we propose and study a new problem called panoptic scene graph generation (PVSG). PVSG relates to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos. However, the limitation of bounding boxes in detecting non-rigid objects and backgrounds often causes VidSGG to miss key details crucial for comprehensive video understanding. In contrast, PVSG requires nodes in scene graphs to be grounded by more precise, pixel-level segmentation masks, which facilitate holistic scene understanding. To advance research in this new area, we contribute the PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with a total of 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs. We also provide a variety of baseline methods and share useful design practices for future work.
    摘要 向建立全面的现实世界视觉系统,我们提议和研究了一个新的问题:权威场景图生成(PVSG)。PVSG与现有的视频场景图生成(VidSGG)问题相关,后者关注视频中人与物体之间的时间交互,并将其与矩形框进行固定。然而,矩形框在检测非RIGID объекts和背景时经常会导致 VidSGG 过look 详细的场景理解。相比之下,PVSG 需要场景图中的节点被固定到更精确的像素级分割掩模,以便整体场景理解。为了推动这个新领域的研究,我们提供了 PVSG 数据集,包括 400 个视频(289 个第三人称 + 111 个 egocentric 视频),共计 150 万帧,帧标注了权威分割掩模以及细腻的时间场景图。我们还提供了多种基eline方法,并分享了有用的设计做法,以便未来的工作。

No Representation Rules Them All in Category Discovery

  • paper_url: http://arxiv.org/abs/2311.17055
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Sagar Vaze, Andrea Vedaldi, Andrew Zisserman
  • for: 这个论文目标是解决一个Generalized Category Discovery(GCD)问题,具体来说是使用标签和无标签图像集合来分类所有无标签图像,以确定它们是否属于标签分类中的图像。
  • methods: 该论文首先认可了大多数现有GCD标准 benchmarks 只包含一个分类的数据,这使得模型是否使用可用的标签来解决GCD任务,或者只是解决一个无监督的分类问题。因此,它们提出了一个 sintetic dataset 名为 ‘Clevr-4’,用于类发现。Clevr-4 包含四个等效的数据分区,基于对象形状、文本、颜色或数量。为解决任务,模型需要从标签集合中推断出分类结构,而不是仅仅采用数据自然分组。
  • results: 该论文使用 Clevr-4 dataset 展示了无监督分类在GCD任务中的局限性,证明了even very strong 无监督模型在 Clevr-4 上失败。它们还使用 Clevr-4 来评估现有GCD算法的缺点,并提出了一种新的方法,称为 $\mu$GCD,该方法利用了 representation learning 文献中的一致发现,并且在 Clevr-4 上显著超过了实际基eline。最后,当将这些发现应用到 Semantic Shift Benchmark (SSB) 上,$\mu$GCD 超过了所有之前的成果,设置了一个新的状态码。
    Abstract In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically, given a dataset with labelled and unlabelled images, the task is to cluster all images in the unlabelled subset, whether or not they belong to the labelled categories. Our first contribution is to recognize that most existing GCD benchmarks only contain labels for a single clustering of the data, making it difficult to ascertain whether models are using the available labels to solve the GCD task, or simply solving an unsupervised clustering problem. As such, we present a synthetic dataset, named 'Clevr-4', for category discovery. Clevr-4 contains four equally valid partitions of the data, i.e based on object shape, texture, color or count. To solve the task, models are required to extrapolate the taxonomy specified by the labelled set, rather than simply latching onto a single natural grouping of the data. We use this dataset to demonstrate the limitations of unsupervised clustering in the GCD setting, showing that even very strong unsupervised models fail on Clevr-4. We further use Clevr-4 to examine the weaknesses of existing GCD algorithms, and propose a new method which addresses these shortcomings, leveraging consistent findings from the representation learning literature to do so. Our simple solution, which is based on 'mean teachers' and termed $\mu$GCD, substantially outperforms implemented baselines on Clevr-4. Finally, when we transfer these findings to real data on the challenging Semantic Shift Benchmark (SSB), we find that $\mu$GCD outperforms all prior work, setting a new state-of-the-art. For the project webpage, see https://www.robots.ox.ac.uk/~vgg/data/clevr4/
    摘要 在这篇论文中,我们面临了一个泛化类划分(GCD)问题。具体来说,我们给定一个带有标签和无标签图像的数据集,需要将无标签图像集中 clustering,无论它们是否属于标签分类。我们的首要贡献是认识到大多数现有的GCD标准准例只包含数据集的一个分类,使得模型是否使用可用的标签来解决GCD问题或者只是解决无监督划分问题存在困难。为此,我们提供了一个人工生成的数据集,名为'Clevr-4',用于类划分。Clevr-4包含四个等价有效的数据分区,即基于物体形状、文本、颜色或数量。要解决这个任务,模型需要从标签集中推导出税类划分,而不是简单地挖掘数据集的自然分组。我们使用这个数据集来证明无监督划分在GCD设置下存在局限性,even very strong unsupervised models fail on Clevr-4。我们进一步使用Clevr-4来检查现有的GCD算法的缺陷,并提出一种新的方法,基于'mean teachers'的思想,并命名为$\mu$GCD。我们的简单解决方案在Clevr-4上substantially outperforms implemented baselines。最后,当我们将这些发现应用到实际数据上Semantic Shift Benchmark (SSB)上,我们发现$\mu$GCD超过了所有之前的工作,设置了新的州OF-THE-ART。更多信息请访问https://www.robots.ox.ac.uk/~vgg/data/clevr4/。

Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry…for now

  • paper_url: http://arxiv.org/abs/2311.17138
  • repo_url: None
  • paper_authors: Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, D. A. Forsyth, Anand Bhattad
  • for: This paper demonstrates that generated images have geometric features different from those of real images, and that current generators cannot reliably reproduce geometric properties of real images.
  • methods: The paper uses a set of collections of generated images that are prequalified to fool simple, signal-based classifiers into believing they are real, and three classifiers that look only at derived geometric features to identify generated images reliably.
  • results: The paper shows that the classifiers can identify generated images more reliably than SOTA local signal-based detectors, for images from a number of distinct generators, and that saliency maps suggest that the classifiers can identify geometric problems reliably.
    Abstract Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images, prequalified to fool simple, signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels, and look only at derived geometric features. The first classifier looks at the perspective field of the image, the second looks at lines detected in the image, and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors, for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images.
    摘要 Translate the given text into Simplified Chinese.<>现代生成模型可以生成惊人的真实图像。这篇论文表明,生成的图像具有与真实图像不同的几何特征。我们构建了一组具有欺骗简单信号基于分类器的生成图像集,并证明了这些生成图像可以被可靠地识别为假。我们使用三种类ifiers,其中所有类ifiers都 denied 访问图像像素,只看到图像的派生几何特征。第一个类ifiers looks at the perspective field of the image,第二个类ifiers looks at lines detected in the image,第三个类ifiers looks at relations between detected objects and shadows。我们的过程可以更可靠地检测生成图像,比对当前的最佳地方检测器。抽象地图表示,这些类ifiers可以可靠地识别图像的几何问题。我们 conclude 现代生成器无法可靠地复制真实图像的几何特征。

Generative Models: What do they know? Do they know things? Let’s find out!

  • paper_url: http://arxiv.org/abs/2311.17137
  • repo_url: None
  • paper_authors: Xiaodan Du, Nicholas Kolkin, Greg Shakhnarovich, Anand Bhattad
  • for: 本研究证明了生成模型内部实际上学习了场景内部地图,并提出了一种 universial 的 Plug-and-Play 方法,可以将任何生成模型转化成场景内部预测器,直接从原始生成器网络中提取场景内部地图,不需要额外的解码器或完全精度调整原始网络。
  • methods: 我们的方法基于 Low-Rank Adaptation (LoRA) 技术,对键特征图进行低级别适应,新增 Parameters 占总参数数量的 menos than 0.6%。我们使用一小个标注图像集进行优化,以适应不同的生成架构,包括扩散模型、GANs 和自然语言模型。
  • results: 我们的方法可以生成比较好的场景内部地图,与领先的指导技术相比,有一些情况下 Even surpass 。
    Abstract Generative models have been shown to be capable of synthesizing highly detailed and realistic images. It is natural to suspect that they implicitly learn to model some image intrinsics such as surface normals, depth, or shadows. In this paper, we present compelling evidence that generative models indeed internally produce high-quality scene intrinsic maps. We introduce Intrinsic LoRA (I LoRA), a universal, plug-and-play approach that transforms any generative model into a scene intrinsic predictor, capable of extracting intrinsic scene maps directly from the original generator network without needing additional decoders or fully fine-tuning the original network. Our method employs a Low-Rank Adaptation (LoRA) of key feature maps, with newly learned parameters that make up less than 0.6% of the total parameters in the generative model. Optimized with a small set of labeled images, our model-agnostic approach adapts to various generative architectures, including Diffusion models, GANs, and Autoregressive models. We show that the scene intrinsic maps produced by our method compare well with, and in some cases surpass those generated by leading supervised techniques.
    摘要 (注:以下是使用 Simplified Chinese 翻译的文本)生成模型已经显示出能够Synthesize高级细节和真实的图像。因此,人们 Naturally 怀疑它们会隐式地学习图像内部特性,如表面法则、深度或阴影。在这篇论文中,我们提供了吸引人的证据,证明生成模型实际上内部生成了高质量的场景内部地图。我们提出了Scene Intrinsic LoRA (I LoRA),一种通用、插件化方法,可以将任何生成模型转换成场景内部预测器,直接从原始生成器网络中提取场景内部地图,无需添加额外的解码器或全面调参原始网络。我们的方法使用了适应性低级的特征映射(LoRA),新增的参数占总参数的0.6%左右。通过一小batch of labels 图像进行优化,我们的模型无关方法可以适应不同的生成架构,包括扩散模型、GANs 和自然语言模型。我们展示了我们的方法生成的场景内部地图与领先的指导技术相比,有时even surpass。

DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.17053
  • repo_url: None
  • paper_authors: Tsun-Hsuan Wang, Juntian Zheng, Pingchuan Ma, Yilun Du, Byungchul Kim, Andrew Spielberg, Joshua Tenenbaum, Chuang Gan, Daniela Rus
  • for: 这个论文的目的是开发一种能够创造高级软机器人的数字化生成模型,以满足物理软机器人和虚拟人物创建的应用需求。
  • methods: 这个论文使用的方法包括物理学中的扩散过程和diffusion模型,以及一种新的学习算法,可以理解功能基于结构。
  • results: 研究人员通过实验和实践,证明了DiffuseBot模型能够创造出高级软机器人的多样化形态和控制方法,并且能够在各种任务中表现出优异的能力。
    Abstract Nature evolves creatures with a high complexity of morphological and behavioral intelligence, meanwhile computational methods lag in approaching that diversity and efficacy. Co-optimization of artificial creatures' morphology and control in silico shows promise for applications in physical soft robotics and virtual character creation; such approaches, however, require developing new learning algorithms that can reason about function atop pure structure. In this paper, we present DiffuseBot, a physics-augmented diffusion model that generates soft robot morphologies capable of excelling in a wide spectrum of tasks. DiffuseBot bridges the gap between virtually generated content and physical utility by (i) augmenting the diffusion process with a physical dynamical simulation which provides a certificate of performance, and (ii) introducing a co-design procedure that jointly optimizes physical design and control by leveraging information about physical sensitivities from differentiable simulation. We showcase a range of simulated and fabricated robots along with their capabilities. Check our website at https://diffusebot.github.io/
    摘要 自然界中的生物演化出高度复杂的形态和行为智能,而计算方法则远远落后于这种多样性和效率。在计算机中合作设计艺 creature的形态和控制方法显示出应用于物理软机器人和虚拟人物创建的承诺,但这些方法需要开发新的学习算法,以可以理解功能的基础上的结构。本文介绍DiffuseBot,一种物理扩展的扩散模型,可以生成高效的软机器人形态,并bridge virtually generated content和physical utility之间的差距。DiffuseBot通过(i) 将扩散过程与物理动力学模拟相结合,提供性能证明,以及(ii) 通过与物理敏感度的差分学习结合的协同设计程序,同时优化物理设计和控制。我们展示了一系列的 simulate和fabricated robots以及其能力。更多信息请访问我们的网站:https://diffusebot.github.io/

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

  • paper_url: http://arxiv.org/abs/2311.17136
  • repo_url: None
  • paper_authors: Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen
  • for: 本研究旨在提出一种能够处理多种不同信息寻找需求的共同指南导向多模态检索系统,以满足用户的多样化信息寻找需求。
  • methods: 本研究使用了多任务训练和指令调整来实现UniIR的通用性,并在多种多modal-IR数据集上进行了联合训练。
  • results: UniIR在现有数据集上表现了良好的性能,并在新任务上进行零开发掌握。此外,本研究还提出了一个多modal检索 benchmark,以标准化多modal信息检索的评估。
    Abstract Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.
    摘要 现有的信息检索(IR)模型 oftentimes假设具有同质的格式,这限制了它们对不同用户需求的应用,例如搜索图片与文本描述、搜索新闻文章与头条图片、或找到类似图片与查询图片。为了解决这些不同的信息检索需求,我们介绍UniIR,一种统一指南驱动的多模态检索系统,可以处理八种不同的检索任务 Across modalities。UniIR通过对多种多 modal-IR数据集进行联合训练,可以根据用户的指南来执行多种检索任务,在现有数据集上表现出了稳定的性能,并在新任务上进行零基eline性学习。我们的实验表明,多任务训练和指南调整是UniIR的泛化能力的关键。此外,我们还构建了M-BEIR,一个包括多种多modal检索任务的多Modal检索基准数据集,以标准化多modal信息检索的评估。

Efficient In-Context Learning in Vision-Language Models for Egocentric Videos

  • paper_url: http://arxiv.org/abs/2311.17041
  • repo_url: None
  • paper_authors: Keunwoo Peter Yu, Zheyuan Zhang, Fengyuan Hu, Joyce Chai
  • for: 本研究旨在提高 Egocentric Vision-Language Models (VLMs) 的内容学习能力,使其可以通过几次示例来适应新任务。
  • methods: 本研究提出了一种新的训练方法,称为 $\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos($\mathbb{EILEV}$),它可以在几次示例的情况下帮助 VLMs 学习 Egocentric 视频中的内容。
  • results: 对比于大量自然语言数据训练的大型 VLMs, $\mathbb{EILEV}$-训练的模型在内容学习中表现更好,并且可以通过内容学习来扩展到新、罕见的 Egocentric 视频和文本应用程序。
    Abstract Recent advancements in text-only large language models (LLMs) have highlighted the benefit of in-context learning for adapting to new tasks with a few demonstrations. However, extending in-context learning to large vision-language models (VLMs) using a huge amount of naturalistic vision-language data has shown limited success, particularly for egocentric videos, due to high data collection costs. We propose a novel training method $\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos ($\mathbb{EILEV}$), which elicits in-context learning in VLMs for egocentric videos without requiring massive, naturalistic egocentric video datasets. $\mathbb{EILEV}$ involves architectural and training data adaptations to allow the model to process contexts interleaved with video clips and narrations, sampling of in-context examples with clusters of similar verbs and nouns, use of data with skewed marginal distributions with a long tail of infrequent verbs and nouns, as well as homonyms and synonyms. Our evaluations show that $\mathbb{EILEV}$-trained models outperform larger VLMs trained on a huge amount of naturalistic data in in-context learning. Furthermore, they can generalize to not only out-of-distribution, but also novel, rare egocentric videos and texts via in-context learning, demonstrating potential for applications requiring cost-effective training, and rapid post-deployment adaptability. Our code and demo are available at \url{https://github.com/yukw777/EILEV}.
    摘要 近期文本大语言模型(LLM)的进步强调了在Context中学习的利点,用于适应新任务需要只需几个示例。然而,将Context学习扩展到大量视频语言模型(VLM)使用庞大的自然主义视频语言数据显示有限的成功,特别是对于 Egocentric 视频,因为收集数据成本高。我们提议一种新的训练方法, называ为 $\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos( $\mathbb{EILEV}$),它使得 VLM 在 Egocentric 视频中学习 Context 无需巨量自然主义 Egocentric 视频数据。 $\mathbb{EILEV}$ 包括建筑和训练数据的修改,以便模型可以处理插入视频片段和 narración 的上下文,采样上下文中的示例,使用权重为上下文相似的名词和动词,以及使用数据具有扁平的分布和长尾的罕见名词和动词。我们的评估表明, $\mathbb{EILEV}$-训练的模型可以在 Context 中学习,并且可以泛化到不同的 Egocentric 视频和文本中,表现出成本效益和快速适应性。我们的代码和demo可以在 \url{https://github.com/yukw777/EILEV} 上找到。

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

  • paper_url: http://arxiv.org/abs/2311.17030
  • repo_url: https://github.com/amakelov/activation-patching-illusion
  • paper_authors: Aleksandar Makelov, Georg Lange, Neel Nanda
  • for: 本研究旨在探讨机器学习模型的解释性,具体来说是通过特定的干扰方法(如活化质量)来理解模型行为,以及这些行为是如何与特定特征相关的。
  • methods: 本研究使用了活化质量的方法来探讨机器学习模型的解释性,并通过数学示例、实际应用场景和实验来检验这些方法的效果。
  • results: 研究发现,使用活化质量方法可能会导致模型行为的假象解释性,即尽管模型的输出被修改,但这并不一定意味着模型内部的特定特征被修改。此外,研究还发现了在具体任务(如直接对象识别)中,可以通过手动电路分析来更好地理解模型行为的具体层次结构。
    Abstract Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability. Counterintuitively, even if a subspace intervention makes the model's output behave as if the value of a feature was changed, this effect may be achieved by activating a dormant parallel pathway leveraging another subspace that is causally disconnected from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice. In the context of factual recall, we further show a link to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localization. However, this does not imply that activation patching of subspaces is intrinsically unfit for interpretability. To contextualize our findings, we also show what a success case looks like in a task (indirect object identification) where prior manual circuit analysis informs an understanding of the location of a feature. We explore the additional evidence needed to argue that a patched subspace is faithful.
    摘要 机制解释可能性目标是理解模型行为的特定、可解释的特征,通常认为是低维度活动的表现。现在的研究已经探讨过执行子空间干预(如活动贴图)以同时操纵模型行为并归因特征到给定的子空间。在这种情况下,我们发现这两个目标之间存在冲突,可能导致假设的解释性。例如,即使模型输出行为变化,这种变化可能是通过启用一个休眠的平行路径,通过另一个子空间来实现的,这个子空间与模型输出没有直接关系。我们在一个简化的数学示例中、两个实际领域(间接对象识别任务和事实回忆)中都提供了证据,并证明这种现象在实践中很普遍。在事实回忆中,我们还显示了与排名1事实编辑的关系,为之前观察到的事实地址不一致而提供机制解释。然而,这并不意味着 activation patching 对 interpretability 是不适用的。为了 Contextualize 我们的发现,我们还显示了一个成功的例子(间接对象识别任务),其中先前的手动电路分析提供了特征所在的位置的理解。我们还探讨了需要更多证据以证明贴图后的子空间是忠实的。

When the Few Outweigh the Many: Illicit Content Recognition with Few-Shot Learning

  • paper_url: http://arxiv.org/abs/2311.17026
  • repo_url: None
  • paper_authors: G. Cascavilla, G. Catolino, M. Conti, D. Mellios, D. A. Tamburri
  • for: 本研究旨在探讨黑客网上违法活动识别的新方法,具体来说是通过图像识别来识别违法活动。
  • methods: 本研究使用了一种名为Siamese нейрон网络的技术,它是现在领域的state-of-the-art方法之一。此外,还使用了一种名为One-Shot和Few-Shot学习的技术,可以在小规模数据集上达到高度的准确率。
  • results: 研究发现,使用Siamese нейрон网络和One-Shot/Few-Shot学习技术可以在违法活动识别中达到高度的准确率,特别是在20枚投票 experiment中,Siamese нейрон网络的准确率达到90.9%。这表明这种方法是一种可靠且经济的自动法律执行机器的代替方案。
    Abstract The anonymity and untraceability benefits of the Dark web account for the exponentially-increased potential of its popularity while creating a suitable womb for many illicit activities, to date. Hence, in collaboration with cybersecurity and law enforcement agencies, research has provided approaches for recognizing and classifying illicit activities with most exploiting textual dark web markets' content recognition; few such approaches use images that originated from dark web content. This paper investigates this alternative technique for recognizing illegal activities from images. In particular, we investigate label-agnostic learning techniques like One-Shot and Few-Shot learning featuring the use Siamese neural networks, a state-of-the-art approach in the field. Our solution manages to handle small-scale datasets with promising accuracy. In particular, Siamese neural networks reach 90.9% on 20-Shot experiments over a 10-class dataset; this leads us to conclude that such models are a promising and cheaper alternative to the definition of automated law-enforcing machinery over the dark web.
    摘要 “黑客的匿名和不可追踪的优点使得其 популяр度 exponential 增长,创造了许多非法活动的适宜环境。因此,与 кибер安全和宪政机构合作,我们的研究提供了承认和分类非法活动的方法,大多数使用文本黑客市场内容的识别。这篇论文探讨这种替代技术,具体来说是使用一shot和几shot学习技术,特别是使用siamese神经网络,这是当前领域的state-of-the-art方法。我们的解决方案可以处理小规模数据集,并且表现良好,siamese神经网络在20枚shot experiment中达到90.9%的准确率,这使我们得出结论,这些模型是可靠且经济的替代方案,用于自动实施法律执行机器在黑客上。”Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Deployment of a Robust and Explainable Mortality Prediction Model: The COVID-19 Pandemic and Beyond

  • paper_url: http://arxiv.org/abs/2311.17133
  • repo_url: None
  • paper_authors: Jacob R. Epifano, Stephen Glass, Ravi P. Ramachandran, Sharad Patel, Aaron J. Masino, Ghulam Rasool
  • for: 这项研究旨在investigate deployed artificial intelligence(AI)模型在COVID-19大流行期间和以后的预测死亡率性能、可解释性和可靠性。
  • methods: 研究使用了 bayesian neural networks(BNNs)和智能训练技术,发现这些模型在面临数据变化时仍能保持性能。
  • results: 研究发现,使用BNNs和智能训练技术可以建立高性能、可解释的AI模型,并且能够在实际医疗环境中提供可靠的预测。
    Abstract This study investigated the performance, explainability, and robustness of deployed artificial intelligence (AI) models in predicting mortality during the COVID-19 pandemic and beyond. The first study of its kind, we found that Bayesian Neural Networks (BNNs) and intelligent training techniques allowed our models to maintain performance amidst significant data shifts. Our results emphasize the importance of developing robust AI models capable of matching or surpassing clinician predictions, even under challenging conditions. Our exploration of model explainability revealed that stochastic models generate more diverse and personalized explanations thereby highlighting the need for AI models that provide detailed and individualized insights in real-world clinical settings. Furthermore, we underscored the importance of quantifying uncertainty in AI models which enables clinicians to make better-informed decisions based on reliable predictions. Our study advocates for prioritizing implementation science in AI research for healthcare and ensuring that AI solutions are practical, beneficial, and sustainable in real-world clinical environments. By addressing unique challenges and complexities in healthcare settings, researchers can develop AI models that effectively improve clinical practice and patient outcomes.
    摘要 Translated into Simplified Chinese:这项研究 investigate了 COVID-19 大流行和以后部署的人工智能 (AI) 模型的性能、可解释性和可靠性。这是首个这样的研究,我们发现,使用抽象神经网络 (BNNs) 和智能训练技术,我们的模型在数据变化中保持了性能。我们的结果强调了开发能够与临床医生预测匹配或超越的robust AI模型,尤其在面临困难条件下。我们的解释性探索发现,Stochastic 模型生成更多的多样化和个性化的解释,因此 highlights 需要AI模型提供详细和个性化的解释,以便在实际临床环境中提供更好的指导。此外,我们强调了量化 AI 模型中的不确定性,这使得临床医生可以基于可靠预测做出更好的决策。我们的研究强调了在健康保健领域应用人工智能研究的实施科学,以确保 AI 解决方案是实用、有益和可持续的。通过解决医疗设置中的特殊挑战和复杂性,研究人员可以开发有效地改善临床实践和患者结果的 AI 模型。

Foundational Moral Values for AI Alignment

  • paper_url: http://arxiv.org/abs/2311.17017
  • repo_url: None
  • paper_authors: Betty Li Hou, Brian Patrick Green
  • for: 本研究旨在提供清晰、可靠的AI系统配置目标,以便实现人类存在的基本需求。
  • methods: 本研究使用道德哲学中的五个核心价值作为技术配置目标的基础,这五个价值分别是存 survival、可持续发展、社会、教育和真理。
  • results: 本研究显示,这五个价值不仅可以提供技术配置工作的 clearer direction,还可以用于检测AI系统对这些价值的威胁和机遇。
    Abstract Solving the AI alignment problem requires having clear, defensible values towards which AI systems can align. Currently, targets for alignment remain underspecified and do not seem to be built from a philosophically robust structure. We begin the discussion of this problem by presenting five core, foundational values, drawn from moral philosophy and built on the requisites for human existence: survival, sustainable intergenerational existence, society, education, and truth. We show that these values not only provide a clearer direction for technical alignment work, but also serve as a framework to highlight threats and opportunities from AI systems to both obtain and sustain these values.
    摘要 解决人工智能对接管问题需要有清晰、可靠的价值观,以便人工智能系统可以按照这些价值进行对接。目前,目标对接仍然尚未得到明确定义,而且没有一个具有哲学基础的结构。我们开始对这个问题的讨论,提出五个核心基础价值,基于道德哲学和人类存在的必需品:存在、可持续性、社会、教育和真理。我们表明了这些价值不仅可以为技术对接工作提供更清晰的指导,还可以作为对人工智能系统的威胁和机遇的框架。Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is also widely used, particularly in Taiwan and Hong Kong.

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

  • paper_url: http://arxiv.org/abs/2311.17132
  • repo_url: None
  • paper_authors: Dai Shi
  • for: 提高模型的自然视觉表现和本地模型化能力
  • methods: 提出了生物辐射视觉和不断眼动的灵感导向的Token混合器,并在传统Query和Key中添加学习的Token进行多样化生成相似矩阵。此外,我们还提出了几何GLU混合器,用于 bridging GLU和SE机制,以便每个Token可以基于其最近邻域图像特征进行通道注意力。
  • results: 我们的TransNeXt模型在多种模型尺寸上达到了领先的性能,包括ImageNet准确率84.0%、ConvNeXt-B模型的69% fewer parameters。此外,我们还实现了ImageNet-A准确率61.6%、COCO物体检测mAP57.1和ADE20K语义分割mIoU54.7等多种任务的优秀表现。
    Abstract Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.
    摘要 Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.

Computational Hypergraph Discovery, a Gaussian Process framework for connecting the dots

  • paper_url: http://arxiv.org/abs/2311.17007
  • repo_url: https://github.com/theobourdais/computationalhypergraphdiscovery
  • paper_authors: Théo Bourdais, Pau Batlle, Xianjin Yang, Ricardo Baptista, Nicolas Rouquette, Houman Owhadi
  • for: 这 paper 是用于数据驱动的 Computational Science and Engineering 和 Scientific Machine Learning 问题的解决方案。
  • methods: 这 paper 使用了 Gaussian Process(GP)方法,其基于非线性系统的ROW ECHELON FORM 缩放和 variance-based 分析。
  • results: 这 paper 提出了一种可解释的 GP 框架,可以用于解决 Type 3 问题,即数据驱动地发现和完善计算的 hypergraph 结构。 应用于 equation discovery、network discovery 和 raw data analysis 等领域,并且表明了该方法的效率和可靠性。
    Abstract Most scientific challenges can be framed into one of the following three levels of complexity of function approximation. Type 1: Approximate an unknown function given input/output data. Type 2: Consider a collection of variables and functions, some of which are unknown, indexed by the nodes and hyperedges of a hypergraph (a generalized graph where edges can connect more than two vertices). Given partial observations of the variables of the hypergraph (satisfying the functional dependencies imposed by its structure), approximate all the unobserved variables and unknown functions. Type 3: Expanding on Type 2, if the hypergraph structure itself is unknown, use partial observations of the variables of the hypergraph to discover its structure and approximate its unknown functions. While most Computational Science and Engineering and Scientific Machine Learning challenges can be framed as Type 1 and Type 2 problems, many scientific problems can only be categorized as Type 3. Despite their prevalence, these Type 3 challenges have been largely overlooked due to their inherent complexity. Although Gaussian Process (GP) methods are sometimes perceived as well-founded but old technology limited to Type 1 curve fitting, their scope has recently been expanded to Type 2 problems. In this paper, we introduce an interpretable GP framework for Type 3 problems, targeting the data-driven discovery and completion of computational hypergraphs. Our approach is based on a kernel generalization of Row Echelon Form reduction from linear systems to nonlinear ones and variance-based analysis. Here, variables are linked via GPs and those contributing to the highest data variance unveil the hypergraph's structure. We illustrate the scope and efficiency of the proposed approach with applications to (algebraic) equation discovery, network discovery (gene pathways, chemical, and mechanical) and raw data analysis.
    摘要 多科学挑战可以划分为以下三级复杂性函数近似: Type 1:给定输入/输出数据,近似未知函数。 Type 2:考虑一个包含多个变量和函数的超graph(一种通用图,其边可以连接多个顶点),其中一些变量是未知的。给定超graph的部分观察数据,近似所有未知变量和未知函数。 Type 3:在 Type 2 基础上,如果超graph的结构本身是未知的,使用超graph的部分观察数据来探索其结构并近似其未知函数。虽然大多数计算科学和数学机器学习挑战都可以划分为 Type 1 和 Type 2 问题,但是许多科学问题只能划分为 Type 3 问题。尽管它们的复杂性使得它们受到了相对较少的关注,但是它们的存在是普遍的。虽然 Gaussian Process(GP)方法有时被视为已有的技术,但它们的范围已经扩展到 Type 2 问题。在这篇论文中,我们介绍了一种可解释的 GP 框架,用于 Type 3 问题,targeting 数据驱动的 computational hypergraph 的发现和完善。我们的方法基于 GP 的kernel化矩阵分解,并通过对数据的变差进行分析。在这种方法中,变量通过 GP 连接,而对数据变差最大的变量揭示出 computational hypergraph 的结构。我们通过应用于(代数)方程发现、网络发现(生物、化学和机械)和原始数据分析等场景来说明本方法的范围和效率。

Goal-conditioned Offline Planning from Curious Exploration

  • paper_url: http://arxiv.org/abs/2311.16996
  • repo_url: https://github.com/martius-lab/gcopfce
  • paper_authors: Marco Bagatella, Georg Martius
  • for: 该论文主要关注如何从无监督的探索技术中提取目标决策的行为,无需进一步与环境互动。
  • methods: 该论文提出了一种结合模型基本规划和图集值聚合算法的方法,以修正学习值函数中的估计误差,提高零次目标达成性能。
  • results: 该论文在多种模拟环境中表明,该方法可以提高零次目标达成性能,并且可以修正本地和全局的估计误差。
    Abstract Curiosity has established itself as a powerful exploration strategy in deep reinforcement learning. Notably, leveraging expected future novelty as intrinsic motivation has been shown to efficiently generate exploratory trajectories, as well as a robust dynamics model. We consider the challenge of extracting goal-conditioned behavior from the products of such unsupervised exploration techniques, without any additional environment interaction. We find that conventional goal-conditioned reinforcement learning approaches for extracting a value function and policy fall short in this difficult offline setting. By analyzing the geometry of optimal goal-conditioned value functions, we relate this issue to a specific class of estimation artifacts in learned values. In order to mitigate their occurrence, we propose to combine model-based planning over learned value landscapes with a graph-based value aggregation scheme. We show how this combination can correct both local and global artifacts, obtaining significant improvements in zero-shot goal-reaching performance across diverse simulated environments.
    摘要 curiosity 已经成为深度再征学习中powerful的探索策略。特别是利用未来的未知性作为内在动机,可以效率地生成探索轨迹,以及一个坚实的动力学模型。我们面临的挑战是从无supervised探索技术所生成的产品中提取目标条件行为,无需更多的环境互动。我们发现,传统的目标决策学习方法在这种困难的离线设定下失效。通过分析优化的目标决策函数的几何学性质,我们发现这些问题与学习值函数中的一种特殊类型的估计错误有关。为了缓解这些错误的发生,我们提议将模型基于 планинг与学习值景观的集成,并使用图集成值评价方法。我们表明这种结合可以 corrrect both local和global的估计错误,从而在多种模拟环境中提高零次目标达成性能。

Debiasing Multimodal Models via Causal Information Minimization

  • paper_url: http://arxiv.org/abs/2311.16941
  • repo_url: https://github.com/vaidehi99/causalinfomin
  • paper_authors: Vaidehi Patil, Adyasha Maharana, Mohit Bansal
  • For: The paper is written to address the issue of bias in multimodal models, specifically the use of approximate heuristics to represent biases, and to propose a novel approach that leverages causally-motivated information minimization to learn confounder representations and remove bias from models.* Methods: The paper uses causal graph theory to study bias arising from confounders in multimodal data, and proposes a method that leverages causally-motivated information minimization to learn the confounder representations. The method involves minimizing the information content of features obtained from a pretrained biased model, and using these features via methods motivated by causal theory to remove bias from models.* Results: The paper shows that the proposed debiasing methods improve out-of-distribution (OOD) performance on multiple multimodal datasets without sacrificing in-distribution performance. Additionally, the paper introduces a novel metric to quantify the sufficiency of spurious features in models’ predictions, which further demonstrates the effectiveness of the proposed methods.Here is the summary in Simplified Chinese text:* For: 本文目的是解决多模态模型中的偏见问题,具体来说是使用估计法表示偏见,并提出一种基于 causal graph 理论的新方法,该方法可以学习干扰表示。* Methods: 本文使用 causal graph 理论研究多模态数据中的偏见,并提出一种基于 causal 理论的方法,该方法通过减少早期训练阶段的 shallow 特征来学习干扰表示。* Results: 本文显示,提出的偏见除法方法可以在多个多模态数据集上提高 OOD 性能,而无需牺牲预测性能。此外,本文还引入了一种新的度量来衡量模型预测中的干扰特征充分性,这再次证明了提出的方法的效果。
    Abstract Most existing debiasing methods for multimodal models, including causal intervention and inference methods, utilize approximate heuristics to represent the biases, such as shallow features from early stages of training or unimodal features for multimodal tasks like VQA, etc., which may not be accurate. In this paper, we study bias arising from confounders in a causal graph for multimodal data and examine a novel approach that leverages causally-motivated information minimization to learn the confounder representations. Robust predictive features contain diverse information that helps a model generalize to out-of-distribution data. Hence, minimizing the information content of features obtained from a pretrained biased model helps learn the simplest predictive features that capture the underlying data distribution. We treat these features as confounder representations and use them via methods motivated by causal theory to remove bias from models. We find that the learned confounder representations indeed capture dataset biases, and the proposed debiasing methods improve out-of-distribution (OOD) performance on multiple multimodal datasets without sacrificing in-distribution performance. Additionally, we introduce a novel metric to quantify the sufficiency of spurious features in models' predictions that further demonstrates the effectiveness of our proposed methods. Our code is available at: https://github.com/Vaidehi99/CausalInfoMin
    摘要 现有的多Modal模型偏见纠正方法,包括 causal intervention 和推理方法,通常使用估计的方法来表示偏见,例如在训练的早期阶段获得的浅层特征或多Modal任务中的单modal特征等,这些方法可能不准确。在这篇文章中,我们研究了多Modal数据中的偏见,并研究了一种新的方法,即通过 causally-motivated information minimization 来学习偏见表示。我们发现这些表示 capture 了数据分布下的偏见,并且使用这些表示来 removes 偏见从模型中。我们发现这些方法可以在多个多Modal dataset 上提高 OOD 性能,而不是牺牲在 Distribution 性能。此外,我们引入了一个新的度量来衡量模型预测中的幂制特征的充分性,这进一步证明了我们提出的方法的有效性。我们的代码可以在 GitHub 上找到:https://github.com/Vaidehi99/CausalInfoMin。

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

  • paper_url: http://arxiv.org/abs/2311.16922
  • repo_url: https://github.com/damo-nlp-sg/vcd
  • paper_authors: Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing
  • for: 提高大观语言模型(LVLM)的可靠性,减少模型生成的对象幻觉问题。
  • methods: 提出了一种名为视觉对比解码(VCD)的简单、训练自由方法,通过对原始和受损视觉输入的输出分布进行对比,减少模型对统计偏见和单Modal先天假设的依赖。
  • results: VCD在不同LVLM家族上进行测试,能够有效地减少对象幻觉问题,同时在普通LVLM测试中也表现出优异的表现,这表明VCD的广泛适用性。
    Abstract Large Vision-Language Models (LVLMs) have advanced considerably, intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success, LVLMs still suffer from the issue of object hallucinations, where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs, resulting in contextually accurate outputs. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families. Beyond mitigating object hallucinations, VCD also excels in general LVLM benchmarks, highlighting its wide-ranging applicability.
    摘要

RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D

  • paper_url: http://arxiv.org/abs/2311.16918
  • repo_url: None
  • paper_authors: Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng Bo, Xiaoguang Han
  • for: 提高2D扩散为3D生成的精度和细节
  • methods: 学习一种通用的 норál-深度扩散模型,并通过图像到深度和图像到法向的通用模型来训练
  • results: 在与其他文本到3D管道集成后,模型可以显著提高细节和精度,达到当前最佳效果
    Abstract Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appearance modeling. However, relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and normals maps, leading to instability in optimization. In this paper, recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images, we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumination effects in the generated materials, we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when integrated into existing text-to-3D pipelines, our models significantly enhance the detail richness, achieving state-of-the-art results. Our project page is https://lingtengqiu.github.io/RichDreamer/.
    摘要 提高2D扩散为3D生成是一个具有挑战性的问题,因为自然图像中的物理和照明复杂相互纠缠,缺乏几何先验。现有方法通过首先通过 Rendered surface normals 进行 score-distillation sampling (SDS) 来创建 geometry,然后进行外观模型化。然而,通过2D RGB 扩散模型来优化表面法向是不优化的,因为自然图像和表面法向图像的分布差异导致优化过程中的不稳定。在这篇论文中,我们认为场景几何信息可以通过自动计算 Normal 和深度信息来描述,并且可以从图像中自动估计。因此,我们提议学习一种通用的 Normal-Depth 扩散模型。我们在 LAION 数据集上进行大规模训练,并且使用通用的图像-到-深度和 Normal 先验模型。为了解决生成物体材质中的混合照明效果,我们引入了一种 albedo 扩散模型,以在生成的材质中强制实施数据驱动的约束。我们的实验表明,当我们的模型与现有的文本-到-3D 管道结合使用时,可以增加细节充实,达到领域前景的最佳结果。我们的项目页面是

Optimization Theory Based Deep Reinforcement Learning for Resource Allocation in Ultra-Reliable Wireless Networked Control Systems

  • paper_url: http://arxiv.org/abs/2311.16895
  • repo_url: None
  • paper_authors: Hamida Qumber Ali, Amirhassan Babazadeh Darabi, Sinem Coleri
  • for: 本研究 targets the joint design of controller and communication systems for Wireless Networked Control Systems (WNCS), with the objective of minimizing power consumption while satisfying schedulability and rate constraints, and stability constraints.
  • methods: 本研究提出了一种基于深度学习的优化理论框架,包括两个阶段:优化理论阶段和深度学习阶段。在优化理论阶段,根据问题的形式ulation,获得了优化问题的数学关系,从而将问题分解成多个建筑块。在深度学习阶段,使用深度学习代替了不可解析的块。
  • results: 通过广泛的 simulate,提出的优化理论基于深度学习方法在比较于优化理论和纯度学习基于方法的情况下,具有较高的性能和较低的复杂性。
    Abstract The design of Wireless Networked Control System (WNCS) requires addressing critical interactions between control and communication systems with minimal complexity and communication overhead while providing ultra-high reliability. This paper introduces a novel optimization theory based deep reinforcement learning (DRL) framework for the joint design of controller and communication systems. The objective of minimum power consumption is targeted while satisfying the schedulability and rate constraints of the communication system in the finite blocklength regime and stability constraint of the control system. Decision variables include the sampling period in the control system, and blocklength and packet error probability in the communication system. The proposed framework contains two stages: optimization theory and DRL. In the optimization theory stage, following the formulation of the joint optimization problem, optimality conditions are derived to find the mathematical relations between the optimal values of the decision variables. These relations allow the decomposition of the problem into multiple building blocks. In the DRL stage, the blocks that are simplified but not tractable are replaced by DRL. Via extensive simulations, the proposed optimization theory based DRL approach is demonstrated to outperform the optimization theory and pure DRL based approaches, with close to optimal performance and much lower complexity.
    摘要 wireless网络控制系统(WNCS)的设计需要考虑控制和通信系统之间的关键互动,以最小化复杂性和通信负担,同时保证超高可靠性。本文介绍一种基于深度学习的优化理论框架,用于joint控制和通信系统的设计。目标是最小化能耗,同时满足通信系统的剩余时间和速率约束,以及控制系统的稳定性约束。决策变量包括控制系统中的采样时间,以及通信系统中的块长度和错误率。提议的框架包括两个阶段:优化理论和深度学习。在优化理论阶段,根据联合优化问题的形ulation,获得优化条件,以找到决策变量的数学关系。这些关系允许将问题分解为多个建筑块。在深度学习阶段,使用深度学习取代了不可解析的块,以简化问题。经过广泛的仿真实验,提议的优化理论基于深度学习方法在性能和复杂性方面均有较好的表现,与优化理论和纯深度学习方法相比,具有更高的性能和远远低于的复杂性。

Vulnerability Analysis of Transformer-based Optical Character Recognition to Adversarial Attacks

  • paper_url: http://arxiv.org/abs/2311.17128
  • repo_url: None
  • paper_authors: Lucas Beerens, Desmond J. Higham
  • for: 这个论文主要研究了基于变换器的光学字符识别(OCR)系统的安全性。
  • methods: 作者们开发了一个新的评估攻击抗性框架,包括对不argeted和targeted攻击进行评估。
  • results: 研究发现,基于变换器的OCR系统在不argeted攻击下非常易受到影响,CER可以高达1 Without being noticeable to the eye。在targeted攻击下,成功率可达25%,但需要TrOCR输出大词汇中的第十个最有可能的token。
    Abstract Recent advancements in Optical Character Recognition (OCR) have been driven by transformer-based models. OCR systems are critical in numerous high-stakes domains, yet their vulnerability to adversarial attack remains largely uncharted territory, raising concerns about security and compliance with emerging AI regulations. In this work we present a novel framework to assess the resilience of Transformer-based OCR (TrOCR) models. We develop and assess algorithms for both targeted and untargeted attacks. For the untargeted case, we measure the Character Error Rate (CER), while for the targeted case we use the success ratio. We find that TrOCR is highly vulnerable to untargeted attacks and somewhat less vulnerable to targeted attacks. On a benchmark handwriting data set, untargeted attacks can cause a CER of more than 1 without being noticeable to the eye. With a similar perturbation size, targeted attacks can lead to success rates of around $25\%$ -- here we attacked single tokens, requiring TrOCR to output the tenth most likely token from a large vocabulary.
    摘要 最近的Optical Character Recognition(OCR)技术发展受到 transformer 模型的推动,但OCR 系统的安全性和合规性仍然待到很多不确定的领域,这引发了安全和合规性的问题。在这项工作中,我们提出了一种新的框架来评估 transformer 基于 OCR 模型的可抗性。我们开发了针对目标和无目标攻击的算法。对于无目标情况,我们使用 Character Error Rate(CER)来衡量,而对于目标情况,我们使用成功率。我们发现 transformer 基于 OCR 模型在无目标攻击下非常高度易受攻击,而在目标攻击下则较为可靠。在一个标准手写数据集上,无目标攻击可以使 CER 超过 1,而无需被注意到。同样的扰乱大小,目标攻击可以导致成功率达到 around 25%,我们在攻击单个Token时,需要 TrOCR 输出大词汇中的第十个最有可能的Token。

The Falcon Series of Open Language Models

  • paper_url: http://arxiv.org/abs/2311.16867
  • repo_url: None
  • paper_authors: Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo
  • For: The paper is written to present the Falcon series of causal decoder-only models, which are trained on a large and diverse corpora of web data.* Methods: The paper uses a custom distributed training codebase to pretrain the models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect.* Results: The largest model, Falcon-180B, significantly outperforms other models such as PaLM or Chinchilla, and approaches the performance of PaLM-2-Large at a reduced pretraining and inference cost.Here are the three points in Simplified Chinese text:* For: 这篇论文是为了介绍 Falcon 系列的 causal decoder-only 模型,这些模型在web数据上进行了大规模和多样化的训练。* Methods: 这篇论文使用自定义的分布式训练代码基于,使用 cloud AWS 基础设施上的 Up to 4,096 A100 进行训练,并限制了间接连接。* Results: Falcon-180B 模型在其他模型such as PaLM 或 Chinchilla 上显著超越,并与 PaLM-2-Large 的性能相似,但具有更低的预训练和计算成本。
    Abstract We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.
    摘要 我们介绍飞鹰系列:7B、40B和180B参数 causal 解oder-仅模型,在高品质的网络数据集上进行了各种多样化的训练。最大模型飞鹰-180B 已经在超过3.5兆个字元的文本上进行了最大的公开文献训练。飞鹰-180B 对 PaLM 或 Chinchilla 等模型进行了明显的超越,并且在与 LLamA 2 或 Inflection-1 等模型同时发展的情况下,也获得了进一步的改进。它在与 PaLM-2-Large 的性能几乎相等,但具有较低的训练和测试成本,使其成为我们所知道的世界上第三好的语言模型之一,只次于 GPT-4 和 PaLM-2-Large。我们在详细的评估和自订工具的使用方面进行了深入的检视,包括我们自己开发的分布式训练代码库,让我们可以将这些模型在云端 AWS 基础设施上进行高效的训练,并且限制了互connect。我们释出了600亿个字元的网络数据抽出,以及飞鹰-7/40/180B 三个模型,并在允许性的授权下释出,以促进开源的科学和加速开发大型语言模型的开放生态系统。

Edge AI for Internet of Energy: Challenges and Perspectives

  • paper_url: http://arxiv.org/abs/2311.16851
  • repo_url: None
  • paper_authors: Yassine Himeur, Aya Nabil Sayed, Abdullah Alsalemi, Faycal Bensaali, Abbes Amira
  • for: 这篇评论探讨了智能边缘技术如何重塑互联网络的能源互联网环境。
  • methods: 该评论采用了严格的研究方法,探讨了特制的边缘人工智能技术,并对其各种优点进行了详细的描述。
  • results: 该评论总结了边缘人工智能在互联网络中的多种优点,包括减少延迟和实时分析,以及关键的信息安全、可扩展性和成本效益等方面的进步。
    Abstract The digital landscape of the Internet of Energy (IoE) is on the brink of a revolutionary transformation with the integration of edge Artificial Intelligence (AI). This comprehensive review elucidates the promise and potential that edge AI holds for reshaping the IoE ecosystem. Commencing with a meticulously curated research methodology, the article delves into the myriad of edge AI techniques specifically tailored for IoE. The myriad benefits, spanning from reduced latency and real-time analytics to the pivotal aspects of information security, scalability, and cost-efficiency, underscore the indispensability of edge AI in modern IoE frameworks. As the narrative progresses, readers are acquainted with pragmatic applications and techniques, highlighting on-device computation, secure private inference methods, and the avant-garde paradigms of AI training on the edge. A critical analysis follows, offering a deep dive into the present challenges including security concerns, computational hurdles, and standardization issues. However, as the horizon of technology ever expands, the review culminates in a forward-looking perspective, envisaging the future symbiosis of 5G networks, federated edge AI, deep reinforcement learning, and more, painting a vibrant panorama of what the future beholds. For anyone vested in the domains of IoE and AI, this review offers both a foundation and a visionary lens, bridging the present realities with future possibilities.
    摘要 互联网的能源INTERNET(IoE)领域正在踏入一场革命性的变革,通过融合边缘人工智能(AI)。本综观篇探讨边缘AI在IoE生态系统中的承诺和潜力。文章开始采用仔细挑选的研究方法,探讨边缘AI技术对IoE的多种应用。这些技术包括实时分析、降低延迟、资安性、可扩展性和成本效益等,这些潜在效益强调了边缘AI在现代IoE框架中的不可或缺性。文章随后介绍了实际应用和技术,包括在设备上进行计算、安全隐私方法、以及前对边缘进行AI训练。文章还提供了深入分析,探讨现有挑战,包括安全性 Concerns、计算问题和标准化问题。然而,随着科技的发展,未来边缘AI与5G网络、联邦边缘AI、深度强化学习和更多的技术融合,将创造出一个缤纷的未来,文章终结于一个前瞻性的展望。对于IoE和AI领域的投资者和探险家来说,这篇综观篇提供了一个基础和未来可能性的桥梁,将现实与未来相连。

Two-step dynamic obstacle avoidance

  • paper_url: http://arxiv.org/abs/2311.16841
  • repo_url: None
  • paper_authors: Fabian Hart, Martin Waltz, Ostap Okhrin
  • for: 本研究旨在解决自动驾驶车辆面临的动态障碍避免(DOA)问题,无论是在海上、空中还是陆地上。
  • methods: 本文提出了一种两步架构,通过结合监督学习和强化学习(RL)来处理 DOA 任务。在第一步,我们提出了一种数据驱动的方法,使用循环神经网络来估算障碍物冲突风险,并在监督学习的方式下训练。在第二步,我们将这些冲突风险估算结果包含到RL agent的观察空间中,以提高其情况意识。
  • results: 我们在一个需要在多个障碍物中穿梭的复杂环境中训练了不同的 RL 代理人,并证明了我们的两步架构可以提高 RL 代理人的性能。实验表明,将冲突风险估算结果包含到观察空间中可以 doubles RL 代理人的奖励,等于减少碰撞事件的一半。此外,我们还证明了我们的架构性能改善不受应用的 RL 算法而强调。
    Abstract Dynamic obstacle avoidance (DOA) is a fundamental challenge for any autonomous vehicle, independent of whether it operates in sea, air, or land. This paper proposes a two-step architecture for handling DOA tasks by combining supervised and reinforcement learning (RL). In the first step, we introduce a data-driven approach to estimate the collision risk of an obstacle using a recurrent neural network, which is trained in a supervised fashion and offers robustness to non-linear obstacle movements. In the second step, we include these collision risk estimates into the observation space of an RL agent to increase its situational awareness.~We illustrate the power of our two-step approach by training different RL agents in a challenging environment that requires to navigate amid multiple obstacles. The non-linear movements of obstacles are exemplarily modeled based on stochastic processes and periodic patterns, although our architecture is suitable for any obstacle dynamics. The experiments reveal that integrating our collision risk metrics into the observation space doubles the performance in terms of reward, which is equivalent to halving the number of collisions in the considered environment. Furthermore, we show that the architecture's performance improvement is independent of the applied RL algorithm.
    摘要 自适应障碍避免(DOA)是智能自动车辆的基本挑战,无论在海上、空中或陆地上运行。这篇论文提议一种两步架构来处理 DOA 任务,通过将监睹学习和奖励学习(RL)结合起来。在第一步,我们引入了一种数据驱动的方法,通过使用回卷神经网络来估算障碍物碰撞风险,这个网络在监睹式的方式下被训练,并且具有对非线性障碍物运动的Robustness。在第二步,我们将这些碰撞风险估算值添加到 RL 代理的观察空间中,从而提高其 Situational awareness。我们在一个需要在多个障碍物间穿梭的复杂环境中训练不同的 RL 代理,并通过模拟障碍物的非线性运动使用杂乱过程和 periodic pattern来 exemplarily 示例。我们的架构适用于任何障碍物动力学。实验表明,将我们的碰撞风险 metrics 添加到 RL 代理的观察空间可以 doubles 提高代理的性能,相当于将碰撞数量减少一半。此外,我们还证明了我们的架构性能提高不受应用的 RL 算法的影响。

The Claire French Dialogue Dataset

  • paper_url: http://arxiv.org/abs/2311.16840
  • repo_url: None
  • paper_authors: Julie Hunter, Jérôme Louradour, Virgile Rennard, Ismaïl Harrando, Guokan Shang, Jean-Pierre Lorré
  • for: 这个论文是为了推动多语言、开源语言模型的发展而创建的 Claire French Dialogue Dataset (CFDD) 资源。
  • methods: 论文描述了 CFDD 资源的24个各自 corpora 和其原始来源的链接和引用,以及将 CFDD dataset 分解成8种类型的子 corpora 的过程。
  • results: 论文介绍了 CFDD dataset 的标准化格式和相关工作的未来方向。
    Abstract We present the Claire French Dialogue Dataset (CFDD), a resource created by members of LINAGORA Labs in the context of the OpenLLM France initiative. CFDD is a corpus containing roughly 160 million words from transcripts and stage plays in French that we have assembled and publicly released in an effort to further the development of multilingual, open source language models. This paper describes the 24 individual corpora of which CFDD is composed and provides links and citations to their original sources. It also provides our proposed breakdown of the full CFDD dataset into eight categories of subcorpora and describes the process we followed to standardize the format of the final dataset. We conclude with a discussion of similar work and future directions.
    摘要 我们介绍Claire French Dialogue Dataset(CFDD), LINAGORA Labs 在 OpenLLM France iniciativa 中创建的资源。 CFDD 包含约 160 亿个单词的法语对话和舞台剧脚本,我们已经组装并公开释放,以促进多语言、开源语言模型的发展。这篇文章描述 CFDD dataset 的 24 个子 corpora,并提供了每个子 corpora 的链接和参考文献。它还介绍了我们对 CFDD 数据集的分类方法,并描述了我们对数据集的标准化处理过程。我们 conclude 这篇文章,并讨论了相关的工作和未来方向。

Modular Neural Networks for Time Series Forecasting: Interpretability and Feature Selection using Attention

  • paper_url: http://arxiv.org/abs/2311.16834
  • repo_url: None
  • paper_authors: Qiqi Su, Christos Kloukinas, Artur d’Avila Garcez
  • for: 这篇论文应用于多变量时间序列预测,并且需要建立可解释的深度学习模型。
  • methods: 本论文提出了一个模块化神经网络模型,具有选择性的特征选择和时间预测两大 componenets。这个模型可以实现可解释性,并且在时间序列预测任务中表现出比预先学习型和可解释性模型更好的预测性。
  • results: 实验结果显示,这种方法可以超过现有的可解释性模型(例如NAM)和其变形,并且与非可解释性方法(例如LSTM和XGBoost)的预测性相当。
    Abstract Multivariate time series have many applications, from healthcare and meteorology to life science. Although deep learning models have shown excellent predictive performance for time series, they have been criticised for being "black-boxes" or non-interpretable. This paper proposes a novel modular neural network model for multivariate time series prediction that is interpretable by construction. A recurrent neural network learns the temporal dependencies in the data while an attention-based feature selection component selects the most relevant features and suppresses redundant features used in the learning of the temporal dependencies. A modular deep network is trained from the selected features independently to show the users how features influence outcomes, making the model interpretable. Experimental results show that this approach can outperform state-of-the-art interpretable Neural Additive Models (NAM) and variations thereof in both regression and classification of time series tasks, achieving a predictive performance that is comparable to the top non-interpretable methods for time series, LSTM and XGBoost.
    摘要 多变量时间系列有很多应用,从医疗和气象到生命科学。虽然深度学习模型在时间序列预测方面表现出色,但它们受到了“黑盒子”或不可解释的批评。这篇论文提出了一种新的模块化神经网络模型,用于多变量时间序列预测,该模型具有可解释性。一个循环神经网络学习数据中的时间相关性,而另一个注意力机制选择器选择最 relevante 的特征,并且避免在学习时间相关性中使用 redundante 的特征。一个模块化深度网络从选择的特征独立地训练,以显示特征如何影响结果,使模型成为可解释的。实验结果表明,这种方法可以超越现有的可解释性 Neural Additive Models(NAM)和其变种,在时间序列预测任务中实现比较高的预测性能,并且与非可解释的方法,如 LSTM 和 XGBoost,的预测性能相比较。

CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models

  • paper_url: http://arxiv.org/abs/2311.16832
  • repo_url: None
  • paper_authors: Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Libiao Peng, Jiaming Yang, Xiyao Xiao, Sahand Sabour, Xiaohan Zhang, Wenjing Hou, Yijia Zhang, Yuxiao Dong, Jie Tang, Minlie Huang
  • for: 本研究旨在提供一系列基于ChatGLM的模型,用于生成基于人物的对话(CharacterDial),以满足人们的内在社交欲望和情感需求。
  • methods: 我们采用了CharacterGLM模型,可以自定义各种人工智能角色或社交代理人的特性和行为,包括人物特征、情感表达、互动模式等。
  • results: 我们的模型在consistency、人工化和参与度等方面,与主流的关键词搜索模型(GPT系列)相比,表现出优异性。我们将发布6B版本的CharacterGLM模型和一部分训练数据,以便进一步的研究发展在人物基于对话生成方向。
    Abstract In this paper, we present CharacterGLM, a series of models built upon ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM is designed for generating Character-based Dialogues (CharacterDial), which aims to equip a conversational AI system with character customization for satisfying people's inherent social desires and emotional needs. On top of CharacterGLM, we can customize various AI characters or social agents by configuring their attributes (identities, interests, viewpoints, experiences, achievements, social relationships, etc.) and behaviors (linguistic features, emotional expressions, interaction patterns, etc.). Our model outperforms most mainstream close-source large langauge models, including the GPT series, especially in terms of consistency, human-likeness, and engagement according to manual evaluations. We will release our 6B version of CharacterGLM and a subset of training data to facilitate further research development in the direction of character-based dialogue generation.
    摘要 在这篇论文中,我们介绍了CharacterGLM,一系列基于ChatGLM的模型,其参数量从6B到66B不等。CharacterGLM是用于生成基于人物的对话(CharacterDial)的,旨在具备让人工智能系统具备人物定制,满足人们的内在社交欲望和情感需求。在CharacterGLM之上,我们可以自定义各种AI人物或社交代理人的属性(身份、兴趣、观点、经历、成就、社交关系等)和行为(语言特征、情感表达、互动模式等)。我们的模型在主观评价中胜过大多数主流的关键字大型语言模型,特别是在一致性、人类化和参与度方面。我们将发布6B版的CharacterGLM和一部分训练数据,以便进一步的研究发展在人物基于对话生成方向。

A knowledge-driven AutoML architecture

  • paper_url: http://arxiv.org/abs/2311.17124
  • repo_url: None
  • paper_authors: Corneliu Cofaru, Johan Loeckx
  • for: 本研究提出了一种知识驱动的AutoML架构,用于管道和深度特征合成。目标是使AutoML过程可读性好,利用领域知识在合成管道和特征时。
  • methods: 本架构尝试了一些新的想法,包括将管道和深度特征的建构进行统一处理,驱动合成过程使用共享知识系统,在运行时根据数据的应用而作出决策。
  • results: 两个实验证明了提议的架构的可行性和优势,同时也揭示了一些负面影响和未来AutoML的潜在发展前景。
    Abstract This paper proposes a knowledge-driven AutoML architecture for pipeline and deep feature synthesis. The main goal is to render the AutoML process explainable and to leverage domain knowledge in the synthesis of pipelines and features. The architecture explores several novel ideas: first, the construction of pipelines and deep features is approached in an unified way. Next, synthesis is driven by a shared knowledge system, interactively queried as to what pipeline operations to use or features to compute. Lastly, the synthesis processes takes decisions at runtime using partial solutions and results of their application on data. Two experiments are conducted to demonstrate the functionality of a na\"{\i}ve implementation of the proposed architecture and to discuss its advantages, trade-offs as well as future potential for AutoML.
    摘要
  1. Unified construction of pipelines and deep features: The architecture treats the construction of pipelines and deep features in an integrated manner.2. Shared knowledge system: The synthesis process is driven by a shared knowledge system that is interactively queried as to what pipeline operations to use or features to compute.3. Runtime decision-making: The synthesis process takes decisions at runtime using partial solutions and results of their application on data.Two experiments are conducted to demonstrate the functionality of a naive implementation of the proposed architecture and to discuss its advantages, trade-offs, and future potential for AutoML.

Agent-Aware Training for Agent-Agnostic Action Advising in Deep Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.16807
  • repo_url: None
  • paper_authors: Yaoquan Wei, Shunyu Liu, Jie Song, Tongya Zheng, Kaixuan Chen, Yong Wang, Mingli Song
  • for: 这个论文是为了提高深度奖励学习(DRL)中的采样效率而写的。
  • methods: 这个论文使用了一种新的框架 called Agent-Aware trAining yet Agent-Agnostic Action Advising (A7),它通过利用状态特征的相似性作为咨询的指标来寻求专家指导。不同于现有的方法ologies,A7不使用学习机器人自身的错误或专家指导者来衡量状态特征相似性,而是使用一个代理模型来提取状态特征,以便在多个任务上都能够有效地采样。
  • results: 实验结果表明,A7在GridWorld、LunarLander和Atari游戏中的六个场景中都能够显著地加速学习过程,并在现有方法(包括特定于机器人和无关机器人)的比较中占据领先地位。
    Abstract Action advising endeavors to leverage supplementary guidance from expert teachers to alleviate the issue of sampling inefficiency in Deep Reinforcement Learning (DRL). Previous agent-specific action advising methods are hindered by imperfections in the agent itself, while agent-agnostic approaches exhibit limited adaptability to the learning agent. In this study, we propose a novel framework called Agent-Aware trAining yet Agent-Agnostic Action Advising (A7) to strike a balance between the two. The underlying concept of A7 revolves around utilizing the similarity of state features as an indicator for soliciting advice. However, unlike prior methodologies, the measurement of state feature similarity is performed by neither the error-prone learning agent nor the agent-agnostic advisor. Instead, we employ a proxy model to extract state features that are both discriminative (adaptive to the agent) and generally applicable (robust to agent noise). Furthermore, we utilize behavior cloning to train a model for reusing advice and introduce an intrinsic reward for the advised samples to incentivize the utilization of expert guidance. Experiments are conducted on the GridWorld, LunarLander, and six prominent scenarios from Atari games. The results demonstrate that A7 significantly accelerates the learning process and surpasses existing methods (both agent-specific and agent-agnostic) by a substantial margin. Our code will be made publicly available.
    摘要 <> tranlate into Simplified Chinese行业建议努力利用专家教师提供补充指导,以解决深度束缚学习(DRL)中的采样不充分问题。先前的特定行业行业建议方法受到机器学习模型中的缺陷所限,而非特定行业方法则具有有限的适应性。在本研究中,我们提出了一种新的框架,即知道性training yet Agent-Agnostic Action Advising(A7),以寻求这两者之间的平衡。A7的基本思想在于通过利用状态特征的相似性作为咨询的指标来实现这一点。不同于先前的方法ologies,我们在计算状态特征相似性时不是通过学习机器人的错误或者非特定行业咨询人来进行测量。相反,我们使用代理模型来提取状态特征,这些特征是有效的适应(适应于机器人)和普适的(抗性于机器人噪音)。此外,我们使用行为做为模型来培养一个咨询样本的重复使用,并引入内在的奖励,以便鼓励使用专家指导。我们在GridWorld、LunarLander以及Atari游戏中的六个场景中进行了实验,结果表明,A7可以很大程度上加速学习过程,并在现有方法(包括特定行业和非特定行业方法)之上显著超越。我们将代码公开。

ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

  • paper_url: http://arxiv.org/abs/2311.17123
  • repo_url: None
  • paper_authors: Xiangjun Gao, Xiaoyu Li, Chaopeng Zhang, Qi Zhang, Yanpei Cao, Ying Shan, Long Quan
  • for: 实现单一图像中的3D人体呈现
  • methods: 使用特定的静止图像进行人体呈现,并使用深度和文本导向注意力注入来转换参考图像的内容到后视角度
  • results: 能够实现高质量和内容一致的人体呈现,并在实验中与先前基eline方法进行比较
    Abstract In this work, we propose a method to address the challenge of rendering a 3D human from a single image in a free-view manner. Some existing approaches could achieve this by using generalizable pixel-aligned implicit fields to reconstruct a textured mesh of a human or by employing a 2D diffusion model as guidance with the Score Distillation Sampling (SDS) method, to lift the 2D image into 3D space. However, a generalizable implicit field often results in an over-smooth texture field, while the SDS method tends to lead to a texture-inconsistent novel view with the input image. In this paper, we introduce a texture-consistent back view synthesis module that could transfer the reference image content to the back view through depth and text-guided attention injection. Moreover, to alleviate the color distortion that occurs in the side region, we propose a visibility-aware patch consistency regularization for texture mapping and refinement combined with the synthesized back view texture. With the above techniques, we could achieve high-fidelity and texture-consistent human rendering from a single image. Experiments conducted on both real and synthetic data demonstrate the effectiveness of our method and show that our approach outperforms previous baseline methods.
    摘要 在这个工作中,我们提出了一种方法来解决从单个图像中生成3D人的挑战。现有的方法可以使用总体化的像素对齐隐藏场(Implicit Field)来重建人体的文化或者使用2D扩散模型作为引导,通过SDS方法将2D图像映射到3D空间中。然而,通常的隐藏场会导致文化场过于平滑,而SDS方法会导致输入图像的文化不一致。在这篇论文中,我们介绍了一种可以将参照图像内容传递到后视之处的纹理一致转换模块。此外,我们还提出了一种可以减少侧面区域中的颜色扭曲的可见性感知融合regularization,用于纹理映射和重定义。通过上述技术,我们可以实现从单个图像中获得高品质和纹理一致的人体渲染。实验结果表明,我们的方法可以高效地生成高品质的人体图像。

A Survey of the Evolution of Language Model-Based Dialogue Systems

  • paper_url: http://arxiv.org/abs/2311.16789
  • repo_url: https://github.com/ruleGreen/Survey-Evolution-DS
  • paper_authors: Hongru Wang, Lingzhi Wang, Yiming Du, Liang Chen, Jingyan Zhou, Yufei Wang, Kam-Fai Wong
  • for: This paper provides a comprehensive review of the evolution of dialogue systems, specifically highlighting the significant transformations and advancements in language models.
  • methods: The paper categorizes the evolution of dialogue systems into four distinct stages, each marked by pivotal breakthroughs in language models, including statistical language models, neural language models, pre-trained language models, and current language model-based dialogue systems.
  • results: The paper offers a chronological perspective on the development of dialogue systems, providing a comprehensive review of state-of-the-art research outcomes, and discusses emerging topics and open challenges in the field, guiding future developments in language model-based dialogue systems.
    Abstract Dialogue systems, including task-oriented_dialogue_system (TOD) and open-domain_dialogue_system (ODD), have undergone significant transformations, with language_models (LM) playing a central role. This survey delves into the historical trajectory of dialogue systems, elucidating their intricate relationship with advancements in language models by categorizing this evolution into four distinct stages, each marked by pivotal LM breakthroughs: 1) Early_Stage: characterized by statistical LMs, resulting in rule-based or machine-learning-driven dialogue_systems; 2) Independent development of TOD and ODD based on neural_language_models (NLM; e.g., LSTM and GRU), since NLMs lack intrinsic knowledge in their parameters; 3) fusion between different types of dialogue systems with the advert of pre-trained_language_models (PLMs), starting from the fusion between four_sub-tasks_within_TOD, and then TOD_with_ODD; and 4) current LLM-based_dialogue_system, wherein LLMs can be used to conduct TOD and ODD seamlessly. Thus, our survey provides a chronological perspective aligned with LM breakthroughs, offering a comprehensive review of state-of-the-art research outcomes. What's more, we focus on emerging topics and discuss open challenges, providing valuable insights into future directions for LLM-based_dialogue_systems. Through this exploration, we pave the way for a deeper_comprehension of the evolution, guiding future developments in LM-based dialogue_systems.
    摘要 对话系统,包括任务导向对话系统(TOD)和开放领域对话系统(ODD),已经经历了重要的变革,语言模型(LM)在这一过程中扮演着中心角色。本评论对对话系统的历史轨迹进行了深入的探讨,并将这一演化分为四个不同的阶段,每个阶段都受到了关键的LM突破:1. 早期阶段:以统计语言模型(SM)为基础,导致了规则驱动或机器学习驱动的对话系统;2. 独立发展TOD和ODD,基于神经语言模型(NLM,如LSTM和GRU),由于NLM的参数缺乏内在知识,因此独立发展TOD和ODD;3. 不同类型的对话系统的融合,通过预训练语言模型(PLM)的出现,从四个子任务内部的TOD开始,然后是TOD与ODD的融合;4. 当前的LLM-基于对话系统,可以使用LLM进行TOD和ODD的同时执行,从而提供了一种可靠的对话系统。因此,本评论提供了与LM突破的时间线相对应的一种探讨,并对当前领域的研究成果进行了全面的回顾。此外,我们还关注了emerging topic和未解决的挑战,为未来LLM-基于对话系统的发展提供了有价值的指导。通过这一探讨,我们为LLM-基于对话系统的进一步发展铺平了道路。

The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation

  • paper_url: http://arxiv.org/abs/2311.16782
  • repo_url: None
  • paper_authors: Christel Chappuis, Eliot Walt, Vincent Mendez, Sylvain Lobry, Bertrand Le Saux, Devis Tuia
    for: 这个论文的目的是探讨语理问答(RSVQA)中语言偏见的问题,以便通过人机交互来使用遥感图像。methods: 这篇论文使用了自然语言处理和计算机视觉技术,并采用了三重分析策略来探讨语言偏见问题,包括视觉盲模型、对抗测试和数据分析。results: 研究发现,遥感语理问答中存在严重的语言偏见问题,这些问题源于数据本身,而不仅仅是模型的问题。研究还发现,现有的遥感语理问答数据集存在地域相似性和罕见性等问题,这些问题导致模型的偏见问题更加严重。
    Abstract Remote sensing visual question answering (RSVQA) opens new opportunities for the use of overhead imagery by the general public, by enabling human-machine interaction with natural language. Building on the recent advances in natural language processing and computer vision, the goal of RSVQA is to answer a question formulated in natural language about a remote sensing image. Language understanding is essential to the success of the task, but has not yet been thoroughly examined in RSVQA. In particular, the problem of language biases is often overlooked in the remote sensing community, which can impact model robustness and lead to wrong conclusions about the performances of the model. Thus, the present work aims at highlighting the problem of language biases in RSVQA with a threefold analysis strategy: visual blind models, adversarial testing and dataset analysis. This analysis focuses both on model and data. Moreover, we motivate the use of more informative and complementary evaluation metrics sensitive to the issue. The gravity of language biases in RSVQA is then exposed for all of these methods with the training of models discarding the image data and the manipulation of the visual input during inference. Finally, a detailed analysis of question-answer distribution demonstrates the root of the problem in the data itself. Thanks to this analytical study, we observed that biases in remote sensing are more severe than in standard VQA, likely due to the specifics of existing remote sensing datasets for the task, e.g. geographical similarities and sparsity, as well as a simpler vocabulary and question generation strategies. While new, improved and less-biased datasets appear as a necessity for the development of the promising field of RSVQA, we demonstrate that more informed, relative evaluation metrics remain much needed to transparently communicate results of future RSVQA methods.
    摘要 远程感知视觉问题回答(RSVQA)开启了用上空影像的通用公众使用,通过人机互动的自然语言来进行。基于latest的自然语言处理和计算机见,RSVQA的目标是使用自然语言问题回答远程感知影像。语言理解是这个任务的关键,但尚未充分研究。特别是语言偏见问题,通常在遥感社群中被忽略,可能会影响模型的稳定性,导致错误的结论。因此, presente 的工作强调了语言偏见问题在RSVQA中,透过三重分析方法:视觉盲模型、反对测试和数据分析。这些分析涉及到模型和数据。此外,我们呼吁使用更加详细和补充的评估指标,敏感到这问题。最后,我们透过问题答案分布的详细分析,揭示了问题的根源在数据本身。由于这个分析研究,我们发现遥感偏见比标准VQA更严重,可能是因为现有的遥感数据集的特殊性,例如地理相似性和稀缺性,以及更简单的词汇和问题生成策略。新的、改进的和更无偏见的数据集似乎是RSVQA的发展所需的。不过,我们证明更 Informed、相对评估指标仍然是未来RSVQA方法的发展中需要。

Generation of Games for Opponent Model Differentiation

  • paper_url: http://arxiv.org/abs/2311.16781
  • repo_url: None
  • paper_authors: David Milec, Viliam Lisý, Christopher Kiekintveld
  • for: 这篇论文的目的是如何保护对抗攻击,即多智能问题中的攻击者是人类活动者,保护方法通常包括对手模型以提高表现。
  • methods: 本论文使用了心理学家所收集的人格类型数据,创建了一种新的模型,将参数链接到心理特征。该模型在参数化游戏中优化,创建了具有显著差异的游戏。
  • results: 该论文的研究结果表明,新模型可以在不同的游戏中显著地不同,并且可以识别模型之间的不同。这些结果可以帮助自动生成游戏,以及识别模型之间的不同。
    Abstract Protecting against adversarial attacks is a common multiagent problem. Attackers in the real world are predominantly human actors, and the protection methods often incorporate opponent models to improve the performance when facing humans. Previous results show that modeling human behavior can significantly improve the performance of the algorithms. However, modeling humans correctly is a complex problem, and the models are often simplified and assume humans make mistakes according to some distribution or train parameters for the whole population from which they sample. In this work, we use data gathered by psychologists who identified personality types that increase the likelihood of performing malicious acts. However, in the previous work, the tests on a handmade game could not show strategic differences between the models. We created a novel model that links its parameters to psychological traits. We optimized over parametrized games and created games in which the differences are profound. Our work can help with automatic game generation when we need a game in which some models will behave differently and to identify situations in which the models do not align.
    摘要 保护对抗攻击是一种常见的多代理问题。真实世界中的攻击者主要是人类行为者,保护方法 oft incorporate 对手模型以提高面对人类时的性能。先前的结果表明,模拟人类行为可以显著提高算法的性能。然而,正确地模拟人类是一个复杂的问题,模型通常简化和假设人们会根据某种分布或训练参数中的某些分布进行错误。在这项工作中,我们使用心理学家所收集的数据,并识别出了增强邪恶行为的人格类型。然而,在先前的工作中,对手制作的游戏测试无法显示出不同的战略。我们创建了一种新的模型,将其参数与心理特征相关联。我们在参数化游戏中优化并创建了游戏,其中差异悬殊。我们的工作可以帮助自动生成游戏,需要一些模型在游戏中不同的情况下表现不同,以及 indentify 模型在某些情况下不一致。

Equilibrium in the Computing Continuum through Active Inference

  • paper_url: http://arxiv.org/abs/2311.16769
  • repo_url: None
  • paper_authors: Boris Sedlak, Victor Casamayor Pujol, Praveen Kumar Donta, Schahram Dustdar
  • for: 本研究旨在帮助分布式计算(Distributed Computing)系统保证每个计算层的复杂要求。
  • methods: 本研究提出了一个整合边缘智能的框架,使得个体边缘设备可以(1)了解如何执行服务水平目标(Service Level Objectives,SLO),以及(2)将知识传播以加速不同设备的上线。通过合作,边缘设备可以(3)提高服务水平目标的范围。
  • results: 在视频流传输中,使用本研究的框架可以在10轮训练后确保四个SLO,并且下面的 causal 结构也可以理解。新型设备的添加可以在后续进行,框架允许 reuse 现有模型,即使设备类型未知。最后,在设备群组内重新负载均衡可以使个体边缘设备从22%提高到89%的SLO 遵从率。
    Abstract Computing Continuum (CC) systems are challenged to ensure the intricate requirements of each computational tier. Given the system's scale, the Service Level Objectives (SLOs) which are expressed as these requirements, must be broken down into smaller parts that can be decentralized. We present our framework for collaborative edge intelligence enabling individual edge devices to (1) develop a causal understanding of how to enforce their SLOs, and (2) transfer knowledge to speed up the onboarding of heterogeneous devices. Through collaboration, they (3) increase the scope of SLO fulfillment. We implemented the framework and evaluated a use case in which a CC system is responsible for ensuring Quality of Service (QoS) and Quality of Experience (QoE) during video streaming. Our results showed that edge devices required only ten training rounds to ensure four SLOs; furthermore, the underlying causal structures were also rationally explainable. The addition of new types of devices can be done a posteriori, the framework allowed them to reuse existing models, even though the device type had been unknown. Finally, rebalancing the load within a device cluster allowed individual edge devices to recover their SLO compliance after a network failure from 22% to 89%.
    摘要 computin Continuum (CC) 系统面临着每个 Computational tier 的细节要求的挑战。由于系统的规模, Service Level Objectives (SLOs) 需要被细分成可以分散的部分。我们介绍了一个协力式边缘智能框架,让单独的边缘设备可以:1. 发展导引如何实现 SLOs 的 causal 理解。2. 将知识传递到快速启动不同类型的设备。3. 通过协力,提高 SLO 的覆盖范围。我们实现了这个框架,并评估了一个 CC 系统负责确保影片串流中的 Quality of Service (QoS) 和 Quality of Experience (QoE)。我们的结果显示,边缘设备只需要进行十次训练,并且其下面的 causal 结构也能得到合理的解释。对新的设备类型进行添加可以在后续进行,框架允许它们重复使用现有的模型,即使device type 未知。最后,在设备对 clustering 中重新调整负载可以使个别边缘设备回复 SLO 遵循率,从 22% 提高到 89%。

Towards Full-scene Domain Generalization in Multi-agent Collaborative Bird’s Eye View Segmentation for Connected and Autonomous Driving

  • paper_url: http://arxiv.org/abs/2311.16754
  • repo_url: None
  • paper_authors: Senkang Hu, Zhengru Fang, Xianhao Chen, Yuguang Fang, Sam Kwong
  • for: 提高自动驾驶系统的感知质量
  • methods: 使用域总结框架和振荡增强方法
  • results: 比对 existing 状态 искусственный智能方法更高效
    Abstract Collaborative perception has recently gained significant attention in autonomous driving, improving perception quality by enabling the exchange of additional information among vehicles. However, deploying collaborative perception systems can lead to domain shifts due to diverse environmental conditions and data heterogeneity among connected and autonomous vehicles (CAVs). To address these challenges, we propose a unified domain generalization framework applicable in both training and inference stages of collaborative perception. In the training phase, we introduce an Amplitude Augmentation (AmpAug) method to augment low-frequency image variations, broadening the model's ability to learn across various domains. We also employ a meta-consistency training scheme to simulate domain shifts, optimizing the model with a carefully designed consistency loss to encourage domain-invariant representations. In the inference phase, we introduce an intra-system domain alignment mechanism to reduce or potentially eliminate the domain discrepancy among CAVs prior to inference. Comprehensive experiments substantiate the effectiveness of our method in comparison with the existing state-of-the-art works. Code will be released at https://github.com/DG-CAVs/DG-CoPerception.git.
    摘要 协同感知在自动驾驶中得到了广泛关注,通过交换额外信息来提高感知质量。但是部署协同感知系统可能会导致领域变化,因为连接自动汽车(CAV)中的环境条件和数据多样性不同。为解决这些挑战,我们提议一种统一领域普适框架,可以在培训和推理阶段应用。在培训阶段,我们引入了振荡增强(AmpAug)方法,用于增加低频图像变化,使模型能够适应不同领域。我们还使用元学习协调方案,模拟领域变化,使模型通过特意设计的一致损失来强制遵循领域无关的表示。在推理阶段,我们引入了内部领域对接机制,以减少或完全消除CAV之间的领域差异。广泛的实验证明了我们的方法的有效性,与现有的状态之作比较。代码将于https://github.com/DG-CAVs/DG-CoPerception.git发布。

LLMs for Science: Usage for Code Generation and Data Analysis

  • paper_url: http://arxiv.org/abs/2311.16733
  • repo_url: https://github.com/luuca78/llms4science
  • paper_authors: Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, Ingo Weber
  • for: 这个论文主要是为了研究大语言模型(LLM)在科学研究中的应用。
  • methods: 本研究使用了多种大语言模型(LLM)来支持科学研究的日常工作。
  • results: 研究发现,使用LLM可以帮助提高科学研究的效率和质量,但也存在一些问题,如输出的可靠性和一致性。
    Abstract Large language models (LLMs) have been touted to enable increased productivity in many areas of today's work life. Scientific research as an area of work is no exception: the potential of LLM-based tools to assist in the daily work of scientists has become a highly discussed topic across disciplines. However, we are only at the very onset of this subject of study. It is still unclear how the potential of LLMs will materialise in research practice. With this study, we give first empirical evidence on the use of LLMs in the research process. We have investigated a set of use cases for LLM-based tools in scientific research, and conducted a first study to assess to which degree current tools are helpful. In this paper we report specifically on use cases related to software engineering, such as generating application code and developing scripts for data analytics. While we studied seemingly simple use cases, results across tools differ significantly. Our results highlight the promise of LLM-based tools in general, yet we also observe various issues, particularly regarding the integrity of the output these tools provide.
    摘要 大型语言模型(LLM)已被广泛宣传可以提高现代工作生活中的产出能力。科学研究作为工作领域,也不例外。LLM基于工具在科学研究中的潜在价值已成为跨学科讨论的热点。然而,我们正处于这个领域的初始阶段。尚未清楚LLM的潜在价值如何实现。在这项研究中,我们提供了首个实证证明LLM在研究过程中的使用。我们 investigate了一组LLM基于工具在科学研究中的使用场景,并对当前工具的帮助度进行了首次评估。本文特别关注软件工程领域的使用场景,如生成应用代码和数据分析cript的生成。虽然我们研究的场景 relativamente simple,但结果各工具之间存在差异。我们的结果表明LLM基于工具在整体上具有抢器,但也存在一些问题,尤其是输出的完整性。

Graph Pre-training and Prompt Learning for Recommendation

  • paper_url: http://arxiv.org/abs/2311.16716
  • repo_url: None
  • paper_authors: Yuhao Yang, Lianghao Xia, Da Luo, Kangyi Lin, Chao Huang
  • for: 提高GNNS的推荐性能和可扩展性,帮助GNNS更好地适应用户的变化偏好和数据分布shift。
  • methods: combining parameter-efficient和动态图前training with prompt learning,使GNNS能够更好地捕捉用户长期偏好和短期行为动态。
  • results: 在大规模实际应用中,GraphPL可以减轻GNNS的训练和推荐负担,同时提高推荐效果和稳定性。
    Abstract GNN-based recommenders have excelled in modeling intricate user-item interactions through multi-hop message passing. However, existing methods often overlook the dynamic nature of evolving user-item interactions, which impedes the adaption to changing user preferences and distribution shifts in newly arriving data. Thus, their scalability and performances in real-world dynamic environments are limited. In this study, we propose GraphPL, a framework that incorporates parameter-efficient and dynamic graph pre-training with prompt learning. This novel combination empowers GNNs to effectively capture both long-term user preferences and short-term behavior dynamics, enabling the delivery of accurate and timely recommendations. Our GraphPL framework addresses the challenge of evolving user preferences by seamlessly integrating a temporal prompt mechanism and a graph-structural prompt learning mechanism into the pre-trained GNN model. The temporal prompt mechanism encodes time information on user-item interaction, allowing the model to naturally capture temporal context, while the graph-structural prompt learning mechanism enables the transfer of pre-trained knowledge to adapt to behavior dynamics without the need for continuous incremental training. We further bring in a dynamic evaluation setting for recommendation to mimic real-world dynamic scenarios and bridge the offline-online gap to a better level. Our extensive experiments including a large-scale industrial deployment showcases the lightweight plug-in scalability of our GraphPL when integrated with various state-of-the-art recommenders, emphasizing the advantages of GraphPL in terms of effectiveness, robustness and efficiency.
    摘要 In this study, we propose GraphPL, a framework that combines parameter-efficient and dynamic graph pre-training with prompt learning. This novel approach enables GNNs to effectively capture both long-term user preferences and short-term behavior dynamics, allowing for the delivery of accurate and timely recommendations.To address the challenge of evolving user preferences, our GraphPL framework integrates a temporal prompt mechanism and a graph-structural prompt learning mechanism into the pre-trained GNN model. The temporal prompt mechanism encodes time information on user-item interactions, allowing the model to naturally capture temporal context. The graph-structural prompt learning mechanism enables the transfer of pre-trained knowledge to adapt to behavior dynamics without the need for continuous incremental training.We also establish a dynamic evaluation setting for recommendation, mimicking real-world dynamic scenarios and bridging the offline-online gap to a better level. Our extensive experiments, including a large-scale industrial deployment, showcase the lightweight plug-in scalability of our GraphPL when integrated with various state-of-the-art recommenders. This emphasizes the advantages of GraphPL in terms of effectiveness, robustness, and efficiency.

LEDITS++: Limitless Image Editing using Text-to-Image Models

  • paper_url: http://arxiv.org/abs/2311.16711
  • repo_url: None
  • paper_authors: Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos
  • for: 本研究旨在提出一种高效 yet versatile和精准的文本图像修改技术,以解决现有的图像修改方法存在的缺陷。
  • methods: 本方法使用novel倒推approach,不需要优化或调整,可以在几步diffusion中生成高品质的图像。此外,本方法支持同时进行多个修改,并且architecture-agnostic。
  • results: 我们的实验结果表明,LEDITS++可以准确地修改图像,并且与原始图像的差异较小。此外,LEDITS++也比现有的方法更高效和更灵活。详细的实验结果可以参考https://leditsplusplus-project.static.hf.space。
    Abstract Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .
    摘要 LEDITS++'s novel inversion approach does not require fine-tuning or optimization and produces high-fidelity results with just a few diffusion steps. Additionally, our method supports multiple simultaneous edits and is compatible with various architectures. To ensure accuracy, we use a novel implicit masking technique that limits changes to relevant image regions.We have created the TEdBench++ benchmark as part of our comprehensive evaluation. Our results show that LEDITS++ outperforms previous methods in terms of efficiency and accuracy. For more information, please visit the project page at .

Rethinking Intermediate Layers design in Knowledge Distillation for Kidney and Liver Tumor Segmentation

  • paper_url: http://arxiv.org/abs/2311.16700
  • repo_url: None
  • paper_authors: Vandan Gorade, Sparsh Mittal, Debesh Jha, Ulas Bagci
  • for: 这个研究的目的是提出一种基于层选择反馈知识填充(HLFD)的 médical imaging 任务中的学习混合模型。
  • methods: 该方法使用一种独特的层选择反馈知识填充策略,将中间层的知识传递到早期层,并将最后一层的知识传递到中间层。这种设计使得模型能够从早期层学习更高质量的表示,从而获得更好的性能。
  • results: 对多个验证数据集进行评估后,HLFD方法与传统的学习混合模型相比,具有较高的性能。例如,在肾脏分 segmentation 任务中,HLFD方法可以将学生模型(没有 KD)的性能提高 более 10pp,显著提高它的专注于特征特征。此外,Student 模型使用 HLFD 方法学习的结果显示,能够减少不相关信息,更好地强调特征特征,从而开启了更高效和准确的诊断工具的新途径。
    Abstract Knowledge distillation(KD) has demonstrated remarkable success across various domains, but its application to medical imaging tasks, such as kidney and liver tumor segmentation, has encountered challenges. Many existing KD methods are not specifically tailored for these tasks. Moreover, prevalent KD methods often lack a careful consideration of what and from where to distill knowledge from the teacher to the student. This oversight may lead to issues like the accumulation of training bias within shallower student layers, potentially compromising the effectiveness of KD. To address these challenges, we propose Hierarchical Layer-selective Feedback Distillation (HLFD). HLFD strategically distills knowledge from a combination of middle layers to earlier layers and transfers final layer knowledge to intermediate layers at both the feature and pixel levels. This design allows the model to learn higher-quality representations from earlier layers, resulting in a robust and compact student model. Extensive quantitative evaluations reveal that HLFD outperforms existing methods by a significant margin. For example, in the kidney segmentation task, HLFD surpasses the student model (without KD) by over 10pp, significantly improving its focus on tumor-specific features. From a qualitative standpoint, the student model trained using HLFD excels at suppressing irrelevant information and can focus sharply on tumor-specific details, which opens a new pathway for more efficient and accurate diagnostic tools.
    摘要 知识缩写(KD)在不同领域都有出色的成绩,但在医学影像任务,如肾和肝肿瘤分 segmentation,遇到了挑战。许多现有的KD方法不适应这些任务。此外,常见的KD方法经常不进行细心的知识从教师到学生的选择。这种欠缺可能导致在较浅的学生层建立训练偏见,从而可能降低KD的效果。为解决这些挑战,我们提出了层次选择反馈缩写(HLFD)。HLFD在中间层和早期层之间进行知识缩写,并将最后一层知识传递到中间层。这种设计使得模型可以从早期层学习更高质量的表示,从而获得更加紧凑和可靠的学生模型。经过广泛的量化评估,我们发现HLFD比现有方法提高了10个百分点以上。例如,在肾分 segmentation任务中,HLFD的学生模型(无KD)比较高,显著提高其强调特征的精准性。从Qualitative的角度来看,通过HLFD培育的学生模型能够快速压抑不关键信息,具有强调特征的精准性,这开启了更高效、更准确的诊断工具的新路径。

XAI for time-series classification leveraging image highlight methods

  • paper_url: http://arxiv.org/abs/2311.17110
  • repo_url: None
  • paper_authors: Georgios Makridis, Georgios Fatouros, Vasileios Koukos, Dimitrios Kotios, Dimosthenis Kyriazis, Ioannis Soldatos
  • for: 这篇论文的目的是将时间序列资料分类 зада项中的深度神经网络(DNN)实现解释性。
  • methods: 本篇论文使用了教师生物体(teacher-student architecture)和图像高亮技术(如LIME和GradCam),将时间序列转换为2D图形,以提高预测的解释性。
  • results: 本篇论文的预测结果与基准模型相当,但是训练时间有所增加。
    Abstract Although much work has been done on explainability in the computer vision and natural language processing (NLP) fields, there is still much work to be done to explain methods applied to time series as time series by nature can not be understood at first sight. In this paper, we present a Deep Neural Network (DNN) in a teacher-student architecture (distillation model) that offers interpretability in time-series classification tasks. The explainability of our approach is based on transforming the time series to 2D plots and applying image highlight methods (such as LIME and GradCam), making the predictions interpretable. At the same time, the proposed approach offers increased accuracy competing with the baseline model with the trade-off of increasing the training time.
    摘要 尽管在计算机视觉和自然语言处理(NLP)领域已经做了很多工作,但是对时序序列的解释仍然很需要进一步的研究。在这篇论文中,我们提出了一种基于深度神经网络(DNN)的教师-学生架构(液态模型),以提高时序序列分类任务中的解释性。我们的方法基于将时序序列转换为2D图表,并应用图像高亮方法(如LIME和GradCam),使预测结果变得可读。同时,我们的方法与基线模型的精度相似,但是需要增加训练时间。

Hyper-Relational Knowledge Graph Neural Network for Next POI

  • paper_url: http://arxiv.org/abs/2311.16683
  • repo_url: None
  • paper_authors: Jixiao Zhang, Yongkang Li, Ruotong Zou, Jingyuan Zhang, Zipei Fan, Xuan Song
  • for: 提高 Location-based Social Networks (LBSN) 中 Point of Interest (POI) 推荐系统的精度和效果,使用 Knowledge Graph (KG) 来缓解数据稀缺问题。
  • methods: 提出了 Hyper-Relational Knowledge Graph Neural Network (HKGNN) 模型,使用 Hyper-Relational Knowledge Graph (HKG) 来维护和利用 LBSN 数据中的较复杂的 semantics,并使用 Hypergraph Neural Network 和 self-attention network 来充分利用 HKG 中的结构信息和时间序列信息进行个性化推荐。
  • results: 对四个实际 LBSN 数据集进行了实验,比较了与现有状态艺术方法的比较,结果表明 HKGNN 模型在精度和效果上比现有方法高效,能够更好地利用 LBSN 数据中的 semantics 和结构信息来提高 POI 推荐的精度和效果。
    Abstract With the advancement of mobile technology, Point of Interest (POI) recommendation systems in Location-based Social Networks (LBSN) have brought numerous benefits to both users and companies. Many existing works employ Knowledge Graph (KG) to alleviate the data sparsity issue in LBSN. These approaches primarily focus on modeling the pair-wise relations in LBSN to enrich the semantics and thereby relieve the data sparsity issue. However, existing approaches seldom consider the hyper-relations in LBSN, such as the mobility relation (a 3-ary relation: user-POI-time). This makes the model hard to exploit the semantics accurately. In addition, prior works overlook the rich structural information inherent in KG, which consists of higher-order relations and can further alleviate the impact of data sparsity.To this end, we propose a Hyper-Relational Knowledge Graph Neural Network (HKGNN) model. In HKGNN, a Hyper-Relational Knowledge Graph (HKG) that models the LBSN data is constructed to maintain and exploit the rich semantics of hyper-relations. Then we proposed a Hypergraph Neural Network to utilize the structural information of HKG in a cohesive way. In addition, a self-attention network is used to leverage sequential information and make personalized recommendations. Furthermore, side information, essential in reducing data sparsity by providing background knowledge of POIs, is not fully utilized in current methods. In light of this, we extended the current dataset with available side information to further lessen the impact of data sparsity. Results of experiments on four real-world LBSN datasets demonstrate the effectiveness of our approach compared to existing state-of-the-art methods.
    摘要 随着移动技术的进步,位置基于社交网络(LBSN)中的点对点(POI)推荐系统已经为用户和公司带来了很多利益。现有的方法主要采用知识图(KG)来解决LBSN数据稀缺问题。这些方法主要是将LBSN数据中的对数据关系模型为对数据关系,以此提高 semantics 的含义并缓解数据稀缺问题。然而,现有的方法很少考虑LBSN中的高阶关系,如用户-POI-时间的移动关系(3元关系),这使得模型很难准确地利用 semantics。此外,先前的工作忽视了知识图中的高阶结构信息,这些信息包括更高级别的关系,可以进一步减轻数据稀缺的影响。为此,我们提出了一种高阶关系知识图神经网络(HKGNN)模型。在HKGNN中,一个高阶关系知识图(HKG)是用于维护和利用LBSN数据中的丰富 semantics 的。然后,我们提出了一种高raph neural network,用于在一起 cohesive 的方式利用 HKG 中的结构信息。此外,我们还使用了自注意网络,以利用序列信息并进行个性化推荐。此外,现有的方法很少利用可用的侧信息,这些侧信息可以提供 POI 背景知识,从而进一步减轻数据稀缺的影响。为此,我们将当前的数据集扩展为包含可用的侧信息,以进一步减轻数据稀缺的影响。实验结果表明,我们的方法在四个实际 LBSN 数据集上的实验结果比现有的状态 искусственный方法更为有效。

Understanding the (Extra-)Ordinary: Validating Deep Model Decisions with Prototypical Concept-based Explanations

  • paper_url: http://arxiv.org/abs/2311.16681
  • repo_url: None
  • paper_authors: Maximilian Dreyer, Reduan Achtibat, Wojciech Samek, Sebastian Lapuschkin
  • for: 这 paper 的目的是提出一种新的 post-hoc 基于概念的 XAI 框架,以便更好地理解深度神经网络(DNNs) 的决策过程。
  • methods: 这 paper 使用了一种组合了本地(instance-wise)和全局(class-wise)决策策略的 XAI 框架,通过使用示例来描述模型的决策过程。
  • results: 这 paper 在三个 datasets(ImageNet、CUB-200 和 CIFAR-10)上使用了 VGG、ResNet 和 EfficientNet 架构,并证明了该 XAI 框架的有效性,可以快速和准确地检测模型的异常行为、数据质量问题和模型行为问题。
    Abstract Ensuring both transparency and safety is critical when deploying Deep Neural Networks (DNNs) in high-risk applications, such as medicine. The field of explainable AI (XAI) has proposed various methods to comprehend the decision-making processes of opaque DNNs. However, only few XAI methods are suitable of ensuring safety in practice as they heavily rely on repeated labor-intensive and possibly biased human assessment. In this work, we present a novel post-hoc concept-based XAI framework that conveys besides instance-wise (local) also class-wise (global) decision-making strategies via prototypes. What sets our approach apart is the combination of local and global strategies, enabling a clearer understanding of the (dis-)similarities in model decisions compared to the expected (prototypical) concept use, ultimately reducing the dependence on human long-term assessment. Quantifying the deviation from prototypical behavior not only allows to associate predictions with specific model sub-strategies but also to detect outlier behavior. As such, our approach constitutes an intuitive and explainable tool for model validation. We demonstrate the effectiveness of our approach in identifying out-of-distribution samples, spurious model behavior and data quality issues across three datasets (ImageNet, CUB-200, and CIFAR-10) utilizing VGG, ResNet, and EfficientNet architectures. Code is available on https://github.com/maxdreyer/pcx.
    摘要 Ensuring both transparency and safety is critical when deploying Deep Neural Networks (DNNs) in high-risk applications, such as medicine. The field of explainable AI (XAI) has proposed various methods to comprehend the decision-making processes of opaque DNNs. However, only few XAI methods are suitable of ensuring safety in practice as they heavily rely on repeated labor-intensive and possibly biased human assessment. In this work, we present a novel post-hoc concept-based XAI framework that conveys besides instance-wise (local) also class-wise (global) decision-making strategies via prototypes. What sets our approach apart is the combination of local and global strategies, enabling a clearer understanding of the (dis-)similarities in model decisions compared to the expected (prototypical) concept use, ultimately reducing the dependence on human long-term assessment. Quantifying the deviation from prototypical behavior not only allows to associate predictions with specific model sub-strategies but also to detect outlier behavior. As such, our approach constitutes an intuitive and explainable tool for model validation. We demonstrate the effectiveness of our approach in identifying out-of-distribution samples, spurious model behavior and data quality issues across three datasets (ImageNet, CUB-200, and CIFAR-10) utilizing VGG, ResNet, and EfficientNet architectures. Code is available on https://github.com/maxdreyer/pcx.Here's the translation in Traditional Chinese: Ensuring both transparency and safety is critical when deploying Deep Neural Networks (DNNs) in high-risk applications, such as medicine. The field of explainable AI (XAI) has proposed various methods to comprehend the decision-making processes of opaque DNNs. However, only few XAI methods are suitable of ensuring safety in practice as they heavily rely on repeated labor-intensive and possibly biased human assessment. In this work, we present a novel post-hoc concept-based XAI framework that conveys besides instance-wise (local) also class-wise (global) decision-making strategies via prototypes. What sets our approach apart is the combination of local and global strategies, enabling a clearer understanding of the (dis-)similarities in model decisions compared to the expected (prototypical) concept use, ultimately reducing the dependence on human long-term assessment. Quantifying the deviation from prototypical behavior not only allows to associate predictions with specific model sub-strategies but also to detect outlier behavior. As such, our approach constitutes an intuitive and explainable tool for model validation. We demonstrate the effectiveness of our approach in identifying out-of-distribution samples, spurious model behavior and data quality issues across three datasets (ImageNet, CUB-200, and CIFAR-10) utilizing VGG, ResNet, and EfficientNet architectures. Code is available on https://github.com/maxdreyer/pcx.

ROSO: Improving Robotic Policy Inference via Synthetic Observations

  • paper_url: http://arxiv.org/abs/2311.16680
  • repo_url: https://github.com/Yusuke710/ROSO
  • paper_authors: Yusuke Miyashita, Dimitris Gahtidis, Colin La, Jeremy Rabinowicz, Jurgen Leitner
  • for: 提高 zeroshot 性能的预训练策略
  • methods: 使用生成人工智能修改推理时的观察,以适应新物体和环境
  • results: 实验结果显示,通过将生成人工智能integrated into robotic inference,可以增加策略的适应性,并提高成功率,达到57%的任务成功率。
    Abstract In this paper, we propose the use of generative artificial intelligence (AI) to improve zero-shot performance of a pre-trained policy by altering observations during inference. Modern robotic systems, powered by advanced neural networks, have demonstrated remarkable capabilities on pre-trained tasks. However, generalizing and adapting to new objects and environments is challenging, and fine-tuning visuomotor policies is time-consuming. To overcome these issues we propose Robotic Policy Inference via Synthetic Observations (ROSO). ROSO uses stable diffusion to pre-process a robot's observation of novel objects during inference time to fit within its distribution of observations of the pre-trained policies. This novel paradigm allows us to transfer learned knowledge from known tasks to previously unseen scenarios, enhancing the robot's adaptability without requiring lengthy fine-tuning. Our experiments show that incorporating generative AI into robotic inference significantly improves successful outcomes, finishing up to 57% of tasks otherwise unsuccessful with the pre-trained policy.
    摘要 在这篇论文中,我们提出使用生成人工智能(AI)来提高预训练政策的零拟合性。现代机器人系统,搭载了高级神经网络,已经展示了惊人的能力在预训练任务上。然而,总结和适应新物体和环境是困难的,并且需要较长的练习时间来微调视motor策略。为了解决这些问题,我们提出了机器人政策推理viaSynthetic Observations(ROSO)。ROSO使用稳定的扩散来在推理时间内对机器人对新物体的观察进行预处理,使其适应在预训练策略的分布内。这种新的思路允许我们将已经学习的知识传递到前所未见的情况下,提高机器人的适应性,而不需要较长的微调时间。我们的实验表明,将生成AIintegrated into机器人推理可以显著提高成功率,完成了57%的任务,否则由预训练策略无法完成。

Large Language Models Meet Computer Vision: A Brief Survey

  • paper_url: http://arxiv.org/abs/2311.16673
  • repo_url: None
  • paper_authors: Raby Hamadi
  • for: 本文旨在探讨最新的 transformer 和其后继者在计算机视觉领域的发展,以及这些模型在自然语言处理和计算机视觉之间的交互。
  • methods: 本文使用了许多现代的 paid 和 open-source LLMs,并进行了比较分析,以探讨这些模型在不同任务上的表现。 此外,文章还收集了一些用于训练 LLMs 的数据集,为读者提供了不同数据的应用和演示。
  • results: 本文的研究发现, transformer 和其后继者在计算机视觉领域的应用具有潜在的潜力,可以提高模型的性能和可靠性。 此外,文章还提出了一些未来研究的方向,如如何更好地挖掘数据集,以及如何将 LLMs 应用于更多的计算机视觉任务。
    Abstract Recently, the intersection of Large Language Models (LLMs) and Computer Vision (CV) has emerged as a pivotal area of research, driving significant advancements in the field of Artificial Intelligence (AI). As transformers have become the backbone of many state-of-the-art models in both Natural Language Processing (NLP) and CV, understanding their evolution and potential enhancements is crucial. This survey paper delves into the latest progressions in the domain of transformers and their subsequent successors, emphasizing their potential to revolutionize Vision Transformers (ViTs) and LLMs. This survey also presents a comparative analysis, juxtaposing the performance metrics of several leading paid and open-source LLMs, shedding light on their strengths and areas of improvement as well as a literature review on how LLMs are being used to tackle vision related tasks. Furthermore, the survey presents a comprehensive collection of datasets employed to train LLMs, offering insights into the diverse data available to achieve high performance in various pre-training and downstream tasks of LLMs. The survey is concluded by highlighting open directions in the field, suggesting potential venues for future research and development. This survey aims to underscores the profound intersection of LLMs on CV, leading to a new era of integrated and advanced AI models.
    摘要 最近,大语言模型(LLMs)和计算机视觉(CV)的交叉领域已经成为人工智能(AI)领域的重要研究领域,导致了许多领域的进步。作为 transformers 在 NLP 和 CV 中的基础模型,理解它们的演化和可能的提高是关键。这篇评论paper 探讨了最新的 transformers 和其后继者的进展,强调它们在 ViTs 和 LLMs 中的潜在革命化作用。此外,这篇评论还进行了多种领先的付费和开源 LLMs 的比较分析,探讨了它们的优势和改进点,以及如何使用 LLMs 解决视觉相关任务的文献回顾。此外,评论还提供了许多用于训练 LLMs 的集成数据集,为在不同的预训练和下游任务中实现高性能提供了信息。最后,评论结束于 highlighting 当前领域的开放方向,建议未来的研究发展方向。这篇评论希望强调 LLMS 在 CV 领域的深入交叉,并促进一个新的融合和高级 AI 模型的时代。

SplitNeRF: Split Sum Approximation Neural Field for Joint Geometry, Illumination, and Material Estimation

  • paper_url: http://arxiv.org/abs/2311.16671
  • repo_url: None
  • paper_authors: Jesus Zarzar, Bernard Ghanem
  • for: 这篇论文旨在提出一种用于将真实世界物体数字化的新方法,该方法可以估算物体的几何结构、物理属性和环境照明,从一组固定照明的图像中。
  • methods: 该方法利用神经辐射场(NeRF)管道中的分割Sum应用程序,以实现实时物理基于渲染。另外,我们提议使用单个场景特定的多层感知神经网络(MLP)来表示场景照明,并且可以在任意分辨率下进行精准的模型。此外,我们还提出了一种新的自遮挡预测监督方法,通过利用MCMC抽样来实现高效的训练。
  • results: 实验结果表明,我们的方法可以高效地估算场景几何结构、物理属性和照明,并且可以在固定照明下实现state-of-the-art的重新照明质量,仅需要${sim}1$小时的训练。
    Abstract We present a novel approach for digitizing real-world objects by estimating their geometry, material properties, and environmental lighting from a set of posed images with fixed lighting. Our method incorporates into Neural Radiance Field (NeRF) pipelines the split sum approximation used with image-based lighting for real-time physical-based rendering. We propose modeling the scene's lighting with a single scene-specific MLP representing pre-integrated image-based lighting at arbitrary resolutions. We achieve accurate modeling of pre-integrated lighting by exploiting a novel regularizer based on efficient Monte Carlo sampling. Additionally, we propose a new method of supervising self-occlusion predictions by exploiting a similar regularizer based on Monte Carlo sampling. Experimental results demonstrate the efficiency and effectiveness of our approach in estimating scene geometry, material properties, and lighting. Our method is capable of attaining state-of-the-art relighting quality after only ${\sim}1$ hour of training in a single NVIDIA A100 GPU.
    摘要 我们提出了一种新的方法,用于将真实世界中的物体数字化,包括物体的几何结构、物理性质和环境照明。我们的方法将Neural Radiance Field(NeRF)管道中的 split sum approximation与图像基照明相结合,实现实时物理基照明。我们建议场景的照明使用单一场景特定的多层感知神经网络(MLP)来表示预积合的图像基照明,无论是在什么分辨率下。我们通过一种新的规则来准确地模型预积合照明,并通过有效的 Монте托 Carlo 采样来实现这一点。此外,我们还提出了一种新的方法,通过 Монте托 Carlo 采样来监督自身遮挡预测。实验结果表明,我们的方法可以快速和有效地估计场景几何结构、物理性质和环境照明。我们的方法可以在单个 NVIDIA A100 GPU 上进行一小时的训练后,达到当今最佳的重新照明质量。

MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures

  • paper_url: http://arxiv.org/abs/2311.16666
  • repo_url: None
  • paper_authors: Zhuoyuan Wang, Jiacong Mi, Shan Lu, Jieyue He
  • for: 这篇论文的目的是提出一个多modal的分析方法,以提高药物分子的预测性。
  • methods: 这篇论文使用了自主学习的方法,包括自动对照学习(SSL)和图形神经网络(GNN)。
  • results: 比基eline模型,这篇论文的模型在下游任务中的预测性能更高,特别是在分子性能预测方面。
    Abstract The quest for accurate prediction of drug molecule properties poses a fundamental challenge in the realm of Artificial Intelligence Drug Discovery (AIDD). An effective representation of drug molecules emerges as a pivotal component in this pursuit. Contemporary leading-edge research predominantly resorts to self-supervised learning (SSL) techniques to extract meaningful structural representations from large-scale, unlabeled molecular data, subsequently fine-tuning these representations for an array of downstream tasks. However, an inherent shortcoming of these studies lies in their singular reliance on one modality of molecular information, such as molecule image or SMILES representations, thus neglecting the potential complementarity of various molecular modalities. In response to this limitation, we propose MolIG, a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures. MolIG model innovatively leverages the coherence and correlation between molecule graph and molecule image to execute self-supervised tasks, effectively amalgamating the strengths of both molecular representation forms. This holistic approach allows for the capture of pivotal molecular structural characteristics and high-level semantic information. Upon completion of pre-training, Graph Neural Network (GNN) Encoder is used for the prediction of downstream tasks. In comparison to advanced baseline models, MolIG exhibits enhanced performance in downstream tasks pertaining to molecular property prediction within benchmark groups such as MoleculeNet Benchmark Group and ADMET Benchmark Group.
    摘要 寻求精准预测药物分子性质是艺术智能药物发现(AIDD)领域的基本挑战。一个有效的药物分子表示是这一目标的关键组件。当前的前沿研究主要采用自动学习(SSL)技术来从大规模、无标签的分子数据中提取有意义的结构表示,然后对这些表示进行微调以适应多个下游任务。然而,这些研究的潜在缺点在于它们单一依赖于一种分子信息模式,如分子图像或SMILES表示,因此忽略了不同分子模式之间的可衡合性。为了解决这一限制,我们提出了 MolIG,一种新的多Modal molecular预训练框架,用于预测分子性质基于图像和图 structures。MolIG模型创新地利用分子图和分子图像之间的准确性和相关性来执行自动学习任务,有效地结合了两种分子表示形式的优势。这种整体approach允许捕捉分子结构的重要特征和高级semantic信息。在预训练完成后,图 neural network(GNN)Encoder被用于预测下游任务。与先进基eline模型相比,MolIG在分子性质预测 benchmark groups 中表现出色,如MoleculeNet Benchmark Group和ADMET Benchmark Group。

ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?

  • paper_url: http://arxiv.org/abs/2311.17107
  • repo_url: https://github.com/rlacombe/climatex
  • paper_authors: Romain Lacombe, Kerrie Wu, Eddie Dilworth
  • for: 本研究旨在评估现代自然语言模型(LLMs)在气候科学和政策领域的准确性。
  • methods: 研究者引入了新的专家标注的气候声明集(ClimateX)数据集,包含latest Intergovernmental Panel on Climate Change(IPCC)报告中的8094个气候声明,并对其进行了专家标注。研究者使用这些数据集,表明了最新的LLMs可以在几次学习 Setting中分类人类专家对气候声明的信任程度,但准确率有限(最高达47%)。
  • results: 研究发现,LLMs在低和中信任声明上表现出了一致的和显著的过度自信。
    Abstract Evaluating the accuracy of outputs generated by Large Language Models (LLMs) is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.
    摘要 evaluating the accuracy of outputs generated by large language models (LLMs) is especially important in the climate science and policy domain. we introduce the expert confidence in climate statements (climatex) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest intergovernmental panel on climate change (IPCC) reports, labeled with their associated confidence levels. using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. we highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know.

Finnish 5th and 6th graders’ misconceptions about Artificial Intelligence

  • paper_url: http://arxiv.org/abs/2311.16644
  • repo_url: None
  • paper_authors: Pekka Mertala, Janne Fagerlund
  • for: 这个研究旨在探讨儿童初步理解人工智能的问题,以帮助开发有教学价值的人工智能Literacy课程、方法和材料。
  • methods: 这个研究使用质量调查数据来回答三个研究问题:1) 芬兰5-6年级学生对人工智能的本质有什么假设?;2)这些假设与常见的假设类型有关?;和3)这些假设有多深刻?
  • results: 研究发现,5-6年级学生有三类假设:1) 非技术人工智能,认为人工智能是人类认知过程(事实假设);2)人工智能人类化,认为人工智能是人类化的存在(通俗、非科学、概念假设);和3)机器人工智能,认为人工智能有预装智能或知识(事实假设)。大多数孩子们认为自己对人工智能知识不够,这意味着这些假设是 superfic 的而不是深刻的。研究发现语言特征可以导致学生的人工智能假设。这些结论有助于未来的研究和人工智能Literacy教育。
    Abstract Research on children's initial conceptions of AI is in an emerging state, which, from a constructivist viewpoint, challenges the development of pedagogically sound AI-literacy curricula, methods, and materials. To contribute to resolving this need in the present paper, qualitative survey data from 195 children were analyzed abductively to answer the following three research questions: What kind of misconceptions do Finnish 5th and 6th graders' have about the essence AI?; 2) How do these misconceptions relate to common misconception types?; and 3) How profound are these misconceptions? As a result, three misconception categories were identified: 1) Non-technological AI, in which AI was conceptualized as peoples' cognitive processes (factual misconception); 2) Anthropomorphic AI, in which AI was conceptualized as a human-like entity (vernacular, non-scientific, and conceptual misconception); and 3) AI as a machine with a pre-installed intelligence or knowledge (factual misconception). Majority of the children evaluated their AI-knowledge low, which implies that the misconceptions are more superficial than profound. The findings suggest that context-specific linguistic features can contribute to students' AI misconceptions. Implications for future research and AI literacy education are discussed.
    摘要 研究儿童初始对人工智能的理解处于发展阶段,从构建主义的视角来看,这会挑战开发教学上有效的人工智能Literacy课程、方法和材料的开发。为解决这个需求,本文通过qualitative survey数据分析了195名芬兰5-6年级学生对人工智能的误解,并回答以下三个研究问题:1)芬兰5-6年级学生对人工智能的本质有什么误解?2)这些误解与常见误解类型有关吗?3)这些误解有多深刻?结果显示,5-6年级学生对人工智能的误解可以分为3类:1)非技术人工智能,人工智能被理解为人类的认知过程(事实误解);2)人类化人工智能,人工智能被理解为人类化的存在(非科学、非正式、概念误解);3)人工智能为机器带有预先安装的智能或知识(事实误解)。大多数儿童评估自己对人工智能的知识为低,这意味着误解较浅。发现语言特性对学生的人工智能误解产生影响,有关未来研究和人工智能Literacy教育的探讨。

Single-Cell Clustering via Dual-Graph Alignment

  • paper_url: http://arxiv.org/abs/2311.17104
  • repo_url: None
  • paper_authors: Dayu Hu, Ke Liang, Xinwang Liu
  • for: 这份研究旨在开发一个具有更高精度和可靠性的单细胞聚集分析方法,以便更好地理解肿瘤微环境中单细胞的分布和特征。
  • methods: 本研究使用了双重图像对焦法,具体包括对单细胞matrix和蛋白质-蛋白质互作网络进行自动化和无监督优化,以获得更加精确的单细胞聚集分析结果。
  • results: 实验结果显示,这个新的单细胞聚集分析方法能够更好地捕捉单细胞之间的关系和蛋白质-蛋白质互作网络的结构,并且能够将这些资讯纳入到聚集分析中,以获得更加精确和有意义的结果。
    Abstract In recent years, the field of single-cell RNA sequencing has seen a surge in the development of clustering methods. These methods enable the identification of cell subpopulations, thereby facilitating the understanding of tumor microenvironments. Despite their utility, most existing clustering algorithms primarily focus on the attribute information provided by the cell matrix or the network structure between cells, often neglecting the network between genes. This oversight could lead to loss of information and clustering results that lack clinical significance. To address this limitation, we develop an advanced single-cell clustering model incorporating dual-graph alignment, which integrates gene network information into the clustering process based on self-supervised and unsupervised optimization. Specifically, we designed a graph-based autoencoder enhanced by an attention mechanism to effectively capture relationships between cells. Moreover, we performed the node2vec method on Protein-Protein Interaction (PPI) networks to derive the gene network structure and maintained this structure throughout the clustering process. Our proposed method has been demonstrated to be effective through experimental results, showcasing its ability to optimize clustering outcomes while preserving the original associations between cells and genes. This research contributes to obtaining accurate cell subpopulations and generates clustering results that more closely resemble real-world biological scenarios. It provides better insights into the characteristics and distribution of diseased cells, ultimately building a foundation for early disease diagnosis and treatment.
    摘要 近年来,单个细胞RNA测序领域内,集群方法的研发有了很大的进步。这些方法可以让您标识细胞子population,从而facilitating the understanding of tumor microenvironments。 despite their utility, most existing clustering algorithms primarily focus on the attribute information provided by the cell matrix or the network structure between cells, often neglecting the network between genes。 This oversight could lead to loss of information and clustering results that lack clinical significance。 To address this limitation, we develop an advanced single-cell clustering model incorporating dual-graph alignment, which integrates gene network information into the clustering process based on self-supervised and unsupervised optimization。 Specifically, we designed a graph-based autoencoder enhanced by an attention mechanism to effectively capture relationships between cells。 Moreover, we performed the node2vec method on Protein-Protein Interaction (PPI) networks to derive the gene network structure and maintained this structure throughout the clustering process。 Our proposed method has been demonstrated to be effective through experimental results, showcasing its ability to optimize clustering outcomes while preserving the original associations between cells and genes。 This research contributes to obtaining accurate cell subpopulations and generates clustering results that more closely resemble real-world biological scenarios。 It provides better insights into the characteristics and distribution of diseased cells, ultimately building a foundation for early disease diagnosis and treatment。

LasTGL: An Industrial Framework for Large-Scale Temporal Graph Learning

  • paper_url: http://arxiv.org/abs/2311.16605
  • repo_url: None
  • paper_authors: Jintang Li, Jiawang Dan, Ruofan Wu, Jing Zhou, Sheng Tian, Yunfei Liu, Baokun Wang, Changhua Meng, Weiqiang Wang, Yuchang Zhu, Liang Chen, Zibin Zheng
  • for: 这篇论文是为了提供一个统一的框架,来解决时间变化的图形学习任务。
  • methods: 这篇论文使用了时间对称卷积神经网络(TGNN)来处理时间变化的图形数据。
  • results: 这篇论文提供了一个名为LasTGL的框架,可以帮助研究人员快速实现时间图形学习任务,并且提供了丰富的实验数据和说明。
    Abstract Over the past few years, graph neural networks (GNNs) have become powerful and practical tools for learning on (static) graph-structure data. However, many real-world applications, such as social networks and e-commerce, involve temporal graphs where nodes and edges are dynamically evolving. Temporal graph neural networks (TGNNs) have progressively emerged as an extension of GNNs to address time-evolving graphs and have gradually become a trending research topic in both academics and industry. Advancing research in such an emerging field requires new tools to compose TGNN models and unify their different schemes in dealing with temporal graphs. To facilitate research and application in temporal graph learning, we introduce LasTGL, an industrial framework that integrates unified and extensible implementations of common temporal graph learning algorithms for various advanced tasks. The purpose of LasTGL is to provide the essential building blocks for solving temporal graph learning tasks, focusing on the guiding principles of user-friendliness and quick prototyping on which PyTorch is based. In particular, LasTGL provides comprehensive temporal graph datasets, TGNN models and utilities along with well-documented tutorials, making it suitable for both absolute beginners and expert deep learning practitioners alike.
    摘要 在过去几年,图节点网络(GNN)已成为可靠和实用的工具,用于学习静止图结构数据。然而,许多实际应用,如社交网络和电子商务,涉及到时间演化的图结构数据。时间图节点网络(TGNN)逐渐出现,作为GNN的扩展,用于解决时间演化的图结构数据。为推动这个emergingfield的研究,我们提出了LasTGL框架,它将 integrate了多种时间图学习算法的统一和可扩展实现。LasTGL的目标是提供解决时间图学习任务的基本建筑块,注重用户友好性和快速原型化,基于PyTorch的指导原则。具体来说,LasTGL提供了完整的时间图数据集,TGNN模型和工具,以及详细的教程,适用于各种深度学习实践者。

GSP-KalmanNet: Tracking Graph Signals via Neural-Aided Kalman Filtering

  • paper_url: http://arxiv.org/abs/2311.16602
  • repo_url: None
  • paper_authors: Itay Buchnik, Guy Sagi, Nimrod Leinwand, Yuval Loya, Nir Shlezinger, Tirza Routtenberg
  • for: 这个论文主要研究了图像信号的跟踪问题,即在社交网络、电力网络和交通等应用中遇到的动态系统图像信号的跟踪。
  • methods: 该论文提出了一种hybrid模型基于/数据驱动的方法,称为GSP-KalmanNet,它利用图像处理(GSP)工具和深度学习(DL)技术来跟踪图像信号的隐藏图像状态。
  • results: 实验结果表明,GSP-KalmanNet可以在较高维度的信号处理中提高准确性和运行时间性,同时提高模型误差的抗性。
    Abstract Dynamic systems of graph signals are encountered in various applications, including social networks, power grids, and transportation. While such systems can often be described as state space (SS) models, tracking graph signals via conventional tools based on the Kalman filter (KF) and its variants is typically challenging. This is due to the nonlinearity, high dimensionality, irregularity of the domain, and complex modeling associated with real-world dynamic systems of graph signals. In this work, we study the tracking of graph signals using a hybrid model-based/data-driven approach. We develop the GSP-KalmanNet, which tracks the hidden graphical states from the graphical measurements by jointly leveraging graph signal processing (GSP) tools and deep learning (DL) techniques. The derivations of the GSP-KalmanNet are based on extending the KF to exploit the inherent graph structure via graph frequency domain filtering, which considerably simplifies the computational complexity entailed in processing high-dimensional signals and increases the robustness to small topology changes. Then, we use data to learn the Kalman gain following the recently proposed KalmanNet framework, which copes with partial and approximated modeling, without forcing a specific model over the noise statistics. Our empirical results demonstrate that the proposed GSP-KalmanNet achieves enhanced accuracy and run time performance as well as improved robustness to model misspecifications compared with both model-based and data-driven benchmarks.
    摘要 Dynamic systems of graph signals are encountered in various applications, including social networks, power grids, and transportation. While such systems can often be described as state space (SS) models, tracking graph signals via conventional tools based on the Kalman filter (KF) and its variants is typically challenging. This is due to the nonlinearity, high dimensionality, irregularity of the domain, and complex modeling associated with real-world dynamic systems of graph signals. In this work, we study the tracking of graph signals using a hybrid model-based/data-driven approach. We develop the GSP-KalmanNet, which tracks the hidden graphical states from the graphical measurements by jointly leveraging graph signal processing (GSP) tools and deep learning (DL) techniques. The derivations of the GSP-KalmanNet are based on extending the KF to exploit the inherent graph structure via graph frequency domain filtering, which considerably simplifies the computational complexity entailed in processing high-dimensional signals and increases the robustness to small topology changes. Then, we use data to learn the Kalman gain following the recently proposed KalmanNet framework, which copes with partial and approximated modeling, without forcing a specific model over the noise statistics. Our empirical results demonstrate that the proposed GSP-KalmanNet achieves enhanced accuracy and run time performance as well as improved robustness to model misspecifications compared with both model-based and data-driven benchmarks.

Single-cell Multi-view Clustering via Community Detection with Unknown Number of Clusters

  • paper_url: http://arxiv.org/abs/2311.17103
  • repo_url: https://github.com/dayuhuu/scunc
  • paper_authors: Dayu Hu, Zhibin Dong, Ke Liang, Jun Wang, Siwei Wang, Xinwang Liu
  • for: 这篇论文旨在探讨单个细胞多视图划分的问题,以便探索细胞内部的多样性。
  • methods: 这篇论文提出了一种名为scUNC的新型多视图划分方法,该方法可以自动地将不同视图的信息融合到一起,而无需先定义划分数。
  • results: 对于三个不同的单个细胞数据集,scUNC方法的结果表明,它在比较基eline方法时表现出了更好的性能。
    Abstract Single-cell multi-view clustering enables the exploration of cellular heterogeneity within the same cell from different views. Despite the development of several multi-view clustering methods, two primary challenges persist. Firstly, most existing methods treat the information from both single-cell RNA (scRNA) and single-cell Assay of Transposase Accessible Chromatin (scATAC) views as equally significant, overlooking the substantial disparity in data richness between the two views. This oversight frequently leads to a degradation in overall performance. Additionally, the majority of clustering methods necessitate manual specification of the number of clusters by users. However, for biologists dealing with cell data, precisely determining the number of distinct cell types poses a formidable challenge. To this end, we introduce scUNC, an innovative multi-view clustering approach tailored for single-cell data, which seamlessly integrates information from different views without the need for a predefined number of clusters. The scUNC method comprises several steps: initially, it employs a cross-view fusion network to create an effective embedding, which is then utilized to generate initial clusters via community detection. Subsequently, the clusters are automatically merged and optimized until no further clusters can be merged. We conducted a comprehensive evaluation of scUNC using three distinct single-cell datasets. The results underscored that scUNC outperforms the other baseline methods.
    摘要 单个细胞多视图划分可以探索同一个细胞中的细胞多样性从不同的视角。虽然有几种多视图划分方法的开发,但两大挑战仍然存在。首先,大多数现有方法往往对单细胞RNA(scRNA)和单细胞Assay of Transposase Accessible Chromatin(scATAC)视图中的信息进行平等对待,忽略了这两种视图中数据的差异。这会导致整体性能下降。其次,大多数划分方法需要用户手动指定划分数量。然而,对细胞数据进行划分是一项困难的任务,特别是为了准确地确定细胞中的几个不同类型。为此,我们介绍了scUNC方法,这是针对单细胞数据的创新的多视图划分方法,可以自动地将不同视图中的信息集成,无需先定划分数量。scUNC方法包括以下步骤:首先,它使用cross-view fusión网络创建有效的嵌入,然后使用社区检测来生成初始划分。接着,划分会自动地合并和优化,直到无法再合并划分。我们对三个不同的单细胞数据集进行了广泛的评估,结果表明scUNC方法在基准方法上表现出色。

Monitor Placement for Fault Localization in Deep Neural Network Accelerators

  • paper_url: http://arxiv.org/abs/2311.16594
  • repo_url: None
  • paper_authors: Wei-Kai Liu, Benjamin Tan, Krishnendu Chakrabarty
  • For: The paper is written for improving the reliability of deep neural network (DNN) accelerators using systolic arrays.* Methods: The paper proposes a solution to optimize the placement of hardware monitors within systolic arrays to localize faulty processing elements (PEs) and improve the reliability of DNN inferencing.* Results: The paper shows that $2N-1$ monitors are needed to localize a single faulty PE, and derives the monitor placement. The paper also proposes a heuristic approach to balance the reliability and hardware resource utilization in DNN accelerators when the number of monitors is limited. Experimental evaluation shows that an area overhead of only 0.33% is incurred for a $256\times 256$ systolic array.
    Abstract Systolic arrays are a prominent choice for deep neural network (DNN) accelerators because they offer parallelism and efficient data reuse. Improving the reliability of DNN accelerators is crucial as hardware faults can degrade the accuracy of DNN inferencing. Systolic arrays make use of a large number of processing elements (PEs) for parallel processing, but when one PE is faulty, the error propagates and affects the outcomes of downstream PEs. Due to the large number of PEs, the cost associated with implementing hardware-based runtime monitoring of every single PE is infeasible. We present a solution to optimize the placement of hardware monitors within systolic arrays. We first prove that $2N-1$ monitors are needed to localize a single faulty PE and we also derive the monitor placement. We show that a second placement optimization problem, which minimizes the set of candidate faulty PEs for a given number of monitors, is NP-hard. Therefore, we propose a heuristic approach to balance the reliability and hardware resource utilization in DNN accelerators when number of monitors is limited. Experimental evaluation shows that to localize a single faulty PE, an area overhead of only 0.33% is incurred for a $256\times 256$ systolic array.
    摘要 systolic 阵列是深度神经网络(DNN)加速器的一种常用选择,因为它们提供并行处理和数据重用的机制。 因为硬件错误可以导致神经网络推理精度下降,因此提高 DNN 加速器的可靠性是非常重要的。 systolic 阵列使用大量的处理元素(PE)进行并行处理,但是当一个 PE 出现错误时,错误会向下游 PE 传播并影响其结果。由于 systolic 阵列中的 PE 的数量很大,因此实施硬件基础Runtime监控每个 PE 的成本是不可能的。我们提出一种优化 systolic 阵列中硬件监控器的放置方案。我们首先证明需要 2N-1 个监控器来确定单个错误的 PE,并且我们还 derivation 监控器的放置。我们发现,对于给定的监控器数量, minimize 候选错误 PE 的集合是 NP-hard 问题。因此,我们提出一种妥协方法,以平衡在 DNN 加速器中的可靠性和硬件资源利用率。实验表明,为了确定单个错误 PE,只需要在 256x256 的 systolic 阵列中增加 0.33% 的面积开销。

StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences

  • paper_url: http://arxiv.org/abs/2311.17099
  • repo_url: None
  • paper_authors: Shangkun Sun, Jiaming Liu, Thomas H. Li, Huaxia Li, Guoqing Liu, Wei Gao
  • for: 这个研究旨在解决影像流动预测中的遮掩现象,这些遮掩现象会破坏 pixels 之间的对称性,从而导致预测结果的严重损失。
  • methods: 这个研究使用了一个整合了内核的 Streamlined In-batch Multi-frame (SIM) 架构,并导入了一个高效的 Integrative Spatio-temporal Coherence (ISC) 模型来有效地捕捉空间-时间相互关联。此外,它还具有一个全球时间调整器 (GTR),帮助更好地探索时间关系。
  • results: 这个研究获得了在难题的 KITTI 和 Sintel 数据集上的高性能,特别是在遮掩区域,并且与前一代多帧方法相比,实现了$63.82%$ 的速度提升。
    Abstract Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation. The inherent ambiguity introduced by occlusions directly violates the brightness constancy constraint and considerably hinders pixel-to-pixel matching. To address this issue, multi-frame optical flow methods leverage adjacent frames to mitigate the local ambiguity. Nevertheless, prior multi-frame methods predominantly adopt recursive flow estimation, resulting in a considerable computational overlap. In contrast, we propose a streamlined in-batch framework that eliminates the need for extensive redundant recursive computations while concurrently developing effective spatio-temporal modeling approaches under in-batch estimation constraints. Specifically, we present a Streamlined In-batch Multi-frame (SIM) pipeline tailored to video input, attaining a similar level of time efficiency to two-frame networks. Furthermore, we introduce an efficient Integrative Spatio-temporal Coherence (ISC) modeling method for effective spatio-temporal modeling during the encoding phase, which introduces no additional parameter overhead. Additionally, we devise a Global Temporal Regressor (GTR) that effectively explores temporal relations during decoding. Benefiting from the efficient SIM pipeline and effective modules, StreamFlow not only excels in terms of performance on the challenging KITTI and Sintel datasets, with particular improvement in occluded areas but also attains a remarkable $63.82\%$ enhancement in speed compared with previous multi-frame methods. The code will be available soon at https://github.com/littlespray/StreamFlow.
    摘要 干扰 между连续帧已经长期成为光流估算的主要挑战。这种内在的歧义直接违背了亮度的一致性约束,大大阻碍了像素到像素的匹配。为解决这个问题,多帧光流方法利用邻近帧来减少地方干扰。然而,先前的多帧方法主要采用回归流估计,导致了明显的计算重叠。相比之下,我们提出了一个流lijined在批处理(SIM)管道,消除了广泛的重复回归计算,同时同时发展有效的空间-时模型化方法。具体来说,我们提出了适应视频输入的Streamlined In-batch Multi-frame(SIM)管道,实现了与两帧网络相同的时间效率。此外,我们提出了一种高效的空间-时coherence(ISC)模型化方法,不增加参数占用。此外,我们设计了一种全面的时间回归器(GTR),有效地探索了时间关系。由于高效的SIM管道和有效的模块,StreamFlow不仅在挑战性的KITTI和 Sintel 数据集上表现出色,尤其是在受阻区域,还实现了63.82%的计算速度提升 compared with previous multi-frame methods。代码将很快地在 GitHub 上发布。

DyRA: Dynamic Resolution Adjustment for Scale-robust Object Detection

  • paper_url: http://arxiv.org/abs/2311.17098
  • repo_url: https://github.com/daeunfullgrace/dyra
  • paper_authors: Daeun Seo, Hoeseok Yang, Hyungshin Kim
  • for: 提高对象检测中的常规准确率,因为对象的大小弹性会导致模型的准确率变化。
  • methods: 提出了一种适应分辨率扩展网络(DyRA),该网络包括卷积和变换器编码块,可以与现有的检测器结合使用。DyRA返回一个从输入图像中获得的缩放因子,允许实例特定的缩放。
  • results: 在COCO、RetinaNet、Faster-RCNN、FCOS和Mask-RCNN等四个检测器上进行了实验,实现了与多resolution基线相比的1.3%、1.1%、1.3%和0.8%的准确率提高。
    Abstract In object detection, achieving constant accuracy is challenging due to the variability of object sizes. One possible solution to this problem is to optimize the input resolution, known as a multi-resolution strategy. Previous approaches for optimizing resolution are often based on pre-defined resolutions or a dynamic neural network, but there is a lack of study for run-time resolution optimization for existing architecture. In this paper, we propose an adaptive resolution scaling network called DyRA, which comprises convolutions and transformer encoder blocks, for existing detectors. Our DyRA returns a scale factor from an input image, which enables instance-specific scaling. This network is jointly trained with detectors with specially designed loss functions, namely ParetoScaleLoss and BalanceLoss. The ParetoScaleLoss produces an adaptive scale factor from the image, while the BalanceLoss optimizes the scale factor according to localization power for the dataset. The loss function is designed to minimize accuracy drop about the contrasting objective of small and large objects. Our experiments on COCO, RetinaNet, Faster-RCNN, FCOS, and Mask-RCNN achieved 1.3%, 1.1%, 1.3%, and 0.8% accuracy improvement than a multi-resolution baseline with solely resolution adjustment. The code is available at https://github.com/DaEunFullGrace/DyRA.git.
    摘要 在对象检测中,保持定点准确性是一个挑战,因为对象的大小变化可以导致模型的性能下降。一种可能的解决方案是使用多resolution策略,但前一些方法通常是基于预定的分辨率或动态神经网络。在这篇论文中,我们提出了一种名为 DyRA 的 adaptive resolution scaling network,该网络包括卷积和变换器编码块,用于现有的检测器。我们的 DyRA 从输入图像返回一个扩大因子,该因子可以根据实例特点进行实例化。我们与检测器一起培训了特定的损失函数,包括 ParetoScaleLoss 和 BalanceLoss。ParetoScaleLoss 生成了适应性的扩大因子,而 BalanceLoss 优化了扩大因子的地方化能力。损失函数的设计目的是尽可能地降低小对象和大对象之间的准确率下降。我们在 COCO、RetinaNet、Faster-RCNN、FCOS 和 Mask-RCNN 上进行了实验,并实现了与多resolution基eline相比的1.3%、1.1%、1.3%和0.8%的准确率提高。代码可以在 中找到。

Anonymous Jamming Detection in 5G with Bayesian Network Model Based Inference Analysis

  • paper_url: http://arxiv.org/abs/2311.17097
  • repo_url: None
  • paper_authors: Ying Wang, Shashank Jere, Soumya Banerjee, Lingjia Liu, Sachin Shetty, Shehadi Dayekh
  • for: 这篇论文是为了探讨5G中的干扰检测和入侵检测,以保持可靠性、避免用户体验下降和基础设施失败。
  • methods: 该论文提出了一种基于协议堆栈参数的匿名干扰检测模型,使用超级vised学习和无级学习实现实时、高精度的干扰检测,包括未知类型干扰。
  • results: 实验结果显示,使用supervised模型可达AUC为0.964到1,比LSTM模型有AUC为0.923到1。然而,需要数据注释限制了supervised方法。为解决这一问题,paper提出了一种无级自适应异常检测方法,AUC为0.987,并能抗性 adversarial训练样本。此外,paper还介绍了一种基于 bayesian network的 causation analysis,以提供透明度和领域知识注入。
    Abstract Jamming and intrusion detection are critical in 5G research, aiming to maintain reliability, prevent user experience degradation, and avoid infrastructure failure. This paper introduces an anonymous jamming detection model for 5G based on signal parameters from the protocol stacks. The system uses supervised and unsupervised learning for real-time, high-accuracy detection of jamming, including unknown types. Supervised models reach an AUC of 0.964 to 1, compared to LSTM models with an AUC of 0.923 to 1. However, the need for data annotation limits the supervised approach. To address this, an unsupervised auto-encoder-based anomaly detection is presented with an AUC of 0.987. The approach is resistant to adversarial training samples. For transparency and domain knowledge injection, a Bayesian network-based causation analysis is introduced.
    摘要 “干扰和入侵探测是5G研究中的重要课题,以确保可靠性、避免用户体验下降和基础设施故障。本文提出了一个匿名干扰探测模型,基于协议堆栈的参数来进行实时、高精度的干扰探测,包括未知类型。使用监督学习和无监督学习,监督学习可以 дости到AUC的0.964-1,而LSTM模型则为AUC的0.923-1。但是,需要数据标注限制了监督方法。为了解决这个问题,本文提出了一个无监督自动化探测方法,其AUC为0.987。此方法具有防止反攻击训练样本的特点。为了透明度和领域知识注入,本文还提出了一个基于 Bayesian 网络的 causation 分析。”

Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models

  • paper_url: http://arxiv.org/abs/2311.17095
  • repo_url: None
  • paper_authors: Luo Jiayun, Siddhesh Khandelwal, Leonid Sigal, Boyang Li
  • for: 这篇论文的目的是提出一个训练无需的技术,将大规模感知语言模型(VLM)应用于开放词汇 semantic segmentation 任务。
  • methods: 这篇论文提出了一个简单 yet extremely effective的技术,即 Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS),它利用了 VLM 的直接文本至图像混合注意力和图像文本匹配损失来生成 semantic segmentation。
  • results: 相比于现有的技术,提出的方法不需要任何神经网络训练和进行适应器参数的调整,并且可以在没有 segmentation 标注的情况下进行预测。结果显示,PnP-OVSS 在 Pascal VOC、Pascal Context 和 MS COCO 等测试集上表现出较好的效果,甚至超过了一些基于 VLM 的基eline。
    Abstract From an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which is vital for tasks such as image captioning and visual question answering. However, leveraging such pre-trained models for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation. However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment. To alleviate this issue, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. Compared to existing techniques, the proposed method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over a comparable baseline (+29.4% mIoU on Pascal VOC, +13.2% mIoU on Pascal Context, +14.0% mIoU on MS COCO, +2.4% mIoU on COCO Stuff) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.
    摘要 originally appeared in an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which is crucial for tasks such as image captioning and visual question answering. However, leveraging such pre-trained models for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation. However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment. To alleviate this issue, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. Compared to existing techniques, the proposed method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over a comparable baseline (+29.4% mIoU on Pascal VOC, +13.2% mIoU on Pascal Context, +14.0% mIoU on MS COCO, +2.4% mIoU on COCO Stuff) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.

Graph Prompt Learning: A Comprehensive Survey and Beyond

  • paper_url: http://arxiv.org/abs/2311.16534
  • repo_url: https://github.com/sheldonresearch/ProG
  • paper_authors: Xiangguo Sun, Jiawen Zhang, Xixi Wu, Hong Cheng, Yun Xiong, Jia Li
  • for: 本文旨在探讨人工通用智能(AGI)在图数据上的应用,特别是图数据处理方面的挑战和机遇。
  • methods: 本文提出了一个统一框架来理解图Prompt学习,并详细介绍了图Prompt的特点、灵活性和表达能力,以及与现有图模型的交互。
  • results: 本文分析了目前AGI在处理图数据方面的状况,并提出了一个综合分类法,将相关工作分为节点级、边级和图级预训任务。此外,本文还介绍了ProG库和相关网站,以支持和推动图Prompt研究。
    Abstract Artificial General Intelligence (AGI) has revolutionized numerous fields, yet its integration with graph data, a cornerstone in our interconnected world, remains nascent. This paper presents a pioneering survey on the emerging domain of graph prompts in AGI, addressing key challenges and opportunities in harnessing graph data for AGI applications. Despite substantial advancements in AGI across natural language processing and computer vision, the application to graph data is relatively underexplored. This survey critically evaluates the current landscape of AGI in handling graph data, highlighting the distinct challenges in cross-modality, cross-domain, and cross-task applications specific to graphs. Our work is the first to propose a unified framework for understanding graph prompt learning, offering clarity on prompt tokens, token structures, and insertion patterns in the graph domain. We delve into the intrinsic properties of graph prompts, exploring their flexibility, expressiveness, and interplay with existing graph models. A comprehensive taxonomy categorizes over 100 works in this field, aligning them with pre-training tasks across node-level, edge-level, and graph-level objectives. Additionally, we present, ProG, a Python library, and an accompanying website, to support and advance research in graph prompting. The survey culminates in a discussion of current challenges and future directions, offering a roadmap for research in graph prompting within AGI. Through this comprehensive analysis, we aim to catalyze further exploration and practical applications of AGI in graph data, underlining its potential to reshape AGI fields and beyond. ProG and the website can be accessed by \url{https://github.com/WxxShirley/Awesome-Graph-Prompt}, and \url{https://github.com/sheldonresearch/ProG}, respectively.
    摘要 人工通用智能(AGI)已经革命化了许多领域,但是它与图数据,我们现代世界中的重要基础,的结合仍然处于激进阶段。这篇论文提出了对emerging领域的图prompt在AGI中的全面评估,探讨了在应用图数据的关键挑战和机遇。虽然AGI在自然语言处理和计算机视觉方面已经取得了很大的进步,但是对图数据的应用还是相对未经explored。这篇论文对AGI在处理图数据方面的当前状况进行了严格的评估,揭示了跨模态、跨领域和跨任务特有的图数据应用中的独特挑战。我们的工作是首次提出了一个统一的框架 для理解图示学习,为图示学习提供了清晰的prompt Token、Token结构和插入模式。我们进一步探讨了图示的内在特性,包括其灵活性、表达能力和与现有图模型的交互。我们根据任务类型分类了超过100个相关研究,并提供了一个Python库和相应的网站,以支持和推动图示研究。这篇论文的结论是,通过进一步探讨图示的挑战和未来方向,我们可以激发AGI在图数据上的应用,并且这将对AGI相关领域和之外产生深远的影响。ProG和相应的网站可以在上获取。

Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net

  • paper_url: http://arxiv.org/abs/2311.16488
  • repo_url: None
  • paper_authors: Zizhao Hu, Shaochong Jia, Mohammad Rostami
  • for: 这 paper 是为了提出一种高效的多Modal 扩散模型,能够 preserve 多Modal 细节和效率。
  • methods: 该模型使用 Partially Shared U-Net (PS-U-Net) 架构,允许文本和图像输入通过专门的层和跳过连接来保留模式特有的细节。 此外,它还提出了一种新的多Modal 采样方法,可以在只需学习单一的共享分布下进行条件生成新的enario。
  • results: 对于 MS-COCO 数据集,我们的方法可以生成高质量的多Modal text 和图像数据,比对exist的多Modal 扩散模型更高效,具有相似的大小、更快的训练、更快的多Modal 采样和更加灵活的生成。
    Abstract Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.
    摘要 最近,扩散模型已成功应用于跨Modal数据翻译和多Modal数据生成。然而,这些方法依赖于广泛的扩散,忽略了Modalities之间的不效率和干扰。我们开发了半共享U-Net(PS-U-Net)架构,这是一种高效的多Modal扩散模型,允许文本和图像输入通过专门的层和跳过连接保留Modalities特有的细腻细节。受图像填充启发,我们还提议一种新的高效多Modal采样方法,该方法在只需学习一个简单的共享分布下可以生成新的Scene。我们对COCO dataset进行了实验,结果表明,我们的方法可以在与现有多Modal扩散模型相比生成高质量的文本和图像数据,同时具有相似的大小、更快的训练、更快的多Modal采样和更多的生成方式。

Enhancing Human Persuasion With Large Language Models

  • paper_url: http://arxiv.org/abs/2311.16466
  • repo_url: None
  • paper_authors: Minkyu Shin, Jin Kim
  • for: 这个论文研究了人工智能语言模型(LLM)在人类交流中的影响,具体来说是在金融业消费者投诉中。
  • methods: 该论文使用了一种AI检测工具对 более чем780万个消费者投诉数据进行分析,并发现了LLM在投诉书写中的使用。
  • results: 研究发现,使用LLM可以提高消费者投诉的语言质量,并且可以提高获得满意的结果的可能性(即金融机构提供的满意解决方案)。这些结果与前期注册的实验结果相符,证明LLM可以在人类交流中提高消费者的言语吸引力。
    Abstract Although large language models (LLMs) are reshaping various aspects of human life, our current understanding of their impacts remains somewhat constrained. Here we investigate the impact of LLMs on human communication, in the context of consumer complaints in the financial industry. Employing an AI detection tool on more than 780K complaints gathered by the Consumer Financial Protection Bureau (CFPB), we find evidence of LLM usage in the writing of complaints - shortly after the release of ChatGPT. Our analyses reveal that LLM usage is positively correlated with the likelihood of obtaining desirable outcomes (i.e., offer of relief from financial firms) and suggest that this positive correlation may be partly due to the linguistic features improved by LLMs. We test this conjecture with a preregistered experiment, which reveals results consistent with those from observational studies: Consumer complaints written with ChatGPT for improved linguistic qualities were more likely to receive hypothetical relief offers than the original consumer complaints, demonstrating the LLM's ability to enhance message persuasiveness in human communication. Being some of the earliest empirical evidence on LLM usage for enhancing persuasion, our results highlight the transformative potential of LLMs in human communication.
    摘要

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

  • paper_url: http://arxiv.org/abs/2311.16464
  • repo_url: https://github.com/easonxiao-888/uvcom
  • paper_authors: Yicheng Xiao, Zhuoyan Luo, Yong Liu, Yue Ma, Hengwei Bian, Yatai Ji, Yujiu Yang, Xiu Li
  • for: Video Moment Retrieval (MR) and Highlight Detection (HD) 视频瞬间检索和精彩点检测
  • methods: 使用 transformer-based architecture 使用转换器基本结构
  • results: 对 QVHighlights、Charades-STA、TACoS、YouTube Highlights 和 TVSum datasets进行了广泛的实验,并在这些 datasets 上表现出了杰出的效果和合理性,超过了当前状态的方法。
    Abstract Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.
    摘要 视频时刻回归(MR)和突出点检测(HD)在视频分析方面引起了广泛的关注,因为它们能够帮助我们更好地理解视频内容。然而,我们发现MR和HD之间存在一定的区别,MR需要对视频中的本地关系进行感知,而HD则需要对视频的全局背景进行理解。因此,如果不采取特定任务的设计,将导致两个任务之间的关系不充分耦合,从而限制视频的理解。为了解决这个问题,我们提出了一个统一视频理解框架(UVCOM),它可以有效地解决MR和HD两个任务。UVCOM通过在多维度和多级别进行进程式的 интеграción,实现了视频的全面理解。此外,我们还提出了多方面对比学习,通过对多Modal空间进行满足的对比,来强化本地关系模型和全局知识储存。我们在QVHighlights、Charades-STA、TACoS、YouTube Highlights和TVSum等数据集上进行了广泛的实验,结果表明UVCOM可以具有显著的优势,与当前的方法相比,它的性能有remarkable提升。

Typhoon Intensity Prediction with Vision Transformer

  • paper_url: http://arxiv.org/abs/2311.16450
  • repo_url: https://github.com/chen-huanxin/tint
  • paper_authors: Huanxin Chen, Pengshuai Yin, Huichou Huang, Qingyao Wu, Ruirui Liu, Xiatian Zhu
  • for: 预测台风强度准确 across space and time, 以便发布及时的灾害警示和紧急应急救援。
  • methods: 利用卫星图像进行enario分析,并采用自注意机制和全球受辐激场进行Feature representation learning。
  • results: 比采用现有的卷积神经网络 (CNNs) 更高效,并且可以更好地捕捉到长距离依赖和全球上下文知识。
    Abstract Predicting typhoon intensity accurately across space and time is crucial for issuing timely disaster warnings and facilitating emergency response. This has vast potential for minimizing life losses and property damages as well as reducing economic and environmental impacts. Leveraging satellite imagery for scenario analysis is effective but also introduces additional challenges due to the complex relations among clouds and the highly dynamic context. Existing deep learning methods in this domain rely on convolutional neural networks (CNNs), which suffer from limited per-layer receptive fields. This limitation hinders their ability to capture long-range dependencies and global contextual knowledge during inference. In response, we introduce a novel approach, namely "Typhoon Intensity Transformer" (Tint), which leverages self-attention mechanisms with global receptive fields per layer. Tint adopts a sequence-to-sequence feature representation learning perspective. It begins by cutting a given satellite image into a sequence of patches and recursively employs self-attention operations to extract both local and global contextual relations between all patch pairs simultaneously, thereby enhancing per-patch feature representation learning. Extensive experiments on a publicly available typhoon benchmark validate the efficacy of Tint in comparison with both state-of-the-art deep learning and conventional meteorological methods. Our code is available at https://github.com/chen-huanxin/Tint.
    摘要 预测台风强度 precisely across space and time是至关重要的,以预测气灾和救灾应急措施。这有很大的潜在效果,可以减少人员亡产和环境影响,以及经济影响。使用卫星影像进行enario分析是有效的,但也存在附加的挑战,因为云层之间的复杂关系和高度动态上下文。现有的深度学习方法在这个领域中仍然靠托Convolutional Neural Networks (CNNs),这些网络受限于每层的接受场。这限制了它们在推理过程中 capture长距离依赖和全局上下文知识。为了解决这些问题,我们介绍了一种新的方法,即“台风强度变换”(Tint),它利用自注意机制和全球接受场。Tint采用了序列到序列的特征表示学习视角。它首先将给定的卫星影像切分成一个序列,然后通过自注意操作来抽取所有patch对之间的局部和全局上下文关系,从而提高每个patch的特征表示学习。经验证明,Tint在比较州的台风benchmark上与当前的深度学习和传统气象方法相比,表现出了较高的效果。我们的代码可以在https://github.com/chen-huanxin/Tint上获取。

Text-Driven Image Editing via Learnable Regions

  • paper_url: http://arxiv.org/abs/2311.16432
  • repo_url: https://github.com/yuanze-lin/Learnable_Regions
  • paper_authors: Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, Ming-Hsuan Yang
  • for: 这个论文的目的是提出一种基于文本指令的区域图像编辑方法,无需用户提供涂鸦或mask。
  • methods: 该方法利用现有的文本到图像模型,并引入一个 bounding box生成器来找到与文本指令相对应的编辑区域。
  • results: 我们的方法可以实现高度灵活的图像编辑,并能够处理复杂的文本指令,例如多个 объек 、复杂的句子或长段文本。我们进行了广泛的用户研究,比较了我们的方法与当前的图像生成模型。实验结果表明,我们的方法可以在高度准确和现实的情况下,根据提供的语言描述进行图像修改。Here’s the translation in English for reference:
  • for: The purpose of this paper is to propose a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
  • methods: The method leverages an existing pretrained text-to-image model and introduces a bounding box generator to find the edit regions that are aligned with the textual prompts.
  • results: Our method can achieve highly flexible image editing and can handle complex prompts featuring multiple objects, complex sentences, or long paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. Experimental results demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that align with the language descriptions provided.
    Abstract Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pretrained text-to-image model and introduces a bounding box generator to find the edit regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences or long paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that align with the language descriptions provided. Our project webpage: https://yuanze-lin.me/LearnableRegions_page.
    摘要 语言已经成为自然的图像编辑界面。在这篇论文中,我们介绍了一种基于文本提示的区域编辑方法,不需要用户提供涂抹或绘图。具体来说,我们的方法利用现有的预训练文本到图像模型,并引入一个 bounding box 生成器来找到与文本提示相对应的编辑区域。我们表明,这个简单的方法可以实现高效、准确地编辑图像,并能够处理复杂的提示 featuring 多个物体、复杂的句子或长段文本。我们进行了广泛的用户研究,与现有的方法进行比较。实验结果表明,我们的方法在搅拌图像高度准确地实现了与提供的语言描述相对应的图像修改。更多信息请访问我们的项目网页:https://yuanze-lin.me/LearnableRegions_page。

Manifold Preserving Guided Diffusion

  • paper_url: http://arxiv.org/abs/2311.16424
  • repo_url: None
  • paper_authors: Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon
  • for: 本文旨在提出一种无需训练的条件生成框架,可以广泛应用于多种任务。
  • methods: 该框架基于预训练的扩散模型和启用了启用了少量额外计算成本的 neural network,并且利用 manifold 假设来精细调整扩散步骤。
  • results: 我们的实验表明,MPGD 可以高效地解决低计算成本下的多种条件生成应用,并且可以与基eline相比提供更高质量的样本,并且可以提供 Up to 3.8 倍的速度提升。
    Abstract Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad range of tasks. Specifically, we leverage the manifold hypothesis to refine the guided diffusion steps and introduce a shortcut algorithm in the process. We then propose two methods for on-manifold training-free guidance using pre-trained autoencoders and demonstrate that our shortcut inherently preserves the manifolds when applied to latent diffusion models. Our experiments show that MPGD is efficient and effective for solving a variety of conditional generation applications in low-compute settings, and can consistently offer up to 3.8x speed-ups with the same number of diffusion steps while maintaining high sample quality compared to the baselines.
    摘要 尽管最近有了进步,条件图像生成仍然面临成本高、通用性差和任务特定训练的挑战。在这篇论文中,我们提出了无需训练的推 diffusion 框架(MPGD),利用预训练的扩散模型和商业化神经网络,对广泛的任务进行条件生成。具体来说,我们利用拟合假设来精细化引导扩散步骤,并 introduce 一种短cut 算法在过程中。然后,我们提出了两种无需训练的在扩散模型上进行培训的方法,使用预训练的 autoencoder。我们的短cut 自然地保持了拟合的 manifold 特性,并且我们的实验表明,MPGD 可以高效地解决低计算量下的多种条件生成应用,并且可以与基eline 相比提供高质量样本,同时具有3.8倍的速度提升。

Multi-defender Security Games with Schedules

  • paper_url: http://arxiv.org/abs/2311.16392
  • repo_url: None
  • paper_authors: Zimeng Song, Chun Kai Ling, Fei Fang
  • For: The paper is written to study security games featuring multiple defenders and schedules simultaneously, and to investigate the impact of schedules on the existence of equilibrium in these games.* Methods: The paper uses mathematical modeling and computational algorithms to study the security games, and proves that under certain restrictions, the non-existence of equilibrium can be avoided and computed in polynomial time.* Results: The paper shows that the introduction of schedules can cause non-existence of equilibrium in security games, but that this can be avoided under certain restrictions. The paper also presents experimental results that demonstrate the scalability of the algorithms for games with multiple heterogeneous defenders.
    Abstract Stackelberg Security Games are often used to model strategic interactions in high-stakes security settings. The majority of existing models focus on single-defender settings where a single entity assumes command of all security assets. However, many realistic scenarios feature multiple heterogeneous defenders with their own interests and priorities embedded in a more complex system. Furthermore, defenders rarely choose targets to protect. Instead, they have a multitude of defensive resources or schedules at its disposal, each with different protective capabilities. In this paper, we study security games featuring multiple defenders and schedules simultaneously. We show that unlike prior work on multi-defender security games, the introduction of schedules can cause non-existence of equilibrium even under rather restricted environments. We prove that under the mild restriction that any subset of a schedule is also a schedule, non-existence of equilibrium is not only avoided, but can be computed in polynomial time in games with two defenders. Under additional assumptions, our algorithm can be extended to games with more than two defenders and its computation scaled up in special classes of games with compactly represented schedules such as those used in patrolling applications. Experimental results suggest that our methods scale gracefully with game size, making our algorithms amongst the few that can tackle multiple heterogeneous defenders.
    摘要 史泰堡安全游戏经常用来模型高风险安全场景中的战略互动。大多数现有模型假设单一的安全资产拥有者,但多数现实场景中有多个不同类型的防御者,各自有自己的利益和优先级,形成更加复杂的系统。此外,防御者 rarely chooses targets to protect,而是拥有多种防御资源或时间表,每种都有不同的保护能力。在这篇论文中,我们研究了多个防御者和时间表同时参与的安全游戏。我们发现,不同于之前的多防御者安全游戏研究,在引入时间表后,战略平衡不存在,而且可以在一些简单的环境下计算出平衡。我们证明,在任何 subset of a schedule 也是一个 schedule 的情况下,不存在战略平衡,并且可以在两名防御者的游戏中计算出平衡,并且可以在特定类型的游戏中扩展到更多的防御者。具体来说,我们的算法可以在特定的游戏中扩展到更多的防御者,并且可以在一些特殊的游戏中缩放计算。实验结果表明,我们的方法可以准确地处理多个不同类型的防御者,并且可以扩展到更多的防御者。

Combating the “Sameness” in AI Art: Reflections on the Interactive AI Installation Fencing Hallucination

  • paper_url: http://arxiv.org/abs/2311.17080
  • repo_url: None
  • paper_authors: Weihao Qiu, George Legrady
  • for: addressing the issue of “sameness” in AI art, specifically in the context of AI image creation tools.
  • methods: reflecting on the design of AI art production to alleviate the sense of uniformity, maintain the uniqueness of images from an AI image synthesizer, and enhance the connection between the artworks and the audience.
  • results: stimulating the creation of distinctive AI art through the Fencing Hallucination project, which provides insights and efforts dedicated to addressing the issue of “sameness”.Here’s the full text in Simplified Chinese:
  • for: 论AI艺术中的“相同”问题,具体是关于AI图像创建工具的开发。
  • methods: 通过反思AI艺术生成的设计,以减少图像同一性感,保持AI图像生成器中图像的独特性,并增强艺术作品与观众之间的连接。
  • results: 通过《斗篮幻觉》项目,激发AI艺术的创作,以解决“相同”问题。
    Abstract The article summarizes three types of "sameness" issues in Artificial Intelligence(AI) art, each occurring at different stages of development in AI image creation tools. Through the Fencing Hallucination project, the article reflects on the design of AI art production in alleviating the sense of uniformity, maintaining the uniqueness of images from an AI image synthesizer, and enhancing the connection between the artworks and the audience. This paper endeavors to stimulate the creation of distinctive AI art by recounting the efforts and insights derived from the Fencing Hallucination project, all dedicated to addressing the issue of "sameness".
    摘要 文章总结了人工智能艺术中的三种“相同”问题,每一种发生在不同的AI图像创作工具阶段。通过《剑道幻觉》项目,文章反思了AI艺术生产的设计,以减少图像生成器中的同化感,保持AI图像的独特性,并增强艺术作品与观众之间的联系。这篇文章努力鼓励创造独特的AI艺术,通过《剑道幻觉》项目所获得的努力和思想,解决“相同”的问题。