cs.AI - 2023-08-20

Towards Few-shot Coordination: Revisiting Ad-hoc Teamplay Challenge In the Game of Hanabi

  • paper_url: http://arxiv.org/abs/2308.10284
  • repo_url: None
  • paper_authors: Hadi Nekoei, Xutong Zhao, Janarthanan Rajendran, Miao Liu, Sarath Chandar
  • for: 这项研究旨在解决在多智能体协同学习中零shot协调问题。
  • methods: 研究人员使用了现状最佳算法和独立学习算法来评估多智能体协同学习算法的适应性。
  • results: 研究发现,使用现状最佳算法和独立学习算法可以快速适应不同合作伙伴,但需要训练数据的多样性和优化过程的控制。
    Abstract Cooperative Multi-agent Reinforcement Learning (MARL) algorithms with Zero-Shot Coordination (ZSC) have gained significant attention in recent years. ZSC refers to the ability of agents to coordinate zero-shot (without additional interaction experience) with independently trained agents. While ZSC is crucial for cooperative MARL agents, it might not be possible for complex tasks and changing environments. Agents also need to adapt and improve their performance with minimal interaction with other agents. In this work, we show empirically that state-of-the-art ZSC algorithms have poor performance when paired with agents trained with different learning methods, and they require millions of interaction samples to adapt to these new partners. To investigate this issue, we formally defined a framework based on a popular cooperative multi-agent game called Hanabi to evaluate the adaptability of MARL methods. In particular, we created a diverse set of pre-trained agents and defined a new metric called adaptation regret that measures the agent's ability to efficiently adapt and improve its coordination performance when paired with some held-out pool of partners on top of its ZSC performance. After evaluating several SOTA algorithms using our framework, our experiments reveal that naive Independent Q-Learning (IQL) agents in most cases adapt as quickly as the SOTA ZSC algorithm Off-Belief Learning (OBL). This finding raises an interesting research question: How to design MARL algorithms with high ZSC performance and capability of fast adaptation to unseen partners. As a first step, we studied the role of different hyper-parameters and design choices on the adaptability of current MARL algorithms. Our experiments show that two categories of hyper-parameters controlling the training data diversity and optimization process have a significant impact on the adaptability of Hanabi agents.
    摘要 合作多智能体强化学习(MARL)算法与零shot协调(ZSC)在最近几年内获得了广泛关注。ZSC指代智能体可以零shot(无需额外互动经验)协调独立训练的智能体。然而,ZSC可能不适用于复杂任务和变化环境中。智能体还需要适应和改进其性能,只需最小化与其他智能体的互动。在这个工作中,我们通过实验表明,现状最佳ZSC算法与其他学习方法训练的智能体 pairing 时,性能很差,需要数百万互动样本来适应这些新伙伴。为了探索这个问题,我们正式定义了基于流行的合作多智能体游戏哈那比(Hanabi)的框架,以评估MARL方法的适应性。具体来说,我们创建了一个多样化的预训练智能体集合,并定义了一个新的度量叫做适应 regret,用于衡量智能体在与一些封锁的伙伴中进行协调时,能够快速适应和提高协调性能。经过我们的实验,我们发现,简单粗略的独立Q学习(IQL)智能体在大多数情况下能够和SOTA ZSC算法OFF-BELIEF学习(OBL)一样快速适应。这一发现提出了一个有趣的研究问题:如何设计MARL算法,具有高度的ZSC性能和适应不seen伙伴的能力。作为一个初步探索,我们研究了当前MARL算法的不同超参数和设计选择对适应性的影响。我们的实验显示,控制训练数据多样性和优化过程的两类超参数有重要的影响于哈那比智能体的适应性。

Enhancing Spatiotemporal Traffic Prediction through Urban Human Activity Analysis

  • paper_url: http://arxiv.org/abs/2308.10282
  • repo_url: https://github.com/suminhan/traffic-uagcrntf
  • paper_authors: Sumin Han, Youngjun Park, Minji Lee, Jisun An, Dongman Lee
  • For: 提高城市交通安全和便利性,改进现有的交通预测模型。* Methods: 使用图 convolution deep learning算法,利用国家户室旅行调查数据增强 causal 关系 между活动和交通模式。* Results: 对现有的图 convolutional recurrent networks 和图 convolutional transformer 架构进行修改,实现 state-of-the-art 性能而不增加计算负担。
    Abstract Traffic prediction is one of the key elements to ensure the safety and convenience of citizens. Existing traffic prediction models primarily focus on deep learning architectures to capture spatial and temporal correlation. They often overlook the underlying nature of traffic. Specifically, the sensor networks in most traffic datasets do not accurately represent the actual road network exploited by vehicles, failing to provide insights into the traffic patterns in urban activities. To overcome these limitations, we propose an improved traffic prediction method based on graph convolution deep learning algorithms. We leverage human activity frequency data from National Household Travel Survey to enhance the inference capability of a causal relationship between activity and traffic patterns. Despite making minimal modifications to the conventional graph convolutional recurrent networks and graph convolutional transformer architectures, our approach achieves state-of-the-art performance without introducing excessive computational overhead.
    摘要 traffic prediction 是确保市民安全便利的关键元素之一。现有的交通预测模型主要利用深度学习架构捕捉空间和时间相关性。它们经常忽视交通的本质。 Specifically, 交通感知网络在大多数交通数据集中不准确反映实际行驶路网络, fail to provide insights into traffic patterns in urban activities。 To overcome these limitations, we propose an improved traffic prediction method based on graph convolution deep learning algorithms. We leverage human activity frequency data from National Household Travel Survey to enhance the inference capability of a causal relationship between activity and traffic patterns. Despite making minimal modifications to the conventional graph convolutional recurrent networks and graph convolutional transformer architectures, our approach achieves state-of-the-art performance without introducing excessive computational overhead.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

The DKU-DUKEECE System for the Manipulation Region Location Task of ADD 2023

  • paper_url: http://arxiv.org/abs/2308.10281
  • repo_url: None
  • paper_authors: Zexin Cai, Weiqing Wang, Yikang Wang, Ming Li
  • for: 本研究是为了解决 Audio Deepfake Detection Challenge (ADD 2023) 中的 Track 2 任务,即检测受到修改的音频段别。
  • methods: 本研究使用了多种检测系统,包括边界检测系统和深度迷伪检测系统,以及专门使用真实数据训练的 VAE 模型来确定音频clip的真实性。
  • results: 通过将这三个系统融合,我们实现了82.23%的句子准确率和60.66%的 F1 分数,最终ADD分数为0.6713,在 Track 2 中得到第一名。
    Abstract This paper introduces our system designed for Track 2, which focuses on locating manipulated regions, in the second Audio Deepfake Detection Challenge (ADD 2023). Our approach involves the utilization of multiple detection systems to identify splicing regions and determine their authenticity. Specifically, we train and integrate two frame-level systems: one for boundary detection and the other for deepfake detection. Additionally, we employ a third VAE model trained exclusively on genuine data to determine the authenticity of a given audio clip. Through the fusion of these three systems, our top-performing solution for the ADD challenge achieves an impressive 82.23% sentence accuracy and an F1 score of 60.66%. This results in a final ADD score of 0.6713, securing the first rank in Track 2 of ADD 2023.
    摘要 这篇论文介绍了我们为Track 2而设计的系统,这个Track focuses on locating manipulated regions in the second Audio Deepfake Detection Challenge (ADD 2023)。我们的方法是利用多个检测系统来确定剪辑区域的真实性。具体来说,我们训练并集成了两个帧级系统:一个用于边界检测,另一个用于深伪检测。此外,我们还使用了专门为真实数据训练的VAE模型来确定声音片断的真实性。通过这三个系统的融合,我们的最高表现解决方案在ADD挑战中获得了82.23%的句子准确率和60.66%的F1分数。这导致了我们在Track 2的ADD分数为0.6713,在ADD 2023中取得了第一名。

Learning Disentangled Representation with Mutual Information Maximization for Real-Time UAV Tracking

  • paper_url: http://arxiv.org/abs/2308.10262
  • repo_url: None
  • paper_authors: Xucheng Wang, Xiangyang Yang, Hengzhou Ye, Shuiwang Li
  • for: 提高 UAV 跟踪中的效率和精度,使用 DL 模型压缩和分解表示。
  • methods: 使用 DR-MIM 技术实现分解表示,提高模型的表示效果,并且使用 mutual information maximization 提高模型的精度。
  • results: 在四个 UAV 测试benchmark上,DR-MIM 跟踪器与现有状态的 UAV 跟踪方法相比,显示了显著的提高。
    Abstract Efficiency has been a critical problem in UAV tracking due to limitations in computation resources, battery capacity, and unmanned aerial vehicle maximum load. Although discriminative correlation filters (DCF)-based trackers prevail in this field for their favorable efficiency, some recently proposed lightweight deep learning (DL)-based trackers using model compression demonstrated quite remarkable CPU efficiency as well as precision. Unfortunately, the model compression methods utilized by these works, though simple, are still unable to achieve satisfying tracking precision with higher compression rates. This paper aims to exploit disentangled representation learning with mutual information maximization (DR-MIM) to further improve DL-based trackers' precision and efficiency for UAV tracking. The proposed disentangled representation separates the feature into an identity-related and an identity-unrelated features. Only the latter is used, which enhances the effectiveness of the feature representation for subsequent classification and regression tasks. Extensive experiments on four UAV benchmarks, including UAV123@10fps, DTB70, UAVDT and VisDrone2018, show that our DR-MIM tracker significantly outperforms state-of-the-art UAV tracking methods.
    摘要 efficiency 是 UAV 跟踪领域中的一个关键问题,因为计算资源、电池容量和无人飞行器最大负荷受限。 although 使用推荐相关矩阵(DCF)的trackers 在这个领域占据主导地位,因为它们具有良好的效率。 unfortunately,这些最近提出的轻量级深度学习(DL)基于的trackers 使用的模型压缩方法,虽然简单,仍然无法在更高的压缩率下实现满意的跟踪精度。 this paper aims to exploit 分解表示学习(DR)和共享信息最大化(MIM)来进一步提高 DL 基于的trackers 的精度和效率 для UAV 跟踪。 the proposed 分解表示 separates the feature into an identity-related and an identity-unrelated features。 only the latter is used, which enhances the effectiveness of the feature representation for subsequent classification and regression tasks. extensive experiments on four UAV benchmarks, including UAV123@10fps, DTB70, UAVDT and VisDrone2018, show that our DR-MIM tracker significantly outperforms state-of-the-art UAV tracking methods.

Large Transformers are Better EEG Learners

  • paper_url: http://arxiv.org/abs/2308.11654
  • repo_url: None
  • paper_authors: Bingxin Wang, Xiaowen Fu, Yuan Lan, Luchan Zhang, Yang Xiang
  • for: 这 paper 是为了研究如何使用预训练的大型变换器模型来进行电enzephalogram (EEG) 数据预测任务。
  • methods: 这 paper 使用了一种名为 AdaCE(适应器 для转换 EEG 数据)来直接在预训练的视觉和语言变换器模型上进行 EEG 数据预测任务的细化。
  • results: 这 paper 的实验结果表明,使用 AdaCE 模块可以很好地细化预训练的变换器模型,并在多种 EEG 预测任务上达到了状态的艺术性表现。例如,在 UCI HAR 任务上,AdaCE 在预训练的 Swin-Transformer 上 achieve 99.6%,相对于预训练的模型,提高了9.2%。此外,这 paper 还证明了,通过应用 AdaCE 模块来细化更大的预训练模型,可以在 EEG 预测任务上实现更好的表现,这表明了我们的适应器的潜在应用前景。
    Abstract Pre-trained large transformer models have achieved remarkable performance in the fields of natural language processing and computer vision. Since the magnitude of available labeled electroencephalogram (EEG) data is much lower than that of text and image data, it is difficult for transformer models pre-trained from EEG to be developed as large as GPT-4 100T to fully unleash the potential of this architecture. In this paper, we show that transformers pre-trained from images as well as text can be directly fine-tuned for EEG-based prediction tasks. We design AdaCE, plug-and-play Adapters for Converting EEG data into image as well as text forms, to fine-tune pre-trained vision and language transformers. The proposed AdaCE module is highly effective for fine-tuning pre-trained transformers while achieving state-of-the-art performance on diverse EEG-based prediction tasks. For example, AdaCE on the pre-trained Swin-Transformer achieves 99.6%, an absolute improvement of 9.2%, on the EEG-decoding task of human activity recognition (UCI HAR). Furthermore, we empirically show that applying the proposed AdaCE to fine-tune larger pre-trained models can achieve better performance on EEG-based predicting tasks, indicating the potential of our adapters for even larger transformers. The plug-and-play AdaCE module can be applied to fine-tuning most of the popular pre-trained transformers on many other time-series data with multiple channels, not limited to EEG data and the models we use. Our code will be available at https://github.com/wangbxj1234/AdaCE.
    摘要 <>转换给定文本到简化中文。<>大型预训练变换器模型在自然语言处理和计算机视觉领域已经取得了非常出色的表现。由于可用的电enzephalogram(EEG)数据量与文本和图像数据量相比较少,因此预训练自EEG的变换器模型难以达到GPT-4 100T的规模,以全面发挥这种架构的潜力。在这篇论文中,我们表明可以将预训练自图像以及文本的变换器模型直接精度调整为EEG预测任务。我们设计了AdaCE模块,即适配器转换EEG数据为图像和文本形式,以精度调整预训练的视觉和语言变换器。我们的AdaCE模块非常有效地精度调整预训练的变换器,并在多种EEG预测任务上达到了状态 искусственный智能的性能。例如,AdaCE在预训练的Swin-Transformer上达到了99.6%,相对于基eline的9.2%提升。此外,我们也证明了将AdaCE应用于精度调整更大的预训练模型可以在EEG预测任务中达到更好的性能,表明我们的适配器对更大的变换器有潜力。AdaCE模块可以应用于多种 популяр的预训练变换器,并不限于EEG数据和我们使用的模型。我们的代码将在GitHub上公开。

LMTuner: An user-friendly and highly-integrable Training Framework for fine-tuning Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10252
  • repo_url: https://github.com/wengsyx/lmtuner
  • paper_authors: Yixuan Weng, Zhiqi Wang, Huanxuan Liao, Shizhu He, Shengping Liu, Kang Liu, Jun Zhao
  • for: 本研究旨在提高大语言模型(LLM)的快速增量训练,以满足特定领域和行业的需求。
  • methods: 该研究提出了一种名为LMTuner的高度可用、可integrable和可扩展的系统,用于快速训练LLM。LMTuner包括交互模块、训练模块和推理模块。
  • results: 研究表明,LMTuner可以帮助用户快速启动LLM的训练,只需5分钟即可。此外,LMTuner还支持许多高效精细调整方法,如LoRA和QLoRA等,可以在单个服务器上训练300M到130B参数的语言模型。
    Abstract With the burgeoning development in the realm of large language models (LLMs), the demand for efficient incremental training tailored to specific industries and domains continues to increase. Currently, the predominantly employed frameworks lack modular design, it often takes a lot of coding work to kickstart the training of LLM. To address this, we present "LMTuner", a highly usable, integrable, and scalable system for training LLMs expeditiously and with minimal user-input. LMTuner comprises three main modules - the Interaction, Training, and Inference Modules. We advocate that LMTuner's usability and integrality alleviate the complexities in training large language models. Remarkably, even a novice user could commence training large language models within five minutes. Furthermore, it integrates DeepSpeed frameworks and supports Efficient Fine-Tuning methodologies like Low Rank Adaptation (LoRA), Quantized LoRA (QLoRA), etc., enabling the training of language models scaling from 300M to a whopping 130B parameters using a single server. The LMTuner's homepage (https://wengsyx.github.io/LMTuner/)and screencast video (https://youtu.be/nsXmWOmN3rE) are now publicly available.
    摘要 随着大型语言模型(LLM)的发展,需要高效、逐步培训适应特定领域和领域的需求不断增长。目前主要使用的框架缺乏模块化设计,经常需要大量代码工作以开始LLM的培训。为解决这个问题,我们提出了“LMTuner”系统,它具有高度可用性、可插入性和可扩展性,可以快速、需要 minimal user-input 培训大型语言模型。LMTuner包括三个主要模块:交互模块、培训模块和推理模块。我们认为LMTuner的可用性和可插入性可以减轻大型语言模型培训的复杂性。特别是,even a novice user可以在5分钟内开始培训大型语言模型。此外,它还支持深速框架和高效精度调整方法(如LoRA、QLoRA等),可以在单个服务器上培训语言模型,从300M到130B参数的训练。LMTuner的主页(https://wengsyx.github.io/LMTuner/)和屏幕录制视频(https://youtu.be/nsXmWOmN3rE)现在都已经公开。

Machine Learning-powered Combinatorial Clock Auction

  • paper_url: http://arxiv.org/abs/2308.10226
  • repo_url: https://github.com/marketdesignresearch/ml-cca
  • paper_authors: Ermis Soumalias, Jakob Weissteiner, Jakob Heiss, Sven Seuken
  • for: 提高实际 iterative combinatorial auctions (ICA) 的设计效率,尤其是在 bundle space exponentially 增长的情况下。
  • methods: 使用机器学习 (ML) 技术进行 preference elicitation,并提出一种基于 demand queries 的 ML-powered combinatorial clock auction。
  • results: 在 spectrum auction 多个领域进行实验,并与最常用的实际 ICA(combined clock auction)进行比较,结果显示了 significanly 高效性和 clearer potential。
    Abstract We study the design of iterative combinatorial auctions (ICAs). The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning (ML)-based preference elicitation algorithms that aim to elicit only the most important information from bidders. However, from a practical point of view, the main shortcoming of this prior work is that those designs elicit bidders' preferences via value queries (i.e., ``What is your value for the bundle $\{A,B\}$?''). In most real-world ICA domains, value queries are considered impractical, since they impose an unrealistically high cognitive burden on bidders, which is why they are not used in practice. In this paper, we address this shortcoming by designing an ML-powered combinatorial clock auction that elicits information from the bidders only via demand queries (i.e., ``At prices $p$, what is your most preferred bundle of items?''). We make two key technical contributions: First, we present a novel method for training an ML model on demand queries. Second, based on those trained ML models, we introduce an efficient method for determining the demand query with the highest clearing potential, for which we also provide a theoretical foundation. We experimentally evaluate our ML-based demand query mechanism in several spectrum auction domains and compare it against the most established real-world ICA: the combinatorial clock auction (CCA). Our mechanism significantly outperforms the CCA in terms of efficiency in all domains, it achieves higher efficiency in a significantly reduced number of rounds, and, using linear prices, it exhibits vastly higher clearing potential. Thus, with this paper we bridge the gap between research and practice and propose the first practical ML-powered ICA.
    摘要 我们研究联合推价 combinatorial auction(ICA)的设计。主要挑战在这域的是,套件空间随着物品数量的增加而呈指数增长。为解决这个问题,一些最近的论文已经提出了基于机器学习(ML)的偏好探索算法,目的是获取价值询问(i.e., What is your value for the bundle \{A,B\}?)。但实际上,这些设计仍然存在一个问题,那是它们透过价值询问来探索投得者的偏好,这种方法在实际应用中被视为不切实际。在这篇论文中,我们解决这个问题,通过设计一个基于ML的联合时钟推价(Combinatorial Clock Auction,CCA),这个推价方式只透过需求询问(i.e., At prices $p$, what is your most preferred bundle of items?)来探索投得者的偏好。我们做了两个关键的技术贡献:首先,我们提出了一种基于需求询问的ML模型训练方法。其次,基于这些训练的ML模型,我们引入了一个高效的需求询问选择方法,并提供了理论基础。我们将这些结果实际应用于几个频谱拍卖领域,并与现有的CCA进行比较。我们的机制在所有领域中具有更高的效率,可以在许多领域中取得更高的效率,并且使用线性价格,它的推价 potential exhibits vastly higher clearing potential。因此,这篇论文 bridges the gap between research and practice,并提出了首个实际可行的ML-powered ICA。

ChatEDA: A Large Language Model Powered Autonomous Agent for EDA

  • paper_url: http://arxiv.org/abs/2308.10204
  • repo_url: None
  • paper_authors: Zhuolun He, Haoyuan Wu, Xinyun Zhang, Xufeng Yao, Su Zheng, Haisheng Zheng, Bei Yu
  • for: 提高电路设计工作流程的自动化和效率,使用大型自然语言处理模型来增强电子设计自动化工具之间的协同合作。
  • methods: 使用大型自然语言处理模型AutoMage,并与电子设计自动化工具相结合,实现了自动化的任务规划、脚本生成和任务执行。
  • results: 通过全面的实验评估,ChatEDA已经表明了在处理多样化的需求方面的强大能力,并且 fine-tuned AutoMage 模型在 GPT-4 和其他类似 LLM 中表现出色。
    Abstract The integration of a complex set of Electronic Design Automation (EDA) tools to enhance interoperability is a critical concern for circuit designers. Recent advancements in large language models (LLMs) have showcased their exceptional capabilities in natural language processing and comprehension, offering a novel approach to interfacing with EDA tools. This research paper introduces ChatEDA, an autonomous agent for EDA empowered by a large language model, AutoMage, complemented by EDA tools serving as executors. ChatEDA streamlines the design flow from the Register-Transfer Level (RTL) to the Graphic Data System Version II (GDSII) by effectively managing task planning, script generation, and task execution. Through comprehensive experimental evaluations, ChatEDA has demonstrated its proficiency in handling diverse requirements, and our fine-tuned AutoMage model has exhibited superior performance compared to GPT-4 and other similar LLMs.
    摘要 electronic design automation (EDA) 工具集成是电路设计师面临的一个关键问题。近年来,大型自然语言模型(LLM)的发展受到了关注,因为它们在自然语言处理和理解方面表现出了出色的能力,这提供了一种新的方法来与 EDA 工具进行交互。本研讨稿介绍了 ChatEDA,一个基于大型自然语言模型 AutoMage 的自主代理人,与 EDA 工具合作来实现从Register-Transfer Level (RTL) 到 Graphic Data System Version II (GDSII) 的设计流程的自动化。通过全面的实验评估,ChatEDA 已经证明了它在处理多样化的需求方面的能力,而我们精心调整的 AutoMage 模型则与 GPT-4 和其他类似的 LLM 相比,表现出了更高的性能。

Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous Control with Discrete RL

  • paper_url: http://arxiv.org/abs/2308.10203
  • repo_url: None
  • paper_authors: Yechen Zhang, Jian Sun, Gang Wang, Zhuo Li, Wei Chen
  • For: The paper aims to address the challenges of applying discrete reinforcement learning (RL) algorithms to continuous control problems, and to develop a novel architecture that can effectively overcome these challenges.* Methods: The paper proposes the Soft Decomposed Policy-Critic (SDPC) architecture, which combines soft RL and actor-critic techniques with discrete RL methods. The SDPC architecture discretizes each action dimension independently and employs a shared critic network to maximize the soft $Q$-function. The paper also introduces two types of policies: decomposed actors that lead to the Soft Decomposed Actor-Critic (SDAC) algorithm, and decomposed $Q$-networks that generate Boltzmann soft exploration policies, resulting in the Soft Decomposed-Critic Q (SDCQ) algorithm.* Results: The paper presents extensive experimental results that demonstrate the effectiveness of the proposed SDPC architecture in addressing the challenges associated with continuous control. The results show that the SDPC algorithm outperforms state-of-the-art continuous RL algorithms in a variety of continuous control tasks, including Mujoco’s Humanoid and Box2d’s BipedalWalker.Here is the simplified Chinese text for the three key points:* 用途: 本文旨在解决离散RL算法应用到连续控制问题时存在的挑战,并提出一种新的架构来有效地解决这些挑战。* 方法: 本文提出了软分解策略评估器(SDPC)架构,它结合软RL和演员评估技术,并将离散RL方法应用到连续控制问题中。SDPC将每个动作维度独立地分解,并使用共享评估器网络来最大化软$Q$-函数。这种新的方法使得SDPC可以支持两种策略:分解演员,导致Soft Decomposed Actor-Critic(SDAC)算法,以及分解$Q$-网络,生成Boltzmann软探索策略,导致Soft Decomposed-Critic Q(SDCQ)算法。* 结果: 本文提供了广泛的实验结果,证明了SDPC架构在连续控制问题中的效果。结果显示,SDPC算法在Mujoco的人工智能和Box2d的两脚步行器等多个连续控制任务中表现出色,超过了当前最佳连续RL算法的表现。这些实验结果证明了SDPC架构在连续控制问题中的有效性。
    Abstract Discrete reinforcement learning (RL) algorithms have demonstrated exceptional performance in solving sequential decision tasks with discrete action spaces, such as Atari games. However, their effectiveness is hindered when applied to continuous control problems due to the challenge of dimensional explosion. In this paper, we present the Soft Decomposed Policy-Critic (SDPC) architecture, which combines soft RL and actor-critic techniques with discrete RL methods to overcome this limitation. SDPC discretizes each action dimension independently and employs a shared critic network to maximize the soft $Q$-function. This novel approach enables SDPC to support two types of policies: decomposed actors that lead to the Soft Decomposed Actor-Critic (SDAC) algorithm, and decomposed $Q$-networks that generate Boltzmann soft exploration policies, resulting in the Soft Decomposed-Critic Q (SDCQ) algorithm. Through extensive experiments, we demonstrate that our proposed approach outperforms state-of-the-art continuous RL algorithms in a variety of continuous control tasks, including Mujoco's Humanoid and Box2d's BipedalWalker. These empirical results validate the effectiveness of the SDPC architecture in addressing the challenges associated with continuous control.
    摘要 离散强化学习(RL)算法在解决顺序决策任务中的离散动作空间上表现出色,如Atari游戏。然而,在连续控制问题上,它们的效果受到维度爆炸的限制。在这篇论文中,我们提出了软分解政策评估器(SDPC)架构,它将离散RL和演示器-评估器技术与离散RL方法结合,以解决这一问题。SDPC独立地离散每个动作维度,并使用共享评估器网络来最大化软$Q$-函数。这种新的方法使得SDPC可以支持两种策略:分解演示者,导致的软分解演示者-评估器(SDAC)算法,以及分解$Q$-网络,生成博尔茨曼软探索策略,导致的软分解评估器Q(SDCQ)算法。通过广泛的实验,我们证明了我们提出的方法在多种连续控制任务中比州场RL算法表现出色,包括Mujoco的人工智能和Box2d的双脚行走器。这些实验结果证明了SDPC架构在连续控制问题上的有效性。

Deep Reinforcement Learning for Artificial Upwelling Energy Management

  • paper_url: http://arxiv.org/abs/2308.10199
  • repo_url: None
  • paper_authors: Yiyuan Zhang, Wei Fan
  • for: 本研究旨在提高人工温顺升(AU)系统的效率,以便更好地刺激海藻生长和提高海洋碳储存。
  • methods: 本研究使用深度强化学习(DRL)算法来开发高效的空气喷射策略,以优化AU系统的运行。
  • results: 对于 simulate 的数据,我们的DRL算法可以更好地减少能源浪费,同时保证AU系统的稳定和高效运行。
    Abstract The potential of artificial upwelling (AU) as a means of lifting nutrient-rich bottom water to the surface, stimulating seaweed growth, and consequently enhancing ocean carbon sequestration, has been gaining increasing attention in recent years. This has led to the development of the first solar-powered and air-lifted AU system (AUS) in China. However, efficient scheduling of air injection systems remains a crucial challenge in operating AUS, as it holds the potential to significantly improve system efficiency. Conventional approaches based on rules or models are often impractical due to the complex and heterogeneous nature of the marine environment and its associated disturbances. To address this challenge, we propose a novel energy management approach that utilizes deep reinforcement learning (DRL) algorithm to develop efficient strategies for operating AUS. Through extensive simulations, we evaluate the performance of our algorithm and demonstrate its superior effectiveness over traditional rule-based approaches and other DRL algorithms in reducing energy wastage while ensuring the stable and efficient operation of AUS. Our findings suggest that a DRL-based approach offers a promising way for improving the efficiency of AUS and enhancing the sustainability of seaweed cultivation and carbon sequestration in the ocean.
    摘要 人们在最近几年内对人工升浮(AU)作为吸引燃料富含底水而升到水面,促进海藻生长,并因此增强海洋碳储存的潜力的可能性已经吸引了越来越多的注意。这导致了中国首个太阳能驱动、空气升降AU系统(AUS)的开发。然而,AU系统的有效的调度仍然是一个关键的挑战,因为它可以大幅提高系统的效率。传统的方法,如规则或模型,经常因为marine环境的复杂和多样性而无法实施。为解决这个挑战,我们提出了一种基于深度强化学习(DRL)算法的能源管理方法。通过广泛的 simulations,我们评估了我们的算法的性能,并证明它在降低能源浪费的同时保证AU系统的稳定和高效运行。我们的发现表明,使用DRL算法可以提高AU系统的效率,并且对海洋藻类培植和碳储存具有扩展性。

Efficient Real-time Path Planning with Self-evolving Particle Swarm Optimization in Dynamic Scenarios

  • paper_url: http://arxiv.org/abs/2308.10169
  • repo_url: https://github.com/xinjinghao/real-time-path-planning-with-sepso
  • paper_authors: Jinghao Xin, Zhi Li, Yang Zhang, Ning Li
  • for: 本研究旨在提高Particle Swarm Optimization(PSO)的计算效率和避免偏差问题,以便应用于动态场景中的路径规划问题。
  • methods: 本研究提出了一种基于Tensor Operation Form(TOF)的Self-Evolving Particle Swarm Optimization(SEPSO)算法,具有自适应的距离权重补偿和自适应的准则权重补偿,以提高计算效率和避免偏差。
  • results: 实验结果表明,SEPSO可以在四个常用的优化函数上实现更好的路径规划,并且在动态场景中具有较高的计算效率(每秒67个路径计算)和更好的实时性。
    Abstract Particle Swarm Optimization (PSO) has demonstrated efficacy in addressing static path planning problems. Nevertheless, such application on dynamic scenarios has been severely precluded by PSO's low computational efficiency and premature convergence downsides. To address these limitations, we proposed a Tensor Operation Form (TOF) that converts particle-wise manipulations to tensor operations, thereby enhancing computational efficiency. Harnessing the computational advantage of TOF, a variant of PSO, designated as Self-Evolving Particle Swarm Optimization (SEPSO) was developed. The SEPSO is underpinned by a novel Hierarchical Self-Evolving Framework (HSEF) that enables autonomous optimization of its own hyper-parameters to evade premature convergence. Additionally, a Priori Initialization (PI) mechanism and an Auto Truncation (AT) mechanism that substantially elevates the real-time performance of SEPSO on dynamic path planning problems were introduced. Comprehensive experiments on four widely used benchmark optimization functions have been initially conducted to corroborate the validity of SEPSO. Following this, a dynamic simulation environment that encompasses moving start/target points and dynamic/static obstacles was employed to assess the effectiveness of SEPSO on the dynamic path planning problem. Simulation results exhibit that the proposed SEPSO is capable of generating superior paths with considerably better real-time performance (67 path planning computations per second in a regular desktop computer) in contrast to alternative methods. The code of this paper can be accessed here.
    摘要 particle swarm optimization (PSO) 已经在静止路径规划问题上展示了效果。然而,在动态场景下,PSO的计算效率低下和迟速 converges 的缺点限制了其应用。为了解决这些限制,我们提出了一种tensor操作形式(TOF),它将 particle-wise 操作转化为tensor操作,从而提高计算效率。基于TOF的一种PSO变体,称为自适应 particule swarm optimization(SEPSO),在自适应层次结构(HSEF)的支持下,可以自动调整它的自身超参数,以避免迟速 converges。此外,我们还提出了一种先验初始化(PI)机制和一种自动舒缩(AT)机制,这些机制可以大幅提高REPSO在动态路径规划问题上的实时性表现。我们在四个通用优化函数上进行了初步的实验,以证明SEPSO的有效性。然后,我们使用包括移动起点/目标点和动态/静态障碍物的动态 simulations 环境来评估SEPSO在动态路径规划问题上的效果。实验结果显示,提议的SEPSO可以生成优化的路径,并且在实时性方面表现出色(每秒67个路径规划计算在常规桌面电脑上),相比于其他方法。代码可以在这里获取。

Rethinking Client Drift in Federated Learning: A Logit Perspective

  • paper_url: http://arxiv.org/abs/2308.10162
  • repo_url: None
  • paper_authors: Yunlu Yan, Chun-Mei Feng, Mang Ye, Wangmeng Zuo, Ping Li, Rick Siow Mong Goh, Lei Zhu, C. L. Philip Chen
  • for: 这个论文旨在解决 federated learning (FL) 中 client drift 问题,并提高 FL 的性能。
  • methods: 本文提出了一个新的 FedCSD 算法,它是一个 federated framework 中的 class prototype similarity distillation 方法,用于将 local 和 global 模型之间的差异调节。
  • results: 实验结果显示,FedCSD 方法在不同的多元环境下比 state-of-the-art 的 federated learning 方法表现更好,并且可以增强 global 模型的质量。
    Abstract Federated Learning (FL) enables multiple clients to collaboratively learn in a distributed way, allowing for privacy protection. However, the real-world non-IID data will lead to client drift which degrades the performance of FL. Interestingly, we find that the difference in logits between the local and global models increases as the model is continuously updated, thus seriously deteriorating FL performance. This is mainly due to catastrophic forgetting caused by data heterogeneity between clients. To alleviate this problem, we propose a new algorithm, named FedCSD, a Class prototype Similarity Distillation in a federated framework to align the local and global models. FedCSD does not simply transfer global knowledge to local clients, as an undertrained global model cannot provide reliable knowledge, i.e., class similarity information, and its wrong soft labels will mislead the optimization of local models. Concretely, FedCSD introduces a class prototype similarity distillation to align the local logits with the refined global logits that are weighted by the similarity between local logits and the global prototype. To enhance the quality of global logits, FedCSD adopts an adaptive mask to filter out the terrible soft labels of the global models, thereby preventing them to mislead local optimization. Extensive experiments demonstrate the superiority of our method over the state-of-the-art federated learning approaches in various heterogeneous settings. The source code will be released.
    摘要 联合学习(FL)允许多个客户端共同学习,以保护隐私。但是,实际世界中的非相同数据将导致客户端的漂移,严重损害FL的性能。可是,我们发现在模型不断更新时,本地和全球模型之间的差异将逐渐增加,严重损害FL的性能。这主要是由于资料不均衡引起的严重遗忘,导致本地模型的优化失败。为解决这问题,我们提出了一个新的算法,名为FedCSD,它是一个基于联合架构的类型对应对数据分析方法。FedCSD不仅将全球知识转移到本地客户端,因为一个尚未训练的全球模型无法提供可靠的知识,即类型对应信息,而且其对应的软件标签将会误导本地优化。具体来说,FedCSD引入一个类型对应对数据分析来调整本地値值和重新调整的全球値值,并运用适应mask来筛选全球模型的差异软件标签,以避免它们误导本地优化。实际实验表明我们的方法在多种不同设定下表现出色,与现有的联合学习方法相比。我们将发布源代码。

SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

  • paper_url: http://arxiv.org/abs/2308.10156
  • repo_url: None
  • paper_authors: Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang
  • for: 该 paper targets 提高 Text-to-Image (T2I) 生成模型的细节控制能力,具体来说是通过 Layout-to-Image (L2I) 生成模型,从用户指定的布局信息中提取更多的空间和semantic信息,以提高生成的图像质量和控制性。
  • methods: 该 paper 提出了一种新的 Spatial-Semantic Map Guided (SSMG) 扩散模型,通过使用布局信息中的特征图来为生成过程提供指导,以获得更高质量和更多的控制性。此外,paper 还提出了 Relation-Sensitive Attention (RSA) 和 Location-Sensitive Attention (LSA) 机制,用于模型多对多对象之间的关系和空间信息。
  • results: EXTENSIVE experiments 表明,SSMG 可以 achieve highly promising results,在多个维度上(包括准确率、多样性和控制性)超越前一代的模型。
    Abstract Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability.
    摘要 尽管文本到图像(T2I)生成模型已经取得了重要进展,但是也有许多长度和复杂度的文本描述仍然很难以传递细致的控制。相比之下,图像排版到图像(L2I)生成,它的目标是从用户指定的排版中生成真实和复杂的场景图像,在过程中获得了更多的焦点。然而,现有的方法将排版信息转换为token或RGB图像,用于conditional控制生成过程中,导致个体实例的空间和semantic控制不充分。为了解决这些局限性,我们提出了一种新的空间semantic映射指南(SSMG)扩散模型,该模型采用由排版 derivation 的特征图进行导航。由于特征图具有较好的设计,它们可以储存大量的空间和semantic信息,因此SSMG在生成质量和控制方面达到了前所未有的水平。此外,我们还提出了关系敏感注意力(RSA)和位置敏感注意力(LSA)机制。前者用于模型场景中的物体之间的关系,而后者是为了增强模型对空间信息的敏感度。经验证明,SSMG在多个维度上达到了非常出色的结果,创造了新的state-of-the-art。

Federated Pseudo Modality Generation for Incomplete Multi-Modal MRI Reconstruction

  • paper_url: http://arxiv.org/abs/2308.10910
  • repo_url: None
  • paper_authors: Yunlu Yan, Chun-Mei Feng, Yuexiang Li, Rick Siow Mong Goh, Lei Zhu
  • for: addresses the missing modality challenge in federated multi-modal MRI reconstruction.
  • methods: utilizes a pseudo modality generation mechanism to recover the missing modality for each single-modal client by sharing the distribution information of the amplitude spectrum in frequency space, and introduces a clustering scheme to reduce communication costs.
  • results: can effectively complete the missing modality within an acceptable communication cost, and attains similar performance with the ideal scenario.
    Abstract While multi-modal learning has been widely used for MRI reconstruction, it relies on paired multi-modal data which is difficult to acquire in real clinical scenarios. Especially in the federated setting, the common situation is that several medical institutions only have single-modal data, termed the modality missing issue. Therefore, it is infeasible to deploy a standard federated learning framework in such conditions. In this paper, we propose a novel communication-efficient federated learning framework, namely Fed-PMG, to address the missing modality challenge in federated multi-modal MRI reconstruction. Specifically, we utilize a pseudo modality generation mechanism to recover the missing modality for each single-modal client by sharing the distribution information of the amplitude spectrum in frequency space. However, the step of sharing the original amplitude spectrum leads to heavy communication costs. To reduce the communication cost, we introduce a clustering scheme to project the set of amplitude spectrum into finite cluster centroids, and share them among the clients. With such an elaborate design, our approach can effectively complete the missing modality within an acceptable communication cost. Extensive experiments demonstrate that our proposed method can attain similar performance with the ideal scenario, i.e., all clients have the full set of modalities. The source code will be released.
    摘要 多Modal学习已经广泛应用于MRI重建,但它需要对配对多Modal数据进行学习,这在实际临床场景中很Difficult to obtain.特别是在联合设定下,许多医疗机构只有单Modal数据,称为模式缺失问题。因此,在这种情况下不可能采用标准的联合学习框架。在这篇论文中,我们提出一种新的通信效率高的联合学习框架,即Fed-PMG,用于解决联合多Modal MRI重建中的模式缺失问题。特别是,我们利用pseudo模式生成机制来恢复每个单Modal客户端缺失的模式。然而,将原始振荡谱分享给客户端会导致重大的通信成本。为了降低通信成本,我们引入一种分区 schemes,将振荡谱Project到finite的中心点集中,然后将其分享给客户端。与此棋子的设计,我们的方法可以效果地完成缺失模式,并且在可接受的通信成本下完成。广泛的实验表明,我们提出的方法可以达到与理想情况相同的性能,即所有客户端都有完整的模式。源代码将会发布。

A Survey on Fairness in Large Language Models

  • paper_url: http://arxiv.org/abs/2308.10149
  • repo_url: None
  • paper_authors: Yingji Li, Mengnan Du, Rui Song, Xin Wang, Ying Wang
  • for: 本文旨在概述关于LLM中公平性的相关研究,包括中等规模LLM的评估指标和降低偏见方法,以及大规模LLM的公平性研究。
  • methods: 本文介绍了对LLM中偏见的评估指标和降低偏见方法,包括内在偏见和外在偏见的评估指标和方法。
  • results: 本文总结了大规模LLM的公平性研究,包括偏见评估、偏见原因和降低偏见方法。同时,文章还提供了对LLM公平性发展的未来方向和挑战。
    Abstract Large language models (LLMs) have shown powerful performance and development prospect and are widely deployed in the real world. However, LLMs can capture social biases from unprocessed training data and propagate the biases to downstream tasks. Unfair LLM systems have undesirable social impacts and potential harms. In this paper, we provide a comprehensive review of related research on fairness in LLMs. First, for medium-scale LLMs, we introduce evaluation metrics and debiasing methods from the perspectives of intrinsic bias and extrinsic bias, respectively. Then, for large-scale LLMs, we introduce recent fairness research, including fairness evaluation, reasons for bias, and debiasing methods. Finally, we discuss and provide insight on the challenges and future directions for the development of fairness in LLMs.
    摘要 For medium-scale LLMs, we introduce evaluation metrics and debiasing methods from two perspectives: intrinsic bias and extrinsic bias. Then, for large-scale LLMs, we discuss recent fairness research, including fairness evaluation, reasons for bias, and debiasing methods. Finally, we discuss the challenges and future directions for developing fairness in LLMs.

ExpeL: LLM Agents Are Experiential Learners

  • paper_url: http://arxiv.org/abs/2308.10144
  • repo_url: https://github.com/Andrewzh112/ExpeL
  • paper_authors: Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang
  • for: 这paper的目的是提出一种新的语言模型,用于自动学习和做出决策。
  • methods: 该paper使用的方法包括自然语言处理和机器学习技术,以EXTRACT知识和经验从训练任务中。
  • results: experiments show that the proposed ExpeL agent exhibits robust learning efficacy and consistently enhances its performance as it accumulates experiences. Additionally, the paper explores the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments.
    Abstract The recent surge in research interest in applying large language models (LLMs) to decision-making tasks has flourished by leveraging the extensive world knowledge embedded in LLMs. While there is a growing demand to tailor LLMs for custom decision-making tasks, finetuning them for specific tasks is resource-intensive and may diminish the model's generalization capabilities. Moreover, state-of-the-art language models like GPT-4 and Claude are primarily accessible through API calls, with their parametric weights remaining proprietary and unavailable to the public. This scenario emphasizes the growing need for new methodologies that allow learning from agent experiences without requiring parametric updates. To address these problems, we introduce the Experiential Learning (ExpeL) agent. Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions. Our empirical results highlight the robust learning efficacy of the ExpeL agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments.
    摘要 近些年,巨型语言模型(LLM)在决策任务上的研究兴趣呈现出了激增趋势,通过利用 LLM 中嵌入的广泛世界知识来实现。然而,为特定任务进行训练和优化 LLM 可能会占用资源,同时可能会降低模型的通用化能力。另外,当前的语言模型如 GPT-4 和 Claude 通常通过 API 调用来提供,其参数权重则是商业秘密,不可公开。这种情况强调了需要新的方法来学习从代理体验中。为解决这些问题,我们介绍了经验学(ExpeL)代理人。我们的代理人可以自动从训练任务中收集经验,使用自然语言提取知识。在推理时,代理人可以回忆提取的洞察和过去经验,做出 Informed 决策。我们的实验结果表明,ExpeL 代理人的学习效果是稳定和可靠的,其性能随着经验的增加而提高。我们还通过观察和其他实验进行资深探索,探讨 ExpeL 代理人的出现的能力和转移学习潜力。

A Review on Objective-Driven Artificial Intelligence

  • paper_url: http://arxiv.org/abs/2308.10135
  • repo_url: None
  • paper_authors: Apoorv Singh
  • For: The paper aims to address the limitations of current AI technologies in understanding context, nuances, and subtle cues in communication, and to close the gap between human and machine intelligence.* Methods: The paper reviews prospective Machine Intelligence candidates, including hierarchical planning-based approaches, energy-based, latent-variable methods, and joint embedding predictive architecture methods.* Results: The paper discusses how these methods can help machines better understand context, make logical inferences, and predict outcomes in various situations, ultimately bridging the gap between human and machine intelligence.Here’s the information in Simplified Chinese text:
  • for: 本文目标是解决当前人工智能技术中对语言上的上下文、细节和含义的理解 limitation, 以及将人工智能与人类智能之间的差距缩小。
  • methods: 本文评论了可能的机器智能候选人,包括层次规划基本方法、能量基本方法、隐变量方法和共同嵌入预测建筑方法。
  • results: 本文讲述了这些方法如何帮助机器更好地理解上下文, 做出逻辑推理, 预测结果等,从而bridging人工智能与人类智能之间的差距。
    Abstract While advancing rapidly, Artificial Intelligence still falls short of human intelligence in several key aspects due to inherent limitations in current AI technologies and our understanding of cognition. Humans have an innate ability to understand context, nuances, and subtle cues in communication, which allows us to comprehend jokes, sarcasm, and metaphors. Machines struggle to interpret such contextual information accurately. Humans possess a vast repository of common-sense knowledge that helps us make logical inferences and predictions about the world. Machines lack this innate understanding and often struggle with making sense of situations that humans find trivial. In this article, we review the prospective Machine Intelligence candidates, a review from Prof. Yann LeCun, and other work that can help close this gap between human and machine intelligence. Specifically, we talk about what's lacking with the current AI techniques such as supervised learning, reinforcement learning, self-supervised learning, etc. Then we show how Hierarchical planning-based approaches can help us close that gap and deep-dive into energy-based, latent-variable methods and Joint embedding predictive architecture methods.
    摘要 In this article, we review prospective Machine Intelligence candidates, a review from Prof. Yann LeCun, and other work that can help close the gap between human and machine intelligence. Specifically, we discuss what's lacking with current AI techniques such as supervised learning, reinforcement learning, self-supervised learning, etc. Then, we show how Hierarchical planning-based approaches can help close that gap and deep-dive into energy-based, latent-variable methods and Joint embedding predictive architecture methods.

TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective

  • paper_url: http://arxiv.org/abs/2308.10133
  • repo_url: https://github.com/danjun6737/transface
  • paper_authors: Jun Dan, Yang Liu, Haoyu Xie, Jiankang Deng, Haoran Xie, Xuansong Xie, Baigui Sun
  • for: 这篇论文主要应用于面Recognition (FR) 任务中,以优化ViTs-based FR 模型的性能。
  • methods: 本论文提出了一个名为 TransFace 的FR模型,使用了一种名为 DPAP 的 patch-level 数据增强策略和一种名为 EHSM 的困难样本挑战策略。DPAP 会随机对主要 patches 进行振荡变化,以扩大样本多样性,使ViTs 减少遗传学的问题。EHSM 使用了本地征别的信息熵来动态调整训练中的重要性负载,导致更稳定的预测。
  • results: 实验结果显示,TransFace 可以在多个 Benchmark 上超越原有的 FR 模型。
    Abstract Vision Transformers (ViTs) have demonstrated powerful representation ability in various visual tasks thanks to their intrinsic data-hungry nature. However, we unexpectedly find that ViTs perform vulnerably when applied to face recognition (FR) scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover that the existing data augmentation approach and hard sample mining strategy are incompatible with ViTs-based FR backbone due to the lack of tailored consideration on preserving face structural information and leveraging each local token information. To remedy these problems, this paper proposes a superior FR model called TransFace, which employs a patch-level data augmentation strategy named DPAP and a hard sample mining strategy named EHSM. Specially, DPAP randomly perturbs the amplitude information of dominant patches to expand sample diversity, which effectively alleviates the overfitting problem in ViTs. EHSM utilizes the information entropy in the local tokens to dynamically adjust the importance weight of easy and hard samples during training, leading to a more stable prediction. Experiments on several benchmarks demonstrate the superiority of our TransFace. Code and models are available at https://github.com/DanJun6737/TransFace.
    摘要 vision transformers (vits) 有 demonstrated 强大的表示能力 在不同的视觉任务中,归功于它们的内在的数据吃虫性。然而,我们意外地发现,当应用于人脸认知(FR)场景时,vits 表现弱化了。我们调查了这种现象的原因,发现现有的数据增强方法和困难样本挖掘策略与 vits-based FR 脊梁不兼容,这是因为缺乏适应保持人脸结构信息和利用每个本地 токен信息的考虑。为了解决这些问题,本文提出了一种超越 FR 模型,即 TransFace,该模型使用了 patch-level 数据增强策略名为 DPAP 和一种困难样本挖掘策略名为 EHSM。具体来说,DPAP 随机地对主导的 patches 进行扰动,以扩大样本多样性,从而解决 vits 中的过拟合问题。EHSM 利用了本地 токен中的信息熵来动态调整训练期间的重要性Weight,导致更稳定的预测。我们在多个 benchmark 上进行了实验,并证明了 TransFace 的优越性。代码和模型可以在 上获取。

3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2308.10123
  • repo_url: https://github.com/edz-o/3dnbf
  • paper_authors: Yi Zhang, Pengliang Ji, Angtian Wang, Jieru Mei, Adam Kortylewski, Alan Yuille
  • for: 3D human pose estimation with occlusion robustness
  • methods: 3D-aware Neural Body Fitting (3DNBF) with generative model of deep features and contrastive learning
  • results: outperforms other approaches on both occluded and standard benchmarks
    Abstract Regression-based methods for 3D human pose estimation directly predict the 3D pose parameters from a 2D image using deep networks. While achieving state-of-the-art performance on standard benchmarks, their performance degrades under occlusion. In contrast, optimization-based methods fit a parametric body model to 2D features in an iterative manner. The localized reconstruction loss can potentially make them robust to occlusion, but they suffer from the 2D-3D ambiguity. Motivated by the recent success of generative models in rigid object pose estimation, we propose 3D-aware Neural Body Fitting (3DNBF) - an approximate analysis-by-synthesis approach to 3D human pose estimation with SOTA performance and occlusion robustness. In particular, we propose a generative model of deep features based on a volumetric human representation with Gaussian ellipsoidal kernels emitting 3D pose-dependent feature vectors. The neural features are trained with contrastive learning to become 3D-aware and hence to overcome the 2D-3D ambiguity. Experiments show that 3DNBF outperforms other approaches on both occluded and standard benchmarks. Code is available at https://github.com/edz-o/3DNBF
    摘要 “复复基于方法用深度网络估计3D人姿 Parameters directly from a 2D image。尽管在标准 benchmark 上达到了现有的州首性表现,但它们在 occlusion 情况下表现不佳。相反,优化基于方法将 parametric body model 适束到 2D 特征,并且使用局部重建损失可能对 occlusion 有优化作用,但它们受到 2D-3D 歧义的影响。”“驱动 motivated by the recent success of generative models in rigid object pose estimation, we propose 3D-aware Neural Body Fitting (3DNBF) - an approximate analysis-by-synthesis approach to 3D human pose estimation with SOTA performance and occlusion robustness。 Specifically, we propose a generative model of deep features based on a volumetric human representation with Gaussian ellipsoidal kernels emitting 3D pose-dependent feature vectors。 The neural features are trained with contrastive learning to become 3D-aware and hence to overcome the 2D-3D ambiguity。”“实验结果显示,3DNBF 比其他方法在 occluded 和标准 benchmark 上表现更好。代码可以在 https://github.com/edz-o/3DNBF 上取得。”

Robust Mixture-of-Expert Training for Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2308.10110
  • repo_url: https://github.com/optml-group/robust-moe-cnn
  • paper_authors: Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, Sijia Liu
    for: 这个论文旨在探讨如何使用 Mixture of Expert (MoE) 进行 adversarial robustness 的 CNN 模型。methods: 该论文使用了 adversarial training (AT) 机制,并分析了 MoE 模型中的两个维度:路由器的稳定性和专家的稳定性。它还提出了一种新的 router-expert alternating Adversarial training 框架,以提高 MoE 模型的 adversarial robustness。results: 该论文的实验结果表明,AdvMoE 可以在 4 种常用的 CNN 模型架构和 4 个 benchmark 数据集上提高 adversarial robustness 1% ~ 4%,同时保持了 MoE 模型的效率优势,实现了 более чем 50% 的计算成本减少。
    Abstract Sparsely-gated Mixture of Expert (MoE), an emerging deep model architecture, has demonstrated a great promise to enable high-accuracy and ultra-efficient model inference. Despite the growing popularity of MoE, little work investigated its potential to advance convolutional neural networks (CNNs), especially in the plane of adversarial robustness. Since the lack of robustness has become one of the main hurdles for CNNs, in this paper we ask: How to adversarially robustify a CNN-based MoE model? Can we robustly train it like an ordinary CNN model? Our pilot study shows that the conventional adversarial training (AT) mechanism (developed for vanilla CNNs) no longer remains effective to robustify an MoE-CNN. To better understand this phenomenon, we dissect the robustness of an MoE-CNN into two dimensions: Robustness of routers (i.e., gating functions to select data-specific experts) and robustness of experts (i.e., the router-guided pathways defined by the subnetworks of the backbone CNN). Our analyses show that routers and experts are hard to adapt to each other in the vanilla AT. Thus, we propose a new router-expert alternating Adversarial training framework for MoE, termed AdvMoE. The effectiveness of our proposal is justified across 4 commonly-used CNN model architectures over 4 benchmark datasets. We find that AdvMoE achieves 1% ~ 4% adversarial robustness improvement over the original dense CNN, and enjoys the efficiency merit of sparsity-gated MoE, leading to more than 50% inference cost reduction. Codes are available at https://github.com/OPTML-Group/Robust-MoE-CNN.
    摘要 “罕见的 Mixture of Expert(MoE)模型架构,已经表现出高精度和高效的模型推理承诺。尽管MoE在推理领域受到广泛关注,但是对于它在卷积神经网络(CNN)领域的潜在应用还有很少的研究。在这篇论文中,我们问:如何使一个基于MoE的CNN模型具有鲁棒性?可以如何针对这个问题进行鲁棒训练?我们的初步研究表明,对于MoE-CNN模型,使用传统的鲁棒训练机制(开发为普通的CNN模型)不再有效。为了更好地理解这种现象,我们分析了MoE-CNN模型的鲁棒性的两个维度:路由器(即选择数据特定专家的闭包函数)的鲁棒性和专家(即路由器引导的卷积神经网络的子网络)的鲁棒性。我们的分析表明,路由器和专家在传统的鲁棒训练中很难相互适应。因此,我们提出了一种新的路由器-专家 alternate adversarial training框架,称为AdvMoE。我们的提案的效果被证明在4种常用的CNN模型架构和4个标准数据集上,与原始密集CNN模型相比,AdvMoE可以提高1%~4%的鲁棒性表现,同时保留了MoE的稀疏性优势,实现了More than 50%的推理成本减少。代码可以在https://github.com/OPTML-Group/Robust-MoE-CNN上下载。”

ASPIRE: Language-Guided Augmentation for Robust Image Classification

  • paper_url: http://arxiv.org/abs/2308.10103
  • repo_url: None
  • paper_authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Sakshi Singh, Sanjoy Chowdhury, Dinesh Manocha
  • for: 本研究旨在提高神经图像分类器在真实世界中的表现,使其能够更好地适应不同的场景和数据。
  • methods: 本研究使用了一种名为ASPIRE的简单 yet有效的解决方案,通过在训练集中添加语言指导的数据增强来减少对不相关的特征的依赖。
  • results: 研究表明,使用ASPIRE可以提高神经图像分类器的分类精度,在4个dataset上与9个基线方法进行比较,分别提高精度by 1%-38%.
    Abstract Neural image classifiers can often learn to make predictions by overly relying on non-predictive features that are spuriously correlated with the class labels in the training data. This leads to poor performance in real-world atypical scenarios where such features are absent. Supplementing the training dataset with images without such spurious features can aid robust learning against spurious correlations via better generalization. This paper presents ASPIRE (Language-guided data Augmentation for SPurIous correlation REmoval), a simple yet effective solution for expanding the training dataset with synthetic images without spurious features. ASPIRE, guided by language, generates these images without requiring any form of additional supervision or existing examples. Precisely, we employ LLMs to first extract foreground and background features from textual descriptions of an image, followed by advanced language-guided image editing to discover the features that are spuriously correlated with the class label. Finally, we personalize a text-to-image generation model to generate diverse in-domain images without spurious features. We demonstrate the effectiveness of ASPIRE on 4 datasets, including the very challenging Hard ImageNet dataset, and 9 baselines and show that ASPIRE improves the classification accuracy of prior methods by 1% - 38%. Code soon at: https://github.com/Sreyan88/ASPIRE.
    摘要 神经图像分类器可以很容易地学习通过过度依赖不可预测的特征来预测类别标签。这会导致在实际世界中不常见的情况下表现不佳,因为这些特征在真实数据中缺失。补充训练集中的图像可以帮助神经图像分类器学习更加稳健,以避免基于不可预测的特征的欺骗。这篇论文提出了ASPIRE(语言引导数据增强为排除SPurIous correlation的解决方案),它是一种简单而有效的解决方案。ASPIRE,受语言引导,可以生成不需要任何形式的额外监督或现有示例的图像。具体来说,我们使用自然语言处理技术来首先提取图像中的前景和背景特征,然后使用高级语言导向的图像编辑技术来找出与类别标签相关的不可预测特征。最后,我们个性化文本到图像生成模型,以生成具有不可预测特征的域内多样化图像。我们在4个 dataset上进行了测试,包括非常困难的 Hard ImageNet dataset,并与9个基线进行比较,显示ASPIRE可以提高先前方法的分类精度 by 1% - 38%。代码即将在 GitHub 上发布。

Open, Closed, or Small Language Models for Text Classification?

  • paper_url: http://arxiv.org/abs/2308.10092
  • repo_url: None
  • paper_authors: Hao Yu, Zachary Yang, Kellin Pelrine, Jean Francois Godbout, Reihaneh Rabbany
  • for: 这个研究旨在检验三类模型在三种任务上的表现:命名实体识别、政党预测和谣言检测。
  • methods: 研究使用了八个数据集和三类模型进行评估:开源模型、关闭源模型和生成模型。
  • results: 研究发现,大型语言模型在多个任务上表现出色,但是开源模型可以与关闭源模型相匹配,而小型模型如RoBERTa在某些数据集上可以达到与生成模型相同或更高的性能。然而,关闭模型在需要最高普遍性的任务中仍保持优势。
    Abstract Recent advancements in large language models have demonstrated remarkable capabilities across various NLP tasks. But many questions remain, including whether open-source models match closed ones, why these models excel or struggle with certain tasks, and what types of practical procedures can improve performance. We address these questions in the context of classification by evaluating three classes of models using eight datasets across three distinct tasks: named entity recognition, political party prediction, and misinformation detection. While larger LLMs often lead to improved performance, open-source models can rival their closed-source counterparts by fine-tuning. Moreover, supervised smaller models, like RoBERTa, can achieve similar or even greater performance in many datasets compared to generative LLMs. On the other hand, closed models maintain an advantage in hard tasks that demand the most generalizability. This study underscores the importance of model selection based on task requirements
    摘要

GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism

  • paper_url: http://arxiv.org/abs/2308.10087
  • repo_url: None
  • paper_authors: Jingji Chen, Zhuoming Chen, Xuehai Qian
  • for: 本研究 targets at improving the efficiency of distributed full-graph GNN training methods.
  • methods: 我们提出了一种新的训练方法 named GNNPipe,它采用了模型并行 instead of 图并行,具有较低的最坏情况极限通信复杂度。我们还提出了一种 chunk-based 管道式训练方法,以确保 GPU 资源的高Utilization。
  • results: 我们的方法可以减少每个epoch的训练时间,并且可以减少通信量和开销。 experiments show that our method reduces the per-epoch training time by up to 2.45x (on average 2.03x) and reduces the communication volume and overhead by up to 22.51x and 27.21x (on average 10.27x and 14.96x), respectively, while achieving a comparable level of model accuracy and convergence speed compared to graph parallelism.
    Abstract Current distributed full-graph GNN training methods adopt a variant of data parallelism, namely graph parallelism, in which the whole graph is divided into multiple partitions (subgraphs) and each GPU processes one of them. This incurs high communication overhead because of the inter-partition message passing at each layer. To this end, we proposed a new training method named GNNPipe that adopts model parallelism instead, which has a lower worst-case asymptotic communication complexity than graph parallelism. To ensure high GPU utilization, we proposed to combine model parallelism with a chunk-based pipelined training method, in which each GPU processes a different chunk of graph data at different layers concurrently. We further proposed hybrid parallelism that combines model and graph parallelism when the model-level parallelism is insufficient. We also introduced several tricks to ensure convergence speed and model accuracies to accommodate embedding staleness introduced by pipelining. Extensive experiments show that our method reduces the per-epoch training time by up to 2.45x (on average 2.03x) and reduces the communication volume and overhead by up to 22.51x and 27.21x (on average 10.27x and 14.96x), respectively, while achieving a comparable level of model accuracy and convergence speed compared to graph parallelism.
    摘要 当前的分布式全图GNNS培训方法采用了一种变体的数据并行性,即图并行性,在其中整个图被分成多个分区(子图),每个GPU处理一个分区。这会产生高度的通信开销,因为每层都需要进行交互式的分区消息传递。为了解决这个问题,我们提出了一种新的培训方法,名为GNPipe,它采用了模型并行性,它的最差情况的极限级别的通信复杂度比图并行性低。为保证高效的GPU使用率,我们提出了将模型并行性与 chunk-based 管道式培训方法结合使用,在不同的层次上同时处理不同的图数据块。此外,我们还提出了将模型并行性与图并行性结合使用,当模型级别的并行性不足时。此外,我们还提出了一些技巧来确保快速的整体训练速度和模型精度,以适应嵌入过时引起的管道化。广泛的实验表明,我们的方法可以将每个轮次训练时间减少至最多2.45倍(平均2.03倍),并且可以减少通信量和开销至最多22.51倍和27.21倍(平均10.27倍和14.96倍),而保持与图并行性相比的相似水平的模型精度和训练速度。

Contrastive Learning for Non-Local Graphs with Multi-Resolution Structural Views

  • paper_url: http://arxiv.org/abs/2308.10077
  • repo_url: None
  • paper_authors: Asif Khan, Amos Storkey
  • for: 本研究旨在学习不同类型图像之间的相似性,以提高图像检测和蛋白质功能预测等应用。
  • methods: 我们提出了一种新的多视图对照学习方法,通过对图像进行多种扩充,捕捉到图像之间的结构相似性,从而揭示隐藏的关系和相似性。
  • results: 我们的方法在synthetic和实际结构数据上比基eline高$16.06%$,$3.27%$和$8.04%$。此外,它在邻近任务上表现优于基eline,说明它可以更好地捕捉结构信息,提高下游应用的性能。
    Abstract Learning node-level representations of heterophilic graphs is crucial for various applications, including fraudster detection and protein function prediction. In such graphs, nodes share structural similarity identified by the equivalence of their connectivity which is implicitly encoded in the form of higher-order hierarchical information in the graphs. The contrastive methods are popular choices for learning the representation of nodes in a graph. However, existing contrastive methods struggle to capture higher-order graph structures. To address this limitation, we propose a novel multiview contrastive learning approach that integrates diffusion filters on graphs. By incorporating multiple graph views as augmentations, our method captures the structural equivalence in heterophilic graphs, enabling the discovery of hidden relationships and similarities not apparent in traditional node representations. Our approach outperforms baselines on synthetic and real structural datasets, surpassing the best baseline by $16.06\%$ on Cornell, $3.27\%$ on Texas, and $8.04\%$ on Wisconsin. Additionally, it consistently achieves superior performance on proximal tasks, demonstrating its effectiveness in uncovering structural information and improving downstream applications.
    摘要

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

  • paper_url: http://arxiv.org/abs/2308.11592
  • repo_url: None
  • paper_authors: Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, Can Huang
  • for: 这 paper 的目的是提出一种新的多Modal模型,能够同时检测、识别和理解文本,以提高文本理解的性能。
  • methods: 这 paper 使用了 UniDoc 模型,该模型具有文本检测和识别功能,并且可以充分利用大型预训练模型的表示能力和世界知识。此外,UniDoc 还能够充分利用任务之间的有益关系,提高每个任务的性能。
  • results: 实验结果显示,UniDoc 在多个复杂的 benchmark 上达到了州际纪录级的成绩。这是目前已知的首个同时检测、识别、spotting 和理解的大型多Modal模型。
    Abstract In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored. In this work, we introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities, which are deficient in existing approaches. Moreover, UniDoc capitalizes on the beneficial interactions among tasks to enhance the performance of each individual task. To implement UniDoc, we perform unified multimodal instruct tuning on the contributed large-scale instruction following datasets. Quantitative and qualitative experimental results show that UniDoc sets state-of-the-art scores across multiple challenging benchmarks. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.
    摘要 在大语言模型(LLMs)时代,我们在多模式理解领域已经做出了巨大的进步。然而,现有的高级算法尚未能充分利用大型预训练模型中的庞大表示能力和丰富的世界知识,也没有充分探索了在文本丰富场景下任务之间的有利连接。在这项工作中,我们介绍了UniDoc,一种新的多模式模型,具有文本检测和识别功能,现存的方法缺乏这些功能。此外,UniDoc利用任务之间的有利连接来提高每个任务的性能。为实现UniDoc,我们在大规模的 instrucion following 数据集上进行了统一多模式 instru Tuning。量化和质量上的实验结果表明,UniDoc在多个复杂的benchmark上设置了state-of-the-art 得分。据我们所知,这是第一个同时检测、识别、点名和理解的大型多模式模型。