cs.AI - 2023-11-02

Implicit Chain of Thought Reasoning via Knowledge Distillation

  • paper_url: http://arxiv.org/abs/2311.01460
  • repo_url: https://github.com/da03/implicit_chain_of_thought
  • paper_authors: Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, Stuart Shieber
  • for: 本研究旨在增强语言模型的逻辑能力,通过让模型生成链式思维步骤来解决问题。
  • methods: 本研究使用语言模型的内部隐藏状态进行做implicit reasoning,通过将教师模型在explicit链式思维上受训练的步骤进行压缩,使 reasoning 从”水平”(一旦一旦)变为”垂直”(在不同层次)进行。
  • results: 在多位数乘法任务和小学数学问题集上进行实验,发现这种方法可以解决无需explicit链式思维的任务,速度与无链式思维相当。
    Abstract To augment language models with the ability to reason, researchers usually prompt or finetune them to produce chain of thought reasoning steps before producing the final answer. However, although people use natural language to reason effectively, it may be that LMs could reason more effectively with some intermediate computation that is not in natural language. In this work, we explore an alternative reasoning approach: instead of explicitly producing the chain of thought reasoning steps, we use the language model's internal hidden states to perform implicit reasoning. The implicit reasoning steps are distilled from a teacher model trained on explicit chain-of-thought reasoning, and instead of doing reasoning "horizontally" by producing intermediate words one-by-one, we distill it such that the reasoning happens "vertically" among the hidden states in different layers. We conduct experiments on a multi-digit multiplication task and a grade school math problem dataset and find that this approach enables solving tasks previously not solvable without explicit chain-of-thought, at a speed comparable to no chain-of-thought.
    摘要 通常,为了让语言模型具备理智能力,研究人员通常会提示或调整它们生成链式思维步骤,然后生成答案。然而,人们在自然语言中理智很有效,可能是语言模型可以更加有效地进行一些不是自然语言的中间计算。在这项工作中,我们尝试了一种不同的理智方法:而不是显式地生成链式思维步骤,我们使用语言模型的内部隐藏状态来进行隐藏式理智。我们从一个用于显式链式思维的教师模型中提取了隐藏式理智步骤,并不是在水平方向(一个个)进行理智,而是在不同层次中的隐藏状态之间进行垂直的理智。我们在多位数乘法任务和小学数学题目集合上进行了实验,发现这种方法可以解决无法用显式链式思维解决的任务,并且速度与无链式思维相当。

Conformal Policy Learning for Sensorimotor Control Under Distribution Shifts

  • paper_url: http://arxiv.org/abs/2311.01457
  • repo_url: None
  • paper_authors: Huang Huang, Satvik Sharma, Antonio Loquercio, Anastasios Angelopoulos, Ken Goldberg, Jitendra Malik
  • for: 检测和应对感知器的观测分布变化
  • methods: 使用具有正式统计保证的均值折衔策略,包括使用均值折衔策略进行安全性和速度的选择或直接将策略观测添加到量化和强化学习中
  • results: 虽然在 simulations 和物理 quadruped 上进行了丰富的评估,但是与五个基准相比,OUR 方法表现出了优异的成果,同时也是最简单的基准策略之一。
    Abstract This paper focuses on the problem of detecting and reacting to changes in the distribution of a sensorimotor controller's observables. The key idea is the design of switching policies that can take conformal quantiles as input, which we define as conformal policy learning, that allows robots to detect distribution shifts with formal statistical guarantees. We show how to design such policies by using conformal quantiles to switch between base policies with different characteristics, e.g. safety or speed, or directly augmenting a policy observation with a quantile and training it with reinforcement learning. Theoretically, we show that such policies achieve the formal convergence guarantees in finite time. In addition, we thoroughly evaluate their advantages and limitations on two compelling use cases: simulated autonomous driving and active perception with a physical quadruped. Empirical results demonstrate that our approach outperforms five baselines. It is also the simplest of the baseline strategies besides one ablation. Being easy to use, flexible, and with formal guarantees, our work demonstrates how conformal prediction can be an effective tool for sensorimotor learning under uncertainty.
    摘要

RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation

  • paper_url: http://arxiv.org/abs/2311.01455
  • repo_url: None
  • paper_authors: Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, Chuang Gan
  • for: 本研究旨在将大规模模型中嵌入的广泛和多元知识转移到机器人领域,并实现机器人自动学习多种机器人技能。
  • methods: 本研究使用生成模型来自动生成多样化的任务、景象和训练监督,以扩大机器人技能学习的规模。
  • results: 本研究可以实现自动生成多样化的机器人技能,并且可以在无人指导下实现机器人自动学习。
    Abstract We present RoboGen, a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation. RoboGen leverages the latest advancements in foundation and generative models. Instead of directly using or adapting these models to produce policies or low-level actions, we advocate for a generative scheme, which uses these models to automatically generate diversified tasks, scenes, and training supervisions, thereby scaling up robotic skill learning with minimal human supervision. Our approach equips a robotic agent with a self-guided propose-generate-learn cycle: the agent first proposes interesting tasks and skills to develop, and then generates corresponding simulation environments by populating pertinent objects and assets with proper spatial configurations. Afterwards, the agent decomposes the proposed high-level task into sub-tasks, selects the optimal learning approach (reinforcement learning, motion planning, or trajectory optimization), generates required training supervision, and then learns policies to acquire the proposed skill. Our work attempts to extract the extensive and versatile knowledge embedded in large-scale models and transfer them to the field of robotics. Our fully generative pipeline can be queried repeatedly, producing an endless stream of skill demonstrations associated with diverse tasks and environments.
    摘要 我们介绍RoboGen,一种生成式机器人代理人,可以自动学习多样化机器人技能的扩展。RoboGen利用了最新的基础和生成模型的进步。而不是直接使用或修改这些模型来生成策略或低级动作,我们提议使用生成方案,使用这些模型自动生成多样化的任务、场景和训练监督。我们的方法使机器人代理人具有自顾探索-生成-学习循环:代理人首先提出有趣的任务和技能要发展,然后生成相应的 simulations环境,通过填充相关的物体和资产,并对其进行适当的空间配置。然后,代理人将高级任务分解成子任务,选择最佳学习方法(强化学习、运动规划或轨迹优化),生成所需的训练监督,然后学习策略以获得提案的技能。我们的工作尝试抽取大规模模型中嵌入的广泛和多样化知识,将其传递到机器人领域。我们的完全生成管道可以重复查询,生成无数个关联有多样化任务和环境的技能示例。

NOIR: Neural Signal Operated Intelligent Robots for Everyday Activities

  • paper_url: http://arxiv.org/abs/2311.01454
  • repo_url: None
  • paper_authors: Ruohan Zhang, Sharon Lee, Minjune Hwang, Ayano Hiranaka, Chen Wang, Wensi Ai, Jin Jie Ryan Tan, Shreya Gupta, Yilun Hao, Gabrael Levine, Ruohan Gao, Anthony Norcia, Li Fei-Fei, Jiajun Wu
  • for: 本研究开发了一个通用的智能脑机器人接口系统(NOIR),让人类通过脑征号控制机器人进行日常活动。
  • methods: 本研究使用电生物学测定(EEG)捕捉人类脑征号,并结合机器人学习算法,让NOIR适应个人用户并预测他们的意图。
  • results: 本研究成功完成了20个日常家居活动,包括cooking、cleaning、personal care和娱乐等,并提高了系统的效能。
    Abstract We present Neural Signal Operated Intelligent Robots (NOIR), a general-purpose, intelligent brain-robot interface system that enables humans to command robots to perform everyday activities through brain signals. Through this interface, humans communicate their intended objects of interest and actions to the robots using electroencephalography (EEG). Our novel system demonstrates success in an expansive array of 20 challenging, everyday household activities, including cooking, cleaning, personal care, and entertainment. The effectiveness of the system is improved by its synergistic integration of robot learning algorithms, allowing for NOIR to adapt to individual users and predict their intentions. Our work enhances the way humans interact with robots, replacing traditional channels of interaction with direct, neural communication. Project website: https://noir-corl.github.io/.
    摘要 我们现在推介Neural Signal Operated Intelligent Robots(NOIR),一个通用的智能大脑机器人接口系统,允许人类通过脑信号控制机器人完成日常活动。通过这个接口,人类通过电enzephalography(EEG)传达自己的意图对象和动作到机器人。我们的新系统在20种日常家务中展示了成功,包括厨艺、干净、个人护理和娱乐等。系统的效果通过机器人学习算法的同化,使得NOIR能够适应个人用户和预测他们的意图。我们的工作改善了人类与机器人之间的交互方式,将传统的通信途径替换为直接的 neural 通信。项目网站:https://noir-corl.github.io/.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Time Series Anomaly Detection using Diffusion-based Models

  • paper_url: http://arxiv.org/abs/2311.01452
  • repo_url: https://github.com/fbrad/diffusionae
  • paper_authors: Ioana Pintilie, Andrei Manolache, Florin Brad
  • for: 这 paper 探讨了使用 diffusion models 进行多变量时间序列中的异常检测 (AD)。
  • methods: 这 paper 测试了两种基于 diffusion 的模型,并与多个强大的神经网络基准进行比较。它们还扩展了 PA%K 协议,通过计算一个不依赖检测阈值和 K 的正确检测点的 ROCK-AUC 指标。
  • results: 这 paper 的模型在synthetic datasets 上表现出色,并在实际 datasets 上与基准集成比较,illustrating diffusion-based methods 的潜在用于 AD 中。
    Abstract Diffusion models have been recently used for anomaly detection (AD) in images. In this paper we investigate whether they can also be leveraged for AD on multivariate time series (MTS). We test two diffusion-based models and compare them to several strong neural baselines. We also extend the PA%K protocol, by computing a ROCK-AUC metric, which is agnostic to both the detection threshold and the ratio K of correctly detected points. Our models outperform the baselines on synthetic datasets and are competitive on real-world datasets, illustrating the potential of diffusion-based methods for AD in multivariate time series.
    摘要 Diffusion models 最近在图像异常检测(AD)中使用,本文我们调查是否可以将其应用于多变量时间序列(MTS)上的异常检测。我们测试了两种扩散模型,并与一些强大的神经网络基线进行比较。我们还扩展了PA%K协议,计算一个不受检测阈值和K正确检测点的比率的ROCK-AUC指标。我们的模型在 sintetic 数据集上表现出色,并在实际数据集上与基线集成比较,这表明扩散基本方法在 MTS 上的异常检测具有潜在的潜力。

DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing

  • paper_url: http://arxiv.org/abs/2311.01450
  • repo_url: None
  • paper_authors: Vint Lee, Pieter Abbeel, Youngwoon Lee
  • for: 这篇论文是关于Model-based reinforcement learning(MBRL)的研究,旨在通过生成假象轨迹来计划行为,学习复杂的行为。
  • methods: 这篇论文提出了一种简单 yet effective的奖金平滑方法,叫做DreamSmooth,它通过预测短时间内的奖金来训练MBRL算法,而不是固定时间内的奖金。
  • results: 经验表明,DreamSmooth可以在长期间遇到罕见奖金的任务上达到最佳性能,包括样本效率和最终性能,而不失去常见的benchmark测试。
    Abstract Model-based reinforcement learning (MBRL) has gained much attention for its ability to learn complex behaviors in a sample-efficient way: planning actions by generating imaginary trajectories with predicted rewards. Despite its success, we found that surprisingly, reward prediction is often a bottleneck of MBRL, especially for sparse rewards that are challenging (or even ambiguous) to predict. Motivated by the intuition that humans can learn from rough reward estimates, we propose a simple yet effective reward smoothing approach, DreamSmooth, which learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep. We empirically show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks both in sample efficiency and final performance without losing performance on common benchmarks, such as Deepmind Control Suite and Atari benchmarks.
    摘要

Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models

  • paper_url: http://arxiv.org/abs/2311.01441
  • repo_url: https://github.com/andyz245/discreteadversarialdistillation
  • paper_authors: Andy Zhou, Jindong Wang, Yu-Xiong Wang, Haohan Wang
  • for: 提高视觉模型的鲁棒性(out-of-distribution robustness)
  • methods: 组合知识塑化和数据增强,使用robust teacher生成对抗样本,并使用VQGAN积累对抗样本
  • results: 在不同的学生架构上显示了强大的对抗样本生成和清洁精度提升,而且计算 overhead 相对较少,可以轻松地与其他数据增强技术结合使用
    Abstract We propose a conceptually simple and lightweight framework for improving the robustness of vision models through the combination of knowledge distillation and data augmentation. We address the conjecture that larger models do not make for better teachers by showing strong gains in out-of-distribution robustness when distilling from pretrained foundation models. Following this finding, we propose Discrete Adversarial Distillation (DAD), which leverages a robust teacher to generate adversarial examples and a VQGAN to discretize them, creating more informative samples than standard data augmentation techniques. We provide a theoretical framework for the use of a robust teacher in the knowledge distillation with data augmentation setting and demonstrate strong gains in out-of-distribution robustness and clean accuracy across different student architectures. Notably, our method adds minor computational overhead compared to similar techniques and can be easily combined with other data augmentations for further improvements.
    摘要 我们提出了一种概念简单且轻量级的框架,用于提高视觉模型的鲁棒性通过知识塑化和数据扩展。我们证明了大型模型不一定是优秀的教师,我们通过显示含义更强的外部 robustness 提升。基于这一发现,我们提出了分割对抗塑化(DAD),它利用一个鲁棒的教师生成对抗例子,并使用 VQGAN 精炼它们,创造更有信息的样本。我们提供了在知识塑化和数据扩展设置下使用鲁棒教师的理论框架,并在不同的学生架构上显示了强大的 OUT-OF-distribution 鲁棒性和清晰率。值得注意的是,我们的方法相对于类似技术增加了微量的计算成本,可以轻松地与其他数据扩展技术结合使用,以实现更好的性能。

Tailoring Mixup to Data using Kernel Warping functions

  • paper_url: http://arxiv.org/abs/2311.01434
  • repo_url: https://github.com/ensta-u2is/torch-uncertainty
  • paper_authors: Quentin Bouniot, Pavlo Mozharovskyi, Florence d’Alché-Buc
  • for: 本研究旨在提高深度学习模型的效率和准确性,通过调整数据的 interpolate 方法来实现。
  • methods: 本文提出了一种基于插值的数据采样方法,通过调整插值系数的分布来实现更好的数据混合。
  • results: 经过广泛的 classification 和 regression 任务实验, authors 发现,使用该方法可以提高模型的性能和准确性,同时保持模型的多样性。
    Abstract Data augmentation is an essential building block for learning efficient deep learning models. Among all augmentation techniques proposed so far, linear interpolation of training data points, also called mixup, has found to be effective for a large panel of applications. While the majority of works have focused on selecting the right points to mix, or applying complex non-linear interpolation, we are interested in mixing similar points more frequently and strongly than less similar ones. To this end, we propose to dynamically change the underlying distribution of interpolation coefficients through warping functions, depending on the similarity between data points to combine. We define an efficient and flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves both performance and calibration of models. Code available in https://github.com/ENSTA-U2IS/torch-uncertainty
    摘要 “数据扩充是深度学习模型学习的重要基础之一。迄今为止所提出的所有扩充技术中,线性 interpolate 训练数据点,也称为 mixup,已经在许多应用场景中证明有效。然而,大多数工作都是关注选择要混合的点,或者应用复杂非线性 interpolate,我们则关注更频繁地混合类似点,并强制混合类似点更强大一些。为实现这一目标,我们提议动态更改混合过程中的基础分布,通过扭曲函数,根据数据点的相似性来确定混合。我们定义了高效可靠的框架,不会失去多样性。我们在分类和回归任务中进行了广泛的实验,显示我们的提议方法可以提高模型的性能和准确性。代码可以在 查看。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

Castor: Causal Temporal Regime Structure Learning

  • paper_url: http://arxiv.org/abs/2311.01412
  • repo_url: None
  • paper_authors: Abdellah Rahmani, Pascal Frossard
  • for: 本研究旨在探讨多变量时间序列数据中的 causal 关系,以解决各种领域中的关键问题。
  • methods: CASTOR 方法基于 EM 算法,可以学习不同模式下的 causal 关系,并且可以准确地找到每个模式下的唯一 режи。
  • results: 实验表明,CASTOR 方法在 causal discovery 中具有稳定性和可解释性,并且在 synthetic 数据和实际数据上都表现出色。
    Abstract The task of uncovering causal relationships among multivariate time series data stands as an essential and challenging objective that cuts across a broad array of disciplines ranging from climate science to healthcare. Such data entails linear or non-linear relationships, and usually follow multiple a priori unknown regimes. Existing causal discovery methods can infer summary causal graphs from heterogeneous data with known regimes, but they fall short in comprehensively learning both regimes and the corresponding causal graph. In this paper, we introduce CASTOR, a novel framework designed to learn causal relationships in heterogeneous time series data composed of various regimes, each governed by a distinct causal graph. Through the maximization of a score function via the EM algorithm, CASTOR infers the number of regimes and learns linear or non-linear causal relationships in each regime. We demonstrate the robust convergence properties of CASTOR, specifically highlighting its proficiency in accurately identifying unique regimes. Empirical evidence, garnered from exhaustive synthetic experiments and two real-world benchmarks, confirm CASTOR's superior performance in causal discovery compared to baseline methods. By learning a full temporal causal graph for each regime, CASTOR establishes itself as a distinctly interpretable method for causal discovery in heterogeneous time series.
    摘要 “探索多变量时间序列数据中的 causal 关系是一项非常重要且挑战性强的任务,覆盖了各种领域,从气候科学到医疗。这种数据通常具有线性或非线性关系,并且可能遵循多个未知的模式。现有的 causal 发现方法可以从不同类型的数据中推导摘要的 causal 图,但它们缺乏完整地学习多个模式和相应的 causal 图。在这篇论文中,我们提出了 CASTOR,一种新的框架,用于在不同模式下学习时间序列数据中的 causal 关系。通过 Maximize 一个分数函数的 EM 算法,CASTOR 可以推导模式的数量和每个模式中的线性或非线性 causal 关系。我们证明了 CASTOR 的稳定性和可靠性,并且在多种 synthetic 实验和实际应用中证明了它的超越性。通过学习每个模式的全 temporal causal 图,CASTOR 成为一种可解释的 causal 发现方法。”

Analysis of Information Propagation in Ethereum Network Using Combined Graph Attention Network and Reinforcement Learning to Optimize Network Efficiency and Scalability

  • paper_url: http://arxiv.org/abs/2311.01406
  • repo_url: None
  • paper_authors: Stefan Kambiz Behfar, Jon Crowcroft
  • for: 这个研究的目的是分析以太网络中信息传递的动态模式,以提高网络的效率、安全性和扩展性。
  • methods: 这个研究使用图 convolutional neural networks (GCNs) 分析以太网络中信息传递的图结构,并使用 combined graph attention network (GAT) 和 reinforcement learning (RL) 模型优化网络的效率和扩展性。
  • results: 实验评估表明,我们提出的 GAT-RL 模型在大规模以太网络 dataset 上表现出色,可以有效地传递信息 across the network,优化 gas limits for block processing,并提高网络的效率。
    Abstract Blockchain technology has revolutionized the way information is propagated in decentralized networks. Ethereum plays a pivotal role in facilitating smart contracts and decentralized applications. Understanding information propagation dynamics in Ethereum is crucial for ensuring network efficiency, security, and scalability. In this study, we propose an innovative approach that utilizes Graph Convolutional Networks (GCNs) to analyze the information propagation patterns in the Ethereum network. The first phase of our research involves data collection from the Ethereum blockchain, consisting of blocks, transactions, and node degrees. We construct a transaction graph representation using adjacency matrices to capture the node embeddings; while our major contribution is to develop a combined Graph Attention Network (GAT) and Reinforcement Learning (RL) model to optimize the network efficiency and scalability. It learns the best actions to take in various network states, ultimately leading to improved network efficiency, throughput, and optimize gas limits for block processing. In the experimental evaluation, we analyze the performance of our model on a large-scale Ethereum dataset. We investigate effectively aggregating information from neighboring nodes capturing graph structure and updating node embeddings using GCN with the objective of transaction pattern prediction, accounting for varying network loads and number of blocks. Not only we design a gas limit optimization model and provide the algorithm, but also to address scalability, we demonstrate the use and implementation of sparse matrices in GraphConv, GraphSAGE, and GAT. The results indicate that our designed GAT-RL model achieves superior results compared to other GCN models in terms of performance. It effectively propagates information across the network, optimizing gas limits for block processing and improving network efficiency.
    摘要 Blockchain技术已经改变了分布式网络中信息的传播方式。以太币扮演着重要的角色,它使得智能合约和分布式应用得以实现。为了确保网络的效率、安全性和可扩展性,理解以太币网络中信息传播的动态非常重要。在这项研究中,我们提出了一种创新的方法,使用图 convolutional neural networks(GCNs)来分析以太币网络中信息传播的模式。我们的首个阶段是收集以太币链上的数据,包括块、交易和节点度。我们使用邻居矩阵来构造交易图表示,并通过我们的主要贡献—— combining Graph Attention Network(GAT)和强化学习(RL)模型来优化网络效率和可扩展性。这个模型学习在不同的网络状态下,选择最佳的行为,最终导致网络效率的提高,通过缓存限制和块处理的优化。在实验评估中,我们对大规模的以太币数据进行分析,研究如何有效地从邻居节点中收集信息,更新节点嵌入,使用 GCN 进行交易模式预测,考虑不同的网络负载和块数。此外,我们还设计了一个 gas 限制优化模型,并提供算法。为了解决扩展性问题,我们在 GraphConv、GraphSAGE 和 GAT 中使用稀疏矩阵。结果显示,我们设计的 GAT-RL 模型在性能方面取得了更好的结果,能够有效地在网络中传播信息,优化缓存限制和块处理,提高网络效率。

Vision-Language Foundation Models as Effective Robot Imitators

  • paper_url: http://arxiv.org/abs/2311.01378
  • repo_url: None
  • paper_authors: Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, Tao Kong
  • for: 本研究旨在使用现有的视力语言模型(VLM)进行简单的微调,以解决机器人操作任务。
  • methods: 我们提出了一种简单的视力语言操作框架,名为RoboFlamingo,基于开源的VLM。 RoboFlamingo使用预训练的VLM进行单步视力语言理解,并使用显式策略头来记录Sequential history information。
  • results: 我们的实验结果显示,RoboFlamingo可以在语言控制 datasets 上达到最佳性能,并且在低性能平台上进行开Loop控制。我们的研究还发现了不同预训练VLM的不同表现在操作任务中。我们认为RoboFlamingo可以成为一种cost-effective和易于使用的机器人操作解决方案。
    Abstract Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy.
    摘要 Translated into Simplified Chinese:近来的视觉语言基础模型进步,表明它们可以理解多模态数据,解决复杂的视觉语言任务,包括机器人控制。我们寻找一种简单、直观地使用现有的视觉语言模型(VLM),并在机器人数据上进行简单的微调。为此,我们提出了一个简单而新的视觉语言控制框架,名为RoboFlamingo,基于开源的VLM,OpenFlamingo。与前作不同,RoboFlamingo使用预训练的VLM进行单步视觉语言理解,模型序列历史信息使用显式策略头,并通过仅在语言条件 manipulate 数据上进行微调学习。这种分解提供了 RoboFlamingo 对于开loop控制和低性能平台部署的灵活性。我们通过在测试 benchmark 上以大幅度超越状态艺术表现,显示 RoboFlamingo 可以作为适用 VLM 到机器人控制的有效和竞争力强的解决方案。我们的广泛的实验结果还揭示了不同预训练 VLM 在 manipulate 任务上的行为有趣的结论。我们认为 RoboFlamingo 具有成本效果和易用的特点,可以让每个人通过微调自己的机器人策略来掌控机器人。

Recognize Any Regions

  • paper_url: http://arxiv.org/abs/2311.01373
  • repo_url: https://github.com/Surrey-UPLab/Recognize-Any-Regions
  • paper_authors: Haosen Yang, Chuofan Ma, Bin Wen, Yi Jiang, Zehuan Yuan, Xiatian Zhu
  • for: 本研究旨在解决计算机视觉中开放世界对象检测中个体区域或块的 semantics 问题,即在不受限制的图像中理解每个区域或块的 semantics。
  • methods: 该研究基于现有的图像视语(ViL)基础模型,如 CLIP,并将其用于开放世界对象检测。研究者们使用了各种方法,包括对region-label对的预训练和对检测模型的输出的图像级别表示的对接。
  • results: 研究者们提出了一种新的、通用的和高效的区域认可架构,名为RegionSpot,可以将位置意识的本地化知识与图像级别的semantics相结合。该架构可以在开放世界对象检测中提高性能,同时减少计算成本。例如,在300万个数据集上训练,只需一天内使用8个V100 GPU,并且与GLIP相比,提高了6.5%的mean average precision(mAP),对更难和罕见的类型的提高更大的14.8%。
    Abstract Understanding the semantics of individual regions or patches within unconstrained images, such as in open-world object detection, represents a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and deficiency in contextual information. To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their respective strengths in localization and semantics. We introduce a novel, generic, and efficient region recognition architecture, named RegionSpot, designed to integrate position-aware localization knowledge from a localization foundation model (e.g., SAM) with semantic information extracted from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. Through extensive experiments in the context of open-world object recognition, our RegionSpot demonstrates significant performance improvements over prior alternatives, while also providing substantial computational savings. For instance, training our model with 3 million data in a single day using 8 V100 GPUs. Our model outperforms GLIP by 6.5 % in mean average precision (mAP), with an even larger margin by 14.8 % for more challenging and rare categories.
    摘要 本文描述了一种新的、通用、高效的区域识别架构,即RegionSpot,用于在开放世界 объек特点检测中理解图像中的各个区域或区域提案。我们利用了一个本地化基本模型(如SAM)和一个视力语言基本模型(如CLIP)的各自优势,通过一种简单的注意力机制来结合这两者的知识。我们不会更新基本模型,只是对注意力机制进行优化。我们通过对300万个数据进行训练,使用8个V100 GPU,并证明了我们的模型在开放世界 объек特点检测中表现出了显著的改进,同时也减少了计算成本。例如,我们的模型在GLIP模型的6.5%的mean average precision(mAP)上表现出了6.5%的提升,而在更为困难和罕见的类别上则是14.8%的提升。

Simplicial Models for the Epistemic Logic of Faulty Agents

  • paper_url: http://arxiv.org/abs/2311.01351
  • repo_url: None
  • paper_authors: Eric Goubault, Roman Kniazev, Jeremy Ledent, Sergio Rajsbaum
  • for: 这篇论文研究了基于高维结构 simplicial complexes 的 simplicial models,并探讨了这些模型在不同设计选择下的性质。
  • methods: 作者使用了不同的设计选择来定义不纯的 simplicial models,并axiomatized了这些模型的逻辑。
  • results: 作者通过应用于分布式计算中进程可能在执行系统时崩溃的例子,ILLUSTRATE了这些逻辑的应用。
    Abstract In recent years, several authors have been investigating simplicial models, a model of epistemic logic based on higher-dimensional structures called simplicial complexes. In the original formulation, simplicial models were always assumed to be pure, meaning that all worlds have the same dimension. This is equivalent to the standard S5n semantics of epistemic logic, based on Kripke models. By removing the assumption that models must be pure, we can go beyond the usual Kripke semantics and study epistemic logics where the number of agents participating in a world can vary. This approach has been developed in a number of papers, with applications in fault-tolerant distributed computing where processes may crash during the execution of a system. A difficulty that arises is that subtle design choices in the definition of impure simplicial models can result in different axioms of the resulting logic. In this paper, we classify those design choices systematically, and axiomatize the corresponding logics. We illustrate them via distributed computing examples of synchronous systems where processes may crash.
    摘要 近年来,一些作者已经在调查 simplicial 模型,一种基于高维结构called simplicial complexes的epistemic logic模型。在原始表述中,simplicial模型总是被认为是纯净的,意味着所有世界都有相同的维度。这与标准的 S5n semantics of epistemic logic相等,基于 Kripke 模型。由 removing the assumption that models must be pure,我们可以超越常见的 Kripke semantics 和研究 epistemic logics 中参与世界数量的变化。这种方法在一些文章中被发展,并应用于容易受到进程崩溃的分布式计算系统中。然而,由于 subtle design choices 的定义而导致不同的论据。在这篇文章中,我们系统地分类这些设计选择,并对它们的论据进行 axiomatization。我们通过分布式计算的同步系统示例来说明它们。

Like an Open Book? Read Neural Network Architecture with Simple Power Analysis on 32-bit Microcontrollers

  • paper_url: http://arxiv.org/abs/2311.01344
  • repo_url: None
  • paper_authors: Raphael Joud, Pierre-Alain Moellic, Simon Pontie, Jean-Baptiste Rigaud
  • for: 本研究旨在探讨如何通过EM侧通道诊断深度学习模型的架构信息,以便对相关的AI系统进行安全保护。
  • methods: 本研究使用了 тео리тиче知识和ARM CMSIS-NN库的分析,提出了一种基于简单模式识别分析的EXTRACTION方法,用于从EM侧通道诊断传感器上提取深度学习模型的架构信息。
  • results: 研究发现,即使面临一些困难的特例,EXTRACTION方法仍可以成功地提取深度学习模型的架构信息,而且攻击复杂度较低。研究也指出了相关的安全保护措施需要适应强大的内存和延迟要求。
    Abstract Model extraction is a growing concern for the security of AI systems. For deep neural network models, the architecture is the most important information an adversary aims to recover. Being a sequence of repeated computation blocks, neural network models deployed on edge-devices will generate distinctive side-channel leakages. The latter can be exploited to extract critical information when targeted platforms are physically accessible. By combining theoretical knowledge about deep learning practices and analysis of a widespread implementation library (ARM CMSIS-NN), our purpose is to answer this critical question: how far can we extract architecture information by simply examining an EM side-channel trace? For the first time, we propose an extraction methodology for traditional MLP and CNN models running on a high-end 32-bit microcontroller (Cortex-M7) that relies only on simple pattern recognition analysis. Despite few challenging cases, we claim that, contrary to parameters extraction, the complexity of the attack is relatively low and we highlight the urgent need for practicable protections that could fit the strong memory and latency requirements of such platforms.
    摘要 <>模型提取是人工智能系统安全的一个快速增长的问题。深度神经网络模型的架构是恶意者最重要的目标信息。作为一个序列的重复计算块,深度神经网络模型在边缘设备上部署时会生成特征的侧annel泄露。这些侧annel泄露可以被利用来提取关键信息,当目标平台可以访问时。通过结合深度学习实践知识和ARM CMSIS-NN实现库的分析,我们的目的是回答这个关键问题:可以通过仅仅分析EM侧annel跟踪来提取架构信息多远?我们首次提出了一种EXTRACTION方法,可以在高端32位微控制器(Cortex-M7)上运行传统的MLP和CNN模型,不需要复杂的算法或特殊的硬件。虽有一些困难的案例,但我们宣称,相比于参数提取,这种攻击的复杂性相对较低。我们高亮了对这些平台的实用防护措施的急需,以满足它们的强大内存和延迟需求。

Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching

  • paper_url: http://arxiv.org/abs/2311.01331
  • repo_url: https://github.com/kaiyan289/pw-dice
  • paper_authors: Kai Yan, Alexander G. Schwing, Yu-xiong Wang
  • For: 本研究旨在降低在实际场景中的环境侵入成本,以及专家示范行为不一定可用。为此,Offline Learning from Observations (LfO) 得到了广泛的研究,旨在使用仅专家状态和任务无关的非专家状态-动作对组成一个问题解决方案。* Methods: 现有的 DIstribution Correction Estimation (DICE) 方法尝试将学习者和专家政策之间的状态占用差异降到最小。然而,这些方法受到 $f$- divergence(KL 和 $\chi^2$)或 Wasserstein 距离的限制,后者限制了在 Wasserstein 基于解决方案中使用的下面距离的metric。为了解决这个问题,我们提出了 Primal Wasserstein DICE(PW-DICE),它将学习者和专家状态占用之间的 primal Wasserstein 距离降到最小,并使用一个对比学习的距离作为下面距离的metric。* Results: 我们理论上证明了 PW-DICE 框架是 SMODICE 的一种总结,并将 $f$- divergence 和 Wasserstein 最小化联系起来。实验结果表明,PW-DICE 在多个测试床上超越了多种状态之前的方法。
    Abstract In real-world scenarios, arbitrary interactions with the environment can often be costly, and actions of expert demonstrations are not always available. To reduce the need for both, Offline Learning from Observations (LfO) is extensively studied, where the agent learns to solve a task with only expert states and \textit{task-agnostic} non-expert state-action pairs. The state-of-the-art DIstribution Correction Estimation (DICE) methods minimize the state occupancy divergence between the learner and expert policies. However, they are limited to either $f$-divergences (KL and $\chi^2$) or Wasserstein distance with Rubinstein duality, the latter of which constrains the underlying distance metric crucial to the performance of Wasserstein-based solutions. To address this problem, we propose Primal Wasserstein DICE (PW-DICE), which minimizes the primal Wasserstein distance between the expert and learner state occupancies with a pessimistic regularizer and leverages a contrastively learned distance as the underlying metric for the Wasserstein distance. Theoretically, we prove that our framework is a generalization of the state-of-the-art, SMODICE, and unifies $f$-divergence and Wasserstein minimization. Empirically, we find that PW-DICE improves upon several state-of-the-art methods on multiple testbeds.
    摘要 在实际场景中,不可预知的环境交互可能会很昂贵,而专家示范的动作不总是可获得。为了减少这两种成本,半线性学习从观察(LfO)得到了广泛的研究,其中agent learns to solve a task with only expert states and task-agnostic non-expert state-action pairs。当前的DIstribution Correction Estimation(DICE)方法 minimum the state occupancy divergence between the learner and expert policies,但它们受到 $f$-divergence(KL和$\chi^2)或 Wasserstein distance with Rubinstein duality的限制,后者对于 Wasserstein-based solutions的性能具有关键的下面距离度量。为解决这个问题,我们提出了 Primal Wasserstein DICE(PW-DICE),它 minimum the primal Wasserstein distance between the expert and learner state occupancies with a pessimistic regularizer,并使用一个 contrastively learned distance as the underlying metric for the Wasserstein distance。理论上,我们证明了我们的框架是 state-of-the-art SMODICE 的一般化,并将 $f$-divergence和Wasserstein minimization unify。实际上,我们发现 PW-DICE 在多个测试床上比多种 state-of-the-art 方法表现更好。

A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories

  • paper_url: http://arxiv.org/abs/2311.01329
  • repo_url: https://github.com/kaiyan289/tailo
  • paper_authors: Kai Yan, Alexander G. Schwing, Yu-Xiong Wang
  • for: 解决在缺乏专家动作的情况下,从观察中学习模式动作的问题。
  • methods: 使用权重行为做假的方法,并使用一个识别器来识别专家状态。
  • results: 在多个测试平台上,TAILO表现更加稳定和有效,特别是在 incomplete trajectories 的情况下。
    Abstract Offline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. The state-of-the-art "DIstribution Correction Estimation" (DICE) methods minimize divergence of state occupancy between expert and learner policies and retrieve a policy with weighted behavior cloning; however, their results are unstable when learning from incomplete trajectories, due to a non-robust optimization in the dual domain. To address the issue, in this paper, we propose Trajectory-Aware Imitation Learning from Observations (TAILO). TAILO uses a discounted sum along the future trajectory as the weight for weighted behavior cloning. The terms for the sum are scaled by the output of a discriminator, which aims to identify expert states. Despite simplicity, TAILO works well if there exist trajectories or segments of expert behavior in the task-agnostic data, a common assumption in prior work. In experiments across multiple testbeds, we find TAILO to be more robust and effective, particularly with incomplete trajectories.
    摘要 <>translate text into Simplified ChineseOffline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. The state-of-the-art "DIstribution Correction Estimation" (DICE) methods minimize divergence of state occupancy between expert and learner policies and retrieve a policy with weighted behavior cloning; however, their results are unstable when learning from incomplete trajectories, due to a non-robust optimization in the dual domain. To address the issue, in this paper, we propose Trajectory-Aware Imitation Learning from Observations (TAILO). TAILO uses a discounted sum along the future trajectory as the weight for weighted behavior cloning. The terms for the sum are scaled by the output of a discriminator, which aims to identify expert states. Despite simplicity, TAILO works well if there exist trajectories or segments of expert behavior in the task-agnostic data, a common assumption in prior work. In experiments across multiple testbeds, we find TAILO to be more robust and effective, particularly with incomplete trajectories.中文简体版:<>将文本翻译成中文简体版从观察中进行假扮,目标是解决MDPs,只有任务专家状态和任务非专家动作对组合可用。假扮在实际场景中很有用,因为专家动作是不可预测的。现状的“分布式修正估计”(DICE)方法减少专家和学习政策之间状态占据的差异,并提取一个政策,但是它们在学习部分轨迹时不稳定,这是因为附加的双域优化不稳定。为解决这个问题,在这篇论文中,我们提议使用轨迹意识的假扮学习(TAILO)。TAILO使用未来轨迹的折扣和学习器输出来权重假扮行为。在实验中,我们发现TAILO比DICE更加稳定和有效,特别是在部分轨迹时。

Better Together: Enhancing Generative Knowledge Graph Completion with Language Models and Neighborhood Information

  • paper_url: http://arxiv.org/abs/2311.01326
  • repo_url: https://github.com/screemix/kgc-t5-with-neighbors
  • paper_authors: Alla Chepurova, Aydar Bulatov, Yuri Kuratov, Mikhail Burtsev
  • for: 本研究旨在解决现实世界知识图(KG)中的不完teness问题,提高知识图完teness。
  • methods: 本研究使用语音模型(如T5和KGT5)来预测尾节点,并包含节点邻居信息以改进知识图完teness方法。
  • results: 研究表明,包含节点邻居信息可以提高知识图完teness方法的性能,在 inductive 和 transductive Wikidata 子集上都超过 KGT5 和传统知识图完teness方法。 Additionally, the study shows the importance of neighborhood information in model prediction and points out a way to significantly improve KGC through more effective neighborhood selection.
    Abstract Real-world Knowledge Graphs (KGs) often suffer from incompleteness, which limits their potential performance. Knowledge Graph Completion (KGC) techniques aim to address this issue. However, traditional KGC methods are computationally intensive and impractical for large-scale KGs, necessitating the learning of dense node embeddings and computing pairwise distances. Generative transformer-based language models (e.g., T5 and recent KGT5) offer a promising solution as they can predict the tail nodes directly. In this study, we propose to include node neighborhoods as additional information to improve KGC methods based on language models. We examine the effects of this imputation and show that, on both inductive and transductive Wikidata subsets, our method outperforms KGT5 and conventional KGC approaches. We also provide an extensive analysis of the impact of neighborhood on model prediction and show its importance. Furthermore, we point the way to significantly improve KGC through more effective neighborhood selection.
    摘要

Scattering Vision Transformer: Spectral Mixing Matters

  • paper_url: http://arxiv.org/abs/2311.01310
  • repo_url: None
  • paper_authors: Badri N. Patro, Vijay Srinivas Agneeswaran
  • for: 这个论文主要针对 Computer Vision 领域中的图像分类、实例分割和对象检测任务,尝试解决注意力复杂性和图像细节捕捉问题。
  • methods: 该论文提出了一种新的方法 called Scattering Vision Transformer (SVT),它包括一个spectral scattering网络,可以帮助捕捉图像细节。SVT还引入了一种特殊的 spectral gating 网络,使得计算复杂度得到了降低。
  • results: 根据论文的实验结果,SVT在 ImageNet 数据集上达到了 state-of-the-art 性能,与 LiTv2 和 iFormer 相比,SVT-H-S 达到了 84.2% 的 top-1 准确率,SVT-H-B 达到了 85.2%(基本版本中的 state-of-the-art),SVT-H-L 达到了 85.7%(大版本中的 state-of-the-art)。SVT 还在其他视觉任务中表现出色,比如实例分割任务。此外,SVT 在标准的 CIFAR10、CIFAR100、Oxford Flower 和 Stanford Car 数据集上也表现出优异的转移学习能力。
    Abstract Vision transformers have gained significant attention and achieved state-of-the-art performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-grained information within images. Existing solutions often resort to down-sampling operations, such as pooling, to reduce computational cost. Unfortunately, such operations are non-invertible and can result in information loss. In this paper, we present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally scattering network that enables the capture of intricate image details. SVT overcomes the invertibility issue associated with down-sampling operations by separating low-frequency and high-frequency components. Furthermore, SVT introduces a unique spectral gating network utilizing Einstein multiplication for token and channel mixing, effectively reducing complexity. We show that SVT achieves state-of-the-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS. SVT shows 2\% improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy, while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L reaches 85.7\% (again state-of-art for large versions). SVT also shows comparable results in other vision tasks such as instance segmentation. SVT also outperforms other transformers in transfer learning on standard datasets such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The project page is available on this webpage.\url{https://badripatro.github.io/svt/}.
    摘要 “vision transformer”已经受到了广泛关注,并在不同的计算机视觉任务中取得了前一等的性能,包括图像分类、实例分类和物体检测。然而,在处理注意力复杂性和细节资讯方面仍然存在挑战。现有的解决方案通常是透过下推运算,例如滤波器,以减少计算成本。然而,这些运算是不可逆的,可能会导致资讯损失。在本文中,我们提出了一个新的方法 called Scattering Vision Transformer (SVT),以解决这些挑战。SVT包括一个spectrally scattering网络,可以对图像中的细节进行捕捉。SVT绕过下推运算所带来的倒数易变性问题,并且将低频和高频 ком成分分离。此外,SVT引入了单一的 спектраль闸道网络,使用爱因斯坦 multiplication 进行对token和通道的混合,实现了缩减复杂性。我们展示了SVT在ImageNet dataset上实现了前一等的性能,并且显著减少了总参数和FLOPS数。SVT与LiTv2和iFormer相比,提高了2%的性能。SVT-H-S实现了84.2%的顶部一致率,SVT-H-B实现了85.2%的顶部一致率(大版本的最佳性能),SVT-H-L实现了85.7%的顶部一致率(大版本的最佳性能)。SVT还在其他视觉任务中展示了相似的结果,例如实例分类。此外,SVT在标准的dataset上,如CIFAR10、CIFAR100、牛津花园和斯坦福汽车dataset上也展示了比较好的结果。SVT的专案页面可以在以下网址中找到:\url{https://badripatro.github.io/svt/}.

AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models

  • paper_url: http://arxiv.org/abs/2311.01305
  • repo_url: None
  • paper_authors: Baisong Li, Xingwang Wang, Haixiao Xu
  • for: 提高大型语言模型(LLMs)的计算和存储成本,以提高模型的可扩展性和可靠性。
  • methods: 提出了一种无需额外训练的post-training方法,通过通道平衡来弥合权重和活动量的量化难度差异,从而实现模型的最佳性能。
  • results: 对各种流行的模型(如LLaMA和OPT)进行了广泛的实验,证明了AWEQ方法在post-training量化中的优越性,并且在8位权重和活动(W8A8)量化中达到了最高性能。
    Abstract Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come with significant computational and storage costs. Quantizing these models is an effective way to alleviate this issue. However, existing methods struggle to strike a balance between model accuracy and hardware efficiency. This is where we introduce AWEQ, a post-training method that requires no additional training overhead. AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization. There is an observation that weight quantization is less challenging than activation quantization. AWEQ transfers the difficulty of activation quantization to weights using channel equalization, achieving a balance between the quantization difficulties of both, and thereby maximizing performance. We have further refined the equalization method to mitigate quantization bias error, ensuring the robustness of the model. Extensive experiments on popular models such as LLaMA and OPT demonstrate that AWEQ outperforms all existing post-training quantization methods for large models.
    摘要 大型语言模型(LLM)具有多种任务的出色表现,但它们带来了重要的计算和储存成本。量化这些模型是一种有效的方法来解决这个问题。然而,现有的方法难以寻求模型精度和硬件效率之间的平衡。这是我们引入AWEQ,一种不需要额外训练成本的后训练方法。AWEQ在超低位数量化和8位构成元素(W8A8)量化中表现出色。对于模型的量化难度,权重量化比 activation 量化更容易。AWEQ将活动量化问题转移到权重中,实现了两者之间的平衡,因此提高了性能。我们进一步改进了均衡方法,以减少量化偏误错误,保证模型的稳定性。实验结果显示,AWEQ在各种流行的模型,如LLaMA和OPT上都大大超越了现有的后训练量化方法。

TRIALSCOPE A Unifying Causal Framework for Scaling Real-World Evidence Generation with Biomedical Language Models

  • paper_url: http://arxiv.org/abs/2311.01301
  • repo_url: None
  • paper_authors: Javier González, Cliff Wong, Zelalem Gero, Jass Bagga, Risa Ueno, Isabel Chien, Eduard Orakvin, Emre Kiciman, Aditya Nori, Roshanthi Weerasinghe, Rom S. Leidner, Brian Piening, Tristan Naumann, Carlo Bifulco, Hoifung Poon
  • for: 该论文旨在 оптимизиATION OF HEALTHCARE DELIVERY AND ACCELERATING BIOMEDICAL DISCOVERY 通过利用实际数据,以提高医疗服务质量和生物医学发现。
  • methods: 该论文使用了生物医学语言模型来结构化临床文本,并使用高级概率模型进行噪声除除和替换,同时应用了现代 causal inference 技术来解决常见的干扰因素。
  • results: 该论文通过使用临床试验规范来生成和理解临床假设,并在一个大规模的真实世界数据集上进行了广泛的实验和分析,并得到了高质量的结构化数据和与知名肿瘤试验的相似结果。
    Abstract The rapid digitization of real-world data offers an unprecedented opportunity for optimizing healthcare delivery and accelerating biomedical discovery. In practice, however, such data is most abundantly available in unstructured forms, such as clinical notes in electronic medical records (EMRs), and it is generally plagued by confounders. In this paper, we present TRIALSCOPE, a unifying framework for distilling real-world evidence from population-level observational data. TRIALSCOPE leverages biomedical language models to structure clinical text at scale, employs advanced probabilistic modeling for denoising and imputation, and incorporates state-of-the-art causal inference techniques to combat common confounders. Using clinical trial specification as generic representation, TRIALSCOPE provides a turn-key solution to generate and reason with clinical hypotheses using observational data. In extensive experiments and analyses on a large-scale real-world dataset with over one million cancer patients from a large US healthcare network, we show that TRIALSCOPE can produce high-quality structuring of real-world data and generates comparable results to marquee cancer trials. In addition to facilitating in-silicon clinical trial design and optimization, TRIALSCOPE may be used to empower synthetic controls, pragmatic trials, post-market surveillance, as well as support fine-grained patient-like-me reasoning in precision diagnosis and treatment.
    摘要 随着数字化的迅速进程,现实世界中的数据提供了不可思议的机会,以便优化医疗服务和加速生物医学发现。然而,实际上,这些数据通常存在干扰和干扰因素。在这篇论文中,我们介绍了一种名为TRIALSCOPE的框架,用于从人口水平的观察数据中提取现实世界的证据。TRIALSCOPE利用生物医学语言模型来结构临床文本,在大规模上进行混杂和替换,并应用了当前的 causal inference 技术来战胜常见的干扰因素。使用临床试验规范作为普通表示,TRIALSCOPE提供了一个启用和理解临床假设的全自动解决方案。在对一个大型现实世界数据集(包含 более一百万美国医疗网络中的肿瘤患者)进行了广泛的实验和分析后,我们发现TRIALSCOPE可以生成高质量的现实世界数据结构,并且与知名肿瘤试验的结果相比较。除了促进固态临床试验设计和优化之外,TRIALSCOPE还可以用于强化 synthetic control, Pragmatic trials, post-market surveillance,以及支持精细化的患者如我 reasoning 在精准诊断和治疗方面。

UniFolding: Towards Sample-efficient, Scalable, and Generalizable Robotic Garment Folding

  • paper_url: http://arxiv.org/abs/2311.01267
  • repo_url: https://github.com/xiaoxiaoxh/UniFolding
  • paper_authors: Han Xue, Yutong Li, Wenqiang Xu, Huanyu Li, Dongzhe Zheng, Cewu Lu
  • for: 这 paper 探讨了一种Sample-Efficient, Scalable, and Generalizable Robotic System for Unfolding and Folding Various Garments。
  • methods: 这 paper 使用了提议的 UFONet 神经网络,将 unfolding 和 folding 决策集成到一个单一的策略模型中,可以适应不同的衣物类型和状态。
  • results: 这 paper 测试了两种衣物类型:长袖和短袖衬衣,并对 20 件衣物进行了性能评估,结果表明 UniFolding 系统可以在不同的 texture、shape 和材料下提供高效的 unfolding 和 folding 功能。I hope that helps! Let me know if you have any other questions.
    Abstract This paper explores the development of UniFolding, a sample-efficient, scalable, and generalizable robotic system for unfolding and folding various garments. UniFolding employs the proposed UFONet neural network to integrate unfolding and folding decisions into a single policy model that is adaptable to different garment types and states. The design of UniFolding is based on a garment's partial point cloud, which aids in generalization and reduces sensitivity to variations in texture and shape. The training pipeline prioritizes low-cost, sample-efficient data collection. Training data is collected via a human-centric process with offline and online stages. The offline stage involves human unfolding and folding actions via Virtual Reality, while the online stage utilizes human-in-the-loop learning to fine-tune the model in a real-world setting. The system is tested on two garment types: long-sleeve and short-sleeve shirts. Performance is evaluated on 20 shirts with significant variations in textures, shapes, and materials. More experiments and videos can be found in the supplementary materials and on the website: https://unifolding.robotflow.ai
    摘要 Translated into Simplified Chinese:这篇论文探讨了一种可靠、扩展性强、通用的机器人系统,用于不同类型的衣服的打包和卷起。该系统使用提议的UFONet神经网络,将打包和卷起的决策集成到一个单一的政策模型中,以适应不同的衣服类型和状态。设计基于衣服的部分点云,帮助总体化和降低不同 texture和形状的敏感性。训练管道强调低成本、样本效率的数据采集。训练数据通过人类中心的过程收集,包括在虚拟现实环境中完成人类 unfolding和folding 动作。在线阶段通过人类 loops 学习来细化模型,并在实际环境中进行测试。系统在长袖和短袖上测试了20件衣服,其中具有显著的文本ure、形状和材料的变化。更多实验和视频可以在补充材料和网站:https://unifolding.robotflow.ai 中找到。

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

  • paper_url: http://arxiv.org/abs/2311.01260
  • repo_url: None
  • paper_authors: Hanglei Zhang, Yiwei Guo, Sen Liu, Xie Chen, Kai Yu
  • for: 这研究旨在提供一种可控制的 expresive TTS 模型,无需大量的风格标注数据。
  • methods: 该方法使用大型自然语言模型(LLM)将 expresive TTS 转化为一种风格检索任务,通过选择最佳匹配的风格参考语音来控制 TTS 管道 Synthesize 语音。
  • results: 实验结果表明,FS-TTS 可以充分利用 LLM 的 semantics 推理能力,从输入文本或用户定义的风格描述中检索所需的风格。这 führt 到通过 TTS 管道 Synthesize 出的语音与指定的风格高度吻合。
    Abstract Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in expressive TTS empower users with the ability to directly control synthesis style through natural language prompts. However, these methods often require excessive training with a significant amount of style-annotated data, which can be challenging to acquire. Moreover, they may have limited adaptability due to fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations. Our approach utilizes a large language model (LLM) to transform expressive TTS into a style retrieval task. The LLM selects the best-matching style references from annotated utterances based on external style prompts, which can be raw input text or natural language style descriptions. The selected reference guides the TTS pipeline to synthesize speeches with the intended style. This innovative approach provides flexible, versatile, and precise style control with minimal human workload. Experiments on a Mandarin storytelling corpus demonstrate FS-TTS's proficiency in leveraging LLM's semantic inference ability to retrieve desired styles from either input text or user-defined descriptions. This results in synthetic speeches that are closely aligned with the specified styles.
    摘要 文本译文:文本调读技术(TTS)目的是实时生成人工语音,具有人类语音的调读风格、情感和艺术性。现代的表达式TTS技术允许用户通过自然语言提示来直接控制合成类型。然而,这些方法通常需要大量的类型标注数据,实现可能困难。此外,它们可能具有固定类型标注的局限性。在这个工作中,我们提出了FreeStyleTTS(FS-TTS),一个可控的表达式TTS模型,仅需少量人工标注。我们的方法利用大型自然语言模型(LLM)将表达式TTS转换为一个类型搜寻任务。LLM选择基于标注utterance的最佳匹配式 referent,并将其用于合成语音的指导。这个创新的方法提供了高度可调、多元化和精确的类型控制,仅需 minimal human workload。实验结果显示,FS-TTS可以充分利用LLM的semantic inference能力,从input text或用户定义的描述中找到所需的类型。这 resulted in 合成语音具有所需的类型。

Formal Methods for Autonomous Systems

  • paper_url: http://arxiv.org/abs/2311.01258
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Tichakorn Wongpiromsarn, Mahsa Ghasemi, Murat Cubuktepe, Georgios Bakirtzis, Steven Carr, Mustafa O. Karabag, Cyrus Neary, Parham Gohari, Ufuk Topcu
  • for: 本文提供了形式方法在自动化系统领域的应用现状的报告。
  • methods: 本文使用了多种形式方法,包括关闭系统、反应式和概率设定,以验证和生成系统行为的正式保证。
  • results: 本文描述了一些应用形式方法的成果,包括对不确定性的处理、学习使用形式方法的限制、监控系统的设计等。
    Abstract Formal methods refer to rigorous, mathematical approaches to system development and have played a key role in establishing the correctness of safety-critical systems. The main building blocks of formal methods are models and specifications, which are analogous to behaviors and requirements in system design and give us the means to verify and synthesize system behaviors with formal guarantees. This monograph provides a survey of the current state of the art on applications of formal methods in the autonomous systems domain. We consider correct-by-construction synthesis under various formulations, including closed systems, reactive, and probabilistic settings. Beyond synthesizing systems in known environments, we address the concept of uncertainty and bound the behavior of systems that employ learning using formal methods. Further, we examine the synthesis of systems with monitoring, a mitigation technique for ensuring that once a system deviates from expected behavior, it knows a way of returning to normalcy. We also show how to overcome some limitations of formal methods themselves with learning. We conclude with future directions for formal methods in reinforcement learning, uncertainty, privacy, explainability of formal methods, and regulation and certification.
    摘要 Formal methods refer to rigorous, mathematical approaches to system development and have played a key role in establishing the correctness of safety-critical systems. The main building blocks of formal methods are models and specifications, which are analogous to behaviors and requirements in system design and give us the means to verify and synthesize system behaviors with formal guarantees. This monograph provides a survey of the current state of the art on applications of formal methods in the autonomous systems domain. We consider correct-by-construction synthesis under various formulations, including closed systems, reactive, and probabilistic settings. Beyond synthesizing systems in known environments, we address the concept of uncertainty and bound the behavior of systems that employ learning using formal methods. Further, we examine the synthesis of systems with monitoring, a mitigation technique for ensuring that once a system deviates from expected behavior, it knows a way of returning to normalcy. We also show how to overcome some limitations of formal methods themselves with learning. We conclude with future directions for formal methods in reinforcement learning, uncertainty, privacy, explainability of formal methods, and regulation and certification.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

  • paper_url: http://arxiv.org/abs/2311.01256
  • repo_url: None
  • paper_authors: Sinan Gultekin, Achille Globo, Andrea Zugarini, Marco Ernandes, Leonardo Rigutini
  • for: 这篇论文的目的是评估大自然语言处理器(LLM)和传统方法(如支持向量机)在LexGLUE测试benchmark上的表现,并考虑到了性能(标准指标)以外的因素,如时间、能耗和成本。
  • methods: 这篇论文使用了详细的量化比较,包括训练-验证-测试循环的评估,以及生产阶段和实际应用阶段的评估。
  • results: 结果表明, simplest algorithms 经常可以达到大LLMs的性能,但具有较低的能耗和资源需求。这些结果可能会导致公司在选择机器学习(ML)解决方案时包括额外评估。
    Abstract Most Machine Learning research evaluates the best solutions in terms of performance. However, in the race for the best performing model, many important aspects are often overlooked when, on the contrary, they should be carefully considered. In fact, sometimes the gaps in performance between different approaches are neglectable, whereas factors such as production costs, energy consumption, and carbon footprint must take into consideration. Large Language Models (LLMs) are extensively adopted to address NLP problems in academia and industry. In this work, we present a detailed quantitative comparison of LLM and traditional approaches (e.g. SVM) on the LexGLUE benchmark, which takes into account both performance (standard indices) and alternative metrics such as timing, power consumption and cost, in a word: the carbon-footprint. In our analysis, we considered the prototyping phase (model selection by training-validation-test iterations) and in-production phases separately, since they follow different implementation procedures and also require different resources. The results indicate that very often, the simplest algorithms achieve performance very close to that of large LLMs but with very low power consumption and lower resource demands. The results obtained could suggest companies to include additional evaluations in the choice of Machine Learning (ML) solutions.
    摘要 大多数机器学习研究强调最佳解决方案的性能。然而,在尝试创造最高性能的模型时,有许多重要因素经常被忽略,而这些因素在实际应用中应该仔细考虑。事实上,有时性能之间的差异非常小,而生产成本、能源消耗和碳脚印则应该被考虑。大型自然语言模型(LLM)在学术和产业中广泛应用,以解决自然语言处理(NLP)问题。在这项工作中,我们提供了 LexGLUE 竞赛奖励的详细量化比较,包括性能(标准指标)和代表性指标(如时间、能源消耗和成本)。在我们的分析中,我们分 separately 评估预测阶段(模型选择)和生产阶段,因为它们采用不同的实现方式和需要不同的资源。结果显示,经常情况下,最简单的算法可以与大型 LLM 的性能几乎相当,但具有非常低的电力消耗和资源需求。这些结果可能会让公司包括机器学习(ML)解决方案的评估在内。

Push it to the Demonstrated Limit: Multimodal Visuotactile Imitation Learning with Force Matching

  • paper_url: http://arxiv.org/abs/2311.01248
  • repo_url: None
  • paper_authors: Trevor Ablett, Oliver Limoyo, Adam Sigal, Affan Jilani, Jonathan Kelly, Kaleem Siddiqi, Francois Hogan, Gregory Dudek
  • for: 这个论文主要针对的是使用光学皮肤感知器进行机器人 manipulate 任务中的粘质感知。
  • methods: 该论文使用了灵活的光学皮肤感知器,可以同时获取视觉和皮肤信息。在训练过程中,使用了感觉学习的方法,通过对人类示范者的力学特征进行学习,生成一个更加符合人类的力学特征的力学Profile。
  • results: 研究结果表明,通过结合视觉和皮肤感知,可以提高机器人的 manipulate 任务性能。在多种观察配置下,对比视觉数据和视觉/皮肤感知数据,研究发现,皮肤感知对于力学学习和任务反馈具有重要的作用。
    Abstract Optical tactile sensors have emerged as an effective means to acquire dense contact information during robotic manipulation. A recently-introduced `see-through-your-skin' (STS) variant of this type of sensor has both visual and tactile modes, enabled by leveraging a semi-transparent surface and controllable lighting. In this work, we investigate the benefits of pairing visuotactile sensing with imitation learning for contact-rich manipulation tasks. First, we use tactile force measurements and a novel algorithm during kinesthetic teaching to yield a force profile that better matches that of the human demonstrator. Second, we add visual/tactile STS mode switching as a control policy output, simplifying the application of the sensor. Finally, we study multiple observation configurations to compare and contrast the value of visual/tactile data (both with and without mode switching) with visual data from a wrist-mounted eye-in-hand camera. We perform an extensive series of experiments on a real robotic manipulator with door-opening and closing tasks, including over 3,000 real test episodes. Our results highlight the importance of tactile sensing for imitation learning, both for data collection to allow force matching, and for policy execution to allow accurate task feedback.
    摘要 optical tactile sensors 已经成为了在机器人操作中获取密集的触感信息的有效手段。一种最近引入的 `see-through-your-skin'(STS)变体的这种感应器具有视觉和感觉两种模式,通过利用半透明表面和可控的照明来实现。在这项工作中,我们研究了将视觉感觉与模仿学习结合使用以提高接触充满的抓取任务。首先,我们使用了拟合力测量和一种新的算法来从抗阻教学中获得更好地匹配人类示范者的力脉冲。其次,我们添加了视觉/感觉 STS 模式切换作为控制策略输出,使感应器的应用更加简单。最后,我们研究了多种观察配置,比较和对比视觉数据和感觉数据(均有和无模式切换)的价值,以及视觉数据来自机器人手臂上的眼在手中摄像头。我们在一个真实的机器人抓取机器上进行了大量实验,包括超过 3,000 个真实测试集。我们的结果表明,感觉感知对于模仿学习是非常重要的,不仅用于数据采集以允许力脉冲匹配,还用于策略执行以提供精准任务反馈。

FacadeNet: Conditional Facade Synthesis via Selective Editing

  • paper_url: http://arxiv.org/abs/2311.01240
  • repo_url: None
  • paper_authors: Yiangos Georgiou, Marios Loizou, Tom Kelly, Melinos Averkiou
  • for: 这个论文是为了Synthesizing building facade images from diverse viewpoints,即生成不同视角的建筑facade图像。
  • methods: 这个方法使用了一种conditional GAN,接受一个建筑facade的单一视图以及所需的视点信息,并生成图像。为了精确地修改视点依赖的元素(如窗户和门)而保留视角无关的元素(如墙壁),我们引入了选择性编辑模块。这个模块利用了一种预训练的视Transformer来提取图像嵌入。
  • results: 我们的实验表明,这种方法可以达到建筑facade生成领域的州际性表现,超过了其他方法。
    Abstract We introduce FacadeNet, a deep learning approach for synthesizing building facade images from diverse viewpoints. Our method employs a conditional GAN, taking a single view of a facade along with the desired viewpoint information and generates an image of the facade from the distinct viewpoint. To precisely modify view-dependent elements like windows and doors while preserving the structure of view-independent components such as walls, we introduce a selective editing module. This module leverages image embeddings extracted from a pre-trained vision transformer. Our experiments demonstrated state-of-the-art performance on building facade generation, surpassing alternative methods.
    摘要 我们介绍了 FacadeNet,一种深度学习方法,用于从多个视角生成建筑外墙图像。我们的方法使用一个条件GAN,接受一个建筑外墙的单个视图以及所需视角信息,并生成该视角下的建筑外墙图像。为精准地修改视角依赖的元素,如窗户和门,而保留视角独立的元素,如墙壁,我们引入了选择性编辑模块。这个模块利用一个预训练的视Transformer来提取图像嵌入。我们的实验表明,FacadeNet可以在建筑外墙生成中实现状态机器人表现,超过其他方法。

  • paper_url: http://arxiv.org/abs/2311.01235
  • repo_url: None
  • paper_authors: Ryen W. White
  • for: 这篇论文旨在探讨人工智能技术在搜索方面的应用和发展,以帮助搜索引擎更好地支持复杂的搜索任务。
  • methods: 本论文使用了生成式人工智能技术和助手(AI copilots),以帮助搜索者更好地完成复杂的搜索任务。
  • results: 本论文预示了AI copilots在搜索方面的应用将有普遍的改善和发展,并可能导致搜索引擎的重新设计和未来发展。
    Abstract As many of us in the information retrieval (IR) research community know and appreciate, search is far from being a solved problem. Millions of people struggle with tasks on search engines every day. Often, their struggles relate to the intrinsic complexity of their task and the failure of search systems to fully understand the task and serve relevant results. The task motivates the search, creating the gap/problematic situation that searchers attempt to bridge/resolve and drives search behavior as they work through different task facets. Complex search tasks require more than support for rudimentary fact finding or re-finding. Research on methods to support complex tasks includes work on generating query and website suggestions, personalizing and contextualizing search, and developing new search experiences, including those that span time and space. The recent emergence of generative artificial intelligence (AI) and the arrival of assistive agents, or copilots, based on this technology, has the potential to offer further assistance to searchers, especially those engaged in complex tasks. There are profound implications from these advances for the design of intelligent systems and for the future of search itself. This article, based on a keynote by the author at the 2023 ACM SIGIR Conference, explores these issues and charts a course toward new horizons in information access guided by AI copilots.
    摘要 很多我们在信息检索(IR)研究社区知道和钦佩的事实是,搜寻并不是已经解决的问题。每天,百万人都在搜索引擎上进行各种任务。常常,这些任务的问题在搜寻系统不够理解任务的情况下,无法提供相应的结果。这些任务驱使搜寻,创造出问题的差距和问题,并且驱动搜寻行为。复杂的搜寻任务需要更进一步的支持,不仅是基本的事实查找或重新找。研究人员在发展新的搜寻技术方面做出了很多努力,例如生成查询和网站建议、个性化和 contextualizing搜寻、开发新的搜寻经验,包括在时空中进行的搜寻。受到生成人工智能(AI)的启发,助手或副驾驶器的出现,将对搜寻者,特别是进行复杂任务的人,提供更多的帮助。这些进步将对智能系统的设计和未来的搜寻产生深远的影响。本文,基于作者在2023年ACM SIGIR会议上的关键演讲,探讨了这些问题,并寻找新的搜寻方向,受到AI助手带领。

Long Story Short: a Summarize-then-Search Method for Long Video Question Answering

  • paper_url: http://arxiv.org/abs/2311.01233
  • repo_url: None
  • paper_authors: Jiwan Chung, Youngjae Yu
  • for: This paper explores the ability of large language models like GPT-3 to adapt to new tasks without task-specific training data, specifically in the context of long multimodal narratives in multimedia content like drama, movies, and animation.
  • methods: The proposed framework, called Long Story Short, first summarizes the narrative of the video into a short plot and then searches for relevant parts of the video using CLIPCheck.
  • results: The model outperforms state-of-the-art supervised models by a large margin, demonstrating the potential of zero-shot QA for long videos.Here’s the text in Simplified Chinese:
  • for: 这篇论文探讨了大语言模型如GPT-3在不需要任务特定训练数据的情况下是否能够扩展到新任务,特别是在叙事视频内容如电影、动画等中的长Multimodal narratives。
  • methods: 提议的框架是Long Story Short,它首先摘要了视频的叙事情节,然后使用CLIPCheck搜索问题相关的视频部分。
  • results: 模型超过了现有的超vised模型,强调了零shot QA的潜在性。
    Abstract Large language models such as GPT-3 have demonstrated an impressive capability to adapt to new tasks without requiring task-specific training data. This capability has been particularly effective in settings such as narrative question answering, where the diversity of tasks is immense, but the available supervision data is small. In this work, we investigate if such language models can extend their zero-shot reasoning abilities to long multimodal narratives in multimedia content such as drama, movies, and animation, where the story plays an essential role. We propose Long Story Short, a framework for narrative video QA that first summarizes the narrative of the video to a short plot and then searches parts of the video relevant to the question. We also propose to enhance visual matching with CLIPCheck. Our model outperforms state-of-the-art supervised models by a large margin, highlighting the potential of zero-shot QA for long videos.
    摘要 大型语言模型如GPT-3已经表现出适应新任务的能力,不需要专门的任务特有的训练数据。这种能力在叙事问答中 especial 有效,因为任务的多样性很大,但可用的监督数据很少。在这项工作中,我们 investigates 如果这些语言模型可以扩展其零shot 理解能力到长 multimedia 媒体内容,如电影、电视剧和动画,其中故事扮演着关键性的角色。我们提出了 Long Story Short,一个用于叙事视频问答的框架,首先摘要视频的叙事,然后在问题相关的部分搜索视频。我们还提出了CLIPCheck的Visual Matching Enhancement,我们的模型在比较之上大幅超越了现有的超vised模型,这 highlights 零shot QA 的潜在能力在长视频上。

Multi-Operational Mathematical Derivations in Latent Space

  • paper_url: http://arxiv.org/abs/2311.01230
  • repo_url: https://github.com/neuro-symbolic-ai/latent_mathematical_reasoning
  • paper_authors: Marco Valentino, Jordan Meadows, Lan Zhang, André Freitas
  • for: 这个论文研究了在潜在空间中对多个数学运算的合并,以实现表达推导的可能性。
  • methods: 作者引入了不同的多操作表示模式,将数学运算视为显式的几何变换,并利用符号计算机件构建了一个大规模的 derivation step 集合,包括 61K premises 和 6 种运算,分析每种模式在使用现有的神经编码器时的属性。
  • results: 研究发现,多操作模式是分离不同运算的关键,而单一运算的推导结论可以在原始表达编码器中分配。此外,作者还发现了不同的建筑选择对训练动力、结构组织和泛化造成了重大的影响,导致不同模式和编码器类型之间存在显著的差异。
    Abstract This paper investigates the possibility of approximating multiple mathematical operations in latent space for expression derivation. To this end, we introduce different multi-operational representation paradigms, modelling mathematical operations as explicit geometric transformations. By leveraging a symbolic engine, we construct a large-scale dataset comprising 1.7M derivation steps stemming from 61K premises and 6 operators, analysing the properties of each paradigm when instantiated with state-of-the-art neural encoders. Specifically, we investigate how different encoding mechanisms can approximate equational reasoning in latent space, exploring the trade-off between learning different operators and specialising within single operations, as well as the ability to support multi-step derivations and out-of-distribution generalisation. Our empirical analysis reveals that the multi-operational paradigm is crucial for disentangling different operators, while discriminating the conclusions for a single operation is achievable in the original expression encoder. Moreover, we show that architectural choices can heavily affect the training dynamics, structural organisation, and generalisation of the latent space, resulting in significant variations across paradigms and classes of encoders.
    摘要

Diffusion Models for Reinforcement Learning: A Survey

  • paper_url: http://arxiv.org/abs/2311.01223
  • repo_url: https://github.com/apexrl/diff4rlsurvey
  • paper_authors: Zhengbang Zhu, Hanye Zhao, Haoran He, Yichao Zhong, Shenyu Zhang, Yong Yu, Weinan Zhang
  • for: 本文提供了Diffusion模型在强化学习(Reinforcement Learning,RL)领域的进展概述,并希望通过这篇评论来鼓励新的研究方向。
  • methods: 本文分析了当前RL算法遇到的一些挑战,然后提出了基于Diffusion模型的RL方法的分类,并详细介绍了它们如何解决这些挑战。
  • results: 本文介绍了Diffusion模型在各种RL相关任务中的成功应用,同时讨论了现有方法的局限性,并提出了未来研究方向的想法,包括提高模型性能和应用Diffusion模型到更广泛的任务。Here’s the full text in Simplified Chinese:本文提供了Diffusion模型在强化学习(Reinforcement Learning,RL)领域的进展概述,并希望通过这篇评论来鼓励新的研究方向。Diffusion模型在RL领域的应用已经超过了之前的方法,包括轨迹规划、表达政策类、数据生成器等。本文分析了当前RL算法遇到的一些挑战,然后提出了基于Diffusion模型的RL方法的分类,并详细介绍了它们如何解决这些挑战。此外,本文介绍了Diffusion模型在各种RL相关任务中的成功应用,同时讨论了现有方法的局限性,并提出了未来研究方向的想法,包括提高模型性能和应用Diffusion模型到更广泛的任务。您可以在https://github.com/apexrl/Diff4RLSurvey上找到更多相关资源和文献。
    Abstract Diffusion models have emerged as a prominent class of generative models, surpassing previous methods regarding sample quality and training stability. Recent works have shown the advantages of diffusion models in improving reinforcement learning (RL) solutions, including as trajectory planners, expressive policy classes, data synthesizers, etc. This survey aims to provide an overview of the advancements in this emerging field and hopes to inspire new avenues of research. First, we examine several challenges encountered by current RL algorithms. Then, we present a taxonomy of existing methods based on the roles played by diffusion models in RL and explore how the existing challenges are addressed. We further outline successful applications of diffusion models in various RL-related tasks while discussing the limitations of current approaches. Finally, we conclude the survey and offer insights into future research directions, focusing on enhancing model performance and applying diffusion models to broader tasks. We are actively maintaining a GitHub repository for papers and other related resources in applying diffusion models in RL: https://github.com/apexrl/Diff4RLSurvey .
    摘要 Diffusion models 已经成为一种显著的生成模型,超过了之前的方法在样本质量和训练稳定性方面。最近的研究表明 diffusion models 在改进强化学习(RL)解决方案方面具有优势,包括轨迹规划器、表达政策类、数据合成器等。这篇评论旨在为这个emerging field提供一个概述,并希望能启发新的研究方向。首先,我们考虑了现在RL算法遇到的一些挑战。然后,我们提出了基于 diffusion models 在 RL 中扮演的不同角色的分类,并详细介绍了现有的挑战如何被解决。然后,我们详细介绍了 diffusion models 在各种 RL 相关任务中的成功应用,同时讨论了现有方法的局限性。最后,我们结束这篇评论,并对未来研究方向做出了一些建议,主要是增强模型性能和将 diffusion models 应用于更广泛的任务。我们 aktif maintenanceng a GitHub repository for papers and other related resources in applying diffusion models in RL: https://github.com/apexrl/Diff4RLSurvey。

Multi-view Relation Learning for Cross-domain Few-shot Hyperspectral Image Classification

  • paper_url: http://arxiv.org/abs/2311.01212
  • repo_url: https://github.com/henulwy/stbdip
  • paper_authors: Chun Liu, Longwei Yang, Zheng Li, Wei Yang, Zhigang Han, Jianzhong Guo, Junyong Yu
  • for: 这个论文主要针对跨Domain少数shot颜色成像分类问题,探讨如何将来自源Domain的大量标签样本中的专业知识转移到目标Domain中的任务中,仅具够几个标签样本进行分类。
  • methods: 本文提出了一个基于对比学习的方法,从不同的视角学习样本之间的关系,并将这些关系纳入模型学习过程中,以提高跨Domain少数shot颜色成像分类的性能。这个方法首先从不同的视角EXTRACT样本的特征,然后使用对比学习来学习样本之间的关系,最后将这些关系纳入模型学习过程中。
  • results: 我们的实验结果显示,在跨Domain少数shot颜色成像分类任务中,这个基于对比学习的方法能够提高模型的性能,并且与现有的方法相比,具有更好的一致性和稳定性。
    Abstract Cross-domain few-shot hyperspectral image classification focuses on learning prior knowledge from a large number of labeled samples from source domain and then transferring the knowledge to the tasks which contain only few labeled samples in target domains. Following the metric-based manner, many current methods first extract the features of the query and support samples, and then directly predict the classes of query samples according to their distance to the support samples or prototypes. The relations between samples have not been fully explored and utilized. Different from current works, this paper proposes to learn sample relations from different views and take them into the model learning process, to improve the cross-domain few-shot hyperspectral image classification. Building on current DCFSL method which adopts a domain discriminator to deal with domain-level distribution difference, the proposed method applys contrastive learning to learn the class-level sample relations to obtain more discriminable sample features. In addition, it adopts a transformer based cross-attention learning module to learn the set-level sample relations and acquire the attentions from query samples to support samples. Our experimental results have demonstrated the contribution of the multi-view relation learning mechanism for few-shot hyperspectral image classification when compared with the state of the art methods.
    摘要 Unlike current methods, this paper proposes a new approach that learns sample relations from different views and incorporates them into the model learning process to improve cross-domain few-shot hyperspectral image classification. Building on the current DCFSL method, which uses a domain discriminator to handle domain-level distribution differences, the proposed method uses contrastive learning to learn class-level sample relations and obtain more discriminative sample features. Additionally, it employs a transformer-based cross-attention learning module to learn set-level sample relations and acquire attention from query samples to support samples.Experimental results have shown that the proposed method outperforms state-of-the-art methods in few-shot hyperspectral image classification, thanks to the multi-view relation learning mechanism.

Attacking Graph Neural Networks with Bit Flips: Weisfeiler and Lehman Go Indifferent

  • paper_url: http://arxiv.org/abs/2311.01205
  • repo_url: None
  • paper_authors: Lorenz Kummer, Samir Moustafa, Nils N. Kriege, Wilfried N. Gansterer
  • for: 这篇论文旨在攻击图神经网络(Graph Neural Network,GNN)的Weight和Biases,而不是 tradicional的Graph Poisoning和Evasion攻击。
  • methods: 我们提出了首个特性为图神经网络的Bit Flip攻击, targets learned neighborhood aggregation functions in quantized message passing neural networks,使其难以识别图结构和丢失表达力。
  • results: 我们的研究表明,通过利用图神经网络特有的数学性质,可以大幅提高其对Bit Flip攻击的感受性。我们的攻击可以使最大表达能力的图同构网络(Graph Isomorphism Networks)的输出变为随机值,只需要flipping一小部分网络的比特。
    Abstract Prior attacks on graph neural networks have mostly focused on graph poisoning and evasion, neglecting the network's weights and biases. Traditional weight-based fault injection attacks, such as bit flip attacks used for convolutional neural networks, do not consider the unique properties of graph neural networks. We propose the Injectivity Bit Flip Attack, the first bit flip attack designed specifically for graph neural networks. Our attack targets the learnable neighborhood aggregation functions in quantized message passing neural networks, degrading their ability to distinguish graph structures and losing the expressivity of the Weisfeiler-Lehman test. Our findings suggest that exploiting mathematical properties specific to certain graph neural network architectures can significantly increase their vulnerability to bit flip attacks. Injectivity Bit Flip Attacks can degrade the maximal expressive Graph Isomorphism Networks trained on various graph property prediction datasets to random output by flipping only a small fraction of the network's bits, demonstrating its higher destructive power compared to a bit flip attack transferred from convolutional neural networks. Our attack is transparent and motivated by theoretical insights which are confirmed by extensive empirical results.
    摘要 先前的攻击对图 neural network 主要集中在恶意修改图和逃脱,忽视了网络的权重和偏好。传统的权重基于的攻击,如 convolutional neural network 中的 bit flip 攻击,不考虑图 neural network 的独特特性。我们提出了 Injectivity Bit Flip Attack,首先针对 quantized message passing neural network 中的学习可能的邻接聚合函数,使其失去Distinguish 图结构的能力和 Weisfeiler-Lehman 测试的表达能力。我们的发现表明,特定的图 neural network 架构的数学性质可以使其更容易受到 bit flip 攻击。我们的攻击可以通过只flipping 一小部分网络的比特来使最大表达能力的图Isomorphism Networks 输出Random, demonstarting its higher destructive power compared to transferred bit flip attack from convolutional neural networks。我们的攻击是透明的,基于理论启示,并经过了广泛的实验验证。

Cross-Modal Information-Guided Network using Contrastive Learning for Point Cloud Registration

  • paper_url: http://arxiv.org/abs/2311.01202
  • repo_url: https://github.com/ivanxie416/cmignet
  • paper_authors: Yifan Xie, Jihua Zhu, Shiqi Li, Pengcheng Shi
  • for: 本研究旨在提出一种新的多模态信息引导网络(CMIGNet),用于实现精度和稳定的点云注册。
  • methods: 我们首先将点云图像投影到2D图像上,并将多模态特征进行融合使用注意力机制。然后,我们采用两种对比学习策略,即重叠对比学习和跨模态对比学习,以确定关键点云特征。
  • results: 我们在多个 benchmark 数据集上进行了广泛的实验,结果显示,我们的网络可以准确地进行点云注册。
    Abstract The majority of point cloud registration methods currently rely on extracting features from points. However, these methods are limited by their dependence on information obtained from a single modality of points, which can result in deficiencies such as inadequate perception of global features and a lack of texture information. Actually, humans can employ visual information learned from 2D images to comprehend the 3D world. Based on this fact, we present a novel Cross-Modal Information-Guided Network (CMIGNet), which obtains global shape perception through cross-modal information to achieve precise and robust point cloud registration. Specifically, we first incorporate the projected images from the point clouds and fuse the cross-modal features using the attention mechanism. Furthermore, we employ two contrastive learning strategies, namely overlapping contrastive learning and cross-modal contrastive learning. The former focuses on features in overlapping regions, while the latter emphasizes the correspondences between 2D and 3D features. Finally, we propose a mask prediction module to identify keypoints in the point clouds. Extensive experiments on several benchmark datasets demonstrate that our network achieves superior registration performance.
    摘要 Specifically, we first incorporate the projected images from the point clouds and fuse the cross-modal features using the attention mechanism. Furthermore, we employ two contrastive learning strategies, namely overlapping contrastive learning and cross-modal contrastive learning. The former focuses on features in overlapping regions, while the latter emphasizes the correspondences between 2D and 3D features. Finally, we propose a mask prediction module to identify keypoints in the point clouds.Extensive experiments on several benchmark datasets demonstrate that our network achieves superior registration performance.

Federated Learning on Edge Sensing Devices: A Review

  • paper_url: http://arxiv.org/abs/2311.01201
  • repo_url: None
  • paper_authors: Berrenur Saylam, Özlem Durmaz İncel
  • for: 本研究实际应用于Edge设备上的聚合学习,以解决传统机器学习技术所面临的隐私、硬件和连接限制问题。
  • methods: 本研究使用 Federated Learning(FL)策略,将聚合学习模型训练在Edge设备上,而不需要分享实际数据。
  • results: 本研究提出了一个基于FL的聚合学习方法,可以在Edge设备上进行实时数据分析和决策,并维护隐私和安全性。
    Abstract The ability to monitor ambient characteristics, interact with them, and derive information about the surroundings has been made possible by the rapid proliferation of edge sensing devices like IoT, mobile, and wearable devices and their measuring capabilities with integrated sensors. Even though these devices are small and have less capacity for data storage and processing, they produce vast amounts of data. Some example application areas where sensor data is collected and processed include healthcare, environmental (including air quality and pollution levels), automotive, industrial, aerospace, and agricultural applications. These enormous volumes of sensing data collected from the edge devices are analyzed using a variety of Machine Learning (ML) and Deep Learning (DL) approaches. However, analyzing them on the cloud or a server presents challenges related to privacy, hardware, and connectivity limitations. Federated Learning (FL) is emerging as a solution to these problems while preserving privacy by jointly training a model without sharing raw data. In this paper, we review the FL strategies from the perspective of edge sensing devices to get over the limitations of conventional machine learning techniques. We focus on the key FL principles, software frameworks, and testbeds. We also explore the current sensor technologies, properties of the sensing devices and sensing applications where FL is utilized. We conclude with a discussion on open issues and future research directions on FL for further studies
    摘要 “随着边缘感应设备的普及,例如IoT、手持式和穿戴式设备的数据量和感应功能,实现了监测环境特点、互动和获取环境信息的能力。这些设备小巧,储存空间和处理能力有限,但生成了巨量数据。一些应用领域包括医疗、环境(包括空气质量和污染水平)、汽车、工业、航空和农业应用。这些边缘感应数据通过多种机器学习(ML)和深度学习(DL)方法进行分析。但是,将数据分析到云端或服务器端存在隐私、硬件和连接限制的问题。联邦学习(FL)正在解决这些问题,并保持隐私性,无需共享原始数据。本文从边缘感应设备的角度,检视FL策略,并评估适用于边缘感应应用的软件框架和实验室。我们也探讨目前的感应技术、感应设备的性能和感应应用中FL的应用。我们结束时讨论未解决的问题和未来研究方向。”

AiluRus: A Scalable ViT Framework for Dense Prediction

  • paper_url: http://arxiv.org/abs/2311.01197
  • repo_url: https://github.com/caddyless/ailurus
  • paper_authors: Jin Li, Yaoming Wang, Xiaopeng Zhang, Bowen Shi, Dongsheng Jiang, Chenglin Li, Wenrui Dai, Hongkai Xiong, Qi Tian
  • For: 提高 vision transformer (ViT) 模型在长序列处理方面的性能,特别是在高分辨率输入的 dense prediction 任务中。* Methods: 使用适应分辨率技术,将图像中不同区域的分辨率调整为不同的水平。在 ViT 中间层使用空间感知密度基于的聚合算法,选择代表性的 токен。然后,将其他 tokens 聚合到最近的代表 токен 中。这种策略可以减少 токен数量,使后续层可以处理减少的 токен序列,实现加速。* Results: 在三个不同的 dataset 上进行测试,并观察了promising的性能。例如,可以通过不需要微调的方式,将 “Segmenter ViT-L” 模型加速48% FPS。此外,我们的方法还可以加速 fine-tuning 过程。实验结果表明,可以在训练时间上产生52%的减少,同时加速2.46倍 FPS,只有0.09%的性能下降。代码可以在 https://github.com/caddyless/ailurus/tree/main 上找到。
    Abstract Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, when it comes to handling long token sequences, especially in dense prediction tasks that require high-resolution input, the complexity of ViTs increases significantly. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we utilize a spatial-aware density-based clustering algorithm to select representative tokens from the token sequence. Once the representative tokens are determined, we proceed to merge other tokens into their closest representative token. Consequently, semantic similar tokens are merged together to form low-resolution regions, while semantic irrelevant tokens are preserved independently as high-resolution regions. This strategy effectively reduces the number of tokens, allowing subsequent layers to handle a reduced token sequence and achieve acceleration. We evaluate our proposed method on three different datasets and observe promising performance. For example, the "Segmenter ViT-L" model can be accelerated by 48% FPS without fine-tuning, while maintaining the performance. Additionally, our method can be applied to accelerate fine-tuning as well. Experimental results demonstrate that we can save 52% training time while accelerating 2.46 times FPS with only a 0.09% performance drop. The code is available at https://github.com/caddyless/ailurus/tree/main.
    摘要 vision transformers (ViTs) 已经成为视觉任务中广泛使用的主流架构,尤其是在处理长token序列时,它们的性能很出色。然而,在 dense prediction 任务中,特别是 semantic segmentation 或 object detection,需要高分辨率的输入,这时 ViTs 的复杂度会增加显著。我们发现, dense prediction 任务中,对象的 outline 或形状更加重要,而内部的文字则更加不重要。基于这一点,我们提议应用适应性分辨率,对不同的图像区域进行不同的分辨率处理。在 ViT 的中间层次结构中,我们使用空间意识度的density-based clustering算法来选择表示性的token。然后,我们将其他token合并到最近的表示token中。因此,semantic相似的token会合并成低分辨率区域,而semantic不相关的token则独立保留为高分辨率区域。这种策略有效地减少了token数量,使后续层次可以处理减少后的token序列,并实现加速。我们在三个不同的dataset上进行了评估,并观察到了有前景的性能。例如,"Segmenter ViT-L" 模型可以通过48% FPS 的加速而不需要微调。此外,我们的方法还可以用于加速微调。实验结果表明,我们可以在训练时间上Save 52%,并且在加速2.46倍 FPS 时,只减少了0.09%的性能。代码可以在 中找到。

Batch Bayesian Optimization for Replicable Experimental Design

  • paper_url: http://arxiv.org/abs/2311.01195
  • repo_url: None
  • paper_authors: Zhongxiang Dai, Quoc Phong Nguyen, Sebastian Shenghong Tay, Daisuke Urano, Richalynn Leong, Bryan Kian Hsiang Low, Patrick Jaillet
  • for: 本文针对实验设计问题提出了一个框架,即批 Thompson 抽样体系 (BTS-RED),用于处理多元实验条件的评估和重复测量。
  • methods: 本文提出了三种算法,分别为 BTS-RED-Known、BTS-RED-Unknown 和 Mean-Var-BTS-RED,用于处理不同的错误分布和对应的风险偏好。
  • results: 本文透过实验证明了这三种算法的可靠性和无损 regret 性,并在精确农业和 AutoML 实验中显示了它们的实际效果。
    Abstract Many real-world experimental design problems (a) evaluate multiple experimental conditions in parallel and (b) replicate each condition multiple times due to large and heteroscedastic observation noise. Given a fixed total budget, this naturally induces a trade-off between evaluating more unique conditions while replicating each of them fewer times vs. evaluating fewer unique conditions and replicating each more times. Moreover, in these problems, practitioners may be risk-averse and hence prefer an input with both good average performance and small variability. To tackle both challenges, we propose the Batch Thompson Sampling for Replicable Experimental Design (BTS-RED) framework, which encompasses three algorithms. Our BTS-RED-Known and BTS-RED-Unknown algorithms, for, respectively, known and unknown noise variance, choose the number of replications adaptively rather than deterministically such that an input with a larger noise variance is replicated more times. As a result, despite the noise heteroscedasticity, both algorithms enjoy a theoretical guarantee and are asymptotically no-regret. Our Mean-Var-BTS-RED algorithm aims at risk-averse optimization and is also asymptotically no-regret. We also show the effectiveness of our algorithms in two practical real-world applications: precision agriculture and AutoML.
    摘要 多个实验设计问题(a)会同时评估多个实验条件,并且每个条件会被重复多次,这是因为观察噪声很大且不均匀。给定一个固定的总预算,这会导致评估更多的独特条件 vs. 评估更少的独特条件的费用之间的权衡。此外,在这些问题中,实践者可能会偏爱风险观,因此偏好一个具有良好平均性和小变异性的输入。为了解决这两个挑战,我们提出了批 Thompson 采样 для可重现实验设计(BTS-RED)框架,该框架包括三种算法。我们的 BTS-RED-known 和 BTS-RED-unknown 算法,分别针对已知和未知噪声 variance,选择复制的数量适应而不是决定性地,以便在噪声不均匀的情况下,输入具有更大的噪声 variance 会被复制更多次。由于这些算法具有理论保证和朴素观察的折衔,它们在噪声不均匀情况下是 asymptotically no-regret。我们的 Mean-Var-BTS-RED 算法则是针对偏爱风险优化的,并且也是 asymptotically no-regret。我们还在精准农业和 AutoML 两个实际应用中证明了我们的算法的效果。

Contextual Confidence and Generative AI

  • paper_url: http://arxiv.org/abs/2311.01193
  • repo_url: None
  • paper_authors: Shrey Jain, Zoë Hitzig, Pamela Mishkin
  • for: 该论文旨在面对生成式人工智能模型对有效人类communication的威胁,描述一些稳定communication的策略。
  • methods: 该论文使用的方法包括工具、技术和政策,分为两大类:含容策略和推动策略。含容策略目的是在生成式AI模型威胁下重新确定communication的上下文,而推动策略则是利用AI的进步提高mediated communication的隐私和真实性的期望。
  • results: 该论文的结果表明,采用合适的策略可以稳定communication在生成式AI模型的威胁下,并提高mediated communication的隐私和真实性。
    Abstract Generative AI models perturb the foundations of effective human communication. They present new challenges to contextual confidence, disrupting participants' ability to identify the authentic context of communication and their ability to protect communication from reuse and recombination outside its intended context. In this paper, we describe strategies--tools, technologies and policies--that aim to stabilize communication in the face of these challenges. The strategies we discuss fall into two broad categories. Containment strategies aim to reassert context in environments where it is currently threatened--a reaction to the context-free expectations and norms established by the internet. Mobilization strategies, by contrast, view the rise of generative AI as an opportunity to proactively set new and higher expectations around privacy and authenticity in mediated communication.
    摘要 生成AI模型对人类交流的基础产生了巨大的挑战。它们使得参与者无法正确地识别交流的 authentics 上下文和保护交流从其不良上下文中的重用和复制。在这篇论文中,我们描述了一些策略——工具、技术和政策——以稳定交流面临这些挑战。我们所讨论的策略分为两个大类。封装策略 aim to reassert context in environments where it is currently threatened——一种应对互联网所建立的无上下文期望和规范的反应。 mobilization strategies, by contrast, view the rise of generative AI as an opportunity to proactively set new and higher expectations around privacy and authenticity in mediated communication.

VIGraph: Self-supervised Learning for Class-Imbalanced Node Classification

  • paper_url: http://arxiv.org/abs/2311.01191
  • repo_url: None
  • paper_authors: Yulan Hu, Sheng Ouyang, Zhirui Yang, Yong Liu
  • for: 此研究旨在解决图数据中类别不整齐的问题,提高类别不整齐的节点预测性能。
  • methods: 本研究提出了一种基于自助学习(SSL)的新方法,利用自身数据自带的缺失数据进行生成缺失类节点,从而提高类别不整齐的预测性能。
  • results: 实验结果表明,基于VGAE的VIGraph方法可以生成高质量的缺失类节点,提高类别不整齐的节点预测性能。
    Abstract Class imbalance in graph data poses significant challenges for node classification. Existing methods, represented by SMOTE-based approaches, partially alleviate this issue but still exhibit limitations during imbalanced scenario construction. Self-supervised learning (SSL) offers a promising solution by synthesizing minority nodes from the data itself, yet its potential remains unexplored. In this paper, we analyze the limitations of SMOTE-based approaches and introduce VIGraph, a novel SSL model based on the self-supervised Variational Graph Auto-Encoder (VGAE) that leverages Variational Inference (VI) to generate minority nodes. Specifically, VIGraph strictly adheres to the concept of imbalance when constructing imbalanced graphs and utilizes the generative VGAE to generate minority nodes. Moreover, VIGraph introduces a novel Siamese contrastive strategy at the decoding phase to improve the overall quality of generated nodes. VIGraph can generate high-quality nodes without reintegrating them into the original graph, eliminating the "Generating, Reintegrating, and Retraining" process found in SMOTE-based methods. Experiments on multiple real-world datasets demonstrate that VIGraph achieves promising results for class-imbalanced node classification tasks.
    摘要 classe 不均衡在图数据中存在 significativ 挑战,现有的方法,表示 SMOTE 基于方法, partially 缓解了这种情况,但仍然在不均衡enario 构建中存在限制。自我supervised 学习(SSL)提供了一个有前途的解决方案,可以自动生成少数节点,但其潜力仍然没有得到充分利用。本文分析了 SMOTE 基于方法的局限性,并引入 VIGraph,一种新的 SSL 模型,基于自我supervised Variational Graph Auto-Encoder(VGAE),利用 Variational Inference(VI)生成少数节点。具体来说,VIGraph 严格遵循不均衡概念在构建不均衡图时,并利用生成的 VGAE 来生成少数节点。此外,VIGraph 引入了一种新的对比策略在解码阶段,以提高生成节点的质量。VIGraph 可以生成高质量节点,无需将其重新 integrate 到原始图中,从而消除 SMOTE 基于方法中的 "生成、重新集成、重新训练" 过程。实验表明,VIGraph 在多个真实世界数据集上取得了优秀的结果 для类均衡节点分类任务。

Revolutionizing Healthcare Image Analysis in Pandemic-Based Fog-Cloud Computing Architectures

  • paper_url: http://arxiv.org/abs/2311.01185
  • repo_url: None
  • paper_authors: Al Zahraa Elsayed, Khalil Mohamed, Hany Harb
    for: 这篇研究paper的目的是来提出一个创新的医疗架构,以解决医疗数据分析中的效率和准确性问题。methods: 这篇paper使用了fog computing和改进的Convolutional Neural Network(CNN)来进行医疗影像分析。不同的CNN层架构被充分探讨和评估,以最大化整体性能。results: 比较过去的模型如VGG16、VGG19、MobileNet以及相关研究,提出的方法实现了99.88%的正常案例准确率,以及96.5%的验证率、100%的精度和回传率,以及100%的F1分数。这些结果显示fog computing和改进的CNN在医疗影像分析和诊断中具有广泛的应用前景,不仅在疫情期间,而且在未来也具有巨大的潜力。
    Abstract The emergence of pandemics has significantly emphasized the need for effective solutions in healthcare data analysis. One particular challenge in this domain is the manual examination of medical images, such as X-rays and CT scans. This process is time-consuming and involves the logistical complexities of transferring these images to centralized cloud computing servers. Additionally, the speed and accuracy of image analysis are vital for efficient healthcare image management. This research paper introduces an innovative healthcare architecture that tackles the challenges of analysis efficiency and accuracy by harnessing the capabilities of Artificial Intelligence (AI). Specifically, the proposed architecture utilizes fog computing and presents a modified Convolutional Neural Network (CNN) designed specifically for image analysis. Different architectures of CNN layers are thoroughly explored and evaluated to optimize overall performance. To demonstrate the effectiveness of the proposed approach, a dataset of X-ray images is utilized for analysis and evaluation. Comparative assessments are conducted against recent models such as VGG16, VGG19, MobileNet, and related research papers. Notably, the proposed approach achieves an exceptional accuracy rate of 99.88% in classifying normal cases, accompanied by a validation rate of 96.5%, precision and recall rates of 100%, and an F1 score of 100%. These results highlight the immense potential of fog computing and modified CNNs in revolutionizing healthcare image analysis and diagnosis, not only during pandemics but also in the future. By leveraging these technologies, healthcare professionals can enhance the efficiency and accuracy of medical image analysis, leading to improved patient care and outcomes.
    摘要 随着疫情的出现,医疗数据分析领域面临着有效解决方案的强烈需求。一个特定的挑战在这个领域是手动检查医疗图像,如X光和CT扫描图像。这个过程浪费时间,同时也存在将图像传输到中央云计算服务器的logistical复杂性。此外,图像分析的速度和准确率对医疗图像管理是非常重要的。本研究论文提出了一种革命性的医疗架构,通过人工智能(AI)技术解决了分析效率和准确率的挑战。具体来说,该架构利用了fog computing技术,并提出了一种特殊的卷积神经网络(CNN),用于图像分析。不同的CNN层的架构被全面探讨和评估,以便优化总性性能。为证明提出的方法的效果,本文使用了一个X光图像集进行分析和评估。与之比较的是,VGG16、VGG19、MobileNet等现有模型和相关研究论文。结果表明,提出的方法在分类正常情况时 achieved an exceptional accuracy rate of 99.88%,并且 validation rate为96.5%,准确率和回归率均为100%,F1分数也为100%。这些结果表明,fog computing和特殊的CNN可以在医疗图像分析和诊断方面发挥革命性的作用,不仅在疫情期间,而且在未来也会发挥重要作用。通过利用这些技术,医疗专业人员可以提高医疗图像分析的效率和准确率,从而提高患者的病情和结果。

Generative Input: Towards Next-Generation Input Methods Paradigm

  • paper_url: http://arxiv.org/abs/2311.01166
  • repo_url: None
  • paper_authors: Keyu Ding, Yongcan Wang, Zihang Xu, Zhenzhen Jia, Shijin Wang, Cong Liu, Enhong Chen
  • for: 这个论文旨在探讨如何使用生成模型提高中文输入法的性能。
  • methods: 该论文提出了一种新的生成输入模式,名为生成输入模式(GeneInput),它使用提示来处理所有输入场景,并使用用户反馈来优化模型并提供个性化结果。
  • results: 研究结果显示,GeneInput在全模式键序列到字符(FK2C)任务中首次实现了国际级表现,并且提出了一种新的奖励模型训练方法,可以消除额外的手动注释和表现超越GPT-4在智能关联和对话协助等任务中。相比传统模式,GeneInput不仅表现出了更高的性能,还展现出了更好的抗衡性、扩展性和在线学习能力。
    Abstract Since the release of ChatGPT, generative models have achieved tremendous success and become the de facto approach for various NLP tasks. However, its application in the field of input methods remains under-explored. Many neural network approaches have been applied to the construction of Chinese input method engines(IMEs).Previous research often assumed that the input pinyin was correct and focused on Pinyin-to-character(P2C) task, which significantly falls short of meeting users' demands. Moreover, previous research could not leverage user feedback to optimize the model and provide personalized results. In this study, we propose a novel Generative Input paradigm named GeneInput. It uses prompts to handle all input scenarios and other intelligent auxiliary input functions, optimizing the model with user feedback to deliver personalized results. The results demonstrate that we have achieved state-of-the-art performance for the first time in the Full-mode Key-sequence to Characters(FK2C) task. We propose a novel reward model training method that eliminates the need for additional manual annotations and the performance surpasses GPT-4 in tasks involving intelligent association and conversational assistance. Compared to traditional paradigms, GeneInput not only demonstrates superior performance but also exhibits enhanced robustness, scalability, and online learning capabilities.
    摘要 desde el lanzamiento de ChatGPT, los modelos generativos han logrado un gran éxito y se han convertido en el enfoque por defecto para diversas tareas de procesamiento de lenguaje natural. Sin embargo, su aplicación en el campo de los métodos de entrada aún se ha explorado insuficientemente. Muchas aproximaciones basadas en redes neuronales se han aplicado en la construcción de motores de entrada de caracteres chinos(IMEs). La investigación previa suponía que la entrada de pinyin era correcta y se centró en la tarea de Pinyin-to-Character(P2C), lo que significativamente se aparta de las demandas de los usuarios. Además, la investigación anterior no podía aproveitar la retroalimentación del usuario para optimizar el modelo y proporcionar resultados personalizados. En este estudio, propusimos un paradigma de entrada generativa llamado GeneInput. Utiliza prompts para manejar todas las escenas de entrada y otras funciones de entrada inteligente auxiliar, optimizando el modelo con la retroalimentación del usuario para entregar resultados personalizados. Los resultados demuestran que hemos logrado el rendimiento estado-de-arte por primera vez en la tarea de Full-mode Key-sequence to Characters(FK2C). Proponemos un método de entrenamiento de modelo de reward que elimina la necesidad de anotaciones manuales adicionales y el rendimiento supera a GPT-4 en tareas involucradas en asociación y asistencia conversacional. En comparación con los enfoques tradicionales, GeneInput no solo demuestra un rendimiento superior, sino también exhibe mayor robustez, escalabilidad y capacidades de aprendizaje en línea.

Weakly Supervised Semantic Parsing with Execution-based Spurious Program Filtering

  • paper_url: http://arxiv.org/abs/2311.01161
  • repo_url: None
  • paper_authors: Kang-il Lee, Segwang Kim, Kyomin Jung
  • for: 本研究旨在推断SemanticParser训练 FROM weak supervision中的假计划问题。
  • methods: 我们提议一种基于程序执行结果的领域无关筛选机制,具体来说,对每个通过搜索获得的程序,我们首先构建一个捕捉程序 semantics的表示,然后对这些表示进行多数投票,以识别并过滤有显著不同Semantics的程序。
  • results: 我们的方法可以轻松地与现有的weakly supervised SemanticParser Frameworks堆叠,并在Natural Language Visual Reasoning和WikiTableQuestions上进行了实验,发现将我们的方法应用于现有的SemanticParserinduces significantly improved performances。
    Abstract The problem of spurious programs is a longstanding challenge when training a semantic parser from weak supervision. To eliminate such programs that have wrong semantics but correct denotation, existing methods focus on exploiting similarities between examples based on domain-specific knowledge. In this paper, we propose a domain-agnostic filtering mechanism based on program execution results. Specifically, for each program obtained through the search process, we first construct a representation that captures the program's semantics as execution results under various inputs. Then, we run a majority vote on these representations to identify and filter out programs with significantly different semantics from the other programs. In particular, our method is orthogonal to the program search process so that it can easily augment any of the existing weakly supervised semantic parsing frameworks. Empirical evaluations on the Natural Language Visual Reasoning and WikiTableQuestions demonstrate that applying our method to the existing semantic parsers induces significantly improved performances.
    摘要 “伪函数问题”是强度指导下训练 semantic parser 的长standing挑战。以往的方法则是利用领域专业知识来推导类似性,以删除具有错误semantics yet correct denotation 的程式。在这篇论文中,我们提出了一种领域共享 Filtering 机制,基于程式执行结果。具体来说,我们将每个通过搜索过程获得的程式转换为执行结果的表示,然后对这些表示进行多数决,以识别和删除与其他程式semantics 不同的程式。我们的方法与程式搜索过程不相互干扰,因此可以轻松地将其与现有的弱指导 semantic parsing 框架结合使用。实验评估在 Natural Language Visual Reasoning 和 WikiTableQuestions 上显示,将我们的方法应用到现有的 semantic parsers 可以导致明显改善的性能。

A Review of Digital Twins and their Application in Cybersecurity based on Artificial Intelligence

  • paper_url: http://arxiv.org/abs/2311.01154
  • repo_url: None
  • paper_authors: MohammadHossein Homaei, Oscar Mogollon Gutierrez, Jose Carlos Sancho Nunez, Mar Avila Vegas, Andres Caro Lindo
  • for: 本研究旨在探讨虚拟链技术在不同领域的应用和潜在问题,以及如何通过人工智能技术来提供数字各种领域的安全性。
  • methods: 本研究使用了许多不同的方法,包括文献综述、实践报告、采访调查等,以探讨虚拟链技术的应用和潜在问题。
  • results: 本研究发现了虚拟链技术在各种领域的应用和潜在问题,包括虚拟产品、虚拟服务、虚拟产业等。同时,该研究还发现了虚拟链技术的安全性问题,包括数据隐私和安全性问题。
    Abstract The potential of digital twin technology is yet to be fully realized due to its diversity and untapped potential. Digital twins enable systems' analysis, design, optimization, and evolution to be performed digitally or in conjunction with a cyber-physical approach to improve speed, accuracy, and efficiency over traditional engineering methods. Industry 4.0, factories of the future, and digital twins continue to benefit from the technology and provide enhanced efficiency within existing systems. Due to the lack of information and security standards associated with the transition to cyber digitization, cybercriminals have been able to take advantage of the situation. Access to a digital twin of a product or service is equivalent to threatening the entire collection. There is a robust interaction between digital twins and artificial intelligence tools, which leads to strong interaction between these technologies, so it can be used to improve the cybersecurity of these digital platforms based on their integration with these technologies. This study aims to investigate the role of artificial intelligence in providing cybersecurity for digital twin versions of various industries, as well as the risks associated with these versions. In addition, this research serves as a road map for researchers and others interested in cybersecurity and digital security.
    摘要 “数字双工程技术的潜力仍未得到完全实现,这主要归功于其多样性和未发掘的潜力。数字双工程技术可以在数字或融合物理方式下进行系统分析、设计、优化和演化,从而提高速度、准确性和效率,并且可以与工业4.0、未来的制造厂和数字双工程技术相结合,提高现有系统的效率。然而,由于数字化转型的缺乏信息和安全标准,黑客有机会利用这种情况。访问一个产品或服务的数字双版本等于对整个收藏的威胁。数字双工程技术和人工智能工具之间存在强烈的互动,因此可以通过这些技术的结合来提高数字平台的安全性。本研究旨在调查不同领务中数字双版本的人工智能在提供网络安全方面的作用,以及这些版本的风险。此外,这项研究还可 serve as a roadmap for researchers and others interested in cybersecurity and digital security.”Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Revisiting the Knowledge Injection Frameworks

  • paper_url: http://arxiv.org/abs/2311.01150
  • repo_url: None
  • paper_authors: Peng Fu, Yiming Zhang, Haobo Wang, Weikang Qiu, Junbo Zhao
  • for: 这篇论文旨在解决如何使用外部知识来适应垂直领域特定任务,以提高大语言模型(LLM)的性能。
  • methods: 这篇论文使用了一种Alignment Heuristic的方法,通过将相关的知识元组注入到相应的文本样本中来实现外部知识的注入。然而, authors发现,Random Knowledge Injection(随机注入外部知识)可以达到类似或更好的结果,而不需要对知识元组进行对齐。
  • results: 作者们发现,采用Random Knowledge Injection可以超越现有的Alignment Heuristic,并且可以提高垂直领域特定任务中LLM的性能。此外, authors还提出了一种简单的修复方法,通过约束外部知识库的淘汰和纯化来解决这个问题。
    Abstract In recent years, large language models (LLMs), such as GPTs, have attained great impact worldwide. However, how to adapt these LLMs to better suit the vertical domain-specific tasks by utilizing external knowledge remains not completely solved. Indeed, there have emerged a few works on this line where most of them rely on an alignment heuristic that is built to inject the corresponding knowledge tuple into the associated text sample. However, despite the promise, we identify a pivotal problem in this work ubiquitously. Simply put, we find that injecting unaligned (i.e., random) knowledge tuple into the LLMs achieves comparable (and sometimes better) results than the aligned knowledge being injected. We therefore take a thorough investigation of this frustrating finding on a variety of related prior work and further provide a chain of potential interpretations for the phenomenon. Based on all that, we offer a simple remediated technique. Briefly, the core of this technique is rooted in an ideological emphasis on the pruning and purification of the external knowledge base to be injected into LLMs. At last, we show that by integrating this technique into most (if not all) knowledge injection frameworks and recent LLMs, it manages to overcome the aforementioned sanity problem and further pushes the boundary of the performance of the domain-adaptive LLMs.
    摘要

GREEMA: Proposal and Experimental Verification of Growing Robot by Eating Environmental MAterial for Landslide Disaster

  • paper_url: http://arxiv.org/abs/2311.01107
  • repo_url: None
  • paper_authors: Yusuke Tsunoda, Yuya Sato, Koichi Osuka
  • for: 这个研究是为了开发一种能够在无法人类进入的区域,如月面和滑坡现场,进行多个自动移动机械系统的替代。具体来说,在河道堵塞现场,需要为该地点移除水和泥土的任务。传统上,需要多部建筑机械来进行 цивіLENGINEERING工作,但由于这些机械的大小和重量,将它们运输到现场是具有巨大成本和时间的问题。
  • methods: 这个研究使用了一种名为GREEMA的新型生长机械,这是一种轻量级和压缩的 durante transportation,但可以在到达现场后使用环境材料来运作。GREEMA可以活动地吸收环境材料,如水和泥土,并将它们转换为自己的结构,并从自己运走。
  • results: 这个研究实验了两种GREEMA的类型。首先,我们开发了一种 fins-type swimming robot,这个机器人可以通过吸收水来实现游泳功能。其次,我们建立了一种 arm-type robot,这个机器人可以吃泥土来增加自己的韧性。我们对这两个实验的结果进行了Explicit-Implicit控制的探讨,并描述了GREEMA的设计理论。
    Abstract In areas that are inaccessible to humans, such as the lunar surface and landslide sites, there is a need for multiple autonomous mobile robot systems that can replace human workers. In particular, at landslide sites such as river channel blockages, robots are required to remove water and sediment from the site as soon as possible. Conventionally, several construction machines have been deployed to the site for civil engineering work. However, because of the large size and weight of conventional construction equipment, it is difficult to move multiple units of construction equipment to the site, resulting in significant transportation costs and time. To solve such problems, this study proposes a novel growing robot by eating environmental material called GREEMA, which is lightweight and compact during transportation, but can function by eating on environmental materials once it arrives at the site. GREEMA actively takes in environmental materials such as water and sediment, uses them as its structure, and removes them by moving itself. In this paper, we developed and experimentally verified two types of GREEMAs. First, we developed a fin-type swimming robot that passively takes water into its body using a water-absorbing polymer and forms a body to express its swimming function. Second, we constructed an arm-type robot that eats soil to increase the rigidity of its body. We discuss the results of these two experiments from the viewpoint of Explicit-Implicit control and describe the design theory of GREEMA.
    摘要 在人类无法进入的区域,如月面和滥覆现场,需要多个自主移动 робо辅助人工工作。特别是在河道堵塞现场,机器人需要尽快将水和淤泥从现场除去。 conventionally,数量多的建筑机械被派往现场进行土木工程。然而,由于传统的建筑机械庞大和重量,运输成本和时间均很高。为解决这些问题,本研究提出了一种新型增长机器人,即吃环境材料called GREEMA,它轻量级和压缩的交通时间。GREEMA在到达现场后通过吃环境材料来形成结构,并将其移除。在这篇论文中,我们开发并实验验证了两种GREEMA的类型。首先,我们开发了一种螺旋型游泳机器人,通过吸收水的水吸收聚合物来形成身体表现游泳功能。其次,我们建立了一种吃土的机器人,通过吃土来增加机器人的体硬度。我们从Explicit-Implicit控制的视角来讲述GREEMA的设计理论。

Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO

  • paper_url: http://arxiv.org/abs/2311.01057
  • repo_url: None
  • paper_authors: Julian Moosmann, Pietro Bonazzi, Yawei Li, Sizhen Bian, Philipp Mayer, Luca Benini, Michele Magno
  • for: The paper is written for researchers and developers who are interested in integrating AI into smart glasses, specifically those who are looking to achieve prolonged continuous operation with limited battery capacity.
  • methods: The paper describes the design and implementation of tiny machine-learning algorithms that exploit novel low-power processors to enable energy- and latency-efficient object detection on smart glasses. The authors developed a family of novel tiny deep-learning models based on YOLO with sub-million parameters customized for microcontroller-based inference.
  • results: The paper reports that the proposed TinyissimoYOLO models achieve an inference latency of 17ms and energy consumption of 1.59mJ per inference, with acceptable detection accuracy. The end-to-end latency from image capturing to algorithm prediction is 56ms (equivalent to 18 fps), with a total power consumption of 62.9mW, which is equivalent to 9.3 hours of continuous run time on a 154mAh battery. These results outperform MCUNet (TinyNAS+TinyEngine), which achieves a simpler task (image classification) at just 7.3 fps per second.
    Abstract Smart glasses are rapidly gaining advanced functionality thanks to cutting-edge computing technologies, accelerated hardware architectures, and tiny AI algorithms. Integrating AI into smart glasses featuring a small form factor and limited battery capacity is still challenging when targeting full-day usage for a satisfactory user experience. This paper illustrates the design and implementation of tiny machine-learning algorithms exploiting novel low-power processors to enable prolonged continuous operation in smart glasses. We explore the energy- and latency-efficient of smart glasses in the case of real-time object detection. To this goal, we designed a smart glasses prototype as a research platform featuring two microcontrollers, including a novel milliwatt-power RISC-V parallel processor with a hardware accelerator for visual AI, and a Bluetooth low-power module for communication. The smart glasses integrate power cycling mechanisms, including image and audio sensing interfaces. Furthermore, we developed a family of novel tiny deep-learning models based on YOLO with sub-million parameters customized for microcontroller-based inference dubbed TinyissimoYOLO v1.3, v5, and v8, aiming at benchmarking object detection with smart glasses for energy and latency. Evaluations on the prototype of the smart glasses demonstrate TinyissimoYOLO's 17ms inference latency and 1.59mJ energy consumption per inference while ensuring acceptable detection accuracy. Further evaluation reveals an end-to-end latency from image capturing to the algorithm's prediction of 56ms or equivalently 18 fps, with a total power consumption of 62.9mW, equivalent to a 9.3 hours of continuous run time on a 154mAh battery. These results outperform MCUNet (TinyNAS+TinyEngine), which runs a simpler task (image classification) at just 7.3 fps per second.
    摘要 智能眼镜在技术上不断提高,感谢于高级计算技术、加速器硬件体系和小型AI算法。但是在将AI集成到智能眼镜中,具有小型化的形态和有限的电池容量仍然是一大挑战,以实现满意的用户体验。本文描述了在智能眼镜中实现小型机器学习算法的设计和实现,以提高智能眼镜的连续运行时间。我们开发了一款智能眼镜原型,其包括两个微控制器,包括一个新的低功耗RISC-V并行处理器和一个蓝牙低功耗模块。智能眼镜还包括图像和音频感知接口。此外,我们开发了一家小型深度学习模型基于YOLO,称为TinyissimoYOLO v1.3、v5和v8,以实现智能眼镜中对物体检测的能效评估。我们对智能眼镜原型进行评估,发现TinyissimoYOLO的推理延迟时间为17毫秒,电能消耗为1.59毫瓦,并保持了可接受的检测精度。进一步的评估表明,从图像捕获到算法预测的总时间为56毫秒(相当于18帧/秒),总电力消耗为62.9毫瓦,等于9.3小时的连续运行时间。这些结果超过了MCUNet(TinyNAS+TinyEngine),它在更简单的任务(图像分类)中只能达到7.3帧/秒。

Multi-dimensional data refining strategy for effective fine-tuning LLMs

  • paper_url: http://arxiv.org/abs/2311.01049
  • repo_url: None
  • paper_authors: Thanh Nguyen Ngoc, Quang Nhat Tran, Arthur Tang, Bao Nguyen, Thuy Nguyen, Thanh Pham
  • for: 本研究旨在提供大型语言模型精度调整的数据Foundation,但获得适合的数据仍然具有挑战性。
  • methods: 本研究使用了多维度的策略,包括利用英语语料集和开发自定义数据爬虫脚本,并利用生成AI工具来帮助。
  • results: 使用结果的 Vietnamese 语言模型在生成文章的任务中表现了良好的表现。研究提供了实践的解决方案和指导,对未来针对语言如 Vietnamese 的模型精度调整具有重要意义。I hope that helps! Let me know if you have any further questions.
    Abstract Data is a cornerstone for fine-tuning large language models, yet acquiring suitable data remains challenging. Challenges encompassed data scarcity, linguistic diversity, and domain-specific content. This paper presents lessons learned while crawling and refining data tailored for fine-tuning Vietnamese language models. Crafting such a dataset, while accounting for linguistic intricacies and striking a balance between inclusivity and accuracy, demands meticulous planning. Our paper presents a multidimensional strategy including leveraging existing datasets in the English language and developing customized data-crawling scripts with the assistance of generative AI tools. A fine-tuned LLM model for the Vietnamese language, which was produced using resultant datasets, demonstrated good performance while generating Vietnamese news articles from prompts. The study offers practical solutions and guidance for future fine-tuning models in languages like Vietnamese.
    摘要 数据是大语言模型精度调整的基estone,但获得适合的数据仍然是一大挑战。这些挑战包括数据稀缺、语言多样性和领域特定内容。本文介绍了在爬取和修剪适合精度调整越南语言模型的数据时所学到的经验。制作这类数据集需要仔细规划,考虑语言细节和兼顾准确性和包容性。我们的文章提出了多维度策略,包括利用英语语料库和开发自定义爬取脚本,并通过生成AI工具来帮助。通过使用结果数据集,我们生成的越南语言模型进行了良好的表现,从提示生成越南新闻文章。这种研究提供了实用的解决方案和指导,以便未来的语言模型精度调整。

AI-assisted Learning for Electronic Engineering Courses in High Education

  • paper_url: http://arxiv.org/abs/2311.01048
  • repo_url: None
  • paper_authors: Thanh Nguyen Ngoc, Quang Nhat Tran, Arthur Tang, Bao Nguyen, Thuy Nguyen, Thanh Pham
  • for: This paper is written to evaluate the effectiveness of ChatGPT as a teaching and learning support tool in an integrated circuit systems course at a higher education institution in an Asian country.
  • methods: The study uses various question types to assess ChatGPT’s responses and gain valuable insights for further investigation. The study also includes the evaluation and reflection of different stakeholders: students, lecturers, and engineers.
  • results: The findings of this study shed light on the benefits and limitations of ChatGPT as an AI tool, paving the way for innovative learning approaches in technical disciplines. The study contributes to our understanding of how digital transformation is likely to unfold in the education sector.Here are the three key points in Simplified Chinese text:
  • for: 这篇论文是为了评估聊天GPT在大学学习支持中的效果,specifically in an integrated circuit systems course at a higher education institution in an Asian country.
  • methods: 这篇论文使用了多种问题类型来评估聊天GPT的回答,并通过不同参与者的评估和反思(包括学生、讲师和工程师)来获得有价值的发现。
  • results: 这篇论文的发现探讨了聊天GPT作为AI工具的优缺点,为技术领域的学习方法做出了贡献,并为教育领域的数字变革做出了贡献。
    Abstract This study evaluates the efficacy of ChatGPT as an AI teaching and learning support tool in an integrated circuit systems course at a higher education institution in an Asian country. Various question types were completed, and ChatGPT responses were assessed to gain valuable insights for further investigation. The objective is to assess ChatGPT's ability to provide insights, personalized support, and interactive learning experiences in engineering education. The study includes the evaluation and reflection of different stakeholders: students, lecturers, and engineers. The findings of this study shed light on the benefits and limitations of ChatGPT as an AI tool, paving the way for innovative learning approaches in technical disciplines. Furthermore, the study contributes to our understanding of how digital transformation is likely to unfold in the education sector.
    摘要 这项研究评估了 chatGPT 在大学技术课程中作为人工智能教学支持工具的效果。在一个亚洲国家的高等教育机构中,学生、讲师和工程师参与了多种问题的回答,以获得有价值的发现和反思。研究的目的是评估 chatGPT 是否能提供个性化支持、互动式学习体验和工程教育中的洞察。这项研究还包括不同参与者的评估和反思:学生、讲师和工程师。研究结果为我们提供了 chatGPT 作为人工智能工具的优缺点,并为我们更好地理解技术领域教育领域的数字变革。

A Survey of Large Language Models for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2311.01043
  • repo_url: https://github.com/thinklab-sjtu/awesome-llm4ad
  • paper_authors: Zhenjie Yang, Xiaosong Jia, Hongyang Li, Junchi Yan
  • for: 本研究旨在探讨大语言模型(LLM)在自动驾驶技术中的应用,以提高自动驾驶系统的可解释性和可Traceability。
  • methods: 本研究使用了大语言模型(LLM)与基础视觉模型的组合,以实现开放世界理解、逻辑推理和几步学习等功能。
  • results: 本研究系统性地回顾了现有的LLM4AD技术发展,并特别强调了当前技术的挑战和未来研究的方向。同时,我们还提供了实时更新的最新进展和相关开源资源,以便学术和工业研究人员快速入手。
    Abstract Autonomous driving technology, a catalyst for revolutionizing transportation and urban mobility, has the tend to transition from rule-based systems to data-driven strategies. Traditional module-based systems are constrained by cumulative errors among cascaded modules and inflexible pre-set rules. In contrast, end-to-end autonomous driving systems have the potential to avoid error accumulation due to their fully data-driven training process, although they often lack transparency due to their ``black box" nature, complicating the validation and traceability of decisions. Recently, large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers. A natural thought is to utilize these abilities to empower autonomous driving. By combining LLM with foundation vision models, it could open the door to open-world understanding, reasoning, and few-shot learning, which current autonomous driving systems are lacking. In this paper, we systematically review a research line about \textit{Large Language Models for Autonomous Driving (LLM4AD)}. This study evaluates the current state of technological advancements, distinctly outlining the principal challenges and prospective directions for the field. For the convenience of researchers in academia and industry, we provide real-time updates on the latest advances in the field as well as relevant open-source resources via the designated link: https://github.com/Thinklab-SJTU/Awesome-LLM4AD.
    摘要 自主驾驶技术,一种可以革新交通和城市流动的技术,正在从规则基于系统向数据驱动策略过渡。传统的模块化系统受到累加误差的限制,以及硬性预先设置的规则。相比之下,端到端自主驾驶系统具有避免误差累加的潜力,尽管它们常常lack transparency due to their "black box" nature,复杂化决策的验证和跟踪。现在,大型语言模型(LLM)已经展示了理解上下文、逻辑推理和生成答案的能力。一种自然的想法是利用这些能力来 empower autonomous driving。将 LLM 与基础视觉模型结合,可以开启开放世界理解、逻辑推理和几招学习,现在的自主驾驶系统缺乏。在这篇论文中,我们系统地回顾了关于《大型语言模型 для自主驾驶(LLM4AD)》的研究线。这项研究评估了当前技术前进的状况,明确地描述了主要挑战和未来方向。为研究人员在学术和industry中方便,我们提供了实时更新的最新进展和相关开源资源,via the designated link: .

Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism

  • paper_url: http://arxiv.org/abs/2311.01041
  • repo_url: https://github.com/windszzlang/Learn-to-Refuse
  • paper_authors: Lang Cao
    for: 这篇论文的目的是如何使用拒绝机制来减少语言模型中的幻视(hallucination),尤其是在问答中。methods: 本文使用了一个简单 yet 有效的解决方案called Learn to Refuse (L2R),它将拒绝机制与语言模型相结合,让语言模型可以识别和拒绝处理难以回答的问题。此外,本文还提出了一种自动和高效地扩展语言模型知识库的方法。results: 根据质量和量度分析,我们展示了我们的方法可以提高语言模型的可控性和可靠性。
    Abstract Large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, enabling them to answer a wide range of questions across various domains. However, these models are not flawless and often produce responses that contain errors or misinformation. These inaccuracies, commonly referred to as hallucinations, render LLMs unreliable and even unusable in many scenarios. In this paper, our focus is on mitigating the issue of hallucination in LLMs, particularly in the context of question-answering. Instead of attempting to answer all questions, we explore a refusal mechanism that instructs LLMs to refuse to answer challenging questions in order to avoid errors. We then propose a simple yet effective solution called Learn to Refuse (L2R), which incorporates the refusal mechanism to enable LLMs to recognize and refuse to answer questions that they find difficult to address. To achieve this, we utilize a structured knowledge base to represent all the LLM's understanding of the world, enabling it to provide traceable gold knowledge. This knowledge base is separate from the LLM and initially empty, and it is progressively expanded with validated knowledge. When an LLM encounters questions outside its domain, the system recognizes its knowledge scope and determines whether it can answer the question independently. Additionally, we introduce a method for automatically and efficiently expanding the knowledge base of LLMs. Through qualitative and quantitative analysis, we demonstrate that our approach enhances the controllability and reliability of LLMs.
    摘要 大型语言模型(LLM)有示出了卓越的语言理解和生成能力,能够回答各种领域的问题。然而,这些模型并不完美,常会产生错误或不实的回答,称为“幻视”。这些幻视使得 LLM 成为不可靠和无法使用的。在这篇文章中,我们专注于对 LLM 中的幻视问题进行缓和,尤其是在问答领域。而不是尝试回答所有问题,我们探索了一种拒绝机制,让 LLM 当面困难的问题时拒绝回答,以避免错误。我们then propose a simple yet effective solution called Learn to Refuse (L2R), which incorporates the refusal mechanism to enable LLMs to recognize and refuse to answer questions that they find difficult to address. To achieve this, we utilize a structured knowledge base to represent all the LLM's understanding of the world, enabling it to provide traceable gold knowledge. This knowledge base is separate from the LLM and initially empty, and it is progressively expanded with validated knowledge. When an LLM encounters questions outside its domain, the system recognizes its knowledge scope and determines whether it can answer the question independently. Additionally, we introduce a method for automatically and efficiently expanding the knowledge base of LLMs. Through qualitative and quantitative analysis, we demonstrate that our approach enhances the controllability and reliability of LLMs.

ATHENA: Mathematical Reasoning with Thought Expansion

  • paper_url: http://arxiv.org/abs/2311.01036
  • repo_url: https://github.com/the-jb/athena-math
  • paper_authors: JB. Kim, Hazel Kim, Joonghyuk Hahn, Yo-Sub Han
  • for: 解决实际 math 问题需要如何表述问题,模型如何理解人类语言表达。
  • methods: 我们介绍了 Attention-based THought Expansion Network Architecture (ATHENA),它模仿人类思维扩展机制,通过神经网络传播来解决实际实践中的挑战。
  • results: 我们的实验显示,ATHENA可以达到新的state-of-the-art Water准,在 variant 问题中表现出色,即使训练示例的信息含量有限。
    Abstract Solving math word problems depends on how to articulate the problems, the lens through which models view human linguistic expressions. Real-world settings count on such a method even more due to the diverse practices of the same mathematical operations. Earlier works constrain available thinking processes by limited prediction strategies without considering their significance in acquiring mathematical knowledge. We introduce Attention-based THought Expansion Network Architecture (ATHENA) to tackle the challenges of real-world practices by mimicking human thought expansion mechanisms in the form of neural network propagation. A thought expansion recurrently generates the candidates carrying the thoughts of possible math expressions driven from the previous step and yields reasonable thoughts by selecting the valid pathways to the goal. Our experiments show that ATHENA achieves a new state-of-the-art stage toward the ideal model that is compelling in variant questions even when the informativeness in training examples is restricted.
    摘要 解决数学word问题取决于如何表达问题,模型通过人类语言表达的镜像来看待人类语言表达。现实世界中的各种实践更加依赖这种方法,因为这些操作的方式异常多样。先前的工作压缩了可用的思维过程,未能考虑这些知识获得的重要性。我们介绍了注意力基于的思维扩展网络架构(ATHENA),以模拟人类思维扩展机制,通过神经网络传播来解决现实世界中的挑战。一个思维扩展循环产生可能会携带思维的数学表达的候选者,通过选择前一步的有效路径来得到合理的思维。我们的实验表明,ATHENA已经达到了新的状态级模型,在变化的问题中具有吸引力,即使在受限的训练示例中也能达到优秀的表现。

Non-Autoregressive Diffusion-based Temporal Point Processes for Continuous-Time Long-Term Event Prediction

  • paper_url: http://arxiv.org/abs/2311.01033
  • repo_url: None
  • paper_authors: Wang-Tao Zhou, Zhao Kang, Ling Tian
  • for: 预测长期事件序列
  • methods: 基于扩散过程的非自回归模型
  • results: 比基于当前状态最佳的方法提高了预测质量
    Abstract Continuous-time long-term event prediction plays an important role in many application scenarios. Most existing works rely on autoregressive frameworks to predict event sequences, which suffer from error accumulation, thus compromising prediction quality. Inspired by the success of denoising diffusion probabilistic models, we propose a diffusion-based non-autoregressive temporal point process model for long-term event prediction in continuous time. Instead of generating events one at a time in an autoregressive way, our model predicts the future event sequence entirely as a whole. In order to perform diffusion processes on event sequences, we develop a bidirectional map between target event sequences and the Euclidean vector space. Furthermore, we design a novel denoising network to capture both sequential and contextual features for better sample quality. Extensive experiments are conducted to prove the superiority of our proposed model over state-of-the-art methods on long-term event prediction in continuous time. To the best of our knowledge, this is the first work to apply diffusion methods to long-term event prediction problems.
    摘要 continuous-time long-term event prediction在许多应用场景中扮演着重要的角色。现有大多数工作都是基于autoregressive框架进行预测,这会导致预测误差积累,从而降低预测质量。我们受到了denoising diffusion probabilistic models的成功 inspiration,提出了一种基于diffusion的非autoregressive时间点进程模型 для长期事件预测。而不是一个一个事件进行autoregressive预测,我们的模型会预测未来事件序列的整体。为了在事件序列上进行diffusion过程,我们开发了一种双向映射 между目标事件序列和几何空间的Euclidean vector。此外,我们还设计了一种novel的denoising网络,以捕捉事件序列中的sequential和contextual特征,以提高样本质量。我们进行了广泛的实验,证明了我们提出的模型在长期事件预测任务中的优越性,并且这是首次应用diffusion方法于长期事件预测问题。

Joint Learning of Local and Global Features for Aspect-based Sentiment Classification

  • paper_url: http://arxiv.org/abs/2311.01030
  • repo_url: None
  • paper_authors: Hao Niu, Yun Xiong, Xiaosu Wang, Philip S. Yu
  • for: 本文主要针对 aspect-based sentiment classification (ASC) 问题,即根据给定的方面词语判断句子中的 sentiment polarity。
  • methods: 本文提出了一种基于 local 和 global 特征的模型,包括 Gaussian 层和 covariance self-attention 层,以及一种 dual-level graph attention 网络。这些方法可以强制地模型 local 和 global 信息,从而更好地解决 ASC 问题。
  • results: 本文在 SemEval 2014 和 Twitter datasets 上 achieved state-of-the-art 性能。
    Abstract Aspect-based sentiment classification (ASC) aims to judge the sentiment polarity conveyed by the given aspect term in a sentence. The sentiment polarity is not only determined by the local context but also related to the words far away from the given aspect term. Most recent efforts related to the attention-based models can not sufficiently distinguish which words they should pay more attention to in some cases. Meanwhile, graph-based models are coming into ASC to encode syntactic dependency tree information. But these models do not fully leverage syntactic dependency trees as they neglect to incorporate dependency relation tag information into representation learning effectively. In this paper, we address these problems by effectively modeling the local and global features. Firstly, we design a local encoder containing: a Gaussian mask layer and a covariance self-attention layer. The Gaussian mask layer tends to adjust the receptive field around aspect terms adaptively to deemphasize the effects of unrelated words and pay more attention to local information. The covariance self-attention layer can distinguish the attention weights of different words more obviously. Furthermore, we propose a dual-level graph attention network as a global encoder by fully employing dependency tag information to capture long-distance information effectively. Our model achieves state-of-the-art performance on both SemEval 2014 and Twitter datasets.
    摘要 In this paper, we address these problems by effectively modeling local and global features. First, we design a local encoder containing:1. Gaussian mask layer: 可以 adaptively adjust the receptive field around aspect terms to deemphasize the effects of unrelated words and pay more attention to local information.2. Covariance self-attention layer: can distinguish the attention weights of different words more obviously.Furthermore, we propose a dual-level graph attention network as a global encoder, fully employing dependency tag information to capture long-distance information effectively. Our model achieves state-of-the-art performance on both SemEval 2014 and Twitter datasets.

Distance-Based Propagation for Efficient Knowledge Graph Reasoning

  • paper_url: http://arxiv.org/abs/2311.01024
  • repo_url: https://github.com/harryshomer/tagnet
  • paper_authors: Harry Shomer, Yao Ma, Juanhui Li, Bo Wu, Charu C. Aggarwal, Jiliang Tang
  • for: 这个论文的目的是解决知识 graphs(KGs)中的新的边预测问题,以便发现新的事实。
  • methods: 这些方法使用路径信息的汇集来解决这个问题,但它们受到效率问题的困扰。虽有一些最近的尝试通过学习路径剪辑来解决这个问题,但它们通常会牺牲性能来换取效率。
  • results: 本文提出了一种新的方法TAGNet,可以高效地传播信息。这是通过只在每个源-目标对的固定窗口内汇集路径来实现的。我们示出了TAGNet的复杂性与层数无关。经验表明,TAGNet可以在多个KG数据集上剪枝90%的消息,同时保持与其他方法的竞争性。代码可以在https://github.com/HarryShomer/TAGNet上获取。
    Abstract Knowledge graph completion (KGC) aims to predict unseen edges in knowledge graphs (KGs), resulting in the discovery of new facts. A new class of methods have been proposed to tackle this problem by aggregating path information. These methods have shown tremendous ability in the task of KGC. However they are plagued by efficiency issues. Though there are a few recent attempts to address this through learnable path pruning, they often sacrifice the performance to gain efficiency. In this work, we identify two intrinsic limitations of these methods that affect the efficiency and representation quality. To address the limitations, we introduce a new method, TAGNet, which is able to efficiently propagate information. This is achieved by only aggregating paths in a fixed window for each source-target pair. We demonstrate that the complexity of TAGNet is independent of the number of layers. Extensive experiments demonstrate that TAGNet can cut down on the number of propagated messages by as much as 90% while achieving competitive performance on multiple KG datasets. The code is available at https://github.com/HarryShomer/TAGNet.
    摘要 知识图完成(KGC)目标是预测知识图(KG)中未被观测到的边,从而发现新的事实。一些新的方法已经被提出来解决这个问题,它们通过聚合路径信息来实现。这些方法在KGC任务中表现出了惊人的能力,但它们受到效率问题的困扰。虽然有一些最近的尝试通过学习路径剪辑来解决这个问题,但它们经常牺牲性能来获得效率。在这种情况下,我们发现了两种知识图完成方法的内在限制,它们影响了效率和表示质量。为了解决这些限制,我们提出了一种新的方法,TAGNet,它可以有效地传播信息。这是通过在每个源-目标对的固定窗口内聚合路径来实现的。我们证明TAGNet的复杂度独立于层数。广泛的实验表明,TAGNet可以将传播的消息数量减少为90%,同时在多个知识图 datasets 上实现竞争性的性能。代码可以在 上获取。

Augmentation is AUtO-Net: Augmentation-Driven Contrastive Multiview Learning for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.01023
  • repo_url: None
  • paper_authors: Yanming Guo
  • for: 这篇论文的目的是对医疗影像诊断中使用深度学习分类算法来提高视觉能力,特别是针对视网膜血管分类任务。
  • methods: 这篇论文使用了深度学习分类算法,包括多观察者学习框架和混合网络架构,以及注意力机制来捕捉视网膜血管的复杂构造。
  • results: 这篇论文使用CHASE-DB1资料集进行验证,其中提出了83.46%的F1分数和71.62%的交集顶点分数(IOU),两者都高于现有的参考方法。此外,这篇论文还指出了现有方法的两个主要限制,即数据量紧张和高计算资源的依赖。
    Abstract The utilisation of deep learning segmentation algorithms that learn complex organs and tissue patterns and extract essential regions of interest from the noisy background to improve the visual ability for medical image diagnosis has achieved impressive results in Medical Image Computing (MIC). This thesis focuses on retinal blood vessel segmentation tasks, providing an extensive literature review of deep learning-based medical image segmentation approaches while comparing the methodologies and empirical performances. The work also examines the limitations of current state-of-the-art methods by pointing out the two significant existing limitations: data size constraints and the dependency on high computational resources. To address such problems, this work proposes a novel efficient, simple multiview learning framework that contrastively learns invariant vessel feature representation by comparing with multiple augmented views by various transformations to overcome data shortage and improve generalisation ability. Moreover, the hybrid network architecture integrates the attention mechanism into a Convolutional Neural Network to further capture complex continuous curvilinear vessel structures. The result demonstrates the proposed method validated on the CHASE-DB1 dataset, attaining the highest F1 score of 83.46% and the highest Intersection over Union (IOU) score of 71.62% with UNet structure, surpassing existing benchmark UNet-based methods by 1.95% and 2.8%, respectively. The combination of the metrics indicates the model detects the vessel object accurately with a highly coincidental location with the ground truth. Moreover, the proposed approach could be trained within 30 minutes by consuming less than 3 GB GPU RAM, and such characteristics support the efficient implementation for real-world applications and deployments.
    摘要 utilization of deep learning segmentation algorithms that learn complex organs and tissue patterns and extract essential regions of interest from the noisy background to improve the visual ability for medical image diagnosis has achieved impressive results in Medical Image Computing (MIC). This thesis focuses on retinal blood vessel segmentation tasks, providing an extensive literature review of deep learning-based medical image segmentation approaches while comparing the methodologies and empirical performances. The work also examines the limitations of current state-of-the-art methods by pointing out the two significant existing limitations: data size constraints and the dependency on high computational resources. To address such problems, this work proposes a novel efficient, simple multiview learning framework that contrastively learns invariant vessel feature representation by comparing with multiple augmented views by various transformations to overcome data shortage and improve generalisation ability. Moreover, the hybrid network architecture integrates the attention mechanism into a Convolutional Neural Network to further capture complex continuous curvilinear vessel structures. The result demonstrates the proposed method validated on the CHASE-DB1 dataset, attaining the highest F1 score of 83.46% and the highest Intersection over Union (IOU) score of 71.62% with UNet structure, surpassing existing benchmark UNet-based methods by 1.95% and 2.8%, respectively. The combination of the metrics indicates the model detects the vessel object accurately with a highly coincidental location with the ground truth. Moreover, the proposed approach could be trained within 30 minutes by consuming less than 3 GB GPU RAM, and such characteristics support the efficient implementation for real-world applications and deployments.

NeuroWrite: Predictive Handwritten Digit Classification using Deep Neural Networks

  • paper_url: http://arxiv.org/abs/2311.01022
  • repo_url: None
  • paper_authors: Kottakota Asish, P. Sarath Teja, R. Kishan Chander, Dr. D. Deva Hema
  • for: 这篇文章是为了探讨一种基于深度神经网络的手写数字识别方法,即NeuroWrite。
  • methods: 这篇文章使用了对于手写数字识别的构建方法,包括对于手写数字的资料准备、网络设计和训练方法。文章还使用了现代技术,例如卷积神经网络(CNNs)和回传神经网络(RNNs),以提高模型的准确性和适用性。
  • results: 根据文章的结果,NeuroWrite模型在识别和分类手写数字方面表现出色,具有高的准确性和适用性。文章还评估了NeuroWrite模型在实际应用中的性能,包括手写数字文档中的数字识别、签名验证和自动邮政区识别等。
    Abstract The rapid evolution of deep neural networks has revolutionized the field of machine learning, enabling remarkable advancements in various domains. In this article, we introduce NeuroWrite, a unique method for predicting the categorization of handwritten digits using deep neural networks. Our model exhibits outstanding accuracy in identifying and categorising handwritten digits by utilising the strength of convolutional neural networks (CNNs) and recurrent neural networks (RNNs).In this article, we give a thorough examination of the data preparation methods, network design, and training methods used in NeuroWrite. By implementing state-of-the-art techniques, we showcase how NeuroWrite can achieve high classification accuracy and robust generalization on handwritten digit datasets, such as MNIST. Furthermore, we explore the model's potential for real-world applications, including digit recognition in digitized documents, signature verification, and automated postal code recognition. NeuroWrite is a useful tool for computer vision and pattern recognition because of its performance and adaptability.The architecture, training procedure, and evaluation metrics of NeuroWrite are covered in detail in this study, illustrating how it can improve a number of applications that call for handwritten digit classification. The outcomes show that NeuroWrite is a promising method for raising the bar for deep neural network-based handwritten digit recognition.
    摘要 “深度神经网络的快速演化已经革命化了机器学习领域,使得各种领域得到了无前例的进步。在这篇文章中,我们介绍了一种叫做NeuroWrite的手写数字预测方法,使用深度神经网络(CNNs)和循环神经网络(RNNs)来预测手写数字的分类。我们在这篇文章中对NeuroWrite的数据准备方法、网络设计和训练方法进行了详细的介绍,并通过应用现代技术,证明了NeuroWrite在手写数字 dataset(如MNIST)上的高分类精度和robust适应能力。此外,我们还探讨了NeuroWrite在实际应用中的潜在应用,包括手写数字在扫描文档中的识别、电子签名验证和自动化邮政编码识别。NeuroWrite因其性能和适应性而成为计算机视觉和Pattern recognition中的有用工具。本文中还详细介绍了NeuroWrite的架构、训练过程和评价指标, ilustrating its potential for improving a wide range of applications that require handwritten digit classification.结果表明,NeuroWrite是一种有前途的方法,可以提高深度神经网络基于手写数字的识别水平。”

Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

  • paper_url: http://arxiv.org/abs/2311.01017
  • repo_url: None
  • paper_authors: Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, Raquel Urtasun
  • for: This paper aims to improve the efficiency and effectiveness of world modeling for robotic applications such as autonomous driving.
  • methods: The proposed approach uses a novel combination of VQVAE and discrete diffusion to tokenize and predict the future of sensor observations.
  • results: The proposed method achieves significant improvements in reducing prior SOTA Chamfer distance for 1s and 3s predictions on three datasets (NuScenes, KITTI Odometry, and Argoverse2). Specifically, it reduces the Chamfer distance by more than 65% for 1s predictions and more than 50% for 3s predictions.
    Abstract Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer into the discrete diffusion framework with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, our model reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotic agents.
    摘要 学习世界模型可以教导一个机器人如何在无监督的方式下理解世界。尽管它可以视为语言模型的特殊情况,但对于机器人应用程序如自动驾驶而言,进展 slower than language models with Generative Pre-trained Transformers (GPT)。我们认为这有两个主要瓶颈:处理复杂和不结构化的感知空间,以及拥有可扩展的生成模型。因此,我们提出了一种新的世界模型方法,即首先使用VQVAE卷积编码感知数据,然后预测未来via粒子扩散。为了高效地解码和减噪token,我们将Masked Generative Image Transformer重新定义为粒子扩散框架中,并对其进行一些简单的改进,从而实现了明显的改善。当应用于学习世界模型的点云观测数据时,我们的模型可以在NuScenes、KITTI Odometry和Argoverse2 datasets上降低先前的SOTA Chamfer距离,即在1秒预测中降低65%以上,在3秒预测中降低50%以上。我们的结果表明,在Tokenized Agent Experience上应用粒子扩散可以解锁GPT-like无监督学习的能力。

Revamping AI Models in Dermatology: Overcoming Critical Challenges for Enhanced Skin Lesion Diagnosis

  • paper_url: http://arxiv.org/abs/2311.01009
  • repo_url: None
  • paper_authors: Deval Mehta, Brigid Betz-Stablein, Toan D Nguyen, Yaniv Gal, Adrian Bowling, Martin Haskett, Maithili Sashindranath, Paul Bonnington, Victoria Mar, H Peter Soyer, Zongyuan Ge
  • for: 针对皮肤病变的诊断图像分析领域的深度学习模型的开发呈现了明显的增长趋势,然而这些模型在临床实践中受到一些挑战。现有的皮肤科AI模型具有一些局限性,如有限的诊断输出数量、对不常见皮肤病变的测试不充分、无法检测不符合分布图像等。
  • methods: 我们提出了一种全面的Hierarchical-\textbf{O}ut of Distribution-\textbf{C}linical Triage(HOT)模型,用于诊断皮肤病变。该模型对一个临床图像进行三种输出:层次预测、对不符合分布图像发出警告,以及如果临床图像alone不充分进行诊断,则建议使用德维斯科术图像。当建议被追究时,我们的模型将临床和德维斯科术图像集成,以实现最终诊断。
  • results: 我们在一个代表性的皮肤病变数据集上进行了广泛的实验,并证明了我们的框架中每个组件的有效性和互补性。我们的多功能模型为皮肤病变诊断提供了有价值的决策支持,并为医学AI应用领域设置了一个可喜的先例。
    Abstract The surge in developing deep learning models for diagnosing skin lesions through image analysis is notable, yet their clinical black faces challenges. Current dermatology AI models have limitations: limited number of possible diagnostic outputs, lack of real-world testing on uncommon skin lesions, inability to detect out-of-distribution images, and over-reliance on dermoscopic images. To address these, we present an All-In-One \textbf{H}ierarchical-\textbf{O}ut of Distribution-\textbf{C}linical Triage (HOT) model. For a clinical image, our model generates three outputs: a hierarchical prediction, an alert for out-of-distribution images, and a recommendation for dermoscopy if clinical image alone is insufficient for diagnosis. When the recommendation is pursued, it integrates both clinical and dermoscopic images to deliver final diagnosis. Extensive experiments on a representative cutaneous lesion dataset demonstrate the effectiveness and synergy of each component within our framework. Our versatile model provides valuable decision support for lesion diagnosis and sets a promising precedent for medical AI applications.
    摘要 开发深度学习模型用于诊断皮肤病变的图像分析已经很流行,但它们在临床面临挑战。目前的皮肤科AI模型有一些局限性,包括有限的诊断输出数量、lack of real-world testing on rare skin lesions、无法检测非标量图像和过度依赖于皮肤镜像。为了解决这些问题,我们提出了一种All-In-One层次-\out of Distribution-\临床排序(HOT)模型。对于临床图像,我们的模型可以生成三种输出:层次预测、 alert for non-standard images 和皮肤镜像建议。当建议被追究时,它将 integrate both clinical and dermoscopic images to deliver final diagnosis。我们在一个代表性的皮肤病变数据集上进行了广泛的实验,并证明了我们的框架的有效性和各Component的相互作用。我们的多功能模型可以为皮肤病变诊断提供有价值的决策支持,并为医疗AI应用领域设置了一个可能的先例。

Effective Human-AI Teams via Learned Natural Language Rules and Onboarding

  • paper_url: http://arxiv.org/abs/2311.01007
  • repo_url: https://github.com/clinicalml/onboarding_human_ai
  • paper_authors: Hussein Mozannar, Jimin J Lee, Dennis Wei, Prasanna Sattigeri, Subhro Das, David Sontag
  • for: 本研究旨在学习基于数据区域和自然语言描述的人工智能(AI)和人合作规则,以提高人AI团队的准确性。
  • methods: 本研究使用了一种新的区域发现算法,可以在数据空间中找到本地区域,并使用迭代和对比过程将这些区域描述以便人类理解。
  • results: 通过对物体检测和问答任务进行人类学习和评估,研究发现,通过使用本研究提出的方法,人AI团队的准确性可以得到进一步提高。此外,研究还分别评估了区域发现和描述算法的效果。
    Abstract People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.
    摘要 人们正在依靠人工智能代理人 assistance 完成各种任务。人类需要知道何时依靠代理人、合作与代理人或忽略其建议。在这项工作中,我们提议通过学习基于数据区域的规则,以便人类与AI合作更加准确。我们的新区域发现算法在嵌入空间中找到地方,并将其描述为人类可以理解的形式。然后,我们通过一种迭代和对比的过程,使用大型自然语言处理模型描述这些地方。最后,我们将这些规则传递给人类,以便他们可以更好地与AI合作。通过对物体检测和问答任务的用户研究,我们显示了我们的方法可以带来更加准确的人类-AI团队。我们还分别评估了我们的区域发现和描述算法。

Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning

  • paper_url: http://arxiv.org/abs/2311.01004
  • repo_url: None
  • paper_authors: Gaoang Wang, Zhenyu Zhang, Benlu Wang, Weijie Liang, Yizhi Li, Xuechen Guo, Guanhong Wang, Shiyan Li
  • for: 这篇论文旨在提出一种基于深度学习的医疗影像描述方法,以提供更好的诊断建议。
  • methods: 本论文使用了Segment Anything Model(SAM)来实现更好的缩寸和细部特征提取,并且运用混合semantic learning的独特预训策略,同时捕捉医疗影像的全面信息和细部细节。
  • results: 本论文证明了这种方法的效iveness,与预训BLIP2模型相比,在不同的评估指标上表现出色,能够更好地描述医疗影像的内容。
    Abstract With the development of multimodality and large language models, the deep learning-based technique for medical image captioning holds the potential to offer valuable diagnostic recommendations. However, current generic text and image pre-trained models do not yield satisfactory results when it comes to describing intricate details within medical images. In this paper, we present a novel medical image captioning method guided by the segment anything model (SAM) to enable enhanced encoding with both general and detailed feature extraction. In addition, our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images. We demonstrate the effectiveness of this approach, as it outperforms the pre-trained BLIP2 model on various evaluation metrics for generating descriptions of medical images.
    摘要 随着多模态和大语言模型的发展,深度学习基于医疗图像描述技术具有诊断建议的潜在价值。然而,当前的通用文本和图像预训练模型无法准确描述医疗图像中的细节。在这篇论文中,我们提出了一种基于segment anything模型(SAM)的新型医疗图像描述方法,以便增强通用特征提取和细节特征提取。此外,我们的方法采用混合semantic学习策略,同时捕捉医疗图像的总体信息和细节信息。我们的实验表明,这种方法可以超过预训练的BLIP2模型,在不同的评价指标上为医疗图像生成描述具有更高的效果。

Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy

  • paper_url: http://arxiv.org/abs/2311.01002
  • repo_url: None
  • paper_authors: Dongmin Park, Seola Choi, Doyoung Kim, Hwanjun Song, Jae-Gil Lee
  • for: 降低深度学习的计算成本,通过减少大规模训练集来实现数据采样。
  • methods: 提出了一种基于预测信息的数据采样算法,通过计算邻域例子的预测信息来选择最有用的示例集。
  • results: 对四个真实数据集和一个 sintetic 数据集进行了广泛的实验,结果显示,相比基eline,\algname{}可以提高标注模型的泛化性能和预测精度,最高提高9.1%和21.6%。
    Abstract Data pruning, which aims to downsize a large training set into a small informative subset, is crucial for reducing the enormous computational costs of modern deep learning. Though large-scale data collections invariably contain annotation noise and numerous robust learning methods have been developed, data pruning for the noise-robust learning scenario has received little attention. With state-of-the-art Re-labeling methods that self-correct erroneous labels while training, it is challenging to identify which subset induces the most accurate re-labeling of erroneous labels in the entire training set. In this paper, we formalize the problem of data pruning with re-labeling. We first show that the likelihood of a training example being correctly re-labeled is proportional to the prediction confidence of its neighborhood in the subset. Therefore, we propose a novel data pruning algorithm, Prune4Rel, that finds a subset maximizing the total neighborhood confidence of all training examples, thereby maximizing the re-labeling accuracy and generalization performance. Extensive experiments on four real and one synthetic noisy datasets show that \algname{} outperforms the baselines with Re-labeling models by up to 9.1% as well as those with a standard model by up to 21.6%.
    摘要 <>转换文本到简化中文。<>现代深度学习的计算成本很高,因此大规模数据集的减小成本是至关重要的。尽管大规模数据集总是含有注释噪声和许多Robust学习方法已经开发出来,但是对于噪声Robust学习场景,数据减小尚未得到足够的关注。使用现代重新标注方法可以在训练过程中自动更正错误标签,但是寻找整个训练集中最精确地重新标注错误标签的子集是挑战。在这篇论文中,我们正式定义了数据减小与重新标注的问题。我们首先表明,训练示例 Correctly重新标注的可能性与其邻域在子集中的预测信心直接相关。因此,我们提出了一种新的数据减小算法,名为Prune4Rel,它找到一个最大化全局邻域信心的所有训练示例的子集,以最大化重新标注准确性和泛化性能。我们对四个真实的噪声数据集和一个synthetic数据集进行了广泛的实验,结果显示,\algname{}相比基eline模型和标准模型,提高了9.1%和21.6%。

Fully Quantized Always-on Face Detector Considering Mobile Image Sensors

  • paper_url: http://arxiv.org/abs/2311.01001
  • repo_url: None
  • paper_authors: Haechang Lee, Wongi Jeong, Dongil Ryu, Hyunwoo Je, Albert No, Kijeong Kim, Se Young Chun
  • for: 这个研究旨在对 Always-on 面部检测scenario for 移动像感应器应用进行 bridging gap.
  • methods: 我们的提案使用感应器 Raw 输入,模拟 Always-on 面部检测过程 “before” ISP 链接. 我们的方法使用三元 (-1, 0, 1) 的权重,实现了实际上的图像感应器实现.
  • results: 我们的方法在模拟研究中展示了理想的面部检测性和优秀的效率.
    Abstract Despite significant research on lightweight deep neural networks (DNNs) designed for edge devices, the current face detectors do not fully meet the requirements for "intelligent" CMOS image sensors (iCISs) integrated with embedded DNNs. These sensors are essential in various practical applications, such as energy-efficient mobile phones and surveillance systems with always-on capabilities. One noteworthy limitation is the absence of suitable face detectors for the always-on scenario, a crucial aspect of image sensor-level applications. These detectors must operate directly with sensor RAW data before the image signal processor (ISP) takes over. This gap poses a significant challenge in achieving optimal performance in such scenarios. Further research and development are necessary to bridge this gap and fully leverage the potential of iCIS applications. In this study, we aim to bridge the gap by exploring extremely low-bit lightweight face detectors, focusing on the always-on face detection scenario for mobile image sensor applications. To achieve this, our proposed model utilizes sensor-aware synthetic RAW inputs, simulating always-on face detection processed "before" the ISP chain. Our approach employs ternary (-1, 0, 1) weights for potential implementations in image sensors, resulting in a relatively simple network architecture with shallow layers and extremely low-bitwidth. Our method demonstrates reasonable face detection performance and excellent efficiency in simulation studies, offering promising possibilities for practical always-on face detectors in real-world applications.
    摘要 尽管有大量研究关于轻量级深度神经网络(DNN),目前的脸部检测器还没有完全满足智能CMOS图像传感器(iCIS)的需求。这些检测器在实际应用中非常重要,例如能效的手机和 Always-on surveillance系统。一个吸引人的限制是 absent 的适用于 Always-on 场景的脸部检测器,这是图像感知器(ISP)链的一个关键环节。这个差距使得实现最佳性能在这些场景变得非常困难。为了跨越这个差距,我们在这个研究中尝试通过探索极低位数轻量级脸部检测器来bridging 这个差距。我们的提议的模型使用感知器 Raw 输入,模拟 Always-on 脸部检测场景,并使用 (-1, 0, 1) 的ternary 权重。这种方法使得我们的网络架构非常简单,具有极低的位数宽。我们的方法在模拟研究中表现出了合理的脸部检测性能和优秀的效率,提供了实用 Always-on 脸部检测器的可能性。

Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia

  • paper_url: http://arxiv.org/abs/2311.00998
  • repo_url: https://github.com/exqrch/indonesiannmt
  • paper_authors: Lucky Susanto, Ryandito Diandaru, Adila Krisnadhi, Ayu Purwarianti, Derry Wijaya
  • for: 本研究旨在解决印度尼西亚低资源本地语言中文机器翻译 faces significiant challenges,包括需要代表性的标准和有限的数据可用性。
  • methods: 本研究使用了多种训练方法、概念和数据大小,并进行了一些初步的大语言模型为低资源语言平行数据生成的研究。
  • results: 我们的研究发现,尽管有限的计算资源和文本数据,several of our NMT systems 仍然可以达到竞争性的翻译质量,与零批gpt-3.5-turbo的翻译质量相当。这些发现对低资源语言翻译具有重要的进步,对研究人员在类似情况下具有很大的价值。
    Abstract Neural machine translation (NMT) for low-resource local languages in Indonesia faces significant challenges, including the need for a representative benchmark and limited data availability. This work addresses these challenges by comprehensively analyzing training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our study encompasses various training approaches, paradigms, data sizes, and a preliminary study into using large language models for synthetic low-resource languages parallel data generation. We reveal specific trends and insights into practical strategies for low-resource language translation. Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances, rivaling the translation quality of zero-shot gpt-3.5-turbo. These findings significantly advance NMT for low-resource languages, offering valuable guidance for researchers in similar contexts.
    摘要 神经机器翻译(NMT) для低资源本地语言在印度尼西亚面临重大挑战,包括需要代表性的标准和有限的数据可用性。这项工作解决这些挑战,通过对四种低资源本地语言的NMT系统进行全面分析: javanese、Sundanese、Minangkabau 和 Balinese。我们的研究包括不同的训练方法、概念、数据大小和初步研究使用大型语言模型生成低资源语言平行数据。我们发现了特定的趋势和策略,以及在实际翻译中的实用性。我们的研究表明,尽管计算机资源有限和文本数据有限,但是一些我们的NMT系统可以达到竞争的翻译质量,与零shot gpt-3.5-turbo相当。这些发现对NMT的低资源语言翻译做出了重要贡献,为研究人员提供了有价值的指南。

Optimizing Inventory Routing: A Decision-Focused Learning Approach using Neural Networks

  • paper_url: http://arxiv.org/abs/2311.00983
  • repo_url: None
  • paper_authors: MD Shafikul Islam, Azmine Toushik Wasi
  • for: 解决供应链管理中的货物 Routing 问题 (IRP),这是一个关键的挑战,因为它涉及到最优化的路径选择,同时考虑到货物需求预测的不确定性。
  • methods: 我们的实验表明,通常使用两个阶段方法来解决 IRP,首先使用机器学习技术预测需求,然后使用优化算法来最小化 Routing 成本。
  • results: 然而,我们发现机器学习模型无法达到完美准确性,因为货物储备水平受到动态商业环境的影响,这将在下一阶段的优化问题中产生不优化的决策。在这篇论文中,我们提出了一种专注于决策的学习基本方法,以解决真实世界中的 IRP。这种方法直接将货物预测和 Routing 优化 integrate 到一个综合系统中,可能提供一个有力的供应链策略。
    Abstract Inventory Routing Problem (IRP) is a crucial challenge in supply chain management as it involves optimizing efficient route selection while considering the uncertainty of inventory demand planning. To solve IRPs, usually a two-stage approach is employed, where demand is predicted using machine learning techniques first, and then an optimization algorithm is used to minimize routing costs. Our experiment shows machine learning models fall short of achieving perfect accuracy because inventory levels are influenced by the dynamic business environment, which, in turn, affects the optimization problem in the next stage, resulting in sub-optimal decisions. In this paper, we formulate and propose a decision-focused learning-based approach to solving real-world IRPs. This approach directly integrates inventory prediction and routing optimization within an end-to-end system potentially ensuring a robust supply chain strategy.
    摘要 供应链管理中的存储路径问题(IRP)是一个重要的挑战,因为它涉及到有效地选择路径,同时考虑存储需求规划的不确定性。通常,解决IRP需要采用两个阶段的方法,其中首先使用机器学习技术预测需求,然后使用优化算法减少路径成本。我们的实验表明,机器学习模型无法达到完美准确性,因为存储水平受到动态商业环境的影响,这有利于下一阶段的优化问题,导致偏低的决策。在这篇论文中,我们提出了一种专注于决策的学习基于方法,以解决现实世界中的IRP。这种方法直接 интегрируiert存储预测和路径优化在一个综合系统中,有可能确保一个强大的供应链策略。

An Integrated Framework Integrating Monte Carlo Tree Search and Supervised Learning for Train Timetabling Problem

  • paper_url: http://arxiv.org/abs/2311.00971
  • repo_url: None
  • paper_authors: Feiyu Yang
  • for: 解决单轨铁路列车时间安排问题 (TTP),这是一个重要和复杂的问题。
  • methods: 本文提出了一个整合 Monte Carlo Tree Search (MCTS) 计算框架,该框架结合了规则方法、无监督学习方法和监督学习方法来解决 TTP 中的 discrete action space 问题。
  • results: 实验显示,提出的启发式 MCTS 方法对 TTP 有利,并且将 learners 应用于 MCTS 搜索过程可以提高数据效率。这种方法提供了一个新的 TTP 解决方案。
    Abstract The single-track railway train timetabling problem (TTP) is an important and complex problem. This article proposes an integrated Monte Carlo Tree Search (MCTS) computing framework that combines heuristic methods, unsupervised learning methods, and supervised learning methods for solving TTP in discrete action spaces. This article first describes the mathematical model and simulation system dynamics of TTP, analyzes the characteristics of the solution from the perspective of MCTS, and proposes some heuristic methods to improve MCTS. This article considers these methods as planners in the proposed framework. Secondly, this article utilizes deep convolutional neural networks to approximate the value of nodes and further applies them to the MCTS search process, referred to as learners. The experiment shows that the proposed heuristic MCTS method is beneficial for solving TTP; The algorithm framework that integrates planners and learners can improve the data efficiency of solving TTP; The proposed method provides a new paradigm for solving TTP.
    摘要 单轨铁路列车时刻表(TTP)是一个重要和复杂的问题。这篇文章提出了一个集成 Monte Carlo Tree Search(MCTS)计算框架,该框架结合了规则方法、无监督学习方法和监督学习方法来解决 TTP 中的离散行动空间问题。文章首先描述了 TTP 的数学模型和 simulate 系统动力学,分析了 MCTS 方法解决 TTP 的特点,并提出了一些规则方法来改进 MCTS。这些方法被视为计划者在提出的框架中。其次,文章利用深度卷积神经网络来估算节点的值,并将其应用到 MCTS 搜索过程中,被称为学习者。实验表明,提出的规则 MCTS 方法有利于解决 TTP。集成计划者和学习者的算法框架可以提高解决 TTP 的数据效率;该方法提供了一个新的 TTP 解决方法。

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

  • paper_url: http://arxiv.org/abs/2311.00968
  • repo_url: https://github.com/amaai-lab/video2music
  • paper_authors: Jaeyong Kang, Soujanya Poria, Dorien Herremans
  • for: 这个研究旨在开发一个可生成音乐的 AI 框架,以匹配提供的视频。
  • methods: 研究人员首先筹集了一个独特的音乐视频集,然后分析了这些音乐视频,从而获得了semantic、scene offset、motion和emotion等特征。这些特征被用作音乐生成模型的引导输入。
  • results: 研究人员通过一种名为 Affective Multimodal Transformer (AMT) 的新型模型,使得生成的音乐与视频内容的情感相似。此外,研究人员还使用了一种基于 bigGRU 的回归模型来估算视频特征中的音符密度和响度,以确保生成的和声与视频的匹配性。
    Abstract Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene offset, motion, and emotion features. These distinct features are then employed as guiding input to our music generation model. We transcribe the audio files into MIDI and chords, and extract features such as note density and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on which we train a novel Affective Multimodal Transformer (AMT) model to generate music given a video. This model includes a novel mechanism to enforce affective similarity between video and music. Finally, post-processing is performed based on a biGRU-based regression model to estimate note density and loudness based on the video features. This ensures a dynamic rendering of the generated chords with varying rhythm and volume. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion. The musical quality, along with the quality of music-video matching is confirmed in a user study. The proposed AMT model, along with the new MuVi-Sync dataset, presents a promising step for the new task of music generation for videos.
    摘要 许多音乐生成研究已经表现出卓越表现,但是几乎没有模型可以直接生成与视频相匹配的音乐。在这个工作中,我们开发了一个生成音乐AI框架,即Video2Music,可以匹配提供的视频。我们首先筹集了一个独特的音乐视频集。然后,我们分析了音乐视频,以获取Semantic、Scene Offset、Motion和Emotion等特征。这些特征被用作音乐生成模型的导入输入。我们将音频文件转译成MIDI和和声,并提取特征 such as note density和 loudness。这结果了一个丰富的多模态数据集,称为MuVi-Sync,在这个数据集上我们训练了一个新的Affective Multimodal Transformer(AMT)模型,以生成音乐给视频。这个模型包括一个新的机制,以保证视频和音乐之间的情感相似性。最后,基于一个biGRU-based regression模型,我们进行了后处理,以估算视频特征基于的音乐的Note density和Loudness。这确保了生成的和声在不同的节奏和音量上进行了动态渲染。在一项全面的实验中,我们证明了我们的提议框架可以根据视频内容生成匹配的音乐,并且音乐质量和音乐-视频匹配质量得到了用户研究的证实。提出的AMT模型,加之新的MuVi-Sync数据集,对于新的音乐生成 для视频任务提出了一个可能的步骤。

Vision-Language Interpreter for Robot Task Planning

  • paper_url: http://arxiv.org/abs/2311.00967
  • repo_url: https://github.com/omron-sinicx/vilain
  • paper_authors: Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Atsushi Hashimoto, Shohei Tanaka, Kento Kawaharazuka, Kazutoshi Tanaka, Yoshitaka Ushiku, Shinsuke Mori
  • for: 本研究的目的是提出一个新的任务,即多模态规划问题说明(Multimodal Planning Problem Specification,简称MPPS),以便使用语言指导的 симвоlic planner 来解决问题。
  • methods: 本研究使用了 state-of-the-art 的语言模型和视觉语言模型来生成问题描述(Problem Description,简称PD),并通过Symbolic planner 的反馈来纠正生成的PD。
  • results: 实验结果表明,ViLaIn 可以生成有正确 syntax 的问题描述,并且可以生成有效的机器人计划,其中有效性高于 58%。
    Abstract Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By generating PDs from language instruction and scene observation, we can drive symbolic planners in a language-guided framework. We propose a Vision-Language Interpreter (ViLaIn), a new framework that generates PDs using state-of-the-art LLM and vision-language models. ViLaIn can refine generated PDs via error message feedback from the symbolic planner. Our aim is to answer the question: How accurately can ViLaIn and the symbolic planner generate valid robot plans? To evaluate ViLaIn, we introduce a novel dataset called the problem description generation (ProDG) dataset. The framework is evaluated with four new evaluation metrics. Experimental results show that ViLaIn can generate syntactically correct problems with more than 99% accuracy and valid plans with more than 58% accuracy.
    摘要 大型语言模型(LLM)正在推动语言导航 robot 规划器的发展。然而,符号规划器具有可解释性的优势。本文提出了一个新任务,即多modal 规划问题规定。目标是生成一个问题描述(PD),一个机器可读的文件,用于由规划器找到一个计划。通过将语言指令和场景观察转换为PD,我们可以在语言导航框架下驱动符号规划器。我们提出了一个名为视力语言 интерпреTER(ViLaIn)的新框架,它使用当前的 LLM 和视力语言模型来生成PD。ViLaIn 可以通过符号规划器返回错误消息来精细地修正生成的PD。我们的目标是回答这个问题:ViLaIn 和符号规划器能够生成有效的机器人计划吗?为了评估 ViLaIn,我们创建了一个名为问题描述生成(ProDG)数据集。框架在四个新的评价指标下进行了评估。实验结果显示,ViLaIn 可以生成符合语法规则的问题描述,准确率高于 99%,并且可以生成有效的计划,准确率高于 58%。

IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

  • paper_url: http://arxiv.org/abs/2311.00958
  • repo_url: https://github.com/dehanalkautsar/indotod
  • paper_authors: Muhammad Dehan Al Kautsar, Rahmah Khoirussyifa’ Nurdini, Samuel Cahyawijaya, Genta Indra Winata, Ayu Purwarianti
  • for: 这个论文主要是为了开发高级语言(如英语和中文)以外的地域语言Task-oriented dialogue(ToD)系统,以拓宽对对话上下文的理解能力。
  • methods: 这篇论文使用了两个英语ToD数据集的泛化,通过去 lexicalization 来减少笔记注释的大小,并雇用了本地母语 speaker 手动翻译对话。
  • results: 这篇论文引入了一个综合多个领域的Indonesian ToDbenchmark,可以用于评估英语和INDONESIAN ToD系统,以及探索跨语言和双语权重学习方法的潜在利器。
    Abstract Task-oriented dialogue (ToD) systems have been mostly created for high-resource languages, such as English and Chinese. However, there is a need to develop ToD systems for other regional or local languages to broaden their ability to comprehend the dialogue contexts in various languages. This paper introduces IndoToD, an end-to-end multi domain ToD benchmark in Indonesian. We extend two English ToD datasets to Indonesian, comprising four different domains by delexicalization to efficiently reduce the size of annotations. To ensure a high-quality data collection, we hire native speakers to manually translate the dialogues. Along with the original English datasets, these new Indonesian datasets serve as an effective benchmark for evaluating Indonesian and English ToD systems as well as exploring the potential benefits of cross-lingual and bilingual transfer learning approaches.
    摘要 高度资源语言如英语和中文的任务对话(ToD)系统已经大多创建,但是有必要开发ToD系统 для其他地区或本地语言,以扩大对对话上下文的理解能力。这篇文章介绍了印度ToD,一个综合多领域的ToD benchmark在印度尼西亚语中。我们将英语ToD数据集扩展到印度尼西亚语,包括四个不同的领域,通过去除语言标记来有效地减少注释大小。为保证高质量数据采集,我们雇佣了本地母语speaker来手动翻译对话。与原始英语数据集一起,这些新的印度尼西亚语数据集将成为评估印度尼西亚语和英语ToD系统的有效 benchamark,以及探索跨语言和双语传输学习方法的潜在优势。

Gaussian Mixture Solvers for Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.00941
  • repo_url: https://github.com/guohanzhong/gms
  • paper_authors: Hanzhong Guo, Cheng Lu, Fan Bao, Tianyu Pang, Shuicheng Yan, Chao Du, Chongxuan Li
  • for: 这个论文主要针对的是 diffusion models 的生成任务中的样本生成问题。
  • methods: 该论文提出了一种新的 SDE-based 生成器,称为 Gaussian Mixture Solvers (GMS),它可以在批处理中更好地控制样本质量。
  • results: 实验表明,GMS 可以在各种 diffusion models 中提供更高质量的样本,并且在 stroke-based synthesis 等任务中表现更好。
    Abstract Recently, diffusion models have achieved great success in generative tasks. Sampling from diffusion models is equivalent to solving the reverse diffusion stochastic differential equations (SDEs) or the corresponding probability flow ordinary differential equations (ODEs). In comparison, SDE-based solvers can generate samples of higher quality and are suited for image translation tasks like stroke-based synthesis. During inference, however, existing SDE-based solvers are severely constrained by the efficiency-effectiveness dilemma. Our investigation suggests that this is because the Gaussian assumption in the reverse transition kernel is frequently violated (even in the case of simple mixture data) given a limited number of discretization steps. To overcome this limitation, we introduce a novel class of SDE-based solvers called \emph{Gaussian Mixture Solvers (GMS)} for diffusion models. Our solver estimates the first three-order moments and optimizes the parameters of a Gaussian mixture transition kernel using generalized methods of moments in each step during sampling. Empirically, our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis in various diffusion models, which validates the motivation and effectiveness of GMS. Our code is available at https://github.com/Guohanzhong/GMS.
    摘要

Bridging the Gap: Addressing Discrepancies in Diffusion Model Training for Classifier-Free Guidance

  • paper_url: http://arxiv.org/abs/2311.00938
  • repo_url: None
  • paper_authors: Niket Patel, Luis Salamanca, Luis Barba
  • for: 本研究旨在探讨Diffusion模型的训练方法和生成结果之间的矛盾,以及如何改进Diffusion模型的生成质量。
  • methods: 本研究使用了一种更新的损失函数,以更好地对准Diffusion模型的训练目标和生成行为。
  • results: 实验结果表明,使用该更新后的损失函数可以生成更高质量的样本,并且可以降低指导缩放参数$w$的选择对生成结果的影响。
    Abstract Diffusion models have emerged as a pivotal advancement in generative models, setting new standards to the quality of the generated instances. In the current paper we aim to underscore a discrepancy between conventional training methods and the desired conditional sampling behavior of these models. While the prevalent classifier-free guidance technique works well, it's not without flaws. At higher values for the guidance scale parameter $w$, we often get out of distribution samples and mode collapse, whereas at lower values for $w$ we may not get the desired specificity. To address these challenges, we introduce an updated loss function that better aligns training objectives with sampling behaviors. Experimental validation with FID scores on CIFAR-10 elucidates our method's ability to produce higher quality samples with fewer sampling timesteps, and be more robust to the choice of guidance scale $w$. We also experiment with fine-tuning Stable Diffusion on the proposed loss, to provide early evidence that large diffusion models may also benefit from this refined loss function.
    摘要 各种扩散模型在生成模型中已经成为了重要的进步,为生成实例提供了新的标准。在当前的论文中,我们想要强调普遍的导航方法和扩散模型的 conditional sampling 行为之间的不一致。虽然无类别导航技术在高于 $w$ 的值下能够工作良好,但是存在误差。在较低的 $w$ 值下,我们可能无法获得所需的特定性,而在更高的 $w$ 值下,我们可能会得到偏差的样本。为了解决这些挑战,我们提出了一个更新的损失函数,该函数更好地对应培训目标和抽样行为。实验证明,我们的方法可以在 CIFAR-10 上获得更高质量的样本,并且更加敏感于导航缩放参数 $w$。我们还对 Stable Diffusion 进行了微调,以提供早期的证据,表明大扩散模型也可以从这种精细的损失函数中受益。

Scalable Counterfactual Distribution Estimation in Multivariate Causal Models

  • paper_url: http://arxiv.org/abs/2311.00927
  • repo_url: None
  • paper_authors: Thong Pham, Shohei Shimizu, Hideitsu Hino, Tam Le
  • for: 估计多个量关注(例如结果)的共轭事件分布 Function: 多ivariate causal model中的共轭事件分布估计问题
  • methods: 利用一个可靠的一维隐藏空间,基于所有维度信息构建,以便更好地捕捉相关结构并生成良好的共轭事件分布估计
  • results: 相比现有方法,提供更好的共轭事件分布估计,并在实际数据上显示出更高的准确性和稳定性
    Abstract We consider the problem of estimating the counterfactual joint distribution of multiple quantities of interests (e.g., outcomes) in a multivariate causal model extended from the classical difference-in-difference design. Existing methods for this task either ignore the correlation structures among dimensions of the multivariate outcome by considering univariate causal models on each dimension separately and hence produce incorrect counterfactual distributions, or poorly scale even for moderate-size datasets when directly dealing with such multivariate causal model. We propose a method that alleviates both issues simultaneously by leveraging a robust latent one-dimensional subspace of the original high-dimension space and exploiting the efficient estimation from the univariate causal model on such space. Since the construction of the one-dimensional subspace uses information from all the dimensions, our method can capture the correlation structures and produce good estimates of the counterfactual distribution. We demonstrate the advantages of our approach over existing methods on both synthetic and real-world data.
    摘要 我团队正在考虑一个多量 interess 的 causal 模型的问题,即在多量 outcome 上的 counterfactual JOINT 分布的估计问题。现有的方法可能忽略多量 outcome 的 correlation 结构,通过对每个维度 separately 处理 causal 模型,从而生成错误的 counterfactual 分布,或者对大型数据集进行 direct 处理时会产生差异。我们提出了一种方法,可以同时解决这两个问题,通过利用一个 robust 的 latent 一维空间,并利用这一空间上的 efficient 估计来避免直接处理高维 causal 模型时的问题。由于构造一维空间使用了所有维度的信息,我们的方法可以捕捉 correlation 结构,并生成良好的 counterfactual 分布估计。我们在 synthetic 数据和实际数据上证明了我们的方法的优势。

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

  • paper_url: http://arxiv.org/abs/2311.00926
  • repo_url: None
  • paper_authors: Wentao Yuan, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox
  • for: 这个论文的目的是提出一个单一的模型,能够在不同的物体上进行多种低层运动 Primitives,并在各种不同的场景中进行稳定的物体搬运。
  • methods: 这个模型使用了 transformer 模型,它可以根据触碰点来决定合适的握持位置,并预测不同的动作模式下的有效握持 pose。
  • results: 在一个大规模的 sintetic 数据集上进行训练,这个模型在真实机器人上进行零 shot sim2real 转移,比基eline系统表现出来19%的总性能和37.5%的挑战场景中的性能提升。此外,这个模型也在RLBench中的一 subset of language conditioned tasks 上得到了状况的最佳结果。
    Abstract With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about 19% in overall performance and 37.5% in challenging scenes where the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available on our project website https://m2-t2.github.io.
    摘要 With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about 19% in overall performance and 37.5% in challenging scenes where the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available on our project website (https://m2-t2.github.io).

The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning

  • paper_url: http://arxiv.org/abs/2311.00924
  • repo_url: None
  • paper_authors: Carmelo Sferrazza, Younggyo Seo, Hao Liu, Youngwoon Lee, Pieter Abbeel
  • for: 本研究旨在开发一种可以结合视觉和感觉信息的多模态学习方法,以提高机器人 manipulate 物体的能力。
  • methods: 本研究提出了Masked Multimodal Learning(M3L)方法,它通过对视觉和感觉信息进行卷积自编码,同时学习策略和多模态表示。
  • results: 研究表明,通过在多模态 setting 学习,可以提高 sample efficiency 和泛化能力,并且vision-only 策略在测试时也受益于多模态学习。研究在三个 simulated 环境中进行了测试:机器人插入、门开合和灵活手部操作,结果表明了多模态策略的优势。
    Abstract Humans rely on the synergy of their senses for most essential tasks. For tasks requiring object manipulation, we seamlessly and effectively exploit the complementarity of our senses of vision and touch. This paper draws inspiration from such capabilities and aims to find a systematic approach to fuse visual and tactile information in a reinforcement learning setting. We propose Masked Multimodal Learning (M3L), which jointly learns a policy and visual-tactile representations based on masked autoencoding. The representations jointly learned from vision and touch improve sample efficiency, and unlock generalization capabilities beyond those achievable through each of the senses separately. Remarkably, representations learned in a multimodal setting also benefit vision-only policies at test time. We evaluate M3L on three simulated environments with both visual and tactile observations: robotic insertion, door opening, and dexterous in-hand manipulation, demonstrating the benefits of learning a multimodal policy. Code and videos of the experiments are available at https://sferrazza.cc/m3l_site.
    摘要 人类在日常任务中借靠感觉的协同才能完成大多数任务。在对物体操作任务中,我们会自然地和高效地利用视觉感和触觉感的相互补充。这篇论文从这些能力中得到灵感,旨在在强化学习设置下系统地融合视觉和触觉信息。我们提议的Masked Multimodal Learning(M3L)方法,同时学习策略和视觉和触觉的表示,基于遮盲自动编码。这些共同学习的表示,从视觉和触觉两种感知中各自提高样本效率,并在单独使用的感知方面也具有扩展的能力。更 remarkably,在多模态设置下学习的表示,也对视觉只的策略在测试时具有改善的效果。我们在三个 simulated 环境中进行了 inserting、开门和灵活的手部操作等三种任务的测试,证明了多模态策略的优势。代码和实验视频可以在 https://sferrazza.cc/m3l_site 上获取。

Artificial Intelligence Ethics Education in Cybersecurity: Challenges and Opportunities: a focus group report

  • paper_url: http://arxiv.org/abs/2311.00903
  • repo_url: None
  • paper_authors: Diane Jackson, Sorin Adam Matei, Elisa Bertino
  • for: 这篇论文的目的是探讨人工智能工具在网络安全领域中的应用和挑战。
  • methods: 论文使用了ocus组研讨方法,即与高水平的硬件学生进行面对面讨论,以了解在网络安全领域中人工智能工具的挑战和机遇。
  • results: 论文发现了在网络安全领域中人工智能工具的使用带来的挑战和机遇,包括访问开源或免费工具、文档、课程多样性和伦理原则的明确表述。 additionally, the study found that addressing the “black box” mentality in AI cybersecurity work and improving systems thinking and effective communication skills are crucial.
    Abstract The emergence of AI tools in cybersecurity creates many opportunities and uncertainties. A focus group with advanced graduate students in cybersecurity revealed the potential depth and breadth of the challenges and opportunities. The salient issues are access to open source or free tools, documentation, curricular diversity, and clear articulation of ethical principles for AI cybersecurity education. Confronting the "black box" mentality in AI cybersecurity work is also of the greatest importance, doubled by deeper and prior education in foundational AI work. Systems thinking and effective communication were considered relevant areas of educational improvement. Future AI educators and practitioners need to address these issues by implementing rigorous technical training curricula, clear documentation, and frameworks for ethically monitoring AI combined with critical and system's thinking and communication skills.
    摘要 人工智能在网络安全领域的出现带来了多种机遇和不确定性。一个关注组与高等研究生共同研究了人工智能在网络安全教育中的挑战和机遇。关键问题包括对开源或免费工具的访问、文档、课程多样性和伦理原则的明确表述。 Additionally, confronting the "black box" mentality in AI cybersecurity work is of great importance, and deeper and prior education in foundational AI work is also crucial. 系统思维和有效沟通技巧被认为是教育改进的重要领域。未来的AI教育和实践人员需要通过实施严格的技术训练课程、明确的文档和伦理监测框架来解决这些问题。