2023-12-06

cs.AI

cs.AI - 2023-12-06

A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints

paper_url: http://arxiv.org/abs/2312.03905
repo_url: None
paper_authors: Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck
for: bridges the gap between purely symbolic and neural approaches to learning, specifically for tasks that involve autoregressive distributions such as transformers.
methods: proposes a new approach to neuro-symbolic learning that involves maximizing the likelihood of a symbolic constraint w.r.t the neural network’s output distribution, using a pseudolikelihood-based approximation centered around a model sample, which is factorized and locally high-fidelity.
results: greatly improves upon the base model’s ability to predict logically-consistent outputs on Sudoku and shortest-path prediction tasks, and achieves State-of-the-Art (SoTA) detoxification compared to previous approaches on the task of detoxifying large language models by disallowing a list of toxic words.

Abstract
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning. This often requires maximizing the likelihood of a symbolic constraint w.r.t the neural network's output distribution. Such output distributions are typically assumed to be fully-factorized. This limits the applicability of neuro-symbolic learning to the more expressive autoregressive distributions, e.g., transformers. Under such distributions, computing the likelihood of even simple constraints is #P-hard. Instead of attempting to enforce the constraint on the entire output distribution, we propose to do so on a random, local approximation thereof. More precisely, we optimize the likelihood of the constraint under a pseudolikelihood-based approximation centered around a model sample. Our approximation is factorized, allowing the reuse of solutions to sub-problems, a main tenet for efficiently computing neuro-symbolic losses. Moreover, it is a local, high-fidelity approximation of the likelihood, exhibiting low entropy and KL-divergence around the model sample. We evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation, and observe that we greatly improve upon the base model's ability to predict logically-consistent outputs. We also evaluate on the task of detoxifying large language models. Using a simple constraint disallowing a list of toxic words, we are able to steer the model's outputs away from toxic generations, achieving SoTA detoxification compared to previous approaches.

摘要
More precisely, we optimize the likelihood of the constraint under a pseudolikelihood-based approximation centered around a model sample. Our approximation is factorized, allowing the reuse of solutions to sub-problems, a main tenet for efficiently computing neuro-symbolic losses. Moreover, it is a local, high-fidelity approximation of the likelihood, exhibiting low entropy and KL-divergence around the model sample.We evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation, and observe that we greatly improve upon the base model's ability to predict logically-consistent outputs. We also evaluate on the task of detoxifying large language models. Using a simple constraint disallowing a list of toxic words, we are able to steer the model's outputs away from toxic generations, achieving SoTA detoxification compared to previous approaches.Here's the Simplified Chinese translation: neur-符号学 AI 减少了符号学和神经网络学习之间的鸿沟。这经常需要最大化符号约束的可能性，对神经网络输出分布进行最大化。这些输出分布通常假设为完全因子化。这限制了符号学习的可用性，只能应用于更表达力强的推论分布，例如转换器。在这些分布下，计算约束的可能性是 #P-hard。而不是尝试将约束应用于整个输出分布，我们提议在模型采样中心的抽象上进行约束。更加准确地说，我们优化符号约束的可能性，使用基于 Pseudolikelihood 的抽象。我们的抽象是可重复的，允许在子问题上重用解决方案，这是计算符号学损失的重要原则。此外，我们的抽象是地方的、高准确性的，在模型采样中心的抽象下，Entropy 和 KL 偏移都很低。我们在 Sudoku 和短路预测中使用 autoregressive 生成，并观察到我们在基本模型的输出上大幅提高了逻辑一致性。我们还在大语言模型中使用简单约束，禁止使用恶意词汇，并成功地使模型的输出避免恶意生成， achieved SoTA 恶性识别比前方法。

A Masked Pruning Approach for Dimensionality Reduction in Communication-Efficient Federated Learning Systems

paper_url: http://arxiv.org/abs/2312.03889
repo_url: None
paper_authors: Tamir L. S. Gez, Kobi Cohen
for: 提高 Federated Learning（FL）算法在具有限制通信资源的设备上的可应用性，例如具有限制通信资源的移动设备或嵌入式设备。
methods: 使用掩蔽法（Masking）和FL算法相结合，实现在多个节点之间共享低维度表示，并且减少了通信成本。每个节点首先在本地训练模型，然后计算掩蔽面，并将掩蔽面传输回服务器进行共识。这个迭代过程使得模型具有更高的稳定性和可靠性。
results: 对比 existed 方法，MPFL 方法可以实现更高的带宽缩减，同时保持模型的性能。经过广泛的实验研究，MPFL 方法在具有限制通信资源的设备上的应用性得到了进一步的证明。此外，我们还开发了一个开源的软件包，以便相关领域的研究人员和开发人员能够免费使用。

Abstract
Federated Learning (FL) represents a growing machine learning (ML) paradigm designed for training models across numerous nodes that retain local datasets, all without directly exchanging the underlying private data with the parameter server (PS). Its increasing popularity is attributed to notable advantages in terms of training deep neural network (DNN) models under privacy aspects and efficient utilization of communication resources. Unfortunately, DNNs suffer from high computational and communication costs, as well as memory consumption in intricate tasks. These factors restrict the applicability of FL algorithms in communication-constrained systems with limited hardware resources. In this paper, we develop a novel algorithm that overcomes these limitations by synergistically combining a pruning-based method with the FL process, resulting in low-dimensional representations of the model with minimal communication cost, dubbed Masked Pruning over FL (MPFL). The algorithm operates by initially distributing weights to the nodes through the PS. Subsequently, each node locally trains its model and computes pruning masks. These low-dimensional masks are then transmitted back to the PS, which generates a consensus pruning mask, broadcasted back to the nodes. This iterative process enhances the robustness and stability of the masked pruning model. The generated mask is used to train the FL model, achieving significant bandwidth savings. We present an extensive experimental study demonstrating the superior performance of MPFL compared to existing methods. Additionally, we have developed an open-source software package for the benefit of researchers and developers in related fields.

摘要
federated learning (FL) 是一种在多个节点上训练模型的机器学习（ML） paradigma，不直接在参数服务器（PS）上交换本地私人数据。由于FL具有保护隐私和高效通信资源的优势，其 популяр度在不断增长。然而，深度神经网络（DNN）在复杂任务中具有高计算和通信成本，以及内存占用率，这些因素限制了FL算法在具有有限硬件资源的通信束缚系统中的应用。在这篇论文中，我们开发了一种新的算法，即Masked Pruning over FL（MPFL），以解决这些限制。MPFL算法首先将权重分布给节点 через PS。然后，每个节点本地训练其模型，并计算遮盾mask。这些低维度的mask将被传输回PS，生成一个consensus遮盾mask，并将其广播回节点。这个迭代过程会提高遮盾遮盾模型的稳定性和稳定性。生成的遮盾可以用来训练FL模型，实现了明显的带宽削减。我们进行了广泛的实验研究，证明MPFL的性能superiority compared to existing methods。此外，我们还开发了一个开源的软件包，为相关领域的研究人员和开发人员提供了便利。

On The Fairness Impacts of Hardware Selection in Machine Learning

paper_url: http://arxiv.org/abs/2312.03886
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Sree Harsha Nelaturu, Nishaanth Kanna Ravichandran, Cuong Tran, Sara Hooker, Ferdinando Fioretto
for: investigates the impact of hardware choices on the generalization properties of machine learning models, particularly in the context of ML-as-a-service platforms.
methods: combines theoretical and empirical analysis to identify the factors that contribute to hardware-induced performance imbalances, and proposes a strategy for mitigating these imbalances.
results: demonstrates that hardware choices can exacerbate existing disparities in model performance and fairness, and provides insights into the underlying causes of these discrepancies.

Abstract
In the machine learning ecosystem, hardware selection is often regarded as a mere utility, overshadowed by the spotlight on algorithms and data. This oversight is particularly problematic in contexts like ML-as-a-service platforms, where users often lack control over the hardware used for model deployment. How does the choice of hardware impact generalization properties? This paper investigates the influence of hardware on the delicate balance between model performance and fairness. We demonstrate that hardware choices can exacerbate existing disparities, attributing these discrepancies to variations in gradient flows and loss surfaces across different demographic groups. Through both theoretical and empirical analysis, the paper not only identifies the underlying factors but also proposes an effective strategy for mitigating hardware-induced performance imbalances.

摘要
Note:* 硬件 (hòu jiàn) means "hardware" in Simplified Chinese.* ML-as-a-service (MLaaS) is a cloud-based service that provides machine learning capabilities to users.* 用户 (yòng yòu) means "user" in Simplified Chinese.* 模型 (mó delì) means "model" in Simplified Chinese.* 性能 (xìng néng) means "performance" in Simplified Chinese.* 公平 (gōng píng) means "fairness" in Simplified Chinese.* 群体 (qún tǐ) means "demographic group" in Simplified Chinese.* 梯度流 (dào yù) means "gradient flow" in Simplified Chinese.* 损失表 (shè shì biǎo) means "loss surface" in Simplified Chinese.

FoMo Rewards: Can we cast foundation models as reward functions?

paper_url: http://arxiv.org/abs/2312.03881
repo_url: None
paper_authors: Ekdeep Singh Lubana, Johann Brehmer, Pim de Haan, Taco Cohen
for: 研究是用底层模型作为激励学习的奖励函数的可能性。
methods: 我们提议一种简单的批处理，将可见语言模型与大型语言模型集成。 Specifically, 给一个轨迹的观察，我们可以计算描述任务的 instrucion 的可能性。
results: 我们发现这个通用的可能性函数具有理想的奖励函数特征：它与愿望的行为相关，而与类似但错误的策略相对较低。全面来说，我们的工作开启了通过基础模型设计开放式任务的可能性。

Abstract
We explore the viability of casting foundation models as generic reward functions for reinforcement learning. To this end, we propose a simple pipeline that interfaces an off-the-shelf vision model with a large language model. Specifically, given a trajectory of observations, we infer the likelihood of an instruction describing the task that the user wants an agent to perform. We show that this generic likelihood function exhibits the characteristics ideally expected from a reward function: it associates high values with the desired behaviour and lower values for several similar, but incorrect policies. Overall, our work opens the possibility of designing open-ended agents for interactive tasks via foundation models.

摘要

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

paper_url: http://arxiv.org/abs/2312.03876
repo_url: None
paper_authors: Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Sandeep Madireddy, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Aditya Grover
for: 这个论文的目的是提出一种基于深度学习的天气预报方法，以提高天气预报的准确性和效率。
methods: 这个论文使用了一种简单的转换器模型，称为Stormer，其中包括天气特有的嵌入、随机动力预测和压力加权损失等关键组件。
results: 在WeatherBench 2上，Stormer在短至中范围预测 task 上表现竞争性，而在长范围预测 task 上超过7天的预测任务上表现出色，而且需要训练数据和计算量的极少。此外，论文还证明Stormer的扩展性良好，随着模型大小和训练示例的增加，预测准确性都会提高。

Abstract
Weather forecasting is a fundamental problem for anticipating and mitigating the impacts of climate change. Recently, data-driven approaches for weather forecasting based on deep learning have shown great promise, achieving accuracies that are competitive with operational systems. However, those methods often employ complex, customized architectures without sufficient ablation analysis, making it difficult to understand what truly contributes to their success. Here we introduce Stormer, a simple transformer model that achieves state-of-the-art performance on weather forecasting with minimal changes to the standard transformer backbone. We identify the key components of Stormer through careful empirical analyses, including weather-specific embedding, randomized dynamics forecast, and pressure-weighted loss. At the core of Stormer is a randomized forecasting objective that trains the model to forecast the weather dynamics over varying time intervals. During inference, this allows us to produce multiple forecasts for a target lead time and combine them to obtain better forecast accuracy. On WeatherBench 2, Stormer performs competitively at short to medium-range forecasts and outperforms current methods beyond 7 days, while requiring orders-of-magnitude less training data and compute. Additionally, we demonstrate Stormer's favorable scaling properties, showing consistent improvements in forecast accuracy with increases in model size and training tokens. Code and checkpoints will be made publicly available.

摘要
天气预测是气候变化的基本问题，可以预测和减轻气候变化的影响。现在，基于深度学习的天气预测方法已经显示出了很大的搭配，具有与操作系统相当的精度。然而，这些方法经常使用复杂的自定义架构，导致无法准确地了解它们的成功原因。在这里，我们介绍了风暴（Stormer），一种简单的转换器模型，可以在天气预测中实现最佳性能，并且只需要微小的改变于标准转换器脊梁。我们通过仔细的实验分析，包括特定于天气的嵌入、随机动力预测和压力Weighted损失，确定了风暴的关键组件。风暴的核心是一种随机预测目标的对象，可以在不同的时间间隔内预测天气动力。在推理时，我们可以生成多个预测，并将其组合以获得更好的预测精度。在WeatherBench 2上，风暴在短至中期预测和超过7天的预测中表现竞争力强，同时需要训练数据和计算量减少到了多个级别。此外，我们还证明了风暴的有利扩展性，表现出了随着模型大小和训练Token数量的不断提高的预测精度。代码和检查点将公开发布。

The BigCode Project Governance Card

paper_url: http://arxiv.org/abs/2312.03872
repo_url: None
paper_authors: BigCode collaboration, Sean Hughes, Harm de Vries, Jennifer Robinson, Carlos Muñoz Ferrandis, Loubna Ben Allal, Leandro von Werra, Jennifer Ding, Sebastien Paquet, Yacine Jernite
for: 本文概要提供了BigCode项目的不同机制和管理领域，以支持项目的透明度和可重复性。
methods: 本文使用了项目组织结构、宣言目标和价值观、内部决策过程、资金和资源等方面的几个机制来支持项目的管理。
results: 本文通过提供项目的各个机制和领域的信息，向更广泛的公众提供了项目的透明度和可重复性，同时也为未来的开源项目提供了一个可仿效的参考。

Abstract
This document serves as an overview of the different mechanisms and areas of governance in the BigCode project. It aims to support transparency by providing relevant information about choices that were made during the project to the broader public, and to serve as an example of intentional governance of an open research project that future endeavors can leverage to shape their own approach. The first section, Project Structure, covers the project organization, its stated goals and values, its internal decision processes, and its funding and resources. The second section, Data and Model Governance, covers decisions relating to the questions of data subject consent, privacy, and model release.

摘要
这份文档提供了大码项目不同机制和管理方面的概述，以便支持透明度，为更广泛的公众提供相关的信息，并作为未来项目的示范，以便他们可以根据这个方法制定自己的管理方式。首部分，项目结构，覆盖项目组织结构，项目的声明目标和价值观，内部决策过程，以及资金和资源。第二部分，数据和模型管理，覆盖数据主体同意、隐私和模型释出的决策。

Efficient Large Language Models: A Survey

paper_url: http://arxiv.org/abs/2312.03863
repo_url: https://github.com/aiot-mlsys-lab/efficientllms
paper_authors: Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, Mi Zhang
for: 本文提供了一个系统性和全面的LLMs效率研究综述，帮助研究者和实践者更好地了解LLMs效率研究的发展和进展。
methods: 本文分为三个主要类别，从模型中心、数据中心和框架中心三个角度进行综述，并在GitHub上提供了相关论文的集成。
results: 本文提供了一个系统性和全面的LLMs效率研究综述，包括模型中心、数据中心和框架中心三个角度的研究发展，并将在GitHub上维护和更新相关论文。

Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding, language generation, and complex reasoning and have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we compile the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/EfficientLLMs, https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey, and will actively maintain this repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

摘要
大型语言模型（LLMs）在重要的任务中表现出了惊人的能力，如自然语言理解、语言生成和复杂的推理，并有可能对社会产生深远的影响。然而，这些能力需要巨大的资源， highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we compile the papers featured in this survey at , , and will actively maintain this repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

paper_url: http://arxiv.org/abs/2312.03818
repo_url: https://github.com/sunzey/alphaclip
paper_authors: Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
for: 这个论文的目的是提高CLIP的可控性，以便更好地编辑图像。
methods: 这个论文使用了一个auxiliary alpha channel来指示注意力的区域，并通过构建了数百万个RGBA区域文本对来 fine-tune CLIP。
results: Alpha-CLIP不仅保留了CLIP的视觉认知能力，还允许精准地控制图像内容的强调。它在多种任务上达到了良好的效果，包括开放世界认知、多Modal大语言模型和条件2D/3D生成。

Abstract
Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific regions of interest, which can be indicated as points, masks, or boxes by humans or perception models. To fulfill the requirements, we introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks.

摘要
CLIP（对比语言图像预训练）在多种任务中提取图像中的有价值信息扮演着重要的角色。它将文本和视觉模式联系起来，以便全面理解图像，包括所有细节，即使与特定任务无关。然而，为了更加精细地理解和控制图像，需要专注于特定区域，这些区域可以由人类或感知模型指定为点、面或盒子。为了满足这些需求，我们介绍了Alpha-CLIP，它是CLIP的改进版本，带有一个辅助的α通道，用于建议注意的区域，并且通过构建了数百万个RGBA区域文本对进行精度地调整。Alpha-CLIP不仅保持了CLIP的视觉识别能力，还允许控制图像内容的强调。它在多种任务中展现出了效果，包括但不限于开放世界识别、多Modal大语言模型和条件2D/3D生成。它具有强大的潜在应用前景，可以用于多种图像相关任务。Here's the translation in Simplified Chinese:CLIP（对比语言图像预训练）在多种任务中提取图像中的有价值信息扮演着重要的角色。它将文本和视觉模式联系起来，以便全面理解图像，包括所有细节，即使与特定任务无关。然而，为了更加精细地理解和控制图像，需要专注于特定区域，这些区域可以由人类或感知模型指定为点、面或盒子。为了满足这些需求，我们介绍了Alpha-CLIP，它是CLIP的改进版本，带有一个辅助的α通道，用于建议注意的区域，并且通过构建了数百万个RGBA区域文本对进行精度地调整。Alpha-CLIP不仅保持了CLIP的视觉识别能力，还允许控制图像内容的强调。它在多种任务中展现出了效果，包括但不限于开放世界识别、多Modal大语言模型和条件2D/3D生成。它具有强大的潜在应用前景，可以用于多种图像相关任务。

OneLLM: One Framework to Align All Modalities with Language

paper_url: http://arxiv.org/abs/2312.03700
repo_url: https://github.com/csuhan/onellm
paper_authors: Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue
for: 这篇论文旨在开发一种可以同时处理多种模式的大语言模型（MLLM），以提高模式理解能力。
methods: 该论文使用一种统一架构，将八种模式与语言相align，并通过进程式多模式对齐管道来实现。此外，它还使用一种混合多个图像投影模块和动态路由来建立一个通用投影模块（UPM）。
results: 在25种多样化的benchmark任务上，OneLLM表现出色，包括多模式captioning、问答和推理等。

Abstract
Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

摘要
多模态大语言模型（MLLM）在最近已经吸引了广泛的注意力，因为它们具有强大的多模态理解能力。然而，现有的工作都是基于特定模式的编解oder，这些编解oder通常具有不同的架构，并且只能处理常见的模式。在这篇论文中，我们提出了OneLLM，一个能够对八种模式进行语言对应的 MLLM。我们通过一个统一的多模态编解oder和一个进程式多模态对应管道来实现这一点。具体来说，我们首先使用图像投影模块将视觉编码器与LLM连接起来。然后，我们构建了一个通用投影模块（UPM），通过混合多个图像投影模块和动态路由来实现。最后，我们逐渐将更多的模式与LLM对应。为了充分利用OneLLM在 seguir instrucciones 中的潜力，我们还筹集了一个全面的多模态指令集，包括200万个Item从图像、音频、视频、点云、深度/正常图、IMU和fMRI大脑活动。OneLLM在25种多样化的benchmark上进行评估，包括多模态描述、问答和理解任务，其表现出色。代码、数据、模型和在线示例可以在https://github.com/csuhan/OneLLM 上获取。

Intrinsic Harmonization for Illumination-Aware Compositing

paper_url: http://arxiv.org/abs/2312.03698
repo_url: None
paper_authors: Chris Careaga, S. Mahdi H. Miangoleh, Yağız Aksoy
for: 提高图像合成镜像的真实感和照明准确性
methods: 使用自主超vised illumination harmonization方法，通过估算简单的全局照明模型并使用网络进行修正，实现匹配背景和前景的照明和颜色表现
results: 在实际拼接图像中提高了真实感和照明准确性，并通过用户研究得到了对比先前方法的Objective Measurement of enhanced realism

Abstract
Despite significant advancements in network-based image harmonization techniques, there still exists a domain disparity between typical training pairs and real-world composites encountered during inference. Most existing methods are trained to reverse global edits made on segmented image regions, which fail to accurately capture the lighting inconsistencies between the foreground and background found in composited images. In this work, we introduce a self-supervised illumination harmonization approach formulated in the intrinsic image domain. First, we estimate a simple global lighting model from mid-level vision representations to generate a rough shading for the foreground region. A network then refines this inferred shading to generate a harmonious re-shading that aligns with the background scene. In order to match the color appearance of the foreground and background, we utilize ideas from prior harmonization approaches to perform parameterized image edits in the albedo domain. To validate the effectiveness of our approach, we present results from challenging real-world composites and conduct a user study to objectively measure the enhanced realism achieved compared to state-of-the-art harmonization methods.

摘要
尽管网络基于图像协调技术已经取得了显著的进步，但在实际应用中仍然存在域名不一致问题。大多数现有方法是通过反向全局编辑 segmented 图像区域来逆转global编辑，但这些方法通常无法准确捕捉背景和前景之间的光照不匹配问题。在这种情况下，我们介绍了一种自动协调照明方法，基于中等级视觉表示来估算简单的全局照明模型，并将其用于生成与背景场景相匹配的重新照明。为了保持前景和背景的颜色出现相似，我们利用了之前的协调方法来进行参数化的图像编辑，并在 albedo 频谱中进行这些编辑。为了证明我们的方法的有效性，我们在实际拍摄的复杂图像中展示了结果，并进行了用户研究来 объекively 测量我们的方法与现有协调方法相比的增强现实效果。

MatterGen: a generative model for inorganic materials design

paper_url: http://arxiv.org/abs/2312.03687
repo_url: None
paper_authors: Claudio Zeni, Robert Pinsler, Daniel Zügner, Andrew Fowler, Matthew Horton, Xiang Fu, Sasha Shysheya, Jonathan Crabbé, Lixin Sun, Jake Smith, Ryota Tomioka, Tian Xie
For: The paper aims to develop a new generative model for designing functional materials with desired properties, particularly focusing on stability and novelty.* Methods: The proposed model, called MatterGen, uses a diffusion-based generative process that refines atom types, coordinates, and the periodic lattice to produce crystalline structures. Adapter modules are introduced to enable fine-tuning towards specific property constraints.* Results: MatterGen is able to generate stable, diverse inorganic materials across the periodic table, with a higher success rate and closer proximity to the local energy minimum compared to prior generative models. Fine-tuning the model allows for the design of materials with desired chemistry, symmetry, and multiple properties such as mechanical, electronic, and magnetic properties.

Abstract
The design of functional materials with desired properties is essential in driving technological advances in areas like energy storage, catalysis, and carbon capture. Generative models provide a new paradigm for materials design by directly generating entirely novel materials given desired property constraints. Despite recent progress, current generative models have low success rate in proposing stable crystals, or can only satisfy a very limited set of property constraints. Here, we present MatterGen, a model that generates stable, diverse inorganic materials across the periodic table and can further be fine-tuned to steer the generation towards a broad range of property constraints. To enable this, we introduce a new diffusion-based generative process that produces crystalline structures by gradually refining atom types, coordinates, and the periodic lattice. We further introduce adapter modules to enable fine-tuning towards any given property constraints with a labeled dataset. Compared to prior generative models, structures produced by MatterGen are more than twice as likely to be novel and stable, and more than 15 times closer to the local energy minimum. After fine-tuning, MatterGen successfully generates stable, novel materials with desired chemistry, symmetry, as well as mechanical, electronic and magnetic properties. Finally, we demonstrate multi-property materials design capabilities by proposing structures that have both high magnetic density and a chemical composition with low supply-chain risk. We believe that the quality of generated materials and the breadth of MatterGen's capabilities represent a major advancement towards creating a universal generative model for materials design.

摘要
📝 The design of functional materials with desired properties is crucial in driving technological advances in areas like energy storage, catalysis, and carbon capture. 🔋 Generative models provide a new paradigm for materials design by directly generating entirely novel materials given desired property constraints. 💡 Despite recent progress, current generative models have low success rates in proposing stable crystals, or can only satisfy a very limited set of property constraints. 🔍 Here, we present MatterGen, a model that generates stable, diverse inorganic materials across the periodic table and can further be fine-tuned to steer the generation towards a broad range of property constraints. 🔩 To enable this, we introduce a new diffusion-based generative process that produces crystalline structures by gradually refining atom types, coordinates, and the periodic lattice. 📊 We further introduce adapter modules to enable fine-tuning towards any given property constraints with a labeled dataset. 🔗 Compared to prior generative models, structures produced by MatterGen are more than twice as likely to be novel and stable, and more than 15 times closer to the local energy minimum. 🔓 After fine-tuning, MatterGen successfully generates stable, novel materials with desired chemistry, symmetry, as well as mechanical, electronic, and magnetic properties. 🔍 Finally, we demonstrate multi-property materials design capabilities by proposing structures that have both high magnetic density and a chemical composition with low supply-chain risk. 💪 We believe that the quality of generated materials and the breadth of MatterGen's capabilities represent a major advancement towards creating a universal generative model for materials design. 🌟

LLM as OS (llmao), Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem

paper_url: http://arxiv.org/abs/2312.03815
repo_url: None
paper_authors: Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, Yongfeng Zhang
for: 这篇论文旨在构思一个以大语言模型（LLM）为基础的人工智能操作系统（AIOS）生态系统，这将标志着操作系统的一个新 paradigma shift。
methods: 本论文使用了大语言模型（LLM）作为操作系统的核心组件，并开发了一系列基于 LLM 的人工智能代理应用程序（AAP），以推动 AIOS 生态系统的发展。
results: 本论文预测，通过 LLM 的应用，将不仅改变人工智能应用程序的水平，还会重新定义计算机系统的设计和实现、软件和编程语言的设计方法，并带来一系列新的硬件和中间件设备。

Abstract
This paper envisions a revolutionary AIOS-Agent ecosystem, where Large Language Model (LLM) serves as the (Artificial) Intelligent Operating System (IOS, or AIOS)--an operating system ``with soul''. Upon this foundation, a diverse range of LLM-based AI Agent Applications (Agents, or AAPs) are developed, enriching the AIOS-Agent ecosystem and signaling a paradigm shift from the traditional OS-APP ecosystem. We envision that LLM's impact will not be limited to the AI application level, instead, it will in turn revolutionize the design and implementation of computer system, architecture, software, and programming language, featured by several main concepts: LLM as OS (system-level), Agents as Applications (application-level), Natural Language as Programming Interface (user-level), and Tools as Devices/Libraries (hardware/middleware-level).

摘要
这篇论文拟想一个革命性的AIOS投送生态系统，其中大语言模型（LLM）作为人工智能操作系统（IOS或AIOS），这是一个“有心”的操作系统。在这个基础上，一些LLM基于的AI应用程序（Agent或AAP）被开发出来，rich了AIOS投送生态系统，标志着传统OS-APP生态系统的 парадигShift。我们想象，LLM的影响不将止于AI应用程序层次，反之，它会革命化计算机系统的设计和实现、软件架构和编程语言，主要特点包括：LLM作为系统层次（system-level），代理为应用程序层次（application-level），自然语言作为用户层次（user-level），工具作为硬件/中间件层次（hardware/middleware-level）。

What Planning Problems Can A Relational Neural Network Solve?

paper_url: http://arxiv.org/abs/2312.03682
repo_url: https://github.com/concepts-ai/goal-regression-width
paper_authors: Jiayuan Mao, Tomás Lozano-Pérez, Joshua B. Tenenbaum, Leslie Pack Kaelbling
for: 本研究旨在探讨goal-conditioned policies是如何被学习的，以及其效率如何。
methods: 本文使用circuit complexity analysis和serialized goal regression search（S-GRS）来研究relational neural networks表示的策略学习问题。
results: 本研究发现有三类计划问题，其宽度和深度随着物品和规划距离的增加而增长，并提供了构造性的证明。此外，本研究还证明了这种分析的实用性于策略学习中。

Abstract
Goal-conditioned policies are generally understood to be "feed-forward" circuits, in the form of neural networks that map from the current state and the goal specification to the next action to take. However, under what circumstances such a policy can be learned and how efficient the policy will be are not well understood. In this paper, we present a circuit complexity analysis for relational neural networks (such as graph neural networks and transformers) representing policies for planning problems, by drawing connections with serialized goal regression search (S-GRS). We show that there are three general classes of planning problems, in terms of the growth of circuit width and depth as a function of the number of objects and planning horizon, providing constructive proofs. We also illustrate the utility of this analysis for designing neural networks for policy learning.

摘要
目标条件政策通常被理解为“前向”Circuit，即神经网络，将当前状态和目标规范映射到下一个行动。然而，学习这种策略的情况和效率尚不够清楚。在这篇论文中，我们提出了关系神经网络（如图神经网络和变换器）表示策略的电路复杂度分析，通过与序列化目标回归搜索（S-GRS）的连接。我们证明了计划问题的三类总体情况，即电路宽度和深度随物品和规划时间的增加情况，并提供了构造性证明。此外，我们还 Illustrates the utility of this analysis for designing neural networks for policy learning.

An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition

paper_url: http://arxiv.org/abs/2312.03668
repo_url: None
paper_authors: Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada
for: 这篇论文的目的是提出一种基于预训练语音和自然语言模型的端到端自动语音识别（ASR）模型，以便实现更高效的语音识别。
methods: 该论文使用了预训练语音表示模型和大型自然语言模型（LLM）的组合，通过将语音表示转换为文本token，并使用LLM的庞大知识进行 autoregressive 生成，实现端到端 ASR。
results: 实验结果表明，提出的模型可以与现代端到端 ASR 模型相比，并且可以进行 parameter-efficient 预测优化和预训练域转换。

Abstract
Advances in machine learning have made it possible to perform various text and speech processing tasks, including automatic speech recognition (ASR), in an end-to-end (E2E) manner. Since typical E2E approaches require large amounts of training data and resources, leveraging pre-trained foundation models instead of training from scratch is gaining attention. Although there have been attempts to use pre-trained speech and language models in ASR, most of them are limited to using either. This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR. The proposed model enables E2E ASR by generating text tokens in an autoregressive manner via speech representations as speech prompts, taking advantage of the vast knowledge provided by the LLM. Furthermore, the proposed model can incorporate remarkable developments for LLM utilization, such as inference optimization and parameter-efficient domain adaptation. Experimental results show that the proposed model achieves performance comparable to modern E2E ASR models.

摘要
Translated into Simplified Chinese:随着机器学习的进步，可以使用端到端（E2E）方式完成不同的文本和语音处理任务，包括自动语音识别（ASR）。 Typical E2E Approaches 需要大量的训练数据和资源，因此利用预训练基础模型而不是从scratch 训练是收到关注。 Although there have been attempts to use pre-trained speech and language models in ASR, most of them are limited to using either. This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR. The proposed model enables E2E ASR by generating text tokens in an autoregressive manner via speech representations as speech prompts, taking advantage of the vast knowledge provided by the LLM. Furthermore, the proposed model can incorporate remarkable developments for LLM utilization, such as inference optimization and parameter-efficient domain adaptation. Experimental results show that the proposed model achieves performance comparable to modern E2E ASR models.Translated into Traditional Chinese:随着机器学习的进步，可以使用端到端（E2E）方式完成不同的文本和语音处理任务，包括自动语音识别（ASR）。 Typical E2E Approaches 需要大量的训练数据和资源，因此利用预训练基础模型而不是从scratch 训练是收到关注。 Although there have been attempts to use pre-trained speech and language models in ASR, most of them are limited to using either. This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR. The proposed model enables E2E ASR by generating text tokens in an autoregressive manner via speech representations as speech prompts, taking advantage of the vast knowledge provided by the LLM. Furthermore, the proposed model can incorporate remarkable developments for LLM utilization, such as inference optimization and parameter-efficient domain adaptation. Experimental results show that the proposed model achieves performance comparable to modern E2E ASR models.

paper_url: http://arxiv.org/abs/2312.03664
repo_url: https://github.com/google-deepmind/concordia
paper_authors: Alexander Sasha Vezhnevets, John P. Agapiou, Avia Aharon, Ron Ziv, Jayd Matyas, Edgar A. Duéñez-Guzmán, William A. Cunningham, Simon Osindero, Danny Karmon, Joel Z. Leibo
for: 这个论文的目的是探讨 Agent-based modeling 如何利用 Large Language Models (LLM) 提高模型的可理解性和可行性。
methods: 这个论文使用了 Concordia 库，用于构建和使用语言媒介的agent-based模型。Concordia 使用 LLM 来应用常识，行为理解、记忆常识知识，并通过 API 调用控制数字技术。
results: 这个论文提出了一种新的 Agent-based modeling 方法，可以在physically-或 digitally-grounded environments中实现语言媒介的 simulations。这种方法可以支持广泛的应用，包括科学研究和评估实际的数字服务性能。

Abstract
Agent-based modeling has been around for decades, and applied widely across the social and natural sciences. The scope of this research method is now poised to grow dramatically as it absorbs the new affordances provided by Large Language Models (LLM)s. Generative Agent-Based Models (GABM) are not just classic Agent-Based Models (ABM)s where the agents talk to one another. Rather, GABMs are constructed using an LLM to apply common sense to situations, act "reasonably", recall common semantic knowledge, produce API calls to control digital technologies like apps, and communicate both within the simulation and to researchers viewing it from the outside. Here we present Concordia, a library to facilitate constructing and working with GABMs. Concordia makes it easy to construct language-mediated simulations of physically- or digitally-grounded environments. Concordia agents produce their behavior using a flexible component system which mediates between two fundamental operations: LLM calls and associative memory retrieval. A special agent called the Game Master (GM), which was inspired by tabletop role-playing games, is responsible for simulating the environment where the agents interact. Agents take actions by describing what they want to do in natural language. The GM then translates their actions into appropriate implementations. In a simulated physical world, the GM checks the physical plausibility of agent actions and describes their effects. In digital environments simulating technologies such as apps and services, the GM may handle API calls to integrate with external tools such as general AI assistants (e.g., Bard, ChatGPT), and digital apps (e.g., Calendar, Email, Search, etc.). Concordia was designed to support a wide array of applications both in scientific research and for evaluating performance of real digital services by simulating users and/or generating synthetic data.

摘要
agent-based模型已经存在数十年，并广泛应用于社会和自然科学领域。现在，随着大语言模型（LLM）的新特性的出现， agent-based模型的范围即将扩大很多。生成型agent-based模型（GABM）不仅是класси型agent-based模型（ABM），where agents talk to each other，而是通过使用LLM来应用常识，行为“合理”，回忆常识知识，生成API调用来控制数字技术，如应用和服务。我们现在在Concordia库中提供了一种方便构建和使用GABM的方法。Concordia可以帮助构建语言媒介的物理或数字环境模拟。Concordia代理人使用可变组件系统来调用LLM和 associative memory Retrieval两种基本操作。一个特殊的代理人called Game Master（GM），它 draws inspiration from tabletop role-playing games，负责模拟代理人之间的环境。代理人通过natural language描述自己的行为，而GM将其转化为合适的实现。在模拟的物理世界中，GM检查代理人行为的物理可能性，并描述其效果。在模拟数字环境中，GM可能处理API调用，以 интеграble with external tools，如通用AI助手（例如Bard、ChatGPT）和数字应用（例如日历、邮件、搜索等）。Concordia是为了支持广泛的应用，从科学研究到评估实际数字服务的性能而设计。

Pearl: A Production-ready Reinforcement Learning Agent

paper_url: http://arxiv.org/abs/2312.03814
repo_url: https://github.com/facebookresearch/pearl
paper_authors: Zheqing Zhu, Rodrigo de Salvo Braz, Jalaj Bhandari, Daniel Jiang, Yi Wan, Yonathan Efroni, Liyuan Wang, Ruiyang Xu, Hongbo Guo, Alex Nikulkov, Dmytro Korenkevych, Urun Dogan, Frank Cheng, Zheng Wu, Wanqiao Xu
For: 这篇论文是为了探讨RL框架在实现长期目标方面的一些问题，包括延迟奖励、部分可见性、搜索和利用之间的矛盾、使用离线数据提高在线性能、并确保安全限制得到满足。* Methods: 这篇论文提出了一个名为Pearl的生产准备RL智能代理软件包，该包可以模块化地解决RL解决方案中的各种问题，包括延迟奖励、部分可见性、搜索和利用之间的矛盾、使用离线数据提高在线性能、并确保安全限制得到满足。* Results: 这篇论文提供了一些初步的基准测试结果，同时也 highlights了Pearl在实际生产环境中的采纳，以 demonstarte其生产准备性。

Abstract
Reinforcement Learning (RL) offers a versatile framework for achieving long-term goals. Its generality allows us to formalize a wide range of problems that real-world intelligent systems encounter, such as dealing with delayed rewards, handling partial observability, addressing the exploration and exploitation dilemma, utilizing offline data to improve online performance, and ensuring safety constraints are met. Despite considerable progress made by the RL research community in addressing these issues, existing open-source RL libraries tend to focus on a narrow portion of the RL solution pipeline, leaving other aspects largely unattended. This paper introduces Pearl, a Production-ready RL agent software package explicitly designed to embrace these challenges in a modular fashion. In addition to presenting preliminary benchmark results, this paper highlights Pearl's industry adoptions to demonstrate its readiness for production usage. Pearl is open sourced on Github at github.com/facebookresearch/pearl and its official website is located at pearlagent.github.io.

摘要

Improving Activation Steering in Language Models with Mean-Centring

paper_url: http://arxiv.org/abs/2312.03813
repo_url: None
paper_authors: Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan
for: 本研究旨在改进大语言模型（LLM）的输出控制，通过发现导航向量。但是，工程师通常不知道这些模型中特征的表示方式。
methods: 本研究提出使用均值中心化导航向量的想法，即取target dataset的活动均值，然后对所有训练活动均值进行减法。这种方法在自然语言任务中被证明有效，可以帮助控制大语言模型的输出，避免生成攻击性文本，并让故事完成target类型。
results: 本研究发现，对于自然语言任务，使用均值中心化导航向量可以大幅提高活动导航的效iveness，比之前的基eline更高。此外，这种方法还可以让模型更好地执行各种自然语言任务，比如故事完成和文本生成等。

Abstract
Recent work in activation steering has demonstrated the potential to better control the outputs of Large Language Models (LLMs), but it involves finding steering vectors. This is difficult because engineers do not typically know how features are represented in these models. We seek to address this issue by applying the idea of mean-centring to steering vectors. We find that taking the average of activations associated with a target dataset, and then subtracting the mean of all training activations, results in effective steering vectors. We test this method on a variety of models on natural language tasks by steering away from generating toxic text, and steering the completion of a story towards a target genre. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin (compared to previous baselines). This suggests that mean-centring can be used to easily improve the effectiveness of activation steering in a wide range of contexts.

摘要
最近的活动导航研究表明可以更好地控制大型语言模型（LLM）的输出，但是它需要找到导航向量。这是因为工程师通常不知道这些模型中特征的表示方式。我们想解决这个问题，通过应用均值中心化思想来改进导航向量。我们发现，对目标数据集的活动均值，并从所有训练活动均值中 subtract 目标数据集的均值，可以获得有效的导航向量。我们在自然语言任务上测试了这种方法，包括避免生成恶意文本和导航故事的完成方向。我们还应用均值中心化来提取函数向量，可以更好地触发多种自然语言任务的执行，相比之前的基线。这表示，均值中心化可以用于广泛改进 activation steering 的效iveness。

Efficient Inverse Design Optimization through Multi-fidelity Simulations, Machine Learning, and Search Space Reduction Strategies

paper_url: http://arxiv.org/abs/2312.03654
repo_url: None
paper_authors: Luka Grbcic, Juliane Müller, Wibe Albert de Jong
for: 这篇论文旨在增强逆设计优化过程中的约束环境，尤其是在计算资源有限的情况下，通过多元预测、机器学习模型和优化算法的联盟。
methods: 本论文提出了一种方法ологи？，将机器学习模型与优化算法联盟起来，以增强逆设计优化过程的效率和精度。在两个不同的工程逆设计问题上进行了分析，并使用了低精度模拟数据训练机器学习模型，以便在优化过程中预测目标变数和决定是否需要高精度模拟。
results: 本论文的结果显示，这种方法可以大幅提高逆设计优化过程的效率和精度，并且可以与不同的优化算法联盟以实现更好的结果。尤其是在计算资源有限的情况下，这种方法可以很好地保留计算资源，并且可以让逆设计优化过程更加快速和稳定。

Abstract
This paper introduces a methodology designed to augment the inverse design optimization process in scenarios constrained by limited compute, through the strategic synergy of multi-fidelity evaluations, machine learning models, and optimization algorithms. The proposed methodology is analyzed on two distinct engineering inverse design problems: airfoil inverse design and the scalar field reconstruction problem. It leverages a machine learning model trained with low-fidelity simulation data, in each optimization cycle, thereby proficiently predicting a target variable and discerning whether a high-fidelity simulation is necessitated, which notably conserves computational resources. Additionally, the machine learning model is strategically deployed prior to optimization to reduce the search space, thereby further accelerating convergence toward the optimal solution. The methodology has been employed to enhance two optimization algorithms, namely Differential Evolution and Particle Swarm Optimization. Comparative analyses illustrate performance improvements across both algorithms. Notably, this method is adeptly adaptable across any inverse design application, facilitating a harmonious synergy between a representative low-fidelity machine learning model, and high-fidelity simulation, and can be seamlessly applied across any variety of population-based optimization algorithms.

摘要
The methodology uses a machine learning model trained with low-fidelity simulation data to predict a target variable in each optimization cycle. This approach conserves computational resources by only using high-fidelity simulations when necessary. Additionally, the machine learning model is deployed before optimization to reduce the search space, which further accelerates convergence towards the optimal solution.The methodology is employed to enhance two optimization algorithms, namely Differential Evolution and Particle Swarm Optimization. Comparative analyses show performance improvements across both algorithms. Notably, this method is adaptable to any inverse design application and can be seamlessly applied to any variety of population-based optimization algorithms.In simplified Chinese, the paper introduces a methodology that improves the inverse design optimization process in situations with limited computing resources. The methodology combines multi-fidelity evaluations, machine learning models, and optimization algorithms to achieve this goal. The proposed methodology is applied to two engineering inverse design problems and shows performance improvements across two optimization algorithms. This method is adaptable to any inverse design application and can be easily applied to any population-based optimization algorithm.

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

paper_url: http://arxiv.org/abs/2312.03641
repo_url: None
paper_authors: Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan
for: 这篇论文的目的是提出一种能够精准控制 видео中的摄像机和物体运动的动作控制器（MotionCtrl）。
methods: 这篇论文使用了一种新的动作控制器架构，它综合考虑了摄像机运动、物体运动以及训练数据的特性，以提供灵活和精准的动作控制。
results: 对比于现有的方法，MotionCtrl具有三大优势：1）它可以精准地控制摄像机和物体运动，允许更细致的动作控制和多样化的动作组合。2）它的动作条件由摄像机姿态和轨迹决定，这些条件是出现无关的和对物体形状或外观的影响最小。3）它是一种相对通用的模型，可以适应各种摄像机姿态和轨迹。经过广泛的质量和量测试，MotionCtrl在与现有方法进行比较时表现出了超越性。

Abstract
Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement. Accurate control of both camera and object motion is essential for video generation. However, existing works either mainly focus on one type of motion or do not clearly distinguish between the two, limiting their control capabilities and diversity. Therefore, this paper presents MotionCtrl, a unified and flexible motion controller for video generation designed to effectively and independently control camera and object motion. The architecture and training strategy of MotionCtrl are carefully devised, taking into account the inherent properties of camera motion, object motion, and imperfect training data. Compared to previous methods, MotionCtrl offers three main advantages: 1) It effectively and independently controls camera motion and object motion, enabling more fine-grained motion control and facilitating flexible and diverse combinations of both types of motion. 2) Its motion conditions are determined by camera poses and trajectories, which are appearance-free and minimally impact the appearance or shape of objects in generated videos. 3) It is a relatively generalizable model that can adapt to a wide array of camera poses and trajectories once trained. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of MotionCtrl over existing methods.

摘要
主要的动作在影片中包括摄像机运动所引起的摄像机运动和物体运动。精确控制摄像机和物体运动是影片生成的重点。然而，现有的工作几乎专注于一种类型的动作或没有清晰地区分这两种动作，这限制了它们的控制能力和多样性。因此，这篇论文提出了 MotionCtrl，一个统一和 flexible的动作控制器，用于影片生成，可以精确地和独立地控制摄像机和物体运动。 MotionCtrl 的架构和训练策略充分考虑了摄像机运动、物体运动和训练数据的自然性。相比于先前的方法，MotionCtrl 提供了三大优点：1. 可以精确地和独立地控制摄像机和物体运动，实现更细部的动作控制和让生成的影片更多样化。2. 其动作条件由摄像机位置和轨迹决定，这些条件是无形感和物体形状的影响最小的。3. 它是一个相对一般化的模型，可以适应广泛的摄像机位置和轨迹。实际实验表明，MotionCtrl 在训练后可以对多种摄像机位置和轨迹进行适应。

Not All Large Language Models (LLMs) Succumb to the “Reversal Curse”: A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models

paper_url: http://arxiv.org/abs/2312.03633
repo_url: None
paper_authors: Jingye Yang, Da Wu, Kai Wang
for: 这个研究旨在探讨自动逆推Decoder大语言模型（LLM）在“A是B”的情况下失败学习“B是A”，探讨这种逆推的基本失败是否对某些通用任务，如构建知识图谱，提供了红flag。
methods: 这个研究使用了 bidirectional LLM（BERT），并发现它具有逆推祸害的免疫力。此外，研究还评估了更复杂的逻辑推理能力，包括两个集合（union和intersection）操作的交叠和融合。
results: 研究发现，在两个集合操作的情况下， both encoder和decoder语言模型都能够表现出色，但是在三个集合操作的情况下，它们遇到了困难。这些结果表明，encoder和decoder模型在简单和复杂逻辑推理中有所不同，并且在实际应用中，选择BERT或GPT应该根据任务的具体需求和特点，以便充分利用它们的特点。

Abstract
The "Reversal Curse" refers to the scenario where auto-regressive decoder large language models (LLMs), such as ChatGPT, trained on "A is B" fail to learn "B is A", demonstrating a basic failure of logical deduction. This raises a red flag in the use of GPT models for certain general tasks such as constructing knowledge graphs, considering their adherence to this symmetric principle. In our study, we examined a bidirectional LLM, BERT, and found that it is immune to the reversal curse. Driven by ongoing efforts to construct biomedical knowledge graphs with LLMs, we also embarked on evaluating more complex but essential deductive reasoning capabilities. This process included first training encoder and decoder language models to master the intersection ($\cap$) and union ($\cup$) operations on two sets and then moving on to assess their capability to infer different combinations of union ($\cup$) and intersection ($\cap$) operations on three newly created sets. The findings showed that while both encoder and decoder language models, trained for tasks involving two sets (union/intersection), were proficient in such scenarios, they encountered difficulties when dealing with operations that included three sets (various combinations of union and intersection). Our research highlights the distinct characteristics of encoder and decoder models in simple and complex logical reasoning. In practice, the choice between BERT and GPT should be guided by the specific requirements and nature of the task at hand, leveraging their respective strengths in bidirectional context comprehension and sequence prediction.

摘要
“逆转咒”指的是，使用“A是B”的自动逆转数据模型（LLM），如ChatGPT，却无法学习“B是A”，这表示了基本的逻辑推理失败。这引起了使用GPT模型的一些通用任务，如建立知识图，需要注意这个对称原理。在我们的研究中，我们评估了一个对向模型（BERT），发现它免受“逆转咒”的影响。为了继续使用LLM建立生物医学知识图，我们还进行了评估更复杂但重要的推理能力。这包括先将语言模型训练到掌握两个集合的交集（）和union（）操作，然后评估它们在三个新创建的集合上进行不同的交集（）和交集（）操作的能力。发现虽然两个语言模型，在两个集合（union/intersection）的任务上都能够表现出色，但当面临三个集合时，它们却遇到了困难。我们的研究显示了两个语言模型在简单和复杂逻辑推理中的特别性。在实践中，选择BERT或GPT应该根据任务的具体需求和特点，利用它们的相应优势在对向文本理解和时间序列预测。

MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

paper_url: http://arxiv.org/abs/2312.03631
repo_url: https://github.com/assafbk/mocha_code
paper_authors: Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor
for: 提高图像描述文本的准确性和Semantic adequacy
methods: 使用进化学习来解决图像描述文本中的幻觉问题，并提出多目标奖励函数来同时优化准确性和Semantic adequacy
results: 在不同的模型规模下，MOCHa可以同时优化准确性和Semantic adequacy，并且在开 vocabulary setting中表现出色，还提出了一个新的测试集 OpenCHAIR 来评测开 vocabulary hallucinations

Abstract
While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, the generation of spurious details that cannot be inferred from the given image. Dedicated methods for reducing hallucinations in image captioning largely focus on closed-vocabulary object tokens, ignoring most types of hallucinations that occur in practice. In this work, we propose MOCHa, an approach that harnesses advancements in reinforcement learning (RL) to address the sequence-level nature of hallucinations in an open-world setup. To optimize for caption fidelity to the input image, we leverage ground-truth reference captions as proxies to measure the logical consistency of generated captions. However, optimizing for caption fidelity alone fails to preserve the semantic adequacy of generations; therefore, we propose a multi-objective reward function that jointly targets these qualities, without requiring any strong supervision. We demonstrate that these goals can be simultaneously optimized with our framework, enhancing performance for various captioning models of different scales. Our qualitative and quantitative results demonstrate MOCHa's superior performance across various established metrics. We also demonstrate the benefit of our method in the open-vocabulary setting. To this end, we contribute OpenCHAIR, a new benchmark for quantifying open-vocabulary hallucinations in image captioning models, constructed using generative foundation models. We will release our code, benchmark, and trained models.

摘要
近年来，图像条件文本生成领域已经取得了很大的进步，但图像描述仍然受到基本问题的干扰，即生成不存在图像中的幻觉。现有的减少幻觉方法主要是基于关闭 vocabulary 对象 токен，忽略了实际中的大部分幻觉。在这项工作中，我们提出了 MOCHa，一种基于 reinforcement learning（RL）的方法，用于在开放世界设置中解决图像描述中的序列级幻觉。为了优化图像描述与输入图像的一致性，我们利用真实参照caption作为逻辑一致性的指标。但优化一个caption的准确性alone 无法保持生成的 semantics，因此我们提出了一个多目标奖励函数，该函数同时目标这些质量，无需强大的监督。我们示出这些目标可以通过我们的框架同时优化，提高不同规模的描述模型的性能。我们的质量和量化结果表明 MOCHa 的超越性，并且我们还展示了我们的方法在开放 vocabulary Setting中的优势。为此，我们提出了 OpenCHAIR，一个新的评价标准，用于评估开放 vocabulary 描述模型中的幻觉。我们将发布我们的代码、标准和训练模型。

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

paper_url: http://arxiv.org/abs/2312.03611
repo_url: https://github.com/yhyang-myron/DreamComposer
paper_authors: Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, Xihui Liu
for: 这篇论文是为了提高现有的视图意识扩散模型，使其能够生成控制性的新视图图像。
methods: 该论文使用了视图意识3D提升模块，将多个视图中对象的3D表示转换为latent特征。然后，它使用多视图特征融合模块将目标视图特征从多个视图输入中提取出来。最后，它将目标视图特征注入到预训练的扩散模型中，以生成高质量的新视图图像。
results: 实验表明，DreamComposer可以与现有的扩散模型相结合，实现零实际参数的新视图图像生成。它可以生成高品质的新视图图像，准确地捕捉了多视图条件下的对象形态和位置。

Abstract
Utilizing pre-trained 2D large-scale generative models, recent works are capable of generating high-quality novel views from a single in-the-wild image. However, due to the lack of information from multiple views, these works encounter difficulties in generating controllable novel views. In this paper, we present DreamComposer, a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then, it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis, further enhancing them to generate high-fidelity novel view images with multi-view conditions, ready for controllable 3D object reconstruction and various other applications.

摘要
使用预训练的2D大规模生成模型，最近的研究可以从单个宽泛图像中生成高质量的新视图。然而，由于缺乏多视图信息，这些研究受到生成控制新视图的困难。在这篇论文中，我们提出了 DreamComposer，一个灵活可扩展的框架，可以增强现有的视觉扩散模型。具体来说，DreamComposer首先使用视觉意识3D升级模块来从多个视角获取3D对象的表示。然后，它使用多视图特征融合模块来渲染目标视图的秘密特征。最后，从多个视角输入中提取的目标视图特征被注入到预训练的扩散模型中。实验表明，DreamComposer与现有扩散模型兼容，可以further enhance them to generate high-fidelity novel view images with multi-view conditions，ready for controllable 3D object reconstruction和多种其他应用。

DiffusionSat: A Generative Foundation Model for Satellite Imagery

paper_url: http://arxiv.org/abs/2312.03606
repo_url: None
paper_authors: Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David Lobell, Stefano Ermon
for: 这篇论文主要针对的是Remote Sensing数据的生成模型，用于环境监测和农业产量预测等重要应用。
methods: 这篇论文提出了DiffusionSat模型，基于大量公共可用的高分辨率Remote Sensing数据集合进行训练，并采用了新的conditioning技术，使用 metadata 如地理坐标作为生成图像的条件信息。
results: 这篇论文的实验结果表明，DiffusionSat模型可以生成高质量的卫星图像，并可以解决多种生成任务，包括时间生成、多spectral输入的超分辨率生成和填充等。与之前的状态码模型相比，DiffusionSat模型表现出色，是首个大规模的卫星图像生成基础模型。

Abstract
Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets. As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale $\textit{generative}$ foundation model for satellite imagery.

摘要
各种扩散模型在多个频谱中已经达到了当前最佳结果，包括图像、语音和视频。然而，现有的模型没有针对卫星散射数据进行支持，这种数据广泛用于重要应用，如环境监测和作物产量预测。卫星图像与自然图像有很大差异，它们可能是多spectral，时间不规则采样，现有的扩散模型从网络上的图像进行训练不支持它们。此外，卫星散射数据是空间-时的，需要基于条件生成任务，而传统的方法基于标签或图像不支持。在这篇论文中，我们提出了DiffusionSat，迄今为止最大的基础模型，基于公共可用的大量高分辨率卫星散射数据进行训练。由于卫星图像的文本标签罕见，我们将关联 metadata，如地理位置作为条件信息。我们的方法生成的样本是真实的，可以用于解决多个生成任务，包括时间生成、基于多spectral输入的超分辨率、和填充。我们的方法超过了之前的最佳方法，并是首个大规模的卫星图像生成基础模型。

MMM: Generative Masked Motion Model

paper_url: http://arxiv.org/abs/2312.03596
repo_url: None
paper_authors: Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, Chen Chen
For: 这个论文的目的是提出一种基于Masked Motion Model（MMM）的新型动作生成方法，以解决现有的动作生成方法中的时间性和高精度之间的负面选择。* Methods: 这个方法使用了两个关键组件：（1）动作tokenizer，将3D人体动作转换为一个序列的不同的token在隐藏空间中，和（2）条件隐藏动作变换器，学习预计Randomly隐藏动作token，基于已经计算的文本token。* Results: 在对HumanML3D和KIT-ML数据集进行了广泛的实验后，这个方法的result表现出色，同时实现了高精度和高速动作生成，并具有高级编辑特性，例如体部修改、动作间隔和长动作序列的合成。此外，这个方法比现有的编辑动作扩散模型快两个数量级的单个中等级GPU上。

Abstract
Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at \url{https://exitudio.github.io/MMM-page}.

摘要
近期，使用扩散和自适应模型进行文本到动作生成技术已经取得了良好的成果。然而，这些模型经常面临着实时性、高精度和动作可编辑性之间的牵扯。为了解决这个差距，我们介绍了MMM，一种新型但简单的动作生成模式，基于带有掩码的动作模型。MMM包括两个关键组件：（1）动作Tokenizer，将3D人体动作转换为离散的token在隐藏空间中，和（2）受控掩码动作变换器，学习预计掩码动作token，根据预计的文本token来进行条件预测。在推理过程中，MMM通过同时 attend to motion和文本token，从而显式地捕捉动作token之间的自然依赖关系，以及文本token和动作token之间的含义映射。在推理过程中，MMM可以并行地执行多个动作token，以实现高精度和高速的动作生成。此外，MMM内置了动作可编辑性。通过在需要编辑的地方放置掩码，MMM会自动填充缺失的部分，保证编辑和非编辑部分之间的平滑过渡。我们在HumanML3D和KIT-ML数据集上进行了广泛的实验， demonstarted that MMM surpasses current leading methods in generating high-quality motion（证明了FID分数为0.08和0.429），同时提供了高级编辑功能，如身体部分修改、动作卷积和长度动作序列的合成。此外，MMM在单个中等级GPU上两个数量级快于可编辑动作扩散模型。如果您想了解更多细节，请参考我们的项目页面：\url{https://exitudio.github.io/MMM-page}。

Foundation Model Assisted Weakly Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.03585
repo_url: https://github.com/HAL-42/FMA-WSSS
paper_authors: Xiaobo Yang, Xiaojin Gong
for: Addressing weakly supervised semantic segmentation (WSSS) using image-level labels.
methods: Leveraging pre-trained foundation models (CLIP and SAM) to generate high-quality segmentation seeds, and using a coarse-to-fine framework with multi-label contrastive loss and CAM activation loss to learn the prompts.
results: Achieving state-of-the-art performance on PASCAL VOC 2012 and competitive results on MS COCO 2014.Here is the full translation in Simplified Chinese:
for: 本文目的是使用图像级标签来解决弱ively supervised semantic segmentation (WSSS) 问题。
methods: 我们利用预训练基础模型（CLIP和SAM），生成高质量的 segmentation 种子，并使用一种宽泛-to-细化框架，并采用多标签对比损失和 CAM 活化损失来学习提示。
results: 我们的方法在 PASCAL VOC 2012 和 MS COCO 2014 上达到了状态 Ell的性能和竞争性的结果。

Abstract
This work aims to leverage pre-trained foundation models, such as contrastive language-image pre-training (CLIP) and segment anything model (SAM), to address weakly supervised semantic segmentation (WSSS) using image-level labels. To this end, we propose a coarse-to-fine framework based on CLIP and SAM for generating high-quality segmentation seeds. Specifically, we construct an image classification task and a seed segmentation task, which are jointly performed by CLIP with frozen weights and two sets of learnable task-specific prompts. A SAM-based seeding (SAMS) module is designed and applied to each task to produce either coarse or fine seed maps. Moreover, we design a multi-label contrastive loss supervised by image-level labels and a CAM activation loss supervised by the generated coarse seed map. These losses are used to learn the prompts, which are the only parts need to be learned in our framework. Once the prompts are learned, we input each image along with the learned segmentation-specific prompts into CLIP and the SAMS module to produce high-quality segmentation seeds. These seeds serve as pseudo labels to train an off-the-shelf segmentation network like other two-stage WSSS methods. Experiments show that our method achieves the state-of-the-art performance on PASCAL VOC 2012 and competitive results on MS COCO 2014.

摘要
Translated into Simplified Chinese:这个工作目标是利用预训练基本模型，如对比语言图像预训练（CLIP）和 segment anything模型（SAM），来解决弱监督 semantic segmentation（WSSS）问题，使用图像级别标签。为此，我们提出了一个粗细框架，基于 CLIP 和 SAM，用于生成高质量 segmentation 的种子。具体来说，我们构建了一个图像分类任务和一个种子 segmentation 任务，由 CLIP WITH 冻结参数和两组可学习的任务特定推荐来共同进行执行。SAM 模块是设计用于每个任务，以生成粗细或细化种子地图。此外，我们还设计了一个多标签对比损失，由图像级别标签supervise，以及一个 CAM 活动损失，由生成的粗细种子地图supervise。这些损失用于学习推荐，推荐是我们framework中唯一需要学习的部分。一旦推荐学习完毕，我们可以将每个图像与学习的 segmentation 特定推荐输入到 CLIP 和 SAM 模块中，生成高质量 segmentation 种子。这些种子可以作为 Pseudo 标签来训练一个标准的 segmentation 网络，如其他两个阶段 WSSS 方法。实验显示，我们的方法在 PASCAL VOC 2012 和 MS COCO 2014 上达到了状态监督性的性能，并且与其他两个阶段 WSSS 方法相比，实现了竞争性的结果。

Invariance & Causal Representation Learning: Prospects and Limitations

paper_url: http://arxiv.org/abs/2312.03580
repo_url: None
paper_authors: Simon Bing, Jonas Wahl, Urmi Ninad, Jakob Runge
for: 这篇论文主要是关于 causal models 中机制的不变性的研究。
methods: 论文使用了 theoretical impossibility results 和 practical considerations 来探讨机制不变性是否能够用于找到 latent causal variables。
results: 研究发现，机制不变性本身不够以便确定 latent causal variables，需要采用更多的约束来确定表示。

Abstract
In causal models, a given mechanism is assumed to be invariant to changes of other mechanisms. While this principle has been utilized for inference in settings where the causal variables are observed, theoretical insights when the variables of interest are latent are largely missing. We assay the connection between invariance and causal representation learning by establishing impossibility results which show that invariance alone is insufficient to identify latent causal variables. Together with practical considerations, we use these theoretical findings to highlight the need for additional constraints in order to identify representations by exploiting invariance.

摘要
在 causal 模型中，一个给定的机制被假设为其他机制变化不变。虽然这一原则在观察 causal 变量的情况下用于推理，但在 latent 变量的情况下的理论启示几乎缺失。我们通过证明不可能性结论表明了对 latent causal 变量的归一化不能够唯一确定。与实际考虑相结合，我们使用这些理论发现来强调需要额外约束以便通过归一化来确定表示。

Generalization to New Sequential Decision Making Tasks with In-Context Learning

paper_url: http://arxiv.org/abs/2312.03801
repo_url: None
paper_authors: Sharath Chandra Raparthy, Eric Hambro, Robert Kirk, Mikael Henaff, Roberta Raileanu
for: 这篇论文旨在解决机器学习中自适应任务学习的问题，即使只有几个示例也能够学习新的语言或视觉任务。
methods: 这篇论文使用了 transformer 来学习新的语言或视觉任务，但是在顺序决策Setting下，它们无法直接应用于新任务上进行学习。作者们则提出了一种使用序列径行的训练方法，以实现在新任务上进行径行学习。
results: 作者们在这篇论文中通过一个示例来说明，通过训练序列径行可以实现在新任务上进行径行学习。他们还研究了不同的设计选择，发现更大的模型和数据集大小、更多的任务多样性、环境随机性和径行强度都会导致更好的在新任务上进行径行学习。通过训练大型多样化的离线数据集，他们的模型可以在几个示例下学习新的 MiniHack 和 Procgen 任务。

Abstract
Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this paper, we use an illustrative example to show that naively applying transformers to sequential decision making problems does not enable in-context learning of new tasks. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to in-context learning of new sequential decision making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity, and trajectory burstiness, all result in better in-context learning of new out-of-distribution tasks. By training on large diverse offline datasets, our model is able to learn new MiniHack and Procgen tasks without any weight updates from just a handful of demonstrations.

摘要
培训自适应代理人可以从只有几个示例学习新任务是机器学习领域的长期问题。最近， transformers 被证明可以从只有几个示例学习新语言或视觉任务，而无需任何参数更新，也称为内Context学习。然而，顺序决策设置增加了更高的错误忍容率，因为环境的随机性或者代理人的操作可能会导致未看过的、有时无法恢复的状态。在这篇论文中，我们使用了一个 illustrate 例子来表明，直接应用 transformers 到顺序决策问题上不能实现内Context学习新任务。然后，我们示例了在序列径迹中训练时，采用某些分布性质可以实现内Context学习新顺序决策任务。我们调查了不同的设计选择，并发现大型模型和数据集大小、任务多样性、环境随机性和径迹强烈程度都会导致更好的内Context学习新Out-of-distribution任务。通过训练大型多样化的离线数据集，我们的模型可以从几个示例学习新 MiniHack 和 Procgen 任务，无需任何参数更新。

paper_url: http://arxiv.org/abs/2312.03543
repo_url: https://github.com/petrichor625/talk2car_cavg
paper_authors: Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, Chengzhong Xu
for: This paper aims to improve the ability of autonomous vehicles (AVs) to understand and execute visual commands in a visual context.
methods: The authors propose a sophisticated encoder-decoder framework called Context-Aware Visual Grounding (CAVG), which integrates five core encoders (Text, Image, Context, and Cross-Modal) with a Multimodal decoder. The model is trained using state-of-the-art Large Language Models (LLMs) and incorporates multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation.
results: The CAVG model achieves new standards in prediction accuracy and operational efficiency on the Talk2Car dataset, a real-world benchmark. It demonstrates exceptional performance even with limited training data, and shows remarkable robustness and adaptability in challenging scenarios such as long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.

Abstract
In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs.Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments. The code for the proposed model is available at our Github.

摘要
在自动驾驶车（AV）领域，正确地理解指挥官意图并在视觉上发出语言命令是一项重要挑战。本文介绍了一种高级的encoder-decoder框架，用于解决AV中的视觉定位。我们的 Context-Aware Visual Grounding（CAVG）模型包括五种核心encoder——Text、Image、Context、Cross-Modal——以及一个Multimodal decoder。这种整合使得CAVG模型能够很好地捕捉Contextual semantics，并通过使用现代大语言模型（LLMs），包括GPT-4，学习人类情感特征。CAVG模型的architecture被强化了多头跨模态注意力机制和Region-Specific Dynamic（RSD）层 для注意力调整。这种建立的建筑使得模型能够有效地处理和解释多种跨模态输入，从而获得视觉上的command和对应的语言命令之间的关系。实验证明，CAVG在Talk2Car数据集上达到了新的标准，并且在有限的训练数据上达到了出色的表现。此外，CAVG模型在具有挑战性的场景中也表现出了杰出的Robustness和适应性，包括长文本命令解释、低光照条件、不确定的指挥官上下文、不好的天气条件和拥挤的城市环境。CAVG模型的代码可以在我们的Github上获得。

Low-power, Continuous Remote Behavioral Localization with Event Cameras

paper_url: http://arxiv.org/abs/2312.03799
repo_url: None
paper_authors: Friedhelm Hamann, Suman Ghosh, Ignacio Juarez Martinez, Tom Hart, Alex Kacelnik, Guillermo Gallego
for: 本研究旨在开发一种用于远程野外动物观察的可靠计算机视觉方法，以 automatize 动物行为量化。
methods: 本研究使用了事件相机，具有低功耗和高动态范围特性，对 remote 野外动物观察进行了 battery-dependent 监测。研究采用了时间动作检测任务，根据事件数据进行了16个巢的标注。开发的方法包括一个生成几个可能的时间间隔（提案）的生成器，以及一个内部类别动作的分类器。
results: 实验表明，事件相机的自然响应于运动非常有效，可以实现 kontinuous 动物监测和检测，mAP 为 58%（在良好天气情况下提高到 63%）。研究还表明了对不同照明条件的Robustness。使用事件相机记录动物行为可以三倍长于使用 conventunal 相机。本研究开拓了远程野外动物观察领域的新可能性。

Abstract
Researchers in natural science need reliable methods for quantifying animal behavior. Recently, numerous computer vision methods emerged to automate the process. However, observing wild species at remote locations remains a challenging task due to difficult lighting conditions and constraints on power supply and data storage. Event cameras offer unique advantages for battery-dependent remote monitoring due to their low power consumption and high dynamic range capabilities. We use this novel sensor to quantify a behavior in Chinstrap penguins called ecstatic display. We formulate the problem as a temporal action detection task, determining the start and end times of the behavior. For this purpose, we recorded a colony of breeding penguins in Antarctica during several weeks and labeled event data on 16 nests. The developed method consists of a generator of candidate time intervals (proposals) and a classifier of the actions within them. The experiments show that the event cameras' natural response to motion is effective for continuous behavior monitoring and detection, reaching a mean average precision (mAP) of 58% (which increases to 63% in good weather conditions). The results also demonstrate the robustness against various lighting conditions contained in the challenging dataset. The low-power capabilities of the event camera allows to record three times longer than with a conventional camera. This work pioneers the use of event cameras for remote wildlife observation, opening new interdisciplinary opportunities. https://tub-rip.github.io/eventpenguins/

摘要

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

paper_url: http://arxiv.org/abs/2312.03526
repo_url: None
paper_authors: Peng Sun, Bei Shi, Daiwei Yu, Tao Lin
For: This paper aims to improve the efficiency and practicality of dataset distillation for large-scale real-world applications.* Methods: The proposed method, RDED, focuses on three key properties (realism, diversity, and efficiency) and uses a novel computationally-efficient approach to distill large datasets.* Results: RDED achieves notable results, including distilling the full ImageNet-1K to a small dataset within 7 minutes and achieving a 42% top-1 accuracy with ResNet-18 on a single GPU, outperforming the state-of-the-art.

Abstract
Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data. Extensive empirical results over various neural architectures and datasets demonstrate the advancement of RDED: we can distill the full ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours).

摘要
现代机器学习需要训练大型神经网络，因此面临高计算需求的挑战。 dataset distillation 作为一种新兴策略，目的是压缩现实世界数据集，以便高效地训练。然而，这一研究现在受到大规模高分辨率数据集的限制，使其实际性和可行性受到挑战。为此，我们重新审视现有的 dataset distillation 方法，并确定了大规模实际应用中需要的三个属性， namely，realism， diversity， and efficiency。为了解决这些问题，我们提议 RDED，一种新的计算效率高， yet effective 数据压缩 paradigm，以实现数据的多样性和真实性。我们的实验结果表明，RDED 可以在 7 分钟内，将整个 ImageNet-1K 数据集压缩成 10 张图像每个类型的小数据集，并在 ResNet-18 上 achieved 42% top-1 准确率（而 SOTA 只能达到 21%，并需要 6 小时）。

paper_url: http://arxiv.org/abs/2312.03796
repo_url: None
paper_authors: Hongbo Guo, Xinzi Xu, Hao Wu, Guoxing Wang
for: 这篇论文旨在提出一个多模式生物医时间序列资料的学习模型，以实现多模式间的跨度汇流和跨模式转换。
methods: 本文提出了一个多尺度和多模式的生物医时间序列表现学习网络（MBSL），具有对照学习来实现多模式间的跨度汇流和跨模式转换。
results: 实验结果显示，MBSL比前一代模型高出33.9%的平均误差（MAE）在呼吸速率测量、13.8% MAE在运动心率测量、1.41%的准确率在人类活动识别和1.14%的F1分数在呼吸暂停症候群识别等四个生物医应用中。

Abstract
Multi-modal biomedical time series (MBTS) data offers a holistic view of the physiological state, holding significant importance in various bio-medical applications. Owing to inherent noise and distribution gaps across different modalities, MBTS can be complex to model. Various deep learning models have been developed to learn representations of MBTS but still fall short in robustness due to the ignorance of modal-to-modal variations. This paper presents a multi-scale and multi-modal biomedical time series representation learning (MBSL) network with contrastive learning to migrate these variations. Firstly, MBTS is grouped based on inter-modal distances, then each group with minimum intra-modal variations can be effectively modeled by individual encoders. Besides, to enhance the multi-scale feature extraction (encoder), various patch lengths and mask ratios are designed to generate tokens with semantic information at different scales and diverse contextual perspectives respectively. Finally, cross-modal contrastive learning is proposed to maximize consistency among inter-modal groups, maintaining useful information and eliminating noises. Experiments against four bio-medical applications show that MBSL outperforms state-of-the-art models by 33.9% mean average errors (MAE) in respiration rate, by 13.8% MAE in exercise heart rate, by 1.41% accuracy in human activity recognition, and by 1.14% F1-score in obstructive sleep apnea-hypopnea syndrome.

摘要
多Modal生物医学时间序列数据（MBTS）具有整体生理状态的全面视图，在各种生物医学应用中具有重要意义。然而，由于不同modalities之间的附加噪声和分布差异，MBTS可能会变得复杂。为了学习MBTS的表示，各种深度学习模型已经被开发出来，但仍然缺乏robustness，即因为忽略不同modalities之间的变化。这篇论文提出了一种多尺度和多Modal生物医学时间序列表示学习（MBSL）网络，使用对比学习来迁移这些变化。首先，MBTS被分组 Based on inter-modal distances，然后每个组的最小内Modal差异可以被个性化Encoder模型有效地模型。此外，为了增强多尺度特征提取（Encoder），各种patch长度和mask比例被设计出来，以生成具有Semantic信息的Token在不同的尺度和多种文脉上。最后，跨Modal对比学习被提出，以最大化inter-Modal组的一致性，保留有用信息，并消除噪声。对四种生物医学应用进行了实验，研究发现，MBSL比State-of-the-art模型提高33.9%的 Mean Average Error（MAE）、13.8%的 Exercise Heart Rate MAE、1.41%的 Human Activity Recognition Accuracy和1.14%的 Obstructive Sleep Apnea-Hypopnea Syndrome F1 Score。

Optimal Wildfire Escape Route Planning for Drones under Dynamic Fire and Smoke

paper_url: http://arxiv.org/abs/2312.03521
repo_url: None
paper_authors: Chang Liu, Tamas Sziranyi
for: aid wildfire management efforts by planning an optimal escape route for drones
methods: use information fusion between UAV and satellite, multi-channel remote sensing data, UAV vision technology, and improved A* algorithm
results: enhance the safety and efficiency of drone operations in wildfire environments by considering dynamic fire and smoke models

Abstract
In recent years, the increasing prevalence and intensity of wildfires have posed significant challenges to emergency response teams. The utilization of unmanned aerial vehicles (UAVs), commonly known as drones, has shown promise in aiding wildfire management efforts. This work focuses on the development of an optimal wildfire escape route planning system specifically designed for drones, considering dynamic fire and smoke models. First, the location of the source of the wildfire can be well located by information fusion between UAV and satellite, and the road conditions in the vicinity of the fire can be assessed and analyzed using multi-channel remote sensing data. Second, the road network can be extracted and segmented in real time using UAV vision technology, and each road in the road network map can be given priority based on the results of road condition classification. Third, the spread model of dynamic fires calculates the new location of the fire source based on the fire intensity, wind speed and direction, and the radius increases as the wildfire spreads. Smoke is generated around the fire source to create a visual representation of a burning fire. Finally, based on the improved A* algorithm, which considers all the above factors, the UAV can quickly plan an escape route based on the starting and destination locations that avoid the location of the fire source and the area where it is spreading. By considering dynamic fire and smoke models, the proposed system enhances the safety and efficiency of drone operations in wildfire environments.

摘要
近年来，野火的发生和扩散的情况日益严重，对抢救队伍提出了极大的挑战。使用无人飞行器（UAV）的应用显示了帮助野火管理的潜在优势。本工作关注于基于UAV的野火逃生路径规划系统的开发，考虑了动态火焰和烟雾模型。首先，通过UAV和卫星信息融合，可以准确地确定野火的起点位置。其次，通过多通道远程感知技术，在野火附近地区实时提取和分类道路网络地图，并将每条道路在道路网络地图中分配优先级。第三，根据动态火焰扩散模型，计算新的火源位置，以及火焰强度、风速和方向。烟雾在火源周围生成，创造一个燃烧火的视觉表现。最后，基于改进的A*算法，考虑了以上因素，UAV快速计划逃生路径，避免火源位置和扩散的地区。由于考虑了动态火焰和烟雾模型，提出的系统提高了无人机在野火环境中的安全性和效率。

Defense Against Adversarial Attacks using Convolutional Auto-Encoders

paper_url: http://arxiv.org/abs/2312.03520
repo_url: None
paper_authors: Shreyasi Mandal
for: 强化目标分类器模型对抗攻击
methods: 使用卷积自适应器模型对抗攻击
results: 实现模型精度的Restore

Abstract
Deep learning models, while achieving state-of-the-art performance on many tasks, are susceptible to adversarial attacks that exploit inherent vulnerabilities in their architectures. Adversarial attacks manipulate the input data with imperceptible perturbations, causing the model to misclassify the data or produce erroneous outputs. This work is based on enhancing the robustness of targeted classifier models against adversarial attacks. To achieve this, an convolutional autoencoder-based approach is employed that effectively counters adversarial perturbations introduced to the input images. By generating images closely resembling the input images, the proposed methodology aims to restore the model's accuracy.

摘要
深度学习模型，可以达到许多任务的状态前沿性表现，但受到针对性攻击的威胁。这些攻击通过 manipulate 输入数据中的微scopic 变化，使模型错分或生成错误的输出。这项工作是基于增强目标分类器模型对针对性攻击的Robustness。为达到这一目标，我们采用了一种基于卷积 autoencoder 的方法，可以有效对输入图像中的针对性攻击进行应对。通过生成与输入图像几乎相同的图像，我们的方法希望可以恢复模型的准确性。

Active Wildfires Detection and Dynamic Escape Routes Planning for Humans through Information Fusion between Drones and Satellites

paper_url: http://arxiv.org/abs/2312.03519
repo_url: None
paper_authors: Chang Liu, Tamas Sziranyi
for: 这篇论文旨在提出一种基于UAV视觉技术和卫星图像分析技术的动态人员救援路径规划方法，用于检测和识别野外火灾的火源位置和燃烧区域，并为人们提供实时的逃生路径规划。
methods: 本论文使用的方法包括Sentinel 2卫星图像分析、D-linkNet和NDVI值的中心区域燃烧火源分割、人员实时动态最佳路径规划等。
results: 对于8月24日重庆野火的案例研究，结果表明，基于UAV和卫星图像信息的动态最佳路径规划算法可以在实时火灾情况下为人们提供最佳逃生路径。

Abstract
UAVs are playing an increasingly important role in the field of wilderness rescue by virtue of their flexibility. This paper proposes a fusion of UAV vision technology and satellite image analysis technology for active wildfires detection and road networks extraction of wildfire areas and real-time dynamic escape route planning for people in distress. Firstly, the fire source location and the segmentation of smoke and flames are targeted based on Sentinel 2 satellite imagery. Secondly, the road segmentation and the road condition assessment are performed by D-linkNet and NDVI values in the central area of the fire source by UAV. Finally, the dynamic optimal route planning for humans in real time is performed by the weighted A* algorithm in the road network with the dynamic fire spread model. Taking the Chongqing wildfire on August 24, 2022, as a case study, the results demonstrate that the dynamic escape route planning algorithm can provide an optimal real-time navigation path for humans in the presence of fire through the information fusion of UAVs and satellites.

摘要
UAVs 在野外搜救中发挥越来越重要的作用，尤其是因为它们的灵活性。本文提出了结合 UAV 视觉技术和卫星图像分析技术，实时计算 wildfires 的发生地点和燃烧区域的道路网络抽取，以及在人员受损时的实时最优路径规划。首先，通过 sentinel 2 卫星图像，定位火源位置和烟雾颗粒的分 segmentation。其次，通过 D-linkNet 和 NDVI 值在中心地域的火源位置，进行道路分 segmentation 和道路状况评估。最后，在路网中，使用加权 A\* 算法，在实时火势模型的基础上，为人员在火灾中提供最优的实时导航路径。以2022年8月24日的重庆野火为例，结果表明，动态逃生路径规划算法可以在 UAV 和卫星信息融合的情况下，为人员在火灾中提供最优的实时导航路径。

FRDiff: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models

paper_url: http://arxiv.org/abs/2312.03517
repo_url: None
paper_authors: Junhyuk So, Jungwon Lee, Eunhyeok Park
for: 提高Diffusion模型的计算效率，使其更加广泛应用。methods: 利用时间相似性 redundancy，重用特征图，从而降低计算成本。results: 提出FRDiff方法，实现了精度和响应速度之间的平衡，在多种生成任务中获得了显著改善。

Abstract
The substantial computational costs of diffusion models, particularly due to the repeated denoising steps crucial for high-quality image generation, present a major obstacle to their widespread adoption. While several studies have attempted to address this issue by reducing the number of score function evaluations using advanced ODE solvers without fine-tuning, the decreased number of denoising iterations misses the opportunity to update fine details, resulting in noticeable quality degradation. In our work, we introduce an advanced acceleration technique that leverages the temporal redundancy inherent in diffusion models. Reusing feature maps with high temporal similarity opens up a new opportunity to save computation without sacrificing output quality. To realize the practical benefits of this intuition, we conduct an extensive analysis and propose a novel method, FRDiff. FRDiff is designed to harness the advantages of both reduced NFE and feature reuse, achieving a Pareto frontier that balances fidelity and latency trade-offs in various generative tasks.

摘要
Diffusion模型的计算成本很高，尤其是因为高质量图像生成需要多次减雑步骤。虽然一些研究已经尝试通过降低得分函数评估数量使用高级ODE解决方案来降低计算成本，但是减少减雑迭代数会错过更新细节，导致图像质量下降。在我们的工作中，我们介绍了一种高级加速技术，利用Diffusion模型内置的时间重复性。重用时间相似的特征图opens up a new opportunity to save computation without sacrificing output quality。为了实现这个理念的实用效果，我们进行了广泛的分析并提出了一种新方法，FRDiff。 FRDiff旨在利用减少NFE和特征重用的优点，实现多种生成任务中的平衡质量和延迟交易。

Speculative Exploration on the Concept of Artificial Agents Conducting Autonomous Research

paper_url: http://arxiv.org/abs/2312.03497
repo_url: https://github.com/t46/research-automation-perspective-paper
paper_authors: Shiro Takagi
for: 这篇论文探讨了一种人工智能可以进行研究的概念。
methods: 论文首先描述了研究的概念，以提供创新的开始点。然后，它考虑了研究的核心组成部分，包括问题定义、假设生成和假设验证。这些讨论包括了机器自动完成这些任务的潜在和挑战。
results: 论文简要讨论了这些研究能力的agent的相互关系和交叠。最后，它提出了初步的思考，以便探索这些研究能力agent的发展挑战。

Abstract
This paper engages in a speculative exploration of the concept of an artificial agent capable of conducting research. Initially, it examines how the act of research can be conceptually characterized, aiming to provide a starting point for discussions about what it means to create such agents. The focus then shifts to the core components of research: question formulation, hypothesis generation, and hypothesis verification. This discussion includes a consideration of the potential and challenges associated with enabling machines to autonomously perform these tasks. Subsequently, this paper briefly considers the overlapping themes and interconnections that underlie them. Finally, the paper presents preliminary thoughts on prototyping as an initial step towards uncovering the challenges involved in developing these research-capable agents.

摘要
这篇论文展开了一种人工智能可以进行研究的概念。最初，它描述了研究的概念，以便提供讨论的起点。然后，它shift到研究的核心组件：问题定义、假设生成和假设验证。这个讨论包括机器自动执行这些任务的潜在和挑战。接着，这篇论文简要介绍了这些主题之间的重叠点和联系。最后，它提供了初步思想，以便开始评估在开发这些研究能力的机器人时遇到的挑战。Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Learning From Scenarios for Stochastic Repairable Scheduling

paper_url: http://arxiv.org/abs/2312.03492
repo_url: https://github.com/kimvandenhouten/learning-from-scenarios-for-repairable-stochastic-scheduling
paper_authors: Kim van den Houten, David M. J. Tax, Esteban Freydell, Mathijs de Weerdt
for: Linear objective optimization with uncertain parameter values in a stochastic scheduling problem.
methods: Decision-focused learning with stochastic smoothing to adapt existing techniques to the scheduling problem.
results: Extensive experimental evaluation to compare the performance of decision-focused learning with the state of the art for scenario-based stochastic optimization.Here’s the text in Simplified Chinese:
for: Linear目标优化 WITH uncertain parameter values in a stochastic scheduling problem.
methods: Decision-focused learning WITH stochastic smoothing to adapt existing techniques to the scheduling problem.
results: Extensive experimental evaluation to compare the performance of decision-focused learning WITH the state of the art for scenario-based stochastic optimization.

Abstract
When optimizing problems with uncertain parameter values in a linear objective, decision-focused learning enables end-to-end learning of these values. We are interested in a stochastic scheduling problem, in which processing times are uncertain, which brings uncertain values in the constraints, and thus repair of an initial schedule may be needed. Historical realizations of the stochastic processing times are available. We show how existing decision-focused learning techniques based on stochastic smoothing can be adapted to this scheduling problem. We include an extensive experimental evaluation to investigate in which situations decision-focused learning outperforms the state of the art for such situations: scenario-based stochastic optimization.

摘要
当优化具有不确定参数值的线性目标问题时，决策关注学习可以实现端到端学习这些值。我们关注一个随机处理时间的调度问题，处理时间具有随机性，因此可能需要修复初始调度。历史实现随机处理时间的数据可用。我们介绍了现有的决策关注学习技术，基于随机缓和，如何应用于这个调度问题。我们进行了广泛的实验评估，以 Investigate在哪些情况下决策关注学习超越了现状天地随机优化。Here's the breakdown of the translation:* 当优化 (dāng yòu jì) - "when optimizing"* 具有不确定参数值 (yǒu yǒu bù jì pin yè) - "with uncertain parameter values"* 线性目标问题 (xiàn xìng mù tiào wèn tí) - "linear objective"* 决策关注学习 (jì dào guān zhù xué xí) - "decision-focused learning"* 端到端学习 (dían dào diàn xué xí) - "end-to-end learning"* 这些值 (zhè xiē) - "these values"* 随机处理时间 (suì jiàng hòu zhí shí) - "processing times are uncertain"* 随机值 (suì jiàng yè) - "random values"* constraints (guān lì) - "constraints"* 修复 (xiū gòng) - "repair"* 初始调度 (chū shí tiào dào) - "initial schedule"* 历史实现 (lì shǐ shí jì) - "historical realizations"* 数据 (shù dào) - "data"* 可用 (kě yòu) - "available"* 现有的 (xiàn yǒu de) - "existing"* 决策关注学习技术 (jì dào guān zhù xué xí jì shù) - "existing decision-focused learning techniques"* 基于随机缓 (jī yú suì jiàng bì) - "based on stochastic smoothing"* 应用于 (fù yù yǔ) - "applied to"* 这个调度问题 (zhè ge tiào dào wèn tí) - "this scheduling problem"* Investigate (yàn jí) - "investigate"* 情况 (qíng jì) - "situations"* 超越 (chāo yú) - "outperform"* 现状天地 (xiàn zhèng tiān dì) - "current state of the art"* 随机优化 (suì jiàng yì huà) - "scenario-based stochastic optimization"

JAMMIN-GPT: Text-based Improvisation using LLMs in Ableton Live

paper_url: http://arxiv.org/abs/2312.03479
repo_url: https://github.com/supersational/jammin-gpt
paper_authors: Sven Hollowell, Tashi Namgyal, Paul Marshall
for: 这个系统是为Ableton Live用户创建MIDI-clip而设计的，以便通过 musical descriptions 来命名它们。
methods: 该系统使用 ChatGPT 回答器来生成文本基于 musical formats，如 ABC notation、chord symbols 或 drum tablature，以便在 Ableton 的clip view中插入 Musical ideas。
results: 该系统可以帮助用户快速生成 musical ideas，并且可以让用户在创作过程中保持流畅，不需要停下来编辑 code。这种方法可以在既提高了 musical 创作效率，也降低了学习成本。

Abstract
We introduce a system that allows users of Ableton Live to create MIDI-clips by naming them with musical descriptions. Users can compose by typing the desired musical content directly in Ableton's clip view, which is then inserted by our integrated system. This allows users to stay in the flow of their creative process while quickly generating musical ideas. The system works by prompting ChatGPT to reply using one of several text-based musical formats, such as ABC notation, chord symbols, or drum tablature. This is an important step in integrating generative AI tools into pre-existing musical workflows, and could be valuable for content makers who prefer to express their creative vision through descriptive language. Code is available at https://github.com/supersational/JAMMIN-GPT.

摘要
我们介绍一个系统，让Ableton Live用户可以通过 Musical descriptions 名称 MIDI-clip。用户可以在Ableton的 clip view 中直接输入 Desired musical content，我们的整合系统将其插入。这使用户可以保持创作过程中的流动性，快速生成 musical ideas。系统工作方式是通过请求 ChatGPT 回答使用一些文本基于的 Musical formats，例如 ABC notation、chord symbols 或 drum tablature。这是统合生成 AI 工具到现有的 Musical workflows 的重要一步，可能对内容制作者有价值，他们可能 prefer 通过描述性语言表达创作意义。代码可以在 https://github.com/supersational/JAMMIN-GPT 获取。

Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion

paper_url: http://arxiv.org/abs/2312.03475
repo_url: None
paper_authors: Weitao Du, Jiujiu Chen, Xuecang Zhang, Zhiming Ma, Shengchao Liu
for: 本研究旨在提高人工智能在药物发现中的应用，尤其是在机器学习和化学领域。
methods: 本研究提出了一种 pré-training 方法，称为分子联合自动编码（MoleculeJAE），可以学习分子的二维精度（键结构）和三维形态（几何）信息，并通过模拟增强的扩散过程，以自然地学习分子的内在结构。
results: 实验表明，MoleculeJAE 能够达到比较出色的性能，在 20 个任务中的 15 个任务中比 12 个基线模型更高。

Abstract
Recently, artificial intelligence for drug discovery has raised increasing interest in both machine learning and chemistry domains. The fundamental building block for drug discovery is molecule geometry and thus, the molecule's geometrical representation is the main bottleneck to better utilize machine learning techniques for drug discovery. In this work, we propose a pretraining method for molecule joint auto-encoding (MoleculeJAE). MoleculeJAE can learn both the 2D bond (topology) and 3D conformation (geometry) information, and a diffusion process model is applied to mimic the augmented trajectories of such two modalities, based on which, MoleculeJAE will learn the inherent chemical structure in a self-supervised manner. Thus, the pretrained geometrical representation in MoleculeJAE is expected to benefit downstream geometry-related tasks. Empirically, MoleculeJAE proves its effectiveness by reaching state-of-the-art performance on 15 out of 20 tasks by comparing it with 12 competitive baselines.

摘要
(Note: Simplified Chinese is also known as "简化字" or "简化字".)

Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data

paper_url: http://arxiv.org/abs/2312.03455
repo_url: None
paper_authors: Tashi Namgyal, Alexander Hepburn, Raul Santos-Rodriguez, Valero Laparra, Jesus Malo
for: 这篇论文主要是用于评估自然信号质量的方法，如图像和音频。
methods: 这篇论文使用了感知指标来评估自然信号的质量，感知指标是基于人类观察者的感知行为，通常能够捕捉自然信号中的结构。
results: 论文发现，使用感知指标作为损失函数可以让生成模型更好地捕捉自然信号中的结构，并在测试时重建spectrograms和重新生成的音频中得到更好的结果，这表明使用感知指标可以更好地适应未经见过的自然信号。

Abstract
Perceptual metrics are traditionally used to evaluate the quality of natural signals, such as images and audio. They are designed to mimic the perceptual behaviour of human observers and usually reflect structures found in natural signals. This motivates their use as loss functions for training generative models such that models will learn to capture the structure held in the metric. We take this idea to the extreme in the audio domain by training a compressive autoencoder to reconstruct uniform noise, in lieu of natural data. We show that training with perceptual losses improves the reconstruction of spectrograms and re-synthesized audio at test time over models trained with a standard Euclidean loss. This demonstrates better generalisation to unseen natural signals when using perceptual metrics.

摘要
传统的感知指标通常用于评估自然信号的质量，如图像和音频。它们是为模仿人类观察者的感知行为而设计的，通常反映自然信号中的结构。这种想法驱动了使用感知指标作为生成模型的损失函数的使用，以便模型可以捕捉指标中的结构。在音频领域中，我们Push this idea to the extreme by training a compressive autoencoder to reconstruct uniform noise, instead of natural data. We show that training with perceptual losses improves the reconstruction of spectrograms and re-synthesized audio at test time over models trained with a standard Euclidean loss. This demonstrates better generalization to unseen natural signals when using perceptual metrics.Here's the translation breakdown:* 传统的感知指标 (traditional perceptual metrics) -> 传统的感知指标 (traditional perceptual metrics)* 自然信号 (natural signals) -> 自然信号 (natural signals)* 质量 (quality) -> 质量 (quality)* 模仿 (mimic) -> 模仿 (mimic)* 人类观察者 (human observers) -> 人类观察者 (human observers)* 感知行为 (perceptual behavior) -> 感知行为 (perceptual behavior)* 结构 (structure) -> 结构 (structure)* 生成模型 (generative models) -> 生成模型 (generative models)* 损失函数 (loss functions) -> 损失函数 (loss functions)* 捕捉 (capture) -> 捕捉 (capture)* 指标中的结构 (structure held in the metric) -> 指标中的结构 (structure held in the metric)* 音频领域 (audio domain) -> 音频领域 (audio domain)* 抽象压缩 autoencoder (compressive autoencoder) -> 抽象压缩 autoencoder (compressive autoencoder)* 重建 (reconstruct) -> 重建 (reconstruct)* 压缩 (compressive) -> 压缩 (compressive)* 自然数据 (natural data) -> 自然数据 (natural data)* 标准的欧几何落失 (standard Euclidean loss) -> 标准的欧几何落失 (standard Euclidean loss)* 测试时 (at test time) -> 测试时 (at test time)* 总体 (overall) -> 总体 (overall)* 更好的泛化 (better generalization) -> 更好的泛化 (better generalization)

Quantum-Inspired Neural Network Model of Optical Illusions

paper_url: http://arxiv.org/abs/2312.03447
repo_url: None
paper_authors: Ivan S. Maksymov
for: 这篇论文是为了研究人类对涂抹式不稳定物体（如尼克尔立方体）的观察和理解而写的。methods: 作者使用深度神经网络模型来模拟人类对尼克尔立方体的观察和理解，并使用量子生成器来定义神经网络连接的权重。results: 研究发现，尼克尔立方体的实际观察状态是一种基于量子机制的超position，这与 классиical理论预测的两种基本观察状态相符。这些结果将有用于视频游戏和虚拟现实系统，以及研究机器学习、视觉、心理学和量子机制的人类心理和决策。

Abstract
Ambiguous optical illusions have been a paradigmatic object of fascination, research and inspiration in arts, psychology and video games. However, accurate computational models of perception of ambiguous figures have been elusive. In this paper, we design and train a deep neural network model to simulate the human's perception of the Necker cube, an ambiguous drawing with several alternating possible interpretations. Defining the weights of the neural network connection using a quantum generator of truly random numbers, in agreement with the emerging concepts of quantum artificial intelligence and quantum cognition we reveal that the actual perceptual state of the Necker cube is a qubit-like superposition of the two fundamental perceptual states predicted by classical theories. Our results will find applications in video games and virtual reality systems employed for training of astronauts and operators of unmanned aerial vehicles. They will also be useful for researchers working in the fields of machine learning and vision, psychology of perception and quantum-mechanical models of human mind and decision-making.

摘要
困惑的视觉错觉已经成为艺术、心理学和电子游戏等领域的一种独特的对象，但是准确的计算模型来解释人类的视觉却是困难的。在这篇论文中，我们设计了一个深度神经网络模型，用于模拟人类对尼克尔立方体的视觉含义。使用量子生成器生成真实随机数的权重，与量子人工智能和量子认知理论相吻合，我们发现了人类对尼克尔立方体的实际视觉状态是一种基于两个基本视觉状态的QUBIT-like超position。我们的结果将找到应用于电子游戏和虚拟现实系统，用于训练宇航员和无人飞行器操作员。同时，这些结果也将对机器学习、视觉和心理学研究有很大的帮助，以及量子机器人模型和决策的研究。

Sports Recommender Systems: Overview and Research Issues

paper_url: http://arxiv.org/abs/2312.03785
repo_url: None
paper_authors: Alexander Felfernig, Manfred Wundara, Thi Ngoc Trang Tran, Viet-Man Le, Sebastian Lubos, Seda Polat-Erdeniz
for: 运动推荐系统在健康生活、人际关系和运动表现等方面受到越来越多的注意。这些系统可以帮助人们在运动中选择适合自己的餐食、训练方法、才能和团队等。
methods: 这篇论文基于不同的实践例进行了运动推荐系统的应用和技术的概述。它们包括餐食推荐、训练方法推荐、才能和团队推荐以及竞赛中的策略推荐等。
results: 这篇论文分析了运动推荐系统的相关国际和开展研究问题。它还提出了一些未解决的研究问题，以便进一步探索运动推荐系统的应用和技术发展。

Abstract
Sports recommender systems receive an increasing attention due to their potential of fostering healthy living, improving personal well-being, and increasing performances in sport. These systems support people in sports, for example, by the recommendation of healthy and performance boosting food items, the recommendation of training practices, talent and team recommendation, and the recommendation of specific tactics in competitions. With applications in the virtual world, for example, the recommendation of maps or opponents in e-sports, these systems already transcend conventional sports scenarios where physical presence is needed. On the basis of different working examples, we present an overview of sports recommender systems applications and techniques. Overall, we analyze the related state-of-the-art and discuss open research issues.

摘要
体育推荐系统在最近几年来得到了越来越多的关注，这主要归功于它们在健康生活、个人健康和运动表现方面的潜在作用。这些系统支持人们在运动方面，例如，推荐健康和表现提升的食品、训练方法、才能和团队推荐、竞赛中特定战斗策略等等。在虚拟世界中，例如电子竞技，这些系统已经超越了传统的体育场景，需要物理存在。基于不同的实践例子，我们提供体育推荐系统应用和技术的概述，并总结相关的现状和未来研究方向。

Approximating Solutions to the Knapsack Problem using the Lagrangian Dual Framework

paper_url: http://arxiv.org/abs/2312.03413
repo_url: None
paper_authors: Mitchell Keegan, Mahdi Abolghasemi
for: 这篇论文的目的是提出一种基于Lagrangian dual framework的神经网络模型，用于解决箱子问题（Combinatorial Optimization），并且能够提高约束满足度。
methods: 该论文使用神经网络模型来近似箱子问题的解决方案，并且使用Lagrangian dual framework来加以约束满足。
results: 实验结果表明，该模型能够具有强大的约束满足度，但是有一定的优化率下降。相比之下，不具有约束模型的基准神经网络模型会具有更高的优化率，但是约束满足度较差。

Abstract
The Knapsack Problem is a classic problem in combinatorial optimisation. Solving these problems may be computationally expensive. Recent years have seen a growing interest in the use of deep learning methods to approximate the solutions to such problems. A core problem is how to enforce or encourage constraint satisfaction in predicted solutions. A promising approach for predicting solutions to constrained optimisation problems is the Lagrangian Dual Framework which builds on the method of Lagrangian Relaxation. In this paper we develop neural network models to approximate Knapsack Problem solutions using the Lagrangian Dual Framework while improving constraint satisfaction. We explore the problems of output interpretation and model selection within this context. Experimental results show strong constraint satisfaction with a minor reduction of optimality as compared to a baseline neural network which does not explicitly model the constraints.

摘要
《零钱包问题》是一个经典的组合优化问题。解决这类问题可能是 computationally expensive。近年来，有越来越多的关注使用深度学习方法来近似解决这类问题的解决方案。核心问题是如何在预测解决方案中强制或促进约束满足。我们在这篇论文中开发了基于Lagrangian Dual Framework的神经网络模型，以优化零钱包问题的解决方案，同时提高约束满足性。我们还探讨了输出解释和模型选择问题在这个上下文中。实验结果显示，我们的神经网络模型可以强制满足约束，但是有一定的优化率下降相比于基准神经网络模型。

Generalized Contrastive Divergence: Joint Training of Energy-Based Model and Diffusion Model through Inverse Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.03397
repo_url: None
paper_authors: Sangwoong Yoon, Dohyun Kwon, Himchan Hwang, Yung-Kyun Noh, Frank C. Park
for: 本研究旨在提出一种新的对象函数，用于同时训练能量基模型（EBM）和抽取模型（ diffusion model）。
methods: 本研究使用的方法包括对EBM和抽取模型进行同时训练，并将其формализов为一个最小化问题。
results: 研究表明，通过同时训练EBM和抽取模型，可以提高样本质量并减少MCMC的使用。此外，joint training还能够改善EBM的训练效果。

Abstract
We present Generalized Contrastive Divergence (GCD), a novel objective function for training an energy-based model (EBM) and a sampler simultaneously. GCD generalizes Contrastive Divergence (Hinton, 2002), a celebrated algorithm for training EBM, by replacing Markov Chain Monte Carlo (MCMC) distribution with a trainable sampler, such as a diffusion model. In GCD, the joint training of EBM and a diffusion model is formulated as a minimax problem, which reaches an equilibrium when both models converge to the data distribution. The minimax learning with GCD bears interesting equivalence to inverse reinforcement learning, where the energy corresponds to a negative reward, the diffusion model is a policy, and the real data is expert demonstrations. We present preliminary yet promising results showing that joint training is beneficial for both EBM and a diffusion model. GCD enables EBM training without MCMC while improving the sample quality of a diffusion model.

摘要
我团队现在介绍一种新的目标函数，即泛化对照分散（GCD），用于同时训练能量基型模型（EBM）和扩散模型。GCD扩展了2002年希н顿提出的对照分散算法（Hinton），将马尔可夫链 Monte Carlo（MCMC）分布替换为可学习的扩散模型。在GCD中，EBM和扩散模型的共同训练被формализова为一个最小最大问题，当两个模型都 converges到数据分布时，它们达到了平衡。这种最小最大学习与GCD具有惊人的等价性，与反奖学习相当，其中能量对应于负反奖，扩散模型对应于策略，而实际数据则是专家示范。我们展示了初步却有把握的结果，表明同时训练EBM和扩散模型有利于两者。GCD允许EBM无需MCMC训练，并提高扩散模型的样本质量。

Diffused Task-Agnostic Milestone Planner

paper_url: http://arxiv.org/abs/2312.03395
repo_url: None
paper_authors: Mineui Hong, Minjae Kang, Songhwai Oh
for: 这篇论文的目的是提出一种基于序列预测的方法，用于解决对决策问题的长期规划、视觉控制和多任问题的应用。
methods: 本研究提出了一种使用散度基本生成序列模型来规划一系列的里程碑，并让Agent遵循这些里程碑来完成一个任务。提出的方法可以学习控制相关的、低维度的latent表示，从而实现长期规划和视觉控制的效率。此外，我们的方法可以利用散度模型的生成灵活性，实现多任问题的规划。
results: 本研究在多个offline循环学习（RL）benchmark和一个视觉控制环境中进行评估，结果显示，我们的方法可以超越offline RL方法在解决长期、罕见奖励任务和多任问题上表现出色，并在最具挑战性的视觉控制benchmark上 achievement state-of-the-art表现。

Abstract
Addressing decision-making problems using sequence modeling to predict future trajectories shows promising results in recent years. In this paper, we take a step further to leverage the sequence predictive method in wider areas such as long-term planning, vision-based control, and multi-task decision-making. To this end, we propose a method to utilize a diffusion-based generative sequence model to plan a series of milestones in a latent space and to have an agent to follow the milestones to accomplish a given task. The proposed method can learn control-relevant, low-dimensional latent representations of milestones, which makes it possible to efficiently perform long-term planning and vision-based control. Furthermore, our approach exploits generation flexibility of the diffusion model, which makes it possible to plan diverse trajectories for multi-task decision-making. We demonstrate the proposed method across offline reinforcement learning (RL) benchmarks and an visual manipulation environment. The results show that our approach outperforms offline RL methods in solving long-horizon, sparse-reward tasks and multi-task problems, while also achieving the state-of-the-art performance on the most challenging vision-based manipulation benchmark.

摘要

Lite-Mind: Towards Efficient and Versatile Brain Representation Network

paper_url: http://arxiv.org/abs/2312.03781
repo_url: None
paper_authors: Zixuan Gong, Qi Zhang, Duoqian Miao, Guangyin Bao, Liang Hu
for: 这个论文的目的是提高非侵入式fMRI的信息解码性能。methods: 这篇论文使用了深度多层perceptron（MLP）和CLIP的视觉变换器来对fMRI嵌入进行 align。results: 这篇论文提出了一种轻量级、高效、多用途的大脑表示网络（Lite-Mind），可以高效地将fMRI磁化嵌入与CLIP的细腻信息进行对应。实验结果显示，Lite-Mind在NSD数据集上取得了94.3%的fMRI-to-image检索精度，与MindEye相比减少了98.7%的参数数量。

Abstract
Research in decoding visual information from the brain, particularly through the non-invasive fMRI method, is rapidly progressing. The challenge arises from the limited data availability and the low signal-to-noise ratio of fMRI signals, leading to a low-precision task of fMRI-to-image retrieval. State-of-the-art MindEye remarkably improves fMRI-to-image retrieval performance by leveraging a deep MLP with a high parameter count orders of magnitude, i.e., a 996M MLP Backbone per subject, to align fMRI embeddings to the final hidden layer of CLIP's vision transformer. However, significant individual variations exist among subjects, even within identical experimental setups, mandating the training of subject-specific models. The substantial parameters pose significant challenges in deploying fMRI decoding on practical devices, especially with the necessitating of specific models for each subject. To this end, we propose Lite-Mind, a lightweight, efficient, and versatile brain representation network based on discrete Fourier transform, that efficiently aligns fMRI voxels to fine-grained information of CLIP. Our experiments demonstrate that Lite-Mind achieves an impressive 94.3% fMRI-to-image retrieval accuracy on the NSD dataset for Subject 1, with 98.7% fewer parameters than MindEye. Lite-Mind is also proven to be able to be migrated to smaller brain datasets and establishes a new state-of-the-art for zero-shot classification on the GOD dataset. The code is available at https://github.com/gongzix/Lite-Mind.

摘要
研究在解码大脑信息中进展 rapily，特别是通过非侵入式fMRI方法。挑战来自有限的数据可用性和fMRI信号噪声比，导致fMRI-to-image Retrieval 是一个低精度任务。现有的 MindEye 技术备受改进 fMRI-to-image Retrieval 性能，通过使用深度 MLP 和高参数计数，例如每个主体996M MLP Backbone，将 fMRI 嵌入线性对 CLIP 视transformer 的最终隐藏层进行对齐。然而，每个主体都存在差异，即使在同一个实验设置下，需要训练特定主体的模型。高参数数量对实际设备部署造成了 significiant 挑战。为此，我们提出了 Lite-Mind，一种轻量级、高效、多功能大脑表示网络，基于离散傅里叶变换，可以有效地将 fMRI voxel 对 CLIP 的细腻信息进行对齐。我们的实验表明，Lite-Mind 可以在 NSD 数据集上达到94.3%的 fMRI-to-image Retrieval 精度，比 MindEye 低98.7% 的参数数量。此外，Lite-Mind 还可以轻松迁移到 smaller brain 数据集，并在 GOD 数据集上建立了新的状态态-of-the-art для零容量分类。代码可以在 https://github.com/gongzix/Lite-Mind 上获取。

Demand response for residential building heating: Effective Monte Carlo Tree Search control based on physics-informed neural networks

paper_url: http://arxiv.org/abs/2312.03365
repo_url: None
paper_authors: Fabio Pavirani, Gargya Gokhale, Bert Claessens, Chris Develder
for: 控制建筑物的能源消耗以提高global carbon emissions和限制气候变化的控制。
methods: 使用Monte Carlo Tree Search（MCTS）和Physics-informed Neural Network（PiNN）模型来优化建筑物的冷暖系统，以提高DR控制性能。
results: MCTS和PiNN模型的实现能够提高DR控制性能，相比之下rule-based控制器可以提高10%的成本和35%的温度差。此外，深度学习层的添加可以提高计算成本效益。

Abstract
Controlling energy consumption in buildings through demand response (DR) has become increasingly important to reduce global carbon emissions and limit climate change. In this paper, we specifically focus on controlling the heating system of a residential building to optimize its energy consumption while respecting user's thermal comfort. Recent works in this area have mainly focused on either model-based control, e.g., model predictive control (MPC), or model-free reinforcement learning (RL) to implement practical DR algorithms. A specific RL method that recently has achieved impressive success in domains such as board games (go, chess) is Monte Carlo Tree Search (MCTS). Yet, for building control it has remained largely unexplored. Thus, we study MCTS specifically for building demand response. Its natural structure allows a flexible optimization that implicitly integrate exogenous constraints (as opposed, for example, to conventional RL solutions), making MCTS a promising candidate for DR control problems. We demonstrate how to improve MCTS control performance by incorporating a Physics-informed Neural Network (PiNN) model for its underlying thermal state prediction, as opposed to traditional purely data-driven Black-Box approaches. Our MCTS implementation aligned with a PiNN model is able to obtain a 3% increment of the obtained reward compared to a rule-based controller; leading to a 10% cost reduction and 35% reduction on temperature difference with the desired one when applied to an artificial price profile. We further implemented a Deep Learning layer into the Monte Carlo Tree Search technique using a neural network that leads the tree search through more optimal nodes. We then compared this addition with its Vanilla version, showing the improvement in computational cost required.

摘要
控制建筑物的能源消耗已成为降低全球碳排放和控制气候变化的重要方法。在这篇论文中，我们专注于控制公寓建筑物的冷却系统，以优化其能源消耗，同时保证用户的室内温度舒适性。现有的研究主要集中在使用模型预测控制（MPC）或无模型强化学习（RL）实现实用的DR算法。特别是， Monte Carlo Tree Search（MCTS）在棋盘游戏（如围棋、国际象棋）中最近几年表现出了非常出色的成绩。然而，在建筑物控制领域，MCTS的应用仍然很少。因此，我们在这篇论文中研究MCTS，并证明其在建筑物DR控制问题中的潜在优势。MCTS的自然结构使得可以flexibly进行优化，同时自动承载外部约束（与传统RL方法不同），这使MCTS在DR控制问题中成为一个非常有前途的候选者。我们通过将PiNN模型（Physics-informed Neural Network）与MCTS结合使用，提高了控制性能。我们的MCTS实现与PiNN模型相比，与规则控制器相比，可以获得3%的增量奖励，导致10%的成本减少和35%的温度差异减少。此外，我们还添加了一个深度学习层到MCTS技术中，使用神经网络导引搜索更优化的树。与普通版本相比，这种添加减少了计算成本的需求。

Teaching Specific Scientific Knowledge into Large Language Models through Additional Training

paper_url: http://arxiv.org/abs/2312.03360
repo_url: None
paper_authors: Kan Hatakeyama-Sato, Yasuhiko Igarashi, Shun Katakami, Yuta Nabae, Teruaki Hayakawa
for: 通过额外训练，探索将专业科学知识嵌入LLM大语言模型中。
methods: 我们使用文本扩充来解决专业文献缺乏问题，包括样式转换和翻译。我们还进行了参数优化。
results: 我们成功地在一定程度上嵌入了知识，但研究显示嵌入专业信息到LLM中存在复杂性和限制，提出了进一步改进的方向。

Abstract
Through additional training, we explore embedding specialized scientific knowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that effective knowledge integration requires reading texts from multiple perspectives, especially in instructional formats. We utilize text augmentation to tackle the scarcity of specialized texts, including style conversions and translations. Hyperparameter optimization proves crucial, with different size models (7b, 13b, and 70b) reasonably undergoing additional training. Validating our methods, we construct a dataset of 65,000 scientific papers. Although we have succeeded in partially embedding knowledge, the study highlights the complexities and limitations of incorporating specialized information into LLMs, suggesting areas for further improvement.

摘要
通过进一步的训练，我们探索将专业科学知识 embedding到大型自然语言模型（LLM）中。关键发现显示，有效地 integrate 知识需要从多个角度阅读文本，特别是在教学格式下。我们使用文本扩展来解决专业文本稀缺问题，包括样式转换和翻译。模型的超参数优化证明是关键的，不同大小的模型（7b、13b和70b）都能够进行进一步的训练。为验证我们的方法，我们构建了65,000篇科学论文的数据集。虽然我们在部分 embedding 知识上成功，但研究表明将专业信息 embedding 到 LLM 中存在复杂性和限制，提出了进一步改进的方向。

Online Vectorized HD Map Construction using Geometry

paper_url: http://arxiv.org/abs/2312.03341
repo_url: https://github.com/cnzzx/gemap
paper_authors: Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Fusheng Jin, Xiangyu Yue
for: 提出了一种基于Euclidean几何学的映射学习方法，以提高在城市道路系统中的预测和规划。
methods: 提出了一种叫做GeMap的方法，它可以捕捉到城市道路系统中的几何形态和关系，并且可以独立处理几何形态和关系。
results: 在NuScenes和Argoverse 2 datasets上实现了新的最佳性能，其中在Argoverse 2 dataset上达到了71.8%的mAP，比MapTR V2高4.4%，并首次突破了70%的mAP阈值。

Abstract
The construction of online vectorized High-Definition (HD) maps is critical for downstream prediction and planning. Recent efforts have built strong baselines for this task, however, shapes and relations of instances in urban road systems are still under-explored, such as parallelism, perpendicular, or rectangle-shape. In our work, we propose GeMap ($\textbf{Ge}$ometry $\textbf{Map}$), which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception. Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations. We also decouple self-attention to independently handle Euclidean shapes and relations. Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets. Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time. Code is available at https://github.com/cnzzx/GeMap

摘要
“在线vector化高清地图的构建是下游预测和规划的关键。 recent efforts have built strong baselines for this task, but shapes and relations of instances in urban road systems are still under-explored, such as parallelism, perpendicular, or rectangle-shape. In our work, we propose GeMap（地图几何对映）, which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception. Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations. We also decouple self-attention to independently handle Euclidean shapes and relations. Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets. Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time. ”Here's the breakdown of the translation:* “在线vector化高清地图”(online vectorized high-definition maps) is translated as “在线vector化高清地图”(在线vectorized高清地图)* “构建”(construction) is translated as “构建”(构建)* “downstream prediction and planning”(downstream prediction and planning) is translated as “下游预测和规划”(下游预测和规划)* “Recent efforts have built strong baselines for this task”(recent efforts have built strong baselines for this task) is translated as “recent efforts have built strong baselines for this task”(recent efforts have built strong baselines for this task)* “but shapes and relations of instances in urban road systems are still under-explored”(but shapes and relations of instances in urban road systems are still under-explored) is translated as “but shapes and relations of instances in urban road systems are still under-explored”(but shapes and relations of instances in urban road systems are still under-explored)* “such as parallelism, perpendicular, or rectangle-shape”(such as parallelism, perpendicular, or rectangle-shape) is translated as “such as parallelism, perpendicular, or rectangle-shape”(such as parallelism, perpendicular, or rectangle-shape)* “In our work, we propose GeMap”(In our work, we propose GeMap) is translated as “In our work, we propose GeMap”(在我们的工作中，我们提出了GeMap)* “which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception”(which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception) is translated as “which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception”(which end-to-end learnsEuclidean shapes and relations of map instances beyond basic perception)* “Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations”(Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations) is translated as “Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations”(specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations)* “We also decouple self-attention to independently handle Euclidean shapes and relations”(We also decouple self-attention to independently handle Euclidean shapes and relations) is translated as “We also decouple self-attention to independently handle Euclidean shapes and relations”(we also decouple self-attention to independently handle Euclidean shapes and relations)* “Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets”(Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets) is translated as “our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets”(我们的方法在NuScenes和Argoverse 2 dataset上达到了新的state-of-the-art性能)* “Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time”(Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time) is translated as “remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time”(remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time)Note that the translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.

Benchmarking Continual Learning from Cognitive Perspectives

paper_url: http://arxiv.org/abs/2312.03309
repo_url: None
paper_authors: Xiaoqian Liu, Junge Zhang, Mingyi Zhang, Peipei Yang
for: 本研究旨在解决 continual learning 问题，即不断学习和转移知识而不导致老知识忘记。
methods: 本研究使用了多种方法来评估 continual learning 模型，包括基于 cognitive properties 的 desideratum 和多种评价指标。
results: 实验结果显示，现有的 continual learning 模型尚未满足所有 desideratum，并且尚未实现真正的 continual learning。 although some methods 具有一定的适应性和效率，但是无法识别任务变化时的任务关系，或者寻求任务之间的相似性和不同性。

Abstract
Continual learning addresses the problem of continuously acquiring and transferring knowledge without catastrophic forgetting of old concepts. While humans achieve continual learning via diverse neurocognitive mechanisms, there is a mismatch between cognitive properties and evaluation methods of continual learning models. First, the measurement of continual learning models mostly relies on evaluation metrics at a micro-level, which cannot characterize cognitive capacities of the model. Second, the measurement is method-specific, emphasizing model strengths in one aspect while obscuring potential weaknesses in other respects. To address these issues, we propose to integrate model cognitive capacities and evaluation metrics into a unified evaluation paradigm. We first characterize model capacities via desiderata derived from cognitive properties supporting human continual learning. The desiderata concern (1) adaptability in varying lengths of task sequence; (2) sensitivity to dynamic task variations; and (3) efficiency in memory usage and training time consumption. Then we design evaluation protocols for each desideratum to assess cognitive capacities of recent continual learning models. Experimental results show that no method we consider has satisfied all the desiderata and is still far away from realizing truly continual learning. Although some methods exhibit some degree of adaptability and efficiency, no method is able to identify task relationships when encountering dynamic task variations, or achieve a trade-off in learning similarities and differences between tasks. Inspired by these results, we discuss possible factors that influence model performance in these desiderata and provide guidance for the improvement of continual learning models.

摘要
First, the evaluation metrics used are mainly micro-level, which cannot fully capture the cognitive abilities of the model. Second, the evaluation is method-specific, highlighting the strengths of the model in one aspect while hiding its potential weaknesses in other areas. To address these issues, we propose integrating model cognitive abilities and evaluation metrics into a unified evaluation paradigm.We first define the cognitive capabilities of the model based on the cognitive properties that support human continual learning, including the ability to adapt to varying task sequences, sensitivity to dynamic task variations, and efficient use of memory and training time. Then, we design evaluation protocols for each of these desiderata to assess the cognitive abilities of recent continual learning models.The experimental results show that none of the methods we considered have fully met all of the desiderata and are still far from achieving true continual learning. While some methods have shown some degree of adaptability and efficiency, they have failed to identify task relationships when facing dynamic task variations or balance learning similarities and differences between tasks.Inspired by these results, we discuss potential factors that may influence model performance in these desiderata and provide guidance for improving continual learning models.

Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique

paper_url: http://arxiv.org/abs/2312.03303
repo_url: https://github.com/ilyatyagin/dyport
paper_authors: Ilya Tyagin, Ilya Safro
for: 这 paper 是一个新的生物医学假设生成系统评估框架 Dyport。
methods: 该approach 使用了已经精心编辑的数据集，使得我们的评估更加真实。它 integrates 知识到curated databases 中的动态图表，并提供了一种量化发现重要性的方法，不仅评估假设的准确性，还评估其在生物医学研究中的可能的影响，这大大超越了传统的链接预测benchmark。
results: 我们在应用了several link prediction systems 在生物医学semantic knowledge graphs 上的实验中，demonstrated 了我们的评估系统的可行性和灵活性。

Abstract
This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypothesis accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community. Availability and implementation: Dyport framework is fully open-source. All code and datasets are available at: https://github.com/IlyaTyagin/Dyport

摘要
这篇论文提出了一个新的生物医学假设生成系统评估框架，即Dyport。该框架利用了仔细编辑的数据集，使我们的评估更加真实。我们将知识从 curaated 数据库 integrate 到动态图中，并提供一种量化发现重要性的方法。这不仅评估假设准确性，还评估其在生物医学研究中的可能的影响，这在传统的链接预测测试中进行了显著扩展。我们的评估过程的可应用性在多个链接预测系统上进行了应用。我们的评估系统是 flexible 的，可以广泛应用于假设生成质量验证中，以扩展生物医学研究社区的科学发现范围。可用性和实现：Dyport 框架是完全开源的。所有代码和数据集可以在以下链接获取：https://github.com/IlyaTyagin/Dyport。

SoftMAC: Differentiable Soft Body Simulation with Forecast-based Contact Model and Two-way Coupling with Articulated Rigid Bodies and Clothes

paper_url: http://arxiv.org/abs/2312.03297
repo_url: https://github.com/damianliumin/SoftMAC
paper_authors: Min Liu, Gang Yang, Siyuan Luo, Chen Yu, Lin Shao
for: This paper aims to provide a unified framework for simulating diverse robotic manipulation scenarios by integrating soft bodies, articulated rigid bodies, and clothes.
methods: The proposed method, called SoftMAC, uses the Material Point Method (MPM) to simulate soft bodies and a forecast-based contact model to reduce artifacts. It also includes a penetration tracing algorithm to couple MPM particles with deformable and non-volumetric clothes meshes.
results: The authors validate the effectiveness and accuracy of the proposed differentiable pipeline through comprehensive experiments in downstream robotic manipulation applications.Here’s the Chinese version:
for: 这篇论文目的是提供一个综合的机器人操作场景模拟框架，整合软体、骨骼刚体和衣物等多种材料。
methods: 提议的方法是SoftMAC，使用物理点方法（MPM）模拟软体，并采用预测基于的接触模型来减少artefacts。它还包括一种穿透跟踪算法，将MPM粒子与可变形和非液体衣物网格相互关联。
results: 作者通过对下游机器人操作应用的广泛实验 validate了提议的可导式管道的效果和准确性。

Abstract
Differentiable physics simulation provides an avenue for tackling previously intractable challenges through gradient-based optimization, thereby greatly improving the efficiency of solving robotics-related problems. To apply differentiable simulation in diverse robotic manipulation scenarios, a key challenge is to integrate various materials in a unified framework. We present SoftMAC, a differentiable simulation framework coupling soft bodies with articulated rigid bodies and clothes. SoftMAC simulates soft bodies with the continuum-mechanics-based Material Point Method (MPM). We provide a forecast-based contact model for MPM, which greatly reduces artifacts like penetration and unnatural rebound. To couple MPM particles with deformable and non-volumetric clothes meshes, we also propose a penetration tracing algorithm that reconstructs the signed distance field in local area. Based on simulators for each modality and the contact model, we develop a differentiable coupling mechanism to simulate the interactions between soft bodies and the other two types of materials. Comprehensive experiments are conducted to validate the effectiveness and accuracy of the proposed differentiable pipeline in downstream robotic manipulation applications. Supplementary materials and videos are available on our project website at https://sites.google.com/view/softmac.

摘要
《可微分物理模拟：一种提高机器人问题解决效率的新途径》可微分物理模拟提供了解决前无法解决的挑战的新途径，通过梯度基于优化，大幅提高机器人问题的解决效率。为在多样化机器人操作场景中应用可微分模拟，一个关键挑战是将各种材料集成到一个统一框架中。我们提出了SoftMAC，一个可微分模拟框架，将软体与机械肢和衣物相连接。SoftMAC使用物点方法（MPM）来模拟软体，并提供了一种预测基于的接触模型，可以减少穿透和不自然的反弹现象。为将MPM particels与可变形和非液体衣物网格相连接，我们还提出了一种穿透跟踪算法，可以在本地区域重建签名距离场。基于模拟器和接触模型，我们开发了一种可微分连接机制，以模拟软体与其他两种材料之间的交互。我们进行了广泛的实验，以验证提档的有效性和准确性在下游机器人操作应用中。补充材料和视频可以在我们项目网站（https://sites.google.com/view/softmac）上获得。

OMNIINPUT: A Model-centric Evaluation Framework through Output Distribution

paper_url: http://arxiv.org/abs/2312.03291
repo_url: None
paper_authors: Weitang Liu, Ying Wai Li, Tianle Wang, Yi-Zhuang You, Jingbo Shang
for: 评估AI/ML模型预测结果的质量，尤其是对于人类不可识别的输入。
methods: 使用自动生成的测试集和模型自身的输出分布来评估模型质量，而不是传统的数据集中心的评估方法。
results: 能够更细化地比较不同模型的性能，尤其是在预测结果几乎相同的情况下，从而获得新的发现和启示，有助于训练更加稳定和泛化的模型。

Abstract
We propose a novel model-centric evaluation framework, OmniInput, to evaluate the quality of an AI/ML model's predictions on all possible inputs (including human-unrecognizable ones), which is crucial for AI safety and reliability. Unlike traditional data-centric evaluation based on pre-defined test sets, the test set in OmniInput is self-constructed by the model itself and the model quality is evaluated by investigating its output distribution. We employ an efficient sampler to obtain representative inputs and the output distribution of the trained model, which, after selective annotation, can be used to estimate the model's precision and recall at different output values and a comprehensive precision-recall curve. Our experiments demonstrate that OmniInput enables a more fine-grained comparison between models, especially when their performance is almost the same on pre-defined datasets, leading to new findings and insights for how to train more robust, generalizable models.

摘要
我们提出了一种新的模型中心评估框架，OmniInput，以评估人工智能/机器学习模型的预测结果中的所有可能输入（包括人类无法识别的），这对于人工智能安全和可靠性至关重要。与传统的数据中心评估基于预先定义的测试集不同，OmniInput 的测试集由模型自己构建，并通过调查输出分布来评估模型质量。我们使用高效的采样器获取代表性的输入，并对训练后的模型输出进行选择性标注，以便计算模型的精度和准确率在不同的输出值上，并生成了全面的精度-准确率曲线。我们的实验表明，OmniInput 可以对模型进行更细致的比较，特别是当模型在预先定义的数据集上的性能几乎相同时，从而导致新的发现和洞察，帮助train更加稳定和泛化的模型。

Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI Gym

paper_url: http://arxiv.org/abs/2312.03290
repo_url: https://github.com/mail-ecnu/text-gym-agents
paper_authors: Junjie Sheng, Zixiao Huang, Chuyun Shen, Wenhao Li, Yun Hua, Bo Jin, Hongyuan Zha, Xiangfeng Wang
for: 本研究旨在探讨语言代理是否可以取代传统的PPO代理在顺序决策任务中。
methods: 研究者首先使用OpenAI Gym中收集的环境作为测试床，并将这些环境转化为文本环境，以便与语言代理进行直观和高效的比较。
results: 研究者通过数值实验和剖析研究，提取了语言代理的决策能力的有价值信息，并对语言代理作为PPO代理的潜在代替进行初步评估。

Abstract
The formidable capacity for zero- or few-shot decision-making in language agents encourages us to pose a compelling question: Can language agents be alternatives to PPO agents in traditional sequential decision-making tasks? To investigate this, we first take environments collected in OpenAI Gym as our testbeds and ground them to textual environments that construct the TextGym simulator. This allows for straightforward and efficient comparisons between PPO agents and language agents, given the widespread adoption of OpenAI Gym. To ensure a fair and effective benchmarking, we introduce $5$ levels of scenario for accurate domain-knowledge controlling and a unified RL-inspired framework for language agents. Additionally, we propose an innovative explore-exploit-guided language (EXE) agent to solve tasks within TextGym. Through numerical experiments and ablation studies, we extract valuable insights into the decision-making capabilities of language agents and make a preliminary evaluation of their potential to be alternatives to PPO in classical sequential decision-making problems. This paper sheds light on the performance of language agents and paves the way for future research in this exciting domain. Our code is publicly available at~\url{https://github.com/mail-ecnu/Text-Gym-Agents}.

摘要
文中提出了一个吸引人的问题：可以否使用语言代理人代替传统的顺序决策任务中的PPO代理人？为了 investigate这一问题，我们首先使用OpenAI Gym中收集的环境作为测试环境，并将它们转换为文本环境，这使得对比语言代理人和PPO代理人的比较变得更加直观和效率高。为确保公正和有效的对比，我们引入了5级的情景来控制域知识，并提出了一种RL inspirited框架来 guideline语言代理人解决TextGym中的任务。此外，我们还提出了一种尝试-利用-引导语言代理人（EXE）来解决TextGym中的任务。通过数值实验和剥离研究，我们从语言代理人的决策能力中获得了有价值的发现，并对语言代理人是否可以替代PPO进行了初步评估。这篇论文照亮了语言代理人的表现，并为这一有趣的领域开辟了未来研究的道路。我们的代码公开可以在GitHub上获取，请参考~\url{https://github.com/mail-ecnu/Text-Gym-Agents}.

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

paper_url: http://arxiv.org/abs/2312.03288
repo_url: https://github.com/maclong01/STEP-CATFormer
paper_authors: Nguyen Huu Bao Long
For: 本研究探讨了基于骨架的动作识别中Graph Convolutional Convolution networks（GCNs）的应用和优化。* Methods: 本研究提出了三种Channel-wise Topology Graph Convolution（CTR-GCN），并将其与两种跨体部关注模块结合，以捕捉人体骨架上下体部和手脚关系特征。此外，本研究还提出了Temporal Attention Transformers来EXTRACTskeleton特征。* Results: 本研究在NTU RGB+D和NTU RGB+D 120数据集上达到了 notable high-performance。Translation:* For: This study explores the application and optimization of Graph Convolutional Convolution networks (GCNs) in skeleton-based action recognition.* Methods: The study proposes three Channel-wise Topology Graph Convolution (CTR-GCN) methods, and combines them with two joint cross-attention modules to capture upper-lower body part and hand-foot relationships in skeleton features. Additionally, the study proposes Temporal Attention Transformers to extract skeleton features effectively.* Results: The study achieves notable high-performance on the NTU RGB+D and NTU RGB+D 120 datasets.

Abstract
Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. We think the key to skeleton-based action recognition is a skeleton hanging in frames, so we focus on how the Graph Convolutional Convolution networks learn different topologies and effectively aggregate joint features in the global temporal and local temporal. In this work, we propose three Channel-wise Tolopogy Graph Convolution based on Channel-wise Topology Refinement Graph Convolution (CTR-GCN). Combining CTR-GCN with two joint cross-attention modules can capture the upper-lower body part and hand-foot relationship skeleton features. After that, to capture features of human skeletons changing in frames we design the Temporal Attention Transformers to extract skeletons effectively. The Temporal Attention Transformers can learn the temporal features of human skeleton sequences. Finally, we fuse the temporal features output scale with MLP and classification. We develop a powerful graph convolutional network named Spatial Temporal Effective Body-part Cross Attention Transformer which notably high-performance on the NTU RGB+D, NTU RGB+D 120 datasets. Our code and models are available at https://github.com/maclong01/STEP-CATFormer

摘要
“几何卷积网络（GCNs）在skeleton基本动作识别中得到了广泛的应用和出色的结果。我们认为skeleton中的骨架卷积是关键，因此我们专注于如何使Graph Convolutional Convolution网络学习不同的拓扑和有效地聚合关节特征在全球时间和局部时间。在这种工作中，我们提出了三种通道级别拓扑卷积基于通道级别拓扑修剪Graph Convolution（CTR-GCN）。将CTR-GCN与两个交叉关注模块相结合可以捕捉上下躯体和手脚关系骨架特征。然后，为了有效地提取人体骨架在帧内的变化特征，我们设计了时间注意力变换器。时间注意力变换器可以学习人体骨架序列中的时间特征。最后，我们将时间特征输出规格与多层感知（MLP）和分类结合，并发展出一种高性能的几何卷积网络，称为空间时间有效体部相关转换器。我们的代码和模型可以在https://github.com/maclong01/STEP-CATFormer上获取。”

paper_url: http://arxiv.org/abs/2312.03275
repo_url: None
paper_authors: Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, Bernadette Bucher
for: 这 paper 的目的是提出一种零shot navigation 方法，帮助机器人在未经训练的环境中寻找目标对象。
methods: 该方法使用了视觉语言模型，从深度观察Value Map，并使用RGB Observations来生成语言权重图。
results: 该方法在 Gibson、Habitat-Matterport 3D 和 Matterport 3D 数据集上取得了最佳的 результаchs，并在真实世界中部署在 Boston Dynamics Spot 移动 manipulate 平台上，efficiently 导航到目标对象。

Abstract
Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-language models in advancing the field of semantic navigation. Videos of real-world deployment can be viewed at naoki.io/vlfm.

摘要
人类如何利用 semantic knowledge 来探索未知环境并决定下一步行动对于开发人类样式搜索行为的机器人来说非常重要。我们提出了一种零批注导航方法，即 Vision-Language Frontier Maps (VLFM)，这种方法灵感自人类的思维和决策，用于在新环境中导航到未经见过的 semantic 对象。VLFM 从深度观测中生成占据地图，并使用 RGB 观测和预训练的视力语言模型生成语言固定值图。VLFM 然后使用这个图来确定搜索最有前途的方向，以找到给定目标对象类型的实例。我们在 Gibson、Habitat-Matterport 3D 和 Matterport 3D datasets 中的 Habitat simulate 环境进行评估，并显示 VLFM 在这些数据集上取得了最佳的成绩， measured by success weighted by path length (SPL) 对象目标导航任务。此外，我们还表明 VLFM 的零批注特性使得它可以轻松地在真实世界中部署，如 Boston Dynamics Spot 移动 manipulate 平台。我们在 Spot 上部署 VLFM，并在真实世界中 efficiently 导航到目标对象内部，无需环境的先前知识。VLFM 的成就推祟于视力语言模型在 semantic 导航领域的潜在潜力。视频可以在 naoki.io/vlfm 上欣赏。

Weathering Ongoing Uncertainty: Learning and Planning in a Time-Varying Partially Observable Environment

paper_url: http://arxiv.org/abs/2312.03263
repo_url: None
paper_authors: Gokul Puthumanaillam, Xiangyu Liu, Negar Mehr, Melkior Ornik
for: This paper aims to improve the optimal decision-making of autonomous systems in uncertain, stochastic, and time-varying environments.
methods: The paper combines Time-Varying Markov Decision Processes (TVMDP) with partial observability and introduces Time-Varying Partially Observable Markov Decision Processes (TV-POMDP). The proposed approach includes Memory Prioritized State Estimation (MPSE) and an MPSE-integrated planning strategy.
results: The proposed framework and algorithms demonstrate superior performance over standard methods in simulated and real-world experiments, showcasing their effectiveness in stochastic, uncertain, time-varying domains.

Abstract
Optimal decision-making presents a significant challenge for autonomous systems operating in uncertain, stochastic and time-varying environments. Environmental variability over time can significantly impact the system's optimal decision making strategy for mission completion. To model such environments, our work combines the previous notion of Time-Varying Markov Decision Processes (TVMDP) with partial observability and introduces Time-Varying Partially Observable Markov Decision Processes (TV-POMDP). We propose a two-pronged approach to accurately estimate and plan within the TV-POMDP: 1) Memory Prioritized State Estimation (MPSE), which leverages weighted memory to provide more accurate time-varying transition estimates; and 2) an MPSE-integrated planning strategy that optimizes long-term rewards while accounting for temporal constraint. We validate the proposed framework and algorithms using simulations and hardware, with robots exploring a partially observable, time-varying environments. Our results demonstrate superior performance over standard methods, highlighting the framework's effectiveness in stochastic, uncertain, time-varying domains.

摘要
优化决策presentsthan significant challenge for autonomous systems operating inuncertain, stochastic and time-varying environments. Environmental variability over time can significantly impact the system's optimal decision-making strategy for mission completion. To model such environments, our work combines the previous notion of Time-Varying Markov Decision Processes (TVMDP) with partial observability and introduces Time-Varying Partially Observable Markov Decision Processes (TV-POMDP). We propose a two-pronged approach to accurately estimate and plan within the TV-POMDP: 1) Memory Prioritized State Estimation (MPSE), which leverages weighted memory to provide more accurate time-varying transition estimates; and 2) an MPSE-integrated planning strategy that optimizes long-term rewards while accounting for temporal constraint. We validate the proposed framework and algorithms using simulations and hardware, with robots exploring a partially observable, time-varying environments. Our results demonstrate superior performance over standard methods, highlighting the framework's effectiveness in stochastic, uncertain, time-varying domains.Here's the translation in Traditional Chinese:优化决策呈现了 autonomous systems operate inuncertain, stochastic and time-varying environments中的一个 significannot challenge. 环境变化over time可以影响系统的优化决策策略，以 completeloss mission. 为了模型这些环境，我们的工作combines the previous notion of Time-Varying Markov Decision Processes (TVMDP) with partial observability and introduces Time-Varying Partially Observable Markov Decision Processes (TV-POMDP). We propose a two-pronged approach to accurately estimate and plan within the TV-POMDP: 1) Memory Prioritized State Estimation (MPSE), which leverages weighted memory to provide more accurate time-varying transition estimates; and 2) an MPSE-integrated planning strategy that optimizes long-term rewards while accounting for temporal constraint. We validate the proposed framework and algorithms using simulations and hardware, with robots exploring a partially observable, time-varying environments. Our results demonstrate superior performance over standard methods, highlighting the framework's effectiveness in stochastic, uncertain, time-varying domains.

Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning

paper_url: http://arxiv.org/abs/2312.03248
repo_url: None
paper_authors: Haowen Wang, Tao Sun, Cong Fan, Jinjie Gu
for: 提高多任务学习中的样本效率
methods: 使用自适应精度学习策略和低级数据精度学习
results: 比对基eline和任务特定和技能无关基eline的实验结果，C-Poly显示出明显的性能提升

Abstract
Modular and composable transfer learning is an emerging direction in the field of Parameter Efficient Fine-Tuning, as it enables neural networks to better organize various aspects of knowledge, leading to improved cross-task generalization. In this paper, we introduce a novel approach Customized Polytropon C-Poly that combines task-common skills and task-specific skills, while the skill parameters being highly parameterized using low-rank techniques. Each task is associated with a customizable number of exclusive specialized skills and also benefits from skills shared with peer tasks. A skill assignment matrix is jointly learned. To evaluate our approach, we conducted extensive experiments on the Super-NaturalInstructions and the SuperGLUE benchmarks. Our findings demonstrate that C-Poly outperforms fully-shared, task-specific, and skill-indistinguishable baselines, significantly enhancing the sample efficiency in multi-task learning scenarios.

摘要
模块化和可 compose 的传输学习是现代 Parameter Efficient Fine-Tuning 领域的一个emerging direction，因为它使得神经网络更好地组织了不同方面的知识，从而提高了交叉任务泛化性。在这篇论文中，我们介绍了一种新的方法Customized Polytropon C-Poly，它将任务共同技能和任务特定技能结合在一起，并使用低维度技术来高度参数化技能参数。每个任务都有可定制的专属特有技能，同时还可以从同类任务中获得共享的技能。一个任务分配矩阵是同时学习的。为了评估我们的方法，我们在Super-NaturalInstructions和SuperGLUE bencmarks上进行了广泛的实验。我们的发现表明，C-Poly 在多任务学习场景中显著提高了样本效率。

A Simple Framework to Enhance the Adversarial Robustness of Deep Learning-based Intrusion Detection System

paper_url: http://arxiv.org/abs/2312.03245
repo_url: None
paper_authors: Xinwei Yuan, Shu Han, Wei Huang, Hongliang Ye, Xianglong Kong, Fan Zhang
For: The paper proposes a novel intrusion detection system (IDS) architecture that combines conventional machine learning (ML) models and deep learning (DL) models to enhance the robustness of IDS against adversarial attacks.* Methods: The proposed IDS architecture consists of three components: DL-based IDS, adversarial example (AE) detector, and ML-based IDS. The AE detector is based on the local intrinsic dimensionality (LID), and the ML-based IDS is used to determine the maliciousness of AEs. The fusion mechanism leverages the high prediction accuracy of DL models and low attack transferability between DL models and ML models to improve the robustness of the whole system.* Results: The paper shows a significant improvement in the prediction performance of the IDS when subjected to adversarial attack, achieving high accuracy with low resource consumption.Here are the three key points in Simplified Chinese text:* For: 本文提出了一种新的入侵检测系统（IDS）架构，该架构结合了传统的机器学习（ML）模型和深度学习（DL）模型，以提高IDS对于攻击者的抵抗性。* Methods: 提议的IDS架构包括三个组成部分：DL-based IDS、攻击例示器（AE）检测器和ML-based IDS。AE检测器基于本地特征维度（LID），而ML-based IDS用于确定AE的恶意程度。混合机制利用DL模型的高预测精度和DL模型和ML模型之间的低攻击传递性，以提高整个系统的Robustness。* Results: 本文实验结果表明，当IDS面临攻击时，提议的IDS架构可以获得高精度、低资源占用的预测性能。

Abstract
Deep learning based intrusion detection systems (DL-based IDS) have emerged as one of the best choices for providing security solutions against various network intrusion attacks. However, due to the emergence and development of adversarial deep learning technologies, it becomes challenging for the adoption of DL models into IDS. In this paper, we propose a novel IDS architecture that can enhance the robustness of IDS against adversarial attacks by combining conventional machine learning (ML) models and Deep Learning models. The proposed DLL-IDS consists of three components: DL-based IDS, adversarial example (AE) detector, and ML-based IDS. We first develop a novel AE detector based on the local intrinsic dimensionality (LID). Then, we exploit the low attack transferability between DL models and ML models to find a robust ML model that can assist us in determining the maliciousness of AEs. If the input traffic is detected as an AE, the ML-based IDS will predict the maliciousness of input traffic, otherwise the DL-based IDS will work for the prediction. The fusion mechanism can leverage the high prediction accuracy of DL models and low attack transferability between DL models and ML models to improve the robustness of the whole system. In our experiments, we observe a significant improvement in the prediction performance of the IDS when subjected to adversarial attack, achieving high accuracy with low resource consumption.

摘要
深度学习基于的入侵检测系统（DL-IDS）已经成为提供安全解决方案的一种优选，但由于对深度学习技术的发展和应用，DL模型在IDS中的采用受到挑战。在本文中，我们提出了一种新的IDS架构，可以通过结合深度学习模型和传统机器学习模型来增强IDS对假数据攻击的Robustness。我们的提案包括三个组成部分：DL-IDS、假数据检测器和ML-IDS。我们首先开发了一种基于本地内在维度（LID）的假数据检测器。然后，我们利用DL模型和ML模型之间的攻击传递率低，找到一个可靠的ML模型，以确定假数据的Maliciousness。如果输入流量被检测为假数据，则ML-IDS将预测输入流量的Maliciousness，否则DL-IDS将进行预测。混合机制可以利用DL模型的高预测精度和ML模型之间的攻击传递率低，提高整体系统的Robustness。在我们的实验中，我们发现当输入流量遭受假数据攻击时，IDS的预测性能得到了显著提高， достиieving高精度低资源消耗。

Multicoated and Folded Graph Neural Networks with Strong Lottery Tickets

paper_url: http://arxiv.org/abs/2312.03236
repo_url: https://github.com/louivalley/slt-gnn
paper_authors: Jiale Yan, Hiroaki Ito, Ángel López García-Arias, Yasuyuki Okoshi, Hikari Otsuka, Kazushi Kawamura, Thiem Van Chu, Masato Motomura
For: 本研究旨在探讨SLTH（强大抽筋假设）在深度Graph Neural Networks（GNNs）中的应用，以提高精度和减少内存消耗。* Methods: 本研究使用了多层材料掩模（M-Sup）scalar pruning mask方法，并提出了适应性调整的设定策略，以实现在深度GNNs中的精度和减少内存消耗。* Results: 本研究在Open Graph Benchmark（OGB）等多个 dataset上进行了评估，并显示了SLTH-based GNNs可以实现高精度、竞争性和高内存效率，减少内存消耗达98.7%。

Abstract
The Strong Lottery Ticket Hypothesis (SLTH) demonstrates the existence of high-performing subnetworks within a randomly initialized model, discoverable through pruning a convolutional neural network (CNN) without any weight training. A recent study, called Untrained GNNs Tickets (UGT), expanded SLTH from CNNs to shallow graph neural networks (GNNs). However, discrepancies persist when comparing baseline models with learned dense weights. Additionally, there remains an unexplored area in applying SLTH to deeper GNNs, which, despite delivering improved accuracy with additional layers, suffer from excessive memory requirements. To address these challenges, this work utilizes Multicoated Supermasks (M-Sup), a scalar pruning mask method, and implements it in GNNs by proposing a strategy for setting its pruning thresholds adaptively. In the context of deep GNNs, this research uncovers the existence of untrained recurrent networks, which exhibit performance on par with their trained feed-forward counterparts. This paper also introduces the Multi-Stage Folding and Unshared Masks methods to expand the search space in terms of both architecture and parameters. Through the evaluation of various datasets, including the Open Graph Benchmark (OGB), this work establishes a triple-win scenario for SLTH-based GNNs: by achieving high sparsity, competitive performance, and high memory efficiency with up to 98.7\% reduction, it demonstrates suitability for energy-efficient graph processing.

摘要
“强大的抽奖票假设”（SLTH）表明了深度学习中的高性能子网络，可以通过随机初始化的卷积神经网络（CNN）无需任何训练来发现。一 recent study called Untrained GNNs Tickets（UGT）扩展了 SLTH 到 shallow graph neural networks（GNNs）。然而，在比较基eline model 与 learned dense weights 时， still 有差异存在。此外， deeper GNNs 还存在 excessive memory requirements 的问题。为了解决这些挑战，这个工作使用 MultiCoated SuperMasks（M-Sup），一种数值遮瑕法，并将其实现在 GNNs 中。在 deep GNNs 的上下文中，这个研究发现了未训练的循环神经网络，它们在与训练 feed-forward 对应的表现相似。这个 paper 还提出了 Multi-Stage Folding 和 Unshared Masks 方法，以扩展搜寻空间的 both architecture 和 parameter。通过评估多个 dataset，包括 Open Graph Benchmark（OGB），这个研究建立了 SLTH-based GNNs 的 triple-win scenario：它实现了高简洁性、竞争性能和高内存效率，实现了能源效率的graph processing。

Deep Multimodal Fusion for Surgical Feedback Classification

paper_url: http://arxiv.org/abs/2312.03231
repo_url: None
paper_authors: Rafal Kocielnik, Elyssa Y. Wong, Timothy N. Chu, Lydia Lin, De-An Huang, Jiayun Wang, Anima Anandkumar, Andrew J. Hung
for: 本研究的目的是 automatize the annotation of real-time contextual surgical feedback at scale.
methods: 我们使用了多种模式的机器学习模型来类型医学反馈，包括文本、音频和视频模式。我们还使用了分阶段训练策略，先单独训练每种模式，然后将它们 JOINTLY 训练。
results: 我们的自动分类方法可以达到 AUC 值 Between 71.5 and 77.6，并且将五个类别的医学反馈分类为 “Anatomic”, “Technical”, “Procedural”, “Praise” 和 “Visual Aid”。此外，我们发现使用高质量的手动译文回快可以提高 AUC 值至 Between 76.5 and 96.2。

Abstract
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: "Anatomic", "Technical", "Procedural", "Praise" and "Visual Aid". We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1%. We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.

摘要
现场手术医生对学员的实时反馈是重要的，以便提高手术培训技能。这种反馈在实际操作室中是多模式的，包括语音对话（如问题和答案）以及非语言元素（如视觉指示器）。在这项工作中，我们采用严格验证的五类类别法对手术反馈进行分类：“解剖学”、“技术”、“过程”、“赞赏”和“视觉引导”。然后，我们开发了一种多标签机器学习模型，用于从文本、音频和视频模式的输入中分类这些五类类别的手术反馈。我们的自动分类方法可以在不同的模式下达到AUC值在71.5%到77.6%之间，而将多个模式融合可以提高性能的3.1%。我们还发现，从专家手动抄写反馈音频的高质量手动译录可以提高AUC值在76.5%到96.2%之间，这表明了未来可以进一步改进的道路。我们的实验表明，预先在每个模式上单独预训，然后将其 JOINTLY训练是更有效的，而且我们还发现不同的反馈类别对不同的模式具有不同的重要性。这项工作为自动化实时Contextual手术反馈的分类提供了重要的首次 Investigation。

SDSRA: A Skill-Driven Skill-Recombination Algorithm for Efficient Policy Learning

paper_url: http://arxiv.org/abs/2312.03216
repo_url: https://github.com/ericjiang18/sdsra
paper_authors: Eric H. Jiang, Andrew Lizarraga
for: 提高强化学习任务中的最大Entropy效率
methods: 使用Skill-Driven Skill Recombination Algorithm (SDSRA)，一种新型的协调搜索框架，实现更高效的最大Entropy效率
results: SDSRA比传统的Soft Actor-Critic (SAC)算法更快地 converges，并生成了改进的策略，在多种复杂和多样的 benchmark 中展现出了remarkable的适应性和性能

Abstract
In this paper, we introduce a novel algorithm - the Skill-Driven Skill Recombination Algorithm (SDSRA) - an innovative framework that significantly enhances the efficiency of achieving maximum entropy in reinforcement learning tasks. We find that SDSRA achieves faster convergence compared to the traditional Soft Actor-Critic (SAC) algorithm and produces improved policies. By integrating skill-based strategies within the robust Actor-Critic framework, SDSRA demonstrates remarkable adaptability and performance across a wide array of complex and diverse benchmarks.

摘要
在这篇论文中，我们介绍了一种新的算法——技能驱动技能 recombination算法（SDSRA）——一种创新的框架，可以在回归学习任务中提高最大Entropy的效率。我们发现SDSRA比传统的Soft Actor-Critic（SAC）算法更快地 converges和生成更好的策略。通过在Robust Actor-Critic框架中 интеGRATE技能based策略，SDSRA在多种复杂和多样的标准底下表现出了remarkable的适应性和性能。

2023-12-06

A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints

A Masked Pruning Approach for Dimensionality Reduction in Communication-Efficient Federated Learning Systems

On The Fairness Impacts of Hardware Selection in Machine Learning

FoMo Rewards: Can we cast foundation models as reward functions?

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

The BigCode Project Governance Card

Efficient Large Language Models: A Survey

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

OneLLM: One Framework to Align All Modalities with Language

Intrinsic Harmonization for Illumination-Aware Compositing

MatterGen: a generative model for inorganic materials design

LLM as OS (llmao), Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem

What Planning Problems Can A Relational Neural Network Solve?

An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia

Pearl: A Production-ready Reinforcement Learning Agent

Improving Activation Steering in Language Models with Mean-Centring

Efficient Inverse Design Optimization through Multi-fidelity Simulations, Machine Learning, and Search Space Reduction Strategies

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

Not All Large Language Models (LLMs) Succumb to the “Reversal Curse”: A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models

MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

DiffusionSat: A Generative Foundation Model for Satellite Imagery

MMM: Generative Masked Motion Model

Foundation Model Assisted Weakly Supervised Semantic Segmentation

Invariance & Causal Representation Learning: Prospects and Limitations

Generalization to New Sequential Decision Making Tasks with In-Context Learning

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

Low-power, Continuous Remote Behavioral Localization with Event Cameras

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

Multi-Scale and Multi-Modal Contrastive Learning Network for Biomedical Time Series

Optimal Wildfire Escape Route Planning for Drones under Dynamic Fire and Smoke

Defense Against Adversarial Attacks using Convolutional Auto-Encoders

Active Wildfires Detection and Dynamic Escape Routes Planning for Humans through Information Fusion between Drones and Satellites

FRDiff: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models

Speculative Exploration on the Concept of Artificial Agents Conducting Autonomous Research

Learning From Scenarios for Stochastic Repairable Scheduling

JAMMIN-GPT: Text-based Improvisation using LLMs in Ableton Live

Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion

Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data

Quantum-Inspired Neural Network Model of Optical Illusions

Sports Recommender Systems: Overview and Research Issues

Approximating Solutions to the Knapsack Problem using the Lagrangian Dual Framework

Generalized Contrastive Divergence: Joint Training of Energy-Based Model and Diffusion Model through Inverse Reinforcement Learning

Diffused Task-Agnostic Milestone Planner

Lite-Mind: Towards Efficient and Versatile Brain Representation Network

Demand response for residential building heating: Effective Monte Carlo Tree Search control based on physics-informed neural networks

Teaching Specific Scientific Knowledge into Large Language Models through Additional Training

Online Vectorized HD Map Construction using Geometry

Benchmarking Continual Learning from Cognitive Perspectives

Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique

SoftMAC: Differentiable Soft Body Simulation with Forecast-based Contact Model and Two-way Coupling with Articulated Rigid Bodies and Clothes

OMNIINPUT: A Model-centric Evaluation Framework through Output Distribution

Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI Gym

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Weathering Ongoing Uncertainty: Learning and Planning in a Time-Varying Partially Observable Environment

Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning

A Simple Framework to Enhance the Adversarial Robustness of Deep Learning-based Intrusion Detection System

Multicoated and Folded Graph Neural Networks with Strong Lottery Tickets

Deep Multimodal Fusion for Surgical Feedback Classification

SDSRA: A Skill-Driven Skill-Recombination Algorithm for Efficient Policy Learning