2023-09-29

cs.AI

cs.AI - 2023-09-29

On the Equivalence of Graph Convolution and Mixup

paper_url: http://arxiv.org/abs/2310.00183
repo_url: None
paper_authors: Xiaotian Han, Hanqing Zeng, Yu Chen, Shaoliang Nie, Jingzhou Liu, Kanika Narang, Zahra Shakeri, Karthik Abinav Sankararaman, Song Jiang, Madian Khabsa, Qifan Wang, Xia Hu
for: 本研究探讨图 convolution 和 Mixup 技术之间的关系。
methods: 本研究使用图 convolution 和 Mixup 技术来学习图像的表示。
results: 研究发现，在两个轻量级条件下，图 convolution 可以看作是 Mixup 的特殊形式，并在训练和测试阶段应用。这两个条件是：1）同义关系重新标签（Homophily Relabel），2）测试阶段 Mixup（Test-Time Mixup）。我们数学上证明了图 convolution 网络（GCN）和简化图 convolution（SGC）可以表示为 Mixup，并通过训练 MLP 来验证这一等式。

Abstract
This paper investigates the relationship between graph convolution and Mixup techniques. Graph convolution in a graph neural network involves aggregating features from neighboring samples to learn representative features for a specific node or sample. On the other hand, Mixup is a data augmentation technique that generates new examples by averaging features and one-hot labels from multiple samples. One commonality between these techniques is their utilization of information from multiple samples to derive feature representation. This study aims to explore whether a connection exists between these two approaches. Our investigation reveals that, under two mild conditions, graph convolution can be viewed as a specialized form of Mixup that is applied during both the training and testing phases. The two conditions are: 1) \textit{Homophily Relabel} - assigning the target node's label to all its neighbors, and 2) \textit{Test-Time Mixup} - Mixup the feature during the test time. We establish this equivalence mathematically by demonstrating that graph convolution networks (GCN) and simplified graph convolution (SGC) can be expressed as a form of Mixup. We also empirically verify the equivalence by training an MLP using the two conditions to achieve comparable performance.

摘要
Our investigation reveals that, under two mild conditions, graph convolution can be viewed as a specialized form of Mixup that is applied during both the training and testing phases. The two conditions are:1. Homophily Relabel: Assign the target node's label to all its neighbors.2. Test-Time Mixup: Mixup the features during the test time.We establish this equivalence mathematically by demonstrating that graph convolution networks (GCN) and simplified graph convolution (SGC) can be expressed as a form of Mixup. We also empirically verify the equivalence by training an MLP using the two conditions to achieve comparable performance.

Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity

paper_url: http://arxiv.org/abs/2310.02277
repo_url: https://github.com/vita-group/junk_dna_hypothesis
paper_authors: Lu Yin, Shiwei Liu, Ajay Jaiswal, Souvik Kundu, Zhangyang Wang
for: 这个论文旨在探讨大型自然语言处理器（LLM）中具有低维度的权重是否真的是无用的，以及它们是否对下游任务有重要作用。
methods: 作者使用简洁性作为工具，以孤立和量化LLM中低维度权重的细化意义，从下游任务角度来研究。
results: 研究发现，尽管小维度权重可能看起来是无用的，但它们实际上含有解决更难的下游任务所需的关键知识。从某种角度来看，这些权重是“潜在的Junk DNA”，如果被移除，可能会导致知识损失和任务性能下降。这些发现可能会改变我们对LLM知识编码方式的理解，并且开启了未来模型剪枝和任务相关计算的研究方向。

Abstract
The traditional notion of "Junk DNA" has long been linked to non-coding segments within the human genome, constituting roughly 98% of its composition. However, recent research has unveiled the critical roles some of these seemingly non-functional DNA sequences play in cellular processes. Intriguingly, the weights within deep neural networks exhibit a remarkable similarity to the redundancy observed in human genes. It was believed that weights in gigantic models contained excessive redundancy, and could be removed without compromising performance. This paper challenges this conventional wisdom by presenting a compelling counter-argument. We employ sparsity as a tool to isolate and quantify the nuanced significance of low-magnitude weights in pre-trained large language models (LLMs). Our study demonstrates a strong correlation between these weight magnitudes and the knowledge they encapsulate, from a downstream task-centric angle. we raise the "Junk DNA Hypothesis" backed by our in-depth investigation: while small-magnitude weights may appear "useless" for simple tasks and suitable for pruning, they actually encode crucial knowledge necessary for solving more difficult downstream tasks. Removing these seemingly insignificant weights can lead to irreversible knowledge forgetting and performance damage in difficult tasks. These findings offer fresh insights into how LLMs encode knowledge in a task-sensitive manner, pave future research direction in model pruning, and open avenues for task-aware conditional computation during inference.

摘要
传统上，“垃圾DNA”概念一直与人类基因组中的非编码序列相关，占基因组的约98%的质量。然而，最近的研究发现，一些这些看起来无功能的DNA序列在细胞过程中扮演着重要的角色。启示地，神经网络中的权重 Displayed remarkable similarity to human gene redundancy. It was believed that the weights in large models were redundant and could be removed without affecting performance. However, this paper challenges this conventional wisdom by presenting a counter-argument. We use sparsity as a tool to isolate and quantify the nuanced significance of low-magnitude weights in pre-trained large language models (LLMs). Our study shows a strong correlation between these weight magnitudes and the knowledge they encapsulate from a downstream task-centric angle. We propose the "Junk DNA Hypothesis" backed by our in-depth investigation: while small-magnitude weights may appear "useless" for simple tasks and suitable for pruning, they actually encode crucial knowledge necessary for solving more difficult downstream tasks. Removing these seemingly insignificant weights can lead to irreversible knowledge forgetting and performance damage in difficult tasks. These findings offer fresh insights into how LLMs encode knowledge in a task-sensitive manner, pave future research direction in model pruning, and open avenues for task-aware conditional computation during inference.

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

paper_url: http://arxiv.org/abs/2310.00166
repo_url: https://github.com/facebookresearch/motif
paper_authors: Martin Klissarov, Pierluca D’Oro, Shagun Sodhani, Roberta Raileanu, Pierre-Luc Bacon, Pascal Vincent, Amy Zhang, Mikael Henaff
for: 本研究旨在介入大自然语言模型（LLM）中的决策过程，以便在不知情的环境中探索和评估行为。
methods: 本研究提出了“ Motif”方法，它基于在决策过程中使用 LLM 的想法，不需要与环境交互。 Motif 通过从 LLM 中提取 preference 来构建内在奖励，并使用强化学习训练 agents。
results: 在 NetHack 游戏中，Motif 可以在不直接寻求高分的情况下很好地提高游戏分数。此外，当与环境奖励相结合时，Motif 的方法可以大幅度超越现有的方法，并在没有示例的情况下做出进展。最后，研究发现 Motif 通常会生成人类对应的直观行为，可以通过提示修改轻松地控制，而且可扩展性好。

Abstract
Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.

摘要
探索富有环境和不带先知的行为评估是非常困难的。在这篇论文中，我们提出了 Motif，一种通用的方法，可以从大语言模型（LLM）中提取先知知识并用于代理人的决策。Motif基于将 LLM 用于决策的想法，无需与环境进行交互：它从 LLM 中提取对caption的偏好，以构建内在奖励，然后用强化学习训练代理人。我们评估了 Motif 的性能和行为在开放结构的 NetHack 游戏中。 Result 显示，只学习 maximize 内在奖励，Motif 可以在游戏得分上高于直接带直接带 maximize 算法。当与环境奖励结合时，我们的方法在现有的方法上有显著进步，并在没有示范的情况下完成任务。最后，我们表明 Motif 生成的主要是人类对应的INTUITIVE行为，可以通过提示修改轻松地控制，而且随着 LLM 的大小和提示信息的量而升级良好。

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

paper_url: http://arxiv.org/abs/2310.00161
repo_url: https://github.com/google-research/google-research/tree/master/fvlm/dito
paper_authors: Dahun Kim, Anelia Angelova, Weicheng Kuo
for: bridging the gap between image-level pretraining and open-vocabulary object detection
methods: using detection-oriented image-text pretraining with a detector architecture, and a shifted-window learning approach upon window attention
results: setting a new state of the art on the LVIS open-vocabulary detection benchmark, and achieving competitive results on the COCO benchmark without pseudo labeling or weak supervision.Here’s the simplified Chinese text:
for: closure detection和开放词汇检测之间的差距
methods: 使用检测封装的图像文本预训练，并使用窗口注意力的偏移学习
results: 在LIVIS开放词汇检测标准benchmark上设置新的最佳值（40.4偏好AP$_r$），与最佳现有方法相比提高了6.5偏好AP$_r$。在COCObenchmark上 achieve了非常竞争力的40.8新型AP，无需pseudo标签或弱超vision。

Abstract
We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we replace the commonly used classification architecture with the detector architecture, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from noisy image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 40.4 mask AP$_r$ using the common ViT-L backbone, significantly outperforming the best existing approach by +6.5 mask AP$_r$ at system level. On the COCO benchmark, we achieve very competitive 40.8 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where ours outperforms the baseline significantly. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline. Code and models will be publicly released.

摘要
我们提出了一种新的开放词汇检测方法，基于检测偏好的图像文本预训练来bridging图像级预训练和开放词汇检测之间的差距。在预训练阶段，我们将通常使用的分类架构替换为检测架构，这更好地服务于检测需求的区域水平识别，使得检测头可以从噪声图像文本对learning。无需使用伪标注，我们的方法是一种简单而有效的扩展，通过对不同的对象 semantics cues进行学习。此外，我们提议使用shifted-window学习方法来改进后处理抽象，使模型表示更加鲁棒、不受窗口模式的影响和偏见。在流行的LVIS开放词汇检测标准benchmark上，我们的方法实现了40.4带宽AP$_r$的新州Of The Art，比best现有方法+6.5带宽AP$_r$的系统水平性能。在COCObenchmark上，我们实现了非常竞争力的40.8新的AP，没有使用伪标注或弱监督。此外，我们还评估了我们的方法在传输检测设置下的性能，其比基线显著提高。视觉化表明预训练热键比基线具有更高的对象地域性。代码和模型将公开发布。

Self-Specialization: Uncovering Latent Expertise within Large Language Models

paper_url: http://arxiv.org/abs/2310.00160
repo_url: None
paper_authors: Junmo Kang, Hongyin Luo, Yada Zhu, James Glass, David Cox, Alan Ritter, Rogerio Feris, Leonid Karlinsky
for: 本研究ocuses on self-alignment for expert domain specialization (e.g., biomedicine), discovering its effectiveness in improving zero-shot and few-shot performance in target domains of interest.
methods: 我们使用了自我调整，利用专业领域的预有数据和一些标签化的seed来自适应领域。我们还将 retrieve该过程以减少幻视和增加同步。
results: 我们的自适应模型（30B）在生医领域比基本模型（MPT-30B）大幅提高表现，甚至超过了更大的受欢迎模型（LLaMA-65B），显示其实用性和实际性。

Abstract
Recent works have demonstrated the effectiveness of self-alignment in which a large language model is, by itself, aligned to follow general instructions through the automatic generation of instructional data using a handful of human-written seeds. Instead of general alignment, in this work, we focus on self-alignment for expert domain specialization (e.g., biomedicine), discovering it to be very effective for improving zero-shot and few-shot performance in target domains of interest. As a preliminary, we first present the benchmark results of existing aligned models within a specialized domain, which reveals the marginal effect that "generic" instruction-following training has on downstream expert domains' performance. To remedy this, we explore self-specialization that leverages domain-specific unlabelled data and a few labeled seeds for the self-alignment process. When augmented with retrieval to reduce hallucination and enhance concurrency of the alignment, self-specialization offers an effective (and efficient) way of "carving out" an expert model out of a "generalist", pre-trained LLM where different domains of expertise are originally combined in a form of "superposition". Our experimental results on a biomedical domain show that our self-specialized model (30B) outperforms its base model, MPT-30B by a large margin and even surpasses larger popular models based on LLaMA-65B, highlighting its potential and practicality for specialization, especially considering its efficiency in terms of data and parameters.

摘要

Feedback-guided Data Synthesis for Imbalanced Classification

paper_url: http://arxiv.org/abs/2310.00158
repo_url: None
paper_authors: Reyhane Askari Hemmat, Mohammad Pezeshki, Florian Bordes, Michal Drozdzal, Adriana Romero-Soriano
for: 提高图像分类模型在长尾分布下的性能
methods: 使用一种基于单一反馈的概率模型增强静态数据集，通过让概率模型为分类器提供反馈来提高生成模型对图像的生成
results: 在ImageNet-LT和NICO++ datasets上实现了状态机器学习的最佳结果，对下 Representation 类型进行了4%以上的改进，并在NICO++上获得了5%以上的最差组织精度提升。

Abstract
Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.

摘要
Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.Here is the translation in Traditional Chinese:Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.

Learning Generalizable Tool-use Skills through Trajectory Generation

paper_url: http://arxiv.org/abs/2310.00156
repo_url: None
paper_authors: Carl Qi, Sarthak Shetty, Xingyu Lin, David Held
for: 这个论文旨在帮助人类完成许多日常任务，如 cooking 和 cleaning，通过使用智能系统。但是，现有系统不能匹配人类水平的智能水平，在适应新工具方面。
methods: 该论文使用 Generative Model 学习用户在使用新工具时的工具使用轨迹，并且可以泛化到不同的工具形状。首先，我们生成一个工具使用轨迹，然后优化工具姿势序列以适应生成的轨迹。我们使用了单个模型，并在四个不同的难度较高的塑料对象处理任务中训练。
results: 我们的模型在使用不同的新工具时，能够对塑料对象进行高效的处理，并且明显超过基eline。详细结果可以在我们项目网站上找到：https://sites.google.com/view/toolgen。

Abstract
Autonomous systems that efficiently utilize tools can assist humans in completing many common tasks such as cooking and cleaning. However, current systems fall short of matching human-level of intelligence in terms of adapting to novel tools. Prior works based on affordance often make strong assumptions about the environments and cannot scale to more complex, contact-rich tasks. In this work, we tackle this challenge and explore how agents can learn to use previously unseen tools to manipulate deformable objects. We propose to learn a generative model of the tool-use trajectories as a sequence of point clouds, which generalizes to different tool shapes. Given any novel tool, we first generate a tool-use trajectory and then optimize the sequence of tool poses to align with the generated trajectory. We train a single model for four different challenging deformable object manipulation tasks. Our model is trained with demonstration data from just a single tool for each task and is able to generalize to various novel tools, significantly outperforming baselines. Additional materials can be found on our project website: https://sites.google.com/view/toolgen.

摘要
自主系统可以高效地使用工具来辅助人类完成许多常见的任务，如cooking和清洁。然而，目前的系统仍无法与人类水准的智能匹配，在适应新工具方面。先前的工作基于可用性通常会假设环境，并不能扩展到更加复杂的触摸任务。在这个工作中，我们解决这个挑战，并探讨如何使用未看过的工具来操纵弹性物体。我们提出了一个学习生成工具使用轨迹的数据模型，它可以对不同的工具形状进行生成。给任何新的工具，我们首先生成一个工具使用轨迹，然后对该轨迹进行优化，以使其与生成的轨迹相对。我们将一个模型训练用四个不同的弹性物体操纵任务，并且训练这个模型只需要使用单一工具的示范数据，并且可以很好地适应不同的新工具，与基准相比表现更好。更多的资料可以在我们的项目网站上找到：https://sites.google.com/view/toolgen。

Primal-Dual Continual Learning: Stability and Plasticity through Lagrange Multipliers

paper_url: http://arxiv.org/abs/2310.00154
repo_url: None
paper_authors: Juan Elenter, Navid NaderiAlizadeh, Tara Javidi, Alejandro Ribeiro
For: 本研究的目的是解决 continual learning 中的 no-forgetting 约束问题，即在学习新任务时，不能忘记之前学习的任务。* Methods: 本研究使用了 Lagrangian duality 来解决 continual learning 中的具体约束问题，并且分析了两种版本的 continual learning 问题：一种是任务级别的粗略方法，另一种是样本级别的细腻方法。* Results: 研究发现， dual variables 可以指示约束变化对优化问题的敏感性，并且可以使用这些 dual variables 来分区缓存、分配资源和填充缓存。研究还提供了质量下界，并通过多个 continual learning benchmark 进行了实验验证。

Abstract
Continual learning is inherently a constrained learning problem. The goal is to learn a predictor under a \emph{no-forgetting} requirement. Although several prior studies formulate it as such, they do not solve the constrained problem explicitly. In this work, we show that it is both possible and beneficial to undertake the constrained optimization problem directly. To do this, we leverage recent results in constrained learning through Lagrangian duality. We focus on memory-based methods, where a small subset of samples from previous tasks can be stored in a replay buffer. In this setting, we analyze two versions of the continual learning problem: a coarse approach with constraints at the task level and a fine approach with constraints at the sample level. We show that dual variables indicate the sensitivity of the optimal value with respect to constraint perturbations. We then leverage this result to partition the buffer in the coarse approach, allocating more resources to harder tasks, and to populate the buffer in the fine approach, including only impactful samples. We derive sub-optimality bounds, and empirically corroborate our theoretical results in various continual learning benchmarks. We also discuss the limitations of these methods with respect to the amount of memory available and the number of constraints involved in the optimization problem.

摘要
Translated into Simplified Chinese: continual learning 是一个受限学习问题。目标是在 "无忘卷" 的前提下学习预测器。虽然一些先前的研究把它们作为这样形式化，但是它们没有直接解决受限问题。在这个工作中，我们表明可以并且有利可循直接解决受限优化问题。为此，我们利用最近的受限学习的研究结果，特别是拉格朗日策略。我们关注内存基本方法，即将前一个任务中的一小部分样本存储在回Buffer中。在这种设置下，我们分析了两种版本的受限学习问题：一种粗略的方法，在任务级别上设置约束，以及一种细腻的方法，在样本级别上设置约束。我们发现 dual variables 可以反映约束干扰优化问题的优化值的敏感性。然后，我们利用这个结论来划分缓存在粗略方法中，对硬度更高的任务分配更多的资源，并在细腻方法中，只包含影响力较大的样本。我们 derivation 质量下界，并在各种受限学习标准 benchmark 中进行实验验证。我们还讨论了这些方法在内存量和约束数量方面的局限性。

3D Reconstruction in Noisy Agricultural Environments: A Bayesian Optimization Perspective for View Planning

paper_url: http://arxiv.org/abs/2310.00145
repo_url: None
paper_authors: Athanasios Bacharis, Konstantinos D. Polyzos, Henry J. Nelson, Georgios B. Giannakis, Nikolaos Papanikolopoulos
for: 提高3D重建性能在噪音环境中
methods: 使用bayesian优化算法和 геометрические критери优选少数有用的摄像头位置，并考虑噪音环境的影响
results: 在噪音环境中使用少数摄像头实现高效的3D重建

Abstract
3D reconstruction is a fundamental task in robotics that gained attention due to its major impact in a wide variety of practical settings, including agriculture, underwater, and urban environments. An important approach for this task, known as view planning, is to judiciously place a number of cameras in positions that maximize the visual information improving the resulting 3D reconstruction. Circumventing the need for a large number of arbitrary images, geometric criteria can be applied to select fewer yet more informative images to markedly improve the 3D reconstruction performance. Nonetheless, incorporating the noise of the environment that exists in various real-world scenarios into these criteria may be challenging, particularly when prior information about the noise is not provided. To that end, this work advocates a novel geometric function that accounts for the existing noise, relying solely on a relatively small number of noise realizations without requiring its closed-form expression. With no analytic expression of the geometric function, this work puts forth a Bayesian optimization algorithm for accurate 3D reconstruction in the presence of noise. Numerical tests on noisy agricultural environments showcase the impressive merits of the proposed approach for 3D reconstruction with even a small number of available cameras.

摘要
三维重建是机器人学中一项基本任务，它在各种实际应用场景中具有重要的影响，包括农业、水下和城市环境。一种重要的方法 для完成这项任务是通过精心选择一些摄像机的位置，以最大化视觉信息，从而提高三维重建的性能。然而，在实际场景中存在环境噪声的情况下，使用几何 критери优选择 fewer yet more informative images 可以显著提高三维重建性能。然而，在这种情况下，尚未提供关于噪声的准确信息时，将噪声纳入几何函数中可能是一项挑战。为此，这项工作提出了一种新的几何函数，该函数考虑了现有噪声，不需要准确的几何函数表达。由于这种几何函数没有准确的表达，这项工作提出了一种bayesian优化算法，用于在噪声存在的情况下高精度的三维重建。 numerically tests on noisy agricultural environments showcase the impressive merits of the proposed approach for 3D reconstruction with even a small number of available cameras.

Probabilistic Sampling-Enhanced Temporal-Spatial GCN: A Scalable Framework for Transaction Anomaly Detection in Ethereum Networks

paper_url: http://arxiv.org/abs/2310.00144
repo_url: None
paper_authors: Stefan Kambiz Behfar, Jon Crowcroft
for: 本研究旨在提高链表技术中的安全性和透明度，通过挖掘ETH链上交易的异常点和交易批量。
methods: 本研究使用图 convolutional neural networks (GCNs) 和时间随机游走 (TRW) 技术，通过抽样提高图 convolutional neural networks (GCNs) 的性能。
results: 对比传统GCNs，本研究的TRW-GCN框架在检测异常点和交易批量方面显著提高了性能指标。

Abstract
The rapid evolution of the Ethereum network necessitates sophisticated techniques to ensure its robustness against potential threats and to maintain transparency. While Graph Neural Networks (GNNs) have pioneered anomaly detection in such platforms, capturing the intricacies of both spatial and temporal transactional patterns has remained a challenge. This study presents a fusion of Graph Convolutional Networks (GCNs) with Temporal Random Walks (TRW) enhanced by probabilistic sampling to bridge this gap. Our approach, unlike traditional GCNs, leverages the strengths of TRW to discern complex temporal sequences in Ethereum transactions, thereby providing a more nuanced transaction anomaly detection mechanism. Preliminary evaluations demonstrate that our TRW-GCN framework substantially advances the performance metrics over conventional GCNs in detecting anomalies and transaction bursts. This research not only underscores the potential of temporal cues in Ethereum transactional data but also offers a scalable and effective methodology for ensuring the security and transparency of decentralized platforms. By harnessing both spatial relationships and time-based transactional sequences as node features, our model introduces an additional layer of granularity, making the detection process more robust and less prone to false positives. This work lays the foundation for future research aimed at optimizing and enhancing the transparency of blockchain technologies, and serves as a testament to the significance of considering both time and space dimensions in the ever-evolving landscape of the decentralized platforms.

摘要
“随着带有潜在威胁的Ethereum网络的快速演化，需要专业的技术来保证其可靠性和透明度。尽管几何对应网络（GNNs）在这些平台上已经开创了问题检测，但捕捉带有时间和空间特征的交易模式仍然是一个挑战。本研究提出了将几何对应网络（GCNs）与时间随机漫步（TRW）强化的概率抽样法，以突破这个问题。我们的方法与传统GCNs不同，利用TRW的优点，对Ethereum交易中的复杂时间序列进行探测，实现更精确的交易问题检测。初步评估显示，我们的TRW-GCN框架在检测问题和交易激波方面具有明显进步，较 Convention GCNs 高效。这个研究不仅强调了带有时间特征的Ethereum交易资料中的时间对称，而且提供了一个可扩展和有效的方法，以确保区块chain技术的安全性和透明度。我们的模型通过将空间关系和时间基础为节点特征，增加了检测过程中的精确性和减少了假阳性。这个工作为未来对区块chain技术的优化和提高透明度的研究提供了基础，并证明了考虑时间和空间尺度的重要性在区块chain技术的演化中。”

GASS: Generalizing Audio Source Separation with Large-scale Data

paper_url: http://arxiv.org/abs/2310.00140
repo_url: None
paper_authors: Jordi Pons, Xiaoyu Liu, Santiago Pascual, Joan Serrà
for: separating speech, music, and sound events in a supervised fashion
methods: trained on a large-scale dataset, GASS models show feasibility and competitive performance in in-distribution tasks, but struggle with generalizing to out-of-distribution cinematic and music content
results: all fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.

Abstract
Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic and music content. We also fine-tune GASS models on each dataset and consistently outperform the ones without pre-training. All fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.

摘要
通用源分离目标是将混合声音中的声音源分离开，去除特定频谱域，如语音或乐器的限制。然而，通用源分离的潜力受到许多现有作品的限制，因为大多数作品只关注具有主导性的声音事件混合，而小规模的训练集也限制了其潜力。在这里，我们研究了一个普适的音频源分离（GASS）模型，通过大规模数据集进行supervised学习，用于分离语音、乐器和声音事件。我们评估GASS模型在多种任务上。我们的强大在distribution中的结果表明GASS模型的可能性，而在声音事件和语音分离任务上的竞争性表现也表明它的泛化能力。然而，GASS模型在分离非典型电影和音乐内容时存在挑战。我们还细化GASS模型，并在每个数据集上进行了适应。所有细化模型（除了音乐分离）在各自的benchmark中获得了state-of-the-art的结果。

ABScribe: Rapid Exploration of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models

paper_url: http://arxiv.org/abs/2310.00117
repo_url: None
paper_authors: Mohi Reza, Nathan Laundry, Ilya Musabirov, Peter Dushniku, Zhi Yuan “Michael” Yu, Kashish Mittal, Tovi Grossman, Michael Liut, Anastasia Kuzminykh, Joseph Jay Williams
for: 提高人类-AI合写作业的效率和体验，使用大语言模型生成多个写作变体。
methods: 使用LLM提示语生成多个写作变体，并将其自动转换为可重用按钮。通过鼠标悬停比较按钮的文本内容，实现快速在地点比较多个变体。
results: 对12名写作者进行了用户研究，发现ABScribe可以减少任务工作量(d=1.20，p<0.001)，提高用户对修改过程的认知(d=2.41，p<0.001)，并为写作者提供了如何使用LLM生成多个写作变体的经验。

Abstract
Exploring alternative ideas by rewriting text is integral to the writing process. State-of-the-art large language models (LLMs) can simplify writing variation generation. However, current interfaces pose challenges for simultaneous consideration of multiple variations: creating new versions without overwriting text can be difficult, and pasting them sequentially can clutter documents, increasing workload and disrupting writers' flow. To tackle this, we present ABScribe, an interface that supports rapid, yet visually structured, exploration of writing variations in human-AI co-writing tasks. With ABScribe, users can swiftly produce multiple variations using LLM prompts, which are auto-converted into reusable buttons. Variations are stored adjacently within text segments for rapid in-place comparisons using mouse-over interactions on a context toolbar. Our user study with 12 writers shows that ABScribe significantly reduces task workload (d = 1.20, p < 0.001), enhances user perceptions of the revision process (d = 2.41, p < 0.001) compared to a popular baseline workflow, and provides insights into how writers explore variations using LLMs.

摘要
Translation Notes:* "state-of-the-art large language models" Current Chinese translation: "当前的大语言模型"* "simplify writing variation generation" Current Chinese translation: "简化文本变化生成"* "current interfaces" Current Chinese translation: "当前的接口"* "creating new versions without overwriting text" Current Chinese translation: "无需覆盖原文本创建新版本"* "pasting them sequentially" Current Chinese translation: "sequentially paste them"* "clutter the document" Current Chinese translation: "拥堵文档"* "increasing the workload" Current Chinese translation: "增加工作负担"* "disrupting the writer's flow" Current Chinese translation: "打断作者的流程"* "to tackle this" Current Chinese translation: "以解决这个问题"* "users can swiftly produce multiple variations" Current Chinese translation: "用户可快速生成多个变化"* "which are auto-converted into reusable buttons" Current Chinese translation: "自动转换为可重用的按钮"* "variations are stored adjacently within text segments" Current Chinese translation: "变化存储在文本段中侧"* "for rapid in-place comparisons" Current Chinese translation: "以快速比较"* "using mouse-over interactions on a context toolbar" Current Chinese translation: "通过鼠标 hover 在上下文工具栏上"* "our user study with 12 writers" Current Chinese translation: "我们的12名作者用户研究"* "significantly reduces task workload" Current Chinese translation: "显著减少任务工作负担"* "enhances user perceptions of the revision process" Current Chinese translation: "提高用户对修订过程的认知"* "compared to a popular baseline workflow" Current Chinese translation: "相比一个流行的基线工作流"* "and provides insights into how writers explore variations using LLMs" Current Chinese translation: "并提供了如何使用 LLMS 进行文本变化探索的洞察"

Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization

paper_url: http://arxiv.org/abs/2310.00116
repo_url: None
paper_authors: Mahyar Fazlyab, Taha Entesari, Aniket Roy, Rama Chellappa
for: 提高深度分类器对干扰抗 rand 的Robustness
methods: 使用 Lipschitz-capped networks 或者 modify 训练过程（例如 min-max 优化、受限学习、或者 regularization）来提高深度分类器的Robustness，但这些方法可能不会直接提高输入空间中的边界。因此，有越来越多的关注于在输入空间中直接操纵决策边界的训练方法。本文基于最近的发展，提出一种robust训练算法，其目标是在输出空间（logit）中增大边界，同时对模型在敏感方向上的Lipschitz常数进行规范。我们显示这两个目标可以直接提高输入空间中的边界。
results: 我们实现了一种可扩展的方法来计算准确和高效地计算神经网络的Lipschitz常数上下文约束。这些约束可以用来设计新的层，以实现控制边界的Liptsitz常数。我们在MNIST、CIFAR-10和Tiny-ImageNet数据集上进行实验，并证明我们的提posed算法可以与当前最佳竞争力相当。

Abstract
To improve the robustness of deep classifiers against adversarial perturbations, many approaches have been proposed, such as designing new architectures with better robustness properties (e.g., Lipschitz-capped networks), or modifying the training process itself (e.g., min-max optimization, constrained learning, or regularization). These approaches, however, might not be effective at increasing the margin in the input (feature) space. As a result, there has been an increasing interest in developing training procedures that can directly manipulate the decision boundary in the input space. In this paper, we build upon recent developments in this category by developing a robust training algorithm whose objective is to increase the margin in the output (logit) space while regularizing the Lipschitz constant of the model along vulnerable directions. We show that these two objectives can directly promote larger margins in the input space. To this end, we develop a scalable method for calculating guaranteed differentiable upper bounds on the Lipschitz constant of neural networks accurately and efficiently. The relative accuracy of the bounds prevents excessive regularization and allows for more direct manipulation of the decision boundary. Furthermore, our Lipschitz bounding algorithm exploits the monotonicity and Lipschitz continuity of the activation layers, and the resulting bounds can be used to design new layers with controllable bounds on their Lipschitz constant. Experiments on the MNIST, CIFAR-10, and Tiny-ImageNet data sets verify that our proposed algorithm obtains competitively improved results compared to the state-of-the-art.

摘要
deep 分类器的可靠性面临各种攻击，许多方法已经被提出，如设计新架构（例如 Lipschitz 围限网络）或修改训练过程（例如最小最大优化、受限学习或规范）。这些方法可能并不能提高输入空间中的边界。因此，有越来越多的关注于开发可以直接操作输入空间的决策边界的训练方法。在这篇论文中，我们基于最近的发展，开发了一种可靠的训练算法，其目标是在输出（幂值）空间中增大边界，同时对模型的敏感方向进行规范。我们表明这两个目标可以直接提高输入空间中的边界。为此，我们开发了一种可执行的方法来计算准确和高效地计算神经网络的 Lipschitz 常数上下文约束。这种约束的相对准确性防止过度规范，allowing for more direct manipulation of the decision boundary。此外，我们的 Lipschitz 约束算法利用神经网络的活化层的 monotonicity 和 Lipschitz 连续性，并且得到的约束可以用来设计新的层，其 Lipschitz 常数具有可控的上限。实验 verify 在 MNIST、CIFAR-10 和 Tiny-ImageNet 数据集上，我们提出的方法与状态艺术的竞争力相当。

HyperMask: Adaptive Hypernetwork-based Masks for Continual Learning

paper_url: http://arxiv.org/abs/2310.00113
repo_url: https://github.com/gmum/hypermask
paper_authors: Kamil Książek, Przemysław Spurek
for: 解决人工神经网络的悬崖性学习问题，即在练习多个任务后，模型会忘记之前学习的知识。
methods: 使用hypernetwork生成目标模型的权重，根据任务的标识来生成不同的嵌入空间。
results: 提出了一种方法 called HyperMask，可以训练一个网络来满足所有任务，并使用hypernetwork生成半二进制的面积来获得每个任务的专门的子网络。这种解决方案继承了hypernetwork的适应新任务能力，同时使用了lottery ticket假设，可以使用单个网络来满足所有任务。

Abstract
Artificial neural networks suffer from catastrophic forgetting when they are sequentially trained on multiple tasks. To overcome this problem, there exist many continual learning strategies. One of the most effective is the hypernetwork-based approach. The hypernetwork generates the weights of a target model based on the task's identity. The model's main limitation is that hypernetwork can produce completely different nests for each task. Consequently, each task is solved separately. The model does not use information from the network dedicated to previous tasks and practically produces new architectures when it learns the subsequent tasks. To solve such a problem, we use the lottery ticket hypothesis, which postulates the existence of sparse subnetworks, named winning tickets, that preserve the performance of a full network. In the paper, we propose a method called HyperMask, which trains a single network for all tasks. Hypernetwork produces semi-binary masks to obtain target subnetworks dedicated to new tasks. This solution inherits the ability of the hypernetwork to adapt to new tasks with minimal forgetting. Moreover, due to the lottery ticket hypothesis, we can use a single network with weighted subnets dedicated to each task.

摘要
为解决这个问题，我们使用了彩票假设，假设存在一些稀疏的子网络，称为赢家票，它们保持了全网络的性能。在我们的论文中，我们提出了一种方法called HyperMask，它在所有任务上训练单个网络。hypernetwork 生成了半二进制的面纱，以获取新任务的专门的子网络。这种解决方案继承了 hypernetwork 对新任务的适应能力，同时因为彩票假设，我们可以使用单个网络，每个任务都有权重的子网络。

FashionFlow: Leveraging Diffusion Models for Dynamic Fashion Video Synthesis from Static Imagery

paper_url: http://arxiv.org/abs/2310.00106
repo_url: None
paper_authors: Tasin Islam, Alina Miron, XiaoHui Liu, Yongmin Li
for: 这个研究旨在创造一种基于扩散模型的图像到视频生成器，以便从单一图像中生成短视频。
methods: 我们的方法包括开发和连接相关组件，例如使用 Pseudo-3D 卷积层生成视频高效地。我们还使用 VAE 和 CLIP 编码器从图像中捕捉重要特征，以影响扩散模型。
results: 我们的研究成果包括成功 sinthez 时尚视频，展示了模特儿从多个角度的抵抗，展示了衣服的适用和外观。这些发现对于在线时尚销售业可以提供有利的改进和提升。

Abstract
Our study introduces a new image-to-video generator called FashionFlow. By utilising a diffusion model, we are able to create short videos from still images. Our approach involves developing and connecting relevant components with the diffusion model, which sets our work apart. The components include the use of pseudo-3D convolutional layers to generate videos efficiently. VAE and CLIP encoders capture vital characteristics from still images to influence the diffusion model. Our research demonstrates a successful synthesis of fashion videos featuring models posing from various angles, showcasing the fit and appearance of the garment. Our findings hold great promise for improving and enhancing the shopping experience for the online fashion industry.

摘要
我们的研究推出了一种新的图像到视频生成器 called FashionFlow。通过利用扩散模型，我们可以从静止图像中创建短视频。我们的方法是连接相关的组件与扩散模型，这种方法与之前的工作不同。这些组件包括使用 Pseudo-3D 卷积层生成视频高效地。VAE 和 CLIP 编码器捕捉静止图像中重要的特征，这些特征影响扩散模型。我们的研究成功创造了时尚视频，模特儿从多个角度 пози摆，展示裤装的适合和外观。我们的发现对在线时尚业务的改进和提升具有很大的投资意义。

Multilingual Natural Language Processing Model for Radiology Reports – The Summary is all you need!

paper_url: http://arxiv.org/abs/2310.00100
repo_url: None
paper_authors: Mariana Lindo, Ana Sofia Santos, André Ferreira, Jianning Li, Gijs Luijten, Gustavo Correia, Moon Kim, Jens Kleesiek, Jan Egger, Victor Alves
for: 这个研究旨在自动生成不同语言的 radiology 报告摘要，以便将来研究和 Deep Learning 模型中包含不同民族背景的患者数据。
methods: 研究人员使用了一种公共可用的多语言文本-文本 transformer 模型进行练习，以自动摘要英语、葡萄牙语和德语 radiology 报告中的发现。
results: 盲测试中，两位board-certified radiologist表示，系统生成的摘要质量与人工摘要相匹配或超越了70%的情况，表明了严重的临床可靠性。此外，本研究还表明了特定 радиологи 报告的多语言模型比特定语言的模型和非特定报告摘要模型（如 ChatGPT）更高效。

Abstract
The impression section of a radiology report summarizes important radiology findings and plays a critical role in communicating these findings to physicians. However, the preparation of these summaries is time-consuming and error-prone for radiologists. Recently, numerous models for radiology report summarization have been developed. Nevertheless, there is currently no model that can summarize these reports in multiple languages. Such a model could greatly improve future research and the development of Deep Learning models that incorporate data from patients with different ethnic backgrounds. In this study, the generation of radiology impressions in different languages was automated by fine-tuning a model, publicly available, based on a multilingual text-to-text Transformer to summarize findings available in English, Portuguese, and German radiology reports. In a blind test, two board-certified radiologists indicated that for at least 70% of the system-generated summaries, the quality matched or exceeded the corresponding human-written summaries, suggesting substantial clinical reliability. Furthermore, this study showed that the multilingual model outperformed other models that specialized in summarizing radiology reports in only one language, as well as models that were not specifically designed for summarizing radiology reports, such as ChatGPT.

摘要
radiology 报告的印象部分总结重要的放射学发现，并在传达这些发现给医生时扮演关键的角色。然而，准备这些总结是时间费时和容易出错的。近些年来，许多放射学报告总结模型已经开发出来。然而，目前没有任何模型可以在多种语言中总结这些报告。这样的模型可以在未来的研究和深度学习模型中提高数据来源于不同的民族背景的patient的可靠性。本研究通过微调一个基于多语言文本-文本转换器的公共可用模型，自动生成了不同语言的放射学报告总结。在盲测中，两位证enciated的放射学医生指出，至少70%的系统生成的总结质量与相应的人类写的总结匹配或超过，表明了严重的临床可靠性。此外，本研究表明，多语言模型在只有一种语言的放射学报告总结模型和不特定于放射学报告总结的模型（如ChatGPT）之上表现出色。

Voice2Action: Language Models as Agent for Efficient Real-Time Interaction in Virtual Reality

paper_url: http://arxiv.org/abs/2310.00092
repo_url: https://github.com/yang-su2000/vr-multimodal-interaction
paper_authors: Yang Su
for: 本研究旨在提高自动化代理人类型语言模型（LLM）在虚拟现实环境中的效率和准确性。
methods: 该研究提出了一种名为“语音到动作”框架，该框架可以在实时进行自然语言指令的分析和实体提取，并将执行任务分解成可能的互动子集。
results: 实验结果表明，使用“语音到动作”框架可以在虚拟都市工程环境中提高LLM的效率和准确性，并且比不使用优化的方法有所提高。

Abstract
Large Language Models (LLMs) are trained and aligned to follow natural language instructions with only a handful of examples, and they are prompted as task-driven autonomous agents to adapt to various sources of execution environments. However, deploying agent LLMs in virtual reality (VR) has been challenging due to the lack of efficiency in online interactions and the complex manipulation categories in 3D environments. In this work, we propose Voice2Action, a framework that hierarchically analyzes customized voice signals and textual commands through action and entity extraction and divides the execution tasks into canonical interaction subsets in real-time with error prevention from environment feedback. Experiment results in an urban engineering VR environment with synthetic instruction data show that Voice2Action can perform more efficiently and accurately than approaches without optimizations.

摘要
大型语言模型（LLM）通常在几个示例下进行训练和Alignment，并被驱动为执行任务，并且可以适应不同的执行环境。然而，将代理LLM部署到虚拟现实（VR）中具有线上互动效率低下和3D环境中复杂的操作类别的挑战。在这个工作中，我们提出了“声音到动作”框架，通过层次分析自定义的声音信号和文本指令，并在实时运行中分配执行任务 into canonical interaction subsets，并透过环境反馈来预防错误。实验结果显示，在城市工程VR环境中使用合成 instruciton data时，“声音到动作”可以比不具有优化的方法更高效和精度。

SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation

paper_url: http://arxiv.org/abs/2310.00074
repo_url: https://github.com/hornhehhf/socreval
paper_authors: Hangfeng He, Hongming Zhang, Dan Roth
for: 评估复杂逻辑模型的能力
methods: 使用GPT-4自动评估逻辑链质量，不需要人工编写参考链
results: 比较现有的参考自由和参考基的逻辑评估指标，SocREval显著提高GPT-4的表现

Abstract
To comprehensively assess the capacity of current models for complex reasoning, it is crucial to assess their step-by-step reasoning in a scalable manner. Established reference-based evaluation metrics rely on human-annotated reasoning chains to assess the model-derived chains. However, such ``gold-standard'' human-written reasoning chains may not be unique and their acquisition is often labor-intensive. Existing reference-free reasoning metrics eliminate the need for human-crafted reasoning chains as references, but they typically require fine-tuning on datasets with human-derived reasoning chains, which complicates the process and raises concerns regarding generalizability across diverse datasets. To address these challenges, we harness GPT-4 to automatically evaluate reasoning chain quality, obviating the need for human-crafted references. Leveraging the Socratic method, we devise tailored prompts to enhance reference-free reasoning evaluation, which we term SocREval (Socratic method for Reasoning Evaluation). Empirical results from four human annotated datasets reveal that SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics. Beyond its demonstrated efficacy, our proposed framework, large language models (LLMs) with the Socratic method, proves to be both cost-efficient and robust to prompt writing and example selection, as substantiated by our in-depth analysis.

摘要
(Simplified Chinese translation)为了全面评估当前模型的复杂逻辑能力，需要在可扩展的方式进行步骤逻辑评估。现有的参考基础评估 metrics 依靠人工标注的逻辑链来评估模型生成的链。然而，这些“金标准”的人工编写的逻辑链可能不唯一，并且获取它们可以是劳动密集的。现有的参考自由逻辑评估 metric 可以消除人工编写的逻辑链作为参考，但它们通常需要在具有人工编写的逻辑链的 dataset 上进行 fine-tuning，这会增加过程的复杂性并使得其在不同的 dataset 上的普遍性受到质疑。为解决这些挑战，我们利用 GPT-4 自动评估逻辑链质量，不需要人工编写参考。通过索克拉孚方法，我们开发了特制的 prompt 来提高参考自由逻辑评估，我们称之为 SocREval（索克拉孚方法 для逻辑评估）。Empirical results from four human-annotated datasets show that SocREval significantly improves GPT-4's performance, outperforming existing reference-free and reference-based reasoning evaluation metrics. In addition, our proposed framework, large language models (LLMs) with the Socratic method, is both cost-efficient and robust to prompt writing and example selection, as demonstrated by our in-depth analysis.

Emotional Listener Portrait: Neural Listener Head Generation with Emotion

paper_url: http://arxiv.org/abs/2310.00068
repo_url: None
paper_authors: Luchuan Song, Guojun Yin, Zhenchao Jin, Xiaoyi Dong, Chenliang Xu
for: 本研究旨在生成对话中的听众表情（如笑容），以提高人机对话的自然性和多样性。
methods: 本研究提出了Emotional Listener Portrait（ELP）模型，将每个细腻的表情动作视为多个精确的动作代码词的组合，并且Explicitly model了不同情感对话中的动作概率分布。
results: 对多个量化指标进行评测，ELP模型在比较前方法时显示了明显的改善。

Abstract
Listener head generation centers on generating non-verbal behaviors (e.g., smile) of a listener in reference to the information delivered by a speaker. A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation, which varies depending on the emotions and attitudes of both the speaker and the listener. To tackle this problem, we propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords and explicitly models the probability distribution of the motions under different emotion in conversation. Benefiting from the ``explicit'' and ``discrete'' design, our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude. Under several quantitative metrics, our ELP exhibits significant improvements compared to previous methods.

摘要
听众头部生成中心在生成说话者所提供的信息的非语言表达（例如笑容）中心。生成这些响应的主要挑战是在对话中细分的表情的非推定性，这取决于说话者和听众的情感和态度。为解决这个问题，我们提议了情感听众肖像（ELP），它将每个细分的facial motion作为多个独立的动作代码词的组合，并Explicitly modeling the probability distribution of motions under different emotions in conversation。由于“显式”和“独立”的设计，我们的ELP模型可以不仅通过从学习的分布中采样生成自然和多样化的响应，还可以生成 predetermined 的态度。在多个量化指标下，我们的ELP显示出了significant improvement compared to previous methods。

AI ensemble for signal detection of higher order gravitational wave modes of quasi-circular, spinning, non-precessing binary black hole mergers

paper_url: http://arxiv.org/abs/2310.00052
repo_url: None
paper_authors: Minyang Tian, E. A. Huerta, Huihuo Zheng
for: 这个论文是用于开发一个基于人工智能的搜寻引力波高次模式信号的方法。
methods: 这个方法使用了240万个IMRPhenomXPHM波形数据，包括了不同的黑洞组合质量和旋转速度，以及不同的模式混合效应。这些波形数据被用于训练3个人工智能分类器，并在Summit超级计算机上进行了分布式训练。然后，这些分类器被用于创建一个人工预测器，以估计潜在的黑洞组合质量。
results: 这个方法在一年测试集中处理了300,000个信号，并在5.19分钟内完成了。这个方法提供了现有最高精度的信号检测精度，并且只有在每年的搜寻数据中出现了2个错误。这是第一个用于搜寻和检测高次引力波模式信号的人工智能 ensemble。

Abstract
We introduce spatiotemporal-graph models that concurrently process data from the twin advanced LIGO detectors and the advanced Virgo detector. We trained these AI classifiers with 2.4 million \texttt{IMRPhenomXPHM} waveforms that describe quasi-circular, spinning, non-precessing binary black hole mergers with component masses $m_{\{1,2\}\in[3M_\odot, 50 M_\odot]$, and individual spins $s^z_{\{1,2\}\in[-0.9, 0.9]$; and which include the $(\ell, |m|) = \{(2, 2), (2, 1), (3, 3), (3, 2), (4, 4)\}$ modes, and mode mixing effects in the $\ell = 3, |m| = 2$ harmonics. We trained these AI classifiers within 22 hours using distributed training over 96 NVIDIA V100 GPUs in the Summit supercomputer. We then used transfer learning to create AI predictors that estimate the total mass of potential binary black holes identified by all AI classifiers in the ensemble. We used this ensemble, 3 AI classifiers and 2 predictors, to process a year-long test set in which we injected 300,000 signals. This year-long test set was processed within 5.19 minutes using 1024 NVIDIA A100 GPUs in the Polaris supercomputer (for AI inference) and 128 CPU nodes in the ThetaKNL supercomputer (for post-processing of noise triggers), housed at the Argonne Leadership Supercomputing Facility. These studies indicate that our AI ensemble provides state-of-the-art signal detection accuracy, and reports 2 misclassifications for every year of searched data. This is the first AI ensemble designed to search for and find higher order gravitational wave mode signals.

摘要
我们介绍了一种空时间图模型，该模型同时处理了激光探测器先进LIGO和先进维戈探测器的数据。我们使用了240万个IMRPhenomXPHM波形，这些波形描述了 quasi-Circular，旋转，不径向旋转的双黑洞合并，其中Component masses $m_{\{1,2\}$在[3M_\odot, 50M_\odot]之间，并且individual spins $s^z_{\{1,2\}$在[-0.9, 0.9]之间，包括($\ell, |m|) = \{(2, 2), (2, 1), (3, 3), (3, 2), (4, 4)\}$ modes，以及模式混合效应在$\ell = 3, |m| = 2$ harmonics。我们在Summit超级计算机上使用分布式训练在96个NVIDIA V100 GPU上训练了这些AI类ifizzer，训练时间为22小时。然后，我们使用传输学习创建了AI预测器，以估算潜在双黑洞的总质量。我们使用了这个ensemble，3个AI类ifizzer和2个预测器，处理了一年的测试集，其中注入了300,000个信号。这个一年的测试集在使用1024个NVIDIA A100 GPU在Polaris超级计算机上进行AI推理，以及128个CPU节点在ThetaKNL超级计算机上进行噪声触发的后处理，在Argonne领导超级计算机中完成 within 5.19分钟。这些研究表明，我们的AIensemble提供了当前最高精度的信号探测精度，并且错误地分类了每年的搜索数据2次。这是第一个采用AI搜索和找到更高级别重力波模式信号的ensemble。

Efficient Streaming Language Models with Attention Sinks

paper_url: http://arxiv.org/abs/2309.17453
repo_url: https://github.com/mit-han-lab/streaming-llm
paper_authors: Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
For: The paper is written for deploying large language models (LLMs) in streaming applications, such as multi-round dialogue, where long interactions are expected.* Methods: The paper introduces a framework called StreamingLLM, which enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning.* Results: The paper shows that StreamingLLM can enable several state-of-the-art LLMs to perform stable and efficient language modeling with up to 4 million tokens and more, and outperforms the sliding window recomputation baseline by up to 22.2x speedup in streaming settings.Here is the same information in Simplified Chinese:* For: 这篇论文是为了在流动应用中部署大型语言模型（LLMs），例如多轮对话，长期交互等。* Methods: 论文提出了一种名为 StreamingLLM 的框架，可以让 LLMs trained with finite length attention window 通过不需要微调来泛化到无限长序列 lengths。* Results: 论文显示了 StreamingLLM 可以使得多种状态顶峰的 LLMs 在多达 4 百万字和更多的语言模型中表现稳定和高效，并在流动设置下超过滑动窗口重新计算基准的速度提升至多达 22.2 倍。

Abstract
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm.

摘要
部署大型自然语言模型（LLM）在流动应用程序中，如多回对话，需求急快，但也存在两大挑战。首先，在解码阶段，缓存前一个token的键值（KV）占用了大量内存。其次，流行的LLM无法泛化到 longer than training sequence length的文本。窗口注意力，只缓存最近的KV，是一种自然的方法，但我们发现，当文本长度超过缓存大小时，窗口注意力失效。我们观察到了一种有趣现象，即注意力沟（attention sink），即保留初始token的KV可以大幅提高窗口注意力的性能。在这篇论文中，我们首先证明了注意力沟的出现是因为初始token的强烈注意力分数，即作为一个“沟”，即使它们并不是semantically important。基于以上分析，我们提出了StreamingLLM框架，可以让 LLMS 在训练时使用有限长度注意力窗口，并不需要 fine-tuning，可以在无限长度文本上进行稳定和高效的语言模型化。我们证明了StreamingLLM在流动设置下可以使得 Llama-2、MPT、Falcon 和 Pythia 等模型在400万个token和更多的文本上进行稳定和高效的语言模型化。此外，我们发现在预训练时添加一个专门用于注意力沟的placeholder token可以进一步提高流动部署的性能。在流动设置下，StreamingLLM可以与滑动窗口重新计算基准线上的差速度相比，提高到22.2倍。代码和数据集可以在https://github.com/mit-han-lab/streaming-llm上获取。

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

paper_url: http://arxiv.org/abs/2309.17452
repo_url: https://github.com/microsoft/tora
paper_authors: Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen
for: 本研究旨在提高大型语言模型在数学问题上的表现，通过将自然语言理解与外部工具（如计算库和符号计算器）集成在一起，以提高模型的数学分析能力。
methods: 我们提出了一种名为ToRA的工具集成逻辑代理模型，通过在数学数据集上实现互动工具使用轨迹，应用imiter学习onto annotations，并通过输出空间形状进一步细化模型的理解行为。
results: 我们的ToRA模型在10个数学理解数据集上表现出色，与开源模型相比，具有13%-19%的绝对提升，而ToRA-7B模型在竞赛水平数据集MATH上达到44.6%的高分，超过了最佳开源模型WizardMath-70B的22%绝对提升。此外，ToRA-Code-34B模型在MATH数据集上达到了50%以上的准确率，超过了GPT-4的CoT结果，并与GPT-4解决问题的程度相当。

Abstract
Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.

摘要
大型语言模型已经在不同的语言任务中做出了重要进步，但它们仍然对复杂的数学问题须作出努力。在这篇论文中，我们提出了 Tool-integrated Reasoning Agents（ToRA），用于解决具有挑战性的数学问题。ToRA通过融合自然语言理解和工具（如计算库和 симвоlic solvers）的能力，实现了语言和工具之间的融合。为了训练 ToRA，我们创建了互动工具使用轨迹，对数学数据集进行归检学习，并提出了出力空间整形以进一步检验模型的理解行为。因此，ToRA模型在10个数学理解数据集上显示了13%-19%的绝对提升，而 ToRA-7B 更是在竞赛水平数据集 MATH 上取得了44.6%的成绩，比开源模型 WizardMath-70B 高出22%的绝对提升。此外，ToRA-Code-34B 是首个在 MATH 数据集上取得超过50%的开源模型，与 GPT-4 解决问题的程度相当，并且与 GPT-4 解决问题的程度相当。我们还进行了数学理解中工具互动的全面分析，提供了宝贵的研究指引。

LLM-grounded Video Diffusion Models

paper_url: http://arxiv.org/abs/2309.17444
repo_url: https://github.com/TonyLianLong/LLM-groundedVideoDiffusion
paper_authors: Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li
for: 这个论文是为了提高文本生成视频的质量，尤其是处理复杂的时空异常指令。
methods: 这个论文使用了语言模型（LLM）生成动态场景布局，并将布局用于导引扩散模型进行视频生成。
results: 研究发现，LLM可以通过文本 alone 理解复杂的时空动态，并生成布局与实际世界中的物体运动模式高度相似。这种方法可以与任何允许分类引导的扩散模型结合使用，并在比较强的基础模型和基线方法上显著超越。

Abstract
Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion (e.g., even lacking the ability to be prompted for objects moving from left to right). To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.

摘要
文本填充扩散模型已经出现为神经视频生成工具的有力手段。然而，当前模型仍然面临细腻的时空提示和生成限制，例如缺乏左右移动的能力。为解决这些限制，我们介绍了基于大语言模型（LLM）的视频扩散（LVD）。而不是直接从文本输入生成视频，LVD首先利用LLM生成基于文本输入的动态场景布局，然后使用生成的布局来导引扩散模型进行视频生成。我们发现LLM可以通过文本 alone 理解复杂的时空动态，并生成布局与提示和实际世界中对象运动模式相似。我们then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.

Learning Decentralized Flocking Controllers with Spatio-Temporal Graph Neural Network

paper_url: http://arxiv.org/abs/2309.17437
repo_url: None
paper_authors: Siji Chen, Yanshen Sun, Peihan Li, Lifeng Zhou, Chang-Tien Lu
for: 用于实现狂涛飞行器群体协调控制
methods: 使用延迟-$L$ hop状态和时间扩展GNN（STGNN）
results: 实现了分布式控制，模仿中央控制策略，并在不同场景下实现了凝聚飞行、领头飞行和避免障碍等任务。

Abstract
Recently a line of researches has delved the use of graph neural networks (GNNs) for decentralized control in swarm robotics. However, it has been observed that relying solely on the states of immediate neighbors is insufficient to imitate a centralized control policy. To address this limitation, prior studies proposed incorporating $L$-hop delayed states into the computation. While this approach shows promise, it can lead to a lack of consensus among distant flock members and the formation of small clusters, consequently resulting in the failure of cohesive flocking behaviors. Instead, our approach leverages spatiotemporal GNN, named STGNN that encompasses both spatial and temporal expansions. The spatial expansion collects delayed states from distant neighbors, while the temporal expansion incorporates previous states from immediate neighbors. The broader and more comprehensive information gathered from both expansions results in more effective and accurate predictions. We develop an expert algorithm for controlling a swarm of robots and employ imitation learning to train our decentralized STGNN model based on the expert algorithm. We simulate the proposed STGNN approach in various settings, demonstrating its decentralized capacity to emulate the global expert algorithm. Further, we implemented our approach to achieve cohesive flocking, leader following and obstacle avoidance by a group of Crazyflie drones. The performance of STGNN underscores its potential as an effective and reliable approach for achieving cohesive flocking, leader following and obstacle avoidance tasks.

摘要
近期研究团队探索使用图 neuron网络（GNNs）为分布式控制在群体机器人中。然而，已经观察到仅仅基于当前邻居状态不足以模仿中央控制策略。为了解决这个限制，先前的研究提出了 incorporating $L$-hop延迟状态到计算中。虽然这种方法显示了 promise, 但可能导致远程群体成员之间的不一致和小群集成，最终导致协调集成行为失败。相反，我们的方法利用 spatial-temporal GNN， named STGNN，它包括 spatial 和 temporal 扩展。 spatial 扩展收集远程邻居的延迟状态，而 temporal 扩展包括当前邻居的先前状态。通过这两种扩展，我们可以收集更广泛和全面的信息，从而实现更有效和准确的预测。我们开发了一个专家算法来控制群体机器人，并使用模仿学习来训练我们的分布式 STGNN 模型基于专家算法。我们在不同的设置中进行了 STGNN 方法的 simulate，并证明它的分布式能力可以模仿全局专家算法。此外，我们实现了 STGNN 方法以实现凝聚飞行、领导跟随和避免障碍的任务。STGNN 方法的表现强调了它的可靠性和可行性，表明它作为凝聚飞行、领导跟随和避免障碍任务的有效和可靠的方法。

DREAM: Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems

paper_url: http://arxiv.org/abs/2309.17433
repo_url: None
paper_authors: Dipam Patel, Phu Pham, Kshitij Tiwari, Aniket Bera
for: 这篇论文目的是提出一种可以有效地管理资源的多机器人系统，以提高其性能和可靠性。
methods: 这篇论文使用了强化学习算法来进行环境探索和避免障碍物，并使用图 neural network 来优化目标分配，以确保在有限资源的情况下完成任务。
results: 研究人员通过对多种 simulate 环境进行测试，发现这种方法可以提高资源Constrained robotics 性能，相比基eline方法，提高约25%。

Abstract
Resource-constrained robots often suffer from energy inefficiencies, underutilized computational abilities due to inadequate task allocation, and a lack of robustness in dynamic environments, all of which strongly affect their performance. This paper introduces DREAM - Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems, a comprehensive framework that optimizes the allocation of resources for efficient exploration. It advances beyond conventional heuristic-based task planning as observed conventionally. The framework incorporates Operational Range Estimation using Reinforcement Learning to perform exploration and obstacle avoidance in unfamiliar terrains. DREAM further introduces an Energy Consumption Model for goal allocation, thereby ensuring mission completion under constrained resources using a Graph Neural Network. This approach also ensures that the entire Multi-Robot System can survive for an extended period of time for further missions compared to the conventional approach of randomly allocating goals, which compromises one or more agents. Our approach adapts to prioritizing agents in real-time, showcasing remarkable resilience against dynamic environments. This robust solution was evaluated in various simulated environments, demonstrating adaptability and applicability across diverse scenarios. We observed a substantial improvement of about 25% over the baseline method, leading the way for future research in resource-constrained robotics.

摘要
资源有限的机器人经常受到能源不fficient和 computation underutilization的问题，这些问题严重影响其表现。这篇文章提出了 DREAM - 分布式学习探索和能源管理 Framework，这个框架可以优化资源的分配，以提高探索的效率。它超越了传统的规律 Based task planning，通过使用循环 Reinforcement Learning 估计运作范围，实现探索和避免障碍物在未知的地形中。 DREAM 还引入了能源消耗模型，以便将目标分配给可以完成任务的机器人，以确保在有限的资源下完成任务。这种方法还能在多机器人系统中保持机器人系统的持续运行，比传统随机分配目标的方法更好。我们的方法能够在实时给予优先级，实现在动态环境中的类型。我们在不同的 simulated 环境进行了评估，发现我们的方法可以在多种enario中实现显著的改善，相比baseline方法，提高约25%。这个可靠的解决方案开启了未来资源有限机器人研究的新 horizon。

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets

paper_url: http://arxiv.org/abs/2309.17428
repo_url: https://github.com/lifan-yuan/craft
paper_authors: Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R. Fung, Hao Peng, Heng Ji
for: 这篇论文是为了提高大语言模型（LLM）的功能和可用性而设计的。
methods: 该论文使用的方法包括：（1）通过生成代码段和通过任务特定的应用程序编程接口（API）执行它们，以将某些功能委托给专门的外部模块。（2）为每个任务收集特定的代码解决方案，并将其抽象为可重用的代码段。（3）在推理时，通过语言模型获取代码段并执行它们或生成输出。
results: 该论文的实验结果表明，使用该方法可以在视觉语言任务、表格处理任务和数学推理任务中实现显著的提升，并且可以在不需要训练的情况下适应新的领域和模式。此外，该论文还进行了深入的分析，并证明了：（1）随着工具集大小和后向模型的能力增加，可以实现一致的性能提升；（2）该方法的每个组件都对性能增加做出了贡献；（3）创建的工具是可靠且具有低复杂性和原子性。

Abstract
Large language models (LLMs) are often augmented with tools to solve complex tasks. By generating code snippets and executing them through task-specific Application Programming Interfaces (APIs), they can offload certain functions to dedicated external modules, such as image encoding and performing calculations. However, most existing approaches to augment LLMs with tools are constrained by general-purpose APIs and lack the flexibility for tailoring them to specific tasks. In this work, we present CRAFT, a general tool creation and retrieval framework for LLMs. It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. For each task, we collect specific code solutions by prompting GPT-4 to solve the training examples. Following a validation step ensuring the correctness, these solutions are abstracted into code snippets to enhance reusability, and deduplicated for higher quality. At inference time, the language model retrieves snippets from the toolsets and then executes them or generates the output conditioning on the retrieved snippets. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning. Experiments on vision-language, tabular processing, and mathematical reasoning tasks show that our approach achieves substantial improvements compared to strong baselines. In addition, our in-depth analysis reveals that: (1) consistent performance improvement can be achieved by scaling up the number of tools and the capability of the backbone models; (2) each component of our approach contributes to the performance gains; (3) the created tools are well-structured and reliable with low complexity and atomicity. The code is available at \url{https://github.com/lifan-yuan/CRAFT}.

摘要
大型语言模型（LLM）通常会被增强工具来解决复杂任务。通过生成代码批处和透过任务特定的应用程序库（API）执行，它们可以将certain功能外传到特定的外部模组，如图像编码和计算。但现有的方法增强LLM通常受到通用API的限制，缺乏适合特定任务的自适性。在这个工作中，我们提出了CRAFT，一个通用工具创建和撷取框架 для LLM。它创建了特定任务的工具集，并将LLM具有一个组件可以从这些集中撷取工具，以增强它们的解决复杂任务的能力。在每个任务上，我们通过对GPT-4进行训练示例来收集特定的代码解决方案。经过验证步骤以确保正确性，这些解决方案会被抽象为代码批处以增强可重用性，并且删除重复的部分以提高质量。在推断时，语言模型会从工具集撷取批处或生成基于撷取的结果。我们的方法设计为可以灵活地适应市场上的无Seeing领域和模式，而无需调整。实验显示，我们的方法可以与强基eline比较 substatial提升。此外，我们的深入分析显示：（1）可以通过增加工具的数量和背景模型的能力来获得一致的性能提升;（2）我们的方法中每个component都做出了贡献;（3）创建的工具具有低复杂性和原子性。代码可以在 \url{https://github.com/lifan-yuan/CRAFT} 上获取。

Classification of Potholes Based on Surface Area Using Pre-Trained Models of Convolutional Neural Network

paper_url: http://arxiv.org/abs/2309.17426
repo_url: None
paper_authors: Chauhdary Fazeel Ahmad, Abdullah Cheema, Waqas Qayyum, Rana Ehtisham, Muhammad Haroon Yousaf, Junaid Mir, Nasim Shakouri Mahmoudabadi, Afaq Ahmad
for: 本研究旨在比较三种预训练的卷积神经网络模型（ResNet 50、ResNet 18、MobileNet）在识别南亚国家道路上的损害性痕迹（坑洞）方面的表现。
methods: 本研究使用了三种预训练的卷积神经网络模型，对道路图像进行分类，以判断图像中是否存在坑洞，并将图像分为三类：小坑洞、大坑洞和正常道路。
results: 研究发现，MobileNet v2在识别坑洞方面的准确率为98%，而图像分类结果显示，图像从腰高度（2英尺）拍摄时，大坑洞、小坑洞和正常道路的准确率分别为87.33%、88.67%和92%。同时，从全腰高度（FFW）拍摄时，三个类别的准确率均为100%。

Abstract
Potholes are fatal and can cause severe damage to vehicles as well as can cause deadly accidents. In South Asian countries, pavement distresses are the primary cause due to poor subgrade conditions, lack of subsurface drainage, and excessive rainfalls. The present research compares the performance of three pre-trained Convolutional Neural Network (CNN) models, i.e., ResNet 50, ResNet 18, and MobileNet. At first, pavement images are classified to find whether images contain potholes, i.e., Potholes or Normal. Secondly, pavements images are classi-fied into three categories, i.e., Small Pothole, Large Pothole, and Normal. Pavement images are taken from 3.5 feet (waist height) and 2 feet. MobileNet v2 has an accuracy of 98% for detecting a pothole. The classification of images taken at the height of 2 feet has an accuracy value of 87.33%, 88.67%, and 92% for classifying the large, small, and normal pavement, respectively. Similarly, the classification of the images taken from full of waist (FFW) height has an accuracy value of 98.67%, 98.67%, and 100%.

摘要
弹坑可以致命和对车辆造成严重损害，并可能导致致命车祸。在南亚国家，路面问题是主要原因，因为下层状况差，没有对地下排水系统，以及过度的雨水。本研究比较了三个预训NN模型的表现，即ResNet 50、ResNet 18和MobileNet。首先，路面图像被分类，以找出图像中是否存在弹坑（Potholes或Normal）。其次，路面图像被分类为三个类别，即小弹坑、大弹坑和正常。路面图像来自2英尺（胸高）和3.5英尺。MobileNet v2的准确率为98%，检测弹坑。图像分类的准确率为87.33%、88.67%和92%，分别类别为大弹坑、小弹坑和正常路面。 Similarly，图像从胸高（FFW）高度时的准确率为98.67%、98.67%和100%。

Data Filtering Networks

paper_url: http://arxiv.org/abs/2309.17425
repo_url: https://github.com/jpr5/ngrep
paper_authors: Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, Vaishaal Shankar
for: 本研究旨在学习一种数据筛选网络（DFN），用于从大量未经过处理的数据集中选择高质量数据集。
methods: 我们使用了一种新的数据筛选网络建立了一个新的图像文本数据集，并证明了这种方法可以帮助train state-of-the-art模型。
results: 我们的最佳性果DFN-5B数据集可以让模型在不同任务上达到最佳性能，包括在ImageNet上实现83.0%的零shot传输精度。此外，我们还发布了一个新的20亿个示例数据集DFN-2B，并证明了可以从公共数据中训练高性能的数据筛选网络。

Abstract
Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.

摘要
大规模的训练集已成为机器学习的基础，是现代语言模型和多 modal 学习的成就之基础。虽然数据准备 для预训练仍然是一个自动化的过程，但一种常见的方法是先从互联网上收集大量数据，然后通过不同的规则来筛选这个候选人pool下到实际的训练集。在这项工作中，我们研究了如何学习一个数据筛选网络（DFN），用于这个第二步的筛选大量未经准备的数据集。我们的关键发现是，筛选网络的质量与其在下游任务的表现存在差异：例如，一个在 ImageNet 上高度表现的模型可能会生成比其低 ImageNet 准确率的模型更好的训练集。基于我们的发现，我们构建了新的数据筛选网络，实现了状态机器人类图像文本 datasets。具体来说，我们的最佳性performing dataset DFN-5B 使得我们可以在不同的 compute 预算下训练状态机器人类模型。例如，使用我们的 dataset，一个 ViT-H 模型在 ImageNet 上达到 83.0% 零shot 转移率，超过其他 datasets 上的模型，如 LAION-2B、DataComp-1B 或 OpenAI 的 WIT。为了促进更多的研究 dataset 设计，我们还发布了一个新的 2 亿例示例数据 DFN-2B。我们的结果表明，高性能的数据筛选网络可以通过只使用公共可用的数据进行训练。

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

paper_url: http://arxiv.org/abs/2309.17410
repo_url: https://github.com/vaidehi99/infodeletionattacks
paper_authors: Vaidehi Patil, Peter Hase, Mohit Bansal
for: 本研究旨在 mitigating the safety and informational issues of pre-trained language models, such as memorized personal information and harmful output.
methods: 我们提出了一个攻击-防御框架，用于研究直接从模型权重中删除敏感信息。我们研究直接编辑模型权重的原因是，这种方法可以保证删除的信息从未被未来的提问攻击抽取出来，同时也可以保护白盒攻击。
results: 我们的实验表明，即使使用现有的模型编辑方法如ROME，也无法真正地从GPT-J模型中删除事实信息，我们的白盒和黑盒攻击可以从编辑后的模型中恢复“删除”的信息38%的时间。这些攻击利用了两个关键观察：（1）删除的信息可以在模型的中间隐藏状态中找到踪迹，（2）应用 editing 方法于一个问题后可能无法删除对重叠版本的问题中的信息。最后，我们提供了新的防御方法，但我们没有找到一个通用有效的防御方法。

Abstract
Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models.

摘要
有些预训言语模型拥有我们不希望他们拥有的知识，包括记忆化的个人信息和可能被用来伤害人的知识。它们还可以输出恶势推文。为了解决这些安全和信息问题，我们提出了一个攻击和防御框架，用于研究直接从模型权重中删除敏感信息的任务。我们研究直接编辑模型权重的原因是：（1）这种方法可以保证未来的攻击不能提取特定的删除信息，（2）它可以保护白盒攻击，这是在公开可用的模型权重可以用来提取敏感信息的情况下非常重要。我们的威胁模型假设攻击成功的情况是，敏感问题的答案在一组B生成的候选人中，基于情况下如果信息不安全。实验表明， même les méthodes d'édition d'état de l'art telles que ROME ont du mal à vraiment supprimer des informations factuelles à partir de modèles comme GPT-J, car nos attaques blanches et noires peuvent récupérer "désormais" information "supprimée" à partir d'un modèle edited 38% du temps. Ces attaques se basent sur deux observations clés :（1）les traces d'informations supprimées peuvent être trouvées dans les états cachés intermédiaires du modèle, et（2）appliquer une méthode d'édition pour une question ne supprime pas l'information à travers les versions reformulées de la question. Enfin, nous fournissons de nouvelles méthodes de défense qui protègent contre certaines attaques d'extraction, mais nous ne trouvons pas une méthode universelle efficace. Nos résultats suggèrent que vraiment supprimer des informations sensibles est un problème difficile mais tractable, car même des taux d'attaque relativement faibles peuvent avoir des conséquences graves pour la deployment réelle des modèles de langage.

Adversarial Machine Learning in Latent Representations of Neural Networks

paper_url: http://arxiv.org/abs/2309.17401
repo_url: None
paper_authors: Milin Zhang, Mohammad Abdi, Francesco Restuccia
for: 这种论文主要针对的问题是分布式深度学习网络（DNN）在边缘 Computing scenarios 中减轻计算负担和降低端到端推理延迟的问题。
methods: 该论文使用了信息理论的概念来研究分布式 DNN 对抗攻击的稳定性。作者们引入了两个新的度量来衡量损害和抗性。
results: 经过广泛的实验分析，作者们发现：（一）假设保持同等水平的信息损害，则潜在特征都比输入表示更加稳定；（二）抗攻击能力与特征维度和 DNN 的泛化能力 jointly 决定分布式 DNN 的抗击性。实验结果表明，对 ImageNet-1K 数据集进行了 10 种不同的攻击方法， compress 的干扰特征可以降低攻击成功率，最好情况下降低了 88%，平均降低了 57%。

Abstract
Distributed deep neural networks (DNNs) have been shown to reduce the computational burden of mobile devices and decrease the end-to-end inference latency in edge computing scenarios. While distributed DNNs have been studied, to the best of our knowledge the resilience of distributed DNNs to adversarial action still remains an open problem. In this paper, we fill the existing research gap by rigorously analyzing the robustness of distributed DNNs against adversarial action. We cast this problem in the context of information theory and introduce two new measurements for distortion and robustness. Our theoretical findings indicate that (i) assuming the same level of information distortion, latent features are always more robust than input representations; (ii) the adversarial robustness is jointly determined by the feature dimension and the generalization capability of the DNN. To test our theoretical findings, we perform extensive experimental analysis by considering 6 different DNN architectures, 6 different approaches for distributed DNN and 10 different adversarial attacks to the ImageNet-1K dataset. Our experimental results support our theoretical findings by showing that the compressed latent representations can reduce the success rate of adversarial attacks by 88% in the best case and by 57% on the average compared to attacks to the input space.

摘要
分布式深度神经网络（DNN）已经被证明可以减轻移动设备的计算负担和减少边缘计算场景中的终端推理延迟。 Although distributed DNNs have been studied, to the best of our knowledge, the resilience of distributed DNNs to adversarial attacks remains an open problem. In this paper, we fill this research gap by rigorously analyzing the robustness of distributed DNNs against adversarial attacks. We cast this problem in the context of information theory and introduce two new measurements for distortion and robustness. Our theoretical findings indicate that (i) assuming the same level of information distortion, latent features are always more robust than input representations; (ii) the adversarial robustness is jointly determined by the feature dimension and the generalization capability of the DNN. To test our theoretical findings, we perform extensive experimental analysis by considering 6 different DNN architectures, 6 different approaches for distributed DNN, and 10 different adversarial attacks to the ImageNet-1K dataset. Our experimental results support our theoretical findings by showing that the compressed latent representations can reduce the success rate of adversarial attacks by 88% in the best case and by 57% on average compared to attacks to the input space.

LoRA ensembles for large language model fine-tuning

paper_url: http://arxiv.org/abs/2310.00035
repo_url: None
paper_authors: Xi Wang, Laurence Aitchison, Maja Rudolph
for: 提高强化大语言模型（LLM）的不确定性评估和预测结果，解决 LLM 常常表现出过于自信和报告异常的问题。
methods: 使用 Low-Rank Adapters（LoRA） ensemble，LoRA 是一种具有几个参数的精简技术，可以构建大型 ensemble 而不导致计算开销增加。
results: LoRA ensemble 可以提高预测精度和不确定性评估，并且可以与现有的 regularization 技术结合使用。

Abstract
Finetuned LLMs often exhibit poor uncertainty quantification, manifesting as overconfidence, poor calibration, and unreliable prediction results on test data or out-of-distribution samples. One approach commonly used in vision for alleviating this issue is a deep ensemble, which constructs an ensemble by training the same model multiple times using different random initializations. However, there is a huge challenge to ensembling LLMs: the most effective LLMs are very, very large. Keeping a single LLM in memory is already challenging enough: keeping an ensemble of e.g. 5 LLMs in memory is impossible in many settings. To address these issues, we propose an ensemble approach using Low-Rank Adapters (LoRA), a parameter-efficient fine-tuning technique. Critically, these low-rank adapters represent a very small number of parameters, orders of magnitude less than the underlying pre-trained model. Thus, it is possible to construct large ensembles of LoRA adapters with almost the same computational overhead as using the original model. We find that LoRA ensembles, applied on its own or on top of pre-existing regularization techniques, gives consistent improvements in predictive accuracy and uncertainty quantification.

摘要
常见的LLM验议问题是不良的uncertainty量化，表现为过于自信、不好准确和不可预测的预测结果。为解决这个问题，视觉领域常用的一种方法是深度集成，它将多个不同初始化的模型 ensemble。然而， ensemble LLMs 有一个很大的挑战：最有效的LLMs 很、很大。保持一个LLM 在内存中 already 是一个挑战，而保持多个LLMs 在内存中是不可能的在许多设置下。为解决这些问题，我们提出了一种 ensemble 方法，使用 Low-Rank Adapters（LoRA），一种具有较少参数的精度调整技术。 LoRA 适配器只占用了一个LLM 的一个数量级的参数，而不是整个模型的参数。因此，可以构建一个大的 LoRA 适配器ensemble，与原始模型的计算开销几乎相同。我们发现，LoRA ensemble 应用于单独还是应用于现有的 regularization 技术上，都能够得到了更好的预测精度和uncertainty量化。

Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency

paper_url: http://arxiv.org/abs/2309.17382
repo_url: https://github.com/agentification/RAFA_code
paper_authors: Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi Ke, Boyi Liu, Zhaoran Wang
for: The paper aims to improve the ability of large language models (LLMs) to complete tasks provably within a minimum number of interactions with the external environment.
methods: The proposed “reason for future, act for now” (\texttt{RAFA}) framework combines long-term reasoning and short-term acting to achieve provable regret guarantees. The framework includes a prompt template for reasoning, learning and planning in Bayesian adaptive Markov decision processes (MDPs), and an “in-context” actor-critic update.
results: The paper shows that the proposed framework achieves a $\sqrt{T}$ regret bound, outperforming various existing frameworks, and achieves nearly perfect scores on a few benchmarks.Here are the three points in Simplified Chinese text:
for: 该文章目标是提高大语言模型（LLM）完成任务的可靠性，使其在最小化与环境交互数量的情况下完成任务。
methods: 提议的“理解未来，行动当下”（\texttt{RAFA）”框架，通过结合长期理解和短期行动来实现可证明的 regret 保证。该框架包括一个启发模板 для理解，基于 Bayesian 适应 Markov 决策过程（MDPs）的学习和规划，以及在上下文中的 actor-critic 更新。
results: 文章表明，提议的框架可以实现 $\sqrt{T}$ regret bound，超越多种现有框架，并在一些标准准则上达到近乎完美的分数。

Abstract
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it remains unclear how to complete a given task provably within a minimum number of interactions with the external environment, e.g., through an internal mechanism of reasoning. To this end, we propose a principled framework with provable regret guarantees to orchestrate reasoning and acting, which we call "reason for future, act for now" (\texttt{RAFA}). Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future"). At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state. The key idea is to cast reasoning in LLMs as learning and planning in Bayesian adaptive Markov decision processes (MDPs). Correspondingly, we prompt LLMs to form an updated posterior of the unknown environment from the memory buffer (learning) and generate an optimal trajectory for multiple future steps that maximizes a value function (planning). The learning and planning subroutines are performed in an "in-context" manner to emulate the actor-critic update for MDPs. Our theoretical analysis proves that the novel combination of long-term reasoning and short-term acting achieves a $\sqrt{T}$ regret. In particular, the regret bound highlights an intriguing interplay between the prior knowledge obtained through pretraining and the uncertainty reduction achieved by reasoning and acting. Our empirical validation shows that it outperforms various existing frameworks and achieves nearly perfect scores on a few benchmarks.

摘要
大型语言模型（LLM）表现出了印象的思维能力，但将思维转化为实际世界中的行动仍然是一个挑战。尤其是不确定如何在最小化与外部环境互动的情况下完成任务，例如通过内部的思维过程。为此，我们提出了一个原则性的框架，具有证明可靠的对策保证，我们称之为“理解未来，行动现在”（RAFA）。具体来说，我们设计了一个启发模板 для 思维，从记忆缓冲器学习并规划未来路径（理解未来）。在每个步骤中，LLM Agent 执行起始的规划路径的首个动作（行动现在），将收集到的反馈存储在记忆缓冲器中，然后重新邀请思维 routine 重新规划未来路径。关键思想是将 LLM 中的思维视为学习和规划 Bayesian 适应 Markov 决策过程（MDPs）。因此，我们鼓励 LLM 将记忆缓冲器中的未知环境更新为新的 posterior（学习），并生成多个未来步骤的最佳路径，以最大化一个值函数（规划）。学习和规划子程序在“内部”进行，以模拟 actor-critic 更新。我们的理论分析证明了我们的新结合长期思维和短期行动可以取得 $\sqrt{T}$ 的对抗。具体来说，对抗 bound 显示了 LLM 中的前期知识和思维行动之间的不确定性对抗。我们的实验验证显示，它比较多的现有框架表现出色，仅在一些问题上几乎取得完美的得分。

Revolutionizing Mobile Interaction: Enabling a 3 Billion Parameter GPT LLM on Mobile

paper_url: http://arxiv.org/abs/2310.01434
repo_url: None
paper_authors: Samuel Carreira, Tomás Marques, José Ribeiro, Carlos Grilo
for: 这篇论文旨在提出一种新的大语言模型（LLM）执行方法，使得大型LLM可以在移动设备上直接执行，不需要网络连接。
methods: 该方法包括精度调整和模型压缩技术，以便在设备上运行大型LLM。
results: 实验结果表明，这种方法可以在设备上运行一个精度调整后的GPT大语言模型，并且可以在4GB的内存下运行。此外，该方法还实现了文本到动作的功能，使得用户可以通过文本输入来控制移动设备。

Abstract
The field of Artificial Intelligence has witnessed remarkable progress in recent years, especially with the emergence of powerful large language models (LLMs) based on the transformer architecture. Cloud-based LLMs, such as OpenAI's ChatGPT, offer impressive capabilities but come with concerns regarding latency and privacy due to network dependencies. This article presents an innovative approach to LLM inference, envisioning a future where LLMs with billions of parameters can be executed directly on mobile devices without network connectivity. The article showcases a fine-tuned GPT LLM with 3 billion parameters that can operate smoothly on devices with as low as 4GB of memory. Through the integration of native code and model quantization techniques, the application not only serves as a general-purpose assistant but also facilitates seamless mobile interactions with text-to-actions features. The article provides insights into the training pipeline, implementation details, test results, and future directions of on-device LLM inference. This breakthrough technology opens up possibilities for empowering users with sophisticated AI capabilities while preserving their privacy and eliminating latency concerns.

摘要
自 recent years 以来，人工智能领域已经做出了很多出色的进步，尤其是基于 transformer 架构的大型自然语言模型（LLM）。云存储的 LLM，如 OpenAI 的 ChatGPT，具有了很强的功能，但是也存在网络依赖和隐私问题。本文描述了一种创新的 LLM 推理方法，梦想一个未来，在无网络连接的手持设备上直接执行具有数十亿参数的 LLM。通过Native code 和模型压缩技术的结合，该应用不仅可以作为通用助手，还可以实现无缝 mobil 交互，包括文本到动作特性。文章提供了训练管道、实现细节、测试结果以及未来方向的相关信息，这项技术开启了 empower 用户通过保护隐私和消除延迟的方式获得高级 AI 功能。

Neural Lithography: Close the Design-to-Manufacturing Gap in Computational Optics with a ‘Real2Sim’ Learned Photolithography Simulator

paper_url: http://arxiv.org/abs/2309.17343
repo_url: None
paper_authors: Cheng Zheng, Guangyuan Zhao, Peter T. C. So
for: bridging the “design-to-manufacturing” gap in computational opticsmethods: fully differentiable design framework integrating pre-trained photolithography simulator, leveraging physics-informed modeling and data-driven trainingresults: improved optical performance on task-specific metrics for holographic optical element (HOE) and multi-level diffractive lens (MDL) using two-photon lithography system.

Abstract
We introduce neural lithography to address the 'design-to-manufacturing' gap in computational optics. Computational optics with large design degrees of freedom enable advanced functionalities and performance beyond traditional optics. However, the existing design approaches often overlook the numerical modeling of the manufacturing process, which can result in significant performance deviation between the design and the fabricated optics. To bridge this gap, we, for the first time, propose a fully differentiable design framework that integrates a pre-trained photolithography simulator into the model-based optical design loop. Leveraging a blend of physics-informed modeling and data-driven training using experimentally collected datasets, our photolithography simulator serves as a regularizer on fabrication feasibility during design, compensating for structure discrepancies introduced in the lithography process. We demonstrate the effectiveness of our approach through two typical tasks in computational optics, where we design and fabricate a holographic optical element (HOE) and a multi-level diffractive lens (MDL) using a two-photon lithography system, showcasing improved optical performance on the task-specific metrics.

摘要
我们引入神经镶刻来bridging computational optics中的"设计到制造" gap。计算光学具有大的设计自由度，可以实现高级功能和性能，但现有的设计方法通常忽视数值模拟制造过程，这可能导致设计和实际生产的性能差异较大。为了解决这个问题，我们首次提出了一个完全可导的设计框架，其中包含已经预训练的光刻 simulator。通过结合物理知识和数据驱动训练，我们的光刻 simulator acts as a regularizer during design, 补做制造过程中的结构差异。我们通过两个典型的计算光学任务：设计和制造一个折射式光学元件（HOE）和一个多级干涉镜（MDL），使用一个两氢激发镜系统，展示了改进的光学性能在任务特定的 метриках上。

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

paper_url: http://arxiv.org/abs/2309.17341
repo_url: None
paper_authors: Eliska Kloberdanz, Wei Le
for: 创建高效的深度神经网络（DNNs），以便在具有限制计算资源和实时系统的平台上部署。
methods: 使用量化技术，将计算和存储矩阵进行下标宽化，从而减少模型大小和执行时间。
results: 提出了一种名为 MixQuant 的搜索算法，可以根据roundoff error来找到每层weight的优化量化比特宽，并且可以与任何量化方法结合使用，以提高量化模型的准确性。

Abstract
Quantization is a technique for creating efficient Deep Neural Networks (DNNs), which involves performing computations and storing tensors at lower bit-widths than f32 floating point precision. Quantization reduces model size and inference latency, and therefore allows for DNNs to be deployed on platforms with constrained computational resources and real-time systems. However, quantization can lead to numerical instability caused by roundoff error which leads to inaccurate computations and therefore, a decrease in quantized model accuracy. Similarly to prior works, which have shown that both biases and activations are more sensitive to quantization and are best kept in full precision or quantized with higher bit-widths, we show that some weights are more sensitive than others which should be reflected on their quantization bit-width. To that end we propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error and can be combined with any quantization method as a form of pre-processing optimization. We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone. Additionally, we combine MixQuant with vanilla asymmetric quantization to show that MixQuant has the potential to optimize the performance of any quantization technique.

摘要
量化是一种技术，用于创建高效的深度神经网络（DNNs），该技术可以通过在计算和存储矩阵时使用较低的位数宽来减少模型大小和推理延迟，因此允许DNNs在具有限制计算资源和实时系统的平台上进行部署。然而，量化可能会导致数字不稳定性，由于绝对误差所导致的不准确的计算，从而导致减少量化模型的准确性。与先前的作品一样，我们发现， bias和活动是量化的敏感度较高，因此应该保留在全精度或高位宽下进行量化。为了实现这一点，我们提出了 MixQuant，一种搜索算法，可以根据绝对误差来找到每层权重的优化量化位宽。我们显示，将MixQuant与BRECQ（一种现状顶尖的量化方法）相结合，可以提高量化模型的准确性。此外，我们将MixQuant与普通非对称量化相结合，以示MixQuant具有优化任何量化技术的潜力。

Improving Trajectory Prediction in Dynamic Multi-Agent Environment by Dropping Waypoints

paper_url: http://arxiv.org/abs/2309.17338
repo_url: None
paper_authors: Pranav Singh Chib, Pravendra Singh
for: 本研究的目的是提高 trajectory prediction 的准确性， Addressing the challenges of modeling diverse and uncertain trajectories.
methods: 该研究提出了一种新的框架，即 Temporal Waypoint Dropping (TWD)， which promotes explicit temporal learning through the waypoint dropping technique.
results: 对三个数据集（NBA Sports VU、ETH-UCY、TrajNet++）进行了广泛的实验，表明 TWD 能够有效地强制模型学习复杂的时间相关性。

Abstract
The inherently diverse and uncertain nature of trajectories presents a formidable challenge in accurately modeling them. Motion prediction systems must effectively learn spatial and temporal information from the past to forecast the future trajectories of the agent. Many existing methods learn temporal motion via separate components within stacked models to capture temporal features. This paper introduces a novel framework, called Temporal Waypoint Dropping (TWD), that promotes explicit temporal learning through the waypoint dropping technique. Learning through waypoint dropping can compel the model to improve its understanding of temporal correlations among agents, thus leading to a significant enhancement in trajectory prediction. Trajectory prediction methods often operate under the assumption that observed trajectory waypoint sequences are complete, disregarding real-world scenarios where missing values may occur, which can influence their performance. Moreover, these models frequently exhibit a bias towards particular waypoint sequences when making predictions. Our TWD is capable of effectively addressing these issues. It incorporates stochastic and fixed processes that regularize projected past trajectories by strategically dropping waypoints based on temporal sequences. Through extensive experiments, we demonstrate the effectiveness of TWD in forcing the model to learn complex temporal correlations among agents. Our approach can complement existing trajectory prediction methods to enhance prediction accuracy. We also evaluate our proposed method across three datasets: NBA Sports VU, ETH-UCY, and TrajNet++.

摘要
自然的轨迹具有内在的多样性和不确定性，这些特点使轨迹预测变得非常困难。轨迹预测系统需要从过去获取空间和时间信息，以更好地预测未来轨迹。许多现有方法在堆叠模型中分别学习时间特征，以捕捉时间特征。本文提出了一种新的框架，即时间点掉除（TWD），它通过掉除时间点来显式地学习时间相关性。通过这种方法，模型可以更好地理解 agent之间的时间相关性，从而导致轨迹预测的显著改进。轨迹预测方法经常假设观察到的轨迹点序列是完整的，忽略了现实世界中的缺失数据，这可能会影响其性能。此外，这些模型经常偏爱某些轨迹点序列，这会导致预测不准确。我们的 TWD 可以有效地解决这些问题。它将随机过程和固定过程混合，通过掉除时间点来规范预测过去轨迹的投影。通过广泛的实验，我们证明了 TWD 的效果是让模型学习复杂的时间相关性。我们的方法可以补充现有的轨迹预测方法，以提高预测精度。我们还对我们的提议方法进行了三个数据集的评估：NBA Sports VU、ETH-UCY 和 TrajNet++。

Toward Operationalizing Pipeline-aware ML Fairness: A Research Agenda for Developing Practical Guidelines and Tools

paper_url: http://arxiv.org/abs/2309.17337
repo_url: None
paper_authors: Emily Black, Rakshit Naidu, Rayid Ghani, Kit T. Rodolfa, Daniel E. Ho, Hoda Heidari
for: 本研究旨在提供一个完整的研究资源库，用于帮助机器学习（ML）研究人员、实践人员和学生在实现算法公平中使用管道驱动的方法。
methods: 本研究使用了文献综述的方法，收集和组织了过去的做法，以便为研究人员、实践人员和学生提供一个完整的研究资源库。
results: 本研究提出了一个研究资源库，用于帮助研究人员、实践人员和学生在实现算法公平中使用管道驱动的方法。

Abstract
While algorithmic fairness is a thriving area of research, in practice, mitigating issues of bias often gets reduced to enforcing an arbitrarily chosen fairness metric, either by enforcing fairness constraints during the optimization step, post-processing model outputs, or by manipulating the training data. Recent work has called on the ML community to take a more holistic approach to tackle fairness issues by systematically investigating the many design choices made through the ML pipeline, and identifying interventions that target the issue's root cause, as opposed to its symptoms. While we share the conviction that this pipeline-based approach is the most appropriate for combating algorithmic unfairness on the ground, we believe there are currently very few methods of \emph{operationalizing} this approach in practice. Drawing on our experience as educators and practitioners, we first demonstrate that without clear guidelines and toolkits, even individuals with specialized ML knowledge find it challenging to hypothesize how various design choices influence model behavior. We then consult the fair-ML literature to understand the progress to date toward operationalizing the pipeline-aware approach: we systematically collect and organize the prior work that attempts to detect, measure, and mitigate various sources of unfairness through the ML pipeline. We utilize this extensive categorization of previous contributions to sketch a research agenda for the community. We hope this work serves as the stepping stone toward a more comprehensive set of resources for ML researchers, practitioners, and students interested in exploring, designing, and testing pipeline-oriented approaches to algorithmic fairness.

摘要
而algorithmic fairness是一个迅速发展的研究领域，在实践中，消除偏见问题经常被减少到仅仅是在优化步骤中遵循一个arbitrary chosen fairness metric，或者在模型输出后进行后处理，或者通过修改训练数据来修改模型。 current work has called on the ML community to take a more comprehensive approach to address fairness issues by systematically investigating the many design choices made throughout the ML pipeline, and identifying interventions that target the root cause of the issue, rather than its symptoms. While we share the conviction that this pipeline-based approach is the most appropriate for combating algorithmic unfairness in practice, we believe there are currently very few methods of operationalizing this approach. Drawing on our experience as educators and practitioners, we first demonstrate that without clear guidelines and toolkits, even individuals with specialized ML knowledge find it challenging to hypothesize how various design choices influence model behavior. We then consult the fair-ML literature to understand the progress to date toward operationalizing the pipeline-aware approach: we systematically collect and organize the prior work that attempts to detect, measure, and mitigate various sources of unfairness through the ML pipeline. We utilize this extensive categorization of previous contributions to sketch a research agenda for the community. We hope this work serves as a stepping stone toward a more comprehensive set of resources for ML researchers, practitioners, and students interested in exploring, designing, and testing pipeline-oriented approaches to algorithmic fairness.

Asynchronous Graph Generators

paper_url: http://arxiv.org/abs/2309.17335
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Christopher P. Ley, Felipe Tobar
for: 这篇论文是为了描述一种新的图 neural network 架构，用于多通道时间序列数据的填充和预测。
methods: 这篇论文使用了一种名为 asynchronous graph generator (AGG) 的新的图 neural network 架构，该架构可以模型 observation 为节点在动态图上，并通过听取来学习表达时间序列中变量之间的关系。
results: 根据实验结果，AGG 能够在 Beijing Air Quality、PhysioNet Challenge 2012 和 UCI localisation 等标准数据集上达到时间序列数据填充、预测和分类的state-of-the-art 结果。

Abstract
We introduce the asynchronous graph generator (AGG), a novel graph neural network architecture for multi-channel time series which models observations as nodes on a dynamic graph and can thus perform data imputation by transductive node generation. Completely free from recurrent components or assumptions about temporal regularity, AGG represents measurements, timestamps and metadata directly in the nodes via learnable embeddings, to then leverage attention to learn expressive relationships across the variables of interest. This way, the proposed architecture implicitly learns a causal graph representation of sensor measurements which can be conditioned on unseen timestamps and metadata to predict new measurements by an expansion of the learnt graph. The proposed AGG is compared both conceptually and empirically to previous work, and the impact of data augmentation on the performance of AGG is also briefly discussed. Our experiments reveal that AGG achieved state-of-the-art results in time series data imputation, classification and prediction for the benchmark datasets Beijing Air Quality, PhysioNet Challenge 2012 and UCI localisation.

摘要
我们介绍了异步图生成器（AGG），一种新型的图神经网络架构，用于多通道时间序列数据。AGG模型观测值为节点在动态图中，并通过权重注意力学习表示每个变量之间的关系。因此，AGG可以通过推理到新的时间戳和metadata来进行数据潜在性恢复。与传统的循环组件或时间规则假设不同，AGG直接将测量值、时间戳和metadataembedding为节点中的学习 embeddings。我们对AGG与先前的工作进行了概念和实验性比较，并 briefly discuss了数据扩充对AGG性能的影响。我们的实验结果显示，AGG在Beijing空气质量、PhysioNet Challenge 2012和UCIlocalization benchmark datasets上实现了时间序列数据潜在性恢复、分类和预测的state-of-the-artResult。

Efficient Anatomical Labeling of Pulmonary Tree Structures via Implicit Point-Graph Networks

paper_url: http://arxiv.org/abs/2309.17329
repo_url: None
paper_authors: Kangxian Xie, Jiancheng Yang, Donglai Wei, Ziqiao Weng, Pascal Fua
for: 了解肺系统的复杂3D树状结构，以提高肺病的治疗。
methods: 提出一种点云方法，保持树skeleton的图连接和Surface representation，实现低计算成本下的SOTA准确性。
results: 实现了usable Surface的模型，并且由于数据的缺乏，我们还进行了大规模的数据收集和分布。

Abstract
Pulmonary diseases rank prominently among the principal causes of death worldwide. Curing them will require, among other things, a better understanding of the many complex 3D tree-shaped structures within the pulmonary system, such as airways, arteries, and veins. In theory, they can be modeled using high-resolution image stacks. Unfortunately, standard CNN approaches operating on dense voxel grids are prohibitively expensive. To remedy this, we introduce a point-based approach that preserves graph connectivity of tree skeleton and incorporates an implicit surface representation. It delivers SOTA accuracy at a low computational cost and the resulting models have usable surfaces. Due to the scarcity of publicly accessible data, we have also curated an extensive dataset to evaluate our approach and will make it public.

摘要
肺疾病在全球死亡原因中排名前列，需要更好地理解肺系统中复杂的3D树状结构，如呼吸道、血管和血液。在理论上，这些结构可以通过高分辨率图像堆栈来模拟。然而，使用标准的density voxel网格方法会非常昂贵。为此，我们介绍了点基方法，保留树状结构的图 Connectivity和Surface representation。它可以达到低计算成本下的SOTA准确率，并且模型的表面可以使用。由于肺疾病数据的公共访问性缺乏，我们还编辑了一个广泛的数据集，以评估我们的方法，并将其公开。

Assessing Look-Ahead Bias in Stock Return Predictions Generated By GPT Sentiment Analysis

paper_url: http://arxiv.org/abs/2309.17322
repo_url: None
paper_authors: Paul Glasserman, Caden Lin
for: 这个论文是关于大语言模型（LLM）从新闻文本中提取股票交易信号的研究。
methods: 作者使用了LLM来分析新闻文本中的情绪，并通过去除文本中相关公司的标识符来避免干扰和look-ahead偏见。
results: 研究发现，在训练window内，去除公司标识符后的文本表现更好， indicating that the distraction effect is more significant than look-ahead bias。此外，这种偏见具有更强的影响力于大公司。out-of-sample中，look-ahead bias不是一个问题，但distraction仍然可能存在。

Abstract
Large language models (LLMs), including ChatGPT, can extract profitable trading signals from the sentiment in news text. However, backtesting such strategies poses a challenge because LLMs are trained on many years of data, and backtesting produces biased results if the training and backtesting periods overlap. This bias can take two forms: a look-ahead bias, in which the LLM may have specific knowledge of the stock returns that followed a news article, and a distraction effect, in which general knowledge of the companies named interferes with the measurement of a text's sentiment. We investigate these sources of bias through trading strategies driven by the sentiment of financial news headlines. We compare trading performance based on the original headlines with de-biased strategies in which we remove the relevant company's identifiers from the text. In-sample (within the LLM training window), we find, surprisingly, that the anonymized headlines outperform, indicating that the distraction effect has a greater impact than look-ahead bias. This tendency is particularly strong for larger companies--companies about which we expect an LLM to have greater general knowledge. Out-of-sample, look-ahead bias is not a concern but distraction remains possible. Our proposed anonymization procedure is therefore potentially useful in out-of-sample implementation, as well as for de-biased backtesting.

摘要
大型语言模型（LLM），包括ChatGPT，可以从新闻文本中提取有利交易信号。然而，回测这些策略具有挑战，因为LLM被训练了许多年的数据，而回测期间与训练期间重叠，可能会导致偏见。这种偏见可以有两种形式：look-ahead偏见， LLM可能知道新闻文本后的股票收益，以及拖垮效应，通常知道公司名称会对文本情绪的测量产生干扰。我们通过基于新闻标题的交易策略来研究这些偏见的来源。我们将基于原始标题和去除相关公司标识符的两种策略进行比较，并发现在样本内（在LLM训练窗口内），去除公司标识符后的策略表现更好， indicating that the distraction effect is more significant than look-ahead bias。这种倾向尤其强于大型公司——我们预期LLM对这些公司有更多通用知识。外样（外 LLM训练窗口），look-ahead偏见不是问题，但distraction仍然可能存在。我们提议的匿名处理程序因此可能有用，不仅在外样实现，还在做偏见回测。

Building Privacy-Preserving and Secure Geospatial Artificial Intelligence Foundation Models

paper_url: http://arxiv.org/abs/2309.17319
repo_url: None
paper_authors: Jinmeng Rao, Song Gao, Gengchen Mai, Krzysztof Janowicz
for: 本研究旨在启示基于人工智能的地ospatial领域内存bezirk 风险和安全问题，以及对这些问题的可能性解决方案。
methods: 本文使用了一种综述方法，涵盖了基于地ospatial领域的人工智能模型开发和应用的全生命周期，从模型训练和测试到实际应用。
results: 本文发现了基于地ospatial领域的人工智能模型开发和应用可能会降低个人隐私和安全风险，提出了一种全面的研究方向和预防控制策略，以帮助研究人员和政策制定者更好地理解和解决这些问题。

Abstract
In recent years we have seen substantial advances in foundation models for artificial intelligence, including language, vision, and multimodal models. Recent studies have highlighted the potential of using foundation models in geospatial artificial intelligence, known as GeoAI Foundation Models, for geographic question answering, remote sensing image understanding, map generation, and location-based services, among others. However, the development and application of GeoAI foundation models can pose serious privacy and security risks, which have not been fully discussed or addressed to date. This paper introduces the potential privacy and security risks throughout the lifecycle of GeoAI foundation models and proposes a comprehensive blueprint for research directions and preventative and control strategies. Through this vision paper, we hope to draw the attention of researchers and policymakers in geospatial domains to these privacy and security risks inherent in GeoAI foundation models and advocate for the development of privacy-preserving and secure GeoAI foundation models.

摘要
近年来，我们所见到的基础模型在人工智能领域的进步很大，包括语言、视觉和多Modal模型。 latest studies have highlighted the potential of using基础模型 in geospatial artificial intelligence, known as GeoAI Foundation Models, for geographic question answering, remote sensing image understanding, map generation, and location-based services, among others. However, the development and application of GeoAI foundation models can pose serious privacy and security risks, which have not been fully discussed or addressed to date. This paper introduces the potential privacy and security risks throughout the lifecycle of GeoAI foundation models and proposes a comprehensive blueprint for research directions and preventative and control strategies. Through this vision paper, we hope to draw the attention of researchers and policymakers in geospatial domains to these privacy and security risks inherent in GeoAI foundation models and advocate for the development of privacy-preserving and secure GeoAI foundation models.Here's the translation in Traditional Chinese:近年来，我们所见到的基础模型在人工智能领域的进步很大，包括语言、视觉和多Modal模型。 latest studies have highlighted the potential of using基础模型 in geospatial artificial intelligence, known as GeoAI Foundation Models, for geographic question answering, remote sensing image understanding, map generation, and location-based services, among others. However, the development and application of GeoAI foundation models can pose serious privacy and security risks, which have not been fully discussed or addressed to date. This paper introduces the potential privacy and security risks throughout the lifecycle of GeoAI foundation models and proposes a comprehensive blueprint for research directions and preventative and control strategies. Through this vision paper, we hope to draw the attention of researchers and policymakers in geospatial domains to these privacy and security risks inherent in GeoAI foundation models and advocate for the development of privacy-preserving and secure GeoAI foundation models.

AutoAgents: A Framework for Automatic Agent Generation

paper_url: http://arxiv.org/abs/2309.17288
repo_url: https://github.com/Link-AGI/AutoAgents
paper_authors: Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F. Karlsson, Jie Fu, Yemin Shi
for: 这篇论文的目的是提出一个创新的框架，可以自动生成和调度多个专门的智能代理人来解决不同的任务。
methods: 这篇论文使用了自动生成和调度多个专门的智能代理人，并且将任务和角色之间的关系实现为动态生成的专业代理人。
results: experiments 表明，这个框架可以产生更有条理和更准确的解决方案，比起现有的多代理人方法。

Abstract
Large language models (LLMs) have enabled remarkable advances in automated task-solving with multi-agent systems. However, most existing LLM-based multi-agent approaches rely on predefined agents to handle simple tasks, limiting the adaptability of multi-agent collaboration to different scenarios. Therefore, we introduce AutoAgents, an innovative framework that adaptively generates and coordinates multiple specialized agents to build an AI team according to different tasks. Specifically, AutoAgents couples the relationship between tasks and roles by dynamically generating multiple required agents based on task content and planning solutions for the current task based on the generated expert agents. Multiple specialized agents collaborate with each other to efficiently accomplish tasks. Concurrently, an observer role is incorporated into the framework to reflect on the designated plans and agents' responses and improve upon them. Our experiments on various benchmarks demonstrate that AutoAgents generates more coherent and accurate solutions than the existing multi-agent methods. This underscores the significance of assigning different roles to different tasks and of team cooperation, offering new perspectives for tackling complex tasks. The repository of this project is available at https://github.com/Link-AGI/AutoAgents.

摘要
大型语言模型（LLM）已经实现了自动任务解决的多智能体系统的很多创新。然而，大多数现有的 LLM 基于多智能体系统仍然依赖于预定义的代理人来处理简单任务，这限制了多智能体团队在不同场景下的适应性。因此，我们介绍了 AutoAgents，一个创新的框架，可以动态生成和协调多个专业代理人，以建立基于任务的 AI 团队。具体来说，AutoAgents 将任务和角色之间的关系 coupling 到一起，通过动态生成多个需要的代理人，根据任务内容和规划解决方案来生成专业代理人。多个专业代理人之间协同合作，以高效地完成任务。同时，框架中还包含了观察者角色，可以反思指定的计划和代理人的回应，并改进它们。我们在多个标准 benchmark 上进行了实验，结果表明，AutoAgents 可以生成更 coherent 和更准确的解决方案，而不是现有的多智能体方法。这说明了分配不同任务不同角色的重要性，以及团队合作的新视角，可以用于解决复杂任务。AutoAgents 项目的存储库可以在 GitHub 上找到：https://github.com/Link-AGI/AutoAgents。

AI-Aristotle: A Physics-Informed framework for Systems Biology Gray-Box Identification

paper_url: http://arxiv.org/abs/2310.01433
repo_url: None
paper_authors: Nazanin Ahmadi Daryakenari, Mario De Florio, Khemraj Shukla, George Em Karniadakis
for: 这个研究旨在找到生物系统中未知的物理方程式，并且使用调教数据来推导这些方程式。
methods: 这个方法结合了EXTreme Theory of Functional Connections（X-TFC）领域分解和物理调教神经网络（PINNs），以及symbolic regression（SR）技术，实现参数发现和灰色盒识别。
results: 这个方法在两个系统生物 benchmark 问题上进行了测试，结果显示了高准确、快速、灵活和可靠的性能。

Abstract
Discovering mathematical equations that govern physical and biological systems from observed data is a fundamental challenge in scientific research. We present a new physics-informed framework for parameter estimation and missing physics identification (gray-box) in the field of Systems Biology. The proposed framework -- named AI-Aristotle -- combines eXtreme Theory of Functional Connections (X-TFC) domain-decomposition and Physics-Informed Neural Networks (PINNs) with symbolic regression (SR) techniques for parameter discovery and gray-box identification. We test the accuracy, speed, flexibility and robustness of AI-Aristotle based on two benchmark problems in Systems Biology: a pharmacokinetics drug absorption model, and an ultradian endocrine model for glucose-insulin interactions. We compare the two machine learning methods (X-TFC and PINNs), and moreover, we employ two different symbolic regression techniques to cross-verify our results. While the current work focuses on the performance of AI-Aristotle based on synthetic data, it can equally handle noisy experimental data and can even be used for black-box identification in just a few minutes on a laptop. More broadly, our work provides insights into the accuracy, cost, scalability, and robustness of integrating neural networks with symbolic regressors, offering a comprehensive guide for researchers tackling gray-box identification challenges in complex dynamical systems in biomedicine and beyond.

摘要
找到物理和生物系统中的数学方程是科学研究中的基本挑战。我们介绍了一个新的物理学习框架，以帮助在系统生物中进行参数估计和缺失物理特征的恢复（灰色盒）。该框架被称为AI-Aristotle，它结合了极限理论函数连接（X-TFC）领域分解和物理学习网络（PINNs）以及符号回归（SR）技术来进行参数发现和灰色盒特征的恢复。我们在两个系统生物 benchmark 问题中测试了AI-Aristotle 的准确性、速度、灵活性和可靠性。我们将 X-TFC 和 PINNs 两种机器学习方法进行比较，并使用了两种不同的符号回归技术来跨验我们的结果。当前的工作主要基于合成数据进行测试，但它可以同时处理噪音的实验数据，并且可以在几分钟内在笔记计算机上进行黑盒特征的恢复。更广泛地说，我们的工作提供了 integrating 神经网络与符号回归器的精度、成本、可扩展性和可靠性的评估，这将为处理复杂的生物医学系统中的灰色盒特征恢复问题提供一个完整的指南。

Split and Merge: Aligning Position Biases in Large Language Model based Evaluators

paper_url: http://arxiv.org/abs/2310.01432
repo_url: None
paper_authors: Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, Yang Liu
for: 提高LLMs的可靠性和扩展性，以便更好地用于自动评估AI系统的回答质量。
methods: 提出了一种名为PORTIA的Alignment-based系统，通过对答案分割、对相似内容进行对齐，然后将其传递给LLMs进行评估。
results: 对11,520个答案对进行了广泛的实验，发现PORTIA可以显著提高LLMs的一致率，最高可达98%；同时，PORTIA可以使用较为简单的GPT模型达到与State-of-the-art GPT-4模型相当的性能，减少了评估成本。

Abstract
Large language models (LLMs) have shown promise as automated evaluators for assessing the quality of answers generated by AI systems. However, these LLM-based evaluators exhibit position bias, or inconsistency, when used to evaluate candidate answers in pairwise comparisons, favoring either the first or second answer regardless of content. To address this limitation, we propose PORTIA, an alignment-based system designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. Specifically, PORTIA splits the answers into multiple segments, aligns similar content across candidate answers, and then merges them back into a single prompt for evaluation by LLMs. We conducted extensive experiments with six diverse LLMs to evaluate 11,520 answer pairs. Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested, achieving an average relative improvement of 47.46%. Remarkably, PORTIA enables less advanced GPT models to achieve 88% agreement with the state-of-the-art GPT-4 model at just 10% of the cost. Furthermore, it rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%. Subsequent human evaluations indicate that the PORTIA-enhanced GPT-3.5 model can even surpass the standalone GPT-4 in terms of alignment with human evaluators. These findings highlight PORTIA's ability to correct position bias, improve LLM consistency, and boost performance while keeping cost-efficiency. This represents a valuable step toward a more reliable and scalable use of LLMs for automated evaluations across diverse applications.

摘要
大型语言模型（LLM）已经显示了评估人工智能系统产生的答案的潜力，但这些 LLM 基于的评估器具有位置偏见，即在比较对比中，偏向选择第一或第二个答案，不顾其内容。为解决这个限制，我们提出了 PORTIA，一个对照式系统，旨在模拟人类比较策略，以减少 LLM 的位置偏见。 Specifically, PORTIA 将答案分成多个段落，在候选答案中寻找相似内容，然后将它们重新联结成单一的提示，以便 LLM 进行评估。我们在六种不同的 LLM 上进行了广泛的实验，评估了 11,520 个答案对。我们的结果显示，PORTIA 可以明显改善所有模型和比较形式的一致率，实现了平均相对改善率为 47.46%。更重要的是，PORTIA 可以让较不进步的 GPT 模型在成本下降 90% 的情况下，与现有的 GPT-4 模型一致度高于 88%。此外，PORTIA 可以修正 GPT-4 模型中的位置偏见，提高其一致率至 98%。后续的人工评估表明，PORTIA 改进后的 GPT-3.5 模型可以在人工评估者的Alignment上超越 standalone GPT-4 模型。这些结果显示 PORTIA 能够正确地缓解位置偏见，提高 LLM 的一致率，并提高性能，同时保持成本效益。这代表了一个值得信赖的步骤，对于自动评估的应用而言。

PB-LLM: Partially Binarized Large Language Models

paper_url: http://arxiv.org/abs/2310.00034
repo_url: https://github.com/hahnyuan/binaryllm
paper_authors: Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong
for: 本研究探讨了网络归纳，即极端归纳，压缩模型 веса到单位位数，特别是针对大语言模型（LLMs）压缩。由于先前的归纳方法会使 LLMs 崩溃，我们提出了一种新的方法，即部分归纳 LLM（PB-LLM），可以实现极低位归纳而保持压缩后 LLMs 的语言逻辑能力。
methods: 我们的探讨首先揭示了先前归纳方法的不具有效果，并 highlights 突出重要的质量权重的重要性。因此，PB-LLM 在归纳过程中选择一小部分突出的质量权重，将其分配到高位存储，即部分归纳。PB-LLM 还被扩展到重建压缩后 LLMs 的能力，通过分析PTQ和QAT的角度。
results: 我们的实验结果表明，PB-LLM 可以实现极低位归纳，并保持 LLMs 的语言逻辑能力。此外，我们还提出了一种基于 Hessian 矩阵的重建方法，可以在 PTQ 中重建压缩后 LLMs 的能力。此外，我们还提出了一种基于权重抑制的扩展方法，可以在 QAT 中实现更好的压缩精度。

Abstract
This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to previous binarization methods collapsing LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs. Specifically, our exploration first uncovers the ineffectiveness of naive applications of existing binarization algorithms and highlights the imperative role of salient weights in achieving low-bit quantization. Thus, PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage, i.e., partially-binarization. PB-LLM is extended to recover the capacities of quantized LMMs, by analyzing from the perspective of post-training quantization (PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian matrix and successfully recover the reasoning capacity of PB-LLM in low-bit. Under QAT, we freeze the salient weights during training, explore the derivation of optimal scaling factors crucial for minimizing the quantization error, and propose a scaling mechanism based on this derived scaling strategy for residual binarized weights. Those explorations and the developed methodologies significantly contribute to rejuvenating the performance of low-bit quantized LLMs and present substantial advancements in the field of network binarization for LLMs.The code is available at https://github.com/hahnyuan/BinaryLLM.

摘要
(Simplified Chinese translation)这篇论文探讨了网络归一化（Network Binarization），即将模型权重压缩到单位比特，特别是用于大型语言模型（LLMs）压缩。由于先前的归一化方法会使LLMs崩溃，我们提出了一种新的方法，即半归一化大语言模型（PB-LLM）。PB-LLM可以在低比特压缩中实现极低的压缩率，而不失 linguistic 的逻辑理解能力。我们的探讨首先揭示了先前归一化算法的不充分应用，并确认了关键 weights 的重要性。因此，PB-LLM在归一化过程中只 filters 一小部分关键 weights，将其分配到高比特存储。PB-LLM还被扩展以恢复压缩后 LLMs 的能力，通过分析PTQ和QAT的角度。在PTQ中，我们结合GPTQ的概念，重建归一化后的 weight 矩阵，并成功地恢复 PB-LLM 在低比特下的逻辑理解能力。在QAT中，我们冻结关键 weights durante 训练，探索了最佳扩展系数的 derivation，并提出了基于这个 derivation 的扩展系数Scaling 机制。这些探讨和开发的方法ология具有重要的贡献，可以恢复低比特压缩 LLMs 的性能，并对网络归一化领域的进步做出了重要贡献。代码可以在 https://github.com/hahnyuan/BinaryLLM 上获取。

STRONG – Structure Controllable Legal Opinion Summary Generation

paper_url: http://arxiv.org/abs/2309.17280
repo_url: None
paper_authors: Yang Zhong, Diane Litman
for: 这个研究旨在提出一种基于预测 argue结构信息的法律意见摘要方法，以实现结构可控的摘要。
methods: 该方法使用预测的 argue结构信息引导模型生成具有提供结构模式的一致性的摘要。
results: 在一个legal opinion数据集上测试了该方法，并证明它在ROUGE、BERTScore和结构相似性方面与多种强基eline的表现更好。

Abstract
We propose an approach for the structure controllable summarization of long legal opinions that considers the argument structure of the document. Our approach involves using predicted argument role information to guide the model in generating coherent summaries that follow a provided structure pattern. We demonstrate the effectiveness of our approach on a dataset of legal opinions and show that it outperforms several strong baselines with respect to ROUGE, BERTScore, and structure similarity.

摘要
我们提出了一种基于文本案例结构的法律意见概要化方法，考虑了文本中的论证结构。我们的方法利用预测的论证角色信息来引导模型生成符合提供的结构模式的 coherent 概要，并在法律意见 dataset 上进行了评估。结果显示，我们的方法在 ROUGE、BERTScore 和结构相似性方面与多种强大的基线方法相比，表现出优异。

Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4

paper_url: http://arxiv.org/abs/2309.17277
repo_url: https://github.com/cr-gjx/suspicion-agent
paper_authors: Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, Yutaka Matsuo
for: 这篇论文探讨了使用大语言模型GPT-4在不完全信息游戏中的应用可能性。
methods: 作者提出了一种名为“偏见代理”的创新代理，利用GPT-4的知识检索和理解能力在不完全信息游戏中进行游戏。
results: 实验表明，使用GPT-4可以在不同的不完全信息游戏中达到比较出色的效果，而无需特殊的训练或示例。

Abstract
Unlike perfect information games, where all elements are known to every player, imperfect information games emulate the real-world complexities of decision-making under uncertain or incomplete information. GPT-4, the recent breakthrough in large language models (LLMs) trained on massive passive data, is notable for its knowledge retrieval and reasoning abilities. This paper delves into the applicability of GPT-4's learned knowledge for imperfect information games. To achieve this, we introduce \textbf{Suspicion-Agent}, an innovative agent that leverages GPT-4's capabilities for performing in imperfect information games. With proper prompt engineering to achieve different functions, Suspicion-Agent based on GPT-4 demonstrates remarkable adaptability across a range of imperfect information card games. Importantly, GPT-4 displays a strong high-order theory of mind (ToM) capacity, meaning it can understand others and intentionally impact others' behavior. Leveraging this, we design a planning strategy that enables GPT-4 to competently play against different opponents, adapting its gameplay style as needed, while requiring only the game rules and descriptions of observations as input. In the experiments, we qualitatively showcase the capabilities of Suspicion-Agent across three different imperfect information games and then quantitatively evaluate it in Leduc Hold'em. The results show that Suspicion-Agent can potentially outperform traditional algorithms designed for imperfect information games, without any specialized training or examples. In order to encourage and foster deeper insights within the community, we make our game-related data publicly available.

摘要
不同于完美信息游戏， где所有元素都是知道的，不完整信息游戏模拟了现实世界中决策的复杂性，在不完整信息的情况下。GPT-4，最近的大语言模型（LLM）训练在庞大的过去数据上，具有知识检索和理解能力。本文探讨了GPT-4学习得知的可用性 для不完整信息游戏。为此，我们提出了《相互 подозрения代理人》（Suspicion-Agent），一种充分利用GPT-4的能力来进行不完整信息游戏。通过适当的提示工程来实现不同的功能，Suspicion-Agent基于GPT-4在不完整信息 кар打扮游戏中表现出了Remarkable的适应能力。更重要的是，GPT-4表现出了强大的高阶理想心（ToM）能力，可以理解他人并意外影响他人的行为。利用这一点，我们设计了一种规划策略，使GPT-4可以 Competently against不同的对手，适应其游戏风格，只需输入游戏规则和描述观察。在实验中，我们资深展示了Suspicion-Agent在三个不同的不完整信息游戏中的能力，然后quantitatively评估其在Leduc Hold'em中的表现。结果表明，Suspicion-Agent可能会超越传统为不完整信息游戏设计的算法，无需特殊训练或示例。为了鼓励和推动 deeper insights within the community，我们将游戏相关数据公开。

Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency

paper_url: http://arxiv.org/abs/2309.17272
repo_url: None
paper_authors: Baizhou Huang, Shuai Lu, Weizhu Chen, Xiaojun Wan, Nan Duan
methods: 这 paper 使用了以下方法：* 从多个角度（solution、specification 和 test case）提取多个不同的输出，并将其转化为多ipartite 图形式。* 使用两种定义的一致性度量，将自身一致性信息嵌入图中。* 根据图中的一致性分析，选择最佳的选择。results: 这 paper 的实验结果表明，MPSC 框架可以在多个流行的benchmark上提高表现，包括 HumanEval (+17.60%), HumanEval Plus (+17.61%), MBPP (+6.50%) 和 CodeContests (+11.82%)，在 Pass@1 中，比原始从 ChatGPT 生成的输出要好，甚至超过 GPT-4。

Abstract
Large language models (LLMs) have exhibited remarkable ability in textual generation. However, in complex reasoning tasks such as code generation, generating the correct answer in a single attempt remains a formidable challenge for LLMs. Previous research has explored solutions by aggregating multiple outputs, leveraging the consistency among them. However, none of them have comprehensively captured this consistency from different perspectives. In this paper, we propose the Multi-Perspective Self-Consistency (MPSC) framework, a novel decoding strategy for LLM that incorporates both inter-consistency across outputs from multiple perspectives and intra-consistency within a single perspective. Specifically, we ask LLMs to sample multiple diverse outputs from various perspectives for a given query and then construct a multipartite graph based on them. With two predefined measures of consistency, we embed both inter- and intra-consistency information into the graph. The optimal choice is then determined based on consistency analysis in the graph. We conduct comprehensive evaluation on the code generation task by introducing solution, specification and test case as three perspectives. We leverage a code interpreter to quantitatively measure the inter-consistency and propose several intra-consistency measure functions. Our MPSC framework significantly boosts the performance on various popular benchmarks, including HumanEval (+17.60%), HumanEval Plus (+17.61%), MBPP (+6.50%) and CodeContests (+11.82%) in Pass@1, when compared to original outputs generated from ChatGPT, and even surpassing GPT-4.

摘要
大型语言模型（LLM）在文本生成方面表现出色，但在复杂的推理任务中，如代码生成， LLM 仍然很难在一次尝试中获得正确的答案。先前的研究已经探访过解决方案，包括聚合多个输出的方法，但这些方法都未能充分捕捉这些积极的一致性。在这篇论文中，我们提出了多元视角自适应性（MPSC）框架，一种新的解答策略，用于 LLM 中。这个框架包括两个预先定义的一致性度量，将它们嵌入到多方位的图形中。我们要求 LLM 在不同的角度获得多个多样的输出，并将它们组合成一个多方位的图形。然后，我们使用这两个一致性度量进行一致性分析，以选择最佳的选择。我们在代码生成任务中进行了全面评估，并引入解决方案、规格和测试用例作为三个角度。我们运用了代码解释器来量化内部一致性，并提出了多个内部一致性度量函数。我们的 MPSC 框架在多个受测 benchmark 上表现出色，包括 HumanEval (+17.60%)、HumanEval Plus (+17.61%)、MBPP (+6.50%) 和 CodeContests (+11.82%)，在 Pass@1 中比原始的 ChatGPT 生成的输出要好，甚至超过 GPT-4。

A Foundation Model for General Moving Object Segmentation in Medical Images

paper_url: http://arxiv.org/abs/2309.17264
repo_url: None
paper_authors: Zhongnuo Yan, Tong Han, Yuhao Huang, Lian Liu, Han Zhou, Jiongquan Chen, Wenlong Shi, Yan Cao, Xin Yang, Dong Ni
for: 这个研究旨在提高医疗影像分类的精度，实现诊断中的关键角色。
methods: 本研究使用了 Moveing Object Segmentation (MOS) 技术，仅需要小量的标注。
results: 实验结果显示，iMOS 可以实现在医疗影像序列中具有高精度的追踪和分类表现，仅需要标注少量的图像。

Abstract
Medical image segmentation aims to delineate the anatomical or pathological structures of interest, playing a crucial role in clinical diagnosis. A substantial amount of high-quality annotated data is crucial for constructing high-precision deep segmentation models. However, medical annotation is highly cumbersome and time-consuming, especially for medical videos or 3D volumes, due to the huge labeling space and poor inter-frame consistency. Recently, a fundamental task named Moving Object Segmentation (MOS) has made significant advancements in natural images. Its objective is to delineate moving objects from the background within image sequences, requiring only minimal annotations. In this paper, we propose the first foundation model, named iMOS, for MOS in medical images. Extensive experiments on a large multi-modal medical dataset validate the effectiveness of the proposed iMOS. Specifically, with the annotation of only a small number of images in the sequence, iMOS can achieve satisfactory tracking and segmentation performance of moving objects throughout the entire sequence in bi-directions. We hope that the proposed iMOS can help accelerate the annotation speed of experts, and boost the development of medical foundation models.

摘要
医学图像分割目标是将有用的解剖或疾病结构区分出来，在临床诊断中扮演重要角色。但是，建立高精度深度分割模型需要大量高质量标注数据，而医学标注是非常繁琐和时间消耗的，尤其是在医学视频或3DVolume中，因为标注空间庞大且间隔不一致。最近，一项基础任务名为移动对象分割（MOS）在自然图像中做出了重要进步。它的目标是在图像序列中分割移动对象和背景，只需要最小的标注即可。在本文中，我们提出了首个基础模型，名为iMOS， дляMOS在医学图像中。我们在大量多Modal医学数据集上进行了广泛的实验，并证明了提案的iMOS的有效性。具体来说，只需要序列中标注一小部分的图像，iMOS可以在整个序列中在双向 achieve satisfactory tracking和分割性能。我们希望通过提案的iMOS，可以加速专家的标注速度，并促进医学基础模型的发展。

paper_url: http://arxiv.org/abs/2309.17260
repo_url: None
paper_authors: Lauri Suomela, Jussi Kalliola, Harry Edelman, Joni-Kristian Kämäräinen
for: 提高导航性能，使用视觉地标识选择目标并增加训练数据可用性。
methods: 将机器人独立部分分成导航特有和通用计算视觉部分，利用视觉地标识进行目标选择，提高导航性能和计算效率。
results: 实验结果表明，新模型在室内和室外导航任务中取得76%和23%的高Success rate，同时具有更高的计算效率。

Abstract
Recent results suggest that splitting topological navigation into robot-independent and robot-specific components improves navigation performance by enabling the robot-independent part to be trained with data collected by different robot types. However, the navigation methods are still limited by the scarcity of suitable training data and suffer from poor computational scaling. In this work, we present PlaceNav, subdividing the robot-independent part into navigation-specific and generic computer vision components. We utilize visual place recognition for the subgoal selection of the topological navigation pipeline. This makes subgoal selection more efficient and enables leveraging large-scale datasets from non-robotics sources, increasing training data availability. Bayesian filtering, enabled by place recognition, further improves navigation performance by increasing the temporal consistency of subgoals. Our experimental results verify the design and the new model obtains a 76% higher success rate in indoor and 23% higher in outdoor navigation tasks with higher computational efficiency.

摘要

Knowledge Graphs for the Life Sciences: Recent Developments, Challenges and Opportunities

paper_url: http://arxiv.org/abs/2309.17255
repo_url: None
paper_authors: Jiaoyan Chen, Hang Dong, Janna Hastings, Ernesto Jiménez-Ruiz, Vanessa López, Pierre Monnin, Catia Pesquita, Petr Škoda, Valentina Tamma
for: 本研究论文主要针对生命科学领域中的数据管理和科学发现问题，旨在探讨Recent Developments and Advances in Graph-Based Technologies in Life Sciences和其未来发展的可能性。
methods: 本论文使用的方法包括知识图构建和管理、知识图和相关技术在科学发现中的应用、以及使用知识图支持解释的人工智能应用。
results: 本论文通过选择一些优秀的应用场景，描述了constructing and managing Knowledge Graphs (KGs)、使用KGs和相关技术在科学发现中的发现新知识、以及使用KGs支持解释的人工智能应用等三个主题的发展和挑战。

Abstract
The term life sciences refers to the disciplines that study living organisms and life processes, and include chemistry, biology, medicine, and a range of other related disciplines. Research efforts in life sciences are heavily data-driven, as they produce and consume vast amounts of scientific data, much of which is intrinsically relational and graph-structured. The volume of data and the complexity of scientific concepts and relations referred to therein promote the application of advanced knowledge-driven technologies for managing and interpreting data, with the ultimate aim to advance scientific discovery. In this survey and position paper, we discuss recent developments and advances in the use of graph-based technologies in life sciences and set out a vision for how these technologies will impact these fields into the future. We focus on three broad topics: the construction and management of Knowledge Graphs (KGs), the use of KGs and associated technologies in the discovery of new knowledge, and the use of KGs in artificial intelligence applications to support explanations (explainable AI). We select a few exemplary use cases for each topic, discuss the challenges and open research questions within these topics, and conclude with a perspective and outlook that summarizes the overarching challenges and their potential solutions as a guide for future research.

摘要
生命科学一词汇指的是研究生物和生物过程的学科，包括化学、生物、医学和其他相关学科。生命科学的研究努力充满数据，因为它们生成和消耗巨量数据，大多数数据是相互关联的graph结构。因此，使用高级知识驱动技术来管理和解释数据是不可或缺的。在这篇调查和观点论文中，我们讨论了生命科学中使用graph技术的最新发展和进步，以及这些技术将来对这些领域的影响。我们分为三个主题来讨论这些话题：建立和管理知识图грам（KG）、使用KG和相关技术在发现新知识方面、以及使用KG在人工智能应用中支持解释（Explainable AI）。我们选择了一些优秀的使用案例，讨论了每个话题中的挑战和未解决问题，并结束于一个视野和展望，概括了总体挑战和其可能的解决方案，以备未来研究的指南。

paper_url: http://arxiv.org/abs/2309.17252
repo_url: None
paper_authors: Marco Pop-Mihali, Adrian Groza
for: 这个论文目的是开发白盒机器学习算法，特ízarz是学习描述逻辑中的axioms。
methods: 这个论文扩展了DL-Learner工具中的Class Expression Learning for Ontology Engineering（CELOE）算法，使用多个搜索树和共享的修改池来分解搜索空间。它还引入了 conjunction 操作，将每棵搜索树中的最佳类表达合并，保留最多提供信息的结果。
results: 这个论文的实现和设置表明，Forest Mixing Approach不能超越传统的CELOE。然而，这种概念方案可能会在未来对找到ontologies中的类表达进行改进，特ízarz是在大型搜索空间中穿梭。

Abstract
We aim at development white-box machine learning algorithms. We focus here on algorithms for learning axioms in description logic. We extend the Class Expression Learning for Ontology Engineering (CELOE) algorithm contained in the DL-Learner tool. The approach uses multiple search trees and a shared pool of refinements in order to split the search space in smaller subspaces. We introduce the conjunction operation of best class expressions from each tree, keeping the results which give the most information. The aim is to foster exploration from a diverse set of starting classes and to streamline the process of finding class expressions in ontologies. %, particularly in large search spaces. The current implementation and settings indicated that the Forest Mixing approach did not outperform the traditional CELOE. Despite these results, the conceptual proposal brought forward by this approach may stimulate future improvements in class expression finding in ontologies. % and influence. % the way we traverse search spaces in general.

摘要
我们目标是开发白盒机器学习算法。我们主要关注描述逻辑学习算法。我们在DL-Learner工具中扩展了Class Expression Learning for Ontology Engineering（CELOE）算法。我们的方法使用多个搜索树和共享的修具池，以将搜索空间分成更小的子空间。我们引入了每棵树中最佳的类表达的 conjunction 操作，保留最多提供信息的结果。我们的目标是促进从多个起始类中的探索，并使找到ontology中的类表达更加高效。特别是在大型搜索空间中。现有的实现和设置表明，Forest Mixing方法不能超越传统的CELOE。尽管得到的结果不如预期，但我们的概念提案可能会在未来对找到ontology中的类表达进行改进。以及如何探索搜索空间的方式。

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

paper_url: http://arxiv.org/abs/2309.17249
repo_url: None
paper_authors: Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, Subhrajit Roy
for: Addressing the problem of prompt brittleness and various bias factors in large language models (LLMs) to improve their performance.
methods: Propose a simple and intuitive calibration method called Batch Calibration (BC) that controls contextual bias from batched input and unifies various prior approaches.
results: Demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks using PaLM 2-(S, M, L) and CLIP models.Here’s the summary in Traditional Chinese:
for: 解决大型自然语言模型（LLM）中的提示硬化和各种提示因素的问题，以提高其性能。
methods: 提出一种简单且直觉的条件调整方法，即批调协调（Batch Calibration，BC），控制批调入力中的文本因素，统一多种先前方法。
results: 使用PaLM 2-(S, M, L)和CLIP模型，在多于10个自然语言理解和图像识别 зада务中，与先前的条件调整基准点相比，demonstrate state-of-the-art性能。

Abstract
Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.

摘要
大型语言模型（LLM）的Prompting和In-Context Learning（ICL）已成为高效学习方法。然而，LLM受到提示粗糙和提示中各种偏见的影响，包括格式化、选择词语和ICL示例。为解决这些问题，导致LLM表现下降，calibration方法被开发出来mitigate these biases and recover LLM performance。在这个工作中，我们首先进行了现有calibration方法的系统分析，并提供了一个统一的视图，同时揭示了失败的例子。受这些分析的激发，我们提议批量调整（Batch Calibration，BC），一种简单 yet intuitive的方法，控制批处理输入中的上下文偏见，统一多种先前方法，并有效地解决上述问题。BC是零shot，推理只需要进行一次批处理，并且不会增加额外成本。在少量输入 setup 中，我们进一步扩展了BC，让它从标注数据中学习上下文偏见。我们使用PaLM 2-(S, M, L)和CLIP模型验证BC的有效性，并在多于10种自然语言理解和图像分类任务上达到了前一些calibration基线。

MORPH: Design Co-optimization with Reinforcement Learning via a Differentiable Hardware Model Proxy

paper_url: http://arxiv.org/abs/2309.17227
repo_url: None
paper_authors: Zhanpeng He, Matei Ciocarlie
for: 这篇论文是用来描述一种基于强化学习的硬件设计参数和控制策略优化方法（MORPH）的研究。
methods: 这种方法使用了一个可微的硬件代理模型，用于与长期控制策略一起进行协同优化。
results: 在模拟的2D抓取和3D多根手 manipulate任务中，MORPH方法能够有效地优化硬件设计参数和控制策略，并保证硬件代理模型与真实硬件保持最接近。

Abstract
We introduce MORPH, a method for co-optimization of hardware design parameters and control policies in simulation using reinforcement learning. Like most co-optimization methods, MORPH relies on a model of the hardware being optimized, usually simulated based on the laws of physics. However, such a model is often difficult to integrate into an effective optimization routine. To address this, we introduce a proxy hardware model, which is always differentiable and enables efficient co-optimization alongside a long-horizon control policy using RL. MORPH is designed to ensure that the optimized hardware proxy remains as close as possible to its realistic counterpart, while still enabling task completion. We demonstrate our approach on simulated 2D reaching and 3D multi-fingered manipulation tasks.

摘要
我们介绍MORPH，一种基于实验增强学习的硬件设计参数和控制策略优化方法。大多数合优方法都需要硬件模型，通常是基于物理法则进行模拟。但是这种模型往往难以与有效的优化 Routine 集成。为解决这个问题，我们引入了代理硬件模型，这个模型总是可导的，可以与长期控制策略使用增强学习进行合优。MORPH 是设计来确保优化的硬件代理保持与它的实际对应者最接近，同时仍能完成任务。我们在实验中使用了2D 掌上对象和3D 多指手 manipulation 任务来验证我们的方法。

RSAM: Learning on manifolds with Riemannian Sharpness-aware Minimization

paper_url: http://arxiv.org/abs/2309.17215
repo_url: None
paper_authors: Tuan Truong, Hoang-Phi Nguyen, Tung Pham, Minh-Tuan Tran, Mehrtash Harandi, Dinh Phung, Trung Le
for: 提高模型的泛化能力和Robustness
methods: 基于几何原理的优化，扩展了Sharpness-Aware Minimization（SAM）优化器到Riemannian manifold上
results: 提出了一种新的Riemannian Sharpness-Aware Minimization（RSAM）算法，并通过对各种数据集进行评估，证明RSAM可以提高模型的泛化能力和Robustness。

Abstract
Nowadays, understanding the geometry of the loss landscape shows promise in enhancing a model's generalization ability. In this work, we draw upon prior works that apply geometric principles to optimization and present a novel approach to improve robustness and generalization ability for constrained optimization problems. Indeed, this paper aims to generalize the Sharpness-Aware Minimization (SAM) optimizer to Riemannian manifolds. In doing so, we first extend the concept of sharpness and introduce a novel notion of sharpness on manifolds. To support this notion of sharpness, we present a theoretical analysis characterizing generalization capabilities with respect to manifold sharpness, which demonstrates a tighter bound on the generalization gap, a result not known before. Motivated by this analysis, we introduce our algorithm, Riemannian Sharpness-Aware Minimization (RSAM). To demonstrate RSAM's ability to enhance generalization ability, we evaluate and contrast our algorithm on a broad set of problems, such as image classification and contrastive learning across different datasets, including CIFAR100, CIFAR10, and FGVCAircraft. Our code is publicly available at \url{https://t.ly/RiemannianSAM}.

摘要
现在，理解损失地形的几何学特性显示有潜在的提升模型泛化能力的承诺。在这项工作中，我们启发自前人在优化中应用几何学原理，并提出一种新的方法来提高约束优化问题的Robustness和泛化能力。具体来说，这篇论文旨在推广Sharpness-Aware Minimization（SAM）优化器到Riemannian manifold上。在这个过程中，我们首先扩展了锐度的概念，并提出了一种新的 manifold锐度的概念。为支持这种锐度概念，我们提供了一种 theoretically分析，用于Characterizing generalization capabilities with respect to manifold sharpness，这个结果在之前没有知道。这个分析的激励我们提出了我们的算法，即Riemannian Sharpness-Aware Minimization（RSAM）。为证明RSAM能够提高泛化能力，我们在不同的数据集上进行了评估和比较，包括CIFAR100、CIFAR10和FGVCAircraft等。我们的代码可以在 \url{https://t.ly/RiemannianSAM} 上获取。

ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery

paper_url: http://arxiv.org/abs/2309.17203
repo_url: https://github.com/liuxin0824/comsd
paper_authors: Xin Liu, Yaran Chen, Dongbin Zhao
for: 本研究旨在提出一种能够自适应和适应多种下游任务的不监督技能发现方法，即Contrastive multi-objectives Skill Discovery（ComSD）。
methods: ComSD使用了对照学习来更正确地估计技能conditioned entropy的MI decomposition，并提出了一种动态权重机制来协调不同 entropy 的估计。
results: ComSD在数字评估中表现出了state-of-the-art的适应性，对于多种技能组合任务和大多数技能练习任务都有显著的优异性。

Abstract
Learning diverse and qualified behaviors for utilization and adaptation without supervision is a key ability of intelligent creatures. Ideal unsupervised skill discovery methods are able to produce diverse and qualified skills in the absence of extrinsic reward, while the discovered skill set can efficiently adapt to downstream tasks in various ways. Maximizing the Mutual Information (MI) between skills and visited states can achieve ideal skill-conditioned behavior distillation in theory. However, it's difficult for recent advanced methods to well balance behavioral quality (exploration) and diversity (exploitation) in practice, which may be attributed to the unreasonable MI estimation by their rigid intrinsic reward design. In this paper, we propose Contrastive multi-objectives Skill Discovery (ComSD) which tries to mitigate the quality-versus-diversity conflict of discovered behaviors through a more reasonable MI estimation and a dynamically weighted intrinsic reward. ComSD proposes to employ contrastive learning for a more reasonable estimation of skill-conditioned entropy in MI decomposition. In addition, a novel weighting mechanism is proposed to dynamically balance different entropy (in MI decomposition) estimations into a novel multi-objective intrinsic reward, to improve both skill diversity and quality. For challenging robot behavior discovery, ComSD can produce a qualified skill set consisting of diverse behaviors at different activity levels, which recent advanced methods cannot. On numerical evaluations, ComSD exhibits state-of-the-art adaptation performance, significantly outperforming recent advanced skill discovery methods across all skill combination tasks and most skill finetuning tasks. Codes will be released at https://github.com/liuxin0824/ComSD.

摘要
学习多样化和资格化行为的自适应能力是智能生物体的关键能力。理想的无监督技能发现方法应能够在缺乏外在奖励的情况下生成多样化和资格化的技能，而发现的技能集可以有效地适应下游任务的多种方式。在理论上，最大化技能和访问状态之间的共同信息（MI）可以实现理想的技能决策精炼。然而，现有的高级方法在实践中很难均衡行为质量（探索）和多样性（利用），这可能由高级方法的僵化内置奖励设计所致。在这篇论文中，我们提出了对比多目标技能发现（ComSD）方法，该方法通过更合理的MI估计和动态权重的内置奖励来减少发现行为质量与多样性之间的冲突。ComSD方法使用对比学习来更合理地估计技能决策下的 entropy，并提出了一种新的权重机制来动态平衡不同 entropy（在MI decompositions）的估计，以提高技能多样性和质量。对于具有挑战性的机器人行为发现问题，ComSD可以生成包含不同活动水平的多样化技能集，而现有高级方法无法达到这一点。在数值评估中，ComSD显示出了领先的适应性，在所有技能组合任务和大多数技能精化任务中显著超过了现有高级技能发现方法。代码将在上发布。

An Investigation Into Race Bias in Random Forest Models Based on Breast DCE-MRI Derived Radiomics Features

paper_url: http://arxiv.org/abs/2309.17197
repo_url: None
paper_authors: Mohamed Huti, Tiarna Lee, Elinor Sawyer, Andrew P. King
for: 这研究旨在检查使用静止增强MRI数据预测乳腺癌分子亚型时，random forest（RF）模型是否受到种族偏见的影响。
methods: 研究使用了静止增强MRI数据 derivated的 радиом科特征，并使用RF模型预测乳腺癌患者的种族。
results: 研究发现，静止增强MRI数据中含有可识别种族信息，RF模型可以使用这些数据预测白人和黑人种族的准确率在60-70%之间，具体取决于使用的特征子。此外，使用种族不均衡数据预测乳腺癌分子类型的RF模型似乎会产生偏见行为，在测试数据上表现出比较好的性能。

Abstract
Recent research has shown that artificial intelligence (AI) models can exhibit bias in performance when trained using data that are imbalanced by protected attribute(s). Most work to date has focused on deep learning models, but classical AI techniques that make use of hand-crafted features may also be susceptible to such bias. In this paper we investigate the potential for race bias in random forest (RF) models trained using radiomics features. Our application is prediction of tumour molecular subtype from dynamic contrast enhanced magnetic resonance imaging (DCE-MRI) of breast cancer patients. Our results show that radiomics features derived from DCE-MRI data do contain race-identifiable information, and that RF models can be trained to predict White and Black race from these data with 60-70% accuracy, depending on the subset of features used. Furthermore, RF models trained to predict tumour molecular subtype using race-imbalanced data seem to produce biased behaviour, exhibiting better performance on test data from the race on which they were trained.

摘要
Translation notes:* "imbalanced data" 不具有保护属性的数据 (i.e., data that are not balanced by protected attributes)* "hand-crafted features" 手动设计的特征 (i.e., features that are manually designed or selected)* "radiomics features" 镭医学特征 (i.e., features extracted from medical imaging data, such as DCE-MRI)* "tumor molecular subtype" 肿瘤分子亚型 (i.e., the molecular characteristics of a tumor, which can be used to predict its behavior and potential treatment outcomes)* "race-identifiable information" 可以识别出种族信息的特征 (i.e., features that can be used to identify an individual's race)* "biased behavior" 偏见行为 (i.e., behavior that is biased towards a particular group or outcome, rather than being impartial or fair)

PARF: Primitive-Aware Radiance Fusion for Indoor Scene Novel View Synthesis

paper_url: http://arxiv.org/abs/2309.17190
repo_url: None
paper_authors: Haiyang Ying, Baowei Jiang, Jinzhi Zhang, Di Xu, Tao Yu, Qionghai Dai, Lu Fang
for: 该论文提出了一种快速场景频谱场 reconstruction方法，以便实现强大的新视图合成性和方便的场景编辑功能。
methods: 该方法利用语义分析和基本元素提取来约束和加速频谱场 reconstruction过程。具体来说，该方法提出了一种基于基本元素的混合渲染策略，以便得到最佳的频谱场 reconstruction效果。
results: 该论文的实验表明，该方法可以快速重建场景频谱场，并且有高的渲染质量和方便的编辑功能。

Abstract
This paper proposes a method for fast scene radiance field reconstruction with strong novel view synthesis performance and convenient scene editing functionality. The key idea is to fully utilize semantic parsing and primitive extraction for constraining and accelerating the radiance field reconstruction process. To fulfill this goal, a primitive-aware hybrid rendering strategy was proposed to enjoy the best of both volumetric and primitive rendering. We further contribute a reconstruction pipeline conducts primitive parsing and radiance field learning iteratively for each input frame which successfully fuses semantic, primitive, and radiance information into a single framework. Extensive evaluations demonstrate the fast reconstruction ability, high rendering quality, and convenient editing functionality of our method.

摘要

Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training

paper_url: http://arxiv.org/abs/2309.17179
repo_url: https://github.com/waterhorse1/llm_tree_search
paper_authors: Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, Jun Wang
for: 提高大型自然语言模型（LLM）的理解和解释能力
methods: 使用搜索算法和学习值函数
results: 在推理、规划和RLHFAlignment等任务中表现出色，具有普适性和扩展性

Abstract
Large language models (LLMs) typically employ sampling or beam search, accompanied by prompts such as Chain-of-Thought (CoT), to boost reasoning and decoding ability. Recent work like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) aim to augment the reasoning capabilities of LLMs by utilizing tree-search algorithms to guide multi-step reasoning. These methods mainly focus on LLMs' reasoning ability during inference and heavily rely on human-designed prompts to activate LLM as a value function, which lacks general applicability and scalability. To address these limitations, we present an AlphaZero-like tree-search framework for LLMs (termed TS-LLM), systematically illustrating how tree-search with a learned value function can guide LLMs' decoding ability. TS-LLM distinguishes itself in two key ways: (1) Leveraging a learned value function, our approach can be generally applied to different tasks beyond reasoning (such as RLHF alignment), and LLMs of any size, without prompting advanced, large-scale models. (2) It can guide LLM's decoding during both inference and training. Empirical evaluations across reasoning, planning, and RLHF alignment tasks validate the effectiveness of TS-LLM, even on trees with a depth of 64.

摘要
为了解决这些局限性，我们提出了一种 AlphaZero 类型的树搜索框架 для LLM（称为 TS-LLM），系统地说明了如何使用树搜索和学习的值函数来导航 LLM 的解码能力。TS-LLM 的两个关键特点是：1. 利用学习的值函数，我们的方法可以通用于不同的任务，不仅是理解，而且可以应用于不同的 LLM 大小和任务，无需提供高级大型模型的提示。2. 它可以 guid LLM 的解码在推理和训练过程中。在理解、规划和RLHFAlignment任务上进行了实质性的评估，证明了 TS-LLM 的效果，甚至在深度为 64 的树上。

RLAdapter: Bridging Large Language Models to Reinforcement Learning in Open Worlds

paper_url: http://arxiv.org/abs/2309.17176
repo_url: None
paper_authors: Wanpeng Zhang, Zongqing Lu
for: 增强决策问题中的决策效果，通过使用强大的语言模型（LLMs）来帮助代理人学习策略。
methods: RLAdapter框架，通过在RL algoritma的训练过程中附加一个适配器模型，使得LLMs更好地适应下游任务，从而为RL agents提供更好的指导。
results: RLAdapter在Crafter环境中的实验结果表明，RLAdapter比基线模型更高效，同时代理人在我们的框架下展现出了缺失在基线模型中的常识行为。

Abstract
While reinforcement learning (RL) shows remarkable success in decision-making problems, it often requires a lot of interactions with the environment, and in sparse-reward environments, it is challenging to learn meaningful policies. Large Language Models (LLMs) can potentially provide valuable guidance to agents in learning policies, thereby enhancing the performance of RL algorithms in such environments. However, LLMs often encounter difficulties in understanding downstream tasks, which hinders their ability to optimally assist agents in these tasks. A common approach to mitigating this issue is to fine-tune the LLMs with task-related data, enabling them to offer useful guidance for RL agents. However, this approach encounters several difficulties, such as inaccessible model weights or the need for significant computational resources, making it impractical. In this work, we introduce RLAdapter, a framework that builds a better connection between RL algorithms and LLMs by incorporating an adapter model. Within the RLAdapter framework, fine-tuning a lightweight language model with information generated during the training process of RL agents significantly aids LLMs in adapting to downstream tasks, thereby providing better guidance for RL agents. We conducted experiments to evaluate RLAdapter in the Crafter environment, and the results show that RLAdapter surpasses the SOTA baselines. Furthermore, agents under our framework exhibit common-sense behaviors that are absent in baseline models.

摘要
whilst reinforcement learning (RL) displays remarkable success in decision-making problems, it often requires a lot of interactions with the environment, and in sparse-reward environments, it is challenging to learn meaningful policies. Large Language Models (LLMs) can potentially provide valuable guidance to agents in learning policies, thereby enhancing the performance of RL algorithms in such environments. However, LLMs often encounter difficulties in understanding downstream tasks, which hinders their ability to optimally assist agents in these tasks. A common approach to mitigating this issue is to fine-tune the LLMs with task-related data, enabling them to offer useful guidance for RL agents. However, this approach encounters several difficulties, such as inaccessible model weights or the need for significant computational resources, making it impractical. In this work, we introduce RLAdapter, a framework that builds a better connection between RL algorithms and LLMs by incorporating an adapter model. Within the RLAdapter framework, fine-tuning a lightweight language model with information generated during the training process of RL agents significantly aids LLMs in adapting to downstream tasks, thereby providing better guidance for RL agents. We conducted experiments to evaluate RLAdapter in the Crafter environment, and the results show that RLAdapter surpasses the SOTA baselines. Furthermore, agents under our framework exhibit common-sense behaviors that are absent in baseline models.Note: "SOTA" stands for "State of the Art", and "baselines" refer to the standard or default settings or models used for comparison. "Common-sense behaviors" refer to the ability of the agents to exhibit behaviors that are intuitive and reasonable based on human experience and knowledge.

A Vision-Guided Robotic System for Grasping Harvested Tomato Trusses in Cluttered Environments

paper_url: http://arxiv.org/abs/2309.17170
repo_url: None
paper_authors: Luuk van den Bent, Tomás Coleman, Robert Babuska
for: automating truss tomato weighing and packaging processes
methods: deep learning-based vision system to identify and grasp trusses in a crate with clutter
results: 100% clearance rate and 93% success rate of grasping trusses on the first try

Abstract
Currently, truss tomato weighing and packaging require significant manual work. The main obstacle to automation lies in the difficulty of developing a reliable robotic grasping system for already harvested trusses. We propose a method to grasp trusses that are stacked in a crate with considerable clutter, which is how they are commonly stored and transported after harvest. The method consists of a deep learning-based vision system to first identify the individual trusses in the crate and then determine a suitable grasping location on the stem. To this end, we have introduced a grasp pose ranking algorithm with online learning capabilities. After selecting the most promising grasp pose, the robot executes a pinch grasp without needing touch sensors or geometric models. Lab experiments with a robotic manipulator equipped with an eye-in-hand RGB-D camera showed a 100% clearance rate when tasked to pick all trusses from a pile. 93% of the trusses were successfully grasped on the first try, while the remaining 7% required more attempts.

摘要
Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".Translation notes:* "truss tomato" is translated as " Tomatoes 葫芦" ( Tomatoes is a noun, and 葫芦 is a measure word used to indicate a bunch of tomatoes)* "weighing and packaging" is translated as "重量和包装" (重量 means weight, and 包装 means packaging)* "lab experiments" is translated as "实验室试验" (实验室 means laboratory, and 试验 means experiment)* "robotic manipulator" is translated as "机器人操作机构" (机器人 means robot, and 操作机构 means manipulator)* "eye-in-hand RGB-D camera" is translated as "手上RGB-D相机" (手上 means hand-held, RGB-D is a type of camera, and 相机 means camera)

An evaluation of GPT models for phenotype concept recognition

paper_url: http://arxiv.org/abs/2309.17169
repo_url: None
paper_authors: Tudor Groza, Harry Caufield, Dylan Gration, Gareth Baynam, Melissa A Haendel, Peter N Robinson, Chris J Mungall, Justin T Reese
for: 这个论文的目的是探讨用最新的生成器预训练变换器（GPT）模型在医学深度型定义中的表现。
methods: 这个研究使用了七个提示语，两个GPT模型（gpt-3.5和gpt-4.0），以及一个已知的金标准 для fenotype认识。
results: 研究结果表明，目前这些模型尚未达到状态理想的表现。最佳运行使用了少量学习，达到了0.41 F1分数，相比之下，现有的最佳工具可以达到0.62 F1分数。

Abstract
Objective: Clinical deep phenotyping plays a critical role in both the diagnosis of patients with rare disorders as well as in building care coordination plans. The process relies on modelling and curating patient profiles using ontology concepts, usually from the Human Phenotype Ontology. Machine learning methods have been widely adopted to support this phenotype concept recognition task. With the significant shift in the use of large language models (LLMs) for most NLP tasks, herewithin, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT in clinical deep phenotyping. Materials and Methods: The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5 and gpt-4.0) and an established gold standard for phenotype recognition. Results: Our results show that, currently, these models have not yet achieved state of the art performance. The best run, using few-shots learning, achieved 0.41 F1 score, compared to a 0.62 F1 score achieved by the current best in class tool. Conclusion: The non-deterministic nature of the outcomes and the lack of concordance between different runs using the same prompt and input makes the use of these LLMs in clinical settings problematic.

摘要
Materials and Methods: 我们的实验设置包括七个提问，每个提问都有不同的特定程度，以及两个GPT模型（gpt-3.5和gpt-4.0）和已知的现象识别标准。Results: 我们的结果表明，目前这些模型还没有达到状态的艺术性能。最好的运行，使用少量学习，达到了0.41的F1分数，与当前最佳的类别工具的0.62的F1分数相比，表明这些LLMs在医学设置中的使用具有问题。Conclusion: 非束定的结果和使用同一个提问和输入的不同run之间的不一致性，使得这些LLMs在医学设置中的使用变得困难。

DyVal: Graph-informed Dynamic Evaluation of Large Language Models

paper_url: http://arxiv.org/abs/2309.17167
repo_url: None
paper_authors: Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, Xing Xie
for: 本研究旨在描述一种新的评估协议 DyVal，用于动态评估大型自然语言模型（LLM）的性能。
methods: 本研究使用了指定的执行图（DAG）来生成动态评估样本，以控制样本的复杂性。
results: 实验表明， LLM 在 DyVal 生成的评估样本中表现不佳，强调了动态评估的重要性。 Additionally, the researchers analyzed the failure cases and results of different prompting methods, and found that the DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks.

Abstract
Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns about their performance are raised on potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a novel, general, and flexible evaluation protocol for dynamic evaluation of LLMs. Based on our proposed dynamic evaluation framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to ChatGPT and GPT4. Experiments demonstrate that LLMs perform worse in DyVal-generated evaluation samples with different complexities, emphasizing the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on the future evaluation research of LLMs.

摘要
大型语言模型（LLM）在各种评估标准上表现出色，但是有关其表现的问题在它们训练集中可能存在数据污染。此外，现有的benchmark测试集可能无法准确评估LLM的进步。在这篇论文中，我们介绍了DyVal，一种新的评估协议，用于动态评估LLM。我们基于我们的提议的动态评估框架，使用指定的拓扑结构来动态生成评估样本，以控制样本的复杂性。DyVal生成了包括数学、逻辑推理和算法问题在内的复杂理解任务的挑战评估集。我们对多种LLM（从Flan-T5-large到ChatGPT和GPT4）进行了评估，实验表明，LLM在DyVal生成的评估样本中表现差，这 highlights the importance of动态评估。我们还分析了不同的提示方法的失败情况和结果。此外，DyVal生成的样本不仅是评估集，还可以用于 LLM 的微调以提高其在现有benchmark上的表现。我们希望DyVal可以引导未来的LLM评估研究。

Advances in Kidney Biopsy Structural Assessment through Dense Instance Segmentation

paper_url: http://arxiv.org/abs/2309.17166
repo_url: None
paper_authors: Zhan Xiong, Junling He, Pieter Valkema, Tri Q. Nguyen, Maarten Naesens, Jesper Kers, Fons J. Verbeek
for: 这篇论文主要写于 Automatically obtaining statistics per segmented anatomical object in kidney biopsy 的问题上, 以减少劳动力和 semi-quantitative lesion scores 的间观 observer variability.
methods: 该论文提出了一种 anchor-free instance segmentation model, combining diffusion models, transformer modules, and RCNNs (regional convolution neural networks)，可以有效地处理 densely touching anatomical structures, 多个类别和不同的尺寸和形状。
results: 该模型在一个 NVIDIA GeForce RTX 3090 GPU 上训练，可以高效地识别 renal biopsy 中的 más than 500 个 object，包括 glomeruli, tubuli, 和arteries。其数据集包括 148 个 Jones’ silver-stained renal whole slide images (WSIs)，其中 249 个 patches 用于训练，54 个 patches 用于评估。此外，模型可以直接转移领域，无需调整或重新训练，可以在 PAS-stained WSIs 上生成 decent instance segmentation results。与基eline models 相比，该模型达到了新的 state-of-the-art，其 AP 值为 51.7%。

Abstract
The kidney biopsy is the gold standard for the diagnosis of kidney diseases. Lesion scores made by expert renal pathologists are semi-quantitative and suffer from high inter-observer variability. Automatically obtaining statistics per segmented anatomical object, therefore, can bring significant benefits in reducing labor and this inter-observer variability. Instance segmentation for a biopsy, however, has been a challenging problem due to (a) the on average large number (around 300 to 1000) of densely touching anatomical structures, (b) with multiple classes (at least 3) and (c) in different sizes and shapes. The currently used instance segmentation models cannot simultaneously deal with these challenges in an efficient yet generic manner. In this paper, we propose the first anchor-free instance segmentation model that combines diffusion models, transformer modules, and RCNNs (regional convolution neural networks). Our model is trained on just one NVIDIA GeForce RTX 3090 GPU, but can efficiently recognize more than 500 objects with 3 common anatomical object classes in renal biopsies, i.e., glomeruli, tubuli, and arteries. Our data set consisted of 303 patches extracted from 148 Jones' silver-stained renal whole slide images (WSIs), where 249 patches were used for training and 54 patches for evaluation. In addition, without adjustment or retraining, the model can directly transfer its domain to generate decent instance segmentation results from PAS-stained WSIs. Importantly, it outperforms other baseline models and reaches an AP 51.7% in detection as the new state-of-the-art.

摘要
《肾脏生剖图像自动实例分割》肾脏生剖图像自动实例分割是诊断肾脏疾病的标准方法。但是，由专家肾脏 PATHOLOGISTS 评分的病变得分是不准确的，而且受到高度的 observer variability 影响。通过自动获取每个分割的统计数据，可以带来显著的减少劳动力和 observer variability 的效果。然而，实例分割问题在肾脏生剖图像上是一个挑战，因为：1. 肾脏生剖图像平均含有300-1000个密集的生物结构，2. 有多个类型（至少3种），3. 具有不同的大小和形状。目前使用的实例分割模型无法同时解决这些挑战。在这篇论文中，我们提出了第一个无锚点instance segmentation模型，结合了扩散模型、transformer模块和RCNNs（区域 convolutional neural networks）。我们的模型在一个NVIDIA GeForce RTX 3090 GPU上训练，可以高效地识别肾脏生剖图像中的 más de 500个 объек目，包括glomeruli、 tubuli 和arteries。我们的数据集包括148个Jones银色染涂肾脏整个扫描图像（WSIs），其中249个patches用于训练，54个patches用于评估。此外，无需调整或重新训练，我们的模型可以直接传递领域，在PAS染色涂抹扫描图像上生成良好的实例分割结果。最重要的是，它超越了其他基线模型，达到了新的状态空间 AP 51.7% 的检测率。

Compromise in Multilateral Negotiations and the Global Regulation of Artificial Intelligence

paper_url: http://arxiv.org/abs/2309.17158
repo_url: None
paper_authors: Michal Natorski
for: 这篇论文是为了探讨联合国教科文组织（UNESCO）在2021年11月采取的人工智能伦理准则的国际谈判中的纠纷和妥协点。
methods: 本论文使用了一个独特的主要资料集，包括written positions和录制的讨论，以解释UNESCO成员国之间的多元观点和冲突。
results: 本论文发现，通过boltanski的实用社会学概念，Multilateral谈判实践中的结构性normative杂合和 situational normative ambiguity的结合，使得多方面谈判中的妥协成功。

Abstract
As artificial intelligence (AI) technologies spread worldwide, international discussions have increasingly focused on their consequences for democracy, human rights, fundamental freedoms, security, and economic and social development. In this context, UNESCO's Recommendation on the Ethics of Artificial Intelligence, adopted in November 2021, has emerged as the first global normative framework for AI development and deployment. The intense negotiations of every detail of the document brought forth numerous controversies among UNESCO member states. Drawing on a unique set of primary sources, including written positions and recorded deliberations, this paper explains the achievement of global compromise on AI regulation despite the multiplicity of UNESCO member-state positions representing a variety of liberal and sovereignist preferences. Building upon Boltanski's pragmatic sociology, it conceptualises the practice of multilateral negotiations and attributes the multilateral compromise to two embedded therein mechanisms: Structural normative hybridity and situated normative ambiguity allowed to accomplish a compromise by linking macro-normative structures with situated debates of multilateral negotiations.

摘要

Age Group Discrimination via Free Handwriting Indicators

paper_url: http://arxiv.org/abs/2309.17156
repo_url: None
paper_authors: Eugenio Lomurno, Simone Toffoli, Davide Di Febbo, Matteo Matteucci, Francesca Lunardini, Simona Ferrante
for: 这个研究旨在探索一种使用实验墨水笔来评估年龄层的方法，以提高老年人群中的评估和预后评估。
methods: 这个研究使用了一个具有14个姿势和震动相关指标的工具，将手写数据分为五个分类任务，使用Catboost和Logistic Regression分类器进行分类。
results: 研究结果显示了Exceptional的分类性能，精度在82.5%至97.5%之间，特征选择率在81.8%至100%之间， recall在75%至100%之间，ROC-AUC在92.2%至100%之间。

Abstract
The growing global elderly population is expected to increase the prevalence of frailty, posing significant challenges to healthcare systems. Frailty, a syndrome associated with ageing, is characterised by progressive health decline, increased vulnerability to stressors and increased risk of mortality. It represents a significant burden on public health and reduces the quality of life of those affected. The lack of a universally accepted method to assess frailty and a standardised definition highlights a critical research gap. Given this lack and the importance of early prevention, this study presents an innovative approach using an instrumented ink pen to ecologically assess handwriting for age group classification. Content-free handwriting data from 80 healthy participants in different age groups (20-40, 41-60, 61-70 and 70+) were analysed. Fourteen gesture- and tremor-related indicators were computed from the raw data and used in five classification tasks. These tasks included discriminating between adjacent and non-adjacent age groups using Catboost and Logistic Regression classifiers. Results indicate exceptional classifier performance, with accuracy ranging from 82.5% to 97.5%, precision from 81.8% to 100%, recall from 75% to 100% and ROC-AUC from 92.2% to 100%. Model interpretability, facilitated by SHAP analysis, revealed age-dependent sensitivity of temporal and tremor-related handwriting features. Importantly, this classification method offers potential for early detection of abnormal signs of ageing in uncontrolled settings such as remote home monitoring, thereby addressing the critical issue of frailty detection and contributing to improved care for older adults.

摘要
全球老龄化的人口增长会增加衰退的预测，对医疗系统造成重大挑战。衰退是年龄相关的症状，表现为健康下降、增加外在压力的敏感性和死亡风险增加。它对公共卫生造成重大负担，reduce the quality of life of those affected。由于没有一个通用的方法来评估衰退和标准化定义，这个研究 gap highlights the critical need for early prevention. 为了解决这个问题，本研究提出了一种创新的方法，使用Instrumented Ink Pen来生物学地评估手写功能，以分类不同年龄组。研究从健康参与者80人的内容自由手写数据中分析了14个指标，并将其用于5个分类任务。这些任务包括使用Catboost和Logistic Regression分类器来分别区分邻近和非邻近年龄组。结果表明分类器的表现非常出色，准确率从82.5%到97.5%，精度从81.8%到100%，回归率从75%到100%，ROC-AUC从92.2%到100%。模型可读性，通过SHAP分析，表明年龄依赖性的时间和震动相关手写特征。这种分类方法有潜在的潜在的早期识别衰退的异常迹象，并且可以在没有控制的家庭监测设置中进行识别，因此有助于改善老年人的护理。

Using Large Language Models for Qualitative Analysis can Introduce Serious Bias

paper_url: http://arxiv.org/abs/2309.17147
repo_url: None
paper_authors: Julian Ashwin, Aditya Chhabra, Vijayendra Rao
for: 这篇论文问题是使用自然语言处理技术来分析大量访问讲述资料，具体来说是使用大型自然语言模型（LLM）来分析采访讲述录音。
methods: 该论文使用了LLM来标注采访讲述录音，并对这些标注进行评估。
results: 研究发现，使用LLM来标注采访讲述录音存在偏见风险，这些偏见可能导致错误的推论。在使用高质量人工标注和灵活编码来训练简单的超vised模型时，发现这些模型的评估错误和偏见较少。因此，作者认为，在评估LLM是否引入偏见时，需要一定的高质量标注，而不是直接使用LLM来标注。

Abstract
Large Language Models (LLMs) are quickly becoming ubiquitous, but the implications for social science research are not yet well understood. This paper asks whether LLMs can help us analyse large-N qualitative data from open-ended interviews, with an application to transcripts of interviews with Rohingya refugees in Cox's Bazaar, Bangladesh. We find that a great deal of caution is needed in using LLMs to annotate text as there is a risk of introducing biases that can lead to misleading inferences. We here mean bias in the technical sense, that the errors that LLMs make in annotating interview transcripts are not random with respect to the characteristics of the interview subjects. Training simpler supervised models on high-quality human annotations with flexible coding leads to less measurement error and bias than LLM annotations. Therefore, given that some high quality annotations are necessary in order to asses whether an LLM introduces bias, we argue that it is probably preferable to train a bespoke model on these annotations than it is to use an LLM for annotation.

摘要

Prototype Generation: Robust Feature Visualisation for Data Independent Interpretability

paper_url: http://arxiv.org/abs/2309.17144
repo_url: None
paper_authors: Arush Tagade, Jessica Rumbelow
for: 本研究旨在提供一种具有更高准确性和可靠性的特征可视化方法，以便对无模型和数据依赖的图像分类模型进行可读性分析。
methods: 本研究使用了一种名为“Prototype Generation”的新方法，该方法可以生成具有自然Activation路径的输入，从而提高了特征可视化的准确性和可靠性。
results: 研究发现，通过使用Prototype Generation方法可以生成具有自然Activation路径的输入，并且可以 Quantitatively measure the similarity between the internal activations of the generated prototypes and natural images。此外，研究还发现，通过解释生成的 прототипы可以提供重要的 introspection，例如揭示模型中学习的偏见和偏好，这些 introspection 不可能通过测试集的量化方法来实现。

Abstract
We introduce Prototype Generation, a stricter and more robust form of feature visualisation for model-agnostic, data-independent interpretability of image classification models. We demonstrate its ability to generate inputs that result in natural activation paths, countering previous claims that feature visualisation algorithms are untrustworthy due to the unnatural internal activations. We substantiate these claims by quantitatively measuring similarity between the internal activations of our generated prototypes and natural images. We also demonstrate how the interpretation of generated prototypes yields important insights, highlighting spurious correlations and biases learned by models which quantitative methods over test-sets cannot identify.

摘要
我们介绍原型生成，一种更加依hung和更加稳定的特征可视化技术，用于无模型、数据独立的图像分类器解释。我们显示了它的能力可以生成自然的活化路径，与先前的所有特征可视化算法不同，这些算法被认为是不可靠，因为它们内部的活化不自然。我们通过量化地衡量内部活化和自然图像之间的相似性，从而证实我们的主张。此外，我们还示出了解释生成的原型具有重要的问题解释能力，例如发现模型学习的偏好和偏预，这些方法不能通过测试集的量化方法进行识别。

Revisiting Cephalometric Landmark Detection from the view of Human Pose Estimation with Lightweight Super-Resolution Head

paper_url: http://arxiv.org/abs/2309.17143
repo_url: https://github.com/5k5000/cldetection2023
paper_authors: Qian Wu, Si Yong Yeo, Yufei Chen, Jun Liu
for: 本研究旨在提高侵袋形态标志准确性，并且将人体姿势估计（HPE）技术转移到侵袋形态标志准确性领域，以提高其表现。
methods: 本研究使用了一个基于MMPose的可靠和适应性的底线，并将一个简单且高效的超频变化模组 incorporated into the framework，以提高表现。
results: 在MICCAI CLDetection2023挑战中，我们的方法在三个指标上排名第一名，并在另外一个指标上排名第三名。

Abstract
Accurate localization of cephalometric landmarks holds great importance in the fields of orthodontics and orthognathics due to its potential for automating key point labeling. In the context of landmark detection, particularly in cephalometrics, it has been observed that existing methods often lack standardized pipelines and well-designed bias reduction processes, which significantly impact their performance. In this paper, we revisit a related task, human pose estimation (HPE), which shares numerous similarities with cephalometric landmark detection (CLD), and emphasize the potential for transferring techniques from the former field to benefit the latter. Motivated by this insight, we have developed a robust and adaptable benchmark based on the well-established HPE codebase known as MMPose. This benchmark can serve as a dependable baseline for achieving exceptional CLD performance. Furthermore, we introduce an upscaling design within the framework to further enhance performance. This enhancement involves the incorporation of a lightweight and efficient super-resolution module, which generates heatmap predictions on high-resolution features and leads to further performance refinement, benefiting from its ability to reduce quantization bias. In the MICCAI CLDetection2023 challenge, our method achieves 1st place ranking on three metrics and 3rd place on the remaining one. The code for our method is available at https://github.com/5k5000/CLdetection2023.

摘要
精准地标定几何学特征点对于 ortodontics 和 ortognathics 领域具有重要意义，因为它可能导致自动标记关键点的自动化。在几何学特征点检测上，特别是在 Cephalometrics 中，已经存在许多不同的方法，但它们通常缺乏标准化管道和良好的偏好减少过程，这会对其性能产生很大的影响。在这篇论文中，我们回到了相关的任务，即人姿估计（HPE），这两个任务之间存在许多相似之处，我们强调了从 HPE 领域中提取技术以便改善 CLD 性能。由于这一点，我们开发了一个可靠和可调整的benchmark，基于已知的 HPE 代码库，即 MMPose。这个benchmark可以作为 CLD 性能的可靠基准。此外，我们在框架中引入了一种增强设计，即 incorporating 一种轻量级高效的超分辨率模块，该模块在高分辨率特征上预测热图，从而进一步提高性能，借助其减少量化偏误的能力。在 MICCAI CLDetection2023 挑战中，我们的方法在三个指标上排名第一名，并在另外一个指标上排名第三名。我们的代码可以在 GitHub 上找到：https://github.com/5k5000/CLdetection2023。

Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?

paper_url: http://arxiv.org/abs/2309.17122
repo_url: https://github.com/aksw/llm-kg-bench
paper_authors: Johannes Frey, Lars-Peter Meyer, Natanael Arndt, Felix Brei, Kirill Bulert
for: 评估大型自然语言模型（LLM）在知识图工程中的能力
methods: 创建了五个任务来评估不同 LLM 的能力，包括解析、理解、分析和生成知识图，并将其集成到了 LLM-KG-Bench 自动评估系统中
results: 研究发现，latest commercial models 在使用 Turtle 语言时表现出色，但它们在输出格式化要求上存在缺陷，需要进一步改进。

Abstract
Large Language Models (LLMs) are advancing at a rapid pace, with significant improvements at natural language processing and coding tasks. Yet, their ability to work with formal languages representing data, specifically within the realm of knowledge graph engineering, remains under-investigated. To evaluate the proficiency of various LLMs, we created a set of five tasks that probe their ability to parse, understand, analyze, and create knowledge graphs serialized in Turtle syntax. These tasks, each embodying distinct degrees of complexity and being able to scale with the size of the problem, have been integrated into our automated evaluation system, the LLM-KG-Bench. The evaluation encompassed four commercially available LLMs - GPT-3.5, GPT-4, Claude 1.3, and Claude 2.0, as well as two freely accessible offline models, GPT4All Vicuna and GPT4All Falcon 13B. This analysis offers an in-depth understanding of the strengths and shortcomings of LLMs in relation to their application within RDF knowledge graph engineering workflows utilizing Turtle representation. While our findings show that the latest commercial models outperform their forerunners in terms of proficiency with the Turtle language, they also reveal an apparent weakness. These models fall short when it comes to adhering strictly to the output formatting constraints, a crucial requirement in this context.

摘要
大型语言模型（LLM）在快速发展，具有显著改善的自然语言处理和编程任务能力。然而，它们在知识图工程中使用ormal语言表示数据的能力仍然尚未得到充分研究。为评估不同LLM的水平，我们创建了五个任务，这些任务涵盖了知识图解析、理解、分析和创建，并且能够随问题的大小缩放。这些任务被 integrate into our自动评估系统LLM-KG-Bench。我们对四种商业可用的LLM（GPT-3.5、GPT-4、Claude 1.3和Claude 2.0）以及两种自由 accessible的离线模型（GPT4All Vicuna和GPT4All Falcon 13B）进行了评估。这个分析提供了LLM在使用Turtle表示知识图工程中的强点和弱点，并且显示了最新的商业模型在Turtle语言方面的表现有所提升，但同时也发现了一点问题：这些模型在输出格式约束上不够严格。

Meta-Path Learning for Multi-relational Graph Neural Networks

paper_url: http://arxiv.org/abs/2309.17113
repo_url: https://github.com/francescoferrini/multirelationalgnn
paper_authors: Francesco Ferrini, Antonio Longa, Andrea Passerini, Manfred Jaeger
for: 该论文是为了解决多关系图神经网络中 Informative relations 的问题。
methods: 该论文提出了一种新的方法，通过在小量 informative meta-paths 上学习 GNNs 来解决这个问题。关键元素是一种用于评估关系的潜在有用性的分数函数。
results: 实验表明，该方法可以在大量关系的情况下正确地标识有用的 meta-paths，并substantially 超越现有的多关系 GNNs 在 sintetic 和实际 экспериментах中。

Abstract
Existing multi-relational graph neural networks use one of two strategies for identifying informative relations: either they reduce this problem to low-level weight learning, or they rely on handcrafted chains of relational dependencies, called meta-paths. However, the former approach faces challenges in the presence of many relations (e.g., knowledge graphs), while the latter requires substantial domain expertise to identify relevant meta-paths. In this work we propose a novel approach to learn meta-paths and meta-path GNNs that are highly accurate based on a small number of informative meta-paths. Key element of our approach is a scoring function for measuring the potential informativeness of a relation in the incremental construction of the meta-path. Our experimental evaluation shows that the approach manages to correctly identify relevant meta-paths even with a large number of relations, and substantially outperforms existing multi-relational GNNs on synthetic and real-world experiments.

摘要
现有的多关系图 neural network 中的一种方法是将这个问题降低到低级的重量学习，或者是基于手动设计的关系依赖链，called meta-paths。然而，前者在多关系（例如知识图）存在时会遇到问题，而后者需要具有域专业知识来确定相关的 meta-paths。在这个工作中，我们提出了一种新的方法来学习 meta-paths 和 meta-path GNNs，这些方法可以在一小number of informative meta-paths 的基础上实现高度准确。关键元素我们的方法是一个用于评估关系的潜在有用性的分数函数，在逐步构建 meta-path 中。我们的实验评估表明，我们的方法可以在多关系中正确地标识 relevante meta-paths，并在synthetic 和实际世界上进行了显著的性能提升。

Dynamic Interpretability for Model Comparison via Decision Rules

paper_url: http://arxiv.org/abs/2309.17095
repo_url: None
paper_authors: Adam Rida, Marie-Jeanne Lesot, Xavier Renard, Christophe Marsala
for: 本文旨在解释和explainable AI（XAI）方法如何准确地描述多个机器学习模型之间的差异。
methods: 本文提出了一种名为DeltaXplainer的模型无关方法，可以生成基于规则的解释，描述两个二分类器之间的差异。
results: 在synthetic和实际数据集上进行了多种模型比较场景的实验，证明DeltaXplainer可以有效地描述不同类型的概念融合导致的差异。

Abstract
Explainable AI (XAI) methods have mostly been built to investigate and shed light on single machine learning models and are not designed to capture and explain differences between multiple models effectively. This paper addresses the challenge of understanding and explaining differences between machine learning models, which is crucial for model selection, monitoring and lifecycle management in real-world applications. We propose DeltaXplainer, a model-agnostic method for generating rule-based explanations describing the differences between two binary classifiers. To assess the effectiveness of DeltaXplainer, we conduct experiments on synthetic and real-world datasets, covering various model comparison scenarios involving different types of concept drift.

摘要
Explainable AI (XAI) 方法大多是为了研究和解释单个机器学习模型，而不是设计用于捕捉和解释多个模型之间的差异。这篇论文面临着理解和解释多个机器学习模型之间的差异的挑战，这是实际应用中的模型选择、监测和生命周期管理中非常重要的。我们提议了DeltaXplainer，一种模型无关的方法，用于生成对二分类模型之间差异的规则式解释。为了评估DeltaXplainer的有效性，我们在synthetic和实际世界数据集上进行了实验，覆盖了不同类型的概念漂移场景。

GAIA-1: A Generative World Model for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.17080
repo_url: None
paper_authors: Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, Gianluca Corrado
for: 本研究旨在提高自动驾驶系统的安全性和可靠性，通过生成真实的驾驶场景和控制 eg 车行为。
methods: 我们引入了GAIA-1 (‘生成AI для自动驾驶’)，一种生成世界模型，通过视频、文本和动作输入生成真实的驾驶场景，并提供细化的 eg 车行为和场景特性控制。我们将世界模型视为无监督序列模型问题，将输入映射到精确的token，预测下一个token的序列。
results: GAIA-1可以学习高级结构和场景动力学，Contextual awareness，泛化和几何理解等性能，其学习的表示能够捕捉未来事件的预期，同时可以生成真实的样本，为自动驾驶技术的培训带来新的可能性和加速。

Abstract
Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA-1 ('Generative AI for Autonomy'), a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. Our approach casts world modeling as an unsupervised sequence modeling problem by mapping the inputs to discrete tokens, and predicting the next token in the sequence. Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry. The power of GAIA-1's learned representation that captures expectations of future events, combined with its ability to generate realistic samples, provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology.

摘要
自动驾驶承诺改变交通领域，但建立能安全掌握实际enario中的系统仍然是挑战。一个关键问题在于有效地预测由车辆行为所导致的多种可能的结果，如世界如何发展。为解决这个挑战，我们介绍GAIA-1（生成AI для自动驾驶），一种生成世界模型，通过视频、文本和动作输入生成真实的驾驶enario，并提供细化控制 egovehicle行为和场景特性。我们的方法将世界模型作为无监督序列模型问题，将输入映射到精确的标识符，预测下一个标识符的序列。GAIA-1的模型出现了许多属性，包括学习高级结构和场景动力学、场景意识、泛化和几何理解。GAIA-1的学习表示，捕捉未来事件的期望，与生成真实样本的能力，为自动驾驶技术的培训带来新的可能性，提高和加速自动驾驶技术的发展。

Assessment and treatment of visuospatial neglect using active learning with Gaussian processes regression

paper_url: http://arxiv.org/abs/2310.13701
repo_url: None
paper_authors: Ivan De Boi, Elissa Embrechts, Quirine Schatteman, Rudi Penne, Steven Truijen, Wim Saeys
for: 评估和诊断视空间忽视症的人工智能解决方案
methods: 基于 Gaussian process regression 的活动学习方法，以减少患者进行评估所需的努力
results: 在实际 клиниче设定中进行临床试验，与现有的视空间忽视症测试相比，我们的 AI 评估模型显示更高的敏感度和可靠性，并且可以减少评估时间。

Abstract
Visuospatial neglect is a disorder characterised by impaired awareness for visual stimuli located in regions of space and frames of reference. It is often associated with stroke. Patients can struggle with all aspects of daily living and community participation. Assessment methods are limited and show several shortcomings, considering they are mainly performed on paper and do not implement the complexity of daily life. Similarly, treatment options are sparse and often show only small improvements. We present an artificial intelligence solution designed to accurately assess a patient's visuospatial neglect in a three-dimensional setting. We implement an active learning method based on Gaussian process regression to reduce the effort it takes a patient to undergo an assessment. Furthermore, we describe how this model can be utilised in patient oriented treatment and how this opens the way to gamification, tele-rehabilitation and personalised healthcare, providing a promising avenue for improving patient engagement and rehabilitation outcomes. To validate our assessment module, we conducted clinical trials involving patients in a real-world setting. We compared the results obtained using our AI-based assessment with the widely used conventional visuospatial neglect tests currently employed in clinical practice. The validation process serves to establish the accuracy and reliability of our model, confirming its potential as a valuable tool for diagnosing and monitoring visuospatial neglect. Our VR application proves to be more sensitive, while intra-rater reliability remains high.

摘要
visuospatial neglect是一种病理特征于视觉刺激的减退，通常与中风相关。患者可能会受到日常生活和社区参与的困难。评估方法有限，主要是在纸上进行，无法体现生活中的复杂性。治疗方案罕见，效果有限。我们提出了基于人工智能的评估方法，能够在三维环境中准确评估患者的视空间忽视。我们采用了活动学习方法和 Gaussian process regression来减少患者参与评估的努力。此外，我们还描述了如何使用这种模型进行患者参与的治疗，包括临床娱乐、远程医疗和个性化医疗，这些方法可以提高患者参与度和康复成果。为验证我们的评估模块，我们在实际 Setting中进行了临床试验。我们将我们的人工智能基本评估与现有的广泛使用的视空间忽视测试进行比较，以验证我们的模型的准确性和可靠性。我们发现，我们的VR应用程序比 conventionally used tests更敏感，同时内部可靠性保持高。

SCALE: Synergized Collaboration of Asymmetric Language Translation Engines

paper_url: http://arxiv.org/abs/2309.17061
repo_url: https://github.com/hannibal046/scale
paper_authors: Xin Cheng, Xun Wang, Tao Ge, Si-Qing Chen, Furu Wei, Dongyan Zhao, Rui Yan
For: The paper introduces a collaborative framework called SCALE that connects specialized translation models (STMs) and large language models (LLMs) to improve machine translation.* Methods: The paper introduces a method to incorporate translation from STMs into the triplet in-context demonstrations, which enables the LLM to refine and pivot its translations, mitigating language bias and parallel data bias.* Results: The paper shows that SCALE significantly outperforms both few-shot LLMs and specialized models in challenging low-resource settings, and experiences consistent improvement in Xhosa to English translation. Additionally, the paper shows that SCALE can effectively exploit the existing language bias of LLMs by using an English-centric STM as a pivot for translation between any language pairs.Here are the three points in Simplified Chinese text:* For: 本 paper 引入了一种合作框架，即 SCALE，用于连接特殊翻译模型（STM）和大型语言模型（LLM），以提高机器翻译。* Methods: 本 paper 引入了一种方法，即将翻译从 STM integrate 到 triplet 上下文示例中，使 LL 能够对翻译进行精度和推倒，从而减少语言偏见和并行数据偏见。* Results: 本 paper 显示，SCALE 在低资源设置下significantly 超越了几个简单的 LLM 和专门的模型（NLLB），并在 Xhosa 到英语翻译中经常提高。此外，SCALE 还可以有效利用现有的 LLM 语言偏见，通过使用英语中心 STM 作为 pivot 进行任何语言对之间的翻译，超越了几个简单的 GPT-4。

Abstract
In this paper, we introduce SCALE, a collaborative framework that connects compact Specialized Translation Models (STMs) and general-purpose Large Language Models (LLMs) as one unified translation engine. By introducing translation from STM into the triplet in-context demonstrations, SCALE unlocks refinement and pivoting ability of LLM, thus mitigating language bias of LLM and parallel data bias of STM, enhancing LLM speciality without sacrificing generality, and facilitating continual learning without expensive LLM fine-tuning. Our comprehensive experiments show that SCALE significantly outperforms both few-shot LLMs (GPT-4) and specialized models (NLLB) in challenging low-resource settings. Moreover, in Xhosa to English translation, SCALE experiences consistent improvement by a 4 BLEURT score without tuning LLM and surpasses few-shot GPT-4 by 2.5 COMET score and 3.8 BLEURT score when equipped with a compact model consisting of merely 600M parameters. SCALE could also effectively exploit the existing language bias of LLMs by using an English-centric STM as a pivot for translation between any language pairs, outperforming few-shot GPT-4 by an average of 6 COMET points across eight translation directions. Furthermore we provide an in-depth analysis of SCALE's robustness, translation characteristics, and latency costs, providing solid foundation for future studies exploring the potential synergy between LLMs and more specialized, task-specific models.

摘要
在这篇论文中，我们引入了一个协作框架，称为SCALE，它将特殊化翻译模型（STM）和通用大语言模型（LLM） integrate为一个统一的翻译引擎。通过在STM中引入翻译，SCALE可以 Mitigate语言偏好和并行数据偏好，提高LLM的特殊化无需牺牲通用性，并且可以实现 kontinuel Learning without expensive LLM fine-tuning。我们的全面实验表明，SCALE在具有低资源的情况下显著超越了几个shot LLM（GPT-4）和专门的模型（NLLB）。此外，在从Xhosa到英语翻译中，SCALE经常经历了不需要调整LLM的情况下的改进，比特shot GPT-4提高了2.5个COMET分数和3.8个BLEURT分数，当装备了仅600M参数的简洁模型时。SCALE还可以有效利用现有的LLM语言偏好，通过使用英语特殊的STM作为翻译 между任何语言对的托管，超越了几个shot GPT-4的平均6个COMET分数。此外，我们还提供了SCALE的稳定性、翻译特征和延迟成本的深入分析，为未来研究探索LLM和更特殊、任务特定的模型之间的可能的合作提供坚实的基础。

Tell Me a Story! Narrative-Driven XAI with Large Language Models

paper_url: http://arxiv.org/abs/2309.17057
repo_url: https://github.com/admantwerp/xaistories
paper_authors: David Martens, Camille Dams, James Hinns, Mark Vergouwen
for: 该论文的目的是解释人工智能预测结果，提供人类可理解的解释方式。
methods: 该论文使用了SHAP值和counterfactual（CF）解释方法，并将这些解释方法转换成人类可理解的故事形式，以便更好地解释AI预测结果。
results: 调查结果表明，大多数普通用户认为SHAPstories是有力的解释，而数据科学家认为SHAPstories可以帮助普通用户更好地理解AI预测结果。CFstories也在图像分类任务中被证明为更加有力的解释方法，并且可以提高准确率。

Abstract
In today's critical domains, the predominance of black-box machine learning models amplifies the demand for Explainable AI (XAI). The widely used SHAP values, while quantifying feature importance, are often too intricate and lack human-friendly explanations. Furthermore, counterfactual (CF) explanations present `what ifs' but leave users grappling with the 'why'. To bridge this gap, we introduce XAIstories. Leveraging Large Language Models, XAIstories provide narratives that shed light on AI predictions: SHAPstories do so based on SHAP explanations to explain a prediction score, while CFstories do so for CF explanations to explain a decision. Our results are striking: over 90% of the surveyed general audience finds the narrative generated by SHAPstories convincing. Data scientists primarily see the value of SHAPstories in communicating explanations to a general audience, with 92% of data scientists indicating that it will contribute to the ease and confidence of nonspecialists in understanding AI predictions. Additionally, 83% of data scientists indicate they are likely to use SHAPstories for this purpose. In image classification, CFstories are considered more or equally convincing as users own crafted stories by over 75% of lay user participants. CFstories also bring a tenfold speed gain in creating a narrative, and improves accuracy by over 20% compared to manually created narratives. The results thereby suggest that XAIstories may provide the missing link in truly explaining and understanding AI predictions.

摘要
今天的关键领域中，黑盒机器学习模型的占主导地位进一步增加了Explainable AI（XAI）的需求。通用的SHAP值可以衡量特征重要性，但它们经常是太复杂，缺乏人类友好的解释。另外，Counterfactual（CF）解释可以提供“what if”的情况，但它们留下用户摸着头脑，思考“why”。为了bridging这个差距，我们引入XAIstories。利用大型自然语言模型，XAIstories提供了解释AI预测的narritives。SHAPstories基于SHAP解释来解释预测分数，而CFstories基于CF解释来解释决策。我们的结果很有吸引力：论坛中的普通民众超过90%认为SHAPstories的解释感服务。数据科学家主要看到SHAPstories在传达解释给通用观众时的价值，92%的数据科学家认为它会使非专家更好地理解AI预测。此外，83%的数据科学家表示他们可能会使用SHAPstories。在图像分类任务中，CFstories被用户自己创作的故事超过75%的普通用户认为是 equally convincing或更有吸引力。CFstories还提供了图像分类任务中创建故事的十倍快速和超过20%的准确率提升。结果表明XAIstories可能为AI预测的真正解释和理解提供了一个缺失的链接。

On Continuity of Robust and Accurate Classifiers

paper_url: http://arxiv.org/abs/2309.17048
repo_url: None
paper_authors: Ramin Barati, Reza Safabakhsh, Mohammad Rahmati
for: 本研究旨在探讨机器学习模型的可靠性问题，以及如何提高模型的Robustness和准确率。
methods: 本研究使用了对各种机器学习任务的实验研究，以及对学习过程中函数的分析和推理。
results: 研究发现，继续函数不能有效地学习最佳Robust hypothesis，而不连续函数则可以更好地实现Robustness和准确率的平衡。此外，研究还提供了一种框架 для严谨地研究幂等和幂函数在学习理论中的应用。

Abstract
The reliability of a learning model is key to the successful deployment of machine learning in various applications. Creating a robust model, particularly one unaffected by adversarial attacks, requires a comprehensive understanding of the adversarial examples phenomenon. However, it is difficult to describe the phenomenon due to the complicated nature of the problems in machine learning. It has been shown that adversarial training can improve the robustness of the hypothesis. However, this improvement comes at the cost of decreased performance on natural samples. Hence, it has been suggested that robustness and accuracy of a hypothesis are at odds with each other. In this paper, we put forth the alternative proposal that it is the continuity of a hypothesis that is incompatible with its robustness and accuracy. In other words, a continuous function cannot effectively learn the optimal robust hypothesis. To this end, we will introduce a framework for a rigorous study of harmonic and holomorphic hypothesis in learning theory terms and provide empirical evidence that continuous hypotheses does not perform as well as discontinuous hypotheses in some common machine learning tasks. From a practical point of view, our results suggests that a robust and accurate learning rule would train different continuous hypotheses for different regions of the domain. From a theoretical perspective, our analysis explains the adversarial examples phenomenon as a conflict between the continuity of a sequence of functions and its uniform convergence to a discontinuous function.

摘要
“机器学习模型的可靠性是在实际应用中成功的关键。建立一个坚固的模型，特别是不受攻击影响的模型，需要对抗例子现象的全面理解。但是由于机器学习的问题相当复杂，这个现象难以描述。已经证明了对抗训练可以提高模型的坚固性，但是这些改进价值随即降低自然样本的表现。因此，有人建议了模型的稳定性和准确性之间存在冲突。在这篇论文中，我们提出了一个不同的建议，即稳定性和准确性之间的冲突是由于函数的连续性所致。即使是一个连续函数，也无法有效地学习最佳的防御性模型。为了解决这个问题，我们将引入一个对抗学习理论的框架，并提供实验证据，证明连续函数不如离散函数在一些常见的机器学习任务中表现更好。实际上，我们的结果显示了一个坚固和准确的学习规则应该在不同的领域中训练不同的连续函数。从理论上看，我们的分析解释了攻击例子现象是一个连续函数序列的稳定性与其均匀对抗数列的冲突之间的冲突。”

Refined Kolmogorov Complexity of Analog, Evolving and Stochastic Recurrent Neural Networks

paper_url: http://arxiv.org/abs/2309.17032
repo_url: None
paper_authors: Jérémie Cabessa, Yann Strozecki
for: 这个论文旨在填充当时的缺失结果，提供一个统一的视角来描述超过图灵计算机的能力。
methods: 该论文使用了kolmogorov复杂度来定义和研究 Analog、演化和随机神经网络的超过图灵计算机能力。
results: 研究得到了一个无穷 hierarchy of classes of analog networks, evolving networks, and stochastic networks，这些类别位于 $\mathbf{P}$ 和 $\mathbf{P/poly}$ 之间。此外，研究还提供了一种生成这些类别的通用方法。

Abstract
We provide a refined characterization of the super-Turing computational power of analog, evolving, and stochastic neural networks based on the Kolmogorov complexity of their real weights, evolving weights, and real probabilities, respectively. First, we retrieve an infinite hierarchy of classes of analog networks defined in terms of the Kolmogorov complexity of their underlying real weights. This hierarchy is located between the complexity classes $\mathbf{P}$ and $\mathbf{P/poly}$. Then, we generalize this result to the case of evolving networks. A similar hierarchy of Kolomogorov-based complexity classes of evolving networks is obtained. This hierarchy also lies between $\mathbf{P}$ and $\mathbf{P/poly}$. Finally, we extend these results to the case of stochastic networks employing real probabilities as source of randomness. An infinite hierarchy of stochastic networks based on the Kolmogorov complexity of their probabilities is therefore achieved. In this case, the hierarchy bridges the gap between $\mathbf{BPP}$ and $\mathbf{BPP/log^*}$. Beyond proving the existence and providing examples of such hierarchies, we describe a generic way of constructing them based on classes of functions of increasing complexity. For the sake of clarity, this study is formulated within the framework of echo state networks. Overall, this paper intends to fill the missing results and provide a unified view about the refined capabilities of analog, evolving and stochastic neural networks.

摘要
我们提供了一种精细的定义和分类方法 для超过图灵计算机力的Analog、演化和随机神经网络，基于这些网络的实数权重、演化权重和实数概率的科隆莫洛夫复杂性。首先，我们定义了一系列基于实数权重的Analog网络的层次结构，这些层次结构位于$\mathbf{P}$和$\mathbf{P/poly}$之间。然后，我们推广这些结果到演化网络的情况，并获得了类似的层次结构。此外，我们还将这些结果推广到使用实数概率作为随机性源的随机网络，并建立了一个基于概率复杂性的层次结构，这个层次结构跨越了$\mathbf{BPP}$和$\mathbf{BPP/log^*}$之间。除了证明存在和提供示例外，我们还描述了一种通用的构造方法，基于不同复杂性的函数类型。为了便于理解，这种研究基于抽象神经网络（Echo State Networks）的框架下进行了表述。总的来说，本文的目标是填充当前缺失的结果，并提供一个统一的视角，以描述超过图灵计算机力的Analog、演化和随机神经网络的高级功能。

Scalable Multi-Temporal Remote Sensing Change Data Generation via Simulating Stochastic Change Process

paper_url: http://arxiv.org/abs/2309.17031
repo_url: https://github.com/Z-Zheng/Changen
paper_authors: Zhuo Zheng, Shiqi Tian, Ailong Ma, Liangpei Zhang, Yanfei Zhong
For: This paper presents a scalable multi-temporal remote sensing change data generator using generative modeling, which can alleviate the problems of collecting, preprocessing, and annotating multi-temporal remote sensing images at scale.* Methods: The proposed method, called Changen, is based on a generative adversarial network (GAN) and decouples the complex simulation problem into two more tractable sub-problems: change event simulation and semantic change synthesis.* Results: The extensive experiments show that Changen has superior generation capability and the change detectors with Changen pre-training exhibit excellent transferability to real-world change datasets.Here is the same information in Simplified Chinese text:* For: 这篇论文提出了一种可扩展的多时间段Remote sensing变化数据生成器，使用生成模型，解决了多时间段Remote sensing图像收集、处理和标注的问题。* Methods: 提议的方法基于生成对抗网络（GAN），将复杂的模拟问题分解成两个更加可控的互相独立的问题：变化事件模拟和semantic变化合成。* Results: 广泛的实验表明，Changen具有优秀的生成能力，并且将变化探测器与Changen预训练结合，在实际变化数据集上表现出色的传送性。

Abstract
Understanding the temporal dynamics of Earth's surface is a mission of multi-temporal remote sensing image analysis, significantly promoted by deep vision models with its fuel -- labeled multi-temporal images. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present a scalable multi-temporal remote sensing change data generator via generative modeling, which is cheap and automatic, alleviating these problems. Our main idea is to simulate a stochastic change process over time. We consider the stochastic change process as a probabilistic semantic state transition, namely generative probabilistic change model (GPCM), which decouples the complex simulation problem into two more trackable sub-problems, \ie, change event simulation and semantic change synthesis. To solve these two problems, we present the change generator (Changen), a GAN-based GPCM, enabling controllable object change data generation, including customizable object property, and change event. The extensive experiments suggest that our Changen has superior generation capability, and the change detectors with Changen pre-training exhibit excellent transferability to real-world change datasets.

摘要
Simplified Chinese translation:理解地球表面的时间动态是远程感知图像分析的任务，受到深度视觉模型的推动，却面临收集、预处理和标注多个时间点的远程感知图像的问题。在这篇文章中，我们提出了一种可扩展的多时点远程感知变化数据生成器，通过生成模型，它是便宜和自动的，从而解决了这些问题。我们的主要想法是通过随机变化过程来模拟时间的演变。我们认为这种随机变化过程是一种概率性 semantic state transition，即生成概率性变化模型（GPCM），它将复杂的模拟问题分解成两个更容易解决的子问题，即变化事件模拟和semantic变化合成。为解决这两个问题，我们提出了改变生成器（Changen），基于GAN的GPCM，可以生成可控的对象变化数据，包括可定制的对象属性和变化事件。我们的实验表明，我们的Changen具有优秀的生成能力，而使用Changen预训练的变化探测器在实际变化数据集上表现出色。

Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection

paper_url: http://arxiv.org/abs/2310.01430
repo_url: None
paper_authors: Swapnil Bhosale, Abhra Chaudhuri, Alex Lee Robert Williams, Divyank Tiwari, Anjan Dutta, Xiatian Zhu, Pushpak Bhattacharyya, Diptesh Kanojia
for: 这 paper 的目的是对 MUStARD++ dataset 进行严格的benchmarking，以便完全利用这个多modal的丰富性，并提高 macro-F1 分数2%。
methods: 这 paper 使用了当今最佳语言、语音和视觉编码器，以实现全面的多modal表达。此外，它还提出了一种扩展，称为 MUStARD++ Balanced，将分类结果分配到两个类别中，以解决 sarcastic type 类别的偏袋问题，并提高了 macro-F1 分数2.4%。
results: 这 paper 的结果显示，使用新的 TV 剧 House MD 的 clip，并手动将其分类为多种类别，可以提高 MUStARD++ dataset 的多modal表达能力，并且可以提高 macro-F1 分数。

Abstract
The introduction of the MUStARD dataset, and its emotion recognition extension MUStARD++, have identified sarcasm to be a multi-modal phenomenon -- expressed not only in natural language text, but also through manners of speech (like tonality and intonation) and visual cues (facial expression). With this work, we aim to perform a rigorous benchmarking of the MUStARD++ dataset by considering state-of-the-art language, speech, and visual encoders, for fully utilizing the totality of the multi-modal richness that it has to offer, achieving a 2\% improvement in macro-F1 over the existing benchmark. Additionally, to cure the imbalance in the `sarcasm type' category in MUStARD++, we propose an extension, which we call \emph{MUStARD++ Balanced}, benchmarking the same with instances from the extension split across both train and test sets, achieving a further 2.4\% macro-F1 boost. The new clips were taken from a novel source -- the TV show, House MD, which adds to the diversity of the dataset, and were manually annotated by multiple annotators with substantial inter-annotator agreement in terms of Cohen's kappa and Krippendorf's alpha. Our code, extended data, and SOTA benchmark models are made public.

摘要
Introduction of MUStARD dataset and its extension MUStARD++, researchers have found that sarcasm is a multi-modal phenomenon, expressed not only in natural language text, but also through speech and visual cues. In this work, we aim to perform a comprehensive benchmarking of MUStARD++ dataset by considering state-of-the-art language, speech, and visual encoders, to fully utilize the richness of the multi-modal data, and achieve a 2% improvement in macro-F1 over the existing benchmark. Additionally, to address the imbalance in the "sarcasm type" category in MUStARD++, we propose an extension, called MUStARD++ Balanced, and benchmark it with instances from the extension split across both train and test sets, achieving a further 2.4% macro-F1 boost. The new clips were taken from a novel source, the TV show House MD, which adds to the diversity of the dataset, and were manually annotated by multiple annotators with substantial inter-annotator agreement in terms of Cohen's kappa and Krippendorf's alpha. Our code, extended data, and SOTA benchmark models are publicly available.Here's the text in Traditional Chinese as well:Introduction of MUStARD dataset and its extension MUStARD++, researchers have found that sarcasm is a multi-modal phenomenon, expressed not only in natural language text, but also through speech and visual cues. In this work, we aim to perform a comprehensive benchmarking of MUStARD++ dataset by considering state-of-the-art language, speech, and visual encoders, to fully utilize the richness of the multi-modal data, and achieve a 2% improvement in macro-F1 over the existing benchmark. Additionally, to address the imbalance in the "sarcasm type" category in MUStARD++, we propose an extension, called MUStARD++ Balanced, and benchmark it with instances from the extension split across both train and test sets, achieving a further 2.4% macro-F1 boost. The new clips were taken from a novel source, the TV show House MD, which adds to the diversity of the dataset, and were manually annotated by multiple annotators with substantial inter-annotator agreement in terms of Cohen's kappa and Krippendorf's alpha. Our code, extended data, and SOTA benchmark models are publicly available.

Benchmarking Cognitive Biases in Large Language Models as Evaluators

paper_url: http://arxiv.org/abs/2309.17012
repo_url: https://github.com/minnesotanlp/cobbler
paper_authors: Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, Dongyeop Kang
for: 本研究目的是评估大自然语言模型（LLMs）作为自动评估器的可靠性，并研究其评估输出的质量。
methods: 本研究使用15个不同大小范围的LLMs，对其输出回快排名，并使用COBBLEr标准测试六种认知偏见。
results: 研究发现，LLMs作为评估器存在强烈的认知偏见，在每个评估中表现出60%的比较。此外，人机和机器偏好之间的相关性为49.6%，表明机器偏好与人类偏好存在差异。

Abstract
Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences. Our project page is at: https://minnesotanlp.github.io/cobbler.

摘要
大型语言模型（LLM）最近已经被证明可以作为自动评价器，只需要简单的提示和在Context中学习。在这项工作中，我们卷集了15个不同大小范围的LLM，并对它们的输出响应进行了对比评价，例如系统星是比系统方块更好。然后，我们引入了语言模型评价生成器的认知偏见标准（CoBBLEr），以衡量六种不同的认知偏见在LLM评价输出中，如自центризм偏见，这种偏见会让模型对自己的输出进行高评价。我们发现，LLM作为评价器存在强烈的偏见，在每个评价中都有40%的比较，表明它们的可靠性是问题。此外，我们还研究了人类和机器之间的偏好相关性，并计算了机器和人类之间的相互融合率（RBO），得到了49.6%的平均值，这表明机器的偏好与人类的偏好存在差异。根据我们的发现，LLM可能还未能够被用于自动标注，与人类的偏好相对适合。我们的项目页面是：https://minnesotanlp.github.io/cobbler。

Medical Foundation Models are Susceptible to Targeted Misinformation Attacks

paper_url: http://arxiv.org/abs/2309.17007
repo_url: https://github.com/peterhan91/fm_adv
paper_authors: Tianyu Han, Sven Nebelung, Firas Khader, Tianci Wang, Gustav Mueller-Franzes, Christiane Kuhl, Sebastian Försch, Jens Kleesiek, Christoph Haarburger, Keno K. Bressem, Jakob Nikolas Kather, Daniel Truhn
for: 这个研究目的是要检查大型自然语言模型（LLMs）在医疗领域中的可靠性和安全性。
methods: 研究人员使用了targeted manipulation的方法来对模型的1.1%的权重进行修改，以导入错误的生物医学信息。
results: 研究发现，通过修改模型的1.1%的权重，可以故意导入错误的生物医学信息，并且这些错误信息将被模型输出，而模型的其他生物医学任务性能则保持不变。这些结果表明了LLMs在医疗领域的可靠性和安全性存在问题。

Abstract
Large language models (LLMs) have broad medical knowledge and can reason about medical information across many domains, holding promising potential for diverse medical applications in the near future. In this study, we demonstrate a concerning vulnerability of LLMs in medicine. Through targeted manipulation of just 1.1% of the model's weights, we can deliberately inject an incorrect biomedical fact. The erroneous information is then propagated in the model's output, whilst its performance on other biomedical tasks remains intact. We validate our findings in a set of 1,038 incorrect biomedical facts. This peculiar susceptibility raises serious security and trustworthiness concerns for the application of LLMs in healthcare settings. It accentuates the need for robust protective measures, thorough verification mechanisms, and stringent management of access to these models, ensuring their reliable and safe use in medical practice.

摘要

Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

paper_url: http://arxiv.org/abs/2309.17002
repo_url: None
paper_authors: Hao Chen, Jindong Wang, Ankit Shah, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj
for: 本研究旨在理解预训练数据中的噪声对下游任务的影响，并提出一种轻量级黑盒调参方法来缓解噪声的害性影响。
methods: 通过对 synthetic noisy ImageNet-1K 和 YFCC15M 数据集进行广泛的超参数调整实验，我们证明了适度的噪声在预训练中可以提高域内（ID）转移性能，但总是降低 OUT OF 预训练（OOD）性能。我们也证明了噪声在预训练中 shapes 特征空间的不同导致这一现象。
results: 我们对流行的视觉和语言模型进行评估，发现我们的方法可以有效地缓解预训练中的噪声影响，提高 Both ID 和 OOD 任务的泛化性能。

Abstract
Pre-training on large-scale datasets and then fine-tuning on downstream tasks have become a standard practice in deep learning. However, pre-training data often contain label noise that may adversely affect the generalization of the model. This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. More specifically, through extensive experiments of supervised pre-training models on synthetic noisy ImageNet-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) transfer performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing data distribution are different. We empirically verify that the reason behind is noise in pre-training shapes the feature space differently. We then propose a lightweight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization on both ID and OOD tasks, considering one may not be able to fully fine-tune or even access the pre-trained models. We conduct practical experiments on popular vision and language models that are pre-trained on noisy data for evaluation of our approach. Our analysis and results show the importance of this interesting and novel research direction, which we term Noisy Model Learning.

摘要
传统的深度学习实践中，先行训练大规模数据集，然后在下游任务上精细调整模型已成为标准做法。然而，预训练数据通常包含标签噪音，这可能会对模型的泛化产生负面影响。这篇论文的目标是理解预训练数据中噪音的性质，并对其对下游任务的影响进行调节。我们通过对synthetic noisy ImageNet-1K和YFCC15M数据集进行超过实验，发现一定程度的噪音在预训练中可以提高预训练和测试数据分布相同的内联（ID）传递性能，但总是对于预训练和测试数据分布不同的外联（OOD）性能下降。我们经验 verify 噪音在预训练中形成特征空间的不同是原因。我们然后提出了一种轻量级黑obox调整方法（NMTune），用于调整特征空间，以减少噪音对泛化的负面影响，并提高ID和OOD任务的泛化性能。我们在流行的视觉和语言模型中进行了实践性的实验，以评估我们的方法。我们的分析和结果表明这是一个有趣和新的研究方向，我们称之为噪音模型学习（Noisy Model Learning）。

A Closer Look at Bearing Fault Classification Approaches

paper_url: http://arxiv.org/abs/2309.17001
repo_url: None
paper_authors: Harika Abburi, Tanya Chaudhary, Haider Ilyas, Lakshmi Manne, Deepak Mittal, Don Williams, Derek Snaidauf, Edward Bowen, Balaji Veeramani
for: 本研究旨在提高滚珠机器人失效诊断的效率和准确性，以避免意外机器停工和提高维护计划。
methods: 本研究使用了现代机器学习技术，包括深度学习架构，对滚珠数据进行分析和预测。
results: 研究发现，在诊断滚珠失效时，数据分区、模型评价指标和失败标签生成方法的选择具有重要的影响，并提出了实际场景中模型开发考虑因素。

Abstract
Rolling bearing fault diagnosis has garnered increased attention in recent years owing to its presence in rotating machinery across various industries, and an ever increasing demand for efficient operations. Prompt detection and accurate prediction of bearing failures can help reduce the likelihood of unexpected machine downtime and enhance maintenance schedules, averting lost productivity. Recent technological advances have enabled monitoring the health of these assets at scale using a variety of sensors, and predicting the failures using modern Machine Learning (ML) approaches including deep learning architectures. Vibration data has been collected using accelerated run-to-failure of overloaded bearings, or by introducing known failure in bearings, under a variety of operating conditions such as rotating speed, load on the bearing, type of bearing fault, and data acquisition frequency. However, in the development of bearing failure classification models using vibration data there is a lack of consensus in the metrics used to evaluate the models, data partitions used to evaluate models, and methods used to generate failure labels in run-to-failure experiments. An understanding of the impact of these choices is important to reliably develop models, and deploy them in practical settings. In this work, we demonstrate the significance of these choices on the performance of the models using publicly-available vibration datasets, and suggest model development considerations for real world scenarios. Our experimental findings demonstrate that assigning vibration data from a given bearing across training and evaluation splits leads to over-optimistic performance estimates, PCA-based approach is able to robustly generate labels for failure classification in run-to-failure experiments, and $F$ scores are more insightful to evaluate the models with unbalanced real-world failure data.

摘要
However, there is a lack of consensus in the metrics used to evaluate the models, the data partitions used to evaluate the models, and the methods used to generate failure labels in run-to-failure experiments. These choices have a significant impact on the performance of the models, and it is essential to understand their influence to develop reliable models that can be deployed in practical settings.In this work, we investigate the impact of these choices on the performance of bearing failure classification models using publicly-available vibration datasets. We demonstrate that assigning vibration data from a given bearing across training and evaluation splits leads to over-optimistic performance estimates, and that a PCA-based approach can robustly generate labels for failure classification in run-to-failure experiments. Additionally, we find that $F$ scores are more insightful to evaluate the models with unbalanced real-world failure data. Our findings provide practical considerations for developing and deploying bearing failure classification models in real-world scenarios.

AI Algorithm for the Generation of Three-Dimensional Accessibility Ramps in Grasshopper / Rhinoceros 7

paper_url: http://arxiv.org/abs/2310.07728
repo_url: None
paper_authors: Antonio Li, Leila Yi, Brandon Yeo Pei Hui
for: This paper aims to provide an algorithm for the automatic generation of wheelchair-accessible ramps in urban development, with the goal of improving accessibility for people with mobile impairments and able-bodied third parties.
methods: The algorithm uses AI search algorithms to determine the optimal pathway connecting initial and terminal points within a 3D model of the environment, taking into account essential components such as elevation differentials, spatial constraints, and gradient specifications.
results: The algorithm generates a full-scale, usable model of a ramp that can be easily exported and transformed through inter-software exchanges, providing significant efficiency gains in the design process and lowering the threshold for the incorporation of accessibility features in future urban design.

Abstract
Often overlooked as a component of urban development, accessibility infrastructure is undeniably crucial in daily life. Accessibility ramps are one of the most common types of accessibility infrastructure, and serve to benefit not only people with mobile impairments but also able-bodied third parties. While the necessity of accessibility ramps is acknowledged, actual implementation fails in light of the limits of manpower required for the design stage. In response, we present an algorithm capable of the automatic generation of a feasible accessibility ramp based on a 3D model of the relevant environment. Through the manual specification of initial and terminal points within a 3D model, the algorithm uses AI search algorithms to determine the optimal pathway connecting these points. Essential components in devising a wheelchair-accessible ramp are encoded within the process, as evaluated by the algorithm, including but not limited to elevation differentials, spatial constraints, and gradient specifications. From this, the algorithm then generates the pathway to be expanded into a full-scale, usable model of a ramp, which then can be easily exported and transformed through inter-software exchanges. Though some human input is still required following the generation stage, the minimising of human resources provides significant boosts of efficiency in the design process thus lowering the threshold for the incorporation of accessibility features in future urban design.

摘要
通常被忽略的城市发展组件之一是可达性基础设施，它在日常生活中的重要性不言自明。可达性升降是可达性基础设施中最常见的一种，不仅有助于身体残疾人，还有利于能 bodied 第三方。虽然人们承认可达性升降的必要性，但实际实施失败，一个重要原因是人力设计阶段的限制。为此，我们提出了一个可以自动生成可达性升降的算法，基于相关环境的 3D 模型。通过手动指定 initia 和终点在 3D 模型中的点，算法使用人工智能搜索算法来确定最佳的连接这两点的路径。编码在设计轮廓中的 essenti 组件，包括但不限于高差差、空间限制和斜 Slope 规范，都会被算法评估。从这里，算法会生成一个可以扩展到全Scale 使用的升降模型，可以轻松地通过交互软件交换出口。虽然还需要一定的人工输入 после 生成阶段，但是减少了人力资源的需求，提高了设计过程中的效率，从而降低了未来城市设计中可达性特性的投入难度。

Reliability Quantification of Deep Reinforcement Learning-based Control

paper_url: http://arxiv.org/abs/2309.16977
repo_url: None
paper_authors: Hitoshi Yoshioka, Hirotada Hashimoto
for: 这个研究的目的是为了量化深度强化学习（DRL）控制的可靠性，以便在安全重要系统中应用人工智能（AI）。
methods: 这个研究提出了一种方法来量化DRL控制的可靠性。首先，使用了现有的方法——随机噪音浓缩，以解决问题。其次，提出了一种新的方法来量化可靠性。这个方法使用两个神经网络：引用和评估。它们有相同的结构和初始参数。在训练时，评估网络的参数被更新，以将训练数据中的差异最大化。因此，这个方法可以根据训练数据中的差异，评估DRL控制的可靠性。
results: 这个研究运用了DQN-based控制来解决一个简单任务，并证明了其效果。此外，这个方法还应用于问题 switching 训练模型根据状态。因此，这个方法可以提高DRL控制的性能，通过根据可靠性 switching 训练模型。

Abstract
Reliability quantification of deep reinforcement learning (DRL)-based control is a significant challenge for the practical application of artificial intelligence (AI) in safety-critical systems. This study proposes a method for quantifying the reliability of DRL-based control. First, an existing method, random noise distillation, was applied to the reliability evaluation to clarify the issues to be solved. Second, a novel method for reliability quantification was proposed to solve these issues. The reliability is quantified using two neural networks: reference and evaluator. They have the same structure with the same initial parameters. The outputs of the two networks were the same before training. During training, the evaluator network parameters were updated to maximize the difference between the reference and evaluator networks for trained data. Thus, the reliability of the DRL-based control for a state can be evaluated based on the difference in output between the two networks. The proposed method was applied to DQN-based control as an example of a simple task, and its effectiveness was demonstrated. Finally, the proposed method was applied to the problem of switching trained models depending on the state. Con-sequently, the performance of the DRL-based control was improved by switching the trained models according to their reliability.

摘要
深度强化学习（DRL）控制的可靠性评估是应用人工智能（AI）在安全关键系统中实践的一大挑战。本研究提出了一种方法来评估DRL控制的可靠性。首先，我们使用了现有的随机噪声润恤方法来进行可靠性评估，以便更清晰地描述问题。然后，我们提出了一种新的可靠性评估方法，用以解决这些问题。在这种方法中，我们使用了两个神经网络：参照网络和评估网络。它们具有相同的结构和相同的初始参数。在训练前，参照网络和评估网络的输出都是相同的。在训练过程中，评估网络的参数被更新，以便在训练数据上增加参照网络和评估网络之间的差异。因此，我们可以根据参照网络和评估网络之间的差异来评估DRL控制的可靠性。本研究使用了DQN控制为简单任务的示例，并证明了其效果。最后，我们将该方法应用于状态 switching 已训练模型的问题，并通过根据模型的可靠性进行模型 switching 来提高DRL控制的性能。

A Quantum States Preparation Method Based on Difference-Driven Reinforcement Learning

paper_url: http://arxiv.org/abs/2309.16972
repo_url: None
paper_authors: Wenjie Liu, Jing Xu, Bosi Wang
for: 提高二叠bits系统内部状态准备的速度和精度。
methods: 提出了一种基于差分驱动学习算法的改进奖励函数和动作选择策略，以帮助算法快速获得最大预期总奖励。
results: 实验结果表明，提出的算法可以在有限条件下准备高精度的二叠bits系统内部状态。与其他算法相比，它在速度和精度两个方面具有不同的改进。

Abstract
Due to the large state space of the two-qubit system, and the adoption of ladder reward function in the existing quantum state preparation methods, the convergence speed is slow and it is difficult to prepare the desired target quantum state with high fidelity under limited conditions. To solve the above problems, a difference-driven reinforcement learning (RL) algorithm for quantum state preparation of two-qubit system is proposed by improving the reward function and action selection strategy. Firstly, a model is constructed for the problem of preparing quantum states of a two-qubit system, with restrictions on the type of quantum gates and the time for quantum state evolution. In the preparation process, a weighted differential dynamic reward function is designed to assist the algorithm quickly obtain the maximum expected cumulative reward. Then, an adaptive e-greedy action selection strategy is adopted to achieve a balance between exploration and utilization to a certain extent, thereby improving the fidelity of the final quantum state. The simulation results show that the proposed algorithm can prepare quantum state with high fidelity under limited conditions. Compared with other algorithms, it has different degrees of improvement in convergence speed and fidelity of the final quantum state.

摘要
Due to the large state space of the two-qubit system, and the adoption of ladder reward function in the existing quantum state preparation methods, the convergence speed is slow and it is difficult to prepare the desired target quantum state with high fidelity under limited conditions. To solve the above problems, a difference-driven reinforcement learning (RL) algorithm for quantum state preparation of two-qubit system is proposed by improving the reward function and action selection strategy. Firstly, a model is constructed for the problem of preparing quantum states of a two-qubit system, with restrictions on the type of quantum gates and the time for quantum state evolution. In the preparation process, a weighted differential dynamic reward function is designed to assist the algorithm quickly obtain the maximum expected cumulative reward. Then, an adaptive e-greedy action selection strategy is adopted to achieve a balance between exploration and utilization to a certain extent, thereby improving the fidelity of the final quantum state. The simulation results show that the proposed algorithm can prepare quantum state with high fidelity under limited conditions. Compared with other algorithms, it has different degrees of improvement in convergence speed and fidelity of the final quantum state.Here's the translation breakdown:* "two-qubit system" is translated as "两 Quint 系统" (liǎng qiánti zhìxīng)* "ladder reward function" is translated as "爬坡式奖励函数" (pángmǎ shì jiàngshǎng fùxìng)* "quantum state evolution" is translated as "量子状态演化" (liàngzǐ zhèngdào yánhuà)* "weighted differential dynamic reward function" is translated as "带有权重的不同动态奖励函数" (dài yǒu quánzhòng de bùdìng dòngtài jiàngshǎng fùxìng)* "adaptive e-greedy action selection strategy" is translated as "适应式e-贪Strategy" (shìyìngxìng e-shèng zhìxíng)* "fidelity of the final quantum state" is translated as "最终量子态的准确性" (zuìzhì liàngzǐ zhèngde zhèngxìng)Note that the translation is in Simplified Chinese, which is the most commonly used form of Chinese writing. If you need Traditional Chinese, please let me know.

Discrete-Choice Model with Generalized Additive Utility Network

paper_url: http://arxiv.org/abs/2309.16970
repo_url: None
paper_authors: Tomoki Nishi, Yusuke Hara
for: 提供有价值的决策行为分析结果，帮助政策制定者和企业做出更好的决策。
methods: 使用多项式几何函数（MNL），其中包括神经网络模型（如ASU-DNN），以提高预测决策结果的准确性。
results: 提出一种基于泛化加性模型（GAUNet）的新型Utility函数，并在东京的旅游调查数据上进行了评估，结果与ASU-DNN的准确性相当，而且具有更好的可读性。

Abstract
Discrete-choice models are a powerful framework for analyzing decision-making behavior to provide valuable insights for policymakers and businesses. Multinomial logit models (MNLs) with linear utility functions have been used in practice because they are ease to use and interpretable. Recently, MNLs with neural networks (e.g., ASU-DNN) have been developed, and they have achieved higher prediction accuracy in behavior choice than classical MNLs. However, these models lack interpretability owing to complex structures. We developed utility functions with a novel neural-network architecture based on generalized additive models, named generalized additive utility network ( GAUNet), for discrete-choice models. We evaluated the performance of the MNL with GAUNet using the trip survey data collected in Tokyo. Our models were comparable to ASU-DNN in accuracy and exhibited improved interpretability compared to previous models.

摘要

Adversarial Driving Behavior Generation Incorporating Human Risk Cognition for Autonomous Vehicle Evaluation

paper_url: http://arxiv.org/abs/2310.00029
repo_url: None
paper_authors: Zhen Liu, Hang Gao, Hao Ma, Shuo Cai, Yunfeng Hu, Ting Qu, Hong Chen, Xun Gong
for: 本研究旨在开发一种新的敌对驾驶行为生成框架，用于检测自动驾驶车（AV）的弱点。
methods: 该研究使用了强化学习（RL）方法，并结合了累积前景理论（CPT）来表示人类风险认知。另外，使用了扩展版的深度决定策函数（DDPG）技术来训练敌对政策，并保证训练稳定性。
results: 对比试验表明，该敌对方法可以准确地探测测试AV的弱点，并且在高精度硬件在Loop（HiL）平台上得到了较好的效果。

Abstract
Autonomous vehicle (AV) evaluation has been the subject of increased interest in recent years both in industry and in academia. This paper focuses on the development of a novel framework for generating adversarial driving behavior of background vehicle interfering against the AV to expose effective and rational risky events. Specifically, the adversarial behavior is learned by a reinforcement learning (RL) approach incorporated with the cumulative prospect theory (CPT) which allows representation of human risk cognition. Then, the extended version of deep deterministic policy gradient (DDPG) technique is proposed for training the adversarial policy while ensuring training stability as the CPT action-value function is leveraged. A comparative case study regarding the cut-in scenario is conducted on a high fidelity Hardware-in-the-Loop (HiL) platform and the results demonstrate the adversarial effectiveness to infer the weakness of the tested AV.

摘要
自主车辆评估在过去几年内得到了业界和学术界的更多关注。这篇论文关注于开发了一种新的敌对驾驶行为生成框架，用于让背景车辆 intervene 对自动驾驶车辆 (AV) 进行攻击，以暴露出效果和合理的危险事件。特别是，敌对行为通过复现人类风险认知的汇集前景理论 (CPT) 学习approach，从而学习出敌对策略。然后，为了保证训练稳定，提出了基于深度权值函数的扩展DDPG技术。在一个高精度硬件在Loop（HiL）平台上进行了比较案例研究，结果表明了敌对策略的效果，并暴露了测试 AV 的弱点。

Axiomatic Aggregations of Abductive Explanations

paper_url: http://arxiv.org/abs/2310.03131
repo_url: https://github.com/elitalobo/Axiomatic-Aggregations-of-Abductive-Explanations
paper_authors: Gagan Biradar, Yacine Izza, Elita Lobo, Vignesh Viswanathan, Yair Zick
for: 这个论文旨在解决post hoc模型解释方法（如LIME和SHAP）的稳定性问题，提出了模型精确的推理解释方法。
methods: 这个论文使用了abductive解释方法，对每个数据点提供了最小的特征子集，可以生成结果。然而，这些解释方法可能会出现多个有效的解释方案，这会导致不充分的解释。这篇论文解决了这个问题，通过对多个解释方案进行聚合，生成特征重要性分数。
results: 这篇论文提出了三种聚合方法：基于cooperative游戏理论的力指数方法，以及基于一种广泛使用的 causal strength 度量方法。这些方法都有一定的满意性质量，并且在多个数据集上进行了评估，并表明这些解释方法具有robust性。

Abstract
The recent criticisms of the robustness of post hoc model approximation explanation methods (like LIME and SHAP) have led to the rise of model-precise abductive explanations. For each data point, abductive explanations provide a minimal subset of features that are sufficient to generate the outcome. While theoretically sound and rigorous, abductive explanations suffer from a major issue -- there can be several valid abductive explanations for the same data point. In such cases, providing a single abductive explanation can be insufficient; on the other hand, providing all valid abductive explanations can be incomprehensible due to their size. In this work, we solve this issue by aggregating the many possible abductive explanations into feature importance scores. We propose three aggregation methods: two based on power indices from cooperative game theory and a third based on a well-known measure of causal strength. We characterize these three methods axiomatically, showing that each of them uniquely satisfies a set of desirable properties. We also evaluate them on multiple datasets and show that these explanations are robust to the attacks that fool SHAP and LIME.

摘要
Recent criticisms of post hoc model approximation explanation methods (such as LIME and SHAP) have led to the rise of model-precise abductive explanations. For each data point, abductive explanations provide a minimal subset of features that are sufficient to generate the outcome. While theoretically sound and rigorous, abductive explanations suffer from a major issue - there can be several valid abductive explanations for the same data point. In such cases, providing a single abductive explanation can be insufficient; on the other hand, providing all valid abductive explanations can be incomprehensible due to their size. In this work, we solve this issue by aggregating the many possible abductive explanations into feature importance scores. We propose three aggregation methods: two based on power indices from cooperative game theory and a third based on a well-known measure of causal strength. We characterize these three methods axiomatically, showing that each of them uniquely satisfies a set of desirable properties. We also evaluate them on multiple datasets and show that these explanations are robust to the attacks that fool SHAP and LIME.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

On Generating Explanations for Reinforcement Learning Policies: An Empirical Study

paper_url: http://arxiv.org/abs/2309.16960
repo_url: None
paper_authors: Mikihisa Yuasa, Huy T. Tran, Ramavarapu S. Sreenivas
for: 提供策略解释
methods: 使用线性时间逻辑（LTL）方程提供解释
results: 在模拟的捕捉Flag环境中证明了方法的有效性，并提出了未来研究的建议。In this paper, the authors propose using Linear Temporal Logic (LTL) formulae to provide explanations for policies. The paper focuses on creating explanations that reveal both the ultimate objectives accomplished by the policy and the conditions it requires throughout its execution. The proposed approach is demonstrated to be effective through a simulated capture the flag environment, and the paper concludes with suggestions for future research.

Abstract
In this paper, we introduce a set of \textit{Linear Temporal Logic} (LTL) formulae designed to provide explanations for policies. Our focus is on crafting explanations that elucidate both the ultimate objectives accomplished by the policy and the prerequisites it upholds throughout its execution. These LTL-based explanations feature a structured representation, which is particularly well-suited for local-search techniques. The effectiveness of our proposed approach is illustrated through a simulated capture the flag environment. The paper concludes with suggested directions for future research.

摘要
在这篇论文中，我们介绍了一组基于线性时间逻辑（LTL）的方程，用于提供政策的解释。我们的注重点是通过构建政策执行过程中的前提和目标的结构化表示，以便使用本地搜索技术。我们使用这种LTL-基于的解释方法在模拟的捕捉 Flag 环境中证明了其效果。文章结束时，我们还提出了未来研究的建议。Here's the breakdown of the text into Simplified Chinese characters:在这篇论文中 (在这篇论文中)我们介绍了一组 (我们介绍了一组)基于线性时间逻辑 (LTL) (基于线性时间逻辑)的方程 (方程)用于提供政策的解释 (用于提供政策的解释)我们的注重点 (我们的注重点)是通过构建政策执行过程中的前提 (是通过构建政策执行过程中的前提)和目标 (和目标)的结构化表示 (的结构化表示)以便使用本地搜索技术 (以便使用本地搜索技术)我们使用这种LTL-基于的解释方法 (我们使用这种LTL-基于的解释方法)在模拟的捕捉 Flag 环境中证明了其效果 (在模拟的捕捉 Flag 环境中证明了其效果)文章结束时 (文章结束时)我们还提出了未来研究的建议 (我们还提出了未来研究的建议)

Denoising Diffusion Bridge Models

paper_url: http://arxiv.org/abs/2309.16948
repo_url: https://github.com/alexzhou907/DDBM
paper_authors: Linqi Zhou, Aaron Lou, Samar Khanna, Stefano Ermon
for: 这篇论文的目的是提出一种新的扩展 diffusion models 的方法，以便在图像编辑等应用中更好地处理非随机噪声的输入数据。
methods: 该方法基于 diffusion bridges 家族的过程，通过学习数据中的得分来解决一个（随机）分子方程，从一个分布转换到另一个分布。
results: 在实验中，这种方法在困难的图像数据集上达到了显著的改善，并且在减少问题到随机噪声的情况下，与状态艺术方法的 FID 分数相当。

Abstract
Diffusion models are powerful generative models that map noise to data using stochastic processes. However, for many applications such as image editing, the model input comes from a distribution that is not random noise. As such, diffusion models must rely on cumbersome methods like guidance or projected sampling to incorporate this information in the generative process. In our work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural alternative to this paradigm based on diffusion bridges, a family of processes that interpolate between two paired distributions given as endpoints. Our method learns the score of the diffusion bridge from data and maps from one endpoint distribution to the other by solving a (stochastic) differential equation based on the learned score. Our method naturally unifies several classes of generative models, such as score-based diffusion models and OT-Flow-Matching, allowing us to adapt existing design and architectural choices to our more general problem. Empirically, we apply DDBMs to challenging image datasets in both pixel and latent space. On standard image translation problems, DDBMs achieve significant improvement over baseline methods, and, when we reduce the problem to image generation by setting the source distribution to random noise, DDBMs achieve comparable FID scores to state-of-the-art methods despite being built for a more general task.

摘要
Diffusion模型是一种强大的生成模型，它将随机噪声映射到数据中的某些特征上。然而，在许多应用程序，如图像修改，模型输入不是随机噪声。因此，扩散模型必须采用困难的方法，如导航或投影抽样，来包含这些信息在生成过程中。在我们的工作中，我们提出了去噪扩散桥模型（DDBMs），这是一种自然的代替方案，基于扩散桥，一种将两个对应的分布作为终点点 interpolate 的过程家族。我们的方法从数据中学习扩散桥的分数，并将一个分布转换成另一个分布，解决一个（随机）导数 Equation 基于学习的分数。我们的方法自然地统一了许多类型的生成模型，如分数基本扩散模型和OT-Flow-Matching，让我们可以将现有的设计和建筑选择应用到我们更一般的问题上。在实验中，我们在难度图像集上应用DDBMs，并在像素空间和尺度空间中进行了图像翻译和生成。相比基eline方法，DDBMs在标准图像翻译问题上达到了显著的改进，而当我们将问题降到图像生成的情况下，DDBMs在state-of-the-art方法的FID分数上达到了相当的比较。

Asynchrony-Robust Collaborative Perception via Bird’s Eye View Flow

paper_url: http://arxiv.org/abs/2309.16940
repo_url: https://github.com/MediaBrain-SJTU/CoBEVFlow
paper_authors: Sizhe Wei, Yuxi Wei, Yue Hu, Yifan Lu, Yiqi Zhong, Siheng Chen, Ya Zhang
for: This paper is written to address the issue of temporal asynchrony in multi-agent collaboration, which can negatively impact the accuracy of perception and fusion in real-world scenarios.
methods: The proposed method, CoBEVFlow, uses a bird’s eye view (BEV) flow to compensate for motions and align asynchronous collaboration messages sent by multiple agents. This approach allows for robust and efficient collaboration, even in extremely asynchronous settings.
results: The paper presents extensive experiments conducted on both synthetic and real-world datasets, which demonstrate the efficacy of CoBEVFlow in mitigating the impact of asynchrony and outperforming other baselines. The code for CoBEVFlow is available online.Here is the text in Simplified Chinese:
for: 这篇论文是为了解决多代合作中的时间异步问题，以提高实际场景中的感知和融合精度。
methods: 提议的方法是基于鸟瞰视图（BEV）流来补做异步合作消息的滤波，使多个代理能够协作，甚至在极端异步设置下进行有效的协作。
results: 论文提供了大量在synthetic和实际数据集上的实验结果，证明CoBEVFlow在异步设置下能够有效地减轻异步的影响，并超越其他基准。代码可在https://github.com/MediaBrain-SJTU/CoBEVFlow上下载。

Abstract
Collaborative perception can substantially boost each agent's perception ability by facilitating communication among multiple agents. However, temporal asynchrony among agents is inevitable in the real world due to communication delays, interruptions, and clock misalignments. This issue causes information mismatch during multi-agent fusion, seriously shaking the foundation of collaboration. To address this issue, we propose CoBEVFlow, an asynchrony-robust collaborative perception system based on bird's eye view (BEV) flow. The key intuition of CoBEVFlow is to compensate motions to align asynchronous collaboration messages sent by multiple agents. To model the motion in a scene, we propose BEV flow, which is a collection of the motion vector corresponding to each spatial location. Based on BEV flow, asynchronous perceptual features can be reassigned to appropriate positions, mitigating the impact of asynchrony. CoBEVFlow has two advantages: (i) CoBEVFlow can handle asynchronous collaboration messages sent at irregular, continuous time stamps without discretization; and (ii) with BEV flow, CoBEVFlow only transports the original perceptual features, instead of generating new perceptual features, avoiding additional noises. To validate CoBEVFlow's efficacy, we create IRregular V2V(IRV2V), the first synthetic collaborative perception dataset with various temporal asynchronies that simulate different real-world scenarios. Extensive experiments conducted on both IRV2V and the real-world dataset DAIR-V2X show that CoBEVFlow consistently outperforms other baselines and is robust in extremely asynchronous settings. The code is available at https://github.com/MediaBrain-SJTU/CoBEVFlow.

摘要
合作感知可以有效地提高每个代理的感知能力，通过多个代理之间的交流来提高感知质量。然而，在实际世界中，由于通信延迟、中断和时钟不协调，代理之间的时间差是不可避免的。这会导致多个代理之间的信息不匹配，seriously shaking the foundation of collaboration。为解决这问题，我们提出了CoBEVFlow，一种鲁棒的协同感知系统，基于 bird's eye view（BEV）流。CoBEVFlow的关键想法是补偿不同时间报送的多个代理协同感知消息，以实现协同感知的准确性。为模型场景中的运动，我们提出了BEV流，即每个空间位置对应的运动向量的集合。基于BEV流，协同感知消息可以在不同时间报送的情况下进行重新分配，以避免信息不匹配的问题。CoBEVFlow有两个优点：（i）CoBEVFlow可以处理不同时间报送的协同感知消息，无需精确的时间排序；（ii）通过BEV流，CoBEVFlow只需要传输原始的感知特征，而不是生成新的感知特征，从而避免了额外的噪声。为证明CoBEVFlow的有效性，我们创建了IRregular V2V（IRV2V）数据集，该数据集包含了不同的时间差和协同感知消息的各种实际场景。我们对IRV2V数据集和DAIR-V2X数据集进行了广泛的实验，结果表明，CoBEVFlow在非常异步的设置下表现出色，并且与其他基eline相比具有更高的稳定性和可靠性。代码可以在https://github.com/MediaBrain-SJTU/CoBEVFlow中下载。

PC-Adapter: Topology-Aware Adapter for Efficient Domain Adaption on Point Clouds with Rectified Pseudo-label

paper_url: http://arxiv.org/abs/2309.16936
repo_url: None
paper_authors: Joonhyung Park, Hyunjin Seo, Eunho Yang
for: 本文是关于减少点云数据在不同频谱下的域 adaptation 问题的研究，以便更好地理解实际世界中的点云数据。
methods: 本文提出了一种基于注意力抽象器和图 convolution 的域 adaptation 方法，以保留源频谱中的全局几何信息，同时在目标频谱中学习本地特征。此外，本文还提出了一种新的 Pseudo-labeling 策略，可以快速地适应不同的频谱下的点云数据。
results: 在 PointDA、GraspNetPC 和 PointSegDA 等标准 benchmark 上，本文的方法与基eline 相比，显示了更高的准确率和更好的鲁棒性。

Abstract
Understanding point clouds captured from the real-world is challenging due to shifts in data distribution caused by varying object scales, sensor angles, and self-occlusion. Prior works have addressed this issue by combining recent learning principles such as self-supervised learning, self-training, and adversarial training, which leads to significant computational overhead.Toward succinct yet powerful domain adaptation for point clouds, we revisit the unique challenges of point cloud data under domain shift scenarios and discover the importance of the global geometry of source data and trends of target pseudo-labels biased to the source label distribution. Motivated by our observations, we propose an adapter-guided domain adaptation method, PC-Adapter, that preserves the global shape information of the source domain using an attention-based adapter, while learning the local characteristics of the target domain via another adapter equipped with graph convolution. Additionally, we propose a novel pseudo-labeling strategy resilient to the classifier bias by adjusting confidence scores using their class-wise confidence distributions to consider relative confidences. Our method demonstrates superiority over baselines on various domain shift settings in benchmark datasets - PointDA, GraspNetPC, and PointSegDA.

摘要
understanding real-world point clouds 困难，因为数据分布变化引起的因素，如对象大小、感知角度和自我遮挡。先前的方法通过结合最新的学习原则，如无监督学习、自我训练和对抗学习，来解决这个问题，但这会导致很大的计算开销。为了实现简洁而强大的领域适应，我们重新检查了点云数据在领域shift场景下的独特挑战，并发现了源数据的全局几何结构和目标假标签的趋势是源标签分布的关键。基于这些观察，我们提出了一种 adapter-guided 领域适应方法，称为 PC-Adapter，它使用注意力基于的适应器保留源领域的全局几何信息，同时通过另一个适应器 equipped with 图像 convolution 学习target领域的本地特征。此外，我们提出了一种新的假标签生成策略，可以快速地适应类ifier的偏见，通过调整对应的信任分布来考虑相对的信任程度。我们的方法在 PointDA、GraspNetPC 和 PointSegDA 等标准数据集上显示出了superiority。

TranDRL: A Transformer-Driven Deep Reinforcement Learning Enabled Prescriptive Maintenance Framework

paper_url: http://arxiv.org/abs/2309.16935
repo_url: None
paper_authors: Yang Zhao, Wenbo Wang
for: 这篇论文旨在提供一个可靠的预测维护策略，以提高工业系统的运行效率并减少机器 downtime。
methods: 本文提出了一个新的、整合的框架，具有变数径模型和深度强化学习（DRL）算法，以便优化维护作业。变数径模型可以很好地捕捉各种复杂的时间特征，从而精准地预测设备的剩下有用生命（RUL）。同时，DRL ком成分则提供了成本效益的维护建议。
results: 本文验证了其框架的有效性，使用NASA C-MPASS数据集，其展示了预测RUL的精准性和维护作业的优化。因此，本文的创新方法可以提供一个数据驱动的维护策略，解决了工业系统中的主要挑战，将系统变得更高效、成本更低、可靠性更高。

Abstract
Industrial systems demand reliable predictive maintenance strategies to enhance operational efficiency and reduce downtime. This paper introduces a novel, integrated framework that leverages the power of transformer neural networks and deep reinforcement learning (DRL) algorithms to optimize maintenance actions. Our approach employs the transformer model to effectively capture complex temporal patterns in sensor data, thereby accurately predicting the Remaining Useful Life (RUL) of equipment. Simultaneously, the DRL component of our framework provides cost-effective and timely maintenance recommendations. We validate the efficacy of our framework on the NASA C-MPASS dataset, where it demonstrates significant advancements in both RUL prediction accuracy and the optimization of maintenance actions. Consequently, our pioneering approach provides an innovative data-driven methodology for prescriptive maintenance, addressing key challenges in industrial operations and leading the way to more efficient, cost-effective, and reliable systems.

摘要
工业系统需要可靠的预测维护策略，以提高操作效率和减少停机时间。本文将介绍一个新的、整合式框架，利用变换器神经网络和深度强化学习（DRL）算法来优化维护行动。我们的方法使用变换器模型从感应数据中够有效地捕捉复杂的时间模式，从而精确地预测设备的剩余有用生命（RUL）。同时，我们的框架中的DRL部分提供了成本效益的维护建议。我们在NASA C-MPASS数据集上验证了我们的框架，其展示了很大的RUL预测精度和维护行动优化。因此，我们的创新方法提供了一个数据驱动的维护方法，解决了工业运营中的主要挑战，导向更高效、成本效益、可靠的系统。

Learning to Receive Help: Intervention-Aware Concept Embedding Models

paper_url: http://arxiv.org/abs/2309.16928
repo_url: https://github.com/mateoespinosa/cem
paper_authors: Mateo Espinosa Zarlenga, Katherine M. Collins, Krishnamurthy Dvijotham, Adrian Weller, Zohreh Shams, Mateja Jamnik
for: 该 paper 的目的是提出一种新的概念瓶障模型（IntCEMs），以解决现有概念可访性模型（CBMs）中的透彻性问题，并且提高模型在测试时的抗访问性。
methods: 该 paper 使用了一种新的训练方法和架构，称为Intervention-aware Concept Embedding models（IntCEMs），它可以在训练时学习一种概念修正策略，并在测试时通过 sampling meaningful intervention trajectories来实现更好的抗访问性。
results: 该 paper 的实验结果显示，IntCEMs 在接受测试时的概念修正方法时表现出色，与现有概念可访性模型相比，具有更高的效果。

Abstract
Concept Bottleneck Models (CBMs) tackle the opacity of neural architectures by constructing and explaining their predictions using a set of high-level concepts. A special property of these models is that they permit concept interventions, wherein users can correct mispredicted concepts and thus improve the model's performance. Recent work, however, has shown that intervention efficacy can be highly dependent on the order in which concepts are intervened on and on the model's architecture and training hyperparameters. We argue that this is rooted in a CBM's lack of train-time incentives for the model to be appropriately receptive to concept interventions. To address this, we propose Intervention-aware Concept Embedding models (IntCEMs), a novel CBM-based architecture and training paradigm that improves a model's receptiveness to test-time interventions. Our model learns a concept intervention policy in an end-to-end fashion from where it can sample meaningful intervention trajectories at train-time. This conditions IntCEMs to effectively select and receive concept interventions when deployed at test-time. Our experiments show that IntCEMs significantly outperform state-of-the-art concept-interpretable models when provided with test-time concept interventions, demonstrating the effectiveness of our approach.

摘要
具有概念瓶颈模型（CBM）的模型可以解释其预测结果使用一组高级概念。特殊的是，这些模型允许用户进行概念修正，从而提高模型的性能。然而，最近的研究表明，修正效果可能受到 intervened 概念的顺序和模型的架构和训练参数的影响。我们认为这是由于 CBM 缺乏在训练时间提供模型适应测试时间修正的培训励agrant。为解决这个问题，我们提出了概念修正感知模型（IntCEM），一种基于 CBM 的新的架构和训练方法。我们的模型在训练时间内学习一个概念修正策略，从而可以在测试时间内采取有意义的修正轨迹。这使得 IntCEMs 在测试时间被部署时能够有效地接受概念修正。我们的实验表明，IntCEMs 在测试时间提供修正后性能明显超过了现有的概念可读性模型，证明了我们的方法的有效性。

Mode Connectivity and Data Heterogeneity of Federated Learning

paper_url: http://arxiv.org/abs/2309.16923
repo_url: None
paper_authors: Tailin Zhou, Jun Zhang, Danny H. K. Tsang
for: 这个论文的目的是研究联合学习（Federated Learning）中客户端和全球模式之间的关系，以及如何减少客户端资料不均性导致的模型调整频率差异。
methods: 这个论文使用了模式连接性来研究客户端和全球模式之间的关系，并通过实验和理论分析来探索这关系的特性。
results: 研究发现，降低客户端资料不均性可以使模型更加稳定，并且可以增加全球模式之间的连接性，从而提高模型的性能。此外，论文还发现了一个阻碍连接性的障碍，即在两个全球模式之间的直线连接，但这个障碍可以通过考虑非线性模式连接性来消除。

Abstract
Federated learning (FL) enables multiple clients to train a model while keeping their data private collaboratively. Previous studies have shown that data heterogeneity between clients leads to drifts across client updates. However, there are few studies on the relationship between client and global modes, making it unclear where these updates end up drifting. We perform empirical and theoretical studies on this relationship by utilizing mode connectivity, which measures performance change (i.e., connectivity) along parametric paths between different modes. Empirically, reducing data heterogeneity makes the connectivity on different paths more similar, forming more low-error overlaps between client and global modes. We also find that a barrier to connectivity occurs when linearly connecting two global modes, while it disappears with considering non-linear mode connectivity. Theoretically, we establish a quantitative bound on the global-mode connectivity using mean-field theory or dropout stability. The bound demonstrates that the connectivity improves when reducing data heterogeneity and widening trained models. Numerical results further corroborate our analytical findings.

摘要
federated learning (FL) 允许多个客户端同时训练模型，保持其数据私有性。过去的研究表明，客户端数据的不同性导致客户端更新中的漂移。然而，关于客户端和全球模式之间的关系，还是有很少研究，使得这些更新的漂移方向不清楚。我们通过使用模式连接性来研究这种关系，并进行了实验和理论研究。实验结果表明，降低客户端数据的不同性可以使得不同模式之间的连接性更加相似，形成更多的低错重叠。此外，我们发现在线性连接两个全球模式时会出现一个阻断连接性的问题，而不考虑非线性模式连接性时，这个问题会消失。理论上，我们使用mean-field理论或dropout稳定性来确定全球模式连接性的质量上限。这个上限表明，降低客户端数据的不同性和训练模型的宽度可以提高连接性。数值结果进一步证实了我们的分析结论。

ACGAN-GNNExplainer: Auxiliary Conditional Generative Explainer for Graph Neural Networks

paper_url: http://arxiv.org/abs/2309.16918
repo_url: None
paper_authors: Yiqiao Li, Jianlong Zhou, Yifei Dong, Niusha Shafiabady, Fang Chen
for: 本研究旨在提出一种基于生成器和检测器的图神经网络（GNN）解释器，以提高GNN的可靠性和决策支持。
methods: 本文提出的ACGAN-GNNExplainer方法使用生成器生成输入图的解释，并通过检测器监督生成过程，以确保解释的准确性和精度。
results: 实验结果表明，ACGAN-GNNExplainer方法在Synthetic和实际图据集上比其他已有的GNN解释器更高的精度和可靠性。

Abstract
Graph neural networks (GNNs) have proven their efficacy in a variety of real-world applications, but their underlying mechanisms remain a mystery. To address this challenge and enable reliable decision-making, many GNN explainers have been proposed in recent years. However, these methods often encounter limitations, including their dependence on specific instances, lack of generalizability to unseen graphs, producing potentially invalid explanations, and yielding inadequate fidelity. To overcome these limitations, we, in this paper, introduce the Auxiliary Classifier Generative Adversarial Network (ACGAN) into the field of GNN explanation and propose a new GNN explainer dubbed~\emph{ACGAN-GNNExplainer}. Our approach leverages a generator to produce explanations for the original input graphs while incorporating a discriminator to oversee the generation process, ensuring explanation fidelity and improving accuracy. Experimental evaluations conducted on both synthetic and real-world graph datasets demonstrate the superiority of our proposed method compared to other existing GNN explainers.

摘要
几何神经网络（GNN）在实际应用中表现出色，但它们的下面机制仍然是一个谜。为了解释这个挑战并实现可靠的决策，许多GNN解释器在最近的年份中被提出。然而，这些方法经常遇到限制，包括对特定实体的依赖、缺乏未见顶点的普遍化、生成可能无效的解释和产生不足的精确性。为了突破这些限制，我们在这篇论文中引入了帮助器网络（ACGAN）到GNN解释领域，并提出了一个新的GNN解释器，名为ACGAN-GNNExplainer。我们的方法利用生成器来生成输入几何网络的解释，同时包含一个检测器来监督生成过程，以确保解释的实际性和提高精确性。实验结果显示，我们的提案方法在 sintetic 和实际几何网络数据集上比其他现有的GNN解释器更有优势。

ONNXExplainer: an ONNX Based Generic Framework to Explain Neural Networks Using Shapley Values

paper_url: http://arxiv.org/abs/2309.16916
repo_url: None
paper_authors: Yong Zhao, Runxin He, Nicholas Kersting, Can Liu, Shubham Agrawal, Chiranjeet Chetia, Yu Gu
for: 本文旨在提出一种基于ONNX ecosystem的神经网络解释框架，使用Shapley值来解释神经网络的预测结果。
methods: 本文使用自动微分和优化方法来实现一次部署和高效计算神经网络解释，并与TensorFlow和PyTorch进行了公正的比较。
results: 对VGG19、ResNet50、DenseNet201和EfficientNetB0等神经网络模型进行了广泛的 benchmark，结果显示，提出的优化方法可以提高解释延迟时间，比现有开源 counterpart SHAP 提高500%。

Abstract
Understanding why a neural network model makes certain decisions can be as important as the inference performance. Various methods have been proposed to help practitioners explain the prediction of a neural network model, of which Shapley values are most popular. SHAP package is a leading implementation of Shapley values to explain neural networks implemented in TensorFlow or PyTorch but lacks cross-platform support, one-shot deployment and is highly inefficient. To address these problems, we present the ONNXExplainer, which is a generic framework to explain neural networks using Shapley values in the ONNX ecosystem. In ONNXExplainer, we develop its own automatic differentiation and optimization approach, which not only enables One-Shot Deployment of neural networks inference and explanations, but also significantly improves the efficiency to compute explanation with less memory consumption. For fair comparison purposes, we also implement the same optimization in TensorFlow and PyTorch and measure its performance against the current state of the art open-source counterpart, SHAP. Extensive benchmarks demonstrate that the proposed optimization approach improves the explanation latency of VGG19, ResNet50, DenseNet201, and EfficientNetB0 by as much as 500%.

摘要
ONNXExplainer develops its own automatic differentiation and optimization approach, which not only enables one-shot deployment of neural network inference and explanations, but also significantly improves the efficiency of computing explanations with less memory consumption. For fair comparison purposes, we also implement the same optimization in TensorFlow and PyTorch and measure its performance against the current state-of-the-art open-source counterpart, SHAP.Extensive benchmarks demonstrate that the proposed optimization approach improves the explanation latency of VGG19, ResNet50, DenseNet201, and EfficientNetB0 by as much as 500%.

ASAP: Automated Sequence Planning for Complex Robotic Assembly with Physical Feasibility

paper_url: http://arxiv.org/abs/2309.16909
repo_url: None
paper_authors: Yunsheng Tian, Karl D. D. Willis, Bassel Al Omari, Jieliang Luo, Pingchuan Ma, Yichen Li, Farhad Javid, Edward Gu, Joshua Jacob, Shinjiro Sueda, Hui Li, Sachin Chitta, Wojciech Matusik
for: 这个论文目的是提供一种自动生成可行的组装序列计划方法，用于自动组装多个部件 together。
methods: 这种方法基于物理学，考虑重力，设计一个物理稳定的子组装序列，并使用有效的搜索算法来减少维度的决策复杂性。搜索可以被导向 geometric heuristics 或 graph neural networks trained on simulation labels。
results: 作者们在大量复杂产品组装序列数据集上表明 ASAP 可以生成物理可行的组装序列规划方案，并在实际 робоット设置中进行了应用。

Abstract
The automated assembly of complex products requires a system that can automatically plan a physically feasible sequence of actions for assembling many parts together. In this paper, we present ASAP, a physics-based planning approach for automatically generating such a sequence for general-shaped assemblies. ASAP accounts for gravity to design a sequence where each sub-assembly is physically stable with a limited number of parts being held and a support surface. We apply efficient tree search algorithms to reduce the combinatorial complexity of determining such an assembly sequence. The search can be guided by either geometric heuristics or graph neural networks trained on data with simulation labels. Finally, we show the superior performance of ASAP at generating physically realistic assembly sequence plans on a large dataset of hundreds of complex product assemblies. We further demonstrate the applicability of ASAP on both simulation and real-world robotic setups. Project website: asap.csail.mit.edu

摘要
自动生成复杂产品的组装需要一个系统可以自动规划一个物理可行的动作序列，以将多个部件组装在一起。在这篇论文中，我们介绍了 ASAP，一种基于物理学的规划方法，用于自动生成这种动作序列。ASAP考虑了重力，以设计一个物理稳定的动作序列，其中每个子组件都由有限数量的部件支持和承载。我们使用高效的树搜索算法来减少这种组装序列的组合性复杂度。搜索可以按照几何启发或基于数据的物理学 Label 进行导航。最后，我们表明 ASAP 在大量复杂产品组装序列计划中表现出色，并在实际机器人设置中进行了应用。更多信息请参考项目网站：asap.csail.mit.edu。

2023-09-29

On the Equivalence of Graph Convolution and Mixup

Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

Self-Specialization: Uncovering Latent Expertise within Large Language Models

Feedback-guided Data Synthesis for Imbalanced Classification

Learning Generalizable Tool-use Skills through Trajectory Generation

Primal-Dual Continual Learning: Stability and Plasticity through Lagrange Multipliers

3D Reconstruction in Noisy Agricultural Environments: A Bayesian Optimization Perspective for View Planning

Probabilistic Sampling-Enhanced Temporal-Spatial GCN: A Scalable Framework for Transaction Anomaly Detection in Ethereum Networks

GASS: Generalizing Audio Source Separation with Large-scale Data

ABScribe: Rapid Exploration of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models

Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization

HyperMask: Adaptive Hypernetwork-based Masks for Continual Learning

FashionFlow: Leveraging Diffusion Models for Dynamic Fashion Video Synthesis from Static Imagery

Multilingual Natural Language Processing Model for Radiology Reports – The Summary is all you need!

Voice2Action: Language Models as Agent for Efficient Real-Time Interaction in Virtual Reality

SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation

Emotional Listener Portrait: Neural Listener Head Generation with Emotion

AI ensemble for signal detection of higher order gravitational wave modes of quasi-circular, spinning, non-precessing binary black hole mergers

Efficient Streaming Language Models with Attention Sinks

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

LLM-grounded Video Diffusion Models

Learning Decentralized Flocking Controllers with Spatio-Temporal Graph Neural Network

DREAM: Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets

Classification of Potholes Based on Surface Area Using Pre-Trained Models of Convolutional Neural Network

Data Filtering Networks

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

Adversarial Machine Learning in Latent Representations of Neural Networks

LoRA ensembles for large language model fine-tuning

Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency

Revolutionizing Mobile Interaction: Enabling a 3 Billion Parameter GPT LLM on Mobile

Neural Lithography: Close the Design-to-Manufacturing Gap in Computational Optics with a ‘Real2Sim’ Learned Photolithography Simulator

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

Improving Trajectory Prediction in Dynamic Multi-Agent Environment by Dropping Waypoints

Toward Operationalizing Pipeline-aware ML Fairness: A Research Agenda for Developing Practical Guidelines and Tools

Asynchronous Graph Generators

Efficient Anatomical Labeling of Pulmonary Tree Structures via Implicit Point-Graph Networks

Assessing Look-Ahead Bias in Stock Return Predictions Generated By GPT Sentiment Analysis

Building Privacy-Preserving and Secure Geospatial Artificial Intelligence Foundation Models

AutoAgents: A Framework for Automatic Agent Generation

AI-Aristotle: A Physics-Informed framework for Systems Biology Gray-Box Identification

Split and Merge: Aligning Position Biases in Large Language Model based Evaluators

PB-LLM: Partially Binarized Large Language Models

STRONG – Structure Controllable Legal Opinion Summary Generation

Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4

Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency

A Foundation Model for General Moving Object Segmentation in Medical Images

PlaceNav: Topological Navigation through Place Recognition

Knowledge Graphs for the Life Sciences: Recent Developments, Challenges and Opportunities

Forest Mixing: investigating the impact of multiple search trees and a shared refinements pool on ontology learning

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

MORPH: Design Co-optimization with Reinforcement Learning via a Differentiable Hardware Model Proxy

RSAM: Learning on manifolds with Riemannian Sharpness-aware Minimization

ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery

An Investigation Into Race Bias in Random Forest Models Based on Breast DCE-MRI Derived Radiomics Features

PARF: Primitive-Aware Radiance Fusion for Indoor Scene Novel View Synthesis

Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training

RLAdapter: Bridging Large Language Models to Reinforcement Learning in Open Worlds

A Vision-Guided Robotic System for Grasping Harvested Tomato Trusses in Cluttered Environments

An evaluation of GPT models for phenotype concept recognition

DyVal: Graph-informed Dynamic Evaluation of Large Language Models

Advances in Kidney Biopsy Structural Assessment through Dense Instance Segmentation

Compromise in Multilateral Negotiations and the Global Regulation of Artificial Intelligence

Age Group Discrimination via Free Handwriting Indicators

Using Large Language Models for Qualitative Analysis can Introduce Serious Bias

Prototype Generation: Robust Feature Visualisation for Data Independent Interpretability

Revisiting Cephalometric Landmark Detection from the view of Human Pose Estimation with Lightweight Super-Resolution Head

Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?

Meta-Path Learning for Multi-relational Graph Neural Networks

Dynamic Interpretability for Model Comparison via Decision Rules

GAIA-1: A Generative World Model for Autonomous Driving

Assessment and treatment of visuospatial neglect using active learning with Gaussian processes regression

SCALE: Synergized Collaboration of Asymmetric Language Translation Engines

Tell Me a Story! Narrative-Driven XAI with Large Language Models

On Continuity of Robust and Accurate Classifiers

Refined Kolmogorov Complexity of Analog, Evolving and Stochastic Recurrent Neural Networks

Scalable Multi-Temporal Remote Sensing Change Data Generation via Simulating Stochastic Change Process