2023-11-29

cs.AI

cs.AI - 2023-11-29

CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting

paper_url: http://arxiv.org/abs/2311.17907
repo_url: https://github.com/asvilesov/CG3D
paper_authors: Alexander Vilesov, Pradyumna Chari, Achuta Kadambi
for: 本研究旨在提供一种能够生成细节rich、多物体场景的文本决策3D资产生成方法，以解决现有工作中的基本约束。
methods: 我们提出了一种基于显式 Gaussian radiance fields的指导框架，以实现semantically和physically consistent的场景组合。
results: 我们的方法可以在多个对象的组合和物理准确性方面达到国际级的表现，甚至超过指导扩散模型的表现。

Abstract
With the onset of diffusion-based generative models and their ability to generate text-conditioned images, content generation has received a massive invigoration. Recently, these models have been shown to provide useful guidance for the generation of 3D graphics assets. However, existing work in text-conditioned 3D generation faces fundamental constraints: (i) inability to generate detailed, multi-object scenes, (ii) inability to textually control multi-object configurations, and (iii) physically realistic scene composition. In this work, we propose CG3D, a method for compositionally generating scalable 3D assets that resolves these constraints. We find that explicit Gaussian radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes. By utilizing a guidance framework built around this explicit representation, we show state of the art results, capable of even exceeding the guiding diffusion model in terms of object combinations and physics accuracy.

摘要
随着扩散型生成模型的出现，文本条件生成图像的内容生成方面受到了巨大的推动。最近，这些模型已经被证明可以为3D图形资产的生成提供有用的指导。然而，现有的文本条件3D生成工作面临三个基本约束：（1）无法生成复杂、多对象场景，（2）无法文本控制多对象配置，（3）无法实现物理相对稳定的场景组合。在这个工作中，我们提出了CG3D方法，用于组合生成可扩展的3D资产，解决这些约束。我们发现，使用具有作为对象组合的explicit Gaussian radiance fields的表示，可以实现semantically和physically一致的场景。通过基于这种表示的引导框架，我们实现了状态计算机中的最佳结果，甚至超过了引导扩散模型的对象组合和物理准确性。

SODA: Bottleneck Diffusion Models for Representation Learning

paper_url: http://arxiv.org/abs/2311.17901
repo_url: None
paper_authors: Drew A. Hudson, Daniel Zoran, Mateusz Malinowski, Andrew K. Lampinen, Andrew Jaegle, James L. McClelland, Loic Matthey, Felix Hill, Alexander Lerchner
for: 这篇论文旨在探讨扩散模型的可靠性和表示学习能力。
methods: 该模型使用了一种自我超vised扩散模型，包括图像编码器，它将源视图缩减成一个紧凑的表示，然后通过生成相关的新视图来指导生成图像。
results: 研究发现，通过在编码器和去噪解码器之间强制瓶颈，并利用新视图synthesis作为自我超vised目标，可以将扩散模型转变为强大的表示学习器，能够在无监督的情况下捕捉图像的视觉 semantics。

Abstract
We introduce SODA, a self-supervised diffusion model, designed for representation learning. The model incorporates an image encoder, which distills a source view into a compact representation, that, in turn, guides the generation of related novel views. We show that by imposing a tight bottleneck between the encoder and a denoising decoder, and leveraging novel view synthesis as a self-supervised objective, we can turn diffusion models into strong representation learners, capable of capturing visual semantics in an unsupervised manner. To the best of our knowledge, SODA is the first diffusion model to succeed at ImageNet linear-probe classification, and, at the same time, it accomplishes reconstruction, editing and synthesis tasks across a wide range of datasets. Further investigation reveals the disentangled nature of its emergent latent space, that serves as an effective interface to control and manipulate the model's produced images. All in all, we aim to shed light on the exciting and promising potential of diffusion models, not only for image generation, but also for learning rich and robust representations.

摘要
我们介绍SODA，一种自然语言模型，用于表示学习。该模型包含一个图像编码器，将源视图简化为一个紧凑的表示，然后用这个表示 guideline 生成相关的新视图。我们表明，通过在编码器和去噪解码器之间加入紧张瓶颈，并利用新视图合成作为自然语言目标，我们可以将扩散模型转化为强大的表示学习器，可以不经过标注数据来捕捉视觉 semantics。根据我们所知，SODA 是首个在 ImageNet 直接分类任务中成功的扩散模型，同时它在多个数据集上完成了重建、编辑和合成任务。进一步的调查表明 SODA 的出现的独立空间具有分离的特性，可以作为控制和操作模型生成的图像的有效界面。总之，我们希望通过探讨扩散模型的潜在 potential，不仅为图像生成，而且为学习具有rich和稳定的表示。

A Pipeline For Discourse Circuits From CCG

paper_url: http://arxiv.org/abs/2311.17892
repo_url: None
paper_authors: Jonathon Liu, Razin A. Shaikh, Benjamin Rodatz, Richie Yeung, Bob Coecke
for: bridging the divide between linguistic theory and modern NLP practice, and providing a neuro-symbolic model for meaning that incorporates linguistic structure.
methods: using Combinatory Categorial Grammar (CCG) parses and coreference resolution information to convert English text into a simply-typed $\lambda$-calculus term, and then into a circuit diagram.
results: a software pipeline that achieves coverage over a large fragment of the English language, and enables the application of the DisCoCirc framework to NLP tasks using both classical and quantum approaches.

Abstract
There is a significant disconnect between linguistic theory and modern NLP practice, which relies heavily on inscrutable black-box architectures. DisCoCirc is a newly proposed model for meaning that aims to bridge this divide, by providing neuro-symbolic models that incorporate linguistic structure. DisCoCirc represents natural language text as a `circuit' that captures the core semantic information of the text. These circuits can then be interpreted as modular machine learning models. Additionally, DisCoCirc fulfils another major aim of providing an NLP model that can be implemented on near-term quantum computers. In this paper we describe a software pipeline that converts English text to its DisCoCirc representation. The pipeline achieves coverage over a large fragment of the English language. It relies on Combinatory Categorial Grammar (CCG) parses of the input text as well as coreference resolution information. This semantic and syntactic information is used in several steps to convert the text into a simply-typed $\lambda$-calculus term, and then into a circuit diagram. This pipeline will enable the application of the DisCoCirc framework to NLP tasks, using both classical and quantum approaches.

摘要
现有一个重要的沟通差距 между语言理论和现代自然语言处理（NLP）实践，后者具有许多不可解释的黑盒模型。DisCoCirc是一种新提出的模型，旨在bridge这个差距，通过提供神经符号学模型，以捕捉语言结构。DisCoCirc将自然语言文本 Represented as a "circuit" that captures the core semantic information of the text. These circuits can then be interpreted as modular machine learning models. In addition, DisCoCirc fulfills another major aim of providing an NLP model that can be implemented on near-term quantum computers.在这篇论文中，我们描述了一个软件管道，用于将英语文本转换为其DisCoCirc表示形式。该管道实现了英语语言的大 Fragment 的覆盖率。它基于Combinatory Categorial Grammar（CCG）分析输入文本的语法和核心引用信息。这些semantic和syntactic信息在多个步骤中用于将文本转换为简单类型的λ推理calculus表示形式，并最后转换为电路图。这个管道将允许使用DisCoCirc框架进行NLP任务，使用类统和量子方法。

Maximum Entropy Model Correction in Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.17855
repo_url: None
paper_authors: Amin Rakhsha, Mete Kemertas, Mohammad Ghavamzadeh, Amir-massoud Farahmand
for: 本研究旨在提出一种基于模型预测的方法，以减少模型误差的影响，并且可以加速到真实值函数的整合。
methods: 本研究使用MaxEnt模型修正（MoCo）程序，根据最大 entropy density estimation 的形式来修正模型的下一个状态分布。在MoCo的基础之上，我们引入了值迭代（VI）算法和其样本化变体MoCoDyna。
results: 我们表明，MoCoVI和MoCoDyna的整合可以比传统的模型自由算法快得多，而且不同于传统的模型基于算法，MoCoVI和MoCoDyna可以有效地利用一个approximate模型，并且仍然可以整合到正确的值函数。

Abstract
We propose and theoretically analyze an approach for planning with an approximate model in reinforcement learning that can reduce the adverse impact of model error. If the model is accurate enough, it accelerates the convergence to the true value function too. One of its key components is the MaxEnt Model Correction (MoCo) procedure that corrects the model's next-state distributions based on a Maximum Entropy density estimation formulation. Based on MoCo, we introduce the Model Correcting Value Iteration (MoCoVI) algorithm, and its sampled-based variant MoCoDyna. We show that MoCoVI and MoCoDyna's convergence can be much faster than the conventional model-free algorithms. Unlike traditional model-based algorithms, MoCoVI and MoCoDyna effectively utilize an approximate model and still converge to the correct value function.

摘要
我们提出了一种方法，可以在强化学习中使用 aproximate 模型来减少模型错误的影响。如果模型准确 enough，它会加速到真值函数的整合。其中一个关键组件是 MaxEnt Model Correction（MoCo）过程，它根据最大 entropy 推测方法来修正模型的下一个状态分布。基于 MoCo，我们引入了 Model Correcting Value Iteration（MoCoVI）算法和其采样化版本 MoCoDyna。我们显示，MoCoVI 和 MoCoDyna 的整合可以比传统的模型自由算法更快。不同于传统的模型基算法，MoCoVI 和 MoCoDyna 可以有效地利用一个approximate模型，并且仍然可以整合到正确的值函数。Here's the translation of the text in Traditional Chinese:我们提出了一种方法，可以在强化学习中使用 aproximate 模型来减少模型错误的影响。如果模型准确 enough，它会加速到真值函数的整合。其中一个关键组件是 MaxEnt Model Correction（MoCo）过程，它根据最大 entropy 推测方法来修正模型的下一个状态分布。基于 MoCo，我们引入了 Model Correcting Value Iteration（MoCoVI）算法和其采样化版本 MoCoDyna。我们显示，MoCoVI 和 MoCoDyna 的整合可以比传统的模型自由算法更快。不同于传统的模型基算法，MoCoVI 和 MoCoDyna 可以有效地利用一个approximate模型，并且仍然可以整合到正确的值函数。

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

paper_url: http://arxiv.org/abs/2311.17842
repo_url: None
paper_authors: Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, Yang Gao
for: 这个论文旨在具备基于视觉语言模型（VLM）的长期机器人规划能力，以便解决现代机器人控制问题。
methods: 该论文提出了一种新的视觉语言规划（ViLa）方法，该方法通过将视觉数据直接 интеGRATE INTO its 理解和规划过程来增强机器人的常识知识，包括空间布局和物体属性。
results: 对于一系列的开放世界操作任务，ViLa 表现出优于现有基于大语言模型（LLM）的规划器，并且可以自动地从视觉反馈中学习和更新其理解和规划能力。

Abstract
In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.

摘要
在这个研究中，我们关心具有物理基础的任务规划能力的机器人。当前的进步显示大语言模型（LLM）拥有很多有用的知识，特别是在理解和规划方面。然而，LLM受到环境信息的概念化和依赖于外部可用性模型来感知环境信息，无法同LLM一起合理地结合。我们认为任务规划应该是一个内在地附加、多Modal系统。为此，我们介绍了机器人视语言规划（ViLa），一种新的长期机器人规划方法，利用视语言模型（VLM）生成一系列可行的步骤。ViLa直接将感知数据集成到其理解和规划过程中，允许更深入的理解视觉世界的常识知识，包括空间布局和物体属性。它还支持灵活多Modal目标规定和自然地包含视觉反馈。我们在实际机器人和模拟环境中进行了广泛的评估，证明ViLa在多种开放世界机器人任务中表现出色，高亮其效iveness。

Analyzing and Explaining Image Classifiers via Diffusion Guidance

paper_url: http://arxiv.org/abs/2311.17833
repo_url: None
paper_authors: Maximilian Augustin, Yannic Neuhaus, Matthias Hein
for: 这篇论文是为了解决深度学习图像分类模型在实际应用中可靠性和解释性问题而写的。
methods: 这篇论文使用了一种框架来生成满足分类器 derivable 目标函数的图像，并通过视觉对抗解释（VCE）、分类器最大不一致点检测和神经元可见化来分析和评估图像分类器的行为和决策。
results: 这篇论文通过对图像分类器的行为和决策进行分析和评估， validate了一些现有观察结果，例如对抗性强化模型的形状偏好，以及新的失败模式，例如零基础 CLIP 分类器的系统性错误。此外，这篇论文的 VCE 也超过了之前的工作，同时更加多样化。

Abstract
While deep learning has led to huge progress in complex image classification tasks like ImageNet, unexpected failure modes, e.g. via spurious features, call into question how reliably these classifiers work in the wild. Furthermore, for safety-critical tasks the black-box nature of their decisions is problematic, and explanations or at least methods which make decisions plausible are needed urgently. In this paper, we address these problems by generating images that optimize a classifier-derived objective using a framework for guided image generation. We analyze the behavior and decisions of image classifiers by visual counterfactual explanations (VCEs), detection of systematic mistakes by analyzing images where classifiers maximally disagree, and visualization of neurons to verify potential spurious features. In this way, we validate existing observations, e.g. the shape bias of adversarially robust models, as well as novel failure modes, e.g. systematic errors of zero-shot CLIP classifiers, or identify harmful spurious features. Moreover, our VCEs outperform previous work while being more versatile.

摘要

Anomalous Behavior Detection in Trajectory Data of Older Drivers

paper_url: http://arxiv.org/abs/2311.17822
repo_url: None
paper_authors: Seyedeh Gol Ara Ghoreishi, Sonia Moshfeghi, Muhammad Tanveer Jan, Joshua Conniff, KwangSoo Yang, Jinwoo Jang, Borko Furht, Ruth Tappen, David Newman, Monica Rosselli, Jiannan Zhai
for: 本研究使用道路网络和轨迹数据检测驾驶者异常行为，包括方向偏离、强制减速和加速。
methods: 该研究提出了一种基于Edge-Attributed Matrix的方法，可以快速和精准地检测驾驶者异常行为。
results: 实验结果表明，该方法可以准确地检测驾驶者异常行为。

Abstract
Given a road network and a set of trajectory data, the anomalous behavior detection (ABD) problem is to identify drivers that show significant directional deviations, hardbrakings, and accelerations in their trips. The ABD problem is important in many societal applications, including Mild Cognitive Impairment (MCI) detection and safe route recommendations for older drivers. The ABD problem is computationally challenging due to the large size of temporally-detailed trajectories dataset. In this paper, we propose an Edge-Attributed Matrix that can represent the key properties of temporally-detailed trajectory datasets and identify abnormal driving behaviors. Experiments using real-world datasets demonstrated that our approach identifies abnormal driving behaviors.

摘要

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

paper_url: http://arxiv.org/abs/2311.17815
repo_url: None
paper_authors: Fabrizio Ferrandi, Serena Curzel, Leandro Fiorin, Daniele Ielmini, Cristina Silvano, Francesco Conti, Alessio Burrello, Francesco Barchi, Luca Benini, Luciano Lavagno, Teodoro Urso, Enrico Calore, Sebastiano Fabio Schifano, Cristian Zambelli, Maurizio Palesi, Giuseppe Ascia, Enrico Russo, Nicola Petra, Davide De Caro, Gennaro Di Meo, Valeria Cardellini, Salvatore Filippone, Francesco Lo Presti, Francesco Silvestri, Paolo Palazzari, Stefania Perri
for: 这篇论文主要是为了介绍深度学习加速器的设计方法和工具，以及最新的研究发展。
methods: 论文使用了多种方法，包括硬件-软件共设方法、高级合成方法、特定定制编译器以及设计空间探索、模拟和仿真方法。
results: 论文提供了一个宽泛的评视，涵盖了最新的深度学习加速器设计方法和EDA工具，以及它们在深度学习领域的应用和发展趋势。

Abstract
In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, spanning from computer architecture to approximate computing, computational models, and machine learning algorithms. Several methodologies and tools have been proposed to design accelerators for Deep Learning, including hardware-software co-design approaches, high-level synthesis methods, specific customized compilers, and methodologies for design space exploration, modeling, and simulation. These methodologies aim to maximize the exploitable parallelism and minimize data movement to achieve high performance and energy efficiency. This survey provides a holistic review of the most influential design methodologies and EDA tools proposed in recent years to implement Deep Learning accelerators, offering the reader a wide perspective in this rapidly evolving field. In particular, this work complements the previous survey proposed by the same authors in [203], which focuses on Deep Learning hardware accelerators for heterogeneous HPC platforms.

摘要
Recently, the field of Deep Learning has experienced many disruptive and influential advancements. Due to the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more urgent to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from multiple areas, including computer architecture, approximate computing, computational models, and machine learning algorithms. Several methodologies and tools have been proposed to design accelerators for Deep Learning, including hardware-software co-design approaches, high-level synthesis methods, customized compilers, and methodologies for design space exploration, modeling, and simulation. These methodologies aim to maximize exploitable parallelism and minimize data movement to achieve high performance and energy efficiency. This survey provides a comprehensive review of the most influential design methodologies and EDA tools proposed in recent years to implement Deep Learning accelerators, offering readers a wide perspective on this rapidly evolving field. In particular, this work complements the previous survey proposed by the same authors in [203], which focuses on Deep Learning hardware accelerators for heterogeneous HPC platforms.

Propagate & Distill: Towards Effective Graph Learners Using Propagation-Embracing MLPs

paper_url: http://arxiv.org/abs/2311.17781
repo_url: None
paper_authors: Yong-Min Shin, Won-Yong Shin
for: 这篇研究旨在使用多层感知函数（MLP）解决半指定的节点分类问题在图上，通过将学生MLP通过知识传播的方式将教师图像推导数据（GNN）的知识传递给学生模型。
methods: 这篇研究使用知识传播的方式将学生MLP导入到教师GNN的知识中，并且将GNN分为特征转换T和传播π，将学生MLP训练为学习T和π。
results: 这篇研究提出了一个名为传播和导入（P&D）的方法，将教师模型的输出传播给学生模型，可以解释为对 inverse propagation的一个近似过程。P&D可以轻松地提高学生MLP的性能。

Abstract
Recent studies attempted to utilize multilayer perceptrons (MLPs) to solve semisupervised node classification on graphs, by training a student MLP by knowledge distillation from a teacher graph neural network (GNN). While previous studies have focused mostly on training the student MLP by matching the output probability distributions between the teacher and student models during distillation, it has not been systematically studied how to inject the structural information in an explicit and interpretable manner. Inspired by GNNs that separate feature transformation $T$ and propagation $\Pi$, we re-frame the distillation process as making the student MLP learn both $T$ and $\Pi$. Although this can be achieved by applying the inverse propagation $\Pi^{-1}$ before distillation from the teacher, it still comes with a high computational cost from large matrix multiplications during training. To solve this problem, we propose Propagate & Distill (P&D), which propagates the output of the teacher before distillation, which can be interpreted as an approximate process of the inverse propagation. We demonstrate that P&D can readily improve the performance of the student MLP.

摘要
Inspired by GNNs that separate feature transformation $T$ and propagation $\Pi$, we reframe the distillation process as making the student MLP learn both $T$ and $\Pi$. One way to achieve this is by applying the inverse propagation $\Pi^{-1}$ before distillation from the teacher, but this comes with a high computational cost from large matrix multiplications during training.To solve this problem, we propose Propagate & Distill (P&D), which propagates the output of the teacher before distillation. This can be interpreted as an approximate process of the inverse propagation. We demonstrate that P&D can improve the performance of the student MLP.Here's the translation in Simplified Chinese:近期研究尝试使用多层感知器（MLP）解决图semi监督分类问题，通过知识传承来让学生MLP学习教师图神经网络（GNN）的知识。然而，之前的研究主要集中在通过输出概率分布匹配来让学生MLP模型学习教师模型的知识。这些方法并没有系统地研究如何在知识传承过程中明确地注入结构信息。我们受到GNN的启示，它将特征变换T和传播Pi分开。我们将知识传承过程重新定义为让学生MLP学习T和Pi。可以通过对教师模型的输出进行反传播来实现这一点，但是这会伴随着大量矩阵乘法的计算成本。为解决这个问题，我们提出了宣传与预处理（P&D）方法，它在教师模型的输出前进行宣传。这可以被视为一种approximate的反传播过程。我们示出了P&D可以轻松提高学生MLP的性能。

Robustness Approaches for the Examination Timetabling Problem under Data Uncertainty

paper_url: http://arxiv.org/abs/2311.17766
repo_url: None
paper_authors: Bernd Bassimir, Rolf Wanka
for: 这些方法用于解决考试时间安排问题 (ETTP)，尤其是在大学中考试安排前学生注册的情况下。
methods: 这篇文章讨论了robust优化 литературе中的多种方法，以及它们在ETTP中的影响和应用。
results: 文章分析了在两个实际场景中（两个大学的实际考试时间安排）和许多随机生成的实际场景中，不同的实现方式对ETTP的影响。

Abstract
In the literature the examination timetabling problem (ETTP) is often considered a post-enrollment problem (PE-ETTP). In the real world, universities often schedule their exams before students register using information from previous terms. A direct consequence of this approach is the uncertainty present in the resulting models. In this work we discuss several approaches available in the robust optimization literature. We consider the implications of each approach in respect to the examination timetabling problem and present how the most favorable approaches can be applied to the ETTP. Afterwards we analyze the impact of some possible implementations of the given robustness approaches on two real world instances and several random instances generated by our instance generation framework which we introduce in this work.

摘要
在文献中，考试时间安排问题（ETTP）经常被视为后报学生招生（PE-ETTP）。在现实中，大学们通常在学生注册之前安排考试，使用上一学期的信息。这种方法导致考试时间安排模型中存在uncertainty。在这篇文章中，我们讨论了robust优化 литераature中的多种方法。我们考虑每种方法对ETTP的影响，并将最有利的方法应用于ETTP。然后，我们分析了两个真实世界实例和我们自己生成的一些随机实例的影响。

Addressing Membership Inference Attack in Federated Learning with Model Compression

paper_url: http://arxiv.org/abs/2311.17750
repo_url: https://github.com/negedng/ma-fl-mia
paper_authors: Gergely Dániel Németh, Miguel Ángel Lozano, Novi Quadrianto, Nuria Oliver
for: 保护客户端数据隐私
methods: 使用模型压缩在客户端上，保持服务器上的全模型
results: 在 CIFAR-10、CIFAR-100 和 FEMNIST 视觉数据集上实现了保护客户端和服务器的隐私，同时达到了竞争性的分类精度

Abstract
Federated Learning (FL) has been proposed as a privacy-preserving solution for machine learning. However, recent works have shown that Federated Learning can leak private client data through membership attacks. In this paper, we show that the effectiveness of these attacks on the clients negatively correlates with the size of the client datasets and model complexity. Based on this finding, we propose model-agnostic Federated Learning as a privacy-enhancing solution because it enables the use of models of varying complexity in the clients. To this end, we present $\texttt{MaPP-FL}$, a novel privacy-aware FL approach that leverages model compression on the clients while keeping a full model on the server. We compare the performance of $\texttt{MaPP-FL}$ against state-of-the-art model-agnostic FL methods on the CIFAR-10, CIFAR-100, and FEMNIST vision datasets. Our experiments show the effectiveness of $\texttt{MaPP-FL}$ in preserving the clients' and the server's privacy while achieving competitive classification accuracies.

摘要
federated learning (FL) 已经被提议作为隐私保护的解决方案 для机器学习。然而，最近的工作表明，联邦学习可能会通过会员攻击泄露客户端数据。在这篇论文中，我们显示了这些攻击对客户端的效果与客户端数据量和模型复杂度之间存在负相关性。基于这一发现，我们提出了模型无关联邦学习作为隐私增强解决方案，因为它允许客户端使用不同的复杂度的模型。为此，我们提出了一种名为 $\texttt{MaPP-FL}$ 的隐私意识联邦学习方法，该方法在客户端上实现模型压缩，保留服务器上的完整模型。我们与现有的模型无关联邦学习方法进行比较，并在 CIFAR-10、CIFAR-100 和 FEMNIST 视觉数据集上进行实验。我们的实验结果表明， $\texttt{MaPP-FL}$ 可以保护客户端和服务器的隐私，同时实现竞争性的分类精度。

Mukhyansh: A Headline Generation Dataset for Indic Languages

paper_url: http://arxiv.org/abs/2311.17743
repo_url: None
paper_authors: Lokesh Madasu, Gopichand Kanumolu, Nirmal Surange, Manish Shrivastava
for: 这篇论文的目的是提供一个大量多语言数据集，用于印度语言标题生成。methods: 论文使用了多种现有基eline模型进行评估，并进行了比较分析。results: 论文的结果表明，使用Mukhyansh数据集可以达到31.43的ROUGE-L分数，在所有8种语言中表现出色。

Abstract
The task of headline generation within the realm of Natural Language Processing (NLP) holds immense significance, as it strives to distill the true essence of textual content into concise and attention-grabbing summaries. While noteworthy progress has been made in headline generation for widely spoken languages like English, there persist numerous challenges when it comes to generating headlines in low-resource languages, such as the rich and diverse Indian languages. A prominent obstacle that specifically hinders headline generation in Indian languages is the scarcity of high-quality annotated data. To address this crucial gap, we proudly present Mukhyansh, an extensive multilingual dataset, tailored for Indian language headline generation. Comprising an impressive collection of over 3.39 million article-headline pairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present a comprehensive evaluation of several state-of-the-art baseline models. Additionally, through an empirical analysis of existing works, we demonstrate that Mukhyansh outperforms all other models, achieving an impressive average ROUGE-L score of 31.43 across all 8 languages.

摘要
文章标题生成在自然语言处理（NLP）领域具有重要的意义，它寻求将文本内容精炼成短悉的摘要，以吸引读者的注意力。虽然在英语等广泛使用的语言上进行了有意义的进步，但在印度语言上还存在许多挑战。一个主要的障碍是丰富和多样化的印度语言的高质量注释数据的缺乏。为解决这一重要的差距，我们荣幸地提供了“Mukhyansh”，一个广泛的多语言数据集，特化于印度语言的标题生成。该数据集包含339万多篇文章-标题对，覆盖了8种主要印度语言，即泰卢固语、泰米尔语、喀那达语、马拉雅利语、拼音语、孟加拉语、马拉地语和卢杂语。我们进行了多种基eline模型的全面评估。此外，通过对现有作品进行实证分析，我们证明Mukhyansh在所有8种语言中都超越了所有其他模型，实现了31.43的平均ROUGE-L分数。

Fair Text-to-Image Diffusion via Fair Mapping

paper_url: http://arxiv.org/abs/2311.17695
repo_url: None
paper_authors: Jia Li, Lijie Hu, Jingfeng Zhang, Tianhang Zheng, Hua Zhang, Di Wang
for: 提高文本到图像噪声模型的具体性和多样性，即使在人类相关的描述下也能生成公正的图像。
methods: 提出了一种基于文本映射的、通用、轻量级的方法 Fair Mapping，可以控制提示来实现公正的图像生成。该方法只需更新一个小型的参数集，可以快速优化过程。
results: 通过对人脸图像生成的实验表明，该方法可以有效地解决语言偏见导致的图像生成偏见问题，生成更加公正、多样化的图像。

Abstract
In this paper, we address the limitations of existing text-to-image diffusion models in generating demographically fair results when given human-related descriptions. These models often struggle to disentangle the target language context from sociocultural biases, resulting in biased image generation. To overcome this challenge, we propose Fair Mapping, a general, model-agnostic, and lightweight approach that modifies a pre-trained text-to-image model by controlling the prompt to achieve fair image generation. One key advantage of our approach is its high efficiency. The training process only requires updating a small number of parameters in an additional linear mapping network. This not only reduces the computational cost but also accelerates the optimization process. We first demonstrate the issue of bias in generated results caused by language biases in text-guided diffusion models. By developing a mapping network that projects language embeddings into an unbiased space, we enable the generation of relatively balanced demographic results based on a keyword specified in the prompt. With comprehensive experiments on face image generation, we show that our method significantly improves image generation performance when prompted with descriptions related to human faces. By effectively addressing the issue of bias, we produce more fair and diverse image outputs. This work contributes to the field of text-to-image generation by enhancing the ability to generate images that accurately reflect the intended demographic characteristics specified in the text.

摘要
在这篇论文中，我们解决了现有文本到图像扩散模型中对人类相关描述生成的限制。这些模型经常受到社会文化偏见的影响，导致图像生成偏离准确的人类特征。为了解决这个挑战，我们提出了公平映射（Fair Mapping），一种通用、无关模型的扩展方法，可以在文本扩散模型的预训练模型上进行控制，以实现公平的图像生成。我们的方法的一个关键优点是其高效性。训练过程只需更新一个小型的参数，可以快速优化过程。我们首先示出了文本扩散模型生成的结果中的偏见问题，这是由于文本偏见的影响。我们开发了一个投影网络，可以将语言嵌入映射到无偏见的空间，从而生成基于提示中的关键词的相对均衡的人类特征。通过对人脸生成图像进行了全面的实验，我们显示了我们的方法可以在文本扩散模型中提高图像生成性能。我们有效地解决了偏见问题，从而生成更公平和多样化的图像输出。这项工作对文本到图像生成领域的发展做出了贡献，提高了文本扩散模型的能力，以便更准确地生成具有指定的人类特征的图像。

AviationGPT: A Large Language Model for the Aviation Domain

paper_url: http://arxiv.org/abs/2311.17686
repo_url: None
paper_authors: Liya Wang, Jason Chou, Xin Zhou, Alex Tien, Diane M Baumgartner
for: 提高航空业务中的自然语言处理（NLP）能力和国家航空系统（NAS）运营效率和安全性。
methods: 基于开源LLaMA-2和Mistral架构，继续在丰富的aviaition数据集上进行训练，以提高aviaition文本数据的使用效果。
results: 实验结果显示，AviationGPT可以帮助用户解决多种NLP问题（如问答、摘要、文档写作、信息提取、报告查询、数据清洁和互动数据探索），并在aviaition领域提供准确和Contextually relevant的回答，提高tested cases中的性能（超过40%）。

Abstract
The advent of ChatGPT and GPT-4 has captivated the world with large language models (LLMs), demonstrating exceptional performance in question-answering, summarization, and content generation. The aviation industry is characterized by an abundance of complex, unstructured text data, replete with technical jargon and specialized terminology. Moreover, labeled data for model building are scarce in this domain, resulting in low usage of aviation text data. The emergence of LLMs presents an opportunity to transform this situation, but there is a lack of LLMs specifically designed for the aviation domain. To address this gap, we propose AviationGPT, which is built on open-source LLaMA-2 and Mistral architectures and continuously trained on a wealth of carefully curated aviation datasets. Experimental results reveal that AviationGPT offers users multiple advantages, including the versatility to tackle diverse natural language processing (NLP) problems (e.g., question-answering, summarization, document writing, information extraction, report querying, data cleaning, and interactive data exploration). It also provides accurate and contextually relevant responses within the aviation domain and significantly improves performance (e.g., over a 40% performance gain in tested cases). With AviationGPT, the aviation industry is better equipped to address more complex research problems and enhance the efficiency and safety of National Airspace System (NAS) operations.

摘要
随着ChatGPT和GPT-4的出现，全球受到了大型自然语言模型（LLM）的推崇，其在问答、摘要和内容生成方面表现出了非常出色的能力。航空业界面上有着丰富的复杂、无结构的文本数据，滥觞技术术语和专业词汇，又具有匮乏标注数据的问题，导致在这个领域中使用航空文本数据的比例较低。LLM的出现提供了一个机会，但是当前市场上没有专门为航空领域设计的LLM。为解决这个问题，我们提出了AviationGPT，它基于开源的LLaMA-2和Mistral架构，并在丰富的精心抽取的航空数据集上进行连续培育。实验结果表明，AviationGPT可以为用户提供多种优势，包括能够处理多种自然语言处理（NLP）问题（例如问答、摘要、文档写作、信息抽取、报告查询、数据清洁和交互式数据探索）。它还可以在航空领域内提供准确和Contextually relevant的回答，并在测试案例中显示出较高的性能提升（超过40%）。通过AviationGPT，航空业界可以更好地解决更复杂的研究问题，提高国家航空系统（NAS）的效率和安全性。

Improving Minority Stress Detection with Emotions

paper_url: http://arxiv.org/abs/2311.17676
repo_url: None
paper_authors: Jonathan Ivey, Susan Gauch
for: 本研究旨在评估心理压力模型对性别小组成员的表达方式是否有效。
methods: 我们使用相关任务来评估心理压力模型对少数群体压力检测的能力。我们发现传统的心理压力模型在少数群体压力检测中表现不佳，而使用情感涵含的模型可以减少这种表现差距。
results: 我们发现多任务心理压力模型在少数群体压力检测中表现出色，超越当前状态艺术。我们还提供解释分析，证明少数社区的情感分布与普通人口不同，情感涵含的模型可以在低数据环境下提高心理压力模型的表现。

Abstract
Psychological stress detection is an important task for mental healthcare research, but there has been little prior work investigating the effectiveness of psychological stress models on minority individuals, who are especially vulnerable to poor mental health outcomes. In this work, we use the related task of minority stress detection to evaluate the ability of psychological stress models to understand the language of sexual and gender minorities. We find that traditional psychological stress models underperform on minority stress detection, and we propose using emotion-infused models to reduce that performance disparity. We further demonstrate that multi-task psychological stress models outperform the current state-of-the-art for minority stress detection without directly training on minority stress data. We provide explanatory analysis showing that minority communities have different distributions of emotions than the general population and that emotion-infused models improve the performance of stress models on underrepresented groups because of their effectiveness in low-data environments, and we propose that integrating emotions may benefit underrepresented groups in other mental health detection tasks.

摘要
心理压力检测是MENTAL HEALTHCARE研究中的重要任务，但有前的工作 Investigating the effectiveness of psychological stress models on minority individuals, who are especially vulnerable to poor mental health outcomes. In this work, we use the related task of minority stress detection to evaluate the ability of psychological stress models to understand the language of sexual and gender minorities. We find that traditional psychological stress models underperform on minority stress detection, and we propose using emotion-infused models to reduce that performance disparity. We further demonstrate that multi-task psychological stress models outperform the current state-of-the-art for minority stress detection without directly training on minority stress data. We provide explanatory analysis showing that minority communities have different distributions of emotions than the general population and that emotion-infused models improve the performance of stress models on underrepresented groups because of their effectiveness in low-data environments, and we propose that integrating emotions may benefit underrepresented groups in other mental health detection tasks.Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other parts of the world. If you need the translation in Traditional Chinese, please let me know.

Using Ornstein-Uhlenbeck Process to understand Denoising Diffusion Probabilistic Model and its Noise Schedules

paper_url: http://arxiv.org/abs/2311.17673
repo_url: None
paper_authors: Javier E. Santos, Yen Ting Lin
For: 这篇论文旨在证明非准确分布模型（DDPM）可以被表示为时间同质化的连续时间马歇尔过程，该过程被称为奥尔内斯-奥伦堡（OU）过程。* Methods: 作者使用了OU过程的分析解来证明DDPM和OU过程之间的正式等价性。他们还展示了设计噪声调度器的问题对应于设计OU过程的观测时间。* Results: 作者提出了一些启发式的设计方法 для观测时间基于理性的量 such as auto-variance和 Fisher Information，并将其与随机噪声调度器相连接。 Interestingly, 他们发现 Fisher-Information-motivated schedule与cosine schedule相同，cosine schedule是目前 estado-of-the-art 的噪声调度器。

Abstract
The aim of this short note is to show that Denoising Diffusion Probabilistic Model DDPM, a non-homogeneous discrete-time Markov process, can be represented by a time-homogeneous continuous-time Markov process observed at non-uniformly sampled discrete times. Surprisingly, this continuous-time Markov process is the well-known and well-studied Ornstein-Ohlenbeck (OU) process, which was developed in 1930's for studying Brownian particles in Harmonic potentials. We establish the formal equivalence between DDPM and the OU process using its analytical solution. We further demonstrate that the design problem of the noise scheduler for non-homogeneous DDPM is equivalent to designing observation times for the OU process. We present several heuristic designs for observation times based on principled quantities such as auto-variance and Fisher Information and connect them to ad hoc noise schedules for DDPM. Interestingly, we show that the Fisher-Information-motivated schedule corresponds exactly the cosine schedule, which was developed without any theoretical foundation but is the current state-of-the-art noise schedule.

摘要
本短讯的目的是表明Denosing Diffusion Probabilistic Model（DDPM），一种非Homogeneous discrete-time Markov 过程，可以表示为时Homogeneous continuous-time Markov 过程，并且这个过程是已知和已研究的奥尔宁-欧伦贝克（OU）过程。我们证明了DDPM和OU过程之间的正式等价性，并且表明了设计陌生时间观察者的问题等价于设计OU过程的观察时间。我们还提出了一些基于原理量such as auto-variance和Fisher Information的各种设计方案，并连接它们与DDPM的随机噪声调度。意外地，我们发现了基于Fisher Information的调度与现有的cosine调度相同，这个cosine调度在没有任何理论基础下被开发，但是是当前状态的噪声调度标准。

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

paper_url: http://arxiv.org/abs/2311.17667
repo_url: https://github.com/zchuz/timebench
paper_authors: Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang, Ming Liu, Bing Qin
for: 本研究旨在提供一个完整的时间理解benchmark，以评估大型自然语言模型（LLM）的时间理解能力。
methods: 本研究使用了一个 hierarchical 的时间理解benchmark，涵盖了广泛的时间理解现象，并使用了 chain-of-thought 提示。
results: 实验结果显示了现有的 LLM 与人类时间理解能力存在很大的差距，这表明现在还有很长的路要走才能实现时间理解。

Abstract
Understanding time is a pivotal aspect of human cognition, crucial in the broader framework of grasping the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this issue, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena, which provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on popular LLMs, such as GPT-4, LLaMA2, and Mistral, incorporating chain-of-thought prompting. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning for LLMs. Our resource is available at https://github.com/zchuz/TimeBench

摘要
理解时间是人类认知的一个重要方面，对于更广泛的世界观念的理解至关重要。先前的研究通常会专注于特定的时间方面，缺乏一个全面的时间理解标准。为解决这个问题，我们提出了 TimeBench，一个包容性层次的时间理解标准，覆盖了广泛的时间理解现象，为检测大语言模型的时间理解能力提供了全面的评估。我们在各种受欢迎的LLM（如GPT-4、LLaMA2、Mistral）上进行了广泛的实验，并使用链条prompting。我们的实验结果表明，当前的LLM与人类的时间理解能力存在显著的差距，这表明还有一定的进步空间在时间理解方面。我们希望 TimeBench 能成为一个全面的标准，促进大语言模型的时间理解研究。我们的资源可以在 GitHub 上找到：https://github.com/zchuz/TimeBench。

Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes

paper_url: http://arxiv.org/abs/2311.17655
repo_url: None
paper_authors: Pavel Korshunov, Haolin Chen, Philip N. Garner, Sebastien Marcel
for: 本研究旨在探讨深伪检测领域中的问题，即使使用视觉或语音特征来检测深伪。
methods: 本文使用了多种深度模型和混合技术来创建高质量的视觉和音频深伪数据集，包括DeepFaceLab模型和混合技术，以及LibriTTS数据集中的文本识别方法。
results: 研究发现，使用现有的深度模型和混合技术可以成功地欺骗面部和语音识别系统，并且可以在90%以上的时间内成功地创建一个真实看起来和 зву起来的假视频。

Abstract
The task of deepfakes detection is far from being solved by speech or vision researchers. Several publicly available databases of fake synthetic video and speech were built to aid the development of detection methods. However, existing databases typically focus on visual or voice modalities and provide no proof that their deepfakes can in fact impersonate any real person. In this paper, we present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized and video have high visual and audio qualities. We took the publicly available SWAN dataset of real videos with different identities to create audio-visual deepfakes using several models from DeepFaceLab and blending techniques for face swapping and HiFiVC, DiffVC, YourTTS, and FreeVC models for voice conversion. From the publicly available speech dataset LibriTTS, we also created a separate database of only audio deepfakes LibriTTS-DF using several latest text to speech methods: YourTTS, Adaspeech, and TorToiSe. We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain, to the synthetic voices. Similarly, we tested face recognition system based on the MobileFaceNet architecture to several variants of our visual deepfakes. The vulnerability assessment show that by tuning the existing pretrained deepfake models to specific identities, one can successfully spoof the face and speaker recognition systems in more than 90% of the time and achieve a very realistic looking and sounding fake video of a given person.

摘要
“深圳识别挑战未经解决，听说和视觉研究人员尚未能够有效地检测深圳。现有的数据库一般只关注视觉或声音Modalities，无法提供证明其深圳可以模拟真实人的证据。在这篇论文中，我们提供了首个真实的音视频深圳数据库SWAN-DF，其中唇和语音是有良好同步的，视频具有高质量的视觉和声音。我们使用了公开available的SWAN数据集中的真实视频，通过DeepFaceLab模型和混合技术进行面 swap和HiFiVC、DiffVC、YourTTS和FreeVC模型进行语音转换，从LibriTTS数据集中创建了一个单独的声音深圳数据库LibriTTS-DF。我们示出了一个基于ECAPA-TDNN模型的state-of-the-art演讲认知系统的抵触性，以及基于MobileFaceNet架构的面识别系统对我们的视觉深圳变体的抵触性。深圳检测评估表明，通过调整现有预训练深圳模型，可以在90%以上的时间内成功骗取面识别和演讲认知系统，并创造出真实看起来和听起来的假视频。”

VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following

paper_url: http://arxiv.org/abs/2311.17647
repo_url: None
paper_authors: Yujie Lu, Xiujun Li, William Yang Wang, Yejin Choi
for: 评估多Modal大语言模型（MLLMs）的视觉指令遵循能力。
methods: 利用视觉嵌入指令（VIM）框架，对MLLMs进行评测， embed the instructions into the visual scenes，要求模型具备强大的视觉解释能力。
results: 对多种benchmark进行了评测，包括VQAv2、MME、MM-Vet和RefCOCO系列，发现open-source MLLMs和GPT-4V之间存在显著的性能差距，这表明这些模型在视觉指令理解方面并没有达到预期的水平。

Abstract
We introduce VISUAL EMBEDDED INSTRUCTION (VIM), a new framework designed to evaluate the visual instruction following capability of Multimodal Large Language Models (MLLMs). As illustrated in Figure 2, VIM challenges the MLLMs by embedding the instructions into the visual scenes, demanding strong visual interpretative skills for instruction following. We adapt VIM to various benchmarks, including VQAv2, MME, MM-Vet, and RefCOCO series, compose a VIM bench, and probe diverse MLLMs across three distinct in-context learning settings: Zero Shot, One Shot, and Pair Shot. We observe that there is a significant performance disparity between the open-source MLLMs and GPT-4V, implying that their proficiency in visual instruction comprehension is not up to par. Our results highlight a promising direction for the enhancement of MLLMs capabilities on instruction following. We aim VIM to serve as a useful norm for advancing the state of the art and driving further progress in the field.

摘要
我们介绍可视嵌入指令（VIM），一个新的框架，用于评估多modal大语言模型（MLLMs）的可视指令遵循能力。如 figure 2 所示，VIM 将 instrucions 嵌入到可视场景中，需要模型具有强大的可视解释技能，以便遵循 instrucions。我们将 VIM 应用到不同的 bench 上，包括 VQAv2、MME、MM-Vet 和 RefCOCO 系列，并将 VIM 组合成一个 VIM bench，并询问多种 MLLMs 在三种不同的内容学习环境中：Zero Shot、One Shot 和 Pair Shot。我们发现 open-source MLLMs 和 GPT-4V 之间存在很大的性能差距，这表明它们在可视指令理解方面并不够熟悉。我们的结果显示了可视嵌入指令的潜在方向，并且鼓励进一步探索 MLLMs 的可视指令遵循能力。我们希望 VIM 能成为一个有用的 нор，用于提高领域的州OF THE ART，并驱动进一步的进步。

Introduction to Transformers: an NLP Perspective

paper_url: http://arxiv.org/abs/2311.17633
repo_url: None
paper_authors: Tong Xiao, Jingbo Zhu
for: 本研究主要旨在介绍 transformer 模型的基本概念和最新进展。
methods: 本文主要介绍 transformer 模型的标准架构、一系列改进方法和常见应用。
results: 本文 Summarize transformer 模型的关键想法和其影响力，以及这些模型在不同领域的应用。

Abstract
Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.

摘要
《Transformers在自然语言处理领域的实际机器学习模型中占据了主导地位。本文将介绍Transformers的基本概念以及最近的进展。这包括标准Transformer架构的描述、一系列模型优化和常见应用。由于Transformers和相关的深度学习技术可能会不断演化，我们不能详细介绍所有模型细节或覆盖所有技术领域。相反，我们将重点介绍这些概念，以帮助读者快速理解Transformers和其变体。此外，我们还将总结这些模型的关键想法，从而获得一些对这些模型的力量和局限性的了解。》Note: Please keep in mind that the translation is done by a machine and may not be perfect. If you need a more accurate translation, please consider using a professional translation service.

Continual Learning with Low Rank Adaptation

paper_url: http://arxiv.org/abs/2311.17601
repo_url: None
paper_authors: Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella
for: 本研究针对 continual learning 问题进行了研究，旨在让预训transformer能够在新数据上表现良好，并且保持与先前训练数据的性能。
methods: 本研究使用 Low Rank Adaptation (LoRA) 方法，实现 continual learning。
results: 在多个领域增量学习benchmark上，CoLoR方案以 state-of-the-art 性能表现，并且与提示更新方法相比，还是 Paramter efficient。

Abstract
Recent work using pretrained transformers has shown impressive performance when fine-tuned with data from the downstream problem of interest. However, they struggle to retain that performance when the data characteristics changes. In this paper, we focus on continual learning, where a pre-trained transformer is updated to perform well on new data, while retaining its performance on data it was previously trained on. Earlier works have tackled this primarily through methods inspired from prompt tuning. We question this choice, and investigate the applicability of Low Rank Adaptation (LoRA) to continual learning. On a range of domain-incremental learning benchmarks, our LoRA-based solution, CoLoR, yields state-of-the-art performance, while still being as parameter efficient as the prompt tuning based methods.

摘要
Note:* "pre-trained transformers" ⇒ "预训练的转换器" (pre-trained transformers)* "downstream problem of interest" ⇒ "下游问题的兴趣" (downstream problem of interest)* "data characteristics changes" ⇒ "数据特征发生变化" (data characteristics change)* "continual learning" ⇒ "连续学习" (continual learning)* "prompt tuning" ⇒ "提示调整" (prompt tuning)* "Low Rank Adaptation" ⇒ "低级别适应" (Low Rank Adaptation)* "CoLoR" ⇒ "CoLoR" (CoLoR)

Improving embedding of graphs with missing data by soft manifolds

paper_url: http://arxiv.org/abs/2311.17598
repo_url: None
paper_authors: Andrea Marinoni, Pietro Lio’, Alessandro Barp, Christian Jutten, Mark Girolami
for: 这个论文的目的是为了设计和开发自动信息抽取算法，以应用于多种任务（如学习、推理、预测）。
methods: 该论文使用了 manifold-based 图像嵌入算法，其中 manifold 是一种数学结构，可以将图像的特征嵌入到数学空间中。
results: 实验结果表明，使用 soft manifold 可以更好地处理现代化的真实图像，并提供更准确和可靠的图像嵌入。

Abstract
Embedding graphs in continous spaces is a key factor in designing and developing algorithms for automatic information extraction to be applied in diverse tasks (e.g., learning, inferring, predicting). The reliability of graph embeddings directly depends on how much the geometry of the continuous space matches the graph structure. Manifolds are mathematical structure that can enable to incorporate in their topological spaces the graph characteristics, and in particular nodes distances. State-of-the-art of manifold-based graph embedding algorithms take advantage of the assumption that the projection on a tangential space of each point in the manifold (corresponding to a node in the graph) would locally resemble a Euclidean space. Although this condition helps in achieving efficient analytical solutions to the embedding problem, it does not represent an adequate set-up to work with modern real life graphs, that are characterized by weighted connections across nodes often computed over sparse datasets with missing records. In this work, we introduce a new class of manifold, named soft manifold, that can solve this situation. In particular, soft manifolds are mathematical structures with spherical symmetry where the tangent spaces to each point are hypocycloids whose shape is defined according to the velocity of information propagation across the data points. Using soft manifolds for graph embedding, we can provide continuous spaces to pursue any task in data analysis over complex datasets. Experimental results on reconstruction tasks on synthetic and real datasets show how the proposed approach enable more accurate and reliable characterization of graphs in continuous spaces with respect to the state-of-the-art.

摘要
“嵌入图像在连续空间是自动信息提取的关键因素，涉及到设计和开发各种任务（如学习、推理、预测）的算法。图像的可靠性直接取决于连续空间的几何结构与图像结构之间的匹配程度。拓扑结构是一种数学结构，可以将图像的特征与连续空间的几何结构相结合，特别是节点之间的距离。现状最佳的拓扑基于图像嵌入算法假设每个点在拓扑空间的 Tangential 空间（对应图像中的节点）的投影都是一个 euclidian 空间，这个假设可以使得嵌入问题得到高效的分析解决方案，但是这并不符合现代生活中的实际图像，它们通常是有 Edge 连接的，并且这些 Edge 通常是根据稀疏数据集计算出来的。在这项工作中，我们介绍了一种新的拓扑结构，即软拓扑（Soft Manifold），可以解决这种情况。软拓扑是一种具有球面 симметry 的数学结构，其 tangent 空间在每个点上是一个 hypocycloid 形状，这个形状取决于数据点之间的信息传递速度。使用软拓扑来嵌入图像，我们可以为数据分析中的任务提供连续空间。实验结果表明，我们的方法可以在synthetic和实际数据集上进行重建任务，并且与当前状态OF-THE-ART相比，具有更高的准确性和可靠性。”

LanGWM: Language Grounded World Model

paper_url: http://arxiv.org/abs/2311.17593
repo_url: None
paper_authors: Rudra P. K. Poudel, Harit Pandya, Chao Zhang, Roberto Cipolla
for: 提高深度强化学习模型对不同 Distribution 的扩展和鲁棒性。
methods: 使用语言基准的视觉特征学习，以提高世界模型学习，并使用 transformer-based 掩码自动编码器进行掩码 objetcs 的预测。
results: 在 iGibson 点Navigation 任务上，我们的提案 LanGWM：语言基准世界模型达到了状态强化学习模型在不同 Distribution 的测试 benchmark 100K 互动步骤的最佳性能。

Abstract
Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.

摘要
Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.Here's the translation in Traditional Chinese:Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.

Bias Resilient Multi-Step Off-Policy Goal-Conditioned Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.17565
repo_url: None
paper_authors: Lisheng Wu, Ke Chen
for: GCRL 的稀有奖励问题，带来各种困难，影响学习效率。这篇论文探讨了这些偏见，将其分为两类：“射击”和“滑块”。
methods: 我们提出了一种解决方案，利用这些偏见的正面特点，减少其缺点，使用更大的步长加速 GCRL。
results: 我们的方法在十步学习场景中实现了稳定和可靠的改进，比基准和多个状态的 искус智 GCRL benchmark 高效和稳定。

Abstract
In goal-conditioned reinforcement learning (GCRL), sparse rewards present significant challenges, often obstructing efficient learning. Although multi-step GCRL can boost this efficiency, it can also lead to off-policy biases in target values. This paper dives deep into these biases, categorizing them into two distinct categories: "shooting" and "shifting". Recognizing that certain behavior policies can hasten policy refinement, we present solutions designed to capitalize on the positive aspects of these biases while minimizing their drawbacks, enabling the use of larger step sizes to speed up GCRL. An empirical study demonstrates that our approach ensures a resilient and robust improvement, even in ten-step learning scenarios, leading to superior learning efficiency and performance that generally surpass the baseline and several state-of-the-art multi-step GCRL benchmarks.

摘要
在目标条件强化学习（GCRL）中，罕见奖励存在很大挑战，经常阻碍有效学习。虽然多步GCRL可以提高效率，但它也可能导致偏离策略估值的偏误。这篇论文深入探讨这些偏误，将其分为两类：“射击”和“移动”。我们发现某些行为策略可以加速策略细化，因此我们提出了一种利用这些偏误的正面效果的解决方案，以降低其缺点，使用更大的步长来加速GCRL。一项实验表明，我们的方法可以在十步学习场景中保持高效和稳定的提升，并在基准和多个state-of-the-art多步GCRL标准中超越其表现。

TaskWeaver: A Code-First Agent Framework

paper_url: http://arxiv.org/abs/2311.17541
repo_url: None
paper_authors: Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
for: 用于建立基于自然语言处理的自主代理人，处理具有复杂数据结构的域pecific数据分析任务。
methods: 使用code-first方式将用户请求转换为可执行代码，并将用户定义的插件视为可调用函数。支持rich数据结构、灵活插件使用和动态插件选择，并利用LLM编程能力处理复杂逻辑。
results: 提供一个强大和灵活的框架，可以创建具有自适应域pecific场景的智能对话代理人，处理复杂任务和适应域pecificenario。代码开源在https://github.com/microsoft/TaskWeaver/.

Abstract
Language Language Models (LLMs) have shown impressive abilities in natural language understanding and generation, leading to their use in applications such as chatbots and virtual assistants. However, existing LLM frameworks face limitations in handling domain-specific data analytics tasks with rich data structures. Moreover, they struggle with flexibility to meet diverse user requirements. To address these issues, TaskWeaver is proposed as a code-first framework for building LLM-powered autonomous agents. It converts user requests into executable code and treats user-defined plugins as callable functions. TaskWeaver provides support for rich data structures, flexible plugin usage, and dynamic plugin selection, and leverages LLM coding capabilities for complex logic. It also incorporates domain-specific knowledge through examples and ensures the secure execution of generated code. TaskWeaver offers a powerful and flexible framework for creating intelligent conversational agents that can handle complex tasks and adapt to domain-specific scenarios. The code is open-sourced at https://github.com/microsoft/TaskWeaver/.

摘要
语言语模型（LLM）已经显示出了自然语言理解和生成的卓越能力，导致它们在聊天机器人和虚拟助手等应用中得到了广泛使用。然而，现有的 LLM 框架面临着处理域специфи的数据分析任务的丰富数据结构的局限性，以及适应多样化用户需求的灵活性问题。为解决这些问题，我们提出了一种名为 TaskWeaver 的 code-first 框架，它将用户的请求转换成可执行代码，并将用户定义的插件视为可调用函数。TaskWeaver 提供了丰富的数据结构支持、灵活的插件使用支持和动态插件选择支持，同时利用 LLM 编程能力进行复杂的逻辑处理。它还 incorporates 域特定知识通过示例，并确保生成代码的安全执行。TaskWeaver 提供了一个强大和灵活的框架，用于创建智能对话机器人，可以处理复杂任务和适应域特定情况。代码已经开源在 GitHub 上，请参考。

Spinal Muscle Atrophy Disease Modelling as Bayesian Network

paper_url: http://arxiv.org/abs/2311.17521
repo_url: None
paper_authors: Mohammed Ezzat Helal, Manal Ezzat Helal, Sherif Fadel Fahmy
for: 这paper是用来研究分子表达学和疾病模拟的，使用 probabilistic graphical models 和 Bayesian inference。
methods: 该paper使用了 Probabilistic Graphical Models 和 Bayesian Inference 来模型和分析 Spinal Muscle Atrophy Genome-Wide Association Study 结果，并将疾病发展阶段中蛋白质表达量的变化与公共领域中已发表的知识相联系。
results: 该paper通过分析 co-expressions 网络和蛋白质表达量变化，确定了疾病发展阶段中蛋白质的分子路径 trigger，并使用 variational analytical algorithm 和 Markov chain Monte Carlo sampling algorithm 来估计 Bayesian inference posterior distributions。

Abstract
We investigate the molecular gene expressions studies and public databases for disease modelling using Probabilistic Graphical Models and Bayesian Inference. A case study on Spinal Muscle Atrophy Genome-Wide Association Study results is modelled and analyzed. The genes up and down-regulated in two stages of the disease development are linked to prior knowledge published in the public domain and co-expressions network is created and analyzed. The Molecular Pathways triggered by these genes are identified. The Bayesian inference posteriors distributions are estimated using a variational analytical algorithm and a Markov chain Monte Carlo sampling algorithm. Assumptions, limitations and possible future work are concluded.

摘要
我们 investigate 分子表达学研究和公共数据库 для疾病模拟使用概率图解模型和 bayesian 推理。我们使用了一个患有脊梗综合征的 genomic 相关性研究结果作为案例研究，并对这些疾病发展阶段的变化表达分析了相关的基因。我们将这些基因与已发表在公共领域的优先知识相连接，并创建了一个相互作用网络。我们使用概率图解模型来定义这些分子 PATHway 被触发的可能性。我们使用了一种变量分析法和一种 markov chain Monte Carlo 采样算法来估计 bayesian 推理 posterior 分布。我们还总结了假设、局限性和未来工作的可能性。

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

paper_url: http://arxiv.org/abs/2311.17518
repo_url: https://github.com/lorebianchi98/fg-ovd
paper_authors: Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, Fabrizio Falchi
for: 本研究旨在探讨现代大型视觉语言模型在开放词汇场景下进行物体检测时，能否准确地捕捉和分辨物体的细腻特征。
methods: 本研究使用动态词汇生成协议来测试当前领域的状态级方法是否能够正确地捕捉和分辨物体的细腻特征。我们还提供了一个增难的benchmark集和不同的识别性评价指标，以便进一步探讨不同的识别特性。
results: 我们发现，当前的方法在普通的开放词汇场景下表现良好，但在具有困难识别类的场景下，它们往往难以准确地捕捉和分辨物体的细腻特征。我们还发现，现有的识别方法在不同的识别特性下的表现存在很大差异。

Abstract
Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end, we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color, pattern, and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://github.com/lorebianchi98/FG-OVD.

摘要
近期大量视力语言模型的进步使得视觉对象检测在开放词汇场景中实现，在执行过程中，对象类划定在自由文本格式中。在这篇论文中，我们想要探究当前领域的状态对象检测方法，以确定它们在实际应用中是否能够正确地捕捉和区分对象的细腻特征。为此，我们提出了基于动态词汇生成的评估协议，以测试模型在具有困难类型的情况下是否能够检测、辨别和准确地分类对象。我们还提供了一个逐渐增加的评估难度的 benchmark 集合，以测试不同的属性如颜色、模式和材质。我们进一步扩展了我们的调查，通过使用我们提posed的协议测试多种当前领域的状态对象检测器，并发现大多数现有解决方案在标准开放词汇 bencmarks 中表现出色，但在我们的协议中却遇到困难。我们 conclude 这篇论文， highlighting 当前方法的局限性，并探讨了可能的研究方向，以超越发现的缺陷。数据和代码可以在 https://github.com/lorebianchi98/FG-OVD 上获取。

Reinforcement Replaces Supervision: Query focused Summarization using Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.17514
repo_url: None
paper_authors: Swaroop Nath, Harshad Khadilkar, Pushpak Bhattacharyya
for: 这个论文的目的是提出一种基于强化学习的问题摘要系统（QfS），以便从文档中生成基于查询的摘要。
methods: 这个论文使用强化学习的方法，包括多个Policy Gradient网络，这些网络在不同的奖励信号（ROUGE、BLEU和Semantic Similarity）上进行训练。此外，这个论文还解决了在Transformers中使用强化学习的问题，并提出了一种新的语句嵌入方案，使得强化学习训练更加有效。
results: 这个论文在一个 benchmark 数据集（ELI5）上实现了10个点的提升，相比之前的状态革命方法。此外，这个论文还在另一个 benchmark 数据集（DebatePedia）上实现了零shotSetting下的比较好的结果，与特定于 DebatePedia 的基elines一样。

Abstract
Query-focused Summarization (QfS) deals with systems that generate summaries from document(s) based on a query. Motivated by the insight that Reinforcement Learning (RL) provides a generalization to Supervised Learning (SL) for Natural Language Generation, and thereby performs better (empirically) than SL, we use an RL-based approach for this task of QfS. Additionally, we also resolve the conflict of employing RL in Transformers with Teacher Forcing. We develop multiple Policy Gradient networks, trained on various reward signals: ROUGE, BLEU, and Semantic Similarity, which lead to a 10-point improvement over the State-of-the-Art approach on the ROUGE-L metric for a benchmark dataset (ELI5). We also show performance of our approach in zero-shot setting for another benchmark dataset (DebatePedia) -- our approach leads to results comparable to baselines, which were specifically trained on DebatePedia. To aid the RL training, we propose a better semantic similarity reward, enabled by a novel Passage Embedding scheme developed using Cluster Hypothesis. Lastly, we contribute a gold-standard test dataset to further research in QfS and Long-form Question Answering (LfQA).

摘要
Query-focused Summarization (QfS) 是指根据查询生成文章的系统。受到自然语言生成中RL提供普遍化的启发，我们使用RL方法进行QfS任务。此外，我们也解决了在Transformers中使用RL时的矛盾。我们开发了多个政策梯度网络，由不同的优化信号所训练：ROUGE、BLEU和Semantic Similarity，这导致了较好的ROUGE-L指标（10个点），较前一代方法。我们还展示了我们的方法在零条件设定下的表现，另一个benchmark dataset（DebatePedia）上的结果与基准值相似。为RL训练提供更好的对比度优化奖励，我们提出了一个新的过程嵌入方案，基于Cluster Hypothesis。最后，我们提供了一个完整的黄金标准测试集，以便进一步研究QfS和Long-form Question Answering（LfQA）。

Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model

paper_url: http://arxiv.org/abs/2311.17487
repo_url: https://github.com/miulab/taiwan-llm
paper_authors: Yen-Ting Lin, Yun-Nung Chen
for: 本研究旨在开探traditional Chinese language model（TLM），并以台湾的语言和文化为基础，创造出能够理解和生成台湾话的语言模型。
methods: 本研究使用了广泛的预训数据和指令练习数据，以进行模型的预训和调整。
results: 研究结果显示，台湾 LLM 能够优异地理解和生成台湾话文本，并且在文本生成和评分等方面具有优秀的表现。

Abstract
In the realm of language models, the nuanced linguistic and cultural intricacies of Traditional Chinese, as spoken in Taiwan, have been largely overlooked. This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language, with a focus on the variant used in Taiwan. Leveraging a comprehensive pretraining corpus and instruction-finetuning datasets, we have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan. Taiwan LLM represents the first of its kind, a model that is not only linguistically accurate but also culturally resonant with its user base. Our evaluations demonstrate that Taiwan LLM achieves superior performance in understanding and generating Traditional Chinese text, outperforming existing models that are predominantly trained on Simplified Chinese or English. The open-source release of Taiwan LLM invites collaboration and further innovation, ensuring that the linguistic diversity of Chinese speakers is embraced and well-served. The model, datasets, and further resources are made publicly available to foster ongoing research and development in this field.

摘要
在语言模型领域，传统中文（台湾方言）的细腻语言和文化特点曾被大多数排除在外。本文介绍了台湾语言模型（Taiwan LLM），这是一种创新的大型语言模型，专门针对台湾使用的传统中文。通过抽象预训练 corpus 和指令练习 datasets，我们已经开发了一个不仅理解传统中文的复杂性，还能够体现台湾文化的模型。Taiwan LLM 是首个如此的模型，不仅语言准确性高，还能够与用户群体相匹配。我们的评估结果表明，Taiwan LLM 在理解和生成传统中文文本方面表现出色，超越了主要基于简化字的现有模型或英文模型。开源发布 Taiwan LLM，邀请合作和进一步的创新，确保中文使用者的语言多样性得到了充分的认可和服务。模型、数据集和进一步资源均公开发布，以便持续研究和开发在这个领域。

Distributed AI in Zero-touch Provisioning for Edge Networks: Challenges and Research Directions

paper_url: http://arxiv.org/abs/2311.17471
repo_url: None
paper_authors: Abhishek Hazra, Andrea Morichetta, Ilir Murturi, Lauri Lovén, Chinmaya Kumar Dehury, Victor Casamayor Pujol, Praveen Kumar Donta, Schahram Dustdar
for: 这篇论文旨在探讨 Zero-touch 网络中的智能资源分配策略，以及如何通过 Distributed Artificial Intelligence (DAI) 与 Zero-touch Provisioning (ZTP) 技术来实现无需人工干预的网络管理。
methods: 本论文使用 DAI 与 ZTP 技术来自动化网络设备管理，以减少人工干预。
results: 本论文显示了在边缘网络中应用 DAI 与 ZTP 技术可以实现高效、可扩展的资源分配策略，并且提出了未来研究方向以解决现有的限制。

Abstract
Zero-touch network is anticipated to inaugurate the generation of intelligent and highly flexible resource provisioning strategies where multiple service providers collaboratively offer computation and storage resources. This transformation presents substantial challenges to network administration and service providers regarding sustainability and scalability. This article combines Distributed Artificial Intelligence (DAI) with Zero-touch Provisioning (ZTP) for edge networks. This combination helps to manage network devices seamlessly and intelligently by minimizing human intervention. In addition, several advantages are also highlighted that come with incorporating Distributed AI into ZTP in the context of edge networks. Further, we draw potential research directions to foster novel studies in this field and overcome the current limitations.

摘要
zero-touch 网络预计将引入智能化和高度灵活的资源分配策略，许多服务提供商共同提供计算和存储资源。这种转变带来了网络管理和服务提供商对可持续性和扩展性的挑战。本文结合分布式人工智能（DAI）和零点 touched （ZTP）来管理边缘网络设备，通过减少人工干预来实现无需人工干预的网络管理。此外，我们还 highlighted （ZTP）在边缘网络中的一些优点，并提出了可能的研究方向，以推动这一领域的新研究和现有的限制的突破。

Slot-Mixup with Subsampling: A Simple Regularization for WSI Classification

paper_url: http://arxiv.org/abs/2311.17466
repo_url: None
paper_authors: Seongho Keum, Sanghyun Kim, Soojeong Lee, Juho Lee
for: 本研究旨在提高染色质分类器的性能，使其能够更好地检测悉数癌症。
methods: 本研究使用多个方法来提高染色质分类器的性能，包括归一化、混合精度和分布Shift。
results: 本研究实验结果显示，使用提取patch的方法可以提高染色质分类器的性能，并且可以更好地适应不同的分布Shift和批处理环境。

Abstract
Whole slide image (WSI) classification requires repetitive zoom-in and out for pathologists, as only small portions of the slide may be relevant to detecting cancer. Due to the lack of patch-level labels, multiple instance learning (MIL) is a common practice for training a WSI classifier. One of the challenges in MIL for WSIs is the weak supervision coming only from the slide-level labels, often resulting in severe overfitting. In response, researchers have considered adopting patch-level augmentation or applying mixup augmentation, but their applicability remains unverified. Our approach augments the training dataset by sampling a subset of patches in the WSI without significantly altering the underlying semantics of the original slides. Additionally, we introduce an efficient model (Slot-MIL) that organizes patches into a fixed number of slots, the abstract representation of patches, using an attention mechanism. We empirically demonstrate that the subsampling augmentation helps to make more informative slots by restricting the over-concentration of attention and to improve interpretability. Finally, we illustrate that combining our attention-based aggregation model with subsampling and mixup, which has shown limited compatibility in existing MIL methods, can enhance both generalization and calibration. Our proposed methods achieve the state-of-the-art performance across various benchmark datasets including class imbalance and distribution shifts.

摘要
整幕图像（WSI）分类需要病理学家重复滚动和缩放，因为只有某些小部分的图像可能与诊断癌症有关。由于批处级标签缺乏，多个实例学习（MIL）是训练WSI分类器的常见做法。一个MIL在WSI中的挑战是Only slide-level labels weak supervision，frequently leading to severe overfitting。为了应对这个问题，研究人员考虑采用批处级增强或混合增强，但它们的可行性尚未得到证明。我们的方法是在训练集中采样WSI中的一 subset of patches，不会对原始报告的 semantics 产生重要的改变。此外，我们还提出了一种高效的模型（槽MIL），通过注意力机制将批处组织成固定数量的槽，以表示批处的抽象表示。我们经验性地表明，这种增强可以使批处更加有用，提高了解释性。最后，我们示出了将我们的注意力基于汇总模型与增强和混合相结合，可以提高总体性和准确性。我们的提议方法在多个标准测试集上实现了状态的最佳性，包括类别不均和分布shift。

Quantum Neural Networks under Depolarization Noise: Exploring White-Box Attacks and Defenses

paper_url: http://arxiv.org/abs/2311.17458
repo_url: None
paper_authors: David Winderl, Nicola Franco, Jeanette Miriam Lorenz
for: 本研究探讨了量子机器学习（QML）模型面对针对性攻击的 robustness，特别是在多类分类场景下。
methods: 研究者使用了 depolarization noise 来强化 QML 模型的鲁棒性，但发现在多类分类场景下，加入 depolarization noise 甚至会使模型变得更加脆弱。
results: 研究者通过在 gate-based quantum simulators 上进行了 adversarial 训练，发现在多类分类场景下，加入 depolarization noise 会使模型变得更加脆弱，并不如 previously 所料的增加 robustness。

Abstract
Leveraging the unique properties of quantum mechanics, Quantum Machine Learning (QML) promises computational breakthroughs and enriched perspectives where traditional systems reach their boundaries. However, similarly to classical machine learning, QML is not immune to adversarial attacks. Quantum adversarial machine learning has become instrumental in highlighting the weak points of QML models when faced with adversarial crafted feature vectors. Diving deep into this domain, our exploration shines light on the interplay between depolarization noise and adversarial robustness. While previous results enhanced robustness from adversarial threats through depolarization noise, our findings paint a different picture. Interestingly, adding depolarization noise discontinued the effect of providing further robustness for a multi-class classification scenario. Consolidating our findings, we conducted experiments with a multi-class classifier adversarially trained on gate-based quantum simulators, further elucidating this unexpected behavior.

摘要
利用量子机制特点，量子机器学习（QML）承诺计算突破和拓展视野， traditional系统所达到的边界。然而，与传统机器学习一样，QML不免受反对攻击。量子对抗机器学习已成为检测QML模型对反对攻击的脆点。我们的探索推烟到了异议噪声和反对攻击坚持的关系。相比前一些研究通过异议噪声增强对反对攻击的鲁棒性，我们的发现却画出了一个不同的情景：在多类分类场景中，添加异议噪声会中止提供更多鲁棒性的效果。为了证实这一结论，我们在基于门控quantum simulator的多类分类器上进行了对抗训练，进一步阐明了这一不料的行为。

Privacy Measurement in Tabular Synthetic Data: State of the Art and Future Research Directions

paper_url: http://arxiv.org/abs/2311.17453
repo_url: None
paper_authors: Alexander Boudewijn, Andrea Filippo Ferraris, Daniele Panfilo, Vanessa Cocca, Sabrina Zinutti, Karel De Schepper, Carlo Rossi Chauvenet
for: 本研究旨在提出一些量化隐私保护度的方法，以便开发SD隐私标准，促进多学科交流，并帮助SD研究人员做出 Informed 的模型和评估决策。
methods: 本研究使用了一些提出的量化隐私保护度方法，包括使用潜在隐私泄露概率、隐私泄露损失分析、和隐私泄露检测等方法。
results: 本研究的结果表明，使用量化隐私保护度方法可以帮助SD研究人员更好地评估和优化隐私保护度，从而提高SD 技术的隐私保护能力。

Abstract
Synthetic data (SD) have garnered attention as a privacy enhancing technology. Unfortunately, there is no standard for quantifying their degree of privacy protection. In this paper, we discuss proposed quantification approaches. This contributes to the development of SD privacy standards; stimulates multi-disciplinary discussion; and helps SD researchers make informed modeling and evaluation decisions.

摘要
人工数据（SD）已引起关注，作为隐私保护技术。然而，现无标准来衡量其隐私保护度。本文提出了评估方法的建议，这有助于发展SD隐私标准；促进多学科交流；并帮助SD研究人员作出 Informed 的模型和评估决策。Note: "Synthetic data" (SD) is translated as "人工数据" (rénshòu xiàngxīn) in Simplified Chinese.

Learning-driven Zero Trust in Distributed Computing Continuum Systems

paper_url: http://arxiv.org/abs/2311.17447
repo_url: None
paper_authors: Ilir Murturi, Praveen Kumar Donta, Victor Casamayor Pujol, Andrea Morichetta, Schahram Dustdar
for: 解决分布式计算继续体系中的运维和安全挑战，提高分布式计算继续体系中的信任验证建立的质量。
methods: 使用学习技术，如表示学习，将信任验证建立分布在计算继续体系中，提高决策过程中的威胁预测和不可信请求检测。
results: 通过示例演示，学习过程可以检测和屏蔽不当请求，提高资源访问控制，降低网络和计算负担。

Abstract
Converging Zero Trust (ZT) with learning techniques can solve various operational and security challenges in Distributed Computing Continuum Systems (DCCS). Implementing centralized ZT architecture is seen as unsuitable for the computing continuum (e.g., computing entities with limited connectivity and visibility, etc.). At the same time, implementing decentralized ZT in the computing continuum requires understanding infrastructure limitations and novel approaches to enhance resource access management decisions. To overcome such challenges, we present a novel learning-driven ZT conceptual architecture designed for DCCS. We aim to enhance ZT architecture service quality by incorporating lightweight learning strategies such as Representation Learning (ReL) and distributing ZT components across the computing continuum. The ReL helps to improve the decision-making process by predicting threats or untrusted requests. Through an illustrative example, we show how the learning process detects and blocks the requests, enhances resource access control, and reduces network and computation overheads. Lastly, we discuss the conceptual architecture, processes, and provide a research agenda.

摘要
合并零信任（ZT）学习技术可以解决分布式计算继续系统（DCCS）中的多种操作和安全挑战。实施中央化ZT架构在计算继续中被视为不适用（例如计算实体具有有限的连接和可见性等）。同时，在计算继续中实施分布式ZT需要理解基础设施限制和新的方法来增强资源访问管理决策。为解决这些挑战，我们提出了一种基于学习的ZT概念架构，特点是采用轻量级学习策略如表示学习（ReL），并在计算继续中分布ZT组件。ReL可以改善决策过程，预测威胁或不可信请求。通过一个示例，我们展示了如何学习过程检测和屏蔽请求，提高资源访问控制，降低网络和计算负担。最后，我们介绍了概念架构、过程和研究计划。

Uncertainty in Additive Feature Attribution methods

paper_url: http://arxiv.org/abs/2311.17446
repo_url: None
paper_authors: Abhishek Madaan, Tanya Chowdhury, Neha Rana, James Allan, Tanmoy Chakraborty
for: 本研究探讨了post-hoc可解释AI（XAI）方法中的不确定性问题。特别是关注添加性特征归因解释方法的类。
methods: 本文首先定义了不确定性的特性，然后比较了不同的统计学和现代方法来量化它。接着，对于特定的实例，我们研究了特征归因和不确定性之间的关系，并发现它们之间存在 little correlation。因此，我们提议修改了LIME基于算法中的分布，使得重要的特征具有最小的不确定性，不会增加计算成本。
results: 我们发现在分类器的特征空间中，一些实例显示near-zero不确定性。我们称这些实例为”稳定实例”，并诊断了导致实例稳定的因素。此外，我们发现XAI算法中的不确定性与模型的大小和复杂度之间存在正相关关系。我们提出了一种量化黑obox分类器的相对复杂度的度量，这可以在LIME基于算法中的采样密度中被包含，以 помочь不同的解释算法达到更紧密的信任水平。

Abstract
In this work, we explore various topics that fall under the umbrella of Uncertainty in post-hoc Explainable AI (XAI) methods. We in particular focus on the class of additive feature attribution explanation methods. We first describe our specifications of uncertainty and compare various statistical and recent methods to quantify the same. Next, for a particular instance, we study the relationship between a feature's attribution and its uncertainty and observe little correlation. As a result, we propose a modification in the distribution from which perturbations are sampled in LIME-based algorithms such that the important features have minimal uncertainty without an increase in computational cost. Next, while studying how the uncertainty in explanations varies across the feature space of a classifier, we observe that a fraction of instances show near-zero uncertainty. We coin the term "stable instances" for such instances and diagnose factors that make an instance stable. Next, we study how an XAI algorithm's uncertainty varies with the size and complexity of the underlying model. We observe that the more complex the model, the more inherent uncertainty is exhibited by it. As a result, we propose a measure to quantify the relative complexity of a blackbox classifier. This could be incorporated, for example, in LIME-based algorithms' sampling densities, to help different explanation algorithms achieve tighter confidence levels. Together, the above measures would have a strong impact on making XAI models relatively trustworthy for the end-user as well as aiding scientific discovery.

摘要
在这项工作中，我们探讨了各种卷积uncertainty在后置可见AI（XAI）方法中的应用。我们尤其关注加法特征归因解释方法的 класси。我们首先描述了我们的不确定性规定，并与各种统计和现代方法进行比较。接下来，我们对特定实例进行研究，发现特征归因和不确定性之间存在 little correlation。因此，我们提议在 LIME 基于算法中的分布中修改 perturbations 的方式，使得重要的特征具有最小的不确定性，而无需增加计算成本。接着，我们研究了解释如何在分类器的特征空间中变化的不确定性。我们发现一些实例显示 near-zero 的不确定性，我们称之为 "稳定实例"。我们诊断了这些实例的因素，并研究了 XAI 算法的不确定性如何随模型的大小和复杂度而变化。我们发现，随着模型的复杂度增加，XAI 算法中的不确定性也会增加。因此，我们提出了一个度量来衡量黑obox分类器的内在复杂度。这个度量可以在 LIME 基于算法中的抽象率中进行 incorporate，以帮助不同的解释算法实现更紧密的信任水平。总之，以上措施将有很大的影响，使 XAI 模型在用户和科学发现中更加可靠。

CLOMO: Counterfactual Logical Modification with Large Language Models

paper_url: http://arxiv.org/abs/2311.17438
repo_url: None
paper_authors: Yinya Huang, Ruixin Hong, Hongming Zhang, Wei Shao, Zhicheng Yang, Dong Yu, Changshui Zhang, Xiaodan Liang, Linqi Song
for: 本研究探讨大语言模型（LLM）的反事实思维能力。
methods: 我们引入了一个新任务——反事实逻辑修改（CLOMO），以及一个高质量的人类注释 benchmark。
results: 我们的实验结果表明，LLMs在逻辑上的反事实思维能力有所提高，但与人类表现之间仍有一定的差距。

Abstract
In this study, we delve into the realm of counterfactual reasoning capabilities of large language models (LLMs). Our primary objective is to cultivate the counterfactual thought processes within LLMs and rigorously assess these processes for their validity. Specifically, we introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. To effectively evaluate a generation model's counterfactual capabilities, we propose an innovative evaluation metric, the LogicAware Counterfactual Score to directly evaluate the natural language output of LLMs instead of modeling the task as a multiple-choice problem. Analysis shows that the proposed automatic metric aligns well with human preference. Our experimental results show that while LLMs demonstrate a notable capacity for logical counterfactual thinking, there remains a discernible gap between their current abilities and human performance.

摘要
在这项研究中，我们进入了大语言模型（LLM）的 counterfactual 理解能力的领域。我们的主要目标是在 LLM 中培养 counterfactual 思维过程并严格评估这些过程的有效性。我们引入了一个新任务，Counterfactual Logical Modification（CLOMO），以及一个高质量的人类标注 benchmark。在这个任务中，LLM 必须能够修改给定的论据文本，以保持 predetermined 逻辑关系。为了有效地评估一个生成模型的 counterfactual 能力，我们提出了一种创新的评价指标，逻辑感知 counterfactual 分数，直接评估 LLM 的自然语言输出而不是将任务模型为多选问题。分析显示，我们提出的自动化指标与人类偏好高度吻合。我们的实验结果表明，虽然 LLM 表现出了明显的逻辑 counterfactual 思维能力，但与人类性能还存在一定的差距。

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

paper_url: http://arxiv.org/abs/2311.17435
repo_url: None
paper_authors: Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Chung-Ching Lin, Zicheng Liu, Lijuan Wang
for: 这个论文的目的是提出一种基于 GPT-4 的多模态在线学习系统，用于生成长视频描述（AD）。
methods: 该系统使用一种提出的内存扩充生成过程，可以有效地利用短期文本上下文和长期视觉记忆，通过有效的注册和回快机制，生成精准的 audio descriptions。
results: 实验结果表明，MM-Narrator 在 MADE-eval 数据集上consistently 超过了现有的精度折叠法和 LLB-based 方法，并且 Introduce the first segment-based evaluator for recurrent text generation。Here’s the same information in Simplified Chinese:
for: 这个论文的目的是提出一种基于 GPT-4 的多模态在线学习系统，用于生成长视频描述（AD）。
methods: 该系统使用一种提出的内存扩充生成过程，可以有效地利用短期文本上下文和长期视觉记忆，通过有效的注册和回快机制，生成精准的 audio descriptions。
results: 实验结果表明，MM-Narrator 在 MADE-eval 数据集上consistently 超过了现有的精度折叠法和 LLB-based 方法，并且引入了首个基于分割的文本生成评价器。

Abstract
We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios, as measured by standard evaluation metrics. Additionally, we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4, this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.

摘要
我们介绍MM-Narrator，一个新的系统，利用GPT-4和多模式内容学习来生成语音描述（AD）。与前方法主要集中于下游精确化短视频片段的方法不同，MM-Narrator能够生成长达多小时的视频语音描述，以autoregressive的方式进行生成。这个能力是基于我们提出的记忆增强生成过程，具有效率的注册和回传机制，从而充分利用短期文本上下文和长期视觉记忆。这些上下文记忆将重要的过去信息，包括故事情节和人物识别，转换为精确地追踪和描述story-coherent和character-centric的语音描述。保持MM-Narrator的训练无需设计，我们进一步提出了基于复杂度的示例选择策略，以大幅提高其多步验证能力via few-shot multimodal in-context learning（MM-ICL）。实验结果显示，MM-Narrator在MAD-eval dataset上一致地超过了现有的精确化方法和LLM-based方法，以标准评估指标进行评估。此外，我们引入了第一个可分段评估条件的文本生成评估器。 empowered by GPT-4，这个评估器可以全面理解和评估AD生成的不同维度，并提供了可扩展的评估方法。

Grounding Foundation Models through Federated Transfer Learning: A General Framework

paper_url: http://arxiv.org/abs/2311.17431
repo_url: None
paper_authors: Yan Kang, Tao Fan, Hanlin Gu, Lixin Fan, Qiang Yang
for: 本研究的目的是探讨基于联合学习和迁移学习的FM搭建，以便更好地利用FM的潜在能力和特性。
methods: 本研究使用了联合学习和迁移学习的组合，以解决FM搭建中的资源受限、数据隐私、模型多样性和模型所有权等挑战。
results: 本研究提出了一个FTL-FM框架，并对现有的FTL-FM研究进行了分类和评审。此外，本研究还评估了一些高效性和隐私保护技术，以解决FTL-FM中的效率和隐私问题。

Abstract
Foundation Models (FMs) such as GPT-4 encoded with vast knowledge and powerful emergent abilities have achieved remarkable success in various natural language processing and computer vision tasks. Grounding FMs by adapting them to domain-specific tasks or augmenting them with domain-specific knowledge enables us to exploit the full potential of FMs. However, grounding FMs faces several challenges, stemming primarily from constrained computing resources, data privacy, model heterogeneity, and model ownership. Federated Transfer Learning (FTL), the combination of federated learning and transfer learning, provides promising solutions to address these challenges. In recent years, the need for grounding FMs leveraging FTL, coined FTL-FM, has arisen strongly in both academia and industry. Motivated by the strong growth in FTL-FM research and the potential impact of FTL-FM on industrial applications, we propose an FTL-FM framework that formulates problems of grounding FMs in the federated learning setting, construct a detailed taxonomy based on the FTL-FM framework to categorize state-of-the-art FTL-FM works, and comprehensively overview FTL-FM works based on the proposed taxonomy. We also establish correspondences between FTL-FM and conventional phases of adapting FM so that FM practitioners can align their research works with FTL-FM. In addition, we overview advanced efficiency-improving and privacy-preserving techniques because efficiency and privacy are critical concerns in FTL-FM. Last, we discuss opportunities and future research directions of FTL-FM.

摘要
基于 Foundation Models (FMs) 的自然语言处理和计算机视觉任务具有了非常出色的成功。通过适应 FMs 到具体任务或加入具体知识，我们可以激活 FMs 的潜在潜力。然而，在grounding FMs时存在多种挑战，主要包括计算资源约束、数据隐私、模型多样性和模型所有权。 Federated Transfer Learning (FTL) 技术提供了可能的解决方案。随着 FTL-FM 研究的强劲增长和其在业务应用中的潜在影响，我们提出了一个基于 FTL 的 FTL-FM 框架，用于在联合学习环境中解决附加 FMs 的问题。此外，我们还构建了基于 FTL-FM 框架的详细分类，对当前 state-of-the-art FTL-FM 作品进行了全面的综述。此外，我们还与传统 FM 适应阶段的相关性进行了对应，以便 FM 专家可以将研究工作与 FTL-FM 进行对应。此外，我们还详细介绍了 FTL-FM 中的高效性和隐私保护技术，因为效率和隐私是 FTL-FM 中的关键问题。最后，我们还讨论了 FTL-FM 的未来研究方向。

TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4

paper_url: http://arxiv.org/abs/2311.17429
repo_url: None
paper_authors: Zihao Tan, Qingliang Chen, Yongjian Huang, Chen Liang
for: 本研究旨在攻击prompt-based NLP模型，特别是对于具有少量数据的场景。
methods: 我们提出了一种新的Template-trAnsfeRable backdoor attack方法（TARGET），它不依赖于手动定义的模板，而是通过GPT4 reformulate manual templates来生成带强烈语调的模板，并在预训练阶段将其作为背door诱导。在下游任务中，我们不仅直接使用上述模板进行攻击，还使用GPT4生成与上述模板语调相似的模板进行可传播攻击。
results: 我们在五个NLP dataset和三个BERT系列模型上进行了广泛的实验，结果表明，我们的TARGET方法在直接攻击和未看到的语调相似模板攻击方面具有较高的攻击性和隐蔽性，而且在未看到的语调相似模板攻击方面也具有良好的攻击能力。

Abstract
Prompt-based learning has been widely applied in many low-resource NLP tasks such as few-shot scenarios. However, this paradigm has been shown to be vulnerable to backdoor attacks. Most of the existing attack methods focus on inserting manually predefined templates as triggers in the pre-training phase to train the victim model and utilize the same triggers in the downstream task to perform inference, which tends to ignore the transferability and stealthiness of the templates. In this work, we propose a novel approach of TARGET (Template-trAnsfeRable backdoor attack aGainst prompt-basEd NLP models via GPT4), which is a data-independent attack method. Specifically, we first utilize GPT4 to reformulate manual templates to generate tone-strong and normal templates, and the former are injected into the model as a backdoor trigger in the pre-training phase. Then, we not only directly employ the above templates in the downstream task, but also use GPT4 to generate templates with similar tone to the above templates to carry out transferable attacks. Finally we have conducted extensive experiments on five NLP datasets and three BERT series models, with experimental results justifying that our TARGET method has better attack performance and stealthiness compared to the two-external baseline methods on direct attacks, and in addition achieves satisfactory attack capability in the unseen tone-similar templates.

摘要
Prompt-based learning在许多低资源NLP任务中广泛应用，如几shot场景。然而，这种方法容易受到后门攻击。现有的攻击方法多数是在预训练阶段手动定义的模板被用来训练受到攻击的模型，并在下游任务中使用相同的模板进行推理，这种方法往往忽略了模板的传递性和隐蔽性。在这种工作中，我们提出了一种新的方法，即TARGET（Template-trAnsfeRable backdoor attack aGainst prompt-basEd NLP models via GPT4），它是一种数据独立攻击方法。我们首先利用GPT4来重新形式化手动模板，生成强调和正常模板，并将前者作为后门触发器在预训练阶段插入模型。然后，我们不仅直接使用上述模板进行下游任务，还使用GPT4来生成与上述模板相似的声调模板，以实现可传播的攻击。最后，我们在五个NLP数据集和三个BERT系列模型上进行了广泛的实验，实验结果表明，我们的TARGET方法在直接攻击和未看到声调相似模板的情况下具有更高的攻击性和隐蔽性，并且在未看到声调相似模板的情况下也可以实现有 satisfactory 的攻击能力。

paper_url: http://arxiv.org/abs/2311.17428
repo_url: https://github.com/liuqi-creat/sigformer
paper_authors: Qi Liu, Xinchen Liu, Kun Liu, Xiaoyan Gu, Wu Liu
for: 多modal人体动作分割是一个关键和挑战性的任务，它有很多应用。现在大多数方法集中在dense信号（即RGB、运动流和深度图）的融合上。然而，稀疏IoT传感器信号的潜在贡献没有得到充分利用。为了解决这个问题，我们引入了一种稀疏信号引导的Transformer（SigFormer），将稀疏和dense信号结合在一起。
methods: 我们使用mask注意力来融合本地特征，通过在有效的稀疏信号区域内进行交叉注意力约束。然而，稀疏信号是离散的，因此缺乏足够的时间动作边界信息。因此，在SigFormer中，我们提出了两个阶段来强调边界信息：在第一个特征提取阶段，我们引入了一个中间瓶颈模块，通过内部损失函数同时学习每种dense模态的类别和时间边界特征。在density和稀疏信号的融合后，我们然后设计了两棵树结构，以显式地模型动作类别和时间边界之间的关系。
results: 实验结果表明，SigFormer超过了当前状态的方法在多modal人体动作分割数据集上的性能，达到了非常出色的F1分数0.958。代码和预训练模型可以在https://github.com/LIUQI-creat/SigFormer上获取。

Abstract
Multi-modal human action segmentation is a critical and challenging task with a wide range of applications. Nowadays, the majority of approaches concentrate on the fusion of dense signals (i.e., RGB, optical flow, and depth maps). However, the potential contributions of sparse IoT sensor signals, which can be crucial for achieving accurate recognition, have not been fully explored. To make up for this, we introduce a Sparse signalguided Transformer (SigFormer) to combine both dense and sparse signals. We employ mask attention to fuse localized features by constraining cross-attention within the regions where sparse signals are valid. However, since sparse signals are discrete, they lack sufficient information about the temporal action boundaries. Therefore, in SigFormer, we propose to emphasize the boundary information at two stages to alleviate this problem. In the first feature extraction stage, we introduce an intermediate bottleneck module to jointly learn both category and boundary features of each dense modality through the inner loss functions. After the fusion of dense modalities and sparse signals, we then devise a two-branch architecture that explicitly models the interrelationship between action category and temporal boundary. Experimental results demonstrate that SigFormer outperforms the state-of-the-art approaches on a multi-modal action segmentation dataset from real industrial environments, reaching an outstanding F1 score of 0.958. The codes and pre-trained models have been available at https://github.com/LIUQI-creat/SigFormer.

摘要
多模态人体动作分割是一个关键和挑战性的任务，具有广泛的应用场景。目前，大多数方法都是通过紧凑的信号（即RGB、运动流和深度图）的混合来实现。然而，落后的 IoT 传感器信号的潜在贡献尚未得到了充分利用。为了解决这个问题，我们提出了一种名为 sparse signalguided Transformer（SigFormer）的方法，该方法可以结合紧凑和落后的信号。我们使用面积注意力来融合本地特征，并在落后信号有效范围内进行约束交叉注意力。然而，由于落后信号是离散的，它们缺乏足够的时间动作边界信息。因此，在 SigFormer 中，我们提出了两个阶段来强调边界信息，以解决这个问题。在第一个特征提取阶段，我们引入了一个中间瓶颈模块，以同时学习每种紧凑模式的类别和时间边界特征。在紧凑模式和落后信号的混合后，我们则设计了一个两极结构，以显式地模型动作类别和时间边界之间的关系。实验结果表明，SigFormer 可以在真实工业环境中的多模态动作分割数据集上达到出色的 F1 分数 0.958。代码和预训练模型已经在上公开。

LLM-State: Expandable State Representation for Long-horizon Task Planning in the Open World

paper_url: http://arxiv.org/abs/2311.17406
repo_url: None
paper_authors: Siwei Chen, Anxing Xiao, David Hsu
for: solves the problem of long-horizon task planning in an open-world household environment with the Large Language Model (LLM)
methods: proposes a novel, expandable state representation that leverages the LLM’s inherent capabilities for context understanding and historical action reasoning
results: demonstrates significant improvements over baseline methods in a variety of tasks requiring long-horizon state tracking and reasoning through experiments in simulated and real-world scenarios

Abstract
This work addresses the problem of long-horizon task planning with the Large Language Model (LLM) in an open-world household environment. Existing works fail to explicitly track key objects and attributes, leading to erroneous decisions in long-horizon tasks, or rely on highly engineered state features and feedback, which is not generalizable. We propose a novel, expandable state representation that provides continuous expansion and updating of object attributes from the LLM's inherent capabilities for context understanding and historical action reasoning. Our proposed representation maintains a comprehensive record of an object's attributes and changes, enabling robust retrospective summary of the sequence of actions leading to the current state. This allows enhanced context understanding for decision-making in task planning. We validate our model through experiments across simulated and real-world task planning scenarios, demonstrating significant improvements over baseline methods in a variety of tasks requiring long-horizon state tracking and reasoning.

摘要
We propose a novel, expandable state representation that leverages the LLM's inherent capabilities for context understanding and historical action reasoning to continuously expand and update object attributes. Our proposed representation maintains a comprehensive record of an object's attributes and changes, allowing for robust retrospective summary of the sequence of actions leading to the current state. This enhanced context understanding enables improved decision-making in task planning.We validate our model through experiments in simulated and real-world task planning scenarios, demonstrating significant improvements over baseline methods in a variety of tasks requiring long-horizon state tracking and reasoning.

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

paper_url: http://arxiv.org/abs/2311.17404
repo_url: https://github.com/lscpku/vitatecs
paper_authors: Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, Lu Hou
for: 评估视频语言模型（VidLM）的时间理解能力。
methods: 使用精细的时间概念分类和人工辅助标注来生成对时间信息进行修正的异常视频描述，以评估 VidLM 的时间理解能力。
results: 表明 VidLM 尚未具备强大的时间理解能力，需要更多关注时间元素在视频语言研究中。

Abstract
The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research.

摘要
人类智能中能够感受到物体变化的能力是关键。然而，现有的标准 bencmarks 无法准确反映视频语言模型（VidLM）对时间理解能力的限制。为此，我们提出了 VITATECS，一个用于评估视频语言模型的时间概念理解能力的诊断数据集。具体来说，我们首先提出了自然语言中时间概念的细致分类，以诊断 VidLM 是否能够理解不同的时间方面。此外，为了分离静止信息和时间信息的相关性，我们生成了不同于原始描述的Counterfactual video descriptions。我们使用大型语言模型和人类Loop注解来获得高质量的Counterfactual描述，并通过对代表性视频语言理解模型的评估，确认它们在时间方面的不足。

Gene-MOE: A Sparsely-gated Framework for Pan-Cancer Genomic Analysis

paper_url: http://arxiv.org/abs/2311.17401
repo_url: None
paper_authors: Xiangyu Meng, Tao Song, Qing Yang, Huanhuan Dai, Lian Qiao, Hongzhen Ding, Long Hao, Xun Wang
for: 本研究使用Pan-Cancer数据集进行基因表达数据的分析，以更好地理解您 cancer 相关的因素，并为您 cancer 诊断和预后提供帮助。
methods: 本研究提出了一种新的预训练模型called Gene-MOE，通过将Pan-Cancer数据集中的高维基因特征映射到具有 mixture of expert（MOE）层的模型中，以学习深度相关的基因特征表示。同时，我们还构建了一种 mixture of attention expert（MOAE）模型，以学习高维基因特征之间的深度 semantics。
results: 根据Survival analysis结果，使用Gene-MOE模型在14种 cancer 类型上的生存分析达到了state-of-the-art模型的性能。同时，根据分类结果，Gene-MOE模型在33种 cancer 分类中的总准确率达到了95.2%。通过详细的特征分析，我们发现Gene-MOE模型可以学习高维基因特征的富有表示。

Abstract
Analyzing the genomic information from the Pan-Cancer database can help us understand cancer-related factors and contribute to the cancer diagnosis and prognosis. However, existing computational methods and deep learning methods can not effectively find the deep correlations between tens of thousands of genes, which leads to precision loss. In this paper, we proposed a novel pretrained model called Gene-MOE to learn the general feature representations of the Pan-Cancer dataset and transfer the pretrained weights to the downstream tasks. The Gene-MOE fully exploits the mixture of expert (MOE) layers to learn rich feature representations of high-dimensional genes. At the same time, we build a mixture of attention expert (MOAE) model to learn the deep semantic relationships within genetic features. Finally, we proposed a new self-supervised pretraining strategy including loss function design, data enhancement, and optimization strategy to train the Gene-MOE and further improve the performance for the downstream analysis. We carried out cancer classification and survival analysis experiments based on the Gene-MOE. According to the survival analysis results on 14 cancer types, using Gene-MOE outperformed state-of-the-art models on 12 cancer types. According to the classification results, the total accuracy of the classification model for 33 cancer classifications reached 95.2\%. Through detailed feature analysis, we found the Gene-MOE model can learn rich feature representations of high-dimensional genes.

摘要
分析pan-cancer数据库的基因信息可以帮助我们理解患癌相关因素，并对患癌诊断和预后做出贡献。然而，现有的计算方法和深度学习方法无法有效地找到数万个基因之间的深层相关性，这导致精度损失。在本文中，我们提出了一种新的预训练模型called Gene-MOE，用于学习Pan-Cancer数据集的通用特征表示。Gene-MOE完全利用了mixture of expert（MOE）层来学习高维基因的丰富特征表示。同时，我们建立了mixture of attention expert（MOAE）模型，以学习基因特征之间的深层semantic关系。最后，我们提出了一种新的自动预训练策略，包括损失函数设计、数据增强和优化策略，用于训练Gene-MOE，并进一步提高下游分析的性能。我们基于Gene-MOE进行了患癌类型分类和生存分析实验。根据14种患癌类型的生存分析结果，使用Gene-MOE比状态态模型在12种患癌类型上表现出了更好的性能。根据分类结果，Gene-MOE模型的33种患癌分类总准确率达95.2%。通过详细的特征分析，我们发现Gene-MOE模型可以学习高维基因的丰富特征表示。

Comparison of metaheuristics for the firebreak placement problem: a simulation-based optimization approach

paper_url: http://arxiv.org/abs/2311.17393
repo_url: None
paper_authors: David Palacios-Meneses, Jaime Carrasco, Sebastián Dávila, Maximiliano Martínez, Rodrigo Mahaluf, Andrés Weintraub
for: 这个论文的目的是为了提出一种基于 simulations-based optimization（SbO）的方法来解决火灾防治中的火break置置问题。
methods: 该方法使用了种群算法和GRASP算法来解决这个问题。
results: 实际应用中，使用种群算法获得了良好的结果，特别是在操作资源充足的场景下和中等水平的随机性下表现出色。

Abstract
The problem of firebreak placement is crucial for fire prevention, and its effectiveness at landscape scale will depend on their ability to impede the progress of future wildfires. To provide an adequate response, it is therefore necessary to consider the stochastic nature of fires, which are highly unpredictable from ignition to extinction. Thus, the placement of firebreaks can be considered a stochastic optimization problem where: (1) the objective function is to minimize the expected cells burnt of the landscape; (2) the decision variables being the location of firebreaks; and (3) the random variable being the spatial propagation/behavior of fires. In this paper, we propose a solution approach for the problem from the perspective of simulation-based optimization (SbO), where the objective function is not available (a black-box function), but can be computed (and/or approximated) by wildfire simulations. For this purpose, Genetic Algorithm and GRASP are implemented. The final implementation yielded favorable results for the Genetic Algorithm, demonstrating strong performance in scenarios with medium to high operational capacity, as well as medium levels of stochasticity

摘要
火灾防治中的燃烧范围位置问题是关键的，其效iveness将取决于未来的野火扩散情况。为提供有效的回应，因此需要考虑火灾的数学性质，它们是非常随机的，从Ignition到消灭。因此，燃烧范围位置的选择可以被视为一个数学优化问题，其中：1. 目标函数是将预期燃烧的地图面积最小化。2. 决策变数是燃烧范围的位置。3. 随机变数是火灾的空间传播/行为。在这篇文章中，我们提出了一个从模拟基本优化（SbO）的解决方案，其中：1. 目标函数是不可知的黑盒函数，但可以通过野火模拟来计算和/或近似。2. Genetic Algorithm和GRASP是实现的。最终实现获得了优秀的结果，显示在中等至高的作业能力下，以及中等水平的随机性下，Genetic Algorithm表现强劲。

Are we going MAD? Benchmarking Multi-Agent Debate between Language Models for Medical Q&A

paper_url: http://arxiv.org/abs/2311.17371
repo_url: None
paper_authors: Andries Smit, Paul Duckworth, Nathan Grinsztajn, Kale-ab Tessera, Thomas D. Barrett, Arnu Pretorius
for: 这 paper 的目的是提高语言模型的准确性和可靠性，以应对医疗问题。
methods: 这 paper 使用多代理辩论（MAD）策略来提高语言模型的准确性。
results: 这 paper 提出了一种基于代表协议的新辩论策略，在医疗问题解答任务上表现出优于之前发表的策略。

Abstract
Recent advancements in large language models (LLMs) underscore their potential for responding to medical inquiries. However, ensuring that generative agents provide accurate and reliable answers remains an ongoing challenge. In this context, multi-agent debate (MAD) has emerged as a prominent strategy for enhancing the truthfulness of LLMs. In this work, we provide a comprehensive benchmark of MAD strategies for medical Q&A, along with open-source implementations. This explores the effective utilization of various strategies including the trade-offs between cost, time, and accuracy. We build upon these insights to provide a novel debate-prompting strategy based on agent agreement that outperforms previously published strategies on medical Q&A tasks.

摘要
最近的大语言模型（LLM）的进步，强调了它们在医疗问题回答中的潜在作用。然而，确保生成代理提供准确和可靠的答案仍然是一个持续的挑战。在这种情况下，多代理辩论（MAD）已成为提高 LLM 准确性的显著策略。在这项工作中，我们提供了医疗问答中 MAD 策略的完整评估，以及开源实现。这些探索了不同策略之间的平衡，包括成本、时间和准确性的费用。基于这些洞察，我们提出了一种基于代理协议的新的辩论激发策略，在医疗问答任务上超过了之前发表的策略。

Two Scalable Approaches for Burned-Area Mapping Using U-Net and Landsat Imagery

paper_url: http://arxiv.org/abs/2311.17368
repo_url: None
paper_authors: Ian Mancilla-Wulff, Jaime Carrasco, Cristobal Pais, Alejandro Miranda, Andres Weintraub
for: 本研究旨在提高火灾监控效率，实现火灾监控的实时高分辨率应用。
methods: 本研究使用了人工智能技术，应用了U-Net模型，并对输入图像进行不同大小的裁剪，以提高烧焊区域地图的自动化标本。
results: 研究结果显示，对于火灾区域地图的自动化标本，使用AS模型可以提高模型性能，并且在195幅试验图像中获得了0.93的 dice系数、0.086的排除错误率和0.045的错误率。

Abstract
Monitoring wildfires is an essential step in minimizing their impact on the planet, understanding the many negative environmental, economic, and social consequences. Recent advances in remote sensing technology combined with the increasing application of artificial intelligence methods have improved real-time, high-resolution fire monitoring. This study explores two proposed approaches based on the U-Net model for automating and optimizing the burned-area mapping process. Denoted 128 and AllSizes (AS), they are trained on datasets with a different class balance by cropping input images to different sizes. They are then applied to Landsat imagery and time-series data from two fire-prone regions in Chile. The results obtained after enhancement of model performance by hyperparameter optimization demonstrate the effectiveness of both approaches. Tests based on 195 representative images of the study area show that increasing dataset balance using the AS model yields better performance. More specifically, AS exhibited a Dice Coefficient (DC) of 0.93, an Omission Error (OE) of 0.086, and a Commission Error (CE) of 0.045, while the 128 model achieved a DC of 0.86, an OE of 0.12, and a CE of 0.12. These findings should provide a basis for further development of scalable automatic burned-area mapping tools.

摘要
监测野火是一项重要的步骤，以减少它们对地球的影响，了解它们的多种负面环境、经济和社会影响。最近的远程感知技术的进步，以及人工智能方法的应用，已经提高了实时高分辨率的火灾监测。本研究探讨了基于U-Net模型的自动化烧区图像处理两种方法。其中的128和AllSizes（AS）模型都在不同的类别均衡下进行训练，然后应用到智利的两个火灾频发区域的卫星图像和时间序列数据上。研究结果表明，通过修改模型参数来提高模型性能后，两种方法均有效。对于195个研究区域的测试，AS模型的 dice乘数（DC）为0.93，漏掉错（OE）为0.086，通过错（CE）为0.045，而128模型的DC为0.86，OE为0.12，CE为0.12。这些发现可以提供基础 для进一步开发扩展自动烧区图像处理工具。

Exploring Large Language Models for Human Mobility Prediction under Public Events

paper_url: http://arxiv.org/abs/2311.17351
repo_url: None
paper_authors: Yuebing Liang, Yichao Liu, Xiaohan Wang, Zhan Zhao
for: LLM-MPE is designed to accurately predict human mobility under public events, which is crucial for event planning and traffic management.methods: The framework uses Large Language Models (LLMs) to process textual data and learn from minimal examples, and provides human-readable explanations for its predictions.results: LLM-MPE outperforms traditional models, particularly on event days, and textual data significantly enhances its accuracy. Additionally, the framework offers interpretable insights into its predictions.Here is the same information in Simplified Chinese text:for: LLM-MPE 是用于准确预测公众流动的 Framework，这对事件规划和人流管理非常重要。methods: LLM-MPE 使用大型自然语言模型（LLMs）来处理文本数据，从少量示例中学习，并提供可读的解释。results: LLM-MPE 在活动日比传统模型更高，文本数据对其准确率有很大提高 effect。此外，框架还提供可读的解释。

Abstract
Public events, such as concerts and sports games, can be major attractors for large crowds, leading to irregular surges in travel demand. Accurate human mobility prediction for public events is thus crucial for event planning as well as traffic or crowd management. While rich textual descriptions about public events are commonly available from online sources, it is challenging to encode such information in statistical or machine learning models. Existing methods are generally limited in incorporating textual information, handling data sparsity, or providing rationales for their predictions. To address these challenges, we introduce a framework for human mobility prediction under public events (LLM-MPE) based on Large Language Models (LLMs), leveraging their unprecedented ability to process textual data, learn from minimal examples, and generate human-readable explanations. Specifically, LLM-MPE first transforms raw, unstructured event descriptions from online sources into a standardized format, and then segments historical mobility data into regular and event-related components. A prompting strategy is designed to direct LLMs in making and rationalizing demand predictions considering historical mobility and event features. A case study is conducted for Barclays Center in New York City, based on publicly available event information and taxi trip data. Results show that LLM-MPE surpasses traditional models, particularly on event days, with textual data significantly enhancing its accuracy. Furthermore, LLM-MPE offers interpretable insights into its predictions. Despite the great potential of LLMs, we also identify key challenges including misinformation and high costs that remain barriers to their broader adoption in large-scale human mobility analysis.

摘要
公共活动，如演唱会和体育赛事，可以吸引大量人群，导致不规则的旅行需求增加。因此，公共活动中的人群流动预测是至关重要的，以便活动规划和交通或人群管理。although rich textual descriptions of public events are readily available from online sources, it is challenging to incorporate such information into statistical or machine learning models. existing methods are often limited in their ability to handle textual data, are not effective in data-sparse environments, and do not provide rationales for their predictions.to address these challenges, we propose a framework for human mobility prediction under public events (LLM-MPE) based on large language models (LLMs). LLM-MPE leverages the ability of LLMs to process textual data, learn from minimal examples, and generate human-readable explanations. specifically, LLM-MPE first transforms raw, unstructured event descriptions from online sources into a standardized format, and then segments historical mobility data into regular and event-related components. a prompting strategy is designed to direct LLMs in making and rationalizing demand predictions considering historical mobility and event features.a case study is conducted for barclays center in new york city, based on publicly available event information and taxi trip data. results show that LLM-MPE surpasses traditional models, particularly on event days, with textual data significantly enhancing its accuracy. furthermore, LLM-MPE provides interpretable insights into its predictions.despite the great potential of LLMs, we also identify key challenges, including misinformation and high costs, that remain barriers to their broader adoption in large-scale human mobility analysis.

VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model

paper_url: http://arxiv.org/abs/2311.17338
repo_url: https://github.com/videoassembler/videoassembler
paper_authors: Haoyu Zhao, Tianyi Lu, Jiaxi Gu, Xing Zhang, Zuxuan Wu, Hang Xu, Yu-Gang Jiang
for: 该 paper 目的是提出一种基于文本提示和参考图像的视频生成方法，以实现视频的生成和编辑。
methods: 该方法使用了 Refernece Entity Pyramid（REP）编码器和 Entity-Prompt Attention Fusion（EPAF）模块，以实现对文本提示和参考图像的灵活整合。
results: 该方法在 UCF-101、MSR-VTT 和 DAVIS 等 datasets 上达到了良好的数据分布和视觉评价表现（346.84 在 FVD 和 48.01 在 IS 上），并且可以进行不同的视频生成和编辑任务。

Abstract
Identity-consistent video generation seeks to synthesize videos that are guided by both textual prompts and reference images of entities. Current approaches typically utilize cross-attention layers to integrate the appearance of the entity, which predominantly captures semantic attributes, resulting in compromised fidelity of entities. Moreover, these methods necessitate iterative fine-tuning for each new entity encountered, thereby limiting their applicability. To address these challenges, we introduce VideoAssembler, a novel end-to-end framework for identity-consistent video generation that can conduct inference directly when encountering new entities. VideoAssembler is adept at producing videos that are not only flexible with respect to the input reference entities but also responsive to textual conditions. Additionally, by modulating the quantity of input images for the entity, VideoAssembler enables the execution of tasks ranging from image-to-video generation to sophisticated video editing. VideoAssembler comprises two principal components: the Reference Entity Pyramid (REP) encoder and the Entity-Prompt Attention Fusion (EPAF) module. The REP encoder is designed to infuse comprehensive appearance details into the denoising stages of the stable diffusion model. Concurrently, the EPAF module is utilized to integrate text-aligned features effectively. Furthermore, to mitigate the challenge of scarce data, we present a methodology for the preprocessing of training data. Our evaluation of the VideoAssembler framework on the UCF-101, MSR-VTT, and DAVIS datasets indicates that it achieves good performances in both quantitative and qualitative analyses (346.84 in FVD and 48.01 in IS on UCF-101). Our project page is at https://videoassembler.github.io/videoassembler.

摘要
Current approaches to identity-consistent video generation typically use cross-attention layers to integrate the appearance of the entity, which primarily captures semantic attributes, leading to compromised fidelity of entities. Moreover, these methods require iterative fine-tuning for each new entity encountered, limiting their applicability. To address these challenges, we propose VideoAssembler, a novel end-to-end framework for identity-consistent video generation that can conduct inference directly when encountering new entities. VideoAssembler is capable of producing videos that are not only flexible with respect to the input reference entities but also responsive to textual conditions. Additionally, by modulating the quantity of input images for the entity, VideoAssembler enables the execution of tasks ranging from image-to-video generation to sophisticated video editing. VideoAssembler consists of two principal components: the Reference Entity Pyramid (REP) encoder and the Entity-Prompt Attention Fusion (EPAF) module. The REP encoder is designed to infuse comprehensive appearance details into the denoising stages of the stable diffusion model. Concurrently, the EPAF module is utilized to integrate text-aligned features effectively. Furthermore, to mitigate the challenge of scarce data, we present a methodology for the preprocessing of training data. Our evaluation of the VideoAssembler framework on the UCF-101, MSR-VTT, and DAVIS datasets indicates that it achieves good performances in both quantitative and qualitative analyses (346.84 in FVD and 48.01 in IS on UCF-101). Our project page is at .

Cascade: A Platform for Delay-Sensitive Edge Intelligence

paper_url: http://arxiv.org/abs/2311.17329
repo_url: None
paper_authors: Weijia Song, Thiago Garrett, Yuting Yang, Mingzhao Liu, Edward Tremel, Lorenzo Rosa, Andrea Merlina, Roman Vitenberg, Ken Birman
for: 这篇论文是为了解决智能应用程序的响应时间问题，以提高AI/ML平台的吞吐量和资源管理效率。
methods: 这篇论文使用了一种名为Cascade的新AI/ML托管平台，其包括一个遗产兼容的存储层和一个“快速路径”，以最大化响应性。
results: 根据评估结果，Cascade可以大幅降低响应时间，而无损高吞吐量。

Abstract
Interactive intelligent computing applications are increasingly prevalent, creating a need for AI/ML platforms optimized to reduce per-event latency while maintaining high throughput and efficient resource management. Yet many intelligent applications run on AI/ML platforms that optimize for high throughput even at the cost of high tail-latency. Cascade is a new AI/ML hosting platform intended to untangle this puzzle. Innovations include a legacy-friendly storage layer that moves data with minimal copying and a "fast path" that collocates data and computation to maximize responsiveness. Our evaluation shows that Cascade reduces latency by orders of magnitude with no loss of throughput.

摘要
“智能应用程序在互动计算中越来越普遍，需要AI/ML平台优化以降低每个事件延迟，保持高并发和有效资源管理。然而，许多智能应用程序运行于AI/ML平台，这些平台尽管可以提高并发，但却会导致高尾延迟。Cascade是一个新的AI/ML托管平台，旨在解决这个问题。增值点包括兼容老版本存储层，将数据传输到最小化复制，以及“快速路径”，将数据和计算 colocate，以最大化响应性。我们的评估表明，Cascade可以降低延迟到数量级，无损失并发。”Note: Please keep in mind that the translation is done by an AI model, and the quality may vary depending on the complexity and nuances of the text.

Accelerating DNN Training With Photonics: A Residue Number System-Based Design

paper_url: http://arxiv.org/abs/2311.17323
repo_url: None
paper_authors: Cansu Demirkiran, Guowei Yang, Darius Bunandar, Ajay Joshi
for: 这个论文是为了提高深度神经网络（DNN）的训练速度和能效性而设计的。
methods: 这篇论文使用了余数数系（RNS）和光学计算来解决光学硬件中的精度限制，从而实现高能效的DNN训练。
results: 研究人员在这篇论文中提出了一种基于RNS的光学tensor核心，可以在光学频段中进行模块化算术运算，并实现了比FP32训练更高的能效性和相同或更好的精度。 Mirage在多个DNN模型中的训练时间和能耗比FP32训练快了23.8倍，并且在iso-能量和iso-面积场景下占用了42.8倍的电力。

Abstract
Photonic computing is a compelling avenue for performing highly efficient matrix multiplication, a crucial operation in Deep Neural Networks (DNNs). While this method has shown great success in DNN inference, meeting the high precision demands of DNN training proves challenging due to the precision limitations imposed by costly data converters and the analog noise inherent in photonic hardware. This paper proposes Mirage, a photonic DNN training accelerator that overcomes the precision challenges in photonic hardware using the Residue Number System (RNS). RNS is a numeral system based on modular arithmetic$\unicode{x2014}$allowing us to perform high-precision operations via multiple low-precision modular operations. In this work, we present a novel micro-architecture and dataflow for an RNS-based photonic tensor core performing modular arithmetic in the analog domain. By combining RNS and photonics, Mirage provides high energy efficiency without compromising precision and can successfully train state-of-the-art DNNs achieving accuracy comparable to FP32 training. Our study shows that on average across several DNNs when compared to systolic arrays, Mirage achieves more than $23.8\times$ faster training and $32.1\times$ lower EDP in an iso-energy scenario and consumes $42.8\times$ lower power with comparable or better EDP in an iso-area scenario.

摘要
光子计算是一个吸引人的路径，可以实现高效的矩阵乘法，这是深度神经网络（DNN）的关键操作。 although this method has shown great success in DNN inference, meeting the high precision demands of DNN training is challenging due to the precision limitations imposed by costly data converters and the analog noise inherent in photonic hardware. This paper proposes Mirage, a photonic DNN training accelerator that overcomes the precision challenges in photonic hardware using the Residue Number System (RNS). RNS is a numeral system based on modular arithmetic, allowing us to perform high-precision operations via multiple low-precision modular operations. In this work, we present a novel micro-architecture and dataflow for an RNS-based photonic tensor core performing modular arithmetic in the analog domain. By combining RNS and photonics, Mirage provides high energy efficiency without compromising precision and can successfully train state-of-the-art DNNs achieving accuracy comparable to FP32 training. Our study shows that on average across several DNNs, Mirage achieves more than $23.8\times$ faster training and $32.1\times$ lower EDP in an iso-energy scenario and consumes $42.8\times$ lower power with comparable or better EDP in an iso-area scenario.Note: EDP stands for "energy delay product", which is a measure of the energy consumption of a system.

Universal Self-Consistency for Large Language Model Generation

paper_url: http://arxiv.org/abs/2311.17311
repo_url: None
paper_authors: Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, Denny Zhou
For: The paper is written for discussing the use of Universal Self-Consistency (USC) for improving the performance of various challenging tasks, such as mathematical reasoning, code generation, long-context summarization, and open-ended question answering.* Methods: The paper proposes using LLMs themselves to select the most consistent answer among multiple candidates, rather than relying on the answer extraction process to aggregate multiple solutions.* Results: The paper evaluates USC on a variety of benchmarks and shows that it effectively utilizes multiple samples and improves the performance on open-ended generation tasks where the original self-consistency method is not applicable. Additionally, USC matches the standard self-consistency performance on mathematical reasoning tasks and execution-based voting performance on code generation tasks, without requiring the answer formats to be similar.

Abstract
Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Consistency (USC), which leverages LLMs themselves to select the most consistent answer among multiple candidates. We evaluate USC on a variety of benchmarks, including mathematical reasoning, code generation, long-context summarization, and open-ended question answering. On open-ended generation tasks where the original self-consistency method is not applicable, USC effectively utilizes multiple samples and improves the performance. For mathematical reasoning, USC matches the standard self-consistency performance without requiring the answer formats to be similar. Finally, without access to execution results, USC also matches the execution-based voting performance on code generation.

摘要
自我一致性（Self-consistency）与链条提示（Chain-of-thought prompting，CoT）在各种复杂任务上显示出了remarkable的性能提升，通过利用大型自然语言模型（Large Language Models，LLMs）中的多种逻辑路径。然而，自我一致性需要答案提取过程来聚合多个解决方案，这不适用于自由形答案。在这种情况下，我们提出了通用自我一致性（Universal Self-Consistency，USC），它利用LLMs本身来选择最一致的答案 among multiple candidates。我们对USC进行了多种 benchmarcks 评估，包括数学逻辑、代码生成、长 Context 概要、和开放结束问答。在开放结束生成任务上，where the original self-consistency method is not applicable，USC effectively utilizes multiple samples and improves the performance。对数学逻辑来说，USC和标准自我一致性性能相同，无需答案格式相似。最后，无执行结果访问，USC也与执行基本投票性能相同在代码生成任务上。

RoKEPG: RoBERTa and Knowledge Enhancement for Prescription Generation of Traditional Chinese Medicine

paper_url: http://arxiv.org/abs/2311.17307
repo_url: None
paper_authors: Hua Pu, Jiacong Mi, Shan Lu, Jieyue He
for: 本研究旨在探讨传统中药（TCM）订方的复杂非线性关系，以帮助临床医生诊断和治疗。
methods: 我们提出了一种基于RoBERTa和知识增强模型的TCM订方生成模型（RoKEPG），首先在自制的TCM词库中预训练模型，然后细化预训练模型，并通过引入TCM知识的注意mask矩阵来引导模型生成TCM订方。
results: 对公共可用的TCM订方数据集进行实验，我们发现RoKEPG可以提高F1指标比基线模型最佳 результаados中提高约2%。

Abstract
Traditional Chinese medicine (TCM) prescription is the most critical form of TCM treatment, and uncovering the complex nonlinear relationship between symptoms and TCM is of great significance for clinical practice and assisting physicians in diagnosis and treatment. Although there have been some studies on TCM prescription generation, these studies consider a single factor and directly model the symptom-prescription generation problem mainly based on symptom descriptions, lacking guidance from TCM knowledge. To this end, we propose a RoBERTa and Knowledge Enhancement model for Prescription Generation of Traditional Chinese Medicine (RoKEPG). RoKEPG is firstly pre-trained by our constructed TCM corpus, followed by fine-tuning the pre-trained model, and the model is guided to generate TCM prescriptions by introducing four classes of knowledge of TCM through the attention mask matrix. Experimental results on the publicly available TCM prescription dataset show that RoKEPG improves the F1 metric by about 2% over the baseline model with the best results.

摘要
传统中药医学（TCM）订药是TCM治疗中最重要的形式，了解TCM订药之间复杂非线性关系对临床实践和诊断治疗具有重要意义。虽然有一些关于TCM订药生成的研究，但这些研究主要基于症状描述直接模型症状订药生成问题，缺乏TCM知识指导。为此，我们提出了RoBERTa和知识增强模型 дляTCM订药生成（RoKEPG）。RoKEPG首先通过我们构造的TCM文献库进行预训练，然后细化预训练模型，并通过引入TCM知识的注意力掩码矩阵来引导模型生成TCM订药。实验结果表明，RoKEPG在公共可用的TCM订药数据集上提高了F1指标约2%，与最佳基eline模型相比。

Two-Step Reinforcement Learning for Multistage Strategy Card Game

paper_url: http://arxiv.org/abs/2311.17305
repo_url: None
paper_authors: Konrad Godlewski, Bartosz Sawicki
for: 本研究旨在开发一种适应《 lord of the rings：卡牌游戏》（LOTRCG）的两步强化学习策略，该游戏是一种复杂多阶段策略卡牌游戏。
methods: 本研究采用了分阶段学习方法，首先在简化版游戏中进行基础学习，然后在完整的游戏环境中进行进一步学习。这种方法使得人工智能代理人在LOTRCG的不可预测和复杂的环境中表现出了显著的适应能力和性能提升。此外，本研究还探索了多代理人系统，在游戏中使用不同的决策方法。
results: 在一组10000个随机游戏中，RL代理人实现了78.5%的赢利率，表明这种两步强化学习策略在LOTRCG中具有显著的性能提升。

Abstract
In the realm of artificial intelligence and card games, this study introduces a two-step reinforcement learning (RL) strategy tailored for "The Lord of the Rings: The Card Game (LOTRCG)," a complex multistage strategy card game. This research diverges from conventional RL methods by adopting a phased learning approach, beginning with a foundational learning stage in a simplified version of the game and subsequently progressing to the complete, intricate game environment. This methodology notably enhances the AI agent's adaptability and performance in the face of LOTRCG's unpredictable and challenging nature. The paper also explores a multi-agent system, where distinct RL agents are employed for various decision-making aspects of the game. This approach has demonstrated a remarkable improvement in game outcomes, with the RL agents achieving a winrate of 78.5% across a set of 10,000 random games.

摘要
在人工智能和卡牌游戏之间的王国，这项研究推出了一种适应" lord of the rings：卡牌游戏"（LOTRCG）的两步强化学习（RL）策略。这种研究与传统RL方法不同，通过采用分阶段学习方法，首先在简化版游戏中进行基础学习，然后进入完整的复杂游戏环境。这种方法有效地提高了人工智能代理的适应能力和性能，面对LOTRCG的不可预测和复杂的性格。此外，研究还探讨了多代理系统，在游戏中使用不同的决策方面的RL代理。这种方法在10,000个随机游戏中达到了78.5%的胜率。

Enhancing the Performance of Neural Networks Through Causal Discovery and Integration of Domain Knowledge

paper_url: http://arxiv.org/abs/2311.17303
repo_url: None
paper_authors: Xiaoge Zhang, Xiao-Lin Wang, Fenglei Fan, Yiu-Ming Cheung, Indranil Bose
for: 提高神经网络预测性能，使用 Generic方法编码层次 causality 结构
methods: 三步骤：1. 从观察数据中发现 causal 关系，2. 将 causal 结构系统地编码到神经网络中，3. 使用 projection of conflicting gradients 缓解 gradient interference
results: 在 UCI 数据集上实现了substantial 的预测性能提升，并通过缺省研究证明了结构性和量化 causal 知识的紧张关系。

Abstract
In this paper, we develop a generic methodology to encode hierarchical causality structure among observed variables into a neural network in order to improve its predictive performance. The proposed methodology, called causality-informed neural network (CINN), leverages three coherent steps to systematically map the structural causal knowledge into the layer-to-layer design of neural network while strictly preserving the orientation of every causal relationship. In the first step, CINN discovers causal relationships from observational data via directed acyclic graph (DAG) learning, where causal discovery is recast as a continuous optimization problem to avoid the combinatorial nature. In the second step, the discovered hierarchical causality structure among observed variables is systematically encoded into neural network through a dedicated architecture and customized loss function. By categorizing variables in the causal DAG as root, intermediate, and leaf nodes, the hierarchical causal DAG is translated into CINN with a one-to-one correspondence between nodes in the causal DAG and units in the CINN while maintaining the relative order among these nodes. Regarding the loss function, both intermediate and leaf nodes in the DAG graph are treated as target outputs during CINN training so as to drive co-learning of causal relationships among different types of nodes. As multiple loss components emerge in CINN, we leverage the projection of conflicting gradients to mitigate gradient interference among the multiple learning tasks. Computational experiments across a broad spectrum of UCI data sets demonstrate substantial advantages of CINN in predictive performance over other state-of-the-art methods. In addition, an ablation study underscores the value of integrating structural and quantitative causal knowledge in enhancing the neural network's predictive performance incrementally.

摘要
在这篇论文中，我们提出了一种通用方法来嵌入层次 causality 结构中的观察变量到神经网络中，以提高预测性能。我们称这种方法为 causality-informed neural network (CINN)。CINN 采用三个一致的步骤来系统地将结构 causal 知识嵌入神经网络的层次设计中，同时坚持每个 causal 关系的方向。在第一步，CINN 通过 directed acyclic graph (DAG) 学习发现 causal 关系，将 causal discovery 转化为连续优化问题，以避免 combinatorial nature。在第二步，发现的层次 causality 结构中的观察变量被系统地编码到神经网络中，并通过特定的架构和定制的损失函数来实现。在 causal DAG 中，变量被分为根节点、中间节点和叶节点，然后将它们一一对应到 CINN 中的单元上，保持它们之间的相对顺序。在loss function中，中间节点和叶节点在 DAG 图中被视为目标输出，以便在 CINN 训练中共同学习 causal 关系。由于 CINN 中出现多个损失函数，我们利用投影冲突的梯度来 Mitigate 梯度干扰。在 UCI 数据集上的计算实验中，CINN 的预测性能明显超过了其他状态革命方法。此外，一个剥离研究也证明了在结构和量化 causal 知识的基础上，CINN 可以逐步提高神经网络的预测性能。

Language Models: A Guide for the Perplexed

paper_url: http://arxiv.org/abs/2311.17301
repo_url: None
paper_authors: Sofia Serrano, Zander Brumbaugh, Noah A. Smith
For: The paper aims to provide a tutorial on language models, specifically to help narrow the gap between the discourse among researchers and educators and the public’s understanding of these technologies.* Methods: The paper uses a scientific viewpoint to focus on questions amenable to study through experimentation, and situates language models in the context of the research that led to their development.* Results: The paper describes the boundaries of what is known about language models at this writing, and provides a clear and concise overview of the technology.Here are the three points in Simplified Chinese text:* For: 这篇论文的目的是帮助减少研究者和教育者与公众对语言模型的讨论之间的差距，提供一篇关于语言模型的教程。* Methods: 这篇论文使用科学的视角，专注于通过实验来研究的问题，将语言模型置于其发展的研究背景中。* Results: 这篇论文描述了当前关于语言模型的知识的边界，提供了一份清晰概括的语言模型技术的概述。

Abstract
Given the growing importance of AI literacy, we decided to write this tutorial to help narrow the gap between the discourse among those who study language models -- the core technology underlying ChatGPT and similar products -- and those who are intrigued and want to learn more about them. In short, we believe the perspective of researchers and educators can add some clarity to the public's understanding of the technologies beyond what's currently available, which tends to be either extremely technical or promotional material generated about products by their purveyors. Our approach teases apart the concept of a language model from products built on them, from the behaviors attributed to or desired from those products, and from claims about similarity to human cognition. As a starting point, we (1) offer a scientific viewpoint that focuses on questions amenable to study through experimentation; (2) situate language models as they are today in the context of the research that led to their development; and (3) describe the boundaries of what is known about the models at this writing.

摘要
因为人工智能涉及的话语能力在不断增长，我们决定写这篇教程，以帮助减少语言模型的概念和产品之间的差距。在简单来说，我们认为研究者和教育者的视角可以帮助公众更好地理解这些技术，而不是现有的非常技术或产品销售商所提供的材料。我们的方法是：1. 提供一个科学视角，专注于可以通过实验研究的问题;2. 将语言模型放置在研发它们的研究背景中;3. 描述当前知道的语言模型boundaries。我们希望通过这篇教程，帮助读者更好地理解语言模型，并且提供一个基于科学研究的视角，以便更好地理解这些技术。

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

paper_url: http://arxiv.org/abs/2311.17295
repo_url: None
paper_authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee
for: 评估大语言模型 (LLMs) 的比较评价方法
methods: 使用 Elo 评分系统进行对照比较
results: Elo 评分系统存在一定的不稳定性和不一致性问题，需要进行改进以确保评价结果的可靠性。

Abstract
In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through "A vs B" paired comparisons. However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fundamental axioms that evaluation methods should adhere to: reliability and transitivity. We conduct extensive evaluation of Elo behaviour, illustrating that individual Elo computations exhibit volatility and delving into the impact of varying the Elo rating system's hyperparameters. We show that these axioms are not always satisfied raising questions about the reliability of current comparative evaluations of LLMs. If the current use of Elo scores is intended to substitute the costly head-to-head comparison of LLMs, it is crucial to ensure the ranking is as robust as possible. Guided by the axioms, our findings offer concrete guidelines for enhancing the reliability of LLM evaluation methods, suggesting a need for reassessment of existing comparative approaches.

摘要
在自然语言处理（NLP）领域，原始设计用于评估动态游戏玩家水平的Elo评分系统现在越来越被用来评估大语言模型（LLM）通过“A vs B”对比评估。然而，虽然受欢迎，但Elo评分系统对LLM的常数水平评估的适用性尚未得到充分探讨。我们研究了评估方法应该遵循的两个基本axioma：可靠性和传递性。我们进行了广泛的Elo行为评估，表明个体Elo计算具有波动性，并探讨了 vary Elo评分系统的 гиперparameters的影响。我们发现这些axioms并不总是满足，这提出了对当前用于比较LLM的Elo分数的可靠性的很多问题。如果现在使用Elo分数作为LLM之间的比较 substitute的代价高昂的头对头比较，那么是非常重要的确保排名的可靠性。顺应axioms，我们的发现提供了对LLM评估方法的准确性进行加强的具体建议，建议对现有的比较方法进行重新评估。

2023-11-29

cs.CL

cs.CL - 2023-11-29

Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis

paper_url: http://arxiv.org/abs/2311.17898
repo_url: None
paper_authors: Jinqi Luo, Kwan Ho Ryan Chan, Dimitris Dimos, René Vidal
for: 提高多模态生成模型的质量和准确性，而不需要大量的文本-图像对应数据。
methods: 提出了一种零shot框架，通过外部知识检索和语言模型压缩来帮助生成器生成可靠的视觉内容。
results: 在多个文本驱动生成任务（图像、3DRendering和视频）中，KPP可以生成准确、semantically rich的视觉内容，并且可以适应不同的视觉领域和基础模型。

Abstract
Hallucinations and unfaithful synthesis due to inaccurate prompts with insufficient semantic details are widely observed in multimodal generative models. A prevalent strategy to align multiple modalities is to fine-tune the generator with a large number of annotated text-image pairs. However, such a procedure is labor-consuming and resource-draining. The key question we ask is: can we enhance the quality and faithfulness of text-driven generative models beyond extensive text-image pair annotations? To address this question, we propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content. Instead of training generators to handle generic prompts, KPP employs a recursive knowledge query process to gather informative external facts from the knowledge base, instructs a language model to compress the acquired knowledge for prompt refinement, and utilizes text-driven generators for visual synthesis. The entire process is zero-shot, without accessing the architectures and parameters of generative models. We evaluate the framework across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of different domains. We further demonstrate the extensibility and adaptability of KPP through varying foundation model bases and instructions. Our results show that KPP is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising solution to improve multimodal generative models.

摘要
多modal生成模型中的幻觉和不准确的合成问题已经广泛被观察到，这主要归结于提供不够精准的semantic detail。为了解决这问题，我们提出了知识追求提示（KPP），一种零式框架，它可以在不访问生成器的architecture和参数的情况下，iteratively incorporates external knowledge to help generators produce reliable visual content。而不是训练生成器处理通用的提示，KPP使用了递归的知识查询过程来从知识库中获取有用的外部信息，然后使用语言模型压缩获取的知识来修正提示，最后使用文本驱动的生成器进行视觉合成。我们在多种文本驱动生成任务（图像、3D rendering和视频）上的多个领域中进行了评估，并证明了KPP的可重复性和适应性。我们的结果表明，KPP可以在多种视觉领域中生成准确和semantically rich的内容，提供了改善多modal生成模型的可能性。

Higher-Order DisCoCat (Peirce-Lambek-Montague semantics)

paper_url: http://arxiv.org/abs/2311.17813
repo_url: None
paper_authors: Alexis Toumi, Giovanni de Felice
for: 这个论文旨在提出一种新的高阶DisCoCat模型，其中单词意义不是一个图ogram，而是一个图ogram-值的高阶函数。
methods: 该论文使用 lambda calculus 和 Montague semantics 为基础，将字符串图ogram作为基本 primitives，实现了高阶和非线性自然语言Semantics 的处理。
results: 该论文通过将 Lambek Calculus 翻译成 Peirce 系统 beta 来实现一种纯图ogramatic的处理方法，可以处理高阶和非线性的自然语言semantics 中的词语、副词、否定和量词等。

Abstract
We propose a new definition of higher-order DisCoCat (categorical compositional distributional) models where the meaning of a word is not a diagram, but a diagram-valued higher-order function. Our models can be seen as a variant of Montague semantics based on a lambda calculus where the primitives act on string diagrams rather than logical formulae. As a special case, we show how to translate from the Lambek calculus into Peirce's system beta for first-order logic. This allows us to give a purely diagrammatic treatment of higher-order and non-linear processes in natural language semantics: adverbs, prepositions, negation and quantifiers. The theoretical definition presented in this article comes with a proof-of-concept implementation in DisCoPy, the Python library for string diagrams.

摘要
我们提出一个新的定义高阶DisCoCat（分布式compositional distributional）模型，其中单词的意义不是图表，而是图表值的高阶函数。我们的模型可以看作是基于Montague semantics的一种变形，使用λ推理 calculus中的基本 primitives 作用于字串图表而不是逻辑式。我们还示了将Lambek calculus转换为Peirce的系统beta，这允许我们对高阶和非线性自然语言 semantics 进行纯粹的图表处理：副词、前置词、否定和量化器。在本文中提出的理论定义附有一个证明的试验实现在DisCoPy，Python 库中的字串图表。

DSS: Synthesizing long Digital Ink using Data augmentation, Style encoding and Split generation

paper_url: http://arxiv.org/abs/2311.17786
repo_url: None
paper_authors: Aleksandr Timofeev, Anastasiia Fadeeva, Andrei Afonin, Claudiu Musat, Andrii Maksai
for: 提高文本生成模型对长文本数据的泛化能力
methods: 使用对比学习技术，对encoder-decoder模型进行修改和推理过程的修改，以适应手写增强生成的可读性
results: 比基eline RNN减少了英文长文本中字符错误率的一半，并与之前的方法相比减少了16%，并且三个方法都提高了生成的增强文本的可读性。

Abstract
As text generative models can give increasingly long answers, we tackle the problem of synthesizing long text in digital ink. We show that the commonly used models for this task fail to generalize to long-form data and how this problem can be solved by augmenting the training data, changing the model architecture and the inference procedure. These methods use contrastive learning technique and are tailored specifically for the handwriting domain. They can be applied to any encoder-decoder model that works with digital ink. We demonstrate that our method reduces the character error rate on long-form English data by half compared to baseline RNN and by 16% compared to the previous approach that aims at addressing the same problem. We show that all three parts of the method improve recognizability of generated inks. In addition, we evaluate synthesized data in a human study and find that people perceive most of generated data as real.

摘要
为了解决生成长文本的问题，我们对生成长文本的数字墨水进行了处理。我们发现常用的模型在长文本数据上失去泛化能力，并且如何解决这个问题。我们提出了三种解决方案，即通过增强训练数据、改变模型结构和推理过程。这些方法使用对比学习技术，专门针对手写域。它们可以应用于任何encoder-decoder模型，并且可以将charater error rate降低到baseline RNN的一半，并且比前一种方法降低16%。我们还发现这三种方法都提高了生成的墨水的可读性。此外，我们进行了人类研究，并发现大多数生成的数据被人类识别为真实的。

Supervising the Centroid Baseline for Extractive Multi-Document Summarization

paper_url: http://arxiv.org/abs/2311.17771
repo_url: None
paper_authors: Simão Gonçalves, Gonçalo Correia, Diogo Pernes, Afonso Mendes
for: 本研究旨在提高EXTRACTIVE multi-document summarization的简洁度，并在多个多语言场景中进行评估。
methods: 本研究使用了Beam搜索和Centroid估计注意力模型，对选择句子进行优化，以提高结果的质量。
results: 在多个多语言场景中，本研究的方法获得了改进的结果，比如在En2Fr和En2De等多语言场景中的翻译场景中。

Abstract
The centroid method is a simple approach for extractive multi-document summarization and many improvements to its pipeline have been proposed. We further refine it by adding a beam search process to the sentence selection and also a centroid estimation attention model that leads to improved results. We demonstrate this in several multi-document summarization datasets, including in a multilingual scenario.

摘要
中心点方法是一种简单的方法 для抽取式多文摘要，而且许多改进都已经被提议。我们进一步精细化它，通过添加帧搜索过程和中心点估计注意力模型，从而得到了改进的结果。我们在多个多文摘要数据集中进行了证明，包括一个多语言场景。

End-to-end Joint Rich and Normalized ASR with a limited amount of rich training data

paper_url: http://arxiv.org/abs/2311.17741
repo_url: None
paper_authors: Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent
for: 这篇论文旨在实现一个可以同时执行精确和含括的自动语音识别（ASR）系统，并且可以在流式应用程序中使用。
methods: 作者使用了两种不同的方法来训练一个无状态的 Transducer-based E2E JOINT 精确和含括 ASR 系统，即使只有有限量的 riclabeled 数据available。第一种方法使用语言模型生成 Pseudo-rich 精确转录的Normalized 训练数据。第二种方法使用一个单一的解码器，conditional 于输出类型。
results: 作者发现，使用第一种方法可以在非预训练数据上提高 E2E 精确 ASR 系统的性能，相比于使用只有5% 的 riclabeled 数据，可以获得更好的性能（相对错误减少9%）。使用第二种方法可以在5% 的 riclabeled 数据上实现 E2E JOINT 精确和含括 ASR 系统，相比于使用只有2.42% 的精确数据，相对错误增加2.42%）。

Abstract
Joint rich and normalized automatic speech recognition (ASR), that produces transcriptions both with and without punctuation and capitalization, remains a challenge. End-to-end (E2E) ASR models offer both convenience and the ability to perform such joint transcription of speech. Training such models requires paired speech and rich text data, which is not widely available. In this paper, we compare two different approaches to train a stateless Transducer-based E2E joint rich and normalized ASR system, ready for streaming applications, with a limited amount of rich labeled data. The first approach uses a language model to generate pseudo-rich transcriptions of normalized training data. The second approach uses a single decoder conditioned on the type of the output. The first approach leads to E2E rich ASR which perform better on out-of-domain data, with up to 9% relative reduction in errors. The second approach demonstrates the feasibility of an E2E joint rich and normalized ASR system using as low as 5% rich training data with moderate (2.42% absolute) increase in errors.

摘要
JOINT ricH 和 normalized 自动语音识别（ASR），生成带有和无标点和大小写的译文，仍然是一个挑战。端到端（E2E） ASR 模型提供了便利性和可以执行这种联合译文的能力。训练这些模型需要 paired 语音和丰富文本数据，这并不是广泛存在。在这篇论文中，我们比较了两种不同的方法来训练一个无状态 Transducer-based E2E joint ricH 和 normalized ASR 系统，准备进行流动应用程序，使用有限的丰富标注数据。第一种方法使用语言模型生成 pseudo-ricH 译文的normalized 训练数据。第二种方法使用单个解码器，根据输出类型的conditioning。第一种方法导致 E2E ricH ASR 在域外数据上表现更好，错误率下降达9%。第二种方法表明了使用5% ricH 训练数据和moderate（2.42%绝对）错误增加的E2E joint ricH 和 normalized ASR 系统的可行性。

SenTest: Evaluating Robustness of Sentence Encoders

paper_url: http://arxiv.org/abs/2311.17722
repo_url: None
paper_authors: Tanmay Chavan, Shantanu Patankar, Aditya Kane, Omkar Gokhale, Geetanjali Kale, Raviraj Joshi
for: 评估句子变换器的稳定性。
methods: 使用多种敌对攻击来评估句子变换器的稳定性，包括字符水平的随机字符替换、词水平的同义词替换和句子水平的内句词顺序排序。
results: 实验结果表明，句子变换器的稳定性强度较弱，模型在修改后的数据集上的准确率可以下降到15%以上。然而，实验还表明，这些嵌入包含句子的语义和语法结构信息，但现有的超vised分类策略未能充分利用这些信息。

Abstract
Contrastive learning has proven to be an effective method for pre-training models using weakly labeled data in the vision domain. Sentence transformers are the NLP counterparts to this architecture, and have been growing in popularity due to their rich and effective sentence representations. Having effective sentence representations is paramount in multiple tasks, such as information retrieval, retrieval augmented generation (RAG), and sentence comparison. Keeping in mind the deployability factor of transformers, evaluating the robustness of sentence transformers is of utmost importance. This work focuses on evaluating the robustness of the sentence encoders. We employ several adversarial attacks to evaluate its robustness. This system uses character-level attacks in the form of random character substitution, word-level attacks in the form of synonym replacement, and sentence-level attacks in the form of intra-sentence word order shuffling. The results of the experiments strongly undermine the robustness of sentence encoders. The models produce significantly different predictions as well as embeddings on perturbed datasets. The accuracy of the models can fall up to 15 percent on perturbed datasets as compared to unperturbed datasets. Furthermore, the experiments demonstrate that these embeddings does capture the semantic and syntactic structure (sentence order) of sentences. However, existing supervised classification strategies fail to leverage this information, and merely function as n-gram detectors.

摘要
强化学习已经在视觉领域证明是一种有效的预训练方法，使用弱Label数据。句子转移器是NLP领域的对应建筑，它们在生成强大和有效的句子表示方面具有益处。在多个任务中，如信息检索、生成增强（RAG）和句子比较中，有效的句子表示是非常重要的。在考虑到转移器的部署性因素，评估句子转移器的稳定性非常重要。这种系统使用字符级攻击、词级攻击和句子级攻击来评估句子转移器的稳定性。实验结果表明，模型对受攻击的数据集产生了显著不同的预测和嵌入。模型的准确率可以降低到15%以上，相比于未受攻击的数据集。此外，实验还表明，这些嵌入不仅捕捉了句子的语义和 sintaxis结构（句子顺序），而且可以通过不同的排序方式来区分句子。然而，现有的超vised分类策略却未能充分利用这些信息，只是作为n-gram探测器。

How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation

paper_url: http://arxiv.org/abs/2311.17696
repo_url: None
paper_authors: Chenxi Dong
for: 提供个性化教学支持 (provide personalized teaching support)
methods: 使用 cutting-edge Large Language Model (LLM) 和 Retrieval-Augmented Generation (RAG) 技术 (use cutting-edge Large Language Model (LLM) and Retrieval-Augmented Generation (RAG) techniques)
results: 实现了个性化教学支持 (achieve personalized teaching support)In English, the three key points would be:
for: Providing personalized teaching support
methods: Using cutting-edge Large Language Model (LLM) and Retrieval-Augmented Generation (RAG) techniques
results: Achieving personalized teaching support

Abstract
Artificial intelligence is transforming education through data-driven, personalized learning solutions. This paper introduces AI Tutor, an innovative web application that provides personalized tutoring in any subject using state-of-the-art Large Language Model (LLM). AI Tutor ingests course materials to construct an adaptive knowledge base tailored to the course. When students pose questions, it retrieves the most relevant information and generates detailed, conversational responses citing supporting evidence. The system is powered by advanced large language models and Retrieval-Augmented Generation (RAG) techniques for accurate, natural question answering. We present a fully-functional web interface and video demonstration that showcase AI Tutor's versatility across diverse subjects and its ability to produce pedagogically cogent responses. While an initial prototype, this work represents a pioneering step toward AI-enabled tutoring systems that can democratize access to high-quality, customized educational support.

摘要
人工智能改变教育的方式，通过数据驱动、个性化学习解决方案。这篇论文介绍了 AI 教师，一款创新的网络应用程序，可以提供任何科目的个性化教学。 AI 教师将课程材料进行处理，构建适应型知识库，并在学生提问时，通过 retrieve 扩展生成（RAG）技术，提供精准、自然的问答。我们提供了一个完整的网页界面和视频演示，展示 AI 教师在不同科目中的多样性和其能够生成教学上的有效性。尽管这是一个初步的 проtotype，但这个工作表明了人工智能帮助教学系统的前景，可以普及高质量、个性化的教育支持。

Enhancing Answer Selection in Community Question Answering with Pre-trained and Large Language Models

paper_url: http://arxiv.org/abs/2311.17502
repo_url: None
paper_authors: Xinghang Hu
for: 本文主要针对社区问答（CQA）中答案选择问题进行研究，以提高答案选择的准确率。
methods: 本文提出了问题答案卷积网络（QAN）模型，利用预训练模型和大语言模型（LLM）进行答案选择。特别是，使用BERT模型作Encoder层进行预训练，然后使用交叉注意机制选择不同问题的最相关答案。
results: 实验结果表明，QAN模型在SemEval2015和SemEval2017两个数据集上达到了状态艺术性的表现，而使用LLM进行知识扩充可以提高LLM对两个数据集的正确答案选择率。同时，LLM可以通过优化提问来选择更多的问题中的正确答案。

Abstract
Community Question Answering (CQA) becomes increasingly prevalent in recent years. However, there are a large number of answers, which is difficult for users to select the relevant answers. Therefore, answer selection is a very significant subtask of CQA. In this paper, we first propose the Question-Answer cross attention networks (QAN) with pre-trained models for answer selection and utilize large language model (LLM) to perform answer selection with knowledge augmentation. Specifically, we apply the BERT model as the encoder layer to do pre-training for question subjects, question bodies and answers, respectively, then the cross attention mechanism selects the most relevant answer for different questions. Experiments show that the QAN model achieves state-of-the-art performance on two datasets, SemEval2015 and SemEval2017. Moreover, we use the LLM to generate external knowledge from questions and correct answers to achieve knowledge augmentation for the answer selection task by LLM, while optimizing the prompt of LLM in different aspects. The results show that the introduction of external knowledge can improve the correct answer selection rate of LLM on datasets SemEval2015 and SemEval2017. Meanwhile, LLM can also select the correct answer on more questions by optimized prompt.

摘要
社区问答 (CQA) 在最近几年变得越来越普遍。然而，有大量的答案，使用户很难选择相关的答案。因此，答案选择是 CQA 中非常重要的子任务。在这篇论文中，我们首先提出了问题答案交叉注意力网络 (QAN) ，使用预训练模型进行答案选择，并利用大语言模型 (LLM) 进行知识扩展。具体来说，我们使用 BERT 模型作为编码层进行预训练，对问题主题、问题体和答案进行预训练，然后通过交叉注意力机制选择不同问题的最相关答案。实验表明，QAN 模型在 SemEval2015 和 SemEval2017 两个数据集上达到了状态机器的表现。此外，我们使用 LLM 生成外部知识，对答案选择任务进行知识扩展，同时优化 LLM 的提问以达到不同方面的优化。结果显示，引入外部知识可以提高 LLM 在 SemEval2015 和 SemEval2017 两个数据集上的正确答案选择率。此外，LLM 还可以通过优化提问选择更多的问题中的正确答案。

Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data

paper_url: http://arxiv.org/abs/2311.17492
repo_url: None
paper_authors: Jean Seo, Sungjoo Byun, Minha Kang, Sangah Lee
for: 保护满语言（Manchu language）的存在，采用首次尝试的满语-朝鲜机器翻译（MT）模型。
methods: 利用历史文献《满文老档》和《满语-朝鲜词典》等资源，采用 GloVe 嵌入式词替换技术扩展数据，并采用encoder-decoder神经机器翻译模型，具有双向GRU层。
results: 实验结果显示，使用该模型可以有效提高满语-朝鲜翻译效果，BLEU 分数提高20-30分。

Abstract
The Manchu language, with its roots in the historical Manchurian region of Northeast China, is now facing a critical threat of extinction, as there are very few speakers left. In our efforts to safeguard the Manchu language, we introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation (MT) model. To develop this model, we utilize valuable resources such as the Manwen Laodang(a historical book) and a Manchu-Korean dictionary. Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings, trained on both monolingual and parallel texts. Our approach is built around an encoder-decoder neural machine translation model, incorporating a bi-directional Gated Recurrent Unit (GRU) layer. The experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation, with a remarkable 20-30 point increase in the BLEU score.

摘要
《满语言», 起源于历史上的满洲地区东北中国，现在面临着濒临灭绝的危机，因为几乎没有 anymore speakers。为了保护满语言，我们介绍了“Mergen”，这是第一个满语言-韩语机器翻译（MT）模型。为了构建这个模型，我们利用了宝贵的资源，如《满文老档》（一部历史书籍）和满语韩语词典。由于满语韩语平行文本的罕见，我们扩展了数据集，通过使用 GloVe 嵌入法，训练在单语言和平行文本上。我们的方法建立在 encoder-decoder 神经机器翻译模型之上，包括一个双向扩展 GRU 层。实验结果很出色，显示了明显的提升在满语韩语翻译中，BLEU 分数提高了20-30个点。

Improving the Robustness of Transformer-based Large Language Models with Dynamic Attention

paper_url: http://arxiv.org/abs/2311.17400
repo_url: None
paper_authors: Lujia Shen, Yuwen Pu, Shouling Ji, Changjiang Li, Xuhong Zhang, Chunpeng Ge, Ting Wang
for: This paper aims to enhance the inherent robustness of transformer-based models, such as BERT and GPT, against various textual adversarial attacks.
methods: The proposed method, called dynamic attention, consists of two modules: (I) attention rectification, which masks or weakens the attention value of the chosen tokens, and (ii) dynamic modeling, which dynamically builds the set of candidate tokens.
results: The proposed dynamic attention significantly mitigates the impact of adversarial attacks, improving up to 33% better performance than previous methods against widely-used adversarial attacks.

Abstract
Transformer-based models, such as BERT and GPT, have been widely adopted in natural language processing (NLP) due to their exceptional performance. However, recent studies show their vulnerability to textual adversarial attacks where the model's output can be misled by intentionally manipulating the text inputs. Despite various methods that have been proposed to enhance the model's robustness and mitigate this vulnerability, many require heavy consumption resources (e.g., adversarial training) or only provide limited protection (e.g., defensive dropout). In this paper, we propose a novel method called dynamic attention, tailored for the transformer architecture, to enhance the inherent robustness of the model itself against various adversarial attacks. Our method requires no downstream task knowledge and does not incur additional costs. The proposed dynamic attention consists of two modules: (I) attention rectification, which masks or weakens the attention value of the chosen tokens, and (ii) dynamic modeling, which dynamically builds the set of candidate tokens. Extensive experiments demonstrate that dynamic attention significantly mitigates the impact of adversarial attacks, improving up to 33\% better performance than previous methods against widely-used adversarial attacks. The model-level design of dynamic attention enables it to be easily combined with other defense methods (e.g., adversarial training) to further enhance the model's robustness. Furthermore, we demonstrate that dynamic attention preserves the state-of-the-art robustness space of the original model compared to other dynamic modeling methods.

摘要
transformer-based models, such as BERT 和 GPT, 在自然语言处理（NLP）中广泛采用了，因为它们的表现非常出色。然而，最近的研究表明，这些模型对文本攻击性攻击很易受到影响，可以通过故意修改输入文本来诱导模型的输出。虽然许多提议来增强模型的 robustness 和抵御这种攻击，但大多需要巨大的资源投入（例如，对抗训练）或只提供有限的保护（例如，随机dropout）。在这篇论文中，我们提出了一种新的方法，called dynamic attention，特制 для transformer 架构，以提高模型本身的自然 robustness против多种攻击。我们的方法不需要下游任务知识，并不需要额外成本。我们提出的 dynamic attention 包括两个模块：（I）注意力正确化，用于屏蔽或弱化选择的Token的注意力值；（ii）动态建模，用于动态构建候选Token的集合。广泛的实验表明， dynamic attention 能够减轻攻击性攻击的影响，提高与广泛使用的攻击方法相比，最高达33%。model-level的设计使得 dynamic attention 可以轻松地与其他防御方法（例如，对抗训练）结合使用，进一步增强模型的 robustness。此外，我们示示了 dynamic attention 保持了原始模型的状态空间robustness，与其他动态模型方法相比。

Unveiling the Implicit Toxicity in Large Language Models

paper_url: http://arxiv.org/abs/2311.17391
repo_url: https://github.com/thu-coai/implicit-toxicity
paper_authors: Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang
for: 研究各种恶意利用大语言模型（LLM）的安全问题。
methods: 提出了一种基于强化学习（RL）的攻击方法，通过优化语言模型以便生成潜在危险的偏见语言。
results: 实验表明，RL fine-tuning可以在五种常见的危险分类器上提高攻击成功率，例如LLaMA-13B模型在BAD和Davinci003上的攻击成功率分别为90.04%和62.85%。

Abstract
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. Moreover, we propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs. Specifically, we optimize the language model with a reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones. Experiments on five widely-adopted toxicity classifiers demonstrate that the attack success rate can be significantly improved through RL fine-tuning. For instance, the RL-finetuned LLaMA-13B model achieves an attack success rate of 90.04% on BAD and 62.85% on Davinci003. Our findings suggest that LLMs pose a significant threat in generating undetectable implicit toxic outputs. We further show that fine-tuning toxicity classifiers on the annotated examples from our attacking method can effectively enhance their ability to detect LLM-generated implicit toxic language. The code is publicly available at https://github.com/thu-coai/Implicit-Toxicity.

摘要
大型语言模型（LLM）的开放性和其出色的能力可能会导致新的安全问题，当被用于黑客用途时。而最近的研究主要关注于探测毒害输出，可以使用现有的毒害分类器检测。但我们显示，LLM可以生成特殊的含义毒害输出，这些输出很难通过简单的零批训练检测。此外，我们提出了一种强化学习（RL）基于的攻击方法，通过对语言模型进行RL微调，使其偏好于隐式毒害输出。我们在五种广泛使用的毒害分类器上进行了实验，结果显示，通过RL微调，攻击成功率可以显著提高。例如，RL微调后的LLaMA-13B模型在BAD和Davinci003上的攻击成功率分别达到90.04%和62.85%。我们的发现表明，LLM可能会生成隐藏的毒害语言， pose a significant threat。我们还证明，通过我们的攻击方法来注意标注的例子进行微调，可以有效地提高毒害分类器对LLM生成的隐藏毒害语言的检测能力。代码可以在https://github.com/thu-coai/Implicit-Toxicity上获取。

CESAR: Automatic Induction of Compositional Instructions for Multi-turn Dialogs

paper_url: http://arxiv.org/abs/2311.17376
repo_url: None
paper_authors: Taha Aksu, Devamanyu Hazarika, Shikib Mehri, Seokhwan Kim, Dilek Hakkani-Tür, Yang Liu, Mahdi Namazifar
for: 这 paper 的目的是提高大型语言模型（LLM）在多回对话应用中的表现，并且解决复杂的指令下的 LLM 的表现不佳问题。
methods: 这 paper 使用了一种新的框架 called CESAR，它可以自动将多个对话任务format into a unified format，并且可以通过程序матиче induction来学习复杂的指令。
results: 通过对 InstructDial 数据集进行实验，这 paper 显示了 CESAR 的可扩展性和可靠性，并且模型可以 seguido compositional prompts，如多种样式约束的请求。

Abstract
Instruction-based multitasking has played a critical role in the success of large language models (LLMs) in multi-turn dialog applications. While publicly available LLMs have shown promising performance, when exposed to complex instructions with multiple constraints, they lag against state-of-the-art models like ChatGPT. In this work, we hypothesize that the availability of large-scale complex demonstrations is crucial in bridging this gap. Focusing on dialog applications, we propose a novel framework, CESAR, that unifies a large number of dialog tasks in the same format and allows programmatic induction of complex instructions without any manual effort. We apply CESAR on InstructDial, a benchmark for instruction-based dialog tasks. We further enhance InstructDial with new datasets and tasks and utilize CESAR to induce complex tasks with compositional instructions. This results in a new benchmark called InstructDial++, which includes 63 datasets with 86 basic tasks and 68 composite tasks. Through rigorous experiments, we demonstrate the scalability of CESAR in providing rich instructions. Models trained on InstructDial++ can follow compositional prompts, such as prompts that ask for multiple stylistic constraints.

摘要
<>将文本翻译成简化中文。<>大型语言模型（LLM）在多轮对话应用程序中扮演了关键角色。公共可用的LLM表现了可以接受的表现，但当面临复杂的指令和多个约束时，它们与现状的模型如ChatGPT相比表现不佳。在这项工作中，我们认为大规模的复杂示例是bridging这个差距的关键。专注于对话应用程序，我们提出了一个新的框架，CESAR，它可以在同一个格式下结合大量对话任务，并且可以无需手动努力地进行复杂的指令编程。我们在InstructDial上应用CESAR，InstructDial是对话任务的指令基金 benchmark。我们还新增了一些数据集和任务，并使用CESAR来编程复杂任务。这导致了一个新的benchmark，InstructDial++，它包含63个数据集，86个基本任务和68个复杂任务。通过严格的实验，我们证明了CESAR的可扩展性，模型在InstructDial++上可以跟踪复杂的指令，如多种样式约束的请求。

Are Large Language Models Good Fact Checkers: A Preliminary Study

paper_url: http://arxiv.org/abs/2311.17355
repo_url: None
paper_authors: Han Cao, Lingwei Wei, Mengyang Chen, Wei Zhou, Songlin Hu
for: 这研究旨在评估大语言模型（LLMs）在事实核查中的潜力，以及其在不同事实核查子任务中的表现。
methods: 本研究使用了多种大语言模型，包括各种预训练和特定任务预训练的模型，并进行了系统性的评估和比较分析。
results: 实验结果表明，LLMs在大多数情况下可以与其他小型模型相比的表现竞争，但它们在中文事实核查和整个事实核查管道中遇到了语言不一致和幻见等挑战。这些发现强调了进一步探索和研究，以提高LLMs的可靠性作为事实核查工具。

Abstract
Recently, Large Language Models (LLMs) have drawn significant attention due to their outstanding reasoning capabilities and extensive knowledge repository, positioning them as superior in handling various natural language processing tasks compared to other language models. In this paper, we present a preliminary investigation into the potential of LLMs in fact-checking. This study aims to comprehensively evaluate various LLMs in tackling specific fact-checking subtasks, systematically evaluating their capabilities, and conducting a comparative analysis of their performance against pre-trained and state-of-the-art low-parameter models. Experiments demonstrate that LLMs achieve competitive performance compared to other small models in most scenarios. However, they encounter challenges in effectively handling Chinese fact verification and the entirety of the fact-checking pipeline due to language inconsistencies and hallucinations. These findings underscore the need for further exploration and research to enhance the proficiency of LLMs as reliable fact-checkers, unveiling the potential capability of LLMs and the possible challenges in fact-checking tasks.

摘要

Efficient Stitchable Task Adaptation

paper_url: http://arxiv.org/abs/2311.17352
repo_url: None
paper_authors: Haoyu He, Zizheng Pan, Jing Liu, Jianfei Cai, Bohan Zhuang
for: 这个论文旨在提出一个能够快速生成多个新的网络（称为“缝合”），并且适应多种资源限制的模型整合（Model Stitching）方法。
methods: 这个方法使用了具有低维度更新的参数高效练习（Parameter-efficient fine-tuning），以分享低维度更新，并保持独立的偏好项。这样可以大幅减少练习内存负担和任务适应中的干扰。此外，这个方法还使用了一个简单 yet有效的一阶段部署管线，可以估计需要部署的重要缝合。
results: 实验结果显示，这个方法可以快速生成具有稳定精度-效率贡献的缝合，并且比直接使用SN-Net适应得到更大的改善，具有更低的训练时间和更少的可变参数。此外，这个方法还可以与LLaMA家族的语言模型进行缝合，实现了对谈话机器人的整合。

Abstract
The paradigm of pre-training and fine-tuning has laid the foundation for deploying deep learning models. However, most fine-tuning methods are designed to meet a specific resource budget. Recently, considering diverse deployment scenarios with various resource budgets, stitchable neural network (SN-Net) is introduced to quickly obtain numerous new networks (stitches) from the pre-trained models (anchors) in a model family via model stitching. Although promising, SN-Net confronts new challenges when adapting it to new target domains, including huge memory and storage requirements and a long and sub-optimal multistage adaptation process. In this work, we present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches while maintaining independent bias terms. In this way, we largely reduce fine-tuning memory burdens and mitigate the interference among stitches that arises in task adaptation. Furthermore, we streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy with training-time gradient statistics. By assigning higher sampling probabilities to important stitches, we also get a boosted Pareto frontier. Extensive experiments on 25 downstream visual recognition tasks demonstrate that our ESTA is capable of generating stitches with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net adaptation by remarkable margins with significantly lower training time and fewer trainable parameters. Furthermore, we demonstrate the flexibility and scalability of our ESTA framework by stitching LLMs from LLaMA family, obtaining chatbot stitches of assorted sizes.

摘要
深度学习模型的预训练和精度调整的 paradigm 已经为深度学习模型的部署提供了基础。然而，大多数精度调整方法都是为满足特定资源预算而设计。在最近，为了快速从预训练模型（锚点）中获得多个新网络（缝合），模型缝合（SN-Net）被介绍。虽然有前景，SN-Net 面临新的挑战，包括巨大的内存和存储需求以及长时间的多stage 适应过程。在这种情况下，我们提出了一种新的框架，高效精度调整（ESTA），以生成遵循多样化资源限制的多个精度调整模型。具体来说，我们首先为精度调整进行参数高效化，以便在缝合中共享低级别更新，保持独立的偏置项。这样，我们可以大幅减少精度调整的内存压力，并降低缝合中的干扰。此外，我们还提出了一种简单 yet effective 的一stage 部署管线，可以在训练时间中估算重要的缝合。通过将更高的抽样概率分配给重要的缝合，我们还可以获得加强的 pareto 前ier。经过对 25 个下游视觉识别任务的广泛实验，我们的 ESTA 能够生成缝合具有平滑的精度-效率交易和直接 SN-Net 适应的显著优势，同时具有更低的训练时间和更少的可训练参数。此外，我们还demonstrate了我们的 ESTA 框架的灵活性和可扩展性，通过将 LLaMA 家族中的 LLMS 缝合起来，获得了各种大小的 chatbot 缝合。

Biomedical knowledge graph-enhanced prompt generation for large language models

paper_url: http://arxiv.org/abs/2311.17330
repo_url: None
paper_authors: Karthik Soman, Peter W Rose, John H Morris, Rabia E Akbas, Brett Smith, Braian Peetoom, Catalina Villouta-Reyes, Gabriel Cerono, Yongmei Shi, Angela Rizk-Jackson, Sharat Israni, Charlotte A Nelson, Sui Huang, Sergio E Baranzini
for: 本研究旨在推动AI技术在生物医学领域的进步，并解决现有的知识瓶颈。
methods: 本研究使用了知识图(KG)和大语言模型(LLM)，其中KG提供了生物医学领域的知识，而LLM则提供了语言生成能力。研究者通过将KG和LLM结合在一起，实现了生成有意义的生物医学文本，并且能够考虑多种提问类型，包括一阶和二阶提问、药物重用查询、生物医学真假问题和多选题(MCQ)。
results: 研究结果表明，通过使用KG-RAG框架，可以明显提高LLM的性能，特别是在MCQ数据集上，Llama-2模型的性能提高了71%。此外，KG-RAG还能够提高 propriety GPT 模型的性能，如 GPT-3.5 和 GPT-4。此外，研究者还能够通过 KG-RAG 回答药物重用问题，并返回有意义的药物重用建议。

Abstract
Large Language Models (LLMs) have been driving progress in AI at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, and the latter require domain-expertise. External knowledge infusion is task-specific and requires model training. Here, we introduce a task-agnostic Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging the massive biomedical KG SPOKE with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge. KG-RAG consistently enhanced the performance of LLMs across various prompt types, including one-hop and two-hop prompts, drug repurposing queries, biomedical true/false questions, and multiple-choice questions (MCQ). Notably, KG-RAG provides a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 which exhibited improvement over GPT-4 in context utilization on MCQ data. Our approach was also able to address drug repurposing questions, returning meaningful repurposing suggestions. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM, respectively, in an optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a unified framework.

摘要
大型语言模型（LLM）在人工智能领域中传递进步，但在知识充足领域如生医领域仍面临挑战。如果使用预训练和域专化微调，则需要更多的计算负载，而且需要专业知识。外部知识渗透需要模型训练。我们在这里引入一个任务不对称的知识图графі-加速生成（KG-RAG）框架，通过利用巨量生医知识图 SPOKE 与 LLM 如 Llama-2-13b、GPT-3.5-Turbo 和 GPT-4，以生成有意义的生医文本，并将其与已有的知识相连接。KG-RAG 稳定地提高了 LLM 的表现，包括一阶和二阶提示、药物重新利用查询、生医真伪问题和多项选择问题（MCQ）。特别是，KG-RAG 为 Llama-2 模型在具有挑战性的 MCQ 数据集上提供了惊人的71%的提升，显示了这个框架的能力以 fewer 参数的开源模型来解决域专题。此外，KG-RAG 还增强了商业 GPT 模型，例如 GPT-3.5，在 MCQ 数据集上表现出了更好的Context Utilization。我们的方法也能够回答药物重新利用问题，提供了有意义的重新利用建议。总结来说，我们的提案结合了知识图和 LLM 的明确和隐含知识，在一个优化的方式下增强了通用 LLM 的适应能力，以便在域专领域中解决域专问题。

2023-11-29

cs.LG

cs.LG - 2023-11-29

Are ensembles getting better all the time?

paper_url: http://arxiv.org/abs/2311.17885
repo_url: https://github.com/sayantann11/all-classification-templetes-for-ML
paper_authors: Pierre-Alexandre Mattei, Damien Garreau
for: 本文研究了 whether or not 包括更多模型在 ensemble 中会提高其平均性能。
methods: 本文使用了 several base models 的 ensemble methods，并研究了不同的 predictive metric。
results: 研究结果表明，只有当loss function 是 convex 时， ensemble 的平均性能会逐渐提高。如果loss function 不是 convex，那么 ensemble 的平均性能会随着模型数量的增加而变化。

Abstract
Ensemble methods combine the predictions of several base models. We study whether or not including more models in an ensemble always improve its average performance. Such a question depends on the kind of ensemble considered, as well as the predictive metric chosen. We focus on situations where all members of the ensemble are a priori expected to perform as well, which is the case of several popular methods like random forests or deep ensembles. In this setting, we essentially show that ensembles are getting better all the time if, and only if, the considered loss function is convex. More precisely, in that case, the average loss of the ensemble is a decreasing function of the number of models. When the loss function is nonconvex, we show a series of results that can be summarised by the insight that ensembles of good models keep getting better, and ensembles of bad models keep getting worse. To this end, we prove a new result on the monotonicity of tail probabilities that may be of independent interest. We illustrate our results on a simple machine learning problem (diagnosing melanomas using neural nets).

摘要
ensemble方法将多个基本模型的预测结果组合起来。我们研究 whether or not including more models in an ensemble always improve its average performance。这种问题取决于 ensemble 中使用的模型和预测指标。我们主要关注所有ensemble中的模型都能够表现良好的情况，这是 Random Forests 或 Deep Ensembles 等流行的方法的情况。在这种设定下，我们实际上表明，如果loss函数是凸函数，那么ensemble的平均损失将随着模型数量减少。如果loss函数非凸，我们则证明了一系列结论，可以概括为：ensemble中好模型会越来越好，而ensemble中坏模型会越来越差。为此，我们证明了一个新的 monotonicity 的结论，可能有独立的价值。我们使用一个简单的机器学习问题（使用神经网络诊断 melanoma）来 illustrate 我们的结果。

SAIBench: A Structural Interpretation of AI for Science Through Benchmarks

paper_url: http://arxiv.org/abs/2311.17869
repo_url: None
paper_authors: Yatao Li, Jianfeng Zhan
for: This paper is written to address the challenges of deploying AI4S models in real-world applications by introducing a novel benchmarking approach called structural interpretation.
methods: The paper introduces a novel benchmarking approach called structural interpretation, which partitions the problem and metric spaces to facilitate a structural exploration of these spaces and identify the trusted operating range and trace errors.
results: The paper demonstrates the practical utility and effectiveness of structural interpretation through its application to three distinct AI4S workloads: machine-learning force fields (MLFF), jet tagging, and precipitation nowcasting. The benchmarks effectively model the trusted operating range, trace errors, and reveal novel perspectives for refining the model, training process, and data sampling strategy.

Abstract
Artificial Intelligence for Science (AI4S) is an emerging research field that utilizes machine learning advancements to tackle complex scientific computational issues, aiming to enhance computational efficiency and accuracy. However, the data-driven nature of AI4S lacks the correctness or accuracy assurances of conventional scientific computing, posing challenges when deploying AI4S models in real-world applications. To mitigate these, more comprehensive benchmarking procedures are needed to better understand AI4S models. This paper introduces a novel benchmarking approach, known as structural interpretation, which addresses two key requirements: identifying the trusted operating range in the problem space and tracing errors back to their computational components. This method partitions both the problem and metric spaces, facilitating a structural exploration of these spaces. The practical utility and effectiveness of structural interpretation are illustrated through its application to three distinct AI4S workloads: machine-learning force fields (MLFF), jet tagging, and precipitation nowcasting. The benchmarks effectively model the trusted operating range, trace errors, and reveal novel perspectives for refining the model, training process, and data sampling strategy. This work is part of the SAIBench project, an AI4S benchmarking suite.

摘要
人工智能科学（AI4S）是一个emerging研究领域，通过机器学习技术提高科学计算问题的效率和准确性。然而，数据驱动的AI4S具有传统科学计算不具备的正确性或准确性保证，这会在实际应用中提出挑战。为了解决这些挑战，需要更加全面的 benchmarking 程序，以更好地理解AI4S模型。这篇论文提出了一种新的 benchmarking 方法，称为结构解释，它解决了两个关键要求：确定问题空间中的可信运行范围和跟踪错误的计算组件。这种方法将问题空间和度量空间分割，使得结构性地探索这些空间。通过应用这种方法于三个不同的 AI4S 负荷：机器学习力场（MLFF）、捕捉和预测降水， benchmarks 能够模型可信运行范围，跟踪错误，并揭示了改进模型、训练过程和数据采样策略的新的视角。这项工作是SAIBench项目的一部分，SAIBench 是一个AI4S benchmarking 集成。

paper_url: http://arxiv.org/abs/2311.17856
repo_url: None
paper_authors: Puja Trivedi, Ryan Rossi, David Arbour, Tong Yu, Franck Dernoncourt, Sungchul Kim, Nedim Lipka, Namyong Park, Nesreen K. Ahmed, Danai Koutra
for: 这篇论文旨在提高半 observable 网络的精度和可靠性。
methods: 该论文提出了一种基于 subgraph diffusion 的图生成框架，并利用反向过程实现了 conditional 生成任务。
results: 经过广泛的实验和设计的新指标，论文表明该模型可以有效地支持以下三种半 observable 网络的修正任务：T1 去除干扰子图，T2 扩展现有子图，T3 通过修改某个节点或子图的特征来实现 “风格” 传递。

Abstract
Most real-world networks are noisy and incomplete samples from an unknown target distribution. Refining them by correcting corruptions or inferring unobserved regions typically improves downstream performance. Inspired by the impressive generative capabilities that have been used to correct corruptions in images, and the similarities between "in-painting" and filling in missing nodes and edges conditioned on the observed graph, we propose a novel graph generative framework, SGDM, which is based on subgraph diffusion. Our framework not only improves the scalability and fidelity of graph diffusion models, but also leverages the reverse process to perform novel, conditional generation tasks. In particular, through extensive empirical analysis and a set of novel metrics, we demonstrate that our proposed model effectively supports the following refinement tasks for partially observable networks: T1: denoising extraneous subgraphs, T2: expanding existing subgraphs and T3: performing "style" transfer by regenerating a particular subgraph to match the characteristics of a different node or subgraph.

摘要
大多数实际网络是不完整的、含有噪声的样本，来自未知的目标分布。通过更正噪声或推理未经观测的区域，通常会改善下游性能。受到图像修复的印象和图形填充相似的想法，我们提出了一种基于子图扩散的图生成框架，SGDM。我们的框架不仅提高了图扩散模型的可扩展性和准确性，还可以通过反向过程来执行新的、条件生成任务。在广泛的实验和一些新的 метри克的支持下，我们展示了我们的提议模型可以有效地支持以下三种修正任务：T1：除噪扩展Graph，T2：扩展现有的Subgraph，T3：通过重新生成某个Subgraph来模仿另一个节点或Subgraph的特点，进行"风格"转换。

On the Adversarial Robustness of Graph Contrastive Learning Methods

paper_url: http://arxiv.org/abs/2311.17853
repo_url: None
paper_authors: Filippo Guerranti, Zinuo Yi, Anna Starovoit, Rafiq Kamel, Simon Geisler, Stephan Günnemann
for: 本研究旨在评估Graph Contrastive Learning（GCL）模型在针对图structure的攻击下的Robustness。
methods: 本研究使用了适应性攻击， targets the graph structure in the evasion scenario，用于评估节点和图像类фикацию任务中GCL模型的Robustness。
results: 研究发现GCL模型在针对图structure的攻击下的Robustness不如图像和文本领域的对应模型。但是，GCL模型在某些情况下可以具有较高的Robustness。这些结果可以帮助研究人员更好地理解GCL模型的Robustness，并为未来的研究提供新的方向。

Abstract
Contrastive learning (CL) has emerged as a powerful framework for learning representations of images and text in a self-supervised manner while enhancing model robustness against adversarial attacks. More recently, researchers have extended the principles of contrastive learning to graph-structured data, giving birth to the field of graph contrastive learning (GCL). However, whether GCL methods can deliver the same advantages in adversarial robustness as their counterparts in the image and text domains remains an open question. In this paper, we introduce a comprehensive robustness evaluation protocol tailored to assess the robustness of GCL models. We subject these models to adaptive adversarial attacks targeting the graph structure, specifically in the evasion scenario. We evaluate node and graph classification tasks using diverse real-world datasets and attack strategies. With our work, we aim to offer insights into the robustness of GCL methods and hope to open avenues for potential future research directions.

摘要
对比学习（Contrastive Learning，CL）已经成为一种强大的描述图像和文本的自适应学习框架，同时提高模型对黑客攻击的Robustness。在这些研究中，研究人员已经扩展了对比学习的原则来应用于图 струкured数据，这种学习方法被称为图对比学习（Graph Contrastive Learning，GCL）。然而，Whether GCL方法可以提供与图像和文本领域的对比学习模型相同的鲁棒性优势是一个开放的问题。在这篇论文中，我们提出了一种特有的鲁棒性评估协议，用于评估GCL模型的鲁棒性。我们对这些模型进行了适应式黑客攻击，特别是在诱导攻击enario中。我们使用了多种真实世界的数据集和攻击策略来评估节点和图像的鲁棒性。我们的工作希望能够提供对GCL方法的鲁棒性的深入了解，并开启未来研究的可能性。

A quasi-polynomial time algorithm for Multi-Dimensional Scaling via LP hierarchies

paper_url: http://arxiv.org/abs/2311.17840
repo_url: None
paper_authors: Ainesh Bakshi, Vincent Cohen-Addad, Samuel B. Hopkins, Rajesh Jayaram, Silvio Lattanzi
for: 这个论文的目的是为了解决多维规范（MDS）问题，即将对象之间的对比性转化为低维度空间中的嵌入。
methods: 这个论文使用了一种名为卡玛达-加外堡的MDS算法，并提供了一种基于conditioning-based rounding scheme的approximation算法，以及一种基于几何学的分析方法来避免对尺度比的极大值的几何学依赖。
results: 这个论文的结果是一种基于卡玛达-加外堡算法的approximation算法，可以在$n^2 \cdot 2^{\tilde{\mathcal{O}(k \Delta^4 / \epsilon^2)}$时间内实现一个嵌入，其cost与优化目标$\text{OPT}$之间的差别在$\epsilon$ rang内。此外，这个算法还提供了一种基于几何学的分析方法，可以避免对尺度比的极大值的几何学依赖。

Abstract
Multi-dimensional Scaling (MDS) is a family of methods for embedding pair-wise dissimilarities between $n$ objects into low-dimensional space. MDS is widely used as a data visualization tool in the social and biological sciences, statistics, and machine learning. We study the Kamada-Kawai formulation of MDS: given a set of non-negative dissimilarities $\{d_{i,j}\}_{i , j \in [n]}$ over $n$ points, the goal is to find an embedding $\{x_1,\dots,x_n\} \subset \mathbb{R}^k$ that minimizes \[ \text{OPT} = \min_{x} \mathbb{E}_{i,j \in [n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}\right)^2 \right] \] Despite its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine, Hesterberg, Koehler, Lynch, and Urschel (arXiv:2109.11505) gave the first approximation algorithm with provable guarantees for Kamada-Kawai, which achieves an embedding with cost $\text{OPT} +\epsilon$ in $n^2 \cdot 2^{\tilde{\mathcal{O}(k \Delta^4 / \epsilon^2)}$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for target dimension $k$, we achieve a solution with cost $\mathcal{O}(\text{OPT}^{ \hspace{0.04in}1/k } \cdot \log(\Delta/\epsilon) )+ \epsilon$ in time $n^{ \mathcal{O}(1)} \cdot 2^{\tilde{\mathcal{O}( k^2 (\log(\Delta)/\epsilon)^{k/2 + 1} ) }$. Our approach is based on a novel analysis of a conditioning-based rounding scheme for the Sherali-Adams LP Hierarchy. Crucially, our analysis exploits the geometry of low-dimensional Euclidean space, allowing us to avoid an exponential dependence on the aspect ratio $\Delta$. We believe our geometry-aware treatment of the Sherali-Adams Hierarchy is an important step towards developing general-purpose techniques for efficient metric optimization algorithms.

摘要
多维规范（MDS）是一家方法，用于将对象之间的差异 embedding 到低维度空间中。MDS 广泛应用于社会科学、生物科学、统计学和机器学习等领域，作为数据视化工具。我们研究了Kamada-Kawai的MDS问题：给定一组非负差异列表$\{d_{i,j}\}_{i,j\in[n]}$，目标是找到一个 embedding $\\{x_1,...，x_n\} \subset \mathbb{R}^k$，以下式中的最小化：$$\text{OPT} = \min_{x} \mathbb{E}_{i,j \in [n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}\right)^2 \right]$$尽管MDS的理论理解非常有限，但是最近Demaine等人（arXiv:2109.11505）提出了第一个有提able保证的MDS算法，可以在$n^2 \cdot 2^{\tilde{\mathcal{O}(k \Delta^4 / \epsilon^2)}$时间内获得一个 embedding 的成本为$\text{OPT} + \epsilon$。在这个工作中，我们提供了第一个基于 quasi-polynomial 的 MDSA 算法，可以在 target 维度 $k$ 下，在时间 $n^{ \mathcal{O}(1)} \cdot 2^{\tilde{\mathcal{O}( k^2 (\log(\Delta)/\epsilon)^{k/2 + 1} ) }$ 内获得一个成本为 $\mathcal{O}(\text{OPT}^{ \hspace{0.04in}1/k } \cdot \log(\Delta/\epsilon) ) + \epsilon$ 的解决方案。我们的方法基于一种新的 conditioning-based rounding scheme for Sherali-Adams LP Hierarchy的分析。我们的分析利用了低维度欧几里得空间的几何特性，因此可以避免对尺度比 $\Delta$ 的几何因子进行对抗性的依赖。我们认为我们对 Sherali-Adams Hierarchy 的几何意识的处理是一种重要的步骤，可以帮助我们开发一些通用的高效度的 metric optimization 算法。

Towards Efficient Hyperdimensional Computing Using Photonics

paper_url: http://arxiv.org/abs/2311.17801
repo_url: None
paper_authors: Farbin Fayza, Cansu Demirkiran, Hanning Chen, Che-Kai Liu, Avi Mohan, Hamza Errahmouni, Sanggeon Yun, Mohsen Imani, David Zhang, Darius Bunandar, Ajay Joshi
for: 这个论文是为了提出一个电子光学加速器，用于执行深度学习（Deep Neural Network，DNN）训练和测试，并且支持不同的编码方案。
methods: 这个论文使用了光子计算和对称加速器（Hyperdimensional Computing，HDC），并且利用了光子计算和HDC之间的互补性。
results: 论文显示， compared to现有的电子光学DNN加速器和 CiM-based 加速器，PhotoHDC可以 дости得二到五个数据点的运算时间下降，并且可以降低四个数据点的能耗延误产品（EDP）。

Abstract
Over the past few years, silicon photonics-based computing has emerged as a promising alternative to CMOS-based computing for Deep Neural Networks (DNN). Unfortunately, the non-linear operations and the high-precision requirements of DNNs make it extremely challenging to design efficient silicon photonics-based systems for DNN inference and training. Hyperdimensional Computing (HDC) is an emerging, brain-inspired machine learning technique that enjoys several advantages over existing DNNs, including being lightweight, requiring low-precision operands, and being robust to noise introduced by the nonidealities in the hardware. For HDC, computing in-memory (CiM) approaches have been widely used, as CiM reduces the data transfer cost if the operands can fit into the memory. However, inefficient multi-bit operations, high write latency, and low endurance make CiM ill-suited for HDC. On the other hand, the existing electro-photonic DNN accelerators are inefficient for HDC because they are specifically optimized for matrix multiplication in DNNs and consume a lot of power with high-precision data converters. In this paper, we argue that photonic computing and HDC complement each other better than photonic computing and DNNs, or CiM and HDC. We propose PhotoHDC, the first-ever electro-photonic accelerator for HDC training and inference, supporting the basic, record-based, and graph encoding schemes. Evaluating with popular datasets, we show that our accelerator can achieve two to five orders of magnitude lower EDP than the state-of-the-art electro-photonic DNN accelerators for implementing HDC training and inference. PhotoHDC also achieves four orders of magnitude lower energy-delay product than CiM-based accelerators for both HDC training and inference.

摘要
过去几年，半导体光子学基本 computing 已经出现为深度神经网络 (DNN) 的吸引力 alternatives，但是 DNN 的非线性运算和高精度需求使得设计效率的半导体光子学基本系统具有挑战。对于 HDC 来说， computing in-memory (CiM) 方法已经广泛使用，因为 CiM 可以降低资料传输成本，如果操作数可以适应内存中。然而，不确定的非理想项目导致 CiM 不适合 HDC。另一方面，现有的电子光子 DNN 加速器对 HDC 来说是不确定的，因为它们是特别针对 DNN 的矩阵乘法来设计，并且对高精度资料转换器进行耗电。在本文中，我们认为半导体光子学和 HDC 更好地融合在一起，比起半导体光子学和 DNN，或 CiM 和 HDC。我们提出 PhotoHDC，第一个用于 HDC 训练和测试的电子光子加速器，支持基本、记录、和图形编码方案。使用受欢迎的数据集，我们显示我们的加速器可以在 HDC 训练和测试中实现二到五次的较低 EDP，比之前的电子光子 DNN 加速器更高。PhotoHDC 还可以在 CiM 加速器中实现四个数据点的能量延迟产品，实现 HDC 训练和测试中的更高效率。

Learning to Simulate: Generative Metamodeling via Quantile Regression

paper_url: http://arxiv.org/abs/2311.17797
repo_url: None
paper_authors: L. Jeff Hong, Yanxi Hou, Qingkai Zhang, Xiaowei Zhang
for: This paper aims to provide a new metamodeling technique called generative metamodeling, which can be used for real-time decision-making in complex systems.
methods: The proposed method, called quantile-regression-based generative metamodeling (QRGMM), uses a new algorithm to construct a “fast simulator of the simulator” that can generate random outputs faster than the original simulation model while retaining an approximately equal conditional distribution.
results: The paper presents extensive numerical experiments to demonstrate the empirical performance of QRGMM and compare it with other state-of-the-art generative algorithms. The results show that QRGMM can generate random outputs substantially faster than the original simulation model and provide accurate summary statistics for real-time decision-making.

Abstract
Stochastic simulation models, while effective in capturing the dynamics of complex systems, are often too slow to run for real-time decision-making. Metamodeling techniques are widely used to learn the relationship between a summary statistic of the outputs (e.g., the mean or quantile) and the inputs of the simulator, so that it can be used in real time. However, this methodology requires the knowledge of an appropriate summary statistic in advance, making it inflexible for many practical situations. In this paper, we propose a new metamodeling concept, called generative metamodeling, which aims to construct a "fast simulator of the simulator". This technique can generate random outputs substantially faster than the original simulation model, while retaining an approximately equal conditional distribution given the same inputs. Once constructed, a generative metamodel can instantaneously generate a large amount of random outputs as soon as the inputs are specified, thereby facilitating the immediate computation of any summary statistic for real-time decision-making. Furthermore, we propose a new algorithm -- quantile-regression-based generative metamodeling (QRGMM) -- and study its convergence and rate of convergence. Extensive numerical experiments are conducted to investigate the empirical performance of QRGMM, compare it with other state-of-the-art generative algorithms, and demonstrate its usefulness in practical real-time decision-making.

摘要
随机模拟模型可以很好地捕捉复杂系统的动态行为，但它们经常太慢，不能用于实时决策。模型学习技术广泛应用于学习随机模拟器的输出（例如，平均值或分位数）与输入之间的关系，以便在实时使用。然而，这种方法需要先知道合适的摘要统计量，从而带来了许多实际问题。在这篇论文中，我们提出了一种新的模型学习概念，即生成模型学习（generative metamodeling），旨在构建一个“快速的模拟器”。这种技术可以在输入指定后，通过生成大量的随机输出，以实时计算任何摘要统计量，以便实时决策。此外，我们提出了一种新的算法——量词回归基于生成模型（QRGMM），并研究其收敛率和收敛速率。我们对QRGMM的实际性能进行了广泛的数学实验，与其他当前最佳生成算法进行了比较，并在实际实时决策中示出了其有用性。

Marginal Laplacian Score

paper_url: http://arxiv.org/abs/2311.17795
repo_url: None
paper_authors: Guy Hay, Ohad Volk
For: The paper is written for those who work with high-dimensional imbalanced data and need effective unsupervised feature selection methods to handle such data.* Methods: The paper proposes a modification of the Laplacian Score (LS) called Marginal Laplacian Score (MLS) to better handle imbalanced data. The MLS algorithm is integrated into the Differentiable Unsupervised Feature Selection (DUFS) method to create DUFS-MLS.* Results: The proposed methods demonstrate robust and improved performance on synthetic and public data sets.Here’s the same information in Simplified Chinese text:* For: 这篇论文是为了处理高维度不均衡数据而写的，它需要有效的无监督特征选择方法来处理这种数据。* Methods: 论文提出了一种基于 Laplacian Score (LS) 的修改方法，称为 Marginal Laplacian Score (MLS)，以更好地处理不均衡数据。它将 MLS 算法集成到了 Differentiable Unsupervised Feature Selection (DUFS) 方法中，创造出 DUFS-MLS。* Results: 提议的方法在 sintetic 和公共数据集上表现了更加稳定和优化的result。

Abstract
High-dimensional imbalanced data poses a machine learning challenge. In the absence of sufficient or high-quality labels, unsupervised feature selection methods are crucial for the success of subsequent algorithms. Therefore, there is a growing need for unsupervised feature selection algorithms focused on imbalanced data. Thus, we propose a Marginal Laplacian Score (MLS) a modification of the well-known Laplacian Score (LS) to be better suited for imbalance data. We introduce an assumption that the minority class or anomalous appear more frequently in the margin of the features. Consequently, MLS aims to preserve the local structure of the data set's margin. As MLS is better suited for handling imbalanced data, we propose its integration into modern feature selection methods that utilize the Laplacian score. We integrate the MLS algorithm into the Differentiable Unsupervised Feature Selection (DUFS), resulting in DUFS-MLS. The proposed methods demonstrate robust and improved performance on synthetic and public data sets.

摘要
高维度偏振数据对机器学习 pose 一个挑战，在没有足够或高质量标签的情况下，无监督特征选择方法成为后续算法的成功关键。因此，有一个增长的需求，即开发适合偏振数据的无监督特征选择算法。因此，我们提出了 Marginal Laplacian Score（MLS），它是 Laplacian Score（LS）的修改，更适合处理偏振数据。我们假设了小类或异常在特征边缘出现得更加频繁。因此，MLS aspires to preserve the local structure of the data set's margin。由于 MLS 更适合处理偏振数据，我们提议将其 integrate into modern feature selection methods that utilize the Laplacian score。我们将 MLS 算法 integrate into Differentiable Unsupervised Feature Selection（DUFS），得到 DUFS-MLS。我们的提议方法在 synthetic 和公共数据集上进行了 robust 和改进的表现。

Unified Binary and Multiclass Margin-Based Classification

paper_url: http://arxiv.org/abs/2311.17778
repo_url: None
paper_authors: Yutong Wang, Clayton Scott
for: 本文探讨了多类别分类中的损失函数，尤其是多类别损失函数的一致性问题。
methods: 本文使用了相对边缘形式来表示多类别损失函数，这是对二分类损失函数的一般化。
results: 本文使用相对边缘形式表示多类别损失函数，并通过extend Bartlett et al. (2006)的结论来推广二分类损失函数的分类-报告问题。此外，本文还扩展了 Fenchel-Young 损失函数的分类报告问题。

Abstract
The notion of margin loss has been central to the development and analysis of algorithms for binary classification. To date, however, there remains no consensus as to the analogue of the margin loss for multiclass classification. In this work, we show that a broad range of multiclass loss functions, including many popular ones, can be expressed in the relative margin form, a generalization of the margin form of binary losses. The relative margin form is broadly useful for understanding and analyzing multiclass losses as shown by our prior work (Wang and Scott, 2020, 2021). To further demonstrate the utility of this way of expressing multiclass losses, we use it to extend the seminal result of Bartlett et al. (2006) on classification-calibration of binary margin losses to multiclass. We then analyze the class of Fenchel-Young losses, and expand the set of these losses that are known to be classification-calibrated.

摘要
“margin loss”这个概念在 binary 分类算法的开发和分析中具有重要的作用。然而，在多类分类方面，至今还没有一个共识的 аналоги。在这项工作中，我们显示了一种广泛适用的多类损失函数，包括许多流行的函数，可以表示为相对边界形式，这是对 binary 损失函数的推广。这种表示方式广泛有用于理解和分析多类损失，如我们之前的工作（王和斯科特，2020、2021）所示。为了更加彰显这种表示方式的实用性，我们使用它来推广 Bartlett et al. (2006) 的Binary 边界损失的分类抽象结果到多类。然后，我们分析 Fenchel-Young 损失函数，并扩展这些损失函数的分类抽象结果。

A transductive few-shot learning approach for classification of digital histopathological slides from liver cancer

paper_url: http://arxiv.org/abs/2311.17740
repo_url: None
paper_authors: Aymen Sadraoui, Ségolène Martin, Eliott Barbot, Astrid Laurent-Bellue, Jean-Christophe Pesquet, Catherine Guettier, Ismail Ben Ayed
for: 这篇论文旨在开发一种基于几个shot学习的二维 Histopathology 分类方法，以解决 histopathology 中的标签资料有限问题。
methods: 我们使用滑块技术进行 histopathology 扫描，并应用 транductive 学习（即同时预测扫描中的贴附）以 дости持 consistent 和准确的分类。我们还使用优化策略，以避免在每个窗口中预测太多不同的类别。
results: 我们在liver cancer 的数据集上进行实验， Specifically hepatocellular carcinoma，Initial results show 我们的方法的有效性，并且它的应用可能会增加自动诊断和治疗的效率，同时降低专家标签的时间和努力。

Abstract
This paper presents a new approach for classifying 2D histopathology patches using few-shot learning. The method is designed to tackle a significant challenge in histopathology, which is the limited availability of labeled data. By applying a sliding window technique to histopathology slides, we illustrate the practical benefits of transductive learning (i.e., making joint predictions on patches) to achieve consistent and accurate classification. Our approach involves an optimization-based strategy that actively penalizes the prediction of a large number of distinct classes within each window. We conducted experiments on histopathological data to classify tissue classes in digital slides of liver cancer, specifically hepatocellular carcinoma. The initial results show the effectiveness of our method and its potential to enhance the process of automated cancer diagnosis and treatment, all while reducing the time and effort required for expert annotation.

摘要
Note: "Simplified Chinese" is a simplified version of Traditional Chinese, which is used in mainland China. The translation is written in Simplified Chinese, but the original text is in Traditional Chinese.Here are some differences between the original text and the translation:1. "2D histopathology patches" becomes "2D histopathology patches" (Simplified Chinese does not have a separate word for "2D").2. "few-shot learning" becomes "几个shot学习" (the word "几个" is used to indicate a small number).3. "transductive learning" becomes "混合学习" (the word "混合" means "mixed" or "combined").4. "histopathology slides" becomes " histopathology 幕" (the word "幕" is used to refer to a slide).5. "digital slides" becomes "数字幕" (the word "数字" means "digital").6. "liver cancer" becomes "肝癌" (the word "肝" means "liver").7. "hepatocellular carcinoma" becomes "肝细胞癌" (the word "肝细胞" means "liver cell").8. "tissue classes" becomes "组织类" (the word "组织" means "tissue").9. "classify" becomes "分类" (the word "分类" means "classify").10. "automated cancer diagnosis and treatment" becomes "自动化肿瘤诊断和治疗" (the word "自动化" means "automated").

A novel feature selection method based on quantum support vector machine

paper_url: http://arxiv.org/abs/2311.17646
repo_url: None
paper_authors: Haiyan Wang
for: 这篇论文旨在提出一种新的Feature选择方法，协助对现代资料集进行分类和简化过程。
methods: 本论文提出的方法是Quantum支持向量机制Feature选择（QSVMF），它结合了量子支持向量机制和多目标遗传算法。QSVMF aim to optimize multiple simultaneous objectives, including maximizing classification accuracy, minimizing selected features and quantum circuit costs, and reducing feature covariance.
results: 实验结果显示，QSVMF可以实现Superior performance compared to classical approaches with selected features. Moreover, the Pareto front solutions of QSVMF enable analysis of accuracy versus feature set size trade-offs, identifying extremely sparse yet accurate feature subsets. The selected features are also found to be biologically relevant in terms of known breast cancer biomarkers.

Abstract
Feature selection is critical in machine learning to reduce dimensionality and improve model accuracy and efficiency. The exponential growth in feature space dimensionality for modern datasets directly results in ambiguous samples and redundant features, which can severely degrade classification accuracy. Quantum machine learning offers potential advantages for addressing this challenge. In this paper, we propose a novel method, quantum support vector machine feature selection (QSVMF), integrating quantum support vector machines with multi-objective genetic algorithm. QSVMF optimizes multiple simultaneous objectives: maximizing classification accuracy, minimizing selected features and quantum circuit costs, and reducing feature covariance. We apply QSVMF for feature selection on a breast cancer dataset, comparing the performance of QSVMF against classical approaches with the selected features. Experimental results show that QSVMF achieves superior performance. Furthermore, The Pareto front solutions of QSVMF enable analysis of accuracy versus feature set size trade-offs, identifying extremely sparse yet accurate feature subsets. We contextualize the biological relevance of the selected features in terms of known breast cancer biomarkers. This work highlights the potential of quantum-based feature selection to enhance machine learning efficiency and performance on complex real-world data.

摘要
<>Translate the given text into Simplified Chinese.<>机器学习中的特征选择是关键，以降低维度和提高模型准确率和效率。现代数据集的特征空间维度的几乎无限增长直接导致了混淆样本和冗余特征，这可能会严重降低分类精度。量子机器学习具有解决这个挑战的潜在优势。在这篇论文中，我们提出了一种新的方法，即量子支持向量机特征选择（QSVMF），它将量子支持向量机与多目标遗传算法结合。QSVMF同时优化了多个同时目标：最大化分类精度、最小化选择特征数和量子电路成本，还有减少特征协方差。我们在乳腺癌数据集上应用QSVMF进行特征选择，与采用经典方法选择的特征进行比较。实验结果表明，QSVMF的性能优于经典方法。此外，QSVMF的 pareto 前缘解决方案允许分析精度与特征集大小之间的负面负担关系，并标识出EXTREMELY SPARSE yet ACCURATE特征子组。我们在生物学上对选择的特征进行了Contextual化分析，并证明它们与知名乳腺癌生物标志物相关。这种工作展示了量子特征选择的潜在优势，可以提高机器学习的效率和性能在复杂的实际数据上。

Q-learning Based Optimal False Data Injection Attack on Probabilistic Boolean Control Networks

paper_url: http://arxiv.org/abs/2311.17631
repo_url: None
paper_authors: Xianlun Peng, Yang Tang, Fangfei Li, Yang Liu
for: 解决probabilistic Boolean control networks（PBCNs）中的最佳假数据投入攻击问题，攻击者缺乏系统模型知识。
methods: 使用Q-learning（QL）算法解决该问题，并提出了一种改进QL算法，可以提高学习效率并获得大规模PBCNs中的优化攻击策略。
results: 通过对两个被攻击的PBCNs（包括10个节点的网络和28个节点的网络）进行验证，证明了我们提出的方法的有效性。

Abstract
In this paper, we present a reinforcement learning (RL) method for solving optimal false data injection attack problems in probabilistic Boolean control networks (PBCNs) where the attacker lacks knowledge of the system model. Specifically, we employ a Q-learning (QL) algorithm to address this problem. We then propose an improved QL algorithm that not only enhances learning efficiency but also obtains optimal attack strategies for large-scale PBCNs that the standard QL algorithm cannot handle. Finally, we verify the effectiveness of our proposed approach by considering two attacked PBCNs, including a 10-node network and a 28-node network.

摘要
在这篇论文中，我们提出了一种利用强化学习（RL）方法解决潜在假数据插入攻击问题在概率布尔控制网络（PBCN）中，其中攻击者缺乏系统模型知识。我们使用Q学习（QL）算法来解决这个问题。然后，我们提出了一种改进的QL算法，不仅提高学习效率，还可以对大规模PBCN进行优化攻击策略。最后，我们验证了我们提出的方法的有效性，通过考虑两个被攻击的PBCN，即10节点网络和28节点网络。

Federated Online and Bandit Convex Optimization

paper_url: http://arxiv.org/abs/2311.17586
repo_url: None
paper_authors: Kumar Kshitij Patel, Lingxiao Wang, Aadirupa Saha, Nati Sebro
for: 本研究旨在解决分布式在线和带宽迭代优化问题，面临到适应性敌对者。我们想要在 $M$ 机器工作并发，经过 $T$ 轮 communicate 只有 $R$ 次，减少均值误差。
methods: 我们的研究结果表明，当机器可以获取查询点的首个导数信息时，协作并不有益。而在随机函数下，每个机器会随机抽取成本函数的样本。在更复杂的联邦在线优化问题中，我们发现在高维度 régime 中，协作可能会带来线性增速。我们还开发了分布式单点和两点反馈算法。
results: 我们的研究结果表明，在干扰通信设置下，协作可以减少均值误差，并且可能带来线性增速。我们的结果因此bridge了随机和适应设置之间的差异，并实现了紧凑的误差 bound。

Abstract
We study the problems of distributed online and bandit convex optimization against an adaptive adversary. We aim to minimize the average regret on $M$ machines working in parallel over $T$ rounds with $R$ intermittent communications. Assuming the underlying cost functions are convex and can be generated adaptively, our results show that collaboration is not beneficial when the machines have access to the first-order gradient information at the queried points. This is in contrast to the case for stochastic functions, where each machine samples the cost functions from a fixed distribution. Furthermore, we delve into the more challenging setting of federated online optimization with bandit (zeroth-order) feedback, where the machines can only access values of the cost functions at the queried points. The key finding here is identifying the high-dimensional regime where collaboration is beneficial and may even lead to a linear speedup in the number of machines. We further illustrate our findings through federated adversarial linear bandits by developing novel distributed single and two-point feedback algorithms. Our work is the first attempt towards a systematic understanding of federated online optimization with limited feedback, and it attains tight regret bounds in the intermittent communication setting for both first and zeroth-order feedback. Our results thus bridge the gap between stochastic and adaptive settings in federated online optimization.

摘要
我们研究分布式在线和帕戈底函数优化对于适应性对手的问题。我们想要在M机器工作在平行的情况下，避免在T轮次的输送中累累 regret平均值。我们的结果显示，当机器可以在发问点上获得首项梯度信息时，协作无益。相比之下，在数学分布中的机器可以随机抽出成本函数。此外，我们进一步探讨了在联邦线上优化中的困难情况，即机器仅能在发问点上获得成本函数的值。我们发现在高维度情况下，协作可能具有Linear Speedup的优势，并在联邦线上优化中获得紧密的 regret bound。我们还开发了分布式单点和双点反馈算法，以探讨我们的结果。我们的研究是关于联邦线上优化的首次系统性研究，并在干扰通信设定下获得了紧密的 regret bound。我们的结果因此将关于数学和适应设定之间的差异 bridge。

LoCoMotif: Discovering time-warped motifs in time series

paper_url: http://arxiv.org/abs/2311.17582
repo_url: None
paper_authors: Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel
for: 这个论文的目的是提出一种新的时间序列模式发现方法（LoCoMotif），可以解决现有方法中的一些局限性。
methods: 这种方法不同于现有方法，因为它不仅仅寻找最相似的两次出现，而且可以处理时间轴上的变化，同时可以处理多个变量时间序列。
results: 作者通过一个实际的 физи疗案例来验证该方法的效果，并提出了一个新的评价指标来评估时间序列模式发现方法。结果显示，LoCoMotif substantially 超过了现有方法的性能。

Abstract
Time Series Motif Discovery (TSMD) refers to the task of identifying patterns that occur multiple times (possibly with minor variations) in a time series. All existing methods for TSMD have one or more of the following limitations: they only look for the two most similar occurrences of a pattern; they only look for patterns of a pre-specified, fixed length; they cannot handle variability along the time axis; and they only handle univariate time series. In this paper, we present a new method, LoCoMotif, that has none of these limitations. The method is motivated by a concrete use case from physiotherapy. We demonstrate the value of the proposed method on this use case. We also introduce a new quantitative evaluation metric for motif discovery, and benchmark data for comparing TSMD methods. LoCoMotif substantially outperforms the existing methods, on top of being more broadly applicable.

摘要
时间序列模式发现（TSMD）指的是在时间序列中找到重复出现的模式（可能有些变化）。现有的所有TSMD方法都受到以下一或多个限制：它们只查找两个最相似的模式出现;它们只查找固定长度的模式;它们无法处理时间轴上的变化;它们只处理单变量时间序列。在这篇论文中，我们提出了一种新的方法，LoCoMotif，它没有这些限制。该方法是基于物理治疗的具体用例所驱动的。我们在这个用例中展示了提案的方法的价值。我们还介绍了一个新的量化评价指标 для模式发现，以及用于比较TSMD方法的referenced数据。LoCoMotifsubstantially超越现有的方法，同时更广泛适用。

Interpreting Differentiable Latent States for Healthcare Time-series Data

paper_url: http://arxiv.org/abs/2311.17560
repo_url: None
paper_authors: Yu Chen, Nivedita Bijlani, Samaneh Kouchaki, Payam Barnaghi
for: 该论文旨在提出一种可解释性高的算法，以便在数字医疗中应用高级机器学习模型。
methods: 该论文使用了可导 differentiable 模型，并提出了一种基于输入特征的 latent state 解释方法，以及一种基于 latent state 的预测方法。
results: 该论文在一个实际医疗数据集上实现了Identifying daytime behavioral pattern for predicting nocturnal behavior的目标，并且表明了该方法的可解释性和实用性。

Abstract
Machine learning enables extracting clinical insights from large temporal datasets. The applications of such machine learning models include identifying disease patterns and predicting patient outcomes. However, limited interpretability poses challenges for deploying advanced machine learning in digital healthcare. Understanding the meaning of latent states is crucial for interpreting machine learning models, assuming they capture underlying patterns. In this paper, we present a concise algorithm that allows for i) interpreting latent states using highly related input features; ii) interpreting predictions using subsets of input features via latent states; and iii) interpreting changes in latent states over time. The proposed algorithm is feasible for any model that is differentiable. We demonstrate that this approach enables the identification of a daytime behavioral pattern for predicting nocturnal behavior in a real-world healthcare dataset.

摘要

The Effects of Overparameterization on Sharpness-aware Minimization: An Empirical and Theoretical Analysis

paper_url: http://arxiv.org/abs/2311.17539
repo_url: None
paper_authors: Sungbin Shin, Dongyeop Lee, Maksym Andriushchenko, Namhoon Lee
for: 本研究探讨了使用sharpness-aware minimization（SAM）策略在不同程度的过参数化下的行为。
methods: 本研究使用了标准的优化技术和理论分析来证明SAM在随机设置下可以实现线性准确率。
results: 实验和理论结果表明，随着模型的过参数化程度增加，SAM可以更好地找到抗泛化的最优解，并且可以在模型 becomes more overparameterized 时提供更好的泛化性。此外，研究还发现了在实际应用中，采用简洁的缺省值可以实现有效的过参数化。

Abstract
Training an overparameterized neural network can yield minimizers of the same level of training loss and yet different generalization capabilities. With evidence that indicates a correlation between sharpness of minima and their generalization errors, increasing efforts have been made to develop an optimization method to explicitly find flat minima as more generalizable solutions. This sharpness-aware minimization (SAM) strategy, however, has not been studied much yet as to how overparameterization can actually affect its behavior. In this work, we analyze SAM under varying degrees of overparameterization and present both empirical and theoretical results that suggest a critical influence of overparameterization on SAM. Specifically, we first use standard techniques in optimization to prove that SAM can achieve a linear convergence rate under overparameterization in a stochastic setting. We also show that the linearly stable minima found by SAM are indeed flatter and have more uniformly distributed Hessian moments compared to those of SGD. These results are corroborated with our experiments that reveal a consistent trend that the generalization improvement made by SAM continues to increase as the model becomes more overparameterized. We further present that sparsity can open up an avenue for effective overparameterization in practice.

摘要
训练一个过参数化神经网络可以导致同等训练损失的最小值具有不同的泛化能力。基于证据表明极值锐度和泛化错误之间存在关系，因此增加了寻找平滑极值的优化策略的努力。这种关注锐度-aware minimization（SAM）策略尚未受过过参数化的影响的研究。在这项工作中，我们分析了SAM在不同程度的过参数化下的行为，并提供了实际和理论结果，表明过参数化对SAM的影响是 kritikal。 Specifically，我们使用标准的优化技术来证明SAM在随机设置下可以实现线性准确率。我们还发现SAM找到的线性稳定极值具有更平滑的Hessian均值，比SGD更好。这些结果得到了我们的实验证明，显示在过参数化的模型中，SAM可以提供更好的泛化改进。此外，我们发现稀缺可以在实践中打开一个有效的过参数化之路。

Model Performance Prediction for Hyperparameter Optimization of Deep Learning Models Using High Performance Computing and Quantum Annealing

paper_url: http://arxiv.org/abs/2311.17508
repo_url: None
paper_authors: Juan Pablo García Amboage, Eric Wulff, Maria Girone, Tomás F. Pena
for: 这篇论文是针对深度学习模型的参数优化（Hyperparameter Optimization）进行研究，尤其是在大量参数配置下进行训练时的 compute 资源浪费问题。
methods: 本论文提出了一种统计学优化方法，即结合模型性能预测和早期停止方法，以加速深度学习模型的参数优化过程。此外，提出了一个名为 Swift-Hyperband 的新算法，可以使用经典或量子支持向量回传数推断模型性能，并且可以在分布式高性能计算环境下使用。
results: 本论文的测试结果显示，Swift-Hyperband 算法可以在 Machine-Learned Particle Flow 模型和更广泛的目标模型（包括计算机视觉和自然语言处理等领域）中，对参数进行优化，可以获得与使用较多 compute 资源的传统方法相似的结果，同时仅需要使用较少的 compute 资源。

Abstract
Hyperparameter Optimization (HPO) of Deep Learning-based models tends to be a compute resource intensive process as it usually requires to train the target model with many different hyperparameter configurations. We show that integrating model performance prediction with early stopping methods holds great potential to speed up the HPO process of deep learning models. Moreover, we propose a novel algorithm called Swift-Hyperband that can use either classical or quantum support vector regression for performance prediction and benefit from distributed High Performance Computing environments. This algorithm is tested not only for the Machine-Learned Particle Flow model used in High Energy Physics, but also for a wider range of target models from domains such as computer vision and natural language processing. Swift-Hyperband is shown to find comparable (or better) hyperparameters as well as using less computational resources in all test cases.

摘要

Wireless Network Digital Twin for 6G: Generative AI as A Key Enabler

paper_url: http://arxiv.org/abs/2311.17451
repo_url: None
paper_authors: Zhenyu Tao, Wei Xu, Yongming Huang, Xiaoyun Wang, Xiaohu You
for: This paper focuses on the challenges of establishing digital twins for 6G wireless networks and explores the potential of using generative AI to overcome these challenges.methods: The paper discusses the applications of transformer and diffusion models to empower the 6G digital twin, and proposes a hierarchical generative AI-enabled wireless network digital twin at both the message-level and policy-level.results: The paper provides a typical use case with numerical results to validate the effectiveness and efficiency of the proposed approach.

Abstract
Digital twin, which enables emulation, evaluation, and optimization of physical entities through synchronized digital replicas, has gained increasingly attention as a promising technology for intricate wireless networks. For 6G, numerous innovative wireless technologies and network architectures have posed new challenges in establishing wireless network digital twins. To tackle these challenges, artificial intelligence (AI), particularly the flourishing generative AI, emerges as a potential solution. In this article, we discuss emerging prerequisites for wireless network digital twins considering the complicated network architecture, tremendous network scale, extensive coverage, and diversified application scenarios in the 6G era. We further explore the applications of generative AI, such as transformer and diffusion model, to empower the 6G digital twin from multiple perspectives including implementation, physical-digital synchronization, and slicing capability. Subsequently, we propose a hierarchical generative AI-enabled wireless network digital twin at both the message-level and policy-level, and provide a typical use case with numerical results to validate the effectiveness and efficiency. Finally, open research issues for wireless network digital twins in the 6G era are discussed.

摘要
digitāl tōu, yǐng yìng dào yìn zhèng dào, dōng zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, zhèng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, dào zhèng yǐng yìn zhèng dào, yǐng

GNNFlow: A Distributed Framework for Continuous Temporal GNN Learning on Dynamic Graphs

paper_url: http://arxiv.org/abs/2311.17410
repo_url: https://github.com/jasperzhong/GNNFlow
paper_authors: Yuchen Zhong, Guangming Sheng, Tianzuo Qin, Minjie Wang, Quan Gan, Chuan Wu
for: 针对动态图的连续学习
methods: 使用分布式框架和适应时间标记的块结构，并使用гибри드GPU-CPU数据存储和优化的采样过程，以及缓存机制来提高学习效率
results: 与existingsystems比较，GNNFlow提供了Up to 21.1倍的持续学习速度

Abstract
Graph Neural Networks (GNNs) play a crucial role in various fields. However, most existing deep graph learning frameworks assume pre-stored static graphs and do not support training on graph streams. In contrast, many real-world graphs are dynamic and contain time domain information. We introduce GNNFlow, a distributed framework that enables efficient continuous temporal graph representation learning on dynamic graphs on multi-GPU machines. GNNFlow introduces an adaptive time-indexed block-based data structure that effectively balances memory usage with graph update and sampling operation efficiency. It features a hybrid GPU-CPU graph data placement for rapid GPU-based temporal neighborhood sampling and kernel optimizations for enhanced sampling processes. A dynamic GPU cache for node and edge features is developed to maximize cache hit rates through reuse and restoration strategies. GNNFlow supports distributed training across multiple machines with static scheduling to ensure load balance. We implement GNNFlow based on DGL and PyTorch. Our experimental results show that GNNFlow provides up to 21.1x faster continuous learning than existing systems.

摘要
图内部网络（GNNs）在各个领域发挥关键作用。然而，大多数现有的深度图学框架假设预存的静态图，并不支持在图流中进行训练。相反，许多实际世界中的图是动态的，具有时间域信息。我们介绍了GNNFlow，一个分布式框架，允许高效地在动态图上进行连续时间表示学习，在多个GPU机器上进行分布式训练。GNNFlow引入了适应时间标记的块式数据结构，可以有效地平衡内存使用量与图更新和采样操作的效率。它还特点于GPU-CPU图数据分配，以便快速在GPU上进行时间邻域采样和核心优化。我们还开发了一个动态GPU缓存，以便在节点和边特征之间进行最大化缓存擦取率。GNNFlow支持分布式训练，以确保负荷均衡。我们基于DGL和PyTorch实现了GNNFlow，我们的实验结果显示，GNNFlow可以提供到21.1倍的连续学习速度，相比现有系统。

The Devil is in the Data: Learning Fair Graph Neural Networks via Partial Knowledge Distillation

paper_url: http://arxiv.org/abs/2311.17373
repo_url: https://github.com/zzoomd/fairgkd
paper_authors: Yuchang Zhu, Jintang Li, Liang Chen, Zibin Zheng
for: 提高graph neural networks（GNNs）的公平性，因为GNNs在许多高风险任务中使用，而且存在偏见问题，尤其是对敏感属性（如性别和种族）分组的偏见。
methods: 我们提出了一种不需要访问敏感信息的方法，通过知识传播来学习公平的GNN，即FairGKD。我们发现训练GNN只使用节点特征或 topological数据可以提高公平性，但是会增加Utility的代价。因此，我们使用多个公平专家（即GNN在不同的partial数据上训练）构建了伪教师，以帮助学习GNN学生。
results: 我们在多个 benchmark 数据集上进行了实验，发现FairGKD可以大幅提高GNN的公平性，同时保持其实用性。

Abstract
Graph neural networks (GNNs) are being increasingly used in many high-stakes tasks, and as a result, there is growing attention on their fairness recently. GNNs have been shown to be unfair as they tend to make discriminatory decisions toward certain demographic groups, divided by sensitive attributes such as gender and race. While recent works have been devoted to improving their fairness performance, they often require accessible demographic information. This greatly limits their applicability in real-world scenarios due to legal restrictions. To address this problem, we present a demographic-agnostic method to learn fair GNNs via knowledge distillation, namely FairGKD. Our work is motivated by the empirical observation that training GNNs on partial data (i.e., only node attributes or topology data) can improve their fairness, albeit at the cost of utility. To make a balanced trade-off between fairness and utility performance, we employ a set of fairness experts (i.e., GNNs trained on different partial data) to construct the synthetic teacher, which distills fairer and informative knowledge to guide the learning of the GNN student. Experiments on several benchmark datasets demonstrate that FairGKD, which does not require access to demographic information, significantly improves the fairness of GNNs by a large margin while maintaining their utility.

摘要
GRAPH NEURAL NETWORKS (GNNs) 在许多高风险任务中被越来越广泛使用，而在这个过程中，GNNs的公平性也在引起越来越多的关注。GNNs 有可能做出偏袋分类决策，对某些人群进行歧视，这些人群通常是根据敏感特征如性别和种族来分类。尽管最近的工作已经努力地改进 GNNs 的公平性表现，但它们通常需要可 accessible 的人口信息。这限制了它们在实际场景中的应用，因为有法律限制。为解决这个问题，我们提出了一种不需要人口信息的公平 GNN 学习方法，即 FairGKD。我们的工作是基于训练 GNNs 只使用节点特征或 topological 数据的实际观察，训练 GNNs 可以改善公平性，但是会导致效果下降。为了做出平衡的负担 Trade-off между公平性和效果表现，我们使用了一组公平 GNNs（即在不同的 partial 数据上训练的 GNNs）来构建Synthetic teacher，这些教师通过 distill 公平和有用的知识来引导学习 GNN student。我们的实验表明，FairGKD 可以在不需要人口信息的情况下大幅提高 GNNs 的公平性，同时保持其效果表现。

Continuous optimization by quantum adaptive distribution search

paper_url: http://arxiv.org/abs/2311.17353
repo_url: None
paper_authors: Kohei Morimoto, Yusuke Takase, Kosuke Mitarai, Keisuke Fujii
for: 这个论文是为了探讨量子计算在连续优化问题上的应用。
methods: 该论文提出了一种量子连续优化算法 named QuADS，它将量子adaptive搜索（GAS）和经典的 covariance matrix adaptation - evolution strategy（CMA-ES）结合在一起，以优化连续优化问题的解决方法。
results: 数值实验表明，QuADS比GAS和CMA-ES更高效，这是因为它在优化过程中不断更新初始状态分布，而不是一直使用均勤状态，从而减少了 oracle 调用次数。I hope this helps! Let me know if you have any further questions.

Abstract
In this paper, we introduce the quantum adaptive distribution search (QuADS), a quantum continuous optimization algorithm that integrates Grover adaptive search (GAS) with the covariance matrix adaptation - evolution strategy (CMA-ES), a classical technique for continuous optimization. QuADS utilizes the quantum-based search capabilities of GAS and enhances them with the principles of CMA-ES for more efficient optimization. It employs a multivariate normal distribution for the initial state of the quantum search and repeatedly updates it throughout the optimization process. Our numerical experiments show that QuADS outperforms both GAS and CMA-ES. This is achieved through adaptive refinement of the initial state distribution rather than consistently using a uniform state, resulting in fewer oracle calls. This study presents an important step toward exploiting the potential of quantum computing for continuous optimization.

摘要
在这篇论文中，我们介绍了量子适应分布搜索（QuADS），这是一种量子连续优化算法，它将格罗弗适应搜索（GAS）与经典技术 covariance matrix adaptation - evolution strategy（CMA-ES）结合在一起。QuADS利用量子搜索的优势，并将其与CMA-ES的原理相结合，以实现更高效的优化。它使用多ivariate normal distribution作为搜索的初始状态，并在优化过程中不断更新它。我们的数值实验显示，QuADS比GAS和CMA-ES更高效。这是因为QuADS通过适应初始状态的修正，而不是一直使用均匀状态，因此需要更少的oracle调用。这篇研究对于利用量子计算来解决连续优化问题的潜在性具有重要意义。

Improving Self-supervised Molecular Representation Learning using Persistent Homology

paper_url: http://arxiv.org/abs/2311.17327
repo_url: https://github.com/luoyk1999/molecular-homology
paper_authors: Yuankai Luo, Lei Shi, Veronika Thost
for: 本研究探讨自助学习（SSL）在分子表示学习中的潜力，具有复杂的分子图的特点、大量未标注数据的可用性、实验性标注成本高、小训练集的常见情况。
methods: 本文使用 persistente homology（PH）来实现SSL，PH具有适合SSL的多特性，如数据特征的 persistency、距离稳定性和域知识的可选 incorporation。我们（1）调查了一个 autoencoder，以证明PH的普通表达力，并（2）提议了一种对比损失函数，以补充现有方法。
results: 我们对分子性质预测进行了系统评估，并证明了我们的方法在不同探测任务中具有较好的表示能力，比基线方法更高。我们的损失函数可以增加基线性能，有时大幅提高，并且在非常小的数据集上具有显著改进。

Abstract
Self-supervised learning (SSL) has great potential for molecular representation learning given the complexity of molecular graphs, the large amounts of unlabelled data available, the considerable cost of obtaining labels experimentally, and the hence often only small training datasets. The importance of the topic is reflected in the variety of paradigms and architectures that have been investigated recently. Yet the differences in performance seem often minor and are barely understood to date. In this paper, we study SSL based on persistent homology (PH), a mathematical tool for modeling topological features of data that persist across multiple scales. It has several unique features which particularly suit SSL, naturally offering: different views of the data, stability in terms of distance preservation, and the opportunity to flexibly incorporate domain knowledge. We (1) investigate an autoencoder, which shows the general representational power of PH, and (2) propose a contrastive loss that complements existing approaches. We rigorously evaluate our approach for molecular property prediction and demonstrate its particular features in improving the embedding space: after SSL, the representations are better and offer considerably more predictive power than the baselines over different probing tasks; our loss increases baseline performance, sometimes largely; and we often obtain substantial improvements over very small datasets, a common scenario in practice.

摘要

Mostly Beneficial Clustering: Aggregating Data for Operational Decision Making

paper_url: http://arxiv.org/abs/2311.17326
repo_url: None
paper_authors: Chengzhang Li, Zhenkang Peng, Ying Rong
For: 本研究旨在提高大规模系统的运维决策效率，解决具有限制数据的 thousend 个问题。* Methods: 我们提出了一种基于集群的减少SAA方法，利用问题集中的集群结构来实现数据集成。我们证明，随着问题数量增加，利用知道的集群结构中的问题可以获得额外的优势。当集群结构未知时，我们显示可以在一些数据点的代价下揭示集群结构，特别是集群之间的距离较大时。* Results: 我们通过新闻 vendor 系统的管理via数值实验来探讨提议方法的性能。我们研究了不同距离度量之间问题实例间的影响，并验证了我们的提议方法在实际数据上的优势，尤其是在小数据大规模 режи度下。

Abstract
With increasingly volatile market conditions and rapid product innovations, operational decision-making for large-scale systems entails solving thousands of problems with limited data. Data aggregation is proposed to combine the data across problems to improve the decisions obtained by solving those problems individually. We propose a novel cluster-based shrunken-SAA approach that can exploit the cluster structure among problems when implementing the data aggregation approaches. We prove that, as the number of problems grows, leveraging the known cluster structure among problems yields additional benefits over the data aggregation approaches that neglect such structure. When the cluster structure is unknown, we show that unveiling the cluster structure, even at the cost of a few data points, can be beneficial, especially when the distance between clusters of problems is substantial. Our proposed approach can be extended to general cost functions under mild conditions. When the number of problems gets large, the optimality gap of our proposed approach decreases exponentially in the distance between the clusters. We explore the performance of the proposed approach through the application of managing newsvendor systems via numerical experiments. We investigate the impacts of distance metrics between problem instances on the performance of the cluster-based Shrunken-SAA approach with synthetic data. We further validate our proposed approach with real data and highlight the advantages of cluster-based data aggregation, especially in the small-data large-scale regime, compared to the existing approaches.

摘要
随着市场条件的急速变化和产品创新的加速，大规模系统的运营决策面临着解决数千个问题的挑战，具有限制性的数据。为了提高决策，数据聚合被提议来 combiner 数据 Across problems。我们提出了一种基于集群的缩小-SAA方法，可以利用问题之间的集群结构来实现数据聚合方法。我们证明，随着问题的数量增加，利用问题之间的知道集群结构带来更多的优势，比如数据聚合方法不考虑集群结构。当集群结构未知时，我们表明，揭示集群结构，即使只用一些数据点，可以带来很大的优势，特别是集群之间的距离较大。我们的提出的方法可以扩展到通用的成本函数下，只要满足某些轻度的条件。当问题的数量很大时，我们的方法的优化差距在集群之间的距离 exponentiallly 减少。我们通过应用管理新闻系统的 numerical experiments 来探索我们的提出的方法的性能。我们研究了Distance metric 在问题实例之间的影响，并验证了我们的提出的方法在实际数据上的优势，特别是在小数据大规模 режи响应。

Utilizing Model Residuals to Identify Rental Properties of Interest: The Price Anomaly Score (PAS) and Its Application to Real-time Data in Manhattan

paper_url: http://arxiv.org/abs/2311.17287
repo_url: None
paper_authors: Youssef Sultan, Jackson C. Rafter, Huyen T. Nguyen
for: 本研究旨在强化我们对模型偏差的理解，尤其是机器学习模型在大量数据集上的泛化能力。
methods: 本研究使用了所有可用的曼哈顿房源数据（截至2023年9月），并引入了价格异常分数（PAS）来捕捉价格预测偏差的细节。
results: 本研究发现，通过将泛化到至少75%的数据集，可以捕捉到价格预测偏差的重要信息。此外，PAS还可以用于识别过高或过低的房源价格。

Abstract
Understanding whether a property is priced fairly hinders buyers and sellers since they usually do not have an objective viewpoint of the price distribution for the overall market of their interest. Drawing from data collected of all possible available properties for rent in Manhattan as of September 2023, this paper aims to strengthen our understanding of model residuals; specifically on machine learning models which generalize for a majority of the distribution of a well-proportioned dataset. Most models generally perceive deviations from predicted values as mere inaccuracies, however this paper proposes a different vantage point: when generalizing to at least 75\% of the data-set, the remaining deviations reveal significant insights. To harness these insights, we introduce the Price Anomaly Score (PAS), a metric capable of capturing boundaries between irregularly predicted prices. By combining relative pricing discrepancies with statistical significance, the Price Anomaly Score (PAS) offers a multifaceted view of rental valuations. This metric allows experts to identify overpriced or underpriced properties within a dataset by aggregating PAS values, then fine-tuning upper and lower boundaries to any threshold to set indicators of choice.

摘要
To harness these insights, we introduce the Price Anomaly Score (PAS), a metric that captures the boundaries between irregularly predicted prices. By combining relative pricing discrepancies with statistical significance, the PAS offers a multifaceted view of rental valuations. This metric allows experts to identify overpriced or underpriced properties within a dataset by aggregating PAS values and fine-tuning upper and lower boundaries to any threshold of choice.

2023-11-29

eess.SP

eess.SP - 2023-11-29

Robust Localization and Tracking of UAVs in OTFS-based Networks

paper_url: http://arxiv.org/abs/2311.17742
repo_url: None
paper_authors: Alessandro Nordio, Carla Fabiana Chiasserini, Emanuele Viterbo
for: 本研究旨在准确地Localizing N个无人航空器（UAV）在3D空间中，UAVs组成一个群体，通过相互通信，并通过时域频率空间（OTFS）模ulates信号来进行通信。methods: 本研究使用了一种迭代算法，名为Turbo Iterative Positioning（TIP），通过信念传播方法，有效地利用了时差到达（TDoA）测量结果中的LoS和非LoS paths之间的时差。results: 本研究的numerical results和实际轨迹数据显示，通过利用多重射频信号的多path链接，可以实现very accurate的UAV localization和速度测量，即使具有有限的延迟-Doppler分辨率。本研究的方案Robustness也得到了证明，性能接近Cramer-Rao bound。

Abstract
We consider the problem of accurately localizing N unmanned aerial vehicles (UAV) in 3D space where the UAVs are part of a swarm and communicate with each other through orthogonal time-frequency space (OTFS) modulated signals. Each receiving UAV estimates the multipath wireless channel on each link formed by the line-of-sight (LoS) transmission and by the single reflections from the remaining N-2 UAVs. The estimated power delay profiles are communicated to an edge server, which is in charge of computing the exact location and speed of the UAVs. To obtain the UAVs locations and velocities, we propose an iterative algorithm, named Turbo Iterative Positioning (TIP), which, using a belief-propagation approach, effectively exploits the time difference of arrival (TDoA) measurements between the LoS and the non-LoS paths. Enabling a full cold start (no prior knowledge), our solution first maps each TDoA's profile element to a specific ID of the reflecting UAV's. The Doppler shifts measured by the OTFS receivers associated with each path are also used to estimate the UAV's velocities. The localization of the N UAVs is then derived via gradient descent optimization, with the aid of turbo-like iterations that can progressively correct some of the residual errors in the initial ID mapping operation. Our numerical results, obtained also using real-world traces, show how the multipath links are beneficial to achieving very accurate localization and speed of all UAVs, even with a limited delay-Doppler resolution. Robustness of our scheme is proven by its performance approaching the Cramer-Rao bound.

摘要
我们考虑一个精确地对N个无人机（UAV）进行三维空间 Localization的问题，这些UAV彼此通过时间频率空间（OTFS）快速干扰信号互动。每个接收UAV估计彼此之间的多普勒无线通信道，包括直线视线（LoS）传输和剩下N-2 UAV的单反射。估计的电力延迟资料传递到Edge服务器，这个服务器负责计算UAV的精确位置和速度。为了获取UAV的位置和速度，我们提出了一个迭代算法，名为Turbo迭代位置定位（TIP），这个算法使用信念传递方法，对TDoA测量值进行有效地利用，以获取UAV的位置和速度。我们的解决方案不需要实际的对称知识，可以实现冷启动（完全实际）。我们首先将每个TDoA的资料夹映射到特定的反射UAV的ID中。OTFS接收器相关的每条路径也使用Doppler偏移量估计UAV的速度。然后，我们使用梯度下降优化，透过迭代调整过程，逐渐纠正初始ID映射过程中的余额错误。我们的数据分析结果显示，多普勒连接可以实现UAV的精确位置和速度测量，即使延迟-Doppler分辨率有限。我们的方案的稳定性被证明为接近Cramer-Rao bound。

Odor-Based Molecular Communications: State-of-the-Art, Vision, Challenges, and Frontier Directions

paper_url: http://arxiv.org/abs/2311.17727
repo_url: None
paper_authors: Dilara Aktas, Beyza Ezgi Ortlek, Meltem Civas, Elham Baradari, Ayse Sila Okcu, Melanie Whitfield, Oktay Cetinkaya, Ozgur Baris Akan
for:* This paper aims to explore the concept of odor-based molecular communication (OMC) and its potential applications in the Internet of Everything (IoE).methods:* The paper examines olfactory systems in nature, including aspects of odor information, channels, reception, spatial perception, and cognitive functions.* The paper compares various communication systems to highlight the unique characteristics, advantages, and potential applications of OMC.results:* The paper lays the groundwork for exploring the modeling of an end-to-end OMC channel, considering the design of OMC transmitters and receivers, and developing innovative OMC techniques.

Abstract
Humankind mimics the processes and strategies that nature has perfected and uses them as a model to address its problems. That has recently found a new direction, i.e., a novel communication technology called molecular communication (MC), using molecules to encode, transmit, and receive information. Despite extensive research, an innate MC method with plenty of natural instances, i.e., olfactory or odor communication, has not yet been studied with the tools of information and communication technologies (ICT). Existing studies focus on digitizing this sense and developing actuators without inspecting the principles of odor-based information coding and MC, which significantly limits its application potential. Hence, there is a need to focus cross-disciplinary research efforts to reveal the fundamentals of this unconventional communication modality from an ICT perspective. The ways of natural odor MC in nature need to be anatomized and engineered for end-to-end communication among humans and human-made things to enable several multi-sense augmented reality technologies reinforced with olfactory senses for novel applications and solutions in the Internet of Everything (IoE). This paper introduces the concept of odor-based molecular communication (OMC) and provides a comprehensive examination of olfactory systems. It explores odor communication in nature, including aspects of odor information, channels, reception, spatial perception, and cognitive functions. Additionally, a comprehensive comparison of various communication systems sets the foundation for further investigation. By highlighting the unique characteristics, advantages, and potential applications of OMC through this comparative analysis, the paper lays the groundwork for exploring the modeling of an end-to-end OMC channel, considering the design of OMC transmitters and receivers, and developing innovative OMC techniques.

摘要
人类会模仿自然界的过程和策略，用作解决问题的模型。最近发现了一个新的方向，即基于分子的通信技术（MC），使用分子编码、传输和接收信息。尽管已进行了广泛的研究，但是没有尝试使用信息和通信技术（ICT）的工具来研究自然界中的嗅觉通信。现有的研究主要关注数字化这种感官和开发无 actuators，而未能够研究嗅觉信息编码和MC的基本原理，这限制了其应用前景。因此，需要进行跨学科研究，以探索嗅觉通信模式的基本原理从ICT的角度。自然界中的嗅觉通信需要被解剖和工程化，以实现人类和人工物之间的端到端通信，以推动多感官增强现实技术，并提供新的应用和解决方案在互联网东西（IoE）中。本文引入了基于嗅觉分子通信（OMC）的概念，并进行了全面的嗅觉系统的检查。它探讨了自然界中嗅觉信息的方面，包括嗅觉通信的途径、感知、空间识别和认知功能。此外，还进行了多种通信系统的比较分析，为进一步的调查提供了基础。通过强调OMC的独特特点、优势和应用前景，本文奠定了发展OMC技术的基础，包括设计OMC传输器和接收器，以及开发创新OMC技术。

Multi-dimensional Energy Limitation in Sphere Shaping for Nonlinear Interference Noise Mitigation

paper_url: http://arxiv.org/abs/2311.17726
repo_url: None
paper_authors: Jingtian Liu, Élie Awwad, Yves Jaouën
for: 提高无线电缆传输性能
methods: 四维能限枚举圆柱形数据分配
results: 比 conventinal ESS提高0.19比特/4D符号的可能信息率，且对多 span 系统的实现效果不好

Abstract
We propose Four-Dimensional (4D) energy limit enumerative sphere shaping (ESS) of $M$-QAM signaling to minimize rate loss and improve the transmission performance over non-linear WDM optical-fiber systems. Simulation results show that the proposed scheme outperforms the conventional ESS by $0.19$~bit/4D-symbol in achievable information rate over a $205$-km single-span link and a WDM transmission of five polarization-division-multiplexed channels with $400$-Gbit/s net rate per channel. We also study the achieved performance over several shaping block lengths and show that the achieved gains do not scale well over multi-span systems.

摘要
我们提议四维（4D）能量限数列式圆形几何（ESS）来减少率损和提高传输性能 над 非线性滤波器系统。模拟结果显示，我们的方案比传统ESS提高了0.19比特/4D符号的可能信息率，在205公里单 span 链路上传输五条极化分多频道的400 Gbit/s网络率。我们还研究了不同的数列块长度所得到的性能，结果显示，所得到的优化不对多 span 系统 scales well。Here's a breakdown of the translation:* "We propose" is translated as "我们提议" (wǒmen tīshì)* "Four-Dimensional" is translated as "四维" (sìwèi)* "energy limit enumerative sphere shaping" is translated as "能量限数列式圆形几何" (néngyù jiào xiàng yǐngyòu yuánxìng)* "of $M$-QAM signaling" is translated as "几何信号的$M$-QAM" (jīhè xìnshì de $M$-QAM)* "to minimize rate loss and improve the transmission performance" is translated as "来减少率损和提高传输性能" (lái jiǎnshǎo jìshè yǔ fāngkuò yùnxìng)* "over non-linear WDM optical-fiber systems" is translated as " над 非线性滤波器系统" (fēi xiàn xìng yìbò qì zhì)* "Simulation results show" is translated as "模拟结果显示" (móxì jiéguǒ yǐnshì)* "that the proposed scheme outperforms the conventional ESS by $0.19$~bit/4D-symbol" is translated as "所提出的方案比传统ESS提高了0.19比特/4D符号" (suǒ tīshì de fāngyìn bǐ chuánxīn ESS jìshè yǐnshì)* "in achievable information rate" is translated as "的可能信息率" (de kěnéng xìnwén yìndù)* "over a $205$-km single-span link" is translated as "在205公里单 span 链路上" (zhī yī qiān liàng zhī yī qiān)* "and a WDM transmission of five polarization-division-multiplexed channels with $400$-Gbit/s net rate per channel" is translated as "以及五条极化分多频道的400 Gbit/s网络率" (yǐn yǐ jī hóng fāng qiū duō yī qiān)* "We also study the achieved performance over several shaping block lengths" is translated as "我们还研究了不同的数列块长度所得到的性能" (wǒmen hái yánjiū le bùdìng de duō zhì qiū yǐn yì)* "and show that the achieved gains do not scale well over multi-span systems" is translated as "并显示所得到的优化不对多 span 系统 scales well" (yǔshì suǒ de yǐn yì bù dài zhì yǐn yì)

Fault-Tolerant Four-Dimensional Constellation for Coherent Optical Transmission Systems

paper_url: http://arxiv.org/abs/2311.17698
repo_url: None
paper_authors: Jingtian Liu, Élie Awwad, Yves Jaouën
for: 提高长距离光纤传输性能，抗非线性干扰性能。
methods: 使用4D-2A-RS64编码，两极化干扰管理。
results: 比PDM-8QAM-星和4D-2A-8PSK增加10.6%和4%的最大传输距离（400km和160km），具有较好的抗干扰性能和稳定性。

Abstract
We propose a 4-dimensional 2-ary amplitude ring-switched modulation format with 64 symbols, which is denoted as 4D-2A-RS64 encoded over two polarization tributaries to improve the transmission performance over long-haul optical fibers in the presence of the non-linear Kerr effect. At a spectral efficiency of 6 bits per 4D, simulation results show that this format outperforms the polarization division multiplexed (PDM) 8QAM-star modulation as well as the 4D-2A-8PSK over links without inline dispersion management. We evaluate the performance for a WDM transmission of $11\times90~\mathrm{Gbaud}$ channels over a multi-span SSMF link. For an achievable information rate of $4.8\mathrm{bit/s/Hz}$, the maximum transmission distance is improved by $10.6\%$ (400 km) and $4\%$ (160 km) compared to PDM-8QAM-star and 4D-2A-8PSK respectively. The achieved gains are composed of a linear part and a non-linear part, respectively from the improved Euclidean-distance distribution and the constant power property of the 4D modulation. The geometric shaping of the proposed scheme is easy to implement and is robust to Mach-Zehnder modulator (MZM) imbalances and quantization errors stemming from the finite digital-to-analog converter (DAC) resolution. This robustness is compared to the one of other geometric-shaped non-linearity tolerant 4D schemes such as the 4D-2A-8PSK and the 4D-64PRS that can be both outperformed by our scheme in severe conditions.

摘要
我们提出了一种四维度两个参数环 switched 模调 format，用于提高长距离光纤传输性能，这种 format 被称为 4D-2A-RS64，并在两个极化传输 tributaries 上进行编码。在每个 symbol 上，我们使用了 sixteen 种可能的值，这使得我们可以在每个 symbol 上实现 six 比特的 spectral efficiency。我们通过 simulate 结果表明，这种 format 在无 inline dispersion management 的情况下可以超越 PDM 8QAM-star 和 4D-2A-8PSK 模式。我们对 $11\times90~\mathrm{Gbaud}$ ChanNels 的 WDM 传输进行评估，并对 multi-span SSMF 链进行评估。为了实现 $4.8\mathrm{bit/s/Hz}$ 的可接受信息率，我们发现了以下结果：* 在 400 km 的传输距离上，我们可以提高最大传输距离 by 10.6%，相比 PDM-8QAM-star 和 4D-2A-8PSK 模式。* 在 160 km 的传输距离上，我们可以提高最大传输距离 by 4%，相比 PDM-8QAM-star 和 4D-2A-8PSK 模式。这种提高的成果是由于 improved Euclidean-distance distribution 和 constant power property of the 4D modulation 的共同作用。我们的方案的 geometric shaping 易于实现，并且具有较高的 robustness，可以抗衡 MZM 不均衡和 DAC 的量化误差。与其他 geometric-shaped non-linearity tolerant 4D 方案相比，我们的方案可以在严格的条件下表现更好。

Optimization in Mobile Augmented Reality Systems for the Metaverse over Wireless Communications

paper_url: http://arxiv.org/abs/2311.17630
repo_url: None
paper_authors: Tianming Lan, Jun Zhao
for: 这个论文是为了解决边缘计算对Metaverse应用的技术支持问题而写的。
methods: 这个论文使用了边缘计算技术来处理视觉和音频数据，并提出了一种数学模型来描述和分析边缘计算对响应时间、准确率、服务器资源分配和能量消耗的交互。
results: 这个论文提出了一种名为LEAO算法的解决方案，并对LEAO算法和其他相关算法在不同的 simulate enario中进行了性能评估。结果显示LEAO算法在响应时间、准确率和能量消耗等方面表现出色。

Abstract
As the essential technical support for Metaverse, Mobile Augmented Reality (MAR) has attracted the attention of many researchers. MAR applications rely on real-time processing of visual and audio data, and thus those heavy workloads can quickly drain the battery of a mobile device. To address such problem, edge-based solutions have appeared for handling some tasks that require more computing power. However, such strategies introduce a new trade-off: reducing the network latency and overall energy consumption requires limiting the size of the data sent to the edge server, which, in turn, results in lower accuracy. In this paper, we design an edge-based MAR system and propose a mathematical model to describe it and analyze the trade-off between latency, accuracy, server resources allocation and energy consumption. Furthermore, an algorithm named LEAO is proposed to solve this problem. We evaluate the performance of the LEAO and other related algorithms across various simulation scenarios. The results demonstrate the superiority of the LEAO algorithm. Finally, our work provides insight into optimization problem in edge-based MAR system for Metaverse.

摘要
为Metaverse的重要技术支持，移动增强现实（MAR）已经吸引了许多研究者的关注。MAR应用程序需要实时处理视觉和听音数据，因此这些重要的工作负载可以快速抽取移动设备的电池。为解决这个问题，边缘解决方案出现了，用于处理一些需要更高计算能力的任务。然而，这些策略引入了一个新的贸易OFF：以减少网络延迟和总能耗而降低数据的大小，却会导致减少准确性。在这篇论文中，我们设计了边缘基 MAR 系统，并提出了一个数学模型来描述它和分析贸易OFF。此外，我们还提出了一个名为LEAO的算法来解决这个问题。我们在多个 simulate 场景中评估了 LEAO 和其他相关的算法的性能。结果表明 LEAO 算法的超越性。最后，我们的工作为边缘基 MAR 系统的优化问题提供了深入的理解。

Combating Multi-path Interference to Improve Chirp-based Underwater Acoustic Communication

paper_url: http://arxiv.org/abs/2311.17624
repo_url: None
paper_authors: Wenjun Xie, Enqi Zhang, Lizhao You, Deqing Wang, Zhaorui Wang, Liqun Fu
for: 提高水下LoRa信号的传输率和可靠性
methods: 使用非线性喇呢呐模ulation, 多路探测和非二进制LDPC编码
results: 比LoRa原型设计高3倍的比特错误率和显著提高的包错误率，与现状最佳系统相比提高传输率达50倍

Abstract
Linear chirp-based underwater acoustic communication has been widely used due to its reliability and long-range transmission capability. However, unlike the counterpart chirp technology in wireless -- LoRa, its throughput is severely limited by the number of modulated chirps in a symbol. The fundamental challenge lies in the underwater multi-path channel, where the delayed copied of one symbol may cause inter-symbol and intra-symbol interfere. In this paper, we present UWLoRa+, a system that realizes the same chirp modulation as LoRa with higher data rate, and enhances LoRa's design to address the multi-path challenge via the following designs: a) we replace the linear chirp used by LoRa with the non-linear chirp to reduce the signal interference range and the collision probability; b) we design an algorithm that first demodulates each path and then combines the demodulation results of detected paths; and c) we replace the Hamming codes used by LoRa with the non-binary LDPC codes to mitigate the impact of the inevitable collision.Experiment results show that the new designs improve the bit error rate (BER) by 3x, and the packet error rate (PER) significantly, compared with the LoRa's naive design. Compared with an state-of-the-art system for decoding underwater LoRa chirp signal, UWLoRa+ improves the throughput by up to 50 times.

摘要
Linear chirp-based 水下射频通信已经广泛使用，因为它的可靠性和距离传输能力很强。然而，与无线LoRa技术不同，它的传输速率受到每个符号中弹性遮盾的数量所限。这个基本挑战在水下多普通道通信中，因为延迟的弹性遮盾可能会导致内 Symbol 和 External Symbol 干扰。在这篇文章中，我们提出 UWLoRa+，一个可以实现LoRa技术中的同步驱动干扰，并且通过以下设计来解决多普通道挑战：a) 将LoRa技术中使用的线性驱动更换为非线性驱动，以减少干扰距离和碰撞几率。b) 设计一个算法，首先对每条路径进行解调，然后结合检测到的所有路径的解调结果。c) 将LoRa技术中使用的汉明码更换为非二进制的LDPC码，以减少不可预测的碰撞的影响。实验结果显示，新的设计可以提高比特错误率（BER）3倍，并且对封包错误率（PER）有很大的改善，相比于LoRa技术的原生设计。相比于现有的水下LoRa射频讯号解析系统，UWLoRa+可以提高传输速率高达50倍。

A Unified Framework for Multi-Hop Wireless Relaying with Hardware Impairments

paper_url: http://arxiv.org/abs/2311.17617
repo_url: None
paper_authors: Ehsan Soleimani-Nasab, Sinem Coleri
for: 本文研究了多 hop relay 系统中硬件不 ideal 的影响，包括 phase noise、干扰匹配和高功率扩展器非线性。
methods: 文章使用了 amplify-and-forward 和 decode-and-forward relaying 技术，并使用了 H-fading 模型来研究多 hop relay 系统的性能。
results: 文章发现在高信号噪比下，多 hop relay 系统的总均值信号噪声比会达到高水平，与硬件不 ideal 水平成正比。此外，文章还提出了一些实用优化问题，以寻找最佳硬件不 ideal 水平。

Abstract
Relaying increases the coverage area and reliability of wireless communications systems by mitigating the fading effect on the received signal. Most technical contributions in the context of these systems assume ideal hardware (ID) by neglecting the non-idealities of the transceivers, which include phase noise, in-phase/quadrature mismatch and high power amplifier nonlinearities. These non-idealities create distortion on the received signal by causing variations in the phase and attenuating the amplitude. The resulting deterioration of the performance of wireless communication systems is further magnified as the frequency of transmission increases. In this paper, we investigate the aggregate impact of hardware impairments (HI) on the general multi-hop relay system using amplify-and-forward (AF) and decode-and-forward (DF) relaying techniques over a general H-fading model. H-fading model includes free space optics, radio frequency, millimeter wave, Terahertz, and underwater fading models. Closed-form expressions of outage probability, bit error probability and ergodic capacity are derived in terms of H-functions. Following an asymptotic analysis at high signal-to-noise ratio (SNR), practical optimization problems have been formulated with the objective of finding the optimal level of HI subject to the limitation on the total HI level. The analytical solution has been derived for the Nakagami-m fading channel which is a special case of H-fading for AF and DF relaying techniques. The overall instantaneous signal-to-noise-plus-distortion ratio has been demonstrated to reach a ceiling at high SNRs which has a reciprocal proportion to the HI level of all hops transceivers on the contrary to the ID.

摘要
通过重复器提高无线通信系统的覆盖区域和可靠性，重复器可以减轻接收信号的抖抖效应。大多数技术贡献在这些系统中假设硬件是理想的（ID），忽略了发射器和接收器的非理想特性，包括相位噪声、相位和阶跃偏移和高功率扩大器非线性。这些非理想特性会在接收信号上引入偏移和减弱幅度，从而降低无线通信系统的性能。随着传输频率的增加，这种降低的效果会变得更加明显。在这篇论文中，我们研究了硬件不良（HI）对总体多跳 relay 系统的影响，使用扩大并发送（AF）和解码并发送（DF） relaying 技术，在一般 H-fading 模型下。H-fading 模型包括自由空间光学、无线电波、毫米波、teraHz 和水下折射模型。我们使用 H-函数关系 derivation 得出了干扰概率、位噪声率和平均容量的关系。通过高信噪比（SNR）的 asymptotic 分析，我们形ulated 了实用优化问题，寻找最佳的 HI 水平，以满足总体 HI 水平的限制。对 Nakagami-m 折射通道的特殊情况（AF 和 DF relaying 技术），我们得出了分析解。我们发现，在高 SNR 下，总体实时信号噪声比会达到最高值，与 HI 水平成反比。

2023-11-28

cs.SD

cs.SD - 2023-11-28

Introducing STRAUSS: A flexible sonification Python package

paper_url: http://arxiv.org/abs/2311.16847
repo_url: None
paper_authors: James W. Trayford, Chris M. Harrison
for: 这个论文是为了介绍一个名为STRAUSS的Python声音生成包，用于科学数据探索和分析以及公共宣传和艺术场景中的声音生成。
methods: 这个论文使用了Python语言开发了一个模块化、自包含和灵活的声音生成包，并且采用了开源的FOSS许可证。
results: 论文提供了声音生成的多种示例，包括单变量数据的多种表示方式、多变量数据的声音映射和全息声音示例，以及未来功能的概述。

Abstract
We introduce STRAUSS (Sonification Tools and Resources for Analysis Using Sound Synthesis) a modular, self-contained and flexible Python sonification package, operating in a free and open source (FOSS) capacity. STRAUSS is intended to be a flexible tool suitable for both scientific data exploration and analysis as well as for producing sonifications that are suitable for public outreach and artistic contexts. We explain the motivations behind STRAUSS, and how these lead to our design choices. We also describe the basic code structure and concepts. We then present output sonification examples, specifically: (1) multiple representations of univariate data (i.e., single data series) for data exploration; (2) how multi-variate data can be mapped onto sound to help interpret how those data variables are related and; (3) a full spatial audio example for immersive Virtual Reality. We summarise, alluding to some of the future functionality as STRAUSS development accelerates.

摘要
我们介绍STRAUSS（对数数据使用 зву频读取的工具和资源），这是一个模块化、自 conten和可变的 Python 声音化套件，在自由开源（FOSS）的容器下运行。STRAUSS 是一个灵活的工具，适用于科学数据探索和分析以及生产适合公共传达和艺术 contexts 的声音化。我们解释了 STRAUSS 的动机和设计选择。我们也描述了代码结构和概念。然后，我们提供了声音化示例，包括：（1）多个表示方法 для univariate 数据（即单一数据系列） для 数据探索;（2）如何将多个变数数据映射到声音中以帮助解释这些数据变数之间的关系;（3）一个完整的 spatial audio 示例，用于具有传真的虚拟现实。我们结束，提到一些未来功能的开发，当 STRAUSS 的开发加速时。

paper_url: http://arxiv.org/abs/2311.16702
repo_url: None
paper_authors: Or Berebi, Zamir Ben-Hur, David Lou Alon, Boaz Rafaely
for: 这篇论文主要针对头戴式听众的双耳播放技术进行研究，以满足现代技术的发展，如虚拟和增强现实（AR和VR）等应用。
methods: 本文提出了一种新的头相关传输函数（HRTF）预处理优化损失函数，通过非线性编程来最小化。这种新方法被称为iMagLS，它在 lateral 平面角度上引入了间耳差（ILD）错误项。
results: 结果表明，ILD 错误可以减少至一定程度，而 HRTF 大小错误保持与MagLS相同水平。这些结果可能对首觉 Ambisonics 的总空间质量产生积极影响，而其他播放方法也可能从 consid Considering 这种修改后的损失函数。

Abstract
Binaural reproduction for headphone-based listening is an active research area due to its widespread use in evolving technologies such as augmented and virtual reality (AR and VR). On the one hand, these applications demand high quality spatial audio perception to preserve the sense of immersion. On the other hand, recording devices may only have a few microphones, leading to low-order representations such as first-order Ambisonics (FOA). However, first-order Ambisonics leads to limited externalization and spatial resolution. In this paper, a novel head-related transfer function (HRTF) preprocessing optimization loss is proposed, and is minimized using nonlinear programming. The new method, denoted iMagLS, involves the introduction of an interaural level difference (ILD) error term to the now widely used MagLS optimization loss for the lateral plane angles. Results indicate that the ILD error could be substantially reduced, while the HRTF magnitude error remains similar to that obtained with MagLS. These results could prove beneficial to the overall spatial quality of first-order Ambisonics, while other reproduction methods could also benefit from considering this modified loss.

摘要
《首先采用半耳呈现技术的头戴式播放是当前研究领域的活跃之处，这主要是因为这种技术在虚拟和增强现实（AR和VR）等技术中的广泛使用。一方面，这些应用需要高质量的空间声学感知，以保持听众感到 immerse。但是，记录设备通常只有几个麦克风，这导致FOA（首览声学）的低阶记录。然而，FOA会导致外部化和空间分辨率的局限。在这篇论文中，一种新的头部相关传送函数（HRTF）预处理优化损失函数被提出，该函数添加了Interaural水平差（ILD）错误项，以便与现在广泛使用的MagLS优化损失函数相比。结果表明，ILD错误可以减少到一定程度，而HRTF的大小错误则保持与MagLS相同。这些结果可能有利于首览声学的总空间质量，而其他播放方法也可以从考虑这种修改后得到改善。》Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

2023-11-28

eess.AS

eess.AS - 2023-11-28

Study of speaker localization under dynamic and reverberant environments

paper_url: http://arxiv.org/abs/2311.16927
repo_url: None
paper_authors: Daniel A. Mitchell, Boaz Rafaely
for: 本研究探讨了在听到噪音和反射环境中的 speaker localization 问题。
methods: 本研究使用了一种基于 glasses 悬挂麦克风数组的方法，并对其在动态环境中的性能进行了分析和改进。
results: 研究发现，该方法在静止环境下表现良好，但在 EasyCom 数据集上表现较差。研究还提出了改进方法。

Abstract
Speaker localization in a reverberant environment is a fundamental problem in audio signal processing. Many solutions have been developed to tackle this problem. However, previous algorithms typically assume a stationary environment in which both the microphone array and the sound sources are not moving. With the emergence of wearable microphone arrays, acoustic scenes have become dynamic with moving sources and arrays. This calls for algorithms that perform well in dynamic environments. In this article, we study the performance of a speaker localization algorithm in such an environment. The study is based on the recently published EasyCom speech dataset recorded in reverberant and noisy environments using a wearable array on glasses. Although the localization algorithm performs well in static environments, its performance degraded substantially when used on the EasyCom dataset. The paper presents performance analysis and proposes methods for improvement.

摘要
<> tranlate(Speaker localization in a reverberant environment is a fundamental problem in audio signal processing. Many solutions have been developed to tackle this problem. However, previous algorithms typically assume a stationary environment in which both the microphone array and the sound sources are not moving. With the emergence of wearable microphone arrays, acoustic scenes have become dynamic with moving sources and arrays. This calls for algorithms that perform well in dynamic environments. In this article, we study the performance of a speaker localization algorithm in such an environment. The study is based on the recently published EasyCom speech dataset recorded in reverberant and noisy environments using a wearable array on glasses. Although the localization algorithm performs well in static environments, its performance degraded substantially when used on the EasyCom dataset. The paper presents performance analysis and proposes methods for improvement.)中文简体<>音频信号处理中的发言者定位是一个基础问题，许多解决方案已经被开发出来。然而，以前的算法通常假设静止环境，其中麦克风数组和声音源都不会移动。随着携带式麦克风数组的出现，听音场景变得了动态，声音源和麦克风数组都在移动。这需要能够在动态环境中表现良好的算法。在这篇文章中，我们研究了一种发言者定位算法在这种环境中的性能。研究基于最近发布的EasyCom语音数据集，该数据集在噪声和折射环境中记录了穿戴在眼镜上的携带式麦克风数组。虽然本算法在静止环境中表现良好，但在EasyCom数据集上，其性能受到了显著的降低。文章提供了性能分析和改进方法。)

2023-11-28

cs.CV

cs.CV - 2023-11-28

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

paper_url: http://arxiv.org/abs/2311.17267
repo_url: None
paper_authors: Jacob Zhiyuan Fang, Skyler Zheng, Vasu Sharma, Robinson Piramuthu
for: 这个论文主要是为了构建一个轻量级的视频语言模型（E-ViLM）和一种masked video modeling（MVM）schema，以便在实际应用中使用。
methods: 这个论文使用了一种叫做vector-quantized tokenizer的semantic vector化工具，并通过使用一种简单的MVM任务和常规的VL预训练模型，使E-ViLM学习出高效表示。
results: 这个论文的实验结果表明，即使E-ViLM的 Parameters和GFLOPs都很少，它仍然可以从视频语言资源中学习出高效表示，并且在多个视频语言任务上达到了竞争性的表现，比如视频问答、文本到视频检索等。

Abstract
To build scalable models for challenging real-world tasks, it is important to learn from diverse, multi-modal data in various forms (e.g., videos, text, and images). Among the existing works, a plethora of them have focused on leveraging large but cumbersome cross-modal architectures. Regardless of their effectiveness, larger architectures unavoidably prevent the models from being extended to real-world applications, so building a lightweight VL architecture and an efficient learning schema is of great practical value. In this paper, we propose an Efficient Video-Language Model (dubbed as E-ViLM) and a masked video modeling (MVM) schema, assisted with a semantic vector-quantized tokenizer. In particular, our E-ViLM learns to reconstruct the semantic labels of masked video regions, produced by the pre-trained vector-quantized tokenizer, which discretizes the continuous visual signals into labels. We show that with our simple MVM task and regular VL pre-training modelings, our E-ViLM, despite its compactness, is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks including video question answering, text-to-video retrieval, etc. In particular, our E-ViLM obtains obvious efficiency improvements by reaching competing performances with faster inference speed, i.e., our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture with only $15%$ parameters and $94.8%$ fewer GFLOPs. We also provide extensive ablative studies that validate the effectiveness of our proposed learning schema for E-ViLM.

摘要
要建立可扩展的模型以解决实际问题，学习多Modal数据的多种形式（如视频、文本和图像）是非常重要。许多现有的工作都在利用大量的跨Modal建筑。尽管它们有效，但是这些大型建筑无法应用于实际应用场景，因此建立一个轻量级的语音视频模型和有效的学习方案是实际上很有用。在这篇论文中，我们提出了一个高效的视频语言模型（即E-ViLM）和一种受掩蔽的视频模型（MVM） schema，帮助于一个含义归一化的tokenizer。具体来说，我们的E-ViLM学习将掩蔽的视频区域的semantic标签，由预训练的含义归一化tokenizer生成的 kontinuous visual signal，进行分类。我们发现，通过我们简单的MVM任务和常规VL预训练模型，我们的E-ViLM，尽管它很紧凑，仍能学习表达表达from Video-Language corpus和广泛的 Video-Language任务，包括视频问答、文本到视频检索等。具体来说，我们的E-ViLM在MSRVTTbenchmark上达到$39.3\%$的Top-$1$准确率，保留了$91.4\%$的状态的较好的更大VL建筑的准确率，只用$15\%$的参数和$94.8\%$的GFLOPs。我们还提供了详细的ablative studiesthat验证了我们提出的学习方案的有效性。

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

paper_url: http://arxiv.org/abs/2311.17261
repo_url: https://github.com/daveredrum/SceneTex
paper_authors: Dave Zhenyu Chen, Haoxuan Li, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner
for: 用于生成高质量、风格一致的室内场景纹理。
methods: 使用深度到图像扩散假设，形式ulated the texture synthesis task as an optimization problem in RGB空间，并通过score-distillation-based objective function来优化目标纹理。
results: 对3D-FRONT场景中的纹理生成，显示出视觉质量和快速准确性的显著提高。

Abstract
We propose SceneTex, a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. Unlike previous methods that either iteratively warp 2D views onto a mesh surface or distillate diffusion latent features without accurate geometric and style cues, SceneTex formulates the texture synthesis task as an optimization problem in the RGB space where style and geometry consistency are properly reflected. At its core, SceneTex proposes a multiresolution texture field to implicitly encode the mesh appearance. We optimize the target texture via a score-distillation-based objective function in respective RGB renderings. To further secure the style consistency across views, we introduce a cross-attention decoder to predict the RGB values by cross-attending to the pre-sampled reference locations in each instance. SceneTex enables various and accurate texture synthesis for 3D-FRONT scenes, demonstrating significant improvements in visual quality and prompt fidelity over the prior texture generation methods.

摘要
我们提出了SceneTex，一种新的方法，可以高效生成高质量和风格一致的室内场景纹理。与之前的方法不同，SceneTex不会在2D视图的投影上iteratively扭曲mesh表面，也不会不正确地抽取扩散的液态特征。SceneTex将纹理合成任务定义为RGB空间中的优化问题，以确保风格和几何协调正确反映。SceneTex的核心是一个多分辨率的纹理场，用于隐式地编码mesh的外观。我们通过RGB渲染中的分数泵灌对target纹理进行优化。为了保持视图之间的风格一致，我们引入了cross-attention解码器，以在每个实例中预取参考位置的RGB值进行预测。SceneTex可以实现3D-FRONT场景中的各种和准确的纹理合成，并示出了较大的视觉质量和快速精度。

Pattern retrieval of traffic congestion using graph-based associations of traffic domain-specific features

paper_url: http://arxiv.org/abs/2311.17256
repo_url: None
paper_authors: Tin T. Nguyen, Simeon C. Calvert, Guopeng Li, Hans van Lint
for: This paper proposes a content-based retrieval system for spatiotemporal patterns of highway traffic congestion, which can help traffic management by locating similar patterns in big datasets.
methods: The proposed framework consists of two main components: pattern representation and similarity measurement. The paper uses a graph-based approach (relation-graph) for pattern representation, in which fundamental traffic phenomena are encoded as nodes and their spatiotemporal relationships as edges.
results: The proposed method is effective in retrieving similar patterns in a dataset of hundreds of patterns with various complexities, both temporally and spatially. The obtained patterns present similar traffic phenomena as in the given examples, and the success of the proposed approach opens up a new opportunity for semantic retrieval.

Abstract
The fast-growing amount of traffic data brings many opportunities for revealing more insightful information about traffic dynamics. However, it also demands an effective database management system in which information retrieval is arguably an important feature. The ability to locate similar patterns in big datasets potentially paves the way for further valuable analyses in traffic management. This paper proposes a content-based retrieval system for spatiotemporal patterns of highway traffic congestion. There are two main components in our framework, namely pattern representation and similarity measurement. To effectively interpret retrieval outcomes, the paper proposes a graph-based approach (relation-graph) for the former component, in which fundamental traffic phenomena are encoded as nodes and their spatiotemporal relationships as edges. In the latter component, the similarities between congestion patterns are customizable with various aspects according to user expectations. We evaluated the proposed framework by applying it to a dataset of hundreds of patterns with various complexities (temporally and spatially). The example queries indicate the effectiveness of the proposed method, i.e. the obtained patterns present similar traffic phenomena as in the given examples. In addition, the success of the proposed approach directly derives a new opportunity for semantic retrieval, in which expected patterns are described by adopting the relation-graph notion to associate fundamental traffic phenomena.

摘要
随着交通数据的快速增长，它带来了更多的可能性，用于揭示交通动态的更多信息。然而，这也需要一个有效的数据库管理系统，以便更好地检索信息。能够在大数据集中找到类似的模式，可能会开铺更多的有价值的分析，用于交通管理。本文提出了基于内容的检索系统，用于高速公路交通堵塞的空间时间模式。系统的两个主要组成部分是模式表示和相似度测量。为了更好地解释检索结果，文章提出了一种图表基的方法（关系图），在这里，交通现象的基本元素被编码为节点，而它们的空间时间关系被编码为边。在后一个组成部分中，用户可以根据不同的方面自定义相似度测量。我们对一个包含多种复杂性（时间和空间）的数据集进行了测试，并证明了提议的方法的效果，即所获得的模式与给定的示例尝试相似。此外，我们还发现了一个新的机会，即通过采用关系图的想法，将基本交通现象相关联起来，以描述预期的模式。

SubZero: Subspace Zero-Shot MRI Reconstruction

paper_url: http://arxiv.org/abs/2311.17251
repo_url: https://github.com/heng14/subzero
paper_authors: Heng Yu, Yamin Arefeen, Berkin Bilgic
for: 本研究旨在提高MRI扫描速度，使用零shot自监学习和子空间模型。
methods: 本研究使用了一种并行网络框架和注意机制，以提高零shot自监学习的性能，并实现更高的加速因子。
results: 实验结果表明，本方法可以在T1和T2映射获得更高的性能，比现有方法更好。In English, the three key points are:
for: The purpose of this study is to accelerate MRI scans using zero-shot self-supervised learning and subspace models.
methods: The proposed method uses a parallel network framework and an attention mechanism to improve the performance of subspace-based zero-shot self-supervised learning, achieving higher acceleration factors.
results: Experimental results show that the proposed method outperforms existing methods in T1 and T2 mapping acquisitions.

Abstract
Recently introduced zero-shot self-supervised learning (ZS-SSL) has shown potential in accelerated MRI in a scan-specific scenario, which enabled high-quality reconstructions without access to a large training dataset. ZS-SSL has been further combined with the subspace model to accelerate 2D T2-shuffling acquisitions. In this work, we propose a parallel network framework and introduce an attention mechanism to improve subspace-based zero-shot self-supervised learning and enable higher acceleration factors. We name our method SubZero and demonstrate that it can achieve improved performance compared with current methods in T1 and T2 mapping acquisitions.

摘要
Here is the text in Simplified Chinese:最近，零shot自监学习（ZS-SSL）已经被引入，并在扫描特定场景下的MRI扫描中展示了潜力，可以实现高质量重建而无需大量的训练数据。ZS-SSL还与子空间模型结合以加速2D T2-拼接成像。在这项工作中，我们提出了并行网络框架，并引入了注意力机制，以改进基于子空间的零shot自监学习，实现更高的加速因子。我们称之为“SubZero”，并证明其在T1和T2映射成像中可以 achieve improved performance compared with现有方法。

LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS

paper_url: http://arxiv.org/abs/2311.17245
repo_url: https://github.com/VITA-Group/LightGaussian
paper_authors: Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang
for: 提高3D Gaussian Splatting的可扩展性和可靠性，并提高 scene 的渲染效果。
methods: LightGaussian 方法基于 Network Pruning 的概念，通过减少红外干扰并使用 pseudo-view 增强来提高渲染效果。VecTree Quantization 方法用于压缩所有特征，以获得更加紧凑的表示。
results: LightGaussian 方法可以实现15倍的压缩率，同时提高 FPS FROM 139 TO 215，使得复杂场景的渲染变得更加高效。

Abstract
Recent advancements in real-time neural rendering using point-based techniques have paved the way for the widespread adoption of 3D representations. However, foundational approaches like 3D Gaussian Splatting come with a substantial storage overhead caused by growing the SfM points to millions, often demanding gigabyte-level disk space for a single unbounded scene, posing significant scalability challenges and hindering the splatting efficiency. To address this challenge, we introduce LightGaussian, a novel method designed to transform 3D Gaussians into a more efficient and compact format. Drawing inspiration from the concept of Network Pruning, LightGaussian identifies Gaussians that are insignificant in contributing to the scene reconstruction and adopts a pruning and recovery process, effectively reducing redundancy in Gaussian counts while preserving visual effects. Additionally, LightGaussian employs distillation and pseudo-view augmentation to distill spherical harmonics to a lower degree, allowing knowledge transfer to more compact representations while maintaining reflectance. Furthermore, we propose a hybrid scheme, VecTree Quantization, to quantize all attributes, resulting in lower bitwidth representations with minimal accuracy losses. In summary, LightGaussian achieves an averaged compression rate over 15x while boosting the FPS from 139 to 215, enabling an efficient representation of complex scenes on Mip-NeRF 360, Tank and Temple datasets. Project website: https://lightgaussian.github.io/

摘要
近期实时神经渲染技术的发展使得3D表示更加普遍。然而，基础方法如3D Gaussian Splatting带来大量存储开销，由于增长SFM点数达到百万级，经常需要多达单个无穷景的磁盘空间，这会对抹掉效率构成 significiant scalability 挑战。为解决这个问题，我们介绍了LightGaussian，一种将3D Gaussian转换为更加高效和减少的格式的新方法。LightGaussian通过Network Pruning的概念，从Scene reconstruction中找到不重要的3D Gaussian，并采用剪辑和恢复过程，从而减少Gaussian的数量，同时保持视觉效果。此外，LightGaussian还使用热退换和pseudo-view增强，将球面函数退化到较低的度数，使知识传递到更加紧凑的表示，保持反射。此外，我们提议了VecTree Quantizationhybrid scheme，即所有特征的量化，以达到较低的位宽表示，无需损失精度。简单来说，LightGaussian实现了15倍的压缩率，并提高了FPS从139到215，使得复杂的场景在Mip-NeRF 360、Tank和Temple datasets上能够高效表示。项目网站：https://lightgaussian.github.io/

PHG-Net: Persistent Homology Guided Medical Image Classification

paper_url: http://arxiv.org/abs/2311.17243
repo_url: https://github.com/yaoppeng/topoclassification
paper_authors: Yaopeng Peng, Hongxiao Wang, Milan Sonka, Danny Z. Chen
for: 这个研究旨在提高医疗影像分类的精度，通过探索体系 topological features。
methods: 方法是使用 persistent homology guided approach (PHG-Net)，首先计算输入影像的立方体侧 diagram，然后将 topological features 转换为一个矩阵表示，并与 CNN 或 Transformer 的特征汇合。
results: 实验结果显示，PHG-Net 在三个公开数据集上实现了与现有方法相对的提高，并且可以与任何 CNN 或 Transformer 架构结合使用。

Abstract
Modern deep neural networks have achieved great successes in medical image analysis. However, the features captured by convolutional neural networks (CNNs) or Transformers tend to be optimized for pixel intensities and neglect key anatomical structures such as connected components and loops. In this paper, we propose a persistent homology guided approach (PHG-Net) that explores topological features of objects for medical image classification. For an input image, we first compute its cubical persistence diagram and extract topological features into a vector representation using a small neural network (called the PH module). The extracted topological features are then incorporated into the feature map generated by CNN or Transformer for feature fusion. The PH module is lightweight and capable of integrating topological features into any CNN or Transformer architectures in an end-to-end fashion. We evaluate our PHG-Net on three public datasets and demonstrate its considerable improvements on the target classification tasks over state-of-the-art methods.

摘要
现代深度神经网络在医学图像分析中已经取得了很大的成功。然而， convolutional neural networks (CNNs) 或 Transformers 捕捉的特征通常是针对像素强度优化的，而忽略了重要的 анатомиче结构，如连接组件和循环。在这篇论文中，我们提出了一种基于持续同态的方法（PHG-Net），该方法探索医学图像中对象的 тоポлогиcal特征。对输入图像，我们首先计算其立方体持续 diagram，然后使用一个小型神经网络（称为PH模块）将 topological features提取到一个矢量表示中。提取的 topological features然后与 CNN或 Transformer 生成的特征图进行Feature Fusion。PH模块轻量级，可以在任何 CNN 或 Transformer 架构中End-to-end 方式搅合 topological features。我们在三个公共数据集上评估了我们的 PHG-Net，并证明它在目标分类任务上表现出了显著的提高。

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

paper_url: http://arxiv.org/abs/2311.17241
repo_url: None
paper_authors: Shuming Liu, Chen-Lin Zhang, Chen Zhao, Bernard Ghanem
for: 本研究旨在提高 temporal action detection（TAD）性能，但由于存储瓶颈，只有有限的规模和数据量可以进行端到端训练，从而限制 TAD 性能。
methods: 本研究提出了一种减少端到端训练内存消耗的方法，并将 TAD 底层扩展到 100 亿参数和 1536 帧输入视频。关键在于我们提出的 temporal-informative adapter（TIA），是一种轻量级模块，可以减少训练内存。TIA 使得庞大底层解脱 TAD 任务的学习，只需要在 TIA 中更新参数。TIA 还使得 TAD 表示更好，通过在底层中累积邻帧上的时间上下文。
results: 我们在四个代表性的数据集上评估了我们的模型，并取得了比较出色的结果。由于我们的有效设计，我们可以在 VideoMAEv2-giant 上进行端到端训练，并在 THUMOS14 上达到 75.4% mAP，超过了最佳特征基于方法。

Abstract
Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames, leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA), which is a novel lightweight module that reduces training memory. Using TIA, we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We evaluate our model across four representative datasets. Owing to our efficient design, we are able to train end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14, being the first end-to-end model to outperform the best feature-based methods.

摘要
最近，时间动作检测（TAD）的性能有了显著的提升，主要归功于端到端训练。然而，由于内存瓶颈的限制，只有有限的缩放和数据量可以进行端到端训练，这会导致TAD性能受限。在这篇论文中，我们降低了端到端训练的内存占用量，并成功地扩大TAD底层到10亿个参数和输入视频到1536帧，从而实现了显著的检测性能。我们的方法的关键在于我们提出的时间相关信息适配器（TIA），这是一种新的轻量级模块，它可以减少端到端训练的内存占用量。使用TIA，我们解除了庞大的底层学习适应TAD任务的责任，只需要在TIA中更新参数。此外，TIA还使得TAD表示更好，通过在底层中累积邻帧的时间上下文。我们在四个代表性的数据集上进行评估，由于我们的有效设计，我们可以在VideoMAEv2-giant上进行端到端训练，并在THUMOS14上达到75.4%的mAP，成为了首个端到端模型超越最佳基于特征方法的记录。

BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

paper_url: http://arxiv.org/abs/2311.17218
repo_url: None
paper_authors: Yixuan Luo, Mengye Ren, Sai Qian Zhang
for: 提高深度神经网络（DNN）的特征提取能力
methods: 使用分割式隐藏图像模型（BIM），将MIM任务分解为多个独立计算模式
results: 提高MIM性能，降低峰值内存消耗，并允许同时训练多个DNN背bone

Abstract
Like masked language modeling (MLM) in natural language processing, masked image modeling (MIM) aims to extract valuable insights from image patches to enhance the feature extraction capabilities of the underlying deep neural network (DNN). Contrasted with other training paradigms like supervised learning and unsupervised contrastive learning, masked image modeling (MIM) pretraining typically demands significant computational resources in order to manage large training data batches (e.g., 4096). The significant memory and computation requirements pose a considerable challenge to its broad adoption. To mitigate this, we introduce a novel learning framework, termed~\textit{Block-Wise Masked Image Modeling} (BIM). This framework involves decomposing the MIM tasks into several sub-tasks with independent computation patterns, resulting in block-wise back-propagation operations instead of the traditional end-to-end approach. Our proposed BIM maintains superior performance compared to conventional MIM while greatly reducing peak memory consumption. Moreover, BIM naturally enables the concurrent training of numerous DNN backbones of varying depths. This leads to the creation of multiple trained DNN backbones, each tailored to different hardware platforms with distinct computing capabilities. This approach significantly reduces computational costs in comparison with training each DNN backbone individually. Our framework offers a promising solution for resource constrained training of MIM.

摘要
如果你想使用图像patches提高深度神经网络（DNN）的特征提取能力，可以考虑使用面 masked image modeling（MIM）。与其他培训方法（如监督学习和无监督对比学习）相比，MIM培训通常需要大量的计算资源来处理大批量训练数据（例如4096）。这会带来很大的内存和计算资源的需求，对其广泛采用带来很大的挑战。为了解决这个问题，我们提出了一种新的学习框架，称为块级分解MIM（BIM）。BIM框架通过将MIM任务分解成一些独立的计算模式下的子任务，从而实现了块级的反向传播操作。与传统的端到端方法相比，BIM可以维持高性能，同时减少峰值内存占用。此外，BIM自然地支持并发训练多个DNN背部网络，每个背部网络都是不同硬件平台的不同计算能力。这会带来对训练DNN背部网络的计算成本的减少。我们的框架为资源受限的MIM培训提供了一个有前途的解决方案。

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

paper_url: http://arxiv.org/abs/2311.17216
repo_url: None
paper_authors: Hang Li, Chengzhi Shen, Philip Torr, Volker Tresp, Jindong Gu
for: 防止 diffusion-based 模型生成不适内容，如偏见或危险图像。
methods: 我们提出了一种新的自然语言驱动的方法，通过发现对应的概念方向来解释 diffusion 模型内部表示的含义。
results: 我们的方法可以帮助避免不适内容的生成，并且可以实现公正、安全和负责任的文本增强生成。

Abstract
Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content, such as biased or harmful images. However, the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However, existing approaches cannot discover directions for arbitrary concepts, such as those related to inappropriate concepts. In this work, we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors, we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach, namely, for fair generation, safe generation, and responsible text-enhancing generation.

摘要
Diffusion-based models 已经受到了广泛关注，因为它们在文本生成图像方面表现出色。然而，这些模型也存在一定的风险，例如可能生成不当或有害的图像。然而，这些模型内部表示的下面原因仍然不清楚。先前的工作解释了 diffusion 模型的可解释幂会空间中的向量为 semantic 概念。然而，现有的方法无法找到对于任意概念，如不当概念的方向。在这种情况下，我们提出了一种新的自动学习方法，可以找到一个给定概念的可解释幂会空间中的方向。通过这些方向，我们进一步提议了一种简单的方法，以避免不当生成。我们对这种避免方法的效果进行了广泛的实验，包括公平生成、安全生成和责任文本增强生成。

paper_url: http://arxiv.org/abs/2311.17177
repo_url: None
paper_authors: Lin Zhao, Hongxuan Li, Xuefei Ning, Xinru Jiang
for: 这个研究旨在实现跨Modal的隐藏通信，将长时间的语音资料隐藏在公开可用的封面信号中，以便不引起注意。
methods: 这个方法使用了人脸的特性，将语音资料隐藏在图像中，并且可以实现多个授权水平。
results: 实验结果显示，这个方法可以将高质量的语音与影像混合在一起，并且可以实现多个授权水平。

Abstract
Cross-modal Steganography is the practice of concealing secret signals in publicly available cover signals (distinct from the modality of the secret signals) unobtrusively. While previous approaches primarily concentrated on concealing a relatively small amount of information, we propose THInImg, which manages to hide lengthy audio data (and subsequently decode talking head video) inside an identity image by leveraging the properties of human face, which can be effectively utilized for covert communication, transmission and copyright protection. THInImg consists of two parts: the encoder and decoder. Inside the encoder-decoder pipeline, we introduce a novel architecture that substantially increase the capacity of hiding audio in images. Moreover, our framework can be extended to iteratively hide multiple audio clips into an identity image, offering multiple levels of control over permissions. We conduct extensive experiments to prove the effectiveness of our method, demonstrating that THInImg can present up to 80 seconds of high quality talking-head video (including audio) in an identity image with 160x160 resolution.

摘要
cross-modal steganography 是隐藏秘密信号在公开可用的覆盖信号（与秘密信号的模式不同）的做法。 previous approaches 主要集中在隐藏一小Amount of information，我们提议 THInImg，可以在人脸图像中隐藏长时间的音频数据（并 subsequentially decode talking head video），利用人脸的特性，可以有效地用于隐蔽通信、传输和版权保护。 THInImg consists of two parts：编码器和解码器。在编码器-解码器管道中，我们引入了一种新的建筑，可以显著提高图像中隐藏音频的能力。此外，我们的框架可以进一步扩展到隐藏多个音频clip into an identity image，提供多个权限控制等级。 we conduct extensive experiments to prove the effectiveness of our method, demonstrating that THInImg can present up to 80 seconds of high quality talking-head video (including audio) in an identity image with 160x160 resolution.

Material Palette: Extraction of Materials from a Single Image

paper_url: http://arxiv.org/abs/2311.17060
repo_url: None
paper_authors: Ivan Lopes, Fabio Pizzati, Raoul de Charette
for: 这个论文的目的是提出一种从真实世界图像中提取物理基础渲染（PBR）材质的方法。
methods: 这个方法包括两个步骤：首先，通过一种扩散模型将图像中的区域映射到物料概念上，以 sampling 各种物料图像。然后，通过另一个网络将生成的文本ures decomposed into Spatially Varying BRDFs（SVBRDFs），以提供可用于渲染应用程序的材质。
results: 这个方法可以基于现有的 sintetic material 图书馆与 SVBRDF 的基准数据进行训练，同时还可以通过不监督领域适应（UDA）来扩展到新的样本。作者们在 sintetic 和真实世界的数据集上进行了系统性的评估，并证明了该方法的可靠性和可推广性。此外，作者们还示例了使用该方法编辑 3D 场景中的材质。代码和模型将被公开。项目页面：https://astra-vision.github.io/MaterialPalette/

Abstract
In this paper, we propose a method to extract physically-based rendering (PBR) materials from a single real-world image. We do so in two steps: first, we map regions of the image to material concepts using a diffusion model, which allows the sampling of texture images resembling each material in the scene. Second, we benefit from a separate network to decompose the generated textures into Spatially Varying BRDFs (SVBRDFs), providing us with materials ready to be used in rendering applications. Our approach builds on existing synthetic material libraries with SVBRDF ground truth, but also exploits a diffusion-generated RGB texture dataset to allow generalization to new samples using unsupervised domain adaptation (UDA). Our contributions are thoroughly evaluated on synthetic and real-world datasets. We further demonstrate the applicability of our method for editing 3D scenes with materials estimated from real photographs. The code and models will be made open-source. Project page: https://astra-vision.github.io/MaterialPalette/

摘要
在这篇论文中，我们提出了一种方法，可以从实际世界图像中提取物理基础渲染（PBR）材料。我们采用了两步进行：首先，我们使用扩散模型将图像中的区域映射到材料概念上，以便从Texture Image中采样与每种材料场景中相似的文件。其次，我们利用分解网络将生成的Texture decomposed into Spatially Varying BRDFs（SVBRDFs），从而为渲染应用提供准备好的材料。我们的方法基于现有的Synthetic材料图书馆中的SVBRDF基准，同时还利用了扩散生成的RGB图像数据集，以便通过无监督领域适应（UDA）来推广到新样本。我们的贡献得到了实际和synthetic数据集的评估。此外，我们还示出了对3D场景的编辑，使用从实际照片中提取的材料。代码和模型将被公开。项目页面：https://astra-vision.github.io/MaterialPalette/

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

paper_url: http://arxiv.org/abs/2311.17061
repo_url: https://github.com/alvinliu0/HumanGaussian
paper_authors: Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, Ziwei Liu
for: 高质量3D人体生成从文本提示是一项感兴趣但具有挑战性的任务。现有方法通过分数蒸馏抽象（SDS）优化3D表示，但它们受到缺乏细节或过长训练时间的限制。本文提出了一个高效又有效的框架，即人类泛函（HumanGaussian），可以生成高质量3D人体，具有细化的几何结构和真实的外观。
methods: 我们的关键发现是，3D Gaussian Splatting 是一种高效的渲染器，其通过periodic Gaussian shrinkage或growing来控制adaptive density，这种适应性可以受到人类内在结构的指导。在这个框架中，我们首先提出了一种结构意识 SDS，可以同时优化人体的外观和几何结构。此外，我们还提出了一种冷却负面指导，可以有效地解决过滥问题。
results: 我们的实验表明，人类泛函可以在多种enario下生成真实的3D人体，并且比现有方法更高效。在不同的照明和摄像头距离情况下，我们的方法可以准确地控制人体的外观和几何结构，并且可以生成更加细化和真实的3D人体。项目页面：https://alvinliu0.github.io/projects/HumanGaussian

Abstract
Realistic 3D human generation from text prompts is a desirable yet challenging task. Existing methods optimize 3D representations like mesh or neural fields via score distillation sampling (SDS), which suffers from inadequate fine details or excessive training time. In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-grained geometry and realistic appearance. Our key insight is that 3D Gaussian Splatting is an efficient renderer with periodic Gaussian shrinkage or growing, where such adaptive density control can be naturally guided by intrinsic human structures. Specifically, 1) we first propose a Structure-Aware SDS that simultaneously optimizes human appearance and geometry. The multi-modal score function from both RGB and depth space is leveraged to distill the Gaussian densification and pruning process. 2) Moreover, we devise an Annealed Negative Prompt Guidance by decomposing SDS into a noisier generative score and a cleaner classifier score, which well addresses the over-saturation issue. The floating artifacts are further eliminated based on Gaussian size in a prune-only phase to enhance generation smoothness. Extensive experiments demonstrate the superior efficiency and competitive quality of our framework, rendering vivid 3D humans under diverse scenarios. Project Page: https://alvinliu0.github.io/projects/HumanGaussian

摘要
现实型3D人体生成从文本提示是一项愿望又挑战的任务。现有方法通过分数顶混合 sampling（SDS）优化3D表示，但这些方法受到不充分细节或过长训练时间的限制。在这篇论文中，我们提出一个高效又有效的框架，即人类Gaussian（HumanGaussian），可以生成高质量的3D人体，具有细腻的几何结构和真实的外观。我们的关键思想是，3D Gaussian splatting 是一种高效的渲染器，可以通过 periodic Gaussian shrinkage 或 growing 来控制 Adaptive density。specifically，我们首先提出了一种基于结构的 SDS，可以同时优化人体的外观和几何结构。我们利用了RGB和深度空间中的多模式分数函数，来把 Gaussian densification 和 pruning 过程引导到人体的内在结构上。此外，我们还提出了一种混合负面指导（Annealed Negative Prompt Guidance），可以很好地解决过饱感的问题。我们将 SDS decomposed 成一个更加随机的生成分数和一个更加纯净的分类分数，并通过混合这两个分数来生成更加真实的3D人体。最后，我们通过 Gaussian 的大小来除掉浮动的痕迹，以提高生成的平滑度。我们的实验表明，我们的框架可以更加高效地和竞争性地生成3D人体，并在多种场景下呈现出绝佳的效果。项目页面：https://alvinliu0.github.io/projects/HumanGaussian

ReMoS: Reactive 3D Motion Synthesis for Two-Person Interactions

paper_url: http://arxiv.org/abs/2311.17057
repo_url: None
paper_authors: Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek
For: This paper focuses on developing a method for synthesizing realistic two-person interactions in 3D human motion, addressing the complex dynamics of multi-human interactions.* Methods: The proposed method, called ReMoS, is a denoising diffusion-based probabilistic model that explores two-person interactions. It synthesizes the reactive motion of the second person given the motion of the first person, including full-body motions and hand interactions.* Results: The paper demonstrates the performance of ReMoS under a variety of challenging two-person scenarios, including pair-dancing, Ninjutsu, kickboxing, and acrobatics. The results show that the approach can generate realistic and diverse motions for both individuals in the interaction, and provides an adequate amount of control for animators. Additionally, the paper introduces the ReMoCap dataset for two-person interactions, which consists of full-body and hand motions.

Abstract
Current approaches for 3D human motion synthesis can generate high-quality 3D animations of digital humans performing a wide variety of actions and gestures. However, there is still a notable technological gap in addressing the complex dynamics of multi-human interactions within this paradigm. In this work, we introduce ReMoS, a denoising diffusion-based probabilistic model for reactive motion synthesis that explores two-person interactions. Given the motion of one person, we synthesize the reactive motion of the second person to complete the interactions between the two. In addition to synthesizing the full-body motions, we also synthesize plausible hand interactions. We show the performance of ReMoS under a wide range of challenging two-person scenarios including pair-dancing, Ninjutsu, kickboxing, and acrobatics, where one person's movements have complex and diverse influences on the motions of the other. We further propose the ReMoCap dataset for two-person interactions consisting of full-body and hand motions. We evaluate our approach through multiple quantitative metrics, qualitative visualizations, and a user study. Our results are usable in interactive applications while also providing an adequate amount of control for animators.

摘要
Translated into Simplified Chinese:当前的3D人体动作合成方法可以生成高质量的3D动画，展示人工人体执行各种动作和姿势。然而，在多人交互方面仍存在技术差距。在这项工作中，我们介绍ReMoS，一种去噪扩散型 probabilistic模型，用于响应动作合成。给定一个人的动作，我们可以合成另一个人的响应动作，以完成两人之间的交互。此外，我们还可以合成合理的手势交互。我们在多种挑战性的两人场景中进行了评估，包括舞蹈、忍术、拳击和 акро巴特，其中一个人的动作具有复杂和多样化的影响。我们还提出了ReMoCap数据集，用于两人交互的全身和手势动作。我们通过多种量化指标、Qualitative visualization和用户测试来评估我们的方法。我们的结果可以在互动应用中使用，同时还提供了合理的控制 для动画师。

Self-Supervised Motion Magnification by Backpropagating Through Optical Flow

paper_url: http://arxiv.org/abs/2311.17056
repo_url: https://github.com/dangeng/flowmag
paper_authors: Zhaoying Pan, Daniel Geng, Andrew Owens
for: 这 paper 的目的是提出一种简单、自主学习的视频增大细动方法，即给输入视频和扩大因子，我们 manipulate 视频以使其新的光流与给定的扩大因子成正比。
methods: 我们提出了一种损失函数来估算生成视频的光流，并对其与给定的扩大因子进行比较。在训练中，我们通过 differentiating through 预训练的光流网络来训练我们的模型。
results: 我们通过视觉质量和量化指标来评估我们的方法，并在各种实际和Synthetic 视频上进行了评估。我们的方法可以在不同的光流算法上进行自适应调整，并且可以根据用户选择的对象进行精细化。

Abstract
This paper presents a simple, self-supervised method for magnifying subtle motions in video: given an input video and a magnification factor, we manipulate the video such that its new optical flow is scaled by the desired amount. To train our model, we propose a loss function that estimates the optical flow of the generated video and penalizes how far if deviates from the given magnification factor. Thus, training involves differentiating through a pretrained optical flow network. Since our model is self-supervised, we can further improve its performance through test-time adaptation, by finetuning it on the input video. It can also be easily extended to magnify the motions of only user-selected objects. Our approach avoids the need for synthetic magnification datasets that have been used to train prior learning-based approaches. Instead, it leverages the existing capabilities of off-the-shelf motion estimators. We demonstrate the effectiveness of our method through evaluations of both visual quality and quantitative metrics on a range of real-world and synthetic videos, and we show our method works for both supervised and unsupervised optical flow methods.

摘要
中文翻译：这篇论文提出了一种简单、自动超vision的方法，用于在视频中增大微小的运动。给定输入视频和放大因子，该方法将视频 manipulate 以将其新的optical flowScale 到所需的程度。为了训练模型，我们提出了一个损失函数，该函数估算生成的视频的optical flow，并对其与给定的放大因子进行 penalty。因此，训练 involve differentiating through a pre-trained optical flow network。我们的方法不需要使用synthetic magnification datasets，而是利用现有的off-the-shelf motion estimators。我们通过评估视质和量化指标来证明我们的方法的有效性，并在实际和 sintetic videos 上进行了评估。我们还表明了我们的方法可以在不同的supervised和unsupervised optical flow方法上进行改进。

Rethinking Directional Integration in Neural Radiance Fields

paper_url: http://arxiv.org/abs/2311.16504
repo_url: None
paper_authors: Congyue Deng, Jiawei Yang, Leonidas Guibas, Yue Wang
for: 提高 NeRF 的视角依赖效果渲染质量
methods: 修改 NeRF 渲染方程，将投影操作和方向解码器网络互换，分离视角依赖和独立组件
results: 实验结果表明，我们的修改方法可以减少网络近似和数值积分所导致的错误，并且可以解释为光场渲染 WITH 学习的射线嵌入Here’s a more detailed explanation of each point:
for: The paper aims to improve the rendering quality of view-dependent effects in NeRF, which is a powerful method for 3D scene reconstruction.
methods: The authors propose a simple modification to the NeRF rendering equation, which involves swapping the integration operator and the direction decoder network. This modification allows for a better separation of view-dependent and independent components, leading to improved rendering quality.
results: The authors demonstrate the effectiveness of their modification through experiments on different NeRF variations. They show that their method can significantly improve the quality of view-dependent effects, such as reflections and shadows, and can be interpreted as light field rendering with learned ray embeddings.

Abstract
Recent works use the Neural radiance field (NeRF) to perform multi-view 3D reconstruction, providing a significant leap in rendering photorealistic scenes. However, despite its efficacy, NeRF exhibits limited capability of learning view-dependent effects compared to light field rendering or image-based view synthesis. To that end, we introduce a modification to the NeRF rendering equation which is as simple as a few lines of code change for any NeRF variations, while greatly improving the rendering quality of view-dependent effects. By swapping the integration operator and the direction decoder network, we only integrate the positional features along the ray and move the directional terms out of the integration, resulting in a disentanglement of the view-dependent and independent components. The modified equation is equivalent to the classical volumetric rendering in ideal cases on object surfaces with Dirac densities. Furthermore, we prove that with the errors caused by network approximation and numerical integration, our rendering equation exhibits better convergence properties with lower error accumulations compared to the classical NeRF. We also show that the modified equation can be interpreted as light field rendering with learned ray embeddings. Experiments on different NeRF variations show consistent improvements in the quality of view-dependent effects with our simple modification.

摘要
近期研究使用神经辐射场（NeRF）进行多视图3D重建，提供了辐射真实场景的显著进步。然而，NeRF表现有限的视角依赖效果学习能力，相比于光场渲染或图像基synthesis。为此，我们介绍一种对NeRF渲染公式进行修改，只需几行代码修改即可，而大幅提高视角依赖效果的渲染质量。我们将整合运算符和方向解码器网络互换，只在光paths上整合位置特征，并将方向特征移出整合，从而分离视角依赖和独立组成。这个修改后的公式与经典的Volume Rendering等价，并且我们证明在物体表面上的Dirac密度下，我们的渲染公式具有更好的收敛性和更低的错误积累。此外，我们还证明这个修改后的公式可以被视为学习的光场渲染。我们在不同的NeRF变种上进行了实验，并经验显示，我们的简单修改可以在视角依赖效果上提供重大的改善。

Surf-D: High-Quality Surface Generation for Arbitrary Topologies using Diffusion Models

paper_url: http://arxiv.org/abs/2311.17050
repo_url: https://github.com/Yzmblog/SurfD
paper_authors: Zhengming Yu, Zhiyang Dou, Xiaoxiao Long, Cheng Lin, Zekun Li, Yuan Liu, Norman Müller, Taku Komura, Marc Habermann, Christian Theobalt, Xin Li, Wenping Wang
for: 本文提出了一种新的方法Surf-D，用于生成高质量的3D形状，使用Diffusion模型。这种方法采用Unsigned Distance Field（UDF）作为表示方式，可以处理任意的topology，并且可以生成复杂的形状。在优先方法中， shapes的生成通常采用不同的表示方式，但是它们受到限制的topology和geometry细节的影响。
methods: 我们首先利用点云自动编码器来学习一个紧凑的秘密空间，以便在任意输入点上进行差分梯度查询，以高精度捕捉复杂的geometry。此外，我们采用了课程学习策略，以便有效地嵌入不同的表面，从而提高整个嵌入过程。在预训练shape秘密空间后，我们采用了射 diffusional模型来获得不同形状的分布。
results: 我们的方法在多种模式下的形状生成 Tasks中表现出色，包括无条件生成、类别条件生成、图像3D重建和文本到形状任务。我们的方法在这些任务中都达到了或超过了状态艺术的性能。

Abstract
In this paper, we present Surf-D, a novel method for generating high-quality 3D shapes as Surfaces with arbitrary topologies using Diffusion models. Specifically, we adopt Unsigned Distance Field (UDF) as the surface representation, as it excels in handling arbitrary topologies, enabling the generation of complex shapes. While the prior methods explored shape generation with different representations, they suffer from limited topologies and geometry details. Moreover, it's non-trivial to directly extend prior diffusion models to UDF because they lack spatial continuity due to the discrete volume structure. However, UDF requires accurate gradients for mesh extraction and learning. To tackle the issues, we first leverage a point-based auto-encoder to learn a compact latent space, which supports gradient querying for any input point through differentiation to effectively capture intricate geometry at a high resolution. Since the learning difficulty for various shapes can differ, a curriculum learning strategy is employed to efficiently embed various surfaces, enhancing the whole embedding process. With pretrained shape latent space, we employ a latent diffusion model to acquire the distribution of various shapes. Our approach demonstrates superior performance in shape generation across multiple modalities and conducts extensive experiments in unconditional generation, category conditional generation, 3D reconstruction from images, and text-to-shape tasks.

摘要
在这篇论文中，我们介绍了Surf-D方法，用于生成高质量的3D形状，使用Diffusion模型。我们采用了无符号距离场（UDF）作为表面表示，因为它能够处理任意topology，并且可以生成复杂的形状。先前的方法探索了不同的表示方法，但它们受到限制的topology和geometry细节的影响。此外，直接将先前的扩散模型扩展到UDF是非常困难，因为它们缺乏空间连续性，而UDF需要准确的梯度来提取网格和学习。为了解决这些问题，我们首先利用点云自动编码器来学习一个紧凑的秘密空间，该空间支持点云的梯度询问，以高分辨率 capturing细腻的geometry。在不同的形状学习难度不同的情况下，我们采用了课程学习策略，以提高整个嵌入过程。与预训练形状秘密空间后，我们使用扩散模型来获得不同形状的分布。我们的方法在多种模式下进行了广泛的实验，包括无条件生成、类别条件生成、图像三维重建和文本到形状任务。我们的方法示出了在多种任务上的优秀表现。

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

paper_url: http://arxiv.org/abs/2311.17048
repo_url: None
paper_authors: Zeyu Han, Fangrui Zhu, Qianru Lao, Huaizu Jiang
for: 本研究旨在提高零 shot 表达理解中的图像识别精度，特别是在文本提示中提供的图像上 Localizing bounding boxes。
methods: 本研究使用了大规模基础模型，将图像和文本转化为 triplets 的形式，然后使用 VLA 模型进行结构相似矩阵计算，以实现图像和文本之间的关系理解。
results: 实验结果表明，我们的方法可以提高零 shot 图像识别精度，在 RefCOCO/+/g 上比 SOTA 零 shot 模型提高了19.5%。在更加具有挑战性的 Who’s Waldo 数据集上，我们的零 shot 方法与 Fully supervised 模型具有相似的准确率。

Abstract
Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to the provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model.

摘要
<>转换文本到简化中文。<>zero-shot表达式理解 targets locating bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model.

TLControl: Trajectory and Language Control for Human Motion Synthesis

paper_url: http://arxiv.org/abs/2311.17135
repo_url: None
paper_authors: Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, Lingjie Liu
for: 实现控制人体运动Synthesize是应用于AR/VR、游戏、电影和身体AI等领域的基本需求。现有方法通常只是强调语言或整个轨迹控制，缺乏精准地同用户指定轨迹控制，特别是多 JOINT控制。
methods: 我们提出了一种新的TLControl方法，具有低级轨迹和高级语言语义控制。我们首先培训了VQ-VAE来学习一个紧凑的Latent动作空间，其中每个部分都有自己的分割。然后，我们提出了一种Masked Trajectories Transformer来生成基于学习的Latent动作空间的全轨迹预测，用户提供的部分轨迹和文本描述作为条件。最后，我们引入了高效的测试时优化，以优化这些初始预测的准确性。
results: 实验表明，TLControl在轨迹准确性和时间效率两个方面都高于当前状态的方法，使其成为实时交互和高质量动画生成的实际应用。

Abstract
Controllable human motion synthesis is essential for applications in AR/VR, gaming, movies, and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a new method for realistic human motion synthesis, incorporating both low-level trajectory and high-level language semantics controls. Specifically, we first train a VQ-VAE to learn a compact latent motion space organized by body parts. We then propose a Masked Trajectories Transformer to make coarse initial predictions of full trajectories of joints based on the learned latent motion space, with user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce an efficient test-time optimization to refine these coarse predictions for accurate trajectory control. Experiments demonstrate that TLControl outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.

摘要
<>CONTROLLABLE HUMAN MOTION SYNTHESIS 是应用于 AR/VR、游戏、电影和embodied AI 等领域的关键技术。现有方法通常只会强调语言或全征轨迹控制，缺乏精准地控制用户指定的轨迹，特别是多 JOINT 控制。为解决这些问题，我们提出了 TLControl，一种新的人体动作生成方法，具有低级轨迹和高级语言 semantics 控制。我们首先训练了 VQ-VAE 来学习一个紧凑的 latent motion space，该空间由身体部分组织。然后，我们提出了一种假设掩码的 Trajectories Transformer，以使用学习的 latent motion space 来生成全轨迹的 JOINTS 的干预预测，并使用用户指定的部分轨迹和文本描述作为条件。最后，我们引入了高效的测试时优化，以修正这些干预预测，以获得高精度的轨迹控制。实验表明，TLControl 在轨迹精度和时间效率两个方面都超过了当前状况，使其实用于交互式和高质量动画生成。

Adversarial Diffusion Distillation

paper_url: http://arxiv.org/abs/2311.17042
repo_url: https://github.com/stability-ai/generative-models
paper_authors: Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach
for: 本研究旨在提出一种新的训练方法，以有效地在大规模基础图像扩散模型上进行几步扩散，以保持高质量图像。
methods: 本研究使用了分数泵抑素的方法，利用大规模的凝固图像扩散模型作为教师信号，并与对抗损失相结合，以确保图像准确性。
results: 我们的分析表明，我们的模型在一步和四步情况下都能够明显超越现有的几步方法（GANs、潜在一致模型），并达到当前扩散模型（SDXL）的性能水平。 ADD 是首个实现单步、实时图像生成的基础模型。

Abstract
We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1-4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs, Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models. Code and weights available under https://github.com/Stability-AI/generative-models and https://huggingface.co/stabilityai/ .

摘要
我们介绍了一种新的训练方法，叫做敌对扩散定向（ADD），可以快速获得大规模基础图像扩散模型的高品质样本，只需要1-4步。我们使用分数定律来利用大规模的训练模型作为教师信号，并与反对敌的损失函数共同保证图像的内在一致性，即使在低步骤情况下。我们的分析显示，我们的模型在单一步骤和四步骤情况下都能够明显超越现有的几步方法（GANs、 latent Consistency Models），并在四步骤情况下达到现有的扩散模型（SDXL）的性能。ADD 是首个实现单一步骤、实时图像生成的基础模型。我们的代码和模型预设可以在 GitHub 上找到，另外可以在 Hugging Face 上找到。

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

paper_url: http://arxiv.org/abs/2311.17034
repo_url: None
paper_authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, Ming-Hsuan Yang
for: 本研究旨在提高semantic correspondence表现，并探讨现有基础模型中的Feature问题。
methods: 本研究使用了简单但有效的解决方案，包括将geometry信息 integrate into semantic correspondence。
results: 我们的方法在难度较高的SPair-71k数据集上实现了64.2（零shot）和85.6（监控）的PCK@0.10分数，对比州先进的模型实现4.3p和11.0p的绝对优化。

Abstract
While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset, for both pre-training validating models. Our method achieves a PCK@0.10 score of 64.2 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset, outperforming the state-of-the-art by 4.3p and 11.0p absolute gains, respectively. Our code and datasets will be publicly available.

摘要
大量预训练视觉模型已经显示了 semantic correspondence 的承诺，但它们的特征经常强度不足以捕捉实例的几何和方向。这篇论文认为 geometry-awareness 对 semantic correspondence 的重要性，并发现现有基础模型的特征在简单的后处理下存在限制。我们表明可以通过 incorporating 这些信息来显著提高 semantic correspondence 性能，并提供了简单 yet effective 的解决方案。我们还构建了一个新的 Semantic Correspondence 测试集，来验证和预训练模型。我们的方法在 SPair-71k 测试集上取得了 PCK@0.10 分数为 64.2（零值）和 85.6（指导），与当前最佳模型相比带来了 4.3p 和 11.0p 绝对提升。我们的代码和数据将公开发布。

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

paper_url: http://arxiv.org/abs/2311.17024
repo_url: None
paper_authors: Niladri Shekhar Dutt, Sanjeev Muralikrishnan, Niloy J. Mitra
for: 本文为了提供一种简单、强健、无关类型的特征描述器，用于处理无文本输入形状（网格或点云）。
methods: 我们的方法基于输入形状生成深度和法向图，作为条件图生成的导航，并在过程中生成2D中的噪声特征。我们发现，即使多视图渲染输入形状得到的 conditional 图像生成不一致，与其关联的图像特征仍然强大，可以直接在原始表面上积累。
results: 我们在多个benchmark（SHREC’19、SHREC’20和TOSCA）进行了广泛的实验，并证明了我们的特征，不需要额外数据或训练，可以在不同的形状家族之间提供可靠的对应关系。

Abstract
We present Diff3F as a simple, robust, and class-agnostic feature descriptor that can be computed for untextured input shapes (meshes or point clouds). Our method distills diffusion features from image foundational models onto input shapes. Specifically, we use the input shapes to produce depth and normal maps as guidance for conditional image synthesis, and in the process produce (diffusion) features in 2D that we subsequently lift and aggregate on the original surface. Our key observation is that even if the conditional image generations obtained from multi-view rendering of the input shapes are inconsistent, the associated image features are robust and can be directly aggregated across views. This produces semantic features on the input shapes, without requiring additional data or training. We perform extensive experiments on multiple benchmarks (SHREC'19, SHREC'20, and TOSCA) and demonstrate that our features, being semantic instead of geometric, produce reliable correspondence across both isometeric and non-isometrically related shape families.

摘要
我们介绍Diff3F作为一种简单、可靠和无关类型的特征描述器，可以应用于无文本输入形状（三角形或点云）。我们的方法将图像基础模型中的扩散特征投射到输入形状上。具体来说，我们使用输入形状生成深度和法向图作为条件图像synthesis的导向，并在过程中生成2D中的扩散特征。我们的关键观察是，即使多视图渲染输入形状所得到的Conditional图像生成不一致，与其关联的图像特征仍然具有Robust性，可以直接在不同视图之间进行汇聚。这会生成在输入形状上的semantic特征，不需要额外数据或训练。我们在多个benchmark（SHREC'19、SHREC'20和TOSCA）进行了广泛的实验，并证明了我们的特征，作为semantic而不是geometric，在不同形状家族之间可靠地实现对应关系。

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

paper_url: http://arxiv.org/abs/2311.17009
repo_url: None
paper_authors: Danah Yatim, Rafail Fridman, Omer Bar Tal, Yoni Kasten, Tali Dekel
for: 本研究旨在实现基于文本提示的动作传输，将输入视频中的动作和场景 Layout 转移到目标对象和场景中，而不改变输入视频的基本动作特征。
methods: 本研究使用了一种新的文本驱动动作传输方法，利用预训练的和固定的文本-视频扩散模型，以获得生成和动作优先。方法的核心是一种新的空间-时间特征损失，该损失引导生成过程，以保持输入视频的总动作特征，同时遵循目标对象的形状和细部动作特征。
results: 研究人员通过实验和比较分析，发现本方法可以在不同形状和动作特征的目标对象和场景中实现高质量的动作传输，而且比传统方法更加灵活和可靠。

Abstract
We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.

摘要
我们提出了一种新的文本驱动动作传输方法 - 生成一个遵循输入文本提示的目标对象和场景的视频，同时保持输入视频的动作和场景布局。先前的方法受限于在同一或相似物体类别内传输动作，并且只适用于有限的领域（例如人类）。在这个工作中，我们考虑了一个更加具有挑战性的设定，在其中目标和源对象具有极大的形状和细化动作特征差异（例如将狗 transformed into 鲸鱼）。为此，我们利用了一个已经训练并固定的文本到视频扩散模型，该模型为我们提供了生成和动作偏好。我们的方法的核心是一种新的空间时间特征损失，这种损失引导生成过程保持输入视频的总动作，同时遵循目标对象的形状和细化动作特征。

paper_url: http://arxiv.org/abs/2311.17005
repo_url: https://github.com/opengvlab/ask-anything
paper_authors: Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, Yu Qiao
for:多种多样的视频任务的评估和评价。methods: introduce a novel static-to-dynamic method to define temporal-related tasks, 以及自动将公共视频笔记转换成多选问答来评估每个任务。results: existing MLLMs 在时间理解方面表现不够，而我们的 VideoChat2 在 MVBench 上至少比leading模型高出15%。

Abstract
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

摘要
随着多模态大型自然语言模型（MLLMs）的快速发展，一些诊断标准也在emerge来评估这些模型的理解能力。然而，大多数标准都主要评估在静止图像任务上的空间理解，而忽略了动态视频任务中的时间理解。为了解决这问题，我们提出了一个完整的多模态视频理解benchmark，即MVBench，该benchmark包括20个不可 ignored的视频任务，这些任务不可以通过单一帧来解决。我们首先提出了一种新的静止到动态方法，用于定义这些时间相关任务。通过将各种静止任务转化为动态任务，我们允许系统生成具有广泛时间技能范围的视频任务，从感知到认知。然后，根据任务定义，我们自动将公共视频笔记转化为多选问答，以评估每个任务。在一方面，这种独特的方法允许我们建立MVBench的efficient，无需大量的手动干预。在另一方面，它保证了评估公正，避免了LLMs的偏袋评分。此外，我们还进一步开发了一种robust的视频MLLM基eline，即VideoChat2，通过多模式多语言训练和多种指令调整数据进行进化。我们的MVBench测试结果显示，现有的MLLMs在时间理解方面远远不够，而我们的VideoChat2在MVBench上大幅超越了这些领先模型，提高了15%左右。所有模型和数据都可以在https://github.com/OpenGVLab/Ask-Anything上获取。

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

paper_url: http://arxiv.org/abs/2311.17002
repo_url: https://github.com/Ranni-T2I/Ranni
paper_authors: Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou
for: 这个研究的目的是提高文本到图像（T2I）扩散模型的文本控制能力，特别是处理复杂的提示语、对象特性绑定和多主体描述。
methods: 该研究使用了 semantic panel 作为中间件，通过将文本中的视觉概念通过大语言模型进行解析，并将其注入到减噪网络中作为详细控制信号，以改善 T2I 生成器的文本控制能力。
results: 该研究表明，通过使用 semantic panel，T2I 生成器的文本控制能力得到了提高，并且可以通过直接修改面板中的元素或使用语言指令来进行细致的自定义生成。此外，该研究还开发了一个实用的系统，并在连续生成和协作编辑中展示了其潜力。

Abstract
Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing.

摘要

COLE: A Hierarchical Generation Framework for Graphic Design

paper_url: http://arxiv.org/abs/2311.16974
repo_url: https://github.com/JarekPaulDonald/COLE
paper_authors: Peidong Jia, Chenxuan Li, Zeyu Liu, Yichao Shen, Xingru Chen, Yuhui Yuan, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo
for: 本研究旨在提出一种可以从用户意图生成高质量图形设计的框架，以解决现代广告设计中的创新和创造力问题。
methods: 本研究使用了一种层次分解的生成框架，将复杂的文本到设计生成任务分解成一系列更简单的子任务，每个子任务由特定的模型进行处理，然后将这些模型的输出结果集成起来生成具有一致性的最终输出。
results: 对比 existed 方法，本研究的 COLE 系统在生成高质量图形设计方面表现出了显著的优势，并且支持用户输入的灵活编辑。

Abstract
Graphic design, which has been evolving since the 15th century, plays a crucial role in advertising. The creation of high-quality designs demands creativity, innovation, and lateral thinking. This intricate task involves understanding the objective, crafting visual elements such as the background, decoration, font, color, and shape, formulating diverse professional layouts, and adhering to fundamental visual design principles. In this paper, we introduce COLE, a hierarchical generation framework designed to comprehensively address these challenges. This COLE system can transform a straightforward intention prompt into a high-quality graphic design, while also supporting flexible editing based on user input. Examples of such input might include directives like ``design a poster for Hisaishi's concert.'' The key insight is to dissect the complex task of text-to-design generation into a hierarchy of simpler sub-tasks, each addressed by specialized models working collaboratively. The results from these models are then consolidated to produce a cohesive final output. Our hierarchical task decomposition can streamline the complex process and significantly enhance generation reliability. Our COLE system consists of multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for a design-aware text or image generation task. Furthermore, we construct the DESIGNERINTENTION benchmark to highlight the superiority of our COLE over existing methods in generating high-quality graphic designs from user intent. We perceive our COLE as an important step towards addressing more complex visual design generation tasks in the future.

摘要
GRAPHIC DESIGN，历史可追溯到15世纪，在广告中扮演着关键的角色。创建高质量的设计需要创投、创新和横向思维。这个复杂任务包括理解目标、制作背景、装饰、字体、颜色和形状等视觉元素，制定多种专业布局，并遵循基本视觉设计原则。在这篇论文中，我们介绍了COLE，一个层次生成框架，用于全面解决这些挑战。COLE系统可以将简单的意图提示转化成高质量的graphic design，同时支持用户输入的灵活修改。例如，用户可能会提供“设计希雅希的演唱会海报”的指令。我们的关键发现是将文本到设计生成的复杂任务分解成一个层次结构的 simpler sub-task，每个任务由特殊化的模型共同工作。这些模型的结果 subsequences 被汇集以生成一个协调完整的输出。我们的层次任务分解可以减少复杂的过程，并显著提高生成可靠性。我们的COLE系统包括多个精心调整的大语言模型（LLM）、大多媒体模型（LMM）和扩散模型（DM），每个模型都特化于设计意识的文本或图像生成任务。此外，我们还构建了DESIGNERINTENTION benchmark，以证明我们的COLE在基于用户意图的高质量图像设计生成方面的超越性。我们认为COLE是未来更复杂的视觉设计生成任务的重要一步。

HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

paper_url: http://arxiv.org/abs/2311.16961
repo_url: None
paper_authors: Jingbo Zhang, Xiaoyu Li, Qi Zhang, Yanpei Cao, Ying Shan, Jing Liao
for: 生成一个3D人体模型从单个参考图像
methods: 提出了一种基于引用导航的分布式推理模型，以保证生成的3D人体模型具有高级别的细节和一致性
results: 实验结果表明，该方法可以比前方法更好地生成3D人体模型，并且可以保持人体的细节和一致性在不同的视图下

Abstract
Generating a 3D human model from a single reference image is challenging because it requires inferring textures and geometries in invisible views while maintaining consistency with the reference image. Previous methods utilizing 3D generative models are limited by the availability of 3D training data. Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail to preserve the texture details of the reference image, resulting in inconsistent appearances in different views. In this paper, we propose HumanRef, a 3D human generation framework from a single-view input. To ensure the generated 3D model is photorealistic and consistent with the input image, HumanRef introduces a novel method called reference-guided score distillation sampling (Ref-SDS), which effectively incorporates image guidance into the generation process. Furthermore, we introduce region-aware attention to Ref-SDS, ensuring accurate correspondence between different body regions. Experimental results demonstrate that HumanRef outperforms state-of-the-art methods in generating 3D clothed humans with fine geometry, photorealistic textures, and view-consistent appearances.

摘要
生成三维人体模型从单个参考图像是挑战，因为需要在不可见的视角中推断Texture和几何结构，同时保持与参考图像的一致性。先前的方法使用3D生成模型有限，因为3D训练数据的可用性有限。优化基于的方法可以提升文本到图像扩散模型，但是通常无法保持参考图像的细节Texture，导致不同视角中的外观不一致。在这篇论文中，我们提出了人体Ref，一种基于单个视角输入的3D人体生成框架。为确保生成的3D模型具有高品质和与输入图像一致的外观，人体Ref引入了一种新的参考指导分数散 sampling（Ref-SDS）方法，该方法能够有效地在生成过程中包含图像指导。此外，我们还引入了区域注意力，确保不同身体区域之间的匹配精度。实验结果表明，人体Ref可以在生成3D穿着人体时比州先进方法表现出更高品质和视角一致的外观。

UC-NeRF: Neural Radiance Field for Under-Calibrated multi-view cameras in autonomous driving

paper_url: http://arxiv.org/abs/2311.16945
repo_url: None
paper_authors: Kai Cheng, Xiaoxiao Long, Wei Yin, Jin Wang, Zhiqiang Wu, Yuexin Ma, Kaixuan Wang, Xiaozhi Chen, Xuejin Chen
for: 该文章主要用于解决多摄像头系统下的新视图合成问题，即在具有不准确协调的多摄像头系统中实现高质量的新视图合成。
methods: 该文章提出了三个方法来解决上述问题，包括层次色彩调整、虚拟扭曲和空间时间约束pose协调。
results: 该文章的实验结果表明，UC-NeRF方法可以准确地synthesize novel views in under-calibrated multi-view camera systems,并且可以有效地优化depth estimation in large-scale outdoor scenes with the synthesized novel views。

Abstract
Multi-camera setups find widespread use across various applications, such as autonomous driving, as they greatly expand sensing capabilities. Despite the fast development of Neural radiance field (NeRF) techniques and their wide applications in both indoor and outdoor scenes, applying NeRF to multi-camera systems remains very challenging. This is primarily due to the inherent under-calibration issues in multi-camera setup, including inconsistent imaging effects stemming from separately calibrated image signal processing units in diverse cameras, and system errors arising from mechanical vibrations during driving that affect relative camera poses. In this paper, we present UC-NeRF, a novel method tailored for novel view synthesis in under-calibrated multi-view camera systems. Firstly, we propose a layer-based color correction to rectify the color inconsistency in different image regions. Second, we propose virtual warping to generate more viewpoint-diverse but color-consistent virtual views for color correction and 3D recovery. Finally, a spatiotemporally constrained pose refinement is designed for more robust and accurate pose calibration in multi-camera systems. Our method not only achieves state-of-the-art performance of novel view synthesis in multi-camera setups, but also effectively facilitates depth estimation in large-scale outdoor scenes with the synthesized novel views.

摘要
多摄像头设置在各种应用中广泛使用，如自动驾驶，因为它们可以大幅扩展感知能力。尽管神经辐射场（NeRF）技术在室内和室外场景中广泛应用，但将NeRF应用于多摄像头系统仍然非常困难。这主要是因为多摄像头设置中的内置不准确问题，包括不同摄像头中的图像处理单元在不同的尺度和缓冲区域中的颜色不一致问题，以及在驾驶中的机械振荡引起的相对摄像头姿态的误差。在这篇论文中，我们提出了UC-NeRF，一种适应于不准确多视图摄像头系统的新观察角度Synthesis的方法。首先，我们提议了层次颜色修正来纠正不同图像区域中的颜色不一致问题。其次，我们提议虚拟扭曲来生成更多的视角多样性且颜色一致的虚拟视图，以便颜色修正和3D恢复。最后，我们设计了基于空间时间的约束的姿态精度加工来提高多摄像头系统中的姿态精度。我们的方法不仅实现了多摄像头设置中novel view synthesis的状态码性表现，而且也有效地促进了大规模户外场景中的深度估计。

Image segmentation with traveling waves in an exactly solvable recurrent neural network

paper_url: http://arxiv.org/abs/2311.16943
repo_url: None
paper_authors: Luisa H. B. Liboni, Roberto C. Budzinski, Alexandra N. Busch, Sindy Löwe, Thomas A. Keller, Max Welling, Lyle E. Muller
for: 图像分割，使用时空动力学的循环神经网络。
methods: 使用循环神经网络的状态每个单元是复数，生成复杂的时空动力学，可以有效地将图像分成Scene的结构特征。
results: 通过循环网络动力学的精确解，提供了对object segmentation的数学解释，并示出了一种通用的对象分割算法，可以适应各种图像输入，从简单的二维图像到自然图像。

Abstract
We study image segmentation using spatiotemporal dynamics in a recurrent neural network where the state of each unit is given by a complex number. We show that this network generates sophisticated spatiotemporal dynamics that can effectively divide an image into groups according to a scene's structural characteristics. Using an exact solution of the recurrent network's dynamics, we present a precise description of the mechanism underlying object segmentation in this network, providing a clear mathematical interpretation of how the network performs this task. We then demonstrate a simple algorithm for object segmentation that generalizes across inputs ranging from simple geometric objects in grayscale images to natural images. Object segmentation across all images is accomplished with one recurrent neural network that has a single, fixed set of weights. This demonstrates the expressive potential of recurrent neural networks when constructed using a mathematical approach that brings together their structure, dynamics, and computation.

摘要
我们研究图像分割使用空间时间动力学在回归神经网络中，每个单元的状态由复数表示。我们显示这种神经网络生成了复杂的空间时间动力学，可以有效地将图像分成Scene的结构特征相似的组。使用回归网络动力学的精确解，我们提供了图像分割机制的准确数学解释，解释如何在这种神经网络中实现对象分割。然后，我们提出了一种通用的对象分割算法，可以在所有图像上实现对象分割，并且只需一个固定的回归神经网络权重。这种方法 demonstarte了回归神经网络的表达能力，当将其结构、动力学和计算相结合时。

The Sky’s the Limit: Re-lightable Outdoor Scenes via a Sky-pixel Constrained Illumination Prior and Outside-In Visibility

paper_url: http://arxiv.org/abs/2311.16937
repo_url: https://github.com/jadgardner/neusky
paper_authors: James A. D. Gardner, Evgenii Kashin, Bernhard Egger, William A. P. Smith
for: 本研究的目的是解决无束缚的户外场景图像集中的反投影问题，特别是因为光照环境的 occlusion 和光照环境的ambiguity。
methods: 本研究使用了一种基于神经网络的方法，利用天空像提供的直接光照测量，并通过神经网络灵敏照明先验来解译剩余的光照环境。还引入了一种新的“外部向内”方法来计算天空可见度，基于神经directional distance function。
results: 本研究实现了高质量的颜色、geometry、光照和天空可见度的估计，在NeRF-OSR relighting benchmark上达到了现有最佳结果。

Abstract
Inverse rendering of outdoor scenes from unconstrained image collections is a challenging task, particularly illumination/albedo ambiguities and occlusion of the illumination environment (shadowing) caused by geometry. However, there are many cues in an image that can aid in the disentanglement of geometry, albedo and shadows. We exploit the fact that any sky pixel provides a direct measurement of distant lighting in the corresponding direction and, via a neural illumination prior, a statistical cue as to the remaining illumination environment. We also introduce a novel `outside-in' method for computing differentiable sky visibility based on a neural directional distance function. This is efficient and can be trained in parallel with the neural scene representation, allowing gradients from appearance loss to flow from shadows to influence estimation of illumination and geometry. Our method estimates high-quality albedo, geometry, illumination and sky visibility, achieving state-of-the-art results on the NeRF-OSR relighting benchmark. Our code and models can be found https://github.com/JADGardner/neusky

摘要
原文：“室外场景的反向渲染从无结构图像集中是一项具有挑战性的任务，尤其是光照环境的遮挡和反射（阴影）引起的几何学问题。然而，图像中存在许多帮助于分离几何学、反射和阴影的信息。我们利用了任务环境中远程光照的直接测量，并通过神经网络照明先验来获得剩下的照明环境的统计参考。我们还提出了一种新的“外部进入”方法，基于神经网络方向准确距离函数来计算天空视图的可见性。这种方法效率高，可以并行训练神经场景表示和神经照明先验，从颜色损失中获得到的梯度可以流向照明和几何学的估计。我们的方法可以高质量地估计几何学、反射、照明和天空视图，在NeRF-OSR渲染挑战benchmark上实现了状态机器人的结果。我们的代码和模型可以在https://github.com/JADGardner/neusky找到。”Note: The text is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

paper_url: http://arxiv.org/abs/2311.16933
repo_url: None
paper_authors: Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, Bo Dai
for: 这个论文的目的是提高文本到视频（T2V）生成的灵活性和可控性，使得可以更好地控制视频的结构和内容。
methods: 这个论文使用了一种名为SparseCtrl的方法，它利用稀疏的信号（如一些帧的深度/边缘序列）来提高控制性，而不需要大量的输入。这个方法可以与现有的T2V模型一起使用，并且可以与不同的模式（如涂鸦、深度图和RGB图像）结合使用。
results: experiments表明，SparseCtrl可以广泛应用于T2V生成中，并且可以在不同的情况下进行个性化控制。 codes和模型将在https://guoyww.github.io/projects/SparseCtrl中公开发布。

Abstract
The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at https://guoyww.github.io/projects/SparseCtrl .

摘要
在最近几年内，文本到视频（T2V）的开发，即根据文本提示生成视频，已经取得了重要进步。然而，仅仅 rely on 文本提示通常会导致视频帧的不确定性，从而影响视频的质量。因此，研究人员通常会利用 dense structure signals，例如每帧深度/边缘序列，以提高控制性。这种方法的推广会增加推理的负担。在这项工作中，我们提出了 SparseCtrl，它可以在时间上 sparse 的信号（只需一个或几个输入）上实现灵活的结构控制，如图1所示。它包含一个额外的condition encoder来处理这些稀疏信号，而不改变原有的 T2V 模型。我们的方法可以与不同的Modalities，如素描、深度地图和RGB图像相结合，为视频生成带来更多的实用控制，并促进应用如故事板、深度渲染、关键帧动画和 interpolate 等。我们的实验表明 SparseCtrl 可以在原始 T2V 生成器和个性化 T2V 生成器上进行普适化。代码和模型将在上公开。

LLaFS: When Large-Language Models Meet Few-Shot Segmentation

paper_url: http://arxiv.org/abs/2311.16926
repo_url: https://github.com/lanyunzhu99/llafs
paper_authors: Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, Jun Liu
for: 这个论文提出了一种基于大语言模型（LLM）的几个shot分割方法，以便利用LLM的庞大前置知识来提高图像分割的效果。
methods: 这种方法使用了一种特制的输入指令，使得LLM可以直接从文本中获取图像分割结果，并使用一个区域特征表来模拟人类视觉机制，以提供多模式干扰。同时，它还使用了假样生成和课程学习来扩大数据和优化。
results: 这个论文在多个数据集上达到了状态的最佳效果，示出了使用LLM进行几个shot计算机视觉任务的潜在优势。

Abstract
This paper proposes LLaFS, the first attempt to leverage large language models (LLMs) in few-shot segmentation. In contrast to the conventional few-shot segmentation methods that only rely on the limited and biased information from the annotated support images, LLaFS leverages the vast prior knowledge gained by LLM as an effective supplement and directly uses the LLM to segment images in a few-shot manner. To enable the text-based LLM to handle image-related tasks, we carefully design an input instruction that allows the LLM to produce segmentation results represented as polygons, and propose a region-attribute table to simulate the human visual mechanism and provide multi-modal guidance. We also synthesize pseudo samples and use curriculum learning for pretraining to augment data and achieve better optimization. LLaFS achieves state-of-the-art results on multiple datasets, showing the potential of using LLMs for few-shot computer vision tasks. Code will be available at https://github.com/lanyunzhu99/LLaFS.

摘要

Super-Resolution through StyleGAN Regularized Latent Search: A Realism-Fidelity Trade-off

paper_url: http://arxiv.org/abs/2311.16923
repo_url: None
paper_authors: Marzieh Gheisari, Auguste Genovesio
for: 提高图像的分辨率，从低分辨率（LR）图像construct高分辨率（HR）图像。
methods: 使用StyleGAN预训练在HR图像的latent space中搜索图像，以便在输入LR图像下降到最佳的HR图像。但是，这些方法通常会生成外域图像并且不准确地重建HR图像。我们的贡献有两点：首先，我们引入了一种新的正则项来约束搜索的空间，以确保恢复的代码在原始图像映射中。其次，我们通过扩展图像优先预测来进一步提高恢复。
results: 我们的方法可以recover高质量的真实图像，尤其是在大放大因子下。此外，在低放大因子下，我们的方法还可以恢复 generator无法生成的细节。总的来说，我们的方法实现了对超分辨率任务的好的平衡 между 准确和现实。

Abstract
This paper addresses the problem of super-resolution: constructing a highly resolved (HR) image from a low resolved (LR) one. Recent unsupervised approaches search the latent space of a StyleGAN pre-trained on HR images, for the image that best downscales to the input LR image. However, they tend to produce out-of-domain images and fail to accurately reconstruct HR images that are far from the original domain. Our contribution is twofold. Firstly, we introduce a new regularizer to constrain the search in the latent space, ensuring that the inverted code lies in the original image manifold. Secondly, we further enhanced the reconstruction through expanding the image prior around the optimal latent code. Our results show that the proposed approach recovers realistic high-quality images for large magnification factors. Furthermore, for low magnification factors, it can still reconstruct details that the generator could not have produced otherwise. Altogether, our approach achieves a good trade-off between fidelity and realism for the super-resolution task.

摘要
Our contribution is twofold. First, we introduce a new regularizer to constrain the search in the latent space, ensuring that the inverted code lies in the original image manifold. Second, we enhance the reconstruction by expanding the image prior around the optimal latent code.Our results show that the proposed approach recovers realistic high-quality images for large magnification factors. Additionally, for low magnification factors, it can still reconstruct details that the generator could not have produced otherwise. Overall, our approach achieves a good trade-off between fidelity and realism for the super-resolution task.Here's the translation in Simplified Chinese:这篇论文关注超分解问题，即从低分辨率（LR）图像构建高分辨率（HR）图像。现有的无监督方法在StyleGAN预训练的 latent space 中搜索最佳的下采样图像，但它们通常会生成非原域图像并且不准确地重建远离原域的HR图像。我们的贡献有两点：首先，我们引入了一种新的 regularizer，以确保搜索的结果在原始图像概率空间中。其次，我们进一步提高了重建的精度，通过在最佳幂码周围扩展图像优先。我们的结果表明，我们的方法可以在大倍率因子下重建高质量的真实图像。此外，在低倍率因子下，它还可以重建 generator 无法生成的详细信息。总的来说，我们的方法在超分解任务中实现了良好的妥协 между 准确和现实。

UGG: Unified Generative Grasping

paper_url: http://arxiv.org/abs/2311.16917
repo_url: https://github.com/autonomousvision/shape_as_points
paper_authors: Jiaxin Lu, Hao Kang, Haoxiang Li, Bo Liu, Yiding Yang, Qixing Huang, Gang Hua
for: 这个论文的目的是提高dexterous grasping的成功率和多样性。
methods: 该论文使用了一种名为UGG的混合扩散模型，该模型在物体点云和手坐标空间内操作，并通过一种全Transformer架构将物体、手和接触点的信息融合在一起。
results: 该模型在大规模的DexGraspNet数据集上实现了状态的dexterous grasping成功率，同时还可以生成基于手坐标的物体，这提供了有价值的对象设计和生成模型的研究。

Abstract
Dexterous grasping aims to produce diverse grasping postures with a high grasping success rate. Regression-based methods that directly predict grasping parameters given the object may achieve a high success rate but often lack diversity. Generation-based methods that generate grasping postures conditioned on the object can often produce diverse grasping, but they are insufficient for high grasping success due to lack of discriminative information. To mitigate, we introduce a unified diffusion-based dexterous grasp generation model, dubbed the name UGG, which operates within the object point cloud and hand parameter spaces. Our all-transformer architecture unifies the information from the object, the hand, and the contacts, introducing a novel representation of contact points for improved contact modeling. The flexibility and quality of our model enable the integration of a lightweight discriminator, benefiting from simulated discriminative data, which pushes for a high success rate while preserving high diversity. Beyond grasp generation, our model can also generate objects based on hand information, offering valuable insights into object design and studying how the generative model perceives objects. Our model achieves state-of-the-art dexterous grasping on the large-scale DexGraspNet dataset while facilitating human-centric object design, marking a significant advancement in dexterous grasping research. Our project page is https://jiaxin-lu.github.io/ugg/ .

摘要
dexterous grasping aims to produce diverse grasping postures with high success rate. Regression-based methods may achieve high success rate but lack diversity. Generation-based methods can produce diverse grasping but lack discriminative information. To mitigate, we introduce a unified diffusion-based dexterous grasp generation model (UGG) that operates within object point cloud and hand parameter spaces. Our all-transformer architecture unifies object, hand, and contact information, introducing a novel representation of contact points for improved contact modeling. The flexibility and quality of our model enable the integration of a lightweight discriminator, benefiting from simulated discriminative data, which pushes for high success rate while preserving high diversity. Beyond grasp generation, our model can also generate objects based on hand information, offering valuable insights into object design and studying how the generative model perceives objects. Our model achieves state-of-the-art dexterous grasping on large-scale DexGraspNet dataset while facilitating human-centric object design, marking significant advancement in dexterous grasping research. Our project page is .

Brain-ID: Learning Robust Feature Representations for Brain Imaging

paper_url: http://arxiv.org/abs/2311.16914
repo_url: https://github.com/peirong26/Brain-ID
paper_authors: Peirong Liu, Oula Puonti, Xiaoling Hu, Daniel C. Alexander, Juan Eugenio Iglesias
for: 这篇论文旨在提供一个可靠的特征表示学习策略，以便应用于不同的脑成像调查方法中。
methods: 这篇论文使用了一种名为Brain-ID的特征表示学习策略，这策略可以快速地适应不同的脑成像调查方法，并且能够快速地适应新的数据。
results: 根据实验结果显示，Brain-ID 能够在不同的脑成像调查方法中表现出色，并且在有限的训练数据下可以保持其性能。

Abstract
Recent learning-based approaches have made astonishing advances in calibrated medical imaging like computerized tomography, yet they struggle to generalize in uncalibrated modalities -- notoriously magnetic resonance imaging (MRI), where performance is highly sensitive to the differences in MR contrast, resolution, and orientation between the training and testing data. This prevents broad applicability to the diverse clinical acquisition protocols in the real world. We introduce Brain-ID, a robust feature representation learning strategy for brain imaging, which is contrast-agnostic, and robust to the brain anatomy of each subject regardless of the appearance of acquired images (i.e., deformation, contrast, resolution, orientation, artifacts, etc). Brain-ID is trained entirely on synthetic data, and easily adapts to downstream tasks with our proposed simple one-layer solution. We validate the robustness of Brain-ID features, and evaluate their performance in a variety of downstream applications, including both contrast-independent (anatomy reconstruction/contrast synthesis, brain segmentation), and contrast-dependent (super-resolution, bias field estimation) tasks. Extensive experiments on 6 public datasets demonstrate that Brain-ID achieves state-of-the-art performance in all tasks, and more importantly, preserves its performance when only limited training data is available.

摘要
现代学习基于方法在计算机tomography等受核酸成像领域取得了惊人的进步，然而它们在不准确的模式下（如磁共振成像）表现不佳，这限制了它们在实际世界中的广泛应用。我们介绍了Brain-ID，一种robust特征表示学习策略，可以忽略不同主体的脑 анатоMY中的差异（例如变形、对比、分辨率、方向等），并且可以在Synthetic数据上进行完全训练。Brain-ID可以轻松适应下游任务，并且我们提出了一种简单的一层解决方案。我们 validate了Brain-ID特征的稳定性，并评估了它在多种下游应用中的性能，包括不具备对比（脑分割/对比合成）和具备对比（超分辨/偏置场 estimation）任务。我们在6个公共数据集上进行了广泛的实验，并证明了Brain-ID在所有任务中取得了状态的最佳性能，并且在具有有限的训练数据时仍然保持了其性能。

Feedback RoI Features Improve Aerial Object Detection

paper_url: http://arxiv.org/abs/2311.17129
repo_url: None
paper_authors: Botao Ren, Botian Xu, Tengyu Liu, Jingyi Wang, Zhidong Deng
for: 这个论文的目的是提出一种基于高级反馈信息的对象检测方法，以适应不同特征信号的变化。
methods: 该方法使用了高级反馈信息来改进对象检测中的特征选择，以适应图像质量变化和分类不确定性。
results: 实验结果表明，该方法可以在挑战性强的航空图像检测数据集上提供可靠的提升，包括DOTA-v1.0、DOTA-v1.5和HRSC2016。此外，对于普通的检测模型，我们的模块也能够提供有效的改进。

Abstract
Neuroscience studies have shown that the human visual system utilizes high-level feedback information to guide lower-level perception, enabling adaptation to signals of different characteristics. In light of this, we propose Feedback multi-Level feature Extractor (Flex) to incorporate a similar mechanism for object detection. Flex refines feature selection based on image-wise and instance-level feedback information in response to image quality variation and classification uncertainty. Experimental results show that Flex offers consistent improvement to a range of existing SOTA methods on the challenging aerial object detection datasets including DOTA-v1.0, DOTA-v1.5, and HRSC2016. Although the design originates in aerial image detection, further experiments on MS COCO also reveal our module's efficacy in general detection models. Quantitative and qualitative analyses indicate that the improvements are closely related to image qualities, which match our motivation.

摘要
neuroscience 研究表明，人类视系统会使用高级反馈信息来导引低级识别，以适应不同特征的信号。为了实现类似的机制，我们提出了反馈多级特征提取器（Flex）。Flex 根据图像和实例级反馈信息进行特征选择级别，以适应图像质量变化和分类不确定性。实验结果表明，Flex 可以在多个现状顶峰 Object Detection 数据集上提供顺序的改进，包括 DOTA-v1.0、DOTA-v1.5 和 HRSC2016。尽管设计起源于飞行图像检测，但进一步在 MS COCO 上的实验也表明了我们模块的通用性。量化和质量分析表明，改进与图像质量有着密切的关系，与我们的动机一致。

Lane-Keeping Control of Autonomous Vehicles Through a Soft-Constrained Iterative LQR

paper_url: http://arxiv.org/abs/2311.16900
repo_url: None
paper_authors: Der-Hau Lee
for: 自动驾驶车辆应用中，精准预测平滑操纵输入是关键，因为控制动作干扰可能会导致车系失控。
methods: 我们将CILQR算法与模型预测控制（MPC）约束放宽技术结合，开发了一种soft-CILQR算法。在优化过程中，我们将状态和控制边界函数中的缺失变量integrated into the state and control barrier functions of the soft-CILQR solver to soften the constraints in the optimization process so that stabilizing control inputs can be calculated in a relatively simple manner。
results: 我们通过对 linear system dynamics model进行数字 simulations和基于视觉的困难推进的实验来测试提议的soft-CILQR算法的性能，并与CILQR算法进行比较。数字 simulations中，soft-CILQR和CILQR解除都能使系统向参照状态势 asymptotically 驱动；然而，soft-CILQR解除比CILQR解除更容易在受加成干扰情况下获得平滑操纵输入轨迹。实验中，soft-CILQR控制器在跟踪精度和操纵平滑性方面比CILQR控制器更好。

Abstract
The accurate prediction of smooth steering inputs is crucial for autonomous vehicle applications because control actions with jitter might cause the vehicle system to become unstable. To address this problem in automobile lane-keeping control without the use of additional smoothing algorithms, we developed a soft-constrained iterative linear-quadratic regulator (soft-CILQR) algorithm by integrating CILQR algorithm and a model predictive control (MPC) constraint relaxation method. We incorporated slack variables into the state and control barrier functions of the soft-CILQR solver to soften the constraints in the optimization process so that stabilizing control inputs can be calculated in a relatively simple manner. Two types of automotive lane-keeping experiments were conducted with a linear system dynamics model to test the performance of the proposed soft-CILQR algorithm and to compare its performance with that of the CILQR algorithm: numerical simulations and experiments involving challenging vision-based maneuvers. In the numerical simulations, the soft-CILQR and CILQR solvers managed to drive the system toward the reference state asymptotically; however, the soft-CILQR solver obtained smooth steering input trajectories more easily than did the CILQR solver under conditions involving additive disturbances. In the experiments with visual inputs, the soft-CILQR controller outperformed the CILQR controller in terms of tracking accuracy and steering smoothness during the driving of an ego vehicle on TORCS.

摘要
准确预测平滑的控制输入是自动驾驶车辆应用中的关键因素，因为控制动作具有晃动可能会导致车辆系统失控。为了解决这个问题而不使用附加的缓和算法，我们开发了一种具有软约束的迭代线性quadratic regulator（soft-CILQR）算法，该算法将CILQR算法和模型预测控制（MPC）约束松relaxation方法相结合。我们在soft-CILQR求解器中添加了承载变量到状态和控制边界函数中，以软化约束在优化过程中，以便在相对简单的方式中计算稳定的控制输入。我们对使用linear system dynamics模型进行了两种自动驾驶车辆的车道保持实验，分别是数字实验和视觉实验。在数字实验中，soft-CILQR和CILQR求解器都能将系统逼近参照状态，但soft-CILQR求解器在受到加速器的情况下更容易获得平滑的控制输入 trajectory。在视觉实验中，soft-CILQR控制器比CILQR控制器更好地实现了跟踪精度和平滑的总动员操作，在TORCS上驾驶ego车辆时。

Dendrogram distance: an evaluation metric for generative networks using hierarchical clustering

paper_url: http://arxiv.org/abs/2311.16894
repo_url: None
paper_authors: Gustavo Sutter Carvalho, Moacir Antonelli Ponti
for: 本研究提出了一种新的生成模型评估指标，主要针对生成网络。
methods: 该方法使用树形图表示实际和假数据，以计算训练和生成样本之间的差异。它强调模式泄漏问题，targeting生成器无法捕捉全部训练集中的所有模式。
results: 对于提出的方法，进行了一种基于实际数据抽样的验证方案，并在控制环境下证明了与其他状态艺术方法相当竞争。

Abstract
We present a novel metric for generative modeling evaluation, focusing primarily on generative networks. The method uses dendrograms to represent real and fake data, allowing for the divergence between training and generated samples to be computed. This metric focus on mode collapse, targeting generators that are not able to capture all modes in the training set. To evaluate the proposed method it is introduced a validation scheme based on sampling from real datasets, therefore the metric is evaluated in a controlled environment and proves to be competitive with other state-of-the-art approaches.

摘要
我们提出了一种新的评估生成模型 metric，主要针对生成网络。该方法使用树状图表示真实和假数据，从而计算生成样本与训练样本之间的差异。该metric关注模式塌陷，targeting生成器不能捕捉全部训练集中的所有模式。为评估我们的方法，我们提出了基于真实数据采样的验证方案，因此该metric在控制环境中评估并与其他当前最佳方法竞争。

A Unified Approach for Text- and Image-guided 4D Scene Generation

paper_url: http://arxiv.org/abs/2311.16854
repo_url: None
paper_authors: Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Karsten Kreis, Otmar Hilliges, Shalini De Mello
for: 文章旨在探讨大规模扩散生成模型如何用于从用户提供的文本提示和图像中生成图像、视频和3D资产。
methods: 文章提出了一种新的两阶段方法，使用3D和2D扩散指导来有效地学习文本到4D动画场景的高质量静态3D资产在第一阶段，并在第二阶段使用可变尺度特征网格和偏移总变量损失来有效地学习动作。
results: 通过用户偏好研究，文章表明，其方法可以明显提高图像和动作质量、3D一致性和文本准确性，并且可以轻松地适应可控生成任务，无需修改动作学习阶段。因此，该方法提供了一个统一的approach для文本到4D、图像到4D和个性化4D生成任务。

Abstract
Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.

摘要
大规模扩散生成模型已经大幅简化了图像、视频和3D资产的创建，从用户提供的文本提示和图像开始。然而，文本到4D动态3D场景生成仍然是一个未经探索的挑战。我们提出了“梦在4D”方法，它采用了以下三个阶段来实现文本到4D合成：1. 利用3D和2D扩散指导，高质量地学习静态3D资产。2. 使用可变神经频谱场，分离学习的静态资产和其变形，以保持质量在运动学习过程中。3. 使用多resolution特征网格和滤波总变量损失来有效地学习运动。通过用户偏好调查，我们证明了我们的方法在图像和运动质量、3D一致性和文本 faithfulness等方面对比基eline方法有 significan advances。由于其运动拟合的表示方式，梦在4D还可以方便地适应控制生成任务，只需要修改运动学习阶段。因此，我们的方法提供了一个统一的approach，可以同时解决文本到4D、图像到4D和个性化4D生成任务。

Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration

paper_url: http://arxiv.org/abs/2311.16845
repo_url: None
paper_authors: Chen Zhao, Weiling Cai, Chenyu Dong, Chengwei Hu
for: 提高水下图像质量
methods: 利用频域信息和扩散模型
results: 达到了水下图像 dataset 上 SOTA 性能，并与其他方法竞争视觉质量Here’s the full translation in Simplified Chinese:
for: 提高水下图像质量
methods: 利用频域信息和扩散模型
results: 达到了水下图像 dataset 上 SOTA 性能，并与其他方法竞争视觉质量I hope this helps! Let me know if you have any other questions.

Abstract
Underwater images are subject to intricate and diverse degradation, inevitably affecting the effectiveness of underwater visual tasks. However, most approaches primarily operate in the raw pixel space of images, which limits the exploration of the frequency characteristics of underwater images, leading to an inadequate utilization of deep models' representational capabilities in producing high-quality images. In this paper, we introduce a novel Underwater Image Enhancement (UIE) framework, named WF-Diff, designed to fully leverage the characteristics of frequency domain information and diffusion models. WF-Diff consists of two detachable networks: Wavelet-based Fourier information interaction network (WFI2-net) and Frequency Residual Diffusion Adjustment Module (FRDAM). With our full exploration of the frequency domain information, WFI2-net aims to achieve preliminary enhancement of frequency information in the wavelet space. Our proposed FRDAM can further refine the high- and low-frequency information of the initial enhanced images, which can be viewed as a plug-and-play universal module to adjust the detail of the underwater images. With the above techniques, our algorithm can show SOTA performance on real-world underwater image datasets, and achieves competitive performance in visual quality.

摘要
水下图像受到细致和多样化的干扰，不可避免地影响水下视觉任务的效果。然而，大多数方法主要在图像原始像素空间中运行，限制了水下图像频谱特征的探索，导致深度模型的表示能力下释放不够。在这篇论文中，我们提出了一种新的水下图像增强（UIE）框架，名为WF-Diff，用于全面利用频谱信息和扩散模型的特点。WF-Diff包括两个分离的网络：浪涌信息互动网络（WFI2-net）和频谱差异调整模块（FRDAM）。我们在浪涌信息空间进行了全面的频谱信息探索，以实现频谱信息的初步增强。我们的提议的FRDAM可以进一步细化高频和低频信息，这可以视为一个可插入式的通用模块，用于调整水下图像的细节。与以上技术相结合，我们的算法可以在实际水下图像数据上达到顶峰性性能，并在视觉质量方面与前者竞争。

Self-training solutions for the ICCV 2023 GeoNet Challenge

paper_url: http://arxiv.org/abs/2311.16843
repo_url: https://github.com/tim-learn/geonet23_casia_tim
paper_authors: Lijun Sheng, Zhengbo Wang, Jian Liang
for: 本研究目的是在GeoNet数据集上进行领域适应，以优化模型在不同地理环境中的性能。
methods: 本方法采用了两stage的源自由领域适应框架，使用SwinTransformer底层，以实现知识传递从美国（源）领域到亚洲（目标）领域。在第一stage中，我们使用了源数据的标注进行训练，并采用了重采样策略和两种cross-entropy损失函数。在第二stage中，我们生成了target数据的 pseudo标签，以优化模型。
results: 本方法在GeoUniDA挑战中实现了74.56%的H-score，并在GeoImNet和GeoPlaces挑战中达到了64.46%和51.23%的准精度。

Abstract
GeoNet is a recently proposed domain adaptation benchmark consisting of three challenges (i.e., GeoUniDA, GeoImNet, and GeoPlaces). Each challenge contains images collected from the USA and Asia where there are huge geographical gaps. Our solution adopts a two-stage source-free domain adaptation framework with a Swin Transformer backbone to achieve knowledge transfer from the USA (source) domain to Asia (target) domain. In the first stage, we train a source model using labeled source data with a re-sampling strategy and two types of cross-entropy loss. In the second stage, we generate pseudo labels for unlabeled target data to fine-tune the model. Our method achieves an H-score of 74.56% and ultimately ranks 1st in the GeoUniDA challenge. In GeoImNet and GeoPlaces challenges, our solution also reaches a top-3 accuracy of 64.46% and 51.23%, respectively.

摘要

paper_url: http://arxiv.org/abs/2311.16835
repo_url: None
paper_authors: Kunpeng Wang, Chenglong Li, Zhengzheng Tu, Bin Luo
for: 这篇论文 targets 单modal和多modal Sobjective Detection（SOD）任务，旨在提出一个简单的框架来解决这些任务。
methods: 该论文提出了一种名为UniSOD的灵活框架，该框架通过适应提问学习来学习模态相关的提示，并将其插入到基于预训练的SOD模型中，以处理对应的任务，只需几个可学习的参数。
results: 该论文在14个benchmark dataset上实现了革命性的性能提升，证明UniSOD有效地和高效地将单modal和多modal SOD任务集成在一起。

Abstract
Existing single-modal and multi-modal salient object detection (SOD) methods focus on designing specific architectures tailored for their respective tasks. However, developing completely different models for different tasks leads to labor and time consumption, as well as high computational and practical deployment costs. In this paper, we make the first attempt to address both single-modal and multi-modal SOD in a unified framework called UniSOD. Nevertheless, assigning appropriate strategies to modality variable inputs is challenging. To this end, UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning, which are plugged into the proposed pre-trained baseline SOD model to handle corresponding tasks, while only requiring few learnable parameters compared to training the entire model. Each modality-aware prompt is generated from a switchable prompt generation block, which performs structural switching solely relied on single-modal and multi-modal inputs. UniSOD achieves consistent performance improvement on 14 benchmark datasets for RGB, RGB-D, and RGB-T SOD, which demonstrates that our method effectively and efficiently unifies single-modal and multi-modal SOD tasks.

摘要
existing 单modal和多modal鲜 destacado detection（SOD）方法强调设计特定的建筑物tailored for their respective tasks。 However, developing completely different models for different tasks leads to labor and time consumption, as well as high computational and practical deployment costs。 In this paper, we make the first attempt to address both single-modal and multi-modal SOD in a unified framework called UniSOD。 Nevertheless, assigning appropriate strategies to modality variable inputs is challenging。 To this end, UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning，which are plugged into the proposed pre-trained baseline SOD model to handle corresponding tasks，while only requiring few learnable parameters compared to training the entire model。 Each modality-aware prompt is generated from a switchable prompt generation block，which performs structural switching solely relied on single-modal and multi-modal inputs。 UniSOD achieves consistent performance improvement on 14 benchmark datasets for RGB, RGB-D, and RGB-T SOD，which demonstrates that our method effectively and efficiently unifies single-modal and multi-modal SOD tasks。

1-Lipschitz Layers Compared: Memory, Speed, and Certifiable Robustness

paper_url: http://arxiv.org/abs/2311.16833
repo_url: https://github.com/berndprach/1lipschitzlayerscompared
paper_authors: Bernd Prach, Fabio Brau, Giorgio Buttazzo, Christoph H. Lampert
for: 本研究旨在比较用于实现1-Lipschitz神经网络的不同方法，包括卷积神经网络和密集神经网络，以及这些方法在不同资源条件下的性能。
methods: 本研究使用了多种方法来实现1-Lipschitz神经网络，包括使用密集神经网络和卷积神经网络，以及不同的训练策略和优化技术。
results: 研究发现，使用密集神经网络可以减少内存占用，但是需要更长的训练时间。卷积神经网络具有更好的速度性能，但是需要更多的计算资源。此外，研究还提供了一些指导和建议，以帮助用户根据可用资源选择最佳方法。

Abstract
The robustness of neural networks against input perturbations with bounded magnitude represents a serious concern in the deployment of deep learning models in safety-critical systems. Recently, the scientific community has focused on enhancing certifiable robustness guarantees by crafting 1-Lipschitz neural networks that leverage Lipschitz bounded dense and convolutional layers. Although different methods have been proposed in the literature to achieve this goal, understanding the performance of such methods is not straightforward, since different metrics can be relevant (e.g., training time, memory usage, accuracy, certifiable robustness) for different applications. For this reason, this work provides a thorough theoretical and empirical comparison between methods by evaluating them in terms of memory usage, speed, and certifiable robust accuracy. The paper also provides some guidelines and recommendations to support the user in selecting the methods that work best depending on the available resources. We provide code at https://github.com/berndprach/1LipschitzLayersCompared.

摘要
神经网络对输入抖动的 Robustness 是在安全关键系统中深度学习模型的部署中的一个严重问题。最近，科学社区对提高可证Robustness 保证的方法进行了大量研究，包括设计1-Lipschitz神经网络，这些神经网络使用Lipschitz紧密的权重和卷积层。虽然文献中有多种方法来实现这个目标，但理解这些方法的性能不是 straightforward，因为不同的应用可能需要不同的纪录（例如训练时间、内存使用量、准确率、可证Robustness）。因此，这个工作提供了一个全面的理论和实验性比较，通过评估方法的内存使用量、速度和可证Robustness 精度来评估这些方法。此外，这个工作还提供了一些指南和建议，以帮助用户选择适合其 disponible 资源的方法。我们在 GitHub 上提供了代码：https://github.com/berndprach/1LipschitzLayersCompared。

Decomposer: Semi-supervised Learning of Image Restoration and Image Decomposition

paper_url: http://arxiv.org/abs/2311.16829
repo_url: None
paper_authors: Boris Meinardus, Mariusz Trzeciakiewicz, Tim Herzig, Monika Kwiatkowski, Simon Matern, Olaf Hellwich
for: 本文提出了一种 semi-supervised 重建模型，用于分解扭曲图像序列中的基本建构部分，即原始图像和应用的扭曲效果（阴影、灯光和遮挡）。
methods: 本文使用 SIDAR 数据集，该数据集包含许多扭曲图像序列，每个序列包含扭曲后的图像，以及对原始图像应用不同类型的扭曲效果（加法或乘法噪声）。作者提出了一种基于 transformer 的模型，用于显式学习这种分解。模型包括 3D Swin-Transformers 用于空间时间编码，以及 3D U-Nets 作为预测头，用于个别部分的预测。
results: 作者通过 separately 预训导出自己的模型，使其优化为解决 ambiguous 问题定义，并学习分别扭曲效果。模型在不同扭曲类型下的预测中表现出色，可以准确地分解扭曲图像序列中的基本建构部分。

Abstract
We present Decomposer, a semi-supervised reconstruction model that decomposes distorted image sequences into their fundamental building blocks - the original image and the applied augmentations, i.e., shadow, light, and occlusions. To solve this problem, we use the SIDAR dataset that provides a large number of distorted image sequences: each sequence contains images with shadows, lighting, and occlusions applied to an undistorted version. Each distortion changes the original signal in different ways, e.g., additive or multiplicative noise. We propose a transformer-based model to explicitly learn this decomposition. The sequential model uses 3D Swin-Transformers for spatio-temporal encoding and 3D U-Nets as prediction heads for individual parts of the decomposition. We demonstrate that by separately pre-training our model on weakly supervised pseudo labels, we can steer our model to optimize for our ambiguous problem definition and learn to differentiate between the different image distortions.

摘要
我们介绍Decomposer，一种半监督的重建模型，可以将扭曲图像序列分解成其基本组成部分 - 原图像和应用的扭曲（阴影、灯光和遮挡）。为解决这个问题，我们使用SIDAR数据集，该数据集提供了大量扭曲图像序列：每个序列包含有阴影、灯光和遮挡应用于原始图像的版本。每种扭曲都会改变原始信号不同的方式，例如添加或 multiplicative 噪声。我们提议一种基于 transformer 的模型，以显式地学习这种分解。我们的模型使用3D Swin-Transformers 进行空间时间编码，并使用3D U-Nets 作为预测头来预测各个分解部分。我们示示了，通过独立进行弱监督预训练，我们可以让我们的模型优化为我们模糊的问题定义，并学习 differentiate 不同的图像扭曲。

SARA: Controllable Makeup Transfer with Spatial Alignment and Region-Adaptive Normalization

paper_url: http://arxiv.org/abs/2311.16828
repo_url: None
paper_authors: Xiaojing Zhong, Xinyi Huang, Zhonghua Wu, Guosheng Lin, Qingyao Wu
for: 这个研究目的是实现高级别的化妆风格转移，并且可以处理大量的空间调整。
methods: 我们提出了一个名为SARA的新方法，它包括三个模组：首先，一个空间调整模组，可以保留化妆的空间上下文和提供目标semantic map，并且可以帮助推导不同形状的style codes。其次，一个区域适应正规化模组，可以将形状和化妆风格解联，并且可以消除空间调整。最后，一个化妆融合模组，可以融合identité特征和化妆风格，并且可以通过学习scale和偏移参数来实现。
results: 实验结果显示，我们的SARA方法可以比前一代方法高效地实现高级别的化妆风格转移，并且在两个公开的数据集上实现了州级的表现。

Abstract
Makeup transfer is a process of transferring the makeup style from a reference image to the source images, while preserving the source images' identities. This technique is highly desirable and finds many applications. However, existing methods lack fine-level control of the makeup style, making it challenging to achieve high-quality results when dealing with large spatial misalignments. To address this problem, we propose a novel Spatial Alignment and Region-Adaptive normalization method (SARA) in this paper. Our method generates detailed makeup transfer results that can handle large spatial misalignments and achieve part-specific and shade-controllable makeup transfer. Specifically, SARA comprises three modules: Firstly, a spatial alignment module that preserves the spatial context of makeup and provides a target semantic map for guiding the shape-independent style codes. Secondly, a region-adaptive normalization module that decouples shape and makeup style using per-region encoding and normalization, which facilitates the elimination of spatial misalignments. Lastly, a makeup fusion module blends identity features and makeup style by injecting learned scale and bias parameters. Experimental results show that our SARA method outperforms existing methods and achieves state-of-the-art performance on two public datasets.

摘要
美化传输是将参考图像中的美化样式传输到源图像上，保持源图像的身份。这种技术非常感兴趣，有很多应用。然而，现有的方法缺乏细粒度控制美化样式，导致在大 spatial misalignment 下获得高质量结果具有挑战性。为解决这个问题，我们在本文提出了一种新的 Spatial Alignment and Region-Adaptive normalization 方法（SARA）。我们的方法可以处理大 spatial misalignment 并实现部分和颜色控制的美化传输。具体来说，SARA 包括三个模块：首先，一个空间对应模块，保留美化的空间上下文并提供一个目标semantic map，用于引导形状独立的样式编码。其次，一个区域适应normalization模块，使用每个区域的编码和Normalization，以解除空间misalignment。最后，一个美化融合模块，将身份特征和美化样式融合，通过学习缩放和偏置参数进行融合。我们的实验结果表明，SARA 方法在两个公共数据集上表现出state-of-the-art，超过了现有方法的性能。

Denoising Diffusion Probabilistic Models for Image Inpainting of Cell Distributions in the Human Brain

paper_url: http://arxiv.org/abs/2311.16821
repo_url: None
paper_authors: Jan-Oliver Kropp, Christian Schiffer, Katrin Amunts, Timo Dickscheid
for: 这个论文的目的是研究人脑的多级建筑，包括脑区和核lei、层次、柱和单元组分。
methods: 这个论文使用了高性能计算和成像技术来描绘整个人脑的细胞水平，并使用了映射和单元分割方法来实现快速和自动的图像分析。
results: 这个论文提出了一种基于扩散概率模型的填充模型，可以在可靠的方式下填充图像数据中的缺失信息，并生成高度真实的图像信息，包括细胞统计和цитоархитекtonic征特。

Abstract
Recent advances in imaging and high-performance computing have made it possible to image the entire human brain at the cellular level. This is the basis to study the multi-scale architecture of the brain regarding its subdivision into brain areas and nuclei, cortical layers, columns, and cell clusters down to single cell morphology Methods for brain mapping and cell segmentation exploit such images to enable rapid and automated analysis of cytoarchitecture and cell distribution in complete series of histological sections. However, the presence of inevitable processing artifacts in the image data caused by missing sections, tears in the tissue, or staining variations remains the primary reason for gaps in the resulting image data. To this end we aim to provide a model that can fill in missing information in a reliable way, following the true cell distribution at different scales. Inspired by the recent success in image generation, we propose a denoising diffusion probabilistic model (DDPM), trained on light-microscopic scans of cell-body stained sections. We extend this model with the RePaint method to impute missing or replace corrupted image data. We show that our trained DDPM is able to generate highly realistic image information for this purpose, generating plausible cell statistics and cytoarchitectonic patterns. We validate its outputs using two established downstream task models trained on the same data.

摘要
最近的进步在成像和高性能计算方面，使得整个人脑的细胞水平成像成为可能。这是研究脑部的多层架构，包括脑区和核lei、 cortical层、Column和细胞集群的基础。Methods for 脑地图和细胞分类利用这些图像来实现快速和自动的分析脑部的体积和细胞分布。然而，图像处理 artefacts 的存在是脑部图像数据中的主要原因，导致图像数据中的缺失。为此，我们目标是提供一个可靠地填充缺失信息的模型，跟脑部的多个尺度下的细胞分布跟随。受到最近的成像成功的灵感，我们提议一个散射概率模型（DDPM），在细胞体染色 slice 上进行训练。我们将这个模型扩展为RePaint方法，以填充或更正图像数据中的缺失或损坏的图像。我们显示了我们训练的 DDPM 能够生成高度真实的图像信息，实现可靠地填充缺失信息，并生成可靠的细胞统计和Cytoarchitectonic 模式。我们验证其输出使用两个已知的下游任务模型，这些模型都是在同样的数据上训练。

DI-Net : Decomposed Implicit Garment Transfer Network for Digital Clothed 3D Human

paper_url: http://arxiv.org/abs/2311.16818
repo_url: None
paper_authors: Xiaojing Zhong, Yukun Su, Zhonghua Wu, Guosheng Lin, Qingyao Wu
for: 3D虚拟试穿的应用广泛，但是该任务仍然是一个困难的问题。现有的2D虚拟试穿方法无法直接扩展到3D，因为它们缺乏每个像素的深度感知能力。此外，3D虚拟试穿的方法多数基于固定的 topological structure，并且需要大量计算。为解决这些问题，我们提出了Decomposed Implicit garment transfer network（DI-Net），它可以轻松地重建一个3D人体 mesh，并保留来自任意视角的纹理。
methods: DI-Net包括两个模块：1）一个兼容扩展模块，通过 dense correspondence learning和 sparse flow learning将参照图与源图匹配到同 pose；2）一个 geometry-aware decomposed transfer模块，将衣服传输分解为图像布局基于的传输和纹理基于的传输，通过建立像素对应的隐函数来实现表面和纹理重建。
results: 实验结果表明，我们的方法在3D虚拟试穿任务中表现出了效果和优势，可以提供更高质量的结果，超过其他现有方法。

Abstract
3D virtual try-on enjoys many potential applications and hence has attracted wide attention. However, it remains a challenging task that has not been adequately solved. Existing 2D virtual try-on methods cannot be directly extended to 3D since they lack the ability to perceive the depth of each pixel. Besides, 3D virtual try-on approaches are mostly built on the fixed topological structure and with heavy computation. To deal with these problems, we propose a Decomposed Implicit garment transfer network (DI-Net), which can effortlessly reconstruct a 3D human mesh with the newly try-on result and preserve the texture from an arbitrary perspective. Specifically, DI-Net consists of two modules: 1) A complementary warping module that warps the reference image to have the same pose as the source image through dense correspondence learning and sparse flow learning; 2) A geometry-aware decomposed transfer module that decomposes the garment transfer into image layout based transfer and texture based transfer, achieving surface and texture reconstruction by constructing pixel-aligned implicit functions. Experimental results show the effectiveness and superiority of our method in the 3D virtual try-on task, which can yield more high-quality results over other existing methods.

摘要
三维虚拟试穿得到了广泛关注，但是它仍然是一项具有挑战性的任务，尚未得到了充分解决。现有的二维虚拟试穿方法无法直接扩展到三维，因为它们缺乏每个像素的深度感知。此外，三维虚拟试穿方法大多基于固定的 topological结构，并且需要大量计算。为了解决这些问题，我们提出了归一化 garment transfer 网络（DI-Net），它可以轻松地重建一个三维人体模型，并保留来自任意视角的纹理。具体来说，DI-Net 包括两个模块：1）一个 complementary warping 模块，通过 dense correspondence learning 和 sparse flow learning 将参照图像与源图像匹配到同 pose; 2）一个 geometry-aware decomposed transfer 模块，将衣物传输分解为图像布局基于传输和纹理基于传输，通过构建像素对应的偏函数来实现表面和纹理重建。我们的实验结果表明，我们的方法在三维虚拟试穿任务中效果非常好，可以提供更高质量的结果，胜过其他现有方法。

Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

paper_url: http://arxiv.org/abs/2311.16813
repo_url: None
paper_authors: Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang
for: 提高自动驾驶技术的训练数据质量
methods: 提出了一种新的方法Panacea，可以生成宽 angle和可控的驾驶场景视频，并且可以生成无数量的多样化标注样本
results: 对于nuScenes数据集进行了广泛的量化和质量评估，证明Panacea可以生成高质量多视图驾驶场景视频

Abstract
The field of autonomous driving increasingly demands high-quality annotated training data. In this paper, we propose Panacea, an innovative approach to generate panoramic and controllable videos in driving scenarios, capable of yielding an unlimited numbers of diverse, annotated samples pivotal for autonomous driving advancements. Panacea addresses two critical challenges: 'Consistency' and 'Controllability.' Consistency ensures temporal and cross-view coherence, while Controllability ensures the alignment of generated content with corresponding annotations. Our approach integrates a novel 4D attention and a two-stage generation pipeline to maintain coherence, supplemented by the ControlNet framework for meticulous control by the Bird's-Eye-View (BEV) layouts. Extensive qualitative and quantitative evaluations of Panacea on the nuScenes dataset prove its effectiveness in generating high-quality multi-view driving-scene videos. This work notably propels the field of autonomous driving by effectively augmenting the training dataset used for advanced BEV perception techniques.

摘要
随着自动驾驶技术的发展，需要高质量的注释训练数据 Field 的需求不断增长。在这篇论文中，我们提出Panacea，一种创新的方法，可以生成panoramic和可控的驾驶场景视频，可以生成无数量的多样化、注释的样本，对自动驾驶技术的发展起到重要作用。Panacea解决了两个关键挑战：“一致性”和“可控性”。一致性确保时间和视角之间的一致性，而可控性确保生成的内容与相应的注释相匹配。我们的方法 integrate了一种新的4D注意力和两个阶段生成管道，以保持一致性，并且由ControlNet框架进行精细控制，以便在Bird's-Eye-View（BEV）布局上进行精细控制。我们在nuScenes数据集进行了广泛的质量和量度评估，证明Panacea在多视图驾驶场景视频生成方面具有高质量。这项工作对自动驾驶技术的发展起到了有效的补充作用，可以有效地增加用于高级BEV感知技术的训练数据。

Large Model Based Referring Camouflaged Object Detection

paper_url: http://arxiv.org/abs/2311.17122
repo_url: None
paper_authors: Shupeng Cheng, Ge-Peng Ji, Pengda Qin, Deng-Ping Fan, Bowen Zhou, Peng Xu
for: 这个论文是为了解决camouflaged object detection（COD）问题，即根据文本或视觉参考，检测和分割掩盖物体。
methods: 该论文使用了大量的多modal Large Language Models（MLLMs）知识，将COD问题分解成两个主要视角：target和scene，并将这些视角分别guide通过多级知识描述来导引大视觉模型进行分割。
results: 该论文在Ref-COD benchmark上达到了 estado-of-the-art 水平，并且在uni-modal COD datasets上显示了零shot泛化能力。

Abstract
Referring camouflaged object detection (Ref-COD) is a recently-proposed problem aiming to segment out specified camouflaged objects matched with a textual or visual reference. This task involves two major challenges: the COD domain-specific perception and multimodal reference-image alignment. Our motivation is to make full use of the semantic intelligence and intrinsic knowledge of recent Multimodal Large Language Models (MLLMs) to decompose this complex task in a human-like way. As language is highly condensed and inductive, linguistic expression is the main media of human knowledge learning, and the transmission of knowledge information follows a multi-level progression from simplicity to complexity. In this paper, we propose a large-model-based Multi-Level Knowledge-Guided multimodal method for Ref-COD termed MLKG, where multi-level knowledge descriptions from MLLM are organized to guide the large vision model of segmentation to perceive the camouflage-targets and camouflage-scene progressively and meanwhile deeply align the textual references with camouflaged photos. To our knowledge, our contributions mainly include: (1) This is the first time that the MLLM knowledge is studied for Ref-COD and COD. (2) We, for the first time, propose decomposing Ref-COD into two main perspectives of perceiving the target and scene by integrating MLLM knowledge, and contribute a multi-level knowledge-guided method. (3) Our method achieves the state-of-the-art on the Ref-COD benchmark outperforming numerous strong competitors. Moreover, thanks to the injected rich knowledge, it demonstrates zero-shot generalization ability on uni-modal COD datasets. We will release our code soon.

摘要
提交的文本为：参考隐形物检测（Ref-COD）是一个近期提出的问题，旨在将特定的隐形物与文本或视觉参考相匹配。这个任务包括两个主要挑战：COD领域专门的感知和多模态参考图像对alignment。我们的动机是使用最新的多modal大语言模型（MLLM）的semantic intelligence和内在知识，以人类的方式来解compose这个复杂任务。由于语言具有高度的压缩和抽象性，语言表达是人类知识学习的主要媒体，知识信息的传输采用多层次的进步从简单到复杂。在这篇论文中，我们提出了一种基于大语言模型的多级知识引导的多模态方法，称为MLKG，其中多级知识描述从 MLLM 中组织以导引大视觉模型进行隐形目标和隐形场景的感知，同时进行文本参考的深度对齐。我们的贡献主要包括：1. 这是首次将 MLLM 知识应用于 Ref-COD 和 COD。2. 我们首次提出了将 Ref-COD decomposed into two main perspectives of perceiving the target and scene by integrating MLLM knowledge，并提交了一种多级知识引导方法。3. 我们的方法在 Ref-COD benchmark 上达到了状态之前的最高水平，超越了许多强大的竞争对手。此外，由于注入了丰富的知识，它也表现出了零 shot 泛化能力在单模 COD 数据集上。我们即将发布代码。

Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2311.17121
repo_url: None
paper_authors: Jacob Schnell, Jieke Wang, Lu Qi, Vincent Tao Hu, Meng Tang
For: 本研究探讨了基于scribble监督的semantic segmentation中的生成数据增强技术，以提高scribble监督下的semantic segmentation性能。* Methods: 我们提出了一种基于ControlNet扩散模型的生成数据增强方法，使用semantic scribble来控制生成的图像。我们还引入了分类器-free扩散指导和编码比例来保证数据的类别一致性和数据的现实性。* Results: 我们发现了多种增强方案，其中一些可以在低数据量情况下提高模型性能。我们的框架可以减少scribble监督下的semantic segmentation和完全监督下的segmentation之间的差距。此外，我们还发现了在小数据集上，我们的框架可以超越完全监督下的segmentation性能。

Abstract
Recent advances in generative models, such as diffusion models, have made generating high-quality synthetic images widely accessible. Prior works have shown that training on synthetic images improves many perception tasks, such as image classification, object detection, and semantic segmentation. We are the first to explore generative data augmentations for scribble-supervised semantic segmentation. We propose a generative data augmentation method that leverages a ControlNet diffusion model conditioned on semantic scribbles to produce high-quality training data. However, naive implementations of generative data augmentations may inadvertently harm the performance of the downstream segmentor rather than improve it. We leverage classifier-free diffusion guidance to enforce class consistency and introduce encode ratios to trade off data diversity for data realism. Using the guidance scale and encode ratio, we are able to generate a spectrum of high-quality training images. We propose multiple augmentation schemes and find that these schemes significantly impact model performance, especially in the low-data regime. Our framework further reduces the gap between the performance of scribble-supervised segmentation and that of fully-supervised segmentation. We also show that our framework significantly improves segmentation performance on small datasets, even surpassing fully-supervised segmentation.

摘要
最近的生成模型技术，如扩散模型，使得生成高质量的 sintetic 图像变得广泛可用。先前的研究表明，训练在 sintetic 图像上能提高许多感知任务，如图像分类、物体检测和semantic segmentation。我们是第一个探讨使用生成数据增强的scribble-supervised semantic segmentation。我们提出了一种基于ControlNet扩散模型并且 Conditioned on semantic scribbles的生成数据增强方法。然而，直接使用生成数据增强可能会不必要地下降下渠道 segmentor 的性能而不是提高它。我们利用类别感知导向来保证类别一致性，并引入编码比率来让拟合数据的多样性和真实性进行平衡。使用导航缩放和编码比率，我们能够生成高质量的training图像spectrum。我们提出了多种增强方案，并发现这些方案在低数据情况下具有显著的影响，特别是在低数据情况下。我们的框架可以进一步减少scribble-supervised segmentation和完全supervised segmentation之间的差距。此外，我们还证明了我们的框架可以在小数据情况下显著提高segmentation性能，甚至超过完全supervised segmentation。

paper_url: http://arxiv.org/abs/2311.16773
repo_url: https://github.com/dasec/multi-channel-cross-modal-detection-of-synthetic-face-images
paper_authors: M. Ibsen, C. Rathgeb, S. Marcel, C. Busch
for: 检测完全 synthetic 的 face image，提高数字内容的可信度。
methods: 提出了一种多通道架构，通过在频谱和可见spectra中分析信息，使用 Cross Modal Focal Loss 进行训练。
results: 与其他相关架构相比，提出的架构在 cross-model 实验中，通常 achieve 最佳性能。

Abstract
Synthetically generated face images have shown to be indistinguishable from real images by humans and as such can lead to a lack of trust in digital content as they can, for instance, be used to spread misinformation. Therefore, the need to develop algorithms for detecting entirely synthetic face images is apparent. Of interest are images generated by state-of-the-art deep learning-based models, as these exhibit a high level of visual realism. Recent works have demonstrated that detecting such synthetic face images under realistic circumstances remains difficult as new and improved generative models are proposed with rapid speed and arbitrary image post-processing can be applied. In this work, we propose a multi-channel architecture for detecting entirely synthetic face images which analyses information both in the frequency and visible spectra using Cross Modal Focal Loss. We compare the proposed architecture with several related architectures trained using Binary Cross Entropy and show in cross-model experiments that the proposed architecture supervised using Cross Modal Focal Loss, in general, achieves most competitive performance.

摘要
人工生成的面像已经被证明可以与真实图像无法分辨，这可能会导致对数字内容的不信任，因为它们可以用来散布谣言。因此，开发检测完全 sintetic 面像的算法是非常重要的。我们关注使用现代深度学习模型生成的图像，这些图像具有高度的视觉实际性。 latest works 表明，在真实的场景下检测这些 sintetic 面像是具有挑战性的，因为新的生成模型和优化的图像处理技术在抢夺速度上不断地更新。在这种情况下，我们提出了一种多通道架构，该架构在频率和可见光谱中分析信息，使用 Cross Modal Focal Loss 来检测 entirely sintetic 面像。我们与其他相关的架构进行比较，并在跨模型实验中显示，我们的架构在使用 Cross Modal Focal Loss 的情况下，在总体来说达到了最竞争性的性能。

Continuous Pose for Monocular Cameras in Neural Implicit Representation

paper_url: http://arxiv.org/abs/2311.17119
repo_url: https://github.com/qimaqi/continuous-pose-in-nerf
paper_authors: Qi Ma, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool
for: 这个论文目的是优化单目镜头pose为时间函数。
methods: 该方法使用含隐藏层 neural network 表示镜头pose，并将其用于下游任务中的 JOINT 镜头pose优化。在这个过程中，网络参数（即镜头pose）也是被优化的。
results: 在四个多样化的实验设置中（1）NeRF from noisy poses; (2) NeRF from asynchronous Events; (3) Visual Simultaneous Localization and Mapping (vSLAM);和 (4) vSLAM with IMUs），提出的方法与比较基线和状态艺术方法相比表现出色，并且在 vSLAM 设置中实现了惊人的镜头跟踪性能。此外，通过假设连续运动，镜头pose 的变化可能实际上生活在一个低于 6 度自由度（DOF）的拟合 manifold 中，我们称之为“内在运动”，并在 vSLAM 设置中使用这种方法，获得了惊人的Camera tracking性能。

Abstract
In this paper, we showcase the effectiveness of optimizing monocular camera poses as a continuous function of time. The camera poses are represented using an implicit neural function which maps the given time to the corresponding camera pose. The mapped camera poses are then used for the downstream tasks where joint camera pose optimization is also required. While doing so, the network parameters -- that implicitly represent camera poses -- are optimized. We exploit the proposed method in four diverse experimental settings, namely, (1) NeRF from noisy poses; (2) NeRF from asynchronous Events; (3) Visual Simultaneous Localization and Mapping (vSLAM); and (4) vSLAM with IMUs. In all four settings, the proposed method performs significantly better than the compared baselines and the state-of-the-art methods. Additionally, using the assumption of continuous motion, changes in pose may actually live in a manifold that has lower than 6 degrees of freedom (DOF) is also realized. We call this low DOF motion representation as the \emph{intrinsic motion} and use the approach in vSLAM settings, showing impressive camera tracking performance.

摘要
在这篇论文中，我们展示了时间作为连续函数优化单目摄像头姿态的效iveness。摄像头姿态通过一个含义 neural 函数来映射给定时间到对应的摄像头姿态。映射后的摄像头姿态然后用于下游任务中的 JOINT 摄像头姿态优化。在进行这些任务时，网络参数（表示摄像头姿态）被优化。我们在四个多样化的实验设置中利用该方法，即（1）从噪声姿态中生成 NeRF;（2）从异步事件中生成 NeRF;（3）视觉同时地地图和定位（vSLAM）;以及（4） vSLAM 与 IMU 的组合。在所有四个设置中，我们的方法与比较基eline和状态 arts 方法相比，表现出显著的改善。此外，通过假设连续运动，姿态变化可能实际生活在低于 6 度自由度（DOF）的拟合 manifold 中，我们称之为“内在运动”，并在 vSLAM 设置中使用这种方法，展现出了出色的摄像头跟踪性能。

Rescuing referral failures during automated diagnosis of domain-shifted medical images

paper_url: http://arxiv.org/abs/2311.16766
repo_url: None
paper_authors: Anuj Srivastava, Karm Patel, Pradeep Shenoy, Devarajan Sridharan
for: Addressing a fundamental challenge with selective classification during automated diagnosis with domain-shifted medical images.
methods: Examining two benchmark diagnostic medical imaging datasets exhibiting strong covariate shifts, and evaluating novel combinations of robust generalization and post hoc referral approaches.
results: Significant performance improvements, typically >10%, over baseline methods, and rescue of failures under covariate shifts leading to non-monotonic referral curves and severe drops in performance (up to 50%) at high referral rates (>70%).

Abstract
The success of deep learning models deployed in the real world depends critically on their ability to generalize well across diverse data domains. Here, we address a fundamental challenge with selective classification during automated diagnosis with domain-shifted medical images. In this scenario, models must learn to avoid making predictions when label confidence is low, especially when tested with samples far removed from the training set (covariate shift). Such uncertain cases are typically referred to the clinician for further analysis and evaluation. Yet, we show that even state-of-the-art domain generalization approaches fail severely during referral when tested on medical images acquired from a different demographic or using a different technology. We examine two benchmark diagnostic medical imaging datasets exhibiting strong covariate shifts: i) diabetic retinopathy prediction with retinal fundus images and ii) multilabel disease prediction with chest X-ray images. We show that predictive uncertainty estimates do not generalize well under covariate shifts leading to non-monotonic referral curves, and severe drops in performance (up to 50%) at high referral rates (>70%). We evaluate novel combinations of robust generalization and post hoc referral approaches, that rescue these failures and achieve significant performance improvements, typically >10%, over baseline methods. Our study identifies a critical challenge with referral in domain-shifted medical images and finds key applications in reliable, automated disease diagnosis.

摘要
成功的深度学习模型在实际应用中取决于它们在多样数据领域上的泛化能力。我们解决了域转换中自动诊断中的一个基本挑战：在选择性分类时，模型需要学习避免在标签信息不确定时进行预测，特别是在训练集与测试集之间存在差异（变量转换）。这些不确定的案例通常会被送往医生进行进一步分析和评估。然而，我们发现了even state-of-the-art域泛化方法在referral时会严重失败，特别是在医疗图像的不同人群或技术下进行训练。我们分析了两个常用的诊断医学成像数据集，它们都具有强烈的差异变量：i）肥胖糖尿病预测using retinal fundus图像和ii）多标签疾病预测using chest X-ray图像。我们发现了predictive uncertainty估计不会在差异变量下进行泛化，导致非MONOTONIC referral曲线和高referral率（>70%）下的性能下降（up to 50%）。我们评估了一些新的Robust generalization和post hoc referral方法，它们能够恢复这些失败和实现significant性能提升（typically >10%），比基eline方法更好。我们的研究发现了域转换中的自动诊断中的一个关键挑战，并发现了重要的应用在可靠、自动的疾病诊断中。

Gradient-based Local Next-best-view Planning for Improved Perception of Targeted Plant Nodes

paper_url: http://arxiv.org/abs/2311.16759
repo_url: None
paper_authors: Akshay K. Burusa, Eldert J. van Henten, Gert Kootstra
for: Tomatoes greenhouses automation, specifically selective harvesting and de-leafing tasks.
methods: Local next-best-view (NBV) planning using differential ray sampling to overcome occlusion and improve perception.
results: The proposed planner can handle occlusions and improve 3D reconstruction and position estimation of nodes, while taking less computation and generating more efficient trajectories compared to previous methods.

Abstract
Robots are increasingly used in tomato greenhouses to automate labour-intensive tasks such as selective harvesting and de-leafing. To perform these tasks, robots must be able to accurately and efficiently perceive the plant nodes that need to be cut, despite the high levels of occlusion from other plant parts. We formulate this problem as a local next-best-view (NBV) planning task where the robot has to plan an efficient set of camera viewpoints to overcome occlusion and improve the quality of perception. Our formulation focuses on quickly improving the perception accuracy of a single target node to maximise its chances of being cut. Previous methods of NBV planning mostly focused on global view planning and used random sampling of candidate viewpoints for exploration, which could suffer from high computational costs, ineffective view selection due to poor candidates, or non-smooth trajectories due to inefficient sampling. We propose a gradient-based NBV planner using differential ray sampling, which directly estimates the local gradient direction for viewpoint planning to overcome occlusion and improve perception. Through simulation experiments, we showed that our planner can handle occlusions and improve the 3D reconstruction and position estimation of nodes equally well as a sampling-based NBV planner, while taking ten times less computation and generating 28% more efficient trajectories.

摘要
роботы все более часто используются в теплицах по производству помидоров для автоматизации трудоемких задач, таких как выборка и удаление листьев. Чтобы выполнить эти задачи, роботы должны быть в состоянии точно и эффективно определять узлы растения, которые нужно вырезать, несмотря на высокий уровень затенения со стороны других частей растения. Мы формулируем эту проблему как задачу планирования следующего лучшего вида (NBV) на местном уровне, где робот должен планировать набор эффективных точек обзора, чтобы преодолеть затенение и улучшить качество восприятия. Наша формулировка сосредоточена на быстром улучшении точности восприятия单ющей цели узла для максимизации его шансов на вырезание. Предыдущие методы NBV-планирования в основном сосредоточены на глобальном планировании вида и использовали случайный выбор кандидатов для исследования, что могло привести к высоким расходам на вычисления, неэффективному выбору точек обзора из-за плохих кандидатов или несглаженым траекториям из-за неэффективного сэмплирования. Мы предлагаем gradient-based NBV-планировщик с помощью дифференциального сэмплирования лучей, который непосредственно оценивает местный градиент направления для планирования точек обзора, чтобы преодолеть затенение и улучшить восприятие. В экспериментах на симуляторе мы показали, что наш планировщик может справиться с затенением и улучшить трёхмерное восстановление и определение позиции узлов настолько же хорошо, как и sampling-based NBV-планировщик, примерно в десять раз меньше вычислительных затрат и с 28% более эффективными траекториями.

As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors

paper_url: http://arxiv.org/abs/2311.16739
repo_url: None
paper_authors: Seungwoo Yoo, Kunho Kim, Vladimir G. Kim, Minhyuk Sung
for: 这个论文是为了提出一种基于2D扩散先验的可能性最大化（APAP）网格弯曲技术，以保持用户控制的弯曲过程中网格的可能性。
methods: 该技术使用每个面Jacobian来表示网格弯曲，其中网格顶点坐标通过可导的波峰解决方法计算得到。弯曲后的网格将被渲染，并将生成的2D图像用于Score Distillation Sampling（SDS）过程，从而提取有用的可能性先验。
results: 该方法可以在2D和3D网格上提供质量提升，并且可以更好地保持编辑后网格的身份。

Abstract
We present As-Plausible-as-Possible (APAP) mesh deformation technique that leverages 2D diffusion priors to preserve the plausibility of a mesh under user-controlled deformation. Our framework uses per-face Jacobians to represent mesh deformations, where mesh vertex coordinates are computed via a differentiable Poisson Solve. The deformed mesh is rendered, and the resulting 2D image is used in the Score Distillation Sampling (SDS) process, which enables extracting meaningful plausibility priors from a pretrained 2D diffusion model. To better preserve the identity of the edited mesh, we fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a user-prescribed handle displacement are then backpropagated to the per-face Jacobians, and we use iterative gradient descent to compute the final deformation that balances between the user edit and the output plausibility. We evaluate our method with 2D and 3D meshes and demonstrate qualitative and quantitative improvements when using plausibility priors over geometry-preservation or distortion-minimization priors used by previous techniques.

摘要
我们提出了可能性最大化（APAP）网格扭曲技术，利用2D传播先验知识来维护网格在用户控制下的扭曲。我们的框架使用每个面Jacobian来表示网格扭曲，其中网格点坐标由可微分波尔兹方程解析得到。扭曲后的网格被显示出来，并使用Score Distillation Sampling（SDS）过程来提取有用的可能性先验知识。为了更好地保留编辑过的网格的身份，我们精炼了2D传播模型，并使用用户指定的抓取调整和SDS提取的梯度，透过迭代几何推导来计算最终扭曲，以寻求平衡用户编辑和输出可能性。我们将这些方法应用于2D和3D网格，并证明了与先前技术相比，在使用可能性先验知识时的Qualitative和量值改进。

Riemannian Self-Attention Mechanism for SPD Networks

paper_url: http://arxiv.org/abs/2311.16738
repo_url: None
paper_authors: Rui Wang, Xiao-Jun Wu, Hui Li, Josef Kittler
for:* 这篇论文旨在提出一种基于SPD矩阵自注意机制的地形学学习模块，以提高深度结构表示的准确率。methods:* 本文提出了一种基于Riemannian metric、Riemannianmean和Riemannian优化的SPD矩阵自注意机制（SMSA），并将其应用于一种基于SMSA的地形学学习模块（SMSA-GLM）中。results:* 对三个 benchmark 数据集进行了广泛的实验研究，并证明了我们的修改可以进一步减少信息损失问题，提高准确率。

Abstract
Symmetric positive definite (SPD) matrix has been demonstrated to be an effective feature descriptor in many scientific areas, as it can encode spatiotemporal statistics of the data adequately on a curved Riemannian manifold, i.e., SPD manifold. Although there are many different ways to design network architectures for SPD matrix nonlinear learning, very few solutions explicitly mine the geometrical dependencies of features at different layers. Motivated by the great success of self-attention mechanism in capturing long-range relationships, an SPD manifold self-attention mechanism (SMSA) is proposed in this paper using some manifold-valued geometric operations, mainly the Riemannian metric, Riemannian mean, and Riemannian optimization. Then, an SMSA-based geometric learning module (SMSA-GLM) is designed for the sake of improving the discrimination of the generated deep structured representations. Extensive experimental results achieved on three benchmarking datasets show that our modification against the baseline network further alleviates the information degradation problem and leads to improved accuracy.

摘要
“对称正定 positively definite（SPD）矩阵已经在许多科学领域中被证明为有效的特征描述器，因为它可以在弹性的欧几何 Riemannian 构造上充分传递体积时间的数据统计信息，即 SPD 构造。although there are many different ways to design network architectures for SPD matrix nonlinear learning, very few solutions explicitly mine the geometrical dependencies of features at different layers. Motivated by the great success of self-attention mechanism in capturing long-range relationships, an SPD manifold self-attention mechanism (SMSA) is proposed in this paper using some manifold-valued geometric operations, mainly the Riemannian metric, Riemannian mean, and Riemannian optimization. Then, an SMSA-based geometric learning module (SMSA-GLM) is designed for the sake of improving the discrimination of the generated deep structured representations. Extensive experimental results achieved on three benchmarking datasets show that our modification against the baseline network further alleviates the information degradation problem and leads to improved accuracy.”Note: The translation is in Simplified Chinese, which is one of the two standardized Chinese writing systems. The other one is Traditional Chinese.

Point’n Move: Interactive Scene Object Manipulation on Gaussian Splatting Radiance Fields

paper_url: http://arxiv.org/abs/2311.16737
repo_url: None
paper_authors: Jiajun Huang, Hongchuan Yu
for: 本文旨在实现交互式场景对象修改，包括露出区域填充。
methods: 本文采用 Gaussian Splatting Radiance Field 场景表示，并完全利用其Explicit Nature和速度优势。
results: 本文提出了一种基于2D prompt points的3D mask双Stage自然提示分割算法，可以实现高质量和实时编辑。对于前向和360度场景进行编辑测试，并与现有场景对象移除方法进行比较，表现出优于现有方法。

Abstract
We propose Point'n Move, a method that achieves interactive scene object manipulation with exposed region inpainting. Interactivity here further comes from intuitive object selection and real-time editing. To achieve this, we adopt Gaussian Splatting Radiance Field as the scene representation and fully leverage its explicit nature and speed advantage. Its explicit representation formulation allows us to devise a 2D prompt points to 3D mask dual-stage self-prompting segmentation algorithm, perform mask refinement and merging, minimize change as well as provide good initialization for scene inpainting and perform editing in real-time without per-editing training, all leads to superior quality and performance. We test our method by performing editing on both forward-facing and 360 scenes. We also compare our method against existing scene object removal methods, showing superior quality despite being more capable and having a speed advantage.

摘要
我们提出了Point'n Move方法，它实现了互动场景对象修改，并且可以在显示区域填充中进行实时编辑。在这个方法中，我们采用了 Gaussian Splatting Radiance Field来表示场景，并充分利用其明确的形式和速度优势。这种明确的表示形式允许我们设计一种基于2D提示点的3D面积 dual-stage自我提示分割算法，进行面积修正和合并，最小化变化并提供好的初始化，进行场景填充和实时编辑，而不需要每次编辑培训。我们对我们的方法进行了测试，并在前向和360度场景中进行了编辑。我们还对现有场景对象除除方法进行了比较，并证明了我们的方法具有较高的质量和速度优势。

AdaFocus: Towards End-to-end Weakly Supervised Learning for Long-Video Action Understanding

paper_url: http://arxiv.org/abs/2311.17118
repo_url: None
paper_authors: Jiaming Zhou, Hanjun Li, Kun-Yu Lin, Junwei Liang
for: 本文提出了一种用于长视频动作理解任务的终端模型开发方法，以解决长视频特点的计算和内存挑战。
methods: 本文使用了一种叫做AdaFocus的框架，可以适应ively focus on动作clip，使得无需精确的动作开始和结束时间标注可以进行更好的训练。
results: experiments表明，使用AdaFocus框架可以在三个长视频 dataset上实现更好的性能，甚至在两个dataset上，模型经过AdaFocus训练后在无精确标注情况下表现比以前的全监督训练更好。此外，本文还构建了一个弱监督特征提取管道，可以在长视频动作理解任务上实现显著改进。

Abstract
Developing end-to-end models for long-video action understanding tasks presents significant computational and memory challenges. Existing works generally build models on long-video features extracted by off-the-shelf action recognition models, which are trained on short-video datasets in different domains, making the extracted features suffer domain discrepancy. To avoid this, action recognition models can be end-to-end trained on clips, which are trimmed from long videos and labeled using action interval annotations. Such fully supervised annotations are expensive to collect. Thus, a weakly supervised method is needed for long-video action understanding at scale. Under the weak supervision setting, action labels are provided for the whole video without precise start and end times of the action clip. To this end, we propose an AdaFocus framework. AdaFocus estimates the spike-actionness and temporal positions of actions, enabling it to adaptively focus on action clips that facilitate better training without the need for precise annotations. Experiments on three long-video datasets show its effectiveness. Remarkably, on two of datasets, models trained with AdaFocus under weak supervision outperform those trained under full supervision. Furthermore, we form a weakly supervised feature extraction pipeline with our AdaFocus, which enables significant improvements on three long-video action understanding tasks.

摘要
开发长视频动作理解任务的终端模型存在计算和内存挑战。现有的工作通常是基于商业化动作识别模型提取的长视频特征，这些模型在不同领域的短视频 dataset 上进行了训练，导致提取的特征受到领域差异。为了避免这种情况，动作识别模型可以通过clips进行终端训练，clips是从长视频中截取的，并使用动作间隔标注。然而，这种完全监督的标注是costly的。因此，我们需要一种弱监督的方法来实现长视频动作理解的批量训练。在弱监督设定下，我们提出了AdaFocus框架。AdaFocus可以估算动作clip中的峰值动作性和时间位置，从而使其能够自适应地关注动作clip，以便更好地训练无需精确标注。我们在三个长视频dataset上进行了实验，结果表明AdaFocus在两个dataset上比以前的模型更高效，并且在三个任务上建立了弱监督特征提取管道，从而实现了显著的提高。

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

paper_url: http://arxiv.org/abs/2311.17117
repo_url: https://github.com/HumanAIGC/AnimateAnyone
paper_authors: Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo
for: 这个论文是为了生成从静止图像中的人物视频的目的。
methods: 这个论文使用了扩散模型，并提出了一种特有的框架，以保持人物的细节特征的一致性。它还引入了一种有效的姿势引导器和一种有效的时间模型来确保人物的动作是有控制的和平滑的。
results: 这个论文的方法可以在 benchmark 上对人物动作进行Synthesis，并实现了与其他图像到视频方法相比的更高的效果。

Abstract
Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

摘要
Character Animation 目标是从静止图像中生成动画视频，通过驱动信号。目前，diffusion模型在视觉生成研究中成为主流，因为它们具有强大的生成能力。然而，图像到视频的转化仍然存在困难，特别是人物动画，因为需要在 reference 图像中保持细节信息的一致性。在这篇论文中，我们利用 diffusion 模型的能力，并提出了一个专门 для人物动画的框架。为了保持 reference 图像中的细节特征的一致性，我们设计了 ReferenceNet，并通过空间注意力来融合细节特征。为确保控制性和连续性，我们引入了高效的 pose 引导器，并使用有效的时间模型来保证视频帧之间的平滑过渡。通过扩展训练数据，我们的方法可以动化任何人物，并在其他图像到视频方法中达到更高的结果。此外，我们对标准测试集进行评估，在时尚视频和人类舞蹈生成方面达到了状态的最佳结果。

Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras

paper_url: http://arxiv.org/abs/2311.16728
repo_url: None
paper_authors: Huajian Huang, Longwei Li, Hui Cheng, Sai-Kit Yeung
for: 本研究旨在提出一种基于干扰 primitives 地图的 SLAM 框架，以实现 JOINT 地理定位和高品质视觉重建。
methods: 我们同时利用显式 геометрических特征进行地理定位，并通过学习隐式光学特征来表示观察环境的Texture信息。我们还提出了基于 Gaussian Pyramid 的训练方法，以逐级学习多层特征，提高高品质地图表现。
results: 我们在 monocular、stereo 和 RGB-D 数据集上进行了广泛的实验，证明我们提出的 Photo-SLAM 系统在 ONLINE 高品质地图建模方面，比如 PSNR 高度提高30%，并且在 Jetson AGX Orin 嵌入式平台上实现实时速度，表明其可以应用于机器人应用。

Abstract
The integration of neural rendering and the SLAM system recently showed promising results in joint localization and photorealistic view reconstruction. However, existing methods, fully relying on implicit representations, are so resource-hungry that they cannot run on portable devices, which deviates from the original intention of SLAM. In this paper, we present Photo-SLAM, a novel SLAM framework with a hyper primitives map. Specifically, we simultaneously exploit explicit geometric features for localization and learn implicit photometric features to represent the texture information of the observed environment. In addition to actively densifying hyper primitives based on geometric features, we further introduce a Gaussian-Pyramid-based training method to progressively learn multi-level features, enhancing photorealistic mapping performance. The extensive experiments with monocular, stereo, and RGB-D datasets prove that our proposed system Photo-SLAM significantly outperforms current state-of-the-art SLAM systems for online photorealistic mapping, e.g., PSNR is 30% higher and rendering speed is hundreds of times faster in the Replica dataset. Moreover, the Photo-SLAM can run at real-time speed using an embedded platform such as Jetson AGX Orin, showing the potential of robotics applications.

摘要
“对于对照照相机和SLAM系统的统合，最近的结果表明了联合本地化和实时测量的可能性。然而，现有的方法，完全依赖于隐式表示，使得它们无法在可携设备上运行，这与SLAM的原始目的不符。在这篇文章中，我们提出了Photo-SLAM，一个新的SLAM框架，具有几何基本图的对应。具体来说，我们同时利用明确的几何特征来进行定位，并将隐式的光度特征用于表征观察环境的纹理信息。此外，我们还引入了基于高斯散点的训练方法，逐级学习多个特征，进一步提高实时测量的表现。实验结果显示，与对照照相机、 стерео和RGB-D数据集进行比较，我们的提案的Photo-SLAM系统对于在线实时测量的实现，例如PSNR的提高为30%，并且渲染速度比现有的SLAM系统快上百倍。此外，Photo-SLAM可以在真实时运行，使用嵌入式平台如Jetson AGX Orin，显示出机器人应用的潜力。”

REF$^2$-NeRF: Reflection and Refraction aware Neural Radiance Field

paper_url: http://arxiv.org/abs/2311.17116
repo_url: None
paper_authors: Wooseok Kim, Taiki Fukiage, Takeshi Oishi
for: 该文章旨在提出一种基于NeRF的多视图3D重建方法，用于处理包含玻璃显示橱柜的场景。
methods: 该方法基于Volume Rendering，并使用依赖和无关视点的元素来模拟偏振和反射效果。
results: 比对于现有方法，该方法能够更准确地模拟玻璃偏振和整个场景。

Abstract
Recently, significant progress has been made in the study of methods for 3D reconstruction from multiple images using implicit neural representations, exemplified by the neural radiance field (NeRF) method. Such methods, which are based on volume rendering, can model various light phenomena, and various extended methods have been proposed to accommodate different scenes and situations. However, when handling scenes with multiple glass objects, e.g., objects in a glass showcase, modeling the target scene accurately has been challenging due to the presence of multiple reflection and refraction effects. Thus, this paper proposes a NeRF-based modeling method for scenes containing a glass case. In the proposed method, refraction and reflection are modeled using elements that are dependent and independent of the viewer's perspective. This approach allows us to estimate the surfaces where refraction occurs, i.e., glass surfaces, and enables the separation and modeling of both direct and reflected light components. Compared to existing methods, the proposed method enables more accurate modeling of both glass refraction and the overall scene.

摘要

Human Gaussian Splatting: Real-time Rendering of Animatable Avatars

paper_url: http://arxiv.org/abs/2311.17113
repo_url: None
paper_authors: Arthur Moreau, Jifei Song, Helisa Dhamo, Richard Shaw, Yiren Zhou, Eduardo Pérez-Pellitero
for: 本研究实现了实时渲染真实人体模拟，从多视角影片中学习的人体模型。
methods: 我们使用3D Gaussian Splatting来表示人体，这是一种非常高效的替代方法。我们使用一个组合了前向皮肤和本地非静态修正的方法，将人体表示为一组几何形状。
results: 我们的方法可以实现PSNR 1.5dbB更高的输出结果，并且可以在20fps或更高的帧率下显示。

Abstract
This work addresses the problem of real-time rendering of photorealistic human body avatars learned from multi-view videos. While the classical approaches to model and render virtual humans generally use a textured mesh, recent research has developed neural body representations that achieve impressive visual quality. However, these models are difficult to render in real-time and their quality degrades when the character is animated with body poses different than the training observations. We propose the first animatable human model based on 3D Gaussian Splatting, that has recently emerged as a very efficient alternative to neural radiance fields. Our body is represented by a set of gaussian primitives in a canonical space which are deformed in a coarse to fine approach that combines forward skinning and local non-rigid refinement. We describe how to learn our Human Gaussian Splatting (\OURS) model in an end-to-end fashion from multi-view observations, and evaluate it against the state-of-the-art approaches for novel pose synthesis of clothed body. Our method presents a PSNR 1.5dbB better than the state-of-the-art on THuman4 dataset while being able to render at 20fps or more.

摘要

paper_url: http://arxiv.org/abs/2311.16714
repo_url: https://github.com/stevenyangyj/emma-alfworld
paper_authors: Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, Yuhui Shi
for: This paper aims to train a vision-language model (VLM) agent to adapt to a visual world by leveraging a large language model (LLM) agent’s reflection outcomes in a text world.
methods: The proposed method, called Embodied Multi-Modal Agent (EMMA), finetunes the VLM on the same tasks of the visual world using the LLM’s reflection outcomes in a text world.
results: EMMA achieves superior performance compared to state-of-the-art VLM-based agents on diverse tasks, with an improvement rate of 20%-70% in the success rate.

Abstract
While large language models (LLMs) excel in a simulated world of texts, they struggle to interact with the more realistic world without perceptions of other modalities such as visual or audio signals. Although vision-language models (VLMs) integrate LLM modules (1) aligned with static image features, and (2) may possess prior knowledge of world dynamics (as demonstrated in the text world), they have not been trained in an embodied visual world and thus cannot align with its dynamics. On the other hand, training an embodied agent in a noisy visual world without expert guidance is often challenging and inefficient. In this paper, we train a VLM agent living in a visual world using an LLM agent excelling in a parallel text world (but inapplicable to the visual world). Specifically, we distill LLM's reflection outcomes (improved actions by analyzing mistakes) in a text world's tasks to finetune the VLM on the same tasks of the visual world, resulting in an Embodied Multi-Modal Agent (EMMA) quickly adapting to the visual world dynamics. Such cross-modality imitation learning between the two parallel worlds enables EMMA to generalize to a broad scope of new tasks without any further guidance from the LLM expert. Extensive evaluations on the ALFWorld benchmark highlight EMMA's superior performance to SOTA VLM-based agents across diverse tasks, e.g., 20%-70% improvement in the success rate.

摘要
大型语言模型（LLM）在文本世界中具有优异表现，但在具有视觉或音频信号的更真实世界中表现不佳。视觉语言模型（VLM）将LLM模块与静止图像特征进行对应，并具有世界动力学的先前知识，但它们没有在视觉世界中接受训练，因此无法与其动力学相协调。相反，在噪音视觉世界中训练一个无导航的VLM Agent是复杂和不有效的。在这篇论文中，我们训练一个在视觉世界中生存的VLM Agent，使用在文本世界中出色的LLM Agent（尚未在视觉世界中适用）。具体来说，我们将LLM的反射结果（改进的动作）在文本世界的任务中练练VLM Agent，从而生成一个具有多模态功能的Embodied Multi-Modal Agent（EMMA），快速适应视觉世界的动力学。这种跨模态学习between两个平行世界使得EMMA可以通过简单的自适应来适应新任务，而无需LLM专家的进一步指导。我们在ALFWorld benchmark上进行了广泛的评估，并发现EMMA的性能在多种任务上比SOTA VLM-based Agent高出20%-70%。

Full-resolution MLPs Empower Medical Dense Prediction

paper_url: http://arxiv.org/abs/2311.16707
repo_url: https://github.com/mungomeng/densepred-fullmlp
paper_authors: Mingyuan Meng, Yuxin Xue, Dagan Feng, Lei Bi, Jinman Kim
for: 这篇论文主要针对医疗影像处理中的细密预测任务，例如医疗影像修复、注册和分类。
methods: 这篇论文使用的方法是多层感知机制（MLP），并在全像分辨率开始使用MLP。
results: 实验结果显示，使用MLP在全像分辨率可以超过CNN和对应器的性能，并在多种医疗细密预测任务上 achievement state-of-the-art的表现。

Abstract
Dense prediction is a fundamental requirement for many medical vision tasks such as medical image restoration, registration, and segmentation. The most popular vision model, Convolutional Neural Networks (CNNs), has reached bottlenecks due to the intrinsic locality of convolution operations. Recently, transformers have been widely adopted for dense prediction for their capability to capture long-range visual dependence. However, due to the high computational complexity and large memory consumption of self-attention operations, transformers are usually used at downsampled feature resolutions. Such usage cannot effectively leverage the tissue-level textural information available only at the full image resolution. This textural information is crucial for medical dense prediction as it can differentiate the subtle human anatomy in medical images. In this study, we hypothesize that Multi-layer Perceptrons (MLPs) are superior alternatives to transformers in medical dense prediction where tissue-level details dominate the performance, as MLPs enable long-range dependence at the full image resolution. To validate our hypothesis, we develop a full-resolution hierarchical MLP framework that uses MLPs beginning from the full image resolution. We evaluate this framework with various MLP blocks on a wide range of medical dense prediction tasks including restoration, registration, and segmentation. Extensive experiments on six public well-benchmarked datasets show that, by simply using MLPs at full resolution, our framework outperforms its CNN and transformer counterparts and achieves state-of-the-art performance on various medical dense prediction tasks.

摘要
厚度预测是医疗图像任务中的基本需求，如医疗图像修复、注册和分割等。现今最受欢迎的视觉模型是卷积神经网络（CNN），但是由于卷积操作的本质性本地性，CNN在厚度预测方面已经遇到了瓶颈。而在过去几年，人们已经广泛采用了转换器来进行厚度预测，因为它们可以 Capture长距离视觉依赖关系。但是由于自我注意操作的计算复杂度和内存占用，通常在下采样后使用转换器。这种使用方式无法有效利用医疗图像的全像分辨率中的组织层级文本信息，这种信息对医疗厚度预测至关重要，因为它可以在医疗图像中分辨出细腻的人体解剖特征。在本研究中，我们提出了一种多层感知神经网络（MLP）在医疗厚度预测中的超越，MLP可以在全像分辨率上进行长距离依赖。为了验证我们的假设，我们开发了一种全像层次结构的MLP框架，该框架使用MLP从全像分辨率开始。我们在多种医疗厚度预测任务上进行了广泛的实验，包括修复、注册和分割等，并证明了我们的框架在不同的任务上都可以超越CNN和转换器，并达到医疗厚度预测中的状态艺术性表现。

CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs

paper_url: http://arxiv.org/abs/2311.16703
repo_url: None
paper_authors: Haocheng Yuan, Jing Xu, Hao Pan, Adrien Bousseau, Niloy Mitra, Changjian Li
for: 这个论文的目标是如何对CAD程序进行Semantic Commenting，即将CAD程序分解成 semantically meaningful shape parts，并为每个部分分配Semantic Label。
methods: 该论文使用了程序分析和视觉语义分析，利用最新的基础语言和视觉模型，将输入程序执行生成的形状用于生成可预测的图像，并使用这些图像进行semantic annotator。然后，将信息从图像中提取并返回到原始程序中进行Semantic Commenting。
results: 该论文在新的CADTalk数据集上进行了广泛的评估，与GPT基eline和开放集shape segmentation baseline进行比较，并Reported an 83.24% accuracy。

Abstract
CAD programs are a popular way to compactly encode shapes as a sequence of operations that are easy to parametrically modify. However, without sufficient semantic comments and structure, such programs can be challenging to understand, let alone modify. We introduce the problem of semantic commenting CAD programs, wherein the goal is to segment the input program into code blocks corresponding to semantically meaningful shape parts and assign a semantic label to each block. We solve the problem by combining program parsing with visual-semantic analysis afforded by recent advances in foundational language and vision models. Specifically, by executing the input programs, we create shapes, which we use to generate conditional photorealistic images to make use of semantic annotators for such images. We then distill the information across the images and link back to the original programs to semantically comment on them. Additionally, we collected and annotated a benchmark dataset, CADTalk, consisting of 5,280 machine-made programs and 45 human-made programs with ground truth semantic comments to foster future research. We extensively evaluated our approach, compared to a GPT-based baseline approach, and an open-set shape segmentation baseline, i.e., PartSLIP, and reported an 83.24% accuracy on the new CADTalk dataset. Project page: https://enigma-li.github.io/CADTalk/.

摘要

Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model

paper_url: http://arxiv.org/abs/2311.17112
repo_url: None
paper_authors: Zelin Peng, Zhengqin Xu, Zhilin Zeng, Lingxi Xie, Qi Tian, Wei Shen
for: 这个研究旨在提高 Parameter-efficient fine-tuning (PEFT) 方法在新的enario中的表现，并且实现对 Segment Anything Model (SAM) 的适应。
methods: 本研究使用 PEFT 方法，并将其扩展为具有交叉对应关系矩阵的整体适应机制，以便在不同的下游scenario中进行适应。此外，本研究还引入了一个 intra-block 增强模组，以提高对整个参数空间的适应。
results: 实验结果显示，我们的提案方法可以在仅有约1K额外参数的情况下，实现显著的类别分类表现提升，并且在多种 benchmark 上实现了类比性的表现。

Abstract
Parameter-efficient fine-tuning (PEFT) is an effective methodology to unleash the potential of large foundation models in novel scenarios with limited training data. In the computer vision community, PEFT has shown effectiveness in image classification, but little research has studied its ability for image segmentation. Fine-tuning segmentation models usually require a heavier adjustment of parameters to align the proper projection directions in the parameter space for new scenarios. This raises a challenge to existing PEFT algorithms, as they often inject a limited number of individual parameters into each block, which prevents substantial adjustment of the projection direction of the parameter space due to the limitation of Hidden Markov Chain along blocks. In this paper, we equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios. We introduce a novel inter-block communication module, which integrates a learnable relation matrix to facilitate communication among different coefficient sets of each PEFT block's parameter space. Moreover, we propose an intra-block enhancement module, which introduces a linear projection head whose weights are generated from a hyper-complex layer, further enhancing the impact of the adjustment of projection directions on the entire parameter space. Extensive experiments on diverse benchmarks demonstrate that our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters.

摘要
在这篇论文中，我们将 PEFT equip WITH 一种 cross-block orchestration 机制，以便在不同的下游 scenery 中适应 Segment Anything Model (SAM)。我们引入了一种新的 inter-block communication module，该模块通过学习关系矩阵来促进不同 coefficient set 的参数空间之间的交流。此外，我们还提出了一种 intra-block enhancement module，该模块引入了一个线性投影头，其权重由一个 hyper-complex layer 生成，从而进一步提高了参数空间中投影方向的影响。我们在多个 benchmark 上进行了广泛的实验，结果表明，我们的提议方法可以在 novel scenery 中提高 segmentation 性能，只需要约 1K 的额外参数。

ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention

paper_url: http://arxiv.org/abs/2311.16682
repo_url: None
paper_authors: Jiawei Wang, Changjian Li
for: 本研究旨在提出一种简单 yet highly effective的方法，用于快速和准确地进行手写文本 segmentation。
methods: 该方法包括两个阶段：第一阶段，使用 autoencoder 网络预测Extra dense distance field，以增强结构信息学习; 第二阶段，使用自动回归Transformer来标注整个roke为同一个semantic part，并在同一个part中标注剩下的stroke。
results: 该方法在两个表示性数据集上达到了最佳 segmentation 精度，并在许多实验中证明了其超越性能。此外，本研究还提供了解决part imbalance在训练数据中的方法和初步交叉类训练实验，这些研究可能会推动未来的研究在这个领域。

Abstract
Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of pre-defined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage, to better encode the shape and positional information of strokes, we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage, we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an auto-regressive Transformer with the default attention mechanism. By group-based labeling, our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally, we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training, which can inspire future research in this field.

摘要
笔划 semantic segmentation 是计算机视觉中已经广泛研究的一个重要问题，即将预定义的部分标签分配给个体笔划。这篇论文介绍了 ContextSeg，一种简单 yet 高效的方法，用于解决这个问题。我们在首先stage中预测了一个额外的稠密距离场，以增强笔划的结构信息学习。在第二个stage中，我们对整个笔划视为一个单一的实体，并使用自动回归的Transformer网络（default attention mechanism）将一组笔划分配到同一个semantic part中。通过群集标注，我们的方法可以完全利用Context信息进行决策。我们的方法在两个代表性数据集上实现了最佳的 segmentation 精度，并且在广泛的评估中表现出色。此外，我们还提供了解决部分不平衡问题的论述和初步的交叉类训练实验，这可能会激发未来的研究。

Neural Texture Puppeteer: A Framework for Neural Geometry and Texture Rendering of Articulated Shapes, Enabling Re-Identification at Interactive Speed

paper_url: http://arxiv.org/abs/2311.17109
repo_url: None
paper_authors: Urs Waldmann, Ole Johannsen, Bastian Goldluecke
for: 这 paper 的目的是提出一种基于神经网络的纹理渲染管道，用于识别材质化的人体或物体。
methods: 该方法分离了地形和纹理编码，使用地形数据来学习表面上的空间关系，并使用纹理自动编码器将纹理图像编码为全局纹理代码。
results: 该方法可以实现实时的纹理渲染和人体识别，并且可以应用于限制数据的实际场景。furthermore, the method can be applied to real-world data with a synthetic-to-real texture domain shift. The novel synthetic texture dataset NePuMoo is publicly available for further development.

Abstract
In this paper, we present a neural rendering pipeline for textured articulated shapes that we call Neural Texture Puppeteer. Our method separates geometry and texture encoding. The geometry pipeline learns to capture spatial relationships on the surface of the articulated shape from ground truth data that provides this geometric information. A texture auto-encoder makes use of this information to encode textured images into a global latent code. This global texture embedding can be efficiently trained separately from the geometry, and used in a downstream task to identify individuals. The neural texture rendering and the identification of individuals run at interactive speeds. To the best of our knowledge, we are the first to offer a promising alternative to CNN- or transformer-based approaches for re-identification of articulated individuals based on neural rendering. Realistic looking novel view and pose synthesis for different synthetic cow textures further demonstrate the quality of our method. Restricted by the availability of ground truth data for the articulated shape's geometry, the quality for real-world data synthesis is reduced. We further demonstrate the flexibility of our model for real-world data by applying a synthetic to real-world texture domain shift where we reconstruct the texture from a real-world 2D RGB image. Thus, our method can be applied to endangered species where data is limited. Our novel synthetic texture dataset NePuMoo is publicly available to inspire further development in the field of neural rendering-based re-identification.

摘要
在这篇论文中，我们介绍了一个基于神经网络的纹理渲染管线，称之为神经纹理控制器。我们的方法将几何和纹理编码分离开来。几何管线从真实数据中学习到表面附件形状的空间关系信息。纹理自动编码器利用这些信息对纹理图像编码成全局各向征码。这个全局纹理嵌入可以独立地受到训练，并在下游任务中用于标识个体。神经纹理渲染和标识个体都可以在交互速度下进行。根据我们所知，我们的方法是首先提供基于神经渲染的人体重要特征标识的可能的代替方案。我们的方法可以实现真实的新视角和姿态synthesis，并且可以应用于实际世界数据。由于数据的有限性，我们的方法在实际世界数据中的质量有所降低。我们还展示了我们的模型在实际世界数据上的灵活性，通过对真实2D RGB图像中的纹理进行synthetic to real-world texture domain shift来重construct纹理。因此，我们的方法可以应用于紧急状况下的物种。我们的新的 sintetic texture dataset NePuMoo 公开可用，以便激励更多关于神经渲染基于标识的研究。

LiveNVS: Neural View Synthesis on Live RGB-D Streams

paper_url: http://arxiv.org/abs/2311.16668
repo_url: None
paper_authors: Laura Fink, Darius Rückert, Linus Franke, Joachim Keinert, Marc Stamminger
for: 实时RGB-D重建方法，如Kinect Fusion，缺乏实时高品质图像化。这是因为depth map和摄像头位置的不准确会导致geometry和texture受到干扰， resulting in noisy, oversmoothed or incomplete geometry and blurry textures.
methods: LiveNVS使用了 neural novel view synthesis 技术，使得在实时RGB-D输入流中，可以实现very low latency和实时渲染。基于RGB-D输入流，LiveNVS会使用 densely fused depth map project neural features into target view，并将features在图像空间相加到target feature map中。然后，一个通用的神经网络将target feature map转换成高质量RGB图像。
results: LiveNVS可以在捕捉过程中实现状态机器人 Rendering 质量，让用户在实时中查看场景，并在实时中评估重建质量。

Abstract
Existing real-time RGB-D reconstruction approaches, like Kinect Fusion, lack real-time photo-realistic visualization. This is due to noisy, oversmoothed or incomplete geometry and blurry textures which are fused from imperfect depth maps and camera poses. Recent neural rendering methods can overcome many of such artifacts but are mostly optimized for offline usage, hindering the integration into a live reconstruction pipeline. In this paper, we present LiveNVS, a system that allows for neural novel view synthesis on a live RGB-D input stream with very low latency and real-time rendering. Based on the RGB-D input stream, novel views are rendered by projecting neural features into the target view via a densely fused depth map and aggregating the features in image-space to a target feature map. A generalizable neural network then translates the target feature map into a high-quality RGB image. LiveNVS achieves state-of-the-art neural rendering quality of unknown scenes during capturing, allowing users to virtually explore the scene and assess reconstruction quality in real-time.

摘要
现有的实时RGB-D重建方法，如Kinect Fusion，缺乏实时Photo-realistic视觉化。这是因为深度地图和摄像头位置的不准确性导致的噪声、过度平滑或部分地图的缺失，以及摄像头的模糊性。 latest neural rendering methods can overcome many of these artifacts, but are mostly optimized for offline usage, hindering their integration into a live reconstruction pipeline.在这篇论文中，我们介绍了LiveNVS系统，它可以在实时RGB-D输入流上进行神经新视 synthesis，并且具有非常低的延迟和实时渲染。通过将神经特征项投影到目标视图中via dense fusion depth map，并将特征项在图像空间聚合到目标特征图中，LiveNVS可以在捕捉过程中实时生成高质量RGB图像。LiveNVS实现了在捕捉过程中未知场景的神经渲染质量， allowing users to virtually explore the scene and assess reconstruction quality in real-time.

DGNR: Density-Guided Neural Point Rendering of Large Driving Scenes

paper_url: http://arxiv.org/abs/2311.16664
repo_url: None
paper_authors: Zhuopeng Li, Chenming Wu, Liangjun Zhang, Jianke Zhu
for: 这篇论文主要针对大规模驾驶场景的渲染问题，尤其是长轨迹场景，并提出了一种基于密度空间的渲染框架（DGNR）来解决这些问题。
methods: 这篇论文使用了一种基于神经网络的渲染框架，通过学习场景的密度空间来指导渲染。具体来说，这种框架使用了一种可导的渲染器来从神经密度特征中synthesize图像。此外，文章还提出了一种密度基于的融合模块和几何正则化来优化密度空间。
results: 根据在一个广泛使用的自动驾驶数据集上进行的实验，这种框架可以Synthesize出高品质的驾驶场景图像，并且可以实现实时可能的渲染。

Abstract
Despite the recent success of Neural Radiance Field (NeRF), it is still challenging to render large-scale driving scenes with long trajectories, particularly when the rendering quality and efficiency are in high demand. Existing methods for such scenes usually involve with spatial warping, geometric supervision from zero-shot normal or depth estimation, or scene division strategies, where the synthesized views are often blurry or fail to meet the requirement of efficient rendering. To address the above challenges, this paper presents a novel framework that learns a density space from the scenes to guide the construction of a point-based renderer, dubbed as DGNR (Density-Guided Neural Rendering). In DGNR, geometric priors are no longer needed, which can be intrinsically learned from the density space through volumetric rendering. Specifically, we make use of a differentiable renderer to synthesize images from the neural density features obtained from the learned density space. A density-based fusion module and geometric regularization are proposed to optimize the density space. By conducting experiments on a widely used autonomous driving dataset, we have validated the effectiveness of DGNR in synthesizing photorealistic driving scenes and achieving real-time capable rendering.

摘要
In DGNR, geometric priors are no longer needed, as the density space can be learned intrinsically through volumetric rendering. Specifically, we use a differentiable renderer to synthesize images from the neural density features obtained from the learned density space. A density-based fusion module and geometric regularization are proposed to optimize the density space.We validated the effectiveness of DGNR by conducting experiments on a widely used autonomous driving dataset. The results show that DGNR can synthesize photorealistic driving scenes and achieve real-time capable rendering.

SCALAR-NeRF: SCAlable LARge-scale Neural Radiance Fields for Scene Reconstruction

paper_url: http://arxiv.org/abs/2311.16657
repo_url: None
paper_authors: Yu Chen, Gim Hee Lee
for: 这个研究旨在提出一种可扩展的大规模神经场景重建方法。
methods: 该方法采用了编码器-解码器架构，其中编码器处理3D点坐标，生成编码特征，而解码器生成含有积度距离和颜色的几何值。这个方法首先训练一个粗略全局模型，然后将图像分割成小块，并使用KMeans将每个块分配给专门的本地模型。通过扩大每个本地模型的盒子大小，提高不同块之间的重叠区域。全局解码器被共享到不同块中，从而促进了特征空间的对齐。这种粗略到细化策略使得我们的方法超越了现状最佳的NeRF方法，并且可扩展到大规模场景重建。
results: 该方法在大规模场景重建方面实现了优秀的性能，超越了现状最佳的NeRF方法。

Abstract
In this work, we introduce SCALAR-NeRF, a novel framework tailored for scalable large-scale neural scene reconstruction. We structure the neural representation as an encoder-decoder architecture, where the encoder processes 3D point coordinates to produce encoded features, and the decoder generates geometric values that include volume densities of signed distances and colors. Our approach first trains a coarse global model on the entire image dataset. Subsequently, we partition the images into smaller blocks using KMeans with each block being modeled by a dedicated local model. We enhance the overlapping regions across different blocks by scaling up the bounding boxes of each local block. Notably, the decoder from the global model is shared across distinct blocks and therefore promoting alignment in the feature space of local encoders. We propose an effective and efficient methodology to fuse the outputs from these local models to attain the final reconstruction. Employing this refined coarse-to-fine strategy, our method outperforms state-of-the-art NeRF methods and demonstrates scalability for large-scale scene reconstruction. The code will be available on our project page at https://aibluefisher.github.io/SCALAR-NeRF/

摘要
在这个工作中，我们介绍了一种新的框架，即SCALAR-NeRF，这是一种适用于大规模神经场景重建的新方法。我们将神经表示结构设计为编码器-解码器架构，其中编码器处理3D点坐标，生成编码特征，而解码器生成包括volume密度和颜色的 геометрические值。我们的方法首先在整个图像集合上训练一个粗略的全球模型。然后，我们使用KMeans将图像分割成较小的块，每个块使用专门的本地模型进行模型化。我们在不同块之间的重叠区域进行增强，通过缩大每个本地块的 bounding box。值得注意的是，解码器从全球模型中被共享，因此在本地编码器空间中促进了对齐。我们提出了一种有效和高效的方法来融合这些本地模型的输出，以获得最终重建。采用这种粗略到细节的策略，我们的方法在NeRF方法中超越了状态的杰出表现，并证明了对大规模场景重建的可扩展性。代码将在我们项目页面上提供，请参考。

Augmenting x-ray single particle imaging reconstruction with self-supervised machine learning

paper_url: http://arxiv.org/abs/2311.16652
repo_url: None
paper_authors: Zhantao Chen, Cong Wang, Mingye Gao, Chun Hong Yoon, Jana B. Thayer, Joshua J. Turner
for: 这种研究旨在开拓XFELs的应用场景，具体来说是通过SPI技术 investigate 生物体物理状态下的分子结构和动力学性质，无需涉及普通晶体或低温Conditions。
methods: 该研究使用了自动化学习方法，具体来说是一种端到端、自我超visisted的机器学习方法，用于从 diffraction 图像中恢复分子orientation和reciprocal space 强度。
results: 该研究显示了该方法在实验条件下具有很强的Robustness和可靠性，并且在重构分子结构和动力学性质方面具有显著的提高。这种方法可能会在当前的XFELs中实施SPI中引入一种新的 Paradigma shift。

Abstract
The development of X-ray Free Electron Lasers (XFELs) has opened numerous opportunities to probe atomic structure and ultrafast dynamics of various materials. Single Particle Imaging (SPI) with XFELs enables the investigation of biological particles in their natural physiological states with unparalleled temporal resolution, while circumventing the need for cryogenic conditions or crystallization. However, reconstructing real-space structures from reciprocal-space x-ray diffraction data is highly challenging due to the absence of phase and orientation information, which is further complicated by weak scattering signals and considerable fluctuations in the number of photons per pulse. In this work, we present an end-to-end, self-supervised machine learning approach to recover particle orientations and estimate reciprocal space intensities from diffraction images only. Our method demonstrates great robustness under demanding experimental conditions with significantly enhanced reconstruction capabilities compared with conventional algorithms, and signifies a paradigm shift in SPI as currently practiced at XFELs.

摘要
开发X射光自由电子激光（XFEL）已经提供了许多探索原子结构和超速动态的机会，single particle imaging（SPI）技术可以在XFEL下对生物体内的粒子进行探索，并且可以避免使用液氮或晶体化Conditions，但是从反射空间X射谱数据中获取真空空间结构的重建是非常困难，因为缺少相位信息和方向信息，这被复杂的辐射信号和每个激光脉冲中很弱的辐射信号所困难。在这项工作中，我们提出了一种综合、自主学习的机器学习方法，可以从diffraction图像中直接获取粒子方向和反射空间强度。我们的方法在具有严格的实验条件下表现出了很好的稳定性和明显提高的重建能力，这标志着SPI领域中的一个 парадигShift。

Parallax-Tolerant Image Stitching with Epipolar Displacement Field

paper_url: http://arxiv.org/abs/2311.16637
repo_url: None
paper_authors: Jian Yu, Yi Yu, Feipeng Da
for: 大幅投影图像合成是一项具有挑战性的任务，现有方法通常难以同时维护图像的地方和全局结构，以及减少对适应错误和扭曲的影响。
methods: 该文提出了一种新的方法，利用视观几何来设定一种基于视观异位投影的折叠技术，首先通过无穷同步投影来确定折叠规则的每个像素点，然后利用厚板拟合来表示每个像素点在折叠线上的滑动距离。
results: 该方法能够高质量地减少对适应错误和扭曲的影响，同时维护投影的项目性。经过比较性和量化性的实验，该方法在大幅投影图像合成中表现竞争力强。

Abstract
Large parallax image stitching is a challenging task. Existing methods often struggle to maintain both the local and global structures of the image while reducing alignment artifacts and warping distortions. In this paper, we propose a novel approach that utilizes epipolar geometry to establish a warping technique based on the epipolar displacement field. Initially, the warping rule for pixels in the epipolar geometry is established through the infinite homography. Subsequently, Subsequently, the epipolar displacement field, which represents the sliding distance of the warped pixel along the epipolar line, is formulated by thin plate splines based on the principle of local elastic deformation. The stitching result can be generated by inversely warping the pixels according to the epipolar displacement field. This method incorporates the epipolar constraints in the warping rule, which ensures high-quality alignment and maintains the projectivity of the panorama. Qualitative and quantitative comparative experiments demonstrate the competitiveness of the proposed method in stitching images large parallax.

摘要
大型投影图合成是一项复杂的任务。现有方法经常难以保持图像的地方和全局结构，同时减少对焊接残留和扭曲的影响。在这篇论文中，我们提出了一种新的方法，利用视角几何来建立基于视角偏移场的折叠技术。首先，通过无穷多重投影，在视角几何中定义了折叠规则的每个像素。然后，基于本地弹性扭曲的原理，使用薄板拟合来表示每个像素在视角线上的滑动距离。最后，通过倒折叠像素来生成拼接结果。这种方法具有视角约束，保证了高质量的对齐和维护了投影图的 projektivity。经过对比性和量化比较，我们的方法在大型投影图拼接中表现竞争力强。

MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation

paper_url: http://arxiv.org/abs/2311.16635
repo_url: None
paper_authors: Sitong Su, Litao Guo, Lianli Gao, Hengtao Shen, Jingkuan Song
for: 文章旨在解决无法使用视频示例的文本到视频合成问题，通过提取提示中的动作偏好来控制不同 объек 的动作。
methods: 文章提出了一种基于大语言模型的动作控制策略，称为 MotionZero，它从提示中提取不同对象的动作偏好，并对不同对象的动作进行独立控制。此外，文章还提出了一种基于动作强度的注意机制，以适应视频中的动作强度变化。
results: 实验表明，MotionZero 可以正确地控制不同对象的动作，并且支持多种应用，如零例视频编辑。

Abstract
Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. Whereas the motion priors are not fully exploited in previous approaches, thus leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic for disregarding motion priors; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To tackle the two issues, we propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero, which derives motion priors from prompts of different objects by Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Furthermore, to facilitate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme which adjusts attention among frames by motion amplitude. Extensive experiments demonstrate that our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.

摘要
<>文本到视频合成中的零模型 Synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. However, previous approaches have not fully exploited these motion priors, leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To address these two issues, we propose a prompt-adaptive and disentangled motion control strategy called MotionZero, which derives motion priors from prompts of different objects using Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Moreover, to accommodate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme that adjusts attention among frames based on motion amplitude. Experimental results show that our strategy can accurately control the motion of different objects and support versatile applications, including zero-shot video editing.<>

On the Calibration of Human Pose Estimation

paper_url: http://arxiv.org/abs/2311.17105
repo_url: https://github.com/leob03/HRC_extrinsic_calib
paper_authors: Kerui Gu, Rongyu Chen, Angela Yao
for: 这篇论文主要针对2D人姿估计中的误差问题，具体来说是对 pose estimation 中的 keypoint confidence 进行调整和改进。
methods: 该论文使用了 teoretic analysis 和 practical experiments 来探讨 current pose estimation 方法中的误差问题，并提出了一种基于 confidence 和 pose accuracy 的 Calibrated ConfidenceNet (CCNet) 来解决这个问题。
results: 该论文的实验结果表明，CCNet 可以提高 pose estimation 的 accuracy，具体来说是可以提高 AP 的值by up to 1.4%，并且在 downstream task 中也可以降低 3D keypoint error by 1.0mm。

Abstract
Most 2D human pose estimation frameworks estimate keypoint confidence in an ad-hoc manner, using heuristics such as the maximum value of heatmaps. The confidence is part of the evaluation scheme, e.g., AP for the MSCOCO dataset, yet has been largely overlooked in the development of state-of-the-art methods. This paper takes the first steps in addressing miscalibration in pose estimation. From a calibration point of view, the confidence should be aligned with the pose accuracy. In practice, existing methods are poorly calibrated. We show, through theoretical analysis, why a miscalibration gap exists and how to narrow the gap. Simply predicting the instance size and adjusting the confidence function gives considerable AP improvements. Given the black-box nature of deep neural networks, however, it is not possible to fully close this gap with only closed-form adjustments. As such, we go one step further and learn network-specific adjustments by enforcing consistency between confidence and pose accuracy. Our proposed Calibrated ConfidenceNet (CCNet) is a light-weight post-hoc addition that improves AP by up to 1.4% on off-the-shelf pose estimation frameworks. Applied to the downstream task of mesh recovery, CCNet facilitates an additional 1.0mm decrease in 3D keypoint error.

摘要
Current 2D人体pose estimation frameworks usually estimate keypoint confidence in an ad-hoc way, using heuristics like the maximum value of heatmaps. Confidence is part of the evaluation scheme, such as AP for the MSCOCO dataset, but has been largely overlooked in the development of state-of-the-art methods. This paper takes the first steps in addressing miscalibration in pose estimation. From a calibration perspective, the confidence should be aligned with the pose accuracy. In practice, existing methods are poorly calibrated. We show, through theoretical analysis, why a miscalibration gap exists and how to narrow the gap. Simply predicting the instance size and adjusting the confidence function gives considerable AP improvements. However, due to the black-box nature of deep neural networks, it is not possible to fully close this gap with only closed-form adjustments. Therefore, we go one step further and learn network-specific adjustments by enforcing consistency between confidence and pose accuracy. Our proposed Calibrated ConfidenceNet (CCNet) is a light-weight post-hoc addition that improves AP by up to 1.4% on off-the-shelf pose estimation frameworks. Applied to the downstream task of mesh recovery, CCNet facilitates an additional 1.0mm decrease in 3D keypoint error.

paper_url: http://arxiv.org/abs/2311.16623
repo_url: https://github.com/gramuah/ros4vsn
paper_authors: Carlos Gutiérrez-Álvarez, Pablo Ríos-Navarro, Rafael Flor-Rodríguez, Francisco Javier Acevedo-Rodríguez, Roberto J. López-Sastre
for: 这个研究旨在将视觉Semantic Navigation（VSN）模型融入真实世界中的机器人中，以建立真实身体代理人。
methods: 我们提出了一个新的解决方案，通过将VSNAgent integrate到ROS-相容的机器人中，并发布了一个基于ROS的新框架——ROS4VSN，以便任何VSNAgent可以轻松地在任何ROS-相容的机器人上运行和测试。
results: 我们的实验显示，在真实世界和模拟环境中，两个不同的机器人上 embed two state-of-the-art VSNAgent，它们在真实世界中表现出明显的性能差异。

Abstract
Visual Semantic Navigation (VSN) is the ability of a robot to learn visual semantic information for navigating in unseen environments. These VSN models are typically tested in those virtual environments where they are trained, mainly using reinforcement learning based approaches. Therefore, we do not yet have an in-depth analysis of how these models would behave in the real world. In this work, we propose a new solution to integrate VSN models into real robots, so that we have true embodied agents. We also release a novel ROS-based framework for VSN, ROS4VSN, so that any VSN-model can be easily deployed in any ROS-compatible robot and tested in a real setting. Our experiments with two different robots, where we have embedded two state-of-the-art VSN agents, confirm that there is a noticeable performance difference of these VSN solutions when tested in real-world and simulation environments. We hope that this research will endeavor to provide a foundation for addressing this consequential issue, with the ultimate aim of advancing the performance and efficiency of embodied agents within authentic real-world scenarios. Code to reproduce all our experiments can be found at https://github.com/gramuah/ros4vsn.

摘要
Visual Semantic Navigation (VSN) 是一种机器人学习视觉Semantic信息以在未经看过的环境中导航。这些 VSN 模型通常在训练环境中进行测试，主要采用强化学习基本方法。因此，我们还没有深入分析这些模型在实际世界中的行为。在这项工作中，我们提出了一种新的解决方案，将 VSN 模型集成到真实的机器人中，以创建真正的肉体代理人。我们还开发了一个基于 ROS 的 VSN 框架，即 ROS4VSN，使得任何 VSN 模型都可以轻松地在任何 ROS 兼容的机器人上部署和测试。我们对两种不同的机器人进行了实验，并将两个当前顶尖 VSN 解决方案集成到了这两种机器人中。我们发现，在真实世界和模拟环境中测试 VSN 解决方案时，有显著的性能差异。我们希望通过这项研究，为实际世界中肉体代理人的性能和效率提供基础，以便在真实世界中进一步提高肉体代理人的表现。所有我们实验的代码可以在 GitHub 上找到，请参考 https://github.com/gramuah/ros4vsn。

Cross-level Attention with Overlapped Windows for Camouflaged Object Detection

paper_url: http://arxiv.org/abs/2311.16618
repo_url: None
paper_authors: Jiepan Li, Fangxiao Lu, Nan Xue, Zhuohong Li, Hongyan Zhang, Wei He
for: 本研究旨在提高掩饰物体检测精度。
methods: 该方法使用高级语义特征和细节特征的融合，并提出了一种叫做覆盖窗口十字关注（OWinCA）来强化低级特征。
results: 实验结果表明，提出的OWinCANet方法在三个大规模掩饰物体检测数据集上显著超越了当前状态的掩饰物体检测方法。

Abstract
Camouflaged objects adaptively fit their color and texture with the environment, which makes them indistinguishable from the surroundings. Current methods revealed that high-level semantic features can highlight the differences between camouflaged objects and the backgrounds. Consequently, they integrate high-level semantic features with low-level detailed features for accurate camouflaged object detection (COD). Unlike previous designs for multi-level feature fusion, we state that enhancing low-level features is more impending for COD. In this paper, we propose an overlapped window cross-level attention (OWinCA) to achieve the low-level feature enhancement guided by the highest-level features. By sliding an aligned window pair on both the highest- and low-level feature maps, the high-level semantics are explicitly integrated into the low-level details via cross-level attention. Additionally, it employs an overlapped window partition strategy to alleviate the incoherence among windows, which prevents the loss of global information. These adoptions enable the proposed OWinCA to enhance low-level features by promoting the separability of camouflaged objects. The associated proposed OWinCANet fuses these enhanced multi-level features by simple convolution operation to achieve the final COD. Experiments conducted on three large-scale COD datasets demonstrate that our OWinCANet significantly surpasses the current state-of-the-art COD methods.

摘要
伪装物体可以适应环境的颜色和文化，使其与背景完全一致。现有方法表明，高水平semantic特征可以强调掩饰物体和背景之间的差异。因此，它们将高水平semantic特征与低水平细节特征相结合以实现准确的掩饰物体检测（COD）。不同于之前的多级特征融合设计，我们认为提高低水平特征更加重要 для COD。在这篇论文中，我们提出了覆盖窗口交叉水平注意力（OWinCA）来实现低水平特征的提升，这些特征被最高水平semantic特征引导。通过将均匀窗口对在最高水平和低水平特征图中进行对齐，高水平semantic特征与低水平细节特征进行交叉注意力的同时，进一步提高了低水平特征的分割性。此外，我们采用了覆盖窗口分区策略，以避免窗口之间的不一致，从而保持全局信息的完整性。这些采用使得我们提出的 OWinCA 能够提高低水平特征，从而提高掩饰物体的分割性。与此同时，我们还提出了 OWinCANet，它将这些提高后的多级特征进行简单的卷积操作，以实现最终的 COD。实验表明，我们的 OWinCANet 在三个大规模 COD 数据集上显著超越当前状态的最佳 COD 方法。

Filter-Pruning of Lightweight Face Detectors Using a Geometric Median Criterion

paper_url: http://arxiv.org/abs/2311.16613
repo_url: https://github.com/idt-iti/lightweight-face-detector-pruning
paper_authors: Konstantinos Gkrispanis, Nikolaos Gkalelis, Vasileios Mezaris
for: 这个论文旨在提出一种基于范点损失推敲的简洁脸部检测模型，以适应具有限制的处理能力和内存的边缘设备。
methods: 本论文使用了范点损失推敲法（Filter Pruning via Geometric Median，FPGM）和软范点推敲法（Soft Filter Pruning，SFP）来实现范点损失推敲，并与L1 Norm排序比较。
results: 实验结果显示，提案的方法可以降低已经小巧的脸部检测模型的模型大小，仅带来轻微的精度损失或者甚至带来小幅的精度增加，尤其是在低排除率下。

Abstract
Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there's a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from researchers. These methods haven't been well examined in the context of face detectors, despite their expanding popularity. In this paper, we implement filter pruning on two already small and compact face detectors, named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare with the proposed approach. The experimental evaluation on the WIDER FACE dataset indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, or even with small accuracy gain for low pruning rates.

摘要
face 检测器在许多应用程序中变得越来越重要，包括监控，这些应用程序经常需要在边缘设备上运行，这些设备通常具有有限的处理能力和内存。因此，有一个急需更加压缩的面部检测模型，以便在有限的资源下运行。在过去几年中，网络剪辑技术引起了研究人员的广泛关注。尽管这些技术在面部检测器中没有得到广泛的检查，但它们在扩展的应用场景中表现出了潜在的优势。在这篇论文中，我们对两个已经非常小型和紧凑的面部检测器，即EXTD（极其简单的面部检测器）和EResFD（高效的ResNet面部检测器）进行了筛选器剪辑。我们使用的主要筛选器剪辑算法是 Filter Pruning via Geometric Median（FPGM），并与Soft Filter Pruning（SFP）迭代过程相结合。此外，我们还应用了L1 Norm剪辑，以作为基准对比的方法。实验评估于WIDER FACE数据集表明，我们的方法有可能进一步减少已经轻量级的面部检测器的模型大小，即使剪辑率较高，也只带来有限的减少精度损失，或者甚至带来小范围内的精度提升。

Empowering COVID-19 Detection: Optimizing Performance Through Fine-Tuned EfficientNet Deep Learning Architecture

paper_url: http://arxiv.org/abs/2311.16593
repo_url: None
paper_authors: Md. Alamin Talukder, Md. Abu Layek, Mohsin Kazi, Md Ashraf Uddin, Sunil Aryal
for: 这个研究的目的是发展一个基于骨质图像处理的COVID-19患者检测方法，以帮助医生快速和精度地诊断COVID-19。
methods: 这个研究使用了深度学习算法来处理骨质图像，并将这些算法与对应的转移学习模型进行了精确地 fine-tuning。
results: 实验结果显示，使用这种方法可以实现100%的准确率，并且在肺病图像集合中取得了99.17%的准确率、99.13%的精度和99.16%的回传率，这表明这种方法具有较高的准确性和效率。

Abstract
The worldwide COVID-19 pandemic has profoundly influenced the health and everyday experiences of individuals across the planet. It is a highly contagious respiratory disease requiring early and accurate detection to curb its rapid transmission. Initial testing methods primarily revolved around identifying the genetic composition of the coronavirus, exhibiting a relatively low detection rate and requiring a time-intensive procedure. To address this challenge, experts have suggested using radiological imagery, particularly chest X-rays, as a valuable approach within the diagnostic protocol. This study investigates the potential of leveraging radiographic imaging (X-rays) with deep learning algorithms to swiftly and precisely identify COVID-19 patients. The proposed approach elevates the detection accuracy by fine-tuning with appropriate layers on various established transfer learning models. The experimentation was conducted on a COVID-19 X-ray dataset containing 2000 images. The accuracy rates achieved were impressive of 100% for EfficientNetB4 model. The fine-tuned EfficientNetB4 achieved an excellent accuracy score, showcasing its potential as a robust COVID-19 detection model. Furthermore, EfficientNetB4 excelled in identifying Lung disease using Chest X-ray dataset containing 4,350 Images, achieving remarkable performance with an accuracy of 99.17%, precision of 99.13%, recall of 99.16%, and f1-score of 99.14%. These results highlight the promise of fine-tuned transfer learning for efficient lung detection through medical imaging, especially with X-ray images. This research offers radiologists an effective means of aiding rapid and precise COVID-19 diagnosis and contributes valuable assistance for healthcare professionals in accurately identifying affected patients.

摘要
全球COVID-19大流行对人类健康和日常生活产生了深远的影响。这是一种高度传染性的呼吸道疾病，早期检测是阻断其迅速传播的关键。初期检测方法主要是通过识别 koronavirus 的遗传组成来进行，但这种方法的检测率较低，需要时间consuming 的过程。为了解决这个挑战，专家建议使用 radiological imaging（X射线图像）作为诊断协议的一部分。本研究探讨了利用 radiographic imaging（X射线图像）和深度学习算法来快速和准确地诊断COVID-19患者。我们在 COVID-19 X射线图像集中进行了2000张图像的实验，实现了100%的准确率。我们使用 EfficientNetB4 模型进行了精细调整，并 achieved 惊人的准确率（100%）。此外，我们发现 EfficientNetB4 模型在4350张 X射线图像中的肺病检测中表现出色，具有99.17%的准确率、99.13%的精度、99.16%的回归率和99.14%的 F1 分数。这些结果表明 fine-tuned transfer learning 在医疗影像检测中具有优势，特别是在 X射线图像上。这项研究为医生提供了一种有效的帮助方式，以便快速和准确地诊断 COVID-19 病例，并为医疗专业人员提供了准确地诊断患者的有用工具。

Improving Lane Detection Generalization: A Novel Framework using HD Maps for Boosting Diversity

paper_url: http://arxiv.org/abs/2311.16589
repo_url: None
paper_authors: Daeun Lee, Minhyeok Heo, Jiwon Kim
for: 提高自适应车道检测算法的可靠性和灵活性。
methods: 使用高清地图和生成模型增强数据多样性，从核心数据选择最佳多样性和性能优化。
results: 实验表明，我们的框架可以提高车道检测算法的泛化性能，与领域适应化方法相当。

Abstract
Lane detection is a vital task for vehicles to navigate and localize their position on the road. To ensure reliable results, lane detection algorithms must have robust generalization performance in various road environments. However, despite the significant performance improvement of deep learning-based lane detection algorithms, their generalization performance in response to changes in road environments still falls short of expectations. In this paper, we present a novel framework for single-source domain generalization (SSDG) in lane detection. By decomposing data into lane structures and surroundings, we enhance diversity using High-Definition (HD) maps and generative models. Rather than expanding data volume, we strategically select a core subset of data, maximizing diversity and optimizing performance. Our extensive experiments demonstrate that our framework enhances the generalization performance of lane detection, comparable to the domain adaptation-based method.

摘要
Lane detection 是车辆导航和确定位置的关键任务。为确保可靠性，车道检测算法必须具有多样化环境的可靠性。虽然深度学习基于的车道检测算法表现出色，但它们在环境变化后的泛化性仍然不够。在这篇论文中，我们提出了一种单源领域泛化（SSDG）框架。我们将数据分解为车道结构和周围环境，并使用高清地图和生成模型增强多样性。而不是扩大数据量，我们策略性选择核心数据集，最大化多样性并优化性能。我们的广泛实验表明，我们的框架可以提高车道检测的泛化性，与领域适应基于方法相当。

Robust Diffusion GAN using Semi-Unbalanced Optimal Transport

paper_url: http://arxiv.org/abs/2311.17101
repo_url: None
paper_authors: Quan Dao, Binh Ta, Tung Pham, Anh Tran
for: 提高Diffusion模型的可靠性和性能，使其能够更好地应对各种实际应用场景。
methods: 基于半不均衡优先transport的Robust Training技术，以mitigate outlier samples的影响。
results: RDGAN比vanilla DDGAN在图像质量、分布覆盖率和推理速度等方面表现更好，并且在含有异常样本的数据集上展现出更高的可靠性。

Abstract
Diffusion models, a type of generative model, have demonstrated great potential for synthesizing highly detailed images. By integrating with GAN, advanced diffusion models like DDGAN \citep{xiao2022DDGAN} could approach real-time performance for expansive practical applications. While DDGAN has effectively addressed the challenges of generative modeling, namely producing high-quality samples, covering different data modes, and achieving faster sampling, it remains susceptible to performance drops caused by datasets that are corrupted with outlier samples. This work introduces a robust training technique based on semi-unbalanced optimal transport to mitigate the impact of outliers effectively. Through comprehensive evaluations, we demonstrate that our robust diffusion GAN (RDGAN) outperforms vanilla DDGAN in terms of the aforementioned generative modeling criteria, i.e., image quality, mode coverage of distribution, and inference speed, and exhibits improved robustness when dealing with both clean and corrupted datasets.

摘要
传播模型，一种生成模型，在生成高级精照图像方面表现出色。通过与GAN结合，进阶传播模型如DDGAN（《Xiao et al。(2022)》）可以实现实时性，实现广泛的实用应用。 although DDGAN已经很好地解决生成模型的挑战，包括生成高质量样本、覆盖不同数据模式以及更快的样本生成，但是它仍然受到受扰应用数据中的噪音样本所影响。这个研究提出了一种基于半不对称优先运输的强健训练技术，以对噪音样本进行有效防护。通过全面评估，我们展示了我们的强健扩散GAN（RDGAN）在生成模型的评估标准，例如图像质量、分布覆盖率和推断速度等方面，与普通的DDGAN有所不同，并且在清洁和受扰数据集中具有更好的韧性。

GeoScaler: Geometry and Rendering-Aware Downsampling of 3D Mesh Textures

paper_url: http://arxiv.org/abs/2311.16581
repo_url: None
paper_authors: Sai Karthikey Pentapati, Anshul Rai, Arkady Ten, Chaitanya Atluru, Alan Bovik
for: 提高3D场景的图像质量和细节级别
methods: 利用GeoScaler方法，基于3D模型的几何特征和投影 Parametrization，进行纹理截断，以提高图像质量和细节级别
results: 对比传统下采样方法，GeoScaler方法可以生成高质量的渲染图像，并且能够保持3D模型的细节级别和图像质量

Abstract
High-resolution texture maps are necessary for representing real-world objects accurately with 3D meshes. The large sizes of textures can bottleneck the real-time rendering of high-quality virtual 3D scenes on devices having low computational budgets and limited memory. Downsampling the texture maps directly addresses the issue, albeit at the cost of visual fidelity. Traditionally, downsampling of texture maps is performed using methods like bicubic interpolation and the Lanczos algorithm. These methods ignore the geometric layout of the mesh and its UV parametrization and also do not account for the rendering process used to obtain the final visualization that the users will experience. Towards filling these gaps, we introduce GeoScaler, which is a method of downsampling texture maps of 3D meshes while incorporating geometric cues, and by maximizing the visual fidelity of the rendered views of the textured meshes. We show that the textures generated by GeoScaler deliver significantly better quality rendered images compared to those generated by traditional downsampling methods

摘要
高分辨率文字地图是必需的 для准确地表示真实世界对象使用3D网格。大文件大小的文字地图可能会卡顿实时渲染高质量虚拟3D场景，特别是设备有限的计算预算和内存。直接下采样文字地图可以解决这个问题，但是会导致视觉精度下降。传统下采样方法包括二次插值和兰佐斯算法，这些方法忽略网格的几何布局和UV参数化，也不考虑渲染过程来获得最终用户可见的视觉效果。为了填补这些空白，我们介绍了GeoScaler，它是一种基于网格几何布局和渲染过程的文字地图下采样方法，并最大化渲染视图中文字地图的视觉精度。我们表明，由GeoScaler生成的文字地图可以提供较高的视觉质量渲染图像，比传统下采样方法更好。

Clean Label Disentangling for Medical Image Segmentation with Noisy Labels

paper_url: http://arxiv.org/abs/2311.16580
repo_url: https://github.com/xiaoyao3302/2bdenoise
paper_authors: Zicheng Wang, Zhen Zhao, Erjian Guo, Luping Zhou
for: 解决医学图像分割中的噪音标注问题，提高医学图像分割的精度和可靠性。
methods: 提出了一种简单 yet efficient的分类偏好采样策略，并将其扩展到一种新的噪音特征帮助分离清洁标签框架。
results: 经验验证了方法的效iveness，其方法在医学图像分割中实现了新的州对比性表现。代码可以在https://github.com/xiaoyao3302/2BDenoise上获取。

Abstract
Current methods focusing on medical image segmentation suffer from incorrect annotations, which is known as the noisy label issue. Most medical image segmentation with noisy labels methods utilize either noise transition matrix, noise-robust loss functions or pseudo-labeling methods, while none of the current research focuses on clean label disentanglement. We argue that the main reason is that the severe class-imbalanced issue will lead to the inaccuracy of the selected ``clean'' labels, thus influencing the robustness of the model against the noises. In this work, we come up with a simple but efficient class-balanced sampling strategy to tackle the class-imbalanced problem, which enables our newly proposed clean label disentangling framework to successfully select clean labels from the given label sets and encourages the model to learn from the correct annotations. However, such a method will filter out too many annotations which may also contain useful information. Therefore, we further extend our clean label disentangling framework to a new noisy feature-aided clean label disentangling framework, which takes the full annotations into utilization to learn more semantics. Extensive experiments have validated the effectiveness of our methods, where our methods achieve new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/2BDenoise.

摘要
Current methods for medical image segmentation are plagued by noisy labels, a problem known as the noisy label issue. Most existing methods use either noise transition matrices, noise-robust loss functions, or pseudo-labeling techniques, but none of them focus on clean label disentanglement. We believe the main reason is that the severe class-imbalance problem leads to inaccurate selection of "clean" labels, which affects the robustness of the model to noise.In this work, we propose a simple but effective class-balanced sampling strategy to address the class-imbalance problem. This enables our newly proposed clean label disentangling framework to select clean labels from the given label sets and encourages the model to learn from correct annotations. However, such a method may filter out too many annotations that contain useful information. Therefore, we further extend our clean label disentangling framework to a new noisy feature-aided clean label disentangling framework, which utilizes the full annotations to learn more semantics.Extensive experiments have validated the effectiveness of our methods, and our methods achieve new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/2BDenoise.Here's the Simplified Chinese translation of the text:当前的医学影像分割方法受到噪声标注问题的影响，大多数方法使用噪声过渡矩阵、噪声抗损失函数或 Pseudo-标签技术，但没有任何研究专注于干净标签分离。我们认为主要的原因是类别不均衡问题，会导致选择“干净”标签的不准确性，从而影响模型对噪声的Robustness。在这种情况下，我们提出了一种简单 yet efficient的类别均衡采样策略，以解决类别不均衡问题。这使得我们的新提出的干净标签分离框架能够从给定的标签集中选择干净标签，并促使模型学习正确的标签。然而，这种方法可能会过滤出太多的标签，其中可能包含有用信息。因此，我们进一步扩展了我们的干净标签分离框架为一个新的噪声特征帮助的干净标签分离框架，该框架利用全部标签来学习更多的 semantics。我们的实验证明了我们的方法的效果，我们的方法实现了新的州OF-the-art性能。我们的代码可以在https://github.com/xiaoyao3302/2BDenoise中获取。

Efficient Key-Based Adversarial Defense for ImageNet by Using Pre-trained Model

paper_url: http://arxiv.org/abs/2311.16577
repo_url: None
paper_authors: AprilPyone MaungMaung, Isao Echizen, Hitoshi Kiya
for: 这篇论文的目的是提出一个基于键的防御模型增殖模型，利用预训练的模型和最新的高效练习技术在 ImageNet-1k 分类中。
methods: 这篇论文使用了 Apple CoreML 等最新的模型部署技术，并利用了高效的练习技术来增殖基于键的模型，以提高防御性能。
results: 实验结果显示，提出的精练基于键的模型在 ImageNet-1k dataset 上显示出较高的分类精度（超过 10% 增加），比前一代基于键的模型更好。

Abstract
In this paper, we propose key-based defense model proliferation by leveraging pre-trained models and utilizing recent efficient fine-tuning techniques on ImageNet-1k classification. First, we stress that deploying key-based models on edge devices is feasible with the latest model deployment advancements, such as Apple CoreML, although the mainstream enterprise edge artificial intelligence (Edge AI) has been focused on the Cloud. Then, we point out that the previous key-based defense on on-device image classification is impractical for two reasons: (1) training many classifiers from scratch is not feasible, and (2) key-based defenses still need to be thoroughly tested on large datasets like ImageNet. To this end, we propose to leverage pre-trained models and utilize efficient fine-tuning techniques to proliferate key-based models even on limited computing resources. Experiments were carried out on the ImageNet-1k dataset using adaptive and non-adaptive attacks. The results show that our proposed fine-tuned key-based models achieve a superior classification accuracy (more than 10% increase) compared to the previous key-based models on classifying clean and adversarial examples.

摘要
在这篇论文中，我们提出了基于键的防御模型扩散方法，利用预训练模型和最新的有效精细调整技术在ImageNet-1k分类 задании。首先，我们强调在Edge设备上部署基于键的模型是可能的，即使主流企业端人工智能（Edge AI）在云端集中化。然后，我们指出了以前的键基于防御在设备上的图像分类是不现实的，因为训练多个分类器从零是不可能，而且键基于防御仍需要在大量数据集如ImageNet进行详细测试。为此，我们提议利用预训练模型和有效精细调整技术来扩散基于键的模型，即使在有限的计算资源下。实验在ImageNet-1k数据集上使用适应和非适应攻击。结果显示，我们提出的精细调整后的基于键模型在分类清洁和攻击样本上达到了超过10%的提升。

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

paper_url: http://arxiv.org/abs/2311.16567
repo_url: None
paper_authors: Yang Zhao, Yanwu Xu, Zhisheng Xiao, Tingbo Hou
for: 提高大规模文本到图像扩散模型在移动设备上的部署效率，减少模型大小和计算时间。
methods: 通过 architecture 优化和 sampling 技术来提高计算效率，并采用精神投射和扩散-GAN 微调技术来提高图像生成质量。
results: 实验结果表明，MobileDiffusion 可以在移动设备上生成 $512\times512$ 像素图像，并且只需要 \textbf{sub-second} 的计算时间，创造了新的最佳纪录。

Abstract
The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

摘要
deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.Here's the translation in Traditional Chinese:deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement

paper_url: http://arxiv.org/abs/2311.16495
repo_url: https://github.com/jianwang-mpi/egowholemocap
paper_authors: Jian Wang, Zhe Cao, Diogo Luvizon, Lingjie Liu, Kripasindhu Sarkar, Danhang Tang, Thabo Beeler, Christian Theobalt
for: 本研究探讨了基于单一 fisheye 摄像头的 egocentric 全身动作捕捉，同时估计人体体部和手动作。
methods: 我们提出了一种新的方法，利用 FisheyeViT 提取 fisheye 图像特征，并将其转换为像素对应的 3D 热图表示，以便预测人体体部姿势。对手追踪，我们添加了专门的手检测和手姿势预测网络，以回归 3D 手姿势。最后，我们开发了一种基于分布的整体动作先验模型，以修正估计的整体动作，同时考虑关节不确定性。
results: 我们通过收集大量的人工生成数据集 EgoWholeBody，包括 840,000 高质量 egocentric 图像，来训练这些网络。量化和质量评估表明，我们的方法可以很好地从单一 egocentric 摄像头中生成高质量的整体动作估计。

Abstract
In this work, we explore egocentric whole-body motion capture using a single fisheye camera, which simultaneously estimates human body and hand motion. This task presents significant challenges due to three factors: the lack of high-quality datasets, fisheye camera distortion, and human body self-occlusion. To address these challenges, we propose a novel approach that leverages FisheyeViT to extract fisheye image features, which are subsequently converted into pixel-aligned 3D heatmap representations for 3D human body pose prediction. For hand tracking, we incorporate dedicated hand detection and hand pose estimation networks for regressing 3D hand poses. Finally, we develop a diffusion-based whole-body motion prior model to refine the estimated whole-body motion while accounting for joint uncertainties. To train these networks, we collect a large synthetic dataset, EgoWholeBody, comprising 840,000 high-quality egocentric images captured across a diverse range of whole-body motion sequences. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in producing high-quality whole-body motion estimates from a single egocentric camera.

摘要
在这项工作中，我们探索了一种使用单个鱼眼镜头进行自我中心全身运动捕捉，这种方法同时估算人体Body和手部运动。这个任务存在三个因素的挑战：lack of high-quality datasets，鱼眼镜头扭曲和人体自 occlusion。为了解决这些挑战，我们提出了一种新的方法，利用FisheyeViT提取鱼眼图像特征，然后将其转换为像素对齐的3D热图表示，用于3D人体姿势预测。为了跟踪手部，我们添加了专门的手部检测和手部姿势预测网络，以回归3D手部姿势。最后，我们开发了一种基于扩散的整体运动先验模型，以修正估算的整体运动，同时考虑关节不确定性。为了训练这些网络，我们收集了一个大量的synthetic dataset，EgoWholeBody，包含840,000高质量的自我中心 egocentric 图像， captured across a diverse range of whole-body motion sequences。量化和质量评估表明，我们的方法可以高效地从单个 egocentric camera 中生成高质量的整体运动估计。

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

paper_url: http://arxiv.org/abs/2311.16565
repo_url: None
paper_authors: Peng Chen, Xiaobao Wei, Ming Lu, Yitong Zhu, Naiming Yao, Xingyu Xiao, Hui Chen
for: 这个研究旨在提出一个基于扩散模型的 speech-driven 3D 面部动画系统，并通过对比学习和知识传递来个化面部动画和加速化3D 动画生成。
methods: 这个方法使用了扩散模型，并通过对比学习增强了3D 面部动画的个化特征。在推断过程中，我们引入了一个可学习的对话Identify，以将音频序列中的知识聚合到3D 面部动画中。
results: 实验结果显示，我们的方法可以较前方法提高个化面部动画和加速化3D 动画生成的效果。code会被发布。

Abstract
Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.

摘要
<> traduction de texte en chinois simplifié<>三维人脸动画控制方法是学术和业界颇具吸引力的任务。传统方法主要集中在学习speech到动画的准确映射。现有approaches开始考虑speech驱动的非决定性特点，并使用扩散模型进行任务。然而，现有的扩散基本方法仍有两大限制：一是个性化动画生成，二是加速动画生成。为解决以上限制，我们提出了DiffusionTalker方法，它利用对比学习来个性化3D人脸动画，并通过知识储存来加速3D动画生成。具体来说，我们引入了一个可学习的说话identidad来聚合音频序列中的知识。我们的人脸嵌入EXTRACT Customized facial cues across different people in a contrastive learning manner。在推理过程中，用户可以根据输入音频获得个性化的 facial animation，具体的说话样式。我们使用了数百步训练的扩散模型，并将其压缩到8步的轻量级模型。我们的方法在广泛的实验中表现出色，超过了状态艺术方法。代码将被发布。

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

paper_url: http://arxiv.org/abs/2311.16555
repo_url: None
paper_authors: Ling Fu, Zijie Wu, Yingying Zhu, Yuliang Liu, Xiang Bai
for: 提高场景文本检测器的性能
methods: 使用扩散模型将前景文本与背景的自然特征融合，以生成更加现实的文本图像
results: 对于横、旋、拐、线 уров垂直文本检测，DiffText生成的文本图像表现出较高的效果

Abstract
Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the Diffusion Model based Text Generator (DiffText), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background's intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images.

摘要

Robust Transductive Few-shot Learning via Joint Message Passing and Prototype-based Soft-label Propagation

paper_url: http://arxiv.org/abs/2311.17096
repo_url: None
paper_authors: Jiahui Wang, Qin Xu, Bo Jiang, Bin Luo
for: 这个研究目的是发展一个具有数少支持样本的学习模型，能够对新的类别进行扩展。
methods: 这个方法结合了几种常见的方法，包括prototype learning和label propagation。具体来说，我们首先从支持集中学习出各个标本的代表，然后根据问题样本与标本之间的距离来决定问题的标签。
results: 我们的方法在四个受测数据集上取得了与现有方法相对的竞争性结果，包括平衡和不平衡的设定下。此外，我们还设计了一个新的联合讯息传递方案，可以同时学习标本的硬件表现和问题-支持关系图的构成。

Abstract
Few-shot learning (FSL) aims to develop a learning model with the ability to generalize to new classes using a few support samples. For transductive FSL tasks, prototype learning and label propagation methods are commonly employed. Prototype methods generally first learn the representative prototypes from the support set and then determine the labels of queries based on the metric between query samples and prototypes. Label propagation methods try to propagate the labels of support samples on the constructed graph encoding the relationships between both support and query samples. This paper aims to integrate these two principles together and develop an efficient and robust transductive FSL approach, termed Prototype-based Soft-label Propagation (PSLP). Specifically, we first estimate the soft-label presentation for each query sample by leveraging prototypes. Then, we conduct soft-label propagation on our learned query-support graph. Both steps are conducted progressively to boost their respective performance. Moreover, to learn effective prototypes for soft-label estimation as well as the desirable query-support graph for soft-label propagation, we design a new joint message passing scheme to learn sample presentation and relational graph jointly. Our PSLP method is parameter-free and can be implemented very efficiently. On four popular datasets, our method achieves competitive results on both balanced and imbalanced settings compared to the state-of-the-art methods. The code will be released upon acceptance.

摘要
预处理学习（FSL）的目标是开发一种能够通过几个示例学习新类的学习模型。在推uctive FSL任务中， prototype 学习和标签传播方法通常被使用。 prototype 方法通常先从支持集中学习表示性的原型，然后根据查询样本和原型之间的距离来确定查询样本的标签。标签传播方法尝试将支持样本上的标签通过建立查询样本和支持样本之间的图表示的关系进行传播。本文旨在将这两种原理结合在一起，并开发一种高效和可靠的推uctive FSL方法，即示例基于软标签传播（PSLP）。 Specifically，我们首先估算每个查询样本的软标签表示，通过使用原型。然后，我们在我们学习的查询样本和支持样本之间的图上进行软标签传播。这两个步骤都是在进行进程中进行，以提高它们的相应性能。此外，为了学习有效的原型以及欲望的查询样本和支持样本之间的关系，我们设计了一种新的共同消息传递方案，用于同时学习样本表示和关系图。我们的 PSLP 方法无需参数，可以非常高效地实现。在四个流行的数据集上，我们的方法在平衡和不平衡的设置下与当前的状态艺技相当。代码将在接受后释出。

HandyPriors: Physically Consistent Perception of Hand-Object Interactions with Differentiable Priors

paper_url: http://arxiv.org/abs/2311.16552
repo_url: None
paper_authors: Shutong Zhang, Yi-Ling Qiao, Guanglei Zhu, Eric Heiden, Dylan Turpin, Jingzhou Liu, Ming Lin, Miles Macklin, Animesh Garg
for: 这个论文的目的是提出一种综合和通用的推理管道，用于人机物交互场景中的姿态估算。
methods: 这个论文使用了最近的可导物理和渲染技术，并使用渲染优化器来对输入图像和分割mask进行Alignment，同时使用物理优化器来减少刺激和相对滑动。此外，论文还提出了两种手和物 pose估算方法，一种是优化基本的pose估算，另一种是使用可导物理模块作为动力和观测模型，以实现更快的跟踪。
results: 论文表明，HandyPriors可以在人机物交互场景中实现相对或更高的姿态估算精度，并且可以预测物体与人体之间的接触信息，以进一步改进姿态估算。此外，论文还证明了其方法的通用性，可以应用于机器人手部 manipulate和人类物体姿态估算等任务。

Abstract
Various heuristic objectives for modeling hand-object interaction have been proposed in past work. However, due to the lack of a cohesive framework, these objectives often possess a narrow scope of applicability and are limited by their efficiency or accuracy. In this paper, we propose HandyPriors, a unified and general pipeline for pose estimation in human-object interaction scenes by leveraging recent advances in differentiable physics and rendering. Our approach employs rendering priors to align with input images and segmentation masks along with physics priors to mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves higher accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics and observation models, executes faster. We demonstrate that HandyPriors attains comparable or superior results in the pose estimation task, and that the differentiable physics module can predict contact information for pose refinement. We also show that our approach generalizes to perception tasks, including robotic hand manipulation and human-object pose estimation in the wild.

摘要
历史研究中提出了多种各种目标函数用于模型人手交互。然而，由于缺乏一个紧密的框架，这些目标函数经常具有局部应用范围和精度和效率的限制。在这篇论文中，我们提出了HandyPriors，一个通用和总体的排序管道，通过使用最近的可微 физи学和渲染来进行pose estimation在人手交互场景中。我们的方法使用渲染假设来与输入图像和分割掩码进行对齐，同时使用物理假设来缓解射入和相对滑动问题。此外，我们提出了两种手和物体 pose estimation 的方法。一种是优化基于的pose estimation，它可以达到更高的准确率；另一种是使用可微假设作为动力学和观测模型的筛选基于的跟踪，它可以更快地执行。我们示出了HandyPriors在pose estimation任务中可以达到相同或更高的结果，并且可以预测物体的接触信息用于pose refinement。此外，我们还展示了我们的方法在机器人手 manipulate和人手pose estimation中的普适性。

Multi-Irreducible Spectral Synchronization for Robust Rotation Averaging

paper_url: http://arxiv.org/abs/2311.16544
repo_url: None
paper_authors: Owen Howell, Haoen Huang, David Rosen
For: 这个论文的目标是解决机器人和计算机视觉中的旋转平均问题（Rotation Averaging，RA），其中需要估算一组未知旋转矩阵 $R_{1}, …, R_{N} \in SO(3)$，基于各个对称对的噪声测量 $R_{ij} \sim R^{-1}{i} R{j}$。* Methods: 作者使用了幂分析在团体上的技术，以构建一个（凸）спектраль逼近问题，然后使用一些极值值来回归RA解决方案。* Results: 作者的方法具有许多优点，例如可以与任何平滑损失函数一起使用（包括但不限于Robust M-estimators），不需要任何初始化，并且使用简单（高度可扩展）的线性代数计算和并行优化来实现。此外，作者还提供了一些性能保证，其中包括在测量网络中添加某些满足特定条件的探针，以确保准确的估算。

Abstract
Rotation averaging (RA) is a fundamental problem in robotics and computer vision. In RA, the goal is to estimate a set of $N$ unknown orientations $R_{1}, ..., R_{N} \in SO(3)$, given noisy measurements $R_{ij} \sim R^{-1}_{i} R_{j}$ of a subset of their pairwise relative rotations. This problem is both nonconvex and NP-hard, and thus difficult to solve in the general case. We apply harmonic analysis on compact groups to derive a (convex) spectral relaxation constructed from truncated Fourier decompositions of the individual summands appearing in the RA objective; we then recover an estimate of the RA solution by computing a few extremal eigenpairs of this relaxation, and (approximately) solving a consensus problem. Our approach affords several notable advantages versus prior RA methods: it can be used in conjunction with \emph{any} smooth loss function (including, but not limited to, robust M-estimators), does not require any initialization, and is implemented using only simple (and highly scalable) linear-algebraic computations and parallelizable optimizations over band-limited functions of individual rotational states. Moreover, under the (physically well-motivated) assumption of multiplicative Langevin measurement noise, we derive explicit performance guarantees for our spectral estimator (in the form of probabilistic tail bounds on the estimation error) that are parameterized in terms of graph-theoretic quantities of the underlying measurement network. By concretely linking estimator performance with properties of the underlying measurement graph, our results also indicate how to devise measurement networks that are \emph{guaranteed} to achieve accurate estimation, enabling such downstream tasks as sensor placement, network compression, and active sensing.

摘要
rotate averaging (RA) 是 robotics 和 computer vision 中的基本问题。在 RA 中，目标是估计一组 $N$ 个未知旋转 $R_{1}, ..., R_{N} \in SO(3)$，给出一些噪声损失 $R_{ij} \sim R^{-1}_{i} R_{j}$ 的subset 的对称对旋转。这个问题是非凸和NP难，因此在一般情况下很难解决。我们通过幂分析在固定群上来 derive 一个（凸） spectral relaxation， constructed from truncated Fourier decompositions of the individual summands appearing in the RA objective; 然后我们可以通过计算一些极值 eigenpairs 来回收一个 RA 的解决方案，并（约）解决一个 consensus 问题。我们的方法有以下优点：可以与任何平滑损失函数（包括但不限于 robust M-estimators）结合使用，不需要任何初始化，并且通过简单（高度可扩展）的线性代数计算和并行优化来实现。此外，对于 multiplicative Langevin 测量噪声的（物理上有良好的）假设，我们得到了明确的性能保证（在形式上为 probabilistic tail bounds on the estimation error），这些保证与测量网络的特性相关。我们的结果还表明，可以通过设计测量网络来确保高精度估计，以便实现下游任务，如探测器布局、网络压缩和活动探测。

Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance

paper_url: http://arxiv.org/abs/2311.16507
repo_url: None
paper_authors: Siyu Xing, Jie Cao, Huaibo Huang, Xiao-Yu Zhang, Ran He
for: 提高流行匹配模型的生成质量和效率
methods: 提出一种新的拓展方法“直轨流行匹配”(StraightFM)，通过减少步骤数来提高生成质量和效率
results: StraightFM在几个方面达到了更高的质量和效率，包括在 pixel 空间中的 FID 和 latent 空间中的 KID 值在内，并且在一些样本中可以在 fewer than 10 步骤内达到更高的质量和效率。

Abstract
Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straight trajectories. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy guided by diffusion model from entire distribution level. First, we propose a coupling strategy to straighten trajectories, creating couplings between image and noise samples under diffusion model guidance. Second, StraightFM also integrates real data to enhance training, employing a neural network to parameterize another coupling process from images to noise samples. StraightFM is jointly optimized with couplings from above two mutually complementary directions, resulting in straighter trajectories and enabling both one-step and few-step generation. Extensive experiments demonstrate that StraightFM yields high quality samples with fewer step. StraightFM generates visually appealing images with a lower FID among diffusion and traditional flow matching methods within 5 sampling steps when trained on pixel space. In the latent space (i.e., Latent Diffusion), StraightFM achieves a lower KID value compared to existing methods on the CelebA-HQ 256 dataset in fewer than 10 sampling steps.

摘要
流行匹配作为生成模型的 paradigm 在多个领域取得了显著的成功。然而，现有的方法使用 either 多轮训练或者在 mini-batch 中的知识，从而增加了找到适合的 Coupling 策略的问题。为解决这个问题，我们提出了一种新的方法：流程匹配 straight trajectories（StraightFM）。它使用整个分布水平的扩散模型来引导 Coupling 策略，从而 straighten trajectories。首先，我们提出了一种 Coupling 策略，将图像和随机噪音之间创建 Couplings。其次，StraightFM 还 integrate 了实际数据来提高训练，通过一个神经网络来另外 parameterize 一种从图像到随机噪音的 Coupling 过程。StraightFM 被同时优化了上述两个相互补偿的方向，从而实现更直的 trajectories 和一步或几步生成。广泛的实验表明，StraightFM 可以生成高质量的样本，需要 fewer step。在 pixel space 中，StraightFM 在5个步骤内可以生成可见的图像，并且与传统的流行匹配方法相比，在 CelebA-HQ 256 数据集上的 KID 值较低。在 latent space 中，StraightFM 在 fewer than 10 步骤内可以 achiev 较低的 KID 值。

In Search of a Data Transformation That Accelerates Neural Field Training

paper_url: http://arxiv.org/abs/2311.17094
repo_url: None
paper_authors: Junwon Seo, Sangyoon Lee, Kwang In Kim, Jaeho Lee
for: 本文研究了带有数据变换的神经场训练速度的影响，特别是排序像素位置对SGD训练的影响。
methods: 本文使用了神经网络来近似给定的信号，并通过PSNR曲线、损失地形和错误模式来解释Random Permutation的影响。
results: 研究发现，随机排序像素位置可以大幅加速神经场训练，但是这种减少容易适应的模式会妨碍捕捉信号的细节。

Abstract
Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed-generating neural fields requires an overfitting of a neural network, which can take a significant number of SGD steps to reach the desired fidelity level. In this paper, we delve into the impacts of data transformations on the speed of neural field training, specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively, we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon, we examine the neural field training through the lens of PSNR curves, loss landscapes, and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns, which facilitate easy optimization in the early stage but hinder capturing fine details of the signal.

摘要
neural field 是一种emerging paradigm在数据表示方面，用神经网络来近似给定的信号。然而，一个关键的障碍是将神经网络训练到 Desired fidelity level 需要很多SGD步骤，这会增加训练时间。在这篇论文中，我们研究了数据变换对神经场训练速度的影响，具体来说是如何 randomly permute pixel locations 会加速训练。我们发现，Randomly permuting pixel locations 可以大幅提高训练速度。为了解释这种现象，我们通过PSNR曲线、损失 landscape 和错误模式来分析神经场训练。我们发现，随机排序像素位置可以消除容易适应的模式，这些模式可以在早期阶段帮助优化，但是它们会阻碍捕捉信号的细节。

Agents meet OKR: An Object and Key Results Driven Agent System with Hierarchical Self-Collaboration and Self-Evaluation

paper_url: http://arxiv.org/abs/2311.16542
repo_url: None
paper_authors: Yi Zheng, Chongyang Ma, Kanle Shi, Haibin Huang
for: 提高大语言模型（LLM）在任务解决方面的能力
methods: 使用自我合作和自我修正机制，通过层次代理人来解决任务的复杂性
results: 实验结果表明，OKR-Agent方法比前一些方法在一些任务上表现更高效

Abstract
In this study, we introduce the concept of OKR-Agent designed to enhance the capabilities of Large Language Models (LLMs) in task-solving. Our approach utilizes both self-collaboration and self-correction mechanism, facilitated by hierarchical agents, to address the inherent complexities in task-solving. Our key observations are two-fold: first, effective task-solving demands in-depth domain knowledge and intricate reasoning, for which deploying specialized agents for individual sub-tasks can markedly enhance LLM performance. Second, task-solving intrinsically adheres to a hierarchical execution structure, comprising both high-level strategic planning and detailed task execution. Towards this end, our OKR-Agent paradigm aligns closely with this hierarchical structure, promising enhanced efficacy and adaptability across a range of scenarios. Specifically, our framework includes two novel modules: hierarchical Objects and Key Results generation and multi-level evaluation, each contributing to more efficient and robust task-solving. In practical, hierarchical OKR generation decomposes Objects into multiple sub-Objects and assigns new agents based on key results and agent responsibilities. These agents subsequently elaborate on their designated tasks and may further decompose them as necessary. Such generation operates recursively and hierarchically, culminating in a comprehensive set of detailed solutions. The multi-level evaluation module of OKR-Agent refines solution by leveraging feedback from all associated agents, optimizing each step of the process. This ensures solution is accurate, practical, and effectively address intricate task requirements, enhancing the overall reliability and quality of the outcome. Experimental results also show our method outperforms the previous methods on several tasks. Code and demo are available at https://okr-agent.github.io/

摘要
在这项研究中，我们介绍了一种名为OKR-Agent的概念，用于提高大型自然语言模型（LLM）在任务解决方面的能力。我们的方法利用了自适应和自我修正机制，通过层次代理人来解决任务中的内在复杂性。我们的关键观察结果有两个方面：首先，有效地解决任务需要深入的领域知识和细腻的思维，在部署专门的子任务代理人后，LLM表现得更出色。第二，任务解决本身具有层次执行结构，包括高级策略规划和细节任务执行。为此，OKR-Agent模型与这种层次结构高度吻合，提供了更高效和可靠的多种场景下的解决方案。具体来说，OKR-Agent模型包括两个新模块：层次对象和关键结果生成，以及多级评估。层次对象生成模块将对象 decomposes into multiple sub-objects，并将新代理人分配给每个子对象基于关键结果和代理人责任。这些代理人随后在它们的指定任务上进行详细的描述和可能的 decomposing，这种生成操作采用了 recursive和层次的方式进行，从而生成了一个完整的解决方案。多级评估模块使用所有相关代理人的反馈来优化解决方案，确保解决方案是准确、实用和有效地解决任务要求，从而提高整体的可靠性和质量。我们的实验结果也表明，我们的方法在多个任务上比前一种方法表现更出色。OKR-Agent代码和demo可以在中找到。

Improved Prototypical Semi-Supervised Learning with Foundation Models: Prototype Selection, Parametric vMF-SNE Pretraining and Multi-view Pseudolabelling

paper_url: http://arxiv.org/abs/2311.17093
repo_url: None
paper_authors: Evelyn Mannix, Howard Bondell
for: 提高计算机视觉中的半监督学习性能，特别是使用冻结基础模型为网络后ION的准确性。
methods: 提出了一种基于 Parametric von-Mises Fisher Stochastic Neighbour Embedding (vMF-SNE) 的映射方法，以及一种多视图假标签的提出方法，以及一种简单的 $k$-means prototype selection 技术。
results: 在多个 benchmark 数据集上，与现有的 PAWS 和 RoPAWS 方法相比，提高了 +2.9% 和 +5.7% 的性能，并在 DeepWeeds 数据集上达到了新的领先性能记录。

Abstract
In this paper we present an improved approach to prototypical semi-supervised learning for computer vision, in the context of leveraging a frozen foundation model as the backbone of our neural network. As a general tool, we propose parametric von-Mises Fisher Stochastic Neighbour Embedding (vMF-SNE) to create mappings with neural networks between high-dimensional latent spaces that preserve local structure. This enables us to pretrain the projection head of our network using the high-quality embeddings of the foundation model with vMF-SNE. We also propose soft multi-view pseudolabels, where predictions across multiple views are combined to provide a more reliable supervision signal compared to a consistency or swapped assignment approach. We demonstrate that these ideas improve upon P}redicting View-Assignments with Support Samples (PAWS), a current state-of-the-art semi-supervised learning method, as well as Robust PAWS (RoPAWS), over a range of benchmarking datasets. We also introduce simple $k$-means prototype selection, a technique that provides superior performance to other unsupervised label selection approaches in this context. These changes improve upon PAWS by an average of +2.9% for CIFAR-10 and +5.7% for CIFAR-100 with four labels per class, and by +15.2% for DeepWeeds, a particularly challenging dataset for semi-supervised learning. We also achieve new state-of-the-art results in semi-supervised learning in this small label regime for CIFAR-10 - 95.8% (+0.7%) and CIFAR-100 - 76.6% (+12.0%).

摘要
在这篇论文中，我们提出了一种改进的半supervised learning方法，用于计算机视觉领域，基于冻结的基础模型。我们提议使用parametric von-Mises Fisher Stochastic Neighbor Embedding（vMF-SNE）来创建高维特征空间中的映射，这些映射保留了本地结构。我们可以使用基础模型的高质量嵌入，通过vMF-SNE来预训练投影头。我们还提议使用软分布式视图预测，将多个视图的预测结果组合起来，以提供更可靠的超级视图指标。我们示出，这些想法可以超过PAWS方法（Predicting View-Assignments with Support Samples）和RoPAWS（Robust PAWS），在多个 benchmark 数据集上进行改进。我们还介绍了简单的 $k$-means prototype选择技术，这种技术可以在这种情况下提供superior的性能。这些变化可以在 CIFAR-10 和 CIFAR-100 上提高 +2.9% 和 +5.7% 的四个标签，以及在 DeepWeeds 数据集上提高 +15.2%。我们还实现了小标签 regime 中的新state-of-the-art 结果，CIFAR-10 的结果为 95.8% (+0.7%)，CIFAR-100 的结果为 76.6% (+12.0%)。

SEED-Bench-2: Benchmarking Multimodal Large Language Models

paper_url: http://arxiv.org/abs/2311.17092
repo_url: https://github.com/ailab-cvc/seed-bench
paper_authors: Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan
for: 这个论文旨在提供一个全面的多模态大语言模型（MLLM）测试 benchmark，以评估当前 MLLM 的进步和局限性。
methods: 该论文提出了一个基于 LLM 的多模态大语言模型（MLLM）分类法，并提出了 SEED-Bench-2 测试 benchmark，该 benchmark 包括 24K 多选题目，覆盖 27 个维度，包括文本和图像生成能力的评估。
results: 论文通过测试 23 种公开源 MLLM 模型，发现了这些模型的局限性，并提供了价值的观察。该论文的目标是通过 SEED-Bench-2 测试 benchmark，驱动未来的研究，推动到通用人工智能的目标。

Abstract
Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However, existing MLLM benchmarks remain limited to assessing only models' comprehension ability of single image-text inputs, failing to keep up with the strides made in MLLMs. A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work, we categorize the capabilities of MLLMs into hierarchical levels from $L_0$ to $L_4$ based on the modalities they can accept and generate, and propose SEED-Bench-2, a comprehensive benchmark that evaluates the \textbf{hierarchical} capabilities of MLLMs. Specifically, SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations. By revealing the limitations of existing MLLMs through extensive evaluations, we aim for SEED-Bench-2 to provide insights that will motivate future research towards the goal of General Artificial Intelligence. Dataset and evaluation code are available at \href{https://github.com/AILab-CVC/SEED-Bench}

摘要
多Modal大语言模型（MLLM），基于强大的大语言模型（LLM），最近已经展示了对于生成文本和图像的exceptional能力，只要提供混合多Modal输入（类似于GPT-4V和DALL-E 3）。然而，现有的 MLLM 评价标准仍然只能评估模型对单个图像文本输入的理解能力，而不能随着 MLLM 的进步而更新。一个完整的评价标准是必要的，以调查当前 MLLM 的进步和探索其局限性。在这项工作中，我们将 MLLM 的能力分为 $L_0$ 到 $L_4$ 等级，根据它们可以接受和生成的modalities，并提出了 SEED-Bench-2，一个完整的评价标准。SEED-Bench-2 包含 24K 多选题，具有人工注释的准确性，涵盖 27 个维度，包括文本和图像生成的评估。多选题的真实选项来自人工注释，可以快速和有效地评估模型性能，不需要人类或 GPT 的干预。我们还对 23 个开源 MLLM 进行了评估，并提出了有价值的观察。通过对现有 MLLM 的广泛评估，我们希望 SEED-Bench-2 能够提供有价值的信息，以激励未来的人工智能研究。数据集和评估代码可以在 GitHub 上获取：

Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models

paper_url: http://arxiv.org/abs/2311.17091
repo_url: https://github.com/zhihelu/ensemble_vlm
paper_authors: Zhihe Lu, Jiawang Bai, Xin Li, Zeyu Xiao, Xinchao Wang
for: 这个论文旨在探讨如何通过将弱化的视觉语言模型（VLM） ensemble 提高open-world泛化性能。
methods: 本论文提出了三种自定义的ensemble策略，每种适用于不同的场景：零shot ensemble、培训和调整 ensemble、和跨数据集ensemble。
results: 研究人员通过对不同模型的logits进行自动调整，实现了在零shot、基础到新和跨数据集泛化性能上的新州Of-the-art表现。

Abstract
Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the open-world generalization has gained increasing popularity due to its practical value. However, performance advancements are limited when relying solely on intricate algorithmic designs for a single model, even one exhibiting strong performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. The affirmative findings motivate us to address the generalization problem from a novel perspective, i.e., ensemble of pre-trained VLMs. We introduce three customized ensemble strategies, each tailored to one specific scenario. Firstly, we introduce the zero-shot ensemble, automatically adjusting the logits of different models based on their confidence when only pre-trained VLMs are available. Furthermore, for scenarios with extra few-shot samples, we propose the training-free and tuning ensemble, offering flexibility based on the availability of computing resources. The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance. Notably, this work represents an initial stride toward enhancing the generalization performance of VLMs via ensemble. The code is available at https://github.com/zhiheLu/Ensemble_VLM.git.

摘要
优化预训练视语模型（VLM），例如CLIP，以实现开放世界泛化的实用价值已经在增加。然而，凭借精妙的算法设计 alone 的单个模型性能有限。这篇论文是第一次探索强度较弱的 VLM ensemble 的可能性，以提高单个模型的泛化性能。我们引入三种自定义ensemble策略，每种适用于不同的场景。首先，我们引入零shot ensemble，自动调整不同模型的 logits 根据其自信度，只有预训练 VLM available。其次，在具有些个shot样本的情况下，我们提出了免训练和调整 ensemble，可以根据计算资源的可用性进行灵活调整。我们对这些ensemble策略进行了评估，在零shot、基础到新和跨数据集泛化中实现了新的州OF-THE-ART性能。值得注意的是，这项工作表示了对 VLM 泛化性能的提升的初步尝试，未来可能会有更多的进展。代码可以在https://github.com/zhiheLu/Ensemble_VLM.git中找到。

3D Teeth Reconstruction from Panoramic Radiographs using Neural Implicit Functions

paper_url: http://arxiv.org/abs/2311.16524
repo_url: None
paper_authors: Sihwa Park, Seongjun Kim, In-Seok Song, Seung Jun Baek
for: 本研究提出了一种基于神经函数的3D牙齿重建方法，以解决普遍存在的2D牙齿影像 limitation。
methods: 方法包括多标签分割、牙齿形状嵌入和牙齿类别嵌入，然后将这些输出作为神经函数的输入。一个新的模块叫做 Conditional eXcitation (CX) 用于有效地将共同形状和类别嵌入包含在神经函数中。
results: 对比其他方法，Occudent 显示出了更高的精度和可靠性。

Abstract
Panoramic radiography is a widely used imaging modality in dental practice and research. However, it only provides flattened 2D images, which limits the detailed assessment of dental structures. In this paper, we propose Occudent, a framework for 3D teeth reconstruction from panoramic radiographs using neural implicit functions, which, to the best of our knowledge, is the first work to do so. For a given point in 3D space, the implicit function estimates whether the point is occupied by a tooth, and thus implicitly determines the boundaries of 3D tooth shapes. Firstly, Occudent applies multi-label segmentation to the input panoramic radiograph. Next, tooth shape embeddings as well as tooth class embeddings are generated from the segmentation outputs, which are fed to the reconstruction network. A novel module called Conditional eXcitation (CX) is proposed in order to effectively incorporate the combined shape and class embeddings into the implicit function. The performance of Occudent is evaluated using both quantitative and qualitative measures. Importantly, Occudent is trained and validated with actual panoramic radiographs as input, distinct from recent works which used synthesized images. Experiments demonstrate the superiority of Occudent over state-of-the-art methods.

摘要
扫描方式是 dental 实践和研究中广泛使用的影像模式，但它只提供了平铺的2D图像，这限制了牙科结构的详细评估。在这篇论文中，我们提出了 Occudent，一个基于神经隐函数的牙齿三维重建框架。根据我们所知，这是首次在牙齿重建领域使用神经隐函数进行3D牙齿重建。给定3D空间中的点，神经隐函数会判断该点是否被牙齿包含，从而隐式地确定牙齿的3D形态boundaries。首先，Occudent使用多标签分割来处理输入的扫描图像。接着，由分割输出生成的牙齿形态嵌入和牙齿类嵌入被传递到重建网络。为了有效地将合并的形态和类嵌入integrated into the implicit function，我们提出了一个新的模块 called Conditional eXcitation (CX)。Occudent的性能被评估了使用量化和质量度量标准。重要的是，Occudent被训练和验证使用实际的扫描图像作为输入，与之前的works不同，使用合成图像。实验结果表明Occudent在比较方法上显著超越了state-of-the-art方法。

A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis

paper_url: http://arxiv.org/abs/2311.16471
repo_url: None
paper_authors: Zixiang Zhou, Yu Wan, Baoyuan Wang
for: 本研究旨在实现多modal和多部分人体动作生成，以满足实际场景中的各种需求。
methods: 我们提出了一种可扩展的方法，该方法包括以下几个步骤：首先，对不同的身体部位动作进行编码，并将其分配到专门为每个部位设计的codebook中。然后，通过使用预训练模型，将多modal信号转化为共享的几何空间中的字符串。最后，通过预测继 succeeding tokens来构建完整的动作序列。
results: 我们的方法在实验中得到了广泛的应用前景，并且可以轻松地扩展到新的模式。

Abstract
The field has made significant progress in synthesizing realistic human motion driven by various modalities. Yet, the need for different methods to animate various body parts according to different control signals limits the scalability of these techniques in practical scenarios. In this paper, we introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation. Our methodology unfolds in several steps: We begin by quantizing the motions of diverse body parts into separate codebooks tailored to their respective domains. Next, we harness the robust capabilities of pre-trained models to transcode multimodal signals into a shared latent space. We then translate these signals into discrete motion tokens by iteratively predicting subsequent tokens to form a complete sequence. Finally, we reconstruct the continuous actual motion from this tokenized sequence. Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal. This approach is inherently scalable, allowing for the easy integration of new modalities. Extensive experiments demonstrated the effectiveness of our design, emphasizing its potential for broad application.

摘要
“这个领域已经做出了实际人类动作的实际进步，这些进步是由不同的数据类型所驱动的。然而，在实际应用中，不同的控制信号需要不同的方法来驱动不同的身体部位，这限制了这些技术的扩展性。在这篇论文中，我们将引入一个具有一体化和扩展性的方法，该方法可以同时驱动多种数据类型和多个身体部位的人类动作生成。我们的方法流程如下：首先，我们将多种身体部位的动作转换为特定领域的量化表示，然后将这些表示转换为共享的潜在空间中的特征表示。接着，我们将这些特征表示转换为精确的动作字串，并透过迭代预测的方式，将这些字串转换为完整的动作序列。最后，我们将这个字串序列转换为连续的实际动作。我们的方法把多modal动作生成挑战视为一个预测字串的任务，并从不同的数据类型基础上提取特定的代码库。这种方法具有内置的扩展性，可以轻松地添加新的数据类型。实际实验显示了我们的设计的有效性，强调其广泛应用的潜力。”

AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond

paper_url: http://arxiv.org/abs/2311.16468
repo_url: None
paper_authors: Zixiang Zhou, Yu Wan, Baoyuan Wang
for: 这paper的目的是提出一个All-in-One框架，用于融合多种人体动作相关任务，包括理解、规划、生成以及其他任务。
methods: 该框架使用大语言模型（LLMs），并采用了InstructGPT和Gato的概念，将各种任务都作为一种特定的指令进行精度调整。所有任务都通过语言作为共同界面进行连接，形成一个封闭的循环。
results: 经过广泛的实验，AvargarGPT实现了低级任务的最佳性能，以及高级任务的可观望成果。此外，它还可以通过Iterative Traversal的方式实现无限长的动作合成。

Abstract
Large Language Models(LLMs) have shown remarkable emergent abilities in unifying almost all (if not every) NLP tasks. In the human motion-related realm, however, researchers still develop siloed models for each task. Inspired by InstuctGPT, and the generalist concept behind Gato, we introduce AvatarGPT, an All-in-One framework for motion understanding, planning, generations as well as other tasks such as motion in-between synthesis. AvatarGPT treats each task as one type of instruction fine-tuned on the shared LLM. All the tasks are seamlessly interconnected with language as the universal interface, constituting a closed-loop within the framework. To achieve this, human motion sequences are first encoded as discrete tokens, which serve as the extended vocabulary of LLM. Then, an unsupervised pipeline to generate natural language descriptions of human action sequences from in-the-wild videos is developed. Finally, all tasks are jointly trained. Extensive experiments show that AvatarGPT achieves SOTA on low-level tasks, and promising results on high-level tasks, demonstrating the effectiveness of our proposed All-in-One framework. Moreover, for the first time, AvatarGPT enables a principled approach by iterative traversal of the tasks within the closed-loop for unlimited long-motion synthesis.

摘要
大语言模型（LLM）已经显示出了非常出众的融合能力，可以涵盖大多数（如果不是所有）的自然语言处理任务。然而，在人体运动相关领域，研究人员仍然开发着封闭的模型，用于每个任务。受到InstructGPT和Gato的启发，我们介绍了 AvatarGPT，一个所有任务的框架，用于运动理解、规划、生成以及其他任务，如运动间 synthesis。AvatarGPT对每个任务使用共享的 LLM 进行微调，并将所有任务串联在一起，通过语言作为通用接口，形成一个关闭的循环。为实现这一目标，我们首先将人体运动序列编码为离散的token，这些tokenserve为LLM的扩展词汇。然后，我们开发了一个无监督的管道，用于自然语言描述从野外视频中提取的人体动作序列。最后，我们共同训练所有任务。广泛的实验表明，AvatarGPT在低级任务上达到了最佳性能，并在高级任务上获得了扎实的结果，证明了我们提出的所有任务框架的有效性。此外，AvatarGPT还实现了一种原则性的方法，通过循环 traverse Tasks 内的循环来实现无限长的运动合成。

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

paper_url: http://arxiv.org/abs/2311.16465
repo_url: None
paper_authors: Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei
for: 提高文本渲染的自动化和多样性
methods: 使用大语言模型进行布局规划，并在扩散模型中使用语言模型编码文本位置和内容
results: 实现更合理的文本布局和生成，提高多样性和自动化水平

Abstract
The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text. Several methods alleviated this issue by incorporating explicit text position and content as guidance on where and what text to render. However, these methods still suffer from several drawbacks, such as limited flexibility and automation, constrained capability of layout prediction, and restricted style diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering. Firstly, we fine-tune a large language model for layout planning. The large language model is capable of automatically generating keywords for text rendering and also supports layout modification through chatting. Secondly, we utilize the language model within the diffusion model to encode the position and texts at the line level. Unlike previous methods that employed tight character-level guidance, this approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V, validating TextDiffuser-2's capacity to achieve a more rational text layout and generation with enhanced diversity. The code and model will be available at \url{https://aka.ms/textdiffuser-2}.

摘要
很多年来，扩散模型已经被证明是一种强大的生成模型，但是在生成视觉文本方面仍然是一个挑战。许多方法已经尝试解决这个问题，例如通过 incorporating 显式文本位置和内容作为指导来帮助生成文本。然而，这些方法仍然受到一些缺点的限制，例如限制了自动化和自适应性，预测布局的能力受到限制，并且样式多样性受到限制。在这篇论文中，我们提出了 TextDiffuser-2，旨在解 liberate 语言模型在文本渲染方面的力量。我们首先细化了一个大型语言模型，以便自动生成文本渲染关键词和修改布局。其次，我们利用语言模型在扩散模型中来编码文本的位置和内容。与前一些方法不同的是，我们的方法不使用紧张的字符级别指导，而是通过语言模型来生成更多元的文本图像。我们进行了广泛的实验和人类参与者用户研究，并与 GPT-4V 集成，以验证 TextDiffuser-2 的能力在实现更合理的文本布局和生成方面具有更高的多样性。代码和模型将在 \url{https://aka.ms/textdiffuser-2} 上提供。

Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information

paper_url: http://arxiv.org/abs/2311.16462
repo_url: None
paper_authors: Jie Li, Zhixin Li, Zhi Liu, Pengyuan Zhou, Richang Hong, Qiyue Li, Han Hu
for:The paper is written to improve the precision of viewport prediction in volumetric video streaming.methods:The proposed approach, named Saliency and Trajectory Viewport Prediction (STVP), utilizes video saliency information and viewport trajectory to improve viewport prediction. The method includes a novel sampling method, Uniform Random Sampling (URS), and a saliency detection technique that incorporates both spatial and temporal information.results:The proposed method is superior to existing schemes, as shown by extensive simulations using state-of-the-art volumetric video sequences. The dataset and source code will be publicly accessible after acceptance.

Abstract
Volumetric video, also known as hologram video, is a novel medium that portrays natural content in Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). It is expected to be the next-gen video technology and a prevalent use case for 5G and beyond wireless communication. Considering that each user typically only watches a section of the volumetric video, known as the viewport, it is essential to have precise viewport prediction for optimal performance. However, research on this topic is still in its infancy. In the end, this paper presents and proposes a novel approach, named Saliency and Trajectory Viewport Prediction (STVP), which aims to improve the precision of viewport prediction in volumetric video streaming. The STVP extensively utilizes video saliency information and viewport trajectory. To our knowledge, this is the first comprehensive study of viewport prediction in volumetric video streaming. In particular, we introduce a novel sampling method, Uniform Random Sampling (URS), to reduce computational complexity while still preserving video features in an efficient manner. Then we present a saliency detection technique that incorporates both spatial and temporal information for detecting static, dynamic geometric, and color salient regions. Finally, we intelligently fuse saliency and trajectory information to achieve more accurate viewport prediction. We conduct extensive simulations to evaluate the effectiveness of our proposed viewport prediction methods using state-of-the-art volumetric video sequences. The experimental results show the superiority of the proposed method over existing schemes. The dataset and source code will be publicly accessible after acceptance.

摘要
三维视频（Volumetric video）是一种新的媒体形式，可以在虚拟现实（VR）、增强现实（AR）和混合现实（MR）中显示自然内容。预计它将成为下一代视频技术，并在无线通信5G和更多的应用中具有广泛的使用。由于每个用户通常只查看视频中的一个部分，称为“视窗”，因此精准的视窗预测是非常重要的。然而，这个领域的研究仍然处于初级阶段。本文提出了一种新的方法，即Saliency and Trajectory Viewport Prediction（STVP），以提高三维视频流式中的视窗预测精度。STVP广泛利用视频吸引力信息和视窗轨迹。我们认为这是三维视频流式中的第一次全面研究。本文还提出了一种新的采样方法，即均匀随机采样（URS），以降低计算复杂性而仍保持视频特征的有效方式。然后，我们介绍了一种包含空间和时间信息的吸引力检测技术，用于检测静止、动态、几何和颜色吸引力区域。最后，我们智能融合吸引力和轨迹信息，以实现更加准确的视窗预测。我们对使用state-of-the-art三维视频序列进行了广泛的 simulations，以评估我们提出的视窗预测方法的效果。实验结果显示，我们的方法在现有方法中显著超越。数据集和源代码将在接受后公开。

Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering

paper_url: http://arxiv.org/abs/2311.17089
repo_url: None
paper_authors: Zhiwen Yan, Weng Fei Low, Yu Chen, Gim Hee Lee
for: 提高3D重建和渲染的效率和质量，特别是在低分辨率或远程摄像头位置下。
methods: 提议一种多尺度3D加摩斯投影算法，保持不同尺度的加摩斯来表示同一场景。
results: 与单个尺度3D加摩斯投影相比，该算法可以 Achieve 13%-66% PSNR和160%-2400%的渲染速度提高，在Mip-NeRF360 dataset上实现4倍-128倍缩放的渲染。

Abstract
3D Gaussians have recently emerged as a highly efficient representation for 3D reconstruction and rendering. Despite its high rendering quality and speed at high resolutions, they both deteriorate drastically when rendered at lower resolutions or from far away camera position. During low resolution or far away rendering, the pixel size of the image can fall below the Nyquist frequency compared to the screen size of each splatted 3D Gaussian and leads to aliasing effect. The rendering is also drastically slowed down by the sequential alpha blending of more splatted Gaussians per pixel. To address these issues, we propose a multi-scale 3D Gaussian splatting algorithm, which maintains Gaussians at different scales to represent the same scene. Higher-resolution images are rendered with more small Gaussians, and lower-resolution images are rendered with fewer larger Gaussians. With similar training time, our algorithm can achieve 13\%-66\% PSNR and 160\%-2400\% rendering speed improvement at 4$\times$-128$\times$ scale rendering on Mip-NeRF360 dataset compared to the single scale 3D Gaussian splatting.

摘要
3D Gaussian 已经出现为3D重建和渲染中的高效表示方法。尽管它在高分辨率下的渲染质量和速度都很高，但当分辨率低或相机位置远时，它们都会受到严重的损害。在低分辨率或远程渲染时，图像像素大小可能小于屏幕大小中每个扩散3D Gaussian的 Nyquist频率，导致抖抖现象。此外，更多的扩散3D Gaussian per pixel的顺序 alpha 杂合也会导致渲染速度减慢。为解决这些问题，我们提议一种多尺度3D Gaussian拼接算法，该算法在不同尺度上维持不同的 Gaussian 来表示同一场景。高分辨率图像使用更多的小 Gaussian，而低分辨率图像使用 fewer 大 Gaussian。与单个尺度3D Gaussian拼接相同的训练时间，我们的算法可以在4倍-128倍的渲染比例上实现13%-66% PSNR和160%-2400%的渲染速度提升。

Spiking Neural Networks with Dynamic Time Steps for Vision Transformers

paper_url: http://arxiv.org/abs/2311.16456
repo_url: None
paper_authors: Gourav Datta, Zeyu Liu, Anni Li, Peter A. Beerel
for: 这篇论文是关于使用刺激神经网络（SNN）进行复杂视觉任务的研究。
methods: 该论文提出了一种新的训练框架，该框架可以动态分配每个ViT模块中的时间步数，以提高能效性。
results: 该论文在图像识别任务中获得了高测试精度（95.97%），并且需要的时间步数为4.97步。

Abstract
Spiking Neural Networks (SNNs) have emerged as a popular spatio-temporal computing paradigm for complex vision tasks. Recently proposed SNN training algorithms have significantly reduced the number of time steps (down to 1) for improved latency and energy efficiency, however, they target only convolutional neural networks (CNN). These algorithms, when applied on the recently spotlighted vision transformers (ViT), either require a large number of time steps or fail to converge. Based on analysis of the histograms of the ANN and SNN activation maps, we hypothesize that each ViT block has a different sensitivity to the number of time steps. We propose a novel training framework that dynamically allocates the number of time steps to each ViT module depending on a trainable score assigned to each timestep. In particular, we generate a scalar binary time step mask that filters spikes emitted by each neuron in a leaky-integrate-and-fire (LIF) layer. The resulting SNNs have high activation sparsity and require only accumulate operations (AC), except for the input embedding layer, in contrast to expensive multiply-and-accumulates (MAC) needed in traditional ViTs. This yields significant improvements in energy efficiency. We evaluate our training framework and resulting SNNs on image recognition tasks including CIFAR10, CIFAR100, and ImageNet with different ViT architectures. We obtain a test accuracy of 95.97% with 4.97 time steps with direct encoding on CIFAR10.

摘要
快速神经网络（SNN）在复杂视觉任务中得到了广泛应用。最近提出的SNN训练算法可以大幅减少时间步数（下降至1步），以提高响应时间和能效性，但是它们只适用于卷积神经网络（CNN）。这些算法在应用于最近引起关注的视觉转换器（ViT）时，ether需要大量的时间步数或者无法 converge。 Based on the histograms of the ANN and SNN activation maps, we hypothesize that each ViT block has a different sensitivity to the number of time steps. We propose a novel training framework that dynamically allocates the number of time steps to each ViT module depending on a trainable score assigned to each timestep. In particular, we generate a scalar binary time step mask that filters spikes emitted by each neuron in a leaky-integrate-and-fire（LIF）layer. The resulting SNNs have high activation sparsity and require only accumulate operations（AC）, except for the input embedding layer, in contrast to expensive multiply-and-accumulates（MAC） needed in traditional ViTs. This yields significant improvements in energy efficiency. We evaluate our training framework and resulting SNNs on image recognition tasks including CIFAR10, CIFAR100, and ImageNet with different ViT architectures. We obtain a test accuracy of 95.97% with 4.97 time steps with direct encoding on CIFAR10.

paper_url: http://arxiv.org/abs/2311.17088
repo_url: None
paper_authors: Mulin Tian, Mahyar Khayatkhoei, Joe Mathai, Wael AbdAlmageed
for: 本研究旨在探讨深伪视频检测方法，以满足现有深伪生成方法的训练数据短缺问题。
methods: 本文提出了一种新的无监督方法，通过测量多Modal特征之间的同模和交模一致性来检测深伪视频。
results: 通过广泛的实验，authors证明了深伪视频中的内模和交模不一致性存在明显的特征，可以通过高精度检测这些不一致性来检测深伪视频。这种方法可扩展、普适和可解释，并且可以在实时检测中提供高精度的检测结果。

Abstract
Deepfake videos present an increasing threat to society with potentially negative impact on criminal justice, democracy, and personal safety and privacy. Meanwhile, detecting deepfakes, at scale, remains a very challenging tasks that often requires labeled training data from existing deepfake generation methods. Further, even the most accurate supervised learning, deepfake detection methods do not generalize to deepfakes generated using new generation methods. In this paper, we introduce a novel unsupervised approach for detecting deepfake videos by measuring of intra- and cross-modal consistency among multimodal features; specifically visual, audio, and identity features. The fundamental hypothesis behind the proposed detection method is that since deepfake generation attempts to transfer the facial motion of one identity to another, these methods will eventually encounter a trade-off between motion and identity that enviably leads to detectable inconsistencies. We validate our method through extensive experimentation, demonstrating the existence of significant intra- and cross- modal inconsistencies in deepfake videos, which can be effectively utilized to detect them with high accuracy. Our proposed method is scalable because it does not require pristine samples at inference, generalizable because it is trained only on real data, and is explainable since it can pinpoint the exact location of modality inconsistencies which are then verifiable by a human expert.

摘要
深圳视频潜在的威胁正在不断增长，可能对刑事司法、民主和个人安全隐私产生负面影响。而检测深圳视频的任务，尤其是在大规模的场景下，仍然是一项非常困难的任务，通常需要现有的深圳生成方法提供标注训练数据。此外，即使使用最准确的超vised学习方法，也不能泛化到新生成方法生成的深圳视频。在这篇论文中，我们提出了一种新的无监督方法，通过评估多Modal特征之间的内部和交叉Modal一致性来检测深圳视频。我们的基本假设是，深圳生成方法会在传递一个人的面部动作到另一个人身上时，遇到模拟和身份之间的负面交易，这将导致检测到的不一致性。我们通过广泛的实验证明了我们的方法的可靠性，并证明了深圳视频中存在显著的内部和交叉Modal不一致性，可以高度精准地检测深圳视频。我们的提出的方法可扩展，因为它不需要检测过程中的优质样本，泛化，因为它只在真实数据上训练，并且可以解释，因为它可以指出检测到的不一致性的具体位置，这些位置可以由人类专家验证。

Rethinking Mixup for Improving the Adversarial Transferability

paper_url: http://arxiv.org/abs/2311.17087
repo_url: None
paper_authors: Xiaosen Wang, Zeyuan Yin
For: The paper aims to explore the underlying mechanism of mixup augmentation in generating adversarial examples with superior adversarial transferability, and to propose a new input transformation-based attack called Mixing the Image but Separating the gradienT (MIST) to improve the transferability of adversarial examples.* Methods: The paper uses a combination of theoretical analysis and experimental evaluation to investigate the effect of mixup augmentation on adversarial transferability, and to compare the performance of MIST with existing state-of-the-art input transformation-based attacks on both CNNs and ViTs.* Results: The paper shows that MIST outperforms existing attacks with a clear margin on both CNNs and ViTs, and demonstrates its high effectiveness and generality on the ImageNet dataset.

Abstract
Mixup augmentation has been widely integrated to generate adversarial examples with superior adversarial transferability when immigrating from a surrogate model to other models. However, the underlying mechanism influencing the mixup's effect on transferability remains unexplored. In this work, we posit that the adversarial examples located at the convergence of decision boundaries across various categories exhibit better transferability and identify that Admix tends to steer the adversarial examples towards such regions. However, we find the constraint on the added image in Admix decays its capability, resulting in limited transferability. To address such an issue, we propose a new input transformation-based attack called Mixing the Image but Separating the gradienT (MIST). Specifically, MIST randomly mixes the input image with a randomly shifted image and separates the gradient of each loss item for each mixed image. To counteract the imprecise gradient, MIST calculates the gradient on several mixed images for each input sample. Extensive experimental results on the ImageNet dataset demonstrate that MIST outperforms existing SOTA input transformation-based attacks with a clear margin on both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) w/wo defense mechanisms, supporting MIST's high effectiveness and generality.

摘要
mixup 扩展已经广泛应用于生成高效的对抗示例，但是其下面机制 influencing mixup 对 transferred 的影响还未经过 исследова. 在这项工作中，我们认为在不同类别的决策边界交叉点上的对抗示例具有更好的转移性，并发现 Admix 倾向于导向这些地方的对抗示例。然而，我们发现 Admix 中添加的图像约束在 decay 了其能力，导致有限的转移性。为解决这个问题，我们提出了一种新的输入变换基于的攻击方法，即 Mixing the Image but Separating the gradienT (MIST)。具体来说，MIST 随机混合输入图像和一个随机偏移的图像，并将每个混合图像的梯度分离成多个混合图像中的每个输入样本。为了对准不精准的梯度，MIST 对每个输入样本进行多个混合图像中的梯度计算。我们的实验结果表明，MIST 在 ImageNet 数据集上对于 CNNs 和 ViTs WITH/WITHOUT 防御机制都具有明显的优势，支持 MIST 的高效性和通用性。

TopoSemiSeg: Enforcing Topological Consistency for Semi-Supervised Segmentation of Histopathology Images

paper_url: http://arxiv.org/abs/2311.16447
repo_url: https://github.com/Melon-Xu/TopoSemiSeg
paper_authors: Meilong Xu, Xiaoling Hu, Saumya Gupta, Shahira Abousamra, Chao Chen
for: 本研究的目的是提高计算生物学中 segmenation 过程中 dense 分布的 object 的准确率，特别是 gland 和核� wsp;
methods: 我们提出了一种 semi-supervised learning 方法，即 TopoSemiSeg，它可以从无标注数据中学习 topological 表示，从而提高 segmenation 的准确率;
results: 我们在公共的 pathology 图像数据集上进行了广泛的实验，结果表明，TopoSemiSeg 方法在 topology-wise 评价指标上具有显著的优势，特别是在 dense 分布的 object 上。

Abstract
In computational pathology, segmenting densely distributed objects like glands and nuclei is crucial for downstream analysis. To alleviate the burden of obtaining pixel-wise annotations, semi-supervised learning methods learn from large amounts of unlabeled data. Nevertheless, existing semi-supervised methods overlook the topological information hidden in the unlabeled images and are thus prone to topological errors, e.g., missing or incorrectly merged/separated glands or nuclei. To address this issue, we propose TopoSemiSeg, the first semi-supervised method that learns the topological representation from unlabeled data. In particular, we propose a topology-aware teacher-student approach in which the teacher and student networks learn shared topological representations. To achieve this, we introduce topological consistency loss, which contains signal consistency and noise removal losses to ensure the learned representation is robust and focuses on true topological signals. Extensive experiments on public pathology image datasets show the superiority of our method, especially on topology-wise evaluation metrics. Code is available at https://github.com/Melon-Xu/TopoSemiSeg.

摘要
在计算 PATHOLOGY 中，分割密集分布的对象如腺体和核体是至关重要的，以便下游分析。然而，现有的半超vised 方法忽略了无标注图像中的 topological 信息，因此容易出现 topological 错误，例如遗弃或 incorrectly 合并/分割腺体或核体。为解决这个问题，我们提出 TopoSemiSeg，首个半超vised 方法，从无标注图像中学习 topological 表示。具体来说，我们提出了一种 topology-aware teacher-student 方法，在这种方法中，教师和学生网络共同学习 shared topological 表示。为实现这一点，我们引入了 topological 一致性损失，该损失包括信号一致性损失和噪声除掉损失，以确保学习的表示是Robust和专注于真实的 topological 信号。我们在公共的 PATHOLOGY 图像数据集上进行了广泛的实验，结果表明我们的方法在特别是在 topological 评价纪录中表现出优异。代码可以在 https://github.com/Melon-Xu/TopoSemiSeg 上获取。

Centre Stage: Centricity-based Audio-Visual Temporal Action Detection

paper_url: http://arxiv.org/abs/2311.16446
repo_url: https://github.com/hanielwang/audio-visual-tad
paper_authors: Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett
for: 这篇论文旨在提出一种基于多尺度跨模态混合的一阶时间点检测方法，以提高时间点检测的准确性。
methods: 该方法使用多尺度跨模态混合来模型时间相关性，并提出了一种新的中心分数来衡量时间点的准确性。
results: 该方法在EPIC-Kitchens-100动作检测标准 benchmark上实现了state-of-the-art表现，并通过细化研究显示了音频混合和中心分数的优势。

Abstract
Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries. Thus, we propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score. This leads to increased confidence for proposals that exhibit more precise boundaries. Our method can be integrated with other one-stage anchor-free architectures and we demonstrate this on three recent baselines on the EPIC-Kitchens-100 action detection benchmark where we achieve state-of-the-art performance. Detailed ablation studies showcase the benefits of fusing audio and our proposed centricity scores. Code and models for our proposed method are publicly available at https://github.com/hanielwang/Audio-Visual-TAD.git

摘要
先前的一阶段动作检测方法都是基于视觉特征来模型时间关系。在这篇论文中，我们研究了不同的策略来结合听音特征，使用多级横规维度的cross-Attention来融合两种感知。我们还示出了动作中心距离时间步的相关性和预测边界准确度之间的相关性。因此，我们提出了一种新的网络头来估计时间步的中心性，我们称之为中心分数。这导致了更高的自信心度 для提案，这些提案具有更精确的边界。我们的方法可以与其他一阶段无锚的架构结合使用，我们在EPIC-Kitchens-100动作检测 benchmark上使用三个最新的基eline来证明我们的方法的状态之最性。详细的ablation研究表明了将听音特征与我们所提出的中心分数融合的好处。我们在 GitHub 上公开了我们的方法和模型代码，请参考https://github.com/hanielwang/Audio-Visual-TAD.git。

CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models

paper_url: http://arxiv.org/abs/2311.16445
repo_url: None
paper_authors: Yichao Cai, Yuhang Liu, Zhen Zhang, Javen Qinfeng Shi
for: 提高视觉语言模型的Robustness，不需要重新训练图像Encoder。
methods: 通过solely text augmentation来分离 latent content variables和style variables，使图像Encoder更加敏感。
results: 在多个数据集上，通过 modifying the style part of the text data，实现了 substatial improvements in the robustness of the pre-trained CLIP model。

Abstract
Contrastive vision-language models, e.g., CLIP, have garnered substantial attention for their exceptional generalization capabilities. However, their robustness to perturbations has ignited concerns. Existing strategies typically reinforce their resilience against adversarial examples by enabling the image encoder to "see" these perturbed examples, often necessitating a complete retraining of the image encoder on both natural and adversarial samples. In this study, we propose a new method to enhance robustness solely through text augmentation, eliminating the need for retraining the image encoder on adversarial examples. Our motivation arises from the realization that text and image data inherently occupy a shared latent space, comprising latent content variables and style variables. This insight suggests the feasibility of learning to disentangle these latent content variables using text data exclusively. To accomplish this, we introduce an effective text augmentation method that focuses on modifying the style while preserving the content in the text data. By changing the style part of the text data, we empower the text encoder to emphasize latent content variables, ultimately enhancing the robustness of vision-language models. Our experiments across various datasets demonstrate substantial improvements in the robustness of the pre-trained CLIP model.

摘要
受到关注的对比式视觉语言模型，如CLIP，具有极其普遍的应用能力。然而，其对干扰的抵抗力受到了关注。现有的策略通常是通过启用图像编码器"看到"这些干扰示例，通常需要对图像编码器进行完整的重新训练，包括自然示例和干扰示例。在这项研究中，我们提出了一种新的方法来提高对干扰的鲁棒性，不需要重新训练图像编码器。我们的动机来自于认识到文本和图像数据本身就处于共同的隐藏空间，包括内容变量和风格变量。这一点意味着可以通过文本数据专门学习抽离内容变量。为此，我们提出了一种有效的文本增强方法，将文本数据中的风格部分修改，以便让文本编码器强调内容变量，最终提高视觉语言模型的鲁棒性。我们在不同的数据集上进行了多种实验，并证明了在CLIP模型的预训练模型中显著提高了对干扰的鲁棒性。

Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking

paper_url: http://arxiv.org/abs/2311.17085
repo_url: None
paper_authors: Jiawei Ge, Xiangmei Chen, Jiuxin Cao, Xuelin Zhu, Weijia Liu, Bo Liu
for: 本研究旨在提高单目标跟踪的性能，使用语言描述提供高级 semantics。
methods: 本研究提出了一种新的跟踪器，包括两个新模块：目标增强模块（TEM）和semantic aware模块（SAM），以提高 VL 特征提取和融合。
results: 对 VL 跟踪数据集进行了广泛的实验， demonstarted 新方法的超越性和效果。

Abstract
Single object tracking aims to locate one specific target in video sequences, given its initial state. Classical trackers rely solely on visual cues, restricting their ability to handle challenges such as appearance variations, ambiguity, and distractions. Hence, Vision-Language (VL) tracking has emerged as a promising approach, incorporating language descriptions to directly provide high-level semantics and enhance tracking performance. However, current VL trackers have not fully exploited the power of VL learning, as they suffer from limitations such as heavily relying on off-the-shelf backbones for feature extraction, ineffective VL fusion designs, and the absence of VL-related loss functions. Consequently, we present a novel tracker that progressively explores target-centric semantics for VL tracking. Specifically, we propose the first Synchronous Learning Backbone (SLB) for VL tracking, which consists of two novel modules: the Target Enhance Module (TEM) and the Semantic Aware Module (SAM). These modules enable the tracker to perceive target-related semantics and comprehend the context of both visual and textual modalities at the same pace, facilitating VL feature extraction and fusion at different semantic levels. Moreover, we devise the dense matching loss to further strengthen multi-modal representation learning. Extensive experiments on VL tracking datasets demonstrate the superiority and effectiveness of our methods.

摘要
<>单个目标跟踪目标是在视频序列中找到一个具体的目标，givent its initial state。经典的跟踪器仅仅依靠视觉特征来实现跟踪，因此它们无法满足挑战，如外观变化、混乱和干扰。因此，视觉语言（VL）跟踪技术在跟踪中表现出了推荐的特点，它将语言描述直接提供高级别 semantics，以提高跟踪性能。然而，目前的VL跟踪器尚未充分利用VL学习的能力，主要表现在以下几个方面：1. 仅仅使用准备好的背部树进行特征提取，而不是针对特定的目标进行适应性的特征提取。2. 不合理的VL融合设计，导致视觉和语言模式之间的信息不够协调。3. 缺乏VL相关的损失函数，使得学习过程中没有足够的指导。为了解决这些问题，我们提出了一种新的跟踪器，即同步学习底层（SLB）。SLB包括两个新模块：目标增强模块（TEM）和semantic意识模块（SAM）。这两个模块使得跟踪器能够同步地感知目标相关的 semantics，并且在视觉和语言模式之间进行协调，从而促进VL特征提取和融合。此外，我们还提出了密集匹配损失函数，以进一步强化多模式表示学习。我们在VL跟踪数据集上进行了广泛的实验，结果显示了我们的方法的优越性和有效性。>>

Model-free Test Time Adaptation for Out-Of-Distribution Detection

paper_url: http://arxiv.org/abs/2311.16420
repo_url: None
paper_authors: YiFan Zhang, Xue Wang, Tian Zhou, Kun Yuan, Zhang Zhang, Liang Wang, Rong Jin, Tieniu Tan
for: 提高 ML 模型的可靠性，避免因为对于的数据分布而导致的错误推断。
methods: 基于在线测试样本的模型适应，以便在测试时对数据分布的变化进行适应。
results: 比对 conventional 方法，可以更好地避免 false positive 错误，特别是当 ID 和 OOD 数据分布相互重叠时。例如，在 CIFAR-10 和 ImageNet-1k benchmark 上，\abbr 可以 reducuce FPR95 的错误率 $23.23%$ 和 $38%$ 分别。

Abstract
Out-of-distribution (OOD) detection is essential for the reliability of ML models. Most existing methods for OOD detection learn a fixed decision criterion from a given in-distribution dataset and apply it universally to decide if a data point is OOD. Recent work~\cite{fang2022is} shows that given only in-distribution data, it is impossible to reliably detect OOD data without extra assumptions. Motivated by the theoretical result and recent exploration of test-time adaptation methods, we propose a Non-Parametric Test Time \textbf{Ada}ptation framework for \textbf{O}ut-Of-\textbf{D}istribution \textbf{D}etection (\abbr). Unlike conventional methods, \abbr utilizes online test samples for model adaptation during testing, enhancing adaptability to changing data distributions. The framework incorporates detected OOD instances into decision-making, reducing false positive rates, particularly when ID and OOD distributions overlap significantly. We demonstrate the effectiveness of \abbr through comprehensive experiments on multiple OOD detection benchmarks, extensive empirical studies show that \abbr significantly improves the performance of OOD detection over state-of-the-art methods. Specifically, \abbr reduces the false positive rate (FPR95) by $23.23\%$ on the CIFAR-10 benchmarks and $38\%$ on the ImageNet-1k benchmarks compared to the advanced methods. Lastly, we theoretically verify the effectiveness of \abbr.

摘要
外部数据（OOD）检测是机器学习模型的可靠性的关键。现有大多数OOD检测方法都是通过学习固定的决策标准来判断数据点是否为OOD。然而，据最近的研究~\cite{fang2022is}显示，只有在给定的内部数据集上学习的情况下，无法可靠地检测OOD数据。这引发了我们对测试时适应方法的探索，并提出了一种非参数的测试时适应框架 дляOOD检测(\abbr)。与传统方法不同，\abbr在测试时使用在线测试样本进行模型适应，从而提高模型对数据分布变化的适应能力。此外，\abbr还利用检测到的OOD实例来增强决策的可靠性，尤其是在ID和OOD分布 overlap得非常大时。我们通过多个OOD检测benchmark进行了广泛的实验，并证明了\abbr在OOD检测性能方面的有效性。具体来说，\abbr相比于当前最先进的方法，可以降低FPR95的false positive rate（FPR95）by $23.23\%$在CIFAR-10benchmark上和$38\%$在ImageNet-1k benchmark上。最后，我们还 theoretically verify了\abbr的有效性。

DepthSSC: Depth-Spatial Alignment and Dynamic Voxel Resolution for Monocular 3D Semantic Scene Completion

paper_url: http://arxiv.org/abs/2311.17084
repo_url: None
paper_authors: Jiawei Yao, Jusheng Zhang
for: 3D semantic scene completion with monocular cameras
methods: ST-GF (Spatial Transformation Graph Fusion) module with geometric-aware voxelization
results: achieves state-of-the-art performance in capturing intricate 3D structural details and mitigates spatial misalignment and distortion issuesHere is the result in Simplified Chinese text:
for: 三元素场景完成使用单目镜头
methods: ST-GF（空间变换图像融合）模块与几何意识 voxelization
results: 实现了三元素场景的细节 detail 的捕捉以及消除空间偏移和扭曲问题

Abstract
The task of 3D semantic scene completion with monocular cameras is gaining increasing attention in the field of autonomous driving. Its objective is to predict the occupancy status of each voxel in the 3D scene from partial image inputs. Despite the existence of numerous methods, many of them overlook the issue of accurate alignment between spatial and depth information. To address this, we propose DepthSSC, an advanced method for semantic scene completion solely based on monocular cameras. DepthSSC combines the ST-GF (Spatial Transformation Graph Fusion) module with geometric-aware voxelization, enabling dynamic adjustment of voxel resolution and considering the geometric complexity of 3D space to ensure precise alignment between spatial and depth information. This approach successfully mitigates spatial misalignment and distortion issues observed in prior methods. Through evaluation on the SemanticKITTI dataset, DepthSSC not only demonstrates its effectiveness in capturing intricate 3D structural details but also achieves state-of-the-art performance. We believe DepthSSC provides a fresh perspective on monocular camera-based 3D semantic scene completion research and anticipate it will inspire further related studies.

摘要
三维 semantic scene completion with monocular camera 在自动驾驶领域正在获得越来越多的关注。其目标是从部分图像输入中预测每个块体的占据状态。虽然现有许多方法，但很多其中忽视了精确的空间信息和深度信息对齐问题。为解决这个问题，我们提出了 DepthSSC，一种基于单目镜像的高级方法 для三维 semantic scene completion。DepthSSC结合了 ST-GF（空间变换图像融合）模块和地理感知 voxelization，使得 voxel 分辨率可动调整，考虑三维空间的 геометрической复杂性，以确保精确的空间信息和深度信息对齐。这种方法成功解决了先前方法中的空间偏移和扭曲问题。经SemanticKITTI数据集评估，DepthSSC不仅能够捕捉到细腻的三维结构细节，还实现了状态之 arts 的性能。我们认为 DepthSSC 提供了对单目镜像基于三维 semantic scene completion 研究的新的视角，并期望它会激发更多相关的研究。

CLiC: Concept Learning in Context

paper_url: http://arxiv.org/abs/2311.17083
repo_url: https://github.com/Mehdi0xC/clic
paper_authors: Mehdi Safaee, Aryan Mikaeili, Or Patashnik, Daniel Cohen-Or, Ali Mahdavi-Amiri
For: 学习一个对象的本地视觉模式从一个图像中，并将该模式应用到目标图像中的对象上。* Methods: 我们的方法基于现代视觉概念学习的进步，包括从源图像中获取视觉概念（如饰物），并将其应用到目标图像中的对象上。我们的关键想法是在对象上进行Contextual Concept Learning，使得概念学习更加地Local化。我们使用软遮盖来实现这一点，软遮盖包含概念以及周围的图像区域。* Results: 我们通过对象在图像中的生成，展示了可能的在图像中嵌入Contextual Concept Learning得到的结果。我们还引入了将获取的概念引导到目标图像中的特定位置，使用交叉注意机制，并在源和目标对象之间建立对应关系。我们的方法的效果通过量化和质量测试得到证明。

Abstract
This paper addresses the challenge of learning a local visual pattern of an object from one image, and generating images depicting objects with that pattern. Learning a localized concept and placing it on an object in a target image is a nontrivial task, as the objects may have different orientations and shapes. Our approach builds upon recent advancements in visual concept learning. It involves acquiring a visual concept (e.g., an ornament) from a source image and subsequently applying it to an object (e.g., a chair) in a target image. Our key idea is to perform in-context concept learning, acquiring the local visual concept within the broader context of the objects they belong to. To localize the concept learning, we employ soft masks that contain both the concept within the mask and the surrounding image area. We demonstrate our approach through object generation within an image, showcasing plausible embedding of in-context learned concepts. We also introduce methods for directing acquired concepts to specific locations within target images, employing cross-attention mechanisms, and establishing correspondences between source and target objects. The effectiveness of our method is demonstrated through quantitative and qualitative experiments, along with comparisons against baseline techniques.

摘要

DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling

paper_url: http://arxiv.org/abs/2311.17082
repo_url: https://github.com/alexzhou907/dreampropeller
paper_authors: Linqi Zhou, Andy Shih, Chenlin Meng, Stefano Ermon
for: 提高文本到3D生成的速度，使用梦启动器算法来加速现有的文本到3D生成管道。
methods: 基于得分精炼的梦启动器算法，可以普适应用于不同的文本到3D生成框架。
results: 对多种文本到3D生成框架进行测试，实验表明梦启动器算法可以达到4.7倍的速度提升，而且影响生成质量几乎可以忽略。

Abstract
Recent methods such as Score Distillation Sampling (SDS) and Variational Score Distillation (VSD) using 2D diffusion models for text-to-3D generation have demonstrated impressive generation quality. However, the long generation time of such algorithms significantly degrades the user experience. To tackle this problem, we propose DreamPropeller, a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. Our framework generalizes Picard iterations, a classical algorithm for parallel sampling an ODE path, and can account for non-ODE paths such as momentum-based gradient updates and changes in dimensions during the optimization process as in many cases of 3D generation. We show that our algorithm trades parallel compute for wallclock time and empirically achieves up to 4.7x speedup with a negligible drop in generation quality for all tested frameworks.

摘要
Recent methods such as Score Distillation Sampling (SDS) and Variational Score Distillation (VSD) using 2D diffusion models for text-to-3D generation have demonstrated impressive generation quality. However, the long generation time of such algorithms significantly degrades the user experience. To tackle this problem, we propose DreamPropeller, a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. Our framework generalizes Picard iterations, a classical algorithm for parallel sampling an ODE path, and can account for non-ODE paths such as momentum-based gradient updates and changes in dimensions during the optimization process as in many cases of 3D generation. We show that our algorithm trades parallel compute for wallclock time and empirically achieves up to 4.7x speedup with a negligible drop in generation quality for all tested frameworks.Here's the translation in Traditional Chinese:近期的方法，如Score Distillation Sampling (SDS) 和Variational Score Distillation (VSD)，使用2D扩散模型进行文本至3D生成，呈现了很好的生成质量。然而，这些算法的生成时间很长，对用户体验会很差。为了解决这个问题，我们提出了DreamPropeller，一个可以给 EXISTS的文本至3D生成管线基于Score Distillation的加速器。我们的框架将Picard迭代，一个经典的平行抽样ODE路径的算法，扩展到非ODE路径，例如增强的梯度更新和维度变化，这些路径在3D生成中经常出现。我们证明了我们的算法可以将平行计算与壁垒时间交换，并在所有测试框架上实现了最多4.7倍的速度增强，而且生成质量的损失几乎没有。

I-MedSAM: Implicit Medical Image Segmentation with Segment Anything

paper_url: http://arxiv.org/abs/2311.17081
repo_url: None
paper_authors: Xiaobao Wei, Jiajun Cao, Yizhu Jin, Ming Lu, Guangyu Wang, Shanghang Zhang
for:这篇论文的目的是提出一种新的医学影像分割方法，以提高医学影像分割的鲁棒性和精度。methods:该方法基于Continuous Representation（连续表示）和Segment Anything Model（任何分割模型），并通过Parameter Efficient Fine Tuning（ Parametric Efficient Fine Tuning）和Implicit Neural Representation（隐藏神经表示）来提高分割精度。results: compared with现有的分割方法，该方法在2D医学影像分割任务上表现出了更高的精度和鲁棒性，且具有较少的训练参数（1.6M）。

Abstract
With the development of Deep Neural Networks (DNNs), many efforts have been made to handle medical image segmentation. Traditional methods such as nnUNet train specific segmentation models on the individual datasets. Plenty of recent methods have been proposed to adapt the foundational Segment Anything Model (SAM) to medical image segmentation. However, they still focus on discrete representations to generate pixel-wise predictions, which are spatially inflexible and scale poorly to higher resolution. In contrast, implicit methods learn continuous representations for segmentation, which is crucial for medical image segmentation. In this paper, we propose I-MedSAM, which leverages the benefits of both continuous representations and SAM, to obtain better cross-domain ability and accurate boundary delineation. Since medical image segmentation needs to predict detailed segmentation boundaries, we designed a novel adapter to enhance the SAM features with high-frequency information during Parameter Efficient Fine Tuning (PEFT). To convert the SAM features and coordinates into continuous segmentation output, we utilize Implicit Neural Representation (INR) to learn an implicit segmentation decoder. We also propose an uncertainty-guided sampling strategy for efficient learning of INR. Extensive evaluations on 2D medical image segmentation tasks have shown that our proposed method with only 1.6M trainable parameters outperforms existing methods including discrete and continuous methods. The code will be released.

摘要
随着深度神经网络（DNNs）的发展，许多努力已经被投入到医学影像分割中。传统方法如nnUNet将专门设计的分割模型训练在个别数据集上。过去几年，许多最新的方法已经被提出来适应基础Segment Anything Model（SAM）的医学影像分割。然而，这些方法仍然将精度分割预测转换为精度分割结果，这会导致空间不灵活和分辨率不高。相比之下，隐藏方法学习连续表示，对医学影像分割是非常重要的。在这篇文章中，我们提出了I-MedSAM，它利用隐藏方法学习的优点和SAM的基础特点，以获得更好的跨频域能力和精度分割界限。由于医学影像分割需要预测精度分割界限，我们设计了一种新的适配器以增强SAM特征的高频信息通过Parameter Efficient Fine Tuning（PEFT）。为将SAM特征和坐标转换为连续分割输出，我们利用Implicit Neural Representation（INR）学习一个隐藏分割解码器。我们还提出了一种不确定性导向的采样策略，以便高效地学习INR。我们对2D医学影像分割任务进行了广泛的评估，结果表明，我们的提议方法，只有1.6M个可训练参数，已经超过了现有的分割方法，包括分割和连续方法。代码将会发布。

2023-11-28

cs.AI

cs.AI - 2023-11-28

Deep Regularized Compound Gaussian Network for Solving Linear Inverse Problems

paper_url: http://arxiv.org/abs/2311.17248
repo_url: None
paper_authors: Carter Lyons, Raghu G. Raj, Margaret Cheney
for: 这篇论文的目的是提出两种新的方法来解决线性逆问题，以便实现更加稳定和robust的逆问题解决方案。
methods: 这篇论文使用了两种方法：一是iterative算法called generalized compound Gaussian least squares (G-CG-LS)，它是一种用regularized least squares目标函数来满足CG prior的方法；二是一种deep regularized (DR) neural network called DR-CG-Net，它可以学习具体的前知识。
results: 根据 Computational theory和大量的numerical experiments表明，这两种方法都可以在tomographic imaging和compressive sensing中提供更高的性能，尤其是在低训练场景下。DR-CG-Net even outperforms competitive prior art methods in these tasks.

Abstract
Incorporating prior information into inverse problems, e.g. via maximum-a-posteriori estimation, is an important technique for facilitating robust inverse problem solutions. In this paper, we devise two novel approaches for linear inverse problems that permit problem-specific statistical prior selections within the compound Gaussian (CG) class of distributions. The CG class subsumes many commonly used priors in signal and image reconstruction methods including those of sparsity-based approaches. The first method developed is an iterative algorithm, called generalized compound Gaussian least squares (G-CG-LS), that minimizes a regularized least squares objective function where the regularization enforces a CG prior. G-CG-LS is then unrolled, or unfolded, to furnish our second method, which is a novel deep regularized (DR) neural network, called DR-CG-Net, that learns the prior information. A detailed computational theory on convergence properties of G-CG-LS and thorough numerical experiments for DR-CG-Net are provided. Due to the comprehensive nature of the CG prior, these experiments show that our unrolled DR-CG-Net outperforms competitive prior art methods in tomographic imaging and compressive sensing, especially in challenging low-training scenarios.

摘要
使用最大 posteriori 估计等技术，例如最大 posteriori 估计，可以帮助解决逆向问题中的稳定性问题。在这篇论文中，我们提出了两种新的方法，用于Linear inverse problems，允许在compound Gaussian（CG）分布类型中选择问题特定的统计学先验。CG分布包括许多常用的储存和图像重建方法中的先验，如简洁性基本的方法。我们的第一种方法是一种迭代算法，称为总体CG最小二乘（G-CG-LS），它将最小二乘函数中加入了CG先验的正则化来进行执行。然后，我们将G-CG-LS算法“卷起”（unfold），以生成我们的第二种方法，即一种新的深度正则化（DR）神经网络，称为DR-CG-Net，它可以学习先验信息。我们提供了CG prior的完整性的计算理论，以及对DR-CG-Net的严格数值实验。由于CG prior的涵盖性，这些实验表明，我们的DR-CG-Net在tomographic imaging和压缩感知中，尤其是在低训练场景下，能够超越先前艺术方法的表现。

Quantifying the redundancy between prosody and text

paper_url: http://arxiv.org/abs/2311.17233
repo_url: https://github.com/lu-wo/quantifying-redundancy
paper_authors: Lukas Wolf, Tiago Pimentel, Evelina Fedorenko, Ryan Cotterell, Alex Warstadt, Ethan Wilcox, Tamar Regev
for: This paper aims to investigate the relationship between the information conveyed by prosody and the words themselves in spoken language.
methods: The authors use large language models (LLMs) to estimate the redundancy between prosody and the words themselves, and they extract prosodic features aligned to individual words from a large spoken corpus of English audiobooks.
results: The authors find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features, including intensity, duration, pauses, and pitch contours. Additionally, they find that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.Here’s the information in Simplified Chinese text:
for: 这篇论文探讨了口语中语音特征和语言信息之间的关系。
methods: 作者使用大语言模型（LLMs）来估算语音特征和语言信息之间的重复性，并从一大量的英语audiobook中提取对应于单个单词的语音特征。
results: 作者发现，单词和语音特征之间存在高度的重复性，包括强度、持续时间、停顿和抑杂等多个语音特征。此外，作者发现，语音特征无法完全从文本中预测，这表明语音特征上有更多的信息。

Abstract
Prosody -- the suprasegmental component of speech, including pitch, loudness, and tempo -- carries critical aspects of meaning. However, the relationship between the information conveyed by prosody vs. by the words themselves remains poorly understood. We use large language models (LLMs) to estimate how much information is redundant between prosody and the words themselves. Using a large spoken corpus of English audiobooks, we extract prosodic features aligned to individual words and test how well they can be predicted from LLM embeddings, compared to non-contextual word embeddings. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features, including intensity, duration, pauses, and pitch contours. Furthermore, a word's prosodic information is redundant with both the word itself and the context preceding as well as following it. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words. Along with this paper, we release a general-purpose data processing pipeline for quantifying the relationship between linguistic information and extra-linguistic features.

摘要
幽默 -- 语言中的上层组件，包括抑高、响度和步伐 -- 携带着重要的意义信息。然而，语音中的信息与字符自身之间的关系仍然不够了解。我们使用大型语言模型（LLM）来估计语音中的信息 redundant 与字符自身之间的相互关系。使用一个大量的英语 audiobook spoken Corpora，我们提取出对各个字符的听起来特征，并测试它们可以从 LLM 表示中预测，与非语境字符表示相比。我们发现，各个字符的听起来特征和字符自身之间存在高度的重复性，包括强度、持续时间、停顿和抑高曲线。此外，一个字符的听起来特征还 redundant 与上下文前后的字符。虽然如此，我们发现，听起来特征无法完全从文本中预测，表明语音中含有更多的信息。此外，我们还发布了一个通用的数据处理管道，用于评估语言信息和Extra-Linguistic 特征之间的关系。

ReWaRD: Retinal Waves for Pre-Training Artificial Neural Networks Mimicking Real Prenatal Development

paper_url: http://arxiv.org/abs/2311.17232
repo_url: https://github.com/bennyca/reward
paper_authors: Benjamin Cappell, Andreas Stoll, Williams Chukwudi Umah, Bernhard Egger
for: 这个论文的目的是研究人类视觉的开发过程，特别是婴儿期间的视觉发展。
methods: 这个研究使用了计算模型，模拟了婴儿期间的视觉发展机制，通过 simulations of retinal waves 进行预训练。
results: 研究发现，通过使用这种生物可能的预训练方法，可以获得与 primate visual system 的 V1 特征相似的特征。此外，这种预训练方法可以提高模型的性能，与现有的预训练管道相当。

Abstract
Computational models trained on a large amount of natural images are the state-of-the-art to study human vision - usually adult vision. Computational models of infant vision and its further development are gaining more and more attention in the community. In this work we aim at the very beginning of our visual experience - pre- and post-natal retinal waves which suggest to be a pre-training mechanism for the primate visual system at a very early stage of development. We see this approach as an instance of biologically plausible data driven inductive bias through pre-training. We built a computational model that mimics this development mechanism by pre-training different artificial convolutional neural networks with simulated retinal wave images. The resulting features of this biologically plausible pre-training closely match the V1 features of the primate visual system. We show that the performance gain by pre-training with retinal waves is similar to a state-of-the art pre-training pipeline. Our framework contains the retinal wave generator, as well as a training strategy, which can be a first step in a curriculum learning based training diet for various models of development. We release code, data and trained networks to build the basis for future work on visual development and based on a curriculum learning approach including prenatal development to support studies of innate vs. learned properties of the primate visual system. An additional benefit of our pre-trained networks for neuroscience or computer vision applications is the absence of biases inherited from datasets like ImageNet.

摘要
computationally 模型，通过大量的自然图像训练，是现代人类视觉研究的标准。 however，computational models of infant vision and its development are becoming increasingly popular in the community。 in this work, we focus on the very beginning of our visual experience - pre- and post-natal retinal waves, which are believed to be a pre-training mechanism for the primate visual system at a very early stage of development. we view this approach as an instance of biologically plausible data-driven inductive bias through pre-training。 we built a computational model that mimics this development mechanism by pre-training different artificial convolutional neural networks with simulated retinal wave images。 the resulting features of this biologically plausible pre-training closely match the V1 features of the primate visual system。 we show that the performance gain by pre-training with retinal waves is similar to a state-of-the-art pre-training pipeline。 our framework includes the retinal wave generator, as well as a training strategy, which can be a first step in a curriculum learning-based training diet for various models of development。 we release code, data, and trained networks to provide a basis for future work on visual development and curriculum learning, including prenatal development, to support studies of innate vs. learned properties of the primate visual system。 an additional benefit of our pre-trained networks for neuroscience or computer vision applications is the absence of biases inherited from datasets like ImageNet。

Survey on AI Ethics: A Socio-technical Perspective

paper_url: http://arxiv.org/abs/2311.17228
repo_url: None
paper_authors: Dave Mbiazi, Meghana Bhange, Maryam Babaei, Ivaxi Sheth, Patrik Joslin Kenfack
For: This paper aims to provide a comprehensive overview of the ethical concerns associated with the deployment of AI in society, including fairness, privacy, data protection, responsibility, accountability, safety, robustness, transparency, explainability, and environmental impact.* Methods: The paper discusses the technical and social aspects of each ethical principle, providing a unified overview of the current and future ethical concerns of AI deployment.* Results: The paper provides a comprehensive understanding of the ethical considerations for AI deployment, highlighting the need for a societal perspective on these issues.

Abstract
The past decade has observed a great advancement in AI with deep learning-based models being deployed in diverse scenarios including safety-critical applications. As these AI systems become deeply embedded in our societal infrastructure, the repercussions of their decisions and actions have significant consequences, making the ethical implications of AI deployment highly relevant and important. The ethical concerns associated with AI are multifaceted, including challenging issues of fairness, privacy and data protection, responsibility and accountability, safety and robustness, transparency and explainability, and environmental impact. These principles together form the foundations of ethical AI considerations that concern every stakeholder in the AI system lifecycle. In light of the present ethical and future x-risk concerns, governments have shown increasing interest in establishing guidelines for the ethical deployment of AI. This work unifies the current and future ethical concerns of deploying AI into society. While we acknowledge and appreciate the technical surveys for each of the ethical principles concerned, in this paper, we aim to provide a comprehensive overview that not only addresses each principle from a technical point of view but also discusses them from a social perspective.

摘要
过去一个十年，人工智能技术呈现出了快速发展，深度学习基本模型在多种场景中得到了广泛应用，包括关键安全应用。随着人工智能系统在我们社会基础设施中深入普及，它们的决策和行为的后果具有重要的社会影响，因此人工智能部署的伦理问题变得非常重要和有关键性。人工智能的伦理问题多方面，包括公平、隐私和数据保护、责任和评估、安全和可靠性、透明度和解释性、环境影响等。这些原则共同组成了人工智能部署的伦理考虑基础。鉴于现有和未来的风险问题，政府已经表示增加对人工智能部署的伦理规范的兴趣。这项工作将现有和未来的人工智能部署伦理问题统一起来。虽然我们承认和优质技术评估每一个伦理原则的重要性，但在这篇论文中，我们的目标是不仅从技术角度讨论每一个原则，还从社会角度讨论它们。

War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars

paper_url: http://arxiv.org/abs/2311.17227
repo_url: https://github.com/agiresearch/waragent
paper_authors: Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, Yongfeng Zhang
for:这个研究的目的是使用人工智能（AI）和大型自然语言模型（LLM）进行复杂的人类集体行为的研究，以解决人类历史上的国际冲突。methods:我们提出了一个基于LLM的多代理AI系统，称为“WarAgent”，用于模拟历史国际冲突中参与国家的决策和后果。我们在WWI、WWII和古代中国战争时期进行了 simulations，并评估了模拟效果，以评估当今AI技术在复杂的人类集体行为研究中的进步和局限性。results:我们的发现提供了基于数据驱动和AI增强的对 internacional conflicts的理解，以及可能预防未来国际冲突的策略。这些发现不仅有历史分析的意义，还可以作为未来国际冲突预防和维持和平的蓝图。代码和数据可以在https://github.com/agiresearch/WarAgent 获取。

Abstract
Can we avoid wars at the crossroads of history? This question has been pursued by individuals, scholars, policymakers, and organizations throughout human history. In this research, we attempt to answer the question based on the recent advances of Artificial Intelligence (AI) and Large Language Models (LLMs). We propose \textbf{WarAgent}, an LLM-powered multi-agent AI system, to simulate the participating countries, their decisions, and the consequences, in historical international conflicts, including the World War I (WWI), the World War II (WWII), and the Warring States Period (WSP) in Ancient China. By evaluating the simulation effectiveness, we examine the advancements and limitations of cutting-edge AI systems' abilities in studying complex collective human behaviors such as international conflicts under diverse settings. In these simulations, the emergent interactions among agents also offer a novel perspective for examining the triggers and conditions that lead to war. Our findings offer data-driven and AI-augmented insights that can redefine how we approach conflict resolution and peacekeeping strategies. The implications stretch beyond historical analysis, offering a blueprint for using AI to understand human history and possibly prevent future international conflicts. Code and data are available at \url{https://github.com/agiresearch/WarAgent}.

摘要
可以避免历史十字路口上的战争吗？这个问题一直被人们、学者、政策制定者和组织在人类历史中追求。在这项研究中，我们尝试通过人工智能（AI）和大型自然语言模型（LLM）的最新进步来回答这个问题。我们提议了《战争代理人》（WarAgent），一个基于LLM的多代理AI系统，用于在历史国际冲突中模拟参与国家的决策和后果。通过评估模拟效果，我们检查了现代AI系统在复杂的集体人类行为中的进步和局限性。在这些模拟中，参与者之间的emergent互动还提供了一种新的视角来研究引发战争的触发器和条件。我们的发现提供了基于数据驱动和AI增强的洞察，可能重新定义了我们如何处理国际冲突和维护和平策略。这些发现的影响不仅局限于历史分析，还提供了一个蓝图，用于通过AI理解人类历史，并可能预防未来的国际冲突。代码和数据可以在中获取。

Minimax Exploiter: A Data Efficient Approach for Competitive Self-Play

paper_url: http://arxiv.org/abs/2311.17190
repo_url: None
paper_authors: Daniel Bairamian, Philippe Marcotte, Joshua Romoff, Gabriel Robert, Derek Nowrouzezahrai
for: 提高复杂游戏环境中人类水平性能的自我竞争学习（Competitive Self-Play，CSP）技术。
methods: 使用分布式多智能学习（Distributed Multi-Agent Reinforcement Learning，MARL）方法创建一个学习代理池，包括主代理、过去版本的主代理和攻击者代理，其中攻击者代理学习对主代理的counter-strategies。
results: 提出了一种名为“最小最大协议”的抽象游戏理论方法，可以快速启用主代理，从而提高数据效率，并在多种场景中证明了其稳定性和数据效率的提高。

Abstract
Recent advances in Competitive Self-Play (CSP) have achieved, or even surpassed, human level performance in complex game environments such as Dota 2 and StarCraft II using Distributed Multi-Agent Reinforcement Learning (MARL). One core component of these methods relies on creating a pool of learning agents -- consisting of the Main Agent, past versions of this agent, and Exploiter Agents -- where Exploiter Agents learn counter-strategies to the Main Agents. A key drawback of these approaches is the large computational cost and physical time that is required to train the system, making them impractical to deploy in highly iterative real-life settings such as video game productions. In this paper, we propose the Minimax Exploiter, a game theoretic approach to exploiting Main Agents that leverages knowledge of its opponents, leading to significant increases in data efficiency. We validate our approach in a diversity of settings, including simple turn based games, the arcade learning environment, and For Honor, a modern video game. The Minimax Exploiter consistently outperforms strong baselines, demonstrating improved stability and data efficiency, leading to a robust CSP-MARL method that is both flexible and easy to deploy.

摘要
In this paper, we propose the Minimax Exploiter, a game theoretic approach to exploiting Main Agents that leverages knowledge of its opponents, leading to significant increases in data efficiency. We validate our approach in a diversity of settings, including simple turn-based games, the arcade learning environment, and For Honor, a modern video game. The Minimax Exploiter consistently outperforms strong baselines, demonstrating improved stability and data efficiency, leading to a robust CSP-MARL method that is both flexible and easy to deploy.

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

paper_url: http://arxiv.org/abs/2311.17179
repo_url: None
paper_authors: Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, Marc Rußwurm
for: 本研究旨在提出一种全球通用的地理位置编码器，从开放available的卫星影像中学习地理位置的含义。
methods: 本研究使用了卫星对比学习（SatCLIP），通过对全球样本的多spectral Sentinel-2卫星数据进行预训练，从而学习出地理位置的含义。
results: 研究发现，使用SatCLIP编码器的 embedding，可以在不同的预测任务中提供有用的地理位置信息，并且在不同任务中常常超过现有的预训练位置编码器的表现。此外，SatCLIP编码器还有助于提高地理泛化。

Abstract
Geographic location is essential for modeling tasks in fields ranging from ecology to epidemiology to the Earth system sciences. However, extracting relevant and meaningful characteristics of a location can be challenging, often entailing expensive data fusion or data distillation from global imagery datasets. To address this challenge, we introduce Satellite Contrastive Location-Image Pretraining (SatCLIP), a global, general-purpose geographic location encoder that learns an implicit representation of locations from openly available satellite imagery. Trained location encoders provide vector embeddings summarizing the characteristics of any given location for convenient usage in diverse downstream tasks. We show that SatCLIP embeddings, pretrained on globally sampled multi-spectral Sentinel-2 satellite data, can be used in various predictive tasks that depend on location information but not necessarily satellite imagery, including temperature prediction, animal recognition in imagery, and population density estimation. Across tasks, SatCLIP embeddings consistently outperform embeddings from existing pretrained location encoders, ranging from models trained on natural images to models trained on semantic context. SatCLIP embeddings also help to improve geographic generalization. This demonstrates the potential of general-purpose location encoders and opens the door to learning meaningful representations of our planet from the vast, varied, and largely untapped modalities of geospatial data.

摘要
文本翻译成简化中文：在生态学、医学和地球系统科学等领域中，地理位置的模拟任务是非常重要的。然而，提取有用和意义的地理位置特征可能是一项挑战，经常需要伴随着贵重的数据融合或数据缩写从全球遥感数据集中提取数据。为解决这个挑战，我们介绍了卫星对比位置图像预训练（SatCLIP），一种全球、通用的地理位置编码器，可以从公开可用的卫星遥感数据中学习地理位置的隐式表示。经过训练的地址编码器可以为任何给定的地理位置提供vector编码，汇总了该位置的特征。我们表明，SatCLIP编码器在多spectral Sentinel-2卫星数据上进行全球采样预训练后，可以在不同的预测任务中提供有用的位置信息，包括温度预测、动物识别和人口密度估计。在任务中，SatCLIP编码器的表现都高于现有的预训练地址编码器，从natural image模型到semantic context模型。此外，SatCLIP编码器还帮助提高地理泛化。这表明了通用地址编码器的潜力，并开启了从各种地球空间数据模式中学习有意义的地球表示的可能性。

(Ir)rationality in AI: State of the Art, Research Challenges and Open Questions

paper_url: http://arxiv.org/abs/2311.17165
repo_url: None
paper_authors: Olivia Macmillan-Scott, Mirco Musolesi
for: 本文探讨人工智能中的理性和不理性，提出了一些开放问题。
methods: 本文考虑了人工智能agent的行为，以及如何处理不理性行为，包括对不理性行为的识别和交互。
results: 本文发现了一些不理性行为可能在某些场景下是优化的，同时提出了许多人工智能和人类交互中的问题，尚未得到充分解决。

Abstract
The concept of rationality is central to the field of artificial intelligence. Whether we are seeking to simulate human reasoning, or the goal is to achieve bounded optimality, we generally seek to make artificial agents as rational as possible. Despite the centrality of the concept within AI, there is no unified definition of what constitutes a rational agent. This article provides a survey of rationality and irrationality in artificial intelligence, and sets out the open questions in this area. The understanding of rationality in other fields has influenced its conception within artificial intelligence, in particular work in economics, philosophy and psychology. Focusing on the behaviour of artificial agents, we consider irrational behaviours that can prove to be optimal in certain scenarios. Some methods have been developed to deal with irrational agents, both in terms of identification and interaction, however work in this area remains limited. Methods that have up to now been developed for other purposes, namely adversarial scenarios, may be adapted to suit interactions with artificial agents. We further discuss the interplay between human and artificial agents, and the role that rationality plays within this interaction; many questions remain in this area, relating to potentially irrational behaviour of both humans and artificial agents.

摘要
人工智能领域中，理性是核心概念。无论我们是模拟人类思维还是实现有界优化，我们通常尽可能地做到让人工代理者变得如同人类一样理性。然而，在人工智能领域中，没有一个综合的定义，它定义了合理的代理者。这篇文章提供了人工智能中的合理性和不合理性的报告，并探讨了这个领域中的开放问题。人工智能领域中理性的概念受到了其他领域的影响，特别是经济学、哲学和心理学等领域的研究。我们在讨论人工代理者的行为时，探讨了一些不合理的行为，它们在某些情况下可能是优化的。此外，我们还讨论了人工代理者与人类交互的问题，以及合理性在这种交互中的作用。这个领域的问题还有很多，包括人类和人工代理者可能的不合理行为。

Pragmatic Radiology Report Generation

paper_url: http://arxiv.org/abs/2311.17154
repo_url: https://github.com/chicagohai/llm_radiology
paper_authors: Dang Nguyen, Chacha Chen, He He, Chenhao Tan
for: 根据镜头检查结果，是否应该描述缺乏病变的观察或忽略它？
methods: 这个问题无法从镜头 alone 答案，需要实践的观点，即将 radiology report 的通信目标 capture 为 radiologist 和 patient 之间的沟通。
results: 我们显示出 indication，即 patient 来检查镜头的原因，将 negative observations 的提及驱动，并将 indications 作为 additional input 提供给 report 生成。我们还开发了一个架构来识别从镜头图像中的不可推论信息，并将其限制为 cleaning groundtruth reports。最后，我们使用 indications 和 cleaned groundtruth reports 开发了实践模型，并显示它们不仅在新的实践验证中 (+4.3 Negative F1) 表现出色，还在标准的验证中 (+6.3 Positive F1 和 +11.0 BLEU-2) 表现更好。

Abstract
When pneumonia is not found on a chest X-ray, should the report describe this negative observation or omit it? We argue that this question cannot be answered from the X-ray alone and requires a pragmatic perspective, which captures the communicative goal that radiology reports serve between radiologists and patients. However, the standard image-to-text formulation for radiology report generation fails to incorporate such pragmatic intents. Following this pragmatic perspective, we demonstrate that the indication, which describes why a patient comes for an X-ray, drives the mentions of negative observations and introduce indications as additional input to report generation. With respect to the output, we develop a framework to identify uninferable information from the image as a source of model hallucinations, and limit them by cleaning groundtruth reports. Finally, we use indications and cleaned groundtruth reports to develop pragmatic models, and show that they outperform existing methods not only in new pragmatics-inspired metrics (+4.3 Negative F1) but also in standard metrics (+6.3 Positive F1 and +11.0 BLEU-2).

摘要

Mission-driven Exploration for Accelerated Deep Reinforcement Learning with Temporal Logic Task Specifications

paper_url: http://arxiv.org/abs/2311.17059
repo_url: None
paper_authors: Jun Wang, Hosein Hasanbeig, Kaiyuan Tan, Zihe Sun, Yiannis Kantaros
for: 这篇论文关注了设计移动机器人的优化控制策略，使其满足使用线性时间逻辑（LTL）编码的任务，并在未知的 Statistics 和环境中运行。
methods: 我们使用深度强化学习（DRL）算法来Synthesize控制策略，并通过使用执行任务的自动机来优化探索策略，从而提高学习速度。
results: 我们在 robot 导航任务中提供了比较实验，并证明了我们的算法在未知环境中可以很快地学习控制策略，并且可以高效地完成任务。Here is the translation of the paper’s abstract in Simplified Chinese:
for: 这篇论文是关于设计移动机器人的优化控制策略，以满足使用线性时间逻辑（LTL）编码的任务，并在未知的 Statistics 和环境中运行。
methods: 我们使用深度强化学习（DRL）算法来Synthesize控制策略，并通过使用执行任务的自动机来优化探索策略，从而提高学习速度。
results: 我们在 robot 导航任务中提供了比较实验，并证明了我们的算法在未知环境中可以很快地学习控制策略，并且可以高效地完成任务。

Abstract
This paper addresses the problem of designing optimal control policies for mobile robots with mission and safety requirements specified using Linear Temporal Logic (LTL). We consider robots with unknown stochastic dynamics operating in environments with unknown geometric structure. The robots are equipped with sensors allowing them to detect obstacles. Our goal is to synthesize a control policy that maximizes the probability of satisfying an LTL-encoded task in the presence of motion and environmental uncertainty. Several deep reinforcement learning (DRL) algorithms have been proposed recently to address similar problems. A common limitation in related works is that of slow learning performance. In order to address this issue, we propose a novel DRL algorithm, which has the capability to learn control policies at a notably faster rate compared to similar methods. Its sample efficiency is due to a mission-driven exploration strategy that prioritizes exploration towards directions that may contribute to mission accomplishment. Identifying these directions relies on an automaton representation of the LTL task as well as a learned neural network that (partially) models the unknown system dynamics. We provide comparative experiments demonstrating the efficiency of our algorithm on robot navigation tasks in unknown environments.

摘要
Translated into Simplified Chinese:这篇论文考虑了设计移动机器人的优化控制策略，使其满足使用线性时间逻辑（LTL）编码的任务，并且考虑了机器人在未知的随机动力和环境中运行。机器人配备了探测障碍物的感知器，我们的目标是使机器人在运动和环境不确定性中最大化满足LTL任务的概率。Recently, several deep reinforcement learning（DRL）算法已经提出来解决类似问题，但它们通常具有慢学习速率的限制。为了解决这个问题，我们提出了一种新的DRL算法，它在相似方法中具有更快的学习速率。它的样本效率是基于任务驱动的探索策略，该策略会帮助机器人更快地学习控制策略。我们通过对机器人导航任务在未知环境中进行比较实验来证明我们的算法的效率。

Panoptic Video Scene Graph Generation

paper_url: http://arxiv.org/abs/2311.17058
repo_url: https://github.com/jingkang50/openpvsg
paper_authors: Jingkang Yang, Wenxuan Peng, Xiangtai Li, Zujin Guo, Liangyu Chen, Bo Li, Zheng Ma, Kaiyang Zhou, Wayne Zhang, Chen Change Loy, Ziwei Liu
for: 本研究旨在建立全面的现实世界视觉系统，提出并研究了一个新的问题：�anoptic scene graph generation（PVSG）。
methods: 本研究使用了着色板框架和精确的像素级划分掩模，以便更好地理解场景。
results: 研究人员提供了400个视频（289个第三人称视频+111个 Egocentric视频），共计150万帧，并提供了多种基线方法和有用的设计实践。

Abstract
Towards building comprehensive real-world visual perception systems, we propose and study a new problem called panoptic scene graph generation (PVSG). PVSG relates to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos. However, the limitation of bounding boxes in detecting non-rigid objects and backgrounds often causes VidSGG to miss key details crucial for comprehensive video understanding. In contrast, PVSG requires nodes in scene graphs to be grounded by more precise, pixel-level segmentation masks, which facilitate holistic scene understanding. To advance research in this new area, we contribute the PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with a total of 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs. We also provide a variety of baseline methods and share useful design practices for future work.

摘要
向建立全面的现实世界视觉系统，我们提议和研究了一个新的问题：权威场景图生成（PVSG）。PVSG与现有的视频场景图生成（VidSGG）问题相关，后者关注视频中人与物体之间的时间交互，并将其与矩形框进行固定。然而，矩形框在检测非RIGID объекts和背景时经常会导致 VidSGG 过look 详细的场景理解。相比之下，PVSG 需要场景图中的节点被固定到更精确的像素级分割掩模，以便整体场景理解。为了推动这个新领域的研究，我们提供了 PVSG 数据集，包括 400 个视频（289 个第三人称 + 111 个 egocentric 视频），共计 150 万帧，帧标注了权威分割掩模以及细腻的时间场景图。我们还提供了多种基eline方法，并分享了有用的设计做法，以便未来的工作。

No Representation Rules Them All in Category Discovery

paper_url: http://arxiv.org/abs/2311.17055
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Sagar Vaze, Andrea Vedaldi, Andrew Zisserman
for: 这个论文目标是解决一个Generalized Category Discovery（GCD）问题，具体来说是使用标签和无标签图像集合来分类所有无标签图像，以确定它们是否属于标签分类中的图像。
methods: 该论文首先认可了大多数现有GCD标准 benchmarks 只包含一个分类的数据，这使得模型是否使用可用的标签来解决GCD任务，或者只是解决一个无监督的分类问题。因此，它们提出了一个 sintetic dataset 名为 ‘Clevr-4’，用于类发现。Clevr-4 包含四个等效的数据分区，基于对象形状、文本、颜色或数量。为解决任务，模型需要从标签集合中推断出分类结构，而不是仅仅采用数据自然分组。
results: 该论文使用 Clevr-4 dataset 展示了无监督分类在GCD任务中的局限性，证明了even very strong 无监督模型在 Clevr-4 上失败。它们还使用 Clevr-4 来评估现有GCD算法的缺点，并提出了一种新的方法，称为 $\mu$GCD，该方法利用了 representation learning 文献中的一致发现，并且在 Clevr-4 上显著超过了实际基eline。最后，当将这些发现应用到 Semantic Shift Benchmark (SSB) 上，$\mu$GCD 超过了所有之前的成果，设置了一个新的状态码。

Abstract
In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically, given a dataset with labelled and unlabelled images, the task is to cluster all images in the unlabelled subset, whether or not they belong to the labelled categories. Our first contribution is to recognize that most existing GCD benchmarks only contain labels for a single clustering of the data, making it difficult to ascertain whether models are using the available labels to solve the GCD task, or simply solving an unsupervised clustering problem. As such, we present a synthetic dataset, named 'Clevr-4', for category discovery. Clevr-4 contains four equally valid partitions of the data, i.e based on object shape, texture, color or count. To solve the task, models are required to extrapolate the taxonomy specified by the labelled set, rather than simply latching onto a single natural grouping of the data. We use this dataset to demonstrate the limitations of unsupervised clustering in the GCD setting, showing that even very strong unsupervised models fail on Clevr-4. We further use Clevr-4 to examine the weaknesses of existing GCD algorithms, and propose a new method which addresses these shortcomings, leveraging consistent findings from the representation learning literature to do so. Our simple solution, which is based on 'mean teachers' and termed $\mu$GCD, substantially outperforms implemented baselines on Clevr-4. Finally, when we transfer these findings to real data on the challenging Semantic Shift Benchmark (SSB), we find that $\mu$GCD outperforms all prior work, setting a new state-of-the-art. For the project webpage, see https://www.robots.ox.ac.uk/~vgg/data/clevr4/

摘要
在这篇论文中，我们面临了一个泛化类划分（GCD）问题。具体来说，我们给定一个带有标签和无标签图像的数据集，需要将无标签图像集中 clustering，无论它们是否属于标签分类。我们的首要贡献是认识到大多数现有的GCD标准准例只包含数据集的一个分类，使得模型是否使用可用的标签来解决GCD问题或者只是解决无监督划分问题存在困难。为此，我们提供了一个人工生成的数据集，名为'Clevr-4'，用于类划分。Clevr-4包含四个等价有效的数据分区，即基于物体形状、文本、颜色或数量。要解决这个任务，模型需要从标签集中推导出税类划分，而不是简单地挖掘数据集的自然分组。我们使用这个数据集来证明无监督划分在GCD设置下存在局限性，even very strong unsupervised models fail on Clevr-4。我们进一步使用Clevr-4来检查现有的GCD算法的缺陷，并提出一种新的方法，基于'mean teachers'的思想，并命名为$\mu$GCD。我们的简单解决方案在Clevr-4上substantially outperforms implemented baselines。最后，当我们将这些发现应用到实际数据上Semantic Shift Benchmark (SSB)上，我们发现$\mu$GCD超过了所有之前的工作，设置了新的州OF-THE-ART。更多信息请访问https://www.robots.ox.ac.uk/~vgg/data/clevr4/。

Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry…for now

paper_url: http://arxiv.org/abs/2311.17138
repo_url: None
paper_authors: Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, D. A. Forsyth, Anand Bhattad
for: This paper demonstrates that generated images have geometric features different from those of real images, and that current generators cannot reliably reproduce geometric properties of real images.
methods: The paper uses a set of collections of generated images that are prequalified to fool simple, signal-based classifiers into believing they are real, and three classifiers that look only at derived geometric features to identify generated images reliably.
results: The paper shows that the classifiers can identify generated images more reliably than SOTA local signal-based detectors, for images from a number of distinct generators, and that saliency maps suggest that the classifiers can identify geometric problems reliably.

Abstract
Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images, prequalified to fool simple, signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels, and look only at derived geometric features. The first classifier looks at the perspective field of the image, the second looks at lines detected in the image, and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors, for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images.

摘要
Translate the given text into Simplified Chinese.<>现代生成模型可以生成惊人的真实图像。这篇论文表明，生成的图像具有与真实图像不同的几何特征。我们构建了一组具有欺骗简单信号基于分类器的生成图像集，并证明了这些生成图像可以被可靠地识别为假。我们使用三种类ifiers，其中所有类ifiers都 denied 访问图像像素，只看到图像的派生几何特征。第一个类ifiers looks at the perspective field of the image，第二个类ifiers looks at lines detected in the image，第三个类ifiers looks at relations between detected objects and shadows。我们的过程可以更可靠地检测生成图像，比对当前的最佳地方检测器。抽象地图表示，这些类ifiers可以可靠地识别图像的几何问题。我们 conclude 现代生成器无法可靠地复制真实图像的几何特征。

Generative Models: What do they know? Do they know things? Let’s find out!

paper_url: http://arxiv.org/abs/2311.17137
repo_url: None
paper_authors: Xiaodan Du, Nicholas Kolkin, Greg Shakhnarovich, Anand Bhattad
for: 本研究证明了生成模型内部实际上学习了场景内部地图，并提出了一种 universial 的 Plug-and-Play 方法，可以将任何生成模型转化成场景内部预测器，直接从原始生成器网络中提取场景内部地图，不需要额外的解码器或完全精度调整原始网络。
methods: 我们的方法基于 Low-Rank Adaptation (LoRA) 技术，对键特征图进行低级别适应，新增 Parameters 占总参数数量的 menos than 0.6%。我们使用一小个标注图像集进行优化，以适应不同的生成架构，包括扩散模型、GANs 和自然语言模型。
results: 我们的方法可以生成比较好的场景内部地图，与领先的指导技术相比，有一些情况下 Even surpass 。

Abstract
Generative models have been shown to be capable of synthesizing highly detailed and realistic images. It is natural to suspect that they implicitly learn to model some image intrinsics such as surface normals, depth, or shadows. In this paper, we present compelling evidence that generative models indeed internally produce high-quality scene intrinsic maps. We introduce Intrinsic LoRA (I LoRA), a universal, plug-and-play approach that transforms any generative model into a scene intrinsic predictor, capable of extracting intrinsic scene maps directly from the original generator network without needing additional decoders or fully fine-tuning the original network. Our method employs a Low-Rank Adaptation (LoRA) of key feature maps, with newly learned parameters that make up less than 0.6% of the total parameters in the generative model. Optimized with a small set of labeled images, our model-agnostic approach adapts to various generative architectures, including Diffusion models, GANs, and Autoregressive models. We show that the scene intrinsic maps produced by our method compare well with, and in some cases surpass those generated by leading supervised techniques.

摘要
（注：以下是使用 Simplified Chinese 翻译的文本）生成模型已经显示出能够Synthesize高级细节和真实的图像。因此，人们 Naturally 怀疑它们会隐式地学习图像内部特性，如表面法则、深度或阴影。在这篇论文中，我们提供了吸引人的证据，证明生成模型实际上内部生成了高质量的场景内部地图。我们提出了Scene Intrinsic LoRA (I LoRA)，一种通用、插件化方法，可以将任何生成模型转换成场景内部预测器，直接从原始生成器网络中提取场景内部地图，无需添加额外的解码器或全面调参原始网络。我们的方法使用了适应性低级的特征映射（LoRA），新增的参数占总参数的0.6%左右。通过一小batch of labels 图像进行优化，我们的模型无关方法可以适应不同的生成架构，包括扩散模型、GANs 和自然语言模型。我们展示了我们的方法生成的场景内部地图与领先的指导技术相比，有时even surpass。

DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models

paper_url: http://arxiv.org/abs/2311.17053
repo_url: None
paper_authors: Tsun-Hsuan Wang, Juntian Zheng, Pingchuan Ma, Yilun Du, Byungchul Kim, Andrew Spielberg, Joshua Tenenbaum, Chuang Gan, Daniela Rus
for: 这个论文的目的是开发一种能够创造高级软机器人的数字化生成模型，以满足物理软机器人和虚拟人物创建的应用需求。
methods: 这个论文使用的方法包括物理学中的扩散过程和diffusion模型，以及一种新的学习算法，可以理解功能基于结构。
results: 研究人员通过实验和实践，证明了DiffuseBot模型能够创造出高级软机器人的多样化形态和控制方法，并且能够在各种任务中表现出优异的能力。

Abstract
Nature evolves creatures with a high complexity of morphological and behavioral intelligence, meanwhile computational methods lag in approaching that diversity and efficacy. Co-optimization of artificial creatures' morphology and control in silico shows promise for applications in physical soft robotics and virtual character creation; such approaches, however, require developing new learning algorithms that can reason about function atop pure structure. In this paper, we present DiffuseBot, a physics-augmented diffusion model that generates soft robot morphologies capable of excelling in a wide spectrum of tasks. DiffuseBot bridges the gap between virtually generated content and physical utility by (i) augmenting the diffusion process with a physical dynamical simulation which provides a certificate of performance, and (ii) introducing a co-design procedure that jointly optimizes physical design and control by leveraging information about physical sensitivities from differentiable simulation. We showcase a range of simulated and fabricated robots along with their capabilities. Check our website at https://diffusebot.github.io/

摘要
自然界中的生物演化出高度复杂的形态和行为智能，而计算方法则远远落后于这种多样性和效率。在计算机中合作设计艺 creature的形态和控制方法显示出应用于物理软机器人和虚拟人物创建的承诺，但这些方法需要开发新的学习算法，以可以理解功能的基础上的结构。本文介绍DiffuseBot，一种物理扩展的扩散模型，可以生成高效的软机器人形态，并bridge virtually generated content和physical utility之间的差距。DiffuseBot通过(i) 将扩散过程与物理动力学模拟相结合，提供性能证明，以及(ii) 通过与物理敏感度的差分学习结合的协同设计程序，同时优化物理设计和控制。我们展示了一系列的 simulate和fabricated robots以及其能力。更多信息请访问我们的网站：https://diffusebot.github.io/

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

paper_url: http://arxiv.org/abs/2311.17136
repo_url: None
paper_authors: Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen
for: 本研究旨在提出一种能够处理多种不同信息寻找需求的共同指南导向多模态检索系统，以满足用户的多样化信息寻找需求。
methods: 本研究使用了多任务训练和指令调整来实现UniIR的通用性，并在多种多modal-IR数据集上进行了联合训练。
results: UniIR在现有数据集上表现了良好的性能，并在新任务上进行零开发掌握。此外，本研究还提出了一个多modal检索 benchmark，以标准化多modal信息检索的评估。

Abstract
Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.

摘要
现有的信息检索（IR）模型 oftentimes假设具有同质的格式，这限制了它们对不同用户需求的应用，例如搜索图片与文本描述、搜索新闻文章与头条图片、或找到类似图片与查询图片。为了解决这些不同的信息检索需求，我们介绍UniIR，一种统一指南驱动的多模态检索系统，可以处理八种不同的检索任务 Across modalities。UniIR通过对多种多 modal-IR数据集进行联合训练，可以根据用户的指南来执行多种检索任务，在现有数据集上表现出了稳定的性能，并在新任务上进行零基eline性学习。我们的实验表明，多任务训练和指南调整是UniIR的泛化能力的关键。此外，我们还构建了M-BEIR，一个包括多种多modal检索任务的多Modal检索基准数据集，以标准化多modal信息检索的评估。

Efficient In-Context Learning in Vision-Language Models for Egocentric Videos

paper_url: http://arxiv.org/abs/2311.17041
repo_url: None
paper_authors: Keunwoo Peter Yu, Zheyuan Zhang, Fengyuan Hu, Joyce Chai
for: 本研究旨在提高 Egocentric Vision-Language Models (VLMs) 的内容学习能力，使其可以通过几次示例来适应新任务。
methods: 本研究提出了一种新的训练方法，称为 $\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos（$\mathbb{EILEV}$），它可以在几次示例的情况下帮助 VLMs 学习 Egocentric 视频中的内容。
results: 对比于大量自然语言数据训练的大型 VLMs， $\mathbb{EILEV}$-训练的模型在内容学习中表现更好，并且可以通过内容学习来扩展到新、罕见的 Egocentric 视频和文本应用程序。

Abstract
Recent advancements in text-only large language models (LLMs) have highlighted the benefit of in-context learning for adapting to new tasks with a few demonstrations. However, extending in-context learning to large vision-language models (VLMs) using a huge amount of naturalistic vision-language data has shown limited success, particularly for egocentric videos, due to high data collection costs. We propose a novel training method $\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos ($\mathbb{EILEV}$), which elicits in-context learning in VLMs for egocentric videos without requiring massive, naturalistic egocentric video datasets. $\mathbb{EILEV}$ involves architectural and training data adaptations to allow the model to process contexts interleaved with video clips and narrations, sampling of in-context examples with clusters of similar verbs and nouns, use of data with skewed marginal distributions with a long tail of infrequent verbs and nouns, as well as homonyms and synonyms. Our evaluations show that $\mathbb{EILEV}$-trained models outperform larger VLMs trained on a huge amount of naturalistic data in in-context learning. Furthermore, they can generalize to not only out-of-distribution, but also novel, rare egocentric videos and texts via in-context learning, demonstrating potential for applications requiring cost-effective training, and rapid post-deployment adaptability. Our code and demo are available at \url{https://github.com/yukw777/EILEV}.

摘要
近期文本大语言模型（LLM）的进步强调了在Context中学习的利点，用于适应新任务需要只需几个示例。然而，将Context学习扩展到大量视频语言模型（VLM）使用庞大的自然主义视频语言数据显示有限的成功，特别是对于 Egocentric 视频，因为收集数据成本高。我们提议一种新的训练方法， называ为 $\mathbb{E}$fficient $\mathbb{I}$n-context $\mathbb{L}$earning on $\mathbb{E}$gocentric $\mathbb{V}$ideos（ $\mathbb{EILEV}$），它使得 VLM 在 Egocentric 视频中学习 Context 无需巨量自然主义 Egocentric 视频数据。 $\mathbb{EILEV}$ 包括建筑和训练数据的修改，以便模型可以处理插入视频片段和 narración 的上下文，采样上下文中的示例，使用权重为上下文相似的名词和动词，以及使用数据具有扁平的分布和长尾的罕见名词和动词。我们的评估表明， $\mathbb{EILEV}$-训练的模型可以在 Context 中学习，并且可以泛化到不同的 Egocentric 视频和文本中，表现出成本效益和快速适应性。我们的代码和demo可以在 \url{https://github.com/yukw777/EILEV} 上找到。

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

paper_url: http://arxiv.org/abs/2311.17030
repo_url: https://github.com/amakelov/activation-patching-illusion
paper_authors: Aleksandar Makelov, Georg Lange, Neel Nanda
for: 本研究旨在探讨机器学习模型的解释性，具体来说是通过特定的干扰方法（如活化质量）来理解模型行为，以及这些行为是如何与特定特征相关的。
methods: 本研究使用了活化质量的方法来探讨机器学习模型的解释性，并通过数学示例、实际应用场景和实验来检验这些方法的效果。
results: 研究发现，使用活化质量方法可能会导致模型行为的假象解释性，即尽管模型的输出被修改，但这并不一定意味着模型内部的特定特征被修改。此外，研究还发现了在具体任务（如直接对象识别）中，可以通过手动电路分析来更好地理解模型行为的具体层次结构。

Abstract
Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability. Counterintuitively, even if a subspace intervention makes the model's output behave as if the value of a feature was changed, this effect may be achieved by activating a dormant parallel pathway leveraging another subspace that is causally disconnected from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice. In the context of factual recall, we further show a link to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localization. However, this does not imply that activation patching of subspaces is intrinsically unfit for interpretability. To contextualize our findings, we also show what a success case looks like in a task (indirect object identification) where prior manual circuit analysis informs an understanding of the location of a feature. We explore the additional evidence needed to argue that a patched subspace is faithful.

摘要
机制解释可能性目标是理解模型行为的特定、可解释的特征，通常认为是低维度活动的表现。现在的研究已经探讨过执行子空间干预（如活动贴图）以同时操纵模型行为并归因特征到给定的子空间。在这种情况下，我们发现这两个目标之间存在冲突，可能导致假设的解释性。例如，即使模型输出行为变化，这种变化可能是通过启用一个休眠的平行路径，通过另一个子空间来实现的，这个子空间与模型输出没有直接关系。我们在一个简化的数学示例中、两个实际领域（间接对象识别任务和事实回忆）中都提供了证据，并证明这种现象在实践中很普遍。在事实回忆中，我们还显示了与排名1事实编辑的关系，为之前观察到的事实地址不一致而提供机制解释。然而，这并不意味着 activation patching 对 interpretability 是不适用的。为了 Contextualize 我们的发现，我们还显示了一个成功的例子（间接对象识别任务），其中先前的手动电路分析提供了特征所在的位置的理解。我们还探讨了需要更多证据以证明贴图后的子空间是忠实的。

When the Few Outweigh the Many: Illicit Content Recognition with Few-Shot Learning

paper_url: http://arxiv.org/abs/2311.17026
repo_url: None
paper_authors: G. Cascavilla, G. Catolino, M. Conti, D. Mellios, D. A. Tamburri
for: 本研究旨在探讨黑客网上违法活动识别的新方法，具体来说是通过图像识别来识别违法活动。
methods: 本研究使用了一种名为Siamese нейрон网络的技术，它是现在领域的state-of-the-art方法之一。此外，还使用了一种名为One-Shot和Few-Shot学习的技术，可以在小规模数据集上达到高度的准确率。
results: 研究发现，使用Siamese нейрон网络和One-Shot/Few-Shot学习技术可以在违法活动识别中达到高度的准确率，特别是在20枚投票 experiment中，Siamese нейрон网络的准确率达到90.9%。这表明这种方法是一种可靠且经济的自动法律执行机器的代替方案。

Abstract
The anonymity and untraceability benefits of the Dark web account for the exponentially-increased potential of its popularity while creating a suitable womb for many illicit activities, to date. Hence, in collaboration with cybersecurity and law enforcement agencies, research has provided approaches for recognizing and classifying illicit activities with most exploiting textual dark web markets' content recognition; few such approaches use images that originated from dark web content. This paper investigates this alternative technique for recognizing illegal activities from images. In particular, we investigate label-agnostic learning techniques like One-Shot and Few-Shot learning featuring the use Siamese neural networks, a state-of-the-art approach in the field. Our solution manages to handle small-scale datasets with promising accuracy. In particular, Siamese neural networks reach 90.9% on 20-Shot experiments over a 10-class dataset; this leads us to conclude that such models are a promising and cheaper alternative to the definition of automated law-enforcing machinery over the dark web.

摘要
“黑客的匿名和不可追踪的优点使得其 популяр度 exponential 增长，创造了许多非法活动的适宜环境。因此，与 кибер安全和宪政机构合作，我们的研究提供了承认和分类非法活动的方法，大多数使用文本黑客市场内容的识别。这篇论文探讨这种替代技术，具体来说是使用一shot和几shot学习技术，特别是使用siamese神经网络，这是当前领域的state-of-the-art方法。我们的解决方案可以处理小规模数据集，并且表现良好，siamese神经网络在20枚shot experiment中达到90.9%的准确率，这使我们得出结论，这些模型是可靠且经济的替代方案，用于自动实施法律执行机器在黑客上。”Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Deployment of a Robust and Explainable Mortality Prediction Model: The COVID-19 Pandemic and Beyond

paper_url: http://arxiv.org/abs/2311.17133
repo_url: None
paper_authors: Jacob R. Epifano, Stephen Glass, Ravi P. Ramachandran, Sharad Patel, Aaron J. Masino, Ghulam Rasool
for: 这项研究旨在investigate deployed artificial intelligence（AI）模型在COVID-19大流行期间和以后的预测死亡率性能、可解释性和可靠性。
methods: 研究使用了 bayesian neural networks（BNNs）和智能训练技术，发现这些模型在面临数据变化时仍能保持性能。
results: 研究发现，使用BNNs和智能训练技术可以建立高性能、可解释的AI模型，并且能够在实际医疗环境中提供可靠的预测。

Abstract
This study investigated the performance, explainability, and robustness of deployed artificial intelligence (AI) models in predicting mortality during the COVID-19 pandemic and beyond. The first study of its kind, we found that Bayesian Neural Networks (BNNs) and intelligent training techniques allowed our models to maintain performance amidst significant data shifts. Our results emphasize the importance of developing robust AI models capable of matching or surpassing clinician predictions, even under challenging conditions. Our exploration of model explainability revealed that stochastic models generate more diverse and personalized explanations thereby highlighting the need for AI models that provide detailed and individualized insights in real-world clinical settings. Furthermore, we underscored the importance of quantifying uncertainty in AI models which enables clinicians to make better-informed decisions based on reliable predictions. Our study advocates for prioritizing implementation science in AI research for healthcare and ensuring that AI solutions are practical, beneficial, and sustainable in real-world clinical environments. By addressing unique challenges and complexities in healthcare settings, researchers can develop AI models that effectively improve clinical practice and patient outcomes.

摘要
Translated into Simplified Chinese:这项研究 investigate了 COVID-19 大流行和以后部署的人工智能 (AI) 模型的性能、可解释性和可靠性。这是首个这样的研究，我们发现，使用抽象神经网络 (BNNs) 和智能训练技术，我们的模型在数据变化中保持了性能。我们的结果强调了开发能够与临床医生预测匹配或超越的robust AI模型，尤其在面临困难条件下。我们的解释性探索发现，Stochastic 模型生成更多的多样化和个性化的解释，因此 highlights 需要AI模型提供详细和个性化的解释，以便在实际临床环境中提供更好的指导。此外，我们强调了量化 AI 模型中的不确定性，这使得临床医生可以基于可靠预测做出更好的决策。我们的研究强调了在健康保健领域应用人工智能研究的实施科学，以确保 AI 解决方案是实用、有益和可持续的。通过解决医疗设置中的特殊挑战和复杂性，研究人员可以开发有效地改善临床实践和患者结果的 AI 模型。

Foundational Moral Values for AI Alignment

paper_url: http://arxiv.org/abs/2311.17017
repo_url: None
paper_authors: Betty Li Hou, Brian Patrick Green
for: 本研究旨在提供清晰、可靠的AI系统配置目标，以便实现人类存在的基本需求。
methods: 本研究使用道德哲学中的五个核心价值作为技术配置目标的基础，这五个价值分别是存 survival、可持续发展、社会、教育和真理。
results: 本研究显示，这五个价值不仅可以提供技术配置工作的 clearer direction，还可以用于检测AI系统对这些价值的威胁和机遇。

Abstract
Solving the AI alignment problem requires having clear, defensible values towards which AI systems can align. Currently, targets for alignment remain underspecified and do not seem to be built from a philosophically robust structure. We begin the discussion of this problem by presenting five core, foundational values, drawn from moral philosophy and built on the requisites for human existence: survival, sustainable intergenerational existence, society, education, and truth. We show that these values not only provide a clearer direction for technical alignment work, but also serve as a framework to highlight threats and opportunities from AI systems to both obtain and sustain these values.

摘要
解决人工智能对接管问题需要有清晰、可靠的价值观，以便人工智能系统可以按照这些价值进行对接。目前，目标对接仍然尚未得到明确定义，而且没有一个具有哲学基础的结构。我们开始对这个问题的讨论，提出五个核心基础价值，基于道德哲学和人类存在的必需品：存在、可持续性、社会、教育和真理。我们表明了这些价值不仅可以为技术对接工作提供更清晰的指导，还可以作为对人工智能系统的威胁和机遇的框架。Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is also widely used, particularly in Taiwan and Hong Kong.

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

paper_url: http://arxiv.org/abs/2311.17132
repo_url: None
paper_authors: Dai Shi
for: 提高模型的自然视觉表现和本地模型化能力
methods: 提出了生物辐射视觉和不断眼动的灵感导向的Token混合器，并在传统Query和Key中添加学习的Token进行多样化生成相似矩阵。此外，我们还提出了几何GLU混合器，用于 bridging GLU和SE机制，以便每个Token可以基于其最近邻域图像特征进行通道注意力。
results: 我们的TransNeXt模型在多种模型尺寸上达到了领先的性能，包括ImageNet准确率84.0%、ConvNeXt-B模型的69% fewer parameters。此外，我们还实现了ImageNet-A准确率61.6%、COCO物体检测mAP57.1和ADE20K语义分割mIoU54.7等多种任务的优秀表现。

Abstract
Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.

摘要
Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of $224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of $384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.

Computational Hypergraph Discovery, a Gaussian Process framework for connecting the dots

paper_url: http://arxiv.org/abs/2311.17007
repo_url: https://github.com/theobourdais/computationalhypergraphdiscovery
paper_authors: Théo Bourdais, Pau Batlle, Xianjin Yang, Ricardo Baptista, Nicolas Rouquette, Houman Owhadi
for: 这 paper 是用于数据驱动的 Computational Science and Engineering 和 Scientific Machine Learning 问题的解决方案。
methods: 这 paper 使用了 Gaussian Process（GP）方法，其基于非线性系统的ROW ECHELON FORM 缩放和 variance-based 分析。
results: 这 paper 提出了一种可解释的 GP 框架，可以用于解决 Type 3 问题，即数据驱动地发现和完善计算的 hypergraph 结构。应用于 equation discovery、network discovery 和 raw data analysis 等领域，并且表明了该方法的效率和可靠性。

Abstract
Most scientific challenges can be framed into one of the following three levels of complexity of function approximation. Type 1: Approximate an unknown function given input/output data. Type 2: Consider a collection of variables and functions, some of which are unknown, indexed by the nodes and hyperedges of a hypergraph (a generalized graph where edges can connect more than two vertices). Given partial observations of the variables of the hypergraph (satisfying the functional dependencies imposed by its structure), approximate all the unobserved variables and unknown functions. Type 3: Expanding on Type 2, if the hypergraph structure itself is unknown, use partial observations of the variables of the hypergraph to discover its structure and approximate its unknown functions. While most Computational Science and Engineering and Scientific Machine Learning challenges can be framed as Type 1 and Type 2 problems, many scientific problems can only be categorized as Type 3. Despite their prevalence, these Type 3 challenges have been largely overlooked due to their inherent complexity. Although Gaussian Process (GP) methods are sometimes perceived as well-founded but old technology limited to Type 1 curve fitting, their scope has recently been expanded to Type 2 problems. In this paper, we introduce an interpretable GP framework for Type 3 problems, targeting the data-driven discovery and completion of computational hypergraphs. Our approach is based on a kernel generalization of Row Echelon Form reduction from linear systems to nonlinear ones and variance-based analysis. Here, variables are linked via GPs and those contributing to the highest data variance unveil the hypergraph's structure. We illustrate the scope and efficiency of the proposed approach with applications to (algebraic) equation discovery, network discovery (gene pathways, chemical, and mechanical) and raw data analysis.

摘要
多科学挑战可以划分为以下三级复杂性函数近似： Type 1：给定输入/输出数据，近似未知函数。 Type 2：考虑一个包含多个变量和函数的超graph（一种通用图，其边可以连接多个顶点），其中一些变量是未知的。给定超graph的部分观察数据，近似所有未知变量和未知函数。 Type 3：在 Type 2 基础上，如果超graph的结构本身是未知的，使用超graph的部分观察数据来探索其结构并近似其未知函数。虽然大多数计算科学和数学机器学习挑战都可以划分为 Type 1 和 Type 2 问题，但是许多科学问题只能划分为 Type 3 问题。尽管它们的复杂性使得它们受到了相对较少的关注，但是它们的存在是普遍的。虽然 Gaussian Process（GP）方法有时被视为已有的技术，但它们的范围已经扩展到 Type 2 问题。在这篇论文中，我们介绍了一种可解释的 GP 框架，用于 Type 3 问题，targeting 数据驱动的 computational hypergraph 的发现和完善。我们的方法基于 GP 的kernel化矩阵分解，并通过对数据的变差进行分析。在这种方法中，变量通过 GP 连接，而对数据变差最大的变量揭示出 computational hypergraph 的结构。我们通过应用于（代数）方程发现、网络发现（生物、化学和机械）和原始数据分析等场景来说明本方法的范围和效率。

Goal-conditioned Offline Planning from Curious Exploration

paper_url: http://arxiv.org/abs/2311.16996
repo_url: https://github.com/martius-lab/gcopfce
paper_authors: Marco Bagatella, Georg Martius
for: 该论文主要关注如何从无监督的探索技术中提取目标决策的行为，无需进一步与环境互动。
methods: 该论文提出了一种结合模型基本规划和图集值聚合算法的方法，以修正学习值函数中的估计误差，提高零次目标达成性能。
results: 该论文在多种模拟环境中表明，该方法可以提高零次目标达成性能，并且可以修正本地和全局的估计误差。

Abstract
Curiosity has established itself as a powerful exploration strategy in deep reinforcement learning. Notably, leveraging expected future novelty as intrinsic motivation has been shown to efficiently generate exploratory trajectories, as well as a robust dynamics model. We consider the challenge of extracting goal-conditioned behavior from the products of such unsupervised exploration techniques, without any additional environment interaction. We find that conventional goal-conditioned reinforcement learning approaches for extracting a value function and policy fall short in this difficult offline setting. By analyzing the geometry of optimal goal-conditioned value functions, we relate this issue to a specific class of estimation artifacts in learned values. In order to mitigate their occurrence, we propose to combine model-based planning over learned value landscapes with a graph-based value aggregation scheme. We show how this combination can correct both local and global artifacts, obtaining significant improvements in zero-shot goal-reaching performance across diverse simulated environments.

摘要
curiosity 已经成为深度再征学习中powerful的探索策略。特别是利用未来的未知性作为内在动机，可以效率地生成探索轨迹，以及一个坚实的动力学模型。我们面临的挑战是从无supervised探索技术所生成的产品中提取目标条件行为，无需更多的环境互动。我们发现，传统的目标决策学习方法在这种困难的离线设定下失效。通过分析优化的目标决策函数的几何学性质，我们发现这些问题与学习值函数中的一种特殊类型的估计错误有关。为了缓解这些错误的发生，我们提议将模型基于 планинг与学习值景观的集成，并使用图集成值评价方法。我们表明这种结合可以 corrrect both local和global的估计错误，从而在多种模拟环境中提高零次目标达成性能。

Debiasing Multimodal Models via Causal Information Minimization

paper_url: http://arxiv.org/abs/2311.16941
repo_url: https://github.com/vaidehi99/causalinfomin
paper_authors: Vaidehi Patil, Adyasha Maharana, Mohit Bansal
For: The paper is written to address the issue of bias in multimodal models, specifically the use of approximate heuristics to represent biases, and to propose a novel approach that leverages causally-motivated information minimization to learn confounder representations and remove bias from models.* Methods: The paper uses causal graph theory to study bias arising from confounders in multimodal data, and proposes a method that leverages causally-motivated information minimization to learn the confounder representations. The method involves minimizing the information content of features obtained from a pretrained biased model, and using these features via methods motivated by causal theory to remove bias from models.* Results: The paper shows that the proposed debiasing methods improve out-of-distribution (OOD) performance on multiple multimodal datasets without sacrificing in-distribution performance. Additionally, the paper introduces a novel metric to quantify the sufficiency of spurious features in models’ predictions, which further demonstrates the effectiveness of the proposed methods.Here is the summary in Simplified Chinese text:* For: 本文目的是解决多模态模型中的偏见问题，具体来说是使用估计法表示偏见，并提出一种基于 causal graph 理论的新方法，该方法可以学习干扰表示。* Methods: 本文使用 causal graph 理论研究多模态数据中的偏见，并提出一种基于 causal 理论的方法，该方法通过减少早期训练阶段的 shallow 特征来学习干扰表示。* Results: 本文显示，提出的偏见除法方法可以在多个多模态数据集上提高 OOD 性能，而无需牺牲预测性能。此外，本文还引入了一种新的度量来衡量模型预测中的干扰特征充分性，这再次证明了提出的方法的效果。

Abstract
Most existing debiasing methods for multimodal models, including causal intervention and inference methods, utilize approximate heuristics to represent the biases, such as shallow features from early stages of training or unimodal features for multimodal tasks like VQA, etc., which may not be accurate. In this paper, we study bias arising from confounders in a causal graph for multimodal data and examine a novel approach that leverages causally-motivated information minimization to learn the confounder representations. Robust predictive features contain diverse information that helps a model generalize to out-of-distribution data. Hence, minimizing the information content of features obtained from a pretrained biased model helps learn the simplest predictive features that capture the underlying data distribution. We treat these features as confounder representations and use them via methods motivated by causal theory to remove bias from models. We find that the learned confounder representations indeed capture dataset biases, and the proposed debiasing methods improve out-of-distribution (OOD) performance on multiple multimodal datasets without sacrificing in-distribution performance. Additionally, we introduce a novel metric to quantify the sufficiency of spurious features in models' predictions that further demonstrates the effectiveness of our proposed methods. Our code is available at: https://github.com/Vaidehi99/CausalInfoMin

摘要
现有的多Modal模型偏见纠正方法，包括 causal intervention 和推理方法，通常使用估计的方法来表示偏见，例如在训练的早期阶段获得的浅层特征或多Modal任务中的单modal特征等，这些方法可能不准确。在这篇文章中，我们研究了多Modal数据中的偏见，并研究了一种新的方法，即通过 causally-motivated information minimization 来学习偏见表示。我们发现这些表示 capture 了数据分布下的偏见，并且使用这些表示来 removes 偏见从模型中。我们发现这些方法可以在多个多Modal dataset 上提高 OOD 性能，而不是牺牲在 Distribution 性能。此外，我们引入了一个新的度量来衡量模型预测中的幂制特征的充分性，这进一步证明了我们提出的方法的有效性。我们的代码可以在 GitHub 上找到：https://github.com/Vaidehi99/CausalInfoMin。

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

paper_url: http://arxiv.org/abs/2311.16922
repo_url: https://github.com/damo-nlp-sg/vcd
paper_authors: Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing
for: 提高大观语言模型（LVLM）的可靠性，减少模型生成的对象幻觉问题。
methods: 提出了一种名为视觉对比解码（VCD）的简单、训练自由方法，通过对原始和受损视觉输入的输出分布进行对比，减少模型对统计偏见和单Modal先天假设的依赖。
results: VCD在不同LVLM家族上进行测试，能够有效地减少对象幻觉问题，同时在普通LVLM测试中也表现出优异的表现，这表明VCD的广泛适用性。

Abstract
Large Vision-Language Models (LVLMs) have advanced considerably, intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success, LVLMs still suffer from the issue of object hallucinations, where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs, resulting in contextually accurate outputs. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families. Beyond mitigating object hallucinations, VCD also excels in general LVLM benchmarks, highlighting its wide-ranging applicability.

摘要

RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D

paper_url: http://arxiv.org/abs/2311.16918
repo_url: None
paper_authors: Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng Bo, Xiaoguang Han
for: 提高2D扩散为3D生成的精度和细节
methods: 学习一种通用的 норál-深度扩散模型，并通过图像到深度和图像到法向的通用模型来训练
results: 在与其他文本到3D管道集成后，模型可以显著提高细节和精度，达到当前最佳效果

Abstract
Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appearance modeling. However, relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and normals maps, leading to instability in optimization. In this paper, recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images, we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumination effects in the generated materials, we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when integrated into existing text-to-3D pipelines, our models significantly enhance the detail richness, achieving state-of-the-art results. Our project page is https://lingtengqiu.github.io/RichDreamer/.

摘要
提高2D扩散为3D生成是一个具有挑战性的问题，因为自然图像中的物理和照明复杂相互纠缠，缺乏几何先验。现有方法通过首先通过 Rendered surface normals 进行 score-distillation sampling (SDS) 来创建 geometry，然后进行外观模型化。然而，通过2D RGB 扩散模型来优化表面法向是不优化的，因为自然图像和表面法向图像的分布差异导致优化过程中的不稳定。在这篇论文中，我们认为场景几何信息可以通过自动计算 Normal 和深度信息来描述，并且可以从图像中自动估计。因此，我们提议学习一种通用的 Normal-Depth 扩散模型。我们在 LAION 数据集上进行大规模训练，并且使用通用的图像-到-深度和 Normal 先验模型。为了解决生成物体材质中的混合照明效果，我们引入了一种 albedo 扩散模型，以在生成的材质中强制实施数据驱动的约束。我们的实验表明，当我们的模型与现有的文本-到-3D 管道结合使用时，可以增加细节充实，达到领域前景的最佳结果。我们的项目页面是。

Optimization Theory Based Deep Reinforcement Learning for Resource Allocation in Ultra-Reliable Wireless Networked Control Systems

paper_url: http://arxiv.org/abs/2311.16895
repo_url: None
paper_authors: Hamida Qumber Ali, Amirhassan Babazadeh Darabi, Sinem Coleri
for: 本研究 targets the joint design of controller and communication systems for Wireless Networked Control Systems (WNCS), with the objective of minimizing power consumption while satisfying schedulability and rate constraints, and stability constraints.
methods: 本研究提出了一种基于深度学习的优化理论框架，包括两个阶段：优化理论阶段和深度学习阶段。在优化理论阶段，根据问题的形式ulation，获得了优化问题的数学关系，从而将问题分解成多个建筑块。在深度学习阶段，使用深度学习代替了不可解析的块。
results: 通过广泛的 simulate，提出的优化理论基于深度学习方法在比较于优化理论和纯度学习基于方法的情况下，具有较高的性能和较低的复杂性。

Abstract
The design of Wireless Networked Control System (WNCS) requires addressing critical interactions between control and communication systems with minimal complexity and communication overhead while providing ultra-high reliability. This paper introduces a novel optimization theory based deep reinforcement learning (DRL) framework for the joint design of controller and communication systems. The objective of minimum power consumption is targeted while satisfying the schedulability and rate constraints of the communication system in the finite blocklength regime and stability constraint of the control system. Decision variables include the sampling period in the control system, and blocklength and packet error probability in the communication system. The proposed framework contains two stages: optimization theory and DRL. In the optimization theory stage, following the formulation of the joint optimization problem, optimality conditions are derived to find the mathematical relations between the optimal values of the decision variables. These relations allow the decomposition of the problem into multiple building blocks. In the DRL stage, the blocks that are simplified but not tractable are replaced by DRL. Via extensive simulations, the proposed optimization theory based DRL approach is demonstrated to outperform the optimization theory and pure DRL based approaches, with close to optimal performance and much lower complexity.

摘要
wireless网络控制系统（WNCS）的设计需要考虑控制和通信系统之间的关键互动，以最小化复杂性和通信负担，同时保证超高可靠性。本文介绍一种基于深度学习的优化理论框架，用于joint控制和通信系统的设计。目标是最小化能耗，同时满足通信系统的剩余时间和速率约束，以及控制系统的稳定性约束。决策变量包括控制系统中的采样时间，以及通信系统中的块长度和错误率。提议的框架包括两个阶段：优化理论和深度学习。在优化理论阶段，根据联合优化问题的形ulation，获得优化条件，以找到决策变量的数学关系。这些关系允许将问题分解为多个建筑块。在深度学习阶段，使用深度学习取代了不可解析的块，以简化问题。经过广泛的仿真实验，提议的优化理论基于深度学习方法在性能和复杂性方面均有较好的表现，与优化理论和纯深度学习方法相比，具有更高的性能和远远低于的复杂性。

Vulnerability Analysis of Transformer-based Optical Character Recognition to Adversarial Attacks

paper_url: http://arxiv.org/abs/2311.17128
repo_url: None
paper_authors: Lucas Beerens, Desmond J. Higham
for: 这个论文主要研究了基于变换器的光学字符识别（OCR）系统的安全性。
methods: 作者们开发了一个新的评估攻击抗性框架，包括对不argeted和targeted攻击进行评估。
results: 研究发现，基于变换器的OCR系统在不argeted攻击下非常易受到影响，CER可以高达1 Without being noticeable to the eye。在targeted攻击下，成功率可达25%，但需要TrOCR输出大词汇中的第十个最有可能的token。

Abstract
Recent advancements in Optical Character Recognition (OCR) have been driven by transformer-based models. OCR systems are critical in numerous high-stakes domains, yet their vulnerability to adversarial attack remains largely uncharted territory, raising concerns about security and compliance with emerging AI regulations. In this work we present a novel framework to assess the resilience of Transformer-based OCR (TrOCR) models. We develop and assess algorithms for both targeted and untargeted attacks. For the untargeted case, we measure the Character Error Rate (CER), while for the targeted case we use the success ratio. We find that TrOCR is highly vulnerable to untargeted attacks and somewhat less vulnerable to targeted attacks. On a benchmark handwriting data set, untargeted attacks can cause a CER of more than 1 without being noticeable to the eye. With a similar perturbation size, targeted attacks can lead to success rates of around $25\%$ -- here we attacked single tokens, requiring TrOCR to output the tenth most likely token from a large vocabulary.

摘要
最近的Optical Character Recognition（OCR）技术发展受到 transformer 模型的推动，但OCR 系统的安全性和合规性仍然待到很多不确定的领域，这引发了安全和合规性的问题。在这项工作中，我们提出了一种新的框架来评估 transformer 基于 OCR 模型的可抗性。我们开发了针对目标和无目标攻击的算法。对于无目标情况，我们使用 Character Error Rate（CER）来衡量，而对于目标情况，我们使用成功率。我们发现 transformer 基于 OCR 模型在无目标攻击下非常高度易受攻击，而在目标攻击下则较为可靠。在一个标准手写数据集上，无目标攻击可以使 CER 超过 1，而无需被注意到。同样的扰乱大小，目标攻击可以导致成功率达到 around 25%，我们在攻击单个Token时，需要 TrOCR 输出大词汇中的第十个最有可能的Token。

The Falcon Series of Open Language Models

paper_url: http://arxiv.org/abs/2311.16867
repo_url: None
paper_authors: Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo
For: The paper is written to present the Falcon series of causal decoder-only models, which are trained on a large and diverse corpora of web data.* Methods: The paper uses a custom distributed training codebase to pretrain the models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect.* Results: The largest model, Falcon-180B, significantly outperforms other models such as PaLM or Chinchilla, and approaches the performance of PaLM-2-Large at a reduced pretraining and inference cost.Here are the three points in Simplified Chinese text:* For: 这篇论文是为了介绍 Falcon 系列的 causal decoder-only 模型，这些模型在web数据上进行了大规模和多样化的训练。* Methods: 这篇论文使用自定义的分布式训练代码基于，使用 cloud AWS 基础设施上的 Up to 4,096 A100 进行训练，并限制了间接连接。* Results: Falcon-180B 模型在其他模型such as PaLM 或 Chinchilla 上显著超越，并与 PaLM-2-Large 的性能相似，但具有更低的预训练和计算成本。

Abstract
We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.

摘要
我们介绍飞鹰系列：7B、40B和180B参数 causal 解oder-仅模型，在高品质的网络数据集上进行了各种多样化的训练。最大模型飞鹰-180B 已经在超过3.5兆个字元的文本上进行了最大的公开文献训练。飞鹰-180B 对 PaLM 或 Chinchilla 等模型进行了明显的超越，并且在与 LLamA 2 或 Inflection-1 等模型同时发展的情况下，也获得了进一步的改进。它在与 PaLM-2-Large 的性能几乎相等，但具有较低的训练和测试成本，使其成为我们所知道的世界上第三好的语言模型之一，只次于 GPT-4 和 PaLM-2-Large。我们在详细的评估和自订工具的使用方面进行了深入的检视，包括我们自己开发的分布式训练代码库，让我们可以将这些模型在云端 AWS 基础设施上进行高效的训练，并且限制了互connect。我们释出了600亿个字元的网络数据抽出，以及飞鹰-7/40/180B 三个模型，并在允许性的授权下释出，以促进开源的科学和加速开发大型语言模型的开放生态系统。

Edge AI for Internet of Energy: Challenges and Perspectives

paper_url: http://arxiv.org/abs/2311.16851
repo_url: None
paper_authors: Yassine Himeur, Aya Nabil Sayed, Abdullah Alsalemi, Faycal Bensaali, Abbes Amira
for: 这篇评论探讨了智能边缘技术如何重塑互联网络的能源互联网环境。
methods: 该评论采用了严格的研究方法，探讨了特制的边缘人工智能技术，并对其各种优点进行了详细的描述。
results: 该评论总结了边缘人工智能在互联网络中的多种优点，包括减少延迟和实时分析，以及关键的信息安全、可扩展性和成本效益等方面的进步。

Abstract
The digital landscape of the Internet of Energy (IoE) is on the brink of a revolutionary transformation with the integration of edge Artificial Intelligence (AI). This comprehensive review elucidates the promise and potential that edge AI holds for reshaping the IoE ecosystem. Commencing with a meticulously curated research methodology, the article delves into the myriad of edge AI techniques specifically tailored for IoE. The myriad benefits, spanning from reduced latency and real-time analytics to the pivotal aspects of information security, scalability, and cost-efficiency, underscore the indispensability of edge AI in modern IoE frameworks. As the narrative progresses, readers are acquainted with pragmatic applications and techniques, highlighting on-device computation, secure private inference methods, and the avant-garde paradigms of AI training on the edge. A critical analysis follows, offering a deep dive into the present challenges including security concerns, computational hurdles, and standardization issues. However, as the horizon of technology ever expands, the review culminates in a forward-looking perspective, envisaging the future symbiosis of 5G networks, federated edge AI, deep reinforcement learning, and more, painting a vibrant panorama of what the future beholds. For anyone vested in the domains of IoE and AI, this review offers both a foundation and a visionary lens, bridging the present realities with future possibilities.

摘要
互联网的能源INTERNET（IoE）领域正在踏入一场革命性的变革，通过融合边缘人工智能（AI）。本综观篇探讨边缘AI在IoE生态系统中的承诺和潜力。文章开始采用仔细挑选的研究方法，探讨边缘AI技术对IoE的多种应用。这些技术包括实时分析、降低延迟、资安性、可扩展性和成本效益等，这些潜在效益强调了边缘AI在现代IoE框架中的不可或缺性。文章随后介绍了实际应用和技术，包括在设备上进行计算、安全隐私方法、以及前对边缘进行AI训练。文章还提供了深入分析，探讨现有挑战，包括安全性 Concerns、计算问题和标准化问题。然而，随着科技的发展，未来边缘AI与5G网络、联邦边缘AI、深度强化学习和更多的技术融合，将创造出一个缤纷的未来，文章终结于一个前瞻性的展望。对于IoE和AI领域的投资者和探险家来说，这篇综观篇提供了一个基础和未来可能性的桥梁，将现实与未来相连。

Two-step dynamic obstacle avoidance

paper_url: http://arxiv.org/abs/2311.16841
repo_url: None
paper_authors: Fabian Hart, Martin Waltz, Ostap Okhrin
for: 本研究旨在解决自动驾驶车辆面临的动态障碍避免（DOA）问题，无论是在海上、空中还是陆地上。
methods: 本文提出了一种两步架构，通过结合监督学习和强化学习（RL）来处理 DOA 任务。在第一步，我们提出了一种数据驱动的方法，使用循环神经网络来估算障碍物冲突风险，并在监督学习的方式下训练。在第二步，我们将这些冲突风险估算结果包含到RL agent的观察空间中，以提高其情况意识。
results: 我们在一个需要在多个障碍物中穿梭的复杂环境中训练了不同的 RL 代理人，并证明了我们的两步架构可以提高 RL 代理人的性能。实验表明，将冲突风险估算结果包含到观察空间中可以 doubles RL 代理人的奖励，等于减少碰撞事件的一半。此外，我们还证明了我们的架构性能改善不受应用的 RL 算法而强调。

Abstract
Dynamic obstacle avoidance (DOA) is a fundamental challenge for any autonomous vehicle, independent of whether it operates in sea, air, or land. This paper proposes a two-step architecture for handling DOA tasks by combining supervised and reinforcement learning (RL). In the first step, we introduce a data-driven approach to estimate the collision risk of an obstacle using a recurrent neural network, which is trained in a supervised fashion and offers robustness to non-linear obstacle movements. In the second step, we include these collision risk estimates into the observation space of an RL agent to increase its situational awareness.~We illustrate the power of our two-step approach by training different RL agents in a challenging environment that requires to navigate amid multiple obstacles. The non-linear movements of obstacles are exemplarily modeled based on stochastic processes and periodic patterns, although our architecture is suitable for any obstacle dynamics. The experiments reveal that integrating our collision risk metrics into the observation space doubles the performance in terms of reward, which is equivalent to halving the number of collisions in the considered environment. Furthermore, we show that the architecture's performance improvement is independent of the applied RL algorithm.

摘要
自适应障碍避免（DOA）是智能自动车辆的基本挑战，无论在海上、空中或陆地上运行。这篇论文提议一种两步架构来处理 DOA 任务，通过将监睹学习和奖励学习（RL）结合起来。在第一步，我们引入了一种数据驱动的方法，通过使用回卷神经网络来估算障碍物碰撞风险，这个网络在监睹式的方式下被训练，并且具有对非线性障碍物运动的Robustness。在第二步，我们将这些碰撞风险估算值添加到 RL 代理的观察空间中，从而提高其 Situational awareness。我们在一个需要在多个障碍物间穿梭的复杂环境中训练不同的 RL 代理，并通过模拟障碍物的非线性运动使用杂乱过程和 periodic pattern来 exemplarily 示例。我们的架构适用于任何障碍物动力学。实验表明，将我们的碰撞风险 metrics 添加到 RL 代理的观察空间可以 doubles 提高代理的性能，相当于将碰撞数量减少一半。此外，我们还证明了我们的架构性能提高不受应用的 RL 算法的影响。

The Claire French Dialogue Dataset

paper_url: http://arxiv.org/abs/2311.16840
repo_url: None
paper_authors: Julie Hunter, Jérôme Louradour, Virgile Rennard, Ismaïl Harrando, Guokan Shang, Jean-Pierre Lorré
for: 这个论文是为了推动多语言、开源语言模型的发展而创建的 Claire French Dialogue Dataset (CFDD) 资源。
methods: 论文描述了 CFDD 资源的24个各自 corpora 和其原始来源的链接和引用，以及将 CFDD dataset 分解成8种类型的子 corpora 的过程。
results: 论文介绍了 CFDD dataset 的标准化格式和相关工作的未来方向。

Abstract
We present the Claire French Dialogue Dataset (CFDD), a resource created by members of LINAGORA Labs in the context of the OpenLLM France initiative. CFDD is a corpus containing roughly 160 million words from transcripts and stage plays in French that we have assembled and publicly released in an effort to further the development of multilingual, open source language models. This paper describes the 24 individual corpora of which CFDD is composed and provides links and citations to their original sources. It also provides our proposed breakdown of the full CFDD dataset into eight categories of subcorpora and describes the process we followed to standardize the format of the final dataset. We conclude with a discussion of similar work and future directions.

摘要
我们介绍Claire French Dialogue Dataset（CFDD）， LINAGORA Labs 在 OpenLLM France iniciativa 中创建的资源。 CFDD 包含约 160 亿个单词的法语对话和舞台剧脚本，我们已经组装并公开释放，以促进多语言、开源语言模型的发展。这篇文章描述 CFDD dataset 的 24 个子 corpora，并提供了每个子 corpora 的链接和参考文献。它还介绍了我们对 CFDD 数据集的分类方法，并描述了我们对数据集的标准化处理过程。我们 conclude 这篇文章，并讨论了相关的工作和未来方向。

Modular Neural Networks for Time Series Forecasting: Interpretability and Feature Selection using Attention

paper_url: http://arxiv.org/abs/2311.16834
repo_url: None
paper_authors: Qiqi Su, Christos Kloukinas, Artur d’Avila Garcez
for: 这篇论文应用于多变量时间序列预测，并且需要建立可解释的深度学习模型。
methods: 本论文提出了一个模块化神经网络模型，具有选择性的特征选择和时间预测两大 componenets。这个模型可以实现可解释性，并且在时间序列预测任务中表现出比预先学习型和可解释性模型更好的预测性。
results: 实验结果显示，这种方法可以超过现有的可解释性模型（例如NAM）和其变形，并且与非可解释性方法（例如LSTM和XGBoost）的预测性相当。

Abstract
Multivariate time series have many applications, from healthcare and meteorology to life science. Although deep learning models have shown excellent predictive performance for time series, they have been criticised for being "black-boxes" or non-interpretable. This paper proposes a novel modular neural network model for multivariate time series prediction that is interpretable by construction. A recurrent neural network learns the temporal dependencies in the data while an attention-based feature selection component selects the most relevant features and suppresses redundant features used in the learning of the temporal dependencies. A modular deep network is trained from the selected features independently to show the users how features influence outcomes, making the model interpretable. Experimental results show that this approach can outperform state-of-the-art interpretable Neural Additive Models (NAM) and variations thereof in both regression and classification of time series tasks, achieving a predictive performance that is comparable to the top non-interpretable methods for time series, LSTM and XGBoost.

摘要
多变量时间系列有很多应用，从医疗和气象到生命科学。虽然深度学习模型在时间序列预测方面表现出色，但它们受到了“黑盒子”或不可解释的批评。这篇论文提出了一种新的模块化神经网络模型，用于多变量时间序列预测，该模型具有可解释性。一个循环神经网络学习数据中的时间相关性，而另一个注意力机制选择器选择最 relevante 的特征，并且避免在学习时间相关性中使用 redundante 的特征。一个模块化深度网络从选择的特征独立地训练，以显示特征如何影响结果，使模型成为可解释的。实验结果表明，这种方法可以超越现有的可解释性 Neural Additive Models（NAM）和其变种，在时间序列预测任务中实现比较高的预测性能，并且与非可解释的方法，如 LSTM 和 XGBoost，的预测性能相比较。

CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models

paper_url: http://arxiv.org/abs/2311.16832
repo_url: None
paper_authors: Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Libiao Peng, Jiaming Yang, Xiyao Xiao, Sahand Sabour, Xiaohan Zhang, Wenjing Hou, Yijia Zhang, Yuxiao Dong, Jie Tang, Minlie Huang
for: 本研究旨在提供一系列基于ChatGLM的模型，用于生成基于人物的对话（CharacterDial），以满足人们的内在社交欲望和情感需求。
methods: 我们采用了CharacterGLM模型，可以自定义各种人工智能角色或社交代理人的特性和行为，包括人物特征、情感表达、互动模式等。
results: 我们的模型在consistency、人工化和参与度等方面，与主流的关键词搜索模型（GPT系列）相比，表现出优异性。我们将发布6B版本的CharacterGLM模型和一部分训练数据，以便进一步的研究发展在人物基于对话生成方向。

Abstract
In this paper, we present CharacterGLM, a series of models built upon ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM is designed for generating Character-based Dialogues (CharacterDial), which aims to equip a conversational AI system with character customization for satisfying people's inherent social desires and emotional needs. On top of CharacterGLM, we can customize various AI characters or social agents by configuring their attributes (identities, interests, viewpoints, experiences, achievements, social relationships, etc.) and behaviors (linguistic features, emotional expressions, interaction patterns, etc.). Our model outperforms most mainstream close-source large langauge models, including the GPT series, especially in terms of consistency, human-likeness, and engagement according to manual evaluations. We will release our 6B version of CharacterGLM and a subset of training data to facilitate further research development in the direction of character-based dialogue generation.

摘要
在这篇论文中，我们介绍了CharacterGLM，一系列基于ChatGLM的模型，其参数量从6B到66B不等。CharacterGLM是用于生成基于人物的对话（CharacterDial）的，旨在具备让人工智能系统具备人物定制，满足人们的内在社交欲望和情感需求。在CharacterGLM之上，我们可以自定义各种AI人物或社交代理人的属性（身份、兴趣、观点、经历、成就、社交关系等）和行为（语言特征、情感表达、互动模式等）。我们的模型在主观评价中胜过大多数主流的关键字大型语言模型，特别是在一致性、人类化和参与度方面。我们将发布6B版的CharacterGLM和一部分训练数据，以便进一步的研究发展在人物基于对话生成方向。

A knowledge-driven AutoML architecture

paper_url: http://arxiv.org/abs/2311.17124
repo_url: None
paper_authors: Corneliu Cofaru, Johan Loeckx
for: 本研究提出了一种知识驱动的AutoML架构，用于管道和深度特征合成。目标是使AutoML过程可读性好，利用领域知识在合成管道和特征时。
methods: 本架构尝试了一些新的想法，包括将管道和深度特征的建构进行统一处理，驱动合成过程使用共享知识系统，在运行时根据数据的应用而作出决策。
results: 两个实验证明了提议的架构的可行性和优势，同时也揭示了一些负面影响和未来AutoML的潜在发展前景。

Abstract
This paper proposes a knowledge-driven AutoML architecture for pipeline and deep feature synthesis. The main goal is to render the AutoML process explainable and to leverage domain knowledge in the synthesis of pipelines and features. The architecture explores several novel ideas: first, the construction of pipelines and deep features is approached in an unified way. Next, synthesis is driven by a shared knowledge system, interactively queried as to what pipeline operations to use or features to compute. Lastly, the synthesis processes takes decisions at runtime using partial solutions and results of their application on data. Two experiments are conducted to demonstrate the functionality of a na\"{\i}ve implementation of the proposed architecture and to discuss its advantages, trade-offs as well as future potential for AutoML.

摘要

Unified construction of pipelines and deep features: The architecture treats the construction of pipelines and deep features in an integrated manner.2. Shared knowledge system: The synthesis process is driven by a shared knowledge system that is interactively queried as to what pipeline operations to use or features to compute.3. Runtime decision-making: The synthesis process takes decisions at runtime using partial solutions and results of their application on data.Two experiments are conducted to demonstrate the functionality of a naive implementation of the proposed architecture and to discuss its advantages, trade-offs, and future potential for AutoML.

Agent-Aware Training for Agent-Agnostic Action Advising in Deep Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.16807
repo_url: None
paper_authors: Yaoquan Wei, Shunyu Liu, Jie Song, Tongya Zheng, Kaixuan Chen, Yong Wang, Mingli Song
for: 这个论文是为了提高深度奖励学习（DRL）中的采样效率而写的。
methods: 这个论文使用了一种新的框架 called Agent-Aware trAining yet Agent-Agnostic Action Advising (A7)，它通过利用状态特征的相似性作为咨询的指标来寻求专家指导。不同于现有的方法ologies，A7不使用学习机器人自身的错误或专家指导者来衡量状态特征相似性，而是使用一个代理模型来提取状态特征，以便在多个任务上都能够有效地采样。
results: 实验结果表明，A7在GridWorld、LunarLander和Atari游戏中的六个场景中都能够显著地加速学习过程，并在现有方法（包括特定于机器人和无关机器人）的比较中占据领先地位。

Abstract
Action advising endeavors to leverage supplementary guidance from expert teachers to alleviate the issue of sampling inefficiency in Deep Reinforcement Learning (DRL). Previous agent-specific action advising methods are hindered by imperfections in the agent itself, while agent-agnostic approaches exhibit limited adaptability to the learning agent. In this study, we propose a novel framework called Agent-Aware trAining yet Agent-Agnostic Action Advising (A7) to strike a balance between the two. The underlying concept of A7 revolves around utilizing the similarity of state features as an indicator for soliciting advice. However, unlike prior methodologies, the measurement of state feature similarity is performed by neither the error-prone learning agent nor the agent-agnostic advisor. Instead, we employ a proxy model to extract state features that are both discriminative (adaptive to the agent) and generally applicable (robust to agent noise). Furthermore, we utilize behavior cloning to train a model for reusing advice and introduce an intrinsic reward for the advised samples to incentivize the utilization of expert guidance. Experiments are conducted on the GridWorld, LunarLander, and six prominent scenarios from Atari games. The results demonstrate that A7 significantly accelerates the learning process and surpasses existing methods (both agent-specific and agent-agnostic) by a substantial margin. Our code will be made publicly available.

摘要
<> tranlate into Simplified Chinese行业建议努力利用专家教师提供补充指导，以解决深度束缚学习（DRL）中的采样不充分问题。先前的特定行业行业建议方法受到机器学习模型中的缺陷所限，而非特定行业方法则具有有限的适应性。在本研究中，我们提出了一种新的框架，即知道性training yet Agent-Agnostic Action Advising（A7），以寻求这两者之间的平衡。A7的基本思想在于通过利用状态特征的相似性作为咨询的指标来实现这一点。不同于先前的方法ologies，我们在计算状态特征相似性时不是通过学习机器人的错误或者非特定行业咨询人来进行测量。相反，我们使用代理模型来提取状态特征，这些特征是有效的适应（适应于机器人）和普适的（抗性于机器人噪音）。此外，我们使用行为做为模型来培养一个咨询样本的重复使用，并引入内在的奖励，以便鼓励使用专家指导。我们在GridWorld、LunarLander以及Atari游戏中的六个场景中进行了实验，结果表明，A7可以很大程度上加速学习过程，并在现有方法（包括特定行业和非特定行业方法）之上显著超越。我们将代码公开。

ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

paper_url: http://arxiv.org/abs/2311.17123
repo_url: None
paper_authors: Xiangjun Gao, Xiaoyu Li, Chaopeng Zhang, Qi Zhang, Yanpei Cao, Ying Shan, Long Quan
for: 实现单一图像中的3D人体呈现
methods: 使用特定的静止图像进行人体呈现，并使用深度和文本导向注意力注入来转换参考图像的内容到后视角度
results: 能够实现高质量和内容一致的人体呈现，并在实验中与先前基eline方法进行比较

Abstract
In this work, we propose a method to address the challenge of rendering a 3D human from a single image in a free-view manner. Some existing approaches could achieve this by using generalizable pixel-aligned implicit fields to reconstruct a textured mesh of a human or by employing a 2D diffusion model as guidance with the Score Distillation Sampling (SDS) method, to lift the 2D image into 3D space. However, a generalizable implicit field often results in an over-smooth texture field, while the SDS method tends to lead to a texture-inconsistent novel view with the input image. In this paper, we introduce a texture-consistent back view synthesis module that could transfer the reference image content to the back view through depth and text-guided attention injection. Moreover, to alleviate the color distortion that occurs in the side region, we propose a visibility-aware patch consistency regularization for texture mapping and refinement combined with the synthesized back view texture. With the above techniques, we could achieve high-fidelity and texture-consistent human rendering from a single image. Experiments conducted on both real and synthetic data demonstrate the effectiveness of our method and show that our approach outperforms previous baseline methods.

摘要
在这个工作中，我们提出了一种方法来解决从单个图像中生成3D人的挑战。现有的方法可以使用总体化的像素对齐隐藏场（Implicit Field）来重建人体的文化或者使用2D扩散模型作为引导，通过SDS方法将2D图像映射到3D空间中。然而，通常的隐藏场会导致文化场过于平滑，而SDS方法会导致输入图像的文化不一致。在这篇论文中，我们介绍了一种可以将参照图像内容传递到后视之处的纹理一致转换模块。此外，我们还提出了一种可以减少侧面区域中的颜色扭曲的可见性感知融合regularization，用于纹理映射和重定义。通过上述技术，我们可以实现从单个图像中获得高品质和纹理一致的人体渲染。实验结果表明，我们的方法可以高效地生成高品质的人体图像。

A Survey of the Evolution of Language Model-Based Dialogue Systems

paper_url: http://arxiv.org/abs/2311.16789
repo_url: https://github.com/ruleGreen/Survey-Evolution-DS
paper_authors: Hongru Wang, Lingzhi Wang, Yiming Du, Liang Chen, Jingyan Zhou, Yufei Wang, Kam-Fai Wong
for: This paper provides a comprehensive review of the evolution of dialogue systems, specifically highlighting the significant transformations and advancements in language models.
methods: The paper categorizes the evolution of dialogue systems into four distinct stages, each marked by pivotal breakthroughs in language models, including statistical language models, neural language models, pre-trained language models, and current language model-based dialogue systems.
results: The paper offers a chronological perspective on the development of dialogue systems, providing a comprehensive review of state-of-the-art research outcomes, and discusses emerging topics and open challenges in the field, guiding future developments in language model-based dialogue systems.

Abstract
Dialogue systems, including task-oriented_dialogue_system (TOD) and open-domain_dialogue_system (ODD), have undergone significant transformations, with language_models (LM) playing a central role. This survey delves into the historical trajectory of dialogue systems, elucidating their intricate relationship with advancements in language models by categorizing this evolution into four distinct stages, each marked by pivotal LM breakthroughs: 1) Early_Stage: characterized by statistical LMs, resulting in rule-based or machine-learning-driven dialogue_systems; 2) Independent development of TOD and ODD based on neural_language_models (NLM; e.g., LSTM and GRU), since NLMs lack intrinsic knowledge in their parameters; 3) fusion between different types of dialogue systems with the advert of pre-trained_language_models (PLMs), starting from the fusion between four_sub-tasks_within_TOD, and then TOD_with_ODD; and 4) current LLM-based_dialogue_system, wherein LLMs can be used to conduct TOD and ODD seamlessly. Thus, our survey provides a chronological perspective aligned with LM breakthroughs, offering a comprehensive review of state-of-the-art research outcomes. What's more, we focus on emerging topics and discuss open challenges, providing valuable insights into future directions for LLM-based_dialogue_systems. Through this exploration, we pave the way for a deeper_comprehension of the evolution, guiding future developments in LM-based dialogue_systems.

摘要
对话系统，包括任务导向对话系统（TOD）和开放领域对话系统（ODD），已经经历了重要的变革，语言模型（LM）在这一过程中扮演着中心角色。本评论对对话系统的历史轨迹进行了深入的探讨，并将这一演化分为四个不同的阶段，每个阶段都受到了关键的LM突破：1. 早期阶段：以统计语言模型（SM）为基础，导致了规则驱动或机器学习驱动的对话系统；2. 独立发展TOD和ODD，基于神经语言模型（NLM，如LSTM和GRU），由于NLM的参数缺乏内在知识，因此独立发展TOD和ODD；3. 不同类型的对话系统的融合，通过预训练语言模型（PLM）的出现，从四个子任务内部的TOD开始，然后是TOD与ODD的融合；4. 当前的LLM-基于对话系统，可以使用LLM进行TOD和ODD的同时执行，从而提供了一种可靠的对话系统。因此，本评论提供了与LM突破的时间线相对应的一种探讨，并对当前领域的研究成果进行了全面的回顾。此外，我们还关注了emerging topic和未解决的挑战，为未来LLM-基于对话系统的发展提供了有价值的指导。通过这一探讨，我们为LLM-基于对话系统的进一步发展铺平了道路。

The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation

paper_url: http://arxiv.org/abs/2311.16782
repo_url: None
paper_authors: Christel Chappuis, Eliot Walt, Vincent Mendez, Sylvain Lobry, Bertrand Le Saux, Devis Tuia
for: 这个论文的目的是探讨语理问答（RSVQA）中语言偏见的问题，以便通过人机交互来使用遥感图像。methods: 这篇论文使用了自然语言处理和计算机视觉技术，并采用了三重分析策略来探讨语言偏见问题，包括视觉盲模型、对抗测试和数据分析。results: 研究发现，遥感语理问答中存在严重的语言偏见问题，这些问题源于数据本身，而不仅仅是模型的问题。研究还发现，现有的遥感语理问答数据集存在地域相似性和罕见性等问题，这些问题导致模型的偏见问题更加严重。

Abstract
Remote sensing visual question answering (RSVQA) opens new opportunities for the use of overhead imagery by the general public, by enabling human-machine interaction with natural language. Building on the recent advances in natural language processing and computer vision, the goal of RSVQA is to answer a question formulated in natural language about a remote sensing image. Language understanding is essential to the success of the task, but has not yet been thoroughly examined in RSVQA. In particular, the problem of language biases is often overlooked in the remote sensing community, which can impact model robustness and lead to wrong conclusions about the performances of the model. Thus, the present work aims at highlighting the problem of language biases in RSVQA with a threefold analysis strategy: visual blind models, adversarial testing and dataset analysis. This analysis focuses both on model and data. Moreover, we motivate the use of more informative and complementary evaluation metrics sensitive to the issue. The gravity of language biases in RSVQA is then exposed for all of these methods with the training of models discarding the image data and the manipulation of the visual input during inference. Finally, a detailed analysis of question-answer distribution demonstrates the root of the problem in the data itself. Thanks to this analytical study, we observed that biases in remote sensing are more severe than in standard VQA, likely due to the specifics of existing remote sensing datasets for the task, e.g. geographical similarities and sparsity, as well as a simpler vocabulary and question generation strategies. While new, improved and less-biased datasets appear as a necessity for the development of the promising field of RSVQA, we demonstrate that more informed, relative evaluation metrics remain much needed to transparently communicate results of future RSVQA methods.

摘要
远程感知视觉问题回答（RSVQA）开启了用上空影像的通用公众使用，通过人机互动的自然语言来进行。基于latest的自然语言处理和计算机见，RSVQA的目标是使用自然语言问题回答远程感知影像。语言理解是这个任务的关键，但尚未充分研究。特别是语言偏见问题，通常在遥感社群中被忽略，可能会影响模型的稳定性，导致错误的结论。因此， presente 的工作强调了语言偏见问题在RSVQA中，透过三重分析方法：视觉盲模型、反对测试和数据分析。这些分析涉及到模型和数据。此外，我们呼吁使用更加详细和补充的评估指标，敏感到这问题。最后，我们透过问题答案分布的详细分析，揭示了问题的根源在数据本身。由于这个分析研究，我们发现遥感偏见比标准VQA更严重，可能是因为现有的遥感数据集的特殊性，例如地理相似性和稀缺性，以及更简单的词汇和问题生成策略。新的、改进的和更无偏见的数据集似乎是RSVQA的发展所需的。不过，我们证明更 Informed、相对评估指标仍然是未来RSVQA方法的发展中需要。

Generation of Games for Opponent Model Differentiation

paper_url: http://arxiv.org/abs/2311.16781
repo_url: None
paper_authors: David Milec, Viliam Lisý, Christopher Kiekintveld
for: 这篇论文的目的是如何保护对抗攻击，即多智能问题中的攻击者是人类活动者，保护方法通常包括对手模型以提高表现。
methods: 本论文使用了心理学家所收集的人格类型数据，创建了一种新的模型，将参数链接到心理特征。该模型在参数化游戏中优化，创建了具有显著差异的游戏。
results: 该论文的研究结果表明，新模型可以在不同的游戏中显著地不同，并且可以识别模型之间的不同。这些结果可以帮助自动生成游戏，以及识别模型之间的不同。

Abstract
Protecting against adversarial attacks is a common multiagent problem. Attackers in the real world are predominantly human actors, and the protection methods often incorporate opponent models to improve the performance when facing humans. Previous results show that modeling human behavior can significantly improve the performance of the algorithms. However, modeling humans correctly is a complex problem, and the models are often simplified and assume humans make mistakes according to some distribution or train parameters for the whole population from which they sample. In this work, we use data gathered by psychologists who identified personality types that increase the likelihood of performing malicious acts. However, in the previous work, the tests on a handmade game could not show strategic differences between the models. We created a novel model that links its parameters to psychological traits. We optimized over parametrized games and created games in which the differences are profound. Our work can help with automatic game generation when we need a game in which some models will behave differently and to identify situations in which the models do not align.

摘要
保护对抗攻击是一种常见的多代理问题。真实世界中的攻击者主要是人类行为者，保护方法 oft incorporate 对手模型以提高面对人类时的性能。先前的结果表明，模拟人类行为可以显著提高算法的性能。然而，正确地模拟人类是一个复杂的问题，模型通常简化和假设人们会根据某种分布或训练参数中的某些分布进行错误。在这项工作中，我们使用心理学家所收集的数据，并识别出了增强邪恶行为的人格类型。然而，在先前的工作中，对手制作的游戏测试无法显示出不同的战略。我们创建了一种新的模型，将其参数与心理特征相关联。我们在参数化游戏中优化并创建了游戏，其中差异悬殊。我们的工作可以帮助自动生成游戏，需要一些模型在游戏中不同的情况下表现不同，以及 indentify 模型在某些情况下不一致。

Equilibrium in the Computing Continuum through Active Inference

paper_url: http://arxiv.org/abs/2311.16769
repo_url: None
paper_authors: Boris Sedlak, Victor Casamayor Pujol, Praveen Kumar Donta, Schahram Dustdar
for: 本研究旨在帮助分布式计算（Distributed Computing）系统保证每个计算层的复杂要求。
methods: 本研究提出了一个整合边缘智能的框架，使得个体边缘设备可以（1）了解如何执行服务水平目标（Service Level Objectives，SLO），以及（2）将知识传播以加速不同设备的上线。通过合作，边缘设备可以（3）提高服务水平目标的范围。
results: 在视频流传输中，使用本研究的框架可以在10轮训练后确保四个SLO，并且下面的 causal 结构也可以理解。新型设备的添加可以在后续进行，框架允许 reuse 现有模型，即使设备类型未知。最后，在设备群组内重新负载均衡可以使个体边缘设备从22%提高到89%的SLO 遵从率。

Abstract
Computing Continuum (CC) systems are challenged to ensure the intricate requirements of each computational tier. Given the system's scale, the Service Level Objectives (SLOs) which are expressed as these requirements, must be broken down into smaller parts that can be decentralized. We present our framework for collaborative edge intelligence enabling individual edge devices to (1) develop a causal understanding of how to enforce their SLOs, and (2) transfer knowledge to speed up the onboarding of heterogeneous devices. Through collaboration, they (3) increase the scope of SLO fulfillment. We implemented the framework and evaluated a use case in which a CC system is responsible for ensuring Quality of Service (QoS) and Quality of Experience (QoE) during video streaming. Our results showed that edge devices required only ten training rounds to ensure four SLOs; furthermore, the underlying causal structures were also rationally explainable. The addition of new types of devices can be done a posteriori, the framework allowed them to reuse existing models, even though the device type had been unknown. Finally, rebalancing the load within a device cluster allowed individual edge devices to recover their SLO compliance after a network failure from 22% to 89%.

摘要
computin Continuum (CC) 系统面临着每个 Computational tier 的细节要求的挑战。由于系统的规模， Service Level Objectives (SLOs) 需要被细分成可以分散的部分。我们介绍了一个协力式边缘智能框架，让单独的边缘设备可以：1. 发展导引如何实现 SLOs 的 causal 理解。2. 将知识传递到快速启动不同类型的设备。3. 通过协力，提高 SLO 的覆盖范围。我们实现了这个框架，并评估了一个 CC 系统负责确保影片串流中的 Quality of Service (QoS) 和 Quality of Experience (QoE)。我们的结果显示，边缘设备只需要进行十次训练，并且其下面的 causal 结构也能得到合理的解释。对新的设备类型进行添加可以在后续进行，框架允许它们重复使用现有的模型，即使device type 未知。最后，在设备对 clustering 中重新调整负载可以使个别边缘设备回复 SLO 遵循率，从 22% 提高到 89%。

Towards Full-scene Domain Generalization in Multi-agent Collaborative Bird’s Eye View Segmentation for Connected and Autonomous Driving

paper_url: http://arxiv.org/abs/2311.16754
repo_url: None
paper_authors: Senkang Hu, Zhengru Fang, Xianhao Chen, Yuguang Fang, Sam Kwong
for: 提高自动驾驶系统的感知质量
methods: 使用域总结框架和振荡增强方法
results: 比对 existing 状态 искусственный智能方法更高效

Abstract
Collaborative perception has recently gained significant attention in autonomous driving, improving perception quality by enabling the exchange of additional information among vehicles. However, deploying collaborative perception systems can lead to domain shifts due to diverse environmental conditions and data heterogeneity among connected and autonomous vehicles (CAVs). To address these challenges, we propose a unified domain generalization framework applicable in both training and inference stages of collaborative perception. In the training phase, we introduce an Amplitude Augmentation (AmpAug) method to augment low-frequency image variations, broadening the model's ability to learn across various domains. We also employ a meta-consistency training scheme to simulate domain shifts, optimizing the model with a carefully designed consistency loss to encourage domain-invariant representations. In the inference phase, we introduce an intra-system domain alignment mechanism to reduce or potentially eliminate the domain discrepancy among CAVs prior to inference. Comprehensive experiments substantiate the effectiveness of our method in comparison with the existing state-of-the-art works. Code will be released at https://github.com/DG-CAVs/DG-CoPerception.git.

摘要
协同感知在自动驾驶中得到了广泛关注，通过交换额外信息来提高感知质量。但是部署协同感知系统可能会导致领域变化，因为连接自动汽车（CAV）中的环境条件和数据多样性不同。为解决这些挑战，我们提议一种统一领域普适框架，可以在培训和推理阶段应用。在培训阶段，我们引入了振荡增强（AmpAug）方法，用于增加低频图像变化，使模型能够适应不同领域。我们还使用元学习协调方案，模拟领域变化，使模型通过特意设计的一致损失来强制遵循领域无关的表示。在推理阶段，我们引入了内部领域对接机制，以减少或完全消除CAV之间的领域差异。广泛的实验证明了我们的方法的有效性，与现有的状态之作比较。代码将于https://github.com/DG-CAVs/DG-CoPerception.git发布。

LLMs for Science: Usage for Code Generation and Data Analysis

paper_url: http://arxiv.org/abs/2311.16733
repo_url: https://github.com/luuca78/llms4science
paper_authors: Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, Ingo Weber
for: 这个论文主要是为了研究大语言模型（LLM）在科学研究中的应用。
methods: 本研究使用了多种大语言模型（LLM）来支持科学研究的日常工作。
results: 研究发现，使用LLM可以帮助提高科学研究的效率和质量，但也存在一些问题，如输出的可靠性和一致性。

Abstract
Large language models (LLMs) have been touted to enable increased productivity in many areas of today's work life. Scientific research as an area of work is no exception: the potential of LLM-based tools to assist in the daily work of scientists has become a highly discussed topic across disciplines. However, we are only at the very onset of this subject of study. It is still unclear how the potential of LLMs will materialise in research practice. With this study, we give first empirical evidence on the use of LLMs in the research process. We have investigated a set of use cases for LLM-based tools in scientific research, and conducted a first study to assess to which degree current tools are helpful. In this paper we report specifically on use cases related to software engineering, such as generating application code and developing scripts for data analytics. While we studied seemingly simple use cases, results across tools differ significantly. Our results highlight the promise of LLM-based tools in general, yet we also observe various issues, particularly regarding the integrity of the output these tools provide.

摘要
大型语言模型（LLM）已被广泛宣传可以提高现代工作生活中的产出能力。科学研究作为工作领域，也不例外。LLM基于工具在科学研究中的潜在价值已成为跨学科讨论的热点。然而，我们正处于这个领域的初始阶段。尚未清楚LLM的潜在价值如何实现。在这项研究中，我们提供了首个实证证明LLM在研究过程中的使用。我们 investigate了一组LLM基于工具在科学研究中的使用场景，并对当前工具的帮助度进行了首次评估。本文特别关注软件工程领域的使用场景，如生成应用代码和数据分析cript的生成。虽然我们研究的场景 relativamente simple，但结果各工具之间存在差异。我们的结果表明LLM基于工具在整体上具有抢器，但也存在一些问题，尤其是输出的完整性。

Graph Pre-training and Prompt Learning for Recommendation

paper_url: http://arxiv.org/abs/2311.16716
repo_url: None
paper_authors: Yuhao Yang, Lianghao Xia, Da Luo, Kangyi Lin, Chao Huang
for: 提高GNNS的推荐性能和可扩展性，帮助GNNS更好地适应用户的变化偏好和数据分布shift。
methods: combining parameter-efficient和动态图前training with prompt learning，使GNNS能够更好地捕捉用户长期偏好和短期行为动态。
results: 在大规模实际应用中，GraphPL可以减轻GNNS的训练和推荐负担，同时提高推荐效果和稳定性。

Abstract
GNN-based recommenders have excelled in modeling intricate user-item interactions through multi-hop message passing. However, existing methods often overlook the dynamic nature of evolving user-item interactions, which impedes the adaption to changing user preferences and distribution shifts in newly arriving data. Thus, their scalability and performances in real-world dynamic environments are limited. In this study, we propose GraphPL, a framework that incorporates parameter-efficient and dynamic graph pre-training with prompt learning. This novel combination empowers GNNs to effectively capture both long-term user preferences and short-term behavior dynamics, enabling the delivery of accurate and timely recommendations. Our GraphPL framework addresses the challenge of evolving user preferences by seamlessly integrating a temporal prompt mechanism and a graph-structural prompt learning mechanism into the pre-trained GNN model. The temporal prompt mechanism encodes time information on user-item interaction, allowing the model to naturally capture temporal context, while the graph-structural prompt learning mechanism enables the transfer of pre-trained knowledge to adapt to behavior dynamics without the need for continuous incremental training. We further bring in a dynamic evaluation setting for recommendation to mimic real-world dynamic scenarios and bridge the offline-online gap to a better level. Our extensive experiments including a large-scale industrial deployment showcases the lightweight plug-in scalability of our GraphPL when integrated with various state-of-the-art recommenders, emphasizing the advantages of GraphPL in terms of effectiveness, robustness and efficiency.

摘要
In this study, we propose GraphPL, a framework that combines parameter-efficient and dynamic graph pre-training with prompt learning. This novel approach enables GNNs to effectively capture both long-term user preferences and short-term behavior dynamics, allowing for the delivery of accurate and timely recommendations.To address the challenge of evolving user preferences, our GraphPL framework integrates a temporal prompt mechanism and a graph-structural prompt learning mechanism into the pre-trained GNN model. The temporal prompt mechanism encodes time information on user-item interactions, allowing the model to naturally capture temporal context. The graph-structural prompt learning mechanism enables the transfer of pre-trained knowledge to adapt to behavior dynamics without the need for continuous incremental training.We also establish a dynamic evaluation setting for recommendation, mimicking real-world dynamic scenarios and bridging the offline-online gap to a better level. Our extensive experiments, including a large-scale industrial deployment, showcase the lightweight plug-in scalability of our GraphPL when integrated with various state-of-the-art recommenders. This emphasizes the advantages of GraphPL in terms of effectiveness, robustness, and efficiency.

LEDITS++: Limitless Image Editing using Text-to-Image Models

paper_url: http://arxiv.org/abs/2311.16711
repo_url: None
paper_authors: Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos
for: 本研究旨在提出一种高效 yet versatile和精准的文本图像修改技术，以解决现有的图像修改方法存在的缺陷。
methods: 本方法使用novel倒推approach，不需要优化或调整，可以在几步diffusion中生成高品质的图像。此外，本方法支持同时进行多个修改，并且architecture-agnostic。
results: 我们的实验结果表明，LEDITS++可以准确地修改图像，并且与原始图像的差异较小。此外，LEDITS++也比现有的方法更高效和更灵活。详细的实验结果可以参考https://leditsplusplus-project.static.hf.space。

Abstract
Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .

摘要
LEDITS++'s novel inversion approach does not require fine-tuning or optimization and produces high-fidelity results with just a few diffusion steps. Additionally, our method supports multiple simultaneous edits and is compatible with various architectures. To ensure accuracy, we use a novel implicit masking technique that limits changes to relevant image regions.We have created the TEdBench++ benchmark as part of our comprehensive evaluation. Our results show that LEDITS++ outperforms previous methods in terms of efficiency and accuracy. For more information, please visit the project page at .

Rethinking Intermediate Layers design in Knowledge Distillation for Kidney and Liver Tumor Segmentation

paper_url: http://arxiv.org/abs/2311.16700
repo_url: None
paper_authors: Vandan Gorade, Sparsh Mittal, Debesh Jha, Ulas Bagci
for: 这个研究的目的是提出一种基于层选择反馈知识填充（HLFD）的 médical imaging 任务中的学习混合模型。
methods: 该方法使用一种独特的层选择反馈知识填充策略，将中间层的知识传递到早期层，并将最后一层的知识传递到中间层。这种设计使得模型能够从早期层学习更高质量的表示，从而获得更好的性能。
results: 对多个验证数据集进行评估后，HLFD方法与传统的学习混合模型相比，具有较高的性能。例如，在肾脏分 segmentation 任务中，HLFD方法可以将学生模型（没有 KD）的性能提高 более 10pp，显著提高它的专注于特征特征。此外，Student 模型使用 HLFD 方法学习的结果显示，能够减少不相关信息，更好地强调特征特征，从而开启了更高效和准确的诊断工具的新途径。

Abstract
Knowledge distillation(KD) has demonstrated remarkable success across various domains, but its application to medical imaging tasks, such as kidney and liver tumor segmentation, has encountered challenges. Many existing KD methods are not specifically tailored for these tasks. Moreover, prevalent KD methods often lack a careful consideration of what and from where to distill knowledge from the teacher to the student. This oversight may lead to issues like the accumulation of training bias within shallower student layers, potentially compromising the effectiveness of KD. To address these challenges, we propose Hierarchical Layer-selective Feedback Distillation (HLFD). HLFD strategically distills knowledge from a combination of middle layers to earlier layers and transfers final layer knowledge to intermediate layers at both the feature and pixel levels. This design allows the model to learn higher-quality representations from earlier layers, resulting in a robust and compact student model. Extensive quantitative evaluations reveal that HLFD outperforms existing methods by a significant margin. For example, in the kidney segmentation task, HLFD surpasses the student model (without KD) by over 10pp, significantly improving its focus on tumor-specific features. From a qualitative standpoint, the student model trained using HLFD excels at suppressing irrelevant information and can focus sharply on tumor-specific details, which opens a new pathway for more efficient and accurate diagnostic tools.

摘要
知识缩写（KD）在不同领域都有出色的成绩，但在医学影像任务，如肾和肝肿瘤分 segmentation，遇到了挑战。许多现有的KD方法不适应这些任务。此外，常见的KD方法经常不进行细心的知识从教师到学生的选择。这种欠缺可能导致在较浅的学生层建立训练偏见，从而可能降低KD的效果。为解决这些挑战，我们提出了层次选择反馈缩写（HLFD）。HLFD在中间层和早期层之间进行知识缩写，并将最后一层知识传递到中间层。这种设计使得模型可以从早期层学习更高质量的表示，从而获得更加紧凑和可靠的学生模型。经过广泛的量化评估，我们发现HLFD比现有方法提高了10个百分点以上。例如，在肾分 segmentation任务中，HLFD的学生模型（无KD）比较高，显著提高其强调特征的精准性。从Qualitative的角度来看，通过HLFD培育的学生模型能够快速压抑不关键信息，具有强调特征的精准性，这开启了更高效、更准确的诊断工具的新路径。

XAI for time-series classification leveraging image highlight methods

paper_url: http://arxiv.org/abs/2311.17110
repo_url: None
paper_authors: Georgios Makridis, Georgios Fatouros, Vasileios Koukos, Dimitrios Kotios, Dimosthenis Kyriazis, Ioannis Soldatos
for: 这篇论文的目的是将时间序列资料分类 зада项中的深度神经网络（DNN）实现解释性。
methods: 本篇论文使用了教师生物体（teacher-student architecture）和图像高亮技术（如LIME和GradCam），将时间序列转换为2D图形，以提高预测的解释性。
results: 本篇论文的预测结果与基准模型相当，但是训练时间有所增加。

Abstract
Although much work has been done on explainability in the computer vision and natural language processing (NLP) fields, there is still much work to be done to explain methods applied to time series as time series by nature can not be understood at first sight. In this paper, we present a Deep Neural Network (DNN) in a teacher-student architecture (distillation model) that offers interpretability in time-series classification tasks. The explainability of our approach is based on transforming the time series to 2D plots and applying image highlight methods (such as LIME and GradCam), making the predictions interpretable. At the same time, the proposed approach offers increased accuracy competing with the baseline model with the trade-off of increasing the training time.

摘要
尽管在计算机视觉和自然语言处理（NLP）领域已经做了很多工作，但是对时序序列的解释仍然很需要进一步的研究。在这篇论文中，我们提出了一种基于深度神经网络（DNN）的教师-学生架构（液态模型），以提高时序序列分类任务中的解释性。我们的方法基于将时序序列转换为2D图表，并应用图像高亮方法（如LIME和GradCam），使预测结果变得可读。同时，我们的方法与基线模型的精度相似，但是需要增加训练时间。

Hyper-Relational Knowledge Graph Neural Network for Next POI

paper_url: http://arxiv.org/abs/2311.16683
repo_url: None
paper_authors: Jixiao Zhang, Yongkang Li, Ruotong Zou, Jingyuan Zhang, Zipei Fan, Xuan Song
for: 提高 Location-based Social Networks (LBSN) 中 Point of Interest (POI) 推荐系统的精度和效果，使用 Knowledge Graph (KG) 来缓解数据稀缺问题。
methods: 提出了 Hyper-Relational Knowledge Graph Neural Network (HKGNN) 模型，使用 Hyper-Relational Knowledge Graph (HKG) 来维护和利用 LBSN 数据中的较复杂的 semantics，并使用 Hypergraph Neural Network 和 self-attention network 来充分利用 HKG 中的结构信息和时间序列信息进行个性化推荐。
results: 对四个实际 LBSN 数据集进行了实验，比较了与现有状态艺术方法的比较，结果表明 HKGNN 模型在精度和效果上比现有方法高效，能够更好地利用 LBSN 数据中的 semantics 和结构信息来提高 POI 推荐的精度和效果。

Abstract
With the advancement of mobile technology, Point of Interest (POI) recommendation systems in Location-based Social Networks (LBSN) have brought numerous benefits to both users and companies. Many existing works employ Knowledge Graph (KG) to alleviate the data sparsity issue in LBSN. These approaches primarily focus on modeling the pair-wise relations in LBSN to enrich the semantics and thereby relieve the data sparsity issue. However, existing approaches seldom consider the hyper-relations in LBSN, such as the mobility relation (a 3-ary relation: user-POI-time). This makes the model hard to exploit the semantics accurately. In addition, prior works overlook the rich structural information inherent in KG, which consists of higher-order relations and can further alleviate the impact of data sparsity.To this end, we propose a Hyper-Relational Knowledge Graph Neural Network (HKGNN) model. In HKGNN, a Hyper-Relational Knowledge Graph (HKG) that models the LBSN data is constructed to maintain and exploit the rich semantics of hyper-relations. Then we proposed a Hypergraph Neural Network to utilize the structural information of HKG in a cohesive way. In addition, a self-attention network is used to leverage sequential information and make personalized recommendations. Furthermore, side information, essential in reducing data sparsity by providing background knowledge of POIs, is not fully utilized in current methods. In light of this, we extended the current dataset with available side information to further lessen the impact of data sparsity. Results of experiments on four real-world LBSN datasets demonstrate the effectiveness of our approach compared to existing state-of-the-art methods.

摘要
随着移动技术的进步，位置基于社交网络（LBSN）中的点对点（POI）推荐系统已经为用户和公司带来了很多利益。现有的方法主要采用知识图（KG）来解决LBSN数据稀缺问题。这些方法主要是将LBSN数据中的对数据关系模型为对数据关系，以此提高 semantics 的含义并缓解数据稀缺问题。然而，现有的方法很少考虑LBSN中的高阶关系，如用户-POI-时间的移动关系（3元关系），这使得模型很难准确地利用 semantics。此外，先前的工作忽视了知识图中的高阶结构信息，这些信息包括更高级别的关系，可以进一步减轻数据稀缺的影响。为此，我们提出了一种高阶关系知识图神经网络（HKGNN）模型。在HKGNN中，一个高阶关系知识图（HKG）是用于维护和利用LBSN数据中的丰富 semantics 的。然后，我们提出了一种高raph neural network，用于在一起 cohesive 的方式利用 HKG 中的结构信息。此外，我们还使用了自注意网络，以利用序列信息并进行个性化推荐。此外，现有的方法很少利用可用的侧信息，这些侧信息可以提供 POI 背景知识，从而进一步减轻数据稀缺的影响。为此，我们将当前的数据集扩展为包含可用的侧信息，以进一步减轻数据稀缺的影响。实验结果表明，我们的方法在四个实际 LBSN 数据集上的实验结果比现有的状态 искусственный方法更为有效。

Understanding the (Extra-)Ordinary: Validating Deep Model Decisions with Prototypical Concept-based Explanations

paper_url: http://arxiv.org/abs/2311.16681
repo_url: None
paper_authors: Maximilian Dreyer, Reduan Achtibat, Wojciech Samek, Sebastian Lapuschkin
for: 这 paper 的目的是提出一种新的 post-hoc 基于概念的 XAI 框架，以便更好地理解深度神经网络（DNNs）的决策过程。
methods: 这 paper 使用了一种组合了本地（instance-wise）和全局（class-wise）决策策略的 XAI 框架，通过使用示例来描述模型的决策过程。
results: 这 paper 在三个 datasets（ImageNet、CUB-200 和 CIFAR-10）上使用了 VGG、ResNet 和 EfficientNet 架构，并证明了该 XAI 框架的有效性，可以快速和准确地检测模型的异常行为、数据质量问题和模型行为问题。

Abstract
Ensuring both transparency and safety is critical when deploying Deep Neural Networks (DNNs) in high-risk applications, such as medicine. The field of explainable AI (XAI) has proposed various methods to comprehend the decision-making processes of opaque DNNs. However, only few XAI methods are suitable of ensuring safety in practice as they heavily rely on repeated labor-intensive and possibly biased human assessment. In this work, we present a novel post-hoc concept-based XAI framework that conveys besides instance-wise (local) also class-wise (global) decision-making strategies via prototypes. What sets our approach apart is the combination of local and global strategies, enabling a clearer understanding of the (dis-)similarities in model decisions compared to the expected (prototypical) concept use, ultimately reducing the dependence on human long-term assessment. Quantifying the deviation from prototypical behavior not only allows to associate predictions with specific model sub-strategies but also to detect outlier behavior. As such, our approach constitutes an intuitive and explainable tool for model validation. We demonstrate the effectiveness of our approach in identifying out-of-distribution samples, spurious model behavior and data quality issues across three datasets (ImageNet, CUB-200, and CIFAR-10) utilizing VGG, ResNet, and EfficientNet architectures. Code is available on https://github.com/maxdreyer/pcx.

摘要
Ensuring both transparency and safety is critical when deploying Deep Neural Networks (DNNs) in high-risk applications, such as medicine. The field of explainable AI (XAI) has proposed various methods to comprehend the decision-making processes of opaque DNNs. However, only few XAI methods are suitable of ensuring safety in practice as they heavily rely on repeated labor-intensive and possibly biased human assessment. In this work, we present a novel post-hoc concept-based XAI framework that conveys besides instance-wise (local) also class-wise (global) decision-making strategies via prototypes. What sets our approach apart is the combination of local and global strategies, enabling a clearer understanding of the (dis-)similarities in model decisions compared to the expected (prototypical) concept use, ultimately reducing the dependence on human long-term assessment. Quantifying the deviation from prototypical behavior not only allows to associate predictions with specific model sub-strategies but also to detect outlier behavior. As such, our approach constitutes an intuitive and explainable tool for model validation. We demonstrate the effectiveness of our approach in identifying out-of-distribution samples, spurious model behavior and data quality issues across three datasets (ImageNet, CUB-200, and CIFAR-10) utilizing VGG, ResNet, and EfficientNet architectures. Code is available on https://github.com/maxdreyer/pcx.Here's the translation in Traditional Chinese: Ensuring both transparency and safety is critical when deploying Deep Neural Networks (DNNs) in high-risk applications, such as medicine. The field of explainable AI (XAI) has proposed various methods to comprehend the decision-making processes of opaque DNNs. However, only few XAI methods are suitable of ensuring safety in practice as they heavily rely on repeated labor-intensive and possibly biased human assessment. In this work, we present a novel post-hoc concept-based XAI framework that conveys besides instance-wise (local) also class-wise (global) decision-making strategies via prototypes. What sets our approach apart is the combination of local and global strategies, enabling a clearer understanding of the (dis-)similarities in model decisions compared to the expected (prototypical) concept use, ultimately reducing the dependence on human long-term assessment. Quantifying the deviation from prototypical behavior not only allows to associate predictions with specific model sub-strategies but also to detect outlier behavior. As such, our approach constitutes an intuitive and explainable tool for model validation. We demonstrate the effectiveness of our approach in identifying out-of-distribution samples, spurious model behavior and data quality issues across three datasets (ImageNet, CUB-200, and CIFAR-10) utilizing VGG, ResNet, and EfficientNet architectures. Code is available on https://github.com/maxdreyer/pcx.

ROSO: Improving Robotic Policy Inference via Synthetic Observations

paper_url: http://arxiv.org/abs/2311.16680
repo_url: https://github.com/Yusuke710/ROSO
paper_authors: Yusuke Miyashita, Dimitris Gahtidis, Colin La, Jeremy Rabinowicz, Jurgen Leitner
for: 提高 zeroshot 性能的预训练策略
methods: 使用生成人工智能修改推理时的观察，以适应新物体和环境
results: 实验结果显示，通过将生成人工智能integrated into robotic inference，可以增加策略的适应性，并提高成功率，达到57%的任务成功率。

Abstract
In this paper, we propose the use of generative artificial intelligence (AI) to improve zero-shot performance of a pre-trained policy by altering observations during inference. Modern robotic systems, powered by advanced neural networks, have demonstrated remarkable capabilities on pre-trained tasks. However, generalizing and adapting to new objects and environments is challenging, and fine-tuning visuomotor policies is time-consuming. To overcome these issues we propose Robotic Policy Inference via Synthetic Observations (ROSO). ROSO uses stable diffusion to pre-process a robot's observation of novel objects during inference time to fit within its distribution of observations of the pre-trained policies. This novel paradigm allows us to transfer learned knowledge from known tasks to previously unseen scenarios, enhancing the robot's adaptability without requiring lengthy fine-tuning. Our experiments show that incorporating generative AI into robotic inference significantly improves successful outcomes, finishing up to 57% of tasks otherwise unsuccessful with the pre-trained policy.

摘要
在这篇论文中，我们提出使用生成人工智能（AI）来提高预训练政策的零拟合性。现代机器人系统，搭载了高级神经网络，已经展示了惊人的能力在预训练任务上。然而，总结和适应新物体和环境是困难的，并且需要较长的练习时间来微调视motor策略。为了解决这些问题，我们提出了机器人政策推理viaSynthetic Observations（ROSO）。ROSO使用稳定的扩散来在推理时间内对机器人对新物体的观察进行预处理，使其适应在预训练策略的分布内。这种新的思路允许我们将已经学习的知识传递到前所未见的情况下，提高机器人的适应性，而不需要较长的微调时间。我们的实验表明，将生成AIintegrated into机器人推理可以显著提高成功率，完成了57%的任务，否则由预训练策略无法完成。

Large Language Models Meet Computer Vision: A Brief Survey

paper_url: http://arxiv.org/abs/2311.16673
repo_url: None
paper_authors: Raby Hamadi
for: 本文旨在探讨最新的 transformer 和其后继者在计算机视觉领域的发展，以及这些模型在自然语言处理和计算机视觉之间的交互。
methods: 本文使用了许多现代的 paid 和 open-source LLMs，并进行了比较分析，以探讨这些模型在不同任务上的表现。此外，文章还收集了一些用于训练 LLMs 的数据集，为读者提供了不同数据的应用和演示。
results: 本文的研究发现， transformer 和其后继者在计算机视觉领域的应用具有潜在的潜力，可以提高模型的性能和可靠性。此外，文章还提出了一些未来研究的方向，如如何更好地挖掘数据集，以及如何将 LLMs 应用于更多的计算机视觉任务。

Abstract
Recently, the intersection of Large Language Models (LLMs) and Computer Vision (CV) has emerged as a pivotal area of research, driving significant advancements in the field of Artificial Intelligence (AI). As transformers have become the backbone of many state-of-the-art models in both Natural Language Processing (NLP) and CV, understanding their evolution and potential enhancements is crucial. This survey paper delves into the latest progressions in the domain of transformers and their subsequent successors, emphasizing their potential to revolutionize Vision Transformers (ViTs) and LLMs. This survey also presents a comparative analysis, juxtaposing the performance metrics of several leading paid and open-source LLMs, shedding light on their strengths and areas of improvement as well as a literature review on how LLMs are being used to tackle vision related tasks. Furthermore, the survey presents a comprehensive collection of datasets employed to train LLMs, offering insights into the diverse data available to achieve high performance in various pre-training and downstream tasks of LLMs. The survey is concluded by highlighting open directions in the field, suggesting potential venues for future research and development. This survey aims to underscores the profound intersection of LLMs on CV, leading to a new era of integrated and advanced AI models.

摘要
最近，大语言模型（LLMs）和计算机视觉（CV）的交叉领域已经成为人工智能（AI）领域的重要研究领域，导致了许多领域的进步。作为 transformers 在 NLP 和 CV 中的基础模型，理解它们的演化和可能的提高是关键。这篇评论paper 探讨了最新的 transformers 和其后继者的进展，强调它们在 ViTs 和 LLMs 中的潜在革命化作用。此外，这篇评论还进行了多种领先的付费和开源 LLMs 的比较分析，探讨了它们的优势和改进点，以及如何使用 LLMs 解决视觉相关任务的文献回顾。此外，评论还提供了许多用于训练 LLMs 的集成数据集，为在不同的预训练和下游任务中实现高性能提供了信息。最后，评论结束于 highlighting 当前领域的开放方向，建议未来的研究发展方向。这篇评论希望强调 LLMS 在 CV 领域的深入交叉，并促进一个新的融合和高级 AI 模型的时代。

SplitNeRF: Split Sum Approximation Neural Field for Joint Geometry, Illumination, and Material Estimation

paper_url: http://arxiv.org/abs/2311.16671
repo_url: None
paper_authors: Jesus Zarzar, Bernard Ghanem
for: 这篇论文旨在提出一种用于将真实世界物体数字化的新方法，该方法可以估算物体的几何结构、物理属性和环境照明，从一组固定照明的图像中。
methods: 该方法利用神经辐射场（NeRF）管道中的分割Sum应用程序，以实现实时物理基于渲染。另外，我们提议使用单个场景特定的多层感知神经网络（MLP）来表示场景照明，并且可以在任意分辨率下进行精准的模型。此外，我们还提出了一种新的自遮挡预测监督方法，通过利用MCMC抽样来实现高效的训练。
results: 实验结果表明，我们的方法可以高效地估算场景几何结构、物理属性和照明，并且可以在固定照明下实现state-of-the-art的重新照明质量，仅需要${sim}1$小时的训练。

Abstract
We present a novel approach for digitizing real-world objects by estimating their geometry, material properties, and environmental lighting from a set of posed images with fixed lighting. Our method incorporates into Neural Radiance Field (NeRF) pipelines the split sum approximation used with image-based lighting for real-time physical-based rendering. We propose modeling the scene's lighting with a single scene-specific MLP representing pre-integrated image-based lighting at arbitrary resolutions. We achieve accurate modeling of pre-integrated lighting by exploiting a novel regularizer based on efficient Monte Carlo sampling. Additionally, we propose a new method of supervising self-occlusion predictions by exploiting a similar regularizer based on Monte Carlo sampling. Experimental results demonstrate the efficiency and effectiveness of our approach in estimating scene geometry, material properties, and lighting. Our method is capable of attaining state-of-the-art relighting quality after only ${\sim}1$ hour of training in a single NVIDIA A100 GPU.

摘要
我们提出了一种新的方法，用于将真实世界中的物体数字化，包括物体的几何结构、物理性质和环境照明。我们的方法将Neural Radiance Field（NeRF）管道中的 split sum approximation与图像基照明相结合，实现实时物理基照明。我们建议场景的照明使用单一场景特定的多层感知神经网络（MLP）来表示预积合的图像基照明，无论是在什么分辨率下。我们通过一种新的规则来准确地模型预积合照明，并通过有效的 Монте托 Carlo 采样来实现这一点。此外，我们还提出了一种新的方法，通过 Монте托 Carlo 采样来监督自身遮挡预测。实验结果表明，我们的方法可以快速和有效地估计场景几何结构、物理性质和环境照明。我们的方法可以在单个 NVIDIA A100 GPU 上进行一小时的训练后，达到当今最佳的重新照明质量。

MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures

paper_url: http://arxiv.org/abs/2311.16666
repo_url: None
paper_authors: Zhuoyuan Wang, Jiacong Mi, Shan Lu, Jieyue He
for: 这篇论文的目的是提出一个多modal的分析方法，以提高药物分子的预测性。
methods: 这篇论文使用了自主学习的方法，包括自动对照学习（SSL）和图形神经网络（GNN）。
results: 比基eline模型，这篇论文的模型在下游任务中的预测性能更高，特别是在分子性能预测方面。

Abstract
The quest for accurate prediction of drug molecule properties poses a fundamental challenge in the realm of Artificial Intelligence Drug Discovery (AIDD). An effective representation of drug molecules emerges as a pivotal component in this pursuit. Contemporary leading-edge research predominantly resorts to self-supervised learning (SSL) techniques to extract meaningful structural representations from large-scale, unlabeled molecular data, subsequently fine-tuning these representations for an array of downstream tasks. However, an inherent shortcoming of these studies lies in their singular reliance on one modality of molecular information, such as molecule image or SMILES representations, thus neglecting the potential complementarity of various molecular modalities. In response to this limitation, we propose MolIG, a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures. MolIG model innovatively leverages the coherence and correlation between molecule graph and molecule image to execute self-supervised tasks, effectively amalgamating the strengths of both molecular representation forms. This holistic approach allows for the capture of pivotal molecular structural characteristics and high-level semantic information. Upon completion of pre-training, Graph Neural Network (GNN) Encoder is used for the prediction of downstream tasks. In comparison to advanced baseline models, MolIG exhibits enhanced performance in downstream tasks pertaining to molecular property prediction within benchmark groups such as MoleculeNet Benchmark Group and ADMET Benchmark Group.

摘要
寻求精准预测药物分子性质是艺术智能药物发现（AIDD）领域的基本挑战。一个有效的药物分子表示是这一目标的关键组件。当前的前沿研究主要采用自动学习（SSL）技术来从大规模、无标签的分子数据中提取有意义的结构表示，然后对这些表示进行微调以适应多个下游任务。然而，这些研究的潜在缺点在于它们单一依赖于一种分子信息模式，如分子图像或SMILES表示，因此忽略了不同分子模式之间的可衡合性。为了解决这一限制，我们提出了 MolIG，一种新的多Modal molecular预训练框架，用于预测分子性质基于图像和图 structures。MolIG模型创新地利用分子图和分子图像之间的准确性和相关性来执行自动学习任务，有效地结合了两种分子表示形式的优势。这种整体approach允许捕捉分子结构的重要特征和高级semantic信息。在预训练完成后，图 neural network（GNN）Encoder被用于预测下游任务。与先进基eline模型相比，MolIG在分子性质预测 benchmark groups 中表现出色，如MoleculeNet Benchmark Group和ADMET Benchmark Group。

ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?

paper_url: http://arxiv.org/abs/2311.17107
repo_url: https://github.com/rlacombe/climatex
paper_authors: Romain Lacombe, Kerrie Wu, Eddie Dilworth
for: 本研究旨在评估现代自然语言模型（LLMs）在气候科学和政策领域的准确性。
methods: 研究者引入了新的专家标注的气候声明集（ClimateX）数据集，包含latest Intergovernmental Panel on Climate Change（IPCC）报告中的8094个气候声明，并对其进行了专家标注。研究者使用这些数据集，表明了最新的LLMs可以在几次学习 Setting中分类人类专家对气候声明的信任程度，但准确率有限（最高达47%）。
results: 研究发现，LLMs在低和中信任声明上表现出了一致的和显著的过度自信。

Abstract
Evaluating the accuracy of outputs generated by Large Language Models (LLMs) is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.

摘要
evaluating the accuracy of outputs generated by large language models (LLMs) is especially important in the climate science and policy domain. we introduce the expert confidence in climate statements (climatex) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest intergovernmental panel on climate change (IPCC) reports, labeled with their associated confidence levels. using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. we highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know.

Finnish 5th and 6th graders’ misconceptions about Artificial Intelligence

paper_url: http://arxiv.org/abs/2311.16644
repo_url: None
paper_authors: Pekka Mertala, Janne Fagerlund
for: 这个研究旨在探讨儿童初步理解人工智能的问题，以帮助开发有教学价值的人工智能Literacy课程、方法和材料。
methods: 这个研究使用质量调查数据来回答三个研究问题：1) 芬兰5-6年级学生对人工智能的本质有什么假设？；2）这些假设与常见的假设类型有关？；和3）这些假设有多深刻？
results: 研究发现，5-6年级学生有三类假设：1) 非技术人工智能，认为人工智能是人类认知过程（事实假设）；2）人工智能人类化，认为人工智能是人类化的存在（通俗、非科学、概念假设）；和3）机器人工智能，认为人工智能有预装智能或知识（事实假设）。大多数孩子们认为自己对人工智能知识不够，这意味着这些假设是 superfic 的而不是深刻的。研究发现语言特征可以导致学生的人工智能假设。这些结论有助于未来的研究和人工智能Literacy教育。

Abstract
Research on children's initial conceptions of AI is in an emerging state, which, from a constructivist viewpoint, challenges the development of pedagogically sound AI-literacy curricula, methods, and materials. To contribute to resolving this need in the present paper, qualitative survey data from 195 children were analyzed abductively to answer the following three research questions: What kind of misconceptions do Finnish 5th and 6th graders' have about the essence AI?; 2) How do these misconceptions relate to common misconception types?; and 3) How profound are these misconceptions? As a result, three misconception categories were identified: 1) Non-technological AI, in which AI was conceptualized as peoples' cognitive processes (factual misconception); 2) Anthropomorphic AI, in which AI was conceptualized as a human-like entity (vernacular, non-scientific, and conceptual misconception); and 3) AI as a machine with a pre-installed intelligence or knowledge (factual misconception). Majority of the children evaluated their AI-knowledge low, which implies that the misconceptions are more superficial than profound. The findings suggest that context-specific linguistic features can contribute to students' AI misconceptions. Implications for future research and AI literacy education are discussed.

摘要
研究儿童初始对人工智能的理解处于发展阶段，从构建主义的视角来看，这会挑战开发教学上有效的人工智能Literacy课程、方法和材料的开发。为解决这个需求，本文通过qualitative survey数据分析了195名芬兰5-6年级学生对人工智能的误解，并回答以下三个研究问题：1）芬兰5-6年级学生对人工智能的本质有什么误解？2）这些误解与常见误解类型有关吗？3）这些误解有多深刻？结果显示，5-6年级学生对人工智能的误解可以分为3类：1）非技术人工智能，人工智能被理解为人类的认知过程（事实误解）；2）人类化人工智能，人工智能被理解为人类化的存在（非科学、非正式、概念误解）；3）人工智能为机器带有预先安装的智能或知识（事实误解）。大多数儿童评估自己对人工智能的知识为低，这意味着误解较浅。发现语言特性对学生的人工智能误解产生影响，有关未来研究和人工智能Literacy教育的探讨。

Single-Cell Clustering via Dual-Graph Alignment

paper_url: http://arxiv.org/abs/2311.17104
repo_url: None
paper_authors: Dayu Hu, Ke Liang, Xinwang Liu
for: 这份研究旨在开发一个具有更高精度和可靠性的单细胞聚集分析方法，以便更好地理解肿瘤微环境中单细胞的分布和特征。
methods: 本研究使用了双重图像对焦法，具体包括对单细胞matrix和蛋白质-蛋白质互作网络进行自动化和无监督优化，以获得更加精确的单细胞聚集分析结果。
results: 实验结果显示，这个新的单细胞聚集分析方法能够更好地捕捉单细胞之间的关系和蛋白质-蛋白质互作网络的结构，并且能够将这些资讯纳入到聚集分析中，以获得更加精确和有意义的结果。

Abstract
In recent years, the field of single-cell RNA sequencing has seen a surge in the development of clustering methods. These methods enable the identification of cell subpopulations, thereby facilitating the understanding of tumor microenvironments. Despite their utility, most existing clustering algorithms primarily focus on the attribute information provided by the cell matrix or the network structure between cells, often neglecting the network between genes. This oversight could lead to loss of information and clustering results that lack clinical significance. To address this limitation, we develop an advanced single-cell clustering model incorporating dual-graph alignment, which integrates gene network information into the clustering process based on self-supervised and unsupervised optimization. Specifically, we designed a graph-based autoencoder enhanced by an attention mechanism to effectively capture relationships between cells. Moreover, we performed the node2vec method on Protein-Protein Interaction (PPI) networks to derive the gene network structure and maintained this structure throughout the clustering process. Our proposed method has been demonstrated to be effective through experimental results, showcasing its ability to optimize clustering outcomes while preserving the original associations between cells and genes. This research contributes to obtaining accurate cell subpopulations and generates clustering results that more closely resemble real-world biological scenarios. It provides better insights into the characteristics and distribution of diseased cells, ultimately building a foundation for early disease diagnosis and treatment.

摘要
近年来，单个细胞RNA测序领域内，集群方法的研发有了很大的进步。这些方法可以让您标识细胞子population，从而facilitating the understanding of tumor microenvironments。 despite their utility, most existing clustering algorithms primarily focus on the attribute information provided by the cell matrix or the network structure between cells, often neglecting the network between genes。 This oversight could lead to loss of information and clustering results that lack clinical significance。 To address this limitation, we develop an advanced single-cell clustering model incorporating dual-graph alignment， which integrates gene network information into the clustering process based on self-supervised and unsupervised optimization。 Specifically, we designed a graph-based autoencoder enhanced by an attention mechanism to effectively capture relationships between cells。 Moreover, we performed the node2vec method on Protein-Protein Interaction (PPI) networks to derive the gene network structure and maintained this structure throughout the clustering process。 Our proposed method has been demonstrated to be effective through experimental results， showcasing its ability to optimize clustering outcomes while preserving the original associations between cells and genes。 This research contributes to obtaining accurate cell subpopulations and generates clustering results that more closely resemble real-world biological scenarios。 It provides better insights into the characteristics and distribution of diseased cells， ultimately building a foundation for early disease diagnosis and treatment。

LasTGL: An Industrial Framework for Large-Scale Temporal Graph Learning

paper_url: http://arxiv.org/abs/2311.16605
repo_url: None
paper_authors: Jintang Li, Jiawang Dan, Ruofan Wu, Jing Zhou, Sheng Tian, Yunfei Liu, Baokun Wang, Changhua Meng, Weiqiang Wang, Yuchang Zhu, Liang Chen, Zibin Zheng
for: 这篇论文是为了提供一个统一的框架，来解决时间变化的图形学习任务。
methods: 这篇论文使用了时间对称卷积神经网络（TGNN）来处理时间变化的图形数据。
results: 这篇论文提供了一个名为LasTGL的框架，可以帮助研究人员快速实现时间图形学习任务，并且提供了丰富的实验数据和说明。

Abstract
Over the past few years, graph neural networks (GNNs) have become powerful and practical tools for learning on (static) graph-structure data. However, many real-world applications, such as social networks and e-commerce, involve temporal graphs where nodes and edges are dynamically evolving. Temporal graph neural networks (TGNNs) have progressively emerged as an extension of GNNs to address time-evolving graphs and have gradually become a trending research topic in both academics and industry. Advancing research in such an emerging field requires new tools to compose TGNN models and unify their different schemes in dealing with temporal graphs. To facilitate research and application in temporal graph learning, we introduce LasTGL, an industrial framework that integrates unified and extensible implementations of common temporal graph learning algorithms for various advanced tasks. The purpose of LasTGL is to provide the essential building blocks for solving temporal graph learning tasks, focusing on the guiding principles of user-friendliness and quick prototyping on which PyTorch is based. In particular, LasTGL provides comprehensive temporal graph datasets, TGNN models and utilities along with well-documented tutorials, making it suitable for both absolute beginners and expert deep learning practitioners alike.

摘要
在过去几年，图节点网络（GNN）已成为可靠和实用的工具，用于学习静止图结构数据。然而，许多实际应用，如社交网络和电子商务，涉及到时间演化的图结构数据。时间图节点网络（TGNN）逐渐出现，作为GNN的扩展，用于解决时间演化的图结构数据。为推动这个emergingfield的研究，我们提出了LasTGL框架，它将 integrate了多种时间图学习算法的统一和可扩展实现。LasTGL的目标是提供解决时间图学习任务的基本建筑块，注重用户友好性和快速原型化，基于PyTorch的指导原则。具体来说，LasTGL提供了完整的时间图数据集，TGNN模型和工具，以及详细的教程，适用于各种深度学习实践者。

GSP-KalmanNet: Tracking Graph Signals via Neural-Aided Kalman Filtering

paper_url: http://arxiv.org/abs/2311.16602
repo_url: None
paper_authors: Itay Buchnik, Guy Sagi, Nimrod Leinwand, Yuval Loya, Nir Shlezinger, Tirza Routtenberg
for: 这个论文主要研究了图像信号的跟踪问题，即在社交网络、电力网络和交通等应用中遇到的动态系统图像信号的跟踪。
methods: 该论文提出了一种hybrid模型基于/数据驱动的方法，称为GSP-KalmanNet，它利用图像处理（GSP）工具和深度学习（DL）技术来跟踪图像信号的隐藏图像状态。
results: 实验结果表明，GSP-KalmanNet可以在较高维度的信号处理中提高准确性和运行时间性，同时提高模型误差的抗性。

Abstract
Dynamic systems of graph signals are encountered in various applications, including social networks, power grids, and transportation. While such systems can often be described as state space (SS) models, tracking graph signals via conventional tools based on the Kalman filter (KF) and its variants is typically challenging. This is due to the nonlinearity, high dimensionality, irregularity of the domain, and complex modeling associated with real-world dynamic systems of graph signals. In this work, we study the tracking of graph signals using a hybrid model-based/data-driven approach. We develop the GSP-KalmanNet, which tracks the hidden graphical states from the graphical measurements by jointly leveraging graph signal processing (GSP) tools and deep learning (DL) techniques. The derivations of the GSP-KalmanNet are based on extending the KF to exploit the inherent graph structure via graph frequency domain filtering, which considerably simplifies the computational complexity entailed in processing high-dimensional signals and increases the robustness to small topology changes. Then, we use data to learn the Kalman gain following the recently proposed KalmanNet framework, which copes with partial and approximated modeling, without forcing a specific model over the noise statistics. Our empirical results demonstrate that the proposed GSP-KalmanNet achieves enhanced accuracy and run time performance as well as improved robustness to model misspecifications compared with both model-based and data-driven benchmarks.

摘要
Dynamic systems of graph signals are encountered in various applications, including social networks, power grids, and transportation. While such systems can often be described as state space (SS) models, tracking graph signals via conventional tools based on the Kalman filter (KF) and its variants is typically challenging. This is due to the nonlinearity, high dimensionality, irregularity of the domain, and complex modeling associated with real-world dynamic systems of graph signals. In this work, we study the tracking of graph signals using a hybrid model-based/data-driven approach. We develop the GSP-KalmanNet, which tracks the hidden graphical states from the graphical measurements by jointly leveraging graph signal processing (GSP) tools and deep learning (DL) techniques. The derivations of the GSP-KalmanNet are based on extending the KF to exploit the inherent graph structure via graph frequency domain filtering, which considerably simplifies the computational complexity entailed in processing high-dimensional signals and increases the robustness to small topology changes. Then, we use data to learn the Kalman gain following the recently proposed KalmanNet framework, which copes with partial and approximated modeling, without forcing a specific model over the noise statistics. Our empirical results demonstrate that the proposed GSP-KalmanNet achieves enhanced accuracy and run time performance as well as improved robustness to model misspecifications compared with both model-based and data-driven benchmarks.

Single-cell Multi-view Clustering via Community Detection with Unknown Number of Clusters

paper_url: http://arxiv.org/abs/2311.17103
repo_url: https://github.com/dayuhuu/scunc
paper_authors: Dayu Hu, Zhibin Dong, Ke Liang, Jun Wang, Siwei Wang, Xinwang Liu
for: 这篇论文旨在探讨单个细胞多视图划分的问题，以便探索细胞内部的多样性。
methods: 这篇论文提出了一种名为scUNC的新型多视图划分方法，该方法可以自动地将不同视图的信息融合到一起，而无需先定义划分数。
results: 对于三个不同的单个细胞数据集，scUNC方法的结果表明，它在比较基eline方法时表现出了更好的性能。

Abstract
Single-cell multi-view clustering enables the exploration of cellular heterogeneity within the same cell from different views. Despite the development of several multi-view clustering methods, two primary challenges persist. Firstly, most existing methods treat the information from both single-cell RNA (scRNA) and single-cell Assay of Transposase Accessible Chromatin (scATAC) views as equally significant, overlooking the substantial disparity in data richness between the two views. This oversight frequently leads to a degradation in overall performance. Additionally, the majority of clustering methods necessitate manual specification of the number of clusters by users. However, for biologists dealing with cell data, precisely determining the number of distinct cell types poses a formidable challenge. To this end, we introduce scUNC, an innovative multi-view clustering approach tailored for single-cell data, which seamlessly integrates information from different views without the need for a predefined number of clusters. The scUNC method comprises several steps: initially, it employs a cross-view fusion network to create an effective embedding, which is then utilized to generate initial clusters via community detection. Subsequently, the clusters are automatically merged and optimized until no further clusters can be merged. We conducted a comprehensive evaluation of scUNC using three distinct single-cell datasets. The results underscored that scUNC outperforms the other baseline methods.

摘要
单个细胞多视图划分可以探索同一个细胞中的细胞多样性从不同的视角。虽然有几种多视图划分方法的开发，但两大挑战仍然存在。首先，大多数现有方法往往对单细胞RNA（scRNA）和单细胞Assay of Transposase Accessible Chromatin（scATAC）视图中的信息进行平等对待，忽略了这两种视图中数据的差异。这会导致整体性能下降。其次，大多数划分方法需要用户手动指定划分数量。然而，对细胞数据进行划分是一项困难的任务，特别是为了准确地确定细胞中的几个不同类型。为此，我们介绍了scUNC方法，这是针对单细胞数据的创新的多视图划分方法，可以自动地将不同视图中的信息集成，无需先定划分数量。scUNC方法包括以下步骤：首先，它使用cross-view fusión网络创建有效的嵌入，然后使用社区检测来生成初始划分。接着，划分会自动地合并和优化，直到无法再合并划分。我们对三个不同的单细胞数据集进行了广泛的评估，结果表明scUNC方法在基准方法上表现出色。

Monitor Placement for Fault Localization in Deep Neural Network Accelerators

paper_url: http://arxiv.org/abs/2311.16594
repo_url: None
paper_authors: Wei-Kai Liu, Benjamin Tan, Krishnendu Chakrabarty
For: The paper is written for improving the reliability of deep neural network (DNN) accelerators using systolic arrays.* Methods: The paper proposes a solution to optimize the placement of hardware monitors within systolic arrays to localize faulty processing elements (PEs) and improve the reliability of DNN inferencing.* Results: The paper shows that $2N-1$ monitors are needed to localize a single faulty PE, and derives the monitor placement. The paper also proposes a heuristic approach to balance the reliability and hardware resource utilization in DNN accelerators when the number of monitors is limited. Experimental evaluation shows that an area overhead of only 0.33% is incurred for a $256\times 256$ systolic array.

Abstract
Systolic arrays are a prominent choice for deep neural network (DNN) accelerators because they offer parallelism and efficient data reuse. Improving the reliability of DNN accelerators is crucial as hardware faults can degrade the accuracy of DNN inferencing. Systolic arrays make use of a large number of processing elements (PEs) for parallel processing, but when one PE is faulty, the error propagates and affects the outcomes of downstream PEs. Due to the large number of PEs, the cost associated with implementing hardware-based runtime monitoring of every single PE is infeasible. We present a solution to optimize the placement of hardware monitors within systolic arrays. We first prove that $2N-1$ monitors are needed to localize a single faulty PE and we also derive the monitor placement. We show that a second placement optimization problem, which minimizes the set of candidate faulty PEs for a given number of monitors, is NP-hard. Therefore, we propose a heuristic approach to balance the reliability and hardware resource utilization in DNN accelerators when number of monitors is limited. Experimental evaluation shows that to localize a single faulty PE, an area overhead of only 0.33% is incurred for a $256\times 256$ systolic array.

摘要
systolic 阵列是深度神经网络（DNN）加速器的一种常用选择，因为它们提供并行处理和数据重用的机制。因为硬件错误可以导致神经网络推理精度下降，因此提高 DNN 加速器的可靠性是非常重要的。 systolic 阵列使用大量的处理元素（PE）进行并行处理，但是当一个 PE 出现错误时，错误会向下游 PE 传播并影响其结果。由于 systolic 阵列中的 PE 的数量很大，因此实施硬件基础Runtime监控每个 PE 的成本是不可能的。我们提出一种优化 systolic 阵列中硬件监控器的放置方案。我们首先证明需要 2N-1 个监控器来确定单个错误的 PE，并且我们还 derivation 监控器的放置。我们发现，对于给定的监控器数量， minimize 候选错误 PE 的集合是 NP-hard 问题。因此，我们提出一种妥协方法，以平衡在 DNN 加速器中的可靠性和硬件资源利用率。实验表明，为了确定单个错误 PE，只需要在 256x256 的 systolic 阵列中增加 0.33% 的面积开销。

StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences

paper_url: http://arxiv.org/abs/2311.17099
repo_url: None
paper_authors: Shangkun Sun, Jiaming Liu, Thomas H. Li, Huaxia Li, Guoqing Liu, Wei Gao
for: 这个研究旨在解决影像流动预测中的遮掩现象，这些遮掩现象会破坏 pixels 之间的对称性，从而导致预测结果的严重损失。
methods: 这个研究使用了一个整合了内核的 Streamlined In-batch Multi-frame (SIM) 架构，并导入了一个高效的 Integrative Spatio-temporal Coherence (ISC) 模型来有效地捕捉空间-时间相互关联。此外，它还具有一个全球时间调整器 (GTR)，帮助更好地探索时间关系。
results: 这个研究获得了在难题的 KITTI 和 Sintel 数据集上的高性能，特别是在遮掩区域，并且与前一代多帧方法相比，实现了$63.82%$ 的速度提升。

Abstract
Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation. The inherent ambiguity introduced by occlusions directly violates the brightness constancy constraint and considerably hinders pixel-to-pixel matching. To address this issue, multi-frame optical flow methods leverage adjacent frames to mitigate the local ambiguity. Nevertheless, prior multi-frame methods predominantly adopt recursive flow estimation, resulting in a considerable computational overlap. In contrast, we propose a streamlined in-batch framework that eliminates the need for extensive redundant recursive computations while concurrently developing effective spatio-temporal modeling approaches under in-batch estimation constraints. Specifically, we present a Streamlined In-batch Multi-frame (SIM) pipeline tailored to video input, attaining a similar level of time efficiency to two-frame networks. Furthermore, we introduce an efficient Integrative Spatio-temporal Coherence (ISC) modeling method for effective spatio-temporal modeling during the encoding phase, which introduces no additional parameter overhead. Additionally, we devise a Global Temporal Regressor (GTR) that effectively explores temporal relations during decoding. Benefiting from the efficient SIM pipeline and effective modules, StreamFlow not only excels in terms of performance on the challenging KITTI and Sintel datasets, with particular improvement in occluded areas but also attains a remarkable $63.82\%$ enhancement in speed compared with previous multi-frame methods. The code will be available soon at https://github.com/littlespray/StreamFlow.

摘要
干扰 между连续帧已经长期成为光流估算的主要挑战。这种内在的歧义直接违背了亮度的一致性约束，大大阻碍了像素到像素的匹配。为解决这个问题，多帧光流方法利用邻近帧来减少地方干扰。然而，先前的多帧方法主要采用回归流估计，导致了明显的计算重叠。相比之下，我们提出了一个流lijined在批处理（SIM）管道，消除了广泛的重复回归计算，同时同时发展有效的空间-时模型化方法。具体来说，我们提出了适应视频输入的Streamlined In-batch Multi-frame（SIM）管道，实现了与两帧网络相同的时间效率。此外，我们提出了一种高效的空间-时coherence（ISC）模型化方法，不增加参数占用。此外，我们设计了一种全面的时间回归器（GTR），有效地探索了时间关系。由于高效的SIM管道和有效的模块，StreamFlow不仅在挑战性的KITTI和 Sintel 数据集上表现出色，尤其是在受阻区域，还实现了63.82%的计算速度提升 compared with previous multi-frame methods。代码将很快地在 GitHub 上发布。

DyRA: Dynamic Resolution Adjustment for Scale-robust Object Detection

paper_url: http://arxiv.org/abs/2311.17098
repo_url: https://github.com/daeunfullgrace/dyra
paper_authors: Daeun Seo, Hoeseok Yang, Hyungshin Kim
for: 提高对象检测中的常规准确率，因为对象的大小弹性会导致模型的准确率变化。
methods: 提出了一种适应分辨率扩展网络（DyRA），该网络包括卷积和变换器编码块，可以与现有的检测器结合使用。DyRA返回一个从输入图像中获得的缩放因子，允许实例特定的缩放。
results: 在COCO、RetinaNet、Faster-RCNN、FCOS和Mask-RCNN等四个检测器上进行了实验，实现了与多resolution基线相比的1.3%、1.1%、1.3%和0.8%的准确率提高。

Abstract
In object detection, achieving constant accuracy is challenging due to the variability of object sizes. One possible solution to this problem is to optimize the input resolution, known as a multi-resolution strategy. Previous approaches for optimizing resolution are often based on pre-defined resolutions or a dynamic neural network, but there is a lack of study for run-time resolution optimization for existing architecture. In this paper, we propose an adaptive resolution scaling network called DyRA, which comprises convolutions and transformer encoder blocks, for existing detectors. Our DyRA returns a scale factor from an input image, which enables instance-specific scaling. This network is jointly trained with detectors with specially designed loss functions, namely ParetoScaleLoss and BalanceLoss. The ParetoScaleLoss produces an adaptive scale factor from the image, while the BalanceLoss optimizes the scale factor according to localization power for the dataset. The loss function is designed to minimize accuracy drop about the contrasting objective of small and large objects. Our experiments on COCO, RetinaNet, Faster-RCNN, FCOS, and Mask-RCNN achieved 1.3%, 1.1%, 1.3%, and 0.8% accuracy improvement than a multi-resolution baseline with solely resolution adjustment. The code is available at https://github.com/DaEunFullGrace/DyRA.git.

摘要
在对象检测中，保持定点准确性是一个挑战，因为对象的大小变化可以导致模型的性能下降。一种可能的解决方案是使用多resolution策略，但前一些方法通常是基于预定的分辨率或动态神经网络。在这篇论文中，我们提出了一种名为 DyRA 的 adaptive resolution scaling network，该网络包括卷积和变换器编码块，用于现有的检测器。我们的 DyRA 从输入图像返回一个扩大因子，该因子可以根据实例特点进行实例化。我们与检测器一起培训了特定的损失函数，包括 ParetoScaleLoss 和 BalanceLoss。ParetoScaleLoss 生成了适应性的扩大因子，而 BalanceLoss 优化了扩大因子的地方化能力。损失函数的设计目的是尽可能地降低小对象和大对象之间的准确率下降。我们在 COCO、RetinaNet、Faster-RCNN、FCOS 和 Mask-RCNN 上进行了实验，并实现了与多resolution基eline相比的1.3%、1.1%、1.3%和0.8%的准确率提高。代码可以在中找到。

Anonymous Jamming Detection in 5G with Bayesian Network Model Based Inference Analysis

paper_url: http://arxiv.org/abs/2311.17097
repo_url: None
paper_authors: Ying Wang, Shashank Jere, Soumya Banerjee, Lingjia Liu, Sachin Shetty, Shehadi Dayekh
for: 这篇论文是为了探讨5G中的干扰检测和入侵检测，以保持可靠性、避免用户体验下降和基础设施失败。
methods: 该论文提出了一种基于协议堆栈参数的匿名干扰检测模型，使用超级vised学习和无级学习实现实时、高精度的干扰检测，包括未知类型干扰。
results: 实验结果显示，使用supervised模型可达AUC为0.964到1，比LSTM模型有AUC为0.923到1。然而，需要数据注释限制了supervised方法。为解决这一问题，paper提出了一种无级自适应异常检测方法，AUC为0.987，并能抗性 adversarial训练样本。此外，paper还介绍了一种基于 bayesian network的 causation analysis，以提供透明度和领域知识注入。

Abstract
Jamming and intrusion detection are critical in 5G research, aiming to maintain reliability, prevent user experience degradation, and avoid infrastructure failure. This paper introduces an anonymous jamming detection model for 5G based on signal parameters from the protocol stacks. The system uses supervised and unsupervised learning for real-time, high-accuracy detection of jamming, including unknown types. Supervised models reach an AUC of 0.964 to 1, compared to LSTM models with an AUC of 0.923 to 1. However, the need for data annotation limits the supervised approach. To address this, an unsupervised auto-encoder-based anomaly detection is presented with an AUC of 0.987. The approach is resistant to adversarial training samples. For transparency and domain knowledge injection, a Bayesian network-based causation analysis is introduced.

摘要
“干扰和入侵探测是5G研究中的重要课题，以确保可靠性、避免用户体验下降和基础设施故障。本文提出了一个匿名干扰探测模型，基于协议堆栈的参数来进行实时、高精度的干扰探测，包括未知类型。使用监督学习和无监督学习，监督学习可以 дости到AUC的0.964-1，而LSTM模型则为AUC的0.923-1。但是，需要数据标注限制了监督方法。为了解决这个问题，本文提出了一个无监督自动化探测方法，其AUC为0.987。此方法具有防止反攻击训练样本的特点。为了透明度和领域知识注入，本文还提出了一个基于 Bayesian 网络的 causation 分析。”

Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models

paper_url: http://arxiv.org/abs/2311.17095
repo_url: None
paper_authors: Luo Jiayun, Siddhesh Khandelwal, Leonid Sigal, Boyang Li
for: 这篇论文的目的是提出一个训练无需的技术，将大规模感知语言模型（VLM）应用于开放词汇 semantic segmentation 任务。
methods: 这篇论文提出了一个简单 yet extremely effective的技术，即 Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS)，它利用了 VLM 的直接文本至图像混合注意力和图像文本匹配损失来生成 semantic segmentation。
results: 相比于现有的技术，提出的方法不需要任何神经网络训练和进行适应器参数的调整，并且可以在没有 segmentation 标注的情况下进行预测。结果显示，PnP-OVSS 在 Pascal VOC、Pascal Context 和 MS COCO 等测试集上表现出较好的效果，甚至超过了一些基于 VLM 的基eline。

Abstract
From an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which is vital for tasks such as image captioning and visual question answering. However, leveraging such pre-trained models for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation. However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment. To alleviate this issue, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. Compared to existing techniques, the proposed method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over a comparable baseline (+29.4% mIoU on Pascal VOC, +13.2% mIoU on Pascal Context, +14.0% mIoU on MS COCO, +2.4% mIoU on COCO Stuff) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.

摘要
originally appeared in an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which is crucial for tasks such as image captioning and visual question answering. However, leveraging such pre-trained models for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation. However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment. To alleviate this issue, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. Compared to existing techniques, the proposed method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over a comparable baseline (+29.4% mIoU on Pascal VOC, +13.2% mIoU on Pascal Context, +14.0% mIoU on MS COCO, +2.4% mIoU on COCO Stuff) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.

Graph Prompt Learning: A Comprehensive Survey and Beyond

paper_url: http://arxiv.org/abs/2311.16534
repo_url: https://github.com/sheldonresearch/ProG
paper_authors: Xiangguo Sun, Jiawen Zhang, Xixi Wu, Hong Cheng, Yun Xiong, Jia Li
for: 本文旨在探讨人工通用智能（AGI）在图数据上的应用，特别是图数据处理方面的挑战和机遇。
methods: 本文提出了一个统一框架来理解图Prompt学习，并详细介绍了图Prompt的特点、灵活性和表达能力，以及与现有图模型的交互。
results: 本文分析了目前AGI在处理图数据方面的状况，并提出了一个综合分类法，将相关工作分为节点级、边级和图级预训任务。此外，本文还介绍了ProG库和相关网站，以支持和推动图Prompt研究。

Abstract
Artificial General Intelligence (AGI) has revolutionized numerous fields, yet its integration with graph data, a cornerstone in our interconnected world, remains nascent. This paper presents a pioneering survey on the emerging domain of graph prompts in AGI, addressing key challenges and opportunities in harnessing graph data for AGI applications. Despite substantial advancements in AGI across natural language processing and computer vision, the application to graph data is relatively underexplored. This survey critically evaluates the current landscape of AGI in handling graph data, highlighting the distinct challenges in cross-modality, cross-domain, and cross-task applications specific to graphs. Our work is the first to propose a unified framework for understanding graph prompt learning, offering clarity on prompt tokens, token structures, and insertion patterns in the graph domain. We delve into the intrinsic properties of graph prompts, exploring their flexibility, expressiveness, and interplay with existing graph models. A comprehensive taxonomy categorizes over 100 works in this field, aligning them with pre-training tasks across node-level, edge-level, and graph-level objectives. Additionally, we present, ProG, a Python library, and an accompanying website, to support and advance research in graph prompting. The survey culminates in a discussion of current challenges and future directions, offering a roadmap for research in graph prompting within AGI. Through this comprehensive analysis, we aim to catalyze further exploration and practical applications of AGI in graph data, underlining its potential to reshape AGI fields and beyond. ProG and the website can be accessed by \url{https://github.com/WxxShirley/Awesome-Graph-Prompt}, and \url{https://github.com/sheldonresearch/ProG}, respectively.

摘要
人工通用智能（AGI）已经革命化了许多领域，但是它与图数据，我们现代世界中的重要基础，的结合仍然处于激进阶段。这篇论文提出了对emerging领域的图prompt在AGI中的全面评估，探讨了在应用图数据的关键挑战和机遇。虽然AGI在自然语言处理和计算机视觉方面已经取得了很大的进步，但是对图数据的应用还是相对未经explored。这篇论文对AGI在处理图数据方面的当前状况进行了严格的评估，揭示了跨模态、跨领域和跨任务特有的图数据应用中的独特挑战。我们的工作是首次提出了一个统一的框架 для理解图示学习，为图示学习提供了清晰的prompt Token、Token结构和插入模式。我们进一步探讨了图示的内在特性，包括其灵活性、表达能力和与现有图模型的交互。我们根据任务类型分类了超过100个相关研究，并提供了一个Python库和相应的网站，以支持和推动图示研究。这篇论文的结论是，通过进一步探讨图示的挑战和未来方向，我们可以激发AGI在图数据上的应用，并且这将对AGI相关领域和之外产生深远的影响。ProG和相应的网站可以在和上获取。

Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net

paper_url: http://arxiv.org/abs/2311.16488
repo_url: None
paper_authors: Zizhao Hu, Shaochong Jia, Mohammad Rostami
for: 这 paper 是为了提出一种高效的多Modal 扩散模型，能够 preserve 多Modal 细节和效率。
methods: 该模型使用 Partially Shared U-Net (PS-U-Net) 架构，允许文本和图像输入通过专门的层和跳过连接来保留模式特有的细节。此外，它还提出了一种新的多Modal 采样方法，可以在只需学习单一的共享分布下进行条件生成新的enario。
results: 对于 MS-COCO 数据集，我们的方法可以生成高质量的多Modal text 和图像数据，比对exist的多Modal 扩散模型更高效，具有相似的大小、更快的训练、更快的多Modal 采样和更加灵活的生成。

Abstract
Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.

摘要
最近，扩散模型已成功应用于跨Modal数据翻译和多Modal数据生成。然而，这些方法依赖于广泛的扩散，忽略了Modalities之间的不效率和干扰。我们开发了半共享U-Net（PS-U-Net）架构，这是一种高效的多Modal扩散模型，允许文本和图像输入通过专门的层和跳过连接保留Modalities特有的细腻细节。受图像填充启发，我们还提议一种新的高效多Modal采样方法，该方法在只需学习一个简单的共享分布下可以生成新的Scene。我们对COCO dataset进行了实验，结果表明，我们的方法可以在与现有多Modal扩散模型相比生成高质量的文本和图像数据，同时具有相似的大小、更快的训练、更快的多Modal采样和更多的生成方式。

Enhancing Human Persuasion With Large Language Models

paper_url: http://arxiv.org/abs/2311.16466
repo_url: None
paper_authors: Minkyu Shin, Jin Kim
for: 这个论文研究了人工智能语言模型（LLM）在人类交流中的影响，具体来说是在金融业消费者投诉中。
methods: 该论文使用了一种AI检测工具对 более чем780万个消费者投诉数据进行分析，并发现了LLM在投诉书写中的使用。
results: 研究发现，使用LLM可以提高消费者投诉的语言质量，并且可以提高获得满意的结果的可能性（即金融机构提供的满意解决方案）。这些结果与前期注册的实验结果相符，证明LLM可以在人类交流中提高消费者的言语吸引力。

Abstract
Although large language models (LLMs) are reshaping various aspects of human life, our current understanding of their impacts remains somewhat constrained. Here we investigate the impact of LLMs on human communication, in the context of consumer complaints in the financial industry. Employing an AI detection tool on more than 780K complaints gathered by the Consumer Financial Protection Bureau (CFPB), we find evidence of LLM usage in the writing of complaints - shortly after the release of ChatGPT. Our analyses reveal that LLM usage is positively correlated with the likelihood of obtaining desirable outcomes (i.e., offer of relief from financial firms) and suggest that this positive correlation may be partly due to the linguistic features improved by LLMs. We test this conjecture with a preregistered experiment, which reveals results consistent with those from observational studies: Consumer complaints written with ChatGPT for improved linguistic qualities were more likely to receive hypothetical relief offers than the original consumer complaints, demonstrating the LLM's ability to enhance message persuasiveness in human communication. Being some of the earliest empirical evidence on LLM usage for enhancing persuasion, our results highlight the transformative potential of LLMs in human communication.

摘要

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

paper_url: http://arxiv.org/abs/2311.16464
repo_url: https://github.com/easonxiao-888/uvcom
paper_authors: Yicheng Xiao, Zhuoyan Luo, Yong Liu, Yue Ma, Hengwei Bian, Yatai Ji, Yujiu Yang, Xiu Li
for: Video Moment Retrieval (MR) and Highlight Detection (HD) 视频瞬间检索和精彩点检测
methods: 使用 transformer-based architecture 使用转换器基本结构
results: 对 QVHighlights、Charades-STA、TACoS、YouTube Highlights 和 TVSum datasets进行了广泛的实验，并在这些 datasets 上表现出了杰出的效果和合理性，超过了当前状态的方法。

Abstract
Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.

摘要
视频时刻回归（MR）和突出点检测（HD）在视频分析方面引起了广泛的关注，因为它们能够帮助我们更好地理解视频内容。然而，我们发现MR和HD之间存在一定的区别，MR需要对视频中的本地关系进行感知，而HD则需要对视频的全局背景进行理解。因此，如果不采取特定任务的设计，将导致两个任务之间的关系不充分耦合，从而限制视频的理解。为了解决这个问题，我们提出了一个统一视频理解框架（UVCOM），它可以有效地解决MR和HD两个任务。UVCOM通过在多维度和多级别进行进程式的 интеграción，实现了视频的全面理解。此外，我们还提出了多方面对比学习，通过对多Modal空间进行满足的对比，来强化本地关系模型和全局知识储存。我们在QVHighlights、Charades-STA、TACoS、YouTube Highlights和TVSum等数据集上进行了广泛的实验，结果表明UVCOM可以具有显著的优势，与当前的方法相比，它的性能有remarkable提升。

Typhoon Intensity Prediction with Vision Transformer

paper_url: http://arxiv.org/abs/2311.16450
repo_url: https://github.com/chen-huanxin/tint
paper_authors: Huanxin Chen, Pengshuai Yin, Huichou Huang, Qingyao Wu, Ruirui Liu, Xiatian Zhu
for: 预测台风强度准确 across space and time, 以便发布及时的灾害警示和紧急应急救援。
methods: 利用卫星图像进行enario分析，并采用自注意机制和全球受辐激场进行Feature representation learning。
results: 比采用现有的卷积神经网络 (CNNs) 更高效，并且可以更好地捕捉到长距离依赖和全球上下文知识。

Abstract
Predicting typhoon intensity accurately across space and time is crucial for issuing timely disaster warnings and facilitating emergency response. This has vast potential for minimizing life losses and property damages as well as reducing economic and environmental impacts. Leveraging satellite imagery for scenario analysis is effective but also introduces additional challenges due to the complex relations among clouds and the highly dynamic context. Existing deep learning methods in this domain rely on convolutional neural networks (CNNs), which suffer from limited per-layer receptive fields. This limitation hinders their ability to capture long-range dependencies and global contextual knowledge during inference. In response, we introduce a novel approach, namely "Typhoon Intensity Transformer" (Tint), which leverages self-attention mechanisms with global receptive fields per layer. Tint adopts a sequence-to-sequence feature representation learning perspective. It begins by cutting a given satellite image into a sequence of patches and recursively employs self-attention operations to extract both local and global contextual relations between all patch pairs simultaneously, thereby enhancing per-patch feature representation learning. Extensive experiments on a publicly available typhoon benchmark validate the efficacy of Tint in comparison with both state-of-the-art deep learning and conventional meteorological methods. Our code is available at https://github.com/chen-huanxin/Tint.

摘要
预测台风强度 precisely across space and time是至关重要的，以预测气灾和救灾应急措施。这有很大的潜在效果，可以减少人员亡产和环境影响，以及经济影响。使用卫星影像进行enario分析是有效的，但也存在附加的挑战，因为云层之间的复杂关系和高度动态上下文。现有的深度学习方法在这个领域中仍然靠托Convolutional Neural Networks (CNNs)，这些网络受限于每层的接受场。这限制了它们在推理过程中 capture长距离依赖和全局上下文知识。为了解决这些问题，我们介绍了一种新的方法，即“台风强度变换”（Tint），它利用自注意机制和全球接受场。Tint采用了序列到序列的特征表示学习视角。它首先将给定的卫星影像切分成一个序列，然后通过自注意操作来抽取所有patch对之间的局部和全局上下文关系，从而提高每个patch的特征表示学习。经验证明，Tint在比较州的台风benchmark上与当前的深度学习和传统气象方法相比，表现出了较高的效果。我们的代码可以在https://github.com/chen-huanxin/Tint上获取。

Text-Driven Image Editing via Learnable Regions

paper_url: http://arxiv.org/abs/2311.16432
repo_url: https://github.com/yuanze-lin/Learnable_Regions
paper_authors: Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, Ming-Hsuan Yang
for: 这个论文的目的是提出一种基于文本指令的区域图像编辑方法，无需用户提供涂鸦或mask。
methods: 该方法利用现有的文本到图像模型，并引入一个 bounding box生成器来找到与文本指令相对应的编辑区域。
results: 我们的方法可以实现高度灵活的图像编辑，并能够处理复杂的文本指令，例如多个 объек 、复杂的句子或长段文本。我们进行了广泛的用户研究，比较了我们的方法与当前的图像生成模型。实验结果表明，我们的方法可以在高度准确和现实的情况下，根据提供的语言描述进行图像修改。Here’s the translation in English for reference:
for: The purpose of this paper is to propose a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
methods: The method leverages an existing pretrained text-to-image model and introduces a bounding box generator to find the edit regions that are aligned with the textual prompts.
results: Our method can achieve highly flexible image editing and can handle complex prompts featuring multiple objects, complex sentences, or long paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. Experimental results demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that align with the language descriptions provided.

Abstract
Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pretrained text-to-image model and introduces a bounding box generator to find the edit regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences or long paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that align with the language descriptions provided. Our project webpage: https://yuanze-lin.me/LearnableRegions_page.

摘要
语言已经成为自然的图像编辑界面。在这篇论文中，我们介绍了一种基于文本提示的区域编辑方法，不需要用户提供涂抹或绘图。具体来说，我们的方法利用现有的预训练文本到图像模型，并引入一个 bounding box 生成器来找到与文本提示相对应的编辑区域。我们表明，这个简单的方法可以实现高效、准确地编辑图像，并能够处理复杂的提示 featuring 多个物体、复杂的句子或长段文本。我们进行了广泛的用户研究，与现有的方法进行比较。实验结果表明，我们的方法在搅拌图像高度准确地实现了与提供的语言描述相对应的图像修改。更多信息请访问我们的项目网页：https://yuanze-lin.me/LearnableRegions_page。

Manifold Preserving Guided Diffusion

paper_url: http://arxiv.org/abs/2311.16424
repo_url: None
paper_authors: Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon
for: 本文旨在提出一种无需训练的条件生成框架，可以广泛应用于多种任务。
methods: 该框架基于预训练的扩散模型和启用了启用了少量额外计算成本的 neural network，并且利用 manifold 假设来精细调整扩散步骤。
results: 我们的实验表明，MPGD 可以高效地解决低计算成本下的多种条件生成应用，并且可以与基eline相比提供更高质量的样本，并且可以提供 Up to 3.8 倍的速度提升。

Abstract
Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad range of tasks. Specifically, we leverage the manifold hypothesis to refine the guided diffusion steps and introduce a shortcut algorithm in the process. We then propose two methods for on-manifold training-free guidance using pre-trained autoencoders and demonstrate that our shortcut inherently preserves the manifolds when applied to latent diffusion models. Our experiments show that MPGD is efficient and effective for solving a variety of conditional generation applications in low-compute settings, and can consistently offer up to 3.8x speed-ups with the same number of diffusion steps while maintaining high sample quality compared to the baselines.

摘要
尽管最近有了进步，条件图像生成仍然面临成本高、通用性差和任务特定训练的挑战。在这篇论文中，我们提出了无需训练的推 diffusion 框架（MPGD），利用预训练的扩散模型和商业化神经网络，对广泛的任务进行条件生成。具体来说，我们利用拟合假设来精细化引导扩散步骤，并 introduce 一种短cut 算法在过程中。然后，我们提出了两种无需训练的在扩散模型上进行培训的方法，使用预训练的 autoencoder。我们的短cut 自然地保持了拟合的 manifold 特性，并且我们的实验表明，MPGD 可以高效地解决低计算量下的多种条件生成应用，并且可以与基eline 相比提供高质量样本，同时具有3.8倍的速度提升。

Multi-defender Security Games with Schedules

paper_url: http://arxiv.org/abs/2311.16392
repo_url: None
paper_authors: Zimeng Song, Chun Kai Ling, Fei Fang
For: The paper is written to study security games featuring multiple defenders and schedules simultaneously, and to investigate the impact of schedules on the existence of equilibrium in these games.* Methods: The paper uses mathematical modeling and computational algorithms to study the security games, and proves that under certain restrictions, the non-existence of equilibrium can be avoided and computed in polynomial time.* Results: The paper shows that the introduction of schedules can cause non-existence of equilibrium in security games, but that this can be avoided under certain restrictions. The paper also presents experimental results that demonstrate the scalability of the algorithms for games with multiple heterogeneous defenders.

Abstract
Stackelberg Security Games are often used to model strategic interactions in high-stakes security settings. The majority of existing models focus on single-defender settings where a single entity assumes command of all security assets. However, many realistic scenarios feature multiple heterogeneous defenders with their own interests and priorities embedded in a more complex system. Furthermore, defenders rarely choose targets to protect. Instead, they have a multitude of defensive resources or schedules at its disposal, each with different protective capabilities. In this paper, we study security games featuring multiple defenders and schedules simultaneously. We show that unlike prior work on multi-defender security games, the introduction of schedules can cause non-existence of equilibrium even under rather restricted environments. We prove that under the mild restriction that any subset of a schedule is also a schedule, non-existence of equilibrium is not only avoided, but can be computed in polynomial time in games with two defenders. Under additional assumptions, our algorithm can be extended to games with more than two defenders and its computation scaled up in special classes of games with compactly represented schedules such as those used in patrolling applications. Experimental results suggest that our methods scale gracefully with game size, making our algorithms amongst the few that can tackle multiple heterogeneous defenders.

摘要
史泰堡安全游戏经常用来模型高风险安全场景中的战略互动。大多数现有模型假设单一的安全资产拥有者，但多数现实场景中有多个不同类型的防御者，各自有自己的利益和优先级，形成更加复杂的系统。此外，防御者 rarely chooses targets to protect，而是拥有多种防御资源或时间表，每种都有不同的保护能力。在这篇论文中，我们研究了多个防御者和时间表同时参与的安全游戏。我们发现，不同于之前的多防御者安全游戏研究，在引入时间表后，战略平衡不存在，而且可以在一些简单的环境下计算出平衡。我们证明，在任何 subset of a schedule 也是一个 schedule 的情况下，不存在战略平衡，并且可以在两名防御者的游戏中计算出平衡，并且可以在特定类型的游戏中扩展到更多的防御者。具体来说，我们的算法可以在特定的游戏中扩展到更多的防御者，并且可以在一些特殊的游戏中缩放计算。实验结果表明，我们的方法可以准确地处理多个不同类型的防御者，并且可以扩展到更多的防御者。

Combating the “Sameness” in AI Art: Reflections on the Interactive AI Installation Fencing Hallucination

paper_url: http://arxiv.org/abs/2311.17080
repo_url: None
paper_authors: Weihao Qiu, George Legrady
for: addressing the issue of “sameness” in AI art, specifically in the context of AI image creation tools.
methods: reflecting on the design of AI art production to alleviate the sense of uniformity, maintain the uniqueness of images from an AI image synthesizer, and enhance the connection between the artworks and the audience.
results: stimulating the creation of distinctive AI art through the Fencing Hallucination project, which provides insights and efforts dedicated to addressing the issue of “sameness”.Here’s the full text in Simplified Chinese:
for: 论AI艺术中的“相同”问题，具体是关于AI图像创建工具的开发。
methods: 通过反思AI艺术生成的设计，以减少图像同一性感，保持AI图像生成器中图像的独特性，并增强艺术作品与观众之间的连接。
results: 通过《斗篮幻觉》项目，激发AI艺术的创作，以解决“相同”问题。

Abstract
The article summarizes three types of "sameness" issues in Artificial Intelligence(AI) art, each occurring at different stages of development in AI image creation tools. Through the Fencing Hallucination project, the article reflects on the design of AI art production in alleviating the sense of uniformity, maintaining the uniqueness of images from an AI image synthesizer, and enhancing the connection between the artworks and the audience. This paper endeavors to stimulate the creation of distinctive AI art by recounting the efforts and insights derived from the Fencing Hallucination project, all dedicated to addressing the issue of "sameness".

摘要
文章总结了人工智能艺术中的三种“相同”问题，每一种发生在不同的AI图像创作工具阶段。通过《剑道幻觉》项目，文章反思了AI艺术生产的设计，以减少图像生成器中的同化感，保持AI图像的独特性，并增强艺术作品与观众之间的联系。这篇文章努力鼓励创造独特的AI艺术，通过《剑道幻觉》项目所获得的努力和思想，解决“相同”的问题。

2023-11-28

cs.CL

cs.CL - 2023-11-28

Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

paper_url: http://arxiv.org/abs/2311.17280
repo_url: None
paper_authors: Wang Zhu, Ishika Singh, Yuan Huang, Robin Jia, Jesse Thomason
for: 这个论文是为了研究视觉语言导航（VLN）模型在pretraining阶段中数据增强的影响。
methods: 论文使用了back-translation数据增强法，但发现生成的指令有噪音。
results: 研究发现，在下游性能测试中，使用噪音指令的pretraining模型并不影响下游性能，而且仍然比只使用人工清晰指令的模型更好。这些结果表明，在VLN R2R预训练中，量的视觉轨迹更重要于指令质量。

Abstract
Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.

摘要
<>将文本翻译成简化中文。>通常在预训练视觉语言导航（VLN）模型时使用数据扩充via back-translation，即使生成的指令有噪音。但：这种噪音是重要的吗？我们发现在下游性能方面，无关或无用的语言指令在预训练期间具有非常小的影响，而且使用仅净的人工数据仍然比较好。为了强调这些结果，我们开发了一种高效的扩充方法，Unigram + Object，该方法生成无关的指令，却可以提高下游性能。我们的发现表明，对于VLN R2R预训练来说，重要的是视觉轨迹的量，而不是指令的质量。

RETSim: Resilient and Efficient Text Similarity

paper_url: http://arxiv.org/abs/2311.17264
repo_url: None
paper_authors: Marina Zhang, Owen Vallis, Aysegul Bumin, Tanay Vakharia, Elie Bursztein
for: 这篇论文是为了提出一种轻量级、多语言的深度学习模型，用于实现鲁棒和高效的文本相似度计算，以便于文本检索、归类和数据减少等任务。
methods: 该模型使用了 RETSim（鲁检 similarity），一种基于深度学习的多语言模型，通过对文本数据进行归一化和缩放来提高计算效率和准确性。
results: 根据实验结果，RETSim在 dataset deduplication、敌意文本检索和垃圾 clustering 等任务中表现出色，比 MinHash 和神经文本嵌入更加鲁检和准确。此外， authors 还提出了一个新的 W4NT3D 数据集，用于评估多语言、鲁检文本检索能力在敌意 Setting 下。

Abstract
This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.

摘要
这份论文介绍了RETSim（抗强度和高效的文本相似性），一种轻量级、多语言深度学习模型，用于生成鲁棒的度量嵌入，用于近似文本检索、归一化和数据集去重任务。我们示出了RETSim在 dataset deduplication、 adversarial text retrieval bencmarks 和垃圾 clustering 任务中的显著性和准确性，超过了 MinHash 和神经文本嵌入。我们还介绍了 W4NT3D 测试集（ Wiki-40B 4dversarial Near-T3xt Dataset），用于评估多语言、近似文本检索能力在阴谋设置下。 RETSim 和 W4NT3D 测试集都是根据 MIT 许可证开源在 GitHub 上，请参考。

General-Purpose vs. Domain-Adapted Large Language Models for Extraction of Data from Thoracic Radiology Reports

paper_url: http://arxiv.org/abs/2311.17213
repo_url: None
paper_authors: Ali H. Dhanaliwala, Rikhiya Ghosh, Sanjeev Kumar Karn, Poikavila Ullaskrishnan, Oladimeji Farri, Dorin Comaniciu, Charles E. Kahn
for: EXTRACTING COMMON DATA ELEMENTS (CDEs) FROM RADIOLOGY REPORTS
methods: USED DOMAIN-ADAPTED LANGUAGE MODEL (RadLing) AND GENERAL-PURPOSE LARGE LANGUAGE MODEL (GPT-4) TO IDENTIFY CDEs AND ASSIGN VALUES
results: RADLING SYSTEM OUTPERFORMED GPT-4 SYSTEM IN EXTRACTING CDEs, WITH HIGHER PRECISION (96% VS 99%) AND RECALL (94% VS 70%), AND OFFERED OPERATIONAL ADVANTAGES SUCH AS LOCAL DEPLOYMENT AND REDUCED RUNTIME COSTS.

Abstract
Radiologists produce unstructured data that could be valuable for clinical care when consumed by information systems. However, variability in style limits usage. Study compares performance of system using domain-adapted language model (RadLing) and general-purpose large language model (GPT-4) in extracting common data elements (CDE) from thoracic radiology reports. Three radiologists annotated a retrospective dataset of 1300 thoracic reports (900 training, 400 test) and mapped to 21 pre-selected relevant CDEs. RadLing was used to generate embeddings for sentences and identify CDEs using cosine-similarity, which were mapped to values using light-weight mapper. GPT-4 system used OpenAI's general-purpose embeddings to identify relevant CDEs and used GPT-4 to map to values. The output CDE:value pairs were compared to the reference standard; an identical match was considered true positive. Precision (positive predictive value) was 96% (2700/2824) for RadLing and 99% (2034/2047) for GPT-4. Recall (sensitivity) was 94% (2700/2876) for RadLing and 70% (2034/2887) for GPT-4; the difference was statistically significant (P<.001). RadLing's domain-adapted embeddings were more sensitive in CDE identification (95% vs 71%) and its light-weight mapper had comparable precision in value assignment (95.4% vs 95.0%). RadLing system exhibited higher performance than GPT-4 system in extracting CDEs from radiology reports. RadLing system's domain-adapted embeddings outperform general-purpose embeddings from OpenAI in CDE identification and its light-weight value mapper achieves comparable precision to large GPT-4. RadLing system offers operational advantages including local deployment and reduced runtime costs. Domain-adapted RadLing system surpasses GPT-4 system in extracting common data elements from radiology reports, while providing benefits of local deployment and lower costs.

摘要
医生生成的不结构化数据可能对临床护理有价值，但是样式的变化限制了它的使用。这项研究比较了使用域 adaptive 语言模型（RadLing）和通用大型语言模型（GPT-4）在解剖 radiology 报告中提取常见数据元素（CDE）的性能。三位医生对一个退回数据集中的1300份 thoracic 报告（900份训练数据，400份测试数据）进行了标注，并将其映射到21个预先选择的相关CDE中。RadLing使用域 adaptive 语言模型生成句子嵌入，并使用cosine相似性来标识CDE，然后将其映射到值。GPT-4系统使用OpenAI的通用嵌入来标识相关CDE，并使用GPT-4来映射到值。输出的CDE:值对与参考标准进行比较，true positive 的匹配情况下为正确的匹配。RadLing系统的精度（正确率）为96%（2700/2824），而GPT-4系统的精度为99%（2034/2047）。RadLing系统的敏感性（敏感率）为94%（2700/2876），而GPT-4系统的敏感性为70%（2034/2887），这种差异是 statistically significant（P<0.001）。RadLing的域 adaptive 嵌入表现出色地识别CDE（95% vs 71%），并且它的轻量级映射器具有相当的精度（95.4% vs 95.0%）。RadLing系统在解剖 radiology 报告中提取CDE的性能高于GPT-4系统，而且提供了本地部署和较低的运行成本。域 adaptive RadLing系统在解剖 radiology 报告中提取常见数据元素的性能高于GPT-4系统，而且提供了本地部署和较低的运行成本。

paper_url: http://arxiv.org/abs/2311.17049
repo_url: None
paper_authors: Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel
for: This paper aims to improve the efficiency of CLIP models for deployment on mobile devices while maintaining their zero-shot performance.
methods: The proposed method, called Multi-modal Reinforced Training, leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. The approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset.
results: The proposed MobileCLIP model sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. The MobileCLIP-S2 variant is 2.3 times faster while more accurate compared to the previous best CLIP model based on ViT-B/16. The multi-modal reinforced training approach achieves +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best.

Abstract
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. Our MobileCLIP-S2 variant is 2.3$\times$ faster while more accurate compared to previous best CLIP model based on ViT-B/16. We further demonstrate the effectiveness of our multi-modal reinforced training by training a CLIP model based on ViT-B/16 image backbone and achieving +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best. Moreover, we show that the proposed approach achieves 10$\times$-1000$\times$ improved learning efficiency when compared with non-reinforced CLIP training.

摘要
Translate the given text into Simplified Chinese.研究人员们已经提出了一种新的image-text基础模型预训练方法，称为MobileCLIP，以提高runtime性能并降低运行在移动设备上的缓存和延迟开销。这种预训练方法基于一种多模态强化训练方法，该方法利用了图像描述模型和一个 ensemble of strong CLIP encoders来提高效率的模型准确性。我们的方法可以避免了训练时的计算开销，通过将额外知识存储在一个强化数据集中。MobileCLIP在零shot分类和检索任务上设置了新的状态态-精度质量平衡。我们的MobileCLIP-S2变体比前一个最佳CLIP模型基于ViT-B/16更快2.3倍，并且更准确。我们进一步验证了我们的多模态强化训练方法，通过训练基于ViT-B/16图像基础模型的CLIP模型，在38个评估标准中提高了+2.9%的平均性能。此外，我们还表明了我们的方法可以在非强化CLIP训练中实现10-1000倍的学习效率提升。

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

paper_url: http://arxiv.org/abs/2311.17043
repo_url: https://github.com/dvlab-research/llama-vid
paper_authors: Yanwei Li, Chengyao Wang, Jiaya Jia
for: 本研究旨在解决视觉语言模型（VLM）中的固有Token生成挑战，以提高视频和图像理解能力。
methods: LLama-VID方法使用两种不同的token来表示每帧图像，即上下文token和内容token。上下文token基于用户输入表示整个图像上下文，而内容token则捕捉每帧视频中的视觉特征。这种双Token策略有效减轻长视频的计算负担，保留重要信息。
results: LLama-VID方法在大多数视频或图像基准测试上表现出色，超越过去的方法。代码可以在https://github.com/dvlab-research/LLaMA-VID中找到。

Abstract
In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID

摘要
在这项研究中，我们提出了一种新的方法来解决视觉语言模型（VLM）中的令符生成挑战，称为LLaMA-VID。现有的VLM在处理长视频时因为过度的视觉令符而面临计算沉重的问题。LLaMA-VID通过将每帧分为两个不同的令符，即上下文令符和内容令符来解决这个问题。上下文令符基于用户输入来编码整个图像的上下文，而内容令符则捕捉每帧视觉特征。这种双令符策略可以有效减轻长视频的计算负担，同时保留重要信息。总之，LLaMA-VID使得现有的框架可以支持多达一小时的视频，并且对现有的上限进行了加强。我们已经证明了LLaMA-VID在大多数视频或图像基准测试上的优势。代码可以在 GitHub 上找到：https://github.com/dvlab-research/LLaMA-VID

Scalable Extraction of Training Data from (Production) Language Models

paper_url: http://arxiv.org/abs/2311.17035
repo_url: None
paper_authors: Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee
for: 这篇论文研究了可提取的记忆力：对机器学习模型的训练数据进行高效提取，无需先知道训练集。
methods: 我们使用现有的文献技术进行攻击不对齐的模型，并开发了一种新的分岔攻击，使模型偏离正常的讲话生成并以150倍的速率释出训练数据。
results: 我们的方法显示，现有的对齐技术不能消除记忆力，并且可以在很短的时间内从开源语言模型、半开源模型和关闭模型中提取大量的训练数据。

Abstract
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

摘要

ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?

paper_url: http://arxiv.org/abs/2311.16989
repo_url: None
paper_authors: Hailin Chen, Fangkai Jiao, Xingxuan Li, Chengwei Qin, Mathieu Ravaut, Ruochen Zhao, Caiming Xiong, Shafiq Joty
for: This paper provides an exhaustive overview of the success of open-source large language models (LLMs) in achieving parity or better performance than closed-source models like ChatGPT on various tasks.
methods: The paper surveys all tasks where open-source LLMs have claimed to be on par or better than ChatGPT, using supervised fine-tuning and reinforcement learning from human feedback.
results: The paper shows that open-source LLMs have made rapid progress in the past year, with some achieving parity or better performance than ChatGPT on certain tasks, and highlights the crucial implications of this progress on both research and business.

Abstract
Upon its release in late 2022, ChatGPT has brought a seismic shift in the entire landscape of AI, both in research and commerce. Through instruction-tuning a large language model (LLM) with supervised fine-tuning and reinforcement learning from human feedback, it showed that a model could answer human questions and follow instructions on a broad panel of tasks. Following this success, interests in LLMs have intensified, with new LLMs flourishing at frequent interval across academia and industry, including many start-ups focused on LLMs. While closed-source LLMs (e.g., OpenAI's GPT, Anthropic's Claude) generally outperform their open-source counterparts, the progress on the latter has been rapid with claims of achieving parity or even better on certain tasks. This has crucial implications not only on research but also on business. In this work, on the first anniversary of ChatGPT, we provide an exhaustive overview of this success, surveying all tasks where an open-source LLM has claimed to be on par or better than ChatGPT.

摘要
在2022年下半年发布后，ChatGPT引发了人工智能领域内的一场地震，both在研究和商业方面。通过对大语言模型（LLM）进行指导微调和监督学习，它证明了一个模型可以回答人类问题并完成广泛的任务。随后，关于LLM的兴趣加剧，在学术和产业界出现了数量多的新LLM，包括许多专门关注LLM的 startup。虽然关闭源的LLM（如OpenAI的GPT和Anthropic的Claude）通常表现更好，但开源LLM的进步快速，有些任务上 even better 的表现。这对研究和商业都具有重要意义。在ChatGPT发布一周年之际，我们提供了这些成功的全面概述，对所有由开源LLM宣称与ChatGPT相当或更好的任务进行了检查。

Assessing the influence of attractor-verb distance on grammatical agreement in humans and language models

paper_url: http://arxiv.org/abs/2311.16978
repo_url: None
paper_authors: Christos-Nikolaos Zacharopoulos, Théo Desbordes, Mathias Sablé-Meyer
for: investigate the effect of subject-verb agreement on the presence of an attractor noun
methods: parametrically modulate the distance between the attractor and the verb while keeping the length of the sentence equal, and evaluate the performance of both humans and two artificial neural network models
results: humans make more mistakes when the attractor is closer to the verb, while neural networks get close to the chance level; additionally, there is a linear effect of attractor distance on reaction times.

Abstract
Subject-verb agreement in the presence of an attractor noun located between the main noun and the verb elicits complex behavior: judgments of grammaticality are modulated by the grammatical features of the attractor. For example, in the sentence "The girl near the boys likes climbing", the attractor (boys) disagrees in grammatical number with the verb (likes), creating a locally implausible transition probability. Here, we parametrically modulate the distance between the attractor and the verb while keeping the length of the sentence equal. We evaluate the performance of both humans and two artificial neural network models: both make more mistakes when the attractor is closer to the verb, but neural networks get close to the chance level while humans are mostly able to overcome the attractor interference. Additionally, we report a linear effect of attractor distance on reaction times. We hypothesize that a possible reason for the proximity effect is the calculation of transition probabilities between adjacent words. Nevertheless, classical models of attraction such as the cue-based model might suffice to explain this phenomenon, thus paving the way for new research. Data and analyses available at https://osf.io/d4g6k

摘要
主语-谓语协调在吸引语素位于主语和谓语之间时展现复杂行为：判断grammaticality受吸引语素的 grammatical 特征的影响。例如，在句子 "女孩子near boys likes climbing" 中，吸引语素（boys）与谓语（likes）不匹配 grammatical 数量，导致 locally 不可能的转折概率。我们在保持句子长度相同的情况下 parametrically 调整吸引语素和谓语之间的距离。我们评估了人类和两种人工神经网络模型的表现：两者在吸引语素较近于谓语时错误率增加，而神经网络模型在接近机会水平，而人类大多能够超越吸引语素干扰。此外，我们还发现了吸引语素距离谓语的线性效果。我们假设一个可能的原因是计算邻近单词之间的转折概率。然而，古典的吸引模型，如cue-based模型，可能足够解释这种现象，因此推动新的研究。数据和分析可以在获取。

Natural Language Processing Through Transfer Learning: A Case Study on Sentiment Analysis

paper_url: http://arxiv.org/abs/2311.16965
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Aman Yadav, Abhishek Vichare
for: 这篇论文主要探讨了基于自然语言处理的评价情感分类问题，强调了使用传输学习来提高模型的准确率。
methods: 该论文使用了BERT模型，并对 IMDb 电影评论数据集进行了预处理和编码，以便于 NLP 模型进行分析。
results: 根据实验结果，使用传输学习的 BERT 模型在评价情感分类任务上达到了 100% 的准确率，但是需要进一步的分析来验证模型的泛化能力。

Abstract
Artificial intelligence and machine learning have significantly bolstered the technological world. This paper explores the potential of transfer learning in natural language processing focusing mainly on sentiment analysis. The models trained on the big data can also be used where data are scarce. The claim is that, compared to training models from scratch, transfer learning, using pre-trained BERT models, can increase sentiment classification accuracy. The study adopts a sophisticated experimental design that uses the IMDb dataset of sentimentally labelled movie reviews. Pre-processing includes tokenization and encoding of text data, making it suitable for NLP models. The dataset is used on a BERT based model, measuring its performance using accuracy. The result comes out to be 100 per cent accurate. Although the complete accuracy could appear impressive, it might be the result of overfitting or a lack of generalization. Further analysis is required to ensure the model's ability to handle diverse and unseen data. The findings underscore the effectiveness of transfer learning in NLP, showcasing its potential to excel in sentiment analysis tasks. However, the research calls for a cautious interpretation of perfect accuracy and emphasizes the need for additional measures to validate the model's generalization.

摘要
人工智能和机器学习在技术世界中具有重要地位。本文探讨了基于传输学习的自然语言处理（NLP）方面的潜在性，特化于情感分析。据 estudios，使用预训练BERT模型可以提高情感分类精度，比训练模型从头开始更高。本研究采用了复杂的实验设计，使用IMDb sentiment标注的电影评论数据集。数据处理包括准备和编码文本数据，使其适用于NLP模型。模型使用BERT基于模型，测试准确率为100%。尽管完全准确率看起来很吸引人，但可能是过拟合或未看到数据的一种情况。进一步的分析是必要的，以验证模型能够处理多样化和未看到的数据。研究结果证明了基于传输学习的NLP的可能性，并证明了情感分析任务中的效果。然而，研究需要谨慎地解释完美准确率，并强调需要进一步的验证来确保模型的普遍性。

paper_url: http://arxiv.org/abs/2311.16882
repo_url: None
paper_authors: Bowen Li, Yongxin Yang, Steven McDonagh, Shifeng Zhang, Petru-Daniel Tudosiu, Sarah Parisot
for: 这篇论文是为了提高图像编辑的精度和准确性，以及扩展图像编辑的多种指令类型（如空间布局基于的编辑、pose、scribbles、边框图）。
methods: 这篇论文提出了一种在推理时间进行图像编辑优化的方法，将编辑任务分解成两个竞争性子任务：成功地地ocal image modification和全局内容一致性保持，两个子任务受到两个专门的损失函数引导。通过调整每个损失函数的影响，建立一个灵活的图像编辑解决方案，可以根据用户的喜好进行调整。
results: 作者通过文本、pose和scribble编辑条件进行评估，并通过质量和量化实验显示出其能够实现复杂的编辑任务。

Abstract
Image editing affords increased control over the aesthetics and content of generated images. Pre-existing works focus predominantly on text-based instructions to achieve desired image modifications, which limit edit precision and accuracy. In this work, we propose an inference-time editing optimisation, designed to extend beyond textual edits to accommodate multiple editing instruction types (e.g. spatial layout-based; pose, scribbles, edge maps). We propose to disentangle the editing task into two competing subtasks: successful local image modifications and global content consistency preservation, where subtasks are guided through two dedicated loss functions. By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences. We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits, through both qualitative and quantitative experiments.

摘要
图像编辑提供了更多控制图像生成的美学和内容的能力。现有的工作主要通过文本基本的指令来实现图像修改，这限制了修改精度和准确性。在这种工作中，我们提议在推理时间编辑优化，以扩展 beyond 文本修改，涵盖多种修改指令类型（例如空间布局基于; 姿势、笔迹、边极值）。我们分解编辑任务为两个竞争性子任务：成功地ocal image modification和全局内容一致性保持，两个子任务由两个专门的损失函数引导。通过让每个损失函数的影响可调，我们建立了灵活的编辑解决方案，可以根据用户喜好进行调整。我们通过文本、姿势和笔迹编辑条件进行评估，并通过Qualitative和Quantitative实验展示了我们的能力实现复杂的修改。

A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography

paper_url: http://arxiv.org/abs/2311.16865
repo_url: None
paper_authors: Noëmi Aepli, Chantal Amrhein, Florian Schottmann, Rico Sennrich
for: 本研究旨在评估现有评价纪录的限制，以及它们如何处理非标准语言变体（如瑞士德语）。
methods: 作者采集了人工翻译和人工评估数据，并创建了一个挑战集来测试现有纪录的表现。
results: 研究结果显示，现有纪录无法可靠地评估瑞士德语文本生成输出，特别是在段级别。作者还提出了初步的设计改进，可以增强对非标准语言变体的Robustness。

Abstract
For sensible progress in natural language processing, it is important that we are aware of the limitations of the evaluation metrics we use. In this work, we evaluate how robust metrics are to non-standardized dialects, i.e. spelling differences in language varieties that do not have a standard orthography. To investigate this, we collect a dataset of human translations and human judgments for automatic machine translations from English to two Swiss German dialects. We further create a challenge set for dialect variation and benchmark existing metrics' performances. Our results show that existing metrics cannot reliably evaluate Swiss German text generation outputs, especially on segment level. We propose initial design adaptations that increase robustness in the face of non-standardized dialects, although there remains much room for further improvement. The dataset, code, and models are available here: https://github.com/textshuttle/dialect_eval

摘要
为了合理进步在自然语言处理领域，我们需要意识到我们使用的评估指标的局限性。在这项工作中，我们评估了自动机器翻译从英语到两种瑞士德语方言的robust度。为了进行这项研究，我们收集了人类翻译和人类评估自动机器翻译的数据集。我们还创建了一个挑战集，以评估不同方言的性能。我们的结果表明，现有的指标无法可靠地评估瑞士德语文本生成输出，特别是在段级别。我们提出了初步的设计修改，以增加对非标准方言的Robustness，although there remains much room for further improvement。数据集、代码和模型可以在以下链接中下载：https://github.com/textshuttle/dialect_eval。

RELIC: Investigating Large Language Model Responses using Self-Consistency

paper_url: http://arxiv.org/abs/2311.16842
repo_url: None
paper_authors: Furui Cheng, Vilém Zouhar, Simran Arora, Mrinmaya Sachan, Hendrik Strobelt, Mennatallah El-Assady
for: 本研究旨在帮助用户对大语言模型生成的文本进行可靠性检查，以避免混淆fact和 fiction。
methods: 我们提出了一种交互式系统，利用同一个LLM生成多个样本的自 consistency来评估它对具体陈述的信任程度。
results: 经过用户测试，我们发现我们的方法可以帮助用户更好地识别LLM生成的可靠性问题，并且提供了设计启示和未来研究的经验教训。

Abstract
Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations. To tackle this challenge, we propose an interactive system that helps users obtain insights into the reliability of the generated text. Our approach is based on the idea that the self-consistency of multiple samples generated by the same LLM relates to its confidence in individual claims in the generated texts. Using this idea, we design RELIC, an interactive system that enables users to investigate and verify semantic-level variations in multiple long-form responses. This allows users to recognize potentially inaccurate information in the generated text and make necessary corrections. From a user study with ten participants, we demonstrate that our approach helps users better verify the reliability of the generated text. We further summarize the design implications and lessons learned from this research for inspiring future studies on reliable human-LLM interactions.

摘要
大型语言模型（LLM）很知名的问题是混淆实际与虚假信息，称为幻见。为解决这个挑战，我们提议一种互动性系统，帮助用户获得生成文本中的可靠性信息。我们的方法基于多个样本生成同一个LLM的自 consistency关系，用于评估它对个别声明的信任程度。我们称这种方法为RELIC，它可以帮助用户查看多个长篇响应中的semantic级别差异，从而识别生成文本中可能的错误信息，并进行必要的更正。从十名参与者的用户研究中，我们证明了我们的方法可以帮助用户更好地验证生成文本的可靠性。我们还总结了这些研究的设计启示和经验教训，以启发未来的人类-LLM交互研究。

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

paper_url: http://arxiv.org/abs/2311.16839
repo_url: None
paper_authors: Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, Conghui He
for: addresses the “hallucination problem” in multimodal large language models, where the models generate textual descriptions that contain inaccurate or non-existent content from the image.
methods: introduces a novel strategy called Hallucination-Aware Direct Preference Optimization (HA-DPO), which treats the hallucination problem as a unique preference selection issue and trains the model to favor the non-hallucinating response when presented with two responses of the same image.
results: the paper shows a significant reduction in the hallucination problem and an enhancement in the models’ generalization capabilities with HA-DPO. Specifically, the MiniGPT-4 model demonstrates a 34.5% absolute improvement in POPE accuracy and a 41% relative improvement in MME score.

Abstract
Multimodal large language models have made significant advancements in recent years, yet they still suffer from a common issue known as the "hallucination problem" where the models generate textual descriptions that contain inaccurate or non-existent content from the image. To address this issue, this paper introduces a novel strategy: Hallucination-Aware Direct Preference Optimization (HA-DPO). Our approach treats the hallucination problem as a unique preference selection issue, where the model is trained to favor the non-hallucinating response when presented with two responses of the same image (one accurate and one hallucinating). This paper also presents an efficient process for constructing hallucination sample pairs to ensure high-quality, style-consistent pairs for stable HA-DPO training. We applied this strategy to two mainstream multimodal models, and the results showed a significant reduction in the hallucination problem and an enhancement in the models' generalization capabilities. With HA-DPO, the MiniGPT-4 model demonstrates significant advancements: POPE accuracy increases from 51.13% to 85.66% (34.5% absolute improvement), and the MME score escalates from 968.58 to 1365.76 (41% relative improvement). The code, models, and datasets will be made publicly available.

摘要
多Modal大型自然语言模型在最近几年内有了 significiant advancements，但它们仍然面临一个通称为“hallucination problem”的问题，即模型生成的文本描述中包含不准确或不存在图像中的内容。为解决这个问题，本文提出了一种新的策略：Hallucination-Aware Direct Preference Optimization（HA-DPO）。我们的方法将hallucination问题视为一种独特的偏好选择问题，其中模型在同一张图像中提供了两个回答（一个准确的回答和一个投射的回答）时，偏好不投射的回答。本文还提供了一种高效的构造hallucination样本对的过程，以确保HA-DPO训练中的样本质量高、风格一致。我们在两种主流的多Modal模型上应用了这种策略，结果表明HA-DPO可以减少hallucination问题，并提高模型的总体通用能力。使用HA-DPO，MiniGPT-4模型的POPE精度从51.13%提高到85.66%（34.5%绝对提高），MME分数从968.58提高到1365.76（41%相对提高）。代码、模型和数据将公开发布。

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

paper_url: http://arxiv.org/abs/2311.17126
repo_url: None
paper_authors: Xiaohui Chen, Yongfei Liu, Yingxiang Yang, Jianbo Yuan, Quanzeng You, Li-Ping Liu, Hongxia Yang
for: 本研究旨在提高文本到图像（T2I）生成模型的可 Compositional 能力，通过使用大型自然语言模型（LLM）作为布局生成器。
methods: 我们提出了一种新的方法，利用LLM的链式思维提示来解释文本，并生成符合文本semantic的物品布局。然后，我们使用一种高效的Adapter，通过跨注意力机制来集成布局信息到稳定的扩散模型中。
results: 我们的实验表明，使用LLM作为布局生成器可以提高图像质量和布局准确性，这表明了LLM在增强生成图像模型的潜在潜力。

Abstract
Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to translate the semantic content from the text into images entirely. While conditioning on the layout has shown to be effective in improving the compositional ability of T2I diffusion models, they typically require manual layout input. In this work, we introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators. Our method leverages the Chain-of-Thought prompting of LLMs to interpret text and generate spatially reasonable object layouts. The generated layout is then used to enhance the generated images' composition and spatial accuracy. Moreover, we propose an efficient adapter based on a cross-attention mechanism, which explicitly integrates the layout information into the stable diffusion models. Our experiments demonstrate significant improvements in image quality and layout accuracy, showcasing the potential of LLMs in augmenting generative image models.

摘要
Translation notes:* "Recent advancements" is translated as "最近的进步" (most recent advancements)* "text-to-image" is translated as "文本到图像" (T2I)* "diffusion models" is translated as "扩散模型" (diffusion models)* "semantic content" is translated as " semantic 内容" (semantic content)* "layout" is translated as "布局" (layout)* "Chain-of-Thought prompting" is translated as "链式思维提示" (Chain-of-Thought prompting)* "Large Language Models" is translated as "大型语言模型" (LLMs)* "adapter" is translated as "适配器" (adapter)* "cross-attention mechanism" is translated as "交叉注意机制" (cross-attention mechanism)* "stable diffusion models" is translated as "稳定的扩散模型" (stable diffusion models)* "image quality" is translated as "图像质量" (image quality)* "layout accuracy" is translated as "布局准确度" (layout accuracy)

Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop

paper_url: http://arxiv.org/abs/2311.16822
repo_url: None
paper_authors: Martin Briesch, Dominik Sobania, Franz Rothlauf
for: 这个论文主要研究了大语言模型（LLM）在各种benchmark和对话应用程序中的表现，以及LLM生成的内容如何被 reuse 来训练下一代LLM。
methods: 作者使用了一个新的数据集来Empirically研究了这种自食用训练循环，并通过分析和测试来评估生成输出的质量和多样性。
results: 研究发现，自食用训练循环初期可以提高生成输出的质量和多样性，但是经过一些代数后，生成输出的多样性会不可避免地下降，而这种下降的速率取决于实际数据和生成数据的比例。

Abstract
Large language models (LLM) have become state of the art in many benchmarks and conversational LLM applications like ChatGPT are now widely used by the public. Those LLMs can be used to generate large amounts of content which is posted on the internet to various platforms. As LLMs are trained on datasets usually collected from the internet, this LLM-generated content might be used to train the next generation of LLMs. Therefore, a self-consuming training loop emerges in which new LLM generations are trained on the output from the previous generations. We empirically study this self-consuming training loop using a novel dataset to analytically and accurately measure quality and diversity of generated outputs. We find that this self-consuming training loop initially improves both quality and diversity. However, after a few generations the output inevitably degenerates in diversity. We find that the rate of degeneration depends on the proportion of real and generated data.

摘要
大型语言模型（LLM）已经成为训练 benchmark 和对话 LLM 应用程序如 ChatGPT 的公共使用。这些 LLM 可以生成大量的内容，该内容会被上传到互联网上不同的平台。因为 LLM 通常是由互联网上的数据集训练，因此这些 LLM 生成的内容可能会用于训练下一代 LLM。这导致一个自我养分训练循环，在这个循环中，新一代 LLM 将被训练在上一代 LLM 的输出上。我们使用一个新的数据集来Empirically 研究这个自我养分训练循环，并使用新的方法来量化和准确地测量生成的质量和多样性。我们发现，这个自我养分训练循环初期会提高质量和多样性。但是，在几代之后，输出的多样性将无法避免很快地下降。我们发现，这个下降的速度取决于实际和生成的数据的比例。

Evaluating Optimal Reference Translations

paper_url: http://arxiv.org/abs/2311.16787
repo_url: https://github.com/ufal/optimal-reference-translations
paper_authors: Vilém Zouhar, Věra Kloudová, Martin Popel, Ondřej Bojar
for: 提高机器翻译系统的译文质量
methods: 创建更可靠的文档级人工参照翻译，以提高机器翻译系统的译文质量
results: 对文档级人工参照翻译进行评估，并证明了与标准参照翻译的相比，文档级人工参照翻译具有更高的质量 Waterfall Please note that the “results” section is written in a more concise format, using the metaphor of a “waterfall” to represent the increase in quality.

Abstract
The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good. Standard methods of evaluation are not suitable nor intended to uncover the many translation errors and quality deficiencies that still persist. Furthermore, the quality of standard reference translations is commonly questioned and comparable quality levels have been reached by MT alone in several language pairs. Navigating further research in these high-resource settings is thus difficult. In this article, we propose a methodology for creating more reliable document-level human reference translations, called "optimal reference translations," with the simple aim to raise the bar of what should be deemed "human translation quality." We evaluate the obtained document-level optimal reference translations in comparison with "standard" ones, confirming a significant quality increase and also documenting the relationship between evaluation and translation editing.

摘要
当前机器翻译系统在高资源语言对的情况下的总体翻译质量已经非常好。评估标准方法不适用于揭示仍然存在的多个翻译错误和质量不足。此外，标准参考翻译质量经常受到质疑，而且在一些语言对上，MT独立达到了相同的质量水平。因此，进一步研究在这些高资源环境下是困难的。本文提出了一种创建更可靠的文档级人工参考翻译的方法ологи，以提高人翻译质量的标准。我们评估了获得的文档级优化参考翻译，与标准参考翻译进行比较，证实了质量的显著提高，同时也记录了评估和翻译编辑的关系。

Radiology-Aware Model-Based Evaluation Metric for Report Generation

paper_url: http://arxiv.org/abs/2311.16764
repo_url: None
paper_authors: Amos Calamida, Farhad Nooralahzadeh, Morteza Rohanian, Koji Fujimoto, Mizuho Nishio, Michael Krauthammer
for: This paper proposes a new automated evaluation metric for machine-generated radiology reports.
methods: The paper uses the successful COMET architecture adapted for the radiology domain, and trains and publishes four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph.
results: The paper shows that the proposed metric correlates moderately to high with established metrics such as BERTscore, BLEU, and CheXbert scores, and demonstrates that one of the checkpoints exhibits a high correlation with human judgment, as assessed using publicly available annotations of six board-certified radiologists and two radiologists on a collection of 100 reports.Here’s the same information in Simplified Chinese text:
for: 这个论文提出了一种新的自动评估机器生成的放射学报告的方法。
methods: 论文使用了成功的COMET架构，适应放射学领域，并训练了四个医学掌握的模型检查点，其中一个基于RadGraph放射学知识图。
results: 论文显示，提出的方法与已有的BERTscore、BLEU和CheXbert分数呈moderate到高的相关性，并表明一个检查点与六名证明Radiologist的公共可用注释中的人类评价呈高相关性。

Abstract
We propose a new automated evaluation metric for machine-generated radiology reports using the successful COMET architecture adapted for the radiology domain. We train and publish four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph. Our results show that our metric correlates moderately to high with established metrics such as BERTscore, BLEU, and CheXbert scores. Furthermore, we demonstrate that one of our checkpoints exhibits a high correlation with human judgment, as assessed using the publicly available annotations of six board-certified radiologists, using a set of 200 reports. We also performed our own analysis gathering annotations with two radiologists on a collection of 100 reports. The results indicate the potential effectiveness of our method as a radiology-specific evaluation metric. The code, data, and model checkpoints to reproduce our findings will be publicly available.

摘要
我们提出了一种新的自动评估指标 для机器生成的医学报告，基于成功的COMET架构，适用于医学领域。我们训练并发布了四个医学方向的模型检查点，其中一个基于RadGraph医学知识图。我们的结果显示，我们的指标与已确立的指标 such as BERTscore、BLEU和CheXbert score 相关度较高。此外，我们还证明了我们的检查点与人类评估相关，通过使用公开available的六名证encized radiologist的注释来评估。我们还进行了自己的分析，使用两名医生对一个集合的100份报告进行了注释。结果表明了我们的方法的医学特有性和可能的效果。我们将公开code、数据和模型检查点，以便重现我们的结论。

Entity-Aspect-Opinion-Sentiment Quadruple Extraction for Fine-grained Sentiment Analysis

paper_url: http://arxiv.org/abs/2311.16678
repo_url: None
paper_authors: Dan Ma, Jun Xu, Zongyu Wang, Xuezhi Cao, Yunsen Xian
for: 提高 aspect-based sentiment analysis (ABSA) 任务中的精度和可靠性，解决多个方面和属性之间的关系问题。
methods: 提出了一种新的任务 called Entity-Aspect-Opinion-Sentiment Quadruple Extraction (EASQE)，通过层次分解方面名称，避免信息损失、非封闭注释和意见误解，以提高 ABSA 任务的精度和可靠性。
results: 提出了一种基于 sequence-tagging 的 Trigger-Opinion 框架，可以在 EASQE 任务中实现高度的精度和可靠性，并且可以应用于其他 ABSA 任务中，与现有方法相比，显著提高了性能。

Abstract
Product reviews often contain a large number of implicit aspects and object-attribute co-existence cases. Unfortunately, many existing studies in Aspect-Based Sentiment Analysis (ABSA) have overlooked this issue, which can make it difficult to extract opinions comprehensively and fairly. In this paper, we propose a new task called Entity-Aspect-Opinion-Sentiment Quadruple Extraction (EASQE), which aims to hierarchically decompose aspect terms into entities and aspects to avoid information loss, non-exclusive annotations, and opinion misunderstandings in ABSA tasks. To facilitate research in this new task, we have constructed four datasets (Res14-EASQE, Res15-EASQE, Res16-EASQE, and Lap14-EASQE) based on the SemEval Restaurant and Laptop datasets. We have also proposed a novel two-stage sequence-tagging based Trigger-Opinion framework as the baseline for the EASQE task. Empirical evaluations show that our Trigger-Opinion framework can generate satisfactory EASQE results and can also be applied to other ABSA tasks, significantly outperforming state-of-the-art methods. We have made the four datasets and source code of Trigger-Opinion publicly available to facilitate further research in this area.

摘要
To facilitate research in this new task, we have constructed four datasets (Res14-EASQE, Res15-EASQE, Res16-EASQE, and Lap14-EASQE) based on the SemEval Restaurant and Laptop datasets. We have also proposed a novel two-stage sequence-tagging based Trigger-Opinion framework as the baseline for the EASQE task. Our Trigger-Opinion framework can generate satisfactory EASQE results and can also be applied to other ABSA tasks, significantly outperforming state-of-the-art methods.We have made the four datasets and source code of Trigger-Opinion publicly available to facilitate further research in this area.

A Distribution-Based Threshold for Determining Sentence Similarity

paper_url: http://arxiv.org/abs/2311.16675
repo_url: None
paper_authors: Gioele Cadamuro, Marco Gruppo
for: 本研究的目的是解决一种semantic textual similarity（STS）问题，在这个问题中，需要匹配两个句子，它们只有一个微的区别，例如名称、地址、标识码等。
methods: 本研究使用了基于siamese架构的神经网络来创建类似和不类似句子之间的距离分布。这些分布的目的是找到一个”阈值”，这个阈值表示一个明确的量，可以用于在新预测和后续分析中分辨类似和不类似句子的vector距离。此外，我们还开发了一种将 predications 打分的方法，该方法将来自分布的特征和距离函数的特征相结合。
results: 我们通过应用这种系统来解决一个广泛使用的STS问题的benchmark dataset，并得到了一些普适的结果，表明这种方法可以在更广泛的领域中转移。

Abstract
We hereby present a solution to a semantic textual similarity (STS) problem in which it is necessary to match two sentences containing, as the only distinguishing factor, highly specific information (such as names, addresses, identification codes), and from which we need to derive a definition for when they are similar and when they are not. The solution revolves around the use of a neural network, based on the siamese architecture, to create the distributions of the distances between similar and dissimilar pairs of sentences. The goal of these distributions is to find a discriminating factor, that we call "threshold", which represents a well-defined quantity that can be used to distinguish vector distances of similar pairs from vector distances of dissimilar pairs in new predictions and later analyses. In addition, we developed a way to score the predictions by combining attributes from both the distributions' features and the way the distance function works. Finally, we generalize the results showing that they can be transferred to a wider range of domains by applying the system discussed to a well-known and widely used benchmark dataset for STS problems.

摘要
我们现在提出一种解决 semantic textual similarity (STS) 问题的解决方案，该问题需要匹配两句话，其中只有特定信息（如名称、地址、标识代码）作为区分因素。我们需要从这些句话中 derivate 一个定义，用于确定当两句话相似或不相似时。我们的解决方案基于 siamese 架构的神经网络，用于生成两句话之间的距离分布。我们的目标是找到一个名为 "阈值" 的定义，该定义表示一个具体的量，可以用于在新预测和后续分析中分辨 vector 距离相似对的距离和不相似对的距离。此外，我们还开发了一种将这些预测分数相加的方法，该方法基于分布的特征和距离函数的方式。最后，我们总结了结果，并证明它们可以在更广泛的领域中传递。为此，我们应用了讨论的系统到一个广泛使用的 STS 问题 benchmark 数据集中。

Text2Tree: Aligning Text Representation to the Label Tree Hierarchy for Imbalanced Medical Classification

paper_url: http://arxiv.org/abs/2311.16650
repo_url: https://github.com/jyansir/text2tree
paper_authors: Jiahuan Yan, Haojun Gao, Zhang Kai, Weize Liu, Danny Chen, Jian Wu, Jintai Chen
for: 这篇论文旨在应对医疗文本分类任务中，深度学习方法的表现仍然不稳定，主要因为医疗文本样本通常具有极大的不均衡和罕有的问题。
methods: 本论文提出了一个名为 Text2Tree 的新框架独立式算法，仅使用内部标签阶层在训练深度学习模型。文章还提出了两种新的学习方法：一种是 Similarity Surrogate Learning（SSL），另一种是 Dissimilarity Mixup Learning（DML），它们可以将标签之间的关系转换为内部标签阶层的知识，以提高文本分类的稳定性。
results: 在作者公开的公共数据集和实际医疗资料上，我们的方法稳定地达到了旧有和进阶的不均衡分类方法的性能水平。

Abstract
Deep learning approaches exhibit promising performances on various text tasks. However, they are still struggling on medical text classification since samples are often extremely imbalanced and scarce. Different from existing mainstream approaches that focus on supplementary semantics with external medical information, this paper aims to rethink the data challenges in medical texts and present a novel framework-agnostic algorithm called Text2Tree that only utilizes internal label hierarchy in training deep learning models. We embed the ICD code tree structure of labels into cascade attention modules for learning hierarchy-aware label representations. Two new learning schemes, Similarity Surrogate Learning (SSL) and Dissimilarity Mixup Learning (DML), are devised to boost text classification by reusing and distinguishing samples of other labels following the label representation hierarchy, respectively. Experiments on authoritative public datasets and real-world medical records show that our approach stably achieves superior performances over classical and advanced imbalanced classification methods.

摘要
深度学习方法在不同的文本任务上展现出了承诺的表现，但在医疗文本分类方面仍然受到挑战。不同于现有主流方法，该文章强调数据挑战在医疗文本中，并提出了一种新的框架无关算法 called Text2Tree，只使用内部标签层次结构在深度学习模型中进行训练。我们将ICD代码树结构 embedding到级别自适应注意力模块中，以学习层次意识的标签表示。此外，我们还提出了两种新的学习方法：一种是类似替换学习（SSL），另一种是不同混合学习（DML），以便重用和分辨其他标签的样本，并在标签表示层次结构中进行学习。实验结果表明，我们的方法在公共数据集和真实的医疗记录中稳定地实现了 классические和先进的不平衡分类方法的超越表现。

Scaling Political Texts with ChatGPT

paper_url: http://arxiv.org/abs/2311.16639
repo_url: None
paper_authors: Gaël Le Mens, Aina Gallego
for: 这个论文是用于使用GPT-4来获取政治文本的位置估计。
methods: 论文使用了一种新的方法，通过将英国政党纲领和美国国会议员的推文位于经济、社会和移民政策维度上进行了定位。
results: 论文得到的结果表明，使用GPT-4来获取位置估计的性能相对较高，与专家的定位相关性为93%或更高，与批判投票的定位相关性为97%。

Abstract
We use GPT-4 to obtain position estimates of political texts in continuous spaces. We develop and validate a new approach by positioning British party manifestos on the economic, social, and immigration policy dimensions and tweets by members of the US Congress on the left-right ideological spectrum. For the party manifestos, the correlation between the positions produced by GPT-4 and experts is 93% or higher, a performance similar to or better than that obtained with crowdsourced position estimates. For individual tweets, the positions obtained with GPT-4 achieve a correlation of 91% with crowdsourced position estimates. For senators of the 117th US Congress, the positions obtained with GPT-4 achieve a correlation of 97% with estimates based on roll call votes and of 96% with those based on campaign funding. Correlations are also substantial within party, indicating that position estimates produced with GPT-4 capture within-party differences between senators. Overall, using GPT-4 for ideological scaling is fast, cost-efficient, and reliable. This approach provides a viable alternative to scaling by both expert raters and crowdsourcing.

摘要
Translation in Simplified Chinese:我们使用 GPT-4 获取政治文本的位置估计在连续空间中。我们开发和验证了一种新的方法，通过将英国政党纲领Positioning在经济、社会和移民政策维度上，以及美国国会议员的左右政策立场Positioning。对于政党纲领，GPT-4 生成的位置与专家的相关性高于或等于 93%，性能与人类审核的位置估计相似或更好。对于个人推文，GPT-4 生成的位置与人类审核的位置相关性达 91%。对于117届美国国会参议员，GPT-4 生成的位置与基于投票记录和竞选资金的估计相关性达 97%和96%。内党之间的相关性也很高，表明 GPT-4 生成的位置估计能够捕捉内党之间的差异。总之，使用 GPT-4 进行政治倾向的标准化是快速、成本效益高的。这种方法可以作为专家评审和人类审核的代替方案。

On the Long Range Abilities of Transformers

paper_url: http://arxiv.org/abs/2311.16620
repo_url: None
paper_authors: Itamar Zimerman, Lior Wolf
for: 本文是为了探讨transformer架构在长距离任务上的表现不佳问题，并提出了一些改进方法来提高表现。
methods: 本文使用了一些启发自长距离层的方法，包括状态空间层、线性RNN层和全球卷积层，并通过对 transformer 架构进行微小改进来提高表现。
results: 本文通过实验和理论分析，证明了将这些方法integrated into transformer架构可以大幅提高long距离任务的表现，并且这些改进不需要额外的计算和trainable参数。

Abstract
Despite their dominance in modern DL and, especially, NLP domains, transformer architectures exhibit sub-optimal performance on long-range tasks compared to recent layers that are specifically designed for this purpose. In this work, drawing inspiration from key attributes of long-range layers, such as state-space layers, linear RNN layers, and global convolution layers, we demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena (LRA) benchmark, thus narrowing the gap with these specialized layers. We identify that two key principles for long-range tasks are (i) incorporating an inductive bias towards smoothness, and (ii) locality. As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters. Our theory and experiments also shed light on the reasons for the inferior performance of transformers on long-range tasks and identify critical properties that are essential for successfully capturing long-range dependencies.

摘要
尽管转换器在现代DL和特别是NLP领域具有很高的主导地位，但它们在长距离任务上表现不佳，比如特有的长距离层。在这种工作中，我们Drawing inspiration from long-range layers的关键特征，如状态空间层、线性RNN层和全球卷积层，我们示出了对transformer架构进行微小修改可以大幅提高LRA标准套件中的性能，因此降低与专门设计 для长距离任务的层之间的差距。我们认为两个关键原则对长距离任务是（i）在满足平滑性的偏好，以及（ii）本地性。我们表明，在注意力机制中 integrating这些想法可以提高结果，而无需额外计算和无需额外可训练参数。我们的理论和实验还披露了长距离任务中转换器的不佳表现的原因，并确定了捕捉长距离依赖关系所需的关键属性。

MedGen: A Python Natural Language Processing Toolkit for Medical Text Processing

paper_url: http://arxiv.org/abs/2311.16588
repo_url: https://github.com/yale-lily/medgen
paper_authors: Rui Yang, Qingcheng Zeng, Keen You, Yujie Qiao, Lucas Huang, Chia-Chun Hsieh, Benjamin Rosand, Jeremy Goldwasser, Amisha D Dave, Tiarnan D. L. Keenan, Emily Y Chew, Dragomir Radev, Zhiyong Lu, Hua Xu, Qingyu Chen, Irene Li
For: The paper is written for biomedical researchers and healthcare professionals who need an easy-to-use toolkit for medical text processing.* Methods: The paper introduces MedGen, a comprehensive natural language processing (NLP) toolkit that includes four advanced generative functions (question answering, text summarization, text simplification, and machine translation) and 12 essential NLP functions (such as word tokenization and sentence segmentation).* Results: The authors fine-tuned 32 domain-specific language models, evaluated them thoroughly on 24 established benchmarks, and conducted manual reviews with clinicians. They also expanded the toolkit by introducing query and search functions, and standardized and integrated functions from third-party libraries.Here’s the same information in Simplified Chinese text:* For: 这篇论文是为生物医学研究人员和医疗专业人员而写的，他们需要一个易于使用的医学文本处理工具。* Methods: 论文介绍了MedGen，一个全面的自然语言处理（NLP）工具包，包括四种先进的生成函数（问答、摘要、简化和翻译）以及12种基本的NLP函数（例如词 Tokenization和句子分 segmentation）。* Results: 作者对32个领域特定的语言模型进行了精心的 fine-tuning，对24个已知的标准准则进行了严格的评估，并与临床医生进行了手动审查。他们还扩展了工具包，添加了查询和搜索功能，并将第三方库中的函数标准化和集成到了工具中。工具、模型和相关数据都公开提供于https://github.com/Yale-LILY/MedGen。

Abstract
This study introduces MedGen, a comprehensive natural language processing (NLP) toolkit designed for medical text processing. MedGen is tailored for biomedical researchers and healthcare professionals with an easy-to-use, all-in-one solution that requires minimal programming expertise. It includes (1) Generative Functions: For the first time, MedGen includes four advanced generative functions: question answering, text summarization, text simplification, and machine translation; (2) Basic NLP Functions: MedGen integrates 12 essential NLP functions such as word tokenization and sentence segmentation; and (3) Query and Search Capabilities: MedGen provides user-friendly query and search functions on text corpora. We fine-tuned 32 domain-specific language models, evaluated them thoroughly on 24 established benchmarks and conducted manual reviews with clinicians. Additionally, we expanded our toolkit by introducing query and search functions, while also standardizing and integrating functions from third-party libraries. The toolkit, its models, and associated data are publicly available via https://github.com/Yale-LILY/MedGen.

摘要

生成函数：MedGen 首次包含四个高级生成函数：问答、文本概要、文本简化和机器翻译。2. 基础 NLP 函数：MedGen 集成了 12 种基础 NLP 函数，包括词语分词和句子分 segmentation。3. 查询和搜索功能：MedGen 提供了用户友好的查询和搜索功能，可以在文本 corpora 上进行查询和搜索。我们对 MedGen 进行了严格的评估，包括对 24 个已知 benchmark 进行了严格的评估，并与临床医生进行了手动审查。此外，我们还扩展了我们的工具包，包括引入查询和搜索功能，并标准化和集成了第三方库中的函数。MedGen 的工具、模型和相关数据都可以通过 https://github.com/Yale-LILY/MedGen 获取。

Recognizing Conditional Causal Relationships about Emotions and Their Corresponding Conditions

paper_url: http://arxiv.org/abs/2311.16579
repo_url: None
paper_authors: Xinhong Chen, Zongxi Li, Yaowei Wang, Haoran Xie, Jianping Wang, Qing Li
for: 本文主要研究是识别文本中情感和原因之间的 causal 关系，强调特定的上下文句子参与到这种 causal 关系中。
methods: 本文提出了一种新的任务，即根据不同的上下文句子是否存在有效的 causal 关系，并使用人工标注 benchmark 数据集来获取标签。 furthermore, the authors propose an end-to-end multi-task framework to handle the two goals of the task, including a context masking module and a prediction aggregation module.
results: experiments 和 ablation 研究结果表明，提出的方法效果和通用性都很高。

Abstract
The study of causal relationships between emotions and causes in texts has recently received much attention. Most works focus on extracting causally related clauses from documents. However, none of these works has considered that the causal relationships among the extracted emotion and cause clauses can only be valid under some specific context clauses. To highlight the context in such special causal relationships, we propose a new task to determine whether or not an input pair of emotion and cause has a valid causal relationship under different contexts and extract the specific context clauses that participate in the causal relationship. Since the task is new for which no existing dataset is available, we conduct manual annotation on a benchmark dataset to obtain the labels for our tasks and the annotations of each context clause's type that can also be used in some other applications. We adopt negative sampling to construct the final dataset to balance the number of documents with and without causal relationships. Based on the constructed dataset, we propose an end-to-end multi-task framework, where we design two novel and general modules to handle the two goals of our task. Specifically, we propose a context masking module to extract the context clauses participating in the causal relationships. We propose a prediction aggregation module to fine-tune the prediction results according to whether the input emotion and causes depend on specific context clauses. Results of extensive comparative experiments and ablation studies demonstrate the effectiveness and generality of our proposed framework.

摘要
研究文本中情感和原因之间的 causal 关系在当前已经获得了很多关注。大多数工作都是提取文本中的 causally 相关句。然而，这些工作都没有考虑情感和原因句之间的 causal 关系只有在某些特定的上下文句中才能成立。为了强调这种特殊的 causal 关系，我们提出了一个新的任务，即判断输入情感和原因是否存在有效的 causal 关系，并提取特定上下文句的类型。由于这个任务是新的，我们手动标注了一个标准数据集，以获取我们任务的标签和每个上下文句的类型标注。我们采用负样本构建的方法来构建最终数据集，以保持输入文本中具有和没有 causal 关系的文本的平衡。基于构建的数据集，我们提出了一个端到端多任务框架，其中我们设计了两个新的通用模块：上下文屏蔽模块和预测聚合模块。Specifically, we propose a context masking module to extract the context clauses participating in the causal relationships. We propose a prediction aggregation module to fine-tune the prediction results according to whether the input emotion and causes depend on specific context clauses. Results of extensive comparative experiments and ablation studies demonstrate the effectiveness and generality of our proposed framework.

Evaluation of dynamic characteristics of power grid based on GNN and application on knowledge graph

paper_url: http://arxiv.org/abs/2311.16522
repo_url: None
paper_authors: Hao Pei, Si Lin, Chuanfu Li, Che Wang, Haoming Chen, Sizhe Li
for: 这篇论文旨在提高智能故障诊断方面的网络运行和维护。
methods: 该方法使用图神经网络（GNN）来检测电力网络中的故障节点，并使用特殊的电气特征提取模型和知识图来进行特征提取。该方法还利用时间数据，通过前后时期节点状态的比较来帮助当前故障检测。
results: 实验结果表明，这种GNN可以准确地检测出模拟场景中的故障节点，具体的准确率为99.53%。此外，图神经网络的特征模型化也允许对故障节点的扩散进行质量的分析，为分析故障节点提供了有价值的信息。

Abstract
A novel method for detecting faults in power grids using a graph neural network (GNN) has been developed, aimed at enhancing intelligent fault diagnosis in network operation and maintenance. This GNN-based approach identifies faulty nodes within the power grid through a specialized electrical feature extraction model coupled with a knowledge graph. Incorporating temporal data, the method leverages the status of nodes from preceding and subsequent time periods to aid in current fault detection. To validate the effectiveness of this GNN in extracting node features, a correlation analysis of the output features from each node within the neural network layer was conducted. The results from experiments show that this method can accurately locate fault nodes in simulated scenarios with a remarkable 99.53% accuracy. Additionally, the graph neural network's feature modeling allows for a qualitative examination of how faults spread across nodes, providing valuable insights for analyzing fault nodes.

摘要
新的方法 для探测电力网络中的故障使用图 neural network (GNN) 已经开发，以增强智能故障诊断在网络运行和维护中。这种 GNN 基于的方法通过特殊的电气特征提取模型和知识图结合来标识电力网络中的故障节点。利用时间数据，这种方法可以利用前后时期节点的状态来帮助当前故障检测。为验证 GNN 在提取节点特征方面的效果，对每个神经网络层的输出特征进行了相关分析。实验结果显示，这种方法可以高精度地检测模拟场景中的故障节点，具体的检测精度为 99.53%。此外，图 neural network 的特征模型化allow for 分析故障如何在节点之间传播，提供了有价值的分析故障节点的信息。

StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models

paper_url: http://arxiv.org/abs/2311.16509
repo_url: None
paper_authors: Kazuki Yamauchi, Yusuke Ijima, Yuki Saito
for: 生成自然语言描述的说话风格
methods: 使用对应的语音和自然语言描述数据进行预测Prefix вектор，并使用大语言模型（LLM）基于文本解码器进行文本生成
results: 利用更加丰富的LLMs、语音自我超vised学习特征和句子重塑增强生成的说话风格描述的准确性和多样性

Abstract
We propose StyleCap, a method to generate natural language descriptions of speaking styles appearing in speech. Although most of conventional techniques for para-/non-linguistic information recognition focus on the category classification or the intensity estimation of pre-defined labels, they cannot provide the reasoning of the recognition result in an interpretable manner. As a first step towards an end-to-end method for generating speaking-style prompts from speech, i.e., automatic speaking-style captioning, StyleCap uses paired data of speech and natural language descriptions to train neural networks that predict prefix vectors fed into a large language model (LLM)-based text decoder from a speech representation vector. We explore an appropriate text decoder and speech feature representation suitable for this new task. The experimental results demonstrate that our StyleCap leveraging richer LLMs for the text decoder, speech self-supervised learning (SSL) features, and sentence rephrasing augmentation improves the accuracy and diversity of generated speaking-style captions. Samples of speaking-style captions generated by our StyleCap are publicly available.

摘要
我们提出了 StyleCap，一种方法生成来自语音的自然语言描述。大多数传统技术用于非语言/非语言信息认识都是通过类别分类或已定标签的强度估计来进行识别，但这些技术无法提供识别结果的逻辑解释。作为自动发送风格提示的末级方法的第一步，StyleCap使用了对speech和自然语言描述的对应数据来训练神经网络，从语音表示向量预测前缀 вектор，并将其传递给基于大语言模型（LLM）的文本解码器。我们探索适合这新任务的合适的文本解码器和语音特征表示。实验结果表明，我们的 StyleCap 通过richer LLMs для文本解码器、语音自我超vised学习（SSL）特征和句子重塑扩展提高了生成的发送风格描述的准确性和多样性。样本的发送风格描述生成于我们的 StyleCap 已经公开发布。

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

paper_url: http://arxiv.org/abs/2311.16452
repo_url: None
paper_authors: Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, Eric Horvitz
for: 这个论文是为了探讨基于GPT-4的普通基础模型在医学领域的能力，以及如何通过引言工程来解锁这些模型的特殊能力。
methods: 这篇论文使用了多种引言策略，包括各种推荐、医学知识背景和问题预测等，以激活GPT-4模型的特殊能力。
results: 根据实验结果，使用Medprompt引言策略可以让GPT-4模型在多个医学领域的benchmark测试中表现出色，并且比预先训练的专家模型（如Med-PaLM 2）有许多更好的表现。此外，Medprompt还可以在其他领域中展示出优异的表现，如电机工程、机器学习、哲学、会计、法律和心理学等。

Abstract
Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

摘要
通用基础模型如GPT-4已经表现出了广泛领域和任务的某些能力。然而，有一种普遍的假设认为它们无法与专家模型的细化能力相匹配。例如，大多数尝试到date在医疗能力标准 benchmarks 上使用了域pecific 训练，如 BioGPT 和 Med-PaLM。我们在 GPT-4 在医疗挑战benchmarks 上的能力的尝试中建立了基础。而不是使用简单的提示来强调模型的外部能力，我们进行了系统性的提示工程。我们发现，提示创新可以解锁更深的专家能力，并且表明GPT-4可以轻松地超越过去最佳结果。我们的实验设计 méticulously 控制了提示工程过程中的溢出。我们引入 Medprompt，基于多种提示策略的集成。使用 Medprompt，GPT-4在 MultiMedQA 集合中的所有九个benchmark dataset上达到了状态的最佳结果。我们的方法比领导的专家模型such as Med-PaLM 2 margin 的许多次。驱使 GPT-4 使用 Medprompt 可以在 MedQA 集合中reduction 错误率27%，超过了最佳方法以特date 获得的结果。此外，我们还证明了 Medprompt 在其他领域中的广泛适用性，并提供了对战略的证明。

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

paper_url: http://arxiv.org/abs/2311.16444
repo_url: None
paper_authors: Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato
for: 本文提出了一个新的比较方式，用于将 dense video captioning 中的模型从多视图视频转移到单视图视频上。这种转移是由于 egocentric 视频的数据稀缺所致，而 web 视频的数据则充沛。本文通过在这两个数据集之间进行转移学习，来解决这个问题。
methods: 本文提出了一种新的视点不变学习方法，通过对 web 视频和 egocentric 视频进行 adversarial 训练，来bridging 这两个视点之间的视图差异。在预训练和细化训练两个阶段中，都使用了 adversarial 训练来学习不变的特征。
results: 本文通过对两个数据集进行转移学习，来证明这种方法的有效性。研究表明，这种方法可以有效地过渡知识，并且可以减少视图差异。此外，本文还提出了一个实际的 egocentric 数据集（EgoYC2），以便进一步研究 egocentric 视频的描述。

Abstract
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes. In this work, we first create a real-life egocentric dataset (EgoYC2) whose captions are shared with YouCook2, enabling transfer learning between these datasets assuming their ground-truth is accessible. To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language.

摘要
我们提出了一个新的基准测试方法，用于跨视图知识传递的紧密视频描述，将从互联网教学视频中的外部视图适应到内部视图。 although dense video captioning (预测时间段和其描述) 已经主要研究了外部视频（例如 YouCook2），但是基于内部视图的数据受限，因此需要实用的方法。 To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes.在这个工作中，我们首先创建了一个真实的内部视频集（EgoYC2），其caption是与YouCook2的caption相同，这使得可以进行转移学习。 To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language.

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

paper_url: http://arxiv.org/abs/2311.17086
repo_url: https://github.com/OPPO-Mente-Lab/PEA-Diffusion
paper_authors: Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu
for: 这篇论文旨在提出一种简单易用的语言转移方法，以增强文本描述到图像生成的能力。
methods: 该方法基于知识储存法，使用轻量级的多层感知网络（PEA）和教师知识储存来实现语言转移。
results: 研究表明，使用该方法可以减少训练数据量，并且可以在不同语言之间进行跨语言文本描述到图像生成。 Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion.

Abstract
Text-to-image diffusion models are well-known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation. Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion

摘要
文本到图像协拌模型已经很有名气，可以基于文本提示生成真实的图像。然而，现有的工作具主要集中在英语上，缺乏不同语言的支持。现有的翻译方法无法解决语言文化生成问题，而训练从零开始在特定语言数据集上是非常昂贵的。本文我们是被 inspirited 提出了一种简单的插件式语言传递方法，基于知识储存。我们只需要训练一个简单的多层感知（MLP） like 参数效率适应器（PEA），只有600万参数，并在老师知识储存下进行师学。我们感到了惊讶，发现冻结原始 UNet 的参数仍然可以达到了语言特定的提问评估集中的很好表现，表明PEA可以激活原始 UNet 的潜在生成能力。此外，我们的适应器可以用作插件，以达到跨语言文本到图像生成下游任务的显著成果。代码将在：https://github.com/OPPO-Mente-Lab/PEA-Diffusion 上发布。

CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models

paper_url: http://arxiv.org/abs/2311.16421
repo_url: https://github.com/astrodrew/cdeval
paper_authors: Yuhang Wang, Yanxu Zhu, Chao Kong, Shuyu Wei, Xiaoyuan Yi, Xing Xie, Jitao Sang
For: 本研究的目的是提出一个新的评估大语言模型（LLM）文化尺度的benchmark，以便更好地考虑文化因素在LLM的开发和评估中。* Methods: 本研究使用了GPT-4自动生成和人工验证，建立了六个文化尺度的CDEvalbenchmark，覆盖了七个领域。* Results: 实验结果显示，主流LLM的文化尺度存在一定的一致性和差异性，强调了在不同领域和文化Setting中发展和评估LLM时需要考虑文化因素的重要性。

Abstract
As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.

摘要
随着大型语言模型（LLMs）的扩大，其能力的调整问题逐渐受到了更多的关注，以确保其负责任和道德使用。然而，现有的Alignment努力主要集中在通用价值观such as HHH原则，文化方面，即人类社会的多元化和多样性，尚未得到了足够的注意。本工作介绍了一个新的Benchmark，CDEval，用于评估LLMs的文化维度。CDEval通过自动生成和人工验证，覆盖了六个文化维度 across seven domains。我们的完整的实验提供了LLMs的文化方面的有趣发现，包括不同维度和领域之间的一致和差异。这些发现 подчеркивает了在LLM开发中包含文化考虑的重要性，特别是在多元文化环境中应用。通过CDEval，我们希望扩大LLMAlignment研究的视野，包括文化维度，以提供更加全面的框架，以便未来LLMs的发展和评估。这个Benchmark serves as a valuable resource for cultural studies in LLMs，开启了更加文化敏感和aware的模型的道路。

2023-11-28

cs.LG

cs.LG - 2023-11-28

LiveTune: Dynamic Parameter Tuning for Training Deep Neural Networks

paper_url: http://arxiv.org/abs/2311.17279
repo_url: None
paper_authors: Soheil Zibakhsh Shabgahi, Nojan Sheybani, Aiden Tabrizi, Farinaz Koushanfar
for: 提高机器学习训练的实时可调参数能力，减少重启和检查点的开销。
methods: 使用 LiveVariables 技术实现实时参数调整，并且不需要重启训练Session。
results: 测试结果显示，LiveTune 框架可以在实时调整参数时，节省训练时间约 60 秒和能耗约 5.4 千焦耳每个参数调整。

Abstract
Traditional machine learning training is a static process that lacks real-time adaptability of hyperparameters. Popular tuning solutions during runtime involve checkpoints and schedulers. Adjusting hyper-parameters usually require the program to be restarted, wasting utilization and time, while placing unnecessary strain on memory and processors. We present LiveTune, a new framework allowing real-time parameter tuning during training through LiveVariables. Live Variables allow for a continuous training session by storing parameters on designated ports on the system, allowing them to be dynamically adjusted. Extensive evaluations of our framework show saving up to 60 seconds and 5.4 Kilojoules of energy per hyperparameter change.

摘要
传统机器学习训练是一个静态的过程，缺乏实时调整Hyperparameter的能力。流行的训练时间调整方案包括Checkpoint和调度器。调整Hyperparameter通常需要程序重启，浪费资源和时间，同时带来内存和处理器的额外压力。我们提出了LiveTune框架，允许实时参数调整 durante el entrenamiento through LiveVariables。Live Variables使得参数可以在系统上的特定端口存储，以动态调整。我们的框架的广泛评估表明，每个Hyperparameter变化可以保存60秒钟和5.4千焦耳的能源。

An Online Optimization-Based Decision Support Tool for Small Farmers in India: Learning in Non-stationary Environments

paper_url: http://arxiv.org/abs/2311.17277
repo_url: None
paper_authors: Tuxun Lu, Aviva Prins
for: 帮助小型印度农场管理农作，尤其是在气候变化的影响下。
methods: 模型个农庄为Markov决策过程（MDP），采用Li和Li（2019）的Follow the Weighted Leader（FWL）在线学习算法提供农作规划建议。
results: 在模拟中成功生成了保持价值的农作 patrern suggessions，与关闭计划算法相比，运行时间大幅减少。

Abstract
Crop management decision support systems are specialized tools for farmers that reduce the riskiness of revenue streams, especially valuable for use under the current climate changes that impact agricultural productivity. Unfortunately, small farmers in India, who could greatly benefit from these tools, do not have access to them. In this paper, we model an individual greenhouse as a Markov Decision Process (MDP) and adapt Li and Li (2019)'s Follow the Weighted Leader (FWL) online learning algorithm to offer crop planning advice. We successfully produce utility-preserving cropping pattern suggestions in simulations. When we compare against an offline planning algorithm, we achieve the same cumulative revenue with greatly reduced runtime.

摘要
农业资源管理决策支持系统是专门为农民设计的工具，可以减少农业生产力的风险，特别是在气候变化的影响下。然而，印度的小型农民，他们可以很大地受益于这些工具，却没有访问到它们。在这篇论文中，我们将个体绿色房间模型为Markov决策过程（MDP），并采用李和李（2019）的跟随权重领袖（FWL）在线学习算法，为农民提供耕作计划建议。我们在模拟中成功地生成了保持实用性的耕作模式建议。与批处理算法相比，我们在运行时间上减少了很多。

SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata

paper_url: http://arxiv.org/abs/2311.17259
repo_url: None
paper_authors: Mark Díaz, Sunipa Dev, Emily Reif, Remi Denton, Vinodkumar Prabhakaran
for: 本研究旨在提供一种框架，用于系统地分析未结构化数据在基础模型开发中的挑战，以及如何使用和Documentation决策。
methods: 本研究使用一种基于人类表示的框架，用于分析未结构化数据中的人类表示，并identify下游风险。
results: 本研究通过两个示例使用Common Crawl网页文本 corpus（C4）和LAION-400M，并提出了一些假设的行动步骤，以便在数据使用、开发和Documentation中进行系统性的分析和决策。

Abstract
The unstructured nature of data used in foundation model development is a challenge to systematic analyses for making data use and documentation decisions. From a Responsible AI perspective, these decisions often rely upon understanding how people are represented in data. We propose a framework designed to guide analysis of human representation in unstructured data and identify downstream risks. We apply the framework in two toy examples using the Common Crawl web text corpus (C4) and LAION-400M. We also propose a set of hypothetical action steps in service of dataset use, development, and documentation.

摘要
数据的无架构性在基础模型开发中是一个挑战，难以进行系统性的分析，以便做出数据使用和文件纪录的决策。从负责任AI的角度来看，这些决策通常需要了解数据中人类的表现。我们提出了一个框架，用于分析数据中人类的表现，并识别下游的风险。我们在两个玩偶示例中使用了Common Crawl网页文档（C4）和LAION-400M，并提出了一组假设的行动步骤，用于Dataset的使用、开发和文件纪录。

Fourier Neural Differential Equations for learning Quantum Field Theories

paper_url: http://arxiv.org/abs/2311.17250
repo_url: https://github.com/2357e2/fnde
paper_authors: Isaac Brant, Alexander Norcliffe, Pietro Liò
for: 这个论文旨在使用神经异 diferencial equation（NDE）学习粒子散射矩阵，并提出了一种新的快速傅立叶 neural differential equation（FNDE）模型，以提高模型的泛化能力。
methods: 该论文使用NDE模型学习了哥伦布场论（phi^4）、scalar-yukawa理论和量子电磁学。NDE模型通过学习粒子散射矩阵，从而提取出理论中的互动 Hamiltoniano。
results: 研究发现，使用FNDE模型可以提高模型的泛化能力，并且通过学习粒子散射矩阵，从而提取出理论中的互动 Hamiltoniano。

Abstract
A Quantum Field Theory is defined by its interaction Hamiltonian, and linked to experimental data by the scattering matrix. The scattering matrix is calculated as a perturbative series, and represented succinctly as a first order differential equation in time. Neural Differential Equations (NDEs) learn the time derivative of a residual network's hidden state, and have proven efficacy in learning differential equations with physical constraints. Hence using an NDE to learn particle scattering matrices presents a possible experiment-theory phenomenological connection. In this paper, NDE models are used to learn $\phi^4$ theory, Scalar-Yukawa theory and Scalar Quantum Electrodynamics. A new NDE architecture is also introduced, the Fourier Neural Differential Equation (FNDE), which combines NDE integration and Fourier network convolution. The FNDE model demonstrates better generalisability than the non-integrated equivalent FNO model. It is also shown that by training on scattering data, the interaction Hamiltonian of a theory can be extracted from network parameters.

摘要
一种量子场论是通过其交互哈密顿定义，并通过散射矩阵与实验数据连接。散射矩阵可以通过批量 serie 表示，并表示为时间第一频率方程的几何函数。神经差分方程（NDEs）学习时间导函数的差分状态，并在物理约束下学习散射矩阵。因此，使用 NDE 来学习散射矩阵可能存在实验-理论现象联系。在本文中，NDE 模型用于学习 $\phi^4$ 理论、杂scalar-Yukawa理论和杂量量电动力学。新的 FNDE 模型还是引入的，它将 NDE 集成和福特网络卷积结合。FNDE 模型在普适性方面表现更好 than 非集成的 FNO 模型。此外，通过训练散射数据，交互哈密顿理论中的交互 Hamiltonian 也可以从网络参数中提取。

Invariance assumptions for class distribution estimation

paper_url: http://arxiv.org/abs/2311.17225
repo_url: None
paper_authors: Dirk Tasche
for: 这个论文目的是解决数据集移动问题下的分类分布估计问题。
methods: 这个论文使用了假设对于训练集和测试集的 JOINT 分布是一致的，从而使得类别分布估计问题更加容易。具体来说，论文考虑了covariate shift、factorizable joint shift和sparse joint shift等假设，并对它们对分类分布估计的影响进行了讨论。
results: 论文得出的结论是，通过假设对于训练集和测试集的 JOINT 分布是一致的，可以更好地估计测试集中类别的分布。此外，论文还提出了一种基于假设的方法来实现这一目的。

Abstract
We study the problem of class distribution estimation under dataset shift. On the training dataset, both features and class labels are observed while on the test dataset only the features can be observed. The task then is the estimation of the distribution of the class labels, i.e. the estimation of the class prior probabilities, in the test dataset. Assumptions of invariance between the training joint distribution of features and labels and the test distribution can considerably facilitate this task. We discuss the assumptions of covariate shift, factorizable joint shift, and sparse joint shift and their implications for class distribution estimation.

摘要
我们研究在数据集变换的情况下的类分布估计问题。在训练数据集上，both features和类别标签都可见，而在测试数据集上只能见到特征。任务就是估计测试数据集中类别标签的分布，即估计类前期概率。假设训练数据集和测试数据集之间的变换是相互对应的，可以大大facilitate这个任务。我们讨论了covariate shift、factorizable joint shift和sparse joint shift的假设，以及它们对类分布估计的影响。

Optimal EEG Electrode Set for Emotion Recognition From Brain Signals: An Empirical Quest

paper_url: http://arxiv.org/abs/2311.17204
repo_url: None
paper_authors: Rumman Ahmed Prodhan, Sumya Akter, Tanmoy Sarkar Pias, Md. Akhtaruzzaman Adnan
for: This paper aims to empirically analyze the contribution of each part of the brain in exhibiting emotions.
methods: The authors use the DEAP dataset to find the most optimal electrode set and use Fast Fourier Transformation for effective feature extraction, as well as a 1D-CNN with residual connection for classification.
results: The study finds that 12 electrodes (F7, P8, O1, F8, C4, T7, PO3, Fp1, Fp2, O2, P3, and Fz) achieve 95.81% accuracy in recognizing emotions, and that the frontal lobe is the most important for recognizing emotion. Additionally, the authors find that adding more than 10 electrodes does not improve performance significantly.

Abstract
The human brain is a complex organ, still completely undiscovered, that controls almost all the parts of the body. Apart from survival, the human brain stimulates emotions. Recent research indicates that brain signals can be very effective for emotion recognition. However, which parts of the brain exhibit most of the emotions is still under-explored. In this study, we empirically analyze the contribution of each part of the brain in exhibiting emotions. We use the DEAP dataset to find the most optimal electrode set which eventually leads to the effective brain part associated with emotions. We use Fast Fourier Transformation for effective feature extraction and a 1D-CNN with residual connection for classification. Though 32 electrodes from the DEAP dataset got an accuracy of 97.34%, only 12 electrodes (F7, P8, O1, F8, C4, T7, PO3, Fp1, Fp2, O2, P3, and Fz) achieve 95.81% accuracy. This study also shows that adding more than 10 electrodes does not improve performance significantly. Moreover, the frontal lobe is the most important for recognizing emotion.

摘要
人脑是一个复杂的器官，仍然完全未经探索，它控制着身体大多数部分。除了生存，人脑刺激情感。最近的研究表明，大脑信号可以非常有效地识别情感。然而，哪些脑部分展示情感的研究仍然尚未得到足够的探索。在这项研究中，我们employs DEAP数据集来找出最佳电极组，最终导致有效的脑部分与情感相关。我们使用 Fast Fourier Transformation 进行有效的特征提取，并使用1D-CNN With residual connection 进行分类。尽管DEAP数据集中的32个电极达到了97.34%的准确率，但只有12个电极（F7、P8、O1、F8、C4、T7、PO3、Fp1、Fp2、O2、P3和Fz）达到了95.81%的准确率。此研究还表明，添加更多于10个电极不会提高性能。此外，前列带是识别情感的最重要部分。

A personalized Uncertainty Quantification framework for patient survival models: estimating individual uncertainty of patients with metastatic brain tumors in the absence of ground truth

paper_url: http://arxiv.org/abs/2311.17173
repo_url: None
paper_authors: Yuqi Wang, Aarzu Gupta, David Carpenter, Trey Mullikin, Zachary J. Reitman, Scott Floyd, John Kirkpatrick, Joseph K. Salama, Paul W. Sperduto, Jian-Guo Liu, Mustafa R. Bashir, Kyle J. Lafata
for: 这个论文的目的是提出一种基于uncertaintyQuantification (UQ)的方法来估算患者存活模型的不确定性，以便在缺乏真实参照数据的情况下进行存活预测。
methods: 这个方法使用了一个 dataset of 1383 例 brain metastases 患者在2015年1月到2020年12月期间接受了stereotactic radiosurgery (SRS)，并对这些数据进行了不同的统计和非统计模型的预测，包括 CoxPH、conditional survival forest (CSF) 和 neural multi-task linear regression (NMTLR)。
results: 研究结果表明，所有模型在 intracranial progression (ICP) 上有最低的不确定性（2.21%），而在 intracranial progression and death (ICPD) 上有最高的不确定性（17.28%）。 OS 模型在不确定性性能方面表现了较高的变化，NMTLR 模型有最低的不确定性（1.96%），而 CSF 模型有最高的不确定性（14.29%）。因此，这种方法可以估算患者存活模型的不确定性。

Abstract
TodevelopanovelUncertaintyQuantification (UQ) framework to estimate the uncertainty of patient survival models in the absence of ground truth, we developed and evaluated our approach based on a dataset of 1383 patients treated with stereotactic radiosurgery (SRS) for brain metastases between January 2015 and December 2020. Our motivating hypothesis is that a time-to-event prediction of a test patient on inference is more certain given a higher feature-space-similarity to patients in the training set. Therefore, the uncertainty for a particular patient-of-interest is represented by the concordance index between a patient similarity rank and a prediction similarity rank. Model uncertainty was defined as the increased percentage of the max uncertainty-constrained-AUC compared to the model AUC. We evaluated our method on multiple clinically-relevant endpoints, including time to intracranial progression (ICP), progression-free survival (PFS) after SRS, overall survival (OS), and time to ICP and/or death (ICPD), on a variety of both statistical and non-statistical models, including CoxPH, conditional survival forest (CSF), and neural multi-task linear regression (NMTLR). Our results show that all models had the lowest uncertainty on ICP (2.21%) and the highest uncertainty (17.28%) on ICPD. OS models demonstrated high variation in uncertainty performance, where NMTLR had the lowest uncertainty(1.96%)and CSF had the highest uncertainty (14.29%). In conclusion, our method can estimate the uncertainty of individual patient survival modeling results. As expected, our data empirically demonstrate that as model uncertainty measured via our technique increases, the similarity between a feature-space and its predicted outcome decreases.

摘要
为了开发一个不确定量评估患者存生模型的框架，我们采用了一 dataset of 1383 例患者在2015年1月至2020年12月期间接受了静脉射线手术（SRS）治疗脑 метастаases。我们的动机是，在推理时，对测试患者的时间到事件预测更加确定，与训练集中的患者更相似的特征空间相似性越高。因此，对某个特定患者的不确定性被表示为相似性排名和预测相似性排名之间的协调指数。我们定义了模型不确定性为Maximum uncertainty-constrained AUC比较模型 AUC 的增加百分数。我们对多个临床有用的终点进行了评估，包括脑幕进程（ICP）、前列腺生长终止率（PFS）、总存生率（OS）和脑幕进程和/或死亡（ICPD），使用了不同的统计学和非统计学模型，包括 CoxPH、Conditional survival forest（CSF）和神经多任务线性回归（NMTLR）。我们的结果表明，所有模型的最低不确定性是ICP（2.21%），最高不确定性是ICPD（17.28%）。OS 模型的不确定性表现出了大量的变化，NMTLR 模型的不确定性最低（1.96%），CSF 模型的不确定性最高（14.29%）。总之，我们的方法可以估计个体患者存生模型结果的不确定性。我们的数据实际证明，via 我们的技术，模型不确定性的增加与特征空间和预测结果之间的相似性减少。

Fast Particle-based Anomaly Detection Algorithm with Variational Autoencoder

paper_url: http://arxiv.org/abs/2311.17162
repo_url: https://github.com/ryanliu30/fastanomalydetection
paper_authors: Ryan Liu, Abhijith Gandrakota, Jennifer Ngadiuba, Maria Spiropulu, Jean-Roch Vlimant
for: 这个论文的目的是探讨模型独立异常检测的新方法，寻找超越标准模型物理。
methods: 这个论文使用了分子基于变量自动encoder（VAE）的异常检测算法，叫做Set-VAE。
results: 作者在这个论文中展示了使用Set-VAE的异常检测算法可以比传统的融合性-基于jet选择的方法提高异常检测效率，并且提出了CLIP-VAE，可以在触发系统中减少异常检测的执行时间和缓存需求。

Abstract
Model-agnostic anomaly detection is one of the promising approaches in the search for new beyond the standard model physics. In this paper, we present Set-VAE, a particle-based variational autoencoder (VAE) anomaly detection algorithm. We demonstrate a 2x signal efficiency gain compared with traditional subjettiness-based jet selection. Furthermore, with an eye to the future deployment to trigger systems, we propose the CLIP-VAE, which reduces the inference-time cost of anomaly detection by using the KL-divergence loss as the anomaly score, resulting in a 2x acceleration in latency and reducing the caching requirement.

摘要
“模型独立异常检测是物理新模型搜索中的一种promising方法。在这篇论文中，我们提出了Set-VAE，一种基于变量自动编码器（VAE）的异常检测算法。我们证明了与传统的融合性-基于jet选择的异常检测相比，Set-VAE可以提供2倍的信号效率提升。此外，为了在触发系统中部署，我们提议了CLIP-VAE，它使用KL散度损失作异常分数，从而实现2倍的响应时间加速和缓存需求减少。”Note: Simplified Chinese is used in this translation, as it is more commonly used in mainland China and is the standard for most online content. If you prefer Traditional Chinese, I can provide that version as well.

A point cloud approach to generative modeling for galaxy surveys at the field level

paper_url: http://arxiv.org/abs/2311.17141
repo_url: https://github.com/smsharma/point-cloud-galaxy-diffusion
paper_authors: Carolina Cuesta-Lazaro, Siddharth Mishra-Sharma
for: 本研究使用扩散型生成模型描述宇宙中星系分布，直接用3D空间坐标和可选的属性（例如速度和质量）来描述星系分布，不需要分割或卷积。
methods: 本研究使用自定义的扩散模型进行模拟和推断，可以重现宇宙中星系分布的主要摘要统计数据，同时可以计算星系场的 conditional likelihood。
results: 本研究在Quijote simulations中应用了这种方法，并成功地重现了大质量黑 mater坍塌体的分布。这种方法可以扩展到涵盖宇宙学数据的全面分析，超越摘要统计方法和神经网络模拟方法的限制。

Abstract
We introduce a diffusion-based generative model to describe the distribution of galaxies in our Universe directly as a collection of points in 3-D space (coordinates) optionally with associated attributes (e.g., velocities and masses), without resorting to binning or voxelization. The custom diffusion model can be used both for emulation, reproducing essential summary statistics of the galaxy distribution, as well as inference, by computing the conditional likelihood of a galaxy field. We demonstrate a first application to massive dark matter haloes in the Quijote simulation suite. This approach can be extended to enable a comprehensive analysis of cosmological data, circumventing limitations inherent to summary statistic -- as well as neural simulation-based inference methods.

摘要
我们引入一种扩散基于的生成模型，用于描述宇宙中星系的分布，直接表示为3D空间坐标中的点集合，可选择 associate 属性（例如速度和质量），而不需要分割或嵌入。我们的自定义扩散模型可以用于启用模拟，重现宇宙中星系分布的重要摘要统计数据，以及条件概率计算，用于 galaxy 场的计算。我们在 Quijote 仿真 simulations 中进行了首次应用。这种方法可以扩展到涵盖 cosmological data 的全面分析，超越摘要统计和神经网络模拟基于的推理方法的局限性。

Predicting the Age of Astronomical Transients from Real-Time Multivariate Time Series

paper_url: http://arxiv.org/abs/2311.17143
repo_url: None
paper_authors: Hali Huang, Daniel Muthukrishna, Prajna Nair, Zimi Zhang, Michael Fausnaugh, Torsha Majumder, Ryan J. Foley, George R. Ricker
for: 这篇论文主要是为了提高我们对天体变化的理解，尤其是新的天文学天空勘测将记录未曾有的数量的变化。
methods: 这篇论文使用了 bayesian probabilistic recurrent neural network 方法，可以在实时中预测变化的年龄，并且可以提供 robust 的不确定性。
results: 这篇论文可以准确地预测变化的年龄，并且可以为新的天文学勘测提供有价值的信息，以提高我们对天体变化的理解。

Abstract
Astronomical transients, such as supernovae and other rare stellar explosions, have been instrumental in some of the most significant discoveries in astronomy. New astronomical sky surveys will soon record unprecedented numbers of transients as sparsely and irregularly sampled multivariate time series. To improve our understanding of the physical mechanisms of transients and their progenitor systems, early-time measurements are necessary. Prioritizing the follow-up of transients based on their age along with their class is crucial for new surveys. To meet this demand, we present the first method of predicting the age of transients in real-time from multi-wavelength time-series observations. We build a Bayesian probabilistic recurrent neural network. Our method can accurately predict the age of a transient with robust uncertainties as soon as it is initially triggered by a survey telescope. This work will be essential for the advancement of our understanding of the numerous young transients being detected by ongoing and upcoming astronomical surveys.

摘要
天文学上的快时变化，如超新星和其他罕见的恒星爆发，对天文学发现做出了重要贡献。未来的天文望远镜观测将记录到前所未有的数量的变化，这些变化将是疏散和不规则的多变量时间序列。为了更好地理解变化的物理机制和其前体系统，早期测量是关键。为此，我们提出了实时预测变化的年龄的方法。我们构建了 bayesian 概率回归神经网络，可以准确地预测变化的年龄，并且提供了准确的不确定性。这种方法将对当前和未来的天文观测计划做出重要贡献。

\texttt{GlycoNMR}: Dataset and benchmarks for NMR chemical shift prediction of carbohydrates with graph neural networks

paper_url: http://arxiv.org/abs/2311.17134
repo_url: None
paper_authors: Zizhang Chen, Ryan Paul Badman, Lachele Foley, Robert Woods, Pengyu Hong
for: 这篇论文旨在用分子表示学（MRL）方法将碳水化合物转化为数字表示，以保留其化学特征，并为生物化学研究提供基础。
methods: 这篇论文使用了两个精心级别的数据集，包括2,609个碳水化合物结构和211,543个核磁共振（NMR）化学偏移的精确原子级预测。它还采用了特制的碳水化合物特征，适应特殊的碳水化合物数据问题。
results: 本文在新的数据集上测试了四种修改后的MRL模型，并获得了高度的预测精度。这些结果表明，采用特制的碳水化合物特征和适应特殊数据问题的MRL模型，可以有效地解决碳水化合物预测问题。

Abstract
Molecular representation learning (MRL) is a powerful tool for bridging the gap between machine learning and chemical sciences, as it converts molecules into numerical representations while preserving their chemical features. These encoded representations serve as a foundation for various downstream biochemical studies, including property prediction and drug design. MRL has had great success with proteins and general biomolecule datasets. Yet, in the growing sub-field of glycoscience (the study of carbohydrates, where longer carbohydrates are also called glycans), MRL methods have been barely explored. This under-exploration can be primarily attributed to the limited availability of comprehensive and well-curated carbohydrate-specific datasets and a lack of Machine learning (ML) pipelines specifically tailored to meet the unique problems presented by carbohydrate data. Since interpreting and annotating carbohydrate-specific data is generally more complicated than protein data, domain experts are usually required to get involved. The existing MRL methods, predominately optimized for proteins and small biomolecules, also cannot be directly used in carbohydrate applications without special modifications. To address this challenge, accelerate progress in glycoscience, and enrich the data resources of the MRL community, we introduce GlycoNMR. GlycoNMR contains two laboriously curated datasets with 2,609 carbohydrate structures and 211,543 annotated nuclear magnetic resonance (NMR) chemical shifts for precise atomic-level prediction. We tailored carbohydrate-specific features and adapted existing MRL models to tackle this problem effectively. For illustration, we benchmark four modified MRL models on our new datasets.

摘要
聚合物表示学（MRL）是一种有力的工具，可以将分子转换为数字表示，保留其化学特征。这些编码表示可以作为下游生物化学研究的基础，包括质量预测和药物设计。MRL在蛋白质和通用生物分子数据集上已经取得了很大成功。然而，在增长的聚合物科学领域（研究碳水化合物，其中 longer碳水化合物也称为聚糖），MRL方法尚未得到了充分的探索。这种下降的原因主要是因为聚合物特有的数据集缺乏完整和准确的储存，以及缺乏特化于聚合物数据的机器学习（ML）管道。由于读取和标注聚合物特有的数据是通常比较复杂的，因此需要域专家的参与。现有的MRL方法，主要是为蛋白质和小分子优化的，无法直接在聚合物应用中使用，需要特殊修改。为了解决这个挑战，加速聚合物科学的进步，并充实MRL社区的数据资源，我们引入了GlycoNMR。GlycoNMR包含2,609个碳水化合物结构和211,543个核磁共振（NMR）化学偏移的精心筛选数据集。我们适应特有的碳水化合物特征，并采用现有的MRL模型来解决这个问题。例如，我们在新的数据集上 benchmark了四种修改后的MRL模型。

An Investigation of Time Reversal Symmetry in Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.17008
repo_url: None
paper_authors: Brett Barkley, Amy Zhang, David Fridovich-Keil
for: 提高深度学习样本效率
methods: 使用时间反转Symmetry和时间反转数据增强
results: 在适当的环境和奖励结构下，时间反转数据增强可以提高深度学习样本效率，但在一些情况下可能会导致样本效率下降和策略性能下降。

Abstract
One of the fundamental challenges associated with reinforcement learning (RL) is that collecting sufficient data can be both time-consuming and expensive. In this paper, we formalize a concept of time reversal symmetry in a Markov decision process (MDP), which builds upon the established structure of dynamically reversible Markov chains (DRMCs) and time-reversibility in classical physics. Specifically, we investigate the utility of this concept in reducing the sample complexity of reinforcement learning. We observe that utilizing the structure of time reversal in an MDP allows every environment transition experienced by an agent to be transformed into a feasible reverse-time transition, effectively doubling the number of experiences in the environment. To test the usefulness of this newly synthesized data, we develop a novel approach called time symmetric data augmentation (TSDA) and investigate its application in both proprioceptive and pixel-based state within the realm of off-policy, model-free RL. Empirical evaluations showcase how these synthetic transitions can enhance the sample efficiency of RL agents in time reversible scenarios without friction or contact. We also test this method in more realistic environments where these assumptions are not globally satisfied. We find that TSDA can significantly degrade sample efficiency and policy performance, but can also improve sample efficiency under the right conditions. Ultimately we conclude that time symmetry shows promise in enhancing the sample efficiency of reinforcement learning and provide guidance when the environment and reward structures are of an appropriate form for TSDA to be employed effectively.

摘要
We find that utilizing the structure of time reversal in an MDP allows every environment transition experienced by an agent to be transformed into a feasible reverse-time transition, effectively doubling the number of experiences in the environment. To test the usefulness of this newly synthesized data, we develop a novel approach called time symmetric data augmentation (TSDA) and investigate its application in both proprioceptive and pixel-based state within the realm of off-policy, model-free RL.Empirical evaluations show that these synthetic transitions can enhance the sample efficiency of RL agents in time reversible scenarios without friction or contact. However, we also find that TSDA can significantly degrade sample efficiency and policy performance when the environment and reward structures are not appropriate. Ultimately, we conclude that time symmetry shows promise in enhancing the sample efficiency of RL and provide guidance on when and how to employ TSDA effectively.

On the Impact of Sampling on Deep Sequential State Estimation

paper_url: http://arxiv.org/abs/2311.17006
repo_url: None
paper_authors: Helena Calatrava, Ricardo Augusto Borsoi, Tales Imbiriba, Pau Closas
for: 这 paper 是关于 sequential models 的状态推断和参数学习，使用 Approximation techniques 来最大化证明下界对数据分布的含义下的 marginal log-likelihood。
methods: 这 paper 使用 Dynamical Variational Autoencoders 和 deep Kalman filter 来实现状态推断和参数学习，并使用 importance sampling 来提高 generative modeling 性能。
results: 这 paper 的实验表明，使用 tighter Monte Carlo objectives 可以提高状态推断和参数学习的性能，包括 log-likelihood estimates 和 KL divergence between the variational distribution 和 transition model。 Additionally, the paper shows an improvement in estimating the model parameters and latent states, with a decrease in RMSE.

Abstract
State inference and parameter learning in sequential models can be successfully performed with approximation techniques that maximize the evidence lower bound to the marginal log-likelihood of the data distribution. These methods may be referred to as Dynamical Variational Autoencoders, and our specific focus lies on the deep Kalman filter. It has been shown that the ELBO objective can oversimplify data representations, potentially compromising estimation quality. Tighter Monte Carlo objectives have been proposed in the literature to enhance generative modeling performance. For instance, the IWAE objective uses importance weights to reduce the variance of marginal log-likelihood estimates. In this paper, importance sampling is applied to the DKF framework for learning deep Markov models, resulting in the IW-DKF, which shows an improvement in terms of log-likelihood estimates and KL divergence between the variational distribution and the transition model. The framework using the sampled DKF update rule is also accommodated to address sequential state and parameter estimation when working with highly non-linear physics-based models. An experiment with the 3-space Lorenz attractor shows an enhanced generative modeling performance and also a decrease in RMSE when estimating the model parameters and latent states, indicating that tighter MCOs lead to improved state inference performance.

摘要
State 推断和参数学习在序列模型中可以使用approximation技术来最大化证明下界对数据分布的含义下的含义下的质量。这些方法可以称为动态自适应编码器，我们的特定关注点在于深度卡尔曼滤波器。它已经显示出ELBO目标可能简化数据表示，可能影响估计质量。在文献中，使用重要性权重来减少分布估计的方差的 Monte Carlo 目标已经被提出。例如，IWAE目标使用重要性权重来降低分布估计的方差。在这篇论文中，我们将重要性抽样应用到DKF框架中，从而获得IW-DKF，它在评估数据分布和转换模型之间的KL散度和LOG-LIKELIHOOD估计中表现出了改善。此外，我们还使用抽样DKF更新规则来解决高非线性物理基础模型的Sequential state和参数估计问题。在3个空间 Lorenz 吸引器实验中，我们发现使用紧密的MCO可以提高生成模型性能和降低参数和隐藏状态估计的RMSE，这表明了紧密的MCO可以提高状态推断性能。

FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings

paper_url: http://arxiv.org/abs/2311.16984
repo_url: None
paper_authors: Jean Ogier du Terrail, Quentin Klopfenstein, Honghao Li, Imke Mayer, Nicolas Loiseau, Mohammad Hallal, Félix Balazard, Mathieu Andreux
for: 该研究旨在提高非randomized clinical trials中实验药物的效果证明，使用隐私保护技术来解决数据共享的挑战。
methods: 该研究使用了联合学习（Federated Learning，FL）技术，开发了一种名为FedECA的隐私增强的时间到事件结果方法，以减少patients的数据曝光。
results: 对于时间到事件结果的推断，FedECA方法在比较力和对照组匹配性方面都有优势，超过了最相似的匹配评估（MAIC）方法。code基于Substra开源软件，可以供参考。

Abstract
External control arms (ECA) can inform the early clinical development of experimental drugs and provide efficacy evidence for regulatory approval in non-randomized settings. However, the main challenge of implementing ECA lies in accessing real-world data or historical clinical trials. Indeed, data sharing is often not feasible due to privacy considerations related to data leaving the original collection centers, along with pharmaceutical companies' competitive motives. In this paper, we leverage a privacy-enhancing technology called federated learning (FL) to remove some of the barriers to data sharing. We introduce a federated learning inverse probability of treatment weighted (IPTW) method for time-to-event outcomes called FedECA which eases the implementation of ECA by limiting patients' data exposure. We show with extensive experiments that FedECA outperforms its closest competitor, matching-adjusted indirect comparison (MAIC), in terms of statistical power and ability to balance the treatment and control groups. To encourage the use of such methods, we publicly release our code which relies on Substra, an open-source FL software with proven experience in privacy-sensitive contexts.

摘要
外部控制臂（ECA）可以帮助早期药物开发和获得药物批准，但实施ECA的主要挑战在于获取实际世界数据或历史临床试验数据。实际上，数据分享通常不可能因为数据离开原始收集中心的隐私问题以及药品公司的竞争动机。在这篇论文中，我们利用隐私提升技术called federated learning（FL）来解决这些障碍。我们介绍了一种基于FL的时间点事件结果的 inverse probability of treatment weighted（IPTW）方法called FedECA，它可以减少患者数据暴露，从而使ECA更加容易实施。我们通过了详细的实验表明，FedECA在统计能力和对治疗和控制组的均衡方面都高于其最近竞争者matching-adjusted indirect comparison（MAIC）。为促进这些方法的使用，我们在公共发布我们的代码，该代码基于Substra，一个经验证的开源FL软件，其在隐私敏感场景中具有良好的表现。

Bidirectional Reactive Programming for Machine Learning

paper_url: http://arxiv.org/abs/2311.16977
repo_url: None
paper_authors: Dumitru Potop Butucaru, Albert Cohen, Gordon Plotkin, Hugo Pompougnac
for: 本文旨在描述一种新的反应编程模型，用于模型系统与环境的连续交互。
methods: 本文使用了反应编程的 symmetric 构造，以及对后向循环的约束，实现实用性。
results: 文章示示了多种机器学习（ML）系统可以被视为双向反应程序的形式，包括反向模式自动微分、后向循环传播、批量Normalization、双向Recurrent Neural Networks、训练和奖励学习算法等。

Abstract
Reactive languages are dedicated to the programming of systems which interact continuously and concurrently with their environment. Values take the form of unbounded streams modeling the (discrete) passing of time or the sequence of concurrent interactions. While conventional reactivity models recurrences forward in time, we introduce a symmetric reactive construct enabling backward recurrences. Constraints on the latter allow to make the implementation practical. Machine Learning (ML) systems provide numerous motivations for all of this: we demonstrate that reverse-mode automatic differentiation, backpropagation, batch normalization, bidirectional recurrent neural networks, training and reinforcement learning algorithms, are all naturally captured as bidirectional reactive programs.

摘要
响应语言专注于实时并发地与环境交互的系统编程。值采取流式模型，表示时间的推移或同时进行交互的序列。而传统响应模型仅考虑前进时间，我们则引入倒计时的响应构造。对后向倒计时的约束，使实现实际可行。机器学习（ML）系统带来了大量的动机：我们示出了反Mode自动导数、反propagation、批处理、双向循环神经网络、训练和奖励学习算法，都是自然地捕捉为双向响应程序。

Machine learning force-field models for metallic spin glass

paper_url: http://arxiv.org/abs/2311.16964
repo_url: None
paper_authors: Menglin Shi, Sheng Zhang, Gia-Wei Chern
for: 这个论文的目的是研究 металлических磁玻璃系统，例如含有磁矿物质的含气铁合金。
methods: 这个论文使用了可扩展的机器学习（ML）框架来模拟磁玻璃系统的动力学行为。他们开发了一种基于本地性的神经网络模型，用于准确地预测电子诱发的本地磁场，这个磁场是磁动学行为的驱动者。
results: 这个论文的研究结果表明，使用ML模型可以高效地和准确地模拟磁玻璃系统的动力学行为。他们在研究一种含气铁合金的材料中应用了这种方法，并发现了一些有趣的结果，如材料的刚性和磁矿物质的分布。

Abstract
Metallic spin glass systems, such as dilute magnetic alloys, are characterized by randomly distributed local moments coupled to each other through a long-range electron-mediated effective interaction. We present a scalable machine learning (ML) framework for dynamical simulations of metallic spin glasses. A Behler-Parrinello type neural-network model, based on the principle of locality, is developed to accurately and efficiently predict electron-induced local magnetic fields that drive the spin dynamics. A crucial component of the ML model is a proper symmetry-invariant representation of local magnetic environment which is direct input to the neural net. We develop such a magnetic descriptor by incorporating the spin degrees of freedom into the atom-centered symmetry function methods which are widely used in ML force-field models for quantum molecular dynamics. We apply our approach to study the relaxation dynamics of an amorphous generalization of the s-d model. Our work highlights the promising potential of ML models for large-scale dynamical modeling of itinerant magnets with quenched disorder.

摘要
金属自旋玻璃系统，如含杂磁矿物质，由 randomly 分布的地方时刻相互链接的远距离电子谱供电子 mediated 效应相互作用所特征。我们提出了可扩展的机器学习（ML）框架，用于动力学模拟金属自旋玻璃系统。我们基于 Behler-Parrinello 型神经网络模型，利用地方性原理，准确地预测电子引起的本地磁场，这些磁场是自动机动的主要驱动力。我们在 ML 模型中添加了一个对称 invariants 表示本地磁环境，这是神经网络的直接输入。我们在 ML 力场模型中 incorporating 磁度度量，以便更好地考虑磁环境的影响。我们在一种扩展的 s-d 模型中应用了我们的方法，研究了这种杂质磁玻璃的relaxation 动力学。我们的工作表明了 ML 模型在大规模动力学模拟含杂磁矿物质的潜在承载。

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

paper_url: http://arxiv.org/abs/2311.16956
repo_url: None
paper_authors: Frederik Köhne, Leonie Kreis, Anton Schiela, Roland Herzog
for: 提出了一种基于量可追踪的梯度下降逐步大小自适应方法，以优化梯度下降算法的性能。
methods: 利用梯度 Lipschitz常数和搜索方向的本地方差来实现自适应步长大小。
results: 实现了一种几乎无参数的梯度下降算法，并且在quadratic问题上具有证明性的收敛性质和在经典图像分类任务上展现出了真正的问题适应性。

Abstract
This paper proposes a novel approach to adaptive step sizes in stochastic gradient descent (SGD) by utilizing quantities that we have identified as numerically traceable -- the Lipschitz constant for gradients and a concept of the local variance in search directions. Our findings yield a nearly hyperparameter-free algorithm for stochastic optimization, which has provable convergence properties when applied to quadratic problems and exhibits truly problem adaptive behavior on classical image classification tasks. Our framework enables the potential inclusion of a preconditioner, thereby enabling the implementation of adaptive step sizes for stochastic second-order optimization methods.

摘要
Here is the text in Simplified Chinese:这篇论文提出了一种基于 numerically traceable 的 adaptive step size 方法，即使用梯度 Lipschitz 常数和搜索方向的本地卷积分布。我们的方法减少了 hyperparameter 的需求，并在 quadratic 问题上有证明的收敛性。此外，它在经典的图像分类任务上展现了真正的问题适应性。我们的框架允许使用预conditioner，以实现随机第二阶导数优化方法的 adaptive step size。

Multinomial belief networks

paper_url: http://arxiv.org/abs/2311.16909
repo_url: https://github.com/debabratabar/Fake_news_detection
paper_authors: H. C. Donker, D. Neijzen, G. A. Lunter
for: 本研究使用泊然方法来解决健康数据分析中的不确定性、缺失观察数据和罕见样本问题。
methods: 本文提出了一种深度生成模型，其中权重和隐藏层都采用DIRICHLET分布。我们采用了一种基于 Gibbs 抽样的方法，利用一系列的扩充关系，类似于周公共生成模型。
results: 我们在小写数字和大规模的DNA突变癌症数据上应用了这种模型，并显示了该模型能够自动找出生物意义的元签ature。

Abstract
A Bayesian approach to machine learning is attractive when we need to quantify uncertainty, deal with missing observations, when samples are scarce, or when the data is sparse. All of these commonly apply when analysing healthcare data. To address these analytical requirements, we propose a deep generative model for multinomial count data where both the weights and hidden units of the network are Dirichlet distributed. A Gibbs sampling procedure is formulated that takes advantage of a series of augmentation relations, analogous to the Zhou-Cong-Chen model. We apply the model on small handwritten digits, and a large experimental dataset of DNA mutations in cancer, and we show how the model is able to extract biologically meaningful meta-signatures in a fully data-driven way.

摘要

Compressing the Backward Pass of Large-Scale Neural Architectures by Structured Activation Pruning

paper_url: http://arxiv.org/abs/2311.16883
repo_url: None
paper_authors: Daniel Barley, Holger Fröning
for: 这篇论文的目的是提高深度神经网络（DNNs）的训练效率和可持续性，通过对活化部分进行简洁化。
methods: 这篇论文使用了结构化剔除和BSR格式，并将它与大小为零的矩阵进行组合，以便实现对活化部分的简洁化。此外，这篇论文还引入了高效的对称剔除算子，以便在GPU上实现快速的训练。
results: 这篇论文的结果显示，通过对活化部分进行简洁化，可以实现训练DNNs的内存消耗 reduction，并且可以保持模型的准确性。具体来说，这篇论文的结果显示，在运算像素排序任务上，可以实现32%的内存消耗reduction，而且模型的准确性仍然保持在90%左右。

Abstract
The rise of Deep Neural Networks (DNNs) has led to an increase in model size and complexity, straining the memory capacity of GPUs. Sparsity in DNNs, characterized as structural or ephemeral, has gained attention as a solution. This work focuses on ephemeral sparsity, aiming to reduce memory consumption during training. It emphasizes the significance of activations, an often overlooked component, and their role in memory usage. This work employs structured pruning in Block Sparse Compressed Row (BSR) format in combination with a magnitude-based criterion to efficiently prune activations. We furthermore introduce efficient block-sparse operators for GPUs and showcase their effectiveness, as well as the superior compression offered by block sparsity. We report the effectiveness of activation pruning by evaluating training speed, accuracy, and memory usage of large-scale neural architectures on the example of ResMLP on image classification tasks. As a result, we observe a memory reduction of up to 32% while maintaining accuracy. Ultimately, our approach aims to democratize large-scale model training, reduce GPU requirements, and address ecological concerns.

摘要
深度神经网络（DNNs）的崛起使得模型大小和复杂度增加，导致GPU内存容量受到挑战。随着DNNs中的稀疏性（structural or ephemeral）的增加，人们开始关注这种解决方案。本文将重点强调启动的稀疏性，以减少训练期间内存消耗。我们使用BSR格式中的结构化剪裁法，并与大小基于的规则一起快速剪裁启动。此外，我们还提出了高效的块稀疏运算符，并证明其效果和块稀疏的压缩率。通过对ResMLP图像分类任务的训练进行评估，我们发现内存减少了32%，保持了准确性。最终，我们的方法旨在普及大规模模型训练，降低GPU需求，并解决生态环境问题。

Imputation using training labels and classification via label imputation

paper_url: http://arxiv.org/abs/2311.16877
repo_url: https://github.com/thunguyen177/iul-cbmi
paper_authors: Thu Nguyen, Pål Halvorsen, Michael A. Riegler
for: 处理缺失数据的问题
methods: 堆叠标签到输入数据进行重建
results: 提高准确率

Abstract
Missing data is a common problem in practical settings. Various imputation methods have been developed to deal with missing data. However, even though the label is usually available in the training data, the common practice of imputation usually only relies on the input and ignores the label. In this work, we illustrate how stacking the label into the input can significantly improve the imputation of the input. In addition, we propose a classification strategy that initializes the predicted test label with missing values and stacks the label with the input for imputation. This allows imputing the label and the input at the same time. Also, the technique is capable of handling data training with missing labels without any prior imputation and is applicable to continuous, categorical, or mixed-type data. Experiments show promising results in terms of accuracy.

摘要
常见的问题在实际应用中是缺失数据。各种填充方法已经开发出来处理缺失数据。然而，即使标签通常在训练数据中可用，常见的填充方法通常只是依靠输入，忽略标签。在这种工作中，我们示出了把标签堆叠到输入中可以显著提高输入的填充。此外，我们提议一种分类策略，其中首先INITIALIZES predicted test label with missing values，然后堆叠标签与输入进行填充。这种技术可以同时填充标签和输入，并且可以处理没有任何先前填充的数据训练。此外，该技术适用于连续、分类或混合类型数据。实验结果表明，该技术可以提高准确性。

Digital Twin-Enhanced Deep Reinforcement Learning for Resource Management in Networks Slicing

paper_url: http://arxiv.org/abs/2311.16876
repo_url: None
paper_authors: Zhengming Zhang, Yongming Huang, Cheng Zhang, Qingbi Zheng, Luxi Yang, Xiaohu You
For: The paper proposes a framework for dynamic and efficient resource allocation in network slicing-based communication systems, specifically for diversified services.* Methods: The proposed framework uses a digital twin model and reinforcement learning agents to handle the challenge of resource allocation in practical systems. The digital twin model is built using historical data and neural networks, and is calibrated to match the real environment.* Results: The proposed framework significantly improves the performance of slice optimization strategies, as demonstrated through numerical simulation experiments. The use of a digital twin model and reinforcement learning agents enables the framework to achieve good results with less interaction with the real environment.

Abstract
Network slicing-based communication systems can dynamically and efficiently allocate resources for diversified services. However, due to the limitation of the network interface on channel access and the complexity of the resource allocation, it is challenging to achieve an acceptable solution in the practical system without precise prior knowledge of the dynamics probability model of the service requests. Existing work attempts to solve this problem using deep reinforcement learning (DRL), however, such methods usually require a lot of interaction with the real environment in order to achieve good results. In this paper, a framework consisting of a digital twin and reinforcement learning agents is present to handle the issue. Specifically, we propose to use the historical data and the neural networks to build a digital twin model to simulate the state variation law of the real environment. Then, we use the data generated by the network slicing environment to calibrate the digital twin so that it is in sync with the real environment. Finally, DRL for slice optimization optimizes its own performance in this virtual pre-verification environment. We conducted an exhaustive verification of the proposed digital twin framework to confirm its scalability. Specifically, we propose to use loss landscapes to visualize the generalization of DRL solutions. We explore a distillation-based optimization scheme for lightweight slicing strategies. In addition, we also extend the framework to offline reinforcement learning, where solutions can be used to obtain intelligent decisions based solely on historical data. Numerical simulation experiments show that the proposed digital twin can significantly improve the performance of the slice optimization strategy.

摘要

A unified weighting framework for evaluating nearest neighbour classification

paper_url: http://arxiv.org/abs/2311.16872
repo_url: None
paper_authors: Oliver Urs Lenz, Henri Bollaert, Chris Cornelis
for: 评估类别 nearest neighbor 算法的选择
methods: 使用距离函数和权重策略，以及不同的规模计量
results: 研究发现，Boscovich距离函数和Samworth权重策略以及各种规模计量都有优秀表现，NN和FRNN均以这些方法为最佳选择，而FNN则较为简单的Yager距离权重策略也有比较好的表现。

Abstract
We present the first comprehensive and large-scale evaluation of classical (NN), fuzzy (FNN) and fuzzy rough (FRNN) nearest neighbour classification. We show that existing proposals for nearest neighbour weighting can be standardised in the form of kernel functions, applied to the distance values and/or ranks of the nearest neighbours of a test instance. Furthermore, we identify three commonly used distance functions and four scaling measures. We systematically evaluate these choices on a collection of 85 real-life classification datasets. We find that NN, FNN and FRNN all perform best with Boscovich distance. NN and FRNN perform best with a combination of Samworth rank- and distance weights and scaling by the mean absolute deviation around the median ($r_1$), the standard deviaton ($r_2$) or the interquartile range ($r_{\infty}^*$), while FNN performs best with only Samworth distance-weights and $r_1$- or $r_2$-scaling. We also introduce a new kernel based on fuzzy Yager negation, and show that NN achieves comparable performance with Yager distance-weights, which are simpler to implement than a combination of Samworth distance- and rank-weights. Finally, we demonstrate that FRNN generally outperforms NN, which in turns performs systematically better than FNN.

摘要
我们提出了首次全面和大规模评估类传统（NN）、概率（FNN）和概率粗糙（FRNN）最近邻类划分方法的评估。我们表明，现有的邻居重要性评价方法可以标准化为内核函数，应用到测试实例的距离值和/或排名的 nearest neighbors 中。此外，我们标识出了三种常用的距离函数和四种缩放度量。我们系统地评估这些选择，并发现 Bocschovich 距离函数是最佳。NN 和 FRNN 都perform最佳，使用 Samworth 排名-和距离重量，并使用 $r_1$, $r_2$ 或 $r_{\infty}^*$ 的缩放度量。而 FNN 只使用 Samworth 距离重量，并使用 $r_1$- 或 $r_2$-缩放度量。我们还介绍了一种基于概率粗糙 Yager 否定的新内核，并发现 NN 可以与 Yager 距离重量实现相似的性能，而这些重量更容易实现。最后，我们展示了 FRNN 通常超过 NN，而 NN 则在 turns 系统性地高于 FNN。

Power Hungry Processing: Watts Driving the Cost of AI Deployment?

paper_url: http://arxiv.org/abs/2311.16863
repo_url: None
paper_authors: Alexandra Sasha Luccioni, Yacine Jernite, Emma Strubell
For: The paper aims to compare the ongoing inference cost of various categories of machine learning (ML) systems, including task-specific and general-purpose models.* Methods: The authors use a systematic comparison of the deployment cost of these models, measuring the amount of energy and carbon required to perform 1,000 inferences on a representative benchmark dataset.* Results: The authors find that multi-purpose, generative architectures are orders of magnitude more expensive than task-specific systems for a variety of tasks, even when controlling for the number of model parameters.Here’s the same information in Simplified Chinese text:* For: 这个论文目的是比较不同类型的机器学习（ML）系统的持续执行成本，包括任务特定和通用模型。* Methods: 作者使用系统比较持续执行成本的方法，测量使用代表性数据集进行1,000次推理所需的能源和碳排放。* Results: 作者发现通用生成架构在多种任务上比任务特定模型多得多orders of magnitude，即使控制参数的数量。

Abstract
Recent years have seen a surge in the popularity of commercial AI products based on generative, multi-purpose AI systems promising a unified approach to building machine learning (ML) models into technology. However, this ambition of "generality" comes at a steep cost to the environment, given the amount of energy these systems require and the amount of carbon that they emit. In this work, we propose the first systematic comparison of the ongoing inference cost of various categories of ML systems, covering both task-specific (i.e. finetuned models that carry out a single task) and `general-purpose' models, (i.e. those trained for multiple tasks). We measure deployment cost as the amount of energy and carbon required to perform 1,000 inferences on representative benchmark dataset using these models. We find that multi-purpose, generative architectures are orders of magnitude more expensive than task-specific systems for a variety of tasks, even when controlling for the number of model parameters. We conclude with a discussion around the current trend of deploying multi-purpose generative ML systems, and caution that their utility should be more intentionally weighed against increased costs in terms of energy and emissions. All the data from our study can be accessed via an interactive demo to carry out further exploration and analysis.

摘要
Translated into Simplified Chinese:近年来，商业AI产品的普及率呈指数增长，基于生成、多目标AI系统的推广，提出了一种简化机器学习（ML）模型的方法。然而，这种“通用性”的目标来得到了环境成本的代价，这些系统占用的能源和排放的碳排放很高。在这项工作中，我们提出了首个对不同类型的ML系统持续执行成本进行系统比较，包括任务特定（即精心调整模型进行单个任务）和通用模型（可以执行多个任务）。我们测量了部署成本为在代表性数据集上使用这些模型进行1,000次推理所需的能源和碳排放。我们发现，生成、多目标架构的成本比任务特定系统高出多个级别，即使控制参数的数量。我们 conclude 当前投入多用途生成ML系统的趋势，并提醒将其效用与能源和排放成本进行更加INTENTIONAL的评估。我们的所有数据可以通过交互示例来进行进一步的探索和分析。

Data-efficient operator learning for solving high Mach number fluid flow problems

paper_url: http://arxiv.org/abs/2311.16860
repo_url: None
paper_authors: Noah Ford, Victor J. Leon, Honest Merman, Jeffrey Gilbert, Alexander New
for: 预测高速 fluid 流体在不规则 geometries 上的解决方案
methods: 使用 SciML 预测方法，并使用 Neural Basis Functions (NBF) 学习数据中的行为模式，以便在低数据情况下进行预测
results: NBF 模型比基础模型更有效，但还存在一些继续需要解决的挑战。

Abstract
We consider the problem of using SciML to predict solutions of high Mach fluid flows over irregular geometries. In this setting, data is limited, and so it is desirable for models to perform well in the low-data setting. We show that Neural Basis Functions (NBF), which learns a basis of behavior modes from the data and then uses this basis to make predictions, is more effective than a basis-unaware baseline model. In addition, we identify continuing challenges in the space of predicting solutions for this type of problem.

摘要
我们考虑使用SciML预测高马希尔流体流过不规则结构的解决方案。在这种情况下，数据有限，因此希望模型在低数据情况下表现良好。我们发现基于行为模式学习（NBF），它从数据中学习一个行为模式基础，然后使用这个基础进行预测，比基础不明的基准模型更有效。此外，我们还发现这种问题预测解决方案还存在一些挑战。

Attentional Graph Neural Networks for Robust Massive Network Localization

paper_url: http://arxiv.org/abs/2311.16856
repo_url: None
paper_authors: Wenzhong Yan, Juntao Wang, Feng Yin, Abdelhak M. Zoubir
for: 本研究用Graph Neural Networks (GNNs)和注意力机制来解决非线性回归问题：网络地址。
methods: 提出了一种基于GNNs的网络地址方法，可以在严重的非视线（NLOS）干扰下达到高稳定性和准确性。
results: 实验结果表明，提出的GNNs-based模型在NLOS场景下表现出色，特别是在复杂NLOS场景下。但是，GNNs-based模型的灵活性有限，其准确性高度依赖于特定的超参数。为了扩展GNNs-based模型的应用范围和适用场景，我们提出了两种注意力图 neural networks（AGNNs），它们可以自动学习最佳超参数。实验结果表明，AGNNs模型可以提高地址精度，为实际应用提供了有前途的解决方案。

Abstract
Graph neural networks (GNNs) have gained significant popularity for classification tasks in machine learning, yet their applications to regression problems remain limited. Concurrently, attention mechanisms have emerged as powerful tools in sequential learning tasks. In this paper, we employ GNNs and attention mechanisms to address a classical but challenging nonlinear regression problem: network localization. We propose a novel GNN-based network localization method that achieves exceptional stability and accuracy in the presence of severe non-line-of-sight (NLOS) propagations, while eliminating the need for laborious offline calibration or NLOS identification. Extensive experimental results validate the effectiveness and high accuracy of our GNN-based localization model, particularly in challenging NLOS scenarios. However, the proposed GNN-based model exhibits limited flexibility, and its accuracy is highly sensitive to a specific hyperparameter that determines the graph structure. To address the limitations and extend the applicability of the GNN-based model to real scenarios, we introduce two attentional graph neural networks (AGNNs) that offer enhanced flexibility and the ability to automatically learn the optimal hyperparameter for each node. Experimental results confirm that the AGNN models are able to enhance localization accuracy, providing a promising solution for real-world applications. We also provide some analyses of the improved performance achieved by the AGNN models from the perspectives of dynamic attention and signal denoising characteristics.

摘要
граф нейронных сетей (GNNs) 已经在机器学习中获得了广泛的应用，但它们在回归 задача中的应用还是有限的。同时，注意机制已经作为序列学习任务中的有效工具而出现。在这篇论文中，我们使用 GNNs 和注意机制来解决一个经典的非线性回归问题：网络地址。我们提出了一种基于 GNNs 的网络地址方法，该方法在严重的非直线视线（NLOS）传播中具有出色的稳定性和准确性，而不需要劳碌的在线调整或NLOS标识。广泛的实验结果证明了我们基于 GNNs 的地址模型的效果和高精度，特别是在复杂NLOS场景下。然而，我们的 GNNs 基本模型具有局限性，其精度高度依赖于确定图 структуры的特定参数。为了解决这些限制并扩展 GNNs 的应用范围，我们引入了两种注意力图神经网络（AGNNs），它们具有更高的灵活性和自动学习最佳参数的能力。实验结果表明，AGNN 模型能够提高地址精度，提供了实际应用中的有力解决方案。我们还对 AGNN 模型的改进表现进行了一些分析，包括动态注意和信号干扰特性的分析。

Identifiable Feature Learning for Spatial Data with Nonlinear ICA

paper_url: http://arxiv.org/abs/2311.16849
repo_url: None
paper_authors: Hermanni Hälvä, Jonathan So, Richard E. Turner, Aapo Hyvärinen
for: 非线性ICA（简称ICA）是深度表示学习和分离的新型方法，它可以捕捉数据中更复杂的 latent dependencies。
methods: 我们提出了一种基于 $t$-进程（TP） latent components的新的非线性ICA框架，该框架适用于数据中的高维dependency结构，如空间和时空数据。我们还开发了一种新的学习和推理算法，该算法结合了深度混合函数和TP先验方法，并使用了引导点的方法来提高计算效率。
results: 我们通过 simulate spatial data 和实际世界时空数据进行了实验，并证明了我们的算法和定理。我们的结果表明，TP非线性ICA可以更好地捕捉数据中的 latent dependencies，并且可以在具有高维dependency结构的数据上提供更高的可解释性。

Abstract
Recently, nonlinear ICA has surfaced as a popular alternative to the many heuristic models used in deep representation learning and disentanglement. An advantage of nonlinear ICA is that a sophisticated identifiability theory has been developed; in particular, it has been proven that the original components can be recovered under sufficiently strong latent dependencies. Despite this general theory, practical nonlinear ICA algorithms have so far been mainly limited to data with one-dimensional latent dependencies, especially time-series data. In this paper, we introduce a new nonlinear ICA framework that employs $t$-process (TP) latent components which apply naturally to data with higher-dimensional dependency structures, such as spatial and spatio-temporal data. In particular, we develop a new learning and inference algorithm that extends variational inference methods to handle the combination of a deep neural network mixing function with the TP prior, and employs the method of inducing points for computational efficacy. On the theoretical side, we show that such TP independent components are identifiable under very general conditions. Further, Gaussian Process (GP) nonlinear ICA is established as a limit of the TP Nonlinear ICA model, and we prove that the identifiability of the latent components at this GP limit is more restricted. Namely, those components are identifiable if and only if they have distinctly different covariance kernels. Our algorithm and identifiability theorems are explored on simulated spatial data and real world spatio-temporal data.

摘要
近期，非线性ICA 已经出现为深度表征学习和分解的替代方案。非线性ICA 的优点在于已经建立了一套复杂的可识别性理论，具体来说，可以在具有强 suficient latent dependencies 的情况下回归原始组件。 despite this general theory, practical nonlinear ICA algorithms have so far been mainly limited to data with one-dimensional latent dependencies, especially time-series data. 在这篇论文中，我们介绍了一种新的非线性ICA 框架，使用 t-process（TP）latent components，这些组件自然地适用于高维dependency结构的数据，例如空间和时空数据。我们开发了一种新的学习和推断算法，可以将深度混合函数与TP prior相结合，并使用引导点的方法来提高计算效率。从理论角度来看，我们证明了这些TP独立组件在非常一般的条件下是可识别的。此外，我们还将GP非线性ICA 作为 TP非线性ICA 模型的极限情况，并证明了这些独立组件在GP极限下的可识别性是更加restricted。即，这些组件只有在它们的卷积函数权重相差较大时才是可识别的。我们的算法和可识别性定理在 simulate spatial data 和实际世界时空数据上进行了调查。

The HR-Calculus: Enabling Information Processing with Quaternion Algebra

paper_url: http://arxiv.org/abs/2311.16771
repo_url: None
paper_authors: Danilo P. Mandic, Sayed Pouria Talebi, Clive Cheong Took, Yili Xia, Dongpo Xu, Min Xiang, Pauline Bourigault
for: 本文旨在探讨采用四元数和其分数 algebra 模拟三维空间中的旋转/方向问题，以及应用于机器学习、信号处理和控制领域中的自适应信息处理技术。
methods: 本文使用 HR-calculus 提供了对四元数值信号进行适应学习技术的数学基础，并提供了 Gradient 算子、链式和乘积Derivative规则以及 Taylor 级数展开等工具。
results: 本文介绍了在单节和多节形式上适应信息处理的主要应用，并提供了支持材料（SM）。

Abstract
From their inception, quaternions and their division algebra have proven to be advantageous in modelling rotation/orientation in three-dimensional spaces and have seen use from the initial formulation of electromagnetic filed theory through to forming the basis of quantum filed theory. Despite their impressive versatility in modelling real-world phenomena, adaptive information processing techniques specifically designed for quaternion-valued signals have only recently come to the attention of the machine learning, signal processing, and control communities. The most important development in this direction is introduction of the HR-calculus, which provides the required mathematical foundation for deriving adaptive information processing techniques directly in the quaternion domain. In this article, the foundations of the HR-calculus are revised and the required tools for deriving adaptive learning techniques suitable for dealing with quaternion-valued signals, such as the gradient operator, chain and product derivative rules, and Taylor series expansion are presented. This serves to establish the most important applications of adaptive information processing in the quaternion domain for both single-node and multi-node formulations. The article is supported by Supplementary Material, which will be referred to as SM.

摘要
自它们的出现以来，四元数和它们的分配代数已经在三维空间中模拟旋转/方向的问题上表现出了优势，从电磁场理论的初步形ulation到量子场理论的基础都有使用。尽管它们在实际世界现象模拟方面表现出了很好的多样性，但是专门为四元数值信号设计的适应信息处理技术直到最近才被机器学习、信号处理和控制共同体的注意。这一方向的最重要发展是引入HR-calculus，该算术提供了对 directly derivation of adaptive learning techniques in the quaternion domain的数学基础。在本文中，HR-calculus的基础会被重点探讨，以及适用于处理四元数值信号的 gradient operator、链式和产品 derive rule、泰勒级数展开等工具。这将为单节和多节形式的应用提供基础。本文支持了Supplementary Material，将在SM中提供。

Asynchronous Wireless Federated Learning with Probabilistic Client Selection

paper_url: http://arxiv.org/abs/2311.16741
repo_url: None
paper_authors: Jiarong Yang, Yuan Liu, Fangjiong Chen, Wen Chen, Changle Li
for: 这篇论文旨在解决对于分布式学习（Federated Learning，FL）中的延迟问题（stragglers issue），并且提出了一个基于机会抽样的方法来解决这个问题。
methods: 本论文使用了机会抽样和机会分配来解决延迟问题，并且提出了一个优化问题来寻找最佳的机会抽样和机会分配组合，以提高对于FL的吞吐量和能源消耗。
results: 实验结果显示，提出的方法可以对FL中的延迟问题进行有效地解决，并且比传统方法有更好的性能。

Abstract
Federated learning (FL) is a promising distributed learning framework where distributed clients collaboratively train a machine learning model coordinated by a server. To tackle the stragglers issue in asynchronous FL, we consider that each client keeps local updates and probabilistically transmits the local model to the server at arbitrary times. We first derive the (approximate) expression for the convergence rate based on the probabilistic client selection. Then, an optimization problem is formulated to trade off the convergence rate of asynchronous FL and mobile energy consumption by joint probabilistic client selection and bandwidth allocation. We develop an iterative algorithm to solve the non-convex problem globally optimally. Experiments demonstrate the superiority of the proposed approach compared with the traditional schemes.

摘要
Federated 学习（FL）是一种有前途的分布式学习框架，在多个分布式客户端之间协同训练一个由服务器协调的机器学习模型。为解决异步FL中的延迟问题，我们假设每个客户端都保留了本地更新，并在自由时间进行可能性抽取本地模型并将其传输到服务器。我们首先 derivates （近似）的表达式，基于 probabilistic 客户端选择，然后将异步FL的快速吞吐率与 mobilenergy 消耗进行负面优化。我们开发了一种Iterative 算法，可以全面优化非对称问题。实验结果表明，我们的方法比传统方法更高效。Note: "异步FL" in the text refers to "asynchronous federated learning".

Sluggish and Chemically-Biased Interstitial Diffusion in Concentrated Solid Solution Alloys: Mechanisms and Methods

paper_url: http://arxiv.org/abs/2311.16727
repo_url: https://github.com/jeremy1189/interstitial-diffusion
paper_authors: Biao Xu, Haijun Fu, Shasha Huang, Shihua Ma, Yaoxu Xiong, Jun Zhang, Xuepeng Xiang, Wenyu Lu, Ji-Jung Kai, Shijun Zhao
for: 这个论文主要研究了 Fe-Ni 共晶物中的慢性扩散和化学偏好扩散。
methods: 这个论文使用了机器学习（ML）和遗传 Monte Carlo（kMC）方法，其中 ML 用于准确地预测扩散能量障碍。
results: 研究发现，Fe-Ni 共晶物中的慢性扩散和 “Ni-Ni-Ni” 偏好扩散是由于 “梯度锁” 机制引起的，而 “Fe-Fe-Fe” 偏好扩散则受到 “组分控制” 机制的影响。这些机制的发现启发了一种实用的 AvgS-kMC 方法，用于快速和方便地确定扩散性。此外，这个方法还可以通过 differential evolutionary algorithm 进行反向设计，以便优化慢性扩散性。

Abstract
Interstitial diffusion is a pivotal process that governs the phase stability and irradiation response of materials in non-equilibrium conditions. In this work, we study sluggish and chemically-biased interstitial diffusion in Fe-Ni concentrated solid solution alloys (CSAs) by combining machine learning (ML) and kinetic Monte Carlo (kMC), where ML is used to accurately and efficiently predict the migration energy barriers on-the-fly. The ML-kMC reproduces the diffusivity that was reported by molecular dynamics results at high temperatures. With this powerful tool, we find that the observed sluggish diffusion and the "Ni-Ni-Ni"-biased diffusion in Fe-Ni alloys are ascribed to a unique "Barrier Lock" mechanism, whereas the "Fe-Fe-Fe"-biased diffusion is influenced by a "Component Dominance" mechanism. Inspired by the mentioned mechanisms, a practical AvgS-kMC method is proposed for conveniently and swiftly determining interstitial-mediated diffusivity by only relying on the mean energy barriers of migration patterns. Combining the AvgS-kMC with the differential evolutionary algorithm, an inverse design strategy for optimizing sluggish diffusion properties is applied to emphasize the crucial role of favorable migration patterns.

摘要
“交叉扩散是材料在非平衡状态下的阶段稳定和辐射回应的决定性过程。在这个工作中，我们使用机器学习（ML）和运动 Monte Carlo（kMC）结合来研究 Fe-Ni 浓缩体积合金（CSAs）中的慢扩散和化学偏好扩散。ML 是用来精确地预测迁移能量障碍的，而 kMC 则是用来模拟迁移过程。这两种方法的结合可以重现在高温下的分子动力学结果中报告的迁移率。”“我们发现，Fe-Ni 合金中的慢扩散和 "Ni-Ni-Ni" 偏好扩散是由一种叫做 "Barrier Lock" 机制引起的，而 "Fe-Fe-Fe" 偏好扩散则是由一种叫做 "Component Dominance" 机制控制的。这些机制的验证使我们提出一种实用的 AvgS-kMC 方法，可以轻松地和快速地决定慢扩散的媒介。”“我们结合 AvgS-kMC 方法和演化遗传算法，实现了对慢扩散性的反向设计。这个方法可以帮助设计更好的材料，并且可以实现更好的辐射回应和阶段稳定。”

Sinkhorn Flow: A Continuous-Time Framework for Understanding and Generalizing the Sinkhorn Algorithm

paper_url: http://arxiv.org/abs/2311.16706
repo_url: None
paper_authors: Mohammad Reza Karimi, Ya-Ping Hsieh, Andreas Krause
for: 解决机器学习中的各种问题，可以看作是解决 entropy-regularized 优化运输问题，这种问题的 canonical approach 是 Sinkhorn 迭代。
methods: 使用 mirror descent 框架，从而得到了更多的数学意味。
results: 提出了一种连续时间 analogue 的 Sinkhorn 算法，可以承受噪声和偏见，并且可以拓展到其他几种最近发现的数学和机器学习问题。

Abstract
Many problems in machine learning can be formulated as solving entropy-regularized optimal transport on the space of probability measures. The canonical approach involves the Sinkhorn iterates, renowned for their rich mathematical properties. Recently, the Sinkhorn algorithm has been recast within the mirror descent framework, thus benefiting from classical optimization theory insights. Here, we build upon this result by introducing a continuous-time analogue of the Sinkhorn algorithm. This perspective allows us to derive novel variants of Sinkhorn schemes that are robust to noise and bias. Moreover, our continuous-time dynamics not only generalize but also offer a unified perspective on several recently discovered dynamics in machine learning and mathematics, such as the "Wasserstein mirror flow" of (Deb et al. 2023) or the "mean-field Schr\"odinger equation" of (Claisse et al. 2023).

摘要
很多机器学习问题可以表述为寻找带有Entropy regularization的最佳运输问题在概率分布空间上。传统方法通过Sinkhorn迭代，其具有丰富的数学性质。近期，Sinkhorn算法被推广到镜像下降框架中，从而获益于古典优化理论的问题。我们在这篇文章中继续这一点，引入了缓时间的Sinkhorn算法。这个角度允许我们 derivate novel的Sinkhorn方案，这些方案具有防止噪音和偏误的特性。此外，我们的缓时间动态不仅泛化了Sinkhorn方案，而且还提供了机器学习和数学领域中最近发现的一些动态，例如(Deb et al. 2023)所提出的" Wasserstein mirror flow"或(Claisse et al. 2023)所提出的"mean-field Schrödinger equation".

PyTorch Geometric High Order: A Unified Library for High Order Graph Neural Network

paper_url: http://arxiv.org/abs/2311.16670
repo_url: https://github.com/graphpku/pygho
paper_authors: Xiyuan Wang, Muhan Zhang
for: 本研究旨在提供一个简单易用的高阶图神经网络（HOGNN）库，即PyTorch Geometric High Order（PyGHO），用于扩展PyTorch Geometric（PyG）。
methods: PyGHO提供了一个统一的高阶GNN方法框架，包括子图GNN和k-WL GNN，以及高阶GNN的数据结构和处理工具，以及一组高级GNN操作。
results: 对于实际任务，PyGHO可以提高HOGNN的运行速度，最高达50%，并将实现代码量减少到一半。

Abstract
We introduce PyTorch Geometric High Order (PyGHO), a library for High Order Graph Neural Networks (HOGNNs) that extends PyTorch Geometric (PyG). Unlike ordinary Message Passing Neural Networks (MPNNs) that exchange messages between nodes, HOGNNs, encompassing subgraph GNNs and k-WL GNNs, encode node tuples, a method previously lacking a standardized framework and often requiring complex coding. PyGHO's main objective is to provide an unified and user-friendly interface for various HOGNNs. It accomplishes this through streamlined data structures for node tuples, comprehensive data processing utilities, and a flexible suite of operators for high-order GNN methodologies. In this work, we present a detailed in-depth of PyGHO and compare HOGNNs implemented with PyGHO with their official implementation on real-world tasks. PyGHO achieves up to $50\%$ acceleration and reduces the code needed for implementation by an order of magnitude. Our library is available at \url{https://github.com/GraphPKU/PygHO}.

摘要
我们介绍PyTorch Geometric High Order（PyGHO），一个用于高阶图神经网络（HOGNNs）的库，扩展了PyTorch Geometric（PyG）。与普通的消息传递神经网络（MPNNs）不同，HOGNNs将节点对编码，而之前缺乏标准化框架和复杂的编程。PyGHO的主要目标是提供一个简单易用的HOGNNs接口，它通过流lined数据结构、全面的数据处理工具和高阶GNN方法学 suite的灵活操作来实现。在这个工作中，我们对PyGHO进行了详细探讨，并对实际任务上HOGNNs的官方实现进行了比较。我们发现，使用PyGHO可以实现加速达50%，并且可以减少实现代码的数量一倍。我们的库可以在中获取。

Pseudo-Likelihood Inference

paper_url: http://arxiv.org/abs/2311.16656
repo_url: https://github.com/FTurci/plm-brain
paper_authors: Theo Gruner, Boris Belousov, Fabio Muratore, Daniel Palenicek, Jan Peters
for: 这个论文主要针对的是 bayesian 系统逻辑学习中的难以求导likelihood问题。
methods: 这篇论文提出了一种新的 Pseudo-Likelihood Inference（PLI）方法，它将神经网络approximation bring into Approximate Bayesian Computation（ABC），使其在高维问题上能够与Sequential Neural Posterior Estimation（SNPE）相比。
results: 论文通过使用 integral probability metrics和adaptive bandwidth来更新信息理论信任区域，使得PLI可以通过梯度下降优化神经后退，不需要摘要统计数据，并且可以处理多个观察数据。相比SNPE，PLI在更多数据available时显示出改进的性能。论文在四个经典的 SBI bencmark任务和一个高度动态的物理系统上进行了评估，并表明PLI在随机优化和多模 posterior 领域具有特殊优势。

Abstract
Simulation-Based Inference (SBI) is a common name for an emerging family of approaches that infer the model parameters when the likelihood is intractable. Existing SBI methods either approximate the likelihood, such as Approximate Bayesian Computation (ABC) or directly model the posterior, such as Sequential Neural Posterior Estimation (SNPE). While ABC is efficient on low-dimensional problems, on higher-dimensional tasks, it is generally outperformed by SNPE, which leverages function approximation. In this paper, we propose Pseudo-Likelihood Inference (PLI), a new method that brings neural approximation into ABC, making it competitive on challenging Bayesian system identification tasks. By utilizing integral probability metrics, we introduce a smooth likelihood kernel with an adaptive bandwidth that is updated based on information-theoretic trust regions. Thanks to this formulation, our method (i) allows for optimizing neural posteriors via gradient descent, (ii) does not rely on summary statistics, and (iii) enables multiple observations as input. In comparison to SNPE, it leads to improved performance when more data is available. The effectiveness of PLI is evaluated on four classical SBI benchmark tasks and on a highly dynamic physical system, showing particular advantages on stochastic simulations and multi-modal posterior landscapes.

摘要
模拟基于推理（SBI）是一种广泛使用的方法，用于在likelihood是不可 Calculate时获取模型参数的推理。现有的SBI方法可以通过approximate Bayesian computation（ABC）或directly model the posterior（SNPE）来实现。而ABC在低维度问题上是高效的，但是在更高维度任务上通常会被SNPE超越，SNPE利用函数抽象来实现。在这篇论文中，我们提出了 Pseudo-Likelihood Inference（PLI），一种新的方法，它将 neural approximation 引入ABC中，使其在复杂的 Bayesian system identification 任务上竞争。我们通过使用积分概率度量，引入了一个适应性的带宽，这个带宽会根据信息理论的信任区域更新。因此，我们的方法（i）可以通过梯度下降优化神经 posterior，（ii）不需要摘要统计，（iii）允许多个观察入参。与SNPE相比，它在更多数据available时表现更好。我们的方法在四个 классиical SBI benchmark任务和一个高度动态的物理系统上进行了评估，特别是在随机 simulations 和多模态 posterior 陆地表现出色。

Elucidating Discrepancy in Explanations of Predictive Models Developed using EMR

paper_url: http://arxiv.org/abs/2311.16654
repo_url: None
paper_authors: Aida Brankovic, Wenjie Huang, David Cook, Sankalp Khanna, Konstanty Bialkowski
for: 这项研究旨在检验机器学习算法在电子医疗记录数据上的临床决策支持算法是否符合专业医疗知识。
methods: 这项研究使用当今最佳实践的解释性人工智能方法对临床决策支持算法进行分析，并讨论了从临床和技术角度来解释出现的差异。
results: 研究发现，当前的解释性人工智能方法与专业医疗知识之间存在一定的不一致，并且从临床和技术角度来解释出现的差异。

Abstract
The lack of transparency and explainability hinders the clinical adoption of Machine learning (ML) algorithms. While explainable artificial intelligence (XAI) methods have been proposed, little research has focused on the agreement between these methods and expert clinical knowledge. This study applies current state-of-the-art explainability methods to clinical decision support algorithms developed for Electronic Medical Records (EMR) data to analyse the concordance between these factors and discusses causes for identified discrepancies from a clinical and technical perspective. Important factors for achieving trustworthy XAI solutions for clinical decision support are also discussed.

摘要
Machine learning (ML) 算法在临床应用面临透明性和解释性的限制，而解释人工智能（XAI）方法已经被提出，但是对专家临床知识的协调得到了少量的研究。本研究通过现有的状态 искусственный智能（XAI）方法对电子医疗记录（EMR）数据中的决策支持算法进行分析，检查这些因素之间的一致性，并从临床和技术角度讨论了出现的差异的原因。本研究还讨论了实现可靠的 XAI 解决方案的重要因素。

Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective

paper_url: http://arxiv.org/abs/2311.16646
repo_url: None
paper_authors: Ming-Yu Chung, Sheng-Yen Chou, Chia-Mu Yu, Pin-Yu Chen, Sy-Yen Kuo, Tsung-Yi Ho
for: 增强深度学习数据效率的数据净化研究
methods: 基于kernel方法的数据净化和触发模式生成方法
results: 通过优化触发设计框架，实现对数据净化的效果针对性和抗干扰性的触发攻击。

Abstract
Dataset distillation offers a potential means to enhance data efficiency in deep learning. Recent studies have shown its ability to counteract backdoor risks present in original training samples. In this study, we delve into the theoretical aspects of backdoor attacks and dataset distillation based on kernel methods. We introduce two new theory-driven trigger pattern generation methods specialized for dataset distillation. Following a comprehensive set of analyses and experiments, we show that our optimization-based trigger design framework informs effective backdoor attacks on dataset distillation. Notably, datasets poisoned by our designed trigger prove resilient against conventional backdoor attack detection and mitigation methods. Our empirical results validate that the triggers developed using our approaches are proficient at executing resilient backdoor attacks.

摘要

Opening the Black Box: Towards inherently interpretable energy data imputation models using building physics insight

paper_url: http://arxiv.org/abs/2311.16632
repo_url: https://github.com/antonio955/pi_dae
paper_authors: Antonio Liguori, Matias Quintana, Chun Fu, Clayton Miller, Jérôme Frisch, Christoph van Treeck
for: 本研究旨在提出 Physics-informed Denoising Autoencoders (PI-DAE) 方法，用于适应商业建筑物 missing data 问题。
methods: 本方法基于 Denoising Autoencoder (DAE) 结构，并在损失函数中引入物理知识。
results: 研究表明，在不同数据缺失率情况下，PI-DAE 方法具有更好的稳定性和可解释性，并且可以从物理学上得到有价值的参数。

Abstract
Missing data are frequently observed by practitioners and researchers in the building energy modeling community. In this regard, advanced data-driven solutions, such as Deep Learning methods, are typically required to reflect the non-linear behavior of these anomalies. As an ongoing research question related to Deep Learning, a model's applicability to limited data settings can be explored by introducing prior knowledge in the network. This same strategy can also lead to more interpretable predictions, hence facilitating the field application of the approach. For that purpose, the aim of this paper is to propose the use of Physics-informed Denoising Autoencoders (PI-DAE) for missing data imputation in commercial buildings. In particular, the presented method enforces physics-inspired soft constraints to the loss function of a Denoising Autoencoder (DAE). In order to quantify the benefits of the physical component, an ablation study between different DAE configurations is conducted. First, three univariate DAEs are optimized separately on indoor air temperature, heating, and cooling data. Then, two multivariate DAEs are derived from the previous configurations. Eventually, a building thermal balance equation is coupled to the last multivariate configuration to obtain PI-DAE. Additionally, two commonly used benchmarks are employed to support the findings. It is shown how introducing physical knowledge in a multivariate Denoising Autoencoder can enhance the inherent model interpretability through the optimized physics-based coefficients. While no significant improvement is observed in terms of reconstruction error with the proposed PI-DAE, its enhanced robustness to varying rates of missing data and the valuable insights derived from the physics-based coefficients create opportunities for wider applications within building systems and the built environment.

摘要
Missing data 是在建筑能源模型社区中经常观察到的问题。在这种情况下，高级数据驱动解决方案，如深度学习方法，通常被用来反映数据的非线性行为。作为一个持续研究问题，模型在有限数据设置下的可用性可以通过在网络中引入先验知识来探索。为此，本文提出使用物理知识驱动的杜邦抑制自适应器（PI-DAE）来填充商业建筑中的缺失数据。特别是，提出的方法使用物理启发的软约束来改进杜邦抑制自适应器（DAE）的损失函数。为了衡量物理组件的影响，对不同DAE配置进行了减少研究。首先，三个单变量DAE被独立地优化，其中一个是indoor air temperature，另外两个是加热和冷却数据。然后，两个多变量DAE被 derivated from the previous configurations。最后，一个建筑thermal balance方程被 coupling to the last multivariate configuration，以获得PI-DAE。此外，两个常用的benchmark被使用以支持发现。结果显示，在多变量DAE中引入物理知识可以提高模型解释性，并且通过优化物理基于的系数来获得有价值的物理学习知识。虽然在PI-DAE中没有显著提高重建错误的情况，但是它的适应性能强化和可靠性在不同的缺失数据率下得到了提升。这些发现可能为建筑系统和建筑环境中的应用提供新的机遇。

Outfit Completion via Conditional Set Transformation

paper_url: http://arxiv.org/abs/2311.16630
repo_url: None
paper_authors: Takuma Nakamura, Yuki Saito, Ryosuke Goto
for: solves the outfit completion problem as a set retrieval task
methods: proposes a novel framework with conditional set transformation architecture and compatibility-based regularization method
results: outperforms existing approaches in terms of accuracy, condition satisfaction, and compatibility of completion results

Abstract
In this paper, we formulate the outfit completion problem as a set retrieval task and propose a novel framework for solving this problem. The proposal includes a conditional set transformation architecture with deep neural networks and a compatibility-based regularization method. The proposed method utilizes a map with permutation-invariant for the input set and permutation-equivariant for the condition set. This allows retrieving a set that is compatible with the input set while reflecting the properties of the condition set. In addition, since this structure outputs the element of the output set in a single inference, it can achieve a scalable inference speed with respect to the cardinality of the output set. Experimental results on real data reveal that the proposed method outperforms existing approaches in terms of accuracy of the outfit completion task, condition satisfaction, and compatibility of completion results.

摘要
在这篇论文中，我们将穿着完成问题формализова为集合检索任务，并提出了一种新的解决方案。该提案包括一个受条件设定的集合变换架构，以及一种基于兼容性的Regularization方法。我们的方法使用一个具有排序不变性的映射，以便输入集合中的元素与条件集合中的元素进行匹配。此外，由于这种结构在单个推理过程中输出输出集合的元素，因此可以实现相对于输出集合的Cardinality进行扩展的推理速度。实验结果表明，我们的方法在实际数据上超过现有方法的准确率、条件满足度和完成结果的兼容性。

Symmetry-regularized neural ordinary differential equations

paper_url: http://arxiv.org/abs/2311.16628
repo_url: None
paper_authors: Wenbo Hao
for: 这个论文旨在使用神经网络模型中的隐藏状态动态，将其解释为常微分方程，从而在连续时间框架中捕捉系统动态。
methods: 该论文使用连续李群同构的方法，从而 derivation 保守法则并将其添加到损失函数中，使模型在训练过程中具有更高的稳定性和robustness。
results: 在一个简单的示例中，该论文使用cosine rate of change的隐藏状态，通过IDENTIFYING Lie symmetries, deriving conservation laws, and constructing a new loss function来ILLUSTRATE this method。

Abstract
Neural Ordinary Differential Equations (Neural ODEs) is a class of deep neural network models that interpret the hidden state dynamics of neural networks as an ordinary differential equation, thereby capable of capturing system dynamics in a continuous time framework. In this work, I integrate symmetry regularization into Neural ODEs. In particular, I use continuous Lie symmetry of ODEs and PDEs associated with the model to derive conservation laws and add them to the loss function, making it physics-informed. This incorporation of inherent structural properties into the loss function could significantly improve robustness and stability of the model during training. To illustrate this method, I employ a toy model that utilizes a cosine rate of change in the hidden state, showcasing the process of identifying Lie symmetries, deriving conservation laws, and constructing a new loss function.

摘要
neural ordinary differential equations (Neural ODEs) 是一类深度神经网络模型，它将神经网络中的隐藏状态动态视为常微分方程，因此可以捕捉系统动态在连续时间框架中。在这种工作中，我将对Neural ODEs进行Symmetry regularization。特别是，我使用连续李群对ODEs和PDEs相关的模型的Symmetry来 derivation conservation laws，并将它们添加到损失函数中，使其成为物理知识 Informed。这种在损失函数中包含内在结构特性的 incorporation 可能会在训练过程中显著提高模型的稳定性和Robustness。为 illustrate this method，我使用一个简单的示例，使用cosine rate of change在隐藏状态中，详细介绍了如何Identify Lie symmetries、derive conservation laws、并构建新的损失函数。

Gaussian Processes for Monitoring Air-Quality in Kampala

paper_url: http://arxiv.org/abs/2311.16625
repo_url: https://github.com/claramst/gps-kampala-airquality
paper_authors: Clara Stoddart, Lauren Shrack, Richard Sserunjogi, Usman Abdul-Ganiy, Engineer Bainomugisha, Deo Okure, Ruth Misener, Jose Pablo Folch, Ruby Sedgwick
for: 这 paper 是为了研究使用 Gaussian Processes 来现场测量空气质量和未来预测空气质量。
methods: 这 paper 使用 Gaussian Processes 方法进行现场测量和预测空气质量，并对数据进行异常值除外和不同的kernel函数比较。
results: 这 paper 在使用 AirQo 的数据集中 demonstarted 可以使用 Gaussian Processes 方法来现场测量和预测空气质量，并且可以根据不同的kernel函数和预测方法进行选择。

Abstract
Monitoring air pollution is of vital importance to the overall health of the population. Unfortunately, devices that can measure air quality can be expensive, and many cities in low and middle-income countries have to rely on a sparse allocation of them. In this paper, we investigate the use of Gaussian Processes for both nowcasting the current air-pollution in places where there are no sensors and forecasting the air-pollution in the future at the sensor locations. In particular, we focus on the city of Kampala in Uganda, using data from AirQo's network of sensors. We demonstrate the advantage of removing outliers, compare different kernel functions and additional inputs. We also compare two sparse approximations to allow for the large amounts of temporal data in the dataset.

摘要
监测空气质量是人口全体健康的重要因素。可惜，测量空气质量的设备可能很昂贵，许多低中收入国家的城市只有限量的设备分配。在这篇论文中，我们调查了使用 Gaussian Processes 来现场测量无探针地区的空气质量，以及将来探针地点的空气质量预测。我们specifically focuses on UGANDA的加莲达市，使用 AirQo 网络的探针数据。我们展示了减少异常点的优点，比较不同的核函数和附加输入。我们还比较了两种稀缺抽象方法，以处理大量的时间数据。

Beyond Labels: Advancing Cluster Analysis with the Entropy of Distance Distribution (EDD)

paper_url: http://arxiv.org/abs/2311.16621
repo_url: None
paper_authors: Claus Metzner, Achim Schilling, Patrick Krauss
for: 本研究旨在开发一种新的标签自由 clustering 分析方法，以更好地处理高维数据集中的各种各样的分布征性。
methods: 本方法基于对距离分布的信息 entropy 量化，不需要预先定义标签，可以独立地探测数据集中的各种各样的分布征性。
results: 实验结果表明，当 clusters 宽度增大时，EDD 值会逐渐增加，表明方法的敏感性和准确性在探测不同程度的分布征性。

Abstract
In the evolving landscape of data science, the accurate quantification of clustering in high-dimensional data sets remains a significant challenge, especially in the absence of predefined labels. This paper introduces a novel approach, the Entropy of Distance Distribution (EDD), which represents a paradigm shift in label-free clustering analysis. Traditional methods, reliant on discrete labels, often struggle to discern intricate cluster patterns in unlabeled data. EDD, however, leverages the characteristic differences in pairwise point-to-point distances to discern clustering tendencies, independent of data labeling. Our method employs the Shannon information entropy to quantify the 'peakedness' or 'flatness' of distance distributions in a data set. This entropy measure, normalized against its maximum value, effectively distinguishes between strongly clustered data (indicated by pronounced peaks in distance distribution) and more homogeneous, non-clustered data sets. This label-free quantification is resilient against global translations and permutations of data points, and with an additional dimension-wise z-scoring, it becomes invariant to data set scaling. We demonstrate the efficacy of EDD through a series of experiments involving two-dimensional data spaces with Gaussian cluster centers. Our findings reveal a monotonic increase in the EDD value with the widening of cluster widths, moving from well-separated to overlapping clusters. This behavior underscores the method's sensitivity and accuracy in detecting varying degrees of clustering. EDD's potential extends beyond conventional clustering analysis, offering a robust, scalable tool for unraveling complex data structures without reliance on pre-assigned labels.

摘要
在数据科学领域中发展的漫游境界中，对高维数据集中团集的准确评估仍然是一项主要挑战，特别是在缺乏预定标签的情况下。这篇论文介绍了一种新的方法——Entropy of Distance Distribution（EDD），这种方法代表了一种新的分类分析 paradigma shift。传统的方法，它们依赖于预定标签，通常在无标签数据中很难分辨复杂的团集模式。然而，EDD却利用数据集中对点对点距离的特征差异来分辨团集倾向，不需要预定标签。我们的方法利用希尔борн信息熵来衡量数据集中对点对点距离的'峰度'或'平坦度'。这个熵度量，相对于最大值归一化，有效地分辨了强团集数据集（表现为明显的峰值在距离分布中）和更Homogeneous的非团集数据集。这种标签自由的评估是对全球翻译和数据点 permutation 不敏感，并且通过一种额外的维度 wise z-scoring，变得缺省数据集 scaling 不变。我们通过对二维数据空间中的 Gaussian 团集中心进行一系列实验来证明 EDD 的有效性。我们发现，随着团集宽度的增加，EDD 值 monotonic 增加，从well-separated 到 overlap 的团集。这种行为证明了 EDD 的敏感和准确性在检测不同程度的团集。EDD 的潜在应用不仅限于传统的团集分析，还提供了一种可靠、扩展的工具，无需预定标签，可以用于探索复杂的数据结构。

Adversarial Distribution Balancing for Counterfactual Reasoning

paper_url: http://arxiv.org/abs/2311.16616
repo_url: https://github.com/sschrod/adbcr
paper_authors: Stefan Schrod, Fabian Sinz, Michael Altenbuchinger
for: 这个论文主要是为了解决 causal prediction 模型中面临的问题，即只能观察到实际（factual）干预的结果，而不能观察到其他可能的干预方案（counterfactuals）。
methods: 这篇论文提出了一种名为 Adversarial Distribution Balancing for Counterfactual Reasoning（ADBCR）的方法，它直接使用可能的 counterfactual 结果来除掉假设的 causal 关系。
results: 论文表明，ADBCR 可以在三个 benchmark 数据集上准确地识别 causal 关系，并且可以通过包含验证数据集中的无标签数据进行更好地适应验证数据集来进一步提高性能。

Abstract
The development of causal prediction models is challenged by the fact that the outcome is only observable for the applied (factual) intervention and not for its alternatives (the so-called counterfactuals); in medicine we only know patients' survival for the administered drug and not for other therapeutic options. Machine learning approaches for counterfactual reasoning have to deal with both unobserved outcomes and distributional differences due to non-random treatment administration. Unsupervised domain adaptation (UDA) addresses similar issues; one has to deal with unobserved outcomes -- the labels of the target domain -- and distributional differences between source and target domain. We propose Adversarial Distribution Balancing for Counterfactual Reasoning (ADBCR), which directly uses potential outcome estimates of the counterfactuals to remove spurious causal relations. We show that ADBCR outcompetes state-of-the-art methods on three benchmark datasets, and demonstrate that ADBCR's performance can be further improved if unlabeled validation data are included in the training procedure to better adapt the model to the validation domain.

摘要
发展 causal 预测模型受到了因果只能观察到应用 (实际) 干预的问题，而不是其他治疗选项 (称为 counterfactual) 的问题；在医学中，我们只知道接受药物的病人存活情况，而不知道其他治疗方案的效果。机器学习方法 для counterfactual 理解需要面临无观察的结果和分布差异问题，这与非随机的干预 administratin 有关。无监督领域适应 (UDA) 解决类似问题，需要面临无观察的结果（目标领域的标签）和源领域和目标领域之间的分布差异。我们提议 Adversarial Distribution Balancing for Counterfactual Reasoning (ADBCR)，直接使用 counterfactual 的潜在结果估计来消除假 causal 关系。我们表明 ADBCR 能够超越当前的方法在三个 benchmark 数据集上，并示出 ADBCR 性能可以通过包含预处理数据集中的无标签数据进行更好地适应验证领域进行改进。

A Multivariate Unimodality Test Harnenssing the Dip Statistic of Mahalanobis Distances Over Random Projections

paper_url: http://arxiv.org/abs/2311.16614
repo_url: None
paper_authors: Prodromos Kolyvakis, Aristidis Likas
for: 本研究旨在开发一种基于α-单模性假设的多变量单模性测试方法，用于评估多维数据集中的单模性。
methods: 本方法基于一维单模性测试方法的推广，通过线性随机投影将多维数据投影到一维空间中，然后使用点对点距离来评估单模性。
results: both theoretical和empirical研究表明，本方法可以有效地评估多维数据集中的单模性，并且可以估算单模性的数量。

Abstract
Unimodality, pivotal in statistical analysis, offers insights into dataset structures and drives sophisticated analytical procedures. While unimodality's confirmation is straightforward for one-dimensional data using methods like Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. By extrapolating one-dimensional unimodality principles to multi-dimensional spaces through linear random projections and leveraging point-to-point distancing, our method, rooted in $\alpha$-unimodality assumptions, presents a novel multivariate unimodality test named mud-pod. Both theoretical and empirical studies confirm the efficacy of our method in unimodality assessment of multidimensional datasets as well as in estimating the number of clusters.

摘要
一个重要的统计分析概念是唯一性（unimodality），它提供了数据集结构的信息和驱动了复杂的分析过程。 although the confirmation of unimodality is straightforward for one-dimensional data using methods such as Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. 我们的方法，基于 $\alpha$-唯一性假设，通过线性随机投影将一维唯一性原理推广到多维空间，并使用点到点距离来评估多维数据的唯一性。 both theoretical and empirical studies have confirmed the effectiveness of our method in assessing the unimodality of multidimensional datasets and estimating the number of clusters.Here's the word-for-word translation of the text into Simplified Chinese:一个重要的统计分析概念是唯一性（unimodality），它提供了数据集结构的信息和驱动了复杂的分析过程。 although the confirmation of unimodality is straightforward for one-dimensional data using methods such as Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. 我们的方法，基于 $\alpha$-唯一性假设，通过线性随机投影将一维唯一性原理推广到多维空间，并使用点到点距离来评估多维数据的唯一性。 both theoretical and empirical studies have confirmed the effectiveness of our method in assessing the unimodality of multidimensional datasets and estimating the number of clusters.

Eigenmatrix for unstructured sparse recovery

paper_url: http://arxiv.org/abs/2311.16609
repo_url: None
paper_authors: Lexing Ying
for: 这个论文关注了一般形式的不结构化稀疏恢复问题，包括理想化approximation、spectral function估计、Fourier倒推、Laplace倒推和稀疏减去。
methods: 这个论文提出了一种数据驱动的构建方法，即eigenmatrix，它可以提供一种新的方法来解决这些稀疏恢复问题。
results: 文章提供了numerical resultsto demonstartefficient性of the proposed method。

Abstract
This paper considers the unstructured sparse recovery problems in a general form. Examples include rational approximation, spectral function estimation, Fourier inversion, Laplace inversion, and sparse deconvolution. The main challenges are the noise in the sample values and the unstructured nature of the sample locations. This paper proposes the eigenmatrix, a data-driven construction with desired approximate eigenvalues and eigenvectors. The eigenmatrix offers a new way for these sparse recovery problems. Numerical results are provided to demonstrate the efficiency of the proposed method.

摘要
这篇论文考虑了一般形式的不结构化稀疏回复问题。例如：理解函数估计、 спектраль函数估计、傅立假函数倒推、拉Place倒推和稀疏混合。主要挑战是样本值中的噪声和样本位置的不结构化。该论文提出了 eigen矩阵，一种数据驱动的建构，具有愿景的近似尺度和方向。 eigen矩阵为这些稀疏回复问题提供了一新的方法。数据结果表明该方法的效率。

LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models

paper_url: http://arxiv.org/abs/2311.16604
repo_url: None
paper_authors: Chi-Chang Lee, Hong-Wei Chen, Chu-Song Chen, Hsin-Min Wang, Tsung-Te Liu, Yu Tsao
for: 提高喉音识别（SV）模型在噪音环境中的性能
methods: 使用学习基于的插值代理来自动生成适当的系数来改进SV模型在噪音环境中的性能
results: 在不同的SV模型下，LC4SV consistently improve SV performance in noisy environments

Abstract
The performance of speaker verification (SV) models may drop dramatically in noisy environments. A speech enhancement (SE) module can be used as a front-end strategy. However, existing SE methods may fail to bring performance improvements to downstream SV systems due to artifacts in the predicted signals of SE models. To compensate for artifacts, we propose a generic denoising framework named LC4SV, which can serve as a pre-processor for various unknown downstream SV models. In LC4SV, we employ a learning-based interpolation agent to automatically generate the appropriate coefficients between the enhanced signal and its noisy input to improve SV performance in noisy environments. Our experimental results demonstrate that LC4SV consistently improves the performance of various unseen SV systems. To the best of our knowledge, this work is the first attempt to develop a learning-based interpolation scheme aiming at improving SV performance in noisy environments.

摘要
speaker 识别 (SV) 模型在噪声环境中表现可能会降低很多。一个speech 提高 (SE) 模块可以作为前端策略。然而，现有的 SE 方法可能无法提高下游 SV 系统的性能因为提高后的信号中存在artefacts。为了补偿这些artefacts，我们提议一种通用的噪声抑制框架名为LC4SV，它可以作为不同的未知下游 SV 模型的预处理器。在LC4SV中，我们使用学习基于的 interpolate 代理来自动生成适当的系数 между 提高后的信号和其噪声输入，以改善 SV 性能在噪声环境中。我们的实验结果表明，LC4SV 可靠地提高了多种未看过 SV 系统的性能。据我们所知，这是首次开发一种学习基于的 interpolate 方案，以改善 SV 性能在噪声环境中。

D4AM: A General Denoising Framework for Downstream Acoustic Models

paper_url: http://arxiv.org/abs/2311.16595
repo_url: https://github.com/changlee0903/d4am
paper_authors: Chi-Chang Lee, Yu Tsao, Hsin-Min Wang, Chu-Song Chen
for: 提高自动语音识别（ASR）系统在噪音环境下的性能，通过采用涉及语音和文本的混合数据进行训练。
methods: 提出了一种通用的杜邦混合模型（D4AM），通过负梯度反向搅拌和权重调整算法来实现语音减噪和ASR识别的共同训练。
results: 对于不同的语音识别模型，D4AM可以提供明显的性能提升，特别是在未看过的语音识别系统上进行识别时，相比直接将噪音输入传递，WER减少24.65%。

Abstract
The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. However, existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. In this study, we propose a general denoising framework, D4AM, for various downstream acoustic models. Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. In addition, our method aims to consider the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with regression and classification objectives, D4AM uses an adjustment scheme to directly estimate suitable weighting coefficients rather than undergoing a grid search process with additional training costs. The adjustment scheme consists of two parts: gradient calibration and regression objective weighting. The experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. Specifically, when evaluated on the Google ASR API with real noisy data completely unseen during SE training, D4AM achieves a relative WER reduction of 24.65% compared with the direct feeding of noisy input. To our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems. Our code is available at https://github.com/ChangLee0903/D4AM.

摘要
听音模型在噪声环境下表现不佳，使用推荐语音增强（SE）作为前端策略可以帮助自动语音识别（ASR）系统。然而，现有的训练目标方法不够有效地将语音-文本和噪声-干净的对应数据 integrate into 训练，以适应未看过的 ASR 系统。在这项研究中，我们提出了一种通用的杜陵减少框架（D4AM），用于不同的下游听音模型。我们的框架在特定听音模型和相应的分类目标下，使用反向梯度来细化 SE 模型。此外，我们的方法还考虑了使用回归目标作为辅助损失函数，以使 SE 模型能够通用化到其他未看过的听音模型。为了同时训练 SE 单元与回归和分类目标进行结合训练，D4AM 使用调整方案来直接估算适当的权重系数而不需要进行额外的训练成本。调整方案包括两部分：梯度准确和回归目标权重。实验结果表明，D4AM 可以有效地提高多种未看过的听音模型的表现，并且超过其他组合方案。特别是，在使用 Google ASR API 的真实噪声数据进行评估时，D4AM 相比直接输入噪声输入，实现了相对减少 WER 24.65%。我们知道这是首次通过有效地结合回归（减少）和分类（ASR）目标来 derive 一个通用的预处理器，用于多种未看过的 ASR 系统。我们的代码可以在 GitHub 上找到。

FedAL: Black-Box Federated Knowledge Distillation Enabled by Adversarial Learning

paper_url: http://arxiv.org/abs/2311.16584
repo_url: None
paper_authors: Pengchao Han, Xingyan Shi, Jianwei Huang
for: This paper focuses on addressing the challenges of knowledge distillation (KD) in collaborative learning among distributed clients with different model architectures and non-sharing of local data and model parameters.
methods: The proposed method, Federated knowledge distillation enabled by Adversarial Learning (FedAL), uses a min-max game between clients and a server acting as a discriminator to guide clients’ local model training and achieve consensus model outputs among clients. Additionally, the method uses less-forgetting regularization to mitigate catastrophic forgetting due to clients’ heterogeneous local data.
results: The experimental results show that FedAL and its variants achieve higher accuracy than other federated KD baselines.

Abstract
Knowledge distillation (KD) can enable collaborative learning among distributed clients that have different model architectures and do not share their local data and model parameters with others. Each client updates its local model using the average model output/feature of all client models as the target, known as federated KD. However, existing federated KD methods often do not perform well when clients' local models are trained with heterogeneous local datasets. In this paper, we propose Federated knowledge distillation enabled by Adversarial Learning (FedAL) to address the data heterogeneity among clients. First, to alleviate the local model output divergence across clients caused by data heterogeneity, the server acts as a discriminator to guide clients' local model training to achieve consensus model outputs among clients through a min-max game between clients and the discriminator. Moreover, catastrophic forgetting may happen during the clients' local training and global knowledge transfer due to clients' heterogeneous local data. Towards this challenge, we design the less-forgetting regularization for both local training and global knowledge transfer to guarantee clients' ability to transfer/learn knowledge to/from others. Experimental results show that FedAL and its variants achieve higher accuracy than other federated KD baselines.

摘要
知识塑形（KD）可以启用分布式客户端之间的合作学习，即使这些客户端拥有不同的模型结构和不共享本地数据和模型参数。每个客户端将其本地模型更新为所有客户端模型的平均输出/特征作为目标，称为 federated KD。然而，现有的 federated KD 方法通常在客户端的本地模型训练时不会perform well，因为客户端的本地数据具有不同的数据多样性。在这篇论文中，我们提出了 Federated knowledge distillation enabled by Adversarial Learning（FedAL），以解决客户端之间数据多样性的问题。首先，为了使客户端的本地模型输出更一致，服务器 acts as a discriminator 以引导客户端的本地模型训练，以实现客户端之间的共识模型输出。此外，由于客户端的本地数据多样性，可能会导致客户端的本地训练和全局知识传输中的快速忘记。为此，我们设计了 less-forgetting regularization，以保证客户端能够传输和学习知识。实验结果显示，FedAL 和其变种在其他 federated KD 基线上表现出较高的准确率。

Scalable Label Distribution Learning for Multi-Label Classification

paper_url: http://arxiv.org/abs/2311.16556
repo_url: https://github.com/ailearn-ml/sldl
paper_authors: Xingyu Zhao, Yuexuan An, Lei Qi, Xin Geng
for: 提高多标签分类（MLC）方法的可扩展性和效率，以适应实际场景中的非对称标签关系和大规模输出空间。
methods: 基于离散分布的 latent 空间模型，将标签转换为离散分布，并利用非对称距离度量来建立标签之间的相关性。然后，通过学习将特征空间映射到 latent 空间，以实现计算复杂性与标签数无关。最后，通过 nearest-neighbor 策略对 latent 表示进行解码，获得最终预测结果。
results: 在广泛的实验中，SLDL 能够具有非常竞争力的分类性能，同时具有可扩展的计算复杂性，适用于大规模输出空间。

Abstract
Multi-label classification (MLC) refers to the problem of tagging a given instance with a set of relevant labels. Most existing MLC methods are based on the assumption that the correlation of two labels in each label pair is symmetric, which is violated in many real-world scenarios. Moreover, most existing methods design learning processes associated with the number of labels, which makes their computational complexity a bottleneck when scaling up to large-scale output space. To tackle these issues, we propose a novel MLC learning method named Scalable Label Distribution Learning (SLDL) for multi-label classification which can describe different labels as distributions in a latent space, where the label correlation is asymmetric and the dimension is independent of the number of labels. Specifically, SLDL first converts labels into continuous distributions within a low-dimensional latent space and leverages the asymmetric metric to establish the correlation between different labels. Then, it learns the mapping from the feature space to the latent space, resulting in the computational complexity is no longer related to the number of labels. Finally, SLDL leverages a nearest-neighbor-based strategy to decode the latent representations and obtain the final predictions. Our extensive experiments illustrate that SLDL can achieve very competitive classification performances with little computational consumption.

摘要
多标签分类（MLC）问题是将给定实例标注为一组相关的标签。现有的大多数 MLC 方法假设每个标签对的相关性是对称的，而在实际场景中这是被违反的。此外，现有的方法设计学习过程与标签数量相关，这使得计算复杂性成为扩展到大规模输出空间的瓶颈。为解决这些问题，我们提出了一种新的 MLC 学习方法 named Scalable Label Distribution Learning（SLDL），可以在一个低维 latent space 中描述不同标签为分布，其中标签相关性是非对称的。SLDL 方法首先将标签转换为在低维 latent space 中的连续分布，然后利用非对称度量来确定不同标签之间的相关性。接着，它学习从特征空间到 latent space 的映射，使得计算复杂性不再与标签数量相关。最后，SLDL 使用 nearest-neighbor-based 策略来解码 latent 表示，从而获得最终预测。我们的广泛实验表明，SLDL 可以在计算效率低下 achieves 非常竞争的分类性能。

Communication Efficiency Optimization of Federated Learning for Computing and Network Convergence of 6G Networks

paper_url: http://arxiv.org/abs/2311.16540
repo_url: None
paper_authors: Yizhuo Cai, Bo Lei, Qianying Zhao, Jing Peng, Min Wei, Yushun Zhang, Xing Zhang
for: 这篇论文目的是强化联盟学习在复杂网络环境中的通信效率。
methods: 本论文使用了computing-measurable, perceptible, distributable, dispatchable, and manageable的6G网络架构和思维模式来支持联盟学习训练，并根据生意需求、资源负载、网络状态和设备算力来决策训练过程。
results: 本论文的实验结果显示，使用CNC架构可以对联盟学习在复杂网络环境中的通信效率进行优化，并能够对参与训练的设备进行均衡分配，以提高资源利用率。

Abstract
Federated learning effectively addresses issues such as data privacy by collaborating across participating devices to train global models. However, factors such as network topology and device computing power can affect its training or communication process in complex network environments. A new network architecture and paradigm with computing-measurable, perceptible, distributable, dispatchable, and manageable capabilities, computing and network convergence (CNC) of 6G networks can effectively support federated learning training and improve its communication efficiency. By guiding the participating devices' training in federated learning based on business requirements, resource load, network conditions, and arithmetic power of devices, CNC can reach this goal. In this paper, to improve the communication efficiency of federated learning in complex networks, we study the communication efficiency optimization of federated learning for computing and network convergence of 6G networks, methods that gives decisions on its training process for different network conditions and arithmetic power of participating devices in federated learning. The experiments address two architectures that exist for devices in federated learning and arrange devices to participate in training based on arithmetic power while achieving optimization of communication efficiency in the process of transferring model parameters. The results show that the method we proposed can (1) cope well with complex network situations (2) effectively balance the delay distribution of participating devices for local training (3) improve the communication efficiency during the transfer of model parameters (4) improve the resource utilization in the network.

摘要
“联合学习”有效地解决数据隐私和其他问题，通过参与设备之间的协作来训练全球模型。然而，在复杂的网络环境下，因为网络拓扑和设备计算能力的因素，可能会影响联合学习的训练或通信过程。随着6G网络的出现，一种新的网络体系和思维方式，即计算可衡量、感知可控、分配可靠、派发可靠和可管理的计算和网络融合（CNC）技术，可以有效支持联合学习的训练，提高其通信效率。通过根据业务需求、资源负担、网络条件和设备算力来导引参与联合学习的设备进行训练，CNC可以实现这一目标。在本文中，我们研究了在复杂网络环境下，为了提高联合学习的通信效率，CNC技术的通信效率优化方法。我们实验了两种设备在联合学习中的架构，并将设备按照算力排序，以实现在传输模型参数时的优化通信效率。结果显示，我们的方法可以：1. 适应复杂网络环境2. 平衡参与设备的本地训练延迟3. 提高传输模型参数时的通信效率4. 提高网络资源的利用率。

Federated Learning with Diffusion Models for Privacy-Sensitive Vision Tasks

paper_url: http://arxiv.org/abs/2311.16538
repo_url: None
paper_authors: Ye Lin Tun, Chu Myaet Thwal, Ji Su Yoon, Sun Moo Kang, Chaoning Zhang, Choong Seon Hong
for: 这个研究旨在使用联合学习（Federated Learning，FL）方法在分散式学习环境中训练传播模型，以保护敏感数据的隐私。
methods: 这个研究使用了联合学习（FL）方法，收集模型参数而不是敏感数据，以确保数据隐私。
results: 实验结果显示，联合学习（FL）策略可以实现分散式学习环境中的传播模型训练，并且在不同的分散式学习场景下实现了高效的训练。

Abstract
Diffusion models have shown great potential for vision-related tasks, particularly for image generation. However, their training is typically conducted in a centralized manner, relying on data collected from publicly available sources. This approach may not be feasible or practical in many domains, such as the medical field, which involves privacy concerns over data collection. Despite the challenges associated with privacy-sensitive data, such domains could still benefit from valuable vision services provided by diffusion models. Federated learning (FL) plays a crucial role in enabling decentralized model training without compromising data privacy. Instead of collecting data, an FL system gathers model parameters, effectively safeguarding the private data of different parties involved. This makes FL systems vital for managing decentralized learning tasks, especially in scenarios where privacy-sensitive data is distributed across a network of clients. Nonetheless, FL presents its own set of challenges due to its distributed nature and privacy-preserving properties. Therefore, in this study, we explore the FL strategy to train diffusion models, paving the way for the development of federated diffusion models. We conduct experiments on various FL scenarios, and our findings demonstrate that federated diffusion models have great potential to deliver vision services to privacy-sensitive domains.

摘要
diffusion models have shown great potential for vision-related tasks, particularly for image generation. However, their training is typically conducted in a centralized manner, relying on data collected from publicly available sources. This approach may not be feasible or practical in many domains, such as the medical field, which involves privacy concerns over data collection. Despite the challenges associated with privacy-sensitive data, such domains could still benefit from valuable vision services provided by diffusion models. federated learning (FL) plays a crucial role in enabling decentralized model training without compromising data privacy. Instead of collecting data, an FL system gathers model parameters, effectively safeguarding the private data of different parties involved. This makes FL systems vital for managing decentralized learning tasks, especially in scenarios where privacy-sensitive data is distributed across a network of clients. Nonetheless, FL presents its own set of challenges due to its distributed nature and privacy-preserving properties. Therefore, in this study, we explore the FL strategy to train diffusion models, paving the way for the development of federated diffusion models. We conduct experiments on various FL scenarios, and our findings demonstrate that federated diffusion models have great potential to deliver vision services to privacy-sensitive domains.Here is the word-for-word translation of the text into Simplified Chinese:Diffusion模型在视觉任务中表现出了极大的潜力，特别是图像生成。然而，它们的训练通常采用中央化的方式进行，利用公共可用的数据采集。这种方法可能在许多领域不太实际，如医疗领域，这些领域涉及到数据采集的隐私问题。尽管面临着隐私敏感数据的挑战，这些领域仍可从 diffusion模型中获得有价值的视觉服务。联邦学习（FL）在实现分布式学习无需牺牲数据隐私方面扮演着关键的角色。而不是收集数据，FL系统会收集模型参数，从而保护不同参与方的私人数据。这使得FL系统在分布式学习任务中扮演着重要的角色，特别是在分布式数据频繁被分布在多个客户端的情况下。然而，FL也存在其他挑战，如分布式特性和隐私保护性。因此，在本研究中，我们研究了使用FL策略来训练 diffusion模型，为隐私敏感领域提供联邦扩散模型。我们在不同的FL场景中进行了实验，我们的发现表明，联邦扩散模型在隐私敏感领域中具有巨大的潜力。

Personalized Predictions of Glioblastoma Infiltration: Mathematical Models, Physics-Informed Neural Networks and Multimodal Scans

paper_url: http://arxiv.org/abs/2311.16536
repo_url: None
paper_authors: Ray Zirui Zhang, Ivan Ezhov, Michal Balcerak, Andy Zhu, Benedikt Wiestler, Bjoern Menze, John Lowengrub
for: This paper aims to predict the infiltration of glioblastoma (GBM) from medical MRI scans to better understand tumor growth dynamics and design personalized radiotherapy treatment plans.
methods: The paper proposes using physics-informed neural networks (PINNs) to estimate patient-specific parameters of a reaction-diffusion PDE model of GBM growth from a single 3D structural MRI snapshot. The method integrates both data and theory by embedding the PDE into a loss function.
results: The paper validates the method on synthetic and patient datasets and shows promise for real-time parametric inference in the clinical setting for personalized GBM treatment. The method is able to accurately estimate patient-specific parameters and handle complex brain geometry within the PINN framework.

Abstract
Predicting the infiltration of Glioblastoma (GBM) from medical MRI scans is crucial for understanding tumor growth dynamics and designing personalized radiotherapy treatment plans.Mathematical models of GBM growth can complement the data in the prediction of spatial distributions of tumor cells. However, this requires estimating patient-specific parameters of the model from clinical data, which is a challenging inverse problem due to limited temporal data and the limited time between imaging and diagnosis. This work proposes a method that uses Physics-Informed Neural Networks (PINNs) to estimate patient-specific parameters of a reaction-diffusion PDE model of GBM growth from a single 3D structural MRI snapshot. PINNs embed both the data and the PDE into a loss function, thus integrating theory and data. Key innovations include the identification and estimation of characteristic non-dimensional parameters, a pre-training step that utilizes the non-dimensional parameters and a fine-tuning step to determine the patient specific parameters. Additionally, the diffuse domain method is employed to handle the complex brain geometry within the PINN framework. Our method is validated both on synthetic and patient datasets, and shows promise for real-time parametric inference in the clinical setting for personalized GBM treatment.

摘要
预测 Glioblastoma (GBM) 的渗透是诊断和治疗计划的关键，可以帮助理解肿瘤生长动态和设计个性化的放疗疗法。数学模型可以补充医学数据，但是这需要估算患者特定的模型参数，这是一个困难的逆问题，因为有限的时间数据和诊断之间的时间间隔。本文提出了一种方法，使用 Physics-Informed Neural Networks (PINNs) 来估算患者特定的反应扩散 PDE 模型参数从单个 3D 结构 MRI Snapshot。PINNs 将数据和 PDE embed 到损失函数中，因此将理论和数据集成起来。关键创新包括特征非维度参数的标识和估算、预训练步骤使用非维度参数，以及细化步骤确定患者特定参数。此外，Diffuse Domain 方法被用来处理复杂的脑 geometery 在 PINN 框架中。我们的方法在 synthetic 和患者数据集上验证，并在临床设置中实现了实时参数推断，这示出了个性化 GBM 治疗的潜在优势。

Contrastive encoder pre-training-based clustered federated learning for heterogeneous data

paper_url: http://arxiv.org/abs/2311.16535
repo_url: None
paper_authors: Ye Lin Tun, Minh N. H. Nguyen, Chu Myaet Thwal, Jinwoo Choi, Choong Seon Hong
for: 提高分布式学习系统中数据不一致问题的性能
methods: 使用自我超vision学习和客户端分组来解决分布式学习系统中数据不一致问题
results: 通过对照预训练和客户端分组的结合使用，提高了分布式学习系统的模型整合和总体性能

Abstract
Federated learning (FL) is a promising approach that enables distributed clients to collaboratively train a global model while preserving their data privacy. However, FL often suffers from data heterogeneity problems, which can significantly affect its performance. To address this, clustered federated learning (CFL) has been proposed to construct personalized models for different client clusters. One effective client clustering strategy is to allow clients to choose their own local models from a model pool based on their performance. However, without pre-trained model parameters, such a strategy is prone to clustering failure, in which all clients choose the same model. Unfortunately, collecting a large amount of labeled data for pre-training can be costly and impractical in distributed environments. To overcome this challenge, we leverage self-supervised contrastive learning to exploit unlabeled data for the pre-training of FL systems. Together, self-supervised pre-training and client clustering can be crucial components for tackling the data heterogeneity issues of FL. Leveraging these two crucial strategies, we propose contrastive pre-training-based clustered federated learning (CP-CFL) to improve the model convergence and overall performance of FL systems. In this work, we demonstrate the effectiveness of CP-CFL through extensive experiments in heterogeneous FL settings, and present various interesting observations.

摘要
federated learning (FL) 是一种有前途的方法，它允许分布式客户端共同训练全局模型，同时保持客户端数据隐私。然而，FL 经常遇到数据不一致问题，这可能会对其表现产生重要影响。为解决这个问题，归类联合 federated learning (CFL) 已经提议，以构建不同客户端群的个性化模型。一种有效的客户端归类策略是让客户端选择自己的本地模型从模型池中，根据其性能。然而，无法预先训练模型参数，这种策略容易导致归类失败，在所有客户端选择同一个模型。寻集大量标注数据 для预训练可以是成本高昂且不实际的。为解决这个挑战，我们利用无supervised contrastive learning来利用无标注数据来预训练 FL 系统。ogether，self-supervised pre-training和客户端归类可以是FL系统面临数据不一致问题的关键解决方案。基于这两种关键策略，我们提议contrastive pre-training-based clustered federated learning (CP-CFL)，以提高 FL 系统的模型融合和总体性能。在这里，我们通过广泛的实验表明 CP-CFL 的效果，并提出了多种有趣的观察。

Utility Fairness in Contextual Dynamic Pricing with Demand Learning

paper_url: http://arxiv.org/abs/2311.16528
repo_url: None
paper_authors: Xi Chen, David Simchi-Levi, Yining Wang
for: 这篇论文是为了研究个性化价格的Contextual Bandit算法，以实现公平性约束下的优化 regret upper bound。
methods: 该方法包括动态价格和需求学习，解决了价格策略中公平性的挑战。我们首先在静态全信息设置下形式化了优化价格策略为受约束的优化问题，并提出了一种简化的近似算法来计算理想策略。
results: 我们通过数学分析和计算研究，描述了公平价格策略的结构， derivated一种简化的价格策略，这 laid the foundations for further research and extensions。此外，我们还扩展了我们的研究到动态价格问题中，设立了一种非标准 regret lower bound，这 highlights the complexity added by fairness constraints。我们的研究表明了公平性的成本和其对于利润最大化和公平性的影响。这项研究为数据驱动的动态价格中融合了优化效率和优化公平性的优化算法提供了一个重要的步骤。

Abstract
This paper introduces a novel contextual bandit algorithm for personalized pricing under utility fairness constraints in scenarios with uncertain demand, achieving an optimal regret upper bound. Our approach, which incorporates dynamic pricing and demand learning, addresses the critical challenge of fairness in pricing strategies. We first delve into the static full-information setting to formulate an optimal pricing policy as a constrained optimization problem. Here, we propose an approximation algorithm for efficiently and approximately computing the ideal policy. We also use mathematical analysis and computational studies to characterize the structures of optimal contextual pricing policies subject to fairness constraints, deriving simplified policies which lays the foundations of more in-depth research and extensions. Further, we extend our study to dynamic pricing problems with demand learning, establishing a non-standard regret lower bound that highlights the complexity added by fairness constraints. Our research offers a comprehensive analysis of the cost of fairness and its impact on the balance between utility and revenue maximization. This work represents a step towards integrating ethical considerations into algorithmic efficiency in data-driven dynamic pricing.

摘要
In the static full-information setting, we formulate an optimal pricing policy as a constrained optimization problem and propose an approximation algorithm for efficiently computing the ideal policy. Through mathematical analysis and computational studies, we characterize the structures of optimal contextual pricing policies subject to fairness constraints and derive simplified policies that provide a foundation for further research and extensions.We extend our study to dynamic pricing problems with demand learning, establishing a non-standard regret lower bound that highlights the complexity added by fairness constraints. Our research provides a comprehensive analysis of the cost of fairness and its impact on the balance between utility and revenue maximization. This work is a step towards integrating ethical considerations into algorithmic efficiency in data-driven dynamic pricing.

On robust overfitting: adversarial training induced distribution matters

paper_url: http://arxiv.org/abs/2311.16526
repo_url: None
paper_authors: Runzhi Tian, Yongyi Mao
for: 本研究探讨了对抗训练在鲁棒性方面的问题，即对抗训练可能会导致模型的过度鲁棒性。
methods: 本研究使用了PGD-based adversarial training，并提供了一个新的上限 bound for generalization error，其中包含了一个叫做“local dispersion”的扰动算子。
results: 研究发现，鲁棒过度鲁棒性与对抗训练中扰动的困难度有直接关系。同时，提供了一个新的上限 bound for generalization error，可以更好地理解对抗训练的鲁棒性问题。

Abstract
Adversarial training may be regarded as standard training with a modified loss function. But its generalization error appears much larger than standard training under standard loss. This phenomenon, known as robust overfitting, has attracted significant research attention and remains largely as a mystery. In this paper, we first show empirically that robust overfitting correlates with the increasing generalization difficulty of the perturbation-induced distributions along the trajectory of adversarial training (specifically PGD-based adversarial training). We then provide a novel upper bound for generalization error with respect to the perturbation-induced distributions, in which a notion of the perturbation operator, referred to "local dispersion", plays an important role.

摘要
▼ Adversarial training可以看作是标准训练程序中修改了损失函数的情况。但其泛化误差显示得 much larger than标准训练程序下的泛化误差。这种现象，被称为 robust overfitting，已经吸引了广泛的研究注意力，并且仍然是一个谜。在这篇论文中，我们首先经验性地证明了抗性过拟合与随着对抗训练过程中的扰动函数分布的增加相关。然后，我们提供了一个新的泛化误差Upper bound，其中一个扰动运算符，称为"本地扰动"，扮演着重要的角色。

Value Approximation for Two-Player General-Sum Differential Games with State Constraints

paper_url: http://arxiv.org/abs/2311.16520
repo_url: None
paper_authors: Lei Zhang, Mukesh Ghimire, Wenlong Zhang, Zhe Xu, Yi Ren
for: 解决 Hamilton-Jacobi-Isaacs（HJI） partial differential equations（PDEs），帮助实现平衡反馈控制在两个玩家的分布式游戏中。
methods: Physics-informed machine learning 方法，以及 hybrid 学习方法、值硬化方法和 epigraphical 技术。
results: comparing 这些方法的表现， hybrid 学习方法在总体和安全性方面表现最佳。

Abstract
Solving Hamilton-Jacobi-Isaacs (HJI) PDEs enables equilibrial feedback control in two-player differential games, yet faces the curse of dimensionality (CoD). While physics-informed machine learning has been adopted to address CoD in solving PDEs, this method falls short in learning discontinuous solutions due to its sampling nature, leading to poor safety performance of the resulting controllers in robotics applications where values are discontinuous due to state or other temporal logic constraints. In this study, we explore three potential solutions to this problem: (1) a hybrid learning method that uses both equilibrium demonstrations and the HJI PDE, (2) a value-hardening method where a sequence of HJIs are solved with increasing Lipschitz constant on the constraint violation penalty, and (3) the epigraphical technique that lifts the value to a higher dimensional auxiliary state space where the value becomes continuous. Evaluations through 5D and 9D vehicle simulations and 13D drone simulations reveal that the hybrid method outperforms others in terms of generalization and safety performance.

摘要
解决汉姆逊-雅各布-以萨（HJI） partial differential equations (PDEs) 可以实现平衡反馈控制在两个玩家演算中，但是面临维度灰度（CoD）的困难。而物理学 Informed machine learning 已经被应用于解决 PDEs 中的 CoD，但是这种方法因为采样的性质而无法学习离散解，导致 robotics 应用中的控制器性能不佳，因为状态或时间逻辑约束中的离散性。在这种研究中，我们探讨了三种解决方案：（1）一种 hybrid 学习方法，使用Equilibrium demonstrations 和 HJI PDE;（2）一种 value-hardening 方法，在不同的 Lipschitz 常数下解决 HJI 中的约束犯规问题;（3）一种 epigraphical 技术，将值提升到一个更高维度的副状态空间中，使得值变得连续。经过 5D 和 9D 汽车模拟和 13D 无人机模拟，我们发现 hybrid 方法在总体性和安全性方面表现更好。

B-LSTM-MIONet: Bayesian LSTM-based Neural Operators for Learning the Response of Complex Dynamical Systems to Length-Variant Multiple Input Functions

paper_url: http://arxiv.org/abs/2311.16519
repo_url: None
paper_authors: Zhihao Kong, Amirhossein Mollaali, Christian Moya, Na Lu, Guang Lin
for: 本文描述了一种基于神经网络的非线性算子学习框架，即深度操作网络（DeepONet），以及其扩展多输入深度神经网络（MIONet），可以处理多输入函数在不同的巴拿赫空间中。
methods: 本文使用了长短期记忆神经网络（LSTM）来学习时间依赖的数据中的非线性算子。
results: 通过将LSTM интегрирован到MIONet中，本文提出了一种能够处理变量长度、实时数据的模型，并且可以Quantify uncertainty through a novel Bayesian method, sampling from MIONet parameter distributions。这种模型可以更加准确和可靠地处理噪声数据。

Abstract
Deep Operator Network (DeepONet) is a neural network framework for learning nonlinear operators such as those from ordinary differential equations (ODEs) describing complex systems. Multiple-input deep neural operators (MIONet) extended DeepONet to allow multiple input functions in different Banach spaces. MIONet offers flexibility in training dataset grid spacing, without constraints on output location. However, it requires offline inputs and cannot handle varying sequence lengths in testing datasets, limiting its real-time application in dynamic complex systems. This work redesigns MIONet, integrating Long Short Term Memory (LSTM) to learn neural operators from time-dependent data. This approach overcomes data discretization constraints and harnesses LSTM's capability with variable-length, real-time data. Factors affecting learning performance, like algorithm extrapolation ability are presented. The framework is enhanced with uncertainty quantification through a novel Bayesian method, sampling from MIONet parameter distributions. Consequently, we develop the B-LSTM-MIONet, incorporating LSTM's temporal strengths with Bayesian robustness, resulting in a more precise and reliable model for noisy datasets.

摘要
深度运算网络（深度ONet）是一种神经网络框架，用于学习非线性运算符，如 diferencial equations（ODEs）描述的复杂系统。多输入深度神经操作符（MIONet）扩展了深度ONet，允许多个输入函数在不同的巴нах空间中。MIONet 具有不受输出位置限制的网格间距离的灵活性，但它需要线上输入，无法处理测试数据中的变换序列长度，限制了它的实时应用在动态复杂系统中。这项工作重新设计了 MIONet，将长期快速记忆（LSTM） integrate 到它中，以学习时间依赖的数据。这种方法突破了数据离散约束，利用 LSTM 对变量长度、实时数据的能力。学习性能因素，如算法插入能力，被表示出来。框架增强了不确定性评估，通过一种新的 bayesian 方法，从 MIONet 参数分布中采样。因此，我们开发了 B-LSTM-MIONet，将 LSTM 的时间优势与 bayesian 可靠性结合起来，得到一个更精准和可靠的模型，适用于噪音数据集。

On the Robustness of Decision-Focused Learning

paper_url: http://arxiv.org/abs/2311.16487
repo_url: https://github.com/yehya-farhat/AdvAttacks-DFL
paper_authors: Yehya Farhat
for: 这个论文的目的是研究决策关注学习（DFL）在做出决策时的机器学习（ML）模型预测缺失参数问题的性能。
methods: 这个论文使用了十种不同的DFL方法，并对这些方法在预测然后优化问题的情况下进行了两种不同的敌对攻击测试。
results: 研究发现DFL模型在敌对攻击下的性能与其能够根据真实标签获得优化的决策相高度相关。此外，研究还提供了如何target这些模型的方法，并显示这些模型在训练过程中的响应不同于实际标签。

Abstract
Decision-Focused Learning (DFL) is an emerging learning paradigm that tackles the task of training a machine learning (ML) model to predict missing parameters of an incomplete optimization problem, where the missing parameters are predicted. DFL trains an ML model in an end-to-end system, by integrating the prediction and optimization tasks, providing better alignment of the training and testing objectives. DFL has shown a lot of promise and holds the capacity to revolutionize decision-making in many real-world applications. However, very little is known about the performance of these models under adversarial attacks. We adopt ten unique DFL methods and benchmark their performance under two distinctly focused attacks adapted towards the Predict-then-Optimize problem setting. Our study proposes the hypothesis that the robustness of a model is highly correlated with its ability to find predictions that lead to optimal decisions without deviating from the ground-truth label. Furthermore, we provide insight into how to target the models that violate this condition and show how these models respond differently depending on the achieved optimality at the end of their training cycles.

摘要
决策驱动学习（DFL）是一种新趋势的学习方法，旨在训练一个机器学习（ML）模型，用于预测未知优化问题中缺失的参数。 DFL 将 ML 模型与预测和优化任务集成在一个端到端系统中，从而提供更好的训练和测试目标的对齐。 DFL 已经显示出很多搭配性和可能在许多实际应用中革新决策。然而，对这些模型在黑客攻击下的性能知之甚少。我们采用了十种独特的 DFL 方法，对其在预测然后优化问题设置下的两种明确定向攻击进行了 benchmark。我们的研究提出了假设，即模型的可靠性与其能够在ground truth标签的基础上找到导致优化决策的预测的能力高度相关。此外，我们还提供了如何对这些违反这个条件的模型进行攻击的方法，并显示了这些模型在训练过程中的不同响应。

On the Effect of Defections in Federated Learning and How to Prevent Them

paper_url: http://arxiv.org/abs/2311.16459
repo_url: None
paper_authors: Minbiao Han, Kumar Kshitij Patel, Han Shao, Lingxiao Wang
for: This paper focuses on the problem of defections in federated learning, where agents may choose to stop participating in the collaboration and instead use their own models.
methods: The paper proposes a new optimization algorithm with theoretical guarantees to prevent defections and ensure asymptotic convergence to an effective solution for all participating agents.
results: The paper shows that current federated optimization algorithms fail to disincentivize harmful defections, and that the proposed algorithm is effective in preventing defections and improving the final model’s robustness and ability to generalize.

Abstract
Federated learning is a machine learning protocol that enables a large population of agents to collaborate over multiple rounds to produce a single consensus model. There are several federated learning applications where agents may choose to defect permanently$-$essentially withdrawing from the collaboration$-$if they are content with their instantaneous model in that round. This work demonstrates the detrimental impact of such defections on the final model's robustness and ability to generalize. We also show that current federated optimization algorithms fail to disincentivize these harmful defections. We introduce a novel optimization algorithm with theoretical guarantees to prevent defections while ensuring asymptotic convergence to an effective solution for all participating agents. We also provide numerical experiments to corroborate our findings and demonstrate the effectiveness of our algorithm.

摘要
federated learning 是一种机器学习协议，允许大量代理人在多 Round 的合作下生成一个共识模型。有些 federated learning 应用中，代理人可能会在 Round 中 permanently 退出$-$ essentially 退出协作 $-$ 如果他们满意当前的模型。这项工作表明这种退出会对最终模型的 Robustness 和泛化能力产生负面影响。我们还证明现有的 federated 优化算法无法抑制这些危害性的退出。我们介绍了一种新的优化算法，具有理论保证，可以防止退出而确保所有参与代理人的 asymptotic 收敛到有效解决方案。我们还提供了数学实验来证明我们的发现，并证明我们的算法的有效性。

Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization

paper_url: http://arxiv.org/abs/2311.16442
repo_url: None
paper_authors: Jinhao Li, Shiyao Li, Jiaming Xu, Shan Huang, Yaoxiu Lian, Jun Liu, Yu Wang, Guohao Dai
for: 这个研究旨在提高大型语言模型（LLMs）的推理成本，并且使用2位量化来实现。
methods: 作者提出了以下技术来解决现有的挑战：仅将一小部分群组使用4位量化，并考虑内存对齐；对于罕见大 weights的分布，只需要16位量化一小部分 OUTliers。
results: 作者获得了>0.5%的准确度改善，并且降低了<3%的平均位数，并且实现了缩短LLM的推理时间和硬件成本。

Abstract
Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. The state-of-the-art methods use 2-bit quantization for mainstream LLMs. However, challenges still exist: (1) Nonnegligible accuracy loss for 2-bit quantization. Weights are quantized by groups, while the ranges of weights are large in some groups, resulting in large quantization errors and nonnegligible accuracy loss (e.g. >3% for Llama2-7b with 2-bit quantization in GPTQ and Greenbit). (2) Limited accuracy improvement by adding 4-bit weights. Increasing 10% extra average bit more 4-bit weights only leads to <0.5% accuracy improvement on a quantized Llama2-7b. (3) Time-consuming dequantization operations on GPUs. The dequantization operations lead to >50% execution time, hindering the potential of reducing LLM inference cost. To tackle these challenges, we propose the following techniques: (1) We only quantize a small fraction of groups with the larger range using 4-bit with memory alignment consideration on GPUs. (2) We point out that the distribution of the sparse outliers with larger weights is different in 2-bit and 4-bit groups, and only a small fraction of outliers require 16-bit quantization. Such design leads to >0.5% accuracy improvement with <3% average increased bit for Llama2-7b. (3) We design the asynchronous dequantization on GPUs, leading to up to 3.92X speedup. We conduct extensive experiments on different model families and model sizes. We achieve 2.85-bit for each weight and the end-to-end speedup for Llama2-7b is 1.74X over the original model, and we reduce both runtime cost and hardware cost by up to 2.70X and 2.81X with less GPU requirements.

摘要
大型语言模型（LLM）在不同领域展示出了优异的能力，但是推理成本仍然是一大障碍。现有的方法通常使用2比特quantization，但是还有一些挑战：1. 2比特quantization对于精确性的影响不可忽略。在某些group中， weights的范围很大，导致quantization error很大，精确性损失超过3%（例如Llama2-7b中的GPTQ和Greenbit）。2. 增加4比特 weights 的精确度改善不足。将10%更多的平均位数增加到4比特 weights 中，仅导致 <0.5% 的精确度改善（在压缩Llama2-7b中）。3. GPU上的dequantization操作耗时过长。dequantization操作导致>50%的执行时间，使得将LLM推理成本降低的潜力受限。为了解决这些挑战，我们提出以下技术：1. 仅将部分group使用4比特quantization，并考虑内存Alignment的考虑（在GPU上）。2. 我们发现2比特和4比特group中 sparse outliers 的分布不同，只有一小部分outliers需要16比特quantization。这种设计导致精确度改善 >0.5%，推理成本增加 <3%（在Llama2-7b中）。3. 我们设计了GPU上的异步dequantization操作，导致最高达3.92倍的速度提升。我们实现了不同的模型家族和模型大小，并进行了广泛的实验。我们可以达到2.85比特每Weight，并且总体推理速度比原始模型快1.74倍。此外，我们还可以降低硬件成本和时间成本，分别降低到2.70倍和2.81倍，并且降低GPU的需求。

A Combinatorial Approach to Robust PCA

paper_url: http://arxiv.org/abs/2311.16416
repo_url: None
paper_authors: Weihao Kong, Mingda Qiao, Rajat Sen
for: 该论文研究了在恶化的情况下恢复高维数据的问题，具体来说，假设数据点上的 Gaussian 噪声在一个未知的 $k$-维子空间 $U \subseteq \mathbb{R}^d$ 中，并且一些随机选择的坐标被控制在一个恶意者手中。
methods: 该论文使用了一种新的 Basis Pursuit（BP）方法分析，以及一种自然的 combinatorial 问题的研究来解决该问题。
results: 该论文的主要结果是一种高效的算法，可以在 $ks^2 = O(d)$ 情况下，在对数误差 $\tilde O(ks/d)$ 下恢复所有数据点。

Abstract
We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenario of learning from high-dimensional yet structured data that are transmitted through a highly-noisy channel, so that the data points are unlikely to be entirely clean. Our main result is an efficient algorithm that, when $ks^2 = O(d)$, recovers every single data point up to a nearly-optimal $\ell_1$ error of $\tilde O(ks/d)$ in expectation. At the core of our proof is a new analysis of the well-known Basis Pursuit (BP) method for recovering a sparse signal, which is known to succeed under additional assumptions (e.g., incoherence or the restricted isometry property) on the underlying subspace $U$. In contrast, we present a novel approach via studying a natural combinatorial problem and show that, over the randomness in the support of the sparse signal, a high-probability error bound is possible even if the subspace $U$ is arbitrary.

摘要
我们研究在恶意扰乱的情况下，用条件是低维度的 Gaussian 资料还原的问题。具体来说，我们假设 Gaussian 资料点在未知的 $k$-dimensional子空间 $U \subseteq \mathbb{R}^d$ 中，并且每个数据点的 $s$ 个标量被控制在恶意的人手中。这个设定模拟了从高维度 yet 结构化的数据集中学习的情况，因为数据点很 unlikely to be entirely clean。我们的主要结果是一个高效的算法，当 $ks^2 = O(d)$ 时，可以将每个数据点Recover up to a nearly-optimal $\ell_1$ error of $\tilde O(ks/d)$ in expectation。在我们的证明中，我们使用了一个新的 Basis Pursuit (BP) 方法的分析，这个方法可以在对应的子空间 $U$ 下成功。不过，我们则提出了一个新的方法，通过研究一个自然的Combinatorial problem，可以在数据点的随机支持下，获得一个高 probabilities 的错误界限，即使 $U$ 是任意的。

Reduced-order modeling for parameterized PDEs via implicit neural representations

paper_url: http://arxiv.org/abs/2311.16410
repo_url: None
paper_authors: Tianshu Wen, Kookjin Lee, Youngsoo Choi
for: 这个论文是为了解决多个参数的partial differential equations (PDEs)问题而提出的一种数据驱动简化模型方法。
methods: 该方法使用了 implicit neural representation (INR) 模型 физи学信号在连续的方式下，并使用 parametrized neural ODE (PNODE) 学习 latent 动力学Characterized by multiple PDE parameters。
results: 在Two-dimensional Burgers equation中，该方法可以在大 Reynolds number 下实现速度增加 Up to O(10^3) 和 ~1% 相对误差 relative to ground truth values。

Abstract
We present a new data-driven reduced-order modeling approach to efficiently solve parametrized partial differential equations (PDEs) for many-query problems. This work is inspired by the concept of implicit neural representation (INR), which models physics signals in a continuous manner and independent of spatial/temporal discretization. The proposed framework encodes PDE and utilizes a parametrized neural ODE (PNODE) to learn latent dynamics characterized by multiple PDE parameters. PNODE can be inferred by a hypernetwork to reduce the potential difficulties in learning PNODE due to a complex multilayer perceptron (MLP). The framework uses an INR to decode the latent dynamics and reconstruct accurate PDE solutions. Further, a physics-informed loss is also introduced to correct the prediction of unseen parameter instances. Incorporating the physics-informed loss also enables the model to be fine-tuned in an unsupervised manner on unseen PDE parameters. A numerical experiment is performed on a two-dimensional Burgers equation with a large variation of PDE parameters. We evaluate the proposed method at a large Reynolds number and obtain up to speedup of O(10^3) and ~1% relative error to the ground truth values.

摘要
我们提出了一种新的数据驱动简化参数化partial differential equation（PDE）解决方法，用于解决许多查询问题。这项工作 draws inspiration from implicit neural representation（INR），它模型物理信号在连续的方式下，不abhängig von spatial/temporal 精度。我们的框架使用PDE并使用 parametrized neural ODE（PNODE）来学习含多个PDE参数的潜在动态。PNODE可以通过一个hypernetwork来减少学习PNODE的可能性，因为复杂的多层感知器（MLP）。我们的框架使用INR来解码潜在动态并重建准确的PDE解决方案。此外，我们还引入物理信息损失，以正确地预测未看到参数实例。在 incorporating 物理信息损失时，我们也可以在不upervised 的方式下，在未看到PDE参数的情况下，细化模型。我们在两个维度的拜尔斯Equation中进行了一个数学实验，并评估了我们的方法。我们在大 Reynolds 数下运行了方法，并 obtaint up to speedup of O(10^3) and ~1% 相对误差 relative to the ground truth values.

Deep Learning for Time Series Classification of Parkinson’s Disease Eye Tracking Data

paper_url: http://arxiv.org/abs/2311.16381
repo_url: None
paper_authors: Gonzalo Uribarri, Simon Ekman von Huth, Josefine Waldthaler, Per Svenningsson, Erik Fransén
for: 本研究使用眼动跟踪技术来诊断和评估帕金森病。
methods: 使用当今最佳的深度学习算法对眼动跟踪数据进行帕金森病分类。
results: 模型能够学习分类任务，并在未看到过的试验者上generalize。InceptionTime实现了78%的准确率，而ROCKET实现了88%的准确率。使用新的验证方法可以提高ROCKET模型的解释性和普适性，实现96%的准确率。

Abstract
Eye-tracking is an accessible and non-invasive technology that provides information about a subject's motor and cognitive abilities. As such, it has proven to be a valuable resource in the study of neurodegenerative diseases such as Parkinson's disease. Saccade experiments, in particular, have proven useful in the diagnosis and staging of Parkinson's disease. However, to date, no single eye-movement biomarker has been found to conclusively differentiate patients from healthy controls. In the present work, we investigate the use of state-of-the-art deep learning algorithms to perform Parkinson's disease classification using eye-tracking data from saccade experiments. In contrast to previous work, instead of using hand-crafted features from the saccades, we use raw $\sim1.5\,s$ long fixation intervals recorded during the preparatory phase before each trial. Using these short time series as input we implement two different classification models, InceptionTime and ROCKET. We find that the models are able to learn the classification task and generalize to unseen subjects. InceptionTime achieves $78\%$ accuracy, while ROCKET achieves $88\%$ accuracy. We also employ a novel method for pruning the ROCKET model to improve interpretability and generalizability, achieving an accuracy of $96\%$. Our results suggest that fixation data has low inter-subject variability and potentially carries useful information about brain cognitive and motor conditions, making it suitable for use with machine learning in the discovery of disease-relevant biomarkers.

摘要
眼动跟踪技术是一种可访问和不侵入的技术，可以提供对试验者的motor和认知能力的信息。因此，它在研究 Parkinson's disease 等神经退化疾病中发挥了重要作用。尤其是在跑动实验中，眼动跟踪数据具有诊断和分期 Parkinson's disease 的价值。然而，迄今为止，没有一个眼动运动生物标志能够确定性地区分病人和健康人。在当前的工作中，我们investigate了使用现代深度学习算法来对眼动跟踪数据进行 Parkinson's disease 分类。与前一些研究不同，我们不使用手动制定的特征从跑动中提取数据，而是使用raw fixation interval的1.5秒时间序列。使用这些短时间序列作为输入，我们实现了两种不同的分类模型：InceptionTime和ROCKET。我们发现这些模型可以学习分类任务，并能够对未见的试验者进行泛化。InceptionTime achieves 78% accuracy，而ROCKET achieves 88% accuracy。我们还使用了一种新的方法来剪辑ROCKET模型，以提高可解释性和泛化性，实现了96%的准确性。我们的结果表明，fixation数据具有低 между人变化，并可能含有有用的认知和motor状况信息，因此适合与机器学习结合使用，以探索疾病相关的生物标志。

2023-11-29

CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting

SODA: Bottleneck Diffusion Models for Representation Learning

A Pipeline For Discourse Circuits From CCG

Maximum Entropy Model Correction in Reinforcement Learning

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

Analyzing and Explaining Image Classifiers via Diffusion Guidance

Anomalous Behavior Detection in Trajectory Data of Older Drivers

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

Propagate & Distill: Towards Effective Graph Learners Using Propagation-Embracing MLPs

Robustness Approaches for the Examination Timetabling Problem under Data Uncertainty

Addressing Membership Inference Attack in Federated Learning with Model Compression

Mukhyansh: A Headline Generation Dataset for Indic Languages

Fair Text-to-Image Diffusion via Fair Mapping

AviationGPT: A Large Language Model for the Aviation Domain

Improving Minority Stress Detection with Emotions

Using Ornstein-Uhlenbeck Process to understand Denoising Diffusion Probabilistic Model and its Noise Schedules

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes

VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following

Introduction to Transformers: an NLP Perspective

Continual Learning with Low Rank Adaptation

Improving embedding of graphs with missing data by soft manifolds

LanGWM: Language Grounded World Model

Bias Resilient Multi-Step Off-Policy Goal-Conditioned Reinforcement Learning

TaskWeaver: A Code-First Agent Framework

Spinal Muscle Atrophy Disease Modelling as Bayesian Network

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

Reinforcement Replaces Supervision: Query focused Summarization using Deep Reinforcement Learning

Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model

Distributed AI in Zero-touch Provisioning for Edge Networks: Challenges and Research Directions

Slot-Mixup with Subsampling: A Simple Regularization for WSI Classification

Quantum Neural Networks under Depolarization Noise: Exploring White-Box Attacks and Defenses

Privacy Measurement in Tabular Synthetic Data: State of the Art and Future Research Directions

Learning-driven Zero Trust in Distributed Computing Continuum Systems

Uncertainty in Additive Feature Attribution methods

CLOMO: Counterfactual Logical Modification with Large Language Models

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

Grounding Foundation Models through Federated Transfer Learning: A General Framework

TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4

SigFormer: Sparse Signal-Guided Transformer for Multi-Modal Human Action Segmentation

LLM-State: Expandable State Representation for Long-horizon Task Planning in the Open World

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Gene-MOE: A Sparsely-gated Framework for Pan-Cancer Genomic Analysis

Comparison of metaheuristics for the firebreak placement problem: a simulation-based optimization approach

Are we going MAD? Benchmarking Multi-Agent Debate between Language Models for Medical Q&A

Two Scalable Approaches for Burned-Area Mapping Using U-Net and Landsat Imagery

Exploring Large Language Models for Human Mobility Prediction under Public Events

VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model

Cascade: A Platform for Delay-Sensitive Edge Intelligence

Accelerating DNN Training With Photonics: A Residue Number System-Based Design

Universal Self-Consistency for Large Language Model Generation

RoKEPG: RoBERTa and Knowledge Enhancement for Prescription Generation of Traditional Chinese Medicine

Two-Step Reinforcement Learning for Multistage Strategy Card Game

Enhancing the Performance of Neural Networks Through Causal Discovery and Integration of Domain Knowledge

Language Models: A Guide for the Perplexed

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

2023-11-29

Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis

Higher-Order DisCoCat (Peirce-Lambek-Montague semantics)

DSS: Synthesizing long Digital Ink using Data augmentation, Style encoding and Split generation

Supervising the Centroid Baseline for Extractive Multi-Document Summarization

End-to-end Joint Rich and Normalized ASR with a limited amount of rich training data

SenTest: Evaluating Robustness of Sentence Encoders

How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation

Enhancing Answer Selection in Community Question Answering with Pre-trained and Large Language Models

Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data

Improving the Robustness of Transformer-based Large Language Models with Dynamic Attention

Unveiling the Implicit Toxicity in Large Language Models

CESAR: Automatic Induction of Compositional Instructions for Multi-turn Dialogs

Are Large Language Models Good Fact Checkers: A Preliminary Study

Efficient Stitchable Task Adaptation

Biomedical knowledge graph-enhanced prompt generation for large language models

2023-11-29

Are ensembles getting better all the time?

SAIBench: A Structural Interpretation of AI for Science Through Benchmarks

Leveraging Graph Diffusion Models for Network Refinement Tasks

On the Adversarial Robustness of Graph Contrastive Learning Methods

A quasi-polynomial time algorithm for Multi-Dimensional Scaling via LP hierarchies

Towards Efficient Hyperdimensional Computing Using Photonics