2023-11-27

cs.AI

cs.AI - 2023-11-27

Improving Denoising Diffusion Probabilistic Models via Exploiting Shared Representations

paper_url: http://arxiv.org/abs/2311.16353
repo_url: None
paper_authors: Delaram Pirhayatifard, Mohammad Taha Toghani, Guha Balakrishnan, César A. Uribe
for: 用于减少数据量下多任务图像生成
methods: 利用表示学习技术， Shared Parameters 核心meta架构和专门的任务层
results: 在标准图像集上超过了不条件和条件 DDPM 的 FID 和 SSIM 指标

Abstract
In this work, we address the challenge of multi-task image generation with limited data for denoising diffusion probabilistic models (DDPM), a class of generative models that produce high-quality images by reversing a noisy diffusion process. We propose a novel method, SR-DDPM, that leverages representation-based techniques from few-shot learning to effectively learn from fewer samples across different tasks. Our method consists of a core meta architecture with shared parameters, i.e., task-specific layers with exclusive parameters. By exploiting the similarity between diverse data distributions, our method can scale to multiple tasks without compromising the image quality. We evaluate our method on standard image datasets and show that it outperforms both unconditional and conditional DDPM in terms of FID and SSIM metrics.

摘要
在这个工作中，我们解决了多任务图像生成问题，使用有限数据的涂抹扩散模型（DDPM），这种生成模型可以生成高质量图像 by reversing 噪声扩散过程。我们提出了一种新的方法，SR-DDPM，它利用少量学习技术来有效地学习从 fewer samples 中的多个任务。我们的方法包括核心元体建 architecture 的共享参数，即任务特定层的独立参数。通过利用多种数据分布之间的相似性，我们的方法可以扩展到多个任务而无需牺牲图像质量。我们在标准图像集上评估了我们的方法，并发现它在 FID 和 SSIM метриках上比 both 随机 DDPM 和 conditional DDPM 表现更好。

Compositional Chain-of-Thought Prompting for Large Multimodal Models

paper_url: http://arxiv.org/abs/2311.17076
repo_url: None
paper_authors: Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig
for: 提高多modal task的性能，特别是视觉语言任务中的compositional reasoning能力。
methods: 使用Scene Graph（SG）来提取LMM中的compositional知识，并使用Zero-shot Chain-of-Thought（CCoT）提示方法来驱动LMM生成响应。
results: CCoT方法可以提高LMM在多modal benchmark上的性能，不需要精度的Scene Graph annotations和finetuning。

Abstract
The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs.

摘要
现代强视觉脊梁和大型语言模型（LLM）的结合，已经使得大型多模态模型（LMM）成为现代视觉语言（VL）任务的标准。然而，最先进的LMM仍然困难捕捉视觉复杂的组合知识，如物体Attributes和关系。一种解决方案是使用场景图（SG）——一种对象和其关系和属性的正式化，在视觉和文本领域之间作为桥接。然而，场景图数据需要场景图注释，收集场景图注释是贵重的并不易扩展。另外，基于SG数据进行LMM训练可能会导致权值忘记原始预训练目标。为解决这一问题，我们提出了组合思维（CCoT）方法，一种零上下文链条提示方法，使用LMM生成的场景图来提取组合知识。具体来说，我们首先使用LMM生成场景图，然后使用该场景图作为提示来生成响应。经过广泛的实验，我们发现，我们提出的CCoT方法不仅能提高LMM在多种视觉语言复杂 benchmark上的性能，还能提高多种流行的LMM在通用多模态 benchmark上的性能，无需微调或注释真实的SG数据。

Reward Shaping for Improved Learning in Real-time Strategy Game Play

paper_url: http://arxiv.org/abs/2311.16339
repo_url: None
paper_authors: John Kliem, Prithviraj Dasgupta
for: 该研究探讨了在实时策略游戏中使用奖励形成来提高人工智能学习的表现。
methods: 研究使用了不同的奖励形成函数，以适应不同的游戏事件，并对这些事件进行了适当的规定。
results: 实验结果表明，奖励形成可以作为一种有效的方法，以便理解游戏中不同的子任务之间的重要性，编码第二个目标函数，例如能源效率，到玩家的游戏行为中，以及提高对不同对手水平的学习策略。

Abstract
We investigate the effect of reward shaping in improving the performance of reinforcement learning in the context of the real-time strategy, capture-the-flag game. The game is characterized by sparse rewards that are associated with infrequently occurring events such as grabbing or capturing the flag, or tagging the opposing player. We show that appropriately designed reward shaping functions applied to different game events can significantly improve the player's performance and training times of the player's learning algorithm. We have validated our reward shaping functions within a simulated environment for playing a marine capture-the-flag game between two players. Our experimental results demonstrate that reward shaping can be used as an effective means to understand the importance of different sub-tasks during game-play towards winning the game, to encode a secondary objective functions such as energy efficiency into a player's game-playing behavior, and, to improve learning generalizable policies that can perform well against different skill levels of the opponent.

摘要
我们研究了奖励形态在改善回归学习中的效果，在实时策略捕捉旗标游戏的上下文中。游戏具有罕见的奖励，与捕捉或捕获旗标或标记对手相关。我们表明，适当设计的奖励形态函数应用于不同的游戏事件可以显著提高玩家的表现和学习算法的训练时间。我们在模拟 marine 捕捉旗标游戏中进行了实验，结果表明，奖励形态可以用作改善玩家在游戏中完成不同任务的重要性，编码 auxiliary 目标函数，如能源效率，并提高对不同对手水平的学习策略。

Releasing the CRaQAn (Coreference Resolution in Question-Answering): An open-source dataset and dataset creation methodology using instruction-following models

paper_url: http://arxiv.org/abs/2311.16338
repo_url: None
paper_authors: Rob Grzywinski, Joshua D’Arcy, Rob Naidoff, Ashish Shukla, Alex Browne, Ren Gibbons, Brinnae Bent
for: 这个论文是为了提高问答应用中的信息检索方法，特别是在核心引用解决方面。
methods: 这篇论文使用了一种新的 instruciton-following 模型（GPT-4）和一种循环批评和改进的方法来创建高质量的数据集。
results: 这篇论文提供了250个问题答案对，其中包含了核心引用。这些数据集可以帮助进一步研究核心引用解决方法的问题。

Abstract
Instruction-following language models demand robust methodologies for information retrieval to augment instructions for question-answering applications. A primary challenge is the resolution of coreferences in the context of chunking strategies for long documents. The critical barrier to experimentation of handling coreferences is a lack of open source datasets, specifically in question-answering tasks that require coreference resolution. In this work we present our Coreference Resolution in Question-Answering (CRaQAn) dataset, an open-source dataset that caters to the nuanced information retrieval requirements of coreference resolution in question-answering tasks by providing over 250 question-answer pairs containing coreferences. To develop this dataset, we developed a novel approach for creating high-quality datasets using an instruction-following model (GPT-4) and a Recursive Criticism and Improvement Loop.

摘要
instrucciones de modelo de lenguaje exigen métodos robustos para la recuperación de información para aplicaciones de respuesta a preguntas. Un desafío principal es la resolución de coreferencias en el contexto de estrategias de chunking para documentos largos. La barrera crítica para la experimentación de la resolución de coreferencias es la falta de conjuntos de datos abiertos, específicamente en tareas de respuesta a preguntas que requieren resolución de coreferencias. En este trabajo presentamos nuestro Dataset de Resolución de Coreferencias en Respuesta a Preguntas (CRaQAn), un conjunto de datos abierto que se adapta a las necesidades de recuperación de información nuancedas de coreferencias en tareas de respuesta a preguntas al proporcionar más de 250 pares de preguntas y respuestas que contienen coreferencias. Para crear este conjunto de datos, desarrollamos una aproximación novel para crear conjuntos de datos de alta calidad utilizando un modelo de seguimiento de instrucciones (GPT-4) y un Bucle de Crítica y Mejora Recursiva.

Domain-Specific Deep Learning Feature Extractor for Diabetic Foot Ulcer Detection

paper_url: http://arxiv.org/abs/2311.16312
repo_url: None
paper_authors: Reza Basiri, Milos R. Popovic, Shehroz S. Khan
for: 这篇论文旨在开发一个自动识别diabetic foot ulcer（DFU）伤口的深度学习网络，并评估最佳特征提取器。
methods: 本研究使用了14种不同的深度学习网络，包括UNet和EfficientNetb3等，并使用了mAP和F1-score进行评估。
results: 结果显示，UNet和EfficientNetb3的结合使得最高的评估成绩，这两种特征提取器可以用来开发一个专门的DFU域领域自动伤口检测管线。

Abstract
Diabetic Foot Ulcer (DFU) is a condition requiring constant monitoring and evaluations for treatment. DFU patient population is on the rise and will soon outpace the available health resources. Autonomous monitoring and evaluation of DFU wounds is a much-needed area in health care. In this paper, we evaluate and identify the most accurate feature extractor that is the core basis for developing a deep-learning wound detection network. For the evaluation, we used mAP and F1-score on the publicly available DFU2020 dataset. A combination of UNet and EfficientNetb3 feature extractor resulted in the best evaluation among the 14 networks compared. UNet and Efficientnetb3 can be used as the classifier in the development of a comprehensive DFU domain-specific autonomous wound detection pipeline.

摘要
糖尿病足沟（DFU）是一种需要不断监控和评估治疗的病情。DFU患者人数在增加，将很快超过现有的医疗资源。自动监控和评估DFU伤口是医疗领域的急需领域。在本文中，我们评估和找出了最精准的特征提取器，它是深度学习伤口探测网络的核心基础。我们使用MAP和F1-score进行评估，并比较了14种网络。UNet和EfficientNetb3的特征提取器组合得到了最佳评估结果。UNet和Efficientnetb3可以用作发展全面的DFU领域专门自动伤口探测管线的分类器。

A Graph Neural Network-Based QUBO-Formulated Hamiltonian-Inspired Loss Function for Combinatorial Optimization using Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.16277
repo_url: None
paper_authors: Redwan Ahmed Rizvee, Raheeb Hasan, Md. Mosaddek Khan
For: The paper is written to address the challenge of solving combinatorial optimization problems (CO) over graphs using quantum optimization algorithms, specifically by leveraging the Quadratic Unconstrained Binary Optimization (QUBO) formulation and the Ising Hamiltonian.* Methods: The paper proposes a generic framework called PI-GNN, which combines Graph Neural Network (GNN) architecture with a QUBO-formulated Hamiltonian-inspired loss function to solve CO problems over graphs. The authors also introduce a novel Monty Carlo Tree Search-based strategy with GNN that applies guided search through manual perturbation of node labels during training.* Results: The paper reports that the proposed methods can improve the performance of solving CO problems over graphs, with up to 44% improvement in the number of constraint violations compared to the PI-GNN. The results demonstrate the effectiveness of the proposed methods in addressing the challenge of solving CO problems over graphs using quantum optimization algorithms.

Abstract
Quadratic Unconstrained Binary Optimization (QUBO) is a generic technique to model various NP-hard Combinatorial Optimization problems (CO) in the form of binary variables. Ising Hamiltonian is used to model the energy function of a system. QUBO to Ising Hamiltonian is regarded as a technique to solve various canonical optimization problems through quantum optimization algorithms. Recently, PI-GNN, a generic framework, has been proposed to address CO problems over graphs based on Graph Neural Network (GNN) architecture. They introduced a generic QUBO-formulated Hamiltonian-inspired loss function that was directly optimized using GNN. PI-GNN is highly scalable but there lies a noticeable decrease in the number of satisfied constraints when compared to problem-specific algorithms and becomes more pronounced with increased graph densities. Here, We identify a behavioral pattern related to it and devise strategies to improve its performance. Another group of literature uses Reinforcement learning (RL) to solve the aforementioned NP-hard problems using problem-specific reward functions. In this work, we also focus on creating a bridge between the RL-based solutions and the QUBO-formulated Hamiltonian. We formulate and empirically evaluate the compatibility of the QUBO-formulated Hamiltonian as the generic reward function in the RL-based paradigm in the form of rewards. Furthermore, we also introduce a novel Monty Carlo Tree Search-based strategy with GNN where we apply a guided search through manual perturbation of node labels during training. We empirically evaluated our methods and observed up to 44% improvement in the number of constraint violations compared to the PI-GNN.

摘要
Quadratic Unconstrained Binary Optimization (QUBO) 是一种通用技术，用于模型不同NP-hard Combinatorial Optimization问题 (CO) 中的 binary 变量。Ising ハミルтоニアン是用于模型系统的能量函数。QUBO 到 Ising ハミルтоニアン被视为一种用于解决多种 canonical 优化问题的 quantum 优化算法。现在，PI-GNN 是一种通用框架，用于Addressing CO 问题在图上基于图神经网络 (GNN) 架构。它们引入了一个通用 QUBO-formulated Hamiltonian-inspired 产生函数，直接使用 GNN 进行优化。PI-GNN 具有高可扩展性，但是存在一定的约束满足率减少，特别是在图密度增加时。在这里，我们发现了一种行为特征，并提出了改进其性能的策略。另一些文献使用 Reinforcement Learning (RL) 解决NP-hard问题，我们也将关注将 QUBO-formulated Hamiltonian 作为特定问题的奖励函数在 RL 基础上的应用。我们提出了一种将 QUBO-formulated Hamiltonian 作为通用奖励函数的方法，并进行了实验评估。此外，我们还引入了一种基于 Monty Carlo Tree Search 的 GNN 策略，其中我们在训练时通过手动扰动节点标签来进行指导搜索。我们对方法进行了实验评估，并发现了对 PI-GNN 的44%改进。

RelVAE: Generative Pretraining for few-shot Visual Relationship Detection

paper_url: http://arxiv.org/abs/2311.16261
repo_url: None
paper_authors: Sotiris Karapiperis, Markos Diomataris, Vassilis Pitsikalis
for: 本研究 targets the problem of few-shot Visual Relationship Detection (VRD), which has been neglected by the community due to the lack of high-quality, diverse, and large-scale datasets.
methods: 本研究 introduce a generative model that captures the variation of semantic, visual, and spatial information of relations inside a latent space, and exploits its representations for efficient few-shot classification.
results: 本研究 achieves better performance than baselines on VG200 and VRD datasets through few-shot training splits, and provides qualitative experiments to interpret the decisions of the model.

Abstract
Visual relations are complex, multimodal concepts that play an important role in the way humans perceive the world. As a result of their complexity, high-quality, diverse and large scale datasets for visual relations are still absent. In an attempt to overcome this data barrier, we choose to focus on the problem of few-shot Visual Relationship Detection (VRD), a setting that has been so far neglected by the community. In this work we present the first pretraining method for few-shot predicate classification that does not require any annotated relations. We achieve this by introducing a generative model that is able to capture the variation of semantic, visual and spatial information of relations inside a latent space and later exploiting its representations in order to achieve efficient few-shot classification. We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets where our model outperforms the baselines. Lastly we attempt to interpret the decisions of the model by conducting various qualitative experiments.

摘要
Visual relations are complex, multimodal concepts that play an important role in how humans perceive the world. Due to their complexity, high-quality, diverse, and large-scale datasets for visual relations are still lacking. To address this data gap, we focus on the problem of few-shot Visual Relationship Detection (VRD), which has been neglected by the community so far. In this work, we present the first pretraining method for few-shot predicate classification that does not require any annotated relations. We achieve this by introducing a generative model that can capture the variation of semantic, visual, and spatial information of relations inside a latent space, and then exploiting its representations for efficient few-shot classification. We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets, where our model outperforms the baselines. Finally, we attempt to interpret the decisions of the model by conducting various qualitative experiments.

Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation

paper_url: http://arxiv.org/abs/2311.16254
repo_url: https://github.com/aimagelab/safe-clip
paper_authors: Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
for: 使 vision-and-language 模型更安全，以便在敏感和可信任的场景中使用。
methods: 通过精细化大语言模型，将不安全的概念从视力语言模型中除去。 fine-tune 从 100 个手动精心挑选的对。
results: 对 embedding 空间进行广泛的实验，证明我们的模型可以在检索和文本到图像生成中使用。同时，我们还证明了使用预训练的图像生成器。Here’s the breakdown of each point in English:
for: The paper aims to make vision-and-language models safer for use in sensitive and trustworthy contexts.
methods: The authors propose a methodology to remove sensitivity to not-safe-for-work concepts from vision-and-language models, using a distilled language model that converts between safe and unsafe sentences, and fine-tuning starting from just 100 manually curated pairs.
results: The authors conduct extensive experiments on the resulting embedding space for both retrieval and text-to-image generation, demonstrating that their model can be properly employed with pre-trained image generators.

Abstract
Vision-and-Language models such as CLIP have demonstrated remarkable effectiveness across a wide range of tasks. However, these models are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concern in their adoption. To overcome these limitations, we introduce a methodology to make Vision-and-Language models safer by removing their sensitivity to not-safe-for-work concepts. We show how this can be done by distilling from a large language model which converts between safe and unsafe sentences and which is fine-tuned starting from just 100 manually-curated pairs. We conduct extensive experiments on the resulting embedding space for both retrieval and text-to-image generation, where we show that our model can also be properly employed with pre-trained image generators. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

摘要
CLIP类的视觉语言模型已经展示了广泛的应用场景，但这些模型通常是通过网络规模的数据进行训练，这可能会导致不适合的内容和不安全的行为的发展，从而限制其在敏感和可靠的场景中的应用。为解决这些限制，我们提出了一种方法来使视觉语言模型更安全，即去掉它们对不安全的概念的敏感性。我们通过一个大型语言模型，将安全和不安全的句子转换为对应的句子，并从100个手动精心抽样的对话开始进行练习。我们在 embedding 空间进行了广泛的实验，并证明我们的模型可以与预训练的图像生成器结合使用。我们的源代码和训练模型可以在 GitHub 上获取：https://github.com/aimagelab/safe-clip。

IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

paper_url: http://arxiv.org/abs/2311.17072
repo_url: None
paper_authors: Chenglin Yang, Siyuan Qiao, Yuan Cao, Yu Zhang, Tao Zhu, Alan Yuille, Jiahui Yu
for: 本研究旨在减小基于生成目标的视觉语言模型在分类任务上的性能差距。
methods: 我们改进了生成描述对象的评估目标，以减少语言模型对视觉信号的分布偏好，并设计了一种生成训练目标来匹配评估目标。
results: 我们的模型在 zero-shot 分类任务上的 ImageNet 上获得了$> 18%$ 的改进，与标准描述器相当，并在 MSCOCO 和 Flickr30K 上表现出色地完成了零shot 图像文本检索任务。

Abstract
Generative training has been demonstrated to be powerful for building visual-language models. However, on zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives. In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks, without any finetuning processes or additional modules. Specifically, we focus on narrowing the gap between the generative captioner and the CLIP classifier. We begin by analysing the predictions made by the captioner and classifier and observe that the caption generation inherits the distribution bias from the language model trained with pure text modality, making it less grounded on the visual signal. To tackle this problem, we redesign the scoring objective for the captioner to alleviate the distributional bias and focus on measuring the gain of information brought by the visual inputs. We further design a generative training objective to match the evaluation objective. We name our model trained and evaluated from the novel procedures as Information Gain (IG) captioner. We pretrain the models on the public Laion-5B dataset and perform a series of discriminative evaluations. For the zero-shot classification on ImageNet, IG captioner achieves $> 18\%$ improvements over the standard captioner, achieving comparable performances with the CLIP classifier. IG captioner also demonstrated strong performance on zero-shot image-text retrieval tasks on MSCOCO and Flickr30K. We hope this paper inspires further research towards unifying generative and discriminative training procedures for visual-language models.

摘要
<>Translate the given text into Simplified Chinese.<>生成训练已经被证明可以建立视觉语言模型。然而，在零shot推理标准 bencmarks 上，仍然存在生成和推理目标的性能差距。在这篇论文中，我们希望缩小这个差距，不需要任何资金调整或附加模块。特别是，我们专注于缩小生成captioner和CLIP推理器之间的差距。我们开始分析captioner和推理器的预测结果，发现生成caption inherit语言模型培养的分布偏好，使其更加关注视觉信号。为解决这个问题，我们修改了captioner的评价目标，以减少分布偏好，并将着眼于图像输入增加信息的量。我们还设计了一个生成训练目标，与评价目标匹配。我们称之为信息增加（IG）captioner。我们在公共Laion-5B数据集上预训模型，并进行了一系列推理评价。在零shot类别化ImageNet上，IG captioner比标准captioner多获得了18%以上的提升，与CLIP推理器的性能相似。IG captioner还在MSCOCO和Flickr30K上展现出了强大的零shot图像文本检索性能。我们希望这篇论文可以鼓励更多人在视觉语言模型中结合生成和推理训练程序。

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

paper_url: http://arxiv.org/abs/2311.16103
repo_url: https://github.com/pku-yuangroup/video-bench
paper_authors: Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan
for: 本研究旨在提出一个全面的评估系统，以帮助开发智能感知和决策的影像大型语言模型（Video-LLMs）。
methods: 本研究使用了10个精心设计的任务，以评估Video-LLMs在不同的水平上的能力，包括影像专门理解、基于先前知识的问题回答、以及理解和决策。此外，我们还提供了一个自动化的工具箱，便于计算指标和生成排名。
results: 研究发现现有的Video-LLMs仍然与人类水平的理解和分析真实影像许多的差异，提供了宝贵的研究方向。

Abstract
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. In pursuit of the ultimate goal of achieving artificial general intelligence, a truly intelligent Video-LLM model should not only see and understand the surroundings, but also possess human-level commonsense, and make well-informed decisions for the users. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. To this end, this paper proposes \textit{Video-Bench}, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs. The benchmark comprises 10 meticulously crafted tasks, evaluating the capabilities of Video-LLMs across three distinct levels: Video-exclusive Understanding, Prior Knowledge-based Question-Answering, and Comprehension and Decision-making. In addition, we introduce an automatic toolkit tailored to process model outputs for various tasks, facilitating the calculation of metrics and generating convenient final scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world videos, offering valuable insights for future research directions. The benchmark and toolkit are available at: \url{https://github.com/PKU-YuanGroup/Video-Bench}.

摘要
大型语言模型（Video-LLM）在最近引入，旨在提高视频理解和掌握，以及覆盖广泛的用户问题。在追求人工通用智能的目标下，一个真正智能的 Video-LLM 模型不仅应能看到和理解周围环境，还应具备人类常识水平，并对用户进行了解和决策。为促进这种模型的发展，建立一个可靠和全面的评估系统成为了非常重要的。为此，本文提出了 \textit{Video-Bench}，一个新的全面的benchmark，以及特制的工具包，专门用于评估 Video-LLM。benchmark 包括 10 个精心制作的任务，评估 Video-LLM 在三个不同的水平：视频专用理解、基于习知的问答、和理解和决策。此外，我们还提供了自动化的工具包，用于处理模型输出的各种任务，以便计算指标和生成便利的最终分数。我们使用 \textit{Video-Bench} 评估了 8 个代表性的 Video-LLM。结果显示，当前 Video-LLM 仍然很有限地完成了真实世界视频中的人类化理解和分析，提供了有价值的研究方向。benchmark 和工具包可以在：中获取。

Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback

paper_url: http://arxiv.org/abs/2311.16102
repo_url: https://github.com/mihirp1998/Diffusion-TTA
paper_authors: Mihir Prabhudesai, Tsung-Wei Ke, Alexander C. Li, Deepak Pathak, Katerina Fragkiadaki
for: 本研究旨在探讨如何使用生成模型来提高推理模型的准确率。
methods: 我们提出了一种基于扩散模型的测试时适应方法，即Diffusion-TTA，可以使已经训练过的推理模型在测试集中具有更高的准确率。我们通过修改扩散模型的条件来使用生成反馈来适应测试集中的每个例子。然后，我们通过评估图像可能性目标来最大化图像的可能性，并通过反推导出来更新推理模型的参数。
results: 我们的实验结果表明，Diffusion-TTA可以在大规模的预训练推理模型上显著提高准确率，包括ImageNet分类器、CLIP模型、图像像素标注器和图像深度预测器。Diffusion-TTA也超过了现有的测试时适应方法，包括TTT-MAE和TENT，特别是在在线适应设置中，推理模型在测试集中 continually 适应每个例子。我们的代码、结果和视觉化可以在我们的网站上找到：https://diffusion-tta.github.io/.

Abstract
The advancements in generative modeling, particularly the advent of diffusion models, have sparked a fundamental question: how can these models be effectively used for discriminative tasks? In this work, we find that generative models can be great test-time adapters for discriminative models. Our method, Diffusion-TTA, adapts pre-trained discriminative models such as image classifiers, segmenters and depth predictors, to each unlabelled example in the test set using generative feedback from a diffusion model. We achieve this by modulating the conditioning of the diffusion model using the output of the discriminative model. We then maximize the image likelihood objective by backpropagating the gradients to discriminative model's parameters. We show Diffusion-TTA significantly enhances the accuracy of various large-scale pre-trained discriminative models, such as, ImageNet classifiers, CLIP models, image pixel labellers and image depth predictors. Diffusion-TTA outperforms existing test-time adaptation methods, including TTT-MAE and TENT, and particularly shines in online adaptation setups, where the discriminative model is continually adapted to each example in the test set. We provide access to code, results, and visualizations on our website: https://diffusion-tta.github.io/.

摘要
“生成模型的进步，尤其是扩散模型的出现，引起了一个基本问题：如何使用这些模型来进行推断性任务？在这个工作中，我们发现了生成模型可以作为推断模型的test-time adapter。我们的方法，Diffusion-TTA，使用生成反馈来调整预训练的推断模型，以便在测试集中对每个无标示示例进行适应。我们使用扩散模型的输出来修改生成模型的conditioning，然后通过推断模型的参数来最大化图像可能性目标。我们示出Diffusion-TTA可以显著提高各种大规模预训练的推断模型的准确率，包括图像分类器、分割器、深度预测器等。Diffusion-TTA超过了现有的test-time adaptation方法，包括TTT-MAE和TENT，特别在在线适应设置中，推断模型 continually adapts to each example in the test set。我们在我们的网站上提供了代码、结果和视觉化：https://diffusion-tta.github.io/。”Note: The translation is in Simplified Chinese, which is one of the two standard varieties of Chinese. The other variety is Traditional Chinese.

On Bringing Robots Home

paper_url: http://arxiv.org/abs/2311.16098
repo_url: https://github.com/notmahi/dobb-e
paper_authors: Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, Lerrel Pinto
for: 这项研究的目的是开发一种可靠、有效的家用机器人系统，以满足家庭中的多种任务需求。
methods: 这项研究使用了低成本的组件和iPhone制成的示例采集工具（The Stick），收集了13小时的数据，并使用Home Pretrained Representations（HPR）模型进行训练。在新的家庭环境中，只需5分钟的示例和15分钟的适应， Dobb-E 系统就可以可靠地解决任务。
results: 在纽约市和周边地区的30天实验中， Dobb-E 系统在10户109个任务中达到了81%的成功率。此外，实验还发现了很多在实验室Robotics中缺失或忽略的挑战，例如强烈的阴影和非专家用户示例质量的变化。

Abstract
Throughout history, we have successfully integrated various machines into our homes. Dishwashers, laundry machines, stand mixers, and robot vacuums are a few recent examples. However, these machines excel at performing only a single task effectively. The concept of a "generalist machine" in homes - a domestic assistant that can adapt and learn from our needs, all while remaining cost-effective - has long been a goal in robotics that has been steadily pursued for decades. In this work, we initiate a large-scale effort towards this goal by introducing Dobb-E, an affordable yet versatile general-purpose system for learning robotic manipulation within household settings. Dobb-E can learn a new task with only five minutes of a user showing it how to do it, thanks to a demonstration collection tool ("The Stick") we built out of cheap parts and iPhones. We use the Stick to collect 13 hours of data in 22 homes of New York City, and train Home Pretrained Representations (HPR). Then, in a novel home environment, with five minutes of demonstrations and fifteen minutes of adapting the HPR model, we show that Dobb-E can reliably solve the task on the Stretch, a mobile robot readily available on the market. Across roughly 30 days of experimentation in homes of New York City and surrounding areas, we test our system in 10 homes, with a total of 109 tasks in different environments, and finally achieve a success rate of 81%. Beyond success percentages, our experiments reveal a plethora of unique challenges absent or ignored in lab robotics. These range from effects of strong shadows, to variable demonstration quality by non-expert users. With the hope of accelerating research on home robots, and eventually seeing robot butlers in every home, we open-source Dobb-E software stack and models, our data, and our hardware designs at https://dobb-e.com

摘要
历史上，我们已经成功地将多种机器 integrate 到了我们的家中。洗衣机、干洗机、搅拌机和机器干净器是其中的一些最近的例子。然而，这些机器只能够很好地完成单一任务。“家庭助手”这一概念——一种可以适应和学习我们需求的家用机器人——在机器人领域已经是多年来追求的目标。在这项工作中，我们发起了一项大规模努力，推出了 Dobbe，一种可以适应多种任务的家用机器人系统。Dobbe可以通过只需5分钟的用户示例来学习新任务，这得到了我们自己设计的“棒”（The Stick）数据采集工具的帮助。我们使用棒采集了13个小时的数据，并训练了家庭预处理表示（HPR）模型。然后，在一个新的家庭环境中，只需5分钟的示例和15分钟的适应HPR模型，我们证明了Dobbe可以可靠地解决任务。在纽约市和周边地区的约30天内，我们在10户109个任务的不同环境中进行了30天的实验，最终实现了81%的成功率。除了成功率之外，我们的实验还揭示了室内机器人研究中缺失或忽略的许多独特挑战。这些挑战包括影响强烈的阴影，以及非专家用户的示例质量的变化。我们希望通过开源Dobbe软件堆栈和模型，我们的数据和硬件设计，加速家用机器人研究，并 eventually 在每个家庭中看到机器察看。更多信息请访问https://dobb-e.com。

paper_url: http://arxiv.org/abs/2311.16091
repo_url: None
paper_authors: Jiachen Li, David Isele, Kanghoon Lee, Jinkyoo Park, Kikuo Fujimura, Mykel J. Kochenderfer
for:* 这个论文主要目标是提高智能代理人（如自动驾驶车辆）在复杂enario中导航的能力，并提供可解释的中间指标。methods:* 该论文提出三个辅助任务，即空间时间关系理解任务，并将其集成到标准的深度强化学习框架中，以改善决策性能并提供可解释的中间指标。* 该论文使用空间时间图 neural network 来编码关系 между动态实体，以增强内部状态推断和决策。* 论文还提出了一种互动度估计机制，基于Predicted trajectories在不同情况下的差异，以衡量ego agent对其他交互代理人的影响度。results:* 该论文在基于Intelligent Intersection Driver Model (IIDM)的交叉口驾驶 simulator 中测试了其方法，并取得了robust和当前领先的性能。* 该论文的方法提供了可解释的中间指标（即内部状态和互动度），以帮助决策。

Abstract
Deep reinforcement learning (DRL) provides a promising way for intelligent agents (e.g., autonomous vehicles) to learn to navigate complex scenarios. However, DRL with neural networks as function approximators is typically considered a black box with little explainability and often suffers from suboptimal performance, especially for autonomous navigation in highly interactive multi-agent environments. To address these issues, we propose three auxiliary tasks with spatio-temporal relational reasoning and integrate them into the standard DRL framework, which improves the decision making performance and provides explainable intermediate indicators. We propose to explicitly infer the internal states (i.e., traits and intentions) of surrounding agents (e.g., human drivers) as well as to predict their future trajectories in the situations with and without the ego agent through counterfactual reasoning. These auxiliary tasks provide additional supervision signals to infer the behavior patterns of other interactive agents. Multiple variants of framework integration strategies are compared. We also employ a spatio-temporal graph neural network to encode relations between dynamic entities, which enhances both internal state inference and decision making of the ego agent. Moreover, we propose an interactivity estimation mechanism based on the difference between predicted trajectories in these two situations, which indicates the degree of influence of the ego agent on other agents. To validate the proposed method, we design an intersection driving simulator based on the Intelligent Intersection Driver Model (IIDM) that simulates vehicles and pedestrians. Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics and provides explainable intermediate indicators (i.e., internal states, and interactivity scores) for decision making.

摘要
深度强化学习（DRL）提供了智能代理人（如自动驾驶车辆）在复杂enario中导航的可能性。然而，DRL使用神经网络作为函数估计器通常被视为黑盒子，具有少量解释性，并常受优化性下降，特别是在高度互动多代理人环境中。为解决这些问题，我们提议三个辅助任务，包括空间时间关系理解，并将其 integrate into the standard DRL framework。这种方法可以提高决策性能并提供可解释的中间指标。我们还提议明确周围代理人（如人类 drivers）的内部状态（例如特质和意图）的推理，以及未来轨迹预测在各种情况下。这些辅助任务提供了更多的监督信号，以便推理其他互动代理人的行为模式。我们还使用空间时间图 neural network来编码关系 between dynamic entities，这有助于internal state推理和决策。此外，我们还提出了对比预测结果来计算代理人之间的互动程度的估计机制。为验证我们的方法，我们设计了基于Intelligent Intersection Driver Model（IIDM）的交叉口驾驶 simulator，该模型 simulate vehicles and pedestrians。我们的方法实现了 robust和状态艺术的表现，并提供可解释的中间指标（例如内部状态和互动度） для决策。

MAST: Model-Agnostic Sparsified Training

paper_url: http://arxiv.org/abs/2311.16086
repo_url: https://github.com/konstmish/opt_methods
paper_authors: Yury Demidovich, Grigory Malinovsky, Egor Shulgin, Peter Richtárik
for: 提高机器学习模型训练的效率和稳定性
methods: 使用随机笔记算子和初始预训练模型，实现模型和梯度缩短训练
results: 提出了一种新的优化问题表述，并实现了对这种问题的解释和分析，同时还提出了一些基于这种问题表述的SGD算法和其变体，包括抽象抽取、分布式SGD和减少噪声技术等，能够提高机器学习模型训练的效率和稳定性。

Abstract
We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish insightful properties of the proposed objective function and highlight its connections to the standard formulation. Furthermore, we present several variants of the Stochastic Gradient Descent (SGD) method adapted to the new problem formulation, including SGD with general sampling, a distributed version, and SGD with variance reduction techniques. We achieve tighter convergence rates and relax assumptions, bridging the gap between theoretical principles and practical applications, covering several important techniques such as Dropout and Sparse training. This work presents promising opportunities to enhance the theoretical understanding of model training through a sparsification-aware optimization approach.

摘要
我们介绍一个新的优化问题设计，与传统的机器学习模型损失函数优化方法不同。不同于传统的设计，我们的方法明示地包含一个初始预训练的模型和随机绘制算法，以简化模型和梯度的训练过程。我们证明了我们的目标函数的内在性和标准化形式ulation的连接，并提出了多种基于新问题设计的梯度下降法，包括样本选择法、分布式版本和内部统计变化减少技术。我们实现了更紧密的测度误差率和松动条件，将理论原理和实应应用处理融合，涵盖了轻量级训练和斜梯度训练等重要技术。这个研究具有推动理论理解模型训练的可能性。

Transformer-QEC: Quantum Error Correction Code Decoding with Transferable Transformers

paper_url: http://arxiv.org/abs/2311.16082
repo_url: None
paper_authors: Hanrui Wang, Pengyu Liu, Kevin Shao, Dantong Li, Jiaqi Gu, David Z. Pan, Yongshan Ding, Song Han
for: 这个研究旨在开发一个基于trasformer架构的量子错误补偿（QEC）解oder，以提高量子 Computing 系统中的错误率。
methods: 本研究使用了一个mix loss trainingapproach，结合了地方物理错误和全息平衡标签的losses，以train一个基于transformer架构的QEC解oder。
results: 根据evaluation on six个code distance和ten个不同的错误配置，我们的模型 consistently outperforms non-ML decoders和其他ML decoders，以 дости得最好的逻辑错误率。此外，trasnfer learning可以 Save over 10x of training cost.

Abstract
Quantum computing has the potential to solve problems that are intractable for classical systems, yet the high error rates in contemporary quantum devices often exceed tolerable limits for useful algorithm execution. Quantum Error Correction (QEC) mitigates this by employing redundancy, distributing quantum information across multiple data qubits and utilizing syndrome qubits to monitor their states for errors. The syndromes are subsequently interpreted by a decoding algorithm to identify and correct errors in the data qubits. This task is complex due to the multiplicity of error sources affecting both data and syndrome qubits as well as syndrome extraction operations. Additionally, identical syndromes can emanate from different error sources, necessitating a decoding algorithm that evaluates syndromes collectively. Although machine learning (ML) decoders such as multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) have been proposed, they often focus on local syndrome regions and require retraining when adjusting for different code distances. We introduce a transformer-based QEC decoder which employs self-attention to achieve a global receptive field across all input syndromes. It incorporates a mixed loss training approach, combining both local physical error and global parity label losses. Moreover, the transformer architecture's inherent adaptability to variable-length inputs allows for efficient transfer learning, enabling the decoder to adapt to varying code distances without retraining. Evaluation on six code distances and ten different error configurations demonstrates that our model consistently outperforms non-ML decoders, such as Union Find (UF) and Minimum Weight Perfect Matching (MWPM), and other ML decoders, thereby achieving best logical error rates. Moreover, the transfer learning can save over 10x of training cost.

摘要
We propose a transformer-based QEC decoder that employs self-attention to achieve a global receptive field across all input syndromes. The decoder incorporates a mixed loss training approach, combining both local physical error and global parity label losses. Moreover, the transformer architecture's inherent adaptability to variable-length inputs allows for efficient transfer learning, enabling the decoder to adapt to varying code distances without retraining.Our evaluation on six code distances and ten different error configurations demonstrates that our model consistently outperforms non-ML decoders, such as Union Find (UF) and Minimum Weight Perfect Matching (MWPM), and other ML decoders, achieving best logical error rates. Moreover, the transfer learning can save over 10x of training cost.

paper_url: http://arxiv.org/abs/2311.16081
repo_url: https://github.com/TencentARC/ViT-Lens
paper_authors: Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou
for: 提高 AI 代理的能力，大规模基础模型可以大幅提高理解和指令执行，但目前关注视觉和语言，忽略了开放环境中多种感知模式的潜力。
methods: 提出了 ViT-Lens-2，它可以有效地将新的感知模式与预训练的 ViT 集成，并将其们 proyect到预定义的空间中，以便进行有效的表征学习。
results: 在多种理解任务上，ViT-Lens-2 可以提供新的状态记录，并且可以具有无需重新训练的 zeroshot 类别化能力。通过将 ViT-Lens-2 集成到多模态基础模型中，可以实现 Any-modality to Text and Image Generation 的零shot 模式。

Abstract
Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio, tactile and EEG, and set new state-of-the-art results across various understanding tasks, such as zero-shot classification. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens.

摘要
目标是提高人工智能代理人，大型基金模型可以显著提高理解和执行 instrucion，但当前关注视觉和语言而忽视了多样化感知环境中的潜在能力。然而，数据驱动的视觉和语言模型的成功很可能是成本高或不可能重现的。在这篇论文中，我们提出了ViT-Lens-2，它实现了高效多模态表示学习，通过使用预训练的ViT和对应的特点灵活进行模态匹配。具体来说，模态特定的镜头将任何模态信号投影到中间嵌入空间，然后由强大的ViT进行处理，并利用预训练的视觉知识进行编码。所得到的表示被优化向接受模式独立空间进行对齐，这个空间是由商业化基础模型预定义的。ViT-Lens-2提供了一个统一的解决方案，可以有效地将预训练的ViT应用于新的模态，同时保持高效的数据 régime。我们适应ViT-Lens-2来学习3D点云、深度、音频、感觉和EEG等多种模态的表示，并在不同理解任务中设置新的状态之册记录。通过将ViT-Lens-2灵活地集成到多模态基础模型中，我们实现了无需训练的Any-模态到文本和图像生成。代码和模型可以在https://github.com/TencentARC/ViT-Lens中获取。

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

paper_url: http://arxiv.org/abs/2311.16079
repo_url: https://github.com/epfllm/meditron
paper_authors: Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut
for:MEDITRON aims to improve access to large-scale medical language models, with the goal of democratizing medical knowledge.methods:MEDITRON uses a suite of open-source language models with 7B and 70B parameters, adapted from Llama-2 and pretrained on a comprehensive medical corpus.results:MEDITRON achieves significant performance gains over several state-of-the-art baselines in medical benchmarks, with a 6% absolute performance gain over the best public baseline and within 5% of GPT-4.

Abstract
Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia's Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.

摘要
大型语言模型（LLM）有可能将医学知识普及化。虽然许多努力已经用于利用和改进LLM的医学知识和推理能力，但现有的模型都是关闭源代码（如PaLM、GPT-4）或者缺乏大规模（<= 13B参数），这限制了它们的能力。在这项工作中，我们通过发布MEDITRON：一个开源的LMM模型，以7B和70B参数进行适应医学领域。MEDITRON基于Llama-2（通过我们对Nvidia的Megatron-LM分布式训练器的修改），并对医学领域综合抽取的医学文献进行预训练。经过四个主要医学指标的评估，MEDITRON在比较多个状态的基eline上显示了明显的性能提升。总的来说，MEDITRON在参数类型中的最佳公共基线上获得6%的绝对性能提升，并在最强基eline上获得3%的提升。相比于关闭源LLM，MEDITRON-70B在GPT-3.5和Med-PaLM上出色，与GPT-4和Med-PaLM-2的性能相仿。我们发布了对医学预训练文献的编辑代码和MEDITRON模型参数，以便促进开源的医学LLM模型的发展。

BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights

paper_url: http://arxiv.org/abs/2311.16075
repo_url: None
paper_authors: François Remy, Kris Demuynck, Thomas Demeester
for: 本研究旨在利用大语言模型 complement biomedical knowledge graphs，以提高生物医学和临床领域semantic模型的训练。
methods: 该研究提出了三个步骤，包括改进的对照学习阶段、新的自适应学习阶段以及权重平均阶段。
results: 通过对BIOLORD测试集和多个下游任务进行严格评估，研究人员demonstrated了与前一代状态OF-THE-ART（+2pts在MedSTS、+2.5pts在MedNLI-S、+6.1pts在EHR-Rel-B）的一致性和显著性表现提升。此外，研究人员还分配了一个可与50多种语言相容的多语言模型，并在7种欧洲语言上进行了finetuning。这些新的模型可以帮助临床管道中的许多实践。

Abstract
In this study, we investigate the potential of Large Language Models to complement biomedical knowledge graphs in the training of semantic models for the biomedical and clinical domains. Drawing on the wealth of the UMLS knowledge graph and harnessing cutting-edge Large Language Models, we propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences, consisting of three steps: an improved contrastive learning phase, a novel self-distillation phase, and a weight averaging phase. Through rigorous evaluations via the extensive BioLORD testing suite and diverse downstream tasks, we demonstrate consistent and substantial performance improvements over the previous state of the art (e.g. +2pts on MedSTS, +2.5pts on MedNLI-S, +6.1pts on EHR-Rel-B). Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages and finetuned on 7 European languages. Many clinical pipelines can benefit from our latest models. Our new multilingual model enables a range of languages to benefit from our advancements in biomedical semantic representation learning, opening a new avenue for bioinformatics researchers around the world. As a result, we hope to see BioLORD-2023 becoming a precious tool for future biomedical applications.

摘要
在这个研究中，我们研究了大型自然语言模型是否可以补充生物医学知识图的训练，以提高生物医学和临床领域的semantic模型的性能。基于UMLS知识图的丰富资源和前沿技术，我们提出了一种新的state-of-the-art方法，包括三个阶段：改进的对比学习阶段、新的自适应阶段和加权平均阶段。通过对BIoLORD测试集和多种下游任务进行严格评估，我们证明了我们的模型在前一代模型的基础上显著提高了性能（如+2点MedSTS、+2.5点MedNLI-S、+6.1点EHR-Rel-B）。此外，我们还释出了一个兼容50多种语言的多语言模型，并在7种欧洲语言上进行了较好的finetuning。这些新模型将有助于许多临床管道，而且我们的新多语言模型将开启了一条新的国际生物信息学研究之路。因此，我们希望BIoLORD-2023将成为未来生物医学应用中的珍贵工具。

A Survey on Vulnerability of Federated Learning: A Learning Algorithm Perspective

paper_url: http://arxiv.org/abs/2311.16065
repo_url: https://github.com/rand2ai/awesome-vulnerability-of-federated-learning
paper_authors: Xianghua Xie, Chen Hu, Hanchi Ren, Jingjing Deng
for: 本文概述了分布式学习（FL）系统面临的攻击方法，并从新的攻击来源和目标角度进行分类。
methods: 本文根据攻击源和目标分类 существу的威胁模型为四种：数据到模型（D2M）、模型到数据（M2D）、模型到模型（M2M）以及复杂攻击。对每种攻击类型，我们讨论了防御策略，包括使用单一指标、排除恶意客户端以及在不同阶段检查客户端模型等。
results: 本文指出，在不同阶段都可以通过欺骗、重构 lok 数据和插入后门来发起攻击。这些威胁不仅可以下降模型性能，还可以泄露本地数据和插入后门。此外，攻击方法在不断发展，早期研究通常通过增强恶意梯度来攻击，而现在则是通过轻微地修改本地模型中的最小权重来绕过防御措施。

Abstract
This review paper takes a comprehensive look at malicious attacks against FL, categorizing them from new perspectives on attack origins and targets, and providing insights into their methodology and impact. In this survey, we focus on threat models targeting the learning process of FL systems. Based on the source and target of the attack, we categorize existing threat models into four types, Data to Model (D2M), Model to Data (M2D), Model to Model (M2M) and composite attacks. For each attack type, we discuss the defense strategies proposed, highlighting their effectiveness, assumptions and potential areas for improvement. Defense strategies have evolved from using a singular metric to excluding malicious clients, to employing a multifaceted approach examining client models at various phases. In this survey paper, our research indicates that the to-learn data, the learning gradients, and the learned model at different stages all can be manipulated to initiate malicious attacks that range from undermining model performance, reconstructing private local data, and to inserting backdoors. We have also seen these threat are becoming more insidious. While earlier studies typically amplified malicious gradients, recent endeavors subtly alter the least significant weights in local models to bypass defense measures. This literature review provides a holistic understanding of the current FL threat landscape and highlights the importance of developing robust, efficient, and privacy-preserving defenses to ensure the safe and trusted adoption of FL in real-world applications.

摘要
We identify four types of attacks: Data to Model (D2M), Model to Data (M2D), Model to Model (M2M), and composite attacks. Each attack type has unique methodology and impact, and we discuss the effectiveness, assumptions, and potential areas for improvement of defense strategies.Existing defense strategies have evolved from using a single metric to excluding malicious clients, to employing a multifaceted approach examining client models at various phases. Our research shows that the to-learn data, learning gradients, and learned model at different stages can all be manipulated to initiate malicious attacks, such as undermining model performance, reconstructing private local data, and inserting backdoors.Recent attacks have become more insidious, subtly altering the least significant weights in local models to bypass defense measures. This literature review provides a comprehensive understanding of the current FL threat landscape and highlights the importance of developing robust, efficient, and privacy-preserving defenses to ensure the safe and trusted adoption of FL in real-world applications.

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

paper_url: http://arxiv.org/abs/2311.16038
repo_url: https://github.com/wzzheng/occworld
paper_authors: Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, Jiwen Lu
for: The paper aims to improve the understanding of the 3D scene evolution in autonomous driving by proposing a new framework called OccWorld, which learns a world model in the 3D occupancy space.
methods: The proposed method uses a reconstruction-based scene tokenizer to obtain discrete scene tokens, and a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens.
results: The paper demonstrates the effectiveness of OccWorld in modeling the evolution of driving scenes through extensive experiments on the nuScenes benchmark, and shows competitive planning results without using instance and map supervision.Here are the three points in Simplified Chinese:
for: 该论文目的是提高自动驾驶中Scene的理解，提出了一种新的OccWorld框架，该框架在3D占用空间中学习世界模型。
methods: 该方法使用了重建基于Scene tokenizer来获得精细Scene tokens，并采用GPT-like的空间-时间生成变换器来生成后续Scene和自己 tokens。
results: 论文通过对nuScenes数据集进行广泛的实验，显示OccWorld可以有效地模型驾驶场景的发展，并且不需要使用实例和地图指导来实现竞争的规划结果。

Abstract
Understanding how the 3D scene evolves is vital for making decisions in autonomous driving. Most existing methods achieve this by predicting the movements of object boxes, which cannot capture more fine-grained scene information. In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. We propose to learn a world model based on 3D occupancy rather than 3D bounding boxes and segmentation maps for three reasons: 1) expressiveness. 3D occupancy can describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3) versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the modeling of the world evolution, we learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. We then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory. Extensive experiments on the widely used nuScenes benchmark demonstrate the ability of OccWorld to effectively model the evolution of the driving scenes. OccWorld also produces competitive planning results without using instance and map supervision. Code: https://github.com/wzzheng/OccWorld.

摘要
理解三维场景的演化是自动驾驶决策的关键。现有方法多数采用预测 объек 框的运动来实现这一目标，但这些方法无法捕捉更细致的场景信息。在这篇论文中，我们探索了一种新的世界模型框架——OccWorld，用于同时预测egos车的运动和周围场景的演化。我们建议基于三维占用空间学习世界模型，而不是基于三维 bounding box 和分 segmentation 图像。我们有以下三个理由：1. 表达能力。三维占用空间可以更好地描述场景的更细致结构。2. 效率。三维占用空间更 econo 寻取 (例如，从稀疏 LiDAR 点获取).3. 灵活性。三维占用空间可以适应视觉和 LiDAR。为了促进世界演化的模型，我们在三维占用空间上学习了一种Scene Tokenizer，以获得周围场景的精炼 Token 来描述场景的演化。然后，我们采用了一种GPT-like的空间-时间生成变换器，以生成后续场景和egos Token，以解码未来占用和egos车的轨迹。广泛的 nuScenes 测试数据显示了 OccWorld 的能力在场景演化方面的效果。OccWorld 还生成了不使用实例和地图监督的可行的规划结果。代码：https://github.com/wzzheng/OccWorld。

RobustState: Boosting Fidelity of Quantum State Preparation via Noise-Aware Variational Training

paper_url: http://arxiv.org/abs/2311.16035
repo_url: None
paper_authors: Hanrui Wang, Yilian Liu, Pengyu Liu, Jiaqi Gu, Zirui Li, Zhiding Liang, Jinglei Cheng, Yongshan Ding, Xuehai Qian, Yiyu Shi, David Z. Pan, Frederic T. Chong, Song Han
for: 这个论文的目的是提出一种高效稳定的量子状态准备算法，以提高量子计算机的精度和可靠性。
methods: 这个算法使用了变量量量子状态准备（VQSP）方法，通过Iterative tuning ansatz parameters来 aproximate target state。此外，这个算法还使用了实际机器的测量结果来进行质量控制，以提高training efficiency和稳定性。
results: 这个算法在4种量子算法的状态准备任务上达到了 coherent error reduction of up to 7.1 $\times$ 和 state fidelity improvement of up to 96%和81%，而且在平均情况下，与基线方法相比，这个算法提高了精度by 50%和72%。

Abstract
Quantum state preparation, a crucial subroutine in quantum computing, involves generating a target quantum state from initialized qubits. Arbitrary state preparation algorithms can be broadly categorized into arithmetic decomposition (AD) and variational quantum state preparation (VQSP). AD employs a predefined procedure to decompose the target state into a series of gates, whereas VQSP iteratively tunes ansatz parameters to approximate target state. VQSP is particularly apt for Noisy-Intermediate Scale Quantum (NISQ) machines due to its shorter circuits. However, achieving noise-robust parameter optimization still remains challenging. We present RobustState, a novel VQSP training methodology that combines high robustness with high training efficiency. The core idea involves utilizing measurement outcomes from real machines to perform back-propagation through classical simulators, thus incorporating real quantum noise into gradient calculations. RobustState serves as a versatile, plug-and-play technique applicable for training parameters from scratch or fine-tuning existing parameters to enhance fidelity on target machines. It is adaptable to various ansatzes at both gate and pulse levels and can even benefit other variational algorithms, such as variational unitary synthesis. Comprehensive evaluation of RobustState on state preparation tasks for 4 distinct quantum algorithms using 10 real quantum machines demonstrates a coherent error reduction of up to 7.1 $\times$ and state fidelity improvement of up to 96\% and 81\% for 4-Q and 5-Q states, respectively. On average, RobustState improves fidelity by 50\% and 72\% for 4-Q and 5-Q states compared to baseline approaches.

摘要
Quantum state preparation, a crucial subroutine in quantum computing, involves generating a target quantum state from initialized qubits. Arbitrary state preparation algorithms can be broadly categorized into arithmetic decomposition (AD) and variational quantum state preparation (VQSP). AD employs a predefined procedure to decompose the target state into a series of gates, whereas VQSP iteratively tunes ansatz parameters to approximate target state. VQSP is particularly apt for Noisy-Intermediate Scale Quantum (NISQ) machines due to its shorter circuits. However, achieving noise-robust parameter optimization still remains challenging. 我们提出了一个新的VQSP训练方法，即RobustState，它结合了高可靠性和高训练效率。这个核心思想是使用真实机器的测量结果通过классиerne simulator进行反推，将真实量子噪声 incorporated into gradient calculations。RobustState是一个通用、插件化的方法，可以用于从头开始训练或是对现有的parameters进行增强，以提高目标机器的精度。它适用于不同的ansatzes at both gate and pulse levels，甚至可以帮助其他variational algorithms, such as variational unitary synthesis。我们对4种不同的量子算法的state preparation任务使用10部真实机器进行了 comprehensive evaluation。结果显示，RobustState可以 reducing coherent error by up to 7.1 times and improving state fidelity by up to 96% and 81% for 4-Q and 5-Q states, respectively. On average, RobustState improves fidelity by 50% and 72% for 4-Q and 5-Q states compared to baseline approaches.

Machine Learning-Enhanced Aircraft Landing Scheduling under Uncertainties

paper_url: http://arxiv.org/abs/2311.16030
repo_url: None
paper_authors: Yutian Pang, Peng Zhao, Jueming Hu, Yongming Liu
for: 这篇论文旨在减少飞机延误和财务损失，通过创新的机器学习（ML）强化进场时间安排方法来提高自动化和安全性。
methods: 论文提出了一个多元条件的机器学习预测器，可以根据航班事件来预测分类过渡时间。这个预测器是基于时间组合的混合整数线性程序（MILP）的一部分，并且考虑了历史航班纪录和模型预测的不确定性。
results: 论文使用了实际的飞机数据，证明了这种方法可以将总进场时间降低约17.2%，比FCFS规则更好。这种方法考虑了不确定性，因此在进场安排中具有更高的自信。

Abstract
This paper addresses aircraft delays, emphasizing their impact on safety and financial losses. To mitigate these issues, an innovative machine learning (ML)-enhanced landing scheduling methodology is proposed, aiming to improve automation and safety. Analyzing flight arrival delay scenarios reveals strong multimodal distributions and clusters in arrival flight time durations. A multi-stage conditional ML predictor enhances separation time prediction based on flight events. ML predictions are then integrated as safety constraints in a time-constrained traveling salesman problem formulation, solved using mixed-integer linear programming (MILP). Historical flight recordings and model predictions address uncertainties between successive flights, ensuring reliability. The proposed method is validated using real-world data from the Atlanta Air Route Traffic Control Center (ARTCC ZTL). Case studies demonstrate an average 17.2% reduction in total landing time compared to the First-Come-First-Served (FCFS) rule. Unlike FCFS, the proposed methodology considers uncertainties, instilling confidence in scheduling. The study concludes with remarks and outlines future research directions.

摘要
The authors analyze flight arrival delay scenarios and find that there are strong multimodal distributions and clusters in arrival flight time durations. To address this, they develop a multi-stage conditional ML predictor that enhances separation time prediction based on flight events.The authors then integrate the ML predictions as safety constraints in a time-constrained traveling salesman problem formulation, which is solved using mixed-integer linear programming (MILP). They use historical flight recordings and model predictions to address uncertainties between successive flights, ensuring reliability.The proposed method is validated using real-world data from the Atlanta Air Route Traffic Control Center (ARTCC ZTL). Case studies show an average 17.2% reduction in total landing time compared to the First-Come-First-Served (FCFS) rule. Unlike FCFS, the proposed methodology considers uncertainties, instilling confidence in scheduling.The study concludes with remarks and outlines future research directions. The authors propose a new machine learning-based method for landing scheduling that takes into account uncertainties and improves safety and efficiency. The method is validated using real-world data and shows promising results.

An HCAI Methodological Framework: Putting It Into Action to Enable Human-Centered AI

paper_url: http://arxiv.org/abs/2311.16027
repo_url: None
paper_authors: Wei Xu, Zaifeng Gao, Marvin Dainoff
for:This paper aims to provide a comprehensive and interdisciplinary methodological framework for Human-centered AI (HCAI) to help guide its implementation and overcome the current challenges in the field.methods:The proposed framework integrates seven components: design goals, design principles, implementation approaches, design paradigms, interdisciplinary teams, methods, and processes. The framework is designed to be systematic and executable, and can be applied to develop, transfer, and implement HCAI-based intelligent systems.results:The proposed framework is expected to help overcome the weaknesses in current frameworks and the challenges currently faced in implementing HCAI, and enable the design, development, and deployment of HCAI-based intelligent systems that can maximize the benefits of AI technology to humans while minimizing its potential adverse effects.

Abstract
Human-centered AI (HCAI), as a design philosophy, advocates prioritizing humans in designing, developing, and deploying intelligent systems, aiming to maximize the benefits of AI technology to humans and avoid its potential adverse effects. While HCAI has gained momentum, the lack of guidance on methodology in its implementation makes its adoption challenging. After assessing the needs for a methodological framework for HCAI, this paper first proposes a comprehensive and interdisciplinary HCAI methodological framework integrated with seven components, including design goals, design principles, implementation approaches, design paradigms, interdisciplinary teams, methods, and processes. THe implications of the framework are also discussed. This paper also presents a "three-layer" approach to facilitate the implementation of the framework. We believe the proposed framework is systematic and executable, which can overcome the weaknesses in current frameworks and the challenges currently faced in implementing HCAI. Thus, the framework can help put it into action to develop, transfer, and implement HCAI in practice, eventually enabling the design, development, and deployment of HCAI-based intelligent systems.

摘要
人类中心的人工智能（HCAI），作为设计哲学，强调在设计、开发和部署智能系统方面尽可能增加人类的利益，避免人工智能技术的可能的负面影响。虽然HCAI已经赢得了许多支持，但由于实施方法的缺乏指导，使得其实施困难。本文首先提出了一个全面、跨学科的HCAI方法框架，包括设计目标、设计原则、实施方法、设计思维、跨学科团队、方法和过程等七个组成部分。此外，本文还提出了这七个组成部分之间的互动和相互关系，以及实施框架的影响。我们认为这种框架是系统和可执行的，可以超越现有框架的缺陷和实施HCAI所面临的挑战。因此，这种框架可以帮助实施HCAI，最终实现人工智能系统的设计、传递和实施。

Generative AI and US Intellectual Property Law

paper_url: http://arxiv.org/abs/2311.16023
repo_url: None
paper_authors: Cherie M Poland
for: 评估生成AI的法律和伦理问题，包括艺术家权益、内容生产、数据收集、隐私、信息准确性和知识产权。
methods: 使用行政和司法案例来检验生成AI软件系统是否具有独立知识产权。
results: 法律和伦理问题的解决方案尚未得到清晰定义，法院的判决也各有不同，不知道将来是否可以保护人类创作者的知识产权。

Abstract
The rapidity with which generative AI has been adopted and advanced has raised legal and ethical questions related to the impact on artists rights, content production, data collection, privacy, accuracy of information, and intellectual property rights. Recent administrative and case law challenges have shown that generative AI software systems do not have independent intellectual property rights in the content that they generate. It remains to be seen whether human content creators can retain their intellectual property rights against generative AI software, its developers, operators, and owners for the misappropriation of the work of human creatives, given the metes and bounds of existing law. Early signs from various courts are mixed as to whether and to what degree the results generated by AI models meet the legal standards of infringement under existing law.

摘要
通过快速的推广和发展，生成式AI已经引起了法律和伦理问题，包括艺术家权益、内容生产、数据收集、隐私、信息准确性和知识产权。最近的行政和法律挑战表明，生成式AI软件系统没有独立的知识产权在生成的内容上。未来还需要看能否让人类创作者保留对生成AI软件、开发者、运营者和所有者的知识产权，尤其是在AI模型生成的内容中侵犯人类创作者的作品。法律的批准程度还未得到证实，初步的法律裁决表明，AI模型生成的内容是否符合现有法律的侵权标准还需要进一步的探讨。

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

paper_url: http://arxiv.org/abs/2311.16502
repo_url: https://github.com/MMMU-Benchmark/MMMU
paper_authors: Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
for: 评估多模态模型在大规模多学科任务中的表现，需要大学 уров域专业知识和计划性思维。
methods: 使用11.5万个精心收集的多模态问题，覆盖六大核心学科：艺术与设计、商业、科学、医学与医疗、人文与社会科学、技术与工程。这些问题包括30个不同的主题和183个下属领域，共有30种多样化的图像类型，如图表、 диаграм、地图、表格、音乐Sheet、化学结构。
results: 对14种开源LMM和专有GPT-4V进行评估，发现MMMU对多模态模型带来了极大的挑战，只有GPT-4V达到56%的准确率，表明存在大量的改进空间。 believe MMMU将鼓励社区建立下一代多模态基础模型，向专家人工智能寻求进步。

Abstract
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

摘要
我们介绍MMMU：一个新的评价标准，用于评测多Modal模型在大量多学科任务上的敏捷推理和专业知识。MMMU包括11.5万个精心收集的多Modal问题，来自大学考试、考试、和书籍，涵盖六个核心学科：艺术与设计、商业、科学、医学与医疗、人文社科和技术工程。这些问题覆盖30个主题和183个子领域，包括30种非常多样化的图像类型，如图表、 диаграм、地图、表格、乐谱和化学结构。与现有的标准不同，MMMU关注高级感知和专业知识的培养，挑战模型在专家面前完成类似任务。我们对14个开源LMM和商业GPT-4V（视觉）进行评估，显示MMMU对模型带来了很大挑战。即使高级GPT-4V只达56%的准确率，这还表明有很大的提升空间。我们认为MMMU将驱动社区建立下一代多Modal基础模型，进一步推动专业人工通用智能。

RIDE: Real-time Intrusion Detection via Explainable Machine Learning Implemented in a Memristor Hardware Architecture

paper_url: http://arxiv.org/abs/2311.16018
repo_url: None
paper_authors: Jingdi Chen, Lei Zhang, Joseph Riem, Gina Adam, Nathaniel D. Bastian, Tian Lan
for: 这篇论文旨在提出一种基于深度学习的网络入侵检测解决方案，用于实时检测高速通信网络中的恶意流量行为模式。
methods: 该解决方案使用自适应隐藏状态 Récurren Autoencoder 将流量序列编码成更紧凑的共同特征表示，然后将其传递给 DNN 基于的分类器进行检测。此外，我们还开发了一种 Software-Hardware Co-Design 方法，将检测策略转换成决策树，并使用 emerging 基于折射器设备的架构来实现。
results: 我们的方法可以减少计算时间和资源消耗，同时保持高的检测精度，并且可以实现实时检测。测试结果表明，使用我们的方法可以在实际数据集（如 UNSW 和 CIC-IDS 数据集）上达到 nearly three-nines 的检测精度，并且具有大量的速度提升（约 four 个数量级）。

Abstract
Deep Learning (DL) based methods have shown great promise in network intrusion detection by identifying malicious network traffic behavior patterns with high accuracy, but their applications to real-time, packet-level detections in high-speed communication networks are challenging due to the high computation time and resource requirements of Deep Neural Networks (DNNs), as well as lack of explainability. To this end, we propose a packet-level network intrusion detection solution that makes novel use of Recurrent Autoencoders to integrate an arbitrary-length sequence of packets into a more compact joint feature embedding, which is fed into a DNN-based classifier. To enable explainability and support real-time detections at micro-second speed, we further develop a Software-Hardware Co-Design approach to efficiently realize the proposed solution by converting the learned detection policies into decision trees and implementing them using an emerging architecture based on memristor devices. By jointly optimizing associated software and hardware constraints, we show that our approach leads to an extremely efficient, real-time solution with high detection accuracy at the packet level. Evaluation results on real-world datasets (e.g., UNSW and CIC-IDS datasets) demonstrate nearly three-nines detection accuracy with a substantial speedup of nearly four orders of magnitude.

摘要

Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models

paper_url: http://arxiv.org/abs/2311.16017
repo_url: None
paper_authors: Stephen MacNeil, Paul Denny, Andrew Tran, Juho Leinonen, Seth Bernstein, Arto Hellas, Sami Sarsa, Joanne Kim
for: 本研究旨在探讨大语言模型（LLMs）是否可以自动检测逻辑错误，并为新手程序员提供易于理解的解释。
methods: 研究使用两种流行的LLMs（GPT-3和GPT-4）来检测逻辑错误，并与一个大型新手程序员群体（n=964）进行比较。
results: 研究发现，现在的LLMs与之前的一代LLMs有所提升，并且两者都超过了学生的表现。研究还提出了将这些模型 integrate into computing education tools 以支持学生学习编程。

Abstract
Identifying and resolving logic errors can be one of the most frustrating challenges for novices programmers. Unlike syntax errors, for which a compiler or interpreter can issue a message, logic errors can be subtle. In certain conditions, buggy code may even exhibit correct behavior -- in other cases, the issue might be about how a problem statement has been interpreted. Such errors can be hard to spot when reading the code, and they can also at times be missed by automated tests. There is great educational potential in automatically detecting logic errors, especially when paired with suitable feedback for novices. Large language models (LLMs) have recently demonstrated surprising performance for a range of computing tasks, including generating and explaining code. These capabilities are closely linked to code syntax, which aligns with the next token prediction behavior of LLMs. On the other hand, logic errors relate to the runtime performance of code and thus may not be as well suited to analysis by LLMs. To explore this, we investigate the performance of two popular LLMs, GPT-3 and GPT-4, for detecting and providing a novice-friendly explanation of logic errors. We compare LLM performance with a large cohort of introductory computing students $(n=964)$ solving the same error detection task. Through a mixed-methods analysis of student and model responses, we observe significant improvement in logic error identification between the previous and current generation of LLMs, and find that both LLM generations significantly outperform students. We outline how such models could be integrated into computing education tools, and discuss their potential for supporting students when learning programming.

摘要
找到和解决逻辑错误可能是新手程序员最棘手的挑战之一。与 sintaxis错误不同，逻辑错误可能更加潜在和谜团。在某些情况下，错误代码甚至可能会显示正确的行为 -- 在其他情况下，问题可能是解释问题Statement的方式。这些错误可能在阅读代码时很难发现，而且也可能会在自动化测试中被missed。有很大的教育潜力在自动检测逻辑错误，特别是在与新手程序员配合使用。大型自然语言模型（LLMs）最近在多种计算任务中表现出色，包括生成和解释代码。这些能力与代码语法密切相关，与代码中下一个token预测行为相似。然而，逻辑错误与代码的运行性相关，因此可能不太适合LLMs的分析。为了探索这一点，我们研究了两个流行的LLMs，GPT-3和GPT-4，在检测和提供新手友好的逻辑错误解释方面的性能。我们将这些模型的表现与一群新手计算学生（n=964）解决同样的错误检测任务进行比较。通过混合方法分析学生和模型的回答，我们发现了当前和前一代LLM的性能有 significi cant提高，并发现这两个LLM生成是学生的表现的显著超越。我们介绍了如何将这些模型 integrate into computing education tools，并讨论了它们在学习编程时的潜在价值。

Forecasting Auxiliary Energy Consumption for Electric Heavy-Duty Vehicles

paper_url: http://arxiv.org/abs/2311.16003
repo_url: None
paper_authors: Yuantao Fan, Zhenkan Wang, Sepideh Pashami, Slawomir Nowaczyk, Henrik Ydreskog
for: 预测电动商用重载车辆的能源消耗是关键的，以便优化运营和路径规划充电。此外，理解预测何种结果是由哪些因素导致的，对于这种预测模型来说非常重要，以获得用户信任并在实践中部署。由于商用车辆在运输任务、 ambient 和司机方面都不同，因此需要建立一个基于 AI 系统来预测能源消耗。
methods: 我们使用多个回归模型在数据集中进行训练，以解决现有的 XAI 方法（如 LIME 或 SHAP）在面临多元化人口时产生误导性结果的问题。这种方法不仅导致了更好的回归性能，还产生了更直观和一致的解释。
results: 我们通过实验表明，将复杂的问题分解成 simpler ones 可以得到更好的回归性能和解释性能。在 synthetic 和实际 dataset 上进行了实验，并显示了这种方法的有效性。

Abstract
Accurate energy consumption prediction is crucial for optimizing the operation of electric commercial heavy-duty vehicles, e.g., route planning for charging. Moreover, understanding why certain predictions are cast is paramount for such a predictive model to gain user trust and be deployed in practice. Since commercial vehicles operate differently as transportation tasks, ambient, and drivers vary, a heterogeneous population is expected when building an AI system for forecasting energy consumption. The dependencies between the input features and the target values are expected to also differ across sub-populations. One well-known example of such a statistical phenomenon is the Simpson paradox. In this paper, we illustrate that such a setting poses a challenge for existing XAI methods that produce global feature statistics, e.g. LIME or SHAP, causing them to yield misleading results. We demonstrate a potential solution by training multiple regression models on subsets of data. It not only leads to superior regression performance but also more relevant and consistent LIME explanations. Given that the employed groupings correspond to relevant sub-populations, the associations between the input features and the target values are consistent within each cluster but different across clusters. Experiments on both synthetic and real-world datasets show that such splitting of a complex problem into simpler ones yields better regression performance and interpretability.

摘要
准确预测能源消耗是商用重型汽车运营优化的关键，例如路径规划充电。此外，理解预测的原因对于这种预测模型来说非常重要，以至于用户信任和实施。由于商用车辆在交通任务、 ambient 和驾驶员方面都不同，因此在建立人工智能系统 для预测能源消耗时需要面临到非常多样化的人口。预测值和输入特征之间的依赖关系也预期会在不同的子人口中出现差异。这种统计现象被称为Simpson paradox。本文显示，这种设置会导致现有的XAI方法（例如LIME或SHAP）生成不准确的结果。我们示例了一种解决方案，即在数据 subsets 上训练多个回归模型。这不仅导致了更好的回归性能，而且也导致了更加相关和一致的LIME解释。由于emploied grouping对应了有用的子人口，输入特征和目标值之间的关系在每个小组内一致，但在不同的小组之间差异。在 sintetic 和实际数据上进行了实验，我们发现这种分解复杂问题为更简单的问题可以提供更好的回归性能和可读性。

paper_url: http://arxiv.org/abs/2311.16208
repo_url: https://github.com/IDEA-XL/InstructMol
paper_authors: He Cao, Zijing Liu, Xingyu Lu, Yuan Yao, Yu Li
for: 这篇论文旨在探讨人工智能在药物探索中的演化，并解决对于普遍化和广泛训练的挑战。
methods: 本研究使用了多Modal Language Models（LLMs），并将其调整为 instruction-tuning 方法，使其能够对分子结构和自然语言进行有效的Alignment。
results: 研究发现，InstructMol 在药物探索相关的分子任务中表现出色，超越了主流 LLMS 和特殊化模型，并大幅缩小了对特殊化模型的差距，从而建立了一个可靠和多功能的药物探索助手。

Abstract
The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialized models, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.

摘要
人工智能在药物发现领域的快速进化面临总结和培训挑战，然而大型自然语言模型（LLM）具有重新定义与复杂分子数据的互动的承诺。我们的新贡献——InstructMol，是一种多模式LLM，通过指令调整方法，将分子结构与自然语言相互对应，使用两个阶段培训策略，结合有限的领域特定数据、分子和文本信息，实现了药物发现相关的分子任务的显著性能提高，超越了领导的LLM和特殊模型，从而建立了一个可靠的药物发现助手。

Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

paper_url: http://arxiv.org/abs/2311.15993
repo_url: None
paper_authors: Shaobo Wang, Xiangdong Zhang, Junchi Yan
for: 提高深度神经网络训练稳定性
methods: 使用简单的特征压缩阈值解决特征压缩问题，并将不同的normalization变体统一以提高每个组件的表现
results: 在不同的视觉脊梁上显著提高性能，特别是在训练初期快速减少网络训练时间，在ImageNet分类任务中提高了约3%的前1位精度

Abstract
Batch Normalization (BN) has become an essential technique in contemporary neural network design, enhancing training stability. Specifically, BN employs centering and scaling operations to standardize features along the batch dimension and uses an affine transformation to recover features. Although standard BN has shown its capability to improve deep neural network training and convergence, it still exhibits inherent limitations in certain cases. Most existing techniques that enhance BN consider a single or a few aspects of BN. In this paper, we first identify problems with BN from a feature perspective and explore that feature condensation exists in the learning when employing BN, which negatively affects testing performance. To tackle this problem, we propose a two-stage unified framework called Unified Batch Normalization (UBN). In the first stage, we utilize a simple feature condensation threshold to alleviate the feature condensation, which hinders inappropriate statistic updates in normalization. In the second stage, we unify various normalization variants to boost each component of BN. Our experimental results reveal that UBN significantly enhances performance across different visual backbones and notably expedites network training convergence, particularly in early training stages. Notably, our method improved about 3% in top-1 accuracy on ImageNet classification with large batch sizes, showing the effectiveness of our approach in real-world scenarios.

摘要
批量normalization（BN）已成为当代神经网络设计中的关键技术，提高训练稳定性。具体来说，BN通过中心化和缩放操作标准化特征以批量维度，并使用非线性变换回归特征。虽然标准BN已经证明可以改善深度神经网络训练和结束，但它仍然在某些情况下表现出缺陷。大多数现有的BN增强技术只考虑BN的一些方面。在这篇论文中，我们首先从特征角度描述BN的问题，发现在使用BN时存在特征减少现象，这会负面影响测试性能。为解决这个问题，我们提议一种两stage的统一框架，称为统一批量normalization（UBN）。在第一stage，我们使用简单的特征减少阈值来缓解特征减少，避免不合理的统计更新。在第二stage，我们统一各种normalization变种，以提高每个BN ком成分的性能。我们的实验结果表明，UBN可以明显提高不同的视觉脊梁上的性能，特别是在早期训练阶段。吸引注意的是，我们的方法在ImageNet图像分类任务中提高了约3%的顶部1准确率， demonstrating the effectiveness of our approach in real-world scenarios.

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

paper_url: http://arxiv.org/abs/2311.16512
repo_url: https://github.com/CoSeR-main/CoSeR-main.github.io
paper_authors: Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, Yujiu Yang
for: 提高超解像（SR）模型的 semantic 精度，使其能够更好地理解低分辨率图像中的全局含义。
methods: 提出了 Cognitive Super-Resolution (CoSeR) 框架，将图像的显示特征和语言理解结合，生成一个认知嵌入，以便利用大型文本到图像扩散模型的优势，并生成高质量参照图像来优化 SR 过程。另外，提出了一种全注意力充注束的“All-in-Attention”方法，将所有的条件信息集中到一个模块中。
results: 通过实验表明，CoSeR 方法可以很好地恢复semantic 正确和 фото真实的细节，在多个标准团顿上达到了状态的发展性表现。

Abstract
Existing super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding, which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention", consolidating all conditional information into a single module. Consequently, our method successfully restores semantically correct and photorealistic details, demonstrating state-of-the-art performance across multiple benchmarks.

摘要
现有的超高清（SR）模型主要集中于Restoring本地 текстура细节，经常忽视场景中的全局 semantics信息。这种欠佳可能导致场景中的重要 semantics细节排除或者SR过程中引入错误的 texture。在我们的工作中，我们提出了认知超高清（CoSeR）框架，赋予 SR 模型理解低分辨率图像的能力。我们实现这一目标通过将图像外观和语言理解联系起来生成一个认知嵌入，不仅可以通过大规模的文本到图像扩散模型获得优化的参考图像，还可以利用这个嵌入来激活大量的文本-图像扩散模型。此外，我们还提出了一种新的条件注入方案called "All-in-Attention",将所有的条件信息集成到一个模块中。因此，我们的方法可以成功地恢复semantically正确和 фото实际的细节，在多个标准准测中达到了状态机器人表现。

Sparsify-then-Classify: From Internal Neurons of Large Language Models To Efficient Text Classifiers

paper_url: http://arxiv.org/abs/2311.15983
repo_url: https://github.com/difanj0713/sparsify-then-classify
paper_authors: Yilun Liu, Difan Jiao, Ashton Anderson
for: 提高预训练语言模型（LLM）在文本分类任务中的性能和可解释性。
methods: 使用多种pooling策略对所有层的各种活动和隐藏状态进行归一化，然后对文本进行分类。
results: 对多种模型和数据集进行实验，显示STC可以提高预训练和微调模型的分类性能，同时具有更高的效率和可解释性。

Abstract
Among the many tasks that Large Language Models (LLMs) have revolutionized is text classification. However, existing approaches for applying pretrained LLMs to text classification predominantly rely on using single token outputs from only the last layer of hidden states. As a result, they suffer from limitations in efficiency, task-specificity, and interpretability. In our work, we contribute an approach that uses all internal representations by employing multiple pooling strategies on all activation and hidden states. Our novel lightweight strategy, Sparsify-then-Classify (STC) first sparsifies task-specific features layer-by-layer, then aggregates across layers for text classification. STC can be applied as a seamless plug-and-play module on top of existing LLMs. Our experiments on a comprehensive set of models and datasets demonstrate that STC not only consistently improves the classification performance of pretrained and fine-tuned models, but is also more efficient for both training and inference, and is more intrinsically interpretable.

摘要
LLMs 已经革命化了许多任务，其中包括文本分类。然而，现有的应用预训练 LLMs 到文本分类的方法主要仅仅使用最后一层隐藏状态单个元素的输出。这会导致它们受到效率、任务特点和可解释性的限制。在我们的工作中，我们提出了一种使用所有内部表示的方法，使用各层Activation和隐藏状态中的多种池化策略。我们的新的轻量级策略——Sparsify-then-Classify（STC）先将任务特定的特征层次地减少，然后在层次上进行文本分类。STC可以轻松地应用于现有的 LLMs 之上，作为一个可替换的模块。我们在一系列模型和数据集上进行了实验，表明 STC 不仅可以改进预训练和精度调整的模型的分类性能，还可以更高效地进行训练和推理，同时也更容易解释。

paper_url: http://arxiv.org/abs/2311.15979
repo_url: None
paper_authors: Weiying Zhao, Natalia Efremova
for: 这个研究旨在精确估计土壤有机碳(SOC)，以提供可持续的土地和农业管理。
methods: 本研究使用Graph Neural Networks (GNNs)和高分辨率卫星地图，并与位置编码器结合，以捕捉土壤和气候特征之间的复杂关系。
results: 研究结果显示，使用LUCAS数据库，PESAGE和PETransformer模型能够更好地估计SOC，表明这些模型能够 Capture复杂的SOC和气候特征之间关系。

Abstract
Soil organic carbon (SOC) plays a pivotal role in the global carbon cycle, impacting climate dynamics and necessitating accurate estimation for sustainable land and agricultural management. While traditional methods of SOC estimation face resolution and accuracy challenges, recent technological solutions harness remote sensing, machine learning, and high-resolution satellite mapping. Graph Neural Networks (GNNs), especially when integrated with positional encoders, can capture complex relationships between soil and climate. Using the LUCAS database, this study compared four GNN operators in the positional encoder framework. Results revealed that the PESAGE and PETransformer models outperformed others in SOC estimation, indicating their potential in capturing the complex relationship between SOC and climate features. Our findings confirm the feasibility of applications of GNN architectures in SOC prediction, establishing a framework for future explorations of this topic with more advanced GNN models.

摘要
soil organic carbon (SOC) 对全球碳ecycle 发挥关键作用，影响气候动力学和可持续地面和农业管理。 traditional SOC 估算方法面临分解和准确性挑战，而现代技术解决方案利用 remote sensing、机器学习和高分辨率卫星地图。 Graph Neural Networks (GNNs)，特别是与位置编码器结合，可捕捉 soil 和气候特征之间的复杂关系。根据 LUCAS 数据库，本研究比较了四种 GNN 操作器在位置编码器框架中。结果表明 PESAGE 和 PETransformer 模型在 SOC 估算中表现最佳，这表明它们可以 capture soil 和气候特征之间的复杂关系。我们的发现证明了 GNN 架构在 SOC 预测中的可行性，建立了未来这个领域的探索框架，以及更高级的 GNN 模型。

Efficient Pre-training for Localized Instruction Generation of Videos

paper_url: http://arxiv.org/abs/2311.15964
repo_url: None
paper_authors: Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
for: 这个论文是为了提高步骤地示例视频理解和生成文本说明的技术。
methods: 该论文提出了一种自动筛选和提升视频讲解文本质量的技术，包括筛选掉无关的讲解文本和自动将讲解文本替换为人工写的文本说明。
results: 该技术可以生成高性能的步骤地示例视频理解和生成文本说明模型，使用论文中的数据集可以在零shot和微调Setting中达到当前最佳性能，而且使用的计算资源相对较少。

Abstract
Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than current web-scale datasets, enables efficient training of large-scale models with competitive performance. We complement our Sieve-\&-Swap approach with a Procedure Transformer (ProcX) for end-to-end step localization and instruction generation for procedural videos. When this model is pre-trained on our curated dataset, it achieves state-of-the-art performance in zero-shot and finetuning settings on YouCook2 and Tasty, while using a fraction of the computational resources.

摘要
“执行视频显示了逐步示例的任务，如食谱准备。理解这些视频是具有挑战性，需要精确地确定步骤的地方和生成文本指示。手动标注步骤和写入指示是成本高昂的，这限制了当前数据集的大小，阻碍有效学习。利用大量但具有噪音的视频脚本数据集进行预训练可以提高性能，但需要 significativ computational resources。此外，脚本中含有不相关的内容，并且样式与人工标注员写的指示不同。为了解决这两个问题，我们提出了一种方法：（i）筛选不相关的脚本，（ii）将脚本替换为人工标注员写的文本指示。这些替换的指示来自文本只recipe数据集。我们称之为Sieve-&-Swap。我们的Sieve-&-Swap方法可以自动筛选出许多不相关的脚本，并将其替换为高质量的文本指示，从而生成一个许多小于当前网络规模的数据集。我们的curated dataset可以高效地训练大规模模型，并达到竞争性性能。我们补充了我们的Sieve-&-Swap方法，使用Procedure Transformer（ProcX）来实现步骤地理位和指示生成。当我们的模型在我们的curated dataset上进行预训练时，它在零shot和 fine-tuning 设置下在 YouCook2 和 Tasty 上达到了状态艺术性能，同时使用的计算资源只是当前模型的一部分。”

Addressing Long-Horizon Tasks by Integrating Program Synthesis and State Machines

paper_url: http://arxiv.org/abs/2311.15960
repo_url: None
paper_authors: Yu-An Lin, Chen-Tao Lee, Guan-Ting Liu, Pu-Jen Cheng, Shao-Hua Sun
for: 解决深度强化学习在不同领域中的表现不佳，特别是缺乏普适性和可 integrate 性。
methods: 本研究提出了Program Machine Policies（POMPs），它将Programmatic RL和状态机策略相结合，以便表现复杂的行为和长期任务。POMPs使用一种可以检索有效、多样、兼容的程序的方法，然后将这些程序作为状态机的模式，通过学习转移函数来转移 между模式程序，以捕捉长期重复任务。
results: 对多个任务进行测试，我们的提出的框架超越了Programmatic RL和深度强化学习的基elines，并能够无需调整 inductively 推广到更长的时间轴。缺省研究证明了我们的搜索算法对检索有效程序的集合的效iveness。

Abstract
Deep reinforcement learning excels in various domains but lacks generalizability and interoperability. Programmatic RL methods (Trivedi et al., 2021; Liu et al., 2023) reformulate solving RL tasks as synthesizing interpretable programs that can be executed in the environments. Despite encouraging results, these methods are limited to short-horizon tasks. On the other hand, representing RL policies using state machines (Inala et al., 2020) can inductively generalize to long-horizon tasks; however, it struggles to scale up to acquire diverse and complex behaviors. This work proposes Program Machine Policies (POMPs), which bridge the advantages of programmatic RL and state machine policies, allowing for the representation of complex behaviors and the address of long-term tasks. Specifically, we introduce a method that can retrieve a set of effective, diverse, compatible programs. Then, we use these programs as modes of a state machine and learn a transition function to transition among mode programs, allowing for capturing long-horizon repetitive behaviors. Our proposed framework outperforms programmatic RL and deep RL baselines on various tasks and demonstrates the ability to generalize to even longer horizons without any fine-tuning inductively. Ablation studies justify the effectiveness of our proposed search algorithm for retrieving a set of programs as modes.

摘要
深度强化学习在多个领域表现出色，但缺乏通用性和可操作性。程序化RL方法（Trivedi等，2021年；Liu等，2023年）将解决RL任务 reformulate为生成可解释的程序，可以在环境中执行。尽管有激发人的结果，这些方法受到短期任务的限制。然而，通过表示RL策略为状态机（Inala等，2020年）可以逐渐普适到长期任务，但是它在掌握复杂和多样化行为方面受到限制。本工作提出了Program机制策略（POMPs），它将程序化RL和状态机策略的优点相结合，允许表示复杂行为并解决长期任务。具体来说，我们提出一种方法可以检索一组有效、多样、兼容的程序。然后，我们将这些程序作为状态机的模式使用，学习一个转移函数来转移模式程序，以便捕捉长期重复行为。我们的提议框架在多个任务上超越程序化RL和深度RL基elines，并证明可以无需适应学习 inductively 扩展到更长的时间 horizons。ablation 研究证明我们提出的搜索算法对检索模式程序的有效性。

CheapNET: Improving Light-weight speech enhancement network by projected loss function

paper_url: http://arxiv.org/abs/2311.15959
repo_url: None
paper_authors: Kaijun Tan, Benzhe Dai, Jiakui Li, Wenyu Mao
for: 提高 speech 质量和减少噪声
methods: 使用 projection loss function 和 direct predictions on LAEC pre-processed outputs
results: near state-of-the-art 噪声抑制性能和超越 industry-leading 模型的 echo cancellation 性能

Abstract
Noise suppression and echo cancellation are critical in speech enhancement and essential for smart devices and real-time communication. Deployed in voice processing front-ends and edge devices, these algorithms must ensure efficient real-time inference with low computational demands. Traditional edge-based noise suppression often uses MSE-based amplitude spectrum mask training, but this approach has limitations. We introduce a novel projection loss function, diverging from MSE, to enhance noise suppression. This method uses projection techniques to isolate key audio components from noise, significantly improving model performance. For echo cancellation, the function enables direct predictions on LAEC pre-processed outputs, substantially enhancing performance. Our noise suppression model achieves near state-of-the-art results with only 3.1M parameters and 0.4GFlops/s computational load. Moreover, our echo cancellation model outperforms replicated industry-leading models, introducing a new perspective in speech enhancement.

摘要
噪声抑制和声音反射抑制是智能设备和实时通信中关键的，需要实时推理，计算 overhead 低。传统的边缘基础上的噪声抑制通常使用MSE基准 спектроgram搅拌训练，但这种方法有局限性。我们引入一种新的投影损失函数，与MSE diverge，以提高噪声抑制。这种方法使用投影技术隔离关键的音频组件与噪声，显著提高模型性能。而对声音反射，该函数允许直接预测LAEC预处理输出，大幅提高性能。我们的噪声抑制模型达到了近状态艺术的结果，仅有3.1M参数和0.4GFlops/s计算负荷。此外，我们的声音反射模型超越了复制的行业领先模型，开拓了新的抑制 speech 技术的视角。

Replay across Experiments: A Natural Extension of Off-Policy RL

paper_url: http://arxiv.org/abs/2311.15951
repo_url: None
paper_authors: Dhruva Tirumala, Thomas Lampe, Jose Enrique Chen, Tuomas Haarnoja, Sandy Huang, Guy Lever, Ben Moran, Tim Hertweck, Leonard Hasenclever, Martin Riedmiller, Nicolas Heess, Markus Wulfmeier
for: 提高控制器性能和研究循环时间
methods: 重用先前实验的经验来提高探索和启动学习，并减少改变的需要
results: 在多种RL算法和控制领域中显示出提高性能，包括 egocentric 视觉中的困难探索任务

Abstract
Replaying data is a principal mechanism underlying the stability and data efficiency of off-policy reinforcement learning (RL). We present an effective yet simple framework to extend the use of replays across multiple experiments, minimally adapting the RL workflow for sizeable improvements in controller performance and research iteration times. At its core, Replay Across Experiments (RaE) involves reusing experience from previous experiments to improve exploration and bootstrap learning while reducing required changes to a minimum in comparison to prior work. We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices. Finally, we discuss how our approach can be applied more broadly across research life cycles and can increase resilience by reloading data across random seeds or hyperparameter variations.

摘要
重复数据是抽象RL的主要机制，它们在稳定性和数据效率方面发挥重要作用。我们提出了一种简单 yet effective的框架，可以在多个实验中扩展重复数据的使用，以提高控制性和研究迭代时间。核心思想是在前一个实验中收集的经验可以重新使用来提高探索和启动学习，同时尽量避免对RL工作流程的更改。我们通过多种RL算法和控制领域的实验表明了这种方法的效果，包括自我中心视觉中的困难探索任务。通过完整的抑制试验，我们证明了这种方法对数据质量和量的选择以及不同的超参数选择有稳定性。最后，我们讨论了如何广泛应用这种方法，以增加研究生命周期中的可重用性和鲁棒性，以及在不同的random seed或超参数变化中重新加载数据。

Auto-CsiNet: Scenario-customized Automatic Neural Network Architecture Generation for Massive MIMO CSI Feedback

paper_url: http://arxiv.org/abs/2311.15950
repo_url: None
paper_authors: Xiangyi Li, Jiajia Guo, Chao-Kai Wen, Shi Jin
for: 这个论文旨在自动生成适应环境的通道状态信息反馈 neural network 架构，以便在无线通信中实现最佳性能。
methods: 该论文提出了使用神经网络搜索（NAS）自动生成适应环境的 CSI 反馈神经网络架构，并采用了机器学习自动化和梯度下降的方法来实现效率和成本效果。
results: 实验结果表明，自动生成的架构（称为 Auto-CsiNet）在重建性能和复杂度两个方面都有较好的表现，比手动设计的模型提高约14%，并且可以减少复杂度约50%。

Abstract
Deep learning has revolutionized the design of the channel state information (CSI) feedback module in wireless communications. However, designing the optimal neural network (NN) architecture for CSI feedback can be a laborious and time-consuming process. Manual design can be prohibitively expensive for customizing NNs to different scenarios. This paper proposes using neural architecture search (NAS) to automate the generation of scenario-customized CSI feedback NN architectures, thereby maximizing the potential of deep learning in exclusive environments. By employing automated machine learning and gradient-descent-based NAS, an efficient and cost-effective architecture design process is achieved. The proposed approach leverages implicit scene knowledge, integrating it into the scenario customization process in a data-driven manner, and fully exploits the potential of deep learning for each specific scenario. To address the issue of excessive search, early stopping and elastic selection mechanisms are employed, enhancing the efficiency of the proposed scheme. The experimental results demonstrate that the automatically generated architecture, known as Auto-CsiNet, outperforms manually-designed models in both reconstruction performance (achieving approximately a 14% improvement) and complexity (reducing it by approximately 50%). Furthermore, the paper analyzes the impact of the scenario on the NN architecture and its capacity.

摘要
深度学习革命化了无线通信频率信息反馈模块的设计。然而，为每个enario设计最佳神经网络（NN）架构可能是一项劳累和时间consuming的任务。这篇论文提议使用神经网络搜索（NAS）自动生成适应不同enario的CSI反馈NN架构，从而最大化深度学习在专属环境中的潜力。通过自动化机器学习和梯度下降基于NAS，实现了效率和成本效果的架构设计过程。提议的方法利用含义场知识，将其integrated到场景定制过程中，以全面利用深度学习的潜力。为了解决搜索过度的问题，提议使用 early stopping和elastic选择机制，提高提议的方案的效率。实验结果表明，自动生成的架构，称为Auto-CsiNet，在重建性能（减少了约14%）和复杂性（减少了约50%）两个方面都超过了手动设计的模型。此外，论文还分析了场景对NN架构和其容量的影响。

A new fuzzy multi-attribute group decision-making method based on TOPSIS and optimization models

paper_url: http://arxiv.org/abs/2311.15933
repo_url: None
paper_authors: Qixiao Hu, Shiquan Zhang, Chaolang Hu, Yuetong Liu
for: 这种paper是为了解决多Attribute组决策问题，利用interval-valued intuitionistic fuzzy sets环境中的TOPSIS和优化模型。
methods: 该paper提出了一种基于TOPSIS和优化模型的新方法，包括：Establishing an optimization model to determine expert weights by minimizing the sum of differences between individual evaluations and the overall consistent evaluations of all experts; obtaining the improved closeness index for evaluating each alternative based on TOPSIS method; and determining the attribute weight by establishing an optimization model to maximize the closeness of each alternative.
results: 该paper的实验结果表明，该方法可以充分发挥主观和客观权重方法的优点，并且可以解决多Attribute组决策问题中的冲突和不确定性。

Abstract
In this paper, a new method based on TOPSIS and optimization models is proposed for multi-attribute group decision-making in the environment of interval-valued intuitionistic fuzzy sets.Firstly, by minimizing the sum of differences between individual evaluations and the overallconsistent evaluations of all experts, a new optimization model is established for determining expert weights. Secondly, based on TOPSIS method, the improved closeness index for evaluating each alternative is obtained. Finally, the attribute weight is determined by establishing an optimization model with the goal of maximizing the closeness of each alternative, and it is brought into the closeness index so that the alternatives can be ranked. Combining all these together, the complete fuzzy multi-attribute group decision-making algorithm is formulated, which can give full play to the advantages of subjective and objective weighting methods. In the end, the feasibility and effectiveness of the provided method are verified by a real case study.

摘要
在这篇论文中，一种基于TOPSIS和优化模型的新方法是提出的，用于多Attribute组决策。首先，通过最小化所有专家评估和共识评估之间的差异总和，确定专家权重。其次，基于TOPSIS方法，改进了评估每个选项的靠近指数。最后，通过确定每个选项的属性权重，并将其作为靠近指数的一部分，完成了完整的多 Attribute 组决策算法。这种算法可以充分发挥主观和客观权重方法的优点。在实际案例研究中，提供的方法的可行性和效果得到了证明。Here's a word-for-word translation of the text into Simplified Chinese:在这篇论文中，一种基于TOPSIS和优化模型的新方法是提出的，用于多Attribute组决策。首先，通过最小化所有专家评估和共识评估之间的差异总和，确定专家权重。其次，基于TOPSIS方法，改进了评估每个选项的靠近指数。最后，通过确定每个选项的属性权重，并将其作为靠近指数的一部分，完成了完整的多 Attribute 组决策算法。这种算法可以充分发挥主观和客观权重方法的优点。在实际案例研究中，提供的方法的可行性和效果得到了证明。

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

paper_url: http://arxiv.org/abs/2311.15930
repo_url: https://github.com/facebookresearch/worldsense
paper_authors: Youssef Benchekroun, Megi Dervishi, Mark Ibrahim, Jean-Baptiste Gaya, Xavier Martinet, Grégoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke Hupkes, Pascal Vincent
for: 本文旨在评估大自然语言处理器（LLM）是否能够维护简单世界模型，通过测试它们是否能够从描述简单的物体安排中导出简单的推论。
methods: 本文使用了一个名为WorldSense的合成benchmark来测试LLM的能力，该benchmark包括三种问题类型，每种问题类型有自己的极少控制问题，以避免偏见。
results: 作者在三种state-of-the-art chat-LLM（GPT3.5、GPT4和Llama2-chat）上运行了benchmark，发现这些模型在只有三个物体时仍会出现错误，并且具有强烈的回答偏好，不管问题是什么。这些错误并不是由链条提问和上下文学习所解决。此外，训练在相似问题上并不能使模型泛化到非约束问题空间。

Abstract
We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.

摘要
我们提出了WorldSense，一个测试标准用于评估语言模型是否可以顺利地维护含义词语模型，通过测试它们从描述简单的物体安排中缺少推理的能力。WorldSense是一个合成测试标准，包括三种问题类型，每个类型有自己的杜而不受偏见的控制问题，以避免语言和表达的相关性。我们在三个现状顶尖的chat-LLM（GPT3.5、GPT4和Llama2-chat）上运行了我们的benchmark，并显示这些模型在三个对象时仍然会出错。此外，它们具有很重的回答偏好，即不管问题，它们都会采取某些回答。错误持续存在，即使使用链式思维提示和上下文学习。最后，我们表明了虽然在类似问题上进行finetuning可以获得显著改进（在同 distribution和 OUT-OF-distribution），但是finetuned模型不能超越问题空间的约束。

Reinforcement Learning for Wildfire Mitigation in Simulated Disaster Environments

paper_url: http://arxiv.org/abs/2311.15925
repo_url: https://github.com/mitrefireline/simfire
paper_authors: Alexander Tapley, Marissa Dotter, Michael Doyle, Aidan Fennelly, Dhanuj Gandikota, Savanna Smith, Michael Threet, Tim Welsh
for: 这个论文旨在提供一种可靠的野火预测模拟器和自适应机器学习框架，以帮助研究人员和实践者更好地预测和应对野火的威胁。
methods: 这个论文使用的方法包括了一种名为SimFire的野火预测模拟器和一种名为SimHarness的自适应机器学习框架。这两个工具可以让研究人员和实践者更好地模拟和评估火fighter的干预措施，以及制定优化资源分配和值保护的策略计划。
results: 这个论文的结果表明，使用SimFire和SimHarness这两个工具可以帮助研究人员和实践者更好地预测和应对野火的威胁，并且可以提高火fighter的干预效果和资源分配的优化。

Abstract
Climate change has resulted in a year over year increase in adverse weather and weather conditions which contribute to increasingly severe fire seasons. Without effective mitigation, these fires pose a threat to life, property, ecology, cultural heritage, and critical infrastructure. To better prepare for and react to the increasing threat of wildfires, more accurate fire modelers and mitigation responses are necessary. In this paper, we introduce SimFire, a versatile wildland fire projection simulator designed to generate realistic wildfire scenarios, and SimHarness, a modular agent-based machine learning wrapper capable of automatically generating land management strategies within SimFire to reduce the overall damage to the area. Together, this publicly available system allows researchers and practitioners the ability to emulate and assess the effectiveness of firefighter interventions and formulate strategic plans that prioritize value preservation and resource allocation optimization. The repositories are available for download at https://github.com/mitrefireline.

摘要
климаbing change hath resulteth in a year over year increase in adverse weather and weather conditions which contribute to increasingly severe fire seasons. Without effective mitigation, these fires pose a threat to life, property, ecology, cultural heritage, and critical infrastructure. To better prepare for and react to the increasing threat of wildfires, more accurate fire modelers and mitigation responses are necessary. In this paper, we introduce SimFire, a versatile wildland fire projection simulator designed to generate realistic wildfire scenarios, and SimHarness, a modular agent-based machine learning wrapper capable of automatically generating land management strategies within SimFire to reduce the overall damage to the area. Together, this publicly available system allows researchers and practitioners the ability to emulate and assess the effectiveness of firefighter interventions and formulate strategic plans that prioritize value preservation and resource allocation optimization. The repositories are available for download at https://github.com/mitrefireline.Note that Simplified Chinese is the standard form of Chinese used in mainland China, while Traditional Chinese is used in Taiwan and Hong Kong.

Diagnosis driven Anomaly Detection for CPS

paper_url: http://arxiv.org/abs/2311.15924
repo_url: None
paper_authors: Henrik S. Steude, Lukas Moddemann, Alexander Diedrich, Jonas Ehrhardt, Oliver Niggemann
for: 这篇论文主要用于提出了一种基于深度学习的异常探测方法，用于生成适当的诊断输入。
methods: 本论文使用了深度学习的异常探测方法，并与传统的一般诊断方法进行结合，以提供一个整体的诊断解决方案。
results: 在实验和实际应用中，本论文的模型具有优秀的表现，较以往的州流探测方法有所提高。

Abstract
In Cyber-Physical Systems (CPS) research, anomaly detection (detecting abnormal behavior) and diagnosis (identifying the underlying root cause) are often treated as distinct, isolated tasks. However, diagnosis algorithms require symptoms, i.e. temporally and spatially isolated anomalies, as input. Thus, anomaly detection and diagnosis must be developed together to provide a holistic solution for diagnosis in CPS. We therefore propose a method for utilizing deep learning-based anomaly detection to generate inputs for Consistency-Based Diagnosis (CBD). We evaluate our approach on a simulated and a real-world CPS dataset, where our model demonstrates strong performance relative to other state-of-the-art models.

摘要
在 cyber-physical systems (CPS) 研究中，异常检测 (检测异常行为) 和诊断 (确定根本原因) 常常被视为分开的、隔离的任务。然而，诊断算法需要症状作为输入，即时间和空间上的异常现象。因此，异常检测和诊断必须同时开发，以提供 CPs 中的总体解决方案。我们因此提议使用深度学习基于异常检测来生成 Consistency-Based Diagnosis (CBD) 的输入。我们在模拟和实际 CPs 数据集上评估了我们的方法，其表现比其他现有模型更强。

A Fully Data-Driven Approach for Realistic Traffic Signal Control Using Offline Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.15920
repo_url: None
paper_authors: Jianxiong Li, Shichao Lin, Tianyu Shi, Chujie Tian, Yu Mei, Jian Song, Xianyuan Zhan, Ruimin Li
for: 本研究的目的是提出一种数据驱动、模拟器自由的实时交通信号控制（D2TSC）框架，以便在实际交通系统中实现高效的交通流控制。methods: 本研究使用了已知的交通流理论和机器学习技术，构建了一个奖券推导模型，从粗略交通数据中推断奖券信号。然后，我们提出了一种减少样本数量的离线RL方法，以便直接从历史数据集中学习交通信号控制策略。results: 我们通过了广泛的实验表明，我们的方法在比较具有实际应用意义的实际交通口岸上表现出优于 convential和离线RL基准。此外，我们的方法还能够更好地适应实际交通系统中的复杂环境。

Abstract
The optimization of traffic signal control (TSC) is critical for an efficient transportation system. In recent years, reinforcement learning (RL) techniques have emerged as a popular approach for TSC and show promising results for highly adaptive control. However, existing RL-based methods suffer from notably poor real-world applicability and hardly have any successful deployments. The reasons for such failures are mostly due to the reliance on over-idealized traffic simulators for policy optimization, as well as using unrealistic fine-grained state observations and reward signals that are not directly obtainable from real-world sensors. In this paper, we propose a fully Data-Driven and simulator-free framework for realistic Traffic Signal Control (D2TSC). Specifically, we combine well-established traffic flow theory with machine learning to construct a reward inference model to infer the reward signals from coarse-grained traffic data. With the inferred rewards, we further propose a sample-efficient offline RL method to enable direct signal control policy learning from historical offline datasets of real-world intersections. To evaluate our approach, we collect historical traffic data from a real-world intersection, and develop a highly customized simulation environment that strictly follows real data characteristics. We demonstrate through extensive experiments that our approach achieves superior performance over conventional and offline RL baselines, and also enjoys much better real-world applicability.

摘要
优化交通信号控制（TSC）是交通系统的关键因素，以提高效率。在最近几年，使用强化学习（RL）技术进行TSC已经成为一种流行的方法，并且在高度适应控制方面表现出了良好的结果。然而，现有RL基于方法在实际应用中表现糟糕，主要是因为使用过于理想化的交通模拟器进行策略优化，以及使用不可接触的细化状态观察和不直接从实际感知器得到的奖励信号。在这篇论文中，我们提出了一个完全数据驱动的、模拟器自由的框架，以便实际交通信号控制（D2TSC）。具体来说，我们结合了已有的交通流理论和机器学习技术，构建了奖励推理模型，以直接从粗化交通数据中推断奖励信号。使用推断出来的奖励信号，我们进一步提议了一种效率高的离线RL方法，以便从历史离线实际数据库中学习交通信号控制策略。为评估我们的方法，我们收集了实际交通数据，并开发了高度定制的实际环境，严格遵循实际数据特点。我们通过了详细的实验表明，我们的方法在比较基线和离线RL方法的情况下表现出了显著更好的性能，并且也具有更好的实际应用性。

Continual Instruction Tuning for Large Multimodal Models

paper_url: http://arxiv.org/abs/2311.16206
repo_url: None
paper_authors: Jinghan He, Haiyun Guo, Ming Tang, Jinqiao Wang
for: 本研究旨在探讨大型多模态模型（LMM）在持续式 instrucion tuning 中是否会出现衰弱现象，以及现有的三种持续学习方法是否适用于 LMM 的持续 instruction tuning。
methods: 本研究使用了多任务联合 instrucion tuning 来探讨 LMM 的持续学习能力，并采用了数据重播和模型扩展策略来 mitigate 衰弱现象。
results: 实验结果显示， Multi-task joint instrucion tuning 可以提高 LMM 的持续学习能力，而数据重播和模型扩展策略可以在不同的情况下具有显著的改进效果。

Abstract
Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, vision-language tasks are constantly being created in practice. Instead of always re-training LMMs when new tasks arrive, continual learning offers flexibility for models to continually and efficiently exploit the evolving data. This work aims to explore the following two questions: 1) Do LMMs still suffer from catastrophic forgetting in continual instruction tuning? 2) Are the existing three classes of continual learning methods still applicable to the continual instruction tuning of LMMs? An extensive study is conducted to address the above questions. First, we establish the first benchmark in this setting and reveal that catastrophic forgetting is still observed when continually instruction-tuning LMMs. However, the multi-task joint instruction tuning can facilitate the model's continual learning ability and mitigate forgetting. Second, we integrate and adapt classic continual learning methods to our context, demonstrating the efficacy of data replay and model expansion strategies across diverse scenarios. In contrast, regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Third, we delve into the correlation and forgetting dynamics between vision-language task pairs and propose task-similarity-informed regularization and model expansion methods for continual instruction tuning of LMMs. Experimental results show that our approach consistently boosts the model's performance.

摘要
现在广泛采用的指令调整方法可以将大型多模型（LMM）与人类意图进行对应。它统一了视觉语言任务的数据格式，允许多任务共同培训。然而，视觉语言任务在实践中不断创新。而不是总是重新训练LMM， continual learning提供了模型在不断发展数据中高效地利用数据的机会。本研究旨在解决以下两个问题：1）LMM在 continual instruction tuning 中是否仍然会出现衰弱现象？2）现有的三种类型的 continual learning 方法是否仍然适用于 LMM 的 continual instruction tuning？我们进行了广泛的研究，并建立了首个 benchmark 测试环境。我们发现，在不断 instruction-tuning LMM 时，衰弱现象仍然存在，但是多任务联合培训可以提高模型的持续学习能力并减轻衰弱。其次，我们将经典的 continual learning 方法与我们的场景结合，并证明了数据重播和模型扩展策略在多种场景中的效果。而 regularization-based 方法只能在多任务联合培训的情况下表现出色。最后，我们探究了视觉语言任务对的相互关系和忘记动力，并提出了基于任务相互关系的 regularization 和模型扩展策略，以提高 LMM 的持续学习能力。实验结果表明，我们的方法能够一直提高模型的性能。

Towards Adaptive RF Fingerprint-based Authentication of IIoT devices

paper_url: http://arxiv.org/abs/2311.15888
repo_url: None
paper_authors: Emmanuel Lomba, Ricardo Severino, Ana Fernández Vilas
for: 这个论文是为了解决现代化的互联网物联网（IoT）技术在医疗和工业领域中的安全和网络安全问题。
methods: 这篇论文使用了人工智能（AI）适应式电波指纹技术，在物理层（PHY层）上进行高精度的设备认证。
results: 该研究实现了一种高效和灵活的IIoT设备认证方法，可以在复杂的RF环境中提供高度准确的设备认证。

Abstract
As IoT technologies mature, they are increasingly finding their way into more sensitive domains, such as Medical and Industrial IoT, in which safety and cyber-security are of great importance. While the number of deployed IoT devices continues to increase exponentially, they still present severe cyber-security vulnerabilities. Effective authentication is paramount to support trustworthy IIoT communications, however, current solutions focus on upper-layer identity verification or key-based cryptography which are often inadequate to the heterogeneous IIoT environment. In this work, we present a first step towards achieving powerful and flexible IIoT device authentication, by leveraging AI adaptive Radio Frequency Fingerprinting technique selection and tuning, at the PHY layer for highly accurate device authentication over challenging RF environments.

摘要
In this work, we propose a novel approach to achieving powerful and flexible IIoT device authentication. By leveraging AI adaptive Radio Frequency Fingerprinting technique selection and tuning at the PHY layer, we can achieve highly accurate device authentication even in challenging RF environments. This approach represents a significant step forward in addressing the cyber-security vulnerabilities of IoT devices and ensuring the trustworthiness of IIoT communications.

RO-LLaMA: Generalist LLM for Radiation Oncology via Noise Augmentation and Consistency Regularization

paper_url: http://arxiv.org/abs/2311.15876
repo_url: None
paper_authors: Kwanyoung Kim, Yujin Oh, Sangjoon Park, Hwa Kyung Byun, Jin Sung Kim, Yong Bae Kim, Jong Chul Ye
for: 这个研究旨在开发一个通用的大语言模型（LLM），以应对放射科医生的工作流程。
methods: 这个模型使用了一种新的Consistency Embedding Fine-Tuning（CEFTune）技术，以提高模型对于过程中的错误的耐性，并将这个概念应用到了LLM驱动的分 segmentation框架中。
results: 实验结果显示，这个提案的RO-LLaMA模型在多中心资料集上表现出色，能够应对多种任务，并且具有扩展性。

Abstract
Recent advancements in Artificial Intelligence (AI) have profoundly influenced medical fields, by providing tools to reduce clinical workloads. However, most AI models are constrained to execute uni-modal tasks, in stark contrast to the comprehensive approaches utilized by medical professionals. To address this, here we present RO-LLaMA, a versatile generalist large language model (LLM) tailored for the field of radiation oncology. This model seamlessly covers a wide range of the workflow of radiation oncologists, adept at various tasks such as clinical report summarization, radiation therapy plan suggestion, and plan-guided therapy target volume segmentation. In particular, to maximize the end-to-end performance, we further present a novel Consistency Embedding Fine-Tuning (CEFTune) technique, which boosts LLM's robustness to additional errors at the intermediates while preserving the capability of handling clean inputs, and creatively transform this concept into LLM-driven segmentation framework as Consistency Embedding Segmentation (CESEG). Experimental results on multi-centre cohort sets demonstrate our proposed RO-LLaMA's promising performance for diverse tasks with generalization capabilities.

摘要
To optimize end-to-end performance, we introduce a novel Consistency Embedding Fine-Tuning (CEFTune) technique, which enhances the LLM's robustness to intermediate errors while maintaining its ability to handle clean inputs. We innovatively apply this concept to create a LLM-driven segmentation framework, called Consistency Embedding Segmentation (CESEG). Experimental results on multi-center cohort sets demonstrate the promising performance of our proposed RO-LLaMA for diverse tasks with generalization capabilities.

Utilizing Explainability Techniques for Reinforcement Learning Model Assurance

paper_url: http://arxiv.org/abs/2311.15838
repo_url: https://github.com/mitre/arlin
paper_authors: Alexander Tapley, Kyle Gatesman, Luis Robaina, Brett Bissey, Joseph Weissman
for: 提高深度学习强化学习（DRL）模型的透明性，增加用户信任和实际应用情况中的使用。
methods: 利用可解释强化学习（XRL）技术，在训练过程中检测和修复DRL模型中的潜在漏洞和关键点。
results: 通过ARLIN工具包（Assured RL Model Interrogation）的详细、人类可读的解释输出，可以快速检测和修复训练过程中的潜在漏洞和关键点，从而提高DRL模型的可靠性和安全性。

Abstract
Explainable Reinforcement Learning (XRL) can provide transparency into the decision-making process of a Deep Reinforcement Learning (DRL) model and increase user trust and adoption in real-world use cases. By utilizing XRL techniques, researchers can identify potential vulnerabilities within a trained DRL model prior to deployment, therefore limiting the potential for mission failure or mistakes by the system. This paper introduces the ARLIN (Assured RL Model Interrogation) Toolkit, an open-source Python library that identifies potential vulnerabilities and critical points within trained DRL models through detailed, human-interpretable explainability outputs. To illustrate ARLIN's effectiveness, we provide explainability visualizations and vulnerability analysis for a publicly available DRL model. The open-source code repository is available for download at https://github.com/mitre/arlin.

摘要
可解释强化学习（XRL）可以提供强化学习模型决策过程中的透明性，从而增加用户信任和实际应用场景中的采用。通过利用XRL技术，研究人员可以在训练过的深度强化学习模型中发现潜在漏洞，以避免在部署前可能导致任务失败或模型错误。本文介绍了ARLIN（保证RL模型探索）工具箱，一个开源的Python库，通过详细的人类可读性解释输出来识别和找出训练过的DRL模型中的潜在漏洞和关键点。为证明ARLIN的有效性，我们提供了解释性视觉和漏洞分析 для一个公开available的DRL模型。可下载代码存储库的链接在https://github.com/mitre/arlin。

Scale-Dropout: Estimating Uncertainty in Deep Neural Networks Using Stochastic Scale

paper_url: http://arxiv.org/abs/2311.15816
repo_url: None
paper_authors: Soyed Tuhin Ahmed, Kamal Danouchi, Michael Hefenbrock, Guillaume Prenat, Lorena Anghel, Mehdi B. Tahoori
for: 提高神经网络（NN）的可靠性和信任度，尤其是在安全关键应用中。
methods: 使用抽象 Bayesian Neural Networks（BayNN）和Dropout作为一种系统性的方法来衡量不确定性，但它们具有高硬件开销。
results: 我们提出了一种新的规则化技术——Scale Dropout，并在MC-Scale Dropout基于BNN中实现了高效的不确定性估计。我们的方法只需一个杂态单元，无论模型的大小如何，这导致了一个非常可扩展的抽象 Bayesian NN。此外，我们还介绍了一种基于Spintronic memory的CIM架构，实现了和现有最佳化的能源投入相比，更高达100倍的能源节省。我们验证了我们的方法，并证明它们可以提高预测性能和不确定性估计，相比于相关的研究。

Abstract
Uncertainty estimation in Neural Networks (NNs) is vital in improving reliability and confidence in predictions, particularly in safety-critical applications. Bayesian Neural Networks (BayNNs) with Dropout as an approximation offer a systematic approach to quantifying uncertainty, but they inherently suffer from high hardware overhead in terms of power, memory, and computation. Thus, the applicability of BayNNs to edge devices with limited resources or to high-performance applications is challenging. Some of the inherent costs of BayNNs can be reduced by accelerating them in hardware on a Computation-In-Memory (CIM) architecture with spintronic memories and binarizing their parameters. However, numerous stochastic units are required to implement conventional dropout-based BayNN. In this paper, we propose the Scale Dropout, a novel regularization technique for Binary Neural Networks (BNNs), and Monte Carlo-Scale Dropout (MC-Scale Dropout)-based BayNNs for efficient uncertainty estimation. Our approach requires only one stochastic unit for the entire model, irrespective of the model size, leading to a highly scalable Bayesian NN. Furthermore, we introduce a novel Spintronic memory-based CIM architecture for the proposed BayNN that achieves more than $100\times$ energy savings compared to the state-of-the-art. We validated our method to show up to a $1\%$ improvement in predictive performance and superior uncertainty estimates compared to related works.

摘要
不确定性估计在神经网络（NN）中是重要的，尤其在安全关键应用中。潘恩神经网络（BayNN）与Dropout作为一种近似方法可以系统地量化不确定性，但它们具有高硬件开销，包括电力、内存和计算资源。因此，将BayNN应用到边缘设备或高性能应用中是挑战。潘恩神经网络的一些内置成本可以通过硬件加速来减少，例如使用计算在内存（CIM）架构和磁矩随机存储器。然而，实现抽样dropout-based BayNN需要许多随机单元。在本文中，我们提出了缩放抽样（Scale Dropout），一种新的正则化技术 для简化神经网络（BNN），以及基于MC-Scale Dropout的BayNN для高效的不确定性估计。我们的方法只需一个随机单元，不管模型大小如何，从而实现了非常可扩展的潘恩神经网络。此外，我们还介绍了一种基于磁矩随机存储器的CIM架构，实现了与状态前的能效级别比较，并且可以达到100倍以上的能效级别。我们验证了我们的方法，并证明它可以在预测性能和不确定性估计方面达到1%的改进。

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

paper_url: http://arxiv.org/abs/2311.15813
repo_url: https://github.com/aniki-ly/FlowZero
paper_authors: Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang
for: 本文提出了一种名为FlowZero的新框架，用于通过文本转换为 temporally-coherent 视频。
methods: FlowZero 使用大型自然语言模型（LLMs）理解复杂的空间-时间动力学，并使用图像扩散模型生成视频。
results: FlowZero 可以生成具有流畅物体运动和帧次协调性的coherent 视频，并且可以在零 shot 情况下 Synthesize 视频。

Abstract
Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames. We present FlowZero, a novel framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-coherent videos. FlowZero uses LLMs to understand complex spatio-temporal dynamics from text, where LLMs can generate a comprehensive dynamic scene syntax (DSS) containing scene descriptions, object layouts, and background motion patterns. These elements in DSS are then used to guide the image diffusion model for video generation with smooth object motions and frame-to-frame coherence. Moreover, FlowZero incorporates an iterative self-refinement process, enhancing the alignment between the spatio-temporal layouts and the textual prompts for the videos. To enhance global coherence, we propose enriching the initial noise of each frame with motion dynamics to control the background movement and camera motion adaptively. By using spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion.

摘要

Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach

paper_url: http://arxiv.org/abs/2311.16514
repo_url: None
paper_authors: Ayush K. Rai, Tarun Krishna, Feiyan Hu, Alexandru Drimbarean, Kevin McGuinness, Alan F. Smeaton, Noel E. O’Connor
for:这篇论文的目的是提出一种新的方法来生成开放集 recognize 视频异常（VAD）中的假异常（PA），以便在自适应滤波器（AE）中进行异常检测。methods:这篇论文使用了一种名为 Latent Diffusion Model 的隐藏差分模型来填充一个masked out的区域，并通过mixup来模拟空间时间的扰动。此外，该论文还提出了一种简单的一致性检测方法，可以在开放集识别（OCC） Setting 下检测真实的异常。results:实验结果表明，该方法可以在四个 VAD 测试集上达到与其他现有state-of-the-art PA生成和重建方法相同的性能，并且可以在不同的测试集上转移和普适性的异常检测。

Abstract
Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real-world anomalies with regards to abnormality of objects and speed of motion to inject prior information about anomalies in an autoencoder (AE) based reconstruction model during training. This work proposes a novel method for generating generic spatio-temporal PAs by inpainting a masked out region of an image using a pre-trained Latent Diffusion Model and further perturbing the optical flow using mixup to emulate spatio-temporal distortions in the data. In addition, we present a simple unified framework to detect real-world anomalies under the OCC setting by learning three types of anomaly indicators, namely reconstruction quality, temporal irregularity and semantic inconsistency. Extensive experiments on four VAD benchmark datasets namely Ped2, Avenue, ShanghaiTech and UBnormal demonstrate that our method performs on par with other existing state-of-the-art PAs generation and reconstruction based methods under the OCC setting. Our analysis also examines the transferability and generalisation of PAs across these datasets, offering valuable insights by identifying real-world anomalies through PAs.

摘要
视频异常检测（VAD）是一种开放集成识别任务，通常被视为一种一类分类（OCC）问题，其训练数据包含正常的视频实例，测试数据则包含正常和异常的视频实例。现有研究使用只有正常数据创建 pseudo-anomalies（PA），并假设了real-world异常的物体和运动速度以将先验知识注入到自动编码器（AE）基于模型中的恢复模型中。这种方法提出了一种生成通用的空间temporal Pseudo-anomalies（PAs），通过在一个遮盖区域的图像中使用预训练的潜在扩散模型进行填充，并使用mixup进行光流的杂化以模拟数据中的空间temporal扭曲。此外，我们提出了一个简单的一致框架，用于在 OCC 设定下检测真实的异常。我们的方法在四个 VAD benchmark 数据集上（Ped2、Avenue、ShanghaiTech和UBnormal）进行了广泛的实验，结果表明我们的方法与其他现有的 PA 生成和恢复基于方法在 OCC 设定下表现相当。我们的分析还检验了 PA 的传输性和通用性，提供了价值的洞察，通过使用 PA 来识别真实的异常。

Planning for the Efficient Updating of Mutual Fund Portfolios

paper_url: http://arxiv.org/abs/2311.16204
repo_url: None
paper_authors: Tomás de la Rosa
for: 更新或重新均衡基金 portefolio
methods: linear programming 和冒险搜索方法
results: 提高成本效益 compared to compared base strategy

Abstract
Once there is a decision of rebalancing or updating a portfolio of funds, the process of changing the current portfolio to the target one, involves a set of transactions that are susceptible of being optimized. This is particularly relevant when managers have to handle the implications of different types of instruments. In this work we present linear programming and heuristic search approaches that produce plans for executing the update. The evaluation of our proposals shows cost improvements over the compared based strategy. The models can be easily extended to other realistic scenarios in which a holistic portfolio management is required

摘要
一旦决定重新均衡或更新投资组合，将现有投资组合更改为目标组合，涉及到一系列可优化的交易。这 particualrly relevant when managers need to handle different types of instruments的影响。在这篇文章中，我们提出了线性Programming和启发搜索方法，用于生成更新执行的计划。我们的提议的成本优化相比基于的策略。这些模型可以轻松扩展到其他实际情况，在holistic投资管理中需要。Here's the text with some additional information about the Simplified Chinese translation:Simplified Chinese is a writing system used in mainland China and Singapore. It is based on the Traditional Chinese characters, but with some simplifications to make it easier to write and read. In this translation, I have used Simplified Chinese characters to represent the text.The translation is written in a formal and professional tone, using technical vocabulary and grammatical structures appropriate for a business or financial context. I have also tried to preserve the original meaning and intent of the text, while adapting it to the Simplified Chinese language.Please note that there may be some differences in the translation due to the complexity of the text and the nuances of the Simplified Chinese language. If you have any further questions or requests, please feel free to ask.

paper_url: http://arxiv.org/abs/2311.15790
repo_url: None
paper_authors: Siwei Liu, Xi Wang, Craig Macdonald, Iadh Ounis
for: 提高推荐系统的性能，特别是对冷启用户的推荐。
methods: 使用社交关系信息和交互数据进行预训练，并采用 Gaussian Mixture Model (GMM) 进行后续精度训练。
results: 与16个基eline比较，SGP模型在三个公共数据集上显著超过最佳基eline by up to 7.7% in terms of NDCG@10，并且能够有效地解决冷启用户的问题。

Abstract
The use of pre-training is an emerging technique to enhance a neural model's performance, which has been shown to be effective for many neural language models such as BERT. This technique has also been used to enhance the performance of recommender systems. In such recommender systems, pre-training models are used to learn a better initialisation for both users and items. However, recent existing pre-trained recommender systems tend to only incorporate the user interaction data at the pre-training stage, making it difficult to deliver good recommendations, especially when the interaction data is sparse. To alleviate this common data sparsity issue, we propose to pre-train the recommendation model not only with the interaction data but also with other available information such as the social relations among users, thereby providing the recommender system with a better initialisation compared with solely relying on the user interaction data. We propose a novel recommendation model, the Social-aware Gaussian Pre-trained model (SGP), which encodes the user social relations and interaction data at the pre-training stage in a Graph Neural Network (GNN). Afterwards, in the subsequent fine-tuning stage, our SGP model adopts a Gaussian Mixture Model (GMM) to factorise these pre-trained embeddings for further training, thereby benefiting the cold-start users from these pre-built social relations. Our extensive experiments on three public datasets show that, in comparison to 16 competitive baselines, our SGP model significantly outperforms the best baseline by upto 7.7% in terms of NDCG@10. In addition, we show that SGP permits to effectively alleviate the cold-start problem, especially when users newly register to the system through their friends' suggestions.

摘要
使用预训练技术可以提高神经网络模型的性能，这种技术已经被证明是对许多神经语言模型，如BERT，有效。这种技术还被用来提高推荐系统的性能。在这些推荐系统中，预训练模型用于学习更好的初始化 для用户和物品。然而，现有的预训练推荐系统通常只是在预训练阶段使用用户交互数据，这使得提供好的推荐became difficult，特别是当交互数据 scarcity 。为解决这个常见的数据稀缺问题，我们提议在预训练阶段不仅使用交互数据，还使用其他可用的信息，如用户之间的社交关系，以提供推荐系统更好的初始化。我们提出了一种新的推荐模型，社交意识 Gaussian Pre-trained Model (SGP)，它在图 neural network (GNN) 中编码用户社交关系和交互数据。在后续的精度调整阶段，我们的 SGP 模型采用 Gaussian Mixture Model (GMM) 来因子化这些预训练嵌入，以进一步训练，从而为冷启用户带来优势。我们在三个公共数据集上进行了广泛的实验，与 16 个基线比较，我们的 SGP 模型在 NDCG@10 指标上与最佳基线之间比较，高达 7.7%。此外，我们还证明了 SGP 可以有效地解决冷启用户问题，特别是当用户通过朋友的建议注册到系统时。

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

paper_url: http://arxiv.org/abs/2311.15786
repo_url: https://github.com/ieit-yuan/yuan-2.0
paper_authors: Shaohua Wu, Xudong Zhao, Shenling Wang, Jiangang Luo, Lingjun Li, Xi Chen, Bing Zhao, Wei Wang, Tong Yu, Rongguo Zhang, Jiahua Zhang, Chao Wang
for: 本研究旨在描述一种基于本地依赖知识的注意力机制（Localized Filtering-based Attention，LFA），并在该机制基础之上开发了一个大型自然语言处理模型（Yuan 2.0）。
methods: 本研究使用了LFA机制，并开发了一种高质量预训练和细化训练数据的生成方法。此外，提出了一种分布式训练方法，该方法通过非均匀管道并行、数据并行和优化并行来降低内节点通信带宽需求，并在大规模分布式训练中达到了良好性能。
results: 本研究显示，基于LFA机制的Yuan 2.0模型在代码生成、数学问题解决和对话中表现出了卓越的能力，与现有模型相比，其表现更加出色。

Abstract
In this work, the Localized Filtering-based Attention (LFA) is introduced to incorporate prior knowledge of local dependencies of natural language into Attention. Based on LFA, we develop and release Yuan 2.0, a large language model with parameters ranging from 2.1 billion to 102.6 billion. A data filtering and generation method is presented to build pretraining and fine-tuning dataset in high quality. A distributed training method with non-uniform pipeline parallel, data parallel, and optimizer parallel is proposed, which greatly reduces the bandwidth requirements of intra-node communication, and achieves good performance in large-scale distributed training. Yuan 2.0 models display impressive ability in code generation, math problem-solving, and chat compared with existing models. The latest version of YUAN 2.0, including model weights and source code, is accessible at Github.

摘要
在这个工作中，我们引入了本地化筛选基于注意力（LFA），以包含自然语言的本地依赖关系知识。基于LFA，我们开发并发布了Yuan 2.0，一个大型语言模型，其参数范围从2.1亿到102.6亿。我们提出了一种数据筛选和生成方法，用于建立高质量预训练和细化训练 dataset。我们还提出了一种分布式训练方法，包括非均匀管道并行、数据并行和优化器并行，这有效减少了内节点通信带宽的需求，并在大规模分布式训练中达到了良好的性能。Yuan 2.0模型在代码生成、数学问题解决和对话中表现出色，与现有模型相比。最新版本的YUAN 2.0模型、模型参数和源代码都可以在Github上下载。

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

paper_url: http://arxiv.org/abs/2311.16503
repo_url: None
paper_authors: Yushi Huang, Ruihao Gong, Jing Liu, Tianlong Chen, Xianglong Liu
for:* 这个 paper 的目的是解决传统Diffusion模型在应用时的问题，包括对应用时间的依赖和内存需求。methods:* 这个 paper 使用了一个名为 Temporal Feature Maintenance Quantization (TFMQ) 的框架，它透过一个名为 Temporal Information Block (TIB) 的新设计来维护时间特征。* TIM 使用了一个名为 temporal information aware reconstruction (TIAR) 和 finite set calibration (FSC) 的新方法来调整时间特征，以确保生成结果的质量。results:* 这个 paper 的结果显示，使用 TFMQ 框架可以维护时间特征并确保生成结果的质量，并且可以在4位量子化下 achieved model performance nearly on par with the full-precision model。* 这个 paper 的方法可以实现几乎无额外计算成本和加速量化时间，并且在 LSUN-Bedrooms $256 \times 256$ 上比前一代方法快速量化时间 $2.0 \times$。

Abstract
The Diffusion model, a prevalent framework for image generation, encounters significant challenges in terms of broad applicability due to its extended inference times and substantial memory requirements. Efficient Post-training Quantization (PTQ) is pivotal for addressing these issues in traditional models. Different from traditional models, diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$ from the finite set $\{1, \ldots, T\}$ is encoded to a temporal feature by a few modules totally irrespective of the sampling data. However, existing PTQ methods do not optimize these modules separately. They adopt inappropriate reconstruction targets and complex calibration methods, resulting in a severe disturbance of the temporal feature and denoising trajectory, as well as a low compression efficiency. To solve these, we propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block which is just related to the time-step $t$ and unrelated to the sampling data. Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. Equipped with the framework, we can maintain the most temporal information and ensure the end-to-end generation quality. Extensive experiments on various datasets and diffusion models prove our state-of-the-art results. Remarkably, our quantization approach, for the first time, achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally, our method incurs almost no extra computational cost and accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$ compared to previous works.

摘要
Diffusion模型，一种广泛使用的图像生成框架，由于其广泛应用而受到较大的挑战，包括长时间执行时间和大量内存需求。为Addressing these issues in traditional models, Efficient Post-training Quantization (PTQ) is crucial. Unlike traditional models, diffusion models heavily rely on the time-step $t$ to achieve satisfactory multi-round denoising. Typically, $t$ is encoded to a temporal feature by a few modules, regardless of the sampling data. However, existing PTQ methods do not optimize these modules separately. They use inappropriate reconstruction targets and complex calibration methods, leading to a severe disturbance of the temporal feature and denoising trajectory, as well as a low compression efficiency.To solve these issues, we propose a Temporal Feature Maintenance Quantization (TFMQ) framework based on a Temporal Information Block, which is only related to the time-step $t$ and unrelated to the sampling data. With the help of the innovative block design, we develop temporal information-aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. By using the framework, we can maintain most of the temporal information and ensure end-to-end generation quality.Our extensive experiments on various datasets and diffusion models show state-of-the-art results. Remarkably, our quantization approach, for the first time, achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally, our method incurs almost no extra computational cost and accelerates quantization time by 2.0 times compared to previous works on LSUN-Bedrooms $256 \times 256$.

Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs

paper_url: http://arxiv.org/abs/2311.15781
repo_url: https://github.com/apple/ml-kge
paper_authors: Simone Conia, Min Li, Daniel Lee, Umar Farooq Minhas, Ihab Ilyas, Yunyao Li
for: 增强非英语语言知识图的质量和量
methods: combinatorial machine translation, web search, and large language models
results: 提出了一种新的自动知识图增强任务，并进行了对 bridging the gap in both the quantity and quality of textual information between English and non-English languages的 investigateHere’s a more detailed explanation of each point:
for: The paper is written to address the issue of lack of high-quality textual information in non-English languages, and to propose a novel task of automatic Knowledge Graph Enhancement (KGE) to improve the quantity and quality of textual information in non-English languages.
methods: The paper uses a combination of three methods: Machine Translation (MT), Web Search (WS), and Large Language Models (LLMs) to generate high-quality textual information in non-English languages.
results: The paper presents a novel unsupervised approach called M-NTA, which combines MT, WS, and LLMs to generate high-quality textual information in non-English languages. The approach is evaluated on a human-curated benchmark called WikiKGE-10, which covers 10 languages across 7 language families. The results show that the proposed approach can significantly improve the quantity and quality of textual information in non-English languages, and can be used to improve various knowledge graph tasks such as Entity Linking, Knowledge Graph Completion, and Question Answering.

Abstract
Recent work in Natural Language Processing and Computer Vision has been using textual information -- e.g., entity names and descriptions -- available in knowledge graphs to ground neural models to high-quality structured data. However, when it comes to non-English languages, the quantity and quality of textual information are comparatively scarce. To address this issue, we introduce the novel task of automatic Knowledge Graph Enhancement (KGE) and perform a thorough investigation on bridging the gap in both the quantity and quality of textual information between English and non-English languages. More specifically, we: i) bring to light the problem of increasing multilingual coverage and precision of entity names and descriptions in Wikidata; ii) demonstrate that state-of-the-art methods, namely, Machine Translation (MT), Web Search (WS), and Large Language Models (LLMs), struggle with this task; iii) present M-NTA, a novel unsupervised approach that combines MT, WS, and LLMs to generate high-quality textual information; and, iv) study the impact of increasing multilingual coverage and precision of non-English textual information in Entity Linking, Knowledge Graph Completion, and Question Answering. As part of our effort towards better multilingual knowledge graphs, we also introduce WikiKGE-10, the first human-curated benchmark to evaluate KGE approaches in 10 languages across 7 language families.

摘要
近期的自然语言处理和计算机视觉研究使用知识图中的文本信息（例如实体名称和描述）来训练神经网络模型，以高质量结构数据进行降解。然而，在非英语语言方面，文本信息的量和质量相对较少。为解决这个问题，我们提出了自动知识图增强（KGE）任务，并进行了详细的调查，旨在bridging英语和非英语语言之间的文本信息差距。更 Specifically，我们：1. 抛光了wikidata中实体名称和描述的多语言覆盖率和准确率的问题;2. 示出了当前状态的方法（机器翻译、网络搜索和大语言模型）在这个任务上遇到的问题;3. 提出了M-NTA，一种新的无监督方法，通过机器翻译、网络搜索和大语言模型来生成高质量的文本信息;4. 研究了不同语言的文本信息增强对实体链接、知识图完成和问答任务的影响。作为我们为多语言知识图努力的一部分，我们还介绍了 WikiKGE-10，首个由人手纠正的 KGE 评价标准，覆盖10种语言，7种语言家族。

paper_url: http://arxiv.org/abs/2311.15759
repo_url: None
paper_authors: Yunxin Li, Baotian Hu, Wei Wang, Xiaochun Cao, Min Zhang
for: 该论文旨在提高大型语言模型（LLMs）的多模态生成能力，以及利用多模态知识进行语言生成。
methods: 该论文提出了一种方法，即MKS2，用于强化LLMs的多模态知识存储和共享。该方法包括在LLMs内部 integrate open-world visual information的Modular Visual Memory组件，以及在多模态知识之间进行协同生成的soft Mixtures-of-Multimodal Experts架构。
results: 该论文的实验结果表明，MKS2可以有效增强LLMs在需要物理或通用知识的上下文中的理解能力，并在多模态benchmark上达到竞争性的 результаTS。

Abstract
Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual-language understanding, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance overall capabilities of LLMs, which could be regraded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce the Modular Visual Memory, a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixtures-of-Multimodal Experts architecture in LLMs to invoke multimodal knowledge collaboration during generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on multimodal benchmarks.

摘要
最近的多模态大语言模型（MLLMs）技术已经取得了显著的多模态生成能力，类似于GPT-4。这些模型主要将视觉信息映射到语言表示空间中，利用大语言模型的广泛知识和强大的文本生成能力来生成多模态指令遵循Response。我们可以称这种方法为“视觉语言理解”，但我们注意到这些 MLLMs 可能会忽略利用视觉知识来提高整体语言模型的能力，这可以被称为“视觉增强语言模型”。在这篇论文中，我们提出了一种方法called MKS2，旨在通过强化语言模型来提高多模态知识存储和共享。具体来说，我们引入了内部块中的 Modular Visual Memory 组件，用于高效地存储开放世界的视觉信息。此外，我们还提出了一种软 Mixtures-of-Multimodal Experts 架构，用于在生成过程中调用多模态知识协作。我们的全面实验表明，MKS2 可以强化语言模型在需要物理或通用知识的上下文中的理解能力，同时也可以在多模态标准准则上提供竞争力的结果。

SceneDM: Scene-level Multi-agent Trajectory Generation with Consistent Diffusion Models

paper_url: http://arxiv.org/abs/2311.15736
repo_url: None
paper_authors: Zhiming Guo, Xing Gao, Jianlan Zhou, Xinyu Cai, Botian Shi
For: 本文提出了一种新的扩展 diffusion models 框架，用于生成场景中所有代理人的共同未来运动轨迹，以提高自驾算法的开发和评估。* Methods: 该框架使用一种基于 transformer 网络的新方法来有效处理代理人之间的互动，以提高生成轨迹的一致性。另外，为了保证代理人的轨迹平滑，该框架还采用了一种简单 yet effective 的一致扩散方法。* Results: 该框架在 Waymo Sim Agents Benchmark 上实现了状态顶尖的结果，并提供了一个场景级别的评估函数来评估代理人的安全性和路径跟随性。

Abstract
Realistic scene-level multi-agent motion simulations are crucial for developing and evaluating self-driving algorithms. However, most existing works focus on generating trajectories for a certain single agent type, and typically ignore the consistency of generated trajectories. In this paper, we propose a novel framework based on diffusion models, called SceneDM, to generate joint and consistent future motions of all the agents, including vehicles, bicycles, pedestrians, etc., in a scene. To enhance the consistency of the generated trajectories, we resort to a new Transformer-based network to effectively handle agent-agent interactions in the inverse process of motion diffusion. In consideration of the smoothness of agent trajectories, we further design a simple yet effective consistent diffusion approach, to improve the model in exploiting short-term temporal dependencies. Furthermore, a scene-level scoring function is attached to evaluate the safety and road-adherence of the generated agent's motions and help filter out unrealistic simulations. Finally, SceneDM achieves state-of-the-art results on the Waymo Sim Agents Benchmark. Project webpage is available at https://alperen-hub.github.io/SceneDM.

摘要
现实场景级多Agent运动 simulations 是开发和评估自动驾驶算法的关键。然而，大多数现有工作都是为特定单个代理类生成轨迹，通常忽略生成轨迹的一致性。在这篇论文中，我们提出了一种基于扩散模型的新框架，called SceneDM，用于生成场景中所有代理的共同未来运动。为了提高生成轨迹的一致性，我们采用了一种新的Transformer网络来有效地处理代理之间的互动在运动扩散过程中。为了保证代理轨迹的平滑性，我们还设计了一种简单 yet effective的一致扩散方法，以便提高模型在短期时间相互依赖性上的利用。此外，为了评估生成的代理运动的安全性和道路遵从度，我们采用了一种场景级分数函数。最后，SceneDM实现了Waymo Sim Agents Benchmark的国际级 Result。更多信息请参考https://alperen-hub.github.io/SceneDM。

Adinkra Symbol Recognition using Classical Machine Learning and Deep Learning

paper_url: http://arxiv.org/abs/2311.15728
repo_url: None
paper_authors: Michael Adjeisah, Kwame Omono Asamoah, Martha Asamoah Yeboah, Raji Rafiu King, Godwin Ferguson Achaab, Kingsley Adjei
for: 本研究旨在提高黑人社区和非洲国家对人工智能（AI）的认知和参与度。
methods: 本研究使用了класси型机器学习和深度学习模型，构建了ADINKRA数据集，并使用了预训练模型如VGG和ResNet进行Feature抽取和分类。
results: 研究提出了一个简单的CNN模型，并使用了降噪训练来提高模型的性能。模型的准确率和融合率得到了评估，并Visual化了模型的预测结果。这些评估结果可 serve as a foundational benchmark for future assessments of the ADINKRA dataset。

Abstract
Artificial intelligence (AI) has emerged as a transformative influence, engendering paradigm shifts in global societies, spanning academia and industry. However, in light of these rapid advances, addressing the underrepresentation of black communities and African countries in AI is crucial. Boosting enthusiasm for AI can be effectively accomplished by showcasing straightforward applications around tasks like identifying and categorizing traditional symbols, such as Adinkra symbols, or familiar objects within the community. In this research endeavor, we dived into classical machine learning and harnessed the power of deep learning models to tackle the intricate task of classifying and recognizing Adinkra symbols. The idea led to a newly constructed ADINKRA dataset comprising 174,338 images meticulously organized into 62 distinct classes, each representing a singular and emblematic symbol. We constructed a CNN model for classification and recognition using six convolutional layers, three fully connected (FC) layers, and optional dropout regularization. The model is a simpler and smaller version of VGG, with fewer layers, smaller channel sizes, and a fixed kernel size. Additionally, we tap into the transfer learning capabilities provided by pre-trained models like VGG and ResNet. These models assist us in both classifying images and extracting features that can be used with classical machine learning models. We assess the model's performance by measuring its accuracy and convergence rate and visualizing the areas that significantly influence its predictions. These evaluations serve as a foundational benchmark for future assessments of the ADINKRA dataset. We hope this application exemplar inspires ideas on the various uses of AI in organizing our traditional and modern lives.

摘要
人工智能（AI）已经成为全球社会的转型因素，促使了许多学术和业务领域的 paradigm shift。然而，鉴于这些快速的进步，对黑人社区和非洲国家在AI领域的下 Representation是非常重要的。可以通过示例如标识和分类传统符号，如 Adinkra 符号，或者在社区中熟悉的物品，来增强对 AI 的热情。在这个研究项目中，我们投入了古典机器学习和深度学习模型，以解决标识和识别 Adinkra 符号的复杂任务。我们构建了一个名为 ADINKRA 的数据集，包含 174,338 个图像，并且将这些图像分成 62 个不同的类别，每个类别都代表了一种独特的符号。我们构建了一个基于 CNN 的分类和识别模型，该模型包括六层杂凝层、三层全连接层和可选的dropout regularization。这个模型是 VGG 模型的简化和小型版本，它具有 fewer 层、更小的通道大小和固定核大小。此外，我们还利用了预训练的 VGG 和 ResNet 模型，以便在图像分类和特征提取方面提供转移学习能力。我们评估模型的性能 by measuring its accuracy and convergence rate，并 visualize the areas that significantly influence its predictions。这些评估作为我们 ADINKRA 数据集的基础benchmark，以便未来对这些数据集进行评估。我们希望这个应用程序可以激发对 AI 在组织我们传统和现代生活中的各种应用的想法。

Italian Crossword Generator: Enhancing Education through Interactive Word Puzzles

paper_url: http://arxiv.org/abs/2311.15723
repo_url: None
paper_authors: Kamyar Zeinalipour, Tommaso laquinta, Asya Zanollo, Giovanni Angelini, Leonardo Rigutini, Marco Maggini, Marco Gori
for: 提高学生的参与度、理解度、批判性思维和记忆保持。
methods: 使用自然语言处理和机器学习的最新技术，如 GPT3-DaVinci、GPT3-Curie、GPT3-Babbage、GPT3-Ada 和 BERT-uncased，生成高质量的学习拼团。
results: 实现了创造高标准的学习拼团，为学生提供有利的学习经验。

Abstract
Educational crosswords offer numerous benefits for students, including increased engagement, improved understanding, critical thinking, and memory retention. Creating high-quality educational crosswords can be challenging, but recent advances in natural language processing and machine learning have made it possible to use language models to generate nice wordplays. The exploitation of cutting-edge language models like GPT3-DaVinci, GPT3-Curie, GPT3-Babbage, GPT3-Ada, and BERT-uncased has led to the development of a comprehensive system for generating and verifying crossword clues. A large dataset of clue-answer pairs was compiled to fine-tune the models in a supervised manner to generate original and challenging clues from a given keyword. On the other hand, for generating crossword clues from a given text, Zero/Few-shot learning techniques were used to extract clues from the input text, adding variety and creativity to the puzzles. We employed the fine-tuned model to generate data and labeled the acceptability of clue-answer parts with human supervision. To ensure quality, we developed a classifier by fine-tuning existing language models on the labeled dataset. Conversely, to assess the quality of clues generated from the given text using zero/few-shot learning, we employed a zero-shot learning approach to check the quality of generated clues. The results of the evaluation have been very promising, demonstrating the effectiveness of the approach in creating high-standard educational crosswords that offer students engaging and rewarding learning experiences.

摘要
学习十字ixen� proposes numerous benefits for students, including increased engagement, improved understanding, critical thinking, and memory retention. Creating high-quality educational crosswords can be challenging, but recent advances in自然语言处理 and machine learning have made it possible to use language models to generate nice wordplays. The exploitation of cutting-edge language models like GPT3-DaVinci, GPT3-Curie, GPT3-Babbage, GPT3-Ada, and BERT-uncased has led to the development of a comprehensive system for generating and verifying crossword clues. A large dataset of clue-answer pairs was compiled to fine-tune the models in a supervised manner to generate original and challenging clues from a given keyword. On the other hand, for generating crossword clues from a given text, Zero/Few-shot learning techniques were used to extract clues from the input text, adding variety and creativity to the puzzles. We employed the fine-tuned model to generate data and labeled the acceptability of clue-answer parts with human supervision. To ensure quality, we developed a classifier by fine-tuning existing language models on the labeled dataset. Conversely, to assess the quality of clues generated from the given text using zero/few-shot learning, we employed a zero-shot learning approach to check the quality of generated clues. The results of the evaluation have been very promising, demonstrating the effectiveness of the approach in creating high-standard educational crosswords that offer students engaging and rewarding learning experiences.

GLIME: General, Stable and Local LIME Explanation

paper_url: http://arxiv.org/abs/2311.15722
repo_url: https://github.com/thutzr/glime-general-stable-and-local-lime-explanation
paper_authors: Zeren Tan, Yang Tian, Jian Li
for: 本研究旨在解释黑盒机器学习模型的预测结果，提高模型的可解释性。
methods: 本研究使用了增强版LIME方法（GLIME），其具有更快的 converges 和更高的稳定性，以及可以根据具体情况选择的样本分布。
results: GLIME 可以提供更高的本地准确性（local fidelity）和独立于参考选择的解释，并且可以快速地适应不同的场景。

Abstract
As black-box machine learning models grow in complexity and find applications in high-stakes scenarios, it is imperative to provide explanations for their predictions. Although Local Interpretable Model-agnostic Explanations (LIME) [22] is a widely adpoted method for understanding model behaviors, it is unstable with respect to random seeds [35,24,3] and exhibits low local fidelity (i.e., how well the explanation approximates the model's local behaviors) [21,16]. Our study shows that this instability problem stems from small sample weights, leading to the dominance of regularization and slow convergence. Additionally, LIME's sampling neighborhood is non-local and biased towards the reference, resulting in poor local fidelity and sensitivity to reference choice. To tackle these challenges, we introduce GLIME, an enhanced framework extending LIME and unifying several prior methods. Within the GLIME framework, we derive an equivalent formulation of LIME that achieves significantly faster convergence and improved stability. By employing a local and unbiased sampling distribution, GLIME generates explanations with higher local fidelity compared to LIME. GLIME explanations are independent of reference choice. Moreover, GLIME offers users the flexibility to choose a sampling distribution based on their specific scenarios.

摘要
为这些复杂的黑盒机器学习模型在高风险场景中应用，提供模型预测的解释是非常重要的。尽管本地可解释模型行为的方法（LIME）广泛应用[22]，但它具有随机种子不稳定和本地准确性低的问题[35,24,3]。我们的研究表明，这些问题源于小样本权重，导致模型训练过程中的整体化和慢速收敛。此外，LIME的采样区域不具本地性和偏袋性，导致解释不准确和参照选择敏感。为解决这些挑战，我们提出了GLIME框架，它是LIME的扩展和统一多种先前方法。在GLIME框架中，我们得到了LIME的等价形式ulation，实现了更快的收敛和更高的稳定性。通过使用本地无偏采样分布，GLIME生成的解释具有高本地准确性，与参照选择无关。此外，GLIME给用户提供了适应特定场景的采样分布的选择 flexibility。

Variational Autoencoders for Feature Exploration and Malignancy Prediction of Lung Lesions

paper_url: http://arxiv.org/abs/2311.15719
repo_url: https://github.com/benkeel/vae_lung_lesion_bmvc
paper_authors: Benjamin Keel, Aaron Quyn, David Jayne, Samuel D. Relton
For: This study aims to develop an accurate and interpretable AI model for lung cancer diagnosis from routine CT scans.* Methods: The proposed model uses Variational Autoencoders (VAEs) to learn latent vector representations of lung cancer lesions, which are then used in a multi-layer perceptron (MLP) classifier for diagnosis. The study compares the performance of VAEs with two variations: Gaussian VAE (GVAE) and Dirichlet VAE (DirVAE).* Results: The best model achieved state-of-the-art metrics of AUC 0.98 and 93.1% accuracy. Cluster analysis shows the VAE latent space separates malignant and benign lesions based on meaningful feature components, and latent space traversals correspond to clinically meaningful feature changes.

Abstract
Lung cancer is responsible for 21% of cancer deaths in the UK and five-year survival rates are heavily influenced by the stage the cancer was identified at. Recent studies have demonstrated the capability of AI methods for accurate and early diagnosis of lung cancer from routine scans. However, this evidence has not translated into clinical practice with one barrier being a lack of interpretable models. This study investigates the application Variational Autoencoders (VAEs), a type of generative AI model, to lung cancer lesions. Proposed models were trained on lesions extracted from 3D CT scans in the LIDC-IDRI public dataset. Latent vector representations of 2D slices produced by the VAEs were explored through clustering to justify their quality and used in an MLP classifier model for lung cancer diagnosis, the best model achieved state-of-the-art metrics of AUC 0.98 and 93.1% accuracy. Cluster analysis shows the VAE latent space separates the dataset of malignant and benign lesions based on meaningful feature components including tumour size, shape, patient and malignancy class. We also include a comparative analysis of the standard Gaussian VAE (GVAE) and the more recent Dirichlet VAE (DirVAE), which replaces the prior with a Dirichlet distribution to encourage a more explainable latent space with disentangled feature representation. Finally, we demonstrate the potential for latent space traversals corresponding to clinically meaningful feature changes.

摘要
肺癌负责英国癌症死亡21%，五年生存率受肺癌发现阶段的影响。 latest studies have shown that AI methods can accurately and early diagnose lung cancer from routine scans. However, this evidence has not been applied in clinical practice, one of the barriers is the lack of interpretable models. This study uses Variational Autoencoders (VAEs), a type of generative AI model, to diagnose lung cancer. The proposed models were trained on lesions extracted from 3D CT scans in the LIDC-IDRI public dataset. The latent vector representations of 2D slices produced by the VAEs were explored through clustering to justify their quality and used in an MLP classifier model for lung cancer diagnosis, the best model achieved state-of-the-art metrics of AUC 0.98 and 93.1% accuracy. Cluster analysis shows that the VAE latent space separates the dataset of malignant and benign lesions based on meaningful feature components, including tumor size, shape, patient, and malignancy class. We also include a comparative analysis of the standard Gaussian VAE (GVAE) and the more recent Dirichlet VAE (DirVAE), which replaces the prior with a Dirichlet distribution to encourage a more explainable latent space with disentangled feature representation. Finally, we demonstrate the potential for latent space traversals corresponding to clinically meaningful feature changes.

Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation

paper_url: http://arxiv.org/abs/2311.15698
repo_url: None
paper_authors: Federico A. Galatolo, Mario G. C. A. Cimino
For: The paper aims to generate high-quality, language-specific chat corpora using a self-chat mechanism, with a focus on underrepresented languages like Italian.* Methods: The authors use a combination of a generator LLM and an embedder LLM to create new samples and ensure diversity, and propose a new MLM model-based quality assessment metric for evaluating and filtering the corpora.* Results: The refined Italian chat corpus and the fine-tuned LLM model (cerbero-7b) demonstrate significantly enhanced language comprehension and question-answering skills, establishing a new state-of-the-art for Italian LLMs.Here are the three points in Simplified Chinese text:* For: 这个研究旨在使用自我聊天机制生成高质量、语言特定的聊天 corpus，尤其是对于被排除的语言如意大利语。* Methods: 作者们使用一种 generator LLM 和一种 embedder LLM 组合创建新的样本，并提出一种基于 MLM 模型的质量评估指标来评估和筛选 corpus。* Results: 经过精炼的意大利聊天 corpus 和 fine-tune 的 LLM 模型（cerbero-7b）在语言理解和问答能力方面表现出了显著提高，创造了意大利 LLM 的新状态。

Abstract
This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism. We combine a generator LLM for creating new samples and an embedder LLM to ensure diversity. A new Masked Language Modelling (MLM) model-based quality assessment metric is proposed for evaluating and filtering the corpora. Utilizing the llama2-70b as the generator and a multilingual sentence transformer as embedder, we generate an Italian chat corpus and refine the Fauno corpus, which is based on translated English ChatGPT self-chat data. The refinement uses structural assertions and Natural Language Processing techniques. Both corpora undergo a comprehensive quality evaluation using the proposed MLM model-based quality metric. The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills. The resultant model, cerbero-7b, establishes a new state-of-the-art for Italian LLMs. This approach marks a substantial advancement in the development of language-specific LLMs, with a special emphasis on augmenting corpora for underrepresented languages like Italian.

摘要
Simplified Chinese translation:这个研究提出了一种新的方法，用于生成高质量的语言特定的聊天 corpora，使用自我聊天机制。我们将生成器语言模型（LLM）用于生成新样本，并使用嵌入器语言模型来保证多样性。我们提出了一种基于Masked Language Modelling（MLM）模型的质量评估指标，用于评估和筛选 corpora。使用 llama2-70b 作为生成器和多语言句子转换器作为嵌入器，我们生成了一个意大利聊天 corpora，并对基于英语 ChatGPT 自动聊天数据的 Fauno corpus 进行了改进。改进过程包括使用结构声明和自然语言处理技术。两个 corpora 都经过了全面的质量评估使用我们提出的 MLM 模型基于的质量指标。意大利 LLM 通过这些 corpora 进行了高度改进，其语言理解和回答问题能力显著提高。结果的模型，cerbero-7b，在意大利 LLM 中建立了新的州际标准。这种方法标志着语言特定 LLM 的发展受到了重要的促进，特别是增强对语言少数民族语言 like 意大利语言的 corpora。

Peptide Binding Classification on Quantum Computers

paper_url: http://arxiv.org/abs/2311.15696
repo_url: https://github.com/cqcl/peptide-binding-classification-on-quantum-computers
paper_authors: Charles London, Douglas Brown, Wenduan Xu, Sezen Vatansever, Christopher James Langmead, Dimitri Kartsaklis, Stephen Clark, Konstantinos Meichanetzidis
for: 这个研究用于应用近端量子计算机解决生物计算中的一个任务，并且在设计药物蛋白的过程中找到竞争性的表现。
methods: 这个研究使用了参数化量子Circuit来建立量子模型，并在适合的资源需求下进行序列分类。为了研究噪声的影响，将一些最佳性能的量子模型输入 emulator 的现有的噪音量子处理器。然后将这些量子模型在Quantinuum H1-1 磁铁量子处理器上执行，并观察到几乎完全相符的无噪模拟。
results: 这个研究发现，近端量子计算机可以在设计药物蛋白中使用，并且可以和现有的分类模型竞争。此外，这个研究还使用了特征属性方法来检查量子模型是否具有有意义的关系，并发现量子模型至少和现有的分类模型一样好。

Abstract
We conduct an extensive study on using near-term quantum computers for a task in the domain of computational biology. By constructing quantum models based on parameterised quantum circuits we perform sequence classification on a task relevant to the design of therapeutic proteins, and find competitive performance with classical baselines of similar scale. To study the effect of noise, we run some of the best-performing quantum models with favourable resource requirements on emulators of state-of-the-art noisy quantum processors. We then apply error mitigation methods to improve the signal. We further execute these quantum models on the Quantinuum H1-1 trapped-ion quantum processor and observe very close agreement with noiseless exact simulation. Finally, we perform feature attribution methods and find that the quantum models indeed identify sensible relationships, at least as well as the classical baselines. This work constitutes the first proof-of-concept application of near-term quantum computing to a task critical to the design of therapeutic proteins, opening the route toward larger-scale applications in this and related fields, in line with the hardware development roadmaps of near-term quantum technologies.

摘要
我们进行了一项广泛的研究，用近期量子计算机解决生物计算领域中的任务。我们根据参数化的量子电路构建量子模型，在与经典基elines相似的规模上进行序列分类任务，并发现了竞争性的性能。为了研究噪声的影响，我们使用现有的state-of-the-art噪声量子处理器的模拟器运行一些最佳性能的量子模型，并应用错误修正方法提高信号。然后，我们在Quantinuum H1-1 磁共振量子处理器上执行这些量子模型，并观察到与无噪声准确模拟几乎完全一致。最后，我们应用特征归因方法，发现量子模型确实可以识别有意义的关系，至少与经典基elines相似。这项工作成为近期量子计算机技术的硬件开发路线图中的首个证明，开创了更大规模应用的道路，在生物计算和相关领域。

Regularization by Texts for Latent Diffusion Inverse Solvers

paper_url: http://arxiv.org/abs/2311.15658
repo_url: None
paper_authors: Jeongsol Kim, Geon Yeong Park, Hyungjin Chung, Jong Chul Ye
for: 解决 inverse problems 的方法，使用 diffusion models 作为有效的生成假设。
methods: 通过 incorporating regularization by texts (TReg)，在 reverse sampling 阶段应用文本描述解决方案的预 conceived 结构，并通过 null-text 优化进行动态强制。
results: TReg 能够成功减少 latent diffusion inverse solvers 中的ambiguity，提高其效果和准确性。

Abstract
The recent advent of diffusion models has led to significant progress in solving inverse problems, leveraging these models as effective generative priors. Nonetheless, challenges related to the ill-posed nature of such problems remain, often due to inherent ambiguities in measurements. Drawing inspiration from the human ability to resolve visual ambiguities through perceptual biases, here we introduce a novel latent diffusion inverse solver by incorporating regularization by texts (TReg). Specifically, TReg applies the textual description of the preconception of the solution during the reverse sampling phase, of which description isndynamically reinforced through null-text optimization for adaptive negation. Our comprehensive experimental results demonstrate that TReg successfully mitigates ambiguity in latent diffusion inverse solvers, enhancing their effectiveness and accuracy.

摘要
Here's the text in Simplified Chinese:近期的扩散模型在解决反向问题方面带来了重要的进步，利用这些模型作为有效的生成假设。然而，这些问题的缺失定义性问题仍然存在，通常是因为测量数据中的内在含糊。引用人类对视觉含糊的解决方法，我们在这里引入了一种新的秘密扩散反向解决方法，通过文本描述（TReg）来进行 regularization。具体来说，TReg在反向采样阶段应用文本描述，并通过null文本优化来动态强制实施。我们的实验结果表明，TReg可以成功减轻扩散反向解决方法中的含糊，提高其效iveness和准确性。

RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks

paper_url: http://arxiv.org/abs/2311.15649
repo_url: None
paper_authors: Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Dongbin Zhao, He Wang
for:The paper aims to develop a RoboGPT agent that can make embodied long-term decisions for daily tasks through natural language instruction, addressing the issues of feasibility and correctness in LLMs-generated task plans.methods:The proposed RoboGPT agent consists of two modules: 1) LLMs-based planning with re-plan to break the task into multiple sub-goals, and 2) RoboSkill individually designed for sub-goals to learn better navigation and manipulation skills. The LLMs-based planning is enhanced with a new robotic dataset and re-plan, called RoboGPT.results:The proposed RoboGPT agent outperforms SOTA methods on the ALFRED daily tasks, and the LLMs-based planner exceeds SOTA LLM-based planners like ChatGPT in task-planning rationality for hundreds of unseen daily tasks and other domain tasks, while keeping the large model’s original broad application and generality.

Abstract
Robotic agents must master common sense and long-term sequential decisions to solve daily tasks through natural language instruction. The developments in Large Language Models (LLMs) in natural language processing have inspired efforts to use LLMs in complex robot planning. Despite LLMs' great generalization and comprehension of instruction tasks, LLMs-generated task plans sometimes lack feasibility and correctness. To address the problem, we propose a RoboGPT agent\footnote{our code and dataset will be released soon} for making embodied long-term decisions for daily tasks, with two modules: 1) LLMs-based planning with re-plan to break the task into multiple sub-goals; 2) RoboSkill individually designed for sub-goals to learn better navigation and manipulation skills. The LLMs-based planning is enhanced with a new robotic dataset and re-plan, called RoboGPT. The new robotic dataset of 67k daily instruction tasks is gathered for fine-tuning the Llama model and obtaining RoboGPT. RoboGPT planner with strong generalization can plan hundreds of daily instruction tasks. Additionally, a low-computational Re-Plan module is designed to allow plans to flexibly adapt to the environment, thereby addressing the nomenclature diversity challenge. The proposed RoboGPT agent outperforms SOTA methods on the ALFRED daily tasks. Moreover, RoboGPT planner exceeds SOTA LLM-based planners like ChatGPT in task-planning rationality for hundreds of unseen daily tasks, and even other domain tasks, while keeping the large model's original broad application and generality.

摘要
机器人代理人需要掌握常识和长期顺序决策，以完成日常任务通过自然语言指令。大型自然语言处理（LLMs）的发展已经激发了使用LLMs在复杂机器人规划中的尝试。despite LLMs的很好的总结和指令任务的理解，LLMs生成的任务计划有时缺乏可行性和正确性。为解决这问题，我们提议一种名为RoboGPT的机器人代理人，用于实现身体内的长期决策，包括两个模块：1）基于LLMs的规划，通过重新规划破解任务为多个子目标；2）RoboSkill，特制 для每个子目标，以学习更好的导航和抓取技能。LLMs基于的规划得到了一个新的机器人数据集和重新规划（RoboGPT）的改进。新的机器人数据集包括67k天日指令任务，用于精度调整Llama模型并获得RoboGPT。RoboGPT规划器具有强大的通用化能力，可以计划百余天日指令任务。此外，我们还设计了一个低计算量的重新规划模块，以让计划能够灵活适应环境，解决了命名多样性挑战。提议的RoboGPT代理人超出了当前最佳方法在ALFRED日常任务上的表现，同时RoboGPT规划器也超越了基于LLM的其他域的规划器，包括ChatGPT，在未看到的日常任务上的任务规划理智，而且保持了大型模型的原始广泛应用和通用性。

Reinforcement Learning from Diffusion Feedback: Q* for Image Search

paper_url: http://arxiv.org/abs/2311.15648
repo_url: None
paper_authors: Aboli Marathe
for: The paper is written for image generation using model-agnostic learning, with a focus on aligning semantic priors with generative capabilities.
methods: The paper proposes two methods for image generation: Reinforcement Learning from Diffusion Feedback (RLDF) and Noisy Diffusion Gradient. Both methods use a special Continuous Feature Grammar (CFG) encoding for continual semantic guidance.
results: The paper reports that RLDF generates high-quality images over varied domains, including retail, sports, and agriculture, with class-consistency and strong visual diversity. The results are demonstrated using only a single input image and no text input.

Abstract
Large vision-language models are steadily gaining personalization capabilities at the cost of fine-tuning or data augmentation. We present two models for image generation using model-agnostic learning that align semantic priors with generative capabilities. RLDF, or Reinforcement Learning from Diffusion Feedback, is a singular approach for visual imitation through prior-preserving reward function guidance. This employs Q-learning (with standard Q*) for generation and follows a semantic-rewarded trajectory for image search through finite encoding-tailored actions. The second proposed method, noisy diffusion gradient, is optimization driven. At the root of both methods is a special CFG encoding that we propose for continual semantic guidance. Using only a single input image and no text input, RLDF generates high-quality images over varied domains including retail, sports and agriculture showcasing class-consistency and strong visual diversity. Project website is available at https://infernolia.github.io/RLDF.

摘要
大型视语模型逐渐增强个性化功能，而这与数据增强或微调相对应。我们提出了两种图像生成方法，使用模型无关学习来保持语义指导。RLDF（强化学习从扩散反馈）是一种单一的视觉模仿方法，通过保持语义指导的奖励函数导航。这使用Q学习（使用标准Q*)进行生成，并跟踪语义奖励轨迹来进行图像搜索，通过finite编码适应的动作。第二种提议的方法是噪声扩散梯度，它是依靠优化驱动的。这两种方法的核心都是我们提议的特殊CFG编码，用于持续Semantic导航。只需单个输入图像和没有文本输入，RLDF可以生成高质量图像，覆盖多个领域，包括零售、运动和农业，展示了类型一致性和强大的视觉多样性。项目网站可以在https://infernolia.github.io/RLDF查看。

ChatTraffic: Text-to-Traffic Generation via Diffusion Model

paper_url: http://arxiv.org/abs/2311.16203
repo_url: https://github.com/ChyaZhang/ChatTraffic
paper_authors: Chengyang Zhang, Yong Zhang, Qitan Shao, Bo Li, Yisheng Lv, Xinglin Piao, Baocai Yin
for: 这篇论文的目的是提出一种基于文本描述的交通系统的交通预测方法，以解决传统交通预测方法的两个主要挑战：1）不敏感于非常事件，2）长期预测性不佳。
methods: 本文提出了一种基于生成模型的文本到交通数据生成任务（Text-to-Traffic Generation，简称TTG），并提出了一种名为ChatTraffic的扩散模型，将文本与路网和交通数据相关联，生成实际的交通情况。
results: 实验结果表明，ChatTraffic可以从文本中生成实际的交通情况。 codes和数据集可以在https://github.com/ChyaZhang/ChatTraffic上获取。

Abstract
Traffic prediction is one of the most significant foundations in Intelligent Transportation Systems (ITS). Traditional traffic prediction methods rely only on historical traffic data to predict traffic trends and face two main challenges. 1) insensitivity to unusual events. 2) poor performance in long-term prediction. In this work, we explore how generative models combined with text describing the traffic system can be applied for traffic generation and name the task Text-to-Traffic Generation (TTG). The key challenge of the TTG task is how to associate text with the spatial structure of the road network and traffic data for generating traffic situations. To this end, we propose ChatTraffic, the first diffusion model for text-to-traffic generation. To guarantee the consistency between synthetic and real data, we augment a diffusion model with the Graph Convolutional Network (GCN) to extract spatial correlations of traffic data. In addition, we construct a large dataset containing text-traffic pairs for the TTG task. We benchmarked our model qualitatively and quantitatively on the released dataset. The experimental results indicate that ChatTraffic can generate realistic traffic situations from the text. Our code and dataset are available at https://github.com/ChyaZhang/ChatTraffic.

摘要
历史交通数据仅依靠历史交通数据预测交通趋势，面临两大挑战：1）不敏感于特殊事件。2）长期预测性能不佳。在这项工作中，我们探讨如何将生成模型与交通系统的文本描述结合使用，并将该任务命名为文本到交通生成（TTG）。TTG任务的关键挑战在于如何将文本与路网和交通数据相关联，以生成交通情况。为此，我们提出了ChatTraffic，首个泛化模型 для文本到交通生成。为保证生成的交通情况与实际数据一致，我们将泛化模型与图 convolutional neural network（GCN）结合使用，以EXTRACT交通数据的空间相关性。此外，我们构建了一个大量的文本-交通对数据集，用于TTG任务的评估。我们对模型进行了质量和量的测试，并发现ChatTraffic可以从文本中生成真实的交通情况。我们的代码和数据集可以在https://github.com/ChyaZhang/ChatTraffic上获取。

Phonetic-aware speaker embedding for far-field speaker verification

paper_url: http://arxiv.org/abs/2311.15627
repo_url: None
paper_authors: Zezhong Jin, Youzhi Tu, Man-Wai Mak
for: 提高远场声音识别系统的性能
methods: 结合语音特征信息进行joint-training speech recognition和speaker recognition
results: 在VOiCES Challenge 2019评估集和VoxCeleb1测试集上表现出色，提高了标准说话人embedding的性能

Abstract
When a speaker verification (SV) system operates far from the sound sourced, significant challenges arise due to the interference of noise and reverberation. Studies have shown that incorporating phonetic information into speaker embedding can improve the performance of text-independent SV. Inspired by this observation, we propose a joint-training speech recognition and speaker recognition (JTSS) framework to exploit phonetic content for far-field SV. The framework encourages speaker embeddings to preserve phonetic information by matching the frame-based feature maps of a speaker embedding network with wav2vec's vectors. The intuition is that phonetic information can preserve low-level acoustic dynamics with speaker information and thus partly compensate for the degradation due to noise and reverberation. Results show that the proposed framework outperforms the standard speaker embedding on the VOiCES Challenge 2019 evaluation set and the VoxCeleb1 test set. This indicates that leveraging phonetic information under far-field conditions is effective for learning robust speaker representations.

摘要
当一个说话验证（SV）系统在声音源处运行时，由于干扰和回声而出现了重大挑战。研究表明，将音乐信息 incorporated into 说话 embedding 可以提高文本独立的 SV 性能。受这种观察的激发，我们提出了联合训练语音识别和说话识别（JTSS）框架，以利用语音特征来提高远场 SV。这个框架鼓励说话 embeddings 保留音乐信息，通过将框架基于特征地图匹配 wav2vec 的向量。我们的听力是，干扰信息可以保留低级声音动力，并且与说话信息相结合，有助于部分资料减轻干扰和回声的影响。实验结果表明，我们的提议 exceeds 标准说话 embedding 在 VOiCES Challenge 2019 评估集和 VoxCeleb1 测试集上的性能。这表明，在远场条件下利用干扰信息学习Robust 的说话表示是有效的。

Injecting linguistic knowledge into BERT for Dialogue State Tracking

paper_url: http://arxiv.org/abs/2311.15623
repo_url: None
paper_authors: Xiaohan Feng, Xixin Wu, Helen Meng
for: 这篇论文目的是提高对话状态跟踪（DST）模型的性能和可读性，并使用不需要更多标注或训练数据的无监督框架来实现这一目标。
methods: 该论文提出了一种使用无监督框架提取语言知识，然后将这些知识与BERT结合使用以提高DST任务的性能和可读性。该知识提取过程具有计算机 econonomical 的特点，不需要更多的标注或训练数据。
results: 该论文使用了Convex Polytopic Model（CPM）作为特征提取工具，并证明了这些特征与对话中的 sintactic 和 semantics 句子结构具有强相关性。这种相关性使得可以彻底理解DST模型决策过程中哪些语言特征的影响。该框架在不同的DST任务上进行了比较，并观察到了明显的性能提高。

Abstract
Dialogue State Tracking (DST) models often employ intricate neural network architectures, necessitating substantial training data, and their inference processes lack transparency. This paper proposes a method that extracts linguistic knowledge via an unsupervised framework and subsequently utilizes this knowledge to augment BERT's performance and interpretability in DST tasks. The knowledge extraction procedure is computationally economical and does not necessitate annotations or additional training data. The injection of the extracted knowledge necessitates the addition of only simple neural modules. We employ the Convex Polytopic Model (CPM) as a feature extraction tool for DST tasks and illustrate that the acquired features correlate with the syntactic and semantic patterns in the dialogues. This correlation facilitates a comprehensive understanding of the linguistic features influencing the DST model's decision-making process. We benchmark this framework on various DST tasks and observe a notable improvement in accuracy.

摘要
对话状态跟踪（DST）模型经常使用复杂的神经网络架构，需要大量的训练数据，而且在推理过程中缺乏透明度。这篇论文提出了一种方法，通过无监督的框架提取语言知识，然后将这些知识添加到BERT模型中，以提高其性能和可读性在DST任务中。知识提取过程具有计算经济的优点，不需要标注或额外的训练数据。我们使用 convex polytopic model（CPM）作为对话状态任务的特征提取工具，并证明所获取的特征与对话中的 sintactic和semantic 模式呈相关关系。这种相关性使得我们更好地理解DST模型做出决策的语言特征。我们在不同的DST任务上 benchmark 这个框架，并观察到明显的准确率提升。

Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition

paper_url: http://arxiv.org/abs/2311.15619
repo_url: None
paper_authors: Yifei Chen, Dapeng Chen, Ruijin Liu, Sai Zhou, Wenyuan Xue, Wei Peng
for: 提高视频分类任务中的表达力和泛化能力，尤其是面对未经见或未经分类的动作类别时。
methods: 提出了一种新的“对齐然后适应”（ALT）模式，在这种模式下，首先利用每帧图像中的实体-区域对应关系来进行对齐，然后将对齐后的图像embedding传递给一个基于转换器的视频适应器，以EXTRACT视频中最重要的实体的semantics。
results: 在完全监督的情况下，ALT在Kinetics-400上 achieve 88.1%的top-1准确率，并且在2抽样情况下出perform了前一个state-of-the-art的7.1%和9.2%。

Abstract
Large-scale visual-language pre-trained models have achieved significant success in various video tasks. However, most existing methods follow an "adapt then align" paradigm, which adapts pre-trained image encoders to model video-level representations and utilizes one-hot or text embedding of the action labels for supervision. This paradigm overlooks the challenge of mapping from static images to complicated activity concepts. In this paper, we propose a novel "Align before Adapt" (ALT) paradigm. Prior to adapting to video representation learning, we exploit the entity-to-region alignments for each frame. The alignments are fulfilled by matching the region-aware image embeddings to an offline-constructed text corpus. With the aligned entities, we feed their text embeddings to a transformer-based video adapter as the queries, which can help extract the semantics of the most important entities from a video to a vector. This paradigm reuses the visual-language alignment of VLP during adaptation and tries to explain an action by the underlying entities. This helps understand actions by bridging the gap with complex activity semantics, particularly when facing unfamiliar or unseen categories. ALT achieves competitive performance and superior generalizability while requiring significantly low computational costs. In fully supervised scenarios, it achieves 88.1% top-1 accuracy on Kinetics-400 with only 4947 GFLOPs. In 2-shot experiments, ALT outperforms the previous state-of-the-art by 7.1% and 9.2% on HMDB-51 and UCF-101, respectively.

摘要
大规模视语言预训模型已经在各种视频任务中实现了显著的成功。然而，现有的方法大多采用“适应然后对齐”（adapt then align）模式，其适应预训的图像编码器来学习视频水平表示，并使用一个热点或文本 embedding 作为监督。这种模式忽略了将静止图像映射到复杂的活动概念上的挑战。在这篇论文中，我们提出了一种新的“对齐然后适应”（ALT）模式。在适应视频表示学习之前，我们利用每帧的实体-区域对应关系来对每帧的图像编码器进行对齐。这些对齐关系通过将区域感知图像嵌入与在线构建的文本资源进行匹配来实现。通过对已对齐的实体进行文本嵌入的转化，我们可以将视频中最重要的实体的 semantics 提取到一个矢量中。这种模式可以在适应过程中重用视语言的对齐，并尝试通过实体来解释动作，从而bridging动作的复杂 semantics 和未知或未看到的类别。ALT实现了高度竞争力和低计算成本，在完全监督的情况下，在 Kinetics-400 上达到了 88.1% 的 top-1 准确率，并在 2-shot 实验中与前一代 state-of-the-art 相比，提高了7.1% 和 9.2% 的性能。

Spatially Covariant Image Registration with Text Prompts

paper_url: http://arxiv.org/abs/2311.15607
repo_url: None
paper_authors: Hang Zhang, Xiang Chen, Rongguang Wang, Renjiu Hu, Dongdong Liu, Gaolei Li
for: 医疗图像的描述性结构和空间不均的对比强度，可以通过利用解剖学知识来提高图像注射的效率。
methods: 文章提出了一种新的方法，即文本SCF，它将文本视觉语言模型编码的解剖区域文本嵌入与空间相关的缓冲滤波器相结合，以优化一个隐函数，该函数关系解剖区域文本嵌入和滤波器重量的关系，从而降低了传统的翻译不变性约束。
results: 文章的实验结果表明，文本SCF可以不 только提高计算效率，而且可以保持或提高注射精度。它可以capture解剖区域之间的Contextual交互，并能够保持结构性缺失的注射。在MICCAI Learn2Reg 2021挑战中，文本SCF对比存在状态OF-THE-ART模型，表现出优异的表现，并在abdominal注射任务中，大型模型Variant提高了 dice分数11.3%，而小型模型Variant保持了相同的精度，但减少了89.13%的网络参数和98.34%的计算操作。

Abstract
Medical images are often characterized by their structured anatomical representations and spatially inhomogeneous contrasts. Leveraging anatomical priors in neural networks can greatly enhance their utility in resource-constrained clinical settings. Prior research has harnessed such information for image segmentation, yet progress in deformable image registration has been modest. Our work introduces textSCF, a novel method that integrates spatially covariant filters and textual anatomical prompts encoded by visual-language models, to fill this gap. This approach optimizes an implicit function that correlates text embeddings of anatomical regions to filter weights, relaxing the typical translation-invariance constraint of convolutional operations. TextSCF not only boosts computational efficiency but can also retain or improve registration accuracy. By capturing the contextual interplay between anatomical regions, it offers impressive inter-regional transferability and the ability to preserve structural discontinuities during registration. TextSCF's performance has been rigorously tested on inter-subject brain MRI and abdominal CT registration tasks, outperforming existing state-of-the-art models in the MICCAI Learn2Reg 2021 challenge and leading the leaderboard. In abdominal registrations, textSCF's larger model variant improved the Dice score by 11.3% over the second-best model, while its smaller variant maintained similar accuracy but with an 89.13% reduction in network parameters and a 98.34\% decrease in computational operations.

摘要
医学图像经常具有结构化的解剖特征和空间不均的对比度。利用解剖知识在神经网络中可以大幅提高图像注射的使用效率。先前的研究已经利用了这些信息进行图像分割，但是对于扭变图像注射的进步而言，还有很大的空间。我们的工作推出了文本SCF，一种新的方法，它将 integrate spatially covariant filters和文本解剖提示，由视觉语言模型编码，以填补这个空白。这种方法对图像嵌入中的文本解剖区域进行相似性 correlate，以适应不同的翻译不变性约束。文本SCF不仅提高计算效率，而且可以保持或提高注射精度。通过捕捉解剖区域之间的 Contextual 互动，它提供了印象的Inter-regional transferability和保持结构分割的能力。文本SCF的性能在Inter-subject brain MRI和 Abdomen CT注射任务上进行了严格的测试，在MICCAI Learn2Reg 2021挑战中超过了现有的状态 искусственный neural networks，并排名第一。在 Abdomen 注射任务上，文本SCF的大型变体提高了 dice 分数11.3%，而小型变体保持了相同的精度，但减少了89.13%的网络参数和98.34%的计算操作。

QuickDrop: Efficient Federated Unlearning by Integrated Dataset Distillation

paper_url: http://arxiv.org/abs/2311.15603
repo_url: None
paper_authors: Akash Dhasade, Yaohong Ding, Song Guo, Anne-marie Kermarrec, Martijn De Vos, Leijie Wu
for: 这个研究旨在实现 Federated Unlearning (FU)，即从 Federated Learning (FL) 模型中删除特定训练数据。
methods: QuickDrop 使用dataset distillation (DD) 来加速删除和严重降低与现有方法相比的计算负载。每个客户端使用 DD 生成一个简洁的训练数据集，并将这个简洁的数据集用于删除。
results: QuickDrop 可以将删除时间降低到 463.8 倍和 65.1 倍，相比模型重新训练和现有 FU 方法。它还能够处理多个删除操作和100个客户端。

Abstract
Federated Unlearning (FU) aims to delete specific training data from an ML model trained using Federated Learning (FL). We introduce QuickDrop, an efficient and original FU method that utilizes dataset distillation (DD) to accelerate unlearning and drastically reduces computational overhead compared to existing approaches. In QuickDrop, each client uses DD to generate a compact dataset representative of the original training dataset, called a distilled dataset, and uses this compact dataset during unlearning. To unlearn specific knowledge from the global model, QuickDrop has clients execute Stochastic Gradient Ascent with samples from the distilled datasets, thus significantly reducing computational overhead compared to conventional FU methods. We further increase the efficiency of QuickDrop by ingeniously integrating DD into the FL training process. By reusing the gradient updates produced during FL training for DD, the overhead of creating distilled datasets becomes close to negligible. Evaluations on three standard datasets show that, with comparable accuracy guarantees, QuickDrop reduces the duration of unlearning by 463.8x compared to model retraining from scratch and 65.1x compared to existing FU approaches. We also demonstrate the scalability of QuickDrop with 100 clients and show its effectiveness while handling multiple unlearning operations.

摘要
《联邦不学习（Federated Unlearning，FU）》的目标是从一个使用联邦学习（Federated Learning，FL）训练的机器学习模型中删除特定的训练数据。我们介绍了一种高效的原创方法叫做快速落幕（QuickDrop），它利用数据干涯（Dataset Distillation，DD）加速快速落幕，并在计算开销方面减少了与现有方法相比的多少。在快速落幕中，每个客户使用 DD 生成一个尺寸减少后的数据集，并将其用于快速落幕。为了从全球模型中快速卸载特定的知识，快速落幕使客户在 distilled 数据集上执行随机梯度上升，从而减少了计算开销。我们还巧妙地将 DD integrate 到 FL 训练过程中，使得创建 distilled 数据集的开销变得非常接近负数。我们在三个标准数据集上进行评估，发现，与相同的准确性保证，快速落幕可以比模型重新训练从scratch 463.8 倍快，并且比现有的 FU 方法65.1 倍快。我们还证明了快速落幕的扩展性，可以处理多个快速落幕操作，并在100个客户情况下进行评估。

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

paper_url: http://arxiv.org/abs/2311.15599
repo_url: https://github.com/ailab-cvc/unireplknet
paper_authors: Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan
for: 本文主要针对大kernel convolutional neural networks (ConvNets) 的研究，尤其是其建 architecture 的设计方法和在多Modalities 领域中的表现能力。
methods: 本文提出了四个建 architecture 指南，其核心思想是利用大kernel 的特点，即可以覆盖宽而不需要深度。此外， authors 还提出了一些模式相关的预处理技术来使得模型在不同的领域中表现出色。
results: 本文的模型在 ImageNet 等多个任务上达到了leading 性能，例如 ImageNet 精度达88.0%, ADE20K mIoU 达55.6%, COCO box AP 达56.4%。此外，模型还在时间序列预测和音频识别任务上达到了状态 Künstler 的表现。

Abstract
Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but there are two unresolved and critical issues that demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed large-kernel ConvNet shows leading performance in image recognition. For example, our models achieve an ImageNet accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%, demonstrating better performance and higher speed than a number of recently proposed powerful competitors. 2) We discover that large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches, the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. Code and all the models at https://github.com/AILab-CVC/UniRepLKNet.

摘要
大量矩阵卷积神经网络 (ConvNet) 在最近的研究中得到了广泛的关注，但是还有两个未解决的核心问题需要进一步的研究。1）现有的大量矩阵ConvNet的体系设计大多采用了传统ConvNet或转换器的设计原则，而大量矩阵ConvNet的体系设计仍然受挑战。2）由于转换器在多个Modalities中占据了主导地位，因此需要调查大量矩阵ConvNet是否也有强大的通用见解能力，在不同的领域中表现出优异的能力。在这篇论文中，我们从两个方面提出了贡献。1）我们提出了四种大量矩阵ConvNet的体系设计指南，核心在于利用大kernel的特点，即可以覆盖广阔的场景而不需要深入学习。按照这些指南，我们提posed的大量矩阵ConvNet在图像识别 task中表现出了领先的性能。例如，我们的模型在ImageNet任务上 achievied an accuracy of 88.0%, ADE20K mIoU of 55.6%,和COCO box AP of 56.4%,比许多最近提出的强大竞争对手更好的性能和更高的速度。2）我们发现大kernel是解释大量矩阵ConvNet在不同领域中的出色表现的关键。通过ertain模式相关的预处理方法，我们的提案模型在时序预测和音频识别任务中达到了状态 искусственный智能领域的前列表现，无需特定的模式化化到体系结构。代码和所有模型可以在https://github.com/AILab-CVC/UniRepLKNet上获取。

Networked Multiagent Safe Reinforcement Learning for Low-carbon Demand Management in Distribution Network

paper_url: http://arxiv.org/abs/2311.15594
repo_url: None
paper_authors: Jichen Zhang, Linwei Sang, Yinliang Xu, Hongbin Sun
for: 该文章提出了一种基于多代理的两级操作框架，用于 Distribution Networks 中减少碳排放，考虑到供应 сторо面的碳排放allowance。
methods: 文章使用了分布式灵活负荷代理、分布式优化方法和安全学习算法来解决问题。
results: 案例研究表明，该方法可以满足供应 сторо面的碳排放限制，保证供应网络的安全运行，并保持两个供应 сторо面的隐私。

Abstract
This paper proposes a multiagent based bi-level operation framework for the low-carbon demand management in distribution networks considering the carbon emission allowance on the demand side. In the upper level, the aggregate load agents optimize the control signals for various types of loads to maximize the profits; in the lower level, the distribution network operator makes optimal dispatching decisions to minimize the operational costs and calculates the distribution locational marginal price and carbon intensity. The distributed flexible load agent has only incomplete information of the distribution network and cooperates with other agents using networked communication. Finally, the problem is formulated into a networked multi-agent constrained Markov decision process, which is solved using a safe reinforcement learning algorithm called consensus multi-agent constrained policy optimization considering the carbon emission allowance for each agent. Case studies with the IEEE 33-bus and 123-bus distribution network systems demonstrate the effectiveness of the proposed approach, in terms of satisfying the carbon emission constraint on demand side, ensuring the safe operation of the distribution network and preserving privacy of both sides.

摘要

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

paper_url: http://arxiv.org/abs/2311.16201
repo_url: None
paper_authors: Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev
for: 这个论文旨在探讨文本生成器在自适应方法下进行图像生成是否能够借鉴预训练的语言模型，并发现预训练语言模型对自适应文本生成带来有限的帮助。
methods: 这篇论文使用了预训练语言模型进行自适应文本生成，并分析了每个模式的token，发现图像token与文本token之间存在很大的semantic差异，使得预训练语言模型无法有效地模型图像token。
results: 研究发现，预训练语言模型在图像-文本对应任务上表现不佳，因为文本数据集中的文本token过于简单，导致语言模型在这些任务上受到了恶化的影响。

Abstract
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.

摘要
现代图像分割器，如VQ-VAE，已经实现了文本到图像生成，使用自动反推方法，类似于语言模型。但这些方法尚未利用预训语言模型，尽管它们可以适应多元下游任务。在这个工作中，我们探索这个差距，并将预训语言模型适应自动反推文本到图像生成，发现预训语言模型对图像token具有有限的帮助。我们提供了一个二重解释，分析每个modalities的token。首先，我们证明图像token在Semantics上与文本token有很大差异，使预训语言模型无法更有效地模型它们。其次，文本token在图像文本集中是比较简单的，导致语言模型在这些文本上的预训衰退。

Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language Models

paper_url: http://arxiv.org/abs/2311.15569
repo_url: None
paper_authors: Yongjin Yang, Jongwoo Ko, Se-Young Yun
for: 这 paper 旨在解释CLIP类vision-language模型在不同下游任务中的应用，以及如何使用提示或适应器进行高效的转移学习。
methods: 这 paper 使用了视觉提示和文本适应器来研究 VLMs 的总体行为，并提出了一种可 ensemble 方法来提高总体性和特定任务知识的转移。
results: 实验结果表明，使用视觉提示提高分类分化，使用文本适应器进行任务适应，并使用我们提出的可 ensemble 方法可以在不同领域中提高总体性和特定任务知识的转移。

Abstract
Vision-Language Models (VLMs) like CLIP have demonstrated remarkable applicability across a variety of downstream tasks, including zero-shot image classification. Recently, the use of prompts or adapters for efficient transfer learning has gained significant attention for effectively adapting to downstream tasks. However, the roles of vision and text prompts, as well as adapters in terms of generalization and transfer difficulty, have been overlooked, limiting performance on unseen tasks. In this paper, we empirically analyze how VLMs behave when using vision and text prompts, adapters, and a combination of these components, marking a novel exploration by our study. Our observations find that utilizing vision prompts for class separability and text adapters for task adaptation is crucial for adaptability and generalizability. Moreover, to improve generalization across every domain, we propose an adaptive ensemble method that effectively combines the general knowledge of VLMs with task-specific knowledge according to transfer difficulty. Upon experimenting with extensive benchmarks, our method consistently outperforms all baselines, particularly on unseen tasks, demonstrating the effectiveness of our proposed approach.

摘要
视力语言模型（VLM）如CLIP在多种下游任务中表现出色，包括零shot图像分类。近期，使用提示或适配器进行高效转移学习的使用吸引了广泛的关注，以便有效地适应下游任务。然而，视力和文本提示的角色以及适配器在总体化和转移难度方面尚未得到了充分的探讨，这限制了对未经见任务的表现。本文通过实验分析了在使用视力提示、文本适配器和这些组件时，VLM的行为。我们的观察发现，通过使用视力提示来提高分类分离度和使用文本适配器来适应任务是关键的。此外，为了在每个领域中提高总化，我们提议了一种可变集合方法，可以有效地结合VLM的通用知识和任务特定知识，根据转移难度进行调整。在延展 benchmark 上进行实验，我们的方法 consistently 超越了所有基eline，特别是在未经见任务上，证明了我们的提出的方法的效果。

Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text

paper_url: http://arxiv.org/abs/2311.15565
repo_url: None
paper_authors: Finbarrs Oketunji
for: 本研究探讨了利用现代混合深度学习模型准确区分人工生成的文本和人类写作。
methods: 我们采用了一种可靠的方法ológica，利用从多个源头选择的AI和人类文本 Dataset，每个文本都有标注 instrucciones。高级自然语言处理技术支持文本特征分析。 combining了复杂的神经网络，自定义模型能够探测人工和人类内容之间的细微差异。
results: 研究结果表明，自定义模型可以准确地区分AI和人类文本，并且在不同的文本类型和大小上具有高度的泛化能力。

Abstract
My research investigates the use of cutting-edge hybrid deep learning models to accurately differentiate between AI-generated text and human writing. I applied a robust methodology, utilising a carefully selected dataset comprising AI and human texts from various sources, each tagged with instructions. Advanced natural language processing techniques facilitated the analysis of textual features. Combining sophisticated neural networks, the custom model enabled it to detect nuanced differences between AI and human content.

摘要
我的研究探讨了使用前沿混合深度学习模型来准确地分辨人工生成的文本和人类写作。我采用了一种可靠的方法ологи，利用从多个来源选择的精心标注的 dataset，包括人工和机器生成的文本，并通过高级自然语言处理技术进行文本特征分析。将复杂的神经网络结合使用，我的自定义模型能够检测人工和人类内容之间的细微差别。

Instruct2Attack: Language-Guided Semantic Adversarial Attacks

paper_url: http://arxiv.org/abs/2311.15551
repo_url: None
paper_authors: Jiang Liu, Chen Wei, Yuxiang Guo, Heng Yu, Alan Yuille, Soheil Feizi, Chun Pong Lau, Rama Chellappa
for: 本文旨在开发一种语言指导的semantic attack，可以生成基于语言指令的semantically meaningful的抖担。
methods: 该攻击使用现有的 latent diffusion models，通过对反射扩散过程进行敌意指导，以搜索一个基于输入图像和文本指令的敌方幂等。
results: 与现有的噪声攻击和semantic攻击相比，I2A可以生成更自然和多样化的抖担例子，同时提供更好的控制性和可读性。 GPT-4被用来自动生成图像特定的文本指令，并成功破坏了当前最强的深度神经网络，以及在不同的网络架构中展现出了抗抗性和可迁移性。

Abstract
We propose Instruct2Attack (I2A), a language-guided semantic attack that generates semantically meaningful perturbations according to free-form language instructions. We make use of state-of-the-art latent diffusion models, where we adversarially guide the reverse diffusion process to search for an adversarial latent code conditioned on the input image and text instruction. Compared to existing noise-based and semantic attacks, I2A generates more natural and diverse adversarial examples while providing better controllability and interpretability. We further automate the attack process with GPT-4 to generate diverse image-specific text instructions. We show that I2A can successfully break state-of-the-art deep neural networks even under strong adversarial defenses, and demonstrate great transferability among a variety of network architectures.

摘要
我们提出Instruct2Attack（I2A），一种语言导向的semantic攻击，它生成根据自由形式文本指令的semantically meaningful的损害。我们利用了现代的潜在扩散模型，在反对扩散过程中以敌对方式导引潜在码，以找到受影响的图像和文本指令的敏感潜在码。相比于现有的噪音基的攻击和semantic攻击，I2A可以更自然地生成多样化的攻击示例，同时提供更好的控制性和可读性。我们还使用GPT-4自动生成对应图像的文本指令。我们展示了I2A可以成功攻击现代深度神经网络，并且在不同的网络架构上展现出很好的转移性。

From Prediction to Action: The Critical Role of Proper Performance Estimation for Machine-Learning-Driven Materials Discovery

paper_url: http://arxiv.org/abs/2311.15549
repo_url: None
paper_authors: Mario Boley, Felix Luong, Simon Teshuva, Daniel F Schmidt, Lucas Foppa, Matthias Scheffler
for: 本研究旨在提高数据驱动材料发现的效率和准确性，通过使用统计性perty模型进行 iterative 决策。
methods: 本研究使用了一种基于模型驱动的养护函数，以最大化certain “奖励” über time。
results: 研究发现，传统的in-distribution表现度量不直接相关于发现奖励。此外，提出了一种新的性能估计方法，可以successfully predict Gaussian processes with the “expected improvement” acquisition function as the best option。

Abstract
Materials discovery driven by statistical property models is an iterative decision process, during which an initial data collection is extended with new data proposed by a model-informed acquisition function--with the goal to maximize a certain "reward" over time, such as the maximum property value discovered so far. While the materials science community achieved much progress in developing property models that predict well on average with respect to the training distribution, this form of in-distribution performance measurement is not directly coupled with the discovery reward. This is because an iterative discovery process has a shifting reward distribution that is over-proportionally determined by the model performance for exceptional materials. We demonstrate this problem using the example of bulk modulus maximization among double perovskite oxides. We find that the in-distribution predictive performance suggests random forests as superior to Gaussian process regression, while the results are inverse in terms of the discovery rewards. We argue that the lack of proper performance estimation methods from pre-computed data collections is a fundamental problem for improving data-driven materials discovery, and we propose a novel such estimator that, in contrast to na\"ive reward estimation, successfully predicts Gaussian processes with the "expected improvement" acquisition function as the best out of four options in our demonstrational study for double perovskites. Importantly, it does so without requiring the over thousand ab initio computations that were needed to confirm this prediction.

摘要
物料发现驱动了统计性质模型，是一个迭代决策过程，在其中，初始数据集被扩展了新的数据，由模型指导的获得函数提议--以最大化某种"奖励"的总值，例如已发现最大性质值。虽然物理科学社区在开发属性模型方面做出了很多进步，但这种在分布中的性能评估方式并不直接与发现奖励相关。这是因为迭代发现过程中的奖励分布会受到模型性能对特点材料的过度强调。我们使用双晶体矿物质的bulk modulus最大化为例子，发现在分布中的预测性能表示Random Forest在Gaussian Process Regression之上，而实际结果则是相反的。我们认为，从预计算得数据集中缺乏合适性能评估方法是数据驱动材料发现的基本问题，并提出了一种新的评估器。与普遍预测错误的评估器不同，我们的评估器可以预测Gaussian Processes with "预期改善"获得函数为最佳选择，并在我们的示例研究中成功地预测了double perovskite材料中的Gaussian Processes。此外，它并不需要如前一千次的ab initio计算，以确认这一预测。

Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination

paper_url: http://arxiv.org/abs/2311.15548
repo_url: None
paper_authors: Haoqiang Kang, Xiao-Yang Liu
for: This paper aims to empirically investigate the hallucination behaviors of large language models (LLMs) in financial tasks, and to evaluate the effectiveness of four practical methods for mitigating these behaviors.
methods: The paper uses empirical investigation and evaluation of four practical methods to study the hallucination behaviors of LLMs in financial tasks. The methods include few-shot learning, Decoding by Contrasting Layers (DoLa), the Retrieval Augmentation Generation (RAG) method, and the prompt-based tool learning method.
results: The paper finds that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks, highlighting the urgent need for research efforts to mitigate these behaviors.

Abstract
The hallucination issue is recognized as a fundamental deficiency of large language models (LLMs), especially when applied to fields such as finance, education, and law. Despite the growing concerns, there has been a lack of empirical investigation. In this paper, we provide an empirical examination of LLMs' hallucination behaviors in financial tasks. First, we empirically investigate LLM model's ability of explaining financial concepts and terminologies. Second, we assess LLM models' capacity of querying historical stock prices. Third, to alleviate the hallucination issue, we evaluate the efficacy of four practical methods, including few-shot learning, Decoding by Contrasting Layers (DoLa), the Retrieval Augmentation Generation (RAG) method and the prompt-based tool learning method for a function to generate a query command. Finally, our major finding is that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks. Therefore, there is an urgent need to call for research efforts in mitigating LLMs' hallucination.

摘要
大language模型（LLM）的幻觉问题被认为是其基本缺陷，尤其在应用于金融、教育和法律等领域。尽管有增加的忧虑，但是有所 empirical investigation。在这篇论文中，我们提供了empirical examination of LLMs的幻觉行为在金融任务中。首先，我们实际检查LLM模型能够解释金融概念和terminology。第二，我们评估LLM模型对于历史股票价格的询问能力。第三，为了缓解幻觉问题，我们评估了四种实用方法，包括少量学习、Decoding by Contrasting Layers（DoLa）、Retrieval Augmentation Generation（RAG）方法和提示基本工具学习方法，以生成一个查询命令。最后，我们的主要发现是，标准LLMs在金融任务中会经历严重的幻觉。因此，有一定的急需要对LLMs的幻觉进行研究和缓解。

Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction

paper_url: http://arxiv.org/abs/2311.15545
repo_url: None
paper_authors: Zeyang Zhang, Xingwang Li, Fei Teng, Ning Lin, Xueling Zhu, Xin Wang, Wenwu Zhu
for: 预测人体血液中的 albumin水平，以便在抢救病人中维持优质血液水平。
methods: 我们提出了一个名为Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction (DyG-HAP)的框架，用于在医院化学病房中预测ICU病人的 albumin水平。我们首先将人体 albumin 预测视为动态图回归问题，以模型人体关系和动态变化。然后，我们提出了一种分离的动态图注意力机制，以捕捉和分离不同类型的模式。最后，我们提出了一种不变的动态图回归方法，以鼓励模型依靠不变的模式进行预测。
results: 我们的方法在比较多个基线方法的试验中表现出色，在人体 albumin 预测中达到了更高的准确率。

Abstract
Human albumin is essential for indicating the body's overall health. Accurately predicting plasma albumin levels and determining appropriate doses are urgent clinical challenges, particularly in critically ill patients, to maintain optimal blood levels. However, human albumin prediction is non-trivial that has to leverage the dynamics of biochemical markers as well as the experience of treating patients. Moreover, the problem of distribution shift is often encountered in real clinical data, which may lead to a decline in the model prediction performance and reduce the reliability of the model's application. In this paper, we propose a framework named Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction (DyG-HAP), which is able to provide accurate albumin predictions for Intensity Care Unit (ICU) patients during hospitalization. We first model human albumin prediction as a dynamic graph regression problem to model the dynamics and patient relationship. Then, we propose a disentangled dynamic graph attention mechanism to capture and disentangle the patterns whose relationship to labels under distribution shifts is invariant and variant respectively. Last, we propose an invariant dynamic graph regression method to encourage the model to rely on invariant patterns to make predictions. Moreover, we propose a dataset named Albumin level testing and nutritional dosing data for Intensive Care (ANIC) for evaluation. Extensive experiments demonstrate the superiority of our method compared to several baseline methods in human albumin prediction.

摘要
人体albumin是评估身体健康的关键指标。准确预测血液albumin水平和确定合适的剂量是致命的临床挑战，特别是在急性病 patients中，以维持优化的血液水平。然而，人体albumin预测并非易事，需要利用生物化学指标的动态和患者的经验来预测。此外，实际临床数据中的分布shift问题也经常出现，可能导致预测性能下降并减少模型的可靠性。在这篇论文中，我们提出了一种名为Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction (DyG-HAP)的框架，能够在临床情况下提供准确的albumin预测。我们首先将人体albumin预测视为动态图回归问题，以模型人体之间的动态和关系。然后，我们提出了一种分离的动态图注意力机制，以捕捉和分离与标签之间的相关性的不变和变量Pattern。最后，我们提出了一种不变的动态图回归方法，以鼓励模型依靠不变的模式进行预测。此外，我们还提出了一个名为Albumin level testing and nutritional dosing data for Intensive Care (ANIC)的数据集，用于评估我们的方法。广泛的实验表明，我们的方法在人体albumin预测中胜过了多个基线方法。

MI-Gen: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images

paper_url: http://arxiv.org/abs/2311.16480
repo_url: None
paper_authors: Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Lin Yang
for: 这个论文是为了提高肿瘤诊断和治疗的数字 PATHOLOGY 基础知识而写的。
methods: 这个论文使用了全图像文本数据集（TCGA-PathoText）和多例生成模型（MI-Gen）来生成肿瘤报告。
results: 实验结果表明，这个模型可以生成包含多个临床指示的肿瘤报告，并且可以在下游诊断任务中进行转移学习。

Abstract
Whole slide images are the foundation of digital pathology for the diagnosis and treatment of carcinomas. Writing pathology reports is laborious and error-prone for inexperienced pathologists. To reduce the workload and improve clinical automation, we investigate how to generate pathology reports given whole slide images. On the data end, we curated the largest WSI-text dataset (TCGA-PathoText). In specific, we collected nearly 10000 high-quality WSI-text pairs for visual-language models by recognizing and cleaning pathology reports which narrate diagnostic slides in TCGA. On the model end, we propose the multiple instance generative model (MI-Gen) which can produce pathology reports for gigapixel WSIs. We benchmark our model on the largest subset of TCGA-PathoText. Experimental results show our model can generate pathology reports which contain multiple clinical clues. Furthermore, WSI-text prediction can be seen as an approach of visual-language pre-training, which enables our model to be transferred to downstream diagnostic tasks like carcinoma grading and phenotyping. We observe that simple semantic extraction from the pathology reports can achieve the best performance (0.838 of F1 score) on BRCA subtyping without adding extra parameters or tricky fine-tuning. Our collected dataset and related code will all be publicly available.

摘要
全像图像是数字病理学的基础 для诊断和治疗肿瘤。写 PATHOLOGY 报告是劳动ioso和容易出错的 для不熟悉的病理学家。为了减少工作量和提高临床自动化，我们调查如何基于全像图像生成 PATHOLOGY 报告。在数据端，我们筹集了最大的 WSI-文本数据集（TCGA-PathoText）。Specifically，我们收集了 Nearly 10,000 high-quality WSI-文本对 для视觉语言模型，通过认知和清洁 PATHOLOGY 报告，描述TCGA中的诊断扫描片。在模型端，我们提出多例生成模型（MI-Gen），可以为 gigapixel WSIs 生成 PATHOLOGY 报告。我们在TCGA-PathoText 上 benchmark 我们的模型。实验结果表明，我们的模型可以为多个临床指示提供多个临床指示。此外， WSI-文本预测可以被视为视觉语言预处理，启用我们的模型在下游诊断任务中进行转移，如肿瘤分型和生物型分析。我们发现简单的 semantics 提取可以实现最好的表现（F1 分数 0.838）在 BRCA 分类中，无需添加额外参数或复杂的微调。我们收集的数据和相关代码都将公开可用。

SSIN: Self-Supervised Learning for Rainfall Spatial Interpolation

paper_url: http://arxiv.org/abs/2311.15530
repo_url: https://github.com/jlidw/ssin
paper_authors: Jia Li, Yanyan Shen, Lei Chen, Charles Wang Wai NG
for:这篇论文的目的是提出一个新的数据驱动自监督学习框架（SSIN），用于测量降水分布的空间 interpolating。methods:SSIN 使用了一个基于 transformer 架构的 SpaFormer 模型，通过随机覆盖来建立丰富的自我监督信号，从而将降水数据转换为有用的数据嵌入。results:实验结果显示，SSIN 能够在两个真实世界降水测站数据集上超过现有方法的性能。此外，SSIN 还在一个大型真实世界交通数据集上取得了最好的表现，证明了我们的方法的有效性和通用性。

Abstract
The acquisition of accurate rainfall distribution in space is an important task in hydrological analysis and natural disaster pre-warning. However, it is impossible to install rain gauges on every corner. Spatial interpolation is a common way to infer rainfall distribution based on available raingauge data. However, the existing works rely on some unrealistic pre-settings to capture spatial correlations, which limits their performance in real scenarios. To tackle this issue, we propose the SSIN, which is a novel data-driven self-supervised learning framework for rainfall spatial interpolation by mining latent spatial patterns from historical observation data. Inspired by the Cloze task and BERT, we fully consider the characteristics of spatial interpolation and design the SpaFormer model based on the Transformer architecture as the core of SSIN. Our main idea is: by constructing rich self-supervision signals via random masking, SpaFormer can learn informative embeddings for raw data and then adaptively model spatial correlations based on rainfall spatial context. Extensive experiments on two real-world raingauge datasets show that our method outperforms the state-of-the-art solutions. In addition, we take traffic spatial interpolation as another use case to further explore the performance of our method, and SpaFormer achieves the best performance on one large real-world traffic dataset, which further confirms the effectiveness and generality of our method.

摘要
预测雨水分布在空间是 hydrological analysis 和自然灾害预警中的重要任务。但是，无法在每个角落 instal 雨量测量仪。空间 interpolate 是一种常用的方法，以Available raingauge 数据来推断雨水分布。然而，现有的方法假设了一些不切实际的雨水分布模型，这限制了它们在实际场景中的表现。为解决这个问题，我们提出了 SSIN，它是一种数据驱动、自监学习框架，通过 mines 历史观测数据中的 latent 空间模式来预测雨水分布。我们的主要想法是：通过随机覆盖来构建丰富的自我监督信号，SpaFormer 可以学习 raw 数据中的有用嵌入，然后根据雨水空间上下文进行适应性地模型空间相关性。我们的实验表明，SSIN 在两个实际雨量测量 dataset 上比 state-of-the-art 方法更高效。此外，我们将 traffic 空间 interpolate 作为另一个应用场景，并在一个大规模的实际交通数据集上进行了进一步的探索，并证明了我们的方法的有效性和通用性。Here's the translation in Traditional Chinese:预测雨水分布在空间是 hydrological analysis 和自然灾害预警中的重要任务。但是，无法在每个角落 instal 雨量测量器。空间 interpolate 是一种常用的方法，以Available raingauge 数据来推测雨水分布。然而，现有的方法假设了一些不切实际的雨水分布模型，这限制了它们在实际场景中的表现。为解决这个问题，我们提出了 SSIN，它是一种数据驱动、自监学习框架，通过 mines 历史观测数据中的 latent 空间模式来预测雨水分布。我们的主要想法是：通过随机覆盖来建构丰富的自我监督信号，SpaFormer 可以学习 raw 数据中的有用嵌入，然后根据雨水空间上下文进行适应性地模型空间相关性。我们的实验表明，SSIN 在两个实际雨量测量数据上比 state-of-the-art 方法更高效。此外，我们将 traffic 空间 interpolate 作为另一个应用场景，并在一个大规模的实际交通数据集上进行了进一步的探索，并证明了我们的方法的有效性和通用性。

Generation of patient specific cardiac chamber models using generative neural networks under a Bayesian framework for electroanatomical mapping

paper_url: http://arxiv.org/abs/2311.16197
repo_url: None
paper_authors: Sunil Mathew, Jasbir Sra, Daniel B. Rowe
For: 用于 cardiac ablation 手术的 diagnosis、规划和实时导航。* Methods: 使用 probabilistic machine learning 模型，在 Bayesian 框架下进行 surface reconstruction of cardiac chamber models。* Results: 可以从少量的 3D 点云数据中生成准确的 cardiac chamber models，减少过程时间和血射测试暴露。 Additionally, the Bayesian approach provides a natural framework for interpretability of the model, allowing for insight into what the neural network learns from the segmented CT/MRI images used to train the network.

Abstract
Electroanatomical mapping is a technique used in cardiology to create a detailed 3D map of the electrical activity in the heart. It is useful for diagnosis, treatment planning and real time guidance in cardiac ablation procedures to treat arrhythmias like atrial fibrillation. A probabilistic machine learning model trained on a library of CT/MRI scans of the heart can be used during electroanatomical mapping to generate a patient-specific 3D model of the chamber being mapped. The use of probabilistic machine learning models under a Bayesian framework provides a way to quantify uncertainty in results and provide a natural framework of interpretability of the model. Here we introduce a Bayesian approach to surface reconstruction of cardiac chamber models from a sparse 3D point cloud data acquired during electroanatomical mapping. We show how probabilistic graphical models trained on segmented CT/MRI data can be used to generate cardiac chamber models from few acquired locations thereby reducing procedure time and x-ray exposure. We show how they provide insight into what the neural network learns from the segmented CT/MRI images used to train the network, which provides explainability to the resulting cardiac chamber models generated by the model.

摘要
电子 анатомичеMapping是卡диологи中用于创建详细3D图像的电rical活动的心脏的技术。它有用于诊断、治疗规划和实时导航 cardiac ablation 手术来治疗 cardiac arrhythmias like atrial fibrillation。一个基于 probabilities machine learning 模型在 Bayesian 框架下可以在 electroanatomical mapping 过程中使用，生成 Patient-specific 3D 模型。使用 probabilities machine learning 模型可以量化结果中的uncertainty，并提供一个自然的解释性模型。在这篇文章中，我们介绍了 Bayesian 方法 дляcardiac chamber models from sparse 3D point cloud data acquired during electroanatomical mapping。我们表明了如何使用 trained on segmented CT/MRI 数据的 probabilistic graphical models来生成 cardiac chamber models from few acquired locations，从而减少过程时间和X-ray exposure。我们还表明了这些模型如何提供 cardiac chamber models 的解释性。

Active Foundational Models for Fault Diagnosis of Electrical Motors

paper_url: http://arxiv.org/abs/2311.15516
repo_url: None
paper_authors: Sriram Anbalagan, Sai Shashank GP, Deepesh Agarwal, Balasubramaniam Natarajan, Babji Srinivasan
for: 这篇研究旨在提高电动机异常探测和诊断的精度和可靠性，以确保各种工业系统的安全和可靠运行。
methods: 本研究提出了基于创新的活动学习框架，利用更少量的标签数据，并充分利用大量可用的随机监控数据，通过结合活动学习和对比自愿学习技术。
results: 实验评估结果显示，提出的方法在对多种机器异常探测任务进行精确诊断时，表现较前state-of-the-art方法佳，仅使用更少量的标签数据。

Abstract
Fault detection and diagnosis of electrical motors are of utmost importance in ensuring the safe and reliable operation of several industrial systems. Detection and diagnosis of faults at the incipient stage allows corrective actions to be taken in order to reduce the severity of faults. The existing data-driven deep learning approaches for machine fault diagnosis rely extensively on huge amounts of labeled samples, where annotations are expensive and time-consuming. However, a major portion of unlabeled condition monitoring data is not exploited in the training process. To overcome this limitation, we propose a foundational model-based Active Learning framework that utilizes less amount of labeled samples, which are most informative and harnesses a large amount of available unlabeled data by effectively combining Active Learning and Contrastive Self-Supervised Learning techniques. It consists of a transformer network-based backbone model trained using an advanced nearest-neighbor contrastive self-supervised learning method. This approach empowers the backbone to learn improved representations of samples derived from raw, unlabeled vibration data. Subsequently, the backbone can undergo fine-tuning to address a range of downstream tasks, both within the same machines and across different machines. The effectiveness of the proposed methodology has been assessed through the fine-tuning of the backbone for multiple target tasks using three distinct machine-bearing fault datasets. The experimental evaluation demonstrates a superior performance as compared to existing state-of-the-art fault diagnosis methods with less amount of labeled data.

摘要
fault detection and diagnosis of electrical motors are of utmost importance in ensuring the safe and reliable operation of several industrial systems. Detection and diagnosis of faults at the incipient stage allows corrective actions to be taken in order to reduce the severity of faults. The existing data-driven deep learning approaches for machine fault diagnosis rely extensively on huge amounts of labeled samples, where annotations are expensive and time-consuming. However, a major portion of unlabeled condition monitoring data is not exploited in the training process. To overcome this limitation, we propose a foundational model-based Active Learning framework that utilizes less amount of labeled samples, which are most informative and harnesses a large amount of available unlabeled data by effectively combining Active Learning and Contrastive Self-Supervised Learning techniques. It consists of a transformer network-based backbone model trained using an advanced nearest-neighbor contrastive self-supervised learning method. This approach empowers the backbone to learn improved representations of samples derived from raw, unlabeled vibration data. Subsequently, the backbone can undergo fine-tuning to address a range of downstream tasks, both within the same machines and across different machines. The effectiveness of the proposed methodology has been assessed through the fine-tuning of the backbone for multiple target tasks using three distinct machine-bearing fault datasets. The experimental evaluation demonstrates a superior performance as compared to existing state-of-the-art fault diagnosis methods with less amount of labeled data.Here's the translation in Traditional Chinese:发生和诊断电动机的重要性在于确保多种工业系统的安全和可靠运行。早期检测和诊断缺陷可以避免缺陷严重程度的增加。现有的数据驱动深入学习方法 для机器缺陷诊断仅对巨量标签数据进行了广泛运用，但是大量的条件监控数据未被利用在训练过程中。为解决这个限制，我们提出了基础模型基本的活动学习框架，它可以透过更多的标签数据和更多的可用的无标签数据进行有效结合活动学习和对比自我学习技术。这个框架包括一个基于转换器网络的背景模型，通过进一步的最近邻对照自我学习方法进行训练。这种方法使得背景模型能够从未处理过的Raw、无标签振荡数据中学习改善的表示。然后，背景模型可以进行精细调整，以Address多个下游任务，包括同一台机器上的多个任务和不同机器之间的任务。我们通过调整背景模型以多个目标任务进行评估，使用三个不同的机器当发生缺陷数据集。实验评估显示，我们的方法在使用更少的标签数据情况下具有较高的性能，与现有的状态艺术缺陷诊断方法相比。

Improving Word Sense Disambiguation in Neural Machine Translation with Salient Document Context

paper_url: http://arxiv.org/abs/2311.15507
repo_url: None
paper_authors: Elijah Rippeth, Marine Carpuat, Kevin Duh, Matt Post
for: 解决机器翻译中的语义含义问题（lexical ambiguity）
methods: 通过在神经网络机器翻译模型中添加少量EXTRA-sentential context来解决翻译ambiguity
results: 比 STRONG sentence-level baselines和相对Document-level baselines更好地翻译含义含量的源语言词汇，同时降低了训练成本。

Abstract
Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of \mt training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release \docmucow, a challenge set for translation disambiguation based on the English-German \mucow \cite{raganato-etal-2020-evaluation} augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.

摘要
Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of \mt training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release \docmucow, a challenge set for translation disambiguation based on the English-German \mucow \cite{raganato-etal-2020-evaluation} augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.Here's the translation in Traditional Chinese: lexical ambiguity 是机器翻译 (\mt) 中的一个挑战和普遍存在的问题。我们引入一个简单且可扩展的方法，通过将EXTRA-SENTENCE 上的小量额外文本融入到神经 \mt 中，以解决翻译ambiguity的问题。我们的方法不需要标注感知和改变标准模型架构。由于大多数 \mt 训练数据中没有实际文档背景，我们收集每个输入的相关句子，将其融合成 pseudo-documents。然后，将 pseudo-documents 中的焦点词作为每个源句子的前缀，以控制翻译的生成。为了评估，我们发布 \docmucow，一个基于英文-德文 \mucow \cite{raganato-etal-2020-evaluation} 的挑战集，并添加了文档 ID。广泛的实验显示，我们的方法可以更好地翻译不确定的源词，并且与强大的句子级基eline和相近的文档级基eline相比，降低训练成本。

Adaptive Image Registration: A Hybrid Approach Integrating Deep Learning and Optimization Functions for Enhanced Precision

paper_url: http://arxiv.org/abs/2311.15497
repo_url: None
paper_authors: Gabriel De Araujo, Shanlin Sun, Xiaohui Xie
for: 这个论文是为了结合学习基于方法和优化基于方法的图像注册方法而写的。
methods: 这个论文使用了学习基于方法和优化基于方法两种不同的方法来实现图像注册。
results: 研究结果表明，使用最佳性状模型作为框架的情况下，图像注册的测试结果提高了0.3%，而 mantenimiento同样的计算时间和图像投影场景的损失只有0.8%。

Abstract
Image registration has traditionally been done using two distinct approaches: learning based methods, relying on robust deep neural networks, and optimization-based methods, applying complex mathematical transformations to warp images accordingly. Of course, both paradigms offer advantages and disadvantages, and, in this work, we seek to combine their respective strengths into a single streamlined framework, using the outputs of the learning based method as initial parameters for optimization while prioritizing computational power for the image pairs that offer the greatest loss. Our investigations showed that an improvement of 0.3\% in testing when utilizing the best performing state-of-the-art model as the backbone of the framework, while maintaining the same inference time and with only a 0.8\% loss in deformation field smoothness.

摘要

Optimizing and Fine-tuning Large Language Model for Urban Renewal

paper_url: http://arxiv.org/abs/2311.15490
repo_url: None
paper_authors: Xi Wang, Xianyao Ling, Tom Zhang, Xuecao Li, Shaolan Wang, Zhixing Li, Liang Zhang, Peng Gong
for: 这种研究旨在应用大语言模型（LLM）在城市更新领域中的适应应用，并提高其性能和文本生成质量以进行知识问答（QA）任务。
methods: 研究人员使用自 instru 式生成 QA 数据集，并通过Prefix和LoRAjoint 精度训练方法来创建城市更新领域的 LLM。
results: 实验结果显示，提posed 的共同精度训练方法可以显著提高 LLM 在 QA 任务中的表现，相比 LoRA 精度训练方法，提出的方法在测试集上的 Bleu 和 Rouge 指标上提高约5%；相比模型之前的精度训练方法，提出的方法在测试集上的 Bleu 和 Rouge 指标上提高约15%-20%。

Abstract
This study aims to innovatively explore adaptive applications of large language models (LLM) in urban renewal. It also aims to improve its performance and text generation quality for knowledge question-answering (QA) tasks. Based on the ChatGLM, we automatically generate QA datasets using urban renewal scientific literature corpora in a self-instruct manner and then conduct joint fine-tuning training on the model using the Prefix and LoRA fine-tuning methods to create an LLM for urban renewal. By guiding the LLM to automatically generate QA data based on prompt words and given text, it is possible to quickly obtain datasets in the urban renewal field and provide data support for the fine-tuning training of LLMs. The experimental results show that the joint fine-tuning training method proposed in this study can significantly improve the performance of LLM on the QA tasks. Compared with LoRA fine-tuning, the method improves the Bleu and Rouge metrics on the test by about 5%; compared with the model before fine-tuning, the method improves the Bleu and Rouge metrics by about 15%-20%. This study demonstrates the effectiveness and superiority of the joint fine-tuning method using Prefix and LoRA for ChatGLM in the urban renewal knowledge QA tasks. It provides a new approach for fine-tuning LLMs on urban renewal-related tasks.

摘要

Global $\mathcal{L}^2$ minimization with certainty via geometrically adapted gradient descent in Deep Learning

paper_url: http://arxiv.org/abs/2311.15487
repo_url: None
paper_authors: Thomas Chen
for: 这篇论文主要针对 Deep Learning 网络中的 $\mathcal{L}^2$ 成本函数的最小化问题，并提出了两种修改后的 gradient descent 流程，一种适用于过参数化Setting，另一种适用于下参数化Setting。
methods: 这篇论文使用了修改后的 gradient descent 流程，并考虑了 pullback векторBundle 结构在过参数化Setting中，以及 pushforward векторBundle 结构在下参数化Setting中。
results: 论文证明，如果一定的核心rank条件成立，那么所有的或bits都会驱动 $\mathcal{L}^2$ 成本函数到全局最小值，并且在全局最小值中，所有的orbits都会 converges 到全局最小值，并且这种 convergence 速率是一致的 exponential 速率。更进一步，这种结果与 sub-Riemannian geometry 有关。

Abstract
We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate. We point out relations of the latter to sub-Riemannian geometry.

摘要
我团队考虑了广泛使用在深度学习网络中的梯度下降流动，并提出了两种修改版本，一个适用于过参数化 Setting，另一个适用于 unter Parametrierung Setting。两者都具有明确和自然的拓扑学意义，考虑了在过参数化 Setting中的pullback vector bundle结构，以及在 unter Parametrierung Setting中的pushforward vector bundle结构。在过参数化 Setting中，我们证明，如果一个排名条件成立，那么所有梯度下降轨迹都会导向 $\mathcal{L}^2$ 成本的全球最小值，并且在一个固定的扩张速率下进行均匀的对数减少。我们还指出了这种现象与低于里曼尼geometry的关系。

Automatic Time Signature Determination for New Scores Using Lyrics for Latent Rhythmic Structure

paper_url: http://arxiv.org/abs/2311.15480
repo_url: None
paper_authors: Callie C. Liao, Duoduo Liao, Jesse Guessford
for: 这 paper 是为了开发一种基于 lyrics 的自动生成时 signature 的算法，以提高 AI 音乐生成的质量。
methods: 这 paper 使用了 explainable machine learning 模型，并提出了多种关于发现 lyrical patterns 和创建新特征的方法，以同时包含 lyrical、rhythmic 和统计信息。
results: 这 paper 的实验结果显示，使用这种方法可以达到 97.6% F1 分数和 0.996 AUC ROC 分数的水平。

Abstract
There has recently been a sharp increase in interest in Artificial Intelligence-Generated Content (AIGC). Despite this, musical components such as time signatures have not been studied sufficiently to form an algorithmic determination approach for new compositions, especially lyrical songs. This is likely because of the neglect of musical details, which is critical for constructing a robust framework. Specifically, time signatures establish the fundamental rhythmic structure for almost all aspects of a song, including the phrases and notes. In this paper, we propose a novel approach that only uses lyrics as input to automatically generate a fitting time signature for lyrical songs and uncover the latent rhythmic structure utilizing explainable machine learning models. In particular, we devise multiple methods that are associated with discovering lyrical patterns and creating new features that simultaneously contain lyrical, rhythmic, and statistical information. In this approach, the best of our experimental results reveal a 97.6% F1 score and a 0.996 Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) score. In conclusion, our research directly generates time signatures from lyrics automatically for new scores utilizing machine learning, which is an innovative idea that approaches an understudied component of musicology and therefore contributes significantly to the future of Artificial Intelligence (AI) music generation.

摘要
Recently, there has been a surge of interest in Artificial Intelligence-Generated Content (AIGC). However, the study of musical components such as time signatures has been insufficient, especially for lyrical songs. This is likely due to the neglect of musical details, which are crucial for establishing a robust framework. Time signatures provide the fundamental rhythmic structure for almost all aspects of a song, including phrases and notes.In this paper, we propose a novel approach that uses lyrics as input to automatically generate a fitting time signature for lyrical songs and uncover the latent rhythmic structure using explainable machine learning models. We devise multiple methods that discover lyrical patterns and create new features that simultaneously contain lyrical, rhythmic, and statistical information.Our experimental results show a 97.6% F1 score and a 0.996 Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) score. In conclusion, our research directly generates time signatures from lyrics automatically for new scores using machine learning, which is an innovative idea that approaches an understudied component of musicology and contributes significantly to the future of Artificial Intelligence (AI) music generation.

paper_url: http://arxiv.org/abs/2311.15460
repo_url: https://github.com/ebiquity/policy_enforced_data_generation
paper_authors: Anantaa Kotal, Lavanya Elluri, Deepti Gupta, Varun Mandalapu, Anupam Joshi
for: 这个论文旨在推动农业社区利用大数据技术优化资源使用、提高生产力和提高农业实践的可持续性。
methods: 该论文使用了大数据技术收集和分析各种数据源，如感知器、卫星和农民调查。同时，该论文还使用了深度学习技术生成隐私数据，以保护数据主题的隐私。
results: 该论文通过实验表明，使用隐私保护技术可以在农业领域广泛分享数据，而不违反数据主题的隐私。同时，该论文还提出了一种新的框架，可以在隐私保护技术中强制执行数据隐私政策规则。

Abstract
Big Data empowers the farming community with the information needed to optimize resource usage, increase productivity, and enhance the sustainability of agricultural practices. The use of Big Data in farming requires the collection and analysis of data from various sources such as sensors, satellites, and farmer surveys. While Big Data can provide the farming community with valuable insights and improve efficiency, there is significant concern regarding the security of this data as well as the privacy of the participants. Privacy regulations, such as the EU GDPR, the EU Code of Conduct on agricultural data sharing by contractual agreement, and the proposed EU AI law, have been created to address the issue of data privacy and provide specific guidelines on when and how data can be shared between organizations. To make confidential agricultural data widely available for Big Data analysis without violating the privacy of the data subjects, we consider privacy-preserving methods of data sharing in agriculture. Deep learning-based synthetic data generation has been proposed for privacy-preserving data sharing. However, there is a lack of compliance with documented data privacy policies in such privacy-preserving efforts. In this study, we propose a novel framework for enforcing privacy policy rules in privacy-preserving data generation algorithms. We explore several available agricultural codes of conduct, extract knowledge related to the privacy constraints in data, and use the extracted knowledge to define privacy bounds in a privacy-preserving generative model. We use our framework to generate synthetic agricultural data and present experimental results that demonstrate the utility of the synthetic dataset in downstream tasks. We also show that our framework can evade potential threats and secure data based on applicable regulatory policy rules.

摘要
大数据为农业社区提供了有关资源使用优化、生产力提高和农业实践可持续性的信息。使用大数据在农业需要收集和分析来自各种来源的数据，包括感知器、卫星和农民调查。虽然大数据可以为农业社区提供有价值的洞察和效率提高，但是存在大量数据隐私和参与者隐私的问题。为解决这个问题，制定了一些隐私法规，如欧盟GDPR、欧盟农业数据分享代码行为协议和提议的欧盟人工智能法规。为确保农业隐私数据广泛可用于大数据分析而不违反数据主体隐私，我们考虑了隐私保护方法的农业数据分享。深度学习基于的隐私保护数据生成已被提议用于农业隐私数据分享。然而，现有的隐私保护努力中存在不符合文档隐私政策的问题。在本研究中，我们提出了一种新的框架，用于在隐私保护数据生成算法中强制执行隐私政策规则。我们利用可用的农业代码行为，提取与数据隐私相关的知识，并使用提取的知识来定义隐私 bound。我们使用我们的框架生成假数据，并提供实验结果，证明假数据的有用性在下游任务中。我们还表明了我们的框架可以避免潜在的威胁和保护数据根据适用的法规规则。

2023-11-27

Improving Denoising Diffusion Probabilistic Models via Exploiting Shared Representations

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Reward Shaping for Improved Learning in Real-time Strategy Game Play

Releasing the CRaQAn (Coreference Resolution in Question-Answering): An open-source dataset and dataset creation methodology using instruction-following models

Domain-Specific Deep Learning Feature Extractor for Diabetic Foot Ulcer Detection

A Graph Neural Network-Based QUBO-Formulated Hamiltonian-Inspired Loss Function for Combinatorial Optimization using Reinforcement Learning

RelVAE: Generative Pretraining for few-shot Visual Relationship Detection

Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation

IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback

On Bringing Robots Home

Interactive Autonomous Navigation with Internal State Inference and Interactivity Estimation

MAST: Model-Agnostic Sparsified Training

Transformer-QEC: Quantum Error Correction Code Decoding with Transferable Transformers

ViT-Lens-2: Gateway to Omni-modal Intelligence

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights

A Survey on Vulnerability of Federated Learning: A Learning Algorithm Perspective

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

RobustState: Boosting Fidelity of Quantum State Preparation via Noise-Aware Variational Training

Machine Learning-Enhanced Aircraft Landing Scheduling under Uncertainties

An HCAI Methodological Framework: Putting It Into Action to Enable Human-Centered AI

Generative AI and US Intellectual Property Law

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

RIDE: Real-time Intrusion Detection via Explainable Machine Learning Implemented in a Memristor Hardware Architecture

Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models

Forecasting Auxiliary Energy Consumption for Electric Heavy-Duty Vehicles

InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery

Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

Sparsify-then-Classify: From Internal Neurons of Large Language Models To Efficient Text Classifiers

Soil Organic Carbon Estimation from Climate-related Features with Graph Neural Network

Efficient Pre-training for Localized Instruction Generation of Videos

Addressing Long-Horizon Tasks by Integrating Program Synthesis and State Machines

CheapNET: Improving Light-weight speech enhancement network by projected loss function

Replay across Experiments: A Natural Extension of Off-Policy RL

Auto-CsiNet: Scenario-customized Automatic Neural Network Architecture Generation for Massive MIMO CSI Feedback

A new fuzzy multi-attribute group decision-making method based on TOPSIS and optimization models

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

Reinforcement Learning for Wildfire Mitigation in Simulated Disaster Environments

Diagnosis driven Anomaly Detection for CPS

A Fully Data-Driven Approach for Realistic Traffic Signal Control Using Offline Reinforcement Learning

Continual Instruction Tuning for Large Multimodal Models

Towards Adaptive RF Fingerprint-based Authentication of IIoT devices

RO-LLaMA: Generalist LLM for Radiation Oncology via Noise Augmentation and Consistency Regularization

Utilizing Explainability Techniques for Reinforcement Learning Model Assurance

Scale-Dropout: Estimating Uncertainty in Deep Neural Networks Using Stochastic Scale

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach

Planning for the Efficient Updating of Mutual Fund Portfolios

A Social-aware Gaussian Pre-trained Model for Effective Cold-start Recommendation

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

SceneDM: Scene-level Multi-agent Trajectory Generation with Consistent Diffusion Models

Adinkra Symbol Recognition using Classical Machine Learning and Deep Learning

Italian Crossword Generator: Enhancing Education through Interactive Word Puzzles

GLIME: General, Stable and Local LIME Explanation

Variational Autoencoders for Feature Exploration and Malignancy Prediction of Lung Lesions

Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation

Peptide Binding Classification on Quantum Computers

Regularization by Texts for Latent Diffusion Inverse Solvers

RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks

Reinforcement Learning from Diffusion Feedback: Q* for Image Search

ChatTraffic: Text-to-Traffic Generation via Diffusion Model

Phonetic-aware speaker embedding for far-field speaker verification

Injecting linguistic knowledge into BERT for Dialogue State Tracking

Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition

Spatially Covariant Image Registration with Text Prompts

QuickDrop: Efficient Federated Unlearning by Integrated Dataset Distillation

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Networked Multiagent Safe Reinforcement Learning for Low-carbon Demand Management in Distribution Network

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language Models

Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text

Instruct2Attack: Language-Guided Semantic Adversarial Attacks

From Prediction to Action: The Critical Role of Proper Performance Estimation for Machine-Learning-Driven Materials Discovery