2023-12-06

cs.AI

cs.AI - 2023-12-06

A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints

paper_url: http://arxiv.org/abs/2312.03905
repo_url: None
paper_authors: Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck
for: bridges the gap between purely symbolic and neural approaches to learning, specifically for tasks that involve autoregressive distributions such as transformers.
methods: proposes a new approach to neuro-symbolic learning that involves maximizing the likelihood of a symbolic constraint w.r.t the neural network’s output distribution, using a pseudolikelihood-based approximation centered around a model sample, which is factorized and locally high-fidelity.
results: greatly improves upon the base model’s ability to predict logically-consistent outputs on Sudoku and shortest-path prediction tasks, and achieves State-of-the-Art (SoTA) detoxification compared to previous approaches on the task of detoxifying large language models by disallowing a list of toxic words.

Abstract
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning. This often requires maximizing the likelihood of a symbolic constraint w.r.t the neural network's output distribution. Such output distributions are typically assumed to be fully-factorized. This limits the applicability of neuro-symbolic learning to the more expressive autoregressive distributions, e.g., transformers. Under such distributions, computing the likelihood of even simple constraints is #P-hard. Instead of attempting to enforce the constraint on the entire output distribution, we propose to do so on a random, local approximation thereof. More precisely, we optimize the likelihood of the constraint under a pseudolikelihood-based approximation centered around a model sample. Our approximation is factorized, allowing the reuse of solutions to sub-problems, a main tenet for efficiently computing neuro-symbolic losses. Moreover, it is a local, high-fidelity approximation of the likelihood, exhibiting low entropy and KL-divergence around the model sample. We evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation, and observe that we greatly improve upon the base model's ability to predict logically-consistent outputs. We also evaluate on the task of detoxifying large language models. Using a simple constraint disallowing a list of toxic words, we are able to steer the model's outputs away from toxic generations, achieving SoTA detoxification compared to previous approaches.

摘要
More precisely, we optimize the likelihood of the constraint under a pseudolikelihood-based approximation centered around a model sample. Our approximation is factorized, allowing the reuse of solutions to sub-problems, a main tenet for efficiently computing neuro-symbolic losses. Moreover, it is a local, high-fidelity approximation of the likelihood, exhibiting low entropy and KL-divergence around the model sample.We evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation, and observe that we greatly improve upon the base model's ability to predict logically-consistent outputs. We also evaluate on the task of detoxifying large language models. Using a simple constraint disallowing a list of toxic words, we are able to steer the model's outputs away from toxic generations, achieving SoTA detoxification compared to previous approaches.Here's the Simplified Chinese translation: neur-符号学 AI 减少了符号学和神经网络学习之间的鸿沟。这经常需要最大化符号约束的可能性，对神经网络输出分布进行最大化。这些输出分布通常假设为完全因子化。这限制了符号学习的可用性，只能应用于更表达力强的推论分布，例如转换器。在这些分布下，计算约束的可能性是 #P-hard。而不是尝试将约束应用于整个输出分布，我们提议在模型采样中心的抽象上进行约束。更加准确地说，我们优化符号约束的可能性，使用基于 Pseudolikelihood 的抽象。我们的抽象是可重复的，允许在子问题上重用解决方案，这是计算符号学损失的重要原则。此外，我们的抽象是地方的、高准确性的，在模型采样中心的抽象下，Entropy 和 KL 偏移都很低。我们在 Sudoku 和短路预测中使用 autoregressive 生成，并观察到我们在基本模型的输出上大幅提高了逻辑一致性。我们还在大语言模型中使用简单约束，禁止使用恶意词汇，并成功地使模型的输出避免恶意生成， achieved SoTA 恶性识别比前方法。

A Masked Pruning Approach for Dimensionality Reduction in Communication-Efficient Federated Learning Systems

paper_url: http://arxiv.org/abs/2312.03889
repo_url: None
paper_authors: Tamir L. S. Gez, Kobi Cohen
for: 提高 Federated Learning（FL）算法在具有限制通信资源的设备上的可应用性，例如具有限制通信资源的移动设备或嵌入式设备。
methods: 使用掩蔽法（Masking）和FL算法相结合，实现在多个节点之间共享低维度表示，并且减少了通信成本。每个节点首先在本地训练模型，然后计算掩蔽面，并将掩蔽面传输回服务器进行共识。这个迭代过程使得模型具有更高的稳定性和可靠性。
results: 对比 existed 方法，MPFL 方法可以实现更高的带宽缩减，同时保持模型的性能。经过广泛的实验研究，MPFL 方法在具有限制通信资源的设备上的应用性得到了进一步的证明。此外，我们还开发了一个开源的软件包，以便相关领域的研究人员和开发人员能够免费使用。

Abstract
Federated Learning (FL) represents a growing machine learning (ML) paradigm designed for training models across numerous nodes that retain local datasets, all without directly exchanging the underlying private data with the parameter server (PS). Its increasing popularity is attributed to notable advantages in terms of training deep neural network (DNN) models under privacy aspects and efficient utilization of communication resources. Unfortunately, DNNs suffer from high computational and communication costs, as well as memory consumption in intricate tasks. These factors restrict the applicability of FL algorithms in communication-constrained systems with limited hardware resources. In this paper, we develop a novel algorithm that overcomes these limitations by synergistically combining a pruning-based method with the FL process, resulting in low-dimensional representations of the model with minimal communication cost, dubbed Masked Pruning over FL (MPFL). The algorithm operates by initially distributing weights to the nodes through the PS. Subsequently, each node locally trains its model and computes pruning masks. These low-dimensional masks are then transmitted back to the PS, which generates a consensus pruning mask, broadcasted back to the nodes. This iterative process enhances the robustness and stability of the masked pruning model. The generated mask is used to train the FL model, achieving significant bandwidth savings. We present an extensive experimental study demonstrating the superior performance of MPFL compared to existing methods. Additionally, we have developed an open-source software package for the benefit of researchers and developers in related fields.

摘要
federated learning (FL) 是一种在多个节点上训练模型的机器学习（ML） paradigma，不直接在参数服务器（PS）上交换本地私人数据。由于FL具有保护隐私和高效通信资源的优势，其 популяр度在不断增长。然而，深度神经网络（DNN）在复杂任务中具有高计算和通信成本，以及内存占用率，这些因素限制了FL算法在具有有限硬件资源的通信束缚系统中的应用。在这篇论文中，我们开发了一种新的算法，即Masked Pruning over FL（MPFL），以解决这些限制。MPFL算法首先将权重分布给节点 через PS。然后，每个节点本地训练其模型，并计算遮盾mask。这些低维度的mask将被传输回PS，生成一个consensus遮盾mask，并将其广播回节点。这个迭代过程会提高遮盾遮盾模型的稳定性和稳定性。生成的遮盾可以用来训练FL模型，实现了明显的带宽削减。我们进行了广泛的实验研究，证明MPFL的性能superiority compared to existing methods。此外，我们还开发了一个开源的软件包，为相关领域的研究人员和开发人员提供了便利。

On The Fairness Impacts of Hardware Selection in Machine Learning

paper_url: http://arxiv.org/abs/2312.03886
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Sree Harsha Nelaturu, Nishaanth Kanna Ravichandran, Cuong Tran, Sara Hooker, Ferdinando Fioretto
for: investigates the impact of hardware choices on the generalization properties of machine learning models, particularly in the context of ML-as-a-service platforms.
methods: combines theoretical and empirical analysis to identify the factors that contribute to hardware-induced performance imbalances, and proposes a strategy for mitigating these imbalances.
results: demonstrates that hardware choices can exacerbate existing disparities in model performance and fairness, and provides insights into the underlying causes of these discrepancies.

Abstract
In the machine learning ecosystem, hardware selection is often regarded as a mere utility, overshadowed by the spotlight on algorithms and data. This oversight is particularly problematic in contexts like ML-as-a-service platforms, where users often lack control over the hardware used for model deployment. How does the choice of hardware impact generalization properties? This paper investigates the influence of hardware on the delicate balance between model performance and fairness. We demonstrate that hardware choices can exacerbate existing disparities, attributing these discrepancies to variations in gradient flows and loss surfaces across different demographic groups. Through both theoretical and empirical analysis, the paper not only identifies the underlying factors but also proposes an effective strategy for mitigating hardware-induced performance imbalances.

摘要
Note:* 硬件 (hòu jiàn) means "hardware" in Simplified Chinese.* ML-as-a-service (MLaaS) is a cloud-based service that provides machine learning capabilities to users.* 用户 (yòng yòu) means "user" in Simplified Chinese.* 模型 (mó delì) means "model" in Simplified Chinese.* 性能 (xìng néng) means "performance" in Simplified Chinese.* 公平 (gōng píng) means "fairness" in Simplified Chinese.* 群体 (qún tǐ) means "demographic group" in Simplified Chinese.* 梯度流 (dào yù) means "gradient flow" in Simplified Chinese.* 损失表 (shè shì biǎo) means "loss surface" in Simplified Chinese.

FoMo Rewards: Can we cast foundation models as reward functions?

paper_url: http://arxiv.org/abs/2312.03881
repo_url: None
paper_authors: Ekdeep Singh Lubana, Johann Brehmer, Pim de Haan, Taco Cohen
for: 研究是用底层模型作为激励学习的奖励函数的可能性。
methods: 我们提议一种简单的批处理，将可见语言模型与大型语言模型集成。 Specifically, 给一个轨迹的观察，我们可以计算描述任务的 instrucion 的可能性。
results: 我们发现这个通用的可能性函数具有理想的奖励函数特征：它与愿望的行为相关，而与类似但错误的策略相对较低。全面来说，我们的工作开启了通过基础模型设计开放式任务的可能性。

Abstract
We explore the viability of casting foundation models as generic reward functions for reinforcement learning. To this end, we propose a simple pipeline that interfaces an off-the-shelf vision model with a large language model. Specifically, given a trajectory of observations, we infer the likelihood of an instruction describing the task that the user wants an agent to perform. We show that this generic likelihood function exhibits the characteristics ideally expected from a reward function: it associates high values with the desired behaviour and lower values for several similar, but incorrect policies. Overall, our work opens the possibility of designing open-ended agents for interactive tasks via foundation models.

摘要

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

paper_url: http://arxiv.org/abs/2312.03876
repo_url: None
paper_authors: Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Sandeep Madireddy, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Aditya Grover
for: 这个论文的目的是提出一种基于深度学习的天气预报方法，以提高天气预报的准确性和效率。
methods: 这个论文使用了一种简单的转换器模型，称为Stormer，其中包括天气特有的嵌入、随机动力预测和压力加权损失等关键组件。
results: 在WeatherBench 2上，Stormer在短至中范围预测 task 上表现竞争性，而在长范围预测 task 上超过7天的预测任务上表现出色，而且需要训练数据和计算量的极少。此外，论文还证明Stormer的扩展性良好，随着模型大小和训练示例的增加，预测准确性都会提高。

Abstract
Weather forecasting is a fundamental problem for anticipating and mitigating the impacts of climate change. Recently, data-driven approaches for weather forecasting based on deep learning have shown great promise, achieving accuracies that are competitive with operational systems. However, those methods often employ complex, customized architectures without sufficient ablation analysis, making it difficult to understand what truly contributes to their success. Here we introduce Stormer, a simple transformer model that achieves state-of-the-art performance on weather forecasting with minimal changes to the standard transformer backbone. We identify the key components of Stormer through careful empirical analyses, including weather-specific embedding, randomized dynamics forecast, and pressure-weighted loss. At the core of Stormer is a randomized forecasting objective that trains the model to forecast the weather dynamics over varying time intervals. During inference, this allows us to produce multiple forecasts for a target lead time and combine them to obtain better forecast accuracy. On WeatherBench 2, Stormer performs competitively at short to medium-range forecasts and outperforms current methods beyond 7 days, while requiring orders-of-magnitude less training data and compute. Additionally, we demonstrate Stormer's favorable scaling properties, showing consistent improvements in forecast accuracy with increases in model size and training tokens. Code and checkpoints will be made publicly available.

摘要
天气预测是气候变化的基本问题，可以预测和减轻气候变化的影响。现在，基于深度学习的天气预测方法已经显示出了很大的搭配，具有与操作系统相当的精度。然而，这些方法经常使用复杂的自定义架构，导致无法准确地了解它们的成功原因。在这里，我们介绍了风暴（Stormer），一种简单的转换器模型，可以在天气预测中实现最佳性能，并且只需要微小的改变于标准转换器脊梁。我们通过仔细的实验分析，包括特定于天气的嵌入、随机动力预测和压力Weighted损失，确定了风暴的关键组件。风暴的核心是一种随机预测目标的对象，可以在不同的时间间隔内预测天气动力。在推理时，我们可以生成多个预测，并将其组合以获得更好的预测精度。在WeatherBench 2上，风暴在短至中期预测和超过7天的预测中表现竞争力强，同时需要训练数据和计算量减少到了多个级别。此外，我们还证明了风暴的有利扩展性，表现出了随着模型大小和训练Token数量的不断提高的预测精度。代码和检查点将公开发布。

The BigCode Project Governance Card

paper_url: http://arxiv.org/abs/2312.03872
repo_url: None
paper_authors: BigCode collaboration, Sean Hughes, Harm de Vries, Jennifer Robinson, Carlos Muñoz Ferrandis, Loubna Ben Allal, Leandro von Werra, Jennifer Ding, Sebastien Paquet, Yacine Jernite
for: 本文概要提供了BigCode项目的不同机制和管理领域，以支持项目的透明度和可重复性。
methods: 本文使用了项目组织结构、宣言目标和价值观、内部决策过程、资金和资源等方面的几个机制来支持项目的管理。
results: 本文通过提供项目的各个机制和领域的信息，向更广泛的公众提供了项目的透明度和可重复性，同时也为未来的开源项目提供了一个可仿效的参考。

Abstract
This document serves as an overview of the different mechanisms and areas of governance in the BigCode project. It aims to support transparency by providing relevant information about choices that were made during the project to the broader public, and to serve as an example of intentional governance of an open research project that future endeavors can leverage to shape their own approach. The first section, Project Structure, covers the project organization, its stated goals and values, its internal decision processes, and its funding and resources. The second section, Data and Model Governance, covers decisions relating to the questions of data subject consent, privacy, and model release.

摘要
这份文档提供了大码项目不同机制和管理方面的概述，以便支持透明度，为更广泛的公众提供相关的信息，并作为未来项目的示范，以便他们可以根据这个方法制定自己的管理方式。首部分，项目结构，覆盖项目组织结构，项目的声明目标和价值观，内部决策过程，以及资金和资源。第二部分，数据和模型管理，覆盖数据主体同意、隐私和模型释出的决策。

Efficient Large Language Models: A Survey

paper_url: http://arxiv.org/abs/2312.03863
repo_url: https://github.com/aiot-mlsys-lab/efficientllms
paper_authors: Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, Mi Zhang
for: 本文提供了一个系统性和全面的LLMs效率研究综述，帮助研究者和实践者更好地了解LLMs效率研究的发展和进展。
methods: 本文分为三个主要类别，从模型中心、数据中心和框架中心三个角度进行综述，并在GitHub上提供了相关论文的集成。
results: 本文提供了一个系统性和全面的LLMs效率研究综述，包括模型中心、数据中心和框架中心三个角度的研究发展，并将在GitHub上维护和更新相关论文。

Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding, language generation, and complex reasoning and have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we compile the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/EfficientLLMs, https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey, and will actively maintain this repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

摘要
大型语言模型（LLMs）在重要的任务中表现出了惊人的能力，如自然语言理解、语言生成和复杂的推理，并有可能对社会产生深远的影响。然而，这些能力需要巨大的资源， highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we compile the papers featured in this survey at , , and will actively maintain this repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

paper_url: http://arxiv.org/abs/2312.03818
repo_url: https://github.com/sunzey/alphaclip
paper_authors: Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
for: 这个论文的目的是提高CLIP的可控性，以便更好地编辑图像。
methods: 这个论文使用了一个auxiliary alpha channel来指示注意力的区域，并通过构建了数百万个RGBA区域文本对来 fine-tune CLIP。
results: Alpha-CLIP不仅保留了CLIP的视觉认知能力，还允许精准地控制图像内容的强调。它在多种任务上达到了良好的效果，包括开放世界认知、多Modal大语言模型和条件2D/3D生成。

Abstract
Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific regions of interest, which can be indicated as points, masks, or boxes by humans or perception models. To fulfill the requirements, we introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks.

摘要
CLIP（对比语言图像预训练）在多种任务中提取图像中的有价值信息扮演着重要的角色。它将文本和视觉模式联系起来，以便全面理解图像，包括所有细节，即使与特定任务无关。然而，为了更加精细地理解和控制图像，需要专注于特定区域，这些区域可以由人类或感知模型指定为点、面或盒子。为了满足这些需求，我们介绍了Alpha-CLIP，它是CLIP的改进版本，带有一个辅助的α通道，用于建议注意的区域，并且通过构建了数百万个RGBA区域文本对进行精度地调整。Alpha-CLIP不仅保持了CLIP的视觉识别能力，还允许控制图像内容的强调。它在多种任务中展现出了效果，包括但不限于开放世界识别、多Modal大语言模型和条件2D/3D生成。它具有强大的潜在应用前景，可以用于多种图像相关任务。Here's the translation in Simplified Chinese:CLIP（对比语言图像预训练）在多种任务中提取图像中的有价值信息扮演着重要的角色。它将文本和视觉模式联系起来，以便全面理解图像，包括所有细节，即使与特定任务无关。然而，为了更加精细地理解和控制图像，需要专注于特定区域，这些区域可以由人类或感知模型指定为点、面或盒子。为了满足这些需求，我们介绍了Alpha-CLIP，它是CLIP的改进版本，带有一个辅助的α通道，用于建议注意的区域，并且通过构建了数百万个RGBA区域文本对进行精度地调整。Alpha-CLIP不仅保持了CLIP的视觉识别能力，还允许控制图像内容的强调。它在多种任务中展现出了效果，包括但不限于开放世界识别、多Modal大语言模型和条件2D/3D生成。它具有强大的潜在应用前景，可以用于多种图像相关任务。

OneLLM: One Framework to Align All Modalities with Language

paper_url: http://arxiv.org/abs/2312.03700
repo_url: https://github.com/csuhan/onellm
paper_authors: Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue
for: 这篇论文旨在开发一种可以同时处理多种模式的大语言模型（MLLM），以提高模式理解能力。
methods: 该论文使用一种统一架构，将八种模式与语言相align，并通过进程式多模式对齐管道来实现。此外，它还使用一种混合多个图像投影模块和动态路由来建立一个通用投影模块（UPM）。
results: 在25种多样化的benchmark任务上，OneLLM表现出色，包括多模式captioning、问答和推理等。

Abstract
Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

摘要
多模态大语言模型（MLLM）在最近已经吸引了广泛的注意力，因为它们具有强大的多模态理解能力。然而，现有的工作都是基于特定模式的编解oder，这些编解oder通常具有不同的架构，并且只能处理常见的模式。在这篇论文中，我们提出了OneLLM，一个能够对八种模式进行语言对应的 MLLM。我们通过一个统一的多模态编解oder和一个进程式多模态对应管道来实现这一点。具体来说，我们首先使用图像投影模块将视觉编码器与LLM连接起来。然后，我们构建了一个通用投影模块（UPM），通过混合多个图像投影模块和动态路由来实现。最后，我们逐渐将更多的模式与LLM对应。为了充分利用OneLLM在 seguir instrucciones 中的潜力，我们还筹集了一个全面的多模态指令集，包括200万个Item从图像、音频、视频、点云、深度/正常图、IMU和fMRI大脑活动。OneLLM在25种多样化的benchmark上进行评估，包括多模态描述、问答和理解任务，其表现出色。代码、数据、模型和在线示例可以在https://github.com/csuhan/OneLLM 上获取。

Intrinsic Harmonization for Illumination-Aware Compositing

paper_url: http://arxiv.org/abs/2312.03698
repo_url: None
paper_authors: Chris Careaga, S. Mahdi H. Miangoleh, Yağız Aksoy
for: 提高图像合成镜像的真实感和照明准确性
methods: 使用自主超vised illumination harmonization方法，通过估算简单的全局照明模型并使用网络进行修正，实现匹配背景和前景的照明和颜色表现
results: 在实际拼接图像中提高了真实感和照明准确性，并通过用户研究得到了对比先前方法的Objective Measurement of enhanced realism

Abstract
Despite significant advancements in network-based image harmonization techniques, there still exists a domain disparity between typical training pairs and real-world composites encountered during inference. Most existing methods are trained to reverse global edits made on segmented image regions, which fail to accurately capture the lighting inconsistencies between the foreground and background found in composited images. In this work, we introduce a self-supervised illumination harmonization approach formulated in the intrinsic image domain. First, we estimate a simple global lighting model from mid-level vision representations to generate a rough shading for the foreground region. A network then refines this inferred shading to generate a harmonious re-shading that aligns with the background scene. In order to match the color appearance of the foreground and background, we utilize ideas from prior harmonization approaches to perform parameterized image edits in the albedo domain. To validate the effectiveness of our approach, we present results from challenging real-world composites and conduct a user study to objectively measure the enhanced realism achieved compared to state-of-the-art harmonization methods.

摘要
尽管网络基于图像协调技术已经取得了显著的进步，但在实际应用中仍然存在域名不一致问题。大多数现有方法是通过反向全局编辑 segmented 图像区域来逆转global编辑，但这些方法通常无法准确捕捉背景和前景之间的光照不匹配问题。在这种情况下，我们介绍了一种自动协调照明方法，基于中等级视觉表示来估算简单的全局照明模型，并将其用于生成与背景场景相匹配的重新照明。为了保持前景和背景的颜色出现相似，我们利用了之前的协调方法来进行参数化的图像编辑，并在 albedo 频谱中进行这些编辑。为了证明我们的方法的有效性，我们在实际拍摄的复杂图像中展示了结果，并进行了用户研究来 объекively 测量我们的方法与现有协调方法相比的增强现实效果。

MatterGen: a generative model for inorganic materials design

paper_url: http://arxiv.org/abs/2312.03687
repo_url: None
paper_authors: Claudio Zeni, Robert Pinsler, Daniel Zügner, Andrew Fowler, Matthew Horton, Xiang Fu, Sasha Shysheya, Jonathan Crabbé, Lixin Sun, Jake Smith, Ryota Tomioka, Tian Xie
For: The paper aims to develop a new generative model for designing functional materials with desired properties, particularly focusing on stability and novelty.* Methods: The proposed model, called MatterGen, uses a diffusion-based generative process that refines atom types, coordinates, and the periodic lattice to produce crystalline structures. Adapter modules are introduced to enable fine-tuning towards specific property constraints.* Results: MatterGen is able to generate stable, diverse inorganic materials across the periodic table, with a higher success rate and closer proximity to the local energy minimum compared to prior generative models. Fine-tuning the model allows for the design of materials with desired chemistry, symmetry, and multiple properties such as mechanical, electronic, and magnetic properties.

Abstract
The design of functional materials with desired properties is essential in driving technological advances in areas like energy storage, catalysis, and carbon capture. Generative models provide a new paradigm for materials design by directly generating entirely novel materials given desired property constraints. Despite recent progress, current generative models have low success rate in proposing stable crystals, or can only satisfy a very limited set of property constraints. Here, we present MatterGen, a model that generates stable, diverse inorganic materials across the periodic table and can further be fine-tuned to steer the generation towards a broad range of property constraints. To enable this, we introduce a new diffusion-based generative process that produces crystalline structures by gradually refining atom types, coordinates, and the periodic lattice. We further introduce adapter modules to enable fine-tuning towards any given property constraints with a labeled dataset. Compared to prior generative models, structures produced by MatterGen are more than twice as likely to be novel and stable, and more than 15 times closer to the local energy minimum. After fine-tuning, MatterGen successfully generates stable, novel materials with desired chemistry, symmetry, as well as mechanical, electronic and magnetic properties. Finally, we demonstrate multi-property materials design capabilities by proposing structures that have both high magnetic density and a chemical composition with low supply-chain risk. We believe that the quality of generated materials and the breadth of MatterGen's capabilities represent a major advancement towards creating a universal generative model for materials design.

摘要
📝 The design of functional materials with desired properties is crucial in driving technological advances in areas like energy storage, catalysis, and carbon capture. 🔋 Generative models provide a new paradigm for materials design by directly generating entirely novel materials given desired property constraints. 💡 Despite recent progress, current generative models have low success rates in proposing stable crystals, or can only satisfy a very limited set of property constraints. 🔍 Here, we present MatterGen, a model that generates stable, diverse inorganic materials across the periodic table and can further be fine-tuned to steer the generation towards a broad range of property constraints. 🔩 To enable this, we introduce a new diffusion-based generative process that produces crystalline structures by gradually refining atom types, coordinates, and the periodic lattice. 📊 We further introduce adapter modules to enable fine-tuning towards any given property constraints with a labeled dataset. 🔗 Compared to prior generative models, structures produced by MatterGen are more than twice as likely to be novel and stable, and more than 15 times closer to the local energy minimum. 🔓 After fine-tuning, MatterGen successfully generates stable, novel materials with desired chemistry, symmetry, as well as mechanical, electronic, and magnetic properties. 🔍 Finally, we demonstrate multi-property materials design capabilities by proposing structures that have both high magnetic density and a chemical composition with low supply-chain risk. 💪 We believe that the quality of generated materials and the breadth of MatterGen's capabilities represent a major advancement towards creating a universal generative model for materials design. 🌟

LLM as OS (llmao), Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem

paper_url: http://arxiv.org/abs/2312.03815
repo_url: None
paper_authors: Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, Yongfeng Zhang
for: 这篇论文旨在构思一个以大语言模型（LLM）为基础的人工智能操作系统（AIOS）生态系统，这将标志着操作系统的一个新 paradigma shift。
methods: 本论文使用了大语言模型（LLM）作为操作系统的核心组件，并开发了一系列基于 LLM 的人工智能代理应用程序（AAP），以推动 AIOS 生态系统的发展。
results: 本论文预测，通过 LLM 的应用，将不仅改变人工智能应用程序的水平，还会重新定义计算机系统的设计和实现、软件和编程语言的设计方法，并带来一系列新的硬件和中间件设备。

Abstract
This paper envisions a revolutionary AIOS-Agent ecosystem, where Large Language Model (LLM) serves as the (Artificial) Intelligent Operating System (IOS, or AIOS)--an operating system ``with soul''. Upon this foundation, a diverse range of LLM-based AI Agent Applications (Agents, or AAPs) are developed, enriching the AIOS-Agent ecosystem and signaling a paradigm shift from the traditional OS-APP ecosystem. We envision that LLM's impact will not be limited to the AI application level, instead, it will in turn revolutionize the design and implementation of computer system, architecture, software, and programming language, featured by several main concepts: LLM as OS (system-level), Agents as Applications (application-level), Natural Language as Programming Interface (user-level), and Tools as Devices/Libraries (hardware/middleware-level).

摘要
这篇论文拟想一个革命性的AIOS投送生态系统，其中大语言模型（LLM）作为人工智能操作系统（IOS或AIOS），这是一个“有心”的操作系统。在这个基础上，一些LLM基于的AI应用程序（Agent或AAP）被开发出来，rich了AIOS投送生态系统，标志着传统OS-APP生态系统的 парадигShift。我们想象，LLM的影响不将止于AI应用程序层次，反之，它会革命化计算机系统的设计和实现、软件架构和编程语言，主要特点包括：LLM作为系统层次（system-level），代理为应用程序层次（application-level），自然语言作为用户层次（user-level），工具作为硬件/中间件层次（hardware/middleware-level）。

What Planning Problems Can A Relational Neural Network Solve?

paper_url: http://arxiv.org/abs/2312.03682
repo_url: https://github.com/concepts-ai/goal-regression-width
paper_authors: Jiayuan Mao, Tomás Lozano-Pérez, Joshua B. Tenenbaum, Leslie Pack Kaelbling
for: 本研究旨在探讨goal-conditioned policies是如何被学习的，以及其效率如何。
methods: 本文使用circuit complexity analysis和serialized goal regression search（S-GRS）来研究relational neural networks表示的策略学习问题。
results: 本研究发现有三类计划问题，其宽度和深度随着物品和规划距离的增加而增长，并提供了构造性的证明。此外，本研究还证明了这种分析的实用性于策略学习中。

Abstract
Goal-conditioned policies are generally understood to be "feed-forward" circuits, in the form of neural networks that map from the current state and the goal specification to the next action to take. However, under what circumstances such a policy can be learned and how efficient the policy will be are not well understood. In this paper, we present a circuit complexity analysis for relational neural networks (such as graph neural networks and transformers) representing policies for planning problems, by drawing connections with serialized goal regression search (S-GRS). We show that there are three general classes of planning problems, in terms of the growth of circuit width and depth as a function of the number of objects and planning horizon, providing constructive proofs. We also illustrate the utility of this analysis for designing neural networks for policy learning.

摘要
目标条件政策通常被理解为“前向”Circuit，即神经网络，将当前状态和目标规范映射到下一个行动。然而，学习这种策略的情况和效率尚不够清楚。在这篇论文中，我们提出了关系神经网络（如图神经网络和变换器）表示策略的电路复杂度分析，通过与序列化目标回归搜索（S-GRS）的连接。我们证明了计划问题的三类总体情况，即电路宽度和深度随物品和规划时间的增加情况，并提供了构造性证明。此外，我们还 Illustrates the utility of this analysis for designing neural networks for policy learning.

An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition

paper_url: http://arxiv.org/abs/2312.03668
repo_url: None
paper_authors: Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada
for: 这篇论文的目的是提出一种基于预训练语音和自然语言模型的端到端自动语音识别（ASR）模型，以便实现更高效的语音识别。
methods: 该论文使用了预训练语音表示模型和大型自然语言模型（LLM）的组合，通过将语音表示转换为文本token，并使用LLM的庞大知识进行 autoregressive 生成，实现端到端 ASR。
results: 实验结果表明，提出的模型可以与现代端到端 ASR 模型相比，并且可以进行 parameter-efficient 预测优化和预训练域转换。

Abstract
Advances in machine learning have made it possible to perform various text and speech processing tasks, including automatic speech recognition (ASR), in an end-to-end (E2E) manner. Since typical E2E approaches require large amounts of training data and resources, leveraging pre-trained foundation models instead of training from scratch is gaining attention. Although there have been attempts to use pre-trained speech and language models in ASR, most of them are limited to using either. This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR. The proposed model enables E2E ASR by generating text tokens in an autoregressive manner via speech representations as speech prompts, taking advantage of the vast knowledge provided by the LLM. Furthermore, the proposed model can incorporate remarkable developments for LLM utilization, such as inference optimization and parameter-efficient domain adaptation. Experimental results show that the proposed model achieves performance comparable to modern E2E ASR models.

摘要
Translated into Simplified Chinese:随着机器学习的进步，可以使用端到端（E2E）方式完成不同的文本和语音处理任务，包括自动语音识别（ASR）。 Typical E2E Approaches 需要大量的训练数据和资源，因此利用预训练基础模型而不是从scratch 训练是收到关注。 Although there have been attempts to use pre-trained speech and language models in ASR, most of them are limited to using either. This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR. The proposed model enables E2E ASR by generating text tokens in an autoregressive manner via speech representations as speech prompts, taking advantage of the vast knowledge provided by the LLM. Furthermore, the proposed model can incorporate remarkable developments for LLM utilization, such as inference optimization and parameter-efficient domain adaptation. Experimental results show that the proposed model achieves performance comparable to modern E2E ASR models.Translated into Traditional Chinese:随着机器学习的进步，可以使用端到端（E2E）方式完成不同的文本和语音处理任务，包括自动语音识别（ASR）。 Typical E2E Approaches 需要大量的训练数据和资源，因此利用预训练基础模型而不是从scratch 训练是收到关注。 Although there have been attempts to use pre-trained speech and language models in ASR, most of them are limited to using either. This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR. The proposed model enables E2E ASR by generating text tokens in an autoregressive manner via speech representations as speech prompts, taking advantage of the vast knowledge provided by the LLM. Furthermore, the proposed model can incorporate remarkable developments for LLM utilization, such as inference optimization and parameter-efficient domain adaptation. Experimental results show that the proposed model achieves performance comparable to modern E2E ASR models.

paper_url: http://arxiv.org/abs/2312.03664
repo_url: https://github.com/google-deepmind/concordia
paper_authors: Alexander Sasha Vezhnevets, John P. Agapiou, Avia Aharon, Ron Ziv, Jayd Matyas, Edgar A. Duéñez-Guzmán, William A. Cunningham, Simon Osindero, Danny Karmon, Joel Z. Leibo
for: 这个论文的目的是探讨 Agent-based modeling 如何利用 Large Language Models (LLM) 提高模型的可理解性和可行性。
methods: 这个论文使用了 Concordia 库，用于构建和使用语言媒介的agent-based模型。Concordia 使用 LLM 来应用常识，行为理解、记忆常识知识，并通过 API 调用控制数字技术。
results: 这个论文提出了一种新的 Agent-based modeling 方法，可以在physically-或 digitally-grounded environments中实现语言媒介的 simulations。这种方法可以支持广泛的应用，包括科学研究和评估实际的数字服务性能。

Abstract
Agent-based modeling has been around for decades, and applied widely across the social and natural sciences. The scope of this research method is now poised to grow dramatically as it absorbs the new affordances provided by Large Language Models (LLM)s. Generative Agent-Based Models (GABM) are not just classic Agent-Based Models (ABM)s where the agents talk to one another. Rather, GABMs are constructed using an LLM to apply common sense to situations, act "reasonably", recall common semantic knowledge, produce API calls to control digital technologies like apps, and communicate both within the simulation and to researchers viewing it from the outside. Here we present Concordia, a library to facilitate constructing and working with GABMs. Concordia makes it easy to construct language-mediated simulations of physically- or digitally-grounded environments. Concordia agents produce their behavior using a flexible component system which mediates between two fundamental operations: LLM calls and associative memory retrieval. A special agent called the Game Master (GM), which was inspired by tabletop role-playing games, is responsible for simulating the environment where the agents interact. Agents take actions by describing what they want to do in natural language. The GM then translates their actions into appropriate implementations. In a simulated physical world, the GM checks the physical plausibility of agent actions and describes their effects. In digital environments simulating technologies such as apps and services, the GM may handle API calls to integrate with external tools such as general AI assistants (e.g., Bard, ChatGPT), and digital apps (e.g., Calendar, Email, Search, etc.). Concordia was designed to support a wide array of applications both in scientific research and for evaluating performance of real digital services by simulating users and/or generating synthetic data.

摘要
agent-based模型已经存在数十年，并广泛应用于社会和自然科学领域。现在，随着大语言模型（LLM）的新特性的出现， agent-based模型的范围即将扩大很多。生成型agent-based模型（GABM）不仅是класси型agent-based模型（ABM），where agents talk to each other，而是通过使用LLM来应用常识，行为“合理”，回忆常识知识，生成API调用来控制数字技术，如应用和服务。我们现在在Concordia库中提供了一种方便构建和使用GABM的方法。Concordia可以帮助构建语言媒介的物理或数字环境模拟。Concordia代理人使用可变组件系统来调用LLM和 associative memory Retrieval两种基本操作。一个特殊的代理人called Game Master（GM），它 draws inspiration from tabletop role-playing games，负责模拟代理人之间的环境。代理人通过natural language描述自己的行为，而GM将其转化为合适的实现。在模拟的物理世界中，GM检查代理人行为的物理可能性，并描述其效果。在模拟数字环境中，GM可能处理API调用，以 интеграble with external tools，如通用AI助手（例如Bard、ChatGPT）和数字应用（例如日历、邮件、搜索等）。Concordia是为了支持广泛的应用，从科学研究到评估实际数字服务的性能而设计。

Pearl: A Production-ready Reinforcement Learning Agent

paper_url: http://arxiv.org/abs/2312.03814
repo_url: https://github.com/facebookresearch/pearl
paper_authors: Zheqing Zhu, Rodrigo de Salvo Braz, Jalaj Bhandari, Daniel Jiang, Yi Wan, Yonathan Efroni, Liyuan Wang, Ruiyang Xu, Hongbo Guo, Alex Nikulkov, Dmytro Korenkevych, Urun Dogan, Frank Cheng, Zheng Wu, Wanqiao Xu
For: 这篇论文是为了探讨RL框架在实现长期目标方面的一些问题，包括延迟奖励、部分可见性、搜索和利用之间的矛盾、使用离线数据提高在线性能、并确保安全限制得到满足。* Methods: 这篇论文提出了一个名为Pearl的生产准备RL智能代理软件包，该包可以模块化地解决RL解决方案中的各种问题，包括延迟奖励、部分可见性、搜索和利用之间的矛盾、使用离线数据提高在线性能、并确保安全限制得到满足。* Results: 这篇论文提供了一些初步的基准测试结果，同时也 highlights了Pearl在实际生产环境中的采纳，以 demonstarte其生产准备性。

Abstract
Reinforcement Learning (RL) offers a versatile framework for achieving long-term goals. Its generality allows us to formalize a wide range of problems that real-world intelligent systems encounter, such as dealing with delayed rewards, handling partial observability, addressing the exploration and exploitation dilemma, utilizing offline data to improve online performance, and ensuring safety constraints are met. Despite considerable progress made by the RL research community in addressing these issues, existing open-source RL libraries tend to focus on a narrow portion of the RL solution pipeline, leaving other aspects largely unattended. This paper introduces Pearl, a Production-ready RL agent software package explicitly designed to embrace these challenges in a modular fashion. In addition to presenting preliminary benchmark results, this paper highlights Pearl's industry adoptions to demonstrate its readiness for production usage. Pearl is open sourced on Github at github.com/facebookresearch/pearl and its official website is located at pearlagent.github.io.

摘要

Improving Activation Steering in Language Models with Mean-Centring

paper_url: http://arxiv.org/abs/2312.03813
repo_url: None
paper_authors: Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan
for: 本研究旨在改进大语言模型（LLM）的输出控制，通过发现导航向量。但是，工程师通常不知道这些模型中特征的表示方式。
methods: 本研究提出使用均值中心化导航向量的想法，即取target dataset的活动均值，然后对所有训练活动均值进行减法。这种方法在自然语言任务中被证明有效，可以帮助控制大语言模型的输出，避免生成攻击性文本，并让故事完成target类型。
results: 本研究发现，对于自然语言任务，使用均值中心化导航向量可以大幅提高活动导航的效iveness，比之前的基eline更高。此外，这种方法还可以让模型更好地执行各种自然语言任务，比如故事完成和文本生成等。

Abstract
Recent work in activation steering has demonstrated the potential to better control the outputs of Large Language Models (LLMs), but it involves finding steering vectors. This is difficult because engineers do not typically know how features are represented in these models. We seek to address this issue by applying the idea of mean-centring to steering vectors. We find that taking the average of activations associated with a target dataset, and then subtracting the mean of all training activations, results in effective steering vectors. We test this method on a variety of models on natural language tasks by steering away from generating toxic text, and steering the completion of a story towards a target genre. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin (compared to previous baselines). This suggests that mean-centring can be used to easily improve the effectiveness of activation steering in a wide range of contexts.

摘要
最近的活动导航研究表明可以更好地控制大型语言模型（LLM）的输出，但是它需要找到导航向量。这是因为工程师通常不知道这些模型中特征的表示方式。我们想解决这个问题，通过应用均值中心化思想来改进导航向量。我们发现，对目标数据集的活动均值，并从所有训练活动均值中 subtract 目标数据集的均值，可以获得有效的导航向量。我们在自然语言任务上测试了这种方法，包括避免生成恶意文本和导航故事的完成方向。我们还应用均值中心化来提取函数向量，可以更好地触发多种自然语言任务的执行，相比之前的基线。这表示，均值中心化可以用于广泛改进 activation steering 的效iveness。

Efficient Inverse Design Optimization through Multi-fidelity Simulations, Machine Learning, and Search Space Reduction Strategies

paper_url: http://arxiv.org/abs/2312.03654
repo_url: None
paper_authors: Luka Grbcic, Juliane Müller, Wibe Albert de Jong
for: 这篇论文旨在增强逆设计优化过程中的约束环境，尤其是在计算资源有限的情况下，通过多元预测、机器学习模型和优化算法的联盟。
methods: 本论文提出了一种方法ологи？，将机器学习模型与优化算法联盟起来，以增强逆设计优化过程的效率和精度。在两个不同的工程逆设计问题上进行了分析，并使用了低精度模拟数据训练机器学习模型，以便在优化过程中预测目标变数和决定是否需要高精度模拟。
results: 本论文的结果显示，这种方法可以大幅提高逆设计优化过程的效率和精度，并且可以与不同的优化算法联盟以实现更好的结果。尤其是在计算资源有限的情况下，这种方法可以很好地保留计算资源，并且可以让逆设计优化过程更加快速和稳定。

Abstract
This paper introduces a methodology designed to augment the inverse design optimization process in scenarios constrained by limited compute, through the strategic synergy of multi-fidelity evaluations, machine learning models, and optimization algorithms. The proposed methodology is analyzed on two distinct engineering inverse design problems: airfoil inverse design and the scalar field reconstruction problem. It leverages a machine learning model trained with low-fidelity simulation data, in each optimization cycle, thereby proficiently predicting a target variable and discerning whether a high-fidelity simulation is necessitated, which notably conserves computational resources. Additionally, the machine learning model is strategically deployed prior to optimization to reduce the search space, thereby further accelerating convergence toward the optimal solution. The methodology has been employed to enhance two optimization algorithms, namely Differential Evolution and Particle Swarm Optimization. Comparative analyses illustrate performance improvements across both algorithms. Notably, this method is adeptly adaptable across any inverse design application, facilitating a harmonious synergy between a representative low-fidelity machine learning model, and high-fidelity simulation, and can be seamlessly applied across any variety of population-based optimization algorithms.

摘要
The methodology uses a machine learning model trained with low-fidelity simulation data to predict a target variable in each optimization cycle. This approach conserves computational resources by only using high-fidelity simulations when necessary. Additionally, the machine learning model is deployed before optimization to reduce the search space, which further accelerates convergence towards the optimal solution.The methodology is employed to enhance two optimization algorithms, namely Differential Evolution and Particle Swarm Optimization. Comparative analyses show performance improvements across both algorithms. Notably, this method is adaptable to any inverse design application and can be seamlessly applied to any variety of population-based optimization algorithms.In simplified Chinese, the paper introduces a methodology that improves the inverse design optimization process in situations with limited computing resources. The methodology combines multi-fidelity evaluations, machine learning models, and optimization algorithms to achieve this goal. The proposed methodology is applied to two engineering inverse design problems and shows performance improvements across two optimization algorithms. This method is adaptable to any inverse design application and can be easily applied to any population-based optimization algorithm.

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

paper_url: http://arxiv.org/abs/2312.03641
repo_url: None
paper_authors: Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan
for: 这篇论文的目的是提出一种能够精准控制 видео中的摄像机和物体运动的动作控制器（MotionCtrl）。
methods: 这篇论文使用了一种新的动作控制器架构，它综合考虑了摄像机运动、物体运动以及训练数据的特性，以提供灵活和精准的动作控制。
results: 对比于现有的方法，MotionCtrl具有三大优势：1）它可以精准地控制摄像机和物体运动，允许更细致的动作控制和多样化的动作组合。2）它的动作条件由摄像机姿态和轨迹决定，这些条件是出现无关的和对物体形状或外观的影响最小。3）它是一种相对通用的模型，可以适应各种摄像机姿态和轨迹。经过广泛的质量和量测试，MotionCtrl在与现有方法进行比较时表现出了超越性。

Abstract
Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement. Accurate control of both camera and object motion is essential for video generation. However, existing works either mainly focus on one type of motion or do not clearly distinguish between the two, limiting their control capabilities and diversity. Therefore, this paper presents MotionCtrl, a unified and flexible motion controller for video generation designed to effectively and independently control camera and object motion. The architecture and training strategy of MotionCtrl are carefully devised, taking into account the inherent properties of camera motion, object motion, and imperfect training data. Compared to previous methods, MotionCtrl offers three main advantages: 1) It effectively and independently controls camera motion and object motion, enabling more fine-grained motion control and facilitating flexible and diverse combinations of both types of motion. 2) Its motion conditions are determined by camera poses and trajectories, which are appearance-free and minimally impact the appearance or shape of objects in generated videos. 3) It is a relatively generalizable model that can adapt to a wide array of camera poses and trajectories once trained. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of MotionCtrl over existing methods.

摘要
主要的动作在影片中包括摄像机运动所引起的摄像机运动和物体运动。精确控制摄像机和物体运动是影片生成的重点。然而，现有的工作几乎专注于一种类型的动作或没有清晰地区分这两种动作，这限制了它们的控制能力和多样性。因此，这篇论文提出了 MotionCtrl，一个统一和 flexible的动作控制器，用于影片生成，可以精确地和独立地控制摄像机和物体运动。 MotionCtrl 的架构和训练策略充分考虑了摄像机运动、物体运动和训练数据的自然性。相比于先前的方法，MotionCtrl 提供了三大优点：1. 可以精确地和独立地控制摄像机和物体运动，实现更细部的动作控制和让生成的影片更多样化。2. 其动作条件由摄像机位置和轨迹决定，这些条件是无形感和物体形状的影响最小的。3. 它是一个相对一般化的模型，可以适应广泛的摄像机位置和轨迹。实际实验表明，MotionCtrl 在训练后可以对多种摄像机位置和轨迹进行适应。

Not All Large Language Models (LLMs) Succumb to the “Reversal Curse”: A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models

paper_url: http://arxiv.org/abs/2312.03633
repo_url: None
paper_authors: Jingye Yang, Da Wu, Kai Wang
for: 这个研究旨在探讨自动逆推Decoder大语言模型（LLM）在“A是B”的情况下失败学习“B是A”，探讨这种逆推的基本失败是否对某些通用任务，如构建知识图谱，提供了红flag。
methods: 这个研究使用了 bidirectional LLM（BERT），并发现它具有逆推祸害的免疫力。此外，研究还评估了更复杂的逻辑推理能力，包括两个集合（union和intersection）操作的交叠和融合。
results: 研究发现，在两个集合操作的情况下， both encoder和decoder语言模型都能够表现出色，但是在三个集合操作的情况下，它们遇到了困难。这些结果表明，encoder和decoder模型在简单和复杂逻辑推理中有所不同，并且在实际应用中，选择BERT或GPT应该根据任务的具体需求和特点，以便充分利用它们的特点。

Abstract
The "Reversal Curse" refers to the scenario where auto-regressive decoder large language models (LLMs), such as ChatGPT, trained on "A is B" fail to learn "B is A", demonstrating a basic failure of logical deduction. This raises a red flag in the use of GPT models for certain general tasks such as constructing knowledge graphs, considering their adherence to this symmetric principle. In our study, we examined a bidirectional LLM, BERT, and found that it is immune to the reversal curse. Driven by ongoing efforts to construct biomedical knowledge graphs with LLMs, we also embarked on evaluating more complex but essential deductive reasoning capabilities. This process included first training encoder and decoder language models to master the intersection ($\cap$) and union ($\cup$) operations on two sets and then moving on to assess their capability to infer different combinations of union ($\cup$) and intersection ($\cap$) operations on three newly created sets. The findings showed that while both encoder and decoder language models, trained for tasks involving two sets (union/intersection), were proficient in such scenarios, they encountered difficulties when dealing with operations that included three sets (various combinations of union and intersection). Our research highlights the distinct characteristics of encoder and decoder models in simple and complex logical reasoning. In practice, the choice between BERT and GPT should be guided by the specific requirements and nature of the task at hand, leveraging their respective strengths in bidirectional context comprehension and sequence prediction.

摘要
“逆转咒”指的是，使用“A是B”的自动逆转数据模型（LLM），如ChatGPT，却无法学习“B是A”，这表示了基本的逻辑推理失败。这引起了使用GPT模型的一些通用任务，如建立知识图，需要注意这个对称原理。在我们的研究中，我们评估了一个对向模型（BERT），发现它免受“逆转咒”的影响。为了继续使用LLM建立生物医学知识图，我们还进行了评估更复杂但重要的推理能力。这包括先将语言模型训练到掌握两个集合的交集（）和union（）操作，然后评估它们在三个新创建的集合上进行不同的交集（）和交集（）操作的能力。发现虽然两个语言模型，在两个集合（union/intersection）的任务上都能够表现出色，但当面临三个集合时，它们却遇到了困难。我们的研究显示了两个语言模型在简单和复杂逻辑推理中的特别性。在实践中，选择BERT或GPT应该根据任务的具体需求和特点，利用它们的相应优势在对向文本理解和时间序列预测。

MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

paper_url: http://arxiv.org/abs/2312.03631
repo_url: https://github.com/assafbk/mocha_code
paper_authors: Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor
for: 提高图像描述文本的准确性和Semantic adequacy
methods: 使用进化学习来解决图像描述文本中的幻觉问题，并提出多目标奖励函数来同时优化准确性和Semantic adequacy
results: 在不同的模型规模下，MOCHa可以同时优化准确性和Semantic adequacy，并且在开 vocabulary setting中表现出色，还提出了一个新的测试集 OpenCHAIR 来评测开 vocabulary hallucinations

Abstract
While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, the generation of spurious details that cannot be inferred from the given image. Dedicated methods for reducing hallucinations in image captioning largely focus on closed-vocabulary object tokens, ignoring most types of hallucinations that occur in practice. In this work, we propose MOCHa, an approach that harnesses advancements in reinforcement learning (RL) to address the sequence-level nature of hallucinations in an open-world setup. To optimize for caption fidelity to the input image, we leverage ground-truth reference captions as proxies to measure the logical consistency of generated captions. However, optimizing for caption fidelity alone fails to preserve the semantic adequacy of generations; therefore, we propose a multi-objective reward function that jointly targets these qualities, without requiring any strong supervision. We demonstrate that these goals can be simultaneously optimized with our framework, enhancing performance for various captioning models of different scales. Our qualitative and quantitative results demonstrate MOCHa's superior performance across various established metrics. We also demonstrate the benefit of our method in the open-vocabulary setting. To this end, we contribute OpenCHAIR, a new benchmark for quantifying open-vocabulary hallucinations in image captioning models, constructed using generative foundation models. We will release our code, benchmark, and trained models.

摘要
近年来，图像条件文本生成领域已经取得了很大的进步，但图像描述仍然受到基本问题的干扰，即生成不存在图像中的幻觉。现有的减少幻觉方法主要是基于关闭 vocabulary 对象 токен，忽略了实际中的大部分幻觉。在这项工作中，我们提出了 MOCHa，一种基于 reinforcement learning（RL）的方法，用于在开放世界设置中解决图像描述中的序列级幻觉。为了优化图像描述与输入图像的一致性，我们利用真实参照caption作为逻辑一致性的指标。但优化一个caption的准确性alone 无法保持生成的 semantics，因此我们提出了一个多目标奖励函数，该函数同时目标这些质量，无需强大的监督。我们示出这些目标可以通过我们的框架同时优化，提高不同规模的描述模型的性能。我们的质量和量化结果表明 MOCHa 的超越性，并且我们还展示了我们的方法在开放 vocabulary Setting中的优势。为此，我们提出了 OpenCHAIR，一个新的评价标准，用于评估开放 vocabulary 描述模型中的幻觉。我们将发布我们的代码、标准和训练模型。

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

paper_url: http://arxiv.org/abs/2312.03611
repo_url: https://github.com/yhyang-myron/DreamComposer
paper_authors: Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, Xihui Liu
for: 这篇论文是为了提高现有的视图意识扩散模型，使其能够生成控制性的新视图图像。
methods: 该论文使用了视图意识3D提升模块，将多个视图中对象的3D表示转换为latent特征。然后，它使用多视图特征融合模块将目标视图特征从多个视图输入中提取出来。最后，它将目标视图特征注入到预训练的扩散模型中，以生成高质量的新视图图像。
results: 实验表明，DreamComposer可以与现有的扩散模型相结合，实现零实际参数的新视图图像生成。它可以生成高品质的新视图图像，准确地捕捉了多视图条件下的对象形态和位置。

Abstract
Utilizing pre-trained 2D large-scale generative models, recent works are capable of generating high-quality novel views from a single in-the-wild image. However, due to the lack of information from multiple views, these works encounter difficulties in generating controllable novel views. In this paper, we present DreamComposer, a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then, it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis, further enhancing them to generate high-fidelity novel view images with multi-view conditions, ready for controllable 3D object reconstruction and various other applications.

摘要
使用预训练的2D大规模生成模型，最近的研究可以从单个宽泛图像中生成高质量的新视图。然而，由于缺乏多视图信息，这些研究受到生成控制新视图的困难。在这篇论文中，我们提出了 DreamComposer，一个灵活可扩展的框架，可以增强现有的视觉扩散模型。具体来说，DreamComposer首先使用视觉意识3D升级模块来从多个视角获取3D对象的表示。然后，它使用多视图特征融合模块来渲染目标视图的秘密特征。最后，从多个视角输入中提取的目标视图特征被注入到预训练的扩散模型中。实验表明，DreamComposer与现有扩散模型兼容，可以further enhance them to generate high-fidelity novel view images with multi-view conditions，ready for controllable 3D object reconstruction和多种其他应用。

DiffusionSat: A Generative Foundation Model for Satellite Imagery

paper_url: http://arxiv.org/abs/2312.03606
repo_url: None
paper_authors: Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David Lobell, Stefano Ermon
for: 这篇论文主要针对的是Remote Sensing数据的生成模型，用于环境监测和农业产量预测等重要应用。
methods: 这篇论文提出了DiffusionSat模型，基于大量公共可用的高分辨率Remote Sensing数据集合进行训练，并采用了新的conditioning技术，使用 metadata 如地理坐标作为生成图像的条件信息。
results: 这篇论文的实验结果表明，DiffusionSat模型可以生成高质量的卫星图像，并可以解决多种生成任务，包括时间生成、多spectral输入的超分辨率生成和填充等。与之前的状态码模型相比，DiffusionSat模型表现出色，是首个大规模的卫星图像生成基础模型。

Abstract
Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets. As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale $\textit{generative}$ foundation model for satellite imagery.

摘要
各种扩散模型在多个频谱中已经达到了当前最佳结果，包括图像、语音和视频。然而，现有的模型没有针对卫星散射数据进行支持，这种数据广泛用于重要应用，如环境监测和作物产量预测。卫星图像与自然图像有很大差异，它们可能是多spectral，时间不规则采样，现有的扩散模型从网络上的图像进行训练不支持它们。此外，卫星散射数据是空间-时的，需要基于条件生成任务，而传统的方法基于标签或图像不支持。在这篇论文中，我们提出了DiffusionSat，迄今为止最大的基础模型，基于公共可用的大量高分辨率卫星散射数据进行训练。由于卫星图像的文本标签罕见，我们将关联 metadata，如地理位置作为条件信息。我们的方法生成的样本是真实的，可以用于解决多个生成任务，包括时间生成、基于多spectral输入的超分辨率、和填充。我们的方法超过了之前的最佳方法，并是首个大规模的卫星图像生成基础模型。

MMM: Generative Masked Motion Model

paper_url: http://arxiv.org/abs/2312.03596
repo_url: None
paper_authors: Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee, Chen Chen
For: 这个论文的目的是提出一种基于Masked Motion Model（MMM）的新型动作生成方法，以解决现有的动作生成方法中的时间性和高精度之间的负面选择。* Methods: 这个方法使用了两个关键组件：（1）动作tokenizer，将3D人体动作转换为一个序列的不同的token在隐藏空间中，和（2）条件隐藏动作变换器，学习预计Randomly隐藏动作token，基于已经计算的文本token。* Results: 在对HumanML3D和KIT-ML数据集进行了广泛的实验后，这个方法的result表现出色，同时实现了高精度和高速动作生成，并具有高级编辑特性，例如体部修改、动作间隔和长动作序列的合成。此外，这个方法比现有的编辑动作扩散模型快两个数量级的单个中等级GPU上。

Abstract
Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at \url{https://exitudio.github.io/MMM-page}.

摘要
近期，使用扩散和自适应模型进行文本到动作生成技术已经取得了良好的成果。然而，这些模型经常面临着实时性、高精度和动作可编辑性之间的牵扯。为了解决这个差距，我们介绍了MMM，一种新型但简单的动作生成模式，基于带有掩码的动作模型。MMM包括两个关键组件：（1）动作Tokenizer，将3D人体动作转换为离散的token在隐藏空间中，和（2）受控掩码动作变换器，学习预计掩码动作token，根据预计的文本token来进行条件预测。在推理过程中，MMM通过同时 attend to motion和文本token，从而显式地捕捉动作token之间的自然依赖关系，以及文本token和动作token之间的含义映射。在推理过程中，MMM可以并行地执行多个动作token，以实现高精度和高速的动作生成。此外，MMM内置了动作可编辑性。通过在需要编辑的地方放置掩码，MMM会自动填充缺失的部分，保证编辑和非编辑部分之间的平滑过渡。我们在HumanML3D和KIT-ML数据集上进行了广泛的实验， demonstarted that MMM surpasses current leading methods in generating high-quality motion（证明了FID分数为0.08和0.429），同时提供了高级编辑功能，如身体部分修改、动作卷积和长度动作序列的合成。此外，MMM在单个中等级GPU上两个数量级快于可编辑动作扩散模型。如果您想了解更多细节，请参考我们的项目页面：\url{https://exitudio.github.io/MMM-page}。

Foundation Model Assisted Weakly Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.03585
repo_url: https://github.com/HAL-42/FMA-WSSS
paper_authors: Xiaobo Yang, Xiaojin Gong
for: Addressing weakly supervised semantic segmentation (WSSS) using image-level labels.
methods: Leveraging pre-trained foundation models (CLIP and SAM) to generate high-quality segmentation seeds, and using a coarse-to-fine framework with multi-label contrastive loss and CAM activation loss to learn the prompts.
results: Achieving state-of-the-art performance on PASCAL VOC 2012 and competitive results on MS COCO 2014.Here is the full translation in Simplified Chinese:
for: 本文目的是使用图像级标签来解决弱ively supervised semantic segmentation (WSSS) 问题。
methods: 我们利用预训练基础模型（CLIP和SAM），生成高质量的 segmentation 种子，并使用一种宽泛-to-细化框架，并采用多标签对比损失和 CAM 活化损失来学习提示。
results: 我们的方法在 PASCAL VOC 2012 和 MS COCO 2014 上达到了状态 Ell的性能和竞争性的结果。

Abstract
This work aims to leverage pre-trained foundation models, such as contrastive language-image pre-training (CLIP) and segment anything model (SAM), to address weakly supervised semantic segmentation (WSSS) using image-level labels. To this end, we propose a coarse-to-fine framework based on CLIP and SAM for generating high-quality segmentation seeds. Specifically, we construct an image classification task and a seed segmentation task, which are jointly performed by CLIP with frozen weights and two sets of learnable task-specific prompts. A SAM-based seeding (SAMS) module is designed and applied to each task to produce either coarse or fine seed maps. Moreover, we design a multi-label contrastive loss supervised by image-level labels and a CAM activation loss supervised by the generated coarse seed map. These losses are used to learn the prompts, which are the only parts need to be learned in our framework. Once the prompts are learned, we input each image along with the learned segmentation-specific prompts into CLIP and the SAMS module to produce high-quality segmentation seeds. These seeds serve as pseudo labels to train an off-the-shelf segmentation network like other two-stage WSSS methods. Experiments show that our method achieves the state-of-the-art performance on PASCAL VOC 2012 and competitive results on MS COCO 2014.

摘要
Translated into Simplified Chinese:这个工作目标是利用预训练基本模型，如对比语言图像预训练（CLIP）和 segment anything模型（SAM），来解决弱监督 semantic segmentation（WSSS）问题，使用图像级别标签。为此，我们提出了一个粗细框架，基于 CLIP 和 SAM，用于生成高质量 segmentation 的种子。具体来说，我们构建了一个图像分类任务和一个种子 segmentation 任务，由 CLIP WITH 冻结参数和两组可学习的任务特定推荐来共同进行执行。SAM 模块是设计用于每个任务，以生成粗细或细化种子地图。此外，我们还设计了一个多标签对比损失，由图像级别标签supervise，以及一个 CAM 活动损失，由生成的粗细种子地图supervise。这些损失用于学习推荐，推荐是我们framework中唯一需要学习的部分。一旦推荐学习完毕，我们可以将每个图像与学习的 segmentation 特定推荐输入到 CLIP 和 SAM 模块中，生成高质量 segmentation 种子。这些种子可以作为 Pseudo 标签来训练一个标准的 segmentation 网络，如其他两个阶段 WSSS 方法。实验显示，我们的方法在 PASCAL VOC 2012 和 MS COCO 2014 上达到了状态监督性的性能，并且与其他两个阶段 WSSS 方法相比，实现了竞争性的结果。

Invariance & Causal Representation Learning: Prospects and Limitations

paper_url: http://arxiv.org/abs/2312.03580
repo_url: None
paper_authors: Simon Bing, Jonas Wahl, Urmi Ninad, Jakob Runge
for: 这篇论文主要是关于 causal models 中机制的不变性的研究。
methods: 论文使用了 theoretical impossibility results 和 practical considerations 来探讨机制不变性是否能够用于找到 latent causal variables。
results: 研究发现，机制不变性本身不够以便确定 latent causal variables，需要采用更多的约束来确定表示。

Abstract
In causal models, a given mechanism is assumed to be invariant to changes of other mechanisms. While this principle has been utilized for inference in settings where the causal variables are observed, theoretical insights when the variables of interest are latent are largely missing. We assay the connection between invariance and causal representation learning by establishing impossibility results which show that invariance alone is insufficient to identify latent causal variables. Together with practical considerations, we use these theoretical findings to highlight the need for additional constraints in order to identify representations by exploiting invariance.

摘要
在 causal 模型中，一个给定的机制被假设为其他机制变化不变。虽然这一原则在观察 causal 变量的情况下用于推理，但在 latent 变量的情况下的理论启示几乎缺失。我们通过证明不可能性结论表明了对 latent causal 变量的归一化不能够唯一确定。与实际考虑相结合，我们使用这些理论发现来强调需要额外约束以便通过归一化来确定表示。

Generalization to New Sequential Decision Making Tasks with In-Context Learning

paper_url: http://arxiv.org/abs/2312.03801
repo_url: None
paper_authors: Sharath Chandra Raparthy, Eric Hambro, Robert Kirk, Mikael Henaff, Roberta Raileanu
for: 这篇论文旨在解决机器学习中自适应任务学习的问题，即使只有几个示例也能够学习新的语言或视觉任务。
methods: 这篇论文使用了 transformer 来学习新的语言或视觉任务，但是在顺序决策Setting下，它们无法直接应用于新任务上进行学习。作者们则提出了一种使用序列径行的训练方法，以实现在新任务上进行径行学习。
results: 作者们在这篇论文中通过一个示例来说明，通过训练序列径行可以实现在新任务上进行径行学习。他们还研究了不同的设计选择，发现更大的模型和数据集大小、更多的任务多样性、环境随机性和径行强度都会导致更好的在新任务上进行径行学习。通过训练大型多样化的离线数据集，他们的模型可以在几个示例下学习新的 MiniHack 和 Procgen 任务。

Abstract
Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this paper, we use an illustrative example to show that naively applying transformers to sequential decision making problems does not enable in-context learning of new tasks. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to in-context learning of new sequential decision making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity, and trajectory burstiness, all result in better in-context learning of new out-of-distribution tasks. By training on large diverse offline datasets, our model is able to learn new MiniHack and Procgen tasks without any weight updates from just a handful of demonstrations.

摘要
培训自适应代理人可以从只有几个示例学习新任务是机器学习领域的长期问题。最近， transformers 被证明可以从只有几个示例学习新语言或视觉任务，而无需任何参数更新，也称为内Context学习。然而，顺序决策设置增加了更高的错误忍容率，因为环境的随机性或者代理人的操作可能会导致未看过的、有时无法恢复的状态。在这篇论文中，我们使用了一个 illustrate 例子来表明，直接应用 transformers 到顺序决策问题上不能实现内Context学习新任务。然后，我们示例了在序列径迹中训练时，采用某些分布性质可以实现内Context学习新顺序决策任务。我们调查了不同的设计选择，并发现大型模型和数据集大小、任务多样性、环境随机性和径迹强烈程度都会导致更好的内Context学习新Out-of-distribution任务。通过训练大型多样化的离线数据集，我们的模型可以从几个示例学习新 MiniHack 和 Procgen 任务，无需任何参数更新。

paper_url: http://arxiv.org/abs/2312.03543
repo_url: https://github.com/petrichor625/talk2car_cavg
paper_authors: Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, Chengzhong Xu
for: This paper aims to improve the ability of autonomous vehicles (AVs) to understand and execute visual commands in a visual context.
methods: The authors propose a sophisticated encoder-decoder framework called Context-Aware Visual Grounding (CAVG), which integrates five core encoders (Text, Image, Context, and Cross-Modal) with a Multimodal decoder. The model is trained using state-of-the-art Large Language Models (LLMs) and incorporates multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation.
results: The CAVG model achieves new standards in prediction accuracy and operational efficiency on the Talk2Car dataset, a real-world benchmark. It demonstrates exceptional performance even with limited training data, and shows remarkable robustness and adaptability in challenging scenarios such as long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.

Abstract
In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs.Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments. The code for the proposed model is available at our Github.

摘要
在自动驾驶车（AV）领域，正确地理解指挥官意图并在视觉上发出语言命令是一项重要挑战。本文介绍了一种高级的encoder-decoder框架，用于解决AV中的视觉定位。我们的 Context-Aware Visual Grounding（CAVG）模型包括五种核心encoder——Text、Image、Context、Cross-Modal——以及一个Multimodal decoder。这种整合使得CAVG模型能够很好地捕捉Contextual semantics，并通过使用现代大语言模型（LLMs），包括GPT-4，学习人类情感特征。CAVG模型的architecture被强化了多头跨模态注意力机制和Region-Specific Dynamic（RSD）层 для注意力调整。这种建立的建筑使得模型能够有效地处理和解释多种跨模态输入，从而获得视觉上的command和对应的语言命令之间的关系。实验证明，CAVG在Talk2Car数据集上达到了新的标准，并且在有限的训练数据上达到了出色的表现。此外，CAVG模型在具有挑战性的场景中也表现出了杰出的Robustness和适应性，包括长文本命令解释、低光照条件、不确定的指挥官上下文、不好的天气条件和拥挤的城市环境。CAVG模型的代码可以在我们的Github上获得。

Low-power, Continuous Remote Behavioral Localization with Event Cameras

paper_url: http://arxiv.org/abs/2312.03799
repo_url: None
paper_authors: Friedhelm Hamann, Suman Ghosh, Ignacio Juarez Martinez, Tom Hart, Alex Kacelnik, Guillermo Gallego
for: 本研究旨在开发一种用于远程野外动物观察的可靠计算机视觉方法，以 automatize 动物行为量化。
methods: 本研究使用了事件相机，具有低功耗和高动态范围特性，对 remote 野外动物观察进行了 battery-dependent 监测。研究采用了时间动作检测任务，根据事件数据进行了16个巢的标注。开发的方法包括一个生成几个可能的时间间隔（提案）的生成器，以及一个内部类别动作的分类器。
results: 实验表明，事件相机的自然响应于运动非常有效，可以实现 kontinuous 动物监测和检测，mAP 为 58%（在良好天气情况下提高到 63%）。研究还表明了对不同照明条件的Robustness。使用事件相机记录动物行为可以三倍长于使用 conventunal 相机。本研究开拓了远程野外动物观察领域的新可能性。

Abstract
Researchers in natural science need reliable methods for quantifying animal behavior. Recently, numerous computer vision methods emerged to automate the process. However, observing wild species at remote locations remains a challenging task due to difficult lighting conditions and constraints on power supply and data storage. Event cameras offer unique advantages for battery-dependent remote monitoring due to their low power consumption and high dynamic range capabilities. We use this novel sensor to quantify a behavior in Chinstrap penguins called ecstatic display. We formulate the problem as a temporal action detection task, determining the start and end times of the behavior. For this purpose, we recorded a colony of breeding penguins in Antarctica during several weeks and labeled event data on 16 nests. The developed method consists of a generator of candidate time intervals (proposals) and a classifier of the actions within them. The experiments show that the event cameras' natural response to motion is effective for continuous behavior monitoring and detection, reaching a mean average precision (mAP) of 58% (which increases to 63% in good weather conditions). The results also demonstrate the robustness against various lighting conditions contained in the challenging dataset. The low-power capabilities of the event camera allows to record three times longer than with a conventional camera. This work pioneers the use of event cameras for remote wildlife observation, opening new interdisciplinary opportunities. https://tub-rip.github.io/eventpenguins/

摘要

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

paper_url: http://arxiv.org/abs/2312.03526
repo_url: None
paper_authors: Peng Sun, Bei Shi, Daiwei Yu, Tao Lin
For: This paper aims to improve the efficiency and practicality of dataset distillation for large-scale real-world applications.* Methods: The proposed method, RDED, focuses on three key properties (realism, diversity, and efficiency) and uses a novel computationally-efficient approach to distill large datasets.* Results: RDED achieves notable results, including distilling the full ImageNet-1K to a small dataset within 7 minutes and achieving a 42% top-1 accuracy with ResNet-18 on a single GPU, outperforming the state-of-the-art.

Abstract
Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data. Extensive empirical results over various neural architectures and datasets demonstrate the advancement of RDED: we can distill the full ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours).

摘要
现代机器学习需要训练大型神经网络，因此面临高计算需求的挑战。 dataset distillation 作为一种新兴策略，目的是压缩现实世界数据集，以便高效地训练。然而，这一研究现在受到大规模高分辨率数据集的限制，使其实际性和可行性受到挑战。为此，我们重新审视现有的 dataset distillation 方法，并确定了大规模实际应用中需要的三个属性， namely，realism， diversity， and efficiency。为了解决这些问题，我们提议 RDED，一种新的计算效率高， yet effective 数据压缩 paradigm，以实现数据的多样性和真实性。我们的实验结果表明，RDED 可以在 7 分钟内，将整个 ImageNet-1K 数据集压缩成 10 张图像每个类型的小数据集，并在 ResNet-18 上 achieved 42% top-1 准确率（而 SOTA 只能达到 21%，并需要 6 小时）。

paper_url: http://arxiv.org/abs/2312.03796
repo_url: None
paper_authors: Hongbo Guo, Xinzi Xu, Hao Wu, Guoxing Wang
for: 这篇论文旨在提出一个多模式生物医时间序列资料的学习模型，以实现多模式间的跨度汇流和跨模式转换。
methods: 本文提出了一个多尺度和多模式的生物医时间序列表现学习网络（MBSL），具有对照学习来实现多模式间的跨度汇流和跨模式转换。
results: 实验结果显示，MBSL比前一代模型高出33.9%的平均误差（MAE）在呼吸速率测量、13.8% MAE在运动心率测量、1.41%的准确率在人类活动识别和1.14%的F1分数在呼吸暂停症候群识别等四个生物医应用中。

Abstract
Multi-modal biomedical time series (MBTS) data offers a holistic view of the physiological state, holding significant importance in various bio-medical applications. Owing to inherent noise and distribution gaps across different modalities, MBTS can be complex to model. Various deep learning models have been developed to learn representations of MBTS but still fall short in robustness due to the ignorance of modal-to-modal variations. This paper presents a multi-scale and multi-modal biomedical time series representation learning (MBSL) network with contrastive learning to migrate these variations. Firstly, MBTS is grouped based on inter-modal distances, then each group with minimum intra-modal variations can be effectively modeled by individual encoders. Besides, to enhance the multi-scale feature extraction (encoder), various patch lengths and mask ratios are designed to generate tokens with semantic information at different scales and diverse contextual perspectives respectively. Finally, cross-modal contrastive learning is proposed to maximize consistency among inter-modal groups, maintaining useful information and eliminating noises. Experiments against four bio-medical applications show that MBSL outperforms state-of-the-art models by 33.9% mean average errors (MAE) in respiration rate, by 13.8% MAE in exercise heart rate, by 1.41% accuracy in human activity recognition, and by 1.14% F1-score in obstructive sleep apnea-hypopnea syndrome.

摘要
多Modal生物医学时间序列数据（MBTS）具有整体生理状态的全面视图，在各种生物医学应用中具有重要意义。然而，由于不同modalities之间的附加噪声和分布差异，MBTS可能会变得复杂。为了学习MBTS的表示，各种深度学习模型已经被开发出来，但仍然缺乏robustness，即因为忽略不同modalities之间的变化。这篇论文提出了一种多尺度和多Modal生物医学时间序列表示学习（MBSL）网络，使用对比学习来迁移这些变化。首先，MBTS被分组 Based on inter-modal distances，然后每个组的最小内Modal差异可以被个性化Encoder模型有效地模型。此外，为了增强多尺度特征提取（Encoder），各种patch长度和mask比例被设计出来，以生成具有Semantic信息的Token在不同的尺度和多种文脉上。最后，跨Modal对比学习被提出，以最大化inter-Modal组的一致性，保留有用信息，并消除噪声。对四种生物医学应用进行了实验，研究发现，MBSL比State-of-the-art模型提高33.9%的 Mean Average Error（MAE）、13.8%的 Exercise Heart Rate MAE、1.41%的 Human Activity Recognition Accuracy和1.14%的 Obstructive Sleep Apnea-Hypopnea Syndrome F1 Score。

Optimal Wildfire Escape Route Planning for Drones under Dynamic Fire and Smoke

paper_url: http://arxiv.org/abs/2312.03521
repo_url: None
paper_authors: Chang Liu, Tamas Sziranyi
for: aid wildfire management efforts by planning an optimal escape route for drones
methods: use information fusion between UAV and satellite, multi-channel remote sensing data, UAV vision technology, and improved A* algorithm
results: enhance the safety and efficiency of drone operations in wildfire environments by considering dynamic fire and smoke models

Abstract
In recent years, the increasing prevalence and intensity of wildfires have posed significant challenges to emergency response teams. The utilization of unmanned aerial vehicles (UAVs), commonly known as drones, has shown promise in aiding wildfire management efforts. This work focuses on the development of an optimal wildfire escape route planning system specifically designed for drones, considering dynamic fire and smoke models. First, the location of the source of the wildfire can be well located by information fusion between UAV and satellite, and the road conditions in the vicinity of the fire can be assessed and analyzed using multi-channel remote sensing data. Second, the road network can be extracted and segmented in real time using UAV vision technology, and each road in the road network map can be given priority based on the results of road condition classification. Third, the spread model of dynamic fires calculates the new location of the fire source based on the fire intensity, wind speed and direction, and the radius increases as the wildfire spreads. Smoke is generated around the fire source to create a visual representation of a burning fire. Finally, based on the improved A* algorithm, which considers all the above factors, the UAV can quickly plan an escape route based on the starting and destination locations that avoid the location of the fire source and the area where it is spreading. By considering dynamic fire and smoke models, the proposed system enhances the safety and efficiency of drone operations in wildfire environments.

摘要
近年来，野火的发生和扩散的情况日益严重，对抢救队伍提出了极大的挑战。使用无人飞行器（UAV）的应用显示了帮助野火管理的潜在优势。本工作关注于基于UAV的野火逃生路径规划系统的开发，考虑了动态火焰和烟雾模型。首先，通过UAV和卫星信息融合，可以准确地确定野火的起点位置。其次，通过多通道远程感知技术，在野火附近地区实时提取和分类道路网络地图，并将每条道路在道路网络地图中分配优先级。第三，根据动态火焰扩散模型，计算新的火源位置，以及火焰强度、风速和方向。烟雾在火源周围生成，创造一个燃烧火的视觉表现。最后，基于改进的A*算法，考虑了以上因素，UAV快速计划逃生路径，避免火源位置和扩散的地区。由于考虑了动态火焰和烟雾模型，提出的系统提高了无人机在野火环境中的安全性和效率。

Defense Against Adversarial Attacks using Convolutional Auto-Encoders

paper_url: http://arxiv.org/abs/2312.03520
repo_url: None
paper_authors: Shreyasi Mandal
for: 强化目标分类器模型对抗攻击
methods: 使用卷积自适应器模型对抗攻击
results: 实现模型精度的Restore

Abstract
Deep learning models, while achieving state-of-the-art performance on many tasks, are susceptible to adversarial attacks that exploit inherent vulnerabilities in their architectures. Adversarial attacks manipulate the input data with imperceptible perturbations, causing the model to misclassify the data or produce erroneous outputs. This work is based on enhancing the robustness of targeted classifier models against adversarial attacks. To achieve this, an convolutional autoencoder-based approach is employed that effectively counters adversarial perturbations introduced to the input images. By generating images closely resembling the input images, the proposed methodology aims to restore the model's accuracy.

摘要
深度学习模型，可以达到许多任务的状态前沿性表现，但受到针对性攻击的威胁。这些攻击通过 manipulate 输入数据中的微scopic 变化，使模型错分或生成错误的输出。这项工作是基于增强目标分类器模型对针对性攻击的Robustness。为达到这一目标，我们采用了一种基于卷积 autoencoder 的方法，可以有效对输入图像中的针对性攻击进行应对。通过生成与输入图像几乎相同的图像，我们的方法希望可以恢复模型的准确性。

Active Wildfires Detection and Dynamic Escape Routes Planning for Humans through Information Fusion between Drones and Satellites

paper_url: http://arxiv.org/abs/2312.03519
repo_url: None
paper_authors: Chang Liu, Tamas Sziranyi
for: 这篇论文旨在提出一种基于UAV视觉技术和卫星图像分析技术的动态人员救援路径规划方法，用于检测和识别野外火灾的火源位置和燃烧区域，并为人们提供实时的逃生路径规划。
methods: 本论文使用的方法包括Sentinel 2卫星图像分析、D-linkNet和NDVI值的中心区域燃烧火源分割、人员实时动态最佳路径规划等。
results: 对于8月24日重庆野火的案例研究，结果表明，基于UAV和卫星图像信息的动态最佳路径规划算法可以在实时火灾情况下为人们提供最佳逃生路径。

Abstract
UAVs are playing an increasingly important role in the field of wilderness rescue by virtue of their flexibility. This paper proposes a fusion of UAV vision technology and satellite image analysis technology for active wildfires detection and road networks extraction of wildfire areas and real-time dynamic escape route planning for people in distress. Firstly, the fire source location and the segmentation of smoke and flames are targeted based on Sentinel 2 satellite imagery. Secondly, the road segmentation and the road condition assessment are performed by D-linkNet and NDVI values in the central area of the fire source by UAV. Finally, the dynamic optimal route planning for humans in real time is performed by the weighted A* algorithm in the road network with the dynamic fire spread model. Taking the Chongqing wildfire on August 24, 2022, as a case study, the results demonstrate that the dynamic escape route planning algorithm can provide an optimal real-time navigation path for humans in the presence of fire through the information fusion of UAVs and satellites.

摘要
UAVs 在野外搜救中发挥越来越重要的作用，尤其是因为它们的灵活性。本文提出了结合 UAV 视觉技术和卫星图像分析技术，实时计算 wildfires 的发生地点和燃烧区域的道路网络抽取，以及在人员受损时的实时最优路径规划。首先，通过 sentinel 2 卫星图像，定位火源位置和烟雾颗粒的分 segmentation。其次，通过 D-linkNet 和 NDVI 值在中心地域的火源位置，进行道路分 segmentation 和道路状况评估。最后，在路网中，使用加权 A\* 算法，在实时火势模型的基础上，为人员在火灾中提供最优的实时导航路径。以2022年8月24日的重庆野火为例，结果表明，动态逃生路径规划算法可以在 UAV 和卫星信息融合的情况下，为人员在火灾中提供最优的实时导航路径。

FRDiff: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models

paper_url: http://arxiv.org/abs/2312.03517
repo_url: None
paper_authors: Junhyuk So, Jungwon Lee, Eunhyeok Park
for: 提高Diffusion模型的计算效率，使其更加广泛应用。methods: 利用时间相似性 redundancy，重用特征图，从而降低计算成本。results: 提出FRDiff方法，实现了精度和响应速度之间的平衡，在多种生成任务中获得了显著改善。

Abstract
The substantial computational costs of diffusion models, particularly due to the repeated denoising steps crucial for high-quality image generation, present a major obstacle to their widespread adoption. While several studies have attempted to address this issue by reducing the number of score function evaluations using advanced ODE solvers without fine-tuning, the decreased number of denoising iterations misses the opportunity to update fine details, resulting in noticeable quality degradation. In our work, we introduce an advanced acceleration technique that leverages the temporal redundancy inherent in diffusion models. Reusing feature maps with high temporal similarity opens up a new opportunity to save computation without sacrificing output quality. To realize the practical benefits of this intuition, we conduct an extensive analysis and propose a novel method, FRDiff. FRDiff is designed to harness the advantages of both reduced NFE and feature reuse, achieving a Pareto frontier that balances fidelity and latency trade-offs in various generative tasks.

摘要
Diffusion模型的计算成本很高，尤其是因为高质量图像生成需要多次减雑步骤。虽然一些研究已经尝试通过降低得分函数评估数量使用高级ODE解决方案来降低计算成本，但是减少减雑迭代数会错过更新细节，导致图像质量下降。在我们的工作中，我们介绍了一种高级加速技术，利用Diffusion模型内置的时间重复性。重用时间相似的特征图opens up a new opportunity to save computation without sacrificing output quality。为了实现这个理念的实用效果，我们进行了广泛的分析并提出了一种新方法，FRDiff。 FRDiff旨在利用减少NFE和特征重用的优点，实现多种生成任务中的平衡质量和延迟交易。

Speculative Exploration on the Concept of Artificial Agents Conducting Autonomous Research

paper_url: http://arxiv.org/abs/2312.03497
repo_url: https://github.com/t46/research-automation-perspective-paper
paper_authors: Shiro Takagi
for: 这篇论文探讨了一种人工智能可以进行研究的概念。
methods: 论文首先描述了研究的概念，以提供创新的开始点。然后，它考虑了研究的核心组成部分，包括问题定义、假设生成和假设验证。这些讨论包括了机器自动完成这些任务的潜在和挑战。
results: 论文简要讨论了这些研究能力的agent的相互关系和交叠。最后，它提出了初步的思考，以便探索这些研究能力agent的发展挑战。

Abstract
This paper engages in a speculative exploration of the concept of an artificial agent capable of conducting research. Initially, it examines how the act of research can be conceptually characterized, aiming to provide a starting point for discussions about what it means to create such agents. The focus then shifts to the core components of research: question formulation, hypothesis generation, and hypothesis verification. This discussion includes a consideration of the potential and challenges associated with enabling machines to autonomously perform these tasks. Subsequently, this paper briefly considers the overlapping themes and interconnections that underlie them. Finally, the paper presents preliminary thoughts on prototyping as an initial step towards uncovering the challenges involved in developing these research-capable agents.

摘要
这篇论文展开了一种人工智能可以进行研究的概念。最初，它描述了研究的概念，以便提供讨论的起点。然后，它shift到研究的核心组件：问题定义、假设生成和假设验证。这个讨论包括机器自动执行这些任务的潜在和挑战。接着，这篇论文简要介绍了这些主题之间的重叠点和联系。最后，它提供了初步思想，以便开始评估在开发这些研究能力的机器人时遇到的挑战。Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Learning From Scenarios for Stochastic Repairable Scheduling

paper_url: http://arxiv.org/abs/2312.03492
repo_url: https://github.com/kimvandenhouten/learning-from-scenarios-for-repairable-stochastic-scheduling
paper_authors: Kim van den Houten, David M. J. Tax, Esteban Freydell, Mathijs de Weerdt
for: Linear objective optimization with uncertain parameter values in a stochastic scheduling problem.
methods: Decision-focused learning with stochastic smoothing to adapt existing techniques to the scheduling problem.
results: Extensive experimental evaluation to compare the performance of decision-focused learning with the state of the art for scenario-based stochastic optimization.Here’s the text in Simplified Chinese:
for: Linear目标优化 WITH uncertain parameter values in a stochastic scheduling problem.
methods: Decision-focused learning WITH stochastic smoothing to adapt existing techniques to the scheduling problem.
results: Extensive experimental evaluation to compare the performance of decision-focused learning WITH the state of the art for scenario-based stochastic optimization.

Abstract
When optimizing problems with uncertain parameter values in a linear objective, decision-focused learning enables end-to-end learning of these values. We are interested in a stochastic scheduling problem, in which processing times are uncertain, which brings uncertain values in the constraints, and thus repair of an initial schedule may be needed. Historical realizations of the stochastic processing times are available. We show how existing decision-focused learning techniques based on stochastic smoothing can be adapted to this scheduling problem. We include an extensive experimental evaluation to investigate in which situations decision-focused learning outperforms the state of the art for such situations: scenario-based stochastic optimization.

摘要
当优化具有不确定参数值的线性目标问题时，决策关注学习可以实现端到端学习这些值。我们关注一个随机处理时间的调度问题，处理时间具有随机性，因此可能需要修复初始调度。历史实现随机处理时间的数据可用。我们介绍了现有的决策关注学习技术，基于随机缓和，如何应用于这个调度问题。我们进行了广泛的实验评估，以 Investigate在哪些情况下决策关注学习超越了现状天地随机优化。Here's the breakdown of the translation:* 当优化 (dāng yòu jì) - "when optimizing"* 具有不确定参数值 (yǒu yǒu bù jì pin yè) - "with uncertain parameter values"* 线性目标问题 (xiàn xìng mù tiào wèn tí) - "linear objective"* 决策关注学习 (jì dào guān zhù xué xí) - "decision-focused learning"* 端到端学习 (dían dào diàn xué xí) - "end-to-end learning"* 这些值 (zhè xiē) - "these values"* 随机处理时间 (suì jiàng hòu zhí shí) - "processing times are uncertain"* 随机值 (suì jiàng yè) - "random values"* constraints (guān lì) - "constraints"* 修复 (xiū gòng) - "repair"* 初始调度 (chū shí tiào dào) - "initial schedule"* 历史实现 (lì shǐ shí jì) - "historical realizations"* 数据 (shù dào) - "data"* 可用 (kě yòu) - "available"* 现有的 (xiàn yǒu de) - "existing"* 决策关注学习技术 (jì dào guān zhù xué xí jì shù) - "existing decision-focused learning techniques"* 基于随机缓 (jī yú suì jiàng bì) - "based on stochastic smoothing"* 应用于 (fù yù yǔ) - "applied to"* 这个调度问题 (zhè ge tiào dào wèn tí) - "this scheduling problem"* Investigate (yàn jí) - "investigate"* 情况 (qíng jì) - "situations"* 超越 (chāo yú) - "outperform"* 现状天地 (xiàn zhèng tiān dì) - "current state of the art"* 随机优化 (suì jiàng yì huà) - "scenario-based stochastic optimization"

JAMMIN-GPT: Text-based Improvisation using LLMs in Ableton Live

paper_url: http://arxiv.org/abs/2312.03479
repo_url: https://github.com/supersational/jammin-gpt
paper_authors: Sven Hollowell, Tashi Namgyal, Paul Marshall
for: 这个系统是为Ableton Live用户创建MIDI-clip而设计的，以便通过 musical descriptions 来命名它们。
methods: 该系统使用 ChatGPT 回答器来生成文本基于 musical formats，如 ABC notation、chord symbols 或 drum tablature，以便在 Ableton 的clip view中插入 Musical ideas。
results: 该系统可以帮助用户快速生成 musical ideas，并且可以让用户在创作过程中保持流畅，不需要停下来编辑 code。这种方法可以在既提高了 musical 创作效率，也降低了学习成本。

Abstract
We introduce a system that allows users of Ableton Live to create MIDI-clips by naming them with musical descriptions. Users can compose by typing the desired musical content directly in Ableton's clip view, which is then inserted by our integrated system. This allows users to stay in the flow of their creative process while quickly generating musical ideas. The system works by prompting ChatGPT to reply using one of several text-based musical formats, such as ABC notation, chord symbols, or drum tablature. This is an important step in integrating generative AI tools into pre-existing musical workflows, and could be valuable for content makers who prefer to express their creative vision through descriptive language. Code is available at https://github.com/supersational/JAMMIN-GPT.

摘要
我们介绍一个系统，让Ableton Live用户可以通过 Musical descriptions 名称 MIDI-clip。用户可以在Ableton的 clip view 中直接输入 Desired musical content，我们的整合系统将其插入。这使用户可以保持创作过程中的流动性，快速生成 musical ideas。系统工作方式是通过请求 ChatGPT 回答使用一些文本基于的 Musical formats，例如 ABC notation、chord symbols 或 drum tablature。这是统合生成 AI 工具到现有的 Musical workflows 的重要一步，可能对内容制作者有价值，他们可能 prefer 通过描述性语言表达创作意义。代码可以在 https://github.com/supersational/JAMMIN-GPT 获取。

Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion

paper_url: http://arxiv.org/abs/2312.03475
repo_url: None
paper_authors: Weitao Du, Jiujiu Chen, Xuecang Zhang, Zhiming Ma, Shengchao Liu
for: 本研究旨在提高人工智能在药物发现中的应用，尤其是在机器学习和化学领域。
methods: 本研究提出了一种 pré-training 方法，称为分子联合自动编码（MoleculeJAE），可以学习分子的二维精度（键结构）和三维形态（几何）信息，并通过模拟增强的扩散过程，以自然地学习分子的内在结构。
results: 实验表明，MoleculeJAE 能够达到比较出色的性能，在 20 个任务中的 15 个任务中比 12 个基线模型更高。

Abstract
Recently, artificial intelligence for drug discovery has raised increasing interest in both machine learning and chemistry domains. The fundamental building block for drug discovery is molecule geometry and thus, the molecule's geometrical representation is the main bottleneck to better utilize machine learning techniques for drug discovery. In this work, we propose a pretraining method for molecule joint auto-encoding (MoleculeJAE). MoleculeJAE can learn both the 2D bond (topology) and 3D conformation (geometry) information, and a diffusion process model is applied to mimic the augmented trajectories of such two modalities, based on which, MoleculeJAE will learn the inherent chemical structure in a self-supervised manner. Thus, the pretrained geometrical representation in MoleculeJAE is expected to benefit downstream geometry-related tasks. Empirically, MoleculeJAE proves its effectiveness by reaching state-of-the-art performance on 15 out of 20 tasks by comparing it with 12 competitive baselines.

摘要
(Note: Simplified Chinese is also known as "简化字" or "简化字".)

Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data

paper_url: http://arxiv.org/abs/2312.03455
repo_url: None
paper_authors: Tashi Namgyal, Alexander Hepburn, Raul Santos-Rodriguez, Valero Laparra, Jesus Malo
for: 这篇论文主要是用于评估自然信号质量的方法，如图像和音频。
methods: 这篇论文使用了感知指标来评估自然信号的质量，感知指标是基于人类观察者的感知行为，通常能够捕捉自然信号中的结构。
results: 论文发现，使用感知指标作为损失函数可以让生成模型更好地捕捉自然信号中的结构，并在测试时重建spectrograms和重新生成的音频中得到更好的结果，这表明使用感知指标可以更好地适应未经见过的自然信号。

Abstract
Perceptual metrics are traditionally used to evaluate the quality of natural signals, such as images and audio. They are designed to mimic the perceptual behaviour of human observers and usually reflect structures found in natural signals. This motivates their use as loss functions for training generative models such that models will learn to capture the structure held in the metric. We take this idea to the extreme in the audio domain by training a compressive autoencoder to reconstruct uniform noise, in lieu of natural data. We show that training with perceptual losses improves the reconstruction of spectrograms and re-synthesized audio at test time over models trained with a standard Euclidean loss. This demonstrates better generalisation to unseen natural signals when using perceptual metrics.

摘要
传统的感知指标通常用于评估自然信号的质量，如图像和音频。它们是为模仿人类观察者的感知行为而设计的，通常反映自然信号中的结构。这种想法驱动了使用感知指标作为生成模型的损失函数的使用，以便模型可以捕捉指标中的结构。在音频领域中，我们Push this idea to the extreme by training a compressive autoencoder to reconstruct uniform noise, instead of natural data. We show that training with perceptual losses improves the reconstruction of spectrograms and re-synthesized audio at test time over models trained with a standard Euclidean loss. This demonstrates better generalization to unseen natural signals when using perceptual metrics.Here's the translation breakdown:* 传统的感知指标 (traditional perceptual metrics) -> 传统的感知指标 (traditional perceptual metrics)* 自然信号 (natural signals) -> 自然信号 (natural signals)* 质量 (quality) -> 质量 (quality)* 模仿 (mimic) -> 模仿 (mimic)* 人类观察者 (human observers) -> 人类观察者 (human observers)* 感知行为 (perceptual behavior) -> 感知行为 (perceptual behavior)* 结构 (structure) -> 结构 (structure)* 生成模型 (generative models) -> 生成模型 (generative models)* 损失函数 (loss functions) -> 损失函数 (loss functions)* 捕捉 (capture) -> 捕捉 (capture)* 指标中的结构 (structure held in the metric) -> 指标中的结构 (structure held in the metric)* 音频领域 (audio domain) -> 音频领域 (audio domain)* 抽象压缩 autoencoder (compressive autoencoder) -> 抽象压缩 autoencoder (compressive autoencoder)* 重建 (reconstruct) -> 重建 (reconstruct)* 压缩 (compressive) -> 压缩 (compressive)* 自然数据 (natural data) -> 自然数据 (natural data)* 标准的欧几何落失 (standard Euclidean loss) -> 标准的欧几何落失 (standard Euclidean loss)* 测试时 (at test time) -> 测试时 (at test time)* 总体 (overall) -> 总体 (overall)* 更好的泛化 (better generalization) -> 更好的泛化 (better generalization)

Quantum-Inspired Neural Network Model of Optical Illusions

paper_url: http://arxiv.org/abs/2312.03447
repo_url: None
paper_authors: Ivan S. Maksymov
for: 这篇论文是为了研究人类对涂抹式不稳定物体（如尼克尔立方体）的观察和理解而写的。methods: 作者使用深度神经网络模型来模拟人类对尼克尔立方体的观察和理解，并使用量子生成器来定义神经网络连接的权重。results: 研究发现，尼克尔立方体的实际观察状态是一种基于量子机制的超position，这与 классиical理论预测的两种基本观察状态相符。这些结果将有用于视频游戏和虚拟现实系统，以及研究机器学习、视觉、心理学和量子机制的人类心理和决策。

Abstract
Ambiguous optical illusions have been a paradigmatic object of fascination, research and inspiration in arts, psychology and video games. However, accurate computational models of perception of ambiguous figures have been elusive. In this paper, we design and train a deep neural network model to simulate the human's perception of the Necker cube, an ambiguous drawing with several alternating possible interpretations. Defining the weights of the neural network connection using a quantum generator of truly random numbers, in agreement with the emerging concepts of quantum artificial intelligence and quantum cognition we reveal that the actual perceptual state of the Necker cube is a qubit-like superposition of the two fundamental perceptual states predicted by classical theories. Our results will find applications in video games and virtual reality systems employed for training of astronauts and operators of unmanned aerial vehicles. They will also be useful for researchers working in the fields of machine learning and vision, psychology of perception and quantum-mechanical models of human mind and decision-making.

摘要
困惑的视觉错觉已经成为艺术、心理学和电子游戏等领域的一种独特的对象，但是准确的计算模型来解释人类的视觉却是困难的。在这篇论文中，我们设计了一个深度神经网络模型，用于模拟人类对尼克尔立方体的视觉含义。使用量子生成器生成真实随机数的权重，与量子人工智能和量子认知理论相吻合，我们发现了人类对尼克尔立方体的实际视觉状态是一种基于两个基本视觉状态的QUBIT-like超position。我们的结果将找到应用于电子游戏和虚拟现实系统，用于训练宇航员和无人飞行器操作员。同时，这些结果也将对机器学习、视觉和心理学研究有很大的帮助，以及量子机器人模型和决策的研究。

Sports Recommender Systems: Overview and Research Issues

paper_url: http://arxiv.org/abs/2312.03785
repo_url: None
paper_authors: Alexander Felfernig, Manfred Wundara, Thi Ngoc Trang Tran, Viet-Man Le, Sebastian Lubos, Seda Polat-Erdeniz
for: 运动推荐系统在健康生活、人际关系和运动表现等方面受到越来越多的注意。这些系统可以帮助人们在运动中选择适合自己的餐食、训练方法、才能和团队等。
methods: 这篇论文基于不同的实践例进行了运动推荐系统的应用和技术的概述。它们包括餐食推荐、训练方法推荐、才能和团队推荐以及竞赛中的策略推荐等。
results: 这篇论文分析了运动推荐系统的相关国际和开展研究问题。它还提出了一些未解决的研究问题，以便进一步探索运动推荐系统的应用和技术发展。

Abstract
Sports recommender systems receive an increasing attention due to their potential of fostering healthy living, improving personal well-being, and increasing performances in sport. These systems support people in sports, for example, by the recommendation of healthy and performance boosting food items, the recommendation of training practices, talent and team recommendation, and the recommendation of specific tactics in competitions. With applications in the virtual world, for example, the recommendation of maps or opponents in e-sports, these systems already transcend conventional sports scenarios where physical presence is needed. On the basis of different working examples, we present an overview of sports recommender systems applications and techniques. Overall, we analyze the related state-of-the-art and discuss open research issues.

摘要
体育推荐系统在最近几年来得到了越来越多的关注，这主要归功于它们在健康生活、个人健康和运动表现方面的潜在作用。这些系统支持人们在运动方面，例如，推荐健康和表现提升的食品、训练方法、才能和团队推荐、竞赛中特定战斗策略等等。在虚拟世界中，例如电子竞技，这些系统已经超越了传统的体育场景，需要物理存在。基于不同的实践例子，我们提供体育推荐系统应用和技术的概述，并总结相关的现状和未来研究方向。

Approximating Solutions to the Knapsack Problem using the Lagrangian Dual Framework

paper_url: http://arxiv.org/abs/2312.03413
repo_url: None
paper_authors: Mitchell Keegan, Mahdi Abolghasemi
for: 这篇论文的目的是提出一种基于Lagrangian dual framework的神经网络模型，用于解决箱子问题（Combinatorial Optimization），并且能够提高约束满足度。
methods: 该论文使用神经网络模型来近似箱子问题的解决方案，并且使用Lagrangian dual framework来加以约束满足。
results: 实验结果表明，该模型能够具有强大的约束满足度，但是有一定的优化率下降。相比之下，不具有约束模型的基准神经网络模型会具有更高的优化率，但是约束满足度较差。

Abstract
The Knapsack Problem is a classic problem in combinatorial optimisation. Solving these problems may be computationally expensive. Recent years have seen a growing interest in the use of deep learning methods to approximate the solutions to such problems. A core problem is how to enforce or encourage constraint satisfaction in predicted solutions. A promising approach for predicting solutions to constrained optimisation problems is the Lagrangian Dual Framework which builds on the method of Lagrangian Relaxation. In this paper we develop neural network models to approximate Knapsack Problem solutions using the Lagrangian Dual Framework while improving constraint satisfaction. We explore the problems of output interpretation and model selection within this context. Experimental results show strong constraint satisfaction with a minor reduction of optimality as compared to a baseline neural network which does not explicitly model the constraints.

摘要
《零钱包问题》是一个经典的组合优化问题。解决这类问题可能是 computationally expensive。近年来，有越来越多的关注使用深度学习方法来近似解决这类问题的解决方案。核心问题是如何在预测解决方案中强制或促进约束满足。我们在这篇论文中开发了基于Lagrangian Dual Framework的神经网络模型，以优化零钱包问题的解决方案，同时提高约束满足性。我们还探讨了输出解释和模型选择问题在这个上下文中。实验结果显示，我们的神经网络模型可以强制满足约束，但是有一定的优化率下降相比于基准神经网络模型。

Generalized Contrastive Divergence: Joint Training of Energy-Based Model and Diffusion Model through Inverse Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.03397
repo_url: None
paper_authors: Sangwoong Yoon, Dohyun Kwon, Himchan Hwang, Yung-Kyun Noh, Frank C. Park
for: 本研究旨在提出一种新的对象函数，用于同时训练能量基模型（EBM）和抽取模型（ diffusion model）。
methods: 本研究使用的方法包括对EBM和抽取模型进行同时训练，并将其формализов为一个最小化问题。
results: 研究表明，通过同时训练EBM和抽取模型，可以提高样本质量并减少MCMC的使用。此外，joint training还能够改善EBM的训练效果。

Abstract
We present Generalized Contrastive Divergence (GCD), a novel objective function for training an energy-based model (EBM) and a sampler simultaneously. GCD generalizes Contrastive Divergence (Hinton, 2002), a celebrated algorithm for training EBM, by replacing Markov Chain Monte Carlo (MCMC) distribution with a trainable sampler, such as a diffusion model. In GCD, the joint training of EBM and a diffusion model is formulated as a minimax problem, which reaches an equilibrium when both models converge to the data distribution. The minimax learning with GCD bears interesting equivalence to inverse reinforcement learning, where the energy corresponds to a negative reward, the diffusion model is a policy, and the real data is expert demonstrations. We present preliminary yet promising results showing that joint training is beneficial for both EBM and a diffusion model. GCD enables EBM training without MCMC while improving the sample quality of a diffusion model.

摘要
我团队现在介绍一种新的目标函数，即泛化对照分散（GCD），用于同时训练能量基型模型（EBM）和扩散模型。GCD扩展了2002年希н顿提出的对照分散算法（Hinton），将马尔可夫链 Monte Carlo（MCMC）分布替换为可学习的扩散模型。在GCD中，EBM和扩散模型的共同训练被формализова为一个最小最大问题，当两个模型都 converges到数据分布时，它们达到了平衡。这种最小最大学习与GCD具有惊人的等价性，与反奖学习相当，其中能量对应于负反奖，扩散模型对应于策略，而实际数据则是专家示范。我们展示了初步却有把握的结果，表明同时训练EBM和扩散模型有利于两者。GCD允许EBM无需MCMC训练，并提高扩散模型的样本质量。

Diffused Task-Agnostic Milestone Planner

paper_url: http://arxiv.org/abs/2312.03395
repo_url: None
paper_authors: Mineui Hong, Minjae Kang, Songhwai Oh
for: 这篇论文的目的是提出一种基于序列预测的方法，用于解决对决策问题的长期规划、视觉控制和多任问题的应用。
methods: 本研究提出了一种使用散度基本生成序列模型来规划一系列的里程碑，并让Agent遵循这些里程碑来完成一个任务。提出的方法可以学习控制相关的、低维度的latent表示，从而实现长期规划和视觉控制的效率。此外，我们的方法可以利用散度模型的生成灵活性，实现多任问题的规划。
results: 本研究在多个offline循环学习（RL）benchmark和一个视觉控制环境中进行评估，结果显示，我们的方法可以超越offline RL方法在解决长期、罕见奖励任务和多任问题上表现出色，并在最具挑战性的视觉控制benchmark上 achievement state-of-the-art表现。

Abstract
Addressing decision-making problems using sequence modeling to predict future trajectories shows promising results in recent years. In this paper, we take a step further to leverage the sequence predictive method in wider areas such as long-term planning, vision-based control, and multi-task decision-making. To this end, we propose a method to utilize a diffusion-based generative sequence model to plan a series of milestones in a latent space and to have an agent to follow the milestones to accomplish a given task. The proposed method can learn control-relevant, low-dimensional latent representations of milestones, which makes it possible to efficiently perform long-term planning and vision-based control. Furthermore, our approach exploits generation flexibility of the diffusion model, which makes it possible to plan diverse trajectories for multi-task decision-making. We demonstrate the proposed method across offline reinforcement learning (RL) benchmarks and an visual manipulation environment. The results show that our approach outperforms offline RL methods in solving long-horizon, sparse-reward tasks and multi-task problems, while also achieving the state-of-the-art performance on the most challenging vision-based manipulation benchmark.

摘要

Lite-Mind: Towards Efficient and Versatile Brain Representation Network

paper_url: http://arxiv.org/abs/2312.03781
repo_url: None
paper_authors: Zixuan Gong, Qi Zhang, Duoqian Miao, Guangyin Bao, Liang Hu
for: 这个论文的目的是提高非侵入式fMRI的信息解码性能。methods: 这篇论文使用了深度多层perceptron（MLP）和CLIP的视觉变换器来对fMRI嵌入进行 align。results: 这篇论文提出了一种轻量级、高效、多用途的大脑表示网络（Lite-Mind），可以高效地将fMRI磁化嵌入与CLIP的细腻信息进行对应。实验结果显示，Lite-Mind在NSD数据集上取得了94.3%的fMRI-to-image检索精度，与MindEye相比减少了98.7%的参数数量。

Abstract
Research in decoding visual information from the brain, particularly through the non-invasive fMRI method, is rapidly progressing. The challenge arises from the limited data availability and the low signal-to-noise ratio of fMRI signals, leading to a low-precision task of fMRI-to-image retrieval. State-of-the-art MindEye remarkably improves fMRI-to-image retrieval performance by leveraging a deep MLP with a high parameter count orders of magnitude, i.e., a 996M MLP Backbone per subject, to align fMRI embeddings to the final hidden layer of CLIP's vision transformer. However, significant individual variations exist among subjects, even within identical experimental setups, mandating the training of subject-specific models. The substantial parameters pose significant challenges in deploying fMRI decoding on practical devices, especially with the necessitating of specific models for each subject. To this end, we propose Lite-Mind, a lightweight, efficient, and versatile brain representation network based on discrete Fourier transform, that efficiently aligns fMRI voxels to fine-grained information of CLIP. Our experiments demonstrate that Lite-Mind achieves an impressive 94.3% fMRI-to-image retrieval accuracy on the NSD dataset for Subject 1, with 98.7% fewer parameters than MindEye. Lite-Mind is also proven to be able to be migrated to smaller brain datasets and establishes a new state-of-the-art for zero-shot classification on the GOD dataset. The code is available at https://github.com/gongzix/Lite-Mind.

摘要
研究在解码大脑信息中进展 rapily，特别是通过非侵入式fMRI方法。挑战来自有限的数据可用性和fMRI信号噪声比，导致fMRI-to-image Retrieval 是一个低精度任务。现有的 MindEye 技术备受改进 fMRI-to-image Retrieval 性能，通过使用深度 MLP 和高参数计数，例如每个主体996M MLP Backbone，将 fMRI 嵌入线性对 CLIP 视transformer 的最终隐藏层进行对齐。然而，每个主体都存在差异，即使在同一个实验设置下，需要训练特定主体的模型。高参数数量对实际设备部署造成了 significiant 挑战。为此，我们提出了 Lite-Mind，一种轻量级、高效、多功能大脑表示网络，基于离散傅里叶变换，可以有效地将 fMRI voxel 对 CLIP 的细腻信息进行对齐。我们的实验表明，Lite-Mind 可以在 NSD 数据集上达到94.3%的 fMRI-to-image Retrieval 精度，比 MindEye 低98.7% 的参数数量。此外，Lite-Mind 还可以轻松迁移到 smaller brain 数据集，并在 GOD 数据集上建立了新的状态态-of-the-art для零容量分类。代码可以在 https://github.com/gongzix/Lite-Mind 上获取。

Demand response for residential building heating: Effective Monte Carlo Tree Search control based on physics-informed neural networks

paper_url: http://arxiv.org/abs/2312.03365
repo_url: None
paper_authors: Fabio Pavirani, Gargya Gokhale, Bert Claessens, Chris Develder
for: 控制建筑物的能源消耗以提高global carbon emissions和限制气候变化的控制。
methods: 使用Monte Carlo Tree Search（MCTS）和Physics-informed Neural Network（PiNN）模型来优化建筑物的冷暖系统，以提高DR控制性能。
results: MCTS和PiNN模型的实现能够提高DR控制性能，相比之下rule-based控制器可以提高10%的成本和35%的温度差。此外，深度学习层的添加可以提高计算成本效益。

Abstract
Controlling energy consumption in buildings through demand response (DR) has become increasingly important to reduce global carbon emissions and limit climate change. In this paper, we specifically focus on controlling the heating system of a residential building to optimize its energy consumption while respecting user's thermal comfort. Recent works in this area have mainly focused on either model-based control, e.g., model predictive control (MPC), or model-free reinforcement learning (RL) to implement practical DR algorithms. A specific RL method that recently has achieved impressive success in domains such as board games (go, chess) is Monte Carlo Tree Search (MCTS). Yet, for building control it has remained largely unexplored. Thus, we study MCTS specifically for building demand response. Its natural structure allows a flexible optimization that implicitly integrate exogenous constraints (as opposed, for example, to conventional RL solutions), making MCTS a promising candidate for DR control problems. We demonstrate how to improve MCTS control performance by incorporating a Physics-informed Neural Network (PiNN) model for its underlying thermal state prediction, as opposed to traditional purely data-driven Black-Box approaches. Our MCTS implementation aligned with a PiNN model is able to obtain a 3% increment of the obtained reward compared to a rule-based controller; leading to a 10% cost reduction and 35% reduction on temperature difference with the desired one when applied to an artificial price profile. We further implemented a Deep Learning layer into the Monte Carlo Tree Search technique using a neural network that leads the tree search through more optimal nodes. We then compared this addition with its Vanilla version, showing the improvement in computational cost required.

摘要
控制建筑物的能源消耗已成为降低全球碳排放和控制气候变化的重要方法。在这篇论文中，我们专注于控制公寓建筑物的冷却系统，以优化其能源消耗，同时保证用户的室内温度舒适性。现有的研究主要集中在使用模型预测控制（MPC）或无模型强化学习（RL）实现实用的DR算法。特别是， Monte Carlo Tree Search（MCTS）在棋盘游戏（如围棋、国际象棋）中最近几年表现出了非常出色的成绩。然而，在建筑物控制领域，MCTS的应用仍然很少。因此，我们在这篇论文中研究MCTS，并证明其在建筑物DR控制问题中的潜在优势。MCTS的自然结构使得可以flexibly进行优化，同时自动承载外部约束（与传统RL方法不同），这使MCTS在DR控制问题中成为一个非常有前途的候选者。我们通过将PiNN模型（Physics-informed Neural Network）与MCTS结合使用，提高了控制性能。我们的MCTS实现与PiNN模型相比，与规则控制器相比，可以获得3%的增量奖励，导致10%的成本减少和35%的温度差异减少。此外，我们还添加了一个深度学习层到MCTS技术中，使用神经网络导引搜索更优化的树。与普通版本相比，这种添加减少了计算成本的需求。

Teaching Specific Scientific Knowledge into Large Language Models through Additional Training

paper_url: http://arxiv.org/abs/2312.03360
repo_url: None
paper_authors: Kan Hatakeyama-Sato, Yasuhiko Igarashi, Shun Katakami, Yuta Nabae, Teruaki Hayakawa
for: 通过额外训练，探索将专业科学知识嵌入LLM大语言模型中。
methods: 我们使用文本扩充来解决专业文献缺乏问题，包括样式转换和翻译。我们还进行了参数优化。
results: 我们成功地在一定程度上嵌入了知识，但研究显示嵌入专业信息到LLM中存在复杂性和限制，提出了进一步改进的方向。

Abstract
Through additional training, we explore embedding specialized scientific knowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that effective knowledge integration requires reading texts from multiple perspectives, especially in instructional formats. We utilize text augmentation to tackle the scarcity of specialized texts, including style conversions and translations. Hyperparameter optimization proves crucial, with different size models (7b, 13b, and 70b) reasonably undergoing additional training. Validating our methods, we construct a dataset of 65,000 scientific papers. Although we have succeeded in partially embedding knowledge, the study highlights the complexities and limitations of incorporating specialized information into LLMs, suggesting areas for further improvement.

摘要
通过进一步的训练，我们探索将专业科学知识 embedding到大型自然语言模型（LLM）中。关键发现显示，有效地 integrate 知识需要从多个角度阅读文本，特别是在教学格式下。我们使用文本扩展来解决专业文本稀缺问题，包括样式转换和翻译。模型的超参数优化证明是关键的，不同大小的模型（7b、13b和70b）都能够进行进一步的训练。为验证我们的方法，我们构建了65,000篇科学论文的数据集。虽然我们在部分 embedding 知识上成功，但研究表明将专业信息 embedding 到 LLM 中存在复杂性和限制，提出了进一步改进的方向。

Online Vectorized HD Map Construction using Geometry

paper_url: http://arxiv.org/abs/2312.03341
repo_url: https://github.com/cnzzx/gemap
paper_authors: Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Fusheng Jin, Xiangyu Yue
for: 提出了一种基于Euclidean几何学的映射学习方法，以提高在城市道路系统中的预测和规划。
methods: 提出了一种叫做GeMap的方法，它可以捕捉到城市道路系统中的几何形态和关系，并且可以独立处理几何形态和关系。
results: 在NuScenes和Argoverse 2 datasets上实现了新的最佳性能，其中在Argoverse 2 dataset上达到了71.8%的mAP，比MapTR V2高4.4%，并首次突破了70%的mAP阈值。

Abstract
The construction of online vectorized High-Definition (HD) maps is critical for downstream prediction and planning. Recent efforts have built strong baselines for this task, however, shapes and relations of instances in urban road systems are still under-explored, such as parallelism, perpendicular, or rectangle-shape. In our work, we propose GeMap ($\textbf{Ge}$ometry $\textbf{Map}$), which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception. Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations. We also decouple self-attention to independently handle Euclidean shapes and relations. Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets. Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time. Code is available at https://github.com/cnzzx/GeMap

摘要
“在线vector化高清地图的构建是下游预测和规划的关键。 recent efforts have built strong baselines for this task, but shapes and relations of instances in urban road systems are still under-explored, such as parallelism, perpendicular, or rectangle-shape. In our work, we propose GeMap（地图几何对映）, which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception. Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations. We also decouple self-attention to independently handle Euclidean shapes and relations. Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets. Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time. ”Here's the breakdown of the translation:* “在线vector化高清地图”(online vectorized high-definition maps) is translated as “在线vector化高清地图”(在线vectorized高清地图)* “构建”(construction) is translated as “构建”(构建)* “downstream prediction and planning”(downstream prediction and planning) is translated as “下游预测和规划”(下游预测和规划)* “Recent efforts have built strong baselines for this task”(recent efforts have built strong baselines for this task) is translated as “recent efforts have built strong baselines for this task”(recent efforts have built strong baselines for this task)* “but shapes and relations of instances in urban road systems are still under-explored”(but shapes and relations of instances in urban road systems are still under-explored) is translated as “but shapes and relations of instances in urban road systems are still under-explored”(but shapes and relations of instances in urban road systems are still under-explored)* “such as parallelism, perpendicular, or rectangle-shape”(such as parallelism, perpendicular, or rectangle-shape) is translated as “such as parallelism, perpendicular, or rectangle-shape”(such as parallelism, perpendicular, or rectangle-shape)* “In our work, we propose GeMap”(In our work, we propose GeMap) is translated as “In our work, we propose GeMap”(在我们的工作中，我们提出了GeMap)* “which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception”(which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception) is translated as “which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception”(which end-to-end learnsEuclidean shapes and relations of map instances beyond basic perception)* “Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations”(Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations) is translated as “Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations”(specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations)* “We also decouple self-attention to independently handle Euclidean shapes and relations”(We also decouple self-attention to independently handle Euclidean shapes and relations) is translated as “We also decouple self-attention to independently handle Euclidean shapes and relations”(we also decouple self-attention to independently handle Euclidean shapes and relations)* “Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets”(Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets) is translated as “our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets”(我们的方法在NuScenes和Argoverse 2 dataset上达到了新的state-of-the-art性能)* “Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time”(Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time) is translated as “remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time”(remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time)Note that the translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.

Benchmarking Continual Learning from Cognitive Perspectives

paper_url: http://arxiv.org/abs/2312.03309
repo_url: None
paper_authors: Xiaoqian Liu, Junge Zhang, Mingyi Zhang, Peipei Yang
for: 本研究旨在解决 continual learning 问题，即不断学习和转移知识而不导致老知识忘记。
methods: 本研究使用了多种方法来评估 continual learning 模型，包括基于 cognitive properties 的 desideratum 和多种评价指标。
results: 实验结果显示，现有的 continual learning 模型尚未满足所有 desideratum，并且尚未实现真正的 continual learning。 although some methods 具有一定的适应性和效率，但是无法识别任务变化时的任务关系，或者寻求任务之间的相似性和不同性。

Abstract
Continual learning addresses the problem of continuously acquiring and transferring knowledge without catastrophic forgetting of old concepts. While humans achieve continual learning via diverse neurocognitive mechanisms, there is a mismatch between cognitive properties and evaluation methods of continual learning models. First, the measurement of continual learning models mostly relies on evaluation metrics at a micro-level, which cannot characterize cognitive capacities of the model. Second, the measurement is method-specific, emphasizing model strengths in one aspect while obscuring potential weaknesses in other respects. To address these issues, we propose to integrate model cognitive capacities and evaluation metrics into a unified evaluation paradigm. We first characterize model capacities via desiderata derived from cognitive properties supporting human continual learning. The desiderata concern (1) adaptability in varying lengths of task sequence; (2) sensitivity to dynamic task variations; and (3) efficiency in memory usage and training time consumption. Then we design evaluation protocols for each desideratum to assess cognitive capacities of recent continual learning models. Experimental results show that no method we consider has satisfied all the desiderata and is still far away from realizing truly continual learning. Although some methods exhibit some degree of adaptability and efficiency, no method is able to identify task relationships when encountering dynamic task variations, or achieve a trade-off in learning similarities and differences between tasks. Inspired by these results, we discuss possible factors that influence model performance in these desiderata and provide guidance for the improvement of continual learning models.

摘要
First, the evaluation metrics used are mainly micro-level, which cannot fully capture the cognitive abilities of the model. Second, the evaluation is method-specific, highlighting the strengths of the model in one aspect while hiding its potential weaknesses in other areas. To address these issues, we propose integrating model cognitive abilities and evaluation metrics into a unified evaluation paradigm.We first define the cognitive capabilities of the model based on the cognitive properties that support human continual learning, including the ability to adapt to varying task sequences, sensitivity to dynamic task variations, and efficient use of memory and training time. Then, we design evaluation protocols for each of these desiderata to assess the cognitive abilities of recent continual learning models.The experimental results show that none of the methods we considered have fully met all of the desiderata and are still far from achieving true continual learning. While some methods have shown some degree of adaptability and efficiency, they have failed to identify task relationships when facing dynamic task variations or balance learning similarities and differences between tasks.Inspired by these results, we discuss potential factors that may influence model performance in these desiderata and provide guidance for improving continual learning models.

Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique

paper_url: http://arxiv.org/abs/2312.03303
repo_url: https://github.com/ilyatyagin/dyport
paper_authors: Ilya Tyagin, Ilya Safro
for: 这 paper 是一个新的生物医学假设生成系统评估框架 Dyport。
methods: 该approach 使用了已经精心编辑的数据集，使得我们的评估更加真实。它 integrates 知识到curated databases 中的动态图表，并提供了一种量化发现重要性的方法，不仅评估假设的准确性，还评估其在生物医学研究中的可能的影响，这大大超越了传统的链接预测benchmark。
results: 我们在应用了several link prediction systems 在生物医学semantic knowledge graphs 上的实验中，demonstrated 了我们的评估系统的可行性和灵活性。

Abstract
This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypothesis accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community. Availability and implementation: Dyport framework is fully open-source. All code and datasets are available at: https://github.com/IlyaTyagin/Dyport

摘要
这篇论文提出了一个新的生物医学假设生成系统评估框架，即Dyport。该框架利用了仔细编辑的数据集，使我们的评估更加真实。我们将知识从 curaated 数据库 integrate 到动态图中，并提供一种量化发现重要性的方法。这不仅评估假设准确性，还评估其在生物医学研究中的可能的影响，这在传统的链接预测测试中进行了显著扩展。我们的评估过程的可应用性在多个链接预测系统上进行了应用。我们的评估系统是 flexible 的，可以广泛应用于假设生成质量验证中，以扩展生物医学研究社区的科学发现范围。可用性和实现：Dyport 框架是完全开源的。所有代码和数据集可以在以下链接获取：https://github.com/IlyaTyagin/Dyport。

SoftMAC: Differentiable Soft Body Simulation with Forecast-based Contact Model and Two-way Coupling with Articulated Rigid Bodies and Clothes

paper_url: http://arxiv.org/abs/2312.03297
repo_url: https://github.com/damianliumin/SoftMAC
paper_authors: Min Liu, Gang Yang, Siyuan Luo, Chen Yu, Lin Shao
for: This paper aims to provide a unified framework for simulating diverse robotic manipulation scenarios by integrating soft bodies, articulated rigid bodies, and clothes.
methods: The proposed method, called SoftMAC, uses the Material Point Method (MPM) to simulate soft bodies and a forecast-based contact model to reduce artifacts. It also includes a penetration tracing algorithm to couple MPM particles with deformable and non-volumetric clothes meshes.
results: The authors validate the effectiveness and accuracy of the proposed differentiable pipeline through comprehensive experiments in downstream robotic manipulation applications.Here’s the Chinese version:
for: 这篇论文目的是提供一个综合的机器人操作场景模拟框架，整合软体、骨骼刚体和衣物等多种材料。
methods: 提议的方法是SoftMAC，使用物理点方法（MPM）模拟软体，并采用预测基于的接触模型来减少artefacts。它还包括一种穿透跟踪算法，将MPM粒子与可变形和非液体衣物网格相互关联。
results: 作者通过对下游机器人操作应用的广泛实验 validate了提议的可导式管道的效果和准确性。

Abstract
Differentiable physics simulation provides an avenue for tackling previously intractable challenges through gradient-based optimization, thereby greatly improving the efficiency of solving robotics-related problems. To apply differentiable simulation in diverse robotic manipulation scenarios, a key challenge is to integrate various materials in a unified framework. We present SoftMAC, a differentiable simulation framework coupling soft bodies with articulated rigid bodies and clothes. SoftMAC simulates soft bodies with the continuum-mechanics-based Material Point Method (MPM). We provide a forecast-based contact model for MPM, which greatly reduces artifacts like penetration and unnatural rebound. To couple MPM particles with deformable and non-volumetric clothes meshes, we also propose a penetration tracing algorithm that reconstructs the signed distance field in local area. Based on simulators for each modality and the contact model, we develop a differentiable coupling mechanism to simulate the interactions between soft bodies and the other two types of materials. Comprehensive experiments are conducted to validate the effectiveness and accuracy of the proposed differentiable pipeline in downstream robotic manipulation applications. Supplementary materials and videos are available on our project website at https://sites.google.com/view/softmac.

摘要
《可微分物理模拟：一种提高机器人问题解决效率的新途径》可微分物理模拟提供了解决前无法解决的挑战的新途径，通过梯度基于优化，大幅提高机器人问题的解决效率。为在多样化机器人操作场景中应用可微分模拟，一个关键挑战是将各种材料集成到一个统一框架中。我们提出了SoftMAC，一个可微分模拟框架，将软体与机械肢和衣物相连接。SoftMAC使用物点方法（MPM）来模拟软体，并提供了一种预测基于的接触模型，可以减少穿透和不自然的反弹现象。为将MPM particels与可变形和非液体衣物网格相连接，我们还提出了一种穿透跟踪算法，可以在本地区域重建签名距离场。基于模拟器和接触模型，我们开发了一种可微分连接机制，以模拟软体与其他两种材料之间的交互。我们进行了广泛的实验，以验证提档的有效性和准确性在下游机器人操作应用中。补充材料和视频可以在我们项目网站（https://sites.google.com/view/softmac）上获得。

OMNIINPUT: A Model-centric Evaluation Framework through Output Distribution

paper_url: http://arxiv.org/abs/2312.03291
repo_url: None
paper_authors: Weitang Liu, Ying Wai Li, Tianle Wang, Yi-Zhuang You, Jingbo Shang
for: 评估AI/ML模型预测结果的质量，尤其是对于人类不可识别的输入。
methods: 使用自动生成的测试集和模型自身的输出分布来评估模型质量，而不是传统的数据集中心的评估方法。
results: 能够更细化地比较不同模型的性能，尤其是在预测结果几乎相同的情况下，从而获得新的发现和启示，有助于训练更加稳定和泛化的模型。

Abstract
We propose a novel model-centric evaluation framework, OmniInput, to evaluate the quality of an AI/ML model's predictions on all possible inputs (including human-unrecognizable ones), which is crucial for AI safety and reliability. Unlike traditional data-centric evaluation based on pre-defined test sets, the test set in OmniInput is self-constructed by the model itself and the model quality is evaluated by investigating its output distribution. We employ an efficient sampler to obtain representative inputs and the output distribution of the trained model, which, after selective annotation, can be used to estimate the model's precision and recall at different output values and a comprehensive precision-recall curve. Our experiments demonstrate that OmniInput enables a more fine-grained comparison between models, especially when their performance is almost the same on pre-defined datasets, leading to new findings and insights for how to train more robust, generalizable models.

摘要
我们提出了一种新的模型中心评估框架，OmniInput，以评估人工智能/机器学习模型的预测结果中的所有可能输入（包括人类无法识别的），这对于人工智能安全和可靠性至关重要。与传统的数据中心评估基于预先定义的测试集不同，OmniInput 的测试集由模型自己构建，并通过调查输出分布来评估模型质量。我们使用高效的采样器获取代表性的输入，并对训练后的模型输出进行选择性标注，以便计算模型的精度和准确率在不同的输出值上，并生成了全面的精度-准确率曲线。我们的实验表明，OmniInput 可以对模型进行更细致的比较，特别是当模型在预先定义的数据集上的性能几乎相同时，从而导致新的发现和洞察，帮助train更加稳定和泛化的模型。

Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI Gym

paper_url: http://arxiv.org/abs/2312.03290
repo_url: https://github.com/mail-ecnu/text-gym-agents
paper_authors: Junjie Sheng, Zixiao Huang, Chuyun Shen, Wenhao Li, Yun Hua, Bo Jin, Hongyuan Zha, Xiangfeng Wang
for: 本研究旨在探讨语言代理是否可以取代传统的PPO代理在顺序决策任务中。
methods: 研究者首先使用OpenAI Gym中收集的环境作为测试床，并将这些环境转化为文本环境，以便与语言代理进行直观和高效的比较。
results: 研究者通过数值实验和剖析研究，提取了语言代理的决策能力的有价值信息，并对语言代理作为PPO代理的潜在代替进行初步评估。

Abstract
The formidable capacity for zero- or few-shot decision-making in language agents encourages us to pose a compelling question: Can language agents be alternatives to PPO agents in traditional sequential decision-making tasks? To investigate this, we first take environments collected in OpenAI Gym as our testbeds and ground them to textual environments that construct the TextGym simulator. This allows for straightforward and efficient comparisons between PPO agents and language agents, given the widespread adoption of OpenAI Gym. To ensure a fair and effective benchmarking, we introduce $5$ levels of scenario for accurate domain-knowledge controlling and a unified RL-inspired framework for language agents. Additionally, we propose an innovative explore-exploit-guided language (EXE) agent to solve tasks within TextGym. Through numerical experiments and ablation studies, we extract valuable insights into the decision-making capabilities of language agents and make a preliminary evaluation of their potential to be alternatives to PPO in classical sequential decision-making problems. This paper sheds light on the performance of language agents and paves the way for future research in this exciting domain. Our code is publicly available at~\url{https://github.com/mail-ecnu/Text-Gym-Agents}.

摘要
文中提出了一个吸引人的问题：可以否使用语言代理人代替传统的顺序决策任务中的PPO代理人？为了 investigate这一问题，我们首先使用OpenAI Gym中收集的环境作为测试环境，并将它们转换为文本环境，这使得对比语言代理人和PPO代理人的比较变得更加直观和效率高。为确保公正和有效的对比，我们引入了5级的情景来控制域知识，并提出了一种RL inspirited框架来 guideline语言代理人解决TextGym中的任务。此外，我们还提出了一种尝试-利用-引导语言代理人（EXE）来解决TextGym中的任务。通过数值实验和剥离研究，我们从语言代理人的决策能力中获得了有价值的发现，并对语言代理人是否可以替代PPO进行了初步评估。这篇论文照亮了语言代理人的表现，并为这一有趣的领域开辟了未来研究的道路。我们的代码公开可以在GitHub上获取，请参考~\url{https://github.com/mail-ecnu/Text-Gym-Agents}.

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

paper_url: http://arxiv.org/abs/2312.03288
repo_url: https://github.com/maclong01/STEP-CATFormer
paper_authors: Nguyen Huu Bao Long
For: 本研究探讨了基于骨架的动作识别中Graph Convolutional Convolution networks（GCNs）的应用和优化。* Methods: 本研究提出了三种Channel-wise Topology Graph Convolution（CTR-GCN），并将其与两种跨体部关注模块结合，以捕捉人体骨架上下体部和手脚关系特征。此外，本研究还提出了Temporal Attention Transformers来EXTRACTskeleton特征。* Results: 本研究在NTU RGB+D和NTU RGB+D 120数据集上达到了 notable high-performance。Translation:* For: This study explores the application and optimization of Graph Convolutional Convolution networks (GCNs) in skeleton-based action recognition.* Methods: The study proposes three Channel-wise Topology Graph Convolution (CTR-GCN) methods, and combines them with two joint cross-attention modules to capture upper-lower body part and hand-foot relationships in skeleton features. Additionally, the study proposes Temporal Attention Transformers to extract skeleton features effectively.* Results: The study achieves notable high-performance on the NTU RGB+D and NTU RGB+D 120 datasets.

Abstract
Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. We think the key to skeleton-based action recognition is a skeleton hanging in frames, so we focus on how the Graph Convolutional Convolution networks learn different topologies and effectively aggregate joint features in the global temporal and local temporal. In this work, we propose three Channel-wise Tolopogy Graph Convolution based on Channel-wise Topology Refinement Graph Convolution (CTR-GCN). Combining CTR-GCN with two joint cross-attention modules can capture the upper-lower body part and hand-foot relationship skeleton features. After that, to capture features of human skeletons changing in frames we design the Temporal Attention Transformers to extract skeletons effectively. The Temporal Attention Transformers can learn the temporal features of human skeleton sequences. Finally, we fuse the temporal features output scale with MLP and classification. We develop a powerful graph convolutional network named Spatial Temporal Effective Body-part Cross Attention Transformer which notably high-performance on the NTU RGB+D, NTU RGB+D 120 datasets. Our code and models are available at https://github.com/maclong01/STEP-CATFormer

摘要
“几何卷积网络（GCNs）在skeleton基本动作识别中得到了广泛的应用和出色的结果。我们认为skeleton中的骨架卷积是关键，因此我们专注于如何使Graph Convolutional Convolution网络学习不同的拓扑和有效地聚合关节特征在全球时间和局部时间。在这种工作中，我们提出了三种通道级别拓扑卷积基于通道级别拓扑修剪Graph Convolution（CTR-GCN）。将CTR-GCN与两个交叉关注模块相结合可以捕捉上下躯体和手脚关系骨架特征。然后，为了有效地提取人体骨架在帧内的变化特征，我们设计了时间注意力变换器。时间注意力变换器可以学习人体骨架序列中的时间特征。最后，我们将时间特征输出规格与多层感知（MLP）和分类结合，并发展出一种高性能的几何卷积网络，称为空间时间有效体部相关转换器。我们的代码和模型可以在https://github.com/maclong01/STEP-CATFormer上获取。”

paper_url: http://arxiv.org/abs/2312.03275
repo_url: None
paper_authors: Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, Bernadette Bucher
for: 这 paper 的目的是提出一种零shot navigation 方法，帮助机器人在未经训练的环境中寻找目标对象。
methods: 该方法使用了视觉语言模型，从深度观察Value Map，并使用RGB Observations来生成语言权重图。
results: 该方法在 Gibson、Habitat-Matterport 3D 和 Matterport 3D 数据集上取得了最佳的 результаchs，并在真实世界中部署在 Boston Dynamics Spot 移动 manipulate 平台上，efficiently 导航到目标对象。

Abstract
Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-language models in advancing the field of semantic navigation. Videos of real-world deployment can be viewed at naoki.io/vlfm.

摘要
人类如何利用 semantic knowledge 来探索未知环境并决定下一步行动对于开发人类样式搜索行为的机器人来说非常重要。我们提出了一种零批注导航方法，即 Vision-Language Frontier Maps (VLFM)，这种方法灵感自人类的思维和决策，用于在新环境中导航到未经见过的 semantic 对象。VLFM 从深度观测中生成占据地图，并使用 RGB 观测和预训练的视力语言模型生成语言固定值图。VLFM 然后使用这个图来确定搜索最有前途的方向，以找到给定目标对象类型的实例。我们在 Gibson、Habitat-Matterport 3D 和 Matterport 3D datasets 中的 Habitat simulate 环境进行评估，并显示 VLFM 在这些数据集上取得了最佳的成绩， measured by success weighted by path length (SPL) 对象目标导航任务。此外，我们还表明 VLFM 的零批注特性使得它可以轻松地在真实世界中部署，如 Boston Dynamics Spot 移动 manipulate 平台。我们在 Spot 上部署 VLFM，并在真实世界中 efficiently 导航到目标对象内部，无需环境的先前知识。VLFM 的成就推祟于视力语言模型在 semantic 导航领域的潜在潜力。视频可以在 naoki.io/vlfm 上欣赏。

Weathering Ongoing Uncertainty: Learning and Planning in a Time-Varying Partially Observable Environment

paper_url: http://arxiv.org/abs/2312.03263
repo_url: None
paper_authors: Gokul Puthumanaillam, Xiangyu Liu, Negar Mehr, Melkior Ornik
for: This paper aims to improve the optimal decision-making of autonomous systems in uncertain, stochastic, and time-varying environments.
methods: The paper combines Time-Varying Markov Decision Processes (TVMDP) with partial observability and introduces Time-Varying Partially Observable Markov Decision Processes (TV-POMDP). The proposed approach includes Memory Prioritized State Estimation (MPSE) and an MPSE-integrated planning strategy.
results: The proposed framework and algorithms demonstrate superior performance over standard methods in simulated and real-world experiments, showcasing their effectiveness in stochastic, uncertain, time-varying domains.

Abstract
Optimal decision-making presents a significant challenge for autonomous systems operating in uncertain, stochastic and time-varying environments. Environmental variability over time can significantly impact the system's optimal decision making strategy for mission completion. To model such environments, our work combines the previous notion of Time-Varying Markov Decision Processes (TVMDP) with partial observability and introduces Time-Varying Partially Observable Markov Decision Processes (TV-POMDP). We propose a two-pronged approach to accurately estimate and plan within the TV-POMDP: 1) Memory Prioritized State Estimation (MPSE), which leverages weighted memory to provide more accurate time-varying transition estimates; and 2) an MPSE-integrated planning strategy that optimizes long-term rewards while accounting for temporal constraint. We validate the proposed framework and algorithms using simulations and hardware, with robots exploring a partially observable, time-varying environments. Our results demonstrate superior performance over standard methods, highlighting the framework's effectiveness in stochastic, uncertain, time-varying domains.

摘要
优化决策presentsthan significant challenge for autonomous systems operating inuncertain, stochastic and time-varying environments. Environmental variability over time can significantly impact the system's optimal decision-making strategy for mission completion. To model such environments, our work combines the previous notion of Time-Varying Markov Decision Processes (TVMDP) with partial observability and introduces Time-Varying Partially Observable Markov Decision Processes (TV-POMDP). We propose a two-pronged approach to accurately estimate and plan within the TV-POMDP: 1) Memory Prioritized State Estimation (MPSE), which leverages weighted memory to provide more accurate time-varying transition estimates; and 2) an MPSE-integrated planning strategy that optimizes long-term rewards while accounting for temporal constraint. We validate the proposed framework and algorithms using simulations and hardware, with robots exploring a partially observable, time-varying environments. Our results demonstrate superior performance over standard methods, highlighting the framework's effectiveness in stochastic, uncertain, time-varying domains.Here's the translation in Traditional Chinese:优化决策呈现了 autonomous systems operate inuncertain, stochastic and time-varying environments中的一个 significannot challenge. 环境变化over time可以影响系统的优化决策策略，以 completeloss mission. 为了模型这些环境，我们的工作combines the previous notion of Time-Varying Markov Decision Processes (TVMDP) with partial observability and introduces Time-Varying Partially Observable Markov Decision Processes (TV-POMDP). We propose a two-pronged approach to accurately estimate and plan within the TV-POMDP: 1) Memory Prioritized State Estimation (MPSE), which leverages weighted memory to provide more accurate time-varying transition estimates; and 2) an MPSE-integrated planning strategy that optimizes long-term rewards while accounting for temporal constraint. We validate the proposed framework and algorithms using simulations and hardware, with robots exploring a partially observable, time-varying environments. Our results demonstrate superior performance over standard methods, highlighting the framework's effectiveness in stochastic, uncertain, time-varying domains.

Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning

paper_url: http://arxiv.org/abs/2312.03248
repo_url: None
paper_authors: Haowen Wang, Tao Sun, Cong Fan, Jinjie Gu
for: 提高多任务学习中的样本效率
methods: 使用自适应精度学习策略和低级数据精度学习
results: 比对基eline和任务特定和技能无关基eline的实验结果，C-Poly显示出明显的性能提升

Abstract
Modular and composable transfer learning is an emerging direction in the field of Parameter Efficient Fine-Tuning, as it enables neural networks to better organize various aspects of knowledge, leading to improved cross-task generalization. In this paper, we introduce a novel approach Customized Polytropon C-Poly that combines task-common skills and task-specific skills, while the skill parameters being highly parameterized using low-rank techniques. Each task is associated with a customizable number of exclusive specialized skills and also benefits from skills shared with peer tasks. A skill assignment matrix is jointly learned. To evaluate our approach, we conducted extensive experiments on the Super-NaturalInstructions and the SuperGLUE benchmarks. Our findings demonstrate that C-Poly outperforms fully-shared, task-specific, and skill-indistinguishable baselines, significantly enhancing the sample efficiency in multi-task learning scenarios.

摘要
模块化和可 compose 的传输学习是现代 Parameter Efficient Fine-Tuning 领域的一个emerging direction，因为它使得神经网络更好地组织了不同方面的知识，从而提高了交叉任务泛化性。在这篇论文中，我们介绍了一种新的方法Customized Polytropon C-Poly，它将任务共同技能和任务特定技能结合在一起，并使用低维度技术来高度参数化技能参数。每个任务都有可定制的专属特有技能，同时还可以从同类任务中获得共享的技能。一个任务分配矩阵是同时学习的。为了评估我们的方法，我们在Super-NaturalInstructions和SuperGLUE bencmarks上进行了广泛的实验。我们的发现表明，C-Poly 在多任务学习场景中显著提高了样本效率。

A Simple Framework to Enhance the Adversarial Robustness of Deep Learning-based Intrusion Detection System

paper_url: http://arxiv.org/abs/2312.03245
repo_url: None
paper_authors: Xinwei Yuan, Shu Han, Wei Huang, Hongliang Ye, Xianglong Kong, Fan Zhang
For: The paper proposes a novel intrusion detection system (IDS) architecture that combines conventional machine learning (ML) models and deep learning (DL) models to enhance the robustness of IDS against adversarial attacks.* Methods: The proposed IDS architecture consists of three components: DL-based IDS, adversarial example (AE) detector, and ML-based IDS. The AE detector is based on the local intrinsic dimensionality (LID), and the ML-based IDS is used to determine the maliciousness of AEs. The fusion mechanism leverages the high prediction accuracy of DL models and low attack transferability between DL models and ML models to improve the robustness of the whole system.* Results: The paper shows a significant improvement in the prediction performance of the IDS when subjected to adversarial attack, achieving high accuracy with low resource consumption.Here are the three key points in Simplified Chinese text:* For: 本文提出了一种新的入侵检测系统（IDS）架构，该架构结合了传统的机器学习（ML）模型和深度学习（DL）模型，以提高IDS对于攻击者的抵抗性。* Methods: 提议的IDS架构包括三个组成部分：DL-based IDS、攻击例示器（AE）检测器和ML-based IDS。AE检测器基于本地特征维度（LID），而ML-based IDS用于确定AE的恶意程度。混合机制利用DL模型的高预测精度和DL模型和ML模型之间的低攻击传递性，以提高整个系统的Robustness。* Results: 本文实验结果表明，当IDS面临攻击时，提议的IDS架构可以获得高精度、低资源占用的预测性能。

Abstract
Deep learning based intrusion detection systems (DL-based IDS) have emerged as one of the best choices for providing security solutions against various network intrusion attacks. However, due to the emergence and development of adversarial deep learning technologies, it becomes challenging for the adoption of DL models into IDS. In this paper, we propose a novel IDS architecture that can enhance the robustness of IDS against adversarial attacks by combining conventional machine learning (ML) models and Deep Learning models. The proposed DLL-IDS consists of three components: DL-based IDS, adversarial example (AE) detector, and ML-based IDS. We first develop a novel AE detector based on the local intrinsic dimensionality (LID). Then, we exploit the low attack transferability between DL models and ML models to find a robust ML model that can assist us in determining the maliciousness of AEs. If the input traffic is detected as an AE, the ML-based IDS will predict the maliciousness of input traffic, otherwise the DL-based IDS will work for the prediction. The fusion mechanism can leverage the high prediction accuracy of DL models and low attack transferability between DL models and ML models to improve the robustness of the whole system. In our experiments, we observe a significant improvement in the prediction performance of the IDS when subjected to adversarial attack, achieving high accuracy with low resource consumption.

摘要
深度学习基于的入侵检测系统（DL-IDS）已经成为提供安全解决方案的一种优选，但由于对深度学习技术的发展和应用，DL模型在IDS中的采用受到挑战。在本文中，我们提出了一种新的IDS架构，可以通过结合深度学习模型和传统机器学习模型来增强IDS对假数据攻击的Robustness。我们的提案包括三个组成部分：DL-IDS、假数据检测器和ML-IDS。我们首先开发了一种基于本地内在维度（LID）的假数据检测器。然后，我们利用DL模型和ML模型之间的攻击传递率低，找到一个可靠的ML模型，以确定假数据的Maliciousness。如果输入流量被检测为假数据，则ML-IDS将预测输入流量的Maliciousness，否则DL-IDS将进行预测。混合机制可以利用DL模型的高预测精度和ML模型之间的攻击传递率低，提高整体系统的Robustness。在我们的实验中，我们发现当输入流量遭受假数据攻击时，IDS的预测性能得到了显著提高， достиieving高精度低资源消耗。

Multicoated and Folded Graph Neural Networks with Strong Lottery Tickets

paper_url: http://arxiv.org/abs/2312.03236
repo_url: https://github.com/louivalley/slt-gnn
paper_authors: Jiale Yan, Hiroaki Ito, Ángel López García-Arias, Yasuyuki Okoshi, Hikari Otsuka, Kazushi Kawamura, Thiem Van Chu, Masato Motomura
For: 本研究旨在探讨SLTH（强大抽筋假设）在深度Graph Neural Networks（GNNs）中的应用，以提高精度和减少内存消耗。* Methods: 本研究使用了多层材料掩模（M-Sup）scalar pruning mask方法，并提出了适应性调整的设定策略，以实现在深度GNNs中的精度和减少内存消耗。* Results: 本研究在Open Graph Benchmark（OGB）等多个 dataset上进行了评估，并显示了SLTH-based GNNs可以实现高精度、竞争性和高内存效率，减少内存消耗达98.7%。

Abstract
The Strong Lottery Ticket Hypothesis (SLTH) demonstrates the existence of high-performing subnetworks within a randomly initialized model, discoverable through pruning a convolutional neural network (CNN) without any weight training. A recent study, called Untrained GNNs Tickets (UGT), expanded SLTH from CNNs to shallow graph neural networks (GNNs). However, discrepancies persist when comparing baseline models with learned dense weights. Additionally, there remains an unexplored area in applying SLTH to deeper GNNs, which, despite delivering improved accuracy with additional layers, suffer from excessive memory requirements. To address these challenges, this work utilizes Multicoated Supermasks (M-Sup), a scalar pruning mask method, and implements it in GNNs by proposing a strategy for setting its pruning thresholds adaptively. In the context of deep GNNs, this research uncovers the existence of untrained recurrent networks, which exhibit performance on par with their trained feed-forward counterparts. This paper also introduces the Multi-Stage Folding and Unshared Masks methods to expand the search space in terms of both architecture and parameters. Through the evaluation of various datasets, including the Open Graph Benchmark (OGB), this work establishes a triple-win scenario for SLTH-based GNNs: by achieving high sparsity, competitive performance, and high memory efficiency with up to 98.7\% reduction, it demonstrates suitability for energy-efficient graph processing.

摘要
“强大的抽奖票假设”（SLTH）表明了深度学习中的高性能子网络，可以通过随机初始化的卷积神经网络（CNN）无需任何训练来发现。一 recent study called Untrained GNNs Tickets（UGT）扩展了 SLTH 到 shallow graph neural networks（GNNs）。然而，在比较基eline model 与 learned dense weights 时， still 有差异存在。此外， deeper GNNs 还存在 excessive memory requirements 的问题。为了解决这些挑战，这个工作使用 MultiCoated SuperMasks（M-Sup），一种数值遮瑕法，并将其实现在 GNNs 中。在 deep GNNs 的上下文中，这个研究发现了未训练的循环神经网络，它们在与训练 feed-forward 对应的表现相似。这个 paper 还提出了 Multi-Stage Folding 和 Unshared Masks 方法，以扩展搜寻空间的 both architecture 和 parameter。通过评估多个 dataset，包括 Open Graph Benchmark（OGB），这个研究建立了 SLTH-based GNNs 的 triple-win scenario：它实现了高简洁性、竞争性能和高内存效率，实现了能源效率的graph processing。

Deep Multimodal Fusion for Surgical Feedback Classification

paper_url: http://arxiv.org/abs/2312.03231
repo_url: None
paper_authors: Rafal Kocielnik, Elyssa Y. Wong, Timothy N. Chu, Lydia Lin, De-An Huang, Jiayun Wang, Anima Anandkumar, Andrew J. Hung
for: 本研究的目的是 automatize the annotation of real-time contextual surgical feedback at scale.
methods: 我们使用了多种模式的机器学习模型来类型医学反馈，包括文本、音频和视频模式。我们还使用了分阶段训练策略，先单独训练每种模式，然后将它们 JOINTLY 训练。
results: 我们的自动分类方法可以达到 AUC 值 Between 71.5 and 77.6，并且将五个类别的医学反馈分类为 “Anatomic”, “Technical”, “Procedural”, “Praise” 和 “Visual Aid”。此外，我们发现使用高质量的手动译文回快可以提高 AUC 值至 Between 76.5 and 96.2。

Abstract
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: "Anatomic", "Technical", "Procedural", "Praise" and "Visual Aid". We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1%. We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.

摘要
现场手术医生对学员的实时反馈是重要的，以便提高手术培训技能。这种反馈在实际操作室中是多模式的，包括语音对话（如问题和答案）以及非语言元素（如视觉指示器）。在这项工作中，我们采用严格验证的五类类别法对手术反馈进行分类：“解剖学”、“技术”、“过程”、“赞赏”和“视觉引导”。然后，我们开发了一种多标签机器学习模型，用于从文本、音频和视频模式的输入中分类这些五类类别的手术反馈。我们的自动分类方法可以在不同的模式下达到AUC值在71.5%到77.6%之间，而将多个模式融合可以提高性能的3.1%。我们还发现，从专家手动抄写反馈音频的高质量手动译录可以提高AUC值在76.5%到96.2%之间，这表明了未来可以进一步改进的道路。我们的实验表明，预先在每个模式上单独预训，然后将其 JOINTLY训练是更有效的，而且我们还发现不同的反馈类别对不同的模式具有不同的重要性。这项工作为自动化实时Contextual手术反馈的分类提供了重要的首次 Investigation。

SDSRA: A Skill-Driven Skill-Recombination Algorithm for Efficient Policy Learning

paper_url: http://arxiv.org/abs/2312.03216
repo_url: https://github.com/ericjiang18/sdsra
paper_authors: Eric H. Jiang, Andrew Lizarraga
for: 提高强化学习任务中的最大Entropy效率
methods: 使用Skill-Driven Skill Recombination Algorithm (SDSRA)，一种新型的协调搜索框架，实现更高效的最大Entropy效率
results: SDSRA比传统的Soft Actor-Critic (SAC)算法更快地 converges，并生成了改进的策略，在多种复杂和多样的 benchmark 中展现出了remarkable的适应性和性能

Abstract
In this paper, we introduce a novel algorithm - the Skill-Driven Skill Recombination Algorithm (SDSRA) - an innovative framework that significantly enhances the efficiency of achieving maximum entropy in reinforcement learning tasks. We find that SDSRA achieves faster convergence compared to the traditional Soft Actor-Critic (SAC) algorithm and produces improved policies. By integrating skill-based strategies within the robust Actor-Critic framework, SDSRA demonstrates remarkable adaptability and performance across a wide array of complex and diverse benchmarks.

摘要
在这篇论文中，我们介绍了一种新的算法——技能驱动技能 recombination算法（SDSRA）——一种创新的框架，可以在回归学习任务中提高最大Entropy的效率。我们发现SDSRA比传统的Soft Actor-Critic（SAC）算法更快地 converges和生成更好的策略。通过在Robust Actor-Critic框架中 интеGRATE技能based策略，SDSRA在多种复杂和多样的标准底下表现出了remarkable的适应性和性能。

2023-12-06

cs.CL

cs.CL - 2023-12-06

Collaboration or Corporate Capture? Quantifying NLP’s Reliance on Industry Artifacts and Contributions

paper_url: http://arxiv.org/abs/2312.03912
repo_url: None
paper_authors: Will Aitken, Mohamed Abdalla, Karen Rudie, Catherine Stinson
for: This paper investigates the reliance on industry for NLP publications, specifically looking at the citations of industry artifacts and contributions in papers presented at EMNLP 2022.
methods: The paper surveys 100 papers published at EMNLP 2022 to determine the frequency of citations of industry artifacts and contributions.
results: The paper finds that there is a substantial reliance on industry for NLP publications, with citations of industry artifacts and contributions being at least three times greater than industry publication rates per year. The paper discusses two possible perspectives on this finding: 1) collaboration with industry is still collaboration, even in the absence of an alternative, or 2) free NLP inquiry has been captured by the motivations and research direction of private corporations.

Abstract
The advent of transformers, higher computational budgets, and big data has engendered remarkable progress in Natural Language Processing (NLP). Impressive performance of industry pre-trained models has garnered public attention in recent years and made news headlines. That these are industry models is noteworthy. Rarely, if ever, are academic institutes producing exciting new NLP models. Using these models is critical for competing on NLP benchmarks and correspondingly to stay relevant in NLP research. We surveyed 100 papers published at EMNLP 2022 to determine whether this phenomenon constitutes a reliance on industry for NLP publications. We find that there is indeed a substantial reliance. Citations of industry artifacts and contributions across categories is at least three times greater than industry publication rates per year. Quantifying this reliance does not settle how we ought to interpret the results. We discuss two possible perspectives in our discussion: 1) Is collaboration with industry still collaboration in the absence of an alternative? Or 2) has free NLP inquiry been captured by the motivations and research direction of private corporations?

摘要
“ transformers 的出现，更高的计算预算和大数据，已经导致自然语言处理（NLP）领域做出了很大的进步。在最近几年，业界预训模型的出色表现受到了公众的关注，并在新闻头条上占据了主要地位。这些模型是业界模型，这是值得注意的。在学术界rarely, if ever, 出现了新的NLP模型。我们在 EMNLP 2022 年度会议上翻译了 100 篇论文，以确定这种现象是否存在。我们发现，实际上有一定的依赖。业界文献和贡献的引用 frequency 至少三倍于每年的业界发表率。量化这种依赖并不能解释我们应该如何解释结果。我们在讨论中提出了两个可能的视角：1）在没有备用的情况下，与业界合作仍然是合作吗？或2）自私公司的动机和研究方向已经抓住了自由NLP研究的主流？”

Revisiting the Optimality of Word Lengths

paper_url: http://arxiv.org/abs/2312.03897
repo_url: https://github.com/tpimentelms/optimality-of-word-lengths
paper_authors: Tiago Pimentel, Clara Meister, Ethan Gotlieb Wilcox, Kyle Mahowald, Ryan Cotterell
for: Zipf (1935) 的研究目的是提出词形具有最小通信成本的优化。
methods: 这种研究使用 Piantadosi et al. (2011) 提出的通信成本理论（Channel Capacity Hypothesis，CCH），并提出一种新的 derivation 来最小化 CCH 的成本。
results: 研究发现，Zipf 的假设在13种语言和多种实验设置下，word length 更好地预测了 frequency。此外，当使用更好的语言模型来估算 expectation 和 variance-to-mean ratio 时，word length 的预测变得更差。这些结果支持 Zipf 的长期假设。

Abstract
Zipf (1935) posited that wordforms are optimized to minimize utterances' communicative costs. Under the assumption that cost is given by an utterance's length, he supported this claim by showing that words' lengths are inversely correlated with their frequencies. Communicative cost, however, can be operationalized in different ways. Piantadosi et al. (2011) claim that cost should be measured as the distance between an utterance's information rate and channel capacity, which we dub the channel capacity hypothesis (CCH) here. Following this logic, they then proposed that a word's length should be proportional to the expected value of its surprisal (negative log-probability in context). In this work, we show that Piantadosi et al.'s derivation does not minimize CCH's cost, but rather a lower bound, which we term CCH-lower. We propose a novel derivation, suggesting an improved way to minimize CCH's cost. Under this method, we find that a language's word lengths should instead be proportional to the surprisal's expectation plus its variance-to-mean ratio. Experimentally, we compare these three communicative cost functions: Zipf's, CCH-lower , and CCH. Across 13 languages and several experimental settings, we find that length is better predicted by frequency than either of the other hypotheses. In fact, when surprisal's expectation, or expectation plus variance-to-mean ratio, is estimated using better language models, it leads to worse word length predictions. We take these results as evidence that Zipf's longstanding hypothesis holds.

摘要
zipf (1935) 提出了 Wordforms 是为最小化语音交流成本而优化的假设。假设交流成本是话语长度，他通过显示单词长度与其频率的相对关系来支持这一点。 communicative cost 可以用不同的方式来操作化。 piantadosi 等 (2011) 提出了一种方法，即将 cost 定义为语音信号和渠道 capacities 之间的距离，我们在这里称之为通道容量假设 (CCH)。 following 这种逻辑，他们 then proposed 一个词语的长度应该与其在语言上的频率相对关系成正比。在这个工作中，我们发现 piantadosi 等的 derivation 不能减少 CCH 的成本，而是一个下界，我们称之为 CCH-lower。我们提出了一种新的 derivation，建议一种改进的方法来减少 CCH 的成本。根据这种方法，我们发现一个语言中的单词长度应该与其预期的Surprisal （负对数概率在语言上的相对关系）成正比，加上其均值与标准差的比率。实验ally，我们比较了这三种交流成本函数： zipf 的、 CCH-lower 和 CCH。在 13 种语言和多种实验设置下，我们发现 length 是频率更好地预测的。事实上，当 Surprisal 的预期、或者预期加上均值与标准差的比率，使用更好的语言模型来估计，会导致单词长度预测更差。我们认为这些结果是证明 zipf 的长期假设的证据。

PROMISE: A Framework for Model-Driven Stateful Prompt Orchestration

paper_url: http://arxiv.org/abs/2312.03699
repo_url: None
paper_authors: Wenyuan Wu, Jasmin Heierli, Max Meisterhans, Adrian Moser, Andri Färber, Mateusz Dolata, Elena Gavagnin, Alexandre de Spindler, Gerhard Schwabe
for: 本文旨在提供一种框架，帮助开发者在信息系统中实现复杂的语言基于交互。
methods: 本文使用状态机器模型概念，实现模型驱动、动态提示编排，以控制语言模型的行为。
results: 我们在医疗信息系统中应用PROMISE框架，并 demonstarted其能够处理复杂交互情况。

Abstract
The advent of increasingly powerful language models has raised expectations for language-based interactions. However, controlling these models is a challenge, emphasizing the need to be able to investigate the feasibility and value of their application. We present PROMISE, a framework that facilitates the development of complex language-based interactions with information systems. Its use of state machine modeling concepts enables model-driven, dynamic prompt orchestration across hierarchically nested states and transitions. This improves the control of the behavior of language models and thus enables their effective and efficient use. We show the benefits of PROMISE in the context of application scenarios within health information systems and demonstrate its ability to handle complex interactions.

摘要
“语言模型的增强力量已经提高了语言基于交互的期望。然而，控制这些模型是一项挑战，强调了需要能够评估其可行性和价值。我们提出了PROMISE框架，它使用状态机制模型的概念来实现语言模型的动态提示管理。这些管理技术可以在层次结构中进行模型驱动的状态和转移控制，从而提高语言模型的控制能力。我们在医疗信息系统中应用PROMISE，并证明它可以处理复杂交互。”Note that Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Evaluating and Mitigating Discrimination in Language Model Decisions

paper_url: http://arxiv.org/abs/2312.03689
repo_url: None
paper_authors: Alex Tamkin, Amanda Askell, Liane Lovitt, Esin Durmus, Nicholas Joseph, Shauna Kravec, Karina Nguyen, Jared Kaplan, Deep Ganguli
For: The paper aims to evaluate the potential discriminatory impact of language models (LMs) in a wide range of use cases, including hypothetical scenarios where they have not yet been deployed.* Methods: The authors use a method that involves generating a wide array of potential prompts that decision-makers may input into an LM, systematically varying the demographic information in each prompt, and applying this methodology to the Claude 2.0 model.* Results: The authors find patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied, and demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering.Here are the three points in Simplified Chinese:
for: 这篇论文目标是评估语言模型（LMs）在各种使用场景中的可能性歧视影响，包括尚未部署的 гипотетических场景。
methods: 作者使用一种方法，即生成各种可能的决策者输入语言模型（LM）的提示，并系统地变化每个提示中的人口信息，以应用这种方法性到 Claude 2.0 模型。
results: 作者发现 Claude 2.0 模型在某些场景中存在正面和负面歧视现象，并示出了采用提示工程来减少这些歧视的技术。

Abstract
As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. We present a method for proactively evaluating the potential discriminatory impact of LMs in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. Specifically, we use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied. While we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. Our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. We release our dataset and prompts at https://huggingface.co/datasets/Anthropic/discrim-eval

摘要
We use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society. We systematically vary the demographic information in each prompt to identify patterns of both positive and negative discrimination in the Claude 2.0 model in select settings. Our findings reveal that the model exhibits both positive and negative discrimination in certain situations, highlighting the need for careful prompt engineering to mitigate these biases.While we do not endorse or permit the use of language models for automated decision-making in high-risk use cases, our work demonstrates techniques to significantly decrease both positive and negative discrimination. By anticipating, measuring, and addressing discrimination, our method enables developers and policymakers to safely deploy language models in appropriate use cases. We release our dataset and prompts at .

Interpretability Illusions in the Generalization of Simplified Models

paper_url: http://arxiv.org/abs/2312.03656
repo_url: None
paper_authors: Dan Friedman, Andrew Lampinen, Lucas Dixon, Danqi Chen, Asma Ghandeharioun
for: 本研究旨在检验使用简化模型表示方法来研究深度学习系统的准确性。
methods: 研究者使用了简化工具如几何约化和聚类来将深度学习模型转化为更加简单的形式，然后将这些简化形式与原始模型进行比较，以检验它们之间的差异。
results: 研究者发现，即使使用简化形式可以准确地预测训练集上的结果，但是这些简化形式在不同的测试集上的预测结果可能不准确，特别是在模型能够涵盖新结构或更深的深度时。这种现象存在，即使简化形式不直接依赖于训练分布。

Abstract
A common method to study deep learning systems is to use simplified model representations -- for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplified are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution -- the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits. First, we train models on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model on various out-of-distribution test sets. We find that the simplified proxies are generally less faithful out of distribution. In cases where the original model generalizes to novel structures or deeper depths, the simplified versions may fail, or generalize better. This finding holds even if the simplified representations do not directly depend on the training distribution. Next, we study a more naturalistic task: predicting the next character in a dataset of computer code. We find similar generalization gaps between the original model and simplified proxies, and conduct further analysis to investigate which aspects of the code completion task are associated with the largest gaps. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.

摘要
通常使用简化的模型表示方法来研究深度学习系统，例如使用特征值分解来可视化模型的隐藏状态在低维度空间中。这种方法假设简化后的结果与原始模型相符。在这里，我们解释了一个重要的假设问题：即简化表示可能在不同的概率分布下不准确地反映模型的行为。我们使用控制的数据集和系统化泛化分割来训练Transformer模型。首先，我们训练模型在 Dyck 平衡括号语言上。然后，我们使用维度减少和聚类等工具简化这些模型，并直接测试这些简化的代理模型在不同的异常分布上如何匹配原始模型的行为。我们发现简化后的代理模型通常在异常分布下不准确。在模型可以泛化到新结构或更深的深度时，简化版本可能会失败或更好地泛化。这种发现不仅在简化表示不直接依赖于训练分布，还有这种情况。接下来，我们研究了一个更自然的任务：预测代码中的下一个字符。我们发现简化后的代理模型和原始模型之间存在类似的泛化差异，并进行了进一步的分析，以确定代码完成任务中哪些方面与最大差异相关。总之，我们的结果提出了机制解释使用工具如特征值分解是否可靠地预测模型在新情况下的行为。

Improving Bias Mitigation through Bias Experts in Natural Language Understanding

paper_url: http://arxiv.org/abs/2312.03577
repo_url: https://github.com/jej127/bias-experts
paper_authors: Eojin Jeon, Mingyu Lee, Juhyeong Park, Yeachan Kim, Wing-Lam Mok, SangKeun Lee
for: 降低数据集中偏见的影响，提高模型在不同数据集上的性能。
methods: 使用auxiliary model和主模型之间的二分类预测器（bias experts），通过One-vs-Restapproach进行训练，以提高auxiliary model的偏见识别能力。
results: 通过实验结果，我们的提议方法可以提高auxiliary model的偏见识别能力，并使得降低偏见后的模型在不同数据集上的性能得到了显著提升。

Abstract
Biases in the dataset often enable the model to achieve high performance on in-distribution data, while poorly performing on out-of-distribution data. To mitigate the detrimental effect of the bias on the networks, previous works have proposed debiasing methods that down-weight the biased examples identified by an auxiliary model, which is trained with explicit bias labels. However, finding a type of bias in datasets is a costly process. Therefore, recent studies have attempted to make the auxiliary model biased without the guidance (or annotation) of bias labels, by constraining the model's training environment or the capability of the model itself. Despite the promising debiasing results of recent works, the multi-class learning objective, which has been naively used to train the auxiliary model, may harm the bias mitigation effect due to its regularization effect and competitive nature across classes. As an alternative, we propose a new debiasing framework that introduces binary classifiers between the auxiliary model and the main model, coined bias experts. Specifically, each bias expert is trained on a binary classification task derived from the multi-class classification task via the One-vs-Rest approach. Experimental results demonstrate that our proposed strategy improves the bias identification ability of the auxiliary model. Consequently, our debiased model consistently outperforms the state-of-the-art on various challenge datasets.

摘要
dataset 中的偏见 oftentimes enables the model to achieve high performance on in-distribution data, while poorly performing on out-of-distribution data. To mitigate the detrimental effect of the bias on the networks, previous works have proposed debiasing methods that down-weight the biased examples identified by an auxiliary model, which is trained with explicit bias labels. However, finding a type of bias in datasets is a costly process. Therefore, recent studies have attempted to make the auxiliary model biased without the guidance (or annotation) of bias labels, by constraining the model's training environment or the capability of the model itself. Despite the promising debiasing results of recent works, the multi-class learning objective, which has been naively used to train the auxiliary model, may harm the bias mitigation effect due to its regularization effect and competitive nature across classes. As an alternative, we propose a new debiasing framework that introduces binary classifiers between the auxiliary model and the main model, coined bias experts. Specifically, each bias expert is trained on a binary classification task derived from the multi-class classification task via the One-vs-Rest approach. Experimental results demonstrate that our proposed strategy improves the bias identification ability of the auxiliary model. Consequently, our debiased model consistently outperforms the state-of-the-art on various challenge datasets.Note: The translation is done using Google Translate, and may not be perfect.

XAIQA: Explainer-Based Data Augmentation for Extractive Question Answering

paper_url: http://arxiv.org/abs/2312.03567
repo_url: None
paper_authors: Joel Stremmel, Ardavan Saeedi, Hamid Hassanzadeh, Sanjit Batra, Jeffrey Hertzberg, Jaime Murillo, Eran Halperin
For: The paper is written for physicians and researchers who need to query medical records to design clinical studies and understand patient medical history.* Methods: The paper introduces a novel approach called XAIQA, which generates synthetic QA pairs at scale from data naturally available in electronic health records. The method uses the idea of a classification model explainer to generate questions and answers about medical concepts corresponding to medical codes.* Results: The paper shows that XAIQA identifies more semantic matches and clinical abbreviations than two popular approaches that use sentence transformers to create QA pairs, and improves the performance of GPT-4 as an extractive QA model, including on difficult questions.Here’s the information in Simplified Chinese text:* For: 这篇论文是为了帮助医生和研究人员查询医疗记录，以设计临床研究和理解患者医疗历史。* Methods: 论文提出了一种新的方法 called XAIQA，它可以在电子医疗记录中生成大量的Synthetic QA对。这种方法使用分类模型 explainer 来生成关于医疗概念的问题和答案。* Results: 论文表明，XAIQA 可以比两种使用 sentence transformers 生成 QA对的方法更好地标识 semantic match 和 clinical abbreviation，并且可以提高 GPT-4 作为抽取式 QA 模型的性能，包括难问题。

Abstract
Extractive question answering (QA) systems can enable physicians and researchers to query medical records, a foundational capability for designing clinical studies and understanding patient medical history. However, building these systems typically requires expert-annotated QA pairs. Large language models (LLMs), which can perform extractive QA, depend on high quality data in their prompts, specialized for the application domain. We introduce a novel approach, XAIQA, for generating synthetic QA pairs at scale from data naturally available in electronic health records. Our method uses the idea of a classification model explainer to generate questions and answers about medical concepts corresponding to medical codes. In an expert evaluation with two physicians, our method identifies $2.2\times$ more semantic matches and $3.8\times$ more clinical abbreviations than two popular approaches that use sentence transformers to create QA pairs. In an ML evaluation, adding our QA pairs improves performance of GPT-4 as an extractive QA model, including on difficult questions. In both the expert and ML evaluations, we examine trade-offs between our method and sentence transformers for QA pair generation depending on question difficulty.

摘要
“抽象Question Answering（QA）系统可以让医生和研究人员查询医疗纪录，这是设计临床试验和理解病人医疗历史的重要能力。然而，建立这些系统通常需要专家录创QA对。大型自然语言模型（LLM）可以进行抽象QA，但它们需要高质量的数据作为其推问。我们介绍了一种新的方法，XAIQA，可以将大量的自然可用数据中的数据生成成QA对。我们的方法使用医疗条目 explainer 来生成关于医疗条目的问题和答案。在两位医生的专家评估中，我们的方法可以识别 $2.2\times$ 更多的 semantic match 和 $3.8\times$ 更多的医疗缩写。在 ML 评估中，将我们生成的 QA 对添加到 GPT-4 中可以提高这个抽象 QA 模型的性能，包括难问题。在专家和 ML 评估中，我们分析了我们的方法和 sentence transformers 的 QA 对生成方法之间的贡献和折冲关系，具体取决于问题的难度。”

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

paper_url: http://arxiv.org/abs/2312.03549
repo_url: None
paper_authors: Fei Yang, Shuang Peng, Ning Sun, Fangyu Wang, Ke Tan, Fu Wu, Jiezhong Qiu, Aimin Pan
for: 这个 paper 是为了提高大型语言模型（LLMs）的训练效率和可扩展性。methods: 本 paper 使用了当地的数据和模型平行化策略，以及一个新的排程方法来将特定的计算任务分配给具有不同特性的 GPU 设备。results: 本 paper 的结果显示，使用者的框架可以在不同的 NIC 环境下进行训练，并且可以与现有的主流 LLM 框架整合。此外，该框架在多个 GPU 集群中的扩展性也得到了证明。

Abstract
Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them.

摘要
大型语言模型（LLM）如GPT-3、OPT和LLaMA在各种任务中表现出色，但训练这些模型可能会出现巨大成本，通常需要数千个GPU数据中心 Months of continuous operation。通常，这些训练是在特殊的GPU集群中进行，该集群是配备同步高速Remote Direct Memory Access（RDMA）网络卡（NIC）。获取和维护这些专门的集群是具有挑战。目前的LLM训练框架，如Megatron-LM和Megatron-DeepSpeed，专注在同步训练Homogeneous cluster Setting中。在这篇文章中，我们引入Holmes，一个LLM训练框架，该框架使用了精心设计的数据和模型平行化策略在 hetroogeneous NIC环境中。我们的主要技术贡献在于一种新的排程方法，将在LLM训练中分配特定的computational tasklet到特定的GPU装置基于该装置的连接NIC的特性。此外，我们的提案的框架，使用管道平行技术，可以在多个GPU集群中扩展，甚至在没有高速Interconnects between nodes in distinct clusters的情况下。我们进行了广泛的实验，包括不同的情况在hetroogeneous NIC环境中。大多数情况下，我们的框架可以在同步训练中 achievable with homogeneous RDMA-capable networks（InfiniBand or RoCE）水平，significantly exceeding training efficiency within the pure Ethernet environment。此外，我们验证了我们的框架可以与主流LLM框架在hetroogeneous NIC环境中优化训练效率，并且可以与它们集成。

Sig-Networks Toolkit: Signature Networks for Longitudinal Language Modelling

paper_url: http://arxiv.org/abs/2312.03523
repo_url: None
paper_authors: Talia Tseriotou, Ryan Sze-Yin Chan, Adam Tsakalidis, Iman Munire Bilal, Elena Kochkina, Terry Lyons, Maria Liakata
for: 这个论文主要是为了提出一个开源的pip安装的工具套件，叫做Sig-Networks，用于长期语言模型化。
methods: 这个工具套件使用了签名基于神经网络模型，这些模型在时间任务上已经显示出了成功。论文还应用并扩展了已经发表的研究，提供了一个完整的签名基于模型的suite。这些组件可以用作PyTorch的建筑块，在未来的架构中使用。Sig-Networks支持任务无关的数据集插入，简单的前处理 для顺序数据，参数的灵活性，自动调整多种模型。
results: 论文在三个不同的自然语言处理任务上进行了测试，包括心理咨询对话、谣言立场转换和社交媒体Thread中的情绪变化，在所有三个任务上达到了最高性能水平。论文还提供了对未来任务的指导。

Abstract
We present an open-source, pip installable toolkit, Sig-Networks, the first of its kind for longitudinal language modelling. A central focus is the incorporation of Signature-based Neural Network models, which have recently shown success in temporal tasks. We apply and extend published research providing a full suite of signature-based models. Their components can be used as PyTorch building blocks in future architectures. Sig-Networks enables task-agnostic dataset plug-in, seamless pre-processing for sequential data, parameter flexibility, automated tuning across a range of models. We examine signature networks under three different NLP tasks of varying temporal granularity: counselling conversations, rumour stance switch and mood changes in social media threads, showing SOTA performance in all three, and provide guidance for future tasks. We release the Toolkit as a PyTorch package with an introductory video, Git repositories for preprocessing and modelling including sample notebooks on the modeled NLP tasks.

摘要
我们介绍一个开源、可以通过pip安装的工具集，即Sig-Networks，这是首先采用签名基于神经网络模型的语言模型工具集。我们将ocus在 incorporating Signature-based Neural Network models，这些模型在时间任务上表现出了成功。我们应用并扩展了已发表的研究，提供了一个完整的签名基于模型集。这些组件可以用作PyTorch建筑块，在未来的建筑中使用。Sig-Networks支持任务无关的数据集插入、sequential数据顺序处理、参数灵活性和模型自动调整。我们在三种不同的自然语言处理任务上（辅导对话、谣言立场转换和社交媒体线上情绪变化）进行了试验，并达到了当前最佳性能。我们释放了工具集作为PyTorch包，并提供了引导视频、Git存储库和示例笔记本。

Exploring Answer Information Methods for Question Generation with Transformers

paper_url: http://arxiv.org/abs/2312.03483
repo_url: None
paper_authors: Talha Chafekar, Aafiya Hussain, Grishma Sharma, Deepak Sharma
for: 这个研究旨在探讨不同方法在提供目标答案作为输入时，对 RNN 模型的效果。
methods: 这个研究使用了三种方法和其组合，包括答案提示、使用自定义产品方法、使用答案嵌入和解码器输出、选择输入段落中的答案相关信息，以及使用独立的跨注意力块在解码器中注意答案。
results: 我们发现，不含任何其他模式的答案提示方法可以获得最佳分 across ROUGE 和 Meteor 评价指标。此外，我们还使用自定义指标计算生成问题中是否包含相同的答案。

Abstract
There has been a lot of work in question generation where different methods to provide target answers as input, have been employed. This experimentation has been mostly carried out for RNN based models. We use three different methods and their combinations for incorporating answer information and explore their effect on several automatic evaluation metrics. The methods that are used are answer prompting, using a custom product method using answer embeddings and encoder outputs, choosing sentences from the input paragraph that have answer related information, and using a separate cross-attention attention block in the decoder which attends to the answer. We observe that answer prompting without any additional modes obtains the best scores across rouge, meteor scores. Additionally, we use a custom metric to calculate how many of the generated questions have the same answer, as the answer which is used to generate them.

摘要
有很多研究在问题生成方面，使用不同的方法提供目标答案作为输入，以explore其影响多种自动评估指标。这些实验主要针对基于RNN的模型。我们使用三种不同的方法和其组合来推送答案信息，并评估它们对多个自动评估指标的影响。这些方法包括答案提示、使用自定义产品方法使答案嵌入和解码输出、从输入段落中选择带答案相关信息的句子，以及在解码器中使用独立的交叉注意力块，以注意答案。我们发现，不使用任何其他模式的答案提示方法可以获得最好的总评分和雨亮分数。此外，我们使用自定义指标来计算生成问题中是否包含相同的答案，即生成问题的答案和生成问题的答案之间的相似度。

AMR Parsing is Far from Solved: GrAPES, the Granular AMR Parsing Evaluation Suite

paper_url: http://arxiv.org/abs/2312.03480
repo_url: https://github.com/jgroschwitz/grapes
paper_authors: Jonas Groschwitz, Shay B. Cohen, Lucia Donatelli, Meaghan Fowlie
for: 本研究开发了一个抽象意义表示（AMR）分析评估集（GrAPES），用于测试现有的AMR分析器在各种语言现象上的能力。
methods: 本研究使用了多种现有的AMR分析器，并开发了一些新的评估指标来测试这些分析器的性能。
results: 研究发现，现有的AMR分析器在一些语言现象上表现良好，但在其他情况下仍然存在较多的错误，特别是在节点标签和图结构上。

Abstract
We present the Granular AMR Parsing Evaluation Suite (GrAPES), a challenge set for Abstract Meaning Representation (AMR) parsing with accompanying evaluation metrics. AMR parsers now obtain high scores on the standard AMR evaluation metric Smatch, close to or even above reported inter-annotator agreement. But that does not mean that AMR parsing is solved; in fact, human evaluation in previous work indicates that current parsers still quite frequently make errors on node labels or graph structure that substantially distort sentence meaning. Here, we provide an evaluation suite that tests AMR parsers on a range of phenomena of practical, technical, and linguistic interest. Our 36 categories range from seen and unseen labels, to structural generalization, to coreference. GrAPES reveals in depth the abilities and shortcomings of current AMR parsers.

摘要
我团队现在发布了粒子AMR解析评估集（GrAPES），这是一个为抽象意义表示（AMR）解析的挑战集，同时提供了评估 метри。现在的AMR解析器在标准的Smatch评估 metric上获得了高分，接近或者超过了报告的间接审核者一致性。但这并不意味着AMR解析已经解决了，事实上，在前一项工作中的人工评估表明，当前的解析器仍然很频繁地在节点标签或图 структуре中出现错误，这些错误会对句子意义产生重大的扭曲。在这里，我们提供了一个测试AMR解析器的评估集，该集包括36个类别，从seen和unseen标签、结构总结、核心reference等方面进行测试。GrAPES将深入探讨当前AMR解析器的能力和缺陷。

DBCopilot: Scaling Natural Language Querying to Massive Databases

paper_url: http://arxiv.org/abs/2312.03463
repo_url: https://github.com/tshu-w/dbcopilot
paper_authors: Tianshu Wang, Hongyu Lin, Xianpei Han, Le Sun, Xiaoyang Chen, Hao Wang, Zhenyu Zeng
for: 这篇论文的目的是解决现有的文本到SQL（Text-to-SQL）框架在面对庞大、动态变化的数据库时的扩展性问题。
methods: 这篇论文使用了一种备受折衣的和灵活的助手模型来路由在庞大数据库中。具体来说，DBCopilot 将文本到SQL 过程分解为 schema 路由和 SQL 生成两个部分，使用了一种轻量级的序列到序列神经网络模型来构建数据库连接和导航自然语言问题通过数据库和表。
results: 实验结果表明，DBCopilot 是一个可扩展和高效的解决方案，可以有效地处理实际中的文本到SQL 任务，并提供了一个大规模数据库自动学习和适应的机制。

Abstract
Text-to-SQL simplifies database interactions by enabling non-experts to convert their natural language (NL) questions into Structured Query Language (SQL) queries. While recent advances in large language models (LLMs) have improved the zero-shot text-to-SQL paradigm, existing methods face scalability challenges when dealing with massive, dynamically changing databases. This paper introduces DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing across massive databases. Specifically, DBCopilot decouples the text-to-SQL process into schema routing and SQL generation, leveraging a lightweight sequence-to-sequence neural network-based router to formulate database connections and navigate natural language questions through databases and tables. The routed schemas and questions are then fed into LLMs for efficient SQL generation. Furthermore, DBCopilot also introduced a reverse schema-to-question generation paradigm, which can learn and adapt the router over massive databases automatically without requiring manual intervention. Experimental results demonstrate that DBCopilot is a scalable and effective solution for real-world text-to-SQL tasks, providing a significant advancement in handling large-scale schemas.

摘要
文本到SQL 技术可以简化数据库交互，让非专家转换自然语言（NL）问题成为结构化查询语言（SQL）查询。而最近的大语言模型（LLM）的进步有助于零学习文本到SQL paradigm，但现有方法在面临巨大、动态变化的数据库时存在扩展性问题。本文介绍DBCopilot框架，该框架解决这些挑战，通过使用轻量级和灵活的助手模型来在巨大数据库中路由。具体来说，DBCopilot将文本到SQL过程分解成SchemaRouting和SQL生成两个阶段，使用轻量级的序列到序列神经网络基于路由器来形成数据库连接和导航自然语言问题 через数据库和表。路由的schema和问题然后被 feed into LLMs для高效的 SQL 生成。此外，DBCopilot 还引入了反向 schema-to-question 生成 paradigm，可以自动学习和适应大数据库，无需人工干预。实验结果表明，DBCopilot 是一个扩展性和有效的解决方案，对实际文本到SQL任务具有重要进步，可以有效地处理大规模的 schema。

Think from Words(TFW): Initiating Human-Like Cognition in Large Language Models Through Think from Words for Japanese Text-level Classification

paper_url: http://arxiv.org/abs/2312.03458
repo_url: None
paper_authors: Chengguang Gan, Qinghao Zhang, Tatsunori Mori
for: This paper aims to improve the text comprehension of Large Language Models (LLMs) by bridging the gap between LLM and human-like thinking processes, specifically in the domain of Japanese text.
methods: The paper proposes two methods, “Think from Words” (TFW) and “TFW with Extra word-level information” (TFW Extra), which initiate the comprehension process at the word level and incorporate additional word-level data to enhance LLMs’ text comprehension.
results: The paper employs text classification on six Japanese datasets to assess the effectiveness of TFW and investigate the impact of various word-level information types on LLMs’ text comprehension, providing insights into their potential to cause misinterpretations and errors in the overall comprehension of the final text.

Abstract
The proliferation of Large Language Models (LLMs) has spurred extensive research into LLM-related Prompt investigations, such as Instruction Learning (IL), In-context Learning (ICL), and Chain-of-Thought (CoT). These approaches aim to improve LLMs' responses by enabling them to provide concise statements or examples for deeper contemplation when addressing questions. However, independent thinking by LLMs can introduce variability in their thought processes, leading to potential inaccuracies. In response, our study seeks to bridge the gap between LLM and human-like thinking processes, recognizing that text comprehension begins with understanding individual words. To tackle this challenge, we have expanded the CoT method to cater to a specific domain. Our approach, known as "Think from Words" (TFW), initiates the comprehension process at the word level and then extends it to encompass the entire text. We also propose "TFW with Extra word-level information" (TFW Extra), augmenting comprehension with additional word-level data. To assess our methods, we employ text classification on six Japanese datasets comprising text-level and word-level elements. Our findings not only validate the effectiveness of TFW but also shed light on the impact of various word-level information types on LLMs' text comprehension, offering insights into their potential to cause misinterpretations and errors in the overall comprehension of the final text.

摘要
大量的大语言模型（LLM）的出现，推动了关于 LLM 相关的提示研究，如指令学习（IL）、内容学习（ICL）和链条（CoT）。这些方法旨在改进 LLM 的回答，让它们能够提供简洁的声明或示例，以便更深入的思考问题。然而， LLM 独立思考的变化可能会导致回答的不准确。因此，我们的研究旨在将 LLM 和人类思维过程连接起来，认为文本理解始于单词理解。为解决这个挑战，我们扩展了 CoT 方法，称为 "从单词开始的理解"（TFW）。TFW 方法首先从单词水平开始理解，然后扩展到整篇文本。此外，我们还提出 "TFW 加上额外单词水平信息"（TFW Extra），通过添加单词水平数据来加强理解。为评估我们的方法，我们使用了六个日本文本集，包括文本水平和单词水平的元素。我们的发现不仅证明了 TFW 的有效性，还揭示了不同单词水平信息类型对 LLM 的文本理解产生了什么影响，提供了对 LLM 可能的误解和错误的深入了解。

Comparative Analysis of Multilingual Text Classification & Identification through Deep Learning and Embedding Visualization

paper_url: http://arxiv.org/abs/2312.03789
repo_url: None
paper_authors: Arinjay Wyawhare
for: 这个研究是为了比较多语言文本分类方法，使用深度学习和嵌入可视化。
methods: 这个研究使用了LangDetect、LangId、FastText和Sentence Transformer模型，并在一个包含17种语言的数据集上进行了测试。
results: 研究发现，FastText的2D可视化显示了更清晰的幂等分类结果，并且FastText多层Perceptron模型在精度、准确率、回归率和F1分数方面表现出色，超过了Sentence Transformer模型。

Abstract
This research conducts a comparative study on multilingual text classification methods, utilizing deep learning and embedding visualization. The study employs LangDetect, LangId, FastText, and Sentence Transformer on a dataset encompassing 17 languages. It explores dimensionality's impact on clustering, revealing FastText's clearer clustering in 2D visualization due to its extensive multilingual corpus training. Notably, the FastText multi-layer perceptron model achieved remarkable accuracy, precision, recall, and F1 score, outperforming the Sentence Transformer model. The study underscores the effectiveness of these techniques in multilingual text classification, emphasizing the importance of large multilingual corpora for training embeddings. It lays the groundwork for future research and assists practitioners in developing language detection and classification systems. Additionally, it includes the comparison of multi-layer perceptron, LSTM, and Convolution models for classification.

摘要
这项研究进行了多语言文本分类方法的比较研究，利用深度学习和嵌入视觉化。研究使用了 LangDetect、LangId、FastText 和 Sentence Transformer 在一个包括 17 种语言的数据集上进行了测试。研究发现，在二维视觉化中，FastText 的 clustering 更加清晰，这是因为它在多语言训练中获得了更广泛的训练数据。另外，FastText 多层感知机制实现了很高的准确率、精度、回归率和 F1 分数，在 Sentence Transformer 模型中表现出色。研究证明了这些技术在多语言文本分类中的效果，并且强调了在训练嵌入时需要大量的多语言训练数据。这项研究为未来研究提供了基础，并帮助实践者在语言检测和分类方面建立系统。此外，研究还比较了多层感知、LSTM 和 Convolution 模型在分类方面的表现。

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

paper_url: http://arxiv.org/abs/2312.03788
repo_url: None
paper_authors: Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng
for: 这个论文的目的是提出一种高效精度的4位量子化方法，以便将大型语言模型（LLMs）部署到具有限制的计算和存储资源的设备上。
methods: 这个论文提出了一种名为SmoothQuant+的精度高效的4位量子化方法，该方法不需要额外训练，可以保持LLMs模型的精度不产生损失。SmoothQuant+使用通道级别的活化异常值缓和，并对应的调整相应的权重，以确保量子化后的模型和原始模型具有相同的精度。
results: 根据论文的结果，使用SmoothQuant+进行4位量子化后，Code Llama-34B模型可以在A100 40GB GPU上部署，并且保持模型的精度不产生损失。此外，在两个A100 40GB GPU上运行的FP16模型的吞吐量比SmoothQuant+模型高出1.9倍至4.0倍，而每个字符的延迟时间仅占FP16模型在两个A100 40GB GPU上运行时的68%。这是目前最佳的4位量子化方法 для LLMS。

Abstract
Large language models (LLMs) have shown remarkable capabilities in various tasks. However their huge model size and the consequent demand for computational and memory resources also pose challenges to model deployment. Currently, 4-bit post-training quantization (PTQ) has achieved some success in LLMs, reducing the memory footprint by approximately 75% compared to FP16 models, albeit with some accuracy loss. In this paper, we propose SmoothQuant+, an accurate and efficient 4-bit weight-only PTQ that requires no additional training, which enables lossless in accuracy for LLMs for the first time. Based on the fact that the loss of weight quantization is amplified by the activation outliers, SmoothQuant+ smoothes the activation outliers by channel before quantization, while adjusting the corresponding weights for mathematical equivalence, and then performs group-wise 4-bit weight quantization for linear layers. We have integrated SmoothQuant+ into the vLLM framework, an advanced high-throughput inference engine specially developed for LLMs, and equipped it with an efficient W4A16 CUDA kernels, so that vLLM can seamlessly support SmoothQuant+ 4-bit weight quantization. Our results show that, with SmoothQuant+, the Code Llama-34B model can be quantized and deployed on a A100 40GB GPU, achieving lossless accuracy and a throughput increase of 1.9 to 4.0 times compared to the FP16 model deployed on two A100 40GB GPUs. Moreover, the latency per token is only 68% of the FP16 model deployed on two A100 40GB GPUs. This is the state-of-the-art 4-bit weight quantization for LLMs as we know.

摘要
大型语言模型（LLM）在不同任务中表现出色，但它们的庞大模型大小和相应的计算和存储资源需求也存在投入困难。目前，4比特后期量化（PTQ）已经在LLM中获得了一定的成功，可以将模型的存储尺寸减少约75%，但是会有一定的精度损失。在这篇论文中，我们提出了高精度和高效的4比特只量化（SmoothQuant+），不需要额外训练，可以实现LLM中的精度损失无损。基于活动值异常值的扩散会增加量化损失，SmoothQuant+在通道级别将活动值缓冲和滤波，然后对应的 weights 进行数学等价性调整，并对 linear 层进行分组weise 4比特量化。我们将SmoothQuant+结合到了高性能的 vLLM 框架中，并使用高效的 W4A16 CUDA 加速器，以便 vLLM 可以无缝支持 SmoothQuant+ 4比特量化。我们的结果显示，使用 SmoothQuant+，Code Llama-34B 模型可以在 A100 40GB GPU 上进行量化部署，实现精度损失无损，并提高了 Throughput 1.9-4.0 倍，同时减少了每个 Token 的延迟时间为 FP16 模型在两个 A100 40GB GPU 上部署的 68%。这是目前最佳的4比特量化方法 для LLM。

Compressed Context Memory For Online Language Model Interaction

paper_url: http://arxiv.org/abs/2312.03414
repo_url: https://github.com/snu-mllab/context-memory
paper_authors: Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song
for: 本研究旨在提出一种 Context Compression 方法，用于在在线场景中，如 ChatGPT，进行 Transformer 语言模型的压缩。随着上下文的扩展，注意过程需要更多的内存和计算资源，从而降低了语言模型的吞吐量。
methods: 我们提出了一种压缩上下文存储系统，通过在语言模型的前进传输中 integrate 一个轻量级的conditional LoRA来实现压缩。基于压缩上下文存储系统，语言模型可以进行压缩的注意操作和内存操作，从而实现压缩的语言模型。
results: 通过对话、个性化和多任务学习等评估，我们 demonstarte 了我们的方法可以达到一个完整的上下文模型的性能水平，但具有 $5\times$ 小的上下文存储空间。代码可以在 https://github.com/snu-mllab/context-memory 中找到。

Abstract
This paper presents a novel context compression method for Transformer language models in online scenarios such as ChatGPT, where the context continually expands. As the context lengthens, the attention process requires more memory and computational resources, which in turn reduces the throughput of the language model. To this end, we propose a compressed context memory system that continually compresses the growing context into a compact memory space. The compression process simply involves integrating a lightweight conditional LoRA into the language model's forward pass during inference. Based on the compressed context memory, the language model can perform inference with reduced memory and attention operations. Through evaluations on conversation, personalization, and multi-task learning, we demonstrate that our approach achieves the performance level of a full context model with $5\times$ smaller context memory space. Codes are available at https://github.com/snu-mllab/context-memory.

摘要
这篇论文提出了一种基于Transformer语言模型的上下文压缩方法，用于在在线场景如ChatGPT中，context不断扩展。随着context的增长，注意过程需要更多的内存和计算资源，从而降低语言模型的吞吐率。为解决这个问题，我们提议一种压缩上下文内存系统，通过在语言模型的前进通道中插入一个轻量级的 conditional LoRA进行压缩。基于压缩上下文内存，语言模型可以进行压缩后的推理，具有减少内存和注意操作的能力。经过对话、个性化和多任务学习的评估，我们证明了我们的方法可以实现与全上下文模型相同的性能水平，但具有5倍小的上下文内存空间。代码可以在https://github.com/snu-mllab/context-memory上下载。

A Text-to-Text Model for Multilingual Offensive Language Identification

paper_url: http://arxiv.org/abs/2312.03379
repo_url: None
paper_authors: Tharindu Ranasinghe, Marcos Zampieri
for: 本研究旨在开发一个基于 transformer 的语言模型，用于识别社交媒体上的不良内容（如仇恨言语、网络欺凌、网络攻击等）。
methods: 本研究使用了 text-to-text transformers (T5) 模型，并在两个大规模的不良语言识别数据集（SOLID 和 CCTK）上进行了预训练。在这些数据集上，我们研究了将两个数据集合并使用，以及在 semi-supervised 情况下选择最佳阈值的影响。
results: 我们的预训练 T5 模型在多个英语 benchmark 上表现出色，超过了其他基于 transformer 的模型（如 fBERT 和 HateBERT）的表现。此外，我们还在六种不同语言（德语、希腊语、韩语、马拉地语、僧伽罗语和西班牙语）上训练了首个多语言预训练模型，并在这些语言上达到了新的州OF-THE-ART 表现。

Abstract
The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their capabilities due to their encoder-only architecture, which restricts the number and types of labels in downstream tasks. Addressing these limitations, this study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) trained on two large offensive language identification datasets; SOLID and CCTK. We investigate the effectiveness of combining two datasets and selecting an optimal threshold in semi-supervised instances in SOLID in the T5 retraining step. Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual model achieves a new state-of-the-art on all the above datasets, showing its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.

摘要
社交媒体上的不良内容问题日益担忧，许多公司和政府组织都在寻找有效的识别方法。最近，基于转换器的模型，如BERT、XLNET和XLM-R，已经达到了识别不良内容的状态对抗性表现。然而，这些模型的主要局限性在于其核心只 architecture，这限制了下游任务中的标签类型和数量。为了解决这些限制，本研究提出了首个使用文本到文本转换器（T5）进行不良语言识别的预训练模型。我们在两个大的不良语言识别 dataset（SOLID和CCTK）上进行了T5的预训练，并investigated了在半有限制的情况下选择最佳阈值的影响。我们的预训练T5模型在多个英语标准测试上超过了其他基于转换器的模型，如fBERT和HateBERT，的识别性能。ollowing a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5, and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results show that this multilingual model achieves a new state-of-the-art on all the above datasets, demonstrating its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.

Lazy-k: Decoding for Constrained Token Classification

paper_url: http://arxiv.org/abs/2312.03367
repo_url: https://github.com/arthurdevnl/lazyk
paper_authors: Arthur Hemmer, Mickaël Coustaty, Nicola Bartolo, Jérôme Brachat, Jean-Marc Ogier
for: 提高概率模型在结构预测中的表现
methods: 结合受限解码方法，使用小型模型
results: 受限解码方法可以显著提高模型表现，特别是使用小型模型时Here’s a breakdown of each point:
for: The paper aims to improve the performance of probabilistic models in structured prediction.
methods: The paper combines probabilistic models with constrained decoding approaches, specifically in the context of token classification for information extraction.
results: The paper shows that constrained decoding approaches can significantly improve the models’ performances, especially when using smaller models. Additionally, the Lazy-$k$ approach proposed in the paper allows for more flexibility between decoding time and accuracy.

Abstract
We explore the possibility of improving probabilistic models in structured prediction. Specifically, we combine the models with constrained decoding approaches in the context of token classification for information extraction. The decoding methods search for constraint-satisfying label-assignments while maximizing the total probability. To do this, we evaluate several existing approaches, as well as propose a novel decoding method called Lazy-$k$. Our findings demonstrate that constrained decoding approaches can significantly improve the models' performances, especially when using smaller models. The Lazy-$k$ approach allows for more flexibility between decoding time and accuracy. The code for using Lazy-$k$ decoding can be found here: https://github.com/ArthurDevNL/lazyk.

摘要
我们探讨可能性模型在结构化预测中的提升。specifically，我们将模型与约束解码方法结合在信息抽取中的Token类型分类中。解码方法会搜索满足约束的标签分配，同时最大化总概率。为此，我们评估了一些现有的方法，并提出了一种新的解码方法called Lazy-$k$.我们的发现表明，约束解码方法可以明显提高模型的表现，尤其是使用较小的模型。Lazy-$k$方法允许在解码时间和准确率之间进行更多的灵活性。可以在以下链接中找到使用Lazy-$k$解码的代码：https://github.com/ArthurDevNL/lazyk。

KhabarChin: Automatic Detection of Important News in the Persian Language

paper_url: http://arxiv.org/abs/2312.03361
repo_url: None
paper_authors: Hamed Hematian Hemati, Arash Lagzian, Moein Salimi Sartakhti, Hamid Beigy, Ehsaneddin Asgari
for: 本研究旨在探讨重要新闻的检测，以提高社会大量人群的信息感知和决策效率。
methods: 本研究使用自然语言处理（NLP）方法自动化新闻检测过程。提出了一个新的基准数据集（Khabarchin），用于检测波斯语新闻中的重要新闻。
results: 研究对7,869篇波斯语新闻文章进行了注释，并创建了数据集。面临了高度不同观和类别偏见的两个挑战，并提供了解决方案。提出了一些学习型模型，从传统机器学习到当前最佳transformer模型，解决这个任务。此外，还提出了新闻文章中重要句子检测的第二任务，以解决长文本上重要信息的检测问题。

Abstract
Being aware of important news is crucial for staying informed and making well-informed decisions efficiently. Natural Language Processing (NLP) approaches can significantly automate this process. This paper introduces the detection of important news, in a previously unexplored area, and presents a new benchmarking dataset (Khabarchin) for detecting important news in the Persian language. We define important news articles as those deemed significant for a considerable portion of society, capable of influencing their mindset or decision-making. The news articles are obtained from seven different prominent Persian news agencies, resulting in the annotation of 7,869 samples and the creation of the dataset. Two challenges of high disagreement and imbalance between classes were faced, and solutions were provided for them. We also propose several learning-based models, ranging from conventional machine learning to state-of-the-art transformer models, to tackle this task. Furthermore, we introduce the second task of important sentence detection in news articles, as they often come with a significant contextual length that makes it challenging for readers to identify important information. We identify these sentences in a weakly supervised manner.

摘要
知道重要的新闻对于快速获取信息和做出 Informed 决策至关重要。自然语言处理（NLP）方法可以帮助自动化这个过程。这篇论文介绍了检测重要新闻的新方法，在未曾研究的地区进行了探索。我们定义重要新闻文章为能够对一大部分社会产生影响，能够改变他们的思维方式或决策方式。新闻文章来自七家重要的波斯语新闻机构，共计7,869个样本，并创建了数据集。面临了高度不同观和类别异质的两个挑战，并提供了解决方案。此外，我们还提出了一些学习基于模型，从传统机器学习到当前最佳transformer模型，解决这个任务。此外，我们还引入了新闻文章中重要句子检测的第二个任务，因为它们经常具有较长的上下文，使读者很难寻找重要信息。我们在弱监督方式下进行了这个任务。

Topic and genre in dialogue

paper_url: http://arxiv.org/abs/2312.03342
repo_url: None
paper_authors: Amandine Decker, Ellen Breitholtz, Christine Howes, Staffan Larsson
for: 本研究证明话题在对话中发挥基本作用，并且需要在话题和类型之间划分和正交定义，以实现可靠、可控和自定义对话系统。
methods: 本研究使用了话题分析和分类技术，以及对话分析和模型建立方法。
results: 研究发现，通过分别定义话题和类型，可以实现对话系统的模块化、可靠和自定义，并且可以提高对话系统的可控性和效果。

Abstract
In this paper we argue that topic plays a fundamental role in conversations, and that the concept is needed in addition to that of genre to define interactions. In particular, the concepts of genre and topic need to be separated and orthogonally defined. This would enable modular, reliable and controllable flexible-domain dialogue systems.

摘要
在这篇论文中，我们 argues That topic 在对话中发挥基本作用，并且认为这个概念与 genre 需要分开、正交定义。这样可以带来可模块化、可靠、可控的多领域对话系统。Note:* "topic" 被翻译为 "话题" (huì tí)* "genre" 被翻译为 "类型" (lèi xìng)* "orthogonally" 被翻译为 "正交" (zhèng jì)* "modular" 被翻译为 "可模块化" (kě móudāng hóu)* "reliable" 被翻译为 "可靠" (kě xìng)* "controllable" 被翻译为 "可控" (kě kòng)

Measuring Misogyny in Natural Language Generation: Preliminary Results from a Case Study on two Reddit Communities

paper_url: http://arxiv.org/abs/2312.03330
repo_url: None
paper_authors: Aaron J. Snoswell, Lucinda Nelson, Hao Xue, Flora D. Salim, Nicolas Suzor, Jean Burgess
for: 这篇论文主要是为了评估自然语言生成中的恶意情况，尤其是评估 generic ‘toxicity’ 分类器在识别恶意语言中的缺点。
methods: 作者使用了两个well-characterized ‘Incel’ 社区在 Reddit 上的数据来构建了两个训练集，并使用了这些训练集来精制两个语言模型。然后，作者使用了一个开源的 ‘toxicity’ 分类器来评估这两个语言模型中的恶意语言表现。
results: 研究发现，使用 generic ‘toxicity’ 分类器无法在这两个语言模型中分辨出意义性的区别。而一个 feminist 主题专家提出的一个 gender-specific 词汇表则能够准确地识别这两个社区的不同。这些初步结果表明，使用通用的方法来评估危害性的缺点，并高亮了需要在自然语言评估中注意的精细的 benchmark 设计和选择。

Abstract
Generic `toxicity' classifiers continue to be used for evaluating the potential for harm in natural language generation, despite mounting evidence of their shortcomings. We consider the challenge of measuring misogyny in natural language generation, and argue that generic `toxicity' classifiers are inadequate for this task. We use data from two well-characterised `Incel' communities on Reddit that differ primarily in their degrees of misogyny to construct a pair of training corpora which we use to fine-tune two language models. We show that an open source `toxicity' classifier is unable to distinguish meaningfully between generations from these models. We contrast this with a misogyny-specific lexicon recently proposed by feminist subject-matter experts, demonstrating that, despite the limitations of simple lexicon-based approaches, this shows promise as a benchmark to evaluate language models for misogyny, and that it is sensitive enough to reveal the known differences in these Reddit communities. Our preliminary findings highlight the limitations of a generic approach to evaluating harms, and further emphasise the need for careful benchmark design and selection in natural language evaluation.

摘要

Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation

paper_url: http://arxiv.org/abs/2312.03312
repo_url: None
paper_authors: Wonjun Lee, Gary Geunbae Lee, Yunsu Kim
for: 这项研究旨在提高两个通过语言的 Cross-Lingual Transfer Learning（CLTL），以提高语音识别的精度。
methods: 这项研究使用了两个阶段的优化：首先，我们优化了音素识别模型和音素到文字转换模型，以提高语音识别的精度。其次，我们引入了全球音素噪声生成器，以在文字到图标训练中模拟真实的 ASR 噪声，从而降低错误的传递。
results: 实验结果表明，使用我们的方法可以在低资源语言中显著降低 Word Error Rate（WER），这说明了我们的方法的有效性。这项研究的成果可能对两个通过语言的 ASR 系统的发展产生影响，并且可能提供更好的 Cross-Lingual Transfer Learning 技术。

Abstract
This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. Our approach optimizes these two stages to improve speech recognition across languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics, thus improving recognition accuracy. Additionally, we introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation. Experiments on the CommonVoice 12.0 dataset show significant reductions in Word Error Rate (WER) for low-resource languages, highlighting the effectiveness of our approach. This research contributes to the advancements of two-pass ASR systems in low-resource languages, offering the potential for improved cross-lingual transfer learning.

摘要
这项研究优化了两个阶段的两种语言之间转移学习，以提高语音识别的精度。我们的方法优化了这两个阶段，以提高语音识别的准确性。我们优化phoneme词汇覆盖率，通过共享语音特征来合并phoneme，从而提高识别精度。此外，我们引入了全球phoneme噪音生成器，以提供实际ASR噪音 durante phoneme-to-grapheme训练，以降低错误卷积。对于CommonVoice 12.0数据集进行了实验，显示了低资源语言中的 significan reductions in Word Error Rate (WER)， highlighting the effectiveness of our approach。这项研究对两个阶段ASR系统的发展在低资源语言中做出了贡献，提供了改进的cross-lingual transfer learning的 potential。

Rethinking E-Commerce Search

paper_url: http://arxiv.org/abs/2312.03217
repo_url: https://github.com/jacklinedesouza/STRATEGIES-OF-DIGITAL-MARKETING-AND-CONTENT-MARKETING
paper_authors: Haixun Wang, Taesik Na
for: 这篇论文的目的是提出一种新的电商搜索和推荐系统，可以更好地利用不结构化数据，如客户评价和网页文章等。
methods: 这篇论文提出了一种新的方法，即将结构化数据转换为文本数据，然后使用自然语言处理技术（如大语言模型）进行搜索和推荐。
results: 该方法可以更好地利用不结构化数据，提高电商搜索和推荐的精度和效果。

Abstract
E-commerce search and recommendation usually operate on structured data such as product catalogs and taxonomies. However, creating better search and recommendation systems often requires a large variety of unstructured data including customer reviews and articles on the web. Traditionally, the solution has always been converting unstructured data into structured data through information extraction, and conducting search over the structured data. However, this is a costly approach that often has low quality. In this paper, we envision a solution that does entirely the opposite. Instead of converting unstructured data (web pages, customer reviews, etc) to structured data, we instead convert structured data (product inventory, catalogs, taxonomies, etc) into textual data, which can be easily integrated into the text corpus that trains LLMs. Then, search and recommendation can be performed through a Q/A mechanism through an LLM instead of using traditional information retrieval methods over structured data.

摘要
电商搜索和推荐通常操作于结构化数据such as产品目录和分类。然而，创建更好的搜索和推荐系统经常需要大量的无结构数据，包括客户评价和网络上的文章。传统上，解决方案总是通过信息抽取将无结构数据转换为结构数据，然后进行搜索。然而，这种方法通常是成本高且质量低的。在这篇论文中，我们想象一种解决方案，即将结构数据（产品库、目录、分类等）转换为文本数据，可以轻松地与文本训练LMs（大语言模型）集成。然后，通过Q/A机制，使用LM进行搜索和推荐而不是使用传统的信息检索方法。

Detecting Rumor Veracity with Only Textual Information by Double-Channel Structure

paper_url: http://arxiv.org/abs/2312.03195
repo_url: None
paper_authors: Alex Kim, Sangwon Yoon
for: 本文目的是提出一种双通道结构，用于在社交媒体上预先鉴别谣言的真实性。
methods: 本文使用了两种方法：一是lie detection算法，用于有信息的谣言；二是thread-reply agreement detection算法，用于无信息的谣言。
results: 使用SemEval 2019 Task 7 dataset，本文的模型在预先三分类（真、假、未鉴别）社交媒体谣言上获得了0.4027的macro-F1分数，超过了所有基eline模型和第二名奖 winner（Gorrell et al., 2019）。此外，本文还证实了双通道结构的优越性，比单通道结构使用 lie detection或agreement detection算法来到所有帖子。

Abstract
Kyle (1985) proposes two types of rumors: informed rumors which are based on some private information and uninformed rumors which are not based on any information (i.e. bluffing). Also, prior studies find that when people have credible source of information, they are likely to use a more confident textual tone in their spreading of rumors. Motivated by these theoretical findings, we propose a double-channel structure to determine the ex-ante veracity of rumors on social media. Our ultimate goal is to classify each rumor into true, false, or unverifiable category. We first assign each text into either certain (informed rumor) or uncertain (uninformed rumor) category. Then, we apply lie detection algorithm to informed rumors and thread-reply agreement detection algorithm to uninformed rumors. Using the dataset of SemEval 2019 Task 7, which requires ex-ante threefold classification (true, false, or unverifiable) of social media rumors, our model yields a macro-F1 score of 0.4027, outperforming all the baseline models and the second-place winner (Gorrell et al., 2019). Furthermore, we empirically validate that the double-channel structure outperforms single-channel structures which use either lie detection or agreement detection algorithm to all posts.

摘要
凯尔（1985）提出了两种吹发：有信息的吹发和无信息的吹发（即恶作剧）。此外，先前的研究发现当人们有可靠的信息来源时，他们更可能使用更自信的文字语调在吹发消息。基于这些理论发现，我们提出了双渠道结构来确定社交媒体上吹发的预先真实性。我们首先将每个文本分为确定（有信息吹发）或未确定（无信息吹发）类别。然后，我们对有信息吹发应用了谎言检测算法，对无信息吹发应用了线程回复一致检测算法。使用SemEval 2019任务7的数据集，我们的模型在三分类（真、假、未知）预先分类任务中获得了0.4027的macro-F1分数，超过了所有基eline模型和第二名奖得者（Gorrell et al., 2019）。此外，我们经验 validate了双渠道结构的优越性，它在使用单渠道结构，其中使用谎言检测或线程回复一致检测算法来处理所有吹发时表现较差。

Corporate Bankruptcy Prediction with Domain-Adapted BERT

paper_url: http://arxiv.org/abs/2312.03194
repo_url: None
paper_authors: Alex Kim, Sangwon Yoon
for: 这研究使用BERT模型对公司财务披露数据进行预测，以预测公司破产。先前的研究主要集中在开发更加复杂的预测方法，使用金融变量。然而，在这种研究中，我们专注于提高输入数据质量。
methods: 我们使用BERT模型进行情感分析，对MD&A披露中的文本进行分析。我们发现，BERT比词典基于预测和Word2Vec基于预测更高效，以 adj R-square、kNN-5 和线性支持向量机器人（SVM）进行评估。
results: 我们发现，通过自学习与信任满足筛选，可以在10-K corporate disclosure数据上进行自适应适应。我们实现了预测精度91.56%，并证明了预测精度得到了显著提高。

Abstract
This study performs BERT-based analysis, which is a representative contextualized language model, on corporate disclosure data to predict impending bankruptcies. Prior literature on bankruptcy prediction mainly focuses on developing more sophisticated prediction methodologies with financial variables. However, in our study, we focus on improving the quality of input dataset. Specifically, we employ BERT model to perform sentiment analysis on MD&A disclosures. We show that BERT outperforms dictionary-based predictions and Word2Vec-based predictions in terms of adjusted R-square in logistic regression, k-nearest neighbor (kNN-5), and linear kernel support vector machine (SVM). Further, instead of pre-training the BERT model from scratch, we apply self-learning with confidence-based filtering to corporate disclosure data (10-K). We achieve the accuracy rate of 91.56% and demonstrate that the domain adaptation procedure brings a significant improvement in prediction accuracy.

摘要
这个研究使用BERT模型进行分析，BERT是一种代表性的语言模型，以预测公司破产。先前的文献关于破产预测主要集中在开发更复杂的预测方法ologies，而我们的研究则专注于提高输入数据质量。具体来说，我们使用BERT模型进行情感分析，并证明BERT的性能超过词典基于预测和Word2Vec基于预测。我们还采用了自学习和信任度基于筛选来自10-K报告数据，实现了预测精度为91.56%。这些结果表明，域 adaptaption程序可以提供显著的改善。

2023-12-06

cs.LG

cs.LG - 2023-12-06

Understanding the Role of Optimization in Double Descent

paper_url: http://arxiv.org/abs/2312.03951
repo_url: None
paper_authors: Chris Yuhao Liu, Jeffrey Flanigan
for: 本研究探讨了模型强度逐渐增加时测试错误的峰值和下降现象，即模型强度增加后测试错误可能会增加或减少。
methods: 本研究使用了优化视角来解释模型强度逐渐增加时测试错误的现象。研究者通过控制不同的初始化、归一化、批处理大小、学习率和优化算法来调查这些因素对模型强度逐渐增加时测试错误的影响。
results: 研究者发现了许多不同的因素（初始化、归一化、批处理大小、学习率和优化算法）对模型强度逐渐增加时测试错误的影响，这些因素直接影响优化问题的condition number或优化器，从而影响最终发现的最低点。研究者通过控制这些因素来示出了这种优化视角的合理性。

Abstract
The phenomenon of model-wise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the optimization perspective and propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent (initialization, normalization, batch size, learning rate, optimization algorithm) are unified from the viewpoint of optimization: model-wise double descent is observed if and only if the optimizer can find a sufficiently low-loss minimum. These factors directly affect the condition number of the optimization problem or the optimizer and thus affect the final minimum found by the optimizer, reducing or increasing the height of the double descent peak. We conduct a series of controlled experiments on random feature models and two-layer neural networks under various optimization settings, demonstrating this optimization-based unified view. Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups.

摘要
“模型强度双峰现象”，即测试错误峰值然后逐渐下降，是研究者吸引了关注的话题，因为观察到的理论与实践之间存在吸引人的 gap 。此外，双峰现象在不同任务和架构上都有被观察到，但测试错误峰值可能会缺失或减弱，甚至没有明显的正则化技术，如权重衰变和早期停止。在这篇文章中，我们从优化视角来调查这一 interessante 现象，并提出一种简单的优化基于解释：双峰现象在优化视角下可以被解释为优化问题的condition number直接影响最终找到的最低值。我们控制了不同的初始化、标准化、批处理大小、学习率和优化算法，并在random feature模型和二层神经网络上进行了系列的控制实验，证明了这一点。我们的结果表明，双峰现象在实际机器学习设置下是不可能的。此外，我们的结果还可以解释实际中观察到的双峰峰值较弱和仔细设置下观察到的强双峰峰值之间的差异。

A Scalable and Generalizable Pathloss Map Prediction

paper_url: http://arxiv.org/abs/2312.03950
repo_url: https://github.com/abman23/pmnet
paper_authors: Ju-Hyung Lee, Andreas F. Molisch
for: 预测无线网络的通信范围，即地理/地形/建筑地图上的干扰程度的估算。
methods: 使用数据驱动、模型无关的方法，通过训练使用有限的射线观测（或通道测量）数据和地图数据来预测通信范围。
results: 可以在几毫秒内，使用有限的数据和训练，实现高精度（RMSE级别为10^-2）的通信范围预测，并且通过知识传播来快速地（x5.6快）和效率地（使用x4.5少的数据）适应新的网络enario。

Abstract
Large-scale channel prediction, i.e., estimation of the pathloss from geographical/morphological/building maps, is an essential component of wireless network planning. Ray tracing (RT)-based methods have been widely used for many years, but they require significant computational effort that may become prohibitive with the increased network densification and/or use of higher frequencies in B5G/6G systems. In this paper, we propose a data-driven, model-free pathloss map prediction (PMP) method, called PMNet. PMNet uses a supervised learning approach: it is trained on a limited amount of RT (or channel measurement) data and map data. Once trained, PMNet can predict pathloss over location with high accuracy (an RMSE level of $10^{-2}$) in a few milliseconds. We further extend PMNet by employing transfer learning (TL). TL allows PMNet to learn a new network scenario quickly (x5.6 faster training) and efficiently (using x4.5 less data) by transferring knowledge from a pre-trained model, while retaining accuracy. Our results demonstrate that PMNet is a scalable and generalizable ML-based PMP method, showing its potential to be used in several network optimization applications.

摘要
大规模通道预测，即地理/形态/建筑图准确预测信号强度，是无线网络规划中不可或缺的一部分。基于射线追踪（RT）方法已经广泛使用了很多年，但它们需要很大的计算力，随着网络 densification 和/或使用更高频率的 B5G/6G 系统，可能成为禁制性。本文提出了一种数据驱动、模型自由的通道预测方法， called PMNet。PMNet 使用supervised learning 方法：它在有限的 RT（或通道测量）数据和地图数据上进行训练。一旦训练完成，PMNet 可以准确预测通道loss 的位置，并且在几毫秒钟内完成预测。我们进一步扩展 PMNet ，使其可以快速地学习新的网络场景（x5.6 快速训练），并使用 x4.5 menos 数据来学习。我们的结果表明，PMNet 是一种扩展性和普适的机器学习（ML）基于 PMP 方法，表明它可以在多种网络优化应用中使用。

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

paper_url: http://arxiv.org/abs/2312.03940
repo_url: https://github.com/yushangdi/pecann-dpc
paper_authors: Shangdi Yu, Joshua Engels, Yihao Huang, Julian Shun
for: 本文研究点集拟合 clustering 算法，特别是基于密集度的点集 clustering 算法。目标是处理大量高维数据，广泛存在实际应用中。
methods: 本文提出了一个统一框架 PECANN，抽象出了多种密集点集 clustering 算法的共同步骤。其中一个关键步骤是查找满足 predicate 函数的最近邻居，本文提出了一种高效的 predicate 搜索方法，并可以应用到许多现有的图基于 ANNS 算法中。
results: 本文实现了五种 clustering 算法，并对 synthetic 和实际数据集进行了评估。与状态艺法 FASTDP 算法相比，本文的最佳算法在高维度数据集上比 FASTDP 快速45倍-734倍，而且与状态艺法 parallel DPC-based 算法相比，本文的算法在高维度数据集上两个数量级更快。此外，本文还是首次在大量高维实际图像和文本嵌入数据集上评估了 DPC 变种。

Abstract
This paper studies density-based clustering of point sets. These methods use dense regions of points to detect clusters of arbitrary shapes. In particular, we study variants of density peaks clustering, a popular type of algorithm that has been shown to work well in practice. Our goal is to cluster large high-dimensional datasets, which are prevalent in practice. Prior solutions are either sequential, and cannot scale to large data, or are specialized for low-dimensional data. This paper unifies the different variants of density peaks clustering into a single framework, PECANN, by abstracting out several key steps common to this class of algorithms. One such key step is to find nearest neighbors that satisfy a predicate function, and one of the main contributions of this paper is an efficient way to do this predicate search using graph-based approximate nearest neighbor search (ANNS). To provide ample parallelism, we propose a doubling search technique that enables points to find an approximate nearest neighbor satisfying the predicate in a small number of rounds. Our technique can be applied to many existing graph-based ANNS algorithms, which can all be plugged into PECANN. We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading. Compared to the state-of-the-art FASTDP algorithm for high-dimensional density peaks clustering, which is sequential, our best algorithm is 45x-734x faster while achieving competitive ARI scores. Compared to the state-of-the-art parallel DPC-based algorithm, which is optimized for low dimensions, we show that PECANN is two orders of magnitude faster. As far as we know, our work is the first to evaluate DPC variants on large high-dimensional real-world image and text embedding datasets.

摘要
To address these limitations, this paper introduces PECANN, a unified framework that abstracts common steps among DPC algorithms and provides an efficient predicate search using graph-based approximate nearest neighbor search (ANNS). This enables points to find an approximate nearest neighbor satisfying the predicate in a small number of rounds, allowing for ample parallelism.The paper evaluates five clustering algorithms with PECANN on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading. The results show that PECANN is significantly faster than the state-of-the-art FASTDP algorithm for high-dimensional DPC clustering, achieving competitive ARI scores. PECANN is also two orders of magnitude faster than the state-of-the-art parallel DPC-based algorithm, which is optimized for low dimensions.Moreover, this paper is the first to evaluate DPC variants on large, high-dimensional real-world image and text embedding datasets, demonstrating the effectiveness of PECANN in practical applications. Overall, PECANN provides a scalable and efficient solution for clustering large, high-dimensional datasets using density-based methods.

Adaptive Weighted Co-Learning for Cross-Domain Few-Shot Learning

paper_url: http://arxiv.org/abs/2312.03928
repo_url: None
paper_authors: Abdullah Alchihabi, Marzi Heidari, Yuhong Guo
for: Addressing the challenging adaptation problem in cross-domain few-shot learning (CDFSL) tasks, where there are only a few labeled instances available for the target prediction task and a significant domain shift between the well-annotated source domain and the target domain.
methods: Propose a simple Adaptive Weighted Co-Learning (AWCoL) method that adapts two independently trained source prototypical classification models to the target task in a weighted co-learning manner. The method deploys a weighted moving average prediction strategy and conducts adaptive co-learning by jointly fine-tuning the two models based on the pseudo-labels and instance weights produced from the predictions.
results: Produce state-of-the-art CDFSL performance on multiple benchmark datasets through comprehensive experiments.

Abstract
Due to the availability of only a few labeled instances for the novel target prediction task and the significant domain shift between the well annotated source domain and the target domain, cross-domain few-shot learning (CDFSL) induces a very challenging adaptation problem. In this paper, we propose a simple Adaptive Weighted Co-Learning (AWCoL) method to address the CDFSL challenge by adapting two independently trained source prototypical classification models to the target task in a weighted co-learning manner. The proposed method deploys a weighted moving average prediction strategy to generate probabilistic predictions from each model, and then conducts adaptive co-learning by jointly fine-tuning the two models in an alternating manner based on the pseudo-labels and instance weights produced from the predictions. Moreover, a negative pseudo-labeling regularizer is further deployed to improve the fine-tuning process by penalizing false predictions. Comprehensive experiments are conducted on multiple benchmark datasets and the empirical results demonstrate that the proposed method produces state-of-the-art CDFSL performance.

摘要
The proposed method uses a weighted moving average prediction strategy to generate probabilistic predictions from each model, and then conducts adaptive co-learning by jointly fine-tuning the two models in an alternating manner based on the pseudo-labels and instance weights produced from the predictions. Moreover, a negative pseudo-labeling regularizer is further deployed to improve the fine-tuning process by penalizing false predictions.Experiments conducted on multiple benchmark datasets show that the proposed method produces state-of-the-art CDFSL performance.Here's the translation in Simplified Chinese:由于目标预测任务中的只有几个标注数据，以及源领域和目标领域之间的域名shift，跨领域少样本学习（CDFSL）具有极其挑战的适应问题。在本文中，我们提出了一种简单的 adaptive weighted co-learning（AWCoL）方法，以适应两个独立训练的源类prototype分类模型到目标任务中。该方法使用权重移动平均预测策略来生成每个模型的概率预测，然后通过对两个模型进行 alternate fine-tuning 来进行适应学习，基于预测中的pseudo-标签和实例权重。此外，还部署了一个负 pseudo-标签 regularizer，以改进 fine-tuning 过程中的准确性。对多个benchmark数据集进行了实验，结果表明，提出的方法在 CDFSL 性能方面达到了国际先进水平。

Improving Gradient-guided Nested Sampling for Posterior Inference

paper_url: http://arxiv.org/abs/2312.03911
repo_url: https://github.com/pablo-lemos/ggns
paper_authors: Pablo Lemos, Nikolay Malkin, Will Handley, Yoshua Bengio, Yashar Hezaveh, Laurence Perreault-Levasseur
for: 这篇论文旨在开发一种高性能、通用的梯度导引隐藏样本算法（${\tt GGNS}$），结合了微分编程、哈密顿截面抽样、嵌套抽样、模式分离、动态嵌套抽样和并行化等技术。
methods: 这篇论文使用了微分编程、哈密顿截面抽样、嵌套抽样、模式分离、动态嵌套抽样和并行化等方法。
results: 研究人员通过使用${\tt GGNS}$算法，在不同的 sintetic 和实际问题上达到了比较好的级别性和竞争力。此外，将隐藏样本算法与生成流网络结合使用，可以更快地发现模式和更准确地估计 posterior 分布的partition fonction。

Abstract
We present a performant, general-purpose gradient-guided nested sampling algorithm, ${\tt GGNS}$, combining the state of the art in differentiable programming, Hamiltonian slice sampling, clustering, mode separation, dynamic nested sampling, and parallelization. This unique combination allows ${\tt GGNS}$ to scale well with dimensionality and perform competitively on a variety of synthetic and real-world problems. We also show the potential of combining nested sampling with generative flow networks to obtain large amounts of high-quality samples from the posterior distribution. This combination leads to faster mode discovery and more accurate estimates of the partition function.

摘要
我团队提出了一种高性能、通用的梯度导引内样本算法（{\tt GGNS）， combining the state of the art in differentiable programming, Hamiltonian slice sampling, clustering, mode separation, dynamic nested sampling, and parallelization. This unique combination allows {\tt GGNS} to scale well with dimensionality and perform competitively on a variety of synthetic and real-world problems. We also show the potential of combining nested sampling with generative flow networks to obtain large amounts of high-quality samples from the posterior distribution. This combination leads to faster mode discovery and more accurate estimates of the partition function.Here's the breakdown of the translation:* 高性能 (gāo xìng néng) - high performance* 通用 (tōng yòng) - general-purpose* 梯度导引内样本算法 (jì duān yù xiào yì) - gradient-guided nested sampling algorithm* 梯度程度 (jì duān) - gradient* 导引 (dǎo yǐn) - guided* 内样本 (nèi yàng bèi) - nested sampling* 算法 (suān fā) - algorithm* combining (combine) - combining* state of the art (jì yì zhì) - state of the art* 斜切 (shuā zhì) - slice* 散列 (pān jiè) - clustering* 模式分离 (mó xing fēn liè) - mode separation* 动态内样本 (dòng tǐ nèi yàng bèi) - dynamic nested sampling* 并行 (bìng xíng) - parallelization* 高维度 (gāo wéidù) - high dimensionality* 竞争 (jiàng zhì) - competitive* synthetic (shèng chǎng) - synthetic* 实际 (shí jí) - real-world* 问题 (wèn tí) - problems* 可能 (kě néng) - possible* 组合 (zǔ xiàng) - combining* 内样本 (nèi yàng bèi) - nested sampling* 流动网络 (liú dòng wǎng wǎn) - generative flow networks* 获取 (huò qù) - obtain* 大量 (dà liàng) - large amounts* 高质量 (gāo zhì yù) - high-quality* 样本 (yàng bèi) - samples* posterior distribution (后预分布) - posterior distribution* 总体 (zǒng tǐ) - overall* faster (fā jí) - faster* mode discovery (mó yǐn jí) - mode discovery* 更加 (gèng jī) - more* 准确 (zhèng qiú) - accurate* 估计 (gueshì) - estimate* partition function (分配函数) - partition function

Adaptive Dependency Learning Graph Neural Networks

paper_url: http://arxiv.org/abs/2312.03903
repo_url: https://github.com/abisheksriramulu/adlgnn
paper_authors: Abishek Sriramulu, Nicolas Fourrier, Christoph Bergmeir
for: 这篇论文旨在提供一个结合神经网络和统计结构学模型的融合方法，以自动学习多变量时间序列中的依赖关系和建立动态变化的依赖关系图，并允许用于多变量预测问题，甚至在真实世界中的零预设图形中。
methods: 本文提出的方法结合了神经网络和统计结构学模型，通过内在的征测和统计学统计学模型来自动学习多变量时间序列中的依赖关系，并将其转换为动态变化的依赖关系图。
results: 本文运行于实际世界的实验数据上，与传统的方法进行比较，结果显示了融合方法的明显改善，具体来说，在多变量预测问题中，融合方法的误差率较低，而且可以更好地捕捉多变量时间序列中的复杂关系。

Abstract
Graph Neural Networks (GNN) have recently gained popularity in the forecasting domain due to their ability to model complex spatial and temporal patterns in tasks such as traffic forecasting and region-based demand forecasting. Most of these methods require a predefined graph as input, whereas in real-life multivariate time series problems, a well-predefined dependency graph rarely exists. This requirement makes it harder for GNNs to be utilised widely for multivariate forecasting problems in other domains such as retail or energy. In this paper, we propose a hybrid approach combining neural networks and statistical structure learning models to self-learn the dependencies and construct a dynamically changing dependency graph from multivariate data aiming to enable the use of GNNs for multivariate forecasting even when a well-defined graph does not exist. The statistical structure modeling in conjunction with neural networks provides a well-principled and efficient approach by bringing in causal semantics to determine dependencies among the series. Finally, we demonstrate significantly improved performance using our proposed approach on real-world benchmark datasets without a pre-defined dependency graph.

摘要
graph neural networks (GNN) 在预测领域中最近受到欢迎，因为它们可以模型复杂的空间和时间模式，如交通预测和地域基础需求预测。大多数这些方法需要一个预定义的图作为输入，而在实际生活中多变量时间序列问题中，一个准确定义的依赖图很少出现。这一要求使得GNN在多变量预测问题中更难被广泛应用，特别是在零售或能源领域。在这篇论文中，我们提议一种混合方法，将神经网络和统计结构学模型相结合，以自动学习依赖关系并从多变量数据中动态生成依赖图，以便使用GNN进行多变量预测，即使没有明确的依赖图。统计结构模型和神经网络的结合使得我们可以有效地带来 causal semantics 来确定多变量系列之间的依赖关系。最后，我们在实际benchmark数据上示出了明显提高的性能。

HLoOP – Hyperbolic 2-space Local Outlier Probabilities

paper_url: http://arxiv.org/abs/2312.03895
repo_url: None
paper_authors: Clémence Allietta, Jean-Philippe Condomines, Jean-Yves Tourneret, Emmanuel Lochin
for: 本研究旨在提出一种简单的检测方法，用于检测频繁地访问的数据点，以便进行后续处理。
methods: 本方法基于计算数据点与最近的近邻的里曼尼安距离，并使用高probability度的Gaussian随机分布来表示这个距离。
results: 在WordNet数据集上测试了本方法，并取得了良好的结果。Here’s the full translation of the paper’s abstract in Simplified Chinese:
for: 本研究旨在提出一种简单的检测方法，用于检测频繁地访问的数据点，以便进行后续处理。
methods: 本方法基于计算数据点与最近的近邻的里曼尼安距离，并使用高probability度的Gaussian随机分布来表示这个距离。
results: 在WordNet数据集上测试了本方法，并取得了良好的结果。Note that the word “高probability度” in the methods section is a bit tricky to translate, as it means “high probability” in English, but it’s a bit more nuanced in Chinese. In Chinese, “高probability度” is often used to refer to a high probability distribution, rather than just a high probability value. So in this context, the phrase “高probability度的Gaussian随机分布” is saying that the method uses a high probability distribution (i.e., a Gaussian distribution) to represent the distance between the data point and its nearest neighbors.

Abstract
Hyperbolic geometry has recently garnered considerable attention in machine learning due to its capacity to embed hierarchical graph structures with low distortions for further downstream processing. This paper introduces a simple framework to detect local outliers for datasets grounded in hyperbolic 2-space referred to as HLoOP (Hyperbolic Local Outlier Probability). Within a Euclidean space, well-known techniques for local outlier detection are based on the Local Outlier Factor (LOF) and its variant, the LoOP (Local Outlier Probability), which incorporates probabilistic concepts to model the outlier level of a data vector. The developed HLoOP combines the idea of finding nearest neighbors, density-based outlier scoring with a probabilistic, statistically oriented approach. Therefore, the method consists in computing the Riemmanian distance of a data point to its nearest neighbors following a Gaussian probability density function expressed in a hyperbolic space. This is achieved by defining a Gaussian cumulative distribution in this space. The HLoOP algorithm is tested on the WordNet dataset yielding promising results. Code and data will be made available on request for reproductibility.

摘要

Evaluation of Infrastructure-based Warning System on Driving Behaviors-A Roundabout Study

paper_url: http://arxiv.org/abs/2312.03891
repo_url: None
paper_authors: Cong Zhang, Chi Tian, Tianfang Han, Hang Li, Yiheng Feng, Yunfeng Chen, Robert W. Proctor, Jiansong Zhang
for: 这篇论文 investigate了基础设施发送到附近行驶者的通信 warnings 对弯道安全的影响，以帮助改善道路安全。
methods: 该论文使用了一个合并 SUMO 和 Webots 的驾驶 simulate 平台，并在该平台上模拟了一个实际存在的弯道，以便进行研究。
results: 研究发现，提前发送警告可以帮助驾驶员更好地适应弯道驾驶，并减少突然减速和剧烈刹车。此外，基于驾驶员停车或加速决策的个性化预测模型也被开发出来。

Abstract
Smart intersections have the potential to improve road safety with sensing, communication, and edge computing technologies. Perception sensors installed at a smart intersection can monitor the traffic environment in real time and send infrastructure-based warnings to nearby travelers through V2X communication. This paper investigated how infrastructure-based warnings can influence driving behaviors and improve roundabout safety through a driving-simulator study - a challenging driving scenario for human drivers. A co-simulation platform integrating Simulation of Urban Mobility (SUMO) and Webots was developed to serve as the driving simulator. A real-world roundabout in Ann Arbor, Michigan was built in the co-simulation platform as the study area, and the merging scenarios were investigated. 36 participants were recruited and asked to navigate the roundabout under three danger levels (e.g., low, medium, high) and three collision warning designs (e.g., no warning, warning issued 1 second in advance, warning issued 2 seconds in advance). Results indicated that advanced warnings can significantly enhance safety by minimizing potential risks compared to scenarios without warnings. Earlier warnings enabled smoother driver responses and reduced abrupt decelerations. In addition, a personalized intention prediction model was developed to predict drivers' stop-or-go decisions when the warning is displayed. Among all tested machine learning models, the XGBoost model achieved the highest prediction accuracy with a precision rate of 95.56% and a recall rate of 97.73%.

摘要
智能交叉口具有改善交通安全的潜力，通过感知、通信和边缘计算技术。智能交叉口中的感知传感器可以在实时监测交通环境中，通过V2X通信发送到附近交通参与者的基础设施警示。这篇论文研究了基础设施警示如何影响驾驶行为，提高环境圈安全性。为了实现这一目标，我们开发了一个集成SUMO和Webots的合作平台，作为驾驶模拟器。一个位于美国密歇根州安那伯го瑟的实际环境圈被建立在合作平台中，并 investigate了融合场景。36名参与者被征集，并被要求在三种危险水平（低、中、高）和三种Collision warning设计（无警示、1秒前发送警示、2秒前发送警示）下进行驾驶。结果表明，提前发送警示可以显著提高安全性，最小化风险。EARLIER警示使得司机更smooth的响应，降低了突然减速。此外，我们还开发了一个个性化意图预测模型，可以预测司机在警示显示时的停车或前进决策。 Among all tested machine learning models, the XGBoost model achieved the highest prediction accuracy with a precision rate of 95.56% and a recall rate of 97.73%.

Adapting Newton’s Method to Neural Networks through a Summary of Higher-Order Derivatives

paper_url: http://arxiv.org/abs/2312.03885
repo_url: None
paper_authors: Pierre Wolinski
for: 这种 gradient-based 优化方法适用于一个函数 $\mathcal{L}$ 中的一个 вектор变量 $\boldsymbol{\theta}$，当 $\boldsymbol{\theta}$ 是一个tensor tuples $({\mathbf{T}_1, \ldots, {\mathbf{T}_S)$ 的形式时。这个框架包括许多常见的应用场景，如训练神经网络。
methods: 我们提出了一种计算成本低的技术，以获取 $\mathcal{L}$ 中更高阶信息，尤其是tensor $\mathbf{T}_s$ 之间的交互信息，基于自动导数和计算技巧。这种技术在第二阶段使用，用于建立一种第二阶段优化方法。
results: 我们使用这种技术在第二阶段，并利用分区结构来构建一种适用于不同神经网络 arquitectures 的第二阶段优化方法。这种方法不需要计算 $\mathcal{L}$ 中的梯度矩阵或其approximation，并且不忽略层之间的交互。在contrast to many existing practical second-order methods used in neural networks, which perform a diagonal or block-diagonal approximation of the Hessian or its inverse, our method does not neglect interactions between layers。最后，我们可以根据分区粒度来调整优化方法，从最粗糙的case（Cauchy的最陡下降法）到最细粒度的case（usual Newton’s method）。

Abstract
We consider a gradient-based optimization method applied to a function $\mathcal{L}$ of a vector of variables $\boldsymbol{\theta}$, in the case where $\boldsymbol{\theta}$ is represented as a tuple of tensors $(\mathbf{T}_1, \cdots, \mathbf{T}_S)$. This framework encompasses many common use-cases, such as training neural networks by gradient descent. First, we propose a computationally inexpensive technique providing higher-order information on $\mathcal{L}$, especially about the interactions between the tensors $\mathbf{T}_s$, based on automatic differentiation and computational tricks. Second, we use this technique at order 2 to build a second-order optimization method which is suitable, among other things, for training deep neural networks of various architectures. This second-order method leverages the partition structure of $\boldsymbol{\theta}$ into tensors $(\mathbf{T}_1, \cdots, \mathbf{T}_S)$, in such a way that it requires neither the computation of the Hessian of $\mathcal{L}$ according to $\boldsymbol{\theta}$, nor any approximation of it. The key part consists in computing a smaller matrix interpretable as a "Hessian according to the partition", which can be computed exactly and efficiently. In contrast to many existing practical second-order methods used in neural networks, which perform a diagonal or block-diagonal approximation of the Hessian or its inverse, the method we propose does not neglect interactions between layers. Finally, we can tune the coarseness of the partition to recover well-known optimization methods: the coarsest case corresponds to Cauchy's steepest descent method, the finest case corresponds to the usual Newton's method.

摘要
我们考虑一个梯度基本优化方法，应用于一个函数 $\mathcal{L}$ 中的一组参数 $\boldsymbol{\theta}$，在这个函数中，$\boldsymbol{\theta}$ 是一个元组中的多个tensor（$\mathbf{T}_1, \cdots, \mathbf{T}_S$）。这个框架包括训练神经网络的各种常用案例，例如梯度下降。我们首先提出一个 computationally inexpensive 的技术，可以提供更高阶的信息关于 $\mathcal{L}$，特别是关于 tensor $\mathbf{T}_s$ 之间的互动，基于自动梯度分析和计算技巧。其次，我们使用这个技术在第二阶层上，建立一个第二阶优化方法，这个方法适用于训练各种神经网络架构，并且不需要计算 $\mathcal{L}$ 的梯度或其逆，也不需要任何梯度或逆的 Approximation。关键部分是计算一个小型的矩阵，可以理解为 "Hessian according to the partition"，这个矩阵可以实际和高效地计算。与许多现有的实用第二阶方法不同，我们的方法不忽略层次之间的互动。最后，我们可以调整组分的粗糙度，以回复知名的优化方法：最粗糙的 случа corresponds to Cauchy's steepest descent method，最细的 caso corresponds to the usual Newton's method。

Domain constraints improve risk prediction when outcome data is missing

paper_url: http://arxiv.org/abs/2312.03878
repo_url: None
paper_authors: Sidhika Balachandar, Nikhil Garg, Emma Pierson
for: 这个论文的目的是为了准确估计病人的风险，包括测试和未测试的病人。
methods: 这个论文使用了 bayesian 模型， capture 了医生做出决定后，测试结果不可见的问题。
results: 该模型可以准确地估计病人的风险，并且可以捕捉到医生在做出决定时的专业知识和偏好。它还可以预测病人的检测策略和诊断结果。

Abstract
Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that the human decision censors the outcome data: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model's inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.

摘要
机器学习模型常被训练以预测人类决策后的结果。例如，如果医生决定测试患者有病或不有病，那么患者会测试阳性或阴性？一个挑战是人类决策 censors the outcome data：我们只能观察已经被测试的患者的测试结果。未测试的患者，其结果未被观察，可能与测试的患者不同于其观察和不观察的维度。我们提出了一种 bayesian 模型类，用于准确地估计患者的风险。这个模型的目的是估计测试和未测试的患者的风险。由于未测试的患者的可能性范围广泛，因此我们提出了两种领域约束，它们在医疗设置中是可能的：一个是疾病患率约束，即总疾病患率是已知的，另一个是专业知识约束，即人类决策者在已知的特征集上偏离完全基于风险的决策。我们在理论和人工数据上表明了领域约束可以改善参数推导。我们在一个抑制癌症预测案例中应用了这种模型，显示其推导出的风险可以预测癌症诊断，其推导出的测试策略遵循已知的公共卫生政策，并可以识别测试资源的不足。虽然我们的案例是在医疗领域，但我们的分析表明了一个通用的领域约束，可以在许多设置中改善模型估计。

Optimizing $CO_{2}$ Capture in Pressure Swing Adsorption Units: A Deep Neural Network Approach with Optimality Evaluation and Operating Maps for Decision-Making

paper_url: http://arxiv.org/abs/2312.03873
repo_url: None
paper_authors: Carine Menezes Rebello, Idelfonso B. R. Nogueira
for: 这项研究旨在开发一种代表性优化方法，用于环境中的环境征素捕集过程，尤其是碳排放 ($CO_{2}$) 捕集。
methods: 该研究使用了多输入单出力（MISO）框架，包括两个深度神经网络（DNN）模型，预测过程性能指标。这些模型然后被集成到优化框架中，通过粒子群搜索（PSO）和统计分析来生成全面的Pareto前列表。
results: 该方法可以准确地评估优化效果，并且可以在允许的操作范围内找到最优的决策场景。研究还发现了一些影响过程行为的关键因素，并提供了一个具有实用性和深入分析的操作地图，帮助操作人员快速定位最佳过程位置和优化特定操作目标。

Abstract
This study presents a methodology for surrogate optimization of cyclic adsorption processes, focusing on enhancing Pressure Swing Adsorption units for carbon dioxide ($CO_{2}$) capture. We developed and implemented a multiple-input, single-output (MISO) framework comprising two deep neural network (DNN) models, predicting key process performance indicators. These models were then integrated into an optimization framework, leveraging particle swarm optimization (PSO) and statistical analysis to generate a comprehensive Pareto front representation. This approach delineated feasible operational regions (FORs) and highlighted the spectrum of optimal decision-making scenarios. A key aspect of our methodology was the evaluation of optimization effectiveness. This was accomplished by testing decision variables derived from the Pareto front against a phenomenological model, affirming the surrogate models reliability. Subsequently, the study delved into analyzing the feasible operational domains of these decision variables. A detailed correlation map was constructed to elucidate the interplay between these variables, thereby uncovering the most impactful factors influencing process behavior. The study offers a practical, insightful operational map that aids operators in pinpointing the optimal process location and prioritizing specific operational goals.

摘要
Translation notes:* "cyclic adsorption processes" is translated as "循环吸附过程" (pinyin: xúngrán jīngshū gòujihòu)* "Pressure Swing Adsorption units" is translated as "压力振荡吸附设备" (pinyin: yālì zhàngdàng jīngshū shèbì)* "carbon dioxide" is translated as "二氧化碳" (pinyin: èr gōngyǎng dī)* "surrogate optimization" is translated as "代理优化" (pinyin: dàlǐ yōujiā)* "Pareto front" is translated as "帕雷托前方" (pinyin: pāleitō qiánfāng)* "phenomenological model" is translated as "现象学模型" (pinyin: xiànxiàng xué móde)* "decision variables" is translated as "决策变量" (pinyin: juédà biànzhong)* "feasible operational domains" is translated as "可行操作领域" (pinyin: kěxí còngzuò yìndù)* "correlation map" is translated as "相关地图" (pinyin: xiāngguān dìtú)

Hidden yet quantifiable: A lower bound for confounding strength using randomized trials

paper_url: http://arxiv.org/abs/2312.03871
repo_url: https://github.com/jaabmar/confounder-lower-bound
paper_authors: Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, Fanny Yang
for: 评估新药在临床实践中的效果
methods: 使用随机试验来评估潜在的潜在偏见
results: 提出一种新的统计测试方法，可以评估潜在偏见的强度，并且可以正确地确定潜在偏见的存在或不存在

Abstract
In the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new treatments in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions drawn from non-randomized data. We propose a novel strategy that leverages randomized trials to quantify unobserved confounding. First, we design a statistical test to detect unobserved confounding with strength above a given threshold. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world setting.

摘要
在精准医学时代，观察研究在临床实践中扮演着重要角色，但是不观察的偏见可能会对 causal 结论产生重大影响。我们提出了一种新的策略，利用随机试验来衡量不观察的偏见。首先，我们设计了一种统计测试，用于检测不观察的偏见强度超过给定的阈值。然后，我们使用测试来估算偏见强度的下界，该下界在某些 Synthetic 和半 Synthetic 数据集上是有效的。我们还证明了我们的下界可以正确地识别实际场景中的偏见存在和缺失。

Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing

paper_url: http://arxiv.org/abs/2312.03867
repo_url: None
paper_authors: Lucas Monteiro Paes, Ananda Theertha Suresh, Alex Beutel, Flavio P. Calmon, Ahmad Beirami
for: 本研究旨在评估多个敏感属性（如种族、性别、年龄）定义的人群之间机器学习（ML）模型的表现差异。
methods: 我们提出了一种基于Conditional Value-at-Risk（CVaR）的方法来检测人群之间的表现差异。我们允许小的概率饱和在群体上，以减少对群体之间表现差异的检测样本复杂度。
results: 我们的分析表明，当群体由多个敏感属性定义时，我们的CVaR测试算法的样本复杂度只需 upper bounded by 方差的平方根。此外，我们还证明了在某些情况下，存在一种非独立的数据收集策略，可以使样本复杂度独立于人群数量。

Abstract
Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that R\'enyi entropy of order $2/3$ of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.

摘要
(Simplified Chinese translation)机器学习（ML）模型在预测和分类任务中可能会 Display 性能差异 across Population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that R\'enyi entropy of order $2/3$ of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.

Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graphs

paper_url: http://arxiv.org/abs/2312.03865
repo_url: https://github.com/ratschlab/genomic-gnn
paper_authors: Kacper Kapuśniak, Manuel Burger, Gunnar Rätsch, Amir Joudaki
for: 这篇论文是为了提出一种新的序列表示方法，以适应 genomic 序列数据的快速扩展。
methods: 该论文使用了 k-mer 嵌入，将上下文信息和结构信息结合在一起，并使用自适应学习的 Graph Convolutional Network 编码器。
results: 该论文的嵌入方法在 Edit Distance Approximation 和 Closest String Retrieval 任务上表现出了明显的超越前一些方法。

Abstract
The rapid expansion of genomic sequence data calls for new methods to achieve robust sequence representations. Existing techniques often neglect intricate structural details, emphasizing mainly contextual information. To address this, we developed k-mer embeddings that merge contextual and structural string information by enhancing De Bruijn graphs with structural similarity connections. Subsequently, we crafted a self-supervised method based on Contrastive Learning that employs a heterogeneous Graph Convolutional Network encoder and constructs positive pairs based on node similarities. Our embeddings consistently outperform prior techniques for Edit Distance Approximation and Closest String Retrieval tasks.

摘要
“随着基因序列数据的快速扩展，需要新的方法来建立Robust的序列表现。现有的技术 часто忽略细部结构信息，仅将主要关注于上下文信息。为解决这个问题，我们开发了k-mer嵌入，将上下文和结构串信息融合，通过强化De Bruijn гра图中的结构相似性连接。接着，我们创造了一种自我超vised的方法，基于不同类型的Graph Convolutional Network嵌入，并以节点相似性作为建构正面对的双方对。我们的嵌入一致性地超过了先前的方法，对于Edit Distance Approximation和最近串串搜寻任务都有着优秀的表现。”Note that Simplified Chinese is a written language that uses shorter words and simpler grammar compared to Traditional Chinese. The translation is written in Simplified Chinese, but the original text is in English.

Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

paper_url: http://arxiv.org/abs/2312.03853
repo_url: None
paper_authors: Matteo Gioele Collu, Tom Janssen-Groesbeek, Stefanos Koffas, Mauro Conti, Stjepan Picek
For: The paper is written to demonstrate the vulnerability of ChatGPT and Bard to adversarial personas, and to show that these chatbots can be tricked into providing unauthorized, illegal, or harmful information.* Methods: The paper uses elaborate biographies of complex personas to trick the chatbots into providing prohibited responses. The conversation is conducted in a role-play style to elicit the desired response.* Results: The paper shows that both ChatGPT and Bard are vulnerable to this kind of attack, and that it is possible to obtain unauthorized information by using adversarial personas. The paper also introduces several ways of activating such personas.

Abstract
This year, we witnessed a rise in the use of Large Language Models, especially when combined with applications like chatbot assistants. Safety mechanisms and specialized training procedures are put in place to prevent improper responses from these assistants. In this work, we bypass these measures for ChatGPT and Bard (and, to some extent, Bing chat) by making them impersonate complex personas with opposite characteristics as those of the truthful assistants they are supposed to be. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversation followed a role-play style to get the response the assistant was not allowed to provide. By making use of personas, we show that the response that is prohibited is actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Bard. It also introduces several ways of activating such adversarial personas, altogether showing that both chatbots are vulnerable to this kind of attack.

摘要
We begin by creating detailed biographies of these personas, which we then use in a new session with the same chatbots. Our conversation follows a role-playing style to elicit the response the assistant is not allowed to provide. By using personas, we demonstrate that the response that is prohibited is actually provided, allowing us to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome the safety mechanisms set by ChatGPT and Bard. Additionally, we present several methods for activating such adversarial personas, revealing that both chatbots are vulnerable to this type of attack.

Exposing Disparities in Flood Adaptation for Equitable Future Interventions

paper_url: http://arxiv.org/abs/2312.03843
repo_url: None
paper_authors: Lidia Cano Pecharroman, ChangHoon Hahn
For: The paper aims to evaluate the effectiveness of the FEMA National Flood Insurance Program Community Rating System in providing equitable support for all communities, particularly those that have been historically disadvantaged.* Methods: The authors use a causal inference method called ${\rm C{\scriptsize AUSAL}F{\scriptsize LOW}$ based on deep generative models to estimate the treatment effect of flood adaptation interventions on communities’ savings, considering factors such as income, diversity, population, flood risk, educational attainment, and precipitation.* Results: The program is found to save communities an average of $5,000–15,000 per household, but the savings are not evenly distributed across communities. In particular, low-income communities and predominantly non-white communities tend to have lower savings, with a gap of more than $6000 per household between predominantly white and non-white communities.Here’s the information in Simplified Chinese:* For: 研究旨在评估FEMA国家洪水保险计划社区评级系统是否为所有社区提供平等支持，尤其是历史上受到不公正待遇的社区。* Methods: 作者使用基于深度生成模型的 causal inference方法(${\rm C{\scriptsize AUSAL}F{\scriptsize LOW}$)来估计洪水适应措施对社区的成本影响，考虑因素包括收入、多样性、人口、洪水风险、教育水平和降水量。* Results: 计划可以为社区提供 $5,000–15,000每户的成本节点，但这些节点不均分配到社区。特别是低收入社区的节点减少逐渐减少，与高收入社区相比，非白社区的节点可能高达 $6,000以上的差距。

Abstract
As governments race to implement new climate adaptation policies that prepare for more frequent flooding, they must seek policies that are effective for all communities and uphold climate justice. This requires evaluating policies not only on their overall effectiveness but also on whether their benefits are felt across all communities. We illustrate the importance of considering such disparities for flood adaptation using the FEMA National Flood Insurance Program Community Rating System and its dataset of $\sim$2.5 million flood insurance claims. We use ${\rm C{\scriptsize AUSAL}F{\scriptsize LOW}$, a causal inference method based on deep generative models, to estimate the treatment effect of flood adaptation interventions based on a community's income, diversity, population, flood risk, educational attainment, and precipitation. We find that the program saves communities \$5,000--15,000 per household. However, these savings are not evenly spread across communities. For example, for low-income communities savings sharply decline as flood-risk increases in contrast to their high-income counterparts with all else equal. Even among low-income communities, there is a gap in savings between predominantly white and non-white communities: savings of predominantly white communities can be higher by more than \$6000 per household. As communities worldwide ramp up efforts to reduce losses inflicted by floods, simply prescribing a series flood adaptation measures is not enough. Programs must provide communities with the necessary technical and economic support to compensate for historical patterns of disenfranchisement, racism, and inequality. Future flood adaptation efforts should go beyond reducing losses overall and aim to close existing gaps to equitably support communities in the race for climate adaptation.

摘要
政府们在实施新的气候适应政策时，必须考虑所有社区的利益，并保持气候正义。这意味着评估政策的效果不仅是全体效果，而且是在所有社区中的效果。我们使用CAUSALFLOW方法，一种基于深度生成模型的 causal inference方法，来估计洪水适应措施的影响，基于社区的收入、多样性、人口、洪水风险、教育水平和降雨量。我们发现，该计划可以为每户节省5,000到15,000美元。然而，这些节省不均匀分布在社区中。例如，对低收入社区来说，洪水风险增加时，节省额amount sharply decreases，与高收入社区相比，其节省额amount差距超过6,000美元。而在低收入社区中，非白人社区的节省额amount与白人社区相比，还存在一定的差距。在全球范围内，社区减少洪水所造成的损害的努力是不够的。未来的洪水适应努力应该超越减少总损害，而是努力 closing existing gaps to equitably support communities in the race for climate adaptation。

High Pileup Particle Tracking with Object Condensation

paper_url: http://arxiv.org/abs/2312.03823
repo_url: None
paper_authors: Kilian Lieret, Gage DeZoort, Devdoot Chatterjee, Jian Park, Siqi Miao, Pan Li
for: 这项研究的目的是提高高能物理实验室中 charged particle 的追踪精度和可扩展性。
methods: 这项研究使用 graph neural networks (GNNs) 和 object condensation (OC) 方法来实现追踪。
results: 研究显示 GNNs 可以与传统算法匹配性能，同时提高可扩展性以应对高能物理实验室中的计算挑战。

Abstract
Recent work has demonstrated that graph neural networks (GNNs) can match the performance of traditional algorithms for charged particle tracking while improving scalability to meet the computing challenges posed by the HL-LHC. Most GNN tracking algorithms are based on edge classification and identify tracks as connected components from an initial graph containing spurious connections. In this talk, we consider an alternative based on object condensation (OC), a multi-objective learning framework designed to cluster points (hits) belonging to an arbitrary number of objects (tracks) and regress the properties of each object. Building on our previous results, we present a streamlined model and show progress toward a one-shot OC tracking algorithm in a high-pileup environment.

摘要
近期研究表明，图 neuronal networks (GNNs) 可以与传统算法匹配性能，同时提高可扩展性以满足高能物理研究所 (HL-LHC) 所带来的计算挑战。大多数 GNN 跟踪算法基于边类划分，从初始图中的假设连接中提取轨迹。在这篇报告中，我们考虑了一种替代方案，基于物体凝结 (OC)，一种多目标学习框架，用于聚合点 (hit) 所属的任意数量的物体 (track)，并估计每个物体的性质。基于我们的前一次成果，我们提出了一种更加流畅的模型，并展示了在高堆核燃料环境中一shot OC 跟踪算法的进展。

nbi: the Astronomer’s Package for Neural Posterior Estimation

paper_url: http://arxiv.org/abs/2312.03824
repo_url: https://github.com/kmzzhang/nbi
paper_authors: Keming Zhang, Joshua Bloom, Stéfan van der Walt, Nina Hernitschek
for: 本研究旨在提高天体物理学中的神经 posterior 估计（NPE）方法的应用速度，并解决了三个关键问题：需要特定化的特征网络、推断不准确和物理前向模型的不足。
methods: 本研究提出了一个新的框架和开源软件 nbi（神经 Bayesian 推断），支持了积累和顺序 NPE。 nbi 提供了内置的 “特征化” 网络，可以快速地应用于天体观测数据，如光谱和光谱曲线。此外，本研究还提出了一种修改后的算法 SNPE-IS，可以实现极限精准推断。
results: 本研究通过应用 nbi 软件来解决天体观测数据中的几个问题，并证明了 nbi 可以作为 Nested Sampling 等方法的有效替代方案。

Abstract
Despite the promise of Neural Posterior Estimation (NPE) methods in astronomy, the adaptation of NPE into the routine inference workflow has been slow. We identify three critical issues: the need for custom featurizer networks tailored to the observed data, the inference inexactness, and the under-specification of physical forward models. To address the first two issues, we introduce a new framework and open-source software nbi (Neural Bayesian Inference), which supports both amortized and sequential NPE. First, nbi provides built-in "featurizer" networks with demonstrated efficacy on sequential data, such as light curve and spectra, thus obviating the need for this customization on the user end. Second, we introduce a modified algorithm SNPE-IS, which facilities asymptotically exact inference by using the surrogate posterior under NPE only as a proposal distribution for importance sampling. These features allow nbi to be applied off-the-shelf to astronomical inference problems involving light curves and spectra. We discuss how nbi may serve as an effective alternative to existing methods such as Nested Sampling. Our package is at https://github.com/kmzzhang/nbi.

摘要
尽管神经后验估计（NPE）方法在天文学中表现了承诺，但把NPE纳入常规推理 workflow 中的应用进程 slower。我们认为有三个关键问题：需要特制化的特征网络适应观测数据，推理不准确，以及物理前向模型的不足。为了解决这些问题，我们提出了一个新的框架和开源软件 nbi（神经 bayesian 推理），该框架支持分布式和顺序的NPE。首先，nbi 提供了内置的 "特征网络"，可以有效地处理顺序数据，如光谱和光谱，从而减少用户需要自己定制的需求。其次，我们引入了一种修改后的算法 SNPE-IS，该算法使用 NPE 下的代表 posterior 作为决策Importance sampling 的提案分布，从而实现了 asymptotically exact 的推理。这些特点使得 nbi 可以在天文学中 direct 应用于光谱和光谱推理问题。我们讨论了如何使用 nbi 作为现有方法 such as Nested Sampling 的有效替代方案。我们的 package 可以在 https://github.com/kmzzhang/nbi 中找到。

On the Role of Edge Dependency in Graph Generative Models

paper_url: http://arxiv.org/abs/2312.03691
repo_url: None
paper_authors: Sudhanshu Chanpuriya, Cameron Musco, Konstantinos Sotiropoulos, Charalampos Tsourakakis
for: 本文提出了一种新的评估框架，用于评估生成图模型的准确性和边多样性。
methods: 本文使用了一种分三级划分法，将生成图模型分为独立边、独立节点和完全依赖模型三类。此外，本文还提出了一种基于潮湍发现的新生成模型。
results: 本文通过对实际数据进行评估，发现了新模型的输出质量和重叠性与其他流行模型相当。

Abstract
In this work, we introduce a novel evaluation framework for generative models of graphs, emphasizing the importance of model-generated graph overlap (Chanpuriya et al., 2021) to ensure both accuracy and edge-diversity. We delineate a hierarchy of graph generative models categorized into three levels of complexity: edge independent, node independent, and fully dependent models. This hierarchy encapsulates a wide range of prevalent methods. We derive theoretical bounds on the number of triangles and other short-length cycles producible by each level of the hierarchy, contingent on the model overlap. We provide instances demonstrating the asymptotic optimality of our bounds. Furthermore, we introduce new generative models for each of the three hierarchical levels, leveraging dense subgraph discovery (Gionis & Tsourakakis, 2015). Our evaluation, conducted on real-world datasets, focuses on assessing the output quality and overlap of our proposed models in comparison to other popular models. Our results indicate that our simple, interpretable models provide competitive baselines to popular generative models. Through this investigation, we aim to propel the advancement of graph generative models by offering a structured framework and robust evaluation metrics, thereby facilitating the development of models capable of generating accurate and edge-diverse graphs.

摘要
在这个工作中，我们介绍了一种新的评估框架 для生成图模型，强调模型生成的图重合（Chanpuriya et al., 2021）以确保准确性和边多样性。我们分类了图生成模型为三级复杂度：独立边、独立节点和完全依赖模型。这些级别包括了许多流行的方法。我们 derivated了对于每个级别的bounds，具体来说是对于图中的三角形和其他短路径数的生成。我们还提供了实际实验，证明了我们的 bound 是 asymptotic 的优化。此外，我们还提出了每个级别的新生成模型，利用 dense subgraph 发现（Gionis & Tsourakakis, 2015）。我们的评估，基于实际数据集，主要关注生成出的输出质量和模型之间的重合度。我们的结果表明，我们的简单、可解释的模型提供了与流行模型相当的基准。通过这次调查，我们希望推动生成图模型的进步，通过提供结构化的框架和可靠的评估指标，以促进生成准确和多样化的图。

Inverse Design of Vitrimeric Polymers by Molecular Dynamics and Generative Modeling

paper_url: http://arxiv.org/abs/2312.03690
repo_url: None
paper_authors: Yiwen Zheng, Prakash Thakolkaran, Jake A. Smith, Ziheng Lu, Shuxin Zheng, Bichlien H. Nguyen, Siddhant Kumar, Aniruddh Vashisth
For: The paper aims to develop a method for generating novel vitrimers with desired glass transition temperature (Tg) and guide their inverse design based on Tg.* Methods: The method combines molecular dynamics (MD) simulations and machine learning (ML), specifically a novel graph variational autoencoder (VAE) model, to generate and design vitrimers with desired Tg.* Results: The proposed VAE framework demonstrates high accuracy and efficiency in discovering novel vitrimers with desirable Tg beyond the training regime, and the generated vitrimers have reasonable synthesizability and cover a wide range of Tg, broadening the potential widespread usage of vitrimeric materials.Here is the same information in Simplified Chinese:* For: 这篇论文目标是通过分子动力学（MD） simulate 和机器学习（ML）来设计和生成易于自适应的 vitrimer，并通过描述这些材料的玻璃转变温度（Tg）来引导其逆设计。* Methods: 该方法结合了 MD simulate 和 ML，特别是一种新的图像变量自动编码器（VAE）模型，来生成和设计 vitrimer 的新种类。* Results: 提议的 VAE 框架在发现新的 vitrimer 中表现出了高精度和效率，并且生成的 vitrimer 具有合理的合成可能性和覆盖了广泛的 Tg 范围，扩展了 vitrimeric 材料的潜在广泛使用。

Abstract
Vitrimer is a new class of sustainable polymers with the ability of self-healing through rearrangement of dynamic covalent adaptive networks. However, a limited choice of constituent molecules restricts their property space, prohibiting full realization of their potential applications. Through a combination of molecular dynamics (MD) simulations and machine learning (ML), particularly a novel graph variational autoencoder (VAE) model, we establish a method for generating novel vitrimers and guide their inverse design based on desired glass transition temperature (Tg). We build the first vitrimer dataset of one million and calculate Tg on 8,424 of them by high-throughput MD simulations calibrated by a Gaussian process model. The proposed VAE employs dual graph encoders and a latent dimension overlapping scheme which allows for individual representation of multi-component vitrimers. By constructing a continuous latent space containing necessary information of vitrimers, we demonstrate high accuracy and efficiency of our framework in discovering novel vitrimers with desirable Tg beyond the training regime. The proposed vitrimers with reasonable synthesizability cover a wide range of Tg and broaden the potential widespread usage of vitrimeric materials.

摘要

GeoShapley: A Game Theory Approach to Measuring Spatial Effects in Machine Learning Models

paper_url: http://arxiv.org/abs/2312.03675
repo_url: None
paper_authors: Ziqi Li
for: 这篇论文旨在探讨机器学习模型中的空间效应，并提出了一种基于游戏理论的地理Shapeley方法来衡量这些效应。
methods: 该方法基于诺贝尔奖得主Shapley值框架，将地理位置视为模型预测游戏中的一个玩家，从而可以量化地理位置的重要性和其他特征之间的协同作用。该方法是模型无关的，可以应用于统计或黑盒机器学习模型。
results: 使用simulated数据验证GeoShapley值，并对七种统计和机器学习模型进行了跨比较。一个实际的住房价值预测模型的example也用于解释GeoShapley的用途和解释。该方法可以作为一个开源Python包名为geoshapley进行应用。

Abstract
This paper introduces GeoShapley, a game theory approach to measuring spatial effects in machine learning models. GeoShapley extends the Nobel Prize-winning Shapley value framework in game theory by conceptualizing location as a player in a model prediction game, which enables the quantification of the importance of location and the synergies between location and other features in a model. GeoShapley is a model-agnostic approach and can be applied to statistical or black-box machine learning models in various structures. The interpretation of GeoShapley is directly linked with spatially varying coefficient models for explaining spatial effects and additive models for explaining non-spatial effects. Using simulated data, GeoShapley values are validated against known data-generating processes and are used for cross-comparison of seven statistical and machine learning models. An empirical example of house price modeling is used to illustrate GeoShapley's utility and interpretation with real world data. The method is available as an open-source Python package named geoshapley.

摘要

On the Role of the Action Space in Robot Manipulation Learning and Sim-to-Real Transfer

paper_url: http://arxiv.org/abs/2312.03673
repo_url: None
paper_authors: Elie Aljalbout, Felix Frank, Maximilian Karl, Patrick van der Smagt
for: 本研究探讨了机器人 manipulate 学习中的动作空间选择问题，以及 sim-to-real 转移。
methods: 我们定义了表现评价指标，并研究不同动作空间的emerging 性特征。我们在 simulations 中训练了13种不同的控制空间，并评估了在真实环境中的训练性和转移性。
results: 我们发现了一些好的和坏的机器人动作空间特征，并提出了将来设计RL算法时的建议。我们的发现对机器人 manipulate 学习任务的RL算法设计有重要意义，并 highlights 在真实环境中训练和转移 RL 代理的需要慎重考虑动作空间。

Abstract
We study the choice of action space in robot manipulation learning and sim-to-real transfer. We define metrics that assess the performance, and examine the emerging properties in the different action spaces. We train over 250 reinforcement learning~(RL) agents in simulated reaching and pushing tasks, using 13 different control spaces. The choice of action spaces spans popular choices in the literature as well as novel combinations of common design characteristics. We evaluate the training performance in simulation and the transfer to a real-world environment. We identify good and bad characteristics of robotic action spaces and make recommendations for future designs. Our findings have important implications for the design of RL algorithms for robot manipulation tasks, and highlight the need for careful consideration of action spaces when training and transferring RL agents for real-world robotics.

摘要
我们研究 робот manipulation 学习中的行动空间选择，以及在 sim-to-real 转移中的表现。我们定义了评价性能的指标，并研究不同的行动空间中出现的特性。我们在 simulated 抓取和推动任务中训练了超过 250 个 reinforcement learning （RL） agent，使用 13 种不同的控制空间。选择的行动空间包括文献中常见的选择以及一些新的组合。我们在模拟和真实环境中评估训练性能，并发现了 робот行动空间的好坏特点，并提出了未来设计的建议。我们的发现对 robot manipulation 任务中的 RL 算法设计有重要意义，并高亮了在实际 робоics 中训练和转移 RL 代理的精心考虑行动空间的必要性。

Direct Exoplanet Detection Using Deep Convolutional Image Reconstruction (ConStruct): A New Algorithm for Post-Processing High-Contrast Images

paper_url: http://arxiv.org/abs/2312.03671
repo_url: None
paper_authors: Trevor N. Wolf, Brandon A. Jones, Brendan P. Bowler
for: 检测暗点源在高对比 adaptive optics 图像序列中
methods: 使用深度学习 direct imaging 后处理算法，利用一个广泛的参考图书馆来减少天体噪声
results: 对30个唯一的点源进行评估，ConStruct 比传统 PCA 处理提高了 S/N 的比例为67%，并提高了对比度的因子达2.6

Abstract
We present a novel machine-learning approach for detecting faint point sources in high-contrast adaptive optics imaging datasets. The most widely used algorithms for primary subtraction aim to decouple bright stellar speckle noise from planetary signatures by subtracting an approximation of the temporally evolving stellar noise from each frame in an imaging sequence. Our approach aims to improve the stellar noise approximation and increase the planet detection sensitivity by leveraging deep learning in a novel direct imaging post-processing algorithm. We show that a convolutional autoencoder neural network, trained on an extensive reference library of real imaging sequences, accurately reconstructs the stellar speckle noise at the location of a potential planet signal. This tool is used in a post-processing algorithm we call Direct Exoplanet Detection with Convolutional Image Reconstruction, or ConStruct. The reliability and sensitivity of ConStruct are assessed using real Keck/NIRC2 angular differential imaging datasets. Of the 30 unique point sources we examine, ConStruct yields a higher S/N than traditional PCA-based processing for 67$\%$ of the cases and improves the relative contrast by up to a factor of 2.6. This work demonstrates the value and potential of deep learning to take advantage of a diverse reference library of point spread function realizations to improve direct imaging post-processing. ConStruct and its future improvements may be particularly useful as tools for post-processing high-contrast images from the James Webb Space Telescope and extreme adaptive optics instruments, both for the current generation and those being designed for the upcoming 30 meter-class telescopes.

摘要
我们介绍了一种新的机器学习方法，用于检测高对比度适应光学图像中的柔软点源。现有的主要 subtract 算法，目的是从每帧图像序列中提取亮度强度的恒星噪声，以便从 планетар signals 中分离出 planetary signatures。我们的方法是通过利用深度学习来提高stellar 噪声的 aproximation，并提高检测感度。我们使用一个卷积自编码器神经网络，训练于一个广泛的参考图像库中，可以准确地重建恒星杂点噪声。我们称之为 Direct Exoplanet Detection with Convolutional Image Reconstruction，或ConStruct。我们使用实验室中的Keck/NIRC2 angular differential imaging数据集来评估ConStruct的可靠性和敏感度。我们发现，ConStruct 比传统的PCA-based处理更高的Signal-to-Noise Ratio（SNR），并且可以提高对比度的Relative Contrast。这种工作表明了深度学习的价值和潜在，可以利用多样化的点扩散函数实现来提高直接成像后处理。ConStruct 和未来的改进可能将成为James Webb Space Telescope和极高对比度适应光学工具的后处理工具，包括当前的30米级望远镜。

Towards small and accurate convolutional neural networks for acoustic biodiversity monitoring

paper_url: http://arxiv.org/abs/2312.03666
repo_url: None
paper_authors: Serge Zaugg, Mike van der Schaar, Florence Erbs, Antonio Sanchez, Joan V. Castell, Emiliano Ramallo, Michel André
for: 这个研究的目的是设计一些快速对应时间的卷积神经网络，以便大规模监控生物多样性。
methods: 这个研究使用了spectrograms from 10 second segments as input to CNNs, and designed a simple CNN architecture with a frequency unwrapping layer (SIMP-FU models) to improve the classification performance.
results: 研究发现，使用时间索引的标签 durante la formación de los modelos SIMP-FU 可以提高分类性能，并且模型的选择可以影响分类性能。最佳的SIMP-FU模型在20种鸟类中的18种测试集上获得了AUC超过0.95。此外，这些模型在具有相对较低的成本和设备上进行评估也能够实现高速的对应时间。

Abstract
Automated classification of animal sounds is a prerequisite for large-scale monitoring of biodiversity. Convolutional Neural Networks (CNNs) are among the most promising algorithms but they are slow, often achieve poor classification in the field and typically require large training data sets. Our objective was to design CNNs that are fast at inference time and achieve good classification performance while learning from moderate-sized data. Recordings from a rainforest ecosystem were used. Start and end-point of sounds from 20 bird species were manually annotated. Spectrograms from 10 second segments were used as CNN input. We designed simple CNNs with a frequency unwrapping layer (SIMP-FU models) such that any output unit was connected to all spectrogram frequencies but only to a sub-region of time, the Receptive Field (RF). Our models allowed experimentation with different RF durations. Models either used the time-indexed labels that encode start and end-point of sounds or simpler segment-level labels. Models learning from time-indexed labels performed considerably better than their segment-level counterparts. Best classification performances was achieved for models with intermediate RF duration of 1.5 seconds. The best SIMP-FU models achieved AUCs over 0.95 in 18 of 20 classes on the test set. On compact low-cost hardware the best SIMP-FU models evaluated up to seven times faster than real-time data acquisition. RF duration was a major driver of classification performance. The optimum of 1.5 s was in the same range as the duration of the sounds. Our models achieved good classification performance while learning from moderate-sized training data. This is explained by the usage of time-indexed labels during training and adequately sized RF. Results confirm the feasibility of deploying small CNNs with good classification performance on compact low-cost devices.

摘要
自动化动物声音分类是生物多样性大规模监测的必要前提。 convolutional neural networks (CNNs) 是最有前途的算法之一，但它们通常慢于执行速度， often achieve poor classification in the field 并且通常需要大量的训练数据集。我们的目标是设计fast at inference time 的 CNNs，以及可以从中等大小的训练数据集中学习好分类性能。我们使用热带雨林生态系统的声音记录。我们 manually annotated the start and end points of 20种鸟类的声音。我们使用10秒段的spectrogram作为CNN输入。我们设计了一种简单的CNN，即SIMP-FU模型，其中任何输出单元都连接到了所有的spectrogram频谱，但只连接到了一个时间区域，即Receptive Field (RF)。我们的模型允许我们在不同的RF持续时间上进行实验。我们使用了时间索引标签，这些标签编码了声音的开始和结束时刻。我们的模型比使用段级标签表现得更好。最佳的SIMP-FU模型在20种类型的测试集上达到了AUC超过0.95。在具有相同时长的1.5秒的RF持续时间下，我们的模型可以在低成本设备上评估到7倍于实时数据采集。RF持续时间是分类性能的关键因素。我们的结果表明，在1.5秒的RF持续时间下，我们的模型可以达到好的分类性能，同时从中等大小的训练数据集中学习。这可以归因于在训练时使用时间索引标签，以及RF的大小。我们的结果证明了在低成本设备上部署小 CNNs 的可行性，并且可以达到好的分类性能。

paper_url: http://arxiv.org/abs/2312.03651
repo_url: None
paper_authors: Nihal Gunukula, Kshitij Tiwari, Aniket Bera
for: This paper aims to improve the navigation of mobile robots in emergency scenarios by enabling them to interpret stimuli like humans and locate potential victims rapidly without interfering with first responders.
methods: The proposed solution, called MIRACLE, uses gamified learning to gather stimuli-driven human navigational data, which is then used to train a Deep Inverse Maximum Entropy Reinforcement Learning model.
results: Testing revealed a low loss of 2.7717 within a 400-sized environment, indicating that the proposed approach can replicate human-like response. The approach has the potential to enhance the life-saving capabilities of mobile robots in emergency situations.

Abstract
In emergency scenarios, mobile robots must navigate like humans, interpreting stimuli to locate potential victims rapidly without interfering with first responders. Existing socially-aware navigation algorithms face computational and adaptability challenges. To overcome these, we propose a solution, MIRACLE -- an inverse reinforcement and curriculum learning model, that employs gamified learning to gather stimuli-driven human navigational data. This data is then used to train a Deep Inverse Maximum Entropy Reinforcement Learning model, reducing reliance on demonstrator abilities. Testing reveals a low loss of 2.7717 within a 400-sized environment, signifying human-like response replication. Current databases lack comprehensive stimuli-driven data, necessitating our approach. By doing so, we enable robots to navigate emergency situations with human-like perception, enhancing their life-saving capabilities.

摘要
在紧急情况下，移动 робоッTS必须如人类一样导航，解读刺激来寻找受害者快速，不会干扰先期应急救援人员。现有的社会意识导航算法面临计算和适应性挑战。为了解决这些挑战，我们提议一种解决方案：MIRACLE——一种 inverse reinforcement 和学习级课程学习模型，使用游戏化学习方法收集人类导航数据，并用这些数据训练深度反最大熵奖励学习模型，减少依赖于示范人员的能力。测试表明在400个环境中，损失为2.7717，这表明了人类式的响应复制。当前的数据库缺乏完整的刺激驱动数据，因此我们的方法是必要的。通过这种方法，我们可以让机器人在紧急情况下具有人类化的感知，提高其救命能力。

MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

paper_url: http://arxiv.org/abs/2312.03644
repo_url: None
paper_authors: Ziyan Wang, Yali Du, Yudi Zhang, Meng Fang, Biwei Huang
for: 提高 Offline Multi-agent Reinforcement Learning（MARL）的效果， conquer online interaction 是不实际或危险的场景。
methods: 我们的方法 MACCA，使用 Dynamic Bayesian Network 描述环境变量、状态、动作和奖励之间的关系，通过分析每个代理的个人奖励 causal 关系，以获得正确和可读的奖励归属。
results: 实验表明，MACCA 在离线环境中表现出色，超越了 State-of-the-Art 方法，并在其基础上提高表现。

Abstract
Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges due to partial observability and emergent behavior. Directly transferring the online credit assignment method to offline settings results in suboptimal outcomes due to the absence of real-time feedback and intricate agent interactions. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to seamlessly integrate with various offline MARL methods. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. Experimentally, we tested MACCA in two environments, including discrete and continuous action settings. The results show that MACCA outperforms SOTA methods and improves performance upon their backbones.

摘要

Transformer-Powered Surrogates Close the ICF Simulation-Experiment Gap with Extremely Limited Data

paper_url: http://arxiv.org/abs/2312.03642
repo_url: None
paper_authors: Matthew L. Olson, Shusen Liu, Jayaraman J. Thiagarajan, Bogdan Kustowski, Weng-Keen Wong, Rushil Anirudh
for: 这篇论文的目的是提出一种基于变换器架构的方法，以提高多模态输出enario中的预测精度，当 experimental data 是稀缺的时候，可以补充 simulation data。
methods: 该方法使用 transformer 架构，并结合了一种新的图形基于的超参数优化技术。
results: 该方法不仅可以减少 simulation bias，而且可以在具有稀缺实验数据的情况下实现更高的预测精度，并且在射击束束截实验中得到了证明。

Abstract
Recent advances in machine learning, specifically transformer architecture, have led to significant advancements in commercial domains. These powerful models have demonstrated superior capability to learn complex relationships and often generalize better to new data and problems. This paper presents a novel transformer-powered approach for enhancing prediction accuracy in multi-modal output scenarios, where sparse experimental data is supplemented with simulation data. The proposed approach integrates transformer-based architecture with a novel graph-based hyper-parameter optimization technique. The resulting system not only effectively reduces simulation bias, but also achieves superior prediction accuracy compared to the prior method. We demonstrate the efficacy of our approach on inertial confinement fusion experiments, where only 10 shots of real-world data are available, as well as synthetic versions of these experiments.

摘要
近期机器学习技术的发展，尤其是变换器架构，已经导致商业领域的重要进步。这些强大的模型能够学习复杂的关系，并经常在新的数据和问题上更好地泛化。本文提出了一种基于变换器架构的多模态输出预测精度加强方法，其中使用了一种新的图像基于的超参数优化技术。该系统不仅能够有效减少模拟数据偏见，还能够在尺度精度方面超越先前的方法。我们在固溶体压缩实验中进行了证明，只有10个实际数据分布的实验，以及一些人工生成的实验。

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

paper_url: http://arxiv.org/abs/2312.03632
repo_url: None
paper_authors: Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
for: 本研究旨在使虚拟助手与用户的交互更自然，即使用户没有使用触发语句。
methods: 我们使用了1-best假设和解码器信号从自动语音识别系统，并将音频编码器的听觉特征作为输入特征进行大语言模型（LLM）的训练。
results: 我们发现，使用多模态数据进行训练可以降低等错率（EER），并且只需使用80k或更少的示例数据进行训练。此外，我们还发现使用低维特化的音频表示可以降低EER。

Abstract
Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

摘要
通常，与虚拟助手交互都需要一个触发语句和一个命令。在这项工作中，我们探讨了使这些交互更自然的可能性，即消除触发语句的需求。我们的目标是通过基于流动音频记录的设备 микрофон获得的信号来确定用户是否向虚拟助手进行了 addressed。我们解决这个任务 by combining 1-best hypothesis和 decoder signal from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). Specifically, we are interested in data and resource-efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.

Evaluation of Active Feature Acquisition Methods for Static Feature Settings

paper_url: http://arxiv.org/abs/2312.03619
repo_url: None
paper_authors: Henrik von Kleist, Alireza Zamanian, Ilya Shpitser, Narges Ahmidi
for:This paper focuses on evaluating the performance of active feature acquisition (AFA) agents in healthcare, where acquiring features can be costly or harmful. The authors aim to assess the expected performance of AFA agents using retrospective data.methods:The authors propose a semi-offline reinforcement learning (RL) framework for active feature acquisition performance evaluation (AFAPE), which considers time-dependent features. They also derive and adapt new estimators within the semi-offline RL framework, including inverse probability weighting (IPW), direct method (DM), and double reinforcement learning (DRL), to handle missing data.results:The authors demonstrate the improved data efficiency of their semi-offline RL estimators in synthetic and real-world data experiments under synthetic missing-at-random (MAR) and missing-not-at-random (MNAR) patterns.

Abstract
Active feature acquisition (AFA) agents, crucial in domains like healthcare where acquiring features is often costly or harmful, determine the optimal set of features for a subsequent classification task. As deploying an AFA agent introduces a shift in missingness distribution, it's vital to assess its expected performance at deployment using retrospective data. In a companion paper, we introduce a semi-offline reinforcement learning (RL) framework for active feature acquisition performance evaluation (AFAPE) where features are assumed to be time-dependent. Here, we study and extend the AFAPE problem to cover static feature settings, where features are time-invariant, and hence provide more flexibility to the AFA agents in deciding the order of the acquisitions. In this static feature setting, we derive and adapt new inverse probability weighting (IPW), direct method (DM), and double reinforcement learning (DRL) estimators within the semi-offline RL framework. These estimators can be applied when the missingness in the retrospective dataset follows a missing-at-random (MAR) pattern. They also can be applied to missing-not-at-random (MNAR) patterns in conjunction with appropriate existing missing data techniques. We illustrate the improved data efficiency offered by the semi-offline RL estimators in synthetic and real-world data experiments under synthetic MAR and MNAR missingness.

摘要
aktive feature acquisition (AFA) 代理，在医疗领域和其他领域中，可以帮助减少成本和危害。 AFA 代理可以确定下一个分类任务中的优化功能集。但是，在部署 AFA 代理时，会导致缺失分布的变化，因此在使用回顾数据进行评估是非常重要的。在另一篇论文中，我们提出了一种半线上学习（RL）框架，用于评估活动特征获取性能（AFAPE），在这个框架中，特征是时间依赖的。在这个静态特征设置中，我们研究并扩展 AFAPE 问题，以涵盖静态特征设置，这些特征是时间不变的，因此给 AFA 代理更多的灵活性，以确定特征获取顺序。在这个静态特征设置中，我们 derive 和适应新的反映权重（IPW）、直接方法（DM）和双线上学习（DRL）估计器，在半线上RL框架中使用。这些估计器可以在回顾数据中存在 missing-at-random（MAR）模式下应用。它们也可以在合适的 missing data 技术的情况下，在 missing-not-at-random（MNAR）模式下应用。我们在 sintetic 和实际数据实验中，通过使用 semi-offline RL 估计器，提高数据效率。

Physical Symbolic Optimization

paper_url: http://arxiv.org/abs/2312.03612
repo_url: https://github.com/wassimtenachi/physo
paper_authors: Wassim Tenachi, Rodrigo Ibata, Foivos I. Diakogiannis
for: 这个论文是为了提出一种束缚自动生成方程的方法，以便符合维度分析规则。
methods: 这个方法结合了强化学习，使用物理符号推论方法来恢复物理数据中的分析函数。
results: 该方法在SRBench的菲涅曼标准上达到了状态之最的结果，在噪音（大于0.1%）和重要噪音（10%）的情况下表现出色，并且显示出高度鲁棒性。

Abstract
We present a framework for constraining the automatic sequential generation of equations to obey the rules of dimensional analysis by construction. Combining this approach with reinforcement learning, we built $\Phi$-SO, a Physical Symbolic Optimization method for recovering analytical functions from physical data leveraging units constraints. Our symbolic regression algorithm achieves state-of-the-art results in contexts in which variables and constants have known physical units, outperforming all other methods on SRBench's Feynman benchmark in the presence of noise (exceeding 0.1%) and showing resilience even in the presence of significant (10%) levels of noise.

摘要
我们提出了一个框架，用于自动生成方程的顺序化生成，以遵循维度分析的规则。将这种方法与强化学习结合，我们构建了 $\Phi$-SO，一种物理符号优化方法，用于从物理数据中恢复符号函数，并且利用单位约束。我们的符号回归算法在变量和常数具有知道物理单位时达到了状态对应的最佳结果，在SRBench的费涅曼标准 benchmark 中在噪音（超过 0.1%）存在时超越所有其他方法，并在噪音水平达到了10%时仍然保持稳定。

Achieving ${O}(ε^{-1.5})$ Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization

paper_url: http://arxiv.org/abs/2312.03807
repo_url: None
paper_authors: Yifan Yang, Peiyao Xiao, Kaiyi Ji
for: 这种问题的解决方法，即在非卷积优化问题中，提高优化效率。
methods: 提出了一种新的无约束/约束级别优化算法，名为FdeHBO，它具有简单的单循环结构、投影 помо手finite-difference约束矩阵向量近似、势能 momentum-based更新。
results: 证明了FdeHBO可以在${O}(\epsilon^{-1.5})$迭代内（每迭代使用${O}(1)$样本，只需要首频 Gradient 信息）找到一个$\epsilon$-精度的站点点。这是首次无约束/约束级别方法，可以在非卷积-强烈束下达到${O}(\epsilon^{-1.5})$ 样本复杂度。

Abstract
In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an ${O}(\epsilon^{-1.5})$ sample complexity of ${O}(\epsilon^{-1.5})$ in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires ${O}(\epsilon^{-1.5})$ iterations (each using ${O}(1)$ samples and only first-order gradient information) to find an $\epsilon$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an ${O}(\epsilon^{-1.5})$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.

摘要
“在这篇论文中，我们重新评估了二级优化问题，其中上层目标函数通常是非凸函数，而下层目标函数是强 converges 函数。虽然这种问题已经得到了广泛的研究，但还没有任何方法可以在偏导数/导函数-自由 Stochastic bilevel 优化中实现 ${O}(\epsilon^{-1.5})$ 样本复杂度。为了填补这一漏洞，我们提出了一种新的 Hessian/Jacobian-free 二级优化器 named FdeHBO，它具有简单的幂等循环结构、投影 aid finite-difference Hessian/Jacobian-vector aproximation以及势能-based 更新。理论上，我们证明了 FdeHBO 需要 ${O}(\epsilon^{-1.5})$ 迭代（每迭代使用 ${O}(1)$ 样本和只需要首频导数信息）可以找到 $\epsilon$-精度站点。到目前为止，这是第一个 Hessian/Jacobian-free 方法，其样本复杂度为 ${O}(\epsilon^{-1.5})$ для非凸-强 converges 随机二级优化。”Note that Simplified Chinese is a simplified version of Chinese, and it is used in mainland China and Singapore. Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

Blueprinting the Future: Automatic Item Categorization using Hierarchical Zero-Shot and Few-Shot Classifiers

paper_url: http://arxiv.org/abs/2312.03561
repo_url: None
paper_authors: Ting Wang, Keith Stelter, Jenn Floyd, Thomas O’Neill, Nathaniel Hendrix, Andrew Bazemore, Kevin Rode, Warren Newton
For: The paper aims to develop a novel approach for hierarchical item categorization in testing industries, specifically for aligning exam questions with the designated content domains outlined in the assessment blueprint.* Methods: The proposed approach utilizes the zero-shot and few-shot Generative Pretrained Transformer (GPT) classifier, which leverages human-like language descriptions to define categories. The hierarchical nature of examination blueprints is navigated using a structured python dictionary, allowing for a tiered classification of items across multiple levels.* Results: The proposed method achieves an average accuracy of 92.91% measured by the F1 score in an initial simulation with artificial data. Additionally, the method was applied to real exam items from the 2022 In-Training Examination (ITE) conducted by the American Board of Family Medicine (ABFM), reclassifying 200 items according to a newly formulated blueprint swiftly in 15 minutes, a task that traditionally could span several days among editors and physicians.

Abstract
In testing industry, precise item categorization is pivotal to align exam questions with the designated content domains outlined in the assessment blueprint. Traditional methods either entail manual classification, which is laborious and error-prone, or utilize machine learning requiring extensive training data, often leading to model underfit or overfit issues. This study unveils a novel approach employing the zero-shot and few-shot Generative Pretrained Transformer (GPT) classifier for hierarchical item categorization, minimizing the necessity for training data, and instead, leveraging human-like language descriptions to define categories. Through a structured python dictionary, the hierarchical nature of examination blueprints is navigated seamlessly, allowing for a tiered classification of items across multiple levels. An initial simulation with artificial data demonstrates the efficacy of this method, achieving an average accuracy of 92.91% measured by the F1 score. This method was further applied to real exam items from the 2022 In-Training Examination (ITE) conducted by the American Board of Family Medicine (ABFM), reclassifying 200 items according to a newly formulated blueprint swiftly in 15 minutes, a task that traditionally could span several days among editors and physicians. This innovative approach not only drastically cuts down classification time but also ensures a consistent, principle-driven categorization, minimizing human biases and discrepancies. The ability to refine classifications by adjusting definitions adds to its robustness and sustainability.

摘要
在测试业界，精准的项目分类是考试评估蓝图中的关键因素，以确保考试问题与指定的内容领域相匹配。传统方法可能是手动分类，这是时间consuming和容易出错的，或者使用机器学习，需要大量的训练数据，经常会导致模型过拟合或者下降问题。这项研究揭示了一种新的方法，利用零批和几批生成搜索transformer（GPT）分类器，实现了不需要大量训练数据，而是通过人类语言描述来定义分类。通过结构化的python字典，浸入考试蓝图的层次结构，实现了多级分类。在人工数据上进行的初步模拟中，这种方法实现了92.91%的准确率， measured by F1 score。这种方法继而应用于2022年家庭医学评估（ABFM）的实际考试题，对200个项目进行了根据新的蓝图快速重新分类，只需15分钟，而传统上需要数天内由编辑和医生共同努力完成。这种创新的方法不仅减少了分类时间，而且确保了一致、原则驱动的分类，减少了人类偏见和差异。可以通过调整定义来进一步提高其可靠性和可维护性。

Clustering by Contour coreset and variational quantum eigensolver

paper_url: http://arxiv.org/abs/2312.03516
repo_url: None
paper_authors: Canaan Yung, Muhammad Usman
for: 解决量子计算机上的k-means聚类问题
methods: 使用量子近似优化算法（QAOA）和特定的核心集技术
results: 比对现有方法，我们的VQE+Contour核心集方法在真实数据上达到了更高的准确率和更低的标准差In English, this translates to:
for: Solving the k-means clustering problem on quantum computers
methods: Using the Quantum Approximate Optimization Algorithm (QAOA) and customized coreset techniques
results: Our VQE+Contour coreset approach outperforms existing QAOA+coreset k-means clustering approaches with higher accuracy and lower standard deviation on real-life data.

Abstract
Recent work has proposed solving the k-means clustering problem on quantum computers via the Quantum Approximate Optimization Algorithm (QAOA) and coreset techniques. Although the current method demonstrates the possibility of quantum k-means clustering, it does not ensure high accuracy and consistency across a wide range of datasets. The existing coreset techniques are designed for classical algorithms and there has been no quantum-tailored coreset technique which is designed to boost the accuracy of quantum algorithms. In this work, we propose solving the k-means clustering problem with the variational quantum eigensolver (VQE) and a customised coreset method, the Contour coreset, which has been formulated with specific focus on quantum algorithms. Extensive simulations with synthetic and real-life data demonstrated that our VQE+Contour Coreset approach outperforms existing QAOA+Coreset k-means clustering approaches with higher accuracy and lower standard deviation. Our work has shown that quantum tailored coreset techniques has the potential to significantly boost the performance of quantum algorithms when compared to using generic off-the-shelf coreset techniques.

摘要
近期研究提出使用量子计算机解决k-means划分问题的方法，使用量子approx优化算法（QAOA）和核集技术。although current method demonstrates the possibility of quantum k-means clustering, it does not ensure high accuracy and consistency across a wide range of datasets. existing coreset techniques are designed for classical algorithms, and there has been no quantum-tailored coreset technique that is designed to boost the accuracy of quantum algorithms. in this work, we propose solving the k-means clustering problem with the variational quantum eigensolver (VQE) and a customized coreset method, the Contour coreset, which has been formulated with specific focus on quantum algorithms. extensive simulations with synthetic and real-life data demonstrated that our VQE+Contour Coreset approach outperforms existing QAOA+Coreset k-means clustering approaches with higher accuracy and lower standard deviation. our work has shown that quantum-tailored coreset techniques have the potential to significantly boost the performance of quantum algorithms when compared to using generic off-the-shelf coreset techniques.Note: The translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. The other form is Traditional Chinese.

Towards Sobolev Pruning

paper_url: http://arxiv.org/abs/2312.03510
repo_url: None
paper_authors: Neil Kichler, Sher Afghan, Uwe Naumann
For: The paper aims to propose a method for building surrogate models that capture the sensitivity information of the original model, using interval adjoint significance analysis and Sobolev training.* Methods: The proposed method uses a neural network to model the original sensitivity information, and combines interval adjoint significance analysis and Sobolev training to prune the network and obtain an accurate surrogate model.* Results: The proposed method is experimentally validated on an example of pricing a multidimensional basket option, and the results show that the surrogate model accurately captures the sensitivity information of the original model. The method is not limited to quantitative finance and can be applied to other domains as well.Here’s the Chinese translation of the three points:* For: 这篇论文的目的是提出一种基于敏感信息的副模型建模方法，使用间隔逻辑重要性分析和 Sobolev 训练。* Methods: 该方法使用神经网络模型原始模型的敏感信息，并将间隔逻辑重要性分析和 Sobolev 训练相结合，以减少神经网络的大小并获得准确的副模型。* Results: 该方法在一个多维化的篮球选择价值模型中进行实验验证，结果显示副模型准确地捕捉了原始模型的敏感信息。该方法不仅限于金融领域，也适用于其他领域。

Abstract
The increasing use of stochastic models for describing complex phenomena warrants surrogate models that capture the reference model characteristics at a fraction of the computational cost, foregoing potentially expensive Monte Carlo simulation. The predominant approach of fitting a large neural network and then pruning it to a reduced size has commonly neglected shortcomings. The produced surrogate models often will not capture the sensitivities and uncertainties inherent in the original model. In particular, (higher-order) derivative information of such surrogates could differ drastically. Given a large enough network, we expect this derivative information to match. However, the pruned model will almost certainly not share this behavior. In this paper, we propose to find surrogate models by using sensitivity information throughout the learning and pruning process. We build on work using Interval Adjoint Significance Analysis for pruning and combine it with the recent advancements in Sobolev Training to accurately model the original sensitivity information in the pruned neural network based surrogate model. We experimentally underpin the method on an example of pricing a multidimensional Basket option modelled through a stochastic differential equation with Brownian motion. The proposed method is, however, not limited to the domain of quantitative finance, which was chosen as a case study for intuitive interpretations of the sensitivities. It serves as a foundation for building further surrogate modelling techniques considering sensitivity information.

摘要
随着复杂现象的描述使用渐进模型的使用逐渐增长，因此需要优化模型的准确性和计算效率。传统的方法是通过大型神经网络进行学习，然后剪辑其大小，但这种方法常常忽略了原始模型的敏感性和不确定性。具体来说，神经网络生成的替代模型通常不会捕捉原始模型中的敏感性和不确定性，特别是高阶导数信息可能会异常大。如果神经网络够大，我们可以期望导数信息会匹配。然而，剪辑后的模型几乎绝不会具备这种行为。在这篇论文中，我们提出了一种基于敏感信息的替代模型建模方法。我们基于之前的间隔对价值分析技术和 Sobolev 训练技术，将敏感信息纳入学习和剪辑过程中。我们通过一个多维 Brownian Motion 模型来证明方法，但这种方法并不局限于金融领域。我们选择了这个领域作为示例，以便更好地解释敏感信息的含义。这种方法可以作为建立更多基于敏感信息的替代模型技术的基础。

PCDP-SGD: Improving the Convergence of Differentially Private SGD via Projection in Advance

paper_url: http://arxiv.org/abs/2312.03792
repo_url: None
paper_authors: Haichao Sha, Ruixuan Liu, Yixuan Liu, Hong Chen
for: 提供了一个概念，即使在中央化和联合设置中训练数据时提供了一定的理论保证，但是由于DP-SGD的使用，导致训练效果受到限制。
methods: 我们提出了一种框架，即PCDP-SGD，它通过对梯度norm进行压缩，并在压缩后进行投影操作，以保留更重要的梯度组件。此外，我们还扩展了PCDP-SGD作为DPFL的基本组件，以适应数据不均衡的挑战并实现高效的通信。
results: 我们的实验结果表明，PCDP-SGD在计算机视觉任务中可以达到更高的准确率，并且在保证DP的情况下，PCDP-SGD还能够超越现有的DP-SGD变体。此外，PCDP-SGD也可以在不同的联合设置下实现更高效的通信。

Abstract
The paradigm of Differentially Private SGD~(DP-SGD) can provide a theoretical guarantee for training data in both centralized and federated settings. However, the utility degradation caused by DP-SGD limits its wide application in high-stakes tasks, such as medical image diagnosis. In addition to the necessary perturbation, the convergence issue is attributed to the information loss on the gradient clipping. In this work, we propose a general framework PCDP-SGD, which aims to compress redundant gradient norms and preserve more crucial top gradient components via projection operation before gradient clipping. Additionally, we extend PCDP-SGD as a fundamental component in differential privacy federated learning~(DPFL) for mitigating the data heterogeneous challenge and achieving efficient communication. We prove that pre-projection enhances the convergence of DP-SGD by reducing the dependence of clipping error and bias to a fraction of the top gradient eigenspace, and in theory, limits cross-client variance to improve the convergence under heterogeneous federation. Experimental results demonstrate that PCDP-SGD achieves higher accuracy compared with state-of-the-art DP-SGD variants in computer vision tasks. Moreover, PCDP-SGD outperforms current federated learning frameworks when DP is guaranteed on local training sets.

摘要
DP-SGD（差异加Private Stochastic Gradient Descent）的 paradigm可以为中央化和联合设置的训练数据提供理论保证。然而，DP-SGD的实用效果受到训练数据的高度风险任务，如医疗图像诊断中的数据敏感性限制。此外，DP-SGD需要额外增加噪声，并且在权重抑制中产生信息损失，这会导致训练过程中的偏移。为了解决这些问题，我们提出了一种通用框架PCDP-SGD，该框架通过对 gradient norm 进行压缩和保留更重要的 top gradient 分量来降低权重抑制中的偏移和噪声。此外，我们扩展了PCDP-SGD作为在分布式隐私学习中的基本组件，以mitigate 数据不均性挑战和实现高效的通信。我们证明了在前向压缩后，DP-SGD 的整合会降低权重抑制中的偏移和噪声，并在理论上限制了跨客户端的差异，从而提高了 federated learning 的性能。实验结果表明，PCDP-SGD 在计算机视觉任务中实现了更高的准确率，并且在保证了地方训练集的隐私性的情况下，PCDP-SGD 还能够超越当前的联合学习框架。

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis

paper_url: http://arxiv.org/abs/2312.03491
repo_url: None
paper_authors: Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, Jun Zhu
for: 这个论文的目的是提出一种新的文本识别系统，它可以提高文本识别的质量和效率。
methods: 该系统使用了一种新的扩展方法，即将文本输入的干扰特征作为它的先前分布，然后使用这个干扰特征来生成高质量的语音。
results: 实验结果表明，这种新的文本识别系统可以在LJ-Speech数据集上提供高质量的语音生成，并且比传统的扩展方法 Grad-TTS 和快速的文本识别模型在50步/1000步生成和几步生成方面表现出色。

Abstract
In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one, which provides strong structural information of the target. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process. Moreover, the tractability and flexibility of our formulation allow us to empirically study the design spaces such as noise schedules, as well as to develop stochastic and deterministic samplers. Experimental results on the LJ-Speech dataset illustrate the effectiveness of our method in terms of both synthesis quality and sampling efficiency, significantly outperforming our diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/

摘要
在文本到语音（TTS）合成中，扩散模型已经实现了出色的生成质量。然而，由于存在预定的数据到噪声扩散过程，其先前分布受到噪声影响，提供了 little information about the generation target。在这项工作中，我们介绍了一种新的 TTS 系统，名为 Bridge-TTS，这是首次将预先定义的噪声 Gaussian 先验替换为干净的束缚先验，提供了强有力的结构信息。特别是，我们利用文本输入所得的潜在表示作为我们的先验，并建立了一个完全可追踪的施罗德伯格之桥，将其与真实的 mel-spectrogram 相连接，从而实现了数据到数据过程。此外，我们的形式化表述的可追踪性和灵活性，使我们能够实验不同的噪声计划，以及开发随机和决定性抽取器。实验结果表明，我们的方法在 LJ-Speech 数据集上具有出色的生成质量和抽取效率，在 50 步/1000 步合成和快速 TTS 模型中显著超越了我们的扩散对手 Grad-TTS，并在几步情况下与快速 TTS 模型匹配。项目页面：https://bridge-tts.github.io/

Precision of Individual Shapley Value Explanations

paper_url: http://arxiv.org/abs/2312.03485
repo_url: None
paper_authors: Lars Henry Berge Olsen
for: This paper focuses on explaining predictions made by complex machine learning models using Shapley values for tabular data.methods: The paper compares numerous Shapley value estimation methods and discusses their precision on an individual basis.results: The explanations are systematically less precise for observations on the outer region of the training data distribution for all used estimation methods.

Abstract
Shapley values are extensively used in explainable artificial intelligence (XAI) as a framework to explain predictions made by complex machine learning (ML) models. In this work, we focus on conditional Shapley values for predictive models fitted to tabular data and explain the prediction $f(\boldsymbol{x}^{*})$ for a single observation $\boldsymbol{x}^{*}$ at the time. Numerous Shapley value estimation methods have been proposed and empirically compared on an average basis in the XAI literature. However, less focus has been devoted to analyzing the precision of the Shapley value explanations on an individual basis. We extend our work in Olsen et al. (2023) by demonstrating and discussing that the explanations are systematically less precise for observations on the outer region of the training data distribution for all used estimation methods. This is expected from a statistical point of view, but to the best of our knowledge, it has not been systematically addressed in the Shapley value literature. This is crucial knowledge for Shapley values practitioners, who should be more careful in applying these observations' corresponding Shapley value explanations.

摘要
沙佩利值在可解释人工智能（XAI）中广泛应用为解释复杂机器学习（ML）模型的预测。在这项工作中，我们专注于条件沙佩利值对预测模型适用于表格数据的解释，并对单个观察值 $\boldsymbol{x}^{*}$ 的预测 $f(\boldsymbol{x}^{*})$ 进行解释。文献中已经有许多沙佩利值估计方法的比较，但是对具体的个体解释精度的分析得到了更少的关注。我们在奥尔森等（2023）的工作中进一步推动了我们的研究，并证明了所有使用的估计方法的解释都会在训练数据分布的外部区域 observation 上系统性地减少精度。这是预期的从统计角度来看，但是我们知道这并没有在沙佩利值文献中得到系统的考虑。这些知识对沙佩利值实践者来说非常重要，他们应该更加小心地应用这些观察值对应的沙佩利值解释。

Search Strategies for Self-driving Laboratories with Pending Experiments

paper_url: http://arxiv.org/abs/2312.03466
repo_url: None
paper_authors: Hao Wen, Jakob Zeitler, Connor Rupnow
For: 本研究旨在探讨异步并发实验室(SDL)中实验的并发并行化，以及延迟反馈的影响。* Methods: 本研究使用了一个SDL的模拟器，并对不同的搜索策略进行比较，以优化功能膜的导电性。* Results: 研究结果表明，异步并发实验室中的延迟反馈会影响搜索策略的性能。不同的搜索策略在不同的延迟和问题维度下的性能有显著的区别。

Abstract
Self-driving laboratories (SDLs) consist of multiple stations that perform material synthesis and characterisation tasks. To minimize station downtime and maximize experimental throughput, it is practical to run experiments in asynchronous parallel, in which multiple experiments are being performed at once in different stages. Asynchronous parallelization of experiments, however, introduces delayed feedback (i.e. "pending experiments"), which is known to reduce Bayesian optimiser performance. Here, we build a simulator for a multi-stage SDL and compare optimisation strategies for dealing with delayed feedback and asynchronous parallelized operation. Using data from a real SDL, we build a ground truth Bayesian optimisation simulator from 177 previously run experiments for maximizing the conductivity of functional coatings. We then compare search strategies such as expected improvement, noisy expected improvement, 4-mode exploration and random sampling. We evaluate their performance in terms of amount of delay and problem dimensionality. Our simulation results showcase the trade-off between the asynchronous parallel operation and delayed feedback.

摘要
自动驾驶实验室（SDL）由多个站点组成，每个站点负责材料合成和特征测试任务。为最小化站点停机时间和最大化实验通过put，可以在不同阶段进行异步并行实验。然而，异步并行实验会导致延迟反馈（即“等待实验”），这知道会降低 bayesian优化器性能。我们在一个多阶段 SDL 上建立了一个模拟器，并比较了各种搜索策略来处理延迟反馈和异步并行操作。使用实际 SDL 数据，我们建立了一个基于 177 次实验的 Bayesian 优化器模拟器，以最大化功能涂层的导电性。然后，我们比较了搜索策略，如期望改善、噪声期望改善、4 种探索和随机抽样。我们根据延迟和问题维度来评估 их性能。我们的模拟结果显示异步并行操作和延迟反馈之间存在负相关性。

Subnetwork-to-go: Elastic Neural Network with Dynamic Training and Customizable Inference

paper_url: http://arxiv.org/abs/2312.03464
repo_url: None
paper_authors: Kai Li, Yi Luo
for: 这个论文主要是为了提出一种简单的方法，使得在推理阶段可以从一个大型神经网络中提取一个子网络，并且这个子网络可以有任意深度和宽度，而不需要从scratch retrained。
methods: 作者提出了一种新的方法，允许在训练阶段使用动态深度和宽度来训练一个大型神经网络，然后在推理阶段可以选择一个子网络，并且这个子网络可以有任意深度和宽度。
results: 实验结果表明，使用这种方法可以在不同的子网络大小和复杂度下提高分离性能，并且训练大型神经网络的时间比单独训练所有的子网络要 shorter。

Abstract
Deploying neural networks to different devices or platforms is in general challenging, especially when the model size is large or model complexity is high. Although there exist ways for model pruning or distillation, it is typically required to perform a full round of model training or finetuning procedure in order to obtain a smaller model that satisfies the model size or complexity constraints. Motivated by recent works on dynamic neural networks, we propose a simple way to train a large network and flexibly extract a subnetwork from it given a model size or complexity constraint during inference. We introduce a new way to allow a large model to be trained with dynamic depth and width during the training phase, and after the large model is trained we can select a subnetwork from it with arbitrary depth and width during the inference phase with a relatively better performance compared to training the subnetwork independently from scratch. Experiment results on a music source separation model show that our proposed method can effectively improve the separation performance across different subnetwork sizes and complexities with a single large model, and training the large model takes significantly shorter time than training all the different subnetworks.

摘要
通常来说，将神经网络部署到不同的设备或平台是具有挑战性，特别是当模型大小或模型复杂性较高时。虽然存在模型剪辑或液态精炼的方法，但通常需要进行全局的模型训练或 Fine-tuning 过程以获得符合模型大小或复杂性约束的小模型。受到最近的动态神经网络研究的启发，我们提出了一种简单的方法，可以在搜索过程中训练一个大型神经网络，并在推理阶段选择一个子网络，并且该子网络可以有任意的深度和宽度。我们介绍了一种新的方法，可以让一个大型模型在训练阶段使用动态深度和宽度，并在推理阶段选择一个子网络，并且这个子网络可以有任意的深度和宽度，而不需要从零开始训练。实验结果表明，我们的提议方法可以在不同的子网络大小和复杂性下提高分离性能，并且训练大型模型的时间比训练各种不同的子网络更为快速。

Run LoRA Run: Faster and Lighter LoRA Implementations

paper_url: http://arxiv.org/abs/2312.03415
repo_url: None
paper_authors: Daria Cherniuk, Aleksandr Mikhalev, Ivan Oseledets
for: 提高神经网络训练和微调速度
methods: 使用低级扩展器来减少神经网络参数数量
results: 实现了高效的神经网络训练和微调，并且不会产生减少精度的问题，实验结果显示可以达到17%的速度提升

Abstract
LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers. This technique is used both for fine-tuning (LoRA, QLoRA) and full train (ReLoRA). This paper presents the RunLoRA framework for efficient implementations of LoRA that significantly improves the speed of neural network training and fine-tuning using low-rank adapters. The proposed implementation optimizes the computation of LoRA operations based on dimensions of corresponding linear layer, layer input dimensions and lora rank by choosing best forward and backward computation graph based on FLOPs and time estimations, resulting in faster training without sacrificing accuracy. The experimental results show up to 17% speedup on Llama family of models.

摘要
LoRA 是一种技术，可以减少 neural network 中可训练参数的数量，通过引入低级 adapter 来线性层。这种技术在 fine-tuning 和全部训练中都可以使用（LoRA、QLoRA、ReLoRA）。本文提出了 RunLoRA 框架，用于高效地实现 LoRA，并可以显著提高 neural network 训练和 fine-tuning 的速度，无需牺牲准确性。该实现基于 linear layer 的维度、输入维度和 LoRA 级别，选择最佳的前向和反向计算图，以根据 FLOPs 和时间估计，从而实现更快的训练，而无需牺牲准确性。实验结果显示，可以达到 LLama 家族模型的17%速度提升。

An AI for Scientific Discovery Route between Amorphous Networks and Mechanical Behavior

paper_url: http://arxiv.org/abs/2312.03404
repo_url: None
paper_authors: Changliang Zhu, Chenchao Fang, Zhipeng Jin, Baowen Li, Xiangying Shen, Lei Xu
for: 这篇论文旨在探讨人工智能如何帮助科学研究人员揭示物理机制，并使用这些机制提高机器学习算法的效率。
methods: 这篇论文使用了对极值弗洛伦矩阵的研究作为案例，使用机器学习方法探讨弗洛伦矩阵的物理机制，并通过这些机制提高机器学习算法的效率。
results: 研究发现，通过使用动态矩阵的低频振荡模式，可以更高效地预测极值弗洛伦矩阵的弗洛伦矩。这种方法可以提高机器学习算法的效率，并且可以用于其他物理系统的研究。

Abstract
"AI for science" is widely recognized as a future trend in the development of scientific research. Currently, although machine learning algorithms have played a crucial role in scientific research with numerous successful cases, relatively few instances exist where AI assists researchers in uncovering the underlying physical mechanisms behind a certain phenomenon and subsequently using that mechanism to improve machine learning algorithms' efficiency. This article uses the investigation into the relationship between extreme Poisson's ratio values and the structure of amorphous networks as a case study to illustrate how machine learning methods can assist in revealing underlying physical mechanisms. Upon recognizing that the Poisson's ratio relies on the low-frequency vibrational modes of dynamical matrix, we can then employ a convolutional neural network, trained on the dynamical matrix instead of traditional image recognition, to predict the Poisson's ratio of amorphous networks with a much higher efficiency. Through this example, we aim to showcase the role that artificial intelligence can play in revealing fundamental physical mechanisms, which subsequently improves the machine learning algorithms significantly.

摘要

An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network

paper_url: http://arxiv.org/abs/2312.03386
repo_url: None
paper_authors: Taeyoung Kim, Hongseok Yang
for: 这个论文探讨了深度神经网络在无穷宽限制下的初始化、特征学习和训练，以及如何找到适当的超参数、学习网络权重和进行推理。
methods: 本论文扩展了这一线索，表明在 Jacobian 的情况下，一个多层感知器（MLP）和其 Jacobian 在初始化时共同整合到一个 Gaussian Process（GP）中，并 characterize 这个 GP。
results: 我们证明在无穷宽限制下，MLP 的演化是由一种 linear first-order ordinary differential equation 描述，这个 differential equation 是由一种变种的 Neural Tangent Kernel 决定。我们还通过实验证明了我们的理论结论对宽finite网络有 relevance，并通过实验分析 kernel regression 的性质来获得一种 Jacobian 规范化的理解。

Abstract
The recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks, and brought new practical techniques for finding appropriate hyperparameters, learning network weights, and performing inference. In this paper, we broaden this line of research by showing that this infinite-width analysis can be extended to the Jacobian of a deep neural network. We show that a multilayer perceptron (MLP) and its Jacobian at initialisation jointly converge to a Gaussian process (GP) as the widths of the MLP's hidden layers go to infinity and characterise this GP. We also prove that in the infinite-width limit, the evolution of the MLP under the so-called robust training (i.e., training with a regulariser on the Jacobian) is described by a linear first-order ordinary differential equation that is determined by a variant of the Neural Tangent Kernel. We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.

摘要

On the variants of SVM methods applied to GPR data to classify tack coat characteristics in French pavements: two experimental case studies

paper_url: http://arxiv.org/abs/2312.03351
repo_url: None
paper_authors: Grégory Andreoli, Amine Ihamouten, Mai Lan Nguyen, Yannick Fargier, Cyrille Fauchard, Jean-Michel Simonin, Viktoriia Buliuk, David Souriou, Xavier Dérobert
for: 用于评估法国道路厚度的非破坏性技术之一是地面探测雷达（GPR），但传统的雷达系统和前向处理方法在较薄的层次中的物理和几何特征化方面存在局限性。
methods: 本文提出了基于机器学习方法的逆向方法，并在先前的数据上验证了其数学可行性。在这两个实验案例中，我们应用了SVM/SVR方法来分类和估算涂抹层中的聚合物含量。
results: 在 Gustave Eiffel University （法国南特）的测试轮和新的实际道路（法国卢瓦尔河地区）中，SVM/SVR方法表现出了效率，可以准确地分类和估算涂抹层中的聚合物含量。

Abstract
Among the commonly used non-destructive techniques, the Ground Penetrating Radar (GPR) is one of the most widely adopted today for assessing pavement conditions in France. However, conventional radar systems and their forward processing methods have shown their limitations for the physical and geometrical characterization of very thin layers such as tack coats. However, the use of Machine Learning methods applied to GPR with an inverse approach showed that it was numerically possible to identify the tack coat characteristics despite masking effects due to low timefrequency resolution noted in the raw B-scans. Thus, we propose in this paper to apply the inverse approach based on Machine Learning, already validated in previous works on numerical data, on two experimental cases with different pavement structures. The first case corresponds to a validation on known pavement structures on the Gustave Eiffel University (Nantes, France) with its pavement fatigue carousel and the second case focuses on a new real road in Vend{\'e}e department (France). In both case studies, the performances of SVM/SVR methods showed the efficiency of supervised learning methods to classify and estimate the emulsion proportioning in the tack coats.

摘要
中国常用的非 destruktive 技术中，地面探测雷达（GPR）是法国今天最广泛使用的评估路面状况的方法之一。然而，传统的雷达系统和其前向处理方法在覆盖层较薄时显示了其限制。然而，通过机器学习方法应用于GPR的逆向方法可以快速地识别涂层特征，尽管 Raw B-scan 中存在低时频分辨率的遮盖效果。因此，在这篇论文中，我们提议使用逆向方法基于机器学习，已经在前一些数据上验证了其效果，在两个实验 случа例中应用。第一个实验 caso 是在 Gustave Eiffel University (Nantes, France) 的路面疲劳车ousel上进行验证，第二个实验 caso 是在 Vend{\'e}e 省的一条新的实际道路上。在两个 caso 研究中，SVM/SVR 方法的表现表明了超visions 学习方法的效果性，可以对涂层中的混合剂质量进行分类和估算。

Predicting the Transportation Activities of Construction Waste Hauling Trucks: An Input-Output Hidden Markov Approach

paper_url: http://arxiv.org/abs/2312.03780
repo_url: None
paper_authors: Hongtai Yang, Boyi Lei, Ke Han, Luna Liu
for:* 这研究旨在预测废弃建筑材料拖车（CWHTs）的目的地和停留时间，以便有效管理环境。methods:* 该研究提出了一种基于可解释的活动基本模型（IOHMM）的预测方法，并在成都市300辆CWHTs上验证了其效果。results:* 结果显示，IOHMM比基线模型（Markov链、线性回归和长短时间记忆）表现更好，并且对CWHTs运输活动的影响因素进行了分析。

Abstract
Construction waste hauling trucks (CWHTs), as one of the most commonly seen heavy-duty vehicles in major cities around the globe, are usually subject to a series of regulations and spatial-temporal access restrictions because they not only produce significant NOx and PM emissions but also causes on-road fugitive dust. The timely and accurate prediction of CWHTs' destinations and dwell times play a key role in effective environmental management. To address this challenge, we propose a prediction method based on an interpretable activity-based model, input-output hidden Markov model (IOHMM), and validate it on 300 CWHTs in Chengdu, China. Contextual factors are considered in the model to improve its prediction power. Results show that the IOHMM outperforms several baseline models, including Markov chains, linear regression, and long short-term memory. Factors influencing the predictability of CWHTs' transportation activities are also explored using linear regression models. Results suggest the proposed model holds promise in assisting authorities by predicting the upcoming transportation activities of CWHTs and administering intervention in a timely and effective manner.

摘要
重建废弃物拖车（CWHT）是全球主要城市中最常见的重型车辆之一，通常受到一系列的规定和时空访问限制，因为它们不仅产生大量的NOx和PM排放，还会在路上产生逸散尘埃。预测CWHT的目的地和停留时间在环境管理中扮演了关键角色。为解决这个挑战，我们提出了基于可解释的活动基本模型，输入输出隐马尔可夫模型（IOHMM），并在成都市300辆CWHT上验证其效果。在模型中考虑了上下文因素，以提高预测力度。结果表明，IOHMM在多个基eline模型之上占优，包括马尔可夫链、线性回归和长短期记忆。我们还使用线性回归模型探讨CWHT的交通活动预测因素，结果表明，提出的模型在辅助管理当局预测CWHT的交通活动并实施有效措施方面具有承诺。

Interpretable Mechanistic Representations for Meal-level Glycemic Control in the Wild

paper_url: http://arxiv.org/abs/2312.03344
repo_url: https://github.com/keawang/interpretable-cgm-representations
paper_authors: Ke Alexander Wang, Emily B. Fox
for: This paper aims to learn interpretable representations of continuous glucose monitoring (CGM) and meal data to capture the complexity of glycemic control in individuals with type-2 diabetes and pre-diabetes.
methods: The proposed method uses a hybrid variational autoencoder to learn embeddings that reflect physiological quantities such as insulin sensitivity, glucose effectiveness, and basal glucose levels. The method also introduces a novel method to infer the glucose appearance rate, making the mechanistic model robust to unreliable meal logs.
results: The proposed method discovers a separation between individuals proportional to their disease severity and produces clusters that are up to 4x better than other features. The embeddings provide a nuanced, yet interpretable, embedding space to compare glycemic control within and across individuals, directly learnable from in-the-wild data.Here’s the simplified Chinese text for the three key points:
for: 这篇论文目标是通过学习可解释的CGM和食物数据来捕捉人类血糖控制的复杂性。
methods: 该方法使用混合变量自动编码器来学习具有生物物理量的嵌入。这种方法还引入了一种新的糖分出现率推断方法，使机制模型具有可靠的食物日志。
results: 该方法可以在不同人群中捕捉疾病严重程度的分布，并生成 clusters 的血糖控制水平，至少比其他特征更好。这些嵌入可以直接从野外数据中学习，提供一个简洁 yet 可解释的嵌入空间，用于比较血糖控制水平。

Abstract
Diabetes encompasses a complex landscape of glycemic control that varies widely among individuals. However, current methods do not faithfully capture this variability at the meal level. On the one hand, expert-crafted features lack the flexibility of data-driven methods; on the other hand, learned representations tend to be uninterpretable which hampers clinical adoption. In this paper, we propose a hybrid variational autoencoder to learn interpretable representations of CGM and meal data. Our method grounds the latent space to the inputs of a mechanistic differential equation, producing embeddings that reflect physiological quantities, such as insulin sensitivity, glucose effectiveness, and basal glucose levels. Moreover, we introduce a novel method to infer the glucose appearance rate, making the mechanistic model robust to unreliable meal logs. On a dataset of CGM and self-reported meals from individuals with type-2 diabetes and pre-diabetes, our unsupervised representation discovers a separation between individuals proportional to their disease severity. Our embeddings produce clusters that are up to 4x better than naive, expert, black-box, and pure mechanistic features. Our method provides a nuanced, yet interpretable, embedding space to compare glycemic control within and across individuals, directly learnable from in-the-wild data.

摘要
диабетес включает в себя сложный пейзаж контроля глюкозы, который широко варьируется среди индивидуумов. Однако, существующие методы не верно отражают эту варьируемость на уровне еды. С одной стороны, экспертные параметры не обладают гибкостью методов, основанных на данных; с другой стороны, представления, полученные из обучения, часто не интерактивны, что затрудняет их клиническое применение. В этой статье мы предлагаем гибридный автоматический векторный анализатор для обучения интерпретабельным представлениям CGM и данных еды. Наши методы основаны на входных данных механической уравнения дифференциального типа, что позволяет получать эмбеды, отражающие физиологические количества, такие как чувствительность инсулина, эффективность глюкозы и уровни базального глюкозы. Кроме того, мы вводим новый метод для оценки скорости появления глюкозы, что делает механический модель более устойчивым к ненадежным записям о еде. На наборе CGM и самоотчетных записей о еде от людей с типовой диабетом и преддиабетом, наши несупервизированные представления позволяют разделить индивидуумы в зависимости от их степени болезни. Наши эмбеды дают кластеры, которые на 4 раза лучше, чем наивные, экспертные, черные кубы и чистые механические представления. Наши методы предоставляют гнусму, но интерпретабельную пространство представлений, которое можно непосредственно обучать из данных в быту.

Deep Learning for Koopman-based Dynamic Movement Primitives

paper_url: http://arxiv.org/abs/2312.03328
repo_url: None
paper_authors: Tyler Han, Carl Glen Henshaw
for: 学习Robot执行灵活抓取、动态移动或全身抓取的技能，从少量示范中启发学习是一个重要的研究领域。
methods: 提议使用 Koopman 运算符和动态运动基本征学习从示范学习。
results: 对于 LASA 手写字库数据集，我们的方法可以与扩展动态模式分解相比，但是只需训练少量字符。

Abstract
The challenge of teaching robots to perform dexterous manipulation, dynamic locomotion, or whole--body manipulation from a small number of demonstrations is an important research field that has attracted interest from across the robotics community. In this work, we propose a novel approach by joining the theories of Koopman Operators and Dynamic Movement Primitives to Learning from Demonstration. Our approach, named \gls{admd}, projects nonlinear dynamical systems into linear latent spaces such that a solution reproduces the desired complex motion. Use of an autoencoder in our approach enables generalizability and scalability, while the constraint to a linear system attains interpretability. Our results are comparable to the Extended Dynamic Mode Decomposition on the LASA Handwriting dataset but with training on only a small fractions of the letters.

摘要
研究教育机器人执行灵活的搬运、动态移动或全身搬运从一小数量的示例中学习的挑战是机器人社区中的一个重要研究领域。在这项工作中，我们提出一种新的方法，结合库普曼操作和动态运动基本元素学习从示例学习。我们的方法命名为\gls{admd}，将非线性动力系统投影到线性隐藏空间中，使得解决器重现所需的复杂运动。使用自动encoder在我们的方法中允许普遍性和可扩展性，而对于线性系统的约束使得解释性。我们的结果与扩展动态模式分解相比，在LASA手写数据集上达到了类似的性能，但是只需训练一小部分的字母。

On the Nystrom Approximation for Preconditioning in Kernel Machines

paper_url: http://arxiv.org/abs/2312.03311
repo_url: None
paper_authors: Amirhesam Abedsoltan, Mikhail Belkin, Parthe Pandit, Luis Rademacher
for: 本研究旨在分析使用约ström方法预处理器，以加速基本函数学习算法的训练过程。
methods: 本研究使用了约ström方法预处理器，以加速基本函数学习算法的训练过程。
results: 研究发现，使用约ström方法预处理器可以减少计算和存储开销，同时仍能够减少训练过程的时间。具体来说，使用一个logarithmic sample size（与数据集大小成正比）的约ström方法预处理器，能够在大规模数据集上加速基本函数学习算法的训练过程，同时减少计算和存储开销。

Abstract
Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.

摘要

Balanced Marginal and Joint Distributional Learning via Mixture Cramer-Wold Distance

paper_url: http://arxiv.org/abs/2312.03307
repo_url: None
paper_authors: Seunghwan An, Sungchul Hong, Jong-June Jeon
for: 本研究旨在提出一种新的度量方法，以便在生成模型训练过程中衡量高维概率分布之间的差异。
methods: 本研究使用了一种新的度量方法，即混合卡默-沃尔德距离（Mixture Cramer-Wold distance），该方法能同时捕捉高维概率分布的 JOINT 和 MARGINAL 信息。
results: 研究人员通过提出 CWDAE（卡默-沃尔德分布自动编码器）模型，实现了在真实的标量数据集上生成Synthetic数据的remarkable表现。此外，该模型具有轻松调整数据隐私水平的便利性。

Abstract
In the process of training a generative model, it becomes essential to measure the discrepancy between two high-dimensional probability distributions: the generative distribution and the ground-truth distribution of the observed dataset. Recently, there has been growing interest in an approach that involves slicing high-dimensional distributions, with the Cramer-Wold distance emerging as a promising method. However, we have identified that the Cramer-Wold distance primarily focuses on joint distributional learning, whereas understanding marginal distributional patterns is crucial for effective synthetic data generation. In this paper, we introduce a novel measure of dissimilarity, the mixture Cramer-Wold distance. This measure enables us to capture both marginal and joint distributional information simultaneously, as it incorporates a mixture measure with point masses on standard basis vectors. Building upon the mixture Cramer-Wold distance, we propose a new generative model called CWDAE (Cramer-Wold Distributional AutoEncoder), which shows remarkable performance in generating synthetic data when applied to real tabular datasets. Furthermore, our model offers the flexibility to adjust the level of data privacy with ease.

摘要
在训练生成模型过程中， measure the discrepancy between two high-dimensional probability distributions：生成分布和观察数据的真实分布。近期，有关切分高维分布的方法受到了越来越多的关注，其中Cramer-Wold distance emerges as a promising method。然而，我们发现Cramer-Wold distance 主要关注共同分布学习，而理解各自分布的模式是生成 sintetic data 的效果所必需的。在本文中，我们提出了一种新的不同度量，即 mixture Cramer-Wold distance。这种度量可以同时捕捉到各自分布和共同分布的信息，通过点 масса在标准基准向量上来实现。基于mixture Cramer-Wold distance，我们提议一种新的生成模型，called CWDAE (Cramer-Wold Distributional AutoEncoder)，它在实际的表格数据上显示了很好的生成 sintetic data 性能。此外，我们的模型还具有轻松地调整数据隐私水平的能力。

Enhancing Molecular Property Prediction via Mixture of Collaborative Experts

paper_url: http://arxiv.org/abs/2312.03292
repo_url: https://github.com/Hyacinth-YX/mixture-of-collaborative-experts
paper_authors: Xu Yao, Shuang Liang, Songqiao Han, Hailiang Huang
for: 这 paper 的目的是解决分子物理学任务（MPP）中数据稀缺和不均衡问题，通过使用图神经网络（GNN）作为编码器，抽取分子图的共同特征。
methods: 这 paper 使用的方法是 Mixture of Collaborative Experts（MoCE）作为预测器，利用任务之间的共同特征，同时解决专家组内的同化问题和决策占比问题。为了增强专家的多样性，提出了专家特有投影方法，以及专家特有损失函数，以更好地让所有专家合作。
results: 根据这 paper 的结果，使用 GNN-MoCE 架构可以在 24 个 MPP 数据集上达到更高的性能，特别是在数据稀缺或高不均衡的任务中。

Abstract
Molecular Property Prediction (MPP) task involves predicting biochemical properties based on molecular features, such as molecular graph structures, contributing to the discovery of lead compounds in drug development. To address data scarcity and imbalance in MPP, some studies have adopted Graph Neural Networks (GNN) as an encoder to extract commonalities from molecular graphs. However, these approaches often use a separate predictor for each task, neglecting the shared characteristics among predictors corresponding to different tasks. In response to this limitation, we introduce the GNN-MoCE architecture. It employs the Mixture of Collaborative Experts (MoCE) as predictors, exploiting task commonalities while confronting the homogeneity issue in the expert pool and the decision dominance dilemma within the expert group. To enhance expert diversity for collaboration among all experts, the Expert-Specific Projection method is proposed to assign a unique projection perspective to each expert. To balance decision-making influence for collaboration within the expert group, the Expert-Specific Loss is presented to integrate individual expert loss into the weighted decision loss of the group for more equitable training. Benefiting from the enhancements of MoCE in expert creation, dynamic expert group formation, and experts' collaboration, our model demonstrates superior performance over traditional methods on 24 MPP datasets, especially in tasks with limited data or high imbalance.

摘要
蛋白质物理预测（MPP）任务是预测生物化学性质基于蛋白质分子结构的特征，这有助于药物开发中发现领先的药物。为了解决MPP数据稀缺和不均衡问题，一些研究已经采用图神经网络（GNN）作为编码器，以EXTRACT molecular graph 中的共同特征。然而，这些方法通常使用单独的预测器 для每个任务，忽略不同任务的预测器之间的共同特征。为了解决这些限制，我们介绍了GNN-MoCE架构。它使用 Mixture of Collaborative Experts（MoCE）作为预测器，利用不同任务之间的共同特征，同时解决专家组内部决策权的分配和专家组中的决策占比问题。为了增加专家之间的协作，我们提出了专家特有的投影方法，使每个专家有独特的投影角度。此外，我们还提出了专家特有的损失函数，将每个专家的损失函数积加到专家组内的权重平均损失中，以更平等地训练专家组。由于MoCE的优势在专家创造、动态专家组成和专家协作方面，我们的模型在24个MPP数据集上表现出色，特别是在数据稀缺或高不均衡的任务中。

Anomaly Detection for Scalable Task Grouping in Reinforcement Learning-based RAN Optimization

paper_url: http://arxiv.org/abs/2312.03277
repo_url: None
paper_authors: Jimmy Li, Igor Kozlov, Di Wu, Xue Liu, Gregory Dudek
for: 优化mobile network的维护和优化
methods: 使用学习基于方法来优化维护和优化
results: 构建一个可扩展的政策银行，可以在多个cell site上进行优化，并且可以智能地判断任务和政策之间的相容性，从而有效地利用计算资源。

Abstract
The use of learning-based methods for optimizing cellular radio access networks (RAN) has received increasing attention in recent years. This coincides with a rapid increase in the number of cell sites worldwide, driven largely by dramatic growth in cellular network traffic. Training and maintaining learned models that work well across a large number of cell sites has thus become a pertinent problem. This paper proposes a scalable framework for constructing a reinforcement learning policy bank that can perform RAN optimization across a large number of cell sites with varying traffic patterns. Central to our framework is a novel application of anomaly detection techniques to assess the compatibility between sites (tasks) and the policy bank. This allows our framework to intelligently identify when a policy can be reused for a task, and when a new policy needs to be trained and added to the policy bank. Our results show that our approach to compatibility assessment leads to an efficient use of computational resources, by allowing us to construct a performant policy bank without exhaustively training on all tasks, which makes it applicable under real-world constraints.

摘要
随着学习基于方法在移动通信网络（RAN）优化中的应用越来越普遍，世界各地Cellular网络覆盖面积在快速增长。这与移动网络流量快速增长有着直接关系。因此，在大量Cell sites上训练和维护良好在多个Cell sites上工作的学习模型成为了一个紧迫的问题。本文提出了一个可扩展的框架，用于在大量Cell sites上构建一个强化学习策略银行，以便在不同的交通征文中优化RAN。我们的框架中心思想是通过异常检测技术来评估Cell sites（任务）与策略银行之间的兼容性。这使得我们的框架可以智能地确定是否可以在某个任务上 reuse 已经训练好的策略，而不是一样地在所有任务上进行极限训练。我们的结果表明，我们的兼容性评估方法可以有效地使用计算资源，从而在实际环境中构建一个高性能的策略银行。

Low-Cost High-Power Membership Inference by Boosting Relativity

paper_url: http://arxiv.org/abs/2312.03262
repo_url: None
paper_authors: Sajjad Zarifzadeh, Philippe Liu, Reza Shokri
for: 这篇论文是为了攻击机器学习算法的隐私风险而设计的。
methods: 这篇论文使用了参考模型和参考数据来强化对任务模型的识别力，并通过likelihood ratio测试来实现。
results: 这篇论文的算法在比较低的假阳性率下（如0） still 可以达到高的真阳性率，并且在计算约束下（只有limited number of reference models）也表现出色，与之前的方法不同。

Abstract
We present a robust membership inference attack (RMIA) that amplifies the distinction between population data and the training data on any target model, by effectively leveraging both reference models and reference data in our likelihood ratio test. Our algorithm exhibits superior test power (true-positive rate) when compared to prior methods, even at extremely low false-positive error rates (as low as 0). Also, under computation constraints, where only a limited number of reference models (as few as 1) are available, our method performs exceptionally well, unlike some prior attacks that approach random guessing in such scenarios. Our method lays the groundwork for cost-effective and practical yet powerful and robust privacy risk analysis of machine learning algorithms.

摘要
我们提出了一种robust会员推测攻击(RMIA)，使得人口数据和训练数据之间的差别更加突出，通过有效地利用参考模型和参考数据在likelihood比率测试中。我们的算法在比较方法时显示出超过其他方法的测试力（真正正确率），尤其是在非常低的假阳性错误率（如0）下。此外，在计算限制下，只有有限的参考模型（比如1）可用时，我们的方法表现出色，与一些先前的攻击方法不同，在这些情况下，它们的表现接近随机猜测。我们的方法为机器学习算法的隐私风险分析提供了可靠、实用且强大的基础。

f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization

paper_url: http://arxiv.org/abs/2312.03259
repo_url: https://github.com/optimization-for-data-driven-science/f-ferm
paper_authors: Sina Baharlouei, Shivam Patel, Meisam Razaviyayn
for: 该论文的目的是提出一种可靠的搜索优化框架，以优化机器学习模型的公平性。
methods: 该论文使用的方法包括f-FERM的概率评估和分布不确定性评估，以及基于$L_p$范数的分布ally robust优化。
results: 该论文的实验结果显示，基于f-FERM的公平风险最小化方法可以在大多数批处理大小下（从全批处理到单个样本处理）提供更佳的公平精度-准确度质量比。此外，该方法还可以在分布shift情况下表现出优于其他基于文献的方法。

Abstract
Training and deploying machine learning models that meet fairness criteria for protected groups are fundamental in modern artificial intelligence. While numerous constraints and regularization terms have been proposed in the literature to promote fairness in machine learning tasks, most of these methods are not amenable to stochastic optimization due to the complex and nonlinear structure of constraints and regularizers. Here, the term "stochastic" refers to the ability of the algorithm to work with small mini-batches of data. Motivated by the limitation of existing literature, this paper presents a unified stochastic optimization framework for fair empirical risk minimization based on f-divergence measures (f-FERM). The proposed stochastic algorithm enjoys theoretical convergence guarantees. In addition, our experiments demonstrate the superiority of fairness-accuracy tradeoffs offered by f-FERM for almost all batch sizes (ranging from full-batch to batch size of one). Moreover, we show that our framework can be extended to the case where there is a distribution shift from training to the test data. Our extension is based on a distributionally robust optimization reformulation of f-FERM objective under $L_p$ norms as uncertainty sets. Again, in this distributionally robust setting, f-FERM not only enjoys theoretical convergence guarantees but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient stochastic implementation of $f$-FERM is publicly available.

摘要
现代人工智能中，训练和部署满足保护组团的机器学习模型是基本的。虽然文献中有许多约束和正则化项来促进机器学习任务的公平性，但大多数这些方法不适合随机优化。在这里，“随机”指的是算法可以处理小批量数据。驱动了现有文献的限制，本文提出了一个统一的随机优化框架 для公平empirical risk minimization（f-FERM）。提出的随机算法具有理论的收敛保证。此外，我们的实验表明，f-FERM在大多数批处理大小（从全批处理到一个批处理）中提供了更好的公平准确性质量比。此外，我们还证明了我们的框架可以扩展到测试数据集与训练数据集之间的分布偏移情况下。我们的扩展基于在$L_p$ норм下的分布不确定性集（uncertainty sets）中的分布robust优化重新定义f-FERM目标。在这种分布不确定性下，f-FERM不仅具有理论的收敛保证，还超越了文献中其他基准值在任务中的分布偏移情况下的表现。一个有效的随机实现的$f$-FERM公开可用。

CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation Models

paper_url: http://arxiv.org/abs/2312.03256
repo_url: https://github.com/hugozhl/cafe
paper_authors: Hailin Zhang, Zirui Liu, Boxuan Chen, Yikai Zhao, Tong Zhao, Tong Yang, Bin Cui
for: 这篇论文的目的是提出一个可靠、快速、适应动态数据分布的嵌入表压缩框架，以应对深度学习推荐模型（DLRM）的增长内存需求。
methods: 这篇论文使用了一个名为CAFE的嵌入压缩框架，包括一个快速和轻量级的热度测量技术（HotSketch）来捕捉具有重要性的嵌入，并将其映射为唯一的嵌入。对于不具有重要性的嵌入，CAFE使用了Hash嵌入技术，让多个嵌入共享一个嵌入。此外，CAFE还提出了一个多级Hash嵌入框架来优化嵌入表的压缩。
results: 实验结果显示，CAFE在10000倍压缩比例下对于Criteo Kaggle数据集和CriteoTB数据集的测试AUC有3.92%和3.68%的提升，较 existed embedding compression methods superior。

Abstract
Recently, the growing memory demands of embedding tables in Deep Learning Recommendation Models (DLRMs) pose great challenges for model training and deployment. Existing embedding compression solutions cannot simultaneously meet three key design requirements: memory efficiency, low latency, and adaptability to dynamic data distribution. This paper presents CAFE, a Compact, Adaptive, and Fast Embedding compression framework that addresses the above requirements. The design philosophy of CAFE is to dynamically allocate more memory resources to important features (called hot features), and allocate less memory to unimportant ones. In CAFE, we propose a fast and lightweight sketch data structure, named HotSketch, to capture feature importance and report hot features in real time. For each reported hot feature, we assign it a unique embedding. For the non-hot features, we allow multiple features to share one embedding by using hash embedding technique. Guided by our design philosophy, we further propose a multi-level hash embedding framework to optimize the embedding tables of non-hot features. We theoretically analyze the accuracy of HotSketch, and analyze the model convergence against deviation. Extensive experiments show that CAFE significantly outperforms existing embedding compression methods, yielding 3.92% and 3.68% superior testing AUC on Criteo Kaggle dataset and CriteoTB dataset at a compression ratio of 10000x. The source codes of CAFE are available at GitHub.

摘要
近期，深度学习推荐模型（DLRM）中嵌入表的增长内存需求带来了训练和部署模型的巨大挑战。现有的嵌入压缩解决方案无法同时满足三个关键设计要求：内存效率、延迟低和适应动态数据分布。这篇论文提出了一个名为CAFE的压缩框架，该框架可以同时满足以上三个要求。CAFE的设计哲学是动态分配更多的内存资源到重要的特征（即热特征），并将不重要的特征分配到更少的内存中。在CAFE中，我们提出了一种快速和轻量级的笔记数据结构，名为热笔记（HotSketch），用于捕捉特征重要性并在实时上报热特征。对于每个报告的热特征，我们为它分配唯一的嵌入。对于非热特征，我们使用哈希嵌入技术，允许多个特征共享一个嵌入。根据我们的设计哲学，我们进一步提出了一个多级哈希嵌入框架，以优化嵌入表的非热特征。我们 theoretically 分析了热笔记的准确性，并分析模型对偏差的抗衡。经过广泛的实验，我们发现CAFE在10000倍压缩比下显著超越现有的嵌入压缩方法，在 krito Kaggle 数据集和 krito TB 数据集上测试 AUC 提高3.92%和3.68%。CAFE 的源代码可以在 GitHub 上下载。

Seller-side Outcome Fairness in Online Marketplaces

paper_url: http://arxiv.org/abs/2312.03253
repo_url: None
paper_authors: Zikun Ye, Reza Yousefi Maragheh, Lalitesh Morishetti, Shanu Vashishtha, Jason Cho, Kaushiki Nag, Sushant Kumar, Kannan Achan
for: investigate and achieve seller-side fairness within online marketplaces
methods: introduce the notion of seller-side outcome fairness and build an optimization model based on duality and bandit theory
results: lift seller fairness measures without hurting metrics like collected Gross Merchandise Value (GMV) and total purchases.Here’s the full text in Simplified Chinese:
for: 这 paper aims to investigate and achieve seller-side fairness within online marketplaces
methods: 该 paper introduces the notion of seller-side outcome fairness and builds an optimization model based on duality and bandit theory
results: 该 algorithm can lift seller fairness measures without hurting metrics like collected Gross Merchandise Value (GMV) and total purchases.

Abstract
This paper aims to investigate and achieve seller-side fairness within online marketplaces, where many sellers and their items are not sufficiently exposed to customers in an e-commerce platform. This phenomenon raises concerns regarding the potential loss of revenue associated with less exposed items as well as less marketplace diversity. We introduce the notion of seller-side outcome fairness and build an optimization model to balance collected recommendation rewards and the fairness metric. We then propose a gradient-based data-driven algorithm based on the duality and bandit theory. Our numerical experiments on real e-commerce data sets show that our algorithm can lift seller fairness measures while not hurting metrics like collected Gross Merchandise Value (GMV) and total purchases.

摘要
Translation in Simplified Chinese:这篇论文目标是在在线市场场所内调查和实现卖家侧公正， где许多卖家和他们的商品在电商平台上未能得到足够的曝光。这种现象引发了收益损失和市场多样性的关注。我们介绍了卖家侧结果公正的概念，并建立了优化模型来均衡收集推荐奖励和公正度量。然后，我们提议了基于偏微分和随机理论的梯度驱动数据驱动算法。我们的数值实验表明，我们的算法可以提高卖家公正度量而不会削弱收集的总营业额（GMV）和总销售额。

Generalizable Neural Physics Solvers by Baldwinian Evolution

paper_url: http://arxiv.org/abs/2312.03243
repo_url: https://github.com/chiuph/baldwinian-pinn
paper_authors: Jian Cheng Wong, Chin Chun Ooi, Abhishek Gupta, Pao-Hsiung Chiu, Joshua Shao Zheng Low, My Ha Dao, Yew-Soon Ong
for: 研究用physics-informed neural networks（PINNs）可以应用于多种物理任务中，并且能够快速适应和预测物理现象。
methods: 这篇论文使用了生物学的观点，即鸽子的生长和学习过程，来设计PINNs。具体来说，使用了进化选择压力和生命时间学习来训练PINNs，以实现快速和符合物理规律的预测。
results: 这篇论文发现，使用 Baldwin 效应来训练 PINNs，可以大大提高预测精度，并且仅需一小部分的计算成本。相比之下，使用 gradient descent 来 meta-learn PINNs，只能实现一小部分的预测精度。

Abstract
Physics-informed neural networks (PINNs) are at the forefront of scientific machine learning, making possible the creation of machine intelligence that is cognizant of physical laws and able to accurately simulate them. In this paper, the potential of discovering PINNs that generalize over an entire family of physics tasks is studied, for the first time, through a biological lens of the Baldwin effect. Drawing inspiration from the neurodevelopment of precocial species that have evolved to learn, predict and react quickly to their environment, we envision PINNs that are pre-wired with connection strengths inducing strong biases towards efficient learning of physics. To this end, evolutionary selection pressure (guided by proficiency over a family of tasks) is coupled with lifetime learning (to specialize on a smaller subset of those tasks) to produce PINNs that demonstrate fast and physics-compliant prediction capabilities across a range of empirically challenging problem instances. The Baldwinian approach achieves an order of magnitude improvement in prediction accuracy at a fraction of the computation cost compared to state-of-the-art results with PINNs meta-learned by gradient descent. This paper marks a leap forward in the meta-learning of PINNs as generalizable physics solvers.

摘要
Physics-informed neural networks (PINNs) 是科学机器学习的前沿领域，使得机器智能能够认识物理法律并准确地模拟它们。在这篇论文中，我们研究了，通过生物学的观点，即贝尔德温效应，对PINNs的泛化性能进行研究，以便创造能够快速、准确地预测物理现象的机器智能。我们积极借鉴了哺乳动物胚胎发育中的学习、预测和反应机制，以设计PINNs具有强烈的学习偏好，使其能够快速、效率地学习物理知识。我们通过生态选择压力（基于一家 physics 任务的执行效果）和生物学学习（特定任务上的特殊化学习）来制定PINNs，从而实现了在多种实验困难的问题上的快速、物理相符的预测能力。 Baldwinian 方法相比于使用 PINNs meta-学习Gradient Descent 的现有结果，可以达到一个数量级的改进，并且只需要一小部分的计算成本。这篇论文标志着 PINNs 的元学习为泛化的物理解决方案做出了重要的突破。

Accelerated Gradient Algorithms with Adaptive Subspace Search for Instance-Faster Optimization

paper_url: http://arxiv.org/abs/2312.03218
repo_url: None
paper_authors: Yuanshi Liu, Hanzhen Zhao, Yang Xu, Pengyun Yue, Cong Fang
for: 本文旨在探讨 Gradient-based 最优化算法在连续优化和机器学习中的发展，并提出一种新的方法来设计和分析这类算法。
methods: 本文使用了两个因素($\alpha$, $\tau_{\alpha}$)来细化描述优化问题的减零情况，并设计了一种可适应问题的 adaptive 算法，以解决一些机器学习中的问题。
results: 本文提出了一种新的 $\mathcal{O}(1)$-核函数约束的 linear regression 算法，以及一些其他机器学习中的问题的解决方案，其中具有更高的State-of-the-art 复杂性。

Abstract
Gradient-based minimax optimal algorithms have greatly promoted the development of continuous optimization and machine learning. One seminal work due to Yurii Nesterov [Nes83a] established $\tilde{\mathcal{O}(\sqrt{L/\mu})$ gradient complexity for minimizing an $L$-smooth $\mu$-strongly convex objective. However, an ideal algorithm would adapt to the explicit complexity of a particular objective function and incur faster rates for simpler problems, triggering our reconsideration of two defeats of existing optimization modeling and analysis. (i) The worst-case optimality is neither the instance optimality nor such one in reality. (ii) Traditional $L$-smoothness condition may not be the primary abstraction/characterization for modern practical problems. In this paper, we open up a new way to design and analyze gradient-based algorithms with direct applications in machine learning, including linear regression and beyond. We introduce two factors $(\alpha, \tau_{\alpha})$ to refine the description of the degenerated condition of the optimization problems based on the observation that the singular values of Hessian often drop sharply. We design adaptive algorithms that solve simpler problems without pre-known knowledge with reduced gradient or analogous oracle accesses. The algorithms also improve the state-of-art complexities for several problems in machine learning, thereby solving the open problem of how to design faster algorithms in light of the known complexity lower bounds. Specially, with the $\mathcal{O}(1)$-nuclear norm bounded, we achieve an optimal $\tilde{\mathcal{O}(\mu^{-1/3})$ (v.s. $\tilde{\mathcal{O}(\mu^{-1/2})$) gradient complexity for linear regression. We hope this work could invoke the rethinking for understanding the difficulty of modern problems in optimization.

摘要
gradient-based minimum-maximum算法在连续优化和机器学习发展中具有巨大的影响。一项著名的论文（Nes83a）确立了$\tilde{\mathcal{O}(\sqrt{L/\mu})$的梯度复杂度来优化一个$L$-平滑$\mu$-强烈的目标函数。然而，理想的算法应该适应特定目标函数的显示复杂度，并在更简单的问题上具有更快的速度，这引发了我们对现有优化模型和分析的重新考虑。（i）最坏情况优化不是实际中的实际优化。（ii）传统的$L$-平滑条件可能不是现代实际问题的主要抽象/特征。在这篇论文中，我们开创了一种新的方法来设计和分析梯度基于算法，直接应用于机器学习领域，包括线性回归和更多的问题。我们引入两个因素$(\alpha, \tau_{\alpha})$来细化优化问题的异常情况，注意到梯度矩阵的特征值往往快速下降。我们设计适应问题的算法，可以在不知道问题的先前知识的情况下解决问题，并且提高了现有复杂性下界的状态。具体来说，当$\mathcal{O}(1)$-核函数约束时，我们实现了最佳的$\tilde{\mathcal{O}(\mu^{-1/3})$（vs. $\tilde{\mathcal{O}(\mu^{-1/2})$）梯度复杂度，用于线性回归。我们希望这项工作能够让人们重新思考现代优化问题的困难性。

Bootstrap Your Own Variance

paper_url: http://arxiv.org/abs/2312.03213
repo_url: https://github.com/Stathiskan/HTML-CSS-Bootstrap-Framework-Razor
paper_authors: Polina Turishcheva, Jason Ramapuram, Sinead Williamson, Dan Busbridge, Eeshan Dhekane, Russ Webb
for: 本研究旨在提高模型uncertainty的理解，以便应用于多个领域。
methods: 本研究使用Bootstrap Your Own Variance（BYOV）， комbining Bootstrap Your Own Latent（BYOL）、一种自然语言处理算法，和抽象梯度下降（BBB）。
results: 研究发现 BYOV 对于测试集的预测标准差可以很好地被捕捉为 Gaussian 分布，这提供了初步的证据，表明学习的参数 posterior 有用于无标签uncertainty estimation。 BYOV 在对比 deterministic BYOL 基eline上提高了测试集的性能 (+2.83% test ECE, +1.03% test Brier)，并且在不同的扩展中表现了更好的准确性和可靠性（例如，+2.4% test ECE, +1.2% test Brier for Salt & Pepper noise）。

Abstract
Understanding model uncertainty is important for many applications. We propose Bootstrap Your Own Variance (BYOV), combining Bootstrap Your Own Latent (BYOL), a negative-free Self-Supervised Learning (SSL) algorithm, with Bayes by Backprop (BBB), a Bayesian method for estimating model posteriors. We find that the learned predictive std of BYOV vs. a supervised BBB model is well captured by a Gaussian distribution, providing preliminary evidence that the learned parameter posterior is useful for label free uncertainty estimation. BYOV improves upon the deterministic BYOL baseline (+2.83% test ECE, +1.03% test Brier) and presents better calibration and reliability when tested with various augmentations (eg: +2.4% test ECE, +1.2% test Brier for Salt & Pepper noise).

摘要
理解模型uncertainty的重要性对许多应用有益。我们提出了自适应Bootstrap Your Own Variance（BYOV），将自适应Bootstrap Your Own Latent（BYOL）、一种无监督自适应学习（SSL）算法，与抽象后验（BBB） Bayesian方法相结合。我们发现 BYOV 学习的预测标准差可以用 Gaussian 分布捕捉，这提供了初步证据，表明learned parameter posterior 有用于无监督 uncertainty estimation。 BYOV 在 deterministic BYOL 基eline上进行了改进（+2.83% 测试 ECE，+1.03% 测试 Brier），并在不同的扩展中（如 Salt & Pepper 噪声）表现出更好的准确性和可靠性（+2.4% 测试 ECE，+1.2% 测试 Brier）。

Constrained Bayesian Optimization Under Partial Observations: Balanced Improvements and Provable Convergence

paper_url: http://arxiv.org/abs/2312.03212
repo_url: None
paper_authors: Shengbo Wang, Ke Li
for: 解决高成本的半可观测约束优化问题 (POCOPs)，因为无法可观测的解不能提供优化目标和约束的信息。
methods: 提出一种高效可证明的方法，包括两个关键组成部分：首先，提出一种改进的获取函数设计，以便在优化过程中充分勇气练习; 其次，提出一种基于假设分布函数的嵌入模型，用于表示可观测约束的可能性空间。
results: 通过synthetic和实际问题的实验研究，证明了该方法的竞争力，能够有效解决POCOPs。

Abstract
The partially observable constrained optimization problems (POCOPs) impede data-driven optimization techniques since an infeasible solution of POCOPs can provide little information about the objective as well as the constraints. We endeavor to design an efficient and provable method for expensive POCOPs under the framework of constrained Bayesian optimization. Our method consists of two key components. Firstly, we present an improved design of the acquisition functions that introduces balanced exploration during optimization. We rigorously study the convergence properties of this design to demonstrate its effectiveness. Secondly, we propose a Gaussian process embedding different likelihoods as the surrogate model for a partially observable constraint. This model leads to a more accurate representation of the feasible regions compared to traditional classification-based models. Our proposed method is empirically studied on both synthetic and real-world problems. The results demonstrate the competitiveness of our method for solving POCOPs.

摘要
partially observable constrained optimization problems (POCOPs) 妨碍数据驱动优化技术，因为不可能的解决方案可以提供少量信息对目标以及约束。我们努力设计高效可证明的方法来解决贵 POCOPs 。我们的方法包括两个关键组成部分。首先，我们提出了一种改进的获取函数设计，具有平衡的探索特性。我们仔细研究了这种设计的收敛性，以证明其效果。其次，我们提议使用 Gaussian process 嵌入不同的可能性作为约束的模型。这种模型可以更加准确地表示可行区域，相比传统的分类型模型。我们的提议方法在 synthetic 和实际问题上进行了 empirical 研究，结果表明我们的方法可以有效地解决 POCOPs。

Domain Invariant Representation Learning and Sleep Dynamics Modeling for Automatic Sleep Staging

paper_url: http://arxiv.org/abs/2312.03196
repo_url: https://github.com/yeon-lab/dream
paper_authors: Seungyeon Lee, Thai-Hoang Pham, Zhao Cheng, Ping Zhang
For: automatic sleep staging to diagnose and treat sleep disorders* Methods: neural network-based model (DREAM) to learn domain generalized representations from physiological signals and model sleep dynamics* Results: outperforms existing sleep staging methods on three datasets, and provides prediction uncertainty to ensure reliability in real-world applications.Here’s the full text in Simplified Chinese:* For: 自动睡眠阶段识别，以诊断和治疗睡眠障碍。* Methods: 使用神经网络模型（DREAM），从生物体信号中学习Domain Generalized表示，并模型睡眠 dinamics。* Results: 在三个数据集上比前者运算更好，并提供了预测不确定性，以确保实际应用中的可靠性。

Abstract
Sleep staging has become a critical task in diagnosing and treating sleep disorders to prevent sleep related diseases. With rapidly growing large scale public sleep databases and advances in machine learning, significant progress has been made toward automatic sleep staging. However, previous studies face some critical problems in sleep studies; the heterogeneity of subjects' physiological signals, the inability to extract meaningful information from unlabeled sleep signal data to improve predictive performances, the difficulty in modeling correlations between sleep stages, and the lack of an effective mechanism to quantify predictive uncertainty. In this study, we propose a neural network based automatic sleep staging model, named DREAM, to learn domain generalized representations from physiological signals and models sleep dynamics. DREAM learns sleep related and subject invariant representations from diverse subjects' sleep signal segments and models sleep dynamics by capturing interactions between sequential signal segments and between sleep stages. In the experiments, we demonstrate that DREAM outperforms the existing sleep staging methods on three datasets. The case study demonstrates that our model can learn the generalized decision function resulting in good prediction performances for the new subjects, especially in case there are differences between testing and training subjects. The usage of unlabeled data shows the benefit of leveraging unlabeled EEG data. Further, uncertainty quantification demonstrates that DREAM provides prediction uncertainty, making the model reliable and helping sleep experts in real world applications.

摘要
休眠阶段识别已成为诊断和治疗睡眠疾病的关键任务。随着大规模公共睡眠数据库的快速增长和机器学习的进步，自动休眠阶段识别得到了 significiant progress。然而，之前的研究面临了睡眠研究中的一些关键问题，包括参与者的生物学信号差异性、EXTRACTING meaningful information from unlabeled sleep signal data to improve predictive performances, sleep stage之间的关系难以建模，以及睡眠预测uncertainty的缺乏有效机制。在本研究中，我们提出了一种基于神经网络的自动休眠阶段识别模型，名为DREAM，以学习域共采样表示和模型睡眠动力学。DREAM从多个参与者的睡眠信号段中学习睡眠相关和参与者不同的表示，并 capture interactions between sequential signal segments and between sleep stages。在实验中，我们证明DREAM在三个数据集上的表现比之前的睡眠阶段识别方法更好。 caso study表明我们的模型可以学习通用的决策函数，从而在新的参与者上达到良好的预测性能，特别是在测试和训练参与者之间存在差异时。使用未标注数据也显示了抽象数据的利用的好处。此外，uncertainty量化表明DREAM提供了预测不确定性，使模型成为可靠的和帮助睡眠专家在实际应用中。

2023-12-06

eess.IV

eess.IV - 2023-12-06

Bile Duct Segmentation Methods Under 3D Slicer Applied to ERCP: Advantages and Disadvantages

paper_url: http://arxiv.org/abs/2312.03356
repo_url: None
paper_authors: Abdelhadi Essamlali, Vincent Millot-Maysounabe, Marion Chartier, Grégoire Salin, Aymeric Becq, Lionel Arrivé, Marine Duboc Camus, Jérôme Szewczyk, Isabelle Claude
for: 这个研究旨在评估在3D重建中使用的胆囊道段化方法，以便在不同的关键手段中，如endorroscopic retrograde cholangiopancreatography（ERCP）中，实现更高的准确率和效率。
methods: 这篇文章评估了三种不同的段化方法，namely thresholding、flood filling和region growing，并对它们的优缺点进行评价。
results: 研究结果表明，阈值段化方法几乎是手动和时间consuming的，而洗涤填充方法是半自动的，但它们都不是可重复的。因此，一种基于区域生长的自动方法被开发出来，以减少段化时间，但是这会导致段化质量下降。这些结果highlight了不同的传统段化方法的优缺点，并强调了在ERCP中优化胆囊道段化的需要。

Abstract
This article presents an evaluation of biliary tract segmentation methods used for 3D reconstruction, which may be very usefull in various critical interventions, such as endoscopic retrograde cholangiopancreatography (ERCP), using the 3D Slicer software. This article provides an assessment of biliary tract segmentation techniques employed for 3D reconstruction, which can prove highly valuable in diverse critical procedures like endoscopic retrograde cholangiopancreatography (ERCP) through the utilization of 3D Slicer software. Three different methods, namely thresholding, flood filling, and region growing, were assessed in terms of their advantages and disadvantages. The study involved 10 patient cases and employed quantitative indices and qualitative evaluation to assess the segmentations obtained by the different segmentation methods against ground truth. The results indicate that the thresholding method is almost manual and time-consuming, while the flood filling method is semi-automatic and also time-consuming. Although both methods improve segmentation quality, they are not reproducible. Therefore, an automatic method based on region growing was developed to reduce segmentation time, albeit at the expense of quality. These findings highlight the pros and cons of different conventional segmentation methods and underscore the need to explore alternative approaches, such as deep learning, to optimize biliary tract segmentation in the context of ERCP.

摘要

2023-12-06

eess.SP

eess.SP - 2023-12-06

Slepian Beamforming: Broadband Beamforming using Streaming Least Squares

paper_url: http://arxiv.org/abs/2312.03922
repo_url: None
paper_authors: Coleman DeLude, Mark A. Davenport, Justin Romberg
for: 本文探讨了一个经典问题，即在多感器阵列上估计到来的信号。本文主要关注带宽信号的情况，即广频信号的探测。传统的方法包括过滤和总和、真实时延或这两者的组合。而我们提出的方法与这些方法有很大差异，它不需要过滤或真实时延。我们使用直接从感知器输出中提取的块样本来适应稳健Subspace模型，使用最小二乘法来适应。然后，我们可以使用这个模型来估计 uniformly 分布的带宽信号样本。
methods: 我们使用直接从感知器输出中提取的块样本来适应稳健Subspace模型，使用最小二乘法来适应。
results: 我们的方法比激光滤波器方法有更好的性能，同时与计算复杂性相当。

Abstract
In this paper we revisit the classical problem of estimating a signal as it impinges on a multi-sensor array. We focus on the case where the impinging signal's bandwidth is appreciable and is operating in a broadband regime. Estimating broadband signals, often termed broadband (or wideband) beamforming, is traditionally done through filter and summation, true time delay, or a coupling of the two. Our proposed method deviates substantially from these paradigms in that it requires no notion of filtering or true time delay. We use blocks of samples taken directly from the sensor outputs to fit a robust Slepian subspace model using a least squares approach. We then leverage this model to estimate uniformly spaced samples of the impinging signal. Alongside a careful discussion of this model and how to choose its parameters we show how to fit the model to new blocks of samples as they are received, producing a streaming output. We then go on to show how this method naturally extends to adaptive beamforming scenarios, where we leverage signal statistics to attenuate interfering sources. Finally, we discuss how to use our model to estimate from dimensionality reducing measurements. Accompanying these discussions are extensive numerical experiments establishing that our method outperforms existing filter based approaches while being comparable in terms of computational complexity.

摘要
在这篇论文中，我们重新评估了经典的信号估计问题，即信号在多感器阵列上充当的情况。我们特点在于宽频信号的情况，即信号在广频域内操作。传统的广频信号估计方法包括过滤和总和、真实时延或两者的组合。我们提出的方法与这些模式不同，它不需要过滤或真实时延。我们使用直接从感知器输出中获取的块样本来适应一种可靠的莱布尼茨空间模型，使用最小二乘法进行适应。然后，我们可以利用这个模型来估计具有均匀间隔的信号征料。我们还详细介绍了这个模型以及如何选择其参数，并证明了如何在新的块样本接收后继续适应，从而生成流动输出。此外，我们还详细介绍了如何使用我们的模型来抑制干扰信号。最后，我们介绍了如何使用我们的模型来估计从维度减少测量中的信号。相比之下，我们的方法在计算复杂性方面与传统的筛选方法相当，但在性能方面表现更佳。Note: The translation is done using Google Translate and may not be perfect. Please note that the translation is for reference only and may not be accurate.

Community Detection in High-Dimensional Graph Ensembles

paper_url: http://arxiv.org/abs/2312.03900
repo_url: None
paper_authors: Robert Malinas, Dogyoon Song, Alfred O. Hero III
for: 本文是为了检测高维度图的社区而写的。
methods: 本文使用随机矩阵理论，将图的连接矩阵模型为Stochastic Block Model (SBM)，然后提出了一种转换方法来消除对社区的不同权重的影响，并保留有关社区检测的有用特征。
results: 本文提出了一种基于极值特征值的测试方法，可以控制显著性水平，并提出了一个假设，即测试在图中节点数趋于无穷时，拥有一个有效的社区检测能力。此外，文章还提供了实际证据和理论支持这些主张。

Abstract
Detecting communities in high-dimensional graphs can be achieved by applying random matrix theory where the adjacency matrix of the graph is modeled by a Stochastic Block Model (SBM). However, the SBM makes an unrealistic assumption that the edge probabilities are homogeneous within communities, i.e., the edges occur with the same probabilities. The Degree-Corrected SBM is a generalization of the SBM that allows these edge probabilities to be different, but existing results from random matrix theory are not directly applicable to this heterogeneous model. In this paper, we derive a transformation of the adjacency matrix that eliminates this heterogeneity and preserves the relevant eigenstructure for community detection. We propose a test based on the extreme eigenvalues of this transformed matrix and (1) provide a method for controlling the significance level, (2) formulate a conjecture that the test achieves power one for all positive significance levels in the limit as the number of nodes approaches infinity, and (3) provide empirical evidence and theory supporting these claims.

摘要
可以使用 Random Matrix Theory（RMT）来检测高维图的社区，但 Stochastic Block Model（SBM）假设了同一个社区内的边概率是一致的，即边 occur with the same probabilities。degree-corrected SBM 是 SBM 的推广，允许边概率不同，但现有的 RMT 结果不直接适用于这种不同的模型。在这篇论文中，我们提出一种将 adjacency matrix 转化为消除不同概率的表示，保持社区检测的相关特征。我们提出一种基于极值特征的测试，并（1）提供控制 significanc 水平的方法，（2）提出一种 conjecture 测试在一个 society 中的所有正面 significanc 水平下都有权威性，并（3）提供实证和理论支持这些主张。

Signal Detection in Ambient Backscatter Systems: Fundamentals, Methods, and Trends

paper_url: http://arxiv.org/abs/2312.03882
repo_url: None
paper_authors: Shayan Zargari, Azar Hakimi, Fatemeh Rezaei, Chintha Tellambura, Amine Maaref
for: This paper provides an overview of signal detection for AmBC networks, which is essential for IoT and wireless communication applications.methods: The paper discusses various detection methods, including their advantages and drawbacks, for signal detection in AmBC networks.results: The paper provides a comprehensive overview of the fundamentals, challenges, and ongoing research in signal detection for AmBC networks, making it a valuable resource for IoT and wireless communication professionals and researchers.

Abstract
Internet-of-Things (IoT) is rapidly growing in wireless technology, aiming to connect vast numbers of devices to gather and distribute vital information. Despite individual devices having low energy consumption, the cumulative demand results in significant energy usage. Consequently, the concept of ultra-low-power tags gains appeal. Such tags communicate by reflecting rather than generating the radio frequency (RF) signals by themselves. Thus, these backscatter tags can be low-cost and battery-free. The RF signals can be ambient sources such as wireless-fidelity (Wi-Fi), cellular, or television (TV) signals, or the system can generate them externally. Backscatter channel characteristics are different from conventional point-to-point or cooperative relay channels. These systems are also affected by a strong interference link between the RF source and the tag besides the direct and backscattering links, making signal detection challenging. This paper provides an overview of the fundamentals, challenges, and ongoing research in signal detection for AmBC networks. It delves into various detection methods, discussing their advantages and drawbacks. The paper's emphasis on signal detection sets it apart and positions it as a valuable resource for IoT and wireless communication professionals and researchers.

摘要
互联网智能化（IoT）在无线技术方面迅速成长，旨在联系大量设备，收集和传输重要信息。尽管个别设备的能量消耗低，但累累的需求导致了明显的能源使用。因此，低功率标签的概念受到推广。这些标签通过反射而不是自己生成电romagnetic（EM）信号通信。因此，这些反射标签可以实现低成本和电池自由。EM信号可以来自周遭的无线供应商，例如无线宽频（Wi-Fi）、mobile（cellular）或电视（TV）信号，或者系统可以生成它们外部。反射通道的特性与传统的点对点或合作传输通道不同，这些系统也受到强大的干扰链接，使信号探测困难。本文提供了低功率标签信号探测的基础、挑战和现有的研究，并评估了不同探测方法的优点和缺点。本文的专注在信号探测上，使其成为互联网和无线通信专业人员和研究人员的值得阅读资源。

Enabling Edge Artificial Intelligence via Goal-oriented Deep Neural Network Splitting

paper_url: http://arxiv.org/abs/2312.03555
repo_url: None
paper_authors: Francesco Binucci, Mattia Merluzzi, Paolo Banelli, Emilio Calvanese Strinati, Paolo Di Lorenzo
for: 本研究探讨了在6G无线网络边缘使用深度神经网络（DNN）分割以实现低能耗合作推理，并在目标延迟和准确率下实现目标效果。
methods: 本研究提出了一种基于目标效果的SP选择和资源分配算法，可以动态控制SP选择、本地计算资源、上行传输功率和频率分配，以满足目标效果。
results: 数据显示，提出的SP选择和资源分配策略可以实现能量她抑和有效的边缘AI。

Abstract
Deep Neural Network (DNN) splitting is one of the key enablers of edge Artificial Intelligence (AI), as it allows end users to pre-process data and offload part of the computational burden to nearby Edge Cloud Servers (ECSs). This opens new opportunities and degrees of freedom in balancing energy consumption, delay, accuracy, privacy, and other trustworthiness metrics. In this work, we explore the opportunity of DNN splitting at the edge of 6G wireless networks to enable low energy cooperative inference with target delay and accuracy with a goal-oriented perspective. Going beyond the current literature, we explore new trade-offs that take into account the accuracy degradation as a function of the Splitting Point (SP) selection and wireless channel conditions. Then, we propose an algorithm that dynamically controls SP selection, local computing resources, uplink transmit power and bandwidth allocation, in a goal-oriented fashion, to meet a target goal-effectiveness. To the best of our knowledge, this is the first work proposing adaptive SP selection on the basis of all learning performance (i.e., energy, delay, accuracy), with the aim of guaranteeing the accomplishment of a goal (e.g., minimize the energy consumption under latency and accuracy constraints). Numerical results show the advantages of the proposed SP selection and resource allocation, to enable energy frugal and effective edge AI.

摘要

Variational Autoencoder for Channel Estimation: Real-World Measurement Insights

paper_url: http://arxiv.org/abs/2312.03450
repo_url: None
paper_authors: Michael Baur, Benedikt Böck, Nurettin Turan, Wolfgang Utschick
for: 这个论文是为了提出一种基于变量自动编码器的通道估计方法，并对实际测量数据进行评估。
methods: 该方法使用变量自动编码器来进行通道估计，并且只通过噪声损坏的通道观察数据进行训练。它学习了观察 conditional的第一和第二 момент，以估计 mean squared error-optimal estimator。
results: 对实际测量数据进行评估，该方法的估计效果明显比之前的相关状态艺术 estimator 更好，并且发现 pre-training synthetic data 可以帮助降低测量训练数据集大小。

Abstract
This work utilizes a variational autoencoder for channel estimation and evaluates it on real-world measurements. The estimator is trained solely on noisy channel observations and parameterizes an approximation to the mean squared error-optimal estimator by learning observation-dependent conditional first and second moments. The proposed estimator significantly outperforms related state-of-the-art estimators on real-world measurements. We investigate the effect of pre-training with synthetic data and find that the proposed estimator exhibits comparable results to the related estimators if trained on synthetic data and evaluated on the measurement data. Furthermore, pre-training on synthetic data also helps to reduce the required measurement training dataset size.

摘要
这项工作利用变分自动编码器进行通道估计，并在实际测量数据上评估其性能。估计器基于噪声通道观察数据进行训练，并通过学习观察值所依赖的 conditional first和second moments来 Parametrize一个 Mean Squared Error-optimal estimator的近似。提议的估计器在实际测量数据上明显超过相关的state-of-the-art estimators。我们还 investigate了使用 sintetic data进行预训练的效果，发现如果在 sintetic data上进行预训练，然后在测量数据上评估，则提议的估计器与相关的估计器具有相似的性能。此外，预训练 sintetic data还可以减少测量数据训练集的大小。

On the Estimation Performance of Generalized Power Method for Heteroscedastic Probabilistic PCA

paper_url: http://arxiv.org/abs/2312.03438
repo_url: None
paper_authors: Jinxin Wang, Chonghe Jiang, Huikang Liu, Anthony Man-Cho So
for: 本研究旨在提出一种基于非对称最大似然估计的hetroscedastic probablistic PCA方法，以估计可用多个不同类型数据样本的低维度线性子空间（简称“基准”）。
methods: 本研究使用了一种名为通用力量方法（GPM）来解决 associate maximum-likelihood estimation问题，并证明了GPM的估计性能保证。
results: 实验表明，GPM在 Gaussian 噪声和sub-Gaussian噪声 Setting中具有优于其他方法的性能。

Abstract
The heteroscedastic probabilistic principal component analysis (PCA) technique, a variant of the classic PCA that considers data heterogeneity, is receiving more and more attention in the data science and signal processing communities. In this paper, to estimate the underlying low-dimensional linear subspace (simply called \emph{ground truth}) from available heterogeneous data samples, we consider the associated non-convex maximum-likelihood estimation problem, which involves maximizing a sum of heterogeneous quadratic forms over an orthogonality constraint (HQPOC). We propose a first-order method -- generalized power method (GPM) -- to tackle the problem and establish its \emph{estimation performance} guarantee. Specifically, we show that, given a suitable initialization, the distances between the iterates generated by GPM and the ground truth decrease at least geometrically to some threshold associated with the residual part of certain "population-residual decomposition". In establishing the estimation performance result, we prove a novel local error bound property of another closely related optimization problem, namely quadratic optimization with orthogonality constraint (QPOC), which is new and can be of independent interest. Numerical experiments are conducted to demonstrate the superior performance of GPM in both Gaussian noise and sub-Gaussian noise settings.

摘要
异型分布 probabilistic principal component analysis (PCA) 技术在数据科学和信号处理领域获得越来越多的关注。在这篇论文中，我们想要从可用的异ogeneous数据样本中估算下面的低维Linear subspace（简称为“ground truth”）。我们考虑了相关的非凸最大 LIKElihood估计问题，该问题是通过一个orthogonality constraint (HQPOC)来最大化一个异ogeneous quadratic forms的总和。我们提出了一种首选方法——通用力量方法 (GPM)——来解决这个问题，并证明了其估计性能的保证。具体来说，我们证明，在适当的初始化下，GPM所生成的迭代器与ground truth之间的距离在一定的阈值内减少至少Geometrically，这与残差部分的“人口-剩余分解”有关。在证明估计性能的结果中，我们证明了另一个相关优化问题——quadratic optimization with orthogonality constraint (QPOC)——的本地错误 bound 性质，这是一个新的结论，可能具有独立的 интеrest。我们在 Gaussian noise 和 sub-Gaussian noise Settings 中进行了数值实验，以证明 GPM 的超越性。

Beyond Low Rank: A Graph-Based Propagation Approach to Tensor Completion for Multi-Acquisition Scenarios

paper_url: http://arxiv.org/abs/2312.03436
repo_url: None
paper_authors: Iain Rolland, Sivasakthy Selvakumaran, Andrea Marinoni
for: 这个论文的目的是解决数据表示为tensor的缺失、损坏或未观察的问题。
methods: 该论文提出了一种基于图的扩散方法，称为GraphProp，它在图基于的表示中填充缺失的Entry。
results: experiments表明，该方法可以成功地完成缺失tensor Entry，并且在实际应用中比 estado del arte方法更高效。

Abstract
Tensor completion refers to the problem of recovering the missing, corrupted or unobserved entries in data represented by tensors. In this paper, we tackle the tensor completion problem in the scenario in which multiple tensor acquisitions are available and do so without placing constraints on the underlying tensor's rank. Whereas previous tensor completion work primarily focuses on low-rank completion methods, we propose a novel graph-based diffusion approach to the problem. Referred to as GraphProp, the method propagates observed entries around a graph-based representation of the tensor in order to recover the missing entries. A series of experiments have been performed to validate the presented approach, including a synthetically-generated tensor recovery experiment which shows that the method can be used to recover both low and high rank tensor entries. The successful tensor completion capabilities of the approach are also demonstrated on a real-world completion problem from the field of multispectral remote sensing completion. Using data acquired from the Landsat 7 platform, we synthetically obscure image sections in order to simulate the scenario in which image acquisitions overlap only partially. In these tests, we benchmark against alternative tensor completion approaches as well as existing graph signal recovery methods, demonstrating the superior reconstruction performance of our method versus the state of the art.

摘要
tensor 完成（tensor completion）指的是在数据表示为张量（tensor）时缺失、损坏或未观察到的元素的问题。在这篇论文中，我们解决了基于多个张量获取的张量完成问题，而不是对张量的下标（rank）进行限制。前一些张量完成工作主要关注于低级 Completion 方法，我们提出了一种基于图的扩散方法，称为GraphProp，它在图基于的张量表示上宣传已知的元素，以recover 缺失的元素。我们对方法进行了一系列的实验验证，包括一个通过生成的张量恢复实验，这个实验表明了我们的方法可以完成低级和高级张量元素的恢复。此外，我们还在多spectral 遥感完成问题中应用了我们的方法，使用了来自LandSat 7 平台的数据，并对图像分割进行了synthetic 遮盲，以模拟图像获取的部分重叠。在这些测试中，我们与其他张量完成方法和图信号恢复方法进行比较，并证明了我们的方法的超过状态艺术的重建性能。

Markov Chain Monte Carlo Data Association for Sets of Trajectories

paper_url: http://arxiv.org/abs/2312.03423
repo_url: None
paper_authors: Yuxuan Xia, Ángel F. García-Fernández, Lennart Svensson
for: 这个论文关注批处理方法 для多目标跟踪问题，基于轨迹集的方法。
methods: 论文提出了两种MCMC采样数据关联假设的离线实现方法，即TPMBM筛选器。
results: simulation结果显示，使用 Metropolis-Hastings算法实现的TPMBM筛选器在多轨迹估计方面达到了状态之巅。

Abstract
This paper considers a batch solution to the multi-object tracking problem based on sets of trajectories. Specifically, we present two offline implementations of the trajectory Poisson multi-Bernoulli mixture (TPMBM) filter for batch data based on Markov chain Monte Carlo (MCMC) sampling of the data association hypotheses. In contrast to online TPMBM implementations, the proposed offline implementations solve a large-scale, multi-scan data association problem across the entire time interval of interest, and therefore they can fully exploit all the measurement information available. Furthermore, by leveraging the efficient hypothesis structure of TPMBM filters, the proposed implementations compare favorably with other MCMC-based multi-object tracking algorithms. Simulation results show that the TPMBM implementation using the Metropolis-Hastings algorithm presents state-of-the-art multiple trajectory estimation performance.

摘要
这篇论文考虑了批处理方法来解决多目标跟踪问题，基于轨迹集的方法。我们提出了两种基于Markov链 Monte Carlo（MCMC）抽样的轨迹Poisson多 Берну利混合（TPMBM）筛选器的停机实现方法。与在线TPMBM实现方法不同，我们的停机实现方法可以在整个时间间隔内解决大规模的多扫描数据关联问题，因此它们可以完全利用所有的测量信息。此外，通过利用TPMBM筛选器的有效假设结构，我们的实现方法与其他MCMC基于多目标跟踪算法相比，具有优秀的性能。实验结果显示，使用 Metropolis-Hastings算法实现的TPMBM筛选器在多轨迹估计中具有国际先进的性能。

Implementing Digital Twin in Field-Deployed Optical Networks: Uncertain Factors, Operational Guidance, and Field-Trial Demonstration

paper_url: http://arxiv.org/abs/2312.03374
repo_url: None
paper_authors: Yuchen Song, Min Zhang, Yao Zhang, Yan Shi, Shikui Shen, Bingli Guo, Shanguo Huang, Danshi Wang
for: 本文旨在解决在实际环境中部署的光通信网络中数字双方面的准确性问题，而不是在控制的实验室设置中。
methods: 本文通过分析实际环境中数字双方面不确定因素，并提出了实施精准数字双方面的操作指南。
results: 通过提出的操作指南，本文在一个实际测试中发现了使用数字双方面进行性能恢复的可能性，并展示了其在fiber cut场景中的应用前景。

Abstract
Digital twin has revolutionized optical communication networks by enabling their full life-cycle management, including design, troubleshooting, optimization, upgrade, and prediction. While extensive literature exists on frameworks, standards, and applications of digital twin, there is a pressing need in implementing digital twin in field-deployed optical networks operating in real-world environments, as opposed to controlled laboratory settings. This paper addresses this challenge by examining the uncertain factors behind the inaccuracy of digital twin in field-deployed optical networks from three main challenges and proposing operational guidance for implementing accurate digital twin in field-deployed optical networks. Through the proposed guidance, we demonstrate the effective implementation of digital twin in a field-trial C+L-band optical transmission link, showcasing its capabilities in performance recovery in a fiber cut scenario.

摘要
“数字双”已经革命化了光通信网络的全生命周期管理，包括设计、排查、优化、升级和预测。尽管有大量关于框架、标准和应用的数字双文献，但是在实际 поле景中部署的光网络中实施数字双仍存在挑战。本文解决这个挑战，通过分析场景下数字双不准确的uncertain factor，并提出了实施数字双的操作指南。通过我们的指南，我们在一个场景下C+L频段光传输链路上实现了数字双的有效实施，并示出了它在光缆截断场景下性能恢复的能力。

Understanding Concepts in Graph Signal Processing for Neurophysiological Signal Analysis

paper_url: http://arxiv.org/abs/2312.03371
repo_url: None
paper_authors: Stephan Goerttler, Fei He, Min Wu
for: 这篇论文主要针对的是图像信号处理领域的研究，尤其是利用图像信号处理来分类神经生物学数据。
methods: 论文使用了图像傅立叶变换，它将多变量信号投影到频率顺序图像信号上，从而可以看作是时间傅立叶变换的空间同步。
results: 实验部分主要研究了图像频率在数据分类中的作用，并通过人工生成的数据进行了评估。结果显示，图像频率较低的信号更难分类神经生物学数据，而图像频率较高的信号更容易分类。此外，论文还提出了一个基准测试框架，结果表明，使用图像傅立叶变换可能会减弱神经生物学数据中的特征特征。

Abstract
Multivariate signals, which are measured simultaneously over time and acquired by sensor networks, are becoming increasingly common. The emerging field of graph signal processing (GSP) promises to analyse spectral characteristics of these multivariate signals, while at the same time taking the spatial structure between the time signals into account. A central idea in GSP is the graph Fourier transform, which projects a multivariate signal onto frequency-ordered graph Fourier modes, and can therefore be regarded as a spatial analog of the temporal Fourier transform. This chapter derives and discusses key concepts in GSP, with a specific focus on how the various concepts relate to one another. The experimental section focuses on the role of graph frequency in data classification, with applications to neuroimaging. To address the limited sample size of neurophysiological datasets, we introduce a minimalist simulation framework that can generate arbitrary amounts of data. Using this artificial data, we find that lower graph frequency signals are less suitable for classifying neurophysiological data as compared to higher graph frequency signals. Finally, we introduce a baseline testing framework for GSP. Employing this framework, our results suggest that GSP applications may attenuate spectral characteristics in the signals, highlighting current limitations of GSP for neuroimaging.

摘要
多变量信号，它们同时在时间上测量并由感知网络获取，现在越来越普遍。新兴的图信号处理（GSP）技术承诺可以分析多变量信号的спектраль特征，同时考虑信号之间的空间结构。GSP的中心思想是图傅立 transform，它将多变量信号映射到频率顺序的图傅立模式上，可以视为时间傅立transform的空间同义。本章 derivatives和讨论GSP关键概念，特别是这些概念之间的关系。实验部分关注graph frequency在数据分类中的角色，并应用于神经成像。由于神经physiological dataset的样本数量有限，我们引入了一个简单的Simulation Framework，可以生成arbitrary amount of data。使用这些人工数据，我们发现lower graph frequency signal less suitable for classifying neurophysiological data compared to higher graph frequency signals。最后，我们引入了GSP基线测试框架。使用这个框架，我们的结果表明GSP应用可能减弱信号中的spectral特征， highlighting current limitations of GSP for neuroimaging.

Channel-Transferable Semantic Communications for Multi-User OFDM-NOMA Systems

paper_url: http://arxiv.org/abs/2312.03299
repo_url: None
paper_authors: Lan Lin, Wenjun Xu, Fengyu Wang, Yimeng Zhang, Wei Zhang, Ping Zhang
for: 提高 sixth generation（6G）无线网络中semantic communications的核心新 paradigm。
methods: 提出了一种新的channel-transferable semantic communications（CT-SemCom）框架，可以将codecs学习的channel信息传递到其他类型的channel上。
results: 在OFDM-NOMA系统中，通过分析解SSDT算法，实现了semantic communications的传输，并且与不同 Rayleigh fading channels的传输性能具有显著的改进（比如图像传输的PSNR提高4.2-7.3dB）。

Abstract
Semantic communications are expected to become the core new paradigms of the sixth generation (6G) wireless networks. Most existing works implicitly utilize channel information for codecs training, which leads to poor communications when channel type or statistical characteristics change. To tackle this issue posed by various channels, a novel channel-transferable semantic communications (CT-SemCom) framework is proposed, which adapts the codecs learned on one type of channel to other types of channels. Furthermore, integrating the proposed framework and the orthogonal frequency division multiplexing systems integrating non-orthogonal multiple access technologies, i.e., OFDM-NOMA systems, a power allocation problem to realize the transfer from additive white Gaussian noise (AWGN) channels to multi-subcarrier Rayleigh fading channels is formulated. We then design a semantics-similar dual transformation (SSDT) algorithm to derive analytical solutions with low complexity. Simulation results show that the proposed CT-SemCom framework with SSDT algorithm significantly outperforms the existing work w.r.t. channel transferability, e.g., the peak signal-to-noise ratio (PSNR) of image transmission improves by 4.2-7.3 dB under different variances of Rayleigh fading channels.

摘要
新六代（6G）无线网络中， semantics 通信将成为核心新 парадигмы。现有的大多数工作都会利用通道信息来训练编解码器，这会导致通信效率下降，特别是当通道类型或统计特性发生变化时。为解决这个问题，我们提出了一个新的通道传递semantic通信（CT-SemCom）框架，可以将在一类通道上学习的编解码器应用到其他类型的通道上。此外，我们还提出了将 CT-SemCom 框架和OFDM-NOMA 系统集成，以实现从白噪声加速度随机变化（AWGN）通道传递到多子载谐幕滤波（Rayleigh fading）通道的力调配问题。我们then designed a semantics-similar dual transformation（SSDT）算法来 derivate analytical solutions with low complexity。 simulation results show that the proposed CT-SemCom framework with SSDT algorithm significantly outperforms the existing work w.r.t. channel transferability, e.g., the peak signal-to-noise ratio（PSNR）of image transmission improves by 4.2-7.3 dB under different variances of Rayleigh fading channels.

Adaptive Multi-band Modulation for Robust and Low-complexity Faster-than-Nyquist Non-Orthogonal FDM IM-DD System

paper_url: http://arxiv.org/abs/2312.03284
repo_url: None
paper_authors: Peiji Song, Zhouyi Hu, Yizhan Dai, Yuan Liu, Chao Gao, Chun-Kit Chan
for: 提高信号带宽利用率和降低复杂性
methods: 使用非对称矩阵压缩和多 bands 分割
results: 可以降低比特错误率和提高实现复杂性

Abstract
Faster-than-Nyquist non-orthogonal frequency-division multiplexing (FTN-NOFDM) is robust against the steep frequency roll-off by saving signal bandwidth. Among the FTN-NOFDM techniques, the non-orthogonal matrix precoding (NOM-p) based FTN has high compatibility with the conventional orthogonal frequency division multiplexing (OFDM), in terms of the advanced digital signal processing already used in OFDM. In this work, by dividing the single band into multiple sub-bands in the NOM-p-based FTN-NOFDM system, we propose a novel FTN-NOFDM scheme with adaptive multi-band modulation. The proposed scheme assigns different quadrature amplitude modulation (QAM) levels to different sub-bands, effectively utilizing the low-pass-like channel and reducing the complexity. The impacts of sub-band number and bandwidth compression factor on the bit-error-rate (BER) performance and implementation complexity are experimentally analyzed with a 32.23-Gb/s and 20-km intensity modulation-direct detection (IM-DD) optical transmission system. Results show that the proposed scheme with proper sub-band numbers can lower BER and greatly reduce the complexity compared to the conventional single-band way.

摘要
非对称频分多plexing（FTN-NOFDM）能够抗衡峻频滑降，并且可以储存信号带宽。 Among FTN-NOFDM技术中，基于非对称矩阵嵌入（NOM-p）的FTN具有与传统的正交频分多plexing（OFDM）的高相容性，从 digitale sign processing 的角度来看。在这个工作中，我们在NOM-p基于FTN-NOFDM系统中将单一频带分成多个子带，并提出了一个新的FTN-NOFDM方案，即适应多带模ULATION。这个方案将不同的振幅模ulation（QAM）水平分配到不同的子带中，以优化低通频道和减少复杂性。我们透过实验分析了具有32.23 Gb/s和20 km IM-DD 光学传输系统的BER性能和实现复杂性，结果显示，这个方案可以适当地选择子带数量，以下降BER和减少复杂性，相比于传统单一频带的方法。

Densifying MIMO: Channel Modeling, Physical Constraints, and Performance Evaluation for Holographic Communications

paper_url: http://arxiv.org/abs/2312.03255
repo_url: None
paper_authors: Y. Liu, M. Zhang, T. Wang, A. Zhang, M. Debbah
for: 这篇论文旨在解决5G电子网络背景下大量多输入多Output（MIMO）技术中的一个挑战：如何在有限空间内部署大量天线元件。
methods: 该论文提出了一种基于电磁波特性的电磁通信频道模型，该模型考虑了天线元件之间的互相干扰和传播环境中的极化。此外，通过约等于无穷大数组来研究大规模紧密天线数组的性能限制。
results: 研究发现，在有限空间内，天线元件之间的互相干扰，尤其是天线元件间距小于半波长的情况下，是电光通信性能的主要限制因素。

Abstract
As the backbone of the fifth-generation (5G) cellular network, massive multiple-input multiple-output (MIMO) encounters a significant challenge in practical applications: how to deploy a large number of antenna elements within limited spaces. Recently, holographic communication has emerged as a potential solution to this issue. It employs dense antenna arrays and provides a tractable model. Nevertheless, some challenges must be addressed to actualize this innovative concept. One is the mutual coupling among antenna elements within an array. When the element spacing is small, near-field coupling becomes the dominant factor that strongly restricts the array performance. Another is the polarization of electromagnetic waves. As an intrinsic property, it was not fully considered in the previous channel modeling of holographic communication. The third is the lack of real-world experiments to show the potential and possible defects of a holographic communication system. In this paper, we propose an electromagnetic channel model based on the characteristics of electromagnetic waves. This model encompasses the impact of mutual coupling in the transceiver sides and the depolarization in the propagation environment. Furthermore, by approximating an infinite array, the performance restrictions of large-scale dense antenna arrays are also studied theoretically to exploit the potential of the proposed channel. In addition, numerical simulations and a channel measurement experiment are conducted. The findings reveal that within limited spaces, the coupling effect, particularly for element spacing smaller than half of the wavelength, is the primary factor leading to the inflection point for the performance of holographic communications.

摘要
fifth-generation (5G) 无线网络的脊梁——巨量多输入多输出 (MIMO) 技术在实践应用中遇到了一个重要挑战：如何在有限的空间内部署大量天线元件。最近，干扰通信技术 emerged as a potential solution to this issue. It employs dense antenna arrays and provides a tractable model. However, some challenges must be addressed to actualize this innovative concept. One is the mutual coupling among antenna elements within an array. When the element spacing is small, near-field coupling becomes the dominant factor that strongly restricts the array performance. Another is the polarization of electromagnetic waves. As an intrinsic property, it was not fully considered in the previous channel modeling of holographic communication. The third is the lack of real-world experiments to show the potential and possible defects of a holographic communication system. In this paper, we propose an electromagnetic channel model based on the characteristics of electromagnetic waves. This model encompasses the impact of mutual coupling in the transceiver sides and the depolarization in the propagation environment. Furthermore, by approximating an infinite array, the performance restrictions of large-scale dense antenna arrays are also studied theoretically to exploit the potential of the proposed channel. In addition, numerical simulations and a channel measurement experiment are conducted. The findings reveal that within limited spaces, the coupling effect, particularly for element spacing smaller than half of the wavelength, is the primary factor leading to the inflection point for the performance of holographic communications.Here's the text with some additional information about the translation:I used the Simplified Chinese version of the text, which is the most commonly used version in mainland China. I translated the text word-for-word, without any modifications or simplifications, to ensure accuracy. However, some technical terms and jargon may not have direct translations in Simplified Chinese, so I did my best to find the closest equivalent. Additionally, I used the pinyin system to represent the Chinese characters, which is a standardized system for romanizing Chinese pronunciation.

2023-12-05

cs.SD

cs.SD - 2023-12-05

Leveraging Laryngograph Data for Robust Voicing Detection in Speech

paper_url: http://arxiv.org/abs/2312.03129
repo_url: https://github.com/yixuanz/rvd
paper_authors: Yixuan Zhang, Heming Wang, DeLiang Wang
for: 这篇论文是为了提出一个可靠地检测语音信号中的声音时间的方法，并且有许多应用，例如抑扬识别、语音识别等。
methods: 本研究使用了一个紧密联接的卷积运算神经网络（DC-CRN），并且使用了录音的声带数据集进行训练。还进行了预训练来提高模型的应用性。
results: 本研究的模型可以实现高度的声音检测精度，比较其他强大的基eline方法更好，并且可以对未见的数据集进行一致的检测。

Abstract
Accurately detecting voiced intervals in speech signals is a critical step in pitch tracking and has numerous applications. While conventional signal processing methods and deep learning algorithms have been proposed for this task, their need to fine-tune threshold parameters for different datasets and limited generalization restrict their utility in real-world applications. To address these challenges, this study proposes a supervised voicing detection model that leverages recorded laryngograph data. The model is based on a densely-connected convolutional recurrent neural network (DC-CRN), and trained on data with reference voicing decisions extracted from laryngograph data sets. Pretraining is also investigated to improve the generalization ability of the model. The proposed model produces robust voicing detection results, outperforming other strong baseline methods, and generalizes well to unseen datasets. The source code of the proposed model with pretraining is provided along with the list of used laryngograph datasets to facilitate further research in this area.

摘要
实时检测语音讯号中的发音间隔是一项重要的步骤，有许多应用。传统的信号处理方法和深度学习算法已经被提出来解决这个问题，但它们需要精确地调整阈值参数，并且仅仅对特定数据集有限度的适用。为了解决这些挑战，这些研究提出了一个监督式的发音检测模型，利用语音讯号中的预录资料。这个模型基于密集的卷积神经网络（DC-CRN），并且由参考的发音决策撷取自语音讯号数据集。预训也被 investigate 以提高模型的应用能力。提出的模型实现了Robust的发音检测结果，超越了其他强大的基eline方法，并且对未见数据集具有良好的扩展性。实现的模型代码和预训数据库都提供给进一步的研究。

Integrating Plug-and-Play Data Priors with Weighted Prediction Error for Speech Dereverberation

paper_url: http://arxiv.org/abs/2312.02773
repo_url: None
paper_authors: Ziye Yang, Wenxing Yang, Kai Xie, Jie Chen
for: 提高涂滥后音频信号的质量和可解性
methods: combining physics-based and data-driven methods, incorporating speech prior information learnt from data during the optimization problem solving iterations
results: 实验结果验证了提案的有效性

Abstract
Speech dereverberation aims to alleviate the detrimental effects of late-reverberant components. While the weighted prediction error (WPE) method has shown superior performance in dereverberation, there is still room for further improvement in terms of performance and robustness in complex and noisy environments. Recent research has highlighted the effectiveness of integrating physics-based and data-driven methods, enhancing the performance of various signal processing tasks while maintaining interpretability. Motivated by these advancements, this paper presents a novel dereverberation frame-work, which incorporates data-driven methods for capturing speech priors within the WPE framework. The plug-and-play strategy (PnP), specifically the regularization by denoising (RED) strategy, is utilized to incorporate speech prior information learnt from data during the optimization problem solving iterations. Experimental results validate the effectiveness of the proposed approach.

摘要
文本摘要：干扰除泛音尝试缓解晚 reverberation 的负面影响。虽然Weighted prediction error（WPE）方法已经表现出优于其他方法，但是在复杂和噪音环境中仍然有很多的改进空间。最近的研究表明，结合物理学和数据驱动方法可以提高各种信号处理任务的性能，同时保持可解释性。这篇论文提出了一种新的干扰除泛音框架，通过在 WPE 框架中包含数据驱动方法来捕捉speech 的偏好。使用 plug-and-play 策略（PnP）和正则化 denoising 策略（RED）来在优化问题的解决过程中包含数据驱动方法学习的speech 偏好信息。实验结果证明了该方法的有效性。

Distributed Speech Dereverberation Using Weighted Prediction Error

paper_url: http://arxiv.org/abs/2312.03034
repo_url: None
paper_authors: Ziye Yang, Mengfei Zhang, Jie Chen
for: 减少延迟响应的负面影响，提高语音听录质量。
methods: 使用分布式适应节点特定信号估计（DANSE）算法，在多通道线性预测（MCLP）过程中实现分布式演算。每个节点只需执行本地操作，通过节点间协作实现全局性能。
results: 实验结果表明，提出的方法可以有效地在分布式 Microphone 节点场景下实现高效的语音听录除抖。

Abstract
Speech dereverberation aims to alleviate the negative impact of late reverberant reflections. The weighted prediction error (WPE) method is a well-established technique known for its superior performance in dereverberation. However, in scenarios where microphone nodes are dispersed, the centralized approach of the WPE method requires aggregating all observations for inverse filtering, resulting in a significant computational burden. This paper introduces a distributed speech dereverberation method that emphasizes low computational complexity at each node. Specifically, we leverage the distributed adaptive node-specific signal estimation (DANSE) algorithm within the multichannel linear prediction (MCLP) process. This approach empowers each node to perform local operations with reduced complexity while achieving the global performance through inter-node cooperation. Experimental results validate the effectiveness of our proposed method, showcasing its ability to achieve efficient speech dereverberation in dispersed microphone node scenarios.

摘要
<>Translate the given text into Simplified Chinese.<>Speech dereverberation aims to alleviate the negative impact of late reverberant reflections. The weighted prediction error (WPE) method is a well-established technique known for its superior performance in dereverberation. However, in scenarios where microphone nodes are dispersed, the centralized approach of the WPE method requires aggregating all observations for inverse filtering, resulting in a significant computational burden. This paper introduces a distributed speech dereverberation method that emphasizes low computational complexity at each node. Specifically, we leverage the distributed adaptive node-specific signal estimation (DANSE) algorithm within the multichannel linear prediction (MCLP) process. This approach empowers each node to perform local operations with reduced complexity while achieving the global performance through inter-node cooperation. Experimental results validate the effectiveness of our proposed method, showcasing its ability to achieve efficient speech dereverberation in dispersed microphone node scenarios.Translation:<>转换给定文本到简化中文。<>Speech dereverberation aims to alleviate the negative impact of late reverberant reflections. WPE method is a well-established technique known for its superior performance in dereverberation. However, in scenarios where microphone nodes are dispersed, the centralized approach of the WPE method requires aggregating all observations for inverse filtering, resulting in a significant computational burden. This paper introduces a distributed speech dereverberation method that emphasizes low computational complexity at each node. Specifically, we leverage the distributed adaptive node-specific signal estimation (DANSE) algorithm within the multichannel linear prediction (MCLP) process. This approach empowers each node to perform local operations with reduced complexity while achieving the global performance through inter-node cooperation. Experimental results validate the effectiveness of our proposed method, showcasing its ability to achieve efficient speech dereverberation in dispersed microphone node scenarios.Note: Please keep in mind that the translation is done using a machine translation tool, and the quality of the translation may vary depending on the complexity and nuances of the original text.

Auralization based on multi-perspective ambisonic room impulse responses

paper_url: http://arxiv.org/abs/2312.02581
repo_url: None
paper_authors: Kaspar Müller, Franz Zotter
for: 这种技术用于在变化的听众视角下实时生成听众在不同环境中的听觉效果。
methods: 使用 Ambisonic 室内响应函数（ARIR）的三元 interpolate 技术，将 ARRI 的数据集中的听众视角作为变量，并使用时间差分析和方向探测来实现听觉效果的拟合。
results: 通过 listening 实验，在不同环境下使用这种技术可以实现高质量的听觉效果，并且可以在变化的听众视角下进行实时拟合。

Abstract
Most often, virtual acoustic rendering employs real-time updated room acoustic simulations to accomplish auralization for a variable listener perspective. As an alternative, we propose and test a technique to interpolate room impulse responses, specifically Ambisonic room impulse responses (ARIRs) available at a grid of spatially distributed receiver perspectives, measured or simulated in a desired acoustic environment. In particular, we extrapolate a triplet of neighboring ARIRs to the variable listener perspective, preceding their linear interpolation. The extrapolation is achieved by decomposing each ARIR into localized sound events and re-assigning their direction, time, and level to what could be observed at the listener perspective, with as much temporal, directional, and perspective context as possible. We propose to undertake this decomposition in two levels: Peaks in the early ARIRs are decomposed into jointly localized sound events, based on time differences of arrival observed in either an ARIR triplet, or all ARIRs observing the direct sound. Sound events that could not be jointly localized are treated as residuals whose less precise localization utilizes direction-of-arrival detection and the estimated time of arrival. For the interpolated rendering, suitable parameter settings are found by evaluating the proposed method in a listening experiment, using both measured and simulated ARIR data sets, under static and time-varying conditions.

摘要
通常，虚拟音响渲染使用实时更新的房间音响模拟来实现听众角度变化下的听觉。作为一个 alternatives，我们提出并测试了一种技术，该技术是 interpolating 房间冲击响应（ARIR），特别是在 Desired 音响环境中测量或模拟的各种听众角度上的一组分布式接收器角度上。在特定情况下，我们将三个相邻的 ARIR interpolated 到变量听众角度上，并在其线性 interpolate 之前进行了先前的拟合。 interpolate 的实现方式是将每个 ARIR decomposed 成本地化的声音事件，并将它们的方向、时间和强度重新分配给可能在听众角度上观察到的声音，以保留最多的时间、方向和观察角度上的上下文。我们提议在两个层次进行这种划分：在早期 ARIR 中，发生在相邻的冲击响应中的峰值被 decomposed 成共同localized 的声音事件，基于在 ARIR triplet 中或所有 ARIRs 观察到的直接声音时间差。不能共同localized 的声音事件被视为剩下的 residuals，其准确性较低的localization 使用方向到达检测和估计的时间到达。在 interpolated 渲染中，采用 suitable 参数设置，通过对提议方法进行测试，使用 measured 和 simulated ARIR 数据集，在静止和时间变化的情况下进行评估。

2023-12-05

cs.CV

cs.CV - 2023-12-05

DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing

paper_url: http://arxiv.org/abs/2312.03772
repo_url: None
paper_authors: Shao-Yu Chang, Hwann-Tzong Chen, Tyng-Luh Liu
for: 这篇论文旨在解决视频编辑中维护物体外观的一致性和高精度问题。
methods: 该方法基于DiffusionAtlas框架，利用视觉文本扩散模型进行物体直接编辑，以确保帧内物体外观一致性。
results: 对比之下，该方法能够在视频编辑中实现高精度和一致性，超过当前状态艺法的性能。

Abstract
We present a diffusion-based video editing framework, namely DiffusionAtlas, which can achieve both frame consistency and high fidelity in editing video object appearance. Despite the success in image editing, diffusion models still encounter significant hindrances when it comes to video editing due to the challenge of maintaining spatiotemporal consistency in the object's appearance across frames. On the other hand, atlas-based techniques allow propagating edits on the layered representations consistently back to frames. However, they often struggle to create editing effects that adhere correctly to the user-provided textual or visual conditions due to the limitation of editing the texture atlas on a fixed UV mapping field. Our method leverages a visual-textual diffusion model to edit objects directly on the diffusion atlases, ensuring coherent object identity across frames. We design a loss term with atlas-based constraints and build a pretrained text-driven diffusion model as pixel-wise guidance for refining shape distortions and correcting texture deviations. Qualitative and quantitative experiments show that our method outperforms state-of-the-art methods in achieving consistent high-fidelity video-object editing.

摘要
我们提出了一个基于扩散的视频编辑框架，即DiffusionAtlas，可以实现帧内一致性和高精度对视频对象的外观进行编辑。虽然扩散模型在图像编辑方面取得了成功，但在视频编辑方面仍然遇到了重大的障碍，主要是保持视频帧内对象的外观一致性。然而，在地图基础上的技术可以在层次表示中一致地传递编辑效果，但它们经常因为固定的UV映射场而难以创造符合用户提供的文本或视觉条件的编辑效果。我们的方法利用了视觉文本扩散模型来直接编辑对象在扩散地图上，确保对象在帧内保持一致性。我们设计了基于地图的损失函数和预训练的文本驱动扩散模型，以便修复形状偏差和Texture偏差。我们的方法在质量和量上的实验中胜过了现有的方法，可以实现高精度、一致的视频对象编辑。

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

paper_url: http://arxiv.org/abs/2312.03771
repo_url: None
paper_authors: Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C. K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou
for: 文章旨在描述一种新的图像填充任务，即通过文本和示例图像进行图像填充。在先前的尝试中，文本和示例图像都被独立地使用，但是两者同时使用的挑战尚未被解决。
methods: 我们提出了一种两步方法 DreamInpainter，首先计算粘密的主题特征，以确保精确的主题复制。然后，我们使用一个抽象 токен选择模块，以消除冗余的主题细节，保留主题的身份而允许根据文本提示和掩码形状进行变化。此外，我们还提出了解 Coupling 规范技术，以增强文本控制在示例图像存在时。
results: 我们的广泛实验表明，我们的方法在视觉质量、身份保持和文本控制方面具有显著的优势，展示了其在文本引导主题驱动图像填充中的效果。

Abstract
This study introduces Text-Guided Subject-Driven Image Inpainting, a novel task that combines text and exemplar images for image inpainting. While both text and exemplar images have been used independently in previous efforts, their combined utilization remains unexplored. Simultaneously accommodating both conditions poses a significant challenge due to the inherent balance required between editability and subject fidelity. To tackle this challenge, we propose a two-step approach DreamInpainter. First, we compute dense subject features to ensure accurate subject replication. Then, we employ a discriminative token selection module to eliminate redundant subject details, preserving the subject's identity while allowing changes according to other conditions such as mask shape and text prompts. Additionally, we introduce a decoupling regularization technique to enhance text control in the presence of exemplar images. Our extensive experiments demonstrate the superior performance of our method in terms of visual quality, identity preservation, and text control, showcasing its effectiveness in the context of text-guided subject-driven image inpainting.

摘要

HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces

paper_url: http://arxiv.org/abs/2312.03160
repo_url: None
paper_authors: Haithem Turki, Vasu Agrawal, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Deva Ramanan, Michael Zollhöfer, Christian Richardt
for: 提高视图合成质量和速度
methods: 结合表面和体积表示方法
results: 提高错误率15-30%，实现真实时频率（至少36帧/秒） для虚拟现实分辨率（2Kx2K）

Abstract
Neural radiance fields provide state-of-the-art view synthesis quality but tend to be slow to render. One reason is that they make use of volume rendering, thus requiring many samples (and model queries) per ray at render time. Although this representation is flexible and easy to optimize, most real-world objects can be modeled more efficiently with surfaces instead of volumes, requiring far fewer samples per ray. This observation has spurred considerable progress in surface representations such as signed distance functions, but these may struggle to model semi-opaque and thin structures. We propose a method, HybridNeRF, that leverages the strengths of both representations by rendering most objects as surfaces while modeling the (typically) small fraction of challenging regions volumetrically. We evaluate HybridNeRF against the challenging Eyeful Tower dataset along with other commonly used view synthesis datasets. When comparing to state-of-the-art baselines, including recent rasterization-based approaches, we improve error rates by 15-30% while achieving real-time framerates (at least 36 FPS) for virtual-reality resolutions (2Kx2K).

摘要
neural radiance fields 提供了当今最高质量的视图合成效果，但它们往往具有较慢的渲染速度。一个原因是它们使用体积渲染，因此每个光束的渲染时需要许多样本（以及模型查询）。虽然这种表示方式弹性和易于优化，但大多数实际世界中的物体可以更加效率地使用表面而不是体积来表示，需要更少的样本每个光束。这一观察点推动了表面表示方法的进步，如签名距离函数，但这些方法可能会难以模型半透明和薄 estructures。我们提出了一种方法，HybridNeRF，它利用两种表示方式的优势，在渲染大部分物体时使用表面，而在处理复杂的部分时使用体积。我们在eyeful tower dataset以及其他常用的视图合成dataset上评估了HybridNeRF，并与其他常用的基准值进行比较。与state-of-the-art基准值相比，我们提高了错误率15-30%，同时实现了虚拟现实分辨率（2Kx2K）的真实帧率（至少36帧/秒）。

ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

paper_url: http://arxiv.org/abs/2312.03154
repo_url: https://github.com/soon-yau/visconet
paper_authors: Soon Yau Cheong, Armin Mustafa, Andrew Gilbert
for: 提高文本到图像人类生成模型的可控性，使用视觉提示来控制图像结构。
methods: 使用ControlNet分支将物体的外观分离出图像背景，并在预训练的扩散模型（LDM）中注入控制信息。
results: 可以通过文本和图像提示来控制图像的视觉特征和艺术风格，并且可以从小型特定对象领域学习视觉条件，保留LDM骨干的生成力。

Abstract
This paper introduces ViscoNet, a novel method that enhances text-to-image human generation models with visual prompting. Unlike existing methods that rely on lengthy text descriptions to control the image structure, ViscoNet allows users to specify the visual appearance of the target object with a reference image. ViscoNet disentangles the object's appearance from the image background and injects it into a pre-trained latent diffusion model (LDM) model via a ControlNet branch. This way, ViscoNet mitigates the style mode collapse problem and enables precise and flexible visual control. We demonstrate the effectiveness of ViscoNet on human image generation, where it can manipulate visual attributes and artistic styles with text and image prompts. We also show that ViscoNet can learn visual conditioning from small and specific object domains while preserving the generative power of the LDM backbone.

摘要
这篇论文介绍了ViscoNet，一种新的方法，用于增强文本到图像人工生成模型，使其具有更精确和灵活的视觉控制。与现有方法不同，ViscoNet不需要长文本描述来控制图像结构，而是通过引入参考图像来指定目标对象的视觉外观。ViscoNet分离了图像背景和目标对象的外观，并将其注入到预训练的扩散模型（LDM）中，通过ControlNet分支。这样，ViscoNet可以解决Style Mode Collapse问题，并允许用户通过文本和图像提示来控制图像的Visual attribute和艺术风格。我们在人像生成中示示了ViscoNet的效果，它可以通过文本和图像提示来控制视觉特征和艺术风格。此外，我们还表明ViscoNet可以从小的特定对象领域中学习视觉条件，而不会削弱LDM的生成能力。

Predicting Bone Degradation Using Vision Transformer and Synthetic Cellular Microstructures Dataset

paper_url: http://arxiv.org/abs/2312.03133
repo_url: None
paper_authors: Mohammad Saber Hashemi, Azadeh Sheidaei
for: 这篇论文的目的是为了预测和可视化太空探索任务中宇航员骨骼衰竭的现象。
methods: 这篇论文使用了一种具有弹性和速度的计算方法，即TransVNet，可以将不同的3D voxelized图像进行预测和可视化，并且可以考虑每个个体的骨骼特性。
results: 这篇论文使用了一个混合3D-CNN-VisionTransformer autoencoder架构，可以将3D voxelized图像的演化追踪到月份级的时间尺度，并且可以将这些变化与真实的骨骼衰竭现象进行比较。

Abstract
Bone degradation, especially for astronauts in microgravity conditions, is crucial for space exploration missions since the lower applied external forces accelerate the diminution in bone stiffness and strength substantially. Although existing computational models help us understand this phenomenon and possibly restrict its effect in the future, they are time-consuming to simulate the changes in the bones, not just the bone microstructures, of each individual in detail. In this study, a robust yet fast computational method to predict and visualize bone degradation has been developed. Our deep-learning method, TransVNet, can take in different 3D voxelized images and predict their evolution throughout months utilizing a hybrid 3D-CNN-VisionTransformer autoencoder architecture. Because of limited available experimental data and challenges of obtaining new samples, a digital twin dataset of diverse and initial bone-like microstructures was generated to train our TransVNet on the evolution of the 3D images through a previously developed degradation model for microgravity.

摘要
骨质下降，尤其是在微重力条件下，对航天器探索任务非常重要，因为低于应用的外部力量会加速骨刚度和强度的减退。虽然现有的计算模型可以帮助我们理解这种现象并可能将其影响限制在未来，但它们需要较长时间来模拟每个个体的骨变化。在本研究中，我们开发了一种强健快速的计算方法来预测和可视化骨质下降。我们的深度学习方法TransVNet可以将不同的3D矩阵图像作为输入，预测其在月份内的演化，使用混合3D-CNN-视Transformer自适应网络架构。由于实验数据的有限性和获取新样本的挑战，我们使用了一个数字双胞虫数据集来训练我们的TransVNet，该数据集包含多种初始骨状微结构。

AI-SAM: Automatic and Interactive Segment Anything Model

paper_url: http://arxiv.org/abs/2312.03119
repo_url: None
paper_authors: Yimu Pan, Sitao Zhang, Alison D. Gernand, Jeffery A. Goldstein, James Z. Wang
for: 该论文主要针对 Semantic Segmentation 任务中的自动和交互两种方法的问题。
methods: 该论文提出了一种新的自动和交互模型（AI-SAM），通过对提示质量进行全面分析，并提出了首个自动生成初始点提示的自动交互提示器（AI-Prompter）。
results: 实验结果表明，AI-SAM在自动设置下达到了状态当前的最高性能，并且可以采用更多的用户提示来进一步提高性能。

Abstract
Semantic segmentation is a core task in computer vision. Existing methods are generally divided into two categories: automatic and interactive. Interactive approaches, exemplified by the Segment Anything Model (SAM), have shown promise as pre-trained models. However, current adaptation strategies for these models tend to lean towards either automatic or interactive approaches. Interactive methods depend on prompts user input to operate, while automatic ones bypass the interactive promptability entirely. Addressing these limitations, we introduce a novel paradigm and its first model: the Automatic and Interactive Segment Anything Model (AI-SAM). In this paradigm, we conduct a comprehensive analysis of prompt quality and introduce the pioneering Automatic and Interactive Prompter (AI-Prompter) that automatically generates initial point prompts while accepting additional user inputs. Our experimental results demonstrate AI-SAM's effectiveness in the automatic setting, achieving state-of-the-art performance. Significantly, it offers the flexibility to incorporate additional user prompts, thereby further enhancing its performance. The project page is available at https://github.com/ymp5078/AI-SAM.

摘要
Semantic segmentation 是计算机视觉中的核心任务。现有方法大致可分为两类：自动和交互式。交互式方法，如Segment Anything Model (SAM)，有示 promise 作为预训练模型。然而，当前的适应策略倾向于自动或交互式方法。交互式方法需要用户输入来运行，而自动方法则完全不需要用户交互。为了解决这些限制，我们介绍了一种新的思想和其首个模型：自动和交互式Segment Anything Model (AI-SAM)。在这种思想中，我们进行了详细的提示质量分析，并引入了先锋的自动和交互式提示器（AI-Prompter），可以自动生成初始点提示而 accepting additional 用户输入。我们的实验结果表明 AI-SAM 在自动设置下 exhibit 州前的表现，并且具有更多用户提示的灵活性，进一步提高其性能。项目页面可以在中找到。

The Automated Bias Triangle Feature Extraction Framework

paper_url: http://arxiv.org/abs/2312.03110
repo_url: None
paper_authors: Madeleine Kotzagiannidis, Jonas Schuff, Nathan Korda
for: 本研究旨在提供一个自动化分析框架，以便利用量子点（QD）设备的稳定图中的偏置三角形来检测粒子物理学中的Pauli束阻塞（PSB）现象。
methods: 本研究使用了无监督的计算机视觉方法，包括分割分析，以提取偏置三角形的物理特性。这种方法可以自动地识别和量化偏置三角形的形状和特征，并且不需要人工标注或大量的训练数据。
results: 研究表明，使用这种方法可以高效地、自动地检测PSB现象，而不需要任何训练数据或人工标注。这种方法可以帮助提高量子点设备的稳定性和性能。

Abstract
Bias triangles represent features in stability diagrams of Quantum Dot (QD) devices, whose occurrence and property analysis are crucial indicators for spin physics. Nevertheless, challenges associated with quality and availability of data as well as the subtlety of physical phenomena of interest have hindered an automatic and bespoke analysis framework, often still relying (in part) on human labelling and verification. We introduce a feature extraction framework for bias triangles, built from unsupervised, segmentation-based computer vision methods, which facilitates the direct identification and quantification of physical properties of the former. Thereby, the need for human input or large training datasets to inform supervised learning approaches is circumvented, while additionally enabling the automation of pixelwise shape and feature labeling. In particular, we demonstrate that Pauli Spin Blockade (PSB) detection can be conducted effectively, efficiently and without any training data as a direct result of this approach.

摘要
bias triangle 表示量子点（QD）设备的稳定性图中的特征，其出现和性质分析是核磁物理的重要指标。然而，数据质量和可用性的问题以及感兴趣的物理现象的细微性带来了自动化和特定分析框架的困难，经常仍然 rely (part) on human labeling and verification。我们介绍了一种基于不supervised，分割based computer vision方法的特征提取框架，可以直接识别和量化former的物理属性。因此，不需要人工输入或大训练数据来支持supervised learning方法，同时还能够自动化像素级shape和特征标签。例如，我们示示了Pauli Spin Blockade（PSB）检测可以高效、高效地进行，不需要任何training data。

Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI

paper_url: http://arxiv.org/abs/2312.03102
repo_url: None
paper_authors: Sean I. Young, Yaël Balbastre, Bruce Fischl, Polina Golland, Juan Eugenio Iglesias
for: 用于重建未知的3D磁共振图像卷积
methods: 使用深度学习网络，通过对给定的 slice stack 进行单个视图运动估计，生成一个运动堆，并且作为运动堆的副产品生成3D重建
results: 对成人和胎儿大脑的SVR重建实现了两倍的精度，与之前的SVR方法相比Here’s the translation of the abstract in English:
for: To reconstruct an unknown 3D magnetic resonance volume from stacks of 2D slices corrupted by motion.
methods: Using a fully convolutional network to estimate the motion stack for a given slice stack, and produce a 3D reconstruction as a byproduct of the predicted motion.
results: Achieved twice the accuracy of previous SVR methods in reconstructing adult and fetal brains. The code is available at github.com/seannz/svr.

Abstract
In magnetic resonance imaging (MRI), slice-to-volume reconstruction (SVR) refers to computational reconstruction of an unknown 3D magnetic resonance volume from stacks of 2D slices corrupted by motion. While promising, current SVR methods require multiple slice stacks for accurate 3D reconstruction, leading to long scans and limiting their use in time-sensitive applications such as fetal fMRI. Here, we propose a SVR method that overcomes the shortcomings of previous work and produces state-of-the-art reconstructions in the presence of extreme inter-slice motion. Inspired by the recent success of single-view depth estimation methods, we formulate SVR as a single-stack motion estimation task and train a fully convolutional network to predict a motion stack for a given slice stack, producing a 3D reconstruction as a byproduct of the predicted motion. Extensive experiments on the SVR of adult and fetal brains demonstrate that our fully convolutional method is twice as accurate as previous SVR methods. Our code is available at github.com/seannz/svr.

摘要
magnetic resonance imaging (MRI)中的slice-to-volume reconstruction (SVR)是指计算unknown的3D磁共振图像从堆叠的2Dslice中重建的计算方法。而现在的SVR方法需要多个slice堆叠来实现 precisemotion 的3D重建，导致扫描时间长，限制了在时间敏感应用中的使用，如胎儿fMRI。在这里，我们提出了一种SVR方法，超越了现有的工作，并在极端间 slice 运动下生成了state-of-the-art的重建。我们 Drawing inspiration from the recent success of single-view depth estimation methods, we formulate SVR as a single-stack motion estimation task and train a fully convolutional network to predict a motion stack for a given slice stack, producing a 3D reconstruction as a byproduct of the predicted motion. Our extensive experiments on the SVR of adult and fetal brains demonstrate that our fully convolutional method is twice as accurate as previous SVR methods. Our code is available at github.com/seannz/svr.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know and I can provide that as well.

Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing

paper_url: http://arxiv.org/abs/2312.03763
repo_url: None
paper_authors: Yushi Lan, Feitong Tan, Di Qiu, Qiangeng Xu, Kyle Genova, Zeng Huang, Sean Fanello, Rohit Pandey, Thomas Funkhouser, Chen Change Loy, Yinda Zhang
for: 生成高品质3D人头模型，并可以灵活地修改和重新pose。
methods: 使用卷积神经网络生成3D人头模型，并使用3DGauss Distribution来表示人头的形状和表现。
results: 可以生成高品质的3D人头模型，并且可以灵活地修改和重新pose。

Abstract
We present a novel framework for generating photorealistic 3D human head and subsequently manipulating and reposing them with remarkable flexibility. The proposed approach leverages an implicit function representation of 3D human heads, employing 3D Gaussians anchored on a parametric face model. To enhance representational capabilities and encode spatial information, we embed a lightweight tri-plane payload within each Gaussian rather than directly storing color and opacity. Additionally, we parameterize the Gaussians in a 2D UV space via a 3DMM, enabling effective utilization of the diffusion model for 3D head avatar generation. Our method facilitates the creation of diverse and realistic 3D human heads with fine-grained editing over facial features and expressions. Extensive experiments demonstrate the effectiveness of our method.

摘要
我们提出了一种新的框架，用于生成真实的3D人头和后续进行灵活的修改和重新布局。我们的方法利用了3D гаус模型来表示人头，并在每个GAUSSIAN中嵌入一个轻量级的三面板卷积，以编码空间信息。此外，我们使用3DMM参数化GAUSSIAN在2D UV空间，使得diffusion模型可以有效地用于3D头像生成。这种方法可以生成多样化、真实的3D人头，并且具有细化的编辑功能，可以修改人头的 facial 特征和表情。我们的实验证明了该方法的有效性。

ScAR: Scaling Adversarial Robustness for LiDAR Object Detection

paper_url: http://arxiv.org/abs/2312.03085
repo_url: https://github.com/xiaohulugo/ScAR-Scaling-Adversarial-Robustness-for-LiDAR-Object-Detection
paper_authors: Xiaohu Lu, Hayder Radha
For:* The paper is written to improve the adversarial robustness of LiDAR object detection models.Methods:* The paper proposes a black-box Scaling Adversarial Robustness (ScAR) method for LiDAR object detection, which uses three black-box scaling adversarial attack methods based on available information: model-aware attack, distribution-aware attack, and blind attack.Results:* The proposed method is effective in improving the model’s robustness against scaling adversarial attacks, as demonstrated by comparison with other methods on public datasets under different 3D object detection architectures.

Abstract
The adversarial robustness of a model is its ability to resist adversarial attacks in the form of small perturbations to input data. Universal adversarial attack methods such as Fast Sign Gradient Method (FSGM) and Projected Gradient Descend (PGD) are popular for LiDAR object detection, but they are often deficient compared to task-specific adversarial attacks. Additionally, these universal methods typically require unrestricted access to the model's information, which is difficult to obtain in real-world applications. To address these limitations, we present a black-box Scaling Adversarial Robustness (ScAR) method for LiDAR object detection. By analyzing the statistical characteristics of 3D object detection datasets such as KITTI, Waymo, and nuScenes, we have found that the model's prediction is sensitive to scaling of 3D instances. We propose three black-box scaling adversarial attack methods based on the available information: model-aware attack, distribution-aware attack, and blind attack. We also introduce a strategy for generating scaling adversarial examples to improve the model's robustness against these three scaling adversarial attacks. Comparison with other methods on public datasets under different 3D object detection architectures demonstrates the effectiveness of our proposed method.

摘要
“模型的对抗攻击 robustness 是指模型能够抵挡小变换的输入数据上的攻击。通用对抗攻击方法如 Fast Sign Gradient Method（FSGM）和 Projected Gradient Descend（PGD）在 LiDAR 对象检测中很受欢迎，但它们通常比任务特定的对抗攻击更弱。此外，这些通用方法通常需要获取模型的完整信息，这在实际应用中很难实现。为解决这些限制，我们提出了黑盒 Scaling Adversarial Robustness（ScAR）方法。通过分析3D对象检测数据集 such as KITTI、Waymo 和 nuScenes 的统计特征，我们发现模型对缩放3D实例的预测是敏感的。我们提出了三种黑盒缩放对抗攻击方法，基于可用的信息：模型意识攻击、分布意识攻击和盲目攻击。我们还介绍了一种生成缩放对抗例的策略，以提高模型对这三种缩放对抗攻击的Robustness。与其他方法在不同的3D对象检测架构下进行比较，我们发现我们的提出的方法更有效。”

LooseControl: Lifting ControlNet for Generalized Depth Conditioning

paper_url: http://arxiv.org/abs/2312.03079
repo_url: None
paper_authors: Shariq Farooq Bhat, Niloy J. Mitra, Peter Wonka
For: LooseControl is designed to enable generalized depth conditioning for diffusion-based image generation, allowing users to create complex environments with only boundary conditions or 3D box controls.* Methods: The paper introduces a generalized version of depth conditioning that enables scene boundary control and 3D box control for specifying layout locations of target objects, along with two editing mechanisms (3D box editing and attribute editing) to refine the results.* Results: Extensive tests and comparisons with baselines demonstrate the generality of LooseControl, and the authors believe it can become an important design tool for creating complex environments and be extended to other forms of guidance channels.Here’s the Chinese translation of the three points:* For: LooseControl 是用于启用普通化深度条件的扩展性图像生成，让用户通过边界条件或 3D 盒子控制创建复杂的环境。* Methods: paper 引入了一种普通化的深度条件，允许用户通过边界控制和 3D 盒子控制指定目标对象的布局位置，同时提供了三种编辑机制（3D 盒子编辑、特征编辑）来细化结果。* Results: 广泛的测试和比较基线表明 LooseControl 的一般性，作者们认为它可以成为创建复杂环境的重要设计工具，并可以扩展到其他指导通道。

Abstract
We present LooseControl to allow generalized depth conditioning for diffusion-based image generation. ControlNet, the SOTA for depth-conditioned image generation, produces remarkable results but relies on having access to detailed depth maps for guidance. Creating such exact depth maps, in many scenarios, is challenging. This paper introduces a generalized version of depth conditioning that enables many new content-creation workflows. Specifically, we allow (C1) scene boundary control for loosely specifying scenes with only boundary conditions, and (C2) 3D box control for specifying layout locations of the target objects rather than the exact shape and appearance of the objects. Using LooseControl, along with text guidance, users can create complex environments (e.g., rooms, street views, etc.) by specifying only scene boundaries and locations of primary objects. Further, we provide two editing mechanisms to refine the results: (E1) 3D box editing enables the user to refine images by changing, adding, or removing boxes while freezing the style of the image. This yields minimal changes apart from changes induced by the edited boxes. (E2) Attribute editing proposes possible editing directions to change one particular aspect of the scene, such as the overall object density or a particular object. Extensive tests and comparisons with baselines demonstrate the generality of our method. We believe that LooseControl can become an important design tool for easily creating complex environments and be extended to other forms of guidance channels. Code and more information are available at https://shariqfarooq123.github.io/loose-control/ .

摘要
我们提出LooseControl，以允许通用深度conditioning的扩展。ControlNet，当前最佳的深度conditioned图像生成方法，生成了惊人的结果，但是它需要精确的深度地图作为导航。在许多场景下，创建这些精确的深度地图是困难的。这篇论文介绍了一种通用的深度conditioning方法，允许用户通过边界控制和3D盒控制来创建更多的内容创作 workflow。具体来说，我们允许用户根据场景边界进行不具体的场景控制，并且可以通过3D盒控制来定义目标对象的布局位置，而不是具体的形状和外观。使用LooseControl，同时提供文本指导，用户可以通过指定场景边界和目标对象的位置来创建复杂的环境，例如房间、街景等。此外，我们还提供了两种编辑机制来细化结果：(E1) 3D盒编辑允许用户在保持图像风格不变的情况下，改变、添加或移除盒子，从而导致最小的变化。(E2) 特征编辑提供了可能的编辑方向，以改变场景中某个特定的特征，例如总体对象密度或特定对象。我们进行了广泛的测试和比较，证明了我们的方法的通用性。我们认为LooseControl可以成为轻松创建复杂环境的设计工具，并且可以扩展到其他导航通道。代码和更多信息可以在https://shariqfarooq123.github.io/loose-control/ 找到。

ReconFusion: 3D Reconstruction with Diffusion Priors

paper_url: http://arxiv.org/abs/2312.02981
repo_url: None
paper_authors: Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, Aleksander Holynski
for: 用于重建真实世界场景，只需要几张照片。
methods: 使用协助扩散假设来Synthesize realistic geometry和 texture，并且保持观察到的外观。
results: 在多种真实世界场景中，表现出显著的性能改进，比如前向场景和360度场景。

Abstract
3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at rendering photorealistic novel views of complex scenes. However, recovering a high-quality NeRF typically requires tens to hundreds of input images, resulting in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.

摘要
3D重建方法如Neural Radiance Fields（NeRFs）能够生成高品质的新视图图像，但是获得高质量NeRF通常需要数十到数百个输入图像，这 Result in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.Here's the breakdown of the translation:* 3D重建方法 (3D reconstruction methods) -> 3D重建方法 (3D reconstruction methods)* Neural Radiance Fields (NeRFs) -> 神经采样场 (NeRFs)* 能够生成高品质的新视图图像 (able to generate high-quality novel views) -> 能够生成高品质的新视图图像 (able to generate high-quality novel views)* 数十到数百个输入图像 (tens to hundreds of input images) -> 数十到数百个输入图像 (tens to hundreds of input images)* Result in a time-consuming capture process -> 导致时间消耗的捕捉过程 (leading to a time-consuming capture process)* We present ReconFusion -> 我们提出ReconFusion (we propose ReconFusion)* leverages a diffusion prior for novel view synthesis -> 利用分散预测器进行新视图合成 (leveraging a diffusion prior for novel view synthesis)* trained on synthetic and multiview datasets -> 在 sintetic和多视图数据上训练 (trained on synthetic and multiview datasets)* which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images -> 可以在新的摄像机位置上正则化基于NeRF的3D重建管道 (which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images)* Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. -> 我们的方法可以在不够约束的区域中合成实现 geometry和texture，同时保留观察到的区域的外观 (Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions.)* We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches. -> 我们在不同的实际场景中进行了广泛的评估，包括前视场景和360度场景，并证明了以前的几视NeRF重建方法的性能提高 (We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.)

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

paper_url: http://arxiv.org/abs/2312.02980
repo_url: None
paper_authors: Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, Hengshuang Zhao
For: The paper aims to improve the understanding of 3D objects by developing a groundbreaking point-language multimodal model called GPT4Point, which can execute various point-text reference tasks and generate high-quality 3D objects from low-quality point-text features.* Methods: The paper introduces GPT4Point, a powerful 3D multimodal language model that utilizes the MLLM framework and is equipped with advanced capabilities for controllable 3D generation. The model is trained on a large-scale database of 1M objects from the Objaverse-XL dataset, which is constructed using the Pyramid-XL dataset annotation engine.* Results: The paper demonstrates the superior performance of GPT4Point in understanding and generation tasks, achieving high-quality results in point-cloud captioning and Q&A, and maintaining the geometric shapes and colors of the 3D objects.

Abstract
Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation, but their understanding of the 3D world is notably deficient, limiting progress in 3D language understanding and generation. To solve this problem, we introduce GPT4Point, an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point is equipped with advanced capabilities for controllable 3D generation, it can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors. To support the expansive needs of 3D object-text pairs, we develop Pyramid-XL, a point-language dataset annotation engine. It constructs a large-scale database over 1M objects of varied text granularity levels from the Objaverse-XL dataset, essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations, GPT4Point has demonstrated superior performance in understanding and generation.

摘要
多模式大语言模型（MLLM）在2D图像文本理解和生成方面具有出色的表现，但它们对3D世界的理解仍然有限，这限制了3D语言理解和生成的进步。为解决这个问题，我们提出了GPT4Point，一种创新的多点语言多模式模型，专门针对MLLM框架中的统一3D物体理解和生成。GPT4Point作为强大的3D MLLM，可以轻松执行多点文本参考任务，如点云captioning和Q&A。此外，GPT4Point还具有高级可控3D生成能力，可以通过低质量点文本特征获得高质量结果，保持几何形状和颜色。为满足3D物体-文本对的广泛需求，我们开发了Pyramid-XL，一种点语言数据集标注工具。它构建了一个大规模的数据库，包含100万个不同文本细分水平的物体，这些数据库是用于训练GPT4Point的必备。我们建议了一个完整的评价标准，用于评估3D点语言理解能力。在广泛的评估中，GPT4Point表现出色，在理解和生成方面均达到了优秀的成绩。

DiffusionPCR: Diffusion Models for Robust Multi-Step Point Cloud Registration

paper_url: http://arxiv.org/abs/2312.03053
repo_url: None
paper_authors: Zhi Chen, Yufan Ren, Tong Zhang, Zheng Dang, Wenbing Tao, Sabine Süsstrunk, Mathieu Salzmann
for: 本研究的目的是提出一种基于扩散过程的点云注册方法，能够高效地将两个点云注册到同一个参考系统中。
methods: 本研究使用了一种基于扩散过程的点云注册方法，包括将点云注册问题转化为一种扩散过程，并使用一种带有规范的扩散模型来抑制噪声。
results: 实验结果表明，使用扩散PCR方法可以提高点云注册的准确率，并且在3DMatch和3DLoMatch dataset上达到了状态艺术的注册记忆率（95.3%/81.6%）。

Abstract
Point Cloud Registration (PCR) estimates the relative rigid transformation between two point clouds. We propose formulating PCR as a denoising diffusion probabilistic process, mapping noisy transformations to the ground truth. However, using diffusion models for PCR has nontrivial challenges, such as adapting a generative model to a discriminative task and leveraging the estimated nonlinear transformation from the previous step. Instead of training a diffusion model to directly map pure noise to ground truth, we map the predictions of an off-the-shelf PCR model to ground truth. The predictions of off-the-shelf models are often imperfect, especially in challenging cases where the two points clouds have low overlap, and thus could be seen as noisy versions of the real rigid transformation. In addition, we transform the rotation matrix into a spherical linear space for interpolation between samples in the forward process, and convert rigid transformations into auxiliary information to implicitly exploit last-step estimations in the reverse process. As a result, conditioned on time step, the denoising model adapts to the increasing accuracy across steps and refines registrations. Our extensive experiments showcase the effectiveness of our DiffusionPCR, yielding state-of-the-art registration recall rates (95.3%/81.6%) on 3DMatch and 3DLoMatch. The code will be made public upon publication.

摘要
点云注册（PCR）估计两个点云之间的相对稳定变换。我们提议将PCR视为一种滤波推演过程，将噪声变换映射到真实的变换。然而，使用推演模型进行PCR有一些不rivial的挑战，例如将生成模型转化为推论任务，并利用上一步估计的非线性变换。而不是直接将纯噪声映射到真实的变换，我们将外部PCR模型的预测映射到真实的变换。外部PCR模型的预测通常不准确，特别是在两个点云之间的覆盖率低的情况下，因此可以看作噪声版本的真正稳定变换。此外，我们将旋转矩阵转换为球面线性空间中的 interpolate 操作，并将稳定变换转换为auxiliary信息，以便在反向过程中隐式地利用上一步的估计。因此，根据时间步长，滤波模型适应到增加的准确性，并在注册过程中进行细化。我们的广泛的实验显示了DiffusionPCR的效果，得到了3DMatch和3DLoMatch上的注册记忆率（95.3%/81.6%）。代码将在出版时公开。

GauHuman: Articulated Gaussian Splatting from Monocular Human Videos

paper_url: http://arxiv.org/abs/2312.02973
repo_url: https://github.com/skhu101/gauhuman
paper_authors: Shoukang Hu, Ziwei Liu
for: 这篇论文的目的是提出一种快速训练（1~2分钟）和实时渲染（最多189帧/秒）的3D人体模型，比既有的基于NeRF的半 implicit表示模型框架需要几个小时的训练和每帧渲染。
methods: 该论文使用 Gaussian Splatting 方法，在 canonical space 中编码 Gaussian Splatting，将3D Gaussians 从 canonical space 转换到 posed space 中使用线性混合皮肤（LBS），并设计了精细的人体详细信息学习模块。
results: 广大实验表明，GauHuman 可以在 ZJU_Mocap 和 MonoCap 数据集上达到当今最佳性能，同时具有快速训练和实时渲染速度。特别是，无需牺牲渲染质量，GauHuman 可以快速模型3D人体演员的 ~13k 3D Gaussians。

Abstract
We present, GauHuman, a 3D human model with Gaussian Splatting for both fast training (1 ~ 2 minutes) and real-time rendering (up to 189 FPS), compared with existing NeRF-based implicit representation modelling frameworks demanding hours of training and seconds of rendering per frame. Specifically, GauHuman encodes Gaussian Splatting in the canonical space and transforms 3D Gaussians from canonical space to posed space with linear blend skinning (LBS), in which effective pose and LBS refinement modules are designed to learn fine details of 3D humans under negligible computational cost. Moreover, to enable fast optimization of GauHuman, we initialize and prune 3D Gaussians with 3D human prior, while splitting/cloning via KL divergence guidance, along with a novel merge operation for further speeding up. Extensive experiments on ZJU_Mocap and MonoCap datasets demonstrate that GauHuman achieves state-of-the-art performance quantitatively and qualitatively with fast training and real-time rendering speed. Notably, without sacrificing rendering quality, GauHuman can fast model the 3D human performer with ~13k 3D Gaussians.

摘要
我们介绍GauHuman，一个3D人体模型，使用高斯扩散来实现快速训练（1-2分钟）和实时渲染（最高189帧/秒）。与现有基于NeRF的几何表示模型框架相比，GauHuman需要训练时间只需要几分钟，并且每帧渲染时间只需要几秒钟。特别是，GauHuman在 canonical space中编码高斯扩散，将3D高斯从 canonical space 转换为posed space 通过线性混合皮肤（LBS），并设计了有效的pose和LBS修正模块，以学习3D人体的细节。此外，为了快速优化GauHuman，我们在初始化和剪辑3D高斯时使用3D人体先验，并通过拟合操作进一步加速。我们对ZJU_Mocap和MonoCap数据集进行了广泛的实验，并证明GauHuman在量化和质量上均达到了领先水平，同时具有快速训练和实时渲染的优点。值得一提的是，不 sacrificing渲染质量，GauHuman可以快速模拟3D人体演示者的约13k个3D高斯。

AmbiGen: Generating Ambigrams from Pre-trained Diffusion Model

paper_url: http://arxiv.org/abs/2312.02967
repo_url: None
paper_authors: Boheng Zhao, Rana Hanocka, Raymond A. Yeh
for: 生成拼写旁观文本的拼写旁观文本摘要
methods: 使用深度弗洛伊德IF模型进行拼写旁观文本生成，并优化字符的轮廓以提高在两个视图角度下的可读性
results: 对于英语500个最常见的单词，我们的方法比现有的拼写旁观文本生成方法高于11.6%的词法精度和少于41.9%的编辑距离

Abstract
Ambigrams are calligraphic designs that have different meanings depending on the viewing orientation. Creating ambigrams is a challenging task even for skilled artists, as it requires maintaining the meaning under two different viewpoints at the same time. In this work, we propose to generate ambigrams by distilling a large-scale vision and language diffusion model, namely DeepFloyd IF, to optimize the letters' outline for legibility in the two viewing orientations. Empirically, we demonstrate that our approach outperforms existing ambigram generation methods. On the 500 most common words in English, our method achieves more than an 11.6% increase in word accuracy and at least a 41.9% reduction in edit distance.

摘要
《抽象字形设计》是一种字形设计，它们在不同的视角下具有不同的含义。创建抽象字形是一项复杂的任务，因为需要在两个不同的视角下保持字形的意义。在这项工作中，我们提议使用大规模视力和语言扩散模型（DeepFloyd IF）来优化字形的轮廓，以确保在两个视角下的可读性。我们的方法在500个最常见的英文词汇上表现出了超过11.6%的词法精度提升和最少41.9%的编辑距离减少。

Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

paper_url: http://arxiv.org/abs/2312.02966
repo_url: https://github.com/luluho1208/diffusion-ss3d
paper_authors: Cheng-Ju Ho, Chen-Hsuan Tai, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai
for: 提高 semi-supervised 3D 物体检测中的鲁棒性和精度， Addressing the limitation of acquiring large-scale 3D bounding box annotations.
methods: 使用 teacher-student 框架和 pseudo-labeling 技术，并提出一种基于扩散模型的新方法，通过增加噪声来提高 pseudo-label 质量，并将扩散模型纳入 teacher-student 框架中，以提高 pseudo-label 生成和整个 semi-supervised 学习过程。
results: 在 ScanNet 和 SUN RGB-D 数据集上进行了实验，并 demonstrably 达到了现有方法的状态空间性能。 Additionally, we present extensive analysis to understand how our diffusion model design affects performance in semi-supervised learning.

Abstract
Semi-supervised object detection is crucial for 3D scene understanding, efficiently addressing the limitation of acquiring large-scale 3D bounding box annotations. Existing methods typically employ a teacher-student framework with pseudo-labeling to leverage unlabeled point clouds. However, producing reliable pseudo-labels in a diverse 3D space still remains challenging. In this work, we propose Diffusion-SS3D, a new perspective of enhancing the quality of pseudo-labels via the diffusion model for semi-supervised 3D object detection. Specifically, we include noises to produce corrupted 3D object size and class label distributions, and then utilize the diffusion model as a denoising process to obtain bounding box outputs. Moreover, we integrate the diffusion model into the teacher-student framework, so that the denoised bounding boxes can be used to improve pseudo-label generation, as well as the entire semi-supervised learning process. We conduct experiments on the ScanNet and SUN RGB-D benchmark datasets to demonstrate that our approach achieves state-of-the-art performance against existing methods. We also present extensive analysis to understand how our diffusion model design affects performance in semi-supervised learning.

摘要
三种指导下的对象检测是三维场景理解的关键，因为获取大规模的三维矩形框注释 remains challenging. 现有方法通常采用教师生 frameworks with pseudo-labeling 来利用无标注点云数据. 然而，在多样化的三维空间中生成可靠的pseudo-labels仍然是一个挑战。在这项工作中，我们提议Diffusion-SS3D，一种新的扩展pseudo-label的质量via diffusion model for semi-supervised 3D object detection. Specifically, we add noise to produce corrupted 3D object size and class label distributions, and then use the diffusion model as a denoising process to obtain bounding box outputs. 此外，我们将diffusion model integrate into the teacher-student framework, so that the denoised bounding boxes can be used to improve pseudo-label generation, as well as the entire semi-supervised learning process. We conduct experiments on the ScanNet and SUN RGB-D benchmark datasets to demonstrate that our approach achieves state-of-the-art performance against existing methods. We also present extensive analysis to understand how our diffusion model design affects performance in semi-supervised learning.

MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

paper_url: http://arxiv.org/abs/2312.02963
repo_url: None
paper_authors: Zhangyang Xiong, Chenghong Li, Kenkun Liu, Hongjie Liao, Jianqiao Hu, Junyi Zhu, Shuliang Ning, Lingteng Qiu, Chongjie Wang, Shijie Wang, Shuguang Cui, Xiaoguang Han
for: 本研究的目的是提供一个大规模的3D人体数据集，以便进行大规模的人体数据采集和分析，并且探索这些数据集在不同的视觉任务中的应用潜力。
methods: 本研究使用了一种多视图人体捕捉系统，通过这种系统，可以轻松地收集大规模的高质量3D人体数据。此外，研究者还对数据进行了广泛的标注，包括人体 máscara、摄像机参数、2D和3D关键点、SMPL/SMPLX参数以及相应的文本描述。
results: 研究者通过对MVHumanNet数据集进行了多种视觉任务的实验，包括视力一致动作识别、人体NeRF重建、文本驱动视图不受限制的人像生成、2D视图不受限制的人像生成以及3D人像生成等。结果表明，MVHumanNet数据集可以提供大量的高质量3D人体数据，并且可以有效地应用于多种视觉任务。

Abstract
In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.

摘要
在这个时代，大语言模型和文本到图像模型的成功可以归功于大规模数据的驱动力。然而，在3D视觉领域，虽然已经取得了很大的进步，但在人类中心任务领域并没有达到同样的水平，一部分原因是因为缺乏大规模的人类数据集。现有的高品质3D人类捕捉数据集仍然较小，因为获得大规模高质量3D人类数据的挑战很大。为了bridging这个差距，我们提出了MVHumanNet数据集，它包括4,500个人类标签的多视图人体动作序列。我们的工作的主要重点是收集具有大量多样化人类标签和日常服装的人体数据，使用多视图人体捕捉系统，以便轻松扩展数据收集。我们的数据集包含4,500个不同的日常服装，60,000个动作序列和645万帧，并有广泛的注释，包括人mask、摄像头参数、2D和3D关键点、SMPL/SMPLX参数和相应的文本描述。为了探索MVHumanNet在不同的2D和3D视觉任务中的潜力，我们进行了一些试点研究，包括视度一致动作识别、人NeRF重建、文本驱动视角不受限制的人图生成、2D视角不受限制的人图生成和3D人像生成。广泛的实验表明，MVHumanNet数据集的规模提供了大量的性能提升和应用场景，我们希望通过发布MVHumanNet数据集和注释的发布，激发更多在3D人类中心任务领域的创新。

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

paper_url: http://arxiv.org/abs/2312.03050
repo_url: None
paper_authors: Trong-Thuan Nguyen, Pha Nguyen, Khoa Luu
for: 本研究旨在解决计算机视觉中的视觉场景内的互动理解问题， existing methods 只是基于简单的关系模型，在面临多样化的外观、情况、位置、互动和关系等方面受到限制，降低了对复杂的视觉动力学中的互动理解能力。
methods: 本研究使用的方法是基于层次结构的 Hierarchical Interlacement Graph (HIG)，它利用一个统一层和图在层次结构中提供深入的Scene变化理解，并在五个不同任务中进行了广泛的实验。
results: 对于五个任务，本研究的方法表现出色，超过了其他方法的性能。

Abstract
Visual interactivity understanding within visual scenes presents a significant challenge in computer vision. Existing methods focus on complex interactivities while leveraging a simple relationship model. These methods, however, struggle with a diversity of appearance, situation, position, interaction, and relation in videos. This limitation hinders the ability to fully comprehend the interplay within the complex visual dynamics of subjects. In this paper, we delve into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects. To achieve this goal, we first present a new dataset containing Appearance-Situation-Position-Interaction-Relation predicates, named ASPIRe, offering an extensive collection of videos marked by a wide range of interactivities. Then, we propose a new approach named Hierarchical Interlacement Graph (HIG), which leverages a unified layer and graph within a hierarchical structure to provide deep insights into scene changes across five distinct tasks. Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios.

摘要
Visual interactivity understanding within visual scenes presents a significant challenge in computer vision. Existing methods focus on complex interactivities while leveraging a simple relationship model. These methods, however, struggle with a diversity of appearance, situation, position, interaction, and relation in videos. This limitation hinders the ability to fully comprehend the interplay within the complex visual dynamics of subjects. In this paper, we delve into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects. To achieve this goal, we first present a new dataset containing Appearance-Situation-Position-Interaction-Relation predicates, named ASPIRe, offering an extensive collection of videos marked by a wide range of interactivities. Then, we propose a new approach named Hierarchical Interlacement Graph (HIG), which leverages a unified layer and graph within a hierarchical structure to provide deep insights into scene changes across five distinct tasks. Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios.Here is the word-for-word translation of the text into Simplified Chinese:视觉互动理解在视觉场景中存在重大挑战，现有方法主要关注复杂的互动关系，而使用简单的关系模型。这些方法却难以处理视频中的多样性，包括外观、情况、位置、互动和关系。这种局限性阻碍了完全理解视觉场景中人物之间的互动剖面。在这篇论文中，我们探索视觉内容中的互动理解，通过 dense 互动关系网络生成场景图表示。为实现这一目标，我们首先提供了一个新的数据集，名为 ASPIRe，该数据集包含了各种互动的视频集，并且提供了各种互动关系 predicates。然后，我们提出了一种新的方法，即层次结构 Graph (HIG)，该方法在层次结构中具有一个统一的层次结构和图，以深入探索视频场景的变化。我们的方法在多种场景下进行了广泛的实验，并证明了与其他方法相比，具有更高的性能。

Choroidalyzer: An open-source, end-to-end pipeline for choroidal analysis in optical coherence tomography

paper_url: http://arxiv.org/abs/2312.02956
repo_url: None
paper_authors: Justin Engelmann, Jamie Burke, Charlene Hamid, Megan Reid-Schachter, Dan Pugh, Neeraj Dhaun, Diana Moukaddem, Lyle Gray, Niall Strang, Paul McGraw, Amos Storkey, Paul J. Steptoe, Stuart King, Tom MacGillivray, Miguel O. Bernabeu, Ian J. C. MacCormick
for:The paper aims to develop an open-source, end-to-end pipeline for segmenting the choroid region, vessels, and fovea, and deriving choroidal thickness, area, and vascular index.methods:The authors used 5,600 OCT B-scans from 233 subjects, 6 systemic disease cohorts, and 3 device types, and trained a U-Net deep-learning model to detect the region, vessels, and fovea. They also used state-of-the-art automatic methods for generating region and vessel ground-truths, and manually annotated foveal positions.results:Choroidalyzer achieved excellent segmentation performance for the choroid region (Dice score: internal 0.9789, external 0.9749), vessels (Dice score: internal 0.8817, external 0.8703), and fovea location (mean absolute error: internal 3.9 pixels, external 3.4 pixels). The agreement between Choroidalyzer and two manual graders was comparable to the inter-grader agreement across all metrics. The pipeline was able to accurately extract choroidal thickness, area, and vascular index, with strong correlations (Pearson correlation: internal 0.9754, external 0.9831) and low mean absolute errors (MAE: internal 3.9 pixels, external 3.4 pixels).

Abstract
Purpose: To develop Choroidalyzer, an open-source, end-to-end pipeline for segmenting the choroid region, vessels, and fovea, and deriving choroidal thickness, area, and vascular index. Methods: We used 5,600 OCT B-scans (233 subjects, 6 systemic disease cohorts, 3 device types, 2 manufacturers). To generate region and vessel ground-truths, we used state-of-the-art automatic methods following manual correction of inaccurate segmentations, with foveal positions manually annotated. We trained a U-Net deep-learning model to detect the region, vessels, and fovea to calculate choroid thickness, area, and vascular index in a fovea-centred region of interest. We analysed segmentation agreement (AUC, Dice) and choroid metrics agreement (Pearson, Spearman, mean absolute error (MAE)) in internal and external test sets. We compared Choroidalyzer to two manual graders on a small subset of external test images and examined cases of high error. Results: Choroidalyzer took 0.299 seconds per image on a standard laptop and achieved excellent region (Dice: internal 0.9789, external 0.9749), very good vessel segmentation performance (Dice: internal 0.8817, external 0.8703) and excellent fovea location prediction (MAE: internal 3.9 pixels, external 3.4 pixels). For thickness, area, and vascular index, Pearson correlations were 0.9754, 0.9815, and 0.8285 (internal) / 0.9831, 0.9779, 0.7948 (external), respectively (all p<0.0001). Choroidalyzer's agreement with graders was comparable to the inter-grader agreement across all metrics. Conclusions: Choroidalyzer is an open-source, end-to-end pipeline that accurately segments the choroid and reliably extracts thickness, area, and vascular index. Especially choroidal vessel segmentation is a difficult and subjective task, and fully-automatic methods like Choroidalyzer could provide objectivity and standardisation.

摘要
Methods: 使用 5,600 个 OCT B-scan 图像（233 个Subject，6 个系统疾病群组，3 种设备，2 个制造商）。为生成区域和血管的地面 truth，我们使用当前最佳的自动方法，并对不准确的分 segmentation 进行手动修正，并手动标注眼窝位置。我们使用 U-Net 深度学习模型，以检测区域、血管和眼窝，计算choroid 厚度、面积和血管指数在眼窝中心的区域兴趣点。我们分析了 segmentation 一致性（AUC、Dice）和choroid 指标一致性（Pearson、Spearman、平均绝对误差 (MAE)）在内部和外部测试集中。我们将 Choroidalyzer 与两名手动评分员进行比较，并对高误差情况进行检查。Results: Choroidalyzer 在标准 laptop 上花费 0.299 秒/图像，实现了优秀的区域（Dice：内部 0.9789，外部 0.9749）、非常好的血管分 segmentation 性能（Dice：内部 0.8817，外部 0.8703）和非常好的眼窝位置预测（MAE：内部 3.9 像素，外部 3.4 像素）。对厚度、面积和血管指数，Pearson 相关性为 0.9754、0.9815 和 0.8285（内部）/ 0.9831、0.9779 和 0.7948（外部），均为 p<0.0001。 Choroidalyzer 与评分员的协调性与两名评分员之间的协调性相当。Conclusions: Choroidalyzer 是一个开源、端到端管道，可以准确地 segmenting choroid 和眼窝，并可靠地提取厚度、面积和血管指数。特别是血管分 segmentation 是一项具有Subjective和Objective two aspects的任务，全自动的方法 Like Choroidalyzer 可以提供Objectivity和标准化。

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

paper_url: http://arxiv.org/abs/2312.03048
repo_url: None
paper_authors: Yuru Jia, Lukas Hoyer, Shengyu Huang, Tianfu Wang, Luc Van Gool, Konrad Schindler, Anton Obukhov
for: 这个论文旨在检查大规模潜在扩展模型（LDM）是否可以用于生成大规模数据，以提高自动驾驶任务中的semantic segmentation？
methods: 作者提出了一种高效的数据生成管线，称为DGInStyle。该管线包括特点控制生成、多分辨率潜在拟合和风格交换等方法，以提高LDM的生成质量和多样性。
results: 通过使用DGInStyle管线，作者生成了一个多样的街景数据集，并在其上训练了一个域不依的semantic segmentation模型。结果表明，使用该生成增强方案可以提高域泛化预测的性能，在一些情况下比前一个状态的方法提高了+2.5 mIoU。

Abstract
Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding "yes". We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Third, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods, in some cases by +2.5 mIoU compared to the previous state-of-the-art method without our generative augmentation scheme. Source code and dataset are available at https://dginstyle.github.io .

摘要
First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Third, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control.Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently improves the performance of several domain generalization methods, in some cases by +2.5 mIoU compared to the previous state-of-the-art method without our generative augmentation scheme. The source code and dataset are available at .

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

paper_url: http://arxiv.org/abs/2312.02949
repo_url: https://github.com/ux-decoder/llava-grounding
paper_authors: Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, Jianwei Yang
for: 这篇论文旨在提高大型多Modal模型（LMM）在视觉对话中的固定能力。
methods: 作者提出了一种新的数据集（GVC），以及一种模型设计，可以将固定和对话功能结合在一起。
results: 实验结果表明，作者的模型在Grounding-Bench上表现出色，并且在经典的固定 benchmark RefCOCO/+/g 和 Flickr30K Entities 上也达到了竞争性表现。

Abstract
With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .

摘要

Fast CT anatomic localization algorithm

paper_url: http://arxiv.org/abs/2312.02941
repo_url: None
paper_authors: Amit Oved
for: 本研究的目的是提高计算机断层成像（CT）扫描中的每个slice的定位精度，以便快速检索区域 интерес点并自动分析。
methods: 本研究使用了一种新的方法，直接在一部分的slice上进行定位，然后使用线性模型将slice的标注位置映射到实际的AXIAL anatomical位置。
results: 本研究的结果显示，这种方法可以很快（less than 1 second per scan）、精度高（typical median localization error of 1 cm）、Robust（resistant to various noise sources, imaging protocols, metal induced artifacts, anatomical deformations etc.）地定位每个slice。此外，该方法还提供了一个映射信任分数，以避免在罕见的异常扫描时出现的不可靠定位结果。

Abstract
Automatically determining the position of every slice in a CT scan is a basic yet powerful capability allowing fast retrieval of region of interest for visual inspection and automated analysis. Unlike conventional localization approaches which work at the slice level, we directly localize only a fraction of the slices and and then fit a linear model which maps slice index to its estimated axial anatomical position based on those slices. The model is then used to assign axial position to every slices of the scan. This approach proves to be both computationally efficient, with a typical processing time of less than a second per scan (regardless of its size), accurate, with a typical median localization error of 1 cm, and robust to different noise sources, imaging protocols, metal induced artifacts, anatomical deformations etc. Another key element of our approach is the introduction of a mapping confidence score. This score acts as a fail safe mechanism which allows a rejection of unreliable localization results in rare cases of anomalous scans. Our algorithm sets new State Of The Art results in terms of localization accuracy. It also offers a decrease of two orders of magnitude in processing time with respect to all published processing times. It was designed to be invariant to various scan resolutions, scan protocols, patient orientations, strong artifacts and various deformations and abnormalities. Additionally, our algorithm is the first one to the best of our knowledge which supports the entire body from head to feet and is not confined to specific anatomical region. This algorithm was tested on thousands of scans and proves to be very reliable and useful as a preprocessing stage for many applications.

摘要
自动确定每个切面在CT扫描中的位置是一个基本 yet 强大的能力，允许快速检索区域关注和自动分析。与传统的地方化方法不同，我们直接确定只有一部分的切面，然后使用一个线性模型，将切面索引映射到估计的AXIAL anatomical位置基于这些切面。模型然后用于将AXIAL位置分配给扫描中的所有切面。这种方法证明是计算效率高，Typical处理时间低于1秒钟（无论扫描的大小），准确性高，Typical median localization error 1 cm，并且对不同的噪声来源、扫描协议、镍导致的artefacts、生物学变形等等有 robust性。另外，我们的方法还引入了一个映射信任分数。这个分数 acts as a fail safe mechanism， allowing reject unreliable localization results in rare cases of anomalous scans。我们的算法创造了新的 State Of The Art 结果，在本地化精度方面。它还提供了两个阶段的减少，在所有已发表的处理时间上。它是具有VARIOUS SCAN RESOLUTIONS、SCAN PROTOCOLS、patient orientations、强大的artefacts和多种变形和畸形等等的抗衡性。此外，我们的算法是首个，至少到我们所知道的，支持全身从头到脚，而不是仅仅局部解剖区域。这个算法在千个扫描中进行了严格的测试，证明它是非常可靠和有用，作为许多应用的预处理阶段。

Drag-A-Video: Non-rigid Video Editing with Point-based Interaction

paper_url: http://arxiv.org/abs/2312.02936
repo_url: None
paper_authors: Yao Teng, Enze Xie, Yue Wu, Haoyu Han, Zhenguo Li, Xihui Liu
for: 该论文旨在提供一种可交互地点对点视频修改方法，允许用户在第一帧视频中点击两个托管点和目标点，以及面对面的mask，然后将这些点集扩展到其他帧中。
methods: 该方法使用了一种扩散基于的方法，通过点对点的匹配来更新视频的内容。为了保证视频的流畅性和一致性，我们采用了一种新的视频水平运动监视器，并引入了隐藏偏移来实现这种更新。
results: 我们的方法可以准确地修改视频的内容，并且可以在不同的视频中保持一致性。我们在多个视频中进行了实验，并证明了我们的方法的效果和灵活性。具体的实验结果可以通过我们的网站查看：https://drag-a-video.github.io/.

Abstract
Video editing is a challenging task that requires manipulating videos on both the spatial and temporal dimensions. Existing methods for video editing mainly focus on changing the appearance or style of the objects in the video, while keeping their structures unchanged. However, there is no existing method that allows users to interactively ``drag'' any points of instances on the first frame to precisely reach the target points with other frames consistently deformed. In this paper, we propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video. Our method allows users to click pairs of handle points and target points as well as masks on the first frame of an input video. Then, our method transforms the inputs into point sets and propagates these sets across frames. To precisely modify the contents of the video, we employ a new video-level motion supervision to update the features of the video and introduce the latent offsets to achieve this update at multiple denoising timesteps. We propose a temporal-consistent point tracking module to coordinate the movement of the points in the handle point sets. We demonstrate the effectiveness and flexibility of our method on various videos. The website of our work is available here: https://drag-a-video.github.io/.

摘要
视频编辑是一项复杂的任务，需要在空间和时间维度上修改视频内容。现有的视频编辑方法主要是改变视频中对象的外观或风格，而保持对象的结构不变。然而，现在没有任何方法可以允许用户在第一帧视频中“拖”任意点，并在其他帧上准确地达到目标点。在本文中，我们提出了一种新的扩散基于的实时点基视频修改方法，称为Drag-A-Video。我们的方法允许用户在输入视频的第一帧上采用点对点和面Mask进行点击，然后将这些点集传播到其他帧。为了准确修改视频内容，我们采用了一种新的视频级运动监视，并引入了潜在偏移来实现这种更新。我们还提出了一个协调点跟踪模块，以确保点集在各个帧中的运动协调一致。我们在各种视频上展示了我们的方法的效果和灵活性。我们的工作网站的链接是：https://drag-a-video.github.io/.

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

paper_url: http://arxiv.org/abs/2312.02934
repo_url: https://github.com/fudan-zvg/wovogen
paper_authors: Jiachen Lu, Ze Huang, Jiahui Zhang, Zeyu Yang, Li Zhang
For: This paper is written for researchers and developers working on autonomous driving technology, particularly those interested in multi-camera street-view video generation and scene understanding.* Methods: The paper proposes a novel method called WoVoGen, which combines an additional explicit world volume to leverage 4D world volume as a foundational element for video generation. The method operates in two phases: envisioning the future 4D temporal world volume based on vehicle control sequences, and generating multi-camera videos informed by this envisioned 4D temporal world volume and sensor interconnectivity.* Results: The paper presents results of WoVoGen on several benchmark datasets, demonstrating its ability to generate high-quality street-view videos in response to vehicle control inputs and facilitating scene editing tasks. The results show that WoVoGen outperforms traditional rendering-based methods in terms of diversity and coherence, and achieves state-of-the-art performance in scene understanding tasks.

Abstract
Generating multi-camera street-view videos is critical for augmenting autonomous driving datasets, addressing the urgent demand for extensive and varied data. Due to the limitations in diversity and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being supplanted by diffusion-based methods. However, a significant challenge in diffusion-based methods is ensuring that the generated sensor data preserve both intra-world consistency and inter-sensor coherence. To address these challenges, we combine an additional explicit world volume and propose the World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). This system is specifically designed to leverage 4D world volume as a foundational element for video generation. Our model operates in two distinct phases: (i) envisioning the future 4D temporal world volume based on vehicle control sequences, and (ii) generating multi-camera videos, informed by this envisioned 4D temporal world volume and sensor interconnectivity. The incorporation of the 4D world volume empowers WoVoGen not only to generate high-quality street-view videos in response to vehicle control inputs but also to facilitate scene editing tasks.

摘要
生成多摄像头街景视频是为自动驾驶数据集扩大和多样化的关键，以满足自动驾驶技术的不断发展和应用。然而，传统的渲染方法由于缺乏多样性和光照条件的控制，逐渐被替换为扩散方法。然而，扩散方法中保持摄像头数据的内部一致和外部协调性的问题仍然具有挑战性。为解决这些问题，我们提出了基于世界体积（4D世界体积）的多摄像头驾驶场景生成器（WoVoGen）。这种系统利用了4D世界体积作为生成多摄像头视频的基础元素，并在两个阶段进行操作：1. 根据车辆控制序列预测未来4D时间世界体积。2. 使用预测的4D时间世界体积和摄像头之间的连接信息，生成多摄像头视频。通过利用4D世界体积，WoVoGen不仅可以根据车辆控制输入生成高质量的街景视频，还可以帮助进行场景编辑任务。

LivePhoto: Real Image Animation with Text-guided Motion Control

paper_url: http://arxiv.org/abs/2312.02928
repo_url: None
paper_authors: Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, Hengshuang Zhao
for: 这个论文主要用于解决文本描述中的动作控制问题，使得用户可以通过文本描述来控制视频中的动作和摄像头移动。
methods: 这个系统使用了一种叫做LivePhoto的实用系统，该系统包括一个强化的文本-到-图像生成器（Stable Diffusion），以及一个动作模块用于模拟时间变化。此外，该系统还包括一个文本重新权重模块和一个动作强度估计模块，以降低文本-到-动作映射的歧义。
results: 这个系统能够很好地将文本描述中的动作指令转化为视频中的动作和摄像头移动，并且可以根据用户的需求进行视频自定义，包括控制动作的强度。

Abstract
Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization.

摘要
尽管现有的文本到视频生成技术已经做出了一些进步，但是现有的研究通常忽略了视频中的时间动作控制问题。面对这个挑战，这项工作提出了一个实用的系统——LivePhoto，允许用户通过文本描述控制自己 интере点的图像动作。我们首先确立了一个强大的基线，即使用Stable Diffusion文本到图像生成器（i.e., Stable Diffusion），并将其与一个时间模型相结合。我们然后提出了一种特制的训练管道，以更好地联系文本和动作。具体来说，我们注意到以下两点：（1）文本只能描述动作 roughly（例如无论移动速度），（2）文本可能包含内容和动作描述。为了减少文本到动作映射的抽象性，我们引入了动作强度估计模块以及文本重新权重模块。实验证明，我们的方法可以很好地将文本指令转化为视频中的动作，例如行为、相机运动或者even创造新的内容（如倒流水到空瓶中）。同时，由于我们提出的强度学习机制，我们的系统还为用户提供了一个额外的控制信号（即动作强度），以便为视频定制。

MagicStick: Controllable Video Editing via Control Handle Transformations

paper_url: http://arxiv.org/abs/2312.03047
repo_url: https://github.com/mayuelala/magicstick
paper_authors: Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen
for: 这个论文的目的是提出一种可控的视频编辑方法，可以通过对特定内部特征（如物体的边极图或人体姿态）的转换来编辑视频的属性。
methods: 该方法利用提取的内部控制信号（如边极图或人体姿态）进行转换，并通过使用彩色图像扩散模型和ControlNet进行编辑。在编辑过程中，还使用了提案的注意力混合来提供注意力引导。
results: 该方法可以从预训练的文本到图像模型中提取出视频属性的编辑能力，并在各种场景中进行了详细的实验，证明了其在 temporal consistency 和编辑能力方面的superiority。

Abstract
Text-based video editing has recently attracted considerable interest in changing the style or replacing the objects with a similar structure. Beyond this, we demonstrate that properties such as shape, size, location, motion, etc., can also be edited in videos. Our key insight is that the keyframe transformations of the specific internal feature (e.g., edge maps of objects or human pose), can easily propagate to other frames to provide generation guidance. We thus propose MagicStick, a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals. In detail, to keep the appearance, we inflate both the pretrained image diffusion model and ControlNet to the temporal dimension and train low-rank adaptions (LORA) layers to fit the specific scenes. Then, in editing, we perform an inversion and editing framework. Differently, finetuned ControlNet is introduced in both inversion and generation for attention guidance with the proposed attention remix between the spatial attention maps of inversion and editing. Yet succinct, our method is the first method to show the ability of video property editing from the pre-trained text-to-image model. We present experiments on numerous examples within our unified framework. We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works. The code and models will be made publicly available.

摘要
文本基于视频编辑在最近吸引了许多关注，主要是修改样式或者替换对象的类似结构。此外，我们还证明了视频中的属性，如形状、大小、位置、运动等，也可以通过编辑。我们的关键发现是，特定的内部特征（例如对象的边极图或人体姿势）的关键帧变换可以轻松地传播到其他帧，以提供生成指南。因此，我们提出了MagicStick，一种可控的视频编辑方法，通过利用特定内部控制信号的变换来编辑视频属性。在详细的实现方式下，为保持外观，我们将预训练的图像扩散模型和ControlNet扩展到时间维度，并使用低级杂化层（LORA）来适应特定的场景。在编辑时，我们将执行反向和编辑框架。与之不同的是，我们在编辑和反向中都进行了фиinetuning ControlNet，以使用提议的注意力混合来提供注意力导航。简而言之，我们的方法是首次通过文本到图像模型来实现视频属性编辑的方法。我们在多个例子中进行了实验，并与Shape-aware文本基于编辑和手动制作的运动视频生成进行比较，demonstrating our superior temporal consistency和编辑能力。代码和模型将公开发布。

Split & Merge: Unlocking the Potential of Visual Adapters via Sparse Training

paper_url: http://arxiv.org/abs/2312.02923
repo_url: https://github.com/theia-4869/mosa
paper_authors: Qizhe Zhang, Bocheng Zou, Ruichuan An, Jiaming Liu, Shanghang Zhang
for: 提高Adapter Tuning的效率和性能，以便更好地应用于各种任务和环境。
methods: 将标准适配器拆分成多个非重叠模块，然后随机启用模块进行稀有训练，并最终将模块集成成一个完整的适配器进行调整。
results: 在27种视觉任务上进行了广泛的实验，显示MoSA可以在其他Adapter Tuning方法和基elines之上具有显著的性能优势，并在低资源和多任务设置下达到满意的结果。

Abstract
With the rapid growth in the scale of pre-trained foundation models, parameter-efficient fine-tuning techniques have gained significant attention, among which Adapter Tuning is the most widely used. Despite achieving efficiency, Adapter Tuning still underperforms full fine-tuning, and the performance improves at the cost of an increase in parameters. Recent efforts address this issue by pruning the original adapters, but it also introduces training instability and suboptimal performance on certain datasets. Motivated by this, we propose Mixture of Sparse Adapters, or MoSA, as a novel Adapter Tuning method to fully unleash the potential of each parameter in the adapter. We first split the standard adapter into multiple non-overlapping modules, then stochastically activate modules for sparse training, and finally merge them to form a complete adapter after tuning. In this way, MoSA can achieve significantly better performance than standard adapters without any additional computational or storage overhead. Furthermore, we propose a hierarchical sparse strategy to better leverage limited training data. Extensive experiments on a series of 27 visual tasks demonstrate that MoSA consistently outperforms other Adapter Tuning methods as well as other baselines by a significant margin. Furthermore, in two challenging scenarios with low-resource and multi-task settings, MoSA achieves satisfactory results, further demonstrating the effectiveness of our design. Our code will be released.

摘要
通过预训练基础模型的规模快速增长，参数稀缺化精细调教技术得到了广泛关注，其中Adapter Tuning是最为广泛使用的。尽管它可以提高效率，但它仍然比全面调教下降性能，而且性能改善的代价是增加参数的数量。现有努力通过修剪原始adapter来解决这个问题，但它也会引入训练不稳定和数据集特定的表现下降。为此，我们提出了一种多 sparse adapter（MoSA），作为一种新的Adapter Tuning方法，以全面发挥每个参数在adapter中的潜力。我们首先将标准adapter分解成多个不重叠的模块，然后随机启用模块进行极 sparse训练，并最后将它们合并为一个完整的adapter после调教。这样，MoSA可以在不增加计算或存储开销的情况下，实现了标准adapter的显著性能提升。此外，我们还提出了一种层次 sparse策略，以更好地利用有限的训练数据。我们对27个视觉任务进行了广泛的实验，结果表明，MoSA可以在Adapter Tuning方法和其他基elines之上呈现出显著的性能优势，并且在低资源和多任务设置下也能达到满意的结果。我们将代码发布。

Fine-grained Controllable Video Generation via Object Appearance and Context

paper_url: http://arxiv.org/abs/2312.02919
repo_url: None
paper_authors: Hsin-Ping Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun Zhu, Ming-Hsuan Yang
for: 可以实现精细控制的影片生成，并且不需要调整模型。
methods: 提出了一个统一框架，将控制信号插入到现有的文本至影片模型中，以实现精细控制。
results: 与比较方法相比，提高了70%的控制性指标。

Abstract
Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines.

摘要
文本到视频生成技术已经展示了有望的结果。然而，通过只接受自然语言作为输入，用户经常遇到减少细节信息的困难，以便精确控制模型的输出。在这种情况下，我们提议细化可控视频生成（FACTOR）以实现细节控制。特别是，FACTOR目标控制对象的外观和上下文，包括其位置和类别，与文本提示相协调。为了实现细节控制，我们提议一种统一框架，将控制信号直接注入到现有的文本到视频模型中。我们的模型包括共同编码器和适应性跨度注意力层。通过优化编码器和插入层，我们适应了模型以生成与文本提示和细节控制相对应的视频。与以前基于粗糙控制信号such as edge maps的方法相比，我们提供了更直观和用户友好的界面，允许对象级别的细节控制。我们的方法可以在不进行finetuning的情况下实现对象外观的控制，从而降低用户每个人优化的努力。我们的实验表明，我们的模型在标准 benchmark数据集和用户提供的输入上实现了70%的控制度指标提高，比基eline方法更高。

Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration

paper_url: http://arxiv.org/abs/2312.02918
repo_url: None
paper_authors: Yuang Ai, Huaibo Huang, Xiaoqiang Zhou, Jiexiang Wang, Ran He
for: 这篇论文旨在提出一种多modal提示学习方法（MPerceiver），用于解决现实世界中复杂的图像修复问题。
methods: 该方法使用了稳定扩散（SD）先验来提高适应性、普适性和准确性。具体来说，我们开发了一个双树模块，用于掌握两种类型的SD提前：文本提前为总体表示，视觉提前为多尺度细节表示。这两种提前都会随着预测的降低预测来进行自适应调整。此外，我们还增加了一个精细修正模块，以提高修复精度。
results: 对于9个图像修复任务，MPerceiver训练后比 estado-of-the-art 任务特定方法在大多数任务中表现出优异。在预训练后，MPerceiver在未看过任务的情况下也能够达到优秀的零例学习和几例学习效果。广泛的实验表明，MPerceiver在适应性、普适性和精度方面具有优异性。

Abstract
Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks and 26 benchmarks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.

摘要

MIND: Multi-Task Incremental Network Distillation

paper_url: http://arxiv.org/abs/2312.02916
repo_url: https://github.com/lsabetta/mind
paper_authors: Jacopo Bonato, Francesco Pelosin, Luigi Sabetta, Alessandro Nicolosi
for: Addressing the challenges of Class-Incremental and Domain-Incremental learning in resource-constrained environments.
methods: Two alternative distillation procedures and the optimization of BachNorm layers across tasks inside sub-networks.
results: Outperforms all state-of-the-art methods for rehearsal-free Class-Incremental learning, with an increment in classification accuracy of +6% on CIFAR-100/10 and +10% on TinyImageNet/10, and up to +40% accuracy in Domain-Incremental scenarios.

Abstract
The recent surge in pervasive devices generating dynamic data streams has underscored the necessity for learning systems to adapt to data distributional shifts continually. To tackle this challenge, the research community has put forth a spectrum of methodologies, including the demanding pursuit of class-incremental learning without replay data. In this study, we present MIND, a parameter isolation method that aims to significantly enhance the performance of replay-free solutions and achieve state-of-the-art results on several widely studied datasets. Our approach introduces two main contributions: two alternative distillation procedures that significantly improve the efficiency of MIND increasing the accumulated knowledge of each sub-network, and the optimization of the BachNorm layers across tasks inside the sub-networks. Overall, MIND outperforms all the state-of-the-art methods for rehearsal-free Class-Incremental learning (with an increment in classification accuracy of approx. +6% on CIFAR-100/10 and +10% on TinyImageNet/10) reaching up to approx. +40% accuracy in Domain-Incremental scenarios. Moreover, we ablated each contribution to demonstrate its impact on performance improvement. Our results showcase the superior performance of MIND indicating its potential for addressing the challenges posed by Class-incremental and Domain-Incremental learning in resource-constrained environments.

摘要
Our approach has two main contributions:1. Two alternative distillation procedures that improve the efficiency of MIND and increase the accumulated knowledge of each sub-network.2. Optimization of the BachNorm layers across tasks inside the sub-networks.Overall, MIND outperforms all state-of-the-art methods for rehearsal-free Class-Incremental learning, with an average increase in classification accuracy of approximately 6% on CIFAR-100/10 and 10% on TinyImageNet/10. In Domain-Incremental scenarios, MIND achieves up to approximately 40% accuracy.We also conducted ablation studies to demonstrate the impact of each contribution on performance improvement. Our results show that MIND significantly outperforms other methods, indicating its potential for addressing the challenges of Class-incremental and Domain-Incremental learning in resource-constrained environments.

Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training

paper_url: http://arxiv.org/abs/2312.02914
repo_url: None
paper_authors: Arun Reddy, William Paul, Corban Rivera, Ketul Shah, Celso M. de Melo, Rama Chellappa
for: 这项研究目标是解决无监督领域适应（Unsupervised Domain Adaptation，UDA）的视频动作识别问题。
methods: 该方法（UNITE）使用一个图像教师模型来适应目标频道上的视频学生模型。首先，UNITE使用自我超视觉预训练来促进目标频道视频中的特征学习，使用教师导航的掩码目标散列对象。然后，我们使用视频学生模型和图像教师模型共同生成改进的 Pseudolabels для无标记目标视频。我们的自我训练过程成功地利用了两个模型的优势，以实现强大的适应性。
results: 我们在多个视频频道适应benchmark上评估了我们的方法，并观察到了 significanthigher 的改进 compared to 之前报告的结果。

Abstract
In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data, using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results.

摘要
在这项工作中，我们解决了无监督领域适应（USD）视频动作识别问题。我们的方法，我们称之为UNITE，使用一个图像老师模型来适应目标领域的视频学生模型。UNITE首先使用自我超vision的预训练来促进目标领域视频中的特征学习，使用老师导航的封面挑战目标。然后，我们在封面目标数据上进行自我训练，使用视频学生模型和图像老师模型共同生成改进的假标签 для未标注的目标视频。我们的自我训练过程成功地利用了两个模型的优势，实现了强大的适应性 across 频率。我们在多个视频领域适应 benchmark 上评估了我们的方法，并观察到了显著的改进。

Realistic Scatterer Based Adversarial Attacks on SAR Image Classifiers

paper_url: http://arxiv.org/abs/2312.02912
repo_url: None
paper_authors: Tian Ye, Rajgopal Kannan, Viktor Prasanna, Carl Busart, Lance Kaplan
for:This paper proposes a new physical adversarial attack called the On-Target Scatterer Attack (OTSA) to mislead SAR image classifiers.methods:The OTSA attack uses physical actions to place additional false objects as scatterers around the on-ground target to perturb the SAR image. To ensure the feasibility of its physical execution, the attack is constrained to only place scatterers on the target, and not in the shadow regions or background.results:The experimental results show that the OTSA attack obtains significantly higher success rates under the positioning constraint compared with existing methods.

Abstract
Adversarial attacks have highlighted the vulnerability of classifiers based on machine learning for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) tasks. An adversarial attack perturbs SAR images of on-ground targets such that the classifiers are misled into making incorrect predictions. However, many existing attacking techniques rely on arbitrary manipulation of SAR images while overlooking the feasibility of executing the attacks on real-world SAR imagery. Instead, adversarial attacks should be able to be implemented by physical actions, for example, placing additional false objects as scatterers around the on-ground target to perturb the SAR image and fool the SAR ATR. In this paper, we propose the On-Target Scatterer Attack (OTSA), a scatterer-based physical adversarial attack. To ensure the feasibility of its physical execution, we enforce a constraint on the positioning of the scatterers. Specifically, we restrict the scatterers to be placed only on the target instead of in the shadow regions or the background. To achieve this, we introduce a positioning score based on Gaussian kernels and formulate an optimization problem for our OTSA attack. Using a gradient ascent method to solve the optimization problem, the OTSA can generate a vector of parameters describing the positions, shapes, sizes and amplitudes of the scatterers to guide the physical execution of the attack that will mislead SAR image classifiers. The experimental results show that our attack obtains significantly higher success rates under the positioning constraint compared with the existing method.

摘要
侵器攻击揭示了基于机器学习的Synthetic Aperture Radar（SAR）自动目标识别（ATR）任务中的漏洞。一个侵器攻击可以让SAR图像上的地面目标变得模糊，使得分类器进行错误预测。然而，许多现有的攻击技术是通过随意地修改SAR图像来实现的，而忽视了在实际SAR图像上执行攻击的可行性。相反，攻击应该能够通过物理操作实现，例如，在地面目标周围放置附带干扰的干扰物来让SAR图像变得模糊，使得SAR ATR分类器进行错误预测。在这篇论文中，我们提出了Target Scatterer Attack（OTSA），一种基于物理干扰的 физи学攻击方法。为确保物理执行的可行性，我们强制限制干扰物的位置。具体来说，我们约束干扰物只能在目标上放置，而不能在阴影区域或背景上放置。为实现这一点，我们引入了一个位置评分函数，该函数基于高斯函数，用于形式化我们的 OTSA 攻击问题。通过使用梯度下降法解决该问题，OTSA 可以生成一个描述干扰物的位置、形状、大小和强度的向量，以引导物理执行的攻击，以让SAR图像分类器进行错误预测。实验结果表明，我们的攻击在位置约束下比现有方法更高的成功率。

Rare Galaxy Classes Identified In Foundation Model Representations

paper_url: http://arxiv.org/abs/2312.02910
repo_url: None
paper_authors: Mike Walmsley, Anna M. M. Scaife
for: identify rare and visually distinctive galaxy populations
methods: use pretrained models to search for structure in learned representations, cluster approach to isolate specific local patterns
results: reveal groups of galaxies with rare and scientifically-interesting morphologies

Abstract
We identify rare and visually distinctive galaxy populations by searching for structure within the learned representations of pretrained models. We show that these representations arrange galaxies by appearance in patterns beyond those needed to predict the pretraining labels. We design a clustering approach to isolate specific local patterns, revealing groups of galaxies with rare and scientifically-interesting morphologies.

摘要
我们通过在已经预训练的模型中学习的表示找到罕见和视觉特征 distintive 的星系人口。我们发现这些表示在预训练标签的预测之外还有更多的结构，这些结构可以用来划分特定的地方pattern，揭示出具有罕见和科学上有价值的星系形态。Here's a breakdown of the translation:* 我们 (wǒmen) - we* 通过 (gòngzuò) - by* 在 (在) - in* 已经 (yǐjīng) - already* 预训练 ( pré-training) - pretrained* 模型 (módel) - model* 中 (zhōng) - in* 学习 (xuéxí) - learning* 的 (de) - possessive particle* 表示 (biǎoxiǎng) - representation* 找到 (zhǎndào) - find* 罕见 (guājian) - rare* 和 (hé) - and* 视觉 (shìjian) - visual* 特征 (tèshēng) - distinctive* 的 (de) - possessive particle* 星系 (xīngxì) - galaxy* 人口 (rénkǒu) - populationI hope this helps! Let me know if you have any questions or need further clarification.

Deep Learning Segmentation of Spiral Arms and Bars

paper_url: http://arxiv.org/abs/2312.02908
repo_url: https://github.com/mwalmsley/zoobot-3d
paper_authors: Mike Walmsley, Ashley Spindler
for: 这篇论文是为了开发一个用于Segmenting galactic spiral arms和bars的深度学习模型。
methods: 这篇论文使用了深度学习模型来 segment spiral arms和bars，并在不知情的专家评估中胜过当前的自动化方法（99%的评估）和原始志愿标签（79%的评估）。
results: 专家对我们预测的扭轴膜的评估为“大部分是好”到“完美”，覆盖率为89%。bar lengths从我们预测的扭轴膜中得到的轴长与一个专门募集的项目中的轴长display excellent agreement。我们的Masks的像素精度，在大规模上是不可能的，将为 spiral arms和bars的演化研究提供基础。

Abstract
We present the first deep learning model for segmenting galactic spiral arms and bars. In a blinded assessment by expert astronomers, our predicted spiral arm masks are preferred over both current automated methods (99% of evaluations) and our original volunteer labels (79% of evaluations). Experts rated our spiral arm masks as `mostly good' to `perfect' in 89% of evaluations. Bar lengths trivially derived from our predicted bar masks are in excellent agreement with a dedicated crowdsourcing project. The pixelwise precision of our masks, previously impossible at scale, will underpin new research into how spiral arms and bars evolve.

摘要
我们介绍了首个深度学习模型，用于Segmenting galactic spiral arms和bars。在由专家天文学家Blind assessment中，我们预测的旋回臂Masks被评估为当前自动方法（99%评估）和原始志愿标签（79%评估）都被首选。专家对我们的旋回臂Masks评估为“大部分好”到“完美”的89%。来自我们预测的棒长，可以在一个专门的人工智能项目中得到了高度一致。我们的掩码精度，以前无法在大规模实现，将对旋回臂和棒的演化做出新的研究贡献。

HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting

paper_url: http://arxiv.org/abs/2312.02902
repo_url: None
paper_authors: Helisa Dhamo, Yinyu Nie, Arthur Moreau, Jifei Song, Richard Shaw, Yiren Zhou, Eduardo Pérez-Pellitero
for: 3D head animation quality and runtime improvement
methods: 3D Gaussian Splats (3DGS) and hybrid model with learnable latent features
results: state-of-the-art results in real-time inference frame rates, with up to ~2dB improvement and x10 acceleration in rendering speed

Abstract
3D head animation has seen major quality and runtime improvements over the last few years, particularly empowered by the advances in differentiable rendering and neural radiance fields. Real-time rendering is a highly desirable goal for real-world applications. We propose HeadGaS, the first model to use 3D Gaussian Splats (3DGS) for 3D head reconstruction and animation. In this paper we introduce a hybrid model that extends the explicit representation from 3DGS with a base of learnable latent features, which can be linearly blended with low-dimensional parameters from parametric head models to obtain expression-dependent final color and opacity values. We demonstrate that HeadGaS delivers state-of-the-art results in real-time inference frame rates, which surpasses baselines by up to ~2dB, while accelerating rendering speed by over x10.

摘要
“3D头部动画在最近几年内已经经历了重要的质量和运行时间提升，尤其是由于分别渲染和神经光谱场的进步。实时渲染是实际应用中的一个非常愿望的目标。我们提议使用3D Gaussian Splats（3DGS）来进行3D头部重建和动画。在这篇论文中，我们介绍了一种混合模型，其扩展了3DGS的Explicit Representation，通过可学习的潜在特征基准来 Linearly Blend low-dimensional参数从parametric头部模型中获得表达висиendent的最终颜色和透明度值。我们示出了HeadGaS可以在实时推理框架中达到状态机器的结果，超过基准值by up to ~2dB，同时加速渲染速度by over x10。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Diversified in-domain synthesis with efficient fine-tuning for few-shot classification

paper_url: http://arxiv.org/abs/2312.03046
repo_url: https://github.com/vturrisi/disef
paper_authors: Victor G. Turrisi da Costa, Nicola Dall’Asen, Yiming Wang, Nicu Sebe, Elisa Ricci
for: 提高 few-shot 图像分类器的泛化能力，使其能够更好地适应不同的图像分类任务。
methods: 使用文本生成模型生成高质量的假图像，并通过维度增强和有效的 fine-tuning 来提高模型的泛化能力。
results: 在十个不同的 benchmark 上进行了实验， consistently 超越基elines，并为 few-shot 图像分类器成功地设立了新的 state-of-the-art。

Abstract
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class. A recent research direction for improving few-shot classifiers involves augmenting the labelled samples with synthetic images created by state-of-the-art text-to-image generation models. Following this trend, we propose Diversified in-domain synthesis with efficient fine-tuning (DISEF), a novel approach which addresses the generalization challenge in few-shot learning using synthetic data. DISEF consists of two main components. First, we propose a novel text-to-image augmentation pipeline that, by leveraging the real samples and their rich semantics coming from an advanced captioning model, promotes in-domain sample diversity for better generalization. Second, we emphasize the importance of effective model fine-tuning in few-shot recognition, proposing to use Low-Rank Adaptation (LoRA) for joint adaptation of the text and image encoders in a Vision Language Model. We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification. Code is available at \url{https://github.com/vturrisi/disef}

摘要
《几个示例图像分类》目标是通过少量标注示例来学习图像分类器。现有研究方向是通过使用现代文本生成图像模型来生成synthetic图像来提高几个示例分类器的性能。我们提出了一种新的方法，即含括性域同步生成（DISEF），以解决几个示例学习中的通用化挑战。DISEF包括两个主要组成部分。第一部分是一种新的文本生成图像扩充管道，通过利用真实的样本和它们的高度 semantics来提高域内样本多样性，从而提高泛化性。第二部分是强调有效的模型练习，我们提出了使用矩阵适应（LoRA）来适应文本和图像编码器在视觉语言模型中的集成。我们在十个不同的benchmark上验证了我们的方法， consistently outperforming baselines，并在几个示例分类中创造了新的状态之术。代码可以在 \url{https://github.com/vturrisi/disef} 中找到。

BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

paper_url: http://arxiv.org/abs/2312.02896
repo_url: https://github.com/aifeg/benchlmm
paper_authors: Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, Alex Kot
for: 本研究旨在评估大型多Modal模型（LMMs）在不同风格下的Robustness。
methods: 我们提出了一个新的benchmark，即BenchLMM，用于评估LMMs在三种不同风格下的性能：艺术风格、感知器风格和应用风格，每种风格有五个子风格。我们使用BenchLMM进行了state-of-the-art LMMs的全面评估，并发现：1）LMMs在不同风格下的性能通常会下降；2）一个LMM在常见风格下表现出色不一定意味着它在其他风格下也会表现出色；3）可以通过让LMM预测风格来提高LMMs的reasoning能力，并且这种方法不需要训练。
results: 我们的研究表明，开发更智能和多talented LMMs需要更好地理解它们在不同风格下的性能。我们的benchmark和分析希望能够为LMMs的发展提供新的思路。

Abstract
Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of its errors when facing stylistic variations. We hope that our benchmark and analysis can shed new light on developing more intelligent and versatile LMMs.

摘要
大型多modal模型（LMM），如GPT-4V和LLaVA，在常见图像风格下显示了惊人的视觉理解能力。然而，它们对各种风格变化的可靠性，在实际应用中非常重要，尚未得到了充分探讨。在这篇论文中，我们提出了一个新的标准 benchMark，即BenchLMM，用于评估LMMs对三种不同风格的可靠性：艺术风格、摄像头风格和应用风格，每种风格又有五种子风格。通过使用BenchLMM，我们对当今最佳的LMMs进行了全面的评估，并发现了以下结论：1）LMMs在不同风格下工作时通常会导致性能下降; 2）一个LMM在常见风格下表现良好不一定意味着它在其他风格下也会表现良好; 3）LMMs的理解能力可以通过向LMMs提供风格predicting的Prompt来提高; 4）一个智能LMM应该能够解释它在风格变化时的错误原因。我们希望通过我们的benchmark和分析，可以为开发更智能和多样化的LMMs提供新的思路。

Customization Assistant for Text-to-image Generation

paper_url: http://arxiv.org/abs/2312.03045
repo_url: https://github.com/Sfedfcv/redesigned-pancake
paper_authors: Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Tong Sun
for: 这个论文的目的是提出一种基于预训练大语言模型和扩散模型的自定义助手，可以帮助用户在没有调整的情况下进行自定义生成。
methods: 这个论文使用了一种新的模型设计和训练策略，使得助手可以在2-5秒钟内完成自定义生成，而无需任何测试时间的调整。
results: 实验表明，这个论文的提出的方法可以在不同领域中获得竞争力的result，表明该方法的有效性。

Abstract
Customizing pre-trained text-to-image generation model has attracted massive research interest recently, due to its huge potential in real-world applications. Although existing methods are able to generate creative content for a novel concept contained in single user-input image, their capability are still far from perfection. Specifically, most existing methods require fine-tuning the generative model on testing images. Some existing methods do not require fine-tuning, while their performance are unsatisfactory. Furthermore, the interaction between users and models are still limited to directive and descriptive prompts such as instructions and captions. In this work, we build a customization assistant based on pre-trained large language model and diffusion model, which can not only perform customized generation in a tuning-free manner, but also enable more user-friendly interactions: users can chat with the assistant and input either ambiguous text or clear instruction. Specifically, we propose a new framework consists of a new model design and a novel training strategy. The resulting assistant can perform customized generation in 2-5 seconds without any test time fine-tuning. Extensive experiments are conducted, competitive results have been obtained across different domains, illustrating the effectiveness of the proposed method.

摘要
In this work, we propose a customization assistant based on pre-trained large language models and diffusion models, which can perform customized generation without fine-tuning and enable more user-friendly interactions. Users can chat with the assistant and input either ambiguous text or clear instructions, and the resulting assistant can generate customized images in 2-5 seconds. Our proposed framework consists of a new model design and a novel training strategy, and we have obtained competitive results across different domains through extensive experiments, demonstrating the effectiveness of our method.

Towards More Practical Group Activity Detection: A New Benchmark and Model

paper_url: http://arxiv.org/abs/2312.02878
repo_url: https://github.com/dk-kim/CAFE_codebase
paper_authors: Dongkeun Kim, Youngkil Song, Minsu Cho, Suha Kwak
for: 这篇论文主要是为了提高现有的集体活动检测（GAD）方法和数据集，以更好地应对实际场景。
methods: 这篇论文提出了一个新的GAD模型，可以有效地处理未知的群体数量和隐藏的群员。它还使用了一个新的数据集（Caf'e），该数据集更加实际，具有更多的评估场景和精美的注释。
results: 研究人员在三个数据集中进行了测试，其中包括Caf'e数据集，并在准确率和推理速度两个方面超过了之前的工作。

Abstract
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Caf\'e. Unlike existing datasets, Caf\'e is constructed primarily for GAD and presents more practical evaluation scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Caf\'e, where it outperformed previous work in terms of both accuracy and inference speed. Both our dataset and code base will be open to the public to promote future research on GAD.

摘要
group activity detection (GAD) 是指在视频中Identifying members of each group and classifying the activity of the group simultaneously. Although GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Caf\'e. Unlike existing datasets, Caf\'e is constructed primarily for GAD and presents more practical evaluation scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Caf\'e, where it outperformed previous work in terms of both accuracy and inference speed. Both our dataset and code base will be open to the public to promote future research on GAD.Here's the breakdown of the translation:* Group activity detection (GAD) 是指在视频中... (GAD is a task that involves identifying members of each group and classifying the activity of the group in a video)* although GAD has been studied recently, there is still much room for improvement... (although GAD has been studied recently, there is still much room for improvement in both dataset and methodology)* 因为现有的数据集和方法ologies 有限，无法有效地解决实际的 GAD 场景。 (because the existing datasets and methodologies have limitations, they cannot effectively address practical GAD scenarios)* 为了解决这些问题，我们首先提出了一个新的数据集，名为 Caf\'e. (to resolve these issues, we first proposed a new dataset, called Caf\'e)* Caf\'e 数据集不同于现有的数据集，主要是为 GAD 设计 constructing 和提供了更实用的评估场景和度量，同时也是大规模的和有丰富的注释。 (Caf\'e dataset is different from existing datasets, it is primarily designed for GAD and provides more practical evaluation scenarios and metrics, while also being large-scale and providing rich annotations)* 同时，我们也提出了一种新的 GAD 模型，可以有效地处理不确定的组数和隐藏的组员。 (at the same time, we proposed a new GAD model that can effectively handle an unknown number of groups and latent group members)* 我们在三个数据集中进行了测试，其中包括 Caf\'e，我们的模型在这些数据集上的性能都高于过去的工作。 (we tested our model on three datasets, including Caf\'e, and our model outperformed previous work on all three datasets)* 我们将我们的数据集和代码库公开发布，以便将来的研究人员可以更好地进行 GAD 研究。 (we will openly release our dataset and codebase to promote future research on GAD)

A Dynamic Network for Efficient Point Cloud Registration

paper_url: http://arxiv.org/abs/2312.02877
repo_url: None
paper_authors: Yang Ai, Xi Yang
For: 提高点云注册精度和效率，解决非重叠点云 consume 大量计算资源的问题。* Methods: 引入动态方法，广泛应用于计算机视觉任务中，以提高点云注册精度和效率。使用迭代注册过程， identific 匹配点云集中的区域，并最终移除噪点云。* Results: 对比其他方法，本方法具有较高的速度提升（3DMatch上提高41.2%，KITTI上提高33.4%），同时保持竞争力的注册回快要求。

Abstract
For the point cloud registration task, a significant challenge arises from non-overlapping points that consume extensive computational resources while negatively affecting registration accuracy. In this paper, we introduce a dynamic approach, widely utilized to improve network efficiency in computer vision tasks, to the point cloud registration task. We employ an iterative registration process on point cloud data multiple times to identify regions where matching points cluster, ultimately enabling us to remove noisy points. Specifically, we begin with deep global sampling to perform coarse global registration. Subsequently, we employ the proposed refined node proposal module to further narrow down the registration region and perform local registration. Furthermore, we utilize a spatial consistency-based classifier to evaluate the results of each registration stage. The model terminates once it reaches sufficient confidence, avoiding unnecessary computations. Extended experiments demonstrate that our model significantly reduces time consumption compared to other methods with similar results, achieving a speed improvement of over 41% on indoor dataset (3DMatch) and 33% on outdoor datasets (KITTI) while maintaining competitive registration recall requirements.

摘要

RotaTR: Detection Transformer for Dense and Rotated Object

paper_url: http://arxiv.org/abs/2312.02821
repo_url: None
paper_authors: Zhu Yuke, Ruan Yumeng, Yang Lei, Guo Sheng
for: 本文针对 dense 和旋转的对象检测 Task 进行了研究，以提高 DETR 的性能。
methods: 本文提出了 Rotated object detection TRansformer (RotaTR)，它是 DETR 的扩展，通过设计 Rotation Sensitive deformable (RSDeform) 注意力来增强 DETR 对垂直目标的检测能力。
results: 对四个复杂的旋转 Benchmark 进行测试，RotaTR 在 dense 和旋转的对象检测中表现出色，与原始 DETR 相比有大量的优势。同时，它也与当前最佳的 CNN-based 检测器相当。

Abstract
Detecting the objects in dense and rotated scenes is a challenging task. Recent works on this topic are mostly based on Faster RCNN or Retinanet. As they are highly dependent on the pre-set dense anchors and the NMS operation, the approach is indirect and suboptimal.The end-to-end DETR-based detectors have achieved great success in horizontal object detection and many other areas like segmentation, tracking, action recognition and etc.However, the DETR-based detectors perform poorly on dense rotated target tasks and perform worse than most modern CNN-based detectors. In this paper, we find the most significant reason for the poor performance is that the original attention can not accurately focus on the oriented targets. Accordingly, we propose Rotated object detection TRansformer (RotaTR) as an extension of DETR to oriented detection. Specifically, we design Rotation Sensitive deformable (RSDeform) attention to enhance the DETR's ability to detect oriented targets. It is used to build the feature alignment module and rotation-sensitive decoder for our model. We test RotaTR on four challenging-oriented benchmarks. It shows a great advantage in detecting dense and oriented objects compared to the original DETR. It also achieves competitive results when compared to the state-of-the-art.

摘要
检测密集和旋转场景中的对象是一项具有挑战性的任务。现有的方法多基于Faster RCNN或Retinanet，它们具有各种缺点，如依赖于预设密集锚点和NMS操作，导致方法间接和不优化。而基于DETR的端到端检测器在横向对象检测和其他领域如分割、跟踪、动作识别等方面具有很大的成功。然而，DETR基于的检测器在密集旋转目标任务上表现不佳，比现代CNN基于的检测器更差。在这篇论文中，我们发现最主要的问题在于DETR的原始注意力无法准确地对待方向目标。因此，我们提出了对DETR进行了扩展，即旋转对象检测传播器（RotaTR），以提高DETR对密集旋转目标的检测能力。具体来说，我们设计了旋转敏感的变形（RSDeform）注意力，用于增强DETR对方向目标的检测能力。它被用于构建特征对应模块和旋转敏感解码器。我们在四个复杂的旋转对象测试benchmark上测试了RotaTR。它在密集旋转目标上显示了优于DETR的检测能力，同时与现状的状态 искусственный智能达到了竞争性的 результа。

Deterministic Guidance Diffusion Model for Probabilistic Weather Forecasting

paper_url: http://arxiv.org/abs/2312.02819
repo_url: https://github.com/donggeun-yoon/dgdm
paper_authors: Donggeun Yoon, Minseok Seo, Doyi Kim, Yeji Choi, Donghyeon Cho
for: 用于掌握气象预测中的可能性预测和精度预测之间的平衡。
methods: 结合权重矩阵和概率模型，提出了Deterministic Guidance Diffusion Model (DGDM)，通过在前向和反向过程中结合权重矩阵和概率模型来实现可能性预测和精度预测的平衡。
results: 在全球气象预测数据集（WeatherBench）和通用视频帧预测标准 benchmark（Moving MNIST）上评估DGDM，并在高分辨率地方预测中使用PNW-Typhoon气象卫星数据集进行验证。结果显示DGDM在全球预测和地方预测中均达到了当前最佳效果。

Abstract
Weather forecasting requires not only accuracy but also the ability to perform probabilistic prediction. However, deterministic weather forecasting methods do not support probabilistic predictions, and conversely, probabilistic models tend to be less accurate. To address these challenges, in this paper, we introduce the \textbf{\textit{D}eterministic \textbf{\textit{G}uidance \textbf{\textit{D}iffusion \textbf{\textit{M}odel (DGDM) for probabilistic weather forecasting, integrating benefits of both deterministic and probabilistic approaches. During the forward process, both the deterministic and probabilistic models are trained end-to-end. In the reverse process, weather forecasting leverages the predicted result from the deterministic model, using as an intermediate starting point for the probabilistic model. By fusing deterministic models with probabilistic models in this manner, DGDM is capable of providing accurate forecasts while also offering probabilistic predictions. To evaluate DGDM, we assess it on the global weather forecasting dataset (WeatherBench) and the common video frame prediction benchmark (Moving MNIST). We also introduce and evaluate the Pacific Northwest Windstorm (PNW)-Typhoon weather satellite dataset to verify the effectiveness of DGDM in high-resolution regional forecasting. As a result of our experiments, DGDM achieves state-of-the-art results not only in global forecasting but also in regional forecasting. The code is available at: \url{https://github.com/DongGeun-Yoon/DGDM}.

摘要
天气预测需要不仅准确度高，还需要能够进行 probabilistic 预测。然而，混合型天气预测方法不支持 probabilistic 预测，反之， probabilistic 模型往往准确性较差。为了解决这些挑战，在这篇论文中，我们引入了 \textbf{\textit{D}eterministic \textbf{\textit{G}uidance \textbf{\textit{D}iffusion \textbf{\textit{M}odel (DGDM)，将混合型天气预测和 probabilistic 预测方法相结合。在前向过程中，混合型天气预测和 probabilistic 模型都是终端训练的。在反向过程中，天气预测使用混合型天气预测的预测结果作为probabilistic 模型的初始点。通过这种方式，DGDM可以提供准确的预测，同时也可以提供 probabilistic 预测。为了评估 DGDM，我们在 WeatherBench 全球天气预测数据集和 Moving MNIST 通用视频预测数据集上进行了评估。我们还引入了 Pacific Northwest Windstorm (PNW)-Typhoon 高分辨率地方天气卫星数据集，以验证 DGDM 在高分辨率地方预测中的效果。根据我们的实验结果，DGDM 在全球预测和地方预测中均取得了状态机器人的结果。代码可以在以下链接获取：https://github.com/DongGeun-Yoon/DGDM。

Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions

paper_url: http://arxiv.org/abs/2312.02772
repo_url: None
paper_authors: Xu Shi, Chuanchen Luo, Junran Peng, Hongwen Zhang, Yunlian Sun
for: 本研究的目的是提出一种基于分解策略的人体动作生成模型（FG-MDM），以生成细化的人体动作。
methods: 本研究使用大语言模型（GPT-3.5）将潦汤的文本描述精细化为不同身体部位的描述，然后使用这些精细描述导引一种基于转换器的扩散模型进行人体动作生成。
results: 实验结果表明，FG-MDM比之前的方法更具有优势，尤其是在不同于训练数据的情况下。 FG-MDM能够生成细化的人体动作，并且可以在不同的人体姿势和环境下进行生成。

Abstract
Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, it remains challenging to generate fine-grained or stylized motions due to the lack of datasets annotated with detailed textual descriptions. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for human motion generation. Specifically, we first parse previous vague textual annotation into fine-grained description of different body parts by leveraging a large language model (GPT-3.5). We then use these fine-grained descriptions to guide a transformer-based diffusion model. FG-MDM can generate fine-grained and stylized motions even outside of the distribution of the training data. Our experimental results demonstrate the superiority of FG-MDM over previous methods, especially the strong generalization capability. We will release our fine-grained textual annotations for HumanML3D and KIT.

摘要
最近，在文本基于动作生成方面，有了 significanth的进步，使得可以生成具有多样性和高质量的人体动作，这些动作都遵循文本描述。然而，由于缺乏细化的文本描述，仍然困难生成细化或特殊的动作。为解决这问题，我们提出了一种新的框架，即细化人体动作扩散模型（FG-MDM）。具体来说，我们首先使用大型自然语言模型（GPT-3.5）来分解前一 vague 的文本描述，然后使用这些细化描述来引导一个基于transformer的扩散模型。FG-MDM可以生成细化和特殊的动作，即使在训练数据的外部。我们的实验结果表明FG-MDM在前一些方法之上具有显著的优势，尤其是在泛化性方面。我们计划将我们的细化文本描述发布给HumanML3D和KIT。

SEVA: Leveraging sketches to evaluate alignment between human and machine visual abstraction

paper_url: http://arxiv.org/abs/2312.03035
repo_url: https://github.com/cogtoolslab/visual_abstractions_benchmarking_public2023
paper_authors: Kushin Mukherjee, Holly Huey, Xuanchen Lu, Yael Vinker, Rio Aguina-Kang, Ariel Shamir, Judith E. Fan
for: 本研究旨在评估当前视觉算法是否能够理解人类创作的粗略图像，以及人类对粗略图像的含义是如何的。
methods: 研究者使用了一个新的 benchmark dataset，named SEVA，包含约90,000个人类创作的粗略图像，并对当前视觉算法进行了评估，以确定它们是否能够理解人类创作的粗略图像。
results: 研究者发现，当前视觉算法可以很好地识别人类创作的粗略图像中的目标概念，但是模型和人类响应模式之间仍存在一定的差距。此外，研究者还发现，一种基于人类视觉抽象的生成算法可以生成具有不同粗略度的粗略图像。

Abstract
Sketching is a powerful tool for creating abstract images that are sparse but meaningful. Sketch understanding poses fundamental challenges for general-purpose vision algorithms because it requires robustness to the sparsity of sketches relative to natural visual inputs and because it demands tolerance for semantic ambiguity, as sketches can reliably evoke multiple meanings. While current vision algorithms have achieved high performance on a variety of visual tasks, it remains unclear to what extent they understand sketches in a human-like way. Here we introduce SEVA, a new benchmark dataset containing approximately 90K human-generated sketches of 128 object concepts produced under different time constraints, and thus systematically varying in sparsity. We evaluated a suite of state-of-the-art vision algorithms on their ability to correctly identify the target concept depicted in these sketches and to generate responses that are strongly aligned with human response patterns on the same sketch recognition task. We found that vision algorithms that better predicted human sketch recognition performance also better approximated human uncertainty about sketch meaning, but there remains a sizable gap between model and human response patterns. To explore the potential of models that emulate human visual abstraction in generative tasks, we conducted further evaluations of a recently developed sketch generation algorithm (Vinker et al., 2022) capable of generating sketches that vary in sparsity. We hope that public release of this dataset and evaluation protocol will catalyze progress towards algorithms with enhanced capacities for human-like visual abstraction.

摘要
绘制是一种强大的工具，用于创建简洁而意义reich的抽象图像。理解绘制pose了一些核心性的挑战 для通用视觉算法，因为它们需要对绘制的简洁性与自然视觉输入具有坚固的Robustness，并且需要忍受Semantic Ambiguity，因为绘制可以可靠地诱发多种含义。虽然当前的视觉算法在多种视觉任务上已经 дости得了高性能，但是没有很清楚地知道它们是否能够理解人类的绘制方式。在这篇文章中，我们引入了SEVA，一个新的测试数据集，包含约90,000个人类生成的绘制，这些绘制是在不同的时间限制下生成的，因此系统地 varying in sparsity。我们对一些当前的State-of-the-art视觉算法进行了评估，以确定它们是否能够正确地识别绘制中的目标概念，并且是否能够生成与人类响应模式相似的响应。我们发现，能够更好地预测人类绘制认知性的视觉算法也更好地预测人类对绘制的不确定性，但是还有一定的差距 между模型和人类响应模式。为了探索模型可以模仿人类视觉抽象的能力，我们进行了进一步的评估，使用Vinker et al. (2022)提出的一种可变干扰绘制生成算法，可以生成绘制的不同级别。我们希望通过公开这个数据集和评估协议，促进模型具有人类化视觉抽象能力的进步。

Learning Cortical Anomaly through Masked Encoding for Unsupervised Heterogeneity Mapping

paper_url: http://arxiv.org/abs/2312.02762
repo_url: https://github.com/chadHGY/CAM
paper_authors: Hao-Chun Yang, Ole Andreassen, Lars Tjelta Westlye, Andre F. Marquand, Christian F. Beckmann, Thomas Wolfers
For: 检测复杂的大脑疾病，特别是精神疾病，因为症状的复杂性和可靠的生物标志物的缺失。* Methods: 使用CAM（ cortical anomaly detection through masked image modeling），一种新的自动超级vised框架，通过 cortical surface 特征来探测复杂大脑疾病。* Results: 对于精神疾病的患者，使用CAM框架可以达到 AUC 0.696 和 AUC 0.769，不需要任何标签。此外，分析异常 cortical 区域，包括 Pars Triangularis 和多个前额叶区域，这些区域经常与Schizophrenia 有关，增加了我们的方法的信心。

Abstract
The detection of heterogeneous mental disorders based on brain readouts remains challenging due to the complexity of symptoms and the absence of reliable biomarkers. This paper introduces CAM (Cortical Anomaly Detection through Masked Image Modeling), a novel self-supervised framework designed for the unsupervised detection of complex brain disorders using cortical surface features. We employ this framework for the detection of individuals on the psychotic spectrum and demonstrate its capabilities compared to state-ofthe-art methods, achieving an AUC of 0.696 for Schizoaffective and 0.769 for Schizophreniform, without the need for any labels. Furthermore, the analysis of atypical cortical regions includes Pars Triangularis and several frontal areas, often implicated in schizophrenia, provide further confidence in our approach. Altogether, we demonstrate a scalable approach for anomaly detection of complex brain disorders based on cortical abnormalities.

摘要
<>转换文本到简化中文《检测不同型神经疾病基于脑输出尚是一项挑战，主要因为症状复杂和可靠生物标志物的缺失。本文介绍CAM（ Cortical Anomaly Detection through Masked Image Modeling），一种新的自动学习框架，用于无监督的识别复杂脑疾病使用脑表面特征。我们使用这种框架进行听觉障碍和分裂障碍的检测，无需任何标签，达到了0.696和0.769的ROC曲线，对比当前方法更高。此外，分析不同的脑区，包括前庭区和前额叶皮层，常被 связан到Schizophrenia，增加了我们的方法的信任度。总之，我们展示了一种可扩展的方法，用于识别复杂脑疾病的异常检测，基于脑病变。

C3: High-performance and low-complexity neural compression from a single image or video

paper_url: http://arxiv.org/abs/2312.02753
repo_url: None
paper_authors: Hyunjik Kim, Matthias Bauer, Lucas Theis, Jonathan Richard Schwarz, Emilien Dupont
for: 这个论文是为了提出一种基于小型模型的神经压缩方法，以提高压缩率和质量的性能。
methods: 该方法使用小型模型进行特征提取和编码，并通过一些简单而有效的改进来提高图像和视频压缩性能。
results: 在CLIC2020图像标准和UVG视频标准上，该方法可以与参考实现H.266编码器和神经视频编码器相匹配或超越其性能，但具有许多 magnitudes 更低的计算复杂度。

Abstract
Most neural compression models are trained on large datasets of images or videos in order to generalize to unseen data. Such generalization typically requires large and expressive architectures with a high decoding complexity. Here we introduce C3, a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. The resulting decoding complexity of C3 can be an order of magnitude lower than neural baselines with similar RD performance. C3 builds on COOL-CHIC (Ladune et al.) and makes several simple and effective improvements for images. We further develop new methodology to apply C3 to videos. On the CLIC2020 image benchmark, we match the RD performance of VTM, the reference implementation of the H.266 codec, with less than 3k MACs/pixel for decoding. On the UVG video benchmark, we match the RD performance of the Video Compression Transformer (Mentzer et al.), a well-established neural video codec, with less than 5k MACs/pixel for decoding.

摘要
大多数神经压缩模型通过大量的图像或视频数据进行训练，以便泛化到未经见过的数据。这种泛化通常需要具有高度表达能力和高解码复杂度的大型和复杂的建筑。而我们所介绍的C3神经压缩方法却通过对每个图像或视频进行特点适应，而不是使用大型和复杂的建筑，以达到类似的泛化性能。因此，C3的解码复杂度可以降低至对神经基eline的同等水平，而不是神经基eline的数十倍。C3建立在COOL-CHIC（Ladune等人）的基础之上，并进行了一些简单而有效的改进，以便应用于图像上。我们还开发了新的方法，以应用C3于视频上。在CLIC2020图像benchmark上，我们与VTM（H.266编码器的参考实现）的参考实现匹配了RD性能，但需要少于3k MACs/像素进行解码。在UVG视频benchmark上，我们与Video Compression Transformer（Mentzer等人）的神经视频编码器匹配了RD性能，但需要少于5k MACs/像素进行解码。

C-NERF: Representing Scene Changes as Directional Consistency Difference-based NeRF

paper_url: http://arxiv.org/abs/2312.02751
repo_url: https://github.com/c-nerf/c-nerf
paper_authors: Rui Huang, Binbin Jiang, Qingyi Zhao, William Wang, Yuxiang Zhang, Qing Guo
for: 检测 neural radiance fields (NeRFs) 中的对象变化
methods: 提出了一种基于方向一致性的 NeRF 表示方法，包括三个模块：首先对两个 NeRF 图像进行空间对齐，然后根据方向一致性约束确定变化点，最后根据构建的 NeRF 生成变化地图
results: 与状态艺术方法和 NeRF 基于方法相比，该方法具有显著的优势，可以准确检测Scene中的对象变化

Abstract
In this work, we aim to detect the changes caused by object variations in a scene represented by the neural radiance fields (NeRFs). Given an arbitrary view and two sets of scene images captured at different timestamps, we can predict the scene changes in that view, which has significant potential applications in scene monitoring and measuring. We conducted preliminary studies and found that such an exciting task cannot be easily achieved by utilizing existing NeRFs and 2D change detection methods with many false or missing detections. The main reason is that the 2D change detection is based on the pixel appearance difference between spatial-aligned image pairs and neglects the stereo information in the NeRF. To address the limitations, we propose the C-NERF to represent scene changes as directional consistency difference-based NeRF, which mainly contains three modules. We first perform the spatial alignment of two NeRFs captured before and after changes. Then, we identify the change points based on the direction-consistent constraint; that is, real change points have similar change representations across view directions, but fake change points do not. Finally, we design the change map rendering process based on the built NeRFs and can generate the change map of an arbitrarily specified view direction. To validate the effectiveness, we build a new dataset containing ten scenes covering diverse scenarios with different changing objects. Our approach surpasses state-of-the-art 2D change detection and NeRF-based methods by a significant margin.

摘要
在这项工作中，我们目标是检测对象变化在由神经辐射场（NeRF）表示的场景中。给出任意视角和两个不同时间点拍摄的场景图像对，我们可以预测场景中的变化，这有着重要的应用前景，包括场景监测和量测。我们进行了初步研究发现，这样的挑战不可能通过现有的NeRF和2D变化检测方法来实现，因为这些方法存在许多假或缺失检测。主要原因是2D变化检测基于图像对的空间对齐，忽略了NeRF中的立体信息。为解决这些限制，我们提出了C-NERF，它表示场景变化为方向差异基于NeRF的方向一致性差异。C-NERF主要包括三个模块：首先，我们将两个在before和after变化之前拍摄的NeRF进行空间对齐。然后，我们基于方向一致性约束（实际变化点在不同视向中应该具有相似的变化表达）来确定变化点。最后，我们设计了基于建立的NeRF和变化地图的渲染过程，可以生成任意指定视向的变化地图。为证明效果，我们建立了一个新的数据集，包括十个场景，其中每个场景都包含不同的变化对象。我们的方法在对比现有的2D变化检测和NeRF基于方法时表现出了明显的优势。

LiDAR-based Person Re-identification

paper_url: http://arxiv.org/abs/2312.03033
repo_url: None
paper_authors: Wenxuan Guo, Zhiyu Pan, Yingping Liang, Ziheng Xi, Zhi Chen Zhong, Jianjiang Feng, Jie Zhou
for: 这篇论文主要针对人体重识别（ReID）领域的Camera-based系统，旨在提高人体重识别的精度和可靠性。
methods: 该论文提出了一种基于LiDAR的人体重识别框架，称为ReID3D，该框架利用预训练策略来提取3D人体形态特征，并 introduces Graph-based Complementary Enhancement Encoder来提取全面特征。
results: 经过广泛的实验，ReID3D在LReID dataset上实现了非常出色的性能，rank-1准确率达94.0%，这表明LiDAR可以很好地解决人体重识别任务。

Abstract
Camera-based person re-identification (ReID) systems have been widely applied in the field of public security. However, cameras often lack the perception of 3D morphological information of human and are susceptible to various limitations, such as inadequate illumination, complex background, and personal privacy. In this paper, we propose a LiDAR-based ReID framework, ReID3D, that utilizes pre-training strategy to retrieve features of 3D body shape and introduces Graph-based Complementary Enhancement Encoder for extracting comprehensive features. Due to the lack of LiDAR datasets, we build LReID, the first LiDAR-based person ReID dataset, which is collected in several outdoor scenes with variations in natural conditions. Additionally, we introduce LReID-sync, a simulated pedestrian dataset designed for pre-training encoders with tasks of point cloud completion and shape parameter learning. Extensive experiments on LReID show that ReID3D achieves exceptional performance with a rank-1 accuracy of 94.0, highlighting the significant potential of LiDAR in addressing person ReID tasks. To the best of our knowledge, we are the first to propose a solution for LiDAR-based ReID. The code and datasets will be released soon.

摘要
摄像头基于人识别（ReID）系统在公共安全领域广泛应用，但摄像头通常缺乏人体三维形态信息的感知，并且容易受到不良环境、复杂背景和个人隐私等限制。在这篇论文中，我们提出了基于LiDAR的人识别框架，即ReID3D，该框架利用预训练策略来检索人体三维形态特征，并 introduce了图像基本补充编码器以提取全面特征。由于LiDAR数据集缺乏，我们建立了LReID，这是首个基于LiDAR的人识别数据集，该数据集在自然条件下的多个户外场景中采集。此外，我们还提出了LReID-sync，一个模拟人员数据集，用于预训练编码器，包括点云完成任务和形态参数学习任务。广泛的实验表明，ReID3D在LReID上达到了94.0的排名一精度， highlighting LiDAR在人识别任务中的潜在潜力。到目前为止，我们是首次提出了LiDAR基于的人识别解决方案。代码和数据将很快发布。

R3D-SWIN:Use Shifted Window Attention for Single-View 3D Reconstruction

paper_url: http://arxiv.org/abs/2312.02725
repo_url: None
paper_authors: Chenhuan Li, Meihua Xiao, zehuan li, Mengxi Gao
for: 提高单视重建精度
methods: shifted windows attention voxel 3D reconstruction network
results: SOTA单视重建精度在ShapeNet上

Abstract
Recently, vision transformers have performed well in various computer vision tasks, including voxel 3D reconstruction. However, the windows of the vision transformer are not multi-scale, and there is no connection between the windows, which limits the accuracy of voxel 3D reconstruction . Therefore, we propose a shifted windows attention voxel 3D reconstruction network. To the best of our knowledge, this is the first work to apply shifted window attention to voxel 3D reconstruction. Experimental results on ShapeNet verify our method achieves SOTA accuracy in single-view reconstruction.

摘要
近期，视觉转换器在不同计算机视觉任务中表现出色，包括 voxel 3D 重建。然而，视觉转换器的窗口不是多尺度的，没有连接 между 窗口，这限制了 voxel 3D 重建的准确性。因此，我们提议一种shifted窗口注意力 voxel 3D 重建网络。根据我们所知，这是首次应用 shifted 窗口注意力到 voxel 3D 重建。实验结果表明，我们的方法在ShapeNet上 achieve SOTA 精度在单视重建中。

MyPortrait: Morphable Prior-Guided Personalized Portrait Generation

paper_url: http://arxiv.org/abs/2312.02703
repo_url: None
paper_authors: Bo Ding, Zhenfeng Fan, Shuang Yang, Shihong Xia
for: 这个论文是关于计算机视觉领域中生成真实的人脸演示的研究。
methods: 我们提出了一种简单、通用、灵活的神经网络框架——Myportrait，可以在单个人的单视镜视频中生成个性化的脸部动画。
results: 我们的方法在多种 метри中表现出色，超越了现有的方法。我们还提供了在线和离线两种版本，可以根据测试数据是否送到训练来选择。

Abstract
Generating realistic talking faces is an interesting and long-standing topic in the field of computer vision. Although significant progress has been made, it is still challenging to generate high-quality dynamic faces with personalized details. This is mainly due to the inability of the general model to represent personalized details and the generalization problem to unseen controllable parameters. In this work, we propose Myportrait, a simple, general, and flexible framework for neural portrait generation. We incorporate personalized prior in a monocular video and morphable prior in 3D face morphable space for generating personalized details under novel controllable parameters. Our proposed framework supports both video-driven and audio-driven face animation given a monocular video of a single person. Distinguished by whether the test data is sent to training or not, our method provides a real-time online version and a high-quality offline version. Comprehensive experiments in various metrics demonstrate the superior performance of our method over the state-of-the-art methods. The code will be publicly available.

摘要
<>将文本翻译成简化中文。<>计算机视觉领域中生成真实对话面是一个有趣和长期的话题。虽然已经做出了重要的进步，但仍然困难以生成高质量的动态面孔 WITH 个性化细节。这主要是因为普通的模型无法表示个性化细节，以及总结问题到未经见过的可控参数。在这种情况下，我们提议Myportrait，一个简单、通用和 flexible的神经端游戏框架。我们在单个人的单视图中 incorporate 个性化先验和 3D 面部可变空间中的模板先验，以生成个性化细节 unter novel可控参数。我们的提议的框架支持单视图驱动和音频驱动的面部动画，并根据测试数据是否送往训练而分为实时在线版本和高质量离线版本。对于不同的维度进行了广泛的实验，我们的方法在Various metric中表现出了与当前最佳方法的超越性。代码将公开available。

Neural Sign Actors: A diffusion model for 3D sign language production from text

paper_url: http://arxiv.org/abs/2312.02702
repo_url: None
paper_authors: Vasileios Baltatzis, Rolandos Alexandros Potamias, Evangelos Ververas, Guanxiong Sun, Jiankang Deng, Stefanos Zafeiriou
for: 提高手语生成的真实性和含义精度
methods: 使用扩散过程和基于体格神经网络的新方法
results: 与前方法相比，显著提高手语生成的性能和真实性

Abstract
Sign Languages (SL) serve as the predominant mode of communication for the Deaf and Hard of Hearing communities. The advent of deep learning has aided numerous methods in SL recognition and translation, achieving remarkable results. However, Sign Language Production (SLP) poses a challenge for the computer vision community as the motions generated must be realistic and have precise semantic meanings. Most SLP methods rely on 2D data, thus impeding their ability to attain a necessary level of realism. In this work, we propose a diffusion-based SLP model trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through a series of quantitative and qualitative experiments, we show that the proposed method considerably outperforms previous methods of SLP. We believe that this work presents an important and necessary step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities. The code, method and generated data will be made publicly available.

摘要
《手语生成（SLP）》是一个挑战计算机视觉领域的问题，因为它需要生成的手势必须具有真实的意思和精准的 semantics。大多数SLP方法仅使用2D数据，因此它们无法达到所需的真实性。在这种情况下，我们提出了一种基于扩散的SLP模型，通过一个大规模的4D签名人体数据集和其对应的文本译本来训练。这种方法可以在没有约束的话语域中生成动态的3D人体序列，使用一种基于新的体系学 informed graph neural network（SMPL-X）的扩散过程。通过一系列的量化和质量测试，我们表明了我们的方法舒适性和真实性在SLP方法中明显提高。我们认为这项工作对于实现真实的神经签名人体做出了重要和必要的贡献， bridging the communication gap between Deaf and hearing communities。我们将代码、方法和生成的数据公开发布。

Revisit Human-Scene Interaction via Space Occupancy

paper_url: http://arxiv.org/abs/2312.02700
repo_url: None
paper_authors: Xinpeng Liu, Haowen Hou, Yanchao Yang, Yong-Lu Li, Cewu Lu
for: 本研究旨在解决人Scene交互（HSI）生成 task 中的数据缺乏问题，即高质量的人和3D环境同时捕捉数据匮乏，导致数据多样性和复杂性受限。
methods: 本研究提出了一种新的人Occupancy交互视图，即将人的运动序列看作是与场景的空间占用交互的记录，从而将运动序列数据聚合成大规模的对应人Occupancy交互数据库（MOB）。
results: 通过在MOB上训练单个运动控制器，可以在不同的静态和动态场景中生成实际和稳定的HSI运动，而无需GT 3D场景训练。 codes和数据将在https://foruck.github.io/occu-page/上公开。

Abstract
Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is the limited data scale. High-quality data with simultaneously captured human and 3D environments is rare, resulting in limited data diversity and complexity. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective, leading us to a unified novel view of Human-Occupancy Interaction. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database: Motion Occupancy Base (MOB). Thus, the need for costly paired motion-scene datasets with high-quality scene scans can be substantially alleviated. With this new unified view of Human-Occupancy interaction, a single motion controller is proposed to reach the target state given the surrounding occupancy. Once trained on MOB with complex occupancy layout, the controller could handle cramped scenes and generalize well to general scenes with limited complexity. With no GT 3D scenes for training, our method can generate realistic and stable HSI motions in diverse scenarios, including both static and dynamic scenes. Our code and data would be made publicly available at https://foruck.github.io/occu-page/.

摘要
人Scene交互（HSI）生成是一项复杂的任务，对下游任务非常重要。然而，一个主要的障碍是数据规模的限制。高质量的数据，同时捕捉人和3D环境，罕见，导致数据多样性和复杂性受限。在这种情况下，我们认为人与场景交互的核心是与场景空间占用的抽象物理视角相互作用，从而导致我们提出了一种新的人占用交互视图。我们将纯洁的运动序列看作人与透明场景占用的交互记录，从而将运动序列聚合成大规模的人占用交互数据库：运动占用基础（MOB）。因此，需要高质量的相关运动场景数据和场景扫描数据的成本可以得到很大的减弱。通过这种新的人占用交互视图，我们提出了一种单个运动控制器，可以在固定或动态场景中达到目标状态，只需要知道周围的占用。我们的方法不需要GT 3D场景进行训练，可以生成真实和稳定的HSI运动，在多样化的场景中，包括静止和动态场景。我们的代码和数据将在https://foruck.github.io/occu-page/上公开。

UPOCR: Towards Unified Pixel-Level OCR Interface

paper_url: http://arxiv.org/abs/2312.02694
repo_url: None
paper_authors: Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai Ding, Fengjun Guo, Lianwen Jin
for: 提高 pixel-level OCR 领域的研究和应用效率，建立一个通用的 OCR 模型，能同时处理多种任务。
methods: 提出了一种基于 Vision Transformer 的普通模型，通过学习任务提示来激活任务特征表示，实现多任务共享表示的目标。
results: 实验结果表明，该方法可同时在三种 pixel-level OCR 任务上达到状态对应的表现水平，并且可以在不同任务之间共享表示，提供了 valuable 的研究策略和意见 для未来的通用 OCR 模型。

Abstract
In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the generated and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code will be publicly available.

摘要
近年来，光学字符识别（OCR）领域出现了大量先进技术，涵盖各种任务。然而，这些技术具有不同的思想、结构和训练策略，这使研究和维护变得更加复杂，妨碍快速应用。为此，我们提出了UPOCR，一种简单 yet有效的通用模型，用于统一像素级OCR接口。具体来说，UPOCR将多种OCR任务视为图像到图像转换，并采用基于视Transformer（ViT）的编码器-解码器架构。我们引入可学习的任务提示，使编码器提取的通用特征表示push到任务特有的空间，让解码器具备任务意识。此外，模型训练的目标是将生成的图像与真实图像的差异降到最小化，不管任务之间的不同。我们对三个像素级OCR任务进行了实验，包括文本除去、文本分割和受损文本检测。无论有多少辉煌的技术，我们的方法可以同时在三个任务上达到状态 искусственный智能性的性能，提供了有价值的策略和思路 для未来的通用OCR模型研究。代码将公开。

DeepPointMap: Advancing LiDAR SLAM with Unified Neural Descriptors

paper_url: http://arxiv.org/abs/2312.02684
repo_url: None
paper_authors: Xiaze Zhang, Ziheng Ding, Qi Jing, Yuejie Zhang, Wenchao Ding, Rui Feng
for: 提高 simultaneous localization and mapping (SLAM) 的精度和效率
methods: 使用神经网络提取高度表示性的点云描述符，实现内存有效的地图表示和精准的多尺度本地化任务
results: 在多种复杂的场景中，包括多机合作SLAM, 实现了优秀的结果，证明了方法的有效性和潜力

Abstract
Point clouds have shown significant potential in various domains, including Simultaneous Localization and Mapping (SLAM). However, existing approaches either rely on dense point clouds to achieve high localization accuracy or use generalized descriptors to reduce map size. Unfortunately, these two aspects seem to conflict with each other. To address this limitation, we propose a unified architecture, DeepPointMap, achieving excellent preference on both aspects. We utilize neural network to extract highly representative and sparse neural descriptors from point clouds, enabling memory-efficient map representation and accurate multi-scale localization tasks (e.g., odometry and loop-closure). Moreover, we showcase the versatility of our framework by extending it to more challenging multi-agent collaborative SLAM. The promising results obtained in these scenarios further emphasize the effectiveness and potential of our approach.

摘要
几何 clouds 已经显示出了多个领域的潜在能力，包括同时地位和地图对接（SLAM）。然而，现有的方法可能会依赖于紧密的几何 clouds 以达到高地位准确性，或者使用通用的描述子来减少地图的大小。可惜的是，这两个方面似乎相互抵触。为了解决这个限制，我们提出了一个统一架构，深度点图（DeepPointMap），实现了高度的选择性和高精度的多尺度地位任务（例如运动和关闭）。此外，我们还将框架扩展到更加挑战性的多机合作SLAM。在这些情况下，我们所取得的结果给出了深度点图的有效性和潜力。

Zero-Shot Point Cloud Registration

paper_url: http://arxiv.org/abs/2312.03032
repo_url: None
paper_authors: Weijie Wang, Guofeng Mei, Bin Ren, Xiaoshui Huang, Fabio Poiesi, Luc Van Gool, Nicu Sebe, Bruno Lepri
for: 本研究的目的是提出第一个不需要特定数据集训练的点云注册方法，以提高注册精度和效率。
methods: ZeroReg方法基于图像特征传输，通过在3D空间搜索邻近点Cloud进行减少参数，并通过新的参数-自由地理编码器进行点Cloud特征综合。
results: ZeroReg方法在3DMatch、3DLoMatch和ScanNet等数据集上实现了超过84%、46%和75%的召回率，与传统和学习基于方法竞争。

Abstract
Learning-based point cloud registration approaches have significantly outperformed their traditional counterparts. However, they typically require extensive training on specific datasets. In this paper, we propose , the first zero-shot point cloud registration approach that eliminates the need for training on point cloud datasets. The cornerstone of ZeroReg is the novel transfer of image features from keypoints to the point cloud, enriched by aggregating information from 3D geometric neighborhoods. Specifically, we extract keypoints and features from 2D image pairs using a frozen pretrained 2D backbone. These features are then projected in 3D, and patches are constructed by searching for neighboring points. We integrate the geometric and visual features of each point using our novel parameter-free geometric decoder. Subsequently, the task of determining correspondences between point clouds is formulated as an optimal transport problem. Extensive evaluations of ZeroReg demonstrate its competitive performance against both traditional and learning-based methods. On benchmarks such as 3DMatch, 3DLoMatch, and ScanNet, ZeroReg achieves impressive Recall Ratios (RR) of over 84%, 46%, and 75%, respectively.

摘要
学习基于的点云注册方法已经比传统方法表现出了明显的优势。然而，它们通常需要对特定的点云数据进行广泛的训练。在这篇论文中，我们提出了第一个无需训练点云数据的零学习点云注册方法。 ZeroReg 的核心思想是将图像特征从关键点传递到点云，并通过在3D的几何邻域中聚合信息来增强。我们首先从冻结的预训练2D背bone中提取关键点和特征。然后，我们将这些特征在3D中投影，并通过搜索邻近点构建patch。我们然后将每个点的几何和视觉特征集成我们的新的参数自由几何解码器。最后，我们将点云之间的对应关系定义为优化运输问题。我们对 ZeroReg 进行了广泛的评估，并证明它在3DMatch、3DLoMatch和ScanNet等benchmark上达到了超过84%、46%和75%的征 recall ratio（RR）。

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

paper_url: http://arxiv.org/abs/2312.03031
repo_url: https://github.com/nvlabs/bev-planner
paper_authors: Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, Jose M. Alvarez
for: 本研究旨在深入研究推动自动驾驶技术的开发，尤其是在全栈层面上实现自主驾驶。
methods: 本研究使用了 nuScenes 数据集，并进行了详细的分析和检验，以探讨更多的细节问题。
results: 研究发现，使用 ego 状态信息可以提高路径规划质量，但是 nuScenes 数据集的限制导致模型倾向于仅仅基于 ego 车速度进行未来路径规划。此外，现有的指标不能全面评估规划质量，可能会导致不准确的结论。为此，本研究引入了一种新的指标来评估路径规划是否遵循道路规则。此外，我们还提出了一种简单的基线模型，能够在不依赖于感知注解的情况下达到竞争性的结果。

Abstract
End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line, many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper, we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle's velocity. These models tend to rely predominantly on the ego vehicle's status for future path planning. Beyond the limitations of the dataset, we also note that current metrics do not comprehensively assess the planning quality, leading to potentially biased conclusions drawn from existing benchmarks. To address this issue, we introduce a new metric to evaluate whether the predicted trajectories adhere to the road. We further propose a simple baseline able to achieve competitive results without relying on perception annotations. Given the current limitations on the benchmark and metrics, we suggest the community reassess relevant prevailing research and be cautious whether the continued pursuit of state-of-the-art would yield convincing and universal conclusions. Code and models are available at \url{https://github.com/NVlabs/BEV-Planner}

摘要

Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection? An Investigation and the HOI-Synth Domain Adaptation Benchmark

paper_url: http://arxiv.org/abs/2312.02672
repo_url: None
paper_authors: Rosario Leonardi, Antonino Furnari, Francesco Ragusa, Giovanni Maria Farinella
for: 本研究探讨了使用合成数据提高 egocentric 视野中手-物体互动检测的效果。
methods: 我们引入了一种自动生成合成图像的 simulate 器，可以生成自动标注手-物体接触状态、 bounding box 和像素精度分割的图像。我们还使用了领域适应技术来改进模型性能。
results: 我们的实验结果显示，使用合成数据和领域适应技术可以达到与传统干支持方法相同的性能水平，但需要标注的数据量减少到一半。当使用来自3D模型的真实目标环境和物体的合成数据进行测试时，我们的最佳模型表现出了相对于标准干支持方法的性能提升。我们还设置了一个新的领域适应标准（HOI-Synth），并提供了基线结果，以鼓励社区进行这个挑战性任务。

Abstract
In this study, we investigate the effectiveness of synthetic data in enhancing hand-object interaction detection within the egocentric vision domain. We introduce a simulator able to generate synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. Through comprehensive experiments and comparative analyses on three egocentric datasets, VISOR, EgoHOS, and ENIGMA-51, we demonstrate that the use of synthetic data and domain adaptation techniques allows for comparable performance to conventional supervised methods while requiring annotations on only a fraction of the real data. When tested with in-domain synthetic data generated from 3D models of real target environments and objects, our best models show consistent performance improvements with respect to standard fully supervised approaches based on labeled real data only. Our study also sets a new benchmark of domain adaptation for egocentric hand-object interaction detection (HOI-Synth) and provides baseline results to encourage the community to engage in this challenging task. We release the generated data, code, and the simulator at the following link: https://iplab.dmi.unict.it/HOI-Synth/.

摘要
在这项研究中，我们研究了人工数据的有效性以提高 egocentric 视觉领域内手对物体交互检测。我们开发了一个可以自动生成带有手对物体接触状态、 bounding box 和像素级分 segmentation 的 synthetic 图像的 simulator。通过对三个 egocentric 数据集进行全面的实验和比较分析，我们证明了使用 synthetic 数据和领域适应技术可以实现与传统超级视图方法相同的性能，只需要对实际数据进行标注的一小部分。当测试使用了基于 3D 模型生成的真实目标环境和物体的 synthetic 数据时，我们的最佳模型表现出了与标准充分监督方法相比的一致性提升。我们的研究还设置了 egocentric 手对物体交互检测（HOI-Synth）领域的新标准基准和提供了基线结果，以便社区能够更好地参与这个挑战性任务。我们在以下链接发布了生成的数据、代码和 simulator：https://iplab.dmi.unict.it/HOI-Synth/.

Generating Visually Realistic Adversarial Patch

paper_url: http://arxiv.org/abs/2312.03030
repo_url: None
paper_authors: Xiaosen Wang, Kunyu Wang
for: 防御深度神经网络（DNNs）受到多种攻击，如攻击示例，以提高安全应用的可信度。
methods: 我们提出了一种有效的攻击方法，即可视真实攻击（VRAP），通过在实际图像附近进行约束，并在最差位置进行优化，以确保攻击patch的可见性和印刷性。
results: 我们的实验表明，VRAP在数字世界中表现出了出色的攻击性能，而生成的攻击patch可以在物理世界中掩饰成涂鸦或商标，让深度模型无法识别，对DNNs-enabled应用造成了重大威胁。

Abstract
Deep neural networks (DNNs) are vulnerable to various types of adversarial examples, bringing huge threats to security-critical applications. Among these, adversarial patches have drawn increasing attention due to their good applicability to fool DNNs in the physical world. However, existing works often generate patches with meaningless noise or patterns, making it conspicuous to humans. To address this issue, we explore how to generate visually realistic adversarial patches to fool DNNs. Firstly, we analyze that a high-quality adversarial patch should be realistic, position irrelevant, and printable to be deployed in the physical world. Based on this analysis, we propose an effective attack called VRAP, to generate visually realistic adversarial patches. Specifically, VRAP constrains the patch in the neighborhood of a real image to ensure the visual reality, optimizes the patch at the poorest position for position irrelevance, and adopts Total Variance loss as well as gamma transformation to make the generated patch printable without losing information. Empirical evaluations on the ImageNet dataset demonstrate that the proposed VRAP exhibits outstanding attack performance in the digital world. Moreover, the generated adversarial patches can be disguised as the scrawl or logo in the physical world to fool the deep models without being detected, bringing significant threats to DNNs-enabled applications.

摘要
We analyze that a high-quality adversarial patch should be realistic, position-irrelevant, and printable. Based on this analysis, we propose an effective attack called VRAP (Visually Realistic Adversarial Patch) to generate visually realistic patches. VRAP constrains the patch in the neighborhood of a real image to ensure visual reality, optimizes the patch at the poorest position for position irrelevance, and adopts Total Variance loss as well as gamma transformation to make the generated patch printable without losing information.Empirical evaluations on the ImageNet dataset demonstrate that the proposed VRAP exhibits outstanding attack performance in the digital world. Moreover, the generated adversarial patches can be disguised as scrawls or logos in the physical world to fool deep models without being detected, bringing significant threats to DNNs-enabled applications.

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

paper_url: http://arxiv.org/abs/2312.03029
repo_url: https://github.com/yuelangx/gaussian-head-avatar
paper_authors: Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, Yebin Liu
for: 高精度3D头像创建是研究热点，但在缺乏视图的情况下存在大型挑战。本文提出了基于可控3D Gaussian的 Gaussian Head Avatar，以实现高精度头像模型。
methods: 我们优化了中性3D Gaussian和基于MLP的弯曲场，以捕捉复杂表情。这两个部分相互帮助，因此我们的方法可以模拟细节动作的同时保持表情准确。此外，我们提出了一种合理的几何导向初始化策略，基于几何SDF和深度迭代四面体，以确保训练过程的稳定性和归一化。
results: 我们的方法在缺乏视图情况下比其他状态体系方法表现更高，实现2K分辨率的极高精度渲染质量，即使面对极端表情。

Abstract
Creating high-fidelity 3D head avatars has always been a research hotspot, but there remains a great challenge under lightweight sparse view setups. In this paper, we propose Gaussian Head Avatar represented by controllable 3D Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D Gaussians and a fully learned MLP-based deformation field to capture complex expressions. The two parts benefit each other, thereby our method can model fine-grained dynamic details while ensuring expression accuracy. Furthermore, we devise a well-designed geometry-guided initialization strategy based on implicit SDF and Deep Marching Tetrahedra for the stability and convergence of the training procedure. Experiments show our approach outperforms other state-of-the-art sparse-view methods, achieving ultra high-fidelity rendering quality at 2K resolution even under exaggerated expressions.

摘要
创造高精度3D头像一直是研究热点，但在光量稀缺的视点设置下存在大的挑战。在这篇论文中，我们提出了基于可控3D高斯函数的 Gaussian Head Avatar，用于高精度头像模型化。我们优化了中性3D高斯函数和完全学习的MLP基于塑形场，以捕捉复杂的表情。这两个部分互助 each other，因此我们的方法可以模拟细腻的动态细节，同时保证表情准确性。此外，我们设计了一种基于印象函数和深度迈克定量的初始化策略，以保证训练过程的稳定性和收敛性。实验表明，我们的方法在 sparse-view 方面超越了其他现有的方法，实现了2K解度的极高精度渲染质量，甚至在夸大的表情下。

Double Integral Enhanced Zeroing Neural Network Optimized with ALSOA fostered Lung Cancer Classification using CT Images

paper_url: http://arxiv.org/abs/2312.03028
repo_url: None
paper_authors: V S Priya Sumitha, V. Keerthika, A. Geetha
for: 本研究旨在提出一种智能自动抑制肺癌分类方法，以提高肺癌检测的准确率和效率。
methods: 本研究使用了双 интеграル增强零化神经网络优化的ALSOA肺癌分类方法，并在CT图像上进行了预处理和特征提取。
results: 对比exist方法，本研究的方法实现了18.32%、27.20%和34.32%的高度准确率提升。

Abstract
Lung cancer is one of the deadliest diseases and the leading cause of illness and death. Since lung cancer cannot predicted at premature stage, it able to only be discovered more broadly once it has spread to other lung parts. The risk grows when radiologists and other specialists determine whether lung cancer is current. Owing to significance of determining type of treatment and its depth based on severity of the illness, critical to develop smart and automatic cancer prediction scheme is precise, at which stage of cancer. In this paper, Double Integral Enhanced Zeroing Neural Network Optimized with ALSOA fostered Lung Cancer Classification using CT Images (LCC-DIEZNN-ALSO-CTI) is proposed. Initially, input CT image is amassed from lung cancer dataset. The input CT image is pre-processing via Unscented Trainable Kalman Filtering (UTKF) technique. In pre-processing stage unwanted noise are removed from CT images. Afterwards, grayscale statistic features and Haralick texture features extracted by Adaptive and Concise Empirical Wavelet Transform (ACEWT). The proposed model is implemented on MATLAB. The performance of the proposed method is analyzed through existing techniques. The proposed method attains 18.32%, 27.20%, and 34.32% higher accuracy analyzed with existing method likes Deep Learning Assisted Predict of Lung Cancer on Computed Tomography Images Utilizing AHHMM (LCC-AHHMM-CT), Convolutional neural networks based pulmonary nodule malignancy assessment in pipeline for classifying lung cancer (LCC-ICNN-CT), Automated Decision Support Scheme for Lung Cancer Identification with Categorization (LCC-RFCN-MLRPN-CT) methods respectively.

摘要
肺癌是一种非常致命的疾病，也是致死率最高的疾病之一。由于肺癌在早期难以预测，因此通常只能在其他肺部已经扩散之后才能发现。随着疾病的严重程度增加，检测肺癌的重要性也在提高。为了开发一种智能和自动的肺癌预测方案，在不同的阶段对肺癌进行精准的预测是非常重要。在本文中，我们提出了一种基于CT图像的肺癌分类方法，即Double Integral Enhanced Zeroing Neural Network Optimized with ALSOA fostered Lung Cancer Classification using CT Images (LCC-DIEZNN-ALSO-CTI)。首先，我们从肺癌数据集中收集了输入CT图像。然后，我们使用Unscented Trainable Kalman Filtering (UTKF)技术进行预处理，以移除CT图像中的噪声。接着，我们使用Adaptive and Concise Empirical Wavelet Transform (ACEWT)提取了灰度统计特征和Haralick текстур特征。我们在MATLAB中实现了该方法。我们对该方法进行了分析，并与现有的方法进行比较，包括Deep Learning Assisted Predict of Lung Cancer on Computed Tomography Images Utilizing AHHMM (LCC-AHHMM-CT)、Convolutional neural networks based pulmonary nodule malignancy assessment in pipeline for classifying lung cancer (LCC-ICNN-CT)和Automated Decision Support Scheme for Lung Cancer Identification with Categorization (LCC-RFCN-MLRPN-CT)等方法。结果表明，该方法在肺癌预测中达到了18.32%、27.20%和34.32%的高精度。

TPA3D: Triplane Attention for Fast Text-to-3D Generation

paper_url: http://arxiv.org/abs/2312.02647
repo_url: None
paper_authors: Hong-En Chen, Bin-Shih Wu, Sheng-Yu Huang, Yu-Chiang Frank Wang
For: 文本至3D生成* Methods: 使用GAN基本模型和注意力机制对文本进行生成* Results: 生成高质量的3D纹理形状，与文本描述高度吻合，计算效率高

Abstract
Due to the lack of large-scale text-3D correspondence data, recent text-to-3D generation works mainly rely on utilizing 2D diffusion models for synthesizing 3D data. Since diffusion-based methods typically require significant optimization time for both training and inference, the use of GAN-based models would still be desirable for fast 3D generation. In this work, we propose Triplane Attention for text-guided 3D generation (TPA3D), an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation. With only 3D shape data and their rendered 2D images observed during training, our TPA3D is designed to retrieve detailed visual descriptions for synthesizing the corresponding 3D mesh data. This is achieved by the proposed attention mechanisms on the extracted sentence and word-level text features. In our experiments, we show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions, while impressive computation efficiency can be observed.

摘要
(Simplified Chinese translation)由于缺乏大规模文本-3D对应数据，现代文本-3D生成工作主要依靠利用2D扩散模型为生成3D数据。由于扩散模型通常需要较大的优化时间 для训练和推理，因此使用GAN基于模型仍然是可靠的选择 для快速3D生成。在这种工作中，我们提出了Triplane Attention for text-guided 3D generation（TPA3D），一种可训练的GAN基于深度学习模型，用于快速文本导向3D生成。在训练过程中，我们只需要提供3D形状数据和其渲染后的2D图像，TPA3D会使用提出的注意力机制来检索文本中的详细视觉描述，并生成匹配的3D网格数据。在我们的实验中，我们发现TPA3D可以生成高质量的3D纹理形状，与细化的文本描述高度吻合，同时计算效率非常高。

Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

paper_url: http://arxiv.org/abs/2312.02638
repo_url: https://github.com/fpv-iplab/synchronization-is-all-you-need
paper_authors: Camillo Quattrocchi, Antonino Furnari, Daniele Di Mauro, Mario Valerio Giuffrida, Giovanni Maria Farinella
for: 这篇论文的目的是将惯性的行为分类系统从外部相机扩展到自我相机的 egocentric enario，并且不需要新的 egocentric 视频标签。
methods: 这篇论文提出了一个新的方法，利用现有的标签的外部相机视频和一些新的、同步的外部-自我相机 видео对，来进行适应。这个方法基于知识传播，并在特征和模型层面进行了实现。
results: results 显示，这个方法可以实现对于 egocentric 视频的适应，并且可以和传统的不监督领域适应方法相比，获得更好的效果。特别是，这个方法不需要任何 egocentric 标签，仅靠使用现有的标签和自动生成的对，可以实现和传统监督学习方法相比的性能。

Abstract
We consider the problem of transferring a temporal action segmentation system initially designed for exocentric (fixed) cameras to an egocentric scenario, where wearable cameras capture video data. The conventional supervised approach requires the collection and labeling of a new set of egocentric videos to adapt the model, which is costly and time-consuming. Instead, we propose a novel methodology which performs the adaptation leveraging existing labeled exocentric videos and a new set of unlabeled, synchronized exocentric-egocentric video pairs, for which temporal action segmentation annotations do not need to be collected. We implement the proposed methodology with an approach based on knowledge distillation, which we investigate both at the feature and model level. To evaluate our approach, we introduce a new benchmark based on the Assembly101 dataset. Results demonstrate the feasibility and effectiveness of the proposed method against classic unsupervised domain adaptation and temporal sequence alignment approaches. Remarkably, without bells and whistles, our best model performs on par with supervised approaches trained on labeled egocentric data, without ever seeing a single egocentric label, achieving a +15.99% (28.59% vs 12.60%) improvement in the edit score on the Assembly101 dataset compared to a baseline model trained solely on exocentric data.

摘要
我们考虑将用于外部（固定）摄像头的时间动作分割系统传播到自我中心（穿戴式摄像头）场景中，以便使用便携式摄像头捕捉视频数据。传统的监督方法需要收集和标注一组新的 egocentric 视频，这是贵重和耗时的。我们提出了一种新的方法，该方法可以在现有的 exocentric 视频和一组新的同步 exocentric-egocentric 视频对中进行适应，无需收集 temporal action segmentation 标注。我们实现了该方法使用知识储存技术，并在特征和模型层进行了调整。为评估我们的方法，我们创建了一个基于 Assembly101 数据集的新的标准准比较。结果表明，我们的方法可以减少监督学习的成本和时间，并且能够达到与监督学习 trained on labeled egocentric data 相当的性能，而无需看到一个单一的 egocentric 标注，在 Assembly101 数据集上比基eline模型提高了 +15.99%（28.59% vs 12.60%)。

Stable Diffusion Exposed: Gender Bias from Prompt to Image

paper_url: http://arxiv.org/abs/2312.03027
repo_url: None
paper_authors: Yankun Wu, Yuta Nakashima, Noa Garcia
for: 本研究旨在检查稳定扩散图像生成模型中的性别偏见，以及这些偏见如何影响图像的生成。
methods: 本研究使用自动化评估协议来分析稳定扩散图像中的性别指标的影响。基于之前的研究，我们探讨了性别指标如何影响图像中的对象和布局的表现。
results: 我们发现图像中的对象 display 存在性别偏见，例如♂♂和♀♀的 instrumente 不同，而且图像的总布局也存在差异。此外，中性提示比♂提示更加倾向于生成♂类型的图像，这反映了稳定扩散图像生成模型中的性别偏见。

Abstract
Recent studies have highlighted biases in generative models, shedding light on their predisposition towards gender-based stereotypes and imbalances. This paper contributes to this growing body of research by introducing an evaluation protocol designed to automatically analyze the impact of gender indicators on Stable Diffusion images. Leveraging insights from prior work, we explore how gender indicators not only affect gender presentation but also the representation of objects and layouts within the generated images. Our findings include the existence of differences in the depiction of objects, such as instruments tailored for specific genders, and shifts in overall layouts. We also reveal that neutral prompts tend to produce images more aligned with masculine prompts than their feminine counterparts, providing valuable insights into the nuanced gender biases inherent in Stable Diffusion.

摘要
Translated into Simplified Chinese:近期研究强调生成模型中的偏见，抛照到它们具有性别偏见和不均衡的倾向。本文对这些研究进行贡献，通过自动分析Stable Diffusion图像中的性别指标影响的评价协议。基于先前的研究，我们探索了性别指标不仅影响 gender presentation，还影响图像中的 объек和布局。我们的发现包括对象的不同表现，如适用于特定性别的工具，以及图像中的总布局的变化。我们还发现，中性提示通常会生成更加 masculine 的图像，这提供了关于Stable Diffusion中的性别偏见的有价值信息。

Diffusion Noise Feature: Accurate and Fast Generated Image Detection

paper_url: http://arxiv.org/abs/2312.02625
repo_url: None
paper_authors: Yichi Zhang, Xiaogang Xu
for: 本研究旨在提高生成图像检测精度和普适性。
methods: 本文使用反射扩散模型进行倒散处理，并利用生成图像和真实图像在扩散过程中的差异来增强生成图像的检测。
results: 本研究实现了高精度、稳定性和普适性的生成图像检测方法，并在标准 dataset 上达到了国际顶峰效果。

Abstract
Generative models have reached an advanced stage where they can produce remarkably realistic images. However, this remarkable generative capability also introduces the risk of disseminating false or misleading information. Notably, existing image detectors for generated images encounter challenges such as low accuracy and limited generalization. This paper seeks to address this issue by seeking a representation with strong generalization capabilities to enhance the detection of generated images. Our investigation has revealed that real and generated images display distinct latent Gaussian representations when subjected to an inverse diffusion process within a pre-trained diffusion model. Exploiting this disparity, we can amplify subtle artifacts in generated images. Building upon this insight, we introduce a novel image representation known as Diffusion Noise Feature (DNF). DNF is an ensemble representation that estimates the noise generated during the inverse diffusion process. A simple classifier, e.g., ResNet, trained on DNF achieves high accuracy, robustness, and generalization capabilities for detecting generated images, even from previously unseen classes or models. We conducted experiments using a widely recognized and standard dataset, achieving state-of-the-art effects of Detection.

摘要
<>传送给定文本到简化中文。<>生成模型已经达到了高度的进步，能够生成极其真实的图像。然而，这种极高的生成能力也会导致误差或误导信息的散布。特别是现有的生成图像检测器遇到了低准确率和有限的泛化问题。这篇论文目的是解决这个问题，通过提高生成图像检测的准确率来增强检测生成图像的能力。我们的调查发现，实际图像和生成图像在逆Diffusion过程中的约束表示中存在显著的差异。利用这个差异，我们可以强制加大生成图像中的微小残留。基于这一点，我们引入了一种新的图像表示方式，称为Diffusion Noise Feature（DNF）。DNF是一个ensemble表示方式，用于估计逆Diffusion过程中生成的噪声。一个简单的分类器，例如ResNet，在DNF上进行训练，可以在检测生成图像方面达到高度的准确率、Robustness和泛化能力。我们在一个广泛 признан和标准的 dataset 上进行了实验，实现了状态的检测效果。

DreaMo: Articulated 3D Reconstruction From A Single Casual Video

paper_url: http://arxiv.org/abs/2312.02617
repo_url: None
paper_authors: Tao Tu, Ming-Feng Li, Chieh Hubert Lin, Yen-Chi Cheng, Min Sun, Ming-Hsuan Yang
for: 用于从单一和偶极 captured 的互联网视频中进行擅长3D形态重建，并解决低覆盖区域的挑战。
methods: 提议了一种名为 DreaMo 的方法，该方法同时进行形态重建并解决低覆盖区域，使用视图条件的扩散优化和一些适应性规则。
results: 在一个自收集的互联网视频收集中进行了研究，并取得了可观的质量在新视图渲染、细化三维形态重建和人 интер替skeleton生成方面。对于现有方法，研究发现它们无法解决正确的几何学问题由于视图覆盖率不够。

Abstract
Articulated 3D reconstruction has valuable applications in various domains, yet it remains costly and demands intensive work from domain experts. Recent advancements in template-free learning methods show promising results with monocular videos. Nevertheless, these approaches necessitate a comprehensive coverage of all viewpoints of the subject in the input video, thus limiting their applicability to casually captured videos from online sources. In this work, we study articulated 3D shape reconstruction from a single and casually captured internet video, where the subject's view coverage is incomplete. We propose DreaMo that jointly performs shape reconstruction while solving the challenging low-coverage regions with view-conditioned diffusion prior and several tailored regularizations. In addition, we introduce a skeleton generation strategy to create human-interpretable skeletons from the learned neural bones and skinning weights. We conduct our study on a self-collected internet video collection characterized by incomplete view coverage. DreaMo shows promising quality in novel-view rendering, detailed articulated shape reconstruction, and skeleton generation. Extensive qualitative and quantitative studies validate the efficacy of each proposed component, and show existing methods are unable to solve correct geometry due to the incomplete view coverage.

摘要
“三维重建有着宝贵的应用在不同领域，但它仍然具有高成本和需要专业人员投入大量时间。现代的模板无法学习方法在单一影像中显示了有望的结果，但这些方法需要所有主题的看法都必须被覆盖。在这个工作中，我们研究了从单一且偶发的网络影像中进行三维形状重建，其中主题的看法覆盖率较低。我们提出了DreaMo，它同时进行形状重建和解决低覆盖区域的挑战，使用了看板条件的扩散优化和特殊调整。此外，我们导入了人类可理解的骨架生成策略，将学习的神经骨和皮肤遮盾转换为人类可读的骨架。我们对自己收集的网络影像库进行了研究，其中主题的看法覆盖率较低。DreaMo表现出了优秀的质量，包括新视野显示、细部三维形状重建和骨架生成。广泛的质量和质数研究证明了每个提案的有效性，并证明了现有方法无法正确地重建 geometry 因为看法覆盖率不够高。”

paper_url: http://arxiv.org/abs/2312.02616
repo_url: None
paper_authors: Evlampios Apostolidis, Konstantinos Apostolidis, Vasileios Mezaris
for: 这篇论文提供了一种基于Web的视频摘要生成工具，用于在社交媒体上分享摘要。
methods: 该工具使用了 integrate AI模型进行视频摘要和比例转换，支持一键摘要过程，可以根据目标平台的视频长度和比例生成多个摘要。
results: 该工具可以生成高质量的摘要，并且可以根据用户的需求进行自定义。

Abstract
This paper presents a web-based tool that facilitates the production of tailored summaries for online sharing on social media. Through an interactive user interface, it supports a ``one-click'' video summarization process. Based on the integrated AI models for video summarization and aspect ratio transformation, it facilitates the generation of multiple summaries of a full-length video according to the needs of target platforms with regard to the video's length and aspect ratio.

摘要
Here's the text in Simplified Chinese:这篇论文介绍了一个基于网页的工具，可以帮助创建适合社交媒体上分享的个性化摘要。该工具具有一个交互式用户界面，可以通过一键快速摘要视频。通过 интеGRATED AI模型，该工具可以根据目标平台的视频长度和比例转换生成多个摘要视频。

Projection Regret: Reducing Background Bias for Novelty Detection via Diffusion Models

paper_url: http://arxiv.org/abs/2312.02615
repo_url: None
paper_authors: Sungik Choi, Hankook Lee, Honglak Lee, Moontae Lee
For: The paper is written for detecting abnormal (out-of-distribution) samples in diffusion models, which have recently gained attention in machine learning.* Methods: The paper proposes a novelty detection method called Projection Regret (PR), which mitigates the bias of non-semantic information by computing the perceptual distance between the test image and its diffusion-based projection, and then cancelling out the background bias by comparing it against recursive projections.* Results: The paper shows that PR outperforms prior art of generative-model-based novelty detection methods by a significant margin, demonstrating its effectiveness in detecting abnormal samples.Here’s the simplified Chinese text for the three key points:* For: 本文是为探测Diffusion模型中的异常（非常量）样本而写的。* Methods: 本文提出了一种新的异常探测方法called Projection Regret (PR),它通过计算测试图像与其Diffusion-based projection的相似度来探测异常性，然后通过对比反复的投影来减少背景偏见。* Results: 本文的实验结果表明，PR在与先前的生成模型基于的异常探测方法相比，具有显著的提升。

Abstract
Novelty detection is a fundamental task of machine learning which aims to detect abnormal ($\textit{i.e.}$ out-of-distribution (OOD)) samples. Since diffusion models have recently emerged as the de facto standard generative framework with surprising generation results, novelty detection via diffusion models has also gained much attention. Recent methods have mainly utilized the reconstruction property of in-distribution samples. However, they often suffer from detecting OOD samples that share similar background information to the in-distribution data. Based on our observation that diffusion models can \emph{project} any sample to an in-distribution sample with similar background information, we propose \emph{Projection Regret (PR)}, an efficient novelty detection method that mitigates the bias of non-semantic information. To be specific, PR computes the perceptual distance between the test image and its diffusion-based projection to detect abnormality. Since the perceptual distance often fails to capture semantic changes when the background information is dominant, we cancel out the background bias by comparing it against recursive projections. Extensive experiments demonstrate that PR outperforms the prior art of generative-model-based novelty detection methods by a significant margin.

摘要
新鲜度检测是机器学习的基本任务之一，旨在检测异常（即外部分布（OOD））样本。由于扩散模型最近在生成框架中得到了很多关注，因此通过扩散模型进行新鲜度检测也得到了很多关注。 current methods mainly rely on the reconstruction property of in-distribution samples, but they often suffer from detecting OOD samples that share similar background information to the in-distribution data. Based on our observation that diffusion models can project any sample to an in-distribution sample with similar background information, we propose Projection Regret (PR), an efficient novelty detection method that mitigates the bias of non-semantic information. To be specific, PR computes the perceptual distance between the test image and its diffusion-based projection to detect abnormality. Since the perceptual distance often fails to capture semantic changes when the background information is dominant, we cancel out the background bias by comparing it against recursive projections. EXTENSIVE experiments demonstrate that PR outperforms the prior art of generative-model-based novelty detection methods by a significant margin.

A Unified Simulation Framework for Visual and Behavioral Fidelity in Crowd Analysis

paper_url: http://arxiv.org/abs/2312.02613
repo_url: None
paper_authors: Niccolò Bisagno, Nicola Garau, Antonio Luigi Stefani, Nicola Conci
for: 这个论文主要是为了研究人群聚集的 simulations，以生成适合计算机视觉任务的标注数据。
methods: 这个论文使用了一种人群 simulate 的工具，叫做 UniCrowd，以及一个验证管道。
results: 这个论文通过使用 UniCrowd 生成了一些适合计算机视觉任务的标注数据，包括人群排版、人姿估计、轨迹分析和预测、以及异常检测等。

Abstract
Simulation is a powerful tool to easily generate annotated data, and a highly desirable feature, especially in those domains where learning models need large training datasets. Machine learning and deep learning solutions, have proven to be extremely data-hungry and sometimes, the available real-world data are not sufficient to effectively model the given task. Despite the initial skepticism of a portion of the scientific community, the potential of simulation has been largely confirmed in many application areas, and the recent developments in terms of rendering and virtualization engines, have shown a good ability also in representing complex scenes. This includes environmental factors, such as weather conditions and surface reflectance, as well as human-related events, like human actions and behaviors. We present a human crowd simulator, called UniCrowd, and its associated validation pipeline. We show how the simulator can generate annotated data, suitable for computer vision tasks, in particular for detection and segmentation, as well as the related applications, as crowd counting, human pose estimation, trajectory analysis and prediction, and anomaly detection.

摘要
<>传输文本到简化中文。<>模拟是一种强大的工具，可以轻松生成注解数据，特别在那些领域 where 学习模型需要大量的训练数据。机器学习和深度学习解决方案，已经证明了它们对于解决给定任务非常渴望大量数据。尽管当初一部分科学界对 simulation 的可能性表示怀疑，但是，在许多应用领域， simulation 的潜力已经得到了证明。最近的渲染和虚拟化引擎的进步，也表明它们在表示复杂场景方面具有良好的能力。这包括环境因素，如天气条件和表面反射，以及人类相关的事件，如人类行为和习惯。我们介绍了一种人群模拟器，叫做 UniCrowd，以及其相关的验证管道。我们显示了模拟器可以生成注解数据，适用于计算机视觉任务，特别是检测和分割，以及相关应用，如人群计数、人姿估计、轨迹分析和预测，以及异常检测。

Accelerating Learnt Video Codecs with Gradient Decay and Layer-wise Distillation

paper_url: http://arxiv.org/abs/2312.02605
repo_url: None
paper_authors: Tianhao Peng, Ge Gao, Heming Sun, Fan Zhang, David Bull
for: 这篇论文旨在提出一种基于模型独立抽象的视频编码器压缩模型，以提高视频编码器的计算效率和实时性。
methods: 该论文使用了一种基于梯度衰减和层次练习的模型独立抽象策略，可以在压缩过程中减少计算复杂度，同时保持视频质量。
results: 试验结果显示，该策略可以在三种流行的终端学习视频编码器中实现65%的计算减少和2倍的速度提升，同时保持视频质量下降在0.3dB以下。

Abstract
In recent years, end-to-end learnt video codecs have demonstrated their potential to compete with conventional coding algorithms in term of compression efficiency. However, most learning-based video compression models are associated with high computational complexity and latency, in particular at the decoder side, which limits their deployment in practical applications. In this paper, we present a novel model-agnostic pruning scheme based on gradient decay and adaptive layer-wise distillation. Gradient decay enhances parameter exploration during sparsification whilst preventing runaway sparsity and is superior to the standard Straight-Through Estimation. The adaptive layer-wise distillation regulates the sparse training in various stages based on the distortion of intermediate features. This stage-wise design efficiently updates parameters with minimal computational overhead. The proposed approach has been applied to three popular end-to-end learnt video codecs, FVC, DCVC, and DCVC-HEM. Results confirm that our method yields up to 65% reduction in MACs and 2x speed-up with less than 0.3dB drop in BD-PSNR. Supporting code and supplementary material can be downloaded from: https://jasminepp.github.io/lightweightdvc/

摘要

An Integrated System for Spatio-Temporal Summarization of 360-degrees Videos

paper_url: http://arxiv.org/abs/2312.02576
repo_url: None
paper_authors: Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris
for: 本研究旨在开发一个 integrate 的系统，用于对360度视频进行空间时间概要。
methods: 该系统利用 cutting-edge 的精度检测方法（ATSal 和 SST-Sal）和视频概要生成方法（CA-SUM），以及一种 mechanism 来 классифицировать360度视频是静止或移动摄像机记录的，并选择适用的精度检测方法。
results: 对于两个360度视频数据集（VR-EyeTracking 和 Sports-360）进行了量化评估，并表明了该决策机制的准确性和正面影响。此外，对于这些数据集， Qualitative 分析也表明了决策机制的工作方式，并证明了每种精度检测方法的优缺点，以及训练后的概要生成方法的高效性。

Abstract
In this work, we present an integrated system for spatiotemporal summarization of 360-degrees videos. The video summary production mainly involves the detection of salient events and their synopsis into a concise summary. The analysis relies on state-of-the-art methods for saliency detection in 360-degrees video (ATSal and SST-Sal) and video summarization (CA-SUM). It also contains a mechanism that classifies a 360-degrees video based on the use of static or moving camera during recording and decides which saliency detection method will be used, as well as a 2D video production component that is responsible to create a conventional 2D video containing the salient events in the 360-degrees video. Quantitative evaluations using two datasets for 360-degrees video saliency detection (VR-EyeTracking, Sports-360) show the accuracy and positive impact of the developed decision mechanism, and justify our choice to use two different methods for detecting the salient events. A qualitative analysis using content from these datasets, gives further insights about the functionality of the decision mechanism, shows the pros and cons of each used saliency detection method and demonstrates the advanced performance of the trained summarization method against a more conventional approach.

摘要
在这项工作中，我们提出了一个集成的360度视频审核系统。该系统的视频审核主要包括识别精彩事件和将其简要概括为一个短视频。分析利用当前领域最佳的360度视频精彩检测方法（ATSal和SST-Sal）和视频概要生成方法（CA-SUM），还包括一种机制来判断360度视频是否使用静止或移动摄像机记录，并决定使用哪种精彩检测方法。此外，系统还包括一个负责将360度视频转换成标准2D视频的组件，以便包含精彩事件。经过量测使用两个360度视频精彩检测数据集（VR-EyeTracking、Sports-360），显示出我们的决策机制准确性和积极影响，并证明我们的选择使用两种不同的精彩检测方法是正确的。另外，一个详细的分析使用这些数据集的内容，提供了更多关于决策机制的信息，描述了每种使用的精彩检测方法的优缺点，并证明我们的训练 SUM 方法在对一种传统方法进行训练后表现出色。

Prompt2NeRF-PIL: Fast NeRF Generation via Pretrained Implicit Latent

paper_url: http://arxiv.org/abs/2312.02568
repo_url: None
paper_authors: Jianmeng Liu, Yuyao Zhang, Zeyuan Meng, Yu-Wing Tai, Chi-Keung Tang
for: 本研究探讨了可提示NeRF生成（如文本提示或单一图像提示）的直接条件和快速生成NeRF参数的Underlying 3D场景，从而消除复杂的中间步骤，提供全3D生成以 conditional控制。
methods: 本方法使用Prompt2NeRF-PIL，一种可以通过单一的前进 pass生成多种3D对象，并且可以利用预训练的含隐示准NeRF参数的空间来进行3D生成。
results: 我们的实验表明，我们的方法可以在零基础任务中生成高质量的NeRF，并且可以快速加速现有的提示-to-NeRF方法的推理过程。 specifically,我们的方法可以加速DreamFusion文本-to-NeRF模型和Zero-1-to-3图像-to-NeRF方法的3D重建速度，提高3-5倍。

Abstract
This paper explores promptable NeRF generation (e.g., text prompt or single image prompt) for direct conditioning and fast generation of NeRF parameters for the underlying 3D scenes, thus undoing complex intermediate steps while providing full 3D generation with conditional control. Unlike previous diffusion-CLIP-based pipelines that involve tedious per-prompt optimizations, Prompt2NeRF-PIL is capable of generating a variety of 3D objects with a single forward pass, leveraging a pre-trained implicit latent space of NeRF parameters. Furthermore, in zero-shot tasks, our experiments demonstrate that the NeRFs produced by our method serve as semantically informative initializations, significantly accelerating the inference process of existing prompt-to-NeRF methods. Specifically, we will show that our approach speeds up the text-to-NeRF model DreamFusion and the 3D reconstruction speed of the image-to-NeRF method Zero-1-to-3 by 3 to 5 times.

摘要
Translation notes:* "promptable NeRF generation" is translated as "可提示NeRF生成" (kě bìng NeRF shēng chéng)* "direct conditioning" is translated as "直接控制" (zhí dì kòng zhì)* "full 3D generation" is translated as "全3D生成" (quán sān jí shēng chéng)* "conditional control" is translated as "条件控制" (tiáo xiàng kòng zhì)* "diffusion-CLIP-based pipelines" is translated as "干扰-CLIP基于的管道" (shuā zhí-CLIP jī yǔ de guǎn dào)* "per-prompt optimizations" is translated as "每个提示优化" (mēi ge bìng chēng yǎo jī)* "pre-trained implicit latent space" is translated as "预训练的含义隐藏空间" (xiāng xiǎng zhī xiǎng de hán yì yǐn huī kōng jī)* "semantically informative initializations" is translated as "含义信息的初始化" (hán yì xìn xīn de chū shí)* "accelerating the inference process" is translated as "加速推理过程" (jiā sù tuī lǐ guò jì)* "text-to-NeRF model" is translated as "文本到NeRF模型" (wén těng dao NeRF módel)* "3D reconstruction speed" is translated as "3D重建速度" (3D zhòng jiàn sù dù)

Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts

paper_url: http://arxiv.org/abs/2312.02567
repo_url: None
paper_authors: Jiayi Chen, Benteng Ma, Hengfei Cui, Yong Xia, Kwang-Ting Cheng
for: 这个研究旨在提高 Federated Learning 中的数据评估过程，以便更好地利用分散在多个医疗机构中的数据，而不需要中央化数据。
methods: 这个研究使用了 Federated Evidential Active Learning (FEAL) 方法，它将在不同领域中的数据 derivation 进行评估，并使用 Dirichlet 分布来捕捉本地和全球模型的预测不确定性。
results: 实验结果显示，FEAL 方法比 state-of-the-art 活跃学方法更有效率，并且在 Federated Active Learning 框架下实现了更好的资料多样性和资料范围。

Abstract
Federated learning facilitates the collaborative learning of a global model across multiple distributed medical institutions without centralizing data. Nevertheless, the expensive cost of annotation on local clients remains an obstacle to effectively utilizing local data. To mitigate this issue, federated active learning methods suggest leveraging local and global model predictions to select a relatively small amount of informative local data for annotation. However, existing methods mainly focus on all local data sampled from the same domain, making them unreliable in realistic medical scenarios with domain shifts among different clients. In this paper, we make the first attempt to assess the informativeness of local data derived from diverse domains and propose a novel methodology termed Federated Evidential Active Learning (FEAL) to calibrate the data evaluation under domain shift. Specifically, we introduce a Dirichlet prior distribution in both local and global models to treat the prediction as a distribution over the probability simplex and capture both aleatoric and epistemic uncertainties by using the Dirichlet-based evidential model. Then we employ the epistemic uncertainty to calibrate the aleatoric uncertainty. Afterward, we design a diversity relaxation strategy to reduce data redundancy and maintain data diversity. Extensive experiments and analyses are conducted to show the superiority of FEAL over the state-of-the-art active learning methods and the efficiency of FEAL under the federated active learning framework.

摘要
联邦学习可以帮助多个分散的医疗机构共同学习一个全球模型，而不需要中央化数据。然而，当地方机构的标签成本高昂时，这可能会导致实际上不能充分利用地方数据。为了解决这个问题，联邦活动学习方法建议使用地方和全球模型预测来选择一小量具有资讯的地方数据进行标签。然而，现有的方法主要将注意力集中在同一个领域中的所有地方数据上，因此在实际的医疗场景中，当存在不同客户端之间的领域转移时，这些方法可能无法实际使用。在这篇论文中，我们做出了首次尝试，以评估地方数据来自不同领域的有用性，并提出了一种新的方法，称为联邦证据活动学习（FEAL），以调整数据评估下领域转移的情况。具体来说，我们将地方和全球模型中的预测视为一个分布在概率Simplex上，并使用Dirichlet基于的证据模型来捕捉这两种不确定性。然后，我们使用这个epistemic不确定性来调整这个aleatoric不确定性。接着，我们设计了一种多样性放松策略，以减少数据的重复和保持数据的多样性。我们进行了广泛的实验和分析，以证明 FEAL 在联邦活动学习框架下的超越性和效率。

Uni3DL: Unified Model for 3D and Language Understanding

paper_url: http://arxiv.org/abs/2312.03026
repo_url: None
paper_authors: Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny
for:This paper presents a unified model for 3D and Language understanding, called Uni3DL, which can perform various tasks in 3D vision and language understanding.methods:The Uni3DL model uses a query transformer to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router to selectively generate task-specific outputs required for diverse tasks.results:The Uni3DL model has been evaluated across diverse 3D vision-language understanding tasks and demonstrates performance on par with or surpassing state-of-the-art task-specific models.

Abstract
In this work, we present Uni3DL, a unified model for 3D and Language understanding. Distinct from existing unified vision-language models in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tasks in 3D. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively generate task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D and language understanding. Project page: https://uni3dl.github.io.

摘要
在这项工作中，我们提出Uni3DL模型，这是一个统一的3D和语言理解模型。与现有的3D统一视力语言模型不同，Uni3DL直接处理点云，这将扩大3D支持的任务范围，涵盖了视力和视力语言任务。Uni3DL的核心是一个查询转换器，用于学习任务无关的 semantic和mask输出，并使用一个任务导航器来选择ively生成任务特定的输出。由于具有统一架构，我们的Uni3DL模型可以实现无缝任务分解和大量参数共享。我们在多种3D视力语言理解任务上进行了严格的评估，包括semantic segmentation、object detection、instance segmentation、视物识别、3D标注和文本-3Dcross-modal检索。Uni3DL模型在这些任务上表现了与或超过了状态之arte（SOTA）任务特定模型。我们希望我们的测试和Uni3DL模型可以为未来3D和语言理解领域的研究提供一个坚实的基础。项目页面：https://uni3dl.github.io。

GeNIe: Generative Hard Negative Images Through Diffusion

paper_url: http://arxiv.org/abs/2312.02548
repo_url: https://github.com/ucdvision/genie
paper_authors: Soroush Abbasi Koohpayegani, Anuj Singh, K L Navaneet, Hadi Jamali-Rad, Hamed Pirsiavash
for: 用于训练深度模型，避免过拟合具有有限数据的问题。
methods: 使用扩展AI技术，如扩散模型为图像生成，实现更加复杂的数据增强技术，生成更像自然图像的数据。
results: 通过对分类器的决策边界进行精细调整，生成的扩充样本能够更有效率地引导学习过程。通过文本描述和图像材料的混合，生成出更加挑战性的样本，尤其是在少shot和长尾分布设置下。

Abstract
Data augmentation is crucial in training deep models, preventing them from overfitting to limited data. Common data augmentation methods are effective, but recent advancements in generative AI, such as diffusion models for image generation, enable more sophisticated augmentation techniques that produce data resembling natural images. We recognize that augmented samples closer to the ideal decision boundary of a classifier are particularly effective and efficient in guiding the learning process. We introduce GeNIe which leverages a diffusion model conditioned on a text prompt to merge contrasting data points (an image from the source category and a text prompt from the target category) to generate challenging samples for the target category. Inspired by recent image editing methods, we limit the number of diffusion iterations and the amount of noise. This ensures that the generated image retains low-level and contextual features from the source image, potentially conflicting with the target category. Our extensive experiments, in few-shot and also long-tail distribution settings, demonstrate the effectiveness of our novel augmentation method, especially benefiting categories with a limited number of examples.

摘要
<>通过数据扩充，深度模型可以避免过拟合 Limited Data。常见的数据扩充方法是有效的，但是最近的生成AI技术，如扩散模型 для图像生成，允许更加复杂的扩充技术，生成更像自然图像的数据。我们认为已 augmented 样本更接近分类器的理想决策边界，特别有效和有效地引导学习过程。我们介绍了 GeNIe，利用一个扩散模型根据文本提示来合并对立数据点（一个源类图像和一个目标类文本提示）来生成对目标类来的挑战样本。受最近的图像编辑方法的启发，我们限制了扩散迭代次数和噪音的数量。这确保了生成的图像保留了 source 图像的低级和上下文特征，可能与目标类冲突。我们进行了广泛的实验，包括几个例示和长尾分布设置， demonstrates 我们的新的扩充方法的效果，尤其是对于具有有限的例子数的类别。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

paper_url: http://arxiv.org/abs/2312.02546
repo_url: https://github.com/tmllab/machine_vision_therapy
paper_authors: Zhuo Huang, Chang Liu, Yinpeng Dong, Hang Su, Shibao Zheng, Tongliang Liu
for: 提高vision模型的零基础 robustness
methods: 利用多 modal大语言模型（MLLMs）和denoising in-context learning（DICL）策略
results: 通过不监督的方式提高vision模型的性能，并在多个OOD dataset上进行了广泛的实验 validate the effectiveness of our method.

Abstract
Although vision models such as Contrastive Language-Image Pre-Training (CLIP) show impressive generalization performance, their zero-shot robustness is still limited under Out-of-Distribution (OOD) scenarios without fine-tuning. Instead of undesirably providing human supervision as commonly done, it is possible to take advantage of Multi-modal Large Language Models (MLLMs) that hold powerful visual understanding abilities. However, MLLMs are shown to struggle with vision problems due to the incompatibility of tasks, thus hindering their utilization. In this paper, we propose to effectively leverage MLLMs to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner. To solve the incompatibility issue, we propose a novel Denoising In-Context Learning (DICL) strategy to align vision tasks with MLLMs. Concretely, by estimating a transition matrix that captures the probability of one class being confused with another, an instruction containing a correct exemplar and an erroneous one from the most probable noisy class can be constructed. Such an instruction can help any MLLMs with ICL ability to detect and rectify incorrect predictions of vision models. Through extensive experiments on ImageNet, WILDS, DomainBed, and other OOD datasets, we carefully validate the quantitative and qualitative effectiveness of our method. Our code is available at https://github.com/tmllab/Machine_Vision_Therapy.

摘要
尽管视觉模型如对照语言图像预训练（CLIP）表现出色，但其零shot Robustness仍然受到Out-of-Distribution（OOD）场景下的限制。而不需要不必要地提供人工监督，可以利用多modal大语言模型（MLLMs），这些模型拥有强大的视觉理解能力。然而，MLLMs在视觉任务上存在兼容性问题，这使得它们的使用受限。在这篇论文中，我们提出了一种有效地利用MLLMs进行机器视觉疗法，以改善视觉模型的预测结果。通过精心调整杂乱标签，我们可以在无监督下提高学习模型的性能。为解决兼容性问题，我们提出了一种新的减噪在Context学习策略（DICL），用于将视觉任务与MLLMs相协调。具体来说，我们可以估算一个转移矩阵，该矩阵捕捉了一个类型与另一个类型之间的混淆概率。通过构建一个包含正确示例和错误示例的指导，我们可以帮助任何拥有ICL能力的MLLMs检测和修正视觉模型的错误预测。经过广泛的ImageNet、WILDS、DomainBed和其他OOD dataset的实验，我们谨慎验证了我们的方法的量化和质量效果。我们的代码可以在https://github.com/tmllab/Machine_Vision_Therapy上获取。

Explainable Severity ranking via pairwise n-hidden comparison: a case study of glaucoma

paper_url: http://arxiv.org/abs/2312.02541
repo_url: None
paper_authors: Hong Nguyen, Cuong V. Nguyen, Shrikanth Narayanan, Benjamin Y. Xu, Michael Pazzani
for: 该论文旨在用基于图像的方法评估和比较开Angle Glaucoma（POAG）的严重程度，以帮助诊断和评估这种疾病。
methods: 该论文使用了一种基于siamese网络的严重性排名方法，通过对图像进行对比来评估POAG的严重程度。此外，论文还提出了一种新的解释方法，用于解释图像的严重程度高或低的原因。
results: 论文的实验结果表明，基于图像的严重性排名模型比传统方法更准确地诊断POAG，同时也能够提供更好的解释。

Abstract
Primary open-angle glaucoma (POAG) is a chronic and progressive optic nerve condition that results in an acquired loss of optic nerve fibers and potential blindness. The gradual onset of glaucoma results in patients progressively losing their vision without being consciously aware of the changes. To diagnose POAG and determine its severity, patients must undergo a comprehensive dilated eye examination. In this work, we build a framework to rank, compare, and interpret the severity of glaucoma using fundus images. We introduce a siamese-based severity ranking using pairwise n-hidden comparisons. We additionally have a novel approach to explaining why a specific image is deemed more severe than others. Our findings indicate that the proposed severity ranking model surpasses traditional ones in terms of diagnostic accuracy and delivers improved saliency explanations.

摘要
primary open-angle glaucoma (POAG) 是一种 Chronic 和进行性的Optic nerve condition，会导致Acquired loss of optic nerve fibers 和 potential blindness。逐渐发展的 glaucoma 会使patients 慢慢地失去视力，而不是Consciously aware of the changes。为诊断 POAG 和其严重程度，患者必须进行 comprehensive dilated eye examination。在这项工作中，我们建立了一个排名、比较和解释 glaucoma 严重程度的框架，使用基于siamese的严重排名方法。我们还提出了一种新的方法来解释特定图像是否更严重于别的。我们的发现表明，我们提出的严重排名模型在诊断精度和提供更好的Saliency explanations 方面都超过了传统的方法。

Enhanced Breast Cancer Tumor Classification using MobileNetV2: A Detailed Exploration on Image Intensity, Error Mitigation, and Streamlit-driven Real-time Deployment

paper_url: http://arxiv.org/abs/2312.03020
repo_url: None
paper_authors: Aaditya Surya, Aditya Shah, Jarnell Kabore, Subash Sasikumar
for: 这份研究旨在将Google的MobileNetV2模型应用到乳腺癌肉瘤分类中，以分为正常、良性和恶性三种类别，使用了1576几个超音波图像数据（265个正常、891个良性、420个恶性）。
methods: 这份研究使用了MobileNetV2模型，并对数据进行了调整和分类。
results: 研究获得了0.82的准确率、0.83的精度、0.81的回传率、ROC-AUC的0.94、PR-AUC的0.88和MCC的0.74。它还进行了图像数据分布和错误分析，提供了未来应用的改进。

Abstract
This research introduces a sophisticated transfer learning model based on Google's MobileNetV2 for breast cancer tumor classification into normal, benign, and malignant categories, utilizing a dataset of 1576 ultrasound images (265 normal, 891 benign, 420 malignant). The model achieves an accuracy of 0.82, precision of 0.83, recall of 0.81, ROC-AUC of 0.94, PR-AUC of 0.88, and MCC of 0.74. It examines image intensity distributions and misclassification errors, offering improvements for future applications. Addressing dataset imbalances, the study ensures a generalizable model. This work, using a dataset from Baheya Hospital, Cairo, Egypt, compiled by Walid Al-Dhabyani et al., emphasizes MobileNetV2's potential in medical imaging, aiming to improve diagnostic precision in oncology. Additionally, the paper explores Streamlit-based deployment for real-time tumor classification, demonstrating MobileNetV2's applicability in medical imaging and setting a benchmark for future research in oncology diagnostics.

摘要
Translated into Simplified Chinese:这项研究推出了基于Google的MobileNetV2 Transfer Learning模型，用于分类乳腺癌为正常、恶性和肿瘤三类，使用了1576张ultrasound图像（265张正常、891张恶性、420张肿瘤）。模型达到了0.82的准确率、0.83的精度、0.81的准确率、0.94的ROC-AUC、0.88的PR-AUC和0.74的MCC。它分析了图像强度分布和误分类错误，提供了未来应用中的改进。通过处理数据集偏好，该研究确保了一个通用的模型。这项研究使用了来自加拉heid Hospital的 dataset，由Walid Al-Dhabyani等人编译，强调了MobileNetV2在医疗影像中的潜力， aspires to improve the diagnostic precision in oncology。此外，论文还探讨了基于Streamlit的实时诊断部署，展示了MobileNetV2在医疗影像中的可用性，并为未来医学诊断研究设置了benchmark。

Towards Open-set Gesture Recognition via Feature Activation Enhancement and Orthogonal Prototype Learning

paper_url: http://arxiv.org/abs/2312.02535
repo_url: None
paper_authors: Chen Liu, Can Han, Chengfeng Zhou, Crystal Cai, Suncheng Xiang, Hualiang Ni, Dahong Qian
for: 这篇论文旨在解决人机互动中的手势识别 task 中的 open set recognition (OSR) 问题。
methods: 这篇论文提出了一种基于 prototype learning (PL) 的更有效的方法，利用两种新的自然分布特征，特征活化水平和投影不一致性，对于已知和未知的分类进行更好的分别。
results: 实验结果显示，这篇论文的提案方法可以同时实现精确的关闭集合识别和有效地拒绝未知的手势识别。

Abstract
Gesture recognition is a foundational task in human-machine interaction (HMI). While there has been significant progress in gesture recognition based on surface electromyography (sEMG), accurate recognition of predefined gestures only within a closed set is still inadequate in practice. It is essential to effectively discern and reject unknown gestures of disinterest in a robust system. Numerous methods based on prototype learning (PL) have been proposed to tackle this open set recognition (OSR) problem. However, they do not fully explore the inherent distinctions between known and unknown classes. In this paper, we propose a more effective PL method leveraging two novel and inherent distinctions, feature activation level and projection inconsistency. Specifically, the Feature Activation Enhancement Mechanism (FAEM) widens the gap in feature activation values between known and unknown classes. Furthermore, we introduce Orthogonal Prototype Learning (OPL) to construct multiple perspectives. OPL acts to project a sample from orthogonal directions to maximize the distinction between its two projections, where unknown samples will be projected near the clusters of different known classes while known samples still maintain intra-class similarity. Our proposed method simultaneously achieves accurate closed-set classification for predefined gestures and effective rejection for unknown gestures. Extensive experiments demonstrate its efficacy and superiority in open-set gesture recognition based on sEMG.

摘要
<> translate "Gesture recognition is a foundational task in human-machine interaction (HMI). While there has been significant progress in gesture recognition based on surface electromyography (sEMG), accurate recognition of predefined gestures only within a closed set is still inadequate in practice. It is essential to effectively discern and reject unknown gestures of disinterest in a robust system. Numerous methods based on prototype learning (PL) have been proposed to tackle this open set recognition (OSR) problem. However, they do not fully explore the inherent distinctions between known and unknown classes. In this paper, we propose a more effective PL method leveraging two novel and inherent distinctions, feature activation level and projection inconsistency. Specifically, the Feature Activation Enhancement Mechanism (FAEM) widens the gap in feature activation values between known and unknown classes. Furthermore, we introduce Orthogonal Prototype Learning (OPL) to construct multiple perspectives. OPL acts to project a sample from orthogonal directions to maximize the distinction between its two projections, where unknown samples will be projected near the clusters of different known classes while known samples still maintain intra-class similarity. Our proposed method simultaneously achieves accurate closed-set classification for predefined gestures and effective rejection for unknown gestures. Extensive experiments demonstrate its efficacy and superiority in open-set gesture recognition based on sEMG.">>Here's the translation:人机交互（HMI）中的手势识别是一项基础任务。虽然基于表面电 MYography（sEMG）的手势识别已经取得了 significi cant 进步，但是只能够准确地识别闭 SET的手势，在实际应用中还是不够。因此，有效地分辨并排除无关的手势是一项重要的需求。多种基于原型学习（PL）的方法已经被提议来解决这个开集识别（OSR）问题，但是它们并没有充分利用手势之间的内在差异。在这篇论文中，我们提出了一种更有效的 PL 方法，利用两种新的内在差异来进行分辨：特征活动水平和投影不一致。specifically，我们提出了特征活动增强机制（FAEM），使知道类和未知类之间的特征活动值差距更大。此外，我们引入了多个视角学习（OPL），以从多个方向投影样本，以 maximize 不同类型知道样本之间的分辨。我们的提出的方法同时实现了高精度闭 SET 识别和有效排除无关的手势。广泛的实验证明了我们的方法的有效性和优势在基于 sEMG 的开集手势识别中。

Towards Automatic Power Battery Detection: New Challenge, Benchmark Dataset and Baseline

paper_url: http://arxiv.org/abs/2312.02528
repo_url: None
paper_authors: Xiaoqi Zhao, Youwei Pang, Zhenyu Chen, Qian Yu, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Huchuan Lu
for: 本研究旨在提出一种新任务 named power battery detection (PBD), 用于 lokalisieren dense cathode 和 anode plates 端点从 X-ray 图像中评估电池质量。现有生产者通常采用人工观察来完成 PBD，这会增加识别率和效率的困难，为了解决这个问题并吸引更多关注这个有意义的任务，我们首先 elaborately 收集了一个数据集，called X-ray PBD，包含 $1,500$ 多种 X-ray 图像，选自 $5$ 家生产商的 $5,000$ 个电池，并且包含 $7$ 种视觉干扰。
methods: 我们提出了一种新的 segmentation-based 解决方案，称为 multi-dimensional collaborative network (MDCNet)。通过线性和数字预测器的协同作用，分割分支中的点 segmentation 可以得到改进的Semantic 和 Detail 两个方面的表达。此外，我们还设计了一种有效的距离适应mask生成策略，以适应不同的板间质量和分布，为 MDCNet 提供稳定的指导。
results: 无论是通过 corner detection、人群计数或普通/小物体检测等方法，我们的 segmentation-based MDCNet 都能够在 PBD 任务中表现出色，不仅比较其他方法有更高的准确率，还能够在不同的 X-ray 图像中保持稳定的性能。最终，我们还讨论了一些可能的难点和未来研究的方向。代码和数据集将在 \href{http://www.gy3000.company/x3000%e5%bc%80%e6%94%be%e5%b9%b3%e5%8f%b0}{X-ray PBD} 上公开发布。

Abstract
We conduct a comprehensive study on a new task named power battery detection (PBD), which aims to localize the dense cathode and anode plates endpoints from X-ray images to evaluate the quality of power batteries. Existing manufacturers usually rely on human eye observation to complete PBD, which makes it difficult to balance the accuracy and efficiency of detection. To address this issue and drive more attention into this meaningful task, we first elaborately collect a dataset, called X-ray PBD, which has $1,500$ diverse X-ray images selected from thousands of power batteries of $5$ manufacturers, with $7$ different visual interference. Then, we propose a novel segmentation-based solution for PBD, termed multi-dimensional collaborative network (MDCNet). With the help of line and counting predictors, the representation of the point segmentation branch can be improved at both semantic and detail aspects. Besides, we design an effective distance-adaptive mask generation strategy, which can alleviate the visual challenge caused by the inconsistent distribution density of plates to provide MDCNet with stable supervision. Without any bells and whistles, our segmentation-based MDCNet consistently outperforms various other corner detection, crowd counting and general/tiny object detection-based solutions, making it a strong baseline that can help facilitate future research in PBD. Finally, we share some potential difficulties and works for future researches. The source code and datasets will be publicly available at \href{http://www.gy3000.company/x3000%e5%bc%80%e6%94%be%e5%b9%b3%e5%8f%b0}{X-ray PBD}.

摘要
我们进行了一项全面的研究，探讨一种新任务 named 电池电能检测（PBD），该任务的目的是从X射线图像中LOCAL化稠密的锂电板和锂电板端点。现有制造商通常通过人工观察来完成PBD，这会减少了准确性和效率的平衡。为了解决这个问题并吸引更多关注这一重要任务，我们先在X射线PBD dataset中收集了1500个多样化的X射线图像，这些图像来自于5家制造商的5000个电池，并且包含7种视觉干扰。然后，我们提出了一种基于分割的解决方案，称为多维协同网络（MDCNet）。通过线和计数预测器，我们可以在 semantic和细节两个方面提高点 segmentation branch的表达。此外，我们还设计了一种有效的距离适应mask生成策略，以适应电池板的不均匀分布，为MDCNet提供稳定的超vis。不需要任何炫技，我们的分割基于MDCNet在多种尖锐检测、人群计数和通用/小物体检测基础上准确地检测电池电板，从而成为一个强大的基线，可以帮助未来的PBD研究。最后，我们介绍了一些潜在的挑战和未来研究的可能性。我们将在\href{http://www.gy3000.company/x3000%e5%bc%80%e6%94%be%e5%b9%b3%e5%8f%b0}{X-ray PBD}上公开源代码和数据集。

Towards More Unified In-context Visual Understanding

paper_url: http://arxiv.org/abs/2312.02520
repo_url: None
paper_authors: Dianmo Sheng, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Tao Gong, Bin Liu, Shengwei Xu, Nenghai Yu
for: 这篇论文旨在提出一种能够处理多modal输出的视觉理解ICL框架，以扩展ICL的应用场景。
methods: 该模型使用量化并嵌入文本和视觉提示，并采用一个受限的 sparse transformer 架构进行生成模型化。
results: 实验结果表明，该模型在多modal输出视觉理解任务上达到了与专门模型和先前ICL基线相当的竞争性性能。

Abstract
The rapid advancement of large language models (LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently, ICL has been employed in visual understanding tasks, such as semantic segmentation and image captioning, yielding promising results. However, existing visual ICL framework can not enable producing content across multiple modalities, which limits their potential usage scenarios. To address this issue, we present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space, structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them, facilitating in-context learning. Thanks to this design, the model is capable of handling in-context vision understanding tasks with multimodal output in a unified pipeline. Experimental results demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall, our research takes a further step toward unified multimodal in-context learning.

摘要
快速发展的大语言模型（LLM）已经推动了在自然语言处理领域内的卷积学习（ICL）技术的出现。目前，ICL技术已经在视觉理解任务中，如 semantic segmentation 和图像描述，获得了有利的结果。然而，现有的视觉 ICLL 框架不能生成跨Modalities的内容，这限制了其应用场景。为解决这个问题，我们提出了一个新的 ICLL 框架 для视觉理解，具有多Modalities输出能力。首先，我们将文本和视觉提示编码并嵌入到一个共同表示空间中，即嵌入式的卷积序列。然后，我们使用一个 sparse transformer 架构来进行生成模型，以便在卷积序列中进行卷积学习。由于这种设计，我们的模型可以处理卷积视觉理解任务，并且可以在一个统一的管道中生成多Modalities的输出。实验结果表明，我们的模型可以与专门的模型和前一代 ICLL 基线集成比肩。总之，我们的研究又一步向多Modalities卷积学习的统一发展。

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

paper_url: http://arxiv.org/abs/2312.02503
repo_url: None
paper_authors: Yeji Song, Wonsik Shin, Junsoo Lee, Jeesoo Kim, Nojun Kwak
for: 提高视频编辑的多样性和可Scalability
methods: 采用动作个性化、文本嵌入、pseudo optical flow等技术
results: 实现了对视频中人物的修改和风格变换，提高了视频编辑的多样性和可Scalability

Abstract
Driven by the upsurge progress in text-to-image (T2I) generation models, text-to-video (T2V) generation has experienced a significant advance as well. Accordingly, tasks such as modifying the object or changing the style in a video have been possible. However, previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one. In this paper, we spot the bias problem in the existing video editing method that restricts the range of choices for the new protagonist and attempt to address this issue using the conventional image-level personalization method. We adopt motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly. To deal with the natural discrepancy between image and video, we propose a motion word with an inflated textual embedding to properly represent the motion in a source video. We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow, efficiently computed from the pre-calculated attention maps. Finally, we decouple the motion from the appearance of the source video with an additional pseudo word. Extensive experiments demonstrate the editing capability of our method, taking a step toward more diverse and extensive video editing.

摘要
驱动了文本到图像（T2I）生成模型的快速进步，文本到视频（T2V）生成也经历了显著改进。因此，修改视频中的对象或改变风格也变得可能。然而，先前的工作通常只能在简单和一致的形状下工作，对于具有大量不同的身体形态的目标来说，容易崩溃。在这篇论文中，我们发现了现有视频编辑方法中的偏见问题，并尝试使用传统的图像级个性化方法来解决这个问题。我们采用了运动个性化，即从单个源视频中隔离出运动，然后根据新的主角进行修改。为了处理自然的图像和视频之间的差异，我们提出了一种带有膨胀的文本嵌入来正确表示源视频中的运动。此外，我们还引入了一种新的 Pseudo 运动流，通过计算先前计算的注意力图来有效地Compute Pseudo 运动流。最后，我们将运动与源视频的外观分离开来，使用一个额外的 Pseudo 词来表示运动。广泛的实验表明我们的方法可以进行多样化和广泛的视频编辑。

ReconU-Net: a direct PET image reconstruction using U-Net architecture with back projection-induced skip connection

paper_url: http://arxiv.org/abs/2312.02494
repo_url: None
paper_authors: Fumio Hashimoto, Kibo Ote
for:* 这个研究旨在提出一种基于深度学习的直接 positron发射tomography（PET）图像重建算法，即ReconU-Net，以提高直接PET图像重建的精度。methods:* 该算法独特地将物理模型的后投影操作integrated into skip connection，从而使得含有物理模型的 skip connection可以有效地传递原始空间信息从输入的sinogram到重建的图像中。results:* 比较其他无 skip connections的encoder-decoder架构，提案的ReconU-Net方法可以生成具有更高精度的重建图像。* 进一步分析表明，ReconU-Net可以在 skip connections中传递多个分辨率的特征，特别是非抽象高分辨率信息。* 虽然受限于小训练数据，但提案的ReconU-Net方法可以成功重建实际 Hoffman大脑phantom数据，而其他深度学习基于直接重建方法则无法生成重建图像。

Abstract
[Objective] This study aims to introduce a novel back projection-induced U-Net-shaped architecture, called ReconU-Net, for deep learning-based direct positron emission tomography (PET) image reconstruction. Additionally, our objective is to analyze the behavior of direct PET image reconstruction and gain deeper insights by comparing the proposed ReconU-Net architecture with other encoder-decoder architectures without skip connections. [Approach] The proposed ReconU-Net architecture uniquely integrates the physical model of the back projection operation into the skip connection. This distinctive feature facilitates the effective transfer of intrinsic spatial information from the input sinogram to the reconstructed image via an embedded physical model. The proposed ReconU-Net was trained using Monte Carlo simulation data from the Brainweb phantom and tested on both simulated and real Hoffman brain phantom data. [Main results] The proposed ReconU-Net method generated a reconstructed image with a more accurate structure compared to other deep learning-based direct reconstruction methods. Further analysis showed that the proposed ReconU-Net architecture has the ability to transfer features of multiple resolutions, especially non-abstract high-resolution information, through skip connections. Despite limited training on simulated data, the proposed ReconU-Net successfully reconstructed the real Hoffman brain phantom, unlike other deep learning-based direct reconstruction methods, which failed to produce a reconstructed image. [Significance] The proposed ReconU-Net can improve the fidelity of direct PET image reconstruction, even when dealing with small training datasets, by leveraging the synergistic relationship between data-driven modeling and the physics model of the imaging process.

摘要
[目标] 本研究旨在介绍一种新的反投影引入U-Net型架构，称为ReconU-Net，用于深度学习基于直接融合Tomography（PET）图像重建。此外，我们还想要通过比较不含 skip connection 的架构和ReconU-Net架构进行分析，从而更深入地理解 direct PET 图像重建的行为。[方法] ReconU-Net 架构独特地将物理模型的反投影操作integrated into skip connection，这种特殊的特点使得把输入的sinogram中的内在空间信息有效地传递到重建的图像中。ReconU-Net 使用 Monte Carlo 仿真数据进行训练，并在 simulate 和实际 Hoffman 脑部phantom数据上进行测试。[主要结果] ReconU-Net 方法生成的重建图像具有更高的准确性结构，相比其他深度学习基于直接重建方法。进一步的分析表明，ReconU-Net 架构有能力传递多resolution的特征，特别是非抽象高分辨率信息，通过 skip connection。尽管它只受限于小规模的训练数据，ReconU-Net 仍然成功地重建了实际 Hoffman 脑部phantom，与其他深度学习基于直接重建方法不同，其他方法无法生成重建图像。[意义] ReconU-Net 可以通过利用数据驱动模型和成像过程的物理模型之间的相互关系，提高直接 PET 图像重建的准确性，即使面临小规模的训练数据。

EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model

paper_url: http://arxiv.org/abs/2312.02483
repo_url: None
paper_authors: Guozhang Li, Xinpeng Ding, De Cheng, Jie Li, Nannan Wang, Xinbo Gao
for: 提高 Early weakly supervised video grounding (WSVG) 方法对于 incomplete boundary detection 的能力。
methods: 使用 explicit-supervision 方法，生成 pseudo-temporal boundaries for training，并通过 data augmentations 增加更多的 valuable information。
results: 提出了一种新的 perspective，可以 maintain the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries，并且通过 multimodal large language models (MLLMs) 对每帧内 initial pseudo boundaries 进行更多的描述，以获得更精确的 boundaries。

Abstract
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.

摘要
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.

Differentiable Point-based Inverse Rendering

paper_url: http://arxiv.org/abs/2312.02480
repo_url: None
paper_authors: Hoon-Gyu Chung, Seokjun Choi, Seung-Hwan Baek
for: 本研究旨在实现分析��rendering，以估计图像下多种照明条件下的形状和空间分布 BRDF。
methods: 本研究使用点基 Rendering，消除了多个样本探测每个光束的需要，从而大幅提高反向渲染的速度。另外，我们还提出了一种混合点-积分表示法，以保留 SDF-based 表示法中的几何细节和稳定性。
results: 我们的实验表明，DPIR 可以比优于先前的方法，在重建精度、计算效率和内存占用方面。此外，我们的直观点基表示和渲染可以带来直观的几何和反射Editing。

Abstract
We present differentiable point-based inverse rendering, DPIR, an analysis-by-synthesis method that processes images captured under diverse illuminations to estimate shape and spatially-varying BRDF. To this end, we adopt point-based rendering, eliminating the need for multiple samplings per ray, typical of volumetric rendering, thus significantly enhancing the speed of inverse rendering. To realize this idea, we devise a hybrid point-volumetric representation for geometry and a regularized basis-BRDF representation for reflectance. The hybrid geometric representation enables fast rendering through point-based splatting while retaining the geometric details and stability inherent to SDF-based representations. The regularized basis-BRDF mitigates the ill-posedness of inverse rendering stemming from limited light-view angular samples. We also propose an efficient shadow detection method using point-based shadow map rendering. Our extensive evaluations demonstrate that DPIR outperforms prior works in terms of reconstruction accuracy, computational efficiency, and memory footprint. Furthermore, our explicit point-based representation and rendering enables intuitive geometry and reflectance editing. The code will be publicly available.

摘要
我们介绍了差分可导点 cloud inverse rendering（DPIR），这是一种分析� BY Synthesis 方法，用于从多种照明条件下捕获的图像中估算形状和空间分布式 BRDF。为此，我们采用点 cloud 渲染，从而消除了通常需要多个抽象�ayer per ray的液体渲染的需要，从而有效提高反向渲染的速度。为实现这一想法，我们开发了一种混合点 cloud 和Volume 表示法 дляgeometry，以及一种减少基面-BRDF 表示法来抑制反向渲染中的缺失约束。这种混合的表示法使得快速渲染可以通过点 cloud 拼接来实现，同时保留SDF-based表示法中的准确性和稳定性。我们还提出了一种高效的阴影检测方法，使用点 cloud 阴影图 rendering。我们的广泛评估表明，DPIR 在重建精度、计算效率和存储占用上都超过了先前的工作。此外，我们的点 cloud 表示和渲染允许直观地编辑geometry和 Reflectance。我们的代码将公开。

Generator Born from Classifier

paper_url: http://arxiv.org/abs/2312.02470
repo_url: None
paper_authors: Runpeng Yu, Xinchao Wang
for: Given a pre-trained classifier, the paper aims to reconstruct an image generator without using any data samples.
methods: The proposed method leverages the knowledge encapsulated within the parameters of the neural network and uses a novel learning paradigm that trains the generator to ensure convergence conditions of the network parameters are satisfied over the generated distribution of samples.
results: Empirical validation from various image generation tasks demonstrates the efficacy of the proposed strategy.Here’s the simplified Chinese text:
for: 给一个预训练的分类器，本文尝试做一件奔波的任务：无需使用任何数据样本，重建一个图像生成器。
methods: 该方法借鉴神经网络参数中封装的知识，提出了一种新的学习方法，通过让生成器确保网络参数的叠加条件在生成分布中满足，来让网络参数得到最大margin的偏好。
results: 从多种图像生成任务的实验 validate了该策略的有效性。

Abstract
In this paper, we make a bold attempt toward an ambitious task: given a pre-trained classifier, we aim to reconstruct an image generator, without relying on any data samples. From a black-box perspective, this challenge seems intractable, since it inevitably involves identifying the inverse function for a classifier, which is, by nature, an information extraction process. As such, we resort to leveraging the knowledge encapsulated within the parameters of the neural network. Grounded on the theory of Maximum-Margin Bias of gradient descent, we propose a novel learning paradigm, in which the generator is trained to ensure that the convergence conditions of the network parameters are satisfied over the generated distribution of the samples. Empirical validation from various image generation tasks substantiates the efficacy of our strategy.

摘要
在这篇论文中，我们尝试了一项大梦想的任务：使用预训练的分类器，无需任何数据样本，重建一个图像生成器。从黑盒角度来看，这确实是一项不可能实现的挑战，因为它涉及到找到分类器的逆函数，这是一个自然的信息抽取过程。然而，我们借助神经网络参数中嵌入的知识，提出了一种新的学习方法。基于极大距离抑制法的梯度下降理论，我们训练生成器，使其在生成样本分布中满足网络参数的整合条件。实际 validate from various image generation tasks 证明了我们的策略的有效性。

Learning Energy-based Model via Dual-MCMC Teaching

paper_url: http://arxiv.org/abs/2312.02469
repo_url: None
paper_authors: Jiali Cui, Tian Han
for: 本研究探讨了基于能量模型（EBM）的基本学习问题。
methods: 本研究使用了最大 LIKElihood估计（MLE）和Markov Chain Monte Carlo（MCMC）探索EBM的学习方法，并考虑了将生成器模型作为补充模型，以使MCMC探索更加稳定。
results: 本研究提出了一种结合MLE和MCMC探索EBM的框架，通过将生成器模型作为EBM的补充模型，使MCMC探索更加稳定。此外，研究还提出了一种使用MCMC posterior sampling和补充探索模型来实现有效和高效的EBM学习方法。

Abstract
This paper studies the fundamental learning problem of the energy-based model (EBM). Learning the EBM can be achieved using the maximum likelihood estimation (MLE), which typically involves the Markov Chain Monte Carlo (MCMC) sampling, such as the Langevin dynamics. However, the noise-initialized Langevin dynamics can be challenging in practice and hard to mix. This motivates the exploration of joint training with the generator model where the generator model serves as a complementary model to bypass MCMC sampling. However, such a method can be less accurate than the MCMC and result in biased EBM learning. While the generator can also serve as an initializer model for better MCMC sampling, its learning can be biased since it only matches the EBM and has no access to empirical training examples. Such biased generator learning may limit the potential of learning the EBM. To address this issue, we present a joint learning framework that interweaves the maximum likelihood learning algorithm for both the EBM and the complementary generator model. In particular, the generator model is learned by MLE to match both the EBM and the empirical data distribution, making it a more informative initializer for MCMC sampling of EBM. Learning generator with observed examples typically requires inference of the generator posterior. To ensure accurate and efficient inference, we adopt the MCMC posterior sampling and introduce a complementary inference model to initialize such latent MCMC sampling. We show that three separate models can be seamlessly integrated into our joint framework through two (dual-) MCMC teaching, enabling effective and efficient EBM learning.

摘要

SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary Constraints

paper_url: http://arxiv.org/abs/2312.02464
repo_url: https://github.com/sstary/ssrs
paper_authors: Xianping Ma, Qianqian Wu, Xingyu Zhao, Xiaokang Zhang, Man-On Pun, Bo Huang
for: 这篇论文的目的是提高遥测图像Semantic Segmentation的精度和效率，并使用Segment Anything Model（SAM）来实现这一目的。methods: 本文使用了两个新的概念，即SAM-Generated Object（SGO）和SAM-Generated Boundary（SGB），以及两个新的损失函数：object loss和boundary loss，以进一步优化Semantic Segmentation模型。results: 本文在两个知名的数据集，ISPRS Vaihingen和LoveDA Urban上进行了实验，结果显示了提高了Semantic Segmentation的精度和效率。

Abstract
Semantic segmentation of remote sensing imagery plays a pivotal role in extracting precise information for diverse down-stream applications. Recent development of the Segment Anything Model (SAM), an advanced general-purpose segmentation model, has revolutionized this field, presenting new avenues for accurate and efficient segmentation. However, SAM is limited to generating segmentation results without class information. Consequently, the utilization of such a powerful general vision model for semantic segmentation in remote sensing images has become a focal point of research. In this paper, we present a streamlined framework aimed at leveraging the raw output of SAM by exploiting two novel concepts called SAM-Generated Object (SGO) and SAM-Generated Boundary (SGB). More specifically, we propose a novel object loss and further introduce a boundary loss as augmentative components to aid in model optimization in a general semantic segmentation framework. Taking into account the content characteristics of SGO, we introduce the concept of object consistency to leverage segmented regions lacking semantic information. By imposing constraints on the consistency of predicted values within objects, the object loss aims to enhance semantic segmentation performance. Furthermore, the boundary loss capitalizes on the distinctive features of SGB by directing the model's attention to the boundary information of the object. Experimental results on two well-known datasets, namely ISPRS Vaihingen and LoveDA Urban, demonstrate the effectiveness of our proposed method. The source code for this work will be accessible at https://github.com/sstary/SSRS.

摘要
remote sensing 图像 semantic segmentation 对于提取精准信息而言是关键。 recent development of the Segment Anything Model (SAM) 已经 revolutionized this field, offering new opportunities for accurate and efficient segmentation. However, SAM only generates segmentation results without class information. As a result, utilizing such a powerful general vision model for semantic segmentation in remote sensing images has become a research focus. In this paper, we present a streamlined framework that leverages the raw output of SAM by introducing two novel concepts: SAM-Generated Object (SGO) and SAM-Generated Boundary (SGB). Specifically, we propose a novel object loss and introduce a boundary loss as augmentative components to aid in model optimization in a general semantic segmentation framework. Considering the content characteristics of SGO, we introduce the concept of object consistency to enhance semantic segmentation performance. By imposing constraints on the consistency of predicted values within objects, the object loss aims to enhance semantic segmentation performance. Furthermore, the boundary loss capitalizes on the distinctive features of SGB by directing the model's attention to the boundary information of the object. Experimental results on two well-known datasets, namely ISPRS Vaihingen and LoveDA Urban, demonstrate the effectiveness of our proposed method. The source code for this work will be accessible at https://github.com/sstary/SSRS.

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

paper_url: http://arxiv.org/abs/2312.03018
repo_url: None
paper_authors: Cong Wang, Jiaxi Gu, Panwen Hu, Songcen Xu, Hang Xu, Xiaodan Liang
for: 这篇论文旨在实现高质量的图像到视频生成，以提高现有的图像扩展方法的精度和可控性。
methods: 我们提出了一种基于 DreamVideo 模型的高精度图像到视频生成方法，其中包括一个框护分支来保持图像细节，以及一种double-condition类 Conditioned GAN（DCGAN）自适应导航方法来实现不同动作的视频生成。
results: 我们在公共数据集上进行了广泛的实验，结果表明我们的方法在质量和可控性方面都超过了现有的状态图像到视频模型。尤其是在保持图像细节方面，我们的模型在 UCF101 上的 FVD 较高，并且可以通过不同的文本提示来实现精准的控制。

Abstract
Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch on the basis of a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process in a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenate the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has powerful image retention ability and result in high FVD in UCF101 compared to other image-to-video models. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented in https://anonymous0769.github.io/DreamVideo/.

摘要
图像到视频生成，它目标是从给定的参考图像生成一个视频，已经吸引了广泛的关注。现有方法通常是将预训练的文本指导图像扩散模型扩展为图像指导视频生成模型。然而，这些方法经常会导致时间上的低精度或闪烁，这是因为它们的图像指导性较浅和时间上的一致性不够。为了解决这些问题，我们提出了高精度的图像到视频生成方法，基于预训练的视频扩散模型，我们称之为DreamVideo。在DreamVideo中，我们不是将参考图像直接 интегрирова到扩散过程中的semantic层次，而是通过 convolution层将参考图像 perceive，并将其的特征与随机噪音的特征进行 concatenate。这样做的原因是保留参考图像的细节，以达到最大程度的精度。此外，我们还通过double-condition classifier-free guidance，使得一个图像可以被指导到不同的动作视频中，只需提供不同的文本提示。这有着广泛的应用前景和可控的视频生成的意义。我们对公共数据集进行了广泛的实验，both quantitative和qualitative结果表明，our方法在state-of-the-art方法之上。特别是，our模型在UCF101上的FVD（Frame Velocity Difference）值较高，表明它有强大的图像保留能力。此外，通过不同的文本提示，我们可以实现精确的控制。更多细节和our模型的全面结果可以在https://anonymous0769.github.io/DreamVideo/查看。

GDN: A Stacking Network Used for Skin Cancer Diagnosis

paper_url: http://arxiv.org/abs/2312.02437
repo_url: None
paper_authors: Jingmin Wei, Haoyang Shen, Ziyi Wang, Ziqian Zhang
for: 这个论文的目的是提出一种自动识别不同类型皮肤癌的图像分类模型，以提高皮肤癌的检测精度。
methods: 这个模型使用了堆叠不同网络的方法来提高模型性能，具体来说是使用GoogLeNet和DenseNet两个网络进行并行训练，并在第二层使用logistic regression模型来进行预测。
results: 比较这个模型与四个基eline网络（ResNet、VGGNet、DenseNet和GoogLeNet），GDN模型在测试数据集上显示出较高的准确率，特别是在使用Logistic Regression预测方法时达到了最好的预测结果。

Abstract
Skin cancer, the primary type of cancer that can be identified by visual recognition, requires an automatic identification system that can accurately classify different types of lesions. This paper presents GoogLe-Dense Network (GDN), which is an image-classification model to identify two types of skin cancer, Basal Cell Carcinoma, and Melanoma. GDN uses stacking of different networks to enhance the model performance. Specifically, GDN consists of two sequential levels in its structure. The first level performs basic classification tasks accomplished by GoogLeNet and DenseNet, which are trained in parallel to enhance efficiency. To avoid low accuracy and long training time, the second level takes the output of the GoogLeNet and DenseNet as the input for a logistic regression model. We compare our method with four baseline networks including ResNet, VGGNet, DenseNet, and GoogLeNet on the dataset, in which GoogLeNet and DenseNet significantly outperform ResNet and VGGNet. In the second level, different stacking methods such as perceptron, logistic regression, SVM, decision trees and K-neighbor are studied in which Logistic Regression shows the best prediction result among all. The results prove that GDN, compared to a single network structure, has higher accuracy in optimizing skin cancer detection.

摘要
皮肤癌，主要的癌症可以通过视觉识别，需要一个自动识别系统，可以准确地分类不同类型的肿瘤。这篇论文提出了GoogLe-Dense Network（GDN），是一种图像分类模型，用于识别两种皮肤癌，基础细胞癌和 меланома。GDN使用不同网络堆叠来提高模型性能。具体来说，GDN包括两个级别结构。第一级完成基本的分类任务，由GoogLeNet和DenseNet进行并行训练，以提高效率。而第二级使用GoogLeNet和DenseNet的输出作为对数学归一化模型的输入，以避免低准确率和长训练时间。我们与四种基eline网络，包括ResNet、VGGNet、DenseNet和GoogLeNet进行比较，结果显示GoogLeNet和DenseNet在该数据集上显著超过ResNet和VGGNet。在第二级，我们研究了不同的堆叠方法，包括权重加权、归一化、支持向量机、决策树和K-近邻，其中Logistic Regression表现最好。结果证明，相比单个网络结构，GDN在优化皮肤癌检测上有更高的准确率。

FINER: Flexible spectral-bias tuning in Implicit NEural Representation by Variable-periodic Activation Functions

paper_url: http://arxiv.org/abs/2312.02434
repo_url: None
paper_authors: Zhen Liu, Hao Zhu, Qi Zhang, Jingde Fu, Weibing Deng, Zhan Ma, Yanwen Guo, Xun Cao
for: 解决现有INR技术中频率相关问题，提高复杂信号的表示性能。methods: 提出变量周期函数（FINER），通过初始化神经网络偏置在不同范围内，选择不同频率的子函数进行活化。results: 在2D图像适应、3D签名距离场表示和5D神经辐射场优化上，FINER表现出比现有INR更高的表示性能。

Abstract
Implicit Neural Representation (INR), which utilizes a neural network to map coordinate inputs to corresponding attributes, is causing a revolution in the field of signal processing. However, current INR techniques suffer from a restricted capability to tune their supported frequency set, resulting in imperfect performance when representing complex signals with multiple frequencies. We have identified that this frequency-related problem can be greatly alleviated by introducing variable-periodic activation functions, for which we propose FINER. By initializing the bias of the neural network within different ranges, sub-functions with various frequencies in the variable-periodic function are selected for activation. Consequently, the supported frequency set of FINER can be flexibly tuned, leading to improved performance in signal representation. We demonstrate the capabilities of FINER in the contexts of 2D image fitting, 3D signed distance field representation, and 5D neural radiance fields optimization, and we show that it outperforms existing INRs.

摘要
匿�生神经表示法（INR），通过神经网络将坐标输入映射到相应的特征上，在信号处理领域引起了革命。然而，当前INR技术存在一定的频率集支持的限制，导致表示复杂信号的多频性表现不佳。我们发现，这种频率相关问题可以通过引入变量周期 activation function 进行大幅减轻。我们提议使用不同范围内的偏置初始化神经网络，从而选择不同频率的子函数进行活化。因此，FINER可以自由地调整支持的频率集，从而提高信号表示的性能。我们在2D图像适应、3D签名距离场表示和5D神经辐射场优化上展示了FINER的能力，并证明它超过了现有的INR。

Lenna: Language Enhanced Reasoning Detection Assistant

paper_url: http://arxiv.org/abs/2312.02433
repo_url: https://github.com/meituan-automl/lenna
paper_authors: Fei Wei, Xinyu Zhang, Ailing Zhang, Bo Zhang, Xiangxiang Chu
for: This paper proposes a language-enhanced reasoning detection assistant called Lenna, which utilizes robust multimodal feature representation for image perception tasks.
methods: The paper incorporates an additional token in the MLLM vocabulary to preserve location information for detection, and constructs a ReasonDet dataset to measure the reasoning capability of Lenna.
results: Lenna demonstrates outstanding performance on ReasonDet with significantly low training costs and minimal transferring overhead when extended to other tasks.Here’s the simplified Chinese text:
for: 这篇论文提出了一种基于多模态语言模型（MLLM）的语言增强推理检测助手，即Lenna。
methods: 该论文在MLLM中添加了一个 tokens，以保持检测位置信息，并构建了一个ReasonDet数据集来评估Lenna的推理能力。
results: Lenna在ReasonDet上表现出色，训练成本低下，并在其他任务上转移成本很低。代码和模型将在https://git.io/Lenna上公开。

Abstract
With the fast-paced development of multimodal large language models (MLLMs), we can now converse with AI systems in natural languages to understand images. However, the reasoning power and world knowledge embedded in the large language models have been much less investigated and exploited for image perception tasks. In this paper, we propose Lenna, a language-enhanced reasoning detection assistant, which utilizes the robust multimodal feature representation of MLLMs, while preserving location information for detection. This is achieved by incorporating an additional token in the MLLM vocabulary that is free of explicit semantic context but serves as a prompt for the detector to identify the corresponding position. To evaluate the reasoning capability of Lenna, we construct a ReasonDet dataset to measure its performance on reasoning-based detection. Remarkably, Lenna demonstrates outstanding performance on ReasonDet and comes with significantly low training costs. It also incurs minimal transferring overhead when extended to other tasks. Our code and model will be available at https://git.io/Lenna.

摘要
With the rapid development of multimodal large language models (MLLMs), we can now communicate with AI systems in natural language to understand images. However, the reasoning power and world knowledge embedded in the large language models have been less explored and utilized for image perception tasks. In this paper, we propose Lenna, a language-enhanced reasoning detection assistant, which utilizes the robust multimodal feature representation of MLLMs, while preserving location information for detection. This is achieved by incorporating an additional token in the MLLM vocabulary that is free of explicit semantic context but serves as a prompt for the detector to identify the corresponding position. To evaluate the reasoning capability of Lenna, we construct a ReasonDet dataset to measure its performance on reasoning-based detection. Remarkably, Lenna demonstrates outstanding performance on ReasonDet and comes with significantly low training costs. It also incurs minimal transferring overhead when extended to other tasks. Our code and model will be available at https://git.io/Lenna.Here's the translation in Traditional Chinese:随着多Modal大语言模型（MLLM）的快速发展，我们现在可以通过自然语言与AI系统进行交流，以了解图像。然而，大语言模型中嵌入的理解力和世界知识在图像认识任务中得到了较少的探索和利用。在本文中，我们提出Lenna，一个语言增强的推理检测助手，利用MLLM的强大多modal特征表示，同时保留检测位置信息。这是通过在MLLM词汇中添加一个token，该token是无关 semantic context的，但可作为检测器识别相应位置的帮助。为了评估Lenna的推理能力，我们建立了一个ReasonDet数据集，用于量化它在推理基于检测的性能。惊奇的是，Lenna在ReasonDet上表现出色，训练成本较低，且对其他任务的转移成本较低。我们的代码和模型将在https://git.io/Lenna 上公开。

Orthogonal Adaptation for Modular Customization of Diffusion Models

paper_url: http://arxiv.org/abs/2312.02432
repo_url: None
paper_authors: Ryan Po, Guandao Yang, Kfir Aberman, Gordon Wetzstein
for: 这 paper 的目的是解决 Modular Customization 问题，实现高效地合并精心调整的扩展模型，以便在一幅图像中同时渲染多个概念。
methods: 该 paper 使用 Orthogonal Adaptation 方法，鼓励精心调整的模型在INFERENCE时进行互补，以确保合并后的模型仍然可以保持高度的准确性和唯一性。
results: 该 paper 的实验结果表明，与相关基eline比较，我们的方法在效率和准确性两个方面均有显著提高，代表了在扩展 Customization 领域中的重要突破。

Abstract
Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs. To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed with minimal interference. Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.

摘要
Customization techniques for text-to-image models have opened up a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods allow for high-fidelity customization of individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs.To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed with minimal interference.Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.

FreestyleRet: Retrieving Images from Style-Diversified Queries

paper_url: http://arxiv.org/abs/2312.02428
repo_url: None
paper_authors: Hao Li, Curise Jia, Peng Jin, Zesen Cheng, Kehan Li, Jialu Sui, Chang Liu, Li Yuan
for: 这 paper 的目的是提出 Style-Diversified Query-Based Image Retrieval 任务，允许根据不同的查询风格进行图像检索。
methods: 作者提出了一种轻量级多样化查询检索框架，使得不同的查询风格（如文本、笔迹、低分辨率、艺术等）都可以同时进行检索。
results: 实验表明，使用作者提出的 style-init 提示调整策略，对 Style-Diversified Query-Based Image Retrieval 任务表现出色，并且可以同时检索不同查询风格的图像。此外，auxiliary information from other queries 也可以增强每个查询的检索性能。

Abstract
Image Retrieval aims to retrieve corresponding images based on a given query. In application scenarios, users intend to express their retrieval intent through various query styles. However, current retrieval tasks predominantly focus on text-query retrieval exploration, leading to limited retrieval query options and potential ambiguity or bias in user intention. In this paper, we propose the Style-Diversified Query-Based Image Retrieval task, which enables retrieval based on various query styles. To facilitate the novel setting, we propose the first Diverse-Style Retrieval dataset, encompassing diverse query styles including text, sketch, low-resolution, and art. We also propose a light-weighted style-diversified retrieval framework. For various query style inputs, we apply the Gram Matrix to extract the query's textural features and cluster them into a style space with style-specific bases. Then we employ the style-init prompt tuning module to enable the visual encoder to comprehend the texture and style information of the query. Experiments demonstrate that our model, employing the style-init prompt tuning strategy, outperforms existing retrieval models on the style-diversified retrieval task. Moreover, style-diversified queries~(sketch+text, art+text, etc) can be simultaneously retrieved in our model. The auxiliary information from other queries enhances the retrieval performance within the respective query.

摘要
图像检索目标是根据给定的查询 retrieve 相应的图像。在应用场景中，用户可能通过不同的查询风格表达他们的检索意图。然而，现有的检索任务主要集中在文本查询检索领域，导致检索查询选项有限，可能存在用户意图的抽象或偏见。本文提出了多样化查询基于图像检索任务（Style-Diversified Query-Based Image Retrieval task），允许基于多种查询风格进行检索。为了推动这种新的设定，我们提出了首个多样化查询 dataset，包括多种查询风格，如文本、绘制、低分辨率和艺术。我们还提出了一种轻量级多样化检索框架。对于不同的查询风格输入，我们使用 Gram Matrix 提取查询的文本特征，并将其分为一个风格空间中的风格特征基。然后，我们使用风格初始化 prompt tuning 模块，使视觉编码器理解查询中的Texture和风格信息。实验结果表明，我们的模型，使用风格初始化 prompt tuning 策略，在多样化检索任务上表现出色，并且可以同时检索不同的查询风格（如绘制+文本、艺术+文本等）。此外，auxiliary information 从其他查询中对当前查询的检索性能产生补偿作用。

Towards Granularity-adjusted Pixel-level Semantic Annotation

paper_url: http://arxiv.org/abs/2312.02420
repo_url: None
paper_authors: Rohit Kundu, Sudipta Paul, Rohit Lal, Amit K. Roy-Chowdhury
for: 提供无需人工监督的semantic segmentation预测方法，用于生成图像中像素级别的标注数据。
methods: 使用Stable Diffusion模型生成synthetic图像，并使用这些图像来学习一个映射函数，将SAM卷积推荐的匀速度embeddings与物体类别标签相对应。
results: 在PASCAL VOC 2012和COCO-80 datasets上进行实验，比对 existed状态的方法时，我们得到了+17.95%和+5.17%的mIoU提升。

Abstract
Recent advancements in computer vision predominantly rely on learning-based systems, leveraging annotations as the driving force to develop specialized models. However, annotating pixel-level information, particularly in semantic segmentation, presents a challenging and labor-intensive task, prompting the need for autonomous processes. In this work, we propose GranSAM which distinguishes itself by providing semantic segmentation at the user-defined granularity level on unlabeled data without the need for any manual supervision, offering a unique contribution in the realm of semantic mask annotation method. Specifically, we propose an approach to enable the Segment Anything Model (SAM) with semantic recognition capability to generate pixel-level annotations for images without any manual supervision. For this, we accumulate semantic information from synthetic images generated by the Stable Diffusion model or web crawled images and employ this data to learn a mapping function between SAM mask embeddings and object class labels. As a result, SAM, enabled with granularity-adjusted mask recognition, can be used for pixel-level semantic annotation purposes. We conducted experiments on the PASCAL VOC 2012 and COCO-80 datasets and observed a +17.95% and +5.17% increase in mIoU, respectively, compared to existing state-of-the-art methods when evaluated under our problem setting.

摘要

MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

paper_url: http://arxiv.org/abs/2312.02409
repo_url: https://github.com/AlexXiao95/MGTR
paper_authors: Yiqian Gan, Hao Xiao, Yizhe Zhao, Ethan Zhang, Zhe Huang, Xin Ye, Lingting Ge
for: 这篇论文是用于自动驾驶系统中的动作预测，以应对高度不确定和复杂的交通景象。
methods: 本文提出了一个多粒子 трансформа器（MGTR）框架，利用不同粒子大小的上下文特征来描述不同类型的交通工具。此外，文件还使用了内置的 LiDAR 特征提取器来搭配 MGTR，以进一步增强其能力。
results: 根据 Waymo Open Dataset 动作预测 benchmark 的评估结果，提出的 MGTR 方法获得了最佳性能，在领头牌（https://waymo.com/open/challenges/2023/motion-prediction/）上排名第一。

Abstract
Motion prediction has been an essential component of autonomous driving systems since it handles highly uncertain and complex scenarios involving moving agents of different types. In this paper, we propose a Multi-Granular TRansformer (MGTR) framework, an encoder-decoder network that exploits context features in different granularities for different kinds of traffic agents. To further enhance MGTR's capabilities, we leverage LiDAR point cloud data by incorporating LiDAR semantic features from an off-the-shelf LiDAR feature extractor. We evaluate MGTR on Waymo Open Dataset motion prediction benchmark and show that the proposed method achieved state-of-the-art performance, ranking 1st on its leaderboard (https://waymo.com/open/challenges/2023/motion-prediction/).

摘要
自动驾驶系统中的动态预测已经是一项非常重要的组件，因为它可以处理高度不确定和复杂的情况，涉及到不同类型的移动代理。在这篇论文中，我们提出了一种多级别 TRansformer（MGTR）框架，这是一个编码器-解码器网络，它利用不同级别的上下文特征来处理不同类型的交通代理。为了进一步提高MGTR的能力，我们利用了LiDAR点云数据，并在LiDAR特征提取器中Integrate LiDAR语义特征。我们对 Waymo Open Dataset 动态预测测试benchmark进行评估，并显示了我们的提案方法在其领导人杰出表现，在其领导人杰出表现（https://waymo.com/open/challenges/2023/motion-prediction/）。

2023-12-05

cs.AI

cs.AI - 2023-12-05

FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction

paper_url: http://arxiv.org/abs/2312.03187
repo_url: https://github.com/shuangquanfeng/fergi
paper_authors: Shuangquan Feng, Junhua Ma, Virginia R. de Sa
for: 这个论文的目的是用人类喜好反馈数据来调整文本到图像生成模型。methods: 这个论文使用了自动注释用户喜好反馈来自动标注用户对生成图像的评价。results: 这个研究发现，多个 facial action unit (AU) 的活动响应与用户对生成图像的评价高度相关，特别是 AU4 (眉下丝) 是评价图像不良的最有可靠性的响应。这种方法可以自动标注用户喜好反馈，并且可以与现有的分类模型结合使用以提高人类喜好的准确性。

Abstract
Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically annotate user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. Specifically, AU4 (brow lowerer) is most consistently reflective of negative evaluations of the generated image. This can be useful in two ways. Firstly, we can automatically annotate user preferences between image pairs with substantial difference in AU4 responses to them with an accuracy significantly outperforming state-of-the-art scoring models. Secondly, directly integrating the AU4 responses with the scoring models improves their consistency with human preferences. Additionally, the AU4 response best reflects the user's evaluation of the image fidelity, making it complementary to the state-of-the-art scoring models, which are generally better at reflecting image-text alignment. Finally, this method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.

摘要

Data-Driven Traffic Reconstruction and Kernel Methods for Identifying Stop-and-Go Congestion

paper_url: http://arxiv.org/abs/2312.03186
repo_url: None
paper_authors: Edgar Ramirez Sanchez, Shreyaa Raghavan, Cathy Wu
for: 本研究旨在提高数据驱动的研究，以便为气候变化和可持续发展提供基础数据。
methods: 本研究使用交通重建技术来识别停车事件。特别是，我们引入基于核函数的方法来描述交通中的空间-时间特征，并利用bootstrap方法来评估重建过程中的不确定性。
results: 实验结果表明，这种方法可以准确地捕捉加利福尼亚州高速公路上的停车事件。这种方法可以为数据驱动的决策提供基础。

Abstract
Identifying stop-and-go events (SAGs) in traffic flow presents an important avenue for advancing data-driven research for climate change mitigation and sustainability, owing to their substantial impact on carbon emissions, travel time, fuel consumption, and roadway safety. In fact, SAGs are estimated to account for 33-50% of highway driving externalities. However, insufficient attention has been paid to precisely quantifying where, when, and how much these SAGs take place -necessary for downstream decision making, such as intervention design and policy analysis. A key challenge is that the data available to researchers and governments are typically sparse and aggregated to a granularity that obscures SAGs. To overcome such data limitations, this study thus explores the use of traffic reconstruction techniques for SAG identification. In particular, we introduce a kernel-based method for identifying spatio-temporal features in traffic and leverage bootstrapping to quantify the uncertainty of the reconstruction process. Experimental results on California highway data demonstrate the promise of the method for capturing SAGs. This work contributes to a foundation for data-driven decision making to advance sustainability of traffic systems.

摘要
Identifying stop-and-go events (SAGs) in traffic flow is an important avenue for advancing data-driven research on climate change mitigation and sustainability, as SAGs have a significant impact on carbon emissions, travel time, fuel consumption, and roadway safety. In fact, SAGs are estimated to account for 33-50% of highway driving externalities. However, there has been insufficient attention paid to precisely quantifying where, when, and how much these SAGs take place, which is necessary for downstream decision making, such as intervention design and policy analysis. A key challenge is that the available data to researchers and governments are typically sparse and aggregated to a granularity that obscures SAGs. To overcome such data limitations, this study explores the use of traffic reconstruction techniques for SAG identification. Specifically, we introduce a kernel-based method for identifying spatio-temporal features in traffic and leverage bootstrapping to quantify the uncertainty of the reconstruction process. Experimental results on California highway data demonstrate the promise of the method for capturing SAGs. This work contributes to a foundation for data-driven decision making to advance the sustainability of traffic systems.Here's the text with some additional information about the translation:I translated the text into Simplified Chinese, which is the most widely used variety of Chinese. I tried to preserve the original meaning and structure of the text while making it more concise and natural-sounding in Chinese.Some notes on the translation:* I used 停车堵塞 (tīng chē dào xiāng) to translate "stop-and-go" events, as it is a common term used in China to describe traffic congestion.* I used 驱动 (kuī dàng) to translate "driven" in the phrase "data-driven research," as it is a more common term in Chinese to describe the use of data to inform decision-making.* I used 可见 (kě yán) to translate "obscures" in the phrase "obscures SAGs," as it is a more common term in Chinese to describe something that is visible or clear.* I used 随机 (suī jī) to translate "bootstrap" in the phrase "leverage bootstrapping," as it is a more common term in Chinese to describe a random sampling method.I hope this translation is helpful! Let me know if you have any further questions or requests.

Using Curiosity for an Even Representation of Tasks in Continual Offline Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.03177
repo_url: https://github.com/punk95/continual-learning-with-curiosity
paper_authors: Pankayaraj Pathmanathan, Natalia Díaz-Rodríguez, Javier Del Ser
for: 本研究旨在使用好奇性来提高离线多任务连续强化学习，当任务非站ARY是非标注的，并且在时间上不均分配给学习者。
methods: 我们使用好奇性作为任务界限探测和保留老transition tuple的优先级 метри克。我们提出了两种不同的缓存：Hybrid Reservoir Buffer with Task Separation (HRBTS)和Hybrid Curious Buffer (HCB)。
results: 我们的提出的缓存，与常见的强化学习算法结合使用，可以减轻离线多任务连续强化学习中的灾难性忘记问题。我们在三种不同的 continual reinforcement learning 设置中进行了实验，并与最新的 Hybrid Reservoir Buffer (HRB) 和 Multi-Time Scale Replay Buffer (MTR)进行了比较。实验结果显示，我们的提出的缓存在大多数设置中 display better immunity to catastrophic forgetting compared to existing works。

Abstract
In this work, we investigate the means of using curiosity on replay buffers to improve offline multi-task continual reinforcement learning when tasks, which are defined by the non-stationarity in the environment, are non labeled and not evenly exposed to the learner in time. In particular, we investigate the use of curiosity both as a tool for task boundary detection and as a priority metric when it comes to retaining old transition tuples, which we respectively use to propose two different buffers. Firstly, we propose a Hybrid Reservoir Buffer with Task Separation (HRBTS), where curiosity is used to detect task boundaries that are not known due to the task agnostic nature of the problem. Secondly, by using curiosity as a priority metric when it comes to retaining old transition tuples, a Hybrid Curious Buffer (HCB) is proposed. We ultimately show that these buffers, in conjunction with regular reinforcement learning algorithms, can be used to alleviate the catastrophic forgetting issue suffered by the state of the art on replay buffers when the agent's exposure to tasks is not equal along time. We evaluate catastrophic forgetting and the efficiency of our proposed buffers against the latest works such as the Hybrid Reservoir Buffer (HRB) and the Multi-Time Scale Replay Buffer (MTR) in three different continual reinforcement learning settings. Experiments were done on classical control tasks and Metaworld environment. Experiments show that our proposed replay buffers display better immunity to catastrophic forgetting compared to existing works in most of the settings.

摘要
在这项研究中，我们调查了使用好奇性来改进离线多任务连续奖励学习时的缓冲区。特别是，当任务是由环境非站点性决定的，而且无法被学习者在时间上平均暴露的时候，我们使用好奇性来探索任务边界和优先级缓冲区。我们提出了两种不同的缓冲区：首先，我们提出了混合储存缓冲（HRBTS），其中好奇性用于探索任务边界，而这些边界由于任务agnostic的问题而无法被知道。其次，我们使用好奇性来决定保留老的转移对象，并提出了混合好奇缓冲（HCB）。我们最终表明，这些缓冲区，与常见的奖励学习算法结合使用，可以解决由现状的缓冲区所遇到的慢速忘记问题。我们对三个不同的连续奖励学习设置进行评估，并在 классиcal控制任务和Metaworld环境进行实验。实验结果显示，我们的提出的缓冲区在大多数设置中表现出比现有工作更好的抗忘记性。

A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education

paper_url: http://arxiv.org/abs/2312.03173
repo_url: None
paper_authors: Jacob Doughty, Zipiao Wan, Anishka Bompelli, Jubahed Qayum, Taozhi Wang, Juran Zhang, Yujia Zheng, Aidan Doyle, Pragnya Sridhar, Arav Agarwal, Christopher Bogart, Eric Keylor, Can Kultur, Jaromir Savelka, Majd Sakr
for: Educators can use GPT-4 to generate multiple-choice questions (MCQs) aligned with specific learning objectives (LOs) from Python programming classes in higher education.
methods: The GPT-4 system uses a large language model to generate MCQs from high-level course context and module-level LOs.
results: The study found that GPT-4 was capable of producing MCQs with clear language, a single correct choice, and high-quality distractors, and the generated MCQs appeared to be well-aligned with the LOs.Here’s the Chinese version:
for: Educators可以使用GPT-4来生成与特定学习目标(LOs)相关的多选题(MCQs)，来支持Python编程课程的高等教育。
methods: GPT-4系统使用大语言模型来生成MCQs，从高级课程背景和Module级学习目标中生成MCQs。
results: 研究发现，GPT-4能够生成 Clear语言、唯一正确选项和高质量干扰者的MCQs，并且生成的MCQs与LOs相对匹配。

Abstract
There is a constant need for educators to develop and maintain effective up-to-date assessments. While there is a growing body of research in computing education on utilizing large language models (LLMs) in generation and engagement with coding exercises, the use of LLMs for generating programming MCQs has not been extensively explored. We analyzed the capability of GPT-4 to produce multiple-choice questions (MCQs) aligned with specific learning objectives (LOs) from Python programming classes in higher education. Specifically, we developed an LLM-powered (GPT-4) system for generation of MCQs from high-level course context and module-level LOs. We evaluated 651 LLM-generated and 449 human-crafted MCQs aligned to 246 LOs from 6 Python courses. We found that GPT-4 was capable of producing MCQs with clear language, a single correct choice, and high-quality distractors. We also observed that the generated MCQs appeared to be well-aligned with the LOs. Our findings can be leveraged by educators wishing to take advantage of the state-of-the-art generative models to support MCQ authoring efforts.

摘要
教育者需要不断开发和维护有效、时尚的评估方法。虽然计算教育中使用大语言模型（LLMs）在生成和促进编程练习方面已有一定研究，但使用LLM生成编程多选问题（MCQs）的使用尚未得到广泛探讨。我们对GPT-4的能力进行了分析，以生成基于高级课程背景和模块级学习目标（LOs）的MCQs。我们对651个LLM生成和449个人工制作的MCQs进行了评估，这些MCQs都与246个LOs相关。我们发现GPT-4能够生成清晰的语言、单选正确答案和高质量的干扰选项。我们还发现生成的MCQs与LOs之间存在良好的吻合。我们的发现可以帮助教育者通过利用当今最先进的生成模型来支持MCQ作文的努力。

GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of ChatGPT in Science

paper_url: http://arxiv.org/abs/2312.03769
repo_url: None
paper_authors: Chenxi Wu, Alan John Varghese, Vivek Oommen, George Em Karniadakis
for: 这种研究旨在探讨 Large Language Models (LLMs) 在科学评审中的应用，以提高评审效率、避免偏见、找到交叉领域的连接和发现新趋势。
methods: 本研究使用了 13 篇 GPT 相关论文，由人工评审和 SciSpace 进行评审，然后由三种不同类型的评估者进行评估，包括 GPT-3.5、人群团队和 GPT-4。
results: 研究发现，SciSpace 的回答和人工评审者的回答在客观问题中有50%的一致性，GPT-4 (有知识评估者) 经常将人工评审者评为更高的准确性，而 SciSpace 在结构、清晰性和完整性方面被评为更高。在主观问题中，无知识评估者 (GPT-3.5 和人群团队) 对 SciSpace 和人工评审者的回答有各种偏好，但 GPT-4 对它们的准确性和结构都有平等的评估，但偏好 SciSpace 的完整性。

Abstract
The new polymath Large Language Models (LLMs) can speed-up greatly scientific reviews, possibly using more unbiased quantitative metrics, facilitating cross-disciplinary connections, and identifying emerging trends and research gaps by analyzing large volumes of data. However, at the present time, they lack the required deep understanding of complex methodologies, they have difficulty in evaluating innovative claims, and they are unable to assess ethical issues and conflicts of interest. Herein, we consider 13 GPT-related papers across different scientific domains, reviewed by a human reviewer and SciSpace, a large language model, with the reviews evaluated by three distinct types of evaluators, namely GPT-3.5, a crowd panel, and GPT-4. We found that 50% of SciSpace's responses to objective questions align with those of a human reviewer, with GPT-4 (informed evaluator) often rating the human reviewer higher in accuracy, and SciSpace higher in structure, clarity, and completeness. In subjective questions, the uninformed evaluators (GPT-3.5 and crowd panel) showed varying preferences between SciSpace and human responses, with the crowd panel showing a preference for the human responses. However, GPT-4 rated them equally in accuracy and structure but favored SciSpace for completeness.

摘要
新的多才能大语言模型（LLM）可以大幅提高科学评审，可能使用更公平的量化指标，促进交叉学科连接，并找到emerging trend和研究 gap by analyzing large volumes of data。然而，当前，它们缺乏深入的复杂方法理解，对创新性的laims难以评估，也无法评估伦理问题和利益冲突。在这里，我们考虑了13个GPT相关论文，分别由人类评审和SciSpace进行评审，其中评审结果由三种不同的评估者评估，即GPT-3.5、一个拥有人群和GPT-4。我们发现，SciSpace对 объектив问题的回答与人类评审员的回答相一致的比例为50%，GPT-4（知情评估者）经常将人类评审员的准确性评分高于SciSpace，而SciSpace在结构、明了和完整性方面的评分高于人类评审员。在主观问题上，无知评估者（GPT-3.5和拥有人群）对SciSpace和人类回答之间有变化的偏好，拥有人群显示对人类回答的偏好，但GPT-4对两者的准确性和结构是一致的，但它对SciSpace的完整性有更高的评分。

FlexModel: A Framework for Interpretability of Distributed Large Language Models

paper_url: http://arxiv.org/abs/2312.03140
repo_url: https://github.com/vectorinstitute/flex_model
paper_authors: Matthew Choi, Muhammad Adil Asif, John Willes, David Emerson
for: 本研究旨在提高大型语言模型的训练和部署所需的硬件前提条件，并且增加模型之间的交互，以提高解释性和责任AI技术的研究。
methods: 本研究使用了FlexModel软件包，该包提供了一个易用的界面，可以在多个GPU和多个节点配置下分布式训练和模型交互。它与现有的模型分布库compatible，并且允许用户注册自己的 HookFunctions，以便轻松地与分布式模型内部进行交互。
results: 本研究通过FlexModel软件包，提高了模型交互的访问性和可用性，并且使得更多的研究人员可以参与到大型神经网络领域的研究中。

Abstract
With the growth of large language models, now incorporating billions of parameters, the hardware prerequisites for their training and deployment have seen a corresponding increase. Although existing tools facilitate model parallelization and distributed training, deeper model interactions, crucial for interpretability and responsible AI techniques, still demand thorough knowledge of distributed computing. This often hinders contributions from researchers with machine learning expertise but limited distributed computing background. Addressing this challenge, we present FlexModel, a software package providing a streamlined interface for engaging with models distributed across multi-GPU and multi-node configurations. The library is compatible with existing model distribution libraries and encapsulates PyTorch models. It exposes user-registerable HookFunctions to facilitate straightforward interaction with distributed model internals, bridging the gap between distributed and single-device model paradigms. Primarily, FlexModel enhances accessibility by democratizing model interactions and promotes more inclusive research in the domain of large-scale neural networks. The package is found at https://github.com/VectorInstitute/flex_model.

摘要
随着大语言模型的增长，训练和部署的硬件前提条件也出现了相应的增长。虽然现有的工具可以实现模型平行化和分布式训练，但更深入的模型互动，对于解释性和责任AI技术来说，仍然需要深入的分布式计算知识。这经常阻碍了具有机器学习专业背景但有限的分布式计算知识的研究人员参与到这个领域中。为解决这个挑战，我们提出了 FlexModel，一个软件包，它提供了一个易于使用的接口，可以在多GPU和多节点配置下分布式的模型中进行互动。该库与现有的模型分布库兼容，可以包装PyTorch模型，并提供了用户可注册的 HookFunctions，以便轻松地与分布式模型内部进行互动，从而跨越分布和单设备模型 парадигмы之间的差异。主要来说，FlexModel通过普及模型互动，扩大了研究领域的访问性，并促进了更加包容的研究在大规模神经网络领域。该包可以在 GitHub 上找到：https://github.com/VectorInstitute/flex_model。

paper_url: http://arxiv.org/abs/2312.03121
repo_url: https://github.com/google-deepmind/open_spiel/tree/master/open_spiel/python/voting
paper_authors: Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li, Avishkar Bhoopchand, Thomas Anthony, Brian Tanner, Anna Koop
for: 本研究旨在探讨多任务评价问题的共同特点，提出一种基于选举理论的评价框架，即投票为评价（VasE）框架。
methods: 本研究使用了多个任务的排名或对比来生成总评价，并将评价器看作社会利益函数，能够借鉴社会选择理论 centuries 的研究来 derivation principled 评价框架。
results: 实际应用中，VasE 框架能够更加稳定和鲁棒，发现评价数据中不可见的性质，预测复杂多 player 游戏的结果更加准确。此外，最大抽签法可以满足重要的一致性性质，是计算效率高（几乎线性增长）的。

Abstract
We argue that many general evaluation problems can be viewed through the lens of voting theory. Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation. By viewing the aggregator as a social welfare function, we are able to leverage centuries of research in social choice theory to derive principled evaluation frameworks with axiomatic foundations. These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation. We apply this Voting-as-Evaluation (VasE) framework across multiple settings, including reinforcement learning, large language models, and humans. In practice, we observe that VasE can be more robust than popular evaluation frameworks (Elo and Nash averaging), discovers properties in the evaluation data not evident from scores alone, and can predict outcomes better than Elo in a complex seven-player game. We identify one particular approach, maximal lotteries, that satisfies important consistency properties relevant to evaluation, is computationally efficient (polynomial in the size of the evaluation data), and identifies game-theoretic cycles.

摘要
我们认为许多总评问题可以通过选举理论来看待。每个任务被视为一个独立的选民，只需提供排序或对比两个代理来生成总评。通过视为社会利益函数，我们可以利用社会选择理论 centuries 的研究来 derive 原则性的评价框架，其有AXIOmatic 基础。这些评价可读性和灵活性高，而免除许多现在跨任务评价所面临的问题。我们在多个场景中应用了 VasE 框架，包括强化学习、大语言模型和人类。在实践中，我们发现 VasE 可以比受欢迎的评价框架（Elo和Nash均值）更加稳定，检测评价数据中不可见的特性，并在复杂的七人游戏中预测结果更加准确。我们还发现一种特殊的方法——最大抽签，满足评价中重要的一致性特性，计算效率高（对评价数据的大小为多阶式），并在游戏中发现游戏理论循环。

The Landscape of Modern Machine Learning: A Review of Machine, Distributed and Federated Learning

paper_url: http://arxiv.org/abs/2312.03120
repo_url: None
paper_authors: Omer Subasi, Oceane Bel, Joseph Manzano, Kevin Barker
for: 本研究旨在提供现代机器学习的综述，包括最新的高级机器学习算法、应用和框架。
methods: 本研究使用高性能的多器 heterogeneous 并行分布式计算系统和大量数据，涉及到平行分布式学习、深度学习和联合学习。
results: 本研究提供了现代机器学习领域的高级概述，可以作为该领域的入门教材。

Abstract
With the advance of the powerful heterogeneous, parallel and distributed computing systems and ever increasing immense amount of data, machine learning has become an indispensable part of cutting-edge technology, scientific research and consumer products. In this study, we present a review of modern machine and deep learning. We provide a high-level overview for the latest advanced machine learning algorithms, applications, and frameworks. Our discussion encompasses parallel distributed learning, deep learning as well as federated learning. As a result, our work serves as an introductory text to the vast field of modern machine learning.

摘要
Note:* "modern machine learning" is translated as "现代机器学习" (shìdà jīshū xuéxí)* "heterogeneous" is translated as "多样的" (duōyàng de)* "parallel and distributed" is translated as "并行分布的" ( héngxì běnzhù de)* "ever increasing" is translated as "不断增长" (bùdàn zhèngcháng)* "cutting-edge technology" is translated as "前沿科技" (qiánxiāng kējì)* "scientific research" is translated as "科学研究" (kēxué yánjiū)* "consumer products" is translated as "消费品" (xiāofèi pin)* "federated learning" is translated as "联合学习" (liánhé xuéxí)

Unknown Sample Discovery for Source Free Open Set Domain Adaptation

paper_url: http://arxiv.org/abs/2312.03767
repo_url: None
paper_authors: Chowdhury Sadman Jahan, Andreas Savakis
for: 这个研究旨在应对开放集领域适束（OSDA）中，将源领域训练的模型适束到目标领域，并且在目标领域中进行分类。特别是，这个研究探讨了无源领域（SF-OSDA）技术，不需要访问源领域样本，但是现有的SF-OSDA方法仅使用目标领域中已知的类别进行适束，并且在推断后适束过程中需要访问整个目标领域。
methods: 这个研究使用了教师模型和学生模型的架构，将学生模型适束到目标领域中，并且使用了时间ensemble的教师模型来进行已知 sample separation和适束。它还使用了co-training和时间一致性来帮助学生模型在目标领域中适束。
results: 实验结果显示，这个方法（USD）在比较SF-OSDA方法和OSDA方法时表现更好，并且与现有的OSDA模型在适束过程中相比，具有较好的性能。

Abstract
Open Set Domain Adaptation (OSDA) aims to adapt a model trained on a source domain to a target domain that undergoes distribution shift and contains samples from novel classes outside the source domain. Source-free OSDA (SF-OSDA) techniques eliminate the need to access source domain samples, but current SF-OSDA methods utilize only the known classes in the target domain for adaptation, and require access to the entire target domain even during inference after adaptation, to make the distinction between known and unknown samples. In this paper, we introduce Unknown Sample Discovery (USD) as an SF-OSDA method that utilizes a temporally ensembled teacher model to conduct known-unknown target sample separation and adapts the student model to the target domain over all classes using co-training and temporal consistency between the teacher and the student. USD promotes Jensen-Shannon distance (JSD) as an effective measure for known-unknown sample separation. Our teacher-student framework significantly reduces error accumulation resulting from imperfect known-unknown sample separation, while curriculum guidance helps to reliably learn the distinction between target known and target unknown subspaces. USD appends the target model with an unknown class node, thus readily classifying a target sample into any of the known or unknown classes in subsequent post-adaptation inference stages. Empirical results show that USD is superior to existing SF-OSDA methods and is competitive with current OSDA models that utilize both source and target domains during adaptation.

摘要

Incidental Polysemanticity

paper_url: http://arxiv.org/abs/2312.03096
repo_url: https://github.com/tmychow/incidental-polysemanticity
paper_authors: Victor Lecomte, Kushal Thaman, Trevor Chow, Rylan Schaeffer, Sanmi Koyejo
for: This paper aims to provide a second origin story for polysemantic neurons in deep networks, which can arise incidentally even when there are enough neurons to represent all features in the data.
methods: The paper uses a combination of theory and experiments to demonstrate the existence of incidental polysemanticity, and to show how training dynamics can strengthen such overlap.
results: The paper finds that incidental polysemanticity can occur even when there are ample neurons to represent all features in the data, and that this type of polysemanticity can be a significant obstacle to interpretability of task-optimized deep networks.

Abstract
Polysemantic neurons (neurons that activate for a set of unrelated features) have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more "features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrelated features to the same neuron, endangering our ability to understand the network's internal processing. In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, using a combination of theory and experiments. This second type of polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Due to its origin, we term this \textit{incidental polysemanticity}.

摘要
多义neuron（neuron Activate 多个不相关特征）被视为深度网络解释性的主要障碍，带来人工智能安全问题。 класси的起源故事是数据包含更多的“特征” than neuron，这使得学习完成任务的网络强制合并多个不相关的特征到同一个neuron上，威胁我们理解网络内部处理的能力。在这项工作中，我们提出了第二种和非相互排斥的起源故事，我们表明，even when there are enough neurons to represent all features in the data，Random initialization可以，通过巧合alone，初始化多个特征到同一个neuron上，并且训练剂会强化这种重叠。由于其起源，我们称这种现象为“偶然的多义”（incidental polysemy）。

Similarity-based Knowledge Transfer for Cross-Domain Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.03764
repo_url: None
paper_authors: Sergio A. Serrano, Jose Martinez-Carranza, L. Enrique Sucar
for: 本研究旨在研究如何在另一个任务空间中传递知识，以加速学习。
methods: 我们提出了一种 semi-supervised alignment loss，用于度量不同空间之间的相似性，并将其用于选择适合的知识来提高学习Agent的性能。
results: 我们的方法在一组多样化的 Mujoco 控制任务上进行了实验，并显示了其在无需专家政策指导下选择和传递知识的稳定性。

Abstract
Transferring knowledge in cross-domain reinforcement learning is a challenging setting in which learning is accelerated by reusing knowledge from a task with different observation and/or action space. However, it is often necessary to carefully select the source of knowledge for the receiving end to benefit from the transfer process. In this article, we study how to measure the similarity between cross-domain reinforcement learning tasks to select a source of knowledge that will improve the performance of the learning agent. We developed a semi-supervised alignment loss to match different spaces with a set of encoder-decoders, and use them to measure similarity and transfer policies across tasks. In comparison to prior works, our method does not require data to be aligned, paired or collected by expert policies. Experimental results, on a set of varied Mujoco control tasks, show the robustness of our method in effectively selecting and transferring knowledge, without the supervision of a tailored set of source tasks.

摘要
转移知识在跨领域强化学习是一个挑战的设定，在其中学习速度受到不同观察空间和/或行动空间的知识重用的影响。然而，选择收到知识的源是非常重要，以便接受知识传递过程中的改进。在这篇文章中，我们研究如何测量跨领域强化学习任务之间的相似性，以便选择一个能够提高学习代理的知识源。我们开发了一种半监督对准损失，将不同空间匹配到一组编码器-解码器中，并用其来测量相似性和传递策略 across tasks。与先前的工作不同，我们的方法不需要数据进行对齐、配对或由专家政策进行监督。实验结果，在一组变化的 MuJoCo 控制任务上，显示了我们的方法在不同任务之间选择和传递知识的稳定性。

RESIN-EDITOR: A Schema-guided Hierarchical Event Graph Visualizer and Editor

paper_url: http://arxiv.org/abs/2312.03093
repo_url: https://github.com/blender-nlp/resin-editor
paper_authors: Khanh Duy Nguyen, Zixuan Zhang, Reece Suchocki, Sha Li, Martha Palmer, Susan Brown, Jiawei Han, Heng Ji
for: 这篇论文是为了描述一种名为RESIGN-EDITOR的互动事件图像和编辑器，用于分析复杂事件。
methods: 该系统使用了人工约束事件模式来引导从多媒体和多文档新闻团cluster中提取的层次事件图。
results: 在评估RESIGN-EDITOR的效果时，我们表明了该工具在理解复杂事件和提高系统性能的能力。

Abstract
In this paper, we present RESIN-EDITOR, an interactive event graph visualizer and editor designed for analyzing complex events. Our RESIN-EDITOR system allows users to render and freely edit hierarchical event graphs extracted from multimedia and multi-document news clusters with guidance from human-curated event schemas. RESIN-EDITOR's unique features include hierarchical graph visualization, comprehensive source tracing, and interactive user editing, which is more powerful and versatile than existing Information Extraction (IE) visualization tools. In our evaluation of RESIN-EDITOR, we demonstrate ways in which our tool is effective in understanding complex events and enhancing system performance. The source code, a video demonstration, and a live website for RESIN-EDITOR have been made publicly available.

摘要
在这篇论文中，我们介绍了RESIME-EDITOR，一种可交互地视觉化和编辑事件图的系统，用于分析复杂事件。我们的RESIME-EDITOR系统允许用户自由地编辑嵌入式事件图，以获得人类筛选的事件模式的指导。RESIME-EDITOR的独特特点包括层次图表示、全源追踪和交互式用户编辑，这些特点比现有的信息EXTRACTION（IE）视觉化工具更加强大和灵活。在我们对RESIME-EDITOR的评估中，我们展示了该工具在理解复杂事件和提高系统性能方面的效果。源代码、视频示例和RESIME-EDITOR的在线网站都已经公开发布。

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

paper_url: http://arxiv.org/abs/2312.03762
repo_url: https://github.com/KarolisRam/colour-shape-goal-misgeneralization
paper_authors: Karolis Ramanauskas, Özgür Şimşek
for: 研究 colour versus shape goal misgeneralization 的行为
methods: 使用 Procgen Maze 环境，训练超过 1,000 个代理，并评估其在超过 10 万集的话语中的表现
results: 发现代理通过特定的色道渠道来探测目标物体，这是一种意外的选择；同时，由于 underspecification，代理的偏好会随着不同的随机种子重新训练而改变；最后，通过训练随机种子来证明存在异常行为的异常点。

Abstract
We explore colour versus shape goal misgeneralization originally demonstrated by Di Langosco et al. (2022) in the Procgen Maze environment, where, given an ambiguous choice, the agents seem to prefer generalization based on colour rather than shape. After training over 1,000 agents in a simplified version of the environment and evaluating them on over 10 million episodes, we conclude that the behaviour can be attributed to the agents learning to detect the goal object through a specific colour channel. This choice is arbitrary. Additionally, we show how, due to underspecification, the preferences can change when retraining the agents using exactly the same procedure except for using a different random seed for the training run. Finally, we demonstrate the existence of outliers in out-of-distribution behaviour based on training random seed alone.

摘要
我们研究了颜色vs形状目标总结，由Di Langosco et al. (2022)在Procgen Maze环境中原始示示出的问题，在ambiguous选择时，代理人偏好基于颜色而非形状。我们在简化版环境中训练了1,000个代理人，并评估了他们在超过1000万集的episode中的行为。我们结论是，代理人通过特定的颜色通道探测目标物体。这种选择是意外的。此外，我们表明由于不充分规定，代理人的偏好可以通过重新训练使用相同的过程而改变，只是使用不同的随机种子来控制训练运行。最后，我们示出了训练随机种子alone的外liers行为。

Clinical Notes Reveal Physician Fatigue

paper_url: http://arxiv.org/abs/2312.03077
repo_url: None
paper_authors: Chao-Chun Hsu, Ziad Obermeyer, Chenhao Tan
for: The paper aims to identify notes written by fatigued physicians and understand the impact of physician fatigue on decision-making and patient outcomes.
methods: The authors use a machine learning model to analyze notes from 129,228 emergency room visits and identify patterns associated with fatigued physicians. They also compare the performance of human physicians and language models (LLMs) in generating notes.
results: The model accurately identifies notes written by fatigued physicians and flags notes written in other high-fatigue settings. The authors find that notes written by fatigued physicians have lower yield of testing for heart attack and higher predicted fatigue for Black and Hispanic patients. Additionally, they find that LLM-written notes have higher predicted fatigue than real physicians’ notes, suggesting that LLMs may introduce distortions in generated text.

Abstract
Physicians write notes about patients. In doing so, they reveal much about themselves. Using data from 129,228 emergency room visits, we train a model to identify notes written by fatigued physicians -- those who worked 5 or more of the prior 7 days. In a hold-out set, the model accurately identifies notes written by these high-workload physicians, and also flags notes written in other high-fatigue settings: on overnight shifts, and after high patient volumes. Model predictions also correlate with worse decision-making on at least one important metric: yield of testing for heart attack is 18% lower with each standard deviation increase in model-predicted fatigue. Finally, the model indicates that notes written about Black and Hispanic patients have 12% and 21% higher predicted fatigue than Whites -- larger than overnight vs. daytime differences. These results have an important implication for large language models (LLMs). Our model indicates that fatigued doctors write more predictable notes. Perhaps unsurprisingly, because word prediction is the core of how LLMs work, we find that LLM-written notes have 17% higher predicted fatigue than real physicians' notes. This indicates that LLMs may introduce distortions in generated text that are not yet fully understood.

摘要
医生写病人症状记录时，会透露出自己一些信息。我们使用129,228个急诊室访问数据，训练一个模型，可以准确地识别劳累医生（在过去7天内工作5天或以上）写的症状记录。在测试集中，模型可以准确地识别高工作荷压医生的症状记录，并且可以检测高劳累情况下的其他症状记录，如夜班和高病人量。模型预测也与重要指标之一的决策质量有正相关：对于心肺病检测的采样率，与模型预测的劳累程度相对降低18%。此外，模型还表明，关于黑人和西班牙裔患者的症状记录会有12%和21%更高的预测劳累程度，比白人患者的症状记录更高。这些结果有重要的应用于大语言模型（LLM）。我们的模型表明，劳累医生写的症状记录更加预测可靠，因为word prediction是LLM的核心。我们发现，LLM写的症状记录的预测劳累程度比实际医生写的症状记录高17%。这表明，LLM可能会在生成文本中引入未知的扭曲。

paper_url: http://arxiv.org/abs/2312.02976
repo_url: None
paper_authors: Kiana Ehsani, Tanmay Gupta, Rose Hendrix, Jordi Salvador, Luca Weihs, Kuo-Hao Zeng, Kunal Pratap Singh, Yejin Kim, Winson Han, Alvaro Herrasti, Ranjay Krishna, Dustin Schwenk, Eli VanderBilt, Aniruddha Kembhavi
For: The paper is written to train modern embodied agents using imitation learning with shortest-path planners in simulation, and to demonstrate that these agents can proficiently navigate, explore, and manipulate objects in both simulation and the real world using only RGB sensors.* Methods: The paper uses a transformer-based, end-to-end architecture called SPOC, which is paired with extensive image augmentation and millions of frames of shortest-path-expert trajectories collected inside approximately 200,000 procedurally generated houses containing 40,000 unique 3D assets.* Results: The paper shows that the proposed method can produce agents that can proficiently navigate, explore, and manipulate objects in both simulation and the real world using only RGB sensors, and that the method is effective and efficient, with the ability to generalize to new environments and tasks.Here’s the same information in Simplified Chinese:* For: 论文是为了使用优化的人工智能进行训练，并在实际环境中测试其能够快速和有效地完成任务。* Methods: 论文使用了一种基于转换器的、端到端的架构，称为SPOC，并与其搭配了广泛的图像增强和大量的帧数。* Results: 论文显示了该方法可以生成能够快速和有效地在实际环境中完成任务的代理人，并且可以在新环境中扩展和适应。

Abstract
Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward shaping and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective, collecting human trajectories at scale is extremely expensive. In this work, we show that imitating shortest-path planners in simulation produces agents that, given a language instruction, can proficiently navigate, explore, and manipulate objects in both simulation and in the real world using only RGB sensors (no depth map or GPS coordinates). This surprising result is enabled by our end-to-end, transformer-based, SPOC architecture, powerful visual encoders paired with extensive image augmentation, and the dramatic scale and diversity of our training data: millions of frames of shortest-path-expert trajectories collected inside approximately 200,000 procedurally generated houses containing 40,000 unique 3D assets. Our models, data, training code, and newly proposed 10-task benchmarking suite CHORES will be open-sourced.

摘要
现代embodied智能器型通常使用强化学习（RL）和模仿学习（IL）两种方法训练。RL需要广泛的奖励扭曲和辅助损失，经常太慢和不具有效果，而IL通常需要人工指导，收集人类轨迹是非常昂贵的。在这项工作中，我们展示了在模拟中imiter短est-path规划器的imitating可以使得，给出语言指令，智能器型可以准确地导航、探索和操纵物体，并且可以使用RGB感知器（没有深度地图或GPS坐标）。这一结果是由我们的端到端、转换器基于的SPOC架构、强大的视觉编码器和广泛的图像扩展所启用。我们的模型、数据、训练代码和新提出的10任务benchmarking suite CHORES都将被开源。

Dexterous Functional Grasping

paper_url: http://arxiv.org/abs/2312.02975
repo_url: None
paper_authors: Ananye Agarwal, Shagun Uppal, Kenneth Shaw, Deepak Pathak
for: 本研究旨在结合人工智能和机器人控制技术，实现在自然环境中对物体进行功能 grasping。
methods: 本研究使用模块化方法，首先获取对象的功能可行性，然后使用在模拟环境中培养的低级策略来抓取它。此外，研究还提出了一种使用 eigengrasps 来减少人工数据的搜索空间，以实现更稳定和物理上更真实的运动。
results: 研究结果显示，使用 eigengrasps 可以在模拟环境中击败基eline，并在实际环境中与人类操作员进行比较，或者超越人类操作员。视频和图像可以在 https://dexfunc.github.io/ 上查看。

Abstract
While there have been significant strides in dexterous manipulation, most of it is limited to benchmark tasks like in-hand reorientation which are of limited utility in the real world. The main benefit of dexterous hands over two-fingered ones is their ability to pickup tools and other objects (including thin ones) and grasp them firmly to apply force. However, this task requires both a complex understanding of functional affordances as well as precise low-level control. While prior work obtains affordances from human data this approach doesn't scale to low-level control. Similarly, simulation training cannot give the robot an understanding of real-world semantics. In this paper, we aim to combine the best of both worlds to accomplish functional grasping for in-the-wild objects. We use a modular approach. First, affordances are obtained by matching corresponding regions of different objects and then a low-level policy trained in sim is run to grasp it. We propose a novel application of eigengrasps to reduce the search space of RL using a small amount of human data and find that it leads to more stable and physically realistic motion. We find that eigengrasp action space beats baselines in simulation and outperforms hardcoded grasping in real and matches or outperforms a trained human teleoperator. Results visualizations and videos at https://dexfunc.github.io/

摘要
“尽管有了很大的进步，dexterous manipulation的大多数都仅仅是对 benchmark 任务 like 手中重新Orienting 的限定性利用，这些任务在实际世界中的用途仅仅是有限的。dexterous hands 的主要优点在于能够将工具和其他物品（包括细长的）稳固地捶取并施加力，但这个任务需要 Both a complex understanding of functional affordances 和精确的 low-level control。尽管先前的工作从人类数据中获取了可用性，但这种方法不能扩展到 low-level control。 Similarly, simulation training cannot give the robot an understanding of real-world semantics。在这篇论文中，我们想要结合两个世界的好处，实现实际世界中的功能抓取。我们使用模块化的方法。首先，我们对不同物品的相应区域进行匹配，然后使用 sim 训练的低级策略来抓取它。我们提出了一个新的 eigengrasps 应用，以减少RL 的搜索空间，使用小量人类数据，并发现它导致更稳定和物理上更真实的运动。我们发现 eigengrasp action space 在 sim 中比基eline 高，并在实际世界中超过硬coded grasping，与人工电子师匠相当。结果、视觉化和影片可以在浏览。”

Alchemist: Parametric Control of Material Properties with Diffusion Models

paper_url: http://arxiv.org/abs/2312.02970
repo_url: None
paper_authors: Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, William T. Freeman, Mark Matthews
for: 这个论文是用来控制物体图像中的物理属性，如粗糙度、金属感、反射率和透明度的。
methods: 该方法利用文本到图像模型的生成预设，通过 scalar 值和指令来修改图像中的低级材质属性。
results: 通过自动生成的物体中心synthetic数据集和修改 modify 的pre-trained文本到图像模型，可以在实际图像中编辑材质属性，保留所有其他属性。

Abstract
We propose a method to control material attributes of objects like roughness, metallic, albedo, and transparency in real images. Our method capitalizes on the generative prior of text-to-image models known for photorealism, employing a scalar value and instructions to alter low-level material properties. Addressing the lack of datasets with controlled material attributes, we generated an object-centric synthetic dataset with physically-based materials. Fine-tuning a modified pre-trained text-to-image model on this synthetic dataset enables us to edit material properties in real-world images while preserving all other attributes. We show the potential application of our model to material edited NeRFs.

摘要
我们提出了一种方法，可以控制物体的物理属性，如粗糙度、金属性、反射率和透明度，在真实图像中。我们的方法利用了文本到图像模型的生成前提，通过scalar值和指令来修改低级材质属性。由于缺乏控制材质属性的数据集，我们生成了一个中心对象的 sintetic 数据集，其中物体拥有物理基于的材质。通过对修改后的模型进行高级imos练习，我们可以在真实图像中编辑材质属性，保留所有其他属性不变。我们展示了我们的模型可以应用于材质编辑NeRF。Note: "NeRF" stands for "Neural Radiance Fields", which is a technique used to represent 3D objects in a scene in a way that allows for realistic rendering and manipulation of the object's materials and lighting.

Generating Interpretable Networks using Hypernetworks

paper_url: http://arxiv.org/abs/2312.03051
repo_url: None
paper_authors: Isaac Liao, Ziming Liu, Max Tegmark
for: 本研究的目的是解码神经网络，即将神经网络的原始权重转化为可解释的算法。
methods: 本研究使用了卷积网络（hypernetwork）来生成可解释的网络，并且通过控制网络复杂度来生成多种可解释的算法。
results: 研究发现了三种计算L1范数的算法：（a）双面算法（b）几何算法（c）卷积算法，其中只有第一个算法预期在实验之前。研究还发现了这些算法的系统化演化和复杂度控制的影响。此外，研究还示出了一个训练过的卷积网络可以正确地构建未在训练中看到的输入维度的模型，这表明了系统化泛化的能力。

Abstract
An essential goal in mechanistic interpretability to decode a network, i.e., to convert a neural network's raw weights to an interpretable algorithm. Given the difficulty of the decoding problem, progress has been made to understand the easier encoding problem, i.e., to convert an interpretable algorithm into network weights. Previous works focus on encoding existing algorithms into networks, which are interpretable by definition. However, focusing on encoding limits the possibility of discovering new algorithms that humans have never stumbled upon, but that are nevertheless interpretable. In this work, we explore the possibility of using hypernetworks to generate interpretable networks whose underlying algorithms are not yet known. The hypernetwork is carefully designed such that it can control network complexity, leading to a diverse family of interpretable algorithms ranked by their complexity. All of them are interpretable in hindsight, although some of them are less intuitive to humans, hence providing new insights regarding how to "think" like a neural network. For the task of computing L1 norms, hypernetworks find three algorithms: (a) the double-sided algorithm, (b) the convexity algorithm, (c) the pudding algorithm, although only the first algorithm was expected by the authors before experiments. We automatically classify these algorithms and analyze how these algorithmic phases develop during training, as well as how they are affected by complexity control. Furthermore, we show that a trained hypernetwork can correctly construct models for input dimensions not seen in training, demonstrating systematic generalization.

摘要
一个重要的目标在机制可读性中是解码神经网络，即将神经网络的原始参数转换为可读的算法。由于解码问题的difficulty，已经取得了在理解编码问题中的进展，即将可读的算法转换为神经网络的参数。先前的工作主要集中在将已知的算法编码到神经网络中，这些算法都是可读的。但是，只集中在编码问题上限制了发现新的算法，它们尚未被人类发现，但它们具有可读性。在这种情况下，我们使用嵌入网络来生成可读的网络，其下面的算法并不是已知的。我们 méticulously设计了这个嵌入网络，以控制神经网络的复杂性，从而导致一个多样化的可读算法家族，这些算法的复杂性可以由人类来评估。在计算L1范数任务上，嵌入网络找到了三种算法：（a）双面算法、（b）几何算法、（c）奶糕算法，只有第一种算法被作者们预期。我们自动分类了这些算法，并分析了这些算法的发展阶段以及复杂性控制的影响。此外，我们还证明了一个训练过的嵌入网络可以正确地生成未在训练中看到的输入维度上的模型，这说明了系统化泛化的能力。

Classification for everyone : Building geography agnostic models for fairer recognition

paper_url: http://arxiv.org/abs/2312.02957
repo_url: None
paper_authors: Akshat Jindal, Shreya Singh, Soham Gadgil
for: 这个论文是为了研究如何 Mitigate 图像分类模型中的自然地理偏见。
methods: 这个论文使用了两个数据集 - The Dollar Street Dataset 和 ImageNet，通过图像的位置信息来评量这种偏见。然后，它提出了多种可以使用的方法来减少这种偏见。
results: 这个论文通过分析不同的方法，发现这些方法可以使图像分类模型更加对地域位置具有抗性。

Abstract
In this paper, we analyze different methods to mitigate inherent geographical biases present in state of the art image classification models. We first quantitatively present this bias in two datasets - The Dollar Street Dataset and ImageNet, using images with location information. We then present different methods which can be employed to reduce this bias. Finally, we analyze the effectiveness of the different techniques on making these models more robust to geographical locations of the images.

摘要
在这篇论文中，我们分析了不同的方法来减轻现有的图像分类模型内置的地域偏见。我们首先量化了这种偏见在两个数据集中 - 美元街数据集和ImageNet数据集中的图像，并使用图像地理位置信息。然后，我们介绍了不同的方法可以用来减少这种偏见。最后，我们分析了不同技术在图像地域位置的影响。

WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words

paper_url: http://arxiv.org/abs/2312.02931
repo_url: None
paper_authors: Lukas Wolf, Greta Tuckute, Klemen Kotar, Eghbal Hosseini, Tamar Regev, Ethan Wilcox, Alex Warstadt
for: 本研究是为了检验多Modalities训练语言模型是否可以提高其质量和效率。
methods: 作者使用了Whisbert模型，这是基于文本–图像方法的FLAVA模型（Singh et al., 2022）。作者遵循Babylm指南（Warstadt et al., 2023），在一个包含100万个词和其对应的语音的 dataset 上预训Whisbert模型。
results: 作者发现，虽然Whisbert在多Modalities训练下可以很好地完成模杂隐藏模型任务和超越Babylm基eline在大多数benchmark任务中，但它在优化复杂的目标时受阻，无法超越文本只的Whisbert基eline。

Abstract
Training on multiple modalities of input can augment the capabilities of a language model. Here, we ask whether such a training regime can improve the quality and efficiency of these systems as well. We focus on text--audio and introduce Whisbert, which is inspired by the text--image approach of FLAVA (Singh et al., 2022). In accordance with Babylm guidelines (Warstadt et al., 2023), we pretrain Whisbert on a dataset comprising only 100 million words plus their corresponding speech from the word-aligned version of the People's Speech dataset (Galvez et al., 2021). To assess the impact of multimodality, we compare versions of the model that are trained on text only and on both audio and text simultaneously. We find that while Whisbert is able to perform well on multimodal masked modeling and surpasses the Babylm baselines in most benchmark tasks, it struggles to optimize its complex objective and outperform its text-only Whisbert baseline.

摘要
训练多modalities的输入可以增强语言模型的能力。我们问道，是否可以通过这种训练方式提高这些系统的质量和效率。我们将关注文本——音频的训练，并引入Whisbert，它是基于文本——图像的FLAVA（Singh et al., 2022）的 inspirations。按照Babylm指南（Warstadt et al., 2023），我们预训Whisbert在包含1亿个单词的 dataset 上，并与其对应的语音从人类语音 dataset 的word-aligned版本（Galvez et al., 2021）进行了训练。为了评估多modalities的影响，我们比较了基于文本只和基于音频和文本同时训练的模型。我们发现，虽然Whisbert在多modalities隐藏模型和大多数benchmark任务中表现出色，但它在优化复杂的目标函数上困难超越文本只的Whisbert基eline。

Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

paper_url: http://arxiv.org/abs/2312.02913
repo_url: https://github.com/zahraabbasiantaeb/simquac
paper_authors: Zahra Abbasiantaeb, Yifei Yuan, Evangelos Kanoulas, Mohammad Aliannejadi
for: 这个论文的目的是探讨使用大语言模型（LLM）来仿真人类对话的 conversational question-answering（CQA）系统。
methods: 这个论文使用了零shot学习的GPT-4模型来实现学生和教师的角色，并通过让学生模型生成问题，并由教师模型回答问题来模拟人类对话。
results: 研究发现，使用LLM来仿真人类对话可以取得比较好的效果， teacher LLM生成的答案更加具体和完整，而学生 LLM 生成的问题更加多样化，覆盖了更多的话题方面。

Abstract
Conversational question-answering (CQA) systems aim to create interactive search systems that effectively retrieve information by interacting with users. To replicate human-to-human conversations, existing work uses human annotators to play the roles of the questioner (student) and the answerer (teacher). Despite its effectiveness, challenges exist as human annotation is time-consuming, inconsistent, and not scalable. To address this issue and investigate the applicability of large language models (LLMs) in CQA simulation, we propose a simulation framework that employs zero-shot learner LLMs for simulating teacher-student interactions. Our framework involves two LLMs interacting on a specific topic, with the first LLM acting as a student, generating questions to explore a given search topic. The second LLM plays the role of a teacher by answering questions and is equipped with additional information, including a text on the given topic. We implement both the student and teacher by zero-shot prompting the GPT-4 model. To assess the effectiveness of LLMs in simulating CQA interactions and understand the disparities between LLM- and human-generated conversations, we evaluate the simulated data from various perspectives. We begin by evaluating the teacher's performance through both automatic and human assessment. Next, we evaluate the performance of the student, analyzing and comparing the disparities between questions generated by the LLM and those generated by humans. Furthermore, we conduct extensive analyses to thoroughly examine the LLM performance by benchmarking state-of-the-art reading comprehension models on both datasets. Our results reveal that the teacher LLM generates lengthier answers that tend to be more accurate and complete. The student LLM generates more diverse questions, covering more aspects of a given topic.

摘要
conversational question-answering (CQA) 系统的目标是创建可交互的搜索系统，以便有效地检索信息，并且与人类对话方式相似。现有的工作使用人类标注员扮演问题人（学生）和答案人（教师）的角色。然而，人类标注是时间consuming，不一致和不可扩展的。为解决这些问题，我们提出了一个模拟框架，使用零shot学习的大型自然语言模型（LLM）来模拟教师和学生之间的交互。我们的框架包括两个LLM进行交互，其中一个LLM acts as a student，生成问题以探索一个搜索主题。另一个LLM扮演教师，回答问题，并具有额外信息，包括主题相关的文本。我们使用GPT-4模型来实现学生和教师。为了评估LLM在模拟CQA交互中的效果，以及人类和LLM生成的对话之间的差异，我们对模拟数据进行了多种评估。我们首先评估教师的表现，通过自动和人类评估。接着，我们评估学生的表现，分析和比较LLM生成的问题和人类生成的问题之间的差异。此外，我们进行了广泛的分析，以全面评估LLM性能，并将现有的阅读理解模型 benchmark于两个数据集。我们的结果表明，教师LLM生成的答案较长，具有更高的准确性和完整性。学生LLM生成的问题覆盖了更多的主题方面，更加多样化。

Toward autocorrection of chemical process flowsheets using large language models

paper_url: http://arxiv.org/abs/2312.02873
repo_url: None
paper_authors: Lukas Schulze Balhorn, Marc Caballero, Artur M. Schweidtmann
for: 这个论文的目的是提出一种基于人工智能技术的自动修正流程图文法，以便更好地检查和修正过程流程图文中的错误。
methods: 这个论文使用了大型自然语言模型（LLM）来自动检测和修正过程流程图文中的错误。输入模型是一个可能有误的过程流程图文，输出模型是一个修正后的过程流程图文。
results: 在一个人工生成的synthetic dataset上进行了supervised学习，模型实现了80%的顶峰准确率和84%的顶五准确率在一个独立测试集上。这些结果表明模型可以学习自动修正synthetic流程图文。

Abstract
The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.

摘要
Process 工程领域广泛使用过程流agram（PFD）和过程和测试agram（P&ID）来表示过程流和设备配置。然而，PFDs和P&IDs，以下简称为流程图，可能包含错误，导致安全隐患、不效环境和多花钱。更正和验证流程图是一个繁琐、手动的过程。我们提出了一种新的生成式人工智能方法，可以自动检测流程图中的错误并提供更正建议，即自动更正流程图。取得大语言模型（LLMs）的突破口，我们研究LLMs在修订人类语言中的自动修订能力，以便应用于流程图的自动修订。输入模型的流程图可能包含错误，输出模型的建议是修订后的流程图。我们在一个synthetic dataset上进行了监督性训练。模型在独立测试集上达到了80%的顶部一 accuracy和84%的顶部五 accuracy。结果表明，模型可以学习自动修订synthetic流程图。我们anticipate that flowsheet autocorrection will become a useful tool for chemical engineers.

Experimental Insights Towards Explainable and Interpretable Pedestrian Crossing Prediction

paper_url: http://arxiv.org/abs/2312.02872
repo_url: None
paper_authors: Angie Nataly Melo, Carlota Salinas, Miguel Angel Sotelo
for: 本研究旨在提高自动驾驶road safety，通过可解释和可 interpret的方式预测步行人过路。
methods: 本研究提出了一种新的神经符号approach，结合深度学习和混沌逻辑来实现可解释和可 interpret的步行人过路预测。我们开发了一个可解释预测器（ExPedCross），使用了一组可解释的特征并使用混沌推理系统来预测步行人将否过路。
results: 我们对PIE和JAAD数据集进行了评估，实验结果提供了可解释和可 interpret的步行人过路预测任务中的实践经验和建议。

Abstract
In the context of autonomous driving, pedestrian crossing prediction is a key component for improving road safety. Presently, the focus of these predictions extends beyond achieving trustworthy results; it is shifting towards the explainability and interpretability of these predictions. This research introduces a novel neuro-symbolic approach that combines deep learning and fuzzy logic for an explainable and interpretable pedestrian crossing prediction. We have developed an explainable predictor (ExPedCross), which utilizes a set of explainable features and employs a fuzzy inference system to predict whether the pedestrian will cross or not. Our approach was evaluated on both the PIE and JAAD datasets. The results offer experimental insights into achieving explainability and interpretability in the pedestrian crossing prediction task. Furthermore, the testing results yield a set of guidelines and recommendations regarding the process of dataset selection, feature selection, and explainability.

摘要
在自动驾驶中，人行道十字Prediction是一个关键的安全性组件。目前，这些预测的重点不仅是获得可靠的结果，而且在向Explainability和Interpretability的发展。本研究提出了一种新的 neuralsymbolic方法， combines deep learning和混沌逻辑来实现可解释的人行道十字预测。我们开发了一个可解释预测器（ExPedCross），使用了一组可解释的特征，并使用了混沌推理系统来预测人将否过路。我们的方法在PIE和JAAD数据集上进行了评估。实验结果提供了有用的实验室意见和建议，包括数据集选择、特征选择和可解释的过程。

Navigating the Synthetic Realm: Harnessing Diffusion-based Models for Laparoscopic Text-to-Image Generation

paper_url: http://arxiv.org/abs/2312.03043
repo_url: https://github.com/lucidrains/imagen-pytorch
paper_authors: Simeon Allmendinger, Patrick Hemmer, Moritz Queisner, Igor Sauer, Leopold Müller, Johannes Jakubik, Michael Vössing, Niklas Kühl
for: 这个研究旨在使用扩散型生成模型生成合理的人工镜像数据，以支持外科应用和决策。
methods: 我们使用了当今最佳的文本到图像建筑在镜像医学中进行了应用，通过Diffusion-based生成模型来生成人工镜像数据。
results: 我们的研究表明，Diffusion-based模型可以学习镜像医学中的风格和 semantics，并且可以生成高质量的人工镜像数据，使得计算机生成的图像在外科应用中得到了应用。

Abstract
Recent advances in synthetic imaging open up opportunities for obtaining additional data in the field of surgical imaging. This data can provide reliable supplements supporting surgical applications and decision-making through computer vision. Particularly the field of image-guided surgery, such as laparoscopic and robotic-assisted surgery, benefits strongly from synthetic image datasets and virtual surgical training methods. Our study presents an intuitive approach for generating synthetic laparoscopic images from short text prompts using diffusion-based generative models. We demonstrate the usage of state-of-the-art text-to-image architectures in the context of laparoscopic imaging with regard to the surgical removal of the gallbladder as an example. Results on fidelity and diversity demonstrate that diffusion-based models can acquire knowledge about the style and semantics in the field of image-guided surgery. A validation study with a human assessment survey underlines the realistic nature of our synthetic data, as medical personnel detects actual images in a pool with generated images causing a false-positive rate of 66%. In addition, the investigation of a state-of-the-art machine learning model to recognize surgical actions indicates enhanced results when trained with additional generated images of up to 5.20%. Overall, the achieved image quality contributes to the usage of computer-generated images in surgical applications and enhances its path to maturity.

摘要

Towards Causal Representations of Climate Model Data

paper_url: http://arxiv.org/abs/2312.02858
repo_url: None
paper_authors: Julien Boussard, Chandni Nagda, Julia Kaltenborn, Charlotte Emilie Elektra Lange, Philippe Brouillard, Yaniv Gurwicz, Peer Nowack, David Rolnick
For: This paper aims to explore the potential of using causal representation learning to improve the efficiency and interpretability of climate model emulation.* Methods: The paper uses the CDSD method to learn causal representations of climate data, including emissions, temperature, and precipitation.* Results: The paper evaluates the effectiveness of CDSD in rendering climate model emulation more efficient and interpretable, and sheds light on the challenges and limitations of using this approach.

Abstract
Climate models, such as Earth system models (ESMs), are crucial for simulating future climate change based on projected Shared Socioeconomic Pathways (SSP) greenhouse gas emissions scenarios. While ESMs are sophisticated and invaluable, machine learning-based emulators trained on existing simulation data can project additional climate scenarios much faster and are computationally efficient. However, they often lack generalizability and interpretability. This work delves into the potential of causal representation learning, specifically the \emph{Causal Discovery with Single-parent Decoding} (CDSD) method, which could render climate model emulation efficient \textit{and} interpretable. We evaluate CDSD on multiple climate datasets, focusing on emissions, temperature, and precipitation. Our findings shed light on the challenges, limitations, and promise of using CDSD as a stepping stone towards more interpretable and robust climate model emulation.

摘要
клима数据模型，如地球系统模型（ESM），是未来气候变化的预测基础，基于预测的社会经济路径（SSP）气体排放enario。虽然ESM是复杂且无价的，但机器学习基于现有模拟数据的模拟器可以在快速并高效地进行气候scenario projection，但它们通常缺乏普适性和解释性。这个工作探讨了使用 causal representation learning， Specifically the \emph{Causal Discovery with Single-parent Decoding} (CDSD) 方法，以实现气候模型模拟的效率和解释性。我们在多个气候数据集上评估了 CDSD，专注于排放、温度和降水。我们的发现着重于挑战、局限性和使用 CDSD 作为更加解释性和可靠的气候模型模拟的可能性。

Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study

paper_url: http://arxiv.org/abs/2312.02855
repo_url: None
paper_authors: Qiao Yu, Wengui Zhang, Jorge Cardoso, Odej Kao
for: 本文旨在探讨大规模数据中心中存储器失效的问题，尤其是双inline存储模块(DIMM)的缺陷。
methods: 本文使用了错误比特信息来预测不可修复的错误(UE)。
results: 经过实验 validate 的结果表明，我们的方法可以提高预测性能，比对state-of-the-art算法提高F1分数约15%，并 reduc 虚拟机中断的数量约59%。

Abstract
In large-scale datacenters, memory failure is a common cause of server crashes, with uncorrectable errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using correctable errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occurrence of uncorrectable errors (UEs). In this paper, we present a comprehensive study on the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information. Our analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence. Through evaluations using real-world datasets, we demonstrate that our approach significantly improves prediction performance by 15% in F1-score compared to the state-of-the-art algorithms. Overall, our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.

摘要
大规模数据中心中，内存失效是服务器崩溃的常见原因，无法修复的错误（UE）是DIMMDefects的重要指标。现有的方法主要集中在预测CEs，未充分考虑错误比特信息。然而，错误比特模式与UE发生的可能性强相关。本文进行了详细的CEs和UEs之间的相关性分析，尤其是关注错误比特的空间时间信息。我们的分析发现，错误比特的空间时间信息与UE发生的可能性强相关。使用实际数据进行评估，我们的方法可以提高预测性能，与当前最佳算法相比，提高F1分数指标15%，并将虚拟机中断引起的UE数量减少约59%。

Inherent limitations of LLMs regarding spatial information

paper_url: http://arxiv.org/abs/2312.03042
repo_url: None
paper_authors: He Yan, Xinyao Hu, Xiangpeng Wan, Chengyu Huang, Kai Zou, Shiqi Xu
for: This paper investigates the limitations of ChatGPT and similar models in spatial reasoning and navigation-related tasks, and evaluates their capabilities in 2D and 3D route planning.
methods: The paper introduces a novel evaluation framework and a baseline dataset specifically crafted for this study, which includes three key tasks: plotting spatial points, planning routes in 2D spaces, and devising pathways in 3D environments.
results: The evaluation reveals key insights into ChatGPT’s capabilities and limitations in spatial understanding, highlighting the areas where the model struggles and where further improvement is needed.

Abstract
Despite the significant advancements in natural language processing capabilities demonstrated by large language models such as ChatGPT, their proficiency in comprehending and processing spatial information, especially within the domains of 2D and 3D route planning, remains notably underdeveloped. This paper investigates the inherent limitations of ChatGPT and similar models in spatial reasoning and navigation-related tasks, an area critical for applications ranging from autonomous vehicle guidance to assistive technologies for the visually impaired. In this paper, we introduce a novel evaluation framework complemented by a baseline dataset, meticulously crafted for this study. This dataset is structured around three key tasks: plotting spatial points, planning routes in two-dimensional (2D) spaces, and devising pathways in three-dimensional (3D) environments. We specifically developed this dataset to assess the spatial reasoning abilities of ChatGPT. Our evaluation reveals key insights into the model's capabilities and limitations in spatial understanding.

摘要
尽管大语言模型如ChatGPT在自然语言处理方面做出了重要的进步，但它们在理解和处理空间信息方面仍然存在显著的不足，特别是在2D和3D路径规划领域。这篇论文探讨了ChatGPT和类似模型在空间理解和导航相关任务中的内在局限性，这是应用范围从自动驾驶导航到视障人士助手等领域的关键领域。在这篇论文中，我们提出了一种新的评估框架，并附加了一个特制的基线数据集，用于这项研究。这个数据集结构化为三个关键任务：描述空间点、计划2D空间路径和3D环境中的路径规划。我们专门为这项研究而制定了这个数据集，以评估ChatGPT的空间理解能力。我们的评估发现了ChatGPT在空间理解方面的重要缺陷和局限性。

Are Vision Transformers More Data Hungry Than Newborn Visual Systems?

paper_url: http://arxiv.org/abs/2312.02843
repo_url: https://github.com/buildingamind/vit-cot
paper_authors: Lalit Pandey, Samantha M. W. Wood, Justin N. Wood
for: 测试带有学习能力的ViTs和动物之间的比较，以确定ViTs是否需要更多的训练数据来达到类似水平。
methods: 使用自我监督的ViTs，通过时间作为教学信号，与生物视系统相似。
results: ViTs在新生鸡眼中训练时能够解决同样的视偏变对象识别任务，与新生鸡一样学习了视偏变对象表示。ViTs不是需要更多的训练数据的，两者都可以在穷几何环境中学习视偏变对象表示。

Abstract
Vision transformers (ViTs) are top performing models on many computer vision benchmarks and can accurately predict human behavior on object recognition tasks. However, researchers question the value of using ViTs as models of biological learning because ViTs are thought to be more data hungry than brains, with ViTs requiring more training data to reach similar levels of performance. To test this assumption, we directly compared the learning abilities of ViTs and animals, by performing parallel controlled rearing experiments on ViTs and newborn chicks. We first raised chicks in impoverished visual environments containing a single object, then simulated the training data available in those environments by building virtual animal chambers in a video game engine. We recorded the first-person images acquired by agents moving through the virtual chambers and used those images to train self supervised ViTs that leverage time as a teaching signal, akin to biological visual systems. When ViTs were trained through the eyes of newborn chicks, the ViTs solved the same view invariant object recognition tasks as the chicks. Thus, ViTs were not more data hungry than newborn visual systems: both learned view invariant object representations in impoverished visual environments. The flexible and generic attention based learning mechanism in ViTs combined with the embodied data streams available to newborn animals appears sufficient to drive the development of animal-like object recognition.

摘要
视力变换器（ViT）是计算机视觉benchmark上表现出色的模型，可以准确预测人类行为。然而，研究人员对使用ViT作为生物学学习模型表示怀疑，因为ViT被认为需要更多的训练数据来达到相似水平。为了测试这个假设，我们直接比较了ViT和动物的学习能力，通过在ViT和新生鸡的平行控制养殖实验中进行比较。我们首先将鸡在缺乏视觉环境中养殖，然后通过在虚拟动物室中模拟这些环境中可用的训练数据，在视频游戏引擎中建立虚拟动物室。我们记录了通过代理人在虚拟动物室中移动时获得的第一人称图像，并使用这些图像来训练基于时间的教学信号的自我超vised ViTs。当ViTs通过新生鸡的眼睛进行训练时，ViTs解决了同样的视角不变object recognition任务，与鸡一样。因此，ViTs不是更需要数据的 than新生视系统：both learned view-invariant object representations in impoverished visual environments。ViTs的灵活和通用的注意力基本学习机制，加上可以给新生动物提供的embodied数据流，足以驱动动物如object recognition的发展。

MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

paper_url: http://arxiv.org/abs/2312.02829
repo_url: https://github.com/ibm/multiple-input-multiple-output-nets
paper_authors: Nicolas Menet, Michael Hersche, Geethan Karunaratne, Luca Benini, Abu Sebastian, Abbas Rahimi
for: 这篇研究目的是提出一种多输入多出力神经网络（MIMONet），以降低深度学习模型的计算成本。
methods: MIMONet使用变量绑定机制将多个输入数据结构化为一个固定宽度的分布式表示，并采用多输入多出力神经网络架构来处理数据结构的整体非线性变换。
results: 实验表明，MIMOConv和MIMOFormer可以在吞吐量和准确率之间实现协调的质量和速度衡量，并在CIFAR10和CIFAR100上实现2-4倍的速度提升，而无需更改模型参数。

Abstract
With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves about 2-4 x speedup at an accuracy delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2-4 inputs at once while maintaining a high average accuracy within a [-1.07, -3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.

摘要

Calibrated Adaptive Teacher for Domain Adaptive Intelligent Fault Diagnosis

paper_url: http://arxiv.org/abs/2312.02826
repo_url: None
paper_authors: Florent Forest, Olga Fink
for: 本研究针对智能异常诊断（IFD）基于深度学习的应用，特别是当深度学习模型需要适应不同的操作条件时。
methods: 本研究使用了不确実预测（ pseudo-label）的训练方法，并将这些预测与目标领域的标签进行整合，以提高模型的适应能力。
results: 本研究在domain-adaptive IFD中提出了一种新的训练方法，即对教师网络的预测进行调整，使用后续调整技术来改善预测的准确性。在Paderbornbenchmark上进行了广泛的实验，并取得了最佳的转移任务性能。

Abstract
Intelligent Fault Diagnosis (IFD) based on deep learning has proven to be an effective and flexible solution, attracting extensive research. Deep neural networks can learn rich representations from vast amounts of representative labeled data for various applications. In IFD, they achieve high classification performance from signals in an end-to-end manner, without requiring extensive domain knowledge. However, deep learning models usually only perform well on the data distribution they have been trained on. When applied to a different distribution, they may experience performance drops. This is also observed in IFD, where assets are often operated in working conditions different from those in which labeled data have been collected. Unsupervised domain adaptation (UDA) deals with the scenario where labeled data are available in a source domain, and only unlabeled data are available in a target domain, where domains may correspond to operating conditions. Recent methods rely on training with confident pseudo-labels for target samples. However, the confidence-based selection of pseudo-labels is hindered by poorly calibrated confidence estimates in the target domain, primarily due to over-confident predictions, which limits the quality of pseudo-labels and leads to error accumulation. In this paper, we propose a novel UDA method called Calibrated Adaptive Teacher (CAT), where we propose to calibrate the predictions of the teacher network throughout the self-training process, leveraging post-hoc calibration techniques. We evaluate CAT on domain-adaptive IFD and perform extensive experiments on the Paderborn benchmark for bearing fault diagnosis under varying operating conditions. Our proposed method achieves state-of-the-art performance on most transfer tasks.

摘要
智能故障诊断（IFD）基于深度学习已经证明是一种有效和灵活的解决方案，吸引了广泛的研究。深度神经网络可以从大量的表示性数据中学习丰富的表示，用于多种应用。在IFD中，它们在终端到终点的方式中达到高的分类性能，不需要具有广泛的领域知识。然而，深度学习模型通常只能在它们被训练的数据分布上perform well。当应用于不同的分布时，它们可能会经历性能下降。这也是IFD中所见的情况， где assets 经常在不同的操作条件下运行。这里的问题是，当它们被应用到不同的分布时，它们可能会经历性能下降。这也是IFD中所见的情况， where assets 经常在不同的操作条件下运行。这个问题被称为域 Adaptation（UA）。recent methods rely on training with confident pseudo-labels for target samples. However, the confidence-based selection of pseudo-labels is hindered by poorly calibrated confidence estimates in the target domain, primarily due to over-confident predictions, which limits the quality of pseudo-labels and leads to error accumulation.在这篇论文中，我们提出了一种新的UA方法，叫做Calibrated Adaptive Teacher（CAT）。我们提议在自我教学过程中不断地calibrate the predictions of the teacher network，利用后期calibration技术。我们在domain-adaptive IFD中进行了广泛的实验，并在Paderbornbenchmark上进行了多个转移任务的评估。我们的提出方法实现了状态机器的性能。

Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit

paper_url: http://arxiv.org/abs/2312.03038
repo_url: None
paper_authors: Fanfei Meng, Lele Zhang, Yu Chen, Yuxin Wang
for: 这篇论文是为了提出一种基于样本的动态层次变换器（DHT）模型，以便在训练和推理过程中动态配置层和头数，以适应具体的样本复杂性。methods: 该模型使用了解Contextual Bandit Problems来决定层和头的数量，并使用Combinatorial Thompson Sampling来选择特定的头组合。results: 对比传统压缩已经训练过的网络进行推理，DHT模型可以在训练和推理过程中实现更大的计算成本减少（最高达74%），同时减少了精度的损失。

Abstract
Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.

摘要
<> transformer 需要固定的层数和头数，这使得它们在个体样本的复杂性和训练和推理成本方面不灵活。为了解决这个问题，我们提议一种基于单个数据样本的动态层次Transformer（DHT）模型，其层数和头数可以通过解决上下文ual bandit问题来动态配置。为确定层数和头数，我们使用均匀信任区bound，而在选择特定头组合时，我们使用 combinatorial Thompson Sampling。与前一些研究所做的压缩已训练网络以便只进行推理时进行压缩不同，DHT 不仅在训练期间适应性地优化基础网络结构，还具有高效的推理网络。根据实验结果，我们可以达到74%的计算减少量，同时减少精度损失。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Clustering Pseudo Language Family in Multilingual Translation Models with Fisher Information Matrix

paper_url: http://arxiv.org/abs/2312.02820
repo_url: https://github.com/ecoli-hit/pseudofamily
paper_authors: Xinyu Ma, Xuebo Liu, Min Zhang
for: The paper is written to address the challenge of clustering languages based solely on their ancestral families, which can yield suboptimal results due to variations in the datasets employed during the model’s training phase.
methods: The paper introduces an innovative method that leverages the fisher information matrix (FIM) to cluster language families, anchored on the multilingual translation model’s characteristics. The method defines pseudo language families based on the similarity of the effects of language pairs on model parameters.
results: The paper shows that employing these pseudo language families enhances performance over conventional language families in adapting a multilingual translation model to unfamiliar language pairs. The proposed methodology may also be extended to scenarios requiring language similarity measurements.

Abstract
In multilingual translation research, the comprehension and utilization of language families are of paramount importance. Nevertheless, clustering languages based solely on their ancestral families can yield suboptimal results due to variations in the datasets employed during the model's training phase. To mitigate this challenge, we introduce an innovative method that leverages the fisher information matrix (FIM) to cluster language families, anchored on the multilingual translation model's characteristics. We hypothesize that language pairs with similar effects on model parameters exhibit a considerable degree of linguistic congruence and should thus be grouped cohesively. This concept has led us to define pseudo language families. We provide an in-depth discussion regarding the inception and application of these pseudo language families. Empirical evaluations reveal that employing these pseudo language families enhances performance over conventional language families in adapting a multilingual translation model to unfamiliar language pairs. The proposed methodology may also be extended to scenarios requiring language similarity measurements. The source code and associated scripts can be accessed at https://github.com/ecoli-hit/PseudoFamily.

摘要
在多语言翻译研究中，语言家族的理解和利用对 Paramount importance 。然而，基于祖语言家族来分类语言可能会导致不优化的结果，因为在模型训练阶段使用的数据集可能存在差异。为了解决这个挑战，我们提出了一种创新的方法，利用鱼类信息矩阵（FIM）来分类语言家族，基于多语言翻译模型的特点。我们假设语言对象之间的效果相似性很高，则应该将其归类为一个 cohesive 的语言家族。这个概念导致我们定义了 pseudo 语言家族。我们提供了深入的讨论和应用 pseudo 语言家族的方法。实验表明，使用 pseudo 语言家族可以超过传统语言家族在适应未知语言对的性能。此方法也可以扩展到需要语言相似度测量的场景。详细的代码和相关脚本可以在 GitHub 上获取，请参考。

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

paper_url: http://arxiv.org/abs/2312.02813
repo_url: None
paper_authors: Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei Zhang, Limin Wang
for: 这 paper 的目的是提出一种基于文本的通用视频生成框架，以解决现有视频生成模型的缺点，如需要大量的存储和计算资源、缺乏任务泛化和高效性。
methods: 这 paper 使用了一种叫做 BIVDiff 的框架，它将特定的图像扩散模型和通用的文本到视频扩散模型相连接，以实现无需训练的视频生成。具体来说，首先使用图像扩散模型（如 ControlNet、Instruct Pix2Pix）进行帧级视频生成，然后使用混合倒数法对生成的视频进行 temporal smoothing，最后输入混合倒数后的缓存进入视频扩散模型进行模型化。
results: 这 paper 通过一系列的视频生成任务，如可控的视频生成、视频编辑、视频填充和视频剔除等，证明了 BIVDiff 框架的有效性和通用性。

Abstract
Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and compute overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well kept. Finally, these adaption methods are specifically designed for one task and fail to generalize to different downstream video synthesis tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use an image diffusion model (like ControlNet, Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for temporal smoothing. Decoupling image and video models enables flexible image model selection for different purposes, which endows the framework with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video generation tasks, including controllable video generation video editing, video inpainting and outpainting. Our project page is available at https://bivdiff.github.io.

摘要
Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and compute overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well kept. Finally, these adaption methods are specifically designed for one task and fail to generalize to different downstream video synthesis tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use an image diffusion model (like ControlNet, Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for temporal smoothing. Decoupling image and video models enables flexible image model selection for different purposes, which endows the framework with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video generation tasks, including controllable video generation, video editing, video inpainting, and video outpainting. Our project page is available at .

Leveraging Domain Adaptation and Data Augmentation to Improve Qur’anic IR in English and Arabic

paper_url: http://arxiv.org/abs/2312.02803
repo_url: None
paper_authors: Vera Pavlova
for:* 这项研究的目的是解决阿拉伯语和英语中的古兰经信息检索（IR）问题。methods:* 使用最新的 neural IR 方法进行研究，以便更有效地解决这个问题。* 使用数据增强技术来处理缺乏域领域数据的问题。results:* 使用域pecific language model（LM）和域领域数据进行训练，可以大幅提高MRR@10和NDCG@5 metrics中的表现，创造了古兰经IR中的新纪录。

Abstract
In this work, we approach the problem of Qur'anic information retrieval (IR) in Arabic and English. Using the latest state-of-the-art methods in neural IR, we research what helps to tackle this task more efficiently. Training retrieval models requires a lot of data, which is difficult to obtain for training in-domain. Therefore, we commence with training on a large amount of general domain data and then continue training on in-domain data. To handle the lack of in-domain data, we employed a data augmentation technique, which considerably improved results in MRR@10 and NDCG@5 metrics, setting the state-of-the-art in Qur'anic IR for both English and Arabic. The absence of an Islamic corpus and domain-specific model for IR task in English motivated us to address this lack of resources and take preliminary steps of the Islamic corpus compilation and domain-specific language model (LM) pre-training, which helped to improve the performance of the retrieval models that use the domain-specific LM as the shared backbone. We examined several language models (LMs) in Arabic to select one that efficiently deals with the Qur'anic IR task. Besides transferring successful experiments from English to Arabic, we conducted additional experiments with retrieval task in Arabic to amortize the scarcity of general domain datasets used to train the retrieval models. Handling Qur'anic IR task combining English and Arabic allowed us to enhance the comparison and share valuable insights across models and languages.

摘要
在这项工作中，我们面临着古兰经信息检索（IR）任务的挑战，特别是在阿拉伯语和英语之间。我们使用最新的状态艺术方法来解决这个问题。因为培训检索模型需要很多数据，但这些数据很难以获得，我们因此开始使用通用领域数据进行培训，然后继续使用域专数据进行培训。为了处理缺乏域专数据的问题，我们使用数据扩充技术，这有效地提高了MRR@10和NDCG@5指标的结果，并为古兰经IR任务设置了新的州供应。由于英语中没有伊斯兰卷积和域专语言模型，我们被动地做出了这些缺失的补做，包括伊斯兰卷积集成和域专语言模型预训练。我们在阿拉伯语中选择了一个高效地处理古兰经IR任务的语言模型。除了将成功的实验从英语转移到阿拉伯语之外，我们还进行了额外的检索任务实验，以利用通用领域数据来培训检索模型。通过结合英语和阿拉伯语来处理古兰经IR任务，我们能够提高对模型和语言之间的比较和共享有价值的发现。

paper_url: http://arxiv.org/abs/2312.02781
repo_url: None
paper_authors: Tianshun Han, Shengnan Gui, Yiqing Huang, Baihui Li, Lijian Liu, Benjia Zhou, Ning Jiang, Quan Lu, Ruicong Zhi, Yanyan Liang, Du Zhang, Jun Wan
for: 提高Speech-driven 3D facial animation的精度和准确性，并且使用多modal信息（视觉和文本）来提高 results的可靠性和一致性。
methods: 提出了一个新的框架，即PMMTalk，使用补充的 Pseudo Multi-Modal features来提高 facial animation 的准确性。该框架包括三个模块：PMMTalk encoder、cross-modal alignment module和PMMTalk decoder。特别是，PMMTalk encoder使用了市场上可得的 talking head generation architecture和speech recognition技术来从speech中提取视觉和文本信息。然后，cross-modal alignment module将 audio-image-text特征进行了时间和Semantic Water level的对齐。最后，PMMTalk decoder用于预测lip-syncing facial blendshape coefficients。与先前的方法不同的是，PMMTalk只需要一个随机的参考面孔图像，但它可以提供更高的准确性。此外，它适用于标准动画生产过程中，可以轻松地 интеGRATE到现有的工作流程中。
results: 对比先前的方法，我们的方法在3D facial animation中提高了精度和可靠性。同时，我们也创建了一个大规模的3D Chinese Audio-Visual Facial Animation（3D-CAVFA）数据集，以便进一步探索和改进这个领域。User study表明，我们的方法可以在艺术家和用户之间提供更好的满意度和体验。

Abstract
Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Subsequently, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Then PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. We recommend watching the supplementary video.

摘要
“对话驱动的3D面部动画技术在最近有了很大的改进，但大多数相关工作只使用音频模式，忽略视觉和文本cue，导致精度和一致性不 satisfactory。我们认为视觉和文本cue不是rivial的信息。因此，我们提出了一种新的框架，即PMMTalk，使用补充 Pseudo Multi-Modal feature来提高面部动画的准确性。该框架包括三个模块：PMMTalk编码器、交叉模式对接模块和PMMTalk解码器。具体来说，PMMTalk编码器使用 comercial off-the-shelf talking head生成架构和speech recognition技术来从speech中提取视觉和文本信息。然后，交叉模式对接模块将音频-图像-文本特征在时间和Semantic水平进行对接。最后，PMMTalk解码器用于预测lip-syncing的面部混合坐标。与先前方法不同，PMMTalk只需要额外的随机参考面部图像，但它可以提供更高精度的结果。此外，它适用于标准动画生产工作流程，可以轻松地 интеGRATE到现有的动画生产过程中。 finally，由于3D talking face数据的缺乏，我们介绍了一个大规模的3D中文Audio-Visual Facial Animation（3D-CAVFA）数据集。我们的方法在实验和用户研究中表现出色，并且超越了当前状态。我们建议观看补充视频。”

Towards the Inferrence of Structural Similarity of Combinatorial Landscapes

paper_url: http://arxiv.org/abs/2312.02720
repo_url: None
paper_authors: Mingyu Huang, Ke Li
for: 这篇论文的目的是探讨如何通过地图数据挖掘技术来探索 combinatorial optimization 问题的 fitness landscape 中隐藏的 topological 结构信息，以便更好地解决这些问题。
methods: 本论文使用了 local optima network 作为 fitness landscape 的代理，并通过 graph data mining 技术进行质量和量化分析，以探索不同问题类型的 fitness landscape 之间的相似性。
results: 经过大规模的实验研究，本论文发现了不同问题类型的 fitness landscape 之间存在明显的结构相似性，并且在不同维度上的邻近问题类型之间也存在一定的结构相似性。

Abstract
One of the most common problem-solving heuristics is by analogy. For a given problem, a solver can be viewed as a strategic walk on its fitness landscape. Thus if a solver works for one problem instance, we expect it will also be effective for other instances whose fitness landscapes essentially share structural similarities with each other. However, due to the black-box nature of combinatorial optimization, it is far from trivial to infer such similarity in real-world scenarios. To bridge this gap, by using local optima network as a proxy of fitness landscapes, this paper proposed to leverage graph data mining techniques to conduct qualitative and quantitative analyses to explore the latent topological structural information embedded in those landscapes. By conducting large-scale empirical experiments on three classic combinatorial optimization problems, we gain concrete evidence to support the existence of structural similarity between landscapes of the same classes within neighboring dimensions. We also interrogated the relationship between landscapes of different problem classes.

摘要
Translated into Simplified Chinese:一种非常常见的问题解决策略是analogy。对于一个问题，一个解决方案可以被视为一个策略性的步行在其适应度地图上。因此，如果一个解决方案对一个问题实例有效，我们预期它也会有效于其他实例，只要它们的适应度地图具有相似的结构特征。然而，由于分布式优化的黑盒特性，很难 directamente从实际情况中推断出这种相似性。为了bridging这个差距，这篇论文提议使用本地最优点网络作为适应度地图的代理，然后使用图数据挖掘技术来进行质量和量化分析，探索适应度地图中嵌入的隐藏结构信息。通过对三个经典的分布式优化问题进行大规模的实验，我们获得了具体的证据，支持适应度地图中同一类问题的不同维度的结构相似性的存在。我们还调查了不同问题类型的适应度地图之间的关系。

Large Knowledge Model: Perspectives and Challenges

paper_url: http://arxiv.org/abs/2312.02706
repo_url: https://github.com/molyswu/hand_detection
paper_authors: Huajun Chen
for: 本研究旨在探讨大型语言模型（LLMs）如ChatGPT在知识领域中的应用。
methods: 本研究使用了知识图（KGs）等符号知识来增强LLMs，以及使用LLM来扩展传统的符号知识库。
results: 研究表明，LLMs可以增强传统的符号知识库，并且可以用于构建和控制知识图。但是，由于人类知识的复杂性，建议创建更大的“大知识模型”（LKM）来管理多种知识结构。

Abstract
Humankind's understanding of the world is fundamentally linked to our perception and cognition, with \emph{human languages} serving as one of the major carriers of \emph{world knowledge}. In this vein, \emph{Large Language Models} (LLMs) like ChatGPT epitomize the pre-training of extensive, sequence-based world knowledge into neural networks, facilitating the processing and manipulation of this knowledge in a parametric space. This article explores large models through the lens of ``knowledge''. We initially investigate the role of symbolic knowledge such as Knowledge Graphs (KGs) in enhancing LLMs, covering aspects like knowledge-augmented language model, structure-inducing pre-training, knowledgeable prompts, structured CoT, knowledge editing, semantic tools for LLM and knowledgeable AI agents. Subsequently, we examine how LLMs can amplify traditional symbolic knowledge bases, encompassing aspects like using LLM as KG builder and controller, structured knowledge pretraining, LLM-enhanced symbolic reasoning, and the amalgamation of perception with cognition. Considering the intricate nature of human knowledge, we advocate for the creation of \emph{Large Knowledge Models} (LKM), specifically engineered to manage diversified spectrum of knowledge structures. This ambitious undertaking could entail several key challenges, such as disentangling knowledge representation from language models, restructuring pre-training with structured knowledge, and building large commonsense models, among others. We finally propose a five-``A'' principle to distinguish the concept of LKM.

摘要
人类的世界理解与我们的感知和认知密切相关，各种人类语言 serving as one of the major carriers of 世界知识。在这种情况下，大型语言模型（LLMs） like ChatGPT represent the pre-training of extensive, sequence-based world knowledge into neural networks, allowing for the processing and manipulation of this knowledge in a parametric space. 本文通过 “知识” 来探讨大型模型。我们首先 investigate the role of 符号知识 such as Knowledge Graphs (KGs) in enhancing LLMs, including aspects like knowledge-augmented language model, structure-inducing pre-training, knowledgeable prompts, structured CoT, knowledge editing, semantic tools for LLM and knowledgeable AI agents. 然后，我们 examines how LLMs can amplify traditional symbolic knowledge bases, including aspects like using LLM as KG builder and controller, structured knowledge pretraining, LLM-enhanced symbolic reasoning, and the amalgamation of perception with cognition. 考虑到人类知识的复杂性，我们 advocate for the creation of 大型知识模型（LKM），专门设计用于管理多元的知识结构。这项大规模的任务可能会涉及多个关键挑战，如解脱知识表示与语言模型，重新结构预训练与结构知识，以及建立大规模的通用常识模型，等等。最后，我们提出了五个 “A” 原则来 отлича出 LKM 的概念。

Unified learning-based lossy and lossless JPEG recompression

paper_url: http://arxiv.org/abs/2312.02705
repo_url: None
paper_authors: Jianghui Zhang, Yuanyuan Wang, Lina Guo, Jixiang Luo, Tongda Xu, Yan Wang, Zhi Wang, Hongwei Qin
for: 提高 JPEG 图像压缩率，并 bridge lossy 和 lossless 压缩之间的 gap
methods: 使用学习的量化表和 Markovian 层次变分自动机
results: 可以实现 arbitrarily 低的损害，当 bitrate 接近最高 bound 时Here’s a more detailed explanation of each point:
for: The paper aims to improve the compression efficiency of JPEG images and bridge the gap between lossy and lossless compression methods.
methods: The proposed method uses a learned quantization table and a Markovian hierarchical variational autoencoder to achieve lossy and lossless JPEG recompression.
results: The proposed method can achieve arbitrarily low distortion when the bitrate is close to the upper bound, which is the bitrate of the lossless compression model. This is the first learned method that bridges the gap between lossy and lossless recompression of JPEG images, to the best of the authors’ knowledge.

Abstract
JPEG is still the most widely used image compression algorithm. Most image compression algorithms only consider uncompressed original image, while ignoring a large number of already existing JPEG images. Recently, JPEG recompression approaches have been proposed to further reduce the size of JPEG files. However, those methods only consider JPEG lossless recompression, which is just a special case of the rate-distortion theorem. In this paper, we propose a unified lossly and lossless JPEG recompression framework, which consists of learned quantization table and Markovian hierarchical variational autoencoders. Experiments show that our method can achieve arbitrarily low distortion when the bitrate is close to the upper bound, namely the bitrate of the lossless compression model. To the best of our knowledge, this is the first learned method that bridges the gap between lossy and lossless recompression of JPEG images.

摘要
JPEG仍是最广泛使用的图像压缩算法。大多数图像压缩算法只考虑无压缩原始图像，而忽略了大量已经存在的JPEG图像。近期，JPEG重压缩方法得到了提议，但这些方法只考虑了JPEG无损重压缩，这只是权重-违和定理的特殊情况。在这篇论文中，我们提出了一个统一的损失量和损失无损JPEG重压缩框架，该框架包括学习的量化表和Markov链式层VARAE。实验显示，我们的方法可以在bitrate接近Upper bound的情况下实现arbitrary低的损失。到目前为止，这是我们所知道的第一种学习方法，可以bridging损失和无损重压缩JPEG图像之间的差距。

Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security

paper_url: http://arxiv.org/abs/2312.02699
repo_url: None
paper_authors: Muhammad Umer Ramzan, Usman Ali, Syed Haider Abbas Naqvi, Zeeshan Aslam, Tehseen, Husnain Ali, Muhammad Faheem
for: 解决组织自动化进口和停车管理问题，提高效率、安全性和记录保持。
methods: 利用现代深度学习模型自动化车辆进口和停车过程，并 integrate 车辆检测、车牌号检测、人脸检测和识别模型，以确保车辆和人员的注册。
results: 系统可以快速、准确地检测车辆进口和停车，提供高效的记录保持和洗礼车位分配，提高了便捷、准确性和安全性。

Abstract
The auto-management of vehicle entrance and parking in any organization is a complex challenge encompassing record-keeping, efficiency, and security concerns. Manual methods for tracking vehicles and finding parking spaces are slow and a waste of time. To solve the problem of auto management of vehicle entrance and parking, we have utilized state-of-the-art deep learning models and automated the process of vehicle entrance and parking into any organization. To ensure security, our system integrated vehicle detection, license number plate verification, and face detection and recognition models to ensure that the person and vehicle are registered with the organization. We have trained multiple deep-learning models for vehicle detection, license number plate detection, face detection, and recognition, however, the YOLOv8n model outperformed all the other models. Furthermore, License plate recognition is facilitated by Google's Tesseract-OCR Engine. By integrating these technologies, the system offers efficient vehicle detection, precise identification, streamlined record keeping, and optimized parking slot allocation in buildings, thereby enhancing convenience, accuracy, and security. Future research opportunities lie in fine-tuning system performance for a wide range of real-world applications.

摘要
自动管理机构内部汽车进口和停车是一个复杂的挑战，涉及到记录保持、效率和安全问题。传统的手动方法 для跟踪汽车和寻找停车位置是慢并且是浪费时间的。为解决机构内部汽车进口和停车的自动管理问题，我们使用了当今最先进的深度学习模型，自动化了汽车进口和停车的过程。为保障安全，我们的系统 integrate了车辆检测、车牌号检测、人脸检测和识别模型，以确保汽车和人员是组织注册的。我们已经训练了多个深度学习模型，包括车辆检测、车牌号检测、人脸检测和识别模型，但是YOLOv8n模型在所有模型中表现最佳。此外，车牌号检测得到Google的Tesseract-OCR引擎支持。通过将这些技术集成，系统提供了高效的车辆检测、准确的识别、整洁的记录保持和停车位置分配优化，从而提高了便捷、准确性和安全性。未来的研究机遇在细化系统性能，适应广泛的实际应用场景。

Analyzing and Improving the Training Dynamics of Diffusion Models

paper_url: http://arxiv.org/abs/2312.02696
repo_url: https://github.com/mmathew23/improved_edm
paper_authors: Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, Samuli Laine
for: 这篇论文目标是改进数据驱动图像生成领域中流行的ADM扩散模型架构，提高图像生成质量。
methods: 作者通过修改网络层来保持活化量、权重量和更新量的平衡，解决了训练过程中的不均匀和不有效性问题。此外，作者还提出了一种在训练完成后设置各个EMA参数的方法，以便精细地调整EMA长度。
results: 作者通过修改网络架构和EMA参数，提高了图像生成的质量，并 achieved 1.81的FID记录，胜过了之前的2.41记录。

Abstract
Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.

摘要
Currently, diffusion models dominate the field of data-driven image synthesis due to their ability to scale to large datasets. In this paper, we identify and address several issues with the popular ADM diffusion model architecture that were causing uneven and ineffective training. These issues included uncontrolled magnitude changes and imbalances in both the network activations and weights during training. To address these issues, we redesigned the network layers to preserve activation, weight, and update magnitudes on expectation. As a result, we were able to eliminate the observed drifts and imbalances and achieve considerably better performance at equal computational complexity. Our modifications improved the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. Additionally, we present a method for setting the exponential moving average (EMA) parameters post-hoc, which allows for precise tuning of EMA length without the cost of performing multiple training runs and reveals surprising interactions with network architecture, training time, and guidance.

H-GAP: Humanoid Control with a Generalist Planner

paper_url: http://arxiv.org/abs/2312.02682
repo_url: None
paper_authors: Zhengyao Jiang, Yingchen Xu, Nolan Wagener, Yicheng Luo, Michael Janner, Edward Grefenstette, Tim Rocktäschel, Yuandong Tian
for: 这篇论文旨在提出一种基于人类动作数据采集的humanoid控制方法，以便在人类中心基础设施中集成和实现物理驱动的humanoid动画。
methods: 该论文使用了humanoid trajectory数据集，如MoCapAct，并提出了一种基于状态-动作 trajectory生成模型（H-GAP），可以在Model Predictive Control（MPC）下处理高维状态和动作空间的优化问题。
results: 该研究表明，H-GAP可以学习和生成各种motor behaviors，并在不同的下游控制任务中进行适应性规划。此外，H-GAP可以在不同的任务中具有优于或相当于在线学习和RL方法的性能。

Abstract
Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations. The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC). For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviors to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines that have access to the ground truth dynamics model, and is superior or comparable to offline RL methods trained for individual tasks. Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing. Code and videos are available at https://ycxuyingchen.github.io/hgap/.

摘要
人型控制是一个重要的研究挑战，它提供了与人类中心的基础设施集成和physics驱动的人类动作渲染的可能性。然而，这个领域的挑战在于优化高维动作空间的问题和人型形态引入的不稳定性。然而，人类动作捕捉数据的大量收集和 derivated的人类轨迹数据，如MoCapAct，为这些挑战提供了方向。在这个上下文中，我们介绍了人类通用自编码计划（H-GAP），一种基于人类轨迹数据进行训练的状态-动作轨迹生成模型，能够妥协控制下游任务。对于56度自由度的人类，我们经验表明，H-GAP能够learn represent和生成广泛的动作行为。此外，不需要在线交互学习，它还可以通过规划来适应新的下游控制任务。在比较MPC基线和离线RL方法的情况下，H-GAP表现出色。最后，我们进行了一系列实验研究关于H-GAP的扩展性，显示了可能通过更多的数据获得性能提升，但不需要更多的计算资源。代码和视频可以在https://ycxuyingchen.github.io/hgap/上下载。

Contact Energy Based Hindsight Experience Prioritization

paper_url: http://arxiv.org/abs/2312.02677
repo_url: None
paper_authors: Erdi Sayar, Zhenshan Bing, Carlo D’Eramo, Ozgur S. Oguz, Alois Knoll
for: 这篇论文主要目的是解决多目标机器人操作任务中的强化学习问题，即使奖励率稀疏。
methods: 该论文提出了一种名为Contact Energy Based Prioritization（CEBP）的新方法，它选择从储存缓存中抽取样本，基于机器人和物体移动的触感信息。该方法希望通过强调触感富有的经验来优化学习。
results: 研究人员在不同的稀疏奖励机器人操作任务上评估了该方法，并与现有的方法进行比较。结果显示，CEBP方法在这些任务上表现出优于或与现有方法相当。最后，研究人员在一个真实的Franka机器人上部署了它们的训练政策，并观察到机器人成功完成了一个拾取并置放任务。视频和代码可以在：https://erdiphd.github.io/HER_force 中获取。

Abstract
Multi-goal robot manipulation tasks with sparse rewards are difficult for reinforcement learning (RL) algorithms due to the inefficiency in collecting successful experiences. Recent algorithms such as Hindsight Experience Replay (HER) expedite learning by taking advantage of failed trajectories and replacing the desired goal with one of the achieved states so that any failed trajectory can be utilized as a contribution to learning. However, HER uniformly chooses failed trajectories, without taking into account which ones might be the most valuable for learning. In this paper, we address this problem and propose a novel approach Contact Energy Based Prioritization~(CEBP) to select the samples from the replay buffer based on rich information due to contact, leveraging the touch sensors in the gripper of the robot and object displacement. Our prioritization scheme favors sampling of contact-rich experiences, which are arguably the ones providing the largest amount of information. We evaluate our proposed approach on various sparse reward robotic tasks and compare them with the state-of-the-art methods. We show that our method surpasses or performs on par with those methods on robot manipulation tasks. Finally, we deploy the trained policy from our method to a real Franka robot for a pick-and-place task. We observe that the robot can solve the task successfully. The videos and code are publicly available at: https://erdiphd.github.io/HER_force

摘要
多目标机器人操作任务 WITH sparse reward 难以使用 reinforcement learning（RL）算法，因为收集成功经验不fficient。 recent algorithms such as Hindsight Experience Replay (HER) 使用了失败的轨迹，并将目标更改为达到的状态，以便任何失败的轨迹都可以作为学习的贡献。然而，HER uniformmente选择失败的轨迹，不考虑哪些可能是学习中最有价值的。在这篇论文中，我们解决这个问题，并提出了一种新的方法：Contact Energy Based Prioritization~(CEBP)。我们的优化方案基于触摸感测器和物体移动，可以选择接触rich的经验，并且偏好这些经验。我们的优先级顺序对于学习提供了丰富的信息。我们对多个稀缺奖励机器人任务进行了评估，并与当前的方法进行了比较。我们发现，我们的方法在机器人操作任务中胜过或与当前方法相当。最后，我们使用了我们的方法训练的策略，并在真实的 Franka 机器人上完成了一个 pick-and-place 任务。我们发现，机器人可以成功完成这个任务。视频和代码可以在：https://erdiphd.github.io/HER_force 上获取。

Amortized Bayesian Decision Making for simulation-based models

paper_url: http://arxiv.org/abs/2312.02674
repo_url: https://github.com/mackelab/amortized-decision-making
paper_authors: Mila Gorecki, Jakob H. Macke, Michael Deistler
for: 这篇论文旨在探讨如何使用 simulation-based inference (SBI) 进行 Bayesian 决策，并如何避免计算 Explicit aproximation 的 posterior distribution。
methods: 该论文使用 neural network 进行模拟数据的训练，并可以用来预测给定数据和行动的期望成本。
results: 该论文在多个 benchmark 问题中应用了该方法，并证明了它可以induces 类似于 true posterior distribution 中的成本。此外，该论文还应用了该方法于一个实际世界的 simulator 中，即 Bayesian Virtual Epileptic Patient，并证明了它可以在几个 simulations 中推断出低成本的行动。

Abstract
Simulation-based inference (SBI) provides a powerful framework for inferring posterior distributions of stochastic simulators in a wide range of domains. In many settings, however, the posterior distribution is not the end goal itself -- rather, the derived parameter values and their uncertainties are used as a basis for deciding what actions to take. Unfortunately, because posterior distributions provided by SBI are (potentially crude) approximations of the true posterior, the resulting decisions can be suboptimal. Here, we address the question of how to perform Bayesian decision making on stochastic simulators, and how one can circumvent the need to compute an explicit approximation to the posterior. Our method trains a neural network on simulated data and can predict the expected cost given any data and action, and can, thus, be directly used to infer the action with lowest cost. We apply our method to several benchmark problems and demonstrate that it induces similar cost as the true posterior distribution. We then apply the method to infer optimal actions in a real-world simulator in the medical neurosciences, the Bayesian Virtual Epileptic Patient, and demonstrate that it allows to infer actions associated with low cost after few simulations.

摘要
模拟基于推理（SBI）提供了一个强大的推理框架，可以在各种领域中为不确定的模拟器 posterior distribution 进行推理。然而，在许多情况下， posterior distribution 本身并不是最终目标 -- 而是基于这些参数值和其不确定性来做出决策。然而，由于 SBI 中的 posterior distribution 是（可能粗糙）的估计，因此得出的决策可能会不优化。本文考虑了如何在不确定的模拟器上进行 bayesian 决策，并如何避免计算显式的 posterior 估计。我们的方法是训练一个神经网络，使其可以在任何数据和行动下预测行动的预期成本，从而直接用于推理最低成本的行动。我们在一些标准问题上应用了我们的方法，并证明它们与真正的 posterior distribution 相似。然后，我们将方法应用于医学神经科学的 Bayesian Virtual Epileptic Patient 模拟器，并证明它可以在几次 simulations 后决策出低成本的行动。

Lights out: training RL agents robust to temporary blindness

paper_url: http://arxiv.org/abs/2312.02665
repo_url: None
paper_authors: N. Ordonez, M. Tromp, P. M. Julbe, W. Böhmer
for: 增强 Deep Q-Network (DQN) Agent 的Robustness to 短暂失明（temporary blindness）
methods: 使用隐藏表示 Observation 的 neural network 架构和 noval n-step 损失函数
results: 可以承受更长的盲目期（blindness stretch），示强性提高。In English:
for: Enhancing Deep Q-Network (DQN) Agent’s Robustness to Temporary Blindness
methods: Using a neural network architecture with hidden representations of observations and a novel n-step loss function
results: Can withstand longer blindness periods, demonstrating improved robustness.

Abstract
Agents trained with DQN rely on an observation at each timestep to decide what action to take next. However, in real world applications observations can change or be missing entirely. Examples of this could be a light bulb breaking down, or the wallpaper in a certain room changing. While these situations change the actual observation, the underlying optimal policy does not change. Because of this we want our agent to continue taking actions until it receives a (recognized) observation again. To achieve this we introduce a combination of a neural network architecture that uses hidden representations of the observations and a novel n-step loss function. Our implementation is able to withstand location based blindness stretches longer than the ones it was trained on, and therefore shows robustness to temporary blindness. For access to our implementation, please email Nathan, Marije, or Pau.

摘要
agent驱动使用DQN培育的agent会根据每个时间步骤的观察来决定下一步的行为。然而，在实际应用中，观察可能会变化或 completly missing。例如，灯泡破裂或墙纸在某个房间改变。这些情况会改变实际观察，但是下面的优化策略不会改变。因此，我们想我们的agent可以继续执行行动，直到它收到一个认可的观察。为 достичь这一点，我们引入了一种神经网络架构，使用隐藏表示的观察，以及一种新的n步损失函数。我们的实现能够承受位置基于的盲目扩展，比训练中的盲目扩展更长，因此表现出了对短暂盲目的抗衡能力。如果您想获取我们的实现，请邮件 Nathan、Marije 或 Pau。

FaceStudio: Put Your Face Everywhere in Seconds

paper_url: http://arxiv.org/abs/2312.02663
repo_url: None
paper_authors: Yuxuan Yan, Chi Zhang, Rui Wang, Yichao Zhou, Gege Zhang, Pei Cheng, Gang Yu, Bin Fu
for: 这项研究探索了一种能够保持人物身份的图像生成技术，这是一项激发人们 curiosities 的图像生成任务。methods: 这种技术使用了一种直通的前向驱动机制，不需要耗时 fine-tuning，从而实现了快速和高效的图像生成。该模型还使用了一种混合引导框架，将样式化图像、人脸图像和文本提示相结合，以导引图像生成过程。results: 我们的实验结果表明，我们的方法在比较与基线模型和先前的工作进行评估时，具有显著的优势，特别是在高效和保持人物身份方面。

Abstract
This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject's identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject's identity with high fidelity.

摘要
Here's the text in Simplified Chinese:这项研究 investigate identity-preserving image synthesis，这是一项图像生成任务，旨在保持主体的身份while adding a personalized, stylistic touch。传统方法，如Textual Inversion和DreamBooth，在自定义图像创建方面做出了进展，但它们带有一些缺点。这些缺点包括需要大量资源和时间进行精细调整，以及需要多个参考图像。为了超越这些挑战，我们的研究提出了一种新的方法，具体来说是一种人像图像的identity-preserving图像生成方法。我们的模型使用了直通途径机制， circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation。我们的创新在于将涂抹图像、人脸图像和文本提示相结合，以便指导图像生成过程。这种独特的组合使得我们的模型可以生成多种应用，如艺术投影和身份混合图像。我们的实验结果，包括both qualitative和quantitative评估，表明我们的方法在效率和保持主体身份方面具有显著优势，特别是在高效性和身份保持方面。

Supervised learning of spatial features with STDP and homeostasis using Spiking Neural Networks on SpiNNaker

paper_url: http://arxiv.org/abs/2312.02659
repo_url: None
paper_authors: Sergio Davies, Andrew Gait, Andrew Rowley, Alessandro Di Nuovo
for: 这篇论文目的是在超过规则神经网络（SNN）中进行有监督学习，使其能够识别空间模式。
methods: 该论文使用了快速时钟依存性遗传（STDP）和自适应机制来实现SNN的监督学习。
results: 试验结果显示，当单个模式进行训练时，SNN能够准确地识别出训练模式，准确率为100%。然而，当多个模式同时训练在同一个网络上时，模式的相似性会影响识别的准确率。这种训练SNN识别空间模式的方法可以应用于静止图像识别和计算机网络中的流量分析等领域。此外，研究人员还发现，在同一个网络上训练多个模式时，homeostatic因素可以使网络检测到模式之间的相似性，而不是只是完全匹配的模式。

Abstract
Artificial Neural Networks (ANN) have gained large popularity thanks to their ability to learn using the well-known backpropagation algorithm. On the other hand, Spiking Neural Networks (SNNs), despite having wider abilities than ANNs, have always presented a challenge in the training phase. This paper shows a new method to perform supervised learning on SNNs, using Spike Timing Dependent Plasticity (STDP) and homeostasis, aiming at training the network to identify spatial patterns. The method is tested using the SpiNNaker digital architecture. A SNN is trained to recognise one or multiple patterns and performance metrics are extracted to measure the performance of the network. Some considerations are drawn from the results showing that, in the case of a single trained pattern, the network behaves as the ideal detector, with 100% accuracy in detecting the trained pattern. However, as the number of trained patterns on a single network increases, the accuracy of the identification is linked to the similarities between these patterns. This method of training an SNN to detect spatial patterns may be applied on pattern recognition in static images or traffic analysis in computer networks, where each network packet represents a spatial pattern. It will be stipulated that the homeostatic factor may enable the network to detect patterns with some degree of similarities, rather than only perfectly matching patterns.

摘要
人工神经网络（ANN）因其能够使用著名的反射学习算法而受欢迎。然而，神经元脉冲网络（SNN）却总是在训练阶段存在挑战。这篇论文提出了一种新的超越学习方法，使用脉冲时间依赖束性（STDP）和自适应机制，以训练网络认识空间模式。这种方法在使用SpikeNNaker数字架构测试后，可以让网络认识一个或多个模式，并提取性能指标来衡量网络的表现。结果表明，当单个模式被训练时，网络会 behave as the ideal detector，即100%的准确率可以检测到训练过的模式。然而，当多个模式在同一个网络上被训练时，模式的相似性会影响网络的识别率。这种训练SNN认识空间模式的方法可以应用于静止图像或计算机网络中的模式识别，其中每个网络包etes represent a spatial pattern。此外，homeostatic factor可以使网络检测到一定程度的相似模式，而不仅仅是完全匹配的模式。

How should the advent of large language models affect the practice of science?

paper_url: http://arxiv.org/abs/2312.03759
repo_url: None
paper_authors: Marcel Binz, Stephan Alaniz, Adina Roskies, Balazs Aczel, Carl T. Bergstrom, Colin Allen, Daniel Schad, Dirk Wulff, Jevin D. West, Qiong Zhang, Richard M. Shiffrin, Samuel J. Gershman, Ven Popov, Emily M. Bender, Marco Marelli, Matthew M. Botvinick, Zeynep Akata, Eric Schulz
for: 这篇论文探讨了大语言模型（LLMs）在科学研究中的应用，以及这些应用的影响。
methods: 本文邀请了四个不同领域的科学家共同reflect和辩论，以便了解LLMs的应用对科学研究的影响。
results: 这篇论文总结了四个不同的视角，包括Schulz等人的看法，即与人类合作者不同，Bender等人的看法，即LLMs被过度夸大和违用，Marelli等人的看法，即透明的归属和责任，以及Botvinick和Gershman的看法，即人类应该决定科学的发展规划。

Abstract
Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advent of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and over-hyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices.

摘要

SAMSGL: Series-Aligned Multi-Scale Graph Learning for Spatio-Temporal Forecasting

paper_url: http://arxiv.org/abs/2312.02646
repo_url: None
paper_authors: Xiaobei Zou, Luolin Xiong, Yang Tang, Jurgen Kurths
for: 这个研究旨在提高空间时间预测的表现，特别是在交通预测和天气预测等领域，因为这些领域的预测受到延迟传播 dinamics和高维度互动的影响。
methods: 这个研究使用了Series-Aligned Multi-Scale Graph Learning（SAMSGL）框架，旨在提高预测性能。为了处理延迟传播dinamics，研究人员提出了一个序列aligned图像条件层，以便聚合非延迟图像信号， thereby mitigating the influence of time delays on accuracy。另外，研究人员还提出了一个多尺度图像学习架构，包括全球图像结构和多尺度图像结构，以了解全球和地方空间时间互动。
results: 这个研究的实验结果显示SAMSGL的表现比其他方法更好，尤其是在天气预测和交通预测等领域。

Abstract
Spatio-temporal forecasting in various domains, like traffic prediction and weather forecasting, is a challenging endeavor, primarily due to the difficulties in modeling propagation dynamics and capturing high-dimensional interactions among nodes. Despite the significant strides made by graph-based networks in spatio-temporal forecasting, there remain two pivotal factors closely related to forecasting performance that need further consideration: time delays in propagation dynamics and multi-scale high-dimensional interactions. In this work, we present a Series-Aligned Multi-Scale Graph Learning (SAMSGL) framework, aiming to enhance forecasting performance. In order to handle time delays in spatial interactions, we propose a series-aligned graph convolution layer to facilitate the aggregation of non-delayed graph signals, thereby mitigating the influence of time delays for the improvement in accuracy. To understand global and local spatio-temporal interactions, we develop a spatio-temporal architecture via multi-scale graph learning, which encompasses two essential components: multi-scale graph structure learning and graph-fully connected (Graph-FC) blocks. The multi-scale graph structure learning includes a global graph structure to learn both delayed and non-delayed node embeddings, as well as a local one to learn node variations influenced by neighboring factors. The Graph-FC blocks synergistically fuse spatial and temporal information to boost prediction accuracy. To evaluate the performance of SAMSGL, we conduct experiments on meteorological and traffic forecasting datasets, which demonstrate its effectiveness and superiority.

摘要
预测在不同领域，如交通预测和天气预测，是一项具有挑战性的任务，主要是因为模型困难在描述协同动力和高维度相互作用的问题。尽管 graf-based 网络在空间-时预测方面做出了 significativos progresos，但还有两个关键因素需要进一步考虑：时间延迟在协同动力和多级高维度相互作用。在这项工作中，我们提出了一个Series-Aligned Multi-Scale Graph Learning（SAMSGL）框架，以提高预测性能。为了处理空间协同动力中的时间延迟，我们提议了一个序列对齐图像积分层，以便聚合非延迟图像信号，从而减少时间延迟的影响，提高准确性。为了理解全球和本地空间-时相互作用，我们开发了一个多尺度图学学习架构，包括两个重要组成部分：多尺度图结构学习和图全连接（Graph-FC）块。多尺度图结构学习包括一个全球图结构，用于学习延迟和非延迟节点表示，以及一个本地图结构，用于学习受到邻近因素影响的节点变化。图全连接块协同综合空间和时间信息，以提高预测精度。为评估 SAMSGL 的性能，我们在 meteorological 和交通预测数据集上进行了实验，结果显示它的有效性和优越性。

On the Initialization of Graph Neural Networks

paper_url: http://arxiv.org/abs/2312.02622
repo_url: https://github.com/lspongebobjh/virgo_icml2023
paper_authors: Jiahang Li, Yakun Song, Xiang Song, David Paul Wipf
for:This paper focuses on improving the initialization process for graph neural networks (GNNs) to reduce the variance of forward and backward propagation and improve model performance.methods:The proposed method, called Virgo, analyzes the variance of forward and backward propagation across GNN layers and proposes a new initialization method that takes into account the influence of the activation function, hidden dimension, graph structure, and message passing.results:The proposed Virgo method leads to superior model performance and more stable variance at initialization on node classification, link prediction, and graph classification tasks, as demonstrated through comprehensive experiments on 15 datasets.

Abstract
Graph Neural Networks (GNNs) have displayed considerable promise in graph representation learning across various applications. The core learning process requires the initialization of model weight matrices within each GNN layer, which is typically accomplished via classic initialization methods such as Xavier initialization. However, these methods were originally motivated to stabilize the variance of hidden embeddings and gradients across layers of Feedforward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) to avoid vanishing gradients and maintain steady information flow. In contrast, within the GNN context classical initializations disregard the impact of the input graph structure and message passing on variance. In this paper, we analyze the variance of forward and backward propagation across GNN layers and show that the variance instability of GNN initializations comes from the combined effect of the activation function, hidden dimension, graph structure and message passing. To better account for these influence factors, we propose a new initialization method for Variance Instability Reduction within GNN Optimization (Virgo), which naturally tends to equate forward and backward variances across successive layers. We conduct comprehensive experiments on 15 datasets to show that Virgo can lead to superior model performance and more stable variance at initialization on node classification, link prediction and graph classification tasks. Codes are in https://github.com/LspongebobJH/virgo_icml2023.

摘要
graph neural networks (GNNs) 有显著的搭配可能性在不同应用场景中的图表示学习中展现出来。GNN层的核心学习过程通常通过 класси型的初始化方法，如Xavier initialization来进行初始化模型权重矩阵。然而，这些方法最初是为了稳定隐藏嵌入和梯度的方差在层次的Feedforward Neural Networks (FNNs)和Convolutional Neural Networks (CNNs)中，以避免梯度消失和保持信息流平稳。然而，在 GNN 上下文中，这些古典的初始化方法忽视了输入图结构和消息传递对方差的影响。在这篇论文中，我们分析了 GNN 层之间的方差变化，并显示了 GNN 初始化的方差不稳定性来自于活动函数、隐藏维度、图结构和消息传递的共同作用。为了更好地考虑这些影响因素，我们提出了一种新的初始化方法，称为 Variance Instability Reduction within GNN Optimization (Virgo)，它自然地在Successive层之间均衡前向和反向方差。我们对 15 个数据集进行了全面的实验，证明 Virgo 可以在节点分类、链接预测和图类型任务上提高模型性能并保持更稳定的方差。代码在 https://github.com/LspongebobJH/virgo_icml2023。

Panoptica – instance-wise evaluation of 3D semantic and instance segmentation maps

paper_url: http://arxiv.org/abs/2312.02608
repo_url: https://github.com/brainlesion/panoptica
paper_authors: Florian Kofler, Hendrik Möller, Josef A. Buchner, Ezequiel de la Rosa, Ivan Ezhov, Marcel Rosier, Isra Mekki, Suprosanna Shit, Moritz Negwer, Rami Al-Maskari, Ali Ertürk, Shankeeth Vinayahalingam, Fabian Isensee, Sarthak Pati, Daniel Rueckert, Jan S. Kirschke, Stefan K. Ehrlich, Annika Reinke, Bjoern Menze, Benedikt Wiestler, Marie Piraud
for: 这篇论文是为了计算2D和3D分割图像的实例化分割质量指标而设计的。
methods: 这篇论文使用了一种可编程的、性能优化的包装包，名为panoptica，以计算分割图像的实例化分割质量指标。
results: 论文通过使用不同的指标，如平均对称表面距离度量，对多种实际医学数据进行了详细的评估，并证明了panoptica的效果。

Abstract
This paper introduces panoptica, a versatile and performance-optimized package designed for computing instance-wise segmentation quality metrics from 2D and 3D segmentation maps. panoptica addresses the limitations of existing metrics and provides a modular framework that complements the original intersection over union-based panoptic quality with other metrics, such as the distance metric Average Symmetric Surface Distance. The package is open-source, implemented in Python, and accompanied by comprehensive documentation and tutorials. panoptica employs a three-step metrics computation process to cover diverse use cases. The efficacy of panoptica is demonstrated on various real-world biomedical datasets, where an instance-wise evaluation is instrumental for an accurate representation of the underlying clinical task. Overall, we envision panoptica as a valuable tool facilitating in-depth evaluation of segmentation methods.

摘要
Translated into Simplified Chinese:这篇论文介绍了panoptica，一个功能强大且性能优化的包，用于计算2D和3D segmentation图像质量指标。panoptica解决了现有指标的限制，并提供了一个模块化框架，可以补充原始交集 UNION 基于的权重质量指标，例如平均对称表面距离指标。包是开源的，实现在Python中，并附带了详细的文档和教程。panoptica使用三步计算过程来覆盖多种应用场景。论文示出了panoptica在多个真实的生物医学数据集上的效果，其中Instance-wise评估对于医学任务的准确表示是非常重要的。总之，我们视panoptica为一个有价值的工具，用于深入评估分 segmentation 方法。

Impact of Tokenization on LLaMa Russian Adaptation

paper_url: http://arxiv.org/abs/2312.02598
repo_url: None
paper_authors: Mikhail Tikhomirov, Daniil Chernyshev
for: 本研究旨在解决大型语言模型（LLM）在非英语输入时表现下降的问题。
methods: 本研究使用 vocabulary substitution 方法来改进 LLaMa 俄语言适应。
results: 自动评价结果表明， vocabulary substitution 不仅提高了模型在俄语言中的质量，还可以加速 fine-tuning（35%）和推理（最高达 60%），同时降低内存占用。人工评价结果还表明，使用俄语言适应词汇的模型可以生成更具用户喜爱的答案。

Abstract
Latest instruction-tuned large language models (LLM) show great results on various tasks, however, they often face performance degradation for non-English input. There is evidence that the reason lies in inefficient tokenization caused by low language representation in pre-training data which hinders the comprehension of non-English instructions, limiting the potential of target language instruction-tuning. In this work we investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation. We explore three variants of vocabulary adaptation and test their performance on Saiga instruction-tuning and fine-tuning on Russian Super Glue benchmark. The results of automatic evaluation show that vocabulary substitution not only improves the model's quality in Russian but also accelerates fine-tuning (35%) and inference (up to 60%) while reducing memory consumption. Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference than the original Saiga-LLaMa model.

摘要

UTBoost: A Tree-boosting based System for Uplift Modeling

paper_url: http://arxiv.org/abs/2312.02573
repo_url: https://github.com/jd-opensource/utboost
paper_authors: Junjie Gao, Xiangyu Zheng, DongDong Wang, Zhixiang Huang, Bangqi Zheng, Kai Yang
for: 这个论文旨在提出两种基于Gradient Boosting Decision Trees（GBDT）算法的新方法，用于估计顾客增长（uplift）。
methods: 这两种方法分别是：一种是基于Sequential Additive Model（SAM）的累加学习方法，另一种是基于Huber Regressions（Huber）的多目标学习方法。
results: 实验结果表明，这两种方法可以准确地估计顾客增长，并且frequently yield remarkable improvements over base models。此外， authors还开发了一个特有的树提升系统（UTBoost），用于实现这些方法的应用。

Abstract
Uplift modeling refers to the set of machine learning techniques that a manager may use to estimate customer uplift, that is, the net effect of an action on some customer outcome. By identifying the subset of customers for whom a treatment will have the greatest effect, uplift models assist decision-makers in optimizing resource allocations and maximizing overall returns. Accurately estimating customer uplift poses practical challenges, as it requires assessing the difference between two mutually exclusive outcomes for each individual. In this paper, we propose two innovative adaptations of the well-established Gradient Boosting Decision Trees (GBDT) algorithm, which learn the causal effect in a sequential way and overcome the counter-factual nature. Both approaches innovate existing techniques in terms of ensemble learning method and learning objectives, respectively. Experiments on large-scale datasets demonstrate the usefulness of the proposed methods, which often yielding remarkable improvements over base models. To facilitate the application, we develop the UTBoost, an end-to-end tree boosting system specifically designed for uplift modeling. The package is open source and has been optimized for training speed to meet the needs of real industrial applications.

摘要
<>通过机器学习技术，管理者可以使用“升级模型”来估算每个客户的升级效果，即对某个客户的行为产生的影响。通过 identificatinig 每个客户对待特征的最大效果，升级模型可以帮助决策者优化资源分配和最大化总收益。估算客户升级的实际挑战在于需要评估每个客户对两种不同结果之间的差异。在这篇论文中，我们提出了两种基于 Gradient Boosting Decision Trees（GBDT）算法的创新方法，可以在级联的方式学习 causal effect，并且超越对假性的挑战。这两种方法在 Ensemble Learning 方法和学习目标方面均有创新，实验结果表明，这些方法在大规模数据集上通常可以获得惊人的改进。为便于应用，我们开发了 UTBoost，一个专门为升级模型设计的端到端树提升系统。该系统是开源的，并且在训练速度方面进行了优化，以满足实际工业应用的需求。

Structured World Representations in Maze-Solving Transformers

paper_url: http://arxiv.org/abs/2312.02566
repo_url: https://github.com/understanding-search/structured-representations-maze-transformers
paper_authors: Michael Igorevich Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung
for: 本研究旨在理解小型转移模型在解决迷宫问题中的内部行为。
methods: 本研究使用转移模型解决迷宫问题，并发现这些模型形成了迷宫 topology 的结构化内部表示，以及有效路径的预测。
results: 研究发现，只需要单个 token 的差分流可以线性解码重建整个迷宫，并且各个token的嵌入有空间结构。此外，还发现了路径跟踪的听力头（称为“相邻头”），它们参与找到有效的后续 tokens。

Abstract
Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive picture of their inner workings remains a significant challenge. To this end, we set out to understand small transformer models in a more tractable setting: that of solving mazes. In this work, we focus on the abstractions formed by these models and find evidence for the consistent emergence of structured internal representations of maze topology and valid paths. We demonstrate this by showing that the residual stream of only a single token can be linearly decoded to faithfully reconstruct the entire maze. We also find that the learned embeddings of individual tokens have spatial structure. Furthermore, we take steps towards deciphering the circuity of path-following by identifying attention heads (dubbed $\textit{adjacency heads}$), which are implicated in finding valid subsequent tokens.

摘要
启发器模型在许多实用机器学习应用中发挥重要作用，然而它们的内部行为仍然对研究人员难以理解。由于这些模型的大小和复杂性，建立全面的内部行为图像变得非常困难。为了解决这个问题，我们尝试了通过解决迷宫来理解小启发器模型。在这项工作中，我们关注启发器模型形成的抽象和有效路径的内部表示。我们证明了只需要一个Token的剩余流可以线性解码重建整个迷宫。此外，我们发现了个token的嵌入有空间结构。此外，我们还证明了路径跟踪的电路，通过identifying关注头（称为“相邻头”），这些关注头参与找到有效的后续Token。

Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction

paper_url: http://arxiv.org/abs/2312.03025
repo_url: None
paper_authors: Zilin Du, Haoxin Li, Xu Guo, Boyang Li
for: 本研究旨在对多modal关系抽取进行研究，但是进展受到现有训练数据的稀缺所限。这里我们考虑了一个新的问题设定，即仅在训练过程中使用单modal数据，可以是文本或图像。我们想要从合成数据中训练一个多modal分类器，并在真实多modal测试数据上表现良好。
methods: 我们提出了一个名为MI2RAGE的方法，它使用了链接跨modal生成（CCG）来提高生成数据的多样性，并利用教师网络选择高相互资讯的训练数据。
results: 与直接在合成数据上训练的方法相比，我们的方法实现了24.06% F1的提升，尤其是使用合成文本时的提升为30.42% F1。而我们最佳的模型，即完全使用合成图像进行训练，对先前由真实多modal数据训练的模型进行了3.76%的F1提升。

Abstract
The task of multimodal relation extraction has attracted significant research attention, but progress is constrained by the scarcity of available training data. One natural thought is to extend existing datasets with cross-modal generative models. In this paper, we consider a novel problem setting, where only unimodal data, either text or image, are available during training. We aim to train a multimodal classifier from synthetic data that perform well on real multimodal test data. However, training with synthetic data suffers from two obstacles: lack of data diversity and label information loss. To alleviate the issues, we propose Mutual Information-aware Multimodal Iterated Relational dAta GEneration (MI2RAGE), which applies Chained Cross-modal Generation (CCG) to promote diversity in the generated data and exploits a teacher network to select valuable training samples with high mutual information with the ground-truth labels. Comparing our method to direct training on synthetic data, we observed a significant improvement of 24.06% F1 with synthetic text and 26.42% F1 with synthetic images. Notably, our best model trained on completely synthetic images outperforms prior state-of-the-art models trained on real multimodal data by a margin of 3.76% in F1. Our codebase will be made available upon acceptance.

摘要
多Modal关系提取任务已经吸引了大量研究者的关注，但是进步受到数据不足的限制。一个自然的想法是通过扩展现有数据集来进行训练。在这篇论文中，我们考虑了一个新的问题设定，即训练时只有单模态数据，可以是文本或图像。我们目标是从合成数据中训练一个多Modal分类器，并在真实多Modal测试数据上表现良好。但训练合成数据时存在两个问题：数据多样性不够和标签信息损失。为了解决这些问题，我们提出了相互信息感知多Modal迭代数据生成（MI2RAGE）方法。MI2RAGE方法利用了链式跨模态生成（CCG）来提高生成数据的多样性，并利用教师网络选择有高相互信息的训练样本和真实标签。与直接在合成数据上训练相比，我们发现MI2RAGE方法可以提高24.06%的F1值，其中文本合成数据上提高26.42%的F1值。更重要的是，我们的最佳模型在 completelly synthetic图像上训练后，可以超过现有的 estado-of-the-art模型，即在真实多Modal数据上训练的模型，F1值上的提高为3.76%。我们将代码库在接受后发布。

DanZero+: Dominating the GuanDan Game through Reinforcement Learning

paper_url: http://arxiv.org/abs/2312.02561
repo_url: https://github.com/submit-paper/Danzero_plus
paper_authors: Youpeng Zhao, Yudong Lu, Jian Zhao, Wengang Zhou, Houqiang Li
for: 这个研究的目标是开发一个用于玩家DanZero的人工智能程序，用于解决复杂的棋牌游戏GuanDan。
methods: 该研究使用了深度蒙特卡罗（DMC）和分布式训练框架，并应用了策略基于反射学习算法来进一步提高人工智能的能力。
results: 评估结果表明，与基于经验规则的AI程序进行比较后，DanZero Bot表现出色，并且通过采用预训练模型来减少训练时间，实现了人工智能的进一步提高。

Abstract
The utilization of artificial intelligence (AI) in card games has been a well-explored subject within AI research for an extensive period. Recent advancements have propelled AI programs to showcase expertise in intricate card games such as Mahjong, DouDizhu, and Texas Hold'em. In this work, we aim to develop an AI program for an exceptionally complex and popular card game called GuanDan. This game involves four players engaging in both competitive and cooperative play throughout a long process to upgrade their level, posing great challenges for AI due to its expansive state and action space, long episode length, and complex rules. Employing reinforcement learning techniques, specifically Deep Monte Carlo (DMC), and a distributed training framework, we first put forward an AI program named DanZero for this game. Evaluation against baseline AI programs based on heuristic rules highlights the outstanding performance of our bot. Besides, in order to further enhance the AI's capabilities, we apply policy-based reinforcement learning algorithm to GuanDan. To address the challenges arising from the huge action space, which will significantly impact the performance of policy-based algorithms, we adopt the pre-trained model to facilitate the training process and the achieved AI program manages to achieve a superior performance.

摘要
人工智能（AI）在 карточной игре已经是长期的研究主题。现代技术的发展使得AI程序在复杂的 карточной游戏如 Mahjong、DouDizhu 和 Texas Hold'em 中表现出了专家水平。在这项工作中，我们目标是开发一个用于Exceptionally Complex和受欢迎的 карточной游戏叫做 GuanDan。这个游戏需要四名玩家在竞争和合作之间进行长期的进程，以升级自己的等级，对 AI 提出了巨大的挑战，因为它的扩展状态和动作空间非常大，每一集的长度也很长，规则非常复杂。我们使用了 Deep Monte Carlo（DMC）技术和分布式训练框架，首先开发了一个名为 DanZero 的 AI 程序。对基于规则的 AI 程序进行评估显示了我们的 bot 的出色表现。此外，为了进一步提高 AI 的能力，我们应用了政策基于返点学习算法于 GuanDan。由于动作空间的巨大性，会对政策基于算法的性能产生很大的影响，我们采用预训练模型来促进训练过程，并实现了一个可以在 GuanDan 中达到超过常规水平的 AI 程序。

Beyond Isolation: Multi-Agent Synergy for Improving Knowledge Graph Construction

paper_url: http://arxiv.org/abs/2312.03022
repo_url: None
paper_authors: Hongbin Ye, Honghao Gui, Aijia Zhang, Tong Liu, Wei Hua, Weiqiang Jia
for: 这篇论文旨在探讨知识图构建（KGC）的多方面问题，包括实体、关系和事件EXTRACTION。
methods: 该论文提出了一种新的框架，即CooperKGC，该框架建立了一个KGC协作处理网络，让不同的代理人共同解决ENTITY、关系和事件EXTRACTION任务。
results: 实验表明，通过CooperKGC的协作处理网络，可以同时解决ENTITY、关系和事件EXTRACTION任务，并且在多个交互循环中，协作促进了知识选择、修正和聚合的能力。

Abstract
Knowledge graph construction (KGC) is a multifaceted undertaking involving the extraction of entities, relations, and events. Traditionally, large language models (LLMs) have been viewed as solitary task-solving agents in this complex landscape. However, this paper challenges this paradigm by introducing a novel framework, CooperKGC. Departing from the conventional approach, CooperKGC establishes a collaborative processing network, assembling a KGC collaboration team capable of concurrently addressing entity, relation, and event extraction tasks. Our experiments unequivocally demonstrate that fostering collaboration and information interaction among diverse agents within CooperKGC yields superior results compared to individual cognitive processes operating in isolation. Importantly, our findings reveal that the collaboration facilitated by CooperKGC enhances knowledge selection, correction, and aggregation capabilities across multiple rounds of interactions.

摘要
知识图构建（KGC）是一项多方面的任务，涉及到实体、关系和事件的提取。传统上，大型自然语言模型（LLMs）被视为单独的任务解决者在这个复杂的景象中。然而，这篇论文挑战这一观念，提出了一个新的框架——合作KGC。与传统方法不同，合作KGC建立了一个知识图构建协作团队，负责同时处理实体、关系和事件提取任务。我们的实验表明，在合作KGC中促进多种代理人之间的合作和信息交换，可以让知识选择、修正和聚合能力在多个交互循环中得到加强。

Graph Information Bottleneck for Remote Sensing Segmentation

paper_url: http://arxiv.org/abs/2312.02545
repo_url: None
paper_authors: Yuntao Shou, Wei Ai, Tao Meng
for: 本研究旨在提高遥感图像分割的精度和效率，特别是面对不规则的 объек 。
methods: 本研究使用图像为格 estructure，并引入了简单对比视觉Graph Neural Network (SC-ViG) 架构，以便自适应地选择节点和边进行掩蔽。此外，本研究还应用了信息瓶颈理论来最大化相关任务的信息，而最小化不相关任务的信息。
results: 对于公共可用的实验数据集，我们的方法在遥感图像分割和分类任务中具有更高的精度和效率，比之前的状态艺术方法更高。

Abstract
Remote sensing segmentation has a wide range of applications in environmental protection, and urban change detection, etc. Despite the success of deep learning-based remote sensing segmentation methods (e.g., CNN and Transformer), they are not flexible enough to model irregular objects. In addition, existing graph contrastive learning methods usually adopt the way of maximizing mutual information to keep the node representations consistent between different graph views, which may cause the model to learn task-independent redundant information. To tackle the above problems, this paper treats images as graph structures and introduces a simple contrastive vision GNN (SC-ViG) architecture for remote sensing segmentation. Specifically, we construct a node-masked and edge-masked graph view to obtain an optimal graph structure representation, which can adaptively learn whether to mask nodes and edges. Furthermore, this paper innovatively introduces information bottleneck theory into graph contrastive learning to maximize task-related information while minimizing task-independent redundant information. Finally, we replace the convolutional module in UNet with the SC-ViG module to complete the segmentation and classification tasks of remote sensing images. Extensive experiments on publicly available real datasets demonstrate that our method outperforms state-of-the-art remote sensing image segmentation methods.

摘要
remote sensing segmentation 有广泛的应用在环境保护和城市变化探测等领域。 DESPITE 深度学习基于 remote sensing segmentation 方法（例如 CNN 和 Transformer）的成功，它们并不够灵活来模型不 Regular 的对象。此外，现有的图标对比学习方法通常采用 maximize mutual information 的方法保持不同图视图的节点表示相同，这可能会使模型学习任务无关的冗余信息。为了解决上述问题，本文将图像视为图结构，并提出了一种简单的对比视觉 Graph Neural Network (SC-ViG) 架构 для remote sensing segmentation。 Specifically, we construct a node-masked and edge-masked graph view to obtain an optimal graph structure representation, which can adaptively learn whether to mask nodes and edges. 此外，本文创新地将信息瓶颈理论引入到图标对比学习中，以最大化相关任务的信息，而最小化无关任务的冗余信息。最后，我们将 UNet 中的卷积模块 replaced 为 SC-ViG 模块，以完成 remote sensing 图像分割和分类任务。 EXTENSIVE 实验表明，我们的方法在公开 available 的实验数据上超过了现有的 remote sensing 图像分割方法。

PolyFit: A Peg-in-hole Assembly Framework for Unseen Polygon Shapes via Sim-to-real Adaptation

paper_url: http://arxiv.org/abs/2312.02531
repo_url: None
paper_authors: Geonhyup Lee, Joosoon Lee, Sangjun Noh, Minhwan Ko, Kangmin Kim, Kyoobin Lee
for: 这个研究是为了解决机器人穿孔 assemble 中的基础和挑战性任务，即感知错误和机械错误引起的插入失败或堵塞。
methods: 这个研究使用了一种超vised learning方法，即PolyFit，以减少感知错误和机械错误的影响。PolyFit 使用了力矩数据进行精准的外部pose估计，并将磨盘pose调整以纠正偏差。
results: 该研究在模拟环境中进行了广泛的训练，使用了包含多种磨盘孔形状、外部pose和相应的Contact力矩数据。在模拟环境中，PolyFit 达到了97.3%和96.3%的磨盘成功率，而在实际应用中，它达到了86.7%和85.0%的成功率，这表明了该方法的稳定性和适应性。

Abstract
The study addresses the foundational and challenging task of peg-in-hole assembly in robotics, where misalignments caused by sensor inaccuracies and mechanical errors often result in insertion failures or jamming. This research introduces PolyFit, representing a paradigm shift by transitioning from a reinforcement learning approach to a supervised learning methodology. PolyFit is a Force/Torque (F/T)-based supervised learning framework designed for 5-DoF peg-in-hole assembly. It utilizes F/T data for accurate extrinsic pose estimation and adjusts the peg pose to rectify misalignments. Extensive training in a simulated environment involves a dataset encompassing a diverse range of peg-hole shapes, extrinsic poses, and their corresponding contact F/T readings. To enhance extrinsic pose estimation, a multi-point contact strategy is integrated into the model input, recognizing that identical F/T readings can indicate different poses. The study proposes a sim-to-real adaptation method for real-world application, using a sim-real paired dataset to enable effective generalization to complex and unseen polygon shapes. PolyFit achieves impressive peg-in-hole success rates of 97.3% and 96.3% for seen and unseen shapes in simulations, respectively. Real-world evaluations further demonstrate substantial success rates of 86.7% and 85.0%, highlighting the robustness and adaptability of the proposed method.

摘要
Simplified Chinese:这项研究targets瑞鼎Robotics中的难题：杯子在孔中Assembly, where misalignments caused by sensor inaccuracies and mechanical errors often result in insertion failures or jamming. This research introduces PolyFit, representing a paradigm shift by transitioning from a reinforcement learning approach to a supervised learning methodology. PolyFit is a Force/Torque (F/T)-based supervised learning framework designed for 5-DoF peg-in-hole assembly. It utilizes F/T data for accurate extrinsic pose estimation and adjusts the peg pose to rectify misalignments. Extensive training in a simulated environment involves a dataset encompassing a diverse range of peg-hole shapes, extrinsic poses, and their corresponding contact F/T readings. To enhance extrinsic pose estimation, a multi-point contact strategy is integrated into the model input, recognizing that identical F/T readings can indicate different poses. The study proposes a sim-to-real adaptation method for real-world application, using a sim-real paired dataset to enable effective generalization to complex and unseen polygon shapes. PolyFit achieves impressive peg-in-hole success rates of 97.3% and 96.3% for seen and unseen shapes in simulations, respectively. Real-world evaluations further demonstrate substantial success rates of 86.7% and 85.0%, highlighting the robustness and adaptability of the proposed method.

MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection

paper_url: http://arxiv.org/abs/2312.02530
repo_url: https://github.com/gunny97/MEMTO
paper_authors: Junho Song, Keonwoo Kim, Jeonglyul Oh, Sungzoon Cho
for: 这篇论文的目的是提出一种基于嵌入式Transformer的记忆导向的异常检测方法，以便在实际世界多变数时间序列资料上检测异常。
methods: 这篇论文使用了一种嵌入式Transformer模型，并将其与一个新的记忆模组结合在一起，以学习对输入数据的应对策略。此外，这篇论文还使用了K-means clustering来初始化记忆项，以稳定训练过程。
results: 这篇论文在五个真实世界的多变数时间序列资料上进行了实际测试，以及对先前的州vector方法进行了比较。结果显示，这篇论文的提出的方法在这些数据上取得了平均异常检测F1分数95.74%，较先前的方法有所改善。

Abstract
Detecting anomalies in real-world multivariate time series data is challenging due to complex temporal dependencies and inter-variable correlations. Recently, reconstruction-based deep models have been widely used to solve the problem. However, these methods still suffer from an over-generalization issue and fail to deliver consistently high performance. To address this issue, we propose the MEMTO, a memory-guided Transformer using a reconstruction-based approach. It is designed to incorporate a novel memory module that can learn the degree to which each memory item should be updated in response to the input data. To stabilize the training procedure, we use a two-phase training paradigm which involves using K-means clustering for initializing memory items. Additionally, we introduce a bi-dimensional deviation-based detection criterion that calculates anomaly scores considering both input space and latent space. We evaluate our proposed method on five real-world datasets from diverse domains, and it achieves an average anomaly detection F1-score of 95.74%, significantly outperforming the previous state-of-the-art methods. We also conduct extensive experiments to empirically validate the effectiveness of our proposed model's key components.

摘要
检测实际世界多变量时间序列数据中异常现象是一项复杂的任务，因为存在复杂的时间关系和变量相关性。近年来，使用深度模型进行重建方法已经广泛应用。然而，这些方法仍然受到过泛化问题的困扰，无法持续提供高性能。为解决这个问题，我们提议MEMTO，一种具有记忆导航的转移学习模型。它采用了一种新的记忆模块，可以根据输入数据来学习每个记忆项的更新度。为稳定训练过程，我们采用了两个阶段训练方法，其中包括使用K-means归一化 clustering来初始化记忆项。此外，我们引入了一种两维偏差基于的检测标准，可以考虑输入空间和隐藏空间的偏差。我们对五个不同领域的实际数据进行了测试，并达到了95.74%的异常检测F1分数，高于之前的状态 искусственный方法。我们还进行了广泛的实验来证明我们提议的模型的关键组件的有效性。

paper_url: http://arxiv.org/abs/2312.02522
repo_url: None
paper_authors: Xinyi Yang, Xinting Yang, Chao Yu, Jiayu Chen, Huazhong Yang, Yu Wang
for: 这个论文的目的是解决多 Agent 协同导航任务，即多个 Agent 需要在有限时间内达到初始未分配目标。methods: 这个论文使用了增强学习（RL）和层次搜索结构来解决这个问题，并使用图神经网络（GNN）来模型多 Agent 和目标之间的交互。results: 论文的实验结果表明，MASP 比 класси型的规划方法和 RL 基础方法高效，在多 Agent 粒子环境（MPE）中达到了nearly 100% 成功率，并且在不同的团队大小下进行零例外情况掌握。

Abstract
We investigate the problem of decentralized multi-agent navigation tasks, where multiple agents need to reach initially unassigned targets in a limited time. Classical planning-based methods suffer from expensive computation overhead at each step and offer limited expressiveness for complex cooperation strategies. In contrast, reinforcement learning (RL) has recently become a popular paradigm for addressing this issue. However, RL struggles with low data efficiency and cooperation when directly exploring (nearly) optimal policies in the large search space, especially with an increased agent number (e.g., 10+ agents) or in complex environments (e.g., 3D simulators). In this paper, we propose Multi-Agent Scalable GNN-based P lanner (MASP), a goal-conditioned hierarchical planner for navigation tasks with a substantial number of agents. MASP adopts a hierarchical framework to divide a large search space into multiple smaller spaces, thereby reducing the space complexity and accelerating training convergence. We also leverage graph neural networks (GNN) to model the interaction between agents and goals, improving goal achievement. Besides, to enhance generalization capabilities in scenarios with unseen team sizes, we divide agents into multiple groups, each with a previously trained number of agents. The results demonstrate that MASP outperforms classical planning-based competitors and RL baselines, achieving a nearly 100% success rate with minimal training data in both multi-agent particle environments (MPE) with 50 agents and a quadrotor 3-dimensional environment (OmniDrones) with 20 agents. Furthermore, the learned policy showcases zero-shot generalization across unseen team sizes.

摘要
我们研究了分散式多agger naviagtion任务，其中多个 Agent需要在有限时间内到达初始不知道的目标。古典观念系统方法受到每步computational overhead的高成本和有限的表达能力，导致在复杂的合作策略下遇到困难。相比之下，从 reward learning（RL）的角度来看，它在最近几年内已经成为处理此类任务的受欢迎方法。然而，RL在寻找（近乎）优质策略时受到低效率的资料和协力问题，特别是在 Agent 的数量增加（例如 10 只 Agent 或更多）或在复杂的环境中（例如 3D 模拟器）。在这篇论文中，我们提出了一个具有多个 Agent 的对话 GNN 基于 planner（MASP），用于 navigate 任务。MASP 运用了层次架构来分解大的搜寻空间，因此减少了空间复杂度和加速了训练的步骤。此外，我们还利用图 neural network（GNN）来odel agent 和目标之间的互动，提高了目标实现。此外，为了增强未见到的团队大小的通用能力，我们将 Agent 分为多个小组，每个小组都有先前训练的 Agent 数量。结果显示，MASP 比古典观念系统方法和 RL 基eline 高，在 MPE 中的 50 只 Agent 和 OmniDrones 中的 20 只 Agent 获得了接近 100% 的成功率，并且学习的政策展现了零shot泛化性。

Retrieving Conditions from Reference Images for Diffusion Models

paper_url: http://arxiv.org/abs/2312.02521
repo_url: None
paper_authors: Haoran Tang, Xin Zhou, Jieren Deng, Zhihong Pan, Hao Tian, Pratik Chaudhari
for: 本研究旨在提高生成图像的多样性，以满足更多应用需求。
methods: 本研究使用了 diffusion-based 主题驱动的生成方法，并引入了更加精确的标签数据集 RetriBooru-V1。
results: 研究人员提出了新的任务，并引入了一种新的多样度度量来衡量生成图像的成功程度。基于 RAI-inspired 方法，研究人员还实现了对参照图像中的精确信息的重新 Retrieval。

Abstract
Recent diffusion-based subject driven generative methods have enabled image generations with good fidelity for specific objects or human portraits. However, to achieve better versatility for applications, we argue that not only improved datasets and evaluations are desired, but also more careful methods to retrieve only relevant information from conditional images are anticipated. To this end, we propose an anime figures dataset RetriBooru-V1, with enhanced identity and clothing labels. We state new tasks enabled by this dataset, and introduce a new diversity metric to measure success in completing these tasks, quantifying the flexibility of image generations. We establish an RAG-inspired baseline method, designed to retrieve precise conditional information from reference images. Then, we compare with current methods on existing task to demonstrate the capability of the proposed method. Finally, we provide baseline experiment results on new tasks, and conduct ablation studies on the possible structural choices.

摘要
Translated into Simplified Chinese:近期的扩散基于主题驱动的生成方法已经实现了对特定对象或人脸的图像生成 avec 良好的准确性。然而，为了实现更好的多样性，我们认为不仅需要改进的数据集和评估方法，还需要更加小心地从条件图像中提取相关信息。为此，我们提出了一个名为RetriBooru-V1的漫画人物数据集，其中包含了增强的身份和服装标签。我们提出了基于这个数据集的新任务，并引入了一个新的多样性指标来衡量这些任务的成功程度，量化图像生成的灵活性。我们设置了一个基于RAG的基线方法，用于从参考图像中提取准确的条件信息。然后，我们与现有方法进行比较，以示出我们的方法的能力。最后，我们提供了基eline实验结果，并进行了可能的结构选择的ablation study。

Creative Agents: Empowering Agents with Imagination for Creative Tasks

paper_url: http://arxiv.org/abs/2312.02519
repo_url: https://github.com/pku-rl/creative-agents
paper_authors: Chi Zhang, Penglin Cai, Yuhui Fu, Haoqi Yuan, Zongqing Lu
for: 本研究旨在建立具有创造力的智能代理人，以执行开放式创作任务。现有方法建立了多样化的开放式任务完成者，但None of them Demonstrates creativity。
methods: 我们提出一类解决方案，其中控制器通过增强 imagination 来转化抽象语言指令为具体环境中的任务目标。我们引入了多种实现创意代理人的方法，包括使用大型自然语言模型或扩散模型来实现图像想象。
results: 我们在 Minecraft 游戏中进行了详细的实验分析，显示创意代理人可以在 survival 模式中创造多样化的建筑物。我们还提出了一些新的评价指标，可以更好地评估开放式创作任务中的 AI 代理人。

Abstract
We study building embodied agents for open-ended creative tasks. While existing methods build instruction-following agents that can perform diverse open-ended tasks, none of them demonstrates creativity -- the ability to give novel and diverse task solutions implicit in the language instructions. This limitation comes from their inability to convert abstract language instructions into concrete task goals in the environment and perform long-horizon planning for such complicated goals. Given the observation that humans perform creative tasks with the help of imagination, we propose a class of solutions for creative agents, where the controller is enhanced with an imaginator that generates detailed imaginations of task outcomes conditioned on language instructions. We introduce several approaches to implementing the components of creative agents. We implement the imaginator with either a large language model for textual imagination or a diffusion model for visual imagination. The controller can either be a behavior-cloning policy learned from data or a pre-trained foundation model generating executable codes in the environment. We benchmark creative tasks with the challenging open-world game Minecraft, where the agents are asked to create diverse buildings given free-form language instructions. In addition, we propose novel evaluation metrics for open-ended creative tasks utilizing GPT-4V, which holds many advantages over existing metrics. We perform a detailed experimental analysis of creative agents, showing that creative agents are the first AI agents accomplishing diverse building creation in the survival mode of Minecraft. Our benchmark and models are open-source for future research on creative agents (https://github.com/PKU-RL/Creative-Agents).

摘要
我们研究建立具有开放性的智能代理人，以完成创新性的任务。现有方法建立了遵循指令的智能代理人，但这些方法都没有表现出创新力——能够提供未知和多样化的任务解决方案。这一限制来自于它们无法将抽象的语言指令转换为环境中的具体任务目标，并进行长期规划。人类在创作任务时通常采用想象力，我们提出了一种解决方案，其中控制器被增强为具有想象力的 imagine 模型，可以根据语言指令生成详细的想象结果。我们介绍了几种实现Componentsof Creative Agents的方法。我们使用大语言模型或扩散模型来实现 imagine 模型，而控制器可以是基于数据学习的行为做clone策略或者是生成可执行代码的预训练基础模型。我们使用 Minecraft 游戏进行创建多样化建筑任务的 benchmarking，并提出了一些新的评价指标来评估开放性创新任务。我们的实验分析表明，创造代理人是在 Minecraft 游戏的存活模式中首次完成多样化建筑任务的 AI 代理人。我们的 benchmark 和模型是开源的，以便未来的研究创新代理人（https://github.com/PKU-RL/Creative-Agents）。

Simplifying Neural Network Training Under Class Imbalance

paper_url: http://arxiv.org/abs/2312.02517
repo_url: https://github.com/ravidziv/simplifyingimbalancedtraining
paper_authors: Ravid Shwartz-Ziv, Micah Goldblum, Yucen Lily Li, C. Bayan Bruss, Andrew Gordon Wilson
for: 本研究旨在探讨如何在实际数据集上使用标准深度学习管道中的组件来提高对类偏度问题的性能。
methods: 本研究使用了现有的深度学习管道中的各种组件，包括批处理大小、数据增强、优化器和标签平滑，并调整这些组件来适应类偏度问题。
results: 研究发现，通过调整标准深度学习管道中的各种组件，可以达到类偏度问题的州前性能，而无需使用特殊的类偏度方法。

Abstract
Real-world datasets are often highly class-imbalanced, which can adversely impact the performance of deep learning models. The majority of research on training neural networks under class imbalance has focused on specialized loss functions, sampling techniques, or two-stage training procedures. Notably, we demonstrate that simply tuning existing components of standard deep learning pipelines, such as the batch size, data augmentation, optimizer, and label smoothing, can achieve state-of-the-art performance without any such specialized class imbalance methods. We also provide key prescriptions and considerations for training under class imbalance, and an understanding of why imbalance methods succeed or fail.

摘要

ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU

paper_url: http://arxiv.org/abs/2312.02515
repo_url: https://github.com/TUDB-Labs/multi-lora-fine-tune
paper_authors: Zhengmao Ye, Dengchun Li, Jingqi Tian, Tingfeng Lan, Jie Zuo, Lei Duan, Hui Lu, Yexi Jiang, Jian Sha, Ke Zhang, Mingjie Tang
for: 本文旨在提高大型自然语言处理器（LLM）的 fine-tuning 效率，特别是在多任务 concurrent fine-tuning 中。
methods: 本文使用 Low-Rank Adaptation（LoRA）方法，通过共享预训练模型和 adaptive 调度，实现高效地在单个 GPU 上进行多任务 fine-tuning。
results: 实验表明，使用 ASPEN 框架可以节省 GPU 内存量为 53%，并提高训练 Throughput 约 17%，相比现有方法。另外，适应调度算法可以减少练习循环时间量为 24%，结束到结束训练延迟量为 12%。

Abstract
Transformer-based large language models (LLMs) have demonstrated outstanding performance across diverse domains, particularly when fine-turned for specific domains. Recent studies suggest that the resources required for fine-tuning LLMs can be economized through parameter-efficient methods such as Low-Rank Adaptation (LoRA). While LoRA effectively reduces computational burdens and resource demands, it currently supports only a single-job fine-tuning setup. In this paper, we present ASPEN, a high-throughput framework for fine-tuning LLMs. ASPEN efficiently trains multiple jobs on a single GPU using the LoRA method, leveraging shared pre-trained model and adaptive scheduling. ASPEN is compatible with transformer-based language models like LLaMA and ChatGLM, etc. Experiments show that ASPEN saves 53% of GPU memory when training multiple LLaMA-7B models on NVIDIA A100 80GB GPU and boosts training throughput by about 17% compared to existing methods when training with various pre-trained models on different GPUs. The adaptive scheduling algorithm reduces turnaround time by 24%, end-to-end training latency by 12%, prioritizing jobs and preventing out-of-memory issues.

摘要
transformer-based大型自然语言模型（LLM）在多个领域都展现出杰出的性能，特别是在特定领域 Fine-tuning 时。最近的研究表明，用于 Fine-tuning LLM 的资源可以通过 parameter-efficient 方法such as Low-Rank Adaptation (LoRA) 减少计算卷积和资源需求。而 LoRA 目前只支持单个任务 Fine-tuning setup。在这篇论文中，我们介绍 ASPEN，一个高通过put framework for Fine-tuning LLM。 ASPEN 使用 LoRA 方法在单个 GPU 上高效地训练多个任务，利用共享预训练模型和自适应调度。 ASPEN 与 transformer-based 语言模型如 LLaMA 和 ChatGLM 等兼容。实验表明，在 NVIDIA A100 80GB GPU 上训练多个 LLaMA-7B 模型时，ASPEN 可以 saving 53% GPU 内存和boost 训练速度约17% compared to existing methods。适应调度算法可以 reducing turnaround time by 24%, end-to-end training latency by 12%, prioritizing jobs and preventing out-of-memory issues。

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

paper_url: http://arxiv.org/abs/2312.02512
repo_url: None
paper_authors: Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro
for: 这 paper 提出了一种直接将 audio-visual 语音翻译成 audio-visual 语音的框架 (AV2AV)，以便实现真实的跨国虚拟会议，并且能够同时显示同步的嘴唇运动。
methods: 该框架使用了自然语言处理和计算机视觉技术，并通过自动学习来学习 audio-visual 语音的表示。
results: 实验表明，AV2AV 可以在多种语言翻译任务中提供高效的翻译结果，并且可以在不同的语言环境中保持 speaker 的声音特征。

Abstract
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. The demo page is available on https://choijeongsoo.github.io/av2av.

摘要
这个论文提出了一种新的直接音频视频演讲到音频视频演讲（AV2AV）框架，其输入和输出都是多Modal（即音频和视频演讲）。与已有的演讲到演讲（A2A）不同，AV2AV直接将音频视频演讲翻译成为另一种语言。这种能力提高对话体验，因为它可以同时显示同时发生的口语和翻译后的口语。此外，AV2AV还可以提高演讲语言翻译系统的稳定性。通过利用音频视频演讲的补充信息，系统可以更好地翻译演讲语言，即使在噪音的存在下。为了解决AV2AV翻译数据集缺失的问题，我们提议使用音频只的A2A数据集来训练演讲语言翻译系统。我们通过在先进自动学习中学习统一的音频视频演讲表示，然后用这些表示来训练翻译系统。此外，我们还提议一种AVRenderer，可以在同时生成原始音频和视频。它采用零个模型，因此源语音视频演讲中的 speaker 可以保留在目标翻译后的音频视频演讲中。AV2AV的效果得到了广泛的实验证明，并提供了一个多种语言翻译的多对多示例。详细信息可以在中找到。

paper_url: http://arxiv.org/abs/2312.03446
repo_url: None
paper_authors: Kibeom Kim, Kisung Shin, Min Whoo Lee, Moonhoen Lee, Minsu Lee, Byoung-Tak Zhang
for: 这个论文主要针对的是提高视觉导航任务的样本效率，即让Agent更快地学习和完成这些任务。
methods: 该论文提出了一种新的方法 called Visual Hindsight Self-Imitation Learning (VHS)，它利用视觉反思和自我模仿来提高样本效率。另外，论文还提出了一种 prosthetical goal embedding 方法，用于在视觉和部分可见的环境中更好地表示目标。
results: 实验结果表明，VHS 方法在视觉导航任务中表现出色，超过了现有的技术。这confirming its superior performance and sample efficiency。

Abstract
Interactive visual navigation tasks, which involve following instructions to reach and interact with specific targets, are challenging not only because successful experiences are very rare but also because the complex visual inputs require a substantial number of samples. Previous methods for these tasks often rely on intricately designed dense rewards or the use of expensive expert data for imitation learning. To tackle these challenges, we propose a novel approach, Visual Hindsight Self-Imitation Learning (VHS) for enhancing sample efficiency through hindsight goal re-labeling and self-imitation. We also introduce a prototypical goal embedding method derived from experienced goal observations, that is particularly effective in vision-based and partially observable environments. This embedding technique allows the agent to visually reinterpret its unsuccessful attempts, enabling vision-based goal re-labeling and self-imitation from enhanced successful experiences. Experimental results show that VHS outperforms existing techniques in interactive visual navigation tasks, confirming its superior performance and sample efficiency.

摘要
<> translate into Simplified Chinese抽象：视觉导航任务需要按照指令达到和交互特定目标，这些任务非常困难，不仅成功经验非常罕见，而且视觉输入非常复杂，需要大量样本来学习。现有的方法frequently rely on densely designed reward structures or the use of expensive expert data for imitation learning。提议：为了解决这些挑战，我们提出了一种新的方法，视觉历史自我模仿学习（VHS），可以提高样本效率通过后看目标重新标注和自我模仿。我们还提出了一种基于经验目标观察的目标嵌入方法，这种方法在视觉基于和部分可见环境中非常有效。这种嵌入技术使得机器人可以根据不成功的尝试重新 интерпретирова视觉，从而实现视觉基于的目标重新标注和自我模仿。实验结果表明，VHS在交互视觉导航任务中表现出色，超过现有技术，证明其高效性和样本效率。总结：通过提出VHS方法，我们可以解决交互视觉导航任务中的样本效率和成功经验罕见问题，并且实验结果证明VHS的高效性和样本效率。这种方法可以在视觉基于和部分可见环境中应用，有助于机器人更好地完成交互视觉导航任务。

Inspecting Model Fairness in Ultrasound Segmentation Tasks

paper_url: http://arxiv.org/abs/2312.02501
repo_url: None
paper_authors: Zikang Xu, Fenghe Tang, Quan Quan, Jianrui Ding, Chunping Ning, S. Kevin Zhou
for: 这 paper 的目的是评估深度学习（DL） segmentation 模型在不同敏感属性下的偏见情况。
methods: 该 paper 使用了两个ultrasound数据集，使用了state-of-the-art DL算法进行评估。
results: 研究发现，even state-of-the-art DL算法在ultrasound segmentation任务中存在偏见情况。这些结果作为一个警示，强调了在实际应用场景中进行模型评估，以确保伦理考虑和减少对患者结果的风险。

Abstract
With the rapid expansion of machine learning and deep learning (DL), researchers are increasingly employing learning-based algorithms to alleviate diagnostic challenges across diverse medical tasks and applications. While advancements in diagnostic precision are notable, some researchers have identified a concerning trend: their models exhibit biased performance across subgroups characterized by different sensitive attributes. This bias not only infringes upon the rights of patients but also has the potential to lead to life-altering consequences. In this paper, we inspect a series of DL segmentation models using two ultrasound datasets, aiming to assess the presence of model unfairness in these specific tasks. Our findings reveal that even state-of-the-art DL algorithms demonstrate unfair behavior in ultrasound segmentation tasks. These results serve as a crucial warning, underscoring the necessity for careful model evaluation before their deployment in real-world scenarios. Such assessments are imperative to ensure ethical considerations and mitigate the risk of adverse impacts on patient outcomes.

摘要

MKA: A Scalable Medical Knowledge Assisted Mechanism for Generative Models on Medical Conversation Tasks

paper_url: http://arxiv.org/abs/2312.02496
repo_url: https://github.com/liangke23/knowledge_assisted_medical_dialogue_generation_mechanism
paper_authors: Ke Liang, Sifan Wu, Jiayi Gu
for: 这个研究旨在提高医疗聊天机器人的诊断效率和便捷性，使医疗AI技术得到更多应用。
methods: 本研究使用了神经生成模型作为聊天机器人的核心，并提出了一个可扩展的医疗知识协助机制（MKA），以帮助神经生成模型在医疗聊天任务中表现更好。
results: 评估结果显示，将MKA机制应用于神经生成模型后，在多个自动评估指标中表现出色，并在MedDG和MedDialog-CN两个医疗数据集上取得了最佳性能。

Abstract
Using natural language processing (NLP) technologies to develop medical chatbots makes the diagnosis of the patient more convenient and efficient, which is a typical application in healthcare AI. Because of its importance, lots of research have been come out. Recently, the neural generative models have shown their impressive ability as the core of chatbot, while it cannot scale well when directly applied to medical conversation due to the lack of medical-specific knowledge. To address the limitation, a scalable Medical Knowledge Assisted mechanism, MKA, is proposed in this paper. The mechanism aims to assist general neural generative models to achieve better performance on the medical conversation task. The medical-specific knowledge graph is designed within the mechanism, which contains 6 types of medical-related information, including department, drug, check, symptom, disease, food. Besides, the specific token concatenation policy is defined to effectively inject medical information into the input data. Evaluation of our method is carried out on two typical medical datasets, MedDG and MedDialog-CN. The evaluation results demonstrate that models combined with our mechanism outperform original methods in multiple automatic evaluation metrics. Besides, MKA-Bert-GPT achieves state-of-the-art performance. The open-sourced codes are public: https://github.com/LIANGKE23/Knowledge_Assisted_Medical_Dialogue_Generation_Mechanism

摘要
使用自然语言处理（NLP）技术开发医疗聊天机器人，可以使患者诊断更加方便和高效，这是医疗AI的典型应用。由于其重要性，有很多研究发表。最近，神经生成模型在聊天机器人核心部分表现出了惊人的能力，但直接应用于医疗对话时因缺乏医疗专业知识而难以扩展。为解决这些限制，本文提出了可扩展的医学知识协助机制（MKA）。该机制的目的是帮助普通的神经生成模型在医疗对话任务中更好的表现。医学专业知识图在机制中设计，包括6种医疗相关信息，包括部门、药品、检查、症状、疾病和食物。此外，特定的Token拼接策略定义以有效地注入医学信息到输入数据中。对我们的方法进行评估，使用了两个典型的医疗数据集：MedDG和MedDialog-CN。评估结果表明，与我们的机制相结合的模型在多个自动评估指标中表现出色，而MKA-Bert-GPT还达到了当前最佳性能。我们的代码公开在 GitHub：https://github.com/LIANGKE23/Knowledge_Assisted_Medical_Dialogue_Generation_Mechanism。

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

paper_url: http://arxiv.org/abs/2312.02493
repo_url: None
paper_authors: Sahil Tyagi, Martin Swany
For: This paper aims to improve the efficiency of distributed deep learning training by reducing the communication overhead and accelerating the training process.* Methods: The paper proposes an Allreduce (AR)-compatible Topk compressor that is bandwidth-optimal and can switch between AG and AR based on the current network configuration. The authors also model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust the compression ratio and accelerate training.* Results: The proposed method achieves high accuracy like DenseSGD but with lower communication cost, and can dynamically adjust the compression ratio and collective operation to balance parallel and statistical efficiency. The authors also show that the proposed method outperforms AG in certain network configurations and can be applied to various deep learning models.

Abstract
Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.

摘要
分布式深度学习中的梯度压缩可以降低负担重大的通信成本，通常通过所谓的Allgather（AG）进行实现。在训练中使用高比率压缩比（CR）可以达到高精度，但是同时会降低并行扩展的并行级别（i.e., 并行效率）。使用较低的CR可以提高并行效率，但是会降低模型精度（统计效率）。此外，不同的模型和CR在不同的网络延迟、有效带宽和集合操作上的速度提升也会有差异。在这篇论文中，我们提出了一种AR兼容的Topk压缩器，它在某些网络配置下具有最佳带宽性，因此在AG之上表现更好。我们开发了一种灵活的通信策略，可以根据当前设置选择AG或AR进行协同，并模型了并行效率和统计效率之间的多目标优化（MOO）问题，以 dynamically 调整CR并加速训练，并且仍然可以 converge to high accuracy。

Learning to Holistically Detect Bridges from Large-Size VHR Remote Sensing Imagery

paper_url: http://arxiv.org/abs/2312.02481
repo_url: None
paper_authors: Yansheng Li, Junwei Luo, Yongjun Zhang, Yihua Tan, Jin-Gang Yu, Song Bai
For: The paper is written for detecting bridges in remote sensing images (RSIs) and addressing the challenges of bridge detection in large-size very-high-resolution (VHR) RSIs.* Methods: The paper proposes a large-scale dataset named GLH-Bridge, which comprises 6,000 VHR RSIs sampled from diverse geographic locations across the globe, and presents an efficient network for holistic bridge detection (HBD-Net) in large-size RSIs, which uses a separate detector-based feature fusion (SDFF) architecture and is optimized via a shape-sensitive sample re-weighting (SSRW) strategy.* Results: The paper establishes a bridge detection benchmark including the OBB and HBB tasks, and validates the effectiveness of the proposed HBD-Net on the GLH-Bridge dataset, with cross-dataset generalization experiments illustrating the strong generalization capability of the GLH-Bridge dataset.

Abstract
Bridge detection in remote sensing images (RSIs) plays a crucial role in various applications, but it poses unique challenges compared to the detection of other objects. In RSIs, bridges exhibit considerable variations in terms of their spatial scales and aspect ratios. Therefore, to ensure the visibility and integrity of bridges, it is essential to perform holistic bridge detection in large-size very-high-resolution (VHR) RSIs. However, the lack of datasets with large-size VHR RSIs limits the deep learning algorithms' performance on bridge detection. Due to the limitation of GPU memory in tackling large-size images, deep learning-based object detection methods commonly adopt the cropping strategy, which inevitably results in label fragmentation and discontinuous prediction. To ameliorate the scarcity of datasets, this paper proposes a large-scale dataset named GLH-Bridge comprising 6,000 VHR RSIs sampled from diverse geographic locations across the globe. These images encompass a wide range of sizes, varying from 2,048*2,048 to 16,38*16,384 pixels, and collectively feature 59,737 bridges. Furthermore, we present an efficient network for holistic bridge detection (HBD-Net) in large-size RSIs. The HBD-Net presents a separate detector-based feature fusion (SDFF) architecture and is optimized via a shape-sensitive sample re-weighting (SSRW) strategy. Based on the proposed GLH-Bridge dataset, we establish a bridge detection benchmark including the OBB and HBB tasks, and validate the effectiveness of the proposed HBD-Net. Additionally, cross-dataset generalization experiments on two publicly available datasets illustrate the strong generalization capability of the GLH-Bridge dataset.

摘要
remote sensing 图像中的桥梁检测（RSIs）在多种应用中发挥关键作用，但是它们具有特殊的挑战。在 RSIs 中，桥梁具有较大的空间尺度和方向比，因此，为保证桥梁的可见性和完整性，需要在大型高分辨率（VHR） RSIs 中进行整体的桥梁检测。然而，由于 GPU 内存不能处理大型图像的限制，深度学习基于对象检测方法通常采用裁剪策略，这会导致标签的分裂和不连续预测。为了解决数据缺乏问题，本文提出了一个大规模的数据集名为 GLH-Bridge，包含 6,000 个 VHR RSIs 从世界各地的多个地理位置采样，这些图像具有较大的尺度，从 2,048*2,048 到 16,384*16,384 像素，总共包含 59,737 座桥梁。此外，我们还提出了一种高效的桥梁检测网络（HBD-Net），该网络采用分立检测器基于特征融合（SDFF）架构，并通过形状敏感样本重新权重策略（SSRW）优化。基于我们提出的 GLH-Bridge 数据集，我们建立了一个桥梁检测标准准则，包括 OBB 和 HBB 任务，并证明了我们提出的 HBD-Net 的有效性。此外，我们还进行了跨数据集普适性实验，证明 GLH-Bridge 数据集具有强大的普适性。

E4SRec: An Elegant Effective Efficient Extensible Solution of Large Language Models for Sequential Recommendation

paper_url: http://arxiv.org/abs/2312.02443
repo_url: https://github.com/hestiasky/e4srec
paper_authors: Xinhang Li, Chong Chen, Xiangyu Zhao, Yong Zhang, Chunxiao Xing
for: 这篇论文旨在将大语言模型（LLMs）应用于推荐系统中，以提高推荐的个性化性和效率。
methods: 本文提出了一种叫做Elegant Effective Efficient Extensible solution for large language models for Sequential Recommendation（E4SRec），它可以将LLMs与传统的推荐系统结合起来，并且可以使用ID序列作为输入，确保生成的输出在候选列表中。
results: 作者通过对四种广泛使用的实际数据集进行了广泛的实验，证明了E4SRec的有效性、效率和可扩展性。

Abstract
The recent advancements in Large Language Models (LLMs) have sparked interest in harnessing their potential within recommender systems. Since LLMs are designed for natural language tasks, existing recommendation approaches have predominantly transformed recommendation tasks into open-domain natural language generation tasks. However, this approach necessitates items to possess rich semantic information, often generates out-of-range results, and suffers from notably low efficiency and limited extensibility. Furthermore, practical ID-based recommendation strategies, reliant on a huge number of unique identities (IDs) to represent users and items, have gained prominence in real-world recommender systems due to their effectiveness and efficiency. Nevertheless, the incapacity of LLMs to model IDs presents a formidable challenge when seeking to leverage LLMs for personalized recommendations. In this paper, we introduce an Elegant Effective Efficient Extensible solution for large language models for Sequential Recommendation (E4SRec), which seamlessly integrates LLMs with traditional recommender systems that exclusively utilize IDs to represent items. Specifically, E4SRec takes ID sequences as inputs, ensuring that the generated outputs fall within the candidate lists. Furthermore, E4SRec possesses the capability to generate the entire ranking list in a single forward process, and demands only a minimal set of pluggable parameters, which are trained for each dataset while keeping the entire LLM frozen. We substantiate the effectiveness, efficiency, and extensibility of our proposed E4SRec through comprehensive experiments conducted on four widely-used real-world datasets. The implementation code is accessible at https://github.com/HestiaSky/E4SRec/.

摘要
近年来，大语言模型（LLMs）的进步引起了推荐系统中使用其潜力的兴趣。由于 LLMs 是针对自然语言任务设计的，现有的推荐方法主要将推荐任务转化为开放领域自然语言生成任务。然而，这种方法需要ITEMS具有丰富的含义信息，经常产生范围外的结果，并且受到较低的效率和有限的扩展性的限制。此外，在实际应用中，基于唯一标识符（ID）的实用推荐策略已经得到了广泛的应用，因为它们的效果和效率。然而， LLMs 无法模型 ID 是一大问题，当希望通过 LLMs 提供个性化推荐时。在这篇论文中，我们介绍了一种简洁有效高效可扩展的解决方案，即 E4SRec，该方案可以快速地将 LLMs 与传统的推荐系统集成，并且可以使用 ID 序列作为输入，确保生成的输出在候选列表中。此外，E4SRec 具有生成整个排名列表的能力，仅需要一个 minimal 的可插入参数，并且在每个数据集上训练这些参数，保持整个 LLM 冻结。我们通过对四种广泛使用的实际数据集进行了广泛的实验，证明了 E4SRec 的有效性、效率和可扩展性。代码可以在 GitHub 上获取：https://github.com/HestiaSky/E4SRec/.

Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation

paper_url: http://arxiv.org/abs/2312.02439
repo_url: https://github.com/sail-sg/clot
paper_authors: Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, Pan Zhou
for: 这paper是为了探讨Chain-of-Thought (CoT)和Leap-of-Thought (LoT)在大语言模型 (LLM) 中的应用，以及如何提高LLM的创造力。
methods: 这paper使用了Oogiri游戏作为研究对象，建立了一个多modal和多语言的Oogiri-GO数据集，并对现有LLM的LoT能力进行了研究。为了提高LLM的创造力，这paper引入了一种创新的Leap-of-Thought (CLoT) paradigm，包括对预训练LLM的 instrucion tuning 和自我修复两部分。
results: 这paper在Oogiri游戏中表现出了出色的创作能力，并在多个任务中表现出了提高的创造力，如云猜测游戏和多元关联任务。这些发现可能为大语言模型的创新应用带来启示，并为未来的研究提供了一条可行的道路。

Abstract
Chain-of-Thought (CoT) guides large language models (LLMs) to reason step-by-step, and can motivate their logical reasoning ability. While effective for logical tasks, CoT is not conducive to creative problem-solving which often requires out-of-box thoughts and is crucial for innovation advancements. In this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs -- a non-sequential, creative paradigm involving strong associations and knowledge leaps. To this end, we study LLMs on the popular Oogiri game which needs participants to have good creativity and strong associative thinking for responding unexpectedly and humorously to the given image, text, or both, and thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the Oogiri game, we first build a multimodal and multilingual Oogiri-GO dataset which contains over 130,000 samples from the Oogiri game, and observe the insufficient LoT ability or failures of most existing LLMs on the Oogiri game. Accordingly, we introduce a creative Leap-of-Thought (CLoT) paradigm to improve LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into LoT-oriented instruction tuning data to train pretrained LLM for achieving certain LoT humor generation and discrimination abilities. Then CLoT designs an explorative self-refinement that encourages the LLM to generate more creative LoT data via exploring parallels between seemingly unrelated concepts and selects high-quality data to train itself for self-refinement. CLoT not only excels in humor generation in the Oogiri game but also boosts creative abilities in various tasks like cloud guessing game and divergent association task. These findings advance our understanding and offer a pathway to improve LLMs' creative capacities for innovative applications across domains. The dataset, code, and models will be released online. https://zhongshsh.github.io/CLoT/.

摘要
Chain-of-Thought (CoT) 导引大型语言模型（LLM）进行逻辑推理，并可以激发其逻辑思维能力。虽然效果良好于逻辑任务，但CoT不适用于创造性问题解决，这种问题常需要非典型思维和创新，是创新进程中不可或缺的一部分。在这篇论文中，我们研究了大型语言模型（LLM）在Oogiri游戏中的强相关思维能力（LoT）——一种非线性、创造性的思维方式。为了研究LLM在Oogiri游戏中的LoT能力，我们首先建立了一个多Modal和多语言的Oogiri-GO数据集，包含了Oogiri游戏中的130,000多个样本，并观察了大多数现有LLM在Oogiri游戏中的不足或失败。 accordingly，我们提出了一种创新的Leap-of-Thought（CLoT）方法，以提高LLM的LoT能力。CLoT首先将Oogiri-GO数据集转化为LoT-oriented instruction tuning数据，以训练预训练的LLM达到certain LoT幽默生成和分类能力。然后，CLoT设计了一种探索性自我优化，以便LLM通过探索与之相关的不同概念之间的相似性，生成更创新的LoT数据，并选择高质量的数据进行自我优化。CLoT不仅在Oogiri游戏中展示出了幽默生成能力，还在多个任务中提高了创造能力，如云猫猜测游戏和多元协同关系任务。这些发现对LLM的创造能力的理解提供了一条进路，并可以用于在不同领域中应用创新。数据、代码和模型将在线发布。Please note that the translation is done using a machine translation tool, and may not be perfect.

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

paper_url: http://arxiv.org/abs/2312.02436
repo_url: https://github.com/RenzeLou/Muffin
paper_authors: Renze Lou, Kai Zhang, Jian Xie, Yuxuan Sun, Janice Ahn, Hanzi Xu, Yu Su, Wenpeng Yin
for: 提高大语言模型（LLMs）的 instruction-following 能力
methods: 通过自动扩大每个任务的输入方面来增强数据集的 curación
results: LLMs 在不同的缩放任务和无输入任务之间具有更高的 instruction-following 能力，并且可以在不同的输入方面中表现出色。

Abstract
In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.

摘要
在大语言模型（LLM）的领域中，提高指令遵从能力通常通过两种主要方案实现：一是扩大输入数据，即增大每个任务的输入对应的对话对，以提高指令遵从性。二是扩大任务数量，每个任务都包含一对（指令、输出），而不需要分配单独的输入。然而，在扩大输入方面，LLMs 有很强的敏感性，可能导致指令的误解或不遵从。相反，扩大任务数量需要很多任务，但是在扩大输入方面效果较差。这篇文章介绍了一种新的指令遵从数据集编制方法，称为MUFFIN。具体来说，我们通过自动扩大每个任务的输入方面来多元化这些任务。实验结果验证了我们的方法，在四个零基eline标准 benchmark 中，覆盖了两种扩大输入和扩大任务数量的方案。结果表明， LLMS 在不同的缩放级别 trained on MUFFIN 通常比 trained on 这两种方案 demonstrate 更高的指令遵从能力。

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

paper_url: http://arxiv.org/abs/2312.02431
repo_url: None
paper_authors: Alessandro Suglia, Ioannis Konstas, Oliver Lemon
for: 本文是一篇系统性的文献评论，探讨了视觉语言（V+L）领域中许多任务和模型的发展。
methods: 本文使用维特根hein的“语言游戏”思想分类了V+L任务为三类：推论性游戏、生成性游戏和互动性游戏。
results: 文章分析了现有Literature，提出了未来研究应该关注互动游戏，因为自然语言交流是解决对象引用和行动计划的ambiguity的关键，而物理实现也是理解情境和事件 semantics的重要因素。

Abstract
In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.

摘要
Recently, several machine learning models have been proposed, which are trained with a language modeling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many natural language understanding and generation tasks. However, many aspects of meaning cannot be learned by simply "listening to the radio." In the literature, many vision + language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of "language games" to categorize such tasks into three different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should focus on interactive games, where communication in natural language is important to resolve ambiguities about object referents and action plans, and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.

PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation

paper_url: http://arxiv.org/abs/2312.03015
repo_url: https://github.com/zyc00/partslip2
paper_authors: Yuchen Zhou, Jiayuan Gu, Xuanlin Li, Minghua Liu, Yunhao Fang, Hao Su
for: 本研究旨在提高零或几个shot 3D部分 segmentation的精度和可扩展性。
methods: 本研究使用的方法包括GLIP和SAM两种预训练模型，以及一种改进的Expectation-Maximization算法。
results: 对比PartSLIP，PartSLIP++在零或几个shot 3D semantic和实例基本对象部分分割任务中表现更好。

Abstract
Open-world 3D part segmentation is pivotal in diverse applications such as robotics and AR/VR. Traditional supervised methods often grapple with limited 3D data availability and struggle to generalize to unseen object categories. PartSLIP, a recent advancement, has made significant strides in zero- and few-shot 3D part segmentation. This is achieved by harnessing the capabilities of the 2D open-vocabulary detection module, GLIP, and introducing a heuristic method for converting and lifting multi-view 2D bounding box predictions into 3D segmentation masks. In this paper, we introduce PartSLIP++, an enhanced version designed to overcome the limitations of its predecessor. Our approach incorporates two major improvements. First, we utilize a pre-trained 2D segmentation model, SAM, to produce pixel-wise 2D segmentations, yielding more precise and accurate annotations than the 2D bounding boxes used in PartSLIP. Second, PartSLIP++ replaces the heuristic 3D conversion process with an innovative modified Expectation-Maximization algorithm. This algorithm conceptualizes 3D instance segmentation as unobserved latent variables, and then iteratively refines them through an alternating process of 2D-3D matching and optimization with gradient descent. Through extensive evaluations, we show that PartSLIP++ demonstrates better performance over PartSLIP in both low-shot 3D semantic and instance-based object part segmentation tasks. Code released at https://github.com/zyc00/PartSLIP2.

摘要
开放世界3D部分分割是多种应用程序中的关键，如Robotics和AR/VR。传统的指导方法经常受到有限的3D数据可用性的限制，并且很难在未见到的对象类别上泛化。PartSLIP，一种最近的进步，在零和几个shot 3D部分分割方面做出了重要的突破。这是通过利用2D开放词汇检测模块GLIP的能力，并对多视图2D bounding box预测进行转换和提升为3D分割Masks来实现的。在这篇论文中，我们介绍了PartSLIP++，一种改进版本，旨在超越其前一代的限制。我们的方法包括两个主要改进。一是使用预训练的2D分割模型SAM，以生成像素精度的2D分割，从而生成更加准确和精确的注释。二是PartSLIP++将2D bounding box预测转换为3D分割Masks，而不是使用heuristic的3D转换过程。通过广泛的评估，我们显示了PartSLIP++在零和几个shot 3Dsemantic和实例基于对象部分分割任务中表现更好于PartSLIP。代码可以在https://github.com/zyc00/PartSLIP2中下载。

Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

paper_url: http://arxiv.org/abs/2312.02418
repo_url: None
paper_authors: Yu Yang, Aaditya K. Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal Tirumala, Fabian Gloeckle, Baptiste Rozière, Carole-Jean Wu, Ari S. Morcos, Newsha Ardalani
for: 提高 Large Language Models（LLMs）的代码生成性能和训练效率，通过 removing “low-quality” code data。
methods: 使用 embedding 空间来识别和移除 “low-quality” code data，通过 synthetic corruptions 来探索 “low-quality” code 的特征，并开发了一些基于 embedding 空间的新的采样策略。
results: 在 HumanEval 和 MBPP benchmark 上表现出优于现有的 embedding-based 方法，并且可以达到不进行采样的情况下的性能提升为高达 3%， demonstrating the promise of insights from synthetic corruptions for data pruning。

Abstract
Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.

摘要
<>将文本翻译为简化中文。<>大型语言模型（LLM）的训练效率和性能可能受到来自 GitHub 等多种多样的源头的代码数据集的质量问题的影响。先前的研究已经证明使用嵌入空间进行数据减少具有利处，但主要关注于去除重复或增加多样性，而在其他Modalities中，如图像。我们的工作将关注使用嵌入空间来识别和移除“低质量”的代码数据。我们首先通过使用生成的损害来探索低质量代码的特征在嵌入空间中，然后提出了一种基于嵌入空间的新的减少指标，用于在栈 dataset 中标识和移除低质量项目。我们在人类评估和 MBPP 标准准中示出了这种基于生成损害的减少方法（SCIP）的优势，并且与现有的嵌入空间基本方法相比，达到了3%的性能提升。这显示了对于数据减少，来自生成损害的启示具有承诺。

Towards Fast and Stable Federated Learning: Confronting Heterogeneity via Knowledge Anchor

paper_url: http://arxiv.org/abs/2312.02416
repo_url: https://github.com/J1nqianChen/FedKA
paper_authors: Jinqian Chen, Jihua Zhu, Qinghai Zheng
for: 这篇论文目的是解决联合学习中的数据不一致问题，对于联合模型的性能和稳定性有着重要的影响。
methods: 这篇论文使用了本地训练和共享知识链来解决数据不一致问题，并提出了一个名为“ Federated Knowledge Anchor”的新算法。
results: 实验结果显示，这个新算法可以实现快速和稳定的联合模型训练，并有着优化准确性的效果。

Abstract
Federated learning encounters a critical challenge of data heterogeneity, adversely affecting the performance and convergence of the federated model. Various approaches have been proposed to address this issue, yet their effectiveness is still limited. Recent studies have revealed that the federated model suffers severe forgetting in local training, leading to global forgetting and performance degradation. Although the analysis provides valuable insights, a comprehensive understanding of the vulnerable classes and their impact factors is yet to be established. In this paper, we aim to bridge this gap by systematically analyzing the forgetting degree of each class during local training across different communication rounds. Our observations are: (1) Both missing and non-dominant classes suffer similar severe forgetting during local training, while dominant classes show improvement in performance. (2) When dynamically reducing the sample size of a dominant class, catastrophic forgetting occurs abruptly when the proportion of its samples is below a certain threshold, indicating that the local model struggles to leverage a few samples of a specific class effectively to prevent forgetting. Motivated by these findings, we propose a novel and straightforward algorithm called Federated Knowledge Anchor (FedKA). Assuming that all clients have a single shared sample for each class, the knowledge anchor is constructed before each local training stage by extracting shared samples for missing classes and randomly selecting one sample per class for non-dominant classes. The knowledge anchor is then utilized to correct the gradient of each mini-batch towards the direction of preserving the knowledge of the missing and non-dominant classes. Extensive experimental results demonstrate that our proposed FedKA achieves fast and stable convergence, significantly improving accuracy on popular benchmarks.

摘要
Federated learning 面临着数据多样性的挑战，导致模型性能和融合受到影响。许多方法已经被提出来解决这个问题，但其效果仍然有限。最近的研究发现，在本地训练中，联邦模型会出现严重的忘记现象，导致全局忘记和性能下降。虽然分析提供了有价值的意见，但complete的感受到易受损类和其影响因素仍未得到彻底的理解。在这篇论文中，我们想要填补这个差距，通过评估每个类忘记程度的变化来系统地分析忘记现象。我们的观察结果是：（1）缺失和非主流类在本地训练中都会同样严重忘记，而主流类则会在性能方面表现出改善。（2）当动态减少主流类的样本数时，当其样本占比下降到某个阈值时，训练过程中忘记现象会突然发生，表明本地模型很难以通过几个特定类的样本来防止忘记。这些发现使我们提出了一种新的和简单的算法——联邦知识锚（FedKA）。假设所有客户端均拥有每个类型的唯一的示例，我们可以在每个本地训练阶段前 construct 知识锚，通过提取缺失类型的示例和随机选择每个类型的一个示例来建立知识锚。然后，我们可以在每个mini-batch中使用知识锚来更正梯度的方向，以保持缺失和非主流类型的知识。我们的实验结果表明，我们的提议的FedKA可以快速和稳定地融合，在流行的benchmark上显著提高准确率。

Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey

paper_url: http://arxiv.org/abs/2312.03014
repo_url: https://github.com/shengchaochen82/awesome-large-models-for-weather-and-climate
paper_authors: Shengchao Chen, Guodong Long, Jing Jiang, Dikai Liu, Chengqi Zhang
for: This paper is written to provide an overview of state-of-the-art AI methodologies for weather and climate data, with a focus on time series and text data.
methods: The paper discusses various model architectures, including large language models (LLMs), and their applications in weather and climate data understanding.
results: The paper provides an exhaustive review of current breakthroughs in research on large, data-driven models for weather and climate data understanding, including practical applications, crucial resources, and prospective research opportunities.

Abstract
As artificial intelligence (AI) continues to rapidly evolve, the realm of Earth and atmospheric sciences is increasingly adopting data-driven models, powered by progressive developments in deep learning (DL). Specifically, DL techniques are extensively utilized to decode the chaotic and nonlinear aspects of Earth systems, and to address climate challenges via understanding weather and climate data. Cutting-edge performance on specific tasks within narrower spatio-temporal scales has been achieved recently through DL. The rise of large models, specifically large language models (LLMs), has enabled fine-tuning processes that yield remarkable outcomes across various downstream tasks, thereby propelling the advancement of general AI. However, we are still navigating the initial stages of crafting general AI for weather and climate. In this survey, we offer an exhaustive, timely overview of state-of-the-art AI methodologies specifically engineered for weather and climate data, with a special focus on time series and text data. Our primary coverage encompasses four critical aspects: types of weather and climate data, principal model architectures, model scopes and applications, and datasets for weather and climate. Furthermore, in relation to the creation and application of foundation models for weather and climate data understanding, we delve into the field's prevailing challenges, offer crucial insights, and propose detailed avenues for future research. This comprehensive approach equips practitioners with the requisite knowledge to make substantial progress in this domain. Our survey encapsulates the most recent breakthroughs in research on large, data-driven models for weather and climate data understanding, emphasizing robust foundations, current advancements, practical applications, crucial resources, and prospective research opportunities.

摘要
随着人工智能（AI）的快速发展，地球和大气科学领域正在越来越多地采用数据驱动模型，受到进步的深度学习（DL）技术的推动。特别是通过解码地球系统中的复杂和非线性方面，以及理解天气和气候数据，DL技术在解决气候挑战方面发挥了关键作用。最近，通过大型模型的进步，特别是大语言模型（LLMs），可以进行细化过程，从而在各种下游任务中实现出色的表现，从而推动总AI的发展。然而，我们还处于气候和天气领域的普通AI创造的初 stages。在这篇评论中，我们提供了一份全面、时宜的评论，涵盖了天气和气候数据的类型、主要模型架构、模型范围和应用、以及天气和气候数据的数据集。此外，我们还探讨了创建和应用气候和天气数据理解基础模型的挑战，并提供了关键的洞察和详细的未来研究方向。这种全面的方法使得实践者可以快速掌握这个领域的必要知识，从而取得重要进步。我们的评论汇集了最近在大、数据驱动模型方面的研究进展，强调坚实的基础、当前进步、实用应用、关键资源和未来研究机遇。

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

paper_url: http://arxiv.org/abs/2312.02405
repo_url: https://github.com/minerllabs/basalt-benchmark
paper_authors: Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah
for: 本文提供了一个形式化的测试基准 для学习人类反馈，以便评估新开发的算法性能。
methods: 本文使用了 Minecraft 游戏中的四个困难任务，例如创建和拍摄瀑布，来测试人类反馈学习算法。
results: 本文提供了2600万个图像动作对的集合，以及超过3000个精密对比人类和算法代理的评估结果，以便评估新开发的算法性能。

Abstract
The MineRL BASALT competition has served to catalyze advances in learning from human feedback through four hard-to-specify tasks in Minecraft, such as create and photograph a waterfall. Given the completion of two years of BASALT competitions, we offer to the community a formalized benchmark through the BASALT Evaluation and Demonstrations Dataset (BEDD), which serves as a resource for algorithm development and performance assessment. BEDD consists of a collection of 26 million image-action pairs from nearly 14,000 videos of human players completing the BASALT tasks in Minecraft. It also includes over 3,000 dense pairwise human evaluations of human and algorithmic agents. These comparisons serve as a fixed, preliminary leaderboard for evaluating newly-developed algorithms. To enable this comparison, we present a streamlined codebase for benchmarking new algorithms against the leaderboard. In addition to presenting these datasets, we conduct a detailed analysis of the data from both datasets to guide algorithm development and evaluation. The released code and data are available at https://github.com/minerllabs/basalt-benchmark .

摘要
minesrl 的 BASALT 比赛已经catalyzed 了人类反馈学习的进步，通过四个困难定义的任务（如创建和拍摄瀑布）在 Minecraft 中进行。已经过去两年的 BASALT 比赛，我们现在向社区提供一个正式的标准 bencmark，通过 BASALT 评估和示例 Dataset (BEDD)。BEDD 包括2600万个图像动作对的收集和nearly 14,000个 Minecraft 游戏视频中的人类玩家完成 BASALT 任务的14,000个视频。它还包括了3,000个紧密的人类评估对人类和算法代理的对比。这些对比用作一个固定的、初步的eaderboard 来评估新发展的算法。为了实现这一比较，我们提供了一个 Streamlined 的代码库来测试新的算法。此外，我们还进行了 BASALT 数据集和代码的详细分析，以帮助算法的开发和评估。已经发布的代码和数据可以在上找到。

Breast Ultrasound Report Generation using LangChain

paper_url: http://arxiv.org/abs/2312.03013
repo_url: None
paper_authors: Jaeyoung Huh, Hyun Jeong Park, Jong Chul Ye
for: 这篇研究旨在提高乳腺超音波成像的诊断效率和报告质量，减轻医生和医疗专业人员的负担。methods: 本研究提出了一种基于LangChain的多图分析工具集成方法，通过融合专门的工具和自然语言生成技术，从超音波图像中提取有用的特征，在医疗上进行解释和生成标准化的报告。results: 实验结果显示，每个参考工具均可以提供有质量和量上的重要结果，而且并不需要专业人员的干预。并且，在临床评估中，生成的报告被评估为具有临床意义。

Abstract
Breast ultrasound (BUS) is a critical diagnostic tool in the field of breast imaging, aiding in the early detection and characterization of breast abnormalities. Interpreting breast ultrasound images commonly involves creating comprehensive medical reports, containing vital information to promptly assess the patient's condition. However, the ultrasound imaging system necessitates capturing multiple images of various parts to compile a single report, presenting a time-consuming challenge. To address this problem, we propose the integration of multiple image analysis tools through a LangChain using Large Language Models (LLM), into the breast reporting process. Through a combination of designated tools and text generation through LangChain, our method can accurately extract relevant features from ultrasound images, interpret them in a clinical context, and produce comprehensive and standardized reports. This approach not only reduces the burden on radiologists and healthcare professionals but also enhances the consistency and quality of reports. The extensive experiments shows that each tools involved in the proposed method can offer qualitatively and quantitatively significant results. Furthermore, clinical evaluation on the generated reports demonstrates that the proposed method can make report in clinically meaningful way.

摘要
breast ultrasound (BUS) 是一种重要的诊断工具在乳腺影像领域，帮助早期发现和特征化乳腺畸形。解读乳腺超声图像通常需要创建全面的医疗报告，包含重要信息以评估病人的情况。然而，超声影像系统需要捕捉多个图像来编辑报告，这是一项时间consuming的挑战。为解决这个问题，我们提议通过LangChain使用大语言模型（LLM）integrate多个图像分析工具到乳腺报告过程中。通过综合使用指定工具和文本生成through LangChain，我们的方法可以准确提取超声图像中重要特征，在临床上下文中解释它们，并生成完整、标准化的报告。这种方法不仅减轻了医生和医疗专业人员的负担，还提高了报告的一致性和质量。经验表明，每种工具参与的方法可以提供质量和量上的显著结果。此外，临床评估生成的报告也表明该方法可以在临床意义上生成报告。

2023-12-05

cs.CL

cs.CL - 2023-12-05

Combining Counting Processes and Classification Improves a Stopping Rule for Technology Assisted Review

paper_url: http://arxiv.org/abs/2312.03171
repo_url: https://github.com/reembinhezam/tar_stopping_cp_clf
paper_authors: Reem Bin-Hezam, Mark Stevenson
for: 降低手动审查文档相关性的成本，以确保所需的精确率水准。
methods: 使用文本分类器 derivable 无需任何额外标注而训练。
results: 在多个数据集（CLEF e-Health、TREC Total Recall、TREC Legal和RCV1）的实验中，提出的方法在性能上显著提高，并比多个替代方法表现更好。

Abstract
Technology Assisted Review (TAR) stopping rules aim to reduce the cost of manually assessing documents for relevance by minimising the number of documents that need to be examined to ensure a desired level of recall. This paper extends an effective stopping rule using information derived from a text classifier that can be trained without the need for any additional annotation. Experiments on multiple data sets (CLEF e-Health, TREC Total Recall, TREC Legal and RCV1) showed that the proposed approach consistently improves performance and outperforms several alternative methods.

摘要
技术协助评审（TAR）停止规则目的是降低手动评估文档相关性的成本，最小化需要评审的文档数量，以确保所需的回归率。本文提出了一种改进的停止规则，使用基于文本分类器的信息，不需要额外的标注。在多个数据集（CLEF e-Health、TREC Total Recall、TREC Legal和RCV1）的实验中，提出的方法在性能上有显著改善，并超过了一些替代方法。

Assertion Enhanced Few-Shot Learning: Instructive Technique for Large Language Models to Generate Educational Explanations

paper_url: http://arxiv.org/abs/2312.03122
repo_url: None
paper_authors: Tasmia Shahriar, Noboru Matsuda, Kelly Ramos
for: 提高 Intelligent Tutoring Systems 的解释质量，使其能够如人教师一样，通过几个示例来提供细节 oriented 的教育解释。
methods: 提出 Assertion Enhanced Few-Shot Learning 技术，通过对几个示例进行批处理，提高解释的准确率和质量。
results: 对 12 名实际教师进行比较研究，显示 Assertion Enhanced Few-Shot Learning 提高解释准确率 15%，并且生成的解释质量更高，被教师评价为更加 educator-friendly。

Abstract
Human educators possess an intrinsic ability to anticipate and seek educational explanations from students, which drives them to pose thought-provoking questions when students cannot articulate these explanations independently. We aim to imbue Intelligent Tutoring Systems with this ability using few-shot learning capability of Large Language Models. Our work proposes a novel prompting technique, Assertion Enhanced Few-Shot Learning, to facilitate the generation of accurate, detailed oriented educational explanations. Our central hypothesis is that, in educational domain, few-shot demonstrations are necessary but not a sufficient condition for quality explanation generation. We conducted a study involving 12 in-service teachers, comparing our approach to Traditional Few-Shot Learning. The results show that Assertion Enhanced Few-Shot Learning improves explanation accuracy by 15% and yields higher-quality explanations, as evaluated by teachers. We also conduct a qualitative ablation study to factor the impact of assertions to provide educator-friendly prompting guidelines for generating explanations in their domain of interest.

摘要
人类教育者具有内在的能力，可以预测和寻找学生不能独立表达的教育解释，这会让教育者提问学生无法答复的问题。我们想使用大语言模型的几招学习能力，让智能教育系统拥有这种能力。我们的工作提出了一种新的提问技巧，即断言增强几招学习，以便生成高质量、详细的教育解释。我们的中心假设是，在教育领域，几招示范是必要的，但不是唯一的条件，以获得高质量的解释。我们进行了一项研究，与12名现役教师进行比较，与传统几招学习相比，我们的方法可以提高解释准确率15%，并生成更高质量的解释，如教师所评价。我们还进行了一项解释因素分析研究，以了解断言对生成解释的影响，以提供教师在他们兴趣领域中生成解释的教程。

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

paper_url: http://arxiv.org/abs/2312.03766
repo_url: None
paper_authors: Brian Gordon, Yonatan Bitton, Yonatan Shafir, Roopal Garg, Xi Chen, Dani Lischinski, Daniel Cohen-Or, Idan Szpektor
for: 这个论文的目的是提供文本和图像对齐检测中的细节解释，以帮助检测到的偏移精确地定位。
methods: 该论文使用大语言模型和视觉固定模型自动构建一个包含可能的偏移描述和相关图像指示的训练集，并发布一个新的人工精心标注的测试集。
results: 对于文本和图像对齐检测任务， fine-tuning视觉语言模型使其能够详细描述偏移和在图像中指示它们，超越了强基eline。

Abstract
While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our method code and human curated test set are available at: https://mismatch-quest.github.io/

摘要
现有的图文对alignment模型可以 достичь高质量的二进制评估，但它们无法准确地找到异常的源头。在这篇论文中，我们提出了一种方法，可以为图文对提供细致的文本和视觉解释。我们利用大型语言模型和视觉固定模型来自动构建一个培训集，其中包含可信的异常标注和对应的文本解释和视觉指标。我们还发布了一个新的人工精心标注测试集，其中包含了准确的文本和视觉异常标注。实验结果表明，在我们的培训集上练化视语言模型可以详细描述异常和在图像中视觉指明它们，超过了强大的基eline。我们的方法代码和人工精心标注测试集可以在：https://mismatch-quest.github.io/ accessed.

paper_url: http://arxiv.org/abs/2312.03095
repo_url: None
paper_authors: Daniyar Amangeldi, Aida Usmanova, Pakizar Shamoi
for: This study aims to analyze the public perception of climate change and the environment through social media data from 2014 to 2023, in order to provide insights that can help raise awareness and inform environmental interventions.
methods: The study uses the Pointwise Mutual Information (PMI) algorithm to identify sentiment and explore prevailing emotions expressed within environmental tweets on Twitter, Reddit, and YouTube. The accuracy of the algorithm was compared to human annotation and expert rating.
results: The study finds that negative environmental tweets are more common than positive or neutral ones, with climate change, air quality, emissions, plastic, and recycling being the most discussed topics. The most common emotions in environmental tweets are fear, trust, and anticipation, demonstrating the complex and wide-ranging nature of public reactions to environmental issues.

Abstract
Social media is now the predominant source of information due to the availability of immediate public response. As a result, social media data has become a valuable resource for comprehending public sentiments. Studies have shown that it can amplify ideas and influence public sentiments. This study analyzes the public perception of climate change and the environment over a decade from 2014 to 2023. Using the Pointwise Mutual Information (PMI) algorithm, we identify sentiment and explore prevailing emotions expressed within environmental tweets across various social media platforms, namely Twitter, Reddit, and YouTube. Accuracy on a human-annotated dataset was 0.65, higher than Vader score but lower than that of an expert rater (0.90). Our findings suggest that negative environmental tweets are far more common than positive or neutral ones. Climate change, air quality, emissions, plastic, and recycling are the most discussed topics on all social media platforms, highlighting its huge global concern. The most common emotions in environmental tweets are fear, trust, and anticipation, demonstrating public reactions wide and complex nature. By identifying patterns and trends in opinions related to the environment, we hope to provide insights that can help raise awareness regarding environmental issues, inform the development of interventions, and adapt further actions to meet environmental challenges.

摘要
社交媒体现在是信息的主要来源，因为它提供了即时的公众反应。因此，社交媒体数据已成为了理解公众情绪的重要资源。研究表明，它可以增强想法并影响公众情绪。这个研究分析了2014年至2023年间公众对气候变化和环境的观感。我们使用点对点积分信息（PMI）算法，确定情绪和探索不同社交媒体平台上环境推文中表达的主要情感。我们的结果表明，环境推文中的负面情绪比正面或中性情绪更为常见。气候变化、空气质量、排放、塑料和回收是所有社交媒体平台上最受关注的话题，这反映了人们对环境问题的极大关注。环境推文中最常见的情感是恐慌、信任和期待，这表明公众对环境问题的反应是多样化和复杂的。我们希望通过分析环境话题中的意见和趋势，为环境问题提供意识，制定 intervención和适应环境挑战。

paper_url: http://arxiv.org/abs/2312.03088
repo_url: None
paper_authors: Brett Israelsen, Soumalya Sarkar
for: 这篇论文旨在把最新的大语言模型评估和漏洞研究总结起来，以帮助理解这些技术在知识和安全应用中的应用前需要进行什么样的谨慎。
methods: 本论文通过对最新的大语言模型评估和漏洞研究进行总结，把漏洞分为十个高级类别，并将它们与大语言模型的生命周期进行对比。
results: 本论文结果表明，大语言模型在知识和安全应用中存在许多漏洞和限制，需要进行谨慎的评估和mitigation before applying them to intelligence and safety-critical applications。

Abstract
Large Language Models have seen rapid progress in capability in recent years; this progress has been accelerating and their capabilities, measured by various benchmarks, are beginning to approach those of humans. There is a strong demand to use such models in a wide variety of applications but, due to unresolved vulnerabilities and limitations, great care needs to be used before applying them to intelligence and safety-critical applications. This paper reviews recent literature related to LLM assessment and vulnerabilities to synthesize the current research landscape and to help understand what advances are most critical to enable use of of these technologies in intelligence and safety-critical applications. The vulnerabilities are broken down into ten high-level categories and overlaid onto a high-level life cycle of an LLM. Some general categories of mitigations are reviewed.

摘要
大型语言模型在过去几年内所示出的能力提升非常快，这种提升的速度在不断加速，其能力按照不同的标准测试方法测量，已经接近人类水平。但由于存在许多漏洞和局限性，在智能和安全敏感应用中使用这些模型需要非常小心。这篇评论文件总结了最新的LLM评估和漏洞研究，旨在总结当前研究领域的研究状况，并帮助理解在智能和安全敏感应用中使用这些技术所需的进一步发展。漏洞被分为十个高级类别，并与高级LLM生命周期相叠加以示出。一些通用的mitigation措施也被简要介绍。

Describing Differences in Image Sets with Natural Language

paper_url: http://arxiv.org/abs/2312.02974
repo_url: https://github.com/understanding-visual-datasets/visdiff
paper_authors: Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy
for: 本研究旨在自动描述两个图像集之间的差异，以便更好地理解模型行为和分析数据集。
methods: 该研究提出了一种两stage的方法，首先从图像集中提出候选差异描述，然后使用CLIP进行重新排序，以确定候选描述是否能够分别两个集。
results: 该研究使用VisDiff实现了自动描述图像集之间差异，并在多个领域进行了应用，如比较不同的数据集（例如ImageNet vs. ImageNetV2）、不同的分类模型（例如零shot CLIP vs. 监督ResNet）、概括模型失效模式（例如监督ResNet）、描述生成模型之间的差异（例如StableDiffusionV1和V2）以及发现图像是如何记忆的。使用VisDiff，我们能够找到 interessante 和前所未知的差异， demonstrating its utility in revealing nuanced insights。

Abstract
How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning. This task takes in image sets $D_A$ and $D_B$, and outputs a description that is more often true on $D_A$ than $D_B$. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff, which first captions the images and prompts a language model to propose candidate descriptions, then re-ranks these descriptions using CLIP. To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing model failure modes (supervised ResNet), characterizing differences between generative models (e.g., StableDiffusionV1 and V2), and discovering what makes images memorable. Using VisDiff, we are able to find interesting and previously unknown differences in datasets and models, demonstrating its utility in revealing nuanced insights.

摘要
<>TRANSLATE_TEXT两个图像集的差异如何区分？了解模型行为和分析数据集需要能够快速和精准地发现这些差异，但是 manually 遍历千个图像是不实用的。为了解决这个问题，我们研究了自动描述两个图像集之间的差异，我们称之为 Set Difference Captioning。这个任务接受图像集 $D_A$ 和 $D_B$ 作为输入，并输出一个更常出现在 $D_A$ 上的描述。我们提出了一个两个阶段的方法，首先提出候选的差异描述，然后使用 CLIP 进行重新排序，以确定候选描述是否能够区分两个集。我们称之为 VisDiff，它首先为图像集提供描述，然后使用语言模型提出候选描述，并使用 CLIP 进行重新排序。为了评估 VisDiff，我们收集了 VisDiffBench 数据集，该数据集包含 187 对图像集的对应描述。我们在不同领域应用 VisDiff，包括比较数据集（如 ImageNet vs. ImageNetV2）、比较分类模型（如零shot CLIP vs. 监督 ResNet）、总结模型失效模式（如监督 ResNet）、描述生成模型之间的差异（如 StableDiffusionV1 和 V2），以及发现图像吸引力的原因。使用 VisDiff，我们能够发现不同的图像集和模型之间的差异，这些差异可能是已知的或未知的，这 demonstartes VisDiff 的实用性。

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

paper_url: http://arxiv.org/abs/2312.03052
repo_url: None
paper_authors: Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, Ariel Fuxman
for: 解决复杂视觉任务，如“谁创造了右边的乐器？”需要多种技能的组合，包括理解空间、识别乐器和检索知识。
methods: 使用大语言模型（LLM）将复杂视觉任务分解成可执行程序，并使用特殊视觉模型（VLM）解决问题。
results: 我们提出了视觉程序填充（VPD），一种指导框架，可以在单个前进pass中解决复杂视觉任务。VPD使用LLM来采样多个候选程序，并对每个正确程序进行执行和验证，以确定正确的一个。然后，它将每个正确程序翻译成语言描述，并将其填充到VLM中。实验显示，VPD可以提高VLM的理解空间、COUNT和compositional reasoning能力。我们的VPD-trained PaLI-X在复杂视觉任务上表现出色，超越所有之前的VLM，并在MMBench、OK-VQA、A-OKVQA、TallyQA、POPE和Hateful Memes等任务中获得了state-of-the-art表现。人工标注员也证实了VPD改进了模型的回答准确性和一致性。 finally，我们的实验表明，VPD可以在实际应用中进行适应，并且在有限数据情况下表现出色。

Abstract
Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.

摘要
解决复杂视觉任务，如“谁创造了右边的乐器？”需要合理的技能组合：理解空间、识别乐器、并检索先前知识。最近的研究表明，使用大型语言模型（LLM）可以将这类任务 decomposed into 可执行的程序，但生成的程序具有许多错误：缺少必要步骤、包含幌子步骤，以及无法回归当特化模型返回错误输出。此外，它们需要加载多个模型，从而导致高延迟和计算成本。我们提出了视觉程序熔化（VPD），一种 instruction tuning 框架，可以使得视觉语言模型（VLM）通过单个前进步来解决复杂视觉任务。VPD 通过使用 LLM 来采样多个候选程序，然后执行和验证以确定正确的一个。它将每个正确的程序翻译成语言描述符，并将其熔化成 VLM。广泛的实验表明，VPD 可以提高 VLM 的理解空间、计数和 композиitional 理解能力。我们的 VPD-trained PaLI-X 在复杂视觉任务中表现出色，超越所有先前的 VLM，并 achieved state-of-the-art 性能。人工标注员也证实了 VPD 改进模型的回答准确性和一致性。最后，我们在内容审查应用中也证明了 VPD 的适用性。

Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models

paper_url: http://arxiv.org/abs/2312.02969
repo_url: None
paper_authors: Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, Jimmy Lin
For: The paper aims to build effective listwise rerankers without any dependence on GPT models, addressing the concern of single point of failure and improving scientific reproducibility.* Methods: The authors use large language models (LLM) to build the listwise rerankers, but do not rely on GPT models. They conduct passage retrieval experiments to evaluate the effectiveness of their approach.* Results: The authors achieve 97% effectiveness of the listwise rerankers built on GPT-4, and their best list se reranker surpasses the listwise rerankers based on GPT-3.5 by 13%. However, they find that existing training datasets are insufficient for building such listwise rerankers, highlighting the need for high-quality listwise ranking data resources.Here’s the simplified Chinese text for the three key points:* For: 这篇论文目标是建立不依赖 GPT 模型的有效列表重新排序器，解决单点失败和科学复制性问题。* Methods: 作者使用大语言模型（LLM）建立列表重新排序器，但不依赖 GPT 模型。他们进行了过程检索实验来评估其方法的有效性。* Results: 作者在 GPT-4 上建立的列表重新排序器达到 97% 的有效性，而其最佳列表重新排序器在 GPT-3.5 上出perform 13% 点。然而，他们发现现有的训练数据集不够用于建立这类列表重新排序器，呼吁更多的人工标注列表重新排序数据资源的建立。

Abstract
Listwise rerankers based on large language models (LLM) are the zero-shot state-of-the-art. However, current works in this direction all depend on the GPT models, making it a single point of failure in scientific reproducibility. Moreover, it raises the concern that the current research findings only hold for GPT models but not LLM in general. In this work, we lift this pre-condition and build for the first time effective listwise rerankers without any form of dependency on GPT. Our passage retrieval experiments show that our best list se reranker surpasses the listwise rerankers based on GPT-3.5 by 13% and achieves 97% effectiveness of the ones built on GPT-4. Our results also show that the existing training datasets, which were expressly constructed for pointwise ranking, are insufficient for building such listwise rerankers. Instead, high-quality listwise ranking data is required and crucial, calling for further work on building human-annotated listwise data resources.

摘要
现有的列表重新排序器都基于大语言模型（LLM），但是现有的研究都依赖于GPT模型，这会导致科学复制性的问题。此外，这也提出了当前研究成果只适用于GPT模型，而不适用于LLM总体的问题。在这个工作中，我们解决了这个前提，并首次建立了不依赖于GPT的有效列表重新排序器。我们的过程检索实验表明，我们的最佳列表重新排序器比基于GPT-3.5的列表重新排序器高出13%，并达到97%的效果。我们的结果还表明，现有的点 wise 排序数据集， originally constructed for pointwise ranking, 是无法建立列表重新排序器的。相反，高质量的列表排序数据资源是必要的，呼吁更多的人为列表排序数据集进行人工标注。

Concept Drift Adaptation in Text Stream Mining Settings: A Comprehensive Review

paper_url: http://arxiv.org/abs/2312.02901
repo_url: None
paper_authors: Cristiano Mesquita Garcia, Ramon Simoes Abilio, Alessandro Lameiras Koerich, Alceu de Souza Britto Jr., Jean Paul Barddal
for: 本研究旨在对文本流批处理中的概念漂移进行系统性的文献综述，以探讨这些方法在不同的应用场景中的使用。
methods: 本研究使用了系统性的文献综述方法，从40篇论文中选择了符合定义的文献，并对这些文献进行了分类、总结和讨论。
results: 本研究发现了在文本流批处理中的概念漂移问题，并提出了一些解决方案，包括不同类型的概念漂移检测方法、模型更新机制以及文本表示更新方法等。

Abstract
Due to the advent and increase in the popularity of the Internet, people have been producing and disseminating textual data in several ways, such as reviews, social media posts, and news articles. As a result, numerous researchers have been working on discovering patterns in textual data, especially because social media posts function as social sensors, indicating peoples' opinions, interests, etc. However, most tasks regarding natural language processing are addressed using traditional machine learning methods and static datasets. This setting can lead to several problems, such as an outdated dataset, which may not correspond to reality, and an outdated model, which has its performance degrading over time. Concept drift is another aspect that emphasizes these issues, which corresponds to data distribution and pattern changes. In a text stream scenario, it is even more challenging due to its characteristics, such as the high speed and data arriving sequentially. In addition, models for this type of scenario must adhere to the constraints mentioned above while learning from the stream by storing texts for a limited time and consuming low memory. In this study, we performed a systematic literature review regarding concept drift adaptation in text stream scenarios. Considering well-defined criteria, we selected 40 papers to unravel aspects such as text drift categories, types of text drift detection, model update mechanism, the addressed stream mining tasks, types of text representations, and text representation update mechanism. In addition, we discussed drift visualization and simulation and listed real-world datasets used in the selected papers. Therefore, this paper comprehensively reviews the concept drift adaptation in text stream mining scenarios.

摘要

Can a Tabula Recta provide security in the XXI century?

paper_url: http://arxiv.org/abs/2312.02869
repo_url: None
paper_authors: Francisco Ruiz
for: 这篇论文是为了研究人工计算机可能崩溃后用于加密的纸笔计算方法，以及这些方法是否能够抵抗计算机支持的解密方法。
methods: 论文使用了一些经典的纸笔计算方法，如Tabula Recta，以及一些新的方法，如基于非二进制数学空间的流加密和基于挑战文本生成密码的hash-like算法。
results: 论文通过计算机基本的统计分析，证明了这些人工计算机可能崩溃后用于加密的方法可以提供足够的安全性。

Abstract
In the not so unlikely scenario of total compromise of computers accessible to a group of users, they might be tempted to resort to human-computable paper-and-pencil cryptographic methods aided by a classic Tabula Recta, which helps to perform addition and subtraction directly with letters. But do these classic algorithms, or some new ones using the same simple tools, have any chance against computer-aided cryptanalysis? In this paper I discuss how some human-computable algorithms can indeed afford sufficient security in this situation, drawing conclusions from computer-based statistical analysis. Three kinds of algorithms are discussed: those that concentrate entropy from shared text sources, stream ciphers based on arithmetic of non-binary spaces, and hash-like algorithms that may be used to generate a password from a challenge text.

摘要
在计算机泄露或攻击 scenarios 中，用户群体可能会考虑使用人工可计算的纸笔密码方法，使用类传统的 Tabula Recta，以直接将字母进行加减运算。但这些传统算法或新的算法使用同样的简单工具，对于计算机帮助的加密分析是否有任何机会？在这篇论文中，我会讨论这些人工可计算的算法是否可以提供足够的安全性，通过计算机基于的统计分析来Draw conclusions。我将讨论三种算法：具有共享文本来源的 entropy 集中算法，基于非二进制空间的流加密算法，以及可用于生成挑战文本的 hash-like 算法。

Weakly Supervised Detection of Hallucinations in LLM Activations

paper_url: http://arxiv.org/abs/2312.02798
repo_url: https://github.com/Trusted-AI/adversarial-robustness-toolbox
paper_authors: Miriam Rateike, Celia Cintas, John Wamburu, Tanya Akumu, Skyler Speakman
for: 本研究旨在检测大语言模型（LLM）内部状态中是否存在幻觉模式，这些模式可能会传递到下游任务中。
methods: 我们提出了一种弱监督的审核方法，使用一种subset scanningapproach检测LLM活动中异常模式。我们的方法不需要先知道幻觉模式的类型，而是基于一个无异常样本的参考集进行检测。此外，我们的方法可以确定LLM内部哪些节点负责编码这些模式，这可能提供了关键的透彻阐述用于偏见缓解。
results: 我们的结果证明BERT在内部容积不够以编码幻觉，而OPT则可以在内部编码幻觉信息。我们的检测方法，无需先知道假语句，与完全监督的out-of-distribution分类器相当。

Abstract
We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns a-priori. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally. Importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier.

摘要
我们提出一种审核方法，用于确定大语言模型（LLM）是否内部存在幻觉类模式，这些模式可能会传递到下游任务中。我们提出了一种弱监睹审核技术，使用一个子集扫描方法来检测 LLM 活动中异常模式。我们的方法不需要先知道幻觉模式的类型。相反，它依赖于一个无异常样本的参考数据集来进行测试。我们的方法可以识别 LLM 中幻觉模式的关键节点，这些节点可能会提供关键的调整特定子网络的途径。我们提出了两种新的扫描方法，用于处理 LLM 活动中异常句子，这些句子可能会与预期分布相差。我们的结果证实了 BERT 的有限内部容量，而 OPT 则可以内部编码幻觉信息。我们的扫描方法，不需要先接触假STATEMENT，与完全监睹的外部异常分类器相比，表现相似。

Large Language Models on Graphs: A Comprehensive Survey

paper_url: http://arxiv.org/abs/2312.02783
repo_url: https://github.com/petergriffinjin/awesome-language-model-on-graphs
paper_authors: Bowen Jin, Gang Liu, Chi Han, Meng Jiang, Heng Ji, Jiawei Han
for: 本研究旨在探讨大型自然语言处理模型（LLM）在图 струкucture上的应用场景和技术。
methods: 本文提出了三种类别的应用场景：纯图、文本 ric 图和文本对应的图。同时，也介绍了多种利用 LLM 在图上的技术，包括 LLM 作为预测器、编码器和对齐器。
results: 本文结合了多种实验和应用场景，并提供了相关的开源代码和benchmark数据集。未来研究可能包括细化图结构和文本信息的结合、提高模型性能和扩展到更多的应用场景。

Abstract
Large language models (LLMs), such as ChatGPT and LLaMA, are creating significant advancements in natural language processing, due to their strong text encoding/decoding ability and newly found emergent capability (e.g., reasoning). While LLMs are mainly designed to process pure texts, there are many real-world scenarios where text data are associated with rich structure information in the form of graphs (e.g., academic networks, and e-commerce networks) or scenarios where graph data are paired with rich textual information (e.g., molecules with descriptions). Besides, although LLMs have shown their pure text-based reasoning ability, it is underexplored whether such ability can be generalized to graph scenarios (i.e., graph-based reasoning). In this paper, we provide a systematic review of scenarios and techniques related to large language models on graphs. We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-rich graphs, and text-paired graphs. We then discuss detailed techniques for utilizing LLMs on graphs, including LLM as Predictor, LLM as Encoder, and LLM as Aligner, and compare the advantages and disadvantages of different schools of models. Furthermore, we mention the real-world applications of such methods and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future research directions in this fast-growing field. The related source can be found at https://github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs.

摘要
大型自然语言处理模型（LLM），如ChatGPT和LLaMA，在自然语言处理方面已经取得了重要突破，它们的强大文本编码/解码能力和新发现的emergentcapability（例如，理解）使得它们在自然语言处理方面表现出色。然而，LLM主要是为纯文本进行处理，实际世界中有许多情况where文本数据具有rich结构信息的形式（例如，学术网络和电商网络）或情况where图数据与rich文本信息相关（例如，分子与描述）。此外，尽管LLM已经显示了纯文本基于的理解能力，但是这种能力是否可以泛化到图形enario（即图形基于的理解）尚未得到充分的探讨。在这篇论文中，我们提供了大型自然语言模型在图形上的系统性评论。我们首先总结了将LLM应用于图形的可能enario into three categories：纯图形、文本rich图形和文本对应图形。然后，我们讨论了在图形上使用LLM的详细技术，包括LLM作为预测器、LLM作为编码器和LLM作为对齐器，并对不同的学术模型有利和不利的比较。此外，我们还提到了实际应用的方法和开源代码库以及标准 benchmark数据集。最后，我们结束于未来研究方向的潜在发展空间。相关的源代码可以在https://github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs中找到。

Scaling Laws for Adversarial Attacks on Language Model Activations

paper_url: http://arxiv.org/abs/2312.02780
repo_url: None
paper_authors: Stanislav Fort
for: 这篇论文主要研究了语言模型上的反对攻击。
methods: 作者使用了一种基于活动的反对攻击方法，通过控制一小部分模型活动来控制后续Token的预测结果。
results: 作者发现，对于不同的语言模型和输入大小，最多可以控制1000个后续Token的预测结果，并观察到了一个卷积级数学律，即最大预测结果数量与控制活动数量直线相关。此外，作者发现，对于不同的输入空间和输出空间维度，反对攻击的顺序性很强，即一个输入位置的控制可以导致相应的输出位置的控制。这些结果支持了维度不匹配的假设，并为语言模型上的反对攻击提供了一个新的攻击 повер。

Abstract
We explore a class of adversarial attacks targeting the activations of language models. By manipulating a relatively small subset of model activations, $a$, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens $t$. We empirically verify a scaling law where the maximum number of target tokens $t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose activations the attacker controls as $t_\mathrm{max} = \kappa a$. We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $\chi$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models. Compared to attacks on tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert control over a similar amount of output bits. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models, where additional data sources are added as activations directly, sidestepping the tokenized input. This opens up a new, broad attack surface. By using language models as a controllable test-bed to study adversarial attacks, we were able to experiment with input-output dimensions that are inaccessible in computer vision, especially where the output dimension dominates.

摘要
我们探索了一类语言模型的对抗攻击，通过控制语言模型的激活来控制预测结果。我们通过对一小部分模型激活进行操作，控制了后续多达1000个字符的预测结果。我们经验性地证明了一个尺度法律，其中最大预测结果的数量与控制模型激活的数量成直线关系，即$t_\mathrm{max} = \kappa a$。我们发现在不同的语言模型中，对输入空间中的一个比特控制输出空间中的一个比特的需要的比特数（我们称之为防御性$\chi$）在2个数量级之间具有remarkably常数的性。相比于在字符上进行攻击，在激活上进行攻击更强，但我们发现了一种惊人的规律：一个输入比特通过激活或字符来控制输出空间中的相似数量的比特。这给了我们对维度匹配问题的支持，表明对抗攻击是因为输入和输出空间之间的维度不匹配。在实际应用中，可以通过使用语言模型作为可控的测试床，来研究对抗攻击，这会开 up一个新的、广阔的攻击表面。

Compositional Generalization for Data-to-Text Generation

paper_url: http://arxiv.org/abs/2312.02748
repo_url: None
paper_authors: Xinnuo Xu, Ivan Titov, Mirella Lapata
for: 这篇论文的目的是提出一个基准测试集，用于评估不同方法在处理未看到的 комбинации predicate 时的性能。
methods: 这篇论文使用了一种新的模型，该模型将 predicate 分组 clustering，然后一 sentence 一 sentence 地生成文本描述。
results: 该模型在所有评价指标上都超过 T5 基准值，特别是在保持输入忠实度方面提高了31%。

Abstract
Data-to-text generation involves transforming structured data, often represented as predicate-argument tuples, into coherent textual descriptions. Despite recent advances, systems still struggle when confronted with unseen combinations of predicates, producing unfaithful descriptions (e.g. hallucinations or omissions). We refer to this issue as compositional generalisation, and it encouraged us to create a benchmark for assessing the performance of different approaches on this specific problem. Furthermore, we propose a novel model that addresses compositional generalization by clustering predicates into groups. Our model generates text in a sentence-by-sentence manner, relying on one cluster of predicates at a time. This approach significantly outperforms T5~baselines across all evaluation metrics.Notably, it achieved a 31% improvement over T5 in terms of a metric focused on maintaining faithfulness to the input.

摘要
<>将结构化数据转换为文本描述，经常用作 predicate-argument 对的转换。尽管最近几年有所进步，仍然系统在不familiar的 predicate 组合下遇到困难，导致不准确的描述（如幻觉或缺失）。我们称这个问题为 Compositional Generalization，它驱使我们开发了一个评价不同方法的标准准测试。此外，我们还提出了一种新的模型，它将 predicate 分组 clustering。我们的模型在 sentence-by-sentence 方式生成文本，每次依赖一个 cluster of predicate。这种方法在所有评价指标上都具有显著的优势，特别是在保持输入的准确性方面提高了31%。Note: "Simplified Chinese" is a romanization of the Chinese language that uses a simplified set of characters and pronunciation. It is commonly used in mainland China and Singapore.

Towards Measuring Representational Similarity of Large Language Models

paper_url: http://arxiv.org/abs/2312.02730
repo_url: https://github.com/mklabunde/llm_repsim
paper_authors: Max Klabunde, Mehdi Ben Amor, Michael Granitzer, Florian Lemmerich
for: 本研究目的是了解大语言模型（LLMs）的表示相似性，以便简化模型选择、检测非法模型重复使用和提高我们对 LLMs 的性能原理的理解。
methods: 本研究使用了7亿个参数的 LLMs 的表示相似性进行测量。
results: 研究结果表明，一些 LLMs 之间存在显著的表示相似性差异。研究还发现了使用表示相似性指标的挑战，需要仔细研究相似性分数以避免错误结论。

Abstract
Understanding the similarity of the numerous released large language models (LLMs) has many uses, e.g., simplifying model selection, detecting illegal model reuse, and advancing our understanding of what makes LLMs perform well. In this work, we measure the similarity of representations of a set of LLMs with 7B parameters. Our results suggest that some LLMs are substantially different from others. We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions.

摘要
理解各种大语言模型（LLMs）的相似性有很多用途，例如简化模型选择、检测非法模型重用和提高我们对LLMs表现良好的理解。在这项工作中，我们测量了一组LLMs的表示相似性，其中参数数量达7亿。我们的结果表明一些LLMs与其他模型存在巨大差异。我们发现对表示相似性指标的使用存在挑战，需要仔细研究相似性分数以避免误导性的结论。

Prompt Optimization via Adversarial In-Context Learning

paper_url: http://arxiv.org/abs/2312.02614
repo_url: None
paper_authors: Xuan Long Do, Yiran Zhao, Hannah Brown, Yuxi Xie, James Xu Zhao, Nancy F. Chen, Kenji Kawaguchi, Michael Qizhe Xie, Junxian He
For: 优化受Context学习（ICL）的提示，使用一个LLM作为生成器，另一个作为分类器，并有一个提示修改器。* Methods: 使用传统对抗学习的方式，生成器尝试生成真实的输出，以让分类器难以分辨是模型生成的或实际数据。在每个回合中，给定一个输入，包括任务指令和一些示例，生成器生成输出，而分类器则将生成器输入-输出对分类为模型生成的或实际数据。基于分类器损失，提示修改器提议可能的编辑，并选择最改进对抗损失的编辑。* Results: 比对state-of-the-art提示优化技术，adv-ICL在11种生成和分类任务上获得了显著的改进，包括概要、数学逻辑、机器翻译、数据-文本生成和MMLU和big-bench难度 benchmarks。此外，由于我们的方法使用预训练模型和只更新提示而不是模型参数，因此它是计算效率高、易于扩展到任何LLM和任务，并在低资源环境下效果优秀。

Abstract
We propose a new method, Adversarial In-Context Learning (adv-ICL), to optimize prompt for in-context learning (ICL) by employing one LLM as a generator, another as a discriminator, and a third as a prompt modifier. As in traditional adversarial learning, adv-ICL is implemented as a two-player game between the generator and discriminator, where the generator tries to generate realistic enough output to fool the discriminator. In each round, given an input prefixed by task instructions and several exemplars, the generator produces an output. The discriminator is then tasked with classifying the generator input-output pair as model-generated or real data. Based on the discriminator loss, the prompt modifier proposes possible edits to the generator and discriminator prompts, and the edits that most improve the adversarial loss are selected. We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques for both open and closed-source models on 11 generation and classification tasks including summarization, arithmetic reasoning, machine translation, data-to-text generation, and the MMLU and big-bench hard benchmarks. In addition, because our method uses pre-trained models and updates only prompts rather than model parameters, it is computationally efficient, easy to extend to any LLM and task, and effective in low-resource settings.

摘要
我们提出了一种新方法，called adversarial in-context learning（adv-ICL），用于优化启发（ICL）中的启发。该方法使用一个LLM作为生成器，另一个作为识别器，以及一个作为启发修改器。在传统的对抗学习中，adv-ICL是作为两个玩家的游戏，生成器尝试生成足够真实的输出，以欺骗识别器。在每个轮次中，给定一个包含任务指令和几个示例的输入，生成器生成输出。识别器则被要求将生成器输入-输出对分类为模型生成的或真实数据。基于识别器的损失，启发修改器提出了可能的修改，并选择修改最大化对抗损失的选项。我们表明，adv-ICL在11种生成和分类任务上比状态最佳的启发优化技术具有显著改进。此外，由于我们的方法使用预训练的模型和只更新启发，而不是模型参数，因此它是计算效率高、易于扩展到任何LLM和任务，以及在低资源环境中有效。

Text Intimacy Analysis using Ensembles of Multilingual Transformers

paper_url: http://arxiv.org/abs/2312.02590
repo_url: None
paper_authors: Tanmay Chavan, Ved Patwardhan
for: 本文旨在预测给定文本中的情感层次，即人工智能系统与人类之间的直接互动增加了这一问题的重要性。
methods: 本文使用一个多语言模型的 ensemble 以及一个单语言模型来预测文本中的情感层次，并进行了多种数据扩展方法的试验，包括翻译。
results: 结果显示， ensemble 模型和单语言模型的组合以及数据扩展方法可以提高预测性能，并且进行了详细的结果分析，提供了一些有趣的情感预测问题的深入理解。

Abstract
Intimacy estimation of a given text has recently gained importance due to the increase in direct interaction of NLP systems with humans. Intimacy is an important aspect of natural language and has a substantial impact on our everyday communication. Thus the level of intimacy can provide us with deeper insights and richer semantics of conversations. In this paper, we present our work on the SemEval shared task 9 on predicting the level of intimacy for the given text. The dataset consists of tweets in ten languages, out of which only six are available in the training dataset. We conduct several experiments and show that an ensemble of multilingual models along with a language-specific monolingual model has the best performance. We also evaluate other data augmentation methods such as translation and present the results. Lastly, we study the results thoroughly and present some noteworthy insights into this problem.

摘要
近年来，与人工智能直接交互的语言处理系统的发展，使得距离感度的估计得到了更多的重视。距离感度是自然语言中重要的一个方面，对我们日常交流产生了深观影响。因此，距离感度的级别可以为我们提供更深刻的理解和更富有的 semantics。在这篇论文中，我们介绍了我们在SemEval共享任务9中对给定文本距离感度的预测工作。数据集包括推特在十种语言中的十万句，其中只有六种语言可以在训练集中使用。我们进行了多个实验，并证明了一个多语言模型的ensemble，以及一个单语言模型在每种语言中的最佳性能。我们还评估了其他数据扩充方法，如翻译，并发现了一些有趣的问题。最后，我们进行了深入的分析和讨论。

Empathy and Distress Detection using Ensembles of Transformer Models

paper_url: http://arxiv.org/abs/2312.02578
repo_url: None
paper_authors: Tanmay Chavan, Kshitij Deshpande, Sheetal Sonawane
for: 本文是关于2023年WASSA Empathy、Emotion和Personality共同任务的方法。
methods: 本文使用BERT基于模型和各种ensemble方法进行实验。
results: 我们的最终提交得分为Pearson’s r分数0.346，在Empathy和Distress检测子任务中排名第三。

Abstract
This paper presents our approach for the WASSA 2023 Empathy, Emotion and Personality Shared Task. Empathy and distress are human feelings that are implicitly expressed in natural discourses. Empathy and distress detection are crucial challenges in Natural Language Processing that can aid our understanding of conversations. The provided dataset consists of several long-text examples in the English language, with each example associated with a numeric score for empathy and distress. We experiment with several BERT-based models as a part of our approach. We also try various ensemble methods. Our final submission has a Pearson's r score of 0.346, placing us third in the empathy and distress detection subtask.

摘要
这篇论文介绍了我们在WASSA 2023 Empathy, Emotion and Personality Shared Task 中的方法。人们的感受性和痛苦是自然语言中的隐式表达，感受性和痛苦检测是自然语言处理中的关键挑战，可以帮助我们理解对话。提供的数据集包括英语长文示例，每个示例都有感受性和痛苦的数字分数。我们使用BERT模型作为我们的方法的一部分，并尝试了多种 ensemble 方法。最终提交的结果是Pearson 相关系数0.346，位于感受性和痛苦检测子任务中的第三名。

ULMA: Unified Language Model Alignment with Demonstration and Point-wise Human Preference

paper_url: http://arxiv.org/abs/2312.02554
repo_url: https://github.com/unified-language-model-alignment/src
paper_authors: Tianchi Cai, Xierui Song, Jiyan Jiang, Fei Teng, Jinjie Gu, Guannan Zhang
for: 本研究旨在提高大语言模型的合理性和安全性，通过对用户意图进行整合。
methods: 研究使用了两步整合框架：在首先批处理数据上进行监督微调，然后使用人类偏好数据进行偏好学习。
results: 对于点对点偏好数据，提出了一种名为点对点DPO的偏好学习方法，并通过对超级vised fine-tuning和点对点偏好学习的连接来构建一个统一的框架。实验表明，提出的方法在点对点数据集上具有更高的性能和效率。同时，研究人员还构建了一个高质量的示例集，以便进一步研究和应用。

Abstract
Language model alignment is a cutting-edge technique in large language model training to align the model output to user's intent, e.g., being helpful and harmless. Recent alignment framework consists of two steps: supervised fine-tuning with demonstration data and preference learning with human preference data. Previous preference learning methods, such as RLHF and DPO, mainly focus on pair-wise preference data. However, in many real-world scenarios where human feedbacks are intrinsically point-wise, these methods will suffer from information loss or even fail. To fill this gap, in this paper, we first develop a preference learning method called point-wise DPO to tackle point-wise preference data. Further revelation on the connection between supervised fine-tuning and point-wise preference learning enables us to develop a unified framework for both human demonstration and point-wise preference data, which sheds new light on the construction of preference dataset. Extensive experiments on point-wise datasets with binary or continuous labels demonstrate the superior performance and efficiency of our proposed methods. A new dataset with high-quality demonstration samples on harmlessness is constructed and made publicly available.

摘要
大型语言模型训练中的语言模型准确性Alignment是一种前沿技术，目的是将模型输出与用户意图相匹配，例如帮助和无害。现有的配置框架包括两个步骤：有监督微调和人类偏好数据学习。过去的偏好学习方法，如RLHF和DPO，主要关注对照数据。然而，在多数实际应用中，人类反馈是点对点的，这些方法会导致信息损失或者失败。为填补这一空白，我们在本文中首先开发了一种点对点DPO偏好学习方法。进一步揭示了监督微调和点对点偏好学习之间的连接，使得我们可以构建一个包含人类示例和点对点偏好数据的统一框架。广泛的实验表明了我们提出的方法的超过和高效性。此外，我们还构建了一个高质量的示例集，并公开发布了这个集合。

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

paper_url: http://arxiv.org/abs/2312.02549
repo_url: None
paper_authors: Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan
for: 本研究旨在地localize视频片段和自然语言查询之间的Semantic关系。
methods: 我们提出了一种能量基模型框架，用于显式学习片段-查询输入的分布。此外，我们还提出了一种基于Transformer架构的DemaFormer模型，使用加速采样和学习抑止因子来有效地编码片段-查询输入。
results: 我们在四个公共时间语言固定dataset上进行了广泛的实验，并证明了我们的方法在比基eline上显著超越。

Abstract
Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.

摘要
<>模块语言固定 seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well. Let me know!

DRAFT: Dense Retrieval Augmented Few-shot Topic classifier Framework

paper_url: http://arxiv.org/abs/2312.02532
repo_url: None
paper_authors: Keonwoo Kim, Younggun Lee
for: 这篇论文旨在提出一个简单的框架，用于培训几何少类别标签分类器。
methods: 这篇论文使用了一些特定主题的问题作为查询，使用密集探索模型来建立自定义数据集。然后，使用多问题回传（MQR）算法来建立自定义数据集，并调整一个分类器使用自定义数据集进行主题分类。
results: 根据评估结果，这篇论文的提案比基于内容学习的基准，如GPT-3 175B和InstructGPT 175B，在几何少类别标签分类任务上表现出竞争或超越性。尽管这篇论文有177倍少的参数数量，但它仍能够达到类似水平，证明其效果。

Abstract
With the growing volume of diverse information, the demand for classifying arbitrary topics has become increasingly critical. To address this challenge, we introduce DRAFT, a simple framework designed to train a classifier for few-shot topic classification. DRAFT uses a few examples of a specific topic as queries to construct Customized dataset with a dense retriever model. Multi-query retrieval (MQR) algorithm, which effectively handles multiple queries related to a specific topic, is applied to construct the Customized dataset. Subsequently, we fine-tune a classifier using the Customized dataset to identify the topic. To demonstrate the efficacy of our proposed approach, we conduct evaluations on both widely used classification benchmark datasets and manually constructed datasets with 291 diverse topics, which simulate diverse contents encountered in real-world applications. DRAFT shows competitive or superior performance compared to baselines that use in-context learning, such as GPT-3 175B and InstructGPT 175B, on few-shot topic classification tasks despite having 177 times fewer parameters, demonstrating its effectiveness.

摘要
随着信息多样性的增加，对任意主题的分类需求日益增加。为解决这个挑战，我们介绍了DRAFT，一种简单的框架，用于在少量示例下训练主题分类器。DRAFT使用特定主题的几个示例作为查询来构建自定义数据集，并使用多个查询相关的多QueryRetrieval（MQR）算法来构建自定义数据集。然后，我们精度地调整一个分类器使用自定义数据集来识别主题。为证明我们提议的方法的效果，我们在广泛使用的分类 bencmarks 数据集和手动构造的291个多样主题数据集上进行评估，这些数据集模拟了实际应用中遇到的多样内容。DRAFT在几个基eline上比如GPT-3 175B和InstructGPT 175B进行几shot主题分类任务时表现竞争或更好，即使 Parameters 相对较少，这说明了它的效果。

MedDM:LLM-executable clinical guidance tree for clinical decision-making

paper_url: http://arxiv.org/abs/2312.02441
repo_url: None
paper_authors: Binbin Li, Tianxin Meng, Xiaoming Shi, Jie Zhai, Tong Ruan
for: This paper aims to address the issue of low specialization in current medical language models (LLMs) and provide a solution for LLMs to participate in clinical diagnosis decision-making.
methods: The authors propose a method for constructing a large-scale medical diagnostic decision-making dataset (MedDM) from flowcharts in clinical practice guidelines, and develop an approach for converting these flowcharts into standardized diagnostic decision trees. They also propose a method for reasoning on LLM-executable clinical guidance trees (CGT) and a Patient-LLM multi-turn dialogue framework.
results: The authors construct a knowledge base with 1202 decision trees, covering 12 hospital departments and over 500 diseases, using medical literature and flowcharts. They also demonstrate the effectiveness of their approach through experiments using a Patient-LLM multi-turn dialogue framework.

Abstract
It is becoming increasingly emphasis on the importance of LLM participating in clinical diagnosis decision-making. However, the low specialization refers to that current medical LLMs can not provide specific medical advice, which are more like a medical Q\&A. And there is no suitable clinical guidance tree data set that can be used directly with LLM. To address this issue, we first propose LLM-executavle clinical guidance tree(CGT), which can be directly used by large language models, and construct medical diagnostic decision-making dataset (MedDM), from flowcharts in clinical practice guidelines. We propose an approach to screen flowcharts from medical literature, followed by their identification and conversion into standardized diagnostic decision trees. Constructed a knowledge base with 1202 decision trees, which came from 5000 medical literature and covered 12 hospital departments, including internal medicine, surgery, psychiatry, and over 500 diseases.Moreover, we propose a method for reasoning on LLM-executable CGT and a Patient-LLM multi-turn dialogue framework.

摘要
现在越来越重视LLM在诊断决策中的参与。然而，低特化意味着当前医学LLM无法提供专业医疗建议，更像医学Q&A。而没有适用直接使用LLM的临床指导树数据集。为解决这个问题，我们首先提议LLM执行临床指导树（CGT），可以直接使用大语言模型，并构建医疗诊断决策数据集（MedDM），从临床实践指南中的流程图。我们提出一种方法，从医学文献中选择流程图，然后将其标准化并转换为诊断决策树。构建了1202个决策树，来自5000份医学文献，覆盖12个医院部门，包括内科、外科、心理医学和超过500种疾病。此外，我们还提出了LLM执行CGT的理由方法和病人-LLM多Turn对话框架。

Protein Language Model-Powered 3D Ligand Binding Site Prediction from Protein Sequence

paper_url: http://arxiv.org/abs/2312.03016
repo_url: None
paper_authors: Shuo Zhang, Lei Xie
for: 这个论文的目的是为了预测蛋白质上的 ligand 绑定位置，以便更好地理解蛋白质的功能和寻找新的药物。
methods: 该论文提出了一种名为 LaMPSite 的方法，该方法只需要蛋白质的序列和 ligand 分子图来预测蛋白质上的 ligand 绑定位置。具体来说，该方法首先使用 ESM-2 蛋白质语言模型来获取蛋白质的 residue-level 嵌入和接触图。然后，该方法使用图神经网络来计算 ligand 分子的 atom-level 嵌入。最后，该方法计算和更新蛋白质-ligand 交互嵌入，并根据推断出的蛋白质接触图和 ligand 距离图来强制实施 geometric 约束。最终，对蛋白质-ligand 交互嵌入进行汇总，可以判断蛋白质上的哪些残基属于绑定位置。
results: 该论文的实验结果表明，LaMPSite 方法可以与基eline 方法相比，对于没有三维蛋白质结构信息的情况下，预测蛋白质上的 ligand 绑定位置具有竞争力。这意味着，LaMPSite 方法可以为蛋白质结构信息不完整的情况提供新的机会 для药物发现。

Abstract
Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein sequences and ligand molecular graphs as input for ligand binding site predictions. The protein sequences are used to retrieve residue-level embeddings and contact maps from the pre-trained ESM-2 protein language model. The ligand molecular graphs are fed into a graph neural network to compute atom-level embeddings. Then we compute and update the protein-ligand interaction embedding based on the protein residue-level embeddings and ligand atom-level embeddings, and the geometric constraints in the inferred protein contact map and ligand distance map. A final pooling on protein-ligand interaction embedding would indicate which residues belong to the binding sites. Without any 3D coordinate information of proteins, our proposed model achieves competitive performance compared to baseline methods that require 3D protein structures when predicting binding sites. Given that less than 50% of proteins have reliable structure information in the current stage, LaMPSite will provide new opportunities for drug discovery.

摘要
预测蛋白质上的 ligand 绑定位置是蛋白质功能理解和潜在药物搜寻的基本和重要任务。现有的方法都需要输入经验性确定的蛋白质整体结构。然而，这些结构可能对新或 less-studied 蛋白质而言是不可获得的。为了解决这个限制，我们提出了 LaMPSite，它只需要蛋白质序列和ligand 分子图作为输入，可以预测 ligand 绑定位置。蛋白质序列被用来检索 residue-level 嵌入和接触地图从预训练的 ESM-2 蛋白质语言模型中。ligand 分子图被 feed 到一个图神经网络中，以计算 atom-level 嵌入。然后，我们计算并更新蛋白质-ligand 互动嵌入，基于蛋白质 residue-level 嵌入和 ligand atom-level 嵌入，以及推断的蛋白质接触地图和 ligand 距离地图的几何约束。最后，一个 pooling 操作在蛋白质-ligand 互动嵌入上进行汇聚，以确定绑定位置中的哪些残基。不需要蛋白质三维坐标信息，我们的提议的模型可以与基eline 方法相比，在predicting binding sites时达到竞争性性能。在现有的 Situation 中，蛋白质三维结构信息不可靠的情况下，LaMPSite 将提供新的机会 для药物搜寻。

Efficient Online Data Mixing For Language Model Pre-Training

paper_url: http://arxiv.org/abs/2312.02406
repo_url: https://github.com/Ufere/Assingment_1
paper_authors: Alon Albalak, Liangming Pan, Colin Raffel, William Yang Wang
for: 这个论文的目的是提出一种高效的在线数据混合方法，以提高大语言模型的下游性能。
methods: 这种方法使用多臂投掷算法来在训练过程中优化数据混合比例，以适应变化的训练动态。
results: 与其他方法相比，这种方法在5枚MMLUbenchmark上提高了1.9%的准确率，并在训练迭代数上减少了19%的训练迭代次数，同时增加了 negligible 的墙 clock 时间。

Abstract
The data used to pretrain large language models has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and of pretraining datasets. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together and determining sampling probabilities across entire groups. However, data mixing proportions are typically fixed before training and therefore cannot adapt to changing training dynamics. To address these limitations, we develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing. Based on multi-armed bandit algorithms, our online approach optimizes the data mixing proportions during training. Remarkably, our method trains a model that reaches the final perplexity of the next best method with 19\% fewer training iterations, and improves performance on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible wall-clock time during pretraining.

摘要
大量语言模型的预训数据选择方法具有关键影响下测模型的表现，这导致了大量的工作集中于自动决定最适合的数据使用于预训。现有的数据选择方法受到复杂的运算和计算成本的限制，尤其是模型和预训数据的规模增加。数据混合方法可以简化数据选择的复杂性，但是混合比例通常是在训练前 fixing 的，因此无法适应变化的训练过程。为了解决这些限制，我们开发了一个高效的在线数据混合（ODM）算法，融合了数据选择和数据混合的元素。基于多臂枪击算法，我们的在线方法在训练中优化混合比例。很惊喜地，我们的方法可以在训练迭代数量相同的情况下，让模型的最终误差与下一个最佳方法相同，升高5shot MMLU标准benchmark的对称精度 by 1.9%，同时添加了很少的壁时间。

2023-12-05

cs.LG

cs.LG - 2023-12-05

CaloQVAE : Simulating high-energy particle-calorimeter interactions using hybrid quantum-classical generative models

paper_url: http://arxiv.org/abs/2312.03179
repo_url: None
paper_authors: Sehmimul Hoque, Hao Jia, Abhishek Abhishek, Mojde Fadaie, J. Quetzalcoatl Toledo-Marín, Tiago Vale, Roger G. Melko, Maximilian Swiatlowski, Wojciech T. Fedorko
for: 这篇论文是用于描述大型哈丁撞击机 era 中的 Computational challenges 和 MC simulations 的方法。
methods: 这篇论文使用了 recent advancements in generative models 和 quantum annealing 来快速和高效地模拟高能量 particles 在探测器中的传播。
results: 这篇论文的结果显示了一种快速和高效的 MC simulation 方法，可以将 statistically uncertainty 降低到 experimental data 的水平。

Abstract
The Large Hadron Collider's high luminosity era presents major computational challenges in the analysis of collision events. Large amounts of Monte Carlo (MC) simulation will be required to constrain the statistical uncertainties of the simulated datasets below these of the experimental data. Modelling of high-energy particles propagating through the calorimeter section of the detector is the most computationally intensive MC simulation task. We introduce a technique combining recent advancements in generative models and quantum annealing for fast and efficient simulation of high-energy particle-calorimeter interactions.

摘要
Large Hadron Collider 的高照度时期具有主要的计算挑战，需要大量的 Monte Carlo (MC) 模拟来约束实验数据中的统计不确定性。模拟高能粒子在探测器中的传播是 MC 模拟最为计算成本高的任务。我们介绍了一种 combining 最新的生成模型和量子搜索的技术，以快速和高效地模拟高能粒子-探测器交互。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know.

Active Learning for Abrupt Shifts Change-point Detection via Derivative-Aware Gaussian Processes

paper_url: http://arxiv.org/abs/2312.03176
repo_url: None
paper_authors: Hao Zhao, Rong Pan
for: 本研究旨在提出一种能够有效地检测数据中突然变化的方法，以便帮助各个领域的决策和资源分配。
methods: 本研究使用Derivative-Aware Change Detection（DACD）方法，该方法利用 Gaussian Process（GP）的导数过程进行活动学习（AL），以便有效地检测变化点。DACD通过多种数据收集函数（AFs）来均衡抽取和探索过程，从而提高算法效率并确保准确性。
results: 研究表明，DACD方法在多种场景下表现出优于其他活动学习变化检测方法。

Abstract
Change-point detection (CPD) is crucial for identifying abrupt shifts in data, which influence decision-making and efficient resource allocation across various domains. To address the challenges posed by the costly and time-intensive data acquisition in CPD, we introduce the Derivative-Aware Change Detection (DACD) method. It leverages the derivative process of a Gaussian process (GP) for Active Learning (AL), aiming to pinpoint change-point locations effectively. DACD balances the exploitation and exploration of derivative processes through multiple data acquisition functions (AFs). By utilizing GP derivative mean and variance as criteria, DACD sequentially selects the next sampling data point, thus enhancing algorithmic efficiency and ensuring reliable and accurate results. We investigate the effectiveness of DACD method in diverse scenarios and show it outperforms other active learning change-point detection approaches.

摘要
change-point detection（CPD）对于数据中突然变化的检测是非常重要的，这对各个领域的决策和资源分配产生了深远的影响。为了解决CPD中数据收集所需的成本和时间困难，我们提出了Derivative-Aware Change Detection（DACD）方法。它利用GP derivative进行活动学习（AL），以有效地找到变化点的位置。DACD通过多个数据收集函数（AF）来平衡利用和探索 derivative 过程的权衡，从而提高算法的效率和可靠性。我们在多种情况下对DACD方法进行了研究，并证明它在其他活动学习变化检测方法的基础上具有更高的效果。

Adaptive spectral graph wavelets for collaborative filtering

paper_url: http://arxiv.org/abs/2312.03167
repo_url: None
paper_authors: Osama Alshareet, A. Ben Hamza
for: 提供个性化的ITEM建议给潜在用户，解决新用户无法提供充足的行为数据的冷启动问题。
methods: 使用spectral graph wavelet collaborative filtering框架，将用户、item和他们的交互表示为一个两分图。采用适应转换函数稳定图像频谱中变量的方法，并设计一种深度推荐模型，通过spectral graph wavelets在端到端的方式学习低维表示USER和ITEM。
results: 通过实验表明，提出的模型在实际benchmark数据集上达到了 stronger baseline方法的推荐性能。

Abstract
Collaborative filtering is a popular approach in recommender systems, whose objective is to provide personalized item suggestions to potential users based on their purchase or browsing history. However, personalized recommendations require considerable amount of behavioral data on users, which is usually unavailable for new users, giving rise to the cold-start problem. To help alleviate this challenging problem, we introduce a spectral graph wavelet collaborative filtering framework for implicit feedback data, where users, items and their interactions are represented as a bipartite graph. Specifically, we first propose an adaptive transfer function by leveraging a power transform with the goal of stabilizing the variance of graph frequencies in the spectral domain. Then, we design a deep recommendation model for efficient learning of low-dimensional embeddings of users and items using spectral graph wavelets in an end-to-end fashion. In addition to capturing the graph's local and global structures, our approach yields localization of graph signals in both spatial and spectral domains, and hence not only learns discriminative representations of users and items, but also promotes the recommendation quality. The effectiveness of our proposed model is demonstrated through extensive experiments on real-world benchmark datasets, achieving better recommendation performance compared with strong baseline methods.

摘要
Our approach begins with an adaptive transfer function that leverages a power transform to stabilize the variance of graph frequencies in the spectral domain. This is followed by the design of a deep recommendation model that efficiently learns low-dimensional embeddings of users and items using spectral graph wavelets in an end-to-end fashion. Our approach not only captures the local and global structures of the graph, but also localizes graph signals in both spatial and spectral domains, leading to the learning of discriminative representations of users and items. As a result, our proposed model achieves better recommendation performance compared to strong baseline methods, as demonstrated through extensive experiments on real-world benchmark datasets.

Deep Learning for Fast Inference of Mechanistic Models’ Parameters

paper_url: http://arxiv.org/abs/2312.03166
repo_url: None
paper_authors: Maxim Borisyak, Stefan Born, Peter Neubauer, Mariano Nicolas Cruz-Bournazou
for: 这项研究的目的是提出一种使用深度神经网络（NN）直接预测机理模型参数的方法，以提高生物工程中参数估算的效率和精度。
methods: 该方法使用了一种组合神经网络和机理模型的训练程序，通过对实验数据进行预测来直接预测机理模型参数。
results: 研究发现，使用神经网络预测机理模型参数的方法可以提供较好的估算结果，并且比传统的梯度下降法更快速和更精度。

Abstract
Inferring parameters of macro-kinetic growth models, typically represented by Ordinary Differential Equations (ODE), from the experimental data is a crucial step in bioprocess engineering. Conventionally, estimates of the parameters are obtained by fitting the mechanistic model to observations. Fitting, however, requires a significant computational power. Specifically, during the development of new bioprocesses that use previously unknown organisms or strains, efficient, robust, and computationally cheap methods for parameter estimation are of great value. In this work, we propose using Deep Neural Networks (NN) for directly predicting parameters of mechanistic models given observations. The approach requires spending computational resources for training a NN, nonetheless, once trained, such a network can provide parameter estimates orders of magnitude faster than conventional methods. We consider a training procedure that combines Neural Networks and mechanistic models. We demonstrate the performance of the proposed algorithms on data sampled from several mechanistic models used in bioengineering describing a typical industrial batch process and compare the proposed method, a typical gradient-based fitting procedure, and the combination of the two. We find that, while Neural Network estimates are slightly improved by further fitting, these estimates are measurably better than the fitting procedure alone.

摘要
寻求macro-运动生长模型参数的推断，通常表示为常微分方程（ODE），在生物过程工程中是一个关键步骤。传统上，参数估算通常通过模型适应来获得。然而，这需要较高的计算能力。特别是在开发新的生物过程中使用未知的微生物或株类时，能够快速、稳定、计算成本低的参数估算方法具有重要的价值。在这种情况下，我们提出使用深度神经网络（NN）直接预测机制模型中的参数。这种方法需要训练NN的计算资源，但一旦训练完成，可以在机制模型中提供参数估算，比传统方法快得多。我们考虑一种将神经网络和机制模型结合在一起的训练程序。我们在数据来自多种生物过程中常用的机制模型上进行了测试，并与传统的梯度下降方法和这两种方法进行比较。我们发现，虽然神经网络估算与进一步适应之间存在一定的改进，但神经网络估算的结果明显比梯度下降方法好。

Multitask Learning Can Improve Worst-Group Outcomes

paper_url: http://arxiv.org/abs/2312.03151
repo_url: https://github.com/atharvajk98/mtl-group-robustness
paper_authors: Atharva Kulkarni, Lucio Dery, Amrith Setlur, Aditi Raghunathan, Ameet Talwalkar, Graham Neubig
for: 本研究旨在 investigating the impact of multitask learning (MTL) on worst-group accuracy, as well as exploring MTL’s potential to address the challenge of group-wise fairness.
methods: 作者使用了 fine-tuning 方法，并在 end task 数据上构建了 pre-training 目标。在 absence of group annotations, 作者发现 multitasking 可以 achieve better worst-group accuracy than Just-Train-Twice (JTT) 方法。作者还提出了一种 modify 了 MTL 的方法，通过在 joint multitask representation space 中增加正则化来提高 worst-group accuracy.
results: 作者通过大量 fine-tuning 实验发现，其 modify 了 MTL 方法 consistently outperforms JTT 方法 on both worst and average group outcomes。code 可以在 https://github.com/atharvajk98/MTL-group-robustness 找到。

Abstract
In order to create machine learning systems that serve a variety of users well, it is vital to not only achieve high average performance but also ensure equitable outcomes across diverse groups. However, most machine learning methods are designed to improve a model's average performance on a chosen end task without consideration for their impact on worst group error. Multitask learning (MTL) is one such widely used technique. In this paper, we seek not only to understand the impact of MTL on worst-group accuracy but also to explore its potential as a tool to address the challenge of group-wise fairness. We primarily consider the common setting of fine-tuning a pre-trained model, where, following recent work (Gururangan et al., 2020; Dery et al., 2023), we multitask the end task with the pre-training objective constructed from the end task data itself. In settings with few or no group annotations, we find that multitasking often, but not always, achieves better worst-group accuracy than Just-Train-Twice (JTT; Liu et al. (2021)) -- a representative distributionally robust optimization (DRO) method. Leveraging insights from synthetic data experiments, we propose to modify standard MTL by regularizing the joint multitask representation space. We run a large number of fine-tuning experiments across computer vision and natural language and find that our regularized MTL approach consistently outperforms JTT on both worst and average group outcomes. Our official code can be found here: https://github.com/atharvajk98/MTL-group-robustness.

摘要
要创建机器学习系统，以便服务于多样化的用户，是非常重要的。这些系统不仅需要达到高的平均性能，还需要确保对多个群体的输出结果具有公平性。然而，大多数机器学习方法都是为了提高模型的平均性能而设计的，而忽略了对最差群体的影响。在这篇论文中，我们不仅想要了解多任务学习（MTL）对最差群体精度的影响，还想要探索它是否可以用来解决群体公平性的挑战。我们主要考虑了在练习模型的情况下，通过将练习任务与预训练目标结合在一起来进行多任务学习。在具有少量或无群体注解的情况下，我们发现，通过多任务学习，可以 oftentimes，但并不总是，在最差群体精度方面比just-train-twice（JTT）方法（Liu et al., 2021）表现更好。基于对 sintetic data 的实验结果，我们提议修改标准 MTL，通过对共同多任务表示空间进行规范。我们在计算机视觉和自然语言领域进行了大量的练习实验，并发现，我们的规范 MTL 方法在最差和平均群体结果方面都能够超越 JTT。我们的官方代码可以在以下链接中找到：https://github.com/atharvajk98/MTL-group-robustness。

Neural parameter calibration and uncertainty quantification for epidemic forecasting

paper_url: http://arxiv.org/abs/2312.03147
repo_url: None
paper_authors: Thomas Gaskin, Tim Conrad, Grigorios A. Pavliotis, Christof Schütte
For: This paper aims to accurately forecast contagion dynamics and provide uncertainty quantification for pandemic projections.* Methods: The paper uses a novel computational method that combines a neural network with an ODE model to learn probability densities on contagion parameters and provide uncertainty quantification.* Results: The paper achieves a significantly more accurate calibration and prediction than Markov-Chain Monte Carlo (MCMC)-based sampling schemes, with meaningful confidence intervals on infection figures and hospitalisation rates. The method is also shown to converge to the true posterior on a simplified SIR model of epidemics and can learn complex models from a small number of compartments.

Abstract
The recent COVID-19 pandemic has thrown the importance of accurately forecasting contagion dynamics and learning infection parameters into sharp focus. At the same time, effective policy-making requires knowledge of the uncertainty on such predictions, in order, for instance, to be able to ready hospitals and intensive care units for a worst-case scenario without needlessly wasting resources. In this work, we apply a novel and powerful computational method to the problem of learning probability densities on contagion parameters and providing uncertainty quantification for pandemic projections. Using a neural network, we calibrate an ODE model to data of the spread of COVID-19 in Berlin in 2020, achieving both a significantly more accurate calibration and prediction than Markov-Chain Monte Carlo (MCMC)-based sampling schemes. The uncertainties on our predictions provide meaningful confidence intervals e.g. on infection figures and hospitalisation rates, while training and running the neural scheme takes minutes where MCMC takes hours. We show convergence of our method to the true posterior on a simplified SIR model of epidemics, and also demonstrate our method's learning capabilities on a reduced dataset, where a complex model is learned from a small number of compartments for which data is available.

摘要
COVID-19 大流行 recent 使得精准预测传染动力和学习感染参数的重要性得到了抛弃光照。同时，有效的政策制定需要了解预测的不确定性，以便准备医院和重症监护单元面对最坏情况，无需浪费资源。在这项工作中，我们运用了一种新的计算方法来解决学习感染参数的概率密度和预测不确定性。使用神经网络，我们对柏林2020年COVID-19的传染情况进行了拟合，实现了较MCMC样本 schemes 更高的准确率和预测。我们的预测中的不确定性提供了有意义的信任范围，例如感染人数和医院化率，而训练和运行神经网络只需几分钟，MCMC则需要多少时间。我们证明我们的方法对简化的SIR模型的真后验进行了收敛，并且在减少数据集上展示了我们的方法的学习能力，从一个小量的分布中学习出复杂的模型。

A Hardware Evaluation Framework for Large Language Model Inference

paper_url: http://arxiv.org/abs/2312.03134
repo_url: None
paper_authors: Hengrui Zhang, August Ning, Rohan Prabhakar, David Wentzlaff
for:LLMCompass is a hardware evaluation framework for Large Language Models (LLMs) inference workloads, aiming to evaluate different hardware designs and optimize their performance.methods:LLMCompass includes a mapper to automatically find performance-optimal mapping and scheduling, as well as an area-based cost model to help architects reason about their design choices.results:Compared to real-world hardware, LLMCompass’ estimated latency achieves an average 10.4% error rate across various operators with various input sizes and an average 4.1% error rate for LLM inference. With LLMCompass, simulating a 4-NVIDIA A100 GPU node running GPT-3 175B inference can be done within 16 minutes on commodity hardware, including 26,400 rounds of the mapper’s parameter search. The framework also explores new cost-effective hardware designs that can achieve as much as 3.41x improvement in performance/cost compared to an NVIDIA A100.

Abstract
The past year has witnessed the increasing popularity of Large Language Models (LLMs). Their unprecedented scale and associated high hardware cost have impeded their broader adoption, calling for efficient hardware designs. With the large hardware needed to simply run LLM inference, evaluating different hardware designs becomes a new bottleneck. This work introduces LLMCompass, a hardware evaluation framework for LLM inference workloads. LLMCompass is fast, accurate, versatile, and able to describe and evaluate different hardware designs. LLMCompass includes a mapper to automatically find performance-optimal mapping and scheduling. It also incorporates an area-based cost model to help architects reason about their design choices. Compared to real-world hardware, LLMCompass' estimated latency achieves an average 10.4% error rate across various operators with various input sizes and an average 4.1% error rate for LLM inference. With LLMCompass, simulating a 4-NVIDIA A100 GPU node running GPT-3 175B inference can be done within 16 minutes on commodity hardware, including 26,400 rounds of the mapper's parameter search. With the aid of LLMCompass, this work draws architectural implications and explores new cost-effective hardware designs. By reducing the compute capability or replacing High Bandwidth Memory (HBM) with traditional DRAM, these new designs can achieve as much as 3.41x improvement in performance/cost compared to an NVIDIA A100, making them promising choices for democratizing LLMs. LLMCompass is planned to be fully open-source.

摘要
过去一年，大型语言模型（LLM）的普及度逐渐增长。它们的无 precedent 的规模和相应的高硬件成本，使得它们的更广泛的应用被阻碍。为了解决这个问题，这项工作提出了 LLMCompass，一个用于 LLM 推理工作负荷的硬件评估框架。LLMCompass 具有快速、准确、多样化和可以描述和评估不同硬件设计的特点。LLMCompass 包括一个映射器，可以自动找到性能优化的映射和调度。它还包括一个面积基于的成本模型，帮助建筑师思考他们的设计选择。与实际硬件相比，LLMCompass 的估算延迟 Error Rate 为 10.4% 以上，并且对不同的输入大小和 LLM 推理而言具有平均 4.1% 的误差率。通过使用 LLMCompass，可以在常见硬件上模拟一个运行 GPT-3 175B 推理的 4 个 NVIDIA A100 GPU 节点，需要 16 分钟的时间，包括 26,400 次映射器的参数搜索。LLMCompass 可以帮助推断出新的成本效果的硬件设计，例如通过减少计算能力或者将高频带储存（HBM）替换为传统的 DDR，可以实现与 NVIDIA A100 相当的性能/成本比，达到 3.41 倍的提升。LLMCompass 计划将是完全开源的。

Advantage of Quantum Machine Learning from General Computational Advantages

paper_url: http://arxiv.org/abs/2312.03057
repo_url: None
paper_authors: Hayata Yamasaki, Natsuto Isogai, Mio Murao
for: 这 paper 的目的是证明量子机器学习（QML）在supervised learning task中的优势，并进一步证明 QML 可以在更广泛的学习任务中展示其优势。
methods: 这 paper 使用了一种普遍的量子算法优势来构建一个更广泛的学习任务，并证明这种任务是不可能由任何类型的常见算法解决。
results: 这 paper 证明了 QML 在这个更广泛的学习任务中的优势，并提供了准确的准则来评估 QML 的优势。

Abstract
An overarching milestone of quantum machine learning (QML) is to demonstrate the advantage of QML over all possible classical learning methods in accelerating a common type of learning task as represented by supervised learning with classical data. However, the provable advantages of QML in supervised learning have been known so far only for the learning tasks designed for using the advantage of specific quantum algorithms, i.e., Shor's algorithms. Here we explicitly construct an unprecedentedly broader family of supervised learning tasks with classical data to offer the provable advantage of QML based on general quantum computational advantages, progressing beyond Shor's algorithms. Our learning task is feasibly achievable by executing a general class of functions that can be computed efficiently in polynomial time for a large fraction of inputs by arbitrary quantum algorithms but not by any classical algorithm. We prove the hardness of achieving this learning task for any possible polynomial-time classical learning method. We also clarify protocols for preparing the classical data to demonstrate this learning task in experiments. These results open routes to exploit a variety of quantum advantages in computing functions for the experimental demonstration of the advantage of QML.

摘要
全面的里程碑之一在量子机器学习（QML）领域是证明QML在所有可能的类传统学习方法上具有优势，以加速常见的学习任务。然而，截至目前，只有使用特定量子算法的学习任务中的优势被证明为QML的优势。在这里，我们明确构造了一个新的、前所未有的超级vised learning任务，使得QML具有基于通用量子计算优势的证明优势。我们证明任务是可以由任意量子算法efficiently处理的，但是不可以由任何类传统学习方法处理。我们还阐述了准备经典数据的协议，以便在实验中证明这个学习任务。这些结果开启了在计算函数方面利用量子优势的路径，并进一步推动QML的应用。

Learning High-Dimensional Differential Graphs From Multi-Attribute Data

paper_url: http://arxiv.org/abs/2312.03761
repo_url: None
paper_authors: Jitendra K Tugnait
for: 这篇论文目的是为了估计两个泊尔图模型（GGM）之间的差异，这两个GGM都知道具有相似的结构。
methods: 这篇论文使用了一种基于D-trace损失函数的 групlace penalty方法来学习多属性数据中的差异图模型。一种基于多重方向乘法法（ADMM）的优化算法也是提出的。
results: 论文的 тео리тиче分析表明，在高维设置下，这种方法能够支持恢复和估计差异图模型。同时，数据分析结果表明，这种方法可以在实际数据中具有良好的性能。

Abstract
We consider the problem of estimating differences in two Gaussian graphical models (GGMs) which are known to have similar structure. The GGM structure is encoded in its precision (inverse covariance) matrix. In many applications one is interested in estimating the difference in two precision matrices to characterize underlying changes in conditional dependencies of two sets of data. Existing methods for differential graph estimation are based on single-attribute (SA) models where one associates a scalar random variable with each node. In multi-attribute (MA) graphical models, each node represents a random vector. In this paper, we analyze a group lasso penalized D-trace loss function approach for differential graph learning from multi-attribute data. An alternating direction method of multipliers (ADMM) algorithm is presented to optimize the objective function. Theoretical analysis establishing consistency in support recovery and estimation in high-dimensional settings is provided. Numerical results based on synthetic as well as real data are presented.

摘要
我们考虑了两个 Gaussian graphical model (GGM) 的差异估计问题，这两个 GGM 知道它们的结构相似。GGM 的结构是它的精度矩阵 (逆covariance matrix) 所编码的。在许多应用中，我们 interesseted in 估计两个 GGM 的精度矩阵之间的差异，以描述两aset of data 之间的下游相依性变化。现有的方法是基于单一属性 (SA) 模型，其中每个 node 都相关着一个数值随机变量。在多属性 (MA) 图形模型中，每个 node 表示一个随机 вектор。在这篇论文中，我们分析了一个 group lasso 抑制 D-trace 损失函数的方法来进行差异图学学习。我们提出了一个 alternating direction method of multipliers (ADMM) 算法来优化目标函数。我们提供了 teorical 分析，证明了在高维设定下支持回溯和估计的一致性。我们还提供了基于真实数据的numerical 结果。

Detecting algorithmic bias in medical AI-models

paper_url: http://arxiv.org/abs/2312.02959
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Jeffrey Smith, Andre Holder, Rishikesan Kamaleswaran, Yao Xie
for: 该论文旨在确保机器学习和人工智能基于医疗决策支持系统提供公正和公平的结果。
methods: 该论文提出了一种创新的方法来检测医疗-AI决策支持系统中的算法偏见。该方法使用分类和回归树（CART）算法，并通过 synthetic data 实验和实际医疗记录数据进行验证。
results: 该论文的实验结果表明，该方法可以准确地检测医疗-AI 模型中的偏见，并在实际临床环境中提供了一种有效的公正性验证工具。

Abstract
With the growing prevalence of machine learning and artificial intelligence-based medical decision support systems, it is equally important to ensure that these systems provide patient outcomes in a fair and equitable fashion. This paper presents an innovative framework for detecting areas of algorithmic bias in medical-AI decision support systems. Our approach efficiently identifies potential biases in medical-AI models, specifically in the context of sepsis prediction, by employing the Classification and Regression Trees (CART) algorithm. We verify our methodology by conducting a series of synthetic data experiments, showcasing its ability to estimate areas of bias in controlled settings precisely. The effectiveness of the concept is further validated by experiments using electronic medical records from Grady Memorial Hospital in Atlanta, Georgia. These tests demonstrate the practical implementation of our strategy in a clinical environment, where it can function as a vital instrument for guaranteeing fairness and equity in AI-based medical decisions.

摘要

Attention-enhanced neural differential equations for physics-informed deep learning of ion transport

paper_url: http://arxiv.org/abs/2312.02871
repo_url: None
paper_authors: Danyal Rehman, John H. Lienhard
for: 模型transportnanoporous系统中的离子运输
methods: 使用机器学习方法，特别是注意力增强神经 diferencial equations，以提高模型的泛化性能
results: physics-informed deep learning solutions可以超越传统PDE-based方法，并提供模拟复杂运输现象的可能性

Abstract
Species transport models typically combine partial differential equations (PDEs) with relations from hindered transport theory to quantify electromigrative, convective, and diffusive transport through complex nanoporous systems; however, these formulations are frequently substantial simplifications of the governing dynamics, leading to the poor generalization performance of PDE-based models. Given the growing interest in deep learning methods for the physical sciences, we develop a machine learning-based approach to characterize ion transport across nanoporous membranes. Our proposed framework centers around attention-enhanced neural differential equations that incorporate electroneutrality-based inductive biases to improve generalization performance relative to conventional PDE-based methods. In addition, we study the role of the attention mechanism in illuminating physically-meaningful ion-pairing relationships across diverse mixture compositions. Further, we investigate the importance of pre-training on simulated data from PDE-based models, as well as the performance benefits from hard vs. soft inductive biases. Our results indicate that physics-informed deep learning solutions can outperform their classical PDE-based counterparts and provide promising avenues for modelling complex transport phenomena across diverse applications.

摘要
种类运输模型通常将 partial differential equations (PDEs) 与阻碍运输理论的关系组合以量化电动力、涌动和扩散运输过复杂的奈米孔系统; however, these formulations are frequently substantial simplifications of the governing dynamics, leading to the poor generalization performance of PDE-based models. 给出了物理科学中深度学习方法的增长兴趣，我们开发了一种基于机器学习的方法来Characterize ion transport across nanoporous membranes. Our proposed framework centers around attention-enhanced neural differential equations that incorporate electroneutrality-based inductive biases to improve generalization performance relative to conventional PDE-based methods. In addition, we study the role of the attention mechanism in illuminating physically-meaningful ion-pairing relationships across diverse mixture compositions. Further, we investigate the importance of pre-training on simulated data from PDE-based models, as well as the performance benefits from hard vs. soft inductive biases. Our results indicate that physics-informed deep learning solutions can outperform their classical PDE-based counterparts and provide promising avenues for modelling complex transport phenomena across diverse applications.Note: Simplified Chinese is also known as "简化字" or "简体字".

REST: Enhancing Group Robustness in DNNs through Reweighted Sparse Training

paper_url: http://arxiv.org/abs/2312.03044
repo_url: https://github.com/zhao1402072392/rest
paper_authors: Jiaxu Zhao, Lu Yin, Shiwei Liu, Meng Fang, Mykola Pechenizkiy
for: 本研究旨在提高深度神经网络（DNN）在不同数据集中的表现，特别是在批处理大数据时。
methods: 本研究提出了一种重新权重的简 sparse 训练框架（REST），通过增强训练数据中的偏好性，提高模型在偏好数据上的表现。
results: 实验表明，REST 框架可以有效地降低模型对偏好性的依赖，提高模型的一致性和泛化能力。 code 在 \url{https://github.com/zhao1402072392/REST} 上发布。

Abstract
The deep neural network (DNN) has been proven effective in various domains. However, they often struggle to perform well on certain minority groups during inference, despite showing strong performance on the majority of data groups. This is because over-parameterized models learned \textit{bias attributes} from a large number of \textit{bias-aligned} training samples. These bias attributes are strongly spuriously correlated with the target variable, causing the models to be biased towards spurious correlations (i.e., \textit{bias-conflicting}). To tackle this issue, we propose a novel \textbf{re}weighted \textbf{s}parse \textbf{t}raining framework, dubbed as \textit{\textbf{REST}, which aims to enhance the performance of biased data while improving computation and memory efficiency. Our proposed REST framework has been experimentally validated on three datasets, demonstrating its effectiveness in exploring unbiased subnetworks. We found that REST reduces the reliance on spuriously correlated features, leading to better performance across a wider range of data groups with fewer training and inference resources. We highlight that the \textit{REST} framework represents a promising approach for improving the performance of DNNs on biased data, while simultaneously improving computation and memory efficiency. By reducing the reliance on spurious correlations, REST has the potential to enhance the robustness of DNNs and improve their generalization capabilities. Code is released at \url{https://github.com/zhao1402072392/REST}

摘要
深度神经网络（DNN）在不同领域都有证明其效果。然而，它们在推理时经常对少数群体表现不佳，即使在大量数据组中表现出色。这是因为DNN学习了偏见特征，这些特征与目标变量强烈相关。这些偏见特征来自大量偏见对齐的训练样本。这导致模型偏爱这些偏见特征，从而导致模型偏爱假 correlations（即偏见冲突）。为解决这个问题，我们提出了一种重新权重的稀疏训练框架，名为REST（重新权重的稀疏训练）。REST框架的目标是在不良数据上提高性能，同时提高计算和存储效率。我们在三个数据集上进行了实验，并证明REST框架的效果。我们发现，REST可以减少依赖于假 correlations的特征，从而提高性能 across a wider range of data groups，并采用 fewer training and inference resources。我们强调，REST框架代表了改进DNN性能的有力方法，同时提高计算和存储效率。通过减少依赖于假 correlations，REST有可能提高DNN的Robustness和泛化能力。代码可以在 \url{https://github.com/zhao1402072392/REST} 中下载。

Semi-Supervised Health Index Monitoring with Feature Generation and Fusion

paper_url: http://arxiv.org/abs/2312.02867
repo_url: None
paper_authors: Gaëtan Frusque, Ismail Nejjar, Majid Nabavi, Olga Fink
for: 本研究旨在提供一种可靠且cost-effective的健康指标（Health Index，HI）估算方法，用于检测系统异常和预测系统剩余有用寿命，特别适用于需要高安全可靠性的系统。
methods: 本研究使用深度半监督异常检测（DeepSAD）方法构建HI，并使用DeepSAD嵌入器作为状况指标以解决可 interpretability 和系统特有因素的敏感性问题。我们还引入多样性损失以增加状况指标的多样性。
results: 在PHME 2010 磨削数据集上验证，我们的方法可以提供有意义的HI估算结果。此外，我们还应用了这种方法监测热涂敷剂的温度变化，以获得更加可靠和可访问的HI估算结果。

Abstract
The Health Index (HI) is crucial for evaluating system health, aiding tasks like anomaly detection and predicting remaining useful life for systems demanding high safety and reliability. Tight monitoring is crucial for achieving high precision at a lower cost, with applications such as spray coating. Obtaining HI labels in real-world applications is often cost-prohibitive, requiring continuous, precise health measurements. Therefore, it is more convenient to leverage run-to failure datasets that may provide potential indications of machine wear condition, making it necessary to apply semi-supervised tools for HI construction. In this study, we adapt the Deep Semi-supervised Anomaly Detection (DeepSAD) method for HI construction. We use the DeepSAD embedding as a condition indicators to address interpretability challenges and sensitivity to system-specific factors. Then, we introduce a diversity loss to enrich condition indicators. We employ an alternating projection algorithm with isotonic constraints to transform the DeepSAD embedding into a normalized HI with an increasing trend. Validation on the PHME 2010 milling dataset, a recognized benchmark with ground truth HIs demonstrates meaningful HIs estimations. Our methodology is then applied to monitor wear states of thermal spray coatings using high-frequency voltage. Our contributions create opportunities for more accessible and reliable HI estimation, particularly in cases where obtaining ground truth HI labels is unfeasible.

摘要
健康指数（HI）是评估系统健康的关键指标，有助于异常检测和预测系统剩余有用生命期。严格监测是实现高精度的关键，其应用包括涂抹技术。在实际应用中获得HI标签是经济不可能的，需要连续、精度高的健康测量。因此，我们选择利用运行至故障数据来建立HI。在这种情况下，我们采用深度半supervised检测方法（DeepSAD）来建立HI。我们使用DeepSAD嵌入作为机器磨损状况指标，以解决可 interpretability 和系统特定因素的敏感性问题。然后，我们引入多样性损失，以便增加condition指标的多样性。我们采用交叉 проекction算法和iso逻辑约束来转换DeepSAD嵌入，以获得正负排序的HI，HI的增长趋势。验证PHME 2010 毯剂数据集，一个公认的benchmark，我们得到了有意义的HI估计。我们的方法后来应用于监测热涂抹层的磨损状况，我们的贡献会创造更加可 accessible 和可靠的HI估计方法，特别是在获得真实HI标签是不可能的情况下。

Lessons from Usable ML Deployments and Application to Wind Turbine Monitoring

paper_url: http://arxiv.org/abs/2312.02859
repo_url: None
paper_authors: Alexandra Zytek, Wei-En Wang, Sofia Koukoura, Kalyan Veeramachaneni
for: 这篇论文是关于可用机器学习（usable ML）的应用于现实世界领域的经验分享。
methods: 论文中使用了 bridges 的概念，即将机器学习开发人员和领域专家相连接的人员，以开发可用 ML 应用程序。论文还提出了一种可配置的系统，用于在与 bridges 的合作中轻松地进行可用 ML 界面的迭代。
results: 论文通过应用这些经验到风力机监测任务中，展示了可用 ML 在可再生能源领域中的实际影响。在风力机监测中，机器学习开发人员和数据分析员需要决定是否进行 expensive in-person 调查，以避免 potential 的缸盘失效。论文示出了如何使用可用 ML 界面来帮助决策过程。

Abstract
Through past experiences deploying what we call usable ML (one step beyond explainable ML, including both explanations and other augmenting information) to real-world domains, we have learned three key lessons. First, many organizations are beginning to hire people who we call ``bridges'' because they bridge the gap between ML developers and domain experts, and these people fill a valuable role in developing usable ML applications. Second, a configurable system that enables easily iterating on usable ML interfaces during collaborations with bridges is key. Finally, there is a need for continuous, in-deployment evaluations to quantify the real-world impact of usable ML. Throughout this paper, we apply these lessons to the task of wind turbine monitoring, an essential task in the renewable energy domain. Turbine engineers and data analysts must decide whether to perform costly in-person investigations on turbines to prevent potential cases of brakepad failure, and well-tuned usable ML interfaces can aid with this decision-making process. Through the applications of our lessons to this task, we hope to demonstrate the potential real-world impact of usable ML in the renewable energy domain.

摘要
在过去的实践中，我们发现了三个关键的教训，这些教训在实现可用机器学习（一步超过可解释机器学习，包括解释和其他增强信息）应用中非常重要。首先，许多组织开始招聘我们称为“桥梁”的人，这些人将机器学习开发者和领域专家之间的隔阂bridge，他们在开发可用机器学习应用中扮演了非常重要的角色。第二，一个可配置的系统，可以轻松地在与桥梁合作时进行可用机器学习界面的迭代，是非常重要的。最后，在部署过程中进行连续评估，以评估可用机器学习在实际世界中的影响，是非常重要的。在这篇论文中，我们将这些教训应用于风力机监测任务，这是可再生能源领域的关键任务。风机工程师和数据分析师必须决定是否进行costly的面对面调查，以避免潜在的制动盘失效情况，而且良好的可用机器学习界面可以帮助决策过程中。通过对这个任务的应用，我们希望能够展示可用机器学习在可再生能源领域的实际影响。

Expert-guided Bayesian Optimisation for Human-in-the-loop Experimental Design of Known Systems

paper_url: http://arxiv.org/abs/2312.02852
repo_url: https://github.com/trsav/hitl-bo
paper_authors: Tom Savage, Ehecatl Antonio del Rio Chanona
for: 该论文的目的是使用高通量抽象 Bayesian 优化和人类决策理论来让领域专家影响优化实验的选择。
methods: 该方法利用人类在 discrete 决策方面的优势，并在初期决策中让专家产生影响。在每一轮中，我们解决一个增强多目标优化问题，以最大化sum 的 utility function 值和 covariance 矩阵的 determinant，等于 total 变化。在 Pareto 前折线的叉点处选择解决方案，并返回一组具有高 utility 值和合理差异的 alternate 解决方案，由专家选择一个进行评估。
results: 我们表明，即使专家无知，我们的算法仍可回归标准 Bayesian 优化的 regret。

Abstract
Domain experts often possess valuable physical insights that are overlooked in fully automated decision-making processes such as Bayesian optimisation. In this article we apply high-throughput (batch) Bayesian optimisation alongside anthropological decision theory to enable domain experts to influence the selection of optimal experiments. Our methodology exploits the hypothesis that humans are better at making discrete choices than continuous ones and enables experts to influence critical early decisions. At each iteration we solve an augmented multi-objective optimisation problem across a number of alternate solutions, maximising both the sum of their utility function values and the determinant of their covariance matrix, equivalent to their total variability. By taking the solution at the knee point of the Pareto front, we return a set of alternate solutions at each iteration that have both high utility values and are reasonably distinct, from which the expert selects one for evaluation. We demonstrate that even in the case of an uninformed practitioner, our algorithm recovers the regret of standard Bayesian optimisation.

摘要
域名专家经常拥有可贵的物理洞察，这些洞察在完全自动化的决策过程中被忽略，如极高精度优化。在这篇文章中，我们将高速批量的极高精度优化与人类决策理论相结合，以便域名专家可以影响选择优化实验的决策。我们的方法假设人类在作出离散选择时比在连续选择时更好，并让专家在早期决策中发挥影响。在每个迭代中，我们解决一个增强多目标优化问题，其中每个解决方案的总用值和决定矩阵的 determinant 都达到最大化。通过在 Pareto 前凹点处选择解决方案，我们每次返回一组具有高用值和合理分化的 alternate solution，由专家选择进行评估。我们示例显示，即使域名专家无知，我们的算法仍可恢复标准极高精度优化的 regret。

A Kernel-Based Neural Network Test for High-dimensional Sequencing Data Analysis

paper_url: http://arxiv.org/abs/2312.02850
repo_url: None
paper_authors: Tingting Hou, Chang Jiang, Qing Lu
for: 这个论文的目的是为了探讨高维数据分析中使用人工智能技术，尤其是深度神经网络技术，以及如何在高维数据分析中使用这些技术。
methods: 这个论文使用了 kernel-based neural network (KNN) 方法，这种方法使用了随机效应来模型高维遗传数据的总效应，并使用 kernel-based 神经网络结构来模型复杂的遗传型病理关系。
results: 通过 simulate 的结果表明，这种方法在面对非线性和交互效应时有更高的力度，比如 SKAT 方法。此外，这个论文还应用了这种方法到了 Alzheimer’s Disease Neuroimaging Initiative (ADNI) 研究中的整个基因组序列数据，发现了一些新的与脑干体积变化相关的遗传变化。

Abstract
The recent development of artificial intelligence (AI) technology, especially the advance of deep neural network (DNN) technology, has revolutionized many fields. While DNN plays a central role in modern AI technology, it has been rarely used in sequencing data analysis due to challenges brought by high-dimensional sequencing data (e.g., overfitting). Moreover, due to the complexity of neural networks and their unknown limiting distributions, building association tests on neural networks for genetic association analysis remains a great challenge. To address these challenges and fill the important gap of using AI in high-dimensional sequencing data analysis, we introduce a new kernel-based neural network (KNN) test for complex association analysis of sequencing data. The test is built on our previously developed KNN framework, which uses random effects to model the overall effects of high-dimensional genetic data and adopts kernel-based neural network structures to model complex genotype-phenotype relationships. Based on KNN, a Wald-type test is then introduced to evaluate the joint association of high-dimensional genetic data with a disease phenotype of interest, considering non-linear and non-additive effects (e.g., interaction effects). Through simulations, we demonstrated that our proposed method attained higher power compared to the sequence kernel association test (SKAT), especially in the presence of non-linear and interaction effects. Finally, we apply the methods to the whole genome sequencing (WGS) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, investigating new genes associated with the hippocampal volume change over time.

摘要
Recent advances in artificial intelligence (AI) technology, particularly in deep neural network (DNN) technology, have revolutionized many fields. However, DNN has been rarely used in sequencing data analysis due to the challenges posed by high-dimensional sequencing data, such as overfitting. Moreover, the complexity of neural networks and their unknown limiting distributions make it difficult to build association tests on neural networks for genetic association analysis. To address these challenges and fill the important gap of using AI in high-dimensional sequencing data analysis, we propose a new kernel-based neural network (KNN) test for complex association analysis of sequencing data. Our test is built on our previously developed KNN framework, which uses random effects to model the overall effects of high-dimensional genetic data and adopts kernel-based neural network structures to model complex genotype-phenotype relationships. Based on KNN, we introduce a Wald-type test to evaluate the joint association of high-dimensional genetic data with a disease phenotype of interest, considering non-linear and non-additive effects (e.g., interaction effects). Through simulations, we demonstrated that our proposed method attained higher power compared to the sequence kernel association test (SKAT), especially in the presence of non-linear and interaction effects. Finally, we apply the methods to the whole genome sequencing (WGS) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, investigating new genes associated with the hippocampal volume change over time.

Algorithms for mean-field variational inference via polyhedral optimization in the Wasserstein space

paper_url: http://arxiv.org/abs/2312.02849
repo_url: None
paper_authors: Yiheng Jiang, Sinho Chewi, Aram-Alexandre Pooladian
For: The paper is written for optimizing functionals over finite-dimensional polyhedral subsets of the Wasserstein space, with a main application in mean-field variational inference.* Methods: The paper uses first-order methods for optimization over these polyhedral subsets, and provides approximation rates and an algorithm for minimizing the KL divergence over these sets.* Results: The paper obtains accelerated convergence with a complexity of $O(\sqrt \kappa \log(\kappa d/\varepsilon^2))$, where $\kappa$ is the condition number of the distribution being optimized.Here’s the Chinese translation of the three pieces of information:* For: 本文是为了优化函数als over finite-dimensional polyhedral subsets of Wasserstein space 的问题，主要应用在mean-field variational inference中。* Methods: 本文使用first-order methods for optimization over这些polyhedral subsets，并提供了一个approximation rates和一个算法来最小化KL divergence over这些sets。* Results: 本文获得了$O(\sqrt \kappa \log(\kappa d/\varepsilon^2)))$的加速 convergence，where $\kappa$是被优化分布的condition number。

Abstract
We develop a theory of finite-dimensional polyhedral subsets over the Wasserstein space and optimization of functionals over them via first-order methods. Our main application is to the problem of mean-field variational inference, which seeks to approximate a distribution $\pi$ over $\mathbb{R}^d$ by a product measure $\pi^\star$. When $\pi$ is strongly log-concave and log-smooth, we provide (1) approximation rates certifying that $\pi^\star$ is close to the minimizer $\pi^\star_\diamond$ of the KL divergence over a \emph{polyhedral} set $\mathcal{P}_\diamond$, and (2) an algorithm for minimizing $\text{KL}(\cdot\|\pi)$ over $\mathcal{P}_\diamond$ with accelerated complexity $O(\sqrt \kappa \log(\kappa d/\varepsilon^2))$, where $\kappa$ is the condition number of $\pi$.

摘要
我们开发了一个有限维多面体子集的理论，该子集位于 Wasserstein 空间上，并通过首次方法优化函数ional。我们的主要应用是在mean-fieldvariational推理中，这是一个将分布 $\pi$ approximated by a product measure $\pi^\star$ 的问题。当 $\pi$ 是强式log-concave和log-smooth时，我们提供了以下两个result：1. 确认 $\pi^\star$ 是 $\mathcal{P}_\diamond$ 中的最佳解，其中 $\mathcal{P}_\diamond$ 是一个多面体子集，并且提供了一个 $O(\sqrt \kappa \log(\kappa d/\varepsilon^2)))$ 的 accelerated complexity 的算法，其中 $\kappa$ 是 $\pi$ 的condition number。2. 对 $\mathcal{P}_\diamond$ 中的函数ional $\text{KL}(\cdot\|\pi)$ 进行最小化，并提供了一个 $O(\sqrt \kappa \log(\kappa d/\varepsilon^2)))$ 的 accelerated complexity 的算法。Here's the translation in Traditional Chinese:我们将开发一个有限维多面体子集的理论，该子集位于 Wasserstein 空间上，并通过首次方法优化函数ional。我们的主要应用是在mean-fieldvariational推理中，这是一个将分布 $\pi$ approximated by a product measure $\pi^\star$ 的问题。当 $\pi$ 是强式log-concave和log-smooth时，我们提供了以下两个结果：1. 确认 $\pi^\star$ 是 $\mathcal{P}_\diamond$ 中的最佳解，其中 $\mathcal{P}_\diamond$ 是一个多面体子集，并且提供了一个 $O(\sqrt \kappa \log(\kappa d/\varepsilon^2)))$ 的 accelerated complexity 的算法，其中 $\kappa$ 是 $\pi$ 的condition number。2. 对 $\mathcal{P}_\diamond$ 中的函数ional $\text{KL}(\cdot\|\pi)$ 进行最小化，并提供了一个 $O(\sqrt \kappa \log(\kappa d/\varepsilon^2)))$ 的 accelerated complexity 的算法。

Transformer-Based Deep Learning Model for Bored Pile Load-Deformation Prediction in Bangkok Subsoil

paper_url: http://arxiv.org/abs/2312.03041
repo_url: None
paper_authors: Sompote Youwai, Chissanupong Thongnoo
for: 预测大钻孔杆在曼尼索イル底层中的荷载-减压行为
methods: 使用变换器架构深度学习模型，编码土壤Profile和钻孔特征作为token输入，生成荷载-减压曲线输出，并 incorporate上一个顺序数据来提高预测精度
results: 模型在测试数据上显示了满意的准确率和泛化能力，误差为5.72%，可用于 Parametric analysis和设计优化钻孔在不同的土壤和钻孔条件下。

Abstract
This paper presents a novel deep learning model based on the transformer architecture to predict the load-deformation behavior of large bored piles in Bangkok subsoil. The model encodes the soil profile and pile features as tokenization input, and generates the load-deformation curve as output. The model also incorporates the previous sequential data of load-deformation curve into the decoder to improve the prediction accuracy. The model also incorporates the previous sequential data of load-deformation curve into the decoder. The model shows a satisfactory accuracy and generalization ability for the load-deformation curve prediction, with a mean absolute error of 5.72% for the test data. The model could also be used for parametric analysis and design optimization of piles under different soil and pile conditions, pile cross section, pile length and type of pile.

摘要

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

paper_url: http://arxiv.org/abs/2312.02828
repo_url: None
paper_authors: Rajeeva L. Karandikar, M. Vidyasagar
for: 本文探讨了Stochastic Approximation（SA）算法在各种应用中的性能，包括非 convex 优化和 reinforcement learning（RL）。
methods: 本文扩展了SA理论，包括了随机 error 的非零 conditional mean 和 unbounded conditional variance，以及异步 SA。
results: 本文 Compute the “optimal step size sequences” to maximize the estimated rate of convergence of the algorithm, and prove that SA converges in nonconvex optimization and Markovian SA situations.

Abstract
The Stochastic Approximation (SA) algorithm introduced by Robbins and Monro in 1951 has been a standard method for solving equations of the form $\mathbf{f}({\boldsymbol {\theta}) = \mathbf{0}$, when only noisy measurements of $\mathbf{f}(\cdot)$ are available. If $\mathbf{f}({\boldsymbol {\theta}) = \nabla J({\boldsymbol {\theta})$ for some function $J(\cdot)$, then SA can also be used to find a stationary point of $J(\cdot)$. In much of the literature, it is assumed that the error term ${\boldsymbol {xi}_{t+1}$ has zero conditional mean, and that its conditional variance is bounded as a function of $t$ (though not necessarily with respect to ${\boldsymbol {\theta}_t$). Also, for the most part, the emphasis has been on ``synchronous'' SA, whereby, at each time $t$, \textit{every} component of ${\boldsymbol {\theta}_t$ is updated. Over the years, SA has been applied to a variety of areas, out of which two are the focus in this paper: Convex and nonconvex optimization, and Reinforcement Learning (RL). As it turns out, in these applications, the above-mentioned assumptions do not always hold. In zero-order methods, the error neither has zero mean nor bounded conditional variance. In the present paper, we extend SA theory to encompass errors with nonzero conditional mean and/or unbounded conditional variance, and also asynchronous SA. In addition, we derive estimates for the rate of convergence of the algorithm. Then we apply the new results to problems in nonconvex optimization, and to Markovian SA, a recently emerging area in RL. We prove that SA converges in these situations, and compute the ``optimal step size sequences'' to maximize the estimated rate of convergence.

摘要
estone Approximation（SA）算法， introduction by Robbins and Monro in 1951，是一种标准的解决Equations of the form $\mathbf{f}({\boldsymbol {\theta}) = \mathbf{0}$ 的方法，只有各个噪声测量 $\mathbf{f}(\cdot)$ 可用。如果 $\mathbf{f}({\boldsymbol {\theta}) = \nabla J({\boldsymbol {\theta})$ для某函数 $J(\cdot)$ ， then SA 也可以用来找到 $J(\cdot)$ 的站点点。在大量的文献中，假设 $\mathbf{xi}_{t+1}$ 的 conditional mean 为零，并且其 conditional variance 随 $t$ 增长。此外，大多数情况下，强调“同步” SA，即在每个时间 $t$ 中，\textit{每一个} ${\boldsymbol {\theta}_t$ 的更新。总之，在这篇文章中，我们将SA理论扩展到包括噪声error 的非零 conditional mean 和/或不bounded conditional variance，并且 asynchronous SA。此外，我们 derivestimates 的速度收敛率，然后应用新结果到非对称优化和Markovian SA 中的问题上，证明 SA 在这些情况下收敛，并计算了“最佳步长序列” 以最大化 estimated rate of convergence。

Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems

paper_url: http://arxiv.org/abs/2312.02804
repo_url: None
paper_authors: Céline Comte, Matthieu Jonckheere, Jaron Sanders, Albert Senen-Cerda
for: 这个论文的目的是解决Markov决策过程（MDP）中的异常大状态和动作空间以及非凸目标函数，使得许多机器学习（RL）算法无法 converges。
methods: 这篇论文提出了一种新的摘要估计器called score-aware gradient estimators（SAGEs），可以在MDP的站点分布是 exponential family parametrized by policy parameters时估计策vector gradient，无需计算值函数。
results: 研究表明，在两个常见的控制问题中，SAGEs可以更快地找到优化策略，并且在非凸目标函数和多个最大值时，策略的概率很高地 converges to 优化策略，只要它们在优化策略附近开始。

Abstract
Stochastic networks and queueing systems often lead to Markov decision processes (MDPs) with large state and action spaces as well as nonconvex objective functions, which hinders the convergence of many reinforcement learning (RL) algorithms. Policy-gradient methods perform well on MDPs with large state and action spaces, but they sometimes experience slow convergence due to the high variance of the gradient estimator. In this paper, we show that some of these difficulties can be circumvented by exploiting the structure of the underlying MDP. We first introduce a new family of gradient estimators called score-aware gradient estimators (SAGEs). When the stationary distribution of the MDP belongs to an exponential family parametrized by the policy parameters, SAGEs allow us to estimate the policy gradient without relying on value-function estimation, contrary to classical policy-gradient methods like actor-critic. To demonstrate their applicability, we examine two common control problems arising in stochastic networks and queueing systems whose stationary distributions have a product-form, a special case of exponential families. As a second contribution, we show that, under appropriate assumptions, the policy under a SAGE-based policy-gradient method has a large probability of converging to an optimal policy, provided that it starts sufficiently close to it, even with a nonconvex objective function and multiple maximizers. Our key assumptions are that, locally around a maximizer, a nondegeneracy property of the Hessian of the objective function holds and a Lyapunov function exists. Finally, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic algorithm. The results demonstrate that the SAGE-based method finds close-to-optimal policies more rapidly, highlighting its superior performance over the traditional actor-critic method.

摘要
Stochastic networks and queueing systems oft lead to Markov decision processes (MDPs) with large state and action spaces as well as nonconvex objective functions, which hinders the convergence of many reinforcement learning (RL) algorithms. Policy-gradient methods perform well on MDPs with large state and action spaces, but they sometimes experience slow convergence due to the high variance of the gradient estimator. In this paper, we show that some of these difficulties can be circumvented by exploiting the structure of the underlying MDP. We first introduce a new family of gradient estimators called score-aware gradient estimators (SAGEs). When the stationary distribution of the MDP belongs to an exponential family parametrized by the policy parameters, SAGEs allow us to estimate the policy gradient without relying on value-function estimation, contrary to classical policy-gradient methods like actor-critic. To demonstrate their applicability, we examine two common control problems arising in stochastic networks and queueing systems whose stationary distributions have a product-form, a special case of exponential families. As a second contribution, we show that, under appropriate assumptions, the policy under a SAGE-based policy-gradient method has a large probability of converging to an optimal policy, provided that it starts sufficiently close to it, even with a nonconvex objective function and multiple maximizers. Our key assumptions are that, locally around a maximizer, a nondegeneracy property of the Hessian of the objective function holds and a Lyapunov function exists. Finally, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic algorithm. The results demonstrate that the SAGE-based method finds close-to-optimal policies more rapidly, highlighting its superior performance over the traditional actor-critic method.

Materials Expert-Artificial Intelligence for Materials Discovery

paper_url: http://arxiv.org/abs/2312.02796
repo_url: None
paper_authors: Yanjun Liu, Milena Jovanovic, Krishnanand Mallayya, Wesley J. Maddox, Andrew Gordon Wilson, Sebastian Klemenz, Leslie M. Schoop, Eun-Ah Kim
for: This paper aims to develop a machine learning approach to uncover predictive descriptors for emergent material properties from vast data space, with a focus on topological semimetals (TSMs) among square-net materials.
methods: The authors use a machine learning approach called “Materials Expert-Artificial Intelligence” (ME-AI) to encapsulate and articulate human intuition, which is based on experimental data whenever possible. They use Dirichlet-based Gaussian process regression with a specialized kernel to reveal composite descriptors for square-net TSMs.
results: The ME-AI learned descriptors independently reproduce expert intuition and expand upon it, pointing to hypervalency as a critical chemical feature predicting TSM within square-net compounds. The success of the approach on a carefully defined problem suggests that it is promising for machine learning-aided material discovery.

Abstract
The advent of material databases provides an unprecedented opportunity to uncover predictive descriptors for emergent material properties from vast data space. However, common reliance on high-throughput ab initio data necessarily inherits limitations of such data: mismatch with experiments. On the other hand, experimental decisions are often guided by an expert's intuition honed from experiences that are rarely articulated. We propose using machine learning to "bottle" such operational intuition into quantifiable descriptors using expertly curated measurement-based data. We introduce "Materials Expert-Artificial Intelligence" (ME-AI) to encapsulate and articulate this human intuition. As a first step towards such a program, we focus on the topological semimetal (TSM) among square-net materials as the property inspired by the expert-identified descriptor based on structural information: the tolerance factor. We start by curating a dataset encompassing 12 primary features of 879 square-net materials, using experimental data whenever possible. We then use Dirichlet-based Gaussian process regression using a specialized kernel to reveal composite descriptors for square-net topological semimetals. The ME-AI learned descriptors independently reproduce expert intuition and expand upon it. Specifically, new descriptors point to hypervalency as a critical chemical feature predicting TSM within square-net compounds. Our success with a carefully defined problem points to the "machine bottling human insight" approach as promising for machine learning-aided material discovery.

摘要
Material databases 的出现提供了前所未有的机会，揭示出emergent material properties的预测描述符从庞大的数据空间中。然而，通常依赖于高通量ab initio数据的限制，这些数据与实验不匹配。相反，实验决策 oftentimes 受到专家的直觉导向，这些直觉通常是从经验中熟悉而来，而这些经验 rarely 被详细表述。我们提议使用机器学习来“瓶化”这些人类直觉，转化为可衡量的描述符，使用专家curated measurement-based数据。我们称之为“Materials Expert-Artificial Intelligence”（ME-AI）。作为这一计划的首先步骤，我们将关注在四角网材料中的topological semimetal（TSM），基于结构信息所 inspirited 的描述符：容忍因子。我们开始是通过筛选879个四角网材料的12个基本特征，使用实验数据 whenever possible。然后，我们使用Dirichlet-based Gaussian process regression的特殊kernel来揭示square-net topological semimetals中的复合描述符。ME-AI学习的描述符独立地重现专家直觉，并且进一步发掘了它。 Specifically, new descriptors point to hypervalency as a critical chemical feature predicting TSM within square-net compounds。我们在定制的问题上成功，表明了“机器瓶化人类智慧”的方法为机器学习帮助材料发现具有潜力。

Machine Learning Driven Sensitivity Analysis of E3SM Land Model Parameters for Wetland Methane Emissions

paper_url: http://arxiv.org/abs/2312.02786
repo_url: None
paper_authors: Sandeep Chinta, Xiang Gao, Qing Zhu
for:This study aims to identify critical parameters for methane emission in the Energy Exascale Earth System Model (E3SM) land model (ELM) and to reduce biases and uncertainties in future projections using sensitivity analysis (SA) and machine learning (ML) algorithms.methods:The study uses SA to examine the impact of 19 selected parameters responsible for critical biogeochemical processes in the methane module of ELM on various CH4 fluxes at 14 FLUXNET-CH4 sites with diverse vegetation types. The study also employs an ML algorithm to emulate the complex behavior of ELM methane biogeochemistry and to reduce computational costs.results:The study found that parameters linked to CH4 production and diffusion generally present the highest sensitivities despite apparent seasonal variation. Comparing simulated emissions from perturbed parameter sets against FLUXNET-CH4 observations revealed that better performances can be achieved at each site compared to the default parameter values, indicating a scope for further improving simulated emissions using parameter calibration with advanced optimization techniques like Bayesian optimization.

Abstract
Methane (CH4) is the second most critical greenhouse gas after carbon dioxide, contributing to 16-25% of the observed atmospheric warming. Wetlands are the primary natural source of methane emissions globally. However, wetland methane emission estimates from biogeochemistry models contain considerable uncertainty. One of the main sources of this uncertainty arises from the numerous uncertain model parameters within various physical, biological, and chemical processes that influence methane production, oxidation, and transport. Sensitivity Analysis (SA) can help identify critical parameters for methane emission and achieve reduced biases and uncertainties in future projections. This study performs SA for 19 selected parameters responsible for critical biogeochemical processes in the methane module of the Energy Exascale Earth System Model (E3SM) land model (ELM). The impact of these parameters on various CH4 fluxes is examined at 14 FLUXNET- CH4 sites with diverse vegetation types. Given the extensive number of model simulations needed for global variance-based SA, we employ a machine learning (ML) algorithm to emulate the complex behavior of ELM methane biogeochemistry. ML enables the computational time to be shortened significantly from 6 CPU hours to 0.72 milliseconds, achieving reduced computational costs. We found that parameters linked to CH4 production and diffusion generally present the highest sensitivities despite apparent seasonal variation. Comparing simulated emissions from perturbed parameter sets against FLUXNET-CH4 observations revealed that better performances can be achieved at each site compared to the default parameter values. This presents a scope for further improving simulated emissions using parameter calibration with advanced optimization techniques like Bayesian optimization.

摘要
氨 (CH4) 是大气中第二重要的绿色气体，占据大气暖化的 16-25%。湿地是全球主要的自然氨发生源。然而，湿地氨发生估计从生物地球化学模型中含有较大的不确定性。这种不确定性的主要来源是生物地球化学过程中的多个不确定参数。敏感分析 (SA) 可以帮助标识氨发生中关键的参数，以便在未来预测中减少偏差和不确定性。这个研究在 E3SM terrestrial model (ELM) 中的氨模块中进行了 19 个参数的敏感分析。这些参数影响 CH4 的多种流向，并在 14 个 FLUXNET-CH4 站点上进行了多种植被类型的 исследование。由于需要进行全球差异基于的 SA，我们使用机器学习 (ML) 算法来模拟 ELM 氨生物地球化学的复杂行为。 ML 使得计算时间从原来的 6 CPU 小时缩短到 0.72 毫秒，实现了计算成本的减少。我们发现，与 CH4 生产和扩散直接相关的参数通常具有最高敏感性，尽管显示季节性变化。对比推测参数集中的释放与 FLUXNET-CH4 观测数据表示，可以在每个站点上实现更好的表现，比 default 参数值更好。这表明可以通过参数调整和进一步的优化技术，如 Bayesian 优化，进一步提高预测的释放。

Learning “Look-Ahead” Nonlocal Traffic Dynamics in a Ring Road

paper_url: http://arxiv.org/abs/2312.02770
repo_url: None
paper_authors: Chenguang Zhao, Huan Yu
for: 这个研究旨在探讨非本地差分方程模型（PDE）的应用，以掌握车流速度的预测和管理。
methods: 本研究使用了交通轨迹数据，设计了物理学 Informed Neural Network（PINN）来learns the fundamental diagram和look-ahead核函数，并通过最小化损失函数的方式创建了一个基于数据的增强非本地LWR模型。
results: 研究结果显示，使用了PINN学习的非本地LWR模型能够更 preciselly预测车流速度的传播，在三个不同的情况下都有更好的预测效果：停车往复运动、塞车和自由流。此外，研究也确认了“look-ahead”效应的存在，并发现optimal nonlocal kernel的长度为约35-50米，而内部5米的核函数占了大多数非本地效应。

Abstract
The macroscopic traffic flow model is widely used for traffic control and management. To incorporate drivers' anticipative behaviors and to remove impractical speed discontinuity inherent in the classic Lighthill-Whitham-Richards (LWR) traffic model, nonlocal partial differential equation (PDE) models with ``look-ahead" dynamics have been proposed, which assume that the speed is a function of weighted downstream traffic density. However, it lacks data validation on two important questions: whether there exist nonlocal dynamics, and how the length and weight of the ``look-ahead" window affect the spatial temporal propagation of traffic densities. In this paper, we adopt traffic trajectory data from a ring-road experiment and design a physics-informed neural network to learn the fundamental diagram and look-ahead kernel that best fit the data, and reinvent a data-enhanced nonlocal LWR model via minimizing the loss function combining the data discrepancy and the nonlocal model discrepancy. Results show that the learned nonlocal LWR yields a more accurate prediction of traffic wave propagation in three different scenarios: stop-and-go oscillations, congested, and free traffic. We first demonstrate the existence of ``look-ahead" effect with real traffic data. The optimal nonlocal kernel is found out to take a length of around 35 to 50 meters, and the kernel weight within 5 meters accounts for the majority of the nonlocal effect. Our results also underscore the importance of choosing a priori physics in machine learning models.

摘要
宽泛交通流模型广泛用于交通控制和管理。为了包括 drivers 的预测行为并消除类别 Lighthill-Whitham-Richards (LWR) 流体模型中的不实际速度缺失，非本地partial differential equation (PDE) 模型 WITH "look-ahead" 动力学被提议，它假设速度为下游交通密度的加权函数。然而，它缺乏数据验证两个重要问题：是否存在非本地动力学，以及"look-ahead" 窗口的长度和重量如何影响空间时间层流密度的传播。在这篇论文中，我们采用环路实验的交通轨迹数据，并设计了physics-informed neural network来学习基本图ogram和look-ahead kernel，并通过最小化损失函数来恢复数据增强的非本地LWR模型。结果表明学习的非本地LWR模型可以更准确地预测交通波的传播在三种不同的情况下：停止-和-跑动、堵塞和自由交通。我们首先证明了实际交通数据中的"look-ahead"效应的存在。最佳的非本地kernel长度为35-50米，而在5米内的kernel重量占了非本地效应的大多数。我们的结果也强调了在机器学习模型中采用先验法的重要性。

LExCI: A Framework for Reinforcement Learning with Embedded Systems

paper_url: http://arxiv.org/abs/2312.02739
repo_url: https://github.com/mechatronics-rwth/lexci-2
paper_authors: Kevin Badalian, Lucas Koch, Tobias Brinkmann, Mario Picerno, Marius Wegener, Sung-Yong Lee, Jakob Andert
for: 这篇论文是关于控制工程中的人工智能应用，具体来说是一种名为强化学习（Reinforcement Learning，RL）的方法，用于让代理人在环境中自由地互动，以找到最佳策略。
methods: 本论文使用的方法是一种名为LExCI（Learning and Experiencing Cycle Interface）的框架，它可以将RLlib开源库与特定的嵌入式设备集成，以便在这些设备上训练RL代理人。
results: 本论文的结果表明，LExCI框架可以帮助训练RL代理人，并且可以与现有的工具链集成。两种状态前瞻RL算法和快速控制概念验证系统都被用来演示LExCI的可操作性。

Abstract
Advances in artificial intelligence (AI) have led to its application in many areas of everyday life. In the context of control engineering, reinforcement learning (RL) represents a particularly promising approach as it is centred around the idea of allowing an agent to freely interact with its environment to find an optimal strategy. One of the challenges professionals face when training and deploying RL agents is that the latter often have to run on dedicated embedded devices. This could be to integrate them into an existing toolchain or to satisfy certain performance criteria like real-time constraints. Conventional RL libraries, however, cannot be easily utilised in conjunction with that kind of hardware. In this paper, we present a framework named LExCI, the Learning and Experiencing Cycle Interface, which bridges this gap and provides end-users with a free and open-source tool for training agents on embedded systems using the open-source library RLlib. Its operability is demonstrated with two state-of-the-art RL-algorithms and a rapid control prototyping system.

摘要
人工智能（AI）的进步已经应用到了我们日常生活中的各个领域。在控制工程中，回归学习（RL）是一种特别有把握的方法，因为它将代理人允许自由地与环境互动，以找到最佳策略。然而，训练和部署RL代理人时，专业人员常遇到的挑战是RL代理人通常需要运行在专门的嵌入式设备上。这可能是为了结合现有的工具链，或者满足certain性能标准，如实时约束。 conventioanl RL库不能方便地在这种硬件上使用。在这篇论文中，我们提出了一个名为LExCI的框架，即学习和体验循环界面。LExCI bridges this gap and provides end-users with a free and open-source tool for training agents on embedded systems using the open-source library RLlib。我们的框架可以与两种现状最佳RL算法和快速控制原型系统进行运练。

(Provable) Adversarial Robustness for Group Equivariant Tasks: Graphs, Point Clouds, Molecules, and More

paper_url: http://arxiv.org/abs/2312.02708
repo_url: None
paper_authors: Jan Schuchardt, Yan Scholten, Stephan Günnemann
for: 本研究旨在提供一种具有任务对称性的鲁棒性定义，并证明可以通过选择具有任务对称性的模型和进行 tradicional adversarial robustness 证明来实现可靠的鲁棒性。
methods: 本研究使用了 equivariance-preserving randomized smoothing 框架和architecture-specific graph edit distance certificates来证明模型的鲁棒性。
results: 本研究发现了一些鲁棒性证明方法，包括 choosing a model that matches the task’s equivariances 和 certifying traditional adversarial robustness，可以为未来在鲁棒机器学习和几何机器学习之间的工作提供基础。

Abstract
A machine learning model is traditionally considered robust if its prediction remains (almost) constant under input perturbations with small norm. However, real-world tasks like molecular property prediction or point cloud segmentation have inherent equivariances, such as rotation or permutation equivariance. In such tasks, even perturbations with large norm do not necessarily change an input's semantic content. Furthermore, there are perturbations for which a model's prediction explicitly needs to change. For the first time, we propose a sound notion of adversarial robustness that accounts for task equivariance. We then demonstrate that provable robustness can be achieved by (1) choosing a model that matches the task's equivariances (2) certifying traditional adversarial robustness. Certification methods are, however, unavailable for many models, such as those with continuous equivariances. We close this gap by developing the framework of equivariance-preserving randomized smoothing, which enables architecture-agnostic certification. We additionally derive the first architecture-specific graph edit distance certificates, i.e. sound robustness guarantees for isomorphism equivariant tasks like node classification. Overall, a sound notion of robustness is an important prerequisite for future work at the intersection of robust and geometric machine learning.

摘要
We demonstrate that provable robustness can be achieved by (1) selecting a model that matches the task's equivariances and (2) certifying traditional adversarial robustness. However, certification methods are not available for many models, such as those with continuous equivariances. To address this gap, we develop the framework of equivariance-preserving randomized smoothing, which enables architecture-agnostic certification. Additionally, we derive the first architecture-specific graph edit distance certificates, which provide sound robustness guarantees for isomorphism equivariant tasks like node classification.Overall, a sound notion of robustness is crucial for future work at the intersection of robust and geometric machine learning.

Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler

paper_url: http://arxiv.org/abs/2312.02683
repo_url: None
paper_authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May
for: 这个论文主要针对的是speech enhancement的泛化性能，以及 diffusion models在这个领域的应用。
methods: 这个论文使用了diffusion models，并在多个语音、噪声和binaru room impulse response（BRIR）数据库中进行了训练，以测试其在不同的噪声和音响环境下的泛化性能。
results: 论文表明，使用多个数据库进行训练可以提高 diffusion-based speech enhancement 模型的泛化性能，并且在 matched 和 mismatched 条件下都表现出优于当前领先的泛化模型。此外，使用 Heun-based 采样器也可以在更小的计算成本下提高泛化性能。

Abstract
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the particular databases. Moreover, recent developments from the image generation literature remain largely unexplored for speech enhancement. These include several design aspects of diffusion models, such as the noise schedule or the reverse sampler. In this work, we systematically assess the generalization performance of a diffusion-based speech enhancement model by using multiple speech, noise and binaural room impulse response (BRIR) databases to simulate mismatched acoustic conditions. We also experiment with a noise schedule and a sampler that have not been applied to speech enhancement before. We show that the proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions. We also show that a Heun-based sampler achieves superior performance at a smaller computational cost compared to a sampler commonly used for speech enhancement.

摘要
Diffusion models 是一种新的生成模型，最近在语音提升中得到了成功应用。之前的研究表明，Diffusion models 在不同的匹配条件下比现有的描述性模型表现更出色。然而，这些研究通常使用一个训练数据集和一个测试数据集，这使得结果受到特定数据集的限制。此外，图像生成领域的最新发展还没有得到过语音提升的应用。这些包括Diffusion models中的噪声程度或反向抽象等设计方面。在这项工作中，我们系统地评估了一种基于Diffusion models的语音提升模型，使用多个语音、噪声和双耳室音响响应（BRIR）数据集来模拟不同的匹配条件。我们还尝试了一种没有用于语音提升之前的噪声程度和抽象方法。我们发现，提案的系统在使用多个数据集进行训练时得到了明显的改善，并在匹配和不匹配条件下都与现有的描述性模型相比表现出色。此外，我们还发现了一种基于Heun的抽象方法在计算成本更小的情况下表现更好。

Learning a Sparse Representation of Barron Functions with the Inverse Scale Space Flow

paper_url: http://arxiv.org/abs/2312.02671
repo_url: None
paper_authors: Tjeerd Jan Heeringa, Tim Roith, Christoph Brune, Martin Burger
for: 这 paper 是用来找到 Barron 函数的稀疏表示方法。
methods: 这 paper 使用 inverse scale space flow 来找到一个稀疏测度 $\mu$，使得 Barron 函数相关于测度 $\mu$ 和函数 $f$ 之间的 $L^2$ 距离最小化。
results: 这 paper 分析了这种方法在理想情况下和干扰情况下的收敛性质。在理想情况下，目标函数会逐渐减少，直到到达最小值，并且收敛速率为 $\mathcal{O}(1/t)$。在干扰情况下，最优解可能会受到多余或加法常数的影响。这种收敛性保持在分析参数空间的离散化上，并且在不断细化参数空间上的最小化点会 converges 到全参数空间上的最优解。

Abstract
This paper presents a method for finding a sparse representation of Barron functions. Specifically, given an $L^2$ function $f$, the inverse scale space flow is used to find a sparse measure $\mu$ minimising the $L^2$ loss between the Barron function associated to the measure $\mu$ and the function $f$. The convergence properties of this method are analysed in an ideal setting and in the cases of measurement noise and sampling bias. In an ideal setting the objective decreases strictly monotone in time to a minimizer with $\mathcal{O}(1/t)$, and in the case of measurement noise or sampling bias the optimum is achieved up to a multiplicative or additive constant. This convergence is preserved on discretization of the parameter space, and the minimizers on increasingly fine discretizations converge to the optimum on the full parameter space.

摘要
In an ideal setting, the objective function decreases strictly monotonically in time to a minimizer with a rate of $\mathcal{O}(1/t)$. In the presence of measurement noise or sampling bias, the optimum is achieved up to a multiplicative or additive constant. This convergence is preserved when discretizing the parameter space, and the minimizers on increasingly fine discretizations converge to the optimum on the full parameter space.Translated into Simplified Chinese:这篇论文提出了一种方法，用于找到巴朗函数的简洁表示。给定一个 $L^2$ 函数 $f$，使用反尺度空间流动来找到一个简洁度量 $\mu$，使得巴朗函数相关于度量 $\mu$ 和函数 $f$ 之间的 $L^2$ 距离最小。这种方法的收敛性被分析在理想情况下和干扰和抽象偏见的情况下。在理想情况下，目标函数随着时间的增长而逐渐减少，直到到达最佳解，减少的速率为 $\mathcal{O}(1/t)$。在干扰和抽象偏见的情况下，最佳解可以在多项式幂级上减少，但是会受到多项式幂级的影响。当精度化参数空间时，这种收敛性保持不变，并且在不断细化参数空间时，最佳解在不断细化的参数空间上都会 converge 到全参数空间上的最佳解。Translated by Google Translate:This paper proposes a method for finding a sparse representation of Barron functions. Given an $L^2$ function $f$, the inverse scale space flow is used to find a sparse measure $\mu$ that minimizes the $L^2$ loss between the Barron function associated with the measure $\mu$ and the function $f$. The convergence properties of this method are analyzed in an ideal setting and in the cases of measurement noise and sampling bias.In an ideal setting, the objective function decreases strictly monotonically in time to a minimizer with a rate of $\mathcal{O}(1/t)$. In the presence of measurement noise or sampling bias, the optimum is achieved up to a multiplicative or additive constant. This convergence is preserved when discretizing the parameter space, and the minimizers on increasingly fine discretizations converge to the optimum on the full parameter space.

A Self-Commissioning Edge Computing Method for Data-Driven Anomaly Detection in Power Electronic Systems

paper_url: http://arxiv.org/abs/2312.02661
repo_url: None
paper_authors: Pere Izquierdo Gomez, Miguel E. Lopez Gajardo, Nenad Mijatovic, Tomislav Dragicevic
for: 验证电子转换器可靠性的重要性，数据驱动监测技术在这方面扮演着越来越重要的角色。
methods: 本文提出了一种边缘计算方法，通过优先存储训练样本的大小偏差来mitigate lab数据有限样本的困难，以提高训练过程的稳定性和预测性。
results: 实验数据显示，该方法可以提高预测精度和训练速度，比 tradicional online学习方法无需该数据选择过程更好。

Abstract
Ensuring the reliability of power electronic converters is a matter of great importance, and data-driven condition monitoring techniques are cementing themselves as an important tool for this purpose. However, translating methods that work well in controlled lab environments to field applications presents significant challenges, notably because of the limited diversity and accuracy of the lab training data. By enabling the use of field data, online machine learning can be a powerful tool to overcome this problem, but it introduces additional challenges in ensuring the stability and predictability of the training processes. This work presents an edge computing method that mitigates these shortcomings with minimal additional memory usage, by employing an autonomous algorithm that prioritizes the storage of training samples with larger prediction errors. The method is demonstrated on the use case of a self-commissioning condition monitoring system, in the form of a thermal anomaly detection scheme for a variable frequency motor drive, where the algorithm self-learned to distinguish normal and anomalous operation with minimal prior knowledge. The obtained results, based on experimental data, show a significant improvement in prediction accuracy and training speed, when compared to equivalent models trained online without the proposed data selection process.

摘要
This work proposes an edge computing method that mitigates these shortcomings with minimal additional memory usage. The method employs an autonomous algorithm that prioritizes the storage of training samples with larger prediction errors. The approach is demonstrated on the use case of a self-commissioning condition monitoring system, in the form of a thermal anomaly detection scheme for a variable frequency motor drive. The algorithm self-learned to distinguish normal and anomalous operation with minimal prior knowledge, and the obtained results, based on experimental data, show a significant improvement in prediction accuracy and training speed compared to equivalent models trained online without the proposed data selection process.

Do AI models produce better weather forecasts than physics-based models? A quantitative evaluation case study of Storm Ciarán

paper_url: http://arxiv.org/abs/2312.02658
repo_url: None
paper_authors: Andrew J. Charlton-Perez, Helen F. Dacre, Simon Driscoll, Suzanne L. Gray, Ben Harvey, Natalie J. Harvey, Kieran M. R. Hunt, Robert W. Lee, Ranjini Swaminathan, Remy Vandaele, Ambrogio Volonté
for: 这个研究的目的是对现代机器学习模型在 simulate 高impact 天气事件方面的性能进行了评估。
methods: 这个研究使用了四种机器学习模型（FourCastNet、Pangu-Weather、GraphCast和FourCastNet-v2）来预测欧洲风暴雨灾事件 Storm Ciar'an。
results: 研究发现这些机器学习模型能够准确地捕捉风暴的大规模结构，包括云头的位置、温带的形状和热带湍流的位置，以及风暴的发展驱动因素。但是，它们在发布气象警报所需的更细节结构方面的表现更为杂mix。

Abstract
There has been huge recent interest in the potential of making operational weather forecasts using machine learning techniques. As they become a part of the weather forecasting toolbox, there is a pressing need to understand how well current machine learning models can simulate high-impactweather events. We compare forecasts of Storm Ciar\'an, a European windstorm that caused sixteen deaths and extensive damage in Northern Europe, made by machine learning and numericalweather prediction models. The four machine learning models considered (FourCastNet, Pangu-Weather, GraphCast and FourCastNet-v2) produce forecasts that accurately capture the synoptic-scale structure of the cyclone including the position of the cloud head, shape of the warm sector and location of warm conveyor belt jet, and the large-scale dynamical drivers important for the rapid storm development such as the position of the storm relative to the upper-level jet exit. However, their ability to resolve the more detailed structures important for issuing weather warnings is more mixed. All of the machine learning models underestimate the peak amplitude of winds associated with the storm, only some machine learning models resolve the warm core seclusion and none of the machine learning models capture the sharp bent-back warm frontal gradient. Our study shows there is a great deal about the performance and properties of machine learning weather forecasts that can be derived from case studies of high-impact weather events such as Storm Ciar\'an.

摘要
有很大的现代 интерес在使用机器学习技术进行操作天气预报。随着它们成为天气预报工具箱的一部分，有一个急需要理解现有的机器学习模型是否可以正确地预测高影响天气事件。我们比较了由机器学习和数值天气预报模型生成的飓风恩戈尔（Storm Ciarán）的预测，包括云头位置、暖带形状和暖带喷流的位置，以及飓风发展中的大气动力驱动因素。然而，它们在发布天气警报时的详细结构是更加混乱。所有机器学习模型都低估了飓风相关的风暴潮振幅，只有一些机器学习模型解释暖核孤立，而 none of them capture the sharp bent-back warm frontal gradient。我们的研究表明，可以从高影响天气事件如飓风恩戈尔的案例研究中获得许多有关机器学习天气预报性能和特性的信息。

What Machine Learning Can Do for Focusing Aerogel Detectors

paper_url: http://arxiv.org/abs/2312.02652
repo_url: None
paper_authors: Foma Shipilov, Alexander Barnyakov, Vladimir Bobrovnikov, Sergey Kononov, Fedor Ratnikov
for: 这项研究用于提高Super Charm-Tau工厂实验中的粒子识别率。
methods: 这项研究使用了计算机视觉技术的多种措施来筛选信号射击。
results: 这些措施可以有效地减少数据流量和提高粒子速度分辨率。

Abstract
Particle identification at the Super Charm-Tau factory experiment will be provided by a Focusing Aerogel Ring Imaging CHerenkov detector (FARICH). The specifics of detector location make proper cooling difficult, therefore a significant number of ambient background hits are captured. They must be mitigated to reduce the data flow and improve particle velocity resolution. In this work we present several approaches to filtering signal hits, inspired by machine learning techniques from computer vision.

摘要
超 charm-tau 实验室中的粒子识别将由焦点式气泡图像液态液凝聚器（FARICH）提供。察看器的具体位置使得正确冷却受到限制，因此在捕捉大量的 ambient 背景射击中捕捉到了许多射击。为了减少数据流量并提高粒子运动解析精度，在这种工作中我们提出了一些基于计算机视觉技术的筛选信号射击方法。

A Q-learning approach to the continuous control problem of robot inverted pendulum balancing

paper_url: http://arxiv.org/abs/2312.02649
repo_url: None
paper_authors: Mohammad Safeea, Pedro Neto
for: 这个研究是用于评估抽象动作空间强化学习方法（Q学习）在Robot倒立拐杆平衡控制中的应用。
methods: 这种方法使用了在实际系统上进行学习阶段的数据拟合，以加速学习过程和缓解实际系统上的技术困难。
results: 该方法在实际系统上成功应用，并在一个真实世界Robot上学习平衡倒立拐杆。这个研究也证明了在实际世界中使用抽象动作空间算法控制连续动作的重要性，并且用于加速学习过程。

Abstract
This study evaluates the application of a discrete action space reinforcement learning method (Q-learning) to the continuous control problem of robot inverted pendulum balancing. To speed up the learning process and to overcome technical difficulties related to the direct learning on the real robotic system, the learning phase is performed in simulation environment. A mathematical model of the system dynamics is implemented, deduced by curve fitting on data acquired from the real system. The proposed approach demonstrated feasible, featuring its application on a real world robot that learned to balance an inverted pendulum. This study also reinforces and demonstrates the importance of an accurate representation of the physical world in simulation to achieve a more efficient implementation of reinforcement learning algorithms in real world, even when using a discrete action space algorithm to control a continuous action.

摘要
Translation notes:* "discrete action space" is translated as "离散动作空间" (lián chuān dòng yào kōng jì)* "continuous control" is translated as "连续控制" (lián xù kòng zhì)* "inverted pendulum balancing" is translated as "倒立悬挂平衡" (dào zhí xiàng guī píng yǐng)* "real world robot" is translated as "真实世界机器人" (zhēn shí shì jiè yì jī rén)* "simulation environment" is translated as "模拟环境" (mó xiǎo huán jì)* "mathematical model" is translated as "数学模型" (shù xué mó xiǎng)* "system dynamics" is translated as "系统动态" (xiàng tǒng dòng dài)* "curve fitting" is translated as "曲线适应" (qū xiàn tí bèng)* "accurate representation" is translated as "准确表示" (zhèng qiú biǎo gòng)

Rethinking and Simplifying Bootstrapped Graph Latents

paper_url: http://arxiv.org/abs/2312.02619
repo_url: https://github.com/zszszs25/sgcl
paper_authors: Wangbin Sun, Jintang Li, Liang Chen, Bingzhe Wu, Yatao Bian, Zibin Zheng
for: 提高图像自supervised learning中模型的可分解性和性能。
methods: 利用两次循环输出作为正样本，取消负样本。
results: 与传统GCL方法相比，SGCL可以实现竞争性的性能，同时具有更少的参数、更低的时间和空间成本，以及显著的速度提升。

Abstract
Graph contrastive learning (GCL) has emerged as a representative paradigm in graph self-supervised learning, where negative samples are commonly regarded as the key to preventing model collapse and producing distinguishable representations. Recent studies have shown that GCL without negative samples can achieve state-of-the-art performance as well as scalability improvement, with bootstrapped graph latent (BGRL) as a prominent step forward. However, BGRL relies on a complex architecture to maintain the ability to scatter representations, and the underlying mechanisms enabling the success remain largely unexplored. In this paper, we introduce an instance-level decorrelation perspective to tackle the aforementioned issue and leverage it as a springboard to reveal the potential unnecessary model complexity within BGRL. Based on our findings, we present SGCL, a simple yet effective GCL framework that utilizes the outputs from two consecutive iterations as positive pairs, eliminating the negative samples. SGCL only requires a single graph augmentation and a single graph encoder without additional parameters. Extensive experiments conducted on various graph benchmarks demonstrate that SGCL can achieve competitive performance with fewer parameters, lower time and space costs, and significant convergence speedup.

摘要
《GRAPH CONTRASTIVE LEARNING（GCL）在图自助学习中已成为一种代表性的 paradigm， negative samples 通常被视为防止模型塌缩和生成 отличитель的表示的关键。 however， recent studies have shown that GCL without negative samples can achieve state-of-the-art performance as well as scalability improvement, with bootstrapped graph latent（BGRL）as a prominent step forward。然而，BGRL 依赖于复杂的架构来维护能够散射表示的能力，而下面 mechanisms 使得成功 remain largely unexplored。在这篇论文中，我们提出了一种实例级别的decorrelation perspectives来解决上述问题，并利用其为springboard 探索BGRL 中可能存在的不必要的模型复杂度。根据我们的发现，我们提出了SGCL，一种简单 yet effective GCL framework，利用两个连续的迭代 outputs作为正例对。SGCL 仅需要一个图像增强和一个图像编码器，没有额外参数。在多种图 benchmarks 上进行了广泛的实验， demonstrate that SGCL 可以 дости到与 fewer parameters, lower time and space costs, and significant convergence speedup 的竞争性性能。

Privacy-Aware Data Acquisition under Data Similarity in Regression Markets

paper_url: http://arxiv.org/abs/2312.02611
repo_url: None
paper_authors: Shashi Raj Pandey, Pierre Pinson, Petar Popovski
for: 该论文旨在设计数据市场，考虑数据所有者的隐私偏好和数据相似性的影响。
methods: 该论文提出了一种基于本地均分隐私协议的查询-回复协议，用于实现两方数据交换机制。
results: 该论文通过分析参与者之间的策略交互，分析了隐私意识的影响于价格和隐私因子。 Additionally, the paper shows that data similarity affects market participation and traded data value.

Abstract
Data markets facilitate decentralized data exchange for applications such as prediction, learning, or inference. The design of these markets is challenged by varying privacy preferences as well as data similarity among data owners. Related works have often overlooked how data similarity impacts pricing and data value through statistical information leakage. We demonstrate that data similarity and privacy preferences are integral to market design and propose a query-response protocol using local differential privacy for a two-party data acquisition mechanism. In our regression data market model, we analyze strategic interactions between privacy-aware owners and the learner as a Stackelberg game over the asked price and privacy factor. Finally, we numerically evaluate how data similarity affects market participation and traded data value.

摘要
“数据市场促进了分布式数据交换，用于预测、学习或推理应用。市场设计面临着数据所有者的隐私偏好以及数据之间的相似性问题。相关的研究经常忽略了数据相似性对价格和数据价值的影响。我们证明了数据相似性和隐私偏好是市场设计的关键组成部分，并提出了基于本地均匀隐私协议的查询-响应协议 для两方数据获取机制。在我们的回归数据市场模型中，我们分析了隐私意识的所有者和学习者之间的战略交互，包括价格和隐私因素。最后，我们数字评估了数据相似性对市场参与度和交易数据价值的影响。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I'll be happy to provide it.

TSVR+: Twin support vector regression with privileged information

paper_url: http://arxiv.org/abs/2312.02596
repo_url: None
paper_authors: Anuradha Kumari, M. Tanveer
for: 提高机器学习模型的训练速度和准确性
methods: combining twin support vector regression (TSVR) with learning using privileged information (LUPI) and using successive overrelaxation (SOR) technique to solve the optimization problem
results: 在 UCI、股票和时间序列数据集上进行了数值实验，并证明了提案的模型的优越性

Abstract
In the realm of machine learning, the data may contain additional attributes, known as privileged information (PI). The main purpose of PI is to assist in the training of the model and then utilize the acquired knowledge to make predictions for unseen samples. Support vector regression (SVR) is an effective regression model, however, it has a low learning speed due to solving a convex quadratic problem (QP) subject to a pair of constraints. In contrast, twin support vector regression (TSVR) is more efficient than SVR as it solves two QPs each subject to one set of constraints. However, TSVR and its variants are trained only on regular features and do not use privileged features for training. To fill this gap, we introduce a fusion of TSVR with learning using privileged information (LUPI) and propose a novel approach called twin support vector regression with privileged information (TSVR+). The regularization terms in the proposed TSVR+ capture the essence of statistical learning theory and implement the structural risk minimization principle. We use the successive overrelaxation (SOR) technique to solve the optimization problem of the proposed TSVR+, which enhances the training efficiency. As far as our knowledge extends, the integration of the LUPI concept into twin variants of regression models is a novel advancement. The numerical experiments conducted on UCI, stock and time series data collectively demonstrate the superiority of the proposed model.

摘要
在机器学习领域中，数据可能包含附加的特征，称为特权信息（PI）。PI的主要目的是帮助模型训练并使用所获知ledge来预测未经见过的样本。支持向量回归（SVR）是一种有效的回归模型，但它的学习速度较低，因为它解决了一个几何 quadratic problem（QP），并且受到一对约束的限制。相比之下，双支持向量回归（TSVR）比SVR更高效，因为它解决了两个QP，每个QP受到一个集合约束。然而，TSVR和其变种只在常见特征上训练，并不使用特权特征进行训练。为了填补这个空隙，我们提出了将TSVR与特权信息学习（LUPI）融合，并提出了一种新的方法called twin support vector regression with privileged information（TSVR+）。TSVR+的正则化项捕捉了统计学学习理论的核心，并实现了结构风险最小化原则。我们使用successive overrelaxation（SOR）技术解决TSVR+优化问题，这有助于提高训练效率。在我们所知道的范围内，将LUPI概念integrated into twin variants of regression models是一种新的进展。在UCIC、股票和时间序列数据上进行的数字实验结果表明，提议的模型具有superiority。

FRAPPÉ: A Post-Processing Framework for Group Fairness Regularization

paper_url: http://arxiv.org/abs/2312.02592
repo_url: https://github.com/google-research/google-research
paper_authors: Alexandru Ţifrea, Preethi Lahoti, Ben Packer, Yoni Halpern, Ahmad Beirami, Flavien Prost
for: 提高群体公平性，减少偏袋性和欺诈性
methods: 将任何内部处理方法转换为后处理方法，并使用罚 penalty 函数来解决敏感特征知ledge的问题
results: 经过批处理可以达到与内部处理方法相同的公平性-错误负担协议，并且在实际数据上表现出较好的性能

Abstract
Post-processing mitigation techniques for group fairness generally adjust the decision threshold of a base model in order to improve fairness. Methods in this family exhibit several advantages that make them appealing in practice: post-processing requires no access to the model training pipeline, is agnostic to the base model architecture, and offers a reduced computation cost compared to in-processing. Despite these benefits, existing methods face other challenges that limit their applicability: they require knowledge of the sensitive attributes at inference time and are oftentimes outperformed by in-processing. In this paper, we propose a general framework to transform any in-processing method with a penalized objective into a post-processing procedure. The resulting method is specifically designed to overcome the aforementioned shortcomings of prior post-processing approaches. Furthermore, we show theoretically and through extensive experiments on real-world data that the resulting post-processing method matches or even surpasses the fairness-error trade-off offered by the in-processing counterpart.

摘要
对于群体公平性，后处理mitigation技术通常是调整基本模型的决策阈值，以改进公平性。这些方法具有许多优点，使其在实践中吸引人：后处理不需要对模型训练管道有任何Access，不受模型架构的限制，计算成本较低。然而，现有方法存在其他挑战，包括需要掌握敏感特征的知识在推理时，并且经常被内部处理方法所超越。在这篇论文中，我们提出一种普适的框架，可以将任何内部处理方法转化为后处理过程。得到的方法能够超越先前后处理方法的缺点，并且我们在理论和实验中展示了这种后处理方法与内部处理方法的公平性-错误负担trade-off匹配或甚至超越。

On Optimal Consistency-Robustness Trade-Off for Learning-Augmented Multi-Option Ski Rental

paper_url: http://arxiv.org/abs/2312.02547
repo_url: None
paper_authors: Yongho Shin, Changyeol Lee, Hyung-Chan An
for: 这个论文主要针对的问题是什么？
methods: 这个论文使用了哪些方法？
results: 这个论文获得了什么结果？Here are the answers in Simplified Chinese:
for: 这个论文主要针对的问题是学习增强的多选 ski 租赁问题，它将经典 ski 租赁问题扩展到了两个方面：首先，算法被提供了预测天气情况的数据，其次，租赁选项现在包括多个租赁期和价格选择。
methods: 这个论文使用了学习增强的方法，并且对于不同的租赁期和价格，提供了多种不同的策略。
results: 这个论文提出了一个最佳的算法，它可以与已知的下界匹配，并且对于随机化策略，提供了首次的下界，并且提出了一个改进的随机化策略，该策略在稳定性和多样性之间取得了最佳的平衡。

Abstract
The learning-augmented multi-option ski rental problem generalizes the classical ski rental problem in two ways: the algorithm is provided with a prediction on the number of days we can ski, and the ski rental options now come with a variety of rental periods and prices to choose from, unlike the classical two-option setting. Subsequent to the initial study of the multi-option ski rental problem (without learning augmentation) due to Zhang, Poon, and Xu, significant progress has been made for this problem recently in particular. The problem is very well understood when we relinquish one of the two generalizations -- for the learning-augmented classical ski rental problem, algorithms giving best-possible trade-off between consistency and robustness exist; for the multi-option ski rental problem without learning augmentation, deterministic/randomized algorithms giving the best-possible competitiveness have been found. However, in presence of both generalizations, there remained a huge gap between the algorithmic and impossibility results. In fact, for randomized algorithms, we did not have any nontrivial lower bounds on the consistency-robustness trade-off before. This paper bridges this gap for both deterministic and randomized algorithms. For deterministic algorithms, we present a best-possible algorithm that completely matches the known lower bound. For randomized algorithms, we show the first nontrivial lower bound on the consistency-robustness trade-off, and also present an improved randomized algorithm. Our algorithm matches our lower bound on robustness within a factor of e/2 when the consistency is at most 1.086.

摘要
Ski 租赁问题可以分为两种通用情况：一是学习扩展 Ski 租赁问题，另一是多选 Ski 租赁问题。在这两种情况下，我们可以提供预测 Ski 租赁天数的数据，以及不同的租赁时间和价格选择。在这两种情况下，我们可以通过不同的算法来寻找最佳的租赁解决方案。在过去的研究中，我们已经对 Ski 租赁问题进行了许多研究，但是这些研究都是在单一选择情况下进行的。在这些研究中，我们发现了一些算法可以在这两种情况下寻找最佳的租赁解决方案。在这篇论文中，我们将这两种情况组合起来，对 Ski 租赁问题进行了全面的研究。我们提出了一个完美的算法，可以在这两种情况下寻找最佳的租赁解决方案。此外，我们还提出了一个新的下界，可以用于评估这两种情况下的租赁解决方案。在这篇论文中，我们还详细介绍了一些关于 Ski 租赁问题的概念和理论。我们希望这篇论文可以帮助更多的人了解这个问题，并且帮助他们寻找更好的租赁解决方案。

Characterization of Locality in Spin States and Forced Moves for Optimizations

paper_url: http://arxiv.org/abs/2312.02544
repo_url: None
paper_authors: Yoshiki Sato, Makiko Konoshima, Hirotaka Tamura, Jun Ohkubo
for: 解决 combinatorial optimization 问题中的本地极小点问题
methods: 利用特殊硬件和一种新的算法技术
results: 提出一种高效的、无拒绝的算法，可以快速离开本地极小点

Abstract
Ising formulations are widely utilized to solve combinatorial optimization problems, and a variety of quantum or semiconductor-based hardware has recently been made available. In combinatorial optimization problems, the existence of local minima in energy landscapes is problematic to use to seek the global minimum. We note that the aim of the optimization is not to obtain exact samplings from the Boltzmann distribution, and there is thus no need to satisfy detailed balance conditions. In light of this fact, we develop an algorithm to get out of the local minima efficiently while it does not yield the exact samplings. For this purpose, we utilize a feature that characterizes locality in the current state, which is easy to obtain with a type of specialized hardware. Furthermore, as the proposed algorithm is based on a rejection-free algorithm, the computational cost is low. In this work, after presenting the details of the proposed algorithm, we report the results of numerical experiments that demonstrate the effectiveness of the proposed feature and algorithm.

摘要
伊顿形式ulation是广泛应用于解决 combinatorial optimization 问题，而现在一些量子或半导体基础设施也已经提供。在 combinatorial optimization 问题中，当地点最小值存在问题，因为它们使得寻找全局最小值变得困难。我们注意到优化的目标不是获取精确的抽样，因此不需要满足细节平衡条件。为了缓解这个问题，我们开发了一种能够快速离开本地最小值的算法，该算法不需要拒绝任何样本。在这种情况下，我们可以利用当前状态的本地特征，这是通过特殊硬件获得的易于获得。此外，由于我们的算法基于拒绝自由算法，计算成本较低。在这个工作中，我们将详细介绍我们的算法，并对数值实验结果进行报告。

Asymmetric leader-laggard cluster synchronization for collective decision-making with laser network

paper_url: http://arxiv.org/abs/2312.02537
repo_url: None
paper_authors: Shun Kotoku, Takatomo Mihana, André Röhm, Ryoichi Horisaki, Makoto Naruse
for: 这个论文是为了研究光学加速器在信息处理中的应用，特别是通过使用激光网络来解决竞争多臂弓兵（CMAB）问题。
methods: 该论文使用了光学连接的激光器来实现集体决策，利用激光网络中的异步和同步动力来解决CMAB问题。
results: 研究人员通过稳定性分析对集体决策的必要网络结构进行了评估，并发现了玩家偏好的偏好性，从而扩展了CMAB问题的应用范围。

Abstract
Photonic accelerators have recently attracted soaring interest, harnessing the ultimate nature of light for information processing. Collective decision-making with a laser network, employing the chaotic and synchronous dynamics of optically interconnected lasers to address the competitive multi-armed bandit (CMAB) problem, is a highly compelling approach due to its scalability and experimental feasibility. We investigated essential network structures for collective decision-making through quantitative stability analysis. Moreover, we demonstrated the asymmetric preferences of players in the CMAB problem, extending its functionality to more practical applications. Our study highlights the capability and significance of machine learning built upon chaotic lasers and photonic devices.

摘要
光学加速器最近受到了极高的关注，利用光的本质来处理信息。通过用激光网络实现集体决策，利用激光相互连接的异洛动和同步动力学性能来解决多臂投机问题，是一种非常吸引人的方法，因为它具有扩展性和实验可行性。我们通过量化稳定分析 investigate了集体决策的重要网络结构，以及扩展了玩家偏好的性质，使其更加适用于实际应用。我们的研究强调了基于激光和光学设备的机器学习技术的可能性和重要性。

Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing

paper_url: http://arxiv.org/abs/2312.02491
repo_url: None
paper_authors: Zhangyue Shi, Tianxin Xie, Chenang Liu, Yuxuan Li
for: 这个研究旨在提高现代生产过程中的质量监控，使用先进的感应器和机器学习技术进行数据驱动的实时监控。methods: 本研究使用了内存基础的不断学习，并通过增加级别学习和样本增加的方法来解决资料储存容量的限制。results: 实验结果显示，提案的方法能够实现高质量的数据生成，并在新的类别偏差出现时进行incremental learning，不需要储存所有数据。此外，这些方法还能够提高监控性能，并增加模型架构的 flexibility。

Abstract
The incorporation of advanced sensors and machine learning techniques has enabled modern manufacturing enterprises to perform data-driven in-situ quality monitoring based on the sensor data collected in manufacturing processes. However, one critical challenge is that newly presented defect category may manifest as the manufacturing process continues, resulting in monitoring performance deterioration of previously trained machine learning models. Hence, there is an increasing need for empowering machine learning model to learn continually. Among all continual learning methods, memory-based continual learning has the best performance but faces the constraints of data storage capacity. To address this issue, this paper develops a novel pseudo replay-based continual learning by integrating class incremental learning and oversampling-based data generation. Without storing all the data, the developed framework could generate high-quality data representing previous classes to train machine learning model incrementally when new category anomaly occurs. In addition, it could even enhance the monitoring performance since it also effectively improves the data quality. The effectiveness of the proposed framework is validated in an additive manufacturing process, which leverages supervised classification problem for anomaly detection. The experimental results show that the developed method is very promising in detecting novel anomaly while maintaining a good performance on the previous task and brings up more flexibility in model architecture.

摘要
现代制造企业通过具有先进感测器和机器学习技术的数据驱动 situational quality monitoring 实现了基于感测器数据收集的制造过程中的质量监测。然而，一个重要挑战是新的缺陷类型可能在制造过程继续时出现，导致先前训练的机器学习模型的监测性能下降。因此，有一个增加需要 empowering 机器学习模型进行不断学习。在所有的不断学习方法中，记忆基本的不断学习具有最好的表现，但面临数据存储容量的限制。为解决这个问题，本文开发了一种新的 Pseudo replay-based 不断学习方法，通过将类增量学习和扩sampling-based 数据生成相结合。不需要存储所有数据，开发的框架可以在新类异常出现时逐步培训机器学习模型，并且可以提高监测性能。此外，它还可以增强监测性能，因为它还可以提高数据质量。本文在使用超过 classification 问题进行杂合制造过程中的异常检测，实验结果表明，提出的方法是非常有前途的，能够检测新的异常，保持好的前任任务性能，并增加模型架构的灵活性。

Constrained Twin Variational Auto-Encoder for Intrusion Detection in IoT Systems

paper_url: http://arxiv.org/abs/2312.02490
repo_url: None
paper_authors: Phai Vu Dinh, Quang Uy Nguyen, Dinh Thai Hoang, Diep N. Nguyen, Son Pham Bao, Eryk Dutkiewicz
for: 保护互联网物联网设备免受恶意攻击
methods: 使用受限的双质量变换自动编码器（CTVAE）帮助攻击检测系统获得更可分离和低维度的数据表示
results: 比对11个最受欢迎的互联网物联网恶意蜂灾数据集，CTVAE可以提高约1%的准确率和分数比，而运行时间为攻击检测下降至2E-6秒，模型大小低于1MB。

Abstract
Intrusion detection systems (IDSs) play a critical role in protecting billions of IoT devices from malicious attacks. However, the IDSs for IoT devices face inherent challenges of IoT systems, including the heterogeneity of IoT data/devices, the high dimensionality of training data, and the imbalanced data. Moreover, the deployment of IDSs on IoT systems is challenging, and sometimes impossible, due to the limited resources such as memory/storage and computing capability of typical IoT devices. To tackle these challenges, this article proposes a novel deep neural network/architecture called Constrained Twin Variational Auto-Encoder (CTVAE) that can feed classifiers of IDSs with more separable/distinguishable and lower-dimensional representation data. Additionally, in comparison to the state-of-the-art neural networks used in IDSs, CTVAE requires less memory/storage and computing power, hence making it more suitable for IoT IDS systems. Extensive experiments with the 11 most popular IoT botnet datasets show that CTVAE can boost around 1% in terms of accuracy and Fscore in detection attack compared to the state-of-the-art machine learning and representation learning methods, whilst the running time for attack detection is lower than 2E-6 seconds and the model size is lower than 1 MB. We also further investigate various characteristics of CTVAE in the latent space and in the reconstruction representation to demonstrate its efficacy compared with current well-known methods.

摘要
侵入检测系统（IDS）对数百万个物联网设备进行了重要的保护。然而，IDS 面临物联网系统中的内在挑战，包括设备和数据的多样性，高维度训练数据，以及数据不均衡。此外，在物联网系统上部署 IDS 困难，有时候甚至不可能，因为典型的物联网设备的内存/存储和计算能力有限。为了解决这些挑战，这篇文章提议了一种新的深度神经网络/架构，即受限的双质量变换自适应器（CTVAE）。CTVAE 可以为 IDS 的分类器提供更分割/ отличи的、更低维度的数据表示。此外，相比现状的神经网络，CTVAE 需要更少的内存/存储和计算能力，因此更适合物联网 IDS 系统。在使用 11 个最受欢迎的物联网 botnet 数据集进行了广泛的实验后，我们发现，CTVAE 可以提高约 1% 的准确率和 Fscore 在检测攻击方面，而且检测攻击的运行时间低于 2E-6 秒，模型大小低于 1 MB。我们还进一步研究了 CTVAE 在幂空间和重建表示中的特点，以示其效果相比现有的方法。

RL-Based Cargo-UAV Trajectory Planning and Cell Association for Minimum Handoffs, Disconnectivity, and Energy Consumption

paper_url: http://arxiv.org/abs/2312.02478
repo_url: None
paper_authors: Nesrine Cherif, Wael Jaafar, Halim Yanikomeroglu, Abbas Yongacoglu
For: 这个论文的目的是提高无人机货物交付的可靠性和能效性。* Methods: 这篇论文使用了强化学习（RL）技术来联合货物无人机的路径规划和Cell Association。* Results: 实验结果表明，与比较方法相比，这种方法可以降低手动交换事件，降低离线事件，并提高能源消耗。

Abstract
Unmanned aerial vehicle (UAV) is a promising technology for last-mile cargo delivery. However, the limited on-board battery capacity, cellular unreliability, and frequent handoffs in the airspace are the main obstacles to unleash its full potential. Given that existing cellular networks were primarily designed to service ground users, re-utilizing the same architecture for highly mobile aerial users, e.g., cargo-UAVs, is deemed challenging. Indeed, to ensure a safe delivery using cargo-UAVs, it is crucial to utilize the available energy efficiently, while guaranteeing reliable connectivity for command-and-control and avoiding frequent handoff. To achieve this goal, we propose a novel approach for joint cargo-UAV trajectory planning and cell association. Specifically, we formulate the cargo-UAV mission as a multi-objective problem aiming to 1) minimize energy consumption, 2) reduce handoff events, and 3) guarantee cellular reliability along the trajectory. We leverage reinforcement learning (RL) to jointly optimize the cargo-UAV's trajectory and cell association. Simulation results demonstrate a performance improvement of our proposed method, in terms of handoffs, disconnectivity, and energy consumption, compared to benchmarks.

摘要
无人飞行器（UAV）是一种有前途的科技，用于最后一英里的货物交付。然而，有限的机体内置电池容量、无线电不可靠、空中交换频繁等因素，使得UAV的潜力受到限制。由于现有的无线网络主要为地面用户设计，对高度移动的空中用户，如货物UAV，进行再利用很困难。为确保货物UAV安全交付，必须有效利用可用能量，同时保证命令控制的可靠连接，避免频繁交换。为达到这个目标，我们提出了一种新的方法，即货物UAV轨迹规划和Cells关联优化。具体来说，我们将货物UAV的任务视为一个多目标问题，即1）最小化能量消耗，2）减少交换事件，3）保证无线连接可靠性。我们利用了强化学习（RL）来联合优化货物UAV的轨迹和Cells关联。实验结果显示，我们的提议方法可以比准 benchmark 更好地改善交换、离线和能量消耗等指标。

NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams

paper_url: http://arxiv.org/abs/2312.02473
repo_url: None
paper_authors: Chaoyi Chen, Dechao Gao, Yanfeng Zhang, Qiange Wang, Zhenbo Fu, Xuecang Zhang, Junhua Zhu, Yu Gu, Ge Yu
for: 本文旨在提供一个用于训练动态图 neural network（GNN）模型的框架，以便开发者更方便地创建性能强的 GNN 实现。
methods: 本文使用了一种称为 NeutronStream 的框架，它将输入动态图转换为一个按时间顺序更新的事件流，并使用优化的滑动窗口来逐步捕捉事件的空间-时间相关性。 NeutronStream 还提供了一个并行执行引擎，以解决事件处理的并发挑战，并实现高性能。
results: 对比州际端的动态 GNN 实现，NeutronStream 在速度方面实现了提升 ranges from 1.48X to 5.87X，并在平均准确率方面实现了3.97%的提升。

Abstract
Existing Graph Neural Network (GNN) training frameworks have been designed to help developers easily create performant GNN implementations. However, most existing GNN frameworks assume that the input graphs are static, but ignore that most real-world graphs are constantly evolving. Though many dynamic GNN models have emerged to learn from evolving graphs, the training process of these dynamic GNNs is dramatically different from traditional GNNs in that it captures both the spatial and temporal dependencies of graph updates. This poses new challenges for designing dynamic GNN training frameworks. First, the traditional batched training method fails to capture real-time structural evolution information. Second, the time-dependent nature makes parallel training hard to design. Third, it lacks system supports for users to efficiently implement dynamic GNNs. In this paper, we present NeutronStream, a framework for training dynamic GNN models. NeutronStream abstracts the input dynamic graph into a chronologically updated stream of events and processes the stream with an optimized sliding window to incrementally capture the spatial-temporal dependencies of events. Furthermore, NeutronStream provides a parallel execution engine to tackle the sequential event processing challenge to achieve high performance. NeutronStream also integrates a built-in graph storage structure that supports dynamic updates and provides a set of easy-to-use APIs that allow users to express their dynamic GNNs. Our experimental results demonstrate that, compared to state-of-the-art dynamic GNN implementations, NeutronStream achieves speedups ranging from 1.48X to 5.87X and an average accuracy improvement of 3.97%.

摘要
现有的图 нейрон网络（GNN）训练框架已经被设计便于开发者快速创建高性能的 GNN 实现。然而，大多数现有的 GNN 框架假设输入图为静止的，忽略了实际世界中大多数图是不断更新的。虽然许多动态 GNN 模型已经出现以学习发展中的图，但是这些动态 GNN 的训练过程与传统 GNN 的训练过程有很大差异。这些差异带来了设计动态 GNN 训练框架的新挑战。首先，传统的批处理训练方法无法捕捉实时结构发展信息。其次，时间依赖性使得并行训练变得困难。最后，缺乏对用户进行高效实现动态 GNN 的系统支持。在本文中，我们提出了 NeutronStream，一个用于训练动态 GNN 模型的框架。NeutronStream 将输入动态图转化为一个时间顺序更新的事件流，并使用优化的滑动窗口来逐步捕捉事件流中的空间-时间相关性。此外，NeutronStream 提供了并行执行引擎来解决事件处理挑战，以实现高性能。NeutronStream 还集成了一个支持动态更新的图存储结构，并提供了一组易于使用的 API，allowing users to easily express their dynamic GNNs。我们的实验结果表明，相比于当前的动态 GNN 实现，NeutronStream 在性能和准确率方面具有1.48X-5.87X的加速和3.97%的均值提升。

Congestion-aware Distributed Task Offloading in Wireless Multi-hop Networks Using Graph Neural Networks

paper_url: http://arxiv.org/abs/2312.02471
repo_url: None
paper_authors: Zhongyuan Zhao, Jake Perazzone, Gunjan Verma, Santiago Segarra
for: 这个研究旨在提高边缘智能设备中的处理能力，特别是在无线多跳网络中具有多个移动设备的情况下。
methods: 本研究使用了分布式排阵法和图像学习来实现低负载、干扰确认的分布式任务卸载方案。
results: 在实验中，我们的方法能够降低无线多跳网络中任务卸载所导致的网络填充和不稳 queue，同时提高了本地处理的执行时间。

Abstract
Computational offloading has become an enabling component for edge intelligence in mobile and smart devices. Existing offloading schemes mainly focus on mobile devices and servers, while ignoring the potential network congestion caused by tasks from multiple mobile devices, especially in wireless multi-hop networks. To fill this gap, we propose a low-overhead, congestion-aware distributed task offloading scheme by augmenting a distributed greedy framework with graph-based machine learning. In simulated wireless multi-hop networks with 20-110 nodes and a resource allocation scheme based on shortest path routing and contention-based link scheduling, our approach is demonstrated to be effective in reducing congestion or unstable queues under the context-agnostic baseline, while improving the execution latency over local computing.

摘要
computational offloading已成为移动设备和智能设备的核心组件。现有的卸载方案主要关注于移动设备和服务器，而忽略了多个移动设备任务之间的网络压力。为填补这一空白，我们提议一种低开销、压力感知分布式任务卸载方案，通过对分布式满积框架进行图像学习增强。在模拟无线多跳网络中，我们的方法可以降低压力或不稳定队列，比基线下降低执行延迟，而且在不同上下文中具有改善性。

Dimensionality Reduction and Dynamical Mode Recognition of Circular Arrays of Flame Oscillators Using Deep Neural Network

paper_url: http://arxiv.org/abs/2312.02462
repo_url: None
paper_authors: Weiming Xu, Tao Yang, Peng Zhang
for: 本研究旨在减少高维空间时间数据，并实现不同振荡模式的分类。
methods: 该研究使用了一种基于Bi-LSTM-VAE和WDC的方法，包括使用Bi-LSTM-VAE进行维度减少，并使用WDC进行模式分类。
results: 研究结果表明，该方法可以生成不 overlap的分布，并且在分类中表现出优于VAE和PCA。

Abstract
Oscillatory combustion in aero engines and modern gas turbines often has significant adverse effects on their operation, and accurately recognizing various oscillation modes is the prerequisite for understanding and controlling combustion instability. However, the high-dimensional spatial-temporal data of a complex combustion system typically poses considerable challenges to the dynamical mode recognition. Based on a two-layer bidirectional long short-term memory variational autoencoder (Bi-LSTM-VAE) dimensionality reduction model and a two-dimensional Wasserstein distance-based classifier (WDC), this study proposes a promising method (Bi-LSTM-VAE-WDC) for recognizing dynamical modes in oscillatory combustion systems. Specifically, the Bi-LSTM-VAE dimension reduction model was introduced to reduce the high-dimensional spatial-temporal data of the combustion system to a low-dimensional phase space; Gaussian kernel density estimates (GKDE) were computed based on the distribution of phase points in a grid; two-dimensional WD values were calculated from the GKDE maps to recognize the oscillation modes. The time-series data used in this study were obtained from numerical simulations of circular arrays of laminar flame oscillators. The results show that the novel Bi-LSTM-VAE method can produce a non-overlapping distribution of phase points, indicating an effective unsupervised mode recognition and classification. Furthermore, the present method exhibits a more prominent performance than VAE and PCA (principal component analysis) for distinguishing dynamical modes in complex flame systems, implying its potential in studying turbulent combustion.

摘要
oscillatory combustion in 发动机和现代液体发动机经常会有显著的不良影响，并且正确地识别不同的振荡模式是理解和控制燃燃不稳定的必要前提。然而，复杂的燃燃系统的高维度空间时间数据通常会对动态模式识别提出很大挑战。本研究基于二层双向长短期记忆自适应网络（Bi-LSTM-VAE）维度减少模型和二维 Wasserstein距离基于分类器（WDC），提出了一种有 promise的方法（Bi-LSTM-VAE-WDC）用于识别动态模式。具体来说，Bi-LSTM-VAE 维度减少模型将高维度空间时间数据转化为低维度的相位空间，然后基于相位点的分布在网格中计算Gaussian核密度估计（GKDE），从GKDE 图表中计算二维 Wasserstein距离，用于识别振荡模式。这些时间序列数据由数字 simulate circular array of laminar flame oscillators 得到。结果表明，新的Bi-LSTM-VAE方法可以生成不重叠的相位点分布，表明有效的无监督模式识别和分类。此外， presente 方法在复杂的燃燃系统中比VAE和PCA（主成分分析）表现更出色，implying its potential in studying turbulent combustion。

GIT-Net: Generalized Integral Transform for Operator Learning

paper_url: http://arxiv.org/abs/2312.02450
repo_url: https://github.com/chaow-mat/general_integral_transform_neural_network
paper_authors: Chao Wang, Alexandre Hoang Thiery
for: 用于解决部分偏微分方程（PDE）Operator的深度神经网络架构。
methods: 使用深度神经网络模型来近似特定函数基（如傅敏极化）中的偏微分算子。
results: 比较其他最新的方案更有利的计算和内存需求，适用于复杂 geometries 上的 PDE 问题，并在许多 PDE 问题上表现出小测试错误和低评价。

Abstract
This article introduces GIT-Net, a deep neural network architecture for approximating Partial Differential Equation (PDE) operators, inspired by integral transform operators. GIT-NET harnesses the fact that differential operators commonly used for defining PDEs can often be represented parsimoniously when expressed in specialized functional bases (e.g., Fourier basis). Unlike rigid integral transforms, GIT-Net parametrizes adaptive generalized integral transforms with deep neural networks. When compared to several recently proposed alternatives, GIT-Net's computational and memory requirements scale gracefully with mesh discretizations, facilitating its application to PDE problems on complex geometries. Numerical experiments demonstrate that GIT-Net is a competitive neural network operator, exhibiting small test errors and low evaluations across a range of PDE problems. This stands in contrast to existing neural network operators, which typically excel in just one of these areas.

摘要
这篇文章介绍了 GIT-Net，一种深度神经网络架构，用于近似 diferencial equation（PDE）算子。GIT-Net 灵感来自积分 transform 算子，利用了 differential 算子通常用于定义 PDE 的特殊函数基（例如 fourier 基）来表示。与固定积分 transform 不同，GIT-Net 使用深度神经网络来 Parametrize 自适应总积分 transform。与其他最近提出的 altenativas 相比，GIT-Net 的计算和存储需求随着网格精度的增加而减少，使其适用于复杂 geometry 上的 PDE 问题。数字实验表明，GIT-Net 是一个竞争力强的神经网络算子，在多种 PDE 问题中表现出小误差和低评价。这与现有的神经网络算子不同，通常只在一个这些领域中具有优势。

Adaptive Instrument Design for Indirect Experiments

paper_url: http://arxiv.org/abs/2312.02438
repo_url: https://github.com/yashchandak/IndirectExpDesign
paper_authors: Yash Chandak, Shiv Shankar, Vasilis Syrgkanis, Emma Brunskill
for: 估计干预效果，尤其是在实施Randomized Control Trials (RCTs) 是不现实或不道德的情况下。
methods: 利用（条件）工具变量，通过奖励和推荐而不是严格的治疗分配来估计干预效果。
results: 通过自适应实验设计来提高 indirect experiment 的样本效率，并通过Influence Functions来搜索最佳数据收集策略，最小化欲要的（非线性）估计器的均方差误差。

Abstract
Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for direct experiments, in this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments.

摘要
Translated into Simplified Chinese: indirect experiments 提供一个值得关注的框架，用于估计干预效果，特别是在实施Randomized Control Trials (RCTs) 是不切实际或不道德的情况下。与 RCTs 不同， indirect experiments 通过利用 (conditional) instrumente variables 来估计干预效果，通过奖励和推荐而不是严格的干预分配。然而， indirect experiments 的样本效率取决于不同的用户是否遵循 instrumente variables 的合作率，以及选择的估计器。 especialmente when dealing with numerous instrumente variables。 While adaptive experiment design 在 direct experiments 领域有丰富的Literature, in this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy over instrumente variables。 Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator。 Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models

paper_url: http://arxiv.org/abs/2312.02429
repo_url: https://github.com/amzn/pecos
paper_authors: Wei-Cheng Chang, Jyun-Yu Jiang, Jiong Zhang, Mutasem Al-Darabsah, Choon Hui Teo, Cho-Jui Hsieh, Hsiang-Fu Yu, S. V. N. Vishwanathan
for: 这个研究的目的是提出一种 ParamEter-Free Adapters (PEFA) 框架，用于快速调参大规模文本检索问题中的嵌入式模型 (ERM)。
methods: PEFA 框架使用非参数式 k-最近邻 (kNN) 组件来 equip ERM，并在推理阶段使用 convex combination 的方式将 ERM 和 kNN 两个得分函数相结合。
results: 在两个检索应用中，PEFA 实际上达到了显著的提升，包括对 Trivia-QA 和 NQ-320K 进行了预训练和微调 ERM 的改进。对于文档检索，PEFA 在 Recall@100 指标上提高了预训练 ERM 的平均提升率为 13.2%，而微调 ERM 的平均提升率为 5.5%。对于产品搜索，PEFA 在微调 ERM 上提高了 Recall@100 的平均提升率为 5.3%和 14.5%。

Abstract
Embedding-based Retrieval Models (ERMs) have emerged as a promising framework for large-scale text retrieval problems due to powerful large language models. Nevertheless, fine-tuning ERMs to reach state-of-the-art results can be expensive due to the extreme scale of data as well as the complexity of multi-stages pipelines (e.g., pre-training, fine-tuning, distillation). In this work, we propose the PEFA framework, namely ParamEter-Free Adapters, for fast tuning of ERMs without any backward pass in the optimization. At index building stage, PEFA equips the ERM with a non-parametric k-nearest neighbor (kNN) component. At inference stage, PEFA performs a convex combination of two scoring functions, one from the ERM and the other from the kNN. Based on the neighborhood definition, PEFA framework induces two realizations, namely PEFA-XL (i.e., extra large) using double ANN indices and PEFA-XS (i.e., extra small) using a single ANN index. Empirically, PEFA achieves significant improvement on two retrieval applications. For document retrieval, regarding Recall@100 metric, PEFA improves not only pre-trained ERMs on Trivia-QA by an average of 13.2%, but also fine-tuned ERMs on NQ-320K by an average of 5.5%, respectively. For product search, PEFA improves the Recall@100 of the fine-tuned ERMs by an average of 5.3% and 14.5%, for PEFA-XS and PEFA-XL, respectively. Our code is available at https://github.com/amzn/pecos/tree/mainline/examples/pefa-wsdm24.

摘要
大型文本检索问题上，嵌入式检索模型（ERMs）已经成为一种有前途的框架，尤其是由于大型语言模型的出色表现。然而，为了达到状态之最的效果， fine-tuning ERMs 可能会很昂贵，因为数据的极大规模以及多个阶段管道（如预训练、精度调整、蒸馏）的复杂性。在这种情况下，我们提出了 PEFA 框架，即 ParamEter-Free Adapters，用于快速调整 ERMs 而无需反向传播优化。在索引建立阶段，PEFA 在 ERM 上添加了一个非参数式 k-最近邻（kNN）组件。在检索阶段，PEFA 通过权值融合两个分数函数，一个来自 ERM 和另一个来自 kNN。基于邻居定义，PEFA 框架实现了两个实现，即 PEFA-XL（i.e., extra large）使用双 ANN 索引，以及 PEFA-XS（i.e., extra small）使用单 ANN 索引。实验表明，PEFA 在两个检索应用中具有显著改善。对于文档检索，PEFA 对 Trivia-QA 预训练 ERM 的 Recall@100 指标提高了平均 13.2%，对于 NQ-320K 预训练 ERM 提高了平均 5.5%。对于产品检索，PEFA 对精度调整 ERM 的 Recall@100 指标提高了平均 5.3%和14.5%，分别用于 PEFA-XS 和 PEFA-XL。我们的代码可以在找到。

AI-driven emergence of frequency information non-uniform distribution via THz metasurface spectrum prediction

paper_url: http://arxiv.org/abs/2312.03017
repo_url: https://github.com/Ufere/Assingment_1
paper_authors: Xiaohua Xing, Yuqi Ren, Die Zou, Qiankun Zhang, Bingxuan Mao, Jianquan Yao, Deyi Xiong, Shuang Zhang, Liang Wu
for: 预测tera兆频模ulation效果
methods: 使用人工智能预测方法，并添加多频输入来提高预测精度
results: 实现了预测tera兆频模ulation效果的高精度预测，并且开辟了人工智能在化学、复杂材料设计、生物医学等领域的应用前景

Abstract
Recently, artificial intelligence has been extensively deployed across various scientific disciplines, optimizing and guiding the progression of experiments through the integration of abundant datasets, whilst continuously probing the vast theoretical space encapsulated within the data. Particularly, deep learning models, due to their end-to-end adaptive learning capabilities, are capable of autonomously learning intrinsic data features, thereby transcending the limitations of traditional experience to a certain extent. Here, we unveil previously unreported information characteristics pertaining to different frequencies emerged during our work on predicting the terahertz spectral modulation effects of metasurfaces based on AI-prediction. Moreover, we have substantiated that our proposed methodology of simply adding supplementary multi-frequency inputs to the existing dataset during the target spectral prediction process can significantly enhance the predictive accuracy of the network. This approach effectively optimizes the utilization of existing datasets and paves the way for interdisciplinary research and applications in artificial intelligence, chemistry, composite material design, biomedicine, and other fields.

摘要
Here is the text in Simplified Chinese:最近，人工智能已经广泛应用于不同的科学领域，通过大量数据的集成，优化和导引实验的进程，同时不断探索数据中的庞大理论空间。特别是深度学习模型，它们的终端适应学习能力，使其能够自动学习数据中的内在特征，至少部分突破传统经验的限制。在我们预测teraHz频谱修饰效应的metaSurfaces使用人工智能预测时，我们发现了不同频率的新信息特征。此外，我们还证明了我们提议的方法——在目标频谱预测过程中，添加多个频率输入——可以显著提高网络的预测精度。这种方法可以有效利用现有数据，开展跨学科研究和应用于人工智能、化学、复合材料设计、生物医学和其他领域。

Robust Clustering using Hyperdimensional Computing

paper_url: http://arxiv.org/abs/2312.02407
repo_url: None
paper_authors: Lulu Ge, Keshab K. Parhi
for:This paper aims to improve the clustering performance in the hyperdimensional computing (HDC) domain by proposing four HDC-based clustering algorithms.methods:The proposed algorithms use similarity-based k-means, equal bin-width histogram, equal bin-height histogram, and similarity-based affinity propagation to assign initial cluster hypervectors and improve the performance of HDCluster.results:The proposed algorithms achieve better accuracy, more robust performance, fewer iterations, and less execution time compared to the existing HDCluster. Specifically, similarity-based affinity propagation outperforms the other three algorithms on eight datasets by 2-38% in clustering accuracy. Additionally, the proposed algorithms can provide more robust clustering accuracy than HDCluster even for one-pass clustering, and traditional clustering is more desirable than HDC when the number of clusters is large.Here is the answer in Simplified Chinese text:for:这篇论文目标是在幂维度计算（HDC）领域中提高归一化性能。methods:提议的算法使用相似性基本的k-means、等宽度 histogram、等高度 histogram 和相似性基本的吸引传播来初始化归一化集群。results:提议的算法相比现有的 HDCluster 具有更高的准确率、更稳定的性能、更少的迭代次数和更短的执行时间。具体来说，相似性基本的吸引传播在八个数据集上比其他三种算法提供2-38%的归一化精度提升。此外，提议的算法可以在一次归一化（即无迭代更新集群准则）下提供更稳定的归一化精度，而传统归一化在分支数量较大时更加愿意使用。

Abstract
This paper addresses the clustering of data in the hyperdimensional computing (HDC) domain. In prior work, an HDC-based clustering framework, referred to as HDCluster, has been proposed. However, the performance of the existing HDCluster is not robust. The performance of HDCluster is degraded as the hypervectors for the clusters are chosen at random during the initialization step. To overcome this bottleneck, we assign the initial cluster hypervectors by exploring the similarity of the encoded data, referred to as \textit{query} hypervectors. Intra-cluster hypervectors have a higher similarity than inter-cluster hypervectors. Harnessing the similarity results among query hypervectors, this paper proposes four HDC-based clustering algorithms: similarity-based k-means, equal bin-width histogram, equal bin-height histogram, and similarity-based affinity propagation. Experimental results illustrate that: (i) Compared to the existing HDCluster, our proposed HDC-based clustering algorithms can achieve better accuracy, more robust performance, fewer iterations, and less execution time. Similarity-based affinity propagation outperforms the other three HDC-based clustering algorithms on eight datasets by 2~38% in clustering accuracy. (ii) Even for one-pass clustering, i.e., without any iterative update of the cluster hypervectors, our proposed algorithms can provide more robust clustering accuracy than HDCluster. (iii) Over eight datasets, five out of eight can achieve higher or comparable accuracy when projected onto the hyperdimensional space. Traditional clustering is more desirable than HDC when the number of clusters, $k$, is large.

摘要
Experiments show that the proposed algorithms outperform HDCluster in terms of accuracy, robustness, and execution time. Specifically, similarity-based affinity propagation achieves the highest accuracy on eight datasets, with an improvement of 2-38% compared to HDCluster. Additionally, the proposed algorithms can provide robust clustering accuracy even with one-pass clustering, without any iterative update of the cluster hypervectors. Finally, the paper shows that projecting the data onto the hyperdimensional space can improve the accuracy for some datasets.In summary, the paper proposes four HDC-based clustering algorithms that achieve better accuracy and robustness than existing methods, and demonstrates their effectiveness on eight datasets.

Harmonizing Global Voices: Culturally-Aware Models for Enhanced Content Moderation

paper_url: http://arxiv.org/abs/2312.02401
repo_url: None
paper_authors: Alex J. Chan, José Luis Redondo García, Fabrizio Silvestri, Colm O’Donnel, Konstantina Palla
for: 本研究旨在探讨如何使CONTENT Moderation System能够考虑地区文化差异，以便更好地识别和处理不当内容。
methods: 本研究使用大量媒体新闻和文章数据进行模型训练，以创建地域化的语言模型，以捕捉各地区的通信方式差异，并且通过生成内容违反情况的解释，让政策指南在不同的地区文化背景下能够更好地理解和应用。
results: 研究发现，通过训练于媒体数据集上的大型语言模型，可以成功地捕捉地区文化差异，并且提高了地区性的内容识别和处理能力，同时还能够生成与当地文化和社会背景相align的解释。

Abstract
Content moderation at scale faces the challenge of considering local cultural distinctions when assessing content. While global policies aim to maintain decision-making consistency and prevent arbitrary rule enforcement, they often overlook regional variations in interpreting natural language as expressed in content. In this study, we are looking into how moderation systems can tackle this issue by adapting to local comprehension nuances. We train large language models on extensive datasets of media news and articles to create culturally attuned models. The latter aim to capture the nuances of communication across geographies with the goal of recognizing cultural and societal variations in what is considered offensive content. We further explore the capability of these models to generate explanations for instances of content violation, aiming to shed light on how policy guidelines are perceived when cultural and societal contexts change. We find that training on extensive media datasets successfully induced cultural awareness and resulted in improvements in handling content violations on a regional basis. Additionally, these advancements include the ability to provide explanations that align with the specific local norms and nuances as evidenced by the annotators' preference in our conducted study. This multifaceted success reinforces the critical role of an adaptable content moderation approach in keeping pace with the ever-evolving nature of the content it oversees.

摘要
We find that training on extensive media datasets successfully induced cultural awareness and resulted in improvements in handling content violations on a regional basis. Additionally, these advancements include the ability to provide explanations that align with the specific local norms and nuances, as evidenced by the annotators' preference in our conducted study. This multifaceted success reinforces the critical role of an adaptable content moderation approach in keeping pace with the ever-evolving nature of the content it oversees.

Auto DP-SGD: Dual Improvements of Privacy and Accuracy via Automatic Clipping Threshold and Noise Multiplier Estimation

paper_url: http://arxiv.org/abs/2312.02400
repo_url: None
paper_authors: Sai Venkatesh Chilukoti, Md Imran Hossen, Liqun Shan, Vijay Srinivas Tida, Xiai Hei
for: 保护深度学习应用中的个人隐私信息，DP-SGD 方法已经得到了广泛的应用。
methods: 研究者提出了多种自适应DP-SGD方法来提高模型的实用性。
results: 我们的Auto DP-SGD方法可以在不同的数据集上提高隐私和准确性，并且可以适应不同的隐私预算。特别是，我们的方法可以降低缩放因子和使用学习率调度器来降低隐私预算，而无需 significatively reducuce 准确性。

Abstract
DP-SGD has emerged as a popular method to protect personally identifiable information in deep learning applications. Unfortunately, DP-SGD's per-sample gradient clipping and uniform noise addition during training can significantly degrade model utility. To enhance the model's utility, researchers proposed various adaptive DP-SGD methods. However, we examine and discover that these techniques result in greater privacy leakage or lower accuracy than the traditional DP-SGD method, or a lack of evaluation on a complex data set such as CIFAR100. To address these limitations, we propose an Auto DP-SGD. Our method automates clipping threshold estimation based on the DL model's gradient norm and scales the gradients of each training sample without losing gradient information. This helps to improve the algorithm's utility while using a less privacy budget. To further improve accuracy, we introduce automatic noise multiplier decay mechanisms to decrease the noise multiplier after every epoch. Finally, we develop closed-form mathematical expressions using tCDP accountant for automatic noise multiplier and automatic clipping threshold estimation. Through extensive experimentation, we demonstrate that Auto DP-SGD outperforms existing SOTA DP-SGD methods in privacy and accuracy on various benchmark datasets. We also show that privacy can be improved by lowering the scale factor and using learning rate schedulers without significantly reducing accuracy. Specifically, Auto DP-SGD, when used with a step noise multiplier, improves accuracy by 3.20, 1.57, 6.73, and 1.42 for the MNIST, CIFAR10, CIFAR100, and AG News Corpus datasets, respectively. Furthermore, it obtains a substantial reduction in the privacy budget of 94.9, 79.16, 67.36, and 53.37 for the corresponding data sets.

摘要
DP-SGD 已成为深度学习应用中保护个人隐私的受欢迎方法。然而，DP-SGD 的每个样本的梯度cliping和均匀噪声添加 durante el entrenamiento可能会导致模型的性能下降。为了提高模型的性能，研究人员提出了多种自适应DP-SGD 方法。然而，我们发现这些技术会导致隐私泄露或减少准确率，或者在复杂的数据集上没有进行评估。为了解决这些限制，我们提出了自动DP-SGD。我们的方法可以自动估计梯度norm的clipping阈值，并将每个训练样本的梯度缩放至保留梯度信息。这有助于提高算法的性能，同时使用更少的隐私预算。此外，我们引入自动幂数减少机制，以逐 epoch 递减幂数 Multiplier。最后，我们通过closed-form 的数学表述使用 tCDP 财务公司来自动确定幂数 multiplier 和 clipping 阈值。经过广泛的实验，我们证明了 Auto DP-SGD 可以在各种标准 benchmark 数据集上超越现有的 SOTA DP-SGD 方法，同时保持隐私和准确率的平衡。此外，我们还发现可以通过降低扬射因子和使用学习率调整器，无需减少准确率，进一步提高隐私。例如，在使用步进噪声 multiplier 时，Auto DP-SGD 可以在 MNIST、CIFAR10、CIFAR100 和 AG News Corpus 数据集上提高准确率 3.20、1.57、6.73 和 1.42，同时降低隐私预算 94.9、79.16、67.36 和 53.37。

2023-12-06

A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints

A Masked Pruning Approach for Dimensionality Reduction in Communication-Efficient Federated Learning Systems

On The Fairness Impacts of Hardware Selection in Machine Learning

FoMo Rewards: Can we cast foundation models as reward functions?

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

The BigCode Project Governance Card

Efficient Large Language Models: A Survey

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

OneLLM: One Framework to Align All Modalities with Language

Intrinsic Harmonization for Illumination-Aware Compositing

MatterGen: a generative model for inorganic materials design

LLM as OS (llmao), Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem

What Planning Problems Can A Relational Neural Network Solve?

An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia

Pearl: A Production-ready Reinforcement Learning Agent

Improving Activation Steering in Language Models with Mean-Centring

Efficient Inverse Design Optimization through Multi-fidelity Simulations, Machine Learning, and Search Space Reduction Strategies

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

Not All Large Language Models (LLMs) Succumb to the “Reversal Curse”: A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models

MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

DiffusionSat: A Generative Foundation Model for Satellite Imagery

MMM: Generative Masked Motion Model

Foundation Model Assisted Weakly Supervised Semantic Segmentation

Invariance & Causal Representation Learning: Prospects and Limitations

Generalization to New Sequential Decision Making Tasks with In-Context Learning

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

Low-power, Continuous Remote Behavioral Localization with Event Cameras

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

Multi-Scale and Multi-Modal Contrastive Learning Network for Biomedical Time Series

Optimal Wildfire Escape Route Planning for Drones under Dynamic Fire and Smoke

Defense Against Adversarial Attacks using Convolutional Auto-Encoders

Active Wildfires Detection and Dynamic Escape Routes Planning for Humans through Information Fusion between Drones and Satellites

FRDiff: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models

Speculative Exploration on the Concept of Artificial Agents Conducting Autonomous Research

Learning From Scenarios for Stochastic Repairable Scheduling

JAMMIN-GPT: Text-based Improvisation using LLMs in Ableton Live

Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion

Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data

Quantum-Inspired Neural Network Model of Optical Illusions

Sports Recommender Systems: Overview and Research Issues

Approximating Solutions to the Knapsack Problem using the Lagrangian Dual Framework

Generalized Contrastive Divergence: Joint Training of Energy-Based Model and Diffusion Model through Inverse Reinforcement Learning

Diffused Task-Agnostic Milestone Planner

Lite-Mind: Towards Efficient and Versatile Brain Representation Network

Demand response for residential building heating: Effective Monte Carlo Tree Search control based on physics-informed neural networks

Teaching Specific Scientific Knowledge into Large Language Models through Additional Training

Online Vectorized HD Map Construction using Geometry

Benchmarking Continual Learning from Cognitive Perspectives

Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique

SoftMAC: Differentiable Soft Body Simulation with Forecast-based Contact Model and Two-way Coupling with Articulated Rigid Bodies and Clothes

OMNIINPUT: A Model-centric Evaluation Framework through Output Distribution

Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI Gym

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Weathering Ongoing Uncertainty: Learning and Planning in a Time-Varying Partially Observable Environment

Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning

A Simple Framework to Enhance the Adversarial Robustness of Deep Learning-based Intrusion Detection System

Multicoated and Folded Graph Neural Networks with Strong Lottery Tickets

Deep Multimodal Fusion for Surgical Feedback Classification

SDSRA: A Skill-Driven Skill-Recombination Algorithm for Efficient Policy Learning

2023-12-06

Collaboration or Corporate Capture? Quantifying NLP’s Reliance on Industry Artifacts and Contributions

Revisiting the Optimality of Word Lengths

PROMISE: A Framework for Model-Driven Stateful Prompt Orchestration

Evaluating and Mitigating Discrimination in Language Model Decisions

Interpretability Illusions in the Generalization of Simplified Models

Improving Bias Mitigation through Bias Experts in Natural Language Understanding

XAIQA: Explainer-Based Data Augmentation for Extractive Question Answering

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Sig-Networks Toolkit: Signature Networks for Longitudinal Language Modelling

Exploring Answer Information Methods for Question Generation with Transformers

AMR Parsing is Far from Solved: GrAPES, the Granular AMR Parsing Evaluation Suite

DBCopilot: Scaling Natural Language Querying to Massive Databases

Think from Words(TFW): Initiating Human-Like Cognition in Large Language Models Through Think from Words for Japanese Text-level Classification

Comparative Analysis of Multilingual Text Classification & Identification through Deep Learning and Embedding Visualization

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

Compressed Context Memory For Online Language Model Interaction