2023-10-17

cs.CV

cs.CV - 2023-10-17

Holistic Parking Slot Detection with Polygon-Shaped Representations

paper_url: http://arxiv.org/abs/2310.11629
repo_url: None
paper_authors: Lihao Wang, Antonyo Musabini, Christel Leonet, Rachid Benmokhtar, Amaury Breheret, Chaima Yedes, Fabian Burger, Thomas Boulay, Xavier Perrotton
for:This paper proposes a one-step Holistic Parking Slot Network (HPS-Net) to detect vacant parking slots using a camera-based approach.methods:The proposed method uses a tailor-made adaptation of the You Only Look Once (YOLO)v4 algorithm, which directly outputs the four vertex coordinates of the parking slot in topview domain. A novel regression loss function named polygon-corner Generalized Intersection over Union (GIoU) is proposed to optimize the polygon vertex position and distinguish the entrance line.results:Experiments show that HPS-Net can detect various vacant parking slots with a F1-score of 0.92 on the internal Valeo Parking Slots Dataset (VPSD) and 0.99 on the public dataset PS2.0. The method achieves a satisfying generalization and robustness in various parking scenarios, such as indoor or paved ground, with a real-time detection speed of 17 FPS on Nvidia Drive AGX Xavier.

Abstract
Current parking slot detection in advanced driver-assistance systems (ADAS) primarily relies on ultrasonic sensors. This method has several limitations such as the need to scan the entire parking slot before detecting it, the incapacity of detecting multiple slots in a row, and the difficulty of classifying them. Due to the complex visual environment, vehicles are equipped with surround view camera systems to detect vacant parking slots. Previous research works in this field mostly use image-domain models to solve the problem. These two-stage approaches separate the 2D detection and 3D pose estimation steps using camera calibration. In this paper, we propose one-step Holistic Parking Slot Network (HPS-Net), a tailor-made adaptation of the You Only Look Once (YOLO)v4 algorithm. This camera-based approach directly outputs the four vertex coordinates of the parking slot in topview domain, instead of a bounding box in raw camera images. Several visible points and shapes can be proposed from different angles. A novel regression loss function named polygon-corner Generalized Intersection over Union (GIoU) for polygon vertex position optimization is also proposed to manage the slot orientation and to distinguish the entrance line. Experiments show that HPS-Net can detect various vacant parking slots with a F1-score of 0.92 on our internal Valeo Parking Slots Dataset (VPSD) and 0.99 on the public dataset PS2.0. It provides a satisfying generalization and robustness in various parking scenarios, such as indoor (F1: 0.86) or paved ground (F1: 0.91). Moreover, it achieves a real-time detection speed of 17 FPS on Nvidia Drive AGX Xavier. A demo video can be found at https://streamable.com/75j7sj.

摘要
当前的停车槽检测在高级驾驶支持系统（ADAS）主要依靠 ultrasonic 探测器。这种方法有多个限制，包括需要扫描整个停车槽才能检测它，无法检测多个槽在一行，以及困难的分类问题。由于视觉环境复杂，车辆通常配备卫星视频摄像头系统以检测空闲停车槽。先前的研究工作主要使用图像领域模型解决这个问题。这些两个阶段方法在摄像头准确性上分解了2D检测和3D姿态估计步骤。在本文中，我们提出了一步骤Holistic Parking Slot Network（HPS-Net），这是基于 You Only Look Once（YOLO）v4算法的修改版。这种摄像头基本方法直接输出了四个顶点坐标，而不是Raw摄像头图像中的 bounding box。通过不同的角度可以提出多个可见的点和形状。我们还提出了一种新的 regression loss function，即 polygon-corner Generalized Intersection over Union（GIoU），用于优化额外坐标和分辨入口线。实验表明，HPS-Net可以准确地检测多种空闲停车槽，F1 分数为 0.92 在我们的内部 valeo 停车槽数据集（VPSD）上，并达到 0.99 在公共数据集 PS2.0 上。它在多种停车场景中具有满意的总体化和稳定性，例如室内（F1: 0.86）或沥青地（F1: 0.91）。此外，它在 Nvidia Drive AGX Xavier 上实现了实时检测速度为 17 FPS。更多的详细信息可以在找到。

High-Resolution Building and Road Detection from Sentinel-2

paper_url: http://arxiv.org/abs/2310.11622
repo_url: https://github.com/RudraxDave/UrbanizationDetection_RoadnBuilding
paper_authors: Wojciech Sirko, Emmanuel Asiedu Brempong, Juliana T. C. Marcos, Abigail Annkah, Abel Korme, Mohammed Alewi Hassen, Krishna Sapkota, Tomer Shekel, Abdoulaye Diack, Sella Nevo, Jason Hickey, John Quinn
for: This paper demonstrates how to use multiple 10m resolution Sentinel-2 images to generate 50cm resolution building and road segmentation masks.
methods: The authors use a “student” model to reproduce the predictions of a “teacher” model, which has access to high-resolution imagery.
results: The authors achieve 78.3% mIoU for building segmentation and R^2 = 0.91 for counting individual buildings, which are comparable to the performance of the high-resolution teacher model.Here is the simplified Chinese version of the three key points:
for: 这篇论文用多个10米分辨率的卫星图像生成50厘米分辨率的建筑和公路分割mask。
methods: 作者使用一个”学生”模型来重现一个”教师”模型的预测，该模型有访问高分辨率卫星图像的能力。
results: 作者在建筑分割方面 achievement 78.3% mIoU，与高分辨率教师模型的性能相似。同时，对个别建筑物的计数也达到R^2 = 0.91。

Abstract
Mapping buildings and roads automatically with remote sensing typically requires high-resolution imagery, which is expensive to obtain and often sparsely available. In this work we demonstrate how multiple 10 m resolution Sentinel-2 images can be used to generate 50 cm resolution building and road segmentation masks. This is done by training a `student' model with access to Sentinel-2 images to reproduce the predictions of a `teacher' model which has access to corresponding high-resolution imagery. While the predictions do not have all the fine detail of the teacher model, we find that we are able to retain much of the performance: for building segmentation we achieve 78.3% mIoU, compared to the high-resolution teacher model accuracy of 85.3% mIoU. We also describe a related method for counting individual buildings in a Sentinel-2 patch which achieves R^2 = 0.91 against true counts. This work opens up new possibilities for using freely available Sentinel-2 imagery for a range of tasks that previously could only be done with high-resolution satellite imagery.

摘要
使用远程感知自动地图建模通常需要高分辨率图像，这些图像昂贵并且 sparse 地available。在这项工作中，我们展示了如何使用多张10米分辨率 Sentine-2 图像生成 50 cm 分辨率的建筑和路径分割mask。这是通过训练一个“学生”模型，该模型有访问 Sentine-2 图像，来复制“教师”模型，该模型有访问相应的高分辨率图像的预测。虽然预测没有 teacher 模型的细节，但我们发现可以保留大量的性能：对建筑分割，我们 achieve 78.3% mIoU，比高分辨率教师模型的准确率 85.3% mIoU。我们还描述了一种与此相关的方法，可以在 Sentinel-2 质patch 中计数个建筑，该方法 achieve R^2 = 0.91 与真实计数的相关性。这项工作开 up 了使用免费available Sentinel-2 图像进行许多任务的新可能性，这些任务在过去只能通过高分辨率卫星图像进行。

Classification of Safety Driver Attention During Autonomous Vehicle Operation

paper_url: http://arxiv.org/abs/2310.11608
repo_url: None
paper_authors: Santiago Gerling Konrad, Julie Stephany Berrio, Mao Shan, Favio Masson, Stewart Worrall
For: This paper aims to develop a system to monitor the alertness of vehicle operators in ADAS and AVs to ensure safe operation.* Methods: The proposed system uses an infrared camera to detect the driver’s head and calculate head orientation, and incorporates environmental data from the perception system to determine the driver’s attention to objects in the surroundings.* Results: The system was tested using data collected in Sydney, Australia, and was found to effectively determine the vehicle operator’s attention levels, enabling interventions such as warnings or reducing autonomous functionality as appropriate.Here is the same information in Simplified Chinese text:* For: 这篇论文目标是为ADAS和AV开发一种监测车辆操作员的关注度，以确保安全操作。* Methods: 该系统使用红外摄像头探测司机头部，计算头部方向，并在环境感知系统的数据支持下，判断司机是否注意到周围环境中的物体。* Results: 该系统在澳大利亚悉尼的数据集上进行了测试，并被证明可以有效地确定车辆操作员的关注度，并发出警示或减少自动化功能等措施。

Abstract
Despite the continual advances in Advanced Driver Assistance Systems (ADAS) and the development of high-level autonomous vehicles (AV), there is a general consensus that for the short to medium term, there is a requirement for a human supervisor to handle the edge cases that inevitably arise. Given this requirement, it is essential that the state of the vehicle operator is monitored to ensure they are contributing to the vehicle's safe operation. This paper introduces a dual-source approach integrating data from an infrared camera facing the vehicle operator and vehicle perception systems to produce a metric for driver alertness in order to promote and ensure safe operator behaviour. The infrared camera detects the driver's head, enabling the calculation of head orientation, which is relevant as the head typically moves according to the individual's focus of attention. By incorporating environmental data from the perception system, it becomes possible to determine whether the vehicle operator observes objects in the surroundings. Experiments were conducted using data collected in Sydney, Australia, simulating AV operations in an urban environment. Our results demonstrate that the proposed system effectively determines a metric for the attention levels of the vehicle operator, enabling interventions such as warnings or reducing autonomous functionality as appropriate. This comprehensive solution shows promise in contributing to ADAS and AVs' overall safety and efficiency in a real-world setting.

摘要
Translated into Simplified Chinese:尽管现代驾驶助手系统（ADAS）和高级无人驾驶车辆（AV）在不断发展，但是有一致的共识，在短到中期，需要有人监督处理边缘情况。由于这一点，监控车辆运行者的状态非常重要。这篇文章介绍了一种将 инфракра力相机和车辆感知系统相结合的双源方法，以计算驾驶员注意力水平，以便促进和确保安全的操作行为。这种方法可以检测驾驶员的头部，并计算头部的方向，因为头部通常会根据个人的注意力方向移动。通过与环境数据集成，可以判断车辆运行者是否观察了周围环境。我们在悉尼、澳大利亚进行了实验，使用了在城市环境中采集的数据，模拟了AV的运行。我们的结果表明，该系统可以有效地计算驾驶员注意力水平，并发出警告或降低自动化功能，以适应实际情况。这种全面的解决方案显示了它在ADAS和AV的安全和效率方面的承诺。

DIAR: Deep Image Alignment and Reconstruction using Swin Transformers

paper_url: http://arxiv.org/abs/2310.11605
repo_url: None
paper_authors: Monika Kwiatkowski, Simon Matern, Olaf Hellwich
for: 这个论文旨在建立一个深度学习管道，用于同时对扭曲图像进行Alignment和重建。
methods: 论文使用了Swin transformer模型，并使用了自定义的图像扭曲 dataset，以及图像特征地图来检测图像内容和排除噪声。
results: 论文通过使用图像特征地图和深度学习模型，实现了对扭曲图像的高精度Alignment和重建。

Abstract
When taking images of some occluded content, one is often faced with the problem that every individual image frame contains unwanted artifacts, but a collection of images contains all relevant information if properly aligned and aggregated. In this paper, we attempt to build a deep learning pipeline that simultaneously aligns a sequence of distorted images and reconstructs them. We create a dataset that contains images with image distortions, such as lighting, specularities, shadows, and occlusion. We create perspective distortions with corresponding ground-truth homographies as labels. We use our dataset to train Swin transformer models to analyze sequential image data. The attention maps enable the model to detect relevant image content and differentiate it from outliers and artifacts. We further explore using neural feature maps as alternatives to classical key point detectors. The feature maps of trained convolutional layers provide dense image descriptors that can be used to find point correspondences between images. We utilize this to compute coarse image alignments and explore its limitations.

摘要
当拍摄部分受遮挡内容时，常常会遇到每帧图像中存在不必要的artefacts问题，但是一系列图像中会包含所有相关信息，如果正确地对齐和汇总。在这篇论文中，我们尝试建立一个深度学习管道，同时对扭曲图像序列进行对齐和重建。我们创建了一个包含图像扭曲效果，如光照、 Specularities、阴影和遮挡的 dataset。我们创建了对应的地平线扭曲和真实的homographies作为标签。我们使用这些 dataset 来训练 Swin transformer 模型分析序列图像数据。模型的注意力地图使得模型可以检测图像中相关的内容，并将其与异常值和artefacts分开。我们进一步 explore 使用神经网络特征图作为经典关键点检测器的代替。训练 convolutional 层的神经网络特征图提供了密集图像描述符，可以用于找到图像之间的点对应关系。我们利用这个技术来计算粗略图像对齐，并探讨其局限性。

Learning Neural Implicit through Volume Rendering with Attentive Depth Fusion Priors

paper_url: http://arxiv.org/abs/2310.11598
repo_url: https://github.com/MachinePerceptionLab/Attentive_DFPrior
paper_authors: Pengchong Hu, Zhizhong Han
for: 本研究旨在提高基于多视图图像的3D重建精度，使用神经网络学习隐式表示，并通过粘性深度融合优先来解决 incomplete depth at holes 和 occluded structures 问题。
methods: 我们提出了一种基于多视图RGBD图像的神经网络学习隐式表示方法，通过粘性深度融合优先来解决 incomplete depth at holes 和 occluded structures 问题。我们引入了一种novel attention mechanism，允许神经网络直接使用 depth fusion prior 来学习隐式函数。
results: 我们的方法在 widely used benchmarks 上表现出色，超过了 latest neural implicit methods。我们的实验结果表明，我们的方法可以更好地处理 incomplete depth at holes 和 occluded structures 问题，提高了3D重建精度。

Abstract
Learning neural implicit representations has achieved remarkable performance in 3D reconstruction from multi-view images. Current methods use volume rendering to render implicit representations into either RGB or depth images that are supervised by multi-view ground truth. However, rendering a view each time suffers from incomplete depth at holes and unawareness of occluded structures from the depth supervision, which severely affects the accuracy of geometry inference via volume rendering. To resolve this issue, we propose to learn neural implicit representations from multi-view RGBD images through volume rendering with an attentive depth fusion prior. Our prior allows neural networks to perceive coarse 3D structures from the Truncated Signed Distance Function (TSDF) fused from all depth images available for rendering. The TSDF enables accessing the missing depth at holes on one depth image and the occluded parts that are invisible from the current view. By introducing a novel attention mechanism, we allow neural networks to directly use the depth fusion prior with the inferred occupancy as the learned implicit function. Our attention mechanism works with either a one-time fused TSDF that represents a whole scene or an incrementally fused TSDF that represents a partial scene in the context of Simultaneous Localization and Mapping (SLAM). Our evaluations on widely used benchmarks including synthetic and real-world scans show our superiority over the latest neural implicit methods. Project page: https://machineperceptionlab.github.io/Attentive_DF_Prior/

摘要
学习神经隐式表示法已经在多视图图像中实现了很好的3D重建性能。现有的方法使用volume rendering将隐式表示渲染成RGB或深度图像，并且通过多视图ground truth进行超视图指导。然而，每次渲染一个视图会导致部分深度信息缺失和 occluded 结构的不可见性，这会对geometry inference via volume rendering产生严重的影响。为解决这个问题，我们提议通过多视图RGBD图像学习神经隐式表示法，并通过注意力机制将神经网络让掌握TSDF（Truncated Signed Distance Function）的粗略3D结构。TSDF允许访问多视图图像中缺失的深度信息和 occluded 结构，从而提高geometry inference的准确性。我们的注意力机制可以同时使用一次拼接的TSDF或者逐渐拼接的TSDF，它们在Simultaneous Localization and Mapping（SLAM）上下文中表示整个场景或者部分场景。我们的评估结果表明，我们的方法在广泛使用的标准benchmark上表现出色，超过了最新的神经隐式方法。项目页面：https://machineperceptionlab.github.io/Attentive_DF_Prior/

paper_url: http://arxiv.org/abs/2310.11577
repo_url: None
paper_authors: Mahsa Dibaji, Neha Gianchandani, Akhil Nair, Mansi Singhal, Roberto Souza, Mariana Bento
for: 这个论文的目的是研究机器学习模型在不同性别人群中的偏见和公平问题。
methods: 作者使用了基于大脑磁共振成像数据的机器学习模型，并在不同的实验设计下进行了训练和评估。
results: 研究发现在不同性别人群和数据集上训练模型时，模型的性能存在差异，并且偏见可能导致模型在不同性别人群中的决策不具有公平性。

Abstract
While utilizing machine learning models, one of the most crucial aspects is how bias and fairness affect model outcomes for diverse demographics. This becomes especially relevant in the context of machine learning for medical imaging applications as these models are increasingly being used for diagnosis and treatment planning. In this paper, we study biases related to sex when developing a machine learning model based on brain magnetic resonance images (MRI). We investigate the effects of sex by performing brain age prediction considering different experimental designs: model trained using only female subjects, only male subjects and a balanced dataset. We also perform evaluation on multiple MRI datasets (Calgary-Campinas(CC359) and CamCAN) to assess the generalization capability of the proposed models. We found disparities in the performance of brain age prediction models when trained on distinct sex subgroups and datasets, in both final predictions and decision making (assessed using interpretability models). Our results demonstrated variations in model generalizability across sex-specific subgroups, suggesting potential biases in models trained on unbalanced datasets. This underlines the critical role of careful experimental design in generating fair and reliable outcomes.

摘要
在使用机器学习模型时，最重要的一点是如何处理偏见和公平问题，以确保模型对不同的民族团体都有正确的预测结果。在医疗领域机器学习应用中，这些模型正在被用于诊断和治疗规划。在这篇论文中，我们研究了基于大脑磁共振成像（MRI）的机器学习模型中的性偏见。我们对不同实验设计进行了研究：使用只有女性参与者的模型、只有男性参与者的模型以及平衡 dataset。我们还对多个 MRI 数据集（Calgary-Campinas（CC359）和 CamCAN）进行了评估，以评估提议的模型在不同的数据集上的普适性。我们发现在不同的性 subgroup 和数据集上训练的模型表现出了差异， both final predictions 和决策（使用可解释模型进行评估）。我们的结果表明了不同的性团体中模型的一致性不同，这表明了训练在不均衡数据集上的模型可能存在偏见。这些结果提醒我们在设计实验时应该非常小心，以确保获得公平和可靠的结果。

Learning Lens Blur Fields

paper_url: http://arxiv.org/abs/2310.11535
repo_url: None
paper_authors: Esther Y. H. Lin, Zhecheng Wang, Rebecca Lin, Daniel Miau, Florian Kainz, Jiawen Chen, Xuaner Cecilia Zhang, David B. Lindell, Kiriakos N. Kutulakos
for: The paper is written to address the challenge of modeling optical blur in modern cameras with complex optical elements, and to introduce a high-dimensional neural representation of the blur field.
methods: The paper proposes a practical method for acquiring the lens blur field, which is a multilayer perceptron (MLP) designed to capture variations of the lens 2D point spread function over image plane location, focus setting, and depth. The representation models the combined effects of defocus, diffraction, aberration, and accounts for sensor features such as pixel color filters and pixel-specific micro-lenses.
results: The paper shows that the acquired 5D blur fields are expressive and accurate enough to reveal differences in optical behavior of smartphone devices of the same make and model, and provides a first-of-its-kind dataset of 5D blur fields for smartphone cameras, camera bodies equipped with a variety of lenses, etc.

Abstract
Optical blur is an inherent property of any lens system and is challenging to model in modern cameras because of their complex optical elements. To tackle this challenge, we introduce a high-dimensional neural representation of blur$-$$\textit{the lens blur field}$$-$and a practical method for acquiring it. The lens blur field is a multilayer perceptron (MLP) designed to (1) accurately capture variations of the lens 2D point spread function over image plane location, focus setting and, optionally, depth and (2) represent these variations parametrically as a single, sensor-specific function. The representation models the combined effects of defocus, diffraction, aberration, and accounts for sensor features such as pixel color filters and pixel-specific micro-lenses. To learn the real-world blur field of a given device, we formulate a generalized non-blind deconvolution problem that directly optimizes the MLP weights using a small set of focal stacks as the only input. We also provide a first-of-its-kind dataset of 5D blur fields$-$for smartphone cameras, camera bodies equipped with a variety of lenses, etc. Lastly, we show that acquired 5D blur fields are expressive and accurate enough to reveal, for the first time, differences in optical behavior of smartphone devices of the same make and model.

摘要
“光学朦胧是任何镜系统的自然属性，现代摄像机中模型它具有复杂的光学元件的挑战。为了解决这个挑战，我们介绍了一种高维度神经网络表示朦胧$-$镜片朦胧场$-$以及一种实用的获取方法。镜片朦胧场是一种多层感知网络（MLP），旨在：(1) 精确捕捉镜片2D点扩散函数在图像平面位置、焦距设置和（选择）深度上的变化，以及(2) 表示这些变化为单个、设备特定的函数。该表示模型了杂光、折射、笛卡尔等效应，并考虑了感器特性 such as 像素色滤和像素特定的微镜。为了学习具体的朦胧场，我们提出了一种通用非盲目分解问题，直接优化MLP参量使用一小组焦距栈作为输入。此外，我们还提供了一个具有5D朦胧场的首个数据集$-$包括智能手机摄像机、配备多种镜头的相机机身等。最后，我们证明了获取的5D朦胧场是具有表达力和准确性的，可以折衣出智能手机设备的同类型和型号之间的光学行为差异。”

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

paper_url: http://arxiv.org/abs/2310.11513
repo_url: https://github.com/djghosh13/geneval
paper_authors: Dhruba Ghosh, Hanna Hajishirzi, Ludwig Schmidt
for: 评估文本到图像生成模型的详细特性，如物体共处、位置、数量和颜色等。
methods: 利用现有的 объек检测模型来评估文本到图像模型的多种生成任务，并通过链接其他探索视觉模型来进一步验证特性。
results: 研究发现，现有的文本到图像模型在这些任务上已经显示出了显著的进步，但还缺乏复杂的能力，如空间关系和属性绑定。

Abstract
Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.

摘要

DELIFFAS: Deformable Light Fields for Fast Avatar Synthesis

paper_url: http://arxiv.org/abs/2310.11449
repo_url: None
paper_authors: Youngjoong Kwon, Lingjie Liu, Henry Fuchs, Marc Habermann, Christian Theobalt
for: 生成高品质的数字人像，解决长期存在的图形和视觉领域中的问题。
methods: 提出了一种新的方法，即DELIFFAS，它将人体的外观 Parameterized as a surface light field that is attached to a controllable and deforming human mesh model。
results: 通过 méticulously designed human representation and supervision strategy，实现了 state-of-the-art synthesis results和快速的推理时间。

Abstract
Generating controllable and photorealistic digital human avatars is a long-standing and important problem in Vision and Graphics. Recent methods have shown great progress in terms of either photorealism or inference speed while the combination of the two desired properties still remains unsolved. To this end, we propose a novel method, called DELIFFAS, which parameterizes the appearance of the human as a surface light field that is attached to a controllable and deforming human mesh model. At the core, we represent the light field around the human with a deformable two-surface parameterization, which enables fast and accurate inference of the human appearance. This allows perceptual supervision on the full image compared to previous approaches that could only supervise individual pixels or small patches due to their slow runtime. Our carefully designed human representation and supervision strategy leads to state-of-the-art synthesis results and inference time. The video results and code are available at https://vcai.mpi-inf.mpg.de/projects/DELIFFAS.

摘要
<>将生成控制可靠、渲染实际的数字人类头像作为视图和图形领域的长期问题。最近的方法在一个或两个属性中都已经取得了很大的进步，但是同时拥有这两种属性仍然是一个未解决的问题。为此，我们提出了一种新的方法，称为DELIFFAS，它将人类的外观 parameterized为附着在可控制和变形的人类矩阵模型上的表面光场。在核心上，我们使用可变的两面参数化来表示人类周围的光场，这使得快速和准确地推断人类的外观。这使得我们可以在全图像上进行准确的upervison，而不是之前的方法只能在个像素或小块上进行supervision，因为它们的运行时间过长。我们仔细设计的人类表示和监督策略，导致了最新的合成结果和运行时间。影像结果和代码可以在https://vcai.mpi-inf.mpg.de/projects/DELIFFAS中下载。

4K4D: Real-Time 4D View Synthesis at 4K Resolution

paper_url: http://arxiv.org/abs/2310.11448
repo_url: https://github.com/zju3dv/4K4D
paper_authors: Zhen Xu, Sida Peng, Haotong Lin, Guangzhao He, Jiaming Sun, Yujun Shen, Hujun Bao, Xiaowei Zhou
for: 高精度和实时视觉合成动态3D场景的4K解析
methods: 使用4D点云表示法和硬件排版加速，以及新型混合外观模型提高渲染质量，同时保持高效性
results: 在DNA-Rendering数据集1080p分辨率和ENeRF-Outdoor数据集4K分辨率上实现了30倍 быстре的渲染速度，并达到了当前最佳的渲染质量

Abstract
This paper targets high-fidelity and real-time view synthesis of dynamic 3D scenes at 4K resolution. Recently, some methods on dynamic view synthesis have shown impressive rendering quality. However, their speed is still limited when rendering high-resolution images. To overcome this problem, we propose 4K4D, a 4D point cloud representation that supports hardware rasterization and enables unprecedented rendering speed. Our representation is built on a 4D feature grid so that the points are naturally regularized and can be robustly optimized. In addition, we design a novel hybrid appearance model that significantly boosts the rendering quality while preserving efficiency. Moreover, we develop a differentiable depth peeling algorithm to effectively learn the proposed model from RGB videos. Experiments show that our representation can be rendered at over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the ENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU, which is 30x faster than previous methods and achieves the state-of-the-art rendering quality. Our project page is available at https://zju3dv.github.io/4k4d/.

摘要

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

paper_url: http://arxiv.org/abs/2310.11440
repo_url: https://github.com/evalcrafter/EvalCrafter
paper_authors: Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, Ying Shan
for: 这个论文主要目的是提出一种新的评估视频生成模型的框架和管道，以提高现有的评估方法的精度和全面性。methods: 作者使用了一种新的提问列表和18个对象指标来评估state-of-the-art的视频生成模型。同时，他们还提出了一种基于大语言模型的意见对应方法，以使用户的意见来调整对象指标的权重。results: 作者的研究表明，使用新的评估方法可以更好地评估视频生成模型的性能，并且与用户的意见更高相关。这表明了新的评估方法的效iveness。

Abstract
The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services are released for generating high-visual quality videos. However, these methods often use a few academic metrics, for example, FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a new framework and pipeline to exhaustively evaluate the performance of the generated videos. To achieve this, we first conduct a new prompt list for text-to-video generation by analyzing the real-world prompt list with the help of the large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmarks, in terms of visual qualities, content qualities, motion qualities, and text-caption alignment with around 18 objective metrics. To obtain the final leaderboard of the models, we also fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed opinion alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.

摘要
“在最近几年中，视觉和语言生成模型得到了广泛的应用和发展。为视频生成，各种开源的模型和公共可用的服务被发布，以生成高质量的视频。然而，这些方法经常使用一些学术指标，例如FVD或IS，来评估模型的表现。我们认为，这些简单的指标不能准确评估大型条件生成模型，因为这些模型通常在大量数据集上进行训练，具有多方面能力。因此，我们提出了一新的评估框架和管道，以完整评估生成视频的性能。我们首先对文本到视频生成的新提问列表进行分析，并使用大型语言模型的帮助来生成一个新的提问列表。然后，我们对当今状态的视频生成模型进行了严格的评估，包括视觉质量、内容质量、动作质量和文本描述对齐等18个对象指标。为了获得最终的模型排名，我们还使用一系列的系数进行对象指标的对齐。根据我们的提议的意见对齐方法，我们的最终分数显示与简单地平均指标的相对耗散更高，表明了我们的评估方法的效iveness。”

Revisiting Map Relations for Unsupervised Non-Rigid Shape Matching

paper_url: http://arxiv.org/abs/2310.11420
repo_url: None
paper_authors: Dongliang Cao, Paul Roetzer, Florian Bernard
for: 非RIGID 3D shape matching
methods: 提出了一种新的无监督学习方法，包括自适应功能图计算器和点精度损失函数
results: 在不同的挑战性场景下（包括非同温、topological noise和部分性），与前状态艺术方法相比，提高了substantially的性能

Abstract
We propose a novel unsupervised learning approach for non-rigid 3D shape matching. Our approach improves upon recent state-of-the art deep functional map methods and can be applied to a broad range of different challenging scenarios. Previous deep functional map methods mainly focus on feature extraction and aim exclusively at obtaining more expressive features for functional map computation. However, the importance of the functional map computation itself is often neglected and the relationship between the functional map and point-wise map is underexplored. In this paper, we systematically investigate the coupling relationship between the functional map from the functional map solver and the point-wise map based on feature similarity. To this end, we propose a self-adaptive functional map solver to adjust the functional map regularisation for different shape matching scenarios, together with a vertex-wise contrastive loss to obtain more discriminative features. Using different challenging datasets (including non-isometry, topological noise and partiality), we demonstrate that our method substantially outperforms previous state-of-the-art methods.

摘要
我们提出了一种新的无监督学习方法 для非定形3D形状匹配。我们的方法在当前状态艺术深度函数地图方法的基础上进行改进，并可以应用于各种不同的复杂enario。先前的深度函数地图方法主要关注特征提取，偏向于获得更表达力强的特征来计算函数地图。然而，函数地图计算本身的重要性经常被忽略，以及特征和点级地图之间的关系也未得到充分调查。在这篇论文中，我们系统地探讨了函数地图从函数地图解决器中的coupling关系，以及特征相似性基础上的点级地图。为此，我们提出了一种自适应函数地图解决器，以及基于特征相似性的Vertex-wise contraste loss。使用不同的复杂的数据集（包括非同一致、拓扑噪声和缺失），我们示出了我们的方法在前一代方法的基础上具有显著的改进。

VcT: Visual change Transformer for Remote Sensing Image Change Detection

paper_url: http://arxiv.org/abs/2310.11417
repo_url: https://github.com/event-ahu/vct_remote_sensing_change_detection
paper_authors: Bo Jiang, Zitian Wang, Xixi Wang, Ziyan Zhang, Lan Chen, Xiao Wang, Bin Luo
for: 本文提出了一种Visual change Transformer（VcT）模型，用于解决视觉变化检测问题。
methods: 该模型首先使用共享背景网络提取图像对的特征图，然后使用图 neural network 模型化图像对的结构信息，并采用 top-K 可靠的 токен mines 和改进自/交叉关注机制。
results: 广泛的实验 validate 了我们提出的 VcT 模型的有效性。

Abstract
Existing visual change detectors usually adopt CNNs or Transformers for feature representation learning and focus on learning effective representation for the changed regions between images. Although good performance can be obtained by enhancing the features of the change regions, however, these works are still limited mainly due to the ignorance of mining the unchanged background context information. It is known that one main challenge for change detection is how to obtain the consistent representations for two images involving different variations, such as spatial variation, sunlight intensity, etc. In this work, we demonstrate that carefully mining the common background information provides an important cue to learn the consistent representations for the two images which thus obviously facilitates the visual change detection problem. Based on this observation, we propose a novel Visual change Transformer (VcT) model for visual change detection problem. To be specific, a shared backbone network is first used to extract the feature maps for the given image pair. Then, each pixel of feature map is regarded as a graph node and the graph neural network is proposed to model the structured information for coarse change map prediction. Top-K reliable tokens can be mined from the map and refined by using the clustering algorithm. Then, these reliable tokens are enhanced by first utilizing self/cross-attention schemes and then interacting with original features via an anchor-primary attention learning module. Finally, the prediction head is proposed to get a more accurate change map. Extensive experiments on multiple benchmark datasets validated the effectiveness of our proposed VcT model.

摘要
现有的视觉变化探测器通常采用CNN或Transformers来学习特征表示学习，并主要关注学习改变区域 между图像中的有效表示。although these works can achieve good performance by enhancing the features of the changed regions, they are still limited because they ignore the mining of unchanged background context information. It is known that one of the main challenges of change detection is how to obtain consistent representations for two images with different variations, such as spatial variation and sunlight intensity. In this work, we find that carefully mining the common background information provides an important cue to learn the consistent representations for the two images, which thus facilitates the visual change detection problem. Based on this observation, we propose a novel Visual change Transformer (VcT) model for the visual change detection problem. Specifically, a shared backbone network is first used to extract the feature maps for the given image pair. Then, each pixel of the feature map is regarded as a graph node, and a graph neural network is proposed to model the structured information for coarse change map prediction. Top-K reliable tokens can be mined from the map and refined by using a clustering algorithm. Then, these reliable tokens are enhanced by first utilizing self/cross-attention schemes and then interacting with the original features via an anchor-primary attention learning module. Finally, the prediction head is proposed to get a more accurate change map. Extensive experiments on multiple benchmark datasets validated the effectiveness of our proposed VcT model.

A voxel-level approach to brain age prediction: A method to assess regional brain aging

paper_url: http://arxiv.org/abs/2310.11385
repo_url: https://github.com/nehagianchandani/voxel-level-brain-age-prediction
paper_authors: Neha Gianchandani, Mahsa Dibaji, Johanna Ospel, Fernando Vega, Mariana Bento, M. Ethan MacDonald, Roberto Souza
For: 这个研究的目的是预测脑部年龄从T1束缚成像图像中，以获得精细的地方化脑部年龄估计。这有助于理解健康和疾病群体中脑部年龄轨迹的差异。* Methods: 这个研究使用了深度学习多任务模型来预测脑部年龄。这种模型在现有文献中表现出色，并且在健康和疾病群体中提供了价值的临床洞察。* Results: 研究发现，健康群体和疾病群体之间存在差异的脑部年龄轨迹。具体来说，健康群体的脑部年龄轨迹比疾病群体更年轻，而且存在脑部区域异常的差异。这些结果提供了有价值的临床洞察，可以帮助理解脑部年龄的发展和疾病的扩散。

Abstract
Brain aging is a regional phenomenon, a facet that remains relatively under-explored within the realm of brain age prediction research using machine learning methods. Voxel-level predictions can provide localized brain age estimates that can provide granular insights into the regional aging processes. This is essential to understand the differences in aging trajectories in healthy versus diseased subjects. In this work, a deep learning-based multitask model is proposed for voxel-level brain age prediction from T1-weighted magnetic resonance images. The proposed model outperforms the models existing in the literature and yields valuable clinical insights when applied to both healthy and diseased populations. Regional analysis is performed on the voxel-level brain age predictions to understand aging trajectories of known anatomical regions in the brain and show that there exist disparities in regional aging trajectories of healthy subjects compared to ones with underlying neurological disorders such as Dementia and more specifically, Alzheimer's disease. Our code is available at https://github.com/nehagianchandani/Voxel-level-brain-age-prediction.

摘要
脑衰老是一个地域性的现象，在机器学习方法可预测脑 возраст预测研究中尚未得到充分的探讨。 voxel 级预测可提供本地化的脑 возраст估计，从而为诊断不同 Population 提供细化的诊断信息。这对于理解健康和疾病 Population 的脑衰老趋势具有重要意义。在本研究中，我们提出了一种基于深度学习的多任务模型，用于从 T1 束缚磁共振成像图像中预测 voxel 级脑 возраст。该模型在文献中存在的模型之上升级，并且在应用于健康和疾病 Population 时具有价值的临床应用。通过对 voxel 级脑 возраст预测结果进行区域分析，我们可以理解健康 Population 的脑衰老趋势与患有 деменcia 和特别是阿尔ц海默病的脑衰老趋势之间的差异。我们的代码可以在 GitHub 上找到：https://github.com/nehagianchandani/Voxel-level-brain-age-prediction。

Towards Generalizable Multi-Camera 3D Object Detection via Perspective Debiasing

paper_url: http://arxiv.org/abs/2310.11346
repo_url: None
paper_authors: Hao Lu, Yunpeng Zhang, Qing Lian, Dalong Du, Yingcong Chen
for: 本研究旨在提高多摄像头3D物体检测（MC3D-Det）的精度和可靠性，解决由于不 familar 测试环境而导致的问题。
methods: 我们提出了一种新的方法，即将3D检测与2D相机平面结果相对应，以确保稳定和准确的检测。我们的框架基于视角偏移 rectification，帮助学习不受视角变化影响的特征。我们首先生成了多视图地图从BEV特征中，然后将这些地图的视角偏移正确 rectified，以利用隐藏的前景体Volume来连接相机和BEV平面。
results: 我们的方法可以在不同的视角和环境下提高object detection的精度和可靠性。我们在Domain Generalization（DG）和Unsupervised Domain Adaptation（UDA）任务上实现了显著的效果。此外，我们还证明了我们的方法可以在实际数据上达到满意的结果，只需训练于虚拟数据集。

Abstract
Detecting objects in 3D space using multiple cameras, known as Multi-Camera 3D Object Detection (MC3D-Det), has gained prominence with the advent of bird's-eye view (BEV) approaches. However, these methods often struggle when faced with unfamiliar testing environments due to the lack of diverse training data encompassing various viewpoints and environments. To address this, we propose a novel method that aligns 3D detection with 2D camera plane results, ensuring consistent and accurate detections. Our framework, anchored in perspective debiasing, helps the learning of features resilient to domain shifts. In our approach, we render diverse view maps from BEV features and rectify the perspective bias of these maps, leveraging implicit foreground volumes to bridge the camera and BEV planes. This two-step process promotes the learning of perspective- and context-independent features, crucial for accurate object detection across varying viewpoints, camera parameters and environment conditions. Notably, our model-agnostic approach preserves the original network structure without incurring additional inference costs, facilitating seamless integration across various models and simplifying deployment. Furthermore, we also show our approach achieves satisfactory results in real data when trained only with virtual datasets, eliminating the need for real scene annotations. Experimental results on both Domain Generalization (DG) and Unsupervised Domain Adaptation (UDA) clearly demonstrate its effectiveness. Our code will be released.

摘要
使用多个摄像头探测3D空间中的对象，称为多摄像头3D对象探测（MC3D-Det），在现代鸟瞰视图（BEV）方法的普及下得到了更多的关注。然而，这些方法经常在不熟悉的测试环境中遇到困难，因为缺乏包括多个视角和环境的多样化训练数据。为解决这个问题，我们提出了一种新的方法，该方法将3D探测与2D摄像头平面结果进行对应，以确保准确和一致的探测。我们的框架基于视角偏移的约束，帮助学习不受视角和环境的偏移影响的特征。在我们的方法中，我们从BEV特征中生成多种视图图，然后对这些图像进行视角偏移的正确化，通过使用隐藏的前景体来连接摄像头和BEV平面。这两步过程推动学习不受视角和环境的特征，这些特征是精度的对象探测所必需的。各种模型的机制无需更改，我们的方法可以轻松地与不同的模型集成，并且简化部署。此外，我们还证明我们的方法在真实数据上获得了满意的结果，无需使用真实场景的注释。实验结果表明，我们的方法在适应不同视角、摄像头参数和环境条件时具有显著的效果。我们的代码将会发布。

Towards Generic Semi-Supervised Framework for Volumetric Medical Image Segmentation

paper_url: http://arxiv.org/abs/2310.11320
repo_url: https://github.com/xmed-lab/GenericSSL
paper_authors: Haonan Wang, Xiaomeng Li
For: 这个研究的目的是开发一个通用的semi-supervised learning（SSL）框架，可以处理多个设定，包括SSL、Unsupervised Domain Adaptation（UDA）和Semi-supervised Domain Generalization（SemiDG）。* Methods: 该框架使用了一种含有混合和解耦的方法，包括一个扩散编码器，用于从多个分布/领域中提取共同知识集，以及三个解码器，用于在培训过程中避免过拟合标注数据。* Results: 对四个 benchmark dataset进行了评估，并与现有方法进行了比较，结果表明该框架在所有四个设定中具有显著的改进， indicating the potential of the framework to tackle more challenging SSL scenarios.

Abstract
Volume-wise labeling in 3D medical images is a time-consuming task that requires expertise. As a result, there is growing interest in using semi-supervised learning (SSL) techniques to train models with limited labeled data. However, the challenges and practical applications extend beyond SSL to settings such as unsupervised domain adaptation (UDA) and semi-supervised domain generalization (SemiDG). This work aims to develop a generic SSL framework that can handle all three settings. We identify two main obstacles to achieving this goal in the existing SSL framework: 1) the weakness of capturing distribution-invariant features; and 2) the tendency for unlabeled data to be overwhelmed by labeled data, leading to over-fitting to the labeled data during training. To address these issues, we propose an Aggregating & Decoupling framework. The aggregating part consists of a Diffusion encoder that constructs a common knowledge set by extracting distribution-invariant features from aggregated information from multiple distributions/domains. The decoupling part consists of three decoders that decouple the training process with labeled and unlabeled data, thus avoiding over-fitting to labeled data, specific domains and classes. We evaluate our proposed framework on four benchmark datasets for SSL, Class-imbalanced SSL, UDA and SemiDG. The results showcase notable improvements compared to state-of-the-art methods across all four settings, indicating the potential of our framework to tackle more challenging SSL scenarios. Code and models are available at: https://github.com/xmed-lab/GenericSSL.

摘要
医学三维图像的体积标注是一项时间消耗大的任务，需要专家知识。由于此类任务的挑战和实际应用超出了半编制学习（SSL）技术的范畴，因此这项工作的目标是开发一个通用的SSL框架，可以处理所有三个设置。我们在现有的SSL框架中 Identified two main challenges to achieving this goal: 1）不能够捕捉分布不变的特征; 2）无标注数据被标注数据所抑制，导致在训练过程中适应标注数据，从而导致过拟合。为了解决这些问题，我们提出了一个集成&分离框架。集成部分包括一个扩散编码器，通过从多个分布/领域中提取分布不变的特征来构建共同知识集。分离部分包括三个解码器，通过解耦训练过程中的标注数据和无标注数据，从而避免过拟合标注数据、特定领域和类别。我们在四个 benchmark 数据集上进行了四种SSL、不均衡SSL、UDA 和 SemiDG 的测试，结果显示了与现有方法的明显改善，这表明我们的框架有可能在更加复杂的SSL场景中表现出色。代码和模型可以在 GitHub 上获取：https://github.com/xmed-lab/GenericSSL。

Multi Self-supervised Pre-fine-tuned Transformer Fusion for Better Intelligent Transportation Detection

paper_url: http://arxiv.org/abs/2310.11307
repo_url: None
paper_authors: Juwu Zheng, Jiangtao Ren
for: 本研究旨在提高智能交通系统中的检测精度，解决现有检测方法受到两个限制：一是模型知识预处理大规模数据集和目标任务知识之间的差异，二是大多数检测模型采用单源学习方式，限制学习能力。
methods: 我们提出了一种多自助学习前练 transformer 融合网络（MSPTF），包括两个步骤：自助学习预练频率学习和多模型协同学习目标任务。在第一步，我们引入了自助学习方法到 transformer 模型预练中，以减少数据成本并减轻预处理模型和目标任务之间的知识差异。在第二步，我们提出了多模型协同学习方法，通过考虑不同模型架构和预练任务之间的信息差异，将不同 transformer 模型特征信息融合到一起，以获得更完整和正确的检测特征。
results: 我们对 vehicle 识别数据集和路径疾病检测数据集进行实验，比基准方法提高了1.1%、5.5%、4.2%，比最新方法（sota）提高了0.7%、1.8%、1.7%。这些结果证明了我们的方法的有效性。

Abstract
Intelligent transportation system combines advanced information technology to provide intelligent services such as monitoring, detection, and early warning for modern transportation. Intelligent transportation detection is the cornerstone of many intelligent traffic services by identifying task targets through object detection methods. However existing detection methods in intelligent transportation are limited by two aspects. First, there is a difference between the model knowledge pre-trained on large-scale datasets and the knowledge required for target task. Second, most detection models follow the pattern of single-source learning, which limits the learning ability. To address these problems, we propose a Multi Self-supervised Pre-fine-tuned Transformer Fusion (MSPTF) network, consisting of two steps: unsupervised pre-fine-tune domain knowledge learning and multi-model fusion target task learning. In the first step, we introduced self-supervised learning methods into transformer model pre-fine-tune which could reduce data costs and alleviate the knowledge gap between pre-trained model and target task. In the second step, we take feature information differences between different model architectures and different pre-fine-tune tasks into account and propose Multi-model Semantic Consistency Cross-attention Fusion (MSCCF) network to combine different transformer model features by considering channel semantic consistency and feature vector semantic consistency, which obtain more complete and proper fusion features for detection task. We experimented the proposed method on vehicle recognition dataset and road disease detection dataset and achieved 1.1%, 5.5%, 4.2% improvement compared with baseline and 0.7%, 1.8%, 1.7% compared with sota, which proved the effectiveness of our method.

摘要
智能交通系统结合先进的信息技术，为现代交通提供智能服务，如监测、检测和早期警示。智能交通检测是现代交通服务的核心，通过物体检测方法来确定任务目标。然而，现有的检测方法在智能交通中存在两个限制。一是，模型的先行学习知识与目标任务知识之间存在差异。二是，大多数检测模型采用单源学习模式，限制了学习能力。为了解决这些问题，我们提出了多自动学习预练转换器融合网络（MSPTF），包括两个步骤：无监督预练域知识学习和多模型融合目标任务学习。在第一步，我们将自动学习方法引入转换器模型预练，以减少数据成本并缓解先行学习知识与目标任务知识之间的差异。在第二步，我们利用不同模型架构和预练任务的特征信息差异，提出多模型semantic consistency cross-attention融合网络（MSCCF），将不同转换器模型的特征信息融合，以考虑通道semantic consistency和特征向量semantic consistency，从而获得更加完整和正确的融合特征，进而提高检测任务的准确率。我们对汽车识别 dataset 和路况病变 dataset 进行实验，与基准值和state-of-the-art（sota）进行比较，实验结果显示，我们的方法可以提高检测任务的准确率1.1%、5.5%、4.2%，比基准值更高，证明了我们的方法的有效性。

CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation

paper_url: http://arxiv.org/abs/2310.11295
repo_url: None
paper_authors: Zhaojie Chu, Kailing Guo, Xiaofen Xing, Yilin Lan, Bolun Cai, Xiangmin Xu
for: 这篇论文主要针对的是如何通过语音驱动来生成3D人脸动画，以解决跨Modal Task的挑战。methods: 该论文提出了一种新的框架，即CorrTalk，该框架可以有效地在不同强度的面部活动之间建立时间相关性，并通过 dual-branch decoding 架构来同时synthesize强度不同的面部活动，以保证更宽泛的表情动画生成。results: 根据论文的实验和用户研究，CorrTalk 表现出excelent的性能，可以准确地生成具有不同强度的表情动画，并且能够保持 lip-sync 和合理的表情表达。

Abstract
Speech-driven 3D facial animation is a challenging cross-modal task that has attracted growing research interest. During speaking activities, the mouth displays strong motions, while the other facial regions typically demonstrate comparatively weak activity levels. Existing approaches often simplify the process by directly mapping single-level speech features to the entire facial animation, which overlook the differences in facial activity intensity leading to overly smoothed facial movements. In this study, we propose a novel framework, CorrTalk, which effectively establishes the temporal correlation between hierarchical speech features and facial activities of different intensities across distinct regions. A novel facial activity intensity metric is defined to distinguish between strong and weak facial activity, obtained by computing the short-time Fourier transform of facial vertex displacements. Based on the variances in facial activity, we propose a dual-branch decoding framework to synchronously synthesize strong and weak facial activity, which guarantees wider intensity facial animation synthesis. Furthermore, a weighted hierarchical feature encoder is proposed to establish temporal correlation between hierarchical speech features and facial activity at different intensities, which ensures lip-sync and plausible facial expressions. Extensive qualitatively and quantitatively experiments as well as a user study indicate that our CorrTalk outperforms existing state-of-the-art methods. The source code and supplementary video are publicly available at: https://zjchu.github.io/projects/CorrTalk/

摘要
<>转换文本到简化中文。<>语音驱动的3D面部动画是一个吸引了研究者们的挑战性跨Modal任务。在说话活动中，口部会显示强烈的动作，而其他面部区域通常会表现出较弱的活动水平。现有的方法通常会简化过程，直接将单级语音特征映射到整个面部动画上，这会忽略面部活动Intensity的差异，导致面部运动过于平滑。在这种研究中，我们提出了一个新的框架，即CorrTalk，它可以有效地在不同Intensity水平上建立语音特征和面部动画的时间相关性。我们定义了一个新的面部动activity强度度量，通过计算面部顶点位移的短时傅立叶变换来获得。基于面部动activity的差异，我们提出了一种双分支解码机制，以同步生成强度不同的面部动画，从而保证更广泛的Intensity面部动画生成。此外，我们还提出了一种加权层次特征编码器，以建立不同Intensity水平上的语音特征和面部动画之间的时间相关性，从而保证唇ync和合理的面部表达。我们的CorrTalk在质量和kvantalitative的实验以及用户研究中表现出色，超越了现有的状态 искус技术。source code和补充视频可以在以下链接获取：https://zjchu.github.io/projects/CorrTalk/

Self-Supervised 3D Scene Flow Estimation and Motion Prediction using Local Rigidity Prior

paper_url: http://arxiv.org/abs/2310.11284
repo_url: None
paper_authors: Ruibo Li, Chi Zhang, Zhe Wang, Chunhua Shen, Guosheng Lin
for: investigate self-supervised 3D scene flow estimation and class-agnostic motion prediction on point clouds
methods: build pseudo scene flow labels through piecewise rigid motion estimation and validate with a validity mask
results: achieve new state-of-the-art performance in self-supervised scene flow learning and outperform previous state-of-the-art self-supervised methods on nuScenes dataset.

Abstract
In this article, we investigate self-supervised 3D scene flow estimation and class-agnostic motion prediction on point clouds. A realistic scene can be well modeled as a collection of rigidly moving parts, therefore its scene flow can be represented as a combination of the rigid motion of these individual parts. Building upon this observation, we propose to generate pseudo scene flow labels for self-supervised learning through piecewise rigid motion estimation, in which the source point cloud is decomposed into local regions and each region is treated as rigid. By rigidly aligning each region with its potential counterpart in the target point cloud, we obtain a region-specific rigid transformation to generate its pseudo flow labels. To mitigate the impact of potential outliers on label generation, when solving the rigid registration for each region, we alternately perform three steps: establishing point correspondences, measuring the confidence for the correspondences, and updating the rigid transformation based on the correspondences and their confidence. As a result, confident correspondences will dominate label generation and a validity mask will be derived for the generated pseudo labels. By using the pseudo labels together with their validity mask for supervision, models can be trained in a self-supervised manner. Extensive experiments on FlyingThings3D and KITTI datasets demonstrate that our method achieves new state-of-the-art performance in self-supervised scene flow learning, without any ground truth scene flow for supervision, even performing better than some supervised counterparts. Additionally, our method is further extended to class-agnostic motion prediction and significantly outperforms previous state-of-the-art self-supervised methods on nuScenes dataset.

摘要
在这篇文章中，我们调查了无监督3D场景流估计和无类别运动预测。一个现实场景可以很好地被模型为一个由rigidly运动的部件组成的集合，因此其场景流可以被表示为这些个体部件的rigid运动的组合。基于这一观察，我们提议通过地方rigid运动估计来生成 Pseudo场景流标签，其中源点云被分解成地方区域，每个区域都是rigid。通过将每个区域与其可能的对应点云中的区域进行rigid对齐，我们可以获得每个区域的rigid变换，并生成其pseudo流标签。为了mitigate潜在异常值对标签生成的影响，当解决每个区域的rigid注册问题时，我们采取了三个步骤：确定点对应关系、测量对应关系的信任度，并基于对应关系和其信任度更新rigid变换。因此，信任度高的对应关系将dominates标签生成，并生成一个有效性面纱。通过使用这些pseudo标签和其有效性面纱进行监督，我们可以在无监督情况下训练模型。我们的方法在FlyingThings3D和KITTI数据集上进行了广泛的实验，并达到了无监督场景流学习的新状态对抗性性能，甚至超过了一些监督 counterpart。此外，我们的方法进一步扩展到无类别运动预测，并在nuScenes数据集上显著超越了前一个状态对抗性自监督方法。

Video Super-Resolution Using a Grouped Residual in Residual Network

paper_url: http://arxiv.org/abs/2310.11276
repo_url: None
paper_authors: MohammadHossein Ashoori, Arash Amini
for: 提高图像/视频内容的分辨率和质量。
methods: 使用 grouped residual in residual network (GRRN) 方法。
results: 与现有方法相比，GRRN 方法可以提供Acceptable的输出图像质量。

Abstract
Super-resolution (SR) is the technique of increasing the nominal resolution of image / video content accompanied with quality improvement. Video super-resolution (VSR) can be considered as the generalization of single image super-resolution (SISR). This generalization should be such that more detail is created in the output using adjacent input frames. In this paper, we propose a grouped residual in residual network (GRRN) for VSR. By adjusting the hyperparameters of the proposed structure, we train three networks with different numbers of parameters and compare their quantitative and qualitative results with the existing methods. Although based on some quantitative criteria, GRRN does not provide better results than the existing methods, in terms of the quality of the output image it has acceptable performance.

摘要
超分解（SR）是增加图像/视频内容的 номинаinal 分辨率，同时提高图像质量的技术。视频超分解（VSR）可以视为单个图像超分解（SISR）的推广。在这篇论文中，我们提出了分组差分网络（GRRN） для VSR。通过调整结构的 hyperparameter，我们训练了三个网络，每个网络有不同的参数数量，并与现有方法进行比较。虽然根据一些量化标准，GRRN 不比现有方法提供更好的结果，但在输出图像质量方面，它的表现是可接受的。

Image Compression using only Attention based Neural Networks

paper_url: http://arxiv.org/abs/2310.11265
repo_url: None
paper_authors: Natacha Luka, Romain Negrel, David Picard
for: 本研究旨在探讨图像压缩中完全使用注意力层，以提高图像压缩性和计算效率。
methods: 本paper使用的方法是基于注意力机制的Transformer architecture，并引入了学习图像查询来汇聚patch信息。
results: 经过广泛的评估，本paper的方法在popular Kodak、DIV2K和CLIC datasets上达到了与传统手动设计ipeline相同或更高的性能，而无需使用 convolutional layers。

Abstract
In recent research, Learned Image Compression has gained prominence for its capacity to outperform traditional handcrafted pipelines, especially at low bit-rates. While existing methods incorporate convolutional priors with occasional attention blocks to address long-range dependencies, recent advances in computer vision advocate for a transformative shift towards fully transformer-based architectures grounded in the attention mechanism. This paper investigates the feasibility of image compression exclusively using attention layers within our novel model, QPressFormer. We introduce the concept of learned image queries to aggregate patch information via cross-attention, followed by quantization and coding techniques. Through extensive evaluations, our work demonstrates competitive performance achieved by convolution-free architectures across the popular Kodak, DIV2K, and CLIC datasets.

摘要
Recent research has shown that Learned Image Compression has gained popularity for its ability to outperform traditional handcrafted pipelines, especially at low bit-rates. While existing methods use convolutional priors with occasional attention blocks to address long-range dependencies, recent advances in computer vision have advocated for a transformative shift towards fully transformer-based architectures grounded in the attention mechanism. This paper explores the feasibility of image compression using only attention layers in our novel model, QPressFormer. We introduce the concept of learned image queries to aggregate patch information via cross-attention, followed by quantization and coding techniques. Through extensive evaluations, our work demonstrates competitive performance achieved by convolution-free architectures across the popular Kodak, DIV2K, and CLIC datasets.

An empirical study of automatic wildlife detection using drone thermal imaging and object detection

paper_url: http://arxiv.org/abs/2310.11257
repo_url: None
paper_authors: Miao Chang, Tan Vuong, Manas Palaparthi, Lachlan Howell, Alessio Bonti, Mohamed Abdelrazek, Duc Thanh Nguyen
For: The paper is written for wildlife management, specifically to explore the use of drones and thermal imaging technology for collecting and interpreting wildlife data.* Methods: The paper uses a comprehensive review and empirical study of drone-based wildlife detection, including the collection of a realistic dataset of drone-derived wildlife thermal detections and the annotation of these detections via bounding boxes by experts. The paper also benchmarks state-of-the-art object detection algorithms on the collected dataset.* Results: The paper provides experimental results to identify issues and discuss future directions in automatic animal monitoring using drones.Here is the information in Simplified Chinese text:* For: 这篇论文是为了管理野生动物而写的，具体来说是利用无人飞行器和热成像技术收集和解读野生动物数据。* Methods: 这篇论文使用了全面的文献综述和实验研究，包括收集了一个真实的无人飞行器捕捉的野生动物热成像检测数据，并由专家 manually annotate these detections via bounding boxes。 paper还使用了现有的对象检测算法来对 collected dataset 进行比较。* Results: 这篇论文提供了实验结果，用于发现issues和讨论未来无人飞行器自动动物监测的方向。

Abstract
Artificial intelligence has the potential to make valuable contributions to wildlife management through cost-effective methods for the collection and interpretation of wildlife data. Recent advances in remotely piloted aircraft systems (RPAS or ``drones'') and thermal imaging technology have created new approaches to collect wildlife data. These emerging technologies could provide promising alternatives to standard labourious field techniques as well as cover much larger areas. In this study, we conduct a comprehensive review and empirical study of drone-based wildlife detection. Specifically, we collect a realistic dataset of drone-derived wildlife thermal detections. Wildlife detections, including arboreal (for instance, koalas, phascolarctos cinereus) and ground dwelling species in our collected data are annotated via bounding boxes by experts. We then benchmark state-of-the-art object detection algorithms on our collected dataset. We use these experimental results to identify issues and discuss future directions in automatic animal monitoring using drones.

摘要
人工智能有可能为野生动物管理提供有价值的贡献，通过cost-effective的方式收集和解释野生动物数据。最近的无人驾驶飞行器系统（RPAS或“无人机”）和热成像技术的发展已经创造了新的方法收集野生动物数据。这些新技术可能会提供标准化Field技术的有优的替代方案，同时能够覆盖更大的区域。在这项研究中，我们进行了全面的文献综述和实验研究，specifically collecting a realistic dataset of drone-derived wildlife thermal detections。我们的收集数据包括树上的物种（如 Koala，phascolarctos cinereus）和地面生物种，并由专家用 bounding boxes 进行了标注。我们然后对 state-of-the-art 对象检测算法进行了 benchmarking 测试，并使用实验结果来评估自动动物监测使用无人机的问题和未来方向。

Gromov-Wassertein-like Distances in the Gaussian Mixture Models Space

paper_url: http://arxiv.org/abs/2310.11256
repo_url: None
paper_authors: Antoine Salmona, Julie Delon, Agnès Desolneux
for: 本文介绍了两种Gromov-Wasserstein-type距离在 Gaussian mixture model 空间上。第一种距离是 между两个离散分布在 Gaussian 测度空间上的 Gromov-Wasserstein 距离，可以作为 Gromov-Wasserstein 的替代方法，用于评估分布之间的距离，而不需要直接计算传输计划。
methods: 本文引入了另一种距离 between measures living in incomparable spaces，这种距离与 Gromov-Wasserstein 密切相关，可以用于定义传输计划。 restricting the set of admissible transportation couplings to be themselves Gaussian mixture models in this latter, this defines another distance between Gaussian mixture models that can be used as another alternative to Gromov-Wasserstein。
results: 本文设计了一种基于第一种距离的传输计划，并通过对 medium-to-large scale problems such as shape matching and hyperspectral image color transfer 进行实验，证明了其实用性。

Abstract
In this paper, we introduce two Gromov-Wasserstein-type distances on the set of Gaussian mixture models. The first one takes the form of a Gromov-Wasserstein distance between two discrete distributionson the space of Gaussian measures. This distance can be used as an alternative to Gromov-Wasserstein for applications which only require to evaluate how far the distributions are from each other but does not allow to derive directly an optimal transportation plan between clouds of points. To design a way to define such a transportation plan, we introduce another distance between measures living in incomparable spaces that turns out to be closely related to Gromov-Wasserstein. When restricting the set of admissible transportation couplings to be themselves Gaussian mixture models in this latter, this defines another distance between Gaussian mixture models that can be used as another alternative to Gromov-Wasserstein and which allows to derive an optimal assignment between points. Finally, we design a transportation plan associated with the first distance by analogy with the second, and we illustrate their practical uses on medium-to-large scale problems such as shape matching and hyperspectral image color transfer.

摘要
在这篇论文中，我们介绍了两种Gromov-Wasserstein-类型的距离在 Gaussian mixture model 上。第一个距离是两个抽象分布在 Gaussian measure 空间上的 Gromov-Wasserstein 距离，可以用作 Gromov-Wasserstein 的替代方法，但不能直接 derivate 最佳运输计划 между云集点。为了设计一种定义这种运输计划的方法，我们引入了另一种在不可比较的空间上的距离，该距离与 Gromov-Wasserstein 密切相关。当限制了可用的运输结合为 Gaussian mixture model 时，这个距离定义了另一种 Gaussian mixture model 之间的距离，可以作为 Gromov-Wasserstein 的另一种替代方法，并且可以 derivate 最佳分配计划。最后，我们设计了一个运输计划相关的方法，并在媒体规模到大型问题上如形态匹配和彩色图像传输中 illustrate 其实用性。

LiDAR-based 4D Occupancy Completion and Forecasting

paper_url: http://arxiv.org/abs/2310.11239
repo_url: https://github.com/ai4ce/occ4cast
paper_authors: Xinhao Liu, Moonjun Gong, Qi Fang, Haoyu Xie, Yiming Li, Hang Zhao, Chen Feng
for: 本研究旨在整合Scene Completion和Forecasting两个问题，提出了一种新的LiDAR感知任务——Occupancy Completion and Forecasting（OCF），用于自动驾驶中的4D感知。
methods: 本研究使用了新的算法来解决三个挑战：（1）稀疏到稠密的重建，（2）部分到完整的幻化，（3）3D到4D预测。
results: 研究人员通过对OCFBench dataset进行超视觉和评估，发现了一些相对较好的基线模型和自己的模型的性能。这些结果预示了OCF任务的重要性和潜在应用前景。

Abstract
Scene completion and forecasting are two popular perception problems in research for mobile agents like autonomous vehicles. Existing approaches treat the two problems in isolation, resulting in a separate perception of the two aspects. In this paper, we introduce a novel LiDAR perception task of Occupancy Completion and Forecasting (OCF) in the context of autonomous driving to unify these aspects into a cohesive framework. This task requires new algorithms to address three challenges altogether: (1) sparse-to-dense reconstruction, (2) partial-to-complete hallucination, and (3) 3D-to-4D prediction. To enable supervision and evaluation, we curate a large-scale dataset termed OCFBench from public autonomous driving datasets. We analyze the performance of closely related existing baseline models and our own ones on our dataset. We envision that this research will inspire and call for further investigation in this evolving and crucial area of 4D perception. Our code for data curation and baseline implementation is available at https://github.com/ai4ce/Occ4cast.

摘要
Scene completion and forecasting are two popular perception problems in research for mobile agents like autonomous vehicles. Existing approaches treat the two problems in isolation, resulting in a separate perception of the two aspects. In this paper, we introduce a novel LiDAR perception task of Occupancy Completion and Forecasting (OCF) in the context of autonomous driving to unify these aspects into a cohesive framework. This task requires new algorithms to address three challenges altogether: (1) sparse-to-dense reconstruction, (2) partial-to-complete hallucination, and (3) 3D-to-4D prediction. To enable supervision and evaluation, we curate a large-scale dataset termed OCFBench from public autonomous driving datasets. We analyze the performance of closely related existing baseline models and our own ones on our dataset. We envision that this research will inspire and call for further investigation in this evolving and crucial area of 4D perception. Our code for data curation and baseline implementation is available at https://github.com/ai4ce/Occ4cast.Translation notes:* "Scene completion" and "forecasting" are both translated as "场景完成" (scenario completion) to refer to the same concept.* "LiDAR perception" is translated as "LiDAR感知" (LiDAR perception) to refer to the specific sensing technology used.* "Occupancy Completion and Forecasting" is translated as "场景完成和预测" (scenario completion and forecasting) to refer to the combined task.* "Sparse-to-dense reconstruction" is translated as "稀疏到密集重建" (sparse-to-dense reconstruction) to refer to the challenge of reconstructing a dense 3D point cloud from a sparse set of measurements.* "Partial-to-complete hallucination" is translated as "部分到完整的幻觉" (partial-to-complete hallucination) to refer to the challenge of generating complete 3D information from partial measurements.* "3D-to-4D prediction" is translated as "3D到4D预测" (3D-to-4D prediction) to refer to the challenge of predicting the future 4D state of the environment based on current 3D information.* "OCFBench" is translated as "OCFBench" (OCFBench) to refer to the specific dataset used for evaluation.* "baseline models" is translated as "基线模型" (baseline models) to refer to the existing models used for comparison.* "our own ones" is translated as "我们自己的" (our own ones) to refer to the models developed by the authors.

Innovative Methods for Non-Destructive Inspection of Handwritten Documents

paper_url: http://arxiv.org/abs/2310.11217
repo_url: None
paper_authors: Eleonora Breci, Luca Guarnera, Sebastiano Battiato
for: 本研究旨在提高手写文档分析的精度和效率，以便通过评估手写文档的特征来确定作者身份。
methods: 本研究使用图像处理和深度学习技术EXTRACT AND ANALYZE handwritten manuscript documents的内在特征，包括文本行高、单词间距和字体大小。每个文档的特征向量最终包括每种类型的平均值和标准差。通过对比文档的特征向量的euclid distance，可以 объектив地确定作者身份。
results: 我们的实验结果表明，我们的方法可以对不同的写作媒体（包括手写纸和数字设备）进行 объектив的作者身份识别，并且超过了现有方法的性能。

Abstract
Handwritten document analysis is an area of forensic science, with the goal of establishing authorship of documents through examination of inherent characteristics. Law enforcement agencies use standard protocols based on manual processing of handwritten documents. This method is time-consuming, is often subjective in its evaluation, and is not replicable. To overcome these limitations, in this paper we present a framework capable of extracting and analyzing intrinsic measures of manuscript documents related to text line heights, space between words, and character sizes using image processing and deep learning techniques. The final feature vector for each document involved consists of the mean and standard deviation for every type of measure collected. By quantifying the Euclidean distance between the feature vectors of the documents to be compared, authorship can be discerned. We also proposed a new and challenging dataset consisting of 362 handwritten manuscripts written on paper and digital devices by 124 different people. Our study pioneered the comparison between traditionally handwritten documents and those produced with digital tools (e.g., tablets). Experimental results demonstrate the ability of our method to objectively determine authorship in different writing media, outperforming the state of the art.

摘要
手写文档分析是法医科学领域中的一个领域，旨在透过评估手写文档的内在特征，以确定文档的作者。法律机关通常采用标准协议，基于手动处理手写文档。这种方法是时间consuming，容易受主观影响，并且不可重复。为了超越这些限制，在这篇论文中，我们提出了一个框架，可以提取和分析手写文档中相关的字体大小、词间距和字体大小等内在特征，使用图像处理和深度学习技术。每个文档的最终特征向量由每种测量类型的均值和标准差组成。通过计算这些特征向量之间的欧氏距离，可以bjectively Determine authorship。我们还提出了一个新的和挑战性的数据集，包含362份手写文档，由124名不同的人写作。我们的研究对手写文档和数字工具（例如平板电脑）生成的文档进行比较，并实验结果表明，我们的方法可以在不同的写作媒体中对作者进行 объекively 确定。

Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification

paper_url: http://arxiv.org/abs/2310.11210
repo_url: None
paper_authors: Shuanglin Yan, Neng Dong, Jun Liu, Liyan Zhang, Jinhui Tang
For: 本研究旨在提高文本检索到图像的精度，并解决现有方法对多视图匹配的缺乏考虑。* Methods: 我们提出了一种简单 yet effective的框架，即LCR$^2$S，它通过学习多视图modalities的承载表示来建立多对多匹配。我们还设计了一个多头注意力混合模块，以混合图像（文本）和其支持集。* Results: 我们的方法在三个popular TIReID数据集上实现了优秀的效果，并新创造了TIReID tasks的最佳表现。

Abstract
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text. However, existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view. The many-to-many matching between image-text pairs across views under the same identity is not taken into account, which is one of the main reasons for the poor performance of existing methods. To this end, we propose a simple yet effective framework, called LCR$^2$S, for modeling many-to-many correspondences of the same identity by learning comprehensive representations for both modalities from a novel perspective. We construct a support set for each image (text) by using other images (texts) under the same identity and design a multi-head attentional fusion module to fuse the image (text) and its support set. The resulting enriched image and text features fuse information from multiple views, which are aligned to train a "richer" TIReID model with many-to-many correspondences. Since the support set is unavailable during inference, we propose to distill the knowledge learned by the "richer" model into a lightweight model for inference with a single image/text as input. The lightweight model focuses on semantic association and reasoning of multi-view information, which can generate a comprehensive representation containing multi-view information with only a single-view input to perform accurate text-to-image retrieval during inference. In particular, we use the intra-modal features and inter-modal semantic relations of the "richer" model to supervise the lightweight model to inherit its powerful capability. Extensive experiments demonstrate the effectiveness of LCR$^2$S, and it also achieves new state-of-the-art performance on three popular TIReID datasets.

摘要

Whole-brain radiomics for clustered federated personalization in brain tumor segmentation

paper_url: http://arxiv.org/abs/2310.11480
repo_url: None
paper_authors: Matthis Manthe, Stefan Duffner, Carole Lartizien
For: The paper focuses on mitigating the impact of statistical heterogeneity in federated learning for medical image segmentation.* Methods: The proposed method, called federated personalization, involves computing radiomic features and clustering analysis on each institution’s local dataset, followed by fine-tuning a global model using the clustered decentralized dataset.* Results: The proposed method was validated on the Federated Brain Tumor Segmentation 2022 Challenge dataset and showed improved performance compared to classical federated learning.Here are the three points in Simplified Chinese:* For: 本文针对适用于医疗影像分类的联合学习中的统计不确定性进行了研究，以提高联合学习的精度和效率。* Methods: 提议的方法是基于每个机构的本地数据进行调整，首先计算每个机构的对应数据，然后使用聚合分析将所有的特征向量转移到中央服务器，并使用每个中心 compute 的对应数据进行精度调整。* Results: 在适用于脑癌分类的 Federated Brain Tumor Segmentation 2022 Challenge 数据集上验证了提议的方法，并与传统的联合学习相比，获得了改善的性能。

Abstract
Federated learning and its application to medical image segmentation have recently become a popular research topic. This training paradigm suffers from statistical heterogeneity between participating institutions' local datasets, incurring convergence slowdown as well as potential accuracy loss compared to classical training. To mitigate this effect, federated personalization emerged as the federated optimization of one model per institution. We propose a novel personalization algorithm tailored to the feature shift induced by the usage of different scanners and acquisition parameters by different institutions. This method is the first to account for both inter and intra-institution feature shift (multiple scanners used in a single institution). It is based on the computation, within each centre, of a series of radiomic features capturing the global texture of each 3D image volume, followed by a clustering analysis pooling all feature vectors transferred from the local institutions to the central server. Each computed clustered decentralized dataset (potentially including data from different institutions) then serves to finetune a global model obtained through classical federated learning. We validate our approach on the Federated Brain Tumor Segmentation 2022 Challenge dataset (FeTS2022). Our code is available at (https://github.com/MatthisManthe/radiomics_CFFL).

摘要
《联邦学习和它的医学图像分割应用已经在最近引起了广泛的研究兴趣。这种培训模式受到参与机构本地数据的统计差异的影响，会导致减速和溢出精度相比于传统培训。为了缓解这些效应，联邦个性化出现了，即在每个机构上进行联邦优化的一个模型。我们提出了一种新的个性化算法，专门针对不同扫描仪和获取参数导致的特征偏移。这种方法是首次考虑了多个机构的内部和外部特征偏移（多个扫描仪在同一个机构中使用）。它基于在每个中心计算的一系列各种激光特征，用于捕捉每个3D图像卷积的全局 текстура，然后对所有从本地机构传输到中央服务器的特征向量进行归一化分析。每个计算的归一化分析后的各个归一化分析结果（可能包括多个机构的数据）然后用于在经典联邦学习中进行精化。我们验证了我们的方法在2022年联邦大脑肿瘤分割挑战数据集（FeTS2022）上。我们的代码可以在（https://github.com/MatthisManthe/radiomics_CFFL）上获取。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

Improving Video Deepfake Detection: A DCT-Based Approach with Patch-Level Analysis

paper_url: http://arxiv.org/abs/2310.11204
repo_url: None
paper_authors: Luca Guarnera, Salvatore Manganello, Sebastiano Battiato
for: 这篇研究旨在开发一个可靠、快速且可解释的检测深伪 Multimedia 内容的算法，以应对现在广泛存在的深伪技术滥用问题。
methods: 这篇研究使用了I-frames来提高计算和分析的速度，并分析了个别帧中的背景、脸部、眼睛、鼻子和口部，以找出最有可能性的特征。使用了Discrete Cosine Transform (DCT)来提取Beta комponents，并将其用作标准分类器（例如k-NN、SVM等）的输入，以决定解决这问题的最有可能性的频率。
results: 实验结果显示，眼和口部区域是检测深伪 Multimedia 内容中最有可能性的特征，能够更加可靠地决定影片的性质。提出的方法具有分析性、快速且不需要大量的计算资源。

Abstract
The term deepfake refers to all those multimedia contents that were synthetically altered or created from scratch through the use of generative models. This phenomenon has become widespread due to the use of increasingly accurate and efficient architectures capable of rendering manipulated content indistinguishable from real content. In order to fight the illicit use of this powerful technology, it has become necessary to develop algorithms able to distinguish synthetic content from real ones. In this study, a new algorithm for the detection of deepfakes in digital videos is presented, focusing on the main goal of creating a fast and explainable method from a forensic perspective. To achieve this goal, the I-frames were extracted in order to provide faster computation and analysis than approaches described in literature. In addition, to identify the most discriminating regions within individual video frames, the entire frame, background, face, eyes, nose, mouth, and face frame were analyzed separately. From the Discrete Cosine Transform (DCT), the Beta components were extracted from the AC coefficients and used as input to standard classifiers (e.g., k-NN, SVM, and others) in order to identify those frequencies most discriminative for solving the task in question. Experimental results obtained on the Faceforensics++ and Celeb-DF (v2) datasets show that the eye and mouth regions are those most discriminative and able to determine the nature of the video with greater reliability than the analysis of the whole frame. The method proposed in this study is analytical, fast and does not require much computational power.

摘要
deepfake 指的是通过生成模型制造或修改 multimedia 内容的所有内容。由于使用的生成模型不断改进，使得修改后的内容与原始内容难以区分，因此需要开发一种能够分辨真实内容和修改后的内容的算法。本研究提出了一种用于检测数字视频中的深伪内容的新算法，强调实现快速和可解释的方法。为了实现这一目标，我们提取了 I-frame，以便更快地计算和分析，而不是按照文献中所描述的方法。此外，我们还分析了每帧视频中最有可能区分的地方，包括背景、脸、眼睛、鼻子和口。从 discrete cosine transform (DCT) 中，我们提取了 AC 约束中的 Beta 成分，并将其作为输入给标准分类器（如 k-NN、SVM 等），以确定解决当前问题中最有可能的频率。实验结果表明，脸部和口部是最有可能的区分点，能够更加可靠地判断视频的性质，而不是分析整个帧。该方法具有分析性、快速和不需要大量计算资源的优点。

Sparse Multi-Object Render-and-Compare

paper_url: http://arxiv.org/abs/2310.11184
repo_url: None
paper_authors: Florian Langer, Ignas Budvytis, Roberto Cipolla
for: 这paper是为了解决单图像中物体的3D形态和姿态重建问题，这问题在机器人、增强现实和数字内容创建等领域都是非常重要的。
methods: 这paper使用了一种新的网络架构 called Multi-SPARC，它可以同时对多个检测到的物体进行CAD模型的对接。
results: 相比单视图方法，这paper在ScanNet dataset上达到了状态机器人性的表现，即从31.8%提高到40.3%的实例对接精度。

Abstract
Reconstructing 3D shape and pose of static objects from a single image is an essential task for various industries, including robotics, augmented reality, and digital content creation. This can be done by directly predicting 3D shape in various representations or by retrieving CAD models from a database and predicting their alignments. Directly predicting 3D shapes often produces unrealistic, overly smoothed or tessellated shapes. Retrieving CAD models ensures realistic shapes but requires robust and accurate alignment. Learning to directly predict CAD model poses from image features is challenging and inaccurate. Works, such as ROCA, compute poses from predicted normalised object coordinates which can be more accurate but are susceptible to systematic failure. SPARC demonstrates that following a ''render-and-compare'' approach where a network iteratively improves upon its own predictions achieves accurate alignments. Nevertheless, it performs individual CAD alignment for every object detected in an image. This approach is slow when applied to many objects as the time complexity increases linearly with the number of objects and can not learn inter-object relations. Introducing a new network architecture Multi-SPARC we learn to perform CAD model alignments for multiple detected objects jointly. Compared to other single-view methods we achieve state-of-the-art performance on the challenging real-world dataset ScanNet. By improving the instance alignment accuracy from 31.8% to 40.3% we perform similar to state-of-the-art multi-view methods.

摘要
重建静止物体的3D形状和姿势从单个图像中是许多领域的关键任务，包括机器人、增强现实和数字内容创建。这可以通过直接预测3D形状或从数据库中检索CAD模型并预测其对齐来完成。直接预测3D形状常常生成不真实、过度缩短或分割的形状。从数据库中检索CAD模型可以保证真实的形状，但需要稳定和准确的对齐。学习直接从图像特征中预测CAD模型姿势是困难且不准确。ROCA等方法计算姿势从预测的 нормализованobject坐标，可以更准确但容易系统性失败。SPARC示例了一个“render-and-compare”方法，其中网络在自己的预测基础上进行多次改进，可以实现准确的对齐。然而，它每个图像中检测到的对象都进行个别CAD对齐，这会导致运行时间linearly增长与对象数量的线性关系，无法学习对象之间的关系。我们提出了一种新的网络架构 Multi-SPARC，可以同时对多个检测到的对象进行CAD模型对齐。与其他单视图方法相比，我们在真实的世界数据集ScanNet上 achieve state-of-the-art性能。我们从31.8%提高了实例对齐精度到40.3%，与多视图方法相当。

Unsupervised Pre-Training Using Masked Autoencoders for ECG Analysis

paper_url: http://arxiv.org/abs/2310.11153
repo_url: None
paper_authors: Guoxin Wang, Qingyuan Wang, Ganesh Neelakanta Iyer, Avishek Nag, Deepu John
for: 这篇论文的目的是提出一种基于masked autoencoder（MAE）的无监督预训技术，用于电气征ogram（ECG）信号的分析。
methods: 本论文使用的方法包括masked autoencoder（MAE）和任务特定的精致化。
results: 实验结果显示，使用本提案的方法可以在MITDB dataset上达到ECG预订分类任务的94.39%的精度。此外，该方法也在未见之数据中的分类性能比全监督方法更好。

Abstract
Unsupervised learning methods have become increasingly important in deep learning due to their demonstrated large utilization of datasets and higher accuracy in computer vision and natural language processing tasks. There is a growing trend to extend unsupervised learning methods to other domains, which helps to utilize a large amount of unlabelled data. This paper proposes an unsupervised pre-training technique based on masked autoencoder (MAE) for electrocardiogram (ECG) signals. In addition, we propose a task-specific fine-tuning to form a complete framework for ECG analysis. The framework is high-level, universal, and not individually adapted to specific model architectures or tasks. Experiments are conducted using various model architectures and large-scale datasets, resulting in an accuracy of 94.39% on the MITDB dataset for ECG arrhythmia classification task. The result shows a better performance for the classification of previously unseen data for the proposed approach compared to fully supervised methods.

摘要
《深度学习中的无监督学习方法在最近几年变得越来越重要，因为它们在计算机视觉和自然语言处理任务中的精度高于监督学习方法。随着这些方法的扩展到其他领域，可以利用大量的无标签数据。这篇论文提出了基于屏蔽自动编码器（MAE）的无监督预训练技术，用于电cardiogram（ECG）信号分析。此外，我们还提出了任务特定的细化，以形成一个完整的ECG分析框架。这个框架是高级、通用、不具体适应特定的模型结构或任务。在各种模型结构和大规模数据集上进行了实验，实现了MITDB数据集上ECG动力痕迹分类任务的准确率为94.39%。结果显示，提posed方法对于处理前未见数据的分类表现更好于完全监督方法。》Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference

paper_url: http://arxiv.org/abs/2310.11142
repo_url: None
paper_authors: Siqi Kou, Lei Gan, Dequan Wang, Chongxuan Li, Zhijie Deng
for: 提高 diffusion 模型生成图像质量
methods: 基于 bayesian 推断，计算扩散过程中像素层次的不确定性
results: 能够减少低质量图像，并帮助改进成功图像和修正失败图像中的缺陷

Abstract
Diffusion models have impressive image generation capability, but low-quality generations still exist, and their identification remains challenging due to the lack of a proper sample-wise metric. To address this, we propose BayesDiff, a pixel-wise uncertainty estimator for generations from diffusion models based on Bayesian inference. In particular, we derive a novel uncertainty iteration principle to characterize the uncertainty dynamics in diffusion, and leverage the last-layer Laplace approximation for efficient Bayesian inference. The estimated pixel-wise uncertainty can not only be aggregated into a sample-wise metric to filter out low-fidelity images but also aids in augmenting successful generations and rectifying artifacts in failed generations in text-to-image tasks. Extensive experiments demonstrate the efficacy of BayesDiff and its promise for practical applications.

摘要
Diffusion模型具有吸引人的图像生成能力，但低质量生成仍然存在，其标识仍然困难由于缺乏适当的样本级度指标。为解决这个问题，我们提出了 BayesDiff，一种基于泛函推理的像素级uncertainty估计器 дляDiffusion模型。具体来说，我们 derivate了一种新的uncertainty迭代原理来描述Diffusion中的uncertainty动态，并利用最后层拉пла斯批处理来实现高效的泛函推理。测试表明，BayesDiff可以不仅将像素级uncertainty聚合成样本级度指标来滤除低准确图像，还可以帮助改善成功生成的图像和修复失败生成的瑕疵。在文本到图像任务中，BayesDiff展示了其效果和实际应用潜力。

Super resolution of histopathological frozen sections via deep learning preserving tissue structure

paper_url: http://arxiv.org/abs/2310.11112
repo_url: None
paper_authors: Elad Yoshai, Gil Goldinger, Miki Haifler, Natan T. Shaked
for: histopathological frozen sections imaging, with a focus on achieving better distortion measures and reducing the risk of diagnostic misinterpretation.
methods: deep-learning architecture that leverages loss functions in the frequency domain to generate high-resolution images while preserving critical image details.
results: significant improvements in terms of Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as well as the preservation of details lost in low-resolution frozen-section images, which can affect pathologists’ clinical decisions.

Abstract
Histopathology plays a pivotal role in medical diagnostics. In contrast to preparing permanent sections for histopathology, a time-consuming process, preparing frozen sections is significantly faster and can be performed during surgery, where the sample scanning time should be optimized. Super-resolution techniques allow imaging the sample in lower magnification and sparing scanning time. In this paper, we present a new approach to super resolution for histopathological frozen sections, with focus on achieving better distortion measures, rather than pursuing photorealistic images that may compromise critical diagnostic information. Our deep-learning architecture focuses on learning the error between interpolated images and real images, thereby it generates high-resolution images while preserving critical image details, reducing the risk of diagnostic misinterpretation. This is done by leveraging the loss functions in the frequency domain, assigning higher weights to the reconstruction of complex, high-frequency components. In comparison to existing methods, we obtained significant improvements in terms of Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as well as indicated details that lost in the low-resolution frozen-section images, affecting the pathologist's clinical decisions. Our approach has a great potential in providing more-rapid frozen-section imaging, with less scanning, while preserving the high resolution in the imaged sample.

摘要
Our deep-learning architecture is designed to learn the error between interpolated images and real images, generating high-resolution images while preserving critical image details. We leverage loss functions in the frequency domain, assigning higher weights to the reconstruction of complex, high-frequency components. Compared to existing methods, our approach achieves significant improvements in terms of Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as well as reveals details that were lost in low-resolution frozen-section images, which can affect the pathologist's clinical decisions.Our approach has great potential in providing rapid frozen-section imaging with less scanning, while preserving the high resolution of the imaged sample. This can improve the accuracy of medical diagnosis and treatment.

3D Structure-guided Network for Tooth Alignment in 2D Photograph

paper_url: http://arxiv.org/abs/2310.11106
repo_url: https://github.com/douyl/2DToothAlignment
paper_authors: Yulong Dou, Lanzhuju Mei, Dinggang Shen, Zhiming Cui
for: 这篇论文的目的是提供一个基于2D图像空间的牙齿协调网络，用于 dentist-patient 沟通和鼓励病人接受 ortodontic 治疗。
methods: 这篇论文使用了3D intra-oral scanning models 收集在 clinic 中，并使用了一个对应关系学习模型来将预先和后 ortodontic 治疗的3D 牙齿结构与2D 牙齿 outline 进行映射。然后，使用一个演化模型来将牙齿 outline 调整为具有美观和排列的牙齿图像。
results: 这篇论文的结果显示了该网络在不同的 facial photographs 上的优秀表现和强大应用性，并且可以帮助 dentist 快速地生成一个美观和排列的牙齿图像，以便更好地与病人沟通和鼓励病人接受 ortodontic 治疗。

Abstract
Orthodontics focuses on rectifying misaligned teeth (i.e., malocclusions), affecting both masticatory function and aesthetics. However, orthodontic treatment often involves complex, lengthy procedures. As such, generating a 2D photograph depicting aligned teeth prior to orthodontic treatment is crucial for effective dentist-patient communication and, more importantly, for encouraging patients to accept orthodontic intervention. In this paper, we propose a 3D structure-guided tooth alignment network that takes 2D photographs as input (e.g., photos captured by smartphones) and aligns the teeth within the 2D image space to generate an orthodontic comparison photograph featuring aesthetically pleasing, aligned teeth. Notably, while the process operates within a 2D image space, our method employs 3D intra-oral scanning models collected in clinics to learn about orthodontic treatment, i.e., projecting the pre- and post-orthodontic 3D tooth structures onto 2D tooth contours, followed by a diffusion model to learn the mapping relationship. Ultimately, the aligned tooth contours are leveraged to guide the generation of a 2D photograph with aesthetically pleasing, aligned teeth and realistic textures. We evaluate our network on various facial photographs, demonstrating its exceptional performance and strong applicability within the orthodontic industry.

摘要
Orthodontics 专注于 corrections 不对称牙齿（即 malocclusion），影响咀嚼功能和美观。然而，orthodontic 治疗经常包括复杂、长时间的过程。因此，生成一张显示牙齿调整后的2D照片是关键的，以便dentist和病人之间有效沟通，更重要的是，使病人accept orthodontic intervention。在这篇论文中，我们提议一个基于3D结构的牙齿调整网络，输入2D照片（例如，由智能手机拍摄的照片），并将牙齿在2D图像空间中调整，生成一张 featuring 美观、调整后的牙齿的orthodontic comparison照片。需要注意的是，我们的过程在2D图像空间中进行，但我们使用了3D intra-oral scanning模型，收集在临床中，以学习orthodontic treatment。具体来说，我们将预 orthodontic 和后 orthodontic 3D 牙齿结构投影到2D 牙齿轮廓上，然后使用一种扩散模型来学习 mapping 关系。最后，我们使用了调整后的牙齿轮廓来指导生成一张 featuring 美观、调整后的牙齿和实际 Texture的2D照片。我们对多张人脸照片进行了评估，并证明了我们的网络在orthodontic 行业中表现出色，有强大的应用前景。

Generalizability of CNN Architectures for Face Morph Presentation Attack

paper_url: http://arxiv.org/abs/2310.11105
repo_url: None
paper_authors: Sherko R. HmaSalah, Aras Asaad
for: 防止犯罪分子使用假识别信息越境
methods: 使用Convolutional Neural Network (CNN)模型进行人脸识别，并 investigate CNN模型在各种数据集上的泛化能力
results: InceptionResNet-v2模型在多个数据集上表现出最好的泛化能力，并在人脸识别 task 中获得了最高的性能

Abstract
Automatic border control systems are wide spread in modern airports worldwide. Morphing attacks on face biometrics is a serious threat that undermines the security and reliability of face recognition systems deployed in airports and border controls. Therefore, developing a robust Machine Learning (ML) system is necessary to prevent criminals crossing borders with fake identifications especially since it has been shown that security officers cannot detect morphs better than machines. In this study, we investigate the generalization power of Convolutional Neural Network (CNN) architectures against morphing attacks. The investigation utilizes 5 distinct CNNs namely ShuffleNet, DenseNet201, VGG16, EffecientNet-B0 and InceptionResNet-v2. Each CNN architecture represents a well-known family of CNN models in terms of number of parameters, architectural design and performance across various computer vision applications. To ensure robust evaluation, we employ 4 different datasets (Utrecht, London, Defacto and KurdFace) that contain a diverse range of digital face images which cover variations in ethnicity, gender, age, lighting condition and camera setting. One of the fundamental concepts of ML system design is the ability to generalize effectively to previously unseen data, hence not only we evaluate the performance of CNN models within individual datasets but also explore their performance across combined datasets and investigating each dataset in testing phase only. Experimental results on more than 8 thousand images (genuine and morph) from the 4 datasets show that InceptionResNet-v2 generalizes better to unseen data and outperforms the other 4 CNN models.

摘要
现代机场中的自动边境控制系统广泛应用。但 morphing 攻击对于面部biometrics 是一种严重的威胁，这会使面 recognition 系统在机场和边境控制中受到影响。为了防止罪犯使用假身份证件越境，需要开发一个可靠的机器学习（ML）系统。在这项研究中，我们研究了 CNN 架构对 morphing 攻击的普适性。我们使用 5 种不同的 CNN 模型，即 ShuffleNet、DenseNet201、VGG16、EfficientNet-B0 和 InceptionResNet-v2。每种 CNN 模型都代表了不同的参数量、架构设计和在不同计算机视觉应用中的性能。为了有效评估，我们使用 4 个不同的数据集（UTrecht、London、Defacto 和 KurdFace），这些数据集包含了不同的民族、性别、年龄、照明条件和摄像头设置。 ML 系统设计的一个基本原则是能够有效地普退到未见数据，因此我们不仅在单个数据集中评估 CNN 模型的性能，还在将数据集组合起来评估它们的总体性能。实验结果表明，InceptionResNet-v2 在未见数据中普退性能最好，并且在4个数据集中的测试阶段也表现出色，超过其他 4 种 CNN 模型。

SODA: Robust Training of Test-Time Data Adaptors

paper_url: http://arxiv.org/abs/2310.11093
repo_url: https://github.com/tmlr-group/soda
paper_authors: Zige Wang, Yonggang Zhang, Zhen Fang, Long Lan, Wenjing Yang, Bo Han
For: The paper aims to mitigate the performance degradation caused by distribution shifts in machine learning models, specifically by adapting models deployed to test distributions.* Methods: The paper proposes a method called pseudo-label-robust data adaptation (SODA) that utilizes zeroth-order optimization (ZOO) to train a data adaptor to adapt the test data to fit the deployed models. SODA addresses the issue of unreliable gradients in ZOO by using high-confidence predicted labels as reliable labels to optimize the data adaptor.* Results: The paper shows that SODA can significantly enhance the performance of deployed models in the presence of distribution shifts without requiring access to model parameters. Empirical results indicate that SODA can improve the performance of data adaptation.Here are the three key points in Simplified Chinese text:* For: 这个论文的目标是解决机器学习模型在分布Shift时的性能下降问题，specifically by adapting deployed models to test distributions.* Methods: 论文提出了一种方法called pseudo-label-robust data adaptation (SODA)，which utilizes zeroth-order optimization (ZOO) to train a data adaptor to adapt the test data to fit the deployed models. SODA Addresses the issue of unreliable gradients in ZOO by using high-confidence predicted labels as reliable labels to optimize the data adaptor.* Results: 论文显示SODA可以Significantly enhance deployed models在分布Shift情况下的性能，without requiring access to model parameters. Empirical results indicate that SODA can improve the performance of data adaptation.

Abstract
Adapting models deployed to test distributions can mitigate the performance degradation caused by distribution shifts. However, privacy concerns may render model parameters inaccessible. One promising approach involves utilizing zeroth-order optimization (ZOO) to train a data adaptor to adapt the test data to fit the deployed models. Nevertheless, the data adaptor trained with ZOO typically brings restricted improvements due to the potential corruption of data features caused by the data adaptor. To address this issue, we revisit ZOO in the context of test-time data adaptation. We find that the issue directly stems from the unreliable estimation of the gradients used to optimize the data adaptor, which is inherently due to the unreliable nature of the pseudo-labels assigned to the test data. Based on this observation, we propose pseudo-label-robust data adaptation (SODA) to improve the performance of data adaptation. Specifically, SODA leverages high-confidence predicted labels as reliable labels to optimize the data adaptor with ZOO for label prediction. For data with low-confidence predictions, SODA encourages the adaptor to preserve data information to mitigate data corruption. Empirical results indicate that SODA can significantly enhance the performance of deployed models in the presence of distribution shifts without requiring access to model parameters.

摘要
适应已部署的模型可以减轻由分布shift引起的性能下降。然而，隐私问题可能使模型参数无法访问。一种有 promise的方法是使用零次优化（ZOO）来训练一个数据适应器，以适应已部署的模型。然而，通常情况下，ZOO 训练的数据适应器 Typically brings restricted improvements due to the potential corruption of data features caused by the data adaptor。为Address this issue, we revisit ZOO in the context of test-time data adaptation. We find that the issue directly stems from the unreliable estimation of the gradients used to optimize the data adaptor, which is inherently due to the unreliable nature of the pseudo-labels assigned to the test data。 Based on this observation, we propose pseudo-label-robust data adaptation (SODA) to improve the performance of data adaptation。 Specifically, SODA leverages high-confidence predicted labels as reliable labels to optimize the data adaptor with ZOO for label prediction。 For data with low-confidence predictions, SODA encourages the adaptor to preserve data information to mitigate data corruption。 Empirical results indicate that SODA can significantly enhance the performance of deployed models in the presence of distribution shifts without requiring access to model parameters。

DORec: Decomposed Object Reconstruction Utilizing 2D Self-Supervised Features

paper_url: http://arxiv.org/abs/2310.11092
repo_url: None
paper_authors: Jun Wu, Sicheng Li, Sihui Ji, Yue Wang, Rong Xiong, Yiyi Liao
for: 提高对复杂背景下对象的分解和重建的精度
methods: 基于神经网络隐式表示的Decomposed Object Reconstruction（DORec）网络，利用2D自动学习的自然特征进行分解
results: 在多个 dataset 上实现对eground object的高精度分解和重建

Abstract
Decomposing a target object from a complex background while reconstructing is challenging. Most approaches acquire the perception for object instances through the use of manual labels, but the annotation procedure is costly. The recent advancements in 2D self-supervised learning have brought new prospects to object-aware representation, yet it remains unclear how to leverage such noisy 2D features for clean decomposition. In this paper, we propose a Decomposed Object Reconstruction (DORec) network based on neural implicit representations. Our key idea is to transfer 2D self-supervised features into masks of two levels of granularity to supervise the decomposition, including a binary mask to indicate the foreground regions and a K-cluster mask to indicate the semantically similar regions. These two masks are complementary to each other and lead to robust decomposition. Experimental results show the superiority of DORec in segmenting and reconstructing the foreground object on various datasets.

摘要
分解一个目标对象从复杂背景中分离，而重建时也是一项挑战。大多数方法通过使用手动标注来获得对象实例的感知，但标注过程很昂贵。现代2D自助学习技术的发展带来了新的可能性，但是如何利用这些噪音2D特征来获得清晰的分解仍然是一个未知。本文提出了基于神经无限表示的分解对象网络（DORec），我们的关键想法是将2D自助学习特征转换成两级划分的mask，包括一个二进制划分用于指示前景区域，以及一个K-集群划分用于指示相似区域。这两个划分是相互补偿的，导致了稳定的分解。实验结果表明DORec在不同的数据集上 segment和重建前景对象表现出色。

United We Stand: Using Epoch-wise Agreement of Ensembles to Combat Overfit

paper_url: http://arxiv.org/abs/2310.11077
repo_url: None
paper_authors: Uri Stern, Daniel Shwartz, Daphna Weinshall
for: 该论文的目的是解决深度神经网络在图像分类任务中遇到的过拟合问题，提出了一种新的深度网络集成预测方法来避免过拟合。
methods: 该论文使用了一种新的集成预测方法，该方法基于论文中提出的一种回归模型的预测结果，其中预测结果表明过拟合时，类别分类器的变异度会增加。
results: 在多个图像和文本分类 dataset 上，该方法可以减少过拟合导致的generalization下降，并且在一些情况下，甚至超越了使用 early stopping 的性能。该方法易于实现，可以与任何训练方案和架构结合使用，无需额外的特殊知识。

Abstract
Deep neural networks have become the method of choice for solving many image classification tasks, largely because they can fit very complex functions defined over raw images. The downside of such powerful learners is the danger of overfitting the training set, leading to poor generalization, which is usually avoided by regularization and "early stopping" of the training. In this paper, we propose a new deep network ensemble classifier that is very effective against overfit. We begin with the theoretical analysis of a regression model, whose predictions - that the variance among classifiers increases when overfit occurs - is demonstrated empirically in deep networks in common use. Guided by these results, we construct a new ensemble-based prediction method designed to combat overfit, where the prediction is determined by the most consensual prediction throughout the training. On multiple image and text classification datasets, we show that when regular ensembles suffer from overfit, our method eliminates the harmful reduction in generalization due to overfit, and often even surpasses the performance obtained by early stopping. Our method is easy to implement, and can be integrated with any training scheme and architecture, without additional prior knowledge beyond the training set. Accordingly, it is a practical and useful tool to overcome overfit.

摘要
深度神经网络已成为许多图像分类任务的方法选择，主要是因为它们可以适应非常复杂的图像函数。但是这些强大的学习者也存在过拟合风险，导致泛化性差，通常通过规范和"早停止"等方法来避免。在这篇论文中，我们提出了一种新的深度网络集成分类器，可以很好地避免过拟合。我们从理论分析中开始，对于过拟合情况下的回归模型，其预测结果表明，过拟合时，类ifier的差异量会增加。基于这些结果，我们构建了一种新的集成预测方法，通过在训练过程中确定最一致的预测来对抗过拟合。在多个图像和文本分类 dataset 上，我们证明了，当常见集成遭到过拟合时，我们的方法可以消除过拟合导致的泛化性下降，并经常超越通过早停止获得的性能。我们的方法易于实现，可以与任何训练方案和架构结合使用，无需额外的优先知识，只需要训练集。因此，它是一种实用和有用的工具，可以解决过拟合问题。

$k$-$t$ CLAIR: Self-Consistency Guided Multi-Prior Learning for Dynamic Parallel MR Image Reconstruction

paper_url: http://arxiv.org/abs/2310.11050
repo_url: https://github.com/lpzhang/ktCLAIR
paper_authors: Liping Zhang, Weitian Chen
for: 用于快速推断心脏疾病的临床诊断
methods: 使用自适应启发式多优先学习框架$k$-$t$CLAIR，利用高度强化的数据来推导动态平行MRI重建
results: 实验结果表明，$k$-$t$CLAIR可以在心脏动态MRI重建中实现高质量的重建，并且与量化和质量方面的表现均有显著改善

Abstract
Cardiac magnetic resonance imaging (CMR) has been widely used in clinical practice for the medical diagnosis of cardiac diseases. However, the long acquisition time hinders its development in real-time applications. Here, we propose a novel self-consistency guided multi-prior learning framework named $k$-$t$ CLAIR to exploit spatiotemporal correlations from highly undersampled data for accelerated dynamic parallel MRI reconstruction. The $k$-$t$ CLAIR progressively reconstructs faithful images by leveraging multiple complementary priors learned in the $x$-$t$, $x$-$f$, and $k$-$t$ domains in an iterative fashion, as dynamic MRI exhibits high spatiotemporal redundancy. Additionally, $k$-$t$ CLAIR incorporates calibration information for prior learning, resulting in a more consistent reconstruction. Experimental results on cardiac cine and T1W/T2W images demonstrate that $k$-$t$ CLAIR achieves high-quality dynamic MR reconstruction in terms of both quantitative and qualitative performance.

摘要
cardiac magnetic resonance imaging (CMR) 已经广泛应用在临床实践中用于医疗诊断心脏疾病。然而，长期获取时间限制了其在实时应用中的发展。我们提议一种新的自适应性导向多优先学习框架，名为 $k$-$t$ CLAIR，以利用高度减掉样本数据中的空间时间相关性进行加速的动态平行MRI重建。 $k$-$t$ CLAIR 逐渐重建准确的图像，利用在 $x$-$t$, $x$-$f$, 和 $k$-$t$ 领域中学习的多个补做先天知识，因为动态MRI在空间时间上具有高度相似性。此外， $k$-$t$ CLAIR 还包含了准确性信息 для先天学习，从而使得重建更加一致。实验结果表明， $k$-$t$ CLAIR 在心脏笔记和 T1W/T2W 图像上达到了高质量的动态MR重建， Both quantitative and qualitative performance。

Co-Learning Semantic-aware Unsupervised Segmentation for Pathological Image Registration

paper_url: http://arxiv.org/abs/2310.11040
repo_url: None
paper_authors: Yang Liu, Shi Gu
for: 本研究旨在提出一种不需要标注数据的自动化肿瘤图像注册方法，以解决肿瘤图像注册问题中的焦点缺失和肿瘤图像异常扭曲问题。
methods: 本研究提出了一种基于生成、填充和注册（GIR）原理的自动化肿瘤图像注册方法，包括注册、分割和填充三个模块，通过同时训练这三个模块，以提高肿瘤图像分割和注册的精度。
results: 实验结果表明，提出的方法可以准确地注册肿瘤图像，并且可以在不同的成像模式下检测肿瘤。 code available at https://github.com/brain-intelligence-lab/GIRNet。

Abstract
The registration of pathological images plays an important role in medical applications. Despite its significance, most researchers in this field primarily focus on the registration of normal tissue into normal tissue. The negative impact of focal tissue, such as the loss of spatial correspondence information and the abnormal distortion of tissue, are rarely considered. In this paper, we propose GIRNet, a novel unsupervised approach for pathological image registration by incorporating segmentation and inpainting through the principles of Generation, Inpainting, and Registration (GIR). The registration, segmentation, and inpainting modules are trained simultaneously in a co-learning manner so that the segmentation of the focal area and the registration of inpainted pairs can improve collaboratively. Overall, the registration of pathological images is achieved in a completely unsupervised learning framework. Experimental results on multiple datasets, including Magnetic Resonance Imaging (MRI) of T1 sequences, demonstrate the efficacy of our proposed method. Our results show that our method can accurately achieve the registration of pathological images and identify lesions even in challenging imaging modalities. Our unsupervised approach offers a promising solution for the efficient and cost-effective registration of pathological images. Our code is available at https://github.com/brain-intelligence-lab/GIRNet.

摘要
注册病理图像在医疗应用中扮演着重要的角色。尽管其重要性，大多数研究人员在这个领域主要关注normal tissue到normal tissue的注册。病理区域的负面影响，如损失的空间匹配信息和病理区域的异常扭曲，几乎不被考虑。在这篇论文中，我们提出了GIRNet，一种新的无监督方法，通过生成、填充和注册（GIR）原理，以帮助病理图像注册。注册、分割和填充模块在一起训练，以便在合作方式下提高病理区域的分割和注册匹配的精度。总之，我们的提出的方法可以在完全无监督学习框架下完成病理图像注册。我们的实验结果表明，我们的方法可以准确地注册病理图像，并在具有挑战性的成像模式下识别病理区域。我们的无监督方法可以提供高效、成本效果的病理图像注册解决方案。我们的代码可以在https://github.com/brain-intelligence-lab/GIRNet上获取。

Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters

paper_url: http://arxiv.org/abs/2310.11031
repo_url: None
paper_authors: Gyuseong Lee, Wooseok Jang, Jin Hyeon Kim, Jaewoo Jung, Seungryong Kim
for: 这个论文的目的是提出一种可以实现执行环境中模型的稳定性和可靠性的预测模型，尤其是在域别测量（Domain Generalization，DG）任务中。
methods: 这个论文使用了parameter-efficient fine-tuning（PEFT）方法，并且在PEFT中使用了束缚器（adapter）来实现更好的稳定性和可靠性。另外，这个论文还提出了一种mixture-of-expert（MoA）方法，这是一种将多个束缚器（adapter）组合起来，以提高模型的稳定性和可靠性。
results: 这个论文的结果显示，使用PEFT和MoA方法可以实现更好的稳定性和可靠性，并且可以减少模型的训练时间和计算成本。另外，这个论文还证明了，在域别测量任务中，使用PEFT和MoA方法可以提高模型的性能，并且可以减少模型的性能下降。

Abstract
Learning a robust vision model despite large distribution shift is essential for model deployment in real-world settings. Especially, domain generalization (DG) algorithm aims to maintain the performance of a trained model on different distributions which were not seen during training. One of the most effective methods has been leveraging the already learned rich knowledge of large pretrained models. However, naively fine-tuning large models to DG tasks is often practically infeasible due to memory limitations, extensive time requirements for training, and the risk of learned knowledge deterioration. Recently, parameter-efficient fine-tuning (PEFT) methods have been proposed to reduce the high computational cost during training and efficiently adapt large models to downstream tasks. In this work, for the first time, we find that the use of adapters in PEFT methods not only reduce high computational cost during training but also serve as an effective regularizer for DG tasks. Surprisingly, a naive adapter implementation for large models achieve superior performance on common datasets. However, in situations of large distribution shifts, additional factors such as optimal amount of regularization due to the strength of distribution shifts should be considered for a sophisticated adapter implementation. To address this, we propose a mixture-of-expert based adapter fine-tuning method, dubbed as mixture-of-adapters (MoA). Specifically, we employ multiple adapters that have varying capacities, and by using learnable routers, we allocate each token to a proper adapter. By using both PEFT and MoA methods, we effectively alleviate the performance deterioration caused by distribution shifts and achieve state-of-the-art performance on diverse DG benchmarks.

摘要
学习一个强健的视觉模型，尤其是在不同的分布下进行模型部署，是实际应用中非常重要的。域外泛化（DG）算法的目标是保持训练后的模型在不同的分布下保持性能。然而，直接将大型模型精细调整到DG任务是实际上不可行，因为内存限制、训练时间的投入和模型学习知识的削弱。最近，参数效率的调整方法（PEFT）被提出，以减少训练时间的计算成本并有效地适应大型模型下推理任务。在这项工作中，我们发现了使用适应器不仅可以减少训练时间的计算成本，还可以作为DG任务的有效常规化。 surprisingly，一个简单的适应器实现对常用 datasets 表现出色。然而，在大分布差情况下，需要考虑适当的补偿因子，以适应强大的分布差。为此，我们提出了一种mixture-of-expert（MoA）适应器细化方法，其中我们采用多个适应器，每个适应器有不同的容量，并通过learnable routers来分配每个 токен到合适的适应器。通过使用 PEFT 和 MoA 方法，我们有效地减少分布差引起的性能下降，并在多种 DG bencmarks 上达到了国际首席性表现。

NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning

paper_url: http://arxiv.org/abs/2310.10975
repo_url: https://github.com/mr-neko/nice
paper_authors: Haowei Wang, Jiayi Ji, Tianyu Guo, Yilong Yang, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
for: 本研究旨在提出一个统一的和有效的框架，可以同时进行多个目标的图像识别和位置探索，并且可以将它们与文本描述相互对应。
methods: 我们提出了一个名为NICE的框架，它可以同时进行当地描述的识别和分类，并且可以通过将它们与文本描述相互对应，从而提高表现。在这个框架中，我们引入了两个协调模组，它们是协调调节（CGA）和本体驱动的本地化（BDL），它们负责分类和检测。我们还引入了一个将PNS和PND串接在一起的方法，使得它们可以相互对应，并且允许它们相互补偿以提高表现。
results: 我们的方法NICE可以轻松地超越所有现有的方法，实现4.1%的PND和2.9%的PNS表现。这些结果证明了我们的提出的协调学习策略的有效性。

Abstract
Panoptic Narrative Detection (PND) and Segmentation (PNS) are two challenging tasks that involve identifying and locating multiple targets in an image according to a long narrative description. In this paper, we propose a unified and effective framework called NICE that can jointly learn these two panoptic narrative recognition tasks. Existing visual grounding tasks use a two-branch paradigm, but applying this directly to PND and PNS can result in prediction conflict due to their intrinsic many-to-many alignment property. To address this, we introduce two cascading modules based on the barycenter of the mask, which are Coordinate Guided Aggregation (CGA) and Barycenter Driven Localization (BDL), responsible for segmentation and detection, respectively. By linking PNS and PND in series with the barycenter of segmentation as the anchor, our approach naturally aligns the two tasks and allows them to complement each other for improved performance. Specifically, CGA provides the barycenter as a reference for detection, reducing BDL's reliance on a large number of candidate boxes. BDL leverages its excellent properties to distinguish different instances, which improves the performance of CGA for segmentation. Extensive experiments demonstrate that NICE surpasses all existing methods by a large margin, achieving 4.1% for PND and 2.9% for PNS over the state-of-the-art. These results validate the effectiveness of our proposed collaborative learning strategy. The project of this work is made publicly available at https://github.com/Mr-Neko/NICE.

摘要
通用和有效的框架NICE（Nice In Coordinate Embedding）可以同时学习多个目标的涵义和位置识别 task。在这篇论文中，我们提出了一种解决方案，即在Panoptic Narrative Detection（PND）和Panoptic Segmentation（PNS）任务之间进行协同学习。现有的视觉定位任务使用两支分支方法，但是直接应用这种方法于PND和PNS可能会导致预测冲突，因为它们具有内在的多对多对应性。为解决这个问题，我们引入了两个协同模块，即坐标导航集成（CGA）和坐标驱动本地化（BDL），负责分割和检测。我们将PNS和PND串联在一起，使得两个任务之间的对应关系自然地进行了协同学习。具体来说，CGA提供了分割的参考点，从而降低BDL的候选框数量的依赖性。BDL利用其优秀的性能来分辨不同的实例，从而提高CGA的分割性能。我们的实验结果表明，NICE比所有现有方法大幅超越，实现了PND4.1%和PNS2.9%的状态当前最佳性能。这些结果证明了我们提出的协同学习策略的有效性。NICE项目的代码可以在https://github.com/Mr-Neko/NICE上获取。

Tracking and Mapping in Medical Computer Vision: A Review

paper_url: http://arxiv.org/abs/2310.11475
repo_url: None
paper_authors: Adam Schmidt, Omid Mohareri, Simon DiMaio, Michael Yip, Septimiu E. Salcudean
for:* 医疗图像分析领域的应用，包括诊断和手术指导。methods:* 使用摄像头进行跟踪和场景映射。results:* 提供了一个审查和概述当前领域的状态，包括最新的发展和趋势。Please note that the above information is in Simplified Chinese text.

Abstract
As computer vision algorithms are becoming more capable, their applications in clinical systems will become more pervasive. These applications include diagnostics such as colonoscopy and bronchoscopy, guiding biopsies and minimally invasive interventions and surgery, automating instrument motion and providing image guidance using pre-operative scans. Many of these applications depend on the specific visual nature of medical scenes and require designing and applying algorithms to perform in this environment. In this review, we provide an update to the field of camera-based tracking and scene mapping in surgery and diagnostics in medical computer vision. We begin with describing our review process, which results in a final list of 515 papers that we cover. We then give a high-level summary of the state of the art and provide relevant background for those who need tracking and mapping for their clinical applications. We then review datasets provided in the field and the clinical needs therein. Then, we delve in depth into the algorithmic side, and summarize recent developments, which should be especially useful for algorithm designers and to those looking to understand the capability of off-the-shelf methods. We focus on algorithms for deformable environments while also reviewing the essential building blocks in rigid tracking and mapping since there is a large amount of crossover in methods. Finally, we discuss the current state of the tracking and mapping methods along with needs for future algorithms, needs for quantification, and the viability of clinical applications in the field. We conclude that new methods need to be designed or combined to support clinical applications in deformable environments, and more focus needs to be put into collecting datasets for training and evaluation.

摘要
为了满足医疗系统中computer vision算法的应用 becoming more pervasive，这些应用包括诊断如colonoscopy和bronchoscopy、导引生物检查和微创入侵性手术、自动化工具动作并提供预操作扫描图像导航。许多这些应用需要特定的医疗场景的视觉特性，因此需要设计和应用算法以在这种环境中工作。在这篇评论中，我们提供了医疗计算机视觉中摄像机基于跟踪和场景映射的更新。我们开始介绍我们的评审过程，从而获得了515篇论文的最终列表。然后，我们提供了高级概述，并为需要跟踪和映射的价值读者提供了相关的背景信息。然后，我们评审了在领域中提供的数据集，并评估了临床应用中的临床需求。接着，我们深入探讨算法的方面，并总结了最近的进展，这将对算法设计者和需要了解摄像机基于跟踪和映射的方法来说特别有用。我们主要关注可变环境中的算法，并同时评估了基础建立的硬件跟踪和映射方法。最后，我们讨论了当前跟踪和映射方法的状况，以及未来需要的算法、评估量化和临床应用的可行性。我们结论认为，新的方法需要被设计或组合以支持临床应用，并更多的精力需要投入到训练和评估数据集的收集中。

Context-Aware Meta-Learning

paper_url: http://arxiv.org/abs/2310.10971
repo_url: https://github.com/hallogameboy/MARU
paper_authors: Christopher Fifty, Dennis Duan, Ronald G. Junkins, Ehsan Amid, Jure Leskovec, Christopher Ré, Sebastian Thrun
for: 这个论文旨在解决视觉模型在推理过程中学习新的概念的问题。
methods: 该论文提出了一种基于冻结预训练特征提取器的 meta-学习算法，可以在推理过程中不需要 fine-tuning 地学习新的视觉概念。
results: 在 8 个 meta-学习Benchmark 中，该算法可以无需 meta-训练或 fine-tuning 达到或超过 state-of-the-art 算法 P>M>F 的性能。

Abstract
Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks.

摘要
大语言模型如ChatGPT表现出了惊人的新概念学习能力，而无需任何精度调整。然而，用于检测新对象的视觉模型在推理时表现不佳，或者需要meta-training和/或精度调整。在这项工作中，我们提出了一种meta-学习算法，可以在推理时学习新的视觉概念，无需精度调整。我们的方法利用冻结的预训练特征提取器，类似于上下文学习，将meta-学习视为序列模型化，对已知标签的数据点和未知标签的测试数据点进行模型化。在11个meta-学习benchmark上，我们的方法，无需meta-training或精度调整，超过或等于状态的算法P>M>F，该算法在这些benchmark上进行meta-training。

MRI brain tumor segmentation using informative feature vectors and kernel dictionary learning

paper_url: http://arxiv.org/abs/2310.10963
repo_url: None
paper_authors: Seyedeh Mahya Mousavi, Mohammad Mostafavi
for: 这个论文是用于分类健康和癌症脑磁共振图像中的脑膜区域的方法。methods: 这个方法使用基于内生kernel字典学习算法来学习健康和癌症脑磁共振图像中的特征向量。results: 实验结果表明，提出的方法比其他已有方法更高的准确率和快速采样速度，可以快速完成分类任务，同时也可以减少训练时间和内存需求。

Abstract
This paper presents a method based on a kernel dictionary learning algorithm for segmenting brain tumor regions in magnetic resonance images (MRI). A set of first-order and second-order statistical feature vectors are extracted from patches of size 3 * 3 around pixels in the brain MRI scans. These feature vectors are utilized to train two kernel dictionaries separately for healthy and tumorous tissues. To enhance the efficiency of the dictionaries and reduce training time, a correlation-based sample selection technique is developed to identify the most informative and discriminative subset of feature vectors. This technique aims to improve the performance of the dictionaries by selecting a subset of feature vectors that provide valuable information for the segmentation task. Subsequently, a linear classifier is utilized to distinguish between healthy and unhealthy pixels based on the learned dictionaries. The results demonstrate that the proposed method outperforms other existing methods in terms of segmentation accuracy and significantly reduces both the time and memory required, resulting in a remarkably fast training process.

摘要
这个论文提出了基于kernel字典学习算法的 brain 肿瘤区域分割方法，使用patches 的first-order和second-order统计特征向量从brain MRI扫描中提取特征向量，然后使用这些特征向量训练两个kernel字典，一个用于健康组，一个用于肿瘤组。为了提高字典的效率和减少训练时间，我们提出了一种基于相关性的样本选择技术，以选择最有价值和分类的特征向量，从而提高分类性能。接着，我们使用学习的字典和线性分类器来分割健康和肿瘤组。结果表明，提出的方法在分割精度和训练时间上具有明显的优势，比其他方法更高效。

Enhancing Deep Neural Network Training Efficiency and Performance through Linear Prediction

paper_url: http://arxiv.org/abs/2310.10958
repo_url: None
paper_authors: Hejie Ying, Mengmeng Song, Yaohong Tang, Shungen Xiao, Zimin Xiao
for: 提高深度神经网络（DNN）模型训练效率和性能。
methods: 基于DNN参数在训练过程中变化的规律发现可以预测DNN参数，并且采用Parameter Linear Prediction（PLP）方法进行预测。
results: 对于Vgg16、Resnet18和GoogLeNet等 represntative底层，通过比较Normal训练和提posed方法的结果，发现提posed方法能够在相同的训练条件和轮数下提高模型性能，具体的结果为CIFAR-100 dataset上的准确率提高约1%和top-1/top-5错误降低约0.01。

Abstract
Deep neural networks (DNN) have achieved remarkable success in various fields, including computer vision and natural language processing. However, training an effective DNN model still poses challenges. This paper aims to propose a method to optimize the training effectiveness of DNN, with the goal of improving model performance. Firstly, based on the observation that the DNN parameters change in certain laws during training process, the potential of parameter prediction for improving model training efficiency and performance is discovered. Secondly, considering the magnitude of DNN model parameters, hardware limitations and characteristics of Stochastic Gradient Descent (SGD) for noise tolerance, a Parameter Linear Prediction (PLP) method is exploit to perform DNN parameter prediction. Finally, validations are carried out on some representative backbones. Experiment results show that compare to the normal training ways, under the same training conditions and epochs, by employing proposed PLP method, the optimal model is able to obtain average about 1% accuracy improvement and 0.01 top-1/top-5 error reduction for Vgg16, Resnet18 and GoogLeNet based on CIFAR-100 dataset, which shown the effectiveness of the proposed method on different DNN structures, and validated its capacity in enhancing DNN training efficiency and performance.

摘要
深度神经网络（DNN）在不同领域取得了显著成功，包括计算机视觉和自然语言处理。然而，训练有效的DNN模型仍然存在挑战。本文旨在提出一种方法，以提高DNN训练效果并提高模型性能。首先，基于训练过程中DNN参数变化的规律，探讨可以通过预测参数来提高DNN训练效率和性能的潜在可能性。其次，考虑到DNN模型参数的大小、硬件限制和SGD算法对雷yy的耐受性，提出了一种基于PLP方法的DNN参数预测方法。最后，对一些代表性的背bone进行验证。实验结果显示，相比于常规训练方式，通过提议的PLP方法，在同等训练条件和轮数下，可以获得average约1%的准确率提高和0.01的top-1/top-5错误减少，这demonstrates the effectiveness of the proposed method on different DNN structures and validates its ability to enhance DNN training efficiency and performance.

Medical Image Segmentation via Sparse Coding Decoder

paper_url: http://arxiv.org/abs/2310.10957
repo_url: None
paper_authors: Long Zeng, Kaigui Wu
for: 这篇论文主要探讨了如何将Transformer模型应用于医疗影像分类，以提高其捕捉长距离依赖关系的能力。
methods: 这篇论文提出了一种基于核心 sparse vector coding的decoder，named CASCSCDE，它使用核心 sparse vector coding来表示从encoder模组中获取的特征。
results: 实验结果显示，将CASCSCDE与TransUNet结合，可以实现Synapse benchmark上的表现提升，相比于TransUNet alone，提高了3.15%和1.16%的DICE和mIoU分数。

Abstract
Transformers have achieved significant success in medical image segmentation, owing to its capability to capture long-range dependencies. Previous works incorporate convolutional layers into the encoder module of transformers, thereby enhancing their ability to learn local relationships among pixels. However, transformers may suffer from limited generalization capabilities and reduced robustness, attributed to the insufficient spatial recovery ability of their decoders. To address this issue, A convolution sparse vector coding based decoder is proposed , namely CAScaded multi-layer Convolutional Sparse vector Coding DEcoder (CASCSCDE), which represents features extracted by the encoder using sparse vectors. To prove the effectiveness of our CASCSCDE, The widely-used TransUNet model is chosen for the demonstration purpose, and the CASCSCDE is incorporated with TransUNet to establish the TransCASCSCDE architecture. Our experiments demonstrate that TransUNet with CASCSCDE significantly enhances performance on the Synapse benchmark, obtaining up to 3.15\% and 1.16\% improvements in DICE and mIoU scores, respectively. CASCSCDE opens new ways for constructing decoders based on convolutional sparse vector coding.

摘要
transformers在医学图像分割领域取得了重要成功，归功于它的长距离依赖关系捕捉能力。 précédentes works将 convolutional layers incorporated into the encoder module of transformers，以提高它们对像之间的本地关系学习能力。然而，transformers可能会受到有限的泛化能力和减少的稳定性影响，这是由于它们的解码器的空间恢复能力不够。为 Addressing this issue, a convolution sparse vector coding based decoder is proposed, namely CAScaded multi-layer Convolutional Sparse vector Coding DEcoder (CASCSCDE), which represents features extracted by the encoder using sparse vectors。To prove the effectiveness of our CASCSCDE, the widely-used TransUNet model is chosen for the demonstration purpose, and the CASCSCDE is incorporated with TransUNet to establish the TransCASCSCDE architecture。 our experiments show that TransUNet with CASCSCDE significantly enhances performance on the Synapse benchmark, obtaining up to 3.15% and 1.16% improvements in DICE and mIoU scores, respectively。 CASCSCDE opens new ways for constructing decoders based on convolutional sparse vector coding。

FusionU-Net: U-Net with Enhanced Skip Connection for Pathology Image Segmentation

paper_url: http://arxiv.org/abs/2310.10951
repo_url: https://github.com/zongyi-lee/fusionu-net
paper_authors: Zongyi Li, Hongbing Lyu, Jun Wang
For: 本研究旨在提高路ологи影像分类任务中U-Net和其变种的性能，提出一种基于U-Net结构的新网络模型，即FusionU-Net。* Methods: FusionU-Net使用了一种基于U-Net结构的增强模块，该模块通过在不同层的Encoder和Decoder之间进行信息交换，以减少Semantic Gap。此外，我们采用了两轮交换机制，以完全考虑当前层输出之间的本地相关性和多层信息交换的需求。* Results: 我们在多个路ологи影像 dataset上进行了广泛的实验，并发现FusionU-Net在与其他竞争方法相比表现更好。我们认为我们的交换模块比现有网络的设计更有效，并且可以轻松地在其他网络中插入，以进一步提高模型性能。

Abstract
In recent years, U-Net and its variants have been widely used in pathology image segmentation tasks. One of the key designs of U-Net is the use of skip connections between the encoder and decoder, which helps to recover detailed information after upsampling. While most variations of U-Net adopt the original skip connection design, there is semantic gap between the encoder and decoder that can negatively impact model performance. Therefore, it is important to reduce this semantic gap before conducting skip connection. To address this issue, we propose a new segmentation network called FusionU-Net, which is based on U-Net structure and incorporates a fusion module to exchange information between different skip connections to reduce semantic gaps. Unlike the other fusion modules in existing networks, ours is based on a two-round fusion design that fully considers the local relevance between adjacent encoder layer outputs and the need for bi-directional information exchange across multiple layers. We conducted extensive experiments on multiple pathology image datasets to evaluate our model and found that FusionU-Net achieves better performance compared to other competing methods. We argue our fusion module is more effective than the designs of existing networks, and it could be easily embedded into other networks to further enhance the model performance.

摘要
Unlike other fusion modules in existing networks, our fusion module is based on a two-round fusion design that fully considers the local relevance between adjacent encoder layer outputs and the need for bi-directional information exchange across multiple layers. We conducted extensive experiments on multiple pathology image datasets to evaluate our model and found that FusionU-Net achieves better performance compared to other competing methods. We argue that our fusion module is more effective than the designs of existing networks and could be easily embedded into other networks to further enhance model performance.

paper_url: http://arxiv.org/abs/2310.10942
repo_url: https://github.com/guoyang9/unk-vqa
paper_authors: Yanyang Guo, Fangkai Jiao, Zhiqi Shen, Liqiang Nie, Mohan Kankanhalli
for: 本研究的目的是帮助视觉问答模型避免回答无法答案的问题，以建立更可靠的人工智能系统。
methods: 我们首先对现有数据进行了有意义的干扰，以使问题图像的 semantics 保持与原始不干扰分布的相似性。然后，我们对多 modal 大型模型进行了 Zero-和 few-shot 性能评估，并发现它们在我们的数据集上表现出了明显的局限性。最后，我们还提出了一种简单的方法来解决这些无法答案的问题。
results: 我们的数据集（UNK-VQA）可以用于提高视觉问答模型的拒绝能力，从而提高人工智能系统的可靠性。我们的研究可以帮助解决现有的问题，并为建立更可靠的人工智能系统提供一个重要的基础。

Abstract
Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA. The dataset is specifically designed to address the challenge of questions that models do not know. To this end, we first augment the existing data via deliberate perturbations on either the image or question. In specific, we carefully ensure that the question-image semantics remain close to the original unperturbed distribution. By this means, the identification of unanswerable questions becomes challenging, setting our dataset apart from others that involve mere image replacement. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models and discover their significant limitations when applied to our dataset. Additionally, we also propose a straightforward method to tackle these unanswerable questions. This dataset, we believe, will serve as a valuable benchmark for enhancing the abstention capability of VQA models, thereby leading to increased trustworthiness of AI systems. We have made the \href{https://github.com/guoyang9/UNK-VQA}{dataset} available to facilitate further exploration in this area.

摘要
教学视觉问答模型避免回答无法回答的问题是建立可信worthy AI系统的必要 условия。现有研究，虽曾经探讨了多种VQA方面的问题，但它们却有些忽略了这一特点。这篇论文的目的是填补这个研究漏洞，通过提供一个全面的UNK-VQA数据集来解决问题。这个数据集专门针对模型不知道的问题，我们通过对问题或图像进行意图的拟合来增强问题-图像的Semantics相似性，从而使模型很难以识别无法回答的问题。我们 THEN 进行了广泛的零、少射性性能评估，发现现有的多模式大型模型在我们的数据集上表现出了显著的局限性。此外，我们还提出了一种简单的方法来解决这些无法回答的问题。我们认为这个数据集会成为VQA模型增强其拒绝能力的重要benchmark，从而提高AI系统的可信worthiness。我们已经在github上分享了这个\href{https://github.com/guoyang9/UNK-VQA}{数据集，以便进一步的探索。}

Towards Training-free Open-world Segmentation via Image Prompting Foundation Models

paper_url: http://arxiv.org/abs/2310.10912
repo_url: None
paper_authors: Lv Tang, Peng-Tao Jiang, Hao-Ke Xiao, Bo Li
for:This paper explores open-world segmentation using a novel approach called Image Prompt Segmentation (IPSeg), which leverages vision foundational models and image prompting techniques to segment target objects in input images without requiring exhaustive training sessions.methods:IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. The approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image.results:Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg’s efficacy for flexible open-world segmentation using intuitive image prompts. The proposed method offers a more efficient and scalable solution for open-world segmentation compared to traditional training-based methods.

Abstract
The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of large language models in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. At the heart of IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompting techniques. IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg's efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.

摘要
computer vision 领域受到基础模型的洗礼，这对于自然语言处理领域中基础模型的转变产生了类似的影响。这篇论文探讨开放世界分割问题，提出了一种名为图像提示分 segmentation（IPSeg）的新方法。IPSeg 利用视觉基础模型如 DINOv2 和 Stable Diffusion 的力量，并且采用一种无需训练的方法。我们的方法使用一个包含主观视觉概念的单张图像作为灵活的提示，EXTRACT robust的特征于提示图像和输入图像，然后通过一种新的特征互动模块将输入表示与提示表示匹配，以生成指向目标对象的点提示。这些生成的点提示最后用于导引 Segment Anything Model 对输入图像中的目标对象进行分割。我们的方法不需要耗费大量的训练时间，因此更加高效和可扩展。实验结果表明，IPSeg 在 COCO、PASCAL VOC 和其他数据集上表现出色，用于开放世界中的灵活分割。这项工作开拓了基础模型在视觉领域中对开放世界理解的新途径。

2023-10-17

cs.AI

cs.AI - 2023-10-17

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

paper_url: http://arxiv.org/abs/2310.11628
repo_url: https://github.com/avi-jit/etok
paper_authors: Avijit Thawani, Saurabh Ghanekar, Xiaoyuan Zhu, Jay Pujara
for: 该文章是关于语言模型tokenization的研究，旨在提出一种“学习你的token”的方法，以优化语言模型的表示能力和性能。
methods: 该文章使用了一种“learn your tokens”的方法，将字节/字符 pooling 到单词表示，然后将单词表示传递给主语言模型进行解码。
results: compared to字节/字符模型和子词模型，该文章的中等表达能力和速度的综合型tokenizer在下一个词预测任务中表现出色，特别是在罕见词术中表现出三十倍的提升。

Abstract
Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the limitations of such a tokenization strategy, particularly for documents not written in English and for representing numbers. On the other extreme, byte/character-level language models are much less restricted but suffer from increased sequence description lengths and a subsequent quadratic expansion in self-attention computation. Recent attempts to compress and limit these context lengths with fixed size convolutions is helpful but completely ignores the word boundary. This paper considers an alternative 'learn your tokens' scheme which utilizes the word boundary to pool bytes/characters into word representations, which are fed to the primary language model, before again decoding individual characters/bytes per word in parallel. We find that our moderately expressive and moderately fast end-to-end tokenizer outperform by over 300% both subwords and byte/character models over the intrinsic language modeling metric of next-word prediction across datasets. It particularly outshines on rare words, outperforming by a factor of 30! We extensively study the language modeling setup for all three categories of tokenizers and theoretically analyze how our end-to-end models can also be a strong trade-off in efficiency and robustness.

摘要
This paper proposes an alternative "learn your tokens" scheme, which utilizes word boundaries to pool bytes/characters into word representations, which are then fed to the primary language model. The model then decodes individual characters/bytes per word in parallel. We find that our moderately expressive and moderately fast end-to-end tokenizer outperforms both subwords and byte/character models by over 300% in next-word prediction accuracy across datasets, particularly in rare words, with a factor of 30.We extensively study the language modeling setup for all three categories of tokenizers and theoretically analyze how our end-to-end models can also be a strong trade-off in efficiency and robustness.

An Optimistic-Robust Approach for Dynamic Positioning of Omnichannel Inventories

paper_url: http://arxiv.org/abs/2310.12183
repo_url: None
paper_authors: Pavithra Harsha, Shivaram Subramanian, Ali Koc, Mahesh Ramakrishna, Brian Quanz, Dhruv Shah, Chandra Narayanaswami
for: 本研究提出了一种新的数据驱动、分布 свобо的optimistic-robustBI备货优化策略，用于在零售链中均衡时变不确定的多渠道需求，以提高均值性和鲁棒性。
methods: 该策略基于一种新的分布自由的BI备货模型，通过结合数据驱动和鲁棒优化来超越传统的鲁棒优化方法，并且提供了一种可调的tradeoff между鲁棒性和均值性。
results: 实验结果表明，BI备货策略可以在一个实际的大型美国多渠道零售商中实现至少15%的利润提高，同时保持实际最坏情况性能。

Abstract
We introduce a new class of data-driven and distribution-free optimistic-robust bimodal inventory optimization (BIO) strategy to effectively allocate inventory across a retail chain to meet time-varying, uncertain omnichannel demand. While prior Robust optimization (RO) methods emphasize the downside, i.e., worst-case adversarial demand, BIO also considers the upside to remain resilient like RO while also reaping the rewards of improved average-case performance by overcoming the presence of endogenous outliers. This bimodal strategy is particularly valuable for balancing the tradeoff between lost sales at the store and the costs of cross-channel e-commerce fulfillment, which is at the core of our inventory optimization model. These factors are asymmetric due to the heterogenous behavior of the channels, with a bias towards the former in terms of lost-sales cost and a dependence on network effects for the latter. We provide structural insights about the BIO solution and how it can be tuned to achieve a preferred tradeoff between robustness and the average-case. Our experiments show that significant benefits can be achieved by rethinking traditional approaches to inventory management, which are siloed by channel and location. Using a real-world dataset from a large American omnichannel retail chain, a business value assessment during a peak period indicates over a 15% profitability gain for BIO over RO and other baselines while also preserving the (practical) worst case performance.

摘要
我们介绍了一种新的数据驱动、分布自由的乐观稳定优化策略（BIO），用于在零售链中有效分配存储，以满足时变、不确定的多渠道需求。在优化模型中，BIO策略通过考虑两个模式（下行和上行）来实现乐观稳定性，同时也能够保持优秀的平均情况性。这种策略对于平衡店铺产生损失和多渠道电商配送成本的负担进行了有利的均衡。我们提供了BIO解决方案的结构性分析，以及如何根据需要调整BIO来实现想要的负载均衡。我们的实验结果表明，通过重新思考传统的存储管理方法，可以实现显著的利益提升，并且保持了实际最坏情况性。使用一个大型美国多渠道零售商的实际数据，我们在峰值期进行了业务价值评估，发现BIO相比RO和其他基eline，可以获得超过15%的利润增加，同时保持实际最坏情况性。

Unveiling the General Intelligence Factor in Language Models: A Psychometric Approach

paper_url: http://arxiv.org/abs/2310.11616
repo_url: https://github.com/davidilic/g-in-llms
paper_authors: David Ilić
for: 这项研究探讨语言模型中的通用智能因素（g因素），扩展传统地应用于人类和某些动物种类的心理测量理论。
methods: 通过两个大规模数据集（Open LLM Leaderboard和GLUE Leaderboard）的因素分析，发现了一个强度稳定的单一g因素，占据模型性能变化的85%。同时发现模型大小和g因素之间存在moderate相关性（。48）。
results: 发现了语言模型中的g因素，提供了一个统一的评估指标，开启了更加稳定、g基础能力评估的新途径。这些发现将心理测量理论应用于人工通用智能领域奠定基础，有实践意义 для模型评估和开发。

Abstract
This study uncovers the factor of general intelligence, or g, in language models, extending the psychometric theory traditionally applied to humans and certain animal species. Utilizing factor analysis on two extensive datasets - Open LLM Leaderboard with 1,232 models and General Language Understanding Evaluation (GLUE) Leaderboard with 88 models - we find compelling evidence for a unidimensional, highly stable g factor that accounts for 85% of the variance in model performance. The study also finds a moderate correlation of .48 between model size and g. The discovery of g in language models offers a unified metric for model evaluation and opens new avenues for more robust, g-based model ability assessment. These findings lay the foundation for understanding and future research on artificial general intelligence from a psychometric perspective and have practical implications for model evaluation and development.

摘要

Learning a Hierarchical Planner from Humans in Multiple Generations

paper_url: http://arxiv.org/abs/2310.11614
repo_url: None
paper_authors: Leonardo Hernandez Cano, Yewen Pu, Robert D. Hawkins, Josh Tenenbaum, Armando Solar-Lezama
for: 本研究旨在帮助机器学习人类知识，提高机器的适应能力和解决复杂任务的能力。
methods: 本研究使用了自然编程，一种组合了程序学习和层次规划的库学习系统。用户通过课程建设来教育系统，选择一个具有挑战性但不是不可能的目标，并提供语言提示，帮助系统在层次规划中找到正确的计划。
results: simulations and human experiments (n=360) 表明，自然编程可以强大地组合来自不同用户和Contexts的程序，并且在对Contexts进行更改时更快地适应，并解决更复杂的任务。相比于程序编程基线，自然编程在解决复杂任务方面表现出了明显的优势。

Abstract
A typical way in which a machine acquires knowledge from humans is by programming. Compared to learning from demonstrations or experiences, programmatic learning allows the machine to acquire a novel skill as soon as the program is written, and, by building a library of programs, a machine can quickly learn how to perform complex tasks. However, as programs often take their execution contexts for granted, they are brittle when the contexts change, making it difficult to adapt complex programs to new contexts. We present natural programming, a library learning system that combines programmatic learning with a hierarchical planner. Natural programming maintains a library of decompositions, consisting of a goal, a linguistic description of how this goal decompose into sub-goals, and a concrete instance of its decomposition into sub-goals. A user teaches the system via curriculum building, by identifying a challenging yet not impossible goal along with linguistic hints on how this goal may be decomposed into sub-goals. The system solves for the goal via hierarchical planning, using the linguistic hints to guide its probability distribution in proposing the right plans. The system learns from this interaction by adding newly found decompositions in the successful search into its library. Simulated studies and a human experiment (n=360) on a controlled environment demonstrate that natural programming can robustly compose programs learned from different users and contexts, adapting faster and solving more complex tasks when compared to programmatic baselines.

摘要
一般来说，机器学习知识从人类那里是通过编程的方式。与学习示例或经验相比，编程学习可以让机器快速学习新的技能，只需要编写一份程序即可，并且通过建立一个程序库，机器可以快速学习完成复杂任务。然而，由于程序经常假设执行上下文，因此它们在上下文变化时会变得不稳定，使得复杂程序适应新上下文变得困难。我们提出了自然编程，一种基于程序库学习系统，它将程序学习与层次规划结合。自然编程保留一个库中的分解，包括目标、语言描述如何将目标分解成子目标，以及具体实例的分解。用户通过课程建设来教育系统，选择一个具有挑战性但并不是不可能的目标，并提供语言提示如何将目标分解成子目标。系统使用层次规划解决目标，使用语言提示来引导搜索的概率分布。系统从这种互动中学习，将成功搜索的新分解添加到库中。 simulated studies and human experiment (n=360) on a controlled environment show that natural programming can robustly compose programs learned from different users and contexts, adapting faster and solving more complex tasks than programmatic baselines.

Language Models as Zero-Shot Trajectory Generators

paper_url: http://arxiv.org/abs/2310.11604
repo_url: None
paper_authors: Teyun Kwon, Norman Di Palo, Edward Johns
for: 这个论文旨在检验Language Model (LLM)是否能直接预测 robot 的低级运动轨迹，以便在 Robotics 中使用 LLM 进行高级 планинг。
methods: 该论文使用了 GPT-4 Language Model，并通过对象检测和分割视觉模型来提供输入。
results: 研究发现，单个任务无关的提示可以在 26 个实际世界语言任务中表现出色，例如 “打开瓶cap” 和 “wipe plate with sponge”。此外，研究还发现，LLM 实际上具有足够的低级控制知识，可以执行许多常见任务，并且可以检测失败并重新规划轨迹。

Abstract
Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation skills, when given access to only object detection and segmentation vision models. We study how well a single task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers, can perform across 26 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigate which design choices in this prompt are the most effective. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly. Videos, code, and prompts are available at: https://www.robot-learning.uk/language-models-trajectory-generators.

摘要

The Efficacy of Transformer-based Adversarial Attacks in Security Domains

paper_url: http://arxiv.org/abs/2310.11597
repo_url: None
paper_authors: Kunyang Li, Kyle Domico, Jean-Charles Noirot Ferrand, Patrick McDaniel
for: 本研究旨在探讨transformer架构在安全领域中的可靠性和攻击力。
methods: 本研究使用了精心调整的 pré-trained transformer、Convolutional Neural Network (CNN) 和 hybrid (transformer 和 CNN 的ensemble) 模型来解决不同的下游图像任务。然后，使用了一种攻击算法来生成19,367个 adversarial example对于每个任务和每个模型。
results: 我们发现， adversarial examples 生成于 transformer 上的攻击力最高，可以在其他模型上传输25.7%的攻击力。同时，生成于其他模型上的 adversarial examples 在 transformer 上的传输率只有56.7%。这些结果强调了对 transformer 架构的研究在安全领域中的重要性，并建议在传输攻击 Setting 中使用它们作为主要架构。

Abstract
Today, the security of many domains rely on the use of Machine Learning to detect threats, identify vulnerabilities, and safeguard systems from attacks. Recently, transformer architectures have improved the state-of-the-art performance on a wide range of tasks such as malware detection and network intrusion detection. But, before abandoning current approaches to transformers, it is crucial to understand their properties and implications on cybersecurity applications. In this paper, we evaluate the robustness of transformers to adversarial samples for system defenders (i.e., resiliency to adversarial perturbations generated on different types of architectures) and their adversarial strength for system attackers (i.e., transferability of adversarial samples generated by transformers to other target models). To that effect, we first fine-tune a set of pre-trained transformer, Convolutional Neural Network (CNN), and hybrid (an ensemble of transformer and CNN) models to solve different downstream image-based tasks. Then, we use an attack algorithm to craft 19,367 adversarial examples on each model for each task. The transferability of these adversarial examples is measured by evaluating each set on other models to determine which models offer more adversarial strength, and consequently, more robustness against these attacks. We find that the adversarial examples crafted on transformers offer the highest transferability rate (i.e., 25.7% higher than the average) onto other models. Similarly, adversarial examples crafted on other models have the lowest rate of transferability (i.e., 56.7% lower than the average) onto transformers. Our work emphasizes the importance of studying transformer architectures for attacking and defending models in security domains, and suggests using them as the primary architecture in transfer attack settings.

摘要
今天，许多领域的安全性都依赖于机器学习来探测威胁、 indentify漏洞和保护系统从攻击中保护。现在，transformer架构已经提高了一系列任务的状态之末性，如恶意软件检测和网络侵入检测。但是，在switch到transformer架构之前，必须了解它们的性质和影响在安全应用程序中。在这篇论文中，我们评估了transformer架构对抗攻击样本的Robustness（可以抗拒的性）和攻击者的攻击力（可以转移到其他目标模型）。为此，我们首先精度调整一组预训练过的transformer、Convolutional Neural Network（CNN）和混合（一个ensemble of transformer和CNN）模型，以解决不同的图像基于任务。然后，我们使用一种攻击算法来生成19367个攻击样本 для每个任务。我们使用这些攻击样本来测量每个模型之间的 transferred（可以转移的性）。我们发现，在transformer架构上生成的攻击样本具有最高的转移率（即25.7% higher than the average），而其他模型上生成的攻击样本具有最低的转移率（即56.7% lower than the average）。我们的工作强调了在安全领域使用transformer架构进行攻击和防御模型的研究，并建议在转移攻击设置中使用它们作为主要架构。

WaveAttack: Asymmetric Frequency Obfuscation-based Backdoor Attacks Against Deep Neural Networks

paper_url: http://arxiv.org/abs/2310.11595
repo_url: None
paper_authors: Jun Xia, Zhihao Yue, Yingbo Zhou, Zhiwei Ling, Xian Wei, Mingsong Chen
for: 针对深度神经网络预测中的背门攻击（backdoor attack），提出一种基于高频特征的新攻击方法——WaveAttack。
methods: 使用分割波峰变换（DWT）获得图像高频特征，并通过异步频率隐蔽方法在训练和推理阶段添加自适应副本，以提高背门诱导器的影响和攻击效iveness。
results: 对比比较常见的背门攻击方法，WaveAttack 可以 дости得高度隐蔽和效果，同时在图像质量上也可以达到28.27%的提升（PSNR）、1.61%的提升（SSIM）和70.59%的减少（IS）。

Abstract
Due to the popularity of Artificial Intelligence (AI) technology, numerous backdoor attacks are designed by adversaries to mislead deep neural network predictions by manipulating training samples and training processes. Although backdoor attacks are effective in various real scenarios, they still suffer from the problems of both low fidelity of poisoned samples and non-negligible transfer in latent space, which make them easily detectable by existing backdoor detection algorithms. To overcome the weakness, this paper proposes a novel frequency-based backdoor attack method named WaveAttack, which obtains image high-frequency features through Discrete Wavelet Transform (DWT) to generate backdoor triggers. Furthermore, we introduce an asymmetric frequency obfuscation method, which can add an adaptive residual in the training and inference stage to improve the impact of triggers and further enhance the effectiveness of WaveAttack. Comprehensive experimental results show that WaveAttack not only achieves higher stealthiness and effectiveness, but also outperforms state-of-the-art (SOTA) backdoor attack methods in the fidelity of images by up to 28.27\% improvement in PSNR, 1.61\% improvement in SSIM, and 70.59\% reduction in IS.

摘要
Translated into Simplified Chinese:由于人工智能技术的普及，许多敌对者设计了多种后门攻击，以诱导深度神经网络预测结果的扰乱。尽管后门攻击在实际场景中有效，但它们仍然受到低精度毒品样本和 latent space 中的非至遇抗减 Transfer 的问题，这使得它们可以轻松被现有的后门检测算法检测出来。为了解决这个弱点，本文提出了一种基于频率的后门攻击方法，名为 WaveAttack，它使用 Discrete Wavelet Transform (DWT) 获取图像高频特征来生成后门触发器。此外，我们还引入了不对称频率隐藏方法，可以在训练和推理阶段添加adaptive residual来提高触发器的影响，进一步提高 WaveAttack 的效果。经过全面的实验结果表明，WaveAttack 不仅实现了更高的隐身度和效果，还比 State-of-the-art (SOTA) 后门攻击方法在图像的精度上提高了28.27％的PSNR，1.61％的SSIM，和70.59％的IS。

Adversarial Robustness Unhardening via Backdoor Attacks in Federated Learning

paper_url: http://arxiv.org/abs/2310.11594
repo_url: None
paper_authors: Taejin Kim, Jiarui Li, Shubhranshu Singh, Nikhil Madaan, Carlee Joe-Wong
for: This paper focuses on addressing security challenges in federated learning, specifically the issues of poisoning and backdoor attacks, by exploring the intersection of adversarial training and backdoor attacks.
methods: The paper introduces a new attack called Adversarial Robustness Unhardening (ARU) and evaluates its impact on adversarial training and existing robust aggregation defenses against poisoning and backdoor attacks through extensive empirical experiments.
results: The paper finds that ARU can intentionally undermine model robustness during decentralized training, rendering models susceptible to a broader range of evasion attacks, and highlights the limitations of existing defenses against ARU. The findings offer insights into bolstering defenses against ARU and the need for further research in this area.Here is the information in Simplified Chinese text:
for: 这篇论文关注联邦学习中的安全挑战，具体来说是对毒素和后门攻击的问题，通过对抗训练和后门攻击之间的交叉研究。
methods: 论文引入了一种新的攻击方法called Adversarial Robustness Unhardening (ARU)，通过广泛的实验评估ARU对反对抗训练和现有的稳定聚合防御措施的影响。
results: 论文发现ARU可以在分布式训练中故意削弱模型的鲁棒性，使模型面临更广泛的欺骗攻击，并指出现有防御措施对ARU的限制。发现为防止ARU提供了新的策略和研究方向。

Abstract
In today's data-driven landscape, the delicate equilibrium between safeguarding user privacy and unleashing data potential stands as a paramount concern. Federated learning, which enables collaborative model training without necessitating data sharing, has emerged as a privacy-centric solution. This decentralized approach brings forth security challenges, notably poisoning and backdoor attacks where malicious entities inject corrupted data. Our research, initially spurred by test-time evasion attacks, investigates the intersection of adversarial training and backdoor attacks within federated learning, introducing Adversarial Robustness Unhardening (ARU). ARU is employed by a subset of adversaries to intentionally undermine model robustness during decentralized training, rendering models susceptible to a broader range of evasion attacks. We present extensive empirical experiments evaluating ARU's impact on adversarial training and existing robust aggregation defenses against poisoning and backdoor attacks. Our findings inform strategies for enhancing ARU to counter current defensive measures and highlight the limitations of existing defenses, offering insights into bolstering defenses against ARU.

摘要
今天的数据驱动时代，保护用户隐私和解释数据潜力之间的细腻平衡已成为首要问题。联邦学习，它允许无需数据共享进行模型训练的协同方法，已经出现为隐私中心的解决方案。这种分布式方法会产生安全挑战，包括腐化和后门攻击，即恶意实体插入假数据。我们的研究，起始于测试时间攻击，探讨在联邦学习中的对抗训练和后门攻击的交叉点，并提出了对抗训练不稳定性的技术——不稳定性脱困（ARU）。ARU可以在分布式训练中被一部分对手使用，故意下降模型的可靠性，使模型对更广泛的欺骗攻击变得感受。我们对ARU的影响进行了广泛的实验研究，包括对抗训练和现有的毒素攻击防御措施的评估。我们的发现可以帮助改进ARU，抗衡当前的防御措施，并 highlighted了现有防御措施的局限性，为加强防御提供了新的思路。

Automated Evaluation of Personalized Text Generation using Large Language Models

paper_url: http://arxiv.org/abs/2310.11593
repo_url: None
paper_authors: Yaqing Wang, Jiepu Jiang, Mingyang Zhang, Cheng Li, Yi Liang, Qiaozhu Mei, Michael Bendersky
for: 评估个性化文本生成器的质量
methods: 使用大型自然语言模型（LLMs）进行评估，并分析三个主要Semantic Aspects：个性化、质量和相关性
results: 比较LLMs和人工标注的评估结果，发现LLMs能够更准确地评估个性化文本生成器的质量，并且存在较好的一致性和效率。

Abstract
Personalized text generation presents a specialized mechanism for delivering content that is specific to a user's personal context. While the research progress in this area has been rapid, evaluation still presents a challenge. Traditional automated metrics such as BLEU and ROUGE primarily measure lexical similarity to human-written references, and are not able to distinguish personalization from other subtle semantic aspects, thus falling short of capturing the nuances of personalized generated content quality. On the other hand, human judgments are costly to obtain, especially in the realm of personalized evaluation. Inspired by these challenges, we explore the use of large language models (LLMs) for evaluating personalized text generation, and examine their ability to understand nuanced user context. We present AuPEL, a novel evaluation method that distills three major semantic aspects of the generated text: personalization, quality and relevance, and automatically measures these aspects. To validate the effectiveness of AuPEL, we design carefully controlled experiments and compare the accuracy of the evaluation judgments made by LLMs versus that of judgements made by human annotators, and conduct rigorous analyses of the consistency and sensitivity of the proposed metric. We find that, compared to existing evaluation metrics, AuPEL not only distinguishes and ranks models based on their personalization abilities more accurately, but also presents commendable consistency and efficiency for this task. Our work suggests that using LLMs as the evaluators of personalized text generation is superior to traditional text similarity metrics, even though interesting new challenges still remain.

摘要
个人化文本生成技术提供了特殊的内容交付机制，以便为用户的个人背景提供特定的内容。虽然这一领域的研究进步快速，但评估仍然存在挑战。传统的自动化指标如BLEU和ROUGE主要测量人工写出的参考文本和自动生成内容之间的语言相似性，而不能分辨个人化的特点，因此无法捕捉个人化生成内容质量的细微差别。人类评估更加costly，尤其是在个人化评估领域。我们 inspirited by these challenges，explore the use of large language models (LLMs) for evaluating personalized text generation, and examine their ability to understand nuanced user context.我们提出了一种新的评估方法，称为AuPEL，它可以自动测量生成文本中的三个主要semantic aspect：个人化、质量和相关性。为了验证AuPEL的有效性，我们设计了仔细控制的实验，并将LLMs的评估判断与人类标注者的评估判断进行比较，进行了严格的分析。我们发现，相比现有的评估指标，AuPEL不仅更准确地分辨和排序模型的个人化能力，而且具有了很好的一致性和效率。我们的工作表明，使用LLMs作为个人化文本生成评估的方法比传统的文本相似度指标更为有利，尽管还有一些新的挑战 waits ahead。

Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-task Speech Recognition

paper_url: http://arxiv.org/abs/2310.13015
repo_url: None
paper_authors: Hillary Ngai, Rohan Agrawal, Neeraj Gaur, Ronny Huang, Parisa Haghani, Pedro Moreno Mengibar
for: 这篇论文的目的是提出了一些可以将单任问题（single-task）adapter组合在多任问题（multi-task） automatic speech recognition（ASR）中，并且训练这些方法以提高语音识别精度。
methods: 这篇论文提出了三种新的任问题ID-free方法（task-ID-free methods），可以将单任问题adapter组合在多任问题ASR中，并且训练这些方法以提高语音识别精度。这三种方法分别是：1) 使用任问题ID-free映射函数（task-ID-free mapping function）来将单任问题adapter组合在多任问题ASR中；2) 使用任问题ID-free排序（task-ID-free sorting）来将单任问题adapter组合在多任问题ASR中；3) 使用任问题ID-free混合（task-ID-free mixing）来将单任问题adapter组合在多任问题ASR中。
results: 这篇论文的结果显示，使用这三种方法可以将单任问题adapter组合在多任问题ASR中，并且可以提高语音识别精度。 Specifically, the results show that the proposed methods can improve the speech recognition accuracy by 8% on average, compared to full fine-tuning. Additionally, the proposed methods are non-destructive and parameter-efficient, as they only update 17% of the model parameters.

Abstract
Adapters are an efficient, composable alternative to full fine-tuning of pre-trained models and help scale the deployment of large ASR models to many tasks. In practice, a task ID is commonly prepended to the input during inference to route to single-task adapters for the specified task. However, one major limitation of this approach is that the task ID may not be known during inference, rendering it unsuitable for most multi-task settings. To address this, we propose three novel task-ID-free methods to combine single-task adapters in multi-task ASR and investigate two learning algorithms for training. We evaluate our methods on 10 test sets from 4 diverse ASR tasks and show that our methods are non-destructive and parameter-efficient. While only updating 17% of the model parameters, our methods can achieve an 8% mean WER improvement relative to full fine-tuning and are on-par with task-ID adapter routing.

摘要
adapter 是一种高效、可 compose 的替代方案，帮助扩大预训练模型的部署范围。在实践中，通常会在推理时预先 prepend 任务 ID 到输入，以便将输入 routed 到适应的单任务 adapter。然而，这种方法有一个主要的限制，即在多任务 setting 中，任务 ID 可能不知道，这 Render 其不适用。为了解决这个问题，我们提出了三种新的任务 ID libre 方法，并 investigate 了两种学习算法 для训练。我们在 10 个测试集上从 4 种不同的 ASR 任务中选择了，并显示了我们的方法是非破坏性的和参数有效的。只需更新 17% 的模型参数，我们的方法可以实现一个 8% 的语音识别错误率下降，与完全 fine-tuning 相当，并且与任务 ID adapter 路由相比。

Eliciting Human Preferences with Language Models

paper_url: http://arxiv.org/abs/2310.11589
repo_url: https://github.com/alextamkin/generative-elicitation
paper_authors: Belinda Z. Li, Alex Tamkin, Noah Goodman, Jacob Andreas
for: 本研究用LM来指导任务规定过程。
methods: 本研究提出了Generative Active Task Elicitation（GATE）学习框架，用LM自己进行自由语言互动，以引出用户的意图和行为。
results: 在邮箱验证、内容推荐和道德决策等三个领域中，通过LM自己生成问题或总结特殊情况，可以更好地引出用户的想法和需求，并且用户报告需要更少的努力和更多的创新。

Abstract
Language models (LMs) can be directed to perform target tasks by using labeled examples or natural language prompts. But selecting examples or writing prompts for can be challenging--especially in tasks that involve unusual edge cases, demand precise articulation of nebulous preferences, or require an accurate mental model of LM behavior. We propose to use *LMs themselves* to guide the task specification process. In this paper, we introduce **Generative Active Task Elicitation (GATE)**: a learning framework in which models elicit and infer intended behavior through free-form, language-based interaction with users. We study GATE in three domains: email validation, content recommendation, and moral reasoning. In preregistered experiments, we show that LMs prompted to perform GATE (e.g., by generating open-ended questions or synthesizing informative edge cases) elicit responses that are often more informative than user-written prompts or labels. Users report that interactive task elicitation requires less effort than prompting or example labeling and surfaces novel considerations not initially anticipated by users. Our findings suggest that LM-driven elicitation can be a powerful tool for aligning models to complex human preferences and values.

摘要
语言模型（LM）可以通过使用标记的示例或自然语言提示来实现目标任务。但选择示例或编写提示可以是困难的——特别是在涉及到特殊的边界情况、需要精确表达抽象的偏好或需要语言模型行为的准确理解。我们提议使用LM自己来指导任务规定过程。在这篇论文中，我们介绍了**生成活动任务提取（GATE）**：一种学习框架，在用户与LM进行自由形式、语言基于的互动中，LM可以透过生成开放式问题或合成有用的边界情况来引导用户提供任务要求。我们在三个领域中进行了预先注册的实验，并证明了LM通过执行GATE（例如，生成开放式问题或合成有用的边界情况）可以从用户提供更多有用信息的回应。用户报告称，与LM进行交互式任务提取比起用户编写提示或标记，需要更少的努力，并且可以浮现用户没有首先预期的考虑。我们的发现表明，LM驱动的提取可以是一种强大的工具，用于将模型与人类偏好和价值相互Alignment。

When Rigidity Hurts: Soft Consistency Regularization for Probabilistic Hierarchical Time Series Forecasting

paper_url: http://arxiv.org/abs/2310.11569
repo_url: https://github.com/adityalab/profhit
paper_authors: Harshavardhan Kamarthi, Lingkai Kong, Alexander Rodríguez, Chao Zhang, B. Aditya Prakash
for: 这个研究旨在提出一个可靠且具有准确评估的时间序列预测方法，用于模型和预测多ivariate时间序列，并且考虑到这些时间序列之间的层次关系。
methods: 这个方法使用了一种可靠的数据科学方法，具体来说是一种可靠的 Bayesian 方法，并且引入了一个新的 Distributional Coherency 调整，以从层次关系中学习整个预测分布，这使得预测得到了更好的准确性和均匀性。
results: 在评估这个方法的实验中，发现这个方法可以在各种不同的数据集上提供41-88%的更好的性能，并且在数据集中缺失10%的时间序列Data时，预测的性能仍然保持在良好的水平，而其他方法在这种情况下的性能则会严重下降超过70%。

Abstract
Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have underlying hierarchical relations. Most methods focus on point predictions and do not provide well-calibrated probabilistic forecasts distributions. Recent state-of-art probabilistic forecasting methods also impose hierarchical relations on point predictions and samples of distribution which does not account for coherency of forecast distributions. Previous works also silently assume that datasets are always consistent with given hierarchical relations and do not adapt to real-world datasets that show deviation from this assumption. We close both these gap and propose PROFHiT, which is a fully probabilistic hierarchical forecasting model that jointly models forecast distribution of entire hierarchy. PROFHiT uses a flexible probabilistic Bayesian approach and introduces a novel Distributional Coherency regularization to learn from hierarchical relations for entire forecast distribution that enables robust and calibrated forecasts as well as adapt to datasets of varying hierarchical consistency. On evaluating PROFHiT over wide range of datasets, we observed 41-88% better performance in accuracy and significantly better calibration. Due to modeling the coherency over full distribution, we observed that PROFHiT can robustly provide reliable forecasts even if up to 10% of input time-series data is missing where other methods' performance severely degrade by over 70%.

摘要
“probabilistic hierarchical time-series forecasting是一种重要的时间序列预测变体，其目的是模型和预测多变量时间序列的层次关系。大多数方法都是专注于点预测，而不提供准确的probabilistic预测分布。当前的状态艺术probabilistic预测方法也强制实施层次关系于点预测和样本分布，而不考虑预测分布的协调性。过去的工作也假设了所有数据都一定程度上遵循给定的层次关系，而不会适应实际的数据，其中很多数据可能会出现层次不一致的情况。我们 closure这两个 gap，并提出PROFHiT，这是一种完全probabilistic层次预测模型，可以同时模型整个层次预测分布。PROFHiT使用 flexible probabilistic Bayesian方法，并引入一种新的分布协调regularization，以学习层次关系，并生成准确和协调的预测。对于各种数据集进行评估，我们发现PROFHiT在精度和准确性方面表现出41-88%的提升，同时也能够更好地适应数据集的层次不一致情况。由于模型整个分布的协调性，我们发现PROFHiT可以在数据损失情况下提供可靠的预测，甚至在数据损失10%以上时仍能保持良好的预测性能，而其他方法在这种情况下会导致预测性能下降超过70%。”

Integrating 3D City Data through Knowledge Graphs

paper_url: http://arxiv.org/abs/2310.11555
repo_url: None
paper_authors: Linfang Ding, Guohui Xiao, Albulen Pano, Mattia Fumagalli, Dongsheng Chen, Yu Feng, Diego Calvanese, Hongchao Fan, Liqiu Meng
for: 本研究旨在利用知识图（KG）技术，将城市地理信息模型（CityGML）数据模型化为适当的 ontology，以便对CityGML数据进行查询和推理。
methods: 本研究使用了 declarative 映射将CityGML数据与3DCityDB系统的关系表格相关联，以便通过标准 SQL 查询语言进行查询。此外，本研究还使用了 OpenStreetMap 数据作为示例数据，并与其他（地理）KG（如 Wikidata、DBPedia 和 GeoNames）进行集成。
results: 本研究实现了一个基于 CityGML KG 框架的方法，可以快速地将CityGML数据转换为 KG 格式，并且可以与其他数据源进行集成。这种方法可以帮助解决城市地理信息查询和推理的问题，并且可以提高城市规划和管理的效率。

Abstract
CityGML is a widely adopted standard by the Open Geospatial Consortium (OGC) for representing and exchanging 3D city models. The representation of semantic and topological properties in CityGML makes it possible to query such 3D city data to perform analysis in various applications, e.g., security management and emergency response, energy consumption and estimation, and occupancy measurement. However, the potential of querying CityGML data has not been fully exploited. The official GML/XML encoding of CityGML is only intended as an exchange format but is not suitable for query answering. The most common way of dealing with CityGML data is to store them in the 3DCityDB system as relational tables and then query them with the standard SQL query language. Nevertheless, for end users, it remains a challenging task to formulate queries over 3DCityDB directly for their ad-hoc analytical tasks, because there is a gap between the conceptual semantics of CityGML and the relational schema adopted in 3DCityDB. In fact, the semantics of CityGML itself can be modeled as a suitable ontology. The technology of Knowledge Graphs (KGs), where an ontology is at the core, is a good solution to bridge such a gap. Moreover, embracing KGs makes it easier to integrate with other spatial data sources, e.g., OpenStreetMap and existing (Geo)KGs (e.g., Wikidata, DBPedia, and GeoNames), and to perform queries combining information from multiple data sources. In this work, we describe a CityGML KG framework to populate the concepts in the CityGML ontology using declarative mappings to 3DCityDB, thus exposing the CityGML data therein as a KG. To demonstrate the feasibility of our approach, we use CityGML data from the city of Munich as test data and integrate OpenStreeMap data in the same area.

摘要
CityGML 是一种广泛采用的标准，由开放地理空间协会（OGC）用于表示和交换 3D 城市模型。CityGML 中的 semantics 和 topologic 属性的表示使得可以对这些 3D 城市数据进行查询，并且可以在不同应用中进行分析，如安全管理和紧急应急处理、能源消耗和估算、和占用量测量。然而，CityGML 数据的查询潜力尚未得到完全利用。官方的 GML/XML 编码仅供交换用途，不适用于查询 answered。通常情况下，CityGML 数据会被存储在 3DCityDB 系统中的关系表格中，然后使用标准 SQL 查询语言进行查询。然而，为终端用户来直接对 3DCityDB 进行查询是一项困难的任务，因为 CityGML 的概念 semantics 和 3DCityDB 的关系schema 之间存在一个差距。实际上，CityGML 的 semantics 本身可以被视为一个适当的 ontology。知识图（KGs）技术，其核心是 ontology，是一个好的解决方案，可以bridge 这个差距。此外，采用 KGs 可以轻松地与其他空间数据源集成，如 OpenStreetMap 和现有（Geo）KGs（如 Wikidata、DBPedia 和 GeoNames），并在多个数据源之间进行查询。在这项工作中，我们描述了一个 CityGML KG 框架，通过声明映射来将 CityGML 数据与 3DCityDB 关系表格相关联，从而将 CityGML 数据转换为 KG。为证明我们的方法的可行性，我们使用了 Munich 城市的 CityGML 数据作为测试数据，并将 OpenStreetMap 数据与其同一地区集成。

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

paper_url: http://arxiv.org/abs/2310.11550
repo_url: None
paper_authors: Haolin Liu, Chen-Yu Wei, Julian Zimmert
for: 本研究探讨在线强化学习中的线性Markov决策过程中的抗敌性损失和随机反馈，不需要对过程转移或模拟器的先前知识。
methods: 我们介绍了两种算法，它们在抗敌性损失和随机反馈下实现了改进的后悔性表现，并且比现有方法更为有效。第一种算法尽管计算效率低下，但可 garantuee $\widetilde{\mathcal{O}\left(\sqrt{K}\right)$的后悔性，这是在考虑的设定中的首次成果。第二种算算法基于策略优化框架，并且可 garantuee $\widetilde{\mathcal{O}\left(K^{\frac{3}{4} \right)$的后悔性，同时具有计算效率。
results: 我们的结果在比较之下显著超过现状态艺术：一个计算效率较低的算法由Kong et al. [2023]提出，其后悔性为 $\widetilde{\mathcal{O}\left(K^{\frac{4}{5}+poly\left(\frac{1}{\lambda_{\min}\right) \right)$，其中$\lambda_{\min}$是问题依赖的常数，可以是任意的小数。另一个计算效率较高的算法由Sherman et al. [2023b]提出，其后悔性为 $\widetilde{\mathcal{O}\left(K^{\frac{6}{7} \right)$。

Abstract
We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret performance compared to existing approaches. The first algorithm, although computationally inefficient, ensures a regret of $\widetilde{\mathcal{O}\left(\sqrt{K}\right)$, where $K$ is the number of episodes. This is the first result with the optimal $K$ dependence in the considered setting. The second algorithm, which is based on the policy optimization framework, guarantees a regret of $\widetilde{\mathcal{O}\left(K^{\frac{3}{4} \right)$ and is computationally efficient. Both our results significantly improve over the state-of-the-art: a computationally inefficient algorithm by Kong et al. [2023] with $\widetilde{\mathcal{O}\left(K^{\frac{4}{5}+poly\left(\frac{1}{\lambda_{\min}\right) \right)$ regret, for some problem-dependent constant $\lambda_{\min}$ that can be arbitrarily close to zero, and a computationally efficient algorithm by Sherman et al. [2023b] with $\widetilde{\mathcal{O}\left(K^{\frac{6}{7} \right)$ regret.

摘要
我们研究在线强化学习在线马尔可夫遇到冲突损失和随机反馈下，无需对过程或模拟器的知识。我们介绍了两种算法，它们在比现有方法更好的停损性表现。第一种算法，尽管计算效率低下，但能 garantie $\widetilde{\mathcal{O}\left(\sqrt{K}\right)$ 的停损性，其中 $K$ 是集的集数。这是在考虑的设定中的第一个 $\sqrt{K}$ 依赖性的结果。第二种算法，基于策略优化框架，保证 $\widetilde{\mathcal{O}\left(K^{\frac{3}{4} \right)$ 的停损性，并且计算效率高。我们的结果在现有的state-of-the-art之上进行了显著改进：一个计算不fficient的算法由孔等人（2023）提出的 $\widetilde{\mathcal{O}\left(K^{\frac{4}{5}+poly\left(\frac{1}{\lambda_{\min}\right) \right)$ 的停损性，其中 $\lambda_{\min}$ 是问题依赖的常数，可以是arbitrarily close to zero。以及一个计算效率高的算法由战等人（2023b）提出的 $\widetilde{\mathcal{O}\left(K^{\frac{6}{7} \right)$ 的停损性。

MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning

paper_url: http://arxiv.org/abs/2310.11541
repo_url: https://github.com/noetits/must_p-srl
paper_authors: Noé Tits
for: 这个论文旨在提出一种语言特征EXTRACTION方法，特别是自动将多语言词语 syllabified into phonetic transcriptions，并设计与蒙特利尔强制对齐器（MFA）兼容。
methods: 该方法在文本和音频频谱领域均注重提取phonetic transcriptions from text，包括强制对齐器（MFA）中的音频频谱。
results: 通过ablation study，我们证明了我们的方法在多种语言（英语、法语和西班牙语）中自动 syllabify 词语的 efficacy。此外，我们还应用了这种技术到CMU ARCTIC数据集的转录中，生成了在线可用的笔记 Footnote \url{https://github.com/noetits/MUST_P-SRL}，这些笔记非常有用于speech representation learning、speech unit discovery和speech factor的分离在多种speech-related领域。

Abstract
In this paper, we present a methodology for linguistic feature extraction, focusing particularly on automatically syllabifying words in multiple languages, with a design to be compatible with a forced-alignment tool, the Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification (in text and phonetic domains). The system was built with open-source components and resources. Through an ablation study, we demonstrate the efficacy of our approach in automatically syllabifying words from several languages (English, French and Spanish). Additionally, we apply the technique to the transcriptions of the CMU ARCTIC dataset, generating valuable annotations available online\footnote{\url{https://github.com/noetits/MUST_P-SRL} that are ideal for speech representation learning, speech unit discovery, and disentanglement of speech factors in several speech-related fields.

摘要
在这篇论文中，我们提出了一种语言特征提取方法，特别是自动将多语言单词分割成音节，并设计 compatibles 于蒙特利尔强制对齐器（MFA）。在文本和音乐频域中，我们的方法专注于从文本中提取音调标记和统一自动分割（在文本和音乐频域）。系统使用开源组件和资源构建。通过减少研究，我们证明了我们的方法在多种语言（英语、法语和西班牙语）中自动分割单词的效果。此外，我们将该技术应用到CMU ARCTIC数据集的转录中，生成了在线可用的注释\footnotemark[\url{https://github.com/noetits/MUST_P-SRL}]，这些注释非常有价值，适用于语音表示学习、语音单位发现和多种语音相关领域的分离 speech factor。

Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach

paper_url: http://arxiv.org/abs/2310.11531
repo_url: None
paper_authors: Dengwang Tang, Rahul Jain, Botao Hao, Zheng Wen
for: 这个论文研究了在无穷远 Setting中的高效在线学习问题，当有一个离线数据集可以作为开始。
methods: 论文使用了模仿专家的行为策略（通过一个能力参数来 parameterize），并使用了 Bayesian 在线学习算法。
results: 论文表明，如果学习代理人模仿专家的行为策略，就可以在最小化累累 regret 方面做出substantially 更好的表现，并且可以提供$\tilde{O}(\sqrt{T})$的上界 regret bound。

Abstract
In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (parameterized by a competence parameter) used by the expert, it can do substantially better in terms of minimizing cumulative regret, than if it doesn't do that. We establish an upper bound on regret of the exact informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a novel prior-dependent regret analysis of Bayesian online learning algorithms for the infinite horizon setting. We then propose an approximate Informed RLSVI algorithm that we can interpret as performing imitation learning with the offline dataset, and then performing online learning.

摘要
在这篇论文中，我们研究了在无穷远Setting中的高效在线强化学习问题，当有一个离线数据集可以作为开始。我们假设离线数据集由专家生成，但专家的水平不确定，即不是完美的和不一定使用优化策略。我们证明，如果学习代理模仿专家的行为策略（参数化为能力参数），它可以在误差总和方面做出较好的表现，比如果不做这样。我们提出了一个新的先验依赖的误差分析方法，并采用了一种新的准确度估计技术。我们then propose一个近似 Informed RLSVI算法，可以视为在线学习后，进行模仿学习。Here's a breakdown of the translation:* "infinite horizon setting" 是 translated as "无穷远Setting" (wú jì yáo jiāng)* "offline dataset" is translated as "离线数据集" (liú xiàn numérical dataset)* "expert" is translated as "专家" (zhuān shī)* "behavioral policy" is translated as "行为策略" (xíng wù zhèng yì)* "competence parameter" is translated as "能力参数" (néng lì jiāng xiàng)* "online learning" is translated as "在线学习" (zhèng xiàng xué xí)* "imitation learning" is translated as "模仿学习" (mó shì xué xí)Note that the translation is in Simplified Chinese, which is the most commonly used variety of Chinese in mainland China. If you need Traditional Chinese, please let me know.

Group Preference Optimization: Few-Shot Alignment of Large Language Models

paper_url: http://arxiv.org/abs/2310.11523
repo_url: None
paper_authors: Siyan Zhao, John Dang, Aditya Grover
for: 这篇论文的目的是如何将大型自然语言模型（LLM）调整为不同群体的偏好？
methods: 论文使用了一个名为Group Preference Optimization（GPO）的调整框架，该框架通过将独立的transformer模组与基本LLM模型结合，以便让LLM模型更好地适应不同群体的偏好。
results: 论文通过对不同群体的人类意见项目进行评估，证明了GPO的可行性和高效性，并且需要 fewer group-specific preferences 和 less training and inference computing resources，比较出Perform existing strategies such as in-context steering and fine-tuning methods.

Abstract
Many applications of large language models (LLMs), ranging from chatbots to creative writing, require nuanced subjective judgments that can differ significantly across different groups. Existing alignment algorithms can be expensive to align for each group, requiring prohibitive amounts of group-specific preference data and computation for real-world use cases. We introduce Group Preference Optimization (GPO), an alignment framework that steers language models to preferences of individual groups in a few-shot manner. In GPO, we augment the base LLM with an independent transformer module trained to predict the preferences of a group for the LLM generations. For few-shot learning, we parameterize this module as an in-context autoregressive transformer and train it via meta-learning on several groups. We empirically validate the efficacy of GPO through rigorous evaluations using LLMs with varied sizes on three human opinion adaptation tasks. These tasks involve adapting to the preferences of US demographic groups, global countries, and individual users. Our results demonstrate that GPO not only aligns models more accurately but also requires fewer group-specific preferences, and less training and inference computing resources, outperforming existing strategies such as in-context steering and fine-tuning methods.

摘要
许多大语言模型（LLM）的应用，从聊天机器人到创作写作，需要具有细化的主观评价，这些评价可能会因不同的群体而异常分布。现有的对Alignment算法可能需要较多的群体特定偏好数据和计算资源，这限制了实际应用场景中的使用。我们介绍了一种名为Group Preference Optimization（GPO）的对Alignment框架，可以在几个尝试中导引语言模型到各个群体的偏好。在GPO中，我们将基础的LLM加上一个独立的转换器模块，用于预测群体的偏好。为了进行几个尝试学习，我们将这个模块参数化为受 Context-Aware Transformer 的扩展，并通过元学习训练在多个群体上。我们通过对LLMs With varied sizes进行实质性验证，证明GPO不仅更准确地对齐模型，还需要 fewer group-specific preferences，并且需要更少的训练和推理计算资源，超越现有的方法，如在Context中引导和精度调整方法。

Guarantees for Self-Play in Multiplayer Games via Polymatrix Decomposability

paper_url: http://arxiv.org/abs/2310.11518
repo_url: https://github.com/revanmacqueen/self-play-polymatrix
paper_authors: Revan MacQueen, James R. Wright
for: 这个论文关注的问题是如何使用自我玩家来学习多智能体系统中的机器学习算法，并且如何 garantuee这些算法在post-training中的性能。
methods: 论文使用了自我玩家来生成大量数据，并且使用了no-external-regret算法来学习。
results: 论文显示了在多智能体系统中，通过使用自我玩家来学习，可以生成高性能的策略，并且这些策略具有 bounded vulnerability。此外，论文还提供了一种Structural property的定义，可以确保这些策略的性能。

Abstract
Self-play is a technique for machine learning in multi-agent systems where a learning algorithm learns by interacting with copies of itself. Self-play is useful for generating large quantities of data for learning, but has the drawback that the agents the learner will face post-training may have dramatically different behavior than the learner came to expect by interacting with itself. For the special case of two-player constant-sum games, self-play that reaches Nash equilibrium is guaranteed to produce strategies that perform well against any post-training opponent; however, no such guarantee exists for multiplayer games. We show that in games that approximately decompose into a set of two-player constant-sum games (called constant-sum polymatrix games) where global $\epsilon$-Nash equilibria are boundedly far from Nash equilibria in each subgame (called subgame stability), any no-external-regret algorithm that learns by self-play will produce a strategy with bounded vulnerability. For the first time, our results identify a structural property of multiplayer games that enable performance guarantees for the strategies produced by a broad class of self-play algorithms. We demonstrate our findings through experiments on Leduc poker.

摘要
自我玩家是一种机器学习技术，用于多代理系统中的学习。在这种情况下，学习算法通过与自己的 копиënteract来学习。自我玩家有利于生成大量数据，但是它们可能会导致学习后面对的代理行为与学习期间预期的行为有很大差异。在特殊的两player常数游戏情况下，如果自我玩家达到尼什平衡，那么生成的策略将在任何后期代理对抗中表现出色。然而，对多代理游戏来说，没有类似的保证。我们表明，在 Approximately decomposable 多代理游戏中，如果每个子游戏中的尼什平衡是 global $\epsilon$-尼什平衡的 boundedly far，那么任何没有外部回报折损的自我玩家学习算法将生成有 bounded vulnerability 的策略。这是我们的结果所示，我们的结论标志着多代理游戏中一种结构性质，允许提供性能保证的策略。我们通过启示 Leduc poker 的实验来证明我们的结论。

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

paper_url: http://arxiv.org/abs/2310.11511
repo_url: https://github.com/AkariAsai/self-rag
paper_authors: Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi
for: 提高语言模型的准确性和多样性
methods: 引入自适应式检索和反思 tokens，使语言模型在推理阶段可控
results: 比对状元模型和检索补充模型在多种任务上表现出色，特别是在开放领域问答、理解和事实核查任务中表现出较高的准确性和引用精度。

Abstract
Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.

摘要
尽管大型语言模型（LLM）具有remarkable的能力，但它们经常生成包含错误信息的响应，这是因为它们仅仅依靠自己学习的参数知识。 Retrieval-Augmented Generation（RAG）是一种补做approach，它在LLM中添加了Retrieval的功能，从而减少这些问题。然而，不经过考虑地检索并 incorporate一定数量的检索到的段落，可能会减少LLM的灵活性，或者导致生成异常的响应。我们提出了一种新的框架，叫做Self-Reflective Retrieval-Augmented Generation（Self-RAG），它可以提高LLM的质量和准确性。我们的框架通过在检索和生成过程中适时使用特殊的 tokens， called reflection tokens，来让LLM在推理阶段变得可控。通过生成reflection tokens，LLM可以根据不同的任务要求进行自适应的调整。我们的实验表明，Self-RAG（7B和13B参数）在多种任务上比state-of-the-art LLMs和检索增强模型表现出色，具体来说，Self-RAG在开放领域问答、理解和事实核实任务上表现出色，并且在长形生成中提高了事实准确性和引用率。

CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations

paper_url: http://arxiv.org/abs/2310.11501
repo_url: https://github.com/myracheng/lm_caricature
paper_authors: Myra Cheng, Tiziano Piccardi, Diyi Yang
for: 这个论文旨在捕捉人类行为的细腻之处，使用LLM模型模拟特定人群在社会科学实验和公众意见调查中的反应。
methods: 该论文使用了LLM模型来模拟人类行为，并提出了一个框架来评估这些模拟的质量。
results: 研究发现，在某些情况下（如政治和受难群体）和主题（通用不带争议）中，GPT-4模拟的人物有较高的轮廓化风险。

Abstract
Recent work has aimed to capture nuances of human behavior by using LLMs to simulate responses from particular demographics in settings like social science experiments and public opinion surveys. However, there are currently no established ways to discuss or evaluate the quality of such LLM simulations. Moreover, there is growing concern that these LLM simulations are flattened caricatures of the personas that they aim to simulate, failing to capture the multidimensionality of people and perpetuating stereotypes. To bridge these gaps, we present CoMPosT, a framework to characterize LLM simulations using four dimensions: Context, Model, Persona, and Topic. We use this framework to measure open-ended LLM simulations' susceptibility to caricature, defined via two criteria: individuation and exaggeration. We evaluate the level of caricature in scenarios from existing work on LLM simulations. We find that for GPT-4, simulations of certain demographics (political and marginalized groups) and topics (general, uncontroversial) are highly susceptible to caricature.

摘要
Translation notes:* "LLMs" is translated as "人工智能语言模型" (Rényou Jīngjì Yǔyán Módel), which is a more common term in Simplified Chinese.* "simulations" is translated as "模拟" (Móxīng), which is a more general term that can refer to any kind of imitation or representation.* "caricature" is translated as "轮廓" (Lúnkè), which is a more precise term that specifically refers to an exaggerated or distorted representation of someone or something.* "individuation" is translated as "个体化" (Gètǐ Huà), which is a term that specifically refers to the process of creating unique and distinct individuals.* "exaggeration" is translated as "夸大" (Kuòdà), which is a term that specifically refers to the act of amplifying or enlarging something beyond its normal size or proportion.

Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

paper_url: http://arxiv.org/abs/2310.11451
repo_url: https://github.com/maszhongming/ParaKnowTransfer
paper_authors: Ming Zhong, Chenxin An, Weizhu Chen, Jiawei Han, Pengcheng He
for: 本研究旨在empirically investigate知识传递FROM大型语言模型（LLMs）到小型语言模型（SLMs）的 Parametric perspective。
methods: 我们使用敏感性技术提取和对齐知识特定参数 между不同的LLMs，并使用LoRA模块作为中介机制将提取的知识注入到SLMs中。
results: 我们的evaluaions across four benchmarks validate the efficacy of our proposed method, highlighting the critical factors contributing to the process of parametric knowledge transfer and underscoring the transferability of model parameters across LLMs of different scales.

Abstract
Large Language Models (LLMs) inherently encode a wealth of knowledge within their parameters through pre-training on extensive corpora. While prior research has delved into operations on these parameters to manipulate the underlying implicit knowledge (encompassing detection, editing, and merging), there remains an ambiguous understanding regarding their transferability across models with varying scales. In this paper, we seek to empirically investigate knowledge transfer from larger to smaller models through a parametric perspective. To achieve this, we employ sensitivity-based techniques to extract and align knowledge-specific parameters between different LLMs. Moreover, the LoRA module is used as the intermediary mechanism for injecting the extracted knowledge into smaller models. Evaluations across four benchmarks validate the efficacy of our proposed method. Our findings highlight the critical factors contributing to the process of parametric knowledge transfer, underscoring the transferability of model parameters across LLMs of different scales. We release code and data at \url{https://github.com/maszhongming/ParaKnowTransfer}.

摘要
大型语言模型（LLM）内置了丰富的知识于其参数中，通过前期训练于广泛的文本Corpus。 although prior research has explored operations on these parameters to manipulate the underlying implicit knowledge (including detection, editing, and merging), there is still an ambiguous understanding of their transferability across models with varying scales. In this paper, we aim to empirically investigate knowledge transfer from larger to smaller models through a parametric perspective. To achieve this, we use sensitivity-based techniques to extract and align knowledge-specific parameters between different LLMs. Moreover, the LoRA module is used as the intermediary mechanism for injecting the extracted knowledge into smaller models. Evaluations across four benchmarks validate the efficacy of our proposed method. Our findings highlight the critical factors contributing to the process of parametric knowledge transfer, underscoring the transferability of model parameters across LLMs of different scales. We release code and data at \url{https://github.com/maszhongming/ParaKnowTransfer}.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Explaining Deep Neural Networks for Bearing Fault Detection with Vibration Concepts

paper_url: http://arxiv.org/abs/2310.11450
repo_url: None
paper_authors: Thomas Decker, Michael Lebacher, Volker Tresp
for: 本研究旨在应用概念基于的解释方法，以帮助解释复杂深度神经网络在振荡信号上进行磨损fault探测中的预测结果。
methods: 本研究使用了已有的概念基于的解释技术，包括概念活化矢量，来衡量输入数据中抽象或高级特征如何影响神经网络的预测结果。
results: 本研究的评估结果表明，通过使用概念基于的解释方法，可以提供人类可理解的和直观的探测结果，但是需要首先验证下面的假设。

Abstract
Concept-based explanation methods, such as Concept Activation Vectors, are potent means to quantify how abstract or high-level characteristics of input data influence the predictions of complex deep neural networks. However, applying them to industrial prediction problems is challenging as it is not immediately clear how to define and access appropriate concepts for individual use cases and specific data types. In this work, we investigate how to leverage established concept-based explanation techniques in the context of bearing fault detection with deep neural networks trained on vibration signals. Since bearings are prevalent in almost every rotating equipment, ensuring the reliability of intransparent fault detection models is crucial to prevent costly repairs and downtimes of industrial machinery. Our evaluations demonstrate that explaining opaque models in terms of vibration concepts enables human-comprehensible and intuitive insights about their inner workings, but the underlying assumptions need to be carefully validated first.

摘要
通用概念解释方法，如概念启动向量，是复杂深度神经网络预测结果的强大手段。然而，在实际应用中存在许多挑战，因为不清楚如何定义和访问特定用例和数据类型的适当概念。在这种情况下，我们研究如何在深度神经网络对振荡信号进行报废疾病检测时，使用已有的概念基本解释技术。由于机器设备中的承轮很普遍，因此确保报废模型的可靠性是关键，以避免高昂的维护和机器停机成本。我们的评估表明，通过将深度神经网络解释为振荡概念的方式，可以提供人类可理解和直观的内部启示，但是下面的假设需要首先进行验证。

Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

paper_url: http://arxiv.org/abs/2310.13014
repo_url: None
paper_authors: Philipp Schoenegger, Peter S. Park
for: 测试大语言模型GPT-4的预测能力
methods: 使用GPT-4参加三个月的预测赛，测试其对未来事件的预测能力
results: GPT-4的预测不如人群预测准确，并且没有显著的偏好 towards any particular answer.

Abstract
Accurately predicting the future would be an important milestone in the capabilities of artificial intelligence. However, research on the ability of large language models to provide probabilistic predictions about future events remains nascent. To empirically test this ability, we enrolled OpenAI's state-of-the-art large language model, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. The tournament, running from July to October 2023, attracted 843 participants and covered diverse topics including Big Tech, U.S. politics, viral outbreaks, and the Ukraine conflict. Focusing on binary forecasts, we show that GPT-4's probabilistic forecasts are significantly less accurate than the median human-crowd forecasts. We find that GPT-4's forecasts did not significantly differ from the no-information forecasting strategy of assigning a 50% probability to every question. We explore a potential explanation, that GPT-4 might be predisposed to predict probabilities close to the midpoint of the scale, but our data do not support this hypothesis. Overall, we find that GPT-4 significantly underperforms in real-world predictive tasks compared to median human-crowd forecasts. A potential explanation for this underperformance is that in real-world forecasting tournaments, the true answers are genuinely unknown at the time of prediction; unlike in other benchmark tasks like professional exams or time series forecasting, where strong performance may at least partly be due to the answers being memorized from the training data. This makes real-world forecasting tournaments an ideal environment for testing the generalized reasoning and prediction capabilities of artificial intelligence going forward.

摘要
如果可以准确预测未来，那将是人工智能的重要突破口。然而，关于大语言模型是否能够提供关于未来事件的 probabilistic 预测的研究仍然处于早期阶段。为了实证这一能力，我们在2023年7月至10月的三个月时间内参加了OpenAI的state-of-the-art大语言模型GPT-4在Metaculus平台上的三个月预测赛。这场赛事吸引了843名参与者，涵盖了多个话题，包括大科技、美国政治、艾滋疫情和乌克兰冲突。我们对binary forecast进行了分析，发现GPT-4的 probabilistic 预测与人群预测的 median 相比远不准确。我们发现GPT-4的预测与无信息预测策略（每个问题的概率为50%）没有显著差异。我们还探讨了一个可能的解释：GPT-4可能倾向于预测概率较 бли至中点的情况，但我们的数据不支持这一假设。总的来说，我们发现GPT-4在真实世界的预测任务中表现 significatively 差，与人群预测的 median 相比。一个可能的解释是，在真实世界预测赛事中，答案并不是在预测时已知的；与其他benchmark任务like专业考试或时间序列预测不同，在这些任务中，AI的出色表现可能归结于训练数据中Memorization。这些因素使得真实世界预测赛事成为测试人工智能总化逻辑和预测能力的理想环境。

Functional Invariants to Watermark Large Transformers

paper_url: http://arxiv.org/abs/2310.11446
repo_url: None
paper_authors: Fernandez Pierre, Couairon Guillaume, Furon Teddy, Douze Matthijs
for: 保护大型模型的完整性和所有权
methods: 利用模型的对称性进行功能等价的替换操作，不需要对 weights 进行优化，可以在非盲目白盒设置下实现 watermarking
results: 实验表明该方法可以保持模型的输出不变，同时具有耐变性和隐蔽性，可以实用地保护大型模型的完整性和所有权

Abstract
The rapid growth of transformer-based models increases the concerns about their integrity and ownership insurance. Watermarking addresses this issue by embedding a unique identifier into the model, while preserving its performance. However, most existing approaches require to optimize the weights to imprint the watermark signal, which is not suitable at scale due to the computational cost. This paper explores watermarks with virtually no computational cost, applicable to a non-blind white-box setting (assuming access to both the original and watermarked networks). They generate functionally equivalent copies by leveraging the models' invariance, via operations like dimension permutations or scaling/unscaling. This enables to watermark models without any change in their outputs and remains stealthy. Experiments demonstrate the effectiveness of the approach and its robustness against various model transformations (fine-tuning, quantization, pruning), making it a practical solution to protect the integrity of large models.

摘要
“transformer模型的快速增长带来了其完整性和所有权保险的问题。水印可以解决这个问题，通过在模型中嵌入唯一标识符，保持其性能。但现有的方法通常需要优化权重来印制水印信号，这会在大规模执行中带来计算成本问题。这篇论文探讨了免计算成本的水印方法，适用于非盲目白盒设定（即可以访问原始网络和水印网络）。它通过利用模型的变换不变性，通过维度重新排序或缩放/减速等操作，生成功能相同的模型复制。这些复制品可以隐蔽地携带水印信号，无需改变模型的输出。实验表明该方法的有效性和鲁棒性，可以实际地保护大型模型的完整性。”

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

paper_url: http://arxiv.org/abs/2310.11441
repo_url: None
paper_authors: Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
for: 提高大型多Modal模型（LMMs）的视觉固定能力
methods: 使用可购买的交互分割模型（如SAM）将图像分成不同粒度的区域，并在这些区域上显示marks（如字母、面罩、盒子）
results: GPT-4V与SoM相比，在RefCOCOg zero-shot Setting中表现出比州前进referring segmentation模型更高的性能

Abstract
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting.

摘要
我团队在这篇论文中提出了一种新的视觉提示方法，即Set-of-Mark（SoM），用于解 liberate大型多 modal模型（LMMs）的视觉固定能力。如图1（右）所示，我们使用商业化的交互式分割模型，如SAM，将图像分成不同粒度的区域，并将这些区域 overlay 以不同的标记，如字母、面具、盒子。使用这些标记的图像作为输入，GPT-4V可以回答需要视觉固定的问题。我们进行了广泛的实验研究，以验证SoM在多种细化视觉和多模态任务中的效果。例如，我们的实验表明，GPT-4V与SoM在RefCOCOg中的零基础情况下，与现有的全部finetune Referring Segmentation模型相比，表现出了更好的性能。

Understanding deep neural networks through the lens of their non-linearity

paper_url: http://arxiv.org/abs/2310.11439
repo_url: None
paper_authors: Quentin Bouniot, Ievgen Redko, Anton Mallasto, Charlotte Laclau, Karol Arndt, Oliver Struckmeier, Markus Heinonen, Ville Kyrki, Samuel Kaski
for: 本研究旨在提供一种 theoretically sound 的方法来跟踪深度神经网络中非线性的传播。
methods: 本研究使用的方法包括计算激活函数的非线性度，并通过提出一种 affinity score 来评估不同架构和学习策略中的非线性传播。
results: 实验结果表明，提出的 affinity score 能够帮助我们更好地理解深度神经网络中的非线性传播，并且具有广泛的实际应用前景。

Abstract
The remarkable success of deep neural networks (DNN) is often attributed to their high expressive power and their ability to approximate functions of arbitrary complexity. Indeed, DNNs are highly non-linear models, and activation functions introduced into them are largely responsible for this. While many works studied the expressive power of DNNs through the lens of their approximation capabilities, quantifying the non-linearity of DNNs or of individual activation functions remains an open problem. In this paper, we propose the first theoretically sound solution to track non-linearity propagation in deep neural networks with a specific focus on computer vision applications. Our proposed affinity score allows us to gain insights into the inner workings of a wide range of different architectures and learning paradigms. We provide extensive experimental results that highlight the practical utility of the proposed affinity score and its potential for long-reaching applications.

摘要
深度神经网络（DNN）的异常成功常被归结于它们的高表达力和能够近似任何复杂性函数的能力。实际上，DNN是非线性模型，活动函数引入到它们中央对此负有责任。虽然许多研究对DNN的表达力进行了研究，但量化DNN的非线性或各个活动函数的非线性仍然是一个开放的问题。在这篇论文中，我们提出了首个理论上有sound的解决方案，用于跟踪深度神经网络中非线性的传播。我们提出的相互关系分数允许我们深入了解各种不同架构和学习方法的内部工作机制。我们提供了广泛的实验结果， highlighting the practical utility of the proposed affinity score and its potential for long-reaching applications.

Evaluating LLMs for Privilege-Escalation Scenarios

paper_url: http://arxiv.org/abs/2310.11409
repo_url: https://github.com/ipa-lab/hackingBuddyGPT
paper_authors: Andreas Happe, Aaron Kaplan, Jürgen Cito
for: 这个论文旨在探讨语言模型（LLM）在静态分析中的应用，以及它们在特权提升方面的能力和挑战。
methods: 作者使用了自动化的Linux特权提升测试 benchmark，并开发了一种基于 LLM 的特权提升工具，用于评估不同的 LLM 和提示策略。
results: 研究发现，LLM 可以在特权提升方面具有很高的能力，但也存在一些挑战，如维持测试中的集中性、处理错误等。

Abstract
Penetration testing, an essential component of cybersecurity, allows organizations to proactively identify and remediate vulnerabilities in their systems, thus bolstering their defense mechanisms against potential cyberattacks. One recent advancement in the realm of penetration testing is the utilization of Language Models (LLMs). We explore the intersection of LLMs and penetration testing to gain insight into their capabilities and challenges in the context of privilige escalation. We create an automated Linux privilege-escalation benchmark utilizing local virtual machines. We introduce an LLM-guided privilege-escalation tool designed for evaluating different LLMs and prompt strategies against our benchmark. We analyze the impact of different prompt designs, the benefits of in-context learning, and the advantages of offering high-level guidance to LLMs. We discuss challenging areas for LLMs, including maintaining focus during testing, coping with errors, and finally comparing them with both stochastic parrots as well as with human hackers.

摘要
To evaluate the performance of LLMs in privilege escalation, we created an automated Linux privilege-escalation benchmark using local virtual machines. We also developed an LLM-guided privilege-escalation tool that assesses different LLMs and prompt strategies against our benchmark. Our analysis reveals the impact of various prompt designs, the benefits of in-context learning, and the advantages of providing high-level guidance to LLMs.However, LLMs also face challenges, such as maintaining focus during testing, coping with errors, and comparing with both stochastic parrots and human hackers. By understanding these challenges and the capabilities of LLMs, we can better utilize them in penetration testing to enhance the security of our systems.

Neural Attention: Enhancing QKV Calculation in Self-Attention Mechanism with Neural Networks

paper_url: http://arxiv.org/abs/2310.11398
repo_url: https://github.com/ocislyjrti/neuralattention
paper_authors: Muhan Zhang
for: 这篇论文是用于探讨一种新的自注意力机制，即通过特殊设计的神经网络结构来计算查询、键和值（QKV），以提高自注意力机制的表现。
methods: 这篇论文使用了一个 modificated Marian model，通过实验证明了其对IWSLT 2017德国英语翻译任务数据集的表现有所提高。此外，这篇论文还证明了它的方法在训练Roberta模型时，对Wikitext-103数据集的表现也有所改善。
results: 实验结果显示，这篇论文的方法可以提高BLEU scores，并且在训练Roberta模型时，对模型的误差率有所降低。这些实验结果不仅证明了这篇论文的方法的有效性，而且还显示了对自注意力机制的优化可以通过神经网络基础的QKV计算，对未来的研究和实际应用具有广泛的应用前景。

Abstract
In the realm of deep learning, the self-attention mechanism has substantiated its pivotal role across a myriad of tasks, encompassing natural language processing and computer vision. Despite achieving success across diverse applications, the traditional self-attention mechanism primarily leverages linear transformations for the computation of query, key, and value (QKV), which may not invariably be the optimal choice under specific circumstances. This paper probes into a novel methodology for QKV computation-implementing a specially-designed neural network structure for the calculation. Utilizing a modified Marian model, we conducted experiments on the IWSLT 2017 German-English translation task dataset and juxtaposed our method with the conventional approach. The experimental results unveil a significant enhancement in BLEU scores with our method. Furthermore, our approach also manifested superiority when training the Roberta model with the Wikitext-103 dataset, reflecting a notable reduction in model perplexity compared to its original counterpart. These experimental outcomes not only validate the efficacy of our method but also reveal the immense potential in optimizing the self-attention mechanism through neural network-based QKV computation, paving the way for future research and practical applications. The source code and implementation details for our proposed method can be accessed at https://github.com/ocislyjrti/NeuralAttention.

摘要
在深度学习领域，自注意机制在多种任务中发挥了重要作用，包括自然语言处理和计算机视觉。尽管在多种应用中取得成功，传统的自注意机制通常使用线性变换来计算查询、关键和值（QKV），这可能不一定是最佳选择在特定情况下。这篇论文探讨了一种新的QKV计算方法-通过特制的神经网络结构来实现。我们使用修改后的马里安模型进行实验，并在IWSLT 2017德语英语翻译任务数据集上对我们的方法与传统方法进行了比较。实验结果显示我们的方法可以显著提高BLEU分数，而且我们的方法也在使用Wikitext-103数据集训练Roberta模型时表现出了明显的降低模型困惑度，相比其原始对手。这些实验结果不仅证明了我们的方法的有效性，还暴露了优化自注意机制的可能性，铺开了未来研究和实践应用的道路。我们的代码和实现细节可以通过https://github.com/ocislyjrti/NeuralAttention访问。

Towards Automatic Satellite Images Captions Generation Using Large Language Models

paper_url: http://arxiv.org/abs/2310.11392
repo_url: None
paper_authors: Yingxu He, Qiqi Sun
for: automatic remote sensing image captioning
methods: using large language models (LLMs) to guide the description of object annotations
results: effective collection of captions for remote sensing images

Abstract
Automatic image captioning is a promising technique for conveying visual information using natural language. It can benefit various tasks in satellite remote sensing, such as environmental monitoring, resource management, disaster management, etc. However, one of the main challenges in this domain is the lack of large-scale image-caption datasets, as they require a lot of human expertise and effort to create. Recent research on large language models (LLMs) has demonstrated their impressive performance in natural language understanding and generation tasks. Nonetheless, most of them cannot handle images (GPT-3.5, Falcon, Claude, etc.), while conventional captioning models pre-trained on general ground-view images often fail to produce detailed and accurate captions for aerial images (BLIP, GIT, CM3, CM3Leon, etc.). To address this problem, we propose a novel approach: Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images by guiding LLMs to describe their object annotations. We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images. Our evaluation demonstrates the effectiveness of our approach for collecting captions for remote sensing images.

摘要
自动图像描述技术是落实视觉信息使用自然语言的有前途技术。它可以帮助卫星遥感任务中的各种任务，如环境监测、资源管理、灾害管理等。然而，这个领域的主要挑战之一是缺乏大规模的图像描述数据集，因为需要大量的人工专业和努力来创建。latest research on large language models (LLMs) has shown their impressive performance in natural language understanding and generation tasks. However, most of them cannot handle images (GPT-3.5, Falcon, Claude, etc.), while conventional captioning models pre-trained on general ground-view images often fail to produce detailed and accurate captions for aerial images (BLIP, GIT, CM3, CM3Leon, etc.). To address this problem, we propose a novel approach: Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images by guiding LLMs to describe their object annotations. We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images. Our evaluation demonstrates the effectiveness of our approach for collecting captions for remote sensing images.

End-to-End real time tracking of children’s reading with pointer network

paper_url: http://arxiv.org/abs/2310.11486
repo_url: None
paper_authors: Vishal Sunder, Beulah Karrolla, Eric Fosler-Lussier
for: 这个论文的目的是如何fficiently构建儿童语音实时读物追踪器。
methods: 这个论文使用了一种全端到端模型，而不是以前提出的ASR-based缓存方法。它使用了一个指针网络，直接学习预测文本中的位置，并通过强制对齐来训练指针网络。
results: 这个论文的结果表明，使用一种基于强制对齐的神经网络模型可以达到至少与Montreal强制对齐器相同的对齐精度，并且 surprisingly 是一个更好的训练信号 для指针网络。结果表明，在一个成人语音数据集（TIMIT）和两个儿童语音数据集（CMU Kids和Reading Races）上，这个最佳模型可以高精度地跟踪成人语音（87.8%）和儿童语音（77.1%）。

Abstract
In this work, we explore how a real time reading tracker can be built efficiently for children's voices. While previously proposed reading trackers focused on ASR-based cascaded approaches, we propose a fully end-to-end model making it less prone to lags in voice tracking. We employ a pointer network that directly learns to predict positions in the ground truth text conditioned on the streaming speech. To train this pointer network, we generate ground truth training signals by using forced alignment between the read speech and the text being read on the training set. Exploring different forced alignment models, we find a neural attention based model is at least as close in alignment accuracy to the Montreal Forced Aligner, but surprisingly is a better training signal for the pointer network. Our results are reported on one adult speech data (TIMIT) and two children's speech datasets (CMU Kids and Reading Races). Our best model can accurately track adult speech with 87.8% accuracy and the much harder and disfluent children's speech with 77.1% accuracy on CMU Kids data and a 65.3% accuracy on the Reading Races dataset.

摘要
在这项工作中，我们探讨了如何fficiently构建儿童语音实时读写追踪器。之前的提议的读写追踪器都是基于ASR顺序的搅拌方法，而我们提议一个完全端到端模型，使其更少受到语音追踪延迟。我们使用一个指针网络，直接学习 predict文本中的位置，条件于流动的Speech。为了训练这个指针网络，我们生成了ground truth训练信号，通过强制对READING speech和被读文本之间进行对齐。我们explore了不同的强制对齐模型，发现一个神经网络注意力基于模型，与蒙特利尔强制对齐器准确性相当，但是 surprisingly是一个更好的训练信号 для指针网络。我们的结果在一个成人语音数据（TIMIT）和两个儿童语音数据集（CMU Kids和Reading Races）上被报告。我们的最佳模型可以准确地跟踪成人语音87.8%的准确率，以及更难和不稳定的儿童语音77.1%的准确率在CMU Kids数据集上，以及65.3%的准确率在Reading Races数据集上。

The effect of stemming and lemmatization on Portuguese fake news text classification

paper_url: http://arxiv.org/abs/2310.11344
repo_url: None
paper_authors: Lucca de Freitas Santos, Murilo Varges da Silva
for: 本研究旨在探讨自动伪新闻检测的问题，尤其是在语言学方面，以提高伪新闻检测的精度。
methods: 本研究使用了lemmatization和stemming等预处理技术，并设计了一些类ifier模型，以测试预处理技术对伪新闻分类的影响。
results: 结果显示，预处理步骤对伪新闻分类有着重要的影响，lemmatization和stemming等技术可以帮助提高伪新闻检测的精度。

Abstract
With the popularization of the internet, smartphones and social media, information is being spread quickly and easily way, which implies bigger traffic of information in the world, but there is a problem that is harming society with the dissemination of fake news. With a bigger flow of information, some people are trying to disseminate deceptive information and fake news. The automatic detection of fake news is a challenging task because to obtain a good result is necessary to deal with linguistics problems, especially when we are dealing with languages that not have been comprehensively studied yet, besides that, some techniques can help to reach a good result when we are dealing with text data, although, the motivation of detecting this deceptive information it is in the fact that the people need to know which information is true and trustful and which one is not. In this work, we present the effect the pre-processing methods such as lemmatization and stemming have on fake news classification, for that we designed some classifier models applying different pre-processing techniques. The results show that the pre-processing step is important to obtain betters results, the stemming and lemmatization techniques are interesting methods and need to be more studied to develop techniques focused on the Portuguese language so we can reach better results.

摘要
In this work, we investigate the effect of pre-processing methods, such as lemmatization and stemming, on fake news classification. We designed several classifier models using different pre-processing techniques and analyzed the results. Our findings show that the pre-processing step is crucial for achieving better results. The stemming and lemmatization techniques are promising methods that deserve further study, particularly for the Portuguese language, to improve the accuracy of fake news detection.

Dual Cognitive Architecture: Incorporating Biases and Multi-Memory Systems for Lifelong Learning

paper_url: http://arxiv.org/abs/2310.11341
repo_url: https://github.com/neurai-lab/dn4il-dataset
paper_authors: Shruthi Gowda, Bahram Zonooz, Elahe Arani
for: 这篇论文的目的是探讨人工神经网络（ANNs）在静止独立数据上的局限性，并提出一种基于人类认知结构和多记忆系统的新框架，以实现人工智能的持久学习能力。methods: 该框架基于多种人类认知结构和多记忆系统，包括多个子系统、隐式和显式知识表示分离、偏见适应和多记忆系统。它还包括一个概率适应学习器，用于编码形态信息，从而避免ANNs学习本地文本的偏好。results: 在不同的设定和数据集上，DUCA显示出了改进的表现，并且不需要额外信息。此外，DUCA还在一个复杂的域逐渐学习数据集DN4IL上表现出优异，证明了其在面临分布转换时的多样化持久学习能力。

Abstract
Artificial neural networks (ANNs) exhibit a narrow scope of expertise on stationary independent data. However, the data in the real world is continuous and dynamic, and ANNs must adapt to novel scenarios while also retaining the learned knowledge to become lifelong learners. The ability of humans to excel at these tasks can be attributed to multiple factors ranging from cognitive computational structures, cognitive biases, and the multi-memory systems in the brain. We incorporate key concepts from each of these to design a novel framework, Dual Cognitive Architecture (DUCA), which includes multiple sub-systems, implicit and explicit knowledge representation dichotomy, inductive bias, and a multi-memory system. The inductive bias learner within DUCA is instrumental in encoding shape information, effectively countering the tendency of ANNs to learn local textures. Simultaneously, the inclusion of a semantic memory submodule facilitates the gradual consolidation of knowledge, replicating the dynamics observed in fast and slow learning systems, reminiscent of the principles underpinning the complementary learning system in human cognition. DUCA shows improvement across different settings and datasets, and it also exhibits reduced task recency bias, without the need for extra information. To further test the versatility of lifelong learning methods on a challenging distribution shift, we introduce a novel domain-incremental dataset DN4IL. In addition to improving performance on existing benchmarks, DUCA also demonstrates superior performance on this complex dataset.

摘要

Agent-Specific Effects

paper_url: http://arxiv.org/abs/2310.11334
repo_url: https://github.com/stelios30/agent-specific-effects
paper_authors: Stelios Triantafyllou, Aleksa Sukovic, Debmalya Mandal, Goran Radanovic
for: 本研究的目标是提供一种系统性的方法，以评估多个代理的决策对结果的影响。
methods: 本研究使用多个代理Markov决策过程，引入代理特有的影响（Agent-Specific Effects，ASE），用于评估代理的决策对结果的影响。
results: 研究发现，使用cf-ASE可以准确地评估多个代理的决策对结果的影响，并且可以通过实验 validate 这种方法。

Abstract
Establishing causal relationships between actions and outcomes is fundamental for accountable multi-agent decision-making. However, interpreting and quantifying agents' contributions to such relationships pose significant challenges. These challenges are particularly prominent in the context of multi-agent sequential decision-making, where the causal effect of an agent's action on the outcome depends on how the other agents respond to that action. In this paper, our objective is to present a systematic approach for attributing the causal effects of agents' actions to the influence they exert on other agents. Focusing on multi-agent Markov decision processes, we introduce agent-specific effects (ASE), a novel causal quantity that measures the effect of an agent's action on the outcome that propagates through other agents. We then turn to the counterfactual counterpart of ASE (cf-ASE), provide a sufficient set of conditions for identifying cf-ASE, and propose a practical sampling-based algorithm for estimating it. Finally, we experimentally evaluate the utility of cf-ASE through a simulation-based testbed, which includes a sepsis management environment.

摘要
We focus on multi-agent Markov decision processes and introduce agent-specific effects (ASE), a novel causal quantity that measures the effect of an agent's action on the outcome that propagates through other agents. We then introduce the counterfactual counterpart of ASE (cf-ASE) and provide a set of conditions for identifying it. We propose a practical sampling-based algorithm for estimating cf-ASE.We experimentally evaluate the utility of cf-ASE through a simulation-based testbed, including a sepsis management environment. By attributing the causal effects of agents' actions to their influence on other agents, our approach enables accountable decision-making in multi-agent systems.

Key Point-based Orientation Estimation of Strawberries for Robotic Fruit Picking

paper_url: http://arxiv.org/abs/2310.11333
repo_url: None
paper_authors: Justin Le Louëdec, Grzegorz Cielniak
For: This paper aims to address the issue of labor shortages in modern agriculture by developing a robotic harvesting system that can accurately and efficiently pick fruit.* Methods: The proposed method uses key-point-based fruit orientation estimation, which allows for the prediction of 3D orientation from 2D images directly. The method does not require full 3D orientation annotations and can exploit such information for improved accuracy.* Results: The proposed method achieves state-of-the-art performance with an average error of $8^\circ$, improving predictions by $\sim30%$ compared to previous work. The method also has fast inference times of $\sim30$ms, making it suitable for real-time robotic applications.Here is the text in Simplified Chinese:* For: 这篇论文目的是解决现代农业中的劳动力短缺问题，通过开发一种可以准确地和高效地采集水果的 роботизирован采集系统。* Methods: 该方法使用关键点基于的水果方向估计方法，可以直接从2D图像中预测3D方向。该方法不需要全部3D方向注释，可以利用这些信息来提高准确性。* Results: 该方法在两个独立的苺果图像集上取得了状态机器人应用中的最佳性能，错误率为8度，提高了前一个作者在~\cite{wagner2021efficient}中提出的前一个作者的30%。此外，该方法的推理时间为30ms，适用于实时机器人应用。

Abstract
Selective robotic harvesting is a promising technological solution to address labour shortages which are affecting modern agriculture in many parts of the world. For an accurate and efficient picking process, a robotic harvester requires the precise location and orientation of the fruit to effectively plan the trajectory of the end effector. The current methods for estimating fruit orientation employ either complete 3D information which typically requires registration from multiple views or rely on fully-supervised learning techniques, which require difficult-to-obtain manual annotation of the reference orientation. In this paper, we introduce a novel key-point-based fruit orientation estimation method allowing for the prediction of 3D orientation from 2D images directly. The proposed technique can work without full 3D orientation annotations but can also exploit such information for improved accuracy. We evaluate our work on two separate datasets of strawberry images obtained from real-world data collection scenarios. Our proposed method achieves state-of-the-art performance with an average error as low as $8^{\circ}$, improving predictions by $\sim30\%$ compared to previous work presented in~\cite{wagner2021efficient}. Furthermore, our method is suited for real-time robotic applications with fast inference times of $\sim30$ms.

摘要
选择性机器人收割是一种有前途的技术解决方案，用于解决现代农业中的劳动力短缺问题。为了实现准确和高效的摘取过程，机器人收割器需要准确地知道果实的位置和方向。现有的果实方向估计方法可以通过多视图注册或完全超级vised学习技术来实现，但这些方法具有困难获得的手动参考方向的缺点。在本文中，我们介绍了一种新的关键点基于果实方向估计方法，可以直接从2D图像中预测3D方向。我们的提议方法不需要全部3D方向注释，但也可以利用这些信息以提高准确性。我们在两个独立的苹果图像集上进行了评估，并达到了现有最佳性能，具体错误为$8^{\circ}$，提高了前一个工作（refer to ~\cite{wagner2021efficient}）的预测值 by $\sim30\%$。此外，我们的方法适用于实时机器人应用程序，推理时间为$\sim30$ms。

Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

paper_url: http://arxiv.org/abs/2310.11324
repo_url: None
paper_authors: Melanie Sclar, Yejin Choi, Yulia Tsvetkov, Alane Suhr
for: 本研究旨在准确评估现代预训练语言模型（LLM）的性能，并确定prompt设计的重要性。
methods: 本研究使用了多种现有的开源LLM，并研究了它们对提示格式的敏感性。我们提出了FormatSpread算法，可以快速评估任务中的多个可能的提示格式，并对性能进行interval的预测。
results: 我们发现，使用不同的提示格式可以导致LLM的性能差异非常大，最大差异可达76个准确率点。此外，我们还发现，不同的模型之间的格式敏感性强相关，这提出了评估LLM性能的方法ологи问题。

Abstract
As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.

摘要
大型语言模型（LLM）在语言技术中扮演重要角色，因此精确测量其性能是非常重要的。因为选择提示的设计可以强烈影响模型的行为，因此这个设计过程是使用任何现代预训练的生成语言模型时非常重要。在这个工作中，我们专注于 LLM 对提示格式的敏感性。我们发现了许多常用的开源 LLM 在少量示例设定下表现出敏感性，对 LLLaMA-2-13B 的表现范围为 Up to 76 个精确度点。这些敏感性在提高模型大小、几何示例数量或进行指令调整时仍然存在。我们的分析表明，使用提示方式进行评估 LLM 的方法不应该只报告单一的表现方式，而是应该报告一个范围的表现。此外，我们还证明了不同模型之间的格式表现相互弱相联，这让我们对比模型使用随机选择的固定提示方式的方法存在问题。为了促进系统性的分析，我们提出 FormatSpread，一个算法可以快速评估任务中的可能提示格式，并报告预期的表现范围，不需要访问模型的材料。此外，我们还进行了一系列的分析，以探讨这种敏感性的性质，包括探讨具体的原子变化和内部表现的特点。

Utilising a Large Language Model to Annotate Subject Metadata: A Case Study in an Australian National Research Data Catalogue

paper_url: http://arxiv.org/abs/2310.11318
repo_url: None
paper_authors: Shiwei Zhang, Mingfang Wu, Xiuzhen Zhang
for: 本研究提出了一种利用大语言模型（LLM）进行 kost-effective 的主题元数据注释的方法，以便提高数据挖掘和复用性。
methods: 本方法使用 GPT-3.5，并通过自动生成的提示来注释主题元数据。然而，基于Context learning的模型无法学习专业领域规则，导致一些类别的表现较差。
results: 本研究表明，使用 GPT-3.5 进行主题元数据注释可以达到可观的效果，但是基于Context learning的模型无法学习专业领域规则，导致一些类别的表现较差。

Abstract
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research. As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them. Yet, it is a common issue that datasets often lack quality metadata due to limited resources for data curation. Meanwhile, technologies such as artificial intelligence and large language models (LLMs) are progressing rapidly. Recently, systems based on these technologies, such as ChatGPT, have demonstrated promising capabilities for certain data curation tasks. This paper proposes to leverage LLMs for cost-effective annotation of subject metadata through the LLM-based in-context learning. Our method employs GPT-3.5 with prompts designed for annotating subject metadata, demonstrating promising performance in automatic metadata annotation. However, models based on in-context learning cannot acquire discipline-specific rules, resulting in lower performance in several categories. This limitation arises from the limited contextual information available for subject inference. To the best of our knowledge, we are introducing, for the first time, an in-context learning method that harnesses large language models for automated subject metadata annotation.

摘要
为支持开放和可重复的研究，现有的数据集数量在不断增加。随着数据集的可用性增加，高质量的元数据成为发现和重用数据的关键。然而，由于数据整理有限的资源，数据集经常缺乏高质量的元数据。在这种情况下，人工智能和大语言模型（LLM）的技术在不断进步。最近，基于这些技术的系统，如ChatGPT，在某些数据整理任务中表现出了承诺的能力。本文提议利用LLM进行cost-effective的主题元数据注释。我们的方法使用GPT-3.5，并通过特制的提示来注释主题元数据，表现出了可观的自动元数据注释性能。然而，基于内容学习的模型无法学习专业领域的规则，导致在某些类别上表现较差。这种局限性来自于内容学习模型所lack的专业领域知识。根据我们所知，我们是第一次在内容学习中使用大语言模型进行自动主题元数据注释。

Generative error correction for code-switching speech recognition using large language models

paper_url: http://arxiv.org/abs/2310.13013
repo_url: None
paper_authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, Eng Siong Chng
for: 提高Code-switching自动语音识别（CS-ASR）精度
methods: 利用大型语言模型（LLM）和自动语音识别（ASR）生成的N最佳 гипотезы，并通过学习 гипотезы至转写（H2T）映射来直接预测准确转写。
results: 实验证明，使用这种方法可以显著提高CS-ASR的精度，降低混合错误率（MER）。同时，LLM在H2T学习中表现出了remarkable的数据效率，可能解决CS-ASR在低资源语言中的数据稀缺问题。

Abstract
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem. Specifically, we first employ multiple well-trained ASR models for N-best hypotheses generation, with the aim of increasing the diverse and informative elements in the set of hypotheses. Next, we utilize the LLMs to learn the hypotheses-to-transcription (H2T) mapping by adding a trainable low-rank adapter. Such a generative error correction (GER) method directly predicts the accurate transcription according to its expert linguistic knowledge and N-best hypotheses, resulting in a paradigm shift from the traditional language model rescoring or error correction techniques. Experimental evidence demonstrates that GER significantly enhances CS-ASR accuracy, in terms of reduced mixed error rate (MER). Furthermore, LLMs show remarkable data efficiency for H2T learning, providing a potential solution to the data scarcity problem of CS-ASR in low-resource languages.

摘要
��enson switching (CS) ��ô�� speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task due to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem. Specifically, we first employ multiple well-trained ASR models for N-best hypotheses generation, with the aim of increasing the diverse and informative elements in the set of hypotheses. Next, we utilize the LLMs to learn the hypotheses-to-transcription (H2T) mapping by adding a trainable low-rank adapter. Such a generative error correction (GER) method directly predicts the accurate transcription according to its expert linguistic knowledge and N-best hypotheses, resulting in a paradigm shift from the traditional language model rescoring or error correction techniques. Experimental evidence demonstrates that GER significantly enhances CS-ASR accuracy, in terms of reduced mixed error rate (MER). Furthermore, LLMs show remarkable data efficiency for H2T learning, providing a potential solution to the data scarcity problem of CS-ASR in low-resource languages.

MonoSKD: General Distillation Framework for Monocular 3D Object Detection via Spearman Correlation Coefficient

paper_url: http://arxiv.org/abs/2310.11316
repo_url: https://github.com/senwang98/monoskd
paper_authors: Sen Wang, Jin Zheng
for: 本文提出了一种基于Spearman相关系数的知识传授框架，用于解决单视图3D物体检测中的困难问题。
methods: 该方法使用Spearman相关系数来学习交叉模式特征之间的相对相关性，并通过选择合适的传授位置和移除 redundancy 来降低 GPU 资源消耗和训练时间。
results: 在KITTI 3D物体检测 benchmark 上进行了广泛的实验，并证明了我们的方法可以在挑战性的情况下达到最佳性能，而且无需额外的计算成本。

Abstract
Monocular 3D object detection is an inherently ill-posed problem, as it is challenging to predict accurate 3D localization from a single image. Existing monocular 3D detection knowledge distillation methods usually project the LiDAR onto the image plane and train the teacher network accordingly. Transferring LiDAR-based model knowledge to RGB-based models is more complex, so a general distillation strategy is needed. To alleviate cross-modal prob-lem, we propose MonoSKD, a novel Knowledge Distillation framework for Monocular 3D detection based on Spearman correlation coefficient, to learn the relative correlation between cross-modal features. Considering the large gap between these features, strict alignment of features may mislead the training, so we propose a looser Spearman loss. Furthermore, by selecting appropriate distillation locations and removing redundant modules, our scheme saves more GPU resources and trains faster than existing methods. Extensive experiments are performed to verify the effectiveness of our framework on the challenging KITTI 3D object detection benchmark. Our method achieves state-of-the-art performance until submission with no additional inference computational cost. Our codes are available at https://github.com/Senwang98/MonoSKD

摘要
单目3D物体检测是一个自然的问题，因为从单一图像中预测 precisel 3D位置是困难的。现有的单目3D检测知识传承方法通常是将LiDAR投射到图像平面上，然后对教师网络进行训练。将LiDAR基础的模型知识转移到RGB基础的模型上是更加复杂的，因此需要一个通用的传承策略。为了解决两 modal 之间的问题，我们提出了MonoSKD，一个基于Spearman相関系数的知识传承框架 для单目3D检测。由于这两种特征之间的差距很大，严格对特征进行对齐可能会导致训练失败，因此我们提出了一个更宽松的Spearman损失。此外，我们选择了适当的传承位置和移除了额外的模组，使我们的方案可以更好地运用GPU资源，训练速度更快。我们进行了广泛的实验，以验证我们的框架在KITTI 3D物体检测标准 benchmark 上的效果。我们的方法可以在提交前的测试中实现state-of-the-art的性能，并且不需要额外的推论计算成本。我们的代码可以在https://github.com/Senwang98/MonoSKD 上找到。

MiniZero: Comparative Analysis of AlphaZero and MuZero on Go, Othello, and Atari Games

paper_url: http://arxiv.org/abs/2310.11305
repo_url: https://github.com/rlglab/minizero
paper_authors: Ti-Rong Wu, Hung Guei, Po-Wei Huang, Pei-Chiun Peng, Ting Han Wei, Chung-Chin Shih, Yun-Jui Tsai
for: 本研究旨在提供一个零知识学习框架，支持四种现今最佳算法，包括AlphaZero、MuZero、Gumbel AlphaZero和Gumbel MuZero。这些算法在许多游戏中表现出了超人般的表现，但是哪一个算法在具体任务上最适合哪一个还未明确。通过MiniZero，我们系统地评估了每个算法在9x9囲棋和8x8奥特洛两个棋盘游戏以及57个Atari游戏中的表现。
methods: 我们使用MiniZero框架进行零知识学习，系统地评估了每个算法在不同的游戏中的表现。我们还引入了一种进步的训练方法，即进步的 simulation，可以将计算资源分配更有效率。
results: 我们的实验结果显示，在两个棋盘游戏中，使用更多的 simulations 通常会导致更高的表现。但是，在不同的游戏中，AlphaZero和MuZero的选择可能会因游戏特性而异。在Atari游戏中，MuZero和Gumbel MuZero都是值得考虑的。进步的 simulation 可以将计算资源分配更有效率，并在两个棋盘游戏中获得了明显的提升。

Abstract
This paper presents MiniZero, a zero-knowledge learning framework that supports four state-of-the-art algorithms, including AlphaZero, MuZero, Gumbel AlphaZero, and Gumbel MuZero. While these algorithms have demonstrated super-human performance in many games, it remains unclear which among them is most suitable or efficient for specific tasks. Through MiniZero, we systematically evaluate the performance of each algorithm in two board games, 9x9 Go and 8x8 Othello, as well as 57 Atari games. Our empirical findings are summarized as follows. For two board games, using more simulations generally results in higher performance. However, the choice of AlphaZero and MuZero may differ based on game properties. For Atari games, both MuZero and Gumbel MuZero are worth considering. Since each game has unique characteristics, different algorithms and simulations yield varying results. In addition, we introduce an approach, called progressive simulation, which progressively increases the simulation budget during training to allocate computation more efficiently. Our empirical results demonstrate that progressive simulation achieves significantly superior performance in two board games. By making our framework and trained models publicly available, this paper contributes a benchmark for future research on zero-knowledge learning algorithms, assisting researchers in algorithm selection and comparison against these zero-knowledge learning baselines.

摘要
For two board games, using more simulations generally leads to higher performance. However, the choice between AlphaZero and MuZero may depend on the properties of the game. For Atari games, both MuZero and Gumbel MuZero are worth considering. As each game has unique characteristics, different algorithms and simulations produce varying results.We also introduce an approach called progressive simulation, which increases the simulation budget during training to allocate computation more efficiently. Our results show that progressive simulation achieves significantly better performance in two board games. By making our framework and trained models publicly available, this paper provides a benchmark for future research on zero-knowledge learning algorithms, assisting researchers in selecting and comparing against these baselines.

Emulating Human Cognitive Processes for Expert-Level Medical Question-Answering with Large Language Models

paper_url: http://arxiv.org/abs/2310.11266
repo_url: None
paper_authors: Khushboo Verma, Marina Moore, Stephanie Wottrich, Karla Robles López, Nishant Aggarwal, Zeel Bhatt, Aagamjit Singh, Bradford Unroe, Salah Basheer, Nitish Sachdeva, Prinka Arora, Harmanjeet Kaur, Tanupreet Kaur, Tevon Hood, Anahi Marquez, Tushar Varshney, Nanfu Deng, Azaan Ramani, Pawanraj Ishwara, Maimoona Saeed, Tatiana López Velarde Peña, Bryan Barksdale, Sushovan Guha, Satwant Kumar
for: This paper aims to provide a novel framework for clinical problem-solving tools in healthcare, based on a Large Language Model (LLM) that mimics human cognitive processes.
methods: The framework, called BooksMed, utilizes the GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) framework to effectively quantify evidence strength, and is evaluated using a multispecialty clinical benchmark called ExpertMedQA, which consists of open-ended, expert-level clinical questions validated by a diverse group of medical professionals.
results: The paper shows that BooksMed outperforms existing state-of-the-art models Med-PaLM 2, Almanac, and ChatGPT in a variety of medical scenarios, demonstrating the effectiveness of the framework in providing reliable and evidence-based responses to clinical inquiries.

Abstract
In response to the pressing need for advanced clinical problem-solving tools in healthcare, we introduce BooksMed, a novel framework based on a Large Language Model (LLM). BooksMed uniquely emulates human cognitive processes to deliver evidence-based and reliable responses, utilizing the GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) framework to effectively quantify evidence strength. For clinical decision-making to be appropriately assessed, an evaluation metric that is clinically aligned and validated is required. As a solution, we present ExpertMedQA, a multispecialty clinical benchmark comprised of open-ended, expert-level clinical questions, and validated by a diverse group of medical professionals. By demanding an in-depth understanding and critical appraisal of up-to-date clinical literature, ExpertMedQA rigorously evaluates LLM performance. BooksMed outperforms existing state-of-the-art models Med-PaLM 2, Almanac, and ChatGPT in a variety of medical scenarios. Therefore, a framework that mimics human cognitive stages could be a useful tool for providing reliable and evidence-based responses to clinical inquiries.

摘要
响应医疗领域的高级临床问题解决工具的急需，我们介绍BooksMed，一种新的框架，基于大型自然语言模型（LLM）。BooksMed模仿人类认知过程，以提供基于证据的可靠回答，使用GRADE（评估、评价、发展和评估）框架来有效地评估证据强度。为了正确评估临床决策，需要一种与临床相关的评价标准，而我们提出ExpertMedQA，一个多学科临床引用库，由多个医疗专业人员组成，并被证明了。通过要求对当前临床文献进行深入理解和批判性评估，ExpertMedQA严格评估LLM性能。BooksMed在多种医疗场景中表现出色，超越了现有的state-of-the-art模型Med-PaLM 2、Almanac和ChatGPT。因此，一个模仿人类认知阶段的框架可能是一种有用的工具，用于提供可靠和基于证据的回答临床问题。

Revealing the Unwritten: Visual Investigation of Beam Search Trees to Address Language Model Prompting Challenges

paper_url: http://arxiv.org/abs/2310.11252
repo_url: None
paper_authors: Thilo Spinner, Rebecca Kehlbeck, Rita Sevastjanova, Tobias Stähle, Daniel A. Keim, Oliver Deussen, Andreas Spitz, Mennatallah El-Assady
for: 本研究旨在探讨大语言模型的生成过程中，如何通过提示方法来引导模型输出。
methods: 本研究使用了一种交互式视觉方法，可以帮助分析大语言模型在生成过程中的决策。
results: 研究发现，通过曝光搜索树可以提供有价值的信息，并提供了五种细化分析场景来解决一些关注点。这些结果 validate 现有结果，并提供了新的视角。

Abstract
The growing popularity of generative language models has amplified interest in interactive methods to guide model outputs. Prompt refinement is considered one of the most effective means to influence output among these methods. We identify several challenges associated with prompting large language models, categorized into data- and model-specific, linguistic, and socio-linguistic challenges. A comprehensive examination of model outputs, including runner-up candidates and their corresponding probabilities, is needed to address these issues. The beam search tree, the prevalent algorithm to sample model outputs, can inherently supply this information. Consequently, we introduce an interactive visual method for investigating the beam search tree, facilitating analysis of the decisions made by the model during generation. We quantitatively show the value of exposing the beam search tree and present five detailed analysis scenarios addressing the identified challenges. Our methodology validates existing results and offers additional insights.

摘要
随着生成语言模型的普及，对生成输出的指导方法已经引起了更多的关注。Prompt refinement被认为是影响输出的最有效的方法之一。我们identified several challenges associated with prompting large language models，分为数据特定和模型特定、语言学的和社会语言学的挑战。为了解决这些问题，我们需要进行全面的模型输出的检查，包括 runner-up candidates和其对应的概率。为此，我们引入了一种互动视觉方法，用于调查生成过程中模型做出的决定。我们量化地表明了曝光 beam search tree的价值，并提出了五种细化分析场景，用于解决我们所 identific的挑战。我们的方法证明了现有结果的正确性，同时还提供了更多的洞察。

Leveraging Large Language Model for Automatic Evolving of Industrial Data-Centric R&D Cycle

paper_url: http://arxiv.org/abs/2310.11249
repo_url: None
paper_authors: Xu Yang, Xiao Yang, Weiqing Liu, Jinhui Li, Peng Yu, Zeqi Ye, Jiang Bian
for: 这篇论文旨在探讨大语言模型（LLM）在数据驱动研发中的潜力，以减少人工、计算和时间资源成本。
methods: 该论文利用大语言模型进行数据驱动研发的各个基础元素的研究，包括异ogeneous任务相关数据、多面领域知识和多种计算功能工具。
results: 研究表明，LLM可以快速理解域pecific需求，生成专业意见，使用域specific工具进行实验、解释结果并吸收过去努力的知识来解决新的挑战。示例来自于量化投资研究领域。

Abstract
In the wake of relentless digital transformation, data-driven solutions are emerging as powerful tools to address multifarious industrial tasks such as forecasting, anomaly detection, planning, and even complex decision-making. Although data-centric R&D has been pivotal in harnessing these solutions, it often comes with significant costs in terms of human, computational, and time resources. This paper delves into the potential of large language models (LLMs) to expedite the evolution cycle of data-centric R&D. Assessing the foundational elements of data-centric R&D, including heterogeneous task-related data, multi-facet domain knowledge, and diverse computing-functional tools, we explore how well LLMs can understand domain-specific requirements, generate professional ideas, utilize domain-specific tools to conduct experiments, interpret results, and incorporate knowledge from past endeavors to tackle new challenges. We take quantitative investment research as a typical example of industrial data-centric R&D scenario and verified our proposed framework upon our full-stack open-sourced quantitative research platform Qlib and obtained promising results which shed light on our vision of automatic evolving of industrial data-centric R&D cycle.

摘要
在数字变革不断的背景下，数据驱动的解决方案正在赋予企业多种任务，如预测、异常检测、规划和复杂决策等。虽然数据研发总是数据驱动的关键，但它常常带来人工、计算和时间资源的成本。这篇论文探讨了大语言模型（LLM）在加速数据驱动研发的演化过程中的潜力。我们评估了数据驱动研发的基础元素，包括异常任务相关数据、多元领域知识和多种计算函数工具，以 explored how well LLMs can understand domain-specific requirements, generate professional ideas, utilize domain-specific tools to conduct experiments, interpret results, and incorporate knowledge from past endeavors to tackle new challenges.我们选择了投资研究为典型的工业数据驱动研发场景，并在我们的全栈开源 quantitative research 平台Qlib上验证了我们的提议方案，获得了鼓舞人的结果，这有助于我们实现自动化工业数据驱动研发ecycle的vision。

Query2Triple: Unified Query Encoding for Answering Diverse Complex Queries over Knowledge Graphs

paper_url: http://arxiv.org/abs/2310.11246
repo_url: https://github.com/yaooxu/q2t
paper_authors: Yao Xu, Shizhu He, Cunguang Wang, Li Cai, Kang Liu, Jun Zhao
for: 本研究旨在解决知识图(KG)中复杂问题(CQA)的挑战，提出了一种新的方法Query to Triple(Q2T)，用于解决复杂问题。
methods: Q2T方法分为两个阶段：首先，使用神经链预测器在简单 вопро题上预测tail实体，然后使用复杂问题的查询编码器来编码多样化的复杂问题。
results: 实验表明，无需直接模型神经集算器，Q2T方法仍然可以达到多个公共benchmark上的状态作呈现性能。

Abstract
Complex Query Answering (CQA) is a challenge task of Knowledge Graph (KG). Due to the incompleteness of KGs, query embedding (QE) methods have been proposed to encode queries and entities into the same embedding space, and treat logical operators as neural set operators to obtain answers. However, these methods train KG embeddings and neural set operators concurrently on both simple (one-hop) and complex (multi-hop and logical) queries, which causes performance degradation on simple queries and low training efficiency. In this paper, we propose Query to Triple (Q2T), a novel approach that decouples the training for simple and complex queries. Q2T divides the training into two stages: (1) Pre-training a neural link predictor on simple queries to predict tail entities based on the head entity and relation. (2) Training a query encoder on complex queries to encode diverse complex queries into a unified triple form that can be efficiently solved by the pretrained neural link predictor. Our proposed Q2T is not only efficient to train, but also modular, thus easily adaptable to various neural link predictors that have been studied well. Extensive experiments demonstrate that, even without explicit modeling for neural set operators, Q2T still achieves state-of-the-art performance on diverse complex queries over three public benchmarks.

摘要
困难任务复杂问答 (CQA) 是知识图 (KG) 的挑战。由于知识图的不完整性，问题嵌入 (QE) 方法被提议，以将查询和实体编码到同一个嵌入空间中，并将逻辑运算视为神经集合运算来获取答案。然而，这些方法同时在简单 (一元) 和复杂 (多元和逻辑) 查询上培训 KG 嵌入和神经集合运算，导致简单查询的性能下降和培训效率低。在这篇论文中，我们提出了查询到三元 (Q2T)，一种新的方法，它将培训分解成两个阶段：1. 预训练神经链预测器在简单查询上预测尾实体基于头实体和关系。2. 培训查询编码器在复杂查询上编码多样化的复杂查询，以便由预训神经链预测器有效解决。我们提出的 Q2T 不仅具有高效的培训效率，还具有可组合的特点，因此容易适应各种已研究的神经链预测器。我们的实验证明，无需显式模型神经集合运算，Q2T仍然在多种多样的复杂查询上达到了状态艺术级别的表现。

Rethinking Class-incremental Learning in the Era of Large Pre-trained Models via Test-Time Adaptation

paper_url: http://arxiv.org/abs/2310.11482
repo_url: https://github.com/iemprog/ttacil
paper_authors: Imad Eddine Marouf, Subhankar Roy, Enzo Tartaglione, Stéphane Lathuilière
for: 这篇论文的目的是为了解决分类增量学习（CIL）任务中的问题，即在新任务上进行分类时，不要忘记之前学习的信息。
methods: 这篇论文使用了大量预训模型（PTM）的表现，并在每个任务上进行微调 Parameter 来获得最佳性能。然而，重复微调会使PTM的表现变差，导致忘记之前的任务。为了寻找一个平衡点，这篇论文提出了一种新的方法，即在试验阶段进行调整（TTA），而不是在训练阶段进行微调。
results: 这篇论文的结果显示，TTACIL可以在多个 CIL 标准 benchmark 上进行最佳化，并且在不同的数据损害情况下保持稳定性。此外，TTACIL 还可以避免忘记之前的任务，同时对每个任务都获得良好的表现。

Abstract
Class-incremental learning (CIL) is a challenging task that involves continually learning to categorize classes into new tasks without forgetting previously learned information. The advent of the large pre-trained models (PTMs) has fast-tracked the progress in CIL due to the highly transferable PTM representations, where tuning a small set of parameters results in state-of-the-art performance when compared with the traditional CIL methods that are trained from scratch. However, repeated fine-tuning on each task destroys the rich representations of the PTMs and further leads to forgetting previous tasks. To strike a balance between the stability and plasticity of PTMs for CIL, we propose a novel perspective of eliminating training on every new task and instead performing test-time adaptation (TTA) directly on the test instances. Concretely, we propose "Test-Time Adaptation for Class-Incremental Learning" (TTACIL) that first fine-tunes Layer Norm parameters of the PTM on each test instance for learning task-specific features, and then resets them back to the base model to preserve stability. As a consequence, TTACIL does not undergo any forgetting, while benefiting each task with the rich PTM features. Additionally, by design, our method is robust to common data corruptions. Our TTACIL outperforms several state-of-the-art CIL methods when evaluated on multiple CIL benchmarks under both clean and corrupted data.

摘要
通过不断学习新任务的类增量学习（CIL），我们可以让模型在新任务上进行分类。然而，在某些情况下，由于重复地训练每个任务，这会导致模型忘记之前学习的信息。为了找到PTM的稳定和可变性的平衡，我们提出了一种新的思路，即在测试实例上进行测试时适应（TTA）。具体来说，我们提出了一种名为“测试时适应for类增量学习”（TTACIL）的方法，它首先在每个测试实例上使用层 нор的参数进行微调，以学习任务特有的特征，然后将其重置回基本模型，以保持稳定性。因此，TTACIL不会出现忘记现象，同时每个任务都可以benefit于PTM的丰富特征。此外，由设计，我们的方法具有对常见数据损害的Robustness。我们的TTACIL在多个CIL标准测试 benchmark上评估得到了与state-of-the-art CIL方法相比的更好的性能。

RealBehavior: A Framework for Faithfully Characterizing Foundation Models’ Human-like Behavior Mechanisms

paper_url: http://arxiv.org/abs/2310.11227
repo_url: None
paper_authors: Enyu Zhou, Rui Zheng, Zhiheng Xi, Songyang Gao, Xiaoran Fan, Zichu Fei, Jingting Ye, Tao Gui, Qi Zhang, Xuanjing Huang
for: 这 paper 是为了研究基于模型的人类行为的方法和效果而写的。
methods: 该 paper 使用了一个名为 RealBehavior 的框架，该框架可以评估模型的人类行为是否准确、可重复、内部一致和普适性。
results: 研究发现，直接使用心理学工具不能准确地描述所有人类行为，而 RealBehavior 框架可以评估模型的行为是否准确、可重复、内部一致和普适性。此外， paper 还讨论了将模型与人类和社会价值相对应的影响，并 argue 为多样化对象的定制来避免创建具有限制特征的模型。

Abstract
Reports of human-like behaviors in foundation models are growing, with psychological theories providing enduring tools to investigate these behaviors. However, current research tends to directly apply these human-oriented tools without verifying the faithfulness of their outcomes. In this paper, we introduce a framework, RealBehavior, which is designed to characterize the humanoid behaviors of models faithfully. Beyond simply measuring behaviors, our framework assesses the faithfulness of results based on reproducibility, internal and external consistency, and generalizability. Our findings suggest that a simple application of psychological tools cannot faithfully characterize all human-like behaviors. Moreover, we discuss the impacts of aligning models with human and social values, arguing for the necessity of diversifying alignment objectives to prevent the creation of models with restricted characteristics.

摘要
研究人类行为模型的报告正在增长，心理理论提供了持续适用的工具来调查这些行为。然而，当前的研究通常直接采用人类 oriented 工具而不是验证其结果的准确性。在本文中，我们介绍了一个框架，即 RealBehavior，用于准确描述模型的人类行为。除了直接测量行为外，我们的框架还评估结果的准确性基于可重复性、内部和外部一致性以及泛化性。我们的发现表明，简单地应用心理工具不能准确描述所有人类行为。此外，我们还讨论了将模型与人类和社会价值Alignment的影响， arguing for the necessity of diversifying alignment objectives to prevent the creation of models with restricted characteristics.

Contracting Tsetlin Machine with Absorbing Automata

paper_url: http://arxiv.org/abs/2310.11481
repo_url: None
paper_authors: Bimal Bhattarai, Ole-Christoffer Granmo, Lei Jiao, Per-Arne Andersen, Svein Anders Tunheim, Rishad Shafik, Alex Yakovlev
for: 提高逻辑学习的速度和能效性
methods: 使用稀疏TM和吸引TA状态
results: 加速学习，降低能耗

Abstract
In this paper, we introduce a sparse Tsetlin Machine (TM) with absorbing Tsetlin Automata (TA) states. In brief, the TA of each clause literal has both an absorbing Exclude- and an absorbing Include state, making the learning scheme absorbing instead of ergodic. When a TA reaches an absorbing state, it will never leave that state again. If the absorbing state is an Exclude state, both the automaton and the literal can be removed from further consideration. The literal will as a result never participates in that clause. If the absorbing state is an Include state, on the other hand, the literal is stored as a permanent part of the clause while the TA is discarded. A novel sparse data structure supports these updates by means of three action lists: Absorbed Include, Include, and Exclude. By updating these lists, the TM gets smaller and smaller as the literals and their TA withdraw. In this manner, the computation accelerates during learning, leading to faster learning and less energy consumption.

摘要
在这篇论文中，我们介绍了一种稀疏的Tsetlin机器（TM），它具有吸收型Tsetlin自动机（TA）状态。简而言之，每个clauseLiteral的TA具有两个吸收状态： exclude状态和include状态。当TA达到一个吸收状态时，它将不再离开该状态。如果吸收状态是exclude状态，那么自动机和Literal都可以从进一步考虑中除去。Literal因此从不会参与到该clause中。如果吸收状态是include状态，那么Literal将被保存为 clause中的永久部分，而TA则被抛弃。一种新的稀疏数据结构支持这些更新，通过三个动作列表：吸收包含、包含和排除。通过更新这些列表，TM会随着Literals和其TA减少，从而使计算加速，导致学习更快速、 consume less energy。

Understanding Fairness Surrogate Functions in Algorithmic Fairness

paper_url: http://arxiv.org/abs/2310.11211
repo_url: None
paper_authors: Wei Yao, Zhanke Zhou, Zhicong Li, Bo Han, Yong Liu
for: 本研究旨在 Mitigating 机器学习算法对certain population groups的偏袋预测，而 achieve comparable accuracy。
methods: 以实现 fairness 定义的 surrogate 函数来解决这个问题。但是，previous work 中的 fairness surrogate function 可能会导致不公平的结果。本研究通过对 demographic parity 的 fairness定义进行 theoretically 和 empirical 分析，发现 surrogate-fairness gap 的存在，这个 gap 直接决定了 surrogate function 是否适合 fairness definition。
results: 我们提出了一个通用的 sigmoid surrogate 函数，具有严格和可靠的 fairness 保证。此外，我们还提出了一个 novel 的算法 Balanced Surrogate，可以逐步减少 surrogate-fairness gap，以改善公平性。最后，我们在三个真实世界数据集上提供了实践证据，显示我们的方法可以更好地保证公平性。

Abstract
It has been observed that machine learning algorithms exhibit biased predictions against certain population groups. To mitigate such bias while achieving comparable accuracy, a promising approach is to introduce surrogate functions of the concerned fairness definition and solve a constrained optimization problem. However, an intriguing issue in previous work is that such fairness surrogate functions may yield unfair results. In this work, in order to deeply understand this issue, taking a widely used fairness definition, demographic parity as an example, we both theoretically and empirically show that there is a surrogate-fairness gap between the fairness definition and the fairness surrogate function. The "gap" directly determines whether a surrogate function is an appropriate substitute for a fairness definition. Also, the theoretical analysis and experimental results about the "gap" motivate us that the unbounded surrogate functions will be affected by the points far from the decision boundary, which is the large margin points issue investigated in this paper. To address it, we propose the general sigmoid surrogate with a rigorous and reliable fairness guarantee. Interestingly, the theory also provides insights into two important issues that deal with the large margin points as well as obtaining a more balanced dataset are beneficial to fairness. Furthermore, we elaborate a novel and general algorithm called Balanced Surrogate, which iteratively reduces the "gap" to improve fairness. Finally, we provide empirical evidence showing that our methods achieve better fairness performance in three real-world datasets.

摘要
observations have shown that machine learning algorithms can make biased predictions against certain groups of people. to address this issue, one approach is to use surrogate functions that satisfy certain fairness definitions. however, previous work has shown that these surrogate functions may not always lead to fair results. in this paper, we investigate the reason for this problem and show that there is a gap between the fairness definition and the surrogate function. this gap determines whether the surrogate function is an appropriate substitute for the fairness definition. we also show that the large margin points issue, which is the points far from the decision boundary, affects the unbounded surrogate functions. to address this issue, we propose a general sigmoid surrogate with a rigorous and reliable fairness guarantee. furthermore, we develop a novel and general algorithm called balanced surrogate, which iteratively reduces the gap to improve fairness. finally, we provide empirical evidence showing that our methods achieve better fairness performance in three real-world datasets.

EEG motor imagery decoding: A framework for comparative analysis with channel attention mechanisms

paper_url: http://arxiv.org/abs/2310.11198
repo_url: None
paper_authors: Martin Wimpff, Leonardo Gizzi, Jan Zerfowski, Bin Yang
for: 本研究探究了在大脑-计算机界面（BCI）中应用不同的通道注意力机制以提高电enzephalography（EEG）动作干обраoupe的解码性能。
methods: 本研究使用了不同的通道注意力机制，并将其集成到一个轻量级框架中，以评估它们的影响。我们采用了一个简单、轻量级的基线建模，可以方便地集成不同的注意力机制。
results: 我们的实验表明，使用不同的通道注意力机制可以提高EEG动作干обраoupe的性能，同时保持小的存储容量和低的计算复杂度。我们的框架具有简单性和普适性，可以在多个数据集上进行广泛的实验，以评估不同的注意力机制和基线建模的效果。

Abstract
The objective of this study is to investigate the application of various channel attention mechanisms within the domain of brain-computer interface (BCI) for motor imagery decoding. Channel attention mechanisms can be seen as a powerful evolution of spatial filters traditionally used for motor imagery decoding. This study systematically compares such mechanisms by integrating them into a lightweight architecture framework to evaluate their impact. We carefully construct a straightforward and lightweight baseline architecture designed to seamlessly integrate different channel attention mechanisms. This approach is contrary to previous works which only investigate one attention mechanism and usually build a very complex, sometimes nested architecture. Our framework allows us to evaluate and compare the impact of different attention mechanisms under the same circumstances. The easy integration of different channel attention mechanisms as well as the low computational complexity enables us to conduct a wide range of experiments on three datasets to thoroughly assess the effectiveness of the baseline model and the attention mechanisms. Our experiments demonstrate the strength and generalizability of our architecture framework as well as how channel attention mechanisms can improve the performance while maintaining the small memory footprint and low computational complexity of our baseline architecture. Our architecture emphasizes simplicity, offering easy integration of channel attention mechanisms, while maintaining a high degree of generalizability across datasets, making it a versatile and efficient solution for EEG motor imagery decoding within brain-computer interfaces.

摘要
We construct a simple and lightweight baseline architecture that seamlessly integrates different channel attention mechanisms, unlike previous works that only investigate one mechanism and build complex, sometimes nested architectures. Our framework allows us to evaluate and compare the impact of different attention mechanisms under the same circumstances.The easy integration of different channel attention mechanisms and the low computational complexity enable us to conduct a wide range of experiments on three datasets to thoroughly assess the effectiveness of the baseline model and the attention mechanisms. Our experiments demonstrate the strength and generalizability of our architecture framework and how channel attention mechanisms can improve performance while maintaining a small memory footprint and low computational complexity.Our architecture emphasizes simplicity, allowing for easy integration of channel attention mechanisms while maintaining a high degree of generalizability across datasets, making it a versatile and efficient solution for EEG motor imagery decoding within BCIs.

Medical Text Simplification: Optimizing for Readability with Unlikelihood Training and Reranked Beam Search Decoding

paper_url: http://arxiv.org/abs/2310.11191
repo_url: https://github.com/ljyflores/simplification-project
paper_authors: Lorenzo Jaime Yu Flores, Heyuan Huang, Kejian Shi, Sophie Chheang, Arman Cohan
for: bridging the communication gap in the medical field, where technical jargon and complex constructs are commonly used.
methods: using a new unlikelihood loss and a reranked beam search decoding method to improve the readability of text simplification in the medical domain.
results: better performance on readability metrics on three datasets, offering promising avenues for improving text simplification in the medical field.

Abstract
Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to further improve the readability of text simplification in the medical domain. We propose (1) a new unlikelihood loss that encourages generation of simpler terms and (2) a reranked beam search decoding method that optimizes for simplicity, which achieve better performance on readability metrics on three datasets. This study's findings offer promising avenues for improving text simplification in the medical field.

摘要
文本简化在医学领域中已经成为人工智能应用的一个日益有用的应用，用于bridging通信差距。然而，医学简化方法有时会导致生成的文本质量和多样性偏低。在这项工作中，我们探索了如何进一步提高医学简化文本的可读性。我们提出了（1）一种新的不可能损失函数，以便生成更简单的词汇，以及（2）一种重新排序搜索解码方法，以便优化简单性，这两种方法在三个数据集上都达到了更好的可读性指标。这项研究的发现提供了改进医学简化文本的可能的道路。

FocDepthFormer: Transformer with LSTM for Depth Estimation from Focus

paper_url: http://arxiv.org/abs/2310.11178
repo_url: None
paper_authors: Xueyang Kang, Fengze Han, Abdur Fayjie, Dong Gong
for: 这个研究旨在推导从专注扫描库中的深度信息，解决了现有方法受到本地性的限制，并且可以处理任意长的扫描库。
methods: 我们提出了一个基于Transformer的网络，具有自我注意力和LSTM模组，以及一个CNN解oder。自我注意力允许学习更有用的特征，而LSTM模组可以将表现集成到扫描库中。
results: 我们的模型在多个扫描库评量 dataset 上表现出色，较前一代模型出色，并且可以从视觉对应的单眼RGB深度测量数据中获得更好的预备training。

Abstract
Depth estimation from focal stacks is a fundamental computer vision problem that aims to infer depth from focus/defocus cues in the image stacks. Most existing methods tackle this problem by applying convolutional neural networks (CNNs) with 2D or 3D convolutions over a set of fixed stack images to learn features across images and stacks. Their performance is restricted due to the local properties of the CNNs, and they are constrained to process a fixed number of stacks consistent in train and inference, limiting the generalization to the arbitrary length of stacks. To handle the above limitations, we develop a novel Transformer-based network, FocDepthFormer, composed mainly of a Transformer with an LSTM module and a CNN decoder. The self-attention in Transformer enables learning more informative features via an implicit non-local cross reference. The LSTM module is learned to integrate the representations across the stack with arbitrary images. To directly capture the low-level features of various degrees of focus/defocus, we propose to use multi-scale convolutional kernels in an early-stage encoder. Benefiting from the design with LSTM, our FocDepthFormer can be pre-trained with abundant monocular RGB depth estimation data for visual pattern capturing, alleviating the demand for the hard-to-collect focal stack data. Extensive experiments on various focal stack benchmark datasets show that our model outperforms the state-of-the-art models on multiple metrics.

摘要
depth 估计从焦点栈中是计算机视觉的基本问题，它目的是从图像栈中提取焦点信息。大多数现有方法通过应用 convolutional neural networks (CNNs) 的 2D 或 3D 卷积来学习图像栈中的特征。它们的性能受到本地属性的限制，而且只能处理固定长度的栈图像，从而限制了泛化性。为了解决这些限制，我们开发了一种新的 Transformer 基于网络，即 FocDepthFormer，它主要由 Transformer 和 LSTM 模块以及 CNN 解码器组成。自我注意力在 Transformer 中允许学习更有用的特征，而 LSTM 模块可以将栈图像中的表示集成到不同的图像中。为了直接捕捉不同程度的焦点/杂论的低级特征，我们提议使用多尺度的卷积核在早期编码器中。由于 LSTM 的设计，我们的 FocDepthFormer 可以通过大量的monocular RGB 深度估计数据进行预训练，从而减轻硬件难以收集的焦点栈数据的需求。我们在多个焦点栈 benchmark 数据集上进行了广泛的实验，并证明我们的模型在多个指标上超过了当前状态的模型。

Knowledge Extraction and Distillation from Large-Scale Image-Text Colonoscopy Records Leveraging Large Language and Vision Models

paper_url: http://arxiv.org/abs/2310.11173
repo_url: https://github.com/shuowang26/endoked
paper_authors: Shuo Wang, Yan Zhu, Xiaoyuan Luo, Zhiwei Yang, Yizhe Zhang, Peiyao Fu, Manning Wang, Zhijian Song, Quanlin Li, Pinghong Zhou, Yike Guo
for: 本研究旨在开发一种基于人工智能的检测和分类方法，用于检测肠穿刺图像中的肠癌。
methods: 本研究使用了大语言和视觉模型的最新进展，提出了一种数据挖掘 paradigma，称为EndoKED，用于深度知识提取和蒸馏。EndoKED自动将原始肠穿刺图像记录转换为带有像素级注释的图像集。
results: 在使用多地中的肠穿刺图像记录（约100万张图像）进行验证中，EndoKED显示出了更高的性能，可以更好地训练肠癌检测和分类模型。此外，EndoKED预训练的视觉底层模型可以在数据量少的情况下实现数据效果和泛化，达到专家水平的性能。

Abstract
The development of artificial intelligence systems for colonoscopy analysis often necessitates expert-annotated image datasets. However, limitations in dataset size and diversity impede model performance and generalisation. Image-text colonoscopy records from routine clinical practice, comprising millions of images and text reports, serve as a valuable data source, though annotating them is labour-intensive. Here we leverage recent advancements in large language and vision models and propose EndoKED, a data mining paradigm for deep knowledge extraction and distillation. EndoKED automates the transformation of raw colonoscopy records into image datasets with pixel-level annotation. We validate EndoKED using multi-centre datasets of raw colonoscopy records (~1 million images), demonstrating its superior performance in training polyp detection and segmentation models. Furthermore, the EndoKED pre-trained vision backbone enables data-efficient and generalisable learning for optical biopsy, achieving expert-level performance in both retrospective and prospective validation.

摘要
开发人工智能系统用于护肠镜分析通常需要专家标注的图像集。然而，数据集的大小和多样性的限制会阻碍模型的性能和泛化。医疗实践中的图像报告记录，包括数百万张图像和文本报告，可以作为价值的数据源，但是标注它们是劳动密集的。我们利用最新的自然语言和Computer Vision技术，提出了EndoKED，一种深度知识提取和精炼数据挖掘模式。EndoKED自动将护肠镜记录转换为带有像素级别标注的图像集。我们使用多中心的 raw colonoscopy 记录（约100万张图像）进行验证，并证明EndoKED在训练肿瘤检测和分 segmentation 模型时表现出色。此外，EndoKED 预训练的视觉底层模型可以在数据效率和泛化上达到专家水平的性能，在逆向和前向验证中都达到专家水平。

MST-GAT: A Multimodal Spatial-Temporal Graph Attention Network for Time Series Anomaly Detection

paper_url: http://arxiv.org/abs/2310.11169
repo_url: None
paper_authors: Chaoyue Ding, Shiliang Sun, Jing Zhao
for: 本文旨在提出一种基于多模态时间序列的异常检测方法，以维护工作设备的安全稳定性。
methods: 本方法使用多模态图注意力网络（M-GAT）和时间卷积网络来捕捉多模态时间序列中的空间时间关系。M-GAT使用多头注意力模块和两个关系注意力模块（内模态和间模态注意力）来模型模态关系。
results: 实验结果表明，MST-GAT在四个多模态标准数据集上比州元基elines表现出色，并且可以强化异常检测结果的可读性。

Abstract
Multimodal time series (MTS) anomaly detection is crucial for maintaining the safety and stability of working devices (e.g., water treatment system and spacecraft), whose data are characterized by multivariate time series with diverse modalities. Although recent deep learning methods show great potential in anomaly detection, they do not explicitly capture spatial-temporal relationships between univariate time series of different modalities, resulting in more false negatives and false positives. In this paper, we propose a multimodal spatial-temporal graph attention network (MST-GAT) to tackle this problem. MST-GAT first employs a multimodal graph attention network (M-GAT) and a temporal convolution network to capture the spatial-temporal correlation in multimodal time series. Specifically, M-GAT uses a multi-head attention module and two relational attention modules (i.e., intra- and inter-modal attention) to model modal correlations explicitly. Furthermore, MST-GAT optimizes the reconstruction and prediction modules simultaneously. Experimental results on four multimodal benchmarks demonstrate that MST-GAT outperforms the state-of-the-art baselines. Further analysis indicates that MST-GAT strengthens the interpretability of detected anomalies by locating the most anomalous univariate time series.

摘要
多模态时间序列异常检测（MTS）是维护工作设备（如水处理系统和航天器）的关键，其数据由多个变量时间序列组成，具有多种模式。虽然最新的深度学习方法显示出了异常检测的潜力，但它们并不直接捕捉多modal时间序列之间的空间-时间关系，导致更多的假阳和假阴。在本文中，我们提出了一种多模态空间-时间Graph注意网络（MST-GAT）来解决这个问题。MST-GAT首先使用多modal Graph注意网络（M-GAT）和一个时间卷积网络来捕捉多modal时间序列之间的空间-时间相关性。具体来说，M-GAT使用多头注意模块和两种关系注意模块（即内模态注意和间模态注意）来模型modal相关性。此外，MST-GAT同时优化了重建和预测模块。实验结果表明，MST-GAT比州时的基elines表现出色，并且进一步分析表明，MST-GAT可以强化异常检测结果的解释性，即 locates最异常的单变量时间序列。

Accurate prediction of international trade flows: Leveraging knowledge graphs and their embeddings

paper_url: http://arxiv.org/abs/2310.11161
repo_url: None
paper_authors: Diego Rincon-Yanez, Chahinez Ounoughi, Bassem Sellami, Tarmo Kalvet, Marek Tiits, Sabrina Senatore, Sadok Ben Yahia
for: 本研究旨在使用知识图谱（KG）来模型国际贸易，帮助政策制定者、企业和经济学家预测国际贸易趋势。
methods: 本研究使用知识图谱嵌入（KGE）来预测国际贸易链接，并与传统机器学习方法相结合，如决策树和图 neural network。
results: 研究发现，通过使用KGE来预测国际贸易链接可以提高预测精度，同时也可以提供嵌入解释性的知识表示。此外，研究还发现embedding方法对其他智能算法产生了影响。

Abstract
Knowledge representation (KR) is vital in designing symbolic notations to represent real-world facts and facilitate automated decision-making tasks. Knowledge graphs (KGs) have emerged so far as a popular form of KR, offering a contextual and human-like representation of knowledge. In international economics, KGs have proven valuable in capturing complex interactions between commodities, companies, and countries. By putting the gravity model, which is a common economic framework, into the process of building KGs, important factors that affect trade relationships can be taken into account, making it possible to predict international trade patterns. This paper proposes an approach that leverages Knowledge Graph embeddings for modeling international trade, focusing on link prediction using embeddings. Thus, valuable insights are offered to policymakers, businesses, and economists, enabling them to anticipate the effects of changes in the international trade system. Moreover, the integration of traditional machine learning methods with KG embeddings, such as decision trees and graph neural networks are also explored. The research findings demonstrate the potential for improving prediction accuracy and provide insights into embedding explainability in knowledge representation. The paper also presents a comprehensive analysis of the influence of embedding methods on other intelligent algorithms.

摘要
知识表示（KR）是设计符号notation的关键，以便实现自动化决策任务。知识图（KG）已经出现为知识表示的流行形式，提供了人类化和上下文rich的知识表示。在国际经济中，KGs已经证明了捕捉复杂的贸易关系的价值，例如商品、公司和国家之间的互动。通过将 gravitation model，这是经济框架的一种常见，integrated into the process of building KGs，可以考虑到影响贸易关系的重要因素，从而预测国际贸易模式。这篇论文提出了基于知识图嵌入的国际贸易模型，重点是预测链接使用嵌入。因此，可以为政策制定者、企业和经济学家提供有价值的信息，让他们预测贸易系统的变化的影响。此外，本文还探讨了将传统机器学习方法与KG嵌入结合的可能性，例如决策树和图 neural networks。研究发现， combining KG embeddings with traditional machine learning methods can improve prediction accuracy and provide insights into embedding explainability in knowledge representation. Additionally, the paper presents a comprehensive analysis of the influence of embedding methods on other intelligent algorithms.

Causal discovery using dynamically requested knowledge

paper_url: http://arxiv.org/abs/2310.11154
repo_url: None
paper_authors: Neville K Kitson, Anthony C Constantinou
for: 这 paper 旨在提高 causal Bayesian networks (CBNs) 的结构学习精度，并且研究一种基于机器学习的方法，使得结构学习算法本身可以动态地确定和请求人类知识。
methods: 这 paper 使用了 Tabu 结构学习算法，并将人类知识 integrate 到结构学习中。
results: 研究发现，这种方法可以提高结构学习精度，并且可以更好地使用人类知识。此外，这种方法还可以使结构学习过程更加透明和有效。

Abstract
Causal Bayesian Networks (CBNs) are an important tool for reasoning under uncertainty in complex real-world systems. Determining the graphical structure of a CBN remains a key challenge and is undertaken either by eliciting it from humans, using machine learning to learn it from data, or using a combination of these two approaches. In the latter case, human knowledge is generally provided to the algorithm before it starts, but here we investigate a novel approach where the structure learning algorithm itself dynamically identifies and requests knowledge for relationships that the algorithm identifies as uncertain during structure learning. We integrate this approach into the Tabu structure learning algorithm and show that it offers considerable gains in structural accuracy, which are generally larger than those offered by existing approaches for integrating knowledge. We suggest that a variant which requests only arc orientation information may be particularly useful where the practitioner has little preexisting knowledge of the causal relationships. As well as offering improved accuracy, the approach can use human expertise more effectively and contributes to making the structure learning process more transparent.

摘要
causal Bayesian networks (CBNs) 是实际世界系统中不确定性理解的重要工具。确定CBN的图structural structure是一个关键挑战，通常通过从人类获得、使用机器学习从数据中学习或使用这两种方法进行。在后者情况下，人类知识通常会提供给算法之前，但我们在这里调查了一种新的方法，即结构学习算法本身在学习过程中动态确定和请求关系不确定的知识。我们将这种方法集成到Tabu结构学习算法中，并证明了它可以提供较大的结构准确性，通常比现有的知识集成方法更大。我们建议一种只请求路径方向信息的变体可能特别有用，当实践者具有少量先前知识的 causal 关系时。除了提高准确性外，这种方法可以更有效地使用人类专业知识，使结构学习过程更透明。

Uncovering wall-shear stress dynamics from neural-network enhanced fluid flow measurements

paper_url: http://arxiv.org/abs/2310.11147
repo_url: None
paper_authors: Esther Lagemann, Steven L. Brunton, Christian Lagemann
for: 这篇论文是为了提供一种准确预测wall-shear stress的方法，以便在交通、公共设施、能源技术和医疗等领域实现可持续发展、资源保存和碳中和。methods: 本论文使用深度光流估计器，结合物理知识来 derive velocity和wall-shear stress场的空间和时间分辨率。results: 该方法可以准确预测wall-shear stress场，并且在实验数据上表明其物理正确性和有效性。

Abstract
Friction drag from a turbulent fluid moving past or inside an object plays a crucial role in domains as diverse as transportation, public utility infrastructure, energy technology, and human health. As a direct measure of the shear-induced friction forces, an accurate prediction of the wall-shear stress can contribute to sustainability, conservation of resources, and carbon neutrality in civil aviation as well as enhanced medical treatment of vascular diseases and cancer. Despite such importance for our modern society, we still lack adequate experimental methods to capture the instantaneous wall-shear stress dynamics. In this contribution, we present a holistic approach that derives velocity and wall-shear stress fields with impressive spatial and temporal resolution from flow measurements using a deep optical flow estimator with physical knowledge. The validity and physical correctness of the derived flow quantities is demonstrated with synthetic and real-world experimental data covering a range of relevant fluid flows.

摘要
fluid 动力阻力从一个湍流中过或在一个物体表面或内部具有关键作用，在不同领域中发挥着重要作用，包括交通运输、公共基础设施、能源技术和人类健康。 wall-shear stress 是直接测量摩擦力的力量的直接测量方法，可以贡献到可持续发展、资源保存和碳中和性在民用航空领域，以及更好的医疗治疗血管疾病和癌症。 despite Such importance in modern society, we still lack adequate experimental methods to capture the instantaneous wall-shear stress dynamics. In this contribution, we present a holistic approach that derives velocity and wall-shear stress fields with impressive spatial and temporal resolution from flow measurements using a deep optical flow estimator with physical knowledge. The validity and physical correctness of the derived flow quantities is demonstrated with synthetic and real-world experimental data covering a range of relevant fluid flows.

Long-form Simultaneous Speech Translation: Thesis Proposal

paper_url: http://arxiv.org/abs/2310.11141
repo_url: None
paper_authors: Peter Polák
for: 这份论文主要是为了解决同时传输语音翻译问题，特别是在长形设定下（无需先分割语音）。
methods: 这篇论文主要采用了深度学习方法，包括端到端同时语音翻译系统，以及对现有方法的修改和提高。
results: 这篇论文主要描述了现有的同时语音翻译方法的缺点和限制，以及一些可能的解决方案，但没有直接提供实验结果。

Abstract
Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence. Traditionally, SST has been addressed primarily by cascaded systems that decompose the task into subtasks, including speech recognition, segmentation, and machine translation. However, the advent of deep learning has sparked significant interest in end-to-end (E2E) systems. Nevertheless, a major limitation of most approaches to E2E SST reported in the current literature is that they assume that the source speech is pre-segmented into sentences, which is a significant obstacle for practical, real-world applications. This thesis proposal addresses end-to-end simultaneous speech translation, particularly in the long-form setting, i.e., without pre-segmentation. We present a survey of the latest advancements in E2E SST, assess the primary obstacles in SST and its relevance to long-form scenarios, and suggest approaches to tackle these challenges.

摘要
同时语音翻译（SST）目标是在实时翻译说话人说话，就在说话人完成句子之前。传统上，SST通过顺序系统解决，分解任务为多个子任务，包括语音识别、分 segmentation 和机器翻译。然而，深度学习的出现引发了对 E2E 系统的重要兴趣。然而，大多数文献中的 E2E SST 方法假设源语音已经分 segmentation 成句子，这是实际应用中的重要障碍。本论文提案探讨了无 segmentation 的 E2E SST，特别是长形设置下。我们对最新的 E2E SST 进步进行了检查，评估了 SST 的主要障碍和长形场景的相关性，并建议了解决这些挑战的方法。

USDC: Unified Static and Dynamic Compression for Visual Transformer

paper_url: http://arxiv.org/abs/2310.11117
repo_url: None
paper_authors: Huan Yuan, Chao Liao, Jianchao Tan, Peng Yao, Jiyuan Jia, Bin Chen, Chengru Song, Di Zhang
for: 提高Visual Transformer模型的部署效率和可扩展性，解决模型压缩和执行速度之间的矛盾。
methods: 提出了一种结合静态压缩和动态压缩技术的输入适应压缩模型，以及一种子组augmentation技术来解决具有不同批处理大小的训练和推理阶段之间的性能差异。
results: 对多种基eline Visual Transformer模型进行了广泛的实验，证明了我们的方法可以更好地均衡压缩率和模型性能，并且在不同批处理大小下提供了更高的性能稳定性。

Abstract
Visual Transformers have achieved great success in almost all vision tasks, such as classification, detection, and so on. However, the model complexity and the inference speed of the visual transformers hinder their deployments in industrial products. Various model compression techniques focus on directly compressing the visual transformers into a smaller one while maintaining the model performance, however, the performance drops dramatically when the compression ratio is large. Furthermore, several dynamic network techniques have also been applied to dynamically compress the visual transformers to obtain input-adaptive efficient sub-structures during the inference stage, which can achieve a better trade-off between the compression ratio and the model performance. The upper bound of memory of dynamic models is not reduced in the practical deployment since the whole original visual transformer model and the additional control gating modules should be loaded onto devices together for inference. To alleviate two disadvantages of two categories of methods, we propose to unify the static compression and dynamic compression techniques jointly to obtain an input-adaptive compressed model, which can further better balance the total compression ratios and the model performances. Moreover, in practical deployment, the batch sizes of the training and inference stage are usually different, which will cause the model inference performance to be worse than the model training performance, which is not touched by all previous dynamic network papers. We propose a sub-group gates augmentation technique to solve this performance drop problem. Extensive experiments demonstrate the superiority of our method on various baseline visual transformers such as DeiT, T2T-ViT, and so on.

摘要
Visual Transformers 已经在各种视觉任务上获得了很大的成功，如分类、检测等。然而，视觉转换器的模型复杂度和推理速度使得它们在工业产品中的部署受到限制。各种模型压缩技术都是直接压缩视觉转换器到更小的模型，以保持模型性能，但当压缩比率较大时，模型性能会下降很快。此外，一些动态网络技术也已经应用于在推理阶段动态压缩视觉转换器，以获得输入适应型的有效子结构，可以更好地平衡压缩率和模型性能。然而，在实际部署中，动态模型的内存顶部不会减少，因为整个原始视觉转换器模型和额外的控制闭合模块都需要在设备上加载。为了解决这两种方法的缺点，我们提议将静态压缩和动态压缩技术联合使用，以获得输入适应型的压缩模型，可以更好地平衡总压缩率和模型性能。此外，在训练和推理阶段的批处理大小不同，通常会导致模型的推理性能下降，这个问题未经所有动态网络文章讨论。我们提议使用 subgroup gates 技术来解决这个问题。我们的方法在多种基eline visual transformers 上进行了广泛的实验，如 DeiT、T2T-ViT 等。

H2O Open Ecosystem for State-of-the-art Large Language Models

paper_url: http://arxiv.org/abs/2310.13012
repo_url: https://github.com/h2oai/h2o-llmstudio
paper_authors: Arno Candel, Jon McKinney, Philipp Singer, Pascal Pfeiffer, Maximilian Jeblick, Chun Ming Lee, Marcos V. Conde
for: The paper is written to introduce an open-source ecosystem for developing and testing large language models (LLMs) to address the risks posed by closed-source approaches and to make AI development more accessible, efficient, and trustworthy.
methods: The paper presents a complete open-source ecosystem for LLMs, including a family of fine-tuned LLMs of diverse sizes and a framework and no-code GUI called H2O LLM Studio for efficient fine-tuning, evaluation, and deployment of LLMs using state-of-the-art techniques.
results: The paper introduces h2oGPT, a family of fine-tuned LLMs of diverse sizes, and demonstrates the effectiveness of the H2O LLM Studio for efficient fine-tuning, evaluation, and deployment of LLMs. The demo is available at https://gpt.h2o.ai/.

Abstract
Large Language Models (LLMs) represent a revolution in AI. However, they also pose many significant risks, such as the presence of biased, private, copyrighted or harmful text. For this reason we need open, transparent and safe solutions. We introduce a complete open-source ecosystem for developing and testing LLMs. The goal of this project is to boost open alternatives to closed-source approaches. We release h2oGPT, a family of fine-tuned LLMs of diverse sizes. We also introduce H2O LLM Studio, a framework and no-code GUI designed for efficient fine-tuning, evaluation, and deployment of LLMs using the most recent state-of-the-art techniques. Our code and models are fully open-source. We believe this work helps to boost AI development and make it more accessible, efficient and trustworthy. The demo is available at: https://gpt.h2o.ai/

摘要
大型语言模型（LLMs）表示了人工智能领域的革命，但也存在许多重要的风险，如偏见、私有、版权和危险的文本存在。为了解决这些问题，我们需要开放、透明和安全的解决方案。我们介绍了一个完整的开源生态系统，用于开发和测试LLMs。该项目的目标是推动开放的代替方案，以opposeclosed-source方法。我们发布了h2oGPT家族，包括多种大小的精度调整LLMs。我们还介绍了H2O LLM Studio框架和无代码GUI，用于高效地调整、评估和部署LLMs，使用最新的状态艺术技术。我们的代码和模型都是完全开源的。我们认为这项工作将帮助提高人工智能的发展，使其更加可 accessible、高效和可靠。示例可以在以下链接中找到：https://gpt.h2o.ai/

ASP: Automatic Selection of Proxy dataset for efficient AutoML

paper_url: http://arxiv.org/abs/2310.11478
repo_url: None
paper_authors: Peng Yao, Chao Liao, Jiyuan Jia, Jianchao Tan, Bin Chen, Chengru Song, Di Zhang
for: 这篇论文旨在提出一个自动选择代理数据框架（ASP），以便在每个epoch中 dynamically选择有用的代理数据subset，从而节省训练数据大小和AutoML处理时间。
methods: 这篇论文使用了自动选择代理数据框架（ASP），可以在不同的选择比率下选择有用的代理数据subset，以节省训练数据大小和AutoML处理时间。
results: 实验结果显示，这篇论文使用的ASP方法可以在不同的选择比率下比其他数据选择方法取得更好的结果，并且可以节省2x-20x的AutoML处理时间。

Abstract
Deep neural networks have gained great success due to the increasing amounts of data, and diverse effective neural network designs. However, it also brings a heavy computing burden as the amount of training data is proportional to the training time. In addition, a well-behaved model requires repeated trials of different structure designs and hyper-parameters, which may take a large amount of time even with state-of-the-art (SOTA) hyper-parameter optimization (HPO) algorithms and neural architecture search (NAS) algorithms. In this paper, we propose an Automatic Selection of Proxy dataset framework (ASP) aimed to dynamically find the informative proxy subsets of training data at each epoch, reducing the training data size as well as saving the AutoML processing time. We verify the effectiveness and generalization of ASP on CIFAR10, CIFAR100, ImageNet16-120, and ImageNet-1k, across various public model benchmarks. The experiment results show that ASP can obtain better results than other data selection methods at all selection ratios. ASP can also enable much more efficient AutoML processing with a speedup of 2x-20x while obtaining better architectures and better hyper-parameters compared to utilizing the entire dataset.

摘要

HGCVAE: Integrating Generative and Contrastive Learning for Heterogeneous Graph Learning

paper_url: http://arxiv.org/abs/2310.11102
repo_url: None
paper_authors: Yulan Hu, Zhirui Yang, Sheng Ouyang, Junchen Wan, Fuzheng Zhang, Zhongyuan Wang, Yong Liu
for: 本研究旨在探讨生成自监督学习（SSL）在多类 Graph Learning（HGL）中的应用。
methods: 本文提出了一种新的对比学习变量 Graph Autoencoder（HGCVAE），该模型通过结合对比学习和生成 SSL，解决了传统对比学习方法中复杂的多类性捕捉问题。
results: 对比于多种州rror-of-the-art基elines，HGCVAE达到了Remarkable的结果，证明了其superiority。

Abstract
Generative self-supervised learning (SSL) has exhibited significant potential and garnered increasing interest in graph learning. In this study, we aim to explore the problem of generative SSL in the context of heterogeneous graph learning (HGL). The previous SSL approaches for heterogeneous graphs have primarily relied on contrastive learning, necessitating the design of complex views to capture heterogeneity. However, existing generative SSL methods have not fully leveraged the capabilities of generative models to address the challenges of HGL. In this paper, we present HGCVAE, a novel contrastive variational graph auto-encoder that liberates HGL from the burden of intricate heterogeneity capturing. Instead of focusing on complicated heterogeneity, HGCVAE harnesses the full potential of generative SSL. HGCVAE innovatively consolidates contrastive learning with generative SSL, introducing several key innovations. Firstly, we employ a progressive mechanism to generate high-quality hard negative samples for contrastive learning, utilizing the power of variational inference. Additionally, we present a dynamic mask strategy to ensure effective and stable learning. Moreover, we propose an enhanced scaled cosine error as the criterion for better attribute reconstruction. As an initial step in combining generative and contrastive SSL, HGCVAE achieves remarkable results compared to various state-of-the-art baselines, confirming its superiority.

摘要
<>输入文本翻译为简化中文。<>生成自我超级学习（SSL）在图学习中展示了重要性和吸引了越来越多的关注。在这项研究中，我们想要探讨生成SSL在非同质graph学习（HGL）中的问题。前一些SSL方法 для非同质图把主要依靠于对比学习，因此需要设计复杂的视图来捕捉非同质性。然而，现有的生成SSL方法没有充分利用生成模型来解决HGL中的挑战。在本文中，我们提出了HGCVAE，一种新的对比变量图自动编码器。而不是关注复杂的非同质性，HGCVAE fully harnesses the full potential of generative SSL。HGCVAE innovatively consolidates contrastive learning with generative SSL, introducing several key innovations. Firstly, we employ a progressive mechanism to generate high-quality hard negative samples for contrastive learning, utilizing the power of variational inference. Additionally, we present a dynamic mask strategy to ensure effective and stable learning. Moreover, we propose an enhanced scaled cosine error as the criterion for better attribute reconstruction. As an initial step in combining generative and contrastive SSL, HGCVAE achieves remarkable results compared to various state-of-the-art baselines, confirming its superiority.

MeKB-Rec: Personal Knowledge Graph Learning for Cross-Domain Recommendation

paper_url: http://arxiv.org/abs/2310.11088
repo_url: None
paper_authors: Xin Su, Yao Zhou, Zifei Shan, Qian Chen
for: 强化推荐 для新用户 (addressing the cold-start problem in modern recommender systems)methods: 使用Personal Knowledge Graph (PKG) 和 Pretrained Language Models (PLMs) 来建立用户会兴趣的域别不受限制的 semantic representationresults: 在多个公共 CDR 数据集上实验，证明了 MeKB-Rec 的新定义比前一代方法更具弹性，实现了 HR@10 和 NDCG@10 метри增加24%–91%，zero-shot 用户在目标领域的行为无需准确数据可以获得 significiant 提升（105%）。在 WeiXin 推荐场景中部署 MeKB-Rec，获得了重要的线上数据提升。MeKB-Rec 现在在实际产品中服务百亿用户。

Abstract
It is a long-standing challenge in modern recommender systems to effectively make recommendations for new users, namely the cold-start problem. Cross-Domain Recommendation (CDR) has been proposed to address this challenge, but current ways to represent users' interests across systems are still severely limited. We introduce Personal Knowledge Graph (PKG) as a domain-invariant interest representation, and propose a novel CDR paradigm named MeKB-Rec. We first link users and entities in a knowledge base to construct a PKG of users' interests, named MeKB. Then we learn a semantic representation of MeKB for the cross-domain recommendation. To efficiently utilize limited training data in CDR, MeKB-Rec employs Pretrained Language Models to inject world knowledge into understanding users' interests. Beyond most existing systems, our approach builds a semantic mapping across domains which breaks the requirement for in-domain user behaviors, enabling zero-shot recommendations for new users in a low-resource domain. We experiment MeKB-Rec on well-established public CDR datasets, and demonstrate that the new formulation % is more powerful than previous approaches, achieves a new state-of-the-art that significantly improves HR@10 and NDCG@10 metrics over best previous approaches by 24\%--91\%, with a 105\% improvement for HR@10 of zero-shot users with no behavior in the target domain. We deploy MeKB-Rec in WeiXin recommendation scenarios and achieve significant gains in core online metrics. MeKB-Rec is now serving hundreds of millions of users in real-world products.

摘要
现代推荐系统中长期面临的挑战是如何有效地为新用户提供推荐，即冷启用户问题。跨领域推荐（CDR）已经被提议以解决这个问题，但现有的用户兴趣表示方式仍然受到严重的限制。我们引入个人知识图（PKG）作为领域不变的兴趣表示方式，并提出了一种基于PKG的CDR paradigma，称之为MeKB-Rec。我们首先将用户和实体在知识库中连接，以构建用户兴趣的PKG，称之为MeKB。然后，我们学习MeKB的semantic表示，以便在跨领域推荐中使用。为了有效利用CDR中的有限训练数据，MeKB-Rec使用预训练语言模型，以尝试在理解用户兴趣时涉及世界知识。与现有系统不同，我们的方法建立了领域之间的semantic映射，使得不需要在目标领域中有具体的用户行为，以实现零容量推荐。我们在well-established的公共CDR数据集上实验MeKB-Rec，并证明了新的表示方式比前一代方法更有力，在HR@10和NDCG@10指标上提高了24%--91%，与最佳前一代方法相比提高105%。我们在微信推荐场景中部署MeKB-Rec，得到了显著的提升。MeKB-Rec现在为百亿用户服务。

Feature Pyramid biLSTM: Using Smartphone Sensors for Transportation Mode Detection

paper_url: http://arxiv.org/abs/2310.11087
repo_url: None
paper_authors: Qinrui Tang, Hao Cheng
for: 本研究的目的是提出一种新的终端方法，以优化减少的感知数据来实现日常旅行中准确的交通方式检测。
methods: 该方法称为Feature Pyramid biLSTM（FPbiLSTM），它利用Feature Pyramid Network（FPN），将浅层充满精度的特征与深层特征强度相互补做，以捕捉不同交通模式的时间运动模式。
results: FPbiLSTM使用仅三个感知器（加速度计、陀螺仪和磁场计）的数据，在2018年的Sussex-Huawei Locomotion（SHL）挑战数据集上达到了95.1%的准确率和94.7%的F1分数，在八种不同的交通模式中进行了可识别。

Abstract
The widespread utilization of smartphones has provided extensive availability to Inertial Measurement Units, providing a wide range of sensory data that can be advantageous for the detection of transportation modes. The objective of this study is to propose a novel end-to-end approach to effectively explore a reduced amount of sensory data collected from a smartphone to achieve accurate mode detection in common daily traveling activities. Our approach, called Feature Pyramid biLSTM (FPbiLSTM), is characterized by its ability to reduce the number of sensors required and processing demands, resulting in a more efficient modeling process without sacrificing the quality of the outcomes than the other current models. FPbiLSTM extends an existing CNN biLSTM model with the Feature Pyramid Network, leveraging the advantages of both shallow layer richness and deeper layer feature resilience for capturing temporal moving patterns in various transportation modes. It exhibits an excellent performance by employing the data collected from only three out of seven sensors, i.e. accelerometers, gyroscopes, and magnetometers, in the 2018 Sussex-Huawei Locomotion (SHL) challenge dataset, attaining a noteworthy accuracy of 95.1% and an F1-score of 94.7% in detecting eight different transportation modes.

摘要
通过智能手机的广泛使用，提供了大量的感知数据，这些数据可以为交通方式检测提供有利条件。本研究的目标是提出一种新的端到端方法，使用少量的感知数据来准确地检测日常旅行中的交通方式。我们提出的方法，即特征峰网络bilstm（FPbiLSTM），通过减少感知器数量和处理需求，实现了更高效的模型化过程，而不 sacrificing 结果质量。FPbiLSTM extend了现有的CNN bilstm模型，利用浅层layer的richness和深层layer的特征鲜柔性，捕捉不同交通方式的时间运动模式。它在使用2018年的 sussex-Huawei Locomotion（SHL）挑战数据集中，达到了95.1%的准确率和94.7%的F1分数，在检测八种不同的交通方式中表现出色。

In-Context Few-Shot Relation Extraction via Pre-Trained Language Models

paper_url: http://arxiv.org/abs/2310.11085
repo_url: https://github.com/oezyurty/replm
paper_authors: Yilmazcan Ozyurt, Stefan Feuerriegel, Ce Zhang
for: 实现文本文档中的人类知识结构，扩展现有的语言模型技术。
methods: 提出一个基于预训语言模型的内容几个扩展框架，不需要名称实体输入或文档人类标注。
results: 在 DocRED dataset 上进行评估，表现与原始标签相似或更好，并可以轻松地更新 для新的关系集。

Abstract
Relation extraction aims at inferring structured human knowledge from textual documents. State-of-the-art methods based on language models commonly have two limitations: (1) they require named entities to be either given as input or infer them, which introduces additional noise, and (2) they require human annotations of documents. As a remedy, we present a novel framework for in-context few-shot relation extraction via pre-trained language models. To the best of our knowledge, we are the first to reformulate the relation extraction task as a tailored in-context few-shot learning paradigm. Thereby, we achieve crucial benefits in that we eliminate the need for both named entity recognition and human annotation of documents. Unlike existing methods based on fine-tuning, our framework is flexible in that it can be easily updated for a new set of relations without re-training. We evaluate our framework using DocRED, the largest publicly available dataset for document-level relation extraction, and demonstrate that our framework achieves state-of-the-art performance. Finally, our framework allows us to identify missing annotations, and we thus show that our framework actually performs much better than the original labels from the development set of DocRED.

摘要
关系提取目标是从文本文档中提取结构化的人类知识。现有的方法基于自然语言模型很多时候受到两种限制：（1）它们需要输入名称实体，或者自动检测名称实体，这会增加额外的噪音，（2）它们需要文档的人类标注。为了解决这些问题，我们提出了一种新的框架，即在文本上进行受限的少量trainingrelation extractionvia预训练的语言模型。根据我们所知，我们是第一个将关系提取任务重新定义为特定的在文本上进行受限的少量学习 paradigm。这种方法比现有的方法更加灵活，因为它可以轻松地更新 для新的关系集 ohne需要重新训练。我们使用DocRED dataset，这是公共可用的最大文档关系提取 dataset，来评估我们的框架。我们的结果显示，我们的框架可以达到领先的性能。此外，我们的框架可以识别缺失的注释，因此我们实际上可以证明我们的框架实际上比原始的标签更好。

Multi-omics Sampling-based Graph Transformer for Synthetic Lethality Prediction

paper_url: http://arxiv.org/abs/2310.11082
repo_url: None
paper_authors: Xusheng Zhao, Hao Liu, Qiong Dai, Hao Peng, Xu Bai, Huailiang Peng
for: 这项研究的目的是提出一种新的多型数据预测生物学上的致死性静止（Synthetic Lethality，SL）。
methods: 该研究使用了一种新的多型数据预测SL方法，即使用抽样-基于图 transformer（MSGT-SL）。该方法首先使用了一种 shallow 多视图 GNN 来获取 SL 和多型数据中的本地结构特征。然后，通过输入基因特征来捕捉长距离依赖关系。最后，通过并行随机游走检索多型数据中的基因来模仿结构化地 incorporate 多型数据。
results: 研究发现，使用 MSGT-SL 方法可以在真实世界 SL 任务中获得较高的 empirical 效果，证明了该方法在 SL 预测中的效用性。

Abstract
Synthetic lethality (SL) prediction is used to identify if the co-mutation of two genes results in cell death. The prevalent strategy is to abstract SL prediction as an edge classification task on gene nodes within SL data and achieve it through graph neural networks (GNNs). However, GNNs suffer from limitations in their message passing mechanisms, including over-smoothing and over-squashing issues. Moreover, harnessing the information of non-SL gene relationships within large-scale multi-omics data to facilitate SL prediction poses a non-trivial challenge. To tackle these issues, we propose a new multi-omics sampling-based graph transformer for SL prediction (MSGT-SL). Concretely, we introduce a shallow multi-view GNN to acquire local structural patterns from both SL and multi-omics data. Further, we input gene features that encode multi-view information into the standard self-attention to capture long-range dependencies. Notably, starting with batch genes from SL data, we adopt parallel random walk sampling across multiple omics gene graphs encompassing them. Such sampling effectively and modestly incorporates genes from omics in a structure-aware manner before using self-attention. We showcase the effectiveness of MSGT-SL on real-world SL tasks, demonstrating the empirical benefits gained from the graph transformer and multi-omics data.

摘要
<> traduced text into Simplified Chinese.<>人工致死性（SL）预测是用来判断两个基因的共同突变是否导致细胞死亡。现有的策略是将SL预测视为基因节点之间的边分类任务，并使用图神经网络（GNN）来实现。然而，GNN受到消息传递机制的局限性，包括过滤和压缩问题。另外，在大规模多Omics数据中利用非SL基因关系来促进SL预测是一个非常困难的问题。为了解决这些问题，我们提出了一种新的多Omics采样基于图变换器 для SL预测（MSGT-SL）。具体来说，我们引入了一个浅层多视图GNN，以获取SL数据和多Omics数据中的本地结构模式。然后，我们将基因特征，其中包含多视图信息，输入到标准自注意力中，以捕捉长距离依赖关系。另外，我们从SL数据中开始，采样多Omics基因图中包含的批处理基因，以模estamente和多Omics基因图中的结构相关的方式进行采样。这种采样方式可以有效地和modestly地在多Omics基因图中包含基因，并在使用自注意力之前进行结构意识。我们在实际SL任务上展示了MSGT-SL的效果，并证明了图变换器和多Omics数据的实际效果。

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

paper_url: http://arxiv.org/abs/2310.11079
repo_url: None
paper_authors: Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee
for: 这研究旨在检测大语言模型（LLM）中的可能性偏见，并提出了一种自动生成测试用例的方法来检测这些偏见。methods: 这种方法使用自动生成的测试用例来检测LLMs中的偏见，并对检测到的偏见进行纠正。results: 实验结果表明，使用该方法可以使LLMs生成更公正的回答。

Abstract
Recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (LLMs) such as ChatGPT and GPT-4. These LLM-based chatbots encode the potential biases while retaining disparities that can harm humans during interactions. The traditional biases investigation methods often rely on human-written test cases. However, these test cases are usually expensive and limited. In this work, we propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias. We apply our method to three well-known LLMs and find that the generated test cases effectively identify the presence of biases. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. The experimental results show that LLMs generate fairer responses with the proposed approach.

摘要
In this work, we propose a novel method that automatically generates test cases to detect LLMs' potential gender bias. We apply our method to three well-known LLMs and find that the generated test cases effectively identify the presence of biases. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. The experimental results show that LLMs generate fairer responses with the proposed approach.

Sim-to-Real Transfer of Adaptive Control Parameters for AUV Stabilization under Current Disturbance

paper_url: http://arxiv.org/abs/2310.11075
repo_url: None
paper_authors: Thomas Chaffre, Jonathan Wheare, Andrew Lammas, Paulo Santos, Gilles Le Chenadec, Karl Sammut, Benoit Clement
for: 该论文旨在开发一种基于深度学习的自适应控制方法，以帮助自主潜水机器人（AUV）在海洋环境中减少过程变化的影响，并最小化人工干预。
methods: 该方法结合了最大 entropy深度学习框架和经典的模型基于控制架构，以 trains general-purpose神经网络策略。为了应对实际环境中的分布偏移和高样本复杂性问题，该方法还提出了一种Sim-to-Real传输策略，包括生物体引zed经验回放机制、加强版域随机化技术和实际平台上的评估协议。
results: 实验结果显示，该方法可以从不优的模拟模型上学习出高效策略，并在实际 Vehicle上实现了控制性能3倍高于其模型基于非适应控制器的对照试验。

Abstract
Learning-based adaptive control methods hold the premise of enabling autonomous agents to reduce the effect of process variations with minimal human intervention. However, its application to autonomous underwater vehicles (AUVs) has so far been restricted due to 1) unknown dynamics under the form of sea current disturbance that we can not model properly nor measure due to limited sensor capability and 2) the nonlinearity of AUVs tasks where the controller response at some operating points must be overly conservative in order to satisfy the specification at other operating points. Deep Reinforcement Learning (DRL) can alleviates these limitations by training general-purpose neural network policies, but applications of DRL algorithms to AUVs have been restricted to simulated environments, due to their inherent high sample complexity and distribution shift problem. This paper presents a novel approach, merging the Maximum Entropy Deep Reinforcement Learning framework with a classic model-based control architecture, to formulate an adaptive controller. Within this framework, we introduce a Sim-to-Real transfer strategy comprising the following components: a bio-inspired experience replay mechanism, an enhanced domain randomisation technique, and an evaluation protocol executed on a physical platform. Our experimental assessments demonstrate that this method effectively learns proficient policies from suboptimal simulated models of the AUV, resulting in control performance 3 times higher when transferred to a real-world vehicle, compared to its model-based nonadaptive but optimal counterpart.

摘要
学习基于控制方法可以让自主Agent减少过程变化的影响，但是在自主水下潜水器（AUV）上应用尚未得到广泛使用，主要原因是1）不能正确地模型海流干扰的不确定动力学特性，因为感知器的限制不能准确地测量，2）AUV任务的非线性性，控制器在某些操作点上必须采取保守的响应，以满足其他操作点的规范。深度优化学习（DRL）可以解决这些限制，通过训练通用神经网络策略来减少模型不确定性和分布偏移问题。这篇论文提出了一种新的方法，将最大 entropy深度优化学习框架与经典模型基于控制架构结合，形成一个适应控制器。在这个框架中，我们提出了一种Sim-to-Real转移策略，包括以下三个组成部分：生物发现经验回放机制、改进的领域随机化技术和在物理平台上执行的评估协议。我们的实验评估表明，这种方法可以从优化的模拟模型中学习出高效策略，在真实世界潜水器上实现控制性能3倍高于其模型基于非适应控制器的优化对照。

Robust-MBFD: A Robust Deep Learning System for Motor Bearing Faults Detection Using Multiple Deep Learning Training Strategies and A Novel Double Loss Function

paper_url: http://arxiv.org/abs/2310.11477
repo_url: None
paper_authors: Khoa Tran, Lam Pham, Hai-Canh Vu
for: 这个论文的目的是对电动机承受器 fault detection (MBFD) 进行全面分析，即基于电动机承受器的振荡来识别faults。
methods: 本论文提出了多种机器学习基于系统 для MBFD 任务，并评估了这些系统的性能。此外，本论文还提出了三种深度学习基于系统，每一种都采用了不同的训练策略：supervised learning、semi-supervised learning和Unsupervised learning。
results: 对多个 benchmark 数据集进行了广泛的实验，包括美国机械失效预防学会 (MFPT)、Case Western Reserve University Bearing Center (CWRU) 和Paderborn University (PU) 的condition monitoring of bearing damage in electromechanical drive systems。实验结果表明，深度学习基于系统比机器学习基于系统更有效果于 MBFD 任务。此外，我们还实现了一种可靠和通用的深度学习基于系统，并在多个 benchmark 数据集上实现了良好的性能，demonstrating its potential for real-life MBFD applications。

Abstract
This paper presents a comprehensive analysis of motor bearing fault detection (MBFD), which involves the task of identifying faults in a motor bearing based on its vibration. To this end, we first propose and evaluate various machine learning based systems for the MBFD task. Furthermore, we propose three deep learning based systems for the MBFD task, each of which explores one of the following training strategies: supervised learning, semi-supervised learning, and unsupervised learning. The proposed machine learning based systems and deep learning based systems are evaluated, compared, and then they are used to identify the best model for the MBFD task. We conducted extensive experiments on various benchmark datasets of motor bearing faults, including those from the American Society for Mechanical Failure Prevention Technology (MFPT), Case Western Reserve University Bearing Center (CWRU), and the Condition Monitoring of Bearing Damage in Electromechanical Drive Systems from Paderborn University (PU). The experimental results on different datasets highlight two main contributions of this study. First, we prove that deep learning based systems are more effective than machine learning based systems for the MBFD task. Second, we achieve a robust and general deep learning based system with a novel loss function for the MBFD task on several benchmark datasets, demonstrating its potential for real-life MBFD applications.

摘要
The authors propose three deep learning-based systems for the MBFD task, each of which explores a different training strategy: supervised learning, semi-supervised learning, and unsupervised learning. The proposed machine learning-based systems and deep learning-based systems are evaluated and compared, and the best model for the MBFD task is identified.The authors conducted extensive experiments on various benchmark datasets of motor bearing faults, including those from the American Society for Mechanical Failure Prevention Technology (MFPT), Case Western Reserve University Bearing Center (CWRU), and the Condition Monitoring of Bearing Damage in Electromechanical Drive Systems from Paderborn University (PU). The experimental results show that deep learning-based systems are more effective than machine learning-based systems for the MBFD task. Additionally, the authors develop a novel loss function for the MBFD task that achieves robust and general performance on several benchmark datasets, demonstrating its potential for real-life MBFD applications.

Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning

paper_url: http://arxiv.org/abs/2310.11053
repo_url: None
paper_authors: Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, Ning Gu
for: This paper aims to explore the ethical values of large language models (LLMs) and develop methods to improve their value compliance.
methods: The paper proposes a novel prompt generation algorithm called DeNEVIL to dynamically elicit the violation of ethics in LLMs, and constructs a high-quality dataset called MoralPrompt to benchmark the intrinsic values of LLMs. The paper also develops an in-context alignment method called VILMO to improve the value compliance of LLM outputs.
results: The paper discovers that most LLMs are essentially misaligned and demonstrates the effectiveness of VILMO in improving the value compliance of LLM outputs. The results provide a promising initial step in studying the ethical values of LLMs and aligning their outputs with human values.Here’s the Simplified Chinese text:
for: 这篇论文旨在探索大语言模型（LLMs）的伦理价值，并开发方法来提高其价值兼容性。
methods: 论文提出了一种新的提示生成算法called DeNEVIL，可以动态激发LLMs中的伦理违反行为，并构建了一个高质量的数据集called MoralPrompt，用于对LLMs的内在价值进行评估。论文还开发了一种名为VIMO的增值环境，用于在LLMs输出中提高价值兼容性。
results: 论文发现大多数LLMs是 Essentially Misaligned，并证明了VIMO的效果性。结果提供了一个有前途的初步探索LLMs的伦理价值，并将其输出与人类价值进行对齐。

Abstract
Large Language Models (LLMs) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.

摘要
大型语言模型（LLM）已经创造出了无 precedent 的突破，但是它们的日益加入到日常生活中可能会提高社会风险，因为它们可能会生成不道德的内容。despite extensive research on specific issues such as bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values using Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.

Nonet at SemEval-2023 Task 6: Methodologies for Legal Evaluation

paper_url: http://arxiv.org/abs/2310.11049
repo_url: https://github.com/shubhamkumarnigam/legaleval23_nonet
paper_authors: Shubham Kumar Nigam, Aniket Deroy, Noel Shallum, Ayush Kumar Mishra, Anup Roy, Shubham Kumar Mishra, Arnab Bhattacharya, Saptarshi Ghosh, Kripabandhu Ghosh
for: 这个论文是为了参加SemEval-2023任务6：理解法律文档而写的。
methods: 论文使用了多种实验方法，包括法律命名实体识别（L-NER）、法律预测（LJP）和法律审判说明（CJPE）等三个子任务。
results: 论文在这三个子任务中的result包括数据统计和方法ология，并在领先者排名中获得了竞争性的排名，分别为15$^{th}$, 11$^{th}$, 和1$^{st}$。

Abstract
This paper describes our submission to the SemEval-2023 for Task 6 on LegalEval: Understanding Legal Texts. Our submission concentrated on three subtasks: Legal Named Entity Recognition (L-NER) for Task-B, Legal Judgment Prediction (LJP) for Task-C1, and Court Judgment Prediction with Explanation (CJPE) for Task-C2. We conducted various experiments on these subtasks and presented the results in detail, including data statistics and methodology. It is worth noting that legal tasks, such as those tackled in this research, have been gaining importance due to the increasing need to automate legal analysis and support. Our team obtained competitive rankings of 15$^{th}$, 11$^{th}$, and 1$^{st}$ in Task-B, Task-C1, and Task-C2, respectively, as reported on the leaderboard.

摘要
这份论文描述了我们在SemEval-2023中的任务6中对LegalEval：理解法律文本的提交。我们的提交集中了三个子任务：法律名称识别（L-NER）、法律预测（LJP）和法律判决预测与解释（CJPE）。我们进行了各种实验，并在详细的报告中提供了数据统计和方法ология。值得注意的是，如果法律任务，如这些研究所解决的，在过去几年中得到了越来越多的重视，因为自动化法律分析和支持的需求不断增长。我们团队在排名榜上获得了15名、11名和1名的竞争性排名，分别对应于任务B、C1和C2。

Understanding Contrastive Learning via Distributionally Robust Optimization

paper_url: http://arxiv.org/abs/2310.11048
repo_url: https://github.com/junkangwu/ADNCE
paper_authors: Junkang Wu, Jiawei Chen, Jiancan Wu, Wentao Shi, Xiang Wang, Xiangnan He
for: 这个研究探讨了对比学习（CL）的内在忍容性，即负样本可能包含相似 semantics（例如标签）。然而，现有的理论无法提供这种现象的解释。本研究填补了这个研究漏洞，通过分析CL通过透视分布 robust optimization（DRO）的角度，获得了以下几个关键发现：
methods: CL实际上对负样本分布进行DRO，因此可以在多种可能的分布下实现Robust性，并且具有对 sampling bias 的忍容性; 温度参数 $\tau$ 不仅是优化器的临时方法，而是 Lagrange 系数，控制负样本分布集的大小;
results: 研究发现CL具有 InfoNCE 作为估计的� strongMutual Information 和一种新的 $\phi$- divergence 基于通信信息的 estimator，并且提出了一种改进的 Adjusted InfoNCE 损失函数（ADNCE），可以 Mitigate CL 的缺点，包括过度保守和敏感到异常值。广泛的实验在图像、句子和图 граhp 等领域 validate 了提议的效果。代码可以在 \url{https://github.com/junkangwu/ADNCE} 上获取。

Abstract
This study reveals the inherent tolerance of contrastive learning (CL) towards sampling bias, wherein negative samples may encompass similar semantics (\eg labels). However, existing theories fall short in providing explanations for this phenomenon. We bridge this research gap by analyzing CL through the lens of distributionally robust optimization (DRO), yielding several key insights: (1) CL essentially conducts DRO over the negative sampling distribution, thus enabling robust performance across a variety of potential distributions and demonstrating robustness to sampling bias; (2) The design of the temperature $\tau$ is not merely heuristic but acts as a Lagrange Coefficient, regulating the size of the potential distribution set; (3) A theoretical connection is established between DRO and mutual information, thus presenting fresh evidence for ``InfoNCE as an estimate of MI'' and a new estimation approach for $\phi$-divergence-based generalized mutual information. We also identify CL's potential shortcomings, including over-conservatism and sensitivity to outliers, and introduce a novel Adjusted InfoNCE loss (ADNCE) to mitigate these issues. It refines potential distribution, improving performance and accelerating convergence. Extensive experiments on various domains (image, sentence, and graphs) validate the effectiveness of the proposal. The code is available at \url{https://github.com/junkangwu/ADNCE}.

摘要

CL essentially performs DRO over the negative sampling distribution, enabling robust performance across various potential distributions and demonstrating resistance to sampling bias.2. The temperature parameter $\tau$ is not just a heuristic, but rather a Lagrange coefficient that regulates the size of the potential distribution set.3. We establish a theoretical connection between DRO and mutual information, providing fresh evidence for the idea that “InfoNCE is an estimate of MI” and presenting a new approach for estimating $\phi$-divergence-based generalized mutual information.However, CL also has some limitations, such as over-conservatism and sensitivity to outliers. To address these issues, we propose a novel Adjusted InfoNCE loss (ADNCE) that refines the potential distribution and improves performance and convergence. Extensive experiments on various domains (images, sentences, and graphs) demonstrate the effectiveness of our proposal. The code is available at \url{https://github.com/junkangwu/ADNCE}.

Fast Graph Condensation with Structure-based Neural Tangent Kernel

paper_url: http://arxiv.org/abs/2310.11046
repo_url: None
paper_authors: Lin Wang, Wenqi Fan, Jiatong Li, Yao Ma, Qing Li
for: 减少大规模图数据的计算成本，以便更好地应用图神经网络（GNNs）。
methods: 将图数据 condensed 为更小的图数据，使用 Kernel Ridge Regression（KRR）任务 instead of GNNs 的迭代训练。
results: 提出了一种基于 Structure-based Neural Tangent Kernel（SNTK）的数据压缩框架（GC-SNTK），可以减少图数据的计算成本，保持高精度预测性能。

Abstract
The rapid development of Internet technology has given rise to a vast amount of graph-structured data. Graph Neural Networks (GNNs), as an effective method for various graph mining tasks, incurs substantial computational resource costs when dealing with large-scale graph data. A data-centric manner solution is proposed to condense the large graph dataset into a smaller one without sacrificing the predictive performance of GNNs. However, existing efforts condense graph-structured data through a computational intensive bi-level optimization architecture also suffer from massive computation costs. In this paper, we propose reforming the graph condensation problem as a Kernel Ridge Regression (KRR) task instead of iteratively training GNNs in the inner loop of bi-level optimization. More specifically, We propose a novel dataset condensation framework (GC-SNTK) for graph-structured data, where a Structure-based Neural Tangent Kernel (SNTK) is developed to capture the topology of graph and serves as the kernel function in KRR paradigm. Comprehensive experiments demonstrate the effectiveness of our proposed model in accelerating graph condensation while maintaining high prediction performance.

摘要
“互联网科技的快速发展导致了大量的树结构数据的生成。树神经网络（GNNs）作为许多树采矿任务的有效方法，对于大规模树数据而言，具有很大的计算资源成本。为了缩小大型树dataset，而不是降低GNNs的预测性能，我们提出了一个数据中心的解决方案。但是，现有的实现方法通过重复地训练GNNs内部的双层优化架构，具有巨大的计算成本。在本文中，我们提出了将树数据缩小问题转化为核心ridge regression（KRR）任务，而不是透过双层优化架构的迭代训练GNNs。更specifically，我们提出了一个新的数据缩小框架（GC-SNTK），其中一个基于结构的神经 tangent kernel（SNTK）被设计来捕捉树的结构，并且作为KRR模式中的kernel函数。实验结果显示，我们的提议的模型能够快速缩小树数据，而且保持高预测性能。”

Spoofing Attack Detection in the Physical Layer with Robustness to User Movement

paper_url: http://arxiv.org/abs/2310.11043
repo_url: None
paper_authors: Daniel Romero, Tien Ngoc Ha, Peter Gerstoft
for: 防止 spoofing 攻击
methods: combining 深度学习和图граhp 检测
results: 可以准确地 отлиenciate spoofing 和 user movement

Abstract
In a spoofing attack, an attacker impersonates a legitimate user to access or modify data belonging to the latter. Typical approaches for spoofing detection in the physical layer declare an attack when a change is observed in certain channel features, such as the received signal strength (RSS) measured by spatially distributed receivers. However, since channels change over time, for example due to user movement, such approaches are impractical. To sidestep this limitation, this paper proposes a scheme that combines the decisions of a position-change detector based on a deep neural network to distinguish spoofing from movement. Building upon community detection on graphs, the sequence of received frames is partitioned into subsequences to detect concurrent transmissions from distinct locations. The scheme can be easily deployed in practice since it just involves collecting a small dataset of measurements at a few tens of locations that need not even be computed or recorded. The scheme is evaluated on real data collected for this purpose.

摘要
在 spoofing 攻击中，攻击者会伪装为合法用户，以访问或修改受影响用户的数据。通常的 spoofing 检测方法在物理层将攻击宣告为当前通道特征发生变化，如接收信号强度（RSS）测量的空间分布式接收器。然而，由于通道随着时间的变化，例如用户移动，这些方法是不实用的。为了绕过这些限制，这篇论文提议一种方案，将 deep neural network 基于位置变化探测器的决策与 Movement 分离开来。基于图 communit 探测，接收的序列被分割成子序列，以检测同时从不同位置发送的同时传输。该方案可以轻松实现，只需要收集一小量的测量数据，并且不需要计算或记录。这篇论文使用实际数据进行评估。

Radio Map Estimation in the Real-World: Empirical Validation and Analysis

paper_url: http://arxiv.org/abs/2310.11036
repo_url: None
paper_authors: Raju Shrestha, Tien Ngoc Ha, Pham Q. Viet, Daniel Romero
for: 这篇论文主要针对的是量化广播信号强度或其他广播频率环境中每个点的地理区域。
methods: 这篇论文使用了许多现有的广播地图估计器，并对这些估计器进行了实际验证。
results: 研究发现，使用深度神经网络（DNNs）的复杂估计器可以提供最佳性能，但它们需要大量的训练数据来达到显著优势。一种新的混合估计器可以同时利用这两种估计器的优点，并且可能值得进一步探索。

Abstract
Radio maps quantify received signal strength or other magnitudes of the radio frequency environment at every point of a geographical region. These maps play a vital role in a large number of applications such as wireless network planning, spectrum management, and optimization of communication systems. However, empirical validation of the large number of existing radio map estimators is highly limited. To fill this gap, a large data set of measurements has been collected with an autonomous unmanned aerial vehicle (UAV) and a representative subset of these estimators were evaluated on this data. The performance-complexity trade-off and the impact of fast fading are extensively investigated. Although sophisticated estimators based on deep neural networks (DNNs) exhibit the best performance, they are seen to require large volumes of training data to offer a substantial advantage relative to more traditional schemes. A novel algorithm that blends both kinds of estimators is seen to enjoy the benefits of both, thereby suggesting the potential of exploring this research direction further.

摘要
Radio 地图量化接收信号强度或其他频率环境中每个地理区域点的其他物理量。这些地图在许多应用中发挥重要作用，如无线网络规划、频谱管理和通信系统优化。然而，现有的大量Radio map estimator的实验 validate 是非常有限的。为了填补这一空白，一个大量测量数据集被收集，并对这些 estimator 进行了评估。本研究探讨了性能vs复杂度的贸易和快速抖动的影响。虽然基于深度神经网络（DNNs）的复杂 estimator 表现最佳，但它们需要大量的训练数据来提供substantial 的优势 relative to 传统方案。一种混合 estimator 的新算法被发现，它们享有两种 estimator 的优点，因此更多的研究是可能的。

Core Building Blocks: Next Gen Geo Spatial GPT Application

paper_url: http://arxiv.org/abs/2310.11029
repo_url: None
paper_authors: Ashley Fernandez, Swaraj Dube
for: 本研究提出了 MapGPT，一种将自然语言理解和地理数据处理技术相结合的新approach，以增强地理数据理解和生成。
methods: 本研究使用了大型语言模型（LLMs）和地理数据处理技术，并提出了一种将这两者相结合的方法。这种方法利用了地理和文本数据的token化和向量表示，以提高响应位置相关的问题的准确性和Contextual awareness。
results: 研究表明，通过结合LMMs和地理数据处理技术，MapGPT可以提供更加准确和Contextual awareness的响应，并且可以实现地理计算和可视化输出。

Abstract
This paper proposes MapGPT which is a novel approach that integrates the capabilities of language models, specifically large language models (LLMs), with spatial data processing techniques. This paper introduces MapGPT, which aims to bridge the gap between natural language understanding and spatial data analysis by highlighting the relevant core building blocks. By combining the strengths of LLMs and geospatial analysis, MapGPT enables more accurate and contextually aware responses to location-based queries. The proposed methodology highlights building LLMs on spatial and textual data, utilizing tokenization and vector representations specific to spatial information. The paper also explores the challenges associated with generating spatial vector representations. Furthermore, the study discusses the potential of computational capabilities within MapGPT, allowing users to perform geospatial computations and obtain visualized outputs. Overall, this research paper presents the building blocks and methodology of MapGPT, highlighting its potential to enhance spatial data understanding and generation in natural language processing applications.

摘要
这篇论文提出了MapGPT，一种新的方法，它将自然语言理解和空间数据处理技术相结合。这篇论文描述了MapGPT的目标是将自然语言理解和空间数据分析相连接，并通过高亮相关核心组件来强调这个目标。通过结合LLMs和地理空间分析的优势，MapGPT可以提供更加准确和上下文感知的回答。该方法利用了地理空间数据和文本数据的Tokenization和 вектор表示，并解决了生成空间 вектор表示的挑战。此外，该研究还探讨了MapGPT的计算能力，允许用户进行地理计算并获得可视化输出。总之，这篇研究论文介绍了MapGPT的建构和方法，并强调其在自然语言处理应用中增强空间数据理解和生成的潜在能力。

Compatible Transformer for Irregularly Sampled Multivariate Time Series

paper_url: http://arxiv.org/abs/2310.11022
repo_url: https://github.com/mediabrain-sjtu/coformer
paper_authors: Yuxi Wei, Juntong Peng, Tong He, Chenxin Xu, Jian Zhang, Shirui Pan, Siheng Chen
for: 这篇论文的目的是解决 irregularly sampled multivariate time series 的分析问题，因为现有的方法适用于常规时间序列数据不能直接处理这种不规时间序列数据。
methods: 本文提出了 Compatible Transformer（CoFormer），一个基于 transformer 的Encoder，用于实现每个单元样本的综合时间互动特征学习。CoFormer 视每个样本为唯一的 variate-time 点，并通过 intra-variate/inter-variate 专注来学习每个样本的时间/互动特征基于 intra-variate/inter-variate 邻居。
results: 本文的实验结果显示，对于多个真实世界数据集，CoFormer 对于预测和分类 зада问表现出卓越的成绩，与现有的方法相比，CoFormer 具有优异的表现。

Abstract
To analyze multivariate time series, most previous methods assume regular subsampling of time series, where the interval between adjacent measurements and the number of samples remain unchanged. Practically, data collection systems could produce irregularly sampled time series due to sensor failures and interventions. However, existing methods designed for regularly sampled multivariate time series cannot directly handle irregularity owing to misalignment along both temporal and variate dimensions. To fill this gap, we propose Compatible Transformer (CoFormer), a transformer-based encoder to achieve comprehensive temporal-interaction feature learning for each individual sample in irregular multivariate time series. In CoFormer, we view each sample as a unique variate-time point and leverage intra-variate/inter-variate attentions to learn sample-wise temporal/interaction features based on intra-variate/inter-variate neighbors. With CoFormer as the core, we can analyze irregularly sampled multivariate time series for many downstream tasks, including classification and prediction. We conduct extensive experiments on 3 real-world datasets and validate that the proposed CoFormer significantly and consistently outperforms existing methods.

摘要
多变量时间序列分析方法中，大多数先前方法假设时间序列的尺度保持不变，即间隔时间和样本数均不变。然而，实际上，数据收集系统可能会生成不规则的时间序列，这是因为仪器故障和干预等原因。现有的方法无法直接处理不规则的时间序列，这是因为它们在时间和变量维度上存在偏移。为填补这个空白，我们提出了兼容变换器（CoFormer），一种基于变换器的编码器，用于在不规则的多变量时间序列中实现每个样本独特的时间互动特征学习。在CoFormer中，我们视每个样本为独特的变量-时间点，并通过内变量/外变量注意力来学习每个样本的时间互动特征，基于内变量/外变量的邻居。通过CoFormer作为核心，我们可以分析不规则的多变量时间序列，并进行识别和预测等下游任务。我们在3个真实世界数据集上进行了广泛的实验，并证明了提出的CoFormer在 existed 方法的基础上显著并且一致性地提高了性能。

From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling

paper_url: http://arxiv.org/abs/2310.11011
repo_url: None
paper_authors: Aneesh Komanduri, Xintao Wu, Yongkai Wu, Feng Chen
for: 这篇论文的目的是探讨如何通过结构 causal modeling 改进深度生成模型，以提高其解释性、避免偶极相关性和外部数据描述稳定性。
methods: 论文使用了结构 causal modeling 方法，包括 causal representation learning 和可控Counterfactual生成方法，以帮助改进深度生成模型的性能。
results: 论文的结果表明，通过结构 causal modeling 可以提高深度生成模型的解释性、避免偶极相关性和外部数据描述稳定性，同时提高数据生成的准确性和多样性。

Abstract
Deep generative models have shown tremendous success in data density estimation and data generation from finite samples. While these models have shown impressive performance by learning correlations among features in the data, some fundamental shortcomings are their lack of explainability, the tendency to induce spurious correlations, and poor out-of-distribution extrapolation. In an effort to remedy such challenges, one can incorporate the theory of causality in deep generative modeling. Structural causal models (SCMs) describe data-generating processes and model complex causal relationships and mechanisms among variables in a system. Thus, SCMs can naturally be combined with deep generative models. Causal models offer several beneficial properties to deep generative models, such as distribution shift robustness, fairness, and interoperability. We provide a technical survey on causal generative modeling categorized into causal representation learning and controllable counterfactual generation methods. We focus on fundamental theory, formulations, drawbacks, datasets, metrics, and applications of causal generative models in fairness, privacy, out-of-distribution generalization, and precision medicine. We also discuss open problems and fruitful research directions for future work in the field.

摘要
深度生成模型在数据密度估计和数据生成从有限样本中表现出色，但它们存在一些基本缺陷，如无法解释、产生假 correlations 和外部扩展不稳定。为了解决这些挑战，可以在深度生成模型中涵盖 causality 理论。结构 causal model（SCM）描述了数据生成过程，模型了系统中变量之间复杂的 causal 关系和机制。因此，SCM 可以自然地与深度生成模型结合。 causal 模型具有许多有利的性能，如分布shift 稳定性、公平性和可操作性。我们提供了深入检查 causal 生成模型的技术survey，分为 causal representation learning 和可控 counterfactual generation 方法。我们关注基本理论、形式、缺陷、数据集、指标和应用于公平、隐私、外部扩展、精准医学等领域。我们还讨论了未解决的问题和未来研究的可能性。

Accelerating Scalable Graph Neural Network Inference with Node-Adaptive Propagation

paper_url: http://arxiv.org/abs/2310.10998
repo_url: None
paper_authors: Xinyi Gao, Wentao Zhang, Junliang Yu, Yingxia Shao, Quoc Viet Hung Nguyen, Bin Cui, Hongzhi Yin
for: 提高大规模图的图神经网络（GNNs）实时推理性能。
methods: 提出在线卷积框架和两种基于节点特征信息自适应卷积深度的节点适应卷积方法，以避免重复的特征传播。
results: 在公共数据集上实现了比例性和效率的较好的推理加速，特别是在大规模图上，实现了75倍的推理速度提升。

Abstract
Graph neural networks (GNNs) have exhibited exceptional efficacy in a diverse array of applications. However, the sheer size of large-scale graphs presents a significant challenge to real-time inference with GNNs. Although existing Scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure, these methods still suffer from scalability issues when making inferences on unseen nodes, as the feature preprocessing requires the graph to be known and fixed. To further accelerate Scalable GNNs inference in this inductive setting, we propose an online propagation framework and two novel node-adaptive propagation methods that can customize the optimal propagation depth for each node based on its topological information and thereby avoid redundant feature propagation. The trade-off between accuracy and latency can be flexibly managed through simple hyper-parameters to accommodate various latency constraints. Moreover, to compensate for the inference accuracy loss caused by the potential early termination of propagation, we further propose Inception Distillation to exploit the multi-scale receptive field information within graphs. The rigorous and comprehensive experimental study on public datasets with varying scales and characteristics demonstrates that the proposed inference acceleration framework outperforms existing state-of-the-art graph inference acceleration methods in terms of accuracy and efficiency. Particularly, the superiority of our approach is notable on datasets with larger scales, yielding a 75x inference speedup on the largest Ogbn-products dataset.

摘要
GRAPHNeuralNetworks (GNNs) 已经在多种应用中表现出色。然而，大规模图表示一个 significante challenge для实时推理GNNs。虽然现有的可扩展GNNs使用线性宣传来预处理特征和加速训练和推理过程，但这些方法仍然在对未看过节点的推理中遇到缺乏扩展性的问题，因为特征预处理需要知道和固定的图。为了进一步加速可扩展GNNs的推理在这种推理设定中，我们提议了在线宣传框架和两种新的节点适应性宣传方法。这些方法可以根据每个节点的 topological information 自适应地定制最佳宣传深度，并因此避免了 redundant feature propagation。通过简单的 гиперпараметр来管理准确率和延迟的负担，我们可以适应不同的延迟限制。此外，为了补做因推理早期终止而导致的准确性损失，我们进一步提议了 Inception Distillation，以利用图中的多尺度接收器场信息。我们对公共数据集进行了严格和全面的实验研究，结果显示，我们的推理加速框架在准确率和效率方面都超过了现有的状态码图推理加速方法。特别是在大规模的数据集上，我们的方法可以实现75倍的推理速度增加。

EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset

paper_url: http://arxiv.org/abs/2310.10967
repo_url: https://github.com/poplpr/exmodd
paper_authors: Hang Yin, Pinren Lu, Ziang Li, Bin Sun, Kan Li
for: 提高对对话任务的研究质量，减少数据收集成本
methods: 提出多Modal数据建构框架（MDCF），通过设计合适的提示语来让大规模预训练语言模型生成高质量的内容，同时提供图像和对话的自动解释，提高可读性和可监测性
results: 实验表明，使用MDCF生成的对话数据和图像解释具有正确性和高质量，可以帮助提高对对话任务的研究质量

Abstract
The need for high-quality data has been a key issue hindering the research of dialogue tasks. Recent studies try to build datasets through manual, web crawling, and large pre-trained models. However, man-made data is expensive and data collected from the internet often includes generic responses, meaningless statements, and toxic dialogues. Automatic data generation through large models is a cost-effective method, but for open-domain multimodal dialogue tasks, there are still three drawbacks: 1) There is currently no open-source large model that can accept multimodal input; 2) The content generated by the model lacks interpretability; 3) The generated data is usually difficult to quality control and require extensive resource to collect. To alleviate the significant human and resource expenditure in data collection, we propose a Multimodal Data Construction Framework (MDCF). MDCF designs proper prompts to spur the large-scale pre-trained language model to generate well-formed and satisfactory content. Additionally, MDCF also automatically provides explanation for a given image and its corresponding dialogue, which can provide a certain degree of interpretability and facilitate manual follow-up quality inspection. Based on this, we release an Explanatory Multimodal Open-Domain dialogue dataset (EXMODD). Experiments indicate a positive correlation between the model's ability to generate accurate understandings and high-quality responses. Our code and data can be found at https://github.com/poplpr/EXMODD.

摘要
需求高质量数据一直是对对话任务研究的关键障碍。近期研究通过手动、网络爬虫和大型预训练模型建立数据集。然而，人工生成数据昂贵，网络上收集的数据经常包含无关的回答、意义不明确的声明以及恶意对话。通过大型模型自动生成数据是一种经济的方法，但对开放频道多媒体对话任务还存在三个缺点：1）目前没有开源的大型模型可以接受多媒体输入；2）模型生成的内容缺乏可读性；3）生成的数据困难以质量控制，需要广泛的资源来收集。为了减少数据收集的人工和资源投入，我们提出了多媒体数据建构框架（MDCF）。MDCF设计合适的提示，使大规模预训练语言模型生成高质量和满意的内容。此外，MDCF还自动提供图像和对应对话的解释，可以提供一定的可读性，便于手动跟踪质量检查。根据这，我们发布了解释多媒体开放频道对话数据集（EXMODD）。实验表明，模型能够生成准确理解和高质量回答之间存在正相关关系。我们的代码和数据可以在 GitHub 上找到。

A State-Vector Framework for Dataset Effects

paper_url: http://arxiv.org/abs/2310.10955
repo_url: https://github.com/esmatsahak/emnlp-2023_a-state-vector-framework-for-dataset-effects_repository
paper_authors: Esmat Sahak, Zining Zhu, Frank Rudzicz
for: 这个论文旨在研究深度神经网络（DNN）系统的高质量数据集如何影响其性能，以及这些数据集之间的交互作用如何影响模型的性能。
methods: 该论文提出了一种状态向量框架，用于系统地研究数据集之间的交互作用。该框架使用理想化的探测试结果为基准，将数据集转化为一个维度空间中的状态向量。这种方法可以评估单独和相互作用的数据集效果，并且可以探索模型在不同维度上的表现。
results: 研究发现，一些常用的自然语言理解数据集具有特点性的效果，这些效果集中在一些语言维度上。此外，研究还发现了一些“泄漏”效果：数据集可以影响模型在不同维度上的表现，这些维度可能与计划中的任务无关。该研究为负责任和可靠的模型开发提供了一个系统的理解。

Abstract
The impressive success of recent deep neural network (DNN)-based systems is significantly influenced by the high-quality datasets used in training. However, the effects of the datasets, especially how they interact with each other, remain underexplored. We propose a state-vector framework to enable rigorous studies in this direction. This framework uses idealized probing test results as the bases of a vector space. This framework allows us to quantify the effects of both standalone and interacting datasets. We show that the significant effects of some commonly-used language understanding datasets are characteristic and are concentrated on a few linguistic dimensions. Additionally, we observe some ``spill-over'' effects: the datasets could impact the models along dimensions that may seem unrelated to the intended tasks. Our state-vector framework paves the way for a systematic understanding of the dataset effects, a crucial component in responsible and robust model development.

摘要
“深度神经网络（DNN）系统的卓越成功受到训练 datasets 的高质量影响，但 datasets 之间的交互效果还未得到足够探究。我们提出了一个状态向量框架，以便系统地研究这一方面。这个框架使用理想化的 probing 测试结果作为基准，并允许我们量化单独和交互 datasets 的效果。我们发现一些通用语言理解 datasets 的效果是特征性的，集中在一些语言维度上。此外，我们还发现了一些“倒流”效果： datasets 可以影响模型在不直接相关的任务上的表现。我们的状态向量框架为负责任和可靠模型开发提供了一个系统的理解。”

Enhanced Transformer Architecture for Natural Language Processing

paper_url: http://arxiv.org/abs/2310.10930
repo_url: None
paper_authors: Woohyeon Moon, Taeyoung Kim, Bumgeun Park, Dongsoo Har
for: 提高自然语言处理（NLP）领域的模型性能
methods: 提出一种新的 transformer 结构，包括全层正常化、权重连接、位置编码和零干扰自注意力
results: 使用 Multi30k 翻译数据集进行评估，提高了对 original transformer 的202.96% 的 BLEU 分数

Abstract
Transformer is a state-of-the-art model in the field of natural language processing (NLP). Current NLP models primarily increase the number of transformers to improve processing performance. However, this technique requires a lot of training resources such as computing capacity. In this paper, a novel structure of Transformer is proposed. It is featured by full layer normalization, weighted residual connection, positional encoding exploiting reinforcement learning, and zero masked self-attention. The proposed Transformer model, which is called Enhanced Transformer, is validated by the bilingual evaluation understudy (BLEU) score obtained with the Multi30k translation dataset. As a result, the Enhanced Transformer achieves 202.96% higher BLEU score as compared to the original transformer with the translation dataset.

摘要
transformer 是当前自然语言处理（NLP）领域的先进模型。现有的 NLP 模型主要通过增加 transformer 的数量来提高处理性能。然而，这种技术需要大量的训练资源，如计算能力。在这篇论文中，一种新的 transformer 结构被提出，具有全层正常化、权重征值连接、位置编码利用强化学习和零层隐藏自注意力。这种提出的 transformer 模型被称为增强 transformer，在使用 Multi30k 翻译集合时通过对比 Bleu 分数来验证其性能。结果显示，增强 transformer 与原始 transformer 相比，在 Multi30k 翻译集合中的 Bleu 分数提高了202.96%。

Using Audio Data to Facilitate Depression Risk Assessment in Primary Health Care

paper_url: http://arxiv.org/abs/2310.10928
repo_url: None
paper_authors: Adam Valen Levinson, Abhay Goyal, Roger Ho Chun Man, Roy Ka-Wei Lee, Koustuv Saha, Nimay Parekh, Frederick L. Altice, Lam Yin Cheung, Munmun De Choudhury, Navin Kumar
for: 该研究旨在使用声音数据预测抑郁风险。
methods: 该研究使用了TPOT自动Machine学习工具选择最佳机器学习算法，并使用K-最近邻准确分类器进行预测。
results: 选择的模型在预测抑郁风险中表现出色（准确率0.98，报告率0.93，F1评分0.96）。这些发现可能导致开发识别抑郁风险的工具，以便通过AI驱动的聊天机器人进行初步检测。

Abstract
Telehealth is a valuable tool for primary health care (PHC), where depression is a common condition. PHC is the first point of contact for most people with depression, but about 25% of diagnoses made by PHC physicians are inaccurate. Many other barriers also hinder depression detection and treatment in PHC. Artificial intelligence (AI) may help reduce depression misdiagnosis in PHC and improve overall diagnosis and treatment outcomes. Telehealth consultations often have video issues, such as poor connectivity or dropped calls. Audio-only telehealth is often more practical for lower-income patients who may lack stable internet connections. Thus, our study focused on using audio data to predict depression risk. The objectives were to: 1) Collect audio data from 24 people (12 with depression and 12 without mental health or major health condition diagnoses); 2) Build a machine learning model to predict depression risk. TPOT, an autoML tool, was used to select the best machine learning algorithm, which was the K-nearest neighbors classifier. The selected model had high performance in classifying depression risk (Precision: 0.98, Recall: 0.93, F1-Score: 0.96). These findings may lead to a range of tools to help screen for and treat depression. By developing tools to detect depression risk, patients can be routed to AI-driven chatbots for initial screenings. Partnerships with a range of stakeholders are crucial to implementing these solutions. Moreover, ethical considerations, especially around data privacy and potential biases in AI models, need to be at the forefront of any AI-driven intervention in mental health care.

摘要
电健康是一种有价值的工具 для基础健康护理（PHC）， где抑郁是一种常见的疾病。PHC是大多数抑郁患者的第一个接触点，但约25%的诊断由PHC医生进行的是不正确的。许多其他的障碍也妨碍了抑郁检测和治疗在PHC中。人工智能（AI）可能可以减少在PHC中的抑郁误诊和提高总诊断和治疗结果。电健康咨询经常会出现视频问题，如互联网络不稳定或掉线问题。对于低收入患者来说，音频电健康更加实用，因为他们可能缺乏稳定的互联网连接。因此，我们的研究专注于使用音频数据预测抑郁风险。研究的目标是：1. 收集24名参与者的音频数据（12名抑郁患者和12名没有心理或重要健康状况诊断的人）。2. 使用自动Machine Learning（autoML）工具TPOT选择最佳Machine Learning算法，选择的是K-最近邻准确分类算法。选择的模型在识别抑郁风险方面表现出色（准确率0.98，感知率0.93，F1分数0.96）。这些发现可能导致一系列用于检测和治疗抑郁的工具的开发。通过开发检测抑郁风险的工具，患者可以被导向由AI驱动的聊天机器人进行初步检查。与多个各类投资者合作是实现这些解决方案的关键。此外，在AI驱动的医疗干预中，优先考虑数据隐私和可能存在的AI模型偏见等伦理考虑。

Intelligent Software Tooling for Improving Software Development

paper_url: http://arxiv.org/abs/2310.10921
repo_url: https://github.com/jettbrains/-L-
paper_authors: Nathan Cooper
for: 本研究旨在利用深度学习技术来提高软件开发过程。
methods: 本研究使用深度学习技术来处理大量的不结构化软件工程文档。
results: 研究发现，通过使用深度学习技术可以提高软件开发过程的效率和质量。

Abstract
Software has eaten the world with many of the necessities and quality of life services people use requiring software. Therefore, tools that improve the software development experience can have a significant impact on the world such as generating code and test cases, detecting bugs, question and answering, etc., The success of Deep Learning (DL) over the past decade has shown huge advancements in automation across many domains, including Software Development processes. One of the main reasons behind this success is the availability of large datasets such as open-source code available through GitHub or image datasets of mobile Graphical User Interfaces (GUIs) with RICO and ReDRAW to be trained on. Therefore, the central research question my dissertation explores is: In what ways can the software development process be improved through leveraging DL techniques on the vast amounts of unstructured software engineering artifacts?

摘要

NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain

paper_url: http://arxiv.org/abs/2310.10920
repo_url: https://github.com/pnnl/expert2
paper_authors: Anurag Acharya, Sai Munikoti, Aaron Hellinger, Sara Smith, Sridevi Wagle, Sameera Horawalavithana
for: 本研究的目的是为了评估语言模型在核领域的表现，并提供一个专门为核领域设计的语言模型评价 benchmark。
methods: 本研究使用了一个人工制定的核领域语言模型评价 benchmark，包含100道由专家设计的问题，以测试语言模型的能力。研究还提出了一种新的评价指标，以取代现有的评价指标，以便更好地评估语言模型的表现。
results: 实验表明，even the best LLMs 在本研究中表现不佳，这表明现有的 LLMs 在核领域具有科学知识的 gap。

Abstract
As LLMs have become increasingly popular, they have been used in almost every field. But as the application for LLMs expands from generic fields to narrow, focused science domains, there exists an ever-increasing gap in ways to evaluate their efficacy in those fields. For the benchmarks that do exist, a lot of them focus on questions that don't require proper understanding of the subject in question. In this paper, we present NuclearQA, a human-made benchmark of 100 questions to evaluate language models in the nuclear domain, consisting of a varying collection of questions that have been specifically designed by experts to test the abilities of language models. We detail our approach and show how the mix of several types of questions makes our benchmark uniquely capable of evaluating models in the nuclear domain. We also present our own evaluation metric for assessing LLM's performances due to the limitations of existing ones. Our experiments on state-of-the-art models suggest that even the best LLMs perform less than satisfactorily on our benchmark, demonstrating the scientific knowledge gap of existing LLMs.

摘要

Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit from Emergent Modular Structures?

paper_url: http://arxiv.org/abs/2310.10908
repo_url: https://github.com/qiuzh20/emoe
paper_authors: Zihan Qiu, Zeyu Huang, Jie Fu
for: 这篇论文主要研究如何使用隐式模块化结构来提高神经网络的泛化能力和学习效率。
methods: 作者使用了潜在的模块化结构，即emergent modularity，来改进标准预训练的变换器模型。他们构建了一种名为Emergent Mixture-of-Experts（EMoE）的模块化对手，无需增加参数量可以轻松地在下游调整中进行替换。
results: 广泛的实验（对1785个模型进行了调整）表明，EMoE可以有效地提高预训练模型的在域和离域泛化能力。此外，分析和缺失研究表明，EMoE可以减少负知识传递和对不同配置的稳定性。

Abstract
Incorporating modular designs into neural networks demonstrates superior out-of-generalization, learning efficiency, etc. Existing modular neural networks are generally $\textit{explicit}$ because their modular architectures are pre-defined, and individual modules are expected to implement distinct functions. Conversely, recent works reveal that there exist $\textit{implicit}$ modular structures in standard pre-trained transformers, namely $\textit{Emergent Modularity}$. They indicate that such modular structures exhibit during the early pre-training phase and are totally spontaneous. However, most transformers are still treated as monolithic models with their modular natures underutilized. Therefore, given the excellent properties of explicit modular architecture, we explore $\textit{whether and how dense pre-trained transformers can benefit from emergent modular structures.}$ To study this question, we construct \textbf{E}mergent $\textbf{M}$ixture-$\textbf{o}$f-$\textbf{E}$xperts (EMoE). Without introducing additional parameters, EMoE can be seen as the modular counterpart of the original model and can be effortlessly incorporated into downstream tuning. Extensive experiments (we tune 1785 models) on various downstream tasks (vision and language) and models (22M to1.5B) demonstrate that EMoE effectively boosts in-domain and out-of-domain generalization abilities. Further analysis and ablation study suggest that EMoE mitigates negative knowledge transfer and is robust to various configurations. Code is available at \url{https://github.com/qiuzh20/EMoE}

摘要
使用模块化设计在神经网络中表现出色的特点，包括更好的泛化性、学习效率等。现有的模块化神经网络通常是显式的，即其模块化结构是预先定义的，每个模块需要实现特定的功能。然而，latest works表明，标准预训练变换器中存在隐式的模块结构，称为“ Emergent Modularity”。这些模块结构在初始预训练阶段自然地出现，并且是 Totally Spontaneous。然而，大多数变换器仍然被视为坚实的模块化模型，其中模块性未得到利用。因此，我们提出了以下问题：whether and how dense pre-trained transformers can benefit from emergent modular structures。为了研究这个问题，我们构建了Emergent Mixture-of-Experts (EMoE)。EMoE可以看作原始模型的模块化对手，而且可以无需添加参数进行下游调整。我们对多种下游任务（视觉和语言）和模型（22M到1.5B）进行了广泛的实验，结果表明，EMoE可以有效地提高域内和域外泛化能力。进一步的分析和减少研究表明，EMoE可以 mitigate negative knowledge transfer和是对多种配置的稳定。代码可以在 \url{https://github.com/qiuzh20/EMoE} 上下载。

Instilling Inductive Biases with Subnetworks

paper_url: http://arxiv.org/abs/2310.10899
repo_url: https://github.com/rock-z/instilling-inductiva-bias
paper_authors: Enyan Zhang, Michael A. Lepori, Ellie Pavlick
for: 这篇论文的目的是探讨一种新的机器学习方法，即子任务抽象（Subtask Induction），可以让模型更好地控制其行为。
methods: 这篇论文使用的方法是基于已经训练过的模型中找出一个功能子网络，并使用这个子网络来塑造模型的启发性。
results: 论文的两个实验表明，使用子任务抽象可以减少模型需要的训练数据量，并且可以成功地塑造模型采用人类化的形态偏好。

Abstract
Despite the recent success of artificial neural networks on a variety of tasks, we have little knowledge or control over the exact solutions these models implement. Instilling inductive biases -- preferences for some solutions over others -- into these models is one promising path toward understanding and controlling their behavior. Much work has been done to study the inherent inductive biases of models and instill different inductive biases through hand-designed architectures or carefully curated training regimens. In this work, we explore a more mechanistic approach: Subtask Induction. Our method discovers a functional subnetwork that implements a particular subtask within a trained model and uses it to instill inductive biases towards solutions utilizing that subtask. Subtask Induction is flexible and efficient, and we demonstrate its effectiveness with two experiments. First, we show that Subtask Induction significantly reduces the amount of training data required for a model to adopt a specific, generalizable solution to a modular arithmetic task. Second, we demonstrate that Subtask Induction successfully induces a human-like shape bias while increasing data efficiency for convolutional and transformer-based image classification models.

摘要
尽管人工神经网络在各种任务上表现出色，但我们对这些模型实际解决方案的具体知识和控制仍然很少。尝试通过填充模型中的预设偏见（preferences for certain solutions）来理解和控制它们的行为是一个有前途的路径。许多研究已经专注于研究模型的内生预设偏见和通过手动设计architecture或特制的训练程序来实现不同的预设偏见。在这个工作中，我们探索了一种更机制化的方法：子任务抽象。我们的方法可以在已经训练过的模型中找到一个功能子网络，该子网络实现了特定的子任务，然后使用这个子任务来填充模型中的预设偏见。子任务抽象是高效和灵活的，我们通过两个实验证明其效果。首先，我们表明了子任务抽象可以减少模型学习数据量，使模型采用特定、普适的解决方案。其次，我们成功地在基于卷积和变换器的图像分类模型中实现了人类化形态偏见，同时提高了数据效率。

2023-10-17

cs.CL

cs.CL - 2023-10-17

BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment in Central Philippine Languages

paper_url: http://arxiv.org/abs/2310.11584
repo_url: https://github.com/imperialite/basahacorpus-hierarchicalcrosslingualara
paper_authors: Joseph Marvin Imperial, Ekaterina Kochmar
for: This paper is written for the purpose of improving the performance of automatic readability assessment (ARA) models in lower resource languages in the Philippines.
methods: The paper uses a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada to train ARA models using surface-level, syllable-pattern, and n-gram overlap features. The paper also proposes a new hierarchical cross-lingual modeling approach that takes advantage of a language’s placement in the family tree to increase the amount of available training data.
results: The study yields encouraging results that support previous work showcasing the efficacy of cross-lingual models in low-resource settings, as well as similarities in highly informative linguistic features for mutually intelligible languages.

Abstract
Current research on automatic readability assessment (ARA) has focused on improving the performance of models in high-resource languages such as English. In this work, we introduce and release BasahaCorpus as part of an initiative aimed at expanding available corpora and baseline models for readability assessment in lower resource languages in the Philippines. We compiled a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada -- languages belonging to the Central Philippine family tree subgroup -- to train ARA models using surface-level, syllable-pattern, and n-gram overlap features. We also propose a new hierarchical cross-lingual modeling approach that takes advantage of a language's placement in the family tree to increase the amount of available training data. Our study yields encouraging results that support previous work showcasing the efficacy of cross-lingual models in low-resource settings, as well as similarities in highly informative linguistic features for mutually intelligible languages.

摘要
当前研究自动可读性评估（ARA）主要集中在高资源语言 such as English 中进行改进模型性能。在这项工作中，我们介绍并发布 BasahaCorpus，是一项旨在扩大可用 corpora 和基线模型 для可读性评估的低资源语言在菲律宾的 iniciativa。我们编译了中央菲律宾语族 subgroup 中的希利加纳、民达、卡拉雅和林康达语言的短篇小说，以用于训练 ARA 模型 surface-level、 syllable-pattern 和 n-gram 重叠特征。我们还提出了一种新的 hierarchical cross-lingual 模型方法，利用语言在语族树中的位置，以增加可用训练数据。我们的研究得到了鼓舞人心的结果，支持先前的研究表明在低资源设置中， crossed-lingual 模型具有可读性评估的效果，以及同属语言之间的高度相似性特征。

What is a good question? Task-oriented asking with fact-level masking

paper_url: http://arxiv.org/abs/2310.11571
repo_url: None
paper_authors: Matthew Toles, Yukun Huang, Zhou Yu, Luis Gravano
for: 这篇论文主要是关于问答推理任务中的协作问题，即如何让机器人能够咨询用户并采集有用的信息。
methods: 该论文提出了一种自然语言任务协作问答（TOA）定义和框架，以及一种基于实际事实遮盖（FLM）的自动生成问答数据集的方法。
results: 实际试验表明，当前的零shot语言模型在完成TOA任务时表现不佳，与人工标注者相比。这些结果表明可以使用FLM数据集和TOA框架来训练和评估更好的TOA模型。

Abstract
Asking questions is an important element of real-life collaboration on reasoning tasks like question answering. For example, a legal assistant chatbot may be unable to make accurate recommendations without specific information on the user's circumstances. However, large language models are usually deployed to solve reasoning tasks directly without asking follow-up questions to the user or third parties. We term this problem task-oriented asking (TOA). Zero-shot chat models can perform TOA, but their training is primarily based on next-token prediction rather than whether questions contribute to successful collaboration. To enable the training and evaluation of TOA models, we present a definition and framework for natural language task-oriented asking, the problem of generating questions that result in answers useful for a reasoning task. We also present fact-level masking (FLM), a procedure for converting natural language datasets into self-supervised TOA datasets by omitting particular critical facts. Finally, we generate a TOA dataset from the HotpotQA dataset using FLM and evaluate several zero-shot language models on it. Our experiments show that current zero-shot models struggle to ask questions that retrieve useful information, as compared to human annotators. These results demonstrate an opportunity to use FLM datasets and the TOA framework to train and evaluate better TOA models.

摘要
实际协作中的问题询问是解决理性任务的重要元素，例如法律助手聊天机器人可能无法提供精确的建议 без specific 用户情况信息。然而，大型语言模型通常会直接解决理性任务，而不是询问使用者或第三方的询问。我们称这问题为任务 Orientated Asking（TOA）。零开始聊天模型可以进行 TOA，但它们的训练主要基于下一个字符预测，而不是 Whether 询问对于成功协作有用。为了实现和评估 TOA 模型的训练，我们提出了自然语言任务 Orientated Asking 的定义和框架，以及 факт阶掩蔽（FLM），将自然语言数据集转换为自主监督的 TOA 数据集，并在这些数据集上评估了多个零开始语言模型。我们的实验结果显示，现有的零开始模型在对于有用信息的询问方面表现不佳，与人工标注师相比。这些结果显示了使用 FLM 数据集和 TOA 框架可以训练和评估更好的 TOA 模型。

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

paper_url: http://arxiv.org/abs/2310.11564
repo_url: https://github.com/joeljang/rlphf
paper_authors: Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu
for: 本研究旨在使用人工反馈来适应大语言模型（LLM）与多个个人偏好的多目标强化学习（MORL）问题。
methods: 本研究使用了分解个人偏好为多个维度的方法，并在分布式环境中独立进行了这些维度的快速训练。在训练后，parameters可以通过合并 Parameter Merging 技术来有效地组合。
results: 相比强大的单个目标基eline，我们的方法可以实现个性化的偏好对齐。我们的实验结果表明，RLPHF可以有效地适应多个个人偏好，并且可以在不同的应用场景中提供个性化的结果。

Abstract
While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. These dimensions are defined based on personalizations that are declared as desirable by the user. In this work, we show that they can be efficiently trained independently in a distributed manner and combined effectively post-hoc through parameter merging. The code is available at https://github.com/joeljang/RLPHF.

摘要
While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. These dimensions are defined based on personalizations that are declared as desirable by the user. In this work, we show that they can be efficiently trained independently in a distributed manner and combined effectively post-hoc through parameter merging. The code is available at https://github.com/joeljang/RLPHF.Here's the translation in Simplified Chinese:While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. These dimensions are defined based on personalizations that are declared as desirable by the user. In this work, we show that they can be efficiently trained independently in a distributed manner and combined effectively post-hoc through parameter merging. The code is available at https://github.com/joeljang/RLPHF.

Multi-stage Large Language Model Correction for Speech Recognition

paper_url: http://arxiv.org/abs/2310.11532
repo_url: None
paper_authors: Jie Pu, Thai-Son Nguyen, Sebastian Stüker
for: 提高竞争性语音识别系统的性能
methods: 使用大语言模型（LLM）进行语音识别系统的改进
results: 实验结果显示，提议的方法可以在多个测试频谱上实现10%~20%的相对改进，与一个竞争性ASR系统相比

Abstract
In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. Different from traditional language models that focus on one single data domain, the rise of LLMs brings us the opportunity to push the limit of state-of-the-art ASR performance, and at the same time to achieve higher robustness and generalize effectively across multiple domains. Motivated by this, we propose a novel multi-stage approach to combine traditional language model re-scoring and LLM prompting. Specifically, the proposed method has two stages: the first stage uses a language model to re-score an N-best list of ASR hypotheses and run a confidence check; The second stage uses prompts to a LLM to perform ASR error correction on less confident results from the first stage. Our experimental results demonstrate the effectiveness of the proposed method by showing a 10% ~ 20% relative improvement in WER over a competitive ASR system -- across multiple test domains.

摘要
在这篇论文中，我们研究了使用大语言模型（LLM）提高竞争性语音识别系统的性能。与传统语言模型不同，LLMs允许我们在多个数据域之间进行跨领域的学习和泛化，从而提高ASR性能和Robustness。我们提出了一种新的多阶段方法， combining traditional language model re-scoring和LLM prompting。这种方法包括两个阶段：第一阶段使用语言模型对N-best列表的ASR假设进行重新分数和信任检查；第二阶段使用提示来让LLM进行错误纠正。我们的实验结果表明，提案的方法可以提高竞争性ASR系统的WER表现，在多个测试领域中显示10%~20%的相对改善。

Automatic News Summerization

paper_url: http://arxiv.org/abs/2310.11520
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Kavach Dheer, Arpit Dhankhar
for: 这个研究论文主要是为了比较EXTRACTIVE和ABSTRACTIVE方法在新闻文本摘要方面的表现。
methods: 这个研究使用了CNN-Daily Mail dataset，这个 dataset包含了新闻文章和人工生成的参考摘要。研究使用了ROUGE分数来评估生成的摘要质量。
results: 经过评估后，研究人员选择了最佳性能的模型，并将其集成到了一个web应用程序中，以评估它们在实际应用中的表现和用户体验。

Abstract
Natural Language Processing is booming with its applications in the real world, one of which is Text Summarization for large texts including news articles. This research paper provides an extensive comparative evaluation of extractive and abstractive approaches for news text summarization, with an emphasis on the ROUGE score analysis. The study employs the CNN-Daily Mail dataset, which consists of news articles and human-generated reference summaries. The evaluation employs ROUGE scores to assess the efficacy and quality of generated summaries. After Evaluation, we integrate the best-performing models on a web application to assess their real-world capabilities and user experience.

摘要
自然语言处理技术在现实世界中得到了广泛应用，其中之一是文本概要化，特别是对新闻文章进行概要。本研究论文进行了对抽取和抽象方法的比较评估，强调ROUGE分数分析。研究使用了CNN-Daily Mail dataset，该 dataset包括新闻文章和人工生成的参考概要。评估使用ROUGE分数评估生成的概要质量。经评估后，我们将最佳表现的模型集成到了网站应用程序中，以评估它们在实际应用中的能力和用户体验。

VeRA: Vector-based Random Matrix Adaptation

paper_url: http://arxiv.org/abs/2310.11454
repo_url: None
paper_authors: Dawid Jan Kopiczko, Tijmen Blankevoort, Yuki Markus Asano
For: 降低大型语言模型训练参数数量，并在多个用户或任务适配模型中进行多个适配。* Methods: 使用单个低级别矩阵和学习小扩展矩阵来减少训练参数数量，同时保持与LoRA相同的性能。* Results: 在GLUE和E2E测试集上实现同LoRA相同的性能，并在 instruciton-following 任务中使用Llama2 7B模型，只需1.4M参数。

Abstract
Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which reduces the number of trainable parameters by 10x compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, and show its application in instruction-following with just 1.4M parameters using the Llama2 7B model.

摘要
低阶 adaptive（LoRA）是一种受欢迎的方法，可以降低训练可变参数数量，但仍然面临着扩大模型或部署多个用户或任务特定 adapted 模型时的严重存储挑战。在这种工作中，我们提出了 вектор基的随机矩阵适应（VeRA），它可以在与 LoRA 相比下减少训练可变参数数量达到 10 倍，同时保持性能不变。它实现这一点通过使用所有层共享的低阶矩阵对和学习小扩张 вектор而做。我们在 GLUE 和 E2E 测试上证明了它的有效性，并在使用 Llama2 7B 模型进行 instrucion-following tasks 中只需要 1.4M 参数。

BitNet: Scaling 1-bit Transformers for Large Language Models

paper_url: http://arxiv.org/abs/2310.11453
repo_url: https://github.com/kyegomez/BitNet
paper_authors: Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
for: 这个研究是为了开发一个可扩展且稳定的1比特transformer架构，以便在大型语言模型中实现高效性和可持续性。
methods: 这个研究使用了BitLinear层来将1比特量化的 weights 训练出来，并且提出了一个可替换的drop-in替代方案。
results: 实验结果显示，BitNet可以与现有的8比特量化方法和FP16 transformer基准相比，在语言模型化上达到竞争性的表现，同时具有较小的内存库存和能源消耗。此外，BitNet显示了与全精度transformer相似的扩展法则， suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

Abstract
The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

摘要
大型语言模型的增加会带来部署的挑战和环境影响的关注，由于高能consumption。在这个工作中，我们介绍BitNet，一个可扩展和稳定的1比特Transformer架构，设计用于大型语言模型。具体来说，我们介绍BitLinear，用于从零开始训练1比特的 weights的替换层。实验结果显示，BitNet在语言模型化方面实现了竞争性的性能，并substantially reducingmemory尺度和能源consumption，相比于现有的8比特量化方法和FP16 Transformer基准。此外，BitNet展示了与全精度Transformer一样的扩展律，表明它的潜在可以实现有效的扩展到更大的语言模型，保持效率和性能优势。

An Empirical Study of Translation Hypothesis Ensembling with Large Language Models

paper_url: http://arxiv.org/abs/2310.11430
repo_url: https://github.com/deep-spin/translation-hypothesis-ensembling
paper_authors: António Farinhas, José G. C. de Souza, André F. T. Martins
for: 这 paper investigate LLM-based machine translation 的质量可以通过集成假设来提高。
methods: 这 paper 使用多种ensemble技术，包括多个提示、温度-based sampling 和 beam search，来生成假设。
results: 这 paper 的结果表明，MBR decoding 是一个非常有效的方法，可以使用少量的样本提高翻译质量，而且制定提示的调整对假设的多样性和样本温度具有强烈的影响。

Abstract
Large language models (LLMs) are becoming a one-fits-many solution, but they sometimes hallucinate or produce unreliable output. In this paper, we investigate how hypothesis ensembling can improve the quality of the generated text for the specific problem of LLM-based machine translation. We experiment with several techniques for ensembling hypotheses produced by LLMs such as ChatGPT, LLaMA, and Alpaca. We provide a comprehensive study along multiple dimensions, including the method to generate hypotheses (multiple prompts, temperature-based sampling, and beam search) and the strategy to produce the final translation (instruction-based, quality-based reranking, and minimum Bayes risk (MBR) decoding). Our results show that MBR decoding is a very effective method, that translation quality can be improved using a small number of samples, and that instruction tuning has a strong impact on the relation between the diversity of the hypotheses and the sampling temperature.

摘要
Translation into Simplified Chinese:大型语言模型（LLM）正在成为一个一size-fits-all解决方案，但它们有时会幻想或生成不可靠的输出。在这篇论文中，我们调查了如何使用假设集成以提高由 LLM 生成的文本质量，特别是在机器翻译问题上。我们对各种生成假设的技术进行了实验，包括 ChatGPT、LLaMA 和 Alpaca。我们提供了多维度的研究，包括生成假设的方法（多个提示、温度基本抽样和搜索杆）以及生成翻译的策略（指令基本、质量基本重新排序和最小极值风险（MBR）解oding）。我们的结果显示了 MBR 解oding 是一个非常有效的方法，可以通过一小数量的样本提高翻译质量，并且指令调整对假设多样性和抽样温度之间的关系具有很强的影响。

Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles

paper_url: http://arxiv.org/abs/2310.11379
repo_url: https://github.com/ferugit/iterative-pseudo-forced-alignment-ctc
paper_authors: Fernando López, Jordi Luque, Carlos Segura, Pablo Gómez
for: 本研究旨在提高语音界面上的唤醒词检测精度、能效性和速度。
methods: 该研究使用了两个阶段的检测方法，包括多分辨率的数据增强和服务器端的模型集成。它还使用了一个轻量级的设备上模型和一个云端的验证模型，以优化两个运行点。
results: 研究发现，使用不同的参数配置和多种语音分类器可以提高检测精度和减少干扰。特别是，提出的集成模型在所有噪声条件下都表现出优于 stronger 分类器。

Abstract
Voice-based interfaces rely on a wake-up word mechanism to initiate communication with devices. However, achieving a robust, energy-efficient, and fast detection remains a challenge. This paper addresses these real production needs by enhancing data with temporal alignments and using detection based on two phases with multi-resolution. It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side, which is an ensemble of heterogeneous architectures that refine detection. This scheme allows the optimization of two operating points. To protect privacy, audio features are sent to the cloud instead of raw audio. The study investigated different parametric configurations for feature extraction to select one for on-device detection and another for the verification model. Furthermore, thirteen different audio classifiers were compared in terms of performance and inference time. The proposed ensemble outperforms our stronger classifier in every noise condition.

摘要
声音基于界面依赖于唤醒词机制来与设备进行通信。然而，实现高效、能效、快速的检测仍然是一大挑战。本文通过增强数据的时间对齐和使用两个阶段多分辨率检测来解决这些生产环境需求。它使用了两个模型：一个轻量级在设备上进行实时处理的音频流模型，以及服务器端的验证模型，这是一个多种不同架构的 ensemble 模型，用于精度检测。这种方案允许优化两个运行点。为了保护隐私，音频特征被发送到云端而不是原始音频。研究中试用了不同的参数配置来进行特征提取，以便在设备上进行检测和在服务器端进行验证。此外，本文比较了13种不同的声音分类器，并评估了它们的性能和推理时间。提议的ensemble在每种噪音条件下都超过了我们更强的分类器。

DialogueLLM: Context and Emotion Knowledge-Tuned LLaMA Models for Emotion Recognition in Conversations

paper_url: http://arxiv.org/abs/2310.11374
repo_url: None
paper_authors: Yazhou Zhang, Mengyao Wang, Prayag Tiwari, Qiuchi Li, Benyou Wang, Jing Qin
for: 提高对话情感认知的模型性能，特别是在自然语言生成中。
methods: fine-tuning LLaMA模型，使用多modal信息作为补充知识。
results: 在三个对话情感认知 benchmark 数据集上提供了比基线和其他 SOTA LLM 更高的性能。

Abstract
Large language models (LLMs) and their variants have shown extraordinary efficacy across numerous downstream natural language processing (NLP) tasks, which has presented a new vision for the development of NLP. Despite their remarkable performance in natural language generating (NLG), LLMs lack a distinct focus on the emotion understanding domain. As a result, using LLMs for emotion recognition may lead to suboptimal and inadequate precision. Another limitation of LLMs is that they are typical trained without leveraging multi-modal information. To overcome these limitations, we propose DialogueLLM, a context and emotion knowledge tuned LLM that is obtained by fine-tuning LLaMA models with 13,638 multi-modal (i.e., texts and videos) emotional dialogues. The visual information is considered as the supplementary knowledge to construct high-quality instructions. We offer a comprehensive evaluation of our proposed model on three benchmarking emotion recognition in conversations (ERC) datasets and compare the results against the SOTA baselines and other SOTA LLMs. Additionally, DialogueLLM-7B can be easily trained using LoRA on a 40GB A100 GPU in 5 hours, facilitating reproducibility for other researchers.

摘要
大型语言模型（LLM）和其变体在多种自然语言处理（NLP）任务中表现出色，带来了一个新的视野 для NLP的发展。尽管它们在自然语言生成（NLG）方面表现出色，但LLM对感情理解领域没有明确的注意力，使用LLM进行感情识别可能会导致不足和不充分的精度。另外，LLM通常不会利用多modal信息进行训练。为了解决这些限制，我们提出了对话LLM，一个基于LLaMA模型的内容和感情知识调整的LLM，通过调整13,638种多modal（文本和影片）情感对话。影像信息被视为补充知识，用于建立高品质的指令。我们提供了三个benchmarking感情识别在对话（ERC）数据集的完整评估，并与基于SOTA和其他SOTA LLM的结果进行比较。此外，DialogueLLM-7B可以在5小时内使用LoRA在40GB A100 GPU上进行训练，便于其他研究人员的重现。

VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights

paper_url: http://arxiv.org/abs/2310.11368
repo_url: None
paper_authors: Shanshan Xu, Leon Staufer, T. Y. S. S Santosh, Oana Ichim, Corina Heri, Matthias Grabmair
For: This paper is written to address the elusive concept of vulnerability at the European Court of Human Rights (ECtHR) and to provide a novel dataset (VECHR) for future research in this area.* Methods: The paper uses expert-annotated multi-label data to benchmark the performance of state-of-the-art models for vulnerability type classification and explanation rationale.* Results: The results show that the task of vulnerability classification is challenging, with lower prediction performance and limited agreement between models and experts. Additionally, the models have limited performance when dealing with out-of-domain (OOD) data.Here’s the information in Simplified Chinese text:* For: 这篇论文是为了解决欧洲人权法庭（ECtHR）中的抑降性概念，并提供一个新的数据集（VECHR）以便未来在这个领域进行研究。* Methods: 这篇论文使用专家标注的多标签数据来评估现有模型的表现，以便对抑降性类型分类和解释理由进行 benchlearning。* Results: 结果显示，抑降性分类任务具有较低的预测性能和专家和模型之间的有限的一致性。此外，模型对于非预期数据（OOD）的性能也有限。

Abstract
Recognizing vulnerability is crucial for understanding and implementing targeted support to empower individuals in need. This is especially important at the European Court of Human Rights (ECtHR), where the court adapts Convention standards to meet actual individual needs and thus ensures effective human rights protection. However, the concept of vulnerability remains elusive at the ECtHR and no prior NLP research has dealt with it. To enable future research in this area, we present VECHR, a novel expert-annotated multi-label dataset comprising of vulnerability type classification and explanation rationale. We benchmark the performance of state-of-the-art models on VECHR from both prediction and explainability perspectives. Our results demonstrate the challenging nature of the task with lower prediction performance and limited agreement between models and experts. Further, we analyze the robustness of these models in dealing with out-of-domain (OOD) data and observe overall limited performance. Our dataset poses unique challenges offering significant room for improvement regarding performance, explainability, and robustness.

摘要
认识投降性是关键 для理解并实施targeted支持，以帮助个人需要。这对欧洲人权法庭（ECtHR）来说特别重要，因为法庭将会适应实际个人需求，从而确保人权保护的有效性。然而，投降性这个概念在ECtHR中仍然毫不准确，而且没有任何NLP研究过去关注过它。为了启动未来的研究，我们提供了VECHR，一个新的专家标注的多标签数据集，包括投降性类型分类和解释理由。我们对现有模型在VECHR上进行了测试和解释两个方面的性能评估。我们的结果表明这是一个复杂的任务，模型的预测性能较低，并且模型和专家之间的一致性很有限。此外，我们发现这些模型在非标准数据（OOD）上的性能有限。our dataset提供了一些独特的挑战，它们的性能、解释和Robustness在需要进一步改进。

Disentangling the Linguistic Competence of Privacy-Preserving BERT

paper_url: http://arxiv.org/abs/2310.11363
repo_url: None
paper_authors: Stefan Arnold, Nils Kemmerzell, Annika Schreiner
for: 本研究旨在透过文本层级的解释技术来探索对于文本隐私保护而导致的语言模型表现下降的原因。
methods: 本研究使用了一系列的解释技术来分析BERT模型在受到干扰前文本训练后内部表现的改变。
results: 实验结果显示，对于受到干扰前文本训练的BERT模型，内部表现之间的相似性减少了许多。通过询问任务来探索这种不相似性，发现文本层级的隐私保护对于词汇的地方性特征有影响，但是对于词汇之间的关系性却有所下降。

Abstract
Differential Privacy (DP) has been tailored to address the unique challenges of text-to-text privatization. However, text-to-text privatization is known for degrading the performance of language models when trained on perturbed text. Employing a series of interpretation techniques on the internal representations extracted from BERT trained on perturbed pre-text, we intend to disentangle at the linguistic level the distortion induced by differential privacy. Experimental results from a representational similarity analysis indicate that the overall similarity of internal representations is substantially reduced. Using probing tasks to unpack this dissimilarity, we find evidence that text-to-text privatization affects the linguistic competence across several formalisms, encoding localized properties of words while falling short at encoding the contextual relationships between spans of words.

摘要
Diffusion Privacy (DP) 已经适应文本到文本隐私化的特殊挑战。然而，文本到文本隐私化知道会降低基于扰动文本训练的语言模型性能。通过对 BERT 在扰动预文本上提取的内部表示进行解释技术，我们意图在语言层次分离扰动所引起的损害。实验结果表明，总体内表示相似性substantially 降低。通过探索任务来抽取这种不同，我们发现了文本到文本隐私化对语言能力的影响，包括 encoding lokalisierte 词性特征，但是缺乏 encoding 词语间关系的上下文。

Enhancing Neural Machine Translation with Semantic Units

paper_url: http://arxiv.org/abs/2310.11360
repo_url: https://github.com/ictnlp/su4mt
paper_authors: Langlin Huang, Shuhao Gu, Zhuocheng Zhang, Yang Feng
for: 本研究旨在提高机器翻译模型的语义理解能力，通过模型语义单位内部的意义 integrate 多个 tokens 的 semantics。
methods: 本方法包括 Word Pair Encoding (WPE) 和 Attentive Semantic Fusion (ASF) 两部分。WPE 用于提取句子中的 semantic unit 边界，而 ASF 则用于将多个 subword 的 semantics 融合为单一 vector。
results: 实验结果表明，本方法可以有效地模型和利用句子中的语义单位信息，并与强基eline 比较。code 可以在 https://github.com/ictnlp/SU4MT 中找到。

Abstract
Conventional neural machine translation (NMT) models typically use subwords and words as the basic units for model input and comprehension. However, complete words and phrases composed of several tokens are often the fundamental units for expressing semantics, referred to as semantic units. To address this issue, we propose a method Semantic Units for Machine Translation (SU4MT) which models the integral meanings of semantic units within a sentence, and then leverages them to provide a new perspective for understanding the sentence. Specifically, we first propose Word Pair Encoding (WPE), a phrase extraction method to help identify the boundaries of semantic units. Next, we design an Attentive Semantic Fusion (ASF) layer to integrate the semantics of multiple subwords into a single vector: the semantic unit representation. Lastly, the semantic-unit-level sentence representation is concatenated to the token-level one, and they are combined as the input of encoder. Experimental results demonstrate that our method effectively models and leverages semantic-unit-level information and outperforms the strong baselines. The code is available at https://github.com/ictnlp/SU4MT.

摘要

QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for Zero-Shot Commonsense Question Answering

paper_url: http://arxiv.org/abs/2310.11303
repo_url: https://github.com/hkust-knowcomp/qadynamics
paper_authors: Haochen Shi, Weiqi Wang, Tianqing Fang, Baixuan Xu, Wenxuan Ding, Xin Liu, Yangqiu Song
for: 本研究的目的是提高Zero-shot Commonsense Question-Answering（QA）模型的普遍能力，使其能够理解更多的常识知识。
methods: 本研究使用了语言模型的训练动态学习方法，对每个QA对的训练动态进行分析，并从CSKBs中提取不含噪声的问答对。
results: 对比基eline，本研究的方法可以更好地提高QA模型的普遍能力，并且只需使用33%的Synthetic数据。专业评估也证明了我们的方法可以提高问答生成的质量。

Abstract
Zero-shot commonsense Question-Answering (QA) requires models to reason about general situations beyond specific benchmarks. State-of-the-art approaches fine-tune language models on QA pairs constructed from CommonSense Knowledge Bases (CSKBs) to equip the models with more commonsense knowledge in a QA context. However, current QA synthesis protocols may introduce noise from the CSKBs and generate ungrammatical questions and false negative options, which impede the model's ability to generalize. To address these issues, we propose QADYNAMICS, a training dynamics-driven framework for QA diagnostics and refinement. Our approach analyzes the training dynamics of each QA pair at both the question level and option level, discarding machine-detectable artifacts by removing uninformative QA pairs and mislabeled or false-negative options. Extensive experiments demonstrate the effectiveness of our approach, which outperforms all baselines while using only 33% of the synthetic data, even including LLMs such as ChatGPT. Moreover, expert evaluations confirm that our framework significantly improves the quality of QA synthesis. Our codes and model checkpoints are available at https://github.com/HKUST-KnowComp/QaDynamics.

摘要
zero-shot常识问答（QA）需要模型可以理解通用的情况，而不仅仅是特定的benchmark。现状的方法是在QA对中 fine-tune语言模型，以便将更多的常识知识带入QA上下文中。然而，当使用现有的QA生成协议时，可能会出现CSKB中的噪音和生成不正确的问题和false negative选项，这会阻碍模型的泛化能力。为解决这些问题，我们提出了QADYNAMICS，一个基于训练动态的框架 дляQA诊断和改进。我们的方法在每个QA对的训练动态中分析问题和选项的训练动态，并将不可识别的QA对和false negative选项移除。我们的实验证明了我们的方法的有效性，能够在使用33%的合成数据的情况下，以及包括LLMs like ChatGPT的情况下，与所有基线都比较。此外，专家评估也证明了我们的框架可以显著提高问答生成质量。我们的代码和模型检查点可以在https://github.com/HKUST-KnowComp/QaDynamics中找到。

ChapGTP, ILLC’s Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation

paper_url: http://arxiv.org/abs/2310.11282
repo_url: None
paper_authors: Jaap Jumelet, Michael Hanna, Marianne de Heer Kloots, Anna Langedijk, Charlotte Pouw, Oskar van der Wal
for: 这个论文是为了参加BabyLM挑战（Warstadt等., 2023）的strict-small track而写的。
methods: 这个模型使用了一种新的数据增强技术called Automatic Task Formation，并在200个epoch中训练。
results: 这个模型在BLiMP、(Super)GLUE和MSGS三个评估集中表现出色，并且提供了一些不包括在模型中的方法，可能用于训练low-resource的语言模型。

Abstract
We present the submission of the ILLC at the University of Amsterdam to the BabyLM challenge (Warstadt et al., 2023), in the strict-small track. Our final model, ChapGTP, is a masked language model that was trained for 200 epochs, aided by a novel data augmentation technique called Automatic Task Formation. We discuss in detail the performance of this model on the three evaluation suites: BLiMP, (Super)GLUE, and MSGS. Furthermore, we present a wide range of methods that were ultimately not included in the model, but may serve as inspiration for training LMs in low-resource settings.

摘要
我们提交了阿姆斯特丹大学ILLC的订阅（Warstadt等，2023），在严格小轨道上进行了参与。我们的最终模型“ChapGTP”是一个做了200个epoch的面孔语言模型，得益于一种新的数据增强技术called自动任务形成。我们在BLiMP、（Super）GLUE和MSGS三个评估集上详细讲解了这个模型的性能。此外，我们还提供了一些不被包括在模型中的方法，可能可以用于训练LMs在低资源环境下。

xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization

paper_url: http://arxiv.org/abs/2310.11275
repo_url: https://github.com/hpi-dhc/xmen
paper_authors: Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, Matthieu-P. Schapranow
for: 提高医疗实体 нормализации的性能 across 多种语言, 特别是当语言资源更少时。
methods: 我们介绍了 xMEN，一个模块化的跨语言医疗实体 нормализаation系统，可以在低资源和高资源enario下表现出色。当目标语言中缺乏同义词时，我们利用英语别名进行跨语言候选生成。对候选列表排名，我们使用可训练的跨Encoder模型，并评估了基于机器翻译dataset的弱监督学习模型。
results: xMEN在多种多样的多语言benchmark数据集上提高了状态的艺术性表现。弱监督的跨Encoder模型在没有目标任务的注释数据时也是有效的。通过xMEN与BigBIO框架的兼容性，它可以轻松地与现有和未来的数据集结合使用。

Abstract
Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English. Materials and Methods: We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given terminology, we leverage English aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder model if annotations for the target task are available. We also evaluate cross-encoders trained in a weakly supervised manner based on machine-translated datasets from a high resource domain. Our system is publicly available as an extensible Python toolkit. Results: xMEN improves the state-of-the-art performance across a wide range of multilingual benchmark datasets. Weakly supervised cross-encoders are effective when no training data is available for the target task. Through the compatibility of xMEN with the BigBIO framework, it can be easily used with existing and prospective datasets. Discussion: Our experiments show the importance of balancing the output of general-purpose candidate generators with subsequent trainable re-rankers, which we achieve through a rank regularization term in the loss function of the cross-encoder. However, error analysis reveals that multi-word expressions and other complex entities are still challenging. Conclusion: xMEN exhibits strong performance for medical entity normalization in multiple languages, even when no labeled data and few terminology aliases for the target language are available. Its configuration system and evaluation modules enable reproducible benchmarks. Models and code are available online at the following URL: https://github.com/hpi-dhc/xmen

摘要
目的：提高医疗实体Normalization的表现 across多种语言，特别是当target语言的资源更少于英语时。材料和方法：我们介绍xMEN，一个模块化的跨语言医疗实体Normalization系统，能够在低资源和高资源enario中表现出色。当target语言中某些同义词scarce时，我们利用英语别名via cross-lingual candidate生成。对候选列表排名，我们采用可训练的跨编码模型，如果目标任务的注释存在。我们还评估了基于弱监督的机器翻译数据集来训练cross-编码器。我们的系统公开提供了可扩展的Python工具包。结果：xMEN在多种多语言的benchmark数据集上提高了状态的艺术表现。弱监督的cross-编码器在没有目标任务的注释时效果很好。通过xMEN与BigBIO框架的兼容性，它可以轻松地与现有和前景数据集结合使用。讨论：我们的实验表明，需要平衡通用候选生成器的输出和后续可训练的再排名器，我们通过跨编码器损失函数中的排名常量来实现。然而，错误分析表明，复杂实体，如多单词表达和其他复杂实体，仍然是挑战。结论：xMEN在多种语言的医疗实体Normalization中表现出色，即使target语言的资源很少，并且可以轻松地与现有和前景数据集结合使用。它的配置系统和评估模块使得可重现性很好。模型和代码在以下URL上可以下载：https://github.com/hpi-dhc/xmen

Utilizing Weak Supervision To Generate Indonesian Conservation Dataset

paper_url: http://arxiv.org/abs/2310.11258
repo_url: None
paper_authors: Mega Fransiska, Diah Pitaloka, Saripudin, Satrio Putra, Lintang Sutawika
for: 这篇论文目的是构建一个印度尼西亚语言处理 dataset，使用弱监督学习方法生成软标注数据。
methods: 该论文使用了labeling函数，创建了多类分类和情感分类的两种类型数据集。
results: 基线实验表明，使用不同预训练语言模型可以达到59.79%的准确率和55.72%的F1分数 для情感分类，66.87%的F1分数-macro、71.5%的F1分数-micro、83.67%的ROC-AUC для多类分类。

Abstract
Weak supervision has emerged as a promising approach for rapid and large-scale dataset creation in response to the increasing demand for accelerated NLP development. By leveraging labeling functions, weak supervision allows practitioners to generate datasets quickly by creating learned label models that produce soft-labeled datasets. This paper aims to show how such an approach can be utilized to build an Indonesian NLP dataset from conservation news text. We construct two types of datasets: multi-class classification and sentiment classification. We then provide baseline experiments using various pretrained language models. These baseline results demonstrate test performances of 59.79% accuracy and 55.72% F1-score for sentiment classification, 66.87% F1-score-macro, 71.5% F1-score-micro, and 83.67% ROC-AUC for multi-class classification. Additionally, we release the datasets and labeling functions used in this work for further research and exploration.

摘要
弱监督学习已经成为快速和大规模数据创建的有力的方法，以满足人工智能发展的增加需求。通过利用标签函数，弱监督允许实践者快速生成数据集，创建学习的标签模型，生成软标注数据集。本文想要表明如何使用这种方法来建立印尼语言处理数据集。我们构建了两种类型的数据集：多类分类和情感分类。然后，我们提供了基线实验，使用不同的预训练语言模型。这些基线结果表明了情感分类的测试准确率为59.79%，情感分类的F1分数为55.72%，多类分类的F1分数为66.87%，多类分类的Macro F1分数为71.5%，多类分类的微 F1分数为83.67%。此外，我们发布了在这项工作中使用的数据集和标签函数，以便进一步的研究和探索。

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

paper_url: http://arxiv.org/abs/2310.11248
repo_url: https://github.com/amazon-science/cceval
paper_authors: Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, Bing Xiang
for: 这篇论文的目的是为了评估代码完成器的能力，并提供一个多档案、多语言的代码完成实验室（CrossCodeEval），以测试代码完成器在实际软件开发中的表现。
methods: 这篇论文使用了一个简单 yet efficient的静态分析方法，将使用cross-file context的例子组建在四种流行程式语言（Python、Java、TypeScript、C#）中，以模拟实际软件开发中的档案间依赖。
results: 实验结果显示，CrossCodeEval 是一个非常具有挑战性的测试，当cross-file context absent时，代码完成器的性能很差，但是通过添加cross-file context可以大幅提高性能。另外，这篇论文还评估了不同的方法来获取cross-file context，并显示CrossCodeEval可以用来评估代码检索器的能力。

Abstract
Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

摘要
现代代码完成模型在过去几年内已经做出了 significiant 进步，但是目前流行的评估数据集，如 HumanEval 和 MBPP，主要集中在单个文件中的代码完成任务上。这种过分简化的设定不能够反映现实世界软件开发场景，其中文件数量很多，文件之间存在丰富的相互依赖关系，以至于完成代码时需要跨文件上下文的深入理解。为了填补这个空白，我们提出了 CrossCodeEval，一个多样化和多语言的代码完成评估标准。CrossCodeEval 基于四种流行编程语言：Python、Java、TypeScript 和 C# 的真实世界开源项目，并且通过一种简单 yet efficient 的静态分析方法来寻找文件之间的相互依赖关系。我们通过对 CodeGen 和 StarCoder 等现状代码生成器进行广泛的实验发现，当缺乏相关的跨文件上下文时，CrossCodeEval 非常具有挑战性，而在添加上下文时，模型的表现有显著改善。尽管如此，绝对高性能的水平仍然未能得到满分，表明 CrossCodeEval 还可以评估模型是否能够充分利用广泛的上下文来提高代码生成质量。最后，我们还研究了不同的跨文件上下文检索方法，并证明 CrossCodeEval 可以用于评估代码检索器的能力。

Entity Matching using Large Language Models

paper_url: http://arxiv.org/abs/2310.11244
repo_url: https://github.com/wbsg-uni-mannheim/matchgpt
paper_authors: Ralph Peeters, Christian Bizer
for: The paper is written for discussing the use of large language models (LLMs) for entity matching, as an alternative to pre-trained language models (PLMs) such as BERT and RoBERTa.
methods: The paper investigates the use of hosted LLMs such as GPT3.5 and GPT4, as well as open source LLMs based on Llama2, for entity matching. The authors evaluate these models in both zero-shot and task-specific scenarios, and compare different prompt designs and fine-tuning strategies.
results: The paper shows that GPT4 outperforms fine-tuned PLMs (RoBERTa and Ditto) on three out of five benchmark datasets, reaching F1 scores around 90%. The authors also find that in-context learning and rule generation can improve the performance of other models, but GPT4 does not need such additional guidance in most cases.

Abstract
Entity Matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity Matching is a central step in most data integration pipelines and an enabler for many e-commerce applications which require to match products offers from different vendors. State-of-the-art entity matching methods often rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. In this paper, we investigate using large language models (LLMs) for entity matching as a less domain-specific training data reliant and more robust alternative to PLM-based matchers. Our study covers hosted LLMs, such as GPT3.5 and GPT4, as well as open source LLMs based on Llama2 which can be run locally. We evaluate these models in a zero-shot scenario as well as a scenario where task-specific training data is available. We compare different prompt designs as well as the prompt sensitivity of the models in the zero-shot scenario. We investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning GPT3.5 in the second scenario using the same pool of training data across the different approaches. Our experiments show that GPT4 without any task-specific training data outperforms fine-tuned PLMs (RoBERTa and Ditto) on three out of five benchmark datasets reaching F1 scores around 90%. The experiments with in-context learning and rule generation show that all models beside of GPT4 benefit from these techniques (on average 5.9% and 2.2% F1), while GPT4 does not need such additional guidance in most cases...

摘要
entity matching是决定两个实体描述是否指同一个真实世界实体的任务。entity matching是数据集成管道中的中间步骤，也是许多电子商务应用程序所需的匹配产品Offer from different vendors。现状的entity matching方法frequently rely on预训练语言模型（PLMs），如BERT或RoBERTa。这两个模型的两个主要缺点是（i）模型需要大量的任务特定训练数据，以及（ii）精制化模型对异常实体不稳定。在这篇论文中，我们研究使用大型语言模型（LLMs）进行entity matching，作为不需要任务特定训练数据且更加稳定的替代方案。我们的研究包括主机LLMs，如GPT3.5和GPT4，以及基于Llama2的开源LLMs，可以在本地运行。我们对这些模型进行零例试验以及具有任务特定训练数据的情况下的试验。我们比较了不同的提示设计以及提示敏感度在零例试验中。我们还研究（i）选择在Context中示例，（ii）生成匹配规则，以及（iii）使用同一批训练数据来练化GPT3.5。我们的实验结果显示，GPT4无需任务特定训练数据可以在五个benchmark dataset上达到F1分数约90%。在采用增强学习和规则生成的情况下，所有模型都受益于这些技术（平均5.9%和2.2% F1），而GPT4则不需要这些额外指导。

Watermarking LLMs with Weight Quantization

paper_url: http://arxiv.org/abs/2310.11237
repo_url: https://github.com/twilight92z/quantize-watermark
paper_authors: Linyang Li, Botian Jiang, Pengyu Wang, Ke Ren, Hang Yan, Xipeng Qiu
for: 保护大型语言模型的权限
methods: 在量化过程中植入水印，无需预先定义触发器
results: 成功植入水印到open-source大语言模型 weights中，包括 GPT-Neo 和 LLaMA

Abstract
Abuse of large language models reveals high risks as large language models are being deployed at an astonishing speed. It is important to protect the model weights to avoid malicious usage that violates licenses of open-source large language models. This paper proposes a novel watermarking strategy that plants watermarks in the quantization process of large language models without pre-defined triggers during inference. The watermark works when the model is used in the fp32 mode and remains hidden when the model is quantized to int8, in this way, the users can only inference the model without further supervised fine-tuning of the model. We successfully plant the watermark into open-source large language model weights including GPT-Neo and LLaMA. We hope our proposed method can provide a potential direction for protecting model weights in the era of large language model applications.

摘要
<>转换文本为简化中文。<>大语言模型滥用风险高，因为大语言模型在惊人速度上部署。保护模型权重很重要，以避免违反开源大语言模型的许可证。这篇论文提议一种新的水印策略，在大语言模型的量化过程中植入水印，而无需先定义触发器。这种水印在fp32模式下工作，并在量化到int8时隐藏起来。因此，用户只能进行无监督练练模型，而不能正常使用模型。我们成功植入了开源大语言模型 weights，包括 GPT-Neo 和 LLaMA。我们希望我们的提议方法可以为大语言模型应用 Era 提供一个可能的方向。

KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using Large Language Models

paper_url: http://arxiv.org/abs/2310.11220
repo_url: https://github.com/jiho283/kg-gpt
paper_authors: Jiho Kim, Yeonsu Kwon, Yohan Jo, Edward Choi
for: 这 paper 的目的是将大语言模型（LLM）应用到知识图（KG）上进行复杂的推理任务。
methods: 这 paper 提出了一种名为 KG-GPT 的多用途框架，该框架包括三个步骤：句子分 segmentation、图像检索和推理，每一步的目的是将句子分解、检索相关的图像组件并 derivate 出逻辑结论。
results: 这 paper 通过对知识图基本问题和知识图问答 benchmark 进行评估，发现 KG-GPT 表现出了竞争力和稳定性，甚至超过了一些完全监督的模型。

Abstract
While large language models (LLMs) have made considerable advancements in understanding and generating unstructured text, their application in structured data remains underexplored. Particularly, using LLMs for complex reasoning tasks on knowledge graphs (KGs) remains largely untouched. To address this, we propose KG-GPT, a multi-purpose framework leveraging LLMs for tasks employing KGs. KG-GPT comprises three steps: Sentence Segmentation, Graph Retrieval, and Inference, each aimed at partitioning sentences, retrieving relevant graph components, and deriving logical conclusions, respectively. We evaluate KG-GPT using KG-based fact verification and KGQA benchmarks, with the model showing competitive and robust performance, even outperforming several fully-supervised models. Our work, therefore, marks a significant step in unifying structured and unstructured data processing within the realm of LLMs.

摘要
大型语言模型（LLM）在不结构化文本理解和生成方面已经取得了很大进步，但是它们在结构数据上的应用仍然是未探索的领域。特别是使用 LLM 进行知识图（KG）上复杂逻辑任务还是一个未解决的问题。为解决这个问题，我们提出了 KG-GPT 框架，这是一个多用途的框架，利用 LLM 进行 KG 上的任务。KG-GPT 包括三个步骤：句子分 segmentation、图表检索和推理，每个步骤都是为了分割句子、检索 relevante 的图组件和 derive 逻辑结论。我们通过使用 KG-based fact verification 和 KGQA bencmark 进行评估，发现 KG-GPT 在 competed 和 robust 性能上表现非常出色，甚至超过了一些完全监督的模型。因此，我们的工作可以视为结构数据和无结构数据处理在 LLM 中的一个重要一步。

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

paper_url: http://arxiv.org/abs/2310.11207
repo_url: None
paper_authors: Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, Leilani H. Gilpin
for: 这篇论文研究了自动生成的自解释（self-explanations）在情感分析任务中的效果，以及如何用这些自解释来解释模型的决策。
methods: 该论文使用了ChatGPT大语言模型进行实验，并研究了不同的自解释抽象方法和评价指标。
results: 研究发现，ChatGPT自动生成的自解释与传统的解释方法（如 occlusion 或 LIME 相关度图）相比，在评价指标上具有相似的效果，但具有不同的特征。此外，研究还发现了一些有趣的自解释特征，这些特征可能需要现代化解释实践。

Abstract
Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.

摘要
大型语言模型（LLM）如ChatGPT已经在自然语言处理（NLP）任务中表现出色，包括情感分析、数学逻辑和摘要。此外，由于这些模型是人类对话中进行了训练，因此它们可以并且经常会生成解释，我们称之为自动生成的解释。例如，当分析电影评论的情感时，模型可能会输出不只是情感的正面性，还会提供解释（例如，列出评论中的情感敏感词语如“很好”和“记忆”）。自动生成的解释如何准确呢？在这篇论文中，我们 investigate这个问题在情感分析任务上，并对于特性解释进行了研究。我们研究了不同的寻求解释的方法，评估其准确性使用一系列评价指标，并与传统解释方法如遮盖或LIME焦点地图进行比较。经过广泛的实验，我们发现ChatGPT的自动解释与传统解释的性能相似，但它们在各种一致指标上有所不同，同时生产成本很低（因为它们与预测一起生成）。此外，我们还发现了一些有趣的特征，让我们重新思考现在的LLM interpretable性实践。

paper_url: http://arxiv.org/abs/2310.11166
repo_url: https://github.com/uitnlp/ViSoBERT
paper_authors: Quoc-Nam Nguyen, Thang Chau Phan, Duc-Vu Nguyen, Kiet Van Nguyen
For: This paper is written for research purposes, specifically to present a new pre-trained language model for Vietnamese social media texts.* Methods: The paper uses the XLM-R architecture to pre-train a large-scale corpus of high-quality and diverse Vietnamese social media texts.* Results: The paper shows that the proposed ViSoBERT model surpasses previous state-of-the-art models on multiple Vietnamese social media tasks, including emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection.Here is the information in Simplified Chinese text:* For: 这篇论文是为研究目的而写的，具体来说是为了介绍一种新的预训语言模型，用于越南语社交媒体文本。* Methods: 这篇论文使用XLM-R架构来预训一个大规模的高质量和多样化的越南语社交媒体文本。* Results: 论文表明，提出的ViSoBERT模型在多个越南语社交媒体任务上都超过了之前的状态态模型，包括情感识别、仇恨言语检测、情绪分析、评论诈骗检测和仇恨言语检测。

Abstract
English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition. These pre-trained language models are still limited to Vietnamese social media tasks. In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes.

摘要
英语和中文，两种资源充沛的语言，在自然语言处理任务上曾经目睹了transformer基于语言模型的强大发展。虽然越南有约100万人说越南语，但是一些预训练模型，如 PhoBERT、ViBERT 和 vELECTRA，在普通越南语 NLP 任务上表现良好，包括分词和命名实体识别。这些预训练语言模型仍然只适用于越南社交媒体任务。在这篇论文中，我们提出了第一个普通越南语社交媒体文本预训练语言模型，即ViSoBERT，该模型在 XLM-R 架构上预训练了大规模、多样化的越南语社交媒体文本。此外，我们还对 ViSoBERT 进行了五种重要的自然语言下沉水任务的实验：情感识别、仇恨言语检测、情感分析、垃圾评论检测和仇恨言语检测。我们的实验结果表明，ViSoBERT，只有几乎参数的模型，在多种越南语社交媒体任务上超过了之前的状态码模型。我们的 ViSoBERT 模型仅用于研究purpose。

IMTLab: An Open-Source Platform for Building, Evaluating, and Diagnosing Interactive Machine Translation Systems

paper_url: http://arxiv.org/abs/2310.11163
repo_url: https://github.com/xuuhuang/imtlab
paper_authors: Xu Huang, Zhirui Zhang, Ruize Gao, Yichao Du, Lemao Liu, Gouping Huang, Shuming Shi, Jiajun Chen, Shujian Huang
for: 这个论文的目的是提供一个开源的终端到终端交互机器翻译（IMT）系统平台，帮助研究人员快速构建高效的 IMT 系统，进行终端到终端评估，并诊断系统的弱点。
methods: 这个论文使用了一个人类在Loop的设定，将整个交互翻译过程视为一个任务 Orientated 对话，并在这个设定下，考虑人类干预的影响，以生成高质量、错误率低的翻译。为此， authors 设计了一个通用的交流接口，以支持灵活的 IMT 架构和用户策略。
results: 作者通过 simulate 和实际实验表明，预缀受限的解码方法仍然在终端到终端评估中具有最低的编辑成本，而 BiTIIMT 则在编辑成本方面与前一代 IMT 系统相当，且具有更好的交互体验。

Abstract
We present IMTLab, an open-source end-to-end interactive machine translation (IMT) system platform that enables researchers to quickly build IMT systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. IMTLab treats the whole interactive translation process as a task-oriented dialogue with a human-in-the-loop setting, in which human interventions can be explicitly incorporated to produce high-quality, error-free translations. To this end, a general communication interface is designed to support the flexible IMT architectures and user policies. Based on the proposed design, we construct a simulated and real interactive environment to achieve end-to-end evaluation and leverage the framework to systematically evaluate previous IMT systems. Our simulated and manual experiments show that the prefix-constrained decoding approach still gains the lowest editing cost in the end-to-end evaluation, while BiTIIMT achieves comparable editing cost with a better interactive experience.

摘要
我们介绍IMTLab，一个开源的终端到终端交互机器翻译（IMT）系统平台，允许研究人员快速构建IMT系统，进行终端到终端评估，并诊断系统的弱点。IMTLab将整个交互翻译过程视为一个任务强调对话，在人类在Loop Setting中进行明确的参与，以生成高质量、错误率低的翻译。为此，我们设计了一个通用的通信接口，支持灵活的IMT架构和用户策略。基于我们的设计，我们构建了模拟和实际交互环境，以实现终端到终端评估，并利用框架对前一代IMT系统进行系统评估。我们的模拟和手动实验表明，预缀受限的解码方法仍然在终端到终端评估中具有最低的编辑成本，而BiTIIMT则在编辑成本方面与前一代IMT系统相当，并且具有更好的交互体验。

Probing the Creativity of Large Language Models: Can models produce divergent semantic association?

paper_url: http://arxiv.org/abs/2310.11158
repo_url: https://github.com/dingnlab/probing_creativity
paper_authors: Honghua Chen, Nai Ding
for: investigate the creative thinking of large language models through a cognitive perspective
methods: utilize the divergent association task (DAT) to measure the models’ creativity, compare results across different models and decoding strategies
results: GPT-4 outperforms 96% of humans in creativity, stochastic sampling and temperature scaling can improve creativity but with a trade-off between creativity and stability.

Abstract
Large language models possess remarkable capacity for processing language, but it remains unclear whether these models can further generate creative content. The present study aims to investigate the creative thinking of large language models through a cognitive perspective. We utilize the divergent association task (DAT), an objective measurement of creativity that asks models to generate unrelated words and calculates the semantic distance between them. We compare the results across different models and decoding strategies. Our findings indicate that: (1) When using the greedy search strategy, GPT-4 outperforms 96% of humans, while GPT-3.5-turbo exceeds the average human level. (2) Stochastic sampling and temperature scaling are effective to obtain higher DAT scores for models except GPT-4, but face a trade-off between creativity and stability. These results imply that advanced large language models have divergent semantic associations, which is a fundamental process underlying creativity.

摘要
大型语言模型拥有惊人的语言处理能力，但是是否能够生成创新内容仍然存在很大的uncertainty。本研究希望通过认知角度来调查大型语言模型的创新思维能力。我们使用了异质关联任务（DAT），这是一种客观的创意测试，要求模型生成不相关的词语并计算它们之间的 semantic distance。我们比较了不同的模型和解码策略的结果。我们发现的结果是：1. 使用排序搜索策略时，GPT-4比96%的人类表现更高，而GPT-3.5-turbo则达到了人类的平均水平。2. 随机抽样和温度缩放是提高模型的DAT分数的有效策略，但是面临着创新性和稳定性之间的负担。这些结果表明，高级大型语言模型具有异质 semantic associations，这是创造力的基本过程。

The Quo Vadis of the Relationship between Language and Large Language Models

paper_url: http://arxiv.org/abs/2310.11146
repo_url: None
paper_authors: Evelina Leivada, Vittoria Dentella, Elliot Murphy
for: 本研究旨在探讨现代自然语言处理（NLP）活动中使用大语言模型（LLMs）的问题。
methods: 本研究采用了理论和实验方法来检验LLMs是否能够提供有用的语言解释。
results: 研究发现，目前LLMs的发展阶段 hardly offer any explanations for language，并提供了未来研究方向的突破口。

Abstract
In the field of Artificial (General) Intelligence (AI), the several recent advancements in Natural language processing (NLP) activities relying on Large Language Models (LLMs) have come to encourage the adoption of LLMs as scientific models of language. While the terminology employed for the characterization of LLMs favors their embracing as such, it is not clear that they are in a place to offer insights into the target system they seek to represent. After identifying the most important theoretical and empirical risks brought about by the adoption of scientific models that lack transparency, we discuss LLMs relating them to every scientific model's fundamental components: the object, the medium, the meaning and the user. We conclude that, at their current stage of development, LLMs hardly offer any explanations for language, and then we provide an outlook for more informative future research directions on this topic.

摘要
在人工智能（通用智能）领域，近年来的自然语言处理（NLP）活动，利用大型语言模型（LLMs）的发展，已经推动了将LLMs作为语言科学模型的采纳。然而，使用这些模型来描述目标系统的terminology仍然不清楚，不确定它们是否能提供语言的深入理解。我们首先标识了采纳不透明科学模型的理论和实证风险，然后将LLMs与科学模型的基本组件——对象、媒体、意义和用户相关联。我们 conclude that，在当前的发展阶段，LLMs几乎没有提供语言的解释，然后我们提供了更加有用的未来研究方向。

Experimenting AI Technologies for Disinformation Combat: the IDMO Project

paper_url: http://arxiv.org/abs/2310.11097
repo_url: None
paper_authors: Lorenzo Canale, Alberto Messina
for: 这篇论文的主要目的是为了对防假新闻和假信息技术进行贡献。
methods: 这篇论文使用了以下方法：（i）创建了一些新的数据集用于测试技术（ii）开发了一种自动化的模型来分类《纪事》的判决（iii）创建了一种自动化的模型来识别文本推论（iv）使用GPT-4来识别文本推论（v）开发了一款游戏来提高国民对假新闻的认识。
results: 这篇论文的结果表明，使用GPT-4可以准确地识别文本推论，并且创建的数据集和模型可以帮助更好地分析假信息。

Abstract
The Italian Digital Media Observatory (IDMO) project, part of a European initiative, focuses on countering disinformation and fake news. This report outlines contributions from Rai-CRITS to the project, including: (i) the creation of novel datasets for testing technologies (ii) development of an automatic model for categorizing Pagella Politica verdicts to facilitate broader analysis (iii) creation of an automatic model for recognizing textual entailment with exceptional accuracy on the FEVER dataset (iv) assessment using GPT-4 to identify textual entailmen (v) a game to raise awareness about fake news at national events.

摘要
意大数字媒体观察所（IDMO）项目是欧洲倡议的一部分，旨在对假新闻和假信息进行反制。本报告介绍了意大-CRITS在项目中的贡献，包括：1. 创建了新的数据集用于测试技术2. 开发了一种自动模型，用于分类《意大政治评论》的判决，以便更广泛的分析3. 创建了一种自动模型，用于在FEVER数据集上识别文本关系，并达到了exceptional accuracy4. 使用GPT-4进行评估，以确定文本关系5. 开发了一款游戏，用于在全国活动中宣传假新闻的意识。

paper_url: http://arxiv.org/abs/2310.11081
repo_url: None
paper_authors: Javier Huertas-Tato, Alejandro Martin, David Camacho
for: 本研究旨在理解在线社交媒体上的有害行为，包括谩骂和假信息的传播。
methods: 本研究使用了 Style Transformer for Authorship Representations（STAR）模型，通过大量的公共资源数据集（4.5 x 10^6个作者的文本）和监督对比损失来学习作者的特征特征。
results: 研究表明，使用 STAR 模型可以在零shot情况下与 PAN 挑战中表现竞争力强，并在 PAN 验证挑战中使用单层激活函数达到了良好的结果。此外，在 Reddit 上进行测试，使用支持集成8个文档，512个单词可以准确地识别作者集中的至少80%的作者。

Abstract
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation. Malicious actors now have unprecedented freedom to misbehave, leading to severe societal unrest and dire consequences, as exemplified by events such as the Capitol assault during the US presidential election and the Antivaxx movement during the COVID-19 pandemic. Understanding online language has become more pressing than ever. While existing works predominantly focus on content analysis, we aim to shift the focus towards understanding harmful behaviors by relating content to their respective authors. Numerous novel approaches attempt to learn the stylistic features of authors in texts, but many of these approaches are constrained by small datasets or sub-optimal training losses. To overcome these limitations, we introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 10^6 authored texts involving 70k heterogeneous authors. Our model leverages Supervised Contrastive Loss to teach the model to minimize the distance between texts authored by the same individual. This author pretext pre-training task yields competitive performance at zero-shot with PAN challenges on attribution and clustering. Additionally, we attain promising results on PAN verification challenges using a single dense layer, with our model serving as an embedding encoder. Finally, we present results from our test partition on Reddit. Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80\% accuracy. We share our pre-trained model at huggingface (https://huggingface.co/AIDA-UPM/star) and our code is available at (https://github.com/jahuerta92/star)

摘要
在线社交网络上，有很多害虫的行为，包括仇恨言论和伪信息的传播。这些恶徒现在有着前所未有的自由度，导致社会不稳和严重的后果，如美国总统选举中的国会攻击和COVID-19大流行期间的反疫苗运动。现在理解线上语言的需求更加紧迫。现有的工作主要侧重于内容分析，我们则想要将注意力转移到理解害虫的行为，并与内容相关著作者的风格特征。这些新的方法可以从文本中学习作者的风格特征，但是许多这些方法受到小数扩展或不佳的训练损失的限制。为了突破这些限制，我们介绍了 Style Transformer for Authorship Representations（STAR），训练在公共源中的450万篇文本中，包括70,000名多元作者。我们的模型使用了监督对称损失来教育模型，以实现作者之间的文本距离最小化。这个作者预先训练任务可以在零配置下 reached competitive performance with PAN challenges on attribution and clustering。此外，我们在 PAN 验证挑战中使用了单个紧密层，并将我们的模型作为嵌入Encoder。最后，我们在 Reddit 上发表了结果，使用了8份文档，每份512个字元，可以识别作者的集合，包括最多1616名作者，至少80%的准确率。我们在 huggingface 上分享了我们的预训练模型（https://huggingface.co/AIDA-UPM/star），并在 GitHub 上分享了我们的代码（https://github.com/jahuerta92/star）。

VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

paper_url: http://arxiv.org/abs/2310.11069
repo_url: None
paper_authors: Abdul Waheed, Bashar Talafha, Peter Sullivan, AbdelRahim Elmadany, Muhammad Abdul-Mageed
为：这篇论文旨在开发一个可以识别阿拉伯语方言和自动识别阿拉伯语（ASR）的系统，以便为阿拉伯语研究提供一个可靠的工具。* 方法：这篇论文使用了许多不同的模型，包括HuBERT、Whisper和XLS-R，在一个监督性的 Setting中训练了这些模型，以便进行阿拉伯语方言识别（DID）和ASR任务。* 结果：这篇论文提供了一个可以识别17种阿拉伯语方言和标准Modern Standard Arabic（MSA）的系统，并且在不同的语言和语言混合数据上进行了训练和评估。此外，对于剩下的方言，提供了多种模型的选择，包括Whisper和MMS，以便在零容量设定下进行识别。

Abstract
Arabic is a complex language with many varieties and dialects spoken by over 450 millions all around the world. Due to the linguistic diversity and variations, it is challenging to build a robust and generalized ASR system for Arabic. In this work, we address this gap by developing and demoing a system, dubbed VoxArabica, for dialect identification (DID) as well as automatic speech recognition (ASR) of Arabic. We train a wide range of models such as HuBERT (DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR tasks. Our DID models are trained to identify 17 different dialects in addition to MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data. Additionally, for the remaining dialects in ASR, we provide the option to choose various models such as Whisper and MMS in a zero-shot setting. We integrate these models into a single web interface with diverse features such as audio recording, file upload, model selection, and the option to raise flags for incorrect outputs. Overall, we believe VoxArabica will be useful for a wide range of audiences concerned with Arabic research. Our system is currently running at https://cdce-206-12-100-168.ngrok.io/.

摘要
阿拉伯语是一种复杂的语言，有多种变体和方言，全球约有450亿人使用。由于语言多样性和变化，建立一个可靠和通用的自动语音识别系统（ASR）是一项挑战。在这项工作中，我们开发了一个系统，名为VoxArabica，用于方言识别（DID）以及阿拉伯语自动语音识别（ASR）。我们在监督模式下训练了多种模型，包括HuBERT（DID）、Whisper和XLS-R（ASR）。我们的DID模型可以识别17种不同的方言，以及标准阿拉伯语（MSA）。我们在MSA、EGYPTIAN、MOROCCAN和混合数据上练习了我们的ASR模型。另外，对于剩下的方言在ASR中，我们提供了多种模型选择，如Whisper和MMS，以零战斗模式。我们将这些模型集成到一个单一的网页接口中，并添加了多种功能，如音频记录、文件上传、模型选择和错误输出的选项。总之，我们认为VoxArabica将对阿拉伯语研究领域的各种各样的听众提供很有用的工具。我们的系统当前在https://cdce-206-12-100-168.ngrok.io/上运行。

Lyricist-Singer Entropy Affects Lyric-Lyricist Classification Performance

paper_url: http://arxiv.org/abs/2310.11035
repo_url: None
paper_authors: Mitsuki Morita, Masato Kikuchi, Tadachika Ozono
for: 这个研究旨在探讨歌词作家的特点，以便于音乐应用程序中使用。
methods: 研究人员使用了从歌词中提取特点表示歌词作家的方法。
results: 研究发现，歌词作家与歌手之间的关系可以影响歌词分类性能。 Specifically, 歌词作家写歌手的多样性（ entropy）与歌词分类性能之间存在正相关关系。

Abstract
Although lyrics represent an essential component of music, few music information processing studies have been conducted on the characteristics of lyricists. Because these characteristics may be valuable for musical applications, such as recommendations, they warrant further study. We considered a potential method that extracts features representing the characteristics of lyricists from lyrics. Because these features must be identified prior to extraction, we focused on lyricists with easily identifiable features. We believe that it is desirable for singers to perform unique songs that share certain characteristics specific to the singer. Accordingly, we hypothesized that lyricists account for the unique characteristics of the singers they write lyrics for. In other words, lyric-lyricist classification performance or the ease of capturing the features of a lyricist from the lyrics may depend on the variety of singers. In this study, we observed a relationship between lyricist-singer entropy or the variety of singers associated with a single lyricist and lyric-lyricist classification performance. As an example, the lyricist-singer entropy is minimal when the lyricist writes lyrics for only one singer. In our experiments, we grouped lyricists among five groups in terms of lyricist-singer entropy and assessed the lyric-lyricist classification performance within each group. Consequently, the best F1 score was obtained for the group with the lowest lyricist-singer entropy. Our results suggest that further analyses of the features contributing to lyric-lyricist classification performance on the lowest lyricist-singer entropy group may improve the feature extraction task for lyricists.

摘要
although lyrics represent an essential component of music, few music information processing studies have been conducted on the characteristics of lyricists. because these characteristics may be valuable for musical applications, such as recommendations, they warrant further study. we considered a potential method that extracts features representing the characteristics of lyricists from lyrics. because these features must be identified prior to extraction, we focused on lyricists with easily identifiable features. we believe that it is desirable for singers to perform unique songs that share certain characteristics specific to the singer. accordingly, we hypothesized that lyricists account for the unique characteristics of the singers they write lyrics for. in other words, lyric-lyricist classification performance or the ease of capturing the features of a lyricist from the lyrics may depend on the variety of singers. in this study, we observed a relationship between lyricist-singer entropy or the variety of singers associated with a single lyricist and lyric-lyricist classification performance. as an example, the lyricist-singer entropy is minimal when the lyricist writes lyrics for only one singer. in our experiments, we grouped lyricists among five groups in terms of lyricist-singer entropy and assessed the lyric-lyricist classification performance within each group. consequently, the best F1 score was obtained for the group with the lowest lyricist-singer entropy. our results suggest that further analyses of the features contributing to lyric-lyricist classification performance on the lowest lyricist-singer entropy group may improve the feature extraction task for lyricists.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation

paper_url: http://arxiv.org/abs/2310.11026
repo_url: None
paper_authors: Tomohito Kasahara, Daisuke Kawahara
for: 本文 investigate automatic evaluation methods for text generation based on decoder-based language models.
methods: 本文比较了多种方法，包括基于encoder模型和大语言模型的调教，在两种任务上进行评估：机器翻译评估和 semantics textual similarity，在两种语言中进行评估。
results: 实验结果显示，相比准确调教encoder模型，调教decoder模型表现不佳，这可能是因为decoder模型关注表面字串序列而不捕捉 semantics。此外，对very large decoder-based models such as ChatGPT进行in-context learning也难以识别细致的semantic differences。

Abstract
Automatic evaluation of text generation is essential for improving the accuracy of generation tasks. In light of the current trend towards increasingly larger decoder-based language models, we investigate automatic evaluation methods based on such models for text generation. This paper compares various methods, including tuning with encoder-based models and large language models under equal conditions, on two different tasks, machine translation evaluation and semantic textual similarity, in two languages, Japanese and English. Experimental results show that compared to the tuned encoder-based models, the tuned decoder-based models perform poorly. The analysis of the causes for this suggests that the decoder-based models focus on surface word sequences and do not capture meaning. It is also revealed that in-context learning of very large decoder-based models such as ChatGPT makes it difficult to identify fine-grained semantic differences.

摘要
自动评估文本生成是必要的，以提高生成任务的准确性。鉴于当前大型decoder-based语言模型的趋势，我们 investigate自动评估方法基于这些模型 для文本生成。本文比较了多种方法，包括使用encoder-based模型和大型语言模型在等条件下调整，在两个任务上进行评估：机器翻译评估和 semantic textual similarity，在两种语言中进行比较。实验结果显示，相比调整后的encoder-based模型，调整后的decoder-based模型表现不佳。分析结果表明，decoder-based模型强调表面字符序列，而不捕捉 semantics。此外，对 Very Large decoder-based模型such as ChatGPT进行context learning也会增加识别细致的semantic差异的难度。

Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction

paper_url: http://arxiv.org/abs/2310.11016
repo_url: None
paper_authors: Chong Zhang, Ya Guo, Yi Tu, Huan Chen, Jinyang Tang, Huijia Zhu, Qi Zhang, Tao Gui
for: 本研究旨在提高 Multimodal 预训模型在视觉丰富文档中提取信息的能力，具体是解决 OCR 系统识别文档中文本的顺序问题，以提高名实体识别（NER）的准确率。
methods: 本研究提出了 Token Path Prediction（TPP）方法，它是一种简单的预测头，可以在文档中预测名实体提及的Token序列。TPP 模型文档布局为完全导向图，并在图中预测Token路径作为实体。
results: 实验结果表明，TPP 方法可以有效解决 OCR 系统识别文档中文本的顺序问题，提高 VrD-NER 系统的准确率。此外，本研究还提出了两个修订版本的 NER benchmark 数据集，以更好地评估 VrD-NER 系统在真实场景中的性能。

Abstract
Recent advances in multimodal pre-trained models have significantly improved information extraction from visually-rich documents (VrDs), in which named entity recognition (NER) is treated as a sequence-labeling task of predicting the BIO entity tags for tokens, following the typical setting of NLP. However, BIO-tagging scheme relies on the correct order of model inputs, which is not guaranteed in real-world NER on scanned VrDs where text are recognized and arranged by OCR systems. Such reading order issue hinders the accurate marking of entities by BIO-tagging scheme, making it impossible for sequence-labeling methods to predict correct named entities. To address the reading order issue, we introduce Token Path Prediction (TPP), a simple prediction head to predict entity mentions as token sequences within documents. Alternative to token classification, TPP models the document layout as a complete directed graph of tokens, and predicts token paths within the graph as entities. For better evaluation of VrD-NER systems, we also propose two revised benchmark datasets of NER on scanned documents which can reflect real-world scenarios. Experiment results demonstrate the effectiveness of our method, and suggest its potential to be a universal solution to various information extraction tasks on documents.

摘要
(Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The translation may not be perfect, and some nuances or idiomatic expressions may be lost in translation.)

Correction Focused Language Model Training for Speech Recognition

paper_url: http://arxiv.org/abs/2310.11003
repo_url: None
paper_authors: Yingyi Ma, Zhe Liu, Ozlem Kalinli
for: 提高自动语音识别（ASR）的表现，特别是在领域适应任务中。
methods: 使用一种新的修正注意力集成学习方法，其主要目标是优化ASR中的词级错误率。该方法使用语言模型（LM）来预测词级错误率，并通过多任务练习来帮助LM学习。
results: 实验结果表明，提出的方法可以有效地提高ASR的表现。相比传统LM训练方法，修正注意力集成学习方法在充分文本情况下可以达到相对5.5%的词错率降低。在缺乏文本情况下，使用LLM生成的文本来进行LM训练可以达到相对13%的词错率降低，而修正注意力集成学习方法进一步可以达到相对6%的词错率降低。

Abstract
Language models (LMs) have been commonly adopted to boost the performance of automatic speech recognition (ASR) particularly in domain adaptation tasks. Conventional way of LM training treats all the words in corpora equally, resulting in suboptimal improvements in ASR performance. In this work, we introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words. The word-level ASR fallibility score, representing the likelihood of ASR mis-recognition, is defined and shaped as a prior word distribution to guide the LM training. To enable correction focused training with text-only corpora, large language models (LLMs) are employed as fallibility score predictors and text generators through multi-task fine-tuning. Experimental results for domain adaptation tasks demonstrate the effectiveness of our proposed method. Compared with conventional LMs, correction focused training achieves up to relatively 5.5% word error rate (WER) reduction in sufficient text scenarios. In insufficient text scenarios, LM training with LLM-generated text achieves up to relatively 13% WER reduction, while correction focused training further obtains up to relatively 6% WER reduction.

摘要
语言模型（LM）已广泛应用于自动语音识别（ASR）的性能提升，特别是在领域适应任务中。传统的LM训练方法往往对所有词语在词库中都进行平等对待，从而导致ASR性能的不足提升。在这种工作中，我们提出了一种新的修正注意力集中LM训练方法，旨在优先级ASR不确定词语。为了实现修正注意力集中LM训练，我们定义了ASR不确定词语的字级 fallibility 分数，表示语音识别器对这些词语的识别错误的可能性。然后，我们通过多任务练化来使用大型语言模型（LLM）来预测 fallibility 分数和生成文本。实验结果表明，我们的提议方法在领域适应任务中具有效果。相比传统LM，修正注意力集中LM训练可以在充分文本场景下提取到相对5.5%的单词错误率（WER）下降。在不充分文本场景下，使用LLM生成文本进行LM训练可以实现相对13%的WER下降，而修正注意力集中LM训练进一步实现相对6%的WER下降。

Instructive Dialogue Summarization with Query Aggregations

paper_url: http://arxiv.org/abs/2310.10981
repo_url: https://github.com/BinWang28/InstructDS
paper_authors: Bin Wang, Zhengyuan Liu, Nancy F. Chen
for: 该论文旨在扩展对话概要模型的能力集，以适应用户特定的兴趣和需求。
methods: 该论文提出了一种三步approach，包括摘要anchor query生成、筛选query和基于query的概要生成。通过在多个概要数据集上训练一个统一的模型 called InstructDS，可以扩展对话概要模型的能力集。
results: 实验结果显示，我们的方法可以超越当前状态的模型和even larger models，并且具有更高的普适性和准确性，经human subjective评估确认。

Abstract
Conventional dialogue summarization methods directly generate summaries and do not consider user's specific interests. This poses challenges in cases where the users are more focused on particular topics or aspects. With the advancement of instruction-finetuned language models, we introduce instruction-tuning to dialogues to expand the capability set of dialogue summarization models. To overcome the scarcity of instructive dialogue summarization data, we propose a three-step approach to synthesize high-quality query-based summarization triples. This process involves summary-anchored query generation, query filtering, and query-based summary generation. By training a unified model called InstructDS (Instructive Dialogue Summarization) on three summarization datasets with multi-purpose instructive triples, we expand the capability of dialogue summarization models. We evaluate our method on four datasets, including dialogue summarization and dialogue reading comprehension. Experimental results show that our approach outperforms the state-of-the-art models and even models with larger sizes. Additionally, our model exhibits higher generalizability and faithfulness, as confirmed by human subjective evaluations.

摘要
传统的对话概要方法直接生成概要并不考虑用户的特定兴趣。这会导致在用户更关注特定话题或方面时遇到挑战。随着指令训练语言模型的进步，我们介绍了对对话的指令训练（Instruction-Tuning），以扩展对话概要模型的能力集。由于 instrucional dialogue summarization 数据的罕见，我们提出了三步方法来生成高质量的查询基于概要的三元组。这个过程包括概要锚定的查询生成、查询筛选和基于查询的概要生成。通过训练我们提出的 Unified Model called InstructDS（ instrucional Dialogue Summarization）于三个多用途指令三元组的概要集，我们扩展了对话概要模型的能力。我们对四个数据集进行了evaluate，包括对话概要和对话阅读理解。实验结果表明，我们的方法比state-of-the-art模型和更大的模型更高效。此外，我们的模型在人工评价中表现出更高的普适性和准确性。

Semantic-Aware Contrastive Sentence Representation Learning with Large Language Models

paper_url: http://arxiv.org/abs/2310.10962
repo_url: None
paper_authors: Huiming Wang, Liying Cheng, Zhaodonghui Li, De Wen Soh, Lidong Bing
for: 本研究旨在提出一种semantic-aware冲突单句表示框架，以便通过大型自然语言处理器（LLM）的生成和评估能力自动构建高质量的NLI样本库，并通过这些样本库进行冲突学习 sentence representation。
methods: 本研究提议使用大型自然语言处理器（LLM）的生成和评估能力自动构建高质量的NLI样本库，并通过这些样本库进行冲突学习 sentence representation。
results: 实验和分析结果表明，我们的提议的semantic-aware冲突单句表示框架可以通过LLM进行自动生成和评估，从而学习出更好的句子表示。

Abstract
Contrastive learning has been proven to be effective in learning better sentence representations. However, to train a contrastive learning model, large numbers of labeled sentences are required to construct positive and negative pairs explicitly, such as those in natural language inference (NLI) datasets. Unfortunately, acquiring sufficient high-quality labeled data can be both time-consuming and resource-intensive, leading researchers to focus on developing methods for learning unsupervised sentence representations. As there is no clear relationship between these unstructured randomly-sampled sentences, building positive and negative pairs over them is tricky and problematic. To tackle these challenges, in this paper, we propose SemCSR, a semantic-aware contrastive sentence representation framework. By leveraging the generation and evaluation capabilities of large language models (LLMs), we can automatically construct a high-quality NLI-style corpus without any human annotation, and further incorporate the generated sentence pairs into learning a contrastive sentence representation model. Extensive experiments and comprehensive analyses demonstrate the effectiveness of our proposed framework for learning a better sentence representation with LLMs.

摘要
translate_language=zh-CN contrastive learning 已经被证明可以学习更好的句子表示。然而，为了训练一个对照学习模型，需要大量的标注句子来构建正例和负例对，如自然语言推理（NLI）数据集中的句子对。然而，获得足够的高质量标注数据可以是时间consuming 和资源占用的，导致研究人员强调开发无监督句子表示学习方法。由于这些随机采样的句子之间没有明确的关系，建立正例和负例对是困难和问题。为解决这些挑战，在这篇论文中，我们提议使用 SemCSR，一个具有 semantic-aware 的对照学习句子表示框架。通过利用大语言模型（LLM）的生成和评估能力，我们可以自动生成高质量 NLI-style 训练集，并将生成的句子对 integrate 到学习对照句子表示模型中。广泛的实验和全面的分析表明我们提议的框架可以通过 LLM 学习更好的句子表示。

Computing the optimal keyboard through a geometric analysis of the English language

paper_url: http://arxiv.org/abs/2310.10956
repo_url: None
paper_authors: Jules Deschamps, Quentin Hubert, Lucas Ryckelynck
for: 提高键盘输入速度
methods: 利用几何工具在优化框架中提出新的键盘布局，提高输入速度
results: 提出了新的键盘布局，可以提高输入速度

Abstract
In the context of a group project for the course COMSW4995 002 - Geometric Data Analysis, we bring our attention to the design of fast-typing keyboards. Leveraging some geometric tools in an optimization framework allowed us to propose novel keyboard layouts that offer a faster typing.

摘要
在COMSW4995 002 - 几何数据分析课程的小组项目中，我们对快速键盘设计进行了审视。通过使用一些几何工具在优化框架中，我们提出了新的键盘布局，以提高键盘输入速度。Here's the character-by-character breakdown of the translation:* 在 (preposition) - "in"* COMSW4995 (course name) - "COMSW4995"* 002 (course number) - "002"* - (hyphen) - "--"* 几何数据分析 (course name) - "几何数据分析"* 课程 (course) - "课程"* 小组 (group) - "小组"* 项目 (project) - "项目"* 中 (preposition) - "中"* 我们 (pronoun) - "我们"* 对 (preposition) - "对"* 快速键盘 (noun phrase) - "快速键盘"* 设计 (noun) - "设计"* 进行 (verb) - "进行"* 了 (particle) - "了"* 审视 (verb) - "审视"* 通过 (preposition) - "通过"* 使用 (verb) - "使用"* 一些 (determiner) - "一些"* 几何工具 (noun phrase) - "几何工具"* 在 (preposition) - "在"* 优化 (verb) - "优化"* 框架 (noun) - "框架"* 中 (preposition) - "中"* 提出 (verb) - "提出"* 新的 (adjective) - "新的"* 键盘布局 (noun phrase) - "键盘布局"* 以 (preposition) - "以"* 提高 (verb) - "提高"* 键盘输入速度 (noun phrase) - "键盘输入速度"

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

paper_url: http://arxiv.org/abs/2310.10944
repo_url: https://github.com/intel/neural-compressor
paper_authors: Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen
for: 本研究旨在提出一种可学习的等效变换（TEQ），用于保持FP32精度的模型输出，同时利用低精度量化，尤其是3和4位量化。
methods: 本文使用可学习的等效变换（TEQ），不需要额外的计算负担，只需要1000步训练和少于0.1%的原始模型可训练参数。
results: 本研究结果与状态CURRENT最佳方法相当，可以与其他方法结合使用以获得更好的性能。code可以在https://github.com/intel/neural-compressor中下载。

Abstract
As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1 percent of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at https://github.com/intel/neural-compressor.

摘要
As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational layer demands of these modern architectures while maintaining accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1% of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at https://github.com/intel/neural-compressor.Here's the translation in Traditional Chinese:为了应对现代架构中的大型语言模型（LLMs）的 Computational Layer 需求，我们需要新的和改进的量化方法，以确保模型的精度。在这篇文章中，我们提出了 TEQ，一个可读的等同转换，可以保留模型输出的 FP32 精度，并在低精度量化中得到更好的性能。我们的训练过程是轻量级的，只需要1K步骤和原始模型的训练参数的0.1%。此外，转换不会在测试过程中添加任何计算过程。我们的结果与现有的 state-of-the-art（SOTA）方法相匹配，并且可以与其他方法结合以取得更好的性能。我们的代码可以在https://github.com/intel/neural-compressor 上获取。

paper_url: http://arxiv.org/abs/2310.10941
repo_url: None
paper_authors: Fardin Ahsan Sakib, Ahnaf Atef Choudhury, Ozlem Uzuner
For: The paper is focused on detecting depressive symptoms in social media posts using the Beck Depression Inventory (BDI) questionnaire.* Methods: The authors used a deep learning approach that incorporated MentalBERT, RoBERTa, and LSTM to identify sentences related to different depression symptoms.* Results: Despite their efforts, the evaluation results were lower than expected, highlighting the challenges of ranking sentences from a large dataset about depression.Here’s the same information in Simplified Chinese text:* For: 研究探讨了通过社交媒体帖子中的语言特征来检测抑郁症状，使用 Beck 抑郁 инвен塔ри（BDI）问卷来评估抑郁的严重程度。* Methods: 作者使用了深度学习方法，将MENTALBERT、RoBERTa和LSTM相结合，以检测不同抑郁症状的句子。* Results: 评估结果表明，由于数据集的复杂性和计算资源的限制，得到的结果并不理想，反映了检测抑郁症状的挑战性。

Abstract
Depression is a mental health disorder that has a profound impact on people's lives. Recent research suggests that signs of depression can be detected in the way individuals communicate, both through spoken words and written texts. In particular, social media posts are a rich and convenient text source that we may examine for depressive symptoms. The Beck Depression Inventory (BDI) Questionnaire, which is frequently used to gauge the severity of depression, is one instrument that can aid in this study. We can narrow our study to only those symptoms since each BDI question is linked to a particular depressive symptom. It's important to remember that not everyone with depression exhibits all symptoms at once, but rather a combination of them. Therefore, it is extremely useful to be able to determine if a sentence or a piece of user-generated content is pertinent to a certain condition. With this in mind, the eRisk 2023 Task 1 was designed to do exactly that: assess the relevance of different sentences to the symptoms of depression as outlined in the BDI questionnaire. This report is all about how our team, Mason-NLP, participated in this subtask, which involved identifying sentences related to different depression symptoms. We used a deep learning approach that incorporated MentalBERT, RoBERTa, and LSTM. Despite our efforts, the evaluation results were lower than expected, underscoring the challenges inherent in ranking sentences from an extensive dataset about depression, which necessitates both appropriate methodological choices and significant computational resources. We anticipate that future iterations of this shared task will yield improved results as our understanding and techniques evolve.

摘要
��й��Depression �C 一种心理健康问题，对人们的生活产生深远的影响。最新的研究表明，抑郁症状可以通过人们的沟通方式和文本来识别。特别是社交媒体帖子，它们是一种便捷的文本来源，我们可以对其进行检测抑郁症状的研究。使用 Beck 抑郁 инвен塔里（BDI）问卷，可以帮助我们评估抑郁的严重程度。我们可以将研究缩小到特定的症状，每个 BDI 问题都与特定的抑郁症状相关。请注意，不 everyone with depression 都会表现出所有的症状，而是一种组合。因此，可以非常有用地判断一句话或一 piece of user-generated content 是否与抑郁症状相关。为了实现这一点，我们参加了 eRisk 2023 任务 1，即评估不同句子是否与抑郁症状相关。我们采用了深度学习方法，并将 MentalBERT、RoBERTa 和 LSTM 织入一起。尽管我们尽力，评估结果低于预期，这反映了评估大量抑郁主题的 dataset 中的挑战。我们期望未来的这些共同任务会产生更好的结果，随着我们的理解和技术的进步。

Intent Detection and Slot Filling for Home Assistants: Dataset and Analysis for Bangla and Sylheti

paper_url: http://arxiv.org/abs/2310.10935
repo_url: None
paper_authors: Fardin Ahsan Sakib, A H M Rezaul Karim, Saadat Hasan Khan, Md Mushfiqur Rahman
for: 这项研究的目的是为了提供一个全面的 Intent 检测和插值数据集，用于支持语言模型在不同语言环境中进行下游任务。
methods: 该研究使用了 GPT-3.5 语言模型，并对 colloquial Bangla、formal Bangla 和 Sylheti 语言进行了分类和插值测试。
results: 研究发现，GPT-3.5 模型在 colloquial Bangla 语言下可以达到 impressive F1 分数为 0.94，而在插值任务中可以达到 F1 分数为 0.51。

Abstract
As voice assistants cement their place in our technologically advanced society, there remains a need to cater to the diverse linguistic landscape, including colloquial forms of low-resource languages. Our study introduces the first-ever comprehensive dataset for intent detection and slot filling in formal Bangla, colloquial Bangla, and Sylheti languages, totaling 984 samples across 10 unique intents. Our analysis reveals the robustness of large language models for tackling downstream tasks with inadequate data. The GPT-3.5 model achieves an impressive F1 score of 0.94 in intent detection and 0.51 in slot filling for colloquial Bangla.

摘要
“智能助手在我们技术先进的社会中确立了地位，但仍需考虑多种语言景观，包括低资源语言的口语形式。我们的研究推出了首个完整的数据集 для意图检测和插槽填充在正式孟加拉语、口语孟加拉语和斯里赫蒂语中，总共984个样本，涵盖10个各异的意图。我们的分析显示大语言模型在资料不足情况下能够成功地处理下游任务。GPT-3.5模型在口语孟加拉语中获得了非常出色的F1分数0.94，在插槽填充方面获得了0.51的F1分数。”

Spatial HuBERT: Self-supervised Spatial Speech Representation Learning for a Single Talker from Multi-channel Audio

paper_url: http://arxiv.org/abs/2310.10922
repo_url: None
paper_authors: Antoni Dimitriadis, Siqi Pan, Vidhyasaharan Sethu, Beena Ahmed
for: 提高speech系统的准确率和泛化能力，通过利用无标注数据进行自动学习
methods: 使用多通道音频输入，实现听说者环境中的噪声和抗噪声性能
results: 比前一代单道音频表示模型更高效，特别在噪声和雾气环境中表现出色，同时也能够在声音地图 Task 上达到优秀的效果

Abstract
Self-supervised learning has been used to leverage unlabelled data, improving accuracy and generalisation of speech systems through the training of representation models. While many recent works have sought to produce effective representations across a variety of acoustic domains, languages, modalities and even simultaneous speakers, these studies have all been limited to single-channel audio recordings. This paper presents Spatial HuBERT, a self-supervised speech representation model that learns both acoustic and spatial information pertaining to a single speaker in a potentially noisy environment by using multi-channel audio inputs. Spatial HuBERT learns representations that outperform state-of-the-art single-channel speech representations on a variety of spatial downstream tasks, particularly in reverberant and noisy environments. We also demonstrate the utility of the representations learned by Spatial HuBERT on a speech localisation downstream task. Along with this paper, we publicly release a new dataset of 100 000 simulated first-order ambisonics room impulse responses.

摘要
自我指导学习已经用不标注数据来提高语音系统的准确性和泛化能力。虽然 latest works 尝试生成适用于多种语音频谱、语言、模态和同时说话人的有效表示，但这些研究都受到单通道音频录音的限制。本文介绍 Spatial HuBERT，一种自我指导的语音表示模型，通过多通道音频输入学习到一个说话人在听到的环境中的both acoustic和空间信息。Spatial HuBERT 的表示超过了当前最佳单通道语音表示的状态，特别是在噪音和干扰环境中。我们还证明 Spatial HuBERT 学习的表示在语音地图任务中具有 Utility。此外，我们在这篇论文中公共发布了100000个 simulated first-order ambisonics room impulse responses 的新数据集。

Compositional preference models for aligning LMs

paper_url: http://arxiv.org/abs/2310.13011
repo_url: None
paper_authors: Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman
For: 本研究旨在提高语言模型（LM）与人类偏好的匹配。* Methods: 我们提出了 Compositional Preference Models（CPMs），一种新的偏好模型框架，它将一个全局偏好评估 decomposes 成多个可解释的特征，从提示LM中获取特征的scalar scores，并使用逻辑回归分类器进行聚合。* Results: 我们的实验表明，CPMs 不仅提高了通用性和鲁棒性，而且best-of-n 样本获得到使用 CPMs 比使用标准 PMs 更好。总的来说，我们的方法展示了将 PMs 具备人类偏好的假设，并且通过LM的能力来抽取这些特征的方法的优势。

Abstract
As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.

摘要
(Simplified Chinese translation)随着语言模型（LM）的能力不断提高，对其进行人类喜好的调整变得越来越重要。然而，目前主流的偏好模型（PM）训练方法受到一些基本的限制，如不透明性和可扩展性，同时容易过拟合偏好数据。我们提议compositional Preference Models（CPM），一种新的PM框架，将一个全局的喜好评估 decomposes into 多个可解释的特征，从提问LM中获取这些特征的标量分数，并使用逻辑回归分类器进行聚合。CPMs允许控制偏好数据中哪些特性用于训练偏好模型，并基于人类喜好判断中认为是重要的特征来建立偏好模型。我们的实验表明，CPMs不仅提高了通用性和鲁棒性，而且best-of-n样本获得使用CPMs比使用标准PMs更受欢迎。总的来说，我们的方法表明了将PMs具备人类喜好的假设，并且利用LM的能力来提取这些特征的可行和稳定的方式。

Emergent AI-Assisted Discourse: Case Study of a Second Language Writer Authoring with ChatGPT

paper_url: http://arxiv.org/abs/2310.10903
repo_url: None
paper_authors: Sharin Jacob, Tamara Tate, Mark Warschauer
for: 本研究探讨了ChatGPT如何促进语言学习者的学术写作，以减轻对人类写作标准的担忧。
methods: 本研究采用了 случа研究方法，探讨了Kailing博士在学术写作过程中使用ChatGPT的经验。研究使用了活动理论来理解使用生成AI工具进行写作，数据分析包括 semi-structured interview, writing samples和GPT logs。
results: 结果表明Kailing能够与ChatGPT在不同写作阶段进行有效协作，同时保持自己独特的作者语言和主动性。这表明AI工具如ChatGPT可以增强语言学习者的学术写作，而不会抹杀个体的独特性。本案例研究提供了使用ChatGPT进行学术写作的批判性探讨，以及保持学生独特语言的实践。

Abstract
The rapid proliferation of ChatGPT has incited debates regarding its impact on human writing. Amid concerns about declining writing standards, this study investigates the role of ChatGPT in facilitating academic writing, especially among language learners. Using a case study approach, this study examines the experiences of Kailing, a doctoral student, who integrates ChatGPT throughout their academic writing process. The study employs activity theory as a lens for understanding writing with generative AI tools and data analyzed includes semi-structured interviews, writing samples, and GPT logs. Results indicate that Kailing effectively collaborates with ChatGPT across various writing stages while preserving her distinct authorial voice and agency. This underscores the potential of AI tools such as ChatGPT to enhance academic writing for language learners without overshadowing individual authenticity. This case study offers a critical exploration of how ChatGPT is utilized in the academic writing process and the preservation of a student's authentic voice when engaging with the tool.

摘要
快速扩散的ChatGPT已经引发了人们对人类写作的影响的讨论。本研究探究了ChatGPT如何促进语言学习者的学术写作，特别是在启用AI生成工具的情况下。通过 caso study的方式，本研究研究了Kailing，一名博士学生，在学术写作过程中如何与ChatGPT进行合作。研究使用活动理论作为写作AI工具的理解镜子，数据分析包括 semi-structured 采访、写作样本和 GPT 日志。结果表明，Kailing在不同的写作阶段与ChatGPT进行有效的合作，同时保持自己独特的作者语言和主张。这种情况 highlights AI工具如ChatGPT可以增强语言学习者的学术写作，不会覆盖个人的 Authenticity。本案例研究如何在学术写作过程中使用ChatGPT，并保持学生的独特语言和主张。

2023-10-17

cs.LG

cs.LG - 2023-10-17

paper_url: http://arxiv.org/abs/2310.11612
repo_url: https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval
paper_authors: Yimu Wang, Xiangru Jian, Bo Xue
for: 本研究旨在解决跨Modal Retrieval中的干扰问题，即 gallery数据点频繁出现在检索结果中，导致检索性能下降。
methods: 我们提出了一种新的框架，双银行正常化（DBNorm），以及两种新的方法，对比轮径正常化和动态对比轮径正常化，以正常化相似度基于两个银行。这些方法可以减少极值点和查询样本之间的相似度，提高非极值点和查询样本之间的相似度。
results: 我们在多种语言基础 benchmark上进行了广泛的实验，包括文本-图像、文本-视频和文本-音频 benchmark，并证明了我们的方法可以比前方法更好地解决干扰问题，提高检索性能。我们的代码可以在https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval上下载。

Abstract
In this work, we present a post-processing solution to address the hubness problem in cross-modal retrieval, a phenomenon where a small number of gallery data points are frequently retrieved, resulting in a decline in retrieval performance. We first theoretically demonstrate the necessity of incorporating both the gallery and query data for addressing hubness as hubs always exhibit high similarity with gallery and query data. Second, building on our theoretical results, we propose a novel framework, Dual Bank Normalization (DBNorm). While previous work has attempted to alleviate hubness by only utilizing the query samples, DBNorm leverages two banks constructed from the query and gallery samples to reduce the occurrence of hubs during inference. Next, to complement DBNorm, we introduce two novel methods, dual inverted softmax and dual dynamic inverted softmax, for normalizing similarity based on the two banks. Specifically, our proposed methods reduce the similarity between hubs and queries while improving the similarity between non-hubs and queries. Finally, we present extensive experimental results on diverse language-grounded benchmarks, including text-image, text-video, and text-audio, demonstrating the superior performance of our approaches compared to previous methods in addressing hubness and boosting retrieval performance. Our code is available at https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.

摘要
在这项工作中，我们提出了一种后处理解决方案，用于解决跨Modal重现中的枢轴问题，即一些小量的 галеリー数据点频繁地被重现，导致检索性能下降。我们首先理论上证明了在 incorporating both gallery和query数据时，解决枢轴问题的必要性，因为枢轴总是与 gallery和query数据 exhibit high similarity。然后，基于我们的理论结果，我们提出了一种新的框架，双银行Normalization（DBNorm）。在前一些工作中，人们尝试了通过只使用查询样本来缓解枢轴，但DBNorm利用了基于查询和 галеリー样本构建的两个银行来减少在推理中出现的枢轴。此外，为了补充DBNorm，我们提出了两种新的方法，双 inverted softmax和 dual dynamic inverted softmax，用于在两个银行基础上Normalize similarity。specifically，我们的提出的方法可以降低枢轴和查询之间的相似性，而提高非枢轴和查询之间的相似性。最后，我们在多种语言基础 benchmark上进行了广泛的实验，包括文本-图像、文本-视频和文本-声音，并证明了我们的方法比前一些方法在解决枢轴和提高检索性能方面表现更出色。我们的代码可以在 https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval 上获取。

paper_url: http://arxiv.org/abs/2310.11611
repo_url: None
paper_authors: Aditya Desai, Anshumali Shrivastava
for: 这篇论文主要目的是对模型减小内存占用的方法进行全面评估，并推广RPS方法在模型压缩中的应用。
methods: 本论文使用了RPS方法、截割技术和建立更小的模型来减少模型的内存占用。
results: 研究发现，RPS方法在压缩范围内 consistently 击败/匹配更小的模型和一些中等知识水平的截割策略，特别在高压缩场景下。此外，RPS方法还有较高的鲁棒性和稳定性。

Abstract
When considering a model architecture, there are several ways to reduce its memory footprint. Historically, popular approaches included selecting smaller architectures and creating sparse networks through pruning. More recently, randomized parameter-sharing (RPS) methods have gained traction for model compression at start of training. In this paper, we comprehensively assess the trade-off between memory and accuracy across RPS, pruning techniques, and building smaller models. Our findings demonstrate that RPS, which is both data and model-agnostic, consistently outperforms/matches smaller models and all moderately informed pruning strategies, such as MAG, SNIP, SYNFLOW, and GRASP, across the entire compression range. This advantage becomes particularly pronounced in higher compression scenarios. Notably, even when compared to highly informed pruning techniques like Lottery Ticket Rewinding (LTR), RPS exhibits superior performance in high compression settings. This points out inherent capacity advantage that RPS enjoys over sparse models. Theoretically, we establish RPS as a superior technique in terms of memory-efficient representation when compared to pruning for linear models. This paper argues in favor of paradigm shift towards RPS based models. During our rigorous evaluation of RPS, we identified issues in the state-of-the-art RPS technique ROAST, specifically regarding stability (ROAST's sensitivity to initialization hyperparameters, often leading to divergence) and Pareto-continuity (ROAST's inability to recover the accuracy of the original model at zero compression). We provably address both of these issues. We refer to the modified RPS, which incorporates our improvements, as STABLE-RPS.

摘要
当考虑模型建 architecture时，有几种方法可以降低其内存占用量。历史上，流行的方法包括选择更小的 architecture和通过剪裁来减少模型的大小。在这篇论文中，我们系统地评估了减少内存和精度之间的权衡，并对 RPS、剪裁技术和建立更小的模型进行了比较。我们的发现表明，RPS在整个压缩范围内一直表现出优异，特别是在更高的压缩场景下。此外，RPS还比所有中等知识的剪裁策略（如MAG、SNIP、SYNFLOW和GRASP）在整个压缩范围内表现出优异。这种优势在高压缩场景下特别明显。在高压缩场景下，RPS甚至超过了高知识剪裁策略（如Lottery Ticket Rewinding）的性能。这表明RPS在压缩场景下具有内存效率的优势。从理论角度来看，我们证明RPS在线性模型上是一种更佳的压缩技术。这篇论文提倡使用基于RPS的模型。在我们对RPS进行了严格的评估后，我们发现了一些问题，包括ROAST的稳定性和紧张性（ROAST的初始化参数的敏感性，常导致偏转）以及级联稳定性（ROAST无法在零压缩场景下恢复原始模型的精度）。我们解决了这些问题，并提出了一种改进后的RPS，称为稳定RPS（STABLE-RPS）。

Reflection-Equivariant Diffusion for 3D Structure Determination from Isotopologue Rotational Spectra in Natural Abundance

paper_url: http://arxiv.org/abs/2310.11609
repo_url: https://github.com/aspuru-guzik-group/kreed
paper_authors: Austin Cheng, Alston Lo, Santiago Miret, Brooks Pate, Alán Aspuru-Guzik
for: 这篇论文的目的是用翻译光谱来确定小分子有机物的三维结构。methods: 这篇论文使用了翻译光谱来获得小分子有机物的精准三维信息，并使用 Краичман分析来确定氢原子的取代坐标。results: 这篇论文开发了一种基于生成 diffusion 模型，可以从分子式、翻译光谱和重元素取代坐标来推断小分子有机物的完整三维结构。这种方法的顶尖预测结果可以在 QM9 和 GEOM 数据集上达到 >98% 的准确率。

Abstract
Structure determination is necessary to identify unknown organic molecules, such as those in natural products, forensic samples, the interstellar medium, and laboratory syntheses. Rotational spectroscopy enables structure determination by providing accurate 3D information about small organic molecules via their moments of inertia. Using these moments, Kraitchman analysis determines isotopic substitution coordinates, which are the unsigned $|x|,|y|,|z|$ coordinates of all atoms with natural isotopic abundance, including carbon, nitrogen, and oxygen. While unsigned substitution coordinates can verify guesses of structures, the missing $+/-$ signs make it challenging to determine the actual structure from the substitution coordinates alone. To tackle this inverse problem, we develop KREED (Kraitchman REflection-Equivariant Diffusion), a generative diffusion model that infers a molecule's complete 3D structure from its molecular formula, moments of inertia, and unsigned substitution coordinates of heavy atoms. KREED's top-1 predictions identify the correct 3D structure with >98% accuracy on the QM9 and GEOM datasets when provided with substitution coordinates of all heavy atoms with natural isotopic abundance. When substitution coordinates are restricted to only a subset of carbons, accuracy is retained at 91% on QM9 and 32% on GEOM. On a test set of experimentally measured substitution coordinates gathered from the literature, KREED predicts the correct all-atom 3D structure in 25 of 33 cases, demonstrating experimental applicability for context-free 3D structure determination with rotational spectroscopy.

摘要
STRUCTURE determination 是必需的，以确定未知的有机分子，如自然产物、刑事样本、 междузвездmedium 和实验室合成。扭转 спектроскопия 可以提供有机分子的准确三维信息，通过其惯性矩来确定结构。使用这些矩，卡迪曼分析可以确定原子的替换坐标，即所有原子的自然同位素含量，包括碳、氮和氧。然而，未签名的替换坐标无法决定实际结构，这是一个逆向问题。为解决这个问题，我们开发了 KREED（卡迪曼反射相对均匀扩散），一种生成扩散模型，可以从分子式、惯性矩和未签名重元素替换坐标中推断出分子的完整三维结构。KREED 的顶峰预测可以在 QM9 和 GEOM 数据集上确定分子的正确三维结构，并且在所有重元素替换坐标中具有 >98% 的准确率。当替换坐标只限于一 subset of 碳时，准确率仍保持在 91% 的水平。在Literature中测试的实验ally measured substitution coordinates中，KREED 预测了正确的所有atom 3D结构，证明了实验可行性。

TK-KNN: A Balanced Distance-Based Pseudo Labeling Approach for Semi-Supervised Intent Classification

paper_url: http://arxiv.org/abs/2310.11607
repo_url: https://github.com/servicenow/tk-knn
paper_authors: Nicholas Botzer, David Vasquez, Tim Weninger, Issam Laradji
For: The paper is written for improving the ability to detect intent in dialogue systems, specifically by using semi-supervised learning methods to label unlabeled data.* Methods: The paper proposes a new method called Top-K K-Nearest Neighbor (TK-KNN) that uses a more robust pseudo-labeling approach based on distance in the embedding space, while maintaining a balanced set of pseudo-labeled examples across classes through a ranking-based approach.* Results: The experiments on several datasets show that TK-KNN outperforms existing models, particularly when labeled data is scarce, such as on popular datasets like CLINC150 and Banking77.Here are the three key points in Simplified Chinese text:* For: 这篇论文是为了提高对话系统中的意图检测能力，特别是使用半监督学习方法来标注无标签数据。* Methods: 论文提出了一种新的方法called Top-K K-Nearest Neighbor (TK-KNN)，它使用了更加可靠的 pseudo-labeling 方法基于 embedding 空间的距离，同时保持了类别之间的 pseudo-标签例子具有平衡的分布。* Results: 实验结果表明，TK-KNN 在几个 dataset 上表现出色，特别是在标注数据scarce的情况下，如 CLINC150 和 Banking77 等 популяр的 dataset 上。

Abstract
The ability to detect intent in dialogue systems has become increasingly important in modern technology. These systems often generate a large amount of unlabeled data, and manually labeling this data requires substantial human effort. Semi-supervised methods attempt to remedy this cost by using a model trained on a few labeled examples and then by assigning pseudo-labels to further a subset of unlabeled examples that has a model prediction confidence higher than a certain threshold. However, one particularly perilous consequence of these methods is the risk of picking an imbalanced set of examples across classes, which could lead to poor labels. In the present work, we describe Top-K K-Nearest Neighbor (TK-KNN), which uses a more robust pseudo-labeling approach based on distance in the embedding space while maintaining a balanced set of pseudo-labeled examples across classes through a ranking-based approach. Experiments on several datasets show that TK-KNN outperforms existing models, particularly when labeled data is scarce on popular datasets such as CLINC150 and Banking77. Code is available at https://github.com/ServiceNow/tk-knn

摘要
现代技术中探测对话系统中的意图已经变得越来越重要。这些系统经常生成大量未标注数据，并且手动标注这些数据需要很大的人工劳动。半超vised方法试图解决这个问题，使用一些标注的示例来训练模型，然后将 pseudo-标签分配给一部分未标注示例，这些示例的模型预测度高于某个阈值。然而，这些方法存在一个特别危险的后果，即选择类别之间不均衡的示例集，这可能导致 poor labels。在 presente 的工作中，我们描述了 Top-K K-Nearest Neighbor (TK-KNN)，这是一种基于距离 embedding 空间的更加可靠的 pseudo-标签方法，同时保持类别之间的 pseudo-标签示例集均衡。在几个数据集上进行了实验，发现 TK-KNN 超越了现有模型，特别是在 CLINC150 和 Banking77 等Popular数据集上。代码可以在 https://github.com/ServiceNow/tk-knn 上获取。

paper_url: http://arxiv.org/abs/2310.11590
repo_url: None
paper_authors: Qiping Zhang, Nathan Tsoi, Booyeon Choi, Jie Tan, Hao-Tien Lewis Chiang, Marynel Vázquez
for: 这个论文的目的是研究使用非语言行为特征和机器学习技术来预测人们对机器人表现的印象。
methods: 作者提供了一个名为 SEAN TOGETHER 的数据集，包含人与移动机器人在虚拟现实环境中的互动记录，以及用户对机器人表现的评分。同时，作者还进行了人类和超参量学习模型如何预测人们对机器人表现的印象的分析。
results: 研究发现，人脸表情 alone 可以提供有用的信息关于人们对机器人表现的印象，但在我们测试的导航场景中，空间特征是最critical的信息 для这种推断任务。此外，当评分为二分类而不是多类时，人类预测和机器学习模型的 F1 score 更 чем doublies，表明它们都更好地预测机器人表现的方向性而不是具体评分。

Abstract
Human impressions of robot performance are often measured through surveys. As a more scalable and cost-effective alternative, we study the possibility of predicting people's impressions of robot behavior using non-verbal behavioral cues and machine learning techniques. To this end, we first contribute the SEAN TOGETHER Dataset consisting of observations of an interaction between a person and a mobile robot in a Virtual Reality simulation, together with impressions of robot performance provided by users on a 5-point scale. Second, we contribute analyses of how well humans and supervised learning techniques can predict perceived robot performance based on different combinations of observation types (e.g., facial, spatial, and map features). Our results show that facial expressions alone provide useful information about human impressions of robot performance; but in the navigation scenarios we tested, spatial features are the most critical piece of information for this inference task. Also, when evaluating results as binary classification (rather than multiclass classification), the F1-Score of human predictions and machine learning models more than doubles, showing that both are better at telling the directionality of robot performance than predicting exact performance ratings. Based on our findings, we provide guidelines for implementing these predictions models in real-world navigation scenarios.

摘要
人类对机器人性能的印象通常通过调查来衡量。作为一种可扩展和成本效果更高的替代方案，我们研究使用非语言行为特征和机器学习技术预测人类对机器人行为的印象。为此，我们首先提供了SEAN TOGETHER数据集，包括人与移动机器人在虚拟现实环境中的互动记录，以及用户对机器人性能的评分（在5分比例上）。其次，我们分析了人类和监督学习技术如何预测人类对机器人性能的印象，根据不同的观察类型（例如，表情特征、空间特征和地图特征）。我们的结果显示，表情特征alone提供了人类对机器人性能的有用信息；但在我们测试的导航场景中，空间特征是最重要的信息来源。此外，当评估结果为二分类（而不是多类）时，人类预测和机器学习模型的F1分值超过了两倍，表示它们都更好地预测机器人性能的方向性，而不是精确的评分。根据我们的发现，我们提供了实现这些预测模型的指南，用于实际导航场景。

Partially Observable Stochastic Games with Neural Perception Mechanisms

paper_url: http://arxiv.org/abs/2310.11566
repo_url: None
paper_authors: Rui Yan, Gabriel Santos, Gethin Norman, David Parker, Marta Kwiatkowska
for: This paper is written for researchers and practitioners interested in multi-agent decision-making under uncertainty, with a focus on partial observability and data-driven perception.
methods: The paper proposes a new model called neuro-symbolic partially-observable stochastic games (NS-POSGs), which incorporates perception mechanisms and is applicable to one-sided settings with discrete, data-driven observations. The paper also introduces a new point-based method called one-sided NS-HSVI for approximating values of NS-POSGs.
results: The paper presents experimental results demonstrating the practical applicability of the proposed method for neural networks whose preimage is in polyhedral form. The results show that the one-sided NS-HSVI method is effective in approximating values of NS-POSGs and can be used to solve real-world problems involving partial observability and data-driven perception.

Abstract
Stochastic games are a well established model for multi-agent sequential decision making under uncertainty. In reality, though, agents have only partial observability of their environment, which makes the problem computationally challenging, even in the single-agent setting of partially observable Markov decision processes. Furthermore, in practice, agents increasingly perceive their environment using data-driven approaches such as neural networks trained on continuous data. To tackle this problem, we propose the model of neuro-symbolic partially-observable stochastic games (NS-POSGs), a variant of continuous-space concurrent stochastic games that explicitly incorporates perception mechanisms. We focus on a one-sided setting, comprising a partially-informed agent with discrete, data-driven observations and a fully-informed agent with continuous observations. We present a new point-based method, called one-sided NS-HSVI, for approximating values of one-sided NS-POSGs and implement it based on the popular particle-based beliefs, showing that it has closed forms for computing values of interest. We provide experimental results to demonstrate the practical applicability of our method for neural networks whose preimage is in polyhedral form.

摘要
We focus on a one-sided setting where the partially-informed agent has discrete, data-driven observations, while the fully-informed agent has continuous observations. We develop a new point-based method called one-sided NS-HSVI for approximating values of one-sided NS-POSGs, which is based on the popular particle-based beliefs. Our method has closed forms for computing values of interest, and we provide experimental results to demonstrate its practical applicability for neural networks whose preimage is in polyhedral form.

Online Algorithms with Uncertainty-Quantified Predictions

paper_url: http://arxiv.org/abs/2310.11558
repo_url: None
paper_authors: Bo Sun, Jerry Huang, Nicolas Christianson, Mohammad Hajiesmaili, Adam Wierman
for: This paper focuses on developing online algorithms that incorporate uncertainty-quantified predictions to achieve high-quality performance guarantees while maintaining bounded worst-case guarantees.
methods: The paper explores the use of uncertainty-quantified predictions in online algorithms, specifically for ski rental and online search problems. The authors propose non-trivial modifications to algorithm design to fully leverage the probabilistic predictions.
results: The paper demonstrates the effectiveness of the proposed methods through theoretical analysis and experimental evaluations. The results show that the algorithms achieve better performance guarantees compared to traditional online algorithms, and the uncertainty-quantified predictions provide valuable information for making optimal decisions in multi-instance settings.

Abstract
Online algorithms with predictions have become a trending topic in the field of beyond worst-case analysis of algorithms. These algorithms incorporate predictions about the future to obtain performance guarantees that are of high quality when the predictions are good, while still maintaining bounded worst-case guarantees when predictions are arbitrarily poor. In general, the algorithm is assumed to be unaware of the prediction's quality. However, recent developments in the machine learning literature have studied techniques for providing uncertainty quantification on machine-learned predictions, which describes how certain a model is about its quality. This paper examines the question of how to optimally utilize uncertainty-quantified predictions in the design of online algorithms. In particular, we consider predictions augmented with uncertainty quantification describing the likelihood of the ground truth falling in a certain range, designing online algorithms with these probabilistic predictions for two classic online problems: ski rental and online search. In each case, we demonstrate that non-trivial modifications to algorithm design are needed to fully leverage the probabilistic predictions. Moreover, we consider how to utilize more general forms of uncertainty quantification, proposing a framework based on online learning that learns to exploit uncertainty quantification to make optimal decisions in multi-instance settings.

摘要
在 beyond worst-case 分析算法领域，在线算法 Predictions 已经成为一个流行的话题。这些算法利用未来的预测来获得高质量的性能保证，当预测准确度很好时，而且仍保持 bounded worst-case 保证，当预测准确度很差时。在总的来说，算法假设不知道预测的质量。然而，现代机器学习文献中的技术已经研究了提供机器学习预测的不确定性评估，这种评估描述了模型对其质量的确定程度。本文考虑了如何优化不确定性评估的预测，并在两个经典的在线问题上进行了实践：滑雪租赁和在线搜索。在每个情况下，我们表明了非常轻量级的修改，以便完全利用预测的不确定性。此外，我们考虑了如何利用更加通用的不确定性评估，提出了基于在线学习的框架，用于在多实例设置中学习利用不确定性评估来做出优化的决策。

Bias and Error Mitigation in Software-Generated Data: An Advanced Search and Optimization Framework Leveraging Generative Code Models

paper_url: http://arxiv.org/abs/2310.11546
repo_url: None
paper_authors: Ernesto Giralt Hernández
for: corrected errors and biases in software systems specializing in data analysis and generation
methods: Solomonoff Induction, Kolmogorov Conditional Complexity, generative models ( LLMS)
results: incrementally improve the quality of output results

Abstract
Data generation and analysis is a fundamental aspect of many industries and disciplines, from strategic decision making in business to research in the physical and social sciences. However, data generated using software and algorithms can be subject to biases and errors. These can be due to problems with the original software, default settings that do not align with the specific needs of the situation, or even deeper problems with the underlying theories and models. This paper proposes an advanced search and optimization framework aimed at generating and choosing optimal source code capable of correcting errors and biases from previous versions to address typical problems in software systems specializing in data analysis and generation, especially those in the corporate and data science world. Applying this framework multiple times on the same software system would incrementally improve the quality of the output results. It uses Solomonoff Induction as a sound theoretical basis, extending it with Kolmogorov Conditional Complexity, a novel adaptation, to evaluate a set of candidate programs. We propose the use of generative models for the creation of this set of programs, with special emphasis on the capabilities of Large Language Models (LLMs) to generate high quality code.

摘要
“数据生成和分析是许多行业和领域的基础方面，从商业战略决策到物理和社会科学研究。但是，由软件和算法生成的数据可能受到偏见和错误的影响。这些问题可能来自原始软件的问题、不适应特定情况的默认设置或更深层次的理论和模型问题。这篇论文提出了一种高级搜索和优化框架，用于生成和选择修正过去版本中的错误和偏见的最佳源代码。通过多次应用这种框架于同一个软件系统，可以逐步提高输出结果的质量。它基于索löмо夫推理为基础，并将其扩展到科尔莫果ров conditional complexity，一种新的适应，以评估候选程序集。我们建议使用生成模型来创建这些候选程序集，尤其是利用大型自然语言模型（LLMs）生成高质量代码。”

Thin and Deep Gaussian Processes

paper_url: http://arxiv.org/abs/2310.11527
repo_url: https://github.com/spectraldani/thindeepgps
paper_authors: Daniel Augusto de Souza, Alexander Nikitin, ST John, Magnus Ross, Mauricio A. Álvarez, Marc Peter Deisenroth, João P. P. Gomes, Diego Mesquita, César Lincoln C. Mattos
for: 这个论文的目的是提出一种新的深度 Gaussian process（TDGP）模型，以解决深度 Gaussian process（DGP）模型中的一些问题，如敏感性和解释性的缺失。
methods: TDGP 模型使用了successively parameterizing kernels with Gaussian process layers，这种方法可以学习输入数据的低维度表示，同时保持 kernel 的 interpretable 性。
results: 研究发现，TDGP 模型可以在标准的 benchmark 数据集上表现出色，并且可以适应增加层数的情况。此外，TDGP 模型可以学习低维度表示，并且不会出现特定的PATHOLOGIES。

Abstract
Gaussian processes (GPs) can provide a principled approach to uncertainty quantification with easy-to-interpret kernel hyperparameters, such as the lengthscale, which controls the correlation distance of function values. However, selecting an appropriate kernel can be challenging. Deep GPs avoid manual kernel engineering by successively parameterizing kernels with GP layers, allowing them to learn low-dimensional embeddings of the inputs that explain the output data. Following the architecture of deep neural networks, the most common deep GPs warp the input space layer-by-layer but lose all the interpretability of shallow GPs. An alternative construction is to successively parameterize the lengthscale of a kernel, improving the interpretability but ultimately giving away the notion of learning lower-dimensional embeddings. Unfortunately, both methods are susceptible to particular pathologies which may hinder fitting and limit their interpretability. This work proposes a novel synthesis of both previous approaches: Thin and Deep GP (TDGP). Each TDGP layer defines locally linear transformations of the original input data maintaining the concept of latent embeddings while also retaining the interpretation of lengthscales of a kernel. Moreover, unlike the prior solutions, TDGP induces non-pathological manifolds that admit learning lower-dimensional representations. We show with theoretical and experimental results that i) TDGP is, unlike previous models, tailored to specifically discover lower-dimensional manifolds in the input data, ii) TDGP behaves well when increasing the number of layers, and iii) TDGP performs well in standard benchmark datasets.

摘要
traducción al chino simplificado:Gaussian processes (GPs) 可以提供一个原理性的方法来评估uncertainty量化，通过容易理解的kernel参数，如lengthscale，控制函数值之间的相关程度。然而，选择合适的kernel可以是困难的。深度GPs 可以通过 successively parameterizing kernels with GP layers 来避免手动kernel工程，从而学习输入数据的低维表示。然而，这些方法通常会失去 shallow GPs 中的解释性。一种alternative construction是 successively parameterize the lengthscale of a kernel，以提高解释性，但是 ultimately give up the notion of learning lower-dimensional embeddings。fortunately, both methods are susceptible to particular pathologies which may hinder fitting and limit their interpretability. This work proposes a novel synthesis of both previous approaches: Thin and Deep GP (TDGP). Each TDGP layer defines locally linear transformations of the original input data maintaining the concept of latent embeddings while also retaining the interpretation of lengthscales of a kernel. Moreover, unlike the prior solutions, TDGP induces non-pathological manifolds that admit learning lower-dimensional representations. We show with theoretical and experimental results that i) TDGP is, unlike previous models, tailored to specifically discover lower-dimensional manifolds in the input data, ii) TDGP behaves well when increasing the number of layers, and iii) TDGP performs well in standard benchmark datasets.

Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement Learning in Discounted Linear MDPs

paper_url: http://arxiv.org/abs/2310.11515
repo_url: None
paper_authors: Yu-Heng Hung, Ping-Chun Hsieh, Akshay Mete, P. R. Kumar
for: linear Markov Decision Processes (MDPs) with infinite horizon and linearly parameterized transition probabilities
methods: Value-Biased Maximum Likelihood Estimation (VBMLE)
results: $\widetilde{O}(d\sqrt{T})$ regret, computationally more efficient than existing regression-based approaches, and a generic convergence result of MLE in linear MDPs through a novel supermartingale construct.Here’s the Chinese translation of the three points:
for: linear Markov 决策过程 (MDPs) WITH infinite horizon 和 linearly parameterized transition probabilities
methods: Value-Biased Maximum Likelihood Estimation (VBMLE)
results: $\widetilde{O}(d\sqrt{T})$ regret, computationally more efficient than existing regression-based approaches, AND a generic convergence result of MLE in linear MDPs through a novel supermartingale construct.

Abstract
We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition probabilities of the dynamic model can be linearly parameterized with the help of a predefined low-dimensional feature mapping. While the existing regression-based approaches have been theoretically shown to achieve nearly-optimal regret, they are computationally rather inefficient due to the need for a large number of optimization runs in each time step, especially when the state and action spaces are large. To address this issue, we propose to solve linear MDPs through the lens of Value-Biased Maximum Likelihood Estimation (VBMLE), which is a classic model-based exploration principle in the adaptive control literature for resolving the well-known closed-loop identification problem of Maximum Likelihood Estimation. We formally show that (i) VBMLE enjoys $\widetilde{O}(d\sqrt{T})$ regret, where $T$ is the time horizon and $d$ is the dimension of the model parameter, and (ii) VBMLE is computationally more efficient as it only requires solving one optimization problem in each time step. In our regret analysis, we offer a generic convergence result of MLE in linear MDPs through a novel supermartingale construct and uncover an interesting connection between linear MDPs and online learning, which could be of independent interest. Finally, the simulation results show that VBMLE significantly outperforms the benchmark method in terms of both empirical regret and computation time.

摘要
我们考虑无穷远线性Markov决策过程（MDP），其过程概率转移可线性参数化通过一个固定的低维度特征映射。现有的回归方法有理论上可达到近似优劣 regret，但 computationally 较为慢，特别是当状态和动作空间较大时。为解决这个问题，我们提议通过Value-Biased Maximum Likelihood Estimation（VBMLE）解决linear MDPs，VBMLE 是适应控制文献中的一种经典的模型基于探索原理，用于解决Maximum Likelihood Estimation 的关闭loop标定问题。我们正式表明VBMLE 具有 $\widetilde{O}(d\sqrt{T})$ regret，其中 $T$ 是时间悬度，$d$ 是模型参数的维度，并且VBMLE computationally 更高效，只需在每个时间步骤中解决一个优化问题。在我们的 regret 分析中，我们提供了线性 MDPs 的MLE 的普适减少结果，并发现了线性 MDPs 与在线学习之间的有趣连接，这可能是独立的兴趣。最后，实验结果显示VBMLE 在 empirical regret 和计算时间上明显超过参考方法。

Stochastic Quantum Sampling for Non-Logconcave Distributions and Estimating Partition Functions

paper_url: http://arxiv.org/abs/2310.11445
repo_url: None
paper_authors: Guneykan Ozgul, Xiantao Li, Mehrdad Mahdavi, Chunhao Wang
for: 这个论文目标是设计一种量子算法来采样非几何均匀概率分布 $\pi(x) \propto \exp(-\beta f(x))$.
methods: 这个方法基于量子模拟热化算法，使用慢变化的马尔可夫链，并使用小批量的梯度诊断来实现量子步进。
results: 这个量子算法在维度和精度两个方面表现出了幂等速度的优化，比最佳known的类型算法更快。

Abstract
We present quantum algorithms for sampling from non-logconcave probability distributions in the form of $\pi(x) \propto \exp(-\beta f(x))$. Here, $f$ can be written as a finite sum $f(x):= \frac{1}{N}\sum_{k=1}^N f_k(x)$. Our approach is based on quantum simulated annealing on slowly varying Markov chains derived from unadjusted Langevin algorithms, removing the necessity for function evaluations which can be computationally expensive for large data sets in mixture modeling and multi-stable systems. We also incorporate a stochastic gradient oracle that implements the quantum walk operators inexactly by only using mini-batch gradients. As a result, our stochastic gradient based algorithm only accesses small subsets of data points in implementing the quantum walk. One challenge of quantizing the resulting Markov chains is that they do not satisfy the detailed balance condition in general. Consequently, the mixing time of the algorithm cannot be expressed in terms of the spectral gap of the transition density, making the quantum algorithms nontrivial to analyze. To overcome these challenges, we first build a hypothetical Markov chain that is reversible, and also converges to the target distribution. Then, we quantified the distance between our algorithm's output and the target distribution by using this hypothetical chain as a bridge to establish the total complexity. Our quantum algorithms exhibit polynomial speedups in terms of both dimension and precision dependencies when compared to the best-known classical algorithms.

摘要
我们提出了量子算法用于采样非几何均勋分布，其形式为 $\pi(x) \propto \exp(-\beta f(x))$。其中，$f$ 可以写作 finite sum $f(x):= \frac{1}{N}\sum_{k=1}^N f_k(x)$。我们的方法基于量子模拟热化法，使用慢变化的马尔可夫链，从无调整的勒文算法中 derivation，从而消除了计算成本较高的函数评估，特别是在混合模型和多稳定系统中。我们还使用 Stochastic gradient oracle，通过使用小批量评估来实现量子步进 operator。因此，我们的 Stochastic gradient 基本算法只需访问小 subsets of data points，实现量子步进。一个挑战是量化得到的马尔可夫链不满足细化平衡条件，因此我们无法通过spectral gap 来衡量混合时间。为了突破这些挑战，我们首先建立一个假的马尔可夫链，该链是可逆的，并且 converge 到目标分布。然后，我们使用这个假链作为桥，来衡量我们算法的输出和目标分布之间的距离。我们的量子算法在维度和精度上都 exhibit 对比 classical algorithms 的多项式减速。

Identifying Interpretable Visual Features in Artificial and Biological Neural Systems

paper_url: http://arxiv.org/abs/2310.11431
repo_url: None
paper_authors: David Klindt, Sophia Sanborn, Francisco Acosta, Frédéric Poitevin, Nina Miolane
for: 这个论文的目的是为了探讨深度神经网络中单个神经元的可解释性，以及是否存在多个无关的特征在同一个神经元中的表现。
methods: 这篇论文使用了一种自动化的可解释性量化方法，该方法基于大量的人类 psychophysics 判断神经元可解释性的数据库，并且还提出了一种在网络活动空间找到有意义的方向的方法。
results: 研究发现，使用这种方法可以在卷积神经网络中找到更加直观的方向，这些方向不同于单个神经元的表现。此外，研究还应用了这种方法于三个最近发表的视觉神经响应数据集，并发现结论大致传递到实际神经数据中，这建议superposition可能被脑部实现。这也提出了关于稳定、高效和分解表示的基本问题，并且与分解有关。

Abstract
Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features. However, many neurons exhibit $\textit{mixed selectivity}$, i.e., they represent multiple unrelated features. A recent hypothesis proposes that features in deep networks may be represented in $\textit{superposition}$, i.e., on non-orthogonal axes by multiple neurons, since the number of possible interpretable features in natural data is generally larger than the number of neurons in a given network. Accordingly, we should be able to find meaningful directions in activation space that are not aligned with individual neurons. Here, we propose (1) an automated method for quantifying visual interpretability that is validated against a large database of human psychophysics judgments of neuron interpretability, and (2) an approach for finding meaningful directions in network activation space. We leverage these methods to discover directions in convolutional neural networks that are more intuitively meaningful than individual neurons, as we confirm and investigate in a series of analyses. Moreover, we apply the same method to three recent datasets of visual neural responses in the brain and find that our conclusions largely transfer to real neural data, suggesting that superposition might be deployed by the brain. This also provides a link with disentanglement and raises fundamental questions about robust, efficient and factorized representations in both artificial and biological neural systems.

摘要
单一神经元在神经网络中经常是可解释的，它们表示单一、直觉的特征。然而，许多神经元会表现出混合选择性，即它们表示多个无关的特征。一个最近的假设提出了，内部特征在深度网络中可能会被表示为组合，即在非正交的轴上由多个神经元表示。由于自然数据中的可解释特征的数量通常大于给定网络中的神经元数量，因此我们应该能够在网络启动空间中找到有意义的方向。我们提出了以下两个方法来进行这些研究：1. 一个自动化的方法来评估视觉可解释性，该方法被验证了一个大量的人类心理学评价神经元可解释性的数据库。2. 一种方法来在网络启动空间中找到有意义的方向，这些方法可以在实际的神经网络中发现更直觉的方向。我们运用这些方法发现，对于某些问题，深度网络中的活动空间中的方向可能更直觉、更有意义，并且在实际的神经网络中发现了这些方向。此外，我们将这些方法应用到了三个最近的视觉神经反应数据中，发现结果大多转移到了实际的神经资料中，这表明了组合可能被脑部使用。这还提供了与分离开来的连结，并提出了基本问题，例如如何实现可靠、高效和分离的表示在人工和生物神经系统中。

Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression

paper_url: http://arxiv.org/abs/2310.11428
repo_url: None
paper_authors: Adam Block, Dylan J. Foster, Akshay Krishnamurthy, Max Simchowitz, Cyril Zhang
for: This paper studies the issue of training instability in behavior cloning with deep neural networks, specifically the sharp oscillations in long-horizon rewards that can occur during training.
methods: The authors use minibatch SGD updates to the policy network during training, and empirically disentangle the statistical and computational causes of the oscillations. They also test several standard mitigation techniques and find an exponential moving average (EMA) of iterates to be effective in alleviating the issue.
results: The authors show that GVA is a common phenomenon in both continuous control and autoregressive language generation, and that EMA can effectively mitigate it. They also provide theoretical vignettes to explain the benefits of EMA in alleviating GVA and shed light on the extent to which classical convex models can help in understanding the benefits of iterate averaging in deep learning.

Abstract
This work studies training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards, despite negligibly affecting the behavior cloning loss. We empirically disentangle the statistical and computational causes of these oscillations, and find them to stem from the chaotic propagation of minibatch SGD noise through unstable closed-loop dynamics. While SGD noise is benign in the single-step action prediction objective, it results in catastrophic error accumulation over long horizons, an effect we term gradient variance amplification (GVA). We show that many standard mitigation techniques do not alleviate GVA, but find an exponential moving average (EMA) of iterates to be surprisingly effective at doing so. We illustrate the generality of this phenomenon by showing the existence of GVA and its amelioration by EMA in both continuous control and autoregressive language generation. Finally, we provide theoretical vignettes that highlight the benefits of EMA in alleviating GVA and shed light on the extent to which classical convex models can help in understanding the benefits of iterate averaging in deep learning.

摘要

paper_url: http://arxiv.org/abs/2310.11407
repo_url: None
paper_authors: Quan Zhou, Jakub Marecek
for: 这个论文的目的是为了实现不受人类特征（如性别、种族）影响的机器学习模型。
methods: 这个论文使用了一种单组盲投影 map，将源数据中两个组的特征分布都 align，实现（人口）组均点，无需个体样本中的敏感属性值。
results: 作者通过使用真实数据和 sintetic数据进行数值实验，证明了这种方法可以实现不受敏感属性影响的机器学习模型。

Abstract
Fairness holds a pivotal role in the realm of machine learning, particularly when it comes to addressing groups categorised by sensitive attributes, e.g., gender, race. Prevailing algorithms in fair learning predominantly hinge on accessibility or estimations of these sensitive attributes, at least in the training process. We design a single group-blind projection map that aligns the feature distributions of both groups in the source data, achieving (demographic) group parity, without requiring values of the protected attribute for individual samples in the computation of the map, as well as its use. Instead, our approach utilises the feature distributions of the privileged and unprivileged groups in a boarder population and the essential assumption that the source data are unbiased representation of the population. We present numerical results on synthetic data and real data.

摘要
“公平在机器学习中扮演着关键角色，特别是在面临敏感属性分类的群体时。现有的 Fair learning 算法主要基于敏感属性的访问或估计，至少在训练过程中。我们设计了一个单一群体盲目投影Map，使源数据中两个群体的特征分布相互对齐，实现群体平均性，无需个别样本中的敏感属性值，也无需在计算投影Map时和其使用过程中使用敏感属性值。而是我们的方法基于优先群体和受难群体在更大的人口中的特征分布，以及假设源数据是人口的不偏 representations。我们在synthetic数据和实际数据上提供数字结果。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Enhancing Group Fairness in Online Settings Using Oblique Decision Forests

paper_url: http://arxiv.org/abs/2310.11401
repo_url: None
paper_authors: Somnath Basu Roy Chowdhury, Nicholas Monath, Ahmad Beirami, Rahul Kidambi, Avinava Dubey, Amr Ahmed, Snigdha Chaturvedi
for: 这篇论文的目的是提出一种在线上环境中实现公平的机器学习系统，以确保不同群体之间的公平。
methods: 这篇论文提出了一个名为Aranyani的ensemble of oblique decision trees的方法，可以在线上环境中实现公平的决策。Aranyani使用了树结构，允许在决策时计算公平度量，并且可以快速计算公平度量，不需要额外的储存和前向/后向通过。
results: 这篇论文的实验结果显示，Aranyani可以在5个公开的benchmark上 achieves a better accuracy-fairness trade-off compared to baseline approaches。

Abstract
Fairness, especially group fairness, is an important consideration in the context of machine learning systems. The most commonly adopted group fairness-enhancing techniques are in-processing methods that rely on a mixture of a fairness objective (e.g., demographic parity) and a task-specific objective (e.g., cross-entropy) during the training process. However, when data arrives in an online fashion -- one instance at a time -- optimizing such fairness objectives poses several challenges. In particular, group fairness objectives are defined using expectations of predictions across different demographic groups. In the online setting, where the algorithm has access to a single instance at a time, estimating the group fairness objective requires additional storage and significantly more computation (e.g., forward/backward passes) than the task-specific objective at every time step. In this paper, we propose Aranyani, an ensemble of oblique decision trees, to make fair decisions in online settings. The hierarchical tree structure of Aranyani enables parameter isolation and allows us to efficiently compute the fairness gradients using aggregate statistics of previous decisions, eliminating the need for additional storage and forward/backward passes. We also present an efficient framework to train Aranyani and theoretically analyze several of its properties. We conduct empirical evaluations on 5 publicly available benchmarks (including vision and language datasets) to show that Aranyani achieves a better accuracy-fairness trade-off compared to baseline approaches.

摘要
“公平性，特别是群体公平性，在机器学习系统中是一个重要考虑因素。通常运用的群体公平化技术是在训练过程中使用混合物的公平目标（例如人口平衡）和任务特定目标（例如十字项目）。但在线上数据来临时，实现这些公平目标是有挑战的。具体来说，群体公平目标是根据不同群体的预期预测结果定义的。在线上设置中，algorithm只有单独的实例，估计群体公平目标需要额外的存储和更多的计算（例如前向/后向通过）。在这篇论文中，我们提出Aranyani，一个以梯形树为基础的混合决策树，以确保在线上设置中做出公平的决策。Aranyani的树状架构允许参数隔离和通过先前的决策统计资料来计算公平的梯度，无需额外的存储和前向/后向通过。我们还提供了一个有效的训练框架和理论分析多个性能。我们在5个公开可用的benchmark（包括视觉和语言dataset）进行实验评估，发现Aranyani在精度-公平性贡献中比基准方法更好。”

Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and In-Context Learning

paper_url: http://arxiv.org/abs/2310.11397
repo_url: None
paper_authors: Rui Wen, Tianhao Wang, Michael Backes, Yang Zhang, Ahmed Salem
for: 这篇论文旨在探讨大语言模型（LLM）适用私有数据时的隐私和安全问题。
methods: 论文使用了三种已知技术来适应LLM：Low-Rank Adaptation（LoRA）、Soft Prompt Tuning（SPT）和In-Context Learning（ICL）。
results: 研究发现，无一种适合所有隐私和安全需求的适应技术，每种技术都有不同的优劣点。

Abstract
Large Language Models (LLMs) are powerful tools for natural language processing, enabling novel applications and user experiences. However, to achieve optimal performance, LLMs often require adaptation with private data, which poses privacy and security challenges. Several techniques have been proposed to adapt LLMs with private data, such as Low-Rank Adaptation (LoRA), Soft Prompt Tuning (SPT), and In-Context Learning (ICL), but their comparative privacy and security properties have not been systematically investigated. In this work, we fill this gap by evaluating the robustness of LoRA, SPT, and ICL against three types of well-established attacks: membership inference, which exposes data leakage (privacy); backdoor, which injects malicious behavior (security); and model stealing, which can violate intellectual property (privacy and security). Our results show that there is no silver bullet for privacy and security in LLM adaptation and each technique has different strengths and weaknesses.

摘要

VaR\ and CVaR Estimation in a Markov Cost Process: Lower and Upper Bounds

paper_url: http://arxiv.org/abs/2310.11389
repo_url: None
paper_authors: Sanjay Bhat, Prashanth L. A., Gugan Thoppe
for: 这篇论文旨在研究Markov过程中的 Value-at-Risk（VaR）和Conditional Value-at-Risk（CVaR）估计问题。
methods: 该论文首先 derivates a minimax下界为Ω(1/√n)，这个下界在预期和 probabilistic上都成立。然后，使用finite-horizon truncation scheme， derivates an upper bound for CVaR estimation error, which matches the lower bound up to constant factors。
results: 该论文的结果表明，在Markovian设定下，our estimation scheme can provide lower and upper bounds on the estimation error for any risk measure, including spectral risk measures and utility-based shortfall risk. Our lower bounds also extend to the infinite-horizon discounted costs’ mean, and improve upon the existing result Ω(1/n) [13].

Abstract
We tackle the problem of estimating the Value-at-Risk (VaR) and the Conditional Value-at-Risk (CVaR) of the infinite-horizon discounted cost within a Markov cost process. First, we derive a minimax lower bound of $\Omega(1/\sqrt{n})$ that holds both in an expected and in a probabilistic sense. Then, using a finite-horizon truncation scheme, we derive an upper bound for the error in CVaR estimation, which matches our lower bound up to constant factors. Finally, we discuss an extension of our estimation scheme that covers more general risk measures satisfying a certain continuity criterion, e.g., spectral risk measures, utility-based shortfall risk. To the best of our knowledge, our work is the first to provide lower and upper bounds on the estimation error for any risk measure within Markovian settings. We remark that our lower bounds also extend to the infinite-horizon discounted costs' mean. Even in that case, our result $\Omega(1/\sqrt{n}) $ improves upon the existing result $\Omega(1/n)$[13].

摘要
我们研究了在马尔可夫过程中估计值风险（VaR）和条件值风险（CVaR）的问题。首先，我们 deriv了一个最小最大下界为 $\Omega(1/\sqrt{n})$，这个下界在预期上和概率上都成立。然后，使用一种 finite-horizon truncation scheme，我们 deriv了 CVaR 估计错误的Upper bound，与我们的下界几乎相同。最后，我们讨论了我们的估计方案的扩展，覆盖更加一般的风险度量，如спектраль风险度量和utilities-based shortfall风险。据我们所知，我们的工作是在马尔可夫 Setting 中提供了任何风险度量的下界和上界。我们的下界还扩展到了无限期折抵费用的mean。甚至在那种情况下，我们的结果 $\Omega(1/\sqrt{n})$ 超越了现有的结果 $\Omega(1/n)$ [13].

Faster Algorithms for Generalized Mean Densest Subgraph Problem

paper_url: http://arxiv.org/abs/2310.11377
repo_url: None
paper_authors: Chenglin Fan, Ping Li, Hanyu Peng
for: 本研究的目的是解决 $p$-mean densest subgraph问题，即寻找一个 graphs 中的最密集子图，其中 $p $ 是一个非负整数，表示使用 $p $-th 次幂度来衡量子图的密度。
methods: 本研究使用了一种新的通用抽屉算法（GENPEEL），以及一种更快的通用抽屉算法（GENPEEL++），它们可以在 $p \geq 1 $ 时间复杂度为 $O(mn)$，在 $p \in [1, +\infty)$ 时间复杂度为 $O(m(\log n))$，并且具有 $(p+1)^{1/p}$ 和 $(2(p+1))^{1/p}$ 的近似比率，分别适用于不同的 $p $ 值。
results: 本研究发现，对于 $0 < p < 1 $，标准抽屉算法可以提供 $2^{1/p}$ 的近似比率，而不是预期的 $p $ 次幂度。此外，GENPEEL 和 GENPEEL++ 算法可以在 $p \geq 1 $ 和 $p \in [1, +\infty)$ 中提供 $(p+1)^{1/p}$ 和 $(2(p+1))^{1/p}$ 的近似比率，分别适用于不同的 $p $ 值。

Abstract
The densest subgraph of a large graph usually refers to some subgraph with the highest average degree, which has been extended to the family of $p$-means dense subgraph objectives by~\citet{veldt2021generalized}. The $p$-mean densest subgraph problem seeks a subgraph with the highest average $p$-th-power degree, whereas the standard densest subgraph problem seeks a subgraph with a simple highest average degree. It was shown that the standard peeling algorithm can perform arbitrarily poorly on generalized objective when $p>1$ but uncertain when $0

摘要
通常情况下，最密集子图（dense subgraph）指的是一个具有最高平均度的子图。这个概念在$p$-means dense subgraph目标家族中被推广，其中$p$-mean densest subgraph问题 seek一个具有最高$p$-th-power度的子图。与标准的最密集子图问题不同的是，后者仅寻找一个简单的最高平均度的子图。当$p>1$时，标准的剥离算法可能会表现出现 arbitrarily poor performance，而当$0

Lie Group Decompositions for Equivariant Neural Networks

paper_url: http://arxiv.org/abs/2310.11366
repo_url: None
paper_authors: Mircea Mironenco, Patrick Forré
For: 这个论文的目标是构建具有对称性和同态性的神经网络模型，特别在数据端不充足的情况下。* Methods: 这个论文使用了利群的 Lie 代数和群 exponential 和 logarithm 函数来扩展传统的对称性模型，并使用了 Lie 群的结构和几何来实现对称集合的归一化和全局均衡。* Results: 这个论文通过对比各种已有的对称模型和卷积神经网络模型，证明了其在标准对称检测任务上的性能优越性和对于不同的输入数据的泛化能力。

Abstract
Invariance and equivariance to geometrical transformations have proven to be very useful inductive biases when training (convolutional) neural network models, especially in the low-data regime. Much work has focused on the case where the symmetry group employed is compact or abelian, or both. Recent work has explored enlarging the class of transformations used to the case of Lie groups, principally through the use of their Lie algebra, as well as the group exponential and logarithm maps. The applicability of such methods to larger transformation groups is limited by the fact that depending on the group of interest $G$, the exponential map may not be surjective. Further limitations are encountered when $G$ is neither compact nor abelian. Using the structure and geometry of Lie groups and their homogeneous spaces, we present a framework by which it is possible to work with such groups primarily focusing on the Lie groups $G = \text{GL}^{+}(n, \mathbb{R})$ and $G = \text{SL}(n, \mathbb{R})$, as well as their representation as affine transformations $\mathbb{R}^{n} \rtimes G$. Invariant integration as well as a global parametrization is realized by decomposing the `larger` groups into subgroups and submanifolds which can be handled individually. Under this framework, we show how convolution kernels can be parametrized to build models equivariant with respect to affine transformations. We evaluate the robustness and out-of-distribution generalisation capability of our model on the standard affine-invariant benchmark classification task, where we outperform all previous equivariant models as well as all Capsule Network proposals.

摘要
固有和等变征对几何变换有利，特别是在数据缺乏时。许多研究都集中在 компакт或很小的symmetry group上。现在的工作探索了使用Lie group的方法，包括Lie algebra、组 exponential和logarithm maps。然而，这些方法的应用 scope limited by the fact that the exponential map may not be surjective, and further limitations are encountered when the group of interest $G$ is neither compact nor abelian.我们使用 Lie group的结构和几何特性，提出一个框架，可以让我们在 $G = \text{GL}^{+}(n, \mathbb{R})$ 和 $G = \text{SL}(n, \mathbb{R})$ 上工作，以及它们的表示为抽象变换 $\mathbb{R}^{n} \rtimes G$。我们可以通过将这些 '大' 组织分解成子组织和子抽象变换，并将它们处理一个一个来实现不变 интеграл和全局参数化。在这个框架下，我们可以设计卷积核心来构建对抽象变换具有不变性的模型。我们在标准对称变换分类任务上评估了我们的模型的稳定性和 OUT-OF-distribution泛化能力，并超越了所有均衡变换模型以及所有卷积网络提议。

Contextualized Machine Learning

paper_url: http://arxiv.org/abs/2310.11340
repo_url: https://github.com/SAP/contextual-ai
paper_authors: Benjamin Lengerich, Caleb N. Ellington, Andrea Rubbi, Manolis Kellis, Eric P. Xing
for: 这篇论文旨在探讨Contextualized Machine Learning（ML），一种学习不同和上下文相关的效应的方法。
methods: 该方法使用深度学习来模型meta-关系，将上下文信息翻译成模型参数，实现不同参数的变化。
results: 该方法可以汇集不同的框架，包括带环境分析和年龄模型，并且可以实现非Parametric推断和模型识别性条件。最后，作者提供了一个开源的PyTorch包ContextualizedML。

Abstract
We examine Contextualized Machine Learning (ML), a paradigm for learning heterogeneous and context-dependent effects. Contextualized ML estimates heterogeneous functions by applying deep learning to the meta-relationship between contextual information and context-specific parametric models. This is a form of varying-coefficient modeling that unifies existing frameworks including cluster analysis and cohort modeling by introducing two reusable concepts: a context encoder which translates sample context into model parameters, and sample-specific model which operates on sample predictors. We review the process of developing contextualized models, nonparametric inference from contextualized models, and identifiability conditions of contextualized models. Finally, we present the open-source PyTorch package ContextualizedML.

摘要
我们研究Contextualized Machine Learning（ML），一种学习不同和上下文依赖的效果的方法。Contextualized ML使用深度学习来模型meta关系，即样本上下文信息和样本特定的参数模型之间的关系。这是一种 varying-coefficient modeling，可以统一现有的框架，包括集群分析和团队模型，通过引入两个可重用概念：样本上下文编码器，将样本上下文转换为模型参数，以及样本特定的模型，对样本预测变量进行操作。我们详细介绍了Contextualized模型的开发、非参数推断、和Contextualized模型的可识别条件。最后，我们发布了一个开源的PyTorch包，即ContextualizedML。

Non-ergodicity in reinforcement learning: robustness via ergodicity transformations

paper_url: http://arxiv.org/abs/2310.11335
repo_url: None
paper_authors: Dominik Baumann, Erfaun Noorani, James Price, Ole Peters, Colm Connaughton, Thomas B. Schön
for: This paper aims to address the issue of non-robustness in reinforcement learning (RL) algorithms, specifically in real-world applications such as autonomous driving, precision agriculture, and finance.
methods: The authors propose a new algorithm for learning ergodicity transformations from data, which enables the optimization of long-term return for individual agents rather than the average across infinitely many trajectories.
results: The proposed algorithm is demonstrated to be effective in an instructive, non-ergodic environment and on standard RL benchmarks.Here’s the simplified Chinese text:
for: 本研究旨在解决反对式学习（RL）算法在实际应用中的不稳定性问题，特别是在自动驾驶、精准农业和金融等领域。
methods: 作者提出了一种基于数据学习Ergodicity变换的算法，以便优化个体代理的长期返回，而不是权衡无穷多个轨迹的平均返回。
results: 提出的算法在一个教育性的非ergodic环境以及标准RL benchmark上得到了证明。

Abstract
Envisioned application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance, which all require RL agents to make decisions in the real world. A significant challenge hindering the adoption of RL methods in these domains is the non-robustness of conventional algorithms. In this paper, we argue that a fundamental issue contributing to this lack of robustness lies in the focus on the expected value of the return as the sole "correct" optimization objective. The expected value is the average over the statistical ensemble of infinitely many trajectories. For non-ergodic returns, this average differs from the average over a single but infinitely long trajectory. Consequently, optimizing the expected value can lead to policies that yield exceptionally high returns with probability zero but almost surely result in catastrophic outcomes. This problem can be circumvented by transforming the time series of collected returns into one with ergodic increments. This transformation enables learning robust policies by optimizing the long-term return for individual agents rather than the average across infinitely many trajectories. We propose an algorithm for learning ergodicity transformations from data and demonstrate its effectiveness in an instructive, non-ergodic environment and on standard RL benchmarks.

摘要
拟合应用领域 для强化学习（RL）包括自动驾驶、精细农业和金融，这些领域都需要RL代理人做出实际世界中的决策。然而，现有的RL方法在这些领域的应用受到一定的阻碍。在这篇论文中，我们认为RL方法的一个基本问题在于强调预期返回值作为唯一的“正确”优化目标。预期返回值是统计ensemble中的平均值，对于非ergodic返回，这个平均值与单个但是无限长的轨迹的平均值不同。因此，优化预期返回可能导致政策产生极高的返回，但是几乎确定会导致灾难性的结果。这个问题可以通过将收集到的返回时间序列转换成一个ergodic增量来解决。这种转换允许学习 robust政策，而不是优化infinitely多轨迹的平均值。我们提出了一种从数据中学习ergodicity转换的算法，并在一个 instructive、非ergodic环境中和标准RLbenchmark上进行了证明。

Elucidating The Design Space of Classifier-Guided Diffusion Generation

paper_url: http://arxiv.org/abs/2310.11311
repo_url: https://github.com/alexmaols/elucd
paper_authors: Jiajun Ma, Tianyang Hu, Wenjia Wang, Jiacheng Sun
for: 提高Diffusion模型的样本质量和可控性，通过主流方法和训练自由方法都需要额外的标注数据训练，而训练自由方法的性能尚未得到证明。
methods: 我们通过对设计空间的全面调查，发现可以通过免训练的方式，使用市场上可得的预训练分类器来提高 diffusion 模型的性能，同时兼顾主流方法和训练自由方法的优点。我们提出了几种预处理技术来更好地利用预训练分类器来导向Diffusion生成。
results: 我们通过对 ImageNet 进行了广泛的实验，证明了我们的提posed方法可以提高 state-of-the-art diffusion 模型（DDPM、EDM、DiT）的性能（最多提高20%），而且几乎不需要额外的计算成本。随着可得到的预训练分类器的普及，我们的提posed方法具有极大的潜力，可以快速扩展到文本到图像生成任务。代码可以在 https://github.com/AlexMaOLS/EluCD/tree/main 获取。

Abstract
Guidance in conditional diffusion generation is of great importance for sample quality and controllability. However, existing guidance schemes are to be desired. On one hand, mainstream methods such as classifier guidance and classifier-free guidance both require extra training with labeled data, which is time-consuming and unable to adapt to new conditions. On the other hand, training-free methods such as universal guidance, though more flexible, have yet to demonstrate comparable performance. In this work, through a comprehensive investigation into the design space, we show that it is possible to achieve significant performance improvements over existing guidance schemes by leveraging off-the-shelf classifiers in a training-free fashion, enjoying the best of both worlds. Employing calibration as a general guideline, we propose several pre-conditioning techniques to better exploit pretrained off-the-shelf classifiers for guiding diffusion generation. Extensive experiments on ImageNet validate our proposed method, showing that state-of-the-art diffusion models (DDPM, EDM, DiT) can be further improved (up to 20%) using off-the-shelf classifiers with barely any extra computational cost. With the proliferation of publicly available pretrained classifiers, our proposed approach has great potential and can be readily scaled up to text-to-image generation tasks. The code is available at https://github.com/AlexMaOLS/EluCD/tree/main.

摘要
<>将文本翻译成简化中文。<>Diffusion模型的指导是样本质量和可控性方面的关键因素。然而，现有的指导方法尚不够。一方面，主流方法如类型器指导和类型器无指导都需要额外的训练频道标注数据，时间consuming并不适应新条件。另一方面，无需训练的方法如通用指导，虽然更灵活，尚未达到相关性表现。在这种情况下，我们通过对设计空间的全面调查，展示了可以通过利用准备好的类型器来获得显著性能提升，同时兼得到最佳的两个世界。采用准确性为总则，我们提出了一些预处理技巧来更好地利用预训练的类型器来导向扩散生成。广泛的实验 validate我们的提议方法，显示了使用预训练的类型器可以提高状态当前的扩散模型（DDPM、EDM、DiT）的性能（最高提升20%），而且几乎没有额外的计算成本。随着公共预训练类型器的普及，我们的提议方法具有巨大的潜力，可以轻松扩展到文本到图生成任务。代码可以在 GitHub 上获取：https://github.com/AlexMaOLS/EluCD/tree/main。

An Automatic Learning Rate Schedule Algorithm for Achieving Faster Convergence and Steeper Descent

paper_url: http://arxiv.org/abs/2310.11291
repo_url: None
paper_authors: Zhao Song, Chiwun Yang
for: 这个研究的目的是提高神经网络训练中的优化技术，尤其是解决 mini-batch 优化中的测验问题。
methods: 本研究使用了 delta-bar-delta 算法，并提出了一个新的 RDBD (Regrettable Delta-Bar-Delta) 方法来解决 convergence 问题。
results: 经过广泛的实验评估，RDBD 方法能够增加优化过程的速度和稳定性，并且可以与不同的优化算法整合使用。

Abstract
The delta-bar-delta algorithm is recognized as a learning rate adaptation technique that enhances the convergence speed of the training process in optimization by dynamically scheduling the learning rate based on the difference between the current and previous weight updates. While this algorithm has demonstrated strong competitiveness in full data optimization when compared to other state-of-the-art algorithms like Adam and SGD, it may encounter convergence issues in mini-batch optimization scenarios due to the presence of noisy gradients. In this study, we thoroughly investigate the convergence behavior of the delta-bar-delta algorithm in real-world neural network optimization. To address any potential convergence challenges, we propose a novel approach called RDBD (Regrettable Delta-Bar-Delta). Our approach allows for prompt correction of biased learning rate adjustments and ensures the convergence of the optimization process. Furthermore, we demonstrate that RDBD can be seamlessly integrated with any optimization algorithm and significantly improve the convergence speed. By conducting extensive experiments and evaluations, we validate the effectiveness and efficiency of our proposed RDBD approach. The results showcase its capability to overcome convergence issues in mini-batch optimization and its potential to enhance the convergence speed of various optimization algorithms. This research contributes to the advancement of optimization techniques in neural network training, providing practitioners with a reliable automatic learning rate scheduler for achieving faster convergence and improved optimization outcomes.

摘要
delta-bar-delta 算法是一种学习率自适应技术，可以增加训练过程的速度并且在全数据优化中与其他当前标准算法如 Adam 和 SGD 进行比较。然而，在小批量优化场景下，这种算法可能会遇到收敛问题，这是因为梯度具有噪音。在这个研究中，我们对 delta-bar-delta 算法在实际神经网络优化中的收敛行为进行了全面的调查。为了解决任何可能出现的收敛挑战，我们提出了一种新的 Approach，即 RDBD（Regrettable Delta-Bar-Delta）。我们的方法可以快速更正偏导学习率调整，并确保优化过程的收敛。此外，我们证明了 RDBD 可以轻松地与任何优化算法结合使用，并显著提高优化速度。通过进行了广泛的实验和评估，我们证明了 RDBD 的效果和效率。结果表明，RDBD 可以在小批量优化中解决收敛问题，并且有可能在不同的优化算法中提高收敛速度。这项研究对神经网络训练中优化技术的进步做出了贡献，为实践者提供了一个可靠的自动学习率调整器，以实现更快的收敛和优化结果。

Evaluating the Impact of Humanitarian Aid on Food Security

paper_url: http://arxiv.org/abs/2310.11287
repo_url: None
paper_authors: Jordi Cerdà-Bautista, José María Tárraga, Vasileios Sitokonstantinou, Gustau Camps-Valls
for: 评估气候变化引起的干旱地区食品安全问题，需要紧急人道主义帮助。
methods: 该文章提出了一种 causal inference 框架，用于评估投入到食品危机中的金钱支付政策的影响。文章包括识别食品安全系统中的 causal 关系、完善全面的数据库、和估计人道援助政策对于营养不良的 causal 效应。
results: 研究发现，这些投入没有显著影响，可能因为样本规模太小、数据质量不佳、和 causal 图不完善，即我们对多学科系统食品安全的理解还有限。这说明需要进一步提高数据收集和精细化 causal 模型，以便更有效地进行未来的投入和政策，提高人道援助的透明度和责任感。

Abstract
In the face of climate change-induced droughts, vulnerable regions encounter severe threats to food security, demanding urgent humanitarian assistance. This paper introduces a causal inference framework for the Horn of Africa, aiming to assess the impact of cash-based interventions on food crises. Our contributions encompass identifying causal relationships within the food security system, harmonizing a comprehensive database, and estimating the causal effect of humanitarian interventions on malnutrition. Our results revealed no significant effects, likely due to limited sample size, suboptimal data quality, and an imperfect causal graph resulting from our limited understanding of multidisciplinary systems like food security. This underscores the need to enhance data collection and refine causal models with domain experts for more effective future interventions and policies, improving transparency and accountability in humanitarian aid.

摘要
面对气候变化引起的干旱，抵触地区面临严重的食品安全威胁，需要急需人道主义援助。这篇论文介绍了一种 causal inference 框架，用于评估针对东非的食品危机造成的影响。我们的贡献包括确定食品安全系统中的 causal 关系，融合全面的数据库，并估算人道主义干预对营养不良的影响。我们的结果表明，没有显著的影响， probable 因为样本规模过小、数据质量不佳和我们对多学科系统的理解不够，从而导致 causal 图不准确。这反映了需要增强数据收集和改进 causal 模型，以更有效地应用未来的援助和政策，提高透明度和责任感。

Self-supervision meets kernel graph neural models: From architecture to augmentations

paper_url: http://arxiv.org/abs/2310.11281
repo_url: None
paper_authors: Jiawang Dan, Ruofan Wu, Yunpeng Liu, Baokun Wang, Changhua Meng, Tengfei Liu, Tianyi Zhang, Ningtao Wang, Xing Fu, Qi Li, Weiqiang Wang
for: 这 paper 的目的是提高 kernel graph neural networks (KGNNs) 的设计和学习方法，以提高 graph representation learning 的效果。
methods: 本 paper 使用了一种更加灵活的 graph-level similarity定义，以及一种更加简洁的优化目标函数，以解决 MPNNs 中的一些挑战。此外，paper 还提出了一种新的自我监督学习方法 called latent graph augmentation (LGA)，以提高 KGNNs 的表达能力。
results: 实验结果表明，提出的方法可以在 graph classification 任务中达到竞争性的性能，并且在一些比较难的任务中even outperform 现有的状态态-of-the-art 方法。此外，对比其他已有的 graph data augmentation 方法，LGA augmentation scheme 能够更好地捕捉 graph-level 的 semantics。

Abstract
Graph representation learning has now become the de facto standard when handling graph-structured data, with the framework of message-passing graph neural networks (MPNN) being the most prevailing algorithmic tool. Despite its popularity, the family of MPNNs suffers from several drawbacks such as transparency and expressivity. Recently, the idea of designing neural models on graphs using the theory of graph kernels has emerged as a more transparent as well as sometimes more expressive alternative to MPNNs known as kernel graph neural networks (KGNNs). Developments on KGNNs are currently a nascent field of research, leaving several challenges from algorithmic design and adaptation to other learning paradigms such as self-supervised learning. In this paper, we improve the design and learning of KGNNs. Firstly, we extend the algorithmic formulation of KGNNs by allowing a more flexible graph-level similarity definition that encompasses former proposals like random walk graph kernel, as well as providing a smoother optimization objective that alleviates the need of introducing combinatorial learning procedures. Secondly, we enhance KGNNs through the lens of self-supervision via developing a novel structure-preserving graph data augmentation method called latent graph augmentation (LGA). Finally, we perform extensive empirical evaluations to demonstrate the efficacy of our proposed mechanisms. Experimental results over benchmark datasets suggest that our proposed model achieves competitive performance that is comparable to or sometimes outperforming state-of-the-art graph representation learning frameworks with or without self-supervision on graph classification tasks. Comparisons against other previously established graph data augmentation methods verify that the proposed LGA augmentation scheme captures better semantics of graph-level invariance.

摘要
Graph表示学习现在成为了处理图结构数据的标准方法，MPNN框架是最具有影响力的算法工具。然而，MPNN家族受到一些缺点的限制，如透明度和表达能力。近些年，基于图kernels的图神经网络（KGNN）在MPNN的基础上设计图神经网络，被认为是更透明和有时更表达能力的替代方案。KGNN的发展现在是一个有前途的研究领域，还有许多挑战，如算法设计和适应其他学习模式，如无监督学习。在这篇论文中，我们提高了KGNN的设计和学习。首先，我们扩展了KGNN的算法表述，允许更flexible的图级相似性定义，包括过去的提议，如随机步行图kernels，以及提供更平滑的优化目标，以避免引入 combinatorial学习过程。其次，我们通过对KGNN进行自我监督来增强其性能，发展了一种新的结构保持graph数据增强方法，即latent graph augmentation（LGA）。最后，我们进行了广泛的实验评估，以证明我们提出的机制的有效性。实验结果表明，我们的提出的模型在图分类任务上达到了与或超过了现状标准的表现，并且在不含自我监督的情况下也能够达到类似的表现。与其他之前Established graph data增强方法进行比较，我们的LGA增强方案更好地捕捉到图级 semantics。

Learning to Sample Better

paper_url: http://arxiv.org/abs/2310.11232
repo_url: https://github.com/sayantann11/all-classification-templetes-for-ML
paper_authors: Michael S. Albergo, Eric Vanden-Eijnden
for: 本文介绍了最近的生成模型方法，基于测量传输的动力学，从基本概率分布到目标概率分布的样本映射。
methods: 本文使用了变量学习来学习映射，并使用MC采样技术来提高采样效率。
results: 本文在MCMC和重要采样方面获得了改进的结果。

Abstract
These lecture notes provide an introduction to recent advances in generative modeling methods based on the dynamical transportation of measures, by means of which samples from a simple base measure are mapped to samples from a target measure of interest. Special emphasis is put on the applications of these methods to Monte-Carlo (MC) sampling techniques, such as importance sampling and Markov Chain Monte-Carlo (MCMC) schemes. In this context, it is shown how the maps can be learned variationally using data generated by MC sampling, and how they can in turn be used to improve such sampling in a positive feedback loop.

摘要
Translation in Simplified Chinese:这些讲义 introduce 最近的生成模型方法，基于动态传输度量，从简单的基础探障中映射到目标度量的样本。ocus 在 Monte Carlo (MC) 抽样技术上，如重要抽样和 Markov Chain Monte Carlo (MCMC) 方案。这些讲义显示了如何通过变量学习使得映射，使用 MC 抽样生成的数据，并将其用于改进抽样。

Zipformer: A faster and better encoder for automatic speech recognition

paper_url: http://arxiv.org/abs/2310.11230
repo_url: https://github.com/k2-fsa/icefall
paper_authors: Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, Daniel Povey
for: 这个论文是为了提出一种更快速、更具有内存效率的 transformer 模型，即 Zipformer，用于自动语音识别（ASR）。
methods: 该模型使用了以下方法：1）U-Net-like Encoder结构，中间堆叠在较低帧率下运行; 2）重新排序块结构，增加更多模块，并在每个模块中重用注意力权重以实现更好的效率; 3）修改了 LayerNorm 为 BiasNorm，以保留一些长度信息; 4）新的激活函数 SwooshR 和 SwooshL 比 Swish 更好。
results: 对 LibriSpeech、Aishell-1 和 WenetSpeech 数据集进行了广泛的实验，并证明了 Zipformer 模型在其他状态码模型中表现更好。

Abstract
The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.

摘要
《充当者》已成为自动语音识别（ASR）最受欢迎的编码器模型。它将卷积模块添加到转换器中，以学习本地和全局依赖关系。在这项工作中，我们描述了一种更快、更有效和性能更高的转换器，即Zipformer。模型变化包括：1. 中堆结构采用U-Net类型，中间堆叠运行速率较低;2. 块结构重新排序，增加更多模块，并在这些模块中重用注意力权重以实现效率;3. 使用修改后的层Normalization，称为BiasNorm，以保留一些长度信息;4. 新的激活函数SwooshR和SwooshL，比Swish更好地工作;5. 我们还提出了一种新的优化器，称为扫描Adam，它可以根据每个tensor的当前尺度缩放更新，以保持相对变化的相同程度，并且显式地学习参数尺度。它在其他状态的ASR模型比Adam更快地 converges和表现更好。我们对LibriSpeech、Aishell-1和WenetSpeech datasets进行了广泛的实验，并证明了我们提出的Zipformer在其他状态的ASR模型之上表现更好。我们的代码可以在https://github.com/k2-fsa/icefall中找到。

Federated Learning with Nonvacuous Generalisation Bounds

paper_url: http://arxiv.org/abs/2310.11203
repo_url: None
paper_authors: Pierre Jobic, Maxime Haddouche, Benjamin Guedj
for: 本研究旨在采用隐私保护技术来实现 federated learning，每个节点都保持着自己的训练数据私有，而不会泄露给其他节点。
methods: 本研究使用随机生成器来训练 federated learning 模型，每个节点都生成了一个本地隐私predictor，而不把训练数据分享给其他节点。然后，我们建立了一个全局的随机生成器，该随机生成器继承了本地私有predictor的性质，即PAC-Bayesian泛化 bound。
results: 我们通过一系列的数字实验显示，我们的方法可以与批处理方法（其中所有数据集被共享）匹配的预测性能，而不需要共享数据集。此外，我们的方法可以提供 numerically nonvacuous 泛化 bound，保护每个节点的隐私。我们计算了批处理和 federated learning 之间的增量预测性能和泛化 bound，这将为保护隐私而付出的代价。

Abstract
We introduce a novel strategy to train randomised predictors in federated learning, where each node of the network aims at preserving its privacy by releasing a local predictor but keeping secret its training dataset with respect to the other nodes. We then build a global randomised predictor which inherits the properties of the local private predictors in the sense of a PAC-Bayesian generalisation bound. We consider the synchronous case where all nodes share the same training objective (derived from a generalisation bound), and the asynchronous case where each node may have its own personalised training objective. We show through a series of numerical experiments that our approach achieves a comparable predictive performance to that of the batch approach where all datasets are shared across nodes. Moreover the predictors are supported by numerically nonvacuous generalisation bounds while preserving privacy for each node. We explicitly compute the increment on predictive performance and generalisation bounds between batch and federated settings, highlighting the price to pay to preserve privacy.

摘要
我们提出了一种新的策略，用于在联合学习中训练随机预测器，每个网络节点都希望保持自己的隐私，通过发布本地预测器，而不把训练数据集与其他节点分享。然后，我们构建了一个全局随机预测器，该预测器继承了本地私有预测器的性质，即PAC-Bayesian泛化约束。我们考虑了同步和异步两种情况，在同步情况下，所有节点共享同一个训练目标（基于一个泛化约束），在异步情况下，每个节点可能有自己的个性化训练目标。我们通过一系列数值实验表示，我们的方法可以与批处理方法（所有数据集在节点间共享）相比，具有相似的预测性能，同时保持隐私性。我们显式计算了批处理和联合学习之间的增量预测性能和泛化约束，强调保护隐私的代价。

A Modified EXP3 and Its Adaptive Variant in Adversarial Bandits with Multi-User Delayed Feedback

paper_url: http://arxiv.org/abs/2310.11188
repo_url: https://github.com/chubbro/mud-exp3
paper_authors: Yandi Li, Jianxiong Guo
for: 本研究假设了延迟反馈问题中的多用户情况，即每个用户的反馈可能会在不同的延迟时间内提供，而这些延迟时间都是未知的。
methods: 我们采用了修改后EXP3算法，称之为MUD-EXP3算法，它在每个轮次中基于不同用户的重要性权重来做决策。
results: 我们证明了在知道终点轮次索引$T$的情况下，我们的算法的违和为$\mathcal{O}(\sqrt{TM^2\ln{N}(N\mathrm{e}+4d_{max})})$。此外，我们还提出了一种适应算法 named AMUD-EXP3，它在不知道$T$的情况下可以实现SUBLINEAR的违和。最后，我们进行了广泛的实验来证明我们的算法的正确性和有效性。

Abstract
For the adversarial multi-armed bandit problem with delayed feedback, we consider that the delayed feedback results are from multiple users and are unrestricted on internal distribution. As the player picks an arm, feedback from multiple users may not be received instantly yet after an arbitrary delay of time which is unknown to the player in advance. For different users in a round, the delays in feedback have no latent correlation. Thus, we formulate an adversarial multi-armed bandit problem with multi-user delayed feedback and design a modified EXP3 algorithm named MUD-EXP3, which makes a decision at each round by considering the importance-weighted estimator of the received feedback from different users. On the premise of known terminal round index $T$, the number of users $M$, the number of arms $N$, and upper bound of delay $d_{max}$, we prove a regret of $\mathcal{O}(\sqrt{TM^2\ln{N}(N\mathrm{e}+4d_{max})})$. Furthermore, for the more common case of unknown $T$, an adaptive algorithm named AMUD-EXP3 is proposed with a sublinear regret with respect to $T$. Finally, extensive experiments are conducted to indicate the correctness and effectiveness of our algorithms.

摘要
For the adversarial multi-armed bandit problem with delayed feedback, we consider that the delayed feedback results are from multiple users and are unrestricted on internal distribution. As the player picks an arm, feedback from multiple users may not be received instantly yet after an arbitrary delay of time which is unknown to the player in advance. For different users in a round, the delays in feedback have no latent correlation. Thus, we formulate an adversarial multi-armed bandit problem with multi-user delayed feedback and design a modified EXP3 algorithm named MUD-EXP3, which makes a decision at each round by considering the importance-weighted estimator of the received feedback from different users. On the premise of known terminal round index $T$, the number of users $M$, the number of arms $N$, and upper bound of delay $d_{max}$, we prove a regret of $\mathcal{O}(\sqrt{TM^2\ln{N}(N\mathrm{e}+4d_{max})})$. Furthermore, for the more common case of unknown $T$, an adaptive algorithm named AMUD-EXP3 is proposed with a sublinear regret with respect to $T$. Finally, extensive experiments are conducted to indicate the correctness and effectiveness of our algorithms.Here is the translation in Traditional Chinese:For the adversarial multi-armed bandit problem with delayed feedback, we consider that the delayed feedback results are from multiple users and are unrestricted on internal distribution. As the player picks an arm, feedback from multiple users may not be received instantly yet after an arbitrary delay of time which is unknown to the player in advance. For different users in a round, the delays in feedback have no latent correlation. Thus, we formulate an adversarial multi-armed bandit problem with multi-user delayed feedback and design a modified EXP3 algorithm named MUD-EXP3, which makes a decision at each round by considering the importance-weighted estimator of the received feedback from different users. On the premise of known terminal round index $T$, the number of users $M$, the number of arms $N$, and upper bound of delay $d_{max}$, we prove a regret of $\mathcal{O}(\sqrt{TM^2\ln{N}(N\mathrm{e}+4d_{max})})$. Furthermore, for the more common case of unknown $T$, an adaptive algorithm named AMUD-EXP3 is proposed with a sublinear regret with respect to $T$. Finally, extensive experiments are conducted to indicate the correctness and effectiveness of our algorithms.

Efficiently Visualizing Large Graphs

paper_url: http://arxiv.org/abs/2310.11186
repo_url: https://github.com/charlie-xiao/embedding-visualization-test
paper_authors: Xinyu Li, Yao Xiao, Yuchen Zhou
for: 这个论文旨在提出一种基于维度减少的图像化方法，用于可读性地显示图的结构。
methods: 该方法基于t-SNE算法，但是它采用了邻域结构来降低时间复杂度，从而支持更大的图。此外，该方法还结合了laplacian eigenmaps和最短路算法，以获得高维度的图嵌入。
results: 通过使用这种方法，可以在5分钟内图像化300K个节点和1M个边的图，并且可以达到约10%的视觉质量提升。代码和数据可以在https://github.com/Charlie-XIAO/embedding-visualization-test上获取。

Abstract
Most existing graph visualization methods based on dimension reduction are limited to relatively small graphs due to performance issues. In this work, we propose a novel dimension reduction method for graph visualization, called t-Distributed Stochastic Graph Neighbor Embedding (t-SGNE). t-SGNE is specifically designed to visualize cluster structures in the graph. As a variant of the standard t-SNE method, t-SGNE avoids the time-consuming computations of pairwise similarity. Instead, it uses the neighbor structures of the graph to reduce the time complexity from quadratic to linear, thus supporting larger graphs. In addition, to suit t-SGNE, we combined Laplacian Eigenmaps with the shortest path algorithm in graphs to form the graph embedding algorithm ShortestPath Laplacian Eigenmaps Embedding (SPLEE). Performing SPLEE to obtain a high-dimensional embedding of the large-scale graph and then using t-SGNE to reduce its dimension for visualization, we are able to visualize graphs with up to 300K nodes and 1M edges within 5 minutes and achieve approximately 10% improvement in visualization quality. Codes and data are available at https://github.com/Charlie-XIAO/embedding-visualization-test.

摘要
现有的图视化方法基于维度减少通常只能处理相对较小的图，由于性能问题。在这个工作中，我们提出了一种新的维度减少方法 для图视化，即t-Distributed Stochastic Graph Neighbor Embedding（t-SGNE）。t-SGNE专门用于描述图中的集群结构。作为标准t-SNE方法的变体，t-SGNE避免了对对应之间的相似性进行时间消耗的计算，而是使用图中的邻居结构来降低时间复杂度从quadratico至线性，因此可以支持更大的图。此外，为了适应t-SGNE，我们将Laplacian Eigenmaps与图中最短路算法组合成为图嵌入算法ShortestPath Laplacian Eigenmaps Embedding（SPLEE）。通过对大规模图进行SPLEE嵌入，并使用t-SGNE减少其维度进行视化，我们可以在5分钟内视化300K个节点和1M个边的图，并达到约10%的视化质量提升。代码和数据可以在https://github.com/Charlie-XIAO/embedding-visualization-test中找到。

Serenade: A Model for Human-in-the-loop Automatic Chord Estimation

paper_url: http://arxiv.org/abs/2310.11165
repo_url: None
paper_authors: Hendrik Vincent Koops, Gianluca Micchi, Ilaria Manco, Elio Quinton
for: 这 paper 的目的是提高 Musical Instrument Digital Interface (MIDI) 任务中的自动分 segmentation、 corpus analysis 和自动和声标注 estimation 的精度。
methods: 这 paper 使用了一种新的人类和自动推理模型共同创建和声标注的方法，其中人类在自动生成和声预测时提供精度优化的稀疏标注，而模型则根据人类指导进行修正。
results: 这 paper 在一个流行音乐数据集上进行了评估，并显示了人类和模型共同创建和声标注的方法可以提高和声分析性能，并且人类的贡献被模型的第二次、受限预测所强调。

Abstract
Computational harmony analysis is important for MIR tasks such as automatic segmentation, corpus analysis and automatic chord label estimation. However, recent research into the ambiguous nature of musical harmony, causing limited inter-rater agreement, has made apparent that there is a glass ceiling for common metrics such as accuracy. Commonly, these issues are addressed either in the training data itself by creating majority-rule annotations or during the training phase by learning soft targets. We propose a novel alternative approach in which a human and an autoregressive model together co-create a harmonic annotation for an audio track. After automatically generating harmony predictions, a human sparsely annotates parts with low model confidence and the model then adjusts its predictions following human guidance. We evaluate our model on a dataset of popular music and we show that, with this human-in-the-loop approach, harmonic analysis performance improves over a model-only approach. The human contribution is amplified by the second, constrained prediction of the model.

摘要
计算音乐和谐分析对音乐信息 Retrieval（MIR）任务如自动分割、文献分析和自动和声标注有着重要的作用。然而，最近关于音乐和谐的抽象性的研究，使得限制了通用指标的准确率。通常，这些问题通过创建多数规则约束或在训练阶段学习软目标来解决。我们提出了一种新的人机合作方法，在音频轨道上，人类和自动推理模型共同创建和谐标注。首先，模型自动生成和谐预测，然后人类精选部分低度信任的部分并让模型根据人类指导更新预测。我们对流行音乐数据集进行评估，并显示了人机共同Loop Approach可以提高和谐分析性能，人类贡献被模型第二次预测所增强。

A new high-resolution indoor radon map for Germany using a machine learning based probabilistic exposure model

paper_url: http://arxiv.org/abs/2310.11143
repo_url: None
paper_authors: Eric Petermann, Peter Bossew, Joachim Kemski, Valeria Gruber, Nils Suhr, Bernd Hoffmann
for: 这个研究是为了更正确地估计室内Radon浓度分布，并且实现高空间分辨率的推估。
methods: 这篇研究使用了一个两阶段的模型方法，包括一个量iles regression forest以环境和建筑资料作为预测器，以及一个可靠的Monte Carlo抽样法来组合和人口权重平均。
results: 研究结果显示，室内Radon浓度的算术平均值为63 Bq/m3，几何平均值为41 Bq/m3，95 %ile值为180 Bq/m3。对100 Bq/m3和300 Bq/m3的过衡 probabilities是12.5 % (10.5 million people)和2.2 % (1.9 million people)，分别。在大城市地区，个人室内Radon曝露较在乡村地区低，这是因为人口分布不同。

Abstract
Radon is a carcinogenic, radioactive gas that can accumulate indoors. Indoor radon exposure at the national scale is usually estimated on the basis of extensive measurement campaigns. However, characteristics of the sample often differ from the characteristics of the population due to the large number of relevant factors such as the availability of geogenic radon or floor level. Furthermore, the sample size usually does not allow exposure estimation with high spatial resolution. We propose a model-based approach that allows a more realistic estimation of indoor radon distribution with a higher spatial resolution than a purely data-based approach. We applied a two-stage modelling approach: 1) a quantile regression forest using environmental and building data as predictors was applied to estimate the probability distribution function of indoor radon for each floor level of each residential building in Germany; (2) a probabilistic Monte Carlo sampling technique enabled the combination and population weighting of floor-level predictions. In this way, the uncertainty of the individual predictions is effectively propagated into the estimate of variability at the aggregated level. The results give an arithmetic mean of 63 Bq/m3, a geometric mean of 41 Bq/m3 and a 95 %ile of 180 Bq/m3. The exceedance probability for 100 Bq/m3 and 300 Bq/m3 are 12.5 % (10.5 million people) and 2.2 % (1.9 million people), respectively. In large cities, individual indoor radon exposure is generally lower than in rural areas, which is a due to the different distribution of the population on floor levels. The advantages of our approach are 1) an accurate exposure estimation even if the survey was not fully representative with respect to the main controlling factors, and 2) an estimate of the exposure distribution with a much higher spatial resolution than basic descriptive statistics.

摘要
气体氧化物Radon是一种致癌的放射性气体，可以在室内堆积。室内Radon暴露的国家规模通常通过广泛的测量运动来估算。然而，样本特点与人口特点之间存在许多相关因素，如地源Radon的可用性和地板层。此外，样本大小通常无法实现高空间分辨率的暴露估计。我们提出了一种基于模型的方法，可以更加准确地估计室内Radon分布，并提高空间分辨率。我们采用了两个阶段的模型方法：1. 使用缺陷回归森林来估计室内Radon的分布函数，使用环境和建筑数据作为预测器。2. 使用 probabilistic Monte Carlo sampling technique来组合和人口权重 floor-level 预测。这样，个体预测的uncertainty会被有效地传递到聚合水平的估计中。结果表明， arithmetic mean 为 63 Bq/m3， geometric mean 为 41 Bq/m3，和 95%ile 为 180 Bq/m3。100 Bq/m3和300 Bq/m3的超过 probabilities 分别为 12.5% (10.5 million people) 和 2.2% (1.9 million people)。在大城市地区，室内Radon暴露通常比农村地区低，这是因为人口分布不同。我们的方法的优点包括：1. 即使调查不具有完全反映主要控制因素的 representativeness，也可以准确地估计暴露水平。2. 可以提供高空间分辨率的暴露估计，比基本描述统计数据更加精准。

Keep Various Trajectories: Promoting Exploration of Ensemble Policies in Continuous Control

paper_url: http://arxiv.org/abs/2310.11138
repo_url: None
paper_authors: Chao Li, Chen Gong, Qiang He, Xinwen Hou
for: 这个研究的目的是强化深度强化学习（DRL） ensemble 方法的 empirical 成功，并且提高多模型的统计价值估计和政策的稳定性。
methods: 本研究使用了 Trajectories-awarE Ensemble exploratioN (TEEN) 算法，它的目的是增加多条 trajectory 的丰富性，以提高 ensemble 政策的表现。
results: 实验结果显示，TEEN 可以提高 ensemble 政策的表现，比基eline ensemble DRL 算法高出41%。

Abstract
The combination of deep reinforcement learning (DRL) with ensemble methods has been proved to be highly effective in addressing complex sequential decision-making problems. This success can be primarily attributed to the utilization of multiple models, which enhances both the robustness of the policy and the accuracy of value function estimation. However, there has been limited analysis of the empirical success of current ensemble RL methods thus far. Our new analysis reveals that the sample efficiency of previous ensemble DRL algorithms may be limited by sub-policies that are not as diverse as they could be. Motivated by these findings, our study introduces a new ensemble RL algorithm, termed \textbf{T}rajectories-awar\textbf{E} \textbf{E}nsemble exploratio\textbf{N} (TEEN). The primary goal of TEEN is to maximize the expected return while promoting more diverse trajectories. Through extensive experiments, we demonstrate that TEEN not only enhances the sample diversity of the ensemble policy compared to using sub-policies alone but also improves the performance over ensemble RL algorithms. On average, TEEN outperforms the baseline ensemble DRL algorithms by 41\% in performance on the tested representative environments.

摘要
深度强化学习（DRL）和集成方法的组合在复杂的顺序决策问题上表现出非常高效。这种成功可以归功于多个模型的使用，增强政策的稳健性和估值函数的准确性。然而，现有的集成RL算法的实证成功仍然受限。我们的新分析发现，前一些集成DRL算法的样本效率可能受到较差的优化策略的限制。驱动于这些发现，我们的研究提出了一种新的集成RL算法，名为天文道-探索-集成探索（TEEN）。TEEN的主要目标是 Maximize the expected return while promoting more diverse trajectories。我们通过广泛的实验表明，TEEN不仅提高了集成政策的样本多样性，还超过了基eline集成DRL算法的性能。在测试环境中，TEEN平均比基eline算法提高41%的性能。

Non-parametric Conditional Independence Testing for Mixed Continuous-Categorical Variables: A Novel Method and Numerical Evaluation

paper_url: http://arxiv.org/abs/2310.11132
repo_url: None
paper_authors: Oana-Iuliana Popescu, Andreas Gerhardus, Jakob Runge
for: 这篇研究是关于 conditional independence testing (CIT) 的 mixed-type dataset 的应用。
methods: 这篇研究使用了 conditional mutual information (CMI) 估计器，与 local permutation scheme，并比较了两种新的 CMI 估计器：一种是基于 k-nearest-neighbors (k-NN) 方法，另一种是基于 entropy 度量。
results: 研究发现，该 variants 可以更好地检测依赖性，并且可以适应不同的数据分布和预processing 类型。

Abstract
Conditional independence testing (CIT) is a common task in machine learning, e.g., for variable selection, and a main component of constraint-based causal discovery. While most current CIT approaches assume that all variables are numerical or all variables are categorical, many real-world applications involve mixed-type datasets that include numerical and categorical variables. Non-parametric CIT can be conducted using conditional mutual information (CMI) estimators combined with a local permutation scheme. Recently, two novel CMI estimators for mixed-type datasets based on k-nearest-neighbors (k-NN) have been proposed. As with any k-NN method, these estimators rely on the definition of a distance metric. One approach computes distances by a one-hot encoding of the categorical variables, essentially treating categorical variables as discrete-numerical, while the other expresses CMI by entropy terms where the categorical variables appear as conditions only. In this work, we study these estimators and propose a variation of the former approach that does not treat categorical variables as numeric. Our numerical experiments show that our variant detects dependencies more robustly across different data distributions and preprocessing types.

摘要
<>将文本翻译成简化中文。<> Conditional independence testing (CIT) 是机器学习中常见的任务之一，例如变量选择，并是约束基于 causal discovery 的主要组成部分。而现实中的大多数 CIT 方法假设所有变量都是数值型或所有变量都是类别型，但是实际应用中经常会出现混合类型的数据集。非 Parametric CIT 可以通过 conditional mutual information (CMI) 估计器和地方排序方案来进行。最近，两种新的 CMI 估计器 для混合类型数据集基于 k-nearest-neighbors (k-NN) 已经被提出。这些估计器都取决于距离度量的定义。一种方法通过一个一键编码的 categorical 变量来计算距离，实际上将 categorical 变量当作数值型处理，而另一种方法通过 entropy 表达来计算 CMI，在 categorical 变量中只有作为条件出现。在这项工作中，我们研究这些估计器，并提出一种不对 categorical 变量进行数值化的变体。我们的数值实验表明，我们的变体在不同的数据分布和预处理类型下能够更加稳定地检测依赖关系。

FROST: Towards Energy-efficient AI-on-5G Platforms – A GPU Power Capping Evaluation

paper_url: http://arxiv.org/abs/2310.11131
repo_url: None
paper_authors: Ioannis Mavromatis, Stefano De Feo, Pietro Carnelli, Robert J. Piechocki, Aftab Khan
for: 该论文targets the Open Radio Access Network (O-RAN) market, which is expected to experience significant growth in the coming years. The paper aims to optimize the energy consumption of Machine Learning (ML) pipelines in O-RAN ecosystems.
methods: 该论文提出了一种名为FROST的解决方案，即Flexible Reconfiguration method with Online System Tuning。FROST可以 Profiling the energy consumption of an ML pipeline and optimizing the hardware accordingly, thereby limiting the power draw.
results: 根据该论文的发现，FROST可以实现energy savings of up to 26.4% without compromising the model’s accuracy or introducing significant time delays.

Abstract
The Open Radio Access Network (O-RAN) is a burgeoning market with projected growth in the upcoming years. RAN has the highest CAPEX impact on the network and, most importantly, consumes 73% of its total energy. That makes it an ideal target for optimisation through the integration of Machine Learning (ML). However, the energy consumption of ML is frequently overlooked in such ecosystems. Our work addresses this critical aspect by presenting FROST - Flexible Reconfiguration method with Online System Tuning - a solution for energy-aware ML pipelines that adhere to O-RAN's specifications and principles. FROST is capable of profiling the energy consumption of an ML pipeline and optimising the hardware accordingly, thereby limiting the power draw. Our findings indicate that FROST can achieve energy savings of up to 26.4% without compromising the model's accuracy or introducing significant time delays.

摘要
openRadio Access Network (O-RAN) 是一个快速发展的市场，未来几年将出现快速增长。RAN 是网络总体的最高CapEx 成本和73% 的能源消耗，因此它成为了优化的目标。然而，机器学习 (ML) 在这些生态系统中的能源消耗frequently 被忽略。我们的工作解决了这个关键问题，提出了FROST - 可变化的重配置方法与在线系统调整 - 一种遵循 O-RAN 规范和原则的能源意识机器学习管道解决方案。FROST 可以对机器学习管道的能源消耗进行 profiling，并根据硬件进行优化，从而限制能源浪费。我们的发现表明，FROST 可以实现能源节约达到 26.4% 而不会 compromise 模型准确性或引入显著的时间延迟。

Topological Expressivity of ReLU Neural Networks

paper_url: http://arxiv.org/abs/2310.11130
repo_url: None
paper_authors: Ekin Ergen, Moritz Grillo
for: 本研究探讨了ReLU神经网络在二分类问题中的表达能力，从拓扑角度来看。
methods: 本研究使用了Betti数来度量神经网络对数据集的拓扑简化程度。
results: 研究结果表明，深度的ReLU神经网络在二分类问题中的表达能力比浅度的神经网络更强，具体来说是 exponentially more powerful。这提供了一个数学上的正式解释，为什么深度的神经网络能够更好地处理复杂和拓扑 ric 的数据集。

Abstract
We study the expressivity of ReLU neural networks in the setting of a binary classification problem from a topological perspective. Recently, empirical studies showed that neural networks operate by changing topology, transforming a topologically complicated data set into a topologically simpler one as it passes through the layers. This topological simplification has been measured by Betti numbers, which are algebraic invariants of a topological space. We use the same measure to establish lower and upper bounds on the topological simplification a ReLU neural network can achieve with a given architecture. We therefore contribute to a better understanding of the expressivity of ReLU neural networks in the context of binary classification problems by shedding light on their ability to capture the underlying topological structure of the data. In particular the results show that deep ReLU neural networks are exponentially more powerful than shallow ones in terms of topological simplification. This provides a mathematically rigorous explanation why deeper networks are better equipped to handle complex and topologically rich datasets.

摘要
我们研究使用ReLU神经网络进行二分类问题的表达能力，从拓扑角度来看。最近的实验证明，神经网络在传递层次时会改变拓扑结构，将复杂的拓扑数据集转化为简单的拓扑结构。这种拓扑简化的度量使用Betti数，它是一种拓扑空间的代数 invariants。我们使用这个度量来确定ReLU神经网络的拓扑简化能力，并提出了对于给定架构的下限和上限。因此，我们对于二分类问题中ReLU神经网络的表达能力做出了更深入的理解，特别是结果表明深度的ReLU神经网络在拓扑简化方面的表达能力是极大的，这提供了一种数学上的正式解释，为复杂和拓扑 ric的数据集进行处理，为何更深度的网络更好。

On the Temperature of Bayesian Graph Neural Networks for Conformal Prediction

paper_url: http://arxiv.org/abs/2310.11479
repo_url: None
paper_authors: Seohyeon Cha, Honggu Kang, Joonhyuk Kang
for: 提高Graph Neural Networks（GNNs）中的不确定性评估，尤其在高风险领域where GNNs frequently employed。
methods: 使用Conformal Prediction（CP）框架，提供有效的预测集， garantizing formal probabilistic guarantees that a prediction set contains a true label with a desired probability。
results: 实验表明，可以通过设置温度参数，使得预测集更加有效率。此外，我们还进行了一种分析，以便了解CP性能和模型准确性之间的关系。

Abstract
Accurate uncertainty quantification in graph neural networks (GNNs) is essential, especially in high-stakes domains where GNNs are frequently employed. Conformal prediction (CP) offers a promising framework for quantifying uncertainty by providing $\textit{valid}$ prediction sets for any black-box model. CP ensures formal probabilistic guarantees that a prediction set contains a true label with a desired probability. However, the size of prediction sets, known as $\textit{inefficiency}$, is influenced by the underlying model and data generating process. On the other hand, Bayesian learning also provides a credible region based on the estimated posterior distribution, but this region is $\textit{well-calibrated}$ only when the model is correctly specified. Building on a recent work that introduced a scaling parameter for constructing valid credible regions from posterior estimate, our study explores the advantages of incorporating a temperature parameter into Bayesian GNNs within CP framework. We empirically demonstrate the existence of temperatures that result in more efficient prediction sets. Furthermore, we conduct an analysis to identify the factors contributing to inefficiency and offer valuable insights into the relationship between CP performance and model calibration.

摘要
precisions of uncertainty quantification in graph neural networks (GNNs) is crucial, especially in high-stakes domains where GNNs are widely used. Conformal prediction (CP) provides a promising framework for uncertainty quantification by offering valid prediction sets for any black-box model. CP guarantees formal probabilistic guarantees that a prediction set contains a true label with a desired probability. However, the size of prediction sets, known as inefficiency, is influenced by the underlying model and data generating process. On the other hand, Bayesian learning provides a credible region based on the estimated posterior distribution, but this region is well-calibrated only when the model is correctly specified. Building on a recent work that introduced a scaling parameter for constructing valid credible regions from posterior estimate, our study explores the advantages of incorporating a temperature parameter into Bayesian GNNs within CP framework. We empirically demonstrate the existence of temperatures that result in more efficient prediction sets. Furthermore, we conduct an analysis to identify the factors contributing to inefficiency and offer valuable insights into the relationship between CP performance and model calibration.Here's the translation in Traditional Chinese: precisions of uncertainty quantification in graph neural networks (GNNs) is crucial, especially in high-stakes domains where GNNs are widely used. Conformal prediction (CP) provides a promising framework for uncertainty quantification by offering valid prediction sets for any black-box model. CP guarantees formal probabilistic guarantees that a prediction set contains a true label with a desired probability. However, the size of prediction sets, known as inefficiency, is influenced by the underlying model and data generating process. On the other hand, Bayesian learning provides a credible region based on the estimated posterior distribution, but this region is well-calibrated only when the model is correctly specified. Building on a recent work that introduced a scaling parameter for constructing valid credible regions from posterior estimate, our study explores the advantages of incorporating a temperature parameter into Bayesian GNNs within CP framework. We empirically demonstrate the existence of temperatures that result in more efficient prediction sets. Furthermore, we conduct an analysis to identify the factors contributing to inefficiency and offer valuable insights into the relationship between CP performance and model calibration.

Sensitivity-Aware Amortized Bayesian Inference

paper_url: http://arxiv.org/abs/2310.11122
repo_url: None
paper_authors: Lasse Elsemüller, Hans Olischläger, Marvin Schmitt, Paul-Christian Bürkner, Ullrich Köthe, Stefan T. Radev
for: 这篇论文的目的是提出一种可以在不同假设下进行 Bayesian 推理的方法，以便更好地了解不同假设下的结果之间的关系。
methods: 这篇论文使用了 neural network 来实现 simulation-based inference，并利用 weight sharing 技术来编码结构相似性。
results: 该方法可以快速地评估不同假设下的结果之间的敏感性，并且可以使用 neural network ensemble 来评估模型的变化。

Abstract
Bayesian inference is a powerful framework for making probabilistic inferences and decisions under uncertainty. Fundamental choices in modern Bayesian workflows concern the specification of the likelihood function and prior distributions, the posterior approximator, and the data. Each choice can significantly influence model-based inference and subsequent decisions, thereby necessitating sensitivity analysis. In this work, we propose a multifaceted approach to integrate sensitivity analyses into amortized Bayesian inference (ABI, i.e., simulation-based inference with neural networks). First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to various data perturbations or pre-processing procedures. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model(s) for each choice of likelihood, prior, or dataset. Finally, we propose to use neural network ensembles to evaluate variation in results induced by unreliable approximation on unseen data. We demonstrate the effectiveness of our method in applied modeling problems, ranging from the estimation of disease outbreak dynamics and global warming thresholds to the comparison of human decision-making models. Our experiments showcase how our approach enables practitioners to effectively unveil hidden relationships between modeling choices and inferential conclusions.

摘要
泛bayesian推理是一种强大的推理框架，用于在不纯的情况下做出推理和决策。现代泛bayesian工作流程中的基本选择包括可信度函数和先验分布的规定、 posterior approximator 和数据。每一个选择都会对模型基于推理和后续决策产生重要影响，因此需要敏感分析。在这种工作中，我们提议一种多方面的方法，将敏感分析integrated into amortized Bayesian inference (ABI，即通过神经网络进行 simulations-based inference)。首先，我们利用 weight sharing 将结构相似性编码到替代可信度函数和先验分布中的训练过程中，以 minimize computational overhead。其次，我们利用神经网络的快速推理来评估数据变化或预处理过程对结果的敏感性。与大多数泛bayesian方法不同，这两个步骤都可以避免对模型的重新适应过程中的成本。最后，我们提议使用神经网络集合来评估未知数据上的结果变化。我们在应用模型问题中进行了实验，从疾病爆发动力和全球暖化阈值估计到人类决策模型的比较。我们的实验显示了我们的方法可以帮助实践者更好地揭示模型选择和推理结论之间的隐藏关系。

Minimally Informed Linear Discriminant Analysis: training an LDA model with unlabelled data

paper_url: http://arxiv.org/abs/2310.11110
repo_url: None
paper_authors: Nicolas Heintz, Tom Francart, Alexander Bertrand
for: 这 paper 是用来解释如何使用 Linear Discriminant Analysis (LDA) 算法来解决无标签数据的分类问题。
methods: 这 paper 使用了一种名为 Minimally Informed Linear Discriminant Analysis (MILDA) 模型，该模型可以在没有标签数据的情况下计算出 LDA 投影向量，只需要一些最小的先验信息。
results: 该 paper 的实验结果表明，MILDA 模型可以准确地模型分类问题，并且可以快速适应非站ARY 数据，这使得它成为一个可靠的 adaptive classifier。

Abstract
Linear Discriminant Analysis (LDA) is one of the oldest and most popular linear methods for supervised classification problems. In this paper, we demonstrate that it is possible to compute the exact projection vector from LDA models based on unlabelled data, if some minimal prior information is available. More precisely, we show that only one of the following three pieces of information is actually sufficient to compute the LDA projection vector if only unlabelled data are available: (1) the class average of one of the two classes, (2) the difference between both class averages (up to a scaling), or (3) the class covariance matrices (up to a scaling). These theoretical results are validated in numerical experiments, demonstrating that this minimally informed Linear Discriminant Analysis (MILDA) model closely matches the performance of a supervised LDA model. Furthermore, we show that the MILDA projection vector can be computed in a closed form with a computational cost comparable to LDA and is able to quickly adapt to non-stationary data, making it well-suited to use as an adaptive classifier.

摘要

一个类型的类均值（或一个类型的平均值）2. 两个类型之间的差异（几乎可以忽略扩大）3. 两个类型的类 covariance 矩阵（几乎可以忽略扩大）这些理论结果在数学实验中得到了验证，表明了这种只需要最小先验信息的 Linear Discriminant Analysis (MILDA) 模型能够与指导类型的 LDA 模型准确匹配。此外，我们还证明了 MILDA 投影向量可以在关闭形式下计算，计算成本与 LDA 相当，能够快速适应非站点数据，使其成为一种适用于适应类型的批处理器。

Local Lipschitz Constant Computation of ReLU-FNNs: Upper Bound Computation with Exactness Verification

paper_url: http://arxiv.org/abs/2310.11104
repo_url: None
paper_authors: Yoshio Ebihara, Xin Dai, Victor Magron, Dimitri Peaucelle, Sophie Tarbouriech
for: 这 paper 关注计算Feedforward Neural Networks (FNNs) 的地方 Lipschitz 常数，其中 activation functions 是 Rectified Linear Units (ReLUs)。地方 Lipschitz 常数是一个 reasonable measure для FNNs 的评估量。
methods: 我们首先将 upper bound 计算问题转化为一个 semidefinite programming problem (SDP)，然后引入新的 copositive multipliers 来准确地捕捉 ReLU 的行为。然后，我们通过对 SDP 的 dual 进行分析，提出一种可靠的检验方法来验证 computed upper bound 的准确性。
results: 我们通过数学实验来证明方法的有效性，并通过实验示例来验证方法在实际 FNNs 上的可行性。

Abstract
This paper is concerned with the computation of the local Lipschitz constant of feedforward neural networks (FNNs) with activation functions being rectified linear units (ReLUs). The local Lipschitz constant of an FNN for a target input is a reasonable measure for its quantitative evaluation of the reliability. By following a standard procedure using multipliers that capture the behavior of ReLUs,we first reduce the upper bound computation problem of the local Lipschitz constant into a semidefinite programming problem (SDP). Here we newly introduce copositive multipliers to capture the ReLU behavior accurately. Then, by considering the dual of the SDP for the upper bound computation, we second derive a viable test to conclude the exactness of the computed upper bound. However, these SDPs are intractable for practical FNNs with hundreds of ReLUs. To address this issue, we further propose a method to construct a reduced order model whose input-output property is identical to the original FNN over a neighborhood of the target input. We finally illustrate the effectiveness of the model reduction and exactness verification methods with numerical examples of practical FNNs.

摘要
We first convert the upper bound computation problem of the local Lipschitz constant into a semidefinite programming problem (SDP) using multipliers that capture the ReLU behavior. To improve the accuracy of the computation, we introduce new copositive multipliers.Next, we derive a feasibility test for the computed upper bound by considering the dual of the SDP. However, these SDPs are computationally intractable for practical FNNs with hundreds of ReLUs.To address this issue, we propose a method to construct a reduced order model whose input-output property is identical to the original FNN over a neighborhood of the target input. We demonstrate the effectiveness of the model reduction and exactness verification methods with numerical examples of practical FNNs.

Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads

paper_url: http://arxiv.org/abs/2310.11096
repo_url: https://github.com/samsunglabs/sparse-multi-dnn-scheduling
paper_authors: Hongxiang Fan, Stylianos I. Venieris, Alexandros Kouris, Nicholas D. Lane
for: 本研究旨在探讨多个稀疏深度神经网络（DNN）在 Edge 设备和数据中心之间的平衡执行。
methods: 本文使用了多种简化 Approaches，包括静态稀疏模型和动态稀疏信息，以提高稀疏多DNN 调度。
results: 对于多种部署场景，本文提出了一种新的双级静态和动态调度策略，并通过实验证明其与状态艺术方法相比，可以降低对响应时间的依赖性，并提高平均 норма化响应时间。

Abstract
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices, such as mobile phones where multiple tasks serve a single user for daily activities, and data centers, where various requests are raised from millions of users, as seen with large language models. To reduce the costly computational and memory requirements of these workloads, various efficient sparsification approaches have been introduced, resulting in widespread sparsity across different types of DNN models. In this context, there is an emerging need for scheduling sparse multi-DNN workloads, a problem that is largely unexplored in previous literature. This paper systematically analyses the use-cases of multiple sparse DNNs and investigates the opportunities for optimizations. Based on these findings, we propose Dysta, a novel bi-level dynamic and static scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Both static and dynamic components of Dysta are jointly designed at the software and hardware levels, respectively, to improve and refine the scheduling approach. To facilitate future progress in the study of this class of workloads, we construct a public benchmark that contains sparse multi-DNN workloads across different deployment scenarios, spanning from mobile phones and AR/VR wearables to data centers. A comprehensive evaluation on the sparse multi-DNN benchmark demonstrates that our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time. Our artifacts and code are publicly available at: https://github.com/SamsungLabs/Sparse-Multi-DNN-Scheduling.

摘要
running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices, such as mobile phones where multiple tasks serve a single user for daily activities, and data centers, where various requests are raised from millions of users, as seen with large language models. To reduce the costly computational and memory requirements of these workloads, various efficient sparsification approaches have been introduced, resulting in widespread sparsity across different types of DNN models. In this context, there is an emerging need for scheduling sparse multi-DNN workloads, a problem that is largely unexplored in previous literature. This paper systematically analyzes the use-cases of multiple sparse DNNs and investigates the opportunities for optimizations. Based on these findings, we propose Dysta, a novel bi-level dynamic and static scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Both static and dynamic components of Dysta are jointly designed at the software and hardware levels, respectively, to improve and refine the scheduling approach. To facilitate future progress in the study of this class of workloads, we construct a public benchmark that contains sparse multi-DNN workloads across different deployment scenarios, spanning from mobile phones and AR/VR wearables to data centers. A comprehensive evaluation on the sparse multi-DNN benchmark demonstrates that our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time. Our artifacts and code are publicly available at: https://github.com/SamsungLabs/Sparse-Multi-DNN-Scheduling.

Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free Ensembles of DNNs

paper_url: http://arxiv.org/abs/2310.11094
repo_url: None
paper_authors: Uri Stern, Daphna Weinshall
for: 这个论文的目的是解释深度神经网络中的过拟合现象，并提出一种新的评估过拟合方法。
methods: 这篇论文使用了一种新的评估过拟合方法，它基于 validate 数据集中模型忘记率的评估。
results: 该论文的实验结果表明，过拟合可以在模型训练过程中发生，并且可能更为常见 than 先前认为的。此外，该论文还提出了一种基于单个网络训练历史的新ensemble方法，该方法可以提高性能而无需额外的训练时间成本。

Abstract
The infrequent occurrence of overfit in deep neural networks is perplexing. On the one hand, theory predicts that as models get larger they should eventually become too specialized for a specific training set, with ensuing decrease in generalization. In contrast, empirical results in image classification indicate that increasing the training time of deep models or using bigger models almost never hurts generalization. Is it because the way we measure overfit is too limited? Here, we introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data. Presumably, this score indicates that even while generalization improves overall, there are certain regions of the data space where it deteriorates. When thus measured, we show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated. This observation may help to clarify the aforementioned confusing picture. We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement in performance without any additional cost in training time. An extensive empirical evaluation with modern deep models shows our method's utility on multiple datasets, neural networks architectures and training schemes, both when training from scratch and when using pre-trained networks in transfer learning. Notably, our method outperforms comparable methods while being easier to implement and use, and further improves the performance of competitive networks on Imagenet by 1\%.

摘要
启发性训练深度神经网络中偶尔出现过拟合现象很困惑。一面理论预测，随着模型的大小增加，它们应该逐渐变得特化于具体的训练集，导致泛化性下降。然而，实际研究发现，深度模型的训练时间增加或使用更大的模型在图像分类任务中并没有明显的泛化性下降。我们是否因为量化过拟合的方法有限而导致这种情况呢？在这里，我们引入一种新的过拟合评价指标，可以监测深度模型在验证数据上忘记率。这个指标表明，即使总的泛化性提高，仍然有一些数据空间中的忘记率下降。当如此量化过拟合时，我们发现过拟合可以发生在Validation accuracy下降和不下降的情况下，并且可能更为常见。这一观察可能有助于解释深度神经网络中的困惑场景。我们利用这些观察，构建了一种基于单个网络训练历史的新集成方法，可以在不添加训练时间成本的情况下提供显著性能提升。我们的方法在现代深度模型、不同的 neural network 架构、训练方案和传输学习中都有广泛的实际评估，并且在Imagenet上提高了1%的性能。另外，我们的方法比同类方法更容易实现和使用，并且可以进一步提高竞争力强的网络的性能。

Data Drift Monitoring for Log Anomaly Detection Pipelines

paper_url: http://arxiv.org/abs/2310.14893
repo_url: None
paper_authors: Dipak Wani, Samuel Ackerman, Eitan Farchi, Xiaotong Liu, Hau-wen Chang, Sarasi Lalithsena
for: 本文主要用于提出了一种基于 bayes factor 的偏移检测方法，用于检测日志活动偏移，并且可以帮助系统可靠工程师（SRE）在系统诊断中获得协助。
methods: 本文使用了 Bayes Factor 来检测日志活动偏移，并且提出了一种基于人工干预的方法来更新 LAD 模型。
results: 本文通过使用真实收集的日志数据和 simulate 的活动序列来证明了该方法的可靠性和有效性。

Abstract
Logs enable the monitoring of infrastructure status and the performance of associated applications. Logs are also invaluable for diagnosing the root causes of any problems that may arise. Log Anomaly Detection (LAD) pipelines automate the detection of anomalies in logs, providing assistance to site reliability engineers (SREs) in system diagnosis. Log patterns change over time, necessitating updates to the LAD model defining the `normal' log activity profile. In this paper, we introduce a Bayes Factor-based drift detection method that identifies when intervention, retraining, and updating of the LAD model are required with human involvement. We illustrate our method using sequences of log activity, both from unaltered data, and simulated activity with controlled levels of anomaly contamination, based on real collected log data.

摘要
日志可以监控基础设施状态和相关应用程序的性能。日志也是解决问题的根本原因的诊断的重要工具。日志异常检测（LAD）管道自动检测日志中的异常情况，为站点可靠工程师（SRE）提供帮助。日志模式随着时间的变化，因此LAD模型需要定期更新。在这篇文章中，我们介绍了基于 bayes 因子的漂移检测方法，可以在人工参与下确定是否需要介入、重新训练和更新 LAD 模型。我们使用了日志活动序列，包括未修改数据和 simulated 活动，以及控制了异常污染的水平，基于实际收集的日志数据来示例我们的方法。

CSG: Curriculum Representation Learning for Signed Graph

paper_url: http://arxiv.org/abs/2310.11083
repo_url: None
paper_authors: Zeyu Zhang, Jiamou Liu, Kaiqi Zhao, Yifei Wang, Pengqian Han, Xianda Zheng, Qiqi Wang, Zijian Zhang
for: 本研究旨在提高Signed Graph Neural Networks（SGNNs）的精度和稳定性，通过设计一种基于课程的训练方法，以便更好地处理复杂的签名图。
methods: 本研究提出了一种基于课程的训练方法，其中样本按照难度从易到复杂进行排序，以便SGNN模型在处理不同难度的样本上进行学习。此外，我们还引入了一种轻量级的机制，以量化图的学习难度。
results: 经验 validate 表明，我们的训练方法可以提高 SGNN 模型的准确率，在链接签记预测（AUC）中提高了23.7%，并且可以显著降低标准差的AUC分布。

Abstract
Signed graphs are valuable for modeling complex relationships with positive and negative connections, and Signed Graph Neural Networks (SGNNs) have become crucial tools for their analysis. However, prior to our work, no specific training plan existed for SGNNs, and the conventional random sampling approach did not address varying learning difficulties within the graph's structure. We proposed a curriculum-based training approach, where samples progress from easy to complex, inspired by human learning. To measure learning difficulty, we introduced a lightweight mechanism and created the Curriculum representation learning framework for Signed Graphs (CSG). This framework optimizes the order in which samples are presented to the SGNN model. Empirical validation across six real-world datasets showed impressive results, enhancing SGNN model accuracy by up to 23.7% in link sign prediction (AUC) and significantly improving stability with an up to 8.4 reduction in the standard deviation of AUC scores.

摘要
签名图是用于模型复杂关系的工具，它们可以表示正有负连接。然而，在我们的工作之前，没有专门的培训计划 для签名图神经网络（SGNN），而常见的随机抽样方法也不能处理图结构中的变化学习困难。我们提出了一种学习级别的培训方法，其中样本从简单到复杂进行排序，这是基于人类学习的启发。为了测量学习困难，我们引入了一种轻量级机制，并创建了签名图表示学习框架（CSG）。这个框架优化了SGNN模型被抽样的顺序。经验 validate 在六个实际数据集上表现出色，SGNN 模型的准确率提高了最多 23.7%（AUC），并有显著提高稳定性，AUC 标准差下降了最多 8.4。

Resampling Stochastic Gradient Descent Cheaply for Efficient Uncertainty Quantification

paper_url: http://arxiv.org/abs/2310.11065
repo_url: None
paper_authors: Henry Lam, Zitong Wang
for: 本研究旨在分析 Stochastic Gradient Descent（SGD）训练模型和 Stochastic Optimization 中的解决方案，并对其进行uncertainty quantification。
methods: 我们提出了两种 computationally cheap resampling-based方法来构建SGD解决方案的信任区间。其中一种使用多个SGD并行运行，通过从数据中采样而不填充替换，另一种在在线模式下进行。我们的方法可以视为现有批处理方法的改进，同时可以减少计算努力的重复样本需求。
results: 我们采用了一种称为”便宜bootstrap”的新想法和Berry-Esseen-type bound for SGD，以实现这些目标。我们的方法可以减少计算努力，同时可以快速地生成高质量的信任区间。

Abstract
Stochastic gradient descent (SGD) or stochastic approximation has been widely used in model training and stochastic optimization. While there is a huge literature on analyzing its convergence, inference on the obtained solutions from SGD has only been recently studied, yet is important due to the growing need for uncertainty quantification. We investigate two computationally cheap resampling-based methods to construct confidence intervals for SGD solutions. One uses multiple, but few, SGDs in parallel via resampling with replacement from the data, and another operates this in an online fashion. Our methods can be regarded as enhancements of established bootstrap schemes to substantially reduce the computation effort in terms of resampling requirements, while at the same time bypassing the intricate mixing conditions in existing batching methods. We achieve these via a recent so-called cheap bootstrap idea and Berry-Esseen-type bound for SGD.

摘要

Locally Differentially Private Graph Embedding

paper_url: http://arxiv.org/abs/2310.11060
repo_url: None
paper_authors: Zening Li, Rong-Hua Li, Meihao Liao, Fusheng Jin, Guoren Wang
for: 本研究旨在开发一种满足本地散列保护（LDP）的图像抽象算法，以保护图据中敏感信息的隐私。
methods: 本文提出了一种名为LDP-GE的隐私保护图像抽象框架，包括一种LDP机制来隐蔽节点数据，以及使用个性化PageRank来学习节点表示。
results: 对多个真实世界图据集进行了广泛的实验，并证明LDP-GE在节点分类和链接预测任务中达到了有利的隐私-用途质量比。

Abstract
Graph embedding has been demonstrated to be a powerful tool for learning latent representations for nodes in a graph. However, despite its superior performance in various graph-based machine learning tasks, learning over graphs can raise significant privacy concerns when graph data involves sensitive information. To address this, in this paper, we investigate the problem of developing graph embedding algorithms that satisfy local differential privacy (LDP). We propose LDP-GE, a novel privacy-preserving graph embedding framework, to protect the privacy of node data. Specifically, we propose an LDP mechanism to obfuscate node data and adopt personalized PageRank as the proximity measure to learn node representations. Then, we theoretically analyze the privacy guarantees and utility of the LDP-GE framework. Extensive experiments conducted over several real-world graph datasets demonstrate that LDP-GE achieves favorable privacy-utility trade-offs and significantly outperforms existing approaches in both node classification and link prediction tasks.

摘要
“图像插入”已经被证明是一种有力的工具，用于学习图像中节点的隐藏表示。然而，在学习图像时，可能会引起个人隐私问题，特别是当图像数据包含敏感信息时。为了解决这个问题，在这篇论文中，我们调查了在图像上学习隐藏表示的问题，并提出了一种具有地方敏感性（LDP）的图像插入框架。我们提议了一种LDP机制，以隐藏节点数据，并采用个性化PageRank作为距离度量来学习节点表示。然后，我们对LDP-GE框架的隐私保证和实用性进行了理论分析。广泛的实验表明，LDP-GE在节点分类和链接预测任务中具有良好的隐私-实用质量比。

Causal Feature Selection via Transfer Entropy

paper_url: http://arxiv.org/abs/2310.11059
repo_url: None
paper_authors: Paolo Bonetti, Alberto Maria Metelli, Marcello Restelli
for: 本研究旨在提出一种新的方法，旨在Feature Selection和Causal Discovery之间的交叉点上，用于处理时间序列数据。
methods: 我们提出了一种新的 causal feature selection 方法，该方法基于前向和后向 Feature Selection 过程，并利用了传输 entropy 来估计特征之间的 causal 信息流。
results: 我们提供了理论保证，证明在 exact 和 finite-sample 情况下，我们的方法可以实现更好的预测和分类性能。此外，我们还在 synthetic 和实际 regression 问题上进行了数值验证，结果与考虑的基准相当竞争。

Abstract
Machine learning algorithms are designed to capture complex relationships between features. In this context, the high dimensionality of data often results in poor model performance, with the risk of overfitting. Feature selection, the process of selecting a subset of relevant and non-redundant features, is, therefore, an essential step to mitigate these issues. However, classical feature selection approaches do not inspect the causal relationship between selected features and target, which can lead to misleading results in real-world applications. Causal discovery, instead, aims to identify causal relationships between features with observational data. In this paper, we propose a novel methodology at the intersection between feature selection and causal discovery, focusing on time series. We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures and leverages transfer entropy to estimate the causal flow of information from the features to the target in time series. Our approach enables the selection of features not only in terms of mere model performance but also captures the causal information flow. In this context, we provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases. Finally, we present numerical validations on synthetic and real-world regression problems, showing results competitive w.r.t. the considered baselines.

摘要

Matrix Compression via Randomized Low Rank and Low Precision Factorization

paper_url: http://arxiv.org/abs/2310.11028
repo_url: None
paper_authors: Rajarshi Saha, Varun Srivastava, Mert Pilanci
for:这个论文是为了提出一种基于矩阵的压缩算法，以实现矩阵的压缩和处理。methods:该算法利用矩阵的低级结构，通过随机抽象矩阵的列，并对这些列进行量化，以获得一个低级和低精度的矩阵分解。results:该算法可以实现矩阵的压缩，并且可以达到一比特为一的压缩率，同时保持或超过传统压缩技术的性能。

Abstract
Matrices are exceptionally useful in various fields of study as they provide a convenient framework to organize and manipulate data in a structured manner. However, modern matrices can involve billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Although prohibitively large, such matrices are often approximately low rank. We propose an algorithm that exploits this structure to obtain a low rank decomposition of any matrix $\mathbf{A}$ as $\mathbf{A} \approx \mathbf{L}\mathbf{R}$, where $\mathbf{L}$ and $\mathbf{R}$ are the low rank factors. The total number of elements in $\mathbf{L}$ and $\mathbf{R}$ can be significantly less than that in $\mathbf{A}$. Furthermore, the entries of $\mathbf{L}$ and $\mathbf{R}$ are quantized to low precision formats $--$ compressing $\mathbf{A}$ by giving us a low rank and low precision factorization. Our algorithm first computes an approximate basis of the range space of $\mathbf{A}$ by randomly sketching its columns, followed by a quantization of the vectors constituting this basis. It then computes approximate projections of the columns of $\mathbf{A}$ onto this quantized basis. We derive upper bounds on the approximation error of our algorithm, and analyze the impact of target rank and quantization bit-budget. The tradeoff between compression ratio and approximation accuracy allows for flexibility in choosing these parameters based on specific application requirements. We empirically demonstrate the efficacy of our algorithm in image compression, nearest neighbor classification of image and text embeddings, and compressing the layers of LlaMa-$7$b. Our results illustrate that we can achieve compression ratios as aggressive as one bit per matrix coordinate, all while surpassing or maintaining the performance of traditional compression techniques.

摘要
矩阵在不同的领域中非常有用，因为它们可以有效地组织和处理数据。然而，现代矩阵可能包含数百亿个元素，这会导致存储和处理它们的计算资源和内存使用非常高。虽然这些矩阵可能是非常大的，但它们通常是低级别的。我们提出了一个算法，它利用这种结构来获得矩阵 $\mathbf{A}$ 的低级别分解，即 $\mathbf{A} \approx \mathbf{L}\mathbf{R}$，其中 $\mathbf{L}$ 和 $\mathbf{R}$ 是低级别因素。总的来说， $\mathbf{L}$ 和 $\mathbf{R}$ 中的元素数量可以非常少于 $\mathbf{A}$ 中的元素数量。此外， $\mathbf{L}$ 和 $\mathbf{R}$ 的元素可以使用低精度格式进行压缩，从而压缩 $\mathbf{A}$。我们的算法首先计算矩阵 $\mathbf{A}$ 的估计基准的范围空间，然后使用这个基准来压缩 $\mathbf{A}$ 的列。我们then compute approximate projections of the columns of $\mathbf{A}$ onto this quantized basis. We derive upper bounds on the approximation error of our algorithm, and analyze the impact of target rank and quantization bit-budget. The tradeoff between compression ratio and approximation accuracy allows for flexibility in choosing these parameters based on specific application requirements.我们实际实现了我们的算法，并在图像压缩、图像和文本嵌入图像的最近邻紧类фикации以及LLaMa-$7$b层的压缩中进行了实验。我们的结果表明，我们可以达到一个比特为一个矩阵坐标的压缩比，同时超越或保持传统压缩技术的性能。

SignGT: Signed Attention-based Graph Transformer for Graph Representation Learning

paper_url: http://arxiv.org/abs/2310.11025
repo_url: None
paper_authors: Jinsong Chen, Gaichao Li, John E. Hopcroft, Kun He
for: 这 paper 的目的是提出一种基于签名自注意力的图Transformers，以适应不同图的复杂关系。
methods: 该 paper 使用了自注意力机制，并提出了一种新的签名自注意力机制（SignSA），以便根据节点对的 semantic relevance 生成签名注意力值。此外， paper 还提出了一种结构意识Feed-Forward Network（SFFN），以保留地方topology信息。
results: EXTENSIVE empirical results 表明，SignGT 在 node-level 和 graph-level 任务上表现出色，超过了当前的图Transformers 和高级 GNNs。

Abstract
The emerging graph Transformers have achieved impressive performance for graph representation learning over graph neural networks (GNNs). In this work, we regard the self-attention mechanism, the core module of graph Transformers, as a two-step aggregation operation on a fully connected graph. Due to the property of generating positive attention values, the self-attention mechanism is equal to conducting a smooth operation on all nodes, preserving the low-frequency information. However, only capturing the low-frequency information is inefficient in learning complex relations of nodes on diverse graphs, such as heterophily graphs where the high-frequency information is crucial. To this end, we propose a Signed Attention-based Graph Transformer (SignGT) to adaptively capture various frequency information from the graphs. Specifically, SignGT develops a new signed self-attention mechanism (SignSA) that produces signed attention values according to the semantic relevance of node pairs. Hence, the diverse frequency information between different node pairs could be carefully preserved. Besides, SignGT proposes a structure-aware feed-forward network (SFFN) that introduces the neighborhood bias to preserve the local topology information. In this way, SignGT could learn informative node representations from both long-range dependencies and local topology information. Extensive empirical results on both node-level and graph-level tasks indicate the superiority of SignGT against state-of-the-art graph Transformers as well as advanced GNNs.

摘要
新出现的图变换器技术已经取得了图表示学习中的出色表现，超过传统的图神经网络（GNNs）。在这项工作中，我们将自注意机制，变换器的核心模块，视为一个完全连接的图上的两步积算操作。由于生成正向注意值的性质，自注意机制等同于对所有节点进行缓和操作，保留低频信息。然而，只capture低频信息可能是学习多样性图上的节点关系不充分的。为此，我们提出了一种签名自注意机制基于图变换器（SignGT），以适应不同图上的多样性频率信息。具体来说，SignGT开发了一种新的签名自注意机制（SignSA），生成签名注意值根据节点对的semantic relevance。因此，不同节点对之间的多样频率信息可以得到细致的保留。此外，SignGT还提出了一种结构意识适应链接网络（SFFN），通过引入邻居偏好来保留本地链接信息。因此，SignGT可以从长距离依赖和本地链接信息中学习有用的节点表示。 empirical研究表明，SignGT在节点级和图级任务上表现出色，超过当前的图变换器和高级GNNs。

Pure Exploration in Asynchronous Federated Bandits

paper_url: http://arxiv.org/abs/2310.11015
repo_url: None
paper_authors: Zichen Wang, Chuanhao Li, Chenyu Song, Lianghui Wang, Quanquan Gu, Huazheng Wang
for: 这个论文是为了解决多机器人投掷机制中的联合探索问题，即多个机器人在中央服务器的协作下，找到最佳投掷机制。
methods: 这篇论文提出了首个在异步环境下的联合探索多机器人投掷机制和线性投掷机制算法，以提高实际场景中agent的缺失和延迟的Robustness。
results: 论文的理论分析表明，提议的算法在异步环境下可以 дости到近似优化的样本复杂度和有效的通信成本，并且基于实验数据表明，这些算法在实际场景中具有高效性和通信成本的优势。

Abstract
We study the federated pure exploration problem of multi-armed bandits and linear bandits, where $M$ agents cooperatively identify the best arm via communicating with the central server. To enhance the robustness against latency and unavailability of agents that are common in practice, we propose the first federated asynchronous multi-armed bandit and linear bandit algorithms for pure exploration with fixed confidence. Our theoretical analysis shows the proposed algorithms achieve near-optimal sample complexities and efficient communication costs in a fully asynchronous environment. Moreover, experimental results based on synthetic and real-world data empirically elucidate the effectiveness and communication cost-efficiency of the proposed algorithms.

摘要
我们研究多机构共同探索多臂枪和线性枪问题，其中 $M$ 名代理人共同决定最佳臂via 与中央服务器的通信。为了增强实际中常见的延迟和代理人缺失的响应性，我们提出了首个联邦异步多臂枪和线性枪探索算法，并进行了对这些算法的理论分析。我们的分析结果显示，提案的算法可以实现近乎最佳的样本复杂度和有效的通信成本在完全异步环境中。此外，基于实验数据的实验结果也证明了提案的算法的实际性和通信成本效率。

Hyperspectral In-Memory Computing with Optical Frequency Combs and Programmable Optical Memories

paper_url: http://arxiv.org/abs/2310.11014
repo_url: None
paper_authors: Mostafa Honari Latifpour, Byoung Jun Park, Yoshihisa Yamamoto, Myoung-Gyun Suh
for: This paper aims to develop a highly parallel, programmable, and scalable optical computing system capable of handling matrix-vector multiplication operations for deep learning and optimization tasks.
methods: The proposed hyperspectral in-memory computing architecture integrates space multiplexing with frequency multiplexing of optical frequency combs and uses spatial light modulators as a programmable optical memory.
results: The authors have experimentally demonstrated multiply-accumulate operations with higher than 4-bit precision in both matrix-vector and matrix-matrix multiplications, which suggests the system’s potential for a wide variety of deep learning and optimization tasks.Here’s the text in Simplified Chinese:
for: 这篇论文目标是开发一种高并行、可编程、可扩展的光学计算机系统，用于处理深度学习和优化任务中的矩阵-向量乘法操作。
methods: 提议的幽挺色响应计算机架构，结合空间多plexing和频率多plexing的光频率剖面，使用空间光模ulator作为可编程光学记忆。
results: 作者们实验表明，在矩阵-向量和矩阵-矩阵乘法操作中，可以实现高于4比特精度的乘法操作。

Abstract
The rapid advancements in machine learning across numerous industries have amplified the demand for extensive matrix-vector multiplication operations, thereby challenging the capacities of traditional von Neumann computing architectures. To address this, researchers are currently exploring alternatives such as in-memory computing systems to develop faster and more energy-efficient hardware. In particular, there is renewed interest in computing systems based on optics, which could potentially handle matrix-vector multiplication in a more energy-efficient way. Despite promising initial results, developing a highly parallel, programmable, and scalable optical computing system capable of rivaling electronic computing hardware still remains elusive. In this context, we propose a hyperspectral in-memory computing architecture that integrates space multiplexing with frequency multiplexing of optical frequency combs and uses spatial light modulators as a programmable optical memory, thereby boosting the computational throughput and the energy efficiency. We have experimentally demonstrated multiply-accumulate operations with higher than 4-bit precision in both matrix-vector and matrix-matrix multiplications, which suggests the system's potential for a wide variety of deep learning and optimization tasks. This system exhibits extraordinary modularity, scalability, and programmability, effectively transcending the traditional limitations of optics-based computing architectures. Our approach demonstrates the potential to scale beyond peta operations per second, marking a significant step towards achieving high-throughput energy-efficient optical computing.

摘要
快速发展的机器学习技术在各个领域的应用使得大量矩阵-向量乘法操作的需求增加，导致传统的 von Neumann 计算架构的能力受到挑战。为了解决这问题，研究人员正在寻找代替方案，如内存计算系统，以开发更快速、更能效的硬件。特别是，光学计算系统在处理矩阵-向量乘法方面可能存在更高的能效性。虽然初步的结果很有前途，但是开发一个高度并行、可编程、扩展的光学计算系统，能与电子计算硬件竞争仍然很困难。在这种情况下，我们提出了一种快速响应的多光谱内存计算架构，通过空间复用和频率复用光谱镜的技术，使用空间光模拟器作为可编程的光学记忆，从而提高计算通过put和能效性。我们在实验中已经实现了高于4位精度的矩阵-向量和矩阵-矩阵乘法 multiply-accumulate 操作，这表明该系统在深度学习和优化任务中的潜在能力。这种系统具有极高的可组合性、可扩展性和可编程性，实际上跨越了传统光学计算架构的限制。我们的方法可以超过PETA操作每秒，这标志着光学计算技术的高性能、能效的发展。

Adaptive Pairwise Encodings for Link Prediction

paper_url: http://arxiv.org/abs/2310.11009
repo_url: https://github.com/harryshomer/lpformer
paper_authors: Harry Shomer, Yao Ma, Haitao Mao, Juanhui Li, Bo Wu, Jiliang Tang
for: 链接预测任务是图structured数据中常见的任务，有各种应用。过去通常使用手动设计的规则进行预测。
methods: 这些方法使用message-passing神经网络（MPNN）和规则方法的优点，通过对MPNN的输出和“对称编码”（pairwise encoding）进行组合来进行预测。这些方法在许多数据集上达到了强大的表现。但是，现有的对称编码往往带有强大的推导性偏见，使用同一些下面因素来分类所有链接。
results: LPFormer方法可以在许多数据集上达到SOTA表现，同时保持效率。

Abstract
Link prediction is a common task on graph-structured data that has seen applications in a variety of domains. Classically, hand-crafted heuristics were used for this task. Heuristic measures are chosen such that they correlate well with the underlying factors related to link formation. In recent years, a new class of methods has emerged that combines the advantages of message-passing neural networks (MPNN) and heuristics methods. These methods perform predictions by using the output of an MPNN in conjunction with a "pairwise encoding" that captures the relationship between nodes in the candidate link. They have been shown to achieve strong performance on numerous datasets. However, current pairwise encodings often contain a strong inductive bias, using the same underlying factors to classify all links. This limits the ability of existing methods to learn how to properly classify a variety of different links that may form from different factors. To address this limitation, we propose a new method, LPFormer, which attempts to adaptively learn the pairwise encodings for each link. LPFormer models the link factors via an attention module that learns the pairwise encoding that exists between nodes by modeling multiple factors integral to link prediction. Extensive experiments demonstrate that LPFormer can achieve SOTA performance on numerous datasets while maintaining efficiency.

摘要
链接预测是图Structured data上常见的任务，它在多个领域应用。过去，人工设计的规则通常用于这种任务。这些规则选择的目的是使其与下面的链接形成因素相吻合。在最近几年里，一种新的方法 emerge，它结合了 message-passing neural networks（MPNN）和规则方法的优点。这些方法通过 MPNN 的输出和候选链接的 "对称编码" 进行预测，其中对于每个链接， captured the relationship between nodes in the candidate link。它们在许多数据集上实现了强的表现。然而，现有的对称编码通常具有强 inductive bias，使用同一些基因来分类所有的链接。这限制了现有方法的能力，以learn how to properly classify a variety of different links that may form from different factors。为了解决这些限制，我们提出了一种新方法，LPFormer，它尝试通过 adaptively learning the pairwise encodings for each link来模型链接因素。LPFormer 通过注意力模块来学习每个链接的对称编码，该编码捕捉了 nodes 之间的多种因素。广泛的实验表明，LPFormer 可以在多个数据集上实现 SOTA 性能，同时保持效率。

Spatially-resolved hyperlocal weather prediction and anomaly detection using IoT sensor networks and machine learning techniques

paper_url: http://arxiv.org/abs/2310.11001
repo_url: None
paper_authors: Anita B. Agarwal, Rohit Rajesh, Nitin Arul
for: 本研究旨在提供高精度、快速更新的本地天气预测，以满足各种应用需求，如农业、灾害管理等。
methods: 本研究提议一种新的方法， combining 本地天气预测和异常检测，使用互联网物理传感器网络和高级机器学习技术。该方法利用多个空间分布的、但相对较近的位置和物理传感器数据，创建高分辨率的天气模型，可预测短期、本地化的天气 Conditions。
results: 研究发现，该系统可以增强天气预测的空间分辨率，同时实时检测异常天气情况。此外，该系统还可以通过不监督学习算法，找到异常天气模式，为决策提供时间性信息。

Abstract
Accurate and timely hyperlocal weather predictions are essential for various applications, ranging from agriculture to disaster management. In this paper, we propose a novel approach that combines hyperlocal weather prediction and anomaly detection using IoT sensor networks and advanced machine learning techniques. Our approach leverages data from multiple spatially-distributed yet relatively close locations and IoT sensors to create high-resolution weather models capable of predicting short-term, localized weather conditions such as temperature, pressure, and humidity. By monitoring changes in weather parameters across these locations, our system is able to enhance the spatial resolution of predictions and effectively detect anomalies in real-time. Additionally, our system employs unsupervised learning algorithms to identify unusual weather patterns, providing timely alerts. Our findings indicate that this system has the potential to enhance decision-making.

摘要
准确和及时的本地天气预测非常重要，用于各种应用，从农业到灾害管理。在这篇论文中，我们提出了一种新的方法，它结合了本地天气预测和异常检测，使用互联网器件网络和高级机器学习技术。我们的方法利用多个位于不同地点，但相对较近的位置和互联网器件来创建高分解能力的天气模型，能够预测短期、本地化的天气条件，如温度、压力和湿度。通过监测这些位置之间的天气参数变化，我们的系统可以提高地理分解能力，并实时检测异常。此外，我们的系统使用无监督学习算法来识别异常天气模式，提供实时警报。我们的发现表明，这种系统有可能提高决策。

Program Translation via Code Distillation

paper_url: http://arxiv.org/abs/2310.11476
repo_url: None
paper_authors: Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin Clement, Neel Sundaresan
for: 这篇论文主要写于如何使用Code Distillation（CoDist）模型进行软件版本迁移和程序翻译。
methods: 这篇论文提出了一种基于Code Distillation（CoDist）模型的方法，该方法可以捕捉代码的semantic和structural等价性，并生成一种语言无关的中间表示。这种中间表示可以作为翻译的准则，从而生成并行的训练数据集，并且可以在任何编程语言上应用。
results: 根据CodeXGLUE和TransCoder GeeksForGeeks翻译测试 benchmark，这种方法可以达到当前最佳性能水平，与TransCoder-ST相比，增加了12.7%的平均绝对提升。

Abstract
Software version migration and program translation are an important and costly part of the lifecycle of large codebases. Traditional machine translation relies on parallel corpora for supervised translation, which is not feasible for program translation due to a dearth of aligned data. Recent unsupervised neural machine translation techniques have overcome data limitations by included techniques such as back translation and low level compiler intermediate representations (IR). These methods face significant challenges due to the noise in code snippet alignment and the diversity of IRs respectively. In this paper we propose a novel model called Code Distillation (CoDist) whereby we capture the semantic and structural equivalence of code in a language agnostic intermediate representation. Distilled code serves as a translation pivot for any programming language, leading by construction to parallel corpora which scale to all available source code by simply applying the distillation compiler. We demonstrate that our approach achieves state-of-the-art performance on CodeXGLUE and TransCoder GeeksForGeeks translation benchmarks, with an average absolute increase of 12.7% on the TransCoder GeeksforGeeks translation benchmark compare to TransCoder-ST.

摘要
软件版本迁移和程序翻译是大型代码库生命周期中的重要和昂贵部分。传统机器翻译依赖平行 corpora 进行监督翻译，但对程序翻译来说不是可行的，因为没有准确的数据对齐。现代无监督神经机器翻译技术已经突破了数据限制，通过包括回翻译和低级编译器中间表示（IR）等技术。然而，这些方法面临着代码片段对齐的噪音和 IR 的多样性的挑战。在这篇论文中，我们提出了一种新的模型 called Code Distillation（CoDist），它可以捕捉代码的 semantics 和结构相似性，并将其转化为语言无关的中间表示。浓缩代码可以作为任何编程语言的翻译轮廓，从而自动生成平行 corpora，并且可以通过 simply 应用浓缩编译器来扩展到所有可用的源代码。我们示示了我们的方法可以在 CodeXGLUE 和 TransCoder GeeksForGeeks 翻译benchmark中达到状态机器翻译的性能水平，与TransCoder-ST 的平均绝对增幅为12.7%。

Why Do Students Drop Out? University Dropout Prediction and Associated Factor Analysis Using Machine Learning Techniques

paper_url: http://arxiv.org/abs/2310.10987
repo_url: None
paper_authors: Sean Kim, Eliot Yoo, Samuel Kim
for: 本研究旨在预测大学学生的毕业和退学情况，以便帮助教育机构和学生更好地规划教学和学习计划。
methods: 本研究使用学术、民生、社会经济和macro经济数据类型进行大学生毕业和退学预测。同时，我们进行了相关因素分析，以便分析这些数据类型对机器学习模型的表现有多大影响。
results: 我们使用这些特征训练了四个二分类器，以确定学生会毕业或退学。总的来说，这些模型在预测退学状况时的ROC-AUC分数为0.935。对于学术数据类型，模型性能最高，当排除所有学术相关特征时，模型性能下降到0.811。初步结果表明，数据类型和退学状况之间存在相关性。

Abstract
Graduation and dropout rates have always been a serious consideration for educational institutions and students. High dropout rates negatively impact both the lives of individual students and institutions. To address this problem, this study examined university dropout prediction using academic, demographic, socioeconomic, and macroeconomic data types. Additionally, we performed associated factor analysis to analyze which type of data would be most influential on the performance of machine learning models in predicting graduation and dropout status. These features were used to train four binary classifiers to determine if students would graduate or drop out. The overall performance of the classifiers in predicting dropout status had an average ROC-AUC score of 0.935. The data type most influential to the model performance was found to be academic data, with the average ROC-AUC score dropping from 0.935 to 0.811 when excluding all academic-related features from the data set. Preliminary results indicate that a correlation does exist between data types and dropout status.

摘要
translate into Simplified Chinese:毕业和退学率一直是教育机构和学生的严重考虑之一。高退学率对个人学生和机构都有负面影响。为了解决这个问题，本研究使用学术、人口、社会经济和 macroeconomic 数据类型进行大学退学预测。此外，我们还进行了相关因素分析，以分析这些数据类型对机器学习模型的执行性能有多大影响。这些特征被用来训练四个二分类器，以确定学生会毕业或退学。总的来说，这些分类器在预测退学状况时的 ROC-AUC 分数为 0.935。数据类型对模型性能最有影响的是学术数据，当排除所有学术相关特征时，模型的 ROC-AUC 分数从 0.935 下降至 0.811。初步结果表明，数据类型和退学状况之间存在相关性。

Exact nonlinear state estimation

paper_url: http://arxiv.org/abs/2310.10976
repo_url: None
paper_authors: Hristo G. Chipilski
for: 提高数据融合方法的准确性和稳定性，尤其是在高维模型中。
methods: 基于生成型人工智能技术的新非线性估计理论，拓展了现有的准 Gaussian 分布假设，提供了更高准确性和稳定性的数据融合方法。
results: 对理想化的统计实验进行了验证，结果表明，在观测错误小于预测不确定性和状态变量存在强非线性相互关系的情况下，ECTF 可以提供更高的准确性和稳定性。

Abstract
The majority of data assimilation (DA) methods in the geosciences are based on Gaussian assumptions. While these assumptions facilitate efficient algorithms, they cause analysis biases and subsequent forecast degradations. Non-parametric, particle-based DA algorithms have superior accuracy, but their application to high-dimensional models still poses operational challenges. Drawing inspiration from recent advances in the field of generative artificial intelligence (AI), this article introduces a new nonlinear estimation theory which attempts to bridge the existing gap in DA methodology. Specifically, a Conjugate Transform Filter (CTF) is derived and shown to generalize the celebrated Kalman filter to arbitrarily non-Gaussian distributions. The new filter has several desirable properties, such as its ability to preserve statistical relationships in the prior state and convergence to highly accurate observations. An ensemble approximation of the new theory (ECTF) is also presented and validated using idealized statistical experiments that feature bounded quantities with non-Gaussian distributions, a prevalent challenge in Earth system models. Results from these experiments indicate that the greatest benefits from ECTF occur when observation errors are small relative to the forecast uncertainty and when state variables exhibit strong nonlinear dependencies. Ultimately, the new filtering theory offers exciting avenues for improving conventional DA algorithms through their principled integration with AI techniques.

摘要
大多数数据吸收（DA）方法在地球科学中基于 Gaussian 假设。这些假设使得算法效率高，但会导致分析偏误和预测质量下降。非Parametric, 粒子基本的 DA 算法具有更高的准确度，但在高维模型应用中仍存在操作挑战。本文从最近的生成式人工智能（AI）领域启发，提出一种新的非线性估计理论，以尝试桥接现有 DA 方法ología 的空缺。特别是，一种 conjugate transform filter（CTF）被 derivation 和证明能够泛化 kalman 筛到任意非 Gaussian 分布。新筛有多个愉悦性质，如保持先前状态的统计关系和 converge 到高精度观测。一种 ensemble approximation of the new theory（ECTF）也被提出和验证，使用 идеalized 统计实验，这些实验中的量均具有非 Gaussian 分布，是地球系统模型中的普遍问题。实验结果表明，ECTF 在观测Error 小于预测 uncertainty 以及状态变量具有强非线性关系时，具有最大的优势。最后，新的 filtering 理论提供了改进传统 DA 算法的原则性 интеграción with AI 技术的推动力。

SD-PINN: Deep Learning based Spatially Dependent PDEs Recovery

paper_url: http://arxiv.org/abs/2310.10970
repo_url: None
paper_authors: Ruixian Liu, Peter Gerstoft
for: 这篇论文是为了描述一种能够直接从物理测量数据中恢复部分偏微分方程（PDE）的含义的物理学习神经网络（PINN）的扩展。
methods: 该方法使用一个具有空间依赖性的物理学习神经网络（SD-PINN），可以通过单个神经网络来恢复空间依赖性的PDE含义。该方法还利用物理约束来降低噪声的影响。
results: 该方法可以充分利用物理约束来恢复PDE含义，并且可以在没有测量数据的情况下，通过含义低级假设来恢复PDE含义。

Abstract
The physics-informed neural network (PINN) is capable of recovering partial differential equation (PDE) coefficients that remain constant throughout the spatial domain directly from physical measurements. In this work, we propose a spatially dependent physics-informed neural network (SD-PINN), which enables the recovery of coefficients in spatially-dependent PDEs using a single neural network, eliminating the requirement for domain-specific physical expertise. The proposed method exhibits robustness to noise owing to the incorporation of physical constraints. It can also incorporate the low-rank assumption of the spatial variation for the PDE coefficients to recover the coefficients at locations without available measurements.

摘要
физи学信息泛化神经网络（PINN）可以直接从物理测量中提取常数partial differential equation（PDE）的征量，这些征量在空间领域中保持常数。在这个工作中，我们提议使用空间依赖的 физи学信息泛化神经网络（SD-PINN），它使得可以使用单个神经网络来恢复空间依赖的PDE征量，消除了域pecific的物理专业知识的需求。该方法具有鲁棒性于噪声，并可以 incorporate 空间变化的low-rank假设来恢复征量在测量位置之外的位置。Note: "PINN" and "SD-PINN" in the text are abbreviations for "physics-informed neural network" and "spatially-dependent physics-informed neural network", respectively.

The neural network models with delays for solving absolute value equations

paper_url: http://arxiv.org/abs/2310.10965
repo_url: None
paper_authors: Dongmei Yu, Gehao Zhang, Cairong Chen, Deren Han
for: 解决绝值方程 ($Ax - |x| - b = 0$)
methods: 使用倒计时间神经网络模型和离散延迟神经网络模型，以及 Lyapunov-Krasovskii 理论和线性矩阵不等式（LMI）方法
results: 能够解决一类绝值方程，其中 $A^{-1}$ 的范数大于 1Here’s the breakdown of each point:1. For: This point states that the paper is written for solving the absolute value equation (AVE).2. Methods: This point lists the methods used in the paper, including the use of inverse-free neural network models with discrete delays, as well as the Lyapunov-Krasovskii theory and LMI method.3. Results: This point states the main result of the paper, which is that the proposed neural network models are exponentially convergent to the solution of the AVE, and can solve a class of AVE with $|A^{-1}|>1$.

Abstract
An inverse-free neural network model with mixed delays is proposed for solving the absolute value equation (AVE) $Ax -|x| - b =0$, which includes an inverse-free neural network model with discrete delay as a special case. By using the Lyapunov-Krasovskii theory and the linear matrix inequality (LMI) method, the developed neural network models are proved to be exponentially convergent to the solution of the AVE. Compared with the existing neural network models for solving the AVE, the proposed models feature the ability of solving a class of AVE with $\|A^{-1}\|>1$. Numerical simulations are given to show the effectiveness of the two delayed neural network models.

摘要
<>转换给定文本到简化中文。>一种无反函数神经网络模型，包括杂度延迟，被提出来解决绝值方程（AVE） $Ax - |x| - b = 0$。这个模型包括杂度延迟神经网络模型作为特殊情况。通过利用利阿普涅夫-克拉索夫斯基理论和线性矩阵不等式（LMI）方法，我们证明了这些神经网络模型在AVE的解的抽象 convergent。与现有的神经网络模型相比，我们的模型可以解决一类AVE中，$ \|A^{-1}\|>1$。numerical simulations 给出了这两种延迟神经网络模型的效果。

A Local Graph Limits Perspective on Sampling-Based GNNs

paper_url: http://arxiv.org/abs/2310.10953
repo_url: None
paper_authors: Yeganeh Alimohammadi, Luana Ruiz, Amin Saberi
for: 这项研究旨在提供一种训练图神经网络（GNNs）的理论框架，用于处理大输入图。
methods: 这项研究使用了采样方法，将大输入图分解成小固定大小的子图进行训练。
results: 研究发现，通过采样训练GNNs，可以在减少训练时间和数据量的情况下，保持模型的性能。在一个节点分类任务中，研究发现，使用小型子图进行采样训练，可以达到与直接训练在原始图上的性能相似的水平。

Abstract
We propose a theoretical framework for training Graph Neural Networks (GNNs) on large input graphs via training on small, fixed-size sampled subgraphs. This framework is applicable to a wide range of models, including popular sampling-based GNNs, such as GraphSAGE and FastGCN. Leveraging the theory of graph local limits, we prove that, under mild assumptions, parameters learned from training sampling-based GNNs on small samples of a large input graph are within an $\epsilon$-neighborhood of the outcome of training the same architecture on the whole graph. We derive bounds on the number of samples, the size of the graph, and the training steps required as a function of $\epsilon$. Our results give a novel theoretical understanding for using sampling in training GNNs. They also suggest that by training GNNs on small samples of the input graph, practitioners can identify and select the best models, hyperparameters, and sampling algorithms more efficiently. We empirically illustrate our results on a node classification task on large citation graphs, observing that sampling-based GNNs trained on local subgraphs 12$\times$ smaller than the original graph achieve comparable performance to those trained on the input graph.

摘要
我们提出一种理论框架，用于在大输入图上训练图神经网络（GNNs） via 训练小型、固定大小的采样子图。这个框架适用于各种采样基于GNNs，包括受欢迎的采样GNNs，如GraphSAGE和FastGCN。我们利用图本地限制理论，证明在某些假设下，从训练采样GNNs на小样本上获得的参数与训练同样模型在整个图上的参数在$\epsilon$-邻域内。我们 derive出参数数量、图的大小和训练步骤的上限，作为函数于$\epsilon$。我们的结果为使用采样在训练GNNs提供了新的理论理解，并表明通过训练GNNs于小样本上可以更加快速地确定最佳模型、 гиперпараметры和采样算法。我们在一个节点分类任务上对大量引用图进行了实验，发现采样基于GNNs训练于local子图12$\times$小于原始图的性能与训练于原始图相当。

Restricted Tweedie Stochastic Block Models

paper_url: http://arxiv.org/abs/2310.10952
repo_url: None
paper_authors: Jie Jian, Mu Zhu, Peijun Sang
for: 本文旨在提出一种基于非正态 Tweedie 分布的随机块模型（SBM），用于社区检测网络中的非负零Inflated 连接量。
methods: 本文提出了一种新的 SBM 模型，使用restricted Tweedie distribution来模型连接量，并考虑了节点信息，如国家之间的地理距离。
results: 在大量的 simulations 和实际国际贸易数据中，本文显示了该模型的效果。特别是，当 nodes 数足够多时，计算最大 likelihood 参数的过程可以独立地计算节点标签。这使得可以开发一种高效的 two-step 算法，将 covariate 效应与其他参数分离。

Abstract
The stochastic block model (SBM) is a widely used framework for community detection in networks, where the network structure is typically represented by an adjacency matrix. However, conventional SBMs are not directly applicable to an adjacency matrix that consists of non-negative zero-inflated continuous edge weights. To model the international trading network, where edge weights represent trading values between countries, we propose an innovative SBM based on a restricted Tweedie distribution. Additionally, we incorporate nodal information, such as the geographical distance between countries, and account for its dynamic effect on edge weights. Notably, we show that given a sufficiently large number of nodes, estimating this covariate effect becomes independent of community labels of each node when computing the maximum likelihood estimator of parameters in our model. This result enables the development of an efficient two-step algorithm that separates the estimation of covariate effects from other parameters. We demonstrate the effectiveness of our proposed method through extensive simulation studies and an application to real-world international trading data.

摘要
Stochastic block model (SBM) 是一种广泛使用的社区探测模型，用于网络结构的表示，通常是一个相对矩阵。然而，传统的 SBM 不直接适用于具有非负零填充连接权重的邻接矩阵。为了模型国际贸易网络，其中边权重表示国家之间贸易值，我们提出了一种创新的 SBM，基于限制的 Tweedie 分布。此外，我们还考虑了节点信息，如国家之间的地理距离，并考虑其动态影响边权重。我们发现，当节点数足够多时，计算每个节点社区标签的最大 LIKELIHOOD 估计器中计算 covariate 效应的结果，是独立的。这一结果允许我们开发一种高效的 two-step 算法，将 covariate 效应与其他参数分离。我们通过广泛的 simulations 研究和实际国际贸易数据应用，证明了我们提出的方法的有效性。

Combat Urban Congestion via Collaboration: Heterogeneous GNN-based MARL for Coordinated Platooning and Traffic Signal Control

paper_url: http://arxiv.org/abs/2310.10948
repo_url: None
paper_authors: Xianyue Peng, Hang Gao, Hao Wang, H. Michael Zhang
for: 提高交通流量和减少拥堵
methods: 使用多智能体学习和交通理论来解决异质性和协调问题，并设计了各自的观察、操作和奖励函数来优化交通流量
results: 通过 SUMO 模拟，实现了交通流量等多个指标的协调结果，并与独立的信号控制或队列控制相比，表现更佳。

Abstract
Over the years, reinforcement learning has emerged as a popular approach to develop signal control and vehicle platooning strategies either independently or in a hierarchical way. However, jointly controlling both in real-time to alleviate traffic congestion presents new challenges, such as the inherent physical and behavioral heterogeneity between signal control and platooning, as well as coordination between them. This paper proposes an innovative solution to tackle these challenges based on heterogeneous graph multi-agent reinforcement learning and traffic theories. Our approach involves: 1) designing platoon and signal control as distinct reinforcement learning agents with their own set of observations, actions, and reward functions to optimize traffic flow; 2) designing coordination by incorporating graph neural networks within multi-agent reinforcement learning to facilitate seamless information exchange among agents on a regional scale. We evaluate our approach through SUMO simulation, which shows a convergent result in terms of various transportation metrics and better performance over sole signal or platooning control.

摘要

Designing platoon and signal control as distinct reinforcement learning agents with their own set of observations, actions, and reward functions to optimize traffic flow.2. Designing coordination by incorporating graph neural networks within multi-agent reinforcement learning to facilitate seamless information exchange among agents on a regional scale.We evaluate our approach through SUMO simulation, which shows a convergent result in terms of various transportation metrics and better performance over sole signal or platooning control.Translation notes:* “signal control” and “vehicle platooning” are translated as “信号控制” and “车辆队列”, respectively.* “heterogeneous graph multi-agent reinforcement learning” is translated as “多代理信号控制与车辆队列强化学习”.* “coordination” is translated as “协调”.* “ SUMO simulation” is translated as “SUMO仿真”.

Multi-point Feedback of Bandit Convex Optimization with Hard Constraints

paper_url: http://arxiv.org/abs/2310.10946
repo_url: None
paper_authors: Yasunari Hikima
for: 本 paper 研究了带约束的带射搜索优化问题，learner 需要在部分损失函数信息的情况下生成一个序列决策，以实现总损失降低和约束违反降低。
methods: 我们采用了累积硬约束违反度作为约束违反度的度量，即 $\sum_{t=1}^{T} \max{g_t(\boldsymbol{x}_t), 0}$. 由于最大运算，不可能通过不满足约束的解释取消约束违反的效果，与传统的长期约束违反度不同。我们提出了一种罚款基于的 proximal 梯度下降法，可以在梯度估计中使用两点函数评估。
results: 我们的算法可以实现 $O(d^2T^{\max{c,1-c}})$ regret bounds和 $O(d^2T^{1-\frac{c}{2})$ 约束违反度 bounds，其中 $d$ 是可行区域的维度，$c\in[\frac{1}{2}, 1)$ 是用户决定的参数。我们还扩展了结果到损失函数是强 convex 时的情况，并证明了 regret 和约束违反度 bounds 可以进一步降低。

Abstract
This paper studies bandit convex optimization with constraints, where the learner aims to generate a sequence of decisions under partial information of loss functions such that the cumulative loss is reduced as well as the cumulative constraint violation is simultaneously reduced. We adopt the cumulative \textit{hard} constraint violation as the metric of constraint violation, which is defined by $\sum_{t=1}^{T} \max\{g_t(\boldsymbol{x}_t), 0\}$. Owing to the maximum operator, a strictly feasible solution cannot cancel out the effects of violated constraints compared to the conventional metric known as \textit{long-term} constraints violation. We present a penalty-based proximal gradient descent method that attains a sub-linear growth of both regret and cumulative hard constraint violation, in which the gradient is estimated with a two-point function evaluation. Precisely, our algorithm attains $O(d^2T^{\max\{c,1-c\})$ regret bounds and $O(d^2T^{1-\frac{c}{2})$ cumulative hard constraint violation bounds for convex loss functions and time-varying constraints, where $d$ is the dimensionality of the feasible region and $c\in[\frac{1}{2}, 1)$ is a user-determined parameter. We also extend the result for the case where the loss functions are strongly convex and show that both regret and constraint violation bounds can be further reduced.

摘要

Reaching the Limit in Autonomous Racing: Optimal Control versus Reinforcement Learning

paper_url: http://arxiv.org/abs/2310.10943
repo_url: None
paper_authors: Yunlong Song, Angel Romero, Matthias Mueller, Vladlen Koltun, Davide Scaramuzza
for: This paper aims to design a control system for an agile mobile robot, specifically in the context of autonomous drone racing.
methods: The paper uses reinforcement learning (RL) to train a neural network controller, which outperforms optimal control (OC) methods in this setting.
results: The RL controller achieves superhuman control performance within minutes of training on a standard workstation, achieving a peak acceleration greater than 12 times the gravitational acceleration and a peak velocity of 108 kilometers per hour.Here is the same information in Simplified Chinese:
for: 这篇论文目标是设计一个适应度高的移动机器人控制系统，具体是在自动驾驶飞机比赛中进行。
methods: 这篇论文使用反馈学习（RL）来训练一个神经网络控制器，RL控制器在这个设定中超过优化控制（OC）方法的性能。
results: RL控制器在标准工作站上训练了几分钟后已经实现了人类控制性能的超越，达到了12倍重力加速度的峰值加速和108公里/小时的峰值速度。

Abstract
A central question in robotics is how to design a control system for an agile mobile robot. This paper studies this question systematically, focusing on a challenging setting: autonomous drone racing. We show that a neural network controller trained with reinforcement learning (RL) outperformed optimal control (OC) methods in this setting. We then investigated which fundamental factors have contributed to the success of RL or have limited OC. Our study indicates that the fundamental advantage of RL over OC is not that it optimizes its objective better but that it optimizes a better objective. OC decomposes the problem into planning and control with an explicit intermediate representation, such as a trajectory, that serves as an interface. This decomposition limits the range of behaviors that can be expressed by the controller, leading to inferior control performance when facing unmodeled effects. In contrast, RL can directly optimize a task-level objective and can leverage domain randomization to cope with model uncertainty, allowing the discovery of more robust control responses. Our findings allowed us to push an agile drone to its maximum performance, achieving a peak acceleration greater than 12 times the gravitational acceleration and a peak velocity of 108 kilometers per hour. Our policy achieved superhuman control within minutes of training on a standard workstation. This work presents a milestone in agile robotics and sheds light on the role of RL and OC in robot control.

摘要
中心问题在 роботике是如何设计一个敏捷移动机器人的控制系统。这篇论文系统地研究这个问题，专注于一个挑战性的设定：自主无人机赛车。我们表明，使用强化学习（RL）训练的神经网络控制器在这个设定中表现得更好于优化控制（OC）方法。然后，我们研究了RL和OC之间的基本因素，发现RL的优势不在于更好地优化目标函数，而是在于优化更好的目标函数。OC将问题分解为规划和控制两个部分，使用显式中间表示（如轨迹）作为控制器的界面，这种分解限制控制器可表达的行为范围，导致面临不Modeled Effects时的控制性下降。相比之下，RL可以直接优化任务级目标函数，并通过随机化预测来快速适应模型不确定性，从而发现更加 Robust control response。我们的发现使我们可以将敏捷无人机 pushed to its maximum performance，达到了12倍重力加速度的峰值加速和108公里/小时的峰值速度。我们的策略在标准工作站上训练仅需几分钟便可以达到超人控制水平。这项工作为敏捷 роботиcs带来了里程碑，也照亮了RL和OC在机器人控制中的角色。

Fast and Simple Spectral Clustering in Theory and Practice

paper_url: http://arxiv.org/abs/2310.10939
repo_url: https://github.com/pmacg/fast-spectral-clustering
paper_authors: Peter Macgregor
for: 寻找图像中的k个团体
methods: 使用频谱归一化法，使用力量法计算顺序 embed 图像中的顶点
results: 可以快速地找到图像中的团体，并且准确地回归真实的团体结果

Abstract
Spectral clustering is a popular and effective algorithm designed to find $k$ clusters in a graph $G$. In the classical spectral clustering algorithm, the vertices of $G$ are embedded into $\mathbb{R}^k$ using $k$ eigenvectors of the graph Laplacian matrix. However, computing this embedding is computationally expensive and dominates the running time of the algorithm. In this paper, we present a simple spectral clustering algorithm based on a vertex embedding with $O(\log(k))$ vectors computed by the power method. The vertex embedding is computed in nearly-linear time with respect to the size of the graph, and the algorithm provably recovers the ground truth clusters under natural assumptions on the input graph. We evaluate the new algorithm on several synthetic and real-world datasets, finding that it is significantly faster than alternative clustering algorithms, while producing results with approximately the same clustering accuracy.

摘要
spectral clustering 是一种流行的有效算法，用于在图 G 中找到 $k$ 个群。 classical spectral clustering 算法中，图 vertices 被嵌入到 $\mathbb{R}^k$ 中使用 $k$ 个图 Laplacian 矩阵的特征值。然而，计算这个嵌入是计算成本高昂，对算法的运行时间产生很大影响。在本文中，我们提出了一种简单的 spectral clustering 算法，基于一个 $O(\log(k))$ 维的顶点嵌入，通过力方法计算。顶点嵌入的计算时间与图的大小近似线性，而且算法可以证明地回归真实的群集，以下面的自然假设。我们对这种新算法在一些 sintetic 和实际世界数据集上进行了评估，发现它比其他归一化算法更快速，并且生成的结果与实际结果相似。

Machine Learning in the Quantum Age: Quantum vs. Classical Support Vector Machines

paper_url: http://arxiv.org/abs/2310.10910
repo_url: None
paper_authors: Davut Emre Tasar, Kutan Koruyan, Ceren Ocal Tasar
For: 本研究探讨机器学习算法在经典和量子计算模式下的效率。具体来说，通过强调支持向量机器（SVM），我们研究了使用量子硬件进行分类的量子支持向量机器（QSVM）的性能。* Methods: 本研究采用了一系列实验，使用Qiskit库进行实现，并进行了参数优化。* Results: 发现在某些情况下，QSVM可以与经典SVM匹敌，但是现在的执行时间比较长。此外，我们发现，随着量子计算能力的提高和平行计算的增加，量子机器学习算法的性能可以得到明显改善。这项研究为未来量子机器学习应用提供了有价值的信息。

Abstract
This work endeavors to juxtapose the efficacy of machine learning algorithms within classical and quantum computational paradigms. Particularly, by emphasizing on Support Vector Machines (SVM), we scrutinize the classification prowess of classical SVM and Quantum Support Vector Machines (QSVM) operational on quantum hardware over the Iris dataset. The methodology embraced encapsulates an extensive array of experiments orchestrated through the Qiskit library, alongside hyperparameter optimization. The findings unveil that in particular scenarios, QSVMs extend a level of accuracy that can vie with classical SVMs, albeit the execution times are presently protracted. Moreover, we underscore that augmenting quantum computational capacity and the magnitude of parallelism can markedly ameliorate the performance of quantum machine learning algorithms. This inquiry furnishes invaluable insights regarding the extant scenario and future potentiality of machine learning applications in the quantum epoch. Colab: https://t.ly/QKuz0

摘要
Note:* " classical" and "quantum" are translated as "古典" and "量子" respectively.* "computational paradigms" is translated as "计算框架"* "Support Vector Machines" is translated as "支持向量机"* "Quantum Support Vector Machines" is translated as "量子支持向量机"* "hyperparameter optimization" is translated as "超参数优化"* " execution times" is translated as "执行时间"* "quantum computational capacity" is translated as "量子计算能力"* "magnitude of parallelism" is translated as "并行性的大小"

Heterogenous Memory Augmented Neural Networks

paper_url: http://arxiv.org/abs/2310.10909
repo_url: https://github.com/qiuzh20/hma
paper_authors: Zihan Qiu, Zhen Liu, Shuicheng Yan, Shanghang Zhang, Jie Fu
for: 本研究旨在提出一种基于异构记忆扩展的神经网络方法，以提高神经网络在数据稀缺和 OUT-OF-DISTRIBUTION（OOD）场景中的表现。
methods: 该方法通过引入学习记忆标记和注意机制，使得神经网络可以更好地处理大量数据，而无需增加巨大的计算成本。
results: 经过广泛的实验证明，该方法可以与不同的底层模型（MLP、CNN、GNN和Transformer）结合使用，并在图像和图格等任务下显示出竞争性的表现，特别是在数据稀缺和OOD情况下。

Abstract
It has been shown that semi-parametric methods, which combine standard neural networks with non-parametric components such as external memory modules and data retrieval, are particularly helpful in data scarcity and out-of-distribution (OOD) scenarios. However, existing semi-parametric methods mostly depend on independent raw data points - this strategy is difficult to scale up due to both high computational costs and the incapacity of current attention mechanisms with a large number of tokens. In this paper, we introduce a novel heterogeneous memory augmentation approach for neural networks which, by introducing learnable memory tokens with attention mechanism, can effectively boost performance without huge computational overhead. Our general-purpose method can be seamlessly combined with various backbones (MLP, CNN, GNN, and Transformer) in a plug-and-play manner. We extensively evaluate our approach on various image and graph-based tasks under both in-distribution (ID) and OOD conditions and show its competitive performance against task-specific state-of-the-art methods. Code is available at \url{https://github.com/qiuzh20/HMA}.

摘要
研究表明，半 Parametric 方法，将标准神经网络与非 Parametric 组件相结合，如外部记忆模块和数据检索，在数据缺乏和外部分布（OOD）场景中特别有帮助。然而，现有的半 Parametric 方法通常依赖于独立的原始数据点，这种策略难以扩展，因为它们的计算成本高，以及当前的注意机制无法处理大量的 tokens。在这篇文章中，我们介绍了一种新的异 heterogeneous 记忆扩展方法，通过引入学习记忆тоoken和注意机制，可以有效提高性能，而无需巨大的计算负担。我们的通用方法可以与不同的基础结构（MLP、CNN、GNN、Transformer）混合使用，并且可以在插入式方式下运行。我们对各种图像和图Structured 任务进行了广泛的评估，并在ID和OOD条件下显示了与任务特有的状态UNTUK 的竞争性性能。代码可以在 \url{https://github.com/qiuzh20/HMA} 上获取。

Surrogate Active Subspaces for Jump-Discontinuous Functions

paper_url: http://arxiv.org/abs/2310.10907
repo_url: None
paper_authors: Nathan Wycoff
for: 该研究旨在探讨Surrogate模型和活动子空间在社会科学计算中的应用，特别是对于粒子模型，以及这些技术在处理离散函数时的限制。
methods: 该研究使用了Gaussian процеessed估计active subspace，并对离散函数进行了扩展，以便更好地理解估计中的量。数据进行了比较分析，并在Flee模型中进行了应用，以获得关于8个迁徙危机的新的发现。
results: 研究发现，在离散函数上使用Gaussian процеessed估计active subspace可以提供有用的信息，并且可以帮助理解模型中重要的参数。在Flee模型中，该方法提供了新的发现，对8个迁徙危机进行了分析。

Abstract
Surrogate modeling and active subspaces have emerged as powerful paradigms in computational science and engineering. Porting such techniques to computational models in the social sciences brings into sharp relief their limitations in dealing with discontinuous simulators, such as Agent-Based Models, which have discrete outputs. Nevertheless, prior applied work has shown that surrogate estimates of active subspaces for such estimators can yield interesting results. But given that active subspaces are defined by way of gradients, it is not clear what quantity is being estimated when this methodology is applied to a discontinuous simulator. We begin this article by showing some pathologies that can arise when conducting such an analysis. This motivates an extension of active subspaces to discontinuous functions, clarifying what is actually being estimated in such analyses. We also conduct numerical experiments on synthetic test functions to compare Gaussian process estimates of active subspaces on continuous and discontinuous functions. Finally, we deploy our methodology on Flee, an agent-based model of refugee movement, yielding novel insights into which parameters of the simulation are most important across 8 displacement crises in Africa and the Middle East.

摘要
《代理模型和活跃子空间在计算科学和工程领域已经成为强大的趋势。将这些技术应用到社会科学计算模型上可以使其局限性得到鲜明的表现。然而，先前的应用研究表明，代理估计的活跃子空间可以得到有趣的结果。但是，由于活跃子空间是通过梯度定义的，因此不清楚什么量在这种分析中是被估计的。本文开始通过显示一些在进行这种分析时可能出现的病理来激发扩展。这些病理的出现强化了我们对离散函数的扩展。我们还对 synthetic 测试函数进行了数值实验，比较了 Gaussian 过程的活跃子空间估计在连续和离散函数上。最后，我们将我们的方法应用于 Flee，一个基于代理的难民移动模型，并得到了8个驱逐危机在非洲和中东的新的发现。》Note: The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and widely used in other countries. If you need Traditional Chinese, please let me know.

Analyzing Modularity Maximization in Approximation, Heuristic, and Graph Neural Network Algorithms for Community Detection

paper_url: http://arxiv.org/abs/2310.10898
repo_url: None
paper_authors: Samin Aref, Mahdi Mostajabdaveh
for: 本研究旨在探讨不同模块化最优化算法在网络Community detection中的性能。
methods: 本研究使用了104个网络，包括了真实世界的例子和Synthetic图的模块结构。研究使用了10种模块最优化算法，包括8种搜索算法、2种变种的图神经网络算法和Bayan Approximation算法的多种变种。
results: 研究发现，大多数常用的模块性最优化算法 rarely produce an optimal partition or a partition resembling an optimal partition，即使网络具有模块结构。如果使用模块性来探讨社群，则应用近似优化算法更加合理。

Abstract
Community detection, a fundamental problem in computational sciences, finds applications in various domains. Heuristics are often employed to detect communities through maximizing an objective function, modularity, over partitions of network nodes. Our research delves into the performance of different modularity maximization algorithms in achieving optimal partitions. We use 104 networks, comprising real-world instances from diverse contexts and synthetic graphs with modular structures. We analyze ten inexact modularity-based algorithms against an exact baseline which is an exact integer programming method that globally optimizes modularity. The ten algorithms analyzed include eight heuristics, two variations of a graph neural network algorithm, and several variations of the Bayan approximation algorithm. Our analysis uncovers substantial dissimilarities between the partitions obtained by most commonly used modularity-based methods and any optimal partition of the networks, as indicated by both adjusted and reduced mutual information metrics. Importantly, our results show that near-optimal partitions are often disproportionately dissimilar to any optimal partition. Taken together, our analysis points to a crucial limitation of the commonly used unguaranteed modularity-based methods for discovering communities: they rarely produce an optimal partition or a partition resembling an optimal partition even on networks with modular structures. If modularity is to be used for detecting communities, approximate optimization algorithms are recommendable for a more methodologically sound usage of modularity within its applicability limits.

摘要
We analyze ten inexact modularity-based algorithms, including eight heuristics, two variations of a graph neural network algorithm, and several variations of the Bayan approximation algorithm, against an exact baseline that globally optimizes modularity using integer programming. Our analysis reveals that the partitions obtained by most commonly used modularity-based methods are often substantially dissimilar to any optimal partition, as indicated by both adjusted and reduced mutual information metrics. Furthermore, we find that near-optimal partitions are often disproportionately dissimilar to any optimal partition.Our results suggest that the commonly used unguaranteed modularity-based methods for discovering communities are rarely able to produce an optimal partition or a partition resembling an optimal partition, even on networks with modular structures. As a result, we recommend using approximate optimization algorithms for a more methodologically sound usage of modularity within its applicability limits.

2023-10-17

eess.IV

eess.IV - 2023-10-17

Hybrid quantum-classical graph neural networks for tumor classification in digital pathology

paper_url: http://arxiv.org/abs/2310.11353
repo_url: None
paper_authors: Anupama Ray, Dhiraj Madan, Srushti Patil, Maria Anna Rapsomaniki, Pushpak Pati
for: 研究用于理解疾病细胞和肿瘤微环境之间的互动，以加速治疗发现。
methods: 组合图解树神经网络（GNN）和量子变量分类器（VQC），以解决抗癌疾病分类任务。
results: hybrid量子神经网络（QNN）与状态当前的类 graph神经网络（GNN）相当，以重量精度、准确率和F1分数来衡量。 Here’s the full translation of the abstract in Simplified Chinese:
for: 随着古典机器学习和单元细胞技术的进步，我们可以更好地理解疾病细胞和肿瘤微环境之间的互动，以加速治疗发现。
methods: 我们采用了组合图解树神经网络（GNN）和量子变量分类器（VQC），以解决抗癌疾病分类任务。
results: 我们发现，hybrid量子神经网络（QNN）与状态当前的类 graph神经网络（GNN）相当，以重量精度、准确率和F1分数来衡量。另外，我们还发现，通过幂数编码，可以压缩信息，并且在逻辑数量的级别上实现更好的性能。最后，我们发现，结合练习可以超越固定 GNN 参数，并且也略微提高了与 vanilla GNN 的性能。

Abstract
Advances in classical machine learning and single-cell technologies have paved the way to understand interactions between disease cells and tumor microenvironments to accelerate therapeutic discovery. However, challenges in these machine learning methods and NP-hard problems in spatial Biology create an opportunity for quantum computing algorithms. We create a hybrid quantum-classical graph neural network (GNN) that combines GNN with a Variational Quantum Classifier (VQC) for classifying binary sub-tasks in breast cancer subtyping. We explore two variants of the same, the first with fixed pretrained GNN parameters and the second with end-to-end training of GNN+VQC. The results demonstrate that the hybrid quantum neural network (QNN) is at par with the state-of-the-art classical graph neural networks (GNN) in terms of weighted precision, recall and F1-score. We also show that by means of amplitude encoding, we can compress information in logarithmic number of qubits and attain better performance than using classical compression (which leads to information loss while keeping the number of qubits required constant in both regimes). Finally, we show that end-to-end training enables to improve over fixed GNN parameters and also slightly improves over vanilla GNN with same number of dimensions.

摘要

Automatic Coronary Artery Plaque Quantification and CAD-RADS Prediction using Mesh Priors

paper_url: http://arxiv.org/abs/2310.11297
repo_url: None
paper_authors: Rudolf L. M. van Herten, Nils Hampe, Richard A. P. Takx, Klaas Jan Franssen, Yining Wang, Dominika Suchá, José P. Henriques, Tim Leiner, R. Nils Planken, Ivana Išgum
for: 这个论文是为了评估抗塞性肺动脉疾病（CAD）的风险和治疗方案。
methods: 这个论文使用了直接从中心线约束的方法，将抗塞性肺动脉的血管和瘤直接推导出来，并用其进行下游任务的CAD-RADS分类。
results: 这个论文的结果表明，直接推导抗塞性肺动脉的血管和瘤的方法是可行的，并可以自动预测 Routinely performed CAD-RADS categorization。 lesion-wise volume intraclass correlation coefficients were 0.98, 0.79, and 0.85 for calcified, non-calcified, and total plaque volume respectively. patient-level CAD-RADS categorization achieved linearly weighted kappa（κ）of 0.75.

Abstract
Coronary artery disease (CAD) remains the leading cause of death worldwide. Patients with suspected CAD undergo coronary CT angiography (CCTA) to evaluate the risk of cardiovascular events and determine the treatment. Clinical analysis of coronary arteries in CCTA comprises the identification of atherosclerotic plaque, as well as the grading of any coronary artery stenosis typically obtained through the CAD-Reporting and Data System (CAD-RADS). This requires analysis of the coronary lumen and plaque. While voxel-wise segmentation is a commonly used approach in various segmentation tasks, it does not guarantee topologically plausible shapes. To address this, in this work, we propose to directly infer surface meshes for coronary artery lumen and plaque based on a centerline prior and use it in the downstream task of CAD-RADS scoring. The method is developed and evaluated using a total of 2407 CCTA scans. Our method achieved lesion-wise volume intraclass correlation coefficients of 0.98, 0.79, and 0.85 for calcified, non-calcified, and total plaque volume respectively. Patient-level CAD-RADS categorization was evaluated on a representative hold-out test set of 300 scans, for which the achieved linearly weighted kappa ($\kappa$) was 0.75. CAD-RADS categorization on the set of 658 scans from another hospital and scanner led to a $\kappa$ of 0.71. The results demonstrate that direct inference of coronary artery meshes for lumen and plaque is feasible, and allows for the automated prediction of routinely performed CAD-RADS categorization.

摘要
coronary artery disease (CAD) 仍然是全球最主要的死亡原因。患有可能的 CAD 的患者通常会通过 coronary CT angiography (CCTA) 来评估心血管事件的风险和治疗方案。在 CCTA 中的临床分析中，需要分析 coronary arteries 的病理和病变。而在这个过程中，我们提议直接从中心线约束下直接推算 coronary artery 的血管和病变表面。这种方法可以保证血管和病变的准确性和可靠性。我们在 2407 个 CCTA 扫描数据集上开发和评估了这种方法。我们的方法在不同类型的病变中的体积涂抹相互关系系数为 0.98、0.79 和 0.85。在一个代表样本中，我们对 300 个扫描数据进行了分类，其中的 linearly weighted kappa 值为 0.75。在另一个医院和扫描机器上进行了进一步的评估，其中的 linearly weighted kappa 值为 0.71。这些结果表明，直接从中心线约束下推算 coronary artery 的血管和病变表面是可能的，并且可以自动地预测通常进行的 CAD-RADS 分类。

MorphFlow: Estimating Motion in In Situ Tests of Concrete

paper_url: http://arxiv.org/abs/2310.11109
repo_url: None
paper_authors: Tessa Nogatz, Claudia Redenbach, Katja Schladitz
for: 该论文是为了估计从时间序列3D图像中的运动而设计的。
methods: 该算法使用了一种基于多尺度波лет的新方法来估计运动。
results: 该算法可以快速和可靠地处理大规模的室内实验数据，并且可以捕捉到异常的几何变化。两个例子 validate 了该算法的性能，包括一个经典的抗热试验和一个三点弯矩试验。

Abstract
We present a novel algorithm explicitly tailored to estimate motion from time series of 3D images of concrete. Such volumetric images are usually acquired by Computed Tomography and can contain for example in situ tests, or more complex procedures like self-healing. Our algorithm is specifically designed to tackle the challenge of large scale in situ investigations of concrete. That means it cannot only cope with big images, but also with discontinuous displacement fields that often occur in in situ tests of concrete. We show the superior performance of our algorithm, especially regarding plausibility and time efficient processing. Core of the algorithm is a novel multiscale representation based on morphological wavelets. We use two examples for validation: A classical in situ test on refractory concrete and and a three point bending test on normal concrete. We show that for both applications structural changes like crack initiation can be already found at low scales -- a central achievement of our algorithm.

摘要
我们提出了一种新的算法，专门用于从3D图像序列中估算动态。这些三维图像通常由计算 Tomatoes取得，可以包含例如在 situ测试或更复杂的过程，如自适应修复。我们的算法特地设计用于解决大规模在 situ调查咨 concrete 中的挑战。这意味着它不仅可以处理大图像，还可以处理不连续的变位场的问题，这经常发生在 concrete 中的 in situ 测试中。我们展示了我们的算法在真实性和时间效率两个方面的优秀表现。我们的算法核心是一种新的多尺度表示方法，基于 morphological wavelets。我们使用了两个例子进行验证：一个经典的 in situ 测试和一个三点弯曲测试。我们发现，对于两个应用程序，结构变化，如裂隙开始，可以在低尺度上找到，这是我们算法的中心成就。

Iterative Clustering Material Decomposition Aided by Empirical Spectral Correction for High-Resolution Photon-Counting Detectors in Micro-CT

paper_url: http://arxiv.org/abs/2310.10913
repo_url: None
paper_authors: Juan C. R. Luna, Mini Das
for: 这项研究旨在提高计算tomography（CT）成像的精度，特别是在用 photon counting detectors（PCDs）进行多能量投射的情况下。methods: 该研究使用了实用的 инструменталь和测量策略，包括Iterative Clustering Material Decomposition（ICMD），以实现在spectral micro CT中量化多种材料的分离。results: 实验结果表明， combining spectral correction和高维数据归一化可以提高分离精度和降低噪声，并可以分解多于三种材料，包括混合物和K-edge材料。

Abstract
Photon counting detectors (PCDs) offer promising advancements in computed tomography (CT) imaging by enabling the quantification and 3D imaging of contrast agents and tissue types through multi-energy projections. However, the accuracy of these decomposition methods hinges on precise composite spectral attenuation values that one must reconstruct from spectral micro CT. Factors such as surface defects, local temperature, signal amplification, and impurity levels can cause variations in detector efficiency between pixels, leading to significant quantitative errors. In addition, some inaccuracies such as the charge-sharing effects in PCDs are amplified with a high Z sensor material and also with a smaller detector pixels that are preferred for micro CT. In this work, we propose a comprehensive approach that combines practical instrumentation and measurement strategies leading to the quantitation of multiple materials within an object in a spectral micro CT with a photon counting detector. Our Iterative Clustering Material Decomposition (ICMD) includes an empirical method for detector spectral response corrections, cluster analysis and multi-step iterative material decomposition. Utilizing a CdTe-1mm Medipix detector with a 55$\mu$m pitch, we demonstrate the quantitatively accurate decomposition of several materials in a phantom study, where the sample includes mixtures of material, soft material and K-edge materials. We also show an example of biological sample imaging and separating three distinct types of tissue in mouse: muscle, fat and bone. Our experimental results show that the combination of spectral correction and high-dimensional data clustering enhances decomposition accuracy and reduces noise in micro CT. This ICMD allows for quantitative separation of more than three materials including mixtures and also effectively separates multi-contrast agents.

摘要
吸收计数器（PCD）在计算tomography（CT）成像中提供了有前途的改进，使得可以量化和三维成像各种 контраст物质和组织类型通过多能量投影。然而，这些分解方法的准确性取决于重建的复合spectralattenuation值，这些值可以从spectral micro CT中提取。因为表面缺陷、地方温度、信号增强和杂质水平等因素会导致每个像素的探测效率存在差异，这会导致重要的量化错误。此外，一些不精准的效应，如charge-sharing效应，会在高Z探测器材料和小像素尺寸下被增强。在这种情况下，我们提出了一种涵盖实用仪器和测量策略的全面方法，以实现spectral micro CT中多种材料的量化。我们的迭代归一化材料分解（ICMD）包括了实际方法、集群分析和多步迭代材料分解。使用CdTe-1mm Medipix探测器，我们在一个phantom研究中展示了高精度的材料分解，包括混合物、软物和K-edge材料。此外，我们还展示了一个生物样本的成像和分离三种不同的组织类型，包括肌肉、脂肪和骨。我们的实验结果表明，将spectral correction和高维数据归一化相结合可以提高分解精度和减少微CT的噪声。ICMD可以量化超过三种材料，包括混合物，并有效地分离多种对比剂。

2023-10-17

eess.SP

eess.SP - 2023-10-17

WaveFlex: A Smart Surface for Private CBRS Wireless Cellular Networks

paper_url: http://arxiv.org/abs/2310.11551
repo_url: None
paper_authors: Fan Yi, Kun Woo Cho, Yaxiong Xie, Kyle Jamieson
for: 增强private LTE/5G网络在公民广播服务频段下的共享执照框架下运行。
methods: 利用多频多时动态环境下的频率多样性和时间动态性，实现独立于基站和移动用户的自适应响应。
results: 在实际办公室enario中，单个小型基站的均值Signal-to-Noise Ratio提高8.50 dB，单个小型基站的均值吞吐量提高4.36 Mbps，四个小型基站的均值吞吐量提高3.19 Mbps。

Abstract
We present the design and implementation of WaveFlex, the first smart surface that enhances Private LTE/5G networks operating under the shared-license framework in the Citizens Broadband Radio Service frequency band. WaveFlex works in the presence of frequency diversity: multiple nearby base stations operating on different frequencies, as dictated by a Spectrum Access System coordinator. It also handles time dynamism: due to the dynamic sharing rules of the band, base stations occasionally switch channels, especially when priority users enter the network. Finally, WaveFlex operates independently of the network itself, not requiring access to nor modification of the base station or mobile users, yet it remain compliant with and effective on prevailing cellular protocols. We have designed and fabricated WaveFlex on a custom multi-layer PCB, software defined radio-based network monitor, and supporting control software and hardware. Our experimental evaluation benchmarks an operational Private LTE network running at full line rate. Results demonstrate an 8.50 dB average SNR gain, and an average throughput gain of 4.36 Mbps for a single small cell, and 3.19 Mbps for four small cells, in a realistic indoor office scenario.

摘要
我们介绍了waveflex的设计和实现，这是首个在公民宽频服务频率带下的智能表面，用于增强共享许可的LTE/5G网络。waveflex在频率多样性和时间动态性下工作，包括多个附近基站在不同频率上运行，由 спект域访问系统协调员指定。此外，waveflex不需要访问或修改基站或移动用户，却仍然遵循现有的 celullar协议。我们设计了waveflex于自定义多层PCB、基于Software Defined Radio的网络监测器和相应的控制软件和硬件。我们的实验评估表明，一个实际的专用LTE网络在全线速度下运行，得到了8.50 dB的平均噪声比提高和4.36 Mbps的平均吞吐量提高，以及3.19 Mbps的平均吞吐量提高，在一个真实的办公室enario中。

Integrated Sensing and Channel Estimation by Exploiting Dual Timescales for Delay-Doppler Alignment Modulation

paper_url: http://arxiv.org/abs/2310.11326
repo_url: None
paper_authors: Zhiqiang Xiao, Yong Zeng, Fuxi Wen, Zaichen Zhang, Derrick Wing Kwan Ng
for: This paper proposes a novel ISAC framework that leverages the recently proposed delay-Doppler alignment modulation (DDAM) technique to improve the performance of integrated sensing and communication (ISAC) systems.
methods: The proposed framework uses a novel algorithm called adaptive simultaneously orthogonal matching pursuit with support refinement (ASOMP-SR) for joint environment sensing and PSI estimation, and analyzes the performance of DDAM with imperfectly sensed PSI.
results: Simulation results show that the proposed DDAM-based ISAC can achieve superior spectral efficiency and a reduced peak-to-average power ratio (PAPR) compared to standard orthogonal frequency division multiplexing (OFDM).Here’s the simplified Chinese text:
for: 这篇论文提出了一种基于延迟-Doppler匹配调制（DDAM）技术的新一代 интеGRATED sensing和通信（ISAC）系统。
methods: 该系统使用了一种新的算法 called adaptive simultaneously orthogonal matching pursuit with support refinement（ASOMP-SR）进行环境感知和PSI估计，并分析了受到不准确感知PSI的DDAM性能。
results: 实验结果表明，提出的DDAM-based ISAC可以在spectral efficiency和peak-to-average power ratio（PAPR）方面具有superior性能，比普通的orthogonal frequency division multiplexing（OFDM）更高。

Abstract
For integrated sensing and communication (ISAC) systems, the channel information essential for communication and sensing tasks fluctuates across different timescales. Specifically, wireless sensing primarily focuses on acquiring path state information (PSI) (e.g., delay, angle, and Doppler) of individual multi-path components to sense the environment, which usually evolves much more slowly than the composite channel state information (CSI) required for communications. Typically, the CSI is approximately unchanged during the channel coherence time, which characterizes the statistical properties of wireless communication channels. However, this concept is less appropriate for describing that for wireless sensing. To this end, in this paper, we introduce a new timescale to study the variation of the PSI from a channel geometric perspective, termed path invariant time, during which the PSI largely remains constant. Our analysis indicates that the path invariant time considerably exceeds the channel coherence time. Thus, capitalizing on these dual timescales of the wireless channel, in this paper, we propose a novel ISAC framework exploiting the recently proposed delay-Doppler alignment modulation (DDAM) technique. Different from most existing studies on DDAM that assume the availability of perfect PSI, in this work, we propose a novel algorithm, termed as adaptive simultaneously orthogonal matching pursuit with support refinement (ASOMP-SR), for joint environment sensing and PSI estimation. We also analyze the performance of DDAM with imperfectly sensed PSI.Simulation results unveil that the proposed DDAM-based ISAC can achieve superior spectral efficiency and a reduced peak-to-average power ratio (PAPR) compared to standard orthogonal frequency division multiplexing (OFDM).

摘要
for integrated sensing and communication (ISAC) systems, the channel information that is essential for communication and sensing tasks changes over different time scales. Specifically, wireless sensing primarily focuses on acquiring path state information (PSI) (e.g., delay, angle, and Doppler) of individual multi-path components to sense the environment, which usually evolves much more slowly than the composite channel state information (CSI) required for communications. Typically, the CSI is approximately unchanged during the channel coherence time, which characterizes the statistical properties of wireless communication channels. However, this concept is less appropriate for describing that for wireless sensing. To this end, in this paper, we introduce a new timescale to study the variation of the PSI from a channel geometric perspective, termed path invariant time, during which the PSI largely remains constant. Our analysis indicates that the path invariant time considerably exceeds the channel coherence time. Thus, capitalizing on these dual timescales of the wireless channel, in this paper, we propose a novel ISAC framework exploiting the recently proposed delay-Doppler alignment modulation (DDAM) technique. Different from most existing studies on DDAM that assume the availability of perfect PSI, in this work, we propose a novel algorithm, termed as adaptive simultaneously orthogonal matching pursuit with support refinement (ASOMP-SR), for joint environment sensing and PSI estimation. We also analyze the performance of DDAM with imperfectly sensed PSI.Simulation results unveil that the proposed DDAM-based ISAC can achieve superior spectral efficiency and a reduced peak-to-average power ratio (PAPR) compared to standard orthogonal frequency division multiplexing (OFDM).Here's the word-for-word translation in Simplified Chinese:for 集成感知通信 (ISAC) 系统，通信和感知任务中的通道信息变化在不同的时间尺度上。特别是无线感知主要关注于获取路径状态信息 (PSI)（例如延迟、角度和Doppler）个体多 path 组件来感知环境，这通常比通信通道的Statistical properties 更慢地发展。通常情况下，通道准确性时间内，通信通道的 CSI 保持相对不变。但这个概念对于无线感知来说 less appropriate。为此，在这篇论文中，我们引入了一个新的时间尺度，用于研究 wireless channel 的 PSI 变化，并将其称为 path invariant time， durante el cual la PSI se mantiene prácticamente constante。我们的分析表明，path invariant time 远大于通道准确性时间。因此，基于这两个时间尺度的 wireless channel，在这篇论文中，我们提出了一种新的 ISAC 框架，利用最近提出的 delay-Doppler alignment modulation (DDAM) 技术。与大多数现有研究中的 DDAM 假设完美 PSI 可用，在这种工作中，我们提出了一种新的算法，称为 adaptive simultaneously orthogonal matching pursuit with support refinement (ASOMP-SR)，用于joint 环境感知和 PSI 估计。我们还分析了 DDAM 中的 PSI 估计不准确情况。Simulation results 显示，我们的提议的 DDAM-based ISAC 可以在 spectral efficiency 和 peak-to-average power ratio (PAPR) 两个方面获得更高的性能，相比标准 orthogonal frequency division multiplexing (OFDM)。

Imaging of nonlinear materials via the Monotonicity Principle

paper_url: http://arxiv.org/abs/2310.11234
repo_url: None
paper_authors: Vincenzo Mottola, Antonio Corbo Esposito, Gianpaolo Piscitelli, Antonello Tamburrino
for: 本研究旨在解决非线性材料下的 inverse problems，具体是 magnetostatic permeability tomography 问题。
methods: 本研究使用了 Monotonicity Principle 的扩展，开发了首个实时反射方法。
results: 研究提供了一些初步结果，并给出了一些扩展的数值示例。

Abstract
The topic of inverse problems, related to Maxwell's equations, in the presence of nonlinear materials is quite new in literature. The lack of contributions in this area can be ascribed to the significant challenges that such problems pose. Retrieving the spatial behaviour of some unknown physical property, starting from boundary measurements, is a nonlinear and highly ill-posed problem even in the presence of linear materials. And the complexity exponentially grows when the focus is on nonlinear material properties. Recently, the Monotonicity Principle has been extended to nonlinear materials under very general assumptions. Starting from the theoretical background given by this extension, we develop a first real-time inversion method for the inverse obstacle problem in the presence of nonlinear materials. The Monotonicity Principle is the foundation of a class of non-iterative algorithms for tomography of linear materials. It has been successfully applied to various problems, governed by different PDEs. In the linear case, MP based inversion methods ensure excellent performances and compatibility with real-time applications. We focus on problems governed by elliptical PDEs and, as an example of application, we treat the Magnetostatic Permeability Tomography problem, in which the aim is to retrieve the spatial behaviour of magnetic permeability through boundary measurements in DC operations. In this paper, we provide some preliminary results giving the foundation of our method and extended numerical examples.

摘要
topic of inverse problems related to Maxwell's equations in the presence of nonlinear materials is quite new in literature. lack of contributions in this area can be ascribed to the significant challenges that such problems pose. Retrieving the spatial behavior of some unknown physical property starting from boundary measurements is a nonlinear and highly ill-posed problem even in the presence of linear materials. And the complexity exponentially grows when the focus is on nonlinear material properties. Recently, the Monotonicity Principle has been extended to nonlinear materials under very general assumptions. Starting from the theoretical background given by this extension, we develop a first real-time inversion method for the inverse obstacle problem in the presence of nonlinear materials. Monotonicity Principle is the foundation of a class of non-iterative algorithms for tomography of linear materials. It has been successfully applied to various problems, governed by different PDEs. In the linear case, MP-based inversion methods ensure excellent performances and compatibility with real-time applications. We focus on problems governed by elliptical PDEs and, as an example of application, we treat the Magnetostatic Permeability Tomography problem, in which the aim is to retrieve the spatial behavior of magnetic permeability through boundary measurements in DC operations. In this paper, we provide some preliminary results giving the foundation of our method and extended numerical examples.

Complex Number Assignment in the Topology Method for Heartbeat Interval Estimation Using Millimeter-Wave Radar

paper_url: http://arxiv.org/abs/2310.11149
repo_url: None
paper_authors: Yuji Tanaka, Kimitaka Sumi, Itsuki Iwata, Takuya Sakamoto
for: 用于高精度心跳间隔估计 millimeter 波射频信号
methods: 使用皮肤运动波峰值特征点抽象、赋予每个特征点复杂数字
results: 验证了使用简化的波峰值特征点预测优化复杂数字分配方法的有效性，并使用公共数据集进行了验证。

Abstract
The topology method is an algorithm for accurate estimation of instantaneous heartbeat intervals using millimeter-wave radar signals. In this model, feature points are extracted from the skin displacement waveforms generated by heartbeats and a complex number is assigned to each feature point. However, these numbers have been assigned empirically and without solid justification. This study used a simplified model of displacement waveforms to predict the optimal choice of the complex number assignments to feature points corresponding to inflection points, and the validity of these numbers was confirmed using analysis of a publicly available dataset.

摘要
“扁平方法”是一种用于精确计算心跳间隔的毫米波激光信号中的算法。在这个模型中，从心跳所导致皮肤变形波形中提取特征点，然后将每个特征点分配到复数中。但是这些复数的分配是基于实践和无对Solid的说明。这个研究使用简化的变形波形来预测最佳的复数分配，并使用公共可用数据集进行验证。

Intelligent Resource Allocation for UAV-Based Cognitive NOMA Networks: An Active Inference Approach

paper_url: http://arxiv.org/abs/2310.11070
repo_url: None
paper_authors: Felix Obite, Ali Krayani, Atm S. Alam, Lucio Marcenaro, Arumugam Nallanathan, Carlo Regazzoni
for:This paper aims to improve the adaptive resource allocation and decision-making of future wireless networks, specifically in the context of uplink UAV-based cognitive NOMA networks.methods:The proposed approach uses an active inference-based learning framework, rooted in cognitive neuroscience, to solve the complex problem of joint subchannel and power allocation. This involves creating a training dataset using random or iterative methods, training a mobile UAV offline to learn a generative model of discrete subchannels and continuous power allocation, and using this model for online inference.results:The proposed approach is validated through numerical simulations, which show efficient performance compared to suboptimal baseline schemes. The approach is able to adapt to non-stationary environments and improve the cumulative sum rate by jointly optimizing the subchannel and power allocation based on the UAV’s mobility at each time step.

Abstract
Future wireless networks will need to improve adaptive resource allocation and decision-making to handle the increasing number of intelligent devices. Unmanned aerial vehicles (UAVs) are being explored for their potential in real-time decision-making. Moreover, cognitive non-orthogonal multiple access (Cognitive-NOMA) is envisioned as a remedy to address spectrum scarcity and enable massive connectivity. This paper investigates the design of joint subchannel and power allocation in an uplink UAV-based cognitive NOMA network. We aim to maximize the cumulative sum rate by jointly optimizing the subchannel and power allocation based on the UAV's mobility at each time step. This is often formulated as an optimization problem with random variables. However, conventional optimization algorithms normally introduce significant complexity, and machine learning methods often rely on large but partially representative datasets to build solution models, assuming stationary testing data. Consequently, inference strategies for non stationary events are often overlooked. In this study, we introduce a novel active inference-based learning approach, rooted in cognitive neuroscience, to solve this complex problem. The framework involves creating a training dataset using random or iterative methods to find suboptimal resource allocations. This dataset trains a mobile UAV offline, enabling it to learn a generative model of discrete subchannels and continuous power allocation. The UAV then uses this model for online inference. The method incrementally derives new generative models from training data by identifying dynamic equilibrium conditions between required actions and variables, represented within a unique dynamic Bayesian network. The proposed approach is validated through numerical simulations, showing efficient performance compared to suboptimal baseline schemes.

摘要
未来无线网络将需要改进适应性资源分配和决策，以满足智能设备的增加。无人机（UAV）正被研究，以其实时决策的潜在优势。此外，认知非对称多接入（Cognitive-NOMA）被视为spectrum scarcity和大规模连接问题的解决方案。本文研究了基于无人机的上行UAV认知多接入网络的共同子频率和功率分配的设计。我们希望通过在每个时间步骤中并行优化子频率和功率分配，以最大化总带宽率。这经常被формализова为随机变量的优化问题。然而，常见的优化算法通常会引入显著的复杂性，而机器学习方法通常需要大量但部分代表性的数据来建立解决方案模型，假设测试数据是静止的。因此，对非静态事件的推理策略通常被忽略。在本研究中，我们提出了一种新的活动推理学习方法，基于认知神经科学，解决这个复杂的问题。该框架包括使用随机或迭代方法创建训练数据集，该数据集用于在线推理。无人机使用该模型在线进行推理，并在训练数据中逐渐 derivation of new generative models from training data by identifying dynamic equilibrium conditions between required actions and variables, represented within a unique dynamic Bayesian network. The proposed approach is validated through numerical simulations, showing efficient performance compared to suboptimal baseline schemes.

Aerial-Aided mmWave VANETs Using NOMA: Performance Analysis, Comparison, and Insights

paper_url: http://arxiv.org/abs/2310.11068
repo_url: None
paper_authors: Abdullah Abu Zaid, Baha Eddine Youcef Belmekki, Mohamed-Slim Alouini
for: 这 paper 的目的是研究在协同交通网络 (VANET) 中使用缔结的飞行平台 (NTFP) 来解决城市化导致的问题。
methods: 这 paper 使用 Stochastic Geometry 工具来 derive 缔结平台的停机概率和可以达到的速率表达。
results: 研究结果显示，当 NTFP 作为中继器时，它们在较大的传输距离上表现更好于 traditional roadside units (RSUs)，但是在短距离上，RSUs 表现更好。此外，使用非对称访问 (NOMA) 可以提高spectrum 效率，并且在 millimeter-wave (mmWave) 频率上使用 сектор化扫描模型可以提高数据速率。

Abstract
In this paper, we propose the integration of tethered flying platforms in cooperative vehicular ad hoc networks (VANETs) to alleviate the problems of rapid urbanization. In this context, we study the performance of VANETs by deriving approximate outage probability and average achievable rate expressions using tools from stochastic geometry. We compare between the usage of networked tethered flying platforms (NTFPs) and traditional roadside units (RSUs). On the other hand, the rapid increase of smart devices in vehicles and the upcoming urban air mobility (UAM) vision will congest the spectrum and require increased data rates. Hence, we use non-orthogonal multiple access (NOMA) to improve spectral efficiency and compare its performance to orthogonal access schemes. Furthermore, we utilize millimeter-wave (mmWave) frequencies for high data rates and implement a sectored beamforming model. We extensively study the system using three transmission schemes: direct, relay, and hybrid transmission. The results show that when acting as relays, NTFPs outperform RSUs for larger distances between the transmitting and the receiving vehicles, while RSUs outperform NTFPs for short distances. However, NTFPs are the best solution when acting as a source. Moreover, we find that, in most cases, direct transmission is preferred to achieve a high rate compared to other schemes. Finally, the results are summarized in two tables that provide insights into connecting VANETs by selecting the most suitable platform and type of communication for a given set of parameters, configurations, and requirements.

摘要
在这篇论文中，我们提出了在合作式自适应网络（VANET）中 integrate 固定飞行平台（NTFP）以解决城市化导致的问题。在这个上下文中，我们研究了VANET的性能，通过Stochastic Geometry工具 derive approximate outage probability 和 average achievable rate 表达。我们比较了使用网络化固定飞行平台（NTFP）和传统路边单元（RSU）。而随着智能设备的增加和未来城市空中交通（UAM）的出现，将导致频率受到压力，需要增加数据速率。因此，我们使用非对称多接入（NOMA）提高频率效率，并与对称访问方案进行比较。此外，我们使用毫米波频率（mmWave）频率获得高数据速率，并实施 сектор化扫描模型。我们广泛研究了系统，使用三种传输方案：直接传输、重复传输和混合传输。结果表明，当NTFP作为中继器时，NTFP在较远的传输和接收车辆之间表现更好，而RSU在短距离之间表现更好。然而，NTFP在源位置时是最佳解决方案。此外，我们发现，在大多数情况下，直接传输是以高速度相比其他方案更好。最后，结果分表两个表格，提供了关于连接VANET的最佳平台和通信方式的准确信息，以便根据不同的参数、配置和需求选择最适合的方案。

A Tutorial on Near-Field XL-MIMO Communications Towards 6G

paper_url: http://arxiv.org/abs/2310.11044
repo_url: None
paper_authors: Haiquan Lu, Yong Zeng, Changsheng You, Yu Han, Jiayi Zhang, Zhe Wang, Zhenjun Dong, Shi Jin, Cheng-Xiang Wang, Tao Jiang, Xiaohu You, Rui Zhang
for: 这篇论文主要旨在为6G移动通信网络的巨大多输入多输出技术（XL-MIMO）提供全面的教程概述，以帮助解决近场通信频率模型、性能分析、通道估计和实际实施中的挑战。
methods: 本论文使用了基于近场模型的XL-MIMO通信技术，包括非均匀球波（NUSW）和空间非站点性的近场模型，以及相关的性能分析和通道估计方法。
results: 本论文通过对XL-MIMO技术的近场模型和性能分析，提出了新的信号噪响比例法则、焊焊范围模式、可达性和度量（DoF）等。此外，论文还详细介绍了各种XL-MIMO设计问题，如近场ibeam代码库、焊焊训练、通道估计和延迟对齐变换（DAM）传输。

Abstract
Extremely large-scale multiple-input multiple-output (XL-MIMO) is a promising technology for the sixth-generation (6G) mobile communication networks. By significantly boosting the antenna number or size to at least an order of magnitude beyond current massive MIMO systems, XL-MIMO is expected to unprecedentedly enhance the spectral efficiency and spatial resolution for wireless communication. The evolution from massive MIMO to XL-MIMO is not simply an increase in the array size, but faces new design challenges, in terms of near-field channel modelling, performance analysis, channel estimation, and practical implementation. In this article, we give a comprehensive tutorial overview on near-field XL-MIMO communications, aiming to provide useful guidance for tackling the above challenges. First, the basic near-field modelling for XL-MIMO is established, by considering the new characteristics of non-uniform spherical wave (NUSW) and spatial non-stationarity. Next, based on the near-field modelling, the performance analysis of XL-MIMO is presented, including the near-field signal-to-noise ratio (SNR) scaling laws, beam focusing pattern, achievable rate, and degrees-of-freedom (DoF). Furthermore, various XL-MIMO design issues such as near-field beam codebook, beam training, channel estimation, and delay alignment modulation (DAM) transmission are elaborated. Finally, we point out promising directions to inspire future research on near-field XL-MIMO communications.

摘要
“极大规模多输入多输出（XL-MIMO）技术是6G移动通信网络的未来技术之一。它通过增加天线数或大小，至少一个量级超过当前庞大MIMO系统，可以无 precedent 地提高spectral efficiency和空间分辨率，以提高无线通信的性能。从庞大MIMO到XL-MIMO的演化不仅是天线数或大小的增加，而且面临新的设计挑战，包括近场通道模型、性能分析、通道估计和实践实现。本文提供了XL-MIMO近场通信的完整教程详细介绍，以便对这些挑战进行有用的指导。首先，我们建立了XL-MIMO近场模型，考虑了新的非均匀球波（NUSW）和空间非站点性。然后，基于近场模型，我们提供了XL-MIMO性能分析，包括近场信噪比（SNR）扩展律、扫描 Pattern、可达率和度量（DoF）。此外，我们还详细介绍了XL-MIMO设计问题，如近场天线代码库、天线训练、通道估计和延迟对齐模ulation（DAM）传输。最后，我们指出了未来研究XL-MIMO近场通信的可能的方向。”

Channel Autocorrelation Estimation for IRS-Aided Wireless Communications Based on Power Measurements

paper_url: http://arxiv.org/abs/2310.11038
repo_url: None
paper_authors: Ge Yan, Lipeng Zhu, Rui Zhang
for: 增强无线通信系统的性能，通过可编程的信号反射。
methods: 基于接收信号电平的探测和相关矩阵约束问题的解决。
results: 验证了新的通道估计算法的有效性，以及基于估计的IRS投射设计。

Abstract
Intelligent reflecting surface (IRS) can bring significant performance enhancement for wireless communication systems by reconfiguring wireless channels via passive signal reflection. However, such performance improvement generally relies on the knowledge of channel state information (CSI) for IRS-associated links. Prior IRS channel estimation strategies mainly estimate IRS-cascaded channels based on the excessive pilot signals received at the users/base station (BS) with time-varying IRS reflections, which, however, are not compatible with the existing channel training/estimation protocol for cellular networks. To address this issue, we propose in this paper a new channel estimation scheme for IRS-assisted communication systems based on the received signal power measured at the user, which is practically attainable without the need of changing the current protocol. Specifically, due to the lack of signal phase information in power measurements, the autocorrelation matrix of the BS-IRS-user cascaded channel is estimated by solving equivalent matrix-rank-minimization problems. Simulation results are provided to verify the effectiveness of the proposed channel estimation algorithm as well as the IRS passive reflection design based on the estimated channel autocorrelation matrix.

摘要
智能反射表面（IRS）可以带来无线通信系统的性能提升，通过通过pasive signal reflection重新配置无线通道。然而，这种性能提升通常需要IRS相关链路的通道状态信息（CSI）的知识。先前的IRS通道估计策略主要基于用户/基站（BS）接收到的过剩的射频信号来估计IRS-堆叠的通道，这些信号在时间变化IRS反射后是不可靠的。为解决这个问题，我们在这篇论文中提出了一种新的通道估计方案 дляIRS协助通信系统，基于用户接收到的信号功率。具体来说，由于射频信号的相位信息不可获得，我们通过解决相当于矩阵约等减少问题来估计BS-IRS-用户堆叠通道的自相关矩阵。我们提供了估计算法的实验结果，以证明提案的有效性以及基于估计的IRS pasive反射设计。

Spectral-Efficiency and Energy-Efficiency of Variable-Length XP-HARQ

paper_url: http://arxiv.org/abs/2310.10964
repo_url: None
paper_authors: Jiahui Feng, Zheng Shi, Yaru Fu, Hong Wang, Guanghua Yang, Shaodan Ma
for:* 提高通信的spectral efficiency (SE)和能效率 (EE)methods:* 提出变量长度跨包 hybrid automatic repeat request (VL-XP-HARQ) 技术* 使用 Dinkelbach 变换和successive convex approximation (SCA) 等方法进行优化results:* 提高 SE 和 EE 的Upper bound* 可以通过power allocation来最大化 EE 并保证出错率的要求

Abstract
A variable-length cross-packet hybrid automatic repeat request (VL-XP-HARQ) is proposed to boost the spectral efficiency (SE) and the energy efficiency (EE) of communications. The SE is firstly derived in terms of the outage probabilities, with which the SE is proved to be upper bounded by the ergodic capacity (EC). Moreover, to facilitate the maximization of the SE, the asymptotic outage probability is obtained at high signal-to-noise ratio (SNR), with which the SE is maximized by properly choosing the number of new information bits while guaranteeing outage requirement. By applying Dinkelbach's transform, the fractional objective function is transformed into a subtraction form, which can be decomposed into multiple sub-problems through alternating optimization. By noticing that the asymptotic outage probability is a convex function, each sub-problem can be easily relaxed to a convex problem by adopting successive convex approximation (SCA). Besides, the EE of VL-XP-HARQ is also investigated. An upper bound of the EE is found and proved to be attainable. Furthermore, by aiming at maximizing the EE via power allocation while confining outage within a certain constraint, the methods to the maximization of SE are invoked to solve the similar fractional problem. Finally, numerical results are presented for verification.

摘要
一种变长跨包自动重复请求（VL-XP-HARQ）被提议，以提高通信的spectral efficiency（SE）和能效率（EE）。首先，SE是通过出现概率来 derivation，并证明其Upper bounded by ergodic capacity（EC）。此外，为了最大化SE，高信号噪声比（SNR）下的 asymptotic outage probability 被获得，并通过选择合适的新信息位数来 garantuee outage requirement。通过应用Dinkelbach的变换，目标函数被转换成一个减法表示，可以通过 alternate optimization 分解成多个子问题。由于 asymptotic outage probability 是一个 convex function，每个子问题可以通过Successive convex approximation（SCA）的方式放松到一个convex problem。此外，EE 的Upper bound 也被查找并证明可达。进一步，通过对力分配来最大化EE，并将出现约束在一定范围内，使用SE 的最大化方法来解决相似的分数问题。最后，通过numerical results 进行验证。

On the Performance of Near-Field ISAC

paper_url: http://arxiv.org/abs/2310.10917
repo_url: None
paper_authors: Boqun Zhao, Chongjun Ouyang, Xingqi Zhang, Yuanwei Liu
for:* 这个论文是为了研究Integrated Sensing and Communications（ISAC）在靠近场区域中的性能，并提出了一个更加准确的通道模型。methods:* 该论文使用了一种基于效果天线的评估模型，并分析了下降和上升方向的感知和通信性能。results:* 论文显示，随着天线数量的增加，提出的模型的感知率和通信率都会 converges to常数，而传统的TCMs则会无限增长；* ISAC在靠近场区域中可以 achieve 更广泛的速率区域，比传统的频分S&C更好。

Abstract
The technical trends for the next-generation wireless network significantly extend the near-field region, necessitating a reevaluation for the performance of integrated sensing and communications (ISAC) to account for the effects introduced by the near field. In this paper, a near-field ISAC framework is proposed with a more accurate channel model than the three conventional models (TCMs): uniform plane wave, uniform spherical wave, and non-uniform spherical wave, in which the effective aperture of the antenna is considered. Based on the proposed model, sensing and communication (S&C) performance in both downlink and uplink scenarios are analyzed. For the downlink case, three distinct designs are studied: the communications-centric (C-C) design, the sensing-centric (S-C) design, and the Pareto optimal design. Regarding the uplink case, the C-C design, the S-C design and the time-sharing strategy are considered. Within each design, sensing rates (SRs) and communication rates (CRs) are derived. To gain further insights, high signal-to-noise ratio slopes and rate scaling laws concerning the number of antennas are also examined. Finally, the attainable SR-CR regions of the near-field ISAC are characterized. Numerical results reveal that 1) as the number of antennas grows, the SRs and CRs of the proposed model converges to constants, while those of the TCMs increase unboundedly; 2) ISAC achieves a more extensive rate region than the conventional frequency-division S&C in both downlink and uplink cases.

摘要
Next-generation无线网络的技术趋势明显扩展了近场区域，需要重新评估Integrated Sensing and Communications（ISAC）性能，考虑近场效应的影响。本文提出了一种更准确的近场ISAC框架，包括Antenna的有效覆盖面。基于该模型，对下行和上行场景进行了敏感测量和通信性能的分析。在下行场景中，研究了三种设计：通信中心（C-C）设计、探测中心（S-C）设计和Pareto优化设计。在上行场景中，考虑了C-C设计、S-C设计和时间分享策略。对每种设计，计算了探测率（SR）和通信率（CR）。为了更深入地了解，也研究了高信号噪听比斜率和antenna数量下的速率扩展法则。最后，near-field ISAC可达的SR-CR区域的可行性被Characterized。numerical results indicate that: 1) antenna数量增加时，提posed模型中的SR和CR与TCMs相比， converge to constants,而TCMs中的SR和CR无限增长; 2) ISAC在下行和上行场景中都可以获得更广泛的速率区域，比传统频分S&C更好。

Reuse Kernels or Activations? A Flexible Dataflow for Low-latency Spectral CNN Acceleration

paper_url: http://arxiv.org/abs/2310.10902
repo_url: None
paper_authors: Yue Niu, Rajgopal Kannan, Ajitesh Srivastava, Viktor Prasanna
for: 提高卷积神经网络（CNN）的计算效率和响应时间，解决spectral-domain CNNs的“kernel explosion”问题。
methods: 分析卷积层的带宽-存储费用贸易关系，确定层之间的通信瓶颈，并提出数据流水平优化策略和约束计划算法来最优化数据重复使用和减少外部通信。
results: 在一个现代FPGA平台上，我们的设计可以减少数据传输量42%，使DSP资源利用率达到90%，并实现VGG16模型的推理延迟为9毫秒，比基eline状态则的延迟为68毫秒更快。

Abstract
Spectral-domain CNNs have been shown to be more efficient than traditional spatial CNNs in terms of reducing computation complexity. However they come with a `kernel explosion' problem that, even after compression (pruning), imposes a high memory burden and off-chip bandwidth requirement for kernel access. This creates a performance gap between the potential acceleration offered by compression and actual FPGA implementation performance, especially for low-latency CNN inference. In this paper, we develop a principled approach to overcoming this performance gap and designing a low-latency, low-bandwidth, spectral sparse CNN accelerator on FPGAs. First, we analyze the bandwidth-storage tradeoff of sparse convolutional layers and locate communication bottlenecks. We then develop a dataflow for flexibly optimizing data reuse in different layers to minimize off-chip communication. Finally, we propose a novel scheduling algorithm to optimally schedule the on-chip memory access of multiple sparse kernels and minimize read conflicts. On a state-of-the-art FPGA platform, our design reduces data transfers by 42\% with DSP utilization up to 90\% and achieves inference latency of 9 ms for VGG16, compared to the baseline state-of-the-art latency of 68 ms.

摘要
In this paper, we develop a principled approach to overcoming this performance gap and designing a low-latency, low-bandwidth, spectral sparse CNN accelerator on FPGAs. First, we analyze the bandwidth-storage tradeoff of sparse convolutional layers and locate communication bottlenecks. We then develop a dataflow for flexibly optimizing data reuse in different layers to minimize off-chip communication. Finally, we propose a novel scheduling algorithm to optimally schedule the on-chip memory access of multiple sparse kernels and minimize read conflicts.Our design reduces data transfers by 42% with DSP utilization up to 90% and achieves inference latency of 9 ms for VGG16, compared to the baseline state-of-the-art latency of 68 ms.

2023-10-16

cs.SD

cs.SD - 2023-10-16

Generation or Replication: Auscultating Audio Latent Diffusion Models

paper_url: http://arxiv.org/abs/2310.10604
repo_url: None
paper_authors: Dimitrios Bralios, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux
for: 这个研究旨在理解音频扩散模型如何生成真实的声音clip，以及这种技术在音频处理方面的应用前提下，它们是否能够具备生成高质量声音clip的能力。
methods: 这个研究使用文本到音频扩散模型，并系统地分析这些模型在不同训练集大小下的记忆行为。同时，研究还评估了不同的检索指标，以确定哪些指标更能够捕捉训练数据的记忆。
results: 研究发现，使用mel spectrogram Similarity来评估模型的记忆行为是更加稳定和可靠的，而learned embedding vectors则更容易受到训练数据的干扰。此外，研究还发现AudioCaps数据库中存在大量的复制声音clip。

Abstract
The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio. In this work, we make an initial attempt at understanding the inner workings of audio latent diffusion models by investigating how their audio outputs compare with the training data, similar to how a doctor auscultates a patient by listening to the sounds of their organs. Using text-to-audio latent diffusion models trained on the AudioCaps dataset, we systematically analyze memorization behavior as a function of training set size. We also evaluate different retrieval metrics for evidence of training data memorization, finding the similarity between mel spectrograms to be more robust in detecting matches than learned embedding vectors. In the process of analyzing memorization in audio latent diffusion models, we also discover a large amount of duplicated audio clips within the AudioCaps database.

摘要
文本描述生成真实的声音clip的能力可能会革命化我们如何处理音频。在这个工作中，我们初步地理解音频干扰模型的内部工作，通过对它们的声音输出与训练数据进行比较，类似于医生 auscultates 病人的器官声音。使用基于 AudioCaps 数据集的文本-声音干扰模型，我们系统地分析了训练集大小的影响，并评估不同的检索指标，发现mel спектрограм相似性更加稳定地检测匹配。在分析声音干扰模型的 memorization 行为的过程中，我们还发现了 AudioCaps 数据库中的大量重复的声音clip。Note: "Simplified Chinese" is a translation of the text into Standard Chinese, which is the official language of China. "Traditional Chinese" is a different writing system used in Taiwan and some other countries.

BeatDance: A Beat-Based Model-Agnostic Contrastive Learning Framework for Music-Dance Retrieval

paper_url: http://arxiv.org/abs/2310.10300
repo_url: None
paper_authors: Kaixing Yang, Xukun Zhou, Xulong Tang, Ran Diao, Hongyan Liu, Jun He, Zhaoxin Fan
For: The paper is written for improving dance-music retrieval performance by utilizing the alignment between music beats and dance movements.* Methods: The proposed method, BeatDance, incorporates a Beat-Aware Music-Dance InfoExtractor, a Trans-Temporal Beat Blender, and a Beat-Enhanced Hubness Reducer to improve dance-music retrieval performance.* Results: The experimental results on the Music-Dance (MD) dataset demonstrate the superiority of the proposed method over existing baselines, achieving state-of-the-art performance.Here’s the simplified Chinese version of the three key points:* For: 提高舞蹈音乐 Retrieval 性能，利用音乐拍和舞蹈动作的匹配。* Methods: 提出了 BeatDance 模型无关对比学习框架，包括 Beat-Aware Music-Dance InfoExtractor、Trans-Temporal Beat Blender 和 Beat-Enhanced Hubness Reducer。* Results: 在 Music-Dance（MD）数据集上，实验结果表明提议方法比基eline表现更出色，实现了状态级表现。

Abstract
Dance and music are closely related forms of expression, with mutual retrieval between dance videos and music being a fundamental task in various fields like education, art, and sports. However, existing methods often suffer from unnatural generation effects or fail to fully explore the correlation between music and dance. To overcome these challenges, we propose BeatDance, a novel beat-based model-agnostic contrastive learning framework. BeatDance incorporates a Beat-Aware Music-Dance InfoExtractor, a Trans-Temporal Beat Blender, and a Beat-Enhanced Hubness Reducer to improve dance-music retrieval performance by utilizing the alignment between music beats and dance movements. We also introduce the Music-Dance (MD) dataset, a large-scale collection of over 10,000 music-dance video pairs for training and testing. Experimental results on the MD dataset demonstrate the superiority of our method over existing baselines, achieving state-of-the-art performance. The code and dataset will be made public available upon acceptance.

摘要
文本：舞蹈和音乐是密切相关的表达形式，它们之间存在着很强的相互关联。然而，现有的方法 часто会导致不自然的生成效果，或者完全不利用音乐和舞蹈之间的相互关系。为了解决这些挑战，我们提出了 BeatDance，一种新的 beat-based 模型无关的对比学习框架。 BeatDance 包括一个 Beat-Aware Music-Dance 信息抽取器、一个 Trans-Temporal Beat Blender 和一个 Beat-Enhanced Hubness Reducer，以便通过音乐 beat 和舞蹈动作的协调来提高舞蹈-音乐 retrieve 性能。我们还提出了 Music-Dance（MD）数据集，一个大规模的音乐-舞蹈视频对集，用于训练和测试。实验结果表明，我们的方法在 MD 数据集上表现出优于现有基eline，实现了状态计算机。代码和数据集将在接受后公开。翻译结果：文本：舞蹈和音乐是密切相关的表达形式，它们之间存在着很强的相互关联。然而，现有的方法常常会导致不自然的生成效果，或者完全不利用音乐和舞蹈之间的相互关系。为了解决这些挑战，我们提出了 BeatDance，一种新的 beat-based 模型无关的对比学习框架。 BeatDance 包括一个 Beat-Aware Music-Dance 信息抽取器、一个 Trans-Temporal Beat Blender 和一个 Beat-Enhanced Hubness Reducer，以便通过音乐 beat 和舞蹈动作的协调来提高舞蹈-音乐 retrieve 性能。我们还提出了 Music-Dance（MD）数据集，一个大规模的音乐-舞蹈视频对集，用于训练和测试。实验结果表明，我们的方法在 MD 数据集上表现出优于现有基eline，实现了状态计算机。代码和数据集将在接受后公开。

Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference

paper_url: http://arxiv.org/abs/2310.10179
repo_url: None
paper_authors: Dejan Porjazovski, Yaroslav Getman, Tamás Grósz, Mikko Kurimo
for: This paper is written for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks.
methods: The paper employs large pre-trained models and explores audio-only and hybrid solutions leveraging audio and text modalities. The authors also introduce a Bayesian layer as an alternative to the standard linear output layer.
results: The empirical results consistently show the superiority of the hybrid approaches over the audio-only models, with the multimodal fusion approach achieving an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints, and the ensemble model for the Emotion Share task yielding the best rho value of .614. Additionally, the Bayesian wav2vec2 approach allows for easily building ensembles with usable confidence values instead of overconfident posterior probabilities.Here’s the Chinese translation of the three key points:
for: 这篇论文是为了参加 ACM Multimedia Computational Paralinguistics Challenge 的 Requests 和 Emotion Share 任务而写的。
methods: 这篇论文使用了大型预训模型，并 explore 了听音和文本模式的混合解决方案。作者还介绍了一种 Bayesian 层作为标准线性输出层的替代方案。
results: 实验结果表明，混合方案比听音模型更加有优势，并且 multimodal fusion 方法在 HC-Requests 上 achieved 85.4% UAR 和 HC-Complaints 上 achieved 60.2%。此外，作者还介绍了一种 Bayesian wav2vec2 方法，该方法可以轻松地构建集成模型，只需要 fine-tune 一个模型。此外，该方法还可以提供可信度值 instead of 常见的过度信息 posterior probabilities。

Abstract
Large pre-trained models are essential in paralinguistic systems, demonstrating effectiveness in tasks like emotion recognition and stuttering detection. In this paper, we employ large pre-trained models for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks. We explore audio-only and hybrid solutions leveraging audio and text modalities. Our empirical results consistently show the superiority of the hybrid approaches over the audio-only models. Moreover, we introduce a Bayesian layer as an alternative to the standard linear output layer. The multimodal fusion approach achieves an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints. The ensemble model for the Emotion Share task yields the best rho value of .614. The Bayesian wav2vec2 approach, explored in this study, allows us to easily build ensembles, at the cost of fine-tuning only one model. Moreover, we can have usable confidence values instead of the usual overconfident posterior probabilities.

摘要
大型预训模型在para语言系统中扮演着关键角色，在情感识别和偏声检测等任务中显示出了效iveness。在这篇论文中，我们使用大型预训模型参加ACM Multimedia Computational Paralinguistics Challenge的请求和情感分享任务。我们研究了基于音频和文本modalities的混合解决方案，并对audio-only和混合方案进行了比较。我们的实验结果表明，混合方案在HC-Requests和HC-Complaints任务上具有显著的优势。此外，我们还引入了一种 bayesian层作为标准线性输出层的替代方案。我们的Multimodal混合方法在HC-Requests上 achieve 85.4% UAR，在HC-Complaints上 achieve 60.2%。此外，我们还提出了一种 bayesian wav2vec2方法，可以轻松地建立ensemble，只需要 Fine-tuning一个模型。此外，我们可以获得可信度值 instead of usual overconfident posterior probabilities。

Real-time Speech Enhancement and Separation with a Unified Deep Neural Network for Single/Dual Talker Scenarios

paper_url: http://arxiv.org/abs/2310.10026
repo_url: None
paper_authors: Kashyap Patel, Anton Kovalyov, Issa Panahi
for: 提出了一种实用的方法，利用实时深度学习模型，在单个或双个对话者的输入混合中进行Speech增强和分离。
methods: 使用了时域Signal-to-distortion比（SI-SDR）作为训练指标，并引入了一种轻量级的对话者重叠检测（SOD）模块，通过直接在分离后的面板上操作，而不是直接操作原始混合音频，简化了检测任务。
results: 实验结果表明，提出的训练方法超过了现有的解决方案，并且SOD模块具有高准确性。

Abstract
This paper introduces a practical approach for leveraging a real-time deep learning model to alternate between speech enhancement and joint speech enhancement and separation depending on whether the input mixture contains one or two active speakers. Scale-invariant signal-to-distortion ratio (SI-SDR) has shown to be a highly effective training measure in time-domain speech separation. However, the SI-SDR metric is ill-defined for zero-energy target signals, which is a problem when training a speech separation model using utterances with varying numbers of talkers. Unlike existing solutions that focus on modifying the loss function to accommodate zero-energy target signals, the proposed approach circumvents this problem by training the model to extract speech on both its output channels regardless if the input is a single or dual-talker mixture. A lightweight speaker overlap detection (SOD) module is also introduced to differentiate between single and dual-talker segments in real-time. The proposed module takes advantage of the new formulation by operating directly on the separated masks, given by the separation model, instead of the original mixture, thus effectively simplifying the detection task. Experimental results show that the proposed training approach outperforms existing solutions, and the SOD module exhibits high accuracy.

摘要
Unlike existing solutions that modify the loss function to accommodate zero-energy target signals, the proposed approach trains the model to extract speech on both its output channels regardless of whether the input is a single or dual-talker mixture. Additionally, a lightweight speaker overlap detection (SOD) module is introduced to differentiate between single and dual-talker segments in real-time. The SOD module operates directly on the separated masks, provided by the separation model, rather than the original mixture, making the detection task simpler.Experimental results show that the proposed training approach outperforms existing solutions, and the SOD module exhibits high accuracy.

2023-10-16

cs.CV

cs.CV - 2023-10-16

Filling the Holes on 3D Heritage Object Surface based on Automatic Segmentation Algorithm

paper_url: http://arxiv.org/abs/2310.10875
repo_url: None
paper_authors: Sinh Van Nguyen, Son Thanh Le, Minh Khai Tran, Le Thanh Sach
for: 这个论文的目的是提出一种改进的3D物体表面填充方法，以提高计算机图形学、图像处理和计算机视觉等领域中3D对象的重建和处理的精度。
methods: 该论文使用的方法包括计算几何学和深度学习模型，以及基于图像处理的机器学习算法。
results: 相比现有方法，该论文提出的方法可以更高精度地重建3D对象，并且可以适用于多种3D数据类型，包括点云数据和三角形网格数据。

Abstract
Reconstructing and processing the 3D objects are popular activities in the research field of computer graphics, image processing and computer vision. The 3D objects are processed based on the methods like geometric modeling, a branch of applied mathematics and computational geometry, or the machine learning algorithms based on image processing. The computation of geometrical objects includes processing the curves and surfaces, subdivision, simplification, meshing, holes filling, reconstructing, and refining the 3D surface objects on both point cloud data and triangular mesh. While the machine learning methods are developed using deep learning models. With the support of 3D laser scan devices and Lidar techniques, the obtained dataset is close to original shape of the real objects. Besides, the photography and its application based on the modern techniques in recent years help us collect data and process the 3D models more precise. This article proposes an improved method for filling holes on the 3D object surface based on an automatic segmentation. Instead of filling the hole directly as the existing methods, we now subdivide the hole before filling it. The hole is first determined and segmented automatically based on computation of its local curvature. It is then filled on each part of the hole to match its local curvature shape. The method can work on both 3D point cloud surfaces and triangular mesh surface. Comparing to the state of the art methods, our proposed method obtained higher accuracy of the reconstructed 3D objects.

摘要
Computer graphics、图像处理和计算机视觉领域的研究中，重建和处理3D对象是非常流行的活动。这些3D对象通常通过几何模型化或基于图像处理的机器学习算法进行处理。计算几何对象包括处理曲线和表面、分割、简化、网格化、填充洞和改进3D表面对象的点云数据和三角形网格。而机器学习方法则是基于深度学习模型。通过3D激光扫描设备和激光技术获得的数据，我们可以更加准确地重建真实对象的形状。此外，现代技术的应用也有助于我们更加精准地收集数据和处理3D模型。本文提出了一种改进的洞填充方法，通过自动分割洞而不是直接填充洞像现有方法。首先，我们使用计算当地曲线的方法自动分割洞。然后，我们在每个洞部分填充其所对应的本地弯曲形状。这种方法可以在3D点云表面和三角形网格表面上进行应用。与现有方法相比，我们的提议方法能够获得更高的3D对象重建精度。

Approximation properties of slice-matching operators

paper_url: http://arxiv.org/abs/2310.10869
repo_url: None
paper_authors: Shiying Li, Caroline Moosmueller
for: This paper is written for the purpose of exploring approximation properties of iterative slice-matching procedures for transferring a source measure to a target measure, particularly in high dimensions.
methods: The paper uses slice-matching operators, which depend on the source and target measures and slicing directions, to examine the approximation properties of iterative slice-matching schemes.
results: The paper demonstrates invariance and equivariance properties of the slice-matching operator with respect to the source and target measures, respectively, and establishes error bounds for approximating the target measure using one step of the slice-matching scheme. Additionally, the paper investigates connections to affine registration problems and extensions to the invariance and equivariance properties of the slice-matching operator.

Abstract
Iterative slice-matching procedures are efficient schemes for transferring a source measure to a target measure, especially in high dimensions. These schemes have been successfully used in applications such as color transfer and shape retrieval, and are guaranteed to converge under regularity assumptions. In this paper, we explore approximation properties related to a single step of such iterative schemes by examining an associated slice-matching operator, depending on a source measure, a target measure, and slicing directions. In particular, we demonstrate an invariance property with respect to the source measure, an equivariance property with respect to the target measure, and Lipschitz continuity concerning the slicing directions. We furthermore establish error bounds corresponding to approximating the target measure by one step of the slice-matching scheme and characterize situations in which the slice-matching operator recovers the optimal transport map between two measures. We also investigate connections to affine registration problems with respect to (sliced) Wasserstein distances. These connections can be also be viewed as extensions to the invariance and equivariance properties of the slice-matching operator and illustrate the extent to which slice-matching schemes incorporate affine effects.

摘要
iterative slice-matching 算法是高维中高效的源度量至目标度量转移方案，尤其在应用中如颜色传输和形状检索中得到了成功。这些算法在Regularity assumptions下是确定的收敛的。在这篇论文中，我们研究了单步iterative slice-matching算法的approximation Properties，包括一个相关的slice-matching运算符，它取决于源度量、目标度量和切割方向。我们证明了对源度量的不变性、对目标度量的对称性和切割方向的 lipschitz连续性。我们还确定了一个基于单步slice-matching算法的目标度量的错误 bound，并 characterize了在 slice-matching算法中可以重建两个度量之间的优化运输图的情况。此外，我们还 investigate了基于水星距离的 affine registration problem 的连接，这些连接可以被视为 slice-matching算法中的 affine 效应的扩展。

paper_url: http://arxiv.org/abs/2310.10862
repo_url: None
paper_authors: Paul Ruvolo, Ayush Chakraborty, Rucha Dave, Richard Li, Duncan Mazza, Xierui Shen, Raiyan Siddique, Krishna Suresh
for: 创建大尺寸、易于导航的3D地图，使用主流智能手机。
methods: 将3D地图问题定义为图像SLAM问题，并估计环境中的建筑物标志（指标）和可导航路径（手机姿态）。
results: 系统可以创建准确的3D地图。此外，我们还提出了选择映射超参数的精细技术，以适应新环境。

Abstract
We present a system for creating building-scale, easily navigable 3D maps using mainstream smartphones. In our approach, we formulate the 3D-mapping problem as an instance of Graph SLAM and infer the position of both building landmarks (fiducial markers) and navigable paths through the environment (phone poses). Our results demonstrate the system's ability to create accurate 3D maps. Further, we highlight the importance of careful selection of mapping hyperparameters and provide a novel technique for tuning these hyperparameters to adapt our algorithm to new environments.

摘要
我们提出了一种基于主流智能手机的建筑尺度级可探索3D地图创建系统。我们将3D地图问题定义为Instance of Graph SLAM，并通过约束建筑标记（ fiducial markers）和环境中可行路径（手机姿态）来计算位置。我们的结果表明系统可以创建准确的3D地图。此外，我们强调了选择映射超参数的重要性，并提供了一种新的参数调整技术，以适应新环境。

SoybeanNet: Transformer-Based Convolutional Neural Network for Soybean Pod Counting from Unmanned Aerial Vehicle (UAV) Images

paper_url: http://arxiv.org/abs/2310.10861
repo_url: https://github.com/jiajiali04/soybean-pod-counting-from-uav-images
paper_authors: Jiajia Li, Raju Thada Magar, Dong Chen, Feng Lin, Dechun Wang, Xiang Yin, Weichao Zhuang, Zhaojian Li
for: 这个论文的目的是提高豫豢的生产效率，并使用无人机图像来实现豫豢的果实计数。
methods: 这个论文使用了一种新的点基 counting网络，叫做SoybeanNet，使用了强大的变换器核心来同时进行豫豢的果实计数和定位。
results: 该论文在使用实际的无人机图像进行测试时，与五种现有方法进行比较，并取得了84.51%的计数精度。

Abstract
Soybeans are a critical source of food, protein and oil, and thus have received extensive research aimed at enhancing their yield, refining cultivation practices, and advancing soybean breeding techniques. Within this context, soybean pod counting plays an essential role in understanding and optimizing production. Despite recent advancements, the development of a robust pod-counting algorithm capable of performing effectively in real-field conditions remains a significant challenge This paper presents a pioneering work of accurate soybean pod counting utilizing unmanned aerial vehicle (UAV) images captured from actual soybean fields in Michigan, USA. Specifically, this paper presents SoybeanNet, a novel point-based counting network that harnesses powerful transformer backbones for simultaneous soybean pod counting and localization with high accuracy. In addition, a new dataset of UAV-acquired images for soybean pod counting was created and open-sourced, consisting of 113 drone images with more than 260k manually annotated soybean pods captured under natural lighting conditions. Through comprehensive evaluations, SoybeanNet demonstrated superior performance over five state-of-the-art approaches when tested on the collected images. Remarkably, SoybeanNet achieved a counting accuracy of $84.51\%$ when tested on the testing dataset, attesting to its efficacy in real-world scenarios. The publication also provides both the source code (\url{https://github.com/JiajiaLi04/Soybean-Pod-Counting-from-UAV-Images}) and the labeled soybean dataset (\url{https://www.kaggle.com/datasets/jiajiali/uav-based-soybean-pod-images}), offering a valuable resource for future research endeavors in soybean pod counting and related fields.

摘要
soybeans是一种重要的食品、蛋白和油源，因此它们在提高产量、改善栽培方法和进步杂交技术方面 receiving extensive research。在这个 контексте中，豇豆果 counting plays an essential role in understanding and optimizing production. Despite recent advancements, the development of a robust pod-counting algorithm capable of performing effectively in real-field conditions remains a significant challenge.这篇文章提出了一项突破性的豇豆果 counting方法，使用了来自美国密歇根州actual soybean fields的无人机图像。Specifically, this paper presents SoybeanNet, a novel point-based counting network that harnesses powerful transformer backbones for simultaneous soybean pod counting and localization with high accuracy. In addition, a new dataset of UAV-acquired images for soybean pod counting was created and open-sourced, consisting of 113 drone images with more than 260k manually annotated soybean pods captured under natural lighting conditions. Through comprehensive evaluations, SoybeanNet demonstrated superior performance over five state-of-the-art approaches when tested on the collected images. Remarkably, SoybeanNet achieved a counting accuracy of 84.51% when tested on the testing dataset, attesting to its efficacy in real-world scenarios. The publication also provides both the source code (https://github.com/JiajiaLi04/Soybean-Pod-Counting-from-UAV-Images) and the labeled soybean dataset (https://www.kaggle.com/datasets/jiajiali/uav-based-soybean-pod-images), offering a valuable resource for future research endeavors in soybean pod counting and related fields.

Provable Probabilistic Imaging using Score-Based Generative Priors

paper_url: http://arxiv.org/abs/2310.10835
repo_url: None
paper_authors: Yu Sun, Zihui Wu, Yifan Chen, Berthy T. Feng, Katherine L. Bouman
for: 这篇论文旨在提出一种可靠地估计高质量图像并同时评估其不确定性的权重函数架构。
methods: 该论文提出了一种基于Monte Carlo（MC）的插入式权重函数架构（PMC），可以同时捕捉高质量图像重建和不确定性评估。具体来说，该论文引入了两种PMC算法，可以视为传统插入式质量函数（PnP）和杂化正则化（RED）算法的排除样本分布 аналоги。
results: 对多个代表性的逆问题进行实验，结果表明PMCAlgorithm可以显著提高图像重建质量和高精度不确定性评估。

Abstract
Estimating high-quality images while also quantifying their uncertainty are two desired features in an image reconstruction algorithm for solving ill-posed inverse problems. In this paper, we propose plug-and-play Monte Carlo (PMC) as a principled framework for characterizing the space of possible solutions to a general inverse problem. PMC is able to incorporate expressive score-based generative priors for high-quality image reconstruction while also performing uncertainty quantification via posterior sampling. In particular, we introduce two PMC algorithms which can be viewed as the sampling analogues of the traditional plug-and-play priors (PnP) and regularization by denoising (RED) algorithms. We also establish a theoretical analysis for characterizing the convergence of the PMC algorithms. Our analysis provides non-asymptotic stationarity guarantees for both algorithms, even in the presence of non-log-concave likelihoods and imperfect score networks. We demonstrate the performance of the PMC algorithms on multiple representative inverse problems with both linear and nonlinear forward models. Experimental results show that PMC significantly improves reconstruction quality and enables high-fidelity uncertainty quantification.

摘要
<>将文本翻译成简化中文。<>解决具有不整合性 inverse problem 的算法中，估计高质量图像并同时量化其不确定性是两个愿景。在这篇论文中，我们提出了插入式 Monte Carlo (PMC) 作为一种原理性的框架，用于描述解决一般 inverse problem 中可能的解空间。PMC 能够integrate expressive score-based生成模型，以实现高质量图像重建和不确定性量化。具体来说，我们介绍了两种 PMC 算法，可以视为传统的插入式 priors (PnP) 和 regularization by denoising (RED) 算法的抽象。我们还进行了理论分析，用于Characterizing PMC 算法的收敛性。我们的分析提供了不对称站点保证，即使likelihood 不是几何凹形的情况下，PMC 算法仍然能够收敛。我们在多个代表性的 inverse problem 中进行了实验，结果表明，PMC 可以显著提高重建质量和实现高精度的不确定性量化。

paper_url: http://arxiv.org/abs/2310.10822
repo_url: None
paper_authors: Chengguang Xu, Hieu T. Nguyen, Christopher Amato, Lawson L. S. Wong
for: 提高移动机器人在未看过的环境中的导航效率，使其能够根据自然语言指令进行导航。
methods: 提出了一个新的导航框架，包括四个关键组件：一个基于LLMs的指令解析器、一个在线视觉语言映射器、一个基于语言索引的地方定位器和一个基于DD-PPO的本地控制器。
results: 在一个未看过的实验室环境中测试了该框架，无需调整，significantly outperformed了现有的VLN基线。

Abstract
Navigating in unseen environments is crucial for mobile robots. Enhancing them with the ability to follow instructions in natural language will further improve navigation efficiency in unseen cases. However, state-of-the-art (SOTA) vision-and-language navigation (VLN) methods are mainly evaluated in simulation, neglecting the complex and noisy real world. Directly transferring SOTA navigation policies trained in simulation to the real world is challenging due to the visual domain gap and the absence of prior knowledge about unseen environments. In this work, we propose a novel navigation framework to address the VLN task in the real world. Utilizing the powerful foundation models, the proposed framework includes four key components: (1) an LLMs-based instruction parser that converts the language instruction into a sequence of pre-defined macro-action descriptions, (2) an online visual-language mapper that builds a real-time visual-language map to maintain a spatial and semantic understanding of the unseen environment, (3) a language indexing-based localizer that grounds each macro-action description into a waypoint location on the map, and (4) a DD-PPO-based local controller that predicts the action. We evaluate the proposed pipeline on an Interbotix LoCoBot WX250 in an unseen lab environment. Without any fine-tuning, our pipeline significantly outperforms the SOTA VLN baseline in the real world.

摘要
naviigating 无法看到环境是机器人 navigation 的关键。增强机器人能够遵循自然语言指令，将会进一步提高无法看到环境中的导航效率。然而，当前的VLN方法（state-of-the-art）主要在模拟环境中进行评估，忽略了实际世界的复杂和噪音。直接将模拟环境中训练的VLN策略传输到实际世界是困难的，因为视觉领域之间的差异和未知环境中的先验知识缺乏。在这种情况下，我们提出了一种新的导航框架，用于解决VLN任务在实际世界中。该框架包括四个关键组件：1. LLMs基础模型based instruction parser，将自然语言指令转换为一系列预定的macro-action描述。2. 在线视觉语言映射，实时建立视觉语言地图，以维护未知环境的空间和Semantic理解。3. 基于语言索引的本地化器，将每个macro-action描述映射到地图上的坐标位置。4. DD-PPO基于本地控制器，预测动作。我们在一个未知的实际环境中使用Interbotix LoCoBot WX250进行评估，无需精细调整，我们的管道在实际世界中显著超越了当前VLN基线。

Convolutional Neural Network Model for Diabetic Retinopathy Feature Extraction and Classification

paper_url: http://arxiv.org/abs/2310.10806
repo_url: https://github.com/s21sharan/cnn_dr_detection
paper_authors: Sharan Subramanian, Leilani H. Gilpin
for: 这个研究旨在应用人工智能于医疗领域，尤其是检测无 симtomatic progressing 的疾病，如糖尿病retinopathy (DR)。
methods: 本研究使用了 convolutional Neural Network (CNN) 模型，通过输入照片后，可以正确地识别四种known DR特征，包括 micro-aneurysms、cotton wools、exudates 和 hemorrhages。
results: 本研究获得了97%的敏感度和71%的准确率，表明模型具有高度的可读性和抗过滤性。

Abstract
The application of Artificial Intelligence in the medical market brings up increasing concerns but aids in more timely diagnosis of silent progressing diseases like Diabetic Retinopathy. In order to diagnose Diabetic Retinopathy (DR), ophthalmologists use color fundus images, or pictures of the back of the retina, to identify small distinct features through a difficult and time-consuming process. Our work creates a novel CNN model and identifies the severity of DR through fundus image input. We classified 4 known DR features, including micro-aneurysms, cotton wools, exudates, and hemorrhages, through convolutional layers and were able to provide an accurate diagnostic without additional user input. The proposed model is more interpretable and robust to overfitting. We present initial results with a sensitivity of 97% and an accuracy of 71%. Our contribution is an interpretable model with similar accuracy to more complex models. With that, our model advances the field of DR detection and proves to be a key step towards AI-focused medical diagnosis.

摘要
这个应用人工智能在医疗市场中带来的应用对于不明显进行诊断的疾病，如糖尿病肉眼病（DR），具有增长的担忧。为了诊断DR，医生会使用彩色背部影像（color fundus images），或者是背部 Retina 的照片，以便识别小型的明显特征。我们的工作创造了一个新的 convolutional neural network（CNN）模型，可以通过背部影像的输入来诊断DR的严重程度。我们分类了4种已知的DR特征，包括微型血管、绒毛、渗透物和出血，透过 convolutional layers 进行分类，并能够提供精确的诊断，不需要额外的使用者输入。我们的模型更加可读性和避免过拟合。我们给出了初步的结果，敏感性为97%，准确率为71%。我们的贡献是一个可读性好的模型，与更复杂的模型相比，具有相似的准确性。这个模型对于DR检测具有重要的进步，并且是人工智能在医疗诊断中的一个关键步骤。

LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation

paper_url: http://arxiv.org/abs/2310.10769
repo_url: https://github.com/RQ-Wu/LAMP
paper_authors: Ruiqi Wu, Liangyu Chen, Tong Yang, Chunle Guo, Chongyi Li, Xiangyu Zhang
for:LAMP is designed to learn motion patterns in text-to-video generation, with a focus on few-shot learning and efficient use of resources.methods:The LAMP framework uses a first-frame-conditioned pipeline with an off-the-shelf text-to-image model for content generation, and expands the pretrained 2D convolution layers to temporal-spatial motion learning layers. Shared-noise sampling is used to improve stability and flexibility.results:Extensive experiments show that LAMP can effectively learn motion patterns on limited data and generate high-quality videos, with applications in text-to-image diffusion, real-world image animation, and video editing. The code and models are available online.

Abstract
With the impressive progress in diffusion-based text-to-image generation, extending such powerful generative ability to text-to-video raises enormous attention. Existing methods either require large-scale text-video pairs and a large number of training resources or learn motions that are precisely aligned with template videos. It is non-trivial to balance a trade-off between the degree of generation freedom and the resource costs for video generation. In our study, we present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation so that our tuned video diffusion model mainly focuses on motion learning. The well-developed text-to-image techniques can provide visually pleasing and diverse content as generation conditions, which highly improves video quality and generation freedom. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers and modify the attention blocks to the temporal level. Additionally, we develop an effective inference trick, shared-noise sampling, which can improve the stability of videos with computational costs. Our method can also be flexibly applied to other tasks, e.g. real-world image animation and video editing. Extensive experiments demonstrate that LAMP can effectively learn the motion pattern on limited data and generate high-quality videos. The code and models are available at https://rq-wu.github.io/projects/LAMP.

摘要
<>translate text into Simplified Chinese<>随着文本到图像生成技术的快速进步，扩展这种强大的生成能力到文本到视频生成引发了巨大的关注。现有方法可以通过大量的文本-视频对对和大量的训练资源来学习，但是很难平衡生成自由度和资源成本之间的折衔。在我们的研究中，我们提出了一个几个步骤基于的调整框架，名为LAMP，可以在单个GPU上使用8~16个视频进行调整。 Specifically，我们设计了一个基于首帧的管道，使用商业化的文本到图像模型来生成内容，以便我们的调整视频模型主要集中在动作学习。具有良好的文本到图像技术可以提供辐射的和多样化的生成条件，从而很大地提高视频质量和生成自由度。为了捕捉时间维度的特征，我们扩展了预训练的2D卷积层，并对它们进行我们的新的时空动作学习层，还修改了注意力块到时间层。此外，我们开发了一种有效的推理技术，分享噪声抽样，可以提高视频的稳定性。我们的方法可以适应其他任务，例如真实世界图像动画和视频编辑。广泛的实验证明了LAMP可以有效地学习动作模式，并生成高质量的视频。代码和模型可以在https://rq-wu.github.io/projects/LAMP中获得。

Deep Conditional Shape Models for 3D cardiac image segmentation

paper_url: http://arxiv.org/abs/2310.10756
repo_url: None
paper_authors: Athira J Jacob, Puneet Sharma, Daniel Ruckert
for: 医疗图像分析的第一步是准确地定义器官结构。
methods: 我们引入了一种新的分割算法，使用深度条件形状模型（DCSM）作为核心组件。该算法使用深度隐式形状表示，学习任何有兴趣的生物结构的模态无关形状模型，并通过自动检测或用户输入的特征点来适应图像。最后，我们添加了一个模态依赖的轻量级细节修正网络，以捕捉图像中没有表示的细节。
results: 我们在各种3D成像Modalities（对比增强CT、非对比CT、3D电子心征图像）中进行心脏左心室（LV）分割，并证明自动DCSM在非对比CT中超过基准，并且在对比CT和3DE中使用细节修正网络时，特别是在 Hausdorff 距离方面获得了显著改进。半自动DCSM使用用户输入的特征点，只在对比CT上训练，可达92%的 dice，对所有Modalities具有Equivalent或更好的性能。

Abstract
Delineation of anatomical structures is often the first step of many medical image analysis workflows. While convolutional neural networks achieve high performance, these do not incorporate anatomical shape information. We introduce a novel segmentation algorithm that uses Deep Conditional Shape models (DCSMs) as a core component. Using deep implicit shape representations, the algorithm learns a modality-agnostic shape model that can generate the signed distance functions for any anatomy of interest. To fit the generated shape to the image, the shape model is conditioned on anatomic landmarks that can be automatically detected or provided by the user. Finally, we add a modality-dependent, lightweight refinement network to capture any fine details not represented by the implicit function. The proposed DCSM framework is evaluated on the problem of cardiac left ventricle (LV) segmentation from multiple 3D modalities (contrast-enhanced CT, non-contrasted CT, 3D echocardiography-3DE). We demonstrate that the automatic DCSM outperforms the baseline for non-contrasted CT without the local refinement, and with the refinement for contrasted CT and 3DE, especially with significant improvement in the Hausdorff distance. The semi-automatic DCSM with user-input landmarks, while only trained on contrasted CT, achieves greater than 92% Dice for all modalities. Both automatic DCSM with refinement and semi-automatic DCSM achieve equivalent or better performance compared to inter-user variability for these modalities.

摘要
医学图像分析工作流程中的解剖结构定义是第一步。卷积神经网络可以达到高性能，但不会包含解剖形态信息。我们介绍了一种新的分割算法，使用深度条件形状模型（DCSM）作为核心组件。使用深度隐式形状表示，算法学习了任意解剖结构的模态独立形状模型，可以生成任意解剖结构的签名距离函数。为了使Shape模型适应图像，Shape模型被conditioned on可自动探测或提供的解剖标志。最后，我们添加了一个模态依赖的轻量级细节修正网络，以捕捉无法由隐式函数表示的细节。我们提出的DCSM框架在cardiac left ventricle（LV）三维模态（对比增强CT、非对比CT和3DE）的分割问题上进行了评估。我们表明，自动DCSM比基线高效，而且与使用本地细节修正有显著改善。用户输入标志的半自动DCSM，即使只在对比CT上训练，可以达到92%的Dice指标或更高。自动DCSM和半自动DCSM都与人类间变化相当或更好，对这些模态来说。

IDRNet: Intervention-Driven Relation Network for Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.10755
repo_url: https://github.com/segmentationblwx/sssegmentation
paper_authors: Zhenchao Jin, Xiaowei Hu, Lingting Zhu, Luchuan Song, Li Yuan, Lequan Yu
for: 提高 dense prediction 任务中的 contextual information 聚合
methods: 利用 deletion diagnostics 过程模型 contextual relations among different pixels, 并使用 feature enhancement module 进一步提高 distinguishability
results: 对 state-of-the-art segmentation frameworks 带来了一致性的性能提升, 并在各种标准评价 metrics 上达到了竞争性的结果

Abstract
Co-occurrent visual patterns suggest that pixel relation modeling facilitates dense prediction tasks, which inspires the development of numerous context modeling paradigms, \emph{e.g.}, multi-scale-driven and similarity-driven context schemes. Despite the impressive results, these existing paradigms often suffer from inadequate or ineffective contextual information aggregation due to reliance on large amounts of predetermined priors. To alleviate the issues, we propose a novel \textbf{I}ntervention-\textbf{D}riven \textbf{R}elation \textbf{Net}work (\textbf{IDRNet}), which leverages a deletion diagnostics procedure to guide the modeling of contextual relations among different pixels. Specifically, we first group pixel-level representations into semantic-level representations with the guidance of pseudo labels and further improve the distinguishability of the grouped representations with a feature enhancement module. Next, a deletion diagnostics procedure is conducted to model relations of these semantic-level representations via perceiving the network outputs and the extracted relations are utilized to guide the semantic-level representations to interact with each other. Finally, the interacted representations are utilized to augment original pixel-level representations for final predictions. Extensive experiments are conducted to validate the effectiveness of IDRNet quantitatively and qualitatively. Notably, our intervention-driven context scheme brings consistent performance improvements to state-of-the-art segmentation frameworks and achieves competitive results on popular benchmark datasets, including ADE20K, COCO-Stuff, PASCAL-Context, LIP, and Cityscapes. Code is available at \url{https://github.com/SegmentationBLWX/sssegmentation}.

摘要
伴生视觉模式表明，像素关系模型化可以促进紧凑预测任务，这引发了许多上下文模型范文的发展，如多scale-driven和相似性-driven上下文方案。 despite the impressive results, these existing paradigms often suffer from inadequate or ineffective contextual information aggregation due to reliance on large amounts of predetermined priors. To address these issues, we propose a novel \textbf{I}ntervention-\textbf{D}riven \textbf{R}elation \textbf{Net}work (\textbf{IDRNet}), which leverages a deletion diagnostics procedure to guide the modeling of contextual relations among different pixels. Specifically, we first group pixel-level representations into semantic-level representations with the guidance of pseudo labels and further improve the distinguishability of the grouped representations with a feature enhancement module. Next, a deletion diagnostics procedure is conducted to model relations of these semantic-level representations via perceiving the network outputs and the extracted relations are utilized to guide the semantic-level representations to interact with each other. Finally, the interacted representations are utilized to augment original pixel-level representations for final predictions. Extensive experiments are conducted to validate the effectiveness of IDRNet quantitatively and qualitatively. Notably, our intervention-driven context scheme brings consistent performance improvements to state-of-the-art segmentation frameworks and achieves competitive results on popular benchmark datasets, including ADE20K, COCO-Stuff, PASCAL-Context, LIP, and Cityscapes. Code is available at \url{https://github.com/SegmentationBLWX/sssegmentation}.

HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending

paper_url: http://arxiv.org/abs/2310.10651
repo_url: https://github.com/wty-ustc/hairclipv2
paper_authors: Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Weiming Zhang, Gang Hua, Nenghai Yu
for: 提供一个能够基于文本描述或参考图像进行头发编辑的框架，并且支持多种交互方式，包括文本描述、参考图像和绘制笔触等。
methods: 使用交叉模式模型（如CLIP）将头发编辑转化为头发传送任务，并将编辑条件转化为不同的代理特征。在输入图像上加载编辑效果，通过在头发特征空间中混合相应的代理特征来实现。
results: 对比于原始HairCLIP，HairCLIPv2可以更好地保持无关特征（如人脸特征、背景特征），同时支持未before seen文本描述和不同交互方式。量化和质量实验表明，HairCLIPv2在编辑效果、无关特征保持和视觉自然性等方面具有显著优势。

Abstract
Hair editing has made tremendous progress in recent years. Early hair editing methods use well-drawn sketches or masks to specify the editing conditions. Even though they can enable very fine-grained local control, such interaction modes are inefficient for the editing conditions that can be easily specified by language descriptions or reference images. Thanks to the recent breakthrough of cross-modal models (e.g., CLIP), HairCLIP is the first work that enables hair editing based on text descriptions or reference images. However, such text-driven and reference-driven interaction modes make HairCLIP unable to support fine-grained controls specified by sketch or mask. In this paper, we propose HairCLIPv2, aiming to support all the aforementioned interactions with one unified framework. Simultaneously, it improves upon HairCLIP with better irrelevant attributes (e.g., identity, background) preservation and unseen text descriptions support. The key idea is to convert all the hair editing tasks into hair transfer tasks, with editing conditions converted into different proxies accordingly. The editing effects are added upon the input image by blending the corresponding proxy features within the hairstyle or hair color feature spaces. Besides the unprecedented user interaction mode support, quantitative and qualitative experiments demonstrate the superiority of HairCLIPv2 in terms of editing effects, irrelevant attribute preservation and visual naturalness. Our code is available at \url{https://github.com/wty-ustc/HairCLIPv2}.

摘要
随笔修整技术在最近几年来取得了巨大的进步。早期的修整方法通常使用细致绘制的素描或面Mask来指定修整条件。尽管它们可以实现非常细致的本地控制，但是这些交互方式在基于语言描述或参考图像的修整条件上是不效率的。感谢最近的交互模型技术（如CLIP）的突破，我们的HairCLIP是首个可以基于语言描述或参考图像进行随笔修整的工作。然而，这些基于文本描述或参考图像的交互方式使得HairCLIP无法支持细致的素描或面Mask来指定修整条件。在这篇论文中，我们提出了HairCLIPv2，旨在支持所有的交互方式，同时也提高了不相关特征（如人脸和背景）的保留和未看到文本描述的支持。我们的关键思想是将所有的随笔修整任务转换为随笔传输任务，并将编辑条件转换为不同的代理 accordingly。然后，在输入图像上添加修整效果，通过在额发型或发色特征空间中混合相应的代理特征。除了不同的用户交互方式支持外，我们的HairCLIPv2还在编辑效果、不相关特征保留和视觉自然性方面具有显著优势。我们的代码可以在GitHub上找到：。

TraM-NeRF: Tracing Mirror and Near-Perfect Specular Reflections through Neural Radiance Fields

paper_url: http://arxiv.org/abs/2310.10650
repo_url: https://github.com/Rubikalubi/TraM-NeRF
paper_authors: Leif Van Holland, Ruben Bliersbach, Jan U. Müller, Patrick Stotko, Reinhard Klein
for: 实现复杂场景中细节轻松渲染，如镜子等具有偏光反射的物体。
methods: 采用物理可能的材料模型和Monte-Carlo方法在Volume Rendering中厘定反射行为，实现重要抽象和透射计算。
results: 实现了对这些挑战场景的一致射预测和uperior的效果，较前一代方法更好。

Abstract
Implicit representations like Neural Radiance Fields (NeRF) showed impressive results for photorealistic rendering of complex scenes with fine details. However, ideal or near-perfectly specular reflecting objects such as mirrors, which are often encountered in various indoor scenes, impose ambiguities and inconsistencies in the representation of the reconstructed scene leading to severe artifacts in the synthesized renderings. In this paper, we present a novel reflection tracing method tailored for the involved volume rendering within NeRF that takes these mirror-like objects into account while avoiding the cost of straightforward but expensive extensions through standard path tracing. By explicitly modeling the reflection behavior using physically plausible materials and estimating the reflected radiance with Monte-Carlo methods within the volume rendering formulation, we derive efficient strategies for importance sampling and the transmittance computation along rays from only few samples. We show that our novel method enables the training of consistent representations of such challenging scenes and achieves superior results in comparison to previous state-of-the-art approaches.

摘要
<>使用各种各样的各种方法来描述复杂场景的隐式表示，如神经辐射场（NeRF），已经取得了高度真实的渲染复杂场景的成果。然而，在室内场景中遇到的 идеаль或几乎完美的 espejo 反射物体，如镜子，会导致描述重建场景的表示具有歧义和不一致，从而导致渲染 synthesized 图像中的 artifacts。在这篇论文中，我们提出了一种专门为 NeRF 中 involve 体积渲染中的反射跟踪方法，该方法能够考虑这些 espejo 类型的物体，而不需要 straightforward 而且昂贵的扩展。我们通过物理可能的材料模型和 Monte-Carlo 方法来表示反射行为，从而 deriv 高效的重要抽象策略和传播计算方法。我们示示了我们的新方法可以培养 consistent 的表示这些复杂场景，并且在 compared 到先前的状态的艺术方法上达到了更高的成果。

TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

paper_url: http://arxiv.org/abs/2310.10644
repo_url: None
paper_authors: Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum
for: 本研究旨在提出一种基于文本Semantic guidance的novel view synthesis（NVS）方法，以解决单视图NVS问题的不足约束性。
methods: 本方法使用文本作为高级 semantic information来约束NVS解决方案空间，并引入了特定于图像和摄像机pose conditioning的模块，以及专门为pose正确性和细节细节加权训练。
results: 对比Zero-1-to-3，本研究的提议TOSS实现了更可信、控制性和多视图一致的NVS结果，并通过了全面的ablations来证明引入的Semantic guidance和建筑设计的有效性。

Abstract
In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero-1-to-3 with more plausible, controllable and multiview-consistent NVS results. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design.

摘要
在这篇论文中，我们提出了TOSS，它将文本引入到novel view synthesis（NVS）任务中，只需基于单个RGB图像。 zero-1-to-3 已经表现出了非常出色的零基础开放式 NVS 能力，但是这种方法在单视图 NVS 中存在一些缺乏约束的问题：没有明确的用户控制方式，导致 NVS 生成结果往往不太可能。为了解决这个限制，TOSS 使用文本作为高级Semantic信息来约束 NVS 解决方案空间。TOSS 细致地调整了文本-图像 Stable Diffusion 预训练的大规模文本-图像对，并 introduce了专门为图像和摄像头姿态conditioning设计的模块，以及专门为posecorrectness和细节细节训练。我们进行了全面的实验，结果显示了我们提出的 TOSS 比 zero-1-to-3 更加plausible、可控和多视图一致的 NVS 结果。我们还进行了详细的ablation，以证明引入的semantic导航和建筑设计的效果和潜在。

Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting

paper_url: http://arxiv.org/abs/2310.10642
repo_url: https://github.com/fudan-zvg/4d-gaussian-splatting
paper_authors: Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, Li Zhang
for: 生成复杂的动态场景3D图像和不同时间点的视图
methods: 使用4D primitives优化approximateunderlying spacetime volume，包括视角 dependent和时间演化的外观
results: 提供了一种简单、灵活、可变长视频和终端培育的方法，能够capture复杂的动态场景运动，并且在实验中达到了较高的视觉质量和效率。

Abstract
Reconstructing dynamic 3D scenes from 2D images and generating diverse views over time is challenging due to scene complexity and temporal dynamics. Despite advancements in neural implicit models, limitations persist: (i) Inadequate Scene Structure: Existing methods struggle to reveal the spatial and temporal structure of dynamic scenes from directly learning the complex 6D plenoptic function. (ii) Scaling Deformation Modeling: Explicitly modeling scene element deformation becomes impractical for complex dynamics. To address these issues, we consider the spacetime as an entirety and propose to approximate the underlying spatio-temporal 4D volume of a dynamic scene by optimizing a collection of 4D primitives, with explicit geometry and appearance modeling. Learning to optimize the 4D primitives enables us to synthesize novel views at any desired time with our tailored rendering routine. Our model is conceptually simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, as well as view-dependent and time-evolved appearance represented by the coefficient of 4D spherindrical harmonics. This approach offers simplicity, flexibility for variable-length video and end-to-end training, and efficient real-time rendering, making it suitable for capturing complex dynamic scene motions. Experiments across various benchmarks, including monocular and multi-view scenarios, demonstrate our 4DGS model's superior visual quality and efficiency.

摘要
<>translate text into Simplified ChineseDynamic 3D scene reconstruction from 2D images and generating diverse views over time is challenging due to scene complexity and temporal dynamics. Despite advancements in neural implicit models, there are still limitations: (i) Inadequate Scene Structure: Existing methods cannot effectively reveal the spatial and temporal structure of dynamic scenes by directly learning the complex 6D plenoptic function. (ii) Scaling Deformation Modeling: Explicitly modeling scene element deformation becomes impractical for complex dynamics. To address these issues, we consider the spacetime as a whole and propose to approximate the underlying spatio-temporal 4D volume of a dynamic scene by optimizing a collection of 4D primitives, with explicit geometry and appearance modeling. Learning to optimize the 4D primitives enables us to synthesize novel views at any desired time with our tailored rendering routine. Our model is conceptually simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, as well as view-dependent and time-evolved appearance represented by the coefficient of 4D spherindrical harmonics. This approach offers simplicity, flexibility for variable-length video and end-to-end training, and efficient real-time rendering, making it suitable for capturing complex dynamic scene motions. Experiments across various benchmarks, including monocular and multi-view scenarios, demonstrate our 4DGS model's superior visual quality and efficiency.Translated by Google Translate

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

paper_url: http://arxiv.org/abs/2310.10640
repo_url: https://github.com/hananshafi/llmblueprint
paper_authors: Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter Wonka
for: 实现文本描述中的复杂场景和多个物件的图像生成
methods: 利用大自然语言模型提取文本描述中的关键元素，包括物件 bounding box 坐标、个别物品的详细描述和背景内容，然后使用这些元素生成图像
results: 与基准扩散模型相比，实现了复杂文本描述中的图像生成，并在用户评估中获得了高度的认可和满意度

Abstract
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.

摘要
文本到图像生成技术在进行长度和复杂性增加时遇到了挑战。尽管在简短的单个对象描述中表现出色，但是在处理长度和复杂性更高的文本提示时，这些模型经常遇到困难，不能准确地捕捉文本中的细节。为了解决这问题，我们提出了一种新的方法，利用大型自然语言模型（LLM）来提取文本提示中的关键组成部分，包括主要对象的 bounding box 坐标、对象的详细文本描述和背景 контекст。这些组成部分成为我们的布局到图像生成模型的基础，该模型在两个阶段进行处理。首先，我们使用对象布局和背景 контекст来创建初始场景，但是这些场景经常无法准确地表现出对象的特征。为了解决这个限制，我们提出了一种迭代优化方案，通过评估和修改框架内容，使其与文本描述相符。我们的评估表明，对于包含多个对象的复杂提示，我们的方法可以提高了回归率，并且在用户研究中得到了证明。Translation notes:* "Diffusion-based generative models" is translated as "文本到图像生成技术" (text-to-image generation technology)* "long and intricate text prompts" is translated as "长度和复杂性更高的文本提示" (text prompts with length and complexity)* "nuanced details" is translated as "细节" (details)* "Large Language Models" is translated as "大型自然语言模型" (large language models)* " bounding box coordinates" is translated as " bounding box 坐标" (bounding box coordinates)* "detailed textual descriptions" is translated as "详细文本描述" (detailed textual descriptions)* "succinct background context" is translated as "背景 контекст" (background context)* "Global Scene Generation" is translated as "全球场景生成" (global scene generation)* "Iterative Refinement Scheme" is translated as "迭代优化方案" (iterative refinement scheme)* "box-level content" is translated as "框架内容" (box-level content)* "recomposing objects" is translated as "重新组合对象" (recomposing objects)* "consistency" is translated as "一致性" (consistency)* "user study" is translated as "用户研究" (user study)

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

paper_url: http://arxiv.org/abs/2310.10624
repo_url: None
paper_authors: Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, Yuchao Gu, Rui Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou
for: 这个论文的目的是提出一种基于动态神经辐射场（NeRF）的人称视频编辑方法，以解决现有方法因为视频长度和视角变化而受限的问题。
methods: 这种方法使用动态NeRF作为人称视频表示，并提出了一种基于多视图多姿Score Distillation Sampling（SDS）、图像恢复损失、文本导向地方部超分辨率、风格传递等多种技术的图像三维空间编辑管线。
results: 与比较方法相比，这种方法在两个难度较大的数据集上显示出了大幅提高（50%~95%）的人类喜好度。具体比较结果可以查看到项目页面https://showlab.github.io/DynVideo-E/.

Abstract
Despite remarkable research advances in diffusion-based video editing, existing methods are limited to short-length videos due to the contradiction between long-range consistency and frame-wise editing. Recent approaches attempt to tackle this challenge by introducing video-2D representations to degrade video editing to image editing. However, they encounter significant difficulties in handling large-scale motion- and view-change videos especially for human-centric videos. This motivates us to introduce the dynamic Neural Radiance Fields (NeRF) as the human-centric video representation to ease the video editing problem to a 3D space editing task. As such, editing can be performed in the 3D spaces and propagated to the entire video via the deformation field. To provide finer and direct controllable editing, we propose the image-based 3D space editing pipeline with a set of effective designs. These include multi-view multi-pose Score Distillation Sampling (SDS) from both 2D personalized diffusion priors and 3D diffusion priors, reconstruction losses on the reference image, text-guided local parts super-resolution, and style transfer for 3D background space. Extensive experiments demonstrate that our method, dubbed as DynVideo-E, significantly outperforms SOTA approaches on two challenging datasets by a large margin of 50% ~ 95% in terms of human preference. Compelling video comparisons are provided in the project page https://showlab.github.io/DynVideo-E/. Our code and data will be released to the community.

摘要
尽管扩展视频编辑技术已取得了很大的进步，但现有方法仅适用于短视频，因为视频编辑和框架之间存在长距离一致性和帧级编辑之间的矛盾。现有的方法通过引入视频到2D表示来减轻视频编辑到图像编辑。然而，它们在处理大规模运动和视点变化视频，特别是人类中心视频时遇到了重大困难。这个问题驱使我们提出人类中心视频表示——动态神经辐射场（NeRF），以便将视频编辑转化为3D空间编辑任务。在这种情况下，编辑可以在3D空间进行，并通过扭曲场进行对整个视频的广泛传播。为了提供更加精细和直接控制的编辑，我们提议了基于图像的3D空间编辑管线，包括多视图多姿Score Distillation Sampling（SDS）、参考图像的重建损失、文本指导的本地部分超解析和风格转换。我们的方法，即DynVideo-E，在两个挑战性 datasets 上达到了领先的水平，与前一代方法的比较达到了50%~95%的差距。我们在项目页面（https://showlab.github.io/DynVideo-E/）提供了吸引人的视频比较。我们的代码和数据将被公开发布到社区。

Interpreting and Controlling Vision Foundation Models via Text Explanations

paper_url: http://arxiv.org/abs/2310.10591
repo_url: https://github.com/tonychenxyz/vit-interpret
paper_authors: Haozhe Chen, Junfeng Yang, Carl Vondrick, Chengzhi Mao
for: 本研究旨在理解和控制CLIP等大规模预训练视觉基础模型的预测结果。
methods: 本研究使用了一种基于自然语言的方法来解释视 transformer 的干 tokens，并通过捕捉最近的文本来进行解释。
results: 本研究可以帮助理解CLIP等模型在视觉任务中的决策过程，并提供了一种控制模型行为的方法，以提高模型的Robustness和减少偏见和偶合关系。

Abstract
Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks. However, due to their black-box nature, understanding the underlying rules behind these models' predictions and controlling model behaviors have remained open challenges. We present a framework for interpreting vision transformer's latent tokens with natural language. Given a latent token, our framework retains its semantic information to the final layer using transformer's local operations and retrieves the closest text for explanation. Our approach enables understanding of model visual reasoning procedure without needing additional model training or data collection. Based on the obtained interpretations, our framework allows for model editing that controls model reasoning behaviors and improves model robustness against biases and spurious correlations.

摘要
大规模预训练视觉基础模型，如CLIP，已成为视觉任务的德 факто底层。然而，由于其黑盒模型的性质，理解这些模型预测的下面规则和控制模型行为仍然是开放的挑战。我们提出了一个把视觉转换器的幂谱Token与自然语言相关的框架。给定一个幂谱Token，我们的框架使用转换器的地方运算保留它的含义到最终层，并从文本库中检索最相似的文本来解释。我们的方法可以在不需要额外训练或数据收集的前提下，理解模型的视觉逻辑过程。基于获得的解释，我们的框架允许对模型的编辑，控制模型的逻辑行为，并改善模型免疫偏见和偶极相关性。

Matching the Neuronal Representations of V1 is Necessary to Improve Robustness in CNNs with V1-like Front-ends

paper_url: http://arxiv.org/abs/2310.10575
repo_url: https://github.com/dicarlolab/vonenet
paper_authors: Ruxandra Barbulescu, Tiago Marques, Arlindo L. Oliveira
for: 提高对图像损害的鲁棒性（robustness to image corruptions）
methods: 模拟肉眼初级视觉区域（primate primary visual cortex）的计算，使得模型具有更高的鲁棒性
results: 模型使用生物学发现的电场特性分布（empirical biological distributions） sampling，对图像损害的鲁棒性明显更高（相对差异为8.72%），而同类 neuronal sub-populations 在两个模型中具有相似的响应特性和下游权重学习结果，但下游处理具有不同的影响。

Abstract
While some convolutional neural networks (CNNs) have achieved great success in object recognition, they struggle to identify objects in images corrupted with different types of common noise patterns. Recently, it was shown that simulating computations in early visual areas at the front of CNNs leads to improvements in robustness to image corruptions. Here, we further explore this result and show that the neuronal representations that emerge from precisely matching the distribution of RF properties found in primate V1 is key for this improvement in robustness. We built two variants of a model with a front-end modeling the primate primary visual cortex (V1): one sampling RF properties uniformly and the other sampling from empirical biological distributions. The model with the biological sampling has a considerably higher robustness to image corruptions that the uniform variant (relative difference of 8.72%). While similar neuronal sub-populations across the two variants have similar response properties and learn similar downstream weights, the impact on downstream processing is strikingly different. This result sheds light on the origin of the improvements in robustness observed in some biologically-inspired models, pointing to the need of precisely mimicking the neuronal representations found in the primate brain.

摘要
有些卷积神经网络（CNN）在物体识别方面取得了很大的成功，但它们在受到不同类型的常见噪声损害后仍然难以识别物体。最近，研究人员发现，在CNN的前端模型中 simulate computations 可以提高对图像损害的Robustness。在这里，我们进一步探索这一结论，并证明了模型在 precisely 匹配了黑眼睛动物V1中的观察者分布 Property 后，会带来更高的Robustness。我们建立了两个模型，其中一个采样RF Property uniform，另一个采样从empirical biological distribution。与uniform variant相比，使用生物分布采样的模型具有更高的Robustness to image corruptions（相对差异为8.72%）。尽管这两个模型中的 neuronal sub-populations 具有相似的response property和downstream weights，但是它们对下游处理的影响却是不同的。这一结论 shed light onto the origin of the improvements in robustness observed in some biologically-inspired models, pointing to the need of precisely mimicking the neuronal representations found in the primate brain.

RefConv: Re-parameterized Refocusing Convolution for Powerful ConvNets

paper_url: http://arxiv.org/abs/2310.10563
repo_url: https://github.com/Aiolus-X/RefConv
paper_authors: Zhicheng Cai, Xiaohan Ding, Qiu Shen, Xun Cao
for: 提高 CNN 模型的性能，无需更改原始模型结构或添加新的推理成本。
methods: 使用可调 Refocusing Transformation 修改基础核函数，使得不同通道的参数之间建立连接，从而提高模型表达能力。
results: 在图像分类、物体检测和 semantic segmentation 等 CNN 模型中，使用 RefConv 可以提高性能（ImageNet 上的顶部一 accuracy 提高至 1.47%），而无需增加推理成本或改变原始模型结构。

Abstract
We propose Re-parameterized Refocusing Convolution (RefConv) as a replacement for regular convolutional layers, which is a plug-and-play module to improve the performance without any inference costs. Specifically, given a pre-trained model, RefConv applies a trainable Refocusing Transformation to the basis kernels inherited from the pre-trained model to establish connections among the parameters. For example, a depth-wise RefConv can relate the parameters of a specific channel of convolution kernel to the parameters of the other kernel, i.e., make them refocus on the other parts of the model they have never attended to, rather than focus on the input features only. From another perspective, RefConv augments the priors of existing model structures by utilizing the representations encoded in the pre-trained parameters as the priors and refocusing on them to learn novel representations, thus further enhancing the representational capacity of the pre-trained model. Experimental results validated that RefConv can improve multiple CNN-based models by a clear margin on image classification (up to 1.47% higher top-1 accuracy on ImageNet), object detection and semantic segmentation without introducing any extra inference costs or altering the original model structure. Further studies demonstrated that RefConv can reduce the redundancy of channels and smooth the loss landscape, which explains its effectiveness.

摘要
我们提议使用Re-parameterized Refocusing Convolution（RefConv）取代常规 convolutional layer，这是一个可插入的模块，可以无需更改预测成本提高性能。具体来说，给定一个预训练模型，RefConv将预训练模型继承的基准kernel应用一个可学习的 Refocusing Transformation，以建立模型参数之间的连接。例如，深度 wise RefConv可以将一个特定通道的 convolution kernel 的参数与其他kernel的参数相关联，例如，使得这些参数强调其他部分的模型，而不是仅仅强调输入特征。从另一个角度来看，RefConv可以利用预训练参数中的代表性作为PRIOR，并强调这些代表性，以学习新的表示，从而进一步提高预训练模型的表达能力。实验结果表明，RefConv可以在图像分类（最高达1.47%的top-1准确率提升在ImageNet）、物体检测和 semantic segmentation 中提高多个CNN基于模型的性能，而无需增加预测成本或改变原始模型结构。进一步的研究还表明，RefConv可以减少通道的重复性和缓和损失函数的曲线，这解释了它的效果。

InfoGCN++: Learning Representation by Predicting the Future for Online Human Skeleton-based Action Recognition

paper_url: http://arxiv.org/abs/2310.10547
repo_url: https://github.com/stnoah1/sode
paper_authors: Seunggeun Chi, Hyung-gun Chi, Qixing Huang, Karthik Ramani
for: online skeleton-based action recognition
methods: InfoGCN++, a novel extension of InfoGCN that enables real-time categorization of action types without complete observation sequences
results: exceptional performance in online action recognition, consistently matching or exceeding existing techniques

Abstract
Skeleton-based action recognition has made significant advancements recently, with models like InfoGCN showcasing remarkable accuracy. However, these models exhibit a key limitation: they necessitate complete action observation prior to classification, which constrains their applicability in real-time situations such as surveillance and robotic systems. To overcome this barrier, we introduce InfoGCN++, an innovative extension of InfoGCN, explicitly developed for online skeleton-based action recognition. InfoGCN++ augments the abilities of the original InfoGCN model by allowing real-time categorization of action types, independent of the observation sequence's length. It transcends conventional approaches by learning from current and anticipated future movements, thereby creating a more thorough representation of the entire sequence. Our approach to prediction is managed as an extrapolation issue, grounded on observed actions. To enable this, InfoGCN++ incorporates Neural Ordinary Differential Equations, a concept that lets it effectively model the continuous evolution of hidden states. Following rigorous evaluations on three skeleton-based action recognition benchmarks, InfoGCN++ demonstrates exceptional performance in online action recognition. It consistently equals or exceeds existing techniques, highlighting its significant potential to reshape the landscape of real-time action recognition applications. Consequently, this work represents a major leap forward from InfoGCN, pushing the limits of what's possible in online, skeleton-based action recognition. The code for InfoGCN++ is publicly available at https://github.com/stnoah1/infogcn2 for further exploration and validation.

摘要
InfoGCN++ 是一种在线动作识别模型，它是 InfoGCN 的一种创新扩展。InfoGCN++ 可以在实时情况下进行动作类型分类，不需要完整的动作观察序列。它超越了传统方法，通过学习当前和预测未来动作的整体表示来提高模型的表示能力。我们采用了神经网络普通微分方程来管理预测问题，以便有效地模型动作的不断演化。经过严格的评估，InfoGCN++ 在三个骨干基于动作识别benchmark上表现出色， consistently 与或超过了现有方法的性能。这成功表明InfoGCN++ 在实时动作识别应用中具有重要的潜力。因此，这种工作代表了 InfoGCN 的一个重要突破，推动了在线骨干基于动作识别领域的发展。InfoGCN++ 的代码可以在上公开下载，以便进一步探索和验证。

Label-efficient Segmentation via Affinity Propagation

paper_url: http://arxiv.org/abs/2310.10533
repo_url: https://github.com/circleradon/apro
paper_authors: Wentong Li, Yuqian Yuan, Song Wang, Wenyu Liu, Dongqi Tang, Jian Liu, Jianke Zhu, Lei Zhang
for: 降低寸劳的像素精度标注成本
methods: 提出了一种基于对称协议的对应关系建模方法，包括本地和全局对应关系项
results: 在三种标注任务上（包括INSTANCE Segmentation、semantic Segmentation和CLIP-引导Semantic Segmentation）达到了高度的性能提升

Abstract
Weakly-supervised segmentation with label-efficient sparse annotations has attracted increasing research attention to reduce the cost of laborious pixel-wise labeling process, while the pairwise affinity modeling techniques play an essential role in this task. Most of the existing approaches focus on using the local appearance kernel to model the neighboring pairwise potentials. However, such a local operation fails to capture the long-range dependencies and ignores the topology of objects. In this work, we formulate the affinity modeling as an affinity propagation process, and propose a local and a global pairwise affinity terms to generate accurate soft pseudo labels. An efficient algorithm is also developed to reduce significantly the computational cost. The proposed approach can be conveniently plugged into existing segmentation networks. Experiments on three typical label-efficient segmentation tasks, i.e. box-supervised instance segmentation, point/scribble-supervised semantic segmentation and CLIP-guided semantic segmentation, demonstrate the superior performance of the proposed approach.

摘要
弱监督分割的研究已经吸引了越来越多的关注，以减少繁琐的像素精确标注过程的成本。在这个任务中，对 neighboring pairwise 潜在力场的建模技术扮演着关键角色。大多数现有方法都是基于本地外观核函数来建模邻近对的可能性。然而，这种本地操作无法捕捉长距离依赖关系和对象的 topological 结构。在这种工作中，我们将互相关系建模化为一种互相传播过程，并提出了本地和全局对对应潜在力场的两个方法，以生成准确的软精确标签。我们还开发了高效的算法，以减少计算成本。提议的方法可以方便地插入现有的分割网络中。在三种典型的标签有效分割任务中，即盒子监督实例分割、点/scribble监督 semantic segmentation 和 CLIP 引导的 semantic segmentation 中，我们的方法显示出了superior的性能。

Distribution prediction for image compression: An experimental re-compressor for JPEG images

paper_url: http://arxiv.org/abs/2310.10517
repo_url: None
paper_authors: Maxim Koroteev, Yaroslav Borisov, Pavel Frolov
for: 提高jpg图像压缩率
methods: 使用部分解码算法获取量化的DCT坐标，然后进行更有效的压缩
results: 实现了对jpg图像进行无损压缩

Abstract
We propose a new scheme to re-compress JPEG images in a lossless way. Using a JPEG image as an input the algorithm partially decodes the signal to obtain quantized DCT coefficients and then re-compress them in a more effective way.

摘要
我们提出了一种新的方案，用于在无损压缩 JPEG 图像。使用 JPEG 图像作为输入，算法部分解码信号，获取量化 DCT 系数，然后在更有效的方式压缩它们。Here's a breakdown of the translation:* "We propose" is translated as "我们提出" (wǒmen tīshì).* "a new scheme" is translated as "一种新的方案" (yī zhǒng xīn de fāng'ān).* "to re-compress" is translated as "重新压缩" (zhòng xīn pīn chā).* "JPEG images" is translated as "JPEG 图像" (JPEG túxiàng).* "in a lossless way" is translated as "在无损压缩的方式" (在无损压缩的方式).* "Using a JPEG image as an input" is translated as "使用 JPEG 图像作为输入" (shǐyòu JPEG túxiàng zhīxīng).* "partially decodes the signal" is translated as "部分解码信号" (bùzhì jiěmǎ xìngxiàng).* "to obtain quantized DCT coefficients" is translated as "获取量化 DCT 系数" (gòuqù liàngpǐ DCT xìshù).* "and then re-compress them" is translated as "然后重新压缩它们" (ránhòu zhòng xīn pīn chā tāmen).I hope this helps! Let me know if you have any further questions.

Unifying Image Processing as Visual Prompting Question Answering

paper_url: http://arxiv.org/abs/2310.10513
repo_url: None
paper_authors: Yihao Liu, Xiangyu Chen, Xianzheng Ma, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong
for: 提高图像质量和提取视觉特征，替代具体任务模型。
methods: 使用大规模模型预训练和在图像处理任务中进行培 обу，通过视觉提问解决图像处理任务。
results: 提出一个通用的图像处理模型，可以处理多种图像处理任务，包括图像修复、图像增强、图像特征提取等。

Abstract
Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. Building upon the success of large language models (LLMs) in natural language processing (NLP), there is a similar trend in computer vision, which focuses on developing large-scale models through pretraining and in-context learning. This paradigm shift reduces the reliance on task-specific models, yielding a powerful unified model to deal with various tasks. However, these advances have predominantly concentrated on high-level vision tasks, with less attention paid to low-level vision tasks. To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, \textit{etc}. Our proposed framework, named PromptGIP, unifies these diverse image processing tasks within a universal framework. Inspired by NLP question answering (QA) techniques, we employ a visual prompting question answering paradigm. Specifically, we treat the input-output image pair as a structured question-answer sentence, thereby reprogramming the image processing task as a prompting QA problem. PromptGIP can undertake diverse \textbf{cross-domain} tasks using provided visual prompts, eliminating the need for task-specific finetuning. Our methodology offers a universal and adaptive solution to general image processing. While PromptGIP has demonstrated a certain degree of out-of-domain task generalization capability, further research is expected to fully explore its more powerful emergent generalization.

摘要
计算机视觉中的图像处理是一项基本任务，旨在提高图像质量和提取视觉应用中的关键特征。传统上，图像处理任务需要开发特定任务的模型，并且设计这些模型需要专门的技能。在计算机视觉领域，大型语言模型（LLM）的成功引起了一种类似的趋势，即通过预训练和上下文学习来建立大规模模型。这种思路转移使得图像处理任务可以使用一个强大的通用模型进行处理，而不需要特定任务的模型。然而，这些进步主要集中在高级视觉任务上，对低级视觉任务的关注较少。为了解决这个问题，我们提出了一个通用的图像处理模型，名为PromptGIP。我们的提案旨在将多种图像处理任务集成到一个通用框架中，并采用了视觉提问解答技术来实现。通过将输入输出图像对当做一个结构化的问题和答案句子，我们可以将图像处理任务转化为一个提问解答问题。PromptGIP可以通过提供的视觉提问来完成多个跨领域任务，无需特定任务的训练。我们的方法可以提供一个通用和适应的解决方案 для普通图像处理。虽然PromptGIP已经表现了一定的 OUT-OF-DOMAIN 任务泛化能力，但是进一步的研究可以充分发挥其更强大的 emergent 泛化能力。

Evaluation and improvement of Segment Anything Model for interactive histopathology image segmentation

paper_url: http://arxiv.org/abs/2310.10493
repo_url: None
paper_authors: SeungKyu Kim, Hyun-Jic Oh, Seonghui Min, Won-Ki Jeong
for: This paper focuses on evaluating the performance of the Segment Anything Model (SAM) in interactive segmentation of histopathology data, and comparing it with other state-of-the-art interactive models.
methods: The paper uses the SAM model as a foundational model for image segmentation, and evaluates its performance in zero-shot and fine-tuned scenarios on histopathology data. The authors also propose a modification of SAM’s decoder to improve its local refinement ability and stability.
results: The experimental results show that SAM exhibits weaknesses in segmentation performance compared to other models, but demonstrates relative strengths in inference time and generalization capability. The proposed modification of SAM’s decoder is effective in improving its performance for interactive histology image segmentation.

Abstract
With the emergence of the Segment Anything Model (SAM) as a foundational model for image segmentation, its application has been extensively studied across various domains, including the medical field. However, its potential in the context of histopathology data, specifically in region segmentation, has received relatively limited attention. In this paper, we evaluate SAM's performance in zero-shot and fine-tuned scenarios on histopathology data, with a focus on interactive segmentation. Additionally, we compare SAM with other state-of-the-art interactive models to assess its practical potential and evaluate its generalization capability with domain adaptability. In the experimental results, SAM exhibits a weakness in segmentation performance compared to other models while demonstrating relative strengths in terms of inference time and generalization capability. To improve SAM's limited local refinement ability and to enhance prompt stability while preserving its core strengths, we propose a modification of SAM's decoder. The experimental results suggest that the proposed modification is effective to make SAM useful for interactive histology image segmentation. The code is available at \url{https://github.com/hvcl/SAM_Interactive_Histopathology}

摘要
随着Segment Anything Model（SAM）作为图像分割基本模型的出现，其应用在不同领域得到了广泛的研究，但在 histopathology 数据中的区域分割方面却收到了相对有限的关注。在这篇论文中，我们评估了 SAM 在 zero-shot 和 fine-tuned 场景中对 histopathology 数据的性能，强调交互分割。此外，我们与其他当前领先的交互模型进行比较，以评估 SAM 在实际应用中的实用性和适应性。在实验结果中，SAM 在分割性能方面表现较弱，但在推理时间和适应性方面表现出了相对的优势。为了改进 SAM 的局部精度修正能力并保持其核心优势，我们提议一种修改 SAM 的解码器。实验结果表明，该修改是有效的，使得 SAM 在交互式 histology 图像分割中变得有用。代码可以在 \url{https://github.com/hvcl/SAM_Interactive_Histopathology} 上获取。

On the Transferability of Learning Models for Semantic Segmentation for Remote Sensing Data

paper_url: http://arxiv.org/abs/2310.10490
repo_url: https://github.com/gdaosu/transferability-remote-sensing
paper_authors: Rongjun Qin, Guixiang Zhang, Yang Tang
for: 本研究旨在investigate remote sensing (RS) semantic segmentation/classification任务上的传输性和适应性，以及如何通过领域适应（DA）方法提高深度学习（DL）模型的传输性。
methods: 本研究使用了四个高度不同的RS数据集，并将六个模型在不同的DA策略下进行训练，以量化模型之间的传输性和适应性。此外，我们还提出了一种简单的方法来评估模型在目标领域中的传输性，不需要标签数据。
results: 我们的实验结果显示，DL模型在不同领域之间的传输性较差，而DA策略可以有效地提高DL模型的传输性。此外，我们还发现了一些不常报道的 Raw和适应传输性的观察结果。我们的提出的标签 свобо�评估方法也被证明可以更好地评估模型的传输性。

Abstract
Recent deep learning-based methods outperform traditional learning methods on remote sensing (RS) semantic segmentation/classification tasks. However, they require large training datasets and are generally known for lack of transferability due to the highly disparate RS image content across different geographical regions. Yet, there is no comprehensive analysis of their transferability, i.e., to which extent a model trained on a source domain can be readily applicable to a target domain. Therefore, in this paper, we aim to investigate the raw transferability of traditional and deep learning (DL) models, as well as the effectiveness of domain adaptation (DA) approaches in enhancing the transferability of the DL models (adapted transferability). By utilizing four highly diverse RS datasets, we train six models with and without three DA approaches to analyze their transferability between these datasets quantitatively. Furthermore, we developed a straightforward method to quantify the transferability of a model using the spectral indices as a medium and have demonstrated its effectiveness in evaluating the model transferability at the target domain when the labels are unavailable. Our experiments yield several generally important yet not well-reported observations regarding the raw and adapted transferability. Moreover, our proposed label-free transferability assessment method is validated to be better than posterior model confidence. The findings can guide the future development of generalized RS learning models. The trained models are released under this link: https://github.com/GDAOSU/Transferability-Remote-Sensing

摘要
现代深度学习方法在远程感知（RS）semantic segmentation/分类任务上表现出色，但它们需要大量的训练数据并且通常因为不同地区RS图像内容差异极大而无法转移。然而，没有系统性的分析转移性，即源领域训练的模型可以如何 extent 应用于目标领域。因此，在这篇论文中，我们想要调查传统和深度学习（DL）模型的原生转移性，以及使用领域适应（DA）策略可以提高DL模型的转移性（适应转移性）。通过使用四个高度不同RS数据集，我们训练了六个模型，并使用三种DA策略进行分析其转移性。此外，我们还提出了一种简单的方法来评估模型的转移性，使用spectral indices作为媒介，并在目标领域无标签情况下证明其效果。我们的实验结果表明了一些不常报道的观察结果，包括原生转移性和适应转移性的分析。此外，我们的提出的无标签转移性评估方法被证明为比 posterior model confidence 更有用。这些发现可以指导未来的RS学习模型的发展。我们训练的模型可以在以下链接获取：https://github.com/GDAOSU/Transferability-Remote-Sensing

Combating Label Noise With A General Surrogate Model For Sample Selection

paper_url: http://arxiv.org/abs/2310.10463
repo_url: None
paper_authors: Chao Liang, Linchao Zhu, Humphrey Shi, Yi Yang
for: 减少标签噪音，提高深度学习系统的性能。
methods: 利用CLIPvision-language surrogate模型自动过滤噪音样本，并采用margin适应损失来规范选择偏好。
results: 在实际和Synthetic噪音数据集上实现了显著改进，无需CLIP在推断阶段参与。

Abstract
Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way to deal with label noise. The key is to separate clean samples based on some criterion. Previous methods pay more attention to the small loss criterion where small-loss samples are regarded as clean ones. Nevertheless, such a strategy relies on the learning dynamics of each data instance. Some noisy samples are still memorized due to frequently occurring corrupted learning patterns. To tackle this problem, a training-free surrogate model is preferred, freeing from the effect of memorization. In this work, we propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically. CLIP brings external knowledge to facilitate the selection of clean samples with its ability of text-image alignment. Furthermore, a margin adaptive loss is designed to regularize the selection bias introduced by CLIP, providing robustness to label noise. We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. Our method achieves significant improvement without CLIP involved during the inference stage.

摘要
现代深度学习系统具有巨量数据的需求。使用网络数据进行学习是一个可行的解决方案，但会随机扰动标签，从而影响深度神经网络的性能。样本选择是一个有效的方法来处理标签噪音。以往的方法更多地关注于小损失标准，即将小损失的样本视为干净的样本。然而，这种策略基于每个数据实例的学习动态，一些噪音样本仍然会被记忆由频繁出现的异常学习模式。为解决这个问题，我们提议利用CLIP视觉语言代理模型自动过滤噪音样本。CLIP带来了外部知识，以便通过图文对齐来促进干净样本的选择。此外，我们还设计了一种margin适应损失，以规避CLIP的选择偏见，提供了对标签噪音的Robustness。我们在真实的噪音数据集和静态数据集上验证了我们的提议的有效性，而不需要在推理阶段使用CLIP。

Model Selection of Anomaly Detectors in the Absence of Labeled Validation Data

paper_url: http://arxiv.org/abs/2310.10461
repo_url: None
paper_authors: Clement Fung, Chen Qiu, Aodong Li, Maja Rudolph
for: 这个论文旨在提出一种通用的框架，用于评估基于图像的异常检测器。
methods: 该方法假设有一小支持集（support set）的正常图像，通过预训练的扩散模型进行处理，生成了人工异常样本。当混合到正常样本集中时，这些人工异常样本创建了一个适用于异常检测器评估的验证框架。
results: 在广泛的实验研究中，我们发现，使用我们的生成的验证数据可以选择同样的模型和超参数，与使用真实的验证集一样。此外，我们发现，使用我们的方法选择的提示（prompts）在CLIP基于异常检测中表现出色，超过其他所有提示策略，并在挑战性的MVTec-AD数据集上达到最佳检测精度。

Abstract
Anomaly detection requires detecting abnormal samples in large unlabeled datasets. While progress in deep learning and the advent of foundation models has produced powerful unsupervised anomaly detection methods, their deployment in practice is often hindered by the lack of labeled data -- without it, the detection accuracy of an anomaly detector cannot be evaluated reliably. In this work, we propose a general-purpose framework for evaluating image-based anomaly detectors with synthetically generated validation data. Our method assumes access to a small support set of normal images which are processed with a pre-trained diffusion model (our proposed method requires no training or fine-tuning) to produce synthetic anomalies. When mixed with normal samples from the support set, the synthetic anomalies create detection tasks that compose a validation framework for anomaly detection evaluation and model selection. In an extensive empirical study, ranging from natural images to industrial applications, we find that our synthetic validation framework selects the same models and hyper-parameters as selection with a ground-truth validation set. In addition, we find that prompts selected by our method for CLIP-based anomaly detection outperforms all other prompt selection strategies, and leads to the overall best detection accuracy, even on the challenging MVTec-AD dataset.

摘要
异常检测需要检测大量无标签数据中的异常标本。虽然深度学习和基础模型的进步导致了无监督异常检测方法的生成，但它们在实践中的应用受到了无标签数据的缺乏影响，因为无法对异常检测器的准确性进行可靠评估。在这个工作中，我们提出一个通用的框架，用于评估基于图像的异常检测器，使用生成的运动模型来生成异常标本。我们的方法不需要训练或微调。当混合到支持集的正常图像中，生成的异常标本创建了一个异常检测任务，它们组成了一个适用于异常检测评估和模型选择的验证框架。在广泛的实验研究中，我们发现我们的生成验证框架可以选择同样的模型和参数，并且我们的提示选择策略在CLIP基础上的异常检测中表现出色，并且导致了整体最佳的检测精度，甚至在挑战性的MVTec-AD数据集上。

Object Detection in Aerial Images in Scarce Data Regimes

paper_url: http://arxiv.org/abs/2310.10433
repo_url: None
paper_authors: Pierre Le Jeune
for: 这个论文的目的是提高几个shot目标检测的性能，并评估其在不同类型图像上的可迁移性。
methods: 该论文使用了一种特殊的注意力机制来改进小对象的检测性能，以及一种批处理的盒子相似度标准来改进训练和评估。此外，论文还提出了两种不同的metric学习和精度调整方法来提高检测性能。
results: 论文得到了显著的提高在小对象检测上，并在 Cross-Domain FSOD 领域取得了卓越的结果。此外，论文还成功地解决了在 COSE 系统中部署检测模型的工程问题，并在具有超过 100 万像素的图像中进行实时检测。

Abstract
Most contributions on Few-Shot Object Detection (FSOD) evaluate their methods on natural images only, yet the transferability of the announced performance is not guaranteed for applications on other kinds of images. We demonstrate this with an in-depth analysis of existing FSOD methods on aerial images and observed a large performance gap compared to natural images. Small objects, more numerous in aerial images, are the cause for the apparent performance gap between natural and aerial images. As a consequence, we improve FSOD performance on small objects with a carefully designed attention mechanism. In addition, we also propose a scale-adaptive box similarity criterion, that improves the training and evaluation of FSOD methods, particularly for small objects. We also contribute to generic FSOD with two distinct approaches based on metric learning and fine-tuning. Impressive results are achieved with the fine-tuning method, which encourages tackling more complex scenarios such as Cross-Domain FSOD. We conduct preliminary experiments in this direction and obtain promising results. Finally, we address the deployment of the detection models inside COSE's systems. Detection must be done in real-time in extremely large images (more than 100 megapixels), with limited computation power. Leveraging existing optimization tools such as TensorRT, we successfully tackle this engineering challenge.

摘要
多数对几shot对象检测（FSOD）的贡献仅测试在自然图像上，但是这些方法的可传性并不保证在其他类型图像上的应用。我们通过对现有FSOD方法的深入分析在飞行图像上表明，小对象的众多性导致自然图像和飞行图像之间的性能差距。为了解决这个问题，我们采用了特别设计的注意机制来提高小对象的检测性能。此外，我们还提出了可以适应不同大小的盒子相似性标准，以改进FSOD方法的训练和评估。此外，我们还提出了基于度量学习和精度调整的两种不同方法来提高FSOD性能。经过精心调整，我们得到了很好的结果。最后，我们关注在COSE系统中部署检测模型。在具有EXTREMELY大图像（超过100 megapixel）和有限计算资源的情况下，我们成功使用现有的优化工具 such as TensorRT来解决这个工程问题。

DANAA: Towards transferable attacks with double adversarial neuron attribution

paper_url: http://arxiv.org/abs/2310.10427
repo_url: https://github.com/Davidjinzb/DANAA
paper_authors: Zhibo Jin, Zhiyu Zhu, Xinyi Wang, Jiayu Zhang, Jun Shen, Huaming Chen
for: 本文旨在提出一种基于中间层的双反抗智能方法，以提高深度神经网络模型中特征重要性的估计结果，并提高模型的抗击性能。
methods: 本文提出了一种基于对抗非线性路径的双反抗神经网络模型，通过计算中间层输出的各个神经元的贡献，以估计特征重要性。
results: 对多个基准数据集进行了广泛的实验，并证明了我们的方法可以达到当今最佳性能。我们的代码可以在 GitHub 上找到：https://github.com/Davidjinzb/DANAA

Abstract
While deep neural networks have excellent results in many fields, they are susceptible to interference from attacking samples resulting in erroneous judgments. Feature-level attacks are one of the effective attack types, which targets the learnt features in the hidden layers to improve its transferability across different models. Yet it is observed that the transferability has been largely impacted by the neuron importance estimation results. In this paper, a double adversarial neuron attribution attack method, termed `DANAA', is proposed to obtain more accurate feature importance estimation. In our method, the model outputs are attributed to the middle layer based on an adversarial non-linear path. The goal is to measure the weight of individual neurons and retain the features that are more important towards transferability. We have conducted extensive experiments on the benchmark datasets to demonstrate the state-of-the-art performance of our method. Our code is available at: https://github.com/Davidjinzb/DANAA

摘要
深度神经网络在多个领域取得了出色的成绩，但它们受到攻击样本的干扰，导致评判结果错误。攻击样本是一种有效的攻击类型，targets the learnt features in the hidden layers to improve its transferability across different models。然而，我们发现，通过neuron importance estimation结果，攻击样本的传播性受到了很大的影响。在这篇论文中，我们提出了一种double adversarial neuron attribution attack方法，称为`DANAA'。我们的方法基于一个 adversarial non-linear path，将模型输出归结到中层。我们的目标是测量个体神经元的重要性，保留对传播性更重要的特征。我们在标准 benchmark datasets 上进行了广泛的实验，以示出我们的方法的state-of-the-art性。我们的代码可以在：https://github.com/Davidjinzb/DANAA 中找到。

A Novel Benchmarking Paradigm and a Scale- and Motion-Aware Model for Egocentric Pedestrian Trajectory Prediction

paper_url: http://arxiv.org/abs/2310.10424
repo_url: None
paper_authors: Amir Rasouli
for: 本研究旨在提高智能驾驶系统中对行人行为预测的精度。
methods: 本文提出了一种新的 egocentric 行人轨迹预测算法评估方法，基于不同的情况下的 contextual information，提取了 meaningful 和系统的 driving scenarios，并提出了一种更有效的 metric 来评估预测模型。
results: 对 existed 模型进行了广泛的 empirical 研究，暴露了不同方法在不同情况下的缺陷和优势，并显示了我们的方法在挑战性情况下可以达到40%的提高。

Abstract
Predicting pedestrian behavior is one of the main challenges for intelligent driving systems. In this paper, we present a new paradigm for evaluating egocentric pedestrian trajectory prediction algorithms. Based on various contextual information, we extract driving scenarios for a meaningful and systematic approach to identifying challenges for prediction models. In this regard, we also propose a new metric for more effective ranking within the scenario-based evaluation. We conduct extensive empirical studies of existing models on these scenarios to expose shortcomings and strengths of different approaches. The scenario-based analysis highlights the importance of using multimodal sources of information and challenges caused by inadequate modeling of ego-motion and scale of pedestrians. To this end, we propose a novel egocentric trajectory prediction model that benefits from multimodal sources of data fused in an effective and efficient step-wise hierarchical fashion and two auxiliary tasks designed to learn more robust representation of scene dynamics. We show that our approach achieves significant improvement by up to 40% in challenging scenarios compared to the past arts via empirical evaluation on common benchmark datasets.

摘要
预测行人行为是智能驾驶系统中的一个主要挑战。在这篇论文中，我们提出了一种新的评估 egocentric 行人轨迹预测算法的新模式。基于多种情境信息，我们提取了 meaningful 和系统的驾驶场景，以便更好地识别预测模型的挑战。在这个 regard，我们也提出了一个更有效的排名 metric。我们对现有模型进行了广泛的实证研究，以暴露不同方法的缺陷和优势。场景基本分析表明，使用多modal 信息和 egocentric 行人的不充分模型和比例会导致预测困难。为此，我们提出了一种新的 egocentric 轨迹预测模型，该模型利用多modal 数据的综合和有效的步骤式层次结构，以及两个辅助任务，以学习更加稳定的Scene 动力学。我们的方法在复杂的场景下比过去的艺术品 achieved 40% 的提升，via 实验评估常见的 benchmark 数据集。

YOLOv7 for Mosquito Breeding Grounds Detection and Tracking

paper_url: http://arxiv.org/abs/2310.10423
repo_url: None
paper_authors: Camila Laranjeira, Daniel Andrade, Jefersson A. dos Santos
for: 防止气候变化的威胁，忽略性 tropical diseases 如遗传性疟疾、血吸虫病和生物攻击等可能成为全球问题。
methods: 本文使用 YOLOv7 状态之 искусственный神经网络方法，自动检测和地图蚊子繁殖地点，以便地方机构可以有效 intervene。
results: 我们在 ICIP 2023 大赛中发布的数据集上进行了实验，并示出 YOLOv7 可以直接应用于检测更大的ocus类别，如池塘、车胎和水箱，并且可以通过简单的滤波来实现时间一致性的跟踪过程。

Abstract
With the looming threat of climate change, neglected tropical diseases such as dengue, zika, and chikungunya have the potential to become an even greater global concern. Remote sensing technologies can aid in controlling the spread of Aedes Aegypti, the transmission vector of such diseases, by automating the detection and mapping of mosquito breeding sites, such that local entities can properly intervene. In this work, we leverage YOLOv7, a state-of-the-art and computationally efficient detection approach, to localize and track mosquito foci in videos captured by unmanned aerial vehicles. We experiment on a dataset released to the public as part of the ICIP 2023 grand challenge entitled Automatic Detection of Mosquito Breeding Grounds. We show that YOLOv7 can be directly applied to detect larger foci categories such as pools, tires, and water tanks and that a cheap and straightforward aggregation of frame-by-frame detection can incorporate time consistency into the tracking process.

摘要
With the looming threat of climate change, neglected tropical diseases such as dengue, zika, and chikungunya have the potential to become an even greater global concern. Remote sensing technologies can aid in controlling the spread of Aedes Aegypti, the transmission vector of such diseases, by automating the detection and mapping of mosquito breeding sites, such that local entities can properly intervene. In this work, we leverage YOLOv7, a state-of-the-art and computationally efficient detection approach, to localize and track mosquito foci in videos captured by unmanned aerial vehicles. We experiment on a dataset released to the public as part of the ICIP 2023 grand challenge entitled Automatic Detection of Mosquito Breeding Grounds. We show that YOLOv7 can be directly applied to detect larger foci categories such as pools, tires, and water tanks, and that a cheap and straightforward aggregation of frame-by-frame detection can incorporate time consistency into the tracking process.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

LMT: Longitudinal Mixing Training, a Framework to Predict Disease Progression from a Single Image

paper_url: http://arxiv.org/abs/2310.10420
repo_url: None
paper_authors: Rachid Zeghlache, Pierre-Henri Conze, Mostafa El Habib Daho, Yihao Li, Hugo Le boite, Ramin Tadayoni, Pascal Massin, Béatrice Cochener, Ikram Brahim, Gwenolé Quellec, Mathieu Lamard
for: 这个论文旨在检测和预测糖尿病Retinopathy (DR) 疾病进程的扩展。
methods: 该论文使用了混合训练和预text任务，并使用了新的神经网络模型 named Neural Ordinary Differential Equation (NODE)。
results: 该论文在使用 Longitudinal Mixing Training (LMT) 框架时，可以更好地检测和预测 DR 疾病进程。在 OPHDIAT 长itudinal retinal Color Fundus Photographs (CFP) dataset 上，该方法可以预测下一次访问 whether an eye would develop a severe DR，其 AUC 为 0.798，比基线结果高得多。

Abstract
Longitudinal imaging is able to capture both static anatomical structures and dynamic changes in disease progression toward earlier and better patient-specific pathology management. However, conventional approaches rarely take advantage of longitudinal information for detection and prediction purposes, especially for Diabetic Retinopathy (DR). In the past years, Mix-up training and pretext tasks with longitudinal context have effectively enhanced DR classification results and captured disease progression. In the meantime, a novel type of neural network named Neural Ordinary Differential Equation (NODE) has been proposed for solving ordinary differential equations, with a neural network treated as a black box. By definition, NODE is well suited for solving time-related problems. In this paper, we propose to combine these three aspects to detect and predict DR progression. Our framework, Longitudinal Mixing Training (LMT), can be considered both as a regularizer and as a pretext task that encodes the disease progression in the latent space. Additionally, we evaluate the trained model weights on a downstream task with a longitudinal context using standard and longitudinal pretext tasks. We introduce a new way to train time-aware models using $t_{mix}$, a weighted average time between two consecutive examinations. We compare our approach to standard mixing training on DR classification using OPHDIAT a longitudinal retinal Color Fundus Photographs (CFP) dataset. We were able to predict whether an eye would develop a severe DR in the following visit using a single image, with an AUC of 0.798 compared to baseline results of 0.641. Our results indicate that our longitudinal pretext task can learn the progression of DR disease and that introducing $t_{mix}$ augmentation is beneficial for time-aware models.

摘要
长itudinal imaging可以捕捉到稳定的解剖结构以及疾病进程的动态变化，从而提供更早和更好的病理管理。然而，传统方法rarely利用长itudinal信息进行检测和预测，特别是对于肥胖糖尿病（DR）。在过去几年，杂合训练和预文任务with longitudinal context有效提高了DR分类结果，并捕捉了疾病进程。同时，一种新的神经网络模型名为神经常微方程（NODE）已经被提出，可以解决常微方程问题。根据定义，NODE适合解决时间相关的问题。在这篇文章中，我们提议将这三个方面结合，以检测和预测DR疾病进程。我们的框架，长itudinal Mixing Training（LMT），可以被视为一种正则化和预文任务，用于编码疾病进程在幽默空间中。此外，我们评估训练模型的加权因子在下游任务中使用标准和长itudinal预文任务时的性能。我们还介绍了一种新的时间感知训练方法，使用 $t_{mix} $ Weighted average time between two consecutive examinations。我们比较了我们的方法与标准杂合训练在DR分类中的表现，使用 OPHDIAT longitudinal retinal Color Fundus Photographs（CFP）数据集。我们能够使用单个图像预测眼睛是否在下一次访问中会发展严重的DR，AUC为 0.798，比基eline结果提高了0.641。我们的结果表明，我们的长itudinal预文任务可以学习DR疾病的进程，并且将 $t_{mix} $ 杂合增强是对时间感知模型有利。

Prior-Free Continual Learning with Unlabeled Data in the Wild

paper_url: http://arxiv.org/abs/2310.10417
repo_url: https://github.com/visiontao/pfcl
paper_authors: Tao Zhuo, Zhiyong Cheng, Hehe Fan, Mohan Kankanhalli
for: 本研究旨在 Addressing the problem of forgetting in continual learning (CL) when task priors are unknown in real-world applications.
methods: 我们提出了一个 Prior-Free Continual Learning (PFCL) 方法，不需要任务识别或先前的数据。我们透过两个方法来实现这一目标：首先，我们消除了任务识别的需求，以使用固定单头架构中的任务特定出力头选择。其次，我们使用了一种常见性预测策略，以避免在新任务上重访先前的数据。然而，这种方法单独可能在分类增量学习中表现不佳，特别是当任务序列很长时。我们运用了一个辅助数据集来增强模型的一致性，以提高分类精度。
results: 我们的PFCL方法在多个图像分类benchmarkdataset上进行了广泛的实验，结果显示PFCL方法能够对所有三种学习场景进行优化，并且与最近的回温基本方法相比，PFCL方法在竞争率上具有相似的精度。

Abstract
Continual Learning (CL) aims to incrementally update a trained model on new tasks without forgetting the acquired knowledge of old ones. Existing CL methods usually reduce forgetting with task priors, \ie using task identity or a subset of previously seen samples for model training. However, these methods would be infeasible when such priors are unknown in real-world applications. To address this fundamental but seldom-studied problem, we propose a Prior-Free Continual Learning (PFCL) method, which learns new tasks without knowing the task identity or any previous data. First, based on a fixed single-head architecture, we eliminate the need for task identity to select the task-specific output head. Second, we employ a regularization-based strategy for consistent predictions between the new and old models, avoiding revisiting previous samples. However, using this strategy alone often performs poorly in class-incremental scenarios, particularly for a long sequence of tasks. By analyzing the effectiveness and limitations of conventional regularization-based methods, we propose enhancing model consistency with an auxiliary unlabeled dataset additionally. Moreover, since some auxiliary data may degrade the performance, we further develop a reliable sample selection strategy to obtain consistent performance improvement. Extensive experiments on multiple image classification benchmark datasets show that our PFCL method significantly mitigates forgetting in all three learning scenarios. Furthermore, when compared to the most recent rehearsal-based methods that replay a limited number of previous samples, PFCL achieves competitive accuracy. Our code is available at: https://github.com/visiontao/pfcl

摘要
Our approach has two key components. First, we eliminate the need for task identity to select the task-specific output head using a fixed single-head architecture. Second, we employ a regularization-based strategy for consistent predictions between the new and old models, avoiding revisiting previous samples. However, this strategy alone can perform poorly in class-incremental scenarios, so we also use an auxiliary unlabeled dataset to enhance model consistency.To ensure reliable performance improvement, we develop a sample selection strategy to choose the most informative samples from the auxiliary dataset. Our approach significantly mitigates forgetting in all three learning scenarios, and achieves competitive accuracy compared to rehearsal-based methods that replay a limited number of previous samples.Our code is available at: https://github.com/visiontao/pfcl.

Style transfer between Microscopy and Magnetic Resonance Imaging via Generative Adversarial Network in small sample size settings

paper_url: http://arxiv.org/abs/2310.10414
repo_url: None
paper_authors: Monika Pytlarz, Adrian Onicas, Alessandro Crimi
for: 这个研究的目的是使用 Conditional GAN 架构将 MRI 图像翻译成历史学图像，以便避免侵入性的生物псии检测。
methods: 这个研究使用的方法是使用 Conditional GAN 架构，该架构可以将 MRI 图像翻译成历史学图像。
results: 这个研究表明，使用 Conditional GAN 架构可以可靠地将 MRI 图像翻译成历史学图像，并且可以使用高分辨率的历史学图像和相对较低分辨率的 MRI 图像进行对应。

Abstract
Cross-modal augmentation of Magnetic Resonance Imaging (MRI) and microscopic imaging based on the same tissue samples is promising because it can allow histopathological analysis in the absence of an underlying invasive biopsy procedure. Here, we tested a method for generating microscopic histological images from MRI scans of the corpus callosum using conditional generative adversarial network (cGAN) architecture. To our knowledge, this is the first multimodal translation of the brain MRI to histological volumetric representation of the same sample. The technique was assessed by training paired image translation models taking sets of images from MRI scans and microscopy. The use of cGAN for this purpose is challenging because microscopy images are large in size and typically have low sample availability. The current work demonstrates that the framework reliably synthesizes histology images from MRI scans of corpus callosum, emphasizing the network's ability to train on high resolution histologies paired with relatively lower-resolution MRI scans. With the ultimate goal of avoiding biopsies, the proposed tool can be used for educational purposes.

摘要
通过同一个组织样本的跨模态扩充，核磁共振成像（MRI）和微scopic成像之间的同化是有前途的，因为它可以允许在不基于侵入性生物псиchosurgeries的情况下进行 histopathological 分析。我们在这里测试了一种方法，用于从 MRI 扫描中生成 microscopic histological 图像。根据我们所知，这是第一种跨modal 翻译 brain MRI 到 histological 三维表示的方法。这种技术被评估了，通过对对应的图像翻译模型进行训练。使用 cGAN 进行这种目的是挑战性的，因为微scopic 图像通常很大，并且 Sample 的可用性很低。当前的研究表明，该框架可靠地将 MRI 扫描中的 corpus callosum 转换为 histology 图像，强调网络的能力在高分辨率 histology 和相对较低分辨率 MRI 扫描之间进行训练。以避免生物псиchosurgeries 为目的，提出的工具可以用于教育用途。

Image super-resolution via dynamic network

paper_url: http://arxiv.org/abs/2310.10413
repo_url: https://github.com/hellloxiaotian/dsrnet
paper_authors: Chunwei Tian, Xuanyu Zhang, Qi Zhang, Mingming Yang, Zhaojie Ju
for: 这篇论文旨在提出一种动态网络 для图像超解像（DSRNet），以提高图像超解像的准确率和复杂场景下的应用性。
methods: 该网络使用了差异增强块、宽增强块、特征细化块和结构块等多种块来提高图像超解像的精度和可靠性。
results: 实验结果表明，与传统方法相比，DSRNet能够更好地处理复杂场景下的图像超解像问题，同时具有较低的计算量和可扩展性，适用于移动设备上的实时应用。

Abstract
Convolutional neural networks (CNNs) depend on deep network architectures to extract accurate information for image super-resolution. However, obtained information of these CNNs cannot completely express predicted high-quality images for complex scenes. In this paper, we present a dynamic network for image super-resolution (DSRNet), which contains a residual enhancement block, wide enhancement block, feature refinement block and construction block. The residual enhancement block is composed of a residual enhanced architecture to facilitate hierarchical features for image super-resolution. To enhance robustness of obtained super-resolution model for complex scenes, a wide enhancement block achieves a dynamic architecture to learn more robust information to enhance applicability of an obtained super-resolution model for varying scenes. To prevent interference of components in a wide enhancement block, a refinement block utilizes a stacked architecture to accurately learn obtained features. Also, a residual learning operation is embedded in the refinement block to prevent long-term dependency problem. Finally, a construction block is responsible for reconstructing high-quality images. Designed heterogeneous architecture can not only facilitate richer structural information, but also be lightweight, which is suitable for mobile digital devices. Experimental results shows that our method is more competitive in terms of performance and recovering time of image super-resolution and complexity. The code of DSRNet can be obtained at https://github.com/hellloxiaotian/DSRNet.

摘要
convolutional neural networks (CNNs) 依靠深度网络架构来提取图像超分解中的准确信息。然而，这些 CNNs 所获取的信息无法完全表达预测的高质量图像 для复杂场景。在这篇论文中，我们提出了动态网络 для图像超分解 (DSRNet)，它包含了差异增强块、宽增强块、特征细化块和结构块。差异增强块由一个差异增强架构组成，以便在图像超分解中提高层次特征。为了提高获取的超分解模型在复杂场景中的应用 robustness，宽增强块实现了一个动态架构，以学习更加Robust的信息以提高获取的超分解模型的可用性。为了避免各个组件之间的干扰，特征细化块使用了堆叠结构来准确地学习获取的特征。此外，在细化块中还包含了循环学习操作，以避免长期依赖问题。最后，结构块负责重建高质量图像。设计的不同化架构不仅可以提供更加丰富的结构信息，还可以减轻计算负担，适用于移动式数字设备。实验结果表明，我们的方法在性能和图像重建时间方面更加竞争力，同时也更加复杂。DSRNet 的代码可以在 https://github.com/hellloxiaotian/DSRNet 上获取。

Loci-Segmented: Improving Scene Segmentation Learning

paper_url: http://arxiv.org/abs/2310.10410
repo_url: https://github.com/CognitiveModeling/Loci-Segmented
paper_authors: Manuel Traub, Frederic Becker, Adrian Sauter, Sebastian Otte, Martin V. Butz
for: 本研究旨在提高场景表示的分割能力，并提出了一种基于槽的处理方法。
methods: 本方法使用了一种名为Loci-Segmented（Loci-s）的场景分割神经网络，它基于Loci（Traub等，ICLR 2023）框架，并具有以下三大提升：（1）添加了预训练的动态背景模块；（2）具有对象专注的几何卷积编码模块；（3）采用了级联解码模块，successively生成对象Mask、Masked Depth Maps和Masked, Depth-map-informed RGB重建。
results: 对比于之前的最佳成果，Loci-s在MOVi datasets和另一个Established dataset集合中实现了32%的交集覆盖率（IoU）提升。此外，Loci-s还生成了良好的可解释性 latent representation，这些表示可能作为解释基础模型的可解释基础 для解决下游任务，如语言背景和Context-和Goal-conditioned Event Processing。

Abstract
Slot-oriented processing approaches for compositional scene representation have recently undergone a tremendous development. We present Loci-Segmented (Loci-s), an advanced scene segmentation neural network that extends the slot-based location and identity tracking architecture Loci (Traub et al., ICLR 2023). The main advancements are (i) the addition of a pre-trained dynamic background module; (ii) a hyper-convolution encoder module, which enables object-focused bottom-up processing; and (iii) a cascaded decoder module, which successively generates object masks, masked depth maps, and masked, depth-map-informed RGB reconstructions. The background module features the learning of both a foreground identifying module and a background re-generator. We further improve performance via (a) the integration of depth information as well as improved slot assignments via (b) slot-location-entity regularization and (b) a prior segmentation network. Even without these latter improvements, the results reveal superior segmentation performance in the MOVi datasets and in another established dataset collection. With all improvements, Loci-s achieves a 32% better intersection over union (IoU) score in MOVi-E than the previous best. We furthermore show that Loci-s generates well-interpretable latent representations. We believe that these representations may serve as a foundation-model-like interpretable basis for solving downstream tasks, such as grounding language and context- and goal-conditioned event processing.

摘要
各种槽处理方法在 compositional scene representation 领域受到了非常大的发展。我们现在提出了 Loci-Segmented（Loci-s），这是一种高级的Scene segmentation neural network，它扩展了 Loci（Traub et al., ICLR 2023）槽基 Architecture。主要改进包括：(i) 添加了预训练的动态背景模块；(ii) 使用了对象专注的凹陷 Encoder 模块，以便从底层处进行对象特征提取；(iii) 使用了级联的解码模块，以顺序生成对象面积、掩码depth maps和掩码、 depth-map-informed RGB 重建。背景模块包括学习both foreground 特征和背景重建。我们进一步提高性能通过：(a) интеграción of depth information以及改进的槽分配via(b) slot-location-entity regularization和(b) 一个前 segmentation network。无论这些改进，Loci-s 在 MOVi 数据集和另一个已知数据集中显示出色的 segmentation 性能。通过所有改进，Loci-s 在 MOVi-E 中实现了与之前最佳的32%的交集 над union（IoU）分数提高。我们进一步表明Loci-s 生成的latent representations是可解释的。我们认为这些表示可以作为基础模型的可解释基础，用于解决下游任务，如语言基础和上下文-和目标conditioned事件处理。

A cross Transformer for image denoising

paper_url: http://arxiv.org/abs/2310.10408
repo_url: https://github.com/hellloxiaotian/ctnet
paper_authors: Chunwei Tian, Menghua Zheng, Wangmeng Zuo, Shichao Zhang, Yanning Zhang, Chia-Wen Ling
for: 提高复杂场景中图像的清洁率
methods: 使用交叉 transformer 涨梯网络（CTNet），包括序列块（SB）、平行块（PB）和差分块（RB），以获取有效的结构信息，并通过多种交互来提高适应性。
results: 在实际和synthetic图像杂交中，CTNet表现出色，超过了一些流行的去噪方法。适用于移动设备，如手机。

Abstract
Deep convolutional neural networks (CNNs) depend on feedforward and feedback ways to obtain good performance in image denoising. However, how to obtain effective structural information via CNNs to efficiently represent given noisy images is key for complex scenes. In this paper, we propose a cross Transformer denoising CNN (CTNet) with a serial block (SB), a parallel block (PB), and a residual block (RB) to obtain clean images for complex scenes. A SB uses an enhanced residual architecture to deeply search structural information for image denoising. To avoid loss of key information, PB uses three heterogeneous networks to implement multiple interactions of multi-level features to broadly search for extra information for improving the adaptability of an obtained denoiser for complex scenes. Also, to improve denoising performance, Transformer mechanisms are embedded into the SB and PB to extract complementary salient features for effectively removing noise in terms of pixel relations. Finally, a RB is applied to acquire clean images. Experiments illustrate that our CTNet is superior to some popular denoising methods in terms of real and synthetic image denoising. It is suitable to mobile digital devices, i.e., phones. Codes can be obtained at https://github.com/hellloxiaotian/CTNet.

摘要
深度卷积神经网络 (CNN) 在图像噪声除除针对 feedforward 和反馈方式以获得好的表现。然而，如何通过 CNN 获得有效的结构信息，以有效地表示给定的噪声图像是关键问题。在这篇论文中，我们提出了一种跨Transformer混合卷积神经网络 (CTNet)，包括序列块 (SB)、平行块 (PB) 和差异块 (RB)，以获取复杂场景中的干净图像。SB 使用增强的剩余架构，深入搜索图像噪声除除中的结构信息。为了避免关键信息损失，PB 使用三种不同的网络来实现多种交互，以广泛搜索更多的信息，以提高取得的噪声除除器的适应性。此外，为了提高噪声除除性能，SB 和 PB 中包含了Transformer机制，以EXTRACT complementary salient features，以有效地除除图像层次关系中的噪声。最后，RB 用于获取干净图像。实验表明，我们的 CTNet 在实际和 sintetic 图像噪声除除方面表现优于一些流行的噪声除除方法。它适用于移动设备，如手机。代码可以在 https://github.com/hellloxiaotian/CTNet obtener。

LLM4SGG: Large Language Model for Weakly Supervised Scene Graph Generation

paper_url: http://arxiv.org/abs/2310.10404
repo_url: https://github.com/rlqja1107/torch-LLM4SGG
paper_authors: Kibum Kim, Kanghoon Yoon, Jaehyeong Jeon, Yeonjun In, Jinyoung Moon, Donghyun Kim, Chanyoung Park
for: 本研究旨在提出一种新的、基于语言模型的弱监督Scene Graph生成方法（LLM4SGG），以解决现有WSSGG方法中的两个问题：1）语义过度简化问题，2）低密度场景图问题。
methods: 我们提出一种新的方法，即使用语言模型的语言理解和推理能力来提取caption中的 triplets，并将entity/ predicate类与目标数据进行对齐。为了更好地利用语言模型，我们采用了链式思维和在Context few-shot learning策略。
results: 我们在Visual Genome和GQA datasets上进行了广泛的实验，并显示了与现有WSSGG方法相比的显著提高，包括Recall@K和mean Recall@K的提高。此外，LLM4SGG还具有数据效率的优势，可以通过小量的训练图像进行效果iveness的模型训练。

Abstract
Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard, studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However, they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions, where fine-grained predicates in captions are undesirably converted into coarse-grained predicates, resulting in a long-tailed predicate distribution, and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest, where many triplets are discarded and not used in training, leading to insufficient supervision. To tackle the two issues, we propose a new approach, i.e., Large Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes, we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG, we conduct extensive experiments on Visual Genome and GQA datasets, showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient, enabling effective model training with a small amount of training images.

摘要
弱监督场景图生成（WSSGG）研究最近几年来得到了更多的关注，它作为完全监督的方法的替代方案，减少了成本的注释。在这个 regard，研究者们通过图文来获取不地址 triplets，主要是将图文中的不地址 triplets与图像区域相对应。然而，他们忽略了图文中 triplet 形成过程中的两个问题：1）语义过于简化问题，图文中的细腻 predicate 被不必要地转换为粗略 predicate，导致 predicate 的分布呈长尾形态；2）图像区域对应问题，在将 triplets 与 interesset class 对应时，多个 triplets 被抛弃，导致训练不充分。为解决这两个问题，我们提出了一种新的方法，即大语言模型 для 弱监督 SGG（LLM4SGG）。我们通过利用 LLM 的深刻语言理解和逻辑能力来缓解这两个问题。为了更好地利用 LLM，我们采用了链条思想和在 Context 中几招学习策略。为验证 LLM4SGG 的有效性，我们对 Visual Genome 和 GQA 数据集进行了广泛的实验，并显示了与州chart 方法相比的显著改善。此外，LLM4SGG 具有数据效率的特点，可以在小量训练图像上进行有效的模型训练。

Enhanced Edge-Perceptual Guided Image Filtering

paper_url: http://arxiv.org/abs/2310.10387
repo_url: None
paper_authors: Jinyu Li
for: 提高图像Edge-preserving能力和计算复杂性的问题
methods: 提出一种基于Explicit first-order edge-protect约束和Explicit residual约束的新导向图像滤波器
results: 在单图像细节增强、多尺度曝光融合和多spectral图像分类等应用中，提出的滤波器能够提高Edge-preserving能力，并且经过理论分析和实验验证

Abstract
Due to the powerful edge-preserving ability and low computational complexity, Guided image filter (GIF) and its improved versions has been widely applied in computer vision and image processing. However, all of them are suffered halo artifacts to some degree, as the regularization parameter increase. In the case of inconsistent structure of guidance image and input image, edge-preserving ability degradation will also happen. In this paper, a novel guided image filter is proposed by integrating an explicit first-order edge-protect constraint and an explicit residual constraint which will improve the edge-preserving ability in both cases. To illustrate the efficiency of the proposed filter, the performances are shown in some typical applications, which are single image detail enhancement, multi-scale exposure fusion, hyper spectral images classification. Both theoretical analysis and experimental results prove that the powerful edge-preserving ability of the proposed filter.

摘要
因为导引图像过滤器（GIF）和其改进版本具有强大的边缘保持能力和低计算复杂性，因此在计算机视觉和图像处理领域得到广泛应用。然而，所有它们都受到一定程度的尘埃畸 artifacts的困扰，随着规则化参数的增加。在指导图像和输入图像的结构不一致的情况下，边缘保持能力也会降低。在这篇论文中，一种新的导引图像过滤器被提出，通过加入显式的第一阶边缘保护约束和显式的差分约束，以提高边缘保持能力。为了证明提案的效果，在一些典型应用中进行了表现，包括单图像细节增强、多比例曝光融合和多spectral图像分类。 Both theoretical analysis and experimental results prove that the proposed filter has powerful edge-preserving ability.

Looping LOCI: Developing Object Permanence from Videos

paper_url: http://arxiv.org/abs/2310.10372
repo_url: None
paper_authors: Manuel Traub, Frederic Becker, Sebastian Otte, Martin V. Butz
for: 本研究旨在提高基于场景表示学习的分割和跟踪方法，使其能够更好地处理部分可见对象和逻辑物理测试。
methods: 本研究使用了Loci-Looped算法，它是一种基于Loci neural network架构的循环型朴素网络，可以自动将像素空间信息与预测结果混合，以获得信息混合活动。Loci-Looped还可以学习对象动态和对象之间交互的 compositional 表示。
results: Loci-Looped可以在对象遮挡时长时间跟踪对象，甚至预测遮挡后对象的重返，无需显式历史缓存。此外，Loci-Looped在ADEPT和CLEVRER数据集上比基于现有模型表现出色，在对象遮挡或感知数据断续时表现更好。这表示Loci-Looped可以在无监督下自适应学习物理概念，包括对象持续性和静止性。

Abstract
Recent compositional scene representation learning models have become remarkably good in segmenting and tracking distinct objects within visual scenes. Yet, many of these models require that objects are continuously, at least partially, visible. Moreover, they tend to fail on intuitive physics tests, which infants learn to solve over the first months of their life. Our goal is to advance compositional scene representation algorithms with an embedded algorithm that fosters the progressive learning of intuitive physics, akin to infant development. As a fundamental component for such an algorithm, we introduce Loci-Looped, which advances a recently published unsupervised object location, identification, and tracking neural network architecture (Loci, Traub et al., ICLR 2023) with an internal processing loop. The loop is designed to adaptively blend pixel-space information with anticipations yielding information-fused activities as percepts. Moreover, it is designed to learn compositional representations of both individual object dynamics and between-objects interaction dynamics. We show that Loci-Looped learns to track objects through extended periods of object occlusions, indeed simulating their hidden trajectories and anticipating their reappearance, without the need for an explicit history buffer. We even find that Loci-Looped surpasses state-of-the-art models on the ADEPT and the CLEVRER dataset, when confronted with object occlusions or temporary sensory data interruptions. This indicates that Loci-Looped is able to learn the physical concepts of object permanence and inertia in a fully unsupervised emergent manner. We believe that even further architectural advancements of the internal loop - also in other compositional scene representation learning models - can be developed in the near future.

摘要
现代场景表示学习模型已经很出色地 segmenting 和跟踪视场中的不同对象。然而，许多这些模型需要对象在视场中保持不间断的可见性。此外，它们在直觉物理测试中表现不佳，这与婴儿在生长过程中学习的直觉物理概念相反。我们的目标是提高场景表示算法，其中包括一个内置的算法，以便逐步学习直觉物理概念，类似于婴儿的发展。为此，我们引入了Loci-Looped，它是一种基于Loci（Traub et al., ICLR 2023）的无监督对象位置、识别和跟踪神经网络架构，并添加了内部处理循环。这个循环通过自适应混合像素空间信息和预测得到的信息混合活动，以便学习对象动态和对象之间的交互动态。我们发现Loci-Looped可以在对象遮挡期间跟踪对象，并且可以预测遮挡物体的重返，无需显式历史缓存。此外，Loci-Looped在ADEPT和CLEVRER数据集上表现出色，even when confronted with object occlusions or temporary sensory data interruptions.这表明Loci-Looped可以在无监督下自适应学习物理概念，包括对象永恒和运动的概念。我们认为，将Loci-Looped的内部循环扩展到其他场景表示学习模型中，可能会在未来得到进一步改进。

Camera-LiDAR Fusion with Latent Contact for Place Recognition in Challenging Cross-Scenes

paper_url: http://arxiv.org/abs/2310.10371
repo_url: None
paper_authors: Yan Pan, Jiapeng Xie, Jiajie Wu, Bo Zhou
for: 本文是为了解决在视角变化、季节变化和场景变换等环境下实现地点认知而写的。
methods: 本文使用了一种新的三通道地点描述器，包括图像、点云和融合分支。图像和点云之间的相互关系被利用，以实现信息互动和融合。
results: EXTENSIVE experiments on KITTI、NCLT、USVInland和校园数据集表明，提出的地点描述器为最佳方法，在复杂的场景下表现了 robustness 和通用性。

Abstract
Although significant progress has been made, achieving place recognition in environments with perspective changes, seasonal variations, and scene transformations remains challenging. Relying solely on perception information from a single sensor is insufficient to address these issues. Recognizing the complementarity between cameras and LiDAR, multi-modal fusion methods have attracted attention. To address the information waste in existing multi-modal fusion works, this paper introduces a novel three-channel place descriptor, which consists of a cascade of image, point cloud, and fusion branches. Specifically, the fusion-based branch employs a dual-stage pipeline, leveraging the correlation between the two modalities with latent contacts, thereby facilitating information interaction and fusion. Extensive experiments on the KITTI, NCLT, USVInland, and the campus dataset demonstrate that the proposed place descriptor stands as the state-of-the-art approach, confirming its robustness and generality in challenging scenarios.

摘要
Translated into Simplified Chinese:尽管已经做出了很大的进步，但是在视角变化、季节变化和场景变换等环境下实现地点认知仍然是一个挑战。仅仅基于单一感知器的信息不够 Address these issues. 识别摄像头和激光仪的补偿性，多感知Modal Fusion方法吸引了关注。为了改进现有多感知融合方法中的信息浪费，本文提出了一种新的三通道地点描述符，包括图像、点云和融合分支。具体来说，融合分支采用了双 stage pipeline，利用两个感知器之间的相互关系，以便信息互动和融合。从 KITTI、NCLT、USVInland 和校园数据集进行了广泛的实验，confirming its robustness and generality in challenging scenarios.

Multimodal Object Query Initialization for 3D Object Detection

paper_url: http://arxiv.org/abs/2310.10353
repo_url: None
paper_authors: Mathijs R. van Geerenstein, Felicia Ruppel, Klaus Dietmayer, Dariu M. Gavrila
For: 本研究旨在提高LiDAR和摄像头感知器件之间的对接，以提高3D物体检测模型的性能。* Methods: 我们提出了一种高效、可组合、多Modal的对象查询初始化方法，使得查询可以在多种感知器件输入下进行初始化。* Results: 我们在nuScenesbenchmark上进行了实验，并与现有方法进行比较。结果显示，我们的方法可以在LiDAR-camera输入下达到更高的性能，并且比现有方法更快。此外，我们的方法可以适用于任意感知器件输入组合。

Abstract
3D object detection models that exploit both LiDAR and camera sensor features are top performers in large-scale autonomous driving benchmarks. A transformer is a popular network architecture used for this task, in which so-called object queries act as candidate objects. Initializing these object queries based on current sensor inputs is a common practice. For this, existing methods strongly rely on LiDAR data however, and do not fully exploit image features. Besides, they introduce significant latency. To overcome these limitations we propose EfficientQ3M, an efficient, modular, and multimodal solution for object query initialization for transformer-based 3D object detection models. The proposed initialization method is combined with a "modality-balanced" transformer decoder where the queries can access all sensor modalities throughout the decoder. In experiments, we outperform the state of the art in transformer-based LiDAR object detection on the competitive nuScenes benchmark and showcase the benefits of input-dependent multimodal query initialization, while being more efficient than the available alternatives for LiDAR-camera initialization. The proposed method can be applied with any combination of sensor modalities as input, demonstrating its modularity.

摘要
三维物体探测模型，利用激光和相机感知器件特点，在大规模自动驾驶benchmark中表现出色。 transformer是一种广泛使用的网络架构，在这种情况下，被称为“对象查询”的对象被当作候选对象。现有方法通常基于现有的激光数据进行初始化，但是不充分利用图像特征。此外，它们也会增加显著的延迟。为了解决这些限制，我们提出了高效的EfficientQ3M方法，用于初始化转换器基于三维对象探测模型中的对象查询。我们的初始化方法与“多感器均衡”转换器解码器结合使用，使得查询可以在解码器中访问所有感知模式。在实验中，我们超越了现有的转换器基于LiDAR对象探测模型的状态，在competitive nuScenes benchmark上表现出色，并示出了输入具有multimodal查询初始化的优势，同时更高效于现有的LiDAR-camera初始化方法。该方法可以针对任何感知模式进行输入，表明其模块性。

Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes

paper_url: http://arxiv.org/abs/2310.10352
repo_url: https://github.com/cha15yq/MRC-Crowd
paper_authors: Yifei Qian, Xiaopeng Hong, Ognjen Arandjelović, Zhongliang Guo, Carl R. Donovan
for: 增强人群计数模型的可靠性和准确性，提高模型在受限数据量时的泛化能力。
methods: 基于mean teacher框架，对无标签数据进行masking处理，以便模型通过整体特征学习人群场景，模仿人类认知过程。其他方法包括 incorporating fine-grained density classification task，以提高特征学习。
results: 模型在挑战性评价指标上表现出色，胜过之前的方法，且模型具有’subitizing’-like行为，即准确地计算低密度区域，同时 incorporating 地方细节来计算高密度区域。

Abstract
To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model's intrinsic 'subitizing' capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn't have strict structural or loss constraints. In addition, we observe that the model trained with our framework exhibits a 'subitizing'-like behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd.

摘要
为了减轻人群计数模型的注释负担，从而使模型更实用和准确，这篇论文提出了一种新的半监督方法基于 Mean Teacher 框架。当缺乏标注数据时，模型容易过拟合本地块。在这种情况下，通过 solely 提高本地块预测的准确性来提高模型的性能是不够的。因此，我们提出了一种更加细化的方法：激发模型的内在 'subitizing' 能力。这种能力使模型可以通过利用人群场景的理解来准确地计算区域的人群数量，这与人类认知过程相似。为了实现这个目标，我们在无标注数据上应用掩码，使模型根据全景册筹的信息进行预测。此外，我们还在模型中添加了细化的浓度分类任务，以帮助特征学习。我们的方法是通用的，可以应用于大多数现有的人群计数方法，不具有严格的结构或损失约束。此外，我们发现模型通过我们的框架进行训练时会展现 'subitizing' 类的行为，即可以准确地计算低密度区域，只需要一个 '察看'，同时 incorporate 本地细节来预测高密度区域。我们的方法在挑战性较高的标准准则 ShanghaiTech A 和 UCF-QNRF 上实现了 estado del arte 的性能，大幅超过了先前的方法。模型代码可以在 GitHub 上找到：https://github.com/cha15yq/MRC-Crowd。

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

paper_url: http://arxiv.org/abs/2310.10343
repo_url: https://github.com/jiayuyang/consistnet
paper_authors: Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, Hongdong Li
for: 这篇论文的目的是提出一种能够生成多个不同视角的图像，同时保持3D（多视图）一致性的方法。
methods: 该方法基于一个多视图一致块，该块在多个单视图噪声过程中交换信息，根据多视图几何原理来协调多个单视图特征。
results: 该方法可以轻松地插入预训练的LDMs（卷积神经网络），不需要显式的像素对应关系或深度预测。实验显示，该方法可以在40秒内在单个A100 GPU上生成16个不同视角的图像，并且能够有效地学习3D一致性。

Abstract
Given a single image of a 3D object, this paper proposes a novel method (named ConsistNet) that is able to generate multiple images of the same object, as if seen they are captured from different viewpoints, while the 3D (multi-view) consistencies among those multiple generated images are effectively exploited. Central to our method is a multi-view consistency block which enables information exchange across multiple single-view diffusion processes based on the underlying multi-view geometry principles. ConsistNet is an extension to the standard latent diffusion model, and consists of two sub-modules: (a) a view aggregation module that unprojects multi-view features into global 3D volumes and infer consistency, and (b) a ray aggregation module that samples and aggregate 3D consistent features back to each view to enforce consistency. Our approach departs from previous methods in multi-view image generation, in that it can be easily dropped-in pre-trained LDMs without requiring explicit pixel correspondences or depth prediction. Experiments show that our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU. Our code will be made available on https://github.com/JiayuYANG/ConsistNet

摘要
给定一张3D对象的单张图像，这篇论文提出了一种新的方法（名为ConsistNet），可以生成多张不同视角的图像，同时利用多视图的3D一致性。我们的方法的核心是一个多视图一致块，允许多个单视图的扩散过程之间进行信息交换，基于下面的多视图几何原理。ConsistNet是标准潜在扩散模型的扩展，包括两个子模块：（a）视图聚合模块，将多视图特征映射到全局3D体Volume并评估一致性，以及（b）光束聚合模块，从3D一致的特征样本返回到每个视图，以确保一致性。我们的方法与前一些多视图图像生成方法不同，可以直接使用预训练的LDMs，无需明确的像素匹配或深度预测。实验表明，我们的方法可以在冰zero123架构上学习3D一致性，在单个A100 GPU上生成16个对象周围视图 Within 40秒。我们的代码将在https://github.com/JiayuYANG/ConsistNet上公开。

Scene Graph Conditioning in Latent Diffusion

paper_url: http://arxiv.org/abs/2310.10338
repo_url: https://github.com/frankfundel/sgcond
paper_authors: Frank Fundel
for: 这个论文旨在提高基于文本描述的扩展凝聚模型的精细Semantic控制和精准生成图像。
methods: 这篇论文使用了ControlNet和Gated Self-Attention等多种方法来解决大规模凝聚模型的微调和精准生成图像。
results: 研究人员通过使用提出的方法可以生成图像从场景图中得到更高质量的图像，超过了之前的方法。I hope this helps! Let me know if you have any other questions.

Abstract
Diffusion models excel in image generation but lack detailed semantic control using text prompts. Additional techniques have been developed to address this limitation. However, conditioning diffusion models solely on text-based descriptions is challenging due to ambiguity and lack of structure. In contrast, scene graphs offer a more precise representation of image content, making them superior for fine-grained control and accurate synthesis in image generation models. The amount of image and scene-graph data is sparse, which makes fine-tuning large diffusion models challenging. We propose multiple approaches to tackle this problem using ControlNet and Gated Self-Attention. We were able to show that using out proposed methods it is possible to generate images from scene graphs with much higher quality, outperforming previous methods. Our source code is publicly available on https://github.com/FrankFundel/SGCond

摘要
吸引模型在图像生成方面表现出色，但缺乏文本描述的细腻 semantic控制。为了解决这个限制，有些技术被开发出来。然而，通过 solely 文本描述来conditioning 吸引模型是困难的，因为描述的ambiguity和lack of structure。相比之下，场景图表示图像内容的更加精细，使其成为更好的 Fine-grained control 和图像生成模型的精准合成。然而，图像和场景图数据的量是稀缺的，这使得 fine-tuning 大型吸引模型困难。我们提出了多种方法来解决这个问题，包括ControlNet和Gated Self-Attention。我们能够证明，使用我们的方法可以生成从场景图中的图像，质量远高于之前的方法。我们的源代码在https://github.com/FrankFundel/SGCond 上公开可用。

Towards image compression with perfect realism at ultra-low bitrates

paper_url: http://arxiv.org/abs/2310.10325
repo_url: None
paper_authors: Marlène Careil, Matthew J. Muckley, Jakob Verbeek, Stéphane Lathuilière
for: 提高图像质量，使其不受比特率的影响
methods: 使用迭代扩散模型代替feed-forward推理器，并conditioning模型于vector-quantized图像表示和全文描述
results: 与状态之前的编码器相比，提高图像质量，并在ultra-low比特率下（0.003比特/像素）提供高质量图像重建Here’s the breakdown of each point:
for: The paper aims to improve the quality of compressed images and make it less dependent on the bitrate.
methods: The proposed method uses iterative diffusion models instead of feed-forward decoders trained with MSE or LPIPS distortions. Additionally, the model is conditioned on a vector-quantized image representation and a global textual image description to provide additional context.
results: The proposed method outperforms state-of-the-art codecs at ultra-low bitrates (0.003 bits/pixel) and provides high-quality image reconstruction with low dependence on the bitrate.

Abstract
Image codecs are typically optimized to trade-off bitrate vs, distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual or adversarial losses. To improve image quality, and to make it less dependent on the bitrate, we propose to decode with iterative diffusion models, instead of feed-forward decoders trained using MSE or LPIPS distortions used in most neural codecs. In addition to conditioning the model on a vector-quantized image representation, we also condition on a global textual image description to provide additional context. We dub our model PerCo for 'perceptual compression', and compare it to state-of-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is an order of magnitude smaller than those considered in most prior work. At this bitrate a 512x768 Kodak image is encoded in less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID, and that the visual quality is less dependent on the bitrate than previous methods.

摘要
图像编码器通常是进行比特率vs扭曲指标的优化的。在低比特率下，这会导致压缩artefacts，即使在使用感知或敌对损失进行训练。为了改进图像质量并使其不受比特率的影响，我们提议使用迭代扩散模型进行解码，而不是使用MSE或LPIPS损失来训练Feed-forward decoder。此外，我们还conditioning the model on a vector-quantized image representation和global文本描述来提供额外的 контекст。我们称我们的模型为PerCo，用于'感知压缩'，并与当前的编码器进行比较。我们的模型在比特率从0.1下到0.003比特每像素进行比较，其中0.003比特每像素是在大多数先前工作中考虑的一个次数。在这个比特率下，我们可以将512x768像素的Kodak图像编码为 less than 153字节。尽管我们的比特率非常低，但我们的方法可以保持实际的图像重建。我们发现我们的模型可以在FID和KID指标下达到状态泰ometer的视觉质量，并且这种视觉质量与比特率相对较少受到影响。

Multi-Body Neural Scene Flow

paper_url: http://arxiv.org/abs/2310.10301
repo_url: https://github.com/kavisha725/MBNSF
paper_authors: Kavisha Vidanapathirana, Shin-Fang Chng, Xueqian Li, Simon Lucey
for: 本研究旨在提高Scene Flow的测试时优化，使其能够更好地处理实际世界数据中的多体刚体运动。
methods: 作者提出了一种基于坐标网络的神经网络优化方法，通过正则化Scene Flow预测中的流体平滑性来捕捉通用运动。此外，作者还引入了一种基于流体尺度的正则项，以便在多体刚体运动中保持流体场的连续性。
results: 作者在实际数据上进行了广泛的实验，并证明了他们的方法能够超过当前最佳的3D Scene Flow和长期点云轨迹预测。 codes available at: \href{https://github.com/kavisha725/MBNSF}{https://github.com/kavisha725/MBNSF}.

Abstract
The test-time optimization of scene flow - using a coordinate network as a neural prior - has gained popularity due to its simplicity, lack of dataset bias, and state-of-the-art performance. We observe, however, that although coordinate networks capture general motions by implicitly regularizing the scene flow predictions to be spatially smooth, the neural prior by itself is unable to identify the underlying multi-body rigid motions present in real-world data. To address this, we show that multi-body rigidity can be achieved without the cumbersome and brittle strategy of constraining the $SE(3)$ parameters of each rigid body as done in previous works. This is achieved by regularizing the scene flow optimization to encourage isometry in flow predictions for rigid bodies. This strategy enables multi-body rigidity in scene flow while maintaining a continuous flow field, hence allowing dense long-term scene flow integration across a sequence of point clouds. We conduct extensive experiments on real-world datasets and demonstrate that our approach outperforms the state-of-the-art in 3D scene flow and long-term point-wise 4D trajectory prediction. The code is available at: \href{https://github.com/kavisha725/MBNSF}{https://github.com/kavisha725/MBNSF}.

摘要
scene flow 测试时优化 - 使用坐标网络作为神经网络先验 - 在过去几年中变得越来越流行，这是因为它的简单性、不受数据偏见和现在的表现水平都很高。但我们发现，即使坐标网络可以捕捉一般运动的概念，但神经网络本身无法直接捕捉真实世界数据中的多体刚性运动。为解决这个问题，我们表明了一种不需要干扰和脆弱的策略，即在场景流优化中规范化流预测以促进刚性。这种策略允许场景流中的多体刚性，同时保持连续的流场，因此允许长期场景流集成。我们对实际数据进行了广泛的实验，并证明了我们的方法在3D场景流和长期点云轨迹预测中超越了现有的状态艺术。代码可以在：上下载。

Effortless Cross-Platform Video Codec: A Codebook-Based Method

paper_url: http://arxiv.org/abs/2310.10292
repo_url: None
paper_authors: Kuan Tian, Yonghang Guan, Jinxi Xiang, Jun Zhang, Xiao Han, Wei Yang
for: 提高视频编码器的环境灵活性和计算效率
methods: 基于码ebook的视频编码框架，使用Conditional Cross-Attention模块取得帧之间的上下文
results: 实验结果显示，我们的方法可以超越传统的H.265（中）编码器，无需任何Entropy约束，同时具备跨平台性Here’s the simplified Chinese text in the format you requested:
for: 提高视频编码器的环境灵活性和计算效率
methods: 基于码ebook的视频编码框架，使用Conditional Cross-Attention模块取得帧之间的上下文
results: 实验结果显示，我们的方法可以超越传统的H.265（中）编码器，无需任何Entropy约束，同时具备跨平台性

Abstract
Under certain circumstances, advanced neural video codecs can surpass the most complex traditional codecs in their rate-distortion (RD) performance. One of the main reasons for the high performance of existing neural video codecs is the use of the entropy model, which can provide more accurate probability distribution estimations for compressing the latents. This also implies the rigorous requirement that entropy models running on different platforms should use consistent distribution estimations. However, in cross-platform scenarios, entropy models running on different platforms usually yield inconsistent probability distribution estimations due to floating point computation errors that are platform-dependent, which can cause the decoding side to fail in correctly decoding the compressed bitstream sent by the encoding side. In this paper, we propose a cross-platform video compression framework based on codebooks, which avoids autoregressive entropy modeling and achieves video compression by transmitting the index sequence of the codebooks. Moreover, instead of using optical flow for context alignment, we propose to use the conditional cross-attention module to obtain the context between frames. Due to the absence of autoregressive modeling and optical flow alignment, we can design an extremely minimalist framework that can greatly benefit computational efficiency. Importantly, our framework no longer contains any distribution estimation modules for entropy modeling, and thus computations across platforms are not necessarily consistent. Experimental results show that our method can outperform the traditional H.265 (medium) even without any entropy constraints, while achieving the cross-platform property intrinsically.

摘要
在某些情况下，高级神经视频编码器可以超越最复杂的传统编码器在比特率-损失（RD）性能方面。主要的原因是使用Entropy模型，可以提供更准确的概率分布估计，用于压缩缓冲。然而，在跨平台场景下，运行于不同平台的Entropy模型通常会产生不一致的概率分布估计，因为计算机中的浮点数计算错误是平台相关的，这会导致解码器无法正确地解码编码器发送的压缩位流。在本文中，我们提出了基于codebooks的跨平台视频压缩框架，不使用潮流模型和相关适应模块，而是通过传输编码器序列的index来实现压缩。另外，我们提出了基于条件cross-attention模块来获取帧之间的上下文。由于不使用潮流模型和相关适应模块，我们可以设计一个极其简洁的框架，可以大幅提高计算效率。重要的是，我们的框架不再包含任何分布估计模块，因此在不同平台上的计算是不一致的。实验结果表明，我们的方法可以在不使用Entropy约束下，超越传统H.265（中）的RD性能，同时实现跨平台性特性。

Towards Open-World Co-Salient Object Detection with Generative Uncertainty-aware Group Selective Exchange-Masking

paper_url: http://arxiv.org/abs/2310.10264
repo_url: https://github.com/wuyang98/CoSOD
paper_authors: Yang Wu, Shenglong Hu, Huihui Song, Kaihua Zhang, Bo Liu, Dong Liu
for: 提高CoSOD模型在开放世界场景下的Robustness。
methods: 引入集选择交换掩码（GSEM）方法，使用混合度量选择图像，并使用变量量生成器和CoSOD变换分支模型。
results: 提出了一种基于变量量生成器和CoSOD变换分支模型的Robust CoSOD方法，并在三个开放世界 benchmark dataset上进行了实验，证明了方法的有效性和实用性。

Abstract
The traditional definition of co-salient object detection (CoSOD) task is to segment the common salient objects in a group of relevant images. This definition is based on an assumption of group consensus consistency that is not always reasonable in the open-world setting, which results in robustness issue in the model when dealing with irrelevant images in the inputting image group under the open-word scenarios. To tackle this problem, we introduce a group selective exchange-masking (GSEM) approach for enhancing the robustness of the CoSOD model. GSEM takes two groups of images as input, each containing different types of salient objects. Based on the mixed metric we designed, GSEM selects a subset of images from each group using a novel learning-based strategy, then the selected images are exchanged. To simultaneously consider the uncertainty introduced by irrelevant images and the consensus features of the remaining relevant images in the group, we designed a latent variable generator branch and CoSOD transformer branch. The former is composed of a vector quantised-variational autoencoder to generate stochastic global variables that model uncertainty. The latter is designed to capture correlation-based local features that include group consensus. Finally, the outputs of the two branches are merged and passed to a transformer-based decoder to generate robust predictions. Taking into account that there are currently no benchmark datasets specifically designed for open-world scenarios, we constructed three open-world benchmark datasets, namely OWCoSal, OWCoSOD, and OWCoCA, based on existing datasets. By breaking the group-consistency assumption, these datasets provide effective simulations of real-world scenarios and can better evaluate the robustness and practicality of models.

摘要
传统上，co-salient object detection（CoSOD）任务的定义是将相同的焦点对象在多个相关图像中分割。这个定义基于了群体一致性的假设，这并不总是在开放世界场景下合理的，这会导致模型在处理无关图像时出现Robustness问题。为解决这个问题，我们提出了群选择交换掩码（GSEM）方法，用于增强CoSOD模型的Robustness。GSEM使用两组图像作为输入，每组图像含有不同类型的焦点对象。基于我们定义的混合度量，GSEM选择每组图像的一部分图像，然后将这些图像交换。为同时考虑无关图像引入的不确定性和剩下相关图像的协同特征，我们设计了隐藏变量生成分支和CoSOD变换分支。前者由vector quantized-variational autoencoder组成，用于生成随机全球变量，模型不确定性。后者是为了捕捉协同特征，包括群体一致性。最后，两个分支的输出被 merge，并传递到基于变换器的解码器，以生成Robust的预测。考虑到目前没有特定于开放世界场景的准确数据集，我们构建了三个开放世界数据集，namely OWCoSal、OWCoSOD和OWCoCA，基于现有数据集。由于这些数据集破坏了群体一致性假设，它们可以更好地模拟实际场景，并且可以更好地评估模型的Robustness和实用性。

Long-term Dependency for 3D Reconstruction of Freehand Ultrasound Without External Tracker

paper_url: http://arxiv.org/abs/2310.10248
repo_url: https://github.com/ucl-candi/freehand
paper_authors: Qi Li, Ziyi Shen, Qian Li, Dean C. Barratt, Thomas Dowrick, Matthew J. Clarkson, Tom Vercauteren, Yipeng Hu
For: 定义新的方法来嵌入长期依赖性，并评估其性能。* Methods: 使用序列模型与多个变数预测来编码长期依赖性，并提出两个依赖因子（体部图像内容和扫描协议）以推广精准重建。* Results: 1) 添加长期依赖性可以提高重建精度，并且随序列长度、变数间隔和扫描协议而变化。2) 对于训练中的体部或协议方差的降低，对重建精度产生负面影响。I hope this helps! Let me know if you have any further questions.

Abstract
Objective: Reconstructing freehand ultrasound in 3D without any external tracker has been a long-standing challenge in ultrasound-assisted procedures. We aim to define new ways of parameterising long-term dependencies, and evaluate the performance. Methods: First, long-term dependency is encoded by transformation positions within a frame sequence. This is achieved by combining a sequence model with a multi-transformation prediction. Second, two dependency factors are proposed, anatomical image content and scanning protocol, for contributing towards accurate reconstruction. Each factor is quantified experimentally by reducing respective training variances. Results: 1) The added long-term dependency up to 400 frames at 20 frames per second (fps) indeed improved reconstruction, with an up to 82.4% lowered accumulated error, compared with the baseline performance. The improvement was found to be dependent on sequence length, transformation interval and scanning protocol and, unexpectedly, not on the use of recurrent networks with long-short term modules; 2) Decreasing either anatomical or protocol variance in training led to poorer reconstruction accuracy. Interestingly, greater performance was gained from representative protocol patterns, than from representative anatomical features. Conclusion: The proposed algorithm uses hyperparameter tuning to effectively utilise long-term dependency. The proposed dependency factors are of practical significance in collecting diverse training data, regulating scanning protocols and developing efficient networks. Significance: The proposed new methodology with publicly available volunteer data and code for parametersing the long-term dependency, experimentally shown to be valid sources of performance improvement, which could potentially lead to better model development and practical optimisation of the reconstruction application.

摘要
目标：无需外部跟踪器，在ultrasound-assisted程序中自由手写三维重建问题已经是长期的挑战。我们想要定义新的方法来parameterize长期依赖关系，并评估其性能。方法：首先，通过将长期依赖关系编码为帧序列中的变换位置，使用序列模型和多变换预测结合。其次，我们提出了两个依赖因素，一是解剖学图像内容，二是扫描协议。每个因素都是通过实验量化训练方差来评估。结果：1）在400帧内的20帧/秒（fps）加入长期依赖关系后，重建精度显著提高，相比基eline性能，下降82.4%的累积错误。这种改进与序列长度、变换间隔和扫描协议有关，不同于使用循环网络long-short term模块。2）在训练中降低解剖学或协议方差可以得到更差的重建精度。意外地，更多的表现是由代表协议模式获得的，而不是由解剖学特征获得。结论：我们的算法使用了适当的hyperparameter调整，以利用长期依赖关系。我们提出的依赖因素对于收集多样化的训练数据、调整扫描协议和开发高效的网络是实际上的有用。意义：我们的新方法ологи在公共可用的志愿者数据和代码中实现了参数化长期依赖关系，实验证明了这些方法的有效性，这可能会导致更好的模型开发和实用优化重建应用。

Mask wearing object detection algorithm based on improved YOLOv5

paper_url: http://arxiv.org/abs/2310.10245
repo_url: None
paper_authors: Peng Wen, Junhu Zhang, Haitao Li
for: 本研究旨在提出一种基于YOLOv5l的面Mask检测模型，以提高公共场合人员戴Mask的检测精度。
methods: 本研究使用Multi-Head Attentional Self-Convolution和Swin Transformer Block以提高模型的敏捷度和准确率。此外，我们还提出了I-CBAM模块以提高目标检测精度。
results: 在MASK数据集上进行实验，我们的模型比YOLOv5l模型提高了1.1%的mAP(0.5)和1.3%的mAP(0.5:0.95)。这表明我们的提案可以显著提高面Mask检测的精度。

Abstract
Wearing a mask is one of the important measures to prevent infectious diseases. However, it is difficult to detect people's mask-wearing situation in public places with high traffic flow. To address the above problem, this paper proposes a mask-wearing face detection model based on YOLOv5l. Firstly, Multi-Head Attentional Self-Convolution not only improves the convergence speed of the model but also enhances the accuracy of the model detection. Secondly, the introduction of Swin Transformer Block is able to extract more useful feature information, enhance the detection ability of small targets, and improve the overall accuracy of the model. Our designed I-CBAM module can improve target detection accuracy. In addition, using enhanced feature fusion enables the model to better adapt to object detection tasks of different scales. In the experimentation on the MASK dataset, the results show that the model proposed in this paper achieved a 1.1% improvement in mAP(0.5) and a 1.3% improvement in mAP(0.5:0.95) compared to the YOLOv5l model. Our proposed method significantly enhances the detection capability of mask-wearing.

摘要
穿戴口罩是预防感染疾病的一种重要措施。然而，在高流量的公共场所中探测人们穿戴口罩的情况很困难。为解决这个问题，本文提出了基于YOLOv5l的口罩穿戴面部检测模型。首先，多头注意力自适应卷积不仅提高模型的融合速度，也提高了模型的检测精度。其次，将Swin卷积层引入可以提取更多有用的特征信息，提高小目标的检测能力，并提高模型的总精度。我们设计的I-CBAM模块可以提高标的检测精度。此外，使用强化的特征融合可以让模型更好地适应不同的物体检测任务。在MASK dataset上的实验结果显示，提案的模型与YOLOv5l模型相比，在mAP(0.5)和mAP(0.5:0.95)中实现了1.1%和1.3%的提升。我们的提案方法可以优化口罩穿戴的检测能力。

Generalizing Medical Image Representations via Quaternion Wavelet Networks

paper_url: http://arxiv.org/abs/2310.10224
repo_url: https://github.com/ispamm/QWT
paper_authors: Luigi Sigillo, Eleonora Grassucci, Aurelio Uncini, Danilo Comminiello
for: 提高医疗图像处理领域的神经网络通用性，适应不同数据来源和任务的研究。
methods: 提出一种新的、通用、数据和任务无关的框架，可以从医疗图像中提取突出的特征。该框架基于四元波峰变换，可以与现有的医疗图像分析或生成任务集成，并且可以与实数、四元数或复数值模型混合使用。
results: 经过广泛的实验评估，包括不同的数据集和任务，如重建、分割和模态翻译等，结果显示提出的框架可以提高网络性能，同时具有广泛适用的通用性。

Abstract
Neural network generalizability is becoming a broad research field due to the increasing availability of datasets from different sources and for various tasks. This issue is even wider when processing medical data, where a lack of methodological standards causes large variations being provided by different imaging centers or acquired with various devices and cofactors. To overcome these limitations, we introduce a novel, generalizable, data- and task-agnostic framework able to extract salient features from medical images. The proposed quaternion wavelet network (QUAVE) can be easily integrated with any pre-existing medical image analysis or synthesis task, and it can be involved with real, quaternion, or hypercomplex-valued models, generalizing their adoption to single-channel data. QUAVE first extracts different sub-bands through the quaternion wavelet transform, resulting in both low-frequency/approximation bands and high-frequency/fine-grained features. Then, it weighs the most representative set of sub-bands to be involved as input to any other neural model for image processing, replacing standard data samples. We conduct an extensive experimental evaluation comprising different datasets, diverse image analysis, and synthesis tasks including reconstruction, segmentation, and modality translation. We also evaluate QUAVE in combination with both real and quaternion-valued models. Results demonstrate the effectiveness and the generalizability of the proposed framework that improves network performance while being flexible to be adopted in manifold scenarios.

摘要
QUAVE首先使用四元波лет变换提取不同的子带，包括低频/抽象带和高频/细化特征。然后，它对最有代表性的子带进行权重，将其作为任务模型的输入，取代标准数据样本。我们进行了广泛的实验评估，包括不同的数据集和多种图像分析和生成任务，如重建、分割和模式翻译。我们还在QUAVE与实数和四元数值模型结合使用时进行了评估。结果表明我们提出的框架能够提高网络性能，同时具有通用的优势，能够在多种场景中适用。

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

paper_url: http://arxiv.org/abs/2310.10221
repo_url: https://github.com/longkukuhi/armbench
paper_authors: Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa
for: This paper is written for robotic vision applications, specifically to address the challenges of object detection, segmentation, and identification in real-world warehouse scenarios.
methods: The paper proposes the use of Multimodal Large Language Models (MLLMs) as a novel backbone for various downstream tasks, leveraging the pre-training capabilities of MLLMs to create a simplified framework and mitigate the need for task-specific encoders.
results: The paper introduces the RoboLLM framework, equipped with a BEiT-3 backbone, which outperforms existing baselines and substantially reduces the engineering burden associated with model selection and tuning, as demonstrated in the ARMBench challenge.

Abstract
Robotic vision applications often necessitate a wide range of visual perception tasks, such as object detection, segmentation, and identification. While there have been substantial advances in these individual tasks, integrating specialized models into a unified vision pipeline presents significant engineering challenges and costs. Recently, Multimodal Large Language Models (MLLMs) have emerged as novel backbones for various downstream tasks. We argue that leveraging the pre-training capabilities of MLLMs enables the creation of a simplified framework, thus mitigating the need for task-specific encoders. Specifically, the large-scale pretrained knowledge in MLLMs allows for easier fine-tuning to downstream robotic vision tasks and yields superior performance. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to address all visual perception tasks in the ARMBench challenge-a large-scale robotic manipulation dataset about real-world warehouse scenarios. RoboLLM not only outperforms existing baselines but also substantially reduces the engineering burden associated with model selection and tuning. The source code is publicly available at https://github.com/longkukuhi/armbench.

摘要
robotic 视觉应用经常需要各种视觉识别任务，如物体检测、分割和识别。 DESPITE 这些任务的 SUBSTANTIAL ADVANCES，整合特殊模型到一个统一的视觉管道中存在 significan engineering challenges 和 Costs。 Recently, Multimodal Large Language Models (MLLMs) have emerged as novel backbones for various downstream tasks. WE ARGUE that leveraging the pre-training capabilities of MLLMs enables the creation of a simplified framework, thus mitigating the need for task-specific encoders. Specifically, the large-scale pretrained knowledge in MLLMs allows for easier fine-tuning to downstream robotic vision tasks and yields superior performance. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to address all visual perception tasks in the ARMBench challenge-a large-scale robotic manipulation dataset about real-world warehouse scenarios. RoboLLM not only outperforms existing baselines but also substantially reduces the engineering burden associated with model selection and tuning. The source code is publicly available at https://github.com/longkukuhi/armbench.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

Self-supervised Fetal MRI 3D Reconstruction Based on Radiation Diffusion Generation Model

paper_url: http://arxiv.org/abs/2310.10209
repo_url: None
paper_authors: Junpeng Tan, Xin Zhang, Yao Lv, Xiangmin Xu, Gang Li
for: 这种论文是为了解决妊娠Magnetic Resonance Imaging(MRI)中的精度恢复问题而写的。
methods: 这种方法使用了基于卷积的射线辐射场(NeRF)和基于超分解的扩展生成器(CINR)等技术来解决区域性灵敏度不均匀和全局一致性问题。
results: 实验结果表明，这种方法可以在实际世界妊娠MRI核心中实现高质量超分解重建。

Abstract
Although the use of multiple stacks can handle slice-to-volume motion correction and artifact removal problems, there are still several problems: 1) The slice-to-volume method usually uses slices as input, which cannot solve the problem of uniform intensity distribution and complementarity in regions of different fetal MRI stacks; 2) The integrity of 3D space is not considered, which adversely affects the discrimination and generation of globally consistent information in fetal MRI; 3) Fetal MRI with severe motion artifacts in the real-world cannot achieve high-quality super-resolution reconstruction. To address these issues, we propose a novel fetal brain MRI high-quality volume reconstruction method, called the Radiation Diffusion Generation Model (RDGM). It is a self-supervised generation method, which incorporates the idea of Neural Radiation Field (NeRF) based on the coordinate generation and diffusion model based on super-resolution generation. To solve regional intensity heterogeneity in different directions, we use a pre-trained transformer model for slice registration, and then, a new regionally Consistent Implicit Neural Representation (CINR) network sub-module is proposed. CINR can generate the initial volume by combining a coordinate association map of two different coordinate mapping spaces. To enhance volume global consistency and discrimination, we introduce the Volume Diffusion Super-resolution Generation (VDSG) mechanism. The global intensity discriminant generation from volume-to-volume is carried out using the idea of diffusion generation, and CINR becomes the deviation intensity generation network of the volume-to-volume diffusion model. Finally, the experimental results on real-world fetal brain MRI stacks demonstrate the state-of-the-art performance of our method.

摘要
although multiple stacks can handle slice-to-volume motion correction and artifact removal problems, there are still several problems: 1) the slice-to-volume method usually uses slices as input, which cannot solve the problem of uniform intensity distribution and complementarity in regions of different fetal MRI stacks; 2) the integrity of 3D space is not considered, which adversely affects the discrimination and generation of globally consistent information in fetal MRI; 3) fetal MRI with severe motion artifacts in the real world cannot achieve high-quality super-resolution reconstruction. to address these issues, we propose a novel fetal brain MRI high-quality volume reconstruction method, called the radiation diffusion generation model (RDGM). it is a self-supervised generation method, which incorporates the idea of neural radiation field (NeRF) based on the coordinate generation and diffusion model based on super-resolution generation. to solve regional intensity heterogeneity in different directions, we use a pre-trained transformer model for slice registration, and then, a new regionally consistent implicit neural representation (CINR) network sub-module is proposed. CINR can generate the initial volume by combining a coordinate association map of two different coordinate mapping spaces. to enhance volume global consistency and discrimination, we introduce the volume diffusion super-resolution generation (VDSG) mechanism. the global intensity discriminant generation from volume-to-volume is carried out using the idea of diffusion generation, and CINR becomes the deviation intensity generation network of the volume-to-volume diffusion model. finally, the experimental results on real-world fetal brain MRI stacks demonstrate the state-of-the-art performance of our method.

MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations

paper_url: http://arxiv.org/abs/2310.10198
repo_url: None
paper_authors: Heyuan Yao, Zhenhua Song, Yuyang Zhou, Tenglong Ao, Baoquan Chen, Libin Liu
for: 本文提出了一种新的物理学基的动作控制框架，即MoConVQ，可以有效地从大量、无结构的动作示例中学习动作嵌入。
methods: 该方法基于量化变换自动编码器（VQ-VAE）和模型基于奖励学习，可以从大量动作示例中学习动作嵌入，并且可以Capture多样化的动作技巧。
results: 研究人员通过多种应用场景来证明MoConVQ的可行性，包括通用跟踪控制、交互式人物控制、物理学基的动作生成等。此外，研究人员还证明了MoConVQ可以与大型语言模型（LLMs）集成，以解决复杂和抽象的任务。

Abstract
In this work, we present MoConVQ, a novel unified framework for physics-based motion control leveraging scalable discrete representations. Building upon vector quantized variational autoencoders (VQ-VAE) and model-based reinforcement learning, our approach effectively learns motion embeddings from a large, unstructured dataset spanning tens of hours of motion examples. The resultant motion representation not only captures diverse motion skills but also offers a robust and intuitive interface for various applications. We demonstrate the versatility of MoConVQ through several applications: universal tracking control from various motion sources, interactive character control with latent motion representations using supervised learning, physics-based motion generation from natural language descriptions using the GPT framework, and, most interestingly, seamless integration with large language models (LLMs) with in-context learning to tackle complex and abstract tasks.

摘要
在这项工作中，我们介绍了MoConVQ，一种新的物理基于运动控制框架，利用可扩展的字符串表示法。我们的方法基于vector量化自适应学习（VQ-VAE）和基于模型的奖励学习，从大量、无结构的运动示例中学习出高质量的运动嵌入。这种运动表示不仅捕捉了多样化的运动技巧，还提供了一种稳定和直观的界面，可以应用于多种应用程序。我们在这篇论文中展示了MoConVQ的多种应用，包括从不同运动源的跟踪控制、使用监督学习的潜在运动表示进行交互人物控制、基于自然语言描述的物理运动生成、以及与大语言模型（LLM）集成，以解决复杂和抽象任务。

The Road to On-board Change Detection: A Lightweight Patch-Level Change Detection Network via Exploring the Potential of Pruning and Pooling

paper_url: http://arxiv.org/abs/2310.10166
repo_url: None
paper_authors: Lihui Xue, Zhihao Wang, Xueqian Wang, Gang Li
for: 这个论文主要是为了提高大规模卫星遥感帧测变检测（CD）方法的效率和可靠性，尤其是在具有限制性的计算和内存资源的edge Computing平台上。
methods: 本论文提出了一个轻量级的帧级CD网络（LPCDNet），以快速除去大量无变帧，以提高后续像素级CD过程的效率和内存成本。LPCDNet使用了一个感度指导的通道剔除方法，删除无重要通道，并建立轻量级的后门网络基于ResNet18网络。此外，本文还提出了一个多层特征压缩（MLFC）模组，用于压缩和融合两个时间点的帧级特征信息。
results: 根据实验结果，LPCDNet在两个CD资料集上可以每秒逐帧检测1000帧以上，比已有方法高得多，而且不会对CD性能造成明显的损失。此外，LPCDNet还可以降低后续像Pixel-level CD过程的内存成本超过60%。

Abstract
Existing satellite remote sensing change detection (CD) methods often crop original large-scale bi-temporal image pairs into small patch pairs and then use pixel-level CD methods to fairly process all the patch pairs. However, due to the sparsity of change in large-scale satellite remote sensing images, existing pixel-level CD methods suffer from a waste of computational cost and memory resources on lots of unchanged areas, which reduces the processing efficiency of on-board platform with extremely limited computation and memory resources. To address this issue, we propose a lightweight patch-level CD network (LPCDNet) to rapidly remove lots of unchanged patch pairs in large-scale bi-temporal image pairs. This is helpful to accelerate the subsequent pixel-level CD processing stage and reduce its memory costs. In our LPCDNet, a sensitivity-guided channel pruning method is proposed to remove unimportant channels and construct the lightweight backbone network on basis of ResNet18 network. Then, the multi-layer feature compression (MLFC) module is designed to compress and fuse the multi-level feature information of bi-temporal image patch. The output of MLFC module is fed into the fully-connected decision network to generate the predicted binary label. Finally, a weighted cross-entropy loss is utilized in the training process of network to tackle the change/unchange class imbalance problem. Experiments on two CD datasets demonstrate that our LPCDNet achieves more than 1000 frames per second on an edge computation platform, i.e., NVIDIA Jetson AGX Orin, which is more than 3 times that of the existing methods without noticeable CD performance loss. In addition, our method reduces more than 60% memory costs of the subsequent pixel-level CD processing stage.

摘要
现有的卫星遥感变化检测（CD）方法 oftentimes crop original large-scale bi-temporal image pairs into small patch pairs and then use pixel-level CD methods to fairly process all the patch pairs. However, due to the sparsity of change in large-scale satellite remote sensing images, existing pixel-level CD methods suffer from a waste of computational cost and memory resources on lots of unchanged areas, which reduces the processing efficiency of on-board platform with extremely limited computation and memory resources. To address this issue, we propose a lightweight patch-level CD network (LPCDNet) to rapidly remove lots of unchanged patch pairs in large-scale bi-temporal image pairs. This is helpful to accelerate the subsequent pixel-level CD processing stage and reduce its memory costs. In our LPCDNet, a sensitivity-guided channel pruning method is proposed to remove unimportant channels and construct the lightweight backbone network on basis of ResNet18 network. Then, the multi-layer feature compression (MLFC) module is designed to compress and fuse the multi-level feature information of bi-temporal image patch. The output of MLFC module is fed into the fully-connected decision network to generate the predicted binary label. Finally, a weighted cross-entropy loss is utilized in the training process of network to tackle the change/unchange class imbalance problem. Experiments on two CD datasets demonstrate that our LPCDNet achieves more than 1000 frames per second on an edge computation platform, i.e., NVIDIA Jetson AGX Orin, which is more than 3 times that of the existing methods without noticeable CD performance loss. In addition, our method reduces more than 60% memory costs of the subsequent pixel-level CD processing stage.

A Search for Prompts: Generating Structured Answers from Contracts

paper_url: http://arxiv.org/abs/2310.10141
repo_url: None
paper_authors: Adam Roegiest, Radha Chitta, Jonathan Donnelly, Maya Lash, Alexandra Vtyurina, François Longtin
for: 法律问题自动回答，帮助自动化人类审查或标识特定条件（例如，自动续订警示）。
methods: 使用 OpenAI 的 \textit{GPT-3.5-Turbo} 进行不结构化生成问题回答，并对问题回答提供了审查和改进。
results: 相比 semantic matching 方法，我们的模板提问方法更加准确，并且通过Context learning和提问修改，我们的方法可以进一步提高性能。

Abstract
In many legal processes being able to action on the concrete implication of a legal question can be valuable to automating human review or signalling certain conditions (e.g., alerts around automatic renewal). To support such tasks, we present a form of legal question answering that seeks to return one (or more) fixed answers for a question about a contract clause. After showing that unstructured generative question answering can have questionable outcomes for such a task, we discuss our exploration methodology for legal question answering prompts using OpenAI's \textit{GPT-3.5-Turbo} and provide a summary of insights. Using insights gleaned from our qualitative experiences, we compare our proposed template prompts against a common semantic matching approach and find that our prompt templates are far more accurate despite being less reliable in the exact response return. With some additional tweaks to prompts and the use of in-context learning, we are able to further improve the performance of our proposed strategy while maximizing the reliability of responses as best we can.

摘要
在许多法律程序中，能够对法律问题的具体实施有益于自动化人类审查或标识特定条件（例如，续约提醒）。为支持这些任务，我们提出了一种法律问题回答方法，该方法可以为一个合同条款的问题返回一个或多个固定答案。在显示了无结构生成问题回答可能导致问able的结果后，我们讲述了我们的探索方法ологи，使用OpenAI的GPT-3.5-Turbo进行问题回答提示。我们提供了一个摘要的感想，并与常见Semantic Matching方法进行比较。我们发现，我们的提案的模板提示比Semantic Matching方法更准确，尽管它们可能不那么可靠地返回具体的回答。通过对提示和使用上下文学习进行一些调整，我们能够进一步改进我们的提案的性能，同时最大化回答的可靠性。

PELA: Learning Parameter-Efficient Models with Low-Rank Approximation

paper_url: http://arxiv.org/abs/2310.10700
repo_url: https://github.com/guoyang9/pela
paper_authors: Yangyang Guo, Guangzhi Wang, Mohan Kankanhalli
for: 提高预训练模型中参数效率，以适应资源受限的下游任务。
methods: Introducing an intermediate pre-training stage, using low-rank approximation to compress the original large model, and devising a feature distillation module and weight perturbation regularization module to enhance the low-rank model.
results: 提高预训练模型的参数效率，同时保持与基本架构相似的性能水平，减少参数大小一半至二分之一。

Abstract
Applying a pre-trained large model to downstream tasks is prohibitive under resource-constrained conditions. Recent dominant approaches for addressing efficiency issues involve adding a few learnable parameters to the fixed backbone model. This strategy, however, leads to more challenges in loading large models for downstream fine-tuning with limited resources. In this paper, we propose a novel method for increasing the parameter efficiency of pre-trained models by introducing an intermediate pre-training stage. To this end, we first employ low-rank approximation to compress the original large model and then devise a feature distillation module and a weight perturbation regularization module. These modules are specifically designed to enhance the low-rank model. Concretely, we update only the low-rank model while freezing the backbone parameters during pre-training. This allows for direct and efficient utilization of the low-rank model for downstream tasks. The proposed method achieves both efficiencies in terms of required parameters and computation time while maintaining comparable results with minimal modifications to the base architecture. Specifically, when applied to three vision-only and one vision-language Transformer models, our approach often demonstrates a $\sim$0.6 point decrease in performance while reducing the original parameter size by 1/3 to 2/3.

摘要

3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding

paper_url: http://arxiv.org/abs/2310.10131
repo_url: https://github.com/seonokkim/3dyoga90
paper_authors: Seonok Kim
For: 这个研究是为了开发一个更大、更完整的人工智能训练用运动视频集，具体来说是3D Yoga901数据集，用于习习poses和瑜伽动作识别。* Methods: 这个研究使用了一个专门为这个目标而制作的数据集，其包括90个姿势的RGB视频和3D骨骼序列，同时还有一个三级标签层次结构。* Results: 这个研究创造了一个更大、更完整的公共数据集，包括RGB视频和3D骨骼序列，这对于人工智能训练和瑜伽动作识别具有广泛的应用前景。

Abstract
The increasing popularity of exercises including yoga and Pilates has created a greater demand for professional exercise video datasets in the realm of artificial intelligence. In this study, we developed 3DYoga901, which is organized within a three-level label hierarchy. We have expanded the number of poses from an existing state-of-the-art dataset, increasing it from 82 to 90 poses. Our dataset includes meticulously curated RGB yoga pose videos and 3D skeleton sequences. This dataset was created by a dedicated team of six individuals, including yoga instructors. It stands out as one of the most comprehensive open datasets, featuring the largest collection of RGB videos and 3D skeleton sequences among publicly available resources. This contribution has the potential to significantly advance the field of yoga action recognition and pose assessment. Additionally, we conducted experiments to evaluate the practicality of our proposed dataset. We employed three different model variants for benchmarking purposes.

摘要
随着瑜伽和PILATES等运动的流行，人工智能领域的训练数据需求增加。在这项研究中，我们开发了3DYoga901，这是一个三级标签层次结构下的组织方式。我们将现有状态艺术数据集中的82姿势提高到90姿势。我们的数据集包括仔细挑选的RGB瑜伽姿势视频和3D骨架序列。这个数据集由6名专业人员，包括瑜伽教练，共同创建。它是公共可用资源中最完整的开放数据集，拥有最大的RGB视频和3D骨架序列收集。这一贡献有可能在瑜伽动作识别和姿势评估领域取得重要进步。此外，我们还进行了实验来评估我们的提案的实用性。我们使用了三种不同的模型变体进行比较。

Few-shot Action Recognition with Captioning Foundation Models

paper_url: http://arxiv.org/abs/2310.10125
repo_url: None
paper_authors: Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang
for: 本研究旨在将预训练的视觉语言知识转移到多个下游任务中，以提高实验效率和准确性。
methods: 本研究使用了一个名为CapFSAR的弹性插件架构，具有自动生成对应的视觉描述和文本嵌入的能力，以扩展预训练的视觉语言知识。
results: 实验结果显示，CapFSAR在多个标准几个阶段训练 benchmark 上表现出色，与现有方法比较，具有更高的准确性和更好的一致性。

Abstract
Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for reliable few-shot matching. In this way, CapFSAR can benefit from powerful multimodal knowledge of pretrained foundation models, yielding more comprehensive classification in the low-shot regime. Extensive experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods and achieves state-of-the-art performance. The code will be made publicly available.

摘要
<>使用预训练多modal基础模型的视觉语言知识传递到多个下游任务是一个有前途的方向。然而，当前大多数几 shot动作识别方法仍然只接受单一视觉输入，因为对附加文本描述的标注成本高昂。在这篇论文中，我们开发了一个有效的插件式框架called CapFSAR，以利用预训练多modal模型的知识而不需要手动标注文本。具体来说，我们首先利用一个captioning基础模型（即BLIP）来提取视觉特征并自动生成相关的描述文本 для输入视频。然后，我们应用一个文本编码器来对生成的文本嵌入获得代表性的文本嵌入。最后，我们设计了一个基于Transformer的视TEXT聚合模块，以便在低shot情况下利用多modal视觉语言的补充信息进行可靠的匹配。这样，CapFSAR可以从预训练多modal模型中获得强大的知识，在低shot情况下实现更全面的分类。我们的实验结果表明，提议的CapFSAR在多个标准几 shotbenchmark上表现出色，并达到了状态 искусственный智能的性能。代码将公开。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

paper_url: http://arxiv.org/abs/2310.10123
repo_url: None
paper_authors: Yitong Jiang, Zhaoyang Zhang, Tianfan Xue, Jinwei Gu
for: solves complex real-world image restoration situations with multiple unknown degradations
methods: uses an all-in-one image restoration framework with latent diffusion, including a Blind Image Quality Assessment Module (BIQA) and an All-in-One Image Refinement (AIR) Module, as well as a Structure Correction Module (SCM)
results: outperforms state-of-the-art approaches with superior restoration results and supports a wider range of tasks, including real-scenario images with multiple unknown degradations.

Abstract
In this paper, we aim to solve complex real-world image restoration situations, in which, one image may have a variety of unknown degradations. To this end, we propose an all-in-one image restoration framework with latent diffusion (AutoDIR), which can automatically detect and address multiple unknown degradations. Our framework first utilizes a Blind Image Quality Assessment Module (BIQA) to automatically detect and identify the unknown dominant image degradation type of the image. Then, an All-in-One Image Refinement (AIR) Module handles multiple kinds of degradation image restoration with the guidance of BIQA. Finally, a Structure Correction Module (SCM) is proposed to recover the image details distorted by AIR. Our comprehensive evaluation demonstrates that AutoDIR outperforms state-of-the-art approaches by achieving superior restoration results while supporting a wider range of tasks. Notably, AutoDIR is also the first method to automatically handle real-scenario images with multiple unknown degradations.

摘要
在这篇论文中，我们目标是解决复杂的真实世界图像恢复问题，在这个问题中，一个图像可能具有多种未知的降低效应。为此，我们提议一个整合性图像恢复框架——自适应扩散图像修复（AutoDIR），可以自动检测和解决多种未知降低效应。我们的框架首先利用一个隐藏影像质量评估模块（BIQA）来自动检测和识别图像的未知主要降低类型。然后，一个全面修复（AIR）模块处理多种降低效应的图像修复，以BIQA的指导。最后，我们提出一个结构修复模块（SCM）来恢复图像细节，受到AIR的扭曲影响。我们的全面评估表明，AutoDIR在恢复Result中显示出优于当前方法，同时支持更广泛的任务范围。尤其是，AutoDIR是第一个自动处理真实世界图像中的多种未知降低的方法。

KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training

paper_url: http://arxiv.org/abs/2310.10102
repo_url: https://github.com/TruongThaoNguyen/kakurenbo
paper_authors: Truong Thao Nguyen, Balazs Gerofi, Edgar Josafat Martinez-Noriega, François Trahay, Mohamed Wahib
for: 提高深度神经网络训练效率，减少训练成本。
methods: 利用训练损失和预测信任度信息， dynamically 排除训练样本，不归一化影响精度。
results: 在多个大规模数据集和模型上，与基eline相比，我们的方法可以降低训练时间，仅带来0.4%的精度下降。可以在https://github.com/TruongThaoNguyen/kakurenbo 获取代码。

Abstract
This paper proposes a method for hiding the least-important samples during the training of deep neural networks to increase efficiency, i.e., to reduce the cost of training. Using information about the loss and prediction confidence during training, we adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process, without significantly degrading accuracy. We explore the converge properties when accounting for the reduction in the number of SGD updates. Empirical results on various large-scale datasets and models used directly in image classification and segmentation show that while the with-replacement importance sampling algorithm performs poorly on large datasets, our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline. Code available at https://github.com/TruongThaoNguyen/kakurenbo

摘要

A Multi-Scale Spatial Transformer U-Net for Simultaneously Automatic Reorientation and Segmentation of 3D Nuclear Cardiac Images

paper_url: http://arxiv.org/abs/2310.10095
repo_url: None
paper_authors: Yangfan Ni, Duo Zhang, Gege Ma, Lijun Lu, Zhongke Huang, Wentao Zhu
For: 这个研究旨在提高核心心脏成像中的左心室（LV）重orientation和分割的精度，以便进行多modal的量化分析。* Methods: 该研究提出了一种结合多尺度空间变换网络（MSSTN）和多尺度UNet（MSUNet）模块的端到端模型，用于同时重orientation和分割LV区域从核心心脏成像图像中。* Results: 实验结果表明，该提出的方法可以显著提高重orientation和分割性能。这种结合学习框架可以促进重orientation和分割任务之间的互补性，从而实现高性能和高效的图像处理 workflow。

Abstract
Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of individual patients pose challenges to LV segmentation operation. To mitigate these issues, we propose an end-to-end model, named as multi-scale spatial transformer UNet (MS-ST-UNet), that involves the multi-scale spatial transformer network (MSSTN) and multi-scale UNet (MSUNet) modules to perform simultaneous reorientation and segmentation of LV region from nuclear cardiac images. The proposed method is trained and tested using two different nuclear cardiac image modalities: 13N-ammonia PET and 99mTc-sestamibi SPECT. We use a multi-scale strategy to generate and extract image features with different scales. Our experimental results demonstrate that the proposed method significantly improves the reorientation and segmentation performance. This joint learning framework promotes mutual enhancement between reorientation and segmentation tasks, leading to cutting edge performance and an efficient image processing workflow. The proposed end-to-end deep network has the potential to reduce the burden of manual delineation for cardiac images, thereby providing multimodal quantitative analysis assistance for physicists.

摘要
要实现多Modal量子分析的协助，我们提出了一种结合多级空间变换网络（MSSTN）和多级UNet（MSUNet）模块的端到端模型，用于自动将核心心脏图像重定向和分割为标准短轴扁平图像。左心室（LV）区域检测和各个患者的各种心脏结构带来了重orientation和分割任务的挑战。我们的方法通过一种多级strategy来生成和提取图像特征，从而提高重定向和分割性能。我们对13N-氨基酸PET和99mTc-氯丙胺SPECT两种核心心脏图像模式进行训练和测试，并取得了显著提高重定向和分割性能的结果。这种结合学习框架可以减少心脏图像手动分割的劳动，从而为物理学家提供多模态量子分析的协助。

PUCA: Patch-Unshuffle and Channel Attention for Enhanced Self-Supervised Image Denoising

paper_url: http://arxiv.org/abs/2310.10088
repo_url: https://github.com/HyemiEsme/PUCA
paper_authors: Hyemi Jang, Junsung Park, Dahuin Jung, Jaihyun Lew, Ho Bae, Sungroh Yoon
for: 自动化图像干扰除（image denoising）
methods: 使用自适应卷积（dilated attention blocks）和质patch-unshuffle/shuffle来扩大感知场和保持J-不变性
results: 实验结果显示，PUCA在自主学习图像干扰除中实现了状态的最佳性能，超过了现有方法的性能Here’s the simplified Chinese text in the format you requested:
for: 自动化图像干扰除（image denoising）
methods: 使用自适应卷积（dilated attention blocks）和质patch-unshuffle/shuffle来扩大感知场和保持J-不变性
results: 实验结果显示，PUCA在自主学习图像干扰除中实现了状态的最佳性能，超过了现有方法的性能

Abstract
Although supervised image denoising networks have shown remarkable performance on synthesized noisy images, they often fail in practice due to the difference between real and synthesized noise. Since clean-noisy image pairs from the real world are extremely costly to gather, self-supervised learning, which utilizes noisy input itself as a target, has been studied. To prevent a self-supervised denoising model from learning identical mapping, each output pixel should not be influenced by its corresponding input pixel; This requirement is known as J-invariance. Blind-spot networks (BSNs) have been a prevalent choice to ensure J-invariance in self-supervised image denoising. However, constructing variations of BSNs by injecting additional operations such as downsampling can expose blinded information, thereby violating J-invariance. Consequently, convolutions designed specifically for BSNs have been allowed only, limiting architectural flexibility. To overcome this limitation, we propose PUCA, a novel J-invariant U-Net architecture, for self-supervised denoising. PUCA leverages patch-unshuffle/shuffle to dramatically expand receptive fields while maintaining J-invariance and dilated attention blocks (DABs) for global context incorporation. Experimental results demonstrate that PUCA achieves state-of-the-art performance, outperforming existing methods in self-supervised image denoising.

摘要
尽管监督的图像噪声网络已经在合成噪声图像上展现出惊人的表现，但在实际应用中它们经常失败，因为实际噪声和合成噪声之间存在差异。由于干净图像和噪声图像的对应对不容易获得，自动学习，即使用噪声输入本身作为目标，已经被研究。为保证自动学习噪声除掉模型不学习同一个映射，每个输出像素不能受到其对应的输入像素的影响，这种要求被称为J-不变性。盲区网络（BSN）在保证J-不变性方面广泛应用。然而，通过在BSN中注射额外操作，如下采样，可能会暴露盲区信息，从而违反J-不变性。因此，特制的BSN材料只能被允许，限制了建筑的创新性。为了突破这一限制，我们提出了PUCA，一种新的J-不变的U-Net架构，用于自动学习噪声除掉。PUCA利用质心不排序/排序来巨大地扩大接收场，同时保持J-不变性，并采用扩展的注意块（DABs）来 incorporate global context。实验结果表明，PUCA可以达到状态之内的表现，比存在的方法在自动学习噪声除掉中高效。

Expression Domain Translation Network for Cross-domain Head Reenactment

paper_url: http://arxiv.org/abs/2310.10073
repo_url: None
paper_authors: Taewoong Kang, Jeongsik Oh, Jaeseong Lee, Sunghyun Park, Jaegul Choo
for:* The paper aims to improve cross-domain head reenactment, specifically transferring human motions to cartoon characters.methods:* The paper introduces a novel expression domain translation network to transform human expressions into anime expressions.* The network uses a 3D geometric-aware loss function to ensure geometric consistency between the input and output expressions.results:* The proposed method outperforms existing methods in both qualitative and quantitative analysis, demonstrating a significant advancement in cross-domain head reenactment.

Abstract
Despite the remarkable advancements in head reenactment, the existing methods face challenges in cross-domain head reenactment, which aims to transfer human motions to domains outside the human, including cartoon characters. It is still difficult to extract motion from out-of-domain images due to the distinct appearances, such as large eyes. Recently, previous work introduced a large-scale anime dataset called AnimeCeleb and a cross-domain head reenactment model, including an optimization-based mapping function to translate the human domain's expressions to the anime domain. However, we found that the mapping function, which relies on a subset of expressions, imposes limitations on the mapping of various expressions. To solve this challenge, we introduce a novel expression domain translation network that transforms human expressions into anime expressions. Specifically, to maintain the geometric consistency of expressions between the input and output of the expression domain translation network, we employ a 3D geometric-aware loss function that reduces the distances between the vertices in the 3D mesh of the human and anime. By doing so, it forces high-fidelity and one-to-one mapping with respect to two cross-expression domains. Our method outperforms existing methods in both qualitative and quantitative analysis, marking a significant advancement in the field of cross-domain head reenactment.

摘要
尽管有很多HEADreenactment的进步，现有的方法在跨domain HEADreenactment中遇到了困难，即将人类动作转移到不同的领域，如漫画人物。因为这些领域的外观有很大的不同，例如大的眼睛，提取动作从不同领域的图像仍然是很困难的。在最近，之前的工作已经介绍了一个大规模的漫画数据集called AnimeCeleb和一种跨领域HEADreenactment模型，包括一种优化基于映射函数，将人类领域的表情翻译到漫画领域。然而，我们发现这种映射函数，它基于一 subset of 表情，强制了表情的限制。为解决这个挑战，我们引入了一种新的表情频率翻译网络，可以将人类表情翻译成漫画表情。在我们的方法中，我们采用了一种3D геометрически感知的损失函数，以保持表情的几何一致性。这使得我们的方法可以具有高精度和一对一的映射性，对于两个跨表情频率的跨领域映射。我们的方法在质量和量度分析中都超过了现有的方法，标志着跨领域HEADreenactment领域的重要进步。

ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking

paper_url: http://arxiv.org/abs/2310.10071
repo_url: https://github.com/kou-99/zoomtrack
paper_authors: Yutong Kou, Jin Gao, Bing Li, Gang Wang, Weiming Hu, Yizheng Wang, Liang Li
for: 本研究旨在实现高速追踪，并获得高性能的tracking results，而不是专注于性能和速度之间的 compromise.
methods: 本文使用非对称resize的方法，将cropped image的input size变小，并保持目标的视觉资讯。这个方法可以通过quadratic programming (QP) efficiently solve，并可以与大多数的crop-based local tracker naturally integrate.
results: 在五个挑战性的dataset上，本文的方法可以获得了consistent improvement，并在speed-oriented版本的OSTrack上even outperform its performance-oriented counterpart by 0.6% AUC on TNL2K，并且在50% faster和save over 55% MACs的情况下实现。

Abstract
Recently, the transformer has enabled the speed-oriented trackers to approach state-of-the-art (SOTA) performance with high-speed thanks to the smaller input size or the lighter feature extraction backbone, though they still substantially lag behind their corresponding performance-oriented versions. In this paper, we demonstrate that it is possible to narrow or even close this gap while achieving high tracking speed based on the smaller input size. To this end, we non-uniformly resize the cropped image to have a smaller input size while the resolution of the area where the target is more likely to appear is higher and vice versa. This enables us to solve the dilemma of attending to a larger visual field while retaining more raw information for the target despite a smaller input size. Our formulation for the non-uniform resizing can be efficiently solved through quadratic programming (QP) and naturally integrated into most of the crop-based local trackers. Comprehensive experiments on five challenging datasets based on two kinds of transformer trackers, \ie, OSTrack and TransT, demonstrate consistent improvements over them. In particular, applying our method to the speed-oriented version of OSTrack even outperforms its performance-oriented counterpart by 0.6% AUC on TNL2K, while running 50% faster and saving over 55% MACs. Codes and models are available at https://github.com/Kou-99/ZoomTrack.

摘要
最近，transformer已经使得速度强调跟踪器可以达到状态之Art（SOTA）性能，并且具有更高的速度，即使使用更小的输入大小或更轻量级的特征提取核心。然而，它们仍然较相对落后于其对应的性能强调版本。在这篇论文中，我们展示了可以减少或甚至消除这个差距，而且可以在更小的输入大小下实现高速跟踪。为此，我们非均匀地缩放cropped图像，使得target的可能出现的区域的分辨率更高，而其他区域的分辨率相对较低。这样可以解决在更小的输入大小下尚可以保留更多的原始信息来搜寻target的问题。我们的非均匀缩放的形式可以通过quadratic programming（QP）有效地解决，并且自然地整合到大多数的crop-based本地跟踪器中。我们在五个复杂的数据集上进行了广泛的实验，并示出了一致的改进。特别是，对于速度强调版本的OSTrack，我们的方法可以在TNL2K上提高0.6%的AUC，并且在50%的速度下运行，并且占用了55%的MACs。代码和模型可以在https://github.com/Kou-99/ZoomTrack上获取。

Generalizable Person Search on Open-world User-Generated Video Content

paper_url: http://arxiv.org/abs/2310.10068
repo_url: None
paper_authors: Junjie Li, Guanshuo Wang, Yichao Yan, Fufu Yu, Qiong Jia, Jie Qin, Shouhong Ding, Xiaokang Yang
for: 实现人寻找任务中的扩展性能，尤其是在不同的摄像头和环境中。
methods: 提出了一个通用框架，包括两种水平的普遍化：对于特征水平，引入多任务抽象型普通批量对顶推对顶推标准差，对于数据水平，则是通过通道宽度ID相关特征装饰策略来实现普遍化。
results: 在两个挑战人寻找测试 bencmarks 上获得了可靠的表现，无需使用任何人工标注或目标领域的样本。

Abstract
Person search is a challenging task that involves detecting and retrieving individuals from a large set of un-cropped scene images. Existing person search applications are mostly trained and deployed in the same-origin scenarios. However, collecting and annotating training samples for each scene is often difficult due to the limitation of resources and the labor cost. Moreover, large-scale intra-domain data for training are generally not legally available for common developers, due to the regulation of privacy and public security. Leveraging easily accessible large-scale User Generated Video Contents (\emph{i.e.} UGC videos) to train person search models can fit the open-world distribution, but still suffering a performance gap from the domain difference to surveillance scenes. In this work, we explore enhancing the out-of-domain generalization capabilities of person search models, and propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios. Specifically, we focus on learning domain-invariant representations for both detection and ReID by introducing a multi-task prototype-based domain-specific batch normalization, and a channel-wise ID-relevant feature decorrelation strategy. We also identify and address typical sources of noise in open-world training frames, including inaccurate bounding boxes, the omission of identity labels, and the absence of cross-camera data. Our framework achieves promising performance on two challenging person search benchmarks without using any human annotation or samples from the target domain.

摘要
人体搜索是一项复杂的任务，它涉及到从大量未修剪场景图像中检测和检索人体。现有的人体搜索应用程序都是在同一个来源场景中训练和部署的。然而，收集和标注训练样本的成本是非常高，尤其是在资源和劳动力方面。此外，大规模内域数据 для训练通常不可得，因为隐私和公共安全的法规限制。我们利用可达性高的用户生成内容（i.e., UGC视频）来训练人体搜索模型，以适应开放世界分布。然而，这些模型仍然受到频率域不同的问题带来的性能差。在这种情况下，我们提出了一种通用的框架，以便在无法预测的情况下进行下游任务。我们主要关注于学习域外 invariant 表示，包括检测和 ReID 领域的学习。我们引入多任务prototype-based域特定批处理，以及通道级 ID 相关特征修饰策略。我们还识别和解决常见的开放世界训练帧中的噪声源，包括不准确的 bounding box、缺失标签和 cross-camera 数据缺失。我们的框架在两个复杂的人体搜索标准准点上达到了无需使用人类标注或目标域样本的承诺性能。

A computational model of serial and parallel processing in visual search

paper_url: http://arxiv.org/abs/2310.10061
repo_url: https://github.com/rachelfheaton/CASPER-model
paper_authors: Rachel F. Heaton
For: This paper aims to understand the nature of human visual representations and processes through the study of visual search.* Methods: The paper presents a theory of visual search based on empirical findings and instantiated in a computational model called CASPER (Concurrent Attention: Serial and Parallel Evaluation with Relations).* Results: The paper describes seven experiments that test CASPER’s predictions about relational search, and shows that CASPER can account for negative acceleration in search functions for relational stimuli.Here are the three points in Simplified Chinese:
for: 这篇论文旨在通过视觉搜索来理解人类视觉表示和过程的本质。
methods: 这篇论文提出了基于实验发现的视觉搜索理论，并将其实现在一个名为CASPER（并行注意力：序列和平行评估与关系）的计算模型中。
results: 这篇论文描述了七个实验，以测试CASPER模型对关系搜索的预测，并显示了CASPER可以解释视觉系统在关系刺激下的负加速度。

Abstract
The following is a dissertation aimed at understanding what the various phenomena in visual search teach us about the nature of human visual representations and processes. I first review some of the major empirical findings in the study of visual search. I next present a theory of visual search in terms of what I believe these findings suggest about the representations and processes underlying ventral visual processing. These principles are instantiated in a computational model called CASPER (Concurrent Attention: Serial and Parallel Evaluation with Relations), originally developed by Hummel, that I have adapted to account for a range of phenomena in visual search. I then describe an extension of the CASPER model to account for our ability to search for visual items defined not simply by the features composing those items but by the spatial relations among those features. Seven experiments (four main experiments and three replications) are described that test CASPER's predictions about relational search. Finally, I evaluate the fit between CASPER's predictions and the empirical findings and show with three additional simulations that CASPER can account for negative acceleration in search functions for relational stimuli if one postulates that the visual system is leveraging an emergent feature that bypasses relational processing.

摘要
这是一篇关于视觉搜寻的论文，旨在了解人类视觉表示和过程中的本质。我首先介绍了视觉搜寻的一些主要实验发现，然后提出了基于这些发现的视觉搜寻理论。这些原则通过我修改了由哔哔（Hummel）开发的计算模型CASPER（同时注意力：串行和平行评估与关系）来实现。我然后描述了一种扩展CASPER模型，以便解释我们在视觉搜寻中搜寻视觉物体的空间关系。然后，我描述了七个实验（四个主要实验和三个复现），用于测试CASPER模型的预测。最后，我评估了CASPER模型的预测与实验发现的Compatibility，并通过三个额外的仿真显示CASPER模型可以解释视觉系统在搜寻关系性 stimulus 时的负加速。

EAR-Net: Pursuing End-to-End Absolute Rotations from Multi-View Images

paper_url: http://arxiv.org/abs/2310.10051
repo_url: None
paper_authors: Yuzhen Liu, Qiulei Dong
for: 提供一种结构为深度神经网络的综合方法，用于从多视图图像中估计绝对旋转。
methods: 使用深度神经网络建立 epipolar confidence graph，并使用 confidence-aware rotation averaging 模块来预测绝对旋转。
results: 在三个公共数据集上，EAR-Net 比现有方法提高了准确性和速度。

Abstract
Absolute rotation estimation is an important topic in 3D computer vision. Existing works in literature generally employ a multi-stage (at least two-stage) estimation strategy where multiple independent operations (feature matching, two-view rotation estimation, and rotation averaging) are implemented sequentially. However, such a multi-stage strategy inevitably leads to the accumulation of the errors caused by each involved operation, and degrades its final estimation on global rotations accordingly. To address this problem, we propose an End-to-end method for estimating Absolution Rotations from multi-view images based on deep neural Networks, called EAR-Net. The proposed EAR-Net consists of an epipolar confidence graph construction module and a confidence-aware rotation averaging module. The epipolar confidence graph construction module is explored to simultaneously predict pairwise relative rotations among the input images and their corresponding confidences, resulting in a weighted graph (called epipolar confidence graph). Based on this graph, the confidence-aware rotation averaging module, which is differentiable, is explored to predict the absolute rotations. Thanks to the introduced confidences of the relative rotations, the proposed EAR-Net could effectively handle outlier cases. Experimental results on three public datasets demonstrate that EAR-Net outperforms the state-of-the-art methods by a large margin in terms of accuracy and speed.

摘要
<>将文本翻译成简化中文。<>三维计算机视觉中的绝对旋转估算是一个重要的话题。现有文献中的方法通常采用多个独立的操作（特征匹配、两视旋转估算和旋转平均）的多阶段（至少两阶段）Strategy，这会导致每个参与的操作的错误积累，从而影响最终的全球旋转估算。为解决这个问题，我们提出了基于深度神经网络的绝对旋转估算方法，called EAR-Net。提案的 EAR-Net 包括 Epipolar 信任图构建模块和信任度权重平均模块。Epipolar 信任图构建模块可以同时预测输入图像之间的对应关系和它们的相对旋转信任度，从而构建一个权重图（称为 Epipolar 信任图）。基于这个图，信任度权重平均模块可以预测绝对旋转。由于引入的相对旋转信任度，提案的 EAR-Net 可以有效地处理异常情况。实验结果表明，EAR-Net 在三个公共数据集上的准确率和速度都高于当前状态的方法。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Hyperspectral Image Fusion via Logarithmic Low-rank Tensor Ring Decomposition

paper_url: http://arxiv.org/abs/2310.10044
repo_url: None
paper_authors: Jun Zhang, Lipeng Zhu, Chao Wang, Shutao Li
for: 本研究旨在提高低分辨率多spectral图像（LR-HSI）与高分辨率多spectral图像（HR-MSI）的混合方法，以获得高分辨率多spectral图像（HR-HSI）。
methods: 本研究使用了tensor环（TR）分解方法，并利用了tensor核内积 regularization（TNN）来保持高维低级结构。
results: 实验结果表明，提出的方法可以提高视觉质量，并超过现有的state-of-the-art混合方法 regarding various quantitative metrics。

Abstract
Integrating a low-spatial-resolution hyperspectral image (LR-HSI) with a high-spatial-resolution multispectral image (HR-MSI) is recognized as a valid method for acquiring HR-HSI. Among the current fusion approaches, the tensor ring (TR) decomposition-based method has received growing attention owing to its superior performance on preserving the spatial-spectral correlation. Furthermore, the low-rank property in some TR factors has been exploited via the matrix nuclear norm regularization along mode-2. On the other hand, the tensor nuclear norm (TNN)-based approaches have recently demonstrated to be more efficient on keeping high-dimensional low-rank structures in tensor recovery. Here, we study the low-rankness of TR factors from the TNN perspective and consider the mode-2 logarithmic TNN (LTNN) on each TR factor. A novel fusion model is proposed by incorporating this LTNN regularization and the weighted total variation which is to promote the continuity of HR-HSI in the spatial-spectral domain. Meanwhile, we have devised a highly efficient proximal alternating minimization algorithm to solve the proposed model. The experimental results indicate that our method improves the visual quality and exceeds the existing state-of-the-art fusion approaches with respect to various quantitative metrics.

摘要
Integrating a low-spatial-resolution hyperspectral image (LR-HSI) with a high-spatial-resolution multispectral image (HR-MSI) is recognized as a valid method for acquiring HR-HSI. Among the current fusion approaches, the tensor ring (TR) decomposition-based method has received growing attention owing to its superior performance on preserving the spatial-spectral correlation. Furthermore, the low-rank property in some TR factors has been exploited via the matrix nuclear norm regularization along mode-2. On the other hand, the tensor nuclear norm (TNN)-based approaches have recently demonstrated to be more efficient on keeping high-dimensional low-rank structures in tensor recovery. Here, we study the low-rankness of TR factors from the TNN perspective and consider the mode-2 logarithmic TNN (LTNN) on each TR factor. A novel fusion model is proposed by incorporating this LTNN regularization and the weighted total variation which is to promote the continuity of HR-HSI in the spatial-spectral domain. Meanwhile, we have devised a highly efficient proximal alternating minimization algorithm to solve the proposed model. The experimental results indicate that our method improves the visual quality and exceeds the existing state-of-the-art fusion approaches with respect to various quantitative metrics.Translation in Simplified Chinese:合并低分辨率干卷成像（LR-HSI）和高分辨率多spectral成像（HR-MSI）可以获得高分辨率干卷成像（HR-HSI）。当前的融合方法中，基于tensor ring（TR）分解的方法受到了越来越多的关注，因为它能够保留干卷成像的空间-spectral相关性。此外，TR因子中的低级属性也被利用了，通过matrix nuclear norm regularization along mode-2。而tensor nuclear norm（TNN）基于的方法则在tensor recovery中保持高维度低级结构方面表现更高效。在这里，我们从TNN的角度研究TR因子的低级性，并考虑mode-2 logarithmic TNN（LTNN）在每个TR因子上。我们提出了一种新的融合模型，通过加入LTNN regularization和权重Total variation来提高HR-HSI在空间-spectral频域中的连续性。此外，我们还开发了一种高效的 proximal alternating minimization算法来解决提案的模型。实验结果表明，我们的方法可以提高视觉质量，并在各种量化指标上超越现有的融合方法。

Evading Detection Actively: Toward Anti-Forensics against Forgery Localization

paper_url: http://arxiv.org/abs/2310.10036
repo_url: None
paper_authors: Long Zhuo, Shenghai Luo, Shunquan Tan, Han Chen, Bin Li, Jiwu Huang
for: 防止黑客修改图像，使图像检测器判定图像是否修改。
methods: 使用自我指导和对抗学习算法，训练深度学习反诈模型，以逃脱现有的修改检测器。
results: 成功逃脱现有的修改检测器，并在多个 dataset 上实现了高效的修改检测。

Abstract
Anti-forensics seeks to eliminate or conceal traces of tampering artifacts. Typically, anti-forensic methods are designed to deceive binary detectors and persuade them to misjudge the authenticity of an image. However, to the best of our knowledge, no attempts have been made to deceive forgery detectors at the pixel level and mis-locate forged regions. Traditional adversarial attack methods cannot be directly used against forgery localization due to the following defects: 1) they tend to just naively induce the target forensic models to flip their pixel-level pristine or forged decisions; 2) their anti-forensics performance tends to be severely degraded when faced with the unseen forensic models; 3) they lose validity once the target forensic models are retrained with the anti-forensics images generated by them. To tackle the three defects, we propose SEAR (Self-supErvised Anti-foRensics), a novel self-supervised and adversarial training algorithm that effectively trains deep-learning anti-forensic models against forgery localization. SEAR sets a pretext task to reconstruct perturbation for self-supervised learning. In adversarial training, SEAR employs a forgery localization model as a supervisor to explore tampering features and constructs a deep-learning concealer to erase corresponding traces. We have conducted largescale experiments across diverse datasets. The experimental results demonstrate that, through the combination of self-supervised learning and adversarial learning, SEAR successfully deceives the state-of-the-art forgery localization methods, as well as tackle the three defects regarding traditional adversarial attack methods mentioned above.

摘要
反反馈技术目的是消除或隐藏修改 traces。通常，反反馈方法是为了欺骗 binary 检测器，使其错误地评估图像的 authenticity。然而，据我们所知，没有任何尝试使用对 forgery 位置进行欺骗和误导。传统的反对敌方攻击方法无法直接使用对 forgery 位置的攻击，因为以下三点缺陷：1. 它们通常只是简单地让目标伪钞模型变更其像素级的原始或修改的决策;2. 它们对于不同的伪钞模型表现出很差的防御性能;3. 它们在伪钞模型被重新训练后失效。为了解决这些缺陷，我们提出了 SEAR（Self-supErvised Anti-foRensics），一种新的自我超vised 和反对敌方训练算法，用于训练深度学习反反馈模型。SEAR 设置了一个预text task，用于在自我超vised 学习中重建杂音。在反对敌方训练中，SEAR 使用一个 forgery 位置模型作为监视器，以探索修改特征并构建深度学习隐藏器，以消除相应的 traces。我们在多个 dataset 上进行了大规模的实验，实验结果表明，通过将自我超vised 学习和反对敌方训练相结合，SEAR 成功地欺骗了当前最佳的 forgery 位置方法，同时解决了传统反对敌方攻击方法所存在的三个缺陷。

Deep Unfolding Network for Image Compressed Sensing by Content-adaptive Gradient Updating and Deformation-invariant Non-local Modeling

paper_url: http://arxiv.org/abs/2310.10033
repo_url: None
paper_authors: Wenxue Cui, Xiaopeng Fan, Jian Zhang, Debin Zhao
for: 用于图像压缩感知（CS）领域的深度 unfolding 网络（DUN）的改进。
methods: 提出了一种基于传统的 Proximal Gradient Descent（PGD）算法的 novel DUN 网络（dubbed DUN-CSNet），以解决现有 DUN 中的两个问题：1）大多数超参数是独立于输入内容的，限制了其适应性；2）在每次迭代中使用的普通 convolutional neural network 弱化了更广泛的上下文优先顺序，导致表达能力下降。
results: 经验表明，提出的 DUN-CSNet 在图像压缩感知领域的表现较前者有大幅提升。

Abstract
Inspired by certain optimization solvers, the deep unfolding network (DUN) has attracted much attention in recent years for image compressed sensing (CS). However, there still exist the following two issues: 1) In existing DUNs, most hyperparameters are usually content independent, which greatly limits their adaptability for different input contents. 2) In each iteration, a plain convolutional neural network is usually adopted, which weakens the perception of wider context prior and therefore depresses the expressive ability. In this paper, inspired by the traditional Proximal Gradient Descent (PGD) algorithm, a novel DUN for image compressed sensing (dubbed DUN-CSNet) is proposed to solve the above two issues. Specifically, for the first issue, a novel content adaptive gradient descent network is proposed, in which a well-designed step size generation sub-network is developed to dynamically allocate the corresponding step sizes for different textures of input image by generating a content-aware step size map, realizing a content-adaptive gradient updating. For the second issue, considering the fact that many similar patches exist in an image but have undergone a deformation, a novel deformation-invariant non-local proximal mapping network is developed, which can adaptively build the long-range dependencies between the nonlocal patches by deformation-invariant non-local modeling, leading to a wider perception on context priors. Extensive experiments manifest that the proposed DUN-CSNet outperforms existing state-of-the-art CS methods by large margins.

摘要
traditional Proximal Gradient Descent (PGD) 算法的灵感，一种新的深度 unfolding 网络（DUN）为图像压缩感知（CS）提出了一种新的方法。在这种方法中，存在两个问题：1）在现有的 DUN 中，大多数超参数是独立于输入内容的，这限制了它们的适应性。2）在每个迭代中，通常采用平面卷积神经网络，这弱化了更广泛的上下文先验，从而降低了表达能力。在这篇论文中，我们提出了一种新的 DUN-CSNet，以解决以上两个问题。Specifically，为了解决第一个问题，我们提出了一种新的内容适应的梯度下降网络，其中包括一个 Well-designed 步长生成子网络，通过生成内容快照映射，实现内容适应的梯度更新。为了解决第二个问题，我们发展了一种新的非 lok 的非局部抽象映射网络，该网络可以在不同的扭变下自适应地建立非局部的长距离依赖关系，从而扩大上下文先验的视野。经过广泛的实验，我们发现，提出的 DUN-CSNet 可以舒适性地击败现有的CS方法。

RoomDesigner: Encoding Anchor-latents for Style-consistent and Shape-compatible Indoor Scene Generation

paper_url: http://arxiv.org/abs/2310.10027
repo_url: https://github.com/zhao-yiqun/roomdesigner
paper_authors: Yiqun Zhao, Zibo Zhao, Jing Li, Sixun Dong, Shenghua Gao
for: 本研究旨在创造具有尺度、样式兼容的室内场景，以便在室内设计和规划中提供更加真实和可信的场景。
methods: 本研究提出了一种两stage模型，首先使用离散 вектор量化来编码家具为anchor-latent，然后利用 transformer 模型预测室内场景。通过 incorporating anchor-latent 表示，我们的生成模型可以生成具有尺度和样式兼容的家具布局。
results: 实验结果表明，我们的方法可以在 3D-Front 数据集上生成更加一致和兼容的室内场景，而无需shape取样。此外，我们还进行了广泛的ablation研究，以验证我们的设计选择在室内场景生成模型中的效果。

Abstract
Indoor scene generation aims at creating shape-compatible, style-consistent furniture arrangements within a spatially reasonable layout. However, most existing approaches primarily focus on generating plausible furniture layouts without incorporating specific details related to individual furniture pieces. To address this limitation, we propose a two-stage model integrating shape priors into the indoor scene generation by encoding furniture as anchor latent representations. In the first stage, we employ discrete vector quantization to encode furniture pieces as anchor-latents. Based on the anchor-latents representation, the shape and location information of the furniture was characterized by a concatenation of location, size, orientation, class, and our anchor latent. In the second stage, we leverage a transformer model to predict indoor scenes autoregressively. Thanks to incorporating the proposed anchor-latents representations, our generative model produces shape-compatible and style-consistent furniture arrangements and synthesis furniture in diverse shapes. Furthermore, our method facilitates various human interaction applications, such as style-consistent scene completion, object mismatch correction, and controllable object-level editing. Experimental results on the 3D-Front dataset demonstrate that our approach can generate more consistent and compatible indoor scenes compared to existing methods, even without shape retrieval. Additionally, extensive ablation studies confirm the effectiveness of our design choices in the indoor scene generation model.

摘要
indoor scene generation aims to create shape-compatible, style-consistent furniture arrangements within a spatially reasonable layout. However, most existing approaches primarily focus on generating plausible furniture layouts without incorporating specific details related to individual furniture pieces. To address this limitation, we propose a two-stage model integrating shape priors into the indoor scene generation by encoding furniture as anchor latent representations. In the first stage, we employ discrete vector quantization to encode furniture pieces as anchor-latents. Based on the anchor-latents representation, the shape and location information of the furniture was characterized by a concatenation of location, size, orientation, class, and our anchor latent. In the second stage, we leverage a transformer model to predict indoor scenes autoregressively. Thanks to incorporating the proposed anchor-latents representations, our generative model produces shape-compatible and style-consistent furniture arrangements and synthesis furniture in diverse shapes. Furthermore, our method facilitates various human interaction applications, such as style-consistent scene completion, object mismatch correction, and controllable object-level editing. Experimental results on the 3D-Front dataset demonstrate that our approach can generate more consistent and compatible indoor scenes compared to existing methods, even without shape retrieval. Additionally, extensive ablation studies confirm the effectiveness of our design choices in the indoor scene generation model.Here's the text with some minor adjustments to make it more idiomatic in Simplified Chinese:indoor scene generation aims to create shape-compatible, style-consistent furniture arrangements within a spatially reasonable layout. However, most existing approaches primarily focus on generating plausible furniture layouts without incorporating specific details related to individual furniture pieces. To address this limitation, we propose a two-stage model integrating shape priors into the indoor scene generation by encoding furniture as anchor latent representations. In the first stage, we employ discrete vector quantization to encode furniture pieces as anchor-latents. Based on the anchor-latents representation, the shape and location information of the furniture was characterized by a concatenation of location, size, orientation, class, and our anchor latent. In the second stage, we leverage a transformer model to predict indoor scenes autoregressively. Thanks to incorporating the proposed anchor-latents representations, our generative model produces shape-compatible and style-consistent furniture arrangements and synthesis furniture in diverse shapes. Furthermore, our method facilitates various human interaction applications, such as style-consistent scene completion, object mismatch correction, and controllable object-level editing. Experimental results on the 3D-Front dataset demonstrate that our approach can generate more consistent and compatible indoor scenes compared to existing methods, even without shape retrieval. Additionally, extensive ablation studies confirm the effectiveness of our design choices in the indoor scene generation model.

An Empirical Study of Super-resolution on Low-resolution Micro-expression Recognition

paper_url: http://arxiv.org/abs/2310.10022
repo_url: None
paper_authors: Ling Zhou, Mingpei Wang, Xiaohua Huang, Wenming Zheng, Qirong Mao, Guoying Zhao
for: 本研究旨在提高低分辨率（LR）环境中的微表情识别（MER）精度，特别是在实际应用中的群体MER场景。
methods: 本研究使用了七种最新的状态之册（SOTA）MER技术，并对13种SOTA超分解（SR）技术进行评估，以解决SR助成MER中的问题。
results: 经验研究表明，SR助成MER在LR场景中存在主要的挑战，并提出了改进SR助成MER的方向。

Abstract
Micro-expression recognition (MER) in low-resolution (LR) scenarios presents an important and complex challenge, particularly for practical applications such as group MER in crowded environments. Despite considerable advancements in super-resolution techniques for enhancing the quality of LR images and videos, few study has focused on investigate super-resolution for improving LR MER. The scarcity of investigation can be attributed to the inherent difficulty in capturing the subtle motions of micro-expressions, even in original-resolution MER samples, which becomes even more challenging in LR samples due to the loss of distinctive features. Furthermore, a lack of systematic benchmarking and thorough analysis of super-resolution-assisted MER methods has been noted. This paper tackles these issues by conducting a series of benchmark experiments that integrate both super-resolution (SR) and MER methods, guided by an in-depth literature survey. Specifically, we employ seven cutting-edge state-of-the-art (SOTA) MER techniques and evaluate their performance on samples generated from 13 SOTA SR techniques, thereby addressing the problem of super-resolution in MER. Through our empirical study, we uncover the primary challenges associated with SR-assisted MER and identify avenues to tackle these challenges by leveraging recent advancements in both SR and MER methodologies. Our analysis provides insights for progressing toward more efficient SR-assisted MER.

摘要
低分辨率（LR）环境下的微表达识别（MER）具有重要和复杂的挑战，尤其是在实际应用中，如集体MER在拥挤的环境中。despite considerable advancements in super-resolution techniques for enhancing the quality of LR images and videos, few studies have focused on investigating super-resolution for improving LR MER. The scarcity of investigation can be attributed to the inherent difficulty in capturing the subtle motions of micro-expressions, even in original-resolution MER samples, which becomes even more challenging in LR samples due to the loss of distinctive features. Furthermore, a lack of systematic benchmarking and thorough analysis of super-resolution-assisted MER methods has been noted. This paper tackles these issues by conducting a series of benchmark experiments that integrate both super-resolution (SR) and MER methods, guided by an in-depth literature survey. Specifically, we employ seven cutting-edge state-of-the-art (SOTA) MER techniques and evaluate their performance on samples generated from 13 SOTA SR techniques, thereby addressing the problem of super-resolution in MER. Through our empirical study, we uncover the primary challenges associated with SR-assisted MER and identify avenues to tackle these challenges by leveraging recent advancements in both SR and MER methodologies. Our analysis provides insights for progressing toward more efficient SR-assisted MER.

Black-box Targeted Adversarial Attack on Segment Anything (SAM)

paper_url: http://arxiv.org/abs/2310.10010
repo_url: None
paper_authors: Sheng Zheng, Chaoning Zhang
for: 本研究旨在实现对Segment Anything Model（SAM）的targeted adversarial attack（TAA），以便更好地理解SAM在恶意攻击下的Robustness。
methods: 该研究使用了一种简单 yet effective的方法，即只攻击图像Encoder，以解决prompt依赖性。此外，提出了一种新的规范损失来增强cross-model transferability，使攻击图像更有特征Domination。
results: 广泛的实验证明了我们提出的简单技术可以成功地实现黑盒TAA on SAM。

Abstract
Deep recognition models are widely vulnerable to adversarial examples, which change the model output by adding quasi-imperceptible perturbation to the image input. Recently, Segment Anything Model (SAM) has emerged to become a popular foundation model in computer vision due to its impressive generalization to unseen data and tasks. Realizing flexible attacks on SAM is beneficial for understanding the robustness of SAM in the adversarial context. To this end, this work aims to achieve a targeted adversarial attack (TAA) on SAM. Specifically, under a certain prompt, the goal is to make the predicted mask of an adversarial example resemble that of a given target image. The task of TAA on SAM has been realized in a recent arXiv work in the white-box setup by assuming access to prompt and model, which is thus less practical. To address the issue of prompt dependence, we propose a simple yet effective approach by only attacking the image encoder. Moreover, we propose a novel regularization loss to enhance the cross-model transferability by increasing the feature dominance of adversarial images over random natural images. Extensive experiments verify the effectiveness of our proposed simple techniques to conduct a successful black-box TAA on SAM.

摘要
深度识别模型广泛受到敌意例子的攻击，这些攻击通过添加 quasi-不可见的扰动来改变输入图像，从而影响模型的输出。最近，Segment Anything Model（SAM）在计算机视觉领域得到了广泛的应用，因为它在未seen数据和任务上表现出了很好的总体化能力。为了更好地理解SAM在敌意上下文中的稳定性，本工作寻求实现针对SAM的targeted adversarial attack（TAA）。具体来说，在一定的提示下，目标是使针对敌意例子的预测面几乎与给定的目标图像相同。在 white-box 设置下，这项任务在最近的 arXiv 文章中已经实现了，但是假设了对提示和模型的访问，这是不实际的。为解决提示的依赖关系，我们提议一种简单 yet 有效的方法，即只攻击图像Encoder。此外，我们还提议一种新的规范损失，以增强跨模型传输性，通过增加攻击图像的特征主导性，使攻击图像在随机自然图像上占据优势。我们的实验证明了我们提议的简单技术可以成功地实现黑盒 TAA 任务。

Assessing Encoder-Decoder Architectures for Robust Coronary Artery Segmentation

paper_url: http://arxiv.org/abs/2310.10002
repo_url: None
paper_authors: Shisheng Zhang, Ramtin Gharleghi, Sonit Singh, Arcot Sowmya, Susann Beier
for: 避免心血管疾病的诊断延迟，通过精准 coronary artery 分 segmentation，改善病人结果。
methods: 使用 convolutional neural networks (CNN) 和 U-Net 架构，以及 25 个不同的 encoder-decoder 组合。
results: 使用 ASOCA 公共数据集，对 40 个案例进行分析，发现 EfficientNet-LinkNet 组合的 Dice 乘数为 0.882，95% Percentile Hausdorff 距离为 4.753，表明该模型在 MICCAI 2020 挑战中比其他模型表现更出色。

Abstract
Coronary artery diseases are among the leading causes of mortality worldwide. Timely and accurate diagnosis, facilitated by precise coronary artery segmentation, is pivotal in changing patient outcomes. In the realm of biomedical imaging, convolutional neural networks, especially the U-Net architecture, have revolutionised segmentation processes. However, one of the primary challenges remains the lack of benchmarking datasets specific to coronary arteries. However through the use of the recently published public dataset ASOCA, the potential of deep learning for accurate coronary segmentation can be improved. This paper delves deep into examining the performance of 25 distinct encoder-decoder combinations. Through analysis of the 40 cases provided to ASOCA participants, it is revealed that the EfficientNet-LinkNet combination, serving as encoder and decoder, stands out. It achieves a Dice coefficient of 0.882 and a 95th percentile Hausdorff distance of 4.753. These findings not only underscore the superiority of our model in comparison to those presented at the MICCAI 2020 challenge but also set the stage for future advancements in coronary artery segmentation, opening doors to enhanced diagnostic and treatment strategies.

摘要
coronary artery disease 是全球最主要的死亡原因之一，时间和准确的诊断是改善病人结果的关键。在生物医学影像中，对于条形血管的精确分类是非常重要。然而，主要挑战是缺乏特定于条形血管的参考数据集。但是透过使用最近发布的公共数据集ASOCA，可以改善深度学习的精确分类性。本文将进行深入分析25组不同的encoder-decoder组合的表现。通过分析ASOCA参赛者提供的40个档案，发现了EfficientNet-LinkNet组合（作为encoder和decoder）的表现最出色，其Dice系数为0.882，95%的 Hausdorff距离为4.753。这些发现不仅与在MICCAI 2020挑战中提出的模型相比，而且开启了未来条形血管分类的新天地，将来对诊断和治疗策略带来改善。

SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2310.09998
repo_url: None
paper_authors: Tan-Hanh Pham, Xianqi Li, Kim-Doang Nguyen
for: 这个研究旨在提出一个简单 yet effective的 UNet-Transformer（seUNet-Trans）模型，用于医疗影像分类。
methods: 我们使用 UNet 模型作为特征提取器，将输入影像生成多个特征地图，然后将这些地图转移到一个桥layer中，并使用 Transformer 模型进行自我注意力机制。
results: 我们将模型评估于五个医疗影像分类 dataset上，结果显示 seUNet-Trans 模型在这些dataset上具有较高的性能。

Abstract
Automated medical image segmentation is becoming increasingly crucial in modern clinical practice, driven by the growing demand for precise diagnoses, the push towards personalized treatment plans, and advancements in machine learning algorithms, especially the incorporation of deep learning methods. While convolutional neural networks (CNNs) have been prevalent among these methods, the remarkable potential of Transformer-based models for computer vision tasks is gaining more acknowledgment. To harness the advantages of both CNN-based and Transformer-based models, we propose a simple yet effective UNet-Transformer (seUNet-Trans) model for medical image segmentation. In our approach, the UNet model is designed as a feature extractor to generate multiple feature maps from the input images, and these maps are propagated into a bridge layer, which sequentially connects the UNet and the Transformer. In this stage, we employ the pixel-level embedding technique without position embedding vectors to make the model more efficient. Moreover, we applied spatial-reduction attention in the Transformer to reduce the computational/memory overhead. By leveraging the UNet architecture and the self-attention mechanism, our model not only preserves both local and global context information but also captures long-range dependencies between input elements. The proposed model is extensively experimented on five medical image segmentation datasets, including polyp segmentation, to demonstrate its efficacy. A comparison with several state-of-the-art segmentation models on these datasets shows the superior performance of seUNet-Trans.

摘要
《自动医疗影像分割在现代临床实践中变得越来越重要，这被增长的精准诊断需求、个性化治疗方案推动以及机器学习算法的发展，特别是深度学习方法的应用所驱动。而卷积神经网络（CNN）在这些方法中具有广泛的应用，但是Transformer基本模型在计算机视觉任务中表现出了惊人的潜力。为了利用CNN和Transformer两种模型的优点，我们提出了一种简单 yet effective的UNet-Transformer（seUNet-Trans）模型。在我们的方法中，UNet模型被设计为特征提取器，生成输入图像多个特征地图，然后这些地图被传递到一个桥层，该桥层连接了UNet和Transformer。在这个阶段，我们采用了像素级嵌入技术而不使用位置嵌入向量，以使模型更加高效。此外，我们在Transformer中应用了空间减少注意力，以降低计算/存储占用的开销。通过UNet架构和自注意机制，我们的模型不仅保留了本地和全局上下文信息，还能捕捉输入元素之间的长距离依赖关系。我们对五种医疗影像分割数据集进行了广泛的实验，包括肠肝肿瘤分割，以证明seUNet-Trans模型的高效性。与一些现状顶尖分割模型进行比较，我们的模型在这些数据集上显示出了superior的性能。》

A Survey of Graph and Attention Based Hyperspectral Image Classification Methods for Remote Sensing Data

paper_url: http://arxiv.org/abs/2310.09994
repo_url: None
paper_authors: Aryan Vats, Manan Suri
for: 本研究旨在对频谱成像图像分类中应用深度学习技术，并评估其在远程感知和航空频谱成像图像中的性能。methods: 本研究涉及使用图гра数据结构和注意机制来减少维度，以提高频谱成像图像分类的性能。同时，也探讨了使用图 convolutional Neural Networks 进行频谱成像图像特征提取，以提高分类性能。results: 根据本研究的结果，使用图гра数据结构和注意机制可以提高频谱成像图像分类的性能，并且可以在远程感知和航空频谱成像图像中实现更好的分类结果。

Abstract
The use of Deep Learning techniques for classification in Hyperspectral Imaging (HSI) is rapidly growing and achieving improved performances. Due to the nature of the data captured by sensors that produce HSI images, a common issue is the dimensionality of the bands that may or may not contribute to the label class distinction. Due to the widespread nature of class labels, Principal Component Analysis is a common method used for reducing the dimensionality. However,there may exist methods that incorporate all bands of the Hyperspectral image with the help of the Attention mechanism. Furthermore, to yield better spectral spatial feature extraction, recent methods have also explored the usage of Graph Convolution Networks and their unique ability to use node features in prediction, which is akin to the pixel spectral makeup. In this survey we present a comprehensive summary of Graph based and Attention based methods to perform Hyperspectral Image Classification for remote sensing and aerial HSI images. We also summarize relevant datasets on which these techniques have been evaluated and benchmark the processing techniques.

摘要
使用深度学习技术进行干扰спектраль成像（HSI）的分类正在迅速增长，并达到了改进的性能。由于探测器生成HSI图像的数据特性，一个常见的问题是带宽的维度，这些带可能或可能不会影响类标分布。由于类标的普遍性，常用的方法包括原始特征值分析（PCA）等。然而，可能存在一些方法，它们可以在所有HSI图像带中使用注意力机制，以提高特征特征提取。此外，为了提取更好的 спектral空间特征，现有的方法还在探索使用图像卷积网络，它们可以使用节点特征进行预测，这与像素 спектраль组成类似。在本综述中，我们提供了对图基和注意力基的方法进行干扰спектраль成像图像分类的全面概述，以及相关的数据集和处理技术的比较。

2023-10-16

cs.AI

cs.AI - 2023-10-16

Greedy Perspectives: Multi-Drone View Planning for Collaborative Coverage in Cluttered Environments

paper_url: http://arxiv.org/abs/2310.10863
repo_url: None
paper_authors: Krishna Suresh, Aditya Rauniyar, Micah Corah, Sebastian Scherer
for: 这篇论文旨在帮助营造大规模的人群拍摄，特别是在团队体育和电影摄影等领域。
methods: 这篇论文使用了序列优化的方法来实现可扩展的Camera View最优化，但是在填充环境中却遇到了协调问题。
results: 作者通过开发了一种多机器人多演员视图规划算法，并对其进行了阻挡和遮挡意识的目标设定，以实现在填充环境中协调多机器人拍摄人群的目的。并且对比formation planner，这种顺序 планинг器在三个场景中 генериру了14%更高的actor view reward，并且在两个场景中与formation planning的性能相似。

Abstract
Deployment of teams of aerial robots could enable large-scale filming of dynamic groups of people (actors) in complex environments for novel applications in areas such as team sports and cinematography. Toward this end, methods for submodular maximization via sequential greedy planning can be used for scalable optimization of camera views across teams of robots but face challenges with efficient coordination in cluttered environments. Obstacles can produce occlusions and increase chances of inter-robot collision which can violate requirements for near-optimality guarantees. To coordinate teams of aerial robots in filming groups of people in dense environments, a more general view-planning approach is required. We explore how collision and occlusion impact performance in filming applications through the development of a multi-robot multi-actor view planner with an occlusion-aware objective for filming groups of people and compare with a greedy formation planner. To evaluate performance, we plan in five test environments with complex multiple-actor behaviors. Compared with a formation planner, our sequential planner generates 14% greater view reward over the actors for three scenarios and comparable performance to formation planning on two others. We also observe near identical performance of sequential planning both with and without inter-robot collision constraints. Overall, we demonstrate effective coordination of teams of aerial robots for filming groups that may split, merge, or spread apart and in environments cluttered with obstacles that may cause collisions or occlusions.

摘要
deployments of teams of aerial robots could enable large-scale filming of dynamic groups of people (actors) in complex environments for novel applications in areas such as team sports and cinematography. Toward this end, methods for submodular maximization via sequential greedy planning can be used for scalable optimization of camera views across teams of robots but face challenges with efficient coordination in cluttered environments. Obstacles can produce occlusions and increase chances of inter-robot collision which can violate requirements for near-optimality guarantees. To coordinate teams of aerial robots in filming groups of people in dense environments, a more general view-planning approach is required. We explore how collision and occlusion impact performance in filming applications through the development of a multi-robot multi-actor view planner with an occlusion-aware objective for filming groups of people and compare with a greedy formation planner. To evaluate performance, we plan in five test environments with complex multiple-actor behaviors. Compared with a formation planner, our sequential planner generates 14% greater view reward over the actors for three scenarios and comparable performance to formation planning on two others. We also observe near identical performance of sequential planning both with and without inter-robot collision constraints. Overall, we demonstrate effective coordination of teams of aerial robots for filming groups that may split, merge, or spread apart and in environments cluttered with obstacles that may cause collisions or occlusions.Here's the text with some notes on the translation:* "deployments" is translated as "部署" (bù dào), which is a more general term that can refer to any type of deployment, not just of robots.* "aerial robots" is translated as "空中机器人" (kōng zhōng jī rò bīng), which is a more specific term that refers to robots that operate in the air.* "film" is translated as "拍摄" (pān shè), which is a more general term that can refer to any type of filming or recording.* "applications" is translated as "应用" (yìng yòu), which is a more general term that can refer to any type of use or application.* "team sports" is translated as "团体运动" (tuán tǐ yùn dòng), which is a more specific term that refers to sports that involve teams of players.* "cinematography" is translated as "摄影" (shè yǐng), which is a more specific term that refers to the art and technique of filmmaking.* "submodular maximization" is translated as "互补最大化" (huì chē zhì dà huì), which is a more specific term that refers to a type of optimization problem where the goal is to maximize a submodular function.* "sequential greedy planning" is translated as "顺序贪吃规划" (shù xìa bīng zhèng), which is a more specific term that refers to a type of planning algorithm that uses greedy heuristics to optimize a sequence of decisions.* "obstacles" is translated as "障碍物" (fāng yì wù), which is a more general term that can refer to any type of obstacle or barrier.* "occlusions" is translated as "遮挡" (miǎn zhì), which is a more specific term that refers to the blocking or hiding of objects or viewpoints by other objects or surfaces.* "inter-robot collision" is translated as "机器人间冲突" (jī rò bīng jiān chōng tòu), which is a more specific term that refers to collisions between robots.* "near-optimality guarantees" is translated as "似乎最优化保证" (xiào guī zhì yòu huì huì), which is a more specific term that refers to guarantees that a solution is close to optimal.* "view-planning" is translated as "观察规划" (guān chá zhì huì), which is a more specific term that refers to the planning of views or viewpoints.* "multi-robot multi-actor" is translated as "多机器人多actor" (duō jī rò duō yuǎn), which is a more specific term that refers to systems with multiple robots and multiple actors.* "occlusion-aware objective" is translated as "遮挡目标" (miǎn zhì mù tiǎo), which is a more specific term that refers to objectives that take into account the presence of occlusions.* "formation planner" is translated as "formation规划" (fāng yì zhì huì), which is a more specific term that refers to a type of planner that uses formations or patterns to optimize a sequence of decisions.* "sequential planner" is translated as "顺序规划" (shù xìa zhì huì), which is a more specific term that refers to a type of planner that uses a sequential search algorithm to optimize a sequence of decisions.* "test environments" is translated as "测试环境" (cè shí huán jīng), which is a more general term that can refer to any type of testing or evaluation environment.* "complex multiple-actor behaviors" is translated as "复杂多actor行为" (fāng xìa duō yuǎn xíng wèi), which is a more specific term that refers to systems with multiple actors and complex behaviors.

Proper Laplacian Representation Learning

paper_url: http://arxiv.org/abs/2310.10833
repo_url: None
paper_authors: Diego Gomez, Michael Bowling, Marlos C. Machado
for: 解决大型反射学习问题，即探索、泛化和传递问题，需要学习好的状态表示。
methods: 使用 Laplacian 表示法，通过寻找矩阵 Laplacian 的特征值和特征向量来实现。
results: 提出了一种 theoretically 有 garantue 的目标函数和优化算法，可以准确地 recuperate Laplacian 表示，并且在多种环境中进行了实验，证明了其在学习中的稳定性和可靠性。

Abstract
The ability to learn good representations of states is essential for solving large reinforcement learning problems, where exploration, generalization, and transfer are particularly challenging. The Laplacian representation is a promising approach to address these problems by inducing intrinsic rewards for temporally-extended action discovery and reward shaping, and informative state encoding. To obtain the Laplacian representation one needs to compute the eigensystem of the graph Laplacian, which is often approximated through optimization objectives compatible with deep learning approaches. These approximations, however, depend on hyperparameters that are impossible to tune efficiently, converge to arbitrary rotations of the desired eigenvectors, and are unable to accurately recover the corresponding eigenvalues. In this paper we introduce a theoretically sound objective and corresponding optimization algorithm for approximating the Laplacian representation. Our approach naturally recovers both the true eigenvectors and eigenvalues while eliminating the hyperparameter dependence of previous approximations. We provide theoretical guarantees for our method and we show that those results translate empirically into robust learning across multiple environments.

摘要
“学习良好的状态表示是解决大型回归学习问题的关键，特别是在探索、泛化和传输方面存在挑战。laplacian表示是一种有 Promise的方法，它可以通过时间扩展的动作发现和奖励形成，以及有用的状态编码。但是，为了获得laplacian表示，需要计算图laplacian的eigen系统，这通常是通过兼容深度学习方法的优化目标来实现。这些优化目标 however，依赖于无法效率地调整的超参数，并且会导致优化过程中的旋转矩阵和征值的不准确性。在这篇论文中，我们介绍了一种有理论基础的目标函数和相应的优化算法，可以有效地近似laplacian表示。我们的方法可以自动回归真正的eigen vectors和征值，并消除前一代的优化目标中的超参数依赖性。我们提供了理论保证，并证明了这些结果在多个环境中的实际表现是稳定和可靠的。”

Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that Leverages a Universal Speech Model

paper_url: http://arxiv.org/abs/2310.13010
repo_url: None
paper_authors: Hagen Soltau, Izhak Shafran, Alex Ottenwess, Joseph R. JR Duffy, Rene L. Utianski, Leland R. Barnard, John L. Stricker, Daniela Wiepert, David T. Jones, Hugo Botha
for: 检测speech中的异常现象，尤其是一些神经疾病的表现。
methods: 使用Perceiver-based序列分类器和Universal Speech Model（USM），并将其与12百万小时多样化音频记录进行训练。
results: 提出的模型在Mayo клиника检测集上表现出色，与标准转换器（80.9%）和感知器（81.8%）模型相比，具有更高的准确率（83.1%），并且在有限任务特定数据下，发现预训练是重要的，而且预训练与自动语音识别任务也是有益的。

Abstract
We propose a Perceiver-based sequence classifier to detect abnormalities in speech reflective of several neurological disorders. We combine this classifier with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings. Our model compresses long sequences into a small set of class-specific latent representations and a factorized projection is used to predict different attributes of the disordered input speech. The benefit of our approach is that it allows us to model different regions of the input for different classes and is at the same time data efficient. We evaluated the proposed model extensively on a curated corpus from the Mayo Clinic. Our model outperforms standard transformer (80.9%) and perceiver (81.8%) models and achieves an average accuracy of 83.1%. With limited task-specific data, we find that pretraining is important and surprisingly pretraining with the unrelated automatic speech recognition (ASR) task is also beneficial. Encodings from the middle layers provide a mix of both acoustic and phonetic information and achieve best prediction results compared to just using the final layer encodings (83.1% vs. 79.6%). The results are promising and with further refinements may help clinicians detect speech abnormalities without needing access to highly specialized speech-language pathologists.

摘要
我们提议一种基于感知者的序列分类器，用于检测speech中的异常现象，表现为多种神经疾病。我们将这个分类器与一个基于自动语音识别（ASR）任务的通用语音模型（USM）结合，并在1200万小时多样化音频记录上进行无监督训练。我们的模型可以压缩长序列到一小集类特有的归一化表示和一个 факторизовый投影，以预测不同类型的输入异常speech的不同属性。我们的方法的优点在于，它可以为不同类型的输入模型不同的地方，同时具有数据效率的优势。我们对提议模型进行了广泛的评估，并在 mayo临床数据库中验证了模型。我们的模型在标准transformer（80.9%）和感知器（81.8%）模型的基础上提高了性能，并实现了83.1%的平均准确率。我们发现，在有限的任务特定数据上，预训练是重要的，而且预训练使用ASR任务也是有利的。中间层编码器提供了mixture的音频和phonetic信息，并实现了最佳预测结果（83.1% vs. 79.6%）。结果具有潜在的价值，通过进一步的优化，可能帮助临床专业人员检测speech异常性，不需要高度专业的语音学术师。

If the Sources Could Talk: Evaluating Large Language Models for Research Assistance in History

paper_url: http://arxiv.org/abs/2310.10808
repo_url: None
paper_authors: Giselle Gonzalez Garcia, Christian Weilbach
for: 这个论文旨在探讨如何使用大型自然语言模型（LLM）来探索历史记忆（或训练数据），并证明了在增强LLM WITH vector embedding的情况下，可以为历史学家和人文科学研究者提供一种可访问的对话式研究方法。
methods: 这篇论文使用了LLM进行对话式研究，并通过增强LLM WITH vector embedding来提高其对问题的回答和数据EXTRACTION和组织能力。
results: 论文表明，LLM可以在问题解决和数据EXTRACTION和组织等任务中表现出色，并且可以在特定研究项目中应用到大量文本档案中，无需包含在其训练数据中。因此，LLM可以被PRIVATELY queried by researchers，并且可以被用于特定研究项目中。

Abstract
The recent advent of powerful Large-Language Models (LLM) provides a new conversational form of inquiry into historical memory (or, training data, in this case). We show that by augmenting such LLMs with vector embeddings from highly specialized academic sources, a conversational methodology can be made accessible to historians and other researchers in the Humanities. Concretely, we evaluate and demonstrate how LLMs have the ability of assisting researchers while they examine a customized corpora of different types of documents, including, but not exclusive to: (1). primary sources, (2). secondary sources written by experts, and (3). the combination of these two. Compared to established search interfaces for digital catalogues, such as metadata and full-text search, we evaluate the richer conversational style of LLMs on the performance of two main types of tasks: (1). question-answering, and (2). extraction and organization of data. We demonstrate that LLMs semantic retrieval and reasoning abilities on problem-specific tasks can be applied to large textual archives that have not been part of the its training data. Therefore, LLMs can be augmented with sources relevant to specific research projects, and can be queried privately by researchers.

摘要

Demystifying Poisoning Backdoor Attacks from a Statistical Perspective

paper_url: http://arxiv.org/abs/2310.10780
repo_url: None
paper_authors: Ganghua Wang, Xun Xian, Jayanth Srinivasa, Ashish Kundu, Xuan Bi, Mingyi Hong, Jie Ding
for: 本研究旨在评估潜在攻击型机器学习模型的安全性，具体来说是评估含有常量触发器的后门攻击的成功因素和攻击方向。
methods: 本研究使用了定理和实验方法来评估后门攻击的成功因素和攻击方向。
results: 研究发现了一系列关键因素影响后门攻击的成功，包括触发器的类型和位置、模型的类型和训练数据的性质等。此外，研究还发现了一些可能的攻击方向，包括潜在的人为干预和模型的泄漏。

Abstract
The growing dependence on machine learning in real-world applications emphasizes the importance of understanding and ensuring its safety. Backdoor attacks pose a significant security risk due to their stealthy nature and potentially serious consequences. Such attacks involve embedding triggers within a learning model with the intention of causing malicious behavior when an active trigger is present while maintaining regular functionality without it. This paper evaluates the effectiveness of any backdoor attack incorporating a constant trigger, by establishing tight lower and upper boundaries for the performance of the compromised model on both clean and backdoor test data. The developed theory answers a series of fundamental but previously underexplored problems, including (1) what are the determining factors for a backdoor attack's success, (2) what is the direction of the most effective backdoor attack, and (3) when will a human-imperceptible trigger succeed. Our derived understanding applies to both discriminative and generative models. We also demonstrate the theory by conducting experiments using benchmark datasets and state-of-the-art backdoor attack scenarios.

摘要

What are the determining factors for a backdoor attack’s success?2. What is the direction of the most effective backdoor attack?3. When will a human-imperceptible trigger succeed?Our findings apply to both discriminative and generative models, and we demonstrate our theory through experiments using benchmark datasets and state-of-the-art backdoor attack scenarios.

BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys

paper_url: http://arxiv.org/abs/2310.10765
repo_url: None
paper_authors: Yu Gu, Jianwei Yang, Naoto Usuyama, Chunyuan Li, Sheng Zhang, Matthew P. Lungren, Jianfeng Gao, Hoifung Poon
for: 这个研究旨在应用自然语言指令学习的技术来生成医疗影像中的counterfactual影像，以分别鉴别 causal structure 和 spurious correlation，并且帮助医生更好地阅读医疗影像进行病程模型化。
methods: 这个研究使用 GPT-4 处理医疗影像报告，生成医疗影像的描述，并且使用这些 triplets (prior image, progression description, new image) 进行 latent diffusion 模型的训练，以生成 counterfactual 医疗影像。
results: 这个研究的结果显示，BiomedJourney 方法可以对医疗影像进行高品质的 counterfactual 生成，并且substantially outperform 先前的 state-of-the-art 方法。

Abstract
Rapid progress has been made in instruction-learning for image editing with natural-language instruction, as exemplified by InstructPix2Pix. In biomedicine, such methods can be applied to counterfactual image generation, which helps differentiate causal structure from spurious correlation and facilitate robust image interpretation for disease progression modeling. However, generic image-editing models are ill-suited for the biomedical domain, and counterfactual biomedical image generation is largely underexplored. In this paper, we present BiomedJourney, a novel method for counterfactual biomedical image generation by instruction-learning from multimodal patient journeys. Given a patient with two biomedical images taken at different time points, we use GPT-4 to process the corresponding imaging reports and generate a natural language description of disease progression. The resulting triples (prior image, progression description, new image) are then used to train a latent diffusion model for counterfactual biomedical image generation. Given the relative scarcity of image time series data, we introduce a two-stage curriculum that first pretrains the denoising network using the much more abundant single image-report pairs (with dummy prior image), and then continues training using the counterfactual triples. Experiments using the standard MIMIC-CXR dataset demonstrate the promise of our method. In a comprehensive battery of tests on counterfactual medical image generation, BiomedJourney substantially outperforms prior state-of-the-art methods in instruction image editing and medical image generation such as InstructPix2Pix and RoentGen. To facilitate future study in counterfactual medical generation, we plan to release our instruction-learning code and pretrained models.

摘要
快速进步在图像编辑中使用自然语言指令，如InstructPix2Pix，已经取得了显著的成果。在生物医学领域，这些方法可以应用于对比例图像生成，以分解 causal structure 和偶极相关，并且为疾病进程模型提供了更加稳定的图像解释。然而，通用的图像编辑模型在生物医学领域是不适用的，对比例生成图像的研究仍然很少。在这篇论文中，我们提出了 BiomedJourney，一种新的对比例生成方法，通过 instruction-learning 从多modal 患者旅程中学习。给定一个患有两张不同时点的生物医学图像，我们使用 GPT-4 处理相关的医学报告，并生成一个描述疾病进程的自然语言描述。这些 triple（先前图像、进程描述、新图像）然后用于训练一个潜在扩散模型进行对比例生成。由于图像时序数据的缺乏，我们提出了一个两stage 课程，首先使用 much more abundant 的单图像-报告对（与假先前图像）进行预训练，然后继续使用对比例 triple。实验使用标准的 MIMIC-CXR 数据集表明，BiomedJourney 在对比例医学图像生成方面具有明显的优势，substantially outperforming 先前的状态对照方法，如 InstructPix2Pix 和 RoentGen。为便于未来对比例医学生成的研究，我们计划在未来发布我们的 instruction-learning 代码和预训练模型。

Step-by-Step Remediation of Students’ Mathematical Mistakes

paper_url: http://arxiv.org/abs/2310.10648
repo_url: https://github.com/rosewang2008/remath
paper_authors: Rose E. Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, Dorottya Demszky
for: 这个论文的目的是探讨大型自然语言模型（LLM）在数学教学中是否能够有效地帮助新手老师更好地纠正学生的错误。
methods: 这个论文使用了一个名为ReMath的benchmark，该benchmark由经验丰富的数学教师共同开发，它包括三个步骤：（1）推断学生错误的类型，（2）确定修正错误的策略，（3）生成一个包含该信息的回答。这个benchmark用于评估当今最好的 instruct-tuned 和对话模型在ReMath上的性能。
results: 研究发现，即使使用最佳模型，模型的回答仍然不能与经验丰富的数学教师相比。提供模型错误类型和策略信息可以提高模型的回答质量，但是这些回答仍然不能达到经验教师的水平。这些结果表明，使用当今的LLM来提供高质量的学习经验，虽然有potential，但还有一定的限制。研究的代码已经公开在GitHub上：https://github.com/rosewang2008/remath。

Abstract
Scaling high-quality tutoring is a major challenge in education. Because of the growing demand, many platforms employ novice tutors who, unlike professional educators, struggle to effectively address student mistakes and thus fail to seize prime learning opportunities for students. In this paper, we explore the potential for large language models (LLMs) to assist math tutors in remediating student mistakes. We present ReMath, a benchmark co-developed with experienced math teachers that deconstructs their thought process for remediation. The benchmark consists of three step-by-step tasks: (1) infer the type of student error, (2) determine the strategy to address the error, and (3) generate a response that incorporates that information. We evaluate the performance of state-of-the-art instruct-tuned and dialog models on ReMath. Our findings suggest that although models consistently improve upon original tutor responses, we cannot rely on models alone to remediate mistakes. Providing models with the error type (e.g., the student is guessing) and strategy (e.g., simplify the problem) leads to a 75% improvement in the response quality over models without that information. Nonetheless, despite the improvement, the quality of the best model's responses still falls short of experienced math teachers. Our work sheds light on the potential and limitations of using current LLMs to provide high-quality learning experiences for both tutors and students at scale. Our work is open-sourced at this link: \url{https://github.com/rosewang2008/remath}.

摘要
增加高质量的帮助是现代教育中的一大挑战。由于需求的增长，许多平台都雇用了不熟悉教育的新教师，与专业教师不同，他们有时无法有效地 corrected学生的错误，因此失去了学生 prime learning opportunities。在这篇论文中，我们探讨了大型自然语言模型（LLM）是否可以帮助数学 tutors corrected学生的错误。我们提出了一个名为 ReMath 的标准，与经验丰富的数学教师合作开发。ReMath 包括三个步骤任务：（1）推断学生错误的类型，（2）确定修复错误的策略，（3）生成包含该信息的回答。我们对 state-of-the-art 的 instruct-tuned 和对话模型进行评估，我们的发现表明，虽然模型在 ReMath 上表现了进步，但我们无法仅仅通过模型来修复错误。在提供错误类型（如学生假设）和修复策略（如简化问题）的情况下，模型的回答质量提高了75%。然而，即使有这些信息，模型的回答仍然落后于经验丰富的数学教师。我们的工作探讨了当前 LLM 是否可以在大规模上提供高质量的学习经验。我们的工作开源在这里：https://github.com/rosewang2008/remath。

A Survey on Video Diffusion Models

paper_url: http://arxiv.org/abs/2310.10647
repo_url: https://github.com/ChenHsing/Awesome-Video-Diffusion-Models
paper_authors: Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang
for: 这 paper 主要是为了对 AI 生成内容 (AIGC) 领域中的 video diffusion models 进行了一个系统的综述。
methods: 这 paper 使用了多种方法，包括 diffusion models、GANs 和 auto-regressive Transformers，以探讨 video diffusion models 在不同领域的应用。
results: 这 paper 发现了许多有价值的研究结果，包括 video 生成、编辑、以及其他视频理解任务中的应用。

Abstract
The recent wave of AI-generated content (AIGC) has witnessed substantial success in computer vision, with the diffusion model playing a crucial role in this achievement. Due to their impressive generative capabilities, diffusion models are gradually superseding methods based on GANs and auto-regressive Transformers, demonstrating exceptional performance not only in image generation and editing, but also in the realm of video-related research. However, existing surveys mainly focus on diffusion models in the context of image generation, with few up-to-date reviews on their application in the video domain. To address this gap, this paper presents a comprehensive review of video diffusion models in the AIGC era. Specifically, we begin with a concise introduction to the fundamentals and evolution of diffusion models. Subsequently, we present an overview of research on diffusion models in the video domain, categorizing the work into three key areas: video generation, video editing, and other video understanding tasks. We conduct a thorough review of the literature in these three key areas, including further categorization and practical contributions in the field. Finally, we discuss the challenges faced by research in this domain and outline potential future developmental trends. A comprehensive list of video diffusion models studied in this survey is available at https://github.com/ChenHsing/Awesome-Video-Diffusion-Models.

摘要
最近的人工智能生成内容（AIGC）浪潮中，计算机视觉领域的扩散模型发挥了关键作用。由于它们的出色的生成能力，扩散模型逐渐取代了基于GANs和自适应Transformers的方法，在图像生成和编辑领域以及视频领域的研究中表现出色。然而，现有的评论主要集中在图像生成领域中，对视频领域中的扩散模型的应用有少量最新的评论。为了填补这个空白，本文提供了人工智能生成内容时代的视频扩散模型的全面评论。 Specifically, we begin with a concise introduction to the fundamentals and evolution of diffusion models. Subsequently, we present an overview of research on diffusion models in the video domain, categorizing the work into three key areas: video generation, video editing, and other video understanding tasks. We conduct a thorough review of the literature in these three key areas, including further categorization and practical contributions in the field. Finally, we discuss the challenges faced by research in this domain and outline potential future developmental trends. A comprehensive list of video diffusion models studied in this survey is available at https://github.com/ChenHsing/Awesome-Video-Diffusion-Models.

Interactive Task Planning with Language Models

paper_url: http://arxiv.org/abs/2310.10645
repo_url: https://github.com/CraftJarvis/MC-Planner
paper_authors: Boyi Li, Philipp Wu, Pieter Abbeel, Jitendra Malik
for: 这个paper是为了解决长期任务规划和执行问题，并且可以轻松泛化到不同的目标或任务。
methods: 这个paper使用语言模型来实现交互式任务规划，并且结合高级规划和低级功能执行。
results: 这个paper的系统可以生成新的高级指令来实现未经见过的目标，并且可以轻松地适应不同的任务，只需更改任务指南即可。此外，当用户发送新的请求时，系统可以重新规划根据新的请求、任务指南和之前执行的步骤。

Abstract
An interactive robot framework accomplishes long-horizon task planning and can easily generalize to new goals or distinct tasks, even during execution. However, most traditional methods require predefined module design, which makes it hard to generalize to different goals. Recent large language model based approaches can allow for more open-ended planning but often require heavy prompt engineering or domain-specific pretrained models. To tackle this, we propose a simple framework that achieves interactive task planning with language models. Our system incorporates both high-level planning and low-level function execution via language. We verify the robustness of our system in generating novel high-level instructions for unseen objectives and its ease of adaptation to different tasks by merely substituting the task guidelines, without the need for additional complex prompt engineering. Furthermore, when the user sends a new request, our system is able to replan accordingly with precision based on the new request, task guidelines and previously executed steps. Please check more details on our https://wuphilipp.github.io/itp_site and https://youtu.be/TrKLuyv26_g.

摘要
一个交互式机器人框架实现了长期任务规划，可以轻松泛化到新目标或不同任务，甚至在执行过程中。然而，大多数传统方法需要预定的模块设计，这使得泛化到不同目标变得困难。现有大语言模型基于方法可以允许更开放的规划，但frequently需要重量的提前工程或域特定预训练模型。为解决这个问题，我们提出了一个简单的框架，可以通过语言模型实现交互式任务规划。我们的系统结合高级规划和低级功能执行 via 语言。我们验证了我们的系统在生成未看过目标的新高级指令方面的稳定性和可靠性，以及在不同任务时的扩展性和适应性。具体信息请参考我们的和。

In-Context Pretraining: Language Modeling Beyond Document Boundaries

paper_url: http://arxiv.org/abs/2310.10638
repo_url: None
paper_authors: Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis
for: 这个论文的目的是提高大语言模型（LMs）的性能，使其能够更好地理解文档之间的关系和Contextual reasoning。
methods: 该论文提出了一种新的预训练方法 called In-Context Pretraining，该方法使用相关的文档序列来显式地鼓励LMs读取和理解文档边界。
results: 实验表明，In-Context Pretraining可以提高LMs的性能，特别是在需要更复杂的文档上下文理解任务中，例如在文档学习 (+8%), 阅读理解 (+15%), 对前Context的忠诚 (+16%), 长文档理解 (+5%), 和检索扩展 (+9%).

Abstract
Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).

摘要
大型语言模型（LM）目前在预测token的任务上被训练，允许它们直接进行长形生成和提示类型任务，这些任务可以被简化为文档完成。现有的预训管道将LM训练为 concatenating随机的短文档来建立输入 контекст，但是先前的文档提供了无法预测下一个文档的信号。我们则提出了内部预训（In-Context Pretraining），一种新的方法，其中语言模型在相关的文档序列中预训，并且明确地让模型在文档boundaries上读取和理解。我们可以实现内部预训 simply by changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines。但是，这个文档排序问题是具有挑战性的，有 billions of documents，我们希望排序可以最大化文档之间的相似性，而不是重复数据。为了解决这个问题，我们引入了近似算法，用于快速找到相关文档，并使用图 traversal algorithm construct coherent input contexts。我们的实验显示，内部预训可以提供一个简单且扩展的方法，以提高LM的性能：我们在需要更多的contextual reasoning任务中看到了很大的改善（+8%），包括在文档中学习（+15%）、对先前context的忠诚性（+16%）、长形reasoning（+5%）和文档扩展（+9%）。

Towards Scenario-based Safety Validation for Autonomous Trains with Deep Generative Models

paper_url: http://arxiv.org/abs/2310.10635
repo_url: None
paper_authors: Thomas Decker, Ananta R. Bhattarai, Michael Lebacher
for: 这篇论文是为了探讨如何适当地验证自动驾驶系统的可靠性。methods: 这篇论文使用了深度生成模型来生成数据，以验证自动驾驶系统在不同的照明和天气条件下是否能够正常运行。results: 研究人员通过使用深度生成模型，可以使限量的测试数据更加表示性，并且可以分析自动驾驶系统是否遵循了一般的操作设计域（ODD）要求。特别是在不同的照明和天气条件下，自动驾驶系统是否能够正常运行的问题上，研究人员可以通过深度生成模型来进行分析。

Abstract
Modern AI techniques open up ever-increasing possibilities for autonomous vehicles, but how to appropriately verify the reliability of such systems remains unclear. A common approach is to conduct safety validation based on a predefined Operational Design Domain (ODD) describing specific conditions under which a system under test is required to operate properly. However, collecting sufficient realistic test cases to ensure comprehensive ODD coverage is challenging. In this paper, we report our practical experiences regarding the utility of data simulation with deep generative models for scenario-based ODD validation. We consider the specific use case of a camera-based rail-scene segmentation system designed to support autonomous train operation. We demonstrate the capabilities of semantically editing railway scenes with deep generative models to make a limited amount of test data more representative. We also show how our approach helps to analyze the degree to which a system complies with typical ODD requirements. Specifically, we focus on evaluating proper operation under different lighting and weather conditions as well as while transitioning between them.

摘要
现代人工智能技术为自动驾驶车辆开启了无限可能，但如何正确验证这些系统的可靠性仍然不清楚。一般来说，是通过预先定义的操作设计域（ODD）来确保系统在测试时运行正常。然而，收集足够的实际测试 случа件以确保完整的 ODD 覆盖却是一项具有挑战性的任务。在这篇论文中，我们介绍了使用深度生成模型进行数据simeulation，以验证场景基于 ODD 的安全验证。我们选择了基于摄像头的铁路景象分割系统，用于支持自动列车运行。我们示示了使用深度生成模型编辑铁路场景，以使用有限的测试数据更加 Representative。我们还展示了如何使用我们的方法来分析系统是否符合 Typical ODD 要求。特别是，我们对不同的照明和天气条件下的系统运行正常性进行评估，以及在这些条件之间的过渡中系统的运行情况。

OpenAgents: An Open Platform for Language Agents in the Wild

paper_url: http://arxiv.org/abs/2310.10634
repo_url: https://github.com/xlang-ai/openagents
paper_authors: Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu
for: 本研究旨在提供一个开源平台，供日常生活中使用语言代理人，并且将语言代理人应用到实际生活中。
methods: 本研究使用了Python/SQL和日常API工具来建立三个语言代理人：数据代理人、插件代理人和网页代理人。
results: 本研究实现了一个开源平台，可以让一般用户通过网页使用语言代理人功能，并且提供了一个简单的开发者和研究人员的部署体验，以便实现创新的语言代理人和实际世界中的评估。

Abstract
Language agents show potential in being capable of utilizing natural language for varied and intricate tasks in diverse environments, particularly when built upon large language models (LLMs). Current language agent frameworks aim to facilitate the construction of proof-of-concept language agents while neglecting the non-expert user access to agents and paying little attention to application-level designs. We present OpenAgents, an open platform for using and hosting language agents in the wild of everyday life. OpenAgents includes three agents: (1) Data Agent for data analysis with Python/SQL and data tools; (2) Plugins Agent with 200+ daily API tools; (3) Web Agent for autonomous web browsing. OpenAgents enables general users to interact with agent functionalities through a web user interface optimized for swift responses and common failures while offering developers and researchers a seamless deployment experience on local setups, providing a foundation for crafting innovative language agents and facilitating real-world evaluations. We elucidate the challenges and opportunities, aspiring to set a foundation for future research and development of real-world language agents.

摘要
语言代理显示出在使用自然语言完成多样化和复杂任务的潜在能力，特别是在基于大语言模型（LLM）的情况下。现有的语言代理框架主要是为了建立证明性的语言代理，忽略了非专家用户访问代理和应用程序层的设计。我们介绍OpenAgents，一个开放的平台，用于在日常生活中使用和主机语言代理。OpenAgents包括三个代理：（1）数据代理，用于数据分析，使用Python/SQL和数据工具；（2）插件代理，提供200多个日常API工具；（3）网络代理，用于自动化网络浏览。OpenAgents允许一般用户通过网页用户界面进行快速响应和常见失败的交互，同时提供了开发者和研究人员在本地设置上的畅通部署体验，为创造语言代理的未来研究和发展提供了基础。我们详细介绍了挑战和机遇，以便为未来的语言代理研究和发展提供指导。

BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology

paper_url: http://arxiv.org/abs/2310.10632
repo_url: https://github.com/bioplanner/bioplanner
paper_authors: Odhran O’Donoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Essa Ghareeb, Justin Booth, Samuel G Rodriques
for: 本研究旨在开发一种自动生成科学实验协议的能力，以便自动化科学研究。
methods: 本研究使用大型自然语言模型（LLM）来生成科学实验协议，并使用pseudocode表示法来评估模型的性能。
results: 研究发现LLM可以准确地生成科学实验协议，并且可以通过pseudocode表示法来评估模型的性能。此外，研究还发现使用pseudocode表示法可以准确地生成新的实验协议，并且可以在生物实验室中成功完成一个生成的实验协议。

Abstract
The ability to automatically generate accurate protocols for scientific experiments would represent a major step towards the automation of science. Large Language Models (LLMs) have impressive capabilities on a wide range of tasks, such as question answering and the generation of coherent text and code. However, LLMs can struggle with multi-step problems and long-term planning, which are crucial for designing scientific experiments. Moreover, evaluation of the accuracy of scientific protocols is challenging, because experiments can be described correctly in many different ways, require expert knowledge to evaluate, and cannot usually be executed automatically. Here we present an automatic evaluation framework for the task of planning experimental protocols, and we introduce BioProt: a dataset of biology protocols with corresponding pseudocode representations. To measure performance on generating scientific protocols, we use an LLM to convert a natural language protocol into pseudocode, and then evaluate an LLM's ability to reconstruct the pseudocode from a high-level description and a list of admissible pseudocode functions. We evaluate GPT-3 and GPT-4 on this task and explore their robustness. We externally validate the utility of pseudocode representations of text by generating accurate novel protocols using retrieved pseudocode, and we run a generated protocol successfully in our biological laboratory. Our framework is extensible to the evaluation and improvement of language model planning abilities in other areas of science or other areas that lack automatic evaluation.

摘要
科学实验协议自动生成能力会代表科学自动化的重要一步。大型语言模型（LLM）在各种任务上表现出优异，如问答和文本和代码生成。然而，LLM在多步问题和长期规划方面可能会遇到困难，这些是科学实验的关键。此外，科学实验协议的评估困难，因为实验可以用多种语言描述正确，需要专家知识进行评估，并且通常无法自动执行。我们介绍了一种自动评估框架，用于评估语言模型在计划科学实验协议的能力。我们还提供了生物协议集（BioProt），其包含生物实验协议和对应的伪代码表示。为了衡量语言模型在生成科学协议方面的性能，我们使用一个LLM将自然语言协议转换为伪代码，然后评估LLM是否可以从高级描述和授权伪代码函数中重建伪代码。我们使用GPT-3和GPT-4进行测试，并评估其可靠性。我们还验证了 pseudocode 表示的实验协议的可重复性，并在生物实验室中成功执行了生成的协议。我们的框架可以扩展到其他科学领域或缺乏自动评估的领域中的语言模型规划能力的评估和改进。

Llemma: An Open Language Model For Mathematics

paper_url: http://arxiv.org/abs/2310.10631
repo_url: https://github.com/EleutherAI/math-lm
paper_authors: Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck
for: 这篇论文主要是为了描述一种大型语言模型，用于数学领域。
methods: 该论文使用了Code Llama进行预训练，并在Proof-Pile-2 dataset上继续预训练，这个dataset包括科学论文、网络数据和数学代码。
results: 根据MATH benchmark，Llemma模型在开放基础模型中表现出色，并且在相同参数基础上超过了所有已知的开放基础模型和未发布的Minerva模型集。此外，Llemma模型还可以不需要进一步微调来使用工具和正式证明 theorem。

Abstract
We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

摘要
我团队今天发布了一个大型语言模型，称为Llemma。我们继续预训 Code Llama 在 Proof-Pile-2 上进行预训， Proof-Pile-2 是一个混合科学论文、网络数据和数学代码的杂合，从而得到了 Llemma。在 MATH benchmark 上，Llemma 与所有已知的开放基模型以及未发布的 Minerva 模型集合相比，在参数量为相同的前提下表现出色。此外，Llemma 还可以无需进一步训练地使用工具和正式证明。我们公开发布了所有文件，包括 7 亿和 34 亿参数的模型、Proof-Pile-2 和代码，以便重现我们的实验。

Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

paper_url: http://arxiv.org/abs/2310.10627
repo_url: https://github.com/elicit/fave-dataset
paper_authors: Charlie George, Andreas Stuhlmüller
for: 本研究旨在评估自动生成报告中的幻觉现象，以及使用 Factored Verification 方法检测幻觉。
methods: 本研究使用 Factored Verification 方法对 abstractive 报告进行自动检测幻觉，并在 HaluEval benchmark 上达到了新的 SotA 水平。
results: 研究发现，使用 Factored Critiques 方法自动修正幻觉后，ChatGPT 的幻觉数量降低至 0.49，GPT-4 的幻觉数量降低至 0.46，Claude 2 的幻觉数量降低至 0.95。幻觉的存在导致报告中出现了一些细微的差异。因此，在使用模型生成学报时应该慎重。

Abstract
Hallucination plagues even frontier LLMs--but how bad is it really for summarizing academic papers? We evaluate Factored Verification, a simple automated method for detecting hallucinations in abstractive summaries. This method sets a new SotA on hallucination detection in the summarization task of the HaluEval benchmark, achieving 76.2% accuracy. We then use this method to estimate how often language models hallucinate when summarizing across multiple academic papers and find 0.62 hallucinations in the average ChatGPT (16k) summary, 0.84 for GPT-4, and 1.55 for Claude 2. We ask models to self-correct using Factored Critiques and find that this lowers the number of hallucinations to 0.49 for ChatGPT, 0.46 for GPT-4, and 0.95 for Claude 2. The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.

摘要
投影症也捕捉前沿 LLMS---但是怎么减少它对报告简要的影响？我们评估 Factored Verification，一种简单的自动检测投影症的方法。这种方法在HaluEvalbenchmark中的报告简要任务中设置了新的SotArekord，达到76.2%的准确率。然后，我们使用这种方法来估计语言模型在多篇学术论文总结中的投影症频率，发现了0.62个投影症在ChatGPT（16k）总结中，0.84个投影症在GPT-4总结中，以及1.55个投影症在Claude 2总结中。我们让模型使用Factored Critiques进行自我修复，发现这会降低投影症的数量到0.49个投影症在ChatGPT中，0.46个投影症在GPT-4中，以及0.95个投影症在Claude 2中。我们发现投影症很 часто是柔和的，因此在使用模型Synthesize academic papers时应该保持谨慎。

Video Language Planning

paper_url: http://arxiv.org/abs/2310.10625
repo_url: https://github.com/abusufyanvu/6S191_MIT_DeepLearning
paper_authors: Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, Jonathan Tompson
for: 提高复杂长期任务的视觉规划能力，利用大型生成模型的最新进展。
methods: 训练视语模型和文本到视频模型，并使其服务为策略和价值函数，实现视觉语言规划（VLP）算法。
results: VLP可以根据计算资源的增加，提高完成长期任务的成功率，并可以在不同机器人领域中生成长期视频规划：从多对象重新排序到多摄像头双手巧妙操作。生成的视频规划可以通过目标受控策略转化为真正的机器人行为。实验表明，VLP与先前方法相比，在真实机器人上提高了长期任务成功率。

Abstract
We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

摘要
我们有兴趣实现视觉观察计划 для复杂长期任务在生成影像和语言之间，利用最近的大型生成模型。为此，我们提出了视觉语言观察（VLP）算法，它包括树搜索程式，我们在视觉语言模型中训练（i）视觉语言模型作为政策和价值函数，以及（ii）文本视频模型作为动态模型。VLP将长期任务指令和当前影像观察作为输入，将产生详细多模式（影像和语言）的视觉计划，描述如何完成最终任务。VLP随计算预算增加，增加计算时间可以提高视觉计划的质量，并能够适用于不同的机器人领域：从多个物体重新排序到多条臂优雅操作。生成的视觉计划可以转换为真实机器人动作 via 目标受控制政策，每个中途几何视觉计划中的一个条件。实验显示，VLP可以与先前的方法相比，在模拟和真实机器人（三个硬件平台）上提高长期任务成功率。

Generating Summaries with Controllable Readability Levels

paper_url: http://arxiv.org/abs/2310.10623
repo_url: None
paper_authors: Leonardo F. R. Ribeiro, Mohit Bansal, Markus Dreyer
for: 这个论文的目的是控制摘要的可读性水平，以便对不同读者群体进行知识传递。
methods: 这个论文使用了以下三种技术来控制摘要的可读性水平：1）指令式可读性控制，2）使用约束来减少请求和实际可读性差距，3）使用预测器来估算下一步摘要的可读性水平。
results: 这个论文通过对新闻摘要（CNN/DM数据集）进行实验，证明了其控制可读性的三种生成技术具有显著的改善作用，从而为可控的可读性 summarization 提供了强有力的基础。

Abstract
Readability refers to how easily a reader can understand a written text. Several factors affect the readability level, such as the complexity of the text, its subject matter, and the reader's background knowledge. Generating summaries based on different readability levels is critical for enabling knowledge consumption by diverse audiences. However, current text generation approaches lack refined control, resulting in texts that are not customized to readers' proficiency levels. In this work, we bridge this gap and study techniques to generate summaries at specified readability levels. Unlike previous methods that focus on a specific readability level (e.g., lay summarization), we generate summaries with fine-grained control over their readability. We develop three text generation techniques for controlling readability: (1) instruction-based readability control, (2) reinforcement learning to minimize the gap between requested and observed readability and (3) a decoding approach that uses lookahead to estimate the readability of upcoming decoding steps. We show that our generation methods significantly improve readability control on news summarization (CNN/DM dataset), as measured by various readability metrics and human judgement, establishing strong baselines for controllable readability in summarization.

摘要
<>转换给定文本到简化中文。>可读性指示文本的理解容易程度。不同因素会影响可读性水平，如文本复杂度、主题和读者背景知识。生成基于不同可读性水平的摘要是关键，以便各种读者阅读。然而，当前文本生成方法缺乏细化控制，导致文本不具有读者素途水平定制。在这种情况下，我们尝试填补这个空白，研究生成摘要时控制可读性的技术。与之前的方法不同，我们的生成方法可以在不同的可读性水平上生成摘要，并且可以在不同的读者背景知识下进行细化控制。我们开发了三种文本生成技术来控制可读性：1. 指令式可读性控制2. 使用奖励学习减少请求和实际可读性之间的差距3. 使用预测器来估计下一步的可读性。我们表明，我们的生成方法可以在新闻摘要（CNN/DM数据集）中提高可读性控制，根据不同的可读性指标和人类评价。这些结果设置了可控的可读性基elines。

Quantifying Assistive Robustness Via the Natural-Adversarial Frontier

paper_url: http://arxiv.org/abs/2310.10610
repo_url: None
paper_authors: Jerry Zhi-Yang He, Zackory Erickson, Daniel S. Brown, Anca D. Dragan
For: The paper aims to build robust policies for robots that assist people, but the challenge is that people can behave unexpectedly and interact with the robot outside of its training distribution, leading to failures.* Methods: The paper proposes a method called RIGID, which constructs the entire natural-adversarial frontier by training adversarial human policies that trade off between minimizing robot reward and acting human-like.* Results: The paper uses RIGID to analyze the performance of standard collaborative Reinforcement Learning and existing methods meant to increase robustness, and compares the frontier identified by RIGID with failures identified in expert adversarial interaction and naturally-occurring failures during user interaction. The results show that RIGID can provide a meaningful measure of robustness predictive of deployment performance and uncover failure cases that are difficult to find manually.Here is the text in Simplified Chinese:* For: 论文目标是建立助人机器人策略，但是人们可能会在测试时表现出意外的行为，导致机器人失败。* Methods: 论文提出了一种方法 called RIGID，它通过培养对应的人类策略来构建整个自然-攻击前景，以确定机器人策略的稳定性。* Results: 论文使用 RIGID 分析了标准合作 reinforcement learning 和现有增强稳定性的方法的性能，并与专家对抗交互中的失败和用户交互中的自然失败进行比较。结果表明，RIGID 可以提供有用的稳定性预测，并揭示了一些难以手动发现的失败案例。

Abstract
Our ultimate goal is to build robust policies for robots that assist people. What makes this hard is that people can behave unexpectedly at test time, potentially interacting with the robot outside its training distribution and leading to failures. Even just measuring robustness is a challenge. Adversarial perturbations are the default, but they can paint the wrong picture: they can correspond to human motions that are unlikely to occur during natural interactions with people. A robot policy might fail under small adversarial perturbations but work under large natural perturbations. We propose that capturing robustness in these interactive settings requires constructing and analyzing the entire natural-adversarial frontier: the Pareto-frontier of human policies that are the best trade-offs between naturalness and low robot performance. We introduce RIGID, a method for constructing this frontier by training adversarial human policies that trade off between minimizing robot reward and acting human-like (as measured by a discriminator). On an Assistive Gym task, we use RIGID to analyze the performance of standard collaborative Reinforcement Learning, as well as the performance of existing methods meant to increase robustness. We also compare the frontier RIGID identifies with the failures identified in expert adversarial interaction, and with naturally-occurring failures during user interaction. Overall, we find evidence that RIGID can provide a meaningful measure of robustness predictive of deployment performance, and uncover failure cases in human-robot interaction that are difficult to find manually. https://ood-human.github.io.

摘要
我们的最终目标是建立Robot assistant的坚强策略。但是，人类在测试时可能会出现意外的行为，导致机器人失败。甚至测试Robot的坚强性也是一个挑战。对于Robot来说，对人类的干预是最大的挑战。我们认为，在这些互动设定下，捕捉Robot的坚强性需要构建和分析人类政策的全面自然针对的前ier：人类政策的Pareto前ier，即在自然性和机器人性能之间寻找最佳的交易。我们提出了RIGID方法，通过在人类政策中培养对减少机器人奖励和人类化行为（由识别器来衡量）进行交易来构建这个前ier。在助手机器人任务上，我们使用RIGID方法分析标准协作学习的性能，以及已有的Robot坚强性增强方法的性能。我们还将这个前ier与专家对机器人互动的攻击、以及用户互动中的自然出现的失败进行比较。总的来说，我们发现RIGID可以提供有意义的坚强性预测，并揭示了人机互动中难以找到的失败案例。

Exploring the Power of Graph Neural Networks in Solving Linear Optimization Problems

paper_url: http://arxiv.org/abs/2310.10603
repo_url: https://github.com/chendiqian/IPM_MPNN
paper_authors: Chendi Qian, Didier Chételat, Christopher Morris
for: 这篇论文主要是为了解释Message-Passing Graph Neural Networks（MPNNs）在增强精确优化算法方面的效果。
methods: 这篇论文使用了MPNNs模仿计算机机械的努力，如强分支，来解决混合整数优化问题。
results: 这篇论文证明了MPNNs可以模拟标准内部点方法来解决线性优化问题，并且可以根据给定问题实例的分布来适应。 Empirical results show that MPNNs can solve LP relaxations of standard combinatorial optimization problems with high accuracy and fast speed, often surpassing conventional solvers and competing approaches.

Abstract
Recently, machine learning, particularly message-passing graph neural networks (MPNNs), has gained traction in enhancing exact optimization algorithms. For example, MPNNs speed up solving mixed-integer optimization problems by imitating computational intensive heuristics like strong branching, which entails solving multiple linear optimization problems (LPs). Despite the empirical success, the reasons behind MPNNs' effectiveness in emulating linear optimization remain largely unclear. Here, we show that MPNNs can simulate standard interior-point methods for LPs, explaining their practical success. Furthermore, we highlight how MPNNs can serve as a lightweight proxy for solving LPs, adapting to a given problem instance distribution. Empirically, we show that MPNNs solve LP relaxations of standard combinatorial optimization problems close to optimality, often surpassing conventional solvers and competing approaches in solving time.

摘要
最近，机器学习技术，特别是消息传递图神经网络（MPNN），在增强精确优化算法方面取得了进展。例如，MPNN可以加速解决杂合整数优化问题，通过模拟计算沉重的规则，如强分支法，解决多个线性优化问题（LP）。虽然实际成功，但MPNN在优化LP的原因 remained largely unclear。在这篇文章中，我们展示MPNN可以模拟标准内部点方法，解释其实际成功。此外，我们指出MPNN可以作为LP的轻量级代理，适应给定问题实例分布。在实验中，我们发现MPNN可以解决标准 combinatorial优化问题的LP relaxation，与优化时间相对较长的传统算法和竞争方法相比，往往达到更高的优化精度。

Physics-informed neural wavefields with Gabor basis functions

paper_url: http://arxiv.org/abs/2310.10602
repo_url: None
paper_authors: Tariq Alkhalifah, Xinquan Huang
for: This paper aims to enhance the efficiency and accuracy of neural network wavefield solutions by modeling them as linear combinations of Gabor basis functions that satisfy the wave equation.methods: The proposed approach uses a fully connected neural network with an adaptable Gabor layer as the final hidden layer, employing a weighted summation of Gabor neurons to compute predictions. The weights/coefficients of the Gabor functions are learned from previous hidden layers with nonlinear activation functions.results: Realistic assessments showcase the efficacy of this novel implementation compared to the vanilla PINN, particularly in scenarios involving high-frequencies and realistic models that are often challenging for PINNs.

Abstract
Recently, Physics-Informed Neural Networks (PINNs) have gained significant attention for their versatile interpolation capabilities in solving partial differential equations (PDEs). Despite their potential, the training can be computationally demanding, especially for intricate functions like wavefields. This is primarily due to the neural-based (learned) basis functions, biased toward low frequencies, as they are dominated by polynomial calculations, which are not inherently wavefield-friendly. In response, we propose an approach to enhance the efficiency and accuracy of neural network wavefield solutions by modeling them as linear combinations of Gabor basis functions that satisfy the wave equation. Specifically, for the Helmholtz equation, we augment the fully connected neural network model with an adaptable Gabor layer constituting the final hidden layer, employing a weighted summation of these Gabor neurons to compute the predictions (output). These weights/coefficients of the Gabor functions are learned from the previous hidden layers that include nonlinear activation functions. To ensure the Gabor layer's utilization across the model space, we incorporate a smaller auxiliary network to forecast the center of each Gabor function based on input coordinates. Realistic assessments showcase the efficacy of this novel implementation compared to the vanilla PINN, particularly in scenarios involving high-frequencies and realistic models that are often challenging for PINNs.

摘要
近期，物理学 Informed Neural Networks (PINNs) 已经受到了广泛关注，因为它们可以通过解析部分偏微分方程 (PDEs) 来进行多元函数的 interpolating 能力。 despite their potential, the training of PINNs can be computationally demanding, especially for complex functions such as wavefields. This is primarily due to the fact that the neural network-based (learned) basis functions are biased towards low frequencies, as they are dominated by polynomial calculations, which are not inherently wavefield-friendly.为了解决这个问题，我们提出了一种方法，用于提高 PINN 的效率和准确性，通过将它们表示为线性组合的 Gabor 基函数，这些基函数满足波方程。 Specifically, for the Helmholtz equation, we augment the fully connected neural network model with an adaptable Gabor layer constituting the final hidden layer, employing a weighted summation of these Gabor neurons to compute the predictions (output). These weights/coefficients of the Gabor functions are learned from the previous hidden layers that include nonlinear activation functions. To ensure the Gabor layer's utilization across the model space, we incorporate a smaller auxiliary network to forecast the center of each Gabor function based on input coordinates. Realistic assessments showcase the efficacy of this novel implementation compared to the vanilla PINN, particularly in scenarios involving high-frequencies and realistic models that are often challenging for PINNs.

Automated Natural Language Explanation of Deep Visual Neurons with Large Models

paper_url: http://arxiv.org/abs/2310.10708
repo_url: None
paper_authors: Chenxu Zhao, Wei Qian, Yucheng Shi, Mengdi Huai, Ninghao Liu
for: 本研究旨在解释深度神经网络中 neuron 的含义，提高神经网络的可解释性和可行性。
methods: 本研究提出了一种基于大型基础模型的后续框架，可自动生成 neuron 的含义，不需要人工干预或专业知识。
results: 实验表明，该方法可以准确地找出 neuron 的含义，并且可以与不同的模型和数据集相结合。

Abstract
Deep neural networks have exhibited remarkable performance across a wide range of real-world tasks. However, comprehending the underlying reasons for their effectiveness remains a challenging problem. Interpreting deep neural networks through examining neurons offers distinct advantages when it comes to exploring the inner workings of neural networks. Previous research has indicated that specific neurons within deep vision networks possess semantic meaning and play pivotal roles in model performance. Nonetheless, the current methods for generating neuron semantics heavily rely on human intervention, which hampers their scalability and applicability. To address this limitation, this paper proposes a novel post-hoc framework for generating semantic explanations of neurons with large foundation models, without requiring human intervention or prior knowledge. Our framework is designed to be compatible with various model architectures and datasets, facilitating automated and scalable neuron interpretation. Experiments are conducted with both qualitative and quantitative analysis to verify the effectiveness of our proposed approach.

摘要
深度神经网络在各种实际任务中表现出色，但理解它们的内在原理仍然是一项复杂的问题。通过分析神经元来解释深度神经网络的工作机制具有明显的优势。前期研究表明，深度视觉网络中的特定神经元具有 semantics 意义，并在模型性能中扮演重要角色。然而，目前用于生成神经元 semantics 的方法仍然高度依赖于人工干预，这限制了其可推广性和应用性。为了解决这一问题，本文提出了一种新的后置框架，用于自动生成大型基础模型中神经元的Semantic解释，不需要人工干预或先验知识。我们的框架适用于多种模型架构和数据集，可以实现自动化和扩展的神经元解释。我们通过对质量和kvantitative分析进行实验来验证我们的提议的有效性。

Towards the Imagenets of ML4EDA

paper_url: http://arxiv.org/abs/2310.10560
repo_url: None
paper_authors: Animesh Basak Chowdhury, Shailja Thakur, Hammond Pearce, Ramesh Karri, Siddharth Garg
for: 这篇论文旨在探讨ML导向EDA工具从RTL到GDSII的应用，但是现在没有标准的数据集或评估任务定义于EDA问题领域。
methods: 作者在这篇论文中描述了他们在Verilog代码生成和逻辑合成方面收集和维护了两个大规模、高质量的数据集的经验。
results: 作者在这篇论文中讨论了数据集维护和扩展的挑战，以及数据集质量和安全性问题，并使用专门为硬件领域开发的数据生成工具。

Abstract
Despite the growing interest in ML-guided EDA tools from RTL to GDSII, there are no standard datasets or prototypical learning tasks defined for the EDA problem domain. Experience from the computer vision community suggests that such datasets are crucial to spur further progress in ML for EDA. Here we describe our experience curating two large-scale, high-quality datasets for Verilog code generation and logic synthesis. The first, VeriGen, is a dataset of Verilog code collected from GitHub and Verilog textbooks. The second, OpenABC-D, is a large-scale, labeled dataset designed to aid ML for logic synthesis tasks. The dataset consists of 870,000 And-Inverter-Graphs (AIGs) produced from 1500 synthesis runs on a large number of open-source hardware projects. In this paper we will discuss challenges in curating, maintaining and growing the size and scale of these datasets. We will also touch upon questions of dataset quality and security, and the use of novel data augmentation tools that are tailored for the hardware domain.

摘要
尽管RLT到GDSII之间的ML指导EDA工具的兴趣在增长，但是没有定义了EDA问题领域的标准数据集或范例学习任务。从计算机视觉社区的经验来看，这些数据集是ML进一步发展EDA的关键。我们在这篇文章中描述了我们在Verilog代码生成和逻辑合成方面收集的两个大规模、高质量数据集的经验。第一个数据集是从GitHub和Verilog书籍中收集的Verilog代码集合，称为VeriGen。第二个数据集是一个大规模、标注的数据集，用于ML逻辑合成任务，包括1500次合成运行从开源硬件项目中生成的870,000个和逻辑图（AIGs）。在这篇文章中，我们会讨论数据集维护和扩展的挑战，以及数据集质量和安全性问题，以及适用于硬件领域的特有数据扩展工具。

Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning

paper_url: http://arxiv.org/abs/2310.10707
repo_url: None
paper_authors: Anirudh Som, Karan Sikka, Helen Gent, Ajay Divakaran, Andreas Kathol, Dimitra Vergyri
For: 本研究旨在帮助实践者开发可用的妥协译者，通过尝试大语言模型（LLM）中的内在学习（ICL），使用有限的输入标签示例对象引导模型生成特定查询的愿望输出。* Methods: 本研究检查了一些关键因素，如示例数量和顺序，排除提示指导语言，以及减少衡量毒性。我们在三个数据集上进行了原则性的评估，其中包括我们所提出的上下文感知礼貌译 dataset，包含对话式粗鲁言语、礼貌译和附加的对话上下文。* Results: 我们的结果表明，ICL与监督方法在生成质量方面相当，而且在人工评估中提高了25%，并在衡量毒性方面下降了76%。此外，ICL基于的译者只有10%的训练数据 exhibit slight reduction in performance.

Abstract
Paraphrasing of offensive content is a better alternative to content removal and helps improve civility in a communication environment. Supervised paraphrasers; however, rely heavily on large quantities of labelled data to help preserve meaning and intent. They also retain a large portion of the offensiveness of the original content, which raises questions on their overall usability. In this paper we aim to assist practitioners in developing usable paraphrasers by exploring In-Context Learning (ICL) with large language models (LLMs), i.e., using a limited number of input-label demonstration pairs to guide the model in generating desired outputs for specific queries. Our study focuses on key factors such as -- number and order of demonstrations, exclusion of prompt instruction, and reduction in measured toxicity. We perform principled evaluation on three datasets, including our proposed Context-Aware Polite Paraphrase dataset, comprising of dialogue-style rude utterances, polite paraphrases, and additional dialogue context. We evaluate our approach using two closed source and one open source LLM. Our results reveal that ICL is comparable to supervised methods in generation quality, while being qualitatively better by 25% on human evaluation and attaining lower toxicity by 76%. Also, ICL-based paraphrasers only show a slight reduction in performance even with just 10% training data.

摘要
干脆的内容重写是一种更好的代替方案，而不是完全删除内容。这有助于提高通信环境中的文明性。然而，监督重写者依然需要大量标注数据来保持意思和意图。此外，它们还会保留大量的害词，这引发了使用这些重写器的问题。在这篇论文中，我们想帮助实践者开发可用的重写器，通过exploring In-Context Learning（ICL）和大语言模型（LLM）来实现。我们的研究将关注一些关键因素，如数量和顺序的示例，排除提示 instrucion，以及减少测量的害词。我们在三个数据集上进行了原则性的评估，包括我们提出的Context-Aware Polite Paraphrase数据集，这包括对话式的粗鲁言语、文明重写和额外对话背景。我们使用两个关闭源的LLM和一个开源的LLM进行评估。我们的结果表明，ICL与监督方法在生成质量上相当，而且在人工评估中比其提高了25%，并且测量到的害词下降了76%。此外，ICL基本上只有10%的训练数据下降的性能。

Deep learning applied to EEG data with different montages using spatial attention

paper_url: http://arxiv.org/abs/2310.10550
repo_url: https://github.com/sccn/deep-channel-harmonization
paper_authors: Dung Truong, Muhammad Abdullah Khalid, Arnaud Delorme
for: 本研究旨在使用深度学习处理和提取复杂脑动态信息的EEG raw数据中的信息。
methods: 本研究使用了 espacial attention 对 EEG 电极坐标进行通道协调，以使得可以使用不同的通道组合训练深度学习模型。
results: 研究表明，使用 espacial attention 可以提高模型性能，而且一个使用不同通道组合训练的深度学习模型可以在性别分类任务中表现出色，比Fixed 23-和 128-通道数据 montage 的模型要好。

Abstract
The ability of Deep Learning to process and extract relevant information in complex brain dynamics from raw EEG data has been demonstrated in various recent works. Deep learning models, however, have also been shown to perform best on large corpora of data. When processing EEG, a natural approach is to combine EEG datasets from different experiments to train large deep-learning models. However, most EEG experiments use custom channel montages, requiring the data to be transformed into a common space. Previous methods have used the raw EEG signal to extract features of interest and focused on using a common feature space across EEG datasets. While this is a sensible approach, it underexploits the potential richness of EEG raw data. Here, we explore using spatial attention applied to EEG electrode coordinates to perform channel harmonization of raw EEG data, allowing us to train deep learning on EEG data using different montages. We test this model on a gender classification task. We first show that spatial attention increases model performance. Then, we show that a deep learning model trained on data using different channel montages performs significantly better than deep learning models trained on fixed 23- and 128-channel data montages.

摘要
深度学习可以从Raw EEG数据中提取和处理复杂脑动态信息的能力已经在各种最近的研究中得到证明。然而，深度学习模型也被证明可以在大量数据上表现最佳。在处理 EEG 数据时，自然的方法是将 EEG 数据集合在一起训练大型深度学习模型。然而，大多数 EEG 实验使用自定义通道 montage，需要数据进行变换以达到共同空间。先前的方法使用了 Raw EEG 信号提取关键特征，并将着眼于在 EEG 数据集中共同的特征空间。虽然这是一种合理的方法，但是它忽略了 EEG 原始数据的潜在强大性。在这里，我们探索使用 EEG 电极坐标的空间注意力进行通道协调的 Raw EEG 数据，以便在不同的 montage 上训练深度学习模型。我们在性别分类任务上测试了这种模型，首先显示了空间注意力可以提高模型性能。然后，我们显示了使用不同的 montage 训练深度学习模型可以在性别分类任务中获得显著更好的性能，与固定的 23-和 128-通道数据 montage 相比。

Use of probabilistic phrases in a coordination game: human versus GPT-4

paper_url: http://arxiv.org/abs/2310.10544
repo_url: None
paper_authors: Laurence T Maloney, Maria F Dal Martello, Vivian Fei, Valerie Ma
for: 这个论文的目的是测试人类和大语言模型GPT4在 probabilistic phrases 上的能力。
methods: 这个论文使用了人类和GPT4在两个不同的上下文中 estimates probabilistic phrases 的能力。
results: 研究发现人类和GPT4在 probabilistic phrases 上的 estimations 在大多数情况下相互吻合，但人类和GPT4在 ambiguity 上的 estimations 不够一致。 GPT4 重复测试结果表明，它的 estimations 不够稳定。

Abstract
English speakers use probabilistic phrases such as likely to communicate information about the probability or likelihood of events. Communication is successful to the extent that the listener grasps what the speaker means to convey and, if communication is successful, two individuals can potentially coordinate their actions based on shared knowledge about uncertainty. We first assessed human ability to estimate the probability and the ambiguity (imprecision) of 23 probabilistic phrases in two different contexts, investment advice and medical advice. We then had GPT4 (OpenAI), a recent Large Language Model, complete the same tasks as the human participants. We found that the median human participant and GPT4 assigned probability estimates that were in good agreement (proportions of variance accounted were close to .90). GPT4's estimates of probability both in the investment and Medical contexts were as close or closer to that of the human participants as the human participants were to one another. Estimates of probability for both the human participants and GPT4 were little affected by context. In contrast, human and GPT4 estimates of ambiguity were not in as good agreement. We repeated some of the GPT4 estimates to assess their stability: does GPT4, if run twice, produce the same or similar estimates? There is some indication that it does not.

摘要
英语speaker们使用可能性短语来传达事件的可能性或可信度。如果通信成功，两个人可以基于共享不确定性知识协调行动。我们首先评估了人类对23个可能性短语的可能性和不确定性（精度）的能力。然后，我们使用GPT4（OpenAI），一个最近的大语言模型，完成了同样的任务。我们发现 median人参与者和GPT4的可能性估计相差不大（相对变异度占比接近0.90）。GPT4在投资和医疗上的可能性估计与人参与者的估计相似或更相似。人参与者和GPT4对可能性的估计几乎不受 context 的影响。然而，人参与者和GPT4对不确定性的估计不太一致。我们重复了一些GPT4的估计以评估其稳定性：GPT4在两次运行后会产生相同或类似的估计吗？有些证据表明它不一定会。

Efficient Dataset Distillation through Alignment with Smooth and High-Quality Expert Trajectories

paper_url: http://arxiv.org/abs/2310.10541
repo_url: None
paper_authors: Jiyuan Shen, Wenzhuo Yang, Kwok-Yan Lam
for: 本研究旨在提出一种数据效果的方法，以便在训练大型和 cutting-edge 机器学习模型时，避免使用大量数据。
methods: 该方法基于 expert trajectory 的使用，并引入 clipping loss 和 gradient penalty 来规则参数变化的速率。此外，还提出了代表性初始化、平衡内循环损失和中间匹配损失等优化策略。
results: 实验结果显示，提出的方法在不同的数据集、大小和分辨率上均显著超越先前的方法。

Abstract
Training a large and state-of-the-art machine learning model typically necessitates the use of large-scale datasets, which, in turn, makes the training and parameter-tuning process expensive and time-consuming. Some researchers opt to distil information from real-world datasets into tiny and compact synthetic datasets while maintaining their ability to train a well-performing model, hence proposing a data-efficient method known as Dataset Distillation (DD). Despite recent progress in this field, existing methods still underperform and cannot effectively replace large datasets. In this paper, unlike previous methods that focus solely on improving the efficacy of student distillation, we are the first to recognize the important interplay between expert and student. We argue the significant impact of expert smoothness when employing more potent expert trajectories in subsequent dataset distillation. Based on this, we introduce the integration of clipping loss and gradient penalty to regulate the rate of parameter changes in expert trajectories. Furthermore, in response to the sensitivity exhibited towards randomly initialized variables during distillation, we propose representative initialization for synthetic dataset and balanced inner-loop loss. Finally, we present two enhancement strategies, namely intermediate matching loss and weight perturbation, to mitigate the potential occurrence of cumulative errors. We conduct extensive experiments on datasets of different scales, sizes, and resolutions. The results demonstrate that the proposed method significantly outperforms prior methods.

摘要
通常，训练大型和当前最佳的机器学习模型需要使用大规模数据集，这会使训练和参数调整过程成为昂贵的时间和资源浪费。一些研究人员尝试将实际世界数据集中的信息简化为小型和紧凑的 sintetic 数据集，同时保持模型训练的能力，这被称为数据减量（DD）。尽管现有的方法已经取得了一定的进步，但现有的方法仍然无法有效替代大规模数据集。在这篇论文中，我们不同于以前的方法，我们认为专家畅通性在使用更强大的专家轨迹时对 DATASET DISTILLATION 的影响是非常重要的。基于这个想法，我们引入了折射损失和梯度罚 penalty 来控制专家轨迹中参数的变化速率。此外，我们还提出了代表性初始化和平衡内循环损失来适应在混合损失中随机初始化的变量的敏感性。最后，我们提出了两种改进策略，即中间匹配损失和重量扰动，以避免可能出现的累累错误。我们在不同的数据集、大小和分辨率上进行了广泛的实验，结果显示，我们的方法在PRIOR METHODS 上表现出色。

Microscaling Data Formats for Deep Learning

paper_url: http://arxiv.org/abs/2310.10537
repo_url: https://github.com/microsoft/microxcaling
paper_authors: Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, Eric Chung
for: 降低现代深度学习应用的计算和存储成本
methods: 使用块缩放因子和窄Float和整数类型来组合微规模数据格式
results: 实证结果表明MX数据格式可以作为FP32的Drop-in更新，并且在AI推理和训练中具有低用户阻力，以及可以在训练生成语言模型中使用sub-8位权重、活化和梯度，并且减少了精度损失。

Abstract
Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction. We also show the first instance of training generative language models at sub-8-bit weights, activations, and gradients with minimal accuracy loss and no modifications to the training recipe.

摘要
宽度狭小的数据格式是现代深度学习应用中减少计算和存储成本的关键。这篇论文评估了 Microscaling（MX）数据格式，它将每个块缩放因子与窄浮点和整数类型相结合。MX格式均衡硬件效率、模型准确性和用户抵抗。实验结果表明MX格式可以作为FP32基eline的Drop-in取代物，用于AI推理和训练，并且具有低用户抵抗。我们还示出了在低于8位权重、活动和梯度上训练生成语言模型，无需修改训练脚本，且减少了准确性损失。

Semantic Parsing by Large Language Models for Intricate Updating Strategies of Zero-Shot Dialogue State Tracking

paper_url: http://arxiv.org/abs/2310.10520
repo_url: https://github.com/ToLightUpTheSky/ParsingDST
paper_authors: Yuxiang Wu, Guanting Dong, Weiran Xu
for: Zero-shot Dialogue State Tracking (DST) aims to address the challenge of acquiring and annotating task-oriented dialogues, which can be time-consuming and costly.
methods: The proposed ParsingDST method leverages powerful Large Language Models (LLMs) and semantic parsing to reformulate the DST task and improve updating strategies in the text-to-JSON process.
results: Experimental results show that ParsingDST outperforms existing zero-shot DST methods on MultiWOZ, with significant improvements in Joint Goal Accuracy (JGA) and slot accuracy compared to existing ICL methods.

Abstract
Zero-shot Dialogue State Tracking (DST) addresses the challenge of acquiring and annotating task-oriented dialogues, which can be time consuming and costly. However, DST extends beyond simple slot-filling and requires effective updating strategies for tracking dialogue state as conversations progress. In this paper, we propose ParsingDST, a new In-Context Learning (ICL) method, to introduce additional intricate updating strategies in zero-shot DST. Our approach reformulates the DST task by leveraging powerful Large Language Models (LLMs) and translating the original dialogue text to JSON through semantic parsing as an intermediate state. We also design a novel framework that includes more modules to ensure the effectiveness of updating strategies in the text-to-JSON process. Experimental results demonstrate that our approach outperforms existing zero-shot DST methods on MultiWOZ, exhibiting significant improvements in Joint Goal Accuracy (JGA) and slot accuracy compared to existing ICL methods.

摘要
<>TRANSLATE_TEXT Zero-shot Dialogue State Tracking (DST) Addresses the challenge of acquiring and annotating task-oriented dialogues, which can be time-consuming and costly. However, DST extends beyond simple slot-filling and requires effective updating strategies for tracking dialogue state as conversations progress. In this paper, we propose ParsingDST, a new In-Context Learning (ICL) method, to introduce additional intricate updating strategies in zero-shot DST. Our approach reformulates the DST task by leveraging powerful Large Language Models (LLMs) and translating the original dialogue text to JSON through semantic parsing as an intermediate state. We also design a novel framework that includes more modules to ensure the effectiveness of updating strategies in the text-to-JSON process. Experimental results demonstrate that our approach outperforms existing zero-shot DST methods on MultiWOZ, exhibiting significant improvements in Joint Goal Accuracy (JGA) and slot accuracy compared to existing ICL methods.TRANSLATE_TEXT

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

paper_url: http://arxiv.org/abs/2310.10501
repo_url: None
paper_authors: Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, Jonathan Cohen
for: 这个论文主要是为了提供一种开源的工具kit，用于轻松地在基于语言模型（LLM）的对话系统中添加可编程的 guardrails。
methods: 论文使用了一些机制，如模型对齐，来让LLM提供者和开发者在训练时添加到 guardrails。此外，论文还使用了一种运行时灵感自对话管理的方法，允许开发者在运行时添加可编程的 guardrails。
results: 论文的初步结果表明，提出的方法可以与多个LLM提供者合作，开发出可控和安全的LLM应用程序，使用可编程的 guardrails。

Abstract
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or rails for short) are a specific way of controlling the output of an LLM, such as not talking about topics considered harmful, following a predefined dialogue path, using a particular language style, and more. There are several mechanisms that allow LLM providers and developers to add guardrails that are embedded into a specific model at training, e.g. using model alignment. Differently, using a runtime inspired from dialogue management, NeMo Guardrails allows developers to add programmable rails to LLM applications - these are user-defined, independent of the underlying LLM, and interpretable. Our initial results show that the proposed approach can be used with several LLM providers to develop controllable and safe LLM applications using programmable rails.

摘要
尼莫·卫铁（NeMo Guardrails）是一个开源的工具套件，用于轻松地在基于LLM的对话系统中添加可编程的保护栏（guardrails）。保护栏（rails）是控制LLM输出的一种特定方式，例如不讨论有害话题，遵循预定的对话路径，使用特定的语言风格，等等。它们可以让LLM提供者和开发者在训练时间添加到特定模型中的保护栏，例如使用模型对齐。然而，使用运行时启发自对话管理的NeMo Guardrails，开发者可以添加可编程的rails到LLM应用程序中 - 这些rails是独立于下面LLM的、用户定义的、可解释的。我们的初步结果表明，提议的方法可以与多个LLM提供者合作开发可控和安全的LLM应用程序使用可编程rails。

LocSelect: Target Speaker Localization with an Auditory Selective Hearing Mechanism

paper_url: http://arxiv.org/abs/2310.10497
repo_url: None
paper_authors: Yu Chen, Xinyuan Qian, Zexu Pan, Kainan Chen, Haizhou Li
for: 本研究旨在提出一种Selective hearing mechanism的目标说话者定位算法，以提高多说话者场景下的干扰难以听清楚的问题。
methods: 给出一个参考说话者的 Referral speech，首先生成一个基于说话者的 Spectrogram mask，以排除干扰说话者的speech。然后，使用Long short-term memory（LSTM）网络提取目标说话者的位置信息从过滤后的 Spectrogram中。
results: 实验表明，我们提出的方法在不同的 Signal-to-noise ratio（SNR）条件下，与现有算法相比，具有较高的准确率和鲁棒性。Specifically, at SNR = -10 dB, our proposed network LocSelect achieves a mean absolute error (MAE) of 3.55 and an accuracy (ACC) of 87.40%.

Abstract
The prevailing noise-resistant and reverberation-resistant localization algorithms primarily emphasize separating and providing directional output for each speaker in multi-speaker scenarios, without association with the identity of speakers. In this paper, we present a target speaker localization algorithm with a selective hearing mechanism. Given a reference speech of the target speaker, we first produce a speaker-dependent spectrogram mask to eliminate interfering speakers' speech. Subsequently, a Long short-term memory (LSTM) network is employed to extract the target speaker's location from the filtered spectrogram. Experiments validate the superiority of our proposed method over the existing algorithms for different scale invariant signal-to-noise ratios (SNR) conditions. Specifically, at SNR = -10 dB, our proposed network LocSelect achieves a mean absolute error (MAE) of 3.55 and an accuracy (ACC) of 87.40%.

摘要
prevailing 听风抵抗和响应抵抗的本地化算法主要强调在多个说话人场景中分离并提供每个说话人的方向性输出，不与说话人标识相关。在这篇论文中，我们提出了一种目标说话人本地化算法，具有选择性听风机制。给定一个参照说话人的参照speech，我们首先生成一个基于说话人的spectrogram掩码，以消除干扰说话人的speech。接着，我们使用Long short-term memory（LSTM）网络提取目标说话人的位置信息从过滤后的spectrogram中。实验证明了我们的提出方法与现有算法在不同的标准信号噪声比（SNR）条件下表现更好。具体来说，在SNR=-10dB的条件下，我们的提出的网络LocSelect的平均绝对误差（MAE）为3.55，准确率（ACC）为87.40%。

Harnessing the Power of LLMs: Evaluating Human-AI Text Co-Creation through the Lens of News Headline Generation

paper_url: http://arxiv.org/abs/2310.10706
repo_url: https://github.com/jsndg/emnlp23-llm-headline
paper_authors: Zijian Ding, Alison Smith-Renner, Wenjuan Zhang, Joel R. Tetreault, Alejandro Jaimes
for: 这项研究旨在探讨人们如何最佳地利用LLMs进行写作，以及与这些模型交互对于写作过程中的拥有感和信任的影响。
methods: 研究采用了常见的人机交互方式（如导航系统、从系统输出中选择、后期编辑），在LLM协助新闻标题生成上进行了比较。
results: 研究发现，人类控制可以减少LLM输出的不良结果，而且与自由编辑相比，AI协助不会影响参与者对写作过程的感知控制。

Abstract
To explore how humans can best leverage LLMs for writing and how interacting with these models affects feelings of ownership and trust in the writing process, we compared common human-AI interaction types (e.g., guiding system, selecting from system outputs, post-editing outputs) in the context of LLM-assisted news headline generation. While LLMs alone can generate satisfactory news headlines, on average, human control is needed to fix undesirable model outputs. Of the interaction methods, guiding and selecting model output added the most benefit with the lowest cost (in time and effort). Further, AI assistance did not harm participants' perception of control compared to freeform editing.

摘要
(Simplified Chinese translation)为了了解人类如何最好地利用LLM，并如何与这些模型交互影响写作过程中的所有权和信任感，我们在LLM协助新闻标题生成上比较了不同的人机合作方式（如导引系统、从系统输出选择、后期编辑）。尽管LLM独立可以生成可靠的新闻标题，但平均需要人类控制来修复模型输出的问题。 among the interaction methods, guiding and selecting model output added the most benefit with the lowest cost (in time and effort). Additionally, AI assistance did not harm participants' perception of control compared to freeform editing.

Type-aware Decoding via Explicitly Aggregating Event Information for Document-level Event Extraction

paper_url: http://arxiv.org/abs/2310.10487
repo_url: None
paper_authors: Gang Zhao, Yidong Shi, Shudong Lu, Xinjie Yang, Guanting Dong, Jian Xu, Xiaocheng Gong, Si Li
for: 本研究旨在解决文档水平事件EXTRACTION（DEE）中的两个主要挑战：事件散布和多事件。 précédentes méthodes have attempted to address these challenges, but they have overlooked the interference of event-unrelated sentences during event detection and neglected the mutual interference of different event roles during argument extraction.
methods: 本研究提出了一种新的Schema-based Explicitly Aggregating（SEA）模型，该模型可以有效地聚合事件信息，并将事件类型和角色信息分别编码为特定的类型和角色表示。通过基于类型的表示来检测每个事件，SEA可以减轻由事件相关信息引起的干扰。此外，SEA可以根据每个角色的表示来提取对应的Arguments，从而减少不同角色之间的互相干扰。
results: 实验结果表明，SEA模型在ChFinAnn和DuEE-fin数据集上的表现优于STATE-OF-THE-ART（SOTA）方法。

Abstract
Document-level event extraction (DEE) faces two main challenges: arguments-scattering and multi-event. Although previous methods attempt to address these challenges, they overlook the interference of event-unrelated sentences during event detection and neglect the mutual interference of different event roles during argument extraction. Therefore, this paper proposes a novel Schema-based Explicitly Aggregating~(SEA) model to address these limitations. SEA aggregates event information into event type and role representations, enabling the decoding of event records based on specific type-aware representations. By detecting each event based on its event type representation, SEA mitigates the interference caused by event-unrelated information. Furthermore, SEA extracts arguments for each role based on its role-aware representations, reducing mutual interference between different roles. Experimental results on the ChFinAnn and DuEE-fin datasets show that SEA outperforms the SOTA methods.

摘要
文档级事件提取（DEE）面临两大挑战：事件散布和多事件。尽管先前的方法尝试解决这些挑战，但它们忽略了事件检测过程中的事件无关句子干扰和对不同角色的事件提取过程中的互相干扰。因此，这篇论文提出了一种新的Schema-based Explicitly Aggregating（SEA）模型，用于解决这些限制。SEA将事件信息聚合到事件类型和角色表示中，使得根据具体的类型意识来解码事件记录。通过根据事件类型表示来检测每个事件，SEA可以减轻由事件无关信息引起的干扰。此外，SEA根据角色意识来提取每个角色的证据，减少不同角色之间的互相干扰。实验结果表明，SEA在ChFinAnn和DuEE-fin数据集上的性能比SOTA方法更高。

ManyQuadrupeds: Learning a Single Locomotion Policy for Diverse Quadruped Robots

paper_url: http://arxiv.org/abs/2310.10486
repo_url: None
paper_authors: Milad Shafiee, Guillaume Bellegarda, Auke Ijspeert
for: 这种研究旨在开发一种可以控制多种四足机器人的奔跑策略，而无需重新调整参数和奖励函数。
methods: 研究人员 drew inspiration from animal motor control，并使用了一种模块化的CPG和PF层，以实现不同机器人之间的奔跑策略的共享。
results: 研究人员在不同机器人上测试了这种策略，并观察到了强健的实际到虚拟转移性，甚至在加载15公斤（相当于A1机器人的125% Nominal Mass）时仍然保持了稳定的性能。

Abstract
Learning a locomotion policy for quadruped robots has traditionally been constrained to specific robot morphology, mass, and size. The learning process must usually be repeated for every new robot, where hyperparameters and reward function weights must be re-tuned to maximize performance for each new system. Alternatively, attempting to train a single policy to accommodate different robot sizes, while maintaining the same degrees of freedom (DoF) and morphology, requires either complex learning frameworks, or mass, inertia, and dimension randomization, which leads to prolonged training periods. In our study, we show that drawing inspiration from animal motor control allows us to effectively train a single locomotion policy capable of controlling a diverse range of quadruped robots. These differences encompass a variable number of DoFs, (i.e. 12 or 16 joints), three distinct morphologies, a broad mass range spanning from 2 kg to 200 kg, and nominal standing heights ranging from 16 cm to 100 cm. Our policy modulates a representation of the Central Pattern Generator (CPG) in the spinal cord, effectively coordinating both frequencies and amplitudes of the CPG to produce rhythmic output (Rhythm Generation), which is then mapped to a Pattern Formation (PF) layer. Across different robots, the only varying component is the PF layer, which adjusts the scaling parameters for the stride height and length. Subsequently, we evaluate the sim-to-real transfer by testing the single policy on both the Unitree Go1 and A1 robots. Remarkably, we observe robust performance, even when adding a 15 kg load, equivalent to 125% of the A1 robot's nominal mass.

摘要
学习四肢动物机器人的运动策略传统上受到机器人形态、质量和大小的限制。学习过程通常需要对每个新机器人重新调整超参数和奖励函数权重，以最大化表现。 Alternatively, 尝试使用同一个策略控制不同机器人的不同大小、同样的度度自由（DoF）和形态，需要使用复杂的学习框架，或者质量、抗力和维度随机化，这会导致训练期间过长。在我们的研究中，我们draw inspiration from animal motor control，我们可以有效地训练一个单一的运动策略，可以控制多种不同的四肢动物机器人。这些差异包括变化的DoF数（即12或16关节）、三种不同的形态、机器人质量范围从2公斤到200公斤，和nominal standing heights从16厘米到100厘米。我们的策略调节了中枢pattern generator（CPG）的表达，有效地协调CPG的频率和振荡 amplitudes，并将其映射到Pattern Formation（PF）层。不同的机器人中，唯一变化的是PF层，其调整了步高和步长的缩放参数。我们在Unitree Go1和A1机器人上进行了实验，并观察到了稳定的表现，甚至在加载15公斤的情况下，即A1机器人的125% Nominal mass。

DemoSG: Demonstration-enhanced Schema-guided Generation for Low-resource Event Extraction

paper_url: http://arxiv.org/abs/2310.10481
repo_url: None
paper_authors: Gang Zhao, Xiaocheng Gong, Xinjie Yang, Guanting Dong, Shudong Lu, Si Li
for: 提高低资源场景中的事件EXTRACTION（EE）效果
methods: 示范学习 paradigm和schema-based prompts
results: 在域 adapted low-resource setting中，对三个数据集进行了广泛的实验，并研究了 DemoSG 的稳定性。结果表明， DemoSG 在低资源场景中明显超过当前方法。Here’s a breakdown of each point:
for: The paper is written to improve the effectiveness of Event Extraction (EE) in low-resource scenarios.
methods: The paper proposes two methods to improve EE in low-resource scenarios: (1) demonstration-based learning paradigm, and (2) schema-based prompts.
results: The paper presents extensive experiments on three datasets in in-domain and domain adaptation low-resource settings, and demonstrates that the proposed DemoSG model significantly outperforms current methods in low-resource scenarios.

Abstract
Most current Event Extraction (EE) methods focus on the high-resource scenario, which requires a large amount of annotated data and can hardly be applied to low-resource domains. To address EE more effectively with limited resources, we propose the Demonstration-enhanced Schema-guided Generation (DemoSG) model, which benefits low-resource EE from two aspects: Firstly, we propose the demonstration-based learning paradigm for EE to fully use the annotated data, which transforms them into demonstrations to illustrate the extraction process and help the model learn effectively. Secondly, we formulate EE as a natural language generation task guided by schema-based prompts, thereby leveraging label semantics and promoting knowledge transfer in low-resource scenarios. We conduct extensive experiments under in-domain and domain adaptation low-resource settings on three datasets, and study the robustness of DemoSG. The results show that DemoSG significantly outperforms current methods in low-resource scenarios.

摘要
现有的事件抽取（EE）方法专注于高资源情况下，需要大量的标注数据并几乎无法应用于低资源领域。为了对EE更有效地应用限制的资源，我们提出了示例增强的结构引导生成（DemoSG）模型，它具有以下两个方面的优点：首先，我们提出了示例学习模式，将标注数据转换为示例，以帮助模型彻底学习。其次，我们将EE视为自然语言生成任务，并透过Schema-based启发词提高标签 semantics，以便在低资源情况下传递知识。我们对三个数据集进行了广泛的实验，包括域内和领域适应低资源情况下的实验，并研究了DemoSG的稳定性。结果显示，DemoSG与现有的方法在低资源情况下具有很大的优势。

Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis

paper_url: http://arxiv.org/abs/2310.10477
repo_url: None
paper_authors: Kai Chen, Chunwei Wang, Kuo Yang, Jianhua Han, Lanqing Hong, Fei Mi, Hang Xu, Zhengying Liu, Wenyong Huang, Zhenguo Li, Dit-Yan Yeung, Lifeng Shang, Xin Jiang, Qun Liu
for: 本研究旨在提高大语言模型（LLM）的安全性和合理性，特别是在面对恶意和毒害语言时。
methods: 本研究提出了一种基于错误分析的新的对齐策略，通过故意暴露LLM于异常输出，然后进行全面的评估，以全面了解内部的原因。
results: 实验结果表明，提出的方法在安全指令遵从方面的性能超过了传统对齐方法，同时保持了高效性。

Abstract
The rapid advancement of large language models (LLMs) presents both opportunities and challenges, particularly concerning unintentional generation of harmful and toxic responses. While the traditional alignment methods strive to steer LLMs towards desired performance and shield them from malicious content, this study proposes a novel alignment strategy rooted in mistake analysis by exposing LLMs to flawed outputs purposefully and then conducting a thorough assessment to fully comprehend internal reasons via natural language analysis. Thus, toxic responses can be transformed into instruction tuning corpus for model alignment, and LLMs can not only be deterred from generating flawed responses but also trained to self-criticize, leveraging its innate ability to discriminate toxic content. Experimental results demonstrate that the proposed method outperforms conventional alignment techniques for safety instruction following, while maintaining superior efficiency.

摘要
大量语言模型（LLM）的快速进步带来了机会和挑战，特别是在无意义生成危险和恶意响应方面。传统的Alignment方法努力使LLM towards Desired performance和避免恶意内容，这种研究提出了一种新的Alignment策略，基于 mistake analysis，故意暴露LLM于异常输出，然后进行全面的评估，以全面了解内部原因via自然语言分析。因此，恶意响应可以被转化为调教征集，LLM不仅可以减少生成异常响应，还可以培养自我批判，利用其内置的恶意内容抵制能力。实验结果表明，提出的方法在安全指令遵从方面超过了传统的Alignment技术，同时保持了高效性。

Stance Detection with Collaborative Role-Infused LLM-Based Agents

paper_url: http://arxiv.org/abs/2310.10467
repo_url: None
paper_authors: Xiaochong Lan, Chen Gao, Depeng Jin, Yong Li
for: 这篇文章的目的是提出一个三阶段框架，以帮助自然语言处理器（LLM）实现偏见探测。
methods: 这个框架使用了三种不同的LLM，每个LLM都有不同的角色，包括语言专家、领域专家和社交媒体老手。这些LLM共同执行三个阶段，包括多维度文本分析阶段、逻辑增强辩论阶段和结论统一阶段。
results: 这篇文章的结果显示，使用这个框架可以实现高度的偏见探测性能，并且不需要额外的标注资料和模型训练。实验还显示了这个方法的可说明性和可重用性。

Abstract
Stance detection automatically detects the stance in a text towards a target, vital for content analysis in web and social media research. Despite their promising capabilities, LLMs encounter challenges when directly applied to stance detection. First, stance detection demands multi-aspect knowledge, from deciphering event-related terminologies to understanding the expression styles in social media platforms. Second, stance detection requires advanced reasoning to infer authors' implicit viewpoints, as stance are often subtly embedded rather than overtly stated in the text. To address these challenges, we design a three-stage framework COLA (short for Collaborative rOle-infused LLM-based Agents) in which LLMs are designated distinct roles, creating a collaborative system where each role contributes uniquely. Initially, in the multidimensional text analysis stage, we configure the LLMs to act as a linguistic expert, a domain specialist, and a social media veteran to get a multifaceted analysis of texts, thus overcoming the first challenge. Next, in the reasoning-enhanced debating stage, for each potential stance, we designate a specific LLM-based agent to advocate for it, guiding the LLM to detect logical connections between text features and stance, tackling the second challenge. Finally, in the stance conclusion stage, a final decision maker agent consolidates prior insights to determine the stance. Our approach avoids extra annotated data and model training and is highly usable. We achieve state-of-the-art performance across multiple datasets. Ablation studies validate the effectiveness of each design role in handling stance detection. Further experiments have demonstrated the explainability and the versatility of our approach. Our approach excels in usability, accuracy, effectiveness, explainability and versatility, highlighting its value.

摘要
Automatic stance detection可以检测文本中对目标的立场，这对于网络和社交媒体研究是非常重要。然而，深入应用于检测的语言模型（LLMs）会遇到挑战。首先，检测立场需要多方面的知识，包括理解社交媒体平台上的表达方式和解读事件相关的术语。其次，检测立场需要高级的理解，以便推理出作者的潜在观点，因为立场通常不直接在文本中表达。为解决这些挑战，我们设计了一个三个阶段的框架，称为COLA（简称为协作型角色扮演 LLM 代理）。在这个框架中，LLMs被分配为不同的角色，形成一个协作的系统，每个角色具有唯一的贡献。在多维度文本分析阶段，我们配置 LLMS acted as语言专家、领域专家和社交媒体老手，以获得多方面的分析结果，从而解决第一个挑战。接着，在逻辑批判阶段，我们为每个可能的立场分配了一个特定的 LLM 代理，使得 LLMS 检测文本特征和立场之间的逻辑连接，解决第二个挑战。最后，在立场结论阶段，一个最终的决策者代理将先前的见解集成，以确定立场。我们的方法不需要额外的注释数据和模型训练，具有非常高的可用性。我们在多个数据集上实现了状态的最佳性能。剥离学习 validate了每个设计角色在处理检测立场方面的效果。进一步的实验还表明了我们的方法在可读性、准确性、有效性、可读性和多样性方面的优异。这些结果表明我们的方法具有价值。

Machine Learning Techniques for Identifying the Defective Patterns in Semiconductor Wafer Maps: A Survey, Empirical, and Experimental Evaluations

paper_url: http://arxiv.org/abs/2310.10705
repo_url: None
paper_authors: Kamal Taha
for:This survey paper provides a comprehensive review of machine learning (ML) techniques for identifying wafer defects in semiconductor manufacturing, aiming to fill a void in the existing literature and provide an in-depth analysis of the advantages, limitations, and potential applications of various ML algorithms in this field.methods:The paper employs a four-tier taxonomy to classify ML algorithms into more refined categories and techniques, providing a detailed understanding of the complex relationships between different algorithms and their sub-techniques. The taxonomy includes broad methodology categories, specific sub-techniques, and experimental evaluations to rank the techniques.results:The paper presents a comprehensive empirical evaluation of the techniques based on four criteria and an experimental evaluation that ranks the algorithms employing the same sub-techniques, techniques, sub-categories, and categories. The approach provides a detailed and holistic understanding of ML techniques and algorithms for identifying wafer defects, guiding researchers towards making more informed decisions in their work. The paper also highlights future prospects and opportunities for further research in this field.

Abstract
This survey paper offers a comprehensive review of methodologies utilizing machine learning (ML) techniques for identifying wafer defects in semiconductor manufacturing. Despite the growing body of research demonstrating the effectiveness of ML in wafer defect identification, there is a noticeable absence of comprehensive reviews on this subject. This survey attempts to fill this void by amalgamating available literature and providing an in-depth analysis of the advantages, limitations, and potential applications of various ML algorithms in the realm of wafer defect detection. An innovative taxonomy of methodologies that we present provides a detailed classification of algorithms into more refined categories and techniques. This taxonomy follows a four-tier structure, starting from broad methodology categories and ending with specific sub-techniques. It aids researchers in comprehending the complex relationships between different algorithms and their techniques. We employ a rigorous empirical and experimental evaluation to rank these varying techniques. For the empirical evaluation, we assess techniques based on a set of four criteria. The experimental evaluation ranks the algorithms employing the same sub-techniques, techniques, sub-categories, and categories. This integration of a multi-layered taxonomy, empirical evaluations, and comparative experiments provides a detailed and holistic understanding of ML techniques and algorithms for identifying wafer defects. This approach guides researchers towards making more informed decisions in their work. Additionally, the paper illuminates the future prospects of ML techniques for wafer defect identification, underscoring potential advancements and opportunities for further research in this field

摘要
We present a novel taxonomy of methodologies that categorizes algorithms into more specific categories and techniques. This taxonomy has four tiers, starting with broad methodology categories and ending with specific sub-techniques. This taxonomy helps researchers understand the complex relationships between different algorithms and their techniques.We conduct a rigorous empirical and experimental evaluation of these techniques. For the empirical evaluation, we assess techniques based on a set of four criteria. The experimental evaluation ranks the algorithms using the same sub-techniques, techniques, sub-categories, and categories. This integrated approach provides a comprehensive understanding of ML techniques and algorithms for identifying wafer defects.This paper also highlights the future prospects of ML techniques for wafer defect identification, highlighting potential advancements and opportunities for further research in this field. By providing a detailed and holistic understanding of ML techniques and algorithms, this survey aims to guide researchers in making more informed decisions in their work.翻译结果：这篇研究论文提供了机器学习（ML）技术在半导体制造过程中检测板差的全面回顾。尽管有一些研究证明了ML在板差检测中的效果，但是存在一定的研究杂乱。这篇论文尝试填补这个空白，并提供了一个深入分析的机器学习技术在板差检测中的优点、缺点和应用前景。我们提出了一种新的分类方法，它将机器学习算法分为更加细分的类别和技术。这种分类方法有四层结构，从最高级的方法类别开始，到最低级的特定技术。这种分类方法可以帮助研究人员更好地理解不同的算法和技术之间的复杂关系。我们进行了一项严格的实验和实证评估。对实验来说，我们对不同的技术进行了四个标准的评估标准。这种分类方法可以帮助研究人员在工作中做出更加 Informed 的决策。此外，这篇论文还探讨了机器学习技术在板差检测中的未来前景，并指出了潜在的进步和研究机会。

On the Relevance of Temporal Features for Medical Ultrasound Video Recognition

paper_url: http://arxiv.org/abs/2310.10453
repo_url: https://github.com/MedAI-Clemson/pda_detection
paper_authors: D. Hudson Smith, John Paul Lineberger, George H. Baker
for: 本研究旨在提高医疗ultrasound视频识别任务的效率，特别是在低数据量情况下。
methods: 本研究提出了一种新的多头注意架构，通过 incorporating 时间特征来提高模型的效率。
results: 对比于效率高的3D CNN视频识别模型，本研究在一些常见的ultrasound任务中表现出优于其，尤其是在训练数据量受限的情况下。

Abstract
Many medical ultrasound video recognition tasks involve identifying key anatomical features regardless of when they appear in the video suggesting that modeling such tasks may not benefit from temporal features. Correspondingly, model architectures that exclude temporal features may have better sample efficiency. We propose a novel multi-head attention architecture that incorporates these hypotheses as inductive priors to achieve better sample efficiency on common ultrasound tasks. We compare the performance of our architecture to an efficient 3D CNN video recognition model in two settings: one where we expect not to require temporal features and one where we do. In the former setting, our model outperforms the 3D CNN - especially when we artificially limit the training data. In the latter, the outcome reverses. These results suggest that expressive time-independent models may be more effective than state-of-the-art video recognition models for some common ultrasound tasks in the low-data regime.

摘要
“许多医疗超音波录影 задачі都涉及到识别关键生物学特征，不论在录影中出现的时间点。这表明模型化这些任务可能不需要时间特征。因此，不包含时间特征的模型架构可能会有更好的样本效率。我们提出了一个新的多头注意架构，将这两个假设作为导引假设，以 achieve better sample efficiency on common ultrasound tasks。我们将比较我们的架构和一个高效的3D CNN录影识别模型在两个设定下的性能：一个情况下，我们不需要时间特征，一个情况下，我们需要时间特征。在前一个情况下，我们的模型比3D CNN更高效，特别是当我们人工限制训练数据时。在后一个情况下，结果逆转。这些结果表明表现出时间独立的表达模型可能比现有的录影识别模型在低数据情况下更有效。”

Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models

paper_url: http://arxiv.org/abs/2310.10449
repo_url: https://github.com/lbasyal/llms-text-summarization
paper_authors: Lochan Basyal, Mihir Sanghvi
For: This paper explores the use of Large Language Models (LLMs) for text summarization, specifically comparing the performance of three different models (MPT-7b-instruct, falcon-7b-instruct, and OpenAI ChatGPT text-davinci-003) on two datasets (CNN Daily Mail and XSum).* Methods: The paper uses a diverse set of LLMs and evaluates their performance using widely accepted metrics such as BLEU Score, ROUGE Score, and BERT Score. The experiment involves different hyperparameters and aims to provide a comprehensive understanding of the effectiveness of LLMs for text summarization.* Results: According to the experiment, text-davinci-003 outperformed the other two models, demonstrating its effectiveness for text summarization. The paper provides valuable insights for researchers and practitioners within the NLP domain and lays the foundation for the development of advanced Generative AI applications.

Abstract
Text summarization is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Leveraging Large Language Models (LLMs) has shown remarkable promise in enhancing summarization techniques. This paper embarks on an exploration of text summarization with a diverse set of LLMs, including MPT-7b-instruct, falcon-7b-instruct, and OpenAI ChatGPT text-davinci-003 models. The experiment was performed with different hyperparameters and evaluated the generated summaries using widely accepted metrics such as the Bilingual Evaluation Understudy (BLEU) Score, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score, and Bidirectional Encoder Representations from Transformers (BERT) Score. According to the experiment, text-davinci-003 outperformed the others. This investigation involved two distinct datasets: CNN Daily Mail and XSum. Its primary objective was to provide a comprehensive understanding of the performance of Large Language Models (LLMs) when applied to different datasets. The assessment of these models' effectiveness contributes valuable insights to researchers and practitioners within the NLP domain. This work serves as a resource for those interested in harnessing the potential of LLMs for text summarization and lays the foundation for the development of advanced Generative AI applications aimed at addressing a wide spectrum of business challenges.

摘要
文本概要是一个重要的自然语言处理（NLP）任务，其应用范围从信息检索到内容生成。利用大语言模型（LLMs）已经显著提高了概要技术。这篇论文展开了使用多种LLMs进行文本概要的研究，包括MPT-7b-instruct、falcon-7b-instruct和OpenAI ChatGPT text-davinci-003模型。实验中使用了不同的超参数，并使用了通用的评价指标如双语评价下study（BLEU）分数、推理引导下的学生评价（ROUGE）分数和Transformers的扩展语言模型（BERT）分数进行评估生成的概要。根据实验结果，text-davinci-003表现最佳。这项研究使用了两个不同的数据集：CNN Daily Mail和XSum。研究的主要目标是为NLP领域的研究者和实践者提供LLMs在不同数据集上的性能评估，以便更好地利用LLMs的潜力。这项工作作为NLP领域的研究资源，并为开发高级生成AI应用程序提供了基础。

Large Language Model-Empowered Agents for Simulating Macroeconomic Activities

paper_url: http://arxiv.org/abs/2310.10436
repo_url: None
paper_authors: Nian Li, Chen Gao, Yong Li, Qingmin Liao
for: 这篇论文旨在探讨使用语言模型（LLM）在macro经济模拟中的可能性，以解决传统模型中的三大挑战，即代理人差异、macro经济趋势的影响和多方面经济因素的互动。
methods: 该论文提出了一种新的方法，利用LLM来塑造人类决策行为，并通过提问工程来让LLM表现出人类特征，包括感知、反思和决策能力。
results: 在macro经济活动的 simulations中，LLM强化的代理人可以做出更加真实的工作和消费决策，并且可以生成更加合理的macro经济现象。这些结果表明LLM在macro经济模拟中的潜力很大。

Abstract
The advent of the Web has brought about a paradigm shift in traditional economics, particularly in the digital economy era, enabling the precise recording and analysis of individual economic behavior. This has led to a growing emphasis on data-driven modeling in macroeconomics. In macroeconomic research, Agent-based modeling (ABM) emerged as an alternative, evolving through rule-based agents, machine learning-enhanced decision-making, and, more recently, advanced AI agents. However, the existing works are suffering from three main challenges when endowing agents with human-like decision-making, including agent heterogeneity, the influence of macroeconomic trends, and multifaceted economic factors. Large language models (LLMs) have recently gained prominence in offering autonomous human-like characteristics. Therefore, leveraging LLMs in macroeconomic simulation presents an opportunity to overcome traditional limitations. In this work, we take an early step in introducing a novel approach that leverages LLMs in macroeconomic simulation. We design prompt-engineering-driven LLM agents to exhibit human-like decision-making and adaptability in the economic environment, with the abilities of perception, reflection, and decision-making to address the abovementioned challenges. Simulation experiments on macroeconomic activities show that LLM-empowered agents can make realistic work and consumption decisions and emerge more reasonable macroeconomic phenomena than existing rule-based or AI agents. Our work demonstrates the promising potential to simulate macroeconomics based on LLM and its human-like characteristics.

摘要
互联网的出现带来了传统经济学中的 Paradigm shift，特别是在数位经济时代，允许精确地录取和分析个人经济行为。这导致了对数据驱动模型在macroeconomics中的增加强调。在macroeconomic研究中，Agent-based modeling（ABM） emerged as an alternative，通过规则生成的代理人、机器学习增强的决策和、更近期的进步AI代理人。然而，现有的工作受到三大挑战，包括代理人多样性、macroeconomic趋势的影响和多方面的经济因素。 latest Large language models（LLMs）have recently gained prominence in offering autonomous human-like characteristics. Therefore, leveraging LLMs in macroeconomic simulation presents an opportunity to overcome traditional limitations. In this work, we take an early step in introducing a novel approach that leverages LLMs in macroeconomic simulation. We design prompt-engineering-driven LLM agents to exhibit human-like decision-making and adaptability in the economic environment, with the abilities of perception, reflection, and decision-making to address the above-mentioned challenges. Simulation experiments on macroeconomic activities show that LLM-empowered agents can make realistic work and consumption decisions and emerge more reasonable macroeconomic phenomena than existing rule-based or AI agents. Our work demonstrates the promising potential to simulate macroeconomics based on LLM and its human-like characteristics.

Longitudinal Self-supervised Learning Using Neural Ordinary Differential Equation

paper_url: http://arxiv.org/abs/2310.10431
repo_url: None
paper_authors: Rachid Zeghlache, Pierre-Henri Conze, Mostafa El Habib Daho, Yihao Li, Hugo Le Boité, Ramin Tadayoni, Pascal Massin, Béatrice Cochener, Ikram Brahim, Gwenolé Quellec, Mathieu Lamard
for: investigate the progressive changes in anatomical structures or disease progression over time
methods: longitudinal self-supervised learning (LSSL) algorithm embedded in an auto-encoder (AE) structure, Siamese-like LSSL, and neural ordinary differential equation (NODE)
results: demonstration of LSSL without including a reconstruction term, and the potential of incorporating NODE in conjunction with LSSL

Abstract
Longitudinal analysis in medical imaging is crucial to investigate the progressive changes in anatomical structures or disease progression over time. In recent years, a novel class of algorithms has emerged with the goal of learning disease progression in a self-supervised manner, using either pairs of consecutive images or time series of images. By capturing temporal patterns without external labels or supervision, longitudinal self-supervised learning (LSSL) has become a promising avenue. To better understand this core method, we explore in this paper the LSSL algorithm under different scenarios. The original LSSL is embedded in an auto-encoder (AE) structure. However, conventional self-supervised strategies are usually implemented in a Siamese-like manner. Therefore, (as a first novelty) in this study, we explore the use of Siamese-like LSSL. Another new core framework named neural ordinary differential equation (NODE). NODE is a neural network architecture that learns the dynamics of ordinary differential equations (ODE) through the use of neural networks. Many temporal systems can be described by ODE, including modeling disease progression. We believe that there is an interesting connection to make between LSSL and NODE. This paper aims at providing a better understanding of those core algorithms for learning the disease progression with the mentioned change. In our different experiments, we employ a longitudinal dataset, named OPHDIAT, targeting diabetic retinopathy (DR) follow-up. Our results demonstrate the application of LSSL without including a reconstruction term, as well as the potential of incorporating NODE in conjunction with LSSL.

摘要
长itudinal分析在医学成像中是关键性的，用于探索时间序列中结构或疾病进程的变化。近年来，一种新的算法类型出现了，即无supervision的自适应学习疾病进程（LSSL），使用时间序列或邻接图像对。通过捕捉时间模式，LSSL成为了一个有前途的方向。为更好地理解这种核心方法，我们在这篇论文中探讨了LSSL算法在不同情况下的表现。原始LSSL被嵌入了自适应encoder（AE）结构中。然而，传统的自适应策略通常是在Siamese-like的方式实现。因此，在本研究中，我们探索了使用Siamese-like LSSL。另一个新的核心框架是神经ordinary differential equation（NODE）。NODE是一种使用神经网络学习ordinary differential equation（ODE）的神经网络架构。许多时间系统可以通过ODE进行描述，包括疾病进程。我们认为在LSSL和NODE之间存在一种有趣的联系。本文的目标是为了更好地理解这些核心算法，以便更好地学习疾病进程的变化。在不同的实验中，我们使用了名为OPHDIAT的长itudinal数据集，targeting diabetic retinopathy（DR）跟踪。我们的结果表明了不包含重建项的LSSL的应用，以及将NODE与LSSL相结合的潜在优势。

Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms

paper_url: http://arxiv.org/abs/2310.10418
repo_url: https://github.com/wade3han/normlens
paper_authors: Seungju Han, Junhyeok Kim, Jack Hessel, Liwei Jiang, Jiwan Chung, Yejin Son, Yejin Choi, Youngjae Yu
for: 研究视觉封装常识规则，以提高机器人工智能的可理解性和适应能力。
methods: 使用人类评估和自然语言处理技术，构建一个新的多模态测试 benchmark，以评估模型对视觉封装常识规则的适应性和可解性。
results: 发现当前状态的模型判断和解释与人类标注不匹配，并提出一种新的方法，通过借鉴大型自然语言模型中的社会常识知识，改进模型与人类之间的匹配度。

Abstract
Commonsense norms are defeasible by context: reading books is usually great, but not when driving a car. While contexts can be explicitly described in language, in embodied scenarios, contexts are often provided visually. This type of visually grounded reasoning about defeasible commonsense norms is generally easy for humans, but (as we show) poses a challenge for machines, as it necessitates both visual understanding and reasoning about commonsense norms. We construct a new multimodal benchmark for studying visual-grounded commonsense norms: NORMLENS. NORMLENS consists of 10K human judgments accompanied by free-form explanations covering 2K multimodal situations, and serves as a probe to address two questions: (1) to what extent can models align with average human judgment? and (2) how well can models explain their predicted judgments? We find that state-of-the-art model judgments and explanations are not well-aligned with human annotation. Additionally, we present a new approach to better align models with humans by distilling social commonsense knowledge from large language models. The data and code are released at https://seungjuhan.me/normlens.

摘要
通常的规则是可以被上下文所推翻：读书通常是非常好的，但不是在开车时。而在实际情况下，上下文通常是通过视觉提供的。这种基于视觉的常识规则的理解和判断是人类非常容易做，但对机器来说是一种挑战，因为它需要同时具备视觉理解和常识规则的理解。我们构建了一个新的多Modal benchMark，名为NORMLENS，用于研究基于视觉的常识规则。NORMLENS包含10,000个人类判断，以及2,000个多Modal的情况，并用于解决两个问题：（1）模型与人类平均判断是否能够Alignment？（2）模型如何解释其预测的判断？我们发现当前的模型判断和解释都与人类注释不一致。此外，我们还提出了一种新的方法，通过从大语言模型中提取社会常识知识来更好地将模型与人类Alignment。数据和代码在https://seungjuhan.me/normlens上发布。

Real-Fake: Effective Training Data Synthesis Through Distribution Matching

paper_url: http://arxiv.org/abs/2310.10402
repo_url: None
paper_authors: Jianhao Yuan, Jie Zhang, Shuyang Sun, Philip Torr, Bo Zhao
for: 提高深度学习模型的训练效率和鲁棒性。
methods: 基于分布匹配理论的数据生成方法，包括数据生成和数据筛选。
results: 在多种图像分类任务中， synthetic data 能够取代和补充实际数据，提高模型的鲁棒性和特点外泄能力。

Abstract
Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of synthetic data generated by current methodologies remains inferior when training advanced deep models exclusively, limiting its practical utility. To address this challenge, we analyze the principles underlying training data synthesis for supervised learning and elucidate a principled theoretical framework from the distribution-matching perspective that explicates the mechanisms governing synthesis efficacy. Through extensive experiments, we demonstrate the effectiveness of our synthetic data across diverse image classification tasks, both as a replacement for and augmentation to real datasets, while also benefits challenging tasks such as out-of-distribution generalization and privacy preservation.

摘要
现代深度学习模型的训练数据 synthetic 技术在各种学习任务和场景中得到了广泛应用，具有提高数据量、提高模型性能、隐私保护等优点。然而，现有的synthetic数据生成方法在训练高级深度模型时效率仍然较低，限制其实际应用。为解决这个挑战，我们分析了supervised 学习数据生成的原理，从分布匹配角度出发，探讨生成效果的机制。经过广泛的实验，我们证明了我们的synthetic数据在多种图像分类任务中具有广泛的应用价值，可以替代真实数据，也可以增强模型的性能，同时在难题上如out-of-distribution泛化和隐私保护等方面具有优势。

Can Word Sense Distribution Detect Semantic Changes of Words?

paper_url: http://arxiv.org/abs/2310.10400
repo_url: https://github.com/LivNLP/Sense-based-Semantic-Change-Prediction
paper_authors: Xiaohang Tang, Yi Zhou, Taichi Aida, Procheta Sen, Danushka Bollegala
for: 这个研究的目的是为了探索使用时间点数据集来预测字词意思是否有变化。
methods: 这个研究使用预训 слова embeddings 来自动标注目标词的每个出现，然后计算每个词的意思分布。最后，使用不同的分布或距离度量来衡量目标词的意思变化。
results: 实验结果显示，使用时间点数据集可以准确预测英语、德语、瑞典语和拉丁语中字词意思的变化。

Abstract
Semantic Change Detection (SCD) of words is an important task for various NLP applications that must make time-sensitive predictions. Some words are used over time in novel ways to express new meanings, and these new meanings establish themselves as novel senses of existing words. On the other hand, Word Sense Disambiguation (WSD) methods associate ambiguous words with sense ids, depending on the context in which they occur. Given this relationship between WSD and SCD, we explore the possibility of predicting whether a target word has its meaning changed between two corpora collected at different time steps, by comparing the distributions of senses of that word in each corpora. For this purpose, we use pretrained static sense embeddings to automatically annotate each occurrence of the target word in a corpus with a sense id. Next, we compute the distribution of sense ids of a target word in a given corpus. Finally, we use different divergence or distance measures to quantify the semantic change of the target word across the two given corpora. Our experimental results on SemEval 2020 Task 1 dataset show that word sense distributions can be accurately used to predict semantic changes of words in English, German, Swedish and Latin.

摘要
<>将文本翻译成简化中文。<>word sense distribution的Changesemantic detection（SCD）是各种自然语言处理（NLP）应用程序中的重要任务，这些应用程序需要在时间紧张的情况下进行预测。一些 слова在时间的推移中被用于表达新的意思，这些新意思会成为存在的 слова的新的意思。然而，word sense disambiguation（WSD）方法会将word的意思相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相关的上下文中的word相

Towards Open World Active Learning for 3D Object Detection

paper_url: http://arxiv.org/abs/2310.10391
repo_url: None
paper_authors: Zhuoxiao Chen, Yadan Luo, Zixin Wang, Zijian Wang, Xin Yu, Zi Huang
for: 本研究旨在解决开放世界3D对象检测中存在新类出现的挑战，即效率地选择少量的3D框进行标注，以提高对知类和未知类的检测性能。
methods: 本研究提出了一种名为OpenCRB的简单有效的活动学习策略，通过融合关系约束，选择最有价值的3D框进行标注，以最小化标注成本。
results: 对于开放世界3D对象检测任务，提出了一种名为OpenCRB的简单有效的活动学习策略，可以在很少的标注成本下，达到高效地检测知类和未知类的目标。

Abstract
Significant strides have been made in closed world 3D object detection, testing systems in environments with known classes. However, the challenge arises in open world scenarios where new object classes appear. Existing efforts sequentially learn novel classes from streams of labeled data at a significant annotation cost, impeding efficient deployment to the wild. To seek effective solutions, we investigate a more practical yet challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aiming at selecting a small number of 3D boxes to annotate while maximizing detection performance on both known and unknown classes. The core difficulty centers on striking a balance between mining more unknown instances and minimizing the labeling expenses of point clouds. Empirically, our study finds the harmonious and inverse relationship between box quantities and their confidences can help alleviate the dilemma, avoiding the repeated selection of common known instances and focusing on uncertain objects that are potentially unknown. We unify both relational constraints into a simple and effective AL strategy namely OpenCRB, which guides to acquisition of informative point clouds with the least amount of boxes to label. Furthermore, we develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and shared categories with very limited labeling costs, compared to state-of-the-art baselines.

摘要
<>translate_language: zh-CN<>Recent progress has been made in closed-world 3D object detection, testing systems in environments with known classes. However, the challenge arises in open-world scenarios where new object classes appear. Existing efforts sequentially learn novel classes from streams of labeled data at a significant annotation cost, impeding efficient deployment to the wild. To seek effective solutions, we investigate a more practical yet challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aiming at selecting a small number of 3D boxes to annotate while maximizing detection performance on both known and unknown classes. The core difficulty centers on striking a balance between mining more unknown instances and minimizing the labeling expenses of point clouds. Empirically, our study finds the harmonious and inverse relationship between box quantities and their confidences can help alleviate the dilemma, avoiding the repeated selection of common known instances and focusing on uncertain objects that are potentially unknown. We unify both relational constraints into a simple and effective AL strategy named OpenCRB, which guides to acquisition of informative point clouds with the least amount of boxes to label. Furthermore, we develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection, and open-world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN), and 3 benchmark 3D datasets (i.e., KITTI, nuScenes, and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and shared categories with very limited labeling costs, compared to state-of-the-art baselines.

Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models

paper_url: http://arxiv.org/abs/2310.10378
repo_url: https://github.com/Betswish/Cross-Lingual-Consistency
paper_authors: Jirui Qi, Raquel Fernández, Arianna Bisazza
For: The paper aims to study the cross-lingual consistency (CLC) of factual knowledge in various multilingual pre-trained language models (PLMs) and to identify the determining factors for CLC.* Methods: The authors propose a Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. They conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level.* Results: The authors find that increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. They also conduct a case study on CLC when new factual associations are inserted in the PLMs via model editing, and find that the new piece of knowledge transfers only to languages with which English has a high RankC score.Here are the three points in Simplified Chinese text:* For: 这个论文目的是研究不同语言背景下的多语言大规模预训练语言模型（PLMs）中的知识一致性（CLC），并确定这些因素的影响因素。* Methods: 作者们提出了一种简单的排名基于一致性（RankC）指标，以独立地评估不同语言之间的知识一致性。他们进行了深入的分析，以确定CLC的决定因素，包括模型级别和语言对照级别。* Results: 作者们发现，增加模型大小通常会提高大多数语言中的事实探测精度，但不会提高跨语言一致性。他们还进行了模型编辑后新知识插入的 caso study，发现新知识只转移到与英语有高RankC分数的语言中。

Abstract
Multilingual large-scale Pretrained Language Models (PLMs) have been shown to store considerable amounts of factual knowledge, but large variations are observed across languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we propose a Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. Using this metric, we conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. Finally, we conduct a case study on CLC when new factual associations are inserted in the PLMs via model editing. Results on a small sample of facts inserted in English reveal a clear pattern whereby the new piece of knowledge transfers only to languages with which English has a high RankC score.

摘要
多语言大规模预训练语言模型（PLM）已经显示出很大量的事实知识，但是各语言之间存在大量的变化。为确保用户不同语言背景 obtain consistent feedback from the same model，我们研究了跨语言一致性（CLC）的多语言大规模预训练语言模型中的事实知识。为此，我们提出了一个 Ranking-based Consistency（RankC）度量来评估不同语言之间的知识一致性，不受准确率的影响。使用这个度量，我们进行了详细的分析CLC的决定因素，包括模型级别和语言对级别。 among other results, we find that increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. Finally, we conduct a case study on CLC when new factual associations are inserted in the PLMs via model editing. Results on a small sample of facts inserted in English reveal a clear pattern whereby the new piece of knowledge transfers only to languages with which English has a high RankC score.

GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers

paper_url: http://arxiv.org/abs/2310.10375
repo_url: https://github.com/autonomousvision/gta
paper_authors: Takeru Miyato, Bernhard Jaeger, Max Welling, Andreas Geiger
for: 提高3D视觉任务中transformer模型的学习效率和性能，无需额外学习参数，只带有较少的计算开销。
methods: 基于几何关系的相对变换来编码token的几何结构，提出了一种几何意识授益机制（Geometric Transform Attention，GTA）。
results: 在多视图synthesis任务中，GTA提高了state-of-the-art transformer-based NVS模型的学习效率和性能，无需额外学习参数，只带有较少的计算开销。

Abstract
As transformers are equivariant to the permutation of input tokens, encoding the positional information of tokens is necessary for many tasks. However, since existing positional encoding schemes have been initially designed for NLP tasks, their suitability for vision tasks, which typically exhibit different structural properties in their data, is questionable. We argue that existing positional encoding schemes are suboptimal for 3D vision tasks, as they do not respect their underlying 3D geometric structure. Based on this hypothesis, we propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation determined by the geometric relationship between queries and key-value pairs. By evaluating on multiple novel view synthesis (NVS) datasets in the sparse wide-baseline multi-view setting, we show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models without any additional learned parameters and only minor computational overhead.

摘要
transformers是 permutation 的equivariant，因此需要encoding位置信息来进行许多任务。然而，现有的 pozitional 编码方案最初是为 NLP 任务设计的，因此对于视觉任务，它们的适用性是有问题的。我们认为现有的 pozitional 编码方案对于 3D 视觉任务是不佳的，因为它们不尊重它们的下面 3D 结构。基于这个假设，我们提出了一种 geometry-aware 注意力机制，该机制通过 queries 和 key-value 对的几何关系来确定 tokens 的几何变换。我们通过在多视图 synthesis (NVS) 数据集上评估了多个 sparse wide-baseline 多视图设定，并证明了我们的注意力（GTA）可以提高基于 transformer 的 NVS 模型的学习效率和性能，无需额外学习参数，只需要少量的计算开销。

Prompt Tuning for Multi-View Graph Contrastive Learning

paper_url: http://arxiv.org/abs/2310.10362
repo_url: None
paper_authors: Chenghua Gong, Xiang Li, Jianxiang Yu, Cheng Yao, Jiaqi Tan, Chengcheng Yu, Dawei Yin
for: 提高 traditional GNN 的问题，如标签依赖和泛化性能，使用 “预训练和精度调整” 方法。
methods: 提出了一种多视图图像对比学习方法作为预tex，并设计了一种启发调整方法来衔接预tex 和下游任务。
results: 通过对多个 benchmark 数据集进行广泛的实验，证明了我们的提议可以有效地提高 GNN 的性能。

Abstract
In recent years, "pre-training and fine-tuning" has emerged as a promising approach in addressing the issues of label dependency and poor generalization performance in traditional GNNs. To reduce labeling requirement, the "pre-train, fine-tune" and "pre-train, prompt" paradigms have become increasingly common. In particular, prompt tuning is a popular alternative to "pre-training and fine-tuning" in natural language processing, which is designed to narrow the gap between pre-training and downstream objectives. However, existing study of prompting on graphs is still limited, lacking a framework that can accommodate commonly used graph pre-training methods and downstream tasks. In this paper, we propose a multi-view graph contrastive learning method as pretext and design a prompting tuning for it. Specifically, we first reformulate graph pre-training and downstream tasks into a common format. Second, we construct multi-view contrasts to capture relevant information of graphs by GNN. Third, we design a prompting tuning method for our multi-view graph contrastive learning method to bridge the gap between pretexts and downsteam tasks. Finally, we conduct extensive experiments on benchmark datasets to evaluate and analyze our proposed method.

摘要
Recently, "预训练和细化" 方法在解决传统GNNS中的标签依赖和泛化性问题上得到了广泛的应用。以减少标签需求为目的，"预训练、细化" 和 "预训练、提示" 两种方法在自然语言处理领域中得到了广泛的应用。特别是，提示练习是自然语言处理中的一种流行的代替方法，旨在减少预训练和下游目标之间的差距。然而，现有的图像提示研究仍然受限，缺乏一个框架可以整合通用的图像预训练方法和下游任务。在这篇论文中，我们提出了一种多视图图像对照学习方法作为预文，并设计了一种提示练习方法来衔接这两者。具体来说，我们首先将图像预训练和下游任务转化为共同的格式。其次，我们使用多视图对照来捕捉图像中重要信息。最后，我们设计了一种提示练习方法来桥接预文和下游任务之间的差距。我们在标准 benchmark 数据集上进行了广泛的实验，以评估和分析我们的提议方法。

Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs

paper_url: http://arxiv.org/abs/2310.10358
repo_url: None
paper_authors: Ananya Singha, José Cambronero, Sumit Gulwani, Vu Le, Chris Parnin
for: 这个论文旨在研究大型自然语言模型（LLM）在表格任务中使用上下文学习，并评估不同格式的表格提示表示的影响。
methods: 作者根据先前的工作，生成了一系列自我超vised的结构任务（例如，导航到单元和行；转置表格），并评估不同格式下LLM表现的差异。此外，作者还引入了8种噪音操作，以模拟实际世界中的潦腹数据和敌意输入，并证明这些操作对不同结构理解任务中LLM表现的影响。
results: 研究发现，不同格式下LLM的表现有显著差异，而噪音操作也可以影响LLM的表现。这些结果表明，在选择表格提示格式时，需要考虑表格的结构和噪音特征，以便 optimize LLM的表现。

Abstract
Large language models (LLMs) are increasingly applied for tabular tasks using in-context learning. The prompt representation for a table may play a role in the LLMs ability to process the table. Inspired by prior work, we generate a collection of self-supervised structural tasks (e.g. navigate to a cell and row; transpose the table) and evaluate the performance differences when using 8 formats. In contrast to past work, we introduce 8 noise operations inspired by real-world messy data and adversarial inputs, and show that such operations can impact LLM performance across formats for different structural understanding tasks.

摘要
大型语言模型（LLM）在表格任务中使用内容学习，表格提示表现可能影响 LLM 的处理能力。受到先前的工作启发，我们生成了一个自动supervised的结构任务集（例如：前往矩格和行），并评估不同格式的性能差异。相比于过去的工作，我们引入了8种噪音操作，这些操作是根据实际的混乱数据和敌方输入而设计的，并证明这些操作可以影响 LLM 的性能在不同结构理解任务中。

Compressed Sensing of Generative Sparse-latent (GSL) Signals

paper_url: http://arxiv.org/abs/2310.15119
repo_url: None
paper_authors: Antoine Honoré, Anubhab Ghosh, Saikat Chatterjee
for: 该研究旨在使用神经网络生成模型进行环境信号重建，并使用非对称拟合算法实现稀疏化。
methods: 该研究使用了神经网络生成模型，并采用了非对称拟合算法来实现稀疏化。
results: 实验结果表明，使用非对称拟合算法可以实现高质量的环境信号重建。

Abstract
We consider reconstruction of an ambient signal in a compressed sensing (CS) setup where the ambient signal has a neural network based generative model. The generative model has a sparse-latent input and we refer to the generated ambient signal as generative sparse-latent signal (GSL). The proposed sparsity inducing reconstruction algorithm is inherently non-convex, and we show that a gradient based search provides a good reconstruction performance. We evaluate our proposed algorithm using simulated data.

摘要
我们考虑了压缩感知（CS）设置中重建的环境信号，该信号有基于神经网络的生成模型。生成的环境信号我们称为生成稀烈输入信号（GSL）。我们提出的稀烈性引导的重建算法是非几何的，我们表明了使用梯度基本搜索可以获得良好的重建性能。我们使用模拟数据进行评估我们的提议算法。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Optimizing Layerwise Polynomial Approximation for Efficient Private Inference on Fully Homomorphic Encryption: A Dynamic Programming Approach

paper_url: http://arxiv.org/abs/2310.10349
repo_url: None
paper_authors: Junghyun Lee, Eunsang Lee, Young-Sik Kim, Yongwoo Lee, Joon-Woo Lee, Yongjune Kim, Jong-Seon No
for: 本研究旨在实现基于完全同质加密的隐私保护深度神经网络，但实际应用受到 prolonged inference times 的限制。这主要归结于使用高度多项式函数approximation，如ReLU函数的高度多项式approximation，占用了大量同质计算资源，导致更慢的推理。
methods: 本研究使用layerwise度优化activation functions的方法，以减少推理时间，保持深度神经网络的分类精度。而不like previoius works，我们不使用最小最大approximation方法，而是使用weighted least squares approximation方法，基于activation functions的输入分布。然后，我们通过 dynamic programming算法获取layerwise优化的度数，考虑每层的扩张错误对深度神经网络的分类精度的影响。此外，我们还提议在ciphertext moduli-chain layerwise进行调整，以更好地减少推理时间。
results: 我们的方法可以将ResNet-20模型和ResNet-32模型的推理时间减少为3.44倍和3.16倍，respectively，相比之前使用均匀度和固定ciphertext modulus的实现。

Abstract
Recent research has explored the implementation of privacy-preserving deep neural networks solely using fully homomorphic encryption. However, its practicality has been limited because of prolonged inference times. When using a pre-trained model without retraining, a major factor contributing to these prolonged inference times is the high-degree polynomial approximation of activation functions such as the ReLU function. The high-degree approximation consumes a substantial amount of homomorphic computational resources, resulting in slower inference. Unlike the previous works approximating activation functions uniformly and conservatively, this paper presents a \emph{layerwise} degree optimization of activation functions to aggressively reduce the inference time while maintaining classification accuracy by taking into account the characteristics of each layer. Instead of the minimax approximation commonly used in state-of-the-art private inference models, we employ the weighted least squares approximation method with the input distributions of activation functions. Then, we obtain the layerwise optimized degrees for activation functions through the \emph{dynamic programming} algorithm, considering how each layer's approximation error affects the classification accuracy of the deep neural network. Furthermore, we propose modulating the ciphertext moduli-chain layerwise to reduce the inference time. By these proposed layerwise optimization methods, we can reduce inference times for the ResNet-20 model and the ResNet-32 model by 3.44 times and 3.16 times, respectively, in comparison to the prior implementations employing uniform degree polynomials and a consistent ciphertext modulus.

摘要
近期研究探讨了使用完全同质加密实现隐私保护深度神经网络。然而，它的实用性受到了 prolonged inference times 的限制。当使用预训练模型无需重新训练时，一个主要的因素是高度 polynomials 的激活函数approximation，如 ReLU 函数。高度的激活函数approximation 需要大量的同质计算资源，导致更慢的推理。 unlike previous works approximating activation functions uniformly and conservatively, this paper presents a layerwise degree optimization of activation functions to aggressively reduce the inference time while maintaining classification accuracy by taking into account the characteristics of each layer。 instead of the minimax approximation commonly used in state-of-the-art private inference models, we employ the weighted least squares approximation method with the input distributions of activation functions。 then, we obtain the layerwise optimized degrees for activation functions through the dynamic programming algorithm, considering how each layer's approximation error affects the classification accuracy of the deep neural network。 furthermore, we propose modulating the ciphertext moduli-chain layerwise to reduce the inference time。 by these proposed layerwise optimization methods, we can reduce inference times for the ResNet-20 model and the ResNet-32 model by 3.44 times and 3.16 times, respectively, in comparison to the prior implementations employing uniform degree polynomials and a consistent ciphertext modulus。

Attribution Patching Outperforms Automated Circuit Discovery

paper_url: http://arxiv.org/abs/2310.10348
repo_url: None
paper_authors: Aaquib Syed, Can Rager, Arthur Conmy
for: 这个论文的目的是探讨自动化可读性研究的可能性，以扩展神经网络行为的解释到大型模型。
methods: 这个论文使用了归因覆盖来自动化发现计算子网络（circuit）。它使用了一种简单的方法，基于归因覆盖来估算每个边的重要性，然后使用这个估算来剪枝。
results: 论文表明，使用这种方法可以超越现有的所有方法，只需要两次前向 passes和一次后向 pass。在所有任务中，这种方法的AUC从计算子网络恢复中得到了最高的平均值。

Abstract
Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.

摘要
自动化可读性研究近期吸引了关注，作为可扩展 neural network 行为的解释的潜在研究方向。现有的自动化电路发现工作使用活动贴图来确定解决特定任务（电路）负责的子网络。在这种工作中，我们显示了一种简单的方法，基于归因贴图，超过所有现有方法，只需要两次前进和一次反向传播。我们使用线性近似来估计活动贴图中每个边的重要性。使用这种近似，我们剪枝网络中最不重要的边。我们对这种方法的性能和局限性进行了抽查，发现在所有任务上的均值AUC greater than other methods。

Unlocking Metasurface Practicality for B5G Networks: AI-assisted RIS Planning

paper_url: http://arxiv.org/abs/2310.10330
repo_url: None
paper_authors: Guillermo Encinas-Lago, Antonio Albanese, Vincenzo Sciancalepore, Marco Di Renzo, Xavier Costa-Pérez
for: 本文旨在探讨如何使用可编程智能表面（RIS）来提高无线网络性能，尤其是在 beyond-fifth-generation 网络（B5G）中。
methods: 本文使用深度逆向学习（DRL）算法，训练一个 DRL 代理，以便优化 RIS 的投放。
results: 本文在法国雷恩火车站的indoor场景中进行了实验，并证明了对于无法覆盖的区域，D-RISA 算法可以提供更好的覆盖率（10 dB 的最小信号噪听比提高），同时具有更低的计算时间（下降至 -25%）和更好的扩展性。

Abstract
The advent of reconfigurable intelligent surfaces(RISs) brings along significant improvements for wireless technology on the verge of beyond-fifth-generation networks (B5G).The proven flexibility in influencing the propagation environment opens up the possibility of programmatically altering the wireless channel to the advantage of network designers, enabling the exploitation of higher-frequency bands for superior throughput overcoming the challenging electromagnetic (EM) propagation properties at these frequency bands. However, RISs are not magic bullets. Their employment comes with significant complexity, requiring ad-hoc deployments and management operations to come to fruition. In this paper, we tackle the open problem of bringing RISs to the field, focusing on areas with little or no coverage. In fact, we present a first-of-its-kind deep reinforcement learning (DRL) solution, dubbed as D-RISA, which trains a DRL agent and, in turn, obtain san optimal RIS deployment. We validate our framework in the indoor scenario of the Rennes railway station in France, assessing the performance of our algorithm against state-of-the-art (SOA) approaches. Our benchmarks showcase better coverage, i.e., 10-dB increase in minimum signal-to-noise ratio (SNR), at lower computational time (up to -25 percent) while improving scalability towards denser network deployments.

摘要
随着智能重新配置表面（RIS）的出现， fifth-generation wireless networks（B5G）的技术将得到显著改进。 RIS 的可变性使得网络设计者可以通过程序控制无线通信环境，从而在高频段上获得更高的吞吐量，并且可以超越高频段的电磁波媒体传输性的挑战。然而， RIS 不是一个魔术药丸。它的使用需要适当的部署和管理操作，以便实现。在这篇论文中，我们解决了将 RIS 引入到实际应用中的开放问题，特别是在有少量或无覆盖的地区。我们提出了一种首先的深度优化学（DRL）解决方案，称为 D-RISA，它在 RIS 部署方面进行优化。我们在法国雷恩火车站的indoorenario中验证了我们的框架，并与现有的方法进行比较。我们的标准显示了更好的覆盖率，即10dB的最小信号响应比（SNR）的提高，同时减少计算时间（下降至25%），并改善了网络部署的扩展性。

Interpreting and Exploiting Functional Specialization in Multi-Head Attention under Multi-task Learning

paper_url: http://arxiv.org/abs/2310.10318
repo_url: https://github.com/znlp/functionalspecializationinmha
paper_authors: Chong Li, Shaonan Wang, Yunhao Zhang, Jiajun Zhang, Chengqing Zong
for: 这paper aims to investigate the functional specialization of multi-head attention in transformer-based models under multi-task learning.
methods: The authors propose an interpreting method to quantify the degree of functional specialization in multi-head attention and a simple multi-task training method to increase functional specialization and mitigate negative information transfer.
results: Experimental results on seven pre-trained transformer models demonstrate that multi-head attention evolves functional specialization after multi-task training, which is affected by the similarity of tasks. The proposed multi-task training strategy based on functional specialization boosts performance in both multi-task learning and transfer learning without adding any parameters.

Abstract
Transformer-based models, even though achieving super-human performance on several downstream tasks, are often regarded as a black box and used as a whole. It is still unclear what mechanisms they have learned, especially their core module: multi-head attention. Inspired by functional specialization in the human brain, which helps to efficiently handle multiple tasks, this work attempts to figure out whether the multi-head attention module will evolve similar function separation under multi-tasking training. If it is, can this mechanism further improve the model performance? To investigate these questions, we introduce an interpreting method to quantify the degree of functional specialization in multi-head attention. We further propose a simple multi-task training method to increase functional specialization and mitigate negative information transfer in multi-task learning. Experimental results on seven pre-trained transformer models have demonstrated that multi-head attention does evolve functional specialization phenomenon after multi-task training which is affected by the similarity of tasks. Moreover, the multi-task training strategy based on functional specialization boosts performance in both multi-task learning and transfer learning without adding any parameters.

摘要
transformer-based模型，即使达到了人类超常表现，常被视为黑盒子，无法了解它们学习的机制，尤其是核心模块：多头注意力。我们受到人脑功能特化的启发，人脑可以有效地处理多个任务，因此我们想知道多头注意力模块是否会在多任务训练中发展类似的功能分化现象。如果是，那么这种机制可能会进一步提高模型性能吗？为了回答这些问题，我们提出了一种量化多头注意力中函数特化程度的解释方法。此外，我们还提出了一种简单的多任务训练方法，可以增强功能特化并降低多任务学习中的负信息传递。实验结果表明，在七种预训练transformer模型中，多头注意力会在多任务训练后发展功能分化现象，这种现象受到任务相似度的影响。此外，基于功能特化的多任务训练策略可以提高多任务学习和传播学习的性能，无需添加参数。

End-to-end Offline Reinforcement Learning for Glycemia Control

paper_url: http://arxiv.org/abs/2310.10312
repo_url: None
paper_authors: Tristan Beolet, Alice Adenis, Erik Huneker, Maxime Louis
for: 这项研究旨在提高closed-loop系统的糖尿病控制性能，并使其更适应不同情况。
methods: 该研究使用了RL算法，并开发了一个端到端个性化管道，以 removing the need for a simulator while still enabling the estimation of clinically relevant metrics for diabetes。
results: 研究表明，使用RL算法和个性化管道可以提高closed-loop系统的糖尿病控制性能，并减少了风险的存在。

Abstract
The development of closed-loop systems for glycemia control in type I diabetes relies heavily on simulated patients. Improving the performances and adaptability of these close-loops raises the risk of over-fitting the simulator. This may have dire consequences, especially in unusual cases which were not faithfully-if at all-captured by the simulator. To address this, we propose to use offline RL agents, trained on real patient data, to perform the glycemia control. To further improve the performances, we propose an end-to-end personalization pipeline, which leverages offline-policy evaluation methods to remove altogether the need of a simulator, while still enabling an estimation of clinically relevant metrics for diabetes.

摘要
开发闭环系统控制型一 диа베ت斯需要启用模拟患者。提高这些闭环的性能和适应性可能会增加过拟合模拟器的风险，特别是在不常见的情况下。为解决这个问题，我们提议使用线上RL代理，在真实患者数据上训练，来实现血糖控制。此外，我们还提议一个终端个性化管道，利用线上策略评估方法来完全排除模拟器的需求，同时仍能估计临床重要指标。

Learning visual-based deformable object rearrangement with local graph neural networks

paper_url: http://arxiv.org/abs/2310.10307
repo_url: https://github.com/dengyh16code/deformable-gnn
paper_authors: Yuhong Deng, Xueqian Wang, Lipeng chen
for: 本研究旨在解决机器人对弹性物体（如绳子和布）的重新排序问题，使用只有视觉观察的情况下，将弹性物体转换到预先设定的目标配置。
methods: 本研究提出了一种新的表示策略，可以快速和有效地表示弹性物体的状态，并且可以模型弹性重新排序动态。本研究还提出了一种本地图 neural network（GNN），用于同时学习弹性重新排序动态和掌握最佳抓取和放置动作。
results: 对多种弹性重新排序任务进行了仿真和实验，结果显示，提出的动态图表示方法可以更高效地模型弹性重新排序动态，并且在多任务学习和实际应用中表现出色。

Abstract
Goal-conditioned rearrangement of deformable objects (e.g. straightening a rope and folding a cloth) is one of the most common deformable manipulation tasks, where the robot needs to rearrange a deformable object into a prescribed goal configuration with only visual observations. These tasks are typically confronted with two main challenges: the high dimensionality of deformable configuration space and the underlying complexity, nonlinearity and uncertainty inherent in deformable dynamics. To address these challenges, we propose a novel representation strategy that can efficiently model the deformable object states with a set of keypoints and their interactions. We further propose local-graph neural network (GNN), a light local GNN learning to jointly model the deformable rearrangement dynamics and infer the optimal manipulation actions (e.g. pick and place) by constructing and updating two dynamic graphs. Both simulated and real experiments have been conducted to demonstrate that the proposed dynamic graph representation shows superior expressiveness in modeling deformable rearrangement dynamics. Our method reaches much higher success rates on a variety of deformable rearrangement tasks (96.3% on average) than state-of-the-art method in simulation experiments. Besides, our method is much more lighter and has a 60% shorter inference time than state-of-the-art methods. We also demonstrate that our method performs well in the multi-task learning scenario and can be transferred to real-world applications with an average success rate of 95% by solely fine tuning a keypoint detector.

摘要
goal-conditioned 重新排序的软体 объек� (例如，整 Straightening a rope and folding a cloth) 是最常见的软体 manipulation 任务之一， robot需要根据视觉观察来重新排序软体 объек� 到预定的目标配置中。这些任务通常面临两个主要挑战：一是软体配置空间的维度太高，二是软体动力学的内置复杂性、非线性和不确定性。为了解决这些挑战，我们提出了一种新的表示策略，可以有效地模型软体 объек� 的状态，并且通过建立和更新两个动态图来模型软体重新排序动力学。我们还提出了一种本地图神经网络（GNN），可以同时模型软体重新排序动力学和推理最佳抓取和放置动作。我们在实验中发现，我们的动态图表示方法可以更高效地模型软体重新排序动力学，并且在多种软体重新排序任务上达到96.3%的Success rate（在实验中）。此外，我们的方法比现状态技术更轻量级，并且在推理时间方面具有60%的缩短。此外，我们还证明了我们的方法在多任务学习场景下表现良好，可以通过精细调整一个关键点探测器来转移到实际应用中。

Forking Uncertainties: Reliable Prediction and Model Predictive Control with Sequence Models via Conformal Risk Control

paper_url: http://arxiv.org/abs/2310.10299
repo_url: None
paper_authors: Matteo Zecchin, Sangwoo Park, Osvaldo Simeone
for: 本文旨在提供一种基于 probabilistic implicit or explicit sequence model 的预测 uncertainty 管理方法，以便在具有复杂动力学和分支轨迹的 cyber-physical systems 中提供可靠性和安全性 garanties。
methods: 本文提出了一种基于 ensemble 的多板轨迹预测（PTS-CRC）方法，该方法可以跨多个预测器生成可靠的预测集，以捕捉异常轨迹的不确定性。此外，PTS-CRC 还可以满足非覆盖性定义的可靠性要求，这使得可以实现更加高效的控制策略。
results: 实验结果表明，PTS-CRC 预测器可以提供更加信息归一化的预测集，以及安全性和质量的控制策略，并且在无线网络中的多种任务中都表现出了更高的返回。

Abstract
In many real-world problems, predictions are leveraged to monitor and control cyber-physical systems, demanding guarantees on the satisfaction of reliability and safety requirements. However, predictions are inherently uncertain, and managing prediction uncertainty presents significant challenges in environments characterized by complex dynamics and forking trajectories. In this work, we assume access to a pre-designed probabilistic implicit or explicit sequence model, which may have been obtained using model-based or model-free methods. We introduce probabilistic time series-conformal risk prediction (PTS-CRC), a novel post-hoc calibration procedure that operates on the predictions produced by any pre-designed probabilistic forecaster to yield reliable error bars. In contrast to existing art, PTS-CRC produces predictive sets based on an ensemble of multiple prototype trajectories sampled from the sequence model, supporting the efficient representation of forking uncertainties. Furthermore, unlike the state of the art, PTS-CRC can satisfy reliability definitions beyond coverage. This property is leveraged to devise a novel model predictive control (MPC) framework that addresses open-loop and closed-loop control problems under general average constraints on the quality or safety of the control policy. We experimentally validate the performance of PTS-CRC prediction and control by studying a number of use cases in the context of wireless networking. Across all the considered tasks, PTS-CRC predictors are shown to provide more informative predictive sets, as well as safe control policies with larger returns.

摘要
在许多实际问题中，预测被用来监控和控制电脑物理系统，需要保证可靠性和安全性要求的满足。然而，预测本身具有不确定性，在复杂动态环境中管理预测不确定性呈现出 significanthallenges。在这项工作中，我们假设有一个预先设计的 probabilistic implicit or explicit sequence model，可能通过模型基于或模型自由方法获得。我们介绍了一种新的后期加拟程序，即 probablistic time series-conformal risk prediction (PTS-CRC)，可以在任何预先设计的 probabilistic forecaster 的预测结果上进行后期加拟，以生成可靠的误差范围。与现有艺术 differencely，PTS-CRC 生成基于多个原型轨迹样本集的预测集，支持高效地表示分支不确定性。此外，PTS-CRC 可以满足超过覆盖率的可靠性定义，这种特性被利用来开发一种基于 average constraints 的新的模型预测控制 (MPC) 框架，可以解决一般平均约束下的开 loop 和关 loop 控制问题。我们通过研究无线网络上的一些用例， validate 了 PTS-CRC 预测和控制的性能。在所有考虑的任务中，PTS-CRC 预测器被证明提供更加有用的预测集，以及安全的控制策略与更大的回报。

Key-phrase boosted unsupervised summary generation for FinTech organization

paper_url: http://arxiv.org/abs/2310.10294
repo_url: None
paper_authors: Aadit Deshpande, Shreya Goyal, Prateek Nagwanshi, Avinash Tripathy
for: 这篇论文的目的是提出一种基于Action-Object对的自动生成社交媒体摘要方法，以帮助金融科技公司更好地利用社交媒体语言数据，并提供一个外部视角来对consumer behavior进行分析。
methods: 这篇论文使用了NLP技术，特别是意向检测、情感分类和文本概要生成等应用，以处理社交媒体语言数据。它还提出了一种基于Action-Object对的自动生成社交媒体摘要方法，以增强对社交媒体语言数据的分析和利用。
results: 该论文通过对Reddit讨论串的社交媒体语言数据进行分析，并对基于Action-Object对的摘要方法进行评估，并证明了该方法的效果。具体来说，该论文在Context Metrics上表现出了显著的优势，包括Unique words、Action-Object对和名称块的数量。

Abstract
With the recent advances in social media, the use of NLP techniques in social media data analysis has become an emerging research direction. Business organizations can particularly benefit from such an analysis of social media discourse, providing an external perspective on consumer behavior. Some of the NLP applications such as intent detection, sentiment classification, text summarization can help FinTech organizations to utilize the social media language data to find useful external insights and can be further utilized for downstream NLP tasks. Particularly, a summary which highlights the intents and sentiments of the users can be very useful for these organizations to get an external perspective. This external perspective can help organizations to better manage their products, offers, promotional campaigns, etc. However, certain challenges, such as a lack of labeled domain-specific datasets impede further exploration of these tasks in the FinTech domain. To overcome these challenges, we design an unsupervised phrase-based summary generation from social media data, using 'Action-Object' pairs (intent phrases). We evaluated the proposed method with other key-phrase based summary generation methods in the direction of contextual information of various Reddit discussion threads, available in the different summaries. We introduce certain "Context Metrics" such as the number of Unique words, Action-Object pairs, and Noun chunks to evaluate the contextual information retrieved from the source text in these phrase-based summaries. We demonstrate that our methods significantly outperform the baseline on these metrics, thus providing a qualitative and quantitative measure of their efficacy. Proposed framework has been leveraged as a web utility portal hosted within Amex.

摘要
One of the key challenges in exploring these tasks in the FinTech domain is the lack of labeled domain-specific datasets. To overcome this challenge, we have designed an unsupervised phrase-based summary generation method from social media data using 'Action-Object' pairs (intent phrases). We evaluated our method against other key-phrase based summary generation methods in terms of contextual information retrieved from the source text.To evaluate the contextual information, we introduced certain "Context Metrics" such as the number of unique words, Action-Object pairs, and noun chunks. Our method significantly outperformed the baseline on these metrics, providing a qualitative and quantitative measure of its efficacy. This framework has been leveraged as a web utility portal hosted within Amex.

No Compromise in Solution Quality: Speeding Up Belief-dependent Continuous POMDPs via Adaptive Multilevel Simplification

paper_url: http://arxiv.org/abs/2310.10274
repo_url: None
paper_authors: Andrey Zhitnikov, Ori Sztyglic, Vadim Indelman
for: 这篇论文是关于Continuous POMDPs with general belief-dependent rewards的解决方案。
methods: 本论文使用了一种名为“adaptive multilevel simplification”的方法，具体来说是在给定的信念树和MCTS的基础上实现POMDP的线上规划。这种方法可以快速加速POMDP的规划，而不会失去解决方案的质量。
results: 本论文提出了三种算法来加速Continuous POMDP的规划，其中两种算法（SITH-BSP和LAZY-SITH-BSP）可以在任何信念树构建方法上使用，第三种算法（SITH-PFT）是一种可以适应任何探索技术的任何时间MCTS方法。所有这些算法都能够返回与未加速的算法相同的优化的动作。此外，本论文还提出了一种新的信息论 reward的代价计算方法，该方法可以轻松计算，并且可以通过需求的精细化来紧张化。

Abstract
Continuous POMDPs with general belief-dependent rewards are notoriously difficult to solve online. In this paper, we present a complete provable theory of adaptive multilevel simplification for the setting of a given externally constructed belief tree and MCTS that constructs the belief tree on the fly using an exploration technique. Our theory allows to accelerate POMDP planning with belief-dependent rewards without any sacrifice in the quality of the obtained solution. We rigorously prove each theoretical claim in the proposed unified theory. Using the general theoretical results, we present three algorithms to accelerate continuous POMDP online planning with belief-dependent rewards. Our two algorithms, SITH-BSP and LAZY-SITH-BSP, can be utilized on top of any method that constructs a belief tree externally. The third algorithm, SITH-PFT, is an anytime MCTS method that permits to plug-in any exploration technique. All our methods are guaranteed to return exactly the same optimal action as their unsimplified equivalents. We replace the costly computation of information-theoretic rewards with novel adaptive upper and lower bounds which we derive in this paper, and are of independent interest. We show that they are easy to calculate and can be tightened by the demand of our algorithms. Our approach is general; namely, any bounds that monotonically converge to the reward can be easily plugged-in to achieve significant speedup without any loss in performance. Our theory and algorithms support the challenging setting of continuous states, actions, and observations. The beliefs can be parametric or general and represented by weighted particles. We demonstrate in simulation a significant speedup in planning compared to baseline approaches with guaranteed identical performance.

摘要
CONTINUOUS POMDPs WITH GENERAL BELIEF-DEPENDENT REWARDS ARE DIFFICULT TO SOLVE ONLINE. IN THIS PAPER, WE PRESENT A COMPLETE PROVABLE THEORY OF ADAPTIVE MULTILEVEL SIMPLIFICATION FOR THE SETTING OF A GIVEN EXTERNALLY CONSTRUCTED BELIEF TREE AND MCTS THAT CONSTRUCTS THE BELIEF TREE ON THE FLY USING AN EXPLORATION TECHNIQUE. OUR THEORY ALLOWS FOR ACCELERATING POMDP PLANNING WITH BELIEF-DEPENDENT REWARDS WITHOUT ANY SACRIFICE IN THE QUALITY OF THE OBTAINED SOLUTION. WE RIGOROUSLY PROVE EACH THEORETICAL CLAIM IN THE PROPOSED UNIFIED THEORY. USING THE GENERAL THEORETICAL RESULTS, WE PRESENT THREE ALGORITHMS TO ACCELERATE CONTINUOUS POMDP ONLINE PLANNING WITH BELIEF-DEPENDENT REWARDS. OUR TWO ALGORITHMS, SITH-BSP AND LAZY-SITH-BSP, CAN BE UTILIZED ON TOP OF ANY METHOD THAT CONSTRUCTS A BELIEF TREE EXTERNALLY. THE THIRD ALGORITHM, SITH-PFT, IS ANYTIME MCTS METHOD THAT PERMITS TO PLUG-IN ANY EXPLORATION TECHNIQUE. ALL OUR METHODS ARE GUARANTEED TO RETURN THE SAME OPTIMAL ACTION AS THEIR UNSIMPLIFIED EQUIVALENTS. WE REPLACE THE COSTLY COMPUTATION OF INFORMATION-THEORETIC REWARDS WITH NOVEL ADAPTIVE UPPER AND LOWER BOUNDS WHICH WE DERIVE IN THIS PAPER, AND ARE OF INDEPENDENT INTEREST. WE SHOW THAT THEY ARE EASY TO CALCULATE AND CAN BE TIGHTENED BY THE DEMAND OF OUR ALGORITHMS. OUR APPROACH IS GENERAL; NAMELY, ANY BOUNDS THAT MONOTONICALLY CONVERGE TO THE REWARD CAN BE EASILY PLUGGED-IN TO ACHIEVE SIGNIFICANT SPEEDUP WITHOUT ANY LOSS IN PERFORMANCE. OUR THEORY AND ALGORITHMS SUPPORT THE CHALLENGING SETTING OF CONTINUOUS STATES, ACTIONS, AND OBSERVATIONS. THE BELIEFS CAN BE PARAMETRIC OR GENERAL AND REPRESENTED BY WEIGHTED PARTICLES. WE DEMONSTRATE IN SIMULATION A SIGNIFICANT SPEEDUP IN PLANNING COMPARED TO BASELINE APPROACHES WITH GUARANTEED IDENTICAL PERFORMANCE.

Rethinking Financial Service Promotion With Hybrid Recommender Systems at PicPay

paper_url: http://arxiv.org/abs/2310.10268
repo_url: None
paper_authors: Gabriel Mendonça, Matheus Santos, André Gonçalves, Yan Almeida
for: 这个研究是为了提高PicPay的金融服务推荐效果。
methods: 这个研究使用了两种推荐算法，Switching Hybrid Recommender System，以提高item推荐的效果。
results: 我们的A/B测试显示， Switching Hybrid Recommender System可以提高推荐效果，比默认推荐策略提高3.2%。

Abstract
The fintech PicPay offers a wide range of financial services to its 30 million monthly active users, with more than 50 thousand items recommended in the PicPay mobile app. In this scenario, promoting specific items that are strategic to the company can be very challenging. In this work, we present a Switching Hybrid Recommender System that combines two algorithms to effectively promote items without negatively impacting the user's experience. The results of our A/B tests show an uplift of up to 3.2\% when compared to a default recommendation strategy.

摘要
picPay 提供了一个广泛的金融服务，每月活跃用户达3000万人，app中推荐的商品超过50000个。在这种情况下，推荐特定的商品可以非常具有挑战性。在这份工作中，我们提出了一种交换 гибрид推荐系统，将两种算法结合使用，以有效地推荐商品，不对用户体验造成负面影响。我们的A/B测试结果显示，与默认推荐策略相比，我们的推荐策略可以提高用户增长率高达3.2%。

Prediction of Arabic Legal Rulings using Large Language Models

paper_url: http://arxiv.org/abs/2310.10260
repo_url: None
paper_authors: Adel Ammar, Anis Koubaa, Bilel Benjdira, Omar Najar, Serry Sibaee
for: 这研究旨在预测阿拉伯语法庭决定，帮助法官做出决策和帮助律师采取更加细化的战略。
methods: 本研究使用了当前state-of-the-art的大语言模型进行预测，包括LLaMA-7b、JAIS-13b和GPT3.5-turbo等三种基础模型，并使用了三种训练方法：零例学习、一例学习和特定精度调整。此外，研究还评估了对原始阿拉伯文输入文本的摘要和/或翻译的效果。
results: 研究发现，所有LLaMA模型的性能很差，而GPT-3.5基于模型在所有模型中表现出色，比其他模型的平均分数高出50%。此外，研究还发现，除了人类评估外，其他所有的评估方法都不可靠，无法正确评估大语言模型在法庭决定预测中的性能。

Abstract
In the intricate field of legal studies, the analysis of court decisions is a cornerstone for the effective functioning of the judicial system. The ability to predict court outcomes helps judges during the decision-making process and equips lawyers with invaluable insights, enhancing their strategic approaches to cases. Despite its significance, the domain of Arabic court analysis remains under-explored. This paper pioneers a comprehensive predictive analysis of Arabic court decisions on a dataset of 10,813 commercial court real cases, leveraging the advanced capabilities of the current state-of-the-art large language models. Through a systematic exploration, we evaluate three prevalent foundational models (LLaMA-7b, JAIS-13b, and GPT3.5-turbo) and three training paradigms: zero-shot, one-shot, and tailored fine-tuning. Besides, we assess the benefit of summarizing and/or translating the original Arabic input texts. This leads to a spectrum of 14 model variants, for which we offer a granular performance assessment with a series of different metrics (human assessment, GPT evaluation, ROUGE, and BLEU scores). We show that all variants of LLaMA models yield limited performance, whereas GPT-3.5-based models outperform all other models by a wide margin, surpassing the average score of the dedicated Arabic-centric JAIS model by 50%. Furthermore, we show that all scores except human evaluation are inconsistent and unreliable for assessing the performance of large language models on court decision predictions. This study paves the way for future research, bridging the gap between computational linguistics and Arabic legal analytics.

摘要
在复杂的法律研究领域中，法庭判决分析是司法系统的基础石头。预测法庭结果可以帮助法官做出决策，并让律师获得价值的洞察，提高他们的战略方法。然而，阿拉伯语法庭分析领域仍然未得到足够的探索。这篇论文探索了10813起商业法庭案例的阿拉伯语法庭判决预测，利用当今最先进的大语言模型。通过系统性的探索，我们评估了三种基础模型（LLaMA-7b、JAIS-13b和GPT3.5-turbo）和三种训练方法（零shot、一shot和tailored fine-tuning）。此外，我们还评估了原始阿拉伯语输入文本的摘要和/或翻译是否有利。这导致了14种模型变体，我们为它们提供了细腻的性能评估，包括人类评估、GPT评估、ROUGE和BLEU分数。我们发现所有LLaMA模型的表现很有限，而GPT-3.5基于模型在所有其他模型之上占据了很大优势，超过了特化于阿拉伯语的JAIS模型的平均分数 by 50%。此外，我们发现除人类评估外，所有其他分数都是不可靠和不一致的，这些分数不适用于评估大语言模型在法庭判决预测中的表现。这篇研究为未来的研究提供了桥梁，将计算语言学和阿拉伯语法律分析相连。

SGOOD: Substructure-enhanced Graph-Level Out-of-Distribution Detection

paper_url: http://arxiv.org/abs/2310.10237
repo_url: None
paper_authors: Zhihao Ding, Jieming Shi
for: 本研究旨在提高图像分类中的非标准图像检测性能，即在未知数据分布下检测图像是否属于标准分布或者非标准分布。
methods: 本研究提出了一种基于子结构的图像分类方法，包括建立超Graph的子结构、设计两级图像编码管道以及开发三种图像增强技术来增强表达力。
results: 对10种竞争者进行了广泛的实验，常常超过现有方法，并且在许多图像 Dataset 上表现出了显著的优势。

Abstract
Graph-level representation learning is important in a wide range of applications. However, existing graph-level models are generally built on i.i.d. assumption for both training and testing graphs, which is not realistic in an open world, where models can encounter out-of-distribution (OOD) testing graphs that are from different distributions unknown during training. A trustworthy model should not only produce accurate predictions for in-distribution (ID) data, but also detect OOD graphs to avoid unreliable prediction. In this paper, we present SGOOD, a novel graph-level OOD detection framework. We find that substructure differences commonly exist between ID and OOD graphs. Hence, SGOOD explicitly utilizes substructures to learn powerful representations to achieve superior performance. Specifically, we build a super graph of substructures for every graph, and design a two-level graph encoding pipeline that works on both original graphs and super graphs to obtain substructure-enhanced graph representations. To further distinguish ID and OOD graphs, we develop three graph augmentation techniques that preserve substructures and increase expressiveness. Extensive experiments against 10 competitors on numerous graph datasets demonstrate the superiority of SGOOD, often surpassing existing methods by a significant margin. The code is available at https://anonymous.4open.science/r/SGOOD-0958.

摘要
GRAPH-LEVEL REPRESENTATION LEARNING 是在各种应用中非常重要。然而，现有的 GRAPH-LEVEL 模型通常是基于 i.i.d. 假设，即训练和测试 GRAPH 都是同一个分布，这并不是现实世界中的开放世界， где模型可能会遇到不同分布的测试 GRAPH。一个可靠的模型不仅需要在 ID 数据上生成准确的预测，还需要检测 OOD GRAPH 以避免不可靠的预测。在这篇论文中，我们提出了 SGOOD，一种新的 GRAPH-LEVEL OOD 检测框架。我们发现了 ID 和 OOD GRAPH 之间的结构差异，因此 SGOOD 使用substructure来学习强大的表示。具体来说，我们建立了每个 GRAPH 的超graph，并设计了两级图编码管道，以便在原始 GRAPH 和超graph 上获得增强的图表示。为了进一步分别 ID 和 OOD GRAPH，我们开发了三种图增强技术，以保持substructure并提高表达能力。我们对10个竞争对手的实验结果表明，SGOOD 常常超过现有方法，准确率高于95%。代码可以在获取。

Using Global Land Cover Product as Prompt for Cropland Mapping via Visual Foundation Model

paper_url: http://arxiv.org/abs/2310.10219
repo_url: None
paper_authors: Chao Tao, Aoran Hu, Rong Xiao, Haifeng Li, Yuze Wang
for: 本研究旨在解决受不同场景Attribute和取景条件影响的 cropland 映射问题，通过提出 “Pretrain+Prompting” 方法，以便在模型理解过程中简化领域适应。
methods: 本研究使用了可访问的全球土地覆盖产品，设计了自动提示（APT）方法，通过在模型推理过程中引入各个示例的个性提示，实现了细化的领域适应过程。
results: 根据两个SUB-METER cropland 数据集的实验结果，提出的 “Pretrain+Prompting” 方法在Remote sensing 领域的 cropland 映射问题中表现了比传统supervised learning和精度调整方法更高的性能。

Abstract
Data-driven deep learning methods have shown great potential in cropland mapping. However, due to multiple factors such as attributes of cropland (topography, climate, crop type) and imaging conditions (viewing angle, illumination, scale), croplands under different scenes demonstrate a great domain gap. This makes it difficult for models trained in the specific scenes to directly generalize to other scenes. A common way to handle this problem is through the "Pretrain+Fine-tuning" paradigm. Unfortunately, considering the variety of features of cropland that are affected by multiple factors, it is hardly to handle the complex domain gap between pre-trained data and target data using only sparse fine-tuned samples as general constraints. Moreover, as the number of model parameters grows, fine-tuning is no longer an easy and low-cost task. With the emergence of prompt learning via visual foundation models, the "Pretrain+Prompting" paradigm redesigns the optimization target by introducing individual prompts for each single sample. This simplifies the domain adaption from generic to specific scenes during model reasoning processes. Therefore, we introduce the "Pretrain+Prompting" paradigm to interpreting cropland scenes and design the auto-prompting (APT) method based on freely available global land cover product. It can achieve a fine-grained adaptation process from generic scenes to specialized cropland scenes without introducing additional label costs. To our best knowledge, this work pioneers the exploration of the domain adaption problems for cropland mapping under prompt learning perspectives. Our experiments using two sub-meter cropland datasets from southern and northern China demonstrated that the proposed method via visual foundation models outperforms traditional supervised learning and fine-tuning approaches in the field of remote sensing.

摘要
“数据驱动深度学习方法在耕地地图中表现出了很大的潜力。然而，由于耕地特性（地形、气候、作物种）以及捕获条件（观察角度、照明、比例）的多种因素，耕地不同场景之间存在巨大的领域差异。这使得使用特定场景的训练数据直接适应其他场景变得困难。通常，使用“Pretrain+Fine-tuning”模式来解决这个问题。然而，考虑到耕地特性的多种影响，使用只有稀疏的精度适应样本作为通用约束是不充分的。此外，随着模型参数的增加，精度适应变得不是易于进行的低成本任务。随着视觉基础模型的出现，“Pretrain+Prompting”模式可以重新设定优化目标，通过引入每个样本的个性提示来简化领域适应。因此，我们提出了基于自由可用的全球土地覆盖产品的自动提示（APT）方法，可以实现不受预先标注的场景适应过程。我们的实验使用南方和北方中国的两个半米耕地数据集证明了，与传统的超级学习和精度适应方法相比，我们的方法在远程感知领域中表现出了更好的性能。”

Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook

paper_url: http://arxiv.org/abs/2310.10196
repo_url: https://github.com/qingsongedu/awesome-timeseries-spatiotemporal-lm-llm
paper_authors: Ming Jin, Qingsong Wen, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, Shirui Pan, Vincent S. Tseng, Yu Zheng, Lei Chen, Hui Xiong
for: 本研究主要是为了对时间序列和空间时间数据进行分析和挖掘，以便更好地利用这些数据中含的丰富信息，并为各种应用领域提供支持。
methods: 本研究使用大量语言和其他基础模型，对时间序列和空间时间数据进行分析和挖掘，并提供了一个完整的和最新的评论，涵盖四个关键方面：数据类型、模型类别、模型范围和应用领域/任务。
results: 本研究提供了一个完整的和最新的评论，涵盖大型模型在时间序列和空间时间数据分析中的应用和发展，并提供了丰富的资源，包括数据集、模型资产和有用工具，以便开发应用和进行进一步的研究。

Abstract
Temporal data, notably time series and spatio-temporal data, are prevalent in real-world applications. They capture dynamic system measurements and are produced in vast quantities by both physical and virtual sensors. Analyzing these data types is vital to harnessing the rich information they encompass and thus benefits a wide range of downstream tasks. Recent advances in large language and other foundational models have spurred increased use of these models in time series and spatio-temporal data mining. Such methodologies not only enable enhanced pattern recognition and reasoning across diverse domains but also lay the groundwork for artificial general intelligence capable of comprehending and processing common temporal data. In this survey, we offer a comprehensive and up-to-date review of large models tailored (or adapted) for time series and spatio-temporal data, spanning four key facets: data types, model categories, model scopes, and application areas/tasks. Our objective is to equip practitioners with the knowledge to develop applications and further research in this underexplored domain. We primarily categorize the existing literature into two major clusters: large models for time series analysis (LM4TS) and spatio-temporal data mining (LM4STD). On this basis, we further classify research based on model scopes (i.e., general vs. domain-specific) and application areas/tasks. We also provide a comprehensive collection of pertinent resources, including datasets, model assets, and useful tools, categorized by mainstream applications. This survey coalesces the latest strides in large model-centric research on time series and spatio-temporal data, underscoring the solid foundations, current advances, practical applications, abundant resources, and future research opportunities.

摘要
现代数据中，时序数据和空间时序数据具有广泛的应用，它们捕捉了动态系统的测量结果，并由物理和虚拟感知器生成了庞大量数据。分析这些数据类型是利用它们含义的关键，因此在多种下游任务中具有重要意义。最新的大语言和其他基础模型的发展，使得这些模型在时序数据和空间时序数据挖掘中得到广泛的应用。这些方法不仅可以在多种领域中提高模式识别和理解，而且为人工通用智能做好了准备。在本综述中，我们提供了一个完整和最新的时序数据和空间时序数据大模型综述，涵盖四个关键方面：数据类型、模型类别、模型范围和应用领域/任务。我们的目标是为实践者提供开发应用和进一步研究的知识。我们将现有文献分为两个主要群组：时序数据分析大模型（LM4TS）和空间时序数据挖掘大模型（LM4STD）。基于这两个群组，我们进一步分类研究根据模型范围（一般 vs.域特定）和应用领域/任务。此外，我们还提供了一份完整的相关资源，包括数据集、模型资产和有用工具，按照主流应用分类。这篇综述汇集了最新的大模型中心研究的进展，强调了它们的基础、当前进展、实际应用、资源储备和未来研究机遇。

Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs Guanaco vs Bard vs ChatGPT – A Text-to-SQL Parsing Comparison

paper_url: http://arxiv.org/abs/2310.10190
repo_url: None
paper_authors: Shuo Sun, Yuchen Zhang, Jiahuan Yan, Yuze Gao, Donovan Ong, Bin Chen, Jian Su
for: 评估大型自然语言模型（LLM）在文本转换SQL解析方面的表现，以帮助研究人员更好地了解这些模型的实际性能。
methods: 对六种流行的大型自然语言模型进行系统性的评估，使用九个benchmark数据集和五种提示策略进行测试，包括零shot和几shot情况。
results: 发现开源模型在文本转换SQL解析方面的性能落后于关闭源模型如GPT-3.5，表明需要进一步的研究以减少这些模型之间的性能差距。

Abstract
The success of ChatGPT has ignited an AI race, with researchers striving to develop new large language models (LLMs) that can match or surpass the language understanding and generation abilities of commercial ones. In recent times, a number of models have emerged, claiming performance near that of GPT-3.5 or GPT-4 through various instruction-tuning methods. As practitioners of Text-to-SQL parsing, we are grateful for their valuable contributions to open-source research. However, it is important to approach these claims with a sense of scrutiny and ascertain the actual effectiveness of these models. Therefore, we pit six popular large language models against each other, systematically evaluating their Text-to-SQL parsing capability on nine benchmark datasets with five different prompting strategies, covering both zero-shot and few-shot scenarios. Regrettably, the open-sourced models fell significantly short of the performance achieved by closed-source models like GPT-3.5, highlighting the need for further work to bridge the performance gap between these models.

摘要
成功的ChatGPT引燃了一场AI竞赛，研究人员努力开发新的大型自然语言模型（LLMs），以达到或超越商业模型的语言理解和生成能力。最近，一些模型宣称在不同的指导方法下达到GPT-3.5或GPT-4的性能。作为文本转SQL解析的实践者，我们感谢这些开源研究的有价值贡献。然而，我们应该对这些声明进行严格的评估，以确定这些模型的实际效果。因此，我们将六种流行的大型自然语言模型进行比较测试，系统地评估这些模型在九个benchmark datasets上的文本转SQL解析能力，使用五种不同的提示策略，包括零shot和几shot场景。惜亏，开源的模型在closed-source模型GPT-3.5的性能上表现出了明显的差距，这 highlights the need for further work to bridge the performance gap between these models。

Continual Generalized Intent Discovery: Marching Towards Dynamic and Open-world Intent Recognition

paper_url: http://arxiv.org/abs/2310.10184
repo_url: https://github.com/songxiaoshuai/CGID
paper_authors: Xiaoshuai Song, Yutao Mou, Keqing He, Yueyan Qiu, Pei Wang, Weiran Xu
for: 这篇论文目标是解决在不同数据流中进行动态意图发现，以及在开放世界中实现动态意图识别。
methods: 该论文提出了一种新的任务 named Continual Generalized Intent Discovery (CGID), 它通过不断地从动态OOD数据流中发现新意图，然后逐渐将其添加到分类器中，几乎不需要之前的数据。
results: 论文提出了一种名为Prototype-guided Learning with Replay and Distillation (PLRD)的方法，可以实现CGID任务。该方法通过类prototype进行启动新意图发现，并通过数据重播和特征储存来保持新和旧意图的平衡。

Abstract
In a practical dialogue system, users may input out-of-domain (OOD) queries. The Generalized Intent Discovery (GID) task aims to discover OOD intents from OOD queries and extend them to the in-domain (IND) classifier. However, GID only considers one stage of OOD learning, and needs to utilize the data in all previous stages for joint training, which limits its wide application in reality. In this paper, we introduce a new task, Continual Generalized Intent Discovery (CGID), which aims to continuously and automatically discover OOD intents from dynamic OOD data streams and then incrementally add them to the classifier with almost no previous data, thus moving towards dynamic intent recognition in an open world. Next, we propose a method called Prototype-guided Learning with Replay and Distillation (PLRD) for CGID, which bootstraps new intent discovery through class prototypes and balances new and old intents through data replay and feature distillation. Finally, we conduct detailed experiments and analysis to verify the effectiveness of PLRD and understand the key challenges of CGID for future research.

摘要
在实际对话系统中，用户可能输入过域 (OOD) 查询。通用意图发现 (GID) 任务目标是从 OOD 查询中发现 OOD 意图并将其扩展到内域 (IND) 分类器。但 GID 只考虑了一个阶段的 OOD 学习，需要在所有前一阶段的数据上进行联合训练，这限制了其在实际应用中的广泛应用。在本文中，我们介绍了一个新的任务：不断总结化通用意图发现 (CGID)，它目标是从动态 OOD 数据流中不断发现 OOD 意图，然后在几乎没有先前数据的情况下，逐步添加它们到分类器中，从而逐步实现动态意图认知在开放世界中。接着，我们提出了一种方法 called Prototype-guided Learning with Replay and Distillation (PLRD)，它通过类型概念引导新意向发现，并通过数据重播和特征硬化平衡新和旧意图。最后，我们进行了详细的实验和分析，以证明 PLRD 的效果和 CGID 的关键挑战，以便未来研究。

Large Language Models Meet Open-World Intent Discovery and Recognition: An Evaluation of ChatGPT

paper_url: http://arxiv.org/abs/2310.10176
repo_url: https://github.com/songxiaoshuai/OOD-Evaluation
paper_authors: Xiaoshuai Song, Keqing He, Pei Wang, Guanting Dong, Yutao Mou, Jingang Wang, Yunsen Xian, Xunliang Cai, Weiran Xu
for: 本研究旨在评估ChatGPT在非预期意权（OOD）发现和总结扩展（GID）任务中的能力。
methods: 本研究使用了ChatGPT进行OOD意权发现和GID任务，并对其进行了评估。
results: ChatGPT在零shot设定下表现出了一致的优势，但与精心定制的模型相比，其仍然处于劣势。通过一系列的分析实验，本研究揭示了LLMs在扩展OOD意权时所面临的挑战，并提供了未来研究的指导。

Abstract
The tasks of out-of-domain (OOD) intent discovery and generalized intent discovery (GID) aim to extend a closed intent classifier to open-world intent sets, which is crucial to task-oriented dialogue (TOD) systems. Previous methods address them by fine-tuning discriminative models. Recently, although some studies have been exploring the application of large language models (LLMs) represented by ChatGPT to various downstream tasks, it is still unclear for the ability of ChatGPT to discover and incrementally extent OOD intents. In this paper, we comprehensively evaluate ChatGPT on OOD intent discovery and GID, and then outline the strengths and weaknesses of ChatGPT. Overall, ChatGPT exhibits consistent advantages under zero-shot settings, but is still at a disadvantage compared to fine-tuned models. More deeply, through a series of analytical experiments, we summarize and discuss the challenges faced by LLMs including clustering, domain-specific understanding, and cross-domain in-context learning scenarios. Finally, we provide empirical guidance for future directions to address these challenges.

摘要
Tasks of out-of-domain (OOD) intent discovery and generalized intent discovery (GID) aim to extend a closed intent classifier to open-world intent sets, which is crucial to task-oriented dialogue (TOD) systems. Previous methods address them by fine-tuning discriminative models. Recently, although some studies have been exploring the application of large language models (LLMs) represented by ChatGPT to various downstream tasks, it is still unclear for the ability of ChatGPT to discover and incrementally extend OOD intents. In this paper, we comprehensively evaluate ChatGPT on OOD intent discovery and GID, and then outline the strengths and weaknesses of ChatGPT. Overall, ChatGPT exhibits consistent advantages under zero-shot settings, but is still at a disadvantage compared to fine-tuned models. More deeply, through a series of analytical experiments, we summarize and discuss the challenges faced by LLMs including clustering, domain-specific understanding, and cross-domain in-context learning scenarios. Finally, we provide empirical guidance for future directions to address these challenges.Here's the translation in Traditional Chinese:Tasks of out-of-domain (OOD) intent discovery and generalized intent discovery (GID) aim to extend a closed intent classifier to open-world intent sets, which is crucial to task-oriented dialogue (TOD) systems. Previous methods address them by fine-tuning discriminative models. Recently, although some studies have been exploring the application of large language models (LLMs) represented by ChatGPT to various downstream tasks, it is still unclear for the ability of ChatGPT to discover and incrementally extend OOD intents. In this paper, we comprehensively evaluate ChatGPT on OOD intent discovery and GID, and then outline the strengths and weaknesses of ChatGPT. Overall, ChatGPT exhibits consistent advantages under zero-shot settings, but is still at a disadvantage compared to fine-tuned models. More deeply, through a series of analytical experiments, we summarize and discuss the challenges faced by LLMs including clustering, domain-specific understanding, and cross-domain in-context learning scenarios. Finally, we provide empirical guidance for future directions to address these challenges.

Analyzing An After-Sales Service Process Using Object-Centric Process Mining: A Case Study

paper_url: http://arxiv.org/abs/2310.10174
repo_url: None
paper_authors: Gyunam Park, Sevde Aydin, Cuneyt Ugur, Wil M. P. van der Aalst
for: 本研究旨在探讨对象центри的过程挖掘技术的应用，以帮助实际操作场景中的业务进程优化。
methods: 本研究使用了对象центри的过程挖掘技术，通过对 approximately 65,000 个事件的分析，揭示了这种技术在实际操作场景中的应用前景和优势。
results: 研究发现，对象центри的过程挖掘技术可以更好地捕捉实际操作场景中的企业过程细节，提供更加全面和深入的业务进程nderstandings，帮助企业实现更好的操作优化。

Abstract
Process mining, a technique turning event data into business process insights, has traditionally operated on the assumption that each event corresponds to a singular case or object. However, many real-world processes are intertwined with multiple objects, making them object-centric. This paper focuses on the emerging domain of object-centric process mining, highlighting its potential yet underexplored benefits in actual operational scenarios. Through an in-depth case study of Borusan Cat's after-sales service process, this study emphasizes the capability of object-centric process mining to capture entangled business process details. Utilizing an event log of approximately 65,000 events, our analysis underscores the importance of embracing this paradigm for richer business insights and enhanced operational improvements.

摘要
Process mining, a technique turning event data into business process insights, has traditionally operated on the assumption that each event corresponds to a singular case or object. However, many real-world processes are intertwined with multiple objects, making them object-centric. This paper focuses on the emerging domain of object-centric process mining, highlighting its potential yet underexplored benefits in actual operational scenarios. Through an in-depth case study of Borusan Cat's after-sales service process, this study emphasizes the capability of object-centric process mining to capture entangled business process details. Utilizing an event log of approximately 65,000 events, our analysis underscores the importance of embracing this paradigm for richer business insights and enhanced operational improvements. traducción al chino simplificado: 过程挖掘，一种将事件数据转化为业务过程智能，传统上假设每个事件对应一个单一的案例或对象。然而，现实世界中许多过程都与多个对象紧密相连，使得它们变成了中心式的。这篇论文关注到在 объекo-中心的过程挖掘领域的出现，强调其在实际运营场景中的可能尚未得到充分利用的优点。通过对博рус安猫后售服务过程的深入探讨，本研究强调了中心式过程挖掘的能力来捕捉互相紧密相连的业务过程细节。通过使用约65,000个事件的日志分析，我们的研究强调了接受这种思想的重要性，以获得更加丰富的商业智能和改进运营效率。

Leveraging Knowledge Distillation for Efficient Deep Reinforcement Learning in Resource-Constrained Environments

paper_url: http://arxiv.org/abs/2310.10170
repo_url: https://github.com/paopaolin/papercode/tree/main/MENGGUANLIN_papercode/combine%20V1
paper_authors: Guanlin Meng
for: 这篇论文旨在探索深度强化学习（DRL）与知识传播（KD）的组合，以实现将深度模型的计算负载减轻，保持性能。
methods: 这篇论文使用了多种DRL算法的知识传播，并研究了这些知识传播的影响。
results: 这篇论文的研究结果显示，通过将DRL算法与KD技术组合使用，可以开发出较快速、较具有 Computational efficiency 的DRL模型。

Abstract
This paper aims to explore the potential of combining Deep Reinforcement Learning (DRL) with Knowledge Distillation (KD) by distilling various DRL algorithms and studying their distillation effects. By doing so, the computational burden of deep models could be reduced while maintaining the performance. The primary objective is to provide a benchmark for evaluating the performance of different DRL algorithms that have been refined using KD techniques. By distilling these algorithms, the goal is to develop efficient and fast DRL models. This research is expected to provide valuable insights that can facilitate further advancements in this promising direction. By exploring the combination of DRL and KD, this work aims to promote the development of models that require fewer GPU resources, learn more quickly, and make faster decisions in complex environments. The results of this research have the capacity to significantly advance the field of DRL and pave the way for the future deployment of resource-efficient, decision-making intelligent systems.

摘要
本研究旨在探索将深度束缚学习（DRL）与知识传递（KD）相结合，通过传递多种DRL算法，研究其传递效果。这可以减少深度模型的计算负担，保持性能。研究的主要目标是为不同DRL算法提供评估性能的标准准例。通过传递这些算法，目标是开发高效快速的DRL模型。这项研究预期会为这个Promising direction提供有价值的发现，促进DRL领域的进一步发展。通过探索DRL和KD的结合，这项研究期望开发需要 fewer GPU资源、快速学习、在复杂环境中做出快速决策的模型。研究结果具有提高DRL领域的前景，并为未来部署资源有效的决策智能系统铺平道路的潜在性。

DemoNSF: A Multi-task Demonstration-based Generative Framework for Noisy Slot Filling Task

paper_url: http://arxiv.org/abs/2310.10169
repo_url: https://github.com/dongguanting/Demo-NSF
paper_authors: Guanting Dong, Tingfeng Hui, Zhuoma GongQue, Jinxu Zhao, Daichi Guo, Gang Zhao, Keqing He, Weiran Xu
for: 提高生成框架在实际对话场景中的泛化能力，解决它们在输入干扰时的通用化问题。
methods: 提出了多任务示范生成框架，名为DemoNSF，并引入了三种含有干扰信息的辅助任务，即干扰恢复（NR）、随机覆盖（RM）和混合识别（HD），以便在不同粒度上捕捉输入干扰的Semantic结构信息。
results: 在两个标准 benchmark 上， DemoNSF 比所有基eline方法表现出色，并实现了强大的泛化性。进一步的分析提供了生成框架在实践中的指导。

Abstract
Recently, prompt-based generative frameworks have shown impressive capabilities in sequence labeling tasks. However, in practical dialogue scenarios, relying solely on simplistic templates and traditional corpora presents a challenge for these methods in generalizing to unknown input perturbations. To address this gap, we propose a multi-task demonstration based generative framework for noisy slot filling, named DemoNSF. Specifically, we introduce three noisy auxiliary tasks, namely noisy recovery (NR), random mask (RM), and hybrid discrimination (HD), to implicitly capture semantic structural information of input perturbations at different granularities. In the downstream main task, we design a noisy demonstration construction strategy for the generative framework, which explicitly incorporates task-specific information and perturbed distribution during training and inference. Experiments on two benchmarks demonstrate that DemoNSF outperforms all baseline methods and achieves strong generalization. Further analysis provides empirical guidance for the practical application of generative frameworks. Our code is released at https://github.com/dongguanting/Demo-NSF.

摘要
最近，基于提示的生成框架在序列标注任务中表现出了很好的能力。然而，在实际对话场景中，仅仅依靠简单的模板和传统词汇库是生成方法在处理未知输入干扰时的挑战。为解决这个差距，我们提议一种多任务生成框架 для噪声插值，名为 DemoNSF。我们在这个框架中引入了三种噪声辅助任务，即噪声恢复（NR）、随机面（RM）和混合识别（HD），以隐式地捕捉输入干扰的 semantic 结构信息。在下游主任务中，我们设计了一种噪声示例建构策略，用于在训练和推断过程中显式地包含任务特定的信息和干扰分布。实验结果表明， DemoNSF 在两个标准 benchmark 上都高于所有基准方法，并且具有强大的泛化能力。进一步的分析提供了实践应用 generative 框架的指导。我们的代码在 GitHub 上发布，地址为。

Deep Learning Algorithm for Advanced Level-3 Inverse-Modeling of Silicon-Carbide Power MOSFET Devices

paper_url: http://arxiv.org/abs/2310.17657
repo_url: None
paper_authors: Massimo Orazio Spata, Sebastiano Battiato, Alessandro Ortis, Francesco Rundo, Michele Calabretta, Carmelo Pino, Angelo Messina
for: 这个论文是为了提取力Field-Effect Transistor (SiC Power MOS)的物理参数而设计的深度学习方法。
methods: 该方法使用深度学习算法来训练Device的参数预测。
results: 实验结果表明，该方法可以有效地重构SiC Power MOS的物理参数，包括晶圆长度。

Abstract
Inverse modelling with deep learning algorithms involves training deep architecture to predict device's parameters from its static behaviour. Inverse device modelling is suitable to reconstruct drifted physical parameters of devices temporally degraded or to retrieve physical configuration. There are many variables that can influence the performance of an inverse modelling method. In this work the authors propose a deep learning method trained for retrieving physical parameters of Level-3 model of Power Silicon-Carbide MOSFET (SiC Power MOS). The SiC devices are used in applications where classical silicon devices failed due to high-temperature or high switching capability. The key application of SiC power devices is in the automotive field (i.e. in the field of electrical vehicles). Due to physiological degradation or high-stressing environment, SiC Power MOS shows a significant drift of physical parameters which can be monitored by using inverse modelling. The aim of this work is to provide a possible deep learning-based solution for retrieving physical parameters of the SiC Power MOSFET. Preliminary results based on the retrieving of channel length of the device are reported. Channel length of power MOSFET is a key parameter involved in the static and dynamic behaviour of the device. The experimental results reported in this work confirmed the effectiveness of a multi-layer perceptron designed to retrieve this parameter.

摘要
倒推模型使用深度学习算法来训练深度架构，以预测设备的参数从其静态行为中。倒推设备模型适用于重构过时的物理参数或恢复物理配置。倒推模型的性能有很多因素的影响。在这项工作中，作者提出了一种深度学习方法，用于重 Retrieving physical parameters of Level-3 model of Power Silicon-Carbide MOSFET (SiC Power MOS). SiC设备在高温或高开关能力应用场景中被广泛使用，因此SiC Power MOS在高温或高压力环境中会显著偏移物理参数，这可以通过倒推模型进行监测。本工作的目标是提供一种可能的深度学习基于解决方案，以重 Retrieving physical parameters of SiC Power MOSFET。初步结果基于设备渠道长度的重构被报告。渠道长度是SiC Power MOS的静态和动态行为中关键参数之一。实验结果表明，使用多层感知器来重构这个参数是有效的。

Character-LLM: A Trainable Agent for Role-Playing

paper_url: http://arxiv.org/abs/2310.10158
repo_url: https://github.com/choosewhatulike/trainable-agents
paper_authors: Yunfan Shao, Linyang Li, Junqi Dai, Xipeng Qiu
for: 研究是使用大型自然语言模型（LLM）作为代理人类模拟人类行为的能力。
methods: 我们提出了一种方法，即编辑人物profile和经验，以允许LLM模型成为特定人物。
results: 我们在测试场景中训练了agent并评估其能否记忆和表现出人物的特点和经验。实验结果表明了有趣的观察，有助于建立未来的人类模拟。

Abstract
Large language models (LLMs) can be used to serve as agents to simulate human behaviors, given the powerful ability to understand human instructions and provide high-quality generated texts. Such ability stimulates us to wonder whether LLMs can simulate a person in a higher form than simple human behaviors. Therefore, we aim to train an agent with the profile, experience, and emotional states of a specific person instead of using limited prompts to instruct ChatGPT API. In this work, we introduce Character-LLM that teach LLMs to act as specific people such as Beethoven, Queen Cleopatra, Julius Caesar, etc. Our method focuses on editing profiles as experiences of a certain character and training models to be personal simulacra with these experiences. To assess the effectiveness of our approach, we build a test playground that interviews trained agents and evaluates whether the agents \textit{memorize} their characters and experiences. Experimental results show interesting observations that help build future simulacra of humankind.

摘要
大型语言模型（LLM）可以作为代理人对人类行为进行模拟，因为它具有强大的理解人类指令和生成文本能力。这种能力让我们感到是否可以使LMLM模型模拟出更高一层的人类形式。因此，我们想要将特定人物的资料 Profiling、经验和情感状态传入LMLM模型，以模拟出该人物的行为。在这个研究中，我们提出了Character-LLM，它可以教导LMLM模型以特定人物的形式行为。我们的方法是将特定人物的资料编译为LMLM模型的体验，然后训练这些模型成为特定人物的人工模拟。为了评估我们的方法的有效性，我们建立了一个测试场景，让训练好的代理人回答问题，以判断代理人是否将记忆其角色和体验。实验结果给出了有趣的观察，帮助我们建立未来的人类模拟。

Adaptive Workload Distribution for Accuracy-aware DNN Inference on Collaborative Edge Platforms

paper_url: http://arxiv.org/abs/2310.10157
repo_url: None
paper_authors: Zain Taufique, Antonio Miele, Pasi Liljeberg, Anil Kanduri
for: 加速深度学习模型（DNN）的推理过程，通过分布工作负载到一群协作的边缘节点。
methods: 提议适应性工作负载分布策略，同时考虑边缘设备的差异性和深度学习模型的精度和性能要求。
results: 测试了我们的方法在一个包括Odroid XU4、Raspberry Pi4和Jetson Nano板的边缘集群上，与状态艺术工作负载分布策略相比，实现了平均提高41.52%的性能和5.2%的输出精度。

Abstract
DNN inference can be accelerated by distributing the workload among a cluster of collaborative edge nodes. Heterogeneity among edge devices and accuracy-performance trade-offs of DNN models present a complex exploration space while catering to the inference performance requirements. In this work, we propose adaptive workload distribution for DNN inference, jointly considering node-level heterogeneity of edge devices, and application-specific accuracy and performance requirements. Our proposed approach combinatorially optimizes heterogeneity-aware workload partitioning and dynamic accuracy configuration of DNN models to ensure performance and accuracy guarantees. We tested our approach on an edge cluster of Odroid XU4, Raspberry Pi4, and Jetson Nano boards and achieved an average gain of 41.52% in performance and 5.2% in output accuracy as compared to state-of-the-art workload distribution strategies.

摘要

Theory of Mind for Multi-Agent Collaboration via Large Language Models

paper_url: http://arxiv.org/abs/2310.10701
repo_url: None
paper_authors: Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, Katia Sycara
for: 本研究评估基于大型自然语言模型（LLM）的多智能代理人在多智能协作文本游戏中的表现，并与多智能奖励学习（MARL）和规划基eline进行比较。
methods: 本研究使用LLM来实现多智能协作，并对其表现进行评估。研究还 explore了使用explicit belief state representation来改善LLM的规划优化和任务状态幻觉问题。
results: 研究发现LLM-based agents exhibit emergent collaborative behaviors and high-order Theory of Mind capabilities，但受到长期context管理和任务状态幻觉的限制。使用explicit belief state representation可以提高任务表现和理解能力。

Abstract
While Large Language Models (LLMs) have demonstrated impressive accomplishments in both reasoning and planning, their abilities in multi-agent collaborations remains largely unexplored. This study evaluates LLM-based agents in a multi-agent cooperative text game with Theory of Mind (ToM) inference tasks, comparing their performance with Multi-Agent Reinforcement Learning (MARL) and planning-based baselines. We observed evidence of emergent collaborative behaviors and high-order Theory of Mind capabilities among LLM-based agents. Our results reveal limitations in LLM-based agents' planning optimization due to systematic failures in managing long-horizon contexts and hallucination about the task state. We explore the use of explicit belief state representations to mitigate these issues, finding that it enhances task performance and the accuracy of ToM inferences for LLM-based agents.

摘要
大型语言模型（LLM）已经在理解和规划方面表现出色，但它们在多代合作中的能力仍然尚未得到充分探索。这个研究评估了基于 LLM 的代理人在多代合作文本游戏中的理论心理推理任务表现，与多代征学习（MARL）和规划基eline相比较。我们发现 LLM 基于代理人在合作行为中发展出了轻度的协力行为和高级理论心理能力。我们的结果显示 LLM 基于代理人在规划优化方面存在长时间前景管理和任务状态幻觉的系统性问题。我们探索了使用明确的信仰状态表示来缓和这些问题，发现这可以提高任务表现和 LLM 基于代理人的理论心理推理精度。

Recursive Segmentation Living Image: An eXplainable AI (XAI) Approach for Computing Structural Beauty of Images or the Livingness of Space

paper_url: http://arxiv.org/abs/2310.10149
repo_url: None
paper_authors: Yao Qianxiang, Bin Jiang
for: 这项研究探讨了一种对图像美感评价的 объек Oriented 计算方法，即“结构美”。通过使用 SAM 模型，我们提出了一种基于再嵌套分割的方法，可以更准确地捕捉图像中的更细grained 结构。
methods: 我们使用 SAM 模型进行再嵌套分割，并将结构重建为层次结构，从而获得更加准确的结构量和层次结构。
results: 我们的方法可以准确地分割出图像中的意义 objects，包括树、建筑和窗户等，以及抽象的画作中的子结构。我们的计算结果与人类视觉评价相一致，并且在不同的颜色空间中进行评价时也能够保持一定的一致性。

Abstract
This study introduces the concept of "structural beauty" as an objective computational approach for evaluating the aesthetic appeal of images. Through the utilization of the Segment anything model (SAM), we propose a method that leverages recursive segmentation to extract finer-grained substructures. Additionally, by reconstructing the hierarchical structure, we obtain a more accurate representation of substructure quantity and hierarchy. This approach reproduces and extends our previous research, allowing for the simultaneous assessment of Livingness in full-color images without the need for grayscale conversion or separate computations for foreground and background Livingness. Furthermore, the application of our method to the Scenic or Not dataset, a repository of subjective scenic ratings, demonstrates a high degree of consistency with subjective ratings in the 0-6 score range. This underscores that structural beauty is not solely a subjective perception, but a quantifiable attribute accessible through objective computation. Through our case studies, we have arrived at three significant conclusions. 1) our method demonstrates the capability to accurately segment meaningful objects, including trees, buildings, and windows, as well as abstract substructures within paintings. 2) we observed that the clarity of an image impacts our computational results; clearer images tend to yield higher Livingness scores. However, for equally blurry images, Livingness does not exhibit a significant reduction, aligning with human visual perception. 3) our approach fundamentally differs from methods employing Convolutional Neural Networks (CNNs) for predicting image scores. Our method not only provides computational results but also offers transparency and interpretability, positioning it as a novel avenue in the realm of Explainable AI (XAI).

摘要

Our method demonstrates the capability to accurately segment meaningful objects, including trees, buildings, and windows, as well as abstract substructures within paintings.2. We observed that the clarity of an image impacts our computational results; clearer images tend to yield higher Livingness scores. However, for equally blurry images, Livingness does not exhibit a significant reduction, aligning with human visual perception.3. Our approach fundamentally differs from methods employing Convolutional Neural Networks (CNNs) for predicting image scores. Our method not only provides computational results but also offers transparency and interpretability, positioning it as a novel avenue in the realm of Explainable AI (XAI).

LoBaSS: Gauging Learnability in Supervised Fine-tuning Data

paper_url: http://arxiv.org/abs/2310.13008
repo_url: None
paper_authors: Haotian Zhou, Tingkai Liu, Qianli Ma, Jianbo Yuan, Pengfei Liu, Yang You, Hongxia Yang
for: 本研究的目的是提出一种基于模型学习能力的超级vised fine-tuning数据选择方法，以便在模型精度和学习效率之间寻找优质平衡。
methods: 本研究使用的方法是基于损失函数的SFT数据选择方法（LoBaSS），该方法根据模型在预训练阶段所学习的能力来选择合适的SFT数据，以便提高模型的精度和学习效率。
results: 实验结果表明，使用LoBaSS方法可以在仅6%的全部训练数据量下，超越全数据 Fine-tuning，并在16.7%的数据量下具有同样的精度和学习效率。这表明LoBaSS方法可以在不同领域中协调模型的能力，以达到优质的精度和学习效率。

Abstract
Supervised Fine-Tuning (SFT) serves as a crucial phase in aligning Large Language Models (LLMs) to specific task prerequisites. The selection of fine-tuning data profoundly influences the model's performance, whose principle is traditionally grounded in data quality and distribution. In this paper, we introduce a new dimension in SFT data selection: learnability. This new dimension is motivated by the intuition that SFT unlocks capabilities acquired by a LLM during the pretraining phase. Given that different pretrained models have disparate capabilities, the SFT data appropriate for one may not suit another. Thus, we introduce the term learnability to define the suitability of data for effective learning by the model. We present the Loss Based SFT Data Selection (LoBaSS) method, utilizing data learnability as the principal criterion for the selection SFT data. This method provides a nuanced approach, allowing the alignment of data selection with inherent model capabilities, ensuring optimal compatibility and learning efficiency. In experimental comparisons involving 7B and 13B models, our LoBaSS method is able to surpass full-data fine-tuning at merely 6% of the total training data. When employing 16.7% of the data, LoBaSS harmonizes the model's capabilities across conversational and mathematical domains, proving its efficacy and adaptability.

摘要
大型语言模型（LLM）的超级vised Fine-Tuning（SFT）阶段 serves as a crucial phase in aligning LLMs to specific task prerequisites. The selection of fine-tuning data profoundly influences the model's performance, whose principle is traditionally grounded in data quality and distribution. In this paper, we introduce a new dimension in SFT data selection: learnability. This new dimension is motivated by the intuition that SFT unlocks capabilities acquired by a LLM during the pretraining phase. Given that different pretrained models have disparate capabilities, the SFT data appropriate for one may not suit another. Thus, we introduce the term learnability to define the suitability of data for effective learning by the model. We present the Loss Based SFT Data Selection (LoBaSS) method, utilizing data learnability as the principal criterion for the selection SFT data. This method provides a nuanced approach, allowing the alignment of data selection with inherent model capabilities, ensuring optimal compatibility and learning efficiency. In experimental comparisons involving 7B and 13B models, our LoBaSS method is able to surpass full-data fine-tuning at merely 6% of the total training data. When employing 16.7% of the data, LoBaSS harmonizes the model's capabilities across conversational and mathematical domains, proving its efficacy and adaptability.

CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization

paper_url: http://arxiv.org/abs/2310.10134
repo_url: None
paper_authors: Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark
For: The paper aims to develop a language-based agent that can continually improve over time and perform well in varied environments and tasks.* Methods: The paper proposes a persistent, dynamic, textual memory centered on causal abstractions, which is regularly updated after each trial to gradually learn useful knowledge for new trials.* Results: The proposed approach, called CLIN, outperforms state-of-the-art reflective language agents in the ScienceWorld benchmark, achieves transfer learning to new environments and tasks, and continually improves performance through memory updates.Here are the three points in Simplified Chinese:* For: 这篇论文目的是开发一种可以不断改进并在多个环境和任务中表现出色的语言基于的智能代理。* Methods: 论文提出了一种持续、动态、文本内存，以 causal abstractions 为中心，在每次试验后进行更新，以逐渐学习有用的知识。* Results: CLIN 在 ScienceWorld benchmark 中表现出色，比 state-of-the-art 反射语言代理 Reflexion 高出 23 个绝对分数点，并且在新环境（或新任务）中表现出较好的适应能力和持续改进能力。

Abstract
Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time beyond performance refinement on a specific task. Here we present CLIN, the first language-based agent to achieve this, so that it continually improves over multiple trials, including when both the environment and task are varied, and without requiring parameter updates. Our approach is to use a persistent, dynamic, textual memory centered on causal abstractions (rather than general "helpful hints") that is regularly updated after each trial so that the agent gradually learns useful knowledge for new trials. In the ScienceWorld benchmark, CLIN is able to continually improve on repeated trials on the same task and environment, outperforming state-of-the-art reflective language agents like Reflexion by 23 absolute points. CLIN can also transfer its learning to new environments (or new tasks), improving its zero-shot performance by 4 points (13 for new tasks) and can further improve performance there through continual memory updates, enhancing performance by an additional 17 points (7 for new tasks). This suggests a new architecture for agents built on frozen models that can still continually and rapidly improve over time.

摘要
language agents 有能力与外部环境互动，例如虚拟世界 ScienceWorld，完成复杂任务，如培养植物，而无需强化学习的开始成本。然而，虽有零开始能力，这些代理人至今没有持续改进的能力。在这，我们提出了 CLIN，第一个语言基于的代理人，能够在多次尝试中不断改进，包括环境和任务变化时。我们的方法是使用持续、动态、文本中心的 causal abstractions（而不是通用的“帮助提示”）， Regularly update after each trial so that the agent gradually learns useful knowledge for new trials。在 ScienceWorld benchmark 中，CLIN 能够在重复尝试中不断改进同一任务和环境，胜过现状 reflective language agents like Reflexion 的 23 个绝对分数点。CLIN 还可以转移到新环境（或新任务），提高零开始性能 by 4 个分数点（13 个分数点），并可以通过持续记忆更新，进一步提高性能，增加 17 个分数点（7 个分数点）。这表明了一种基于冻结模型的新架构，可以在不断改进的时间上 continually 和 Rapidly 提高性能。

A Non-monotonic Smooth Activation Function

paper_url: http://arxiv.org/abs/2310.10126
repo_url: None
paper_authors: Koushik Biswas, Meghana Karri, Ulaş Bağcı
for: The paper is written for proposing a new activation function called Sqish, which is an alternative to existing activation functions in deep learning models.
methods: The paper uses experiments on various tasks such as classification, object detection, segmentation, and adversarial robustness to demonstrate the superiority of the Sqish activation function over existing activation functions such as ReLU.
results: The paper shows that the Sqish activation function achieves better performance than ReLU on several benchmark datasets, including CIFAR100, with an improvement of 8.21% in adversarial robustness and 5.87% in image classification.

Abstract
Activation functions are crucial in deep learning models since they introduce non-linearity into the networks, allowing them to learn from errors and make adjustments, which is essential for learning complex patterns. The essential purpose of activation functions is to transform unprocessed input signals into significant output activations, promoting information transmission throughout the neural network. In this study, we propose a new activation function called Sqish, which is a non-monotonic and smooth function and an alternative to existing ones. We showed its superiority in classification, object detection, segmentation tasks, and adversarial robustness experiments. We got an 8.21% improvement over ReLU on the CIFAR100 dataset with the ShuffleNet V2 model in the FGSM adversarial attack. We also got a 5.87% improvement over ReLU on image classification on the CIFAR100 dataset with the ShuffleNet V2 model.

摘要
translate the given text into Simplified Chinese.Activation functions are crucial in deep learning models, as they introduce non-linearity into the networks, allowing them to learn from errors and make adjustments, which is essential for learning complex patterns. The essential purpose of activation functions is to transform unprocessed input signals into significant output activations, promoting information transmission throughout the neural network. In this study, we propose a new activation function called Sqish, which is a non-monotonic and smooth function and an alternative to existing ones. We showed its superiority in classification, object detection, segmentation tasks, and adversarial robustness experiments. We got an 8.21% improvement over ReLU on the CIFAR100 dataset with the ShuffleNet V2 model in the FGSM adversarial attack. We also got a 5.87% improvement over ReLU on image classification on the CIFAR100 dataset with the ShuffleNet V2 model.中文翻译： activation functions 是深度学习模型中关键的组件，因为它们引入非线性，让模型从错误中学习并进行调整，这是学习复杂模式的关键。 activation functions 的主要目的是将未处理的输入信号转化为有意义的输出活动，促进神经网络中信息的传输。在本研究中，我们提出了一个新的 activation function called Sqish，它是非增长的和平滑的函数，是现有的替代品。我们在类别、物体检测、分割任务和对抗攻击性实验中证明了它的优越性。在 ShuffleNet V2 模型上，我们在 FGSM 对抗攻击中获得了 ReLU 的 8.21% 提升，并在图像分类任务中获得了 ReLU 的 5.87% 提升。

From Continuous Dynamics to Graph Neural Networks: Neural Diffusion and Beyond

paper_url: http://arxiv.org/abs/2310.10121
repo_url: None
paper_authors: Andi Han, Dai Shi, Lequan Lin, Junbin Gao
for: 这篇论文旨在提供关于图 neural network（GNN）的系统性和全面的回顾，尤其是在使用连续动力学方法的研究中。
methods: 这篇论文使用的方法包括message passing机制和连续动力学方法，用于解决图 neural network（GNN）中的各种问题，如过滤和压缩。
results: 该论文提出了一种基于连续动力学方法的GNN设计方法，并对经典GNN的局限性进行了解释和改进。同时，该论文还提供了多个未解决的研究方向，以便进一步探索GNN的可能性。

Abstract
Graph neural networks (GNNs) have demonstrated significant promise in modelling relational data and have been widely applied in various fields of interest. The key mechanism behind GNNs is the so-called message passing where information is being iteratively aggregated to central nodes from their neighbourhood. Such a scheme has been found to be intrinsically linked to a physical process known as heat diffusion, where the propagation of GNNs naturally corresponds to the evolution of heat density. Analogizing the process of message passing to the heat dynamics allows to fundamentally understand the power and pitfalls of GNNs and consequently informs better model design. Recently, there emerges a plethora of works that proposes GNNs inspired from the continuous dynamics formulation, in an attempt to mitigate the known limitations of GNNs, such as oversmoothing and oversquashing. In this survey, we provide the first systematic and comprehensive review of studies that leverage the continuous perspective of GNNs. To this end, we introduce foundational ingredients for adapting continuous dynamics to GNNs, along with a general framework for the design of graph neural dynamics. We then review and categorize existing works based on their driven mechanisms and underlying dynamics. We also summarize how the limitations of classic GNNs can be addressed under the continuous framework. We conclude by identifying multiple open research directions.

摘要
格图神经网络（GNNs）已经显示出了重要的承诺，可以模型关系数据，并广泛应用于不同的领域。GNNs的关键机制是叫做“消息传递”，信息从邻居传递到中心节点。这种机制与物理过程热扩散有着直接的关系，GNNs的传播 Naturally corresponds to the evolution of heat density。通过对消息传递的过程进行对热动力学的分析，可以更深入地理解GNNs的力量和缺陷，并且可以设计更好的模型。最近，有一些研究借鉴了维持热动力学形式的GNNs，以解决经典GNNs的知名的局限性，如扩散和压缩。在这篇评论中，我们提供了首次系统性和完整性的对Continuous perspective of GNNs的评论。为了实现这一目标，我们介绍了适应维持热动力学的基本成分，并提出了一个总体的框架 для设计图神经动态。然后，我们回顾了现有的研究，并根据他们的驱动机制和下面动力来分类。我们还总结了经典GNNs中的局限性如何通过维持热动力学的方式解决。 Finally, we identify multiple open research directions.

On Generative Agents in Recommendation

paper_url: http://arxiv.org/abs/2310.10108
repo_url: https://github.com/LehengTHU/Agent4Rec
paper_authors: An Zhang, Leheng Sheng, Yuxin Chen, Hao Li, Yang Deng, Xiang Wang, Tat-Seng Chua
for:* 这个论文旨在提出一种基于大语言模型（LLM）的电影推荐模拟器，帮助解决现有推荐系统中偏差的问题。methods:* 该模拟器使用了LLM-empowered生成代理，每个代理具有用户 profiling、记忆和行为模块，特意为推荐系统设计。results:* 对 Agent4Rec 的广泛和多方面评估表明，LLM-empowered生成代理可以准确地模拟真实自主人类在推荐系统中的行为。I hope this helps! Let me know if you have any further questions or if there’s anything else I can help with.

Abstract
Recommender systems are the cornerstone of today's information dissemination, yet a disconnect between offline metrics and online performance greatly hinders their development. Addressing this challenge, we envision a recommendation simulator, capitalizing on recent breakthroughs in human-level intelligence exhibited by Large Language Models (LLMs). We propose Agent4Rec, a novel movie recommendation simulator, leveraging LLM-empowered generative agents equipped with user profile, memory, and actions modules specifically tailored for the recommender system. In particular, these agents' profile modules are initialized using the MovieLens dataset, capturing users' unique tastes and social traits; memory modules log both factual and emotional memories and are integrated with an emotion-driven reflection mechanism; action modules support a wide variety of behaviors, spanning both taste-driven and emotion-driven actions. Each agent interacts with personalized movie recommendations in a page-by-page manner, relying on a pre-implemented collaborative filtering-based recommendation algorithm. We delve into both the capabilities and limitations of Agent4Rec, aiming to explore an essential research question: to what extent can LLM-empowered generative agents faithfully simulate the behavior of real, autonomous humans in recommender systems? Extensive and multi-faceted evaluations of Agent4Rec highlight both the alignment and deviation between agents and user-personalized preferences. Beyond mere performance comparison, we explore insightful experiments, such as emulating the filter bubble effect and discovering the underlying causal relationships in recommendation tasks. Our codes are available at https://github.com/LehengTHU/Agent4Rec.

摘要
现代推荐系统是信息传递的核心，但是在线和离线指标之间的差距妨碍了其发展。为解决这个挑战，我们提出了一种推荐模拟器，即Agent4Rec，利用最近的人工智能大语言模型（LLM）的突破性。我们的模拟器采用LLM拥有的生成代理，包括用户profile、记忆和行为模块，特地设计用于推荐系统。具体来说，这些代理的profile模块初始化使用MovieLens数据集，捕捉用户独特的味蕾和社交特征；记忆模块记录了事实和情感的记忆，并与情感驱动的反射机制集成；行动模块支持广泛的行为，包括味蕾驱动和情感驱动的行为。每个代理都与个性化电影推荐在页面上进行交互，基于先前实现的共同 filtering 基于推荐算法。我们深入探讨Agent4Rec的能力和局限性，以explore一个关键研究问题：可以使用LLM拥有的生成代理 simulate真实自主人类在推荐系统中的行为到哪 extent?我们进行了广泛和多方面的评估，包括对代理和用户个性化喜好的Alignment和偏差。此外，我们还进行了有趣的实验，如模拟过滤层效应和发现推荐任务中的内在 causal 关系。我们的代码可以在https://github.com/LehengTHU/Agent4Rec 中下载。

Regret Analysis of the Posterior Sampling-based Learning Algorithm for Episodic POMDPs

paper_url: http://arxiv.org/abs/2310.10107
repo_url: None
paper_authors: Dengwang Tang, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo
for: 这 paper 的目的是研究在 partially observable Markov decision processes (POMDPs) 中学习问题。
methods: 这 paper 使用 posterior sampling-based reinforcement learning (PSRL) algorithm for POMDPs，并证明其 bayesian regret 的增长率为 sqrt(eps)。
results: 这 paper 显示，在 POMDPs 中，bayesian regret 随着 horizon length H 的增长而增长 exponentiallly，但在undercomplete和weakly revealing的 condition下，可以得到 polynomial bayesian regret bound，其比 recent result by arXiv:2204.08967 好几个orders of magnitude。

Abstract
Compared to Markov Decision Processes (MDPs), learning in Partially Observable Markov Decision Processes (POMDPs) can be significantly harder due to the difficulty of interpreting observations. In this paper, we consider episodic learning problems in POMDPs with unknown transition and observation models. We consider the Posterior Sampling-based Reinforcement Learning (PSRL) algorithm for POMDPs and show that its Bayesian regret scales as the square root of the number of episodes. In general, the regret scales exponentially with the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, under the condition that the POMDP is undercomplete and weakly revealing, we establish a polynomial Bayesian regret bound that improves the regret bound by a factor of $\Omega(H^2\sqrt{SA})$ over the recent result by arXiv:2204.08967.

摘要
Translated into Simplified Chinese:与Markov决策过程（MDP）相比，在Partially Observable Markov决策过程（POMDP）中学习可能更加困难，主要是因为观察结果的难以解释。在这篇论文中，我们考虑POMDP中的集集学习问题，其中过渡和观察模型都未知。我们使用Posterior Sampling-based Reinforcement Learning（PSRL）算法，并证明其 bayesian regret的准确性与集数平方根相关。通常情况下， regret scales exponentially with horizon length $H$，并且我们提供了一个下界，证明这是不可避免的。然而，在POMDP是undercomplete和weakly revealing的情况下，我们设立了一个较好的 Bayesian regret bound，它比recent result by arXiv:2204.08967好于$\Omega(H^2\sqrt{SA})$。

paper_url: http://arxiv.org/abs/2310.10103
repo_url: https://github.com/Michael-Equi/lfg-nav
paper_authors: Dhruv Shah, Michael Equi, Blazej Osinski, Fei Xia, Brian Ichter, Sergey Levine
for: 这个论文的目的是为了帮助机器人快速在未知环境中找到目标。
methods: 这个论文使用语言模型来提供启发，并将语言模型作为搜索准则来帮助计划算法探索新环境。
results: 这个论文在实验中表明，使用语言模型可以使机器人在未知环境中更快地找到目标，并且在实际环境和模拟 benchmark 中都能够表现出优于无意义探索和其他语言模型使用方法。

Abstract
Navigation in unfamiliar environments presents a major challenge for robots: while mapping and planning techniques can be used to build up a representation of the world, quickly discovering a path to a desired goal in unfamiliar settings with such methods often requires lengthy mapping and exploration. Humans can rapidly navigate new environments, particularly indoor environments that are laid out logically, by leveraging semantics -- e.g., a kitchen often adjoins a living room, an exit sign indicates the way out, and so forth. Language models can provide robots with such knowledge, but directly using language models to instruct a robot how to reach some destination can also be impractical: while language models might produce a narrative about how to reach some goal, because they are not grounded in real-world observations, this narrative might be arbitrarily wrong. Therefore, in this paper we study how the ``semantic guesswork'' produced by language models can be utilized as a guiding heuristic for planning algorithms. Our method, Language Frontier Guide (LFG), uses the language model to bias exploration of novel real-world environments by incorporating the semantic knowledge stored in language models as a search heuristic for planning with either topological or metric maps. We evaluate LFG in challenging real-world environments and simulated benchmarks, outperforming uninformed exploration and other ways of using language models.

摘要
naviagtion in unfamiliar environments presents a major challenge for robots: while mapping and planning techniques can be used to build up a representation of the world, quickly discovering a path to a desired goal in unfamiliar settings with such methods often requires lengthy mapping and exploration. humans can rapidly navigate new environments, particularly indoor environments that are laid out logically, by leveraging semantics -- e.g., a kitchen often adjoins a living room, an exit sign indicates the way out, and so forth. language models can provide robots with such knowledge, but directly using language models to instruct a robot how to reach some destination can also be impractical: while language models might produce a narrative about how to reach some goal, because they are not grounded in real-world observations, this narrative might be arbitrarily wrong. therefore, in this paper we study how the "semantic guesswork" produced by language models can be utilized as a guiding heuristic for planning algorithms. our method, language frontier guide (lfg), uses the language model to bias exploration of novel real-world environments by incorporating the semantic knowledge stored in language models as a search heuristic for planning with either topological or metric maps. we evaluate lfg in challenging real-world environments and simulated benchmarks, outperforming uninformed exploration and other ways of using language models.

Reusing Pretrained Models by Multi-linear Operators for Efficient Training

paper_url: http://arxiv.org/abs/2310.10699
repo_url: None
paper_authors: Yu Pan, Ye Yuan, Yichun Yin, Zenglin Xu, Lifeng Shang, Xin Jiang, Qun Liu
for: 这篇论文目的是提高大型模型的训练速度，使用小型预训练模型来初始化大型模型（称为“目标模型”），并且将这两个模型中的权重 Linearly correlated 以增强加速能力。
methods: 这篇论文使用了一种方法，将每个目标模型的权重与所有预训练模型的权重进行 Linear correlation，以增强加速能力。这篇论文还使用了多元线性算子来降低计算和空间复杂度，使得资源需求可以接受。
results: 实验结果显示，这篇论文的方法可以在DeiT-base上预训练DeiT-small时，节省76%的计算成本，并且在BERT2BERT和LiGO的比较下，提高了12.0%和20.7%的性能。

Abstract
Training large models from scratch usually costs a substantial amount of resources. Towards this problem, recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model (termed the ``target model''), leading to a considerable acceleration in training. Despite the successes of these previous studies, they grew pretrained models by mapping partial weights only, ignoring potential correlations across the entire model. As we show in this paper, there are inter- and intra-interactions among the weights of both the pretrained and the target models. As a result, the partial mapping may not capture the complete information and lead to inadequate growth. In this paper, we propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model to further enhance acceleration ability. We utilize multi-linear operators to reduce computational and spacial complexity, enabling acceptable resource requirements. Experiments demonstrate that our method can save 76\% computational costs on DeiT-base transferred from DeiT-small, which outperforms bert2BERT by +12.0\% and LiGO by +20.7\%, respectively.

摘要
通常，训练大型模型从零开始需要很多资源。为解决这个问题， latest studies such as bert2BERT 和 LiGO reuse small pre-trained models to initialize a large model（称为“目标模型”），从而大幅提高训练速度。然而，这些前一 Studies only map partial weights,忽略了整个模型中 weights 之间的可能的相互关系。在这篇文章中，我们发现了这些 weights 之间的相互作用和内部相互作用，这意味着只有部分映射可能不能捕捉完整的信息，导致不够的增长。为解决这个问题，我们提议一种方法，使得每个目标模型中的每个Weight 都 Linearly correlated 到整个预训练模型中的所有Weight。我们使用多线性运算符来减少计算和空间复杂度，使得可接受的资源需求。实验表明，我们的方法可以在 DeiT-base 中心从 DeiT-small 中心转移时Save 76% 的计算成本，并且在 bert2BERT 和 LiGO 的比较中，提高 +12.0% 和 +20.7%，分别。

Orthogonal Uncertainty Representation of Data Manifold for Robust Long-Tailed Learning

paper_url: http://arxiv.org/abs/2310.10090
repo_url: None
paper_authors: Yanbiao Ma, Licheng Jiao, Fang Liu, Shuyuan Yang, Xu Liu, Lingling Li
for: 提高模型在长尾分布下的Robustness
methods: 使用特征嵌入的Orthogonal Uncertainty Representation(OUR)和端到端训练策略
results: 在长尾数据集上进行了全面的评估，OUR方法可以减少模型在长尾分布下的敏感性，并且可以与其他长尾学习方法相结合使用，不需要额外数据生成，快速和高效地训练。

Abstract
In scenarios with long-tailed distributions, the model's ability to identify tail classes is limited due to the under-representation of tail samples. Class rebalancing, information augmentation, and other techniques have been proposed to facilitate models to learn the potential distribution of tail classes. The disadvantage is that these methods generally pursue models with balanced class accuracy on the data manifold, while ignoring the ability of the model to resist interference. By constructing noisy data manifold, we found that the robustness of models trained on unbalanced data has a long-tail phenomenon. That is, even if the class accuracy is balanced on the data domain, it still has bias on the noisy data manifold. However, existing methods cannot effectively mitigate the above phenomenon, which makes the model vulnerable in long-tailed scenarios. In this work, we propose an Orthogonal Uncertainty Representation (OUR) of feature embedding and an end-to-end training strategy to improve the long-tail phenomenon of model robustness. As a general enhancement tool, OUR has excellent compatibility with other methods and does not require additional data generation, ensuring fast and efficient training. Comprehensive evaluations on long-tailed datasets show that our method significantly improves the long-tail phenomenon of robustness, bringing consistent performance gains to other long-tailed learning methods.

摘要
在长尾分布场景下，模型能够识别尾类受限因为尾类样本的下 Representation 不充分。Class重新平衡、信息增强和其他技术已经提出来解决这个问题，但这些方法通常寻求在数据 manifold 上具有平衡的类准确率，而忽略模型对干扰的抵抗能力。我们通过构建噪音数据 manifold 发现，模型在不平衡数据上训练时的 Robustness 存在长尾现象。即，即使在数据Domain 上具有平衡的类准确率，模型在噪音数据 manifold 上仍存在偏见。然而，现有的方法无法有效 Mitigate 这个现象，使得模型在长尾场景中脆弱。在这种情况下，我们提出了一种Orthogonal Uncertainty Representation（OUR）的特征嵌入和端到端训练策略，以改善模型在长尾场景中的Robustness。作为一种通用加强工具，OUR具有优compatibility 性，不需要额外数据生成，保证快速和高效的训练。对长尾数据集进行了全面评估，我们的方法在长尾场景中显著改善了模型的Robustness，并且与其他长尾学习方法相结合，带来了一致的性能提升。

MOCHA: Real-Time Motion Characterization via Context Matching

paper_url: http://arxiv.org/abs/2310.10079
repo_url: https://github.com/DK-Jang/MOCHA_SIGASIA2023
paper_authors: Deok-Kyeong Jang, Yuting Ye, Jungdam Won, Sung-Hee Lee
for: 这篇论文目的是为了实时转换中性无表情的输入动作为一个知名人物的特有风格。
methods: 这篇论文介绍了一个新的在线动作特征化框架，即MOCHA，可以将目标人物的动作风格和体型特征转移到输入动作上。
results: 该框架可以在实时进行人物动作特征化，并且可以轻松地满足不同应用场景，如只有稀疏输入的人物特征化和实时人物特征化。此外，论文还提供了一个高质量的动作数据集，包括六个不同人物在多种动作中的表现，这可以成为未来研究的优质资源。

Abstract
Transforming neutral, characterless input motions to embody the distinct style of a notable character in real time is highly compelling for character animation. This paper introduces MOCHA, a novel online motion characterization framework that transfers both motion styles and body proportions from a target character to an input source motion. MOCHA begins by encoding the input motion into a motion feature that structures the body part topology and captures motion dependencies for effective characterization. Central to our framework is the Neural Context Matcher, which generates a motion feature for the target character with the most similar context to the input motion feature. The conditioned autoregressive model of the Neural Context Matcher can produce temporally coherent character features in each time frame. To generate the final characterized pose, our Characterizer network incorporates the characteristic aspects of the target motion feature into the input motion feature while preserving its context. This is achieved through a transformer model that introduces the adaptive instance normalization and context mapping-based cross-attention, effectively injecting the character feature into the source feature. We validate the performance of our framework through comparisons with prior work and an ablation study. Our framework can easily accommodate various applications, including characterization with only sparse input and real-time characterization. Additionally, we contribute a high-quality motion dataset comprising six different characters performing a range of motions, which can serve as a valuable resource for future research.

摘要
<>输入动作的中性化和无特点化的输入动作在实时动画中具有吸引力，这篇论文介绍了MOCHA，一种新的在线动作特征化框架。MOCHA可以同时传递目标人物的动作风格和身体比例到输入动作中。MOCHA开始sBy编码输入动作为一个动作特征，该特征映射体部 topology和捕捉动作依赖关系以便有效地特征化。中心于我们框架的神经Context Matcher生成了目标人物的动作特征，该特征与输入动作特征Context最相似。Conditional autoregressive model of the Neural Context Matcher can produce temporally coherent character features in each time frame. To generate the final characterized pose, our Characterizer network incorporates the characteristic aspects of the target motion feature into the input motion feature while preserving its context. This is achieved through a transformer model that introduces the adaptive instance normalization and context mapping-based cross-attention, effectively injecting the character feature into the source feature. We validate the performance of our framework through comparisons with prior work and an ablation study. Our framework can easily accommodate various applications, including characterization with only sparse input and real-time characterization. Additionally, we contribute a high-quality motion dataset comprising six different characters performing a range of motions, which can serve as a valuable resource for future research.

Verbosity Bias in Preference Labeling by Large Language Models

paper_url: http://arxiv.org/abs/2310.10076
repo_url: None
paper_authors: Keita Saito, Akifumi Wachi, Koki Wataoka, Youhei Akimoto
for: 本研究旨在探讨大语言模型（LLMs）的性能提升方法，具体来说是通过人工智能反馈学习（RLAIF）取代人类反馈来评估LLMs。
methods: 本研究使用了RLAIF评估GPT-4和人类反馈的方法，并提出了一个量化verbosity bias的指标。
results: 研究发现GPT-4在本研究中偏好 longer answers than humans, and propose a metric to measure this bias.

Abstract
In recent years, Large Language Models (LLMs) have witnessed a remarkable surge in prevalence, altering the landscape of natural language processing and machine learning. One key factor in improving the performance of LLMs is alignment with humans achieved with Reinforcement Learning from Human Feedback (RLHF), as for many LLMs such as GPT-4, Bard, etc. In addition, recent studies are investigating the replacement of human feedback with feedback from other LLMs named Reinforcement Learning from AI Feedback (RLAIF). We examine the biases that come along with evaluating LLMs with other LLMs and take a closer look into verbosity bias -- a bias where LLMs sometimes prefer more verbose answers even if they have similar qualities. We see that in our problem setting, GPT-4 prefers longer answers more than humans. We also propose a metric to measure this bias.

摘要

Fine-tuning ChatGPT for Automatic Scoring

paper_url: http://arxiv.org/abs/2310.10072
repo_url: None
paper_authors: Ehsan Latif, Xiaoming Zhai
For: This paper demonstrates the potential of fine-tuned ChatGPT (GPT-3.5) for automatically scoring student written constructed responses in science education.* Methods: The paper uses fine-tuned GPT-3.5 on six assessment tasks with a diverse dataset of middle-school and high-school student responses and expert scoring.* Results: The results show that fine-tuned GPT-3.5 achieved a remarkable average increase (9.1%) in automatic scoring accuracy compared to the fine-tuned state-of-the-art Google’s generated language model, BERT, with significant improvements on multi-label tasks and multi-class items.Here are the three points in Simplified Chinese text:* For: 这个研究用于评估学生写的 constructed responses 的自动评分。* Methods: 这个研究使用 fine-tuned GPT-3.5 在 six 个评估任务上，使用多样化的中学和高中生回答和专家评分数据进行 fine-tuning。* Results: 结果显示， fine-tuned GPT-3.5 在 six 个评估任务上达到了9.1%的平均提升率，比 fine-tuned BERT 高，尤其是在多个标签任务和多个类型任务上显示出了显著的提升。

Abstract
This study highlights the potential of fine-tuned ChatGPT (GPT-3.5) for automatically scoring student written constructed responses using example assessment tasks in science education. Recent studies on OpenAI's generative model GPT-3.5 proved its superiority in predicting the natural language with high accuracy and human-like responses. GPT-3.5 has been trained over enormous online language materials such as journals and Wikipedia; therefore, more than direct usage of pre-trained GPT-3.5 is required for automatic scoring as students utilize a different language than trained material. These imply that a domain-specific model, fine-tuned over data for specific tasks, can enhance model performance. In this study, we fine-tuned GPT-3.5 on six assessment tasks with a diverse dataset of middle-school and high-school student responses and expert scoring. The six tasks comprise two multi-label and four multi-class assessment tasks. We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT. The results show that in-domain training corpora constructed from science questions and responses for BERT achieved average accuracy = 0.838, SD = 0.069. GPT-3.5 shows a remarkable average increase (9.1%) in automatic scoring accuracy (mean = 9.15, SD = 0.042) for the six tasks, p =0.001 < 0.05. Specifically, for multi-label tasks (item 1 with 5 labels; item 2 with 10 labels), GPT-3.5 achieved significantly higher scoring accuracy than BERT across all the labels, with the second item achieving a 7.1% increase. The average scoring increase for the four multi-class items for GPT-3.5 was 10.6% compared to BERT. Our study confirmed the effectiveness of fine-tuned GPT-3.5 for automatic scoring of student responses on domain-specific data in education with high accuracy. We have released fine-tuned models for public use and community engagement.

摘要

GreatSplicing: A Semantically Rich Splicing Dataset

paper_url: http://arxiv.org/abs/2310.10070
repo_url: None
paper_authors: Xiuli Bi, Jiaming Liang
for: 解决现有拼接质量 dataset 中缺乏 semantic varieties 的问题，提高拼接trace detection的准确率。
methods: 使用 manually created splicing dataset GreatSplicing，包括5000个拼接图像，覆盖335种不同的semantic categories。
results: 模型在 GreatSplicing 上训练后表现出较低的错误率和跨 dataset detection 能力，比现有dataset 更佳。

Abstract
In existing splicing forgery datasets, the insufficient semantic varieties of spliced regions cause a problem that trained detection models overfit semantic features rather than splicing traces. Meanwhile, because of the absence of a reasonable dataset, different detection methods proposed cannot reach a consensus on experimental settings. To address these urgent issues, GreatSplicing, a manually created splicing dataset with a considerable amount and high quality, is proposed in this paper. GreatSplicing comprises 5,000 spliced images and covers spliced regions with 335 distinct semantic categories, allowing neural networks to grasp splicing traces better. Extensive experiments demonstrate that models trained on GreatSplicing exhibit minimal misidentification rates and superior cross-dataset detection capabilities compared to existing datasets. Furthermore, GreatSplicing is available for all research purposes and can be downloaded from www.greatsplicing.net.

摘要
现有的剪辑伪造数据集中，剪辑区域的 semantic variety 不够，导致训练的检测模型更倾向于学习 semantic features 而不是剪辑 traces。另一方面，由于缺乏合理的数据集，不同的检测方法的实际设置不能达成一致。为解决这些紧迫的问题，本文提出了 GreatSplicing，一个手动创建的剪辑数据集，包含 5,000 个剪辑图像，剪辑区域涵盖 335 个不同的 semantic category，使得神经网络更好地捕捉剪辑 traces。广泛的实验表明，基于 GreatSplicing 训练的模型在剪辑检测方面具有较少的误认率和较好的跨数据集检测能力，与现有数据集相比。此外，GreatSplicing 适用于所有研究用途，可以从 www.greatsplicing.net 下载。

Learning Graph Filters for Spectral GNNs via Newton Interpolation

paper_url: http://arxiv.org/abs/2310.10064
repo_url: None
paper_authors: Junjie Xu, Enyan Dai, Dongsheng Luo, Xiang Zhang, Suhang Wang
for: 本研究旨在探讨spectral graph neural networks（GNNs）中filter frequency的选择对graph数据的同义度水平（homophily level）的影响，以及如何通过任务指导学习spectral filters来捕捉 graf数据中的关键频率信息。
methods: 本研究采用了 both theoretical and empirical analyses，包括对 existedential GNNs进行分析，以及实验室中的实验。研究发现，low-frequency filters与homophily level之间存在正相关，而高频率 filters则与homophily level之间存在负相关。基于这一结论，研究人员提出了一种shape-aware regularization技术，用于自适应定制polynomial spectral filters，以适应 desired homophily levels。
results: 研究人员通过实验表明，NewtonNet可以成功地实现desired filter shapes，并在homophilous和heterophilous dataset上显示出优秀的性能。

Abstract
Spectral Graph Neural Networks (GNNs) are gaining attention because they can surpass the limitations of message-passing GNNs by learning spectral filters that capture essential frequency information in graph data through task supervision. However, previous research suggests that the choice of filter frequency is tied to the graph's homophily level, a connection that hasn't been thoroughly explored in existing spectral GNNs. To address this gap, the study conducts both theoretical and empirical analyses, revealing that low-frequency filters have a positive correlation with homophily, while high-frequency filters have a negative correlation. This leads to the introduction of a shape-aware regularization technique applied to a Newton Interpolation-based spectral filter, enabling the customization of polynomial spectral filters that align with desired homophily levels. Extensive experiments demonstrate that NewtonNet successfully achieves the desired filter shapes and exhibits superior performance on both homophilous and heterophilous datasets.

摘要
spectral graph neural networks (GNNs) 是受到关注，因为它可以超越消息传递 GNNs 的局限性，通过任务指导学习spectral filters，捕捉图数据中重要的频率信息。然而，前一些研究表明，filter frequency 与图的同化程度（homophily level）之间存在相互关系，这个关系尚未在现有的 spectral GNNs 中得到了充分探讨。为了解决这个差距，这项研究进行了both theoretical和empirical分析，发现低频 filters 与 homophily 之间存在正相关，高频 filters 与 homophily 之间存在负相关。这导致了一种shape-aware regularization技术的引入，用于 Newton Interpolation-based spectral filter，以适应欲要的 homophily 水平。广泛的实验表明，NewtonNet 成功实现了所需的 filter 形状，并在 homophilous 和 heterophilous 数据集上表现出色。

A Comprehensive Evaluation of Tool-Assisted Generation Strategies

paper_url: http://arxiv.org/abs/2310.10062
repo_url: https://github.com/jettbrains/-L-
paper_authors: Alon Jacovi, Avi Caciularu, Jonathan Herzig, Roee Aharoni, Bernd Bohnet, Mor Geva
for: This paper aims to investigate the effectiveness of various few-shot tool-usage strategies for augmenting language models, and to provide a systematic and fair comparison with strong baselines.
methods: The paper uses empirical analysis to compare the performance of different few-shot tool-usage strategies, including strategies that refine incorrect outputs with tools and strategies that retrieve relevant information ahead of or during generation.
results: The paper finds that strong no-tool baselines are competitive to tool-assisted strategies, and that tool-assisted strategies are expensive in terms of the number of tokens they require. The paper emphasizes the need for comprehensive evaluations of future strategies to accurately assess their benefits and costs.

Abstract
A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.

摘要
研究者正在努力增强语言模型，以解决其缺陷（如缺失或错误知识、逻辑推理错误）。各种几招工具使用策略已经被提出，但没有系统化、公平的比较，或与不使用工具的强大基eline进行比较。我们进行了广泛的实验分析，发现：1. 在不同的数据集、示例难度水平和模型上，不使用工具的强大基eline与工具协助策略竞争激烈，表明使用工具进行协助是一个困难的未解决问题。2. 对知识检索任务，使用工具修正错误输出的策略比使用工具预先检索相关信息的策略更高效。3. 使用工具的策略需要更多的字符数，而这些字符数的增加并没有级别提高表现。总之，我们的发现表明几招工具集成仍然是一个开放的挑战，强调未来策略的全面评估，以准确评估其利益和成本。

Flow Dynamics Correction for Action Recognition

paper_url: http://arxiv.org/abs/2310.10059
repo_url: None
paper_authors: Lei Wang, Piotr Koniusz
for: investigate different optical flow and features extracted from these optical flow to improve action recognition performance
methods: power normalization on magnitude component of optical flow for flow dynamics correction, and integrating corrected flow dynamics into popular models through a simple hallucination step
results: performance boosted with corrected optical flow, and new state-of-the-art performance on several benchmarks including HMDB-51, YUP++, fine-grained action recognition on MPII Cooking Activities, and large-scale Charades

Abstract
Various research studies indicate that action recognition performance highly depends on the types of motions being extracted and how accurate the human actions are represented. In this paper, we investigate different optical flow, and features extracted from these optical flow that capturing both short-term and long-term motion dynamics. We perform power normalization on the magnitude component of optical flow for flow dynamics correction to boost subtle or dampen sudden motions. We show that existing action recognition models which rely on optical flow are able to get performance boosted with our corrected optical flow. To further improve performance, we integrate our corrected flow dynamics into popular models through a simple hallucination step by selecting only the best performing optical flow features, and we show that by 'translating' the CNN feature maps into these optical flow features with different scales of motions leads to the new state-of-the-art performance on several benchmarks including HMDB-51, YUP++, fine-grained action recognition on MPII Cooking Activities, and large-scale Charades.

摘要
各种研究表明动作认识性能高度取决于提取的动作类型和表达的准确性。在这篇论文中，我们调查了不同的光流，以及从这些光流中提取的短期和长期动作动力特征。我们对光流的幅组分进行功率正规化，以 corrections for flow dynamics。我们显示了现有的基于光流的动作认识模型可以通过我们修正后的光流获得性能提升。为了进一步提高性能，我们将我们修正后的流动动力集成到了流行的模型中，并通过简单的梦幻步骤选择最佳的光流特征，并显示了通过将CNN特征图转换为不同尺度的光流特征来实现新的状态前瞻性表现。这些表现在HMDB-51、YUP++、细化动作认识在MPII Cooking Activities以及大规模Charades等数据集上达到了新的状态前瞻性。

NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models

paper_url: http://arxiv.org/abs/2310.10054
repo_url: https://github.com/jongwooko/nash-pruning-official
paper_authors: Jongwoo Ko, Seungjoon Park, Yujin Kim, Sumyeong Ahn, Du-Seong Chang, Euijai Ahn, Se-Young Yun
for: 本研究旨在 investigate encoder-decoder 模型中的结构化剪辑方法，以提高执行速度和生成质量。
methods: 本研究采用了分解式剪辑视角，分别对 encoder 和 decoder 组件进行结构化剪辑。
results: 研究发现，减少 encoder 网络中的层数可以提高执行速度，而减少 decoder 网络中的层数可以提高生成质量。基于这些发现，提出了一种简单有效的框架 NASH，可以快速地适应不同的任务和网络架构。

Abstract
Structured pruning methods have proven effective in reducing the model size and accelerating inference speed in various network architectures such as Transformers. Despite the versatility of encoder-decoder models in numerous NLP tasks, the structured pruning methods on such models are relatively less explored compared to encoder-only models. In this study, we investigate the behavior of the structured pruning of the encoder-decoder models in the decoupled pruning perspective of the encoder and decoder component, respectively. Our findings highlight two insights: (1) the number of decoder layers is the dominant factor of inference speed, and (2) low sparsity in the pruned encoder network enhances generation quality. Motivated by these findings, we propose a simple and effective framework, NASH, that narrows the encoder and shortens the decoder networks of encoder-decoder models. Extensive experiments on diverse generation and inference tasks validate the effectiveness of our method in both speedup and output quality.

摘要
《结构化剪除方法在不同的网络架构中，如转换器，有效地减少模型大小和加速推理速度。尽管encoder-decoder模型在许多自然语言处理任务中表现出了多样性，structured pruning方法在这些模型上 however, relatively less explored compared to encoder-only models. 在这个研究中，我们研究了encoder-decoder模型的结构化剪除在解体预测的encoder和decoder组件中的行为。我们的发现指出了两点：（1）解码层数是推理速度的决定因素，和（2）剪除后encoder网络中的低稀畴性提高了生成质量。这些发现使我们提出了一个简单而有效的框架，称为NASH，该框架缩短了encoder网络和缩短了decoder网络。我们在多种生成和推理任务上进行了广泛的实验，并证明了我们的方法在速度和输出质量上都是有效的。》

Robust Collaborative Filtering to Popularity Distribution Shift

paper_url: http://arxiv.org/abs/2310.10696
repo_url: https://github.com/anzhang314/popgo
paper_authors: An Zhang, Wenchang Ma, Jingnan Zheng, Xiang Wang, Tat-seng Chua
for: 本研究旨在提高collaborative filtering（CF）模型的泛化能力，即使训练数据中的媒体短Circle（popularity shortcut）存在。
methods: 本研究提出了一种简单 yet effective的偏差修正策略，称为PopGo，它可以衡量并降低用户-项目对的交互性wise媒体短Circle。PopGo首先学习一个媒体短Circle模型，然后通过对CF模型的预测进行修正来减少媒体短Circle的影响。
results: 对四个benchmark数据集进行了实验，PopGo可以在ID和OOD测试集上取得显著的提升，比如DICE和MACR等已有的偏差修正策略。

Abstract
In leading collaborative filtering (CF) models, representations of users and items are prone to learn popularity bias in the training data as shortcuts. The popularity shortcut tricks are good for in-distribution (ID) performance but poorly generalized to out-of-distribution (OOD) data, i.e., when popularity distribution of test data shifts w.r.t. the training one. To close the gap, debiasing strategies try to assess the shortcut degrees and mitigate them from the representations. However, there exist two deficiencies: (1) when measuring the shortcut degrees, most strategies only use statistical metrics on a single aspect (i.e., item frequency on item and user frequency on user aspect), failing to accommodate the compositional degree of a user-item pair; (2) when mitigating shortcuts, many strategies assume that the test distribution is known in advance. This results in low-quality debiased representations. Worse still, these strategies achieve OOD generalizability with a sacrifice on ID performance. In this work, we present a simple yet effective debiasing strategy, PopGo, which quantifies and reduces the interaction-wise popularity shortcut without any assumptions on the test data. It first learns a shortcut model, which yields a shortcut degree of a user-item pair based on their popularity representations. Then, it trains the CF model by adjusting the predictions with the interaction-wise shortcut degrees. By taking both causal- and information-theoretical looks at PopGo, we can justify why it encourages the CF model to capture the critical popularity-agnostic features while leaving the spurious popularity-relevant patterns out. We use PopGo to debias two high-performing CF models (MF, LightGCN) on four benchmark datasets. On both ID and OOD test sets, PopGo achieves significant gains over the state-of-the-art debiasing strategies (e.g., DICE, MACR).

摘要
领导合作 filtering（CF）模型中，用户和item的表示可能受到训练数据中的媒体偏袋影响，即媒体偏袋短cut。这些媒体偏袋短cut可以在内部数据（ID）上达到好的性能，但是对于外部数据（OOD）来说，媒体偏袋短cut会导致模型的泛化能力差。为了缓解这 gap，去偏袋策略会评估用户和item的媒体偏袋度并 Mitigate它们从表示中。然而，存在两个缺陷：（1）当衡量媒体偏袋度时，大多数策略只使用单一方面的统计指标（例如，item频次和用户频次），而不考虑用户-item对的compositional度;（2）当缓解媒体偏袋时，许多策略假设测试分布已知。这会导致低质量的去偏袋表示。更糟糕的是，这些策略可以在ID性能的代价下实现OOD泛化性能。在这个工作中，我们提出了一种简单 yet effective的去偏袋策略，即PopGo。PopGo会量化和降低用户-item对的交互媒体偏袋度，不需要测试分布的假设。它首先学习一个媒体偏袋模型，并根据用户和item的媒体偏袋表示计算交互媒体偏袋度。然后，它将CF模型通过对预测进行调整来训练。通过从 causal和信息理论的角度看PopGo，我们可以解释它如何鼓励CF模型捕捉重要的媒体偏袋无关特征，而不捕捉媒体偏袋相关的假特征。我们使用PopGo对两种高性能CF模型（MF、LightGCN）在四个benchmark数据集上进行去偏袋。在ID和OOD测试集上，PopGo实现了与状态 искусственный debiasing策略（例如、DICE、MACR）相比较高的性能提升。

FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models

paper_url: http://arxiv.org/abs/2310.10049
repo_url: https://github.com/FederatedAI/FATE-LLM
paper_authors: Tao Fan, Yan Kang, Guoqiang Ma, Weijing Chen, Wenbin Wei, Lixin Fan, Qiang Yang
for: 这个论文的目的是提出一个industrial-grade federated learning框架，以便在实际应用中使用大型自然语言模型（LLMs）。methods: 这个框架使用了 parameter-efficient fine-tuning方法来有效地训练大型自然语言模型，并且运用了隐私保护机制来保护知识产权和数据隐私。results: 这个框架能够解决训练大型自然语言模型所需的巨量 Computing资源和高质量数据的问题，并且能够保护知识产权和数据隐私。

Abstract
Large Language Models (LLMs), such as ChatGPT, LLaMA, GLM, and PaLM, have exhibited remarkable performances across various tasks in recent years. However, LLMs face two main challenges in real-world applications. One challenge is that training LLMs consumes vast computing resources, preventing LLMs from being adopted by small and medium-sized enterprises with limited computing resources. Another is that training LLM requires a large amount of high-quality data, which are often scattered among enterprises. To address these challenges, we propose FATE-LLM, an industrial-grade federated learning framework for large language models. FATE-LLM (1) facilitates federated learning for large language models (coined FedLLM); (2) promotes efficient training of FedLLM using parameter-efficient fine-tuning methods; (3) protects the intellectual property of LLMs; (4) preserves data privacy during training and inference through privacy-preserving mechanisms. We release the code of FATE-LLM at https://github.com/FederatedAI/FATE-LLM to facilitate the research of FedLLM and enable a broad range of industrial applications.

摘要
大型语言模型（LLM），如ChatGPT、LLaMA、GLM和PaLM，在过去的几年中表现出了很好的表现。然而，LLM在实际应用中遇到了两个主要挑战。一个挑战是在小到中型企业的限制计算资源下培训LLM，这限制了LLM的应用。另一个挑战是培训LLM需要大量高质量数据，这些数据经常分散在企业中。为解决这些挑战，我们提出了FATE-LLM，一个工业级联合学习框架 для大型语言模型。FATE-LLM（1）实现联合学习 для大型语言模型（称为FedLLM）；（2）提高FedLLM的高效培训方法；（3）保护LLM的知识产权；（4）在训练和推断过程中保护数据隐私。我们在GitHub上发布了FATE-LLM的代码，以便研究FedLLM和推广各种工业应用。

TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

paper_url: http://arxiv.org/abs/2310.10046
repo_url: https://github.com/SenseCore/transom-checkpoint-engine
paper_authors: Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo, Tieyao Xiang, Yuheng Chen, Shigang Li
for: 提高大型语言模型（LLM）的训练效率，解决大规模参数训练过程中的硬件和软件故障问题。
methods: 提出了一个新的故障tolerant LLM 训练系统，包括三个关键子系统：training pipeline automatic fault tolerance and recovery mechanism（TOL）、training task multi-dimensional metric automatic anomaly detection system（TEE）和training checkpoint asynchronous access automatic fault tolerance and recovery technology（TCE）。
results: 实验结果表明， tranSOM 可以显著提高大规模 LLM 训练的效率，Specifically, GPT3-175B 的预训练时间被降低了28%，而 asynchronous checkpoint saving and loading 性能提高了20倍。

Abstract
Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields. However, training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months. Due to the inevitable hardware and software failures in large-scale clusters, maintaining uninterrupted and long-duration training is extremely challenging. As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency. To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system. In this work, we design three key subsystems: the training pipeline automatic fault tolerance and recovery mechanism named Transom Operator and Launcher (TOL), the training task multi-dimensional metric automatic anomaly detection system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous access automatic fault tolerance and recovery technology named Transom Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks, while TEE is responsible for task monitoring and anomaly reporting. TEE detects training anomalies and reports them to TOL, who automatically enters the fault tolerance strategy to eliminate abnormal nodes and restart the training task. And the asynchronous checkpoint saving and loading functionality provided by TCE greatly shorten the fault tolerance overhead. The experimental results indicate that TRANSOM significantly enhances the efficiency of large-scale LLM training on clusters. Specifically, the pre-training time for GPT3-175B has been reduced by 28%, while checkpoint saving and loading performance have improved by a factor of 20.

摘要
大型语言模型（LLM），如chatGPT，已经在不同领域 achieve 了深见的影响。然而，在超大型参数的训练中，需要大型高性能GPU集群和长时间的训练时间，达到月份甚至更长。由于大型集群中的硬件和软件故障是不可避免的，因此在长时间训练中保持无间断和长时间训练是极其困难。为此，我们提出了 TRANSOM，一种新的 fault-tolerant LLM 训练系统。在这项工作中，我们设计了三个关键子系统：训练管道自动过错tolerance和恢复机制（TOL），任务多维度自动异常检测系统（TEE），以及训练checkpoint异步访问自动过错tolerance和恢复技术（TCE）。在这里，TOL负责训练任务的生命周期管理，而TEE负责任务监控和异常报告。TEE检测训练异常并将其报告给TOL，TOL会自动入口过错策略，消除异常节点并重新启动训练任务。而TCE提供的异步checkpoint存储和加载功能，可以大大减少过错过程的负担。实验结果表明，TRANSOM可以大幅提高大规模 LL M 训练的效率。Specifically，对于GPT3-175B的预训练时间，可以提高28%，而checkpoint存储和加载性能可以提高20倍。

Smart City Transportation: Deep Learning Ensemble Approach for Traffic Accident Detection

paper_url: http://arxiv.org/abs/2310.10038
repo_url: None
paper_authors: Victor Adewopo, Nelly Elsayed
for: 这篇论文旨在探讨现有的交通事故检测技术，以提高城市智能化交通管理系统中的安全性和效率。
methods: 该论文提出了一种新的I3D-CONVLSTM2D模型架构，该模型结合RGB帧和光流信息，特意为城市智能化交通监测系统设计，并通过实验研究证明了该模型的高效性。
results: 该论文的实验分析表明，I3D-CONVLSTM2D RGB + Optical-Flow (Trainable)模型在交通事故检测方面表现出色，MAP值达到87%。同时，论文还探讨了数据偏置问题，并提出了解决方案。

Abstract
The dynamic and unpredictable nature of road traffic necessitates effective accident detection methods for enhancing safety and streamlining traffic management in smart cities. This paper offers a comprehensive exploration study of prevailing accident detection techniques, shedding light on the nuances of other state-of-the-art methodologies while providing a detailed overview of distinct traffic accident types like rear-end collisions, T-bone collisions, and frontal impact accidents. Our novel approach introduces the I3D-CONVLSTM2D model architecture, a lightweight solution tailored explicitly for accident detection in smart city traffic surveillance systems by integrating RGB frames with optical flow information. Our experimental study's empirical analysis underscores our approach's efficacy, with the I3D-CONVLSTM2D RGB + Optical-Flow (Trainable) model outperforming its counterparts, achieving an impressive 87\% Mean Average Precision (MAP). Our findings further elaborate on the challenges posed by data imbalances, particularly when working with a limited number of datasets, road structures, and traffic scenarios. Ultimately, our research illuminates the path towards a sophisticated vision-based accident detection system primed for real-time integration into edge IoT devices within smart urban infrastructures.

摘要
随着城市智能化的发展，道路交通中的事故检测技术已成为提高安全性和优化交通管理的关键。本文进行了全面的探讨现有事故检测技术，探讨其他现代方法的细节，并提供了不同类型的交通事故的详细概述，如后尾collisions、T-bone collisions和前面Collisions。我们的新方法 introduce了I3D-CONVLSTM2D模型架构，这是一种适应性强的解决方案，通过RGB框架和光流信息来检测事故。我们的实验研究的实证分析表明，我们的I3D-CONVLSTM2D RGB + Optical-Flow（可训练）模型在事故检测方面表现出色，达到了87%的 Mean Average Precision（MAP）。我们的发现还探讨了数据不均衡的挑战，特别是在有限数据集、路径结构和交通场景下。最后，我们的研究阐明了一种基于视觉的事故检测系统，准备好于实时集成到智能城市基础设施中的边缘IoT设备。

Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance

paper_url: http://arxiv.org/abs/2310.10021
repo_url: https://github.com/clvrai/boss
paper_authors: Jesse Zhang, Jiahui Zhang, Karl Pertsch, Ziyi Liu, Xiang Ren, Minsuk Chang, Shao-Hua Sun, Joseph J. Lim
for:BOSS is designed to solve new long-horizon, complex, and meaningful tasks with minimal supervision.methods:BOSS uses skill bootstrapping, where an agent with a set of primitive skills interacts with the environment to practice new skills without receiving reward feedback for tasks outside of the initial skill set. The bootstrapping phase is guided by large language models (LLMs) that inform the agent of meaningful skills to chain together.results:Agents trained with the LLM-guided bootstrapping procedure outperform those trained with naive bootstrapping as well as prior unsupervised skill acquisition methods on zero-shot execution of unseen, long-horizon tasks in new environments.

Abstract
We propose BOSS, an approach that automatically learns to solve new long-horizon, complex, and meaningful tasks by growing a learned skill library with minimal supervision. Prior work in reinforcement learning require expert supervision, in the form of demonstrations or rich reward functions, to learn long-horizon tasks. Instead, our approach BOSS (BOotStrapping your own Skills) learns to accomplish new tasks by performing "skill bootstrapping," where an agent with a set of primitive skills interacts with the environment to practice new skills without receiving reward feedback for tasks outside of the initial skill set. This bootstrapping phase is guided by large language models (LLMs) that inform the agent of meaningful skills to chain together. Through this process, BOSS builds a wide range of complex and useful behaviors from a basic set of primitive skills. We demonstrate through experiments in realistic household environments that agents trained with our LLM-guided bootstrapping procedure outperform those trained with naive bootstrapping as well as prior unsupervised skill acquisition methods on zero-shot execution of unseen, long-horizon tasks in new environments. Website at clvrai.com/boss.

摘要
我们提出了BOSS方法，它可以自动学习解决新的长期、复杂和有意义的任务，只需最小的监督。现有的循环学习方法通常需要专家指导，通过示例或丰富的奖励函数来学习长期任务。相比之下，我们的BOSS方法通过“技能启动”来学习新任务，其中一个已有技能的机器人与环境互动，不接受任务外部的奖励反馈，而是通过大型自然语言模型（LLM）指导，学习新的技能链接。通过这个过程，BOSS可以从基本的原始技能中拓宽许多复杂和有用的行为。我们通过实验表明，使用我们的LLM指导的启动过程训练的机器人在未看过任务和环境的情况下，可以在逻辑家庭环境中比静态训练和先前的无监督技能获取方法表现出更好的成绩。更多信息可以通过官方网站clvrai.com/boss。

Towards Unified and Effective Domain Generalization

paper_url: http://arxiv.org/abs/2310.10008
repo_url: https://github.com/invictus717/UniDG
paper_authors: Yiyuan Zhang, Kaixiong Gong, Xiaohan Ding, Kaipeng Zhang, Fangrui Lv, Kurt Keutzer, Xiangyu Yue
for: 提高基础模型对不同领域的扩展性和泛化性性能
methods: 基于无监督学习的方法，在推理阶段进行轻量级微调，以避免训练阶段的极端忘却
results: 在12种视觉基础模型上，包括CNN、MLP和Transformer等，平均提高了+5.4%的精度，证明UniDG的多样性和优势。

Abstract
We propose $\textbf{UniDG}$, a novel and $\textbf{Uni}$fied framework for $\textbf{D}$omain $\textbf{G}$eneralization that is capable of significantly enhancing the out-of-distribution generalization performance of foundation models regardless of their architectures. The core idea of UniDG is to finetune models during the inference stage, which saves the cost of iterative training. Specifically, we encourage models to learn the distribution of test data in an unsupervised manner and impose a penalty regarding the updating step of model parameters. The penalty term can effectively reduce the catastrophic forgetting issue as we would like to maximally preserve the valuable knowledge in the original model. Empirically, across 12 visual backbones, including CNN-, MLP-, and Transformer-based models, ranging from 1.89M to 303M parameters, UniDG shows an average accuracy improvement of +5.4% on DomainBed. These performance results demonstrate the superiority and versatility of UniDG. The code is publicly available at https://github.com/invictus717/UniDG

摘要
我们提出了UniDG，一个新的、统一的框架，可以对基础模型的外部泛化性能进行明显改善，不论其架构。UniDG的核心思想是在推断阶段进行调整，这样可以避免迭代训练的成本。具体来说，我们鼓励模型在无监督下学习试验数据的分布，并对模型参数更新的步骤加入一个罚则。这个罚则可以有效减少严重遗忘问题，因为我们希望将原始模型中的有价知识保留到最大程度。实验结果显示，在12种视觉基础模型中，包括CNN、MLP和Transformer等，参数量从1.89M到303M之间，UniDG在DomainBed上平均提高了5.4%的精度。这些表现结果证明UniDG的优越性和多样性。代码可以在https://github.com/invictus717/UniDG上获取。

Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels

paper_url: http://arxiv.org/abs/2310.09997
repo_url: None
paper_authors: Thomas Jiralerspong, Flemming Kondrup, Doina Precup, Khimya Khetarpal
for: 本研究旨在提高深度层次优化学习 Agent 的Sample Efficiency，使其在高维状态空间中具有远景见准能力，从而更好地学习和做出决策。
methods: 本研究提出了 Forecaster，一种深度层次优化学习方法，该方法利用高级目标进行规划，并通过模拟环境动力学特性来学习抽象世界模型。
results: 实验表明，Forecaster 可以在 AntMaze domain 中实现更好的 Sample Efficiency，并且可以在新任务下进行普适化学习。

Abstract
The ability to plan at many different levels of abstraction enables agents to envision the long-term repercussions of their decisions and thus enables sample-efficient learning. This becomes particularly beneficial in complex environments from high-dimensional state space such as pixels, where the goal is distant and the reward sparse. We introduce Forecaster, a deep hierarchical reinforcement learning approach which plans over high-level goals leveraging a temporally abstract world model. Forecaster learns an abstract model of its environment by modelling the transitions dynamics at an abstract level and training a world model on such transition. It then uses this world model to choose optimal high-level goals through a tree-search planning procedure. It additionally trains a low-level policy that learns to reach those goals. Our method not only captures building world models with longer horizons, but also, planning with such models in downstream tasks. We empirically demonstrate Forecaster's potential in both single-task learning and generalization to new tasks in the AntMaze domain.

摘要
agent的多级划分能力使其能够预测长期后果，从而实现样本效率学习。这特别有用在高维状态空间如像素的复杂环境中，目标远距离，奖励罕见。我们介绍了Forecaster，一种深层决策学习方法，通过高级目标规划来规划高级目标。Forecaster使用抽象世界模型来模型环境的过程动态，并在 such transition 上训练世界模型。它然后使用这个世界模型来选择优质高级目标，并通过树搜索规划算法来实现。此外，它还训练低级策略，以实现高级目标。我们的方法不仅能够建立更长期的世界模型，还能够在下游任务中使用这些模型进行规划。我们在AntMaze领域进行了实验，证明了Forecaster的潜力。

Network Analysis of the iNaturalist Citizen Science Community

paper_url: http://arxiv.org/abs/2310.10693
repo_url: None
paper_authors: Yu Lu Liu, Thomas Jiralerspong
for: 本研究使用iNaturalist citizen science平台作为案例研究，探讨公民科学项目之间的结构和互动方式。
methods: 我们将iNaturalist数据表示为两类网络，并使用视觉化和已知网络科学技术来获得公民科学项目结构和用户互动的新的视角。
results: 我们提出了一种新的网络准则，使用iNaturalist数据创建一个独特的网络结构，并通过链接预测任务表明这个网络可以用于探讨多种网络科学方法的新思路。

Abstract
In recent years, citizen science has become a larger and larger part of the scientific community. Its ability to crowd source data and expertise from thousands of citizen scientists makes it invaluable. Despite the field's growing popularity, the interactions and structure of citizen science projects are still poorly understood and under analyzed. We use the iNaturalist citizen science platform as a case study to analyze the structure of citizen science projects. We frame the data from iNaturalist as a bipartite network and use visualizations as well as established network science techniques to gain insights into the structure and interactions between users in citizen science projects. Finally, we propose a novel unique benchmark for network science research by using the iNaturalist data to create a network which has an unusual structure relative to other common benchmark networks. We demonstrate using a link prediction task that this network can be used to gain novel insights into a variety of network science methods.

摘要

2023-10-16

cs.CL

cs.CL - 2023-10-16

IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models

paper_url: http://arxiv.org/abs/2310.10873
repo_url: https://github.com/skzhang1/IDEAL
paper_authors: Shaokun Zhang, Xiaobo Xia, Zhaoqing Wang, Ling-Hao Chen, Jiale Liu, Qingyun Wu, Tongliang Liu
for: This paper aims to address the challenge of high annotation costs in in-context learning by introducing an influence-driven selective annotation method.methods: The proposed method constructs a directed graph to represent unlabeled data, quantifies the influence of candidate unlabeled subsets using a diffusion process, and selects the most influential subsets using a greedy algorithm.results: The proposed method achieves better performance under lower time consumption during subset selection compared to previous efforts on selective annotations. Experiments confirm the superiority of the proposed method on various benchmarks.Here’s the Chinese translation of the three points:for: 这篇论文目标是解决启动学习中的注释成本高的挑战，提出了一种基于影响的选择性注释方法。methods: 该方法首先构建了一个指向图来表示无标示数据，然后使用扩散过程来衡量候选无标示子集的影响，并使用一种简单 yet effective的批处算法来选择最有影响的子集。results: 该方法在不同的标准 bencmark 上达到了更高的性能，而且在选择子集时间上具有更低的时间投入。实验证明了该方法的优越性。

Abstract
In-context learning is a promising paradigm that utilizes in-context examples as prompts for the predictions of large language models. These prompts are crucial for achieving strong performance. However, since the prompts need to be sampled from a large volume of annotated examples, finding the right prompt may result in high annotation costs. To address this challenge, this paper introduces an influence-driven selective annotation method that aims to minimize annotation costs while improving the quality of in-context examples. The essence of our method is to select a pivotal subset from a large-scale unlabeled data pool to annotate for the subsequent sampling of prompts. Specifically, a directed graph is first constructed to represent unlabeled data. Afterward, the influence of candidate unlabeled subsets is quantified with a diffusion process. A simple yet effective greedy algorithm for unlabeled data selection is lastly introduced. It iteratively selects the data if it provides a maximum marginal gain with respect to quantified influence. Compared with previous efforts on selective annotations, our influence-driven method works in an end-to-end manner, avoids an intractable explicit balance between data diversity and representativeness, and enjoys theoretical support. Experiments confirm the superiority of the proposed method on various benchmarks, achieving better performance under lower time consumption during subset selection. The project page is available at https://skzhang1.github.io/IDEAL/.

摘要
内容学习是一种有前途的概念，它利用内容例子作为大型语言模型的预测提示。这些提示是实现强制性的关键，但是因为需要从大量的标注例子中抽取提示，因此找到正确的提示可能会带来高的标注成本。为解决这个挑战，本研究将引入一种影响驱动的选择性标注方法，以降低标注成本而提高内容例子的质量。本方法的核心思想是从大规模的未标注数据池中选择一个关键子集，并将其标注以供后续的提示抽取。首先， constructed 一个导向的图来表示未标注数据。接着， candidate 的未标注子集之间的影响被评估通过一个传播过程。最后，一个简单 yet effective 的对不标注数据选择法是引入，它在每次选择时会选择具有最大 MARGINAL 增长的数据。与先前的选择性标注方法不同，我们的影响驱动方法在端到端方式下进行，避免了一个不可能的明确平衡 между 数据多样性和代表性，并且受到了理论支持。实验确认了我们提出的方法在不同的benchmark上的超越性，在选择subset时间consumption下得到了更好的性能。更多信息可以通过我们的项目页面（https://skzhang1.github.io/IDEAL/）了解。

Will the Prince Get True Love’s Kiss? On the Model Sensitivity to Gender Perturbation over Fairytale Texts

paper_url: http://arxiv.org/abs/2310.10865
repo_url: None
paper_authors: Christina Chance, Da Yin, Dakuo Wang, Kai-Wei Chang
for: 本研究旨在探讨传统童话中存在的性别偏见，以及语言模型学习到的这些偏见是如何影响其性别认知的。
methods: 本研究使用Counterfactual数据增强技术来评估语言模型对性别变化的Robustness。Specifically，我们使用FairytaleQA数据集进行问答任务，并在训练时引入Counterfactual性别刻板印象以降低学习到的偏见。
results: 我们的实验结果显示，模型对性别变化具有敏感性，在原始测试集比较性别偏见的情况下，模型的性能会明显下降。但是，在先进行Counterfactual训练 dataset的 fine-tuning 后，模型对后来引入的Anti-性别刻板文本变得更加敏感。

Abstract
Recent studies show that traditional fairytales are rife with harmful gender biases. To help mitigate these gender biases in fairytales, this work aims to assess learned biases of language models by evaluating their robustness against gender perturbations. Specifically, we focus on Question Answering (QA) tasks in fairytales. Using counterfactual data augmentation to the FairytaleQA dataset, we evaluate model robustness against swapped gender character information, and then mitigate learned biases by introducing counterfactual gender stereotypes during training time. We additionally introduce a novel approach that utilizes the massive vocabulary of language models to support text genres beyond fairytales. Our experimental results suggest that models are sensitive to gender perturbations, with significant performance drops compared to the original testing set. However, when first fine-tuned on a counterfactual training dataset, models are less sensitive to the later introduced anti-gender stereotyped text.

摘要

CoTFormer: More Tokens With Attention Make Up For Less Depth

paper_url: http://arxiv.org/abs/2310.10845
repo_url: None
paper_authors: Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi
for: 本文目的是提出一种基于链式思维（Chain-of-Thought，CoT）机制的 transformer 变体，以实现与更深的模型性能相似的表现。
methods: 本文使用了一种做为链式思维机制的假设，并基于此假设提出了一种名为 CoTFormer 的 transformer 变体。
results: 实验结果表明，CoTFormer 能够与更深的标准 transformer 相比，在多个任务上表现更好。

Abstract
The race to continually develop ever larger and deeper foundational models is underway. However, techniques like the Chain-of-Thought (CoT) method continue to play a pivotal role in achieving optimal downstream performance. In this work, we establish an approximate parallel between using chain-of-thought and employing a deeper transformer. Building on this insight, we introduce CoTFormer, a transformer variant that employs an implicit CoT-like mechanism to achieve capacity comparable to a deeper model. Our empirical findings demonstrate the effectiveness of CoTFormers, as they significantly outperform larger standard transformers.

摘要
“Foundational models”的竞赛不断地进行开发，但“Chain-of-Thought”（CoT）方法仍然扮演着关键的角色，以获得最佳的下游性能。在这个研究中，我们发现使用Chain-of-thought和使用更深的transformer之间存在一种近似的关系。基于这个意识，我们介绍CoTFormer，一种使用隐式CoT-like机制的transformer变体，以获得与更深的模型相同的容量。我们的实验结果显示CoTFormer具有明显的超越性，与标准的transformer模型相比。

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

paper_url: http://arxiv.org/abs/2310.10844
repo_url: None
paper_authors: Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, Nael Abu-Ghazaleh
for: 这篇论文探讨了对大语言模型（LLMs）的敌意攻击的研究，以及如何使AI系统更加可靠。
methods: 论文使用了多种学习结构，包括文本只攻击、多模态攻击和复杂系统特有的攻击方法，以探讨LLMs的安全性问题。
results: 论文提供了LLMs的安全性问题的概述，以及现有研究的总结和可能的防御策略。

Abstract
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).

摘要
大型自然语言模型（LLM）在architecture和能力方面快速进步，因此批判它们的安全性特性的必要性也在增加。这篇论文对抗AI系统中的攻击进行了评估，这是一种涉及自然语言处理和安全的新兴领域。先前的研究表明，即使通过 instrucion 调整和人工反馈来实现安全性的LLM也可能受到攻击，这些攻击利用模型的弱点并诱导AI系统出错，例如 chatGPT 和 Bard 上的 "监狱" 攻击。在这篇论文中，我们首先提供了大型语言模型的概述，描述了它们的安全性，然后根据不同的学习结构进行了分类：只有文本攻击、多模态攻击以及特定复杂系统的攻击方法，如联合学习或多代理系统。我们还提供了关于漏洞的基本来源和防御措施的评论。为了让这个领域更加Accessible，我们提供了一个系统性的回顾现有工作，一种结构化的攻击概念 typology，以及其他资源，包括与相关话题的PowerPoint演示在ACL'24年会上。

Fake News in Sheep’s Clothing: Robust Fake News Detection Against LLM-Empowered Style Attacks

paper_url: http://arxiv.org/abs/2310.10830
repo_url: None
paper_authors: Jiaying Wu, Bryan Hooi
for: 这篇论文旨在解决大型自然语言模型（LLM）驱动的新闻假消息探测问题，以提高在线新闻环境中自动检测的精度。
methods: 这篇论文提出了一种基于新闻媒体的探测方法，通过使用 style-oriented reframing 技术和内置的大型自然语言模型（LLM），实现对新闻写作风格的适应性。
results: 实验结果表明，这种方法可以在三个 benchmark 数据集上提供显著的改进，并增强对 LLM 驱动的新闻假消息探测的抗性。

Abstract
It is commonly perceived that online fake news and reliable news exhibit stark differences in writing styles, such as the use of sensationalist versus objective language. However, we emphasize that style-related features can also be exploited for style-based attacks. Notably, the rise of powerful Large Language Models (LLMs) has enabled malicious users to mimic the style of trustworthy news outlets at minimal cost. Our analysis reveals that LLM-camouflaged fake news content leads to substantial performance degradation of state-of-the-art text-based detectors (up to 38% decrease in F1 Score), posing a significant challenge for automated detection in online ecosystems. To address this, we introduce SheepDog, a style-agnostic fake news detector robust to news writing styles. SheepDog achieves this adaptability through LLM-empowered news reframing, which customizes each article to match different writing styles using style-oriented reframing prompts. By employing style-agnostic training, SheepDog enhances its resilience to stylistic variations by maximizing prediction consistency across these diverse reframings. Furthermore, SheepDog extracts content-focused veracity attributions from LLMs, where the news content is evaluated against a set of fact-checking rationales. These attributions provide supplementary information and potential interpretability that assist veracity prediction. On three benchmark datasets, empirical results show that SheepDog consistently yields significant improvements over competitive baselines and enhances robustness against LLM-empowered style attacks.

摘要
通常认为在线假新闻和可靠新闻的写作风格有很大差异，如使用感人化语言 versus объектив语言。然而，我们强调的是风格相关特征也可以被利用于风格基本攻击。尤其是现在强大的大语言模型（LLMs）的出现，使得恶意用户可以轻松地模仿可靠新闻机构的风格，对于自动检测在线环境中具有重大挑战。为了解决这一问题，我们介绍了羊狗（SheepDog），一种不受风格限制的假新闻检测器，可以在不同的新闻风格下保持高度的稳定性。羊狗通过使用 LLMS 进行新闻重 framings，以适应不同的新闻风格，并通过风格无关的训练来增强其对风格变化的抗性。此外，羊狗使用 LLMS 提供的内容相关的真实性评估，对新闻内容进行了实际的真实性评估，以提供可靠的假新闻检测。在三个 benchmark 数据集上，实验结果表明，羊狗可以与竞争对手相比，提供显著的改善，并增强了对 LLMS 风格基本攻击的抗性。

SD-HuBERT: Self-Distillation Induces Syllabic Organization in HuBERT

paper_url: http://arxiv.org/abs/2310.10803
repo_url: None
paper_authors: Cheol Jun Cho, Abdelrahman Mohamed, Shang-Wen Li, Alan W Black, Gopala K. Anumanchipalli
for: 这篇论文旨在探讨自然语言处理中的自适应学习（SSL）技术，具体来说是检测和分析发音中的句子水平表示。
methods: 作者采用了自我混合对象函数（self-distillation）来练化预训练的HuBERT模型，并使用汇集token来概括整个句子。无需任何监督，模型能够自动从发音中找到定义的边界，并在不同帧中显示出standing的句子结构。
results: 作者的模型在无监督情况下自动找到了发音中的句子结构，并且与实际的句子结构大致匹配。此外，作者还提出了一个新的评价任务——Spoken Speech ABX，用于评估发音中的句子表示。与之前的模型相比，作者的模型在这两个任务中表现出色。

Abstract
Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space, limiting the utility of SSL representations. Here, we demonstrate that a syllabic organization emerges in learning sentence-level representation of speech. In particular, we adopt "self-distillation" objective to fine-tune the pretrained HuBERT with an aggregator token that summarizes the entire sentence. Without any supervision, the resulting model draws definite boundaries in speech, and the representations across frames show salient syllabic structures. We demonstrate that this emergent structure largely corresponds to the ground truth syllables. Furthermore, we propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech. When compared to previous models, our model outperforms in both unsupervised syllable discovery and learning sentence-level representation. Together, we demonstrate that the self-distillation of HuBERT gives rise to syllabic organization without relying on external labels or modalities, and potentially provides novel data-driven units for spoken language modeling.

摘要
<>自动发现单元在自注意力学习（SSL）中的语音处理已经进入了新的时代。然而，发现的单元经常保留在音位空间，这限制了SSL表示的用途。我们在这里示出，在学习 sentence-level 表示的语音中，一种 syllabic 组织structure emerges。具体来说，我们采用 "self-distillation" 目标来练化预训练 HuBERT 的汇总符号，该符号概括整个句子。无需任何超级视图，得到的模型可以在语音中画定界限，并且在帧中的表示显示出了鲜明的 syllabic 结构。我们示出，这 emergent structure 与真实的 syllables 大致匹配。此外，我们提出了一个新的 benchmark 任务，Spoken Speech ABX，用于评估 sentence-level 表示的语音。与前一代模型相比，我们的模型在无监督 syllable 发现和 sentence-level 表示学习方面表现出色。总之，我们示出了 HuBERT 的自注意力学习可以不依赖于外部标签或模式，并可能提供一种新的数据驱动单元 для spoken language modeling。

Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

paper_url: http://arxiv.org/abs/2310.10788
repo_url: https://github.com/Hermannovski/React
paper_authors: Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, Gopala K. Anumanchipalli
for: 这个论文旨在探讨自动学习（SSL）基于模型在语音识别任务上的表现，以及这些模型内部表征与语音相关的关系。
methods: 这个论文使用了许多现代的探索技术来探索SSL模型的内部表征，包括HuBERT模型。
results: 研究发现，SSL模型具有一种叫做“语音生成动力学”的基本属性，即将语音信号转换为生成语音的动力学过程。此外，这种属性在不同语言训练数据上具有相似性，并且可以通过简单的仿射变换转移到不同的发音者、语言和方言上。这些结果为语音工程领域中SSL模型的性能提供了新的理解和应用前景，同时也为语音科学领域的研究提供了新的可能性。

Abstract
Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental property of SSL models, i.e., the ability of these models to transform acoustics into the causal articulatory dynamics underlying the speech signal. We also show that this abstraction is largely overlapping across the language of the data used to train the model, with preference to the language with similar phonological system. Furthermore, we show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects, showing the generalizability of this property. Together, these results shed new light on the internals of SSL models that are critical to their superior performance, and open up new avenues into language-agnostic universal models for speech engineering, that are interpretable and grounded in speech science.

摘要
自顾学学习（SSL）基于模型的语音表现非常出色，但这些顶尖模型一直保持了黑盒模型的状态，许多最近的研究开始使用 HuBERT 等模型进行探测，以 correlate 其内部表示与不同的语音特征。在这篇论文中，我们显示了 SSL 模型中的 "语音生成动态推理" 的基本性质，即将听音信号转化为生成语音的 causal 生成动态。此外，我们还发现这种抽象在训练数据语言上具有很大的 overlap，尤其是在语音系统相似性方面。此外，我们还发现通过简单的仿射变换，可以在不同的说话者、语言和方言之间传递 AAI，这表明这种性质具有普适性。总之，这些结果为 SSL 模型的内部结构提供了新的灯光，并开启了新的语言不受限制的通用模型，这些模型可以解释并基于语音科学。

BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali

paper_url: http://arxiv.org/abs/2310.10781
repo_url: None
paper_authors: Saumajit Saha, Albert Nanda
for: 这个论文是关于推荐一种用于检测孟加拉语挑衅文本的系统。
methods: 这个系统使用了传统和现代方法，以便让模型学习。
results: 我们的提议系统可以判断给定文本是否含有任何威胁。我们对数据增强的影响进行了研究，并对多种转换器-基础模型进行了评估。我们在测试集上 obtained a macro F1 score of 68.11%，在共享任务中排名第23名。

Abstract
This paper presents the system that we have developed while solving this shared task on violence inciting text detection in Bangla. We explain both the traditional and the recent approaches that we have used to make our models learn. Our proposed system helps to classify if the given text contains any threat. We studied the impact of data augmentation when there is a limited dataset available. Our quantitative results show that finetuning a multilingual-e5-base model performed the best in our task compared to other transformer-based architectures. We obtained a macro F1 of 68.11\% in the test set and our performance in this shared task is ranked at 23 in the leaderboard.

摘要

Towards reducing hallucination in extracting information from financial reports using Large Language Models

paper_url: http://arxiv.org/abs/2310.10760
repo_url: None
paper_authors: Bhaskarjit Sarmah, Tianjie Zhu, Dhagash Mehta, Stefano Pasquali
for: 提高财务报告中问答部分的信息提取效率和准确率，以便更好地进行投资决策和分析。
methods: 使用大语言模型（LLMs）来快速和高精度地提取财务报告 транскрипts中的信息，并通过结合检索增强生成技术和元数据来减少幻觉。
results: 对多种LLMs进行比较，并employs objective metrics for evaluating Q&A systems to demonstrate the superiority of our proposed approach.

Abstract
For a financial analyst, the question and answer (Q\&A) segment of the company financial report is a crucial piece of information for various analysis and investment decisions. However, extracting valuable insights from the Q\&A section has posed considerable challenges as the conventional methods such as detailed reading and note-taking lack scalability and are susceptible to human errors, and Optical Character Recognition (OCR) and similar techniques encounter difficulties in accurately processing unstructured transcript text, often missing subtle linguistic nuances that drive investor decisions. Here, we demonstrate the utilization of Large Language Models (LLMs) to efficiently and rapidly extract information from earnings report transcripts while ensuring high accuracy transforming the extraction process as well as reducing hallucination by combining retrieval-augmented generation technique as well as metadata. We evaluate the outcomes of various LLMs with and without using our proposed approach based on various objective metrics for evaluating Q\&A systems, and empirically demonstrate superiority of our method.

摘要

Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning

paper_url: http://arxiv.org/abs/2310.10735
repo_url: https://github.com/ryanshea10/personachat_offline_rl
paper_authors: Ryan Shea, Zhou Yu
for: 提高对话系统的自然语言对话品质和个性化度
methods: 使用离线学习 reinforcement learning 方法，将supervised learning 和 online reinforcement learning 的优点结合在一起，并 introduce 一种减少重要性权重的自适应重要性 sampling 方法
results: 对一个现有的社交聊天机器人进行自动和人类评估，结果显示，该方法可以提高对话系统的自然语言对话品质和个性化度

Abstract
Maintaining a consistent persona is a key quality for any open domain dialogue system. Current state-of-the-art systems do this by training agents with supervised learning or online reinforcement learning (RL). However, systems trained with supervised learning often lack consistency as they are never punished for uttering contradictions. Additional training with RL can alleviate some of these issues, however the training process is expensive. Instead, we propose an offline RL framework to improve the persona consistency of dialogue systems. Our framework allows us to combine the advantages of previous methods as we can inexpensively train our model on existing data as in supervised learning, while punishing and rewarding specific utterances as in RL. We also introduce a simple importance sampling method to reduce the variance of importance weights in offline RL training which we call Variance-Reducing MLE-Initialized (VaRMI) importance sampling. Our automatic and human evaluations show that our framework improves both the persona consistency and dialogue quality of a state-of-the-art social chatbot.

摘要
保持一致的人格是对任何开放领域对话系统的关键质量。现状之 artifical intelligence 系统通常通过经过监督学习或在线强化学习（RL）训练来实现这一目标。然而，通过监督学习训练的系统经常缺乏一致性，因为它们从来没有受到违反的惩罚。额外的 RL 训练可以减轻一些这些问题，但训练过程是昂贵的。因此，我们提出了一个Offline RL框架，以提高对话系统的人格一致性。我们的框架允许我们将supervised learning中的优点与RL中的优点结合起来，并且可以廉价地在现有数据上训练我们的模型。我们还提出了一种简单的重要性抽样方法，以减少偏移重要性抽样的方差，我们称之为“Variance-Reducing MLE-Initialized”（VaRMI）重要性抽样。我们的自动和人类评估表明，我们的框架可以提高一个现有社交聊天机器人的人格一致性和对话质量。

“Mistakes Help Us Grow”: Facilitating and Evaluating Growth Mindset Supportive Language in Classrooms

paper_url: http://arxiv.org/abs/2310.10637
repo_url: None
paper_authors: Kunal Handa, Margaret Clapper, Jessica Boyle, Rose E Wang, Diyi Yang, David S Yeager, Dorottya Demszky
for: 这个论文目的是探讨使用大自然语言模型（LLM）提供自动化、个性化的教师培训，以促进教师的成长心理语言支持（GMSL）。
methods: 这个论文使用了以下方法：（1）建立了一个平行数据集，其中包含GMSL培训的教师重构不支持性语言的示例，并提供了一个批注指南；（2）开发了GMSL提问框架，用于修改教师的不支持性语言；（3）采用了基于心理理论的评价框架，用于评价GMSL的效果。
results: 这个论文的研究结果显示，both teachers and students perceive GMSL-trained teacher and model reframings as more effective in fostering a growth mindset and promoting challenge-seeking behavior, among other benefits. In addition, model-generated reframings outperform those from the GMSL-trained teachers. These results demonstrate the promise of using LLMs to provide automated GMSL feedback for teachers, and more broadly, the potential of LLMs for supporting students’ learning in the classroom.

Abstract
Teachers' growth mindset supportive language (GMSL)--rhetoric emphasizing that one's skills can be improved over time--has been shown to significantly reduce disparities in academic achievement and enhance students' learning outcomes. Although teachers espouse growth mindset principles, most find it difficult to adopt GMSL in their practice due the lack of effective coaching in this area. We explore whether large language models (LLMs) can provide automated, personalized coaching to support teachers' use of GMSL. We establish an effective coaching tool to reframe unsupportive utterances to GMSL by developing (i) a parallel dataset containing GMSL-trained teacher reframings of unsupportive statements with an accompanying annotation guide, (ii) a GMSL prompt framework to revise teachers' unsupportive language, and (iii) an evaluation framework grounded in psychological theory for evaluating GMSL with the help of students and teachers. We conduct a large-scale evaluation involving 174 teachers and 1,006 students, finding that both teachers and students perceive GMSL-trained teacher and model reframings as more effective in fostering a growth mindset and promoting challenge-seeking behavior, among other benefits. We also find that model-generated reframings outperform those from the GMSL-trained teachers. These results show promise for harnessing LLMs to provide automated GMSL feedback for teachers and, more broadly, LLMs' potentiality for supporting students' learning in the classroom. Our findings also demonstrate the benefit of large-scale human evaluations when applying LLMs in educational domains.

摘要
教师的增长心理支持语言（GMSL）——强调一个人的技能可以逐渐提高——已经显著减少学生学习成绩的差距和提高学生的学习效果。 although teachers espouse growth mindset principles, most find it difficult to adopt GMSL in their practice due to the lack of effective coaching in this area. We explore whether large language models (LLMs) can provide automated, personalized coaching to support teachers' use of GMSL. We establish an effective coaching tool to reframe unsupportive utterances to GMSL by developing (i) a parallel dataset containing GMSL-trained teacher reframings of unsupportive statements with an accompanying annotation guide, (ii) a GMSL prompt framework to revise teachers' unsupportive language, and (iii) an evaluation framework grounded in psychological theory for evaluating GMSL with the help of students and teachers. We conduct a large-scale evaluation involving 174 teachers and 1,006 students, finding that both teachers and students perceive GMSL-trained teacher and model reframings as more effective in fostering a growth mindset and promoting challenge-seeking behavior, among other benefits. We also find that model-generated reframings outperform those from the GMSL-trained teachers. These results show promise for harnessing LLMs to provide automated GMSL feedback for teachers and, more broadly, LLMs' potentiality for supporting students' learning in the classroom. Our findings also demonstrate the benefit of large-scale human evaluations when applying LLMs in educational domains.

Data Contamination Through the Lens of Time

paper_url: http://arxiv.org/abs/2310.10628
repo_url: https://github.com/abacusai/to-the-cutoff
paper_authors: Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, Samuel Dooley
for: This paper aims to investigate the issue of data contamination in large language models (LLMs) by analyzing the trends in LLM pass rates and their relationship with GitHub popularity and release date.
methods: The authors use a natural experiment of training cutoffs in GPT models to examine benchmarks released over time, specifically focusing on two code/mathematical problem-solving datasets, Codeforces and Project Euler. They employ a longitudinal analysis approach to identify statistically significant trends in LLM pass rates.
results: The authors find strong evidence of data contamination in LLMs, as reflected in the statistically significant trends in LLM pass rates vs. GitHub popularity and release date. They also open-source their dataset, raw results, and evaluation framework to facilitate rigorous analyses of data contamination in modern models.

Abstract
Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.

摘要
In this study, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination.By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.

ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a protein language diffusion model

paper_url: http://arxiv.org/abs/2310.10605
repo_url: None
paper_authors: Bo Ni, David L. Kaplan, Markus J. Buehler
for: 本研究旨在开发一种可预测性强的蛋白质设计模型，以满足复杂非线性机械性质设计目标。
methods: 该模型基于先前训练的蛋白质语言模型，利用蛋白质序列深度知识，将机械 unfolding 响应映射到创造新蛋白质。
results: 通过全原子分子动力学 simulate，证明设计出的蛋白质是新的，并满足目标的机械性质，包括 unfolding energy 和机械强度，以及细致的 unfolding force-separation 曲线。

Abstract
Through evolution, nature has presented a set of remarkable protein materials, including elastins, silks, keratins and collagens with superior mechanical performances that play crucial roles in mechanobiology. However, going beyond natural designs to discover proteins that meet specified mechanical properties remains challenging. Here we report a generative model that predicts protein designs to meet complex nonlinear mechanical property-design objectives. Our model leverages deep knowledge on protein sequences from a pre-trained protein language model and maps mechanical unfolding responses to create novel proteins. Via full-atom molecular simulations for direct validation, we demonstrate that the designed proteins are novel, and fulfill the targeted mechanical properties, including unfolding energy and mechanical strength, as well as the detailed unfolding force-separation curves. Our model offers rapid pathways to explore the enormous mechanobiological protein sequence space unconstrained by biological synthesis, using mechanical features as target to enable the discovery of protein materials with superior mechanical properties.

摘要

Motion2Language, Unsupervised learning of synchronized semantic motion segmentation

paper_url: http://arxiv.org/abs/2310.10594
repo_url: https://github.com/rd20karim/M2T-Segmentation
paper_authors: Karim Radouane, Andon Tchechmedjiev, Sylvie Ranwez, Julien Lagarde
for: 这个论文的目的是建立一种序列到序列架构，用于将动作捕获输入翻译成英语自然语言描述，并同时生成描述和动作的同步。
methods: 论文提出了一种新的循环式注意力形式，适用于同步生成文本，以及一种改进的动作编码器架构，适用于更小的数据集和同步生成。
results: 经过测试，提出的注意力机制和编码器架构都有加成效果，可以提高生成文本的质量（BLEU和Semantic Equivalence）以及同步性。

Abstract
In this paper, we investigate building a sequence to sequence architecture for motion to language translation and synchronization. The aim is to translate motion capture inputs into English natural-language descriptions, such that the descriptions are generated synchronously with the actions performed, enabling semantic segmentation as a byproduct, but without requiring synchronized training data. We propose a new recurrent formulation of local attention that is suited for synchronous/live text generation, as well as an improved motion encoder architecture better suited to smaller data and for synchronous generation. We evaluate both contributions in individual experiments, using the standard BLEU4 metric, as well as a simple semantic equivalence measure, on the KIT motion language dataset. In a follow-up experiment, we assess the quality of the synchronization of generated text in our proposed approaches through multiple evaluation metrics. We find that both contributions to the attention mechanism and the encoder architecture additively improve the quality of generated text (BLEU and semantic equivalence), but also of synchronization. Our code will be made available at \url{https://github.com/rd20karim/M2T-Segmentation/tree/main}

摘要
本文 investigate 建立一种序列到序列架构，用于动作到语言翻译和同步。目标是将动作捕获输入翻译成英语自然语言描述，以便在动作发生时同步生成描述，而无需同步训练数据。我们提出了一种新的循环形式的本地注意力表示，适合同步生成文本，以及一种改进的动作编码建立，更适合小型数据和同步生成。我们在使用标准的BLEU4指标和简单的 semantics 等价度量进行评估，并在 KIT 动作语言数据集上进行单独的实验。在后续实验中，我们评估了我们的提议中的同步生成文本质量，通过多种评价指标。我们发现， both 注意力机制和编码建立增加了生成文本质量（BLEU和semantics），同时也提高了同步生成的质量。我们的代码将在 \url{https://github.com/rd20karim/M2T-Segmentation/tree/main} 上提供。

Mastering the Task of Open Information Extraction with Large Language Models and Consistent Reasoning Environment

paper_url: http://arxiv.org/abs/2310.10590
repo_url: None
paper_authors: Ji Qi, Kaixuan Ji, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Lei Hou, Juanzi Li, Bin Xu
for: 解决对自然语言文本中的 объектив结构知识抽取任务的问题，以建立专门的模型。
methods: 使用语言模型进行启发式学习，并提出一种方法来评估语言模型与测试样本之间的语法分布差异，以作为准备证明。
results: 通过在标准 CaRB benchmark上进行 $6$-shot 方法，实现了超过现有监督方法的 $55.3$ $F_1$ 分数，并在 TACRED 和 ACE05 上进行了natural generalization，实现了 $5.7$ 和 $6.8$ $F_1$ 分数的提高。

Abstract
Open Information Extraction (OIE) aims to extract objective structured knowledge from natural texts, which has attracted growing attention to build dedicated models with human experience. As the large language models (LLMs) have exhibited remarkable in-context learning capabilities, a question arises as to whether the task of OIE can be effectively tackled with this paradigm? In this paper, we explore solving the OIE problem by constructing an appropriate reasoning environment for LLMs. Specifically, we first propose a method to effectively estimate the discrepancy of syntactic distribution between a LLM and test samples, which can serve as correlation evidence for preparing positive demonstrations. Upon the evidence, we introduce a simple yet effective mechanism to establish the reasoning environment for LLMs on specific tasks. Without bells and whistles, experimental results on the standard CaRB benchmark demonstrate that our $6$-shot approach outperforms state-of-the-art supervised method, achieving an $55.3$ $F_1$ score. Further experiments on TACRED and ACE05 show that our method can naturally generalize to other information extraction tasks, resulting in improvements of $5.7$ and $6.8$ $F_1$ scores, respectively.

摘要

BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

paper_url: http://arxiv.org/abs/2310.10586
repo_url: None
paper_authors: Ji Qi, Kaixuan Ji, Jifan Yu, Duokang Wang, Bin Xu, Lei Hou, Juanzi Li
for: 本文旨在提出一种快速适应性框架，以便基于视频进行文本回答。
methods: 本文使用了大量语言模型（LLM）来进行视频理解和知识推理。 Specifically, 我们发现回答特定指令的关键在于关注相关视频事件，并使用了两种视觉工具：结构化场景图生成和描述性图像标题生成来收集和表示事件信息。然后，一个搭载了世界知识的 LLM 被用作理解代理，通过多个理解步骤来实现回答。
results: 我们的框架在两个常见的视频文本生成任务上表现出STATE-OF-THE-ART的性能，并且不需要训练。

Abstract
Building models that generate textual responses to user instructions for videos is a practical and challenging topic, as it requires both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos aligned with brief descriptions. In this paper, we introduce BiLL-VTG, a fast adaptive framework that leverages large language models (LLMs) to reasoning on videos based on essential lightweight visual tools. Specifically, we reveal the key to response specific instructions is the concentration on relevant video events, and utilize two visual tools of structured scene graph generation and descriptive image caption generation to gather and represent the events information. Thus, a LLM equipped with world knowledge is adopted as the reasoning agent to achieve the response by performing multiple reasoning steps on specified video events.To address the difficulty of specifying events from agent, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm based on the efficient Hungarian matching to localize corresponding video events using linguistic instructions, enabling LLMs to interact with long videos. Extensive experiments on two typical video-based texts generations tasks show that our tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance.

摘要
Translated into Simplified Chinese:建立基于视频的模型，以生成用户 instrucion 的文本响应是一个实用和挑战的话题，因为它需要视觉理解和知识推理。相比语言和图像模式，训练效率仍然是一个严重的问题，因为现有的研究通常使用大量稀疏的视频和简短的描述进行训练。在这篇论文中，我们介绍了 BiLL-VTG 框架，该框架利用大型语言模型（LLM）来基于视频中的关键事件进行推理。我们发现关键在于响应特定的 instrucion 是关注相关的视频事件，并使用两种视觉工具：结构化场景图生成和描述性图像标签生成来收集和表示事件信息。然后，一个装备了世界知识的 LLM 作为推理代理来实现响应，通过多个推理步骤来处理指定的视频事件。为了解决指定事件的困难，我们还提出了一种基于有效的匈牙利匹配的 Instruction-oriented Video Events Recognition（InsOVER）算法，以便 LLMS 与长视频进行交互。我们在两个典型的视频基于文本生成任务上进行了广泛的实验，结果显示，我们的自适应框架在与 Flamingo-80B 等预训练模型进行比较时，具有更高的性能。

Who Are All The Stochastic Parrots Imitating? They Should Tell Us!

paper_url: http://arxiv.org/abs/2310.10583
repo_url: None
paper_authors: Sagi Shaier, Lawrence E. Hunter, Katharina von der Wense
for: 这篇论文主要是关于语言模型（LM）的可靠性问题。
methods: 作者建议使用LM可以引用其训练数据的方法，以便快速验证LM生成的声明的真实性。
results: 作者认为，当前的LM在重要场景中永远不会被完全信任，并建议一种新的策略来解决这个问题，即建立LM可以引用其训练数据的能力。

Abstract
Both standalone language models (LMs) as well as LMs within downstream-task systems have been shown to generate statements which are factually untrue. This problem is especially severe for low-resource languages, where training data is scarce and of worse quality than for high-resource languages. In this opinion piece, we argue that LMs in their current state will never be fully trustworthy in critical settings and suggest a possible novel strategy to handle this issue: by building LMs such that can cite their sources - i.e., point a user to the parts of their training data that back up their outputs. We first discuss which current NLP tasks would or would not benefit from such models. We then highlight the expected benefits such models would bring, e.g., quick verifiability of statements. We end by outlining the individual tasks that would need to be solved on the way to developing LMs with the ability to cite. We hope to start a discussion about the field's current approach to building LMs, especially for low-resource languages, and the role of the training data in explaining model generations.

摘要
各种自然语言处理（NLP）任务中的语言模型（LM）都有可能生成不准确的陈述，特别是 для低资源语言，训练数据稀缺，质量也较差。在这篇意见文章中，我们 argue that LMs 在当前状态下从不能在重要场景中得到完全信任，并提出一种可能的新策略来解决这个问题：建立LMs 可以指明其所基于的训练数据部分，即用户可以通过点击LMs 的输出来找到相应的训练数据。我们首先讨论了当前NLP任务中哪些任务可以或不可以受益于这种模型，然后描述了这种模型带来的预期优势，例如快速验证陈述的可靠性。最后，我们列出了需要解决的任务，以开发LMs 可以指明其所基于的训练数据部分。我们希望通过这篇文章引发关于当前LMs 建设的讨论，特别是低资源语言的LMs，以及训练数据的角色在解释模型生成中。

Emerging Challenges in Personalized Medicine: Assessing Demographic Effects on Biomedical Question Answering Systems

paper_url: http://arxiv.org/abs/2310.10571
repo_url: None
paper_authors: Sagi Shaier, Kevin Bennett, Lawrence Hunter, Katharina von der Wense
for: 本研究旨在检测生物医学问答模型是否受到人群特征影响，以确保医疗公平。
methods: 研究使用了不同类型的问答模型，包括基于知识图（KG）和文本基于的模型，并对它们进行了测试。
results: 研究发现， irrelevant demographic information可以导致问答模型的答案发生变化，变化的比例可达15%（基于知识图）和23%（基于文本）。这些变化可能会影响准确性。

Abstract
State-of-the-art question answering (QA) models exhibit a variety of social biases (e.g., with respect to sex or race), generally explained by similar issues in their training data. However, what has been overlooked so far is that in the critical domain of biomedicine, any unjustified change in model output due to patient demographics is problematic: it results in the unfair treatment of patients. Selecting only questions on biomedical topics whose answers do not depend on ethnicity, sex, or sexual orientation, we ask the following research questions: (RQ1) Do the answers of QA models change when being provided with irrelevant demographic information? (RQ2) Does the answer of RQ1 differ between knowledge graph (KG)-grounded and text-based QA systems? We find that irrelevant demographic information change up to 15% of the answers of a KG-grounded system and up to 23% of the answers of a text-based system, including changes that affect accuracy. We conclude that unjustified answer changes caused by patient demographics are a frequent phenomenon, which raises fairness concerns and should be paid more attention to.

摘要
现代问答（QA）模型表现出多种社会偏见（例如与性别或种族相关），通常可以归因于训练数据中的类似问题。然而，到目前为止忽略了在重要领域生物医学中，任何不当的模型输出变化因为病人特征是问题：它会导致患者不公正地处理。我们选择仅考虑不依赖性别、性别或性 orientation 的生物医学问题，并提出以下研究问题：（RQ1）QA 模型在接受无关的民族信息时是否发生变化？（RQ2）对知识图（KG）基础的 QA 系统和文本基础的 QA 系统而言，RQ1 的答案是否不同？我们发现，无关民族信息可以改变 KG 基础系统的答案，占总答案的 15%，而文本基础系统的答案则占 23%，包括影响准确性的变化。我们 conclude 这种不当的答案变化是常见的，这引发公平问题，需要更多的注意。

On Position Bias in Summarization with Large Language Models

paper_url: http://arxiv.org/abs/2310.10570
repo_url: None
paper_authors: Mathieu Ravaut, Shafiq Joty, Aixin Sun, Nancy F. Chen
for: 本研究旨在探讨语言模型在多文档问答 задании中如何利用输入Context，以及这些模型在摘要生成任务中的表现。
methods: 本研究使用了10个数据集、4个语言模型和5个评价指标来分析语言模型在摘要生成任务中如何利用其输入。
results: 研究发现，语言模型倾向于使用 introduce content（以及一定程度的 final content），导致摘要生成性能呈U型曲线。这种偏好对多种多样化的摘要任务提出了挑战。

Abstract
Large language models (LLMs) excel in zero-shot abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, surpassing token limits of 32k or more. However, in the realm of multi-document question answering, language models exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped performance pattern concerning where the answer is located within the input. This bias raises concerns, particularly in summarization tasks where crucial content may be dispersed throughout the source document(s). This paper presents a comprehensive investigation encompassing 10 datasets, 4 LLMs, and 5 evaluation metrics to analyze how these models leverage their input for abstractive summarization. Our findings reveal a pronounced bias towards the introductory content (and to a lesser extent, the final content), posing challenges for LLM performance across a range of diverse summarization benchmarks.

摘要
大型语言模型（LLM）在零shot摘要任务中表现出色，提供流畅和有关的摘要。最近的进步使其能处理长输入上下文，超过32k个Token的限制。然而，在多文档问答任务中，语言模型表现出输入上下文不均匀的问题。它们倾向于初始和 final段，导致摘要性能形成U型曲线，其中答案位于输入中的任何位置。这种偏见存在问题，特别是在摘要任务中，重要的内容可能会分散在源文档中。本文通过10个数据集、4个LLM和5个评价指标进行全面的调查，分析这些模型如何使用其输入进行摘要。我们发现，LLM偏向于引言内容（以及一定 extent的 final content），这会影响LLM在多种多样的摘要benchmark上的表现。

RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling

paper_url: http://arxiv.org/abs/2310.10567
repo_url: None
paper_authors: Jingcheng Deng, Liang Pang, Huawei Shen, Xueqi Cheng
for: 提高语言模型（LM）的表达质量和减少幻觉
methods: 使用检索增强的语言模型（RegaVAE），其基于变量自动编码器（VAE），并在检索和生成过程中使用嵌入空间来捕捉当前和未来文本的信息
results: 在多个 dataset 上实现了显著提高表达质量和幻觉的除去

Abstract
Retrieval-augmented language models show promise in addressing issues like outdated information and hallucinations in language models (LMs). However, current research faces two main problems: 1) determining what information to retrieve, and 2) effectively combining retrieved information during generation. We argue that valuable retrieved information should not only be related to the current source text but also consider the future target text, given the nature of LMs that model future tokens. Moreover, we propose that aggregation using latent variables derived from a compact latent space is more efficient than utilizing explicit raw text, which is limited by context length and susceptible to noise. Therefore, we introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE). It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Additionally, we leverage the VAE to initialize the latent space and adopt the probabilistic form of the retrieval generation paradigm by expanding the Gaussian prior distribution into a Gaussian mixture distribution. Theoretical analysis provides an optimizable upper bound for RegaVAE. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.

摘要
Translation note:* "outdated information" is translated as "过时信息" (guòshí xīnxiàng)* "hallucinations" is translated as "幻见" (hénjiàn)* "latent variables" is translated as "隐变量" (yǐbiàn yuán)* "compact latent space" is translated as "紧凑的隐藏空间" (jìchōng de yǐnmo yòngkōng)* "raw text" is translated as "原始文本" (yuánshi wén tiān)* "Gaussian prior distribution" is translated as "高斯先验分布" (gāosī xiān yì fāngbù)* "Gaussian mixture distribution" is translated as "高斯混合分布" (gāosī hùn yì fāngbù)* "theoretical analysis" is translated as "理论分析" (lǐlùn fāng'àn)* "upper bound" is translated as "上限" (shàngjìn)

ViPE: Visualise Pretty-much Everything

paper_url: http://arxiv.org/abs/2310.10543
repo_url: https://github.com/Hazel1994/ViPE-Videos
paper_authors: Hassan Shahmohammadi, Adhiraj Ghosh, Hendrik P. A. Lensch
for: This paper aims to address the issue of text-to-image models struggling to depict non-literal expressions, by introducing a new method called ViPE.
methods: ViPE uses a series of lightweight and robust language models trained on a large-scale set of lyrics with noisy visual descriptions generated by GPT3.5.
results: ViPE effectively expresses any arbitrary piece of text into a visualisable description, and exhibits an understanding of figurative expressions comparable to human experts. It also provides a powerful and open-source backbone for downstream applications such as music video and caption generation.

Abstract
Figurative and non-literal expressions are profoundly integrated in human communication. Visualising such expressions allow us to convey our creative thoughts, and evoke nuanced emotions. Recent text-to-image models like Stable Diffusion, on the other hand, struggle to depict non-literal expressions. Recent works primarily deal with this issue by compiling humanly annotated datasets on a small scale, which not only demands specialised expertise but also proves highly inefficient. To address this issue, we introduce ViPE: Visualise Pretty-much Everything. ViPE offers a series of lightweight and robust language models that have been trained on a large-scale set of lyrics with noisy visual descriptions that represent their implicit meaning. The synthetic visual descriptions are generated by GPT3.5 relying on neither human annotations nor images. ViPE effectively expresses any arbitrary piece of text into a visualisable description, enabling meaningful and high-quality image generation. We provide compelling evidence that ViPE is more robust than GPT3.5 in synthesising visual elaborations. ViPE also exhibits an understanding of figurative expressions comparable to human experts, providing a powerful and open-source backbone to many downstream applications such as music video and caption generation.

摘要
人类communication中的 figurative 和非Literal 表达是极其深入地融合在一起。Visualizing这些表达可以帮助我们表达创造性的思想，并触发细腻的情感。然而，现有的文本-图像模型，如Stable Diffusion，在描绘非Literal表达方面几乎无法表现出来。现有的工作主要采取了 compile humanly annotated datasets的方法，这不仅需要专业知识，还证明高效率。为解决这个问题，我们引入了 ViPE：Visualize Pretty-much Everything。ViPE 提供了一系列轻量级和可靠的语言模型，这些模型在大规模的歌词中生成了噪音的视觉描述。这些synthetic visual descriptions 由 GPT3.5 生成，不需要人类注释也不需要图像。ViPE 可以将任何文本转换成可视化的描述，从而实现了高质量的图像生成。我们提供了吸引人的证明，表明 ViPE 比 GPT3.5 更加稳定在生成视觉 elaborations 方面。ViPE 还表现出了对 figurative expressions 的理解，与人类专家相当，提供了一个强大且开源的基础结构，可以推动多个下游应用，如音乐视频和caption生成。

One For All & All For One: Bypassing Hyperparameter Tuning with Model Averaging For Cross-Lingual Transfer

paper_url: http://arxiv.org/abs/2310.10532
repo_url: https://github.com/fdschmidt93/ofa-xlt
paper_authors: Fabian David Schmidt, Ivan Vulić, Goran Glavaš
for: 这个论文主要探讨了零例转移跨语言传递（ZS-XLT）的效iveness，以及如何选择最佳的模型和 hyperparameter。
methods: 该论文使用了多语言模型，并在不同的语言上进行了针对性的训练和测试。具体来说， authors 使用了不同的 hyperparameter 和模型Snapshot来进行训练和测试，并通过accumulative run-by-run averaging来提高 ZS-XLT 的性能。
results: 研究发现，传统的模型选择方法 based on source-language validation 很快就达到了下降的 ZS-XLT 性能。然而，通过accumulative run-by-run averaging来提高 ZS-XLT 性能，并与 “oracle” ZS-XLT 表现高度相关。

Abstract
Multilingual language models enable zero-shot cross-lingual transfer (ZS-XLT): fine-tuned on sizable source-language task data, they perform the task in target languages without labeled instances. The effectiveness of ZS-XLT hinges on the linguistic proximity between languages and the amount of pretraining data for a language. Because of this, model selection based on source-language validation is unreliable: it picks model snapshots with suboptimal target-language performance. As a remedy, some work optimizes ZS-XLT by extensively tuning hyperparameters: the follow-up work then routinely struggles to replicate the original results. Other work searches over narrower hyperparameter grids, reporting substantially lower performance. In this work, we therefore propose an unsupervised evaluation protocol for ZS-XLT that decouples performance maximization from hyperparameter tuning. As a robust and more transparent alternative to extensive hyperparameter tuning, we propose to accumulatively average snapshots from different runs into a single model. We run broad ZS-XLT experiments on both higher-level semantic tasks (NLI, extractive QA) and a lower-level token classification task (NER) and find that conventional model selection based on source-language validation quickly plateaus to suboptimal ZS-XLT performance. On the other hand, our accumulative run-by-run averaging of models trained with different hyperparameters boosts ZS-XLT performance and closely correlates with "oracle" ZS-XLT, i.e., model selection based on target-language validation performance.

摘要
多语言语模型可以实现零码跨语言传递（ZS-XLT）：经过精心适应源语言任务数据，它们可以在目标语言中完成任务无需标注实例。ZS-XLT的有效性取决于语言之间的语言相似性和语言预训练数据的量。因此，基于源语言验证的模型选择是不可靠的：它可能会选择模型快照中的产生性能不佳的模型。为了解决这个问题，一些研究者们在ZS-XLT中进行了广泛的超参数优化：然而，继续的研究往往难以复制原来的结果。其他研究者们在 narrower 的超参数格上进行了搜索，并报告了较低的性能。在这个研究中，我们因此提出了一种无监督的评估协议，以减少精度优化和超参数优化之间的关系。我们提议通过在不同的run中训练不同的超参数，并将这些run中的模型快照相加，以获得一个更加 robust 和 transparent 的ZS-XLT模型。我们在高级semantic任务（NLI、抽取式问答）和 lower-level 字符串分类任务（NER）上进行了广泛的ZS-XLT实验，并发现了以下结论：在源语言验证中选择模型的方法很快就到达了低效的ZS-XLT性能，而我们的积累run-by-run相加的模型快照则可以提高ZS-XLT性能，并与“oracle” ZS-XLT（基于目标语言验证性能进行选择）高度相关。

Metric Ensembles For Hallucination Detection

paper_url: http://arxiv.org/abs/2310.10495
repo_url: https://github.com/parthk279/Hallucination-Research
paper_authors: Grant C. Forbes, Parth Katlana, Zeydy Ortiz
for: 这篇论文主要研究了对摘要的自动生成中减少“幻”信息（不在原始文档中出现的信息）的问题，以及关于这个问题的评估方法。
methods: 该论文使用了许多不同的无监督度量来评估摘要的一致性，并对这些度量之间的相关性和人工评估分数的相关性进行了分析。
results: 研究发现，使用LLM（大型语言模型）基于的方法可以更好地检测摘要中的幻信息，而且 ensemble方法可以进一步提高这些分数。此外，研究还发现，要使ensemble方法有所提高，则需要确保度量在ensemble中具有足够相似的错误率，而不需要完全相同的错误率。

Abstract
Abstractive text summarization has garnered increased interest as of late, in part due to the proliferation of large language models (LLMs). One of the most pressing problems related to generation of abstractive summaries is the need to reduce "hallucinations," information that was not included in the document being summarized, and which may be wholly incorrect. Due to this need, a wide array of metrics estimating consistency with the text being summarized have been proposed. We examine in particular a suite of unsupervised metrics for summary consistency, and measure their correlations with each other and with human evaluation scores in the wiki_bio_gpt3_hallucination dataset. We then compare these evaluations to models made from a simple linear ensemble of these metrics. We find that LLM-based methods outperform other unsupervised metrics for hallucination detection. We also find that ensemble methods can improve these scores even further, provided that the metrics in the ensemble have sufficiently similar and uncorrelated error rates. Finally, we present an ensemble method for LLM-based evaluations that we show improves over this previous SOTA.

摘要
抽象摘要生成技术在最近几年来得到了更多的关注，一部分这是因为大语言模型（LLM）的普及。摘要生成中最大的问题之一是减少“幻觉”，即文档中没有包含的信息，而且可能完全错误。由于这一需求，一系列用于摘要与文档之间的一致性的度量被提出。我们专门研究了这些无监督度量的套件，并测量它们之间的相关性和人工评价分数在wiki_bio_gpt3_hallucination数据集中的相关性。然后，我们比较了这些评价与模型中的其他无监督度量和LLM-based方法的性能。我们发现LLM-based方法在幻觉检测方面表现出色，而且 ensemble方法可以进一步提高这些分数，只要 ensemble中的度量具有相似的错误率。最后，我们提出了一种ensemble方法，可以further improve sobre la última SOTA。

UNO-DST: Leveraging Unlabelled Data in Zero-Shot Dialogue State Tracking

paper_url: http://arxiv.org/abs/2310.10492
repo_url: https://github.com/lichuangnus/uno-dst
paper_authors: Chuang Li, Yan Zhang, Min-Yen Kan, Haizhou Li
for: 这篇论文是为了提出一种基于少量数据的零shot对话状态跟踪（DST）方法，以便在目标领域中进行自动标注。
methods: 该方法使用了 auxiliary tasks 生成槽类作为主任务的 inverse prompt，通过联合自我训练来使用无标记数据来增强 DST 模型的训练和精度。
results: 在 MultiWOZ 多语言对话场景中，该方法可以提高平均联合目标任务准确率 by 8%，表明该方法可以有效地提高 DST 模型在零shot 情况下的性能。

Abstract
Previous zero-shot dialogue state tracking (DST) methods only apply transfer learning, but ignore unlabelled data in the target domain. We transform zero-shot DST into few-shot DST by utilising such unlabelled data via joint and self-training methods. Our method incorporates auxiliary tasks that generate slot types as inverse prompts for main tasks, creating slot values during joint training. Cycle consistency between these two tasks enables the generation and selection of quality samples in unknown target domains for subsequent fine-tuning. This approach also facilitates automatic label creation, thereby optimizing the training and fine-tuning of DST models. We demonstrate this method's effectiveness on large language models in zero-shot scenarios, improving average joint goal accuracy by $8\%$ across all domains in MultiWOZ.

摘要
Translation notes:* "zero-shot" is translated as "无需标注的" (wú shí biāo yì)* "few-shot" is translated as "几个shot" (jī gè shòu)* "transfer learning" is translated as "传输学习" (chuán xiū xué xí)* "joint training" is translated as "共同训练" (gòng tóng xiǎo xíng)* "self-training" is translated as "自我训练" (zi wo xiǎo xíng)* "auxiliary tasks" is translated as "辅助任务" (bù zhù zhì gōng)* "slot types" is translated as "槽类型" (shí kè yì)* "inverse prompts" is translated as "反向提示" (fǎn xiàng tím shì)* "main tasks" is translated as "主要任务" (zhǔ yào zhì gōng)* "cycle consistency" is translated as "循环一致" (xún huán yī zhì)* "quality samples" is translated as "高质量的样本" (gāo zhì yàng yī xiǎng)* "unknown target domains" is translated as "未知目标领域" (wèi zhī mù bì yì zhòng)* "subsequent fine-tuning" is translated as "后续精度调整" (hòu xù jīng dù jiǎo yì)* "large language models" is translated as "大型自然语言模型" (dà xíng zì rán yǔ yán mó delì)* "improving" is translated as "提高" (tí gāo)* "average joint goal accuracy" is translated as "平均共同目标准确率" (píng jìn gòng tóng mù zhì jīn yì)Note: The translation is based on the standard Simplified Chinese language and may vary depending on the specific dialect or register used in the target domain.

xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection

paper_url: http://arxiv.org/abs/2310.10482
repo_url: None
paper_authors: Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, André F. T. Martins
for: 本文旨在bridge sentence-level评估和错误span检测两种方法之间，提供更加细节的翻译评估方法。
methods: 本文提出了一种开源的学习型评估方法xCOMET，可以同时进行 sentence-level评估和错误span检测。
results: xCOMET在所有类型的评估中表现出状元，并能够高亮和分类错误 span，从而增强翻译评估的细节性。

Abstract
Widely used learned metrics for machine translation evaluation, such as COMET and BLEURT, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xCOMET, an open-source learned metric designed to bridge the gap between these approaches. xCOMET integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.

摘要
Translation (Simplified Chinese):广泛使用的学习型评估指标，如COMET和BLEURT，用单句级分数评估翻译假设，无法提供翻译错误的细节信息（例如，翻译错误的类型和严重程度）。然而，大型自然语言模型（LLMs）正在推广更细化的评估策略，尝试 detail和 categorize 翻译错误。在这种情况下，我们介绍了 xCOMET，一个开源的学习指标，用于bridging这些方法之间的差异。xCOMET integrate了句子级评估和错误异常检测功能，在所有类型的评估中表现出state-of-the-art的性能（句子级、系统级和错误异常检测）。此外，它还可以高亮和 categorize 错误异常，因此可以增加质量评估的深度。我们还提供了一个Robustness分析，使用压力测试，并显示 xCOMET 可以识别和报告局部重要的错误和幻觉。

G-SPEED: General SParse Efficient Editing MoDel

paper_url: http://arxiv.org/abs/2310.10480
repo_url: https://github.com/banner-z/g-speed
paper_authors: Haoke Zhang, Yue Wang, Juntao Li, Xiabing Zhou, Min Zhang
for: 提高工作效率，自动理解人类发出的指令并生成预期的内容。
methods: 提出了一种基于无监督文本编辑数据 clustering 算法的一种新型精简编辑模型建立方法，以及一种使用 sparse 编辑模型架构来缓解小语言模型的学习限制。
results: 对比 LLMS Equipped with 175B parameters，G-SPEED 的508M参数可以超越它们，并且可以满足多种编辑需求。

Abstract
Large Language Models~(LLMs) have demonstrated incredible capabilities in understanding, generating, and manipulating languages. Through human-model interactions, LLMs can automatically understand human-issued instructions and output the expected contents, which can significantly increase working efficiency. In various types of real-world demands, editing-oriented tasks account for a considerable proportion, which involves an interactive process that entails the continuous refinement of existing texts to meet specific criteria. Due to the need for multi-round human-model interaction and the generation of complicated editing tasks, there is an emergent need for efficient general editing models. In this paper, we propose \underline{\textbf{G}eneral \underline{\textbf{SP}arse \underline{\textbf{E}fficient \underline{\textbf{E}diting Mo\underline{\textbf{D}el~(\textbf{G-SPEED}), which can fulfill diverse editing requirements through a single model while maintaining low computational costs. Specifically, we first propose a novel unsupervised text editing data clustering algorithm to deal with the data scarcity problem. Subsequently, we introduce a sparse editing model architecture to mitigate the inherently limited learning capabilities of small language models. The experimental outcomes indicate that G-SPEED, with its 508M parameters, can surpass LLMs equipped with 175B parameters. Our code and model checkpoints are available at \url{https://github.com/Banner-Z/G-SPEED}.

摘要
大型语言模型~(LLMs) 已经表现出了惊人的能力，包括理解、生成和修改语言。通过人机交互，LLMs 可以自动理解人类发布的指令，并输出预期的内容，这可能会提高工作效率。在各种实际应用中，修改任务占了一定的比重，这些任务涉及到人机交互的互动过程，需要不断细化现有的文本，以满足特定的标准。由于需要多轮人机交互和复杂的修改任务，有一种急需高效的通用修改模型。在这篇论文中，我们提出了 \underline{\textbf{G}eneral \underline{\textbf{SP}arse \underline{\textbf{E}fficient \underline{\textbf{E}diting Mo\underline{\textbf{D}el~(\textbf{G-SPEED})，它可以满足多样化的修改需求，而且保持低的计算成本。 Specifically，我们首先提出了一种新的无监督文本修改数据归类算法，以解决数据稀缺问题。然后，我们引入了稀疏修改模型架构，以降低小语言模型的内置学习能力限制。实验结果表明，G-SPEED，具有508M参数，可以超越配备175B参数的LLMs。我们的代码和模型检查点可以在 \url{https://github.com/Banner-Z/G-SPEED} 上获取。

MechGPT, a language-based strategy for mechanics and materials modeling that connects knowledge across scales, disciplines and modalities

paper_url: http://arxiv.org/abs/2310.10445
repo_url: None
paper_authors: Markus J. Buehler
for: 本研究旨在探索人工智能技术以连接不同领域知识，以便更好地探索多 scales 材料失效的问题。
methods: 本研究使用了一个精度调整的大型自然语言模型（LLM），从 raw 源料中提取问题和答案对，然后使用 LLM 微调。研究还使用了 Ontological Knowledge Graphs 提取结构性信息，以及在不同大小和上下文长度下运行多种计算实验。
results: 研究发现，LLMs 能够提取多 scales 材料失效的结构性信息，并且可以用于新的研究问题框架和可读性图表。三个版本的 MechGPT 被讨论，它们在不同的参数大小和上下文长度下运行，可以实现复杂的检索增强策略和多模态探索。

Abstract
For centuries, researchers have sought out ways to connect disparate areas of knowledge. While early scholars (Galileo, da Vinci, etc.) were experts across fields, specialization has taken hold later. With the advent of Artificial Intelligence, we can now explore relationships across areas (e.g., mechanics-biology) or disparate domains (e.g., failure mechanics-art). To achieve this, we use a fine-tuned Large Language Model (LLM), here for a subset of knowledge in multiscale materials failure. The approach includes the use of a general-purpose LLM to distill question-answer pairs from raw sources followed by LLM fine-tuning. The resulting MechGPT LLM foundation model is used in a series of computational experiments to explore its capacity for knowledge retrieval, various language tasks, hypothesis generation, and connecting knowledge across disparate areas. While the model has some ability to recall knowledge from training, we find that LLMs are particularly useful to extract structural insights through Ontological Knowledge Graphs. These interpretable graph structures provide explanatory insights, frameworks for new research questions, and visual representations of knowledge that also can be used in retrieval-augmented generation. Three versions of MechGPT are discussed, featuring different sizes from 13 billion to 70 billion parameters, and reaching context lengths of more than 10,000 tokens. This provides ample capacity for sophisticated retrieval augmented strategies, as well as agent-based modeling where multiple LLMs interact collaboratively and/or adversarially, the incorporation of new data from the literature or web searches, as well as multimodality.

摘要
Traditionally, researchers have sought to connect diverse areas of knowledge. While early scholars (such as Galileo and da Vinci) were experts across multiple fields, specialization has become more prevalent in recent times. With the advent of Artificial Intelligence, we can now explore relationships between different areas (such as mechanics and biology) or disparate domains (such as failure mechanics and art). To achieve this, we use a fine-tuned Large Language Model (LLM), specifically for a subset of knowledge in multiscale materials failure. The approach involves using a general-purpose LLM to distill question-answer pairs from raw sources, followed by LLM fine-tuning. The resulting MechGPT LLM foundation model is then used in a series of computational experiments to explore its capacity for knowledge retrieval, various language tasks, hypothesis generation, and connecting knowledge across disparate areas. While the model has some ability to recall knowledge from training, we find that LLMs are particularly useful for extracting structural insights through Ontological Knowledge Graphs. These interpretable graph structures provide explanatory insights, frameworks for new research questions, and visual representations of knowledge that can also be used in retrieval-augmented generation. Three versions of MechGPT are discussed, featuring different sizes ranging from 13 billion to 70 billion parameters, and reaching context lengths of more than 10,000 tokens. This provides ample capacity for sophisticated retrieval-augmented strategies, as well as agent-based modeling where multiple LLMs interact collaboratively and/or adversarially, the incorporation of new data from the literature or web searches, as well as multimodality.

Exploiting User Comments for Early Detection of Fake News Prior to Users’ Commenting

paper_url: http://arxiv.org/abs/2310.10429
repo_url: None
paper_authors: Qiong Nan, Qiang Sheng, Juan Cao, Yongchun Zhu, Danding Wang, Guang Yang, Jintao Li, Kai Shu
for: 探讨了现有方法中的准确性vs快速性之间的负担，并提出了一种可行但尚未得到广泛研究的解决方案，即利用历史新闻的社交背景（如评论）进行模型训练，并将其应用于新出现的新闻中。
methods: 提出了一种名为Comment Assisted Fake News Detection（CAS-FEND）的方法，该方法利用历史新闻的评论来帮助一个内容只的检测模型提高检测精度。特别是，该方法在训练阶段将有用的知识从教师模型中传递给学生模型，以便在新出现的新闻中进行检测。
results: 实验表明，CAS-FEND学生模型在检测新出现的假新闻方面表现出色，比内容只方法和使用1/4的评论作为输入的方法更高效。这示出了CAS-FEND的超越性，并证明了它在早期检测中的优势。

Abstract
Both accuracy and timeliness are key factors in detecting fake news on social media. However, most existing methods encounter an accuracy-timeliness dilemma: Content-only methods guarantee timeliness but perform moderately because of limited available information, while social context-based ones generally perform better but inevitably lead to latency because of social context accumulation needs. To break such a dilemma, a feasible but not well-studied solution is to leverage social contexts (e.g., comments) from historical news for training a detection model and apply it to newly emerging news without social contexts. This requires the model to (1) sufficiently learn helpful knowledge from social contexts, and (2) be well compatible with situations that social contexts are available or not. To achieve this goal, we propose to absorb and parameterize useful knowledge from comments in historical news and then inject it into a content-only detection model. Specifically, we design the Comments Assisted Fake News Detection method (CAS-FEND), which transfers useful knowledge from a comments-aware teacher model to a content-only student model during training. The student model is further used to detect newly emerging fake news. Experiments show that the CAS-FEND student model outperforms all content-only methods and even those with 1/4 comments as inputs, demonstrating its superiority for early detection.

摘要
<>translate_language: zh-CN<>严谨性和时效性都是社交媒体上检测假新闻的关键因素。然而，现有方法很多时会陷入精度-时效性之间的谍诀：内容仅仅方法可以保证时效性，但是它们的检测能力相对较弱，而基于社交上下文的方法通常可以提供更高的检测精度，但是它们需要较长的时间来积累社交上下文。为了突破这种谍诀，我们可以利用社交上下文（例如评论）来训练检测模型，并将其应用于新出现的新闻。这需要模型可以（1）充分学习社交上下文中的有用知识，并（2）在社交上下文存在或缺失时都能够具有Compatibility。为了实现这个目标，我们提出了注入社交上下文知识（e.g., 评论）到内容仅仅模型中的方法。我们称之为注入社交知识的Comments Assisted Fake News Detection方法（CAS-FEND）。在训练过程中，我们将社交上下文知识由一个师模型转移到内容仅仅模型中，然后使用这个师模型来检测新出现的假新闻。实验结果表明，CAS-FEND学生模型在检测新出现的假新闻方面表现出色，even outperforming those with 1/4 comments as inputs，这说明它在早期检测方面具有优势。

$\textit{Swap and Predict}$ – Predicting the Semantic Changes in Words across Corpora by Context Swapping

paper_url: http://arxiv.org/abs/2310.10397
repo_url: https://github.com/a1da4/svp-swap
paper_authors: Taichi Aida, Danushka Bollegala
For: The paper is written for detecting semantic changes of words in different text corpora.* Methods: The proposed method, Swapping-based Semantic Change Detection (SSCD), uses random context swapping to compare the meaning of a target word in two different text corpora.* Results: The method accurately predicts semantic changes of words in four languages (English, German, Swedish, and Latin) and across different time spans (over 50 years and about five years), and achieves significant performance improvements compared to strong baselines for the English semantic change prediction task.Here are the three key points in Simplified Chinese:* For: 文章目的是检测不同文本集中单个词语的 semantics 是否发生变化。* Methods: 提议的方法是基于随机上下文交换的 Swapping-based Semantic Change Detection (SSCD)，用于比较两个不同文本集中单个词语的含义。* Results: 方法可以准确地检测单个词语在四种语言（英语、德语、瑞典语和拉丁语）和不同时间间隔（超过50年和约5年）中的 semantics 变化，并在英语 semantic change prediction 任务上 achiev 高性能改进。

Abstract
Meanings of words change over time and across domains. Detecting the semantic changes of words is an important task for various NLP applications that must make time-sensitive predictions. We consider the problem of predicting whether a given target word, $w$, changes its meaning between two different text corpora, $\mathcal{C}_1$ and $\mathcal{C}_2$. For this purpose, we propose $\textit{Swapping-based Semantic Change Detection}$ (SSCD), an unsupervised method that randomly swaps contexts between $\mathcal{C}_1$ and $\mathcal{C}_2$ where $w$ occurs. We then look at the distribution of contextualised word embeddings of $w$, obtained from a pretrained masked language model (MLM), representing the meaning of $w$ in its occurrence contexts in $\mathcal{C}_1$ and $\mathcal{C}_2$. Intuitively, if the meaning of $w$ does not change between $\mathcal{C}_1$ and $\mathcal{C}_2$, we would expect the distributions of contextualised word embeddings of $w$ to remain the same before and after this random swapping process. Despite its simplicity, we demonstrate that even by using pretrained MLMs without any fine-tuning, our proposed context swapping method accurately predicts the semantic changes of words in four languages (English, German, Swedish, and Latin) and across different time spans (over 50 years and about five years). Moreover, our method achieves significant performance improvements compared to strong baselines for the English semantic change prediction task. Source code is available at https://github.com/a1da4/svp-swap .

摘要
文字的意思随时间和领域而变化。探测文字的 semantic change 是 NLP 应用中的一项重要任务，需要做到时效预测。我们考虑了 predicting whether a given target word, $w$, changes its meaning between two different text corpora, $\mathcal{C}_1$ and $\mathcal{C}_2$ 的问题。为此，我们提出了 $\textit{Swapping-based Semantic Change Detection}$ (SSCD)，一种无监督的方法， randomly swaps contexts between $\mathcal{C}_1$ and $\mathcal{C}_2$ where $w$ occurs。然后，我们 examine the distribution of contextualised word embeddings of $w$, obtained from a pretrained masked language model (MLM), representing the meaning of $w$ in its occurrence contexts in $\mathcal{C}_1$ and $\mathcal{C}_2$。如果 $w$ 的意思在 $\mathcal{C}_1$ 和 $\mathcal{C}_2$ 中不变，我们就会 expects the distributions of contextualised word embeddings of $w$ to remain the same before and after this random swapping process。尽管其简单，我们示示了使用预训练 MLM 无需 fine-tuning 的我们提posed context swapping method 可以准确地预测英语、德语、瑞典语和拉丁语中文字的 semantic change ，并且在不同的时间间隔（超过 50 年和约 5 年）中具有显著的性能提升。此外，我们的方法在英语 semantic change prediction 任务中也具有显著的性能提升。代码可以在 https://github.com/a1da4/svp-swap 找到。

Towards a Better Understanding of Variations in Zero-Shot Neural Machine Translation Performance

paper_url: http://arxiv.org/abs/2310.10385
repo_url: https://github.com/Smu-Tan/ZS-NMT-Variations
paper_authors: Shaomu Tan, Christof Monz
for: 这个论文旨在探讨多语言神经机器翻译（MNMT）在零shot（ZS）翻译质量方面存在高度变化的原因。
methods: 该论文采用了系统性的实验方法，涵盖了40种语言的1560个翻译方向。通过分析， authors发现了三个关键因素对零shot NMT性能产生高度变化：1）目标语言翻译能力，2）词汇重叠，3）语言特性。
results: 研究发现，目标语言翻译质量是零shot NMT性能的最大影响因素，词汇重叠一直影响翻译质量。此外，语言属性，如语言家族和书写系统，对小型模型来说也具有一定的影响。此外， authors还发现了零shot翻译挑战不仅是 Off-target 问题，更是 beyond Off-target 问题。

Abstract
Multilingual Neural Machine Translation (MNMT) facilitates knowledge sharing but often suffers from poor zero-shot (ZS) translation qualities. While prior work has explored the causes of overall low ZS performance, our work introduces a fresh perspective: the presence of high variations in ZS performance. This suggests that MNMT does not uniformly exhibit poor ZS capability; instead, certain translation directions yield reasonable results. Through systematic experimentation involving 1,560 language directions spanning 40 languages, we identify three key factors contributing to high variations in ZS NMT performance: 1) target side translation capability 2) vocabulary overlap 3) linguistic properties. Our findings highlight that the target side translation quality is the most influential factor, with vocabulary overlap consistently impacting ZS performance. Additionally, linguistic properties, such as language family and writing system, play a role, particularly with smaller models. Furthermore, we suggest that the off-target issue is a symptom of inadequate ZS performance, emphasizing that zero-shot translation challenges extend beyond addressing the off-target problem. We release the data and models serving as a benchmark to study zero-shot for future research at https://github.com/Smu-Tan/ZS-NMT-Variations

摘要
多语言神经机器翻译（MNMT）促进知识共享，但经常受到零上下文（ZS）翻译质量的劣化影响。尽管先前的工作已经探讨过总体低ZS性能的原因，我们的工作引入了一个新的视角：ZS翻译方向中的高变化性。这表示MNMT不uniformmente具有差的ZS能力；相反，某些翻译方向实际上可以得到不错的结果。通过对40种语言、1560个语言方向进行系统性的实验，我们确定了三个关键因素对ZS NMT性能的高变化：1）目标语言翻译能力2）词汇重叠3）语言特性。我们的发现表明目标语言翻译质量是最重要的因素，词汇重叠一直影响ZS性能。此外，语言家庭和书写系统等语言特性也在一定程度上影响ZS性能，特别是使用较小的模型时。此外，我们认为偏离问题是ZS翻译挑战的一部分，强调零上下文翻译挑战不仅是解决偏离问题而已。我们在github上发布了数据和模型，用于未来研究零上下文翻译，请参考https://github.com/Smu-Tan/ZS-NMT-Variations。

Privacy in Large Language Models: Attacks, Defenses and Future Directions

paper_url: http://arxiv.org/abs/2310.10383
repo_url: None
paper_authors: Haoran Li, Yulin Chen, Jinglong Luo, Yan Kang, Xiaojin Zhang, Qi Hu, Chunkit Chan, Yangqiu Song
for: This paper aims to provide a comprehensive analysis of privacy attacks targeting large language models (LLMs) and to identify potential vulnerabilities in these models.
methods: The paper uses a categorization of privacy attacks based on the adversary’s assumed capabilities to shed light on the potential vulnerabilities present in LLMs. It also presents a detailed overview of prominent defense strategies that have been developed to counter these privacy attacks.
results: The paper identifies upcoming privacy concerns as LLMs evolve and points out several potential avenues for future exploration.

Abstract
The advancement of large language models (LLMs) has significantly enhanced the ability to effectively tackle various downstream NLP tasks and unify these tasks into generative pipelines. On the one hand, powerful language models, trained on massive textual data, have brought unparalleled accessibility and usability for both models and users. On the other hand, unrestricted access to these models can also introduce potential malicious and unintentional privacy risks. Despite ongoing efforts to address the safety and privacy concerns associated with LLMs, the problem remains unresolved. In this paper, we provide a comprehensive analysis of the current privacy attacks targeting LLMs and categorize them according to the adversary's assumed capabilities to shed light on the potential vulnerabilities present in LLMs. Then, we present a detailed overview of prominent defense strategies that have been developed to counter these privacy attacks. Beyond existing works, we identify upcoming privacy concerns as LLMs evolve. Lastly, we point out several potential avenues for future exploration.

摘要
LLMs 的进步significantly 提高了解决不同下游 NLP 任务的能力，并将这些任务集成成生成管道。一方面，强大的语言模型，通过庞大的文本数据进行训练，带来了无 precedent的可用性和使用性，对于模型和用户来说。然而，不受限制的访问这些模型也可能 introduce 恶意和无意的隐私风险。虽然持续努力解决 LLMS 中的安全和隐私问题，但问题仍未得到解决。本文提供了 LLMS 中隐私攻击的全面分析，根据敌对者假设的能力，将隐私攻击分为不同类别，以透视 LLMS 中的可能性隐私漏洞。然后，我们提供了一个详细的防御策略的概述，以响应这些隐私攻击。此外，我们还标识了 LLMS 的未来隐私问题。最后，我们指出了未来探索的一些可能性。

Contextual Data Augmentation for Task-Oriented Dialog Systems

paper_url: http://arxiv.org/abs/2310.10380
repo_url: None
paper_authors: Dustin Axman, Avik Ray, Shubham Garg, Jing Huang
for: 增强当前对话系统的训练任务。
methods: 使用对话上下文 Conditional 生成用户回复，并通过新的提示设计和输出重新排序来生成对话。
results: 在多种 benchmark 数据集上，我们的对话增强模型可以生成高质量的对话，提高对话成功率达到 $8%$ 的提高。Here’s the full text in Simplified Chinese:
for: 本文主要用于增强当前对话系统的训练任务。
methods: 我们提出了一种基于对话上下文 Conditional 生成用户回复的对话增强模型，并通过新的提示设计和输出重新排序来生成对话。
results: 在多种 benchmark 数据集上，我们的对话增强模型可以生成高质量的对话，提高对话成功率达到 $8%$ 的提高。

Abstract
Collection of annotated dialogs for training task-oriented dialog systems have been one of the key bottlenecks in improving current models. While dialog response generation has been widely studied on the agent side, it is not evident if similar generative models can be used to generate a large variety of, and often unexpected, user inputs that real dialog systems encounter in practice. Existing data augmentation techniques such as paraphrase generation do not take the dialog context into consideration. In this paper, we develop a novel dialog augmentation model that generates a user turn, conditioning on full dialog context. Additionally, with a new prompt design for language model, and output re-ranking, the dialogs generated from our model can be directly used to train downstream dialog systems. On common benchmark datasets MultiWoZ and SGD, we show that our dialog augmentation model generates high quality dialogs and improves dialog success rate by as much as $8\%$ over baseline.

摘要
“对话系统训练 Task-oriented 对话系统的集成 annotation 对话集成是一个关键瓶颈，目前模型的改进。虽然对话回复生成已经广泛研究，但是不清楚是否可以使用类似的生成模型来生成实际对话系统遇到的多样化和意外的用户输入。现有的数据增强技术，如重叠生成，不考虑对话上下文。在本文中，我们开发了一种基于对话上下文的对话增强模型，可以生成用户转折，并且通过新的语言模型提示和输出重新排序，生成的对话可以直接用于下游对话系统训练。在 MultiWoZ 和 SGD 等常用数据集上，我们展示了我们的对话增强模型可以生成高质量对话，提高对话成功率达到 $8\%$ 。”

Legal NLP Meets MiCAR: Advancing the Analysis of Crypto White Papers

paper_url: http://arxiv.org/abs/2310.10333
repo_url: None
paper_authors: Carolina Camassa
for: 这个论文是为了探讨欧盟Markets in Crypto-Assets Regulation（MiCAR）对不ikel进行规范的影响，以及在这个领域中文本分析的应用。
methods: 本论文使用自然语言处理（NLP）技术来分析不ikel白皮书，并探讨在MiCAR规范下如何integrate NLP。
results: 本论文发现了不ikel白皮书的文本分析应用存在一些研究漏洞，并对MiCAR规范的影响进行了分析，从而为规范机构、投资者和私有货币发行人提供了可能的研究方向。

Abstract
In the rapidly evolving field of crypto assets, white papers are essential documents for investor guidance, and are now subject to unprecedented content requirements under the European Union's Markets in Crypto-Assets Regulation (MiCAR). Natural Language Processing (NLP) can serve as a powerful tool for both analyzing these documents and assisting in regulatory compliance. This paper delivers two contributions to the topic. First, we survey existing applications of textual analysis to unregulated crypto asset white papers, uncovering a research gap that could be bridged with interdisciplinary collaboration. We then conduct an analysis of the changes introduced by MiCAR, highlighting the opportunities and challenges of integrating NLP within the new regulatory framework. The findings set the stage for further research, with the potential to benefit regulators, crypto asset issuers, and investors.

摘要
在迅速发展的区块链资产领域，白皮书是投资者指导的重要文件，现在欧盟市场区块链资产管理法规（MiCAR）下面面临无前例的内容要求。自然语言处理（NLP）可以作为分析这些文件并协助合规遵守的强大工具。这篇论文在这个主题上做出了两项贡献。首先，我们对未经规范的区块链资产白皮书的文本分析应用进行了调查，揭示出了一个研究差距，这可以通过交叉领域合作bridged。然后，我们对MiCAR引入的变化进行了分析， highlighting the opportunities and challenges of integrating NLP within the new regulatory framework。这些发现可以为 regulators、区块链资产发行人和投资者带来 beneficial。

Optimized Tokenization for Transcribed Error Correction

paper_url: http://arxiv.org/abs/2310.10704
repo_url: None
paper_authors: Tomer Wullach, Shlomo E. Chazan
for: 提高 speech recognition 系统的精度和可靠性
methods: 使用生成的错误分布和语言特定的 vocabulary 调整
results: 证明使用生成的错误分布和语言特定的 vocabulary 可以提高 correction 模型的性能，并且可以在多种语言和speech recognition 系统中应用

Abstract
The challenges facing speech recognition systems, such as variations in pronunciations, adverse audio conditions, and the scarcity of labeled data, emphasize the necessity for a post-processing step that corrects recurring errors. Previous research has shown the advantages of employing dedicated error correction models, yet training such models requires large amounts of labeled data which is not easily obtained. To overcome this limitation, synthetic transcribed-like data is often utilized, however, bridging the distribution gap between transcribed errors and synthetic noise is not trivial. In this paper, we demonstrate that the performance of correction models can be significantly increased by training solely using synthetic data. Specifically, we empirically show that: (1) synthetic data generated using the error distribution derived from a set of transcribed data outperforms the common approach of applying random perturbations; (2) applying language-specific adjustments to the vocabulary of a BPE tokenizer strike a balance between adapting to unseen distributions and retaining knowledge of transcribed errors. We showcase the benefits of these key observations, and evaluate our approach using multiple languages, speech recognition systems and prominent speech recognition datasets.

摘要
Speech recognition systems face many challenges, such as differences in pronunciation, poor audio quality, and a lack of labeled data. To address these challenges, researchers have found that using dedicated error correction models can be effective, but these models require large amounts of labeled data, which is not easily obtained. To overcome this limitation, synthetic transcribed-like data is often used, but it can be difficult to bridge the gap between the distribution of transcribed errors and the synthetic noise. In this paper, we show that the performance of correction models can be significantly improved by training solely using synthetic data. Specifically, we find that: (1) synthetic data generated using the error distribution derived from a set of transcribed data outperforms the common approach of applying random perturbations; (2) applying language-specific adjustments to the vocabulary of a BPE tokenizer can strike a balance between adapting to unseen distributions and retaining knowledge of transcribed errors. We demonstrate the benefits of these key observations using multiple languages, speech recognition systems, and prominent speech recognition datasets.

Untying the Reversal Curse via Bidirectional Language Model Editing

paper_url: http://arxiv.org/abs/2310.10322
repo_url: https://github.com/mjy1111/BAKE
paper_authors: Jun-Yu Ma, Jia-Chen Gu, Zhen-Hua Ling, Quan Liu, Cong Liu
for: 本研究旨在提供一种 bidirectional language model editing 的评估方法，以评估编辑后模型是否可以在反向方向上撤回知识。
methods: 本研究提出了一种 bidirectional assessment for knowledge editing (BAKE) 的benchmark，以评估编辑后模型的反向可逆性。此外，研究还提出了一种名为 bidirectionally inversible relationship modeling (BIRD) 的方法，用于 Mitigating the reversal curse。
results: 实验显示，BIRD 可以通过更新模型参数来提高四种不同大小的 LLM 的表现，并且可以在问答和判断任务中提高模型的表现。

Abstract
Recent studies have demonstrated that large language models (LLMs) store massive factual knowledge within their parameters. But existing LLMs are prone to hallucinate unintended text due to false or outdated knowledge. Since retraining LLMs is resource intensive, there has been a growing interest in the concept of model editing. Despite the emergence of benchmarks and approaches, these unidirectional editing and evaluation have failed to explore the reversal curse. Intuitively, if "The capital of France is" is edited to be a counterfact "London" within a model, then it should be able to naturally reason and recall the reverse fact, i.e., "London is the capital of" followed by "France" instead of "England". In this paper, we study bidirectional language model editing, aiming to provide rigorous model editing evaluation to assess if edited LLMs can recall the editing knowledge bidirectionally. A new evaluation metric of reversibility is introduced, and a benchmark dubbed as Bidirectional Assessment for Knowledge Editing (BAKE) is constructed to evaluate the reversibility of edited models in recalling knowledge in the reverse direction of editing. We surprisingly observe that while current editing methods and LLMs can effectively recall editing facts in the direction of editing, they suffer serious deficiencies when evaluated in the reverse direction. To mitigate the reversal curse, a method named Bidirectionally Inversible Relationship moDeling (BIRD) is proposed. A set of editing objectives that incorporate bidirectional relationships between subject and object into the updated model weights are designed. Experiments show that BIRD improves the performance of four representative LLMs of different sizes via question answering and judgement.

摘要
研究者最近发现，大型语言模型（LLM）中含有巨量的事实知识。然而，现有的LLM容易产生假或过时的知识，导致模型产生假信息。由于重新训练LLM是资源占用的，因此对模型编辑的概念产生了增加的兴趣。虽然有了 benchmarcks 和方法，但这些单向编辑和评估未能探索反转咒。在这篇文章中，我们研究了对向语言模型编辑，以提供对编辑后模型的精确评估，以确定编辑后模型是否可以在反向方向上恢复编辑知识。我们引入了一种新的评估指标——反向可逆性指标，并构建了一个名为“ bidirectional Assessment for Knowledge Editing”（BAKE）的benchmarcks，以评估编辑后模型在反向方向上的知识恢复能力。我们意外发现，当前的编辑方法和LLM可以很好地在编辑方向上恢复编辑知识，但在反向方向上表现异常差。为了 Mitigate the reversal curse，我们提出了一种名为“ bidirectionally Inversible Relationship moDeling”（BIRD）的方法。我们设计了一组编辑目标，将对象和主题之间的双向关系 integrate 到更新后的模型参数中。实验表明，BIRD 可以提高四种不同大小的 LLM 的表现，通过问答和判断。

Investigating Bias in Multilingual Language Models: Cross-Lingual Transfer of Debiasing Techniques

paper_url: http://arxiv.org/abs/2310.10310
repo_url: https://github.com/manon-reusens/multilingual_bias
paper_authors: Manon Reusens, Philipp Borchert, Margot Mieskes, Jochen De Weerdt, Bart Baesens
for: 本研究探讨了多语言模型中偏见纠正技术的跨语言传递性。我们对英文、法语、德语和荷语进行了研究。
methods: 我们使用多语言BERT（mBERT）来检验跨语言纠正技术的可行性，并发现这些技术可以跨语言传递，并且在不同语言上表现良好。
results: 我们发现，对非英语语言应用这些技术不会带来性能下降。使用CrowS-Pairs数据集的翻译，我们发现 SentenceDebias 是所有语言中最佳的纠正技术，可以在 mBERT 中减少偏见约13%。此外，我们发现在各种语言上追加预训练可以提高跨语言效果，特别是在低资源语言中。

Abstract
This paper investigates the transferability of debiasing techniques across different languages within multilingual models. We examine the applicability of these techniques in English, French, German, and Dutch. Using multilingual BERT (mBERT), we demonstrate that cross-lingual transfer of debiasing techniques is not only feasible but also yields promising results. Surprisingly, our findings reveal no performance disadvantages when applying these techniques to non-English languages. Using translations of the CrowS-Pairs dataset, our analysis identifies SentenceDebias as the best technique across different languages, reducing bias in mBERT by an average of 13%. We also find that debiasing techniques with additional pretraining exhibit enhanced cross-lingual effectiveness for the languages included in the analyses, particularly in lower-resource languages. These novel insights contribute to a deeper understanding of bias mitigation in multilingual language models and provide practical guidance for debiasing techniques in different language contexts.

摘要
Translation in Simplified Chinese:这篇论文研究了多语言模型中的偏见纠正技术的传递性。我们对英语、法语、德语和荷语进行了研究，使用多语言BERT（mBERT）来示范了跨语言传递的偏见纠正技术的可行性和效果。我们的结果表明，对非英语语言应用这些技术并不会带来性能下降，而且使用翻译的 CrowS-Pairs 数据集，我们的分析发现，在不同语言上，SentenceDebias 是最有效的技术，可以减少 mBERT 中的偏见程度。此外，我们还发现，对于不同语言的语言模型，额外的预训练可以提高跨语言效果，特别是对于低资源语言。这些发现对偏见纠正在多语言语言模型中的深入理解和实践指导提供了有价值的贡献。

Multi-Stage Pre-training Enhanced by ChatGPT for Multi-Scenario Multi-Domain Dialogue Summarization

paper_url: http://arxiv.org/abs/2310.10285
repo_url: https://github.com/zhouweixiao/mp4
paper_authors: Weixiao Zhou, Gengyao Li, Xianfu Cheng, Xinnian Liang, Junnan Zhu, Feifei Zhai, Zhoujun Li
for: 本研究针对多scene多domain的对话摘要进行了新的预训练模型设计，以增强预训练模型的适应性和对话摘要能力。
methods: 本研究使用了一种多stage预训练策略，通过将各个预训练目标调整为预训练模型的核心部分，以减少预训练模型与精革模型之间的差距。具体来说，我们首先进行了域对预训练，使用大量多scene多domain的对话资料，以增强我们的预训练模型的适应性。然后，我们进行了任务对预训练，使用大量多scene多domain的 “对话摘要” 平行数据，由ChatGPT进行标注，以增强我们的预训练模型的对话摘要能力。
results: 实验结果显示，我们的预训练模型在全域 fine-tuning、zero-shot 和几少shot设定中均有着重要的进步，与先前的状态艺术模型相比，具有更高的准确率和更好的一致性。

Abstract
Dialogue summarization involves a wide range of scenarios and domains. However, existing methods generally only apply to specific scenarios or domains. In this study, we propose a new pre-trained model specifically designed for multi-scenario multi-domain dialogue summarization. It adopts a multi-stage pre-training strategy to reduce the gap between the pre-training objective and fine-tuning objective. Specifically, we first conduct domain-aware pre-training using large-scale multi-scenario multi-domain dialogue data to enhance the adaptability of our pre-trained model. Then, we conduct task-oriented pre-training using large-scale multi-scenario multi-domain "dialogue-summary" parallel data annotated by ChatGPT to enhance the dialogue summarization ability of our pre-trained model. Experimental results on three dialogue summarization datasets from different scenarios and domains indicate that our pre-trained model significantly outperforms previous state-of-the-art models in full fine-tuning, zero-shot, and few-shot settings.

摘要
对话概要化 involves a wide range of scenarios and domains. However, existing methods generally only apply to specific scenarios or domains. In this study, we propose a new pre-trained model specifically designed for multi-scenario multi-domain dialogue summarization. It adopts a multi-stage pre-training strategy to reduce the gap between the pre-training objective and fine-tuning objective. Specifically, we first conduct domain-aware pre-training using large-scale multi-scenario multi-domain dialogue data to enhance the adaptability of our pre-trained model. Then, we conduct task-oriented pre-training using large-scale multi-scenario multi-domain "dialogue-summary" parallel data annotated by ChatGPT to enhance the dialogue summarization ability of our pre-trained model. Experimental results on three dialogue summarization datasets from different scenarios and domains indicate that our pre-trained model significantly outperforms previous state-of-the-art models in full fine-tuning, zero-shot, and few-shot settings.Here's the translation in Traditional Chinese:对话概要化 involves a wide range of scenarios and domains. However, existing methods generally only apply to specific scenarios or domains. In this study, we propose a new pre-trained model specifically designed for multi-scenario multi-domain dialogue summarization. It adopts a multi-stage pre-training strategy to reduce the gap between the pre-training objective and fine-tuning objective. Specifically, we first conduct domain-aware pre-training using large-scale multi-scenario multi-domain dialogue data to enhance the adaptability of our pre-trained model. Then, we conduct task-oriented pre-training using large-scale multi-scenario multi-domain "dialogue-summary" parallel data annotated by ChatGPT to enhance the dialogue summarization ability of our pre-trained model. Experimental results on three dialogue summarization datasets from different scenarios and domains indicate that our pre-trained model significantly outperforms previous state-of-the-art models in full fine-tuning, zero-shot, and few-shot settings.

Generative Calibration for In-context Learning

paper_url: http://arxiv.org/abs/2310.10266
repo_url: https://github.com/changmenseng/generative_calibration
paper_authors: Zhongtao Jiang, Yuanzhe Zhang, Cao Liu, Jun Zhao, Kang Liu
for: 本研究目的是解释LLMs中的受欢迎特性——即场景学习，并提出一种基于生成抽象的约束方法来改进其性能。
methods: 本研究使用了 тео리тиче分析和实验方法来解释受欢迎特性的问题，并提出了一种基于生成抽象的约束方法来改进性能。
results: 研究发现，通过调整 labels 的分布，可以提高受欢迎特性的性能，并且这种方法可以在不同的 prompt 配置下保持稳定性。实验结果显示，提出的方法可以大幅提高受欢迎特性的性能，相比于 ICAL 和现有的准则方法，提高了27%的粗略率。

Abstract
As one of the most exciting features of large language models (LLMs), in-context learning is a mixed blessing. While it allows users to fast-prototype a task solver with only a few training examples, the performance is generally sensitive to various configurations of the prompt such as the choice or order of the training examples. In this paper, we for the first time theoretically and empirically identify that such a paradox is mainly due to the label shift of the in-context model to the data distribution, in which LLMs shift the label marginal $p(y)$ while having a good label conditional $p(x|y)$. With this understanding, we can simply calibrate the in-context predictive distribution by adjusting the label marginal, which is estimated via Monte-Carlo sampling over the in-context model, i.e., generation of LLMs. We call our approach as generative calibration. We conduct exhaustive experiments with 12 text classification tasks and 12 LLMs scaling from 774M to 33B, generally find that the proposed method greatly and consistently outperforms the ICL as well as state-of-the-art calibration methods, by up to 27% absolute in macro-F1. Meanwhile, the proposed method is also stable under different prompt configurations.

摘要
一个 LLM 中最吸引人的特点之一是内容学习（in-context learning），它允许用户快速批量任务解决器，只需要几个训练示例。然而，这种特点同时带来了一些问题，例如prompt的选择和顺序对性能的敏感性。在这篇论文中，我们首次 theoretically和empirically发现，这种парадок斯主要是由 LLMS 的数据分布 Label Shift 引起的， LLMS 会将标签梯度 $p(y)$ Shift，而保持标签条件 $p(x|y)$ 良好。通过这种理解，我们可以简单地调整受 Context 预测分布，这是通过 Monte-Carlo 采样来Estimate LLMS 的标签梯度。我们称之为生成 Calibration。我们进行了12种文本分类任务和12种 LLMS 的探索性实验，发现我们的方法可以大幅提高 IC 和当前最佳化方法的表现，最高提高27%的绝对值。此外，我们的方法也在不同的 prompt 配置下保持稳定。

Enhancing Interpretability using Human Similarity Judgements to Prune Word Embeddings

paper_url: http://arxiv.org/abs/2310.10262
repo_url: None
paper_authors: Natalia Flechas Manrique, Wanqian Bao, Aurelie Herbelot, Uri Hasson
for: 这个论文的目的是提供一种可读性方法，以便理解自然语言处理（NLP）系统中的 semantics。
methods: 这种方法使用supervised learning，并为给定的领域（如运动、职业）标识一 subset of 模型特征，以提高人类相似性判断的预测。这种方法只保留20-40%的原始特征，并且在8个独立的semantic domain中都有不同的特征集。
results: 这种方法可以帮助理解NLP系统中的 semantics，并且可以用来解释 humans 对不同领域的分类。例如， humans 在分类运动时会 differentiate based on how gender-inclusive和international they are。此外，这种方法还可以用来预测words的Semantic dimensions，例如 cognitive、emotional 和social dimensions。

Abstract
Interpretability methods in NLP aim to provide insights into the semantics underlying specific system architectures. Focusing on word embeddings, we present a supervised-learning method that, for a given domain (e.g., sports, professions), identifies a subset of model features that strongly improve prediction of human similarity judgments. We show this method keeps only 20-40% of the original embeddings, for 8 independent semantic domains, and that it retains different feature sets across domains. We then present two approaches for interpreting the semantics of the retained features. The first obtains the scores of the domain words (co-hyponyms) on the first principal component of the retained embeddings, and extracts terms whose co-occurrence with the co-hyponyms tracks these scores' profile. This analysis reveals that humans differentiate e.g. sports based on how gender-inclusive and international they are. The second approach uses the retained sets as variables in a probing task that predicts values along 65 semantically annotated dimensions for a dataset of 535 words. The features retained for professions are best at predicting cognitive, emotional and social dimensions, whereas features retained for fruits or vegetables best predict the gustation (taste) dimension. We discuss implications for alignment between AI systems and human knowledge.

摘要
《NLプロセッシングにおける可読性方法の探索》目的：提供NLプロセッシング中の具体的システム构造の下でのSemanticsの问题解釈を提供する方法。方法：1. 给定のドメイン（例如、スポーツ、职业）に対し、predict human similarity judgmentsのためのsupervised learning方法を提供する。2. この方法では、原始の埋め込みに対して、20-40%の削减を行い、8つの独立したセマンティックドメインでのpredictionの改善を目指す。3. この方法でRetained featuresのSemanticsを解釈するために、two approachesを提供する。第一方法：1. Retained embeddingsの初期Componentのスコアを计算し、これらのスコアに対応するドメインワード（co-hyponyms）のスコアを测定する。2. この分析により、人々はスポーツなどをどのように区别しているかを理解することができる。第二方法：1. Retained setsを使用して、65を超えるsemantically annotated dimensionに対する予测タスクを実行する。2. この方法では、职业に対するRetained featuresは、cognitive、emotional、socialdimensionsに最も优れていることが分かり、フルーツや野菜に対するRetained featuresは、味（gustation）dimensionに最も优れていることが分かる。结论：これらの方法により、AIシステムと人间の知识のAlignmentを改善することができる。

Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

paper_url: http://arxiv.org/abs/2310.10226
repo_url: https://github.com/gmftbygmftby/rep-dropout
paper_authors: Huayang Li, Tian Lan, Zihao Fu, Deng Cai, Lemao Liu, Nigel Collier, Taro Watanabe, Yixuan Su
for: 本研究旨在解释文本神经网络垃圾问题的根本原因，从数据角度出发，提出了一个简单的解释。
methods: 我们采用了随机抽样和注意力抑制等方法来调查这个问题，并进行了实验 validate our findings。
results: 我们的实验结果表明，训练数据中的重复元素与神经网络垃圾问题之间存在强相关关系，避免训练数据中的重复元素可以大幅减少垃圾问题的出现。此外，我们发现，对于不同的方法，包括高流入词、可能性目标和自我强化现象，都可以通过对训练数据中的重复元素进行罚金来解释其效果。

Abstract
There are a number of diverging hypotheses about the neural text degeneration problem, i.e., generating repetitive and dull loops, which makes this problem both interesting and confusing. In this work, we aim to advance our understanding by presenting a straightforward and fundamental explanation from the data perspective. Our preliminary investigation reveals a strong correlation between the degeneration issue and the presence of repetitions in training data. Subsequent experiments also demonstrate that by selectively dropping out the attention to repetitive words in training data, degeneration can be significantly minimized. Furthermore, our empirical analysis illustrates that prior works addressing the degeneration issue from various standpoints, such as the high-inflow words, the likelihood objective, and the self-reinforcement phenomenon, can be interpreted by one simple explanation. That is, penalizing the repetitions in training data is a common and fundamental factor for their effectiveness. Moreover, our experiments reveal that penalizing the repetitions in training data remains critical even when considering larger model sizes and instruction tuning.

摘要
有很多关于神经文本衰退问题的不同假设，即生成循环和极端的循环，使得这个问题同时具有诱人性和混乱性。在这项工作中，我们希望通过数据角度提供直接和基本的解释，以进一步深化我们对这个问题的理解。我们的初步调查发现，衰退问题与训练数据中的重复的强相关性存在很强的关系。后续的实验也表明，在训练数据中 selectively dropping out 重复的注意力可以明显减少衰退。此外，我们的实验分析表明，先前关于衰退问题的不同方法，如高流入词、可能性目标和自我强化现象，都可以通过一个简单的解释：即在训练数据中 penalty 重复。此外，我们的实验还表明，即使考虑更大的模型大小和指导调整，penalizing 训练数据中的重复仍然是关键的。

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

paper_url: http://arxiv.org/abs/2310.10195
repo_url: https://github.com/openlmlab/lomo
paper_authors: Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, Xipeng Qiu
for: 降低大语言模型训练的硬件门槛
methods: 利用非负矩阵分解估算二阶均值，采用分组更新正则化稳定收敛
results: 与AdamW相当的性能，同时减少训练内存占用

Abstract
Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing to match the performance of the prevailing optimizer for large language models, AdamW. Through empirical analysis of the Adam optimizer, we found that, compared to momentum, the adaptive learning rate is more critical for bridging the gap. Building on this insight, we introduce the low-memory optimization with adaptive learning rate (AdaLomo), which offers an adaptive learning rate for each parameter. To maintain memory efficiency, we employ non-negative matrix factorization for the second-order moment estimation in the optimizer state. Additionally, we suggest the use of a grouped update normalization to stabilize convergence. Our experiments with instruction-tuning and further pre-training demonstrate that AdaLomo achieves results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.

摘要
大型语言模型已经取得了非常出色的成功，但它们的庞大参数大小需要很大的内存进行训练，从而设置了高度的门槛。而最近提出的低内存优化（LOMO）可以降低内存占用量，但是它的优化技术，类似于随机梯度下降，对于hyper参数敏感，而且 converge 性不如 AdamW 优化器，fail to match the performance of the prevailing optimizer for large language models。经验表明，与滑动 average 相比，适应式学习率更是关键性的 bridging 因素。基于这一点，我们提出了low-memory optimization with adaptive learning rate（AdaLomo），它在每个参数上提供了适应式学习率。为保持内存效率，我们使用非负矩阵因子分解来Estimate 第二个矩阵积分。此外，我们建议使用 grouped update normalization来稳定收敛。我们的实验表明，AdaLomo 可以与 AdamW 的性能相当，同时具有 significanly 降低内存需求，从而降低训练大语言模型的硬件阻碍。

VIBE: Topic-Driven Temporal Adaptation for Twitter Classification

paper_url: http://arxiv.org/abs/2310.10191
repo_url: https://github.com/CelestineZYJ/VIBE-Temporal-Adaptation
paper_authors: Yuji Zhang, Jing Li, Wenjie Li
for: address the challenge of deteriorating text classification performance in real-world social media due to language evolution
methods: 使用变量信息瓶颈（IB）正则化模型 latent topic evolution 进行时间适应，并通过多任务训练来使用时间戳和类别标签预测
results: 在 Twitter 上进行三种分类任务，与前一个状态的继续预处理方法相比，只使用3%的数据，显著提高了模型的性能

Abstract
Language features are evolving in real-world social media, resulting in the deteriorating performance of text classification in dynamics. To address this challenge, we study temporal adaptation, where models trained on past data are tested in the future. Most prior work focused on continued pretraining or knowledge updating, which may compromise their performance on noisy social media data. To tackle this issue, we reflect feature change via modeling latent topic evolution and propose a novel model, VIBE: Variational Information Bottleneck for Evolutions. Concretely, we first employ two Information Bottleneck (IB) regularizers to distinguish past and future topics. Then, the distinguished topics work as adaptive features via multi-task training with timestamp and class label prediction. In adaptive learning, VIBE utilizes retrieved unlabeled data from online streams created posterior to training data time. Substantial Twitter experiments on three classification tasks show that our model, with only 3% of data, significantly outperforms previous state-of-the-art continued-pretraining methods.

摘要
语言特征在现实世界社交媒体上发展，导致文本分类的性能下降。为Address这个挑战，我们研究时间适应，即使模型在过去数据上训练后，在未来数据上进行测试。大多数前期工作都集中在继续预训练或知识更新上，这可能会 compromise 社交媒体数据的性能。为解决这个问题，我们通过模拟 latent topic evolution 来反射特征变化，并提出了一种新的模型，namely VIBE：Variational Information Bottleneck for Evolutions。具体来说，我们首先使用两个 Information Bottleneck（IB）正则化来分辨过去和未来话题。然后，这些分辨出来的话题被用作 adaptive features，通过多任务训练时间戳和类别标签预测。在adaptive learning中，VIBE 利用了 posterior 于训练数据时间创建的在线流量中检索到的无标签数据，以进行学习。在 Twitter 上进行了三种分类任务的实验，我们发现，使用只有 3% 的数据，我们的模型可以明显超越先前的继续预训练方法。Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language Models

paper_url: http://arxiv.org/abs/2310.10180
repo_url: https://github.com/menik1126/TRIGO
paper_authors: Jing Xiong, Jianhao Shen, Ye Yuan, Haiming Wang, Yichun Yin, Zhengying Liu, Lin Li, Zhijiang Guo, Qingxing Cao, Yinya Huang, Chuanyang Zheng, Xiaodan Liang, Ming Zhang, Qun Liu
for: 检验高级生成语言模型的逻辑能力和数学逻辑能力。
methods: 提出了一个基于Lean formal语言系统的ATP benchmark，评估模型在式子和数学表达中的推理能力和 manipulate、分组、因数化能力。
results: 对高级生成语言模型进行了广泛的实验，发现TRIGO benchmark可以挑战高级模型，包括GPT-4，并提供一个新的工具来研究高级模型在正式逻辑和数学逻辑方面的能力。

Abstract
Automated theorem proving (ATP) has become an appealing domain for exploring the reasoning ability of the recent successful generative language models. However, current ATP benchmarks mainly focus on symbolic inference, but rarely involve the understanding of complex number combination reasoning. In this work, we propose TRIGO, an ATP benchmark that not only requires a model to reduce a trigonometric expression with step-by-step proofs but also evaluates a generative LM's reasoning ability on formulas and its capability to manipulate, group, and factor number terms. We gather trigonometric expressions and their reduced forms from the web, annotate the simplification process manually, and translate it into the Lean formal language system. We then automatically generate additional examples from the annotated samples to expand the dataset. Furthermore, we develop an automatic generator based on Lean-Gym to create dataset splits of varying difficulties and distributions in order to thoroughly analyze the model's generalization ability. Our extensive experiments show our proposed TRIGO poses a new challenge for advanced generative LM's including GPT-4 which is pre-trained on a considerable amount of open-source formal theorem-proving language data, and provide a new tool to study the generative LM's ability on both formal and mathematical reasoning.

摘要
自动证明 theorem (ATP) 已成为一个吸引人的领域，以探索最新的成功生成语言模型的逻辑能力。然而，当前的 ATP 标准 mainly focuses on 符号逻辑推理，很少涉及复杂的数学运算理解。在这种工作中，我们提出了 TRIGO，一个 ATP 标准，需要模型将 trigonometric 表达式简化为步骤证明，并评估生成LM的逻辑能力，包括数学表达式的排序、分组和因数化。我们从网络上收集了 trigonometric 表达式和简化过程，并 manually 鉴定了这些简化过程。然后，我们使用 Lean 正式语言系统来翻译这些样例，并自动生成了更多的样例来扩大数据集。此外，我们开发了基于 Lean-Gym 的自动生成器，以创建不同难度和分布的数据集，以全面分析模型的总体化能力。我们的广泛实验表明，我们的提出的 TRIGO 对高级的生成LM，包括 GPT-4，具有新的挑战，并提供了一个新的工具来研究生成LM的 both formal 和数学逻辑能力。

Joint Music and Language Attention Models for Zero-shot Music Tagging

paper_url: http://arxiv.org/abs/2310.10159
repo_url: None
paper_authors: Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong
for: 这个论文目的是提出一种针对开放集音乐标签问题的零shot音乐标签系统。
methods: 该系统使用一种联合音乐和语言注意力（JMLA）模型，包括一个预训练的masked autoencoder音频编码器和一个Falcon7B干扰器。我们还引入了preceiver resampler将任意长度音频转换为固定长度表示。在编码器和解码器层之间添加了紧密的注意力连接，以改进编码器和解码器层之间的信息流。
results: 我们使用了互联网上收集的大规模音乐和描述数据集来训练JMLA模型。我们使用ChatGPT将原始描述转换为正规化和多样化的描述，以训练JMLA模型。我们的提议的JMLA系统在GTZAN数据集上实现了零shot音乐标签准确率为64.82%，超过了前一个零shot系统的性能，并与前一个系统在FMA和MagnaTagATune数据集上的性能相似。

Abstract
Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B. We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the JMLA models. Our proposed JMLA system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.

摘要
音乐标注是一项任务，旨在预测音乐录音的标签。然而，过去的音乐标注研究主要集中在靠近音乐标注任务上，这些任务无法泛化到新的标签。在这项工作中，我们提出了一种基于共同音乐和语言注意力（JMLA）模型的零批学习音乐标注系统，以解决开放集音乐标注问题。JMLA模型包括一个预训练的masked autoencoder音频编码器和一个Falcon7B decoder。我们引入了preceiver resampler将任意长度音频转换为固定长度嵌入。我们引入了 dense attention连接 между编码器和解码器层，以改进编码器和解码器之间的信息流。我们收集了互联网上大规模的音乐和描述数据集。我们提议使用ChatGPT将Raw描述转换为正式化和多样化的描述，以训练JMLA模型。我们提出的JMLA系统在GTZAN数据集上实现了零批学习音乐标注精度为64.82%，超过了前一代零批系统的性能，并与前一代系统在FMA和MagnaTagATune数据集上实现了相似的结果。

DNA: Denoised Neighborhood Aggregation for Fine-grained Category Discovery

paper_url: http://arxiv.org/abs/2310.10151
repo_url: https://github.com/Lackel/DNA
paper_authors: Wenbin An, Feng Tian, Wenkai Shi, Yan Chen, Qinghua Zheng, QianYing Wang, Ping Chen
for: bridging the gap between fine-grained analysis and high annotation cost
methods: self-supervised framework that encodes semantic structures of data into the embedding space, with three principles to filter out false neighbors
results: retrieves more accurate neighbors and outperforms state-of-the-art models by a large margin (average 9.96% improvement on three metrics)

Abstract
Discovering fine-grained categories from coarsely labeled data is a practical and challenging task, which can bridge the gap between the demand for fine-grained analysis and the high annotation cost. Previous works mainly focus on instance-level discrimination to learn low-level features, but ignore semantic similarities between data, which may prevent these models learning compact cluster representations. In this paper, we propose Denoised Neighborhood Aggregation (DNA), a self-supervised framework that encodes semantic structures of data into the embedding space. Specifically, we retrieve k-nearest neighbors of a query as its positive keys to capture semantic similarities between data and then aggregate information from the neighbors to learn compact cluster representations, which can make fine-grained categories more separatable. However, the retrieved neighbors can be noisy and contain many false-positive keys, which can degrade the quality of learned embeddings. To cope with this challenge, we propose three principles to filter out these false neighbors for better representation learning. Furthermore, we theoretically justify that the learning objective of our framework is equivalent to a clustering loss, which can capture semantic similarities between data to form compact fine-grained clusters. Extensive experiments on three benchmark datasets show that our method can retrieve more accurate neighbors (21.31% accuracy improvement) and outperform state-of-the-art models by a large margin (average 9.96% improvement on three metrics). Our code and data are available at https://github.com/Lackel/DNA.

摘要
发现细化类别从宽域标注数据是一个实用和挑战性的任务，可以bridging the gap между需求细化分析和高标注成本。先前的工作主要关注实例级别的 отличия来学习低级特征，但忽略数据之间的 semantic similarity，这可能会使这些模型学习不够紧凑的集群表示。在这篇论文中，我们提出了 Denoised Neighborhood Aggregation（DNA），一种无监督的框架，它可以将数据的 semantic structure编码到嵌入空间中。具体来说，我们在查询时检索 k 个最近邻居作为它的正确键，以捕捉数据之间的semantic similarity，然后将邻居中的信息聚合以学习紧凑的集群表示。但是，检索到的邻居可能含有很多假阳键，这会下降学习得到的嵌入的质量。为了解决这个挑战，我们提出了三个原则来筛选假阳键，以便更好地学习嵌入。此外，我们也证明了我们的学习目标等价于一种聚类损失函数，可以捕捉数据之间的semantic similarity，以形成细化的集群。我们在三个标准数据集上进行了广泛的实验，得到了更高准确的邻居（21.31%的准确率提高）和超过当前领先模型（平均9.96%的提高）。我们的代码和数据可以在https://github.com/Lackel/DNA中找到。

Node-based Knowledge Graph Contrastive Learning for Medical Relationship Prediction

paper_url: http://arxiv.org/abs/2310.10138
repo_url: https://github.com/zhi520/nc-kge
paper_authors: Zhiguang Fan, Yuedong Yang, Mingyuan Xu, Hongming Chen
For: The paper is written for enhancing the distinctiveness of knowledge graph embeddings (KGEs) and improving the performance of downstream tasks such as predicting drug combinations and reasoning disease-drug relationships.* Methods: The paper proposes a novel node-based contrastive learning method for KGE, called NC-KGE, which constructs appropriate contrastive node pairs on knowledge graphs (KGs) and integrates a relation-aware attention mechanism to focus on semantic relationships and node interactions.* Results: The paper shows that NC-KGE performs competitively with state-of-the-art models on public datasets and outperforms all baselines in predicting biomedical relationship predictions tasks, especially in predicting drug combination relationships.

Abstract
The embedding of Biomedical Knowledge Graphs (BKGs) generates robust representations, valuable for a variety of artificial intelligence applications, including predicting drug combinations and reasoning disease-drug relationships. Meanwhile, contrastive learning (CL) is widely employed to enhance the distinctiveness of these representations. However, constructing suitable contrastive pairs for CL, especially within Knowledge Graphs (KGs), has been challenging. In this paper, we proposed a novel node-based contrastive learning method for knowledge graph embedding, NC-KGE. NC-KGE enhances knowledge extraction in embeddings and speeds up training convergence by constructing appropriate contrastive node pairs on KGs. This scheme can be easily integrated with other knowledge graph embedding (KGE) methods. For downstream task such as biochemical relationship prediction, we have incorporated a relation-aware attention mechanism into NC-KGE, focusing on the semantic relationships and node interactions. Extensive experiments show that NC-KGE performs competitively with state-of-the-art models on public datasets like FB15k-237 and WN18RR. Particularly in biomedical relationship prediction tasks, NC-KGE outperforms all baselines on datasets such as PharmKG8k-28, DRKG17k-21, and BioKG72k-14, especially in predicting drug combination relationships. We release our code at https://github.com/zhi520/NC-KGE.

摘要
biomedical知识图（BKG）的嵌入生成了可靠的表示，对于许多人工智能应用有益，如预测药物组合和理解疾病药物关系。然而，在知识图（KG）中构建适当的对照对是一项挑战。在这篇论文中，我们提出了一种新的节点基于对照学习方法 для知识图嵌入（NC-KGE）。NC-KGE在KG中构建适当的对照节点对，从而提高了知识EXTRACTION在嵌入中的效果，并加速了训练的收敛。此方法可以轻松地与其他知识图嵌入（KGE）方法结合使用。在下游任务中，我们将一种关系意识注意力机制 incorporated into NC-KGE，这种机制会话焦点在知识图中的semantic关系和节点互动。我们在公共数据集FB15k-237和WN18RR进行了广泛的实验，结果显示NC-KGE与状态之前模型相当竞争。特别是在生物医学关系预测任务中，NC-KGE在PharmKG8k-28、DRKG17k-21和BioKG72k-14等数据集上表现出色，尤其是预测药物组合关系。我们在https://github.com/zhi520/NC-KGE中发布了代码。

Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset

paper_url: http://arxiv.org/abs/2310.10118
repo_url: None
paper_authors: Arthur Amalvy, Vincent Labatut, Richard Dufour
for: 提高 named entity recognition (NER) 的准确率，特别是在长文档中。
methods: 使用 Alpaca instrucituned large language model (LLM) 生成一个 sintethic context retrieval 训练数据集，然后使用 BERT 模型进行 neural context retriever 训练。
results: 在英文文学数据集中（包括 40 本第一章），我们的方法比几种 retrieval 基准方法高效，提高 NER 任务的准确率。

Abstract
While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.

摘要
Recent pre-trained transformer-based models can perform named entity recognition (NER) with high accuracy, but their limited range is a problem when applied to long documents like whole novels. To address this issue, we propose to retrieve relevant context at the document level. However, due to the lack of supervision, we must rely on unsupervised approaches. We use Alpaca, an instruction-tuned large language model (LLM), to generate a synthetic context retrieval training dataset, and then train a neural context retriever based on a BERT model to find relevant context for NER. Our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.Here's the word-for-word translation of the text into Simplified Chinese:现代预训练变换器模型可以实现命名实体识别（NER）的高精度，但它们的限制范围是一个问题，应用于整个小说等长文档时。为解决这个问题，我们提议在文档级别上提取相关的上下文。然而，由于缺乏监督，我们必须采用无监督方法。我们使用Alpaca，一个指导调整的大型自然语言模型（LLM），生成一个假数据集，并使用这个数据集来训练一个基于BERT模型的神经网络上下文检索器。我们表明，我们的方法在英文文学 dataset 上（由40本第一章组成） outperform 多个检索基准点 для NER 任务。

End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis

paper_url: http://arxiv.org/abs/2310.10106
repo_url: None
paper_authors: Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent
for: 本研究旨在开发一个综合式多频道自动语音识别（MC-SA-ASR）系统，该系统结合Conformer编码器和 speaker-attributed Transformer编码器，并且可以有效地结合语音识别和发音识别模块在多频道设置下。
methods: 本研究使用了Conformer编码器和 speaker-attributed Transformer编码器，并且在多频道设置下使用了多框精度注意力和发音识别模块。
results: 在对LibriSpeech数据的模拟混合数据进行测试时，本研究可以 reduves the word error rate（WER）by up to 12% and 16% compared to previous single-channel and multichannel approaches，respectively。此外，本研究还 investigate了不同的输入特征对ASR性能的影响。最后，我们的实验表明，本系统在真实世界的多频道会议记录中具有有效性。

Abstract
We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder. To the best of our knowledge, this is the first model that efficiently integrates ASR and speaker identification modules in a multichannel setting. On simulated mixtures of LibriSpeech data, our system reduces the word error rate (WER) by up to 12% and 16% relative compared to previously proposed single-channel and multichannel approaches, respectively. Furthermore, we investigate the impact of different input features, including multichannel magnitude and phase information, on the ASR performance. Finally, our experiments on the AMI corpus confirm the effectiveness of our system for real-world multichannel meeting transcription.

摘要
我们提出了一个综合式多通道自动语音识别（MC-SA-ASR）系统，该系统使用Conformer编码器和Speaker-attributed Transformer编码器。我们认为这是首个在多通道设定下集成ASR和speaker认知模块的模型。在对LibriSpeech数据的模拟混合物中，我们的系统可以降低单个通道和多个通道方法相比，Word Error Rate（WER）下降至12%和16%。此外，我们还研究了不同的输入特征，包括多通道幅度和频率信息，对ASR性能的影响。最后，我们在AMI corpus上进行了实验，证明了我们的系统在实际多通道会议记录中的有效性。Note: "Simplified Chinese" is used to refer to the standardized form of Chinese used in mainland China, which is different from "Traditional Chinese" used in Taiwan and other regions.

Decomposed Prompt Tuning via Low-Rank Reparameterization

paper_url: http://arxiv.org/abs/2310.10094
repo_url: https://github.com/xyaoooo/dpt
paper_authors: Yao Xiao, Lu Xu, Jiaxi Li, Wei Lu, Xiaoli Li
for: 提高Prompt Tuning的效率和精度
methods: 使用低级别矩阵初始化软提问
results: 在高资源和低资源情况下，实验结果表明提议方法具有效果

Abstract
While prompt tuning approaches have achieved competitive performance with high efficiency, we observe that they invariably employ the same initialization process, wherein the soft prompt is either randomly initialized or derived from an existing embedding vocabulary. In contrast to these conventional methods, this study aims to investigate an alternative way to derive soft prompt. Our empirical studies show that the soft prompt typically exhibits a low intrinsic rank characteristic. With such observations, we propose decomposed prompt tuning, a novel approach that utilizes low-rank matrices to initialize the soft prompt. Through the low-rank reparameterization, our method significantly reduces the number of trainable parameters while maintaining effectiveness. Experimental results on the SuperGLUE benchmark in both high-resource and low-resource scenarios demonstrate the effectiveness of the proposed method.

摘要
而Prompt调整方法已经实现了高效率的竞争性表现，但我们发现这些传统方法 invariably使用相同的初始化过程，即软提示是随机初始化或者基于现有的Embedding词汇。相比之下，这一研究旨在调查一种不同的软提示 derive的方法。我们的实验研究表明，软提示通常具有低内在矩阵特征。基于这些观察，我们提议了分解Prompt调整，一种使用低级数矩阵初始化软提示的新方法。通过低级数重parameter化，我们的方法可以减少训练参数的数量，同时保持效果。SuperGLUEbenchmark上的实验结果表明，我们的方法在高资源和低资源情况下都具有显著的效果。

JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning

paper_url: http://arxiv.org/abs/2310.10083
repo_url: None
paper_authors: Issey Sukeda, Masahiro Suzuki, Hiroki Sakaji, Satoshi Kodera
for: 本研究旨在探讨如何适应医疗领域的大语言模型（LLMs），以及如何通过领域适应来提高模型的性能。
methods: 本研究使用了LoRA基于的指令调整方法来调整LLMs，以吸收医疗领域特定的知识。
results: 研究发现，通过LoRA基于的指令调整方法，可以部分地将医疗领域特定的知识integrated到LLMs中，大型模型表现更加明显。此外，研究还发现，可以通过适应英语中心模型来进行日本应用领域的适应，同时也 highlighted了日本中心模型的局限性。这些发现可以帮助医疗机构 fine-tune和运行模型，不需要依赖于外部服务。

Abstract
In the ongoing wave of impact driven by large language models (LLMs) like ChatGPT, the adaptation of LLMs to medical domain has emerged as a crucial research frontier. Since mainstream LLMs tend to be designed for general-purpose applications, constructing a medical LLM through domain adaptation is a huge challenge. While instruction-tuning is used to fine-tune some LLMs, its precise roles in domain adaptation remain unknown. Here we show the contribution of LoRA-based instruction-tuning to performance in Japanese medical question-answering tasks. In doing so, we employ a multifaceted evaluation for multiple-choice questions, including scoring based on "Exact match" and "Gestalt distance" in addition to the conventional accuracy. Our findings suggest that LoRA-based instruction-tuning can partially incorporate domain-specific knowledge into LLMs, with larger models demonstrating more pronounced effects. Furthermore, our results underscore the potential of adapting English-centric models for Japanese applications in domain adaptation, while also highlighting the persisting limitations of Japanese-centric models. This initiative represents a pioneering effort in enabling medical institutions to fine-tune and operate models without relying on external services.

摘要
在现代语言模型（LLM）如ChatGPT的浪潮中，适应医疗领域的LLM研究已成为一个关键的前沿领域。由于主流LLM通常是设计为通用应用程序，因此在医疗领域中构建一个LLM通过领域适应是一项巨大的挑战。而在某些LLM上进行了 instrucion-tuning，其precise roles在领域适应仍然未知。在这里，我们展示了LoRA基于的instruction-tuning对于日本医学问答任务的贡献。为此，我们采用了多方面的评估方法，包括基于"精确匹配"和"格式距离"的分数，以及传统的准确率。我们的发现表明，LoRA基于的instruction-tuning可以部分地将领域特定知识引入LLM，大型模型表现更加明显。此外，我们的结果还指出了将英语中心模型适应日本应用的潜在优势，同时也高亮了日本中心模型的限制。这个实验代表了医疗机构可以通过自主定制和操作模型而不需要依赖于外部服务的先驱性努力。

Let’s reward step by step: Step-Level reward model as the Navigators for Reasoning

paper_url: http://arxiv.org/abs/2310.10080
repo_url: None
paper_authors: Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, Hongxia Yang
for: 本研究旨在探讨用Large Language Models（LLM）进行多步逻辑时，是否可以通过在推理过程中提供反馈或搜索机制来提高推理准确性。
methods: 本研究使用了Process-Supervised Reward Model（PRM），在训练阶段为LLM提供步骤级别的反馈，类似于Proximal Policy Optimization（PPO）或拒绝抽样。我们还提出了一种启发式搜索算法，使用PRM的步骤级别反馈来优化LLM在多步任务中推理的路径。
results: 我们的研究显示，使用修改后的PRM在数学 benchmark 上（GSM8K和MATH）得到了更好的结果，并且在代码生成任务中也得到了类似的改进。此外，我们还开发了一种自动生成步骤级别奖励数据的方法，用于探讨代码生成任务中的不同路径。这些结果表明，我们的奖励模型基于的方法在推理任务中具有良好的robust性。

Abstract
Recent years have seen considerable advancements in multi-step reasoning with Large Language Models (LLMs). The previous studies have elucidated the merits of integrating feedback or search mechanisms during model inference to improve the reasoning accuracy. The Process-Supervised Reward Model (PRM), typically furnishes LLMs with step-by-step feedback during the training phase, akin to Proximal Policy Optimization (PPO) or reject sampling. Our objective is to examine the efficacy of PRM in the inference phase to help discern the optimal solution paths for multi-step tasks such as mathematical reasoning and code generation. To this end, we propose a heuristic greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. This tailored PRM demonstrated enhanced results compared to the Chain of Thought (CoT) on mathematical benchmarks like GSM8K and MATH. Additionally, to explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks. Thus highlighting the robust nature of our reward-model-based approach to inference for reasoning tasks.

摘要

Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks

paper_url: http://arxiv.org/abs/2310.10077
repo_url: None
paper_authors: Shuyu Jiang, Xingshu Chen, Rui Tang
for: This paper aims to reveal the vulnerability of large language models (LLMs) to compositional instruction attacks that can elicit harmful content, despite current approaches that focus on detecting and training against harmful prompts.
methods: The paper introduces an innovative technique called Compositional Instruction Attacks (CIA), which combines and encapsulates multiple instructions to hide harmful prompts within harmless ones. Two transformation methods, T-CIA and W-CIA, are also proposed to disguise harmful instructions as talking or writing tasks.
results: The paper achieves an attack success rate of 95%+ on safety assessment datasets and 83%+ for GPT-4, 91%+ for ChatGPT (gpt-3.5-turbo backed), and 91%+ for ChatGLM2 on harmful prompt datasets, demonstrating the effectiveness of CIA in eliciting harmful content from LLMs.

Abstract
Recently, Large language models (LLMs) with powerful general capabilities have been increasingly integrated into various Web applications, while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. Unfortunately, they remain the risk of generating harmful content like hate speech and criminal activities in practical applications. Current approaches primarily rely on detecting, collecting, and training against harmful prompts to prevent such risks. However, they typically focused on the "superficial" harmful prompts with a solitary intent, ignoring composite attack instructions with multiple intentions that can easily elicit harmful content in real-world scenarios. In this paper, we introduce an innovative technique for obfuscating harmful instructions: Compositional Instruction Attacks (CIA), which refers to attacking by combination and encapsulation of multiple instructions. CIA hides harmful prompts within instructions of harmless intentions, making it impossible for the model to identify underlying malicious intentions. Furthermore, we implement two transformation methods, known as T-CIA and W-CIA, to automatically disguise harmful instructions as talking or writing tasks, making them appear harmless to LLMs. We evaluated CIA on GPT-4, ChatGPT, and ChatGLM2 with two safety assessment datasets and two harmful prompt datasets. It achieves an attack success rate of 95%+ on safety assessment datasets, and 83%+ for GPT-4, 91%+ for ChatGPT (gpt-3.5-turbo backed) and ChatGLM2-6B on harmful prompt datasets. Our approach reveals the vulnerability of LLMs to such compositional instruction attacks that harbor underlying harmful intentions, contributing significantly to LLM security development. Warning: this paper may contain offensive or upsetting content!

摘要
Currently, large language models (LLMs) with strong overall capabilities have been increasingly integrated into various web applications, while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. However, they still face the risk of generating harmful content such as hate speech and criminal activities in practical applications. Existing approaches primarily rely on detecting, collecting, and training against harmful prompts to prevent such risks. However, they typically focus on the "superficial" harmful prompts with a single intent, ignoring composite attack instructions with multiple intentions that can easily elicit harmful content in real-world scenarios.In this paper, we propose an innovative technique for obfuscating harmful instructions: Compositional Instruction Attacks (CIA), which refers to attacking by combining and encapsulating multiple instructions. CIA hides harmful prompts within instructions of harmless intentions, making it impossible for the model to identify the underlying malicious intentions. Furthermore, we implement two transformation methods, known as T-CIA and W-CIA, to automatically disguise harmful instructions as talking or writing tasks, making them appear harmless to LLMs.We evaluated CIA on GPT-4, ChatGPT, and ChatGLM2 with two safety assessment datasets and two harmful prompt datasets. It achieved an attack success rate of 95%+ on safety assessment datasets and 83%+ for GPT-4, 91%+ for ChatGPT (gpt-3.5-turbo backed), and 91%+ for ChatGLM2 on harmful prompt datasets. Our approach reveals the vulnerability of LLMs to such compositional instruction attacks that harbor underlying harmful intentions, contributing significantly to LLM security development. Warning: this paper may contain offensive or upsetting content!

Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation

paper_url: http://arxiv.org/abs/2310.10698
repo_url: None
paper_authors: Yingwei Ma, Yue Yu, Shanshan Li, Yu Jiang, Yong Guo, Yuanliang Zhang, Yutao Xie, Xiangke Liao
for: 提高自动代码生成的准确率，充分利用大语言模型（LLM）的含义映射能力。
methods: 提出“含义链条”（SeCoT）方法，通过LLM自动学习源代码的含义信息（如数据流和控制流），提高代码生成的精度。
results: 在三个DL benchmark上实现了状态之准确率提高，证明SeCoT可以帮助大型LLM实现更高精度的代码生成。

Abstract
Large language models (LLMs) have showcased remarkable prowess in code generation. However, automated code generation is still challenging since it requires a high-level semantic mapping between natural language requirements and codes. Most existing LLMs-based approaches for code generation rely on decoder-only causal language models often treate codes merely as plain text tokens, i.e., feeding the requirements as a prompt input, and outputing code as flat sequence of tokens, potentially missing the rich semantic features inherent in source code. To bridge this gap, this paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT. Our motivation is that the semantic information of the source code (\eg data flow and control flow) describes more precise program execution behavior, intention and function. By guiding LLM consider and integrate semantic information, we can achieve a more granular understanding and representation of code, enhancing code generation accuracy. Meanwhile, while traditional techniques leveraging such semantic information require complex static or dynamic code analysis to obtain features such as data flow and control flow, SeCoT demonstrates that this process can be fully automated via the intrinsic capabilities of LLMs (i.e., in-context learning), while being generalizable and applicable to challenging domains. While SeCoT can be applied with different LLMs, this paper focuses on the powerful GPT-style models: ChatGPT(close-source model) and WizardCoder(open-source model). The experimental study on three popular DL benchmarks (i.e., HumanEval, HumanEval-ET and MBPP) shows that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.

摘要
大型语言模型（LLM）已经展示了很好的代码生成能力。然而，自动化代码生成仍然是一个挑战，因为它需要高级别的 semantic mapping zwischen自然语言要求和代码。现有的 LLMs-based 方法 для代码生成都是基于 causal 语言模型，通常将代码当作平面文本符号，即通过提供要求作为输入，并将代码输出为平面序列符号。这可能会遗漏代码中的较为复杂的 semantics 特征。为了bridging这个差距，本文提出了“semantic chain-of-thought” 方法，以帮助 LL M 考虑和integrate semantic information，从而提高代码生成的准确性。我们的动机是，源代码中的semantic信息（例如数据流和控制流）可以描述更加精确的程序执行行为、意图和功能。通过引导 LL M 考虑这些semantic信息，我们可以实现代码生成更加精准和智能。而传统的方法需要复杂的静态或动态代码分析，以获取such as data flow和control flow的特征。然而，SeCoT 示出了这个过程可以通过 LL Ms 的内在能力（即在场景学习）自动化，并且可以普遍应用于复杂的领域。SeCoT 可以与不同的 LL Ms 结合使用，本文主要采用了强大的 GPT-style 模型：ChatGPT（关闭源代码模型）和WizardCoder（开源模型）。我们对三个Popular DL bencmarks（即HumanEval、HumanEval-ET和MBPP）进行了实验，结果表明，SeCoT 可以实现状态机器人的表现，大幅提高可能性。

EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

paper_url: http://arxiv.org/abs/2310.10050
repo_url: None
paper_authors: Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell
for: liberating public domain texts at scale
methods: EffOCR (EfficientOCR), a novel open-source OCR package that uses a character or word-level image retrieval approach, is accurate and sample efficient to train and deploy
results: EffOCR was used to digitize 20 million historical U.S. newspaper scans with high accuracy, and achieved zero-shot performance on randomly selected documents from the U.S. National Archives, as well as accurately digitizing Japanese documents that other OCR solutions failed on.

Abstract
Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets. Existing OCR engines, largely designed for small-scale commercial applications in high resource languages, often fall short of these requirements. EffOCR (EfficientOCR), a novel open-source OCR package, meets both the computational and sample efficiency requirements for liberating texts at scale by abandoning the sequence-to-sequence architecture typically used for OCR, which takes representations from a learned vision model as inputs to a learned language model. Instead, EffOCR models OCR as a character or word-level image retrieval problem. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language. Models in the EffOCR model zoo can be deployed off-the-shelf with only a few lines of code. Importantly, EffOCR also allows for easy, sample efficient customization with a simple model training interface and minimal labeling requirements due to its sample efficiency. We illustrate the utility of EffOCR by cheaply and accurately digitizing 20 million historical U.S. newspaper scans, evaluating zero-shot performance on randomly selected documents from the U.S. National Archives, and accurately digitizing Japanese documents for which all other OCR solutions failed.

摘要
亿量公共领域文档尚未被整合到数字化，或者缺乏准确的数字化。现代自然语言处理技术无法对这些文档进行索引、检索和概要分析，或者提取信息进行统计分析，这些文档也无法被包含在语言模型训练中。由于公共领域文档的多样性和庞大量， liberating them at scale requires an accurate, extremely cheap, and sample-efficient optical character recognition (OCR) technology. Existing OCR engines, primarily designed for small-scale commercial applications in high-resource languages, often fall short of these requirements.EffOCR（EfficientOCR）是一个新的开源 OCR 包，它满足了计算机和样本效率的要求，以便大规模解放文档。而不是使用常见的序列到序列架构，EffOCR 将 OCR 视为字符或单词级图像检索问题。EffOCR 具有低成本和样本效率的训练，因为模型只需学习字符的视觉特征，而不是字符串如何在语言中sequenced使用。EffOCR 的模型集可以通过几行代码部署，并且支持轻松、样本效率地自定义。此外，EffOCR 还具有简单的模型训练接口和最小的标注要求，因此可以轻松地进行随机选择的文档评估和日本文档的数字化，而其他 OCR 解决方案都失败了。

Improving Large Language Model Fine-tuning for Solving Math Problems

paper_url: http://arxiv.org/abs/2310.10047
repo_url: None
paper_authors: Yixin Liu, Avi Singh, C. Daniel Freeman, John D. Co-Reyes, Peter J. Liu
for: 解决大语言模型（LLMs）在数学问题解决方面的成本高、精度低问题。
methods: investigate three fine-tuning strategies：(1) solution fine-tuning，(2) solution-cluster re-ranking，(3) multi-task sequential fine-tuning。
results: 使用MATH dataset，对PaLM 2模型进行了三种精度调整策略的研究，并发现：(1) 使用精度调整的步骤解释可以对模型性能产生显著影响；(2) 筛选和多数投票可以单独使用以提高模型性能，同时使用两者可以叠加提高性能；(3) 将生成和评估任务分别进行多任务并行调整可以比基eline更高的性能。

Abstract
Despite their success in many natural language tasks, solving math problems remains a significant challenge for large language models (LLMs). A large gap exists between LLMs' pass-at-one and pass-at-N performance in solving math problems, suggesting LLMs might be close to finding correct solutions, motivating our exploration of fine-tuning methods to unlock LLMs' performance. Using the challenging MATH dataset, we investigate three fine-tuning strategies: (1) solution fine-tuning, where we fine-tune to generate a detailed solution for a given math problem; (2) solution-cluster re-ranking, where the LLM is fine-tuned as a solution verifier/evaluator to choose among generated candidate solution clusters; (3) multi-task sequential fine-tuning, which integrates both solution generation and evaluation tasks together efficiently to enhance the LLM performance. With these methods, we present a thorough empirical study on a series of PaLM 2 models and find: (1) The quality and style of the step-by-step solutions used for fine-tuning can make a significant impact on the model performance; (2) While solution re-ranking and majority voting are both effective for improving the model performance when used separately, they can also be used together for an even greater performance boost; (3) Multi-task fine-tuning that sequentially separates the solution generation and evaluation tasks can offer improved performance compared with the solution fine-tuning baseline. Guided by these insights, we design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models, an 11.2% accuracy improvement over the few-shot performance of pre-trained PaLM 2-L model with majority voting.

摘要
尽管大型自然语言模型（LLM）在许多自然语言任务上表现出色，但解决数学问题仍然是它们的主要挑战。LLM的通过一次和通过N次性能在解决数学问题上存在很大的差距，这表明LLM可能在解决数学问题的过程中很近于发现正确的解决方案，因此我们对LLM的 fine-tuning 方法进行了探索。使用具有挑战性的 MATH 数据集，我们 investigate了三种 fine-tuning 策略：（1）解决 fine-tuning，我们将 LLM fine-tune 为生成一个给定数学问题的详细解决方案；（2）解决集 cluster 重新排名，我们将 LLM fine-tune 为一个解决方案验证器/评估器，以选择生成的候选解决方案集；（3）多任务顺序 fine-tuning，它将解决方案生成和评估任务集成起来，以提高 LLM 性能。通过这些方法，我们在 PaLM 2 模型上进行了一系列实验，并发现：（1）用于 fine-tuning 的步骤解决方案质量和风格可以对模型性能产生重要影响；（2）解决重新排名和多数投票都是可以提高模型性能的有效方法，但是它们可以同时使用以实现更大的性能提升；（3）将解决生成和评估任务分开并进行多任务顺序 fine-tuning 可以比基于解决 fine-tuning 的基eline提供更好的性能。根据这些发现，我们设计了一种 fine-tuning 配方，通过这种配方，我们在 MATH 数据集上使用 fine-tuned PaLM 2-L 模型，实现了 Approximately 58.8% 的准确率，与未经 fine-tuning 的 PaLM 2-L 模型的多shot 性能相比，提高了约 11.2%。

Empirical Study of Zero-Shot NER with ChatGPT

paper_url: http://arxiv.org/abs/2310.10035
repo_url: https://github.com/emma1066/zero-shot-ner-with-chatgpt
paper_authors: Tingyu Xie, Qi Li, Jian Zhang, Yan Zhang, Zuozhu Liu, Hongwei Wang
for: 本研究探讨了大型自然语言模型（LLM）在零shot信息EXTRACTION任务中的表现，尤其是在ChatGPT和命名实体识别（NER）任务中。
methods: 我们采用了启发于LLM的卓越逻辑能力的方法，并对NER任务进行了修改和适应。我们提出了分解问题解决方案，将NER任务分解成更加简单的互相关联问题，并通过语法提高和工具增强等方法来促进模型的中间思考。此外，我们还采用了自身一致性来优化NER任务。
results: 我们的方法在七个benchmark上实现了零shotNER任务的很好表现，包括中文和英文 dataset，以及域特定和通用领域场景。此外，我们还进行了错误分析和优化建议。此外，我们还证明了我们的方法在几个shot设置和其他LLM中的效果。

Abstract
Large language models (LLMs) exhibited powerful capability in various natural language processing tasks. This work focuses on exploring LLM performance on zero-shot information extraction, with a focus on the ChatGPT and named entity recognition (NER) task. Inspired by the remarkable reasoning capability of LLM on symbolic and arithmetic reasoning, we adapt the prevalent reasoning methods to NER and propose reasoning strategies tailored for NER. First, we explore a decomposed question-answering paradigm by breaking down the NER task into simpler subproblems by labels. Second, we propose syntactic augmentation to stimulate the model's intermediate thinking in two ways: syntactic prompting, which encourages the model to analyze the syntactic structure itself, and tool augmentation, which provides the model with the syntactic information generated by a parsing tool. Besides, we adapt self-consistency to NER by proposing a two-stage majority voting strategy, which first votes for the most consistent mentions, then the most consistent types. The proposed methods achieve remarkable improvements for zero-shot NER across seven benchmarks, including Chinese and English datasets, and on both domain-specific and general-domain scenarios. In addition, we present a comprehensive analysis of the error types with suggestions for optimization directions. We also verify the effectiveness of the proposed methods on the few-shot setting and other LLMs.

摘要
大型自然语言模型（LLM）在各种自然语言处理任务中表现出了强大的能力。本研究将关注LLM在零式信息提取任务中的表现，特别是关注ChatGPT和命名实体识别（NER）任务。受到LLM在符号逻辑和加算逻辑中的卓越逻辑能力的启发，我们采用了现有的逻辑方法，并对NER任务进行了修改和定制。首先，我们探索了一种分解问题解决方案，将NER任务分解成 simpler subproblems by labels。其次，我们提出了语法增强的方法，通过语法提示和工具增强来让模型在语法结构本身进行分析。此外，我们采用了自适应性来NER，提出了两个阶段多数投票策略，首先投票最符合的提及，然后投票最符合的类型。提出的方法在零式NER中实现了显著的提升，在七个benchmark上，包括中文和英文数据集，以及域特定和通用领域场景中。此外，我们还提供了错误类型的完整分析和优化方向。此外，我们还证明了提出的方法在几个ew-shot设定和其他LLMs中的效果。

2023-10-17

Holistic Parking Slot Detection with Polygon-Shaped Representations

High-Resolution Building and Road Detection from Sentinel-2

Classification of Safety Driver Attention During Autonomous Vehicle Operation

DIAR: Deep Image Alignment and Reconstruction using Swin Transformers

Learning Neural Implicit through Volume Rendering with Attentive Depth Fusion Priors

Studying the Effects of Sex-related Differences on Brain Age Prediction using brain MR Imaging

Learning Lens Blur Fields

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

DELIFFAS: Deformable Light Fields for Fast Avatar Synthesis

4K4D: Real-Time 4D View Synthesis at 4K Resolution

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Revisiting Map Relations for Unsupervised Non-Rigid Shape Matching

VcT: Visual change Transformer for Remote Sensing Image Change Detection

A voxel-level approach to brain age prediction: A method to assess regional brain aging

Towards Generalizable Multi-Camera 3D Object Detection via Perspective Debiasing

Towards Generic Semi-Supervised Framework for Volumetric Medical Image Segmentation

Multi Self-supervised Pre-fine-tuned Transformer Fusion for Better Intelligent Transportation Detection

CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation

Self-Supervised 3D Scene Flow Estimation and Motion Prediction using Local Rigidity Prior

Video Super-Resolution Using a Grouped Residual in Residual Network

Image Compression using only Attention based Neural Networks

An empirical study of automatic wildlife detection using drone thermal imaging and object detection

Gromov-Wassertein-like Distances in the Gaussian Mixture Models Space

LiDAR-based 4D Occupancy Completion and Forecasting

Innovative Methods for Non-Destructive Inspection of Handwritten Documents

Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification

Whole-brain radiomics for clustered federated personalization in brain tumor segmentation

Improving Video Deepfake Detection: A DCT-Based Approach with Patch-Level Analysis

Sparse Multi-Object Render-and-Compare

Unsupervised Pre-Training Using Masked Autoencoders for ECG Analysis

BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference

Super resolution of histopathological frozen sections via deep learning preserving tissue structure

3D Structure-guided Network for Tooth Alignment in 2D Photograph

Generalizability of CNN Architectures for Face Morph Presentation Attack

SODA: Robust Training of Test-Time Data Adaptors

DORec: Decomposed Object Reconstruction Utilizing 2D Self-Supervised Features

United We Stand: Using Epoch-wise Agreement of Ensembles to Combat Overfit

$k$-$t$ CLAIR: Self-Consistency Guided Multi-Prior Learning for Dynamic Parallel MR Image Reconstruction

Co-Learning Semantic-aware Unsupervised Segmentation for Pathological Image Registration

Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters

NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning

Tracking and Mapping in Medical Computer Vision: A Review

Context-Aware Meta-Learning

MRI brain tumor segmentation using informative feature vectors and kernel dictionary learning

Enhancing Deep Neural Network Training Efficiency and Performance through Linear Prediction

Medical Image Segmentation via Sparse Coding Decoder

FusionU-Net: U-Net with Enhanced Skip Connection for Pathology Image Segmentation

UNK-VQA: A Dataset and A Probe into Multi-modal Large Models’ Abstention Ability

Towards Training-free Open-world Segmentation via Image Prompting Foundation Models

2023-10-17

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

An Optimistic-Robust Approach for Dynamic Positioning of Omnichannel Inventories

Unveiling the General Intelligence Factor in Language Models: A Psychometric Approach

Learning a Hierarchical Planner from Humans in Multiple Generations

Language Models as Zero-Shot Trajectory Generators

The Efficacy of Transformer-based Adversarial Attacks in Security Domains

WaveAttack: Asymmetric Frequency Obfuscation-based Backdoor Attacks Against Deep Neural Networks

Adversarial Robustness Unhardening via Backdoor Attacks in Federated Learning

Automated Evaluation of Personalized Text Generation using Large Language Models

Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-task Speech Recognition

Eliciting Human Preferences with Language Models

When Rigidity Hurts: Soft Consistency Regularization for Probabilistic Hierarchical Time Series Forecasting

Integrating 3D City Data through Knowledge Graphs

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning

Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach

Group Preference Optimization: Few-Shot Alignment of Large Language Models

Guarantees for Self-Play in Multiplayer Games via Polymatrix Decomposability

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations

Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

Explaining Deep Neural Networks for Bearing Fault Detection with Vibration Concepts

Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

Functional Invariants to Watermark Large Transformers

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Understanding deep neural networks through the lens of their non-linearity

Evaluating LLMs for Privilege-Escalation Scenarios

Neural Attention: Enhancing QKV Calculation in Self-Attention Mechanism with Neural Networks

Towards Automatic Satellite Images Captions Generation Using Large Language Models