2023-10-26

cs.CV

cs.CV - 2023-10-26

Image Prior and Posterior Conditional Probability Representation for Efficient Damage Assessment

paper_url: http://arxiv.org/abs/2310.17801
repo_url: None
paper_authors: Jie Wei, Weicong Feng, Erik Blasch, Erika Ardiles-Cruz, Haibin Ling
for: 本研究旨在提高人工援助和灾难应急Response (HADR) 应用中的损害评估 (DA) 效率和可扩展性。
methods: 本文提出了一种名为图像先后 conditional probability (IP2CP) 的有效计算视觉表示方法，用于对准前后灾难图像，并将其编码成一个图像进行深度学习处理，以确定损害水平。
results: 在两个重要的实际应用场景中，IP2CP 表现出了良好的性能：一是像素级Semantic segmentation，二是质心矩阵学习引入的全面损害分类。结果表明，基于 IP2CP 的深度学习框架可以有效实现数据和计算效率，这对 HADR 应用来说非常重要。

Abstract
It is important to quantify Damage Assessment (DA) for Human Assistance and Disaster Response (HADR) applications. In this paper, to achieve efficient and scalable DA in HADR, an image prior and posterior conditional probability (IP2CP) is developed as an effective computational imaging representation. Equipped with the IP2CP representation, the matching pre- and post-disaster images are effectively encoded into one image that is then processed using deep learning approaches to determine the damage levels. Two scenarios of crucial importance for the practical use of DA in HADR applications are examined: pixel-wise semantic segmentation and patch-based contrastive learning-based global damage classification. Results achieved by IP2CP in both scenarios demonstrate promising performances, showing that our IP2CP-based methods within the deep learning framework can effectively achieve data and computational efficiency, which is of utmost importance for the DA in HADR applications.

摘要
important to quantify Damage Assessment (DA) for Human Assistance and Disaster Response (HADR) applications. In this paper, to achieve efficient and scalable DA in HADR, an image prior and posterior conditional probability (IP2CP) is developed as an effective computational imaging representation. Equipped with the IP2CP representation, the matching pre- and post-disaster images are effectively encoded into one image that is then processed using deep learning approaches to determine the damage levels. Two scenarios of crucial importance for the practical use of DA in HADR applications are examined: pixel-wise semantic segmentation and patch-based contrastive learning-based global damage classification. Results achieved by IP2CP in both scenarios demonstrate promising performances, showing that our IP2CP-based methods within the deep learning framework can effectively achieve data and computational efficiency, which is of utmost importance for the DA in HADR applications.Here's the translation:重要的是量化损害评估（DA）在人类援助和灾害应急应用中。本文使用图像先后条件概率（IP2CP）来实现有效的计算影像表示。通过IP2CP表示，匹配的前后灾害图像都被编码为一个图像，然后使用深度学习方法来确定损害水平。在实际应用中，对DA的两个场景得到了极重要的评估：像素级Semantic segmentation和 patch-based contrastive learning-based全面损害分类。IP2CP在这两个场景中的表现具有承诺性，表明我们的IP2CP基于深度学习框架可以有效实现数据和计算效率，这对DA应用中至关重要。

ControlLLM: Augment Language Models with Tools by Searching on Graphs

paper_url: http://arxiv.org/abs/2310.17796
repo_url: https://github.com/opengvlab/controlllm
paper_authors: Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Zhiheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang
for: 本研究旨在开发一种新的框架，使大型自然语言模型（LLM）能够利用多Modal工具解决复杂的实际任务。
methods: 本研究的方法包括三个关键组成部分：（1）任务分解器，将复杂任务分解成明确的子任务，并将输入和输出明确定义；（2）思想图（ToG） paradigma，在已经建立的工具图中搜索优化解决方案的路径；（3）执行引擎，将解决方案 interpret并有效地运行工具在不同的计算设备上。
results: 研究人员对多种图像、音频和视频处理任务进行了评估，并证明了该框架在精度、效率和多样性方面与现有方法相比有superior性。代码可以在https://github.com/OpenGVLab/ControlLLM 上找到。

Abstract
We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable performance of LLMs, they still struggle with tool invocation due to ambiguous user prompts, inaccurate tool selection and parameterization, and inefficient tool scheduling. To overcome these challenges, our framework comprises three key components: (1) a \textit{task decomposer} that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG) paradigm} that searches the optimal solution path on a pre-built tool graph, which specifies the parameter and dependency relations among different tools; and (3) an \textit{execution engine with a rich toolbox} that interprets the solution path and runs the tools efficiently on different computational devices. We evaluate our framework on diverse tasks involving image, audio, and video processing, demonstrating its superior accuracy, efficiency, and versatility compared to existing methods. The code is at https://github.com/OpenGVLab/ControlLLM .

摘要
我们提出了 ControlLLM 框架，一种新的框架，允许大型自然语言模型（LLM）使用多Modal工具来解决复杂的实际任务。虽然 LLM 表现出了惊人的表现，但它们仍然因为用户提示的歧义、不准确的工具选择和参数化、不合理的工具调度而陷入困难。为了解决这些挑战，我们的框架包括三个关键组件：1. 任务分解器，将复杂任务分解成明确的子任务，输入和输出都具有明确的定义。2. Thoughts-on-Graph（ToG）理念，在预先构建的工具图上搜索最佳解决方案的路径。这个图表示不同工具之间的参数和依赖关系。3. 执行引擎与丰富工具库，将解决方案 интер普理解并运行工具，并在不同的计算设备上进行高效的执行。我们在各种图像、音频和视频处理任务中评估了 ControlLLM 框架，并证明它在精度、效率和多样性方面胜过现有方法。代码可以在 GitHub 上找到：https://github.com/OpenGVLab/ControlLLM。

AutoCT: Automated CT registration, segmentation, and quantification

paper_url: http://arxiv.org/abs/2310.17780
repo_url: None
paper_authors: Zhe Bai, Abdelilah Essiari, Talita Perciano, Kristofer E. Bouchard
for: 提供一个全面的CT成像处理和分析管道，用于基础科学发展和临床应用。
methods: 使用自动化预处理、注册、分割和量化分析3D CT扫描数据的整个管道。
results: 实现了基于Atlas的CT分割和量化，通过势能变换来提取本地特征，用于下游统计学学习，可以facilitate医疗诊断。

Abstract
The processing and analysis of computed tomography (CT) imaging is important for both basic scientific development and clinical applications. In AutoCT, we provide a comprehensive pipeline that integrates an end-to-end automatic preprocessing, registration, segmentation, and quantitative analysis of 3D CT scans. The engineered pipeline enables atlas-based CT segmentation and quantification leveraging diffeomorphic transformations through efficient forward and inverse mappings. The extracted localized features from the deformation field allow for downstream statistical learning that may facilitate medical diagnostics. On a lightweight and portable software platform, AutoCT provides a new toolkit for the CT imaging community to underpin the deployment of artificial intelligence-driven applications.

摘要
computed tomography（CT）影像处理和分析对科学研究和临床应用都具有重要 significanc。在AutoCT中，我们提供了一个整体性的管道，该管道集成了端到端自动化预处理、注册、分割和量化CT扫描图像的工作流程。我们通过可靠的前向和反向映射来实现 diffeomorphic 变换，从而提取 CT 扫描图像中的本地特征。这些本地特征可以用于下游统计学学习，以便帮助医学诊断。AutoCT 在轻量级和可搬式软件平台上提供了一套新的工具包，为 CT 影像社区提供了人工智能驱动应用的基础。

A Dataset of Relighted 3D Interacting Hands

paper_url: http://arxiv.org/abs/2310.17768
repo_url: None
paper_authors: Gyeongsik Moon, Shunsuke Saito, Weipeng Xu, Rohan Joshi, Julia Buffalini, Harley Bellan, Nicholas Rosen, Jesse Richardson, Mallorie Mize, Philippe de Bree, Tomas Simon, Bo Peng, Shubham Garg, Kevyn McPhail, Takaaki Shiratori
for: 本研究旨在提供一个多样化和实际的手交互分析数据集，以便更好地研究手交互的自然语言处理。
methods: 本研究使用了一种现代的手重光照网络，并利用了精准的两手三维姿态跟踪来生成多样化和实际的手交互图像。
results: 对比于现有的手交互图像数据集，本研究的Re：InterHand数据集具有更多的多样化和实际的图像表现，同时也提供了更多的大量的三维姿态跟踪数据。

Abstract
The two-hand interaction is one of the most challenging signals to analyze due to the self-similarity, complicated articulations, and occlusions of hands. Although several datasets have been proposed for the two-hand interaction analysis, all of them do not achieve 1) diverse and realistic image appearances and 2) diverse and large-scale groundtruth (GT) 3D poses at the same time. In this work, we propose Re:InterHand, a dataset of relighted 3D interacting hands that achieve the two goals. To this end, we employ a state-of-the-art hand relighting network with our accurately tracked two-hand 3D poses. We compare our Re:InterHand with existing 3D interacting hands datasets and show the benefit of it. Our Re:InterHand is available in https://mks0601.github.io/ReInterHand/.

摘要
“二手互动是训练模型最difficult的讯号之一，因为手部的自相似和复杂的动作，以及手部的 occlusion。许多dataset已经被提出来分析二手互动，但都无法同时取得1) 多样化和生动的图像出现和2) 多样化和大量的GT 3D姿态。在这个工作中，我们提出了Re:InterHand，一个基于state-of-the-art手部重光网络和我们精确地追踪的二手3D姿态。我们与现有的3D互动手部dataset进行比较，展示了Re:InterHand的优点。Re:InterHand可以在https://mks0601.github.io/ReInterHand/中下载。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

SynergyNet: Bridging the Gap between Discrete and Continuous Representations for Precise Medical Image Segmentation

paper_url: http://arxiv.org/abs/2310.17764
repo_url: https://github.com/CandleLabAI/SynergyNet-WACV-2024
paper_authors: Vandan Gorade, Sparsh Mittal, Debesh Jha, Ulas Bagci
for: 这篇论文是为了提高医疗影像分析的表现，特别是透过整合维度的混合以提高现有的encoder-decoder分类框架。
methods: 这篇论文使用了一种新的瓶颈架构，称为SynergyNet，它可以融合维度的整合以获取相互补偿的信息，并成功地保留了细节和高级构造的资讯。
results: 根据多组组织分类和心脏组织数据，这篇论文的SynergyNet模型比其他现有的方法（包括TransUNet）表现更好，统计学上的提升为2.16%的 dice scores和11.13%的 Hausdorff scores。在皮肤患病和脑肿瘤分类数据上，这篇论文的SynergyNet模型实现了1.71%的 Intersection-over Union 提升和8.58%的提升。

Abstract
In recent years, continuous latent space (CLS) and discrete latent space (DLS) deep learning models have been proposed for medical image analysis for improved performance. However, these models encounter distinct challenges. CLS models capture intricate details but often lack interpretability in terms of structural representation and robustness due to their emphasis on low-level features. Conversely, DLS models offer interpretability, robustness, and the ability to capture coarse-grained information thanks to their structured latent space. However, DLS models have limited efficacy in capturing fine-grained details. To address the limitations of both DLS and CLS models, we propose SynergyNet, a novel bottleneck architecture designed to enhance existing encoder-decoder segmentation frameworks. SynergyNet seamlessly integrates discrete and continuous representations to harness complementary information and successfully preserves both fine and coarse-grained details in the learned representations. Our extensive experiment on multi-organ segmentation and cardiac datasets demonstrates that SynergyNet outperforms other state of the art methods, including TransUNet: dice scores improving by 2.16%, and Hausdorff scores improving by 11.13%, respectively. When evaluating skin lesion and brain tumor segmentation datasets, we observe a remarkable improvement of 1.71% in Intersection-over Union scores for skin lesion segmentation and of 8.58% for brain tumor segmentation. Our innovative approach paves the way for enhancing the overall performance and capabilities of deep learning models in the critical domain of medical image analysis.

摘要
Recently, continuous latent space (CLS) 和 discrete latent space (DLS) deep learning models have been proposed for medical image analysis, which have improved performance. However, these models have different challenges. CLS models can capture intricate details, but often lack interpretability in terms of structural representation and robustness due to their emphasis on low-level features. On the other hand, DLS models have interpretability, robustness, and the ability to capture coarse-grained information thanks to their structured latent space. However, DLS models have limited efficacy in capturing fine-grained details. To address the limitations of both CLS and DLS models, we propose SynergyNet, a novel bottleneck architecture designed to enhance existing encoder-decoder segmentation frameworks. SynergyNet seamlessly integrates discrete and continuous representations to harness complementary information and successfully preserves both fine and coarse-grained details in the learned representations. Our extensive experiment on multi-organ segmentation and cardiac datasets shows that SynergyNet outperforms other state-of-the-art methods, including TransUNet: dice scores improve by 2.16%, and Hausdorff scores improve by 11.13%, respectively. When evaluating skin lesion and brain tumor segmentation datasets, we observe a remarkable improvement of 1.71% in Intersection-over-Union scores for skin lesion segmentation and of 8.58% for brain tumor segmentation. Our innovative approach paves the way for enhancing the overall performance and capabilities of deep learning models in the critical domain of medical image analysis.

Alzheimers Disease Diagnosis by Deep Learning Using MRI-Based Approaches

paper_url: http://arxiv.org/abs/2310.17755
repo_url: None
paper_authors: Sarasadat Foroughipoor, Kimia Moradi, Hamidreza Bolhasani
for: 这篇研究旨在探讨使用Magnetic Resonance Imaging（MRI）技术和深度学习算法来诊断阿尔茨海默病（AD）。
methods: 这些研究使用MRI技术取得的数据，然后使用深度学习算法进行特征提取和模式识别，以帮助早期诊断和病程评估。
results: 这些研究发现，使用MRI技术和深度学习算法可以帮助早期诊断阿尔茨海默病，并且可以特别识别出患者的病程阶段和特定症状。

Abstract
The most frequent kind of dementia of the nervous system, Alzheimer's disease, weakens several brain processes (such as memory) and eventually results in death. The clinical study uses magnetic resonance imaging to diagnose AD. Deep learning algorithms are capable of pattern recognition and feature extraction from the inputted raw data. As early diagnosis and stage detection are the most crucial elements in enhancing patient care and treatment outcomes, deep learning algorithms for MRI images have recently allowed for diagnosing a medical condition at the beginning stage and identifying particular symptoms of Alzheimer's disease. As a result, we aimed to analyze five specific studies focused on AD diagnosis using MRI-based deep learning algorithms between 2021 and 2023 in this study. To completely illustrate the differences between these techniques and comprehend how deep learning algorithms function, we attempted to explore selected approaches in depth.

摘要
最常见的神经系统失智症，阿尔茨海默病（AD），会弱化许多大脑过程（如记忆），最终会导致死亡。临床研究使用核磁共振成像（MRI）诊断AD。深度学习算法具有模式识别和特征提取功能，可以从输入的原始数据中提取有用的特征。早期诊断和stage检测是患者护理和治疗效果的关键因素，因此深度学习算法在MRI图像上的应用在诊断阿尔茨海默病的早期阶段和特征识别方面具有重要意义。本研究采用MRI图像基于深度学习算法进行AD诊断的五项研究，均发生在2021年至2023年之间。为了彻底描述这些技术的差异和深度学习算法的工作方式，我们尝试了深入探讨选择的方法。

Advancing Brain Tumor Detection: A Thorough Investigation of CNNs, Clustering, and SoftMax Classification in the Analysis of MRI Images

paper_url: http://arxiv.org/abs/2310.17720
repo_url: None
paper_authors: Jonayet Miah, Duc M Cao, Md Abu Sayed3, Md Siam Taluckder, Md Sabbirul Haque, Fuad Mahmud
for: 检测脑肿的早期检测是脑肿疾病管理的关键，以提高治疗效果和患者结果。本研究探讨了使用卷积神经网络（CNN）来检测脑肿，使用MRI图像。
methods: 本研究使用MRI图像进行数据采集，并将其处理并输入到CNN体系中。CNN体系使用SoftMax彻底连接层进行分类，实现了98%的准确率。此外，本研究还使用Radial Basis Function（RBF）和Decision Tree（DT）两种分类器，其中RBF的准确率为98.24%，DT的准确率为95.64%。此外，本研究还引入了分类方法来提高CNN的准确率。
results: 本研究的结果表明，SoftMax分类器在测试数据上的准确率为99.52%，而RBF和DT分类器的准确率分别为98.24%和95.64%。此外，本研究还使用了敏感度、特异性和准确率来全面评估网络的性能。

Abstract
Brain tumors pose a significant global health challenge due to their high prevalence and mortality rates across all age groups. Detecting brain tumors at an early stage is crucial for effective treatment and patient outcomes. This study presents a comprehensive investigation into the use of Convolutional Neural Networks (CNNs) for brain tumor detection using Magnetic Resonance Imaging (MRI) images. The dataset, consisting of MRI scans from both healthy individuals and patients with brain tumors, was processed and fed into the CNN architecture. The SoftMax Fully Connected layer was employed to classify the images, achieving an accuracy of 98%. To evaluate the CNN's performance, two other classifiers, Radial Basis Function (RBF) and Decision Tree (DT), were utilized, yielding accuracy rates of 98.24% and 95.64%, respectively. The study also introduced a clustering method for feature extraction, improving CNN's accuracy. Sensitivity, Specificity, and Precision were employed alongside accuracy to comprehensively evaluate the network's performance. Notably, the SoftMax classifier demonstrated the highest accuracy among the categorizers, achieving 99.52% accuracy on test data. The presented research contributes to the growing field of deep learning in medical image analysis. The combination of CNNs and MRI data offers a promising tool for accurately detecting brain tumors, with potential implications for early diagnosis and improved patient care.

摘要
脑肿瘤对全球健康带来了很大挑战，因为它们在所有年龄组中有高的发病率和死亡率。早期检测脑肿瘤非常重要，以确保有效的治疗和患者结果。本研究使用了卷积神经网络（CNN）来检测脑肿瘤，使用了磁共振成像（MRI）图像。数据集包括了健康个体和脑肿瘤患者的MRI图像，经过处理后被传输到CNN体系中。SoftMax完全连接层被用来分类图像，实现了98%的准确率。为了评估CNN的表现，还使用了几种其他分类器，包括卷积函数（RBF）和决策树（DT），其中RBF和DT分别实现了98.24%和95.64%的准确率。研究还提出了一种归一化方法，以提高CNN的准确率。在评估网络表现时，使用了敏感性、特异性和精度，以全面评估网络的表现。结果显示，SoftMax分类器在测试数据上实现了99.52%的准确率。本研究贡献了深度学习在医疗图像分析领域的发展，并表明了CNN和MRI数据的结合可以提供高精度的脑肿瘤检测，有可能提供早期诊断和改善患者护理的可能性。

Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model

paper_url: http://arxiv.org/abs/2310.17653
repo_url: None
paper_authors: Karsten Roth, Lukas Thede, Almut Sophia Koepke, Oriol Vinyals, Olivier Hénaff, Zeynep Akata
for: 这个论文旨在探讨 deep learning 模型在不同设计决策下的训练方法，以及这些方法如何影响模型学习特征集。
methods: 作者使用了公共模型库，包含 thousands 个在 ImageNet 等标准数据集上训练的模型，并对这些模型进行了分析。
results: 研究发现，无论模型是哪两个，都可以在训练过程中学习独特的特征集。此外，作者还提出了一种基于数据分割的扩展方法，可以在大多数情况下实现模型之间的知识传递，无需外部评价。

Abstract
Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other -- independent of overall performance. Given any arbitrary pairing of pretrained models and no external rankings (such as separate test sets, e.g. due to data privacy), we investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation -- a task made particularly difficult as additional knowledge can be contained in stronger, equiperformant or weaker models. Yet facilitating robust transfer in scenarios agnostic to pretrained model pairings would unlock auxiliary gains and knowledge fusion from any model repository without restrictions on model and problem specifics - including from weaker, lower-performance models. This work therefore provides an initial, in-depth exploration on the viability of such general-purpose knowledge transfer. Across large-scale experiments, we first reveal the shortcomings of standard knowledge distillation techniques, and then propose a much more general extension through data partitioning for successful transfer between nearly all pretrained models, which we show can also be done unsupervised. Finally, we assess both the scalability and impact of fundamental model properties on successful model-agnostic knowledge transfer.

摘要

A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised Video Anomaly Detection

paper_url: http://arxiv.org/abs/2310.17650
repo_url: None
paper_authors: Anas Al-lahham, Nurbek Tastan, Zaigham Zaheer, Karthik Nandakumar
for: 本研究旨在提出一种完全无监督的视频异常事件检测方法，以解决视频异常检测在无任何标注或人工监督的情况下的挑战。
methods: 提议的方法基于一种简单而有效的两个阶段pseudo标签生成框架，包括层次分割和统计假设测试，以生成视频段级（常见/异常）pseudo标签。
results: 对两个大规模公共领域数据集（UCF-Crime和XD-Violence）进行了广泛的研究，并证明了提议的无监督方法可以与所有现有的一类分类和无监督方法相比，而且与状态的监督方法相比，可以达到相似的性能。

Abstract
Detection of anomalous events in videos is an important problem in applications such as surveillance. Video anomaly detection (VAD) is well-studied in the one-class classification (OCC) and weakly supervised (WS) settings. However, fully unsupervised (US) video anomaly detection methods, which learn a complete system without any annotation or human supervision, have not been explored in depth. This is because the lack of any ground truth annotations significantly increases the magnitude of the VAD challenge. To address this challenge, we propose a simple-but-effective two-stage pseudo-label generation framework that produces segment-level (normal/anomaly) pseudo-labels, which can be further used to train a segment-level anomaly detector in a supervised manner. The proposed coarse-to-fine pseudo-label (C2FPL) generator employs carefully-designed hierarchical divisive clustering and statistical hypothesis testing to identify anomalous video segments from a set of completely unlabeled videos. The trained anomaly detector can be directly applied on segments of an unseen test video to obtain segment-level, and subsequently, frame-level anomaly predictions. Extensive studies on two large-scale public-domain datasets, UCF-Crime and XD-Violence, demonstrate that the proposed unsupervised approach achieves superior performance compared to all existing OCC and US methods , while yielding comparable performance to the state-of-the-art WS methods.

摘要
检测视频异常事件是应用中的一个重要问题，例如监视。视频异常检测（VAD）在一类分类（OCC）和弱监督（WS）设置中已经得到了广泛的研究。然而，完全无监督（US）的视频异常检测方法，即没有任何标注或人工监督，尚未得到了深入研究。这是因为缺乏任何标注数据导致VAD挑战的难度增加了。为解决这个挑战，我们提出了一个简单 yet effective的两个阶段 Pseudo-label生成框架，生成视频段级（正常/异常） Pseudo-labels，可以进一步用于在监督下训练段级异常检测器。我们的粗略到细分 Pseudo-label生成器（C2FPL）使用了仔细设计的层次分割和统计假设测试来从一组完全无标注视频中 identific 异常视频段。训练后的异常检测器可以直接应用于测试视频段中，以获得段级异常预测，并最终生成帧级异常预测。我们在两个大规模的公共领域数据集，UCFCrime和XD-Violence上，进行了广泛的研究，结果显示，我们的无监督方法在OCC和US方法中表现出色，同时与WS方法相当。

6-DoF Stability Field via Diffusion Models

paper_url: http://arxiv.org/abs/2310.17649
repo_url: None
paper_authors: Takuma Yoneda, Tianchong Jiang, Gregory Shakhnarovich, Matthew R. Walter
for:* 6-DoFusion is a generative model that can generate 3D poses of objects to create stable configurations of a scene.methods:* The model uses a diffusion model to incrementally refine a randomly initialized SE(3) pose to generate a sample from a learned, context-dependent distribution over stable poses.results:* The model can construct stable scenes involving novel object classes and improve the accuracy of state-of-the-art 3D pose estimation methods.Here is the text in Simplified Chinese:for:* 6-DoFusion 是一个生成模型，可以生成 объек的 3D 姿态，以实现组件的稳定配置。methods:* 模型使用一个散布模型，逐步优化一个随机初始化的 SE(3) 姿态，以生成一个从学习的、上下文相依的分布中的稳定姿态样本。results:* 模型可以实现稳定的组件，包括新的物体类型，并且可以提高现有的3D 姿态估计方法的精度。

Abstract
A core capability for robot manipulation is reasoning over where and how to stably place objects in cluttered environments. Traditionally, robots have relied on object-specific, hand-crafted heuristics in order to perform such reasoning, with limited generalizability beyond a small number of object instances and object interaction patterns. Recent approaches instead learn notions of physical interaction, namely motion prediction, but require supervision in the form of labeled object information or come at the cost of high sample complexity, and do not directly reason over stability or object placement. We present 6-DoFusion, a generative model capable of generating 3D poses of an object that produces a stable configuration of a given scene. Underlying 6-DoFusion is a diffusion model that incrementally refines a randomly initialized SE(3) pose to generate a sample from a learned, context-dependent distribution over stable poses. We evaluate our model on different object placement and stacking tasks, demonstrating its ability to construct stable scenes that involve novel object classes as well as to improve the accuracy of state-of-the-art 3D pose estimation methods.

摘要
一个核心能力 для机器人操作是在混乱环境中稳定地放置物体。传统上，机器人依靠特定对象的手工规则来实现这种能力，有限的通用性只能涵盖一小数量的物体实例和物体交互模式。最近的方法学习物理互动，但需要监督，通过标注对象信息来获得权重，或者来到高样本复杂度的代价，并不直接关于稳定性或物体放置的理解。我们提出了6DoFusion，一种生成模型，可以生成一个物体的3D姿态，使得给定场景中的物体生成稳定的配置。6DoFusion的基础是一种扩散模型，逐步精细地修改一个随机初始化的SE(3)姿态，以生成一个学习的、场景相依的 pose distribution。我们对不同的物体放置和堆叠任务进行评估，示出了我们的模型能够处理新的物体类型以及提高现有3D姿态估计方法的准确性。

paper_url: http://arxiv.org/abs/2310.17642
repo_url: None
paper_authors: Tsun-Hsuan Wang, Alaa Maalouf, Wei Xiao, Yutong Ban, Alexander Amini, Guy Rosman, Sertac Karaman, Daniela Rus
for:这篇论文旨在提出一种基于多modal基础模型的端到端自动驾驶系统，以提高系统的可靠性和适应能力。methods: authors使用了多modal基础模型，包括图像和文本，以提高自动驾驶系统的准确性和可靠性。他们还使用了 pixel/patch-aligned特征提取方法，以捕捉图像的细节和 semantics。results: authors的方法在多种不同的测试中达到了无 precedent的结果，同时在不同的环境和场景下表现出了更大的鲁棒性。此外， authors还使用了文本来进行数据增强和策略调试，以提高自动驾驶系统的可靠性和适应能力。

Abstract
As autonomous driving technology matures, end-to-end methodologies have emerged as a leading strategy, promising seamless integration from perception to control via deep learning. However, existing systems grapple with challenges such as unexpected open set environments and the complexity of black-box models. At the same time, the evolution of deep learning introduces larger, multimodal foundational models, offering multi-modal visual and textual understanding. In this paper, we harness these multimodal foundation models to enhance the robustness and adaptability of autonomous driving systems, enabling out-of-distribution, end-to-end, multimodal, and more explainable autonomy. Specifically, we present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text. To do so, we introduce a method to extract nuanced spatial (pixel/patch-aligned) features from transformers to enable the encapsulation of both spatial and semantic features. Our approach (i) demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations, and (ii) allows the incorporation of latent space simulation (via text) for improved training (data augmentation via text) and policy debugging. We encourage the reader to check our explainer video at https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be and to view the code and demos on our project webpage at https://drive-anywhere.github.io/.

摘要
As autonomous driving technology matures, end-to-end methodologies have emerged as a leading strategy, promising seamless integration from perception to control via deep learning. However, existing systems grapple with challenges such as unexpected open set environments and the complexity of black-box models. At the same time, the evolution of deep learning introduces larger, multimodal foundational models, offering multi-modal visual and textual understanding. In this paper, we harness these multimodal foundation models to enhance the robustness and adaptability of autonomous driving systems, enabling out-of-distribution, end-to-end, multimodal, and more explainable autonomy. Specifically, we present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text. To do so, we introduce a method to extract nuanced spatial (pixel/patch-aligned) features from transformers to enable the encapsulation of both spatial and semantic features. Our approach (i) demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations, and (ii) allows the incorporation of latent space simulation (via text) for improved training (data augmentation via text) and policy debugging. We encourage the reader to check our explainer video at and to view the code and demos on our project webpage at .

DeepShaRM: Multi-View Shape and Reflectance Map Recovery Under Unknown Lighting

paper_url: http://arxiv.org/abs/2310.17632
repo_url: None
paper_authors: Kohei Yamashita, Shohei Nobuhara, Ko Nishino
for: accurately recovers object geometry in challenging settings of textureless, non-Lambertian objects under unknown natural illumination.
methods: novel multi-view method called DeepShaRM, which uses a deep reflectance map estimation network and a deep shape-from-shading network to bypass the ill-posed problem of reflectance and illumination decomposition.
results: state-of-the-art accuracy on this challenging task, demonstrated through extensive experiments on both synthetic and real-world data.

Abstract
Geometry reconstruction of textureless, non-Lambertian objects under unknown natural illumination (i.e., in the wild) remains challenging as correspondences cannot be established and the reflectance cannot be expressed in simple analytical forms. We derive a novel multi-view method, DeepShaRM, that achieves state-of-the-art accuracy on this challenging task. Unlike past methods that formulate this as inverse-rendering, i.e., estimation of reflectance, illumination, and geometry from images, our key idea is to realize that reflectance and illumination need not be disentangled and instead estimated as a compound reflectance map. We introduce a novel deep reflectance map estimation network that recovers the camera-view reflectance maps from the surface normals of the current geometry estimate and the input multi-view images. The network also explicitly estimates per-pixel confidence scores to handle global light transport effects. A deep shape-from-shading network then updates the geometry estimate expressed with a signed distance function using the recovered reflectance maps. By alternating between these two, and, most important, by bypassing the ill-posed problem of reflectance and illumination decomposition, the method accurately recovers object geometry in these challenging settings. Extensive experiments on both synthetic and real-world data clearly demonstrate its state-of-the-art accuracy.

摘要
几何重建Textureless、非拉贝特的物体（即在野外）仍然是一个挑战，因为无法确定对应关系和反射无法表示为简单的分析型式。我们提出了一种新的多视图方法，即DeepShaRM，可以在这个挑战任务中实现状态 arts的准确性。与过去的方法不同，我们的关键思想是反射和照明无需分离，而是一起 estimating compound reflectance map。我们介绍了一种新的深度反射地图估计网络，可以从表 Normal 和输入多视图图像中提取camera-view反射地图。该网络还可以直接估计每个像素的信任分数，以处理全球照明效果。一个深度形状从反射地图更新geometry estimate，使用表 Normal 和输入多视图图像来表示。通过交互这两个网络，并更重要的是，通过绕过反射和照明的分解问题，该方法可以高精度地重建物体几何。我们的实验表明，该方法在 sintetic 和 real-world 数据上具有状态 arts 的准确性。

A Survey on Transferability of Adversarial Examples across Deep Neural Networks

paper_url: http://arxiv.org/abs/2310.17626
repo_url: https://github.com/jindonggu/awesome_adversarial_transferability
paper_authors: Jindong Gu, Xiaojun Jia, Pau de Jorge, Wenqain Yu, Xinwei Liu, Avery Ma, Yuan Xun, Anjun Hu, Ashkan Khakzar, Zhijiang Li, Xiaochun Cao, Philip Torr
for: 本研究旨在探讨对抗性例子的跨模型传播性，以及提高抗性例子的传播性的不同方法。
methods: 本文分析了现有的方法，包括权重调整、数据 augmentation、抗性例子生成等方法，以提高抗性例子的传播性。
results: 本文发现，现有的方法可以增强抗性例子的传播性，但也存在一些挑战和未知，如抗性例子的数量和种类的限制，以及模型之间的传播性的不同。

Abstract
The emergence of Deep Neural Networks (DNNs) has revolutionized various domains, enabling the resolution of complex tasks spanning image recognition, natural language processing, and scientific problem-solving. However, this progress has also exposed a concerning vulnerability: adversarial examples. These crafted inputs, imperceptible to humans, can manipulate machine learning models into making erroneous predictions, raising concerns for safety-critical applications. An intriguing property of this phenomenon is the transferability of adversarial examples, where perturbations crafted for one model can deceive another, often with a different architecture. This intriguing property enables "black-box" attacks, circumventing the need for detailed knowledge of the target model. This survey explores the landscape of the adversarial transferability of adversarial examples. We categorize existing methodologies to enhance adversarial transferability and discuss the fundamental principles guiding each approach. While the predominant body of research primarily concentrates on image classification, we also extend our discussion to encompass other vision tasks and beyond. Challenges and future prospects are discussed, highlighting the importance of fortifying DNNs against adversarial vulnerabilities in an evolving landscape.

摘要

Noise-Free Score Distillation

paper_url: http://arxiv.org/abs/2310.17590
repo_url: https://github.com/orenkatzir/nfsd
paper_authors: Oren Katzir, Or Patashnik, Daniel Cohen-Or, Dani Lischinski
For: 这 paper 探讨了 Text-to-Content Generation 在非图像领域中的实现方法，具体来说是对 Score Distillation Sampling (SDS) proces 进行了重新解释和改进。* Methods: 该 paper 建议了一种新的 Noise-Free Score Distillation (NFSD) proces, 该 proces 基于一种简单的解释，即通过约束 undesired noise term 的泛化，以实现更有效的 Text-to-image 模型的泛化。* Results: 作者通过提供许多质量比较的例子，证明了 NFSD proces 可以在 Nominal Classifier-Free Guidance (CFG) scale 下实现更有效的泛化，并且可以避免结果的过度平滑，以保证生成的数据是真实的并符合描述文本的需求。

Abstract
Score Distillation Sampling (SDS) has emerged as the de facto approach for text-to-content generation in non-image domains. In this paper, we reexamine the SDS process and introduce a straightforward interpretation that demystifies the necessity for large Classifier-Free Guidance (CFG) scales, rooted in the distillation of an undesired noise term. Building upon our interpretation, we propose a novel Noise-Free Score Distillation (NFSD) process, which requires minimal modifications to the original SDS framework. Through this streamlined design, we achieve more effective distillation of pre-trained text-to-image diffusion models while using a nominal CFG scale. This strategic choice allows us to prevent the over-smoothing of results, ensuring that the generated data is both realistic and complies with the desired prompt. To demonstrate the efficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as well as several other methods.

摘要
尽管Score Distillation Sampling（SDS）已成为非图像领域中文本生成的德法方法，但我们在这篇论文中又进行了重新评估和解释SDS过程。我们发现，SDS中的大型Classifier-Free Guidance（CFG）缺点是由一种不希望的噪声项引起的。基于这种解释，我们提出了一种新的Noise-Free Score Distillation（NFSD）过程，它具有最小改动，但可以更好地储备预训练的文本生成扩散模型。这种流lined设计可以避免过度熔炼结果，使得生成的数据具有真实性和满足描述的提示。为证明NFSD的有效性，我们提供了许多qualitative例子，包括NFSD和SDS以及其他方法。

Global Structure-Aware Diffusion Process for Low-Light Image Enhancement

paper_url: http://arxiv.org/abs/2310.17577
repo_url: https://github.com/jinnh/GSAD
paper_authors: Jinhui Hou, Zhiyu Zhu, Junhui Hou, Hui Liu, Huanqiang Zeng, Hui Yuan
for: 本文研究了一种基于分散的框架，以解决低光照图像增强问题。
methods: 我们提出了一种基于散度模型的方法，并在其内部加入了一个抽象的ODE-轨迹规则来正则化。这种方法利用了最近的研究结果，表明低拥挤ODE-轨迹可以导致稳定和有效的散度过程。我们在图像数据中嵌入了一个全球结构意识的正则化项，以逐渐促进图像细节的保留和对比的增强。此外，我们还引入了一种不确定度指导正则化技术，以智能地减少图像中最EXTREME的区域的约束。
results: 实验评估表明，提出的散度基于框架，并且补充了排名信息的正则化，在低光照图像增强中表现出色。结果表明，我们的方法可以提高图像质量、降低噪声和增强对比，相比之前的方法有显著进步。我们认为这种创新的方法将激发更多的探索和进步在低光照图像处理领域，并可能具有其他散度模型的应用。代码可以在https://github.com/jinnh/GSAD上获取。

Abstract
This paper studies a diffusion-based framework to address the low-light image enhancement problem. To harness the capabilities of diffusion models, we delve into this intricate process and advocate for the regularization of its inherent ODE-trajectory. To be specific, inspired by the recent research that low curvature ODE-trajectory results in a stable and effective diffusion process, we formulate a curvature regularization term anchored in the intrinsic non-local structures of image data, i.e., global structure-aware regularization, which gradually facilitates the preservation of complicated details and the augmentation of contrast during the diffusion process. This incorporation mitigates the adverse effects of noise and artifacts resulting from the diffusion process, leading to a more precise and flexible enhancement. To additionally promote learning in challenging regions, we introduce an uncertainty-guided regularization technique, which wisely relaxes constraints on the most extreme regions of the image. Experimental evaluations reveal that the proposed diffusion-based framework, complemented by rank-informed regularization, attains distinguished performance in low-light enhancement. The outcomes indicate substantial advancements in image quality, noise suppression, and contrast amplification in comparison with state-of-the-art methods. We believe this innovative approach will stimulate further exploration and advancement in low-light image processing, with potential implications for other applications of diffusion models. The code is publicly available at https://github.com/jinnh/GSAD.

摘要

SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching

paper_url: http://arxiv.org/abs/2310.17569
repo_url: None
paper_authors: Xinghui Li, Jingyi Lu, Kai Han, Victor Prisacariu
for: 本研究旨在解决图像对的semantic keypoint匹配问题。
methods: 本文使用Stable Diffusion（SD）的中间输出作为图像特征地图，并通过基本的提示调整技术来解 liberate SD的内在潜力，从而实现对前一些方法的显著提高。此外，我们还提出了一种新的决定式提示模块，该模块根据输入图像对的本地特征进行Conditional prompting，从而进一步提高表现。
results: 我们的方法SD4Match在PF-Pascal、PF-Willow和SPair-71k数据集上进行了广泛的评估，显示SD4Match在所有数据集上都设置了新的Benchmark。特别是在SPair-71k数据集上，SD4Match比前一些State-of-the-art方法提高12个百分点。

Abstract
In this paper, we address the challenge of matching semantically similar keypoints across image pairs. Existing research indicates that the intermediate output of the UNet within the Stable Diffusion (SD) can serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets new benchmarks in accuracy across all these datasets. Particularly, SD4Match outperforms the previous state-of-the-art by a margin of 12 percentage points on the challenging SPair-71k dataset.

摘要
在这篇论文中，我们 Addresses the challenge of matching semantically similar keypoints across image pairs. 现有研究表明，Stable Diffusion（SD）的中间输出可以 serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. 我们对PF-Pascal、PF-Willow和SPair-71k dataset进行了全面的评估，并证明SD4Match在这些dataset上设置新的benchmark，并且在SPair-71k dataset上比前一个state-of-the-art提高12个百分点。

Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction

paper_url: http://arxiv.org/abs/2310.17527
repo_url: https://github.com/masked-spacetime-hashing/msth
paper_authors: Feng Wang, Zilong Chen, Guokang Wang, Yafei Song, Huaping Liu
for: efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos
methods: Masked Space-Time Hash encoding (MSTH), a novel method that represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding, guided by an uncertainty-based objective
results: consistently better results than previous methods with only 20 minutes of training time and 130 MB of memory storage

Abstract
In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel method for efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos. Based on the observation that dynamic scenes often contain substantial static areas that result in redundancy in storage and computations, MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding. The weights for the two components are represented by a learnable mask which is guided by an uncertainty-based objective to reflect the spatial and temporal importance of each 3D position. With this design, our method can reduce the hash collision rate by avoiding redundant queries and modifications on static areas, making it feasible to represent a large number of space-time voxels by hash tables with small size.Besides, without the requirements to fit the large numbers of temporally redundant features independently, our method is easier to optimize and converge rapidly with only twenty minutes of training for a 300-frame dynamic scene.As a result, MSTH obtains consistently better results than previous methods with only 20 minutes of training time and 130 MB of memory storage. Code is available at https://github.com/masked-spacetime-hashing/msth

摘要
在这篇论文中，我们提出了Masked Space-Time Hash编码（MSTH），一种新的方法用于高效地重建动态3D场景从多视角或单视角视频中。基于观察到的动态场景中往往包含了相对较多的静止区域，从而导致存储和计算中的重复，MSTH将动态场景表示为一个权重加权的3D哈希编码和4D哈希编码的权重组合。这个权重组合是由一个可学习的掩码引导的，掩码的学习目标是根据空间和时间的重要性来反映每个3D位置的重要性。通过这种设计，我们的方法可以避免在静止区域上的重复查询和修改，因此可以使用哈希表存储大量的空间时间坐标，并且可以快速优化和训练。在20分钟的训练时间和130MB的存储空间内，我们的方法可以在300帧动态场景中实现更好的结果，超过之前的方法。代码可以在https://github.com/masked-spacetime-hashing/msth上找到。

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

paper_url: http://arxiv.org/abs/2310.17519
repo_url: https://github.com/sbharadwajj/flare
paper_authors: Shrisha Bharadwaj, Yufeng Zheng, Otmar Hilliges, Michael J. Black, Victoria Fernandez-Abrevaya
for: 创建个性化可动的3D头像，使其具有高度准确、现实主义、可重新灯光和与现有渲染系统兼容的特征。
methods: 通过可微分渲染和高度优化的计算机图形方法，以及部分使用神经网络来学习高级度3D磁质表示。
results: 实现了高质量的geometry和外观，同时具有高效地训练和渲染的特点，比现有方法更高效。

Abstract
Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce FLARE, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the pre-filtered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches.

摘要
我们的目标是高效地学习个性化可动的3D头像，从视频中学习高精度、现实、可重新照明和现代渲染系统兼容的3D矩阵表示。尽管3D矩阵可以高效处理和高度可移植，但它们缺乏实际上的形状和外观真实性。神经表示方法，则是真实的，但它们在训练和渲染过程中较慢，而且兼容性不佳。我们的关键发现是可以通过可导式渲染来高效地学习高精度3D矩阵表示，并通过高度优化的计算机图形方法来减少一些组件的神经网络。为了实现这一点，我们提出了FLARE技术，它可以从单个照视视频中生成可动和可重新照明的3D头像。首先，我们通过 mesh 表示来学习 canonical geometry，以便高效地进行可导式渲染和学习混合形状和线性混合皮肤弹性参数。其次，我们遵循物理学渲染，将观察到的颜色 decomposed 为内在反射率、粗糙度和神经网络中的照明表示，以便学习的头像在新的场景中被重新照明。由于我们的输入视频是在单个设备上捕捉的，因此模拟环境光是非常困难的。在基于折射积分的approximation中，我们使用多层感知器（MLP）模ulated by surface roughness来approximate环境图像，从而消除了必须直接模型光的需求。我们示出了我们的矩阵形式的 avatar 概合，与学习的形变、材质和照明 MLP 相结合，可以生成高质量的形状和外观，同时也比现有方法更高效地训练和渲染。

Revisiting the Distillation of Image Representations into Point Clouds for Autonomous Driving

paper_url: http://arxiv.org/abs/2310.17504
repo_url: None
paper_authors: Gilles Puy, Spyros Gidaris, Alexandre Boulch, Oriane Siméoni, Corentin Sautier, Patrick Pérez, Andrei Bursuc, Renaud Marlet
for: 本文旨在提出一种简单的2D-to-3D填充方法，用于提高自动驾驶场景中的自我监督3D网络表现。
methods: 本文提出了一种简单的2D-to-3D填充方法，通过填充高质量的2D特征到3D网络中来提高3D网络的表现。此外，本文还运用了高容量3D网络进行填充，以提高3D特征质量。
results: 本文的实验结果表明，通过使用提出的2D-to-3D填充方法，可以显著提高自动驾驶场景中的自我监督3D网络表现，并且可以用于开放词汇分割和背景/前景发现。

Abstract
Self-supervised image networks can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. However, self-supervised 3D networks on lidar data do not perform as well for now. A few methods therefore propose to distill high-quality self-supervised 2D features into 3D networks. The most recent ones doing so on autonomous driving data show promising results. Yet, a performance gap persists between these distilled features and fully-supervised ones. In this work, we revisit 2D-to-3D distillation. First, we propose, for semantic segmentation, a simple approach that leads to a significant improvement compared to prior 3D distillation methods. Second, we show that distillation in high capacity 3D networks is key to reach high quality 3D features. This actually allows us to significantly close the gap between unsupervised distilled 3D features and fully-supervised ones. Last, we show that our high-quality distilled representations can also be used for open-vocabulary segmentation and background/foreground discovery.

摘要
自我监督图像网络可以非常高效地解决复杂的2D任务（例如semantic segmentation、物体发现），但是自我监督3D网络在激光数据上并不表现太好。一些方法因此提议将高质量的2D自我监督特征融合到3D网络中。最新的这些方法在自主驾驶数据上显示了有前途的结果。然而，与完全监督的特征相比，这些融合的特征仍然存在一定的性能差距。在这项工作中，我们重新审视2D-to-3D融合。首先，我们提议用于semantic segmentation的简单方法，这会比之前的3D融合方法带来显著的改善。其次，我们表明了高容量3D网络中的融合是关键来达到高质量3D特征。这实际上使得我们可以明显减小不监督融合3D特征和完全监督特征之间的性能差距。最后，我们表明了我们高质量的融合表示可以用于开放词汇分割和背景/前景发现。

A Hybrid Graph Network for Complex Activity Detection in Video

paper_url: http://arxiv.org/abs/2310.17493
repo_url: https://github.com/salmank255/CompAD
paper_authors: Salman Khan, Izzeddin Teeti, Andrew Bradley, Mohamed Elhoseiny, Fabio Cuzzolin
for:本研究旨在解决视频中复杂活动检测问题，尤其是在自动驾驶和体育分析等领域。methods:本研究使用混合图神经网络，其combines attention应用于本地动态场景中的图编码和时间图模型。results:研究结果表明，该方法在三个数据集上都超过了之前的状态OF-THE-ART方法。

Abstract
Interpretation and understanding of video presents a challenging computer vision task in numerous fields - e.g. autonomous driving and sports analytics. Existing approaches to interpreting the actions taking place within a video clip are based upon Temporal Action Localisation (TAL), which typically identifies short-term actions. The emerging field of Complex Activity Detection (CompAD) extends this analysis to long-term activities, with a deeper understanding obtained by modelling the internal structure of a complex activity taking place within the video. We address the CompAD problem using a hybrid graph neural network which combines attention applied to a graph encoding the local (short-term) dynamic scene with a temporal graph modelling the overall long-duration activity. Our approach is as follows: i) Firstly, we propose a novel feature extraction technique which, for each video snippet, generates spatiotemporal `tubes' for the active elements (`agents') in the (local) scene by detecting individual objects, tracking them and then extracting 3D features from all the agent tubes as well as the overall scene. ii) Next, we construct a local scene graph where each node (representing either an agent tube or the scene) is connected to all other nodes. Attention is then applied to this graph to obtain an overall representation of the local dynamic scene. iii) Finally, all local scene graph representations are interconnected via a temporal graph, to estimate the complex activity class together with its start and end time. The proposed framework outperforms all previous state-of-the-art methods on all three datasets including ActivityNet-1.3, Thumos-14, and ROAD.

摘要
视频内容理解和解释存在许多领域中是一项computer vision挑战，如自动驾驶和运动分析。现有的视频clip中动作解释方法基于Temporal Action Localisation（TAL），通常可以识别短期动作。然而，emerging field of Complex Activity Detection（CompAD）扩展了这种分析，通过模型内部结构的复杂活动来获得更深刻的理解。我们使用混合图解释器来解决CompAD问题，其 combine了对本地（短期）动态场景的图编码器和时间图模型。我们的方法如下：1.首先，我们提出了一种新的特征提取技术，对于每个视频剪辑，生成了空间时间的“管” для活动元素（agent）在本地场景中，通过对个体物体的检测、跟踪和提取3D特征来实现。2.接下来，我们构建了本地场景图，其中每个节点（表示agent管或场景）与其他节点相连。然后，我们应用了注意力来获得本地动态场景的总体表示。3.最后，所有的本地场景图表示被通过时间图相连，以估计复杂活动类型以及其开始和结束时间。我们提出的框架在所有三个数据集上都超过了之前的所有状态的方法，包括ActivityNet-1.3、Thumos-14和ROAD。

paper_url: http://arxiv.org/abs/2310.17468
repo_url: https://github.com/qinyang79/crcl
paper_authors: Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, Peng Hu
for: 提高图文匹配的Robustness，增强现有方法的鲁棒性。
methods: 提出一种Generalized Cross-modal Robust Complementary Learning框架（CRCL），利用Active Complementary Loss（ACL）和Self-refining Correspondence Correction（SCC）提高现有方法的鲁棒性。
results: 经验和理论 pruebas 表明，CRCL可以减少NC的影响，提高图文匹配的精度和稳定性。

Abstract
Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.

摘要
近些时候，图文匹配已经引起了学术和业界的越来越多的关注，这是图文模态之间的隐藏相关性的基础。然而，大多数现有方法假设训练对都是正确的，而忽略了普遍存在的注释噪声（NC），从而不可避免性下降。虽然一些方法尝试解决这种噪声，但它们仍然面临两个挑战：过度记忆和不可靠的NC修正，特别是在高噪声下。为Address这两个问题，我们提议一种通用的跨模态Robust Complementary Learning框架（CRCL），它具有一种新的Active Complementary Loss（ACL）和一种有效的Self-refining Correspondence Correction（SCC），以提高现有方法的 robustness。Specifically, ACL利用活动和补做学习损失来减少提供错误指导的风险，从而实际和实验证明了对NC的robustness。SCC利用多个自适应过程和振荡调整来扩大对匹配的感知范围，从而缓解错误的积累和实现精准和稳定的修正。我们在Flickr30K、MS-COCO和CC152K三个图文benchmark上进行了广泛的实验，以验证我们的CRCL对于Synthetic和Real-world NC的超越性。

OTMatch: Improving Semi-Supervised Learning with Optimal Transport

paper_url: http://arxiv.org/abs/2310.17455
repo_url: None
paper_authors: Zhiquan Tan, Kaipeng Zheng, Weiran Huang
for: 这篇论文是为了提高 semi-supervised learning 中的学习效果，使用有限量的标签数据，并利用无标签数据中的资讯。
methods: 这篇论文使用的方法是使用 optimal transport loss function，并且利用这些数据中的 semantic relationships 来提高学习效果。
results: compared to current state-of-the-art method FreeMatch，OTMatch 可以实现更高的精度，具体的测试结果为 CIFAR-10 上的标签数据下的误差比 FreeMatch 降低了 3.18%、STL-10 上的标签数据下的误差比 FreeMatch 降低了 3.46%、ImageNet 上的标签数据下的误差比 FreeMatch 降低了 1.28%。这显示了 OTMatch 的优化性和超越性。

Abstract
Semi-supervised learning has made remarkable strides by effectively utilizing a limited amount of labeled data while capitalizing on the abundant information present in unlabeled data. However, current algorithms often prioritize aligning image predictions with specific classes generated through self-training techniques, thereby neglecting the inherent relationships that exist within these classes. In this paper, we present a new approach called OTMatch, which leverages semantic relationships among classes by employing an optimal transport loss function. By utilizing optimal transport, our proposed method consistently outperforms established state-of-the-art methods. Notably, we observed a substantial improvement of a certain percentage in accuracy compared to the current state-of-the-art method, FreeMatch. OTMatch achieves 3.18%, 3.46%, and 1.28% error rate reduction over FreeMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively. This demonstrates the effectiveness and superiority of our approach in harnessing semantic relationships to enhance learning performance in a semi-supervised setting.

摘要
semi-supervised learning 已经做出了很大的进步，通过有效地利用有限量的标注数据和丰富的无标注数据，实现了不错的学习效果。然而，目前的算法经常强调通过自我训练技术生成的特定类型来对图像进行预测，从而忽视了这些类型之间的内在关系。在这篇论文中，我们提出了一种新的方法 called OTMatch，它利用类型之间的 semantics 关系，通过最优运输损失函数来进行学习。通过使用最优运输，我们的提议方法可以一直超越现有的状态平衡方法。我们注意到，与现有状态平衡方法 FreeMatch 进行比较，OTMatch 在 CIFAR-10 上with 1 个标签、STL-10 上with 4 个标签、以及 ImageNet 上with 100 个标签时都具有较高的精度。这表明我们的方法可以借助类型之间的 semantics 关系，提高 semi-supervised 学习中的学习性能。

Sign Languague Recognition without frame-sequencing constraints: A proof of concept on the Argentinian Sign Language

paper_url: http://arxiv.org/abs/2310.17437
repo_url: None
paper_authors: Franco Ronchetti, Facundo Manuel Quiroga, César Estrebou, Laura Lanzarini, Alejandro Rosete
for: 这个论文旨在提出一种可靠的手语识别方法，以帮助听力障碍人士和听力正常人士之间的交流和教学。
methods: 该论文提出了一种概率模型，结合不同类型的特征（如位置、运动和手势）进行手语分类。该模型采用袋子中的词法方法，以探索论文中的假设，即不需要顺序来进行识别。
results: 该论文在使用阿根廷手语数据集（包含64个手语类和3200个样本）时，实现了97%的准确率，提供了一些证据，证明了论文中的假设是可行的。

Abstract
Automatic sign language recognition (SLR) is an important topic within the areas of human-computer interaction and machine learning. On the one hand, it poses a complex challenge that requires the intervention of various knowledge areas, such as video processing, image processing, intelligent systems and linguistics. On the other hand, robust recognition of sign language could assist in the translation process and the integration of hearing-impaired people, as well as the teaching of sign language for the hearing population. SLR systems usually employ Hidden Markov Models, Dynamic Time Warping or similar models to recognize signs. Such techniques exploit the sequential ordering of frames to reduce the number of hypothesis. This paper presents a general probabilistic model for sign classification that combines sub-classifiers based on different types of features such as position, movement and handshape. The model employs a bag-of-words approach in all classification steps, to explore the hypothesis that ordering is not essential for recognition. The proposed model achieved an accuracy rate of 97% on an Argentinian Sign Language dataset containing 64 classes of signs and 3200 samples, providing some evidence that indeed recognition without ordering is possible.

摘要
自动手语识别（SLR）是人机交互和机器学习领域中的一个重要话题。一方面，它需要多种知识领域的交叠，如视频处理、图像处理、智能系统和语言学。另一方面，可靠地识别手语可以帮助翻译过程和听力障碍人士的 интеграción，以及听力人士学习手语。通常，SLR系统使用隐藏Marker模型、动态时间扩展或类似模型来识别手语。这些技术利用手语的顺序排序来减少假设数量。本文提出了一种通用的手语分类模型，该模型结合不同类型的特征，如位置、运动和手势。该模型使用袋包法在所有分类步骤中，以探索假设That ordering是不必须的。提出的模型在阿根廷手语数据集上达到了97%的准确率，包括64个手语类和3200个样本，提供了一些证据，证明了实际上可以通过识别手语而不需要顺序。

Uncertainty-weighted Loss Functions for Improved Adversarial Attacks on Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.17436
repo_url: https://github.com/kmaag/uncertainty-weighted-loss
paper_authors: Kira Maag, Asja Fischer
for: 防止深度神经网络受到攻击，提高图像分割的可靠性。
methods: 使用简单的不确定性权重方案，将攻击损失函数中的像素级别权重提高，并将确定错误分类的像素loss设为零。
results: 在多个 dataset 和模型上进行了实质性分析，显示了这些方法可以提高攻击性能。

Abstract
State-of-the-art deep neural networks have been shown to be extremely powerful in a variety of perceptual tasks like semantic segmentation. However, these networks are vulnerable to adversarial perturbations of the input which are imperceptible for humans but lead to incorrect predictions. Treating image segmentation as a sum of pixel-wise classifications, adversarial attacks developed for classification models were shown to be applicable to segmentation models as well. In this work, we present simple uncertainty-based weighting schemes for the loss functions of such attacks that (i) put higher weights on pixel classifications which can more easily perturbed and (ii) zero-out the pixel-wise losses corresponding to those pixels that are already confidently misclassified. The weighting schemes can be easily integrated into the loss function of a range of well-known adversarial attackers with minimal additional computational overhead, but lead to significant improved perturbation performance, as we demonstrate in our empirical analysis on several datasets and models.

摘要
现代深度神经网络已经在多种感知任务中展现出极高的能力，如Semantic Segmentation。然而，这些网络受到输入杂音的影响，这些杂音对人类来说是难以看到的，但会导致错误预测。对于图像分割任务，我们将图像分割看作是每个像素的分类问题。在这个工作中，我们提出了一些简单的不确定性基于权重的损失函数，其中（i）将更容易受到杂音影响的像素分类权重高，（ii）将确定错误分类的像素权重设为0。这些权重函数可以轻松地与许多已知的恶意攻击者损失函数结合使用，但会导致显著提高杂音性能，我们在多个 dataset 和模型上进行了实验性分析，并证明了这一点。

AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

paper_url: http://arxiv.org/abs/2310.17419
repo_url: None
paper_authors: You-Ming Chang, Chen Yeh, Wei-Chen Chiu, Ning Yu
for: 这个论文目的是提出一种基于语言视觉模型的深层伪造探测方法，以提高对未见到的数据的深层伪造探测精度。
methods: 这篇论文使用了InstructBLIP和提问调整技术，将深层伪造探测转换为视觉问题，并对InstructBLIP进行软题调整以回答问题中的真伪信息。
results: 实验结果显示，使用预训练的语言视觉模型并进行提问调整可以很好地提高深层伪造探测精度，从58.8%提高到91.31%，而且这些进步仅需要较少的培训参数，因此这是一个有效和经济的深层伪造探测解决方案。

Abstract
Deep generative models can create remarkably photorealistic fake images while raising concerns about misinformation and copyright infringement, known as deepfake threats. Deepfake detection technique is developed to distinguish between real and fake images, where the existing methods typically learn classifiers in the image domain or various feature domains. However, the generalizability of deepfake detection against emerging and more advanced generative models remains challenging. In this paper, being inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach using VLMs (e.g. InstructBLIP) and prompt tuning techniques to improve the deepfake detection accuracy over unseen data. We formulate deepfake detection as a visual question answering problem, and tune soft prompts for InstructBLIP to answer the real/fake information of a query image. We conduct full-spectrum experiments on datasets from 3 held-in and 13 held-out generative models, covering modern text-to-image generation, image editing and image attacks. Results demonstrate that (1) the deepfake detection accuracy can be significantly and consistently improved (from 58.8% to 91.31%, in average accuracy over unseen data) using pretrained vision-language models with prompt tuning; (2) our superior performance is at less cost of trainable parameters, resulting in an effective and efficient solution for deepfake detection. Code and models can be found at https://github.com/nctu-eva-lab/AntifakePrompt.

摘要
深度生成模型可创造出极其真实的假图像，但也引发了假信息和版权侵犯的问题，称为深度假图检测问题。现有的方法通常是在图像领域或多个特征领域学习分类器。然而，对于出现和更进一步的生成模型来说，总的来说是很困难的。在这篇论文中，我们受到了零批优势的视觉语言模型（VLMs）的启发，我们提出了一种新的方法，使用VLMs（例如InstructBLIP）和提问技术来改进深度假图检测精度。我们将深度假图检测转化为视觉问答问题，并对InstructBLIP进行软提问的调整，以回答查询图像的真假信息。我们在3个保持数据集和13个保持数据集上进行了全谱试验，覆盖了现代文本到图像生成、图像修改和图像攻击等多种生成模型。结果显示：1. 使用预训练的视觉语言模型和提问调整可以显著提高深度假图检测精度（从58.8%提高到91.31%，平均精度提高）。2. 我们的优秀表现在较少的可学习参数上，具有效率和可行的解决方案。代码和模型可以在https://github.com/nctu-eva-lab/AntifakePrompt中找到。

Circuit as Set of Points

paper_url: http://arxiv.org/abs/2310.17418
repo_url: https://github.com/hustvl/circuitformer
paper_authors: Jialv Zou, Xinggang Wang, Jiahao Guo, Wenyu Liu, Qian Zhang, Chang Huang
for: 该 paper 主要用于提高 Electronic Design Automation (EDA) 中的电路设计过程中的缓存和 Routing 阶段的效率，通过使用人工智能技术来帮助电路设计。
methods: 该 paper 使用 Transformer-based point cloud perception 方法来提取电路组件的特征，无需预处理，可以进行终端训练，并且实现了高性能。
results: 实验结果显示，该方法在 CircuitNet 和 ISPD2015 数据集上的堵塞预测任务以及 CircuitNet 数据集上的设计规则检查 (DRC) 预测任务中均达到了状态数据集的最高表现。

Abstract
As the size of circuit designs continues to grow rapidly, artificial intelligence technologies are being extensively used in Electronic Design Automation (EDA) to assist with circuit design. Placement and routing are the most time-consuming parts of the physical design process, and how to quickly evaluate the placement has become a hot research topic. Prior works either transformed circuit designs into images using hand-crafted methods and then used Convolutional Neural Networks (CNN) to extract features, which are limited by the quality of the hand-crafted methods and could not achieve end-to-end training, or treated the circuit design as a graph structure and used Graph Neural Networks (GNN) to extract features, which require time-consuming preprocessing. In our work, we propose a novel perspective for circuit design by treating circuit components as point clouds and using Transformer-based point cloud perception methods to extract features from the circuit. This approach enables direct feature extraction from raw data without any preprocessing, allows for end-to-end training, and results in high performance. Experimental results show that our method achieves state-of-the-art performance in congestion prediction tasks on both the CircuitNet and ISPD2015 datasets, as well as in design rule check (DRC) violation prediction tasks on the CircuitNet dataset. Our method establishes a bridge between the relatively mature point cloud perception methods and the fast-developing EDA algorithms, enabling us to leverage more collective intelligence to solve this task. To facilitate the research of open EDA design, source codes and pre-trained models are released at https://github.com/hustvl/circuitformer.

摘要
随着电路设计的大小不断增长，人工智能技术在电子设计自动化（EDA）中得到了广泛应用，以帮助电路设计。电路的位置和路径是物理设计过程中最时间consuming的部分，如何快速评估电路的位置已成为热点研究话题。先前的工作都是将电路设计转化为图像使用手动方法，然后使用卷积神经网络（CNN）提取特征，但这种方法的特征提取有限，无法实现端到端训练。另一些工作则是将电路设计视为图structure，使用图神经网络（GNN）提取特征，但这种方法需要时间consuming的预处理。在我们的工作中，我们提出了一种新的电路设计视角，将电路组件视为点云，使用基于点云的传输器来提取电路特征。这种方法可以直接从原始数据提取特征，不需预处理，可以实现端到端训练，并且实现了高性能。实验结果表明，我们的方法在CircuitNet和ISPD2015 datasets上的堵塞预测任务和CircuitNetdataset上的设计规则检查（DRC）预测任务中均 achieve state-of-the-art表现。我们的方法打通了相对较成熔的点云识别方法和快速发展的 EDA 算法之间的桥梁，使我们可以更好地利用全球的智能来解决这个任务。为便于开放式 EDA 设计的研究，我们在 GitHub 上发布了源代码和预训练模型，请参考。

Detection Defenses: An Empty Promise against Adversarial Patch Attacks on Optical Flow

paper_url: http://arxiv.org/abs/2310.17403
repo_url: https://github.com/cv-stuttgart/detectiondefenses
paper_authors: Erik Scheurer, Jenny Schmalfuss, Alexander Lis, Andrés Bruhn
for: 这篇论文旨在检验目前可用的检测并移除防御策略（ILP和LGS）对state-of-the-art Optical Flow方法的影响，以及这些防御策略对攻击者的抗击能力。
methods: 这篇论文使用了多种state-of-the-art Optical Flow方法，并对这些方法进行了防御策略的检测和移除。
results: 实验结果表明，目前使用的检测并移除防御策略不仅会下降无抗的场景中的Optical Flow质量，还会削弱对攻击者的抗击能力。这些防御策略不能提供预期的安全性。

Abstract
Adversarial patches undermine the reliability of optical flow predictions when placed in arbitrary scene locations. Therefore, they pose a realistic threat to real-world motion detection and its downstream applications. Potential remedies are defense strategies that detect and remove adversarial patches, but their influence on the underlying motion prediction has not been investigated. In this paper, we thoroughly examine the currently available detect-and-remove defenses ILP and LGS for a wide selection of state-of-the-art optical flow methods, and illuminate their side effects on the quality and robustness of the final flow predictions. In particular, we implement defense-aware attacks to investigate whether current defenses are able to withstand attacks that take the defense mechanism into account. Our experiments yield two surprising results: Detect-and-remove defenses do not only lower the optical flow quality on benign scenes, in doing so, they also harm the robustness under patch attacks for all tested optical flow methods except FlowNetC. As currently employed detect-and-remove defenses fail to deliver the promised adversarial robustness for optical flow, they evoke a false sense of security. The code is available at https://github.com/cv-stuttgart/DetectionDefenses.

摘要
“敌对画面推帧可以让光流预测结果不可靠，因此它对实际世界中的动作探测和其下渠道应用存在实际的威胁。因此，可能的解决方案是针对攻击画面中的敌对画面进行探测和移除。但是，这些防护策略对光流预测的质量和可靠性的影响尚未得到充分的探讨。在这篇论文中，我们对一些现有的探测和移除防护措施ILP和LGS进行了广泛的测试，并评估它们对光流预测的质量和可靠性的影响。尤其是，我们实现了防护意识攻击，以检查现有的防护策略是否能够抵抗这种攻击。我们的实验结果产生了两个惊喜：探测和移除防护不仅会对正常场景下的光流质量产生负面影响，而且会对所有测试过的光流方法（除了FlowNetC）下的攻击实际上减少其防护能力。现在的探测和移除防护无法为光流预测提供实际的敌对防护，它们产生了一个假的安全感。我们的代码可以在https://github.com/cv-stuttgart/DetectionDefenses上获得。”

Learning Temporal Sentence Grounding From Narrated EgoVideos

paper_url: http://arxiv.org/abs/2310.17395
repo_url: https://github.com/keflanagan/climer
paper_authors: Kevin Flanagan, Dima Damen, Michael Wray
for: 这 paper 的目的是解决长形自 centered 数据集（如 Ego4D 和 EPIC-Kitchens）对 temporal sentence grounding (TSG) 任务的新挑战。
methods: 该 paper 使用了只使用笔记和其相对粗略的时间戳来学习在这些数据集中附加 sentences。它们提出了一种名为 clip merging (CliMer) 的方法，通过文本控制注意力来进行对比性增强。
results: 对比高效的 TSG 方法，CliMer 方法可以提高 mean R@1 的性能，从 3.9 提高到 5.7 在 Ego4D 上，从 10.7 提高到 13.0 在 EPIC-Kitchens 上。

Abstract
The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens presents a new challenge for the task of Temporal Sentence Grounding (TSG). Compared to traditional benchmarks on which this task is evaluated, these datasets offer finer-grained sentences to ground in notably longer videos. In this paper, we develop an approach for learning to ground sentences in these datasets using only narrations and their corresponding rough narration timestamps. We propose to artificially merge clips to train for temporal grounding in a contrastive manner using text-conditioning attention. This Clip Merging (CliMer) approach is shown to be effective when compared with a high performing TSG method -- e.g. mean R@1 improves from 3.9 to 5.7 on Ego4D and from 10.7 to 13.0 on EPIC-Kitchens. Code and data splits available from: https://github.com/keflanagan/CliMer

摘要
“ egocentric 数据集如 Ego4D 和 EPIC-Kitchens 的出现提出了新的挑战 для时间句子固定（TSG）任务。与传统的评估标准相比，这些数据集提供了更细化的句子，需要在视频中进行更精细的固定。在这篇论文中，我们提出了使用 narraion 和其相应的粗略 narraion 时间戳来学习固定 sentences。我们称之为 clip merging（CliMer）方法，它通过文本控制注意力来进行对比性训练。我们的 CliMer 方法在比较高性能 TSG 方法（如 Mean R@1）的基础上进行了改进，例如在 Ego4D 上从 3.9 提高到 5.7，在 EPIC-Kitchens 上从 10.7 提高到 13.0。代码和数据分割可以从 GitHub 上获取：https://github.com/keflanagan/CliMer。”Note that Simplified Chinese is the standard writing system used in mainland China, and it is different from Traditional Chinese, which is used in Taiwan and other countries.

SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation

paper_url: http://arxiv.org/abs/2310.17359
repo_url: https://github.com/Jiang-HB/DiffusionReg
paper_authors: Haobo Jiang, Mathieu Salzmann, Zheng Dang, Jin Xie, Jian Yang
for: 6D object pose estimation in real-world scenarios
methods: SE(3) diffusion model-based point cloud registration framework
results: Outstanding pose estimation performance on real-world datasets (TUD-L, LINEMOD, and Occluded-LINEMOD)

Abstract
In this paper, we introduce an SE(3) diffusion model-based point cloud registration framework for 6D object pose estimation in real-world scenarios. Our approach formulates the 3D registration task as a denoising diffusion process, which progressively refines the pose of the source point cloud to obtain a precise alignment with the model point cloud. Training our framework involves two operations: An SE(3) diffusion process and an SE(3) reverse process. The SE(3) diffusion process gradually perturbs the optimal rigid transformation of a pair of point clouds by continuously injecting noise (perturbation transformation). By contrast, the SE(3) reverse process focuses on learning a denoising network that refines the noisy transformation step-by-step, bringing it closer to the optimal transformation for accurate pose estimation. Unlike standard diffusion models used in linear Euclidean spaces, our diffusion model operates on the SE(3) manifold. This requires exploiting the linear Lie algebra $\mathfrak{se}(3)$ associated with SE(3) to constrain the transformation transitions during the diffusion and reverse processes. Additionally, to effectively train our denoising network, we derive a registration-specific variational lower bound as the optimization objective for model learning. Furthermore, we show that our denoising network can be constructed with a surrogate registration model, making our approach applicable to different deep registration networks. Extensive experiments demonstrate that our diffusion registration framework presents outstanding pose estimation performance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets.

摘要
“在这篇论文中，我们介绍了基于SE(3)扩散模型的点云注准框架，用于在实际场景中进行6D对象pose估计。我们的方法将3D注准任务转化为一个滤净扩散过程，通过不断注入噪声（扰动变换）来逐步纠正源点云的pose，以达到精确对齐model点云。我们的训练过程包括两个操作：SE(3)扩散过程和SE(3)反向过程。SE(3)扩散过程逐渐扰动优化的rigid变换，而SE(3)反向过程则是学习一个滤净网络，逐步纠正噪声后的不确定变换，使其更加精确地对齐pose。与标准扩散模型在线性Euclidean空间中使用不同，我们的扩散模型在SE(3)拟合上运行。这需要利用SE(3)拟合中的线性李代数 $\mathfrak{se}(3)$ 来约束变换过程中的过渡。此外，为了有效地训练我们的滤净网络，我们 derivates一种注准特有的下界为优化目标，并证明我们的滤净网络可以通过 substitute registration model来实现。我们的实验表明，我们的扩散注准框架在实际世界TUD-L、LINEMOD和Occluded-LINEMOD数据集上具有出色的pose估计性能。”

Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning

paper_url: http://arxiv.org/abs/2310.17356
repo_url: None
paper_authors: Anas Al-lahham, Obaidah Theeb, Khaled Elalem, Tariq A. Alshawi, Saleh A. Alshebeili
for: 预测能源含量稳定电网和不间断服务。
methods: 提出了一种新的天际图像特征提取和学习基本技术来估计短期太阳辐射。
results: 与文献中已知的计算高效算法相比，我们的方法实现了竞争性的结果，而且计算复杂度减少了多。

Abstract
Ahead-of-time forecasting of the output power of power plants is essential for the stability of the electricity grid and ensuring uninterrupted service. However, forecasting renewable energy sources is difficult due to the chaotic behavior of natural energy sources. This paper presents a new approach to estimate short-term solar irradiance from sky images. The~proposed algorithm extracts features from sky images and use learning-based techniques to estimate the solar irradiance. The~performance of proposed machine learning (ML) algorithm is evaluated using two publicly available datasets of sky images. The~datasets contain over 350,000 images for an interval of 16 years, from 2004 to 2020, with the corresponding global horizontal irradiance (GHI) of each image as the ground truth. Compared to the state-of-the-art computationally heavy algorithms proposed in the literature, our approach achieves competitive results with much less computational complexity for both nowcasting and forecasting up to 4 h ahead of time.

摘要
预测发电厂输出功率的预测是电力网络稳定和无间断服务的关键。然而，预测可再生能源很困难，因为自然能源的行为是杂乱的。本文提出了一种新的方法来估算短期日射量。该算法从天空图像中提取特征，并使用学习技术来估算太阳辐射。提出的机器学习（ML）算法的性能被评估使用了两个公开available的天空图像数据集。这两个数据集包含了2004年至2020年的16年间，共有350,000张图像，每张图像的全球水平照度（GHI）作为真实值。相比之前在文献中提出的计算沉重的算法，我们的方法实现了与之相当的竞争力，而且计算复杂性减少了多。这意味着我们的方法可以在4小时之前对nowcasting和预测进行有效的预测。

CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling

paper_url: http://arxiv.org/abs/2310.17347
repo_url: None
paper_authors: Seyedmorteza Sadat, Jakob Buhmann, Derek Bradely, Otmar Hilliges, Romann M. Weber
for: 提高 diffusion models 的输出多样性，特别是在高指导缩放比例下或者在小数据集上训练时。
methods: 提供一种改进的抽样策略，通过在推理过程中添加 scheduled, monotonically decreasing Gaussian noise 来平衡多样性和条件匹配。
results: 在多种条件生成任务中，使用现有预训练 diffusion model，CADS 可以提高 diffusion models 的多样性，并在 class-conditional ImageNet 生成中达到新的state-of-the-art FID 值（1.70和2.31）。

Abstract
While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 and 512$\times$512 respectively.

摘要
“ whilst conditional diffusion models have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. we attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. our condition-annealed diffusion sampler (cads) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. further, using an existing pretrained diffusion model, cads achieves a new state-of-the-art fid of 1.70 and 2.31 for class-conditional imagenet generation at 256x256 and 512x512 respectively.”Note: "FID" stands for "Frechet Inception Distance", which is a measure of the quality of generated images. A lower FID score indicates better image quality.

IndustReal: A Dataset for Procedure Step Recognition Handling Execution Errors in Egocentric Videos in an Industrial-Like Setting

paper_url: http://arxiv.org/abs/2310.17323
repo_url: https://github.com/timschoonbeek/industreal
paper_authors: Tim J. Schoonbeek, Tim Houben, Hans Onvlee, Peter H. N. de With, Fons van der Sommen
for: 这篇论文主要关注于recognizing the correct completion and order of procedural steps，以解决action recognition for procedural tasks中的一个限制，即无法衡量动作的成功度。
methods: 论文提出了一种新的任务—procedure step recognition（PSR），并提供了一个多模态的IndustReal数据集。
results: 论文在IndustReal数据集上进行了实验，并发现了一些新的错误类型，如执行错误和步骤错误。这些错误会限制action recognition的应用在工业领域。

Abstract
Although action recognition for procedural tasks has received notable attention, it has a fundamental flaw in that no measure of success for actions is provided. This limits the applicability of such systems especially within the industrial domain, since the outcome of procedural actions is often significantly more important than the mere execution. To address this limitation, we define the novel task of procedure step recognition (PSR), focusing on recognizing the correct completion and order of procedural steps. Alongside the new task, we also present the multi-modal IndustReal dataset. Unlike currently available datasets, IndustReal contains procedural errors (such as omissions) as well as execution errors. A significant part of these errors are exclusively present in the validation and test sets, making IndustReal suitable to evaluate robustness of algorithms to new, unseen mistakes. Additionally, to encourage reproducibility and allow for scalable approaches trained on synthetic data, the 3D models of all parts are publicly available. Annotations and benchmark performance are provided for action recognition and assembly state detection, as well as the new PSR task. IndustReal, along with the code and model weights, is available at: https://github.com/TimSchoonbeek/IndustReal .

摘要
尽管动作识别 для进程任务已经受到了关注，但是它具有一个基本的缺陷，即没有提供行动的成功度量。这限制了这些系统在工业领域的应用，因为进程动作的结果比执行动作本身更重要。为解决这些限制，我们定义了新的任务：程序步骤识别（PSR），它是识别正确完成和顺序的程序步骤的任务。同时，我们也提供了多modal的IndustReal数据集。与现有数据集不同的是，IndustReal包含了进程错误（如漏洞）以及执行错误。大多数这些错误只存在在验证和测试集中，使IndustReal适用于评估算法对新、未看过的错误的Robustness。此外，为便于可重复性和可以使用合成数据进行扩展，所有部件的3D模型都公开可用。标注和比较性性能是提供的，以及新的PSR任务。IndustReal，以及代码和模型权重，可以在：https://github.com/TimSchoonbeek/IndustReal 中找到。

Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics

paper_url: http://arxiv.org/abs/2310.17316
repo_url: None
paper_authors: Shuai Yang, Zhifei Chen, Pengguang Chen, Xi Fang, Shu Liu, Yingcong Chen
for: 本研究的目的是提供一个精准、semantic-abundant、大规模的 defect spectrum 数据集，用于实际应用中的缺陷检测。
methods: 本研究使用了 four key industrial benchmarks，对现有的标注进行了细化和增加 semantic details，以区分多种缺陷类型。而且，我们提出了一种基于 diffusion-based generator的 two-stage 生成器，用于生成高质量和多样化的缺陷图像。
results: 对于 defect inspection 模型的效果，synthetic images generated by Defect-Gen 显示了明显的提高。总的来说，The Defect Spectrum dataset 在缺陷检测研究中显示了很好的潜力，提供了一个坚实的平台用于测试和优化高级模型。

Abstract
Defect inspection is paramount within the closed-loop manufacturing system. However, existing datasets for defect inspection often lack precision and semantic granularity required for practical applications. In this paper, we introduce the Defect Spectrum, a comprehensive benchmark that offers precise, semantic-abundant, and large-scale annotations for a wide range of industrial defects. Building on four key industrial benchmarks, our dataset refines existing annotations and introduces rich semantic details, distinguishing multiple defect types within a single image. Furthermore, we introduce Defect-Gen, a two-stage diffusion-based generator designed to create high-quality and diverse defective images, even when working with limited datasets. The synthetic images generated by Defect-Gen significantly enhance the efficacy of defect inspection models. Overall, The Defect Spectrum dataset demonstrates its potential in defect inspection research, offering a solid platform for testing and refining advanced models.

摘要
“缺陷检查是关键在关闭式生产系统中。然而，现有的缺陷检查数据集经常缺乏实际应用中所需的精度和semantic细节。本文介绍了缺陷谱，一个全面的标准准 markers，提供了精度、semantic-abundant和大规模的注释，用于各种工业缺陷。基于四个键industrial benchmark，我们的数据集细化了现有的注释，并引入了丰富的semantic细节，在单个图像中分辨多种缺陷类型。此外，我们引入了 Defect-Gen，一个两stage diffusion-based generator，用于生成高质量和多样化的缺陷图像，即使working with limited datasets。生成的synthetic图像由 Defect-Gen明显提高了缺陷检查模型的效果。总之，缺陷谱数据集在缺陷检查研究中展示了很好的潜力，提供了一个坚实的平台用于测试和优化高级模型。”

Scale-Adaptive Feature Aggregation for Efficient Space-Time Video Super-Resolution

paper_url: http://arxiv.org/abs/2310.17294
repo_url: https://github.com/megvii-research/wacv2024-safa
paper_authors: Zhewei Huang, Ailin Huang, Xiaotao Hu, Chen Hu, Jun Xu, Shuchang Zhou
for: 提高视频质量
methods: 提出了一种新的Scale-Adaptive Feature Aggregation（SAFA）网络，该网络可以适应不同的运动幅度，以提高流体基于特征的传播。
results: 对四个公共的STVSR标准测试集进行了实验，SAFA网络可以达到领先的性能水平，比如TMNet和VideoINR方法的平均提高超过0.5dB的PSNR表现，而需要 menos than half的参数和only 1/3的计算成本。

Abstract
The Space-Time Video Super-Resolution (STVSR) task aims to enhance the visual quality of videos, by simultaneously performing video frame interpolation (VFI) and video super-resolution (VSR). However, facing the challenge of the additional temporal dimension and scale inconsistency, most existing STVSR methods are complex and inflexible in dynamically modeling different motion amplitudes. In this work, we find that choosing an appropriate processing scale achieves remarkable benefits in flow-based feature propagation. We propose a novel Scale-Adaptive Feature Aggregation (SAFA) network that adaptively selects sub-networks with different processing scales for individual samples. Experiments on four public STVSR benchmarks demonstrate that SAFA achieves state-of-the-art performance. Our SAFA network outperforms recent state-of-the-art methods such as TMNet and VideoINR by an average improvement of over 0.5dB on PSNR, while requiring less than half the number of parameters and only 1/3 computational costs.

摘要
space-time video super-resolution (STVSR) 任务是提高视频质量，同时进行视频帧 interpolate (VFI) 和视频超解像 (VSR)。然而，面临 temporal 维度和比例不一致的挑战，大多数现有的 STVSR 方法复杂且不灵活地处理不同的运动振荡。在这种情况下，我们发现选择合适的处理缩放级别可以实现remarkable benefits的流基feature propagation。我们提出了一种新的Scale-Adaptive Feature Aggregation (SAFA) 网络，该网络可以适应各个样本的不同处理缩放级别。实验表明，我们的 SAFA 网络在四个公共 STVSR 测试集上达到了最佳性能。与最近的 state-of-the-art 方法如 TMNet 和 VideoINR 相比，我们的 SAFA 网络平均提高了PSNR 值超过 0.5dB，同时需要 fewer than half 的参数和只有 1/3 的计算成本。

RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments

paper_url: http://arxiv.org/abs/2310.17290
repo_url: None
paper_authors: Mengxue Qu, Yu Wu, Wu Liu, Xiaodan Liang, Jingkuan Song, Yao Zhao, Yunchao Wei
for: 这个论文旨在探讨如何基于具体的目的或需求来检测对象。
methods: 这个论文使用了一个新的数据集called Reasoning Intention-Oriented Objects (RIO)，以便更好地处理开放环境中的意图。
results: 研究人员通过使用RIO数据集，发现了一些现有模型在开放环境中理解意图对象的能力有所提高。

Abstract
Intention-oriented object detection aims to detect desired objects based on specific intentions or requirements. For instance, when we desire to "lie down and rest", we instinctively seek out a suitable option such as a "bed" or a "sofa" that can fulfill our needs. Previous work in this area is limited either by the number of intention descriptions or by the affordance vocabulary available for intention objects. These limitations make it challenging to handle intentions in open environments effectively. To facilitate this research, we construct a comprehensive dataset called Reasoning Intention-Oriented Objects (RIO). In particular, RIO is specifically designed to incorporate diverse real-world scenarios and a wide range of object categories. It offers the following key features: 1) intention descriptions in RIO are represented as natural sentences rather than a mere word or verb phrase, making them more practical and meaningful; 2) the intention descriptions are contextually relevant to the scene, enabling a broader range of potential functionalities associated with the objects; 3) the dataset comprises a total of 40,214 images and 130,585 intention-object pairs. With the proposed RIO, we evaluate the ability of some existing models to reason intention-oriented objects in open environments.

摘要
<>Translate the following text into Simplified Chinese:Intention-oriented object detection aims to detect desired objects based on specific intentions or requirements. For instance, when we desire to "lie down and rest", we instinctively seek out a suitable option such as a "bed" or a "sofa" that can fulfill our needs. Previous work in this area is limited either by the number of intention descriptions or by the affordance vocabulary available for intention objects. These limitations make it challenging to handle intentions in open environments effectively. To facilitate this research, we construct a comprehensive dataset called Reasoning Intention-Oriented Objects (RIO). In particular, RIO is specifically designed to incorporate diverse real-world scenarios and a wide range of object categories. It offers the following key features:1. 在RIO中，INTENTIONDESCRIPTION是用自然的句子来表达，而不是单个词或动词短语，使其更加实用和有意义。2. RIO中的INTENTIONDESCRIPTION是场景相关的，使得对象的功能更加广泛。3. RIO dataset包含40,214张图片和130,585个INTENTION-OBJECT对。Using the proposed RIO, we evaluate the ability of some existing models to reason intention-oriented objects in open environments.Translate the text into Simplified Chinese, please.>Here's the translation:意向导航对象检测的目的是根据特定的意向或需求检测所需的对象。例如，当我们想要“躺下休息”时，我们会自然地寻找一个适合的选择，如床或沙发，以满足我们的需求。现有的研究在这一领域受限于意向描述的数量或意向对象的可用词汇。这些限制使得在开放环境中处理意向变得困难。为了进行这些研究，我们构建了一个完整的数据集，即理解意向对象（RIO）。特别是，RIO是专门为开放环境和多种对象类型设计的。它具有以下三个关键特点：1. RIO中的意向描述使用自然的句子表达，而不是单个词或动词短语，使其更加实用和有意义。2. RIO中的意向描述与场景相关，使得对象的功能更加广泛。3. RIO dataset包含40,214张图片和130,585个意向对象对。使用我们提议的RIO，我们评估了一些现有模型在开放环境中理解意向对象的能力。

BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point Clouds

paper_url: http://arxiv.org/abs/2310.17281
repo_url: https://github.com/valeoai/bevcontrast
paper_authors: Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit
for: 提高自动驾驶汽车 LiDAR 点云自我监督的简单性和效率。
methods: 设计了一种基于 Bird’s Eye View 平面的对比损失函数，从而实现了简单且高效的自我监督。
results: 比起 PointConstrast 和 TARL 等方法，BEVContrast 可以提供更好的性能和简洁性，且计算cell级别表示只需要 pays a small computational cost.

Abstract
We present a surprisingly simple and efficient method for self-supervision of 3D backbone on automotive Lidar point clouds. We design a contrastive loss between features of Lidar scans captured in the same scene. Several such approaches have been proposed in the literature from PointConstrast, which uses a contrast at the level of points, to the state-of-the-art TARL, which uses a contrast at the level of segments, roughly corresponding to objects. While the former enjoys a great simplicity of implementation, it is surpassed by the latter, which however requires a costly pre-processing. In BEVContrast, we define our contrast at the level of 2D cells in the Bird's Eye View plane. Resulting cell-level representations offer a good trade-off between the point-level representations exploited in PointContrast and segment-level representations exploited in TARL: we retain the simplicity of PointContrast (cell representations are cheap to compute) while surpassing the performance of TARL in downstream semantic segmentation.

摘要
我们提出了一种奇异简单高效的自监督3D脊梁在汽车激光雷达点云上的方法。我们定义了在同一场景中捕捉的雷达扫描特征之间的对比损失。文献中已有许多类似的方法，如PointContrast，它使用点级对比，到现状之最佳TARL，它使用段级对比，约相对应于物体。而PointContrast具有简单的实现，但是被TARL所超越，后者 however需要昂贵的预处理。在BEVContrast中，我们定义了2D bird's eye view平面上的细胞级对比，得到的细胞级表示具有点级表示使用PointContrast和段级表示使用TARL之间的好COMPROMISE：我们保留了PointContrast的简单实现，同时超越TARL在下游semantic segmentation中的性能。

Generalizing to Unseen Domains in Diabetic Retinopathy Classification

paper_url: http://arxiv.org/abs/2310.17255
repo_url: https://github.com/chumsy0725/spsd-vit
paper_authors: Chamuditha Jayanga Galappaththige, Gayal Kuruppu, Muhammad Haris Khan
for:这个研究旨在解决遗传性糖尿病视力损害（DR）的早期诊断和治疗过程中的检测问题，以帮助早期疗法和恢复病情。methods:我们提出了一个简单且有效的领域扩展（DG）方法，通过一个新的预测软化机制，实现了自我激发（self-distillation）在感知 трансформа器（ViT）中。results:我们在多个挑战性的开源DR检测数据集上进行了广泛的实验，包括多源和单源DG设定，并使用三种不同的ViT背bone进行比较。我们的方法在这些设定下实现了比较好的性能，并且在验证测试中获得了改善的准确性和调整性。

Abstract
Diabetic retinopathy (DR) is caused by long-standing diabetes and is among the fifth leading cause for visual impairments. The process of early diagnosis and treatments could be helpful in curing the disease, however, the detection procedure is rather challenging and mostly tedious. Therefore, automated diabetic retinopathy classification using deep learning techniques has gained interest in the medical imaging community. Akin to several other real-world applications of deep learning, the typical assumption of i.i.d data is also violated in DR classification that relies on deep learning. Therefore, developing DR classification methods robust to unseen distributions is of great value. In this paper, we study the problem of generalizing a model to unseen distributions or domains (a.k.a domain generalization) in DR classification. To this end, we propose a simple and effective domain generalization (DG) approach that achieves self-distillation in vision transformers (ViT) via a novel prediction softening mechanism. This prediction softening is an adaptive convex combination one-hot labels with the model's own knowledge. We perform extensive experiments on challenging open-source DR classification datasets under both multi-source and single-source DG settings with three different ViT backbones to establish the efficacy and applicability of our approach against competing methods. For the first time, we report the performance of several state-of-the-art DG methods on open-source DR classification datasets after conducting thorough experiments. Finally, our method is also capable of delivering improved calibration performance than other methods, showing its suitability for safety-critical applications, including healthcare. We hope that our contributions would investigate more DG research across the medical imaging community.

摘要
糖尿病retinopathy (DR) 是由长期糖尿病引起的，是视力障碍的第五大原因。 early diagnosis 和治疗可以帮助缓解病情，但检测过程很复杂且时consuming。因此，使用深度学习技术进行糖尿病分类已经在医学影像社区中吸引了广泛的关注。与其他多种应用场景一样，糖尿病分类中的i.i.d数据假设也被违反。因此，开发一种可以抗见 distributions 的DR分类方法非常有价值。在这篇论文中，我们研究了在DR分类中对不同分布或领域（a.k.a. 领域普适化）的扩展。为达到这个目标，我们提出了一种简单而有效的领域普适化（DG）方法，通过一种新的预测软化机制来实现自我热化。这种预测软化是一种可靠的一个逻辑折衔，将一个一个的一个零准确率与模型自己的知识相乘。我们在多个开源DR分类数据集上进行了广泛的实验，包括多源和单源DG设置，并使用三种不同的 ViT 框架来证明我们的方法的有效性和可应用性。我们是第一次在开源DR分类数据集上进行了多种state-of-the-art DG方法的实验，并发现我们的方法在calibration性能方面表现出色，表明它适合安全关键应用，如医疗。我们希望通过我们的贡献，可以鼓励更多的领域普适化研究在医学影像社区中进行。

Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

paper_url: http://arxiv.org/abs/2310.17218
repo_url: None
paper_authors: Jiachen Li, Xiaojin Gong
for: 提高对象重复识别（Re-ID）性能 across various supervision settings。
methods: 利用大规模预训练的视觉语言模型（CLIP）进行适应，并直接精度地调整CLIP的图像Encoder使用抽象对比学习（PCL）损失，取消需要提示学习。
results: 在人车Re-ID数据集上实现了与CLIP-ReID相当的竞争性表现，并在无监督情况下进一步实现了状态级表现。

Abstract
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance.

摘要
Translated into Simplified Chinese:这个工作的目标是适应大规模预训练的视觉语言模型，如对比语言图像预训练（CLIP），以提高对象重新识别（Re-ID）的性能。虽然推荐学习已经使得CLIP-ReID实现了良好的表现，但是下面的机制和推荐学习的必要性仍然不清楚，这是因为ReID任务缺乏semantic标签。在这个工作中，我们首先分析CLIP-ReID中的推荐学习角色和其局限性。基于我们的调查，我们提议一种简单 yet effective的方法，通过直接练习CLIP的图像编码器使用prototype对比学习（PCL）损失来适应CLIPsupervised object Re-ID。我们的方法不需要推荐学习。实验结果表明，我们的方法在人体和车辆Re-ID dataset上与CLIP-ReID相比具有竞争力。此外，我们还扩展了我们的PCL-based CLIP fine-tuning方法到无监督场景，在这些场景下，我们实现了状态的最佳性能。

Three-dimensional Bone Image Synthesis with Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2310.17216
repo_url: None
paper_authors: Christoph Angermann, Johannes Bereiter-Payr, Kerstin Stock, Markus Haltmeier, Gerald Degenhart
for: 这篇论文是为了探讨三维生成对医疗影像处理领域的应用而写的。
methods: 这篇论文使用的方法是基于三维生成对抗网络（GAN），可以高效地生成高分辨率医疗影像 Volume 的细节。
results: 这篇论文的结果表明，GAN可以成功地在三维设定下进行生成，并且可以进行大规模的数据驱动模型的开发。此外，GAN的反向减法也可以在这种设定下实现，并用于图像混合、特征编辑和风格混合等应用。结果得到了三维 HR-pQCT 数据库的广泛验证。

Abstract
Medical image processing has been highlighted as an area where deep learning-based models have the greatest potential. However, in the medical field in particular, problems of data availability and privacy are hampering research progress and thus rapid implementation in clinical routine. The generation of synthetic data not only ensures privacy, but also allows to \textit{draw} new patients with specific characteristics, enabling the development of data-driven models on a much larger scale. This work demonstrates that three-dimensional generative adversarial networks (GANs) can be efficiently trained to generate high-resolution medical volumes with finely detailed voxel-based architectures. In addition, GAN inversion is successfully implemented for the three-dimensional setting and used for extensive research on model interpretability and applications such as image morphing, attribute editing and style mixing. The results are comprehensively validated on a database of three-dimensional HR-pQCT instances representing the bone micro-architecture of the distal radius.

摘要
医学图像处理领域内，深度学习基本模型的潜力得到了特别强调。然而，医疗领域中特别是数据可用性和隐私问题，对研究进步和临床应用的阻碍。生成 sintetic 数据不仅保障隐私，还可以为新的患者群体创造特定特征，以便开发基于大规模数据的模型。这项工作表明，三维生成对抗网络（GAN）可以高效地训练高分辨率医学三维体volume，并在三维设定下成功实现GAN反向。这些结果在三维 HR-pQCT 数据库上进行了广泛的验证，表明这些模型在bone micro-architecture中具有高度的可解释性和应用前景。

Weakly-Supervised Surgical Phase Recognition

paper_url: http://arxiv.org/abs/2310.17209
repo_url: None
paper_authors: Roy Hirsch, Regev Cohen, Mathilde Caron, Tomer Golany, Daniel Freedman, Ehud Rivlin
for: 针对计算机助手手术系统中的阶段识别问题
methods: 组合图像分割和自动学习，提出一种随机游走解决方案，并利用弱约束和少量学习
results: 在公共Cholec80 dataset上进行实验，在多种设置下达到了可塑性和低资源消耗的表现

Abstract
A key element of computer-assisted surgery systems is phase recognition of surgical videos. Existing phase recognition algorithms require frame-wise annotation of a large number of videos, which is time and money consuming. In this work we join concepts of graph segmentation with self-supervised learning to derive a random-walk solution for per-frame phase prediction. Furthermore, we utilize within our method two forms of weak supervision: sparse timestamps or few-shot learning. The proposed algorithm enjoys low complexity and can operate in lowdata regimes. We validate our method by running experiments with the public Cholec80 dataset of laparoscopic cholecystectomy videos, demonstrating promising performance in multiple setups.

摘要
computer-assisted surgery systems中一个关键元素是运行过程识别。现有的运行识别算法需要大量的几几个影像档案进行框架层级的标注，这是时间和金额的浪费。在这个工作中，我们结合了Graph分类和自动学习的概念，以derive一个随机步进行每帧运行预测。此外，我们在方法中使用了两种弱型指导：稀脱时间标签或几何学学习。提议的算法具有低复杂度，可以在低数据 режи中运行。我们运行了 experiments with公共Cholec80dataset of laparoscopic cholecystectomy videos， demonstarted promising performance in multiple setups。

Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction Network for Tone Mapping

paper_url: http://arxiv.org/abs/2310.17190
repo_url: https://github.com/fengzhang427/LLF-LUT
paper_authors: Feng Zhang, Ming Tian, Zhiqiang Li, Bin Xu, Qingbo Lu, Changxin Gao, Nong Sang
for: 本研究旨在 Addressing the limitations of traditional 3-Dimensional LookUp Table (3D LUT) based tone mapping methods, which often fail to preserve local details in images.
methods: 该研究提出了一种新的策略，即通过closed-form Laplacian pyramid decomposition and reconstruction，并采用image-adaptive 3D LUTs和Progressive learning of local Laplacian filters来实现同时global和local操作。
results: 对两个标准测试集进行了广泛的实验，并证明了该方法可以同时保持全球含义和地方细节，并且与现有方法相比有所提高。

Abstract
Tone mapping aims to convert high dynamic range (HDR) images to low dynamic range (LDR) representations, a critical task in the camera imaging pipeline. In recent years, 3-Dimensional LookUp Table (3D LUT) based methods have gained attention due to their ability to strike a favorable balance between enhancement performance and computational efficiency. However, these methods often fail to deliver satisfactory results in local areas since the look-up table is a global operator for tone mapping, which works based on pixel values and fails to incorporate crucial local information. To this end, this paper aims to address this issue by exploring a novel strategy that integrates global and local operators by utilizing closed-form Laplacian pyramid decomposition and reconstruction. Specifically, we employ image-adaptive 3D LUTs to manipulate the tone in the low-frequency image by leveraging the specific characteristics of the frequency information. Furthermore, we utilize local Laplacian filters to refine the edge details in the high-frequency components in an adaptive manner. Local Laplacian filters are widely used to preserve edge details in photographs, but their conventional usage involves manual tuning and fixed implementation within camera imaging pipelines or photo editing tools. We propose to learn parameter value maps progressively for local Laplacian filters from annotated data using a lightweight network. Our model achieves simultaneous global tone manipulation and local edge detail preservation in an end-to-end manner. Extensive experimental results on two benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art methods.

摘要
《Tone Mapping using 3D LUT and Local Laplacian Filters》目的：使高动态范围（HDR）图像转换为低动态范围（LDR）表示，是摄像头成像管线中的关键任务。在过去几年，基于3维LookUp Table（3D LUT）的方法吸引了广泛关注，因为它们能够平衡提升性和计算效率。然而，这些方法经常在地方区域出现不满心的结果，因为Look-up Table是一个全局操作符，基于像素值进行匹配，而不会考虑重要的地方信息。为此，本文提出了一种新的策略，即通过closed-form Laplacian pyramid decomposition和重建来结合全局和地方操作符。特别是，我们使用适应性的3D LUT来控制在低频谱中的音调，并且使用地方 Laplacian 滤波器来在高频谱中进行适应式的细节缩放。地方 Laplacian 滤波器广泛用于保持照片中的缝隙细节，但是它们的传统使用具有手动调整和固定实现在摄像头成像管线或图像修饰工具中。我们提议通过轻量级网络来逐渐学习参数值图表进行地方 Laplacian 滤波器的进行进行适应性调整。我们的模型可以同时进行全局音调调整和地方细节缩放，并且在端到端方式进行实现。对两个标准数据集进行了广泛的实验，结果表明，我们的方法在比较当前的方法中表现出色。

paper_url: http://arxiv.org/abs/2310.17189
repo_url: https://github.com/mastervito/diffusionvg
paper_authors: Xiao Liang, Tao Shi, Yaoyuan Liang, Te Tao, Shao-Lun Huang
for: 本研究旨在提高视频落实（video grounding）的精度和效率，以便更好地将文本描述与视频内容相匹配。
methods: 本研究提出了一种基于扩散模型的新方法，称为DiffusionVG，它将视频落实视为一个条件生成任务，通过逐渐添加噪声并在反扩散过程中进行恢复，以便从噪声输入中生成目标 span。
results: 在主流的Charades-STA和ActivityNet Captions测试集上，DiffusionVG表现竞争力或者连续性更高，而无需使用复杂的特性或噪声。

Abstract
Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span in a single-shot manner, resulting in the absence of a systematical prediction refinement process. In this paper, we propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task, where the target span is generated from Gaussian noise inputs and interatively refined in the reverse diffusion process. During training, DiffusionVG progressively adds noise to the target span with a fixed forward diffusion process and learns to recover the target span in the reverse diffusion process. In inference, DiffusionVG can generate the target span from Gaussian noise inputs by the learned reverse diffusion process conditioned on the video-sentence representations. Our DiffusionVG follows the encoder-decoder architecture, which firstly encodes the video-sentence features and iteratively denoises the predicted spans in its specialized span refining decoder. Without bells and whistles, our DiffusionVG demonstrates competitive or even superior performance compared to existing well-crafted models on mainstream Charades-STA and ActivityNet Captions benchmarks.

摘要
视频固定目标是将目标刻影在未处理视频中与给定句子查询匹配。现有方法通常从预先定义的提案中选择最佳预测或直接在单击shot模式下进行 span 直接回归，从而缺乏系统化预测纠正过程。在这篇论文中，我们提出了DiffusionVG，一种新的框架，其中视频固定目标被формализова为一个条件生成任务，其中目标刻影从托管的 Gaussian 噪声输入生成并经过反演 diffusion 过程进行逐步纠正。在训练时，DiffusionVG 逐渐添加到目标刻影的噪声输入，并通过学习反演 diffusion 过程来回归目标刻影。在推理时，DiffusionVG 可以从 Gaussian 噪声输入生成目标刻影，并且通过特殊的 span 纠正逻辑来进行 Conditioned 生成。我们的DiffusionVG 采用了 Encoder-Decoder 架构，它首先将视频-句子特征编码，然后在特殊的 span 纠正解码器中iteratively 进行噪声恢复。没有一切饰物的，我们的DiffusionVG 在主流的 Charades-STA 和 ActivityNet Captions benchmark 上达到了与现有高水平的竞争性或者even 超越性表现。

Blind Image Super-resolution with Rich Texture-Aware Codebooks

paper_url: http://arxiv.org/abs/2310.17188
repo_url: None
paper_authors: Rui Qin, Ming Sun, Fangyuan Zhang, Xing Wen, Bin Wang
for: 提高盲目超解像（BSR）方法的效果，使其能够更好地处理复杂的盲目压缩和杂质损害。
methods: 提出了一种基于高分辨率（HR）重建码库的 Rich Texture-aware Codebook-based Network（RTCNet），包括了适应性损害抑制模块（DTPM）和 patch-aware texture prior module（PTPM）。
results: RTCNet在多个 benchmark 上比州先进方法提高了0.16 ~ 0.46dB。

Abstract
Blind super-resolution (BSR) methods based on high-resolution (HR) reconstruction codebooks have achieved promising results in recent years. However, we find that a codebook based on HR reconstruction may not effectively capture the complex correlations between low-resolution (LR) and HR images. In detail, multiple HR images may produce similar LR versions due to complex blind degradations, causing the HR-dependent only codebooks having limited texture diversity when faced with confusing LR inputs. To alleviate this problem, we propose the Rich Texture-aware Codebook-based Network (RTCNet), which consists of the Degradation-robust Texture Prior Module (DTPM) and the Patch-aware Texture Prior Module (PTPM). DTPM effectively mines the cross-resolution correlation of textures between LR and HR images by exploiting the cross-resolution correspondence of textures. PTPM uses patch-wise semantic pre-training to correct the misperception of texture similarity in the high-level semantic regularization. By taking advantage of this, RTCNet effectively gets rid of the misalignment of confusing textures between HR and LR in the BSR scenarios. Experiments show that RTCNet outperforms state-of-the-art methods on various benchmarks by up to 0.16 ~ 0.46dB.

摘要
干扰盲超分辨率（BSR）方法，基于高分辨率（HR）重建码库，在过去几年内取得了有望的成果。然而，我们发现，基于HR重建的码库可能不能有效地捕捉LR和HR图像之间的复杂相关性。具体来说，多个HR图像可能会生成相同的LR版本，因为复杂的盲抑分辨率，导致HR依赖的只码库具有有限的文本多样性，面临恶势riorityLR输入。为了解决这个问题，我们提议了Rich Texture-aware Codebook-based Network（RTCNet），它包括Degradation-robust Texture Prior Module（DTPM）和Patch-aware Texture Prior Module（PTPM）。DTPM通过利用LR和HR图像之间的Texture的交叉相关性，有效地挖掘LR和HR图像之间的Texture相关性。PTPM使用patch-wise semantic pre-training来正确地修正高级 semantics regularization中的Texture相似性误差。通过这种方式，RTCNet可以有效地消除BSR场景中HR和LR图像之间的混淆文本。实验表明，RTCNet在不同的标准 bencmarks上出perform state-of-the-art方法，提高了0.16~0.46dB。

MO-YOLO: End-to-End Multiple-Object Tracking Method with YOLO and MOTR

paper_url: http://arxiv.org/abs/2310.17170
repo_url: https://github.com/liaopan-lp/MO-YOLO
paper_authors: Liao Pan, Yang Feng, Wu Di, Liu Bo, Zhang Xingle
for: 提高多对象跟踪（MOT）领域中的灵活性和计算效率，提出一种高效、轻量级、计算资源减少的端到端多对象跟踪模型，名为MO-YOLO。
methods: 结合YOLO和RT-DETR模型，构建一个高效、轻量级、计算资源减少的端到端多对象跟踪网络，以提高MOT领域的计算效率和灵活性。
results: 在MOT17 dataset上，MO-YOLO只需要1个GeForce 2080 Ti GPU和12个小时的训练时间，就能够 achieve comparable performance，而MOTR\cite{zeng2022motr}则需要8个GeForce 2080 Ti GPU和4天的训练时间。

Abstract
This paper aims to address critical issues in the field of Multi-Object Tracking (MOT) by proposing an efficient and computationally resource-efficient end-to-end multi-object tracking model, named MO-YOLO. Traditional MOT methods typically involve two separate steps: object detection and object tracking, leading to computational complexity and error propagation issues. Recent research has demonstrated outstanding performance in end-to-end MOT models based on Transformer architectures, but they require substantial hardware support. MO-YOLO combines the strengths of YOLO and RT-DETR models to construct a high-efficiency, lightweight, and resource-efficient end-to-end multi-object tracking network, offering new opportunities in the multi-object tracking domain. On the MOT17 dataset, MOTR\cite{zeng2022motr} requires training with 8 GeForce 2080 Ti GPUs for 4 days to achieve satisfactory results, while MO-YOLO only requires 1 GeForce 2080 Ti GPU and 12 hours of training to achieve comparable performance.

摘要
这篇论文目标是解决多对物跟踪（MOT）领域的关键问题，提出一种高效、计算资源充足的端到端多对物跟踪模型，名为MO-YOLO。传统的MOT方法通常包括两个分开的步骤：物体检测和物体跟踪，导致计算复杂性和错误传递问题。现代研究表明，基于Transformer架构的端到端MOT模型可以达到出色的性能，但它们需要重要的硬件支持。MO-YOLO将YOLO和RT-DETR模型的优点相结合，构建一个高效、轻量级、计算资源充足的端到端多对物跟踪网络，为多对物跟踪领域带来新的机遇。在MOT17数据集上，MOTR\cite{zeng2022motr}需要训练8个GeForce 2080 Ti GPU的4天时间来获得满意的结果，而MO-YOLO只需1个GeForce 2080 Ti GPU和12个小时的训练时间来达到相当的性能。

Bridging Phylogeny and Taxonomy with Protein-protein Interaction Networks

paper_url: http://arxiv.org/abs/2310.17164
repo_url: None
paper_authors: Long-Huei Chen, Mohana Prasad Sathya Moorthy, Pratyaksh Sharma
for: 这项研究旨在更深入地理解生物体内的蛋白质-蛋白质互作（PPI）网络，以了解生物体之间的种系发生关系。
methods: 研究人员使用了已知种类的蛋白质网络统计特征来预测新发现的蛋白质网络统计特征，以及使用这些统计特征来分类生物体。
results: 研究人员成功创建了一个预测蛋白质网络统计特征的模型，以及一个使用蛋白质网络统计特征来分类生物体的模型。这两个模型成功地将蛋白质网络和种系发生关系两个领域联系起来。

Abstract
The protein-protein interaction (PPI) network provides an overview of the complex biological reactions vital to an organism's metabolism and survival. Even though in the past PPI network were compared across organisms in detail, there has not been large-scale research on how individual PPI networks reflect on the species relationships. In this study we aim to increase our understanding of the tree of life and taxonomy by gleaming information from the PPI networks. We successful created (1) a predictor of network statistics based on known traits of existing species in the phylogeny, and (2) a taxonomic classifier of organism using the known protein network statistics, whether experimentally determined or predicted de novo. With the knowledge of protein interactions at its core, our two models effectively connects two field with widely diverging methodologies - the phylogeny and taxonomy of species.

摘要
生物体内的蛋白质-蛋白质互作（PPI）网络提供了生物过程的概述，这些过程对生物体的存活和代谢是关键。尽管过去曾经对不同生物体的PPI网络进行了详细比较，但是没有大规模研究过个PPI网络如何反映种之间的关系。在这项研究中，我们希望通过蛋白质互作网络来增加我们对生命树的理解和分类学的知识。我们成功地开发了以下两种模型：1. 基于已知种的特征来预测网络统计 Parameters的模型，这些特征包括生物体的分类、体积、生长速度等。2. 使用已知蛋白质网络统计 Parameters来分类生物体，无论这些统计 Parameters是通过实验测定还是通过德拟测定来获得的。通过蛋白质互作网络的知识作为核心，我们的两种模型成功地结合了生物体分类学和蛋白质互作网络的两个领域，它们的方法论相对远离。

Low-Dimensional Gradient Helps Out-of-Distribution Detection

paper_url: http://arxiv.org/abs/2310.17163
repo_url: None
paper_authors: Yingwen Wu, Tao Li, Xinwen Cheng, Jie Yang, Xiaolin Huang
for: 这个研究旨在探讨深度神经网络（DNNs）中的外部资料探测（OOD）领域，以确保深度学习模型在实际应用中的可靠性。
methods: 这个研究使用了整个梯度信息来进行OOD探测，包括梯度方向和梯度norm。具体来说，研究者使用了一个特定的主成分空间来实现线性维度减少，从而获得了具有最小资料损失的低维度表示。
results: 研究结果显示，这个新的OOD探测方法可以与现有的检测方法相比，在各种检测任务中表现出色，例如在ImageNetbenchmark上，这个方法可以实现11.15%的 False Positive Rate reduction（FPR95）。

Abstract
Detecting out-of-distribution (OOD) samples is essential for ensuring the reliability of deep neural networks (DNNs) in real-world scenarios. While previous research has predominantly investigated the disparity between in-distribution (ID) and OOD data through forward information analysis, the discrepancy in parameter gradients during the backward process of DNNs has received insufficient attention. Existing studies on gradient disparities mainly focus on the utilization of gradient norms, neglecting the wealth of information embedded in gradient directions. To bridge this gap, in this paper, we conduct a comprehensive investigation into leveraging the entirety of gradient information for OOD detection. The primary challenge arises from the high dimensionality of gradients due to the large number of network parameters. To solve this problem, we propose performing linear dimension reduction on the gradient using a designated subspace that comprises principal components. This innovative technique enables us to obtain a low-dimensional representation of the gradient with minimal information loss. Subsequently, by integrating the reduced gradient with various existing detection score functions, our approach demonstrates superior performance across a wide range of detection tasks. For instance, on the ImageNet benchmark, our method achieves an average reduction of 11.15% in the false positive rate at 95% recall (FPR95) compared to the current state-of-the-art approach. The code would be released.

摘要
检测出现在数据集之外的样本（out-of-distribution，OOD）是深度神经网络（DNN）在实际应用中的可靠性 Ensure essential. While previous research has mainly investigated the disparity between in-distribution (ID) and OOD data through forward information analysis, the discrepancy in parameter gradients during the backward process of DNNs has received insufficient attention. Existing studies on gradient disparities mainly focus on the utilization of gradient norms, neglecting the wealth of information embedded in gradient directions. To bridge this gap, in this paper, we conduct a comprehensive investigation into leveraging the entirety of gradient information for OOD detection. The primary challenge arises from the high dimensionality of gradients due to the large number of network parameters. To solve this problem, we propose performing linear dimension reduction on the gradient using a designated subspace that comprises principal components. This innovative technique enables us to obtain a low-dimensional representation of the gradient with minimal information loss. Subsequently, by integrating the reduced gradient with various existing detection score functions, our approach demonstrates superior performance across a wide range of detection tasks. For instance, on the ImageNet benchmark, our method achieves an average reduction of 11.15% in the false positive rate at 95% recall (FPR95) compared to the current state-of-the-art approach. The code will be released.Note that Simplified Chinese is a romanization of Chinese, and the actual Chinese characters may vary depending on the system and font used.

Learning depth from monocular video sequences

paper_url: http://arxiv.org/abs/2310.17156
repo_url: None
paper_authors: Zhenwei Luo
for: 这个论文旨在提出一种基于单影视频序列的单张图像深度估计模型，以便在训练过程中更好地使用更多的图像作为监督。
methods: 我们提出了一种新的训练损失函数，使得在训练过程中可以更好地包含更多的图像作为监督。我们还提出了一种简单 yet effective的模型来考虑帧到帧像素运动。
results: 当我们将这些方法结合使用时，我们在KITTI dataset上的自主监督下得到了单张图像深度估计的state-of-the-art结果。

Abstract
Learning single image depth estimation model from monocular video sequence is a very challenging problem. In this paper, we propose a novel training loss which enables us to include more images for supervision during the training process. We propose a simple yet effective model to account the frame to frame pixel motion. We also design a novel network architecture for single image estimation. When combined, our method produces state of the art results for monocular depth estimation on the KITTI dataset in the self-supervised setting.

摘要
学习单个图像深度估计模型从单摄影视频序列是一个非常困难的问题。在这篇论文中，我们提出了一种新的训练损失函数，允许我们在训练过程中使用更多的图像进行监督。我们提出了一种简单又有效的方法来考虑帧到帧像素运动。我们还设计了一种新的网络架构来实现单个图像估计。当这些方法相结合使用时，我们的方法在KITTI dataset上的自主监督 Setting中产生了state-of-the-art的结果。

Deep Imbalanced Regression via Hierarchical Classification Adjustment

paper_url: http://arxiv.org/abs/2310.17154
repo_url: None
paper_authors: Haipeng Xiong, Angela Yao
for: 本文提出了一种解决不平衡回归任务中的问题，即使用层次分类器来改善回归性能。
methods: 该方法首先将回归目标空间分解为多个精细分类器，然后使用范围保持分类器来学习一个单一的类ifier。
results: 实验结果显示，该方法在三种多元的回归任务中（年龄估计、人群数量估计和深度估计）都达到了Superior result。

Abstract
Regression tasks in computer vision, such as age estimation or counting, are often formulated into classification by quantizing the target space into classes. Yet real-world data is often imbalanced -- the majority of training samples lie in a head range of target values, while a minority of samples span a usually larger tail range. By selecting the class quantization, one can adjust imbalanced regression targets into balanced classification outputs, though there are trade-offs in balancing classification accuracy and quantization error. To improve regression performance over the entire range of data, we propose to construct hierarchical classifiers for solving imbalanced regression tasks. The fine-grained classifiers limit the quantization error while being modulated by the coarse predictions to ensure high accuracy. Standard hierarchical classification approaches, however, when applied to the regression problem, fail to ensure that predicted ranges remain consistent across the hierarchy. As such, we propose a range-preserving distillation process that can effectively learn a single classifier from the set of hierarchical classifiers. Our novel hierarchical classification adjustment (HCA) for imbalanced regression shows superior results on three diverse tasks: age estimation, crowd counting and depth estimation. We will release the source code upon acceptance.

摘要
计算机视觉领域中的回归任务，如年龄估计或计数，经常被转化为分类任务。然而，实际数据往往受到偏斜——大多数训练样本集中在一个头range的目标值上，而一小部分样本则覆盖一个通常更大的尾range。通过选择类划分，可以调整不均匀的回归目标，并且可以平衡分类精度和划分误差。为了提高回归性能，我们提议使用层次分类器解决不均匀的回归任务。细化分类器限制划分误差，同时被模拟粗略预测的控制，以确保高精度。然而，标准层次分类方法，应用于回归问题时，无法保证预测范围保持一致性。因此，我们提议一种保持范围的润释过程，可以有效地学习单个分类器从多个层次分类器中。我们称之为层次分类调整（HCA）。我们的HCA在三种多样化任务上显示出优秀的结果：年龄估计、人群计数和深度估计。我们即将发布源代码。

Simple Baselines for Projection-based Full-reference and No-reference Point Cloud Quality Assessment

paper_url: http://arxiv.org/abs/2310.17147
repo_url: None
paper_authors: Zicheng Zhang, Yingjie Zhou, Wei Sun, Xiongkuo Min, Guangtao Zhai
for: 本研究旨在提供高效的点云质量评估方法，以满足存储和带宽限制下的3D内容表示和应用需求。
methods: 该研究使用多个投影方法从点云数据中获取多个投影，并使用流行的视觉脊梁提取质量感知特征。基于FR和NR两种任务，分别计算出全referenced和无参照质量表示。
results: 在ICIP 2023 PCVQA挑战中，该研究取得了五个评测轨道中的四个首位。

Abstract
Point clouds are widely used in 3D content representation and have various applications in multimedia. However, compression and simplification processes inevitably result in the loss of quality-aware information under storage and bandwidth constraints. Therefore, there is an increasing need for effective methods to quantify the degree of distortion in point clouds. In this paper, we propose simple baselines for projection-based point cloud quality assessment (PCQA) to tackle this challenge. We use multi-projections obtained via a common cube-like projection process from the point clouds for both full-reference (FR) and no-reference (NR) PCQA tasks. Quality-aware features are extracted with popular vision backbones. The FR quality representation is computed as the similarity between the feature maps of reference and distorted projections while the NR quality representation is obtained by simply squeezing the feature maps of distorted projections with average pooling The corresponding quality representations are regressed into visual quality scores by fully-connected layers. Taking part in the ICIP 2023 PCVQA Challenge, we succeeded in achieving the top spot in four out of the five competition tracks.

摘要
点云是广泛应用于3D内容表示领域中的一种常用技术，但是压缩和简化过程会导致数据损失。因此，有效地评估点云的质量变得越来越重要。在这篇论文中，我们提出了一些简单的基线方法用于基于投影的点云质量评估（PCQA）任务。我们使用了通过共同的立方体投影过程获得的多个投影，并从点云中提取了流行的视觉脊梁中的质量感知特征。FR质量表示为参照投影和扭曲投影之间的相似性，而NR质量表示直接压缩扭曲投影的特征图，并使用了全连接层进行回归。在ICIP 2023 PCVQA挑战中，我们成功地获得了五个竞赛轨道中的四个首位。

A Classifier Using Global Character Level and Local Sub-unit Level Features for Hindi Online Handwritten Character Recognition

paper_url: http://arxiv.org/abs/2310.17138
repo_url: None
paper_authors: Anand Sharma, A. G. Ramakrishnan
For: The paper is written to develop a classifier for Hindi online handwritten characters, which models the joint distribution of global character features, number of sub-units, and local sub-unit features using latent variables.* Methods: The classifier uses histograms of points, orientations, and dynamics of orientations (HPOD) features to represent characters at both global and local levels, and the parameters are estimated using maximum likelihood method. The study also compares the performance of the developed classifier with other classifiers and features used in previous studies.* Results: The developed classifier achieves the highest accuracy of 93.5% on the testing set, outperforming other classifiers trained on different features extracted from the same training set and evaluated on the same testing set.Here are the three key points in Simplified Chinese text:* For: 本研究开发了一种基于全球特征、分割单元数量和地方分割单元特征的类ifizier，用于模型印地语 Онлайн手写字符。* Methods: 该类ifizier使用点频、方向和方向动态特征（HPOD）来表示字符的全球特征和地方分割单元特征，并使用最大可能性方法来估计类ifizier的参数。研究还对以前的研究中使用的不同类ifizier和特征进行比较。* Results: 研究发现，基于HPOD特征的类ifizier在测试集上达到了93.5%的最高准确率，超过了基于不同特征的类ifizier在同一测试集上的表现。

Abstract
A classifier is developed that defines a joint distribution of global character features, number of sub-units and local sub-unit features to model Hindi online handwritten characters. The classifier uses latent variables to model the structure of sub-units. The classifier uses histograms of points, orientations, and dynamics of orientations (HPOD) features to represent characters at global character level and local sub-unit level and is independent of character stroke order and stroke direction variations. The parameters of the classifier is estimated using maximum likelihood method. Different classifiers and features used in other studies are considered in this study for classification performance comparison with the developed classifier. The classifiers considered are Second Order Statistics (SOS), Sub-space (SS), Fisher Discriminant (FD), Feedforward Neural Network (FFN) and Support Vector Machines (SVM) and the features considered are Spatio Temporal (ST), Discrete Fourier Transform (DFT), Discrete Cosine Transform (SCT), Discrete Wavelet Transform (DWT), Spatial (SP) and Histograms of Oriented Gradients (HOG). Hindi character datasets used for training and testing the developed classifier consist of samples of handwritten characters from 96 different character classes. There are 12832 samples with an average of 133 samples per character class in the training set and 2821 samples with an average of 29 samples per character class in the testing set. The developed classifier has the highest accuracy of 93.5\% on the testing set compared to that of the classifiers trained on different features extracted from the same training set and evaluated on the same testing set considered in this study.

摘要
我们开发了一种分类器，它定义了全局字符特征、数量的子单元特征和本地子单元特征的共同分布，用于模型印度 Онлайн手写字符。这个分类器使用隐藏变量来模型子单元的结构。它使用点频率、方向和方向的变化（HPOD）特征来表示字符的全局特征和本地子单元特征，并且不受字符笔触顺序和笔触方向的变化。参数的估计使用最大可能性方法。本研究中考虑了其他一些研究使用的不同分类器和特征，包括第二阶 statistics（SOS）、子空间（SS）、捕捉特征（FD）、径向神经网络（FFN）和支持向量机（SVM），以及各种特征，如时空特征（ST）、抽象傅立叙变换（DFT）、抽象佩顺叙变换（SCT）、时空扫描变换（DWT）、空间特征（SP）和方向特征（HOG）。印度字符数据集用于训练和测试开发的分类器，包括96个不同字符类的样本。训练集包含12832个样本，平均每个字符类样本133个，测试集包含2821个样本，平均每个字符类样本29个。开发的分类器在测试集上的准确率为93.5%，比其他在同一个训练集和测试集上训练的分类器的准确率高。

Comparison of Cross-Entropy, Dice, and Focal Loss for Sea Ice Type Segmentation

paper_url: http://arxiv.org/abs/2310.17135
repo_url: None
paper_authors: Rafael Pires de Lima, Behzad Vahedi, Morteza Karimzadeh
for: 这篇论文是为了提高对于冰对航行安全的 Navigation 中使用 Convolutional Neural Network (CNN) 模型来生成冰图。
methods: 这篇论文使用了三种不同的损失函数（cross-entropy、Dice和Focal），以测试它们在对冰类型的预测中的表现。
results: despite the fact that Dice 和 Focal loss produce higher metrics, results from cross-entropy seem generally more physically consistent。

Abstract
Up-to-date sea ice charts are crucial for safer navigation in ice-infested waters. Recently, Convolutional Neural Network (CNN) models show the potential to accelerate the generation of ice maps for large regions. However, results from CNN models still need to undergo scrutiny as higher metrics performance not always translate to adequate outputs. Sea ice type classes are imbalanced, requiring special treatment during training. We evaluate how three different loss functions, some developed for imbalanced class problems, affect the performance of CNN models trained to predict the dominant ice type in Sentinel-1 images. Despite the fact that Dice and Focal loss produce higher metrics, results from cross-entropy seem generally more physically consistent.

摘要
现代海冰图表是航海安全 navigation 中不可或缺的。最近，卷积神经网络（CNN）模型表现出加速大区域海冰图表生成的潜力。然而，CNN 模型的结果仍需受到评估，因为高度度量表现并不总是能够确保良好的输出。海冰类型受到不均匀分布，需要特殊的训练方法。我们 evaluate 了三种不同的损失函数，其中一些是为异常分布类问题而设计的，如何影响 CNN 模型在 Sentinel-1 图像中预测主要海冰类型的性能。虽然 Dice 和 Focal 损失能生成更高的度量，但跨度 entropy 损失的结果更加 физи学上一致。

Virtual Accessory Try-On via Keypoint Hallucination

paper_url: http://arxiv.org/abs/2310.17131
repo_url: None
paper_authors: Junhong Gou, Bo Zhang, Li Niu, Jianfu Zhang, Jianlou Si, Chen Qian, Liqing Zhang
for: 处理虚拟穿着和虚拟配饰的任务，专注于虚拟配饰try-on，将饰物（例如眼镜、领带）适应到脸部或肖像图像中。
methods: 我们提出了一个背景对准网络，利用背景知识来将背景和前景融合成一个合理的合成图像。我们的方法首先学习人体知识，然后预测该项目的目标位置，然后将该资讯与预测的位置混合到背景UNet中。最后，我们计算扭曲参数，将该资讯扭曲到背景中。
results: 我们在STRAT dataset上进行了实验， validate the effectiveness of our proposed method。

Abstract
The virtual try-on task refers to fitting the clothes from one image onto another portrait image. In this paper, we focus on virtual accessory try-on, which fits accessory (e.g., glasses, ties) onto a face or portrait image. Unlike clothing try-on, which relies on human silhouette as guidance, accessory try-on warps the accessory into an appropriate location and shape to generate a plausible composite image. In contrast to previous try-on methods that treat foreground (i.e., accessories) and background (i.e., human faces or bodies) equally, we propose a background-oriented network to utilize the prior knowledge of human bodies and accessories. Specifically, our approach learns the human body priors and hallucinates the target locations of specified foreground keypoints in the background. Then our approach will inject foreground information with accessory priors into the background UNet. Based on the hallucinated target locations, the warping parameters are calculated to warp the foreground. Moreover, this background-oriented network can also easily incorporate auxiliary human face/body semantic segmentation supervision to further boost performance. Experiments conducted on STRAT dataset validate the effectiveness of our proposed method.

摘要
virtual 试穿任务指的是从一个图像上适应另一个肖像图像上的服装。在这篇论文中，我们专注于虚拟配饰试穿，即将配饰（如镜片、领带）适应到一个脸或肖像图像上。与服装试穿不同，配饰试穿不需要人体轮廓作为指导，而是将配饰扭曲到合适的位置和形状，以生成一个可信度高的复合图像。在之前的试穿方法中，背景（即人体）和前景（即配饰）被平等对待，我们提议了背景 oriented 网络，以利用人体和配饰的先验知识。具体来说，我们的方法学习人体先验知识，并在背景中预测target键带点的位置。然后，我们将前景信息与配饰先验知识混合到背景 UNet 中，根据预测的target位置，计算扭曲参数，以扭曲前景。此外，这种背景 oriented 网络还可以轻松地包含辅助人体脸/身体 semantic 分割监督，以进一步提高性能。在 STRAT 数据集上进行的实验证明了我们的提议的效果。

Task-driven Prompt Evolution for Foundation Models

paper_url: http://arxiv.org/abs/2310.17128
repo_url: None
paper_authors: Rachana Sathish, Rahul Venkataramani, K S Shriram, Prasad Sudhakar
for: 这个研究是为了提高基础模型（Segment Anything Model，SAM）在医疗影像模式下的表现。
methods: 这个研究使用了大量预训条件和概念学习模型，从下游任务中学习提示来优化基础模型的表现。
results: 研究发现，这种提示优化技术可以在肺部 segmentation中获得了 significiant improvement（约75%），并且可以自动优化基础模型的提示，以提高其表现。

Abstract
Promptable foundation models, particularly Segment Anything Model (SAM), have emerged as a promising alternative to the traditional task-specific supervised learning for image segmentation. However, many evaluation studies have found that their performance on medical imaging modalities to be underwhelming compared to conventional deep learning methods. In the world of large pre-trained language and vision-language models, learning prompt from downstream tasks has achieved considerable success in improving performance. In this work, we propose a plug-and-play Prompt Optimization Technique for foundation models like SAM (SAMPOT) that utilizes the downstream segmentation task to optimize the human-provided prompt to obtain improved performance. We demonstrate the utility of SAMPOT on lung segmentation in chest X-ray images and obtain an improvement on a significant number of cases ($\sim75\%$) over human-provided initial prompts. We hope this work will lead to further investigations in the nascent field of automatic visual prompt-tuning.

摘要
通用基础模型，如分割任何模型（SAM），已经出现为图像分割任务中的有前途的替代方案。然而，许多评估研究发现，这些模型在医疗影像模式上的表现不如传统的深度学习方法出色。在大型预训练语言和视觉语言模型的世界中，学习下游任务中的提示已经取得了显著的成功，以提高性能。在这项工作中，我们提出了一种插入式优化技术（SAMPOT），使用下游分割任务来优化提供的人类提示，以获得改善的性能。我们在肺部分剖扫图像中应用SAMPOT，并在大量的 случаес中（约75%）获得了人类提供的初始提示的改善。我们希望这项工作会鼓励进一步的自动视觉提示优化研究。

Deep Learning on SAR Imagery: Transfer Learning Versus Randomly Initialized Weights

paper_url: http://arxiv.org/abs/2310.17126
repo_url: None
paper_authors: Morteza Karimzadeh, Rafael Pires de Lima
for: 本研究旨在评估深度学习在探测雷达数据上的应用，特别是用于海洋 Navigation 的海冰映射。
methods: 本研究使用了 randomly initialized weights 和 fine-tuning pre-trained model 两种方法来训练深度学习模型。
results: 研究结果显示，使用 pre-trained model 进行 fine-tuning 后，模型在测试样本中的表现更佳，特别是在融雪季节的样本上。

Abstract
Deploying deep learning on Synthetic Aperture Radar (SAR) data is becoming more common for mapping purposes. One such case is sea ice, which is highly dynamic and rapidly changes as a result of the combined effect of wind, temperature, and ocean currents. Therefore, frequent mapping of sea ice is necessary to ensure safe marine navigation. However, there is a general shortage of expert-labeled data to train deep learning algorithms. Fine-tuning a pre-trained model on SAR imagery is a potential solution. In this paper, we compare the performance of deep learning models trained from scratch using randomly initialized weights against pre-trained models that we fine-tune for this purpose. Our results show that pre-trained models lead to better results, especially on test samples from the melt season.

摘要
deploying deep learning on synthetic aperture radar (SAR) data 是 becoming more common for mapping purposes. One such case is sea ice, which is highly dynamic and rapidly changes as a result of the combined effect of wind, temperature, and ocean currents. Therefore, frequent mapping of sea ice is necessary to ensure safe marine navigation. However, there is a general shortage of expert-labeled data to train deep learning algorithms. Fine-tuning a pre-trained model on SAR imagery is a potential solution. In this paper, we compare the performance of deep learning models trained from scratch using randomly initialized weights against pre-trained models that we fine-tune for this purpose. Our results show that pre-trained models lead to better results, especially on test samples from the melt season.Here's the text with Traditional Chinese characters:部署深度学习于Synthetic Aperture Radar（SAR）数据是 becoming more common for mapping purposes。One such case is sea ice, which is highly dynamic and rapidly changes as a result of the combined effect of wind, temperature, and ocean currents。Therefore, frequent mapping of sea ice is necessary to ensure safe marine navigation。However, there is a general shortage of expert-labeled data to train deep learning algorithms。Fine-tuning a pre-trained model on SAR imagery is a potential solution。In this paper, we compare the performance of deep learning models trained from scratch using randomly initialized weights against pre-trained models that we fine-tune for this purpose。Our results show that pre-trained models lead to better results, especially on test samples from the melt season。

Enhancing sea ice segmentation in Sentinel-1 images with atrous convolutions

paper_url: http://arxiv.org/abs/2310.17122
repo_url: None
paper_authors: Rafael Pires de Lima, Behzad Vahedi, Nick Hughes, Andrew P. Barrett, Walter Meier, Morteza Karimzadeh
for: 这研究旨在使用机器学习算法来自动化海冰图表生成，以提高海冰 Navigation 的效率。
methods: 我们使用了 Extreme Earth version 2 高分辨率测试数据集，并开发了一个自定义管道，其 combining ResNets 和 Atrous Spatial Pyramid Pooling 来 segments SAR 图像。
results: 我们的模型在 binary 海冰-开水分类和多类海冰分类两个方面都达到了高效性。特别是，在 January 和 July 测试场景中，我们的模型的 weighted F1 分数都大于 0.95， median weighted F1 分数为 0.98。相比之下，一个基eline U-Net 的 weighted average F1 分数在 July 和 January 测试场景中分别为 0.92-0.94 和 0.97-0.98。

Abstract
Due to the growing volume of remote sensing data and the low latency required for safe marine navigation, machine learning (ML) algorithms are being developed to accelerate sea ice chart generation, currently a manual interpretation task. However, the low signal-to-noise ratio of the freely available Sentinel-1 Synthetic Aperture Radar (SAR) imagery, the ambiguity of backscatter signals for ice types, and the scarcity of open-source high-resolution labelled data makes automating sea ice mapping challenging. We use Extreme Earth version 2, a high-resolution benchmark dataset generated for ML training and evaluation, to investigate the effectiveness of ML for automated sea ice mapping. Our customized pipeline combines ResNets and Atrous Spatial Pyramid Pooling for SAR image segmentation. We investigate the performance of our model for: i) binary classification of sea ice and open water in a segmentation framework; and ii) a multiclass segmentation of five sea ice types. For binary ice-water classification, models trained with our largest training set have weighted F1 scores all greater than 0.95 for January and July test scenes. Specifically, the median weighted F1 score was 0.98, indicating high performance for both months. By comparison, a competitive baseline U-Net has a weighted average F1 score of ranging from 0.92 to 0.94 (median 0.93) for July, and 0.97 to 0.98 (median 0.97) for January. Multiclass ice type classification is more challenging, and even though our models achieve 2% improvement in weighted F1 average compared to the baseline U-Net, test weighted F1 is generally between 0.6 and 0.80. Our approach can efficiently segment full SAR scenes in one run, is faster than the baseline U-Net, retains spatial resolution and dimension, and is more robust against noise compared to approaches that rely on patch classification.

摘要
Translation:由于远程感知数据的增长量和海洋导航需要的延迟时间都在逐渐增长，因此机器学习（ML）算法在自动化海冰图制定中得到了广泛的应用。然而，自由available Sentinel-1 Synthetic Aperture Radar（SAR）影像的信号噪声比率较低，反射信号的含义对冰种类是ambiguous，而且开源高分辨率标注数据的缺乏使自动化海冰 mapping更加挑战。我们使用Extreme Earth版2，一个高分辨率标准 datasets generated for ML training和评估，来调查ML在自动化海冰 mapping中的效果。我们自定义的管道 combining ResNets和Atrous Spatial Pyramid Pooling来进行SAR图像分割。我们对以下两个方面进行调查：i) 将海冰和开水分割为二元类型；ii) 将五种冰种类分割为多类型。对于二元海冰-开水分割，我们使用最大training set进行训练的模型均有weighted F1 scores大于0.95 for January和July测试场景。特别是，测试场景的中位weighted F1 score是0.98，表示在这两个月的性能都很高。与基线U-Net相比，我们的模型在July月的测试场景中weighted average F1 score在0.92-0.94之间（中位值为0.93），在January月的测试场景中weighted average F1 score在0.97-0.98之间（中位值为0.97）。对于多类冰种类分割，我们的模型比基线U-Net提高了2%的weighted F1平均分，但测试weighted F1在0.6-0.80之间。我们的方法可以一次性将整个SAR场景分割，比基线U-Net更快，保留空间分辨率和维度，并对噪声更加抗性。

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

paper_url: http://arxiv.org/abs/2310.17109
repo_url: None
paper_authors: Chau Pham, Truong Vu, Khoi Nguyen
for:* This paper addresses the challenging problem of open-vocabulary object detection (OVOD), where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training.methods:* The proposed method, LP-OVOD, uses a novel approach that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text.results:* Experimental results on COCO affirm the superior performance of the LP-OVOD approach over the state of the art, achieving $\textbf{40.5}$ in $\text{AP}_{novel}$ using ResNet50 as the backbone and without external datasets or knowing novel classes during training.

Abstract
This paper addresses the challenging problem of open-vocabulary object detection (OVOD) where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training. A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label. However, this method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information. To address this issue, we propose a novel method, LP-OVOD, that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text. Experimental results on COCO affirm the superior performance of our approach over the state of the art, achieving $\textbf{40.5}$ in $\text{AP}_{novel}$ using ResNet50 as the backbone and without external datasets or knowing novel classes during training. Our code will be available at https://github.com/VinAIResearch/LP-OVOD.

摘要
这个论文解决了开放词汇物体检测（OVOD）的挑战，即在测试图像中无需在训练中提供未知类的标注时，一个物体检测器可以识别已知和未知类。一种常见的OVOD方法是使用CLIP的共同文本图像嵌入来将盒子提案分配给其最近的文本标签。然而，这种方法存在一个重要问题：许多低质量盒子，如过度和下遮掩物体盒子，与高质量盒子具有同样的相似性分数，因为CLIP没有接受具体物体位置信息的训练。为解决这个问题，我们提出了一种新的方法，LP-OVOD，它抛弃低质量盒子通过在顶部相关地区提取的pseudo标签来训练sigmoid线性分类器。实验结果表明我们的方法在COCO上超过了现有的状态态势，达到了$\textbf{40.5}$的$\text{AP}_{novel}$值，使用ResNet50作为背景网络，不需要外部数据集或在训练过程中知道未知类。我们的代码将在https://github.com/VinAIResearch/LP-OVOD上发布。

Navigating Data Heterogeneity in Federated Learning A Semi-Supervised Approach for Object Detection

paper_url: http://arxiv.org/abs/2310.17097
repo_url: None
paper_authors: Taehyeon Kim, Eric Lin, Junu Lee, Christian Lau, Vaikkunth Mugunthan
for: 本研究旨在提出一种semi-supervised federated object detection（SSFOD）方法，用于Scene中where labeled data只存在服务器端，客户端具有无标签数据。
methods: 我们提出了一种两阶段策略，包括选择性训练和正交增强全参数训练，以有效地解决数据shift（例如天气条件） между服务器和客户端。我们还提出了一种选择性修剪backbone的检测器，以避免过拟合；一种正交规则来增强表达分歧；以及一种本地EMA驱动的假标分配来生成高质量假标。
results: 我们对prominent autonomous driving dataset（BDD100K、Cityscapes和SODA10M）进行了广泛的验证，并证明了我们的方法的有效性。特别是，使用仅20-30%的标签，FedSTO方法可以与完全supervised centralized training方法相比，达到nearly同等水平的性能。

Abstract
Federated Learning (FL) has emerged as a potent framework for training models across distributed data sources while maintaining data privacy. Nevertheless, it faces challenges with limited high-quality labels and non-IID client data, particularly in applications like autonomous driving. To address these hurdles, we navigate the uncharted waters of Semi-Supervised Federated Object Detection (SSFOD). We present a pioneering SSFOD framework, designed for scenarios where labeled data reside only at the server while clients possess unlabeled data. Notably, our method represents the inaugural implementation of SSFOD for clients with 0% labeled non-IID data, a stark contrast to previous studies that maintain some subset of labels at each client. We propose FedSTO, a two-stage strategy encompassing Selective Training followed by Orthogonally enhanced full-parameter training, to effectively address data shift (e.g. weather conditions) between server and clients. Our contributions include selectively refining the backbone of the detector to avert overfitting, orthogonality regularization to boost representation divergence, and local EMA-driven pseudo label assignment to yield high-quality pseudo labels. Extensive validation on prominent autonomous driving datasets (BDD100K, Cityscapes, and SODA10M) attests to the efficacy of our approach, demonstrating state-of-the-art results. Remarkably, FedSTO, using just 20-30% of labels, performs nearly as well as fully-supervised centralized training methods.

摘要
federnated learning (FL) 已经成为训练模型遍布分布式数据源的有效框架，同时保持数据隐私。然而，它面临有限高质量标签和非标一致客户端数据的挑战，特别是在自动驾驶应用中。为解决这些障碍，我们在无法预测的水域中探索 semi-supervised federated object detection (SSFOD)。我们提出了一种先进的 SSFOD 框架，适用于客户端只有无标记数据，而服务器只有标记数据。尤其是，我们的方法是首次实现 SSFOD 客户端无标记非标一致数据上的实现，与前一些研究不同，后者都保留了每个客户端上的一些标签。我们提出了 FedSTO，一种两阶段策略，包括选择性训练和正交扩展全参数训练，以有效地解决数据偏移（如天气条件）问题。我们的贡献包括选择性修正检测器的背bone，避免过拟合，正交regularization 增强表示异常性，以及本地EMA驱动的pseudo标签分配。我们对知名的自动驾驶数据集（BDD100K、Cityscapes和SODA10M）进行了广泛验证，证明我们的方法的有效性。印象人是，FedSTO，只使用20-30%的标签，可以与完全监督中心训练方法相比。

Automating lichen monitoring in ecological studies using instance segmentation of time-lapse images

paper_url: http://arxiv.org/abs/2310.17080
repo_url: None
paper_authors: Safwen Naimi, Olfa Koubaa, Wassim Bouachir, Guillaume-Alexandre Bilodeau, Gregory Jeddore, Patricia Baines, David Correia, Andre Arsenault
for: assist ecologists in monitoring and analyzing epiphytic lichens
methods: use time-lapse cameras and semantic segmentation with an effective training approach to automate monitoring and biomass estimation of epiphytic lichens
results: significantly improve the accuracy and efficiency of lichen population monitoring, making it a valuable tool for forest ecologists and environmental scientists to evaluate the impact of climate change on Canada’s forests

Abstract
Lichens are symbiotic organisms composed of fungi, algae, and/or cyanobacteria that thrive in a variety of environments. They play important roles in carbon and nitrogen cycling, and contribute directly and indirectly to biodiversity. Ecologists typically monitor lichens by using them as indicators to assess air quality and habitat conditions. In particular, epiphytic lichens, which live on trees, are key markers of air quality and environmental health. A new method of monitoring epiphytic lichens involves using time-lapse cameras to gather images of lichen populations. These cameras are used by ecologists in Newfoundland and Labrador to subsequently analyze and manually segment the images to determine lichen thalli condition and change. These methods are time-consuming and susceptible to observer bias. In this work, we aim to automate the monitoring of lichens over extended periods and to estimate their biomass and condition to facilitate the task of ecologists. To accomplish this, our proposed framework uses semantic segmentation with an effective training approach to automate monitoring and biomass estimation of epiphytic lichens on time-lapse images. We show that our method has the potential to significantly improve the accuracy and efficiency of lichen population monitoring, making it a valuable tool for forest ecologists and environmental scientists to evaluate the impact of climate change on Canada's forests. To the best of our knowledge, this is the first time that such an approach has been used to assist ecologists in monitoring and analyzing epiphytic lichens.

摘要
蘑菇是一种 симбиotic 生物，由真菌、藻类和/或细菌组成，可以在多种环境中存活。它们对碳和氮的循环具有重要作用，并直接和间接地对生物多样性产生贡献。生态学家通常通过使用蘑菇作为指标来评估空气质量和栖息环境。特别是epiphytic蘑菇，生活在树上，是评估空气质量和环境健康的关键标志。一种新的监测epiphytic蘑菇的方法是使用时间�lapse摄像头获取图像。这些摄像头由新foundland和Labrador的生态学家使用，以后分析和手动分割图像，以确定蘑菇质量和变化。这些方法时间consuming和易受观察者偏见的影响。在这项工作中，我们的目标是自动监测蘑菇在长期内的变化，并估计其生物质量和condition。为此，我们提出了一个基于semantic Segmentation的框架，以自动监测和估计epiphytic蘑菇在时间�lapse图像上的生物质量和condition。我们表明，我们的方法具有提高精度和效率的潜在优势，可以为森林生态学家和环境科学家提供一种有价值的工具，以评估气候变化对加拿大森林的影响。到目前为止，这是首次使用这种方法来帮助生态学家监测和分析epiphytic蘑菇。

HCT: Hybrid Convnet-Transformer for Parkinson’s disease detection and severity prediction from gait

paper_url: http://arxiv.org/abs/2310.17078
repo_url: https://github.com/safwennaimi/hct-hybrid-convnet-transformer-for-parkinson-s-disease-detection-and-severity-prediction-from-gait
paper_authors: Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau
for: 本研究提出了一种基于新型混合卷积神经网络-变换器架构的深度学习方法，用于从步态数据中检测和分stage帕金森病（PD）。
methods: 我们采用了一种两步方法，将问题分解为两个子问题。我们的混合架构首先将健康人 versus 帕金森病人分类。如果患者是帕金森病人，那么我们的多类 Hybrid ConvNet-Transformer 模型将确定帕金森病的谱分 stage。我们的混合 architecture 利用了 ConvNet 和 Transformer 两种不同的强大技术，以便准确地检测 PD 和确定其严重程度 stage。
results: 我们的混合方法在比较其他状态革新方法时表现出优异，PD 检测精度达 97%，严重程度分 stage 精度达 87%。I hope that helps! Let me know if you have any other questions.

Abstract
In this paper, we propose a novel deep learning method based on a new Hybrid ConvNet-Transformer architecture to detect and stage Parkinson's disease (PD) from gait data. We adopt a two-step approach by dividing the problem into two sub-problems. Our Hybrid ConvNet-Transformer model first distinguishes healthy versus parkinsonian patients. If the patient is parkinsonian, a multi-class Hybrid ConvNet-Transformer model determines the Hoehn and Yahr (H&Y) score to assess the PD severity stage. Our hybrid architecture exploits the strengths of both Convolutional Neural Networks (ConvNets) and Transformers to accurately detect PD and determine the severity stage. In particular, we take advantage of ConvNets to capture local patterns and correlations in the data, while we exploit Transformers for handling long-term dependencies in the input signal. We show that our hybrid method achieves superior performance when compared to other state-of-the-art methods, with a PD detection accuracy of 97% and a severity staging accuracy of 87%. Our source code is available at: https://github.com/SafwenNaimi

摘要
在这篇论文中，我们提出了一种新的深度学习方法，基于新的混合ConvNet-Transformer架构，用于从步态数据中检测和分期 Parkinson's disease（PD）。我们采用了两步 Approach，先将问题分为两个子问题。我们的混合架构首先分辨健康和 Parkinsonian 患者。如果患者是 Parkinsonian， THEN 我们的多类 Hybrid ConvNet-Transformer 模型确定了 Hoehn 和 Yahr（H&Y）分数，以评估PD的严重程度阶段。我们的混合架构利用 ConvNets 捕捉本地征 patrerns 和相关性，而Transformers 处理输入信号的长期依赖关系。我们表明，我们的混合方法在比较其他当前领先方法时，具有superior的性能，PD检测精度达97%，严重阶段评估精度达87%。我们的源代码可以在：https://github.com/SafwenNaimi 中找到。

HyperFields: Towards Zero-Shot Generation of NeRFs from Text

paper_url: http://arxiv.org/abs/2310.17075
repo_url: None
paper_authors: Sudarshan Babu, Richard Liu, Avery Zhou, Michael Maire, Greg Shakhnarovich, Rana Hanocka
for: 这篇论文的目的是提出一种基于文本的NeRF生成方法，可以在单个前进 pass中生成文本状态下的NeRF模型，并且可以在不同场景下进行适应。
methods: 该方法使用了动态权重网络和NeRF填充训练，以学习文本token嵌入空间中的NeRF模型的映射。
results: 该方法可以在不同场景下适应，并且可以在不同的文本条件下生成novel的场景，包括零shot和一些精心调整后的场景。此外，训练HyperFields可以比传统的神经网络优化方法更快速地训练 converges。

Abstract
We introduce HyperFields, a method for generating text-conditioned Neural Radiance Fields (NeRFs) with a single forward pass and (optionally) some fine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns a smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF distillation training, which distills scenes encoded in individual NeRFs into one dynamic hypernetwork. These techniques enable a single network to fit over a hundred unique scenes. We further demonstrate that HyperFields learns a more general map between text and NeRFs, and consequently is capable of predicting novel in-distribution and out-of-distribution scenes -- either zero-shot or with a few finetuning steps. Finetuning HyperFields benefits from accelerated convergence thanks to the learned general map, and is capable of synthesizing novel scenes 5 to 10 times faster than existing neural optimization-based methods. Our ablation experiments show that both the dynamic architecture and NeRF distillation are critical to the expressivity of HyperFields.

摘要
我们介绍HyperFields，一种方法用于生成文本条件的神经辐射场（NeRF），通过单一的前进 pass和（可选）一些精细调整。关键技术包括：（i）动态超网络，该网络学习文本token嵌入空间中的NeRF的缓和映射；（ii）NeRF蒸馏训练，将各个NeRF中的场景编码到一个动态超网络中。这些技术使得单个网络可以适应百余个不同的场景。我们进一步证明，HyperFields学习了文本和NeRF之间的更加通用的映射，因此能够预测静态和动态场景，包括零 shot 和几步精度调整。HyperFields在精度调整过程中受到加速的收敛速度，可以在5到10倍 бы于现有神经优化方法中 synthesize 新场景。我们的抽象实验表明，动态架构和NeRF蒸馏都是HyperFields的表达能力的关键因素。

2023-10-26

cs.AI

cs.AI - 2023-10-26

Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting

paper_url: http://arxiv.org/abs/2310.17811
repo_url: None
paper_authors: Benjamin Yan, Ruochen Liu, David E. Kuo, Subathra Adithan, Eduardo Pontes Reis, Stephen Kwak, Vasantha Kumar Venugopal, Chloe P. O’Connell, Agustina Saenz, Pranav Rajpurkar, Michael Moor
For: 提高 radiologist 的工作流程，通过自动生成医疗影像报告* Methods: 提出了一种 two-step 方法，首先提取图像中的内容，然后将其折衣成医生特定的报告风格* Results: 在量化评估中获得了有利的性能，人工评估中显示 AI 生成的报告能够匹配医生特定的报告风格，即使只使用了一些示例作为 context

Abstract
Automatically generated reports from medical images promise to improve the workflow of radiologists. Existing methods consider an image-to-report modeling task by directly generating a fully-fledged report from an image. However, this conflates the content of the report (e.g., findings and their attributes) with its style (e.g., format and choice of words), which can lead to clinically inaccurate reports. To address this, we propose a two-step approach for radiology report generation. First, we extract the content from an image; then, we verbalize the extracted content into a report that matches the style of a specific radiologist. For this, we leverage RadGraph -- a graph representation of reports -- together with large language models (LLMs). In our quantitative evaluations, we find that our approach leads to beneficial performance. Our human evaluation with clinical raters highlights that the AI-generated reports are indistinguishably tailored to the style of individual radiologist despite leveraging only a few examples as context.

摘要
自动生成的医疗图像报告承诺改善诊断医生的工作流程。现有方法直接将图像转换为完整的报告，但这会混合报告的内容（如发现和其属性）与样式（如格式和语言选择），导致临床不准确的报告。为解决这个问题，我们提出了一种两步方法 для医学报告生成。首先，我们从图像中提取内容；然后，我们将提取的内容转换为医生特有的样式的报告。我们利用RadGraph -- 报告的图表表示方式 -- 以及大型自然语言模型（LLM）。我们的量化评估表明，我们的方法具有有利性。我们的人类评估，医生特有的报告被AI生成的报告杂化不可分辨，即使只使用几个例子作为 контекст。

Clover: Closed-Loop Verifiable Code Generation

paper_url: http://arxiv.org/abs/2310.17807
repo_url: None
paper_authors: Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett
for: 这篇论文旨在提出一种方法来确保代码生成器生成的代码是正确的，以避免不良的结果。
methods: 该方法基于一种名为Clover的概念，即闭环可验证代码生成，它将正确性检查降低到更容易进行的一个问题：一致性检查。Clover使用了一种新的形式验证工具和大型自然语言模型的结合来实现一个可验证的代码检查器。
results: 在一个手动设计的数据集（CloverBench）上，我们发现：（i）LLMs可以自动生成正式规范; 并且（ii）我们的一致性检查器在正确的实例上可以达到87%的接受率，而且没有任何false positive（Zero tolerance for incorrect instances）。

Abstract
The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to any number of undesirable outcomes. In this paper, we lay out a vision for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which reduces correctness checking to the more accessible problem of consistency checking. At the core of Clover lies a checker that performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its feasibility on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset, (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for incorrect ones (no false positives).

摘要
大量语言模型在软件开发中用于代码生成是一种快速增长的趋势。然而，如果没有有效的方法来确保代码的正确性，这种趋势可能会导致任何不良结果。在这篇论文中，我们提出了一个方法来解决这个挑战：叶对象（Clover）模型，简称为关闭Loop可验证代码生成。叶对象模型的核心在于一个实现了一种可验证性检查的检查器，该检查器通过对代码、文档字符串和正式注释进行一系列的一致性检查来减少正确性检查到更加可 accessible的一致性检查问题。我们提供了一个理论分析，以支持我们的论点，即叶对象模型在一致性检查中应该是有效的。我们还对一个手动设计的数据集（CloverBench）进行了实验研究，该数据集包含了注释的达凡程程序。实验结果显示，（i）LLMs可以自动生成正式规范; （ii）我们的一致性检查器在正确的实例中达到了Promising的接受率（达到87%），而且在错误的实例中保持了零的准确率（没有假阳性）。

Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks

paper_url: http://arxiv.org/abs/2310.17805
repo_url: None
paper_authors: Ryan Sullivan, Akarsh Kumar, Shengyi Huang, John P. Dickerson, Joseph Suarez
for: 本研究旨在应用DreamerV3的模型基method，并评估其是否能够对PPO提供改进。
methods: 本研究使用了DreamerV3的一些陷阱，包括缓存滤选和规律对应。
results: 研究结果显示，这些陷阱并不能够通用于PPO，并且在一些情况下可能会下降性能。然而，研究还发现了一些特定情况下，这些陷阱可以对PPO提供改进，例如在Atari游戏中实现奖励截断。

Abstract
Most reinforcement learning methods rely heavily on dense, well-normalized environment rewards. DreamerV3 recently introduced a model-based method with a number of tricks that mitigate these limitations, achieving state-of-the-art on a wide range of benchmarks with a single set of hyperparameters. This result sparked discussion about the generality of the tricks, since they appear to be applicable to other reinforcement learning algorithms. Our work applies DreamerV3's tricks to PPO and is the first such empirical study outside of the original work. Surprisingly, we find that the tricks presented do not transfer as general improvements to PPO. We use a high quality PPO reference implementation and present extensive ablation studies totaling over 10,000 A100 hours on the Arcade Learning Environment and the DeepMind Control Suite. Though our experiments demonstrate that these tricks do not generally outperform PPO, we identify cases where they succeed and offer insight into the relationship between the implementation tricks. In particular, PPO with these tricks performs comparably to PPO on Atari games with reward clipping and significantly outperforms PPO without reward clipping.

摘要
大多数强化学习方法很依赖密集、均衡环境奖励。 DreamerV3 最近引入了一种模型基于方法，并使用了一些技巧来缓解这些限制，在各种标准 benchmark 上达到了单一的超参数 Settings 的状态体验最佳性。这一结果引发了对这些技巧的通用性的讨论，因为它们看起来可以应用于其他强化学习算法。我们的工作将 DreamerV3 的技巧应用到 PPO 中，是外部工作中的第一个实验。Surprisingly, we find that the tricks presented do not transfer as general improvements to PPO. 我们使用高质量 PPO 参考实现，并进行了大量的磨砺研究，总计超过 10,000 A100 小时在 Arcade Learning Environment 和 DeepMind Control Suite 上。虽然我们的实验表明，这些技巧不一般地提高 PPO，但我们确定了其在 Atari 游戏中奖励截断时和无奖励截断时的性能相当。

“You Are An Expert Linguistic Annotator”: Limits of LLMs as Analyzers of Abstract Meaning Representation

paper_url: http://arxiv.org/abs/2310.17793
repo_url: None
paper_authors: Allyson Ettinger, Jena D. Hwang, Valentina Pyatkin, Chandra Bhagavatula, Yejin Choi
for: 本研究旨在检验LLMs是否可以作为语言专家提供准确的语义分析结果。
methods: 研究使用了GPT-3、ChatGPT和GPT-4模型，通过对句子意义结构进行分析，并使用Abstract Meaning Representation（AMR）格式进行表示。
results: 研究发现，LLMs可以准确地生成AMR格式的句子意义结构，但是模型输出存在重要和常见的错误，无法生成完全准确的句子意义结构。

Abstract
Large language models (LLMs) show amazing proficiency and fluency in the use of language. Does this mean that they have also acquired insightful linguistic knowledge about the language, to an extent that they can serve as an "expert linguistic annotator"? In this paper, we examine the successes and limitations of the GPT-3, ChatGPT, and GPT-4 models in analysis of sentence meaning structure, focusing on the Abstract Meaning Representation (AMR; Banarescu et al. 2013) parsing formalism, which provides rich graphical representations of sentence meaning structure while abstracting away from surface forms. We compare models' analysis of this semantic structure across two settings: 1) direct production of AMR parses based on zero- and few-shot prompts, and 2) indirect partial reconstruction of AMR via metalinguistic natural language queries (e.g., "Identify the primary event of this sentence, and the predicate corresponding to that event."). Across these settings, we find that models can reliably reproduce the basic format of AMR, and can often capture core event, argument, and modifier structure -- however, model outputs are prone to frequent and major errors, and holistic analysis of parse acceptability shows that even with few-shot demonstrations, models have virtually 0% success in producing fully accurate parses. Eliciting natural language responses produces similar patterns of errors. Overall, our findings indicate that these models out-of-the-box can capture aspects of semantic structure, but there remain key limitations in their ability to support fully accurate semantic analyses or parses.

摘要
大语言模型（LLM）显示了惊人的掌握能力和流畅性在语言使用方面。这意味着它们也获得了深刻的语言知识吗？在这篇论文中，我们研究了GPT-3、ChatGPT和GPT-4模型对句子意义结构的分析，使用抽象意义表示（AMR）分析方法，该方法提供了丰富的图形表示方式，抽象于表面形式。我们在两个设置下对模型的分析进行比较：1）直接生成基于零或几个提示的AMR parse，和2）通过语言问题（如“这句话中的主事件是什么，以及其对应的 predicate”）进行间接半重建AMR。在这两个设置下，我们发现模型可靠地生成基本的AMR格式，并常常捕捉核心事件、参加者和修饰结构。然而，模型的输出受到频繁和重大的错误的影响，全面分析parse的可 acceptability 显示，即使有几个示例，模型的成功率几乎为零。召唤自然语言响应也会产生类似的错误模式。总之，我们的发现表明，这些模型可以在出废的情况下捕捉含义结构的方面，但还有关键的限制，它们无法支持完全准确的semantic analyses或parse。

Utilizing Language Models for Energy Load Forecasting

paper_url: http://arxiv.org/abs/2310.17788
repo_url: https://github.com/xuehaouwa/lm-load-forecasting
paper_authors: Hao Xue, Flora D. Salim
for: 实时能源负载预测可以帮助企业和城市优化资源配置和管理能源消耗。
methods: 本文提出一种使用语言模型进行能源负载预测的新方法，使用提示技术将能源消耗数据转换为描述性句子，并使用自动生成方法进行预测。
results: 经过实验 validate 的结果显示，该方法可以对真实数据进行高精度的能源负载预测，并且可以预测不同时间点的未来能源负载。

Abstract
Energy load forecasting plays a crucial role in optimizing resource allocation and managing energy consumption in buildings and cities. In this paper, we propose a novel approach that leverages language models for energy load forecasting. We employ prompting techniques to convert energy consumption data into descriptive sentences, enabling fine-tuning of language models. By adopting an autoregressive generating approach, our proposed method enables predictions of various horizons of future energy load consumption. Through extensive experiments on real-world datasets, we demonstrate the effectiveness and accuracy of our proposed method. Our results indicate that utilizing language models for energy load forecasting holds promise for enhancing energy efficiency and facilitating intelligent decision-making in energy systems.

摘要
(Simplified Chinese translation)能量负荷预测对于建筑物和城市资源分配和能源消耗管理起到关键作用。在这篇论文中，我们提出了一种新的方法，利用语言模型进行能量负荷预测。我们使用提示技术将能量消耗数据转化为描述性句子，以便Language Model的微调。采用autoregressive生成方法，我们的提议方法可以预测不同时间 horizons的未来能量负荷消耗。通过对实际数据进行广泛的实验，我们证明了我们的提议方法的效果和准确性。结果表明，通过语言模型进行能量负荷预测，可以提高能源效率，并促进智能决策在能源系统中。

Evaluation of large language models using an Indian language LGBTI+ lexicon

paper_url: http://arxiv.org/abs/2310.17787
repo_url: None
paper_authors: Aditya Joshi, Shruta Rawat, Alpana Dange
for: 本研究旨在评估大型自然语言处理（LLM）模型在LGBTI+语言上的责任行为。
methods: 本研究使用LGBTI+词汇库进行评估，包括四个步骤：形式化NLP任务，创建测试提示，使用LLM生成输出，并手动评估结果。
results: 研究发现，三种LLM模型无法检测下面带有仇恨内容的语言。此外，我们发现使用机器翻译来评估自然语言理解可能存在限制，特别是在非英语语言中。

Abstract
Large language models (LLMs) are typically evaluated on the basis of task-based benchmarks such as MMLU. Such benchmarks do not examine responsible behaviour of LLMs in specific contexts. This is particularly true in the LGBTI+ context where social stereotypes may result in variation in LGBTI+ terminology. Therefore, domain-specific lexicons or dictionaries may be useful as a representative list of words against which the LLM's behaviour needs to be evaluated. This paper presents a methodology for evaluation of LLMs using an LGBTI+ lexicon in Indian languages. The methodology consists of four steps: formulating NLP tasks relevant to the expected behaviour, creating prompts that test LLMs, using the LLMs to obtain the output and, finally, manually evaluating the results. Our qualitative analysis shows that the three LLMs we experiment on are unable to detect underlying hateful content. Similarly, we observe limitations in using machine translation as means to evaluate natural language understanding in languages other than English. The methodology presented in this paper can be useful for LGBTI+ lexicons in other languages as well as other domain-specific lexicons. The work done in this paper opens avenues for responsible behaviour of LLMs, as demonstrated in the context of prevalent social perception of the LGBTI+ community.

摘要

Graph Convolutional Networks for Complex Traffic Scenario Classification

paper_url: http://arxiv.org/abs/2310.17773
repo_url: None
paper_authors: Tobias Hoek, Holger Caesar, Andreas Falkovén, Tommy Johansson
for: 这篇论文的目的是为了简化自动驾驶系统（ADS）的安全性证据所需的时间。
methods: 这篇论文使用了enario-based testing方法，并使用了图形传播网络（Graph Convolutional Networks，GCN）来模型车辆与环境的互动，以及其他交通代理人的互动。
results: 这篇论文提出了一个可以模型车辆与环境的互动，以及其他交通代理人的互动的方法，并使用了扩展的 nuScenes 和 Argoverse 2 驾驶测试数据来训练这个方法。这个方法在训练后已经成为了一个可靠的基eline для未来关于每帧复杂enario的分类研究。

Abstract
A scenario-based testing approach can reduce the time required to obtain statistically significant evidence of the safety of Automated Driving Systems (ADS). Identifying these scenarios in an automated manner is a challenging task. Most methods on scenario classification do not work for complex scenarios with diverse environments (highways, urban) and interaction with other traffic agents. This is mirrored in their approaches which model an individual vehicle in relation to its environment, but neglect the interaction between multiple vehicles (e.g. cut-ins, stationary lead vehicle). Furthermore, existing datasets lack diversity and do not have per-frame annotations to accurately learn the start and end time of a scenario. We propose a method for complex traffic scenario classification that is able to model the interaction of a vehicle with the environment, as well as other agents. We use Graph Convolutional Networks to model spatial and temporal aspects of these scenarios. Expanding the nuScenes and Argoverse 2 driving datasets, we introduce a scenario-labeled dataset, which covers different driving environments and is annotated per frame. Training our method on this dataset, we present a promising baseline for future research on per-frame complex scenario classification.

摘要
一种场景基本测试方法可以减少自动驾驶系统（ADS）的安全性证明所需的时间。确定这些场景的自动化方式是一项具有挑战性的任务。大多数场景分类方法不适用于复杂的场景中（高速公路、城市），并且与其他交通代理人之间的交互。这是它们的方法所模拟的个体车辆与其环境之间的关系，而忽略了多辆车辆之间的交互（例如，割込、静止领航车）。此外，现有的数据集缺乏多样性，并没有每帧的注释，以准确地学习场景的开始和结束时间。我们提议一种能够模型车辆与环境的交互，以及其他代理人之间的交互的方法。我们使用图像卷积网络来模型场景的空间和时间方面。对于扩展nuScenes和Argoverse 2驾驶数据集，我们引入了场景标注数据集，覆盖不同的驾驶环境，并且每帧都有注释。通过对这个数据集进行训练，我们提出了一个可能的基线 для未来关于每帧复杂场景分类的研究。

GROOViST: A Metric for Grounding Objects in Visual Storytelling

paper_url: http://arxiv.org/abs/2310.17770
repo_url: https://github.com/akskuchi/groovist
paper_authors: Aditya K Surikuchi, Sandro Pezzelle, Raquel Fernández
for: 评估视觉故事的可ovygrounding度，即图像序列中显示的实体是否被story中正确地描述。
methods: 分析当前的评估方法，包括专门 для这个目标的评估方法和通用视觉对齐方法，以及提出一种新的评估工具GROOViST，该工具考虑了交叉模态依赖关系、时间不同步（图像序列和story的顺序不一致）以及人类视觉嵌入的 intuitions。
results: GROOViST提供了一种可以评估视觉故事的可ovygrounding度的新评估工具，其中每个组件的贡献可以分别评估和解释。

Abstract
A proper evaluation of stories generated for a sequence of images -- the task commonly referred to as visual storytelling -- must consider multiple aspects, such as coherence, grammatical correctness, and visual grounding. In this work, we focus on evaluating the degree of grounding, that is, the extent to which a story is about the entities shown in the images. We analyze current metrics, both designed for this purpose and for general vision-text alignment. Given their observed shortcomings, we propose a novel evaluation tool, GROOViST, that accounts for cross-modal dependencies, temporal misalignments (the fact that the order in which entities appear in the story and the image sequence may not match), and human intuitions on visual grounding. An additional advantage of GROOViST is its modular design, where the contribution of each component can be assessed and interpreted individually.

摘要
为评估生成的故事（常称视觉故事），需考虑多个方面，如 coherence、 grammatical correctness 和 visual grounding。在这种工作中，我们专注于评估故事的基础设定，即图像中显示的实体是否被描述。我们分析了现有的指标，包括专门为这个目的设计的指标以及通用视觉-文本对齐的指标。由于这些指标的缺点，我们提出了一种新的评估工具——GROOViST，它考虑了跨模态依赖关系、时间不对齐（图像序列和故事序列中实体出现的顺序不同）以及人类对视觉基础的直觉。GROOViST 的另一个优点是它的模块化设计，允许每个组件的贡献被分析和解释。

paper_url: http://arxiv.org/abs/2310.17769
repo_url: https://github.com/janphilippfranken/scai
paper_authors: Jan-Philipp Fränken, Sam Kwok, Peixuan Ye, Kanishk Gandhi, Dilip Arumugam, Jared Moore, Alex Tamkin, Tobias Gerstenberg, Noah D. Goodman
for: validating the proposal of aligning an AI assistant by inverting a model of users’ preferences from observed interactions
methods: using proof-of-concept simulations in the economic ultimatum game to formalize user preferences as policies that guide the actions of simulated players
results: the AI assistant accurately aligns its behavior to match standard policies from the economic literature, but exhibits limited generalization in an out-of-distribution setting and slow learning when there is inconsistency in the relationship between language use and an unknown policy.

Abstract
We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.

摘要
我团队研究了一种方法，通过反向模型用户（未知）的偏好来调整人工智能助手的行为。为了证明我们的建议，我们在经济最终决策游戏中进行了证明性实验，将用户的偏好形式为助手的动作指导政策。我们发现助手的行为准确地匹配了经济文献中的标准政策（例如自利和慈善）。然而，助手学习的政策缺乏 Robustness 和在对应分布外的扩展性，例如当面临不包括在助手训练分布中的货币（例如药物的重量）时，助手的行为会受到限制。此外，当语言使用与未知政策之间存在不一致（例如慈善政策与侮辱语言）时，助手学习政策的速度会减慢。总之，我们的初步结果表明，通过在用户多样化的情况下使用人工智能助手来学习用户的偏好可以提供一种有价值的方法，用于研究实际的对齐问题。

Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems

paper_url: http://arxiv.org/abs/2310.17749
repo_url: None
paper_authors: Lidiya Murakhovs’ka, Philippe Laban, Tian Xie, Caiming Xiong, Chien-Sheng Wu
for: 这个论文的目的是提供一种基于对话的产品推荐系统，同时提供教育性的值。
methods: 这个论文使用了大语言模型（LLM）来实现混合类型混合动机的对话系统，并通过人类研究对比专业销售人员的表现。
results: 研究发现，虽然销售机器人（SalesBot）的流畅性和信息准确性与专业销售人员相当，但是它在提供建议质量方面落后。此外，研究还发现，销售机器人和专业销售人员都面临着确保信念的挑战。

Abstract
Making big purchases requires consumers to research or consult a salesperson to gain domain expertise. However, existing conversational recommender systems (CRS) often overlook users' lack of background knowledge, focusing solely on gathering preferences. In this work, we define a new problem space for conversational agents that aim to provide both product recommendations and educational value through mixed-type mixed-initiative dialog. We introduce SalesOps, a framework that facilitates the simulation and evaluation of such systems by leveraging recent advancements in large language models (LLMs). We build SalesBot and ShopperBot, a pair of LLM-powered agents that can simulate either side of the framework. A comprehensive human study compares SalesBot against professional salespeople, revealing that although SalesBot approaches professional performance in terms of fluency and informativeness, it lags behind in recommendation quality. We emphasize the distinct limitations both face in providing truthful information, highlighting the challenges of ensuring faithfulness in the CRS context. We release our code and make all data available.

摘要
大购物需要消费者进行研究或咨询销售人员以获得领域专业知识。然而，现有的对话式推荐系统（CRS）通常忽视用户的背景知识不足，专注于收集首选。在这项工作中，我们定义了对话式代理人提供产品推荐和教育价值的新问题空间。我们提出了销售操作（SalesOps）框架，利用最新的大语言模型（LLMs）进行 simulate和评估。我们建立了销售机器人（SalesBot）和购物机器人（ShopperBot），这两个 LLM 动态代理人可以模拟框架两侧。人类研究比较了销售机器人和专业销售人员，发现销售机器人在流畅性和信息完整性方面几乎与专业人员相当，但在推荐质量方面存在明显的不足。我们强调了对话式推荐系统中 Ensure faithfulness 的挑战，并发布了我们的代码和所有数据。

Improving Traffic Density Forecasting in Intelligent Transportation Systems Using Gated Graph Neural Networks

paper_url: http://arxiv.org/abs/2310.17729
repo_url: None
paper_authors: Razib Hayat Khan, Jonayet Miah, S M Yasir Arafat, M M Mahbubul Syeed, Duc M Ca
for: 本研究探讨了应用图 neural network 在交通预测方面，这是智能交通系统中重要的一环。准确的交通预测对于如旅行规划、交通控制和车辆路径规划等功能都是关键。
methods: 研究中探讨了三种常见的图 neural network 架构，即 Graph Convolutional Networks (Graph Sample and Aggregation)、Gated Graph Neural Networks。每种架构的方法都得到了详细的描述，包括层配置、活动函数和超参数。研究的目标是最小化预测错误，GGNNs emerges as the most effective choice among the three models。
results: 研究结果显示，GCNs 的 RMSE 为 9.10 和 MAE 为 8.00，而 GraphSAGE 表现有所提高，其 RMSE 为 8.3 和 MAE 为 7.5。而 Gated Graph Neural Networks (GGNNs) 则在三种模型中表现最佳，其 RMSE 为 9.15 和 MAE 为 7.1。

Abstract
This study delves into the application of graph neural networks in the realm of traffic forecasting, a crucial facet of intelligent transportation systems. Accurate traffic predictions are vital for functions like trip planning, traffic control, and vehicle routing in such systems. Three prominent GNN architectures Graph Convolutional Networks (Graph Sample and Aggregation) and Gated Graph Neural Networks are explored within the context of traffic prediction. Each architecture's methodology is thoroughly examined, including layer configurations, activation functions,and hyperparameters. The primary goal is to minimize prediction errors, with GGNNs emerging as the most effective choice among the three models. The research outlines outcomes for each architecture, elucidating their predictive performance through root mean squared error and mean absolute error (MAE). Hypothetical results reveal intriguing insights: GCNs display an RMSE of 9.10 and an MAE of 8.00, while GraphSAGE shows improvement with an RMSE of 8.3 and an MAE of 7.5. Gated Graph Neural Networks (GGNNs) exhibit the lowest RMSE at 9.15 and an impressive MAE of 7.1, positioning them as the frontrunner.

摘要
The study aims to minimize prediction errors, and the results show that GGNNs are the most effective among the three models. The outcomes for each architecture are presented in terms of root mean squared error (RMSE) and mean absolute error (MAE). The hypothetical results reveal that GCNs have an RMSE of 9.10 and an MAE of 8.00, while GraphSAGE shows improvement with an RMSE of 8.3 and an MAE of 7.5. GGNNs exhibit the lowest RMSE at 9.15 and an impressive MAE of 7.1, making them the frontrunner.The study's findings highlight the potential of GNNs in traffic forecasting and provide valuable insights into the strengths and limitations of different GNN architectures. The results suggest that GGNNs are a promising approach for accurate traffic prediction, and future research can build on these findings to improve the accuracy and efficiency of intelligent transportation systems.

Large Language Models as Generalizable Policies for Embodied Tasks

paper_url: http://arxiv.org/abs/2310.17722
repo_url: None
paper_authors: Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Mazoure, Walter Talbott, Katherine Metcalf, Natalie Mackraz, Devon Hjelm, Alexander Toshev
for: 这个论文的目的是使大型自然语言模型（LLM）可以被适应为涉及视觉任务的通用策略。
methods: 这种方法called Large LAnguage model Reinforcement Learning Policy（LLaRP），它使用预训练的冻结LLM来接受文本指令和视觉 egocentric 观察，并直接在环境中输出操作。通过强化学习，我们训练 LLaRP 通过环境交互来看和行动。
results: 我们的实验表明，LLaRP 能够通过复杂的重新排序指令来执行任务，并且可以在新任务中表现出新的优化行为。特别是，在 1,000 个未seen 任务中，LLaRP 的成功率为 42%，比其他常见的学习基线或零shot 应用的 LLM 高出 1.7 倍。此外，我们还发布了一个新的标准测试集，名为 Language Rearrangement，它包含 150,000 个训练任务和 1,000 个测试任务，以便研究语言条件、大量多任务、embodied AI 问题。视频示例可以在 https://llm-rl.github.io 上找到。

Abstract
We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement. Video examples of LLaRP in unseen Language Rearrangement instructions are at https://llm-rl.github.io.

摘要
我们显示大型语言模型（LLM）可以适应为普遍化政策 для具有身体视觉任务。我们的方法，叫做大型语言模型增强学习政策（LLaRP），将预训冻的LLM变数为接受文本指令和 Egocentric 视觉观察，并将其转换为环境中的动作。使用增强学习，我们训练 LLaRP 通过环境互动来看和行动。我们显示 LLARP 能够对复杂重新写法的任务指令进行抗衡，并且能够扩展到新的任务，需要新的优化行为。特别是，在 1,000 个未见任务中，它取得了 42% 的成功率，比其他常见的基eline或 zero-shot 应用的 LLM 高一倍。最后，为了帮助社区研究语言条件、大规模多任务、具有身体视觉 AI 问题，我们发布了一个新的对benchmark，语言重新排序，包括 150,000 个训练任务和 1,000 个测试任务 для语言条件的重新排序。影像示例 LLARP 在未见 Language Rearrangement 指令下的动作可以在 https://llm-rl.github.io 浏览。

From Transcripts to Insights: Uncovering Corporate Risks Using Generative AI

paper_url: http://arxiv.org/abs/2310.17721
repo_url: None
paper_authors: Alex Kim, Maximilian Muhn, Valeri Nikolaev
for: 该研究旨在探讨生成AI工具，如ChatGPT，如何帮助投资者发现公司风险。
methods: 研究人员开发了基于资料提供的会议记录文本的启发式AI模型，以生成公司风险报告和评估。
results: 研究发现，基于GPT 3.5模型的风险测量方法具有显著的信息内容，并且能够超越现有的风险测量方法在预测公司独特异常性和投资决策等方面。

Abstract
We explore the value of generative AI tools, such as ChatGPT, in helping investors uncover dimensions of corporate risk. We develop and validate firm-level measures of risk exposure to political, climate, and AI-related risks. Using the GPT 3.5 model to generate risk summaries and assessments from the context provided by earnings call transcripts, we show that GPT-based measures possess significant information content and outperform the existing risk measures in predicting (abnormal) firm-level volatility and firms' choices such as investment and innovation. Importantly, information in risk assessments dominates that in risk summaries, establishing the value of general AI knowledge. We also find that generative AI is effective at detecting emerging risks, such as AI risk, which has soared in recent quarters. Our measures perform well both within and outside the GPT's training window and are priced in equity markets. Taken together, an AI-based approach to risk measurement provides useful insights to users of corporate disclosures at a low cost.

摘要

Outlier Dimensions Encode Task-Specific Knowledge

paper_url: http://arxiv.org/abs/2310.17715
repo_url: https://github.com/wrudman/outlier_dimensions
paper_authors: William Rudman, Catherine Chen, Carsten Eickhoff
for: 这个论文的目的是研究大语言模型（LLM）表示的缺失特征dimension的影响。
methods: 这个论文使用了大语言模型的 fine-tuning 方法，以investigate how fine-tuning impacts outlier dimensions。
results: 这个论文发现了一些 Interesting results，包括：1) pre-training 中出现的异常维度在 fine-tuned 模型中仍然存在，2) 一个异常维度可以完成下游任务 WITH 较低的错误率。这些结果表明了异常维度可能会含有关键的任务特定知识，并且这些知识可能会影响下游模型的决策。

Abstract
Representations from large language models (LLMs) are known to be dominated by a small subset of dimensions with exceedingly high variance. Previous works have argued that although ablating these outlier dimensions in LLM representations hurts downstream performance, outlier dimensions are detrimental to the representational quality of embeddings. In this study, we investigate how fine-tuning impacts outlier dimensions and show that 1) outlier dimensions that occur in pre-training persist in fine-tuned models and 2) a single outlier dimension can complete downstream tasks with a minimal error rate. Our results suggest that outlier dimensions can encode crucial task-specific knowledge and that the value of a representation in a single outlier dimension drives downstream model decisions.

摘要
大型语言模型（LLM）的表示被证明为具有极高方差的一小部分维度所控制。先前的研究表明，尽管剖除这些异常维度在LLM表示中减少下游性能，但异常维度对表征质量仍然有负面影响。在这种研究中，我们研究了如何微调影响异常维度，并发现以下两点：1）在预训练中出现的异常维度在微调后仍然存在于模型中，2）一个异常维度可以通过最小错误率完成下游任务。我们的结果表明，异常维度可能含有关键任务知识，并且表示中的一个异常维度的值会驱动下游模型决策。

A Wireless AI-Generated Content (AIGC) Provisioning Framework Empowered by Semantic Communication

paper_url: http://arxiv.org/abs/2310.17705
repo_url: None
paper_authors: Runze Cheng, Yao Sun, Dusit Niyato, Lan Zhang, Lei Zhang, Muhammad Ali Imran
for: 提供高质量人工智能生成内容（AIGC）服务，使其可以通过无线通信网络进行 ubique 访问。
methods: 使用 semantics communication（SemCom）技术，只提取内容的 semantic 信息，而不是所有的 binary 位。同时，利用 diffusion-based 模型进行高效的内容生成和计算负荷的灵活调整。
results: simulations 表明，提出的 SemAIGC 框架在延迟和内容质量方面比 conventional 方法更高效。

Abstract
Generative AI applications are recently catering to a vast user base by creating diverse and high-quality AI-generated content (AIGC). With the proliferation of mobile devices and rapid growth of mobile traffic, providing ubiquitous access to high-quality AIGC services via wireless communication networks is becoming the future direction for AIGC products. However, it is challenging to provide optimal AIGC services in wireless networks with unstable channels, limited bandwidth resources, and unevenly distributed computational resources. To tackle these challenges, we propose a semantic communication (SemCom)-empowered AIGC (SemAIGC) generation and transmission framework, where only semantic information of the content rather than all the binary bits should be extracted and transmitted by using SemCom. Specifically, SemAIGC integrates diffusion-based models within the semantic encoder and decoder for efficient content generation and flexible adjustment of the computing workload of both transmitter and receiver. Meanwhile, we devise a resource-aware workload trade-off (ROOT) scheme into the SemAIGC framework to intelligently decide transmitter/receiver workload, thus adjusting the utilization of computational resource according to service requirements. Simulations verify the superiority of our proposed SemAIGC framework in terms of latency and content quality compared to conventional approaches.

摘要
<>现代生成AI应用程序正在为广泛的用户群体提供多样化和高质量的AI生成内容（AIGC）服务。随着移动设备的普及和移动流量的快速增长，将高质量AIGC服务通过无线通信网络提供到用户的 ubique 访问已成为未来的发展方向。然而，在不稳定的通道、有限的带宽资源和分布式计算资源的情况下，提供优化的AIGC服务是一项挑战。为解决这些挑战，我们提议一种基于semantic communication（SemCom）的AI生成内容（AIGC）生成和传输框架，其中只需提取内容的semantic信息而不是所有的二进制位数据。具体来说，SemAIGC框架包括在semantic编码器和解码器中的扩散模型，以实现高效的内容生成和接收端计算资源的灵活调整。同时，我们在SemAIGC框架中实现了根据服务需求调整计算资源的资源意识的工作负荷调整（ROOT）策略。实验证明我们的提议的SemAIGC框架在延迟和内容质量方面与传统方法相比具有显著的优势。

Defending Against Transfer Attacks From Public Models

paper_url: http://arxiv.org/abs/2310.17645
repo_url: https://github.com/wagner-group/pubdef
paper_authors: Chawin Sitawarin, Jaewon Chang, David Huang, Wesson Altoyan, David Wagner
for: 这篇论文主要是为了提出一种实际的攻击模型，以及基于游戏理论的防御方法，以应对未来安全敏感应用中的攻击。
methods: 这篇论文使用了将攻击者通过公共可用的副本模型进行转移攻击，并提出了一种基于游戏理论的特殊防御方法。
results: 对于3个数据集（CIFAR-10、CIFAR-100和ImageNet）和24个公共模型以及11种攻击算法，我们的防御方法PubDef在攻击下表现出了明显的优势，与白盒 adversarial training相比，几乎没有失去正常准确率。例如，在ImageNet上，我们的防御方法在最强的转移攻击下达到了62%的准确率，而最佳的白盒 adversarial training只达到了36%。而在不受攻击的情况下，我们的防御方法的准确率只比无防御模型低2%（78%vs80%）。

Abstract
Adversarial attacks have been a looming and unaddressed threat in the industry. However, through a decade-long history of the robustness evaluation literature, we have learned that mounting a strong or optimal attack is challenging. It requires both machine learning and domain expertise. In other words, the white-box threat model, religiously assumed by a large majority of the past literature, is unrealistic. In this paper, we propose a new practical threat model where the adversary relies on transfer attacks through publicly available surrogate models. We argue that this setting will become the most prevalent for security-sensitive applications in the future. We evaluate the transfer attacks in this setting and propose a specialized defense method based on a game-theoretic perspective. The defenses are evaluated under 24 public models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and ImageNet). Under this threat model, our defense, PubDef, outperforms the state-of-the-art white-box adversarial training by a large margin with almost no loss in the normal accuracy. For instance, on ImageNet, our defense achieves 62% accuracy under the strongest transfer attack vs only 36% of the best adversarially trained model. Its accuracy when not under attack is only 2% lower than that of an undefended model (78% vs 80%). We release our code at https://github.com/wagner-group/pubdef.

摘要
针对抗击攻击已成为industry中的潜在威胁，然而过去一代robustness评估文献中的经验教我们，在攻击者有Machine Learning和领域专业知识的情况下，构建强大或最佳的攻击是困难的。因此，过去大多数文献中所假设的白盒威胁模型是不切实际的。在这篇论文中，我们提出了一种新的实用威胁模型，在这个模型中，攻击者通过公共可用的副本模型进行转移攻击。我们认为这将在安全敏感应用中成为未来的主要威胁模型。我们对这个设定中的转移攻击进行评估，并提出了基于游戏理论的防御方法。我们对这些防御策略进行了24个公共模型和11种攻击算法的评估，并在CIFAR-10、CIFAR-100和ImageNet三个数据集上进行了评估。根据这个威胁模型，我们的防御策略PubDef在对抗攻击下的性能大幅超越了现有的白盒针对攻击训练方法，而且与正常准确率几乎没有差异。例如，在ImageNet上，我们的防御策略在最强的转移攻击下达到62%的准确率，而最佳针对攻击训练方法只达到36%。其正常准确率与没有防御的情况下的准确率几乎没有差异（78% vs 80%）。我们将代码发布在GitHub上，可以通过https://github.com/wagner-group/pubdef访问。

In-Context Learning Dynamics with Random Binary Sequences

paper_url: http://arxiv.org/abs/2310.17639
repo_url: None
paper_authors: Eric J. Bigelow, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tomer D. Ullman
for: The paper aims to improve our understanding of the complex, emergent capabilities of large language models (LLMs) and their in-context learning dynamics.
methods: The authors propose a Cognitive Interpretability framework that involves using random binary sequences as context to study the dynamics of in-context learning in LLMs. They manipulate properties of the context data, such as sequence length, to observe the behavior of the models.
results: The authors find that the latest GPT-3.5+ models exhibit emergent abilities to generate pseudo-random numbers and learn basic formal languages, with striking in-context learning dynamics that transition sharply from pseudo-random behaviors to deterministic repetition.Here’s the Chinese translation of the three information points:
for: 该文章目的是更好地理解大语言模型（LLMs）的复杂、萌发性能力以及其在上下文学习动态。
methods: 作者们提出了一种认知可读性框架，使用随机二进制序列作为上下文来研究LLMs中的上下文学习动态。他们在context数据的属性上进行了修改，以观察模型的行为。
results: 作者们发现最新的GPT-3.5+模型在上下文学习过程中展现出了萌发性能力，可以生成 Pseudo-Random 数字和学习基本的正式语言，并且上下文学习动态具有强烈的转变性，从 Pseudo-Random 行为转sharply 到决定性重复。

Abstract
Large language models (LLMs) trained on huge corpora of text datasets demonstrate complex, emergent capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often mysterious, and different prompts can elicit different capabilities through in-context learning. We propose a Cognitive Interpretability framework that enables us to analyze in-context learning dynamics to understand latent concepts in LLMs underlying behavioral patterns. This provides a more nuanced understanding than success-or-failure evaluation benchmarks, but does not require observing internal activations as a mechanistic interpretation of circuits would. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of in-context learning by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate pseudo-random numbers and learn basic formal languages, with striking in-context learning dynamics where model outputs transition sharply from pseudo-random behaviors to deterministic repetition.

摘要

Grow Your Limits: Continuous Improvement with Real-World RL for Robotic Locomotion

paper_url: http://arxiv.org/abs/2310.17634
repo_url: None
paper_authors: Laura Smith, Yunhao Cao, Sergey Levine
for: 实现机器人自动获得复杂行为，如四肢行走。
methods: 使用policy regularization框架，调整机器人在训练过程中的探索。
results: 实现了在真实世界中几分钟内完全学习四肢行走，并继续训练后能够更好地适应不同的情况和动力学变化。

Abstract
Deep reinforcement learning (RL) can enable robots to autonomously acquire complex behaviors, such as legged locomotion. However, RL in the real world is complicated by constraints on efficiency, safety, and overall training stability, which limits its practical applicability. We present APRL, a policy regularization framework that modulates the robot's exploration over the course of training, striking a balance between flexible improvement potential and focused, efficient exploration. APRL enables a quadrupedal robot to efficiently learn to walk entirely in the real world within minutes and continue to improve with more training where prior work saturates in performance. We demonstrate that continued training with APRL results in a policy that is substantially more capable of navigating challenging situations and is able to adapt to changes in dynamics with continued training.

摘要
深度强化学习（RL）可以让机器人自动获得复杂的行为，如四肢行走。但在实际世界中，RL受到效率、安全性和总训练稳定性的限制，这限制了其实际应用性。我们提出了APRL，一种策略Regularization框架，可以在训练过程中调整机器人的探索行为，实现在训练过程中均衡 flexible improvement potential和专注、效率的探索。APRL使得一只四肢机器人在实际世界中快速地学习行走，并继续增强，其性能比之前的工作更高。我们示出，继续训练APRL后，机器人可以更好地处理复杂的情况，并能够适应动力学变化。

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

paper_url: http://arxiv.org/abs/2310.17631
repo_url: https://github.com/baaivision/judgelm
paper_authors: Lianghui Zhu, Xinggang Wang, Xinlong Wang
for:The paper aims to evaluate large language models (LLMs) in open-ended scenarios by fine-tuning them as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively.methods:The paper proposes a comprehensive dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. The authors train JudgeLM at different scales and analyze its capabilities and behaviors.results:JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and the proposed new benchmark. The JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs, and achieves high agreement with the teacher judge (agreement exceeding 90%, even surpassing human-to-human agreement). JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

Abstract
Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

摘要
评估大语言模型（LLM）在开放场景中是有挑战的，因为现有的标准准则和指标不能全面评估它们。为解决这问题，我们提议 fine-tune LLM 为可扩展的判官（JudgeLM），以有效地评估 LLM 在开放场景中。我们首先提出了一个全面、大规模、高质量的数据集，包括任务种子、LLM 生成的答案、GPT-4 生成的判断，以便 Fine-tune 高性能的判官。我们在不同的批处参数（7B、13B、33B）上进行了系统性的分析和评估。我们then analyzed the key biases in fine-tuning LLM as a judge and considered them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient, and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

paper_url: http://arxiv.org/abs/2310.17606
repo_url: None
paper_authors: Owen Henkel, Hannah Horne-Robinson, Libby Hills, Bill Roberts, Joshua McGrane
for: 这项研究旨在使用大规模语音模型对学生在加纳的口语读写能力进行自动评估。
methods: 研究使用最新版本的大规模语音模型（Whisper V2 wav2vec2.0）对学生的口语读写能力进行评估。
results: 研究发现，使用Whisper V2生成的学生口语读写 транскриптов的Word Error Rate为13.5，与成人语音识别模型的平均WER（12.8）相似，而且与人工评分员生成的分数高度相关（相关系数为0.96）。此外，研究还发现这些 транскриптов可以用于生成自动化的 ORF 分数，并且在 repre sentative 数据集上达到了高度相关性（相关系数为0.96）。

Abstract
This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.

摘要
The results show that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate (WER) of 13.5, which is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. Furthermore, the transcriptions were used to produce fully automated ORF scores, which closely aligned with scores generated by expert human graders, with a correlation coefficient of 0.96.The results were achieved on a representative dataset, including students with regional accents and recordings taken in actual classrooms, using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

paper_url: http://arxiv.org/abs/2310.17596
repo_url: None
paper_authors: Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, Dieter Fox
for: 这个论文主要目标是提出一种自动生成大规模、 ricah datasets 的方法，以便通过模仿学习来训练 робот代理。
methods: 这个系统使用了一种基于人工示例的自动生成方法，可以从少量的人工示例中生成大量的示例，并将其适应到新的上下文中。
results: 通过使用这个系统，研究人员可以生成大量的示例，并训练 robot 代理以达到强大的性能，包括多部件组装和咖啡制作等高精度任务，并且在不同的初始状态分布下表现出色。

Abstract
Imitation learning from a large set of human demonstrations has proved to be an effective paradigm for building capable robot agents. However, the demonstrations can be extremely costly and time-consuming to collect. We introduce MimicGen, a system for automatically synthesizing large-scale, rich datasets from only a small number of human demonstrations by adapting them to new contexts. We use MimicGen to generate over 50K demonstrations across 18 tasks with diverse scene configurations, object instances, and robot arms from just ~200 human demonstrations. We show that robot agents can be effectively trained on this generated dataset by imitation learning to achieve strong performance in long-horizon and high-precision tasks, such as multi-part assembly and coffee preparation, across broad initial state distributions. We further demonstrate that the effectiveness and utility of MimicGen data compare favorably to collecting additional human demonstrations, making it a powerful and economical approach towards scaling up robot learning. Datasets, simulation environments, videos, and more at https://mimicgen.github.io .

摘要
人工智能控制机器人的努力学习从人类示例集中获得了成功。然而，收集示例集可以非常昂贵和耗时。我们介绍MimicGen，一个系统可以自动生成大规模、丰富的数据集，只需要一小部分的人类示例。我们使用MimicGen生成了18种任务中的超过50,000个示例，包括多个物品配置、物品实例和机器人臂。我们表明，通过对这些生成的数据集进行依据学习，可以让机器人在长期和高精度任务中表现出色，例如多部件组装和咖啡制作。此外，我们还证明了MimicGen数据的有效性和实用性，与收集更多的人类示例相比，它是一种强大和经济的方法来扩大机器人学习。更多信息请访问https://mimicgen.github.io。

SPA: A Graph Spectral Alignment Perspective for Domain Adaptation

paper_url: http://arxiv.org/abs/2310.17594
repo_url: None
paper_authors: Zhiqing Xiao, Haobo Wang, Ying Jin, Lei Feng, Gang Chen, Fei Huang, Junbo Zhao
for: 这个研究旨在解决域类别预测中的领域对预测模型的不足，使用无监督领域适应（Unsupervised Domain Adaptation，UDA）来将内部预测模型扩展到不同的目标领域，并且考虑到这些领域之间的数据分布不同。
methods: 这个方法基于图 primitives，将DA问题转换为图间的对称问题，并且使用一个新的spectral regularizer来将领域图在特征空间进行调整。此外，还开发了一个细节化的讯息传递模组，以提高目标领域中的推断能力。
results: 在标准的 benchmark 上，SPA 的实验结果表明其性能已经超过了现有的剪切前渠道DA方法。另外，透过对称分析，我们发现SPA 的方法具有较好的有效性、韧性、推断能力和传递能力。资料和代码可以在https://github.com/CrownX/SPA 上获取。

Abstract
Unsupervised domain adaptation (UDA) is a pivotal form in machine learning to extend the in-domain model to the distinctive target domains where the data distributions differ. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. In this work, we introduce a novel graph SPectral Alignment (SPA) framework to tackle the tradeoff. The core of our method is briefly condensed as follows: (i)-by casting the DA problem to graph primitives, SPA composes a coarse graph alignment mechanism with a novel spectral regularizer towards aligning the domain graphs in eigenspaces; (ii)-we further develop a fine-grained message propagation module -- upon a novel neighbor-aware self-training mechanism -- in order for enhanced discriminability in the target domain. On standardized benchmarks, the extensive experiments of SPA demonstrate that its performance has surpassed the existing cutting-edge DA methods. Coupled with dense model analysis, we conclude that our approach indeed possesses superior efficacy, robustness, discriminability, and transferability. Code and data are available at: https://github.com/CrownX/SPA.

摘要
无监督领域适应（USDA）是机器学习中的一种重要形式，用于将内域模型扩展到不同的目标领域，其数据分布不同。大多数前期工作强调了交叉领域的传送性，但它们忽略了内域结构的丰富性，这会导致实际效果更差。在这种情况下，我们介绍了一种新的图spectral alignment（SPA）框架，用于解决这个负担。SPA的核心思想如下：（i）通过将DA问题转化为图 primitives，SPA使用一种含有新的 spectral regularizer 来将领域图在特征空间进行对齐；（ii）我们进一步发展了一种细化的消息传递模块，基于一种新的邻居自动训练机制，以提高目标领域的分类能力。在标准化的测试上，我们进行了广泛的实验，结果表明，SPA的性能已经超过了现有的cutting-edge DA方法。同时，我们还进行了密集的模型分析，得出了我们的方法实际上具有更好的效果、更好的稳定性、更好的分类能力和更好的传送性。代码和数据可以从以下地址获取：https://github.com/CrownX/SPA。

An Open Source Data Contamination Report for Llama Series Models

paper_url: http://arxiv.org/abs/2310.17589
repo_url: https://github.com/liyucheng09/contamination_detector
paper_authors: Yucheng Li
for: 这篇论文旨在提供一种开源的数据污染报告，用于评估LLama系列模型的可靠性。
methods: 该论文使用了多种方法进行数据污染分析，包括对六个多选问答 benchmark进行分析，并计算这些benchmark中的重叠率。
results: 研究发现，这些benchmark中存在1%到8.7%的数据污染程度，而LLama模型在污染subset上的准确率高于清晰subset上的准确率超过5%. 数据和代码可以在https://github.com/liyucheng09/Contamination_Detector中获得。

Abstract
Data contamination in language model evaluation is increasingly prevalent as the popularity of large language models. It allows models to "cheat" via memorisation instead of displaying true capabilities. Therefore, contamination analysis has became an crucial part of reliable model evaluation to validate results. However, existing contamination analysis is usually conducted internally by LLM developers and often lacks transparency and completeness. This paper present an open source data contamination reports for the Llama series models. We analyse six popular multi-choice QA benchmarks and quantify their overlapping with the training set of Llama. Various levels of contamination ranging from 1\% to 8.7\% are found across benchmarks. Our comparison also reveals that Llama models can gain over 5\% higher accuracy on contaminated subsets versus clean subsets. Data and code are available at: https://github.com/liyucheng09/Contamination_Detector.

摘要
<>文本环境污染在语言模型评估中日益普遍，这是由大型语言模型的普及所导致。这种污染允许模型通过记忆而不是展示真正的能力“偷懒”。因此，污染分析已成为可靠模型评估的重要组成部分。然而，现有的污染分析通常由LLM开发者进行内部实施，lacks transparency和完整性。本文介绍了一个开源的数据污染报告 для LLama 系列模型。我们分析了 six 个受欢迎的多选问答 bencmarks，并衡量它们与 LLama 训练集的重叠。我们发现，这些 bencmarks 中的污染水平从 1% 到 8.7% 不等。我们的比较还表明，LLama 模型在污染subset 上可以获得高于 5% 的准确率。数据和代码可以在 GitHub 上获取：https://github.com/liyucheng09/Contamination_Detector。Note: "LLama" refers to a series of language models, and "LLM" stands for "large language model".

Can LLMs Grade Short-answer Reading Comprehension Questions : Foundational Literacy Assessment in LMICs

paper_url: http://arxiv.org/abs/2310.18373
repo_url: None
paper_authors: Owen Henkel, Libby Hills, Bill Roberts, Joshua McGrane
for: 这项研究用于评估大语言模型（GPT-4）在评估短回答阅读理解问题上的可靠性。
methods: 研究使用了多种配置的生成式大语言模型（LLMs）来评估来自新数据集的学生答案，该数据集由150名学生在加纳完成的阅读测验中收集。
results: GPT-4在评估新数据集上表现出色，其 quadratic weighted kappa 值为0.923，F1值为0.88，大大超越了基于传输学习的方法。此外，GPT-4还能够与人类评分员相匹配。这项研究表明，生成式LLMs可能用于可靠地评估基础的阅读理解能力。

Abstract
This paper presents emerging evidence of using generative large language models (i.e., GPT-4) to reliably evaluate short-answer reading comprehension questions. Specifically, we explore how various configurations of generative (LLMs) are able to evaluate student responses from a new dataset, drawn from a battery of reading assessments conducted with over 150 students in Ghana. As this dataset is novel and hence not used in training runs of GPT, it offers an opportunity to test for domain shift and evaluate the generalizability of generative LLMs, which are predominantly designed and trained on data from high-income North American countries. We found that GPT-4, with minimal prompt engineering performed extremely well on evaluating the novel dataset (Quadratic Weighted Kappa 0.923, F1 0.88), substantially outperforming transfer-learning based approaches, and even exceeding expert human raters (Quadratic Weighted Kappa 0.915, F1 0.87). To the best of our knowledge, our work is the first to empirically evaluate the performance of generative LLMs on short-answer reading comprehension questions, using real student data, and suggests that generative LLMs have the potential to reliably evaluate foundational literacy. Currently the assessment of formative literacy and numeracy is infrequent in many low and middle-income countries (LMICs) due to the cost and operational complexities of conducting them at scale. Automating the grading process for reading assessment could enable wider usage, and in turn improve decision-making regarding curricula, school management, and teaching practice at the classroom level. Importantly, in contrast transfer learning based approaches, generative LLMs generalize well and the technical barriers to their use are low, making them more feasible to implement and scale in lower resource educational contexts.

摘要
The results show that GPT-4, with minimal prompt engineering, performed extremely well on evaluating the novel dataset, outperforming transfer-learning based approaches and even exceeding expert human raters. This is the first study to empirically evaluate the performance of generative LLMs on short-answer reading comprehension questions using real student data, and suggests that these models have the potential to reliably evaluate foundational literacy.Currently, the assessment of formative literacy and numeracy is infrequent in many low and middle-income countries (LMICs) due to the cost and operational complexities of conducting them at scale. Automating the grading process for reading assessment could enable wider usage and improve decision-making regarding curricula, school management, and teaching practice at the classroom level. Additionally, generative LLMs generalize well and have low technical barriers to implementation, making them more feasible to use in lower resource educational contexts.

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models

paper_url: http://arxiv.org/abs/2310.17567
repo_url: None
paper_authors: Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, Sanjeev Arora
for: 这个论文旨在探讨 LLM 评价方法如何随着 LLM 从统计语言模型转化为通用 AI 代理而变化。
methods: 本论文引入了一种新的评价方法，称为 Skill-Mix，用于评估 LLM 的能力集合和组合能力。评价方法包括 randomly 选择 $k$ 个技能，并要求 LLM 生成组合这些技能的文本。
results: 研究人员通过对两个流行的 chatbot 进行评价，发现存在较大的差异在不同模型之间，这些差异不会被捕捉到在现有的 LLM 排名板。此外，研究人员发现 GPT-4 在 $k=5$ 时表现良好，这可能指示它在组合技能方面具有更高的能力。

Abstract
With LLMs shifting their role from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. The capability to combine skills plays an important role in (human) pedagogy and also in a paper on emergence phenomena (Arora & Goyal, 2023). This work introduces Skill-Mix, a new evaluation to measure ability to combine skills. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text significantly different from any text in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using GPT-4 as well as the open LLaMA-2 70B model. Administering a version of to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training. We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.

摘要
With LLMs 从统计语言模型转移到通用 AI 代理，如何改变 LLM 评估呢？可以 argue 一个关键的 AI 代理能力是随时结合基本技能。人类教学中也有这种能力，以及 Arora 和 Goyal 的论文（2023）中所描述的 emergence 现象。这个工作介绍了 Skill-Mix，一种新的评估方法，用于测试 LLM 的技能结合能力。评估者会随机选择 $k$ 个技能，并要求 LLM 生成结合这些技能的文本。由于可能的组合方式的数量如 $N^k$，即使使用 modest 的 $k$，这种评估方法仍然可以要求 LLM 生成未在训练集中出现的文本。文章还提供了一种方法来设计和进行这种评估，以及自动评分和人工检查的方法。对两个流行的 chatbot 进行了评估，结果与预期一致，但也有一些意外结果。不同的模型能力存在显著差异，这些差异不会被捕捉在 popular LLM 排名板。此外，简单的概率计算表明，GPT-4 在 $k=5$ 时的表现是指示它可能超过 "随机习语" 行为（Bender et al., 2021），即它可以结合技能，而它没有在训练过程中看到这些技能。文章略 outline 如何使用 Skill-Mix 基础环境来建立未来模型的 AI 能力评估系统。

Bifurcations and loss jumps in RNN training

paper_url: http://arxiv.org/abs/2310.17561
repo_url: https://github.com/durstewitzlab/scyfi
paper_authors: Lukas Eisenmann, Zahra Monfared, Niclas Alexander Göring, Daniel Durstewitz
for: 这个论文的目的是使用权重链网络（RNN）模型和预测时间序列数据，以及推导出动态系统（DS）的各种计算和动态性质。
methods: 这个论文使用了权重链网络（RNN）和权重链网络的训练过程，以及动态系统（DS）理论中的概念，来更好地理解训练过程和模型的计算和动态性质。
results: 这个论文提出了一种新的规则搜索算法，可以快速和精确地找到权重链网络（RNN）中的固定点和循环点，以及它们的存在和稳定区域。这种算法可以帮助分析训练过程中的某些突变，并且可以推导出训练过程中的某些性质。

Abstract
Recurrent neural networks (RNNs) are popular machine learning tools for modeling and forecasting sequential data and for inferring dynamical systems (DS) from observed time series. Concepts from DS theory (DST) have variously been used to further our understanding of both, how trained RNNs solve complex tasks, and the training process itself. Bifurcations are particularly important phenomena in DS, including RNNs, that refer to topological (qualitative) changes in a system's dynamical behavior as one or more of its parameters are varied. Knowing the bifurcation structure of an RNN will thus allow to deduce many of its computational and dynamical properties, like its sensitivity to parameter variations or its behavior during training. In particular, bifurcations may account for sudden loss jumps observed in RNN training that could severely impede the training process. Here we first mathematically prove for a particular class of ReLU-based RNNs that certain bifurcations are indeed associated with loss gradients tending toward infinity or zero. We then introduce a novel heuristic algorithm for detecting all fixed points and k-cycles in ReLU-based RNNs and their existence and stability regions, hence bifurcation manifolds in parameter space. In contrast to previous numerical algorithms for finding fixed points and common continuation methods, our algorithm provides exact results and returns fixed points and cycles up to high orders with surprisingly good scaling behavior. We exemplify the algorithm on the analysis of the training process of RNNs, and find that the recently introduced technique of generalized teacher forcing completely avoids certain types of bifurcations in training. Thus, besides facilitating the DST analysis of trained RNNs, our algorithm provides a powerful instrument for analyzing the training process itself.

摘要
循环神经网络（RNN）是人工智能中流行的模型和预测序列数据的工具，以及从观察时间序列中推导动力系统（DS）的学习方法。DS理论（DST）的概念在RNN中都有各种应用，以深化我们对RNN解决复杂任务的理解和训练过程。变分是RNN中重要的特点之一，它指的是在参数变化时系统的动力学行为发生了 topological（Qualitative）变化。了解RNN的变分结构，可以推导出它的计算和动力学性质，例如参数变化的敏感度和训练过程中的行为。尤其是变分可以解释RNN训练过程中的突然损失峰值，这可能会对训练进程产生严重的阻碍。在这篇文章中，我们首先 математиче地证明了一种特定的ReLU基于RNN中的变分与损失勋度很大或者很小之间的关系。然后，我们提出了一种新的启发式算法，可以在ReLU基于RNN中找到所有的固定点和k-循环，以及它们在参数空间的存在和稳定区域。与之前的数值算法和常见继续方法不同，我们的算法提供了精确的结果，可以在高阶度上找到固定点和循环，并且具有惊人的扩展性。我们在RNN训练过程的分析中应用了这种算法，并发现了一些通过通用教师填充（Generalized Teacher Forcing）完全避免某些类型的变分的训练技术。因此，除了促进DST分析已训练的RNN之外，我们的算法还提供了一种可以分析训练过程本身的强大工具。

Instability of computer vision models is a necessary result of the task itself

paper_url: http://arxiv.org/abs/2310.17559
repo_url: None
paper_authors: Oliver Turnbull, George Cevora
for: 这篇论文探讨了计算机视觉模型中的敌对示例问题，强调了这些问题的潜在危害性，以及如何通过分析问题的本质和数据的稳定性来部分缓解这些问题。
methods: 本论文使用了数据的对称性（翻译不变性）、分类任务的分类性和图像本身的基本不同来探讨计算机视觉模型的不稳定性。
results: 研究发现，由于数据的对称性、分类任务的分类性和图像本身的基本不同，计算机视觉模型必然存在不稳定性。此外，由于训练数据的不彻底标注，这种不稳定性可能会更加严重。但是，通过提高图像的分辨率、提供图像上下文信息、彻底标注训练数据和防止攻击者频繁访问计算机视觉系统，可以部分缓解这种不稳定性。

Abstract
Adversarial examples resulting from instability of current computer vision models are an extremely important topic due to their potential to compromise any application. In this paper we demonstrate that instability is inevitable due to a) symmetries (translational invariance) of the data, b) the categorical nature of the classification task, and c) the fundamental discrepancy of classifying images as objects themselves. The issue is further exacerbated by non-exhaustive labelling of the training data. Therefore we conclude that instability is a necessary result of how the problem of computer vision is currently formulated. While the problem cannot be eliminated, through the analysis of the causes, we have arrived at ways how it can be partially alleviated. These include i) increasing the resolution of images, ii) providing contextual information for the image, iii) exhaustive labelling of training data, and iv) preventing attackers from frequent access to the computer vision system.

摘要
“对今天的计算机视觉模型而言，敌对示例是一个非常重要的话题，因为它们有可能会破坏任何应用程序。在这篇论文中，我们示出了这种不稳定性是不可避免的，原因包括：a) 数据中的对称性（平移不变性），b) 分类任务的分类性质，以及c) 图像本身的基本不同。这个问题被加剧了由于训练数据的非完整标注。因此，我们得出结论是，不稳定性是计算机视觉问题的一个必然的结果。虽然这个问题无法完全消除，但通过分析其原因，我们到达了一些可以减轻这个问题的方法，包括：i) 提高图像的分辨率，ii) 为图像提供上下文信息，iii) 对训练数据进行完整的标注，以及iv) 防止攻击者对计算机视觉系统进行频繁访问。”

Interactive Robot Learning from Verbal Correction

paper_url: http://arxiv.org/abs/2310.17555
repo_url: None
paper_authors: Huihan Liu, Alice Chen, Yuke Zhu, Adith Swaminathan, Andrey Kolobov, Ching-An Cheng
for: 本研究旨在帮助机器人学习并优化其行为在不结构化环境中，使其能够在日常生活中更好地协助人类。
methods: 该研究使用大型自然语言模型（LLM）OLAF，让日常用户通过语音纠正机器人的错误行为来教育机器人。OLAF可以根据语音反馈更新机器人的视听动作神经策略，以避免将来重复错误。
results: 在实验中，用户通过OLAF教育机器人完成长期 manipulate任务，成功率提高了20.0%。详细结果和视频可以在https://ut-austin-rpl.github.io/olaf/中找到。

Abstract
The ability to learn and refine behavior after deployment has become ever more important for robots as we design them to operate in unstructured environments like households. In this work, we design a new learning system based on large language model (LLM), OLAF, that allows everyday users to teach a robot using verbal corrections when the robot makes mistakes, e.g., by saying "Stop what you're doing. You should move closer to the cup." A key feature of OLAF is its ability to update the robot's visuomotor neural policy based on the verbal feedback to avoid repeating mistakes in the future. This is in contrast to existing LLM-based robotic systems, which only follow verbal commands or corrections but not learn from them. We demonstrate the efficacy of our design in experiments where a user teaches a robot to perform long-horizon manipulation tasks both in simulation and on physical hardware, achieving on average 20.0% improvement in policy success rate. Videos and more results are at https://ut-austin-rpl.github.io/olaf/

摘要
“在设计 robots 操作在无结构环境中时，学习和改进行为的能力已经日益重要。在这个工作中，我们设计了一个基于大型自然语言模型（LLM）的新学习系统，名为 OLAF，允许日常用户通过语音纠正，当 robot 错误时，例如“停止你的动作，你应该靠近碗子”。 OLAF 的一个关键特点是能够根据语音反馈更新 robot 的视觉动作神经策略，以避免未来重复错误。这与现有的 LLM-based robotic 系统不同，只会跟随语音命令或纠正，而不会从中学习。我们在实验中证明了我们的设计，可以让用户教育 robot 进行长期搬运任务，并在模拟和物理硬件上实现了平均 20.0% 的政策成功率。详细信息和视频可以在获取。”

Model-Based Runtime Monitoring with Interactive Imitation Learning

paper_url: http://arxiv.org/abs/2310.17552
repo_url: None
paper_authors: Huihan Liu, Shivin Dass, Roberto Martín-Martín, Yuke Zhu
for: 这个研究旨在将Robot学习方法升级为可靠和可靠的高度任务。
methods: 这个研究使用互动学习和监控方法，将人工智能和机器人联合作为一体，以提高机器人的性能和可靠性。
results: 这个研究比基准方法高出23%和40%在模拟和物理硬件上的成功率。

Abstract
Robot learning methods have recently made great strides, but generalization and robustness challenges still hinder their widespread deployment. Failing to detect and address potential failures renders state-of-the-art learning systems not combat-ready for high-stakes tasks. Recent advances in interactive imitation learning have presented a promising framework for human-robot teaming, enabling the robots to operate safely and continually improve their performances over long-term deployments. Nonetheless, existing methods typically require constant human supervision and preemptive feedback, limiting their practicality in realistic domains. This work aims to endow a robot with the ability to monitor and detect errors during task execution. We introduce a model-based runtime monitoring algorithm that learns from deployment data to detect system anomalies and anticipate failures. Unlike prior work that cannot foresee future failures or requires failure experiences for training, our method learns a latent-space dynamics model and a failure classifier, enabling our method to simulate future action outcomes and detect out-of-distribution and high-risk states preemptively. We train our method within an interactive imitation learning framework, where it continually updates the model from the experiences of the human-robot team collected using trustworthy deployments. Consequently, our method reduces the human workload needed over time while ensuring reliable task execution. Our method outperforms the baselines across system-level and unit-test metrics, with 23% and 40% higher success rates in simulation and on physical hardware, respectively. More information at https://ut-austin-rpl.github.io/sirius-runtime-monitor/

摘要
现代机器人学习方法在最近几年内已经做出了很大的进步，但是总结和稳定性问题仍然限制它们的普及。如果不能检测和解决潜在的失败，那么当前最先进的学习系统将不能在高度任务中进行实战。 recient advances in interactive imitation learning have presented a promising framework for human-robot teaming, enabling the robots to operate safely and continually improve their performances over long-term deployments. However, existing methods typically require constant human supervision and preemptive feedback, limiting their practicality in realistic domains.本研究的目的是赋予机器人能力监控和检测任务执行过程中的错误。我们介绍了一种基于模型的运行时监控算法，可以从部署数据中学习检测系统异常和预测失败。与先前的方法不同，我们的方法不需要失败经验进行训练，而是通过学习latent空间动力学模型和失败分类器，可以预测未来动作结果和检测出现在分布中的高风险和异常状态。我们在人机团队优先级的交互式模仿学习框架中训练了我们的方法，从人机团队收集的经验中不断更新模型。因此，我们的方法可以逐渐减少人工劳动量，并同时确保任务执行的可靠性。我们的方法在系统级别和单元测试指标上比基eline高出23%和40%，在实际硬件上也达到了相同的水平。更多信息请参考https://ut-austin-rpl.github.io/sirius-runtime-monitor/。

Unpacking the Ethical Value Alignment in Big Models

paper_url: http://arxiv.org/abs/2310.17551
repo_url: None
paper_authors: Xiaoyuan Yi, Jing Yao, Xiting Wang, Xing Xie
for: 本文旨在探讨大型模型在社会中的风险和挑战，以及现有的人工智能伦理准则是如何应对这些风险的。
methods: 本文对现有的AI伦理准则进行了检视，并分析了大型模型的伦理含义和挑战。此外，本文还 investigate了当前主流的语言模型（LLMs）的道德倾向，分析了现有的对齐算法，并提出了一种新的概念框架以确定大型模型的道德价值观。
results: 本文提出了一种新的概念框架，以帮助建立一个统一的AI伦理框架，并提出了一些有优先的研究方向以确定大型模型的道德价值观。

Abstract
Big models have greatly advanced AI's ability to understand, generate, and manipulate information and content, enabling numerous applications. However, as these models become increasingly integrated into everyday life, their inherent ethical values and potential biases pose unforeseen risks to society. This paper provides an overview of the risks and challenges associated with big models, surveys existing AI ethics guidelines, and examines the ethical implications arising from the limitations of these models. Taking a normative ethics perspective, we propose a reassessment of recent normative guidelines, highlighting the importance of collaborative efforts in academia to establish a unified and universal AI ethics framework. Furthermore, we investigate the moral inclinations of current mainstream LLMs using the Moral Foundation theory, analyze existing alignment algorithms, and outline the unique challenges encountered in aligning ethical values within them. To address these challenges, we introduce a novel conceptual paradigm for aligning the ethical values of big models and discuss promising research directions for alignment criteria, evaluation, and method, representing an initial step towards the interdisciplinary construction of the ethically aligned AI This paper is a modified English version of our Chinese paper https://crad.ict.ac.cn/cn/article/doi/10.7544/issn1000-1239.202330553, intended to help non-Chinese native speakers better understand our work.

摘要
大型模型已经大幅提高了人工智能的理解、生成和 manipulate信息和内容能力，这些能力已经开拓了许多应用程序。然而，随着这些模型在日常生活中的普及，它们的内置优先级和可能的偏见带来了未料的风险。本文提供了大型模型所存在的风险和挑战，检视了现有的人工智能伦理准则，并分析了这些模型的伦理含义。从normative伦理角度出发，我们提出了重新评估最近的normative准则，强调在学术界共同努力建立一个统一的人工智能伦理框架。此外，我们使用道尔文基础理论分析当今主流的大型语言模型（LLMs）中的道德倾向，分析现有的对Alignment算法，并详细介绍了对道德价值的整合在这些模型中所遇到的挑战。为解决这些挑战，我们提出了一种新的概念 paradigm，并讨论了一些有前途的研究方向，代表了初步的interdisciplinary构建人工智能的道德框架。Please note that the translation is done using a machine translation tool, and may not be perfect. Additionally, some cultural references or idioms may not be accurately translated.

Human-Guided Complexity-Controlled Abstractions

paper_url: http://arxiv.org/abs/2310.17550
repo_url: https://github.com/mycal-tucker/human-guided-abstractions
paper_authors: Andi Peng, Mycal Tucker, Eoin Kenny, Noga Zaslavsky, Pulkit Agrawal, Julie Shah
for: 这个论文旨在探讨神经网络如何学习任务特定的幂等表示，以及如何使其泛化到新的任务和设定。
methods: 作者使用了生成谱分布来训练神经网络生成一谱分布，并控制表示复杂性（即输入编码的比特数）通过调整分布 entropy。
results: 在 fine-tuning 实验中，使用只有一小数量的标注数据，发现（1）调整表示复杂性到任务适应的水平支持最高的 fine-tuning 性能，以及（2）在人类参与者研究中，用户能够通过视觉化表示的方式确定下游任务适应的复杂性水平。

Abstract
Neural networks often learn task-specific latent representations that fail to generalize to novel settings or tasks. Conversely, humans learn discrete representations (i.e., concepts or words) at a variety of abstraction levels (e.g., "bird" vs. "sparrow") and deploy the appropriate abstraction based on task. Inspired by this, we train neural models to generate a spectrum of discrete representations, and control the complexity of the representations (roughly, how many bits are allocated for encoding inputs) by tuning the entropy of the distribution over representations. In finetuning experiments, using only a small number of labeled examples for a new task, we show that (1) tuning the representation to a task-appropriate complexity level supports the highest finetuning performance, and (2) in a human-participant study, users were able to identify the appropriate complexity level for a downstream task using visualizations of discrete representations. Our results indicate a promising direction for rapid model finetuning by leveraging human insight.

摘要

根据任务适应性调整表示复杂性水平支持最高的finetuning性能。2. 在人类参与者研究中，用户能通过视觉化表示的幻数表示来识别适当的复杂性水平。我们的结果表明，可以通过人类意见来快速训练模型，并且有批处性。

Neuro-Inspired Fragmentation and Recall to Overcome Catastrophic Forgetting in Curiosity

paper_url: http://arxiv.org/abs/2310.17537
repo_url: https://github.com/fietelab/farcuriosity
paper_authors: Jaedong Hwang, Zhang-Wei Hong, Eric Chen, Akhilan Boopathy, Pulkit Agrawal, Ila Fiete
for: 解决难度探索任务中的快速忘记问题
methods: 使用Fragmentation and Recall Curiosity方法，通过在不同 Fragment 上使用不同的本地好奇模块来减少忘记
results: 在Atari benchmark suite of tasks 上的游戏环境中，实现了 less forgetting 和 better overall performance

Abstract
Deep reinforcement learning methods exhibit impressive performance on a range of tasks but still struggle on hard exploration tasks in large environments with sparse rewards. To address this, intrinsic rewards can be generated using forward model prediction errors that decrease as the environment becomes known, and incentivize an agent to explore novel states. While prediction-based intrinsic rewards can help agents solve hard exploration tasks, they can suffer from catastrophic forgetting and actually increase at visited states. We first examine the conditions and causes of catastrophic forgetting in grid world environments. We then propose a new method FARCuriosity, inspired by how humans and animals learn. The method depends on fragmentation and recall: an agent fragments an environment based on surprisal, and uses different local curiosity modules (prediction-based intrinsic reward functions) for each fragment so that modules are not trained on the entire environment. At each fragmentation event, the agent stores the current module in long-term memory (LTM) and either initializes a new module or recalls a previously stored module based on its match with the current state. With fragmentation and recall, FARCuriosity achieves less forgetting and better overall performance in games with varied and heterogeneous environments in the Atari benchmark suite of tasks. Thus, this work highlights the problem of catastrophic forgetting in prediction-based curiosity methods and proposes a solution.

摘要
In this work, we examine the conditions and causes of catastrophic forgetting in grid world environments and propose a new method called FARCuriosity, inspired by how humans and animals learn. FARCuriosity depends on fragmentation and recall, where the agent fragments the environment based on surprisal and uses different local curiosity modules (prediction-based intrinsic reward functions) for each fragment. At each fragmentation event, the agent stores the current module in long-term memory (LTM) and either initializes a new module or recalls a previously stored module based on its match with the current state.With fragmentation and recall, FARCuriosity achieves less forgetting and better overall performance in games with varied and heterogeneous environments in the Atari benchmark suite of tasks. This work highlights the problem of catastrophic forgetting in prediction-based curiosity methods and proposes a solution.

SoK: Pitfalls in Evaluating Black-Box Attacks

paper_url: http://arxiv.org/abs/2310.17534
repo_url: https://github.com/iamgroot42/blackboxsok
paper_authors: Fnu Suya, Anshuman Suri, Tingwei Zhang, Jingtao Hong, Yuan Tian, David Evans
for:* This paper is written to systematize knowledge in the area of black-box attacks on image classifiers, and to provide a taxonomy for understanding the threat space of these attacks.methods:* The paper uses a taxonomy to organize the threat space of black-box attacks on image classifiers, and to identify under-explored threat spaces that require further research.* The paper also demonstrates the importance of considering the quality and quantity of auxiliary data available to the attacker, as well as the access of interactive queries, in understanding the threat model of different attacks.results:* The paper establishes a new state-of-the-art in the less-studied setting of access to top-k confidence scores, and shows how this setting can be challenging even for well-explored techniques.* The paper also overturns prior state-of-the-art claims in the setting of interactive query access, and highlights the need for more research in this area.* The paper reveals connections between black-box attacks and related areas, such as model inversion and extraction attacks, and discusses how advances in these areas can enable stronger black-box attacks.I hope this helps! Let me know if you have any other questions.

Abstract
Numerous works study black-box attacks on image classifiers. However, these works make different assumptions on the adversary's knowledge and current literature lacks a cohesive organization centered around the threat model. To systematize knowledge in this area, we propose a taxonomy over the threat space spanning the axes of feedback granularity, the access of interactive queries, and the quality and quantity of the auxiliary data available to the attacker. Our new taxonomy provides three key insights. 1) Despite extensive literature, numerous under-explored threat spaces exist, which cannot be trivially solved by adapting techniques from well-explored settings. We demonstrate this by establishing a new state-of-the-art in the less-studied setting of access to top-k confidence scores by adapting techniques from well-explored settings of accessing the complete confidence vector, but show how it still falls short of the more restrictive setting that only obtains the prediction label, highlighting the need for more research. 2) Identification the threat model of different attacks uncovers stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper. 3) Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.

摘要
多种研究报告了黑盒攻击图像分类器。然而，这些研究假设了不同的攻击者知识和当前文献缺乏一个协调中心，因此我们提出了一个分类器，以攻击者的攻击模型为中心，并将攻击者的知识分为三个轴：回归精度、交互查询访问权限和攻击者可用的辅助数据质量和量。我们的新分类器提供了三个关键发现：1. DESPITE 详细的文献研究，还有许多未经探索的攻击空间，这些空间无法通过适应已经explored的设置来解决。我们通过在访问top-k信任分数的设置下Establishing a new state-of-the-art，并证明这些设置仍然不够 restrictive， highlighting the need for more research。2. Identifying the threat model of different attacks reveals stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper.3. Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.

Can large language models replace humans in the systematic review process? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

paper_url: http://arxiv.org/abs/2310.17526
repo_url: None
paper_authors: Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, Kristin Hadfield
for: 这个论文旨在评估大型自然语言模型（LLM）在系统性审查中的性能，以及LLM在不同类型的文献和语言中的表现。
methods: 该论文采用了一种’人出现’-based的方法，通过对LLM进行训练和测试，以评估其在标题/摘要层次层级和全文审查中的表现。
results: 研究发现，LLM在大多数任务中的准确率与人类表现相当，但是结果受到了机会协议和数据不均衡的影响。经过调整后，LLM在数据抽取任务中表现 moderate，但是在不同阶段和语言类型中的层次层级层级表现呈现不均匀。使用高可靠性的提示时，LLM在全文审查任务中的表现接近完美。

Abstract
Systematic reviews are vital for guiding practice, research, and policy, yet they are often slow and labour-intensive. Large language models (LLMs) could offer a way to speed up and automate systematic reviews, but their performance in such tasks has not been comprehensively evaluated against humans, and no study has tested GPT-4, the biggest LLM so far. This pre-registered study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction across various literature types and languages using a 'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human performance in most tasks, results were skewed by chance agreement and dataset imbalance. After adjusting for these, there was a moderate level of performance for data extraction, and - barring studies that used highly reliable prompts - screening performance levelled at none to moderate for different stages and languages. When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key studies using highly reliable prompts improved its performance even more. Our findings indicate that, currently, substantial caution should be used if LLMs are being used to conduct systematic reviews, but suggest that, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.

摘要
While GPT-4 had accuracy on par with human performance in most tasks, the results were affected by chance agreement and dataset imbalance. After adjusting for these factors, GPT-4's performance was moderate for data extraction, but its screening performance was low for different stages and languages. However, when screening full-text literature using highly reliable prompts, GPT-4's performance was almost perfect. Penalizing GPT-4 for missing key studies using highly reliable prompts further improved its performance.Our findings suggest that, while LLMs have the potential to automate systematic reviews, they should be used with caution, and their performance should be carefully evaluated. However, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.

The Expressive Power of Low-Rank Adaptation

paper_url: http://arxiv.org/abs/2310.17513
repo_url: https://github.com/uw-madison-lee-lab/expressive_power_of_lora
paper_authors: Yuchen Zeng, Kangwook Lee
for: This paper aims to theoretically analyze the expressive power of Low-Rank Adaptation (LoRA) for fine-tuning pre-trained models, specifically large language models and diffusion models.
methods: The paper uses theoretical analysis to prove the expressive power of LoRA for fully connected neural networks and Transformer networks. The authors show that LoRA can adapt any model to accurately represent any smaller target model with a certain rank threshold.
results: The paper proves that, for fully connected neural networks, LoRA can adapt any model to accurately represent any smaller target model if LoRA-rank is greater than or equal to the product of the width of the model and the depth of the target model, divided by the depth of the model. For Transformer networks, the authors show that any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}{2})$ LoRA adapters. The paper also quantifies the approximation error when LoRA-rank is lower than the threshold.

Abstract
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}{2})$ LoRA adapters.

摘要
低级别适应（LoRA），一种精炼批处理方法，通过低级别适应weight矩阵，实现了许多预训练模型的精炼。尽管LoRA在实践中取得了很大成功，但其理论基础尚未得到充分探索。这篇论文是研究LoRA的第一步，我们提供了LoRA的表达力理论分析。我们证明，对于完全连接神经网络，LoRA可以使任何模型$f$ accurately表示任何更小的target模型$\overline{f}$，如果LoRA-rank $\geq(\text{宽度of }f) \times \frac{\text{深度of }\overline{f}{\text{深度of }f}$。我们还量化了当LoRA-rank低于阈值时的讹差。对于Transformer网络，我们显示任何模型可以通过rank-$(\frac{\text{嵌入大小}{2})$ LoRA adapter来适应同样大小的target模型。

CompeteAI: Understanding the Competition Behaviors in Large Language Model-based Agents

paper_url: http://arxiv.org/abs/2310.17512
repo_url: None
paper_authors: Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, Xing Xie
for: 这篇论文探讨了基于大语言模型（LLM）的代理在竞争中的行为。
methods: 作者提出了一个通用的竞争框架来研究代理之间的竞争。然后，他们使用GPT-4实现了一个假设的虚拟小镇，并在其中让两种代理进行竞争：餐厅代理和客户代理。餐厅代理之间竞争，以吸引更多的顾客，这种竞争使得餐厅代理受到激发，如培养新的运营策略。
results: 实验结果显示了一些有趣的发现，包括社会学学习和玛提效应，这些发现与社会学和经济学的现有理论很好吻合。作者认为，代理之间的竞争值得进一步研究，以更好地理解社会。代码将很快发布。

Abstract
Large language models (LLMs) have been widely used as agents to complete different tasks, such as personal assistance or event planning. While most work has focused on cooperation and collaboration between agents, little work explores competition, another important mechanism that fosters the development of society and economy. In this paper, we seek to examine the competition behaviors in LLM-based agents. We first propose a general framework to study the competition between agents. Then, we implement a practical competitive environment using GPT-4 to simulate a virtual town with two types of agents, including restaurant agents and customer agents. Specifically, restaurant agents compete with each other to attract more customers, where the competition fosters them to transform, such as cultivating new operating strategies. The results of our experiments reveal several interesting findings ranging from social learning to Matthew Effect, which aligns well with existing sociological and economic theories. We believe that competition between agents deserves further investigation to help us understand society better. The code will be released soon.

摘要
大型语言模型（LLM）已广泛应用于不同任务的代理人，如个人助手或活动规划。然而，大多数工作都集中在合作和协作之间，很少探讨竞争，这也是社会和经济发展的重要机制。在这篇论文中，我们想要研究LLM基于代理人的竞争行为。我们首先提出一个通用的框架来研究代理人之间的竞争。然后，我们使用GPT-4实现一个实际竞争环境，模拟一个虚拟小镇，有两种代理人：餐厅代理人和客户代理人。具体来说，餐厅代理人之间竞争，以吸引更多的客户，这种竞争使得他们变得更加创新，如培养新的运营策略。我们的实验结果显示了一些有趣的发现，包括社会学习到马特效应，这与社会学和经济学的现有理论吻合得非常好。我们认为代理人之间的竞争值得进一步调查，以更好地理解社会。代码将很快发布。

Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation Models: A Multi-Agent Deep Reinforcement Learning Approach

paper_url: http://arxiv.org/abs/2310.17492
repo_url: None
paper_authors: Wenhan Yu, Terence Jie Chua, Jun Zhao
for: 本研究旨在提高当地任务性能，通过Mobile Edge Computing（MEC）与基础模型集成，以提高用户设备（UE）的本地任务性能。
methods: 我们提出了一种创新的Emulator-Adapter架构，将基础模型分成两个协同模块，以保持计算资源并提高下游任务的适应性和微调效率。此外，我们还提出了一种适应度较高的资源分配机制，以适应Emulator-Adapter结构在分散环境中的需求。
results: 我们通过实验和验证表明，我们的方法可以减少计算资源消耗，同时提高下游任务的性能和扩展性。这种方法在实际应用中具有强大的实用性和扩展性。

Abstract
The efficient deployment and fine-tuning of foundation models are pivotal in contemporary artificial intelligence. In this study, we present a groundbreaking paradigm integrating Mobile Edge Computing (MEC) with foundation models, specifically designed to enhance local task performance on user equipment (UE). Central to our approach is the innovative Emulator-Adapter architecture, segmenting the foundation model into two cohesive modules. This design not only conserves computational resources but also ensures adaptability and fine-tuning efficiency for downstream tasks. Additionally, we introduce an advanced resource allocation mechanism that is fine-tuned to the needs of the Emulator-Adapter structure in decentralized settings. To address the challenges presented by this system, we employ a hybrid multi-agent Deep Reinforcement Learning (DRL) strategy, adept at handling mixed discrete-continuous action spaces, ensuring dynamic and optimal resource allocations. Our comprehensive simulations and validations underscore the practical viability of our approach, demonstrating its robustness, efficiency, and scalability. Collectively, this work offers a fresh perspective on deploying foundation models and balancing computational efficiency with task proficiency.

摘要
当代人工智能中的有效部署和细化调整是关键。在本研究中，我们提出了一种创新的模型整合Mobile Edge Computing（MEC）和基础模型，特意设计用于增强用户设备（UE）上本地任务性能。我们的方法的核心是分解基础模型为两个协同模块的Emulator-Adapter架构，不仅保留计算资源，而且确保适应性和调整效率。此外，我们还提出了一种适应Emulator-Adapter结构的高级资源分配机制，在分布式环境中进行微调。为解决这个系统中的挑战，我们采用了一种混合多代理 Deep Reinforcement Learning（DRL）策略，能够处理混合数字-连续动作空间， Ensure dynamic and optimal resource allocations。我们的全面的 simulations和验证表明了我们的方法的实用性和可扩展性。总之，这项工作提供了一种新的基础模型部署和计算效率协调的新视角。

Improving Zero-shot Reader by Reducing Distractions from Irrelevant Documents in Open-Domain Question Answering

paper_url: http://arxiv.org/abs/2310.17490
repo_url: None
paper_authors: Sukmin Cho, Jeong yeon Seo, Soyeong Jeong, Jong C. Park
for: 这篇论文旨在探讨Zero-shot Question Answering（ODQA）中的语言模型（LLMs），以及其在开放领域中的应用。
methods: 本研究使用了Distracted-aware Answer Selection（DAS）技术，将不相关的文档除去，以提高Zero-shotReader的性能。
results: 实验结果显示，DAS技术能够成功地抑制干扰，提高Zero-shotReader的性能，并且与supervised reader不同，Zero-shot reader能够在未见到数据的情况下实现卓越的转移性。

Abstract
Large language models (LLMs) enable zero-shot approaches in open-domain question answering (ODQA), yet with limited advancements as the reader is compared to the retriever. This study aims at the feasibility of a zero-shot reader that addresses the challenges of computational cost and the need for labeled data. We find that LLMs are distracted due to irrelevant documents in the retrieved set and the overconfidence of the generated answers when they are exploited as zero-shot readers. To tackle these problems, we mitigate the impact of such documents via Distraction-aware Answer Selection (DAS) with a negation-based instruction and score adjustment for proper answer selection. Experimental results show that our approach successfully handles distraction across diverse scenarios, enhancing the performance of zero-shot readers. Furthermore, unlike supervised readers struggling with unseen data, zero-shot readers demonstrate outstanding transferability without any training.

摘要
大型语言模型（LLM）允许零条件方法在开放领域问题回答（ODQA）中，但有限的进步，因为读者与搜寻器之间的比较有限。本研究探讨了零条件读者的可能性，并实现了适当的选择方法以减少干扰。我们发现，LLMs 受到无关文档的干扰，并且在应用为零条件读者时，产生了过度自信的答案。为了解决这些问题，我们使用了对抗干扰的选择技术（DAS），并调整得分以确保适当的答案选择。实验结果显示，我们的方法可以成功地减少干扰，并提高零条件读者的表现。此外，不同于需要训练的监督读者，零条件读者具有卓越的转移性，无需任何训练。

Bias in Evaluation Processes: An Optimization-Based Model

paper_url: http://arxiv.org/abs/2310.17489
repo_url: https://github.com/anaymehrotra/bias-in-evaluation-processes
paper_authors: L. Elisa Celis, Amit Kumar, Anay Mehrotra, Nisheeth K. Vishnoi
for: 这个论文主要研究了评估过程中受到个人社会特征的偏见的现象，包括录用和招聘等设置。
methods: 该论文使用了一种解决具有信息约束的损失最小化问题来模型评估过程中的偏见。该模型具有两个参数，它们是资源信息费用Parameter和风险偏好Parameter。
results: 该论文通过分析模型生成的分布，研究了这两个参数对观察分布的影响。此外，该论文还验证了模型的实际应用，并使用其来研究在下游选择任务中的干预效果。这些结果为偏见在评估过程中的发生提供了更深刻的理解，并提供了用于mitigate偏见的工具。

Abstract
Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.

摘要
社会背景下个体特征偏见在评估过程中得到了广泛的报道。我们视这种评估过程为一种将真实个体能力的分布转化为观察分布，并模型为一种损失最小化问题下的信息约束问题。我们的模型具有两个参数，这两个参数被证明导致偏见：资源信息费用参数在信息约束中，以及风险偏好参数在损失函数中。我们描述出的分布可以用来捕捉不同群体在观察评估中的变化。我们采用实际数据进行验证，并用其来研究在下游选择任务中的干预效果。这些结果对偏见的出现和控制偏见的措施提供了深入的理解和实用的工具。

Towards Learning Monocular 3D Object Localization From 2D Labels using the Physical Laws of Motion

paper_url: http://arxiv.org/abs/2310.17462
repo_url: None
paper_authors: Daniel Kienzle, Julian Lorenz, Katja Ludwig, Rainer Lienhart
for: 精确的3D物体定位在单个图像中
methods: 使用2D标签和物体运动的物理知识进行训练，不需要Expensive 3D标签
results: 在实验中，实现了平均距离错误只有6 cm，表明方法具有实现3D物体定位估计的潜在能力，而不需要收集3D数据进行训练。

Abstract
We present a novel method for precise 3D object localization in single images from a single calibrated camera using only 2D labels. No expensive 3D labels are needed. Thus, instead of using 3D labels, our model is trained with easy-to-annotate 2D labels along with the physical knowledge of the object's motion. Given this information, the model can infer the latent third dimension, even though it has never seen this information during training. Our method is evaluated on both synthetic and real-world datasets, and we are able to achieve a mean distance error of just 6 cm in our experiments on real data. The results indicate the method's potential as a step towards learning 3D object location estimation, where collecting 3D data for training is not feasible.

摘要
我们提出了一种新的方法，可以准确地在单张图像中 lokalisieren 3D 对象，只使用单个满足的相机和2D 标签。没有需要高价的3D标签。因此，我们的模型在受训练时不使用3D标签，而是使用容易标注的2D标签以及物体运动的物理知识。给定这些信息，模型可以推断缺失的第三维度信息，即使在训练时没有看到这些信息。我们的方法在实验中达到了6cm的平均误差，这表明该方法在无法收集3D数据的情况下可能成为3D对象位置估计的一个重要步骤。

Generating by Understanding: Neural Visual Generation with Logical Symbol Groundings

paper_url: http://arxiv.org/abs/2310.17451
repo_url: None
paper_authors: Yifei Peng, Yu Jin, Zhexu Luo, Yao-Xiang Ding, Wang-Zhou Dai, Zhong Ren, Kun Zhou
for: integrate neural visual generative models with strong symbolic knowledge reasoning systems
methods: 使用abductive learning框架、量化决策法、对决 Meta-abduction法
results: 比基eline要少的实例级标签信息、能够学习数据中的逻辑生成规则

Abstract
Despite the great success of neural visual generative models in recent years, integrating them with strong symbolic knowledge reasoning systems remains a challenging task. The main challenges are two-fold: one is symbol assignment, i.e. bonding latent factors of neural visual generators with meaningful symbols from knowledge reasoning systems. Another is rule learning, i.e. learning new rules, which govern the generative process of the data, to augment the knowledge reasoning systems. To deal with these symbol grounding problems, we propose a neural-symbolic learning approach, Abductive Visual Generation (AbdGen), for integrating logic programming systems with neural visual generative models based on the abductive learning framework. To achieve reliable and efficient symbol assignment, the quantized abduction method is introduced for generating abduction proposals by the nearest-neighbor lookups within semantic codebooks. To achieve precise rule learning, the contrastive meta-abduction method is proposed to eliminate wrong rules with positive cases and avoid less-informative rules with negative cases simultaneously. Experimental results on various benchmark datasets show that compared to the baselines, AbdGen requires significantly fewer instance-level labeling information for symbol assignment. Furthermore, our approach can effectively learn underlying logical generative rules from data, which is out of the capability of existing approaches.

摘要
尽管 neural visual generative models 在过去几年取得了很大的成功，但将它们与强大的符号知识推理系统集成仍然是一项挑战。主要的挑战有两个方面：一是符号分配，即将 neural visual generators 的幂谱因子绑定到有意义的符号 FROM knowledge reasoning systems。另一个是规则学习，即学习新的规则，这些规则 governs 数据生成过程。 To address these symbol grounding problems, we propose a neural-symbolic learning approach, AbdGen, for integrating logic programming systems with neural visual generative models based on the abductive learning framework. To achieve reliable and efficient symbol assignment, we introduce the quantized abduction method, which generates abduction proposals by nearest-neighbor lookups within semantic codebooks. To achieve precise rule learning, we propose the contrastive meta-abduction method to eliminate wrong rules with positive cases and avoid less-informative rules with negative cases simultaneously. Experimental results on various benchmark datasets show that compared to the baselines, AbdGen requires significantly fewer instance-level labeling information for symbol assignment. Furthermore, our approach can effectively learn underlying logical generative rules from data, which is beyond the capability of existing approaches.

LSA64: An Argentinian Sign Language Dataset

paper_url: http://arxiv.org/abs/2310.17429
repo_url: None
paper_authors: Franco Ronchetti, Facundo Manuel Quiroga, César Estrebou, Laura Lanzarini, Alejandro Rosete
for: 本研究旨在提供一个以阿根廷手语为基础的手语识别 dataset，以便进行机器学习或其他研究。
methods: 本研究使用了10名参与者的3200个手语视频，并将其分为64个不同的手语类型。参与者穿着了颜色的手套，以便追踪和分类手部运动。
results: 本研究提供了一个名为LSA64的手语识别dataset，包含3200个手语视频，并 Computed statistics of movement, position和手型。这个dataset可以作为未来的机器学习或其他研究使用。

Abstract
Automatic sign language recognition is a research area that encompasses human-computer interaction, computer vision and machine learning. Robust automatic recognition of sign language could assist in the translation process and the integration of hearing-impaired people, as well as the teaching of sign language to the hearing population. Sign languages differ significantly in different countries and even regions, and their syntax and semantics are different as well from those of written languages. While the techniques for automatic sign language recognition are mostly the same for different languages, training a recognition system for a new language requires having an entire dataset for that language. This paper presents a dataset of 64 signs from the Argentinian Sign Language (LSA). The dataset, called LSA64, contains 3200 videos of 64 different LSA signs recorded by 10 subjects, and is a first step towards building a comprehensive research-level dataset of Argentinian signs, specifically tailored to sign language recognition or other machine learning tasks. The subjects that performed the signs wore colored gloves to ease the hand tracking and segmentation steps, allowing experiments on the dataset to focus specifically on the recognition of signs. We also present a pre-processed version of the dataset, from which we computed statistics of movement, position and handshape of the signs.

摘要
自动手语识别是一个人机交互、计算机视觉和机器学习研究领域。可靠自动识别手语可以帮助翻译过程和听力障碍人群的集成，以及教育听力人群学习手语。不同国家和地区的手语之间存在很大差异，其语法和 semantics 也与written languages不同。虽然自动手语识别技术大多相同，但为新语言训练recognition系统需要拥有整个语言的数据集。本文介绍了一个名为LSA64的数据集，包括64种阿根廷手语（LSA）的视频记录，共3200个视频，由10名参与者执行。这是建立 comprehensive 研究级数据集的第一步，特地适用于手语识别或其他机器学习任务。参与者在执行手语时穿着颜色的手套，以便轻松地跟踪和分割手部，从而使实验中能够专注于手语识别。我们还提供了对数据集进行了预处理，从而计算了手语的运动、位置和形状的统计数据。

Handshape recognition for Argentinian Sign Language using ProbSom

paper_url: http://arxiv.org/abs/2310.17427
repo_url: None
paper_authors: Franco Ronchetti, Facundo Manuel Quiroga, César Estrebou, Laura Lanzarini
for: 这篇论文主要针对的是自动手语识别技术，以帮助听力障碍人士参与社会通信。
methods: 该论文提出了两大贡献：首先，建立了一个大量的阿根廷手语（LSA）手势数据库，这是一个未曾被充分研究的领域。其次，提出了一种基于自适应映射的图像处理技术，并对其进行了比较与当前状态艺术中的支持向量机器（SVM）、随机森林和神经网络等方法。
results: 该论文的实验结果显示，使用提出的特征点和ProbSom核算法可以实现手势识别精度高于90%。

Abstract
Automatic sign language recognition is an important topic within the areas of human-computer interaction and machine learning. On the one hand, it poses a complex challenge that requires the intervention of various knowledge areas, such as video processing, image processing, intelligent systems and linguistics. On the other hand, robust recognition of sign language could assist in the translation process and the integration of hearing-impaired people. This paper offers two main contributions: first, the creation of a database of handshapes for the Argentinian Sign Language (LSA), which is a topic that has barely been discussed so far. Secondly, a technique for image processing, descriptor extraction and subsequent handshape classification using a supervised adaptation of self-organizing maps that is called ProbSom. This technique is compared to others in the state of the art, such as Support Vector Machines (SVM), Random Forests, and Neural Networks. The database that was built contains 800 images with 16 LSA handshapes, and is a first step towards building a comprehensive database of Argentinian signs. The ProbSom-based neural classifier, using the proposed descriptor, achieved an accuracy rate above 90%.

摘要
自动手语识别是人工智能和人机交互领域中的一个重要话题。一方面，它需要多种知识领域的干预，如视频处理、图像处理、智能系统和语言学。另一方面，可靠地识别手语可以帮助翻译过程和听力障碍人士的 интеграción。本文提供了两个主要贡献：首先，建立了阿根廷手语（LSA）的手势数据库，这是一个尚未得到广泛讨论的话题。其次，提出了一种基于自组织地图的图像处理技术，即ProbSom，用于手势特征提取和分类。这种技术与现有的状态对比较技术，如支持向量机（SVM）、Random Forests和神经网络，进行比较。建立的数据库包含16种LSA手势的800个图像，是建立全面的阿根廷手语数据库的第一步。ProbSom基于的神经分类器，使用提出的特征，达到了90%以上的准确率。

Distribution of Action Movements (DAM): A Descriptor for Human Action Recognition

paper_url: http://arxiv.org/abs/2310.17421
repo_url: None
paper_authors: Facundo Manuel Quiroga, Franco Ronchetti, Laura Lanzarini, Cesar Eestrebou
for: 人体动作识别从骨骼数据是一个重要和活跃的研究领域，现状的最佳性还没有在许多知名数据集上达到近乎完美的准确率。
methods: 我们引入了分布动作运动特征（ Distribution of Action Movements Descriptor），一种基于骨骼动作帧间 JOINTS 的方向分布的新动作描述器。该描述器通过对集成数据集中所有可能的动作的方向分布进行归一化，计算得到一个正常化 histogram，并通过窗口 schemes 保留一定的时间结构。
results: 该描述器，结合标准分类器，在许多知名数据集上超过了许多现状技术的性能。

Abstract
Human action recognition from skeletal data is an important and active area of research in which the state of the art has not yet achieved near-perfect accuracy on many well-known datasets. In this paper, we introduce the Distribution of Action Movements Descriptor, a novel action descriptor based on the distribution of the directions of the motions of the joints between frames, over the set of all possible motions in the dataset. The descriptor is computed as a normalized histogram over a set of representative directions of the joints, which are in turn obtained via clustering. While the descriptor is global in the sense that it represents the overall distribution of movement directions of an action, it is able to partially retain its temporal structure by applying a windowing scheme. The descriptor, together with a standard classifier, outperforms several state-of-the-art techniques on many well-known datasets.

摘要
人体动作识别从骨骼数据是一个重要和活跃的研究领域，目前状态OF THE ART还没有在许多公知数据集上达到近乎完美准确性。在这篇论文中，我们介绍了动作描述器（Distribution of Action Movements Descriptor），这是一种基于数据集中 JOINTS 的动作描述器，它通过计算 JOINTS 的方向分布来描述动作的总体分布。这个描述器是全局的，因为它表示整个动作的方向分布，但同时还能够部分保留时间结构，通过应用窗口计划。这个描述器，结合标准分类器，在许多公知数据集上超越了多种状态OF THE ART技术。

Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs

paper_url: http://arxiv.org/abs/2310.17416
repo_url: None
paper_authors: Kaushik Dey, Satheesh K. Perepu, Abir Das
for: 这个论文的目的是提出一种基于人工智能的拓展者代理人机制，以便在下一代移动网络中实现用户expectation的有效管理。
methods: 该论文使用了多智能代理人学习（MARL）和人工智能监督代理人（AI-based supervisor agent）来实现协调多个预训练好的自利推荐代理人的协同工作。
results: 该论文的实验结果表明，相比 traditional rule-based方法，使用该提议的方法可以更快地和更好地满足用户的期望，并且能够适应环境变化。

Abstract
Intent-based management will play a critical role in achieving customers' expectations in the next-generation mobile networks. Traditional methods cannot perform efficient resource management since they tend to handle each expectation independently. Existing approaches, e.g., based on multi-agent reinforcement learning (MARL) allocate resources in an efficient fashion when there are conflicting expectations on the network slice. However, in reality, systems are often far more complex to be addressed by a standalone MARL formulation. Often there exists a hierarchical structure of intent fulfilment where multiple pre-trained, self-interested agents may need to be further orchestrated by a supervisor or controller agent. Such agents may arrive in the system adhoc, which then needs to be orchestrated along with other available agents. Retraining the whole system every time is often infeasible given the associated time and cost. Given the challenges, such adhoc coordination of pre-trained systems could be achieved through an intelligent supervisor agent which incentivizes pre-trained RL/MARL agents through sets of dynamic contracts (goals or bonuses) and encourages them to act as a cohesive unit towards fulfilling a global expectation. Some approaches use a rule-based supervisor agent and deploy the hierarchical constituent agents sequentially, based on human-coded rules. In the current work, we propose a framework whereby pre-trained agents can be orchestrated in parallel leveraging an AI-based supervisor agent. For this, we propose to use Adhoc-Teaming approaches which assign optimal goals to the MARL agents and incentivize them to exhibit certain desired behaviours. Results on the network emulator show that the proposed approach results in faster and improved fulfilment of expectations when compared to rule-based approaches and even generalizes to changes in environments.

摘要
“intent-based管理将在下一代移动网络中扮演关键角色，以实现用户的期望。传统方法无法有效地资源管理，因为它们通常处理每个期望独立。现有的方法，如基于多代理学习（MARL）的方法，可以有效地分配资源，当存在 conflicting 期望时。然而，在实际情况下，系统通常是多个层次结构的意图实现，其中多个预训练的自利愿代理（agent）需要被进一步协调。这些代理可能会随时间的推移而变化，需要在系统中进行实时协调。在这种情况下，不可能每次都进行系统重新训练。因此，我们提出了一种基于人工智能（AI）的监督代理，通过设置动态目标（goal）和奖励（bonus）来吸引预训练的 MARL 代理，使其们作为一个协调的单元工作。我们还提出了一种可靠性评估方法，以确保系统在不同环境下的稳定性。”Note that Simplified Chinese is the official language used in mainland China, and it is different from Traditional Chinese, which is used in Taiwan and other regions.

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

paper_url: http://arxiv.org/abs/2310.17415
repo_url: https://github.com/ginnm/proteinpretraining
paper_authors: Yang Tan, Mingchen Li, Pan Tan, Ziyi Zhou, Huiqun Yu, Guisheng Fan, Liang Hong
For: The paper is written for protein engineering, specifically to explore the use of large protein language models for capturing the underlying evolutionary information in primary structures.* Methods: The paper uses pre-trained language models, specifically PETA, with 14 different vocabulary sizes under three tokenization methods to assess the models’ transfer learning capabilities.* Results: The paper finds that vocabulary sizes between 50 and 200 optimize the model, while sizes exceeding 800 detrimentally affect the model’s representational performance.Here is the information in Simplified Chinese text:* For: 本文是为了蛋白工程而写的，具体来说是探讨大蛋白语言模型在主要结构中捕捉进化信息的可能性。* Methods: 本文使用预训练的语言模型，具体来说是PETA模型，并在不同的词汇大小下进行了14种不同的模型训练。* Results: 研究发现，词汇大小在50-200之间优化模型表现，而词汇大小超过800会有消化性的影响。

Abstract
Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.

摘要
大型蛋白语言模型能够很好地捕捉基因编码中的演化信息，具有重要的实用价值 для蛋白工程。与自然语言模型相比，蛋白肽序列具有较小的数据量和有限的组合空间。选择合适的词汇大小以优化预训练模型是一个关键的问题。此外，虽然自然语言社区有着丰富的benchmark和研究，但是对蛋白语言模型质量的系统性评估还缺乏一个完整的benchmark。为了解决这些挑战，PETA使用了14个不同的词汇大小在三种token化方法上训练语言模型。它在33个多样化的下游数据集上进行了千次测试，以评估模型的传输学能力，并包括两个分类头和三个随机种子以mitigate潜在偏见。广泛的实验表明，词汇大小在50-200之间优化模型，而大于800的词汇大小会消化模型的表征性表现。我们的代码、模型权重和数据集可以在https://github.com/ginnm/ProteinPretraining上获取。

Synthesizing Efficiently Monitorable Formulas in Metric Temporal Logic

paper_url: http://arxiv.org/abs/2310.17410
repo_url: https://github.com/ritamraha/Teal
paper_authors: Ritam Raha, Rajarshi Roy, Nathanael Fijalkow, Daniel Neider, Guillermo A. Perez
for: 这篇论文的目的是提出一种自动从系统执行中生成正式规则，以便实时监控系统的与性。
methods: 这篇论文使用了一种叫做线性现实数学（LRA）的数学方法，将问题转换为一系列的满足问题，然后将这些问题的解决方案转换为Metric Temporal Logic（MTL）的规则。
results: 这篇论文的结果显示了一个名为TEAL的工具可以实现高效地从系统执行中生成监控可能的MTL规则，并且可以控制规则的”看 ahead”量以提高监控的效率。

Abstract
In runtime verification, manually formalizing a specification for monitoring system executions is a tedious and error-prone process. To address this issue, we consider the problem of automatically synthesizing formal specifications from system executions. To demonstrate our approach, we consider the popular specification language Metric Temporal Logic (MTL), which is particularly tailored towards specifying temporal properties for cyber-physical systems (CPS). Most of the classical approaches for synthesizing temporal logic formulas aim at minimizing the size of the formula. However, for efficiency in monitoring, along with the size, the amount of "lookahead" required for the specification becomes relevant, especially for safety-critical applications. We formalize this notion and devise a learning algorithm that synthesizes concise formulas having bounded lookahead. To do so, our algorithm reduces the synthesis task to a series of satisfiability problems in Linear Real Arithmetic (LRA) and generates MTL formulas from their satisfying assignments. The reduction uses a novel encoding of a popular MTL monitoring procedure using LRA. Finally, we implement our algorithm in a tool called TEAL and demonstrate its ability to synthesize efficiently monitorable MTL formulas in a CPS application.

摘要

Invariance Measures for Neural Networks

paper_url: http://arxiv.org/abs/2310.17404
repo_url: https://github.com/facundoq/tmeasures
paper_authors: Facundo Manuel Quiroga, Jordina Torrents-Barrena, Laura Cristina Lanzarini, Domenec Puig-Valls
for: 本研究旨在量化神经网络模型中的对称性表示。
methods: 本研究提出了一种量化神经网络模型中对称性的方法，该方法基于模型的内部表示。
results: 研究发现，使用该方法可以对神经网络模型的内部表示进行量化，并且该量化结果具有稳定性和可解释性。此外，研究还发现了神经网络模型的内部对称性在不同的数据集和变换下的稳定性。

Abstract
Invariances in neural networks are useful and necessary for many tasks. However, the representation of the invariance of most neural network models has not been characterized. We propose measures to quantify the invariance of neural networks in terms of their internal representation. The measures are efficient and interpretable, and can be applied to any neural network model. They are also more sensitive to invariance than previously defined measures. We validate the measures and their properties in the domain of affine transformations and the CIFAR10 and MNIST datasets, including their stability and interpretability. Using the measures, we perform a first analysis of CNN models and show that their internal invariance is remarkably stable to random weight initializations, but not to changes in dataset or transformation. We believe the measures will enable new avenues of research in invariance representation.

摘要
neural networks 的不变性是有用和必需的许多任务中的。然而，大多数神经网络模型中的不变性表示尚未得到了描述。我们提出了一些量化神经网络模型内部表示的不变性的方法。这些方法是高效的，可解释的，可以应用于任何神经网络模型。它们也比之前定义的方法更敏感于不变性。我们验证了这些方法和其属性在拟合变换和 CIFAR10 和 MNIST 数据集上，包括其稳定性和可解释性。使用这些方法，我们进行了第一次 CNN 模型的分析，并发现它们的内部不变性具有Random weight initialization 的稳定性，但不具有数据集或变换的稳定性。我们认为这些方法将开启新的研究领域，即不变性表示。

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation

paper_url: http://arxiv.org/abs/2310.17389
repo_url: None
paper_authors: Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, Jingbo Shang
for: 本研究旨在提供一个基于实际用户与AI交互的排泄评估 benchmark，以便为用户与AI交互环境中的不良言语检测模型提供更好的训练数据。
methods: 本研究使用了现有的排泄评估 benchmark 进行系统性的评估，并通过与现有的模型进行比较，以显示这些模型在实际用户与AI交互中的缺陷。
results: 研究发现，现有的排泄评估模型在实际用户与AI交互中表现不佳，尤其是在辨识具有复杂涵义的不良言语方面。这显示了社交媒体内的排泄评估 benchmark 和实际用户与AI交互中的不良言语检测具有重要的区别。

Abstract
Despite remarkable advances that large language models have achieved in chatbots, maintaining a non-toxic user-AI interactive environment has become increasingly critical nowadays. However, previous efforts in toxicity detection have been mostly based on benchmarks derived from social media content, leaving the unique challenges inherent to real-world user-AI interactions insufficiently explored. In this work, we introduce ToxicChat, a novel benchmark based on real user queries from an open-source chatbot. This benchmark contains the rich, nuanced phenomena that can be tricky for current toxicity detection models to identify, revealing a significant domain difference compared to social media content. Our systematic evaluation of models trained on existing toxicity datasets has shown their shortcomings when applied to this unique domain of ToxicChat. Our work illuminates the potentially overlooked challenges of toxicity detection in real-world user-AI conversations. In the future, ToxicChat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-AI interactions.

摘要
尽管大语言模型在 чат机器人中已经做出了很多卓越的进步，但现在保持非恶意用户-AI交互环境变得越来越重要。然而，先前的恶意检测努力都基于社交媒体内容的标准套件，忽略了实际世界用户-AI交互中的独特挑战。在这项工作中，我们介绍了一个新的恶意测试集，即 ToxicChat，该集基于实际的用户问题，从一个开源的 chatbot 中提取出来。这个测试集包含实际世界用户-AI交互中的复杂和细腻的现象，这些现象可能会让当前的恶意检测模型很难以识别，与社交媒体内容之间存在显著的域差。我们对现有的恶意数据集上训练的模型进行了系统性的评估，发现这些模型在 ToxicChat 中的表现不佳，表明了现有的恶意检测模型在实际世界用户-AI交互中存在一定的潜在问题。我们的工作暴露了现有的恶意检测模型在实际世界用户-AI交互中可能存在的被过look的挑战。未来，ToxicChat 可以成为驱动进一步帮助建立安全和健康的用户-AI交互环境的资源。

YOLO-BEV: Generating Bird’s-Eye View in the Same Way as 2D Object Detection

paper_url: http://arxiv.org/abs/2310.17379
repo_url: None
paper_authors: Chang Liu, Liguo Zhou, Yanliang Huang, Alois Knoll
for: 提高安全和导航的自动驾驶系统视觉理解能力，实现全面和快速的视觉解释。
methods: 使用特殊的周围摄像头设置，将八个摄像头分别放置在45度的interval上，将图像集成成3x3格式，留下中心空间，提供了充足的空间表示，使得效率处理。使用YOLO检测机制，利用其快速响应和小型模型结构的优点。
results: 预liminary结果表明YOLO-BEV在实时交通视觉任务中的可行性。它的流线式架构和可能的快速部署因为参数的减少，对自动驾驶系统未来的视觉角度提供了一个丰富的探索。

Abstract
Vehicle perception systems strive to achieve comprehensive and rapid visual interpretation of their surroundings for improved safety and navigation. We introduce YOLO-BEV, an efficient framework that harnesses a unique surrounding cameras setup to generate a 2D bird's-eye view of the vehicular environment. By strategically positioning eight cameras, each at a 45-degree interval, our system captures and integrates imagery into a coherent 3x3 grid format, leaving the center blank, providing an enriched spatial representation that facilitates efficient processing. In our approach, we employ YOLO's detection mechanism, favoring its inherent advantages of swift response and compact model structure. Instead of leveraging the conventional YOLO detection head, we augment it with a custom-designed detection head, translating the panoramically captured data into a unified bird's-eye view map of ego car. Preliminary results validate the feasibility of YOLO-BEV in real-time vehicular perception tasks. With its streamlined architecture and potential for rapid deployment due to minimized parameters, YOLO-BEV poses as a promising tool that may reshape future perspectives in autonomous driving systems.

摘要

Optimization dependent generalization bound for ReLU networks based on sensitivity in the tangent bundle

paper_url: http://arxiv.org/abs/2310.17378
repo_url: None
paper_authors: Dániel Rácz, Mihály Petreczky, András Csertán, Bálint Daróczy
for: 该论文旨在解释深度学习模型如何通过极大过 Parametrization 来泛化良好。
methods: 该论文使用 PAC 类型 bound 来估计抽象网络的泛化误差，通过估计梯度下降过程中输入数据的敏感度。
results: 该论文通过实验证明，抽象网络的泛化误差可以通过估计梯度下降过程中输入数据的敏感度来 bounds。

Abstract
Recent advances in deep learning have given us some very promising results on the generalization ability of deep neural networks, however literature still lacks a comprehensive theory explaining why heavily over-parametrized models are able to generalize well while fitting the training data. In this paper we propose a PAC type bound on the generalization error of feedforward ReLU networks via estimating the Rademacher complexity of the set of networks available from an initial parameter vector via gradient descent. The key idea is to bound the sensitivity of the network's gradient to perturbation of the input data along the optimization trajectory. The obtained bound does not explicitly depend on the depth of the network. Our results are experimentally verified on the MNIST and CIFAR-10 datasets.

摘要
Translation notes:* "PAC" stands for "probably approximately correct" and refers to a theoretical framework for understanding the generalization ability of machine learning models.* "Rademacher complexity" is a measure of the complexity of a set of functions, and is used to bound the generalization error of a model.* "feedforward ReLU networks" are a type of deep neural network that uses ReLU activation functions and does not have any feedback connections.* "gradient descent" is an optimization algorithm used to train deep neural networks.* "MNIST" and "CIFAR-10" are benchmark datasets commonly used in deep learning research.

Dialogue-based generation of self-driving simulation scenarios using Large Language Models

paper_url: http://arxiv.org/abs/2310.17372
repo_url: https://github.com/avmb/dialogllmscenic
paper_authors: Antonio Valerio Miceli-Barone, Alex Lascarides, Craig Innes
for: 这篇论文主要用于开发和评估自动驾驶车辆控制器。
methods: 该论文使用了大型自然语言模型（LLM）将用户的英文语言交互映射到域pecific的编程代码中，以支持扩展的多模态互动。
results: 研究表明，LLMs可以捕捉用户在交互中的上下文敏感性，以便计算用户的真正意图。

Abstract
Simulation is an invaluable tool for developing and evaluating controllers for self-driving cars. Current simulation frameworks are driven by highly-specialist domain specific languages, and so a natural language interface would greatly enhance usability. But there is often a gap, consisting of tacit assumptions the user is making, between a concise English utterance and the executable code that captures the user's intent. In this paper we describe a system that addresses this issue by supporting an extended multimodal interaction: the user can follow up prior instructions with refinements or revisions, in reaction to the simulations that have been generated from their utterances so far. We use Large Language Models (LLMs) to map the user's English utterances in this interaction into domain-specific code, and so we explore the extent to which LLMs capture the context sensitivity that's necessary for computing the speaker's intended message in discourse.

摘要
模拟是自驾车控制器开发和评估的不可或缺工具。当前的模拟框架通常使用域Specific语言（DSL）驱动，因此增加自然语言界面可以大幅提高用户体验。但是，通常存在一个差距，这个差距由用户在提供简短的英文语言指令时所做的tacit assumption组成。在这篇论文中，我们描述了一个解决这个问题的系统，该系统支持扩展的多Modal交互：用户可以在模拟生成后进行修改或重新定义先前的指令。我们使用大型自然语言模型（LLM）将用户的英文语言指令映射到域Specific代码中，因此我们探讨了LLM是否能捕捉到在对话中的上下文敏感性。

Exploring the Potential of Generative AI for the World Wide Web

paper_url: http://arxiv.org/abs/2310.17370
repo_url: None
paper_authors: Nouar AlDahoul, Joseph Hong, Matteo Varvello, Yasir Zaki
for: The paper explores the potential of generative AI in the realm of the World Wide Web, specifically focusing on image generation.
methods: The paper develops a tool called WebDiffusion that simulates a Web powered by stable diffusion, a popular text-to-image model, from both a client and server perspective. The tool also supports crowdsourcing of user opinions to evaluate the quality and accuracy of AI-generated images.
results: The paper finds that generative AI is already capable of producing pertinent and high-quality Web images, even without requiring Web designers to manually input prompts, just by leveraging contextual information available within the webpages. However, direct in-browser image generation remains a challenge, and only highly powerful GPUs can partially compete with classic image downloads.

Abstract
Generative Artificial Intelligence (AI) is a cutting-edge technology capable of producing text, images, and various media content leveraging generative models and user prompts. Between 2022 and 2023, generative AI surged in popularity with a plethora of applications spanning from AI-powered movies to chatbots. In this paper, we delve into the potential of generative AI within the realm of the World Wide Web, specifically focusing on image generation. Web developers already harness generative AI to help crafting text and images, while Web browsers might use it in the future to locally generate images for tasks like repairing broken webpages, conserving bandwidth, and enhancing privacy. To explore this research area, we have developed WebDiffusion, a tool that allows to simulate a Web powered by stable diffusion, a popular text-to-image model, from both a client and server perspective. WebDiffusion further supports crowdsourcing of user opinions, which we use to evaluate the quality and accuracy of 409 AI-generated images sourced from 60 webpages. Our findings suggest that generative AI is already capable of producing pertinent and high-quality Web images, even without requiring Web designers to manually input prompts, just by leveraging contextual information available within the webpages. However, we acknowledge that direct in-browser image generation remains a challenge, as only highly powerful GPUs, such as the A40 and A100, can (partially) compete with classic image downloads. Nevertheless, this approach could be valuable for a subset of the images, for example when fixing broken webpages or handling highly private content.

摘要
优化人工智能（AI）是一种前沿技术，可以生成文本、图像和多媒体内容，通过生成模型和用户提示。在2022年和2023年之间，生成AI的受欢迎程度增加，其应用领域包括AI电影和chatbot等。在这篇论文中，我们探讨生成AI在互联网上的潜力，具体来说是图像生成。当前，开发者已经使用生成AI来帮助制作文本和图像，而浏览器可能在未来使用它来本地生成图像，以完成维护破碎页面、降低带宽和保护隐私等任务。为了探索这个研究领域，我们开发了WebDiffusion工具，可以模拟一个基于稳定的扩散模型的网络，从客户端和服务端两个角度进行模拟。WebDiffusion还支持用户对OPINION的协同评估，我们使用这些评估来评估409个由60个页面生成的AI图像的质量和准确性。我们的发现表明，生成AI已经能够生成与页面上的内容相关的高质量和 pertinent 的网络图像，而不需要网站设计师手动输入提示。然而，我们认为，直接在浏览器中生成图像仍然是一个挑战，只有高性能的GPU，如A40和A100，才能（部分）与经典图像下载竞争。尽管如此，这种方法可能对一部分图像有价值，例如修复破碎页面或处理高度隐私内容。

Cultural Adaptation of Recipes

paper_url: http://arxiv.org/abs/2310.17353
repo_url: None
paper_authors: Yong Cao, Yova Kementchedjhieva, Ruixiang Cui, Antonia Karamolegkou, Li Zhou, Megan Dare, Lucia Donatelli, Daniel Hershcovich
for: 本研究旨在探讨跨文化料理翻译和文化化问题，利用大语言模型来支持这一任务。
methods: 本研究使用了GPT-4和其他大语言模型、传统机器翻译和信息检索技术进行评估。
results: GPT-4在翻译中文料理为英文时表现出色，但在翻译英文料理为中文时仍然落后于人工专家。这反映了跨文化翻译的复杂性。

Abstract
Building upon the considerable advances in Large Language Models (LLMs), we are now equipped to address more sophisticated tasks demanding a nuanced understanding of cross-cultural contexts. A key example is recipe adaptation, which goes beyond simple translation to include a grasp of ingredients, culinary techniques, and dietary preferences specific to a given culture. We introduce a new task involving the translation and cultural adaptation of recipes between Chinese and English-speaking cuisines. To support this investigation, we present CulturalRecipes, a unique dataset comprised of automatically paired recipes written in Mandarin Chinese and English. This dataset is further enriched with a human-written and curated test set. In this intricate task of cross-cultural recipe adaptation, we evaluate the performance of various methods, including GPT-4 and other LLMs, traditional machine translation, and information retrieval techniques. Our comprehensive analysis includes both automatic and human evaluation metrics. While GPT-4 exhibits impressive abilities in adapting Chinese recipes into English, it still lags behind human expertise when translating English recipes into Chinese. This underscores the multifaceted nature of cultural adaptations. We anticipate that these insights will significantly contribute to future research on culturally-aware language models and their practical application in culturally diverse contexts.

摘要
基于大语言模型（LLM）的 significative advances，我们现在可以Addressing More sophisticated tasks demanding a nuanced understanding of cross-cultural contexts. A key example is recipe adaptation, which goes beyond simple translation to include a grasp of ingredients, culinary techniques, and dietary preferences specific to a given culture. We introduce a new task involving the translation and cultural adaptation of recipes between Chinese and English-speaking cuisines. To support this investigation, we present CulturalRecipes, a unique dataset comprised of automatically paired recipes written in Mandarin Chinese and English. This dataset is further enriched with a human-written and curated test set. In this intricate task of cross-cultural recipe adaptation, we evaluate the performance of various methods, including GPT-4 and other LLMs, traditional machine translation, and information retrieval techniques. Our comprehensive analysis includes both automatic and human evaluation metrics. While GPT-4 exhibits impressive abilities in adapting Chinese recipes into English, it still lags behind human expertise when translating English recipes into Chinese. This underscores the multifaceted nature of cultural adaptations. We anticipate that these insights will significantly contribute to future research on culturally-aware language models and their practical application in culturally diverse contexts.Note: I used the Simplified Chinese character set for the translation. If you prefer Traditional Chinese, please let me know.

CQM: Curriculum Reinforcement Learning with a Quantized World Model

paper_url: http://arxiv.org/abs/2310.17330
repo_url: None
paper_authors: Seungjae Lee, Daesol Cho, Jonghae Park, H. Jin Kim
for: 解决复杂任务的 Reinforcement Learning (RL) 方法面临高维度目标空间生成课程目标的挑战，因此通常需要手动指定目标空间。
methods: 我们提出了一种新的课程方法，它自动定义了 semantic goal space，包含关键信息 для课程过程，并提出了 uncertainty 和 temporal distance-aware 的课程目标，可以快速在无信息环境中进行探索。
results: 我们的方法可以快速实现目标，并且在不同的目标达成任务中表现出优于现状最佳方法，包括使用 egocentric 视觉输入。

Abstract
Recent curriculum Reinforcement Learning (RL) has shown notable progress in solving complex tasks by proposing sequences of surrogate tasks. However, the previous approaches often face challenges when they generate curriculum goals in a high-dimensional space. Thus, they usually rely on manually specified goal spaces. To alleviate this limitation and improve the scalability of the curriculum, we propose a novel curriculum method that automatically defines the semantic goal space which contains vital information for the curriculum process, and suggests curriculum goals over it. To define the semantic goal space, our method discretizes continuous observations via vector quantized-variational autoencoders (VQ-VAE) and restores the temporal relations between the discretized observations by a graph. Concurrently, ours suggests uncertainty and temporal distance-aware curriculum goals that converges to the final goals over the automatically composed goal space. We demonstrate that the proposed method allows efficient explorations in an uninformed environment with raw goal examples only. Also, ours outperforms the state-of-the-art curriculum RL methods on data efficiency and performance, in various goal-reaching tasks even with ego-centric visual inputs.

摘要
现代训练学习（RL）在解决复杂任务上展现出了显著的进步，通过提出序列的代理任务来解决问题。然而，之前的方法经常面临高维空间中生成课程目标的挑战，因此通常依赖于手动指定的目标空间。为了解决这些限制并提高课程的扩展性，我们提出了一种新的课程方法，它自动定义了 semantic goal space，包含课程过程中重要的信息，并在其上提出课程目标。为了定义 semantic goal space，我们使用 vector quantized-variational autoencoders（VQ-VAE）来维度化连续观察数据，并使用图restore temporal relations between the discretized observations。同时，我们建议uncertainty和 temporal distance-aware curriculum goals，这些目标在自动组成的目标空间中趋向于最终目标。我们示示了我们的方法可以在没有任何信息的环境中高效地探索，并且在不同的目标达成任务中，我们的方法超过了当前RL课程方法的数据效率和性能。

C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder

paper_url: http://arxiv.org/abs/2310.17325
repo_url: None
paper_authors: Xiaoyu Liu, Jiaxin Yuan, Bang An, Yuancheng Xu, Yifan Yang, Furong Huang
for: 本研究目的是探索如何在实际数据中找到几个semantically meaningful的生成因素，并使这些因素在幂下空间中分离开来。
methods: 本研究使用了一种名为Confounded-Disentanglement（C-Disentanglement）的框架，该框架通过域专家的标签引入了 inductive bias of confounder，以便在实际数据中找到 causally disentangled的特征。
results: 根据实验结果，C-Disentanglement 方法在各种 benchmark 上与多种现状顶峰模型相比，在 domain shift 下能够获得 causally disentangled 的特征和下游任务的优秀表现。

Abstract
Representation learning assumes that real-world data is generated by a few semantically meaningful generative factors (i.e., sources of variation) and aims to discover them in the latent space. These factors are expected to be causally disentangled, meaning that distinct factors are encoded into separate latent variables, and changes in one factor will not affect the values of the others. Compared to statistical independence, causal disentanglement allows more controllable data generation, improved robustness, and better generalization. However, most existing work assumes unconfoundedness in the discovery process, that there are no common causes to the generative factors and thus obtain only statistical independence. In this paper, we recognize the importance of modeling confounders in discovering causal generative factors. Unfortunately, such factors are not identifiable without proper inductive bias. We fill the gap by introducing a framework entitled Confounded-Disentanglement (C-Disentanglement), the first framework that explicitly introduces the inductive bias of confounder via labels from domain expertise. In addition, we accordingly propose an approach to sufficiently identify the causally disentangled factors under any inductive bias of the confounder. We conduct extensive experiments on both synthetic and real-world datasets. Our method demonstrates competitive results compared to various SOTA baselines in obtaining causally disentangled features and downstream tasks under domain shifts.

摘要
学习表示假设实际世界数据是由一些semantically meaningful的生成因素（即生成因素）生成的，并且目标是在幽默空间发现这些因素。这些因素应该是 causally disentangled，meaning that distinct factors are encoded into separate latent variables, and changes in one factor will not affect the values of the others. 比如，compared to statistical independence, causal disentanglement allows more controllable data generation, improved robustness, and better generalization. 然而，most existing work assumes unconfoundedness in the discovery process, that there are no common causes to the generative factors and thus obtain only statistical independence. In this paper, we recognize the importance of modeling confounders in discovering causal generative factors. Unfortunately, such factors are not identifiable without proper inductive bias. We fill the gap by introducing a framework entitled Confounded-Disentanglement (C-Disentanglement), the first framework that explicitly introduces the inductive bias of confounder via labels from domain expertise. In addition, we accordingly propose an approach to sufficiently identify the causally disentangled factors under any inductive bias of the confounder. We conduct extensive experiments on both synthetic and real-world datasets. Our method demonstrates competitive results compared to various SOTA baselines in obtaining causally disentangled features and downstream tasks under domain shifts.

In-Context Ability Transfer for Question Decomposition in Complex QA

paper_url: http://arxiv.org/abs/2310.18371
repo_url: None
paper_authors: Venktesh V, Sourangshu Bhattacharya, Avishek Anand
for: 这个论文的目的是提出一种能够帮助语言模型学习复杂问答任务的方法，无需进行模型训练或专家注释。
methods: 这个方法基于在可用数据源中选择相关任务的示例，并通过注意力机制将这些示例传递给语言模型，以便帮助模型学习复杂问答任务。
results: 研究人员通过对多种复杂问答任务进行大规模实验，证明了 ICAT 可以在不进行模型训练或专家注释的情况下，与已有的提问基于方法相比，表现更好。

Abstract
Answering complex questions is a challenging task that requires question decomposition and multistep reasoning for arriving at the solution. While existing supervised and unsupervised approaches are specialized to a certain task and involve training, recently proposed prompt-based approaches offer generalizable solutions to tackle a wide variety of complex question-answering (QA) tasks. However, existing prompt-based approaches that are effective for complex QA tasks involve expensive hand annotations from experts in the form of rationales and are not generalizable to newer complex QA scenarios and tasks. We propose, icat (In-Context Ability Transfer) which induces reasoning capabilities in LLMs without any LLM fine-tuning or manual annotation of in-context samples. We transfer the ability to decompose complex questions to simpler questions or generate step-by-step rationales to LLMs, by careful selection from available data sources of related tasks. We also propose an automated uncertainty-aware exemplar selection approach for selecting examples from transfer data sources. Finally, we conduct large-scale experiments on a variety of complex QA tasks involving numerical reasoning, compositional complex QA, and heterogeneous complex QA which require decomposed reasoning. We show that ICAT convincingly outperforms existing prompt-based solutions without involving any model training, showcasing the benefits of re-using existing abilities.

摘要
Answering complex questions is a difficult task that requires breaking down the question into smaller parts and using multistep reasoning to find the solution. While existing supervised and unsupervised approaches are specialized to a certain task and require training, recently proposed prompt-based approaches offer generalizable solutions to tackle a wide variety of complex question-answering (QA) tasks. However, existing prompt-based approaches that are effective for complex QA tasks rely on expensive expert annotations in the form of rationales and are not generalizable to newer complex QA scenarios and tasks.We propose a new approach called icat (In-Context Ability Transfer), which enables reasoning capabilities in large language models (LLMs) without any fine-tuning or manual annotation of in-context samples. We transfer the ability to decompose complex questions into simpler questions or generate step-by-step rationales to LLMs by carefully selecting relevant data from available sources of related tasks. We also propose an automated uncertainty-aware exemplar selection approach for selecting examples from transfer data sources.We conduct large-scale experiments on a variety of complex QA tasks involving numerical reasoning, compositional complex QA, and heterogeneous complex QA, which require decomposed reasoning. Our results show that ICAT outperforms existing prompt-based solutions without any model training, demonstrating the benefits of reusing existing abilities.

CodeFusion: A Pre-trained Diffusion Model for Code Generation

paper_url: http://arxiv.org/abs/2310.17680
repo_url: None
paper_authors: Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, Gust Verbruggen
for: 本研究的目的是提出一种基于扩散代码生成模型，以便在自然语言编程中能够更好地重新考虑之前生成的代码。
methods: 本研究使用了预训练的扩散代码生成模型CodeFusion，通过 Iterative Denoising 来重新考虑 encoded natural language 中的整个程序。
results: 实验表明，CodeFusion 能够与现状的 auto-regressive 系统相当，并且在 top-3 和 top-5 准确率上表现更佳，这是因为它更好地均衡了多样性和质量。

Abstract
Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.

摘要
想象一个开发者只能改变最后一行代码，如何频繁地重新写函数才能达到正确性？自然语言到代码生成模型具有类似的限制：它们不易允许重新考虑早些 tokens 生成的。我们介绍 CodeFusion，一种预训练的扩散代码生成模型，解决了这种限制。CodeFusion 通过Iteratively Denoising 完整程序，根据编码的自然语言来conditioning。我们在 Bash、Python 和 Microsoft Excel 条件格式（CF）规则中进行了实验，结果显示 CodeFusion（75M 参数）与状态之前 auto-regressive 系统（350M-175B 参数）在 top-1 准确率上相当，而且在 top-3 和 top-5 准确率上表现更好，这是因为它更好地均衡了多样性和质量。

FormaT5: Abstention and Examples for Conditional Table Formatting with Natural Language

paper_url: http://arxiv.org/abs/2310.17306
repo_url: None
paper_authors: Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, Elnaz Nouri, Mohammad Raza, Gust Verbruggen
for: 本研究旨在提供一种基于 transformer 模型的自动表格格式化系统，以便根据用户提供的自然语言描述，生成数据依赖 conditional formatting（CF）规则。
methods: 本研究使用 transformer 模型来生成 CF 规则，并通过预测 placeholder 来解决用户描述的下pecification和 Argument Errors 问题。
results: 对于 1053 个 CF 任务，FormaT5 可以通过预测 placeholder 和 filling 来超过 8 种神经网络方法的性能，both with 和 without 例子。这说明了建立域pecific learning system 的价值。

Abstract
Formatting is an important property in tables for visualization, presentation, and analysis. Spreadsheet software allows users to automatically format their tables by writing data-dependent conditional formatting (CF) rules. Writing such rules is often challenging for users as it requires them to understand and implement the underlying logic. We present FormaT5, a transformer-based model that can generate a CF rule given the target table and a natural language description of the desired formatting logic. We find that user descriptions for these tasks are often under-specified or ambiguous, making it harder for code generation systems to accurately learn the desired rule in a single step. To tackle this problem of under-specification and minimise argument errors, FormaT5 learns to predict placeholders though an abstention objective. These placeholders can then be filled by a second model or, when examples of rows that should be formatted are available, by a programming-by-example system. To evaluate FormaT5 on diverse and real scenarios, we create an extensive benchmark of 1053 CF tasks, containing real-world descriptions collected from four different sources. We release our benchmarks to encourage research in this area. Abstention and filling allow FormaT5 to outperform 8 different neural approaches on our benchmarks, both with and without examples. Our results illustrate the value of building domain-specific learning systems.

摘要
表格的格式化是一个重要的属性，它对于视觉化、展示和分析都非常重要。电子表格软件允许用户自动格式化他们的表格，这可以通过写数据依赖的条件格式化规则（CF）来实现。写这些规则是常常给用户带来挑战，因为它们需要用户理解并实现下面的逻辑。我们提出了FormaT5，一种基于转换器的模型，可以根据目标表格和自然语言描述来生成CF规则。我们发现用户对这些任务的描述经常是不充分或模糊的，这使得代码生成系统更难准确地学习所需的规则。为解决这个问题，FormaT5学习预测占位符，通过缺失目标对象的目标搜索来减少参数错误。这些占位符可以通过第二个模型或，当有示例行可用时，通过编程示例系统来填充。为评估FormaT5在多样化和实际场景中的表现，我们创建了1053个CF任务的广泛 benchmark，其中包括来自四个不同来源的真实描述。我们发布了这些 benchmark，以便促进这一领域的研究。忽略和填充允许FormaT5在我们的 benchmark 上超越8种神经网络方法，包括和没有示例。我们的结果表明，建立领域特定的学习系统是非常有价值的。

Comparing Photorealistic and Animated Embodied Conversational Agents in Serious Games: An Empirical Study on User Experience

paper_url: http://arxiv.org/abs/2310.17300
repo_url: None
paper_authors: Danai Korre
for: 这篇论文的目的是研究对话人工智能（ECAs）在严肃游戏环境中的使用，以及两种不同的表现实实验的影响。
methods: 这篇论文使用了一种在Subjects中使用的两重两жды因素设计，并采集了36名参与者的数据，以便分析对ECAs的使用性和参与者对不同版本的偏好。
results: 研究发现，两种版本都被评估为非常可用，但参与者中69.4%表示偏好真实版本，25%表示偏好动画版本，5.6%没有表态。真实版本被认为更加真实和人类化，而动画版本使得任务更像游戏。尽管代理人的真实性没有对可用性产生显著影响，但它 positively 影响了参与者对代理人的评估。

Abstract
Embodied conversational agents (ECAs) are paradigms of conversational user interfaces in the form of embodied characters. While ECAs offer various manipulable features, this paper focuses on a study conducted to explore two distinct levels of presentation realism. The two agent versions are photorealistic and animated. The study aims to provide insights and design suggestions for speech-enabled ECAs within serious game environments. A within-subjects, two-by-two factorial design was employed for this research with a cohort of 36 participants balanced for gender. The results showed that both the photorealistic and the animated versions were perceived as highly usable, with overall mean scores of 5.76 and 5.71, respectively. However, 69.4 per cent of the participants stated they preferred the photorealistic version, 25 per cent stated they preferred the animated version and 5.6 per cent had no stated preference. The photorealistic agents were perceived as more realistic and human-like, while the animated characters made the task feel more like a game. Even though the agents' realism had no significant effect on usability, it positively influenced participants' perceptions of the agent. This research aims to lay the groundwork for future studies on ECA realism's impact in serious games across diverse contexts.

摘要
人工智能对话代理（ECAs）是对话用户界面的一种形式，具有各种可操作特性。本研究探讨了两种不同的展示现实主义水平，即真实摄影和动画两种代理版本。这项研究的目的是为了在严格游戏环境中的speech-enabled ECAs提供设计建议和灵感。本研究采用了一种内subjects，两个因素实验设计，参与者共36名，男女各半数。结果显示，两种版本都被评估为非常可用，总的 mean分别为5.76和5.71。然而，69.4%的参与者表示喜欢真实摄影版本，25%表示喜欢动画版本，5.6%无偏好。真实摄影代理被认为更真实和人类化，而动画人物使得任务感觉更像是一场游戏。虽然代理的真实性没有显著影响可用性，但它 positively 影响了参与者对代理的看法。本研究的目的是为将来在多种场景中的ECAs真实性的影响进行深入研究。

Fast Scalable and Accurate Discovery of DAGs Using the Best Order Score Search and Grow-Shrink Trees

paper_url: http://arxiv.org/abs/2310.17679
repo_url: https://github.com/cmu-phil/boss
paper_authors: Bryan Andrews, Joseph Ramsey, Ruben Sanchez-Romero, Jazmin Camchong, Erich Kummerfeld
for: 学习图解 conditional independence 结构是机器学习中一项重要的问题，也是 causal discovery 的重要基础。但是，现有的算法的准确率和执行时间通常难以扩展到包含百个高度连接的变量的问题，例如从 fMRI 数据中恢复大脑网络。
methods: 我们引入了最佳顺序分数搜索 (BOSS) 和 grow-shrink 树 (GST)，用于学习 Directed Acyclic Graphs (DAGs)。BOSS 通过 GST 构建和评分 DAGs 来进行搜索。GST 高效缓存分数，以消除重复计算。
results: BOSS 可以在各种条件下达到 state-of-the-art 的准确率和执行时间，与其他 combinatorial 和梯度基于的学习算法相比。为了证明其实用性，我们将 BOSS 应用于两个resting-state fMRI数据集：一个是 simulated data 与 pseudo-empirical noise distribution derivated from randomized empirical fMRI cortical signals，另一个是 3T fMRI scans 处理后的 cortical parcels。BOSS 可以在 TETRAD 项目中使用，包括 Python 和 R wrapper。

Abstract
Learning graphical conditional independence structures is an important machine learning problem and a cornerstone of causal discovery. However, the accuracy and execution time of learning algorithms generally struggle to scale to problems with hundreds of highly connected variables -- for instance, recovering brain networks from fMRI data. We introduce the best order score search (BOSS) and grow-shrink trees (GSTs) for learning directed acyclic graphs (DAGs) in this paradigm. BOSS greedily searches over permutations of variables, using GSTs to construct and score DAGs from permutations. GSTs efficiently cache scores to eliminate redundant calculations. BOSS achieves state-of-the-art performance in accuracy and execution time, comparing favorably to a variety of combinatorial and gradient-based learning algorithms under a broad range of conditions. To demonstrate its practicality, we apply BOSS to two sets of resting-state fMRI data: simulated data with pseudo-empirical noise distributions derived from randomized empirical fMRI cortical signals and clinical data from 3T fMRI scans processed into cortical parcels. BOSS is available for use within the TETRAD project which includes Python and R wrappers.

摘要
学习图structures是机器学习的重要问题，也是 causal discovery的基础。但是，学习算法的准确率和执行时间通常在百个高度连接的变量问题上难以扩展 -- 例如，从 fMRI 数据中回归大脑网络。我们介绍了最佳顺序分数搜索（BOSS）和生长缩小树（GST）用于学习 directed acyclic graphs（DAGs）。BOSS 在 permutations of variables 上进行探索，使用 GSTs 构建和评分 DAGs。GSTs 高效地缓存分数，以消除重复计算。BOSS 在准确率和执行时间方面达到了状态机器学习算法的最佳性能，与许多 combinatorial 和梯度基于的学习算法进行比较，在各种条件下表现出色。为了证明其实用性，我们将 BOSS 应用于两个 sets of resting-state fMRI 数据：生成的 simulated data 和临床数据 from 3T fMRI 扫描。BOSS 可以在 TETRAD 项目中使用，该项目包括 Python 和 R 包装。

New Boolean satisfiability problem heuristic strategy: Minimal Positive Negative Product Strategy

paper_url: http://arxiv.org/abs/2310.18370
repo_url: None
paper_authors: Qun Zhao, Xintao Wang, Menghui Yang
for: 解决Boolean satisfiability问题
methods: 使用Minimal Positive Negative Product Strategy引导CDCL算法
results: 实验结果证明该算法在问题解决中更高效 чем常用的DLIS和VSIDS算法

Abstract
This study presents a novel heuristic algorithm called the "Minimal Positive Negative Product Strategy" to guide the CDCL algorithm in solving the Boolean satisfiability problem. It provides a mathematical explanation for the superiority of this algorithm over widely used heuristics such as the Dynamic Largest Individual Sum (DLIS) and the Variable State Independent Decaying Sum (VSIDS). Experimental results further confirm the effectiveness of this heuristic strategy in problem-solving.

摘要
Here is the text in Simplified Chinese:这个研究提出了一种新的启发算法，called "最小正负乘积策略"，以帮助CDCL算法解决Boolean满足问题。这个算法被数学上证明为其他广泛使用的启发策略，如DLIS和VSIDS，的超越。实验结果还证明了该启发策略的效果。

Attribute Based Interpretable Evaluation Metrics for Generative Models

paper_url: http://arxiv.org/abs/2310.17261
repo_url: None
paper_authors: Dongkyun Kim, Mingi Kwon, Youngjung Uh
for: 本研究旨在提出一种新的评估协议，用于评估生成模型是否能够准确地捕捉训练集中的种类分布。
methods: 本研究使用了单 attribute 分化（SaD）和对应 attribute 分化（PaD）两种新的评估指标，以及一种新的图像特征评估指标——不同类型 CLIPScore（HCS）。
results: 通过使用这些指标，我们发现了一些现有的生成模型的缺陷，例如 ProjectedGAN 生成了不可能的属性关系，扩散模型困难捕捉数据集中的多种颜色，而 latent diffusion model 的更大的抽样步骤生成了更小的对象。

Abstract
When the training dataset comprises a 1:1 proportion of dogs to cats, a generative model that produces 1:1 dogs and cats better resembles the training species distribution than another model with 3:1 dogs and cats. Can we capture this phenomenon using existing metrics? Unfortunately, we cannot, because these metrics do not provide any interpretability beyond "diversity". In this context, we propose a new evaluation protocol that measures the divergence of a set of generated images from the training set regarding the distribution of attribute strengths as follows. Single-attribute Divergence (SaD) measures the divergence regarding PDFs of a single attribute. Paired-attribute Divergence (PaD) measures the divergence regarding joint PDFs of a pair of attributes. They provide which attributes the models struggle. For measuring the attribute strengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures the cosine similarity between image and text vectors with heterogeneous initial points. With SaD and PaD, we reveal the following about existing generative models. ProjectedGAN generates implausible attribute relationships such as a baby with a beard even though it has competitive scores of existing metrics. Diffusion models struggle to capture diverse colors in the datasets. The larger sampling timesteps of latent diffusion model generate the more minor objects including earrings and necklaces. Stable Diffusion v1.5 better captures the attributes than v2.1. Our metrics lay a foundation for explainable evaluations of generative models.

摘要
We define Single-attribute Divergence (SaD) as the divergence regarding the probability density functions (PDFs) of a single attribute. Paired-attribute Divergence (PaD) measures the divergence regarding the joint PDFs of a pair of attributes. These metrics reveal which attributes the models struggle with.To measure the attribute strengths of an image, we propose Heterogeneous CLIPScore (HCS), which measures the cosine similarity between image and text vectors with heterogeneous initial points. With SaD and PaD, we find that ProjectedGAN generates implausible attribute relationships, such as a baby with a beard, despite having competitive scores on existing metrics. Diffusion models struggle to capture diverse colors in the datasets, and the larger sampling timesteps of the latent diffusion model result in the generation of smaller objects, such as earrings and necklaces. Stable Diffusion v1.5 performs better in capturing attributes than v2.1.Our proposed metrics provide a foundation for explainable evaluations of generative models, enabling us to better understand their strengths and weaknesses.

IDENAS: Internal Dependency Exploration for Neural Architecture Search

paper_url: http://arxiv.org/abs/2310.17250
repo_url: https://github.com/viharoszsolt/idenas
paper_authors: Anh T. Hoang, Zsolt J. Viharos
for: 提高自动机器学习模型开发的效率和准确率，特别是在输入和输出变量之间存在未知关系的情况下。
methods: 提出了一种基于内部依赖关系的搜索方法IDENAS，结合了神经网络搜索和特征选择。IDENAS使用修改后的编码器-解码器模型和继承前进搜索算法，将输入-输出配置搜索与嵌入特征选择相结合。
results: 实验结果显示，IDENAS在比较其他算法的情况下表现出色， demonstrating its effectiveness in model development pipelines and automated machine learning. On average, IDENAS achieved significant modelling improvements, highlighting its significant contribution to advancing the state-of-the-art in neural architecture search and feature selection integration.

Abstract
Machine learning is a powerful tool for extracting valuable information and making various predictions from diverse datasets. Traditional algorithms rely on well-defined input and output variables however, there are scenarios where the distinction between the input and output variables and the underlying, associated (input and output) layers of the model, are unknown. Neural Architecture Search (NAS) and Feature Selection have emerged as promising solutions in such scenarios. This research proposes IDENAS, an Internal Dependency-based Exploration for Neural Architecture Search, integrating NAS with feature selection. The methodology explores internal dependencies in the complete parameter space for classification involving 1D sensor and 2D image data as well. IDENAS employs a modified encoder-decoder model and the Sequential Forward Search (SFS) algorithm, combining input-output configuration search with embedded feature selection. Experimental results demonstrate IDENASs superior performance in comparison to other algorithms, showcasing its effectiveness in model development pipelines and automated machine learning. On average, IDENAS achieved significant modelling improvements, underscoring its significant contribution to advancing the state-of-the-art in neural architecture search and feature selection integration.

摘要
机器学习是一种强大的工具，可以从多样化数据集中提取有价值信息并进行多种预测。传统的算法假设输入和输出变量之间存在明确的定义，但有时候输入和输出变量之间的关系并不明确。神经网络搜索（NAS）和特征选择是一些有前途的解决方案。这项研究提出了内部依赖性搜索（IDENAS），它将NAS与特征选择集成了一起。该方法在完全参数空间中搜索内部依赖关系，用于分类，包括1D感知器和2D图像数据。IDENAS使用修改后的encoder-decoder模型和顺序前进搜索（SFS）算法，将输入输出配置搜索与嵌入特征选择结合在一起。实验结果表明，IDENAS在其他算法的比较中表现出色，展示了其在机器学习开发流程和自动化机器学习中的有效性。在平均上，IDENAS实现了重要的模型改进，强调了它在神经网络搜索和特征选择集成中的重要贡献。

CROP: Conservative Reward for Model-based Offline Policy Optimization

paper_url: http://arxiv.org/abs/2310.17245
repo_url: https://github.com/g0k0ururi/crop
paper_authors: Hao Li, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Zhen-Qiu Feng, Xiao-Yin Liu, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Bo-Xian Yao, Zeng-Guang Hou
for: 提出了一种新的模型基于的离线强化学习算法（CROP），用于优化策略，并通过保守估计奖励来避免分布迁移问题。
methods: 该算法使用了模型训练来估计奖励，并同时减少估计错误和随机行动奖励的积累。
results: 实验结果表明，CROP算法与当前基eline相当，并且在D4RLbenchmark上显示了良好的性能。此外，该算法还发现了在离线RL中的onlineRL技术的潜在连接。Here’s the translation in English:
for: The paper proposes a new model-based offline reinforcement learning algorithm (CROP) to optimize policies and mitigate the distribution drift problem by conservatively estimating rewards.
methods: The algorithm uses model training to estimate rewards and simultaneously minimizes the estimation error and the reward of random actions.
results: Experimental results show that the performance of CROP is comparable to the state-of-the-art baselines, and it establishes an innovative connection between offline and online RL by adopting online RL techniques to the empirical Markov decision process trained with a conservative reward.

Abstract
Offline reinforcement learning (RL) aims to optimize policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges due to their capability to mitigate the limitations of offline data through data generation using models. Prior research has demonstrated that introducing conservatism into the model or Q-function during policy optimization can effectively alleviate the prevalent distribution drift problem in offline RL. However, the investigation into the impacts of conservatism in reward estimation is still lacking. This paper proposes a novel model-based offline RL algorithm, Conservative Reward for model-based Offline Policy optimization (CROP), which conservatively estimates the reward in model training. To achieve a conservative reward estimation, CROP simultaneously minimizes the estimation error and the reward of random actions. Theoretical analysis shows that this conservative reward mechanism leads to a conservative policy evaluation and helps mitigate distribution drift. Experiments on D4RL benchmarks showcase that the performance of CROP is comparable to the state-of-the-art baselines. Notably, CROP establishes an innovative connection between offline and online RL, highlighting that offline RL problems can be tackled by adopting online RL techniques to the empirical Markov decision process trained with a conservative reward. The source code is available with https://github.com/G0K0URURI/CROP.git.

摘要
偏好线上学习（RL）的目标是通过收集数据来优化策略，而不是在线交互。基于模型的方法在解决偏好线上学习挑战方面尤其有利，因为它们可以通过模型生成数据来减少收集数据的限制。过去的研究表明，在策略优化中引入保守性可以有效地解决偏好线上学习中的分布漂移问题。然而，关于奖励估计中的保守性的研究仍然缺乏。这篇论文提出了一种新的模型基于的偏好线上学习算法，即保守奖励for model-based Offline Policy optimization（CROP）。CROP通过在模型训练中保守地估计奖励来实现保守的奖励估计。为了实现保守的奖励估计，CROP同时减少了估计错误和随机动作的奖励。理论分析表明，这种保守的奖励机制导致保守的策略评估，帮助解决分布漂移问题。实验表明，CROP在D4RL benchmark上的性能与现状的基eline相当。尤其是，CROP建立了在线和偏好线上学习之间的创新连接，指出偏好线上学习问题可以通过采用在线RL技术来解决empirical Markov decision process中训练的保守奖励。源代码可以在https://github.com/G0K0URURI/CROP.git中找到。

Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks

paper_url: http://arxiv.org/abs/2310.17238
repo_url: https://github.com/yanzhh/hgere
paper_authors: Zhaohui Yan, Songlin Yang, Wei Liu, Kewei Tu
for: 提高Entity和Relation抽取（ERE）任务的性能，特别是解决 marker-based 管道模型中的错误卷积问题。
methods: 基于 PL-marker marker-based 管道模型，提出 HyperGraph 神经网络（$\hgnn{}$），并使用高复 recall 减弱机制来减轻NER模块的负担。进一步地，建立一个高级图，其中节点为实体（由 span pruner 提供）和其关系，强制编码这些关系的交互。
results: 在三个广泛使用的 ERE benchmark 上（\acef{}, \ace{} 和 \scierc{）），经验表明 $\hgnn{}$ 模型在前一代 marker-based 管道模型的基础上具有显著的改进。

Abstract
Entity and Relation Extraction (ERE) is an important task in information extraction. Recent marker-based pipeline models achieve state-of-the-art performance, but still suffer from the error propagation issue. Also, most of current ERE models do not take into account higher-order interactions between multiple entities and relations, while higher-order modeling could be beneficial.In this work, we propose HyperGraph neural network for ERE ($\hgnn{}$), which is built upon the PL-marker (a state-of-the-art marker-based pipleline model). To alleviate error propagation,we use a high-recall pruner mechanism to transfer the burden of entity identification and labeling from the NER module to the joint module of our model. For higher-order modeling, we build a hypergraph, where nodes are entities (provided by the span pruner) and relations thereof, and hyperedges encode interactions between two different relations or between a relation and its associated subject and object entities. We then run a hypergraph neural network for higher-order inference by applying message passing over the built hypergraph. Experiments on three widely used benchmarks (\acef{}, \ace{} and \scierc{}) for ERE task show significant improvements over the previous state-of-the-art PL-marker.

摘要
entity 和 relation 抽取 (ERE) 是信息抽取中的重要任务。 current marker-based pipeline 模型可以达到状态的最佳性能，但仍然受到错误卷积问题的影响。 besides， current ERE 模型多数不考虑多个实体和关系之间的高阶交互，而高阶模型化可能是有利的。在这种情况下，我们提出了 HyperGraph 神经网络 для ERE ($ \hgnn{}$), 它基于 PL-marker (现状最佳 marker-based pipeline 模型)。为了缓解错误卷积问题，我们使用高度回归预测机制，将实体识别和标注的负担从 NER 模块传递给我们模型的联合模块。 For higher-order modeling， we build a hypergraph, where nodes are entities (由 span pruner 提供) and relations thereof, and hyperedges encode interactions between two different relations or between a relation and its associated subject and object entities。然后，我们运行一个高阶神经网络，通过在建立的 hypergraph 上进行消息传递来进行高阶推理。 experiments 表明，在三个常用的 ERE benchmark 上（\acef{}, \ace{} 和 \scierc{）），我们的模型可以具有显著的改善，胜过了之前的 PL-marker。

TST$^\mathrm{R}$: Target Similarity Tuning Meets the Real World

paper_url: http://arxiv.org/abs/2310.17228
repo_url: None
paper_authors: Anirudh Khatry, Sumit Gulwani, Priyanshu Gupta, Vu Le, Ananya Singha, Mukul Singh, Gust Verbruggen
for: This paper is written for improving the performance of natural language (NL) to code generation through large language models (LLMs) by adapting a sentence embedding model to have the similarity between two NL inputs match the similarity between their associated code outputs.
methods: The paper proposes different methods to apply and improve target similarity tuning (TST) in the real world, including replacing the sentence transformer with embeddings from a larger model, training a tiny model to transform the embeddings, and efficiently selecting a smaller number of training examples.
results: The paper introduces a ranking-based evaluation for TST that does not require end-to-end code generation experiments, which can be expensive to perform.

Abstract
Target similarity tuning (TST) is a method of selecting relevant examples in natural language (NL) to code generation through large language models (LLMs) to improve performance. Its goal is to adapt a sentence embedding model to have the similarity between two NL inputs match the similarity between their associated code outputs. In this paper, we propose different methods to apply and improve TST in the real world. First, we replace the sentence transformer with embeddings from a larger model, which reduces sensitivity to the language distribution and thus provides more flexibility in synthetic generation of examples, and we train a tiny model that transforms these embeddings to a space where embedding similarity matches code similarity, which allows the model to remain a black box and only requires a few matrix multiplications at inference time. Second, we show how to efficiently select a smaller number of training examples to train the TST model. Third, we introduce a ranking-based evaluation for TST that does not require end-to-end code generation experiments, which can be expensive to perform.

摘要
目标相似调整（TST）是一种使用大语言模型（LLM）来生成代码的方法，旨在将NL输入与其相关的代码输出之间的相似性进行调整。在这篇论文中，我们提出了不同的方法来应用和改进TST在实际应用中。首先，我们将 sentence transformer 替换为来自更大的模型的嵌入，这会降低语言分布的敏感度，并提供更多的自然语言生成的可能性，然后我们将这些嵌入变换到一个空间中，使得嵌入相似性与代码相似性匹配，这些操作只需在推理时进行几次矩阵乘法即可。其次，我们展示了如何高效地选择训练例子来训练TST模型。最后，我们引入了一种基于排名的评估方法，不需要进行昂贵的端到端代码生成实验。

Beyond MLE: Convex Learning for Text Generation

paper_url: http://arxiv.org/abs/2310.17217
repo_url: https://github.com/ictnlp/convex-learning
paper_authors: Chenze Shao, Zhengrui Ma, Min Zhang, Yang Feng
For: This paper proposes a novel approach to training text generation models using convex functions, which can help the models focus on highly probable outputs without requiring maximum likelihood estimation (MLE).* Methods: The proposed approach uses convex functions to define the training objective, which enables the model to better capture outputs with high probabilities. The authors investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss.* Results: The proposed approach is effective in improving the performance of text generation models. In experiments on various text generation tasks and models, the approach enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. The approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks.

Abstract
Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution that best explain the observed data. In the context of text generation, MLE is often used to train generative language models, which can then be used to generate new text. However, we argue that MLE is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. In these tasks, the goal of model is to generate the most appropriate response, which does not necessarily require it to estimate the entire data distribution with MLE. To this end, we propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution. We investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss, demonstrating that convex functions can sharpen the optimal distribution, thereby enabling the model to better capture outputs with high probabilities. Experiments on various text generation tasks and models show the effectiveness of our approach. It enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. Moreover, our approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks. Source code is available at \url{https://github.com/ictnlp/Convex-Learning}.

摘要
最大 LIKElihood 估计 (MLE) 是一种统计方法，用于估计一个概率分布中的参数，以便最好地预测观察到的数据。在文本生成任务中，MLE oftens 用于训练生成语言模型，以便生成新的文本。然而，我们认为MLE 不一定是最佳和必要的，尤其是在关闭式文本生成任务中，如机器翻译。在这些任务中，模型的目标是生成最佳的回答，而不一定需要估计整个数据分布。为此，我们提出了一种新的训练目标函数，基于凸函数，允许文本生成模型专注于高可能性的输出，而不需要估计整个数据分布。我们研究了这种新的训练目标函数的理论性质，并证明了凸函数可以使估计的最佳分布更加紧凑，使模型更好地捕捉高可能性的输出。我们在不同的文本生成任务和模型上进行了实验，并证明了我们的方法的效iveness。它使得 autoregressive 模型可以跨度搜索和搜索，并且使得非 autoregressive 模型学习得到最大改进（9+ BLEU 点）。此外，我们的方法还在大语言模型 (LLM) 上展现了显著的影响，substantially 提高了它们的生成能力在多种任务上。可以在 \url{https://github.com/ictnlp/Convex-Learning} 上获得源代码。

Emotion Recognition by Video: A review

paper_url: http://arxiv.org/abs/2310.17212
repo_url: https://github.com/Aryia-Behroziuan/References
paper_authors: Junxiao Xue, Jie Wang, Xuecheng Wu, Liangyu Fu
for: 本文旨在帮助学术界和现代科学家综合了解最新的发展和创新在视频情感识别领域。
methods: 本文分析了视频情感识别方法的特点和性能，并对不同类型的方法进行比较。
results: 本文系统性地梳理了2015年至2023年期间发表的视频情感识别研究，包括两种常见情感模型、常用的数据库和现代视频情感识别方法的结构和性能。

Abstract
Video emotion recognition is an important branch of affective computing, and its solutions can be applied in different fields such as human-computer interaction (HCI) and intelligent medical treatment. Although the number of papers published in the field of emotion recognition is increasing, there are few comprehensive literature reviews covering related research on video emotion recognition. Therefore, this paper selects articles published from 2015 to 2023 to systematize the existing trends in video emotion recognition in related studies. In this paper, we first talk about two typical emotion models, then we talk about databases that are frequently utilized for video emotion recognition, including unimodal databases and multimodal databases. Next, we look at and classify the specific structure and performance of modern unimodal and multimodal video emotion recognition methods, talk about the benefits and drawbacks of each, and then we compare them in detail in the tables. Further, we sum up the primary difficulties right now looked by video emotion recognition undertakings and point out probably the most encouraging future headings, such as establishing an open benchmark database and better multimodal fusion strategys. The essential objective of this paper is to assist scholarly and modern scientists with keeping up to date with the most recent advances and new improvements in this speedy, high-influence field of video emotion recognition.

摘要
视频情感识别是人工智能的重要分支，其解决方案可以应用于不同领域，如人机交互（HCI）和智能医疗治疗。虽然有很多关于情感识别的研究论文发表，但有很少的总结性文献，涵盖相关研究的视频情感识别领域。因此，本文选择2015年至2023年发表的文献，系матизи了视频情感识别领域的现有趋势。在本文中，我们首先介绍了两种典型的情感模型，然后介绍了通常用于视频情感识别的数据库，包括单模态数据库和多模态数据库。接着，我们分析和比较现代单模态和多模态视频情感识别方法的特点和性能，讲述每种方法的优缺点，并在表格中进行详细比较。然后，我们总结了现在视频情感识别项目面临的主要挑战，并提出了未来可能的发展方向，如建立开源标准数据库和更好的多模态融合策略。本文的主要目标是帮助学术和现代科学家保持最新的发展和新进展在这个快速、高影响的领域中。

Efficient Data Fusion using the Tsetlin Machine

paper_url: http://arxiv.org/abs/2310.17207
repo_url: None
paper_authors: Rupsa Saha, Vladimir I. Zadorozhny, Ole-Christoffer Granmo
for: 本研究提出了一种新的方法来评估和融合噪音数据，使用Tsetlin机器。
methods: 该方法通过监测Tsetlin机器学习的解释逻辑 clause 如何随数据噪音变化，从而识别噪音或者通过新的逻辑 clause 来反映噪音。
results: 该方法在不同的数据集上进行了全面的实验研究，得到了高效的结果。

Abstract
We propose a novel way of assessing and fusing noisy dynamic data using a Tsetlin Machine. Our approach consists in monitoring how explanations in form of logical clauses that a TM learns changes with possible noise in dynamic data. This way TM can recognize the noise by lowering weights of previously learned clauses, or reflect it in the form of new clauses. We also perform a comprehensive experimental study using notably different datasets that demonstrated high performance of the proposed approach.

摘要
我们提出了一种新的方法，使用Tsetlin机器来评估和融合含有噪声的动态数据。我们的方法是通过观察TMC所学得的逻辑条件如何随着可能的噪声在动态数据中变化，从而使TMC能够认可噪声，例如降低先前学习的条件的权重，或者表现为新的条件。我们还进行了对不同数据集的完整实验研究，并得到了高性能的结果。

Taming Gradient Variance in Federated Learning with Networked Control Variates

paper_url: http://arxiv.org/abs/2310.17200
repo_url: None
paper_authors: Xingyan Chen, Yaling Liu, Huaming Du, Mu Wang, Yu Zhao
For: 这个研究旨在解决联合学习中的问题，包括广泛的通信开销、慢态变化和不稳定的改善。这些问题主要导因于变量 gradient 由于客户端数据分布不均匀。* Methods: 这个研究提出了一个名为 FedNCV 的联合学习框架，采用了 REINFORCE Leave-One-Out (RLOO) 作为基本控制量单元，实现在客户端和服务器两个层次。在客户端上，RLOO 控制量单元用于优化本地梯度更新，减少由数据样本引入的变量。一旦传递到服务器端，RLOO 基本估计又提供了不偏且低变量的总梯度，导致Robust global更新。这个双面应用可以理解为对于客户端和服务器端的线性结合。我们提供了一个数学表达式，捕捉了这个组合的双控制量单元在 FedNCV 中的组合。* Results: 这个研究在六个多样的数据集上进行了六个 SOTA 方法的比较，以及该研究的性能优势。结果显示，FedNCV 具有较高的性能，并且可以实现大规模应用。

Abstract
Federated learning, a decentralized approach to machine learning, faces significant challenges such as extensive communication overheads, slow convergence, and unstable improvements. These challenges primarily stem from the gradient variance due to heterogeneous client data distributions. To address this, we introduce a novel Networked Control Variates (FedNCV) framework for Federated Learning. We adopt the REINFORCE Leave-One-Out (RLOO) as a fundamental control variate unit in the FedNCV framework, implemented at both client and server levels. At the client level, the RLOO control variate is employed to optimize local gradient updates, mitigating the variance introduced by data samples. Once relayed to the server, the RLOO-based estimator further provides an unbiased and low-variance aggregated gradient, leading to robust global updates. This dual-side application is formalized as a linear combination of composite control variates. We provide a mathematical expression capturing this integration of double control variates within FedNCV and present three theoretical results with corresponding proofs. This unique dual structure equips FedNCV to address data heterogeneity and scalability issues, thus potentially paving the way for large-scale applications. Moreover, we tested FedNCV on six diverse datasets under a Dirichlet distribution with {\alpha} = 0.1, and benchmarked its performance against six SOTA methods, demonstrating its superiority.

摘要
federated learning, a decentralized machine learning approach, faces significant challenges such as extensive communication overheads, slow convergence, and unstable improvements. these challenges primarily stem from the gradient variance due to heterogeneous client data distributions. to address this, we introduce a novel networked control variates (fedncov) framework for federated learning. we adopt the reinforce leave-one-out (rloo) as a fundamental control variate unit in the fedncov framework, implemented at both client and server levels. at the client level, the rloo control variate is employed to optimize local gradient updates, mitigating the variance introduced by data samples. once relayed to the server, the rloo-based estimator further provides an unbiased and low-variance aggregated gradient, leading to robust global updates. this dual-side application is formalized as a linear combination of composite control variates. we provide a mathematical expression capturing this integration of double control variates within fedncov and present three theoretical results with corresponding proofs. this unique dual structure equips fedncov to address data heterogeneity and scalability issues, thus potentially paving the way for large-scale applications. moreover, we tested fedncov on six diverse datasets under a dirichlet distribution with α = 0.1, and benchmarked its performance against six sota methods, demonstrating its superiority.

How do Language Models Bind Entities in Context?

paper_url: http://arxiv.org/abs/2310.17191
repo_url: None
paper_authors: Jiahai Feng, Jacob Steinhardt
for: 这篇论文旨在探讨语言模型（LM）如何在上下文中使用符号知识，具体来说是如何将形态绑定到其特征上。
methods: 这篇论文使用了 causal intervention 技术，来检查 LM 内部活动是否表示绑定信息，并发现了绑定 ID 机制，即在大型 Pythia 和 LLaMA 模型中每一个模型都具有解决绑定问题的一致性机制。
results: 研究发现，LM 的内部活动实际上将形态绑定到其特征上，并且绑定 ID 向量组成一个连续的子空间，在这个子空间中，绑定 ID 向量之间的距离反映了它们的推理程度。总的来说，这些结果揭示了 LM 在上下文中表示符号知识的可读性策略，为大规模 LM 的上下文理解提供了一个重要的步阶。

Abstract
To correctly use in-context information, language models (LMs) must bind entities to their attributes. For example, given a context describing a "green square" and a "blue circle", LMs must bind the shapes to their respective colors. We analyze LM representations and identify the binding ID mechanism: a general mechanism for solving the binding problem, which we observe in every sufficiently large model from the Pythia and LLaMA families. Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.

摘要
<>将文本翻译成简化中文。<>为正确地使用上下文信息，语言模型（LM）必须将实体绑定到其属性上。例如，在一个描述绿色正方形和蓝色圆形的上下文中，LM必须将形状绑定到它们的相应颜色上。我们分析LM表示形式和识别绑定机制：一种通用的解决绑定问题的机制，我们在Pyythia和LLaMA家族中的每个足够大的模型中都观察到。使用 causal intervention，我们表明LM内部的活动表示绑定信息，将绑定ID向量附加到对应的实体和属性上。我们还表明绑定ID向量组成一个连续的子空间，在这个子空间中，绑定ID向量之间的距离反映它们的推理程度。总之，我们的结果揭示了LM在含义上的具体推理策略，提供了解决大规模LM的普遍性含义理解的一个步骤。

Understanding the Effects of Projectors in Knowledge Distillation

paper_url: http://arxiv.org/abs/2310.17183
repo_url: https://github.com/chenyd7/pefd
paper_authors: Yudong Chen, Sen Wang, Jiajun Liu, Xuwei Xu, Frank de Hoog, Brano Kusy, Zi Huang
for: 这篇论文旨在调查隐藏在知识储存过程中的投影器（feature distillation）的作用，即使学生和教师网络具有相同的特征维度。
methods: 该论文使用了预训练的教师网络和学生网络，并在学生网络中添加了投影器来进行特征转换。
results: 该研究发现，即使学生和教师网络具有相同的特征维度，投影器仍然能够提高知识储存性能。此外，投影器甚至在逻辑分布式学习中也能够提高性能。这些发现驱动了 authors 提出了一种基于投影器集合的特征储存方法，以进一步提高知识储存性能。

Abstract
Conventionally, during the knowledge distillation process (e.g. feature distillation), an additional projector is often required to perform feature transformation due to the dimension mismatch between the teacher and the student networks. Interestingly, we discovered that even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. In addition, projectors even improve logit distillation if we add them to the architecture too. Inspired by these surprising findings and the general lack of understanding of the projectors in the knowledge distillation process from existing literature, this paper investigates the implicit role that projectors play but so far have been overlooked. Our empirical study shows that the student with a projector (1) obtains a better trade-off between the training accuracy and the testing accuracy compared to the student without a projector when it has the same feature dimensions as the teacher, (2) better preserves its similarity to the teacher beyond shallow and numeric resemblance, from the view of Centered Kernel Alignment (CKA), and (3) avoids being over-confident as the teacher does at the testing phase. Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance. Despite the simplicity of the proposed strategy, empirical results from the evaluation of classification tasks on benchmark datasets demonstrate the superior classification performance of our method on a broad range of teacher-student pairs and verify from the aspects of CKA and model calibration that the student's features are of improved quality with the projector ensemble design.

摘要
通常在知识塑化过程中（例如特征塑化），需要添加一个投影器来实现特征转换，因为教师和学生网络的维度不匹配。 Interestingly, we discovered that even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. In addition, projectors even improve logit distillation if we add them to the architecture too. inspirited by these surprising findings and the general lack of understanding of the projectors in the knowledge distillation process from existing literature, this paper investigates the implicit role that projectors play but so far have been overlooked. Our empirical study shows that the student with a projector (1) obtains a better trade-off between the training accuracy and the testing accuracy compared to the student without a projector when it has the same feature dimensions as the teacher, (2) better preserves its similarity to the teacher beyond shallow and numeric resemblance, from the view of Centered Kernel Alignment (CKA), and (3) avoids being over-confident as the teacher does at the testing phase. Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance. Despite the simplicity of the proposed strategy, empirical results from the evaluation of classification tasks on benchmark datasets demonstrate the superior classification performance of our method on a broad range of teacher-student pairs and verify from the aspects of CKA and model calibration that the student's features are of improved quality with the projector ensemble design.

Graphical Object-Centric Actor-Critic

paper_url: http://arxiv.org/abs/2310.17178
repo_url: None
paper_authors: Leonid Ugadiarov, Aleksandr I. Panov
for: 提高image-based object-centric reinforcement learning任务中的策略学习效果
methods: 使用actor-critic和model-based方法，使用transformer编码器提取对象表示，使用图 neural network逼近环境动力学
results: 在3D机器人环境和2Dcompositional结构环境中表现较为出色，比对state-of-the-art模型自由actor-critic算法和monolithic模型基础算法更好

Abstract
There have recently been significant advances in the problem of unsupervised object-centric representation learning and its application to downstream tasks. The latest works support the argument that employing disentangled object representations in image-based object-centric reinforcement learning tasks facilitates policy learning. We propose a novel object-centric reinforcement learning algorithm combining actor-critic and model-based approaches to utilize these representations effectively. In our approach, we use a transformer encoder to extract object representations and graph neural networks to approximate the dynamics of an environment. The proposed method fills a research gap in developing efficient object-centric world models for reinforcement learning settings that can be used for environments with discrete or continuous action spaces. Our algorithm performs better in a visually complex 3D robotic environment and a 2D environment with compositional structure than the state-of-the-art model-free actor-critic algorithm built upon transformer architecture and the state-of-the-art monolithic model-based algorithm.

摘要
近来，无监督物体归一表示学习问题得到了重要进展，以及其应用于下游任务中。最新的研究证明了使用分离的物体表示在图像基本的反馈学习任务中帮助策略学习。我们提出了一种新的物体中心的奖励学习算法，将actor-critic和模型基础方法结合起来，以有效利用这些表示。在我们的方法中，我们使用变换器编码器提取物体表示，并使用图 neural network来近似环境的动态。我们的方法填充了奖励学习设置中的物体中心世界模型的研究空白，可以用于具有离散或连续动作空间的环境。我们的算法在三维 робоット环境和二维 Compositional 结构环境中表现更好than当前无监督actor-critic算法和单一模型基础算法。

Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning

paper_url: http://arxiv.org/abs/2310.17177
repo_url: None
paper_authors: Fengyuan Shi, Limin Wang
for: 提高适应性和抗遮挡能力，使基本模型更适合用于动态图像转换器的初始化。
methods: 使用Masked Fine-Tuning方法，将预训练基本模型与动态图像转换器的token减少策略相匹配，从而解决基本模型与动态模型之间的不一致问题。
results: 对ImageNet dataset进行了广泛的实验，显示了基本模型通过Masked Fine-Tuning方法获得了强大的遮挡Robustness和信息损失能力，并且Dynamic ViT在不同的token减少比例下（例如0.8和0.3）获得了更高的准确率。

Abstract
Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost. Some works present dynamic vision transformers to accelerate inference by pruning redundant tokens. A key to improving token pruning is using well-trained models as initialization for faster convergence and better performance. However, current base models usually adopt full image training, i.e., using full images as inputs and keeping the whole feature maps through the forward process, which causes inconsistencies with dynamic models that gradually reduce tokens, including calculation pattern, information amount and token selection strategy inconsistencies. Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models used for initialization and token pruning based dynamic vision transformers, by masking image patches and predicting the image class label based on left unmasked patches. Extensive experiments on ImageNet demonstrate that base models via masked fine-tuning gain strong occlusion robustness and ability against information loss. With this better initialization, Dynamic ViT achieves higher accuracies, especially under large token pruning ratios (e.g., 81.9% vs. 81.3%, and 62.3% vs. 58.9% for DeiT based Dynamic ViT/0.8 and Dynamic ViT/0.3). Moreover, we apply our method into different token pruning based dynamic vision transformers, different pre-trained models and randomly initialized models to demonstrate the generalization ability.

摘要
尽管变换器在各种计算机视觉任务上取得了成功，但它们受到过度的内存和计算成本的束缚。一些工作提出了动态视觉转换器来加速推理，其中一个关键是使用已经训练过的模型作为初始化以更快地达到更好的性能。然而，当前的基本模型通常采用全像训练，即将全像作为输入，并保留整个特征图进行前进计算，这会导致动态模型逐渐减少token的问题，包括计算模式、信息量和选择策略不一致。受到MAE的启发，我们设计了彩色精度调整来bridging基本模型和动态视觉转换器之间的差异，通过遮盖图像块并预测图像类别标签基于未遮盖的块来进行彩色精度调整。我们在ImageNet上进行了广泛的实验，发现基于彩色精度调整的基本模型在遮盖率较高时（例如0.8和0.3）获得了强大的遮盖异常和信息损失能力。这些更好的初始化使得动态ViT在不同的token遮盖比例（例如81.9% vs. 81.3%,和62.3% vs. 58.9%）上取得了更高的准确率。此外，我们还应用了我们的方法到不同的token遮盖基于动态视觉转换器、不同的预训练模型和随机初始化模型，以 demonstrate其通用性。

A Deep Learning Approach to Teeth Segmentation and Orientation from Panoramic X-rays

paper_url: http://arxiv.org/abs/2310.17176
repo_url: https://github.com/mrinal054/instance_teeth_segmentation
paper_authors: Mrinal Kanti Dhar, Mou Deb, D. Madhab, Zeyun Yu
for: 这篇研究旨在提高现代口腔健康预算中的精确牙齿分类和方位测量，以便精确诊断、治疗规划和 dental implant 设计。
methods: 我们使用了深度学习技术，基于 FUSegNet 模型，并将其改进为具有格子基于注意门的 skip connections。我们还引入了 Orientated bounding box (OBB) 生成，通过主成分分析 (PCA)，以精确地 Orient 牙齿。
results: 我们在公开的 DNS 资料集上评估了我们的方法，包括 543 枚 panoramic X-ray 图像，得到了 teeth 实例分类中的最高 Intersection-over-Union (IoU) 分数 82.43%，Dice Similarity Coefficient (DSC) 分数 90.37%，以及 Rotated IoU (RIoU) 分数 82.82%。我们还进行了各个牙齿标签和分类性能的详细分析，为未来口腔预算中的精确诊断、治疗规划和个性化医疗带来了promising prospects。

Abstract
Accurate teeth segmentation and orientation are fundamental in modern oral healthcare, enabling precise diagnosis, treatment planning, and dental implant design. In this study, we present a comprehensive approach to teeth segmentation and orientation from panoramic X-ray images, leveraging deep learning techniques. We build our model based on FUSegNet, a popular model originally developed for wound segmentation, and introduce modifications by incorporating grid-based attention gates into the skip connections. We introduce oriented bounding box (OBB) generation through principal component analysis (PCA) for precise tooth orientation estimation. Evaluating our approach on the publicly available DNS dataset, comprising 543 panoramic X-ray images, we achieve the highest Intersection-over-Union (IoU) score of 82.43% and Dice Similarity Coefficient (DSC) score of 90.37% among compared models in teeth instance segmentation. In OBB analysis, we obtain the Rotated IoU (RIoU) score of 82.82%. We also conduct detailed analyses of individual tooth labels and categorical performance, shedding light on strengths and weaknesses. The proposed model's accuracy and versatility offer promising prospects for improving dental diagnoses, treatment planning, and personalized healthcare in the oral domain. Our generated OBB coordinates and codes are available at https://github.com/mrinal054/Instance_teeth_segmentation.

摘要
准确的牙齿分割和方向是现代口腔医疗中的基本要求，帮助确定精准诊断、治疗规划和植入设计。在本研究中，我们提出了一种涵盖所有牙齿分割和方向的全面方法，基于FUSegNet模型，并通过栅格基于注意机制的修改来提高性能。我们还引入了原则components分析（PCA）来生成方向 bounding box（OBB），以便精准地 Orient estimation。在公共可用的 DNS 数据集上评估我们的方法，包括 543 张扫描图像，我们达到了 teeth 实例分割中最高的 Intersection-over-Union（IoU）分数（82.43%）和 Dice Similarity Coefficient（DSC）分数（90.37%），同时在 OBB 分析中获得了 Rotated IoU（RIoU）分数（82.82%）。我们还进行了精度分析，探讨个体牙齿标签和分类性能，为口腔医疗领域的个性化医疗带来了推荐的前景。我们在 GitHub 上公布了生成的 OBB 坐标和代码，请参考 https://github.com/mrinal054/Instance_teeth_segmentation。

Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

paper_url: http://arxiv.org/abs/2310.17167
repo_url: None
paper_authors: Zhenkai Zhang, Krista A. Ehinger, Tom Drummond
for: 本 paper 的两个主要贡献是提高反扩散过程中图像生成的速度和质量。
methods: 本 paper 使用了两种方法来提高图像生成的速度和质量，第一种是将扩散过程重parameterized为图像和噪声之间的角度，第二种是直接使用网络来估算图像和噪声的值。
results: 根据 Frechet Inception Distance (FID)、spatial Frechet Inception Distance (sFID)、精度和回归率等指标，本 paper 的模型可以更快地生成高质量的图像，并且可以更快地达到高质量图像。

Abstract
This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes. The first contribution involves reparameterizing the diffusion process in terms of the angle on a quarter-circular arc between the image and noise, specifically setting the conventional $\displaystyle \sqrt{\bar{\alpha}=\cos(\eta)$. This reparameterization eliminates two singularities and allows for the expression of diffusion evolution as a well-behaved ordinary differential equation (ODE). In turn, this allows higher order ODE solvers such as Runge-Kutta methods to be used effectively. The second contribution is to directly estimate both the image ($\mathbf{x}_0$) and noise ($\mathbf{\epsilon}$) using our network, which enables more stable calculations of the update step in the inverse diffusion steps, as accurate estimation of both the image and noise are crucial at different stages of the process. Together with these changes, our model achieves faster generation, with the ability to converge on high-quality images more quickly, and higher quality of the generated images, as measured by metrics such as Frechet Inception Distance (FID), spatial Frechet Inception Distance (sFID), precision, and recall.

摘要

Reparameterizing the diffusion process: Instead of using the conventional $\sqrt{\bar{\alpha} = \cos(\eta)$, we parameterize the diffusion process in terms of the angle between the image and noise on a quarter-circular arc. This eliminates two singularities and allows the diffusion evolution to be expressed as a well-behaved ordinary differential equation (ODE), making it possible to use higher-order ODE solvers such as Runge-Kutta methods.2. Direct estimation of image and noise: Our network directly estimates both the image and noise, which ensures more stable calculations of the update step in the inverse diffusion process. Accurate estimation of both the image and noise is crucial at different stages of the process, and our model achieves faster generation of high-quality images, as measured by metrics such as Frechet Inception Distance (FID), spatial Frechet Inception Distance (sFID), precision, and recall.

Content-based Controls For Music Large Language Modeling

paper_url: http://arxiv.org/abs/2310.17162
repo_url: None
paper_authors: Liwei Lin, Gus Xia, Junyan Jiang, Yixiao Zhang
for: 这个论文旨在提供一种基于内容的控制方法，以提高大规模语言模型在音乐频域中的音乐生成质量。
methods: 该方法使用一种效率高的参数调整方法（PEFT），专门针对基于转换器的音频模型。实验表明，我们的方法可以在具有少量超级vised学习的情况下实现高质量的音乐生成，并且可以具有有效的内容基于的控制能力。
results: 我们的方法可以实现高质量的音乐生成，并且可以通过调整旋律和和声来实现有效的内容基于的控制。此外，我们还示出了将内容基于的控制与文本描述结合使用可以实现灵活的音乐变化生成和风格传递。

Abstract
Recent years have witnessed a rapid growth of large-scale language models in the domain of music audio. Such models enable end-to-end generation of higher-quality music, and some allow conditioned generation using text descriptions. However, the control power of text controls on music is intrinsically limited, as they can only describe music indirectly through meta-data (such as singers and instruments) or high-level representations (such as genre and emotion). We aim to further equip the models with direct and content-based controls on innate music languages such as pitch, chords and drum track. To this end, we contribute Coco-Mulla, a content-based control method for music large language modeling. It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models. Experiments show that our approach achieved high-quality music generation with low-resource semi-supervised learning, tuning with less than 4% parameters compared to the original model and training on a small dataset with fewer than 300 songs. Moreover, our approach enables effective content-based controls, and we illustrate the control power via chords and rhythms, two of the most salient features of music audio. Furthermore, we show that by combining content-based controls and text descriptions, our system achieves flexible music variation generation and style transfer. Our source codes and demos are available online.

摘要
Translation notes:* "large-scale language models" 大规模语言模型 (dà xiǎng móde lǐng yǔ)* "end-to-end generation" 端到端生成 (dían dào diàn chéng)* "conditioned generation" 受控生成 (fù kòng shēng chéng)* "text descriptions" 文本描述 (wén tiěn mǐng yù)* "meta-data" 元数据 (yuán jí)* "singers and instruments" 歌手和乐器 (gē shǒu hé yuè qì)* "genre and emotion" 种类和情感 (zhòng lèi hé qíng gǎn)* "innate music languages" Native Music Languages (yuán jì yǔ)* "pitch, chords, and drum tracks" 抑弹、和弹、鼓踏 (zuò dì, hé dì, gǔ tà)* "parameter-efficient fine-tuning" 参数高效精度调整 (cèshù gāodégòu jīngdé jiǎo)* "Transformer-based audio models" 基于Transformer的音频模型 (jī yú Transformer de yīn yǐn módel)* "low-resource semi-supervised learning" 半指导式半资源学习 (bàn zhǐdǎo xī bàn zīyuán xuéxí)* "tuning with less than 4% parameters" 使用少于4%参数调整 (shǐyòu xiǎo yú 4% cèshù jiǎo)* "training on a small dataset" 使用小 datasets 训练 (shǐyòu xiǎo dataset zhīngxì)* "fewest than 300 songs" fewer than 300 songs (liǎo xiǎo gē)* "content-based controls" 内容基于的控制 (néngyòu jīyào de kòng zhì)* "chords and rhythms" 和弹和节奏 (hé dì yǔ jié zhù)* "flexible music variation generation" 灵活的音乐变换 (língyòu de yīn yuè biàn huà)* "style transfer" 风格传递 (fēngxìng chuándòu)

CosmosDSR – a methodology for automated detection and tracking of orbital debris using the Unscented Kalman Filter

paper_url: http://arxiv.org/abs/2310.17158
repo_url: None
paper_authors: Daniel S. Roll, Zeyneb Kurt, Wai Lok Woo
for: Addressing the Kessler syndrome by detecting and tracking satellites in sequential images.
methods: Combining YOLOv3 with an Unscented Kalman Filter (UKF) for tracking satellites, and comparing with a linear Kalman filter (LKF).
results: Precise detection and classification of satellite categories with few errors, and accurate tracking of satellites with a mean squared error (MSE) and root mean squared error (RMSE) of 2.83/1.66 for UKF and 2.84/1.66 for LKF.

Abstract
The Kessler syndrome refers to the escalating space debris from frequent space activities, threatening future space exploration. Addressing this issue is vital. Several AI models, including Convolutional Neural Networks, Kernel Principal Component Analysis, and Model-Agnostic Meta- Learning have been assessed with various data types. Earlier studies highlighted the combination of the YOLO object detector and a linear Kalman filter (LKF) for object detection and tracking. Advancing this, the current paper introduces a novel methodology for the Comprehensive Orbital Surveillance and Monitoring Of Space by Detecting Satellite Residuals (CosmosDSR) by combining YOLOv3 with an Unscented Kalman Filter (UKF) for tracking satellites in sequential images. Using the Spacecraft Recognition Leveraging Knowledge of Space Environment (SPARK) dataset for training and testing, the YOLOv3 precisely detected and classified all satellite categories (Mean Average Precision=97.18%, F1=0.95) with few errors (TP=4163, FP=209, FN=237). Both CosmosDSR and an implemented LKF used for comparison tracked satellites accurately for a mean squared error (MSE) and root mean squared error (RME) of MSE=2.83/RMSE=1.66 for UKF and MSE=2.84/RMSE=1.66 for LKF. The current study is limited to images generated in a space simulation environment, but the CosmosDSR methodology shows great potential in detecting and tracking satellites, paving the way for solutions to the Kessler syndrome.

摘要
《凯斯勒征》指的是由于频繁的空间活动而导致的增加的空间废弃物，这对未来的空间探索造成了威胁。为解决这个问题，许多人使用了人工智能模型，包括卷积神经网络、基准 principl component analysis 和模型无关元学习。在之前的研究中，拟合了 YOLO 对象检测器和线性 Kalman 筛（LKF）的结合，用于对象检测和跟踪。现在的论文介绍了一种新的方法，即 CosmosDSR，它将 YOLOv3 与不确定 Kalman 筛（UKF）结合，用于在顺序图像中跟踪卫星。使用 SPARK 数据集进行训练和测试，YOLOv3 精确地检测和分类了所有卫星类别（平均精度=97.18%, F1=0.95），只有一些错误（TP=4163, FP=209, FN=237）。两种 CosmosDSR 和 LKF 的实现都可以准确地跟踪卫星，MSE 和 RMSE 分别为 MSE=2.83/RMSE=1.66。当前的研究只是在空间模拟环境中生成的图像上进行的，但 CosmosDSR 方法具有很好的潜在性，可以用于检测和跟踪卫星，为解决凯斯勒征提供了新的解决方案。

Technical Note: Feasibility of translating 3.0T-trained Deep-Learning Segmentation Models Out-of-the-Box on Low-Field MRI 0.55T Knee-MRI of Healthy Controls

paper_url: http://arxiv.org/abs/2310.17152
repo_url: None
paper_authors: Rupsa Bhattacharjee, Zehra Akkaya, Johanna Luitjens, Pan Su, Yang Yang, Valentina Pedoia, Sharmila Majumdar
for: 这项研究的目的是评估将深度学习技术应用于评估双下肢骨骼标记的可能性，并将其应用于健康控制人群的0.55T MR 影像中。methods: 这项研究使用了标准的实践中的骨和软组织分割算法，并对其进行质量和量化的评估，以确定在0.55T和3.0T之间的差异。results: 初步结果表明，可以将现有的高级深度学习图像分割技术，训练在3.0T上，翻译到0.55T上，并在多 vendor 环境中实现可用到良好的技术可行性。尤其是在分割软组织 compartment 方面，模型表现几乎相当于3.0T。这表明，0.55T低场磁共振成像可以用于评估双下肢骨骼标记，并且可以通过使用现有的深度学习图像分割技术来提高表征性。

Abstract
In the current study, our purpose is to evaluate the feasibility of applying deep learning (DL) enabled algorithms to quantify bilateral knee biomarkers in healthy controls scanned at 0.55T and compared with 3.0T. The current study assesses the performance of standard in-practice bone, and cartilage segmentation algorithms at 0.55T, both qualitatively and quantitatively, in terms of comparing segmentation performance, areas of improvement, and compartment-wise cartilage thickness values between 0.55T vs. 3.0T. Initial results demonstrate a usable to good technical feasibility of translating existing quantitative deep-learning-based image segmentation techniques, trained on 3.0T, out of 0.55T for knee MRI, in a multi-vendor acquisition environment. Especially in terms of segmenting cartilage compartments, the models perform almost equivalent to 3.0T in terms of Likert ranking. The 0.55T low-field sustainable and easy-to-install MRI, as demonstrated, thus, can be utilized for evaluating knee cartilage thickness and bone segmentations aided by established DL algorithms trained at higher-field strengths out-of-the-box initially. This could be utilized at the far-spread point-of-care locations with a lack of radiologists available to manually segment low-field images, at least till a decent base of low-field data pool is collated. With further fine-tuning with manual labeling of low-field data or utilizing synthesized higher SNR images from low-field images, OA biomarker quantification performance is potentially guaranteed to be further improved.

摘要
当前研究的目的是评估使用深度学习（DL）启用算法来评估双侧膝关节生物标志物理量的可能性。研究现在评估0.55T中标准实践骨和软组织分割算法的性能，包括对比分割性能、改进方向和软组织厚度值 между0.55T和3.0T。初步结果表明可以将已经训练在3.0T上的量化深度学习图像分割技术翻译到0.55T，并在多 vendor acquisition 环境中实现了技术可行性。特别是在分割软组织COMPARTMENT中，模型表现了几乎相同的Likert排名。因此，0.55T的低场可持续和易于安装的MRI可以用于评估膝软组织厚度和骨分割，并且可以通过已有的DL算法在更高的场 strengths 中进行外部调试。这可以在覆盖医疗机构的各个点批处理地点使用，至少ntil a decent base of low-field data pool is collated。通过进一步细化的手动标注低场数据或使用生成的更高SNR图像来进行优化，OA生物标志量的评估性能可能会得到进一步改进。

Explainable Spatio-Temporal Graph Neural Networks

paper_url: http://arxiv.org/abs/2310.17149
repo_url: https://github.com/hkuds/stexplainer
paper_authors: Jiabin Tang, Lianghao Xia, Chao Huang
for: 这个论文的目的是提出一个可解释的城市空间时间图预测模型（STExplainer），以增强城市 aplicatons 中的内置可解释性。
methods: 这个模型使用了一个统一的城市空间图注意力网络（STGNN），加上一个位置信息融合层，以解决城市空间时间资料的黑盒问题。此外，我们还提出了一个结构炼分法，基于图形信息瓶颈（GIB）原则，并使用了一个可解释的目标函数。
results: 经过广泛的实验，我们证明了我们的 STExplainer 模型在交通和犯罪预测任务上比基于state-of-the-art 的基础模型表现更好，并且在预测精度和可解释度（例如给定和实际）方面均达到了优秀的成绩。此外，我们的模型还能够有效地解决资料缺失和稀疏性问题。

Abstract
Spatio-temporal graph neural networks (STGNNs) have gained popularity as a powerful tool for effectively modeling spatio-temporal dependencies in diverse real-world urban applications, including intelligent transportation and public safety. However, the black-box nature of STGNNs limits their interpretability, hindering their application in scenarios related to urban resource allocation and policy formulation. To bridge this gap, we propose an Explainable Spatio-Temporal Graph Neural Networks (STExplainer) framework that enhances STGNNs with inherent explainability, enabling them to provide accurate predictions and faithful explanations simultaneously. Our framework integrates a unified spatio-temporal graph attention network with a positional information fusion layer as the STG encoder and decoder, respectively. Furthermore, we propose a structure distillation approach based on the Graph Information Bottleneck (GIB) principle with an explainable objective, which is instantiated by the STG encoder and decoder. Through extensive experiments, we demonstrate that our STExplainer outperforms state-of-the-art baselines in terms of predictive accuracy and explainability metrics (i.e., sparsity and fidelity) on traffic and crime prediction tasks. Furthermore, our model exhibits superior representation ability in alleviating data missing and sparsity issues. The implementation code is available at: https://github.com/HKUDS/STExplainer.

摘要
随着城市应用的多样化和复杂性的增加，随时空 Graph Neural Networks (STGNNs) 已经得到了广泛的应用，以模型城市中的随时空关系。然而，黑盒模型的限制使得 STGNNs 的解释性受到限制，从而阻碍其在城市资源分配和政策制定方面的应用。为了bridging这个鸿沟，我们提出了一个可解释的随时空 Graph Neural Networks (STExplainer) 框架，该框架可以增强 STGNNs 的解释性，使其同时提供高准确率和 faithful 的预测和解释。我们的框架包括一个统一的随时空 Graph attention网络和一个位置信息融合层作为 STG Encoder 和 Decoder，分别。此外，我们提出了基于 Graph Information Bottleneck (GIB) 原理的结构填充方法，该方法通过一个可解释的目标函数实现。经过广泛的实验，我们证明了我们的 STExplainer 在交通预测和犯罪预测任务上的预测精度和解释性指标（即稀疏性和准确性）高于当前基线。此外，我们的模型在缺失数据和稀疏性问题下的表示能力也更高。代码可以在 GitHub 上获取：https://github.com/HKUDS/STExplainer。

Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation

paper_url: http://arxiv.org/abs/2310.17146
repo_url: https://github.com/mld3/counterfactualannot-semiope
paper_authors: Shengpu Tang, Jenna Wiens
for: 这个论文是用于推广强化学习（RL）在高风险领域的应用，并通过观察数据进行量化和质量evaluation，以帮助实践者理解新策略的总体性能。
methods: 这个论文提出了一种半在线评估框架，通过询问人类用户提供不可见的对比性轨迹的注释，以帮助解决在线评估不可能进行的问题。
results: 这个论文的实验表明，相比标准的强化学习评估器，该半在线评估框架可以减少偏见和噪声，并且在不完整的注释情况下表现更加稳定。

Abstract
In applying reinforcement learning (RL) to high-stakes domains, quantitative and qualitative evaluation using observational data can help practitioners understand the generalization performance of new policies. However, this type of off-policy evaluation (OPE) is inherently limited since offline data may not reflect the distribution shifts resulting from the application of new policies. On the other hand, online evaluation by collecting rollouts according to the new policy is often infeasible, as deploying new policies in these domains can be unsafe. In this work, we propose a semi-offline evaluation framework as an intermediate step between offline and online evaluation, where human users provide annotations of unobserved counterfactual trajectories. While tempting to simply augment existing data with such annotations, we show that this naive approach can lead to biased results. Instead, we design a new family of OPE estimators based on importance sampling (IS) and a novel weighting scheme that incorporate counterfactual annotations without introducing additional bias. We analyze the theoretical properties of our approach, showing its potential to reduce both bias and variance compared to standard IS estimators. Our analyses reveal important practical considerations for handling biased, noisy, or missing annotations. In a series of proof-of-concept experiments involving bandits and a healthcare-inspired simulator, we demonstrate that our approach outperforms purely offline IS estimators and is robust to imperfect annotations. Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of RL in high-stakes domains.

摘要
在应用强化学习（RL）到高风险领域时，可以使用观察数据进行量化和质量evaluation来帮助实践者理解新策略的总结性性能。然而，这种Off-policy评估（OPE）是由于新策略的应用而导致的分布变化的限制。相反，在线评估，通过根据新策略收集滚动数据，可以是不可靠的，因为在这些领域中部署新策略可能是不安全的。在这种情况下，我们提出了一种半Offline评估框架，作为在线和Offline评估之间的中间步骤，在这里，人类用户提供了未观察的contrastive Trajectory的注释。虽然有吸引力地将现有数据 augmented with这些注释，但我们表明这种Naive Approach可能会导致偏向结果。相反，我们设计了一种基于重要性抽样（IS）和一种新的权重方案的新家族OPE估计器，可以在不引入额外偏向的情况下，利用contrastive注释进行估计。我们分析了我们的方法的理论性质，并表明它在减少偏向和方差方面具有潜在的优势。我们的分析还揭示了在处理偏向、杂音或缺失注释时的重要实践考虑事项。在一系列Proof-of-concept实验中，我们示出了我们的方法可以在bandits和一种医疗领域的模拟器中出performances，并且可以抗护免着不完整的注释。我们的框架，结合人类中心的注释 solicitation设计，可以帮助RL在高风险领域应用。

Symbolic Planning and Code Generation for Grounded Dialogue

paper_url: http://arxiv.org/abs/2310.17140
repo_url: https://github.com/justinchiu/onecommon-gpt
paper_authors: Justin T. Chiu, Wenting Zhao, Derek Chen, Saujas Vaduguru, Alexander M. Rush, Daniel Fried
for: 这个论文的目的是提出一种可组合和可解释的对话系统，以解决现有的对话系统在跟踪目标和处理新的grounding方面的缺陷。
methods: 该系统包括一个读取器和一个规划器：读取器使用大语言模型将对话伙伴的话语转换成可执行代码，并调用函数来完成grounding。符号规划器使用符号计划法确定下一个最佳回答。
results: 该系统在OneCommon对话任务中表现出色，成功率从56%提高到69%，在最复杂的设定下也有显著提升。

Abstract
Large language models (LLMs) excel at processing and generating both text and code. However, LLMs have had limited applicability in grounded task-oriented dialogue as they are difficult to steer toward task objectives and fail to handle novel grounding. We present a modular and interpretable grounded dialogue system that addresses these shortcomings by composing LLMs with a symbolic planner and grounded code execution. Our system consists of a reader and planner: the reader leverages an LLM to convert partner utterances into executable code, calling functions that perform grounding. The translated code's output is stored to track dialogue state, while a symbolic planner determines the next appropriate response. We evaluate our system's performance on the demanding OneCommon dialogue task, involving collaborative reference resolution on abstract images of scattered dots. Our system substantially outperforms the previous state-of-the-art, including improving task success in human evaluations from 56% to 69% in the most challenging setting.

摘要
大型语言模型（LLM）在处理和生成文本和代码方面表现出色，但LLM在固定任务对话中有限的应用可能性，主要是因为它们难以追导到任务目标并处理新的固定。我们提出了一个模块化和可解释的基于符号计划的对话系统，这个系统通过将LLM与符号计划和基于符号的代码执行结合起来，以解决这些缺点。我们的系统包括读者和计划器：读者使用LLM将伙伴的话语转换为执行代码，并调用函数来进行固定。转换后的代码的输出被存储以跟踪对话状态，而符号计划器根据对话状态确定下一个适当的回应。我们对一个具有抽象点云图像的OneCommon对话任务进行了评估，并substantially outperformed前一个状态的艺术。在最复杂的设定下，我们的系统的任务成功率从56%提高到69%。

Core Challenge 2023: Solver and Graph Descriptions

paper_url: http://arxiv.org/abs/2310.17136
repo_url: None
paper_authors: Takehide Soh, Tomoya Tanjo, Yoshio Okamoto, Takehiro Ito
for: 本研究收集了CoRe Challenge 2023中所提交的解决方案和ISR实例的描述。
methods: 本研究使用了各种解决方案和ISR实例来描述CoRe Challenge 2023中的问题。
results: 本研究收集了CoRe Challenge 2023中所有的解决方案和ISR实例，以便进行后续的分析和研究。

Abstract
This paper collects all descriptions of solvers and ISR instances submitted to CoRe Challenge 2023.

摘要
这篇论文收集了2023年CoRe挑战中所提交的解决方案和实例。Note: "CoRe" stands for "Combinatorial Optimization and Recommendation" challenge.

Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs

paper_url: http://arxiv.org/abs/2310.17133
repo_url: https://github.com/libeineu/mmt-vqa
paper_authors: Yuxin Zuo, Bei Li, Chuanhao Lv, Tong Zheng, Tong Xiao, Jingbo Zhu
for: 这篇论文研究了多Modal机器翻译（MMT）系统中文本输入完整性的影响，并提出了一种新的方法来促进cross-模态交互。
methods: 该论文提出了一种生成来源文本中Visual Question-Answering（VQA）样式对的方法，并使用Large Language Models（LLMs）来显式地模型MMT中的探测信号。
results: 实验结果表明，该新方法可以提高MMT系统的性能，并且可以帮助MMT系统更好地理解图像信息。

Abstract
This paper presents an in-depth study of multimodal machine translation (MMT), examining the prevailing understanding that MMT systems exhibit decreased sensitivity to visual information when text inputs are complete. Instead, we attribute this phenomenon to insufficient cross-modal interaction, rather than image information redundancy. A novel approach is proposed to generate parallel Visual Question-Answering (VQA) style pairs from the source text, fostering more robust cross-modal interaction. Using Large Language Models (LLMs), we explicitly model the probing signal in MMT to convert it into VQA-style data to create the Multi30K-VQA dataset. An MMT-VQA multitask learning framework is introduced to incorporate explicit probing signals from the dataset into the MMT training process. Experimental results on two widely-used benchmarks demonstrate the effectiveness of this novel approach. Our code and data would be available at: \url{https://github.com/libeineu/MMT-VQA}.

摘要
The authors use Large Language Models (LLMs) to explicitly model the probing signal in MMT and convert it into VQA-style data, creating the Multi30K-VQA dataset. They then introduce an MMT-VQA multitask learning framework to incorporate explicit probing signals from the dataset into the MMT training process.Experimental results on two widely-used benchmarks demonstrate the effectiveness of this novel approach. The code and data used in this study will be available at the following link: .

Unleashing the potential of GNNs via Bi-directional Knowledge Transfer

paper_url: http://arxiv.org/abs/2310.17132
repo_url: None
paper_authors: Shuai Zheng, Zhizhe Liu, Zhenfeng Zhu, Xingxing Zhang, Jianxin Li, Yao Zhao
for: 提高 Graph Neural Network (GNN) 的性能。
methods: 利用 message-passing 框架中的 feature transformation 操作，提出 Bi-directional Knowledge Transfer (BiKT) 方法，以便不需修改原有架构即可充分发挥 GNN 的 potential。
results: 对 7 个数据集和 5 种常见 GNN 进行了广泛的实验，显示 BiKT 可以提高 GNN 的性能，最高提升达 0.5% - 4%，同时 derive 模型也能够独立应用于特定下游任务。

Abstract
Based on the message-passing paradigm, there has been an amount of research proposing diverse and impressive feature propagation mechanisms to improve the performance of GNNs. However, less focus has been put on feature transformation, another major operation of the message-passing framework. In this paper, we first empirically investigate the performance of the feature transformation operation in several typical GNNs. Unexpectedly, we notice that GNNs do not completely free up the power of the inherent feature transformation operation. By this observation, we propose the Bi-directional Knowledge Transfer (BiKT), a plug-and-play approach to unleash the potential of the feature transformation operations without modifying the original architecture. Taking the feature transformation operation as a derived representation learning model that shares parameters with the original GNN, the direct prediction by this model provides a topological-agnostic knowledge feedback that can further instruct the learning of GNN and the feature transformations therein. On this basis, BiKT not only allows us to acquire knowledge from both the GNN and its derived model but promotes each other by injecting the knowledge into the other. In addition, a theoretical analysis is further provided to demonstrate that BiKT improves the generalization bound of the GNNs from the perspective of domain adaption. An extensive group of experiments on up to 7 datasets with 5 typical GNNs demonstrates that BiKT brings up to 0.5% - 4% performance gain over the original GNN, which means a boosted GNN is obtained. Meanwhile, the derived model also shows a powerful performance to compete with or even surpass the original GNN, enabling us to flexibly apply it independently to some other specific downstream tasks.

摘要
使用消息传递模式的基础，有很多研究提出了多样化和吸引人的特征传播机制，以提高GNN的性能。然而，对特征转换的研究相对较少。在这篇论文中，我们首先employs empirical investigation来研究GNN中特征转换操作的性能。我们发现，GNN并不完全利用特征转换操作的力量。通过这一发现，我们提出了双向知识传输（BiKT），一种可插入的扩展approach，以解 liberate特征转换操作的潜力。 BiKT通过将特征转换操作作为GNN中的derived representation learning模型，并将这两个模型共享参数，从而实现了从GNN中获得 topological-agnostic的知识反馈，以帮助GNN的学习和特征转换。此外，我们还提供了一个理论分析，以证明BiKT在适应领域中提高GNN的泛化范围。在7个数据集和5种典型的GNN上进行了广泛的实验，显示BiKT可以提高GNN的性能，从0.5%到4%不等。此外， derivated模型还能够独立地应用于其他特定的下游任务中，并且表现强劲。

Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models

paper_url: http://arxiv.org/abs/2310.17120
repo_url: None
paper_authors: Reshmi Ghosh, Harjeet Singh Kajal, Sharanya Kamath, Dhuri Shrivastava, Samyadeep Basu, Hansi Zeng, Soundararajan Srinivasan
For: This paper focuses on analyzing the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts, and evaluating the effectiveness of different loss functions for improving segmentation results in unstructured conversational datasets.* Methods: The paper uses a variety of methods, including training from scratch with a small-sized dataset of the target unstructured domain, and experimenting with multiple loss functions (including Cross-Entropy, re-weighted Cross-Entropy, and Focal Loss) to mitigate the effects of imbalance in unstructured conversational datasets.* Results: The paper finds that training from scratch with a small-sized dataset of the target unstructured domain improves segmentation results by a significant margin, and that the Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.Here’s the information in Simplified Chinese text:* For: 本文研究了现有state-of-the-art话题分割模型在不结构化文本上的泛化能力，并评估了不同损失函数在不结构化对话集中的性能。* Methods: 本文使用了许多方法，包括从scratch在目标不结构化频道上小型数据集上训练，以及使用多种损失函数（包括十字熵、重量十字熵和焦点损失）来减轻不结构化对话集中的不均衡问题。* Results: 本文发现，从scratch在小型数据集上训练可以大幅提高话题分割结果，而焦点损失函数在不结构化和半结构化对话中话题分割时表现出了良好的Robustness。

Abstract
Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.

摘要
分析文档或对话的 semantic 结构，将其分解成多个连续的段落是 NLP 中一项重要和挑战性的问题，可以帮助多种下游任务。然而，现有的话题 segmentation 模型通常只关注结构化文本的 segmentation。在这篇论文中，我们全面分析了现代话题 segmentation 模型对非结构化文本的泛化能力。我们发现：(a) 现有的预训练方法，如使用 Wiki-727K 大量结构化文本集，对于非结构化对话数据的转移性不够。(b) 直接从 scratch 使用target域的小型数据集进行训练，可以大幅提高分 segmentation 结果。我们为了 Mitigate 非结构化对话集的不均衡问题，对我们的提议的话题 segmentation 方法进行了多种搅拌损失函数的实验。我们的实验表明，焦点损失函数是跨Entropy 和重量跨Entropy损失函数的稳定和可靠的替代方案。

Detecting stealthy cyberattacks on adaptive cruise control vehicles: A machine learning approach

paper_url: http://arxiv.org/abs/2310.17091
repo_url: None
paper_authors: Tianyi Li, Mingfeng Shang, Shian Wang, Raphael Stern
for:The paper is written to address the detection of cyberattacks on vehicles equipped with advanced driver-assistance systems (ADAS) and automated driving features.methods:The paper proposes a traffic model framework for three types of potential cyberattacks, and uses a novel generative adversarial network (GAN)-based anomaly detection model to identify such attacks in real-time using vehicle trajectory data.results:The paper provides numerical evidence to demonstrate the efficacy of the proposed machine learning approach in detecting cyberattacks on ACC-equipped vehicles, and compares the results against some recently proposed neural network models.

Abstract
With the advent of vehicles equipped with advanced driver-assistance systems, such as adaptive cruise control (ACC) and other automated driving features, the potential for cyberattacks on these automated vehicles (AVs) has emerged. While overt attacks that force vehicles to collide may be easily identified, more insidious attacks, which only slightly alter driving behavior, can result in network-wide increases in congestion, fuel consumption, and even crash risk without being easily detected. To address the detection of such attacks, we first present a traffic model framework for three types of potential cyberattacks: malicious manipulation of vehicle control commands, false data injection attacks on sensor measurements, and denial-of-service (DoS) attacks. We then investigate the impacts of these attacks at both the individual vehicle (micro) and traffic flow (macro) levels. A novel generative adversarial network (GAN)-based anomaly detection model is proposed for real-time identification of such attacks using vehicle trajectory data. We provide numerical evidence {to demonstrate} the efficacy of our machine learning approach in detecting cyberattacks on ACC-equipped vehicles. The proposed method is compared against some recently proposed neural network models and observed to have higher accuracy in identifying anomalous driving behaviors of ACC vehicles.

摘要
A novel generative adversarial network (GAN)-based anomaly detection model is proposed for real-time identification of such attacks using vehicle trajectory data. We provide numerical evidence to demonstrate the efficacy of our machine learning approach in detecting cyberattacks on ACC-equipped vehicles. The proposed method is compared against some recently proposed neural network models and is found to have higher accuracy in identifying anomalous driving behaviors of ACC vehicles.Here is the text in Simplified Chinese:随着自动驾驶汽车（AV） équiponder with advanced driver-assistance systems（ADAS），如适应速度控制（ACC）和其他自动驾驶功能，攻击这些自动汽车的可能性已出现。而这些攻击可能会导致车辆之间的冲突，也可能会被轻松发现。但是，更嫌恶的攻击可能会导致网络上的堵塞，燃油消耗和碰撞风险的增加，而不会被轻松发现。为了解决这些攻击的检测，我们首先提出了一个交通流模型框架，用于三种可能的网络攻击：负面控制命令的恶意修改，感知测量数据的假数据插入攻击和服务拒绝（DoS）攻击。然后，我们研究了这些攻击对各个车辆（微）和交通流（ макро）水平的影响。一种基于生成对抗网络（GAN）的异常检测模型被提出，用于实时标识这些攻击。我们通过车辆轨迹数据来提供数据来支持我们的机器学习方法的可行性。我们的方法与一些最近提出的神经网络模型进行比较，并被证明具有更高的准确性，可以快速和准确地识别ACC车辆上的异常驾驶行为。

Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

paper_url: http://arxiv.org/abs/2310.17086
repo_url: None
paper_authors: Deqing Fu, Tian-Qi Chen, Robin Jia, Vatsal Sharan
for: 这 paper 探讨了 Transformer 在受限学习 (In-Context Learning, ICL) 中的表现，尤其是它如何在不更新参数的情况下学习。
methods: 这 paper 展示了 Transformer 可以通过内部运行高级梯度下降 (Higher-Order Optimization Method) 来实现 ICL。
results: 实验表明，Transformer 可以很准确地实现 Iterative Newton’s Method，一种高级梯度下降方法，而不是 Gradient Descent。每个中间层都可以 rough Compute 3 个 Newton’s Method 迭代步骤，而 Gradient Descent 需要 exponentiation 更多步骤才能匹配一个 Transformer 层。此外，Transformer 还可以在不良条件数据上进行受限学习，一种 Gradient Descent 在这种情况下困难的情况。最后，paper 还提供了理论结果，支持实验结果，并与实验结果具有密切相关性：Transformer 可以通过 $\mathcal{O}(k)$ 层实现 $k$ 次 Newton’s Method 迭代。

Abstract
Transformers are remarkably good at in-context learning (ICL) -- learning from demonstrations without parameter updates -- but how they perform ICL remains a mystery. Recent work suggests that Transformers may learn in-context by internally running Gradient Descent, a first-order optimization method. In this paper, we instead demonstrate that Transformers learn to implement higher-order optimization methods to perform ICL. Focusing on in-context linear regression, we show that Transformers learn to implement an algorithm very similar to Iterative Newton's Method, a higher-order optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations. In contrast, exponentially more Gradient Descent steps are needed to match an additional Transformers layer; this suggests that Transformers have an comparable rate of convergence with high-order methods such as Iterative Newton, which are exponentially faster than Gradient Descent. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement $k$ iterations of Newton's method with $\mathcal{O}(k)$ layers.

摘要
传播者很好地做内部学习（ICL）——学习 без 参数更新——但它们在做 ICL 的方式仍然是一个谜。最近的工作表明，传播者可能在内部运行 Gradient Descent，一种一阶估计方法。在这篇论文中，我们则证明了传播者会实现更高阶的估计方法来做 ICL。专注于内部线性回推，我们显示了传播者会实现一个非常相似的Iterative Newton's Method，一种更高阶的估计方法，而不是 Gradient Descent。实际上，我们证明了传播者的预测值在不同的中间层之间可以线性地匹配不同的Newton's Method 迭代，每个中间层约 Compute 3 次迭代。相比之下，Gradient Descent 需要更多的步骤来匹配一个额外的传播者层，这表明传播者具有与高阶方法相似的速度，但是 Gradient Descent 的速度是指数增长的。我们还证明了传播者可以在糜烂数据上进行内部学习，这是 Gradient Descent 在这种设定下陷阱的。最后，我们提供了理论结果，证明了传播者可以通过 $\mathcal{O}(k)$ 层来实现 $k$ 次Newton's method。

Isometric Motion Manifold Primitives

paper_url: http://arxiv.org/abs/2310.17072
repo_url: https://github.com/gabe-yhlee/immp-public
paper_authors: Yonghyeon Lee
for: 这个论文主要是为了提出一种基于拟合 manifold 的运动控制方法，以实现一系列的动作任务。
methods: 这个方法使用了 decoder 函数来 parametrize 拟合 manifold，并使用了在 latent 坐标空间中的概率密度。
results: 论文表明，使用 Isometric Motion Manifold Primitives (IMMP) 可以大幅提高运动控制的性能，并且在 planar 障碍物避免和推动 manipulate 任务中表现出色。

Abstract
The Motion Manifold Primitive (MMP) produces, for a given task, a continuous manifold of trajectories each of which can successfully complete the task. It consists of the decoder function that parametrizes the manifold and the probability density in the latent coordinate space. In this paper, we first show that the MMP performance can significantly degrade due to the geometric distortion in the latent space -- by distortion, we mean that similar motions are not located nearby in the latent space. We then propose {\it Isometric Motion Manifold Primitives (IMMP)} whose latent coordinate space preserves the geometry of the manifold. For this purpose, we formulate and use a Riemannian metric for the motion space (i.e., parametric curve space), which we call a {\it CurveGeom Riemannian metric}. Experiments with planar obstacle-avoiding motions and pushing manipulation tasks show that IMMP significantly outperforms existing MMP methods. Code is available at https://github.com/Gabe-YHLee/IMMP-public.

摘要
<>将文本翻译成简化中文。<>动态 manifold 基本原理（MMP）生成一个任务下的连续扩散 manifold 每个可以成功完成任务。它包括嵌入函数参数化扩散和在幂空间中的概率密度。在这篇论文中，我们首先表明了 MMP 性能可能因 latent space 的几何扭曲而受到 significiant 降低。然后，我们提议使用 Isometric Motion Manifold Primitives (IMMP)，它的幂空间保持了扩散的几何结构。为了实现这一目标，我们构造了一个 Riemannian metric для动作空间（即参数曲线空间），我们称之为 CurveGeom Riemannian metric。实验表明，IMMP 在平面障碍物避免和推动 manipulate 任务中表现出色，较之 exist 的 MMP 方法更高效。代码可以在 https://github.com/Gabe-YHLee/IMMP-public 中找到。

2023-10-26

cs.CL

cs.CL - 2023-10-26

TIMELINE: Exhaustive Annotation of Temporal Relations Supporting the Automatic Ordering of Events in News Articles

paper_url: http://arxiv.org/abs/2310.17802
repo_url: https://github.com/alsayyahi/timeline
paper_authors: Sarah Alsayyahi, Riza Batista-Navarro
for: 本研究旨在提高现有的时间关系抽象模型，因为现有的新闻数据集存在许多问题，包括：(1) 注解者之间的干扰准确性低，由于注解指南不够具体，未能准确地定义时间关系的标准; (2) 排除文档内部远程关系（即文档中不同段落之间的关系）; (3) 排除不以词为中心的事件。本研究提出了一个新的注解方案，clearly defines the criteria for annotating temporal relations, includes events that are not expressed as verbs, and annotates all temporal relations, including long-distance ones.
methods: 本研究提出了一种新的注解方法，包括：(1) 使用明确的注解指南，以准确地定义时间关系的标准; (2) 使用自动化注解工具，减少注解员的时间和努力; (3) 包括不以词为中心的事件。
results: 本研究在新闻数据集上进行了基eline模型的训练和评估，与之前report的时间关系数据集相比，获得了改善的注解干扰准确性。

Abstract
Temporal relation extraction models have thus far been hindered by a number of issues in existing temporal relation-annotated news datasets, including: (1) low inter-annotator agreement due to the lack of specificity of their annotation guidelines in terms of what counts as a temporal relation; (2) the exclusion of long-distance relations within a given document (those spanning across different paragraphs); and (3) the exclusion of events that are not centred on verbs. This paper aims to alleviate these issues by presenting a new annotation scheme that clearly defines the criteria based on which temporal relations should be annotated. Additionally, the scheme includes events even if they are not expressed as verbs (e.g., nominalised events). Furthermore, we propose a method for annotating all temporal relations -- including long-distance ones -- which automates the process, hence reducing time and manual effort on the part of annotators. The result is a new dataset, the TIMELINE corpus, in which improved inter-annotator agreement was obtained, in comparison with previously reported temporal relation datasets. We report the results of training and evaluating baseline temporal relation extraction models on the new corpus, and compare them with results obtained on the widely used MATRES corpus.

摘要
temporal relation extraction models 在现有的新闻 datasets 中遇到了一些问题，包括：（1） annotator 之间的协调性低下，由于注释指南不够specificity，不确定哪些 temporal relation 应该被注释;（2）在给定文档中排除远程关系（ span across 不同段落）; 和（3）排除不以 verb 表达的事件。这篇论文希望通过提出一个新的注释方案来缓解这些问题，该方案明确定义了注释 temporal relation 的标准，并包括不以 verb 表达的事件。此外，我们还提议使用自动化注释方法来快速和减少注释员的时间和劳动。这使得得到了一个新的数据集，名为 TIMELINE corpus，其中得到了提高的 inter-annotator 协调性，与之前报道的 temporal relation 数据集相比。我们对新数据集进行了训练和评估基eline temporal relation EXTRACTION 模型，并与 widely 使用的 MATRES corpus 进行了比较。

Words, Subwords, and Morphemes: What Really Matters in the Surprisal-Reading Time Relationship?

paper_url: http://arxiv.org/abs/2310.17774
repo_url: None
paper_authors: Sathvik Nair, Philip Resnik
for: This paper tests the assumption that LLMs (large language models) can be used effectively on psycholinguistic data without considering morphological information.
methods: The paper compares surprisal estimates using orthographic, morphological, and BPE (byte pair encoding) tokenization against reading time data to determine the impact of tokenization method on LLM predictions.
results: The results show that BPE-based tokenization does not result in significantly worse predictions compared to morphological and orthographic segmentation, but a finer-grained analysis reveals potential issues with relying on BPE and suggests the use of morphologically-aware surprisal estimates as an alternative method for evaluating morphological prediction.

Abstract
An important assumption that comes with using LLMs on psycholinguistic data has gone unverified. LLM-based predictions are based on subword tokenization, not decomposition of words into morphemes. Does that matter? We carefully test this by comparing surprisal estimates using orthographic, morphological, and BPE tokenization against reading time data. Our results replicate previous findings and provide evidence that in the aggregate, predictions using BPE tokenization do not suffer relative to morphological and orthographic segmentation. However, a finer-grained analysis points to potential issues with relying on BPE-based tokenization, as well as providing promising results involving morphologically-aware surprisal estimates and suggesting a new method for evaluating morphological prediction.

摘要
一种重要的假设，即使用语言模型（LLM）处理心理语言数据时，没有得到确认。 LLM 的预测基于字符串tokenization，而不是word decomposition into morphemes。这么？我们仔细测试这一点， Comparing surprisal estimates using orthographic, morphological, and BPE tokenization against reading time data。我们的结果重复了之前的发现，并提供了证据，表明在总体上，使用 BPE tokenization 的预测不比 morphological and orthographic segmentation 受到影响。然而，一个更加细致的分析表明，可能存在依赖于 BPE 基于 tokenization 的问题，以及提供了 morphologically-aware surprisal estimates 和一种新的评估方法。

A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications

paper_url: http://arxiv.org/abs/2310.17750
repo_url: None
paper_authors: Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz, Hongliang Kong, Vincent Yun, Eslam Kamal, Federico Zarfati, Hanna Wallach, Sarah Bird, Mei Chen
for: 这个论文目标是提出一种自动测量责任AI（RAI）指标的框架，用于评估大语言模型（LLM）和相关产品和服务的可责任性。
methods: 该框架基于现有的技术和社会技术知识，利用现代大语言模型，如GPT-4，来自动测量LLMs中可能产生的各种危害。
results: 通过这个框架，我们可以对不同的LLM进行多个案例研究，以评估它们是否违反了不同的RAI相关原则。此框架可以与领域特定的社会技术知识结合使用，以创造未来的新危害评估领域的测量方法。

Abstract
We present a framework for the automated measurement of responsible AI (RAI) metrics for large language models (LLMs) and associated products and services. Our framework for automatically measuring harms from LLMs builds on existing technical and sociotechnical expertise and leverages the capabilities of state-of-the-art LLMs, such as GPT-4. We use this framework to run through several case studies investigating how different LLMs may violate a range of RAI-related principles. The framework may be employed alongside domain-specific sociotechnical expertise to create measurements for new harm areas in the future. By implementing this framework, we aim to enable more advanced harm measurement efforts and further the responsible use of LLMs.

摘要
我们提出了一套自动测量责任人工智能（RAI）指标的框架，用于大语言模型（LLM）和相关产品和服务的评估。我们的自动测量损害方法基于现有的技术和社会技术知识，利用现代最先进的LLM，如GPT-4，并可以用于评估不同LLM在多种伦理原则上的违反。我们使用这套框架进行了多个案例研究，探讨了不同LLM在不同伦理原则上的可能损害。这套框架可以与域专业知识相结合，以创造未来的新损害领域的测量。通过实施这套框架，我们期望推动责任用LLM的更高级别的损害评估，并促进责任用LLM的应用。

StyleBART: Decorate Pretrained Model with Style Adapters for Unsupervised Stylistic Headline Generation

paper_url: http://arxiv.org/abs/2310.17743
repo_url: None
paper_authors: Hanqing Wang, Yajing Luo, Boya Xiong, Guanhua Chen, Yun Chen
for: 这篇论文主要针对的是无监督的 стилистического标题生成任务，即生成一个标题不仅总结文章内容，还反映出用户所需的样式。
methods: 该论文提出了一种无监督的方法，即StyleBART，该方法使用预训练的 BART 模型，并在其上添加了不同风格的适应器，以生成具有不同风格的标题。
results: 经过自动和人工评估，StyleBART 可以具有新的状态 искусственный智能水平，生成高质量的标题，同时具有用户所需的风格。

Abstract
Stylistic headline generation is the task to generate a headline that not only summarizes the content of an article, but also reflects a desired style that attracts users. As style-specific article-headline pairs are scarce, previous researches focus on unsupervised approaches with a standard headline generation dataset and mono-style corpora. In this work, we follow this line and propose StyleBART, an unsupervised approach for stylistic headline generation. Our method decorates the pretrained BART model with adapters that are responsible for different styles and allows the generation of headlines with diverse styles by simply switching the adapters. Different from previous works, StyleBART separates the task of style learning and headline generation, making it possible to freely combine the base model and the style adapters during inference. We further propose an inverse paraphrasing task to enhance the style adapters. Extensive automatic and human evaluations show that StyleBART achieves new state-of-the-art performance in the unsupervised stylistic headline generation task, producing high-quality headlines with the desired style.

摘要
“样式化标题生成任务是生成一个标题，不仅概括文章内容，还反映用户所需的样式。由于样式特定的文章标题对пада scarce, previous researches focus on unsupervised approaches with a standard headline generation dataset and mono-style corpora. 在这项工作中，我们跟随这条线和提出了 StyleBART，一种不supervised方法 для样式化标题生成。我们的方法在预训练BART模型的基础之上添加了适应器，负责不同样式，使得可以通过简单地切换适应器来生成多样的标题。与前一些工作不同，StyleBART分离了样式学习和标题生成任务，使得在推理过程中可以自由地组合基础模型和样式适应器。我们还提出了反向重写任务，以增强样式适应器。自动和人工评估表明，StyleBART在无监督样式化标题生成任务中实现了新的状态纪录性表现，生成高质量的标题，满足用户所需的样式。”

paper_url: http://arxiv.org/abs/2310.17737
repo_url: None
paper_authors: Mohammad Akbari, Saeed Ranjbar Alvar, Behnam Kamranian, Amin Banitalebi-Dehkordi, Yong Zhang
For: 这个论文的目的是提出一种基于多modalitat的语言模型ArchBERT，用于同时学习自然语言和神经网络架构。* Methods: 这个论文使用了一种名为Masked Architecture Modeling（MAM）的预训练策略，并使用了两个新的双Modal数据集进行训练和验证。* Results: 经过数据分析和实验，ArchBERT在不同的下游任务中表现出色，包括建筑 oriented 理解、问答和摘要。

Abstract
Building multi-modal language models has been a trend in the recent years, where additional modalities such as image, video, speech, etc. are jointly learned along with natural languages (i.e., textual information). Despite the success of these multi-modal language models with different modalities, there is no existing solution for neural network architectures and natural languages. Providing neural architectural information as a new modality allows us to provide fast architecture-2-text and text-2-architecture retrieval/generation services on the cloud with a single inference. Such solution is valuable in terms of helping beginner and intermediate ML users to come up with better neural architectures or AutoML approaches with a simple text query. In this paper, we propose ArchBERT, a bi-modal model for joint learning and understanding of neural architectures and natural languages, which opens up new avenues for research in this area. We also introduce a pre-training strategy named Masked Architecture Modeling (MAM) for a more generalized joint learning. Moreover, we introduce and publicly release two new bi-modal datasets for training and validating our methods. The ArchBERT's performance is verified through a set of numerical experiments on different downstream tasks such as architecture-oriented reasoning, question answering, and captioning (summarization). Datasets, codes, and demos are available supplementary materials.

摘要
“在 latest years, 多modal language models 已经成为一股潮流，其中包括图像、视频、语音等多种多样的Modalities 被同时学习与自然语言（即文本信息）。尽管这些多modal language models 已经取得了成功，但是在 neural network architectures 和自然语言之间仍没有现有的解决方案。提供 neural 架构信息作为新的Modalities 允许我们在云端提供快速的 architecture-2-text 和 text-2-architecture 搜寻/生成服务，仅需单一的推论。这个解决方案非常有价值，可以帮助 beginner 和 intermediate ML 用户创建更好的 neural 架构或 AutoML 方法，只需要一个简单的文本查询。在本文中，我们提出 ArchBERT，一个 bi-modal 模型 для 共同学习和理解 neural 架构和自然语言，这开启了新的研究领域。我们还提出了一个名为 Masked Architecture Modeling (MAM) 的预训练策略，以更加通用的共同学习。此外，我们创建了两个新的 bi-modal 数据集用于训练和验证我们的方法。ArchBERT 的表现被证明通过不同的下游任务，如 architecture-oriented reasoning、问题回答和描述（摘要）。数据集、代码和示例在补充材料中提供。”

Investigating Multilingual Coreference Resolution by Universal Annotations

paper_url: http://arxiv.org/abs/2310.17734
repo_url: https://github.com/haixiachai/multi-coref
paper_authors: Haixia Chai, Michael Strube
for: 本研究是针对多语言核心共referencing（MCR）任务进行研究，使用新提出的多语言核心共referencing数据集（CorefUD）进行调查。
methods: 本研究首先通过查看不同语言水平和 genre 的真实数据，以获得关于核心共referencing的特征的启示。其次，通过分析CRAC 2022 共同任务中 SotA 系统无法解决的最difficult case，进行错误分析。最后，基于这种分析，从universal morphosyntactic annotations中提取特征，并将其集成到基eline系统中，以评估其可能带来的改进。
results: 我们的最佳配置的特征提高了基eline系统的 F1 分数0.9%。

Abstract
Multilingual coreference resolution (MCR) has been a long-standing and challenging task. With the newly proposed multilingual coreference dataset, CorefUD (Nedoluzhko et al., 2022), we conduct an investigation into the task by using its harmonized universal morphosyntactic and coreference annotations. First, we study coreference by examining the ground truth data at different linguistic levels, namely mention, entity and document levels, and across different genres, to gain insights into the characteristics of coreference across multiple languages. Second, we perform an error analysis of the most challenging cases that the SotA system fails to resolve in the CRAC 2022 shared task using the universal annotations. Last, based on this analysis, we extract features from universal morphosyntactic annotations and integrate these features into a baseline system to assess their potential benefits for the MCR task. Our results show that our best configuration of features improves the baseline by 0.9% F1 score.

摘要
多语言核心引用解决 (MCR) 是一项长期存在挑战的任务。我们使用新提出的多语言核心引用数据集（CorefUD，Nedoluzhko et al., 2022）进行研究。我们首先研究核心引用的特征，包括提取不同语言水平的 mention、entity 和文档水平的 annotations，以及不同类型的文献中的核心引用特征。其次，我们对 CRAC 2022 共享任务中 SotA 系统失败的最具挑战性情况进行错误分析。最后，我们从 universally 的 morphosyntactic 注释中提取特征，并将这些特征与基eline 系统集成，以评估其可能带来的改进。我们的结果表明，我们最佳的配置在 F1 分数上提高了 0.9%。

ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers

paper_url: http://arxiv.org/abs/2310.17723
repo_url: None
paper_authors: Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He
for: 这个论文旨在提出一个全新的、具有硬件优化的Robust Optimized Post-training W8A8量化框架，以提高深度神经网络的执行速度和可靠性。
methods: 这个框架使用了 ZeroQuant 的动态量化技术，并且特别处理了内存带宽和 compute-intensive 算子，以便在硬件上实现最佳性能。此外，这个框架还提供了一个可选的 INT8 模组，可以在 FP16/BF16 模式下运行，以提高准确性。
results: 这个框架可以实现更好的硬件性能和可靠性，并且可以适应不同的硬件环境。在一个实际应用中，这个框架可以实现更好的执行速度和准确性，相比之下 ZeroQuant 的动态量化技术。

Abstract
Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference. Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization. Addressing these gaps, we present a novel, fully hardware-enhanced robust optimized post-training W8A8 quantization framework, ZeroQuant-HERO. This framework uniquely integrates both memory bandwidth and compute-intensive operators, aiming for optimal hardware performance. Additionally, it offers flexibility by allowing specific INT8 modules to switch to FP16/BF16 mode, enhancing accuracy.

摘要
深度神经网络的批量化技术是减少深度神经网络的内存和计算需求的关键。现有的解决方案，如ZeroQuant，为BERT和GPT模型提供了动态量化，但忽略了关键的内存缓存操作和每个token量化的复杂性。为了解决这些缺陷，我们提出了一种全新的、具有硬件优化的robust批量化框架ZeroQuant-HERO。这个框架独特地集成了内存带宽和计算投入的操作，寻求最佳硬件性能。此外，它还提供了flexibility，允许特定INT8模块在FP16/BF16模式下运行，以提高精度。

Nearest Neighbor Search over Vectorized Lexico-Syntactic Patterns for Relation Extraction from Financial Documents

paper_url: http://arxiv.org/abs/2310.17714
repo_url: https://github.com/pawan2411/pan-dl_refind
paper_authors: Pawan Kumar Rajpoot, Ankur Parikh
for: 提高 implicit expression 和 long-tail relation class 的处理能力，以及提供一种可 accessible 的 RE 模型 для用户。
methods: 使用 nearest-neighbor search over dense vectors of lexico-syntactic patterns 来咨询训练关系，并使用这种方法来解决语言复杂性和数据稀缺问题。
results: 在 REFinD 上测试，本方法达到了状态级性能，并且在人工循环设置下提供了一个不错的开始。

Abstract
Relation extraction (RE) has achieved remarkable progress with the help of pre-trained language models. However, existing RE models are usually incapable of handling two situations: implicit expressions and long-tail relation classes, caused by language complexity and data sparsity. Further, these approaches and models are largely inaccessible to users who don't have direct access to large language models (LLMs) and/or infrastructure for supervised training or fine-tuning. Rule-based systems also struggle with implicit expressions. Apart from this, Real world financial documents such as various 10-X reports (including 10-K, 10-Q, etc.) of publicly traded companies pose another challenge to rule-based systems in terms of longer and complex sentences. In this paper, we introduce a simple approach that consults training relations at test time through a nearest-neighbor search over dense vectors of lexico-syntactic patterns and provides a simple yet effective means to tackle the above issues. We evaluate our approach on REFinD and show that our method achieves state-of-the-art performance. We further show that it can provide a good start for human in the loop setup when a small number of annotations are available and it is also beneficial when domain experts can provide high quality patterns.

摘要
<>语言模型已经在relation抽取（RE）方面做出了很大的进步，但现有的RE模型通常无法处理两种情况：偏向表达和长尾关系类别，这是由于语言复杂性和数据稀缺所致。此外，现有的方法和模型具有训练和精度调整的限制，使得用户无法直接访问大语言模型（LLM）和基础设施。规则式系统也难以处理偏向表达。除此之外，公开上市公司的财务报表（如10-K、10-Q等）也对规则式系统 pose another challenge，因为它们具有更长和复杂的句子。在这篇论文中，我们介绍了一种简单的方法，通过在测试时通过 nearest-neighbor搜索 dense vector of lexico-syntactic pattern来咨询训练关系，并提供了一种简单 yet effective的方法来解决以上问题。我们对REFinD进行评估，并证明我们的方法可以达到状态的表现。我们还证明，它可以提供一个好的起点 для人工循环setup，当只有少量注释时，以及当领域专家可以提供高质量的模式时。

Is Explanation the Cure? Misinformation Mitigation in the Short Term and Long Term

paper_url: http://arxiv.org/abs/2310.17711
repo_url: None
paper_authors: Yi-Li Hsu, Shih-Chieh Dai, Aiping Xiong, Lun-Wei Ku
for: 本研究旨在测试自动生成的解释是否能够帮助人们战胜假新闻，并与警告标签相比。
methods: 本研究使用了GPT-4生成的反假证据来对假新闻进行处理。
results: 研究发现，两种干预措施都能够有效地降低受试者对假新闻的信念，并且这两种干预措施在短期和长期都有相似的效果。

Abstract
With advancements in natural language processing (NLP) models, automatic explanation generation has been proposed to mitigate misinformation on social media platforms in addition to adding warning labels to identified fake news. While many researchers have focused on generating good explanations, how these explanations can really help humans combat fake news is under-explored. In this study, we compare the effectiveness of a warning label and the state-of-the-art counterfactual explanations generated by GPT-4 in debunking misinformation. In a two-wave, online human-subject study, participants (N = 215) were randomly assigned to a control group in which false contents are shown without any intervention, a warning tag group in which the false claims were labeled, or an explanation group in which the false contents were accompanied by GPT-4 generated explanations. Our results show that both interventions significantly decrease participants' self-reported belief in fake claims in an equivalent manner for the short-term and long-term. We discuss the implications of our findings and directions for future NLP-based misinformation debunking strategies.

摘要
随着自然语言处理（NLP）模型的进步，自动生成解释被提议用于社交媒体平台上 combat 谣言，并且添加警告标签到已知的假新闻。然而，许多研究者对生成好的解释进行了重点研究，对于这些解释如何真正地帮助人们战胜假新闻则未得到充分探讨。在本研究中，我们比较了警告标签和GPT-4的国际标准对抗谣言的效果。在两个波次的在线人类试验中，参与者（N = 215）被随机分配到控制组（无任何干预）、警告标签组（假称有警告标签）或解释组（假称有GPT-4生成的解释）。我们的结果表明，两种 интервенción都能够在短期和长期内等效地减少参与者对假CLAIM的自我报告的信任度。我们讨论了我们的发现的意义和未来NLP基于的谣言战胜策略的方向。

The impact of using an AI chatbot to respond to patient messages

paper_url: http://arxiv.org/abs/2310.17703
repo_url: https://github.com/aim-harvard/oncqa
paper_authors: Shan Chen, Marco Guevara, Shalini Moningi, Frank Hoebers, Hesham Elhalawani, Benjamin H. Kann, Fallon E. Chipidza, Jonathan Leeman, Hugo J. W. L. Aerts, Timothy Miller, Guergana K. Savova, Raymond H. Mak, Maryam Lustberg, Majid Afshar, Danielle S. Bitterman
for: 这个研究旨在测试人工智能聊天机器人（ChatGPT）是否可以减轻医生的负担，并且检查这些系统在医疗决策中的价值和影响。methods: 这个研究使用了两阶段跨sectional研究，让6名癌科医生回答100个真实的人工癌病患者情况和门诊信息，首先是手动回答，然后是使用人工智能帮助。results: 研究发现，AI帮助下的回答比手动回答更长、更难读，但是可以提供可接受的草稿无需修改58%的时间。AI帮助下的效率提高77%的时间，并且具有低伤害风险（82%安全）。但是，7.7%的AI回答可能会导致严重伤害。在31%的情况下，医生认为AI草稿是人类写的。AI帮助下的回答比手动回答更有传递patient education的建议， fewer clinical actions。结果显示AI可以改善医生的效率和病人照顾，但是需要在安全的情况下使用。

Abstract
Documentation burden is a major contributor to clinician burnout, which is rising nationally and is an urgent threat to our ability to care for patients. Artificial intelligence (AI) chatbots, such as ChatGPT, could reduce clinician burden by assisting with documentation. Although many hospitals are actively integrating such systems into electronic medical record systems, AI chatbots utility and impact on clinical decision-making have not been studied for this intended use. We are the first to examine the utility of large language models in assisting clinicians draft responses to patient questions. In our two-stage cross-sectional study, 6 oncologists responded to 100 realistic synthetic cancer patient scenarios and portal messages developed to reflect common medical situations, first manually, then with AI assistance. We find AI-assisted responses were longer, less readable, but provided acceptable drafts without edits 58% of time. AI assistance improved efficiency 77% of time, with low harm risk (82% safe). However, 7.7% unedited AI responses could severely harm. In 31% cases, physicians thought AI drafts were human-written. AI assistance led to more patient education recommendations, fewer clinical actions than manual responses. Results show promise for AI to improve clinician efficiency and patient care through assisting documentation, if used judiciously. Monitoring model outputs and human-AI interaction remains crucial for safe implementation.

摘要
医疗文书负担是临床医生疲劳的主要 contribuutor，这种情况在全国范围内升高，是我们护理患者的能力的急迫性问题。人工智能（AI）聊天机器人，如ChatGPT，可以减轻医生的文书负担。虽然许多医院在电子医疗记录系统中活动地整合这些系统，但AI聊天机器人在临床决策中的实用性和影响尚未得到研究。我们是第一个对大型自然语言模型在协助临床医生编写答复 patient questions 进行了研究。在我们的两个阶段cross-sectional研究中，6名Oncologists responded to 100种真实 synthetic cancer patient scenario和门户消息，这些消息是通过反映常见医疗情况来设计的。我们发现，使用AI助手的答复比手动答复长得多，可读性下降，但是58%的时间内可以提供可接受的答复无需修改。使用AI助手可以提高效率，77%的时间内可以提高效率，并且风险低（82%的时间内安全）。但是，7.7%的AI答复可能会严重害。在31%的情况下，医生认为AI答复是人类写的。使用AI助手可以提高患者教育建议和临床行动的数量，而手动答复则相对较少。结果表明，AI可以通过协助文书，提高临床医生的效率和患者的护理质量，但是要在安全的情况下使用。监测模型输出和人机AI交互仍然是关键。

Non-contrastive sentence representations via self-supervision

paper_url: http://arxiv.org/abs/2310.17690
repo_url: None
paper_authors: Marco Farina, Duccio Pappadopulo
for: 学习文本和句子嵌入的Unsupervised方法
methods: 使用自动标注对照方法和计算维度对比的方法
results: 无需auxiliary损失函数，自动标注对照方法可以超越SimCSE在下游任务上的表现

Abstract
Sample contrastive methods, typically referred to simply as contrastive are the foundation of most unsupervised methods to learn text and sentence embeddings. On the other hand, a different class of self-supervised loss functions and methods have been considered in the computer vision community and referred to as dimension contrastive. In this paper, we thoroughly compare this class of methods with the standard baseline for contrastive sentence embeddings, SimCSE. We find that self-supervised embeddings trained using dimension contrastive objectives can outperform SimCSE on downstream tasks without needing auxiliary loss functions.

摘要
Sample contrastive methods, typically referred to simply as contrastive, are the foundation of most unsupervised methods to learn text and sentence embeddings. On the other hand, a different class of self-supervised loss functions and methods have been considered in the computer vision community and referred to as dimension contrastive. In this paper, we thoroughly compare this class of methods with the standard baseline for contrastive sentence embeddings, SimCSE. We find that self-supervised embeddings trained using dimension contrastive objectives can outperform SimCSE on downstream tasks without needing auxiliary loss functions.Here's the translation in Traditional Chinese:Sample contrastive methods, typically referred to simply as contrastive, are the foundation of most unsupervised methods to learn text and sentence embeddings. On the other hand, a different class of self-supervised loss functions and methods have been considered in the computer vision community and referred to as dimension contrastive. In this paper, we thoroughly compare this class of methods with the standard baseline for contrastive sentence embeddings, SimCSE. We find that self-supervised embeddings trained using dimension contrastive objectives can outperform SimCSE on downstream tasks without needing auxiliary loss functions.

torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free Deep Learning Studies: A Case Study on NLP

paper_url: http://arxiv.org/abs/2310.17644
repo_url: https://github.com/yoshitomo-matsubara/torchdistill
paper_authors: Yoshitomo Matsubara
for: 本研究旨在提高科学研究中的可重现性，通过对深度学习领域的研究领域进行快速发展，支持更多任务和第三方库。
methods: 本研究使用了模块驱动的 coding-free 深度学习框架 torchdistill，并将其升级至支持更多任务。为了证明新的框架可以支持更多任务，我们使用了一个基于 upgraded torchdistill 的脚本，使用了各种 Hugging Face 库来重现 GLUE benchmark 结果。
results: 我们在本研究中重现了 27 个 fine-tuned BERT 模型和配置，并将其发布在 Hugging Face 上。我们还重新实现了一些小型模型和新的知识传递方法，并在计算机视觉任务上进行了额外的实验。

Abstract
Reproducibility in scientific work has been becoming increasingly important in research communities such as machine learning, natural language processing, and computer vision communities due to the rapid development of the research domains supported by recent advances in deep learning. In this work, we present a significantly upgraded version of torchdistill, a modular-driven coding-free deep learning framework significantly upgraded from the initial release, which supports only image classification and object detection tasks for reproducible knowledge distillation experiments. To demonstrate that the upgraded framework can support more tasks with third-party libraries, we reproduce the GLUE benchmark results of BERT models using a script based on the upgraded torchdistill, harmonizing with various Hugging Face libraries. All the 27 fine-tuned BERT models and configurations to reproduce the results are published at Hugging Face, and the model weights have already been widely used in research communities. We also reimplement popular small-sized models and new knowledge distillation methods and perform additional experiments for computer vision tasks.

摘要
科学研究中的重复性在机器学习、自然语言处理和计算机视觉等领域日益重要，因为近年来深度学习的发展对研究领域的进步提供了大量支持。在这项工作中，我们发布了torchdistill的升级版本，该框架支持了图像分类和物体检测任务，并且可以用于无编程的知识储存实验。为了证明新版本框架可以支持更多任务，我们使用基于升级后的torchdistill的脚本重现GLUEbenchmark中BERT模型的结果。我们在Hugging Face上发布了27个精度调整后的BERT模型和配置，以及模型的 weights，这些模型 weights已经在研究 сообществе广泛使用。此外，我们还重新实现了一些小型模型和新的知识储存方法，并在计算机视觉任务上进行了其他实验。

InstOptima: Evolutionary Multi-objective Instruction Optimization via Large Language Model-based Instruction Operators

paper_url: http://arxiv.org/abs/2310.17630
repo_url: https://github.com/yangheng95/instoptima
paper_authors: Heng Yang, Ke Li
for: 提高 instruciton 工程效率，推动 instruction 学科发展
methods: 基于 evolutionary multi-objective optimization 的 instruciton 生成方法，利用大语言模型 simulate instruction 操作，并 introducing objective-guided mechanism
results: 实验结果显示 improved fine-tuning performance 和生成多种高质量 instrucitonIn English, this would be:
for: Improving instruction engineering efficiency to advance the field of instruction studies
methods: Using an evolutionary multi-objective optimization approach with a large language model to simulate instruction operators, and introducing an objective-guided mechanism to enhance the quality of generated instructions
results: Experimental results show improved fine-tuning performance and the generation of a diverse set of high-quality instructions.

Abstract
Instruction-based language modeling has received significant attention in pretrained language models. However, the efficiency of instruction engineering remains low and hinders the development of instruction studies. Recent studies have focused on automating instruction generation, but they primarily aim to improve performance without considering other crucial objectives that impact instruction quality, such as instruction length and perplexity. Therefore, we propose a novel approach (i.e., InstOptima) that treats instruction generation as an evolutionary multi-objective optimization problem. In contrast to text edition-based methods, our approach utilizes a large language model (LLM) to simulate instruction operators, including mutation and crossover. Furthermore, we introduce an objective-guided mechanism for these operators, allowing the LLM to comprehend the objectives and enhance the quality of the generated instructions. Experimental results demonstrate improved fine-tuning performance and the generation of a diverse set of high-quality instructions.

摘要
过去的研究主要集中在对预训语言模型进行调整，但是指令工程的效率仍然较低，这限制了指令研究的发展。现在的研究主要是对指令生成进行自动化，但是这些研究主要关注性能的提高，而忽略了其他重要的目标，例如指令长度和混淆率。因此，我们提出了一个新的方法（即InstOptima），将指令生成视为进化多个目标优化问题。在对文本进行修订的方法不同之余，我们的方法利用大型语言模型（LLM）来模拟指令操作，包括变异和交叉。此外，我们引入了目标导向的机制，让LLM能够理解目标，并提高生成的指令质量。实验结果显示，我们的方法可以提高调整性能和生成多个高质量的指令。

Proving Test Set Contamination in Black Box Language Models

paper_url: http://arxiv.org/abs/2310.17623
repo_url: None
paper_authors: Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, Tatsunori B. Hashimoto
for: 这篇论文是为了证明语言模型是否受到数据污染的。
methods: 这篇论文使用了一种基于交换可能性的方法来证明语言模型是否受到数据污染。
results: 这篇论文的实验结果表明，使用这种方法可以准确地检测语言模型是否受到数据污染，并且可以在小型模型和少量数据情况下进行检测。

Abstract
Large language models are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit five popular publicly accessible language models for test set contamination and find little evidence for pervasive contamination.

摘要
Translated into Simplified Chinese:大型语言模型通过互联网数据进行训练，引发了关注和推测，它们可能已经记忆了公共测试集。但是从推测到证据是困难的，因为 propietary 模型使用的预训练数据通常不公开 accessible。我们表明可以在无数据污染情况下提供可证明的保证。我们的方法利用了无数据污染时，所有的交换 benchmark 都应该是相同概率的。相反，语言模型很可能记忆示例的顺序，因此一个污染的语言模型会找到一些特定的 canonical 顺序非常有可能性。我们的测试会在这些可能性中旁 Flag 潜在的污染。我们示例了我们的方法可以在复杂的情况下可靠地证明测试集污染，包括模型只有 1.4 十亿参数，测试集只有 1000 个示例，并且数据集仅在预训练 Corpora 中出现一些次。使用我们的测试，我们对五种公共可 accessible 语言模型进行审核，并未发现普遍的污染。

Uncovering Meanings of Embeddings via Partial Orthogonality

paper_url: http://arxiv.org/abs/2310.17611
repo_url: None
paper_authors: Yibo Jiang, Bryon Aragam, Victor Veitch
for: 这个论文研究了语言语义结构如何在数字嵌入中编码。
methods: 作者使用了受限正交的概念和方法来捕捉语义独立性。
results: 作者证明了受限正交可以 capture语义独立性，并提出了独立嵌入和独立嵌入的存在。

Abstract
Machine learning tools often rely on embedding text as vectors of real numbers. In this paper, we study how the semantic structure of language is encoded in the algebraic structure of such embeddings. Specifically, we look at a notion of ``semantic independence'' capturing the idea that, e.g., ``eggplant'' and ``tomato'' are independent given ``vegetable''. Although such examples are intuitive, it is difficult to formalize such a notion of semantic independence. The key observation here is that any sensible formalization should obey a set of so-called independence axioms, and thus any algebraic encoding of this structure should also obey these axioms. This leads us naturally to use partial orthogonality as the relevant algebraic structure. We develop theory and methods that allow us to demonstrate that partial orthogonality does indeed capture semantic independence. Complementary to this, we also introduce the concept of independence preserving embeddings where embeddings preserve the conditional independence structures of a distribution, and we prove the existence of such embeddings and approximations to them.

摘要

LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset

paper_url: http://arxiv.org/abs/2310.17609
repo_url: None
paper_authors: Haitao Li, Yunqiu Shao, Yueyue Wu, Qingyao Ai, Yixiao Ma, Yiqun Liu
for:LeCaRDv2 is a large-scale Legal Case Retrieval Dataset (version 2) that aims to alleviate the limitations of existing datasets in the Chinese legal system, such as limited data size, narrow definitions of legal relevance, and naive candidate pooling strategies.methods:LeCaRDv2 consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents. It enriches the existing relevance criteria by considering three key aspects: characterization, penalty, and procedure. Additionally, a two-level candidate set pooling strategy is proposed to effectively identify potential candidates for each query case.results:The dataset has been annotated by multiple legal experts specializing in criminal law, ensuring the accuracy and reliability of the annotations. The evaluation of several state-of-the-art retrieval models at LeCaRDv2 demonstrates that there is still significant room for improvement in legal case retrieval.Here’s the Chinese version of the information:for:LeCaRDv2 是一个大规模的法律案例检索数据集（第二版），旨在解决现有数据集中的三个问题：数据量有限，法律相关性过于窄，和数据采样中的候选人选择策略过于简单。methods:LeCaRDv2 包含 800 个查询和 55,192 个候选案例，从 4.3 万个刑事案件文档中提取。它对现有的相关性标准进行扩展，考虑三个关键方面： caracterization、penalty 和 procedure。此外，还提出了一种两级候选人pooling策略，以更好地确定每个查询案件的可能候选人。results:所有案例都被多名专业律师（专门从事刑事法） annotate，以确保数据的准确性和可靠性。在 LeCaRDv2 上评估了多种当前领先的检索模型，显示了法律案例检索仍有很大的提升空间。

Abstract
As an important component of intelligent legal systems, legal case retrieval plays a critical role in ensuring judicial justice and fairness. However, the development of legal case retrieval technologies in the Chinese legal system is restricted by three problems in existing datasets: limited data size, narrow definitions of legal relevance, and naive candidate pooling strategies used in data sampling. To alleviate these issues, we introduce LeCaRDv2, a large-scale Legal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents. To the best of our knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval datasets, providing extensive coverage of criminal charges. Additionally, we enrich the existing relevance criteria by considering three key aspects: characterization, penalty, procedure. This comprehensive criteria enriches the dataset and may provides a more holistic perspective. Furthermore, we propose a two-level candidate set pooling strategy that effectively identify potential candidates for each query case. It's important to note that all cases in the dataset have been annotated by multiple legal experts specializing in criminal law. Their expertise ensures the accuracy and reliability of the annotations. We evaluate several state-of-the-art retrieval models at LeCaRDv2, demonstrating that there is still significant room for improvement in legal case retrieval. The details of LeCaRDv2 can be found at the anonymous website https://github.com/anonymous1113243/LeCaRDv2.

摘要
legal case retrieval 作为智能法律系统的重要组件，在确保司法公平和公正方面扮演着关键角色。然而，中国法律系统中的法律案例检索技术的发展受到了三种现有数据集的限制：数据量受限，法律相关性的定义太窄，以及数据采样中使用的候选人选择策略过于简单。为了解决这些问题，我们提出了LeCaRDv2，一个大规模的法律案例检索数据集（版本2）。它包括800个查询和55,192个候选者，从430万起诉案件文档中提取出来。我们知道LeCaRDv2是中国法律系统中最大的法律案例检索数据集之一，提供了广泛的刑事罪名覆盖。此外，我们增强了现有的相关性标准，考虑了三个关键方面： caracterización、刑事和程序。这种全面的标准可能提供了更全面的视角。此外，我们提议了两级候选者集pooling策略，可以有效地找到每个查询案例的可能候选者。需要注意的是，所有案例在数据集中都被多名法律专家特许律师 annotated，他们的专业性确保了数据集的准确性和可靠性。我们在LeCaRDv2上评估了多种当前最佳 Retrieval 模型，显示了法律案例检索领域还有很大的改进空间。LeCaRDv2的细节可以在https://github.com/anonymous1113243/LeCaRDv2 的匿名网站上找到。

Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways

paper_url: http://arxiv.org/abs/2310.17591
repo_url: https://github.com/venkatasg/lil-bevo
paper_authors: Venkata S Govindarajan, Juan Diego Rodriguez, Kaj Bostrom, Kyle Mahowald
for: 本研究旨在提出一种基于小量数据的语言模型预训练方法，以提高语言模型的性能。
methods: 该方法使用了三种inghamredients：初始预训练使用音乐数据、在较短的序列上进行预训练 перед较长的序列上，以及在特定的标记Token上进行做袋装。
results: 研究发现，使用这些技术可以使模型的性能达到或超过随机的水平，但是并不能与更大的语言模型在更多的数据上达到的性能水平。另外，在较短的序列上进行预训练表现较好，而使用音乐数据可能带来小量的改善。

Abstract
We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks. Overall, our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data. We found that training on short sequences performed better than training on longer sequences.Pretraining on music may help performance marginally, but, if so, the effect seems small. Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting (e.g., Negative Polarity Items). Training performant LLMs on small amounts of data is a difficult but potentially informative task. While some of our techniques showed some promise, more work is needed to explore whether they can improve performance more than the modest gains here. Our code is available at https://github.com/venkatasg/Lil-Bevo and out models at https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a

摘要
我们给你介绍Lil-Bevo，我们对 BabyLM 挑战的提交。我们预训我们的封面语言模型使用三种食品：初始预训使用音乐数据、在较短的序列上进行训练 перед训练较长的序列、以及对特定的Token进行遮盾以 targets 一些 BLiMP 任务。总体而言，我们的基eline模型的表现高于机会，但与更多数据训练的更大LLMs相比，表现仍然较低。我们发现训练较短的序列表现比较好，但预训使用音乐数据可能帮助表现，但效果似乎很小。我们的对话类型掩盖语言模型增强法不会提高模型的表现，但在特定的 BLiMP 任务上（例如，负面负陵项），它可能有所帮助。训练表现好的LLMs在小量数据上是一个困难但有可能提供有益的任务。一些我们的技术表现了一定的 promise，但需要更多的工作来探索这些技术是否可以提高表现。我们的代码可以在 GitHub 上找到（https://github.com/venkatasg/Lil-Bevo），我们的模型可以在 Hugging Face 上找到（https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a）。

PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven Perturbed Gradient Descent

paper_url: http://arxiv.org/abs/2310.17588
repo_url: None
paper_authors: Guangliang Liu, Zhiyu Xue, Xitong Zhang, Kristen Marie Johnson, Rongrong Wang
for: 提高大型语言模型（PLMs）在下游任务中的泛化性能，并且在少量数据学习（few-shot learning）情况下具有良好的泛化性能。
methods: 提议一种两阶段精度调整方法（PAC-tuning），首先根据PAC-Bayes训练直接逼近PAC-Bayes泛化 bound，然后在训练过程中通过在模型参数中尝试随机变量来 modify 梯度，实现一种变种的扰动梯度下降（PGD）。
results: 实验结果表明，PAC-tuning可以成功地处理精度调整问题，并在5个 GLUEbenchmark任务上超越强基eline方法，Visible margin。这些结果证明了PAC训练可以在任何其他使用Adam优化器进行训练的设置中应用。

Abstract
Fine-tuning pretrained language models (PLMs) for downstream tasks is a large-scale optimization problem, in which the choice of the training algorithm critically determines how well the trained model can generalize to unseen test data, especially in the context of few-shot learning. To achieve good generalization performance and avoid overfitting, techniques such as data augmentation and pruning are often applied. However, adding these regularizations necessitates heavy tuning of the hyperparameters of optimization algorithms, such as the popular Adam optimizer. In this paper, we propose a two-stage fine-tuning method, PAC-tuning, to address this optimization challenge. First, based on PAC-Bayes training, PAC-tuning directly minimizes the PAC-Bayes generalization bound to learn proper parameter distribution. Second, PAC-tuning modifies the gradient by injecting noise with the variance learned in the first stage into the model parameters during training, resulting in a variant of perturbed gradient descent (PGD). In the past, the few-shot scenario posed difficulties for PAC-Bayes training because the PAC-Bayes bound, when applied to large models with limited training data, might not be stringent. Our experimental results across 5 GLUE benchmark tasks demonstrate that PAC-tuning successfully handles the challenges of fine-tuning tasks and outperforms strong baseline methods by a visible margin, further confirming the potential to apply PAC training for any other settings where the Adam optimizer is currently used for training.

摘要
大规模优化问题中，细化预训练语言模型（PLM） для下游任务是一个大型优化问题，在这个问题中，选择训练算法的选择对模型能够在未经见数据测试中准确预测的能力具有关键作用。为了 достичь好的泛化性表现和避免过拟合，技术如数据增强和剪裁通常被应用。然而，添加这些正则化需要严重调整优化算法的超参数，如流行的Adam优化器。在这篇论文中，我们提议一种两Stage细化方法，即PAC-tuning，以解决这个优化挑战。首先，基于PAC-Bayes培训，PAC-tuning直接将PAC-Bayes泛化约束约束学习到合适的参数分布。其次，PAC-tuning在训练过程中将模型参数中插入采样变量，从而实现一种变化的梯度下降（PGD）。在过去，几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的几个几个任务的

Global Voices, Local Biases: Socio-Cultural Prejudices across Languages

paper_url: http://arxiv.org/abs/2310.17586
repo_url: https://github.com/iamshnoo/weathub
paper_authors: Anjishnu Mukherjee, Chahat Raj, Ziwei Zhu, Antonios Anastasopoulos
for: 这个论文旨在探讨语言模型（LM）如何反映和增强社会偏见。
methods: 作者使用Word Embedding Association Test（WEAT）测试24种语言中LM的偏见，并在每种语言的地区上添加了当地文化特点。
results: 研究发现，LM具有广泛的社会偏见，包括语言、文化和社会方面的偏见。此外，研究还发现了新的偏见维度，如恶意、残夷和更多的社会偏见。最后，研究者强调了这些社会偏见的重要性，并提出了更加公平的语言模型的需求。

Abstract
Human biases are ubiquitous but not uniform: disparities exist across linguistic, cultural, and societal borders. As large amounts of recent literature suggest, language models (LMs) trained on human data can reflect and often amplify the effects of these social biases. However, the vast majority of existing studies on bias are heavily skewed towards Western and European languages. In this work, we scale the Word Embedding Association Test (WEAT) to 24 languages, enabling broader studies and yielding interesting findings about LM bias. We additionally enhance this data with culturally relevant information for each language, capturing local contexts on a global scale. Further, to encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more. Moreover, we delve deeper into the Indian linguistic landscape, conducting a comprehensive regional bias analysis across six prevalent Indian languages. Finally, we highlight the significance of these social biases and the new dimensions through an extensive comparison of embedding methods, reinforcing the need to address them in pursuit of more equitable language models. All code, data and results are available here: https://github.com/iamshnoo/weathub.

摘要
人类偏见 universal，但不均匀：在语言、文化和社会边界之间存在差异。据 latest literature 表明，语言模型（LM）在人类数据上训练后可能反映和增强社会偏见的效果。然而，现有的大多数研究都集中在西方和欧洲语言上，这使得我们无法充分了解各语言的偏见。在这项工作中，我们将 Word Embedding Association Test（WEAT）扩展到24种语言，以便更广泛的研究和发现有趣的发现。此外，我们还增加了每种语言的地方信息，以捕捉每种语言的当地情况。此外，我们还研究了更广泛的社会偏见维度，包括恶意、残夷和更多的维度。此外，我们进行了全面的印度语言风貌分析，对6种流行的印度语言进行了详细的地方偏见分析。最后，我们强调了这些社会偏见的重要性，并通过对不同的插入方法进行了广泛比较，强调需要对其进行更加公正的补偿。所有代码、数据和结果都可以在以下链接中找到：https://github.com/iamshnoo/weathub。

1D-Touch: NLP-Assisted Coarse Text Selection via a Semi-Direct Gesture

paper_url: http://arxiv.org/abs/2310.17576
repo_url: None
paper_authors: Peiling Jiang, Li Feng, Fuling Sun, Parakrant Sarkar, Haijun Xia, Can Liu
for: 提高touch屏上文本选择的精度和效率，特别是在word和phrase水平上。
methods: 引入1D-Touch方法，使用简单的垂直滑块势件来扩展和缩小选择区域，从word到semantic chunk的范围。
results: 对于coarse-grained文本选择任务，与默认word-snapping方法相比，1D-Touch方法提高选择精度20%。

Abstract
Existing text selection techniques on touchscreen focus on improving the control for moving the carets. Coarse-grained text selection on word and phrase levels has not received much support beyond word-snapping and entity recognition. We introduce 1D-Touch, a novel text selection method that complements the carets-based sub-word selection by facilitating the selection of semantic units of words and above. This method employs a simple vertical slide gesture to expand and contract a selection area from a word. The expansion can be by words or by semantic chunks ranging from sub-phrases to sentences. This technique shifts the concept of text selection, from defining a range by locating the first and last words, towards a dynamic process of expanding and contracting a textual semantic entity. To understand the effects of our approach, we prototyped and tested two variants: WordTouch, which offers a straightforward word-by-word expansion, and ChunkTouch, which leverages NLP to chunk text into syntactic units, allowing the selection to grow by semantically meaningful units in response to the sliding gesture. Our evaluation, focused on the coarse-grained selection tasks handled by 1D-Touch, shows a 20% improvement over the default word-snapping selection method on Android.

摘要
现有的触感屏选Text技术主要关注改进选择框的控制。word和phrase层级的粗体化文本选择尚未得到了多少支持，只有word拖拽和实体识别。我们介绍了1D-Touch，一种新的文本选择方法，该方法补充了基于拖拽的字符选择，并且使用简单的垂直滑块手势来扩展和缩小选择区域。这种方法将文本选择的概念从定义选择范围的first和last字符改变为一个动态的文本semantic实体选择过程。为了了解我们的方法的效果，我们实现了两个变体：WordTouch，它通过直接拖拽word来扩展选择范围，和ChunkTouch，它利用NLP将文本切分成语法单位，使选择可以通过semantically meaningful单位来响应滑块手势的扩展。我们的评估对于Android上的粗体化选择任务表现出了20%的提升。

DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation

paper_url: http://arxiv.org/abs/2310.17570
repo_url: None
paper_authors: Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Zhongyi Ye, Linli Xu
for: 这篇论文主要关注在于如何将传播生成模型（Diffusion Generative Models）有效地应用于语音生成和翻译任务中。
methods: 本论文提出了一种新的扩散模型，通过将扩散前进过程应用在连续的语音表示空间中，而将扩散后退程应用在组合的语音单位空间中。这样可以保留连续语音表示空间的semantic结构，并将连续和组合扩散模型融合起来。
results: 根据实验结果，提出的方法可以与 computationally intensive auto-regressive baselines（500步平均）相比，具有相同的表现（50步平均），并且需要更少的解oding步骤（50步）。

Abstract
While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically, due to the low information density of speech data, the transformed discrete speech unit sequence is much longer than the corresponding text transcription, posing significant challenges to existing auto-regressive models. Furthermore, it is not optimal to brutally apply discrete diffusion on the speech unit sequence while disregarding the continuous space structure, which will degrade the generation performance significantly. In this paper, we propose a novel diffusion model by applying the diffusion forward process in the \textit{continuous} speech representation space, while employing the diffusion backward process in the \textit{discrete} speech unit space. In this way, we preserve the semantic structure of the continuous speech representation space in the diffusion process and integrate the continuous and discrete diffusion models. We conduct extensive experiments on the textless direct speech-to-speech translation task, where the proposed method achieves comparable results to the computationally intensive auto-regressive baselines (500 steps on average) with significantly fewer decoding steps (50 steps).

摘要
Diffusion生成模型在图像生成任务上取得了很大的成功，但是在语音生成特别是翻译任务中efficiently和有效地 интеGRATEDiffusion模型仍然是一个非轻松的问题。具体来说，由于语音数据的信息密度低，将转换后的整数语音序列与相应的文本译本进行比较，则长得多于文本译本，这会对现有的自动回归模型带来很大的挑战。另外，不是最佳的办法是在语音单元序列上直接施用抽象diffusion，而不考虑语音空间的连续结构，这将导致生成性能下降显著。在这篇论文中，我们提出了一种新的Diffusion模型，通过在连续语音表示空间中应用抽象diffusion进程，而在整数语音单元空间中使用反向抽象diffusion进程。这种方法保留了连续语音表示空间中的semantic结构，并将连续和整数Diffusion模型 integrate。我们在文本无直接语音翻译任务上进行了广泛的实验，并得到了与计算Intensive自动回归基准（500步平均）相比的相对较好的结果，但是需要更少的解码步骤（50步）。

paper_url: http://arxiv.org/abs/2310.17568
repo_url: None
paper_authors: Stephanie M. Lukin, Kimberly A. Pollard, Claire Bonial, Taylor Hudson, Ron Arstein, Clare Voss, David Traum
for: 本研究旨在探讨人类和机器人在远程位置的共同探索方法，以及不同modalities的使用对探索成功的影响。
methods: 本研究使用多modalities的交互方式，包括自然语言指令、2D LIDAR地图和 Upon-request的静止照片，以帮助参与者在远程位置进行探索。
results: 研究发现参与者在不同modalities的使用方式上有不同的策略，这些策略可能与探索不同任务的成功度有关。 addition, the study found that requesting photos may have improved the identification and counting of certain entities (such as doorways) without hindering overall area exploration.

Abstract
Human-guided robotic exploration is a useful approach to gathering information at remote locations, especially those that might be too risky, inhospitable, or inaccessible for humans. Maintaining common ground between the remotely-located partners is a challenge, one that can be facilitated by multi-modal communication. In this paper, we explore how participants utilized multiple modalities to investigate a remote location with the help of a robotic partner. Participants issued spoken natural language instructions and received from the robot: text-based feedback, continuous 2D LIDAR mapping, and upon-request static photographs. We noticed that different strategies were adopted in terms of use of the modalities, and hypothesize that these differences may be correlated with success at several exploration sub-tasks. We found that requesting photos may have improved the identification and counting of some key entities (doorways in particular) and that this strategy did not hinder the amount of overall area exploration. Future work with larger samples may reveal the effects of more nuanced photo and dialogue strategies, which can inform the training of robotic agents. Additionally, we announce the release of our unique multi-modal corpus of human-robot communication in an exploration context: SCOUT, the Situated Corpus on Understanding Transactions.

摘要
人类指导式机器人探索是一种有用的方法，特别是在远程位置的信息收集方面，这些位置可能是危险、不适生存或不可达的。保持远程合作伙伴之间的共同点是一个挑战，这可以通过多模态通信来促进。在这篇论文中，我们探讨了参与者如何使用多种渠道来探索远程位置，并received from the robot：文本反馈、连续2D LIDAR地图和 Upon-request静止照片。我们注意到参与者在不同的渠道使用情况中采取了不同的策略，并 hypothesize 这些差异可能与探索子任务中的成功有关。我们发现，请求照片可能提高了某些关键实体（门户）的识别和计数，并且这种策略不会降低总区域探索的范围。未来的大样本研究可能会揭示更加细化的照片和对话策略的效果，这可以帮助训练机器人代理。此外，我们宣布了我们独特的多模态人机交互 corps：SCOUT，即 Situated Corpus on Understanding Transactions。

Towards Matching Phones and Speech Representations

paper_url: http://arxiv.org/abs/2310.17558
repo_url: None
paper_authors: Gene-Ping Yang, Hao Tang
for: 这个研究是为了使用自我监督学习来学习手机类型。
methods: 该研究使用了匹配中心点和手机嵌入的方法来解决问题。
results: 实验表明，匹配结果能够捕捉手机之间的关系。通过与常规自我监督学习loss函数合作训练，可以大幅提高下游手机分类的性能。

Abstract
Learning phone types from phone instances has been a long-standing problem, while still being open. In this work, we revisit this problem in the context of self-supervised learning, and pose it as the problem of matching cluster centroids to phone embeddings. We study two key properties that enable matching, namely, whether cluster centroids of self-supervised representations reduce the variability of phone instances and respect the relationship among phones. We then use the matching result to produce pseudo-labels and introduce a new loss function for improving self-supervised representations. Our experiments show that the matching result captures the relationship among phones. Training the new loss function jointly with the regular self-supervised losses, such as APC and CPC, significantly improves the downstream phone classification.

摘要
学习手机类型从手机实例中获取信息已经是一个长期存在的问题，而且还在开放状态下进行学习。在这项工作中，我们重新审视了这个问题，并将其转化为匹配中心点与手机嵌入的问题。我们研究了两个关键的特性，即自生 represencing 中心点是否减少手机实例的变化和是否尊重手机之间的关系。然后，我们使用匹配结果生成 Pseudo-标签，并引入了一个新的损失函数来改进自生 represencing。我们的实验表明，匹配结果能够捕捉手机之间的关系。在训练新的损失函数并与常见的自生损失函数，如APC和CPC，并行训练时，下流手机分类得到了显著改进。

Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models

paper_url: http://arxiv.org/abs/2310.17530
repo_url: https://github.com/coastalcph/gender-neutral-vl
paper_authors: Laura Cabello, Emanuele Bugliarello, Stephanie Brandl, Desmond Elliott
for: 本研究旨在探讨预训练模型内置的性别偏见如何影响模型性能，以及如何通过不同的预训练和终端训练方法来降低这种偏见。
methods: 本研究使用了三种视觉语言模型家族，通过分别在预训练和终端训练两个阶段进行训练，对模型的性别偏见进行了评估。
results: 研究发现，预训练和终端训练两个阶段的偏见增强是独立的，并且终端训练gender-neutral数据可以降低群体差异，提高模型的公平性。

Abstract
Pretrained machine learning models are known to perpetuate and even amplify existing biases in data, which can result in unfair outcomes that ultimately impact user experience. Therefore, it is crucial to understand the mechanisms behind those prejudicial biases to ensure that model performance does not result in discriminatory behaviour toward certain groups or populations. In this work, we define gender bias as our case study. We quantify bias amplification in pretraining and after fine-tuning on three families of vision-and-language models. We investigate the connection, if any, between the two learning stages, and evaluate how bias amplification reflects on model performance. Overall, we find that bias amplification in pretraining and after fine-tuning are independent. We then examine the effect of continued pretraining on gender-neutral data, finding that this reduces group disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without significantly compromising task performance.

摘要
preprained 机器学习模型已知可能导致和加剧现有偏见在数据中，从而导致不公正的结果，最终影响用户体验。因此，我们需要了解这些偏见的机制，以确保模型性能不会导致对某些群体或人口的歧视行为。在这项工作中，我们选择性别偏见作为我们的案例研究。我们量化在预训练和精度调整后的偏见增强。我们调查这两个学习阶段之间的连接，如果有，以及如何将偏见增强反映到模型性能中。总之，我们发现预训练和精度调整的偏见增强是独立的。然后，我们查看继续预训练 gender-neutral 数据后的效果，发现这会降低群体差异，即提高公平性，在 VQAv2 和检索任务上没有显著削弱任务性能。

The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

paper_url: http://arxiv.org/abs/2310.17514
repo_url: https://github.com/facebookresearch/compositionalityvalidity
paper_authors: Kaiser Sun, Adina Williams, Dieuwke Hupkes
for: 本研究旨在探讨如何确定评价 datasets 是否能够准确衡量模型的能力。
methods: 研究使用了 6 种模型方法，并对 4 个 datasets 进行了8种 compositional splitting strategies的比较，并将模型按18个 compositional generalization splits进行排名。
results: 研究结果显示：一、不同的 datasets 对模型方法的评价有所不同; 二、人工生成的 datasets 与其他 datasets 之间的对比更为一致; 三、数据源的样本来源是评价模型的更重要因素，而不是Compositionality的解释; 四、数据中使用的词语项可以强烈影响结论。 overall, 这个研究表明了评价 datasets 的有效性还需要进一步的检验，并建议为领域设置更加严格的评价标准。

Abstract
NLP models have progressed drastically in recent years, according to numerous datasets proposed to evaluate performance. Questions remain, however, about how particular dataset design choices may impact the conclusions we draw about model capabilities. In this work, we investigate this question in the domain of compositional generalization. We examine the performance of six modeling approaches across 4 datasets, split according to 8 compositional splitting strategies, ranking models by 18 compositional generalization splits in total. Our results show that: i) the datasets, although all designed to evaluate compositional generalization, rank modeling approaches differently; ii) datasets generated by humans align better with each other than they with synthetic datasets, or than synthetic datasets among themselves; iii) generally, whether datasets are sampled from the same source is more predictive of the resulting model ranking than whether they maintain the same interpretation of compositionality; and iv) which lexical items are used in the data can strongly impact conclusions. Overall, our results demonstrate that much work remains to be done when it comes to assessing whether popular evaluation datasets measure what they intend to measure, and suggest that elucidating more rigorous standards for establishing the validity of evaluation sets could benefit the field.

摘要
《NLP模型在最近几年内有了很大的进步，根据各种评估表现的数据集提出了许多。然而，有些数据集的设计方式可能会影响我们对模型能力的结论。在本工作中，我们研究了这个问题，以 Compositional generalization 领域为例。我们评估了六种模型方法，在四个数据集上进行了8种 Compositional splitting 策略，并将模型按18种 Compositional generalization 排名。我们的结果表明：一、不同的数据集，即使都是用于评估 Compositional generalization 的，对模型方法的评价有所不同；二、人类生成的数据集更加符合人类生成的数据集，而不是人类生成的数据集和人工生成的数据集之间的对比；三、数据集是否来自同一个源的样本是评估模型排名的更好的预测因素，而不是数据集是否保持同一种 Compositional 性的解释；四、数据集中使用的语言项可以强烈地影响结论。总之，我们的结果表明，评估 NLP 模型的领域还需要很多工作，并建议在设置评估数据集的时候，更加严格地遵循一些标准，以确保评估结果的有效性。》

The IMS Toucan System for the Blizzard Challenge 2023

paper_url: http://arxiv.org/abs/2310.17499
repo_url: https://github.com/digitalphonetics/ims-toucan
paper_authors: Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler, Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu
for: 这个论文是为了提高法语音识别系统的性能而写的。
methods: 该论文使用了一种基于规则的文本到phoneme处理系统，包括法语中 homograph 的规则化解决方法。然后将phoneme转换为spectrogram作为中间表示，使用Conformer和Glow架构实现快速和高效的非autoregressive生成架构。最后，使用GAN基于神经网络 vocoder将spectrogram转换为最终波形。
results: 作者在Blizzard Challenge 2023中提交的系统实现了优化，与Blizzard Challenge 2021中提交的系统相比，提高了性能。

Abstract
For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available.

摘要
为我们在Blizzard Challenge 2023中的贡献，我们对我们在Blizzard Challenge 2021中提交的系统进行了改进。我们的方法包括一个基于规则的文本到音节处理系统，其中包括基于法语 homographs 的规则化扩展。然后，它将音节转换为中间表示使用一个高效的非 autoregressive 合成架构，基于 Conformer 和 Glow。一个基于 GAN 的神经 vocoder 将spectrogram转换为最终波形。我们仔细设计了挑战数据的数据处理、训练和推断过程。我们的系统标识符是 G。开源代码和示例程序可以获得。

LightLM: A Lightweight Deep and Narrow Language Model for Generative Recommendation

paper_url: http://arxiv.org/abs/2310.17488
repo_url: https://github.com/dongyuanjushi/lightlm
paper_authors: Kai Mei, Yongfeng Zhang
for: 这篇论文旨在提出一种轻量级的Transformer模型，用于生成推荐。
methods: 该模型使用了一种特定针对推荐任务的深度和窄针对Transformer结构，以及一种 Spectral Collaborative Indexing (SCI) 和 Graph Collaborative Indexing (GCI) 方法来优化模型的性能。
results: 实验结果表明，LightLM 可以在实际数据集上超过多种竞争对手， both in terms of 推荐准确率和效率。

Abstract
This paper presents LightLM, a lightweight Transformer-based language model for generative recommendation. While Transformer-based generative modeling has gained importance in various AI sub-fields such as NLP and vision, generative recommendation is still in its infancy due to its unique demand on personalized generative modeling. Existing works on generative recommendation often use NLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are heavy-weight and are not specifically designed for recommendation tasks. LightLM tackles the issue by introducing a light-weight deep and narrow Transformer architecture, which is specifically tailored for direct generation of recommendation items. This structure is especially apt for straightforward generative recommendation and stems from the observation that language model does not have to be too wide for this task, as the input predominantly consists of short tokens that are well-suited for the model's capacity. We also show that our devised user and item ID indexing methods, i.e., Spectral Collaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables the deep and narrow Transformer architecture to outperform large-scale language models for recommendation. Besides, to address the hallucination problem of generating items as output, we propose the constrained generation process for generative recommenders. Experiments on real-world datasets show that LightLM outperforms various competitive baselines in terms of both recommendation accuracy and efficiency. The code can be found at https://github.com/dongyuanjushi/LightLM.

摘要
To further enhance the performance of LightLM, the paper proposes two indexing methods, Spectral Collaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), which enable the model to outperform large-scale language models for recommendation. Additionally, the paper addresses the hallucination problem of generating items as output by proposing a constrained generation process for generative recommenders.Experiments on real-world datasets demonstrate that LightLM outperforms various competitive baselines in terms of both recommendation accuracy and efficiency. The code for LightLM can be found at https://github.com/dongyuanjushi/LightLM.

Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

paper_url: http://arxiv.org/abs/2310.17448
repo_url: None
paper_authors: Tanel Alumäe, Jiaming Kong, Daniil Robnikov
for: 这篇论文描述了塔林大学技术（TalTech）为ASRU MADASR 2023挑战制定的系统。挑战的目标是自动识别具有方言多样性的印度语言，使用有限的训练音频和文本数据。
methods: 我们的方法与传统的精度训练模型 fine-tuning 不同，在两个关键点上有所创新：首先，通过实施协调数据扩充技术，提高训练数据的语言多样性；其次，通过深度预FIX Tuning 进行方言适应mod型。
results: 在两个跑道上，我们的方法实现了比基eline低的单词错误率，在所有参赛队伍中获得最低的成绩。

Abstract
This paper describes Tallinn University of Technology (TalTech) systems developed for the ASRU MADASR 2023 Challenge. The challenge focuses on automatic speech recognition of dialect-rich Indian languages with limited training audio and text data. TalTech participated in two tracks of the challenge: Track 1 that allowed using only the provided training data and Track 3 which allowed using additional audio data. In both tracks, we relied on wav2vec2.0 models. Our methodology diverges from the traditional procedure of finetuning pretrained wav2vec2.0 models in two key points: firstly, through the implementation of the aligned data augmentation technique to enhance the linguistic diversity of the training data, and secondly, via the application of deep prefix tuning for dialect adaptation of wav2vec2.0 models. In both tracks, our approach yielded significant improvements over the provided baselines, achieving the lowest word error rates across all participating teams.

摘要
这篇论文介绍了塔林大学科技（TalTech）为ASRU MADASR 2023 挑战开发的系统。挑战目标是使用少量训练音频和文本数据自动识别具有方言落差的印度语言。TalTech 参与了两个赛道：第一个赛道允许只使用提供的训练数据，第三个赛道允许使用额外的音频数据。在两个赛道中，我们采用了两点不同的方法：首先，通过实施对齐数据增强技术来提高训练数据的语言多样性，其次，通过深度预refix 调整来进行方言适应。在两个赛道中，我们的方法比提供的基准值具有显著改进，实现了所有参与队伍中最低的单词错误率。

‘’Fifty Shades of Bias’’: Normative Ratings of Gender Bias in GPT Generated English Text

paper_url: http://arxiv.org/abs/2310.17428
repo_url: None
paper_authors: Rishav Hada, Agrima Seth, Harshita Diddee, Kalika Bali
for: This paper aims to investigate the gender bias in language models and its impact on the generation of text.
methods: The authors use a dataset of GPT-generated English text with normative ratings of gender bias, and employ Best–Worst Scaling to obtain the ratings. They also analyze the variation of themes of gender biases in the observed ranking.
results: The authors show that identity-attack is most closely related to gender bias, and evaluate the performance of existing automated models trained on related concepts on their dataset.

Abstract
Language serves as a powerful tool for the manifestation of societal belief systems. In doing so, it also perpetuates the prevalent biases in our society. Gender bias is one of the most pervasive biases in our society and is seen in online and offline discourses. With LLMs increasingly gaining human-like fluency in text generation, gaining a nuanced understanding of the biases these systems can generate is imperative. Prior work often treats gender bias as a binary classification task. However, acknowledging that bias must be perceived at a relative scale; we investigate the generation and consequent receptivity of manual annotators to bias of varying degrees. Specifically, we create the first dataset of GPT-generated English text with normative ratings of gender bias. Ratings were obtained using Best--Worst Scaling -- an efficient comparative annotation framework. Next, we systematically analyze the variation of themes of gender biases in the observed ranking and show that identity-attack is most closely related to gender bias. Finally, we show the performance of existing automated models trained on related concepts on our dataset.

摘要
文化信仰系统的表达工具，即语言，同时也推动社会中普遍存在的偏见。 gender bias 是社会中最普遍的偏见之一，可见在线上和线下的交流中。随着人工智能语言模型（LLM）的发展，我们必须有一个深刻的理解，这些系统可以生成的偏见是多么的。在先前的工作中， gender bias 常被视为二分类任务。然而，我们认为偏见应该被评估为相对的程度，我们开创了首个 GPT 生成的英文文本中的normative 评分。我们使用了 Best--Worst Scaling Comparative Annotation Framework 来获取评分。接下来，我们系统地分析了观察到的性别偏见主题的变化，并发现 identity-attack 最为相关于性别偏见。最后，我们展示了对我们数据集的现有自动化模型的性能。

Harnessing GPT-3.5-turbo for Rhetorical Role Prediction in Legal Cases

paper_url: http://arxiv.org/abs/2310.17413
repo_url: None
paper_authors: Anas Belfathi, Nicolas Hernandez, Laura Monceaux
for: 本研究旨在为querying一个大型预训练的生成器 transformer（GPT-3.5-turbo）在法律案件中的辩论角色预测任务进行全面的研究。这个任务需要处理文本上下文。
methods: 本研究使用了零数少shot策略、任务特定定义和文本上下文的解释、通用提示和特定问题来探索适用于这个任务的一Stage抽象技术。
results: 研究发现，数量的增加、标签的定义、文本上下文的显示和特定问题对模型的性能有积极的影响。在不同的测试集配置下，我们发现，使用一些直接来自上下文的标签的例子提示可以使模型在Weighted F1 score中达到72%的性能，但还有一定的差距需要追赶到最佳系统（86%），这些系统需要专门的资源、体系和训练。

Abstract
We propose a comprehensive study of one-stage elicitation techniques for querying a large pre-trained generative transformer (GPT-3.5-turbo) in the rhetorical role prediction task of legal cases. This task is known as requiring textual context to be addressed. Our study explores strategies such as zero-few shots, task specification with definitions and clarification of annotation ambiguities, textual context and reasoning with general prompts and specific questions. We show that the number of examples, the definition of labels, the presentation of the (labelled) textual context and specific questions about this context have a positive influence on the performance of the model. Given non-equivalent test set configurations, we observed that prompting with a few labelled examples from direct context can lead the model to a better performance than a supervised fined-tuned multi-class classifier based on the BERT encoder (weighted F1 score of = 72%). But there is still a gap to reach the performance of the best systems = 86%) in the LegalEval 2023 task which, on the other hand, require dedicated resources, architectures and training.

摘要
我们提议进行一项涵盖一阶段 solicitation 技术的大规模预训练生成器（GPT-3.5-turbo）在法律案例中的辩论角色预测任务中进行全面的研究。这项任务需要处理文本背景，我们的研究探讨了策略如零些例、任务特定定义和修饰、文本背景和推理的通用提示和特定问题。我们发现了数量、标签定义、文本背景的显示和特定问题的影响于模型的性能。给定不同的测试集配置，我们发现，使用 direct context 中标签的几个例子提问可以使模型达到比supervised fine-tuned multi-class classifier（基于 BERT 编码器）的性能（weighted F1 分数 = 72%），但还有一定的差距，与最佳系统（86%）在 LegalEval 2023 任务中的性能相比。

Tackling the Matrix Multiplication Micro-kernel Generation with Exo

paper_url: http://arxiv.org/abs/2310.17408
repo_url: https://github.com/adcastel/exo_ukr_generator
paper_authors: Adrián Castelló, Julian Bellavita, Grace Dinh, Yuka Ikarashi, Héctor Martínez
for: 本研究旨在提高矩阵乘法（GEMM）的优化，以提高现代线性代数库（如BLIS、OpenBLAS、Intel OneAPI）的性能。
methods: 本研究使用Exo编译器生成微型核心代码，以实现close to（或更好于）手动编写的微型核心代码。此外，本解决方案提高了代码的可移植性，因为硬件目标完全由一个简洁的库所 descriptions。
results: 本研究显示，使用Exo编译器生成微型核心代码可以实现 close to（或更好于）手动编写的微型核心代码，并提高代码的可移植性。

Abstract
The optimization of the matrix multiplication (or GEMM) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The GEMM is usually implemented following the GotoBLAS philosophy, which tiles the GEMM operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the Exo compiler that performs close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.

摘要
“矩阵乘法（GEMM）的优化在过去几十年中一直是一个需求。这个操作被当今的线性代数库（如BLIS、OpenBLAS或Intel OneAPI）视为标杆操作，因为它在许多科学应用中广泛使用。GEMM通常采用GotoBLAS哲学，即将GEMM参数瓦bben分割并使用一系列嵌套循环来提高性能。这些方法利用硬件的最大计算能力，通过小块硬件特定、高性能的代码块（微型核心）来提高性能。然而，这种方法需要开发者投入一定的努力来生成每个新硬件的专门微型核心。在这个工作中，我们提供了一个步骤式的过程，使用Exo编译器生成微型核心，其性能与手动编写的微型核心或 Assembly 语言代码几乎相当。我们的解决方案还提高了生成代码的可移植性，因为硬件目标可以通过一个简洁的库所示的指令来完全定义。”

Meaning and understanding in large language models

paper_url: http://arxiv.org/abs/2310.17407
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Vladimír Havlík
for: 本文总结了现代人工智能语言模型的发展，并评估了传统哲学假设中关于机器语言理解的假设。
methods: 本文使用了生成式大语言模型来评估机器语言理解的水平，并检验了现代语言模型是否具有真正的语言理解能力。
results: 研究结果显示，现代语言模型可以具有深层次的语言理解能力，不仅仅是 superficielle 的语法处理。

Abstract
Can a machine understand the meanings of natural language? Recent developments in the generative large language models (LLMs) of artificial intelligence have led to the belief that traditional philosophical assumptions about machine understanding of language need to be revised. This article critically evaluates the prevailing tendency to regard machine language performance as mere syntactic manipulation and the simulation of understanding, which is only partial and very shallow, without sufficient referential grounding in the world. The aim is to highlight the conditions crucial to attributing natural language understanding to state-of-the-art LLMs, where it can be legitimately argued that LLMs not only use syntax but also semantics, their understanding not being simulated but duplicated; and determine how they ground the meanings of linguistic expressions.

摘要
can 机器理解自然语言吗？最近的人工智能生成大语言模型（LLMs）的发展已经让人们认为传统哲学假设关于机器语言理解需要修订。这篇文章批判现在人们通常认为机器语言表现只是Syntax的 manipulate和模拟理解，而不是真正的 semantics 理解，而且这种理解并不具备充分的 referential 基础。文章的目标是 highlighting state-of-the-art LLMs 才能够被合理地 argue 不仅使用 syntax 还使用 semantics，其理解不是 simulated 而是 duplicated，并 determin 语言表达意义的基础。

Language and Mental Health: Measures of Emotion Dynamics from Text as Linguistic Biosocial Markers

paper_url: http://arxiv.org/abs/2310.17369
repo_url: None
paper_authors: Daniela Teodorescu, Tiffany Cheng, Alona Fyshe, Saif M. Mohammad
For: This paper aims to investigate the relationship between tweet emotion dynamics and mental health disorders.* Methods: The authors use a dataset of tweets and employ recent approaches to determining emotion dynamics from everyday utterances.* Results: The study finds that each of the emotion dynamics metrics studied varies by the user’s self-disclosed diagnosis, and that linguistic cues pertaining to emotion dynamics can play a crucial role as biosocial markers for mental illnesses.Here’s the same information in Simplified Chinese text:* For: 这篇论文目的是调查推特情感动力与心理疾病的关系。* Methods: 作者使用了一个推特数据集，并采用了现代的语言预测方法来确定情感动力。* Results: 研究发现，每一个情感动力指标都与用户自我报告的疾病有关，并且语言上的情感动力征具有识别心理疾病的重要作用。

Abstract
Research in psychopathology has shown that, at an aggregate level, the patterns of emotional change over time -- emotion dynamics -- are indicators of one's mental health. One's patterns of emotion change have traditionally been determined through self-reports of emotions; however, there are known issues with accuracy, bias, and convenience. Recent approaches to determining emotion dynamics from one's everyday utterances, addresses many of these concerns, but it is not yet known whether these measures of utterance emotion dynamics (UED) correlate with mental health diagnoses. Here, for the first time, we study the relationship between tweet emotion dynamics and mental health disorders. We find that each of the UED metrics studied varied by the user's self-disclosed diagnosis. For example: average valence was significantly higher (i.e., more positive text) in the control group compared to users with ADHD, MDD, and PTSD. Valence variability was significantly lower in the control group compared to ADHD, depression, bipolar disorder, MDD, PTSD, and OCD but not PPD. Rise and recovery rates of valence also exhibited significant differences from the control. This work provides important early evidence for how linguistic cues pertaining to emotion dynamics can play a crucial role as biosocial markers for mental illnesses and aid in the understanding, diagnosis, and management of mental health disorders.

摘要
研究发现，在总体水平上，情感变化趋势 -- 情感动力学 -- 是诊断心理健康的指标。传统的情感变化评估方法是通过自我报告情感，但这些方法存在准确性、偏见和便利性问题。现在的方法是从日常语言中提取情感动力学，解决了这些问题，但是不知道这些措施与心理疾病诊断是否相关。本研究是第一次研究推特情感动力学与心理疾病的关系。我们发现，每一个UED指标都与用户自我报告的疾病有关。例如：正常组比对ADHD、抑郁症、PTSD等用户的平均值更高（即更正面的文本）。变化的值的变化率低于正常组，但是与抑郁症、跨悲症、偏抑郁症、抑郁症、PTSD和OCD有关。这项工作提供了关键的早期证据，证明语言上的情感动力学指标可以作为心理疾病的生物社会标志，帮助理解、诊断和管理心理健康疾病。

ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought

paper_url: http://arxiv.org/abs/2310.17342
repo_url: https://github.com/x-lance/text2sql-gpt
paper_authors: Hanchong Zhang, Ruisheng Cao, Lu Chen, Hongshen Xu, Kai Yu
for: 提高大语言模型（LLMs）在文本到SQL任务中的逻辑能力
methods: 使用链条思维（CoT）提问和自动生成Auto-CoT参考实例，不需手动标注
results: LLMs的性能得到改善，在多turn文本到SQL任务中也有优秀表现，达到了现有的SOTA水平

Abstract
Recently Large Language Models (LLMs) have been proven to have strong abilities in various domains and tasks. We study the problem of prompt designing in the text-to-SQL task and attempt to improve the LLMs' reasoning ability when generating SQL queries. Besides the trivial few-shot in-context learning setting, we design our chain-of-thought (CoT) prompt with a similar method to schema linking. We provide a method named ACT-SQL to automatically generate auto-CoT exemplars and thus the whole process doesn't need manual labeling. Our approach is cost-saving since we only use the LLMs' API call once when generating one SQL query. Furthermore, we extend our in-context learning method to the multi-turn text-to-SQL task. The experiment results show that the LLMs' performance can benefit from our ACT-SQL approach. Our approach achieves SOTA performance on the Spider dev set among existing in-context learning approaches.

摘要

Arabic Fine-Grained Entity Recognition

paper_url: http://arxiv.org/abs/2310.17333
repo_url: None
paper_authors: Haneen Liqreina, Mustafa Jarrar, Mohammed Khalilia, Ahmed Oumar El-Shangiti, Muhammad AbdulMageed
For: This paper aims to advance Arabic Named Entity Recognition (NER) with fine-grained entities, specifically by extending the Wojood corpus with 31 subtypes for four main entity types (GPE, LOC, ORG, and FAC).* Methods: The authors first revised Wojood’s annotations to be compatible with the LDC’s ACE guidelines, and then manually annotated all mentions of GPE, LOC, ORG, and FAC (~44K) with the LDC’s ACE sub-types. They also fine-tuned three pre-trained Arabic BERT encoders in three settings to compute the baselines of WojoodF ine.* Results: The authors achieved inter-annotator agreement (IAA) of 0.9861 and 0.9889 using Cohen’s Kappa and F1 score, respectively. They also achieved F1 score of 0.920, 0.866, and 0.885 in three settings of fine-tuning the pre-trained Arabic BERT encoders.Here are the three points in Simplified Chinese text:* 为：本文目的是提高阿拉伯语名实体识别（NER）的细化实体识别，具体是将 Wojood корпусу扩展到4类主要实体类型（GPE、LOC、ORG、FAC）中的31个子类型。* 方法：作者首先将 Wojood 的注释更改为与 LDC 的 ACE 指南兼容，并手动注释了 Wojood 中约44,000 个 GPE、LOC、ORG 和 FAC 的所有提及。他们还使用三个预训练的阿拉伯语 BERT 核心进行三种设定的 fine-tuning，以计算 WojoodF ine 的基线。* 结果：作者达到了 Cohen 的卡方和 F1 得分的共识度（IAA）为 0.9861 和 0.9889，并在三种设定中达到了 fine-tuning 预训练的阿拉伯语 BERT 核心的 F1 得分为 0.920、0.866 和 0.885。

Abstract
Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with 31 subtypes. To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE sub-types. We refer to this extended version of Wojood as WojoodF ine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/.

摘要
传统的NER系统通常只会训练粗粒度的实体识别，对于细致的实体类型进行分类不受充分关注。这篇文章旨在提高阿拉伯语NER的细致性。我们选择将 Wojood（一个开源的嵌套式阿拉伯语命名实体词典）扩展到子类型。特别是，Wojood中的四个主要实体类型（地域政治实体（GPE）、位置（LOC）、组织（ORG）和设施（FAC））被扩展到31个子类型。为此，我们首先修改了Wojood中GPE、LOC、ORG和FAC的注释，使其与LDC的ACE指南相容，共计5,614个更改。然后，Wojood中所有GPE、LOC、ORG和FAC的提及（约44,000个）都被手动注释为LDC的ACE子类型。我们称这个扩展版本为Wojood Fine。为了评估我们的注释，我们使用了COhen的Kappa和F1得分来计算同义词一致性，得到0.9861和0.9889的值。为了计算Wojood Fine的基准值，我们在三种设置下练习三个预训练的阿拉伯语BERTEncoder：普通的NER、嵌套NER和嵌套NER+子类型，并 achieved F1得分为0.920、0.866和0.885，分别。我们的词典和模型都是开源的，可以在https://sina.birzeit.edu/wojood/获取。

Nabra: Syrian Arabic Dialects with Morphological Annotations

paper_url: http://arxiv.org/abs/2310.17315
repo_url: None
paper_authors: Amal Nayouf, Tymaa Hammouda, Mustafa Jarrar, Fadi Zaraket, Mohamad-Bassam Kurdy
For: 这个论文是为了提供一个包含叙利亚阿拉伯语言方言的 corpora，以便进行语言研究和应用。* Methods: 这个论文使用了社交媒体帖子、电影和电视剧cript、歌曲歌词和地方谚语等多种来源，收集了超过6万句话，共60万个单词，并对这些单词进行了全 morphological 注释。* Results: 这个论文通过使用 nine 名native annotator 进行注释，实现了高质量的注释，F1和κ合理性分数在不同特征上的范围为74%-98%。该 corpora 已经公开发布，可以在 Currasat 门户网站上获取：https://sina.birzeit.edu/currasat。

Abstract
This paper presents Nabra, a corpora of Syrian Arabic dialects with morphological annotations. A team of Syrian natives collected more than 6K sentences containing about 60K words from several sources including social media posts, scripts of movies and series, lyrics of songs and local proverbs to build Nabra. Nabra covers several local Syrian dialects including those of Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and Suwayda. A team of nine annotators annotated the 60K tokens with full morphological annotations across sentence contexts. We trained the annotators to follow methodological annotation guidelines to ensure unique morpheme annotations, and normalized the annotations. F1 and kappa agreement scores ranged between 74% and 98% across features, showing the excellent quality of Nabra annotations. Our corpora are open-source and publicly available as part of the Currasat portal https://sina.birzeit.edu/currasat.

摘要

An Ensemble Method Based on the Combination of Transformers with Convolutional Neural Networks to Detect Artificially Generated Text

paper_url: http://arxiv.org/abs/2310.17312
repo_url: None
paper_authors: Vijini Liyanage, Davide Buscaldi
for: 本研究旨在探讨使用自然语言生成器（Natural Language Generation，NLG）生成的文本是否能够自动分类为人工生成或人类写作。
methods: 本研究使用了一些 ensemble transformer 模型，包括 Sci-BERT、DeBERTa 和 XLNet，以及卷积神经网络（Convolutional Neural Networks，CNNs）。
results: 我们的实验结果表明，使用 ensemble 模型可以超越单独的 transformer 模型的性能，而 SciBERT-CNN ensemble 模型在 ALTA 共享任务 2023 数据集上达到了 F1 分数为 98.36%。

Abstract
Thanks to the state-of-the-art Large Language Models (LLMs), language generation has reached outstanding levels. These models are capable of generating high quality content, thus making it a challenging task to detect generated text from human-written content. Despite the advantages provided by Natural Language Generation, the inability to distinguish automatically generated text can raise ethical concerns in terms of authenticity. Consequently, it is important to design and develop methodologies to detect artificial content. In our work, we present some classification models constructed by ensembling transformer models such as Sci-BERT, DeBERTa and XLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate that the considered ensemble architectures surpass the performance of the individual transformer models for classification. Furthermore, the proposed SciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared task 2023 data.

摘要
thanks to the state-of-the-art Large Language Models (LLMs), language generation has reached outstanding levels. These models are capable of generating high quality content, thus making it a challenging task to detect generated text from human-written content. Despite the advantages provided by Natural Language Generation, the inability to distinguish automatically generated text can raise ethical concerns in terms of authenticity. Consequently, it is important to design and develop methodologies to detect artificial content. In our work, we present some classification models constructed by ensembling transformer models such as Sci-BERT, DeBERTa and XLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate that the considered ensemble architectures surpass the performance of the individual transformer models for classification. Furthermore, the proposed SciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared task 2023 data.Here's the translation in Traditional Chinese:感谢现代大型语言模型（LLMs），语言生成已经到达了出色的水平。这些模型能够生成高品质的内容，因此对于实际生成文本和人写文本的区别成为一个挑战。尽管自然语言生成具有许多优点，但是无法自动识别生成文本的问题可能会导致道德问题。因此，设计和开发检测人工内容的方法ologies是非常重要的。在我们的工作中，我们提出了一些由transformer模型（如Sci-BERT、DeBERTa和XLNet）和卷积神经网（CNNs）构成的分类模型。我们的实验结果显示，考虑的结合架构在分类方面表现更高。此外，我们提出的SciBERT-CNN结合模型在ALTA共享任务2023数据集上产生了98.36%的F1分。

Learning to Abstract with Nonparametric Variational Information Bottleneck

paper_url: http://arxiv.org/abs/2310.17284
repo_url: None
paper_authors: Melika Behjati, Fabio Fehr, James Henderson
for: 提高语言模型的鲁棒性和抗干扰能力
methods: 使用Nonparametric Variational Information Bottleneck（NVIB）压缩Transformer自注意层，实现模型层次结构中的压缩和抽象
results: 模型可以更好地捕捉语言特征，并且具有更高的鲁棒性和抗干扰能力

Abstract
Learned representations at the level of characters, sub-words, words and sentences, have each contributed to advances in understanding different NLP tasks and linguistic phenomena. However, learning textual embeddings is costly as they are tokenization specific and require different models to be trained for each level of abstraction. We introduce a novel language representation model which can learn to compress to different levels of abstraction at different layers of the same model. We apply Nonparametric Variational Information Bottleneck (NVIB) to stacked Transformer self-attention layers in the encoder, which encourages an information-theoretic compression of the representations through the model. We find that the layers within the model correspond to increasing levels of abstraction and that their representations are more linguistically informed. Finally, we show that NVIB compression results in a model which is more robust to adversarial perturbations.

摘要
学习的表示形式在字符、子词、词和句子等多个层次上，各自为不同的自然语言处理任务和语言现象带来了进步。然而，学习文本嵌入是费时的，因为它们是特定的Tokenization的，需要为每个层次投入不同的模型。我们介绍了一种新的语言表示模型，可以在同一个模型中学习压缩到不同的层次。我们在编码器中使用非参数的可变信息瓶颈（NVIB），将核心Transformer自我注意层堆叠在一起，这使得模型中的表示进行了信息学 compression。我们发现模型中的层次对应于不同的层次抽象，并且它们的表示更加语言化。最后，我们发现NVIB压缩后的模型更加抗性 adversarial perturbations。

Automatic Logical Forms improve fidelity in Table-to-Text generation

paper_url: http://arxiv.org/abs/2310.17279
repo_url: https://github.com/alonsoapp/tlt
paper_authors: Iñigo Alonso, Eneko Agirre
for: 这个论文主要写于如何从表格生成自然语言陈述。
methods: 该论文使用了自动生成的逻辑形式（LF），以提高文本的事实准确性。
results: 研究发现，使用自动生成的LF可以提高文本的事实准确性，相比之前的系统不使用LF。此外，研究还发现了高事实准确性的主要挑战，包括自动选择内容、逻辑到文本转换和表格到逻辑转换。

Abstract
Table-to-text systems generate natural language statements from structured data like tables. While end-to-end techniques suffer from low factual correctness (fidelity), a previous study reported gains when using manual logical forms (LF) that represent the selected content and the semantics of the target text. Given the manual step, it was not clear whether automatic LFs would be effective, or whether the improvement came from content selection alone. We present TlT which, given a table and a selection of the content, first produces LFs and then the textual statement. We show for the first time that automatic LFs improve quality, with an increase in fidelity of 30 points over a comparable system not using LFs. Our experiments allow to quantify the remaining challenges for high factual correctness, with automatic selection of content coming first, followed by better Logic-to-Text generation and, to a lesser extent, better Table-to-Logic parsing.

摘要
tables-to-text 系统可以从结构化数据中生成自然语言声明。而通过终端技术，产生的声明准确性（loyalty）很低。一项之前的研究发现，使用手动逻辑形式（LF）可以提高声明的准确性。然而，使用自动生成的LF是否有效？或者是选择内容alone 贡献了改进。我们提出了 TlT，它可以从表格和选择的内容中生成逻辑形式，然后生成文本声明。我们首次发现，使用自动生成的LF可以提高质量，准确性提高30个点。我们的实验表明，自动选择内容是首要挑战，然后是逻辑到文本转换，并且这些挑战减少了一些。

Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

paper_url: http://arxiv.org/abs/2310.17271
repo_url: None
paper_authors: Ahmed Alajrami, Katerina Margatina, Nikolaos Aletras
for: 本研究旨在探讨如何在自然语言处理中理解预训练语言模型（PLMs）对语言的学习。
methods: 本研究使用了小 subsets of characters from individual tokens进行预训练语言模型。
results: surprisingly, 我们发现，即使在极端设置下（即只使用每个токен中的一个字符），预训练后的模型在标准NLU任务和探测任务中的性能保留率相对较高，比如使用单个第一个字符从tokentoken中预训练的模型在SuperGLUE和GLUE任务中的性能保留率分别为 approximately $90$%和$77$%。

Abstract
Understanding how and what pre-trained language models (PLMs) learn about language is an open challenge in natural language processing. Previous work has focused on identifying whether they capture semantic and syntactic information, and how the data or the pre-training objective affects their performance. However, to the best of our knowledge, no previous work has specifically examined how information loss in input token characters affects the performance of PLMs. In this study, we address this gap by pre-training language models using small subsets of characters from individual tokens. Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks compared to full-token models is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately $90$\% and $77$\% of the full-token model in SuperGLUE and GLUE tasks, respectively.

摘要
理解PLMs如何学习语言是自然语言处理领域的开放挑战。先前的工作主要集中在确定PLMs是否捕捉 semantics和 sintaxis信息，以及数据或预训练目标对其性能的影响。然而，根据我们所知，没有任何前一项工作专门检查了输入token字符损失对PLMs的影响。在这项研究中，我们解决这个空白，通过使用个体token中的小subset的字符进行预训练语言模型。Resultingly, we find that even under extreme settings, i.e., using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks compared to full-token models is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately 90% and 77% of the full-token model in SuperGLUE and GLUE tasks, respectively.

EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning

paper_url: http://arxiv.org/abs/2310.17233
repo_url: None
paper_authors: Ping Guo, Xiangpeng Wei, Yue Hu, Baosong Yang, Dayiheng Liu, Fei Huang, Jun Xie
for: 本研究探讨了如何学习跨语言共同表示，以增强机器翻译和其他语言处理任务的性能。
methods: 该研究提出了一种基于EM算法的多语言预训练算法，称为EMMA-X，用于学习跨语言共同表示。EMMA-X使用了大量的多语言非параллель数据，并将跨语言表示学习任务和额外semantic relation预测任务结合在一起。
results: 实验表明，EMMA-X在新引入的XRETEBenchmark上达到了状态对性能。此外，对于建立的表示空间的几何分析表明，EMMA-X在三个需求下表现出了superiority。

Abstract
Expressing universal semantics common to all languages is helpful in understanding the meanings of complex and culture-specific sentences. The research theme underlying this scenario focuses on learning universal representations across languages with the usage of massive parallel corpora. However, due to the sparsity and scarcity of parallel data, there is still a big challenge in learning authentic ``universals'' for any two languages. In this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm, to learn (X)Cross-lingual universals with the aid of excessive multilingual non-parallel data. EMMA-X unifies the cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework. Both the extra semantic classifier and the cross-lingual sentence encoder approximate the semantic relation of two sentences, and supervise each other until convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that EMMA-X achieves state-of-the-art performance. Further geometric analysis of the built representation space with three requirements demonstrates the superiority of EMMA-X over advanced models.

摘要
<> translate("Expressing universal semantics common to all languages is helpful in understanding the meanings of complex and culture-specific sentences. The research theme underlying this scenario focuses on learning universal representations across languages with the usage of massive parallel corpora. However, due to the sparsity and scarcity of parallel data, there is still a big challenge in learning authentic 'universals' for any two languages. In this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm, to learn (X)Cross-lingual universals with the aid of excessive multilingual non-parallel data. EMMA-X unifies the cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework. Both the extra semantic classifier and the cross-lingual sentence encoder approximate the semantic relation of two sentences, and supervise each other until convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that EMMA-X achieves state-of-the-art performance. Further geometric analysis of the built representation space with three requirements demonstrates the superiority of EMMA-X over advanced models.）Here's the translation in Traditional Chinese:<>翻译("表达通用 semantics 对所有语言都很有帮助，对复杂和文化特有的句子进行理解。研究主题下面这个 scenario 是通过大量平行 corpora 来学习语言之间的通用表现。然而，由于平行数据的稀缺和罕见性，还是有一个大的挑战是从任何两个语言中学习真正的通用。在这篇文章中，我们提出了 EMMA-X：一种 EM-like 多语言预训 Algorithm，以learn (X) Cross-lingual universals 的 aid excessive multilingual non-parallel data。EMMA-X 将 Cross-lingual representation learning task 和 extra semantic relation prediction task 统一在 EM 框架内。两个类别的 semantic classifier 和 Cross-lingual sentence encoder 都会 approximates 两个句子之间的 semantic relation，并且彼此监控 until convergence。为了评估 EMMA-X，我们在 XRETE 上进行了实验，XRETE 是一个 newly introduced 的 benchmark，包含 12 种通过句子水平表现来研究的 cross-lingual task。结果显示 EMMA-X 实现了 state-of-the-art 性能。进一步的 geometric analysis 显示 EMMA-X 在三个需求下建立的表示空间优化。）

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

paper_url: http://arxiv.org/abs/2310.17230
repo_url: https://github.com/taufeeque9/codebook-features
paper_authors: Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman
for: 这种方法可以帮助我们更好地理解神经网络的行为和性质。
methods: 我们使用了vector quantization bottleneck来压缩神经网络的潜在特征，从而生成一个具有稀疏、整数特征的神经网络。
results: 我们发现这种方法可以减少神经网络的性能下降，并且可以帮助我们更好地控制神经网络的行为。我们在几个不同的数据集上训练了codebook Transformers，并发现可以通过活化相应的代码来控制神经网络的输出。

Abstract
Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.

摘要
理解神经网络是困难的一个原因之一是它们的隐藏状态的紧密、连续性。我们研究是否可以训练神经网络，使其隐藏状态变得稀疏、简单化并更容易理解，通过将连续的特征映射到我们称之为“代码库特征”中。代码库特征是通过在每层神经网络中加入vector量化瓶颈，生成一个网络，其隐藏特征是由一小数量的简单vector码选择从大型代码库中。我们发现，可以在这种极端瓶颈下训练神经网络，只具有轻微的性能下降。这种稀疏、简单的瓶颈还提供了一种直观的方式控制神经网络行为：首先，找到表示愿景存在的代码，然后在生成时 aktivate 这些代码，以诱发愿景。我们验证了我们的方法，通过在多个不同的数据集上训练代码库Transformer。首先，我们研究了一个有 infinitely many 隐藏状态的 finite state machine 数据集，在这种情况下，我们的方法可以把状态分配给独特的代码，从而解决超position 问题。我们发现，可以通过活动相应的代码来让神经网络 behave 如果是不同的状态。其次，我们在 two 个自然语言数据集上训练了 Transformer 语言模型，包含多达 410M 参数。我们在这些模型中找到了表示多元、分离的概念的代码（从负情感到月份），并发现可以通过在推理中活动相应的代码来引导模型生成不同的话题。总之，代码库特征看来是神经网络和解释性的有希望的单元。我们的代码库和模型在上公开。

paper_url: http://arxiv.org/abs/2310.17166
repo_url: None
paper_authors: Taejun Yun, Jinhyeon Kim, Deokyeong Kang, Seong Hoon Lim, Jihoon Kim, Taeuk Kim
for: 本研究旨在预测跨语言传递（XLT）的Compatibility，以便选择适合的源语言来提高模型的性能。
methods: 我们提出了基于语言卷积网络相似性的方法，通过分析模型的内部工作机制来预测XLT的Compatibility。我们的方法只需要一小量的原始文本，与大多数前一些方法不同。
results: 我们在多个任务中进行了实验，并证明了我们的方法比基eline更有效，具体来说，它在NDCG@3中平均提高了4.6%。我们还提供了详细的分析，证明了语言卷积网络的相似性对XLT预测的重要性。

Abstract
Cross-lingual transfer (XLT) is an emergent ability of multilingual language models that preserves their performance on a task to a significant extent when evaluated in languages that were not included in the fine-tuning process. While English, due to its widespread usage, is typically regarded as the primary language for model adaption in various tasks, recent studies have revealed that the efficacy of XLT can be amplified by selecting the most appropriate source languages based on specific conditions. In this work, we propose the utilization of sub-network similarity between two languages as a proxy for predicting the compatibility of the languages in the context of XLT. Our approach is model-oriented, better reflecting the inner workings of foundation models. In addition, it requires only a moderate amount of raw text from candidate languages, distinguishing it from the majority of previous methods that rely on external resources. In experiments, we demonstrate that our method is more effective than baselines across diverse tasks. Specifically, it shows proficiency in ranking candidates for zero-shot XLT, achieving an improvement of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that confirm the utility of sub-networks for XLT prediction.

摘要
In this work, we propose the utilization of sub-network similarity between two languages as a proxy for predicting the compatibility of the languages in the context of XLT. Our approach is model-oriented, better reflecting the inner workings of foundation models. In addition, it requires only a moderate amount of raw text from candidate languages, distinguishing it from the majority of previous methods that rely on external resources.In experiments, we demonstrate that our method is more effective than baselines across diverse tasks. Specifically, it shows proficiency in ranking candidates for zero-shot XLT, achieving an improvement of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that confirm the utility of sub-networks for XLT prediction.

Supercharging academic writing with generative AI: framework, techniques, and caveats

paper_url: http://arxiv.org/abs/2310.17143
repo_url: None
paper_authors: Zhicheng Lin
For: This paper aims to improve the quality and efficiency of academic writing by leveraging generative artificial intelligence (AI) and large language models (LLMs).* Methods: The authors propose a human-AI collaborative framework for writing that delineates the rationale, process, and nature of AI engagement in writing. They also describe effective prompting techniques for incorporating AI into the writing routine and strategies for maintaining rigorous scholarship.* Results: The authors argue that the prudent integration of AI into academic writing can ease the communication burden, empower authors, accelerate discovery, and promote diversity in science.

Abstract
Academic writing is an indispensable yet laborious part of the research enterprise. This Perspective maps out principles and methods for using generative artificial intelligence (AI), specifically large language models (LLMs), to elevate the quality and efficiency of academic writing. We introduce a human-AI collaborative framework that delineates the rationale (why), process (how), and nature (what) of AI engagement in writing. The framework pinpoints both short-term and long-term reasons for engagement and their underlying mechanisms (e.g., cognitive offloading and imaginative stimulation). It reveals the role of AI throughout the writing process, conceptualized through a two-stage model for human-AI collaborative writing, and the nature of AI assistance in writing, represented through a model of writing-assistance types and levels. Building on this framework, we describe effective prompting techniques for incorporating AI into the writing routine (outlining, drafting, and editing) as well as strategies for maintaining rigorous scholarship, adhering to varied journal policies, and avoiding overreliance on AI. Ultimately, the prudent integration of AI into academic writing can ease the communication burden, empower authors, accelerate discovery, and promote diversity in science.

摘要

M2C: Towards Automatic Multimodal Manga Complement

paper_url: http://arxiv.org/abs/2310.17130
repo_url: https://github.com/hc-guo/m2c
paper_authors: Hongcheng Guo, Boyang Wang, Jiaqi Bai, Jiaheng Liu, Jian Yang, Zhoujun Li
for: 提高漫画理解，结合视觉和文本特征
methods: 使用大语言模型 mines 漫画事件知识，并使用细化视觉提示支持漫画补充
results: FVP-M$^{2}$ 方法实现了 Multimodal Manga Complement 任务的有效性

Abstract
Multimodal manga analysis focuses on enhancing manga understanding with visual and textual features, which has attracted considerable attention from both natural language processing and computer vision communities. Currently, most comics are hand-drawn and prone to problems such as missing pages, text contamination, and aging, resulting in missing comic text content and seriously hindering human comprehension. In other words, the Multimodal Manga Complement (M2C) task has not been investigated, which aims to handle the aforementioned issues by providing a shared semantic space for vision and language understanding. To this end, we first propose the Multimodal Manga Complement task by establishing a new M2C benchmark dataset covering two languages. First, we design a manga argumentation method called MCoT to mine event knowledge in comics with large language models. Then, an effective baseline FVP-M$^{2}$ using fine-grained visual prompts is proposed to support manga complement. Extensive experimental results show the effectiveness of FVP-M$^{2}$ method for Multimodal Mange Complement.

摘要
多模态漫画分析强调使用视觉和文本特征，吸引了自然语言处理和计算机视觉领域的广泛关注。目前大多数漫画是手动绘制的，容易出现缺失页码、文本污染和衰老等问题，导致漫画文本内容丢失和人类理解受阻。即使是现有的多模态漫画补充（M2C）任务也没有被 investigate，该任务目的是通过提供视觉和语言理解共享semantic空间来解决上述问题。为此，我们首先提出了多模态漫画补充任务，并设置了一个新的M2Cbenchmark dataset，覆盖两种语言。然后，我们设计了漫画论证方法（MCoT），用于在大型语言模型中挖掘漫画中的事件知识。最后，我们提出了一个有效的基线方法FVP-M$^{2}$，使用细化的视觉提示来支持多模态漫画补充。广泛的实验结果表明FVP-M$^{2}$方法的效iveness для多模态漫画补充。

Test-time Augmentation for Factual Probing

paper_url: http://arxiv.org/abs/2310.17121
repo_url: https://github.com/gokamoda/TTA4FactualProbing
paper_authors: Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui
for: 验证语言模型是否具备certain world knowledge facts的能力。
methods: 使用提示来测试语言模型的知识。
results: 使用test-time augmentation（TTA）可以减少提示变化导致的模型敏感性，并提高模型评估准确性。但是，一些模型可能会受到TTA的影响，导致质量下降。

Abstract
Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.

摘要
factual probing 是一种使用提示测试语言模型是否具备certain world knowledge fact的方法。问题在于小Changes to the prompt can lead to large changes in model output。 previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning。However, such approaches are relation-specific and do not generalize to unseen relation types。Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time。experiments show improved model calibration，i.e., with TTA, model confidence better reflects prediction accuracy。improvements in prediction accuracy are observed for some models，but for other models，TTA leads to degradation。error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA。

FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge

paper_url: http://arxiv.org/abs/2310.17119
repo_url: None
paper_authors: Farima Fatahi Bayat, Kun Qian, Benjamin Han, Yisi Sang, Anton Belyi, Samira Khorshidi, Fei Wu, Ihab F. Ilyas, Yunyao Li
for: 检测文本信息中的事实错误，以便做出有知识基础的决策。
methods: 使用 prototype 工具 FLEEK，自动提取文本中的事实CLAIM，从外部知识源收集证据，评估每个CLAIM的事实性，并提供修复错误的建议。
results: 初步实验显示 FLEEK 可以准确地检测事实错误（77-85% F1）。

Abstract
Detecting factual errors in textual information, whether generated by large language models (LLM) or curated by humans, is crucial for making informed decisions. LLMs' inability to attribute their claims to external knowledge and their tendency to hallucinate makes it difficult to rely on their responses. Humans, too, are prone to factual errors in their writing. Since manual detection and correction of factual errors is labor-intensive, developing an automatic approach can greatly reduce human effort. We present FLEEK, a prototype tool that automatically extracts factual claims from text, gathers evidence from external knowledge sources, evaluates the factuality of each claim, and suggests revisions for identified errors using the collected evidence. Initial empirical evaluation on fact error detection (77-85\% F1) shows the potential of FLEEK. A video demo of FLEEK can be found at https://youtu.be/NapJFUlkPdQ.

摘要
检测文本信息中的事实错误，无论是由大型语言模型（LLM）生成或由人类编辑，都是决策时的关键。 LLM 的声称无法归因于外部知识并且倾向于幻见，使得不可靠地依赖其回答。人类也容易在写作时犯下事实错误。由于手动检测和修正事实错误是劳动密集的，因此开发自动化方法可以减少人类努力。我们提出了 FLEEK，一种 прототип工具，可以自动从文本中提取事实声称，从外部知识源收集证据，评估每个声称的事实性，并使用收集的证据进行标注错误。我们的初始实验结果（77-85\% F1）表明 FLEEK 有潜力。有关 FLEEK 的视频demo可以在 https://youtu.be/NapJFUlkPdQ 中找到。

2023-10-26

cs.LG

cs.LG - 2023-10-26

A Spectral Condition for Feature Learning

paper_url: http://arxiv.org/abs/2310.17813
repo_url: None
paper_authors: Greg Yang, James B. Simon, Jeremy Bernstein
for: 本研究旨在探讨初始化和训练大型神经网络时的特点和挑战，以及如何使神经网络的内部表示在各种宽度下发展为非常重要的过程。
methods: 本研究使用spectral norm的扩展来扩大神经网络的训练，并证明了这种方法可以在各种宽度下使神经网络的内部表示进行非常有用的学习。
results: 本研究发现，通过控制weight矩阵的spectral norm和其更新的大小，可以实现在神经网络中feature learning的过程。此外，本研究还提出了一种简单的 derivation of 最大更新参数化。

Abstract
The push to train ever larger neural networks has motivated the study of initialization and training at large network width. A key challenge is to scale training so that a network's internal representations evolve nontrivially at all widths, a process known as feature learning. Here, we show that feature learning is achieved by scaling the spectral norm of weight matrices and their updates like $\sqrt{\texttt{fan-out}/\texttt{fan-in}$, in contrast to widely used but heuristic scalings based on Frobenius norm and entry size. Our spectral scaling analysis also leads to an elementary derivation of \emph{maximal update parametrization}. All in all, we aim to provide the reader with a solid conceptual understanding of feature learning in neural networks.

摘要
<>Push 大型神经网络的训练引发了内部表示学习的研究。一个关键挑战是将训练缩放到网络宽度上，使网络内部的表示变得非常至彻。我们显示，通过吸引量范围的卷积矩阵和其更新的方式，如 $\sqrt{\texttt{fan-out}/\texttt{fan-in}$，而不是通用的 Frobenius 范数和entry size的吸引量，来实现表示学习。我们的幂谱缩放分析也导致了最大更新参数的元素分析。总之，我们希望通过这篇文章，为读者提供神经网络内部表示学习的坚实概念理解。Note: The text has been translated using the Google Translate API, which may not produce perfect results. Please let me know if you have any further questions or if you would like me to translate the text into another language.

Interacting Diffusion Processes for Event Sequence Forecasting

paper_url: http://arxiv.org/abs/2310.17800
repo_url: None
paper_authors: Mai Zeng, Florence Regol, Mark Coates
for: 预测不规则时间间隔内的事件序列（long-horizon forecasting of Temporal Point Processes）
methods: 提出了一种基于扩散生成模型的新方法，允许多步预测基于历史事件序列，并直接学习事件类型和时间间隔之间的联合概率分布。
results: 与现有基eline比较，该方法在长期预测TPP方面表现出色，得到了更好的结果。

Abstract
Neural Temporal Point Processes (TPPs) have emerged as the primary framework for predicting sequences of events that occur at irregular time intervals, but their sequential nature can hamper performance for long-horizon forecasts. To address this, we introduce a novel approach that incorporates a diffusion generative model. The model facilitates sequence-to-sequence prediction, allowing multi-step predictions based on historical event sequences. In contrast to previous approaches, our model directly learns the joint probability distribution of types and inter-arrival times for multiple events. This allows us to fully leverage the high dimensional modeling capability of modern generative models. Our model is composed of two diffusion processes, one for the time intervals and one for the event types. These processes interact through their respective denoising functions, which can take as input intermediate representations from both processes, allowing the model to learn complex interactions. We demonstrate that our proposal outperforms state-of-the-art baselines for long-horizon forecasting of TPP.

摘要

Neural Stress Fields for Reduced-order Elastoplasticity and Fracture

paper_url: http://arxiv.org/abs/2310.17790
repo_url: None
paper_authors: Zeshun Zong, Xuan Li, Minchen Li, Maurizio M. Chiaramonte, Wojciech Matusik, Eitan Grinspun, Kevin Carlberg, Chenfanfu Jiang, Peter Yichen Chen
for: 这个研究旨在开发一个混合神经网络和物理框架，用于实时调整黏性和剪损模型。
methods: 这个方法使用了神经网络将黏性和剪损场景映射到低维度数据空间中，并且透过这个映射来快速计算黏性和剪损场景。
results: 这个研究获得了对于黏性和剪损场景的实时调整，并且可以实现 dimension reduction 和 computation time reduction。

Abstract
We propose a hybrid neural network and physics framework for reduced-order modeling of elastoplasticity and fracture. State-of-the-art scientific computing models like the Material Point Method (MPM) faithfully simulate large-deformation elastoplasticity and fracture mechanics. However, their long runtime and large memory consumption render them unsuitable for applications constrained by computation time and memory usage, e.g., virtual reality. To overcome these barriers, we propose a reduced-order framework. Our key innovation is training a low-dimensional manifold for the Kirchhoff stress field via an implicit neural representation. This low-dimensional neural stress field (NSF) enables efficient evaluations of stress values and, correspondingly, internal forces at arbitrary spatial locations. In addition, we also train neural deformation and affine fields to build low-dimensional manifolds for the deformation and affine momentum fields. These neural stress, deformation, and affine fields share the same low-dimensional latent space, which uniquely embeds the high-dimensional simulation state. After training, we run new simulations by evolving in this single latent space, which drastically reduces the computation time and memory consumption. Our general continuum-mechanics-based reduced-order framework is applicable to any phenomena governed by the elastodynamics equation. To showcase the versatility of our framework, we simulate a wide range of material behaviors, including elastica, sand, metal, non-Newtonian fluids, fracture, contact, and collision. We demonstrate dimension reduction by up to 100,000X and time savings by up to 10X.

摘要
我们提出一种混合神经网络和物理框架，用于减少模型的精度模拟塑性和断裂。当前的科学计算模型，如物理点方法（MPM），可以准确模拟大弯形变和断裂机理。但是，它们的长时间和大量内存使其无法适用于受计算时间和内存使用限制的应用，例如虚拟现实。为了超越这些限制，我们提出一种减少模型框架。我们的关键创新是通过一种隐藏神经表示来训练低维度曼努斯（NSF），用于快速计算剪切场的剪切压力值。此外，我们还训练神经变形和平移场来建立低维度曼努斯 для变形和平移动 momentum场。这些神经压力、变形和平移场共享同一个低维度隐藏空间，这种隐藏空间独特地嵌入高维度 simulate 状态。在训练后，我们可以通过在这个唯一的隐藏空间中演化来进行新的 simulate，这会带来很大的计算时间和内存占用减少。我们的总体精度模拟框架适用于任何由塑性动力学方程 governs 的现象。为了展示我们的框架的多样性，我们模拟了各种材料行为，包括塑料、沙、铁、非新颖流体、断裂、接触和碰撞。我们示出了减少维度的幂等于100000倍和计算时间的减少率等于10倍。

Understanding when Dynamics-Invariant Data Augmentations Benefit Model-Free Reinforcement Learning Updates

paper_url: http://arxiv.org/abs/2310.17786
repo_url: None
paper_authors: Nicholas E. Corrado, Josiah P. Hanna
for: 本研究旨在找到数据扩充(DA)方法在奖励学习(RL)任务中提高数据效率的特定要素。
methods: 本研究使用了动态不变的数据扩充函数，并对RL更新进行了Integration。我们分析了三个 relevante aspect of DA：state-action coverage，奖励密度，和更新中每个扩充过程中的transition数量（扩充缓存比率）。
results: 我们的实验结果表明，在某些任务中，适当减少扩充缓存比率可以大幅提高数据效率，而在其他任务中，增加state-action coverage的影响比 reward density更大。

Abstract
Recently, data augmentation (DA) has emerged as a method for leveraging domain knowledge to inexpensively generate additional data in reinforcement learning (RL) tasks, often yielding substantial improvements in data efficiency. While prior work has demonstrated the utility of incorporating augmented data directly into model-free RL updates, it is not well-understood when a particular DA strategy will improve data efficiency. In this paper, we seek to identify general aspects of DA responsible for observed learning improvements. Our study focuses on sparse-reward tasks with dynamics-invariant data augmentation functions, serving as an initial step towards a more general understanding of DA and its integration into RL training. Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio). From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency. In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.

摘要
最近，数据扩充（DA）已经被认为是一种方法，通过利用领域知识来生成便宜的数据，以提高动作学习（RL）任务中的数据效率。 although prior work has shown that incorporating augmented data into model-free RL updates can be beneficial, it is not well understood when a particular DA strategy will improve data efficiency. In this paper, we aim to identify the general aspects of DA that are responsible for observed learning improvements. Our study focuses on sparse-reward tasks with dynamics-invariant DA functions, serving as an initial step towards a more general understanding of DA and its integration into RL training.Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio). From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio can significantly improve data efficiency. In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.Here's the Simplified Chinese version of the text:最近，数据扩充（DA）已经被认为是一种方法，通过利用领域知识来生成便宜的数据，以提高动作学习（RL）任务中的数据效率。 although prior work has shown that incorporating augmented data into model-free RL updates can be beneficial, it is not well understood when a particular DA strategy will improve data efficiency. In this paper, we aim to identify the general aspects of DA that are responsible for observed learning improvements. Our study focuses on sparse-reward tasks with dynamics-invariant DA functions, serving as an initial step towards a more general understanding of DA and its integration into RL training.实验方面，我们孤立了三个相关的DA方面：状态动作覆盖率、奖励密度和每个更新中生成的扩充转移数（扩充回放比率）。从我们的实验结果来看，我们得出了两个结论：（1）提高状态动作覆盖率通常会对数据效率产生更大的影响，而不是提高奖励密度，和（2）降低扩充回放比率可以显著提高数据效率。事实上，在我们的实验中，某些任务只有当扩充回放比率够低时才能解决。

Learning Extrinsic Dexterity with Parameterized Manipulation Primitives

paper_url: http://arxiv.org/abs/2310.17785
repo_url: None
paper_authors: Shih-Min Yang, Martin Magnusson, Johannes A. Stork, Todor Stoyano
for: 本研究旨在解决机器人抓取问题中，所有抓取都被阻塞，例如环境障碍。单shot抓取规划无法成功。相反，需要先将 объек manipulate into a configuration that affords a grasp.
methods: 本研究使用 hierarchical reinforcement learning 学习一系列的 action，利用环境来改变 object 的pose。具体来说，我们使用 parameterized manipulation primitives 来组合低级的抓取政策。通过学习低级抓取政策，我们的方法可以通过利用 object、gripper 和环境之间的互动来控制 object 的状态。
results: 我们的方法在各种不同的 weight、形状和黏性性质的 box-shaped object 上成功完成抓取任务，实验中成功率为 98%。此外，我们的方法还可以在实际的机器人上运行，并在98%的实验中成功完成抓取任务。

Abstract
Many practically relevant robot grasping problems feature a target object for which all grasps are occluded, e.g., by the environment. Single-shot grasp planning invariably fails in such scenarios. Instead, it is necessary to first manipulate the object into a configuration that affords a grasp. We solve this problem by learning a sequence of actions that utilize the environment to change the object's pose. Concretely, we employ hierarchical reinforcement learning to combine a sequence of learned parameterized manipulation primitives. By learning the low-level manipulation policies, our approach can control the object's state through exploiting interactions between the object, the gripper, and the environment. Designing such a complex behavior analytically would be infeasible under uncontrolled conditions, as an analytic approach requires accurate physical modeling of the interaction and contact dynamics. In contrast, we learn a hierarchical policy model that operates directly on depth perception data, without the need for object detection, pose estimation, or manual design of controllers. We evaluate our approach on picking box-shaped objects of various weight, shape, and friction properties from a constrained table-top workspace. Our method transfers to a real robot and is able to successfully complete the object picking task in 98\% of experimental trials.

摘要
许多实用的机器人抓取问题中，目标对象都被环境 occluded，例如机器人的抓取方法无法成功。而不是单shot grasp planning，我们需要先将对象变换到一个可以抓取的配置。我们解决这个问题的方法是通过学习一系列的动作，利用环境来改变对象的姿态。具体来说，我们使用层次强化学习来组合一系列学习的参数化操作。通过学习低级操作策略，我们的方法可以通过利用对象、抓取器和环境之间的互动来控制对象的状态。设计这样的复杂行为分析性是在无控制条件下不可能的，因为分析方法需要准确地模型对象和接触动力学。相比之下，我们学习了层次策略模型，直接在深度感知数据上操作，无需对象检测、姿态估计或人工设计控制器。我们在实验中使用了各种不同的箱形对象，重量、形状和黏性属性都有很大的变化。我们的方法在98%的实验中成功完成对象抓取任务。

Learning Optimal Classification Trees Robust to Distribution Shifts

paper_url: http://arxiv.org/abs/2310.17772
repo_url: None
paper_authors: Nathan Justin, Sina Aghaei, Andrés Gómez, Phebe Vayanos
for: 这篇论文主要应用在高赌性设定中，例如公共健康和社会工作，处理自愿报告问卷调查数据，因为这种数据受到问题表述、时间、地点和访谈者的舒适度等多种因素的影响。
methods: 本论文提出了一种基于杂Integer稳定优化技术的类别树学习方法，具体来说是将类别树学习问题转换为一个单一的杂Integer稳定优化问题，并通过对应映射来实现。
results: 本论文的结果显示，相比于非稳定的类别树学习方法，使用本论文提出的稳定类别树学习方法可以提高最差情况的准确率高达12.48%，平均情况的准确率高达4.85%。

Abstract
We consider the problem of learning classification trees that are robust to distribution shifts between training and testing/deployment data. This problem arises frequently in high stakes settings such as public health and social work where data is often collected using self-reported surveys which are highly sensitive to e.g., the framing of the questions, the time when and place where the survey is conducted, and the level of comfort the interviewee has in sharing information with the interviewer. We propose a method for learning optimal robust classification trees based on mixed-integer robust optimization technology. In particular, we demonstrate that the problem of learning an optimal robust tree can be cast as a single-stage mixed-integer robust optimization problem with a highly nonlinear and discontinuous objective. We reformulate this problem equivalently as a two-stage linear robust optimization problem for which we devise a tailored solution procedure based on constraint generation. We evaluate the performance of our approach on numerous publicly available datasets, and compare the performance to a regularized, non-robust optimal tree. We show an increase of up to 12.48% in worst-case accuracy and of up to 4.85% in average-case accuracy across several datasets and distribution shifts from using our robust solution in comparison to the non-robust one.

摘要
Translated into Simplified Chinese:我们关注了在训练和测试/部署数据之间存在分布偏移的学习分类树的问题。这个问题在高度关键的设置中经常出现，如公共健康和社会工作，数据通常通过自我报告问卷收集，这些问卷高度敏感于问题的表述、时间和地点问卷被进行的，以及回答者对调查员的信任程度。我们提出了基于混合整数稳定优化技术的robust分类树学习方法。特别是，我们示出了将优化最优robust树的问题可以转化为单阶段混合整数稳定优化问题，该问题的目标函数具有非线性和缺continuity。我们将该问题等价转化为两阶段线性稳定优化问题，并开发了特制的解决方案基于约束生成。我们对多个公共可用的数据集进行评估，并与非robust优化树进行比较。我们的结果显示，使用我们的robust解决方案可以提高最坏情况准确率和平均准确率相比，在多个数据集和分布偏移下增加了12.48%和4.85%。

Distributed Personalized Empirical Risk Minimization

paper_url: http://arxiv.org/abs/2310.17761
repo_url: None
paper_authors: Yuyang Deng, Mohammad Mahdi Kamani, Pouria Mahdavinia, Mehrdad Mahdavi
for: 本研究提出了一种新的个性化风险最小化（PERM）模式，以便从不同数据源中学习，不需要对参与设备的计算资源做严格的限制。
methods: 本研究使用了个性化 empirical loss 的权重学习，通过有效地估计数据分布之间的统计差异，实现了对所有本地分布的最佳统计准确性。
results: 提议的分布式算法可以同时优化 PERM 目标 для所有设备，并可以学习不同客户端的个性化模型 architecture（例如，不同参数数量的神经网络），从而限制下各客户端的内存和计算资源。

Abstract
This paper advocates a new paradigm Personalized Empirical Risk Minimization (PERM) to facilitate learning from heterogeneous data sources without imposing stringent constraints on computational resources shared by participating devices. In PERM, we aim to learn a distinct model for each client by learning who to learn with and personalizing the aggregation of local empirical losses by effectively estimating the statistical discrepancy among data distributions, which entails optimal statistical accuracy for all local distributions and overcomes the data heterogeneity issue. To learn personalized models at scale, we propose a distributed algorithm that replaces the standard model averaging with model shuffling to simultaneously optimize PERM objectives for all devices. This also allows us to learn distinct model architectures (e.g., neural networks with different numbers of parameters) for different clients, thus confining underlying memory and compute resources of individual clients. We rigorously analyze the convergence of the proposed algorithm and conduct experiments that corroborate the effectiveness of the proposed paradigm.

摘要
To scale up personalized learning, the proposed algorithm replaces standard model averaging with model shuffling, which simultaneously optimizes PERM objectives for all devices. This allows for the learning of distinct model architectures (e.g., neural networks with different numbers of parameters) for different clients, thereby limiting the underlying memory and compute resources of individual clients. The proposed algorithm is rigorously analyzed, and experimental results demonstrate the effectiveness of the proposed paradigm.

Optimal Guarantees for Algorithmic Reproducibility and Gradient Complexity in Convex Optimization

paper_url: http://arxiv.org/abs/2310.17759
repo_url: None
paper_authors: Liang Zhang, Junchi Yang, Amin Karbasi, Niao He
for: 本研究探讨了机器学习算法的重复性问题，即小变化在训练过程中的输出差异。
methods: 本研究使用了规范基本的算法来实现优化的重复性和梯度复杂度之间的权衡。
results: 研究发现，对于凸最小化和凸-凹最小化问题，可以在不同的异常 oracle 设定下实现优化的重复性和梯度复杂度。特别是，通过使用规范正则化算法，可以在不同的初始化 oracle 下实现最佳的重复性和梯度复杂度。

Abstract
Algorithmic reproducibility measures the deviation in outputs of machine learning algorithms upon minor changes in the training process. Previous work suggests that first-order methods would need to trade-off convergence rate (gradient complexity) for better reproducibility. In this work, we challenge this perception and demonstrate that both optimal reproducibility and near-optimal convergence guarantees can be achieved for smooth convex minimization and smooth convex-concave minimax problems under various error-prone oracle settings. Particularly, given the inexact initialization oracle, our regularization-based algorithms achieve the best of both worlds - optimal reproducibility and near-optimal gradient complexity - for minimization and minimax optimization. With the inexact gradient oracle, the near-optimal guarantees also hold for minimax optimization. Additionally, with the stochastic gradient oracle, we show that stochastic gradient descent ascent is optimal in terms of both reproducibility and gradient complexity. We believe our results contribute to an enhanced understanding of the reproducibility-convergence trade-off in the context of convex optimization.

摘要
算法复现性度量机器学习算法在训练过程中的小变化后输出的偏差。先前的工作表明，首级方法需要让您费时征识和优化训练过程以实现更好的复现性。在这种情况下，我们挑战这一观念，并证明了对于凸 minimization 和凸-凹 minimax 问题，我们可以在不同的错误订单设定下实现优化的复现性和近似最优的梯度复杂度。具体来说，给出不准确的初始化订单，我们的规范基于算法可以实现最优的复现性和近似最优的梯度复杂度 для MINIMIZATION 和 minimax 优化。而使用不准确的梯度订单，我们还可以为 minimax 优化提供近似最优的保证。此外，使用随机梯度订单，我们表明了随机梯度升方法在 reproduceability 和梯度复杂度两个方面具有最佳性。我们认为我们的结果对于凸优化中的 reproduceability-convergence 质量进行了更深入的理解。

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

paper_url: http://arxiv.org/abs/2310.17752
repo_url: None
paper_authors: Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han
for: 支持离线学习和隐私保护的个性化自适应（如本地精细调参大语言模型）。
methods: 使用紧凑的、稀疏的和高效的引擎（PockEngine）来实现在边缘设备上的训练和调参。 PockEngine 支持稀疏反传和编译器优化，以提高训练效率。
results: PockEngine 可以在不同的应用、前端和硬件背景下进行敏捷的调参和训练，并且可以实现大量的硬件兼容性和内存减少。在评估中，PockEngine 可以在 Raspberry Pi 上实现15倍的速度提升，在 Jetson AGX Orin 上实现5.6倍的内存减少和7.9倍的大语言模型调参速度提升。

Abstract
On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 $\times$ speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 $\times$ memory saving back-propagation (Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9$\times$ faster than the PyTorch.

摘要
“Device上学习和高效精度调整允许不间断和隐私保护的自定义（例如，在本地调整大语言模型的个性化数据）。然而，现有的训练框架是为云服务器设计，具有强大加速器（例如，GPU和TPU），而不是面临边缘设备的挑战，包括资源限制和边缘硬件多样性。我们介绍了PockEngine：一个小巧、稀疏和高效的引擎，用于在不同的边缘设备上进行精度调整。PockEngine支持稀疏反传：它缩短反传图和稀疏地更新模型，同时保持模型质量，减少内存占用和延迟。其次，PockEngine是编译先的：整个训练图（包括前向、反向和优化步骤）在编译时确定，从而减少运行时开销并带来可视化和优化图 transformations。PockEngine还集成了丰富的训练图优化功能，可以进一步降低训练成本，包括操作重定义和后端转换。PockEngine支持多种应用程序、前端和硬件后端：它可靠地编译和调整PyTorch/TensorFlow/Jax定义的模型，并将二进制发布到移动CPU/GPU/DSPs。我们对PockEngine进行了对比分析，并证明它可以在Raspberry Pi上提供15倍的速度提升，和 Jetson AGX Orin上的5.6倍内存占用减少。值得一提的是，PockEngine可以在NVIDIA Jetson AGX Orin上进行LLaMav2-7B的精度调整，每秒550个token，比PyTorch的7.9倍快。”

Making the End-User a Priority in Benchmarking: OrionBench for Unsupervised Time Series Anomaly Detection

paper_url: http://arxiv.org/abs/2310.17748
repo_url: https://github.com/sintel-dev/orion
paper_authors: Sarah Alnegheimish, Laure Berti-Equille, Kalyan Veeramachaneni
for: 这个研究是为了提供一个用户中心的时间序列异常检测 benchmark，并且可以持续更新和维护。
methods: 这个研究使用了深度学习基于的时间序列异常检测方法，并且提供了一个名为 OrionBench 的框架，以便实现模型的比较和评估。
results: 研究发现，透过 OrionBench 的持续更新和维护，可以实现较好的时间序列异常检测性能，并且可以适应不同的应用领域和资料集。

Abstract
Time series anomaly detection is a prevalent problem in many application domains such as patient monitoring in healthcare, forecasting in finance, or predictive maintenance in energy. This has led to the emergence of a plethora of anomaly detection methods, including more recently, deep learning based methods. Although several benchmarks have been proposed to compare newly developed models, they usually rely on one-time execution over a limited set of datasets and the comparison is restricted to a few models. We propose OrionBench -- a user centric continuously maintained benchmark for unsupervised time series anomaly detection. The framework provides universal abstractions to represent models, extensibility to add new pipelines and datasets, hyperparameter standardization, pipeline verification, and frequent releases with published benchmarks. We demonstrate the usage of OrionBench, and the progression of pipelines across 15 releases published over the course of three years. Moreover, we walk through two real scenarios we experienced with OrionBench that highlight the importance of continuous benchmarks in unsupervised time series anomaly detection.

摘要
<>将文本翻译成简化中文。<>时间序列异常检测是许多应用领域中的普遍问题，如医疗领域的患者监测、金融领域的预测、或能源领域的预测维护。这导致了各种异常检测方法的出现，包括最近几年内deep learning基于方法。虽然有几个比较方法的标准废，但它们通常基于一次执行在有限的数据集上，并且只能比较一些模型。我们提出了OrionBench——一个用户中心、不断维护的无监督时间序列异常检测标准废。框架提供了通用的抽象来表示模型，扩展性可以添加新的管道和数据集，标准化超参数、管道验证和定期发布标准废。我们介绍了OrionBench的使用方法，以及在过去三年内发布的15个标准废，以及两个实际情景，描述了无监督时间序列异常检测中的连续标准废的重要性。

BERT-PIN: A BERT-based Framework for Recovering Missing Data Segments in Time-series Load Profiles

paper_url: http://arxiv.org/abs/2310.17742
repo_url: None
paper_authors: Yi Hu, Kai Ye, Hyeonjin Kim, Ning Lu
for: 本研究旨在提出一种基于Transformers模型的Profile Inpainting Network（BERT-PIN），用于多个缺失数据段（MDS）的恢复。
methods: 该模型使用了Transformers模型结构，对载重和温度 profiles进行分割，并将每个分割段视为一个单词，整个profile视为一个句子。模型还包括一个顶尖选择过程，以生成多个可能性 Distributions，用于表示不同的自信水平。
results: 实验结果显示，BERT-PIN在精度方面表现优于现有方法，同时能够在更长的窗口内恢复多个MDS。此外，BERT-PIN可以作为预训练模型，进行多个下游任务的调整，如分类和超解像。

Abstract
Inspired by the success of the Transformer model in natural language processing and computer vision, this paper introduces BERT-PIN, a Bidirectional Encoder Representations from Transformers (BERT) powered Profile Inpainting Network. BERT-PIN recovers multiple missing data segments (MDSs) using load and temperature time-series profiles as inputs. To adopt a standard Transformer model structure for profile inpainting, we segment the load and temperature profiles into line segments, treating each segment as a word and the entire profile as a sentence. We incorporate a top candidates selection process in BERT-PIN, enabling it to produce a sequence of probability distributions, based on which users can generate multiple plausible imputed data sets, each reflecting different confidence levels. We develop and evaluate BERT-PIN using real-world dataset for two applications: multiple MDSs recovery and demand response baseline estimation. Simulation results show that BERT-PIN outperforms the existing methods in accuracy while is capable of restoring multiple MDSs within a longer window. BERT-PIN, served as a pre-trained model, can be fine-tuned for conducting many downstream tasks, such as classification and super resolution.

摘要
受到自然语言处理和计算机视觉成功的启发，本文介绍BERT-PIN，一种基于变换器模型的 Profile Inpainting Network。BERT-PIN使用加载和温度时间序列profile作为输入，恢复多个缺失数据段（MDS）。为采用标准变换器模型结构来进行profile填充，我们将加载和温度profile分割成线段，每个线段当作一个单词，整个profile当作一个句子。我们在BERT-PIN中引入了顶部选择过程，使其能生成一个序列的概率分布，基于这些概率分布，用户可以生成多个可能的填充数据集，每个数据集都反映了不同的信任水平。我们开发了BERT-PIN并对实际数据进行评估，用于两个应用：多个MDS恢复和需求回应基线估算。实验结果表明，BERT-PIN在准确性方面超过现有方法，同时能够在较长的窗口内恢复多个MDS。BERT-PIN作为预训练模型，可以进行许多下游任务的 fine-tuning，如分类和超解析。

GNN-GMVO: Graph Neural Networks for Optimizing Gross Merchandise Value in Similar Item Recommendation

paper_url: http://arxiv.org/abs/2310.17732
repo_url: None
paper_authors: Ramin Giahi, Reza Yousefi Maragheh, Nima Farrokhsiar, Jianpeng Xu, Jason Cho, Evren Korpeoglu, Sushant Kumar, Kannan Achan
for: 提高电商平台的相似商品推荐精度和销售额（GMV）。
methods: 使用Graph Neural Networks（GNN）模型，直接优化GMV目标函数，并提出自定义边构建方法来缓解各种商品之间复杂的关系。
results: 在三个真实世界数据集上进行了广泛的实验，与选择的先进Reference模型相比，模型的预测性能和预期的GMV均较高。

Abstract
Similar item recommendation is a critical task in the e-Commerce industry, which helps customers explore similar and relevant alternatives based on their interested products. Despite the traditional machine learning models, Graph Neural Networks (GNNs), by design, can understand complex relations like similarity between products. However, in contrast to their wide usage in retrieval tasks and their focus on optimizing the relevance, the current GNN architectures are not tailored toward maximizing revenue-related objectives such as Gross Merchandise Value (GMV), which is one of the major business metrics for e-Commerce companies. In addition, defining accurate edge relations in GNNs is non-trivial in large-scale e-Commerce systems, due to the heterogeneity nature of the item-item relationships. This work aims to address these issues by designing a new GNN architecture called GNN-GMVO (Graph Neural Network - Gross Merchandise Value Optimizer). This model directly optimizes GMV while considering the complex relations between items. In addition, we propose a customized edge construction method to tailor the model toward similar item recommendation task and alleviate the noisy and complex item-item relations. In our comprehensive experiments on three real-world datasets, we show higher prediction performance and expected GMV for top ranked items recommended by our model when compared with selected state-of-the-art benchmark models.

摘要
Traditional machine learning models have been used for similar item recommendation in the e-commerce industry, but Graph Neural Networks (GNNs) can better understand complex product relationships. However, current GNN architectures are not optimized for revenue-related objectives such as Gross Merchandise Value (GMV), which is a key metric for e-commerce companies. In addition, defining accurate edge relations in GNNs can be challenging in large-scale e-commerce systems due to the complexity of item-item relationships. To address these issues, we propose a new GNN architecture called GNN-GMVO (Graph Neural Network - Gross Merchandise Value Optimizer) that directly optimizes GMV while considering complex item relationships. We also propose a customized edge construction method to tailor the model for similar item recommendation and alleviate noisy item-item relations. In our comprehensive experiments on three real-world datasets, we show that our model outperforms selected state-of-the-art benchmark models in terms of prediction performance and expected GMV for top-ranked items.

Unifying (Quantum) Statistical and Parametrized (Quantum) Algorithms

paper_url: http://arxiv.org/abs/2310.17716
repo_url: None
paper_authors: Alexander Nietner
for: 本研究探讨了机器学习算法在量子学习中的一致性问题，提出了一种基于统计学和参数化学习的共同视角。
methods: 本研究使用了KEARNS的统计查询（SQ）oracle和VALIANT的弱评估 oracle（WEAK），并开发了一种扩展性强大且直观的框架，用于学习从评估查询中获得函数值估计。
results: 本研究实现了将传统机器学习算法中的学习问题转化为量子学习问题，并提出了新的下界性质和函数类型的学习复杂性问题。此外，研究还对一些受欢迎的量子机器学习（QML）设定进行分析，从而获得了对这些任务的更深刻的理解。

Abstract
Kearns' statistical query (SQ) oracle (STOC'93) lends a unifying perspective for most classical machine learning algorithms. This ceases to be true in quantum learning, where many settings do not admit, neither an SQ analog nor a quantum statistical query (QSQ) analog. In this work, we take inspiration from Kearns' SQ oracle and Valiant's weak evaluation oracle (TOCT'14) and establish a unified perspective bridging the statistical and parametrized learning paradigms in a novel way. We explore the problem of learning from an evaluation oracle, which provides an estimate of function values, and introduce an extensive yet intuitive framework that yields unconditional lower bounds for learning from evaluation queries and characterizes the query complexity for learning linear function classes. The framework is directly applicable to the QSQ setting and virtually all algorithms based on loss function optimization. Our first application is to extend prior results on the learnability of output distributions of quantum circuits and Clifford unitaries from the SQ to the (multi-copy) QSQ setting, implying exponential separations between learning stabilizer states from (multi-copy) QSQs versus from quantum samples. Our second application is to analyze some popular quantum machine learning (QML) settings. We gain an intuitive picture of the hardness of many QML tasks which goes beyond existing methods such as barren plateaus and the statistical dimension, and contains crucial setting-dependent implications. Our framework not only unifies the perspective of cost concentration with that of the statistical dimension in a unified language but exposes their connectedness and similarity.

摘要
凯尔恩斯（Kearns）的统计查询（SQ）oracle（STOC'93）为经典机器学习算法提供了一种统一的视角。然而，在量子学习中，许多设定并不允许SQ或量子统计查询（QSQ）的类比。在这项工作中，我们从凯尔恩斯的SQ oracle和瓦利安特（Valiant）的弱评估oracle（TOCT'14）中启发灵感，并建立了一种统一的视角，用于将统计学和参数化学习范文联系起来。我们研究从评估 oracle 中学习函数值的问题，并引入了一个易于理解的框架，以获得无条件下界 для学习从评估查询和Characterizing查询复杂度。这个框架直接适用于QSQ设定，并且可以应用于大多数基于损失函数优化的算法。我们的第一个应用是将先前关于量子环 Circuits 和 Clifford 单位的学习可能性从SQ扩展到（多 копи）QSQ设定，从而得到对学习稳定状态的 exponential 分离。我们的第二个应用是分析一些流行的量子机器学习（QML）设定。我们获得了一种具有INTUITIVE 的图像，这些图像超出了现有的方法，如阻挡板和统计维度，并且包含了关键的设定 dependent 影响。我们的框架不仅将cost concentration 和统计维度的视角统一起来，还暴露了它们之间的连接和相似性。

Community Detection and Classification Guarantees Using Embeddings Learned by Node2Vec

paper_url: http://arxiv.org/abs/2310.17712
repo_url: None
paper_authors: Andrew Davison, S. Carlyle Morgan, Owen G. Ward
for: 本研究旨在探讨node2vec算法学习的维度下的网络节点嵌入的理论性质。
methods: 本研究使用node2vec算法学习网络节点的嵌入，并使用k-means聚类算法来恢复网络中的社群。
results: 研究结果表明，使用node2vec算法学习的嵌入 vectors 可以带来网络中节点的弱相关社群恢复，并且可以用于网络节点和链接预测任务。

Abstract
Embedding the nodes of a large network into an Euclidean space is a common objective in modern machine learning, with a variety of tools available. These embeddings can then be used as features for tasks such as community detection/node clustering or link prediction, where they achieve state of the art performance. With the exception of spectral clustering methods, there is little theoretical understanding for other commonly used approaches to learning embeddings. In this work we examine the theoretical properties of the embeddings learned by node2vec. Our main result shows that the use of k-means clustering on the embedding vectors produced by node2vec gives weakly consistent community recovery for the nodes in (degree corrected) stochastic block models. We also discuss the use of these embeddings for node and link prediction tasks. We demonstrate this result empirically, and examine how this relates to other embedding tools for network data.

摘要
<>将网络节点 embedding 到欧几何空间是现代机器学习中常见的目标，有各种工具可用。这些嵌入可以用于任务such as 社区探测/节点封闭或链接预测，达到状态艺术性能。除spectral clustering方法外，其他常用的嵌入学习方法几乎没有理论理解。在这种工作中，我们研究node2vec学习的嵌入性质。我们的主要结果表明，使用node2vec生成的嵌入向量进行k-means封闭可以weakly consistent的社区恢复节点在（修正后）随机块模型中。我们还讨论这些嵌入的用途 для节点和链接预测任务。我们通过实验证明这个结果，并探讨与其他网络数据嵌入工具之间的关系。Note: "degree corrected" in the text refers to the fact that the stochastic block models are corrected for the degree of the nodes, which means that the models take into account the number of connections (edges) that each node has. This is important because nodes with more connections tend to have a higher chance of being assigned to the same community.

High-Dimensional Prediction for Sequential Decision Making

paper_url: http://arxiv.org/abs/2310.17651
repo_url: None
paper_authors: Georgy Noarov, Ramya Ramalingam, Aaron Roth, Stephan Xie
For: The paper is written for solving the problem of making predictions of an adversarially chosen high-dimensional state that are unbiased subject to an arbitrary collection of conditioning events.* Methods: The paper presents efficient algorithms for solving this problem, as well as a number of applications that stem from choosing an appropriate set of conditioning events.* Results: The paper achieves efficient no-subsequence-regret algorithms in extensive-form games (EFGs), yielding a new family of regret guarantees for EFGs that generalizes some existing EFG regret notions. Additionally, the paper develops a novel transparent alternative to conformal prediction for building valid online adversarial multiclass prediction sets, which implies strong conditional validity guarantees and improved loss compared to any collection of benchmark models.Here’s the Chinese version of the three information points:* For: 论文目的是解决一个 adversarially 选择高维状态的预测问题，保证这些预测是受到 conditioning events 的不偏性。* Methods: 论文提出了一些有效的算法来解决这个问题，以及一些来自 conditioning events 的应用。* Results: 论文实现了一些高效的 no-subsequence-regret 算法在扩展形游戏 (EFG) 中，得到了一些新的 EFG regret garanties，并开发了一种透明的对 conformal prediction 的替换方案，它具有强 conditional validity garanties 和改进的 loss。

Abstract
We study the problem of making predictions of an adversarially chosen high-dimensional state that are unbiased subject to an arbitrary collection of conditioning events, with the goal of tailoring these events to downstream decision makers. We give efficient algorithms for solving this problem, as well as a number of applications that stem from choosing an appropriate set of conditioning events. For example, we can efficiently make predictions targeted at polynomially many decision makers, giving each of them optimal swap regret if they best-respond to our predictions. We generalize this to online combinatorial optimization, where the decision makers have a very large action space, to give the first algorithms offering polynomially many decision makers no regret on polynomially many subsequences that may depend on their actions and the context. We apply these results to get efficient no-subsequence-regret algorithms in extensive-form games (EFGs), yielding a new family of regret guarantees for EFGs that generalizes some existing EFG regret notions, e.g. regret to informed causal deviations, and is generally incomparable to other known such notions. Next, we develop a novel transparent alternative to conformal prediction for building valid online adversarial multiclass prediction sets. We produce class scores that downstream algorithms can use for producing valid-coverage prediction sets, as if these scores were the true conditional class probabilities. We show this implies strong conditional validity guarantees including set-size-conditional and multigroup-fair coverage for polynomially many downstream prediction sets. Moreover, our class scores can be guaranteed to have improved $L_2$ loss, cross-entropy loss, and generally any Bregman loss, compared to any collection of benchmark models, yielding a high-dimensional real-valued version of omniprediction.

摘要
我们研究一个在抗对抗选择高维状态的预测问题，其目标是在一个任意的集合conditioning事件下，得到不偏的预测。我们提供高效的算法来解决这个问题，以及一些来自于conditioning事件选择的应用。例如，我们可以高效地对多个决策者进行预测，并为每个决策者提供优化的交换 regret。我们扩展这些结果到在线 combinatorial optimization 中，使得决策者有very large action space，并给出了第一个不偏多个决策者的no regret算法。我们应用这些结果来得到高效的无序序 regret算法，并在extensive-form games（EFGs）中应用这些结果，得到了一个新的 regret guarantee family for EFGs，这个家族包括一些现有的 EFG regret notions，例如 regret to informed causal deviations，并且与其他已知的notions不可比较。接下来，我们开发了一种新的透明的alternative to conformal prediction for building valid online adversarial multiclass prediction sets。我们生成的class scores可以用于生成有效覆盖的预测集，就好像这些 scores 是真实的 conditional class probabilities。我们证明这些 guarantees 包括 set-size-conditional 和 multigroup-fair coverage for polynomially many downstream prediction sets。此外，我们的class scores可以保证与任何集合of benchmark models相比，有 improved $L_2$ loss, cross-entropy loss, 和generally any Bregman loss，这些 guarantees 是一种高维实数valued的 omniprediction。

Counterfactual Fairness for Predictions using Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2310.17687
repo_url: None
paper_authors: Yuchen Ma, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel
for: 这篇论文是为了实现对不同敏感特征下的预测的公平性而写的。
methods: 该论文提出了一种基于生成对抗网络的新型深度神经网络模型，称为生成对抗公平网络（GCFN），以实现对不同敏感特征下的预测。Specifically, it leverages a tailored generative adversarial network to directly learn the counterfactual distribution of the descendants of the sensitive attribute, which is then used to enforce fair predictions through a novel counterfactual mediator regularization.
results: 该方法在多个实验中达到了状态之最的性能。在一个真实的案例研究中，它还能够在实际中做出有意义的预测。

Abstract
Fairness in predictions is of direct importance in practice due to legal, ethical, and societal reasons. It is often achieved through counterfactual fairness, which ensures that the prediction for an individual is the same as that in a counterfactual world under a different sensitive attribute. However, achieving counterfactual fairness is challenging as counterfactuals are unobservable. In this paper, we develop a novel deep neural network called Generative Counterfactual Fairness Network (GCFN) for making predictions under counterfactual fairness. Specifically, we leverage a tailored generative adversarial network to directly learn the counterfactual distribution of the descendants of the sensitive attribute, which we then use to enforce fair predictions through a novel counterfactual mediator regularization. If the counterfactual distribution is learned sufficiently well, our method is mathematically guaranteed to ensure the notion of counterfactual fairness. Thereby, our GCFN addresses key shortcomings of existing baselines that are based on inferring latent variables, yet which (a) are potentially correlated with the sensitive attributes and thus lead to bias, and (b) have weak capability in constructing latent representations and thus low prediction performance. Across various experiments, our method achieves state-of-the-art performance. Using a real-world case study from recidivism prediction, we further demonstrate that our method makes meaningful predictions in practice.

摘要
法律、伦理和社会因素的直接重要性使得预测中的公平性成为了实践中的核心问题。为了实现这一目标，我们通常采用对假性公平性，即预测结果对于具有不同敏感特征的个体是否相同。然而，实现对假性公平性是困难的，因为对假的世界是不可见的。在这篇论文中，我们开发了一种新的深度神经网络 called Generative Counterfactual Fairness Network (GCFN)，用于在对假性公平性下进行预测。我们特制了一个适应性的生成对抗网络，以直接学习敏感特征的后代的对假分布，然后使用一种新的对假mediator正则化来实现公平预测。如果对假分布被学习得足够好，我们的方法可以在数学上保证对假性公平性的概念。因此，我们的GCFN可以解决现有基准的缺陷，即基于推断 latent variables 的方法可能受敏感特征的干扰，导致偏见，同时 latent representation 的构建能力强度不足，预测性能低下。在多个实验中，我们的方法实现了状态机器人的表现。使用一个实际案例研究，我们进一步证明了我们的方法在实践中可以提供有意义的预测。

Do Graph Neural Networks Dream of Landau Damping? Insights from Kinetic Simulations of a Plasma Sheet Model

paper_url: http://arxiv.org/abs/2310.17646
repo_url: None
paper_authors: Diogo D Carvalho, Diogo R Ferreira, Luis O Silva
for: 本研究目的是替代一个激光物理模拟器使用图像神经网络模拟器。
methods: 我们使用图像神经网络模拟器，因为它们的消息传递更新机制与传统物理解析器更新机制具有相似性，并且可以强制实施知道的物理假设到图像结构和更新中。
results: 我们的模型学习了一维激光物理动力学，包括激光热化、电磁振荡和快速片和兰契抑制。我们与原始激光模型进行比较，并评估了运行时间、保守定律和物理量的时间演化。

Abstract
We explore the possibility of fully replacing a plasma physics kinetic simulator with a graph neural network-based simulator. We focus on this class of surrogate models given the similarity between their message-passing update mechanism and the traditional physics solver update, and the possibility of enforcing known physical priors into the graph construction and update. We show that our model learns the kinetic plasma dynamics of the one-dimensional plasma model, a predecessor of contemporary kinetic plasma simulation codes, and recovers a wide range of well-known kinetic plasma processes, including plasma thermalization, electrostatic fluctuations about thermal equilibrium, and the drag on a fast sheet and Landau damping. We compare the performance against the original plasma model in terms of run-time, conservation laws, and temporal evolution of key physical quantities. The limitations of the model are presented and possible directions for higher-dimensional surrogate models for kinetic plasmas are discussed.

摘要
我们探讨将激射物理学仿真器完全替换为基于图 neural network 的仿真器的可能性。我们关注这类仿真模型，因为它们的消息传递更新机制与传统物理计算器更新机制之间存在相似性，并且可以在图构建和更新中强制实施已知物理前提。我们发现我们的模型学习了一维激射物理动力学，这是当今激射物理计算代码的前身，并重新创造了许多已知激射物理过程，包括激射物理热化、电磁振荡以thermal equilibrium为中心和快板拖拽和兰德抑压。我们与原始激射模型进行比较，包括运行时间、保守法和时间演化关键物理量的方面。我们还提出了模型的限制和高维仿真模型的可能性。

Where you go is who you are – A study on machine learning based semantic privacy attacks

paper_url: http://arxiv.org/abs/2310.17643
repo_url: https://github.com/mie-lab/trip_purpose_privacy
paper_authors: Nina Wiedemann, Ourania Kounadi, Martin Raubal, Krzysztof Janowicz
For: The paper investigates the risk of privacy loss due to machine learning-based attacks on raw location data, even with inaccuracies in the data.* Methods: The paper presents two attack scenarios, location categorization and user profiling, and conducts experiments on the Foursquare dataset and tracking data to demonstrate the potential for abuse of high-quality spatial information.* Results: The paper finds that with location obfuscation of more than 1 km, spatial information hardly adds any value, but a high privacy risk solely from temporal information remains. The availability of public context data such as POIs plays a key role in inference based on spatial information.In Simplified Chinese text, the three points would be:
for: 这篇论文研究了基于机器学习的 Raw 位置数据隐私泄露的风险，即使数据有误差。
methods: 论文提出了两种攻击方案，即位置分类和用户 profiling，并在 Foursquare 数据集和跟踪数据上进行实验，以示出高质量的空间信息的滥用风险。
results: 研究发现，即使Location 屏蔽超过 1 km，空间信息几乎无价值，但是基于时间信息的隐私风险仍然很高。公共上下文数据 such as POIs 在空间信息推断中扮演关键角色。

Abstract
Concerns about data privacy are omnipresent, given the increasing usage of digital applications and their underlying business model that includes selling user data. Location data is particularly sensitive since they allow us to infer activity patterns and interests of users, e.g., by categorizing visited locations based on nearby points of interest (POI). On top of that, machine learning methods provide new powerful tools to interpret big data. In light of these considerations, we raise the following question: What is the actual risk that realistic, machine learning based privacy attacks can obtain meaningful semantic information from raw location data, subject to inaccuracies in the data? In response, we present a systematic analysis of two attack scenarios, namely location categorization and user profiling. Experiments on the Foursquare dataset and tracking data demonstrate the potential for abuse of high-quality spatial information, leading to a significant privacy loss even with location inaccuracy of up to 200m. With location obfuscation of more than 1 km, spatial information hardly adds any value, but a high privacy risk solely from temporal information remains. The availability of public context data such as POIs plays a key role in inference based on spatial information. Our findings point out the risks of ever-growing databases of tracking data and spatial context data, which policymakers should consider for privacy regulations, and which could guide individuals in their personal location protection measures.

摘要
关于数据隐私的担忧是不断增长的，由于人们的数字应用程序使用的不断增加，以及这些应用程序的深层次商业模式，包括卖出用户数据。位置数据尤为敏感，因为它们可以让我们推断用户的活动模式和兴趣，例如，通过将访问的位置分类为基于附近的点位 Interest (POI)。此外，机器学习方法提供了新的强大工具来解读大数据。在考虑这些因素后，我们提出以下问题：真实的风险是，通过实际的机器学习基于隐私攻击，从原始的位置数据中获得有意义的semantic信息，并且在数据不准确的情况下。为回答这个问题，我们提供了两种攻击场景的系统性分析：位置分类和用户 profiling。实验结果表明，使用Foursquare数据集和跟踪数据，高质量的空间信息可以导致重大隐私损失，即使位置数据不准确到200m。而对位置减震超过1公里，空间信息几乎无价值，但高privacy风险仅凭 temporal信息。公共上下文数据，如POI，在推理基于空间信息中扮演关键角色。我们的发现指出了跟踪数据和空间上下文数据的总量的增长，这些数据应该被政策者考虑，以保护个人隐私。同时，这些发现也可以引导个人在个人位置保护措施方面做出决策。

Generative Fractional Diffusion Models

paper_url: http://arxiv.org/abs/2310.17638
repo_url: None
paper_authors: Gabriel Nobis, Marco Aversa, Maximilian Springenberg, Michael Detzel, Stefano Ermon, Shinichi Nakajima, Roderick Murray-Smith, Sebastian Lapuschkin, Christoph Knochenhauer, Luis Oala, Wojciech Samek
for: 这个论文是为了扩展基于分布型生成模型的连续时间框架，从底层布朗运动（BM）到一种近似的费米布朗运动（FBM）的准确的扩展。
methods: 这篇论文使用了一种连续映射技巧和反时间模型，将 FBM 表示为一种家族的奥尔尼采beck过程的积分，定义了一种生成分数演化模型（GFDM），其驱动噪声演化到一种非马普朗过程，具有无限 quadratic variation。
results: 这篇论文的结果表明，通过控制噪声的杂乱程度（Hurst指数 $H\in(0,1)$），GFDM 可以生成具有不同粒度的分布转换路径。这是根据我们所知，第一篇基于杂乱过程的生成模型。

Abstract
We generalize the continuous time framework for score-based generative models from an underlying Brownian motion (BM) to an approximation of fractional Brownian motion (FBM). We derive a continuous reparameterization trick and the reverse time model by representing FBM as a stochastic integral over a family of Ornstein-Uhlenbeck processes to define generative fractional diffusion models (GFDM) with driving noise converging to a non-Markovian process of infinite quadratic variation. The Hurst index $H\in(0,1)$ of FBM enables control of the roughness of the distribution transforming path. To the best of our knowledge, this is the first attempt to build a generative model upon a stochastic process with infinite quadratic variation.

摘要
我们推广了基于 Brownian motion（BM）的连续时间框架，将其转换为一个约� fractional Brownian motion（FBM）的近似。我们 derivated一个连续重构该项技巧和倒时间模型，通过将 FBM 表示为一个家族 Ornstein-Uhlenbeck 过程的统计 интеграル，以定义生成数据演化模型（GFDM），其驱动噪音趋向无法确定的非马可夫过程。 Hurst 指数 $H \in (0,1)$ 控制了 FBM 的分布变数路径的柔软程度。根据我们所知，这是首次建立基于统计过程 infinite quadratic variation 的生成模型。

Combating Representation Learning Disparity with Geometric Harmonization

paper_url: http://arxiv.org/abs/2310.17622
repo_url: https://github.com/MediaBrain-SJTU/Geometric-Harmonization
paper_authors: Zhihan Zhou, Jiangchao Yao, Feng Hong, Ya Zhang, Bo Han, Yanfeng Wang
for: 提高自动学习下 Representation Learning 的稳定性和抗衰落性，特别是在长杂分布下。
methods: 提出了一种名为 Geometric Harmonization（GH）的新方法，通过衡量自动学习下 embedding 空间的人口统计，然后进行细化的实例化调整，以遏制Head类的空间扩展和避免Queue类的落弛。
results: 对多个 benchmark 数据集进行了广泛的测试，证明 GH 能够高效地适应长杂分布，并且具有高耐衰落性。

Abstract
Self-supervised learning (SSL) as an effective paradigm of representation learning has achieved tremendous success on various curated datasets in diverse scenarios. Nevertheless, when facing the long-tailed distribution in real-world applications, it is still hard for existing methods to capture transferable and robust representation. Conventional SSL methods, pursuing sample-level uniformity, easily leads to representation learning disparity where head classes dominate the feature regime but tail classes passively collapse. To address this problem, we propose a novel Geometric Harmonization (GH) method to encourage category-level uniformity in representation learning, which is more benign to the minority and almost does not hurt the majority under long-tailed distribution. Specially, GH measures the population statistics of the embedding space on top of self-supervised learning, and then infer an fine-grained instance-wise calibration to constrain the space expansion of head classes and avoid the passive collapse of tail classes. Our proposal does not alter the setting of SSL and can be easily integrated into existing methods in a low-cost manner. Extensive results on a range of benchmark datasets show the effectiveness of GH with high tolerance to the distribution skewness. Our code is available at https://github.com/MediaBrain-SJTU/Geometric-Harmonization.

摘要
自我超级vised学习（SSL）作为一种有效的表示学习方法在不同的预测集上取得了很大的成功。然而，当面临实际应用中的长尾分布时，现有的方法很难捕捉转移和可靠的表示。传统的SSL方法，尝试实现样本级别的均匀性，容易导致表示学习不均衡，Where head classes dominate the feature regime but tail classes passively collapse。To address this problem, we propose a novel Geometric Harmonization (GH) method to encourage category-level uniformity in representation learning, which is more benign to the minority and almost does not hurt the majority under long-tailed distribution. Specifically, GH measures the population statistics of the embedding space on top of self-supervised learning, and then infers an fine-grained instance-wise calibration to constrain the space expansion of head classes and avoid the passive collapse of tail classes. Our proposal does not alter the setting of SSL and can be easily integrated into existing methods in a low-cost manner. Extensive results on a range of benchmark datasets show the effectiveness of GH with high tolerance to the distribution skewness. Our code is available at https://github.com/MediaBrain-SJTU/Geometric-Harmonization.

A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spaces

paper_url: http://arxiv.org/abs/2310.17610
repo_url: None
paper_authors: Jonathan W. Siegel, Stephan Wojtowytsch
for: 这个论文主要研究了梯度流和梯度降低算法在凸函数空间中的优化问题。
methods: 这篇论文使用了梯度流和梯度降低算法来解决凸函数空间中的优化问题。
results: 这篇论文得到了以下结果：在梯度流 случа子，如果函数 $f$ 没有最小值， то converge $f(x_t) \to \inf f$ 可能具有任意慢的速率。如果函数 $f$ 有最小值，那么残存能量 $f(x_t) - \inf f$ 是可积分的/可加的，并且在特定的希尔伯特空间中可以达到 $o(1/t)$ 的速率。此外，这篇论文还证明了在希尔伯特空间中，这种结果是最佳的，即残存能量 $f(x_t) - \inf f$ 可以随时间的推移而逐渐退化到 $0$。

Abstract
We consider gradient flow/gradient descent and heavy ball/accelerated gradient descent optimization for convex objective functions. In the gradient flow case, we prove the following: 1. If $f$ does not have a minimizer, the convergence $f(x_t)\to \inf f$ can be arbitrarily slow. 2. If $f$ does have a minimizer, the excess energy $f(x_t) - \inf f$ is integrable/summable in time. In particular, $f(x_t) - \inf f = o(1/t)$ as $t\to\infty$. 3. In Hilbert spaces, this is optimal: $f(x_t) - \inf f$ can decay to $0$ as slowly as any given function which is monotone decreasing and integrable at $\infty$, even for a fixed quadratic objective. 4. In finite dimension (or more generally, for all gradient flow curves of finite length), this is not optimal: We prove that there are convex monotone decreasing integrable functions $g(t)$ which decrease to zero slower than $f(x_t)-\inf f$ for the gradient flow of any convex function on $\mathbb R^d$. For instance, we show that any gradient flow $x_t$ of a convex function $f$ in finite dimension satisfies $\liminf_{t\to\infty} \big(t\cdot \log^2(t)\cdot \big\{f(x_t) -\inf f\big\}\big)=0$. This improves on the commonly reported $O(1/t)$ rate and provides a sharp characterization of the energy decay law. We also note that it is impossible to establish a rate $O(1/(t\phi(t))$ for any function $\phi$ which satisfies $\lim_{t\to\infty}\phi(t) = \infty$, even asymptotically. Similar results are obtained in related settings for (1) discrete time gradient descent, (2) stochastic gradient descent with multiplicative noise and (3) the heavy ball ODE. In the case of stochastic gradient descent, the summability of $\mathbb E[f(x_n) - \inf f]$ is used to prove that $f(x_n)\to \inf f$ almost surely - an improvement on the convergence almost surely up to a subsequence which follows from the $O(1/n)$ decay estimate.

摘要
我们考虑Gradient Flow/Gradient Descent和Heavy Ball/加速Gradient Descent优化算法，用于凸目标函数。在Gradient Flow情况下，我们证明了以下结果：1. 如果$f$没有最小值，则$f(x_t)\to \inf f$的数值可能很慢。2. 如果$f$有最小值，则$f(x_t) - \inf f$是可积和的。具体来说，$f(x_t) - \inf f = o(1/t)$为$t\to\infty$。3. 在希尔伯特空间中，这个结果是最佳的：$f(x_t) - \inf f$可以随着时间慢慢地衰落到$0$，甚至比任何固定的凸函数。4. 在finite dimension中（或更一般地，所有Gradient Flow曲线的情况下），这个结果不是最佳的：我们证明了存在凸函数$g(t)$，其满足$g(t)\to 0$，且$g(t)$在$t\to\infty$下递减得更慢。例如，我们证明任何Gradient Flow $x_t$的凸函数$f$在finite dimension中，有$\liminf_{t\to\infty} \big(t\cdot \log^2(t)\cdot \big\{f(x_t) -\inf f\big\}\big)=0$。这与通常报告的$O(1/t)$率不同，提供了凸函数的凝固 decay 定律。我们还证明了无法设定任何函数 $\phi$，满足 $\lim_{t\to\infty}\phi(t) = \infty$，甚至在某些情况下，这个率是可以实现的。这些结果在相关的设定中也得到了类似的结果，包括：* 紧密相关的数值统计学中的Discrete Time Gradient Descent。* 在Stochastic Gradient Descent中，使用加法随机变量，并证明了$\mathbb E[f(x_n) - \inf f]$的积和可以用来证明$f(x_n)\to \inf f$的概率是可以实现的。* Heavy Ball ODE中的相关结果。

A minimax optimal control approach for robust neural ODEs

paper_url: http://arxiv.org/abs/2310.17584
repo_url: None
paper_authors: Cristina Cipriani, Alessandro Scagliotti, Tobias Wöhrer
for: 本研究旨在透过对神经动力学模型进行对抗训练，以提高神经网络的可靠性和抗训练性。
methods: 本研究使用了控制理论中的对抗训练方法，将神经动力学模型看作是深度学习模型的离散化，从而解释了神经网络的行为。作者还提出了一种新的Weighted Optimization方法，并在一个低维度分类任务中进行了测试。
results: 研究发现，通过对抗训练，神经网络可以具备更高的可靠性和抗训练性。此外，Weighted Optimization方法也能够提高神经网络的性能。

Abstract
In this paper, we address the adversarial training of neural ODEs from a robust control perspective. This is an alternative to the classical training via empirical risk minimization, and it is widely used to enforce reliable outcomes for input perturbations. Neural ODEs allow the interpretation of deep neural networks as discretizations of control systems, unlocking powerful tools from control theory for the development and the understanding of machine learning. In this specific case, we formulate the adversarial training with perturbed data as a minimax optimal control problem, for which we derive first order optimality conditions in the form of Pontryagin's Maximum Principle. We provide a novel interpretation of robust training leading to an alternative weighted technique, which we test on a low-dimensional classification task.

摘要
在这篇论文中，我们研究了神经ODE的 adversarial 训练从一种可靠控制的角度。这是对经验风险最小化的 классиical 训练的一种代替方法，用于保证输入扰动的可靠结果。神经ODE 允许将深度神经网络看作控制系统的离散化，这样就可以使用控制理论中的 poderful 工具来开发和理解机器学习。在这种特定的情况下，我们将难以控制的数据作为输入进行了对抗训练，并 derive 了 Pontryagin 最大原理中的第一个优化条件。我们还提供了一种新的robust 训练导致的权重技巧，并在一个低维分类任务中进行了测试。

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

paper_url: http://arxiv.org/abs/2310.17582
repo_url: None
paper_authors: Xiuyuan Cheng, Jianfeng Lu, Yixin Tan, Yao Xie
for: 这个论文的目的是提供一种 teorethical guarantee for generating data distribution using a progressive flow model, specifically the JKO flow model.
methods: 该论文使用了 Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network, 以及 proximal gradient descent (GD) in Wasserstein space.
results: 论文提供了 $O(\varepsilon^2)$ 的 Kullback-Leibler (KL) guarantee of data generation by a JKO flow model, 以及 KL-$W_2$ mixed error guarantees. Additionally, the paper proves the non-asymptotic convergence rate of the JKO-type $W_2$-proximal GD for a general class of convex objective functionals.

Abstract
Flow-based generative models enjoy certain advantages in computing the data generation and the likelihood, and have recently shown competitive empirical performance. Compared to the accumulating theoretical studies on related score-based diffusion models, analysis of flow-based models, which are deterministic in both forward (data-to-noise) and reverse (noise-to-data) directions, remain sparse. In this paper, we provide a theoretical guarantee of generating data distribution by a progressive flow model, the so-called JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. Leveraging the exponential convergence of the proximal gradient descent (GD) in Wasserstein space, we prove the Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be $O(\varepsilon^2)$ when using $N \lesssim \log (1/\varepsilon)$ many JKO steps ($N$ Residual Blocks in the flow) where $\varepsilon $ is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL-$W_2$ mixed error guarantees. The non-asymptotic convergence rate of the JKO-type $W_2$-proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest.

摘要
流基模型在计算数据生成和概率方面享受着一些优势，并在最近的实验性表现中表现竞争力强。相比之下，相关的分数基diffusion模型的理论研究仍然scarce。在这篇论文中，我们提供了一种 theoretically guarantee of generating data distribution by a progressive flow model, namely the JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. 我们利用了Wasserstein空间中的贝叶斯逼近下降的 exponential convergence，证明了JKO流模型在使用$N \lesssim \log (1/\varepsilon)$ many JKO steps（$N$ residual blocks in the flow）时，对数据生成的KL guarantees是$O(\varepsilon^2)$，其中$\varepsilon$是每步first-order condition的误差。假设数据密度是有限第二 moments，我们的理论推广到无密度的数据分布和倒数过程中的倒数错误。我们得到了KL-$W_2$ mixed error guarantees。此外，我们还证明了JKO-type $W_2$-proximal GD的非偏 asymptotic convergence rate для一类CONvex objective functionals，包括KL divergence作为特例，这可能是独立的兴趣。

BLIS-Net: Classifying and Analyzing Signals on Graphs

paper_url: http://arxiv.org/abs/2310.17579
repo_url: None
paper_authors: Charles Xu, Laney Goldman, Valentina Guo, Benjamin Hollander-Bodie, Maedee Trank-Greene, Ian Adelstein, Edward De Brouwer, Rex Ying, Smita Krishnaswamy, Michael Perlmutter
For: 本文是关于信号分类任务中，使用图 neural network (GNN) 来捕捉信号的多频谱特性和长距离相互作用。* Methods: 本文提出了一种基于 geometric scattering transform 的新型 GNN，称为 BLIS-Net，可以捕捉本地和全局信号结构，并同时捕捉低频和高频信息。* Results: 作者对 synthetic 和实际数据集进行了评估，证明了 BLIS-Net 能够超越原有的 geometric scattering 架构，并且在信号分类任务中 achieve 更高的性能。

Abstract
Graph neural networks (GNNs) have emerged as a powerful tool for tasks such as node classification and graph classification. However, much less work has been done on signal classification, where the data consists of many functions (referred to as signals) defined on the vertices of a single graph. These tasks require networks designed differently from those designed for traditional GNN tasks. Indeed, traditional GNNs rely on localized low-pass filters, and signals of interest may have intricate multi-frequency behavior and exhibit long range interactions. This motivates us to introduce the BLIS-Net (Bi-Lipschitz Scattering Net), a novel GNN that builds on the previously introduced geometric scattering transform. Our network is able to capture both local and global signal structure and is able to capture both low-frequency and high-frequency information. We make several crucial changes to the original geometric scattering architecture which we prove increase the ability of our network to capture information about the input signal and show that BLIS-Net achieves superior performance on both synthetic and real-world data sets based on traffic flow and fMRI data.

摘要
GRAPH NEURAL NETWORKS (GNNs) 已经成为一种有力的工具，用于节点分类和图分类等任务。然而，对于信号分类任务，数据通常是多个函数（被称为信号）定义在一个图的边上。这些任务需要特殊的网络设计，不同于传统的 GNN 任务。实际上，传统的 GNN 依赖于局部化低通滤波器，而信号达到了复杂多频响应和长距离互动。这种情况需要我们提出一种新的 GNN，即 BLIS-Net（Bi-Lipschitz Scattering Net）。我们的网络可以捕捉到本地和全局信号结构，同时捕捉到低频和高频信息。我们对原始的几何散射架构进行了一些重要的更改，并证明这些更改使得我们的网络可以更好地捕捉输入信号的信息。我们在synthetic和实际数据集上基于交通流和fMRI数据表示，BLIS-Net 可以达到更高的性能。

Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient Descent

paper_url: http://arxiv.org/abs/2310.17556
repo_url: None
paper_authors: Yixiao Chen, Hao Xie, Han Wang
for: 这篇论文是为了解决大规模的湿 Fisher 矩阵问题，其中参数数量大大超过可用样本数量。这个问题是自然梯度下降和随机重新配置的基础问题。
methods: 这篇论文提出了一种新的解决方法，基于 Cholesky 分解。这种方法是通用的，并且在 benchmark 结果中显示了它比现有方法更快。
results: benchmark 结果表明，这种方法比现有方法更快。

Abstract
We propose a new algorithm for efficiently solving the damped Fisher matrix in large-scale scenarios where the number of parameters significantly exceeds the number of available samples. This problem is fundamental for natural gradient descent and stochastic reconfiguration. Our algorithm is based on Cholesky decomposition and is generally applicable. Benchmark results show that the algorithm is significantly faster than existing methods.

摘要
我们提出了一种新的算法，用于高规模场景中有效地解决受束梯度矩阵问题，这个问题是自然下降和随机重新配置中的基础问题。我们的算法基于Cholesky分解，并且广泛适用。 benchmark结果表明，我们的算法比现有方法更快。Here's a word-for-word translation of the text:我们提出了一种新的算法，用于高规模场景中有效地解决受束梯度矩阵问题，这个问题是自然下降和随机重新配置中的基础问题。我们的算法基于Cholesky分解，并且广泛适用。 benchmark结果表明，我们的算法比现有方法更快。Note that Simplified Chinese is the standard writing system used in mainland China, and it is different from Traditional Chinese, which is used in Taiwan and other parts of the world.

Hierarchical Ensemble-Based Feature Selection for Time Series Forecasting

paper_url: http://arxiv.org/abs/2310.17544
repo_url: None
paper_authors: Aysin Tumay, Mustafa E. Aydin, Suleyman S. Kozat
for: 本研究旨在提出一种基于层次堆叠的 ensemble 方法，用于在非站立性和有限样本数量的情况下进行特征选择，以解决传统特征选择方法和特征重要性评价的局限性。
methods: 本方法首先使用一种机器学习模型，在一 subset of 特征上训练该模型，然后使用另一种算法将剩余的特征更新模型的输出，以最小化目标损失。这种层次结构允许自适应深度和特征选择。
results: 对synthetic和实际数据集进行测试，示出了该方法与传统方法和状态 искусственный进步的性能优势，同时具有扩展性和稳定性。

Abstract
We study a novel ensemble approach for feature selection based on hierarchical stacking in cases of non-stationarity and limited number of samples with large number of features. Our approach exploits the co-dependency between features using a hierarchical structure. Initially, a machine learning model is trained using a subset of features, and then the model's output is updated using another algorithm with the remaining features to minimize the target loss. This hierarchical structure allows for flexible depth and feature selection. By exploiting feature co-dependency hierarchically, our proposed approach overcomes the limitations of traditional feature selection methods and feature importance scores. The effectiveness of the approach is demonstrated on synthetic and real-life datasets, indicating improved performance with scalability and stability compared to the traditional methods and state-of-the-art approaches.

摘要
我们研究了一种新的集成方法，用于非站立性和有限样本大量特征的特征选择。我们的方法利用特征之间的依赖关系，使用层次结构。首先，我们使用一个子集特征训练机器学习模型，然后使用另一个算法更新模型的输出，使用剩余的特征来最小化目标损失。这种层次结构允许我们灵活地选择深度和特征。通过层次地利用特征之间的依赖关系，我们的提议方法超越传统的特征选择方法和特征重要性分数。我们在synthetic和实际数据集上证明了该方法的有效性，表明与传统方法和当前最佳方法相比，它具有扩展性和稳定性。

EqDrive: Efficient Equivariant Motion Forecasting with Multi-Modality for Autonomous Driving

paper_url: http://arxiv.org/abs/2310.17540
repo_url: None
paper_authors: Yuping Wang, Jier Chen
for: 预测自动驾驶车辆动态行为需要深刻理解Agent之间的互动关系和Euclidean几何变换下的动态平衡。传统模型通常缺乏这些复杂的动态关系和Agent之间的互动关系，从而导致模型的预测误差高并且训练效率低。
methods: 我们使用EqMotion，一种革新的平衡性粒子模型，以及考虑到不变性Agent互动关系的人类预测模型， для进行多 Agent 车辆动态预测。此外，我们还使用多模式预测机制，以处理多个可能的未来路径的 probabilistic 方式。
results: 通过利用EqMotion，我们的模型实现了状态当前（SOTA）性能，并且减少了参数数量（1.2 million）和训练时间（ less than 2 hours）。

Abstract
Forecasting vehicular motions in autonomous driving requires a deep understanding of agent interactions and the preservation of motion equivariance under Euclidean geometric transformations. Traditional models often lack the sophistication needed to handle the intricate dynamics inherent to autonomous vehicles and the interaction relationships among agents in the scene. As a result, these models have a lower model capacity, which then leads to higher prediction errors and lower training efficiency. In our research, we employ EqMotion, a leading equivariant particle, and human prediction model that also accounts for invariant agent interactions, for the task of multi-agent vehicle motion forecasting. In addition, we use a multi-modal prediction mechanism to account for multiple possible future paths in a probabilistic manner. By leveraging EqMotion, our model achieves state-of-the-art (SOTA) performance with fewer parameters (1.2 million) and a significantly reduced training time (less than 2 hours).

摘要
预测自动驾驶车辆动态需要深刻理解代理者之间的互动关系和保持几何变换下的动量对称性。传统模型通常缺乏自动驾驶车辆内场势的细腻，因此这些模型的预测误差较高，训练效率较低。在我们的研究中，我们使用EqMotion，一种领先的对称性粒子模型，以及考虑到不变代理者互动关系的人类预测模型，进行多辆自动车动态预测。此外，我们还使用多模态预测机制，以 probabilistic 的方式考虑多个未来路径。通过利用EqMotion，我们的模型实现了状态对当前最佳性能（SOTA），参数量为1.2万，训练时间较短（少于2小时）。

Little Exploration is All You Need

paper_url: http://arxiv.org/abs/2310.17538
repo_url: https://github.com/ArchizSolutions/Lead-Management-Software-2020-for-Generate-More-Sales-
paper_authors: Henry H. H. Chen, Jiaming Lu
for: 本研究旨在提高多臂投机问题（Multi-armed Bandit problem）中的选择策略，特别是处理不确定性和难度问题。
methods: 本研究提出了一个基于UCB算法的修改版本，即UCB$^\tau$，其中$\tau > 1/2$是一个参数，可以考虑任务难度。
results: 在synthetic数据上进行比较性评估，UCB$^\tau$不仅在效率上表现更好，还能够在不同的环境和参数设定下降低风险。

Abstract
The prevailing principle of "Optimism in the Face of Uncertainty" advocates for the incorporation of an exploration bonus, generally assumed to be proportional to the inverse square root of the visit count ($1/\sqrt{n}$), where $n$ is the number of visits to a particular state-action pair. This approach, however, exclusively focuses on "uncertainty," neglecting the inherent "difficulty" of different options. To address this gap, we introduce a novel modification of standard UCB algorithm in the multi-armed bandit problem, proposing an adjusted bonus term of $1/n^\tau$, where $\tau > 1/2$, that accounts for task difficulty. Our proposed algorithm, denoted as UCB$^\tau$, is substantiated through comprehensive regret and risk analyses, confirming its theoretical robustness. Comparative evaluations with standard UCB and Thompson Sampling algorithms on synthetic datasets demonstrate that UCB$^\tau$ not only outperforms in efficacy but also exhibits lower risk across various environmental conditions and hyperparameter settings.

摘要
“optimism在不确定性面前”的主要原则提倡 incorporate an exploration bonus，通常被认为是 proportional to the inverse square root of the visit count ($1/\sqrt{n}$), where $n$ is the number of visits to a particular state-action pair. 然而，这种方法将注意力集中在“uncertainty”上，忽略了不同选项的“difficulty”。为了填补这个 gap，我们介绍了一种修改了标准 UCB 算法的新方法，即 UCB$^\tau$，这个方法在 multi-armed bandit problem 中提出了一个调整的bonus term of $1/n^\tau$, where $\tau > 1/2$, 以考虑任务难度。我们通过了全面的 regret 和 risk 分析，证明了 UCB$^\tau$ 的理论坚固性。在实验中，UCB$^\tau$ 不仅在效率方面表现出色，而且在不同的环境条件和参数设置下也具有较低的风险。

Learning Regularized Graphon Mean-Field Games with Unknown Graphons

paper_url: http://arxiv.org/abs/2310.17531
repo_url: None
paper_authors: Fengzhuo Zhang, Vincent Y. F. Tan, Zhaoran Wang, Zhuoran Yang
For: The paper is written for learning the Nash Equilibrium (NE) of regularized Graphon Mean-Field Games (GMFGs) when the graphons are unknown.* Methods: The paper proposes the Proximal Policy Optimization for GMFG (GMFG-PPO) algorithm and an efficient algorithm to estimate the transition kernels, reward functions, and graphons from sampled agents using kernel embedding of distributions.* Results: The paper shows that the proposed algorithms converge at a rate of $O(T^{-1/3})$ after $T$ iterations with an estimation oracle, and demonstrates the efficacy of the proposed algorithms through simulations, which show that learning the unknown graphons reduces the exploitability effectively.

Abstract
We design and analyze reinforcement learning algorithms for Graphon Mean-Field Games (GMFGs). In contrast to previous works that require the precise values of the graphons, we aim to learn the Nash Equilibrium (NE) of the regularized GMFGs when the graphons are unknown. Our contributions are threefold. First, we propose the Proximal Policy Optimization for GMFG (GMFG-PPO) algorithm and show that it converges at a rate of $O(T^{-1/3})$ after $T$ iterations with an estimation oracle, improving on a previous work by Xie et al. (ICML, 2021). Second, using kernel embedding of distributions, we design efficient algorithms to estimate the transition kernels, reward functions, and graphons from sampled agents. Convergence rates are then derived when the positions of the agents are either known or unknown. Results for the combination of the optimization algorithm GMFG-PPO and the estimation algorithm are then provided. These algorithms are the first specifically designed for learning graphons from sampled agents. Finally, the efficacy of the proposed algorithms are corroborated through simulations. These simulations demonstrate that learning the unknown graphons reduces the exploitability effectively.

摘要
我们设计和分析了基于Graphon Mean-Field Games（GMFG）的强化学习算法。与前一些研究不同，我们尝试在不知道graphon的情况下学习GMFG的 Nash Equilibrium（NE）。我们的贡献包括以下三点：1. 我们提出了GMFG-PPO算法（Proximal Policy Optimization for GMFG），并证明它在$T$迭代后 converges at a rate of $O(T^{-1/3})$ with an estimation oracle, 超越了Xie et al. (ICML, 2021)的前一个研究。2. 使用kernel embedding of distributions，我们设计了高效的算法来估计transition kernels, reward functions, and graphons from sampled agents。并 derivated convergence rates when the positions of the agents are either known or unknown.3. 我们将GMFG-PPO算法和估计算法相结合，提供了首先特地设计为学习graphons from sampled agents的算法。4. 我们通过simulations corroborated the efficacy of the proposed algorithms, which demonstrate that learning the unknown graphons effectively reduces the exploitability.Here's the translation in Traditional Chinese:我们设计和分析了基于Graphon Mean-Field Games（GMFG）的强化学习算法。与前一些研究不同，我们尝试在不知道graphon的情况下学习GMFG的 Nash Equilibrium（NE）。我们的贡献包括以下三点：1. 我们提出了GMFG-PPO算法（Proximal Policy Optimization for GMFG），并证明它在$T$迭代后 converges at a rate of $O(T^{-1/3})$ with an estimation oracle, 超越了Xie et al. (ICML, 2021)的前一个研究。2. 使用kernel embedding of distributions，我们设计了高效的算法来估计transition kernels, reward functions, and graphons from sampled agents。并 derivated convergence rates when the positions of the agents are either known or unknown.3. 我们将GMFG-PPO算法和估计算法相结合，提供了首先特地设计为学习graphons from sampled agents的算法。4. 我们通过simulations corroborated the efficacy of the proposed algorithms, which demonstrate that learning the unknown graphons effectively reduces the exploitability.

Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

paper_url: http://arxiv.org/abs/2310.17502
repo_url: None
paper_authors: Florian Lux, Pascal Tilli, Sarina Meyer, Ngoc Thang Vu
for: 这篇论文的目的是提出一种方法，用于在语音合成系统中自定义声音和发音样式，并且提供了Intuitive和细化控制。
methods: 该方法使用人工生成的说话者嵌入，不需要任何说话者或风格标注数据。在训练过程中，这些嵌入可以通过与真实人的嵌入相乘，以控制语音和发音样式。
results: 该方法可以提供Intuitive和细化控制的语音合成系统，不需要任何隐私数据，并且可以在推理过程中保护隐私。

Abstract
Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intuitive and fine-grained control over the voice and speaking style of the embeddings, without requiring any labels for speaker or style. The artificial and controllable embeddings can be fed to a speech synthesis system, conditioned on embeddings of real humans during training, without sacrificing privacy during inference.

摘要
<>Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intuitive and fine-grained control over the voice and speaking style of the embeddings, without requiring any labels for speaker or style. The artificial and controllable embeddings can be fed to a speech synthesis system, conditioned on embeddings of real humans during training, without sacrificing privacy during inference.Translation:自然语言合成系统中自定义声音和发音风格具有挑战性，因为有限的数据不够标注。此外，修改现有人的声音也存在伦理问题。在这篇论文中，我们提出了一种方法，可以生成不可追溯到实际人类的人工发音嵌入，同时提供细腻的控制权限。这些人工嵌入可以在训练中被feed到自然语言合成系统，不需要标注人员或风格。在推理中，不会产生隐私问题。

CBD: A Certified Backdoor Detector Based on Local Dominant Probability

paper_url: http://arxiv.org/abs/2310.17498
repo_url: https://github.com/zhenxianglance/cbd
paper_authors: Zhen Xiang, Zidi Xiong, Bo Li
For: This paper presents a certified backdoor detector (CBD) to detect backdoor attacks in deep neural networks.* Methods: CBD uses a novel, adjustable conformal prediction scheme based on the proposed statistic local dominant probability.* Results: CBD achieves high detection accuracy with guarantees and provides detection certification, outperforming state-of-the-art detectors on four benchmark datasets. Specifically, it achieves 100% detection true positive rate on backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0.75$.

Abstract
Backdoor attack is a common threat to deep neural networks. During testing, samples embedded with a backdoor trigger will be misclassified as an adversarial target by a backdoored model, while samples without the backdoor trigger will be correctly classified. In this paper, we present the first certified backdoor detector (CBD), which is based on a novel, adjustable conformal prediction scheme based on our proposed statistic local dominant probability. For any classifier under inspection, CBD provides 1) a detection inference, 2) the condition under which the attacks are guaranteed to be detectable for the same classification domain, and 3) a probabilistic upper bound for the false positive rate. Our theoretical results show that attacks with triggers that are more resilient to test-time noise and have smaller perturbation magnitudes are more likely to be detected with guarantees. Moreover, we conduct extensive experiments on four benchmark datasets considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves comparable or even higher detection accuracy than state-of-the-art detectors, and it in addition provides detection certification. Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.

摘要
深度神经网络常遭受后门攻击。在测试时，含有后门触发器的样本将被预测为敌对目标，而不含后门触发器的样本将正确地被分类。在这篇论文中，我们介绍了首个证明型后门检测器（CBD），它基于我们提出的新的可调征式预测方法，以及我们的提出的统计本地主要概率。对于任何被检测的分类器，CBD提供了1）检测推理，2）检测是否在同一类型的分类领域下可以保证检测攻击，以及3）预测false positive率的 probabilistic upper bound。我们的理论结果表明，具有更高抗预测噪声和更小的抖动范围的攻击更容易被检测，并且我们在四个标准 benchmark 数据集上进行了广泛的实验，包括BadNet、CB和Blend等多种后门类型。CBD在检测精度方面与现状冠军检测器相当或更高，并且同时提供了检测证明。尤其是对于具有 $\ell_2\leq0.75$ 的随机扰动触发器，CBD在四个 benchmark 数据集上取得了100%（98%）、100%（84%）、98%（98%）和72%（40%）的实验（证明）检测真正率，同时保持低的 false positive 率。

Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach

paper_url: http://arxiv.org/abs/2310.17496
repo_url: None
paper_authors: Nian Si
for: 提高推荐系统的精度和效果
methods: 使用权重训练方法，即在训练模型时对数据点的权重进行调整，以降低模型的偏度和方差
results: 通过对模型进行权重训练，可以降低模型的偏度和方差，并且在实验studies中比其他方法更低In English, this would be:
for: Improving the accuracy and effectiveness of recommendation systems
methods: Using weighted training methods, which involve adjusting the weights of data points during model training to reduce the bias and variance of the model
results: By using weighted training, we can reduce the bias and variance of the model, and in simulation studies, our approach outperformed other methods.

Abstract
In modern recommendation systems, the standard pipeline involves training machine learning models on historical data to predict user behaviors and improve recommendations continuously. However, these data training loops can introduce interference in A/B tests, where data generated by control and treatment algorithms, potentially with different distributions, are combined. To address these challenges, we introduce a novel approach called weighted training. This approach entails training a model to predict the probability of each data point appearing in either the treatment or control data and subsequently applying weighted losses during model training. We demonstrate that this approach achieves the least variance among all estimators without causing shifts in the training distributions. Through simulation studies, we demonstrate the lower bias and variance of our approach compared to other methods.

摘要
现代推荐系统中的标准管道通常包括使用历史数据训练机器学习模型，以预测用户行为并不断改进推荐。然而，这些数据训练循环可能会导致A/B测试中的干扰，因为控制和治疗算法生成的数据可能具有不同的分布。为解决这些挑战，我们介绍了一种新的方法 called 权重训练。这种方法是训练一个模型，以预测每个数据点是否出现在治疗或控制数据中，并在模型训练时应用权重损失。我们示示了这种方法可以实现最小的偏差，而不会导致训练分布的变化。通过实验研究，我们表明我们的方法在其他方法相比具有较低的偏差和干扰。

FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation Models with Mobile Edge Computing

paper_url: http://arxiv.org/abs/2310.17491
repo_url: None
paper_authors: Terence Jie Chua, Wenhan Yu, Jun Zhao, Kwok-Yan Lam
for: 该论文旨在解决基础模型时的部署和微调问题，提供一种基于模拟器和参数效率的微调方法。
methods: 该方法包括模拟器帮助微调（EAT）和参数效率微调（PEFT），并将这两种方法组合成为参数效率模拟器帮助微调（PEAT）。在联合学习环境中，该方法使用适配器、模拟器和PEFT来进行联合学习模型微调，以保护模型隐私和内存效率。
results: 该论文通过在一个独特的服务器参与联合联合学习微调场景中进行测试，表明了该方法的潜在在解决基础模型挑战。

Abstract
The emergence of foundation models, including language and vision models, has reshaped AI's landscape, offering capabilities across various applications. Deploying and fine-tuning these large models, like GPT-3 and BERT, presents challenges, especially in the current foundation model era. We introduce Emulator-Assisted Tuning (EAT) combined with Parameter-Efficient Fine-Tuning (PEFT) to form Parameter-Efficient Emulator-Assisted Tuning (PEAT). Further, we expand this into federated learning as Federated PEAT (FedPEAT). FedPEAT uses adapters, emulators, and PEFT for federated model tuning, enhancing model privacy and memory efficiency. Adapters adjust pre-trained models, while emulators give a compact representation of original models, addressing both privacy and efficiency. Adaptable to various neural networks, our approach also uses deep reinforcement learning for hyper-parameter optimization. We tested FedPEAT in a unique scenario with a server participating in collaborative federated tuning, showcasing its potential in tackling foundation model challenges.

摘要
“基础模型的出现，包括语言和视觉模型，已经改变了人工智能的景观，提供了多种应用领域的能力。部署和细化这些大型模型，如GPT-3和BERT，具有挑战，尤其在当前基础模型时代。我们提出了助手帮助调参（EAT）和参数有效调参（PEFT）的结合，称为参数有效助手帮助调参（PEAT）。此外，我们扩展了这种方法到联合学习，称为联合参数有效助手帮助调参（FedPEAT）。联合参数有效助手帮助调参使用适配器、模拟器和PEFT进行联合模型调参，提高模型隐私和内存效率。适配器调整预训练模型，而模拟器提供了原始模型的减少表示，解决了隐私和效率的问题。我们的方法可以应用于不同的神经网络，并使用深度强化学习来优化参数。我们在一个独特的服务器参与联合联合调参场景中测试了我们的方法，展示了它在基础模型挑战中的潜力。”

Fair collaborative vehicle routing: A deep multi-agent reinforcement learning approach

paper_url: http://arxiv.org/abs/2310.17485
repo_url: None
paper_authors: Stephen Mak, Liming Xu, Tim Pearce, Michael Ostroumov, Alexandra Brintrup
for: Collaborative vehicle routing problem
methods: Deep multi-agent reinforcement learning
results: Reduction in run-time of 88%

Abstract
Collaborative vehicle routing occurs when carriers collaborate through sharing their transportation requests and performing transportation requests on behalf of each other. This achieves economies of scale, thus reducing cost, greenhouse gas emissions and road congestion. But which carrier should partner with whom, and how much should each carrier be compensated? Traditional game theoretic solution concepts are expensive to calculate as the characteristic function scales exponentially with the number of agents. This would require solving the vehicle routing problem (NP-hard) an exponential number of times. We therefore propose to model this problem as a coalitional bargaining game solved using deep multi-agent reinforcement learning, where - crucially - agents are not given access to the characteristic function. Instead, we implicitly reason about the characteristic function; thus, when deployed in production, we only need to evaluate the expensive post-collaboration vehicle routing problem once. Our contribution is that we are the first to consider both the route allocation problem and gain sharing problem simultaneously - without access to the expensive characteristic function. Through decentralised machine learning, our agents bargain with each other and agree to outcomes that correlate well with the Shapley value - a fair profit allocation mechanism. Importantly, we are able to achieve a reduction in run-time of 88%.

摘要
Traditional game theoretic solution concepts are computationally expensive and scale exponentially with the number of agents, making them impractical for large-scale scenarios. We propose modeling this problem as a coalitional bargaining game solved using deep multi-agent reinforcement learning, where agents do not have access to the characteristic function. Instead, we implicitly reason about the characteristic function, allowing for efficient computation.Our contribution is the first to consider both the route allocation problem and gain sharing problem simultaneously, without access to the expensive characteristic function. Through decentralized machine learning, our agents bargain with each other and agree to outcomes that correlate well with the Shapley value, a fair profit allocation mechanism. This approach achieves a reduction in run-time of 88%.

Secure short-term load forecasting for smart grids with transformer-based federated learning

paper_url: http://arxiv.org/abs/2310.17477
repo_url: https://github.com/JonasSievers/transformerBasedFederatedLearningForSecureSTLFInSG
paper_authors: Jonas Sievers, Thomas Blank
for: 预测电力负载，以帮助智能电网实现供应和需求的均衡。
methods: 使用联邦学习，在私人数据上进行本地学习，仅将训练完成的模型参数统一更新在全球服务器上。
results: 使用 transformer 型深度学习模型，在德国大学校园的数据上进行短期电力负载预测，与中央学习和本地学习进行比较，结果显示 transformer 型预测是该领域中轻量级的一个可靠选择。

Abstract
Electricity load forecasting is an essential task within smart grids to assist demand and supply balance. While advanced deep learning models require large amounts of high-resolution data for accurate short-term load predictions, fine-grained load profiles can expose users' electricity consumption behaviors, which raises privacy and security concerns. One solution to improve data privacy is federated learning, where models are trained locally on private data, and only the trained model parameters are merged and updated on a global server. Therefore, this paper presents a novel transformer-based deep learning approach with federated learning for short-term electricity load prediction. To evaluate our results, we benchmark our federated learning architecture against central and local learning and compare the performance of our model to long short-term memory models and convolutional neural networks. Our simulations are based on a dataset from a German university campus and show that transformer-based forecasting is a promising alternative to state-of-the-art models within federated learning.

摘要
electricity load forecasting是智能电网中必备的任务，帮助供应和需求均衡。而高级深度学习模型需要大量高分辨率数据进行准确短期荷电预测，而细致的荷电Profile可能曝光用户的电力消耗习惯，这会引起隐私和安全问题。为解决这问题，我们提出了一种基于转换器的深度学习方法，通过联合学习来保护数据隐私。我们对中央学习和本地学习进行比较，并与长期快速储存型模型和卷积神经网络进行比较。我们的 simulations 基于德国大学校园的数据集，显示 transformer-based forecasting 是 federated learning 中的一种有前途的代替方案。

Foundation Model Based Native AI Framework in 6G with Cloud-Edge-End Collaboration

paper_url: http://arxiv.org/abs/2310.17471
repo_url: None
paper_authors: Xiang Chen, Zhiheng Guo, Xijun Wang, Howard H. Yang, Chenyuan Feng, Junshen Su, Sihui Zheng, Tony Q. S. Quek
for: This paper aims to redefine modes of collaboration between devices and servers and construct native intelligence libraries for 6G native AI.
methods: The proposed framework is based on foundation models and includes a customization approach for intent-aware PFM, a construction of a task-oriented AI toolkit, and a novel cloud-edge-end collaboration paradigm.
results: The proposed framework is applied to orchestration, achieving the maximum sum rate within a wireless communication system, and preliminary evaluation results are presented.

Abstract
Future wireless communication networks are in a position to move beyond data-centric, device-oriented connectivity and offer intelligent, immersive experiences based on task-oriented connections, especially in the context of the thriving development of pre-trained foundation models (PFM) and the evolving vision of 6G native artificial intelligence (AI). Therefore, redefining modes of collaboration between devices and servers and constructing native intelligence libraries become critically important in 6G. In this paper, we analyze the challenges of achieving 6G native AI from the perspectives of data, intelligence, and networks. Then, we propose a 6G native AI framework based on foundation models, provide a customization approach for intent-aware PFM, present a construction of a task-oriented AI toolkit, and outline a novel cloud-edge-end collaboration paradigm. As a practical use case, we apply this framework for orchestration, achieving the maximum sum rate within a wireless communication system, and presenting preliminary evaluation results. Finally, we outline research directions for achieving native AI in 6G.

摘要
未来无线通信网络即将超越数据中心、设备围绕的连接，而是提供智能、沉浸式体验，基于任务围绕的连接，尤其在PFM（预训练基础模型）和6G天然智能落地视野的迅速发展过程中。因此，在6G中重新定义设备和服务器之间的合作模式和构建Native智能库成为非常重要。本文从数据、智能和网络三个角度分析6G天然智能的挑战，然后提出了基于基础模型的6G天然智能框架，并提供了意图意识PFM的自定义方法，构建了任务围绕的AI工具箱，并详细介绍了云端-边缘-终端的合作模式。作为实践案例，我们应用了这个框架来协调无线通信系统中的最大和值，并对此进行了初步的评估结果。最后，我们提出了6G天然智能实现的研究方向。

The statistical thermodynamics of generative diffusion models

paper_url: http://arxiv.org/abs/2310.17467
repo_url: None
paper_authors: Luca Ambrogioni
for: 这篇论文探讨了生成扩散模型在多种生成模型领域的表现，并证明了这些模型可以使用平衡统计力学工具来理解。
methods: 这篇论文使用了平衡统计力学工具来重新解释生成扩散模型，并证明了这些模型在Symmetry breaking现象下经历第二阶段相对稳定过程。
results: 这篇论文显示了生成扩散模型在Symmetry breaking现象下的性能，并提出了一组平衡统计力学的critical exponent来描述这种不稳定性。最后，论文分析了将 diffusion models和associative memory networks相关的最近工作，并将其与热力学形式化进行了评估。

Abstract
Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We argue that this lead to a form of instability that lies at the heart of their generative capabilities and that can be described by a set of mean field critical exponents. We conclude by analyzing recent work connecting diffusion models and associative memory networks in view of the thermodynamic formulations.

摘要
生成扩散模型在多种生成模型领域呈现出了吸目的表现。虽然这些模型的基本思想来自非平衡物理学，但在这篇论文中我们展示了使用平衡统计力学工具来理解这些模型的多种方面。使用这种划转，我们显示了生成扩散模型会经历第二阶段相对稳定转变，与Symmetry breaking现象相关。我们认为这种不稳定性是生成模型的核心特点，可以通过一组mean field均衡 exponent来描述。我们最后分析了与扩散模型和相关记忆网络之间的关系，以及这些热力学划转的推广。Note:* "生成扩散模型" (generative diffusion models) refers to a class of machine learning models that generate samples by iteratively refining a random noise vector until it matches the desired target distribution.* "非平衡物理学" (non-equilibrium physics) refers to the study of physical systems that are not in thermodynamic equilibrium, such as systems with energy input or output.* "Symmetry breaking" refers to the phenomenon where a system exhibits different behaviors or properties under different conditions, such as when a physical system is subjected to different external fields.* "mean field critical exponents" refer to a set of mathematical exponents that describe the behavior of a system near a phase transition, such as the power-law behavior of the order parameter near a second-order phase transition.

Bayesian Neural Controlled Differential Equations for Treatment Effect Estimation

paper_url: http://arxiv.org/abs/2310.17463
repo_url: https://github.com/konstantinhess/bayesian-neural-cde
paper_authors: Konstantin Hess, Valentyn Melnychuk, Dennis Frauen, Stefan Feuerriegel
for: 这篇论文的目的是估计医疗 treatment effect 在连续时间中，以推广个性化医疗。
methods: 这篇论文提出了一种新的 Bayesian neural controlled differential equation (BNCDE)，用于估计 treatment effect 在连续时间中。这个方法使用时间维度通过一个对应的系统的神经控制了条件方程和神经随机方程，使得可以进行可调整的构造式 Bayesian 推断。
results: 这篇论文的结果显示，使用 BNCDE 可以提供 treatment effect 的 posterior predictive distributions，并且这些分布具有可调整的性质。这意味着，这种方法可以提供医疗决策中的可靠性和可调整性。

Abstract
Treatment effect estimation in continuous time is crucial for personalized medicine. However, existing methods for this task are limited to point estimates of the potential outcomes, whereas uncertainty estimates have been ignored. Needless to say, uncertainty quantification is crucial for reliable decision-making in medical applications. To fill this gap, we propose a novel Bayesian neural controlled differential equation (BNCDE) for treatment effect estimation in continuous time. In our BNCDE, the time dimension is modeled through a coupled system of neural controlled differential equations and neural stochastic differential equations, where the neural stochastic differential equations allow for tractable variational Bayesian inference. Thereby, for an assigned sequence of treatments, our BNCDE provides meaningful posterior predictive distributions of the potential outcomes. To the best of our knowledge, ours is the first tailored neural method to provide uncertainty estimates of treatment effects in continuous time. As such, our method is of direct practical value for promoting reliable decision-making in medicine.

摘要
<>使用 continuous time 来估算医疗 intervención 的效果是个重要的挑战。现有的方法只能提供点估计 intervención 的潜在效果，而忽略了uncertainty 的评估。需要注意的是，uncertainty 评估是医疗应用中可靠决策的关键。为了填补这个空白，我们提出了一种novel Bayesian neural controlled differential equation (BNCDE) for treatment effect estimation in continuous time。在我们的 BNCDE 中，时间维度是通过一个coupled系统的神经控制涨权 differential equations和神经随机涨权 differential equations来模型，其中神经随机涨权 differential equations 允许 tractable 的Variational Bayesian inference。因此，对于一个指定的治疗序列，我们的 BNCDE 可以提供可靠的 posterior predictive distributions of potential outcomes。据我们所知，我们的方法是首次采用神经方法来提供 continuous time 中 intervención 效果的uncertainty estimates。因此，我们的方法对于促进医疗应用中可靠决策具有直接的实践价值。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Coalitional Bargaining via Reinforcement Learning: An Application to Collaborative Vehicle Routing

paper_url: http://arxiv.org/abs/2310.17458
repo_url: None
paper_authors: Stephen Mak, Liming Xu, Tim Pearce, Michael Ostroumov, Alexandra Brintrup
For: Collaborative Vehicle Routing (CVR) problem, where delivery companies cooperate to reduce cost, greenhouse gas emissions, and road congestion by sharing delivery information and performing delivery requests on behalf of each other.* Methods: Modified Independent Proximal Policy Optimization (IPPO) algorithm, which is a decentralized approach that considers the self-interested nature of companies, and coalitional bargaining game to implicitly reason about the characteristic function and eliminate the need to evaluate the Vehicle Routing Problem (VRP) an exponential number of times.* Results: The proposed decentralized approach outperforms a strong heuristic bot, with the agents correctly identifying the optimal coalitions 79% of the time with an average optimality gap of 4.2% and a reduction in run-time of 62%.

Abstract
Collaborative Vehicle Routing is where delivery companies cooperate by sharing their delivery information and performing delivery requests on behalf of each other. This achieves economies of scale and thus reduces cost, greenhouse gas emissions, and road congestion. But which company should partner with whom, and how much should each company be compensated? Traditional game theoretic solution concepts, such as the Shapley value or nucleolus, are difficult to calculate for the real-world problem of Collaborative Vehicle Routing due to the characteristic function scaling exponentially with the number of agents. This would require solving the Vehicle Routing Problem (an NP-Hard problem) an exponential number of times. We therefore propose to model this problem as a coalitional bargaining game where - crucially - agents are not given access to the characteristic function. Instead, we implicitly reason about the characteristic function, and thus eliminate the need to evaluate the VRP an exponential number of times - we only need to evaluate it once. Our contribution is that our decentralised approach is both scalable and considers the self-interested nature of companies. The agents learn using a modified Independent Proximal Policy Optimisation. Our RL agents outperform a strong heuristic bot. The agents correctly identify the optimal coalitions 79% of the time with an average optimality gap of 4.2% and reduction in run-time of 62%.

摘要
共同交通规划是elivery公司合作，共享交通信息，为彼此完成交通请求。这实现了经济规模效应，降低成本、温室气体排放和路况堵塞。但是哪些公司应该合作，各自如何分配资金？传统游戏理论的解决方案，如雪堡值或核心，在现实世界问题上难以计算，因为特征函数呈指数增长。这需要解决交通路径问题（NP困难问题）的极限数量。我们因此提议将这个问题模型为一个协会谈判游戏，其中代理人不具备特征函数的访问权限。相反，我们隐式地理解特征函数，因此不需要评估交通路径问题的极限数量。我们的贡献是我们的分布式方法可扩展，同时考虑到公司的自利益。代理人使用修改后的独立近邻策略学习。我们的RL代理人在比较强的启发策略bot上表现出色，代理人正确地Identify合适的协会79%的时间，平均优化差距4.2%，降低运行时间62%。

Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks

paper_url: http://arxiv.org/abs/2310.17683
repo_url: https://github.com/sds-lab/sliceformer
paper_authors: Shen Yuan, Hongteng Xu
for: 这个研究旨在提出一个具有更高效和更高精度的Transformer模组，以替代现有的Transformer模型。
methods: 这个研究使用了一种简单的“截割排序”操作来取代Transformer中的多头注意力（MHA）机制，并考虑了不同的实现方法。
results: 实验结果显示，Sliceformer在Long-Range Arena套件、图像分类、文本分类和分子性质预测等任务中具有较高的效率和较低的内存成本，并且能够避免模式崩溃现象。

Abstract
As one of the most popular neural network modules, Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing. The effectiveness of the Transformer is often attributed to its multi-head attention (MHA) mechanism. In this study, we discuss the limitations of MHA, including the high computational complexity due to its ``query-key-value'' architecture and the numerical issue caused by its softmax operation. Considering the above problems and the recent development tendency of the attention layer, we propose an effective and efficient surrogate of the Transformer, called Sliceformer. Our Sliceformer replaces the classic MHA mechanism with an extremely simple ``slicing-sorting'' operation, i.e., projecting inputs linearly to a latent space and sorting them along different feature dimensions (or equivalently, called channels). For each feature dimension, the sorting operation implicitly generates an implicit attention map with sparse, full-rank, and doubly-stochastic structures. We consider different implementations of the slicing-sorting operation and analyze their impacts on the Sliceformer. We test the Sliceformer in the Long-Range Arena benchmark, image classification, text classification, and molecular property prediction, demonstrating its advantage in computational complexity and universal effectiveness in discriminative tasks. Our Sliceformer achieves comparable or better performance with lower memory cost and faster speed than the Transformer and its variants. Moreover, the experimental results reveal that applying our Sliceformer can empirically suppress the risk of mode collapse when representing data. The code is available at \url{https://github.com/SDS-Lab/sliceformer}.

摘要
transformed 是深度学习中最受欢迎的神经网络模块之一，在计算机视觉和自然语言处理等领域中扮演着重要的角色。 transformer 的效果很多被归结到它的多头注意力（MHA）机制。在这种研究中，我们讨论 MHA 的限制，包括它的“查询-关键-值”架构带来的计算复杂性以及softmax操作所导致的数值问题。针对这些问题和近期的注意层发展趋势，我们提出了一种高效和高效的 transformer 变体，称为 Sliceformer。 Sliceformer 将 класси的 MHA 机制替换为一种极其简单的“剖分-排序”操作，即将输入 linearly проек到一个 latent space 中，然后将其排序在不同的特征维度（或等效地称为通道）上。对于每个特征维度，排序操作会隐式生成一个含有稀疏、全积、双射的注意力地图。我们考虑了不同的剖分-排序操作的实现方式，并分析它们对 Sliceformer 的影响。我们在 Long-Range Arena benchmark、图像分类、文本分类和分子性质预测等任务中测试了 Sliceformer，并证明它在计算复杂性和泛化效果方面具有优势。在这些任务中，Sliceformer 可以达到与 transformer 和其变体相同或更好的性能，同时具有较低的内存成本和更快的速度。此外，实验结果表明，通过应用 Sliceformer，可以在数据表示方面减少模式陷阱的风险。代码可以在中获取。

Likelihood-based Out-of-Distribution Detection with Denoising Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2310.17432
repo_url: None
paper_authors: Joseph Goodier, Neill D. F. Campbell
for: 本研究探讨了基于生成模型的 OUT-OF-DISTRIBUTION检测问题。
methods: 本研究使用了 Deep Denoising Diffusion Models，并提出了一种新的可能性比，即复杂性修正可能性比。
results: 研究结果与现有生成模型 OUT-OF-DISTRIBUTION 检测方法相当。

Abstract
Out-of-Distribution detection between dataset pairs has been extensively explored with generative models. We show that likelihood-based Out-of-Distribution detection can be extended to diffusion models by leveraging the fact that they, like other likelihood-based generative models, are dramatically affected by the input sample complexity. Currently, all Out-of-Distribution detection methods with Diffusion Models are reconstruction-based. We propose a new likelihood ratio for Out-of-Distribution detection with Deep Denoising Diffusion Models, which we call the Complexity Corrected Likelihood Ratio. Our likelihood ratio is constructed using Evidence Lower-Bound evaluations from an individual model at various noising levels. We present results that are comparable to state-of-the-art Out-of-Distribution detection methods with generative models.

摘要
对两个数据集之间的对外分布检测已经广泛探索过，使用生成模型。我们显示，基于概率的对外分布检测可以对于传播模型进行扩展，利用这些模型与其他基于概率的生成模型一样，受到输入样本复杂性的影响。目前所有与传播模型的对外分布检测方法都是重建基于的。我们提议一新的可能性比例，用于传播模型的对外分布检测，我们称之为“可能性 corrected 可能性比例”。我们的可能性比例是通过对单一模型的证据下界评估进行构建的。我们展示了与州标准对外分布检测方法相当的结果。

Causal Modeling with Stationary Diffusions

paper_url: http://arxiv.org/abs/2310.17405
repo_url: None
paper_authors: Lars Lorch, Andreas Krause, Bernhard Schölkopf
for: 本研究提出了一种新的 causal inference 方法，而不是使用结构方程式表示 causal graph。
methods: 本方法使用 stochastic differential equations (SDEs) 来模型系统的行为下 intervened。
results: 本方法可以在一些情况下更好地扩展到未看到的 intervened 变量，比 классиical approach 更好。

Abstract
We develop a novel approach towards causal inference. Rather than structural equations over a causal graph, we learn stochastic differential equations (SDEs) whose stationary densities model a system's behavior under interventions. These stationary diffusion models do not require the formalism of causal graphs, let alone the common assumption of acyclicity. We show that in several cases, they generalize to unseen interventions on their variables, often better than classical approaches. Our inference method is based on a new theoretical result that expresses a stationarity condition on the diffusion's generator in a reproducing kernel Hilbert space. The resulting kernel deviation from stationarity (KDS) is an objective function of independent interest.

摘要
我们开发了一种新的 causal inference 方法。而不是使用 causal graph 上的结构方程，我们学习了随机差分方程（SDEs），其中的静态分布模型系统在干扰下的行为。这些静态扩散模型不需要形式主义 causal graph，也不需要通常假设的无环。我们显示在一些情况下，它们可以更好地扩展到未看到的干扰变量，经常比经典方法更好。我们的推断方法基于一个新的理论结果，表示扩散的生成器在 reproduce kernel 希尔бер特空间中的静态条件。 resulting kernel deviation from stationarity（KDS）是一个独立的目标函数。

Enhancing Graph Neural Networks with Structure-Based Prompt

paper_url: http://arxiv.org/abs/2310.17394
repo_url: None
paper_authors: Qingqing Ge, Zeyuan Zhao, Yiding Liu, Anfeng Cheng, Xiang Li, Shuaiqiang Wang, Dawei Yin
for: 本研究旨在提高Graph Neural Networks（GNNs）在学习图数据 semantics方面的能力，特别是在“预训练、提示” paradigm下。
methods: 本研究提出了一种新的结构基于的提示方法（SAP），该方法在预训练和提示阶段都会利用图结构信息，以帮助更好地传递预训练知识到下游任务。SAP使用了双视对比学习，将节点特征和图结构的semantic空间进行协调，并在提示图中包含结构信息以更好地采用预训练知识。
results: 对于节点分类和图分类任务，SAP显示了更高的效果。此外，SAP在更加困难的少数shot场景下也能够达到更好的性能，包括同型和不同型图。

Abstract
Graph Neural Networks (GNNs) are powerful in learning semantics of graph data. Recently, a new paradigm "pre-train, prompt" has shown promising results in adapting GNNs to various tasks with less supervised data. The success of such paradigm can be attributed to the more consistent objectives of pre-training and task-oriented prompt tuning, where the pre-trained knowledge can be effectively transferred to downstream tasks. However, an overlooked issue of existing studies is that the structure information of graph is usually exploited during pre-training for learning node representations, while neglected in the prompt tuning stage for learning task-specific parameters. To bridge this gap, we propose a novel structure-based prompting method for GNNs, namely SAP, which consistently exploits structure information in both pre-training and prompt tuning stages. In particular, SAP 1) employs a dual-view contrastive learning to align the latent semantic spaces of node attributes and graph structure, and 2) incorporates structure information in prompted graph to elicit more pre-trained knowledge in prompt tuning. We conduct extensive experiments on node classification and graph classification tasks to show the effectiveness of SAP. Moreover, we show that SAP can lead to better performance in more challenging few-shot scenarios on both homophilous and heterophilous graphs.

摘要
图形神经网络（GNNs）在学习图数据的含义方面具有强大的能力。最近，一种新的思路“预训练、提示”（pre-train, prompt）在使用更少的监督数据来适应不同任务中表现出了扎实的成果。这种思路的成功可以归结于预训练和任务尝试提示中的更一致的目标，其中预训练的知识可以更好地被下游任务中转移。然而，现有研究中一个被忽略的问题是，在预训练和任务尝试提示阶段中，通常会利用图structure信息来学习节点表示，而忽略图structure信息在任务尝试提示阶段中的利用。为了填补这一漏洞，我们提出了一种新的结构基于的提示方法，即SAP，该方法在预训练和任务尝试提示阶段都会一致地利用图structure信息。具体来说，SAP包括以下两个部分：1）使用双视contrastive学习来对节点特征和图结构的Semantic空间进行对接，2）在提示图中包含结构信息，以便更好地在任务尝试提示阶段引导更多的预训练知识。我们在节点分类和图分类任务中进行了广泛的实验，并证明了SAP的效果。此外，我们还证明了SAP可以在更加困难的几个shotenario中表现更好，包括同型和不同型的图。

A Challenge in Reweighting Data with Bilevel Optimization

paper_url: http://arxiv.org/abs/2310.17386
repo_url: None
paper_authors: Anastasia Ivanova, Pierre Ablin
for: 这个论文的目的是学习一个权重学习模型，以优化在小型测试集上的性能。
methods: 该论文使用了皮卡优化方法，并使用了一个热启动策略，以同时学习模型参数和数据权重。
results: 研究发现，使用皮卡优化方法可能会导致数据权重变得非常稀疏，这可能是数据重新权重的一个难题。

Abstract
In many scenarios, one uses a large training set to train a model with the goal of performing well on a smaller testing set with a different distribution. Learning a weight for each data point of the training set is an appealing solution, as it ideally allows one to automatically learn the importance of each training point for generalization on the testing set. This task is usually formalized as a bilevel optimization problem. Classical bilevel solvers are based on a warm-start strategy where both the parameters of the models and the data weights are learned at the same time. We show that this joint dynamic may lead to sub-optimal solutions, for which the final data weights are very sparse. This finding illustrates the difficulty of data reweighting and offers a clue as to why this method is rarely used in practice.

摘要
在许多场景下，我们使用大量训练数据集来训练模型，以达到在小型测试数据集上的不同分布下表现良好。学习每个训练数据点的权重是一个吸引人的解决方案，因为它可以自动学习训练数据集中每个数据点的重要性，以便在测试数据集上泛化。这个任务通常被формализова为二级优化问题。经典的二级优化解决方案基于温风策略，其中模型参数和数据权重同时被学习。我们显示，这种联合动态可能会导致低优化解决方案，其中最终的数据权重很稀疏。这一发现反映了数据重新权重的困难，并提供了为何这种方法在实践中 rarely used 的一个原因。

Multitask Online Learning: Listen to the Neighborhood Buzz

paper_url: http://arxiv.org/abs/2310.17385
repo_url: None
paper_authors: Juliette Achddou, Nicolò Cesa-Bianchi, Pierre Laforgue
For: 这个论文研究了多任务在线学习中，在不同任务之间的Agent之间的信息交换是可以进行的。* Methods: 该论文提出了一种名为$\texttt{MT-CO}_2\texttt{OL}$的分布式算法，可以在这种情况下减少代理的误差。* Results: 研究发现，$\texttt{MT-CO}_2\texttt{OL}$的误差与任务相似性和网络结构相关，并且可以在不同任务之间进行减少误差的训练。此外，该算法还可以保证隐私性，并且对于线性损失函数来说，误差的影响可以忽略不计。

Abstract
We study multitask online learning in a setting where agents can only exchange information with their neighbors on an arbitrary communication network. We introduce $\texttt{MT-CO}_2\texttt{OL}$, a decentralized algorithm for this setting whose regret depends on the interplay between the task similarities and the network structure. Our analysis shows that the regret of $\texttt{MT-CO}_2\texttt{OL}$ is never worse (up to constants) than the bound obtained when agents do not share information. On the other hand, our bounds significantly improve when neighboring agents operate on similar tasks. In addition, we prove that our algorithm can be made differentially private with a negligible impact on the regret when the losses are linear. Finally, we provide experimental support for our theory.

摘要
我们研究多任务在线学习的情况，在具有自适应通信网络的情况下，我们引入了$\texttt{MT-CO}_2\texttt{OL}$算法，该算法在分布式环境中实现多任务学习。我们的分析显示，$\texttt{MT-CO}_2\texttt{OL}$的 regret与任务相似性和网络结构之间存在一定的关系。而且，当邻居任务相似时，我们的 bound 会有所改善。此外，我们证明了我们的算法可以在 Linear 损失下实现 differential privacy，并且这对 regret 的影响是可以忽略的。最后，我们提供了实验支持我们的理论。Note:* "多任务" (duō zhèng) means "multi-task" in Chinese.* "在线" (zài xiàn) means "online" in Chinese.* "学习" (xué xí) means "learning" in Chinese.* "分布式" (fēn bù zhī) means "distributed" in Chinese.* "算法" (suān fǎ) means "algorithm" in Chinese.* "regret" (jì shí) means "regret" in Chinese.* "任务相似性" (tài yì xiǎng qi) means "task similarity" in Chinese.* "网络结构" (wǎng luò jí qi) means "network structure" in Chinese.* "Linear" (líng yǐ) means "linear" in Chinese.* "损失" (diān shì) means "loss" in Chinese.* "differential privacy" (dì zhèng bìng jì) means "differential privacy" in Chinese.

On the recognition of the game type based on physiological signals and eye tracking

paper_url: http://arxiv.org/abs/2310.17383
repo_url: None
paper_authors: Łukasz Czekaj, Łukasz Radzinski, Mateusz Kolimaga, Jakub Domaszewicz, Robert Kitłowski, Mariusz Szwoch, Włodzisław Duch
for: 本研究是为了探讨ognitive activity recognition的可能性，通过使用特定的信号集来进行探索。
methods: 本研究使用了三种游戏（Space Invaders、Tetris、Tower Defence）和间游戏 pause 的分类器，并在玩家无关和有关enario下验证了分类器。
results: 根据游戏分类结果，本研究提出了智能监测和量身定制等应用的可能性。

Abstract
Automated interpretation of signals yields many impressive applications from the area of affective computing and human activity recognition (HAR). In this paper we ask the question about possibility of cognitive activity recognition on the base of particular set of signals. We use recognition of the game played by the participant as a playground for exploration of the problem. We build classifier of three different games (Space Invaders, Tetris, Tower Defence) and inter-game pause. We validate classifier in the player-independent and player-dependent scenario. We discuss the improvement in the player-dependent scenario in the context of biometric person recognition. On the base of the results obtained in game classification, we consider potential applications in smart surveillance and quantified self.

摘要
自动化信号解释给出了许多吸引人的应用，从情感计算和人活动识别（HAR）领域。在这篇论文中，我们讨论了基于特定集合信号的认知活动识别的可能性。我们使用游戏被玩家的情况作为探索问题的平台。我们构建了三种游戏（太空邪 spirit, Tetris, 防御塔）和交互游戏间的暂停 pause 的分类器。我们在独立玩家和依赖玩家的情况下验证了分类器。我们在游戏分类结果的基础上讨论了智能监测和量化自己的潜在应用。

Towards Unifying Diffusion Models for Probabilistic Spatio-Temporal Graph Learning

paper_url: http://arxiv.org/abs/2310.17360
repo_url: None
paper_authors: Junfeng Hu, Xu Liu, Zhencheng Fan, Yuxuan Liang, Roger Zimmermann
for: 本研究旨在提出一种统一的空间时间图学习方法，用于智能城市、人类流动和气候分析等应用。
methods: 该方法基于 conditional 信息的共享空间时间模式，提出了一种通用的 uncertainty-aware 扩散模型（USTD），包括共享空间时间编码器和注意力基于denoising网络。
results: 在预测和投影 зада务中，USTD 实现了STATE-OF-THE-ART的性能，同时提供了有价值的uncertainty estimate。

Abstract
Spatio-temporal graph learning is a fundamental problem in the Web of Things era, which enables a plethora of Web applications such as smart cities, human mobility and climate analysis. Existing approaches tackle different learning tasks independently, tailoring their models to unique task characteristics. These methods, however, fall short of modeling intrinsic uncertainties in the spatio-temporal data. Meanwhile, their specialized designs limit their universality as general spatio-temporal learning solutions. In this paper, we propose to model the learning tasks in a unified perspective, viewing them as predictions based on conditional information with shared spatio-temporal patterns. Based on this proposal, we introduce Unified Spatio-Temporal Diffusion Models (USTD) to address the tasks uniformly within the uncertainty-aware diffusion framework. USTD is holistically designed, comprising a shared spatio-temporal encoder and attention-based denoising networks that are task-specific. The shared encoder, optimized by a pre-training strategy, effectively captures conditional spatio-temporal patterns. The denoising networks, utilizing both cross- and self-attention, integrate conditional dependencies and generate predictions. Opting for forecasting and kriging as downstream tasks, we design Gated Attention (SGA) and Temporal Gated Attention (TGA) for each task, with different emphases on the spatial and temporal dimensions, respectively. By combining the advantages of deterministic encoders and probabilistic diffusion models, USTD achieves state-of-the-art performances compared to deterministic and probabilistic baselines in both tasks, while also providing valuable uncertainty estimates.

摘要
“在Web of Things时代，空间temporal图学是一个基本问题，它支持许多Web应用程序，如智能城市、人类流动和气候分析。现有的方法都是独立地解决不同的学习任务，特制化他们的模型以满足每个任务的特点。然而，这些方法忽略了空间temporal数据中的内在不确定性。此外，它们的专门设计限制了它们的通用性，无法作为普遍的空间temporal学习解决方案。在本文中，我们提议模型学习任务在一个统一的视角下，视之为基于共享空间temporal信息的预测。根据这一提议，我们引入统一的空间temporal扩散模型（USTD），用于 Addressing 这些任务。USTD 包括共享空间temporal编码器和基于注意力的排除网络，这些网络是任务特定的。共享编码器，通过预训练策略优化，有效地捕捉 conditional space-time 模式。排除网络，通过双向注意力和自注意力，集成 conditional 依赖关系，生成预测。为了forecasting 和 krigeing 两个下游任务，我们设计了闭合注意力（SGA）和时间闭合注意力（TGA），每个任务都有不同的焦点在空间和时间维度。USTD 通过结合确定性编码器和 probabilistic 扩散模型的优点，实现了对 deterministic 和 probabilistic 基准的比较性表现，同时也提供了有价值的不确定性估计。”

Exploring the Trie of Rules: a fast data structure for the representation of association rules

paper_url: http://arxiv.org/abs/2310.17355
repo_url: https://github.com/arm-interpretation/trie-of-rules
paper_authors: Mikhail Kudriavtsev, Dr Marija Bezbradica, Dr Andrew McCarren
for: 提高 association rule mining 技术中数据结构的效率，以提高知识挖掘的速度。
methods: 提出了一种新的数据结构——规则 Trie，用于存储生成的规则集。该结构为规则集存储了一个前缀树图结构，每个节点代表一个规则，其中 antecedent 是从该节点到根节点的路径。
results: 对比 traditional 数据结构，提出的规则 Trie 可以压缩规则集，减少数据损失，并且在基本操作如查找特定规则和排序等方面具有显著的提高。特别是，我们的方法在 traverse 时间方面实现了8倍的提高。

Abstract
Association rule mining techniques can generate a large volume of sequential data when implemented on transactional databases. Extracting insights from a large set of association rules has been found to be a challenging process. When examining a ruleset, the fundamental question is how to summarise and represent meaningful mined knowledge efficiently. Many algorithms and strategies have been developed to address issue of knowledge extraction; however, the effectiveness of this process can be limited by the data structures. A better data structure can sufficiently affect the speed of the knowledge extraction process. This paper proposes a novel data structure, called the Trie of rules, for storing a ruleset that is generated by association rule mining. The resulting data structure is a prefix-tree graph structure made of pre-mined rules. This graph stores the rules as paths within the prefix-tree in a way that similar rules overlay each other. Each node in the tree represents a rule where a consequent is this node, and an antecedent is a path from this node to the root of the tree. The evaluation showed that the proposed representation technique is promising. It compresses a ruleset with almost no data loss and benefits in terms of time for basic operations such as searching for a specific rule and sorting, which is the base for many knowledge discovery methods. Moreover, our method demonstrated a significant improvement in traversing time, achieving an 8-fold increase compared to traditional data structures.

摘要

De-novo Chemical Reaction Generation by Means of Temporarily Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2310.17341
repo_url: None
paper_authors: Andrei Buin, Hung Yi Chiang, S. Andrew Gadsden, Faraz A. Alderson
for: 这篇论文主要是用于描述一种结合 Recurrent Neural Networks (RNN) 和 Temporarily Convolutional Neural Networks (TCN) 的新方法，用于生成新的化学反应。
methods: 这篇论文使用了一种新的化学反应表示法（CGRSmiles），并直接 incorporated 了原子映射。RNN 和 TCN 都是基于 autoregressive 性的神经网络，在自然语言处理 (NLP) 中有广泛应用。
results: 结合 RNN 和 TCN 两种隐藏表示结果，通过 TCN 和 RNN 的组合，实现了对化学反应的更好生成。此外，通过不同的 fine-tuning 协议，在一个 dataset 上进行了转移学习，并且发现了不同 fine-tuning 协议对模型的生成范围产生了深刻的影响。

Abstract
We present here a combination of two networks, Recurrent Neural Networks (RNN) and Temporarily Convolutional Neural Networks (TCN) in de novo reaction generation using the novel Reaction Smiles-like representation of reactions (CGRSmiles) with atom mapping directly incorporated. Recurrent Neural Networks are known for their autoregressive properties and are frequently used in language modelling with direct application to SMILES generation. The relatively novel TCNs possess similar properties with wide receptive field while obeying the causality required for natural language processing (NLP). The combination of both latent representations expressed through TCN and RNN results in an overall better performance compared to RNN alone. Additionally, it is shown that different fine-tuning protocols have a profound impact on generative scope of the model when applied on a dataset of interest via transfer learning.

摘要
我们现在提出了两个网络的结合，回传神经网络（RNN）和时间卷积神经网络（TCN），用于从头开始的化学反应生成，使用新的化学反应SMILES表示法（CGRSmiles），直接包含原子映射。RNN和TCN都是有 autoregressive 性的，通常用于语言模型化，直接应用于 SMILES 生成。TCN 具有广泛的接收领域，并遵循自然语言处理（NLP）的 causality。两个隐藏表示的结合，通过 TCN 和 RNN 的表示，导致整体性能更好，相比 RNN alone。此外，我们还发现，在应用到适当的资料集时，不同的精致调整协议对模型的生成范围有着深远的影响。

A multi-artifact EEG denoising by frequency-based deep learning

paper_url: http://arxiv.org/abs/2310.17335
repo_url: None
paper_authors: Matteo Gabardi, Aurora Saibene, Francesca Gasparini, Daniele Rizzo, Fabio Antonio Stella
for: 本研究旨在提高电энцефалографи记录（EEG）信号质量，减少背景噪声，以便更好地掌握脑动态。
methods: 该研究提出了一种基于频域的EEG减噪模型，利用噪声特征的先验知识来自适应计算权重滤波器，以实现噪声分离。模型通过学习empirical关系，将噪声特征与噪声信号和清洁信号的spectral特征进行非线性变换，实现信号减噪。
results: 实验结果表明，提出的减噪模型在EEGdenoiseNet数据集上具有最佳性能，按照时间和频率度量 Both temporal and spectral metrics show that the proposed denoising model achieves optimal results on the EEGdenoiseNet dataset, effectively removing physiological artifacts from input EEG data. The model outperforms or matches the performance of benchmark models, demonstrating its ability to remove both muscle and ocular artifacts without requiring specific training on the type of artifact.

Abstract
Electroencephalographic (EEG) signals are fundamental to neuroscience research and clinical applications such as brain-computer interfaces and neurological disorder diagnosis. These signals are typically a combination of neurological activity and noise, originating from various sources, including physiological artifacts like ocular and muscular movements. Under this setting, we tackle the challenge of distinguishing neurological activity from noise-related sources. We develop a novel EEG denoising model that operates in the frequency domain, leveraging prior knowledge about noise spectral features to adaptively compute optimal convolutional filters for noise separation. The model is trained to learn an empirical relationship connecting the spectral characteristics of noise and noisy signal to a non-linear transformation which allows signal denoising. Performance evaluation on the EEGdenoiseNet dataset shows that the proposed model achieves optimal results according to both temporal and spectral metrics. The model is found to remove physiological artifacts from input EEG data, thus achieving effective EEG denoising. Indeed, the model performance either matches or outperforms that achieved by benchmark models, proving to effectively remove both muscle and ocular artifacts without the need to perform any training on the particular type of artifact.

摘要
电生电幕信号（EEG）是 neuroscience 研究和临床应用中的基础信号，包括大脑-计算机接口和神经疾病诊断。这些信号通常是脑动和噪声的组合，来自不同的来源，包括生物学 artifacts 如眼动和肌动。在这种设置下，我们解决了难题，即分清脑动和噪声相关的来源。我们开发了一种基于频域的 EEG 去噪模型，利用噪声频谱特征的先验知识来自适应计算优化的卷积滤波器，以分离噪声。模型通过学习 Empirical 关系，将噪声和噪声信号的频谱特征转换为非线性变换，实现信号去噪。性能评估在 EEGdenoiseNet 数据集上表明，提议的模型在时间和频谱 métricas 上具有优秀的表现，能够有效地从输入 EEG 数据中除掉生物学 artifacts。实际上，模型的表现和参考模型的表现相似或更好，能够无需特定类型的训练，快速地除掉眼动和肌动噪声。

On Forecast Stability

paper_url: http://arxiv.org/abs/2310.17332
repo_url: https://github.com/KshitijK1999/The-Impact-of-Macroeconomic-and-Oil-Shocks-on-India-s-Non-Ferrous-Metal-Prices-An-SVAR-Approach-
paper_authors: Rakshitha Godahewa, Christoph Bergmeir, Zeynep Erkin Baz, Chengjun Zhu, Zhangdi Song, Salvador García, Dario Benavides
for: 本研究旨在提高预测稳定性，以便在决策过程中可以更好地使用预测结果。
methods: 本研究使用了一种简单的线性插值方法，可以在任何基模型基础上进行稳定预测。
results: 对四个公共数据集进行评估，提议的框架可以达到较高的稳定性和准确性，与一些参考方法相比。

Abstract
Forecasts are typically not produced in a vacuum but in a business context, where forecasts are generated on a regular basis and interact with each other. For decisions, it may be important that forecasts do not change arbitrarily, and are stable in some sense. However, this area has received only limited attention in the forecasting literature. In this paper, we explore two types of forecast stability that we call vertical stability and horizontal stability. The existing works in the literature are only applicable to certain base models and extending these frameworks to be compatible with any base model is not straightforward. Furthermore, these frameworks can only stabilise the forecasts vertically. To fill this gap, we propose a simple linear-interpolation-based approach that is applicable to stabilise the forecasts provided by any base model vertically and horizontally. The approach can produce both accurate and stable forecasts. Using N-BEATS, Pooled Regression and LightGBM as the base models, in our evaluation on four publicly available datasets, the proposed framework is able to achieve significantly higher stability and/or accuracy compared to a set of benchmarks including a state-of-the-art forecast stabilisation method across three error metrics and six stability metrics.

摘要
预测通常不在荒野中生成，而是在商业上下文中进行定期生成，并且与其他预测相互交互。在决策过程中，可能需要预测不应该随意变化，而是具有一定的稳定性。然而，这一领域在预测文献中得到了有限的注意。在这篇论文中，我们探讨了两种预测稳定性，称之为垂直稳定性和水平稳定性。现有的文献仅适用于某些基础模型，扩展这些框架以适用于任何基础模型是不容易的。此外，这些框架只能稳定预测的垂直方向。为了填补这一漏洞，我们提议了一种简单的线性 interpolate-based 方法，可以在任何基础模型上稳定预测，并且可以同时稳定预测的垂直和水平方向。该方法可以生成准确和稳定的预测。使用 N-BEATS、Pool Regression 和 LightGBM 作为基础模型，在我们对四个公共可用的数据集进行评估中，我们的提议框架能够与一组标准的 benchmark 相比，在三个错误度量和六个稳定度量上显著提高预测的稳定性和/或准确性。

Feature Extraction and Classification from Planetary Science Datasets enabled by Machine Learning

paper_url: http://arxiv.org/abs/2310.17681
repo_url: None
paper_authors: Conor Nixon, Zachary Yahn, Ethan Duncan, Ian Neidel, Alyssa Mills, Benoît Seignovert, Andrew Larsen, Kathryn Gansler, Charles Liles, Catherine Walker, Douglas Trent, John Santerre
For: 本研究使用机器学习神经网络（MLNN）对外太阳系行星任务图像集进行特征识别。* Methods: 我们使用了传输学习方法，将业界标准的Mask R-CNN（区域基于卷积神经网络）模型添加和训练新层来识别卷积 dataset 中标注的块。然后，我们对新数据集进行测试，达到了68%的准确率。在另一个应用中，我们使用了同样的方法来识别 titan 上的云层，并在369张图像上达到了95%的准确率。* Results: 我们评估了我们的技术的相对成功，并建议了进一步的训练和识别方法。这些新方法可以在其他行星上进行类似的识别任务，包括地球。此外，这些技术可以将返回的数据减少到最小化的subset，或者只返回差异数据（即图像中发生变化的部分），从而大幅提高最终数据流中的信息内容。

Abstract
In this paper we present two examples of recent investigations that we have undertaken, applying Machine Learning (ML) neural networks (NN) to image datasets from outer planet missions to achieve feature recognition. Our first investigation was to recognize ice blocks (also known as rafts, plates, polygons) in the chaos regions of fractured ice on Europa. We used a transfer learning approach, adding and training new layers to an industry-standard Mask R-CNN (Region-based Convolutional Neural Network) to recognize labeled blocks in a training dataset. Subsequently, the updated model was tested against a new dataset, achieving 68% precision. In a different application, we applied the Mask R-CNN to recognize clouds on Titan, again through updated training followed by testing against new data, with a precision of 95% over 369 images. We evaluate the relative successes of our techniques and suggest how training and recognition could be further improved. The new approaches we have used for planetary datasets can further be applied to similar recognition tasks on other planets, including Earth. For imagery of outer planets in particular, the technique holds the possibility of greatly reducing the volume of returned data, via onboard identification of the most interesting image subsets, or by returning only differential data (images where changes have occurred) greatly enhancing the information content of the final data stream.

摘要
在这篇论文中，我们介绍了我们最近进行的两个例子，应用机器学习（ML）神经网络（NN）到外行星任务图像集来实现特征识别。我们的第一个调查是在冰尘环境中识别 europa 上的冰块（也称为rafts、plates、多边形）。我们采用了传输学习方法，将 industry-standard Mask R-CNN（区域基于 convolutional neural network）添加并训练新层来识别标注的块。然后，我们对新数据集进行测试，达到了68%的准确率。在另一个应用中，我们将 Mask R-CNN 应用于 titan 上的云层识别，通过更新训练后测试新数据集，达到了95%的准确率，测试了 369 张图像。我们评估了我们的技术的相对成功和提高训练和识别的方法。这些新方法可以进一步应用于类似的识别任务中，包括地球。尤其是 для 外行星图像集，这种技术可以大幅减少返回的数据量，通过在线上标识最有趣的图像子集，或者只返回差异数据（图像中发生变化的部分），从而大幅提高最终数据流的信息内容。

Demonstration-Regularized RL

paper_url: http://arxiv.org/abs/2310.17303
repo_url: None
paper_authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard
for: 本文研究了使用专家示例来提高奖励学习（RL）的样本效率的效果。
methods: 本文使用了KL正则化来使用专家示例来学习策略。
results: 本文发现，使用 $N^{\mathrm{E}$ 个专家示例可以在样本复杂度为 $\widetilde{\mathcal{O}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}))$ 中Identify优化策略，在有限的情况下，而在线性 Markov 决策过程中，样本复杂度为 $\widetilde{\mathcal{O}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}))$。此外，本文还提供了紧急的收敛保证 для行为做clone过程，以及RLHF中 demonstrate-regularized 方法的有效性。

Abstract
Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{\mathcal{O}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}))$ in finite and $\widetilde{\mathcal{O}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.

摘要
incorporating expert demonstrations has been empirically shown to improve the sample efficiency of reinforcement learning (RL). This paper quantifies the extent to which this extra information reduces RL's sample complexity. Specifically, we study the use of expert demonstrations in reinforcement learning, leveraging the expert demonstrations through KL-regularization to learn a policy using behavior cloning. Our findings show that using $N^{\text{E}$ expert demonstrations allows for the identification of an optimal policy at a sample complexity of order $\widetilde{\mathcal{O}(\text{Poly}(S,A,H)/(\varepsilon^2 N^{\text{E}))$ in finite and $\widetilde{\mathcal{O}(\text{Poly}(d,H)/(\varepsilon^2 N^{\text{E}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. Additionally, we provide tight convergence guarantees for the behavior cloning procedure under general assumptions on the policy classes. Furthermore, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Notably, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, setting our approach apart from prior works.

Looping in the Human: Collaborative and Explainable Bayesian Optimization

paper_url: http://arxiv.org/abs/2310.17273
repo_url: None
paper_authors: Masaki Adachi, Brady Planden, David A. Howey, Krikamol Maundet, Michael A. Osborne, Siu Lun Chau
for: 这篇论文旨在开发一个协作和解释式算法来增强人工智能服务器和用户之间的合作，以提高服务器的启发和用户的信任。
methods: 这篇论文提出了一个协作和解释式算法框架，名为CoExBO，它使用偏好学习来融合人工智能和用户的意见，并在每一轮的选择过程中进行说明，以增强用户的信任。
results: 这篇论文透过人工智能和用户之间的协作实验，展示了CoExBO框架的效果，并证明了它的优化效果和安全性。

Abstract
Like many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods.

摘要
LIKE many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods.Here's the text with some additional information about the translation:I used Google Translate to translate the text into Simplified Chinese. I chose Simplified Chinese because it is the most widely used standard for Chinese writing and is more commonly used in mainland China than Traditional Chinese.Please note that while machine translation can be helpful, it may not always produce perfect translations, especially for idiomatic expressions or cultural references. Additionally, the translation may not fully capture the nuances and connotations of the original text. Therefore, it's important to review the translation carefully and make any necessary adjustments to ensure that it accurately conveys the intended meaning.

Variance of ML-based software fault predictors: are we really improving fault prediction?

paper_url: http://arxiv.org/abs/2310.17264
repo_url: https://github.com/plubplub1/bountyfarm
paper_authors: Xhulja Shahini, Domenic Bubel, Andreas Metzger
for: 这个论文主要研究的是如何减少机器学习 fault prediction 模型中的变异，以提高模型在实际应用中的可重复性。
methods: 该论文使用了一种现有的 fault prediction 方法，并通过实验分析了这种方法中的变异原因。
results: 实验结果表明，这种 fault prediction 方法中的变异可以归因于机器学习模型中的随机因素（NI factors），并且这些变异可以导致模型在实际应用中的性能下降。

Abstract
Software quality assurance activities become increasingly difficult as software systems become more and more complex and continuously grow in size. Moreover, testing becomes even more expensive when dealing with large-scale systems. Thus, to effectively allocate quality assurance resources, researchers have proposed fault prediction (FP) which utilizes machine learning (ML) to predict fault-prone code areas. However, ML algorithms typically make use of stochastic elements to increase the prediction models' generalizability and efficiency of the training process. These stochastic elements, also known as nondeterminism-introducing (NI) factors, lead to variance in the training process and as a result, lead to variance in prediction accuracy and training time. This variance poses a challenge for reproducibility in research. More importantly, while fault prediction models may have shown good performance in the lab (e.g., often-times involving multiple runs and averaging outcomes), high variance of results can pose the risk that these models show low performance when applied in practice. In this work, we experimentally analyze the variance of a state-of-the-art fault prediction approach. Our experimental results indicate that NI factors can indeed cause considerable variance in the fault prediction models' accuracy. We observed a maximum variance of 10.10% in terms of the per-class accuracy metric. We thus, also discuss how to deal with such variance.

摘要

fairret: a Framework for Differentiable Fairness Regularization Terms

paper_url: http://arxiv.org/abs/2310.17256
repo_url: None
paper_authors: Maarten Buyl, MaryBeth Defrance, Tijl De Bie
for: 该论文主要用于提出一种基于自动微分库的公平正则化框架，以便更好地考虑机器学习模型的公平性问题。
methods: 该论文使用了一种基于线性分数统计的公平正则化方法，可以快速计算出各种公平性指标，并且可以与自动微分库集成。
results: 实验表明，该方法可以准确地评估机器学习模型的公平性问题，同时保持模型的预测能力。Translation:
for: The paper primarily proposes a fair regularization framework based on automatic differentiation libraries to better consider fairness issues in machine learning.
methods: The paper uses a fair regularization method based on linear-fractional statistics, which can quickly calculate various fairness metrics and be integrated with automatic differentiation libraries.
results: Experimental results show that the method can accurately assess fairness issues in machine learning models while maintaining their predictive power.

Abstract
Current tools for machine learning fairness only admit a limited range of fairness definitions and have seen little integration with automatic differentiation libraries, despite the central role these libraries play in modern machine learning pipelines. We introduce a framework of fairness regularization terms (fairrets) which quantify bias as modular objectives that are easily integrated in automatic differentiation pipelines. By employing a general definition of fairness in terms of linear-fractional statistics, a wide class of fairrets can be computed efficiently. Experiments show the behavior of their gradients and their utility in enforcing fairness with minimal loss of predictive power compared to baselines. Our contribution includes a PyTorch implementation of the fairret framework.

摘要
当前的机器学习公平工具只能接受有限的公平定义，而且它们很少与自动差分库集成，即使这些库在现代机器学习管道中扮演着重要角色。我们介绍了一个公平规范化定义（fairrets），该定义可以衡量偏见为可模块化的目标，可以efficiently在自动差分pipeline中集成。通过使用线性浓度统计来定义公平性，我们可以计算广泛的公平规范化定义。实验表明其梯度的行为和对公平性的实施效果，与基线相比，具有较小的预测力损失。我们的贡献包括一个基于PyTorch的公平规范化框架实现。

Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity

paper_url: http://arxiv.org/abs/2310.17247
repo_url: None
paper_authors: Jack Miller, Charles O’Neill, Thang Bui
for: 本研究旨在探讨神经网络中的“搜寻”现象（grokking）是否受限于神经网络本身，或者是一个更通用的现象。
methods: 本研究使用了神经网络、Gaussian process（GP）分类、GP回归和线性回归等不同的算法来研究grokking现象。
results: 研究发现，grokking不仅存在于神经网络中，还存在于其他算法中，如GP分类和GP回归。此外，研究还发现了一种方法，可以通过添加包含幻数据的维度来引起grokking现象。这些发现提供了一种更广泛的理论基础，用于解释grokking现象。

Abstract
In some settings neural networks exhibit a phenomenon known as grokking, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this paper, we discover that grokking is not limited to neural networks but occurs in other settings such as Gaussian process (GP) classification, GP regression and linear regression. We also uncover a mechanism by which to induce grokking on algorithmic datasets via the addition of dimensions containing spurious information. The presence of the phenomenon in non-neural architectures provides evidence that grokking is not specific to SGD or weight norm regularisation. Instead, grokking may be possible in any setting where solution search is guided by complexity and error. Based on this insight and further trends we see in the training trajectories of a Bayesian neural network (BNN) and GP regression model, we make progress towards a more general theory of grokking. Specifically, we hypothesise that the phenomenon is governed by the accessibility of certain regions in the error and complexity landscapes.

摘要
在某些设置下，神经网络会展现一种现象称为“感悟”（grokking），其中神经网络在验证集上达到或几乎达到了完美或near-perfect的准确率，而这种性能已经在训练集上达到了。在这篇论文中，我们发现了感悟不仅限于神经网络，还出现在其他设置中，如 Gaussian process（GP）分类、GP回归和线性回归。我们还发现了如何通过添加包含干扰信息的维度来引起感悟在算法数据集上。非神经网络中的感悟存在证明了感悟不是特定于SGD或权重 нор规则。相反，感悟可能在任何设置中发生，where solution search是由复杂度和错误导航。基于这一见解和 Bayesian neural network（BNN）和 GP 回归模型的训练轨迹趋势，我们进展 towards a more general theory of grokking。 Specifically, we hypothesize that the phenomenon is governed by the accessibility of certain regions in the error and complexity landscapes.

miditok: A Python package for MIDI file tokenization

paper_url: http://arxiv.org/abs/2310.17202
repo_url: https://github.com/Natooz/MidiTok
paper_authors: Nathan Fradet, Jean-Pierre Briot, Fabien Chhel, Amal El Fallah Seghrouchni, Nicolas Gutowski
for: 这篇论文主要针对的是使用自然语言处理技术进行符号音乐处理，包括音乐生成、模型化或转写等任务，以达到现场表现的状态。
methods: 论文使用了语言模型，如转换器，与符号音乐结合进行各种任务，并在生产产品中使用。为了编码和解码音乐，需要依赖于Tokenizer，它将音乐序列化为不同元素的序列。
results: 论文介绍了一个开源库called Miditok，可以高度自定义和扩展，用于符号音乐tokkenization。它支持最流行的音乐tokkenization方法，并提供了一个统一的API。

Abstract
Recent progress in natural language processing has been adapted to the symbolic music modality. Language models, such as Transformers, have been used with symbolic music for a variety of tasks among which music generation, modeling or transcription, with state-of-the-art performances. These models are beginning to be used in production products. To encode and decode music for the backbone model, they need to rely on tokenizers, whose role is to serialize music into sequences of distinct elements called tokens. MidiTok is an open-source library allowing to tokenize symbolic music with great flexibility and extended features. It features the most popular music tokenizations, under a unified API. It is made to be easily used and extensible for everyone.

摘要
最近的自然语言处理进步已经应用到符号音乐模式上。如transformers等语言模型在音乐生成、模拟或转写等任务中具有 state-of-the-art 表现。这些模型已经开始在生产产品中使用。为了对象化和解码音乐，它们需要依赖特 Токен化器，它们的作用是将音乐转换成不同元素的序列。midiotok是一个开源库，它允许用户Tokenize符号音乐，并且具有很好的灵活性和扩展功能。它是为了让 everybody 使用和扩展的。

Adaptive importance sampling for Deep Ritz

paper_url: http://arxiv.org/abs/2310.17185
repo_url: None
paper_authors: Xiaoliang Wan, Tao Zhou, Yuancheng Zhou
for: 解决部分条件方程式（PDEs）的深度抽样方法。
methods: 使用两个深度神经网络：一个用于解决PDEs，另一个用于生成新的拟合点来精细化训练集。适应抽样过程包括两个主要步骤：首先，使用深度瑞特方法解决PDEs，并将其拟合到训练集中的拟合点上。其次，通过生成新的训练集和其相应的PDF值，通过重要抽样来更准确地估算变量损失。
results: 相比原始的深度瑞特方法，提出的适应方法可以提高精度，特别是在低 regularity 和高维度的问题上。通过一系列数学实验，证明了新方法的效果。

Abstract
We introduce an adaptive sampling method for the Deep Ritz method aimed at solving partial differential equations (PDEs). Two deep neural networks are used. One network is employed to approximate the solution of PDEs, while the other one is a deep generative model used to generate new collocation points to refine the training set. The adaptive sampling procedure consists of two main steps. The first step is solving the PDEs using the Deep Ritz method by minimizing an associated variational loss discretized by the collocation points in the training set. The second step involves generating a new training set, which is then used in subsequent computations to further improve the accuracy of the current approximate solution. We treat the integrand in the variational loss as an unnormalized probability density function (PDF) and approximate it using a deep generative model called bounded KRnet. The new samples and their associated PDF values are obtained from the bounded KRnet. With these new samples and their associated PDF values, the variational loss can be approximated more accurately by importance sampling. Compared to the original Deep Ritz method, the proposed adaptive method improves accuracy, especially for problems characterized by low regularity and high dimensionality. We demonstrate the effectiveness of our new method through a series of numerical experiments.

摘要
我们介绍了一种适应样本方法，用于深度Ritz方法来解决部分条件方程（PDEs）。这种方法使用了两个深度神经网络。一个网络用于估算PDEs的解，而另一个网络用于生成新的拟合点，以提高训练集的精度。适应样本过程包括两个主要步骤。第一步是使用深度Ritz方法，通过最小化相关的可变损失函数来解决PDEs。第二步是生成新的训练集，并使用这些新样本和其相应的PDF值进行重新计算，以提高当前解的精度。我们对integrand在可变损失函数中的PDF进行了抽象，并使用了一种深度生成模型calledbounded KRnet来近似它。通过这些新样本和其相应的PDF值，我们可以更准确地估算可变损失函数，从而提高方法的准确性。相比原始的深度Ritz方法，我们的新方法尤其在低regularity和高维度问题上具有更高的准确性。我们通过一系列数值实验证明了新方法的有效性。

DSAC-C: Constrained Maximum Entropy for Robust Discrete Soft-Actor Critic

paper_url: http://arxiv.org/abs/2310.17173
repo_url: None
paper_authors: Dexter Neo, Tsuhan Chen
for: 这个论文是为了提高 Soft Actor-Critic (SAC) 算法的性能而设计的。
methods: 论文使用 Maximum Entropy Principle 和一个附加的统计约束来提高 discrete SAC 算法。
results: 论文的实验结果表明，这些约束可以提供额外的鲁棒性，使得在实际世界中部署的机器学习代理人更加安全。

Abstract
We present a novel extension to the family of Soft Actor-Critic (SAC) algorithms. We argue that based on the Maximum Entropy Principle, discrete SAC can be further improved via additional statistical constraints derived from a surrogate critic policy. Furthermore, our findings suggests that these constraints provide an added robustness against potential domain shifts, which are essential for safe deployment of reinforcement learning agents in the real-world. We provide theoretical analysis and show empirical results on low data regimes for both in-distribution and out-of-distribution variants of Atari 2600 games.

摘要
我们提出了一种新的扩展，基于软活动评价（SAC）算法家族。我们认为，基于最大 entropy原则，随机SAC可以通过额外的统计约束来进一步改进。此外，我们的发现表明，这些约束提供了额外的鲁棒性，对于 potential domain shift 的抵御，这些鲁棒性是实际世界中RL Agent的部署安全的重要条件。我们提供了理论分析，并在低数据情况下对各种Atari 2600游戏进行了实验研究。Note: "Maximum Entropy Principle" is translated as "最大 entropy原则" in Simplified Chinese.

Learning an Inventory Control Policy with General Inventory Arrival Dynamics

paper_url: http://arxiv.org/abs/2310.17168
repo_url: None
paper_authors: Sohrab Andaz, Carson Eisenach, Dhruv Madeka, Kari Torkkola, Randy Jia, Dean Foster, Sham Kakade
for: 这个论文是关于如何学习和测试存储控制策略的问题。
methods: 该论文使用了深度生成模型来模拟订购量的过程，并将问题формализова为一个外部决策过程，可以应用 Madeka et al. (2022) 的结果来降低到supervised learning。
results: 通过实验研究，该论文发现使用Gen-QOT可以提高生产率的利润。此外，使用实际场景A/B测试数据，研究发现Gen-QOT可以 generale well到off-policy数据。

Abstract
In this paper we address the problem of learning and backtesting inventory control policies in the presence of general arrival dynamics -- which we term as a quantity-over-time arrivals model (QOT). We also allow for order quantities to be modified as a post-processing step to meet vendor constraints such as order minimum and batch size constraints -- a common practice in real supply chains. To the best of our knowledge this is the first work to handle either arbitrary arrival dynamics or an arbitrary downstream post-processing of order quantities. Building upon recent work (Madeka et al., 2022) we similarly formulate the periodic review inventory control problem as an exogenous decision process, where most of the state is outside the control of the agent. Madeka et al. (2022) show how to construct a simulator that replays historic data to solve this class of problem. In our case, we incorporate a deep generative model for the arrivals process as part of the history replay. By formulating the problem as an exogenous decision process, we can apply results from Madeka et al. (2022) to obtain a reduction to supervised learning. Finally, we show via simulation studies that this approach yields statistically significant improvements in profitability over production baselines. Using data from an ongoing real-world A/B test, we show that Gen-QOT generalizes well to off-policy data.

摘要
在这篇论文中，我们 addresses the problem of 学习和测试存储控制策略在总订购动态的存储系统中，包括通用的量订购模型（QOT）。我们还允许订购量在后处理步骤中进行修改，以满足供应商的订购最小量和批量限制。根据我们所知，这是首次处理通用的订购动态或者后处理步骤中的订购量修改。我们基于最近的研究（Madeka et al., 2022），将 periodic review inventory control problem формализова为外生决策过程，其中大多数状态外部控制者的agent。Madeka et al. (2022) 示出如何构建一个 simulator，以历史数据来解决这类问题。在我们的情况下，我们将生成型模型包含在历史重温中。通过将问题формализова为外生决策过程，我们可以应用 Madeka et al. (2022) 中的结果，从而实现减少到监管学习。最后，我们通过实验研究发现，这种方法可以 statistically significant 提高生产率的利润。使用实际世界的 A/B 测试数据，我们发现 Gen-QOT 可以很好地适应偏离策略数据。

MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift

paper_url: http://arxiv.org/abs/2310.17159
repo_url: None
paper_authors: Dexter Neo, Stefan Winkler, Tsuhan Chen
for: 提高模型的对外类型（out-of-distribution，OOD）折衔问题的解决。
methods: 基于最大熵原则，在训练中包含有助于性的统计约束，以提高模型的折衔而不 sacrificing 准确性。
results: 理论分析和实验结果表明，我们的方法可以在实际应用中很好地实现对模型的折衔，在synthetic和实际 benchmark上达到了状态的折衔表现。

Abstract
We present a new loss function that addresses the out-of-distribution (OOD) calibration problem. While many objective functions have been proposed to effectively calibrate models in-distribution, our findings show that they do not always fare well OOD. Based on the Principle of Maximum Entropy, we incorporate helpful statistical constraints observed during training, delivering better model calibration without sacrificing accuracy. We provide theoretical analysis and show empirically that our method works well in practice, achieving state-of-the-art calibration on both synthetic and real-world benchmarks.

摘要
我们提出了一种新的损失函数，解决了对外部分布（OOD）的准确性问题。虽然许多目标函数已经被提出来有效地在分布内进行准确化，但我们的发现表明它们并不总是在OOD上很好。基于最大熵原理，我们吸收了训练中观察到的有用统计约束，以提供更好的模型准确化，不会影响准确性。我们提供了理论分析，并证明了我们的方法在实践中具有优秀的性能，在 sintetic 和实际 benchmark 上达到了状态之册准确性。

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

paper_url: http://arxiv.org/abs/2310.17157
repo_url: https://github.com/fminference/dejavu
paper_authors: Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen
for: 降低大语言模型（LLM）的执行时间开销
methods: 利用上下文稀缺性（contextual sparsity）来降低LLM的执行时间开销，并不会牺牲LLM的质量或学习能力
results: 对于OPT-175B模型，DejaVu系统可以将执行时间开销降低至至少2倍，并且在比较常用的Hugging Face实现中降低至至少6倍，无需增加计算成本。

Abstract
Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2X compared to the state-of-the-art FasterTransformer, and over 6X compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https://github.com/FMInference/DejaVu.

摘要
大型语言模型（LLM）有百十亿个参数，启动了一新波的聪明AI应用。然而，它们在推论时需要大量的计算资源。简约是一种自然的方法来减少这些成本，但现有的方法可能需要重新训练，或者牺牲LLM的内部学习能力，或者不会在现代硬件上获得实时时钟速度优化。我们 hypothesis 认为，contextual sparsity，即对于每个输入的小型、受输入影响的注意头和多层感知（MLP）参数，可以解决这些问题。我们证明了contextual sparsity存在，可以准确预测，并且可以利用它来快速化LLM推论。基于这些见解，我们提出了DejaVu，一个使用低成本的算法来预测contextual sparsity的系统，以及一个ynchronized和硬件对应的实现，可以在实时时钟上快速化LLM推论。我们验证了DejaVu可以对OPT-175B减少推论延迟时间比state-of-the-art FasterTransformer高出2倍，并且高于广泛使用的Hugging Face实现高出6倍，而不对模型质量造成妥协。代码可以在https://github.com/FMInference/DejaVu 中找到。

Spatio-Temporal Meta Contrastive Learning

paper_url: http://arxiv.org/abs/2310.17678
repo_url: https://github.com/hkuds/cl4st
paper_authors: Jiabin Tang, Lianghao Xia, Jie Hu, Chao Huang
for: 这个研究的目的是提高公共交通和安全管理，通过预测交通和犯罪活动的趋势。
methods: 这个研究使用了一种新的对照学习框架（CL4ST），将这些框架应用到交通和犯罪预测中。这个框架包括一个自动生成的节点和边扩展观，以及两个分支的图对照学习 paradigm。
results: 这个研究的结果显示，CL4ST在交通和犯罪预测中表现出色，较以往的基eline模型表现更好。

Abstract
Spatio-temporal prediction is crucial in numerous real-world applications, including traffic forecasting and crime prediction, which aim to improve public transportation and safety management. Many state-of-the-art models demonstrate the strong capability of spatio-temporal graph neural networks (STGNN) to capture complex spatio-temporal correlations. However, despite their effectiveness, existing approaches do not adequately address several key challenges. Data quality issues, such as data scarcity and sparsity, lead to data noise and a lack of supervised signals, which significantly limit the performance of STGNN. Although recent STGNN models with contrastive learning aim to address these challenges, most of them use pre-defined augmentation strategies that heavily depend on manual design and cannot be customized for different Spatio-Temporal Graph (STG) scenarios. To tackle these challenges, we propose a new spatio-temporal contrastive learning (CL4ST) framework to encode robust and generalizable STG representations via the STG augmentation paradigm. Specifically, we design the meta view generator to automatically construct node and edge augmentation views for each disentangled spatial and temporal graph in a data-driven manner. The meta view generator employs meta networks with parameterized generative model to customize the augmentations for each input. This personalizes the augmentation strategies for every STG and endows the learning framework with spatio-temporal-aware information. Additionally, we integrate a unified spatio-temporal graph attention network with the proposed meta view generator and two-branch graph contrastive learning paradigms. Extensive experiments demonstrate that our CL4ST significantly improves performance over various state-of-the-art baselines in traffic and crime prediction.

摘要
“空间时间预测是现实世界中许多应用程序的关键，包括交通预测和犯罪预测，以提高公共交通和安全管理。许多当今最佳模型表明了 Space-Time Graph Neural Network（STGNN）的强大能力，用于捕捉复杂的空间时间相关性。然而，现有的方法并不充分解决一些关键挑战。数据质量问题，如数据缺乏和稀疏性，导致数据噪音和缺乏监督信号，这些限制了 STGNN 的性能。虽然最近的 STGNN 模型采用了对比学习，但大多数其中使用手动定义的扩展策略，这些策略无法适应不同的 Space-Time Graph（STG）场景。为了解决这些挑战，我们提出了一个新的空间时间对比学习（CL4ST）框架，用于生成Robust和通用的 STG 表示。具体来说，我们设计了元视图生成器，用于自动构建节点和边扩展视图 для每个分离的空间和时间图。元视图生成器使用元网络和参数化生成模型来自动定制扩展策略 для每个输入。这种个性化的扩展策略使得学习框架具备空间时间感知信息。此外，我们将统一的空间时间图注意力网络与我们的元视图生成器和两个分支图对比学习 paradigms相结合。广泛的实验表明，我们的 CL4ST 可以在交通预测和犯罪预测等领域sigificantly提高性能。”

Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration

paper_url: http://arxiv.org/abs/2310.17153
repo_url: https://github.com/longinyu/hsivi
paper_authors: Longlin Yu, Tianyu Xie, Yu Zhu, Tong Yang, Xiangyu Zhang, Cheng Zhang
for: 这个论文旨在扩展分析变量描述家族，使用层次结构定义半隐式分布。
methods: 该方法使用半隐式分布进行变量描述，并通过逐层匹配auxiliary分布来训练层次结构。
results: 该方法可以在几个 bayesian 推理问题中提高半隐式分布的表达力，并且可以使用 pré-训练的得分网络来加速 diffusion 模型的采样过程。

Abstract
Semi-implicit variational inference (SIVI) has been introduced to expand the analytical variational families by defining expressive semi-implicit distributions in a hierarchical manner. However, the single-layer architecture commonly used in current SIVI methods can be insufficient when the target posterior has complicated structures. In this paper, we propose hierarchical semi-implicit variational inference, called HSIVI, which generalizes SIVI to allow more expressive multi-layer construction of semi-implicit distributions. By introducing auxiliary distributions that interpolate between a simple base distribution and the target distribution, the conditional layers can be trained by progressively matching these auxiliary distributions one layer after another. Moreover, given pre-trained score networks, HSIVI can be used to accelerate the sampling process of diffusion models with the score matching objective. We show that HSIVI significantly enhances the expressiveness of SIVI on several Bayesian inference problems with complicated target distributions. When used for diffusion model acceleration, we show that HSIVI can produce high quality samples comparable to or better than the existing fast diffusion model based samplers with a small number of function evaluations on various datasets.

摘要
semi-implicit variational inference (SIVI) 已经引入以扩展分析性的变量家族。然而，目标 posterior 的复杂结构可能会使单层架构成为不够。在这篇论文中，我们提议使用层次 semi-implicit variational inference (HSIVI)，可以扩展 SIVI 以允许更表达性的多层建构。通过引入 auxiliary distribution，我们可以在层次上进行逐层匹配，从而训练 conditioning layer。此外，我们可以使用 pre-trained score network 加速 diffusion model 的采样过程，使用 score matching 目标。我们发现 HSIVI 可以在几个 Bayesian inference 问题上提高 SIVI 的表达能力。当用于 diffusion model 加速时，HSIVI 可以生成高质量样本，与现有的快速 diffusion model 基于 samplers 相比，只需少量的函数评估。

Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learning

paper_url: http://arxiv.org/abs/2310.17139
repo_url: https://github.com/zanghyu/offline_bisimulation
paper_authors: Hongyu Zang, Xin Li, Leiji Zhang, Yang Liu, Baigui Sun, Riashat Islam, Remi Tachet des Combes, Romain Laroche
for: 这篇论文旨在解释为RL任务中使用bisimulation方法时，在线和离线任务之间的性能差异的原因，以及如何通过expectile算子和合适的奖励缩放策略来改进性能。
methods: 这篇论文使用了bisimulation方法，以及expectile算子和奖励缩放策略来解决RL任务中的问题。
results: 研究发现，在bisimulation方法中缺失的转移会导致优化器过度适应 incomplete data，而导致性能下降。此外，研究还发现，对于RL任务，奖励缩放策略可以有效地避免特征塌陷。通过应用expectile算子和合适的奖励缩放策略，研究在D4RL和Visual D4RL两个benchmark suite上实现了性能提升。

Abstract
While bisimulation-based approaches hold promise for learning robust state representations for Reinforcement Learning (RL) tasks, their efficacy in offline RL tasks has not been up to par. In some instances, their performance has even significantly underperformed alternative methods. We aim to understand why bisimulation methods succeed in online settings, but falter in offline tasks. Our analysis reveals that missing transitions in the dataset are particularly harmful to the bisimulation principle, leading to ineffective estimation. We also shed light on the critical role of reward scaling in bounding the scale of bisimulation measurements and of the value error they induce. Based on these findings, we propose to apply the expectile operator for representation learning to our offline RL setting, which helps to prevent overfitting to incomplete data. Meanwhile, by introducing an appropriate reward scaling strategy, we avoid the risk of feature collapse in representation space. We implement these recommendations on two state-of-the-art bisimulation-based algorithms, MICo and SimSR, and demonstrate performance gains on two benchmark suites: D4RL and Visual D4RL. Codes are provided at \url{https://github.com/zanghyu/Offline_Bisimulation}.

摘要
在bisimulation基础上的方法在rekinderlearning（RL）任务上表现良好，但在线上RL任务中它们的表现并没有达到预期。在某些情况下，它们甚至明显下出于代码。我们想要了解bisimulation方法在线上成功的原因，而在线下任务中它们失败的原因。我们的分析发现， dataset中缺失的转移特别是bisimulation原则的瓶颈，导致无效的估计。我们还发现了奖金缩放对bisimulation测量和值误差的约束的重要作用。基于这些发现，我们提议在线下RLSetting中使用expectile算子进行表示学习，以防止适应不完整的数据。同时，通过引入适当的奖金缩放策略，我们可以在表示空间中避免特征塌陷。我们实现这些建议在两种bisimulation基础上的算法MICo和SimSR上，并在D4RL和Visual D4RL benchmark suite上示出了性能提升。代码可以在\url{https://github.com/zanghyu/Offline_Bisimulation}上找到。

Large-Scale Gaussian Processes via Alternating Projection

paper_url: http://arxiv.org/abs/2310.17137
repo_url: None
paper_authors: Kaiwen Wu, Jonathan Wenger, Haydn Jones, Geoff Pleiss, Jacob R. Gardner
for: This paper aims to improve the efficiency of Gaussian process (GP) hyperparameter optimization for large datasets.
methods: The proposed method uses an iterative approach that only accesses subblocks of the kernel matrix, enabling mini-batching and reducing the time and space complexity to $\mathcal{O}(n)$.
results: The proposed method accelerates training by a factor of 2 to 27 compared to conjugate gradients (CG) on large-scale benchmark datasets with up to four million datapoints, and enjoys linear convergence and robustness to ill-conditioning.

Abstract
Gaussian process (GP) hyperparameter optimization requires repeatedly solving linear systems with $n \times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative numerical methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the corresponding kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling \emph{mini-batching}. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove our method enjoys linear convergence and empirically we demonstrate its robustness to ill-conditioning. On large-scale benchmark datasets up to four million datapoints our approach accelerates training by a factor of 2$\times$ to 27$\times$ compared to CG.

摘要

On the Convergence of CART under Sufficient Impurity Decrease Condition

paper_url: http://arxiv.org/abs/2310.17114
repo_url: None
paper_authors: Rahul Mazumder, Haoyue Wang
for: 这篇论文主要研究了CART算法在回归设置下的收敛速率。
methods: 作者首先提出了一个上界 bound 来证明 CART 算法在 SID 条件下的预测错误。此外，作者还提供了一些容易验证的 suficient condition 来满足 SID 条件，并在几种常见的非 Parametric 估计中进行了示例。
results: 作者发现了一个可以提高 CART 算法的预测错误 bound 的结果，并且证明了这个结果在一些特定的函数类型中是最佳的。此外，作者还提供了一些实用的函数类型，以便在非 Parametric 估计中使用 CART 算法。

Abstract
The decision tree is a flexible machine learning model that finds its success in numerous applications. It is usually fitted in a recursively greedy manner using CART. In this paper, we investigate the convergence rate of CART under a regression setting. First, we establish an upper bound on the prediction error of CART under a sufficient impurity decrease (SID) condition \cite{chi2022asymptotic} -- our result improves upon the known result by \cite{chi2022asymptotic} under a similar assumption. Furthermore, we provide examples that demonstrate the error bound cannot be further improved by more than a constant or a logarithmic factor. Second, we introduce a set of easily verifiable sufficient conditions for the SID condition. Specifically, we demonstrate that the SID condition can be satisfied in the case of an additive model, provided that the component functions adhere to a ``locally reverse Poincar{\'e} inequality". We discuss several well-known function classes in non-parametric estimation to illustrate the practical utility of this concept.

摘要
“决策树是一种灵活的机器学习模型，在许多应用中获得成功。它通常使用CART进行递归循环 fitted。在这篇论文中，我们研究了CART在回归设置下的收敛率。首先，我们确定了CART在充分减少纯度（SID）条件下的预测误差Upper bound——我们的结果超越了类似假设下的已知结果。其次，我们引入了一组容易验证的Sufficient condition，以确保SID条件的满足。 Specifically，我们示出了在加itive模型中，只要 componenet functions 遵循一种“本地逆波卡尔杜里 inequality”，就可以满足SID条件。我们介绍了一些常见的非参数统计学中的函数类型，以 Illustrate the practical utility of this concept。”Note: "Sufficient condition" is translated as "容易验证的Sufficient condition" in Simplified Chinese, which is a literal translation of the original English phrase. However, the word "Sufficient" is not commonly used in Simplified Chinese, and the more idiomatic translation of the phrase might be "可靠的条件" (kě lì de tiáo yì) or "可以验证的条件" (kě yǐ jiàn yì de tiáo yì).

LLM4DyG: Can Large Language Models Solve Problems on Dynamic Graphs?

paper_url: http://arxiv.org/abs/2310.17110
repo_url: None
paper_authors: Zeyang Zhang, Xin Wang, Ziwei Zhang, Haoyang Li, Yijian Qin, Simin Wu, Wenwu Zhu
for: 这篇论文旨在评估语言模型在动态图上的空间-时间理解能力，尤其是在面对动态图数据时。
methods: 作者提出了一个名为LLM4DyG的benchmark，用于评估语言模型在动态图上的空间-时间理解能力。他们还进行了广泛的实验研究，以分析不同的数据生成器、数据统计方法、提示技术和语言模型在模型性能上的影响。
results: 研究发现，语言模型在动态图上有先验空间-时间理解能力，但是随着图像大小和密度增加，模型的性能显著下降。此外，提出了一种名为Disentangled Spatial-Temporal Thoughts（DST2）的提示方法，可以帮助提高语言模型在动态图上的空间-时间理解能力。

Abstract
In an era marked by the increasing adoption of Large Language Models (LLMs) for various tasks, there is a growing focus on exploring LLMs' capabilities in handling web data, particularly graph data. Dynamic graphs, which capture temporal network evolution patterns, are ubiquitous in real-world web data. Evaluating LLMs' competence in understanding spatial-temporal information on dynamic graphs is essential for their adoption in web applications, which remains unexplored in the literature. In this paper, we bridge the gap via proposing to evaluate LLMs' spatial-temporal understanding abilities on dynamic graphs, to the best of our knowledge, for the first time. Specifically, we propose the LLM4DyG benchmark, which includes nine specially designed tasks considering the capability evaluation of LLMs from both temporal and spatial dimensions. Then, we conduct extensive experiments to analyze the impacts of different data generators, data statistics, prompting techniques, and LLMs on the model performance. Finally, we propose Disentangled Spatial-Temporal Thoughts (DST2) for LLMs on dynamic graphs to enhance LLMs' spatial-temporal understanding abilities. Our main observations are: 1) LLMs have preliminary spatial-temporal understanding abilities on dynamic graphs, 2) Dynamic graph tasks show increasing difficulties for LLMs as the graph size and density increase, while not sensitive to the time span and data generation mechanism, 3) the proposed DST2 prompting method can help to improve LLMs' spatial-temporal understanding abilities on dynamic graphs for most tasks. The data and codes will be open-sourced at publication time.

摘要
在大语言模型（LLM）的广泛采用时代，关注 LLM 在处理网络数据上的能力已成为一个关键问题。实际上，网络数据中的动态图都是通用的。评估 LLM 在动态图上的空间-时间信息理解能力是为其在网络应用中使用的关键验证。在这篇论文中，我们填补了这个隔阂，我们提出了 LL4DyG 测试套件，包括9个特定任务，用于评估 LLM 在动态图上的空间-时间能力。然后，我们进行了广泛的实验，以分析不同的数据生成器、数据统计、提示技术和 LLM 对模型性能的影响。最后，我们提出了改进 LLM 在动态图上的空间-时间理解能力的感知思维（DST2）方法。我们的主要观察结果是：1） LLM 在动态图上已经具备了一定的空间-时间理解能力；2）动态图任务随着图像大小和密度增加而变得越来越Difficult，而不受时间范围和数据生成机制的影响；3）我们提出的 DST2 提示方法可以帮助提高 LLM 在动态图上的空间-时间理解能力。数据和代码将在发表时公开源。

MIM-GAN-based Anomaly Detection for Multivariate Time Series Data

paper_url: http://arxiv.org/abs/2310.18257
repo_url: https://github.com/explorerlu1024/mimad-gan
paper_authors: Shan Lu, Zhicheng Dong, Donghong Cai, Fang Fang, Dongcai Zhao
for: 本研究提出了一种基于生成对抗网络（GAN）的多时序数据异常检测算法，用于检测时序数据中的异常点。
methods: 本算法使用了基于Long Short-Term Memory（LSTM）的生成器和批处理器，并引入了对抗推理损失函数以避免本地最佳解和模型塌陷。
results: 实验结果表明，提出的MIM-GAN基于异常检测算法在精度、回归率和F1分数等方面表现出色，比传统方法有所提高。

Abstract
The loss function of Generative adversarial network(GAN) is an important factor that affects the quality and diversity of the generated samples for anomaly detection. In this paper, we propose an unsupervised multiple time series anomaly detection algorithm based on the GAN with message importance measure(MIM-GAN). In particular, the time series data is divided into subsequences using a sliding window. Then a generator and a discriminator designed based on the Long Short-Term Memory (LSTM) are employed to capture the temporal correlations of the time series data. To avoid the local optimal solution of loss function and the model collapse, we introduce an exponential information measure into the loss function of GAN. Additionally, a discriminant reconstruction score consisting on discrimination and reconstruction loss is taken into account. The global optimal solution for the loss function is derived and the model collapse is proved to be avoided in our proposed MIM-GAN-based anomaly detection algorithm. Experimental results show that the proposed MIM-GAN-based anomaly detection algorithm has superior performance in terms of precision, recall, and F1 score.

摘要
“伪函数（GAN）的损失函数是发现异常时的重要因素，影响发现结果的质量和多样性。本文提出了一种基于GAN的无监督多时间序列伪函数检测算法（MIM-GAN）。具体来说，时间序列数据被分成子序列使用滑动窗口。然后，基于Long Short-Term Memory（LSTM）的生成器和检测器被用来捕捉时间序列数据的时间相关性。为避免本地最佳解和模型崩溃，我们将数据构造的实际信息量纳入损失函数中。此外，我们还考虑了检测和重建损失的平均损失，以确保模型能够具有全局最佳解。实验结果显示，我们的MIM-GAN基于伪函数检测算法在精度、回传和F1分数方面表现出色。”

Network Design through Graph Neural Networks: Identifying Challenges and Improving Performance

paper_url: http://arxiv.org/abs/2310.17100
repo_url: None
paper_authors: Donald Loveland, Rajmonda Caceres
for: 研究 Graph Neural Network (GNN) 的修改策略，以提高网络设计。
methods: 分析 previous works 中的 gradient 计算，揭示影响 editing 的因素，并提出一种 iterative editing 方法（ORE），可以更好地避免基于结构性质的错误编辑。
results: 通过一系列设计任务和外部验证方法，证明 ORE 可以提高 editing 效果，比前一代方法提高至 50%。

Abstract
Graph Neural Network (GNN) research has produced strategies to modify a graph's edges using gradients from a trained GNN, with the goal of network design. However, the factors which govern gradient-based editing are understudied, obscuring why edges are chosen and if edits are grounded in an edge's importance. Thus, we begin by analyzing the gradient computation in previous works, elucidating the factors that influence edits and highlighting the potential over-reliance on structural properties. Specifically, we find that edges can achieve high gradients due to structural biases, rather than importance, leading to erroneous edits when the factors are unrelated to the design task. To improve editing, we propose ORE, an iterative editing method that (a) edits the highest scoring edges and (b) re-embeds the edited graph to refresh gradients, leading to less biased edge choices. We empirically study ORE through a set of proposed design tasks, each with an external validation method, demonstrating that ORE improves upon previous methods by up to 50%.

摘要
GRAPH Neural Network (GNN) 研究已经制定了使用训练GNN的梯度来修改图的边，以达到网络设计的目标。然而，影响梯度基于编辑的因素尚未得到足够的研究，导致选择边和编辑是否与设计任务相关的问题未得到解释。因此，我们开始分析过去的研究中的梯度计算，抛出影响编辑的因素，并指出梯度基于结构性质可能导致不准确的编辑。为了改进编辑，我们提出了ORE，一种迭代编辑方法，它包括（a）编辑梯度最高的边，并（b）重新嵌入编辑后的图以重新计算梯度，从而避免结构性质导致的不准确编辑。我们通过一系列的设计任务和外部验证方法来实验ORE，发现它比前一代方法提高了50%。

Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult

paper_url: http://arxiv.org/abs/2310.17087
repo_url: None
paper_authors: Yuqing Wang, Zhenghao Xu, Tuo Zhao, Molei Tao
for: This paper aims to understand the implicit biases that arise when using large learning rates for nonconvex optimization, and to develop a new global convergence theory for this setting.
methods: The paper uses a combination of theoretical and experimental techniques to study the behavior of large learning rate gradient descent for nonconvex optimization. The authors develop a new global convergence theory for this setting, and validate their results with experiments on neural networks.
results: The paper shows that large learning rates can lead to various implicit biases, including the edge of stability, balancing, and catapult phenomena. The authors also establish that these biases are a result of the combination of a provable preference of large learning rate gradient descent for moving toward flatter regions, and the good regularity of the objective function. Additionally, the paper provides the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions.

Abstract
Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including the edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Though significant theoretical progress has been made in understanding these implicit biases, it remains unclear for which objective functions would they occur. This paper provides an initial step in answering this question, namely that these implicit biases are in fact various tips of the same iceberg. They occur when the objective function of optimization has some good regularity, which, in combination with a provable preference of large learning rate gradient descent for moving toward flatter regions, results in these nontrivial dynamical phenomena. To establish this result, we develop a new global convergence theory under large learning rates, for a family of nonconvex functions without globally Lipschitz continuous gradient, which was typically assumed in existing convergence analysis. A byproduct is the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions. We also validate our theory with experiments on neural networks, where different losses, activation functions, and batch normalization all can significantly affect regularity and lead to very different training dynamics.

摘要
大的学习率，当应用到非对称优化中的梯度下降中，会导致多种隐式偏见，包括稳定边缘（Cohen et al., 2021）、平衡（Wang et al., 2022）和投射机（Lewkowycz et al., 2020）。这些现象无法由传统优化理论解释。虽然在理解这些隐式偏见方面已经做出了重要的理论进步，但是还没有确定哪些目标函数会出现这些现象。这篇论文提供了一个初步答案，即这些隐式偏见实际上是同一个冰山下的不同端点。它们发生在目标函数的优化中有一定的良好的规则性，当combined with provable preference of large learning rate gradient descent for moving toward flatter regions，导致这些非常 dynamical phenomena。为了证明这个结论，我们开发了一种新的全球准确性理论，用于大学习率下的非对称函数优化，这些函数通常会被Assume是全球 lipschitz连续的梯度。这种新的理论允许我们提供非对称函数优化的首个非对应�C bound。我们还通过实验 validate our theory on neural networks，发现不同的损失函数、活动函数和批处理均可以对regularity产生很大的影响，从而导致非常不同的训练 Dinamics。

Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

paper_url: http://arxiv.org/abs/2310.17074
repo_url: None
paper_authors: Miao Lu, Beining Wu, Xiaodong Yang, Difan Zou
for: 这个研究是为了研究吧神经网络（NN）在使用大学习率SGD算法进行训练时的泛化性能。
methods: 这个研究使用了大学习率SGD算法来训练NN，并发现在这种训练环境下，NN的振荡可以提高NN的泛化性能。
results: 研究发现，使用大学习率SGD算法训练NN可以更好地学习弱特征（weak features），而不是使用小学习率SGD算法训练NN所能学习的强特征（strong features）。这种现象被称为“有利振荡”。

Abstract
In this work, we theoretically investigate the generalization properties of neural networks (NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under such a training regime, our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns out to be beneficial to the generalization of the NN, which potentially improves over the same NN trained by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a phenomenon "benign oscillation". Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small $\ell_2$-norm and appear in each data point; (ii) strong features which have a larger $\ell_2$-norm but only appear in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate can only learn the strong features but makes little progress in learning the weak features. Consequently, when it comes to the new testing data which consist of only weak features, the NN trained by oscillating SGD with a large learning rate could still make correct predictions consistently, while the NN trained by small learning rate SGD fails. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our finding on "benign oscillation".

摘要
在这个研究中，我们研究了神经网络（NN）由权重学习率很大的梯度下降（SGD）算法进行训练的泛化性能。在这种训练 regime 中，我们发现，NN的权重往复引起的SGD训练中的振荡实际上对NN的泛化有利，可能超过与SGD学习率很小的NN训练，该训练更平滑。根据这种现象，我们称之为“有利的振荡”。我们的理论基于深度学习的特征学习视角。我们考虑了一种特征-噪声数据生成模型，包括（i）弱特征，它们具有小$\ell_2$范数并出现在每个数据点中;（ii）强特征，它们具有更大的$\ell_2$范数，仅出现在一定比例的所有数据点中;以及（iii）噪声。我们证明，由振荡SGD训练的大学习率NN可以有效地在强特征的存在下学习弱特征。相比之下，SGD训练的小学习率NN只能学习强特征，对弱特征的学习做不到进展。因此，当新的测试数据只包含弱特征时，由振荡SGD训练的NN可以在新的测试数据上做出正确预测，而SGD训练的NN则失败。我们的理论解释了大学习率训练如何提高NN的泛化性能。实验结果证明了我们的发现。

2023-10-26

eess.IV

eess.IV - 2023-10-26

Extended Signaling Methods for Reduced Video Decoder Power Consumption Using Green Metadata

paper_url: http://arxiv.org/abs/2310.17346
repo_url: None
paper_authors: Christian Herglotz, Matthias Kränzler, Xixue Chu, Edouard Francois, Yong He, André Kaup
for: 本文研究MPEG标准版本中最新的能效媒体消耗方面，即绿色元数据（ISO/IEC 232001-11），即用户端发送视频流的简化解码和处理需求，以降低接收端的能耗和提高运行时间。
methods: 本文总结了最新的研究成果，包括绿色元数据标准的扩展和新的语法元素的引入，以实现更高效的能效媒体消耗。同时，作者还进行了专门的实验来证明这些语法元素的有效性。
results: 本文的实验结果显示，通过在绿色元数据中引入新的语法元素，可以实现对软件视频解码和硬件视频解码的动态能耗减少，最高可达90%和80%分别。

Abstract
In this paper, we discuss one aspect of the latest MPEG standard edition on energy-efficient media consumption, also known as Green Metadata (ISO/IEC 232001-11), which is the interactive signaling for remote decoder-power reduction for peer-to-peer video conferencing. In this scenario, the receiver of a video, e.g., a battery-driven portable device, can send a dedicated request to the sender which asks for a video bitstream representation that is less complex to decode and process. Consequently, the receiver saves energy and extends operating times. We provide an overview on latest studies from the literature dealing with energy-saving aspects, which motivate the extension of the legacy Green Metadata standard. Furthermore, we explain the newly introduced syntax elements and verify their effectiveness by performing dedicated experiments. We show that the integration of these syntax elements can lead to dynamic energy savings of up to 90% for software video decoding and 80% for hardware video decoding, respectively.

摘要
在本文中，我们讨论了最新的MPEG标准版本中关于能效媒体消耗的一方面，即绿色元数据（ISO/IEC 232001-11），即远程解码器功率减少的交互信号。在这种情况下，接收者（例如电池驱动的移动设备）可以发送特定的请求到 sender，请求一个较为简单的视频流 Representation，以便更好地解码和处理。因此，接收者可以节省能源，提高运行时间。我们对最新的文献研究进行了概述，这些研究激励了扩展传统绿色元数据标准的扩展。此外，我们还介绍了新引入的语法元素，并通过专门的实验来验证其效果。我们发现，将这些语法元素集成到系统中可以导致软件视频解码中的动态能量节省达到90%，硬件视频解码中的动态能量节省达到80%。

2023-10-26

eess.SP

eess.SP - 2023-10-26

Novel Models for Multiple Dependent Heteroskedastic Time Series

paper_url: http://arxiv.org/abs/2310.17760
repo_url: https://github.com/13204942/stat40710
paper_authors: Fangyijie Wang, Michael Salter-Townshend
for: 这个论文是为了处理具有高波动性的脑区活动数据，以及评估多个依赖关系的fMRI时间序列数据的模型性能。
methods: 这个论文提出了一种新的方法来处理高波动性的fMRI数据，包括使用AR和GARCH模型来模型多个依赖关系的时间序列数据。
results: 研究发现，当多个fMRI时间序列数据具有多少头运动时，AR+GARCH模型可以成功地适应这些数据。此外，研究还发现了这些模型在不同的脑区之间共享波动性的现象。

Abstract
Functional magnetic resonance imaging or functional MRI (fMRI) is a very popular tool used for differing brain regions by measuring brain activity. It is affected by physiological noise, such as head and brain movement in the scanner from breathing, heart beats, or the subject fidgeting. The purpose of this paper is to propose a novel approach to handling fMRI data for infants with high volatility caused by sudden head movements. Another purpose is to evaluate the volatility modelling performance of multiple dependent fMRI time series data. The models examined in this paper are AR and GARCH and the modelling performance is evaluated by several statistical performance measures. The conclusions of this paper are that multiple dependent fMRI series data can be fitted with AR + GARCH model if the multiple fMRI data have many sudden head movements. The GARCH model can capture the shared volatility clustering caused by head movements across brain regions. However, the multiple fMRI data without many head movements have fitted AR + GARCH model with different performance. The conclusions are supported by statistical tests and measures. This paper highlights the difference between the proposed approach from traditional approaches when estimating model parameters and modelling conditional variances on multiple dependent time series. In the future, the proposed approach can be applied to other research fields, such as financial economics, and signal processing. Code is available at \url{https://github.com/13204942/STAT40710}.

摘要
функциональная магнитно-резонансная томография (fMRI) 是一种非常受欢迎的工具，用于测量脑动脉的活动。它受到生理噪声的影响，例如在扫描机上的头部和脑部活动，心跳声、呼吸或试验者的不稳定。这篇论文的目的是提出一种处理高波动性fMRI数据的新方法，用于评估多个依赖关系的fMRI时间序列数据的模型性能。这篇论文分析了AR和GARCH模型，并评估了这些模型在多个依赖关系的fMRI时间序列数据中的表现。结论是，如果多个fMRI时间序列数据具有多少头部活动，那么AR + GARCH模型可以正确地适应这些数据。GARCH模型可以捕捉脑区域之间的共同波动性噪声，即头部活动引起的共同噪声。然而，没有多少头部活动的多个fMRI时间序列数据不能够正确地适应AR + GARCH模型。这些结论得到了统计测试和度量的支持。这篇论文区别于传统方法，在估计模型参数和模型conditional variance的时候。未来，该方法可以应用于其他研究领域，如金融经济和信号处理。代码可以在中找到。

Adaptive Digital Twin for UAV-Assisted Integrated Sensing, Communication, and Computation Networks

paper_url: http://arxiv.org/abs/2310.17470
repo_url: None
paper_authors: Bin Li, Wenshuai Liu, Wancheng Xie, Ning Zhang, Yan Zhang
for: 这个论文研究了一个基于数字双子（DT）的集成感知通信计算网络。用户进行射频感知和计算卸载在同一频段上进行，而无人飞机（UAV）被部署以提供边缘计算服务。
methods: 我们首先形式化了一个多目标优化问题，以同时减小多输入多输出（MIMO）雷达的辐射性能和计算卸载能耗。然后，我们利用数字双子（DT）的预测能力提供智能卸载决策，并考虑DT估计偏差。
results: 我们的方法能够均衡感知和计算功能之间的性能质量规比，同时降低计算能耗相比现有研究。

Abstract
In this paper, we study a digital twin (DT)-empowered integrated sensing, communication, and computation network. Specifically, the users perform radar sensing and computation offloading on the same spectrum, while unmanned aerial vehicles (UAVs) are deployed to provide edge computing service. We first formulate a multi-objective optimization problem to minimize the beampattern performance of multi-input multi-output (MIMO) radars and the computation offloading energy consumption simultaneously. Then, we explore the prediction capability of DT to provide intelligent offloading decision, where the DT estimation deviation is considered. To track this challenge, we reformulate the original problem as a multi-agent Markov decision process and design a multi-agent proximal policy optimization (MAPPO) framework to achieve a flexible learning policy. Furthermore, the Beta-policy and attention mechanism are used to improve the training performance. Numerical results show that the proposed method is able to balance the performance tradeoff between sensing and computation functions, while reducing the energy consumption compared with the existing studies.

摘要
在本文中，我们研究了一个基于数字双（DT）的整合感知、通信和计算网络。具体来说，用户进行了雷达感知和计算卷积的同时使用同频率，而无人机（UAV）被部署以提供边缘计算服务。我们首先形ulated一个多目标优化问题，以最小化多输入多Output（MIMO）雷达的扫扫 Pattern性能和计算卷积能 consumption同时。然后，我们探索了DT的预测能力，以提供智能卷积决策。为了跟踪这个挑战，我们将原始问题重新形ulated为多个机器人Markov决策过程，并设计了一个多机器人 proximal policy optimization（MAPPO）框架，以实现 flexible learning策略。此外，我们还使用了β策略和注意力机制来提高训练性能。numerical results表明，我们的方法能够均衡感知和计算功能之间的性能交易，同时降低与已有研究相比的能 consumption。

Detecting Abrupt Change of Channel Covariance Matrix in IRS-Assisted Communication

paper_url: http://arxiv.org/abs/2310.17425
repo_url: None
paper_authors: Runnan Liu, Liang Liu, Yin Xu, Dazhi He, Wenjun Zhang, Chang Wen Chen
for: 本文关注于智能反射表（IRS）助理通信系统中的频道协方差矩阵变化检测。
methods: 我们提出了一种强大的检测方法，可以检测IRS助理通信系统中频道协方差矩阵的变化。
results: 我们的提议方法通过数值结果验证了其效果。

Abstract
The knowledge of channel covariance matrices is crucial to the design of intelligent reflecting surface (IRS) assisted communication. However, channel covariance matrices may change suddenly in practice. This letter focuses on the detection of the above change in IRS-assisted communication. Specifically, we consider the uplink communication system consisting of a single-antenna user (UE), an IRS, and a multi-antenna base station (BS). We first categorize two types of channel covariance matrix changes based on their impact on system design: Type I change, which denotes the change in the BS receive covariance matrix, and Type II change, which denotes the change in the IRS transmit/receive covariance matrix. Secondly, a powerful method is proposed to detect whether a Type I change occurs, a Type II change occurs, or no change occurs. The effectiveness of our proposed scheme is verified by numerical results.

摘要
知识 Channel 协方差矩阵对智能反射表（IRS）助动通信的设计非常重要。然而， Channel 协方差矩阵在实践中可能会快速变化。本信函要针对IRS协助通信中Channel 协方差矩阵变化的检测。具体来说，我们考虑了单antenna用户（UE）、IRS和多antenna基站（BS）组成的上传通信系统。我们首先将Channel 协方差矩阵变化分为两类基于它们对系统设计的影响：Type I变化，表示BS接收协方差矩阵发生变化，Type II变化表示IRS传输/接收协方差矩阵发生变化。其次，我们提出了一种强大的检测方法，能够检测Type I变化、Type II变化或者没有变化。我们的提议方案的效果得到了数值结果的验证。

Energy Efficient Robust Beamforming for Vehicular ISAC with Imperfect Channel Estimation

paper_url: http://arxiv.org/abs/2310.17401
repo_url: None
paper_authors: Hanwen Zhang, Haijian Sun, Tianyi He, Weiming Xiang, Rose Qingyang Hu
for: 该论文研究了针对 vehicular integrated sensing and communication (ISAC) 系统中的channel estimation uncertainty的 robust beamforming，以优化系统级能效性 (EE)。
methods: 论文首先将系统EE最大化问题转化为一个受限制的Channel estimation error的问题，然后使用分数编程和准确relaxation (SDR) 将约束转化为一个对偶问题。最后，使用Schur complement和S-Procedure将Cramer-Rao bound (CRB)和channel estimation error约束转化为几何约束。
results: 研究结果表明，提出的算法具有良好的收敛速率，并能有效地减轻频率 estimation errors的影响。

Abstract
This paper investigates robust beamforming for system-centric energy efficiency (EE) optimization in the vehicular integrated sensing and communication (ISAC) system, where the mobility of vehicles poses significant challenges to channel estimation. To obtain the optimal beamforming under channel uncertainty, we first formulate an optimization problem for maximizing the system EE under bounded channel estimation errors. Next, fractional programming and semidefinite relaxation (SDR) are utilized to relax the rank-1 constraints. We further use Schur complement and S-Procedure to transform Cramer-Rao bound (CRB) and channel estimation error constraints into convex forms, respectively. Based on the Lagrangian dual function and Karush-Kuhn-Tucker (KKT) conditions, it is proved that the optimal beamforming solution is rank-1. Finally, we present comprehensive simulation results to demonstrate two key findings: 1) the proposed algorithm exhibits a favorable convergence rate, and 2) the approach effectively mitigates the impact of channel estimation errors.

摘要
We use fractional programming and semidefinite relaxation (SDR) to relax the rank-1 constraints. We then transform the Cramer-Rao bound (CRB) and channel estimation error constraints into convex forms using Schur complement and S-Procedure, respectively.Using the Lagrangian dual function and Karush-Kuhn-Tucker (KKT) conditions, we prove that the optimal beamforming solution is rank-1. Finally, we present comprehensive simulation results to show that the proposed algorithm has a favorable convergence rate and effectively mitigates the impact of channel estimation errors.

Near-Field Positioning and Attitude Sensing Based on Electromagnetic Propagation Modeling

paper_url: http://arxiv.org/abs/2310.17327
repo_url: None
paper_authors: Ang Chen, Li Chen, Yunfei Chen, Nan Zhao, Changsheng You
for: 这篇论文是为了研究无线网络上的位姿探测和感知而写的。
methods: 这篇论文使用了基于电磁理论的电磁场传播模型（EPM）来准确地模型近场通信。在噪声free情况下，EPM模型确定了观测信号与用户设备（UE）的位置和orientation之间的非线性函数关系。为了解决非线性相互作用的困难，我们首先将距离域分成三个区域，由定义的相位悬念距离和间隔约束距离分割。然后，对每个区域，我们获得了低复杂性的关闭式解决方案。
results: 我们的数值结果表明，我们 derivated的Ziv-Zakai bound（ZZB）可以准确预测无线信号噪声环境中 estimator 的性能。更重要的是，我们在位置估计中实现了 millimeter-level的精度，并在orientation估计中实现了0.1-level的精度。

Abstract
Positioning and sensing over wireless networks are imperative for many emerging applications. However, traditional wireless channel models cannot be used for sensing the attitude of the user equipment (UE), since they over-simplify the UE as a point target. In this paper, a comprehensive electromagnetic propagation modeling (EPM) based on electromagnetic theory is developed to precisely model the near-field channel. For the noise-free case, the EPM model establishes the non-linear functional dependence of observed signals on both the position and attitude of the UE. To address the difficulty in the non-linear coupling, we first propose to divide the distance domain into three regions, separated by the defined Phase ambiguity distance and Spacing constraint distance. Then, for each region, we obtain the closed-form solutions for joint position and attitude estimation with low complexity. Next, to investigate the impact of random noise on the joint estimation performance, the Ziv-Zakai bound (ZZB) is derived to yield useful insights. The expected Cram\'er-Rao bound (ECRB) is further provided to obtain the simplified closed-form expressions for the performance lower bounds. Our numerical results demonstrate that the derived ZZB can provide accurate predictions of the performance of estimators in all signal-to-noise ratio (SNR) regimes. More importantly, we achieve the millimeter-level accuracy in position estimation and attain the 0.1-level accuracy in attitude estimation.

摘要
无线网络上的位姿探测是许多出现中的应用所必需的。然而，传统的无线通道模型无法探测用户设备（UE）的姿态，因为它们过于简化了UE为点目标。在这篇论文中，我们开发了基于电磁学理论的完整的电磁传播模型ing（EPM），以精确模拟近场通道。在无噪情况下，EPM模型确定了观察信号与UE的位置和姿态之间的非线性函数关系。为了解决近场吸引的困难，我们首先将距离域分成三个区域，由定义的相位偏移距离和间隔约束距离分割。然后，对每个区域，我们获得了具有低复杂性的闭式解决方案。接着，我们 investigate了随机噪声对共同估计性能的影响，并 derivated Ziv-Zakai bound（ZZB），以获得有用的洞察。此外，我们还提供了预期Cramér-Rao bound（ECRB），以获得简化后的关注下限表达。我们的数值结果表明， derive ZZB可以在所有信号强度（SNR）域内提供准确的性能预测。更重要的是，我们实现了百分之一级的位置估计和0.1级的姿态估计。

O-band QKD link over a multiple ONT loaded carrier-grade GPON for FTTH applications

paper_url: http://arxiv.org/abs/2310.17259
repo_url: None
paper_authors: N. Makris, A. Ntanos, A. Papageorgopoulos, A. Stathis, P. Konteli, I. Tsoni, G. Giannoulis, F. Setaki, T. Stathopoulos, G. Lyberopoulos, H. Avramopoulos, G. T. Kanellos, D. Syvridis
for: 这项研究是为了在实际的光纤到户（FTTH）网络中实现量子键分发（QKD）系统的集成。
methods: 该研究使用了一个商业化的O-带量子键分发系统，并在一个已经 réplikas了一个实际的光纤到户（FTTH）网络的GPON测试环境中成功集成了这个系统。
results: 研究人员成功地在多个ONT上实现了量子键分发系统的集成，并且在实际的FTTH网络中进行了多个ONT的测试和评估。

Abstract
We have successfully integrated an O-band commercial Quantum-Key-Distribution (QKD) system over a lit GPON testbed that replicates a carrier-grade Fiber-to-the-Home (FTTH) optical access network with multiple ONTs to emulate real-life FTTH operational deployments.

摘要
我们已成功将 O-band 商业量子键分发（QKD）系统集成到了模拟了实际FTTH运营部署的灯光干线GPON测试基础设施中。该基础设施包括多个ONT来模拟实际FTTH网络中的多个设备。

Beampattern Design in Non-Uniform MIMO Communication

paper_url: http://arxiv.org/abs/2310.17201
repo_url: None
paper_authors: Amirsadegh Roshanzamir
for: 本研究旨在探讨非均匀数组下的多输入多输出通信技术。
methods: 本研究使用了优化发射天线位置和交叉相关矩阵来设计发射扩散 patrern。
results: 研究结果表明，通过优化发射天线位置和交叉相关矩阵，可以更好地控制发射扩散 patrern，提高多输入多输出通信的性能。

Abstract
In recent years and with introduction of 5G cellular network and communication, researchers have shown great interest in Multiple Input Multiple Output (MIMO) communication, an advanced technology. Many studies have examined the problem of designing the beampattern for MIMO communication using uniform arrays and the covariance-based method to concentrate the transmitted power to the users. However, this paper aims to tackle this issue in the context of non-uniform arrays. Previous authors have primarily focused on designing the transmitted beampattern based on the cross-correlation matrix of transmitted signal elements. In contrast, this paper suggests optimizing the positions of transmitted antennas along with the cross-correlation matrix. This approach is expected to produce better results.

摘要
在最近几年和5G移动通信网络的引入，研究人员对多输入多输出（MIMO）通信技术表示了极大的兴趣。许多研究都集中在多输入多输出通信中照射 patrern的设计方面，使用均匀阵列和基于协方差的方法来集中发射器的输出功率到用户。然而，这篇论文则是针对非均匀阵列进行设计照射 patrern的。以前的作者主要关注基于发射信号元素的交叉相关矩阵来设计发射 patrern。相比之下，这篇论文提议同时优化发射天线的位置和交叉相关矩阵，这种方法预计会产生更好的结果。

Multi-level Gated Bayesian Recurrent Neural Network for State Estimation

paper_url: http://arxiv.org/abs/2310.17187
repo_url: None
paper_authors: Shi Yan, Yan Liang, Le Zheng, Mingyang Fan, Binglu Wang, Xiaoxu Wang
for: 本研究旨在提出一种多级闭合抑制极 bayesian 循环神经网络，用于状态估计下存在模型不匹配的情况。
methods: 本文提出了一种新的解决方案，即将非Markov 状态空间模型转换成等效的第一阶Markov模型，并通过数据帮助的联合状态-记忆-偏差极 bayesian 筛选，设计了一个多级闭合极 bayesian 循环神经网络。
results: 在实验中，包括模拟和实际数据集，提议的闭合网络表现较为出色，比 benchmark 筛选和现有深度学习筛选方法更好。

Abstract
The optimality of Bayesian filtering relies on the completeness of prior models, while deep learning holds a distinct advantage in learning models from offline data. Nevertheless, the current fusion of these two methodologies remains largely ad hoc, lacking a theoretical foundation. This paper presents a novel solution, namely a multi-level gated Bayesian recurrent neural network specifically designed to state estimation under model mismatches. Firstly, we transform the non-Markov state-space model into an equivalent first-order Markov model with memory. It is a generalized transformation that overcomes the limitations of the first-order Markov property and enables recursive filtering. Secondly, by deriving a data-assisted joint state-memory-mismatch Bayesian filtering, we design a Bayesian multi-level gated framework that includes a memory update gate for capturing the temporal regularities in state evolution, a state prediction gate with the evolution mismatch compensation, and a state update gate with the observation mismatch compensation. The Gaussian approximation implementation of the filtering process within the gated framework is derived, taking into account the computational efficiency. Finally, the corresponding internal neural network structures and end-to-end training methods are designed. The Bayesian filtering theory enhances the interpretability of the proposed gated network, enabling the effective integration of offline data and prior models within functionally explicit gated units. In comprehensive experiments, including simulations and real-world datasets, the proposed gated network demonstrates superior estimation performance compared to benchmark filters and state-of-the-art deep learning filtering methods.

摘要
bayesian滤波的优点取决于先前模型的完整性，而深度学习具有从线上数据学习模型的优势。然而，现有的这两种方法的结合仍然是广义的，缺乏理论基础。这篇论文提出了一种新的解决方案，即一种多级闭合泛bayesian循环神经网络，专门用于状态估计下的模型差异。首先，我们将非马歇维状态空间模型转换成一个等效的首级Markov模型，以掌握时间序列的特征。这是一种通用的转换方法，可以超越首级Markov性质的限制，并允许递归滤波。其次，通过 derive一种基于数据助记的共同状态记忆滤波，我们设计了一种bayesian多级闭合框架，包括一个记忆更新门、一个状态预测门和一个观测差异补偿门。在这个框架内，我们使用Gaussian approximation实现滤波过程，考虑计算效率。最后，我们设计了相关的内部神经网络结构和端到端训练方法。bayesian滤波理论增强了我们提议的闭合网络的解释能力，使得可以有效地结合在线数据和先前模型内部functionallyExplicit的闭合单元。在广泛的实验中，包括仿真和实际数据，我们的闭合网络示出了与标准滤波器和深度学习滤波方法相比较好的估计性能。

Max-min Rate Optimization of Low-Complexity Hybrid Multi-User Beamforming Maintaining Rate-Fairness

paper_url: http://arxiv.org/abs/2310.17155
repo_url: None
paper_authors: W. Zhu, H. D. Tuan, E. Dutkiewicz, H. V. Poor, L. Hanzo
for: 本研究考虑了一个无线网络，用于服务多个用户，采用 millimeter-wave或sub-terahertz频率带。
methods: 研究使用高通信率多用户混合传输扫描 beamforming，以最大化用户最低速率。为了实现能源效率的信号传输，使用了数组-of-subarrays结构，并采用低分辨率相位调制器。
results: 我们开发了一种基于 convexsolver 算法的方法，可以逐步解决相同的 beamformer 大小的几何问题。我们还引入了 soft max-min rate 目标函数，并开发了可扩展的优化算法。我们的实验结果表明，soft max-min rate 优化不仅可以达到最小用户速率的最低值，而且还可以实现与 sum-rate 最大化的同等总吞吐率。因此，我们的概念的束 beamforming 设计可以提供一种新的同时实现高个用户质量服务和高总网络吞吐率的技术。

Abstract
A wireless network serving multiple users in the millimeter-wave or the sub-terahertz band by a base station is considered. High-throughput multi-user hybrid-transmit beamforming is conceived by maximizing the minimum rate of the users. For the sake of energy-efficient signal transmission, the array-of-subarrays structure is used for analog beamforming relying on low-resolution phase shifters. We develop a convexsolver based algorithm, which iteratively invokes a convex problem of the same beamformer size for its solution. We then introduce the soft max-min rate objective function and develop a scalable algorithm for its optimization. Our simulation results demonstrate the striking fact that soft max-min rate optimization not only approaches the minimum user rate obtained by max-min rate optimization but it also achieves a sum rate similar to that of sum-rate maximization. Thus, the soft max-min rate optimization based beamforming design conceived offers a new technique of simultaneously achieving a high individual quality-of-service for all users and a high total network throughput.

摘要
“考虑一个无线网络，用于服务多个用户，运行在毫米波频率或子teraHz频率带之中的基站。我们提出了一种高通量多用户混合传输射频几何，通过将最大化最低用户速率来实现。为了节省能源，我们使用了一个组件-subarray结构，实现了低分辨率相位调整器的对称传输。我们开发了一个基于convex solver的算法，逐步解决一个具有相同对称传输组件大小的问题。我们然后引入了软max-min率目标函数，并开发了可扩展的数值估算法来优化它。我们的实验结果显示，soft max-min率优化不仅可以接近最小用户速率的最大化优化，而且还可以 дости得一个相似于sum-rate最大化的总网络吞吐量。因此，我们的设计提案可以提供一种新的技术，即同时确保所有用户的个人质量服务水准高，并且确保网络吞吐量高。”

Reducing the impact of non-ideal PRBS on microwave photonic random demodulators by low biasing the optical modulator via PRBS amplitude compression

paper_url: http://arxiv.org/abs/2310.17676
repo_url: None
paper_authors: Shiyang Liu, Yang Chen
for: 这篇论文旨在解决对microwave photonic random demodulators (RDs)中非理想pseudo-random binary sequence (PRBS)的影响。methods: 本研究提出了一种新的方法，利用lower amplitude PRBS来对光学模拟器进行偏好偏移，以减少非理想PRBS对microwave photonic RDs的影响。results: 实验结果显示，这种方法可以降低重建误差达85%。此方法可以对RD-based photonics-assisted compressed sensing (CS)系统中PRBS的要求进行重大减少，提供一个可行的解决方案，从而降低系统实现的复杂度和成本。

Abstract
A novel method for reducing the impact of non-ideal pseudo-random binary sequence (PRBS) on microwave photonic random demodulators (RDs) in a photonics-assisted compressed sensing (CS) system is proposed. Different from the commonly used method that switches the bias point of the optical modulator in the RD between two quadrature transmission points to mix the signal to be sampled and the PRBS, this method employs a PRBS with lower amplitude to low bias the optical modulator so that the impact of non-ideal PRBS on microwave photonic RDs can be greatly reduced by compressing the amplitude of non-ideal parts of the PRBS. An experiment is performed to verify the concept. The optical modulator is properly low-biased via PRBS amplitude compression. The data rate and occupied bandwidth of the PRBS are 500 Mb/s and 1 GHz, while the multi-tone signals with a maximum frequency of 100 MHz are sampled at an equivalent sampling rate of only 50 MSa/s. The results show that the reconstruction error can be reduced by up to 85%. The proposed method can significantly reduce the requirements for PRBS in RD-based photonics-assisted CS systems, providing a feasible solution for reducing the complexity and cost of system implementation.

摘要
一种新的方法可以减少光学抽象推定系统中pseudo-random binary sequence(PRBS)的影响。与常用的方法不同，这种方法使用低 amplitud PRBS来压缩光学模拟器的偏好点，从而减少非理想PRBS对微波光学RD的影响。一个实验证明了这个概念。通过PRBS压缩 amplitude来低 bias光学模拟器。数据率和占用频谱带宽为500Mb/s和1GHz，而多谱信号的最大频率为100MHz， Sampled at an equivalent sampling rate of only 50MSa/s。结果显示，可以将重建错误降低到85%。提议的方法可以减少RD基于光学抽象推定系统中PRBS的复杂性和成本，提供一个可行的解决方案。

2023-10-25

cs.SD

cs.SD - 2023-10-25

Improved Panning on Non-Equidistant Loudspeakers with Direct Sound Level Compensation

paper_url: http://arxiv.org/abs/2310.17004
repo_url: None
paper_authors: Jan-Hendrik Hanschke, Daniel Arteaga, Giulio Cengarle, Joshua Lando, Mark R. P. Thomas, Alan Seefeldt
for: 这篇论文旨在提出一种基于直接声音和感知响应的方法，以便在非等距喇叭布局下实现扬声器扬声。
methods: 论文使用了一种新的方法，即基于直接声音和感知响应的方法，以便在非等距喇叭布局下实现扬声器扬声。
results: 试验表明，该方法可以减少喇叭布局不一致性导致的幻音源位偏移，并且可以大大改善声音扬声效果。

Abstract
Loudspeaker rendering techniques that create phantom sound sources often assume an equidistant loudspeaker layout. Typical home setups might not fulfill this condition as loudspeakers deviate from canonical positions, thus requiring a corresponding calibration. The standard approach is to compensate for delays and to match the loudness of each loudspeaker at the listener's location. It was found that a shift of the phantom image occurs when this calibration procedure is applied and one of a pair of loudspeakers is significantly closer to the listener than the other. In this paper, a novel approach to panning on non-equidistant loudspeaker layouts is presented whereby the panning position is governed by the direct sound and the perceived loudness is governed by the full impulse response. Subjective listening tests are presented that validate the approach and quantify the perceived effect of the compensation. In a setup where the standard calibration leads to an average error of 10 degrees, the proposed direct sound compensation largely returns the phantom source to its intended position.

摘要
喇叭渲染技术经常假设喇叭Layout是平等的，但家用设置通常不满足这个条件，因为喇叭与CanonicalPosition不匹配，因此需要相应的调整。标准方法是补偿延迟并将每个喇叭的响度在听众位置进行调整。研究发现，当应用这种调整程序时，如果一对喇叭中的一个喇叭远离听众更近， then the phantom image will shift. 本文提出了一种新的喇叭扫描非平等喇叭布局的方法，其中喇叭扫描位置由直接声音控制，而听众感受到的响度则由全冲响应控制。对比listen testing表明，该方法可以大幅提高喇叭扫描的精度。在一个标准调整后的平均错误为10度的设置下，提posed direct sound compensation方法可以几乎完全将幻音源返回到其原始位置。

Dynamic Processing Neural Network Architecture For Hearing Loss Compensation

paper_url: http://arxiv.org/abs/2310.16550
repo_url: None
paper_authors: Szymon Drgas, Lars Bramsløw, Archontis Politis, Gaurav Naithani, Tuomas Virtanen
for: 提高听力障碍者的语音理解能力（speech intelligibility）
methods: 使用神经网络（neural networks）和听力模型（hearing loss model）实现语音补偿（speech compensation），并提出一种可解释性模型（interpretable model）called dynamic processing network
results: 在使用STOI和HASPI指标评估的情况下，dynamic processing network在与Camfit规则相比 Displayath significant improvement in speech intelligibility, while a large enough convolutional neural network could outperform the interpretable model with higher computational load.

Abstract
This paper proposes neural networks for compensating sensorineural hearing loss. The aim of the hearing loss compensation task is to transform a speech signal to increase speech intelligibility after further processing by a person with a hearing impairment, which is modeled by a hearing loss model. We propose an interpretable model called dynamic processing network, which has a structure similar to band-wise dynamic compressor. The network is differentiable, and therefore allows to learn its parameters to maximize speech intelligibility. More generic models based on convolutional layers were tested as well. The performance of the tested architectures was assessed using spectro-temporal objective index (STOI) with hearing-threshold noise and hearing aid speech intelligibility (HASPI) metrics. The dynamic processing network gave a significant improvement of STOI and HASPI in comparison to popular compressive gain prescription rule Camfit. A large enough convolutional network could outperform the interpretable model with the cost of larger computational load. Finally, a combination of the dynamic processing network with convolutional neural network gave the best results in terms of STOI and HASPI.

摘要

A Novel Approach for Object Based Audio Broadcasting

paper_url: http://arxiv.org/abs/2310.16481
repo_url: None
paper_authors: Mohammad Reza Hasanabadi
for: 提供个性化和自定义的音频经验，适用于不同的平台，如广播、流媒体和电影音频。
methods: 提出了一种新的对象音频生成方法，即Sample-by-Sample Object Based Audio（SSOBA）嵌入。SSOBA将音频对象样本置于一起，让听众根据自己的兴趣和需求自由地个性化选择音频来源。
results: 对SSOBA的主要性能因素进行了研究，包括输入音频对象、输出通道数和采样率。实验结果表明，在编码和解码过程中，SSOBA可以保持高质量音频效果，并且可以在不需要特殊硬件的情况下实现。

Abstract
Object Based Audio (OBA) provides a new kind of audio experience, delivered to the audience to personalize and customize their experience of listening and to give them choice of what and how to hear their audio content. OBA can be applied to different platforms such as broadcasting, streaming and cinema sound. This paper presents a novel approach for creating object-based audio on the production side. The approach here presents Sample-by-Sample Object Based Audio (SSOBA) embedding. SSOBA places audio object samples in such a way that allows audiences to easily individualize their chosen audio sources according to their interests and needs. SSOBA is an extra service and not an alternative, so it is also compliant with legacy audio players. The biggest advantage of SSOBA is that it does not require any special additional hardware in the broadcasting chain and it is therefore easy to implement and equip legacy players and decoders with enhanced ability. Input audio objects, number of output channels and sampling rates are three important factors affecting SSOBA performance and specifying it to be lossless or lossy. SSOBA adopts interpolation at the decoder side to compensate for eliminated samples. Both subjective and objective experiments are carried out to evaluate the output results at each step. MUSHRA subjective experiments conducted after the encoding step shows good-quality performance of SSOBA with up to five objects. SNR measurements and objective experiments, performed after decoding and interpolation, show significant successful recovery and separation of audio objects. Experimental results show that a minimum sampling rate of 96 kHz is indicated to encode up to five objects in a Stereo-mode channel to acquire good subjective and objective results simultaneously.

摘要
对象基于专业（OBA）提供了一种新的专业音频经验，为听众个人化和自定义音频内容的欣赏体验。OBA可以应用到不同的平台，如广播、流媒体和电影 surround sound。本篇文章介绍了一种 novel 的创新方法，即 Sample-by-Sample Object Based Audio（SSOBA）嵌入。SSOBA 将音频 объек�置于适当的位置，以便让听众根据他们的 interess 和需求选择自己想要的音频源。SSOBA 不是一个替代品，而是一个额外的服务，因此适合旧有的音频播放器。SSOBA 的主要优点是不需要在广播链接中添加特殊的硬件，因此易于实现和升级旧有的播放器。音频对象、出力通道数量和抽样率是 SSOBA 性能的三大因素，可以根据这些因素来决定是使用损失less 或损失y 的编码方式。SSOBA 使用 decoder сторо面的 interpolate 来补偿被删除的样本。在编码和 interpolate 后，我们进行了主观和 объек 的实验，结果显示 SSOBA 在 five 个音频对象的情况下具有良好的品质表现。SNR 测量和对象实验表明，SSOBA 在恢复和分隔音频对象方面取得了显著的成功。实验结果显示，为了在 Stereo-mode 通道中编码 up to five 个音频对象，至少需要 96 kHz 的抽样率。

Towards Streaming Speech-to-Avatar Synthesis

paper_url: http://arxiv.org/abs/2310.16287
repo_url: None
paper_authors: Tejas S. Prabhune, Peter Wu, Bohan Yu, Gopala K. Anumanchipalli
for: 这篇论文旨在实现实时语音到人物动画的转化，以便在语音学、phonetics和phonology等领域可以实时visualize声音，并帮助第二语言学习和贫听患者的虚拟体现。
methods: 该方法使用深度形态反转来实现高质量的人物动画，使用实时语音而不是录音来进行实时动画化。
results: 该方法可以实现130毫秒的平均流动时间，与真实的语音相关性达0.792。此外，我们还展示了生成的口部和舌头动画，以证明我们的方法的有效性。

Abstract
Streaming speech-to-avatar synthesis creates real-time animations for a virtual character from audio data. Accurate avatar representations of speech are important for the visualization of sound in linguistics, phonetics, and phonology, visual feedback to assist second language acquisition, and virtual embodiment for paralyzed patients. Previous works have highlighted the capability of deep articulatory inversion to perform high-quality avatar animation using electromagnetic articulography (EMA) features. However, these models focus on offline avatar synthesis with recordings rather than real-time audio, which is necessary for live avatar visualization or embodiment. To address this issue, we propose a method using articulatory inversion for streaming high quality facial and inner-mouth avatar animation from real-time audio. Our approach achieves 130ms average streaming latency for every 0.1 seconds of audio with a 0.792 correlation with ground truth articulations. Finally, we show generated mouth and tongue animations to demonstrate the efficacy of our methodology.

摘要
《流式报道——语音到人物Synthesis创造实时动画》我们的研究旨在开发一种基于深度辐射倒推的实时人物动画 Synthesis方法，以便实时visual化语音。我们的方法可以快速地从实时 audio 数据中提取高质量的 facial 和 inner-mouth 动画，并且可以在实时播放 audio 时进行实时动画 Synthesis。我们的方法可以实现每0.1秒 audio 的130ms平均流式延迟，并且与真实辐射数据的0.792相对于ground truth的相关性。最后，我们展示了由我们的方法生成的口部和舌头动画，以证明我们的方法的有效性。

2023-10-25

eess.AS

eess.AS - 2023-10-25

UniX-Encoder: A Universal $X$-Channel Speech Encoder for Ad-Hoc Microphone Array Speech Processing

paper_url: http://arxiv.org/abs/2310.16367
repo_url: None
paper_authors: Zili Huang, Yiwen Shao, Shi-Xiong Zhang, Dong Yu
for: solve more challenging scenarios of multi-channel recordings with multiple simultaneous talkers
methods: universal encoder designed for multiple tasks, compatible with any microphone array, and trained without labeled multi-channel data
results: consistently outperformed combinations like the WavLM model with the BeamformIt frontend in speech recognition and speaker diarization tasks

Abstract
The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there, we present the UniX-Encoder. It's a universal encoder designed for multiple tasks, and worked with any microphone array, in both solo and multi-talker environments. Our research enhances previous multi-channel speech processing efforts in four key areas: 1) Adaptability: Contrasting traditional models constrained to certain microphone array configurations, our encoder is universally compatible. 2) Multi-Task Capability: Beyond the single-task focus of previous systems, UniX-Encoder acts as a robust upstream model, adeptly extracting features for diverse tasks including ASR and speaker recognition. 3) Self-Supervised Training: The encoder is trained without requiring labeled multi-channel data. 4) End-to-End Integration: In contrast to models that first beamform then process single-channels, our encoder offers an end-to-end solution, bypassing explicit beamforming or separation. To validate its effectiveness, we tested the UniX-Encoder on a synthetic multi-channel dataset from the LibriSpeech corpus. Across tasks like speech recognition and speaker diarization, our encoder consistently outperformed combinations like the WavLM model with the BeamformIt frontend.

摘要
《演讲场景在解决更加复杂的enario中进行进步，例如多通道录音多个同时发言人。为了解决这些问题，我们提出了UniX-Encoder。它是一种通用编码器，适用于多种 микрофон设置，并在单个和多个发言人环境中都可以工作。我们的研究在以下四个领域进行了进一步改进：1. 适应性：相比传统模型固定于特定的 микрофон设置，我们的编码器是通用的，可以与任何 микрофон设置结合使用。2. 多任务能力：除了先前的单一任务集成，UniX-Encoder 还可以作为一个强大的上游模型，可以提取多种任务的特征，包括ASR和speaker recognition。3. 自我超vised Training：我们的编码器不需要标注的多通道数据进行训练。4. 端到端集成：与先前的模型不同，UniX-Encoder 不需要显式的扩散或分离。它提供了一个端到端的解决方案，直接处理多通道数据，而不需要先进行分离或扩散。为验证其效果，我们在LibriSpeech corpus上测试了UniX-Encoder，并在多个任务，如speech recognition和speaker diarization中，consistently exceeded combinations like WavLM model with BeamformIt frontend。》

Covariance Blocking and Whitening Method for Successive Relative Transfer Function Vector Estimation in Multi-Speaker Scenarios

paper_url: http://arxiv.org/abs/2310.16327
repo_url: None
paper_authors: Henri Gode, Simon Doclo
for: 这篇论文是为了解决多个说话者在噪音和反射环境中估计相对传输函数（RTF）向量的挑战。
methods: 这篇论文使用了一种被称为盲斜投影（BOP）方法，该方法确定了第二个说话者的斜投影运算符，以便屏蔽第二个说话者。而在这篇论文中，我们提议使用协方差屏蔽和白化（CBW）方法，该方法首先屏蔽第一个说话者，然后使用估计的噪声协方差矩阵进行白化，并基于协方差分解来估计第二个说话者的RTF向量。
results: 在使用两个说话者的估计RTF向量在线性受限最小噪声抑制器中， simulation results使用了真实世界记录的多个说话者位置，demonstrate that the proposed CBW method outperforms the conventional BOP and covariance whitening methods in terms of signal-to-interferer-and-noise ratio improvement.

Abstract
This paper addresses the challenge of estimating the relative transfer function (RTF) vectors of multiple speakers in a noisy and reverberant environment. More specifically, we consider a scenario where two speakers activate successively. In this scenario, the RTF vector of the first speaker can be estimated in a straightforward way and the main challenge lies in estimating the RTF vector of the second speaker during segments where both speakers are simultaneously active. To estimate the RTF vector of the second speaker the so-called blind oblique projection (BOP) method determines the oblique projection operator that optimally blocks the second speaker. Instead of blocking the second speaker, in this paper we propose a covariance blocking and whitening (CBW) method, which first blocks the first speaker and applies whitening using the estimated noise covariance matrix and then estimates the RTF vector of the second speaker based on a singular value decomposition. When using the estimated RTF vectors of both speakers in a linearly constrained minimum variance beamformer, simulation results using real-world recordings for multiple speaker positions demonstrate that the proposed CBW method outperforms the conventional BOP and covariance whitening methods in terms of signal-to-interferer-and-noise ratio improvement.

摘要
To estimate the RTF vector of the second speaker, the so-called blind oblique projection (BOP) method is used to determine the oblique projection operator that optimally blocks the second speaker. However, in this paper, we propose a covariance blocking and whitening (CBW) method instead. This method first blocks the first speaker and applies whitening using the estimated noise covariance matrix, and then estimates the RTF vector of the second speaker based on a singular value decomposition.When using the estimated RTF vectors of both speakers in a linearly constrained minimum variance beamformer, simulation results using real-world recordings for multiple speaker positions show that the proposed CBW method outperforms the conventional BOP and covariance whitening methods in terms of signal-to-interferer-and-noise ratio improvement.

2023-10-25

cs.CV

cs.CV - 2023-10-25

Exploring Question Decomposition for Zero-Shot VQA

paper_url: http://arxiv.org/abs/2310.17050
repo_url: None
paper_authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun Fu
for: 提高Visual Question Answering（VQA）任务的性能，使其更能够模仿人类的问答策略。
methods: 使用人类写好的问题分解策略，以及模型自动生成的问题分解策略，从示例 alone 学习两种任务。
results: 在八个VQA任务上，通过选择性地使用模型生成的问题分解策略，可以提高准确率，包括在医疗VQA任务上提高 >20%，并将BLIP-2的零基础性能提高到Winoground任务上的VQA重新定义任务上。

Abstract
Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site: https://zaidkhan.me/decomposition-0shot-vqa/

摘要
视觉问答 (VQA) 传统上被视为一个单步任务，每个问题都receives the same amount of effort，与人类问答策略不同。我们explore一种问题分解策略来超越这一限制。我们 probed recently developed large vision-language models的能力使用人类写的分解和生成自己的分解，发现它们可以从示例 alone learn both tasks。然而，我们发现直接使用模型写的分解可以伤性表现。我们引入一种模型驱动的选择性分解方法，用于second-guessing predictions和 corrected errors，并在八个 VQA任务上三个领域进行了验证，显示了一致性提高，包括 >20%的提高在医学 VQA数据集上和在Winoground任务上超过Random guess的VQA重新定义提高。项目网站：https://zaidkhan.me/decomposition-0shot-vqa/

Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis

paper_url: http://arxiv.org/abs/2310.17674
repo_url: None
paper_authors: Shangbang Long, Siyang Qin, Yasuhisa Fujii, Alessandro Bissacco, Michalis Raptis
for: 这个论文是为了解决文本检测和几何布局分析的联合任务而设计的。
methods: 论文使用了两个新的组件：一个叫做Unified-Detector-Polygon (UDP)，它生成了文本线的贝塞尔曲线 polygon，并生成了段落之间的亲和力矩阵; 另一个叫做Line-to-Character-to-Word (L2C2W) recognizer，它将分割出来的行转换为字符，并将字符合并回到单词中。
results: 论文在多个单词文本检测标准数据集上达到了状态之Art Results，以及几何布局分析任务中的优秀成绩。

Abstract
We propose Hierarchical Text Spotter (HTS), a novel method for the joint task of word-level text spotting and geometric layout analysis. HTS can recognize text in an image and identify its 4-level hierarchical structure: characters, words, lines, and paragraphs. The proposed HTS is characterized by two novel components: (1) a Unified-Detector-Polygon (UDP) that produces Bezier Curve polygons of text lines and an affinity matrix for paragraph grouping between detected lines; (2) a Line-to-Character-to-Word (L2C2W) recognizer that splits lines into characters and further merges them back into words. HTS achieves state-of-the-art results on multiple word-level text spotting benchmark datasets as well as geometric layout analysis tasks.

摘要
我们提出了层次文本检测器（HTS），一种新的方法，用于同时解决单词水平文本检测和几何布局分析任务。HTS可以在图像中识别文本并识别其4级层次结构：字符、单词、行和段落。提出的HTS具有两个新的组成部分：（1）一个统一探测器多边形（UDP），生成文本线的贝塞尔曲线多边形和行间文本段落相互关系矩阵;（2）一个字行字符识别器（L2C2W），将行分成字符并将其再次合并回words。HTS在多个单词文本检测数据集和几何布局分析任务上达到了现状之冠的结果。

Trust, but Verify: Robust Image Segmentation using Deep Learning

paper_url: http://arxiv.org/abs/2310.16999
repo_url: None
paper_authors: Fahim Ahmed Zaman, Xiaodong Wu, Weiyu Xu, Milan Sonka, Raghuraman Mudumbai
for: 验证深度神经网络的医学图像分割输出的可靠性，抗性 against 随机和最坏情况的攻击。
methods: 基于作者提出的“信任，但验证”方法，使用助记网络生成受Mask的特征预测，并将其与原始图像进行比较，以检测错误分割。
results: 比较于之前的方法，新的验证网络设计可以减少假阳性（错误地判断正确分割为错误分割），并且在不同类型的攻击下保持高度的可靠性。

Abstract
We describe a method for verifying the output of a deep neural network for medical image segmentation that is robust to several classes of random as well as worst-case perturbations i.e. adversarial attacks. This method is based on a general approach recently developed by the authors called "Trust, but Verify" wherein an auxiliary verification network produces predictions about certain masked features in the input image using the segmentation as an input. A well-designed auxiliary network will produce high-quality predictions when the input segmentations are accurate, but will produce low-quality predictions when the segmentations are incorrect. Checking the predictions of such a network with the original image allows us to detect bad segmentations. However, to ensure the verification method is truly robust, we need a method for checking the quality of the predictions that does not itself rely on a black-box neural network. Indeed, we show that previous methods for segmentation evaluation that do use deep neural regression networks are vulnerable to false negatives i.e. can inaccurately label bad segmentations as good. We describe the design of a verification network that avoids such vulnerability and present results to demonstrate its robustness compared to previous methods.

摘要
我们描述了一种用于验证深度神经网络医学图像分割结果的方法，该方法具有对多种随机和最差情况的抗击性，即抗击攻击。该方法基于我们最近开发的“信任，但验证”方法，其中一个辅助验证网络生成了基于输入图像的掩码特征的预测结果。如果输入分割结果正确，那么这个辅助网络会生成高质量的预测结果；如果输入分割结果错误，那么辅助网络会生成低质量的预测结果。通过对辅助网络的预测结果与原始图像进行比较，我们可以检测出错误的分割结果。但是，为了确保验证方法的真正可靠性，我们需要一种不依赖于黑obox神经网络的方法来检查预测结果的质量。我们显示了先前用于分割评估的深度神经回归网络方法存在false negative问题，即可能错别错误地将错误的分割结果标记为正确的。我们描述了一种避免这种敏感性的验证网络的设计，并提供了证明其稳定性的结果。

An Efficient Deep Learning-based approach for Recognizing Agricultural Pests in the Wild

paper_url: http://arxiv.org/abs/2310.16991
repo_url: https://github.com/mohtasimhadi/An-Efficient-Deep-Learning-Based-Approach-for-Recognizing-Agricultural-Pests-in-the-Wild
paper_authors: Mohtasim Hadi Rafi, Mohammad Ratul Mahjabin, Md Sabbir Rahman
for: 本研究旨在帮助农民防治垂直病虫，提高农业产量和经济效益。
methods: 本研究使用了数据转移学习、微调和自定义建 Architecture，实现了蜂群识别的高精度和可靠性。
results: 实验结果表明，使用我们提出的方法可以准确地识别各种蜂群，并且在不同的数据集上都具有良好的 robustness。

Abstract
One of the biggest challenges that the farmers go through is to fight insect pests during agricultural product yields. The problem can be solved easily and avoid economic losses by taking timely preventive measures. This requires identifying insect pests in an easy and effective manner. Most of the insect species have similarities between them. Without proper help from the agriculturist academician it is very challenging for the farmers to identify the crop pests accurately. To address this issue we have done extensive experiments considering different methods to find out the best method among all. This paper presents a detailed overview of the experiments done on mainly a robust dataset named IP102 including transfer learning with finetuning, attention mechanism and custom architecture. Some example from another dataset D0 is also shown to show robustness of our experimented techniques.

摘要
一个最大的问题是农民面临的是在农产品收成时遇到昆虫害。这个问题可以轻松解决，避免经济损失，通过及时预防措施。这需要识别昆虫害的方法。大多数昆虫种类之间有相似之处。如果没有专业农业学家的帮助，农民很难准确识别作物害虫。为解决这个问题，我们在不同方法的基础上进行了广泛的实验。这篇论文提供了对我们实验的详细概述，包括转移学习与细化、注意机制和定制架构。另外，我们还提供了另一个数据集D0中的一些示例，以示我们的实验技术的稳定性。

Unsupervised Domain Adaptation for Semantic Segmentation with Pseudo Label Self-Refinement

paper_url: http://arxiv.org/abs/2310.16979
repo_url: None
paper_authors: Xingchen Zhao, Niluthpol Chowdhury Mithun, Abhinav Rajvanshi, Han-Pang Chiu, Supun Samarasekera
for: 提高深度学习基于 semantic segmentation 模型在不同特征集上的性能，尤其是在实际应用环境中。
methods: 使用教师模型生成 pseudo-标签，并使用学生模型在新数据上进行自教育。auxiliary pseudo-label refinement network (PRN) 用于在不同阶段进行pseudo标签的修正和选择高可靠的标签。
results: 在多个 benchmark 数据集上，我们的方法比前一个状态的方法表现出了显著的改善， indicating that our approach can effectively improve the robustness of segmentation models against pseudo label noise propagation during different stages of adaptation.

Abstract
Deep learning-based solutions for semantic segmentation suffer from significant performance degradation when tested on data with different characteristics than what was used during the training. Adapting the models using annotated data from the new domain is not always practical. Unsupervised Domain Adaptation (UDA) approaches are crucial in deploying these models in the actual operating conditions. Recent state-of-the-art (SOTA) UDA methods employ a teacher-student self-training approach, where a teacher model is used to generate pseudo-labels for the new data which in turn guide the training process of the student model. Though this approach has seen a lot of success, it suffers from the issue of noisy pseudo-labels being propagated in the training process. To address this issue, we propose an auxiliary pseudo-label refinement network (PRN) for online refining of the pseudo labels and also localizing the pixels whose predicted labels are likely to be noisy. Being able to improve the quality of pseudo labels and select highly reliable ones, PRN helps self-training of segmentation models to be robust against pseudo label noise propagation during different stages of adaptation. We evaluate our approach on benchmark datasets with three different domain shifts, and our approach consistently performs significantly better than the previous state-of-the-art methods.

摘要
Translated into Simplified Chinese:深度学习基于的 semantic segmentation 模型在测试数据上uffer from 性能下降，特别是当数据的特征与训练时使用的数据不同时。 adapting 模型使用新Domain的标注数据不always practical. Unsupervised Domain Adaptation (UDA) 方法是部署这些模型的关键。 recent state-of-the-art (SOTA) UDA 方法使用教师模型生成新数据上的pseudo-labels，并使用这些pseudo-labels来导导学生模型的训练过程。 Although this approach has seen a lot of success, it suffers from the issue of noisy pseudo-labels being propagated in the training process. To address this issue, we propose an auxiliary pseudo-label refinement network (PRN) for online refining of the pseudo labels and also localizing the pixels whose predicted labels are likely to be noisy. PRN can improve the quality of pseudo labels and select highly reliable ones, which helps self-training of segmentation models to be robust against pseudo label noise propagation during different stages of adaptation. We evaluate our approach on benchmark datasets with three different domain shifts, and our approach consistently performs significantly better than the previous state-of-the-art methods.

Improving Performance in Colorectal Cancer Histology Decomposition using Deep and Ensemble Machine Learning

paper_url: http://arxiv.org/abs/2310.16954
repo_url: None
paper_authors: Fabi Prezja, Leevi Annala, Sampsa Kiiskinen, Suvi Lahtinen, Timo Ojala, Pekka Ruusuvuori, Teijo Kuopio
for: This paper aims to explore the potential of convolutional neural networks (CNNs) in facilitating the extraction of clinically relevant biomarkers from histologic samples for colorectal cancer management.
methods: The authors use a hybrid Deep and ensemble machine learning model to classify diverse tissue types from whole slide microscope images accurately, which is critical for amplifying the prognostic potential of imaging-based biomarkers.
results: The model achieved 96.74% accuracy on the external test set and 99.89% on the internal test set, demonstrating its high accuracy and potential for clinical application.

Abstract
In routine colorectal cancer management, histologic samples stained with hematoxylin and eosin are commonly used. Nonetheless, their potential for defining objective biomarkers for patient stratification and treatment selection is still being explored. The current gold standard relies on expensive and time-consuming genetic tests. However, recent research highlights the potential of convolutional neural networks (CNNs) in facilitating the extraction of clinically relevant biomarkers from these readily available images. These CNN-based biomarkers can predict patient outcomes comparably to golden standards, with the added advantages of speed, automation, and minimal cost. The predictive potential of CNN-based biomarkers fundamentally relies on the ability of convolutional neural networks (CNNs) to classify diverse tissue types from whole slide microscope images accurately. Consequently, enhancing the accuracy of tissue class decomposition is critical to amplifying the prognostic potential of imaging-based biomarkers. This study introduces a hybrid Deep and ensemble machine learning model that surpassed all preceding solutions for this classification task. Our model achieved 96.74% accuracy on the external test set and 99.89% on the internal test set. Recognizing the potential of these models in advancing the task, we have made them publicly available for further research and development.

摘要
Routine colorectal cancer management 通常使用 Hematoxylin 和 Eosin 染色的 histologic samples，但是它们的潜在作用还在探索中。目前的黄金标准是使用 expensive 和 time-consuming 的遗传学测试。然而，最近的研究表明，扩散 нейрон网络 (CNN) 可以帮助从readily available 的图像中提取临床 relevance 的生物标志物。这些 CNN-based 生物标志物可以与黄金标准相比，预测患者的结果，并且具有速度、自动化和成本低的优势。生物标志物的预测潜力基于 CNN 的能力准确地分类不同的组织类型从整个染色microscope 图像中。因此，提高染色microscope 图像中组织类型的准确性是关键，以激活生物标志物的预测潜力。本研究提出了一种 Hybrid Deep 和ensemble 机器学习模型，超过了所有之前的解决方案。我们的模型在 external 测试集上达到了 96.74% 的准确率，并在 internal 测试集上达到了 99.89%。认可这些模型在任务上的潜力，我们将其公开发布，以便进一步的研究和发展。

Diagnosing Alzheimer’s Disease using Early-Late Multimodal Data Fusion with Jacobian Maps

paper_url: http://arxiv.org/abs/2310.16936
repo_url: None
paper_authors: Yasmine Mustafa, Tie Luo
for: 这篇论文主要旨在提出一种高效的早期-晚期 fusión（ELF）方法，用于早期识别阿尔茨海默病（AD）的四个阶段。methods: 该方法使用了一个 convolutional neural network（CNN）来自动提取特征，并使用 random forests 来实现小型数据集上的竞争性表现。此外，该方法还提出了一种可靠的预处理管道，该管道可以适应个体Subject的特有特征，并使用整个大脑图像而不是切片或质量图像来进行预处理。results: 在使用 OASIS-3 dataset 的 MRI 和 CT 图像上进行实验，该方法可以准确地将 AD 分类为四个阶段，准确率达到 97.19%。

Abstract
Alzheimer's disease (AD) is a prevalent and debilitating neurodegenerative disorder impacting a large aging population. Detecting AD in all its presymptomatic and symptomatic stages is crucial for early intervention and treatment. An active research direction is to explore machine learning methods that harness multimodal data fusion to outperform human inspection of medical scans. However, existing multimodal fusion models have limitations, including redundant computation, complex architecture, and simplistic handling of missing data. Moreover, the preprocessing pipelines of medical scans remain inadequately detailed and are seldom optimized for individual subjects. In this paper, we propose an efficient early-late fusion (ELF) approach, which leverages a convolutional neural network for automated feature extraction and random forests for their competitive performance on small datasets. Additionally, we introduce a robust preprocessing pipeline that adapts to the unique characteristics of individual subjects and makes use of whole brain images rather than slices or patches. Moreover, to tackle the challenge of detecting subtle changes in brain volume, we transform images into the Jacobian domain (JD) to enhance both accuracy and robustness in our classification. Using MRI and CT images from the OASIS-3 dataset, our experiments demonstrate the effectiveness of the ELF approach in classifying AD into four stages with an accuracy of 97.19%.

摘要
阿尔茨海默病 (AD) 是一种广泛存在并且严重影响老龄人口的神经退化疾病。早期发现 AD 的检测是非常重要，以便提供早期 intervención 和治疗。目前的研究方向之一是利用机器学习方法，把多modal 数据融合以超越人工检查医疗影像。然而，现有的多modal 融合模型受到一些限制，包括重复计算、复杂的架构和简单处理缺失数据。此外，医疗影像的预处理管道仍然不够详细，通常不会适应个体Subject 的特点。在这篇论文中，我们提出了一种高效的早期晚期融合 (ELF) 方法，利用 convolutional neural network (CNN) 自动提取特征，并使用 random forest 来实现小型数据集的竞争性表现。此外，我们还引入了一种可靠的预处理管道，该管道适应每个个体Subject 的特点，并使用整个大脑图像而不是切片或贴图。此外，为了解决检测大脑体积的微小变化的挑战，我们将图像转换为 Jacobian 域 (JD)，以提高精度和可靠性的分类。使用 MRI 和 CT 图像从 OASIS-3 数据集，我们的实验表明 ELF 方法可以在四个 AD 阶段中分类，准确率达 97.19%。

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory

paper_url: http://arxiv.org/abs/2310.16898
repo_url: https://github.com/liangyn22/mcuformer
paper_authors: Yinan Liang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
for: 这篇论文旨在实现深度学习模型的部署在Internet of Things（IoT）设备上，例如微控制器，以减少成本和能源消耗。
methods: 本篇论文提出了一个硬件-算法共优化方法，名为MCUFormer，以便在微控制器上部署视觉 трансформа器，并且将其应用于图像识别 tasks。
results: 实验结果显示，使用MCUFormer可以在STM32F746微控制器上实现73.62%的top-1精度（在ImageNet图像识别任务中），并且仅需320KB的内存。

Abstract
Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI. Conventional methods successfully enable convolutional neural network inference of high resolution images on microcontrollers, while the framework for vision transformers that achieve the state-of-the-art performance in many vision applications still remains unexplored. In this paper, we propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint. More specifically, we generalize the one-shot network architecture search (NAS) to discover the optimal architecture with highest task performance given the memory budget from the microcontrollers, where we enlarge the existing search space of vision transformers by considering the low-rank decomposition dimensions and patch resolution for memory reduction. For the construction of the inference operator library of vision transformers, we schedule the memory buffer during inference through operator integration, patch embedding decomposition, and token overwriting, allowing the memory buffer to be fully utilized to adapt to the forward pass of the vision transformer. Experimental results demonstrate that our MCUFormer achieves 73.62\% top-1 accuracy on ImageNet for image classification with 320KB memory on STM32F746 microcontroller. Code is available at https://github.com/liangyn22/MCUFormer.

摘要
Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI. Conventional methods successfully enable convolutional neural network inference of high resolution images on microcontrollers, while the framework for vision transformers that achieve the state-of-the-art performance in many vision applications still remains unexplored. In this paper, we propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint. More specifically, we generalize the one-shot network architecture search (NAS) to discover the optimal architecture with highest task performance given the memory budget from the microcontrollers, where we enlarge the existing search space of vision transformers by considering the low-rank decomposition dimensions and patch resolution for memory reduction. For the construction of the inference operator library of vision transformers, we schedule the memory buffer during inference through operator integration, patch embedding decomposition, and token overwriting, allowing the memory buffer to be fully utilized to adapt to the forward pass of the vision transformer. Experimental results demonstrate that our MCUFormer achieves 73.62% top-1 accuracy on ImageNet for image classification with 320KB memory on STM32F746 microcontroller. Code is available at https://github.com/liangyn22/MCUFormer.

SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation

paper_url: http://arxiv.org/abs/2310.16838
repo_url: None
paper_authors: Qianxu Wang, Haotong Zhang, Congyue Deng, Yang You, Hao Dong, Yixin Zhu, Leonidas Guibas
for: 将 robot 给授予高水平的 Semantic Understanding，以便在3D场景中进行高级的物体捕捉和操作。
methods: 我们运用大量2D Computer Vision模型，将多视图图像中的semantic feature概念传递到3D场景中，以建立一个Distilled Feature Field（DFF）。
results: 我们的方法可以从简单的RGBD观察中获取高水平的3D DFF，并且可以在不同的物体和场景下进行一次性学习，并且能够在不同的物体和场景下进行传授。

Abstract
Humans excel at transferring manipulation skills across diverse object shapes, poses, and appearances due to their understanding of semantic correspondences between different instances. To endow robots with a similar high-level understanding, we develop a Distilled Feature Field (DFF) for 3D scenes, leveraging large 2D vision models to distill semantic features from multiview images. While current research demonstrates advanced performance in reconstructing DFFs from dense views, the development of learning a DFF from sparse views is relatively nascent, despite its prevalence in numerous manipulation tasks with fixed cameras. In this work, we introduce SparseDFF, a novel method for acquiring view-consistent 3D DFFs from sparse RGBD observations, enabling one-shot learning of dexterous manipulations that are transferable to novel scenes. Specifically, we map the image features to the 3D point cloud, allowing for propagation across the 3D space to establish a dense feature field. At the core of SparseDFF is a lightweight feature refinement network, optimized with a contrastive loss between pairwise views after back-projecting the image features onto the 3D point cloud. Additionally, we implement a point-pruning mechanism to augment feature continuity within each local neighborhood. By establishing coherent feature fields on both source and target scenes, we devise an energy function that facilitates the minimization of feature discrepancies w.r.t. the end-effector parameters between the demonstration and the target manipulation. We evaluate our approach using a dexterous hand, mastering real-world manipulations on both rigid and deformable objects, and showcase robust generalization in the face of object and scene-context variations.

摘要
人类具有将抓取技能转移到多种物体形状、姿态和外观的能力，这是因为他们对不同实例之间的 semantic 匹配有深刻的理解。为了赋予机器人类似的高级理解，我们开发了一种 Distilled Feature Field (DFF) for 3D 场景，利用大量 2D 视觉模型来精炼 semantic 特征从多视图图像中。当前研究已经实现了高级的 DFF 重建从密集视图中，但是对于从稀疏视图学习 DFF 的开发还是相对落后，尽管这种情况在许多抓取任务中具有广泛的应用。在这项工作中，我们介绍了一种新的方法，即 SparseDFF，用于从稀疏 RGBD 观察中获取视元一致的 3D DFF，以便一次学习灵活的抓取动作，并将其应用到新的场景。我们将图像特征映射到 3D 点云上，以便在 3D 空间中进行特征场的传播。SparseDFF 的核心是一种轻量级的特征修正网络，通过对匹配视图之间的特征进行对比而优化。此外，我们还实现了一种点云杂除机制，以确保每个本地邻域内的特征连续性。通过在源场景和目标场景上建立一致的特征场，我们定义了一个能量函数，该函数使得在示例抓取动作和目标抓取动作之间的特征差异与机器人的终端参数之间的差异进行最小化。我们通过使用一个协助手臂来评估我们的方法，并在实际抓取任务中展示了对物体和场景变化的稳定性。

LightSpeed: Light and Fast Neural Light Fields on Mobile Devices

paper_url: http://arxiv.org/abs/2310.16832
repo_url: https://github.com/lightspeed-r2l/lightspeed
paper_authors: Aarush Gupta, Junli Cao, Chaoyang Wang, Ju Hu, Sergey Tulyakov, Jian Ren, László A Jeni
for: 实现实时新视图图像生成在移动设备上，因为计算能力和存储空间有限制。
methods: 使用神经光场表示法，这些方法可以在移动设备上实现高品质的视图生成，神经光场方法直接将射线表示与像素颜色映射。
results: 我们发现使用光板表示是一种高效的射线表示，可以使用特征网格快速训练和渲染。我们的方法可以在非正面视图中进行扩展，并且比前一代光场方法提供更高品质的渲染和更好的速度比。

Abstract
Real-time novel-view image synthesis on mobile devices is prohibitive due to the limited computational power and storage. Using volumetric rendering methods, such as NeRF and its derivatives, on mobile devices is not suitable due to the high computational cost of volumetric rendering. On the other hand, recent advances in neural light field representations have shown promising real-time view synthesis results on mobile devices. Neural light field methods learn a direct mapping from a ray representation to the pixel color. The current choice of ray representation is either stratified ray sampling or Plucker coordinates, overlooking the classic light slab (two-plane) representation, the preferred representation to interpolate between light field views. In this work, we find that using the light slab representation is an efficient representation for learning a neural light field. More importantly, it is a lower-dimensional ray representation enabling us to learn the 4D ray space using feature grids which are significantly faster to train and render. Although mostly designed for frontal views, we show that the light-slab representation can be further extended to non-frontal scenes using a divide-and-conquer strategy. Our method offers superior rendering quality compared to previous light field methods and achieves a significantly improved trade-off between rendering quality and speed.

摘要
Translated into Simplified Chinese:实时新视图图像合成在移动设备上是不可能的，由于限制的计算能力和存储空间。使用液体渲染方法，如NeRF和其 derivates，在移动设备上不适用，因为这些方法的计算成本过高。然而，最近的神经光场表示法已经展示了在移动设备上的实时视图合成结果。神经光场方法直接将光束表示转换为像素颜色。当前的光束表示方法有 stratified ray sampling 和 Plucker坐标，但我们提议使用光束坐标表示，这是一种更有效的、低维度的表示。这使得我们可以使用特征网格来学习4D光束空间。虽然主要设计为前视场景，但我们表明了使用光束坐标表示可以进一步扩展到非前视场景，使用分割和聚合策略。我们的方法提供了比前一代光场方法更高的图像质量和速度之间的改善交易。

PERF: Panoramic Neural Radiance Field from a Single Panorama

paper_url: http://arxiv.org/abs/2310.16831
repo_url: https://github.com/perf-project/PeRF
paper_authors: Guangcong Wang, Peng Wang, Zhaoxi Chen, Wenping Wang, Chen Change Loy, Ziwei Liu
for: 这 paper 的目的是提出一种基于单个扫描图的360度新视图合成方法，使得可以在复杂的场景中实现3D 浏览而不需要花费大量时间和精力进行图像采集。
methods: 这 paper 使用了一种 collaborative RGBD 填充方法和一种进行填充和 удали除方法来将2D 图像提升到3D 场景中。 Specifically, authors first predict a panoramic depth map based on a single panorama and reconstruct visible 3D regions using volume rendering. Then, they introduce a collaborative RGBD inpainting approach into a NeRF to complete RGB images and depth maps from random views. Finally, they use an inpainting-and-erasing strategy to avoid inconsistent geometry between a newly-sampled view and reference views.
results: 作者们的方法可以在 Replica 和一个新的数据集 PERF-in-the-wild 上实现了比州-of-the-art 的性能。 Their method can be widely used for real-world applications, such as panorama-to-3D, text-to-3D, and 3D scene stylization applications.

Abstract
Neural Radiance Field (NeRF) has achieved substantial progress in novel view synthesis given multi-view images. Recently, some works have attempted to train a NeRF from a single image with 3D priors. They mainly focus on a limited field of view with a few occlusions, which greatly limits their scalability to real-world 360-degree panoramic scenarios with large-size occlusions. In this paper, we present PERF, a 360-degree novel view synthesis framework that trains a panoramic neural radiance field from a single panorama. Notably, PERF allows 3D roaming in a complex scene without expensive and tedious image collection. To achieve this goal, we propose a novel collaborative RGBD inpainting method and a progressive inpainting-and-erasing method to lift up a 360-degree 2D scene to a 3D scene. Specifically, we first predict a panoramic depth map as initialization given a single panorama and reconstruct visible 3D regions with volume rendering. Then we introduce a collaborative RGBD inpainting approach into a NeRF for completing RGB images and depth maps from random views, which is derived from an RGB Stable Diffusion model and a monocular depth estimator. Finally, we introduce an inpainting-and-erasing strategy to avoid inconsistent geometry between a newly-sampled view and reference views. The two components are integrated into the learning of NeRFs in a unified optimization framework and achieve promising results. Extensive experiments on Replica and a new dataset PERF-in-the-wild demonstrate the superiority of our PERF over state-of-the-art methods. Our PERF can be widely used for real-world applications, such as panorama-to-3D, text-to-3D, and 3D scene stylization applications. Project page and code are available at https://perf-project.github.io/ and https://github.com/perf-project/PeRF.

摘要
neural radiance field (NeRF) 已经取得了在新视图合成方面的显著进步，尤其是使用多视图图像。在这篇文章中，我们介绍了一种名为 PeRF 的全景新视图合成框架，可以从单个全景图像中训练一个全景 NeRF。与传统的方法不同的是，PeRF 允许无需昂贵和繁琐的图像采集，就能够在实际世界中360度的全景enario中进行3D镜头游走。为了实现这一目标，我们提出了一种新的协同 RGBD 填充方法和一种进行填充和取消填充的策略，以将2D场景提升到3D场景。具体来说，我们首先从单个全景图像中预测了全景深度图，并使用可视3D区域的volume rendering重建可见的3D区域。然后，我们引入了协同 RGBD 填充方法，以完成来自Random View的 RGB 图像和深度图的完善。最后，我们引入了一种填充和取消填充策略，以避免在新采集的视角和参考视角之间出现不一致的几何结构。这两种组件被集成到了 NeRF 的学习框架中，并实现了良好的结果。我们的 PeRF 可以广泛应用于实际应用场景，如全景图像到3D、文本到3D和3D场景风格化应用。项目页面和代码可以在和上找到。

CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images

paper_url: http://arxiv.org/abs/2310.16825
repo_url: https://github.com/mosaicml/diffusion
paper_authors: Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, Volodymyr Kuleshov
for: 这个论文是用于训练一种基于开源扩散模型的文本到图像生成模型的。
methods: 这个论文使用了一种启发式传输学习技术来生成高质量的 sintetic caption，并使用了一种数据和计算效率的训练策略来训练模型。
results: 这个论文通过使用高质量的 sintetic caption和数据和计算效率的训练策略，成功地训练出一些高质量的文本到图像生成模型，并且这些模型的性能与LAION-2B数据集上训练的SD2模型相当。

Abstract
We assemble a dataset of Creative-Commons-licensed (CC) images, which we use to train a set of open diffusion models that are qualitatively competitive with Stable Diffusion 2 (SD2). This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce. In turn, to address these challenges, we use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired with curated CC images. We then develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION-2B data needed to train existing SD2 models, but obtains comparable quality. These results indicate that we have a sufficient number of CC images (~70 million) for training high-quality models. Our training recipe also implements a variety of optimizations that achieve ~3X training speed-ups, enabling rapid model iteration. We leverage this recipe to train several high-quality text-to-image models, which we dub the CommonCanvas family. Our largest model achieves comparable performance to SD2 on a human evaluation, despite being trained on our CC dataset that is significantly smaller than LAION and using synthetic captions for training. We release our models, data, and code at https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md

摘要
我团队 assemble 一个 Creative-Commons-licensed（CC）图像集，用于训练一些开放扩散模型，这些模型与 Stable Diffusion 2（SD2）相比质量相似。这个任务存在两个挑战：（1）高分辨率 CC 图像缺乏需要训练文本到图像生成模型的描述;（2） CC 图像相对罕见。为了解决这些挑战，我们使用一种直观转移学习技术生成高质量的 sintetic 描述，并将它们分配到我们精心选择的 CC 图像上。然后，我们开发了一种数据和计算效率高的训练方法，只需要使用 LAION-2B 数据的 3%，但可以获得相似的质量。这些结果表明我们有足够的 CC 图像（约 70 万） для训练高质量模型。我们的训练方法还实现了多种优化，实现了 ~3X 的训练速度提升，以便快速进行模型迭代。我们利用这种方法训练了一些高质量的文本到图像模型，我们称之为 CommonCanvas 家族。我们最大的模型在人类评估中与 SD2 相比，即使是在我们小于 LAION 的 CC 数据集上训练，也达到了相似的性能。我们将我们的模型、数据和代码发布在 GitHub 上，请参考 https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md。

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior

paper_url: http://arxiv.org/abs/2310.16818
repo_url: https://github.com/deepseek-ai/dreamcraft3d
paper_authors: Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, Yebin Liu
for: 这篇论文旨在提出一种基于引用图像的层次3D内容生成方法，以生成高品质和一致的3D对象。
methods: 作者使用了geometry sculpting和texture boosting两个阶段来生成3D对象，并通过score distillation sampling和 Bootstrapped Score Distillation来保证geometry和texture的一致性。
results: 通过 alternating optimization of the diffusion prior and 3D scene representation，作者实现了mutually reinforcing improvements，并达到了高品质的rendering和一致的3D对象生成。

Abstract
We present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation. Code available at https://github.com/deepseek-ai/DreamCraft3D.

摘要
我们介绍 DreamCraft3D，一种层次的三维内容生成方法，可以生成高准确性和一致的三维对象。我们解决了一致性问题，通过利用二维参考图来导航几个阶段的几何雕刻和текстура增强。我们的中心焦点是解决现有工作中的一致性问题，通过视图依赖的扩散模型进行得分采样。这个三维优先， alongside 多种训练策略，优先几何一致性，但是妥协текстура准确性。我们还提议了 Bootstrapped Score Distillation，用于特性增强。我们在增强的场景下训练了个性化的扩散模型， Dreambooth，使其具有场景的3D知识。得分采样从这个3D-意识扩散先进提供了视图一致的指导，为场景优化提供了视图一致的指导。通过场景优化和扩散先进的交互优化，我们实现了相互扶持的改进：优化的3D场景帮助了场景特定的扩散模型训练，这些模型提供了越来越视图一致的指导，以便为3D优化提供越来越高的指导。因此，我们可以通过层次的生成来生成一致的3D对象，并且可以提高图像的精度和一致性，从而提高3D内容生成的状态。代码可以在https://github.com/deepseek-ai/DreamCraft3D 中找到。

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation

paper_url: http://arxiv.org/abs/2310.16809
repo_url: https://github.com/scut-dlvclab/gpt-4v_ocr
paper_authors: Yongxin Shi, Dezhi Peng, Wenhui Liao, Zening Lin, Xinhong Chen, Chongyu Liu, Yuyi Zhang, Lianwen Jin
For: The paper evaluates the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM), and assesses its performance across a range of OCR tasks.* Methods: The paper uses a comprehensive evaluation pipeline to test the model’s performance on scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich documents.* Results: The evaluation reveals that GPT-4V performs well on recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks such as handwriting mathematical expression recognition, table structure recognition, and end-to-end semantic entity recognition and pair extraction from document image.

Abstract
This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Specifically, it showed limitations when dealing with non-Latin languages and complex tasks such as handwriting mathematical expression recognition, table structure recognition, and end-to-end semantic entity recognition and pair extraction from document image. Based on these observations, we affirm the necessity and continued research value of specialized OCR models. In general, despite its versatility in handling diverse OCR tasks, GPT-4V does not outperform existing state-of-the-art OCR models. How to fully utilize pre-trained general-purpose LMMs such as GPT-4V for OCR downstream tasks remains an open problem. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at https://github.com/SCUT-DLVCLab/GPT-4V_OCR.

摘要
The evaluation reveals that GPT-4V performs well on recognizing and understanding Latin contents, but encounters challenges in multilingual scenarios and complex tasks. Specifically, it struggles with non-Latin languages and tasks such as handwriting mathematical expression recognition, table structure recognition, and end-to-end semantic entity recognition and pair extraction from document images. These findings highlight the importance and continued value of specialized OCR models.Although GPT-4V can handle diverse OCR tasks, it does not outperform existing state-of-the-art OCR models. The study underscores the need to explore how to fully utilize pre-trained general-purpose LMMs like GPT-4V for OCR downstream tasks. The research provides a valuable reference for future OCR research with LMMs. The evaluation pipeline and results are available on GitHub at .

Fingervein Verification using Convolutional Multi-Head Attention Network

paper_url: http://arxiv.org/abs/2310.16808
repo_url: None
paper_authors: Raghavendra Ramachandra, Sushma Venkatesh
for: 这个论文是为了提出一种基于 convolutional multihead attention network 的新型指静脉识别方法，以提高指静脉识别的精度和可靠性。
methods: 该方法使用了 VeinAtnNet 网络，该网络通过EXTRACTING DISCRIMINANT INFORMATION FROM BOTH NORMAL AND ENHANCED FINGERVEIN IMAGES来提取指静脉图像中的特征信息，并通过一种小型的学习参数来减轻网络的计算负担。
results: 在新收集的 FV-300 数据集和公共可用的 FV-USM 和 FV-PolyU 指静脉数据集上，提出的方法与五种现有的指静脉识别系统进行比较，并显示出提案的 VeinAtnNet 方法的高效性。

Abstract
Biometric verification systems are deployed in various security-based access-control applications that require user-friendly and reliable person verification. Among the different biometric characteristics, fingervein biometrics have been extensively studied owing to their reliable verification performance. Furthermore, fingervein patterns reside inside the skin and are not visible outside; therefore, they possess inherent resistance to presentation attacks and degradation due to external factors. In this paper, we introduce a novel fingervein verification technique using a convolutional multihead attention network called VeinAtnNet. The proposed VeinAtnNet is designed to achieve light weight with a smaller number of learnable parameters while extracting discriminant information from both normal and enhanced fingervein images. The proposed VeinAtnNet was trained on the newly constructed fingervein dataset with 300 unique fingervein patterns that were captured in multiple sessions to obtain 92 samples per unique fingervein. Extensive experiments were performed on the newly collected dataset FV-300 and the publicly available FV-USM and FV-PolyU fingervein dataset. The performance of the proposed method was compared with five state-of-the-art fingervein verification systems, indicating the efficacy of the proposed VeinAtnNet.

摘要
生物特征验证系统在安全访问控制应用中广泛应用，需要易于使用和可靠的人验证。 Among different生物特征，手掌血管网的验证性能得到了广泛的研究，因为它们在皮肤内部存在，不可见于外部，因此具有内生的抗击损和抗伪攻击性。在本文中，我们提出了一种基于卷积多头注意网络的新型手掌血管验证技术，称为VeinAtnNet。提出的VeinAtnNet设计用较小的学习参数量和批处理数据来提取特征信息，并且可以在不同的手掌血管图像下提取有效的特征信息。我们使用新建的手掌血管数据集，包含300个唯一的手掌血管图像，在多个会话中采集到92个样本。我们对新收集的FV-300数据集和公共可用的FV-USM和FV-PolyU手掌血管数据集进行了广泛的实验，并与5种现有的手掌血管验证系统进行了比较。实验结果表明，提出的方法可以提供高效的手掌血管验证。

The GOOSE Dataset for Perception in Unstructured Environments

paper_url: http://arxiv.org/abs/2310.16788
repo_url: None
paper_authors: Peter Mortimer, Raphael Hagmanns, Miguel Granero, Thorsten Luettel, Janko Petereit, Hans-Joachim Wuensche
for: 这个研究是为了提高自动化系统在无结构的开放空间环境中的感知和解释能力。
methods: 这个研究使用了深度学习技术，将数据分为10,000对描述像和点云数据，并将其用于训练多种现有的分类模型。
results: 这个研究获得了一个名为GOOSE的大量数据集，并提供了一个开源的数据集、ontology для无结构的地形，以及数据集的标准和指引。这个initiative可以帮助建立一个通用的框架，并且实现将现有的数据集和模型融合，以提高各种无结构环境中的感知能力。

Abstract
The potential for deploying autonomous systems can be significantly increased by improving the perception and interpretation of the environment. However, the development of deep learning-based techniques for autonomous systems in unstructured outdoor environments poses challenges due to limited data availability for training and testing. To address this gap, we present the German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset specifically designed for unstructured outdoor environments. The GOOSE dataset incorporates 10 000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models on both image and point cloud data. We open source the dataset, along with an ontology for unstructured terrain, as well as dataset standards and guidelines. This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments. The dataset, pre-trained models for offroad perception, and additional documentation can be found at https://goose-dataset.de/.

摘要
可以significantly increasethe potential for deploying autonomous systems by improving the perception and interpretation of the environment. However, the development of deep learning-based techniques for autonomous systems in unstructured outdoor environments poses challenges due to limited data availability for training and testing. To address this gap, we present the German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset specifically designed for unstructured outdoor environments. The GOOSE dataset incorporates 10,000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models on both image and point cloud data. We open source the dataset, along with an ontology for unstructured terrain, as well as dataset standards and guidelines. This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments. The dataset, pre-trained models for offroad perception, and additional documentation can be found at https://goose-dataset.de/.Here's the word-for-word translation:可以significantly increasethe potential for deploying autonomous systems by improving the perception and interpretation of the environment. However, the development of deep learning-based techniques for autonomous systems in unstructured outdoor environments poses challenges due to limited data availability for training and testing. To address this gap, we present the German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset specifically designed for unstructured outdoor environments. The GOOSE dataset incorporates 10,000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models on both image and point cloud data. We open source the dataset, along with an ontology for unstructured terrain, as well as dataset standards and guidelines. This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments. The dataset, pre-trained models for offroad perception, and additional documentation can be found at https://goose-dataset.de/.

S$^3$-TTA: Scale-Style Selection for Test-Time Augmentation in Biomedical Image Segmentation

paper_url: http://arxiv.org/abs/2310.16783
repo_url: None
paper_authors: Kangxian Xie, Siyu Huang, Sebastian Cajas Ordone, Hanspeter Pfister, Donglai Wei
for: 提高生物医学图像分割 task 的泛化能力
methods: 使用 S$^3$-TTA 框架，选择适合的图像比例和风格，并实现端到端的损均折合训练管道
results: 在公共 benchmark 上，S$^3$-TTA 对 cell 和 lung 分割 task 进行了3.4% 和 1.3% 的提高，仅通过在测试阶段对输入数据进行扩展。

Abstract
Deep-learning models have been successful in biomedical image segmentation. To generalize for real-world deployment, test-time augmentation (TTA) methods are often used to transform the test image into different versions that are hopefully closer to the training domain. Unfortunately, due to the vast diversity of instance scale and image styles, many augmented test images produce undesirable results, thus lowering the overall performance. This work proposes a new TTA framework, S$^3$-TTA, which selects the suitable image scale and style for each test image based on a transformation consistency metric. In addition, S$^3$-TTA constructs an end-to-end augmentation-segmentation joint-training pipeline to ensure a task-oriented augmentation. On public benchmarks for cell and lung segmentation, S$^3$-TTA demonstrates improvements over the prior art by 3.4% and 1.3%, respectively, by simply augmenting the input data in testing phase.

摘要
深度学习模型在生物医学影像分割中取得了成功。为了在真实世界中广泛应用，测试时增强（TTA）方法通常用于在测试图像上进行不同版本的转换，以期望更近于训练领域。然而，由于图像实例尺寸和风格的巨大多样性，许多增强后的测试图像会导致不愿意的结果，从而降低总性能。本工作提出了一种新的TTA框架，S$^3$-TTA，该框架可以根据图像测试阶段的变换一致度指标选择适合的图像尺寸和风格。此外，S$^3$-TTA还构建了端到端增强-分割共训练管道，以确保任务导向的增强。在公共基准测试数据集上，S$^3$-TTA比依据测试阶段增强的先前艺术提高了3.4%和1.3%。

MixerFlow for Image Modelling

paper_url: http://arxiv.org/abs/2310.16777
repo_url: None
paper_authors: Eshant English, Matthias Kirchler, Christoph Lippert
for: 图像模型的建模和数据生成
methods: 基于MLP-Mixer架构的 MixerFlow 模型，实现Weight共享和流式模型的结合
results: 在固定计算资源下，MixerFlow 在图像数据集上表现出更好的密度估计，并且随着图像分辨率的增加，性能逐渐提高，成为Glow-based架构的有力且简单的替代方案。同时，MixerFlow 还提供了更有用的嵌入特征比Glow-based架构。

Abstract
Normalising flows are statistical models that transform a complex density into a simpler density through the use of bijective transformations enabling both density estimation and data generation from a single model. In the context of image modelling, the predominant choice has been the Glow-based architecture, whereas alternative architectures remain largely unexplored in the research community. In this work, we propose a novel architecture called MixerFlow, based on the MLP-Mixer architecture, further unifying the generative and discriminative modelling architectures. MixerFlow offers an effective mechanism for weight sharing for flow-based models. Our results demonstrate better density estimation on image datasets under a fixed computational budget and scales well as the image resolution increases, making MixeFlow a powerful yet simple alternative to the Glow-based architectures. We also show that MixerFlow provides more informative embeddings than Glow-based architectures.

摘要
通常的流变换是一种统计模型，它将复杂的概率变换成更简单的概率，通过使用射影函数，以便 Both density estimation和数据生成从同一模型中。在图像模型中，主流选择是基于Glow架构的，而其他架构在研究community中尚未得到广泛的探索。在这项工作中，我们提出了一种新的架构called MixerFlow，基于MLP-Mixer架构，并将生成和识别模型架构进行统一。MixerFlow提供了有效的权重共享机制 для流变换模型。我们的结果表明，MixerFlow在图像数据集上的静止计算资源下可以更好地估计概率，并且随着图像分辨率的增加，MixerFlow的性能逐渐提高，使其成为一种强大 yet simple的Glow-based架构的替代方案。此外，我们还证明了MixerFlow提供了更有用的嵌入than Glow-based架构。

ConvNets Match Vision Transformers at Scale

paper_url: http://arxiv.org/abs/2310.16764
repo_url: https://github.com/kyegomez/ConvNet
paper_authors: Samuel L. Smith, Andrew Brock, Leonard Berrada, Soham De
for: 这个论文主要用于检验 ConvNet 是否在大型数据集上表现出色，并评估不同计算预算下 ConvNet 的表现。
methods: 该论文使用 JFT-4B 大量标注图像数据集进行预训练，并从 NFNet 模型家族中选择不同的深度和宽度来训练不同的网络。
results: 研究发现在不同计算预算下，ConvNet 会与 Vision Transformers 具有相似的表现，而且在 ImageNet 上进行精度训练后，NFNets 可以达到 Reported 性能水平。最终的精度训练结果为 Top-1 精度为 90.4%。

Abstract
Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

摘要
许多研究人员认为卷积网络在小或中等规模的数据集上表现良好，但不能与视觉转换器在大规模数据集上竞争。我们挑战这个信念，通过评估一种高性能的卷积网络架构，预训练在JFT-4B大量标签图像集上。我们在预训练计算预算为0.4k至110k TPU-v4核心计算时间之间考虑了一系列不同的网络模型，从NFNet家族中选择了一些网络。我们发现了对保留的损失和计算预算之间的对数对应关系。经过精度调整后，我们的最佳精度调整模型在ImageNet上达到了90.4%的顶部一致率。

SonoSAM – Segment Anything on Ultrasound Images

paper_url: http://arxiv.org/abs/2310.16872
repo_url: None
paper_authors: Hariharan Ravishankar, Rohan Patil, Vikram Melapudi, Parminder Bhatia, Kass-Hout Taha, Pavan Annangi
for: 这个论文旨在提出一种可靠、通用的基础模型，用于分割ultrasound图像中的对象。
methods: 该模型基于一个丰富、多样的对象集，并通过精心调整和学习来实现高性能。
results: 模型在8个未看过的ultrasound数据集上表现出色，与其他方法相比，在所有评价指标上都具有显著的优势。

Abstract
In this paper, we present SonoSAM - a promptable foundational model for segmenting objects of interest on ultrasound images. Fine-tuned exclusively on a rich, diverse set of objects from roughly 200k ultrasound image-mask pairs, SonoSAM demonstrates state-of-the-art performance on 8 unseen ultrasound data-sets, outperforming competing methods by a significant margin on all metrics of interest. SonoSAM achieves average dice similarity score of more than 90% on almost all test datasets within 2-6 clicks on an average, making it a valuable tool for annotating ultrasound images. We also extend SonoSAM to 3-D (2-D +t) applications and demonstrate superior performance making it a valuable tool for generating dense annotations from ultrasound cine-loops. Further, to increase practical utility of SonoSAM, we propose a two-step process of fine-tuning followed by knowledge distillation to a smaller footprint model without comprising the performance. We present detailed qualitative and quantitative comparisons of SonoSAM with state-of-the art methods showcasing efficacy of SonoSAM as one of the first reliable, generic foundational model for ultrasound.

摘要
在这篇论文中，我们介绍了SonoSAM - 一个可提Prompt的基础模型，用于分割ultrasound图像中的对象。通过仅在200k ultrasound图像-mask对中进行微调，SonoSAM在8个未看过的ultrasound数据集上达到了最新的性能水平，比竞争方法在所有关键指标上都高于它们。SonoSAM在大多数测试数据集上的average dice相似性分数超过90%，只需2-6个键击，使其成为对ultrasound图像进行标注的有价值工具。此外，我们还扩展了SonoSAM到3D（2D+t）应用程序，并证明其在生成密集注释方面表现出色，使其成为对ultrasound cinema-loop进行密集注释的有价值工具。为了提高SonoSAM的实际使用性，我们提议了一种微调后followed by knowledge distillation的两步过程，以降低模型的尺寸不减其性能。我们提供了详细的量化和质量比较，展示了SonoSAM作为ultrasound领域的首个可靠、通用基础模型的效果。

paper_url: http://arxiv.org/abs/2310.16754
repo_url: None
paper_authors: Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa
for: 提高 Audio Visual Question Answering (AVQA) 任务中的表现。
methods: 提出一种 Contextual Multi-modal Alignment (CAD) 网络，解决现有 AVQA 方法中的两大缺点： Audio-Visual (AV) 信息在网络中不匹配的 Spatial 和 Temporal 两个水平，以及 Audio 和 Visual 模式之间的 Semantic 信息在 контекст中不均衡。
results: 在 MUSIC-AVQA 数据集上，CAD 网络比状态艺术方法提高了平均性能，提高了9.4%。同时，我们还证明了我们的提案可以让现有 AVQA 方法中的表现得到改善，无需增加复杂度要求。

Abstract
In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

摘要
在听视问答（AVQA）任务上，听视模态可以学习在三个水平：1）空间、2）时间和3）Semantic。现有的AVQA方法受到两大缺点的影响：听视信息在网络中不匹配的空间和时间水平，以及听视模态之间的Semantic信息在一个上下文中不均衡。这导致了表现不佳。在本文中，我们提出了一种新的综合多模态对齐（CAD）网络，解决了AVQA方法中的挑战。我们的方法包括：1）引入无参数的随机Contextual块，确保听视对齐的空间水平是Robust的。2）提出了一种在无监督下自适应的听视对齐策略，以解决听视模态之间的时间水平的匹配问题。3）引入了一种cross-attention机制，以协调听视信息的Semantic水平。我们的CAD网络在MUSIC-AVQA数据集上的平均提高了state-of-the-art方法的性能，提高了9.4%。此外，我们还证明了我们的提议可以追加到现有的方法中，无需增加复杂性。

Metrically Scaled Monocular Depth Estimation through Sparse Priors for Underwater Robots

paper_url: http://arxiv.org/abs/2310.16750
repo_url: https://github.com/ebnerluca/uw_depth
paper_authors: Luca Ebner, Gideon Billings, Stefan Williams
for: 这篇论文targets the problem of real-time dense depth estimation from monocular underwater images for mobile underwater vehicles.
methods: The paper proposes a deep learning model that fuses sparse depth measurements from triangulated features to improve depth predictions and solve the scale ambiguity problem. The model uses an efficient encoder-decoder backbone and modern lightweight transformer optimization stage to encode global context.
results: The proposed method achieves significant improvement in depth prediction accuracy by fusing sparse feature priors, and achieves similar accuracy on a downward-looking dataset without any retraining. The method runs at 160 FPS on a laptop GPU and 7 FPS on a single CPU core, making it suitable for direct deployment on embedded systems.

Abstract
In this work, we address the problem of real-time dense depth estimation from monocular images for mobile underwater vehicles. We formulate a deep learning model that fuses sparse depth measurements from triangulated features to improve the depth predictions and solve the problem of scale ambiguity. To allow prior inputs of arbitrary sparsity, we apply a dense parameterization method. Our model extends recent state-of-the-art approaches to monocular image based depth estimation, using an efficient encoder-decoder backbone and modern lightweight transformer optimization stage to encode global context. The network is trained in a supervised fashion on the forward-looking underwater dataset, FLSea. Evaluation results on this dataset demonstrate significant improvement in depth prediction accuracy by the fusion of the sparse feature priors. In addition, without any retraining, our method achieves similar depth prediction accuracy on a downward looking dataset we collected with a diver operated camera rig, conducting a survey of a coral reef. The method achieves real-time performance, running at 160 FPS on a laptop GPU and 7 FPS on a single CPU core and is suitable for direct deployment on embedded systems. The implementation of this work is made publicly available at https://github.com/ebnerluca/uw_depth.

摘要
在这个工作中，我们解决了来自单摄像头图像的实时稠密深度估计问题 для移动式水下潜水器。我们设计了一种深度学习模型，该模型将 sparse depth measurement from triangulated features fusion 以提高深度预测和解决尺度 ambiguity 问题。为了允许不同的稀疏性输入，我们应用了密集参数化方法。我们的模型基于最近的状态开头的方法，使用高效的编码器-解码器脊梁和现代轻量级 transformer 优化阶段来编码全球上下文。网络在监督模式下在 forward-looking 水下数据集上训练，评估结果表明，通过稀疏特征先导的深度预测得到了显著改善。此外，没有任何重新训练，我们的方法在一个下降看到的数据集上也实现了类似的深度预测精度。此方法在实时性方面表现出色，在笔记型 GPU 上运行速度达 160 FPS，单 CPU 核心上运行速度为 7 FPS，适用于直接部署在嵌入式系统上。我们在 https://github.com/ebnerluca/uw_depth 上公开了实现。

Interferometric Neural Networks

paper_url: http://arxiv.org/abs/2310.16742
repo_url: https://github.com/arunsehrawat/interferometric-neural-networks
paper_authors: Arun Sehrawat
for: 本研究旨在探讨人工神经网络和干涉仪器的结合，并将其应用于机器学习和优化领域。
methods: 本研究使用干涉仪器构建了基于神经网络的生成对抗网络，而这些神经网络没有任何类传呈层，可以在量子计算机或光子芯片上实现。
results: 研究表明，这种方法可以用于解决 combinatorial optimization 问题，并在多类图像识别任务中达到了93%和83%的准确率。此外，它还可以生成数字0-9和人脸图像。

Abstract
On the one hand, artificial neural networks have many successful applications in the field of machine learning and optimization. On the other hand, interferometers are integral parts of any field that deals with waves such as optics, astronomy, and quantum physics. Here, we introduce neural networks composed of interferometers and then build generative adversarial networks from them. Our networks do not have any classical layer and can be realized on quantum computers or photonic chips. We demonstrate their applicability for combinatorial optimization, image classification, and image generation. For combinatorial optimization, our network consistently converges to the global optimum or remains within a narrow range of it. In multi-class image classification tasks, our networks achieve accuracies of 93% and 83%. Lastly, we show their capability to generate images of digits from 0 to 9 as well as human faces.

摘要
一方面，人工神经网络在机器学习和优化领域有很多成功应用。另一方面，湍波器是任何波动领域的重要组件，包括光学、天文学和量子物理。我们在这里引入神经网络中的湍波器，然后建立基于这些湍波器的生成敌对网络。我们的网络没有任何经典层，可以在量子计算机或光子芯片上实现。我们展示了它们在组合优化、图像分类和图像生成等方面的可行性。在组合优化任务中，我们的网络一般 converge 到全局最优或在一个窄范围内停止。在多类图像分类任务中，我们的网络达到了93%和83%的准确率。最后，我们展示了它们可以生成从0到9的数字图像以及人脸。

A No-Reference Quality Assessment Method for Digital Human Head

paper_url: http://arxiv.org/abs/2310.16732
repo_url: None
paper_authors: Yingjie Zhou, Zicheng Zhang, Wei Sun, Xiongkuo Min, Xianghe Ma, Guangtao Zhai
for: 这篇论文是关于数字人质量评估（DHQA）的研究，目的是提出一种基于Transformer的无参考评估方法，以解决数字人在生成和传输过程中可能出现的多种扭曲和质量下降问题。
methods: 该方法使用了 Rendering 技术将数字人的前2D投影作为输入，然后使用视transformer（ViT）进行特征提取，并设计了一个多任务模块以同时分类扭曲类型和预测数字人的 perceived quality 水平。
results: 实验结果表明，提出的方法与人工评估得分高度相关，并且在比较现有评估方法时表现出色。

Abstract
In recent years, digital humans have been widely applied in augmented/virtual reality (A/VR), where viewers are allowed to freely observe and interact with the volumetric content. However, the digital humans may be degraded with various distortions during the procedure of generation and transmission. Moreover, little effort has been put into the perceptual quality assessment of digital humans. Therefore, it is urgent to carry out objective quality assessment methods to tackle the challenge of digital human quality assessment (DHQA). In this paper, we develop a novel no-reference (NR) method based on Transformer to deal with DHQA in a multi-task manner. Specifically, the front 2D projections of the digital humans are rendered as inputs and the vision transformer (ViT) is employed for the feature extraction. Then we design a multi-task module to jointly classify the distortion types and predict the perceptual quality levels of digital humans. The experimental results show that the proposed method well correlates with the subjective ratings and outperforms the state-of-the-art quality assessment methods.

摘要
在最近的几年里，数字人类在扩展/虚拟现实（A/VR）领域广泛应用，让观众可以自由观看和与三维内容互动。然而，数字人类可能在生成和传输过程中受到多种扭曲的影响。此外，对数字人类的主观质量评估没有充分的努力。因此，需要开发一种无参考（NR）方法，以便在多任务方式下进行数字人类质量评估（DHQA）。在这篇论文中，我们提出了一种基于变换器的新方法，以解决DHQA的挑战。具体来说，我们将数字人类的前2D投影作为输入，并使用视传送器（ViT）进行特征提取。然后，我们设计了一个多任务模块，以同时类型化扭曲和预测数字人类的主观质量水平。实验结果表明，我们的方法与主观评估结果高度相关，并超越了现有的质量评估方法。

Rebuild City Buildings from Off-Nadir Aerial Images with Offset-Building Model (OBM)

paper_url: http://arxiv.org/abs/2310.16717
repo_url: None
paper_authors: Kai Li, Yupeng Deng, Yunlong Kong, Diyou Liu, Jingbo Chen, Yu Meng, Junxian Ma
for: 这个研究的目的是提出一个互动式Transformer模型，用于精确地测量高解析 remote sensing 图像中的建筑物偏移。
methods: 这个模型使用了一个互动式Transformer模型，与一个启发器Encoder，以精确地测量建筑物偏移。它还使用了一个ROAM模块，用于解决常见的预测建筑物偏移的问题。
results: 这个研究在公开available的BONAI dataset上进行了试验，实现了显著对Prompt-Instance-Level偏移Error的减少，从14.6%到16.3%。此外，这个研究也开发了一个适合大规模建筑物偏移的Distance-NMS算法，对预测建筑物偏移角度和长度进行了重要改善。

Abstract
Accurate measurement of the offset from roof-to-footprint in very-high-resolution remote sensing imagery is crucial for urban information extraction tasks. With the help of deep learning, existing methods typically rely on two-stage CNN models to extract regions of interest on building feature maps. At the first stage, a Region Proposal Network (RPN) is applied to extract thousands of ROIs (Region of Interests) which will post-imported into a Region-based Convolutional Neural Networks (RCNN) to extract wanted information. However, because of inflexible RPN, these methods often lack effective user interaction, encounter difficulties in instance correspondence, and struggle to keep up with the advancements in general artificial intelligence. This paper introduces an interactive Transformer model combined with a prompt encoder to precisely extract building segmentation as well as the offset vectors from roofs to footprints. In our model, a powerful module, namely ROAM, was tailored for common problems in predicting roof-to-footprint offsets. We tested our model's feasibility on the publicly available BONAI dataset, achieving a significant reduction in Prompt-Instance-Level offset errors ranging from 14.6% to 16.3%. Additionally, we developed a Distance-NMS algorithm tailored for large-scale building offsets, significantly enhancing the accuracy of predicted building offset angles and lengths in a straightforward and efficient manner. To further validate the model's robustness, we created a new test set using 0.5m remote sensing imagery from Huizhou, China, for inference testing. Our code, training methods, and the updated dataset will be accessable at https://github.com/likaiucas.

摘要
准确测量房屋缘界与地面的偏移量在高分辨率Remote sensing影像中是城市信息提取任务中非常重要的。通过深度学习，现有方法通常是通过两个阶段Convolutional Neural Networks (CNN)模型来提取建筑特征图像中的区域兴趣点。在第一阶段，一个Region Proposal Network (RPN)被应用以提取数以千计的ROIs（区域兴趣点），然后将其导入到基于区域的Convolutional Neural Networks (RCNN)中进行信息提取。然而，由于不灵活的RPN，这些方法经常缺乏有效的用户互动，遇到实例匹配的问题，并且难以保持总人工智能的提高。这篇论文介绍了一种交互式Transformer模型，并与一个提示编码器结合，以准确提取建筑分割以及缘界到地面的偏移量。在我们的模型中，一个专门为建筑问题设计的强大模块，即ROAM，用于解决通用的预测缘界到地面的偏移量问题。我们在公共可用的BONAI数据集上测试了我们的模型的可行性，实现了对Prompt-Instance-Level偏移量的显著减少，范围为14.6%至16.3%。此外，我们开发了一种适合大规模建筑偏移量的Distance-NMS算法，大幅提高了预测建筑偏移角度和长度的准确率，并且这是一种简单、有效的方式。为了进一步验证我们的模型的稳定性，我们创建了一个使用0.5米Remote sensing影像的新测试集，用于对模型进行推理测试。我们的代码、训练方法和更新的数据集将在https://github.com/likaiucas中公开。

Nighttime Driver Behavior Prediction Using Taillight Signal Recognition via CNN-SVM Classifier

paper_url: http://arxiv.org/abs/2310.16706
repo_url: https://github.com/deepcar/taillight_recognition
paper_authors: Amir Hossein Barshooi, Elmira Bagheri
for: 这种研究的目的是提高夜间驾驶行为预测能力，通过识别人驾和自动驾车 taillight。
methods: 提出的模型使用了自定义的探测器，通过提取输入图像的深度特征，并对每个特征计算数据稀缺性。然后，通过引入 soft attention 的权重binary mask，使模型更加注重预先确定的区域。最后，使用 Convolutional Neural Networks (CNNs) 提取特征，并使用 Principal Component Analysis (PCA) 减少维度。
results: 实验结果显示，提出的方法可以准确地分类夜间驾驶行为，具体的结果为92.14%的准确率、97.38%的特殊性、92.09%的敏感度、92.10%的 F1-度量和0.895的科恩 statistic。

Abstract
This paper aims to enhance the ability to predict nighttime driving behavior by identifying taillights of both human-driven and autonomous vehicles. The proposed model incorporates a customized detector designed to accurately detect front-vehicle taillights on the road. At the beginning of the detector, a learnable pre-processing block is implemented, which extracts deep features from input images and calculates the data rarity for each feature. In the next step, drawing inspiration from soft attention, a weighted binary mask is designed that guides the model to focus more on predetermined regions. This research utilizes Convolutional Neural Networks (CNNs) to extract distinguishing characteristics from these areas, then reduces dimensions using Principal Component Analysis (PCA). Finally, the Support Vector Machine (SVM) is used to predict the behavior of the vehicles. To train and evaluate the model, a large-scale dataset is collected from two types of dash-cams and Insta360 cameras from the rear view of Ford Motor Company vehicles. This dataset includes over 12k frames captured during both daytime and nighttime hours. To address the limited nighttime data, a unique pixel-wise image processing technique is implemented to convert daytime images into realistic night images. The findings from the experiments demonstrate that the proposed methodology can accurately categorize vehicle behavior with 92.14% accuracy, 97.38% specificity, 92.09% sensitivity, 92.10% F1-measure, and 0.895 Cohen's Kappa Statistic. Further details are available at https://github.com/DeepCar/Taillight_Recognition.

摘要
To train and evaluate the model, a large-scale dataset is collected from two types of dash-cams and Insta360 cameras from the rear view of Ford Motor Company vehicles. The dataset includes over 12,000 frames captured during both daytime and nighttime hours. To address the limited nighttime data, a unique pixel-wise image processing technique is implemented to convert daytime images into realistic night images.The experimental results show that the proposed methodology can accurately categorize vehicle behavior with 92.14% accuracy, 97.38% specificity, 92.09% sensitivity, 92.10% F1-measure, and 0.895 Cohen's Kappa Statistic. More details can be found at .

From Pointwise to Powerhouse: Initialising Neural Networks with Generative Models

paper_url: http://arxiv.org/abs/2310.16695
repo_url: None
paper_authors: Christian Harder, Moritz Fuchs, Yuri Tolkach, Anirban Mukhopadhyay
for: 这篇研究旨在提出新的启动方法，以解决深度神经网络中的弹性问题（vanishing or exploding gradients）。
methods: 这篇研究使用生成模型来初始化神经网络，包括使用Variational Autoencoders（VAEs）和Graph Hypernetworks（GHNs）。
results: 研究发现，使用全量的初始化方法可以提高精度和初始化速度，但是透过Graph Hypernetworks（GHNs）实现的方法会导致集合的ensemble performance在离distribution data上下降。为了解决这个问题，研究提出了一种噪音Graph Hypernetworks（noise GHNs）来鼓励多样性。此外，这篇研究还发现这些新的启动方法可能可以将学习到的知识转移到不同的图像分布上。

Abstract
Traditional initialisation methods, e.g. He and Xavier, have been effective in avoiding the problem of vanishing or exploding gradients in neural networks. However, they only use simple pointwise distributions, which model one-dimensional variables. Moreover, they ignore most information about the architecture and disregard past training experiences. These limitations can be overcome by employing generative models for initialisation. In this paper, we introduce two groups of new initialisation methods. First, we locally initialise weight groups by employing variational autoencoders. Secondly, we globally initialise full weight sets by employing graph hypernetworks. We thoroughly evaluate the impact of the employed generative models on state-of-the-art neural networks in terms of accuracy, convergence speed and ensembling. Our results show that global initialisations result in higher accuracy and faster initial convergence speed. However, the implementation through graph hypernetworks leads to diminished ensemble performance on out of distribution data. To counteract, we propose a modification called noise graph hypernetwork, which encourages diversity in the produced ensemble members. Furthermore, our approach might be able to transfer learned knowledge to different image distributions. Our work provides insights into the potential, the trade-offs and possible modifications of these new initialisation methods.

摘要
传统的初始化方法，如希尔和夏维，有效地避免神经网络中的衰减或扩散Gradient问题。然而，它们只使用简单的点位分布，这些分布模型了一维变量。此外，它们忽略了神经网络的建筑和过去训练经验。这些限制可以通过使用生成模型来超越。在这篇论文中，我们介绍了两组新的初始化方法。首先，我们在weight组上本地初始化使用变量自动机。其次，我们在全Weight集上全球初始化使用图hyper网络。我们仔细评估了采用生成模型对现有神经网络的影响，包括精度、快速初始化速度和 ensemble。我们的结果显示，全球初始化可以提高精度和初始化速度，但是通过图hyper网络的实现会导致对于异常数据的ensemble表现下降。为了缓解这个问题，我们提出了噪音图hyper网络修改，该修改强制生成 ensemble member中的多样性。此外，我们的方法可能可以传输学习到不同的图像分布。我们的工作为这些新的初始化方法的潜力、交易和可能的修改提供了深入的视角。

DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for Vehicle Re-identification

paper_url: http://arxiv.org/abs/2310.16694
repo_url: None
paper_authors: Yuejun Jiao, Song Qiu, Mingsong Chen, Dingding Han, Qingli Li, Yue Lu
for: 本研究旨在提高智能交通系统中车辆重新认识（Re-ID）的精度，适用于助动系统、交通流管理和车辆跟踪等应用。
methods: 本研究提出了基于动态相似 adjacency 矩阵（DSAM-GN）的方法，具有新的相似性矩阵构建方法，可以捕捉车辆特征的空间关系，减少背景噪音。
results: 实验结果表明，提出的方法比现有方法更有效，能够更好地提高车辆 Re-ID 的精度。

Abstract
In recent years, vehicle re-identification (Re-ID) has gained increasing importance in various applications such as assisted driving systems, traffic flow management, and vehicle tracking, due to the growth of intelligent transportation systems. However, the presence of extraneous background information and occlusions can interfere with the learning of discriminative features, leading to significant variations in the same vehicle image across different scenarios. This paper proposes a method, named graph network based on dynamic similarity adjacency matrices (DSAM-GN), which incorporates a novel approach for constructing adjacency matrices to capture spatial relationships of local features and reduce background noise. Specifically, the proposed method divides the extracted vehicle features into different patches as nodes within the graph network. A spatial attention-based similarity adjacency matrix generation (SASAMG) module is employed to compute similarity matrices of nodes, and a dynamic erasure operation is applied to disconnect nodes with low similarity, resulting in similarity adjacency matrices. Finally, the nodes and similarity adjacency matrices are fed into graph networks to extract more discriminative features for vehicle Re-ID. Experimental results on public datasets VeRi-776 and VehicleID demonstrate the effectiveness of the proposed method compared with recent works.

摘要
Recently, vehicle re-identification (Re-ID) has become increasingly important in various applications such as assisted driving systems, traffic flow management, and vehicle tracking, due to the development of intelligent transportation systems. However, the presence of extraneous background information and occlusions can interfere with the learning of discriminative features, leading to significant variations in the same vehicle image across different scenarios. This paper proposes a method, named graph network based on dynamic similarity adjacency matrices (DSAM-GN), which incorporates a novel approach for constructing adjacency matrices to capture spatial relationships of local features and reduce background noise. Specifically, the proposed method divides the extracted vehicle features into different patches as nodes within the graph network. A spatial attention-based similarity adjacency matrix generation (SASAMG) module is employed to compute similarity matrices of nodes, and a dynamic erasure operation is applied to disconnect nodes with low similarity, resulting in similarity adjacency matrices. Finally, the nodes and similarity adjacency matrices are fed into graph networks to extract more discriminative features for vehicle Re-ID. Experimental results on public datasets VeRi-776 and VehicleID demonstrate the effectiveness of the proposed method compared with recent works.

Local Statistics for Generative Image Detection

paper_url: http://arxiv.org/abs/2310.16684
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Yung Jer Wong, Teck Khim Ng
for: 这个论文主要旨在提高Diffusion models（DMs）的渲染质量，以便在不同任务上使用DMs进行图像生成和图像超分辨率等任务。
methods: 这个论文使用了计算地方统计信息，而不是全局统计信息，以分辨出图像是否来自DMs生成。它们认为地方统计信息能够解决图像空间不均衡问题。
results: 论文表明，使用地方统计信息可以提供有前景的结果，并且这种方法对图像缩放和JPEG压缩等多种杂音噪声有良好的Robustness。

Abstract
Diffusion models (DMs) are generative models that learn to synthesize images from Gaussian noise. DMs can be trained to do a variety of tasks such as image generation and image super-resolution. Researchers have made significant improvement in the capability of synthesizing photorealistic images in the past few years. These successes also hasten the need to address the potential misuse of synthesized images. In this paper, we highlight the effectiveness of computing local statistics, as opposed to global statistics, in distinguishing digital camera images from DM-generated images. We hypothesized that local statistics should be used to address the spatial non-stationarity problem in images. We show that our approach produced promising results and it is also robust to various perturbations such as image resizing and JPEG compression.

摘要
干扰模型（DM）是一种生成模型，可以学习将泊松噪声转化为图像。DM可以完成多种任务，如图像生成和图像超分辨率。过去几年，研究人员在图像生成方面做出了 significative 进步，这也提高了对生成图像的可能性。然而，随着技术的发展，对生成图像的潜在misuse的需要也在增加。在这篇论文中，我们指出了计算地方统计信息，而不是全局统计信息，对于区分摄像头图像和DM生成图像的效iveness。我们假设了使用地方统计来解决图像的空间不均衡问题。我们的方法得到了有 promise 的结果，并且对各种扰动，如图像缩放和JPEG压缩，也具有良好的鲁棒性。

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

paper_url: http://arxiv.org/abs/2310.16667
repo_url: https://github.com/cvmi-lab/codet
paper_authors: Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi
for: 学习开放词汇对象检测中的对象级视力语言表示，从图文对 alignment 获得可靠的区域词对应关系是关键。
methods: CoDet 提出了一种新的方法，即将区域词对应关系重新定义为共同发现问题，通过图像的视觉相似性来发现共同出现的对象。
results: 实验结果表明，CoDet 在开放词汇检测中具有优秀的性能和扩展性，例如通过增加视觉底层，CoDet 在 OV-LVIS 上实现了 37.0 $\text{AP}^m_{novel}$ 和 44.7 $\text{AP}^m_{all}$，比前一个 SoTA 高出 4.2 $\text{AP}^m_{novel}$ 和 9.8 $\text{AP}^m_{all}$。

Abstract
Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and 44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.

摘要
通过图文对应关系学习对象级视觉语言表示是关键，以实现开放词汇物体检测。现有方法通常基于预训练或自动训练视觉语言模型进行对齐，这些方法容易受到当地化精度或通用能力的限制。在本文中，我们提出了CoDet方法，它通过重新定义区域词对应关系为共同发现问题来超越预aligned视觉语言空间的局限性。具体来说，通过将图像分组，图像的描述中提到的共同概念的对象将显示高度的共同出现。CoDet然后利用视觉相似性来找到共同出现的对象，并将其与共同概念进行对齐。我们的实验表明，CoDet在开放词汇检测中具有优秀的性能和吸引人的扩展性，例如，通过将视觉后骨骼扩展到更大的尺度，CoDet在OV-LVIS上 achievied 37.0 $\text{AP}^m_{novel}$和44.7 $\text{AP}^m_{all}$，超过了之前的SoTAby 4.2 $\text{AP}^m_{novel}$和9.8 $\text{AP}^m_{all}$。代码可以在https://github.com/CVMI-Lab/CoDet上获取。

Robust Source-Free Domain Adaptation for Fundus Image Segmentation

paper_url: http://arxiv.org/abs/2310.16665
repo_url: https://github.com/LinGrayy/PLPB
paper_authors: Lingrui Li, Yanfeng Zhou, Ge Yang
For: This paper focuses on improving the robustness of unsupervised domain adaptation (UDA) techniques for medical image segmentation, specifically fundus image segmentation.* Methods: The proposed method consists of two stages: (1) source training with adversarial sample augmentation to enhance the robustness and generalization capability of the source model, and (2) target training with a novel robust pseudo-label and pseudo-boundary (PLPB) method that utilizes unlabeled target data to generate pseudo labels and pseudo boundaries for self-adaptation.* Results: Extensive experimental results on cross-domain fundus image segmentation confirm the effectiveness and versatility of the proposed method, demonstrating improved accuracy and robustness compared to existing UDA techniques.Here’s the simplified Chinese version:* For: 这篇论文主要关注改进医疗图像分割领域的无监督领域适应（UDA）技术的Robustness，具体来说是基于眼膜图像分割。* Methods: 提议的方法包括两个阶段：（1）源训练阶段使用对抗样本扩大来提高源模型的Robustness和泛化能力，以及（2）目标训练阶段提出一种新的Robust pseudo-label和pseudo-boundary（PLPB）方法，通过不使用源数据，使用无标目标数据生成pseudo标签和pseudo bound。* Results: 对于cross-domain眼膜图像分割，实验结果证明提议的方法的有效性和多样性，比较 existing UDA 技术，显示了提高了准确率和Robustness。

Abstract
Unsupervised Domain Adaptation (UDA) is a learning technique that transfers knowledge learned in the source domain from labelled training data to the target domain with only unlabelled data. It is of significant importance to medical image segmentation because of the usual lack of labelled training data. Although extensive efforts have been made to optimize UDA techniques to improve the accuracy of segmentation models in the target domain, few studies have addressed the robustness of these models under UDA. In this study, we propose a two-stage training strategy for robust domain adaptation. In the source training stage, we utilize adversarial sample augmentation to enhance the robustness and generalization capability of the source model. And in the target training stage, we propose a novel robust pseudo-label and pseudo-boundary (PLPB) method, which effectively utilizes unlabeled target data to generate pseudo labels and pseudo boundaries that enable model self-adaptation without requiring source data. Extensive experimental results on cross-domain fundus image segmentation confirm the effectiveness and versatility of our method. Source code of this study is openly accessible at https://github.com/LinGrayy/PLPB.

摘要
无监督领域适应（UDA）是一种学习技术，它将源领域中标注训练数据上的知识传递到目标领域中，只使用无标注数据进行学习。由于医学影像分割通常缺乏标注训练数据，UDA技术具有重要的意义。虽然有大量研究探讨了如何优化UDA技术以提高目标领域中分割模型的准确率，但只有一些研究考虑了UDA模型的稳定性。在这个研究中，我们提出了一种两stage训练策略，以提高UDA模型的稳定性。在源训练阶段，我们利用对抗样本增强的技术，以提高源模型的 robustness和通用性。在目标训练阶段，我们提出了一种新的假标签和假边界（PLPB）方法，它可以使用无标注目标数据来生成假标签和假边界，以便模型自适应而不需要源数据。我们的方法在跨领域眼影像分割中进行了广泛的实验，并证明了我们的方法的效果和多样性。源代码可以在https://github.com/LinGrayy/PLPB中获取。

MACP: Efficient Model Adaptation for Cooperative Perception

paper_url: http://arxiv.org/abs/2310.16870
repo_url: https://github.com/purduedigitaltwin/macp
paper_authors: Yunsheng Ma, Juanwu Lu, Can Cui, Sicheng ZHao, Xu Cao, Wenqian Ye, Ziran Wang
for: 提高连接自动化车辆（CAVs）的感知能力，使其可以“看过遮盖物”，提高性能。
methods: 基于单机器学习模型，增加合作能力。
results: 在模拟和实际协同感知测试中，提出的方法可以有效利用协同观察，并超越其他当前最佳方法，需要许多 fewer 参数和通信成本。I hope this helps! Let me know if you have any other questions.

Abstract
Vehicle-to-vehicle (V2V) communications have greatly enhanced the perception capabilities of connected and automated vehicles (CAVs) by enabling information sharing to "see through the occlusions", resulting in significant performance improvements. However, developing and training complex multi-agent perception models from scratch can be expensive and unnecessary when existing single-agent models show remarkable generalization capabilities. In this paper, we propose a new framework termed MACP, which equips a single-agent pre-trained model with cooperation capabilities. We approach this objective by identifying the key challenges of shifting from single-agent to cooperative settings, adapting the model by freezing most of its parameters and adding a few lightweight modules. We demonstrate in our experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches in both simulated and real-world cooperative perception benchmarks while requiring substantially fewer tunable parameters with reduced communication costs. Our source code is available at https://github.com/PurdueDigitalTwin/MACP.

摘要
connected and automated vehicles (CAVs) 的 perception capabilities 有 greatly enhanced by enabling information sharing to "see through the occlusions", resulting in significant performance improvements. However, developing and training complex multi-agent perception models from scratch can be expensive and unnecessary when existing single-agent models show remarkable generalization capabilities. In this paper, we propose a new framework termed MACP, which equips a single-agent pre-trained model with cooperation capabilities. We approach this objective by identifying the key challenges of shifting from single-agent to cooperative settings, adapting the model by freezing most of its parameters and adding a few lightweight modules. We demonstrate in our experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches in both simulated and real-world cooperative perception benchmarks while requiring substantially fewer tunable parameters with reduced communication costs. Our source code is available at https://github.com/PurdueDigitalTwin/MACP.Here's the translation breakdown:* connected and automated vehicles (CAVs) connected and automated vehicles (CAVs)* perception capabilities 感知能力* greatly enhanced 增强了* by enabling information sharing to "see through the occlusions" 通过共享信息来"看through the occlusions"* resulting in significant performance improvements 导致显著性能提升* However, developing and training complex multi-agent perception models from scratch can be expensive and unnecessary 然而，从零开始开发和训练复杂多代理感知模型可能是不必要的和昂贵的* when existing single-agent models show remarkable generalization capabilities 当现有的单代理模型表现出很好的泛化能力* In this paper, we propose a new framework termed MACP 在这篇论文中，我们提出了一个新的框架，名为MACP* which equips a single-agent pre-trained model with cooperation capabilities 将单代理预训练模型带有合作能力* We approach this objective by identifying the key challenges of shifting from single-agent to cooperative settings 我们通过了解单代理到合作设置的主要挑战来实现这个目标* adapting the model by freezing most of its parameters and adding a few lightweight modules 通过冻结大多数参数并添加一些轻量级模块来适应模型* We demonstrate in our experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches in both simulated and real-world cooperative perception benchmarks 我们在实验中表明，提出的框架可以有效利用合作观察和在模拟和真实世界的合作感知指标上超越其他现有的方法* while requiring substantially fewer tunable parameters with reduced communication costs 而不需要大量可调参数和减少的通信成本* Our source code is available at https://github.com/PurdueDigitalTwin/MACP 我们的源代码可以在上获取

Deep Learning Techniques for Cervical Cancer Diagnosis based on Pathology and Colposcopy Images

paper_url: http://arxiv.org/abs/2310.16662
repo_url: None
paper_authors: Hana Ahmadzadeh Sarhangi, Dorsa Beigifard, Elahe Farmani, Hamidreza Bolhasani
for: 本研究旨在探讨 Deep Learning 技术在预防和诊断 cervical cancer 方面的潜在应用，以提高诊断的准确率和效率。
methods: 本研究使用 Deep Learning 技术进行训练，包括分类、 segmentation 和检测任务，以提高预防和诊断 cervical cancer 的精度和效率。
results: 研究发现，Deep Learning 技术可以帮助提高预防和诊断 cervical cancer 的精度和效率，并且可以减少人类 Error 的影响。

Abstract
Cervical cancer is a prevalent disease affecting millions of women worldwide every year. It requires significant attention, as early detection during the precancerous stage provides an opportunity for a cure. The screening and diagnosis of cervical cancer rely on cytology and colposcopy methods. Deep learning, a promising technology in computer vision, has emerged as a potential solution to improve the accuracy and efficiency of cervical cancer screening compared to traditional clinical inspection methods that are prone to human error. This review article discusses cervical cancer and its screening processes, followed by the Deep Learning training process and the classification, segmentation, and detection tasks for cervical cancer diagnosis. Additionally, we explored the most common public datasets used in both cytology and colposcopy and highlighted the popular and most utilized architectures that researchers have applied to both cytology and colposcopy. We reviewed 24 selected practical papers in this study and summarized them. This article highlights the remarkable efficiency in enhancing the precision and speed of cervical cancer analysis by Deep Learning, bringing us closer to early diagnosis and saving lives.

摘要
cervical cancer是每年全球多 millones of women affected by the disease，需要 significativ attention，因为 early detection during the precancerous stage provides an opportunity for a cure. Screening and diagnosis of cervical cancer rely on cytology and colposcopy methods，Deep learning，a promising technology in computer vision，has emerged as a potential solution to improve the accuracy and efficiency of cervical cancer screening compared to traditional clinical inspection methods that are prone to human error.this review article discusses cervical cancer and its screening processes，followed by the Deep Learning training process and the classification, segmentation, and detection tasks for cervical cancer diagnosis. Additionally, we explored the most common public datasets used in both cytology and colposcopy and highlighted the popular and most utilized architectures that researchers have applied to both cytology and colposcopy. We reviewed 24 selected practical papers in this study and summarized them. This article highlights the remarkable efficiency in enhancing the precision and speed of cervical cancer analysis by Deep Learning，bringing us closer to early diagnosis and saving lives.

EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

paper_url: http://arxiv.org/abs/2310.16640
repo_url: https://github.com/nickyfot/emoclip
paper_authors: Niki Maria Foteinopoulou, Ioannis Patras
for: 这篇论文的目的是提高 dynamic FER 中的表情识别精度，并且扩展表情识别的调estre spectrum。
methods: 这篇论文提出了一个 novel vision-language model，使用 sample-level text descriptions 作为自然语言监督，以增强网络内部化的复杂表情表现。
results: 这篇论文的结果显示，使用 sample-level descriptions 监督的方法可以在零 shot 分类中实现 Significant Improvements，比如在某些 datasets 上与 CLIP 相比，增加了10% 以上的 Weighted Average Recall 和5% 以上的 Unweighted Average Recall。此外，这篇论文也评估了这种网络在下游任务中的表现，例如 mental health symptom estimation，并取得了与现有方法相似或更高的表现。

Abstract
Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10\% in terms of Weighted Average Recall and 5\% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson's Correlation Coefficient of up to 0.85 on schizophrenia symptom severity estimation, which is comparable to human experts' agreement. The code is publicly available at: https://github.com/NickyFot/EmoCLIP.

摘要
Facial Expression Recognition (FER) 是人工智能情感计算中的关键任务，但它的传统 фокус只是七种基本情感限制了其应用范围。为解决新和未经见的情感在动态的自然环境中的FER问题，我们提出了一种新的视觉语言模型，利用样本级文本描述（例如，情绪或情感cue的标签）作为自然语言监督，以增强模型学习的质折表示。为测试这种方法，我们在四个流行的动态FER数据集上进行零 shot 分类测试。我们的结果显示，这种方法可以与基准方法相比，提供显著改善。特别是，在零 shot 视频FER问题上，我们超过 CLIP 的10%以上在权重平均回归和5%以上在无权重平均回归方面进行了比较优秀的表现。此外，我们使用模型通过样本级文本描述学习获得的表示在下游任务中的精神病状 симптом估计中表现出色，与当前的状态 искусственный智能技术和人类专家的协同达到了高度一致。具体来说，我们在各种精神病状Symptom Severity估计中达到了0.85的佩森相关系数，与人类专家的一致度相当。代码可以在 GitHub 上获取：https://github.com/NickyFot/EmoCLIP。

Driving through the Concept Gridlock: Unraveling Explainability Bottlenecks in Automated Driving

paper_url: http://arxiv.org/abs/2310.16639
repo_url: https://github.com/jessicamecht/concept_gridlock
paper_authors: Jessica Echterhoff, An Yan, Kyungtae Han, Amr Abdelraouf, Rohit Gupta, Julian McAuley
for: 这个论文是为了提出一种基于概念瓶颈模型的可解释机器学习方法，用于解释自动驾驶车辆的决策和行为。
methods: 该论文使用了概念瓶颈模型，将人类定义的概念编码到模型中，以实现可解释的机器学习。
results: 该论文提出了一种新的方法，使用概念瓶颈作为视觉特征来预测和解释用户和车辆行为，并实现了与离散特征的竞争性表现。

Abstract
Concept bottleneck models have been successfully used for explainable machine learning by encoding information within the model with a set of human-defined concepts. In the context of human-assisted or autonomous driving, explainability models can help user acceptance and understanding of decisions made by the autonomous vehicle, which can be used to rationalize and explain driver or vehicle behavior. We propose a new approach using concept bottlenecks as visual features for control command predictions and explanations of user and vehicle behavior. We learn a human-understandable concept layer that we use to explain sequential driving scenes while learning vehicle control commands. This approach can then be used to determine whether a change in a preferred gap or steering commands from a human (or autonomous vehicle) is led by an external stimulus or change in preferences. We achieve competitive performance to latent visual features while gaining interpretability within our model setup.

摘要
<>用概念瓶颈模型实现可解释机器学习，将人类定义的概念编码到模型中，以解释自动驾驶决策的可解释性。在人工或自动驾驶场景下，解释模型可以帮助用户理解和接受自动车辆的决策，并用于解释机动员或车辆行为的原因。我们提出了一种新的方法，利用概念瓶颈作为视觉特征来预测和解释用户和车辆行为。我们学习了人类可理解的概念层，用于解释顺序驾驶场景，同时学习车辆控制命令。这种方法可以确定外部刺激或变化的是否导致人类或自动车辆的控制命令变化。我们实现了与潜在视觉特征相当的竞争性性能，同时在我们的模型设置中获得了解释性。

EdgeCalib: Multi-Frame Weighted Edge Features for Automatic Targetless LiDAR-Camera Calibration

paper_url: http://arxiv.org/abs/2310.16629
repo_url: None
paper_authors: Xingchen Li, Yifan Duan, Beibei Wang, Haojie Ren, Guoliang You, Yu Sheng, Jianmin Ji, Yanyong Zhang
for: 提高 LiDAR 和摄像头之间的外部准确定时参数的自动化线程，以便在实际场景中实现高精度的多模态感知系统。
methods: 基于边缘特征的自动化线程，包括在图像和点云中提取稳定和可靠的边缘特征，并通过多帧权重策略对边缘特征进行过滤。最后，通过边缘匹配约束来优化精度的外部参数。
results: 在 KITTI 数据集和自己的数据集上进行了评估，实现了Rotation 精度0.086度和Translation 精度0.977 cm，超过了现有的边缘基于准确定时参数的方法。

Abstract
In multimodal perception systems, achieving precise extrinsic calibration between LiDAR and camera is of critical importance. Previous calibration methods often required specific targets or manual adjustments, making them both labor-intensive and costly. Online calibration methods based on features have been proposed, but these methods encounter challenges such as imprecise feature extraction, unreliable cross-modality associations, and high scene-specific requirements. To address this, we introduce an edge-based approach for automatic online calibration of LiDAR and cameras in real-world scenarios. The edge features, which are prevalent in various environments, are aligned in both images and point clouds to determine the extrinsic parameters. Specifically, stable and robust image edge features are extracted using a SAM-based method and the edge features extracted from the point cloud are weighted through a multi-frame weighting strategy for feature filtering. Finally, accurate extrinsic parameters are optimized based on edge correspondence constraints. We conducted evaluations on both the KITTI dataset and our dataset. The results show a state-of-the-art rotation accuracy of 0.086{\deg} and a translation accuracy of 0.977 cm, outperforming existing edge-based calibration methods in both precision and robustness.

摘要
在多模态感知系统中，精确的外部准确性 calibration between LiDAR 和摄像头是关键。先前的准确方法通常需要特定的目标或手动调整，使其成为劳动密集和昂贵的。基于特征的在线准确方法已经被提议，但这些方法遇到了准确特征提取、交叉模式关联不可靠和高场景特定性的挑战。为解决这个问题，我们介绍了一种基于边的方法，用于自动在实际场景中进行 LiDAR 和摄像头的在线准确性 calibration。在图像和点云中对边特征进行对齐，以确定外部参数。 Specifically, 使用 SAM 方法提取稳定和可靠的图像边特征，并通过多帧权重策略来筛选点云中的边特征。最后，基于边对应关系的约束，进行高精度的外部参数优化。我们在 KITTI 数据集和我们自己的数据集上进行了评估，结果显示，我们的方法在精度和可靠性方面占据了领先地位，与现有的边基于准确方法相比，在精度和可靠性方面都有显著的提升。

Real-time 6-DoF Pose Estimation by an Event-based Camera using Active LED Markers

paper_url: http://arxiv.org/abs/2310.16618
repo_url: None
paper_authors: Gerald Ebmer, Adam Loch, Minh Nhat Vu, Germain Haessig, Roberto Mecca, Markus Vincze, Christian Hartl-Nesic, Andreas Kugi
for: 本研究旨在提出一种简单 yet effective的事件基于pose数据估算系统，用于快速和准确的 pose数据估算。
methods: 本研究使用了活动LED标记（ALM），并提出了一种基于事件的pose数据估算算法，可以在实时中运行，并且可以在不可靠的视觉条件下保持精度。
results: 实验结果表明，提出的方法可以在实时中运行，并且可以在静止和动态场景中保持高度的计算速度和精度。

Abstract
Real-time applications for autonomous operations depend largely on fast and robust vision-based localization systems. Since image processing tasks require processing large amounts of data, the computational resources often limit the performance of other processes. To overcome this limitation, traditional marker-based localization systems are widely used since they are easy to integrate and achieve reliable accuracy. However, classical marker-based localization systems significantly depend on standard cameras with low frame rates, which often lack accuracy due to motion blur. In contrast, event-based cameras provide high temporal resolution and a high dynamic range, which can be utilized for fast localization tasks, even under challenging visual conditions. This paper proposes a simple but effective event-based pose estimation system using active LED markers (ALM) for fast and accurate pose estimation. The proposed algorithm is able to operate in real time with a latency below \SI{0.5}{\milli\second} while maintaining output rates of \SI{3}{\kilo \hertz}. Experimental results in static and dynamic scenarios are presented to demonstrate the performance of the proposed approach in terms of computational speed and absolute accuracy, using the OptiTrack system as the basis for measurement.

摘要

paper_url: http://arxiv.org/abs/2310.16590
repo_url: None
paper_authors: Adnen Abdessaied, Lei Shi, Andreas Bulling
for: 本文提出了一种新的视觉对话模型（$\mathbb{VD}$-$\mathbb{GR}$），它将预训练语言模型（LM）与图 neural networks（GNN）结合在一起，以利用它们的优势。
methods: 本文使用了多modal GNN进行特征处理，并利用本地结构信息以前进行BERT层的全球注意力。此外，本文还提出了核心节点，它们连接到每个模式图中的所有节点，使模型可以在不同模式之间传递信息。
results: 根据VisDial v1.0、VisDial v0.9、VisDialConv和VisPro等四个数据集的评估结果，$\mathbb{VD}$-$\mathbb{GR}$模型在所有四个数据集上均达到了新的状态级 результа。

Abstract
We propose $\mathbb{VD}$-$\mathbb{GR}$ - a novel visual dialog model that combines pre-trained language models (LMs) with graph neural networks (GNNs). Prior works mainly focused on one class of models at the expense of the other, thus missing out on the opportunity of combining their respective benefits. At the core of $\mathbb{VD}$-$\mathbb{GR}$ is a novel integration mechanism that alternates between spatial-temporal multi-modal GNNs and BERT layers, and that covers three distinct contributions: First, we use multi-modal GNNs to process the features of each modality (image, question, and dialog history) and exploit their local structures before performing BERT global attention. Second, we propose hub-nodes that link to all other nodes within one modality graph, allowing the model to propagate information from one GNN (modality) to the other in a cascaded manner. Third, we augment the BERT hidden states with fine-grained multi-modal GNN features before passing them to the next $\mathbb{VD}$-$\mathbb{GR}$ layer. Evaluations on VisDial v1.0, VisDial v0.9, VisDialConv, and VisPro show that $\mathbb{VD}$-$\mathbb{GR}$ achieves new state-of-the-art results across all four datasets.

摘要
我们提出了一种新的视觉对话模型，称为$\mathbb{VD}$-$\mathbb{GR}$，它结合了预训练语言模型（LM）和图 neural network（GNN）。先前的工作主要集中在一类模型上，忽略了另一类模型的机会，因此我们决定结合它们的优点。 $\mathbb{VD}$-$\mathbb{GR}$ 的核心机制是一种新的集成机制，它在空间-时间多Modal GNN 和 BERT 层之间进行交替，并包括以下三个贡献：1. 我们使用多Modal GNN 处理每个模式（图像、问题和对话历史）的特征，并利用它们的本地结构，然后进行 BERT 全局注意力。2. 我们提出了核心节点，它们与一个模式图中的所有节点相连，使模型可以在一个模式中传递信息，并在另一个模式中进行协调。3. 我们将 BERT 隐藏状态与细致的多Modal GNN 特征相加，然后传递给下一层 $\mathbb{VD}$-$\mathbb{GR}$。我们对 VisDial v1.0、VisDial v0.9、VisDialConv 和 VisPro 进行评估，得到了新的领域记录。

Flow-Attention-based Spatio-Temporal Aggregation Network for 3D Mask Detection

paper_url: http://arxiv.org/abs/2310.16569
repo_url: https://github.com/josephcao0327/fasten
paper_authors: Yuxin Cao, Yian Li, Yumeng Zhu, Derui Wang, Minhui Xue
for:The paper aims to address the challenges of anti-spoofing detection in face recognition systems, specifically the generalizability insufficiency of deep-learning-based methods in 3D masks.methods:The proposed method, FASTEN, is a novel 3D mask detection framework that leverages remote photoplethysmography (rPPG) technology and tailors a network for focusing on fine-grained details in large movements. The network consists of three key modules: facial optical flow network, flow attention, and spatio-temporal aggregation.results:FASTEN outperforms eight competitors in terms of multiple detection metrics, requiring only five frames of input. The proposed method has been deployed in real-world mobile devices for practical 3D mask detection.

Abstract
Anti-spoofing detection has become a necessity for face recognition systems due to the security threat posed by spoofing attacks. Despite great success in traditional attacks, most deep-learning-based methods perform poorly in 3D masks, which can highly simulate real faces in appearance and structure, suffering generalizability insufficiency while focusing only on the spatial domain with single frame input. This has been mitigated by the recent introduction of a biomedical technology called rPPG (remote photoplethysmography). However, rPPG-based methods are sensitive to noisy interference and require at least one second (> 25 frames) of observation time, which induces high computational overhead. To address these challenges, we propose a novel 3D mask detection framework, called FASTEN (Flow-Attention-based Spatio-Temporal aggrEgation Network). We tailor the network for focusing more on fine-grained details in large movements, which can eliminate redundant spatio-temporal feature interference and quickly capture splicing traces of 3D masks in fewer frames. Our proposed network contains three key modules: 1) a facial optical flow network to obtain non-RGB inter-frame flow information; 2) flow attention to assign different significance to each frame; 3) spatio-temporal aggregation to aggregate high-level spatial features and temporal transition features. Through extensive experiments, FASTEN only requires five frames of input and outperforms eight competitors for both intra-dataset and cross-dataset evaluations in terms of multiple detection metrics. Moreover, FASTEN has been deployed in real-world mobile devices for practical 3D mask detection.

摘要
face recognition 系统中的反伪检测已成为必备因素，由于伪检测攻击的安全威胁。虽然传统攻击方法取得了很大成功，但大多数深度学习基于方法在3D面具下表现不佳，它们在遥感特征上缺乏普适性，只集中在空间领域，使用单帧输入。这一问题得到了最近的生物医学技术——远程血氧测量（rPPG）的解决。然而，rPPG基于方法对干扰噪音敏感，需要至少一秒（> 25帧）的观察时间，这会导致高度的计算开销。为解决这些挑战，我们提出了一种新的3D面具检测框架，called FASTEN（流量注意力基于空间-时间聚合网络）。我们修改网络以便更注重大运动中细节，可以消除重复的空间-时间特征干扰，快速捕捉3D面具的拼接迹象。我们的提议网络包括以下三个关键模块：1） facial optical flow网络获取非RGB间帧流动信息；2）流量注意力将每帧图像分配不同的重要性；3）空间-时间聚合以聚合高级空间特征和时间变化特征。经过广泛的实验，FASTEN只需输入5帧，并在多个检测指标方面超越8个竞争对手。此外，FASTEN已经在实际应用中的移动设备中部署了实用3D面具检测。

ParisLuco3D: A high-quality target dataset for domain generalization of LiDAR perception

paper_url: http://arxiv.org/abs/2310.16542
repo_url: None
paper_authors: Jules Sanchez, Louis Soum-Fontez, Jean-Emmanuel Deschaud, Francois Goulette
for: 本研究旨在提供一个跨多个频道的评估数据集，以便评估不同来源数据的性能。
methods: 本研究使用了一种新的评估方法，可以在不同的频道上进行公正的比较。
results: 本研究提供了一个新的评估数据集，可以帮助研究人员更好地评估不同来源数据的性能。

Abstract
LiDAR is a sensor system that supports autonomous driving by gathering precise geometric information about the scene. Exploiting this information for perception is interesting as the amount of available data increases. As the quantitative performance of various perception tasks has improved, the focus has shifted from source-to-source perception to domain adaptation and domain generalization for perception. These new goals require access to a large variety of domains for evaluation. Unfortunately, the various annotation strategies of data providers complicate the computation of cross-domain performance based on the available data This paper provides a novel dataset, specifically designed for cross-domain evaluation to make it easier to evaluate the performance of various source datasets. Alongside the dataset, a flexible online benchmark is provided to ensure a fair comparison across methods.

摘要
李达（LiDAR）是一种感知系统，用于支持自动驾驶，它可以准确地收集干扰场景的几何信息。利用这些信息进行感知是有趣的，因为数据量的增加会提高感知任务的量化性能。随着感知任务的数学性能的改进，关注点从源源感知转移到领域适应和领域总结，以便更好地评估各种来源数据的性能。然而，不同数据提供者的注释策略会使计算交叉领域性能变得复杂。本文提供了一个新的数据集，专门用于交叉领域评估，以便更好地评估各种来源数据的性能。此外，还提供了一个灵活的在线测试台，以确保对各种方法进行公平的比较。

Dual Defense: Adversarial, Traceable, and Invisible Robust Watermarking against Face Swapping

paper_url: http://arxiv.org/abs/2310.16540
repo_url: None
paper_authors: Yunming Zhang, Dengpan Ye, Caiyun Xie, Long Tang, Chuanxi Chen, Ziyi Liu, Jiacheng Deng
for: 防范深刻的违当用途，如人脸替换，以防止违信传播和身份篡改。
methods: 提出了一种新的总防御机制，即双重防御， combine traceability和对抗性，以快速应对违当用途。
results: 实验表明，双重防御可以实现最佳的总防御成功率，并且在不同的人脸数据集上表现出优秀的通用性和对抗性。

Abstract
The malicious applications of deep forgery, represented by face swapping, have introduced security threats such as misinformation dissemination and identity fraud. While some research has proposed the use of robust watermarking methods to trace the copyright of facial images for post-event traceability, these methods cannot effectively prevent the generation of forgeries at the source and curb their dissemination. To address this problem, we propose a novel comprehensive active defense mechanism that combines traceability and adversariality, called Dual Defense. Dual Defense invisibly embeds a single robust watermark within the target face to actively respond to sudden cases of malicious face swapping. It disrupts the output of the face swapping model while maintaining the integrity of watermark information throughout the entire dissemination process. This allows for watermark extraction at any stage of image tracking for traceability. Specifically, we introduce a watermark embedding network based on original-domain feature impersonation attack. This network learns robust adversarial features of target facial images and embeds watermarks, seeking a well-balanced trade-off between watermark invisibility, adversariality, and traceability through perceptual adversarial encoding strategies. Extensive experiments demonstrate that Dual Defense achieves optimal overall defense success rates and exhibits promising universality in anti-face swapping tasks and dataset generalization ability. It maintains impressive adversariality and traceability in both original and robust settings, surpassing current forgery defense methods that possess only one of these capabilities, including CMUA-Watermark, Anti-Forgery, FakeTagger, or PGD methods.

摘要
“深层伪造的黑客应用，例如脸部交换，导致安全风险，如误传信息和身份骗复。一些研究提出使用可靠的水印方法来追溯颜面图像的版权，以便后续追踪，但这些方法无法有效防止伪造的生成和传播。为解决这个问题，我们提出了一个全新的综合式活动防御机制，称为“双重防御”（Dual Defense）。这个机制隐藏式嵌入了单一的可靠水印在目标脸部中，以活动地回应突然的黑客脸部交换。它破坏该交换模型的输出，同时保持水印信息的完整性 throughout the entire 传播过程。这使得可以在任何追踪过程中提取水印。 Specifically, we introduce a water mark embedding network based on original-domain feature impersonation attack. This network learns robust adversarial features of target facial images and embeds watermarks, seeking a well-balanced trade-off between water mark invisibility, adversariality, and traceability through perceptual adversarial encoding strategies. Extensive experiments demonstrate that Dual Defense achieves optimal overall defense success rates and exhibits promising universality in anti-face swapping tasks and dataset generalization ability. It maintains impressive adversariality and traceability in both original and robust settings, surpassing current forgery defense methods that possess only one of these capabilities, including CMUA-Watermark, Anti-Forgery, FakeTagger, or PGD methods.”

Learning Robust Deep Visual Representations from EEG Brain Recordings

paper_url: http://arxiv.org/abs/2310.16532
repo_url: https://github.com/prajwalsingh/eegstylegan-ada
paper_authors: Prajwal Singh, Dwip Dalal, Gautam Vashishtha, Krishna Miyapuram, Shanmuganathan Raman
for:This paper is written for researchers and scientists interested in brain-computer interfacing and reconstruction of visual images from brain Electroencephalography (EEG) signals.methods:The paper proposes a two-stage method for image generation and classification using EEG signals. The first step is to obtain EEG-derived features for robust learning of deep representations, and the second step is to utilize the learned representation for image generation and classification. The paper uses deep-learning architectures with supervised and contrastive learning methods.results:The paper demonstrates the generalizability of the feature extraction pipeline across three different datasets using deep-learning architectures with supervised and contrastive learning methods. The paper also shows that a subject invariant linearly separable visual representation can be learned using EEG data alone in an unimodal setting, which gives better k-means accuracy compared to a joint representation learning between EEG and images. Finally, the paper proposes a novel framework to transform unseen images into the EEG space and reconstruct them with approximation, showcasing the potential for image reconstruction from EEG signals. The proposed image synthesis method from EEG shows 62.9% and 36.13% inception score improvement on the EEGCVPR40 and the Thoughtviz datasets, which is better than state-of-the-art performance in GAN.

Abstract
Decoding the human brain has been a hallmark of neuroscientists and Artificial Intelligence researchers alike. Reconstruction of visual images from brain Electroencephalography (EEG) signals has garnered a lot of interest due to its applications in brain-computer interfacing. This study proposes a two-stage method where the first step is to obtain EEG-derived features for robust learning of deep representations and subsequently utilize the learned representation for image generation and classification. We demonstrate the generalizability of our feature extraction pipeline across three different datasets using deep-learning architectures with supervised and contrastive learning methods. We have performed the zero-shot EEG classification task to support the generalizability claim further. We observed that a subject invariant linearly separable visual representation was learned using EEG data alone in an unimodal setting that gives better k-means accuracy as compared to a joint representation learning between EEG and images. Finally, we propose a novel framework to transform unseen images into the EEG space and reconstruct them with approximation, showcasing the potential for image reconstruction from EEG signals. Our proposed image synthesis method from EEG shows 62.9% and 36.13% inception score improvement on the EEGCVPR40 and the Thoughtviz datasets, which is better than state-of-the-art performance in GAN.

摘要
neuroscientists 和人工智能研究者都在努力 decode the human brain. 使用电encephalography (EEG) 信号重建视觉图像的技术吸引了很多关注，因为它在Brain-computer interfacing中有很多应用。本研究提出了一种两个阶段的方法，其中第一个阶段是使用EEG信号获得可靠的特征，然后使用这些特征进行图像生成和分类。我们在三个不同的数据集上使用深度学习架构和监督学习方法来验证我们的特征提取管道的一致性。此外，我们还完成了零shot EEG分类任务，以更进一步地证明我们的特征提取管道的一致性。我们发现，使用EEG数据 alone 在单模式下可以学习一个主动抗干扰的视觉表示，这个表示可以在k-means分类任务中达到更高的准确率。最后，我们提出了一种将未看过的图像转换到EEG空间的框架，并使用这些图像进行重建，这展示了图像从EEG信号中的重建的潜在可能性。我们的提出的图像生成方法在EEGCVPR40和Thoughtviz数据集上达到了62.9%和36.13%的inception分数提升，这比state-of-the-art的GAN性能更好。

Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents

paper_url: http://arxiv.org/abs/2310.16527
repo_url: None
paper_authors: Tofik Ali, Partha Pratim Roy
for: 这个研究是为了开发一种专门为文档信息分析而设计的深度学习模型，包括文档分类、实体关系EXTRACTION和文档视频问答。
methods: 该模型使用基于transformer的模型来编码文档图像中的所有信息，包括文字、视觉和布局信息。模型首先预训练，然后进行精度调整以适应不同的文档图像分析任务。该模型还包括在预训练阶段进行多个任务的混合预训练，以及在多个数据集上进行精度调整。
results: 该模型在多个任务上达到了出色的效果，包括文档分类（RVL-CDIP数据集上的准确率为95.87%）、实体关系EXTRACTION（FUNSD、CORD、SROIE和Kleister-NDA数据集上的F1分数分别为0.9306、0.9804、0.9794和0.8742）和文档视频问答（DocVQA数据集上的ANLS分数为0.8468）。结果表明该模型可以快速和准确地理解和解释复杂的文档布局和内容，使其成为文档分析任务中的一种有前途的工具。

Abstract
This paper introduces a deep learning model tailored for document information analysis, emphasizing document classification, entity relation extraction, and document visual question answering. The proposed model leverages transformer-based models to encode all the information present in a document image, including textual, visual, and layout information. The model is pre-trained and subsequently fine-tuned for various document image analysis tasks. The proposed model incorporates three additional tasks during the pre-training phase, including reading order identification of different layout segments in a document image, layout segments categorization as per PubLayNet, and generation of the text sequence within a given layout segment (text block). The model also incorporates a collective pre-training scheme where losses of all the tasks under consideration, including pre-training and fine-tuning tasks with all datasets, are considered. Additional encoder and decoder blocks are added to the RoBERTa network to generate results for all tasks. The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification, F1 scores of 0.9306, 0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets respectively for entity relation extraction, and an ANLS score of 0.8468 on the DocVQA dataset for visual question answering. The results highlight the effectiveness of the proposed model in understanding and interpreting complex document layouts and content, making it a promising tool for document analysis tasks.

摘要
The proposed model includes three additional tasks during the pre-training phase: identifying the reading order of different layout segments in a document image, categorizing layout segments using PubLayNet, and generating text within a given layout segment. The model also uses a collective pre-training scheme, where the losses of all tasks are considered. Additional encoder and decoder blocks are added to the RoBERTa network to generate results for all tasks.The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification, F1 scores of 0.9306, 0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets respectively for entity relation extraction, and an ANLS score of 0.8468 on the DocVQA dataset for visual question answering. These results demonstrate the effectiveness of the proposed model in understanding and interpreting complex document layouts and content, making it a promising tool for document analysis tasks.

Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction

paper_url: http://arxiv.org/abs/2310.16494
repo_url: None
paper_authors: Sebastian Koch, Pedro Hermosilla, Narunas Vaskevicius, Mirco Colosi, Timo Ropinski
for: 本研究旨在提高3D场景图模型的学习效果，因为学习3D场景图需要不仅物体标签，还需要关系注释，这些注释在数据集中非常罕见。
methods: 我们采用了语言基于的预训练方法，利用图像语言模型CLIP的语言编码器储存其知识，并通过对subject-predicate-object triplets的对比，将语言表示和预测的3D图像特征进行对应。
results: 我们的方法在主要的semantic 3D场景图标准测试集上达到了状态 искусственный智能水平，比基eline预测方法和所有现有的完全监督场景图预测方法都高出较多。此外，由于我们的场景图特征是语言对应的，因此可以在零shot情况下查询语言空间中的特征。在本文中，我们展示了使用这种特征的属性来预测场景中的房型。

Abstract
D scene graphs are an emerging 3D scene representation, that models both the objects present in the scene as well as their relationships. However, learning 3D scene graphs is a challenging task because it requires not only object labels but also relationship annotations, which are very scarce in datasets. While it is widely accepted that pre-training is an effective approach to improve model performance in low data regimes, in this paper, we find that existing pre-training methods are ill-suited for 3D scene graphs. To solve this issue, we present the first language-based pre-training approach for 3D scene graphs, whereby we exploit the strong relationship between scene graphs and language. To this end, we leverage the language encoder of CLIP, a popular vision-language model, to distill its knowledge into our graph-based network. We formulate a contrastive pre-training, which aligns text embeddings of relationships (subject-predicate-object triplets) and predicted 3D graph features. Our method achieves state-of-the-art results on the main semantic 3D scene graph benchmark by showing improved effectiveness over pre-training baselines and outperforming all the existing fully supervised scene graph prediction methods by a significant margin. Furthermore, since our scene graph features are language-aligned, it allows us to query the language space of the features in a zero-shot manner. In this paper, we show an example of utilizing this property of the features to predict the room type of a scene without further training.

摘要
DScene graphs是一种emerging 3D场景表示，它模型了场景中的对象以及它们之间的关系。然而，学习3D场景图是一个具有挑战性的任务，因为它需要不仅对象标签，还需要关系注释，这些注释在数据集中很罕见。而现有的预训练方法在低数据量情况下并不适用于3D场景图。为解决这个问题，我们提出了首个语言基于的预训练方法 для3D场景图，其中我们利用了场景图和语言之间的强关系。为此，我们利用了CLIP语言编码器，一种流行的视觉语言模型，将其知识融合到我们的图基于网络中。我们提出了一种对比预训练方法，将文本 embedding（主语-谓语-词语 triplets）与预测的3D图像特征进行对比。我们的方法在主要的semantic 3D场景图标准测试集上实现了状态之 искусственный智能水平，比基elines预训练方法和所有现有的完全监督场景图预测方法在较大的margin上表现出色。此外，因为我们的场景图特征与语言空间相对应，因此我们可以在零shot情况下查询语言空间中的特征。在本文中，我们给出了一个例子，利用这种特征的属性来预测场景中的房间类型。

Gramian Attention Heads are Strong yet Efficient Vision Learners

paper_url: http://arxiv.org/abs/2310.16483
repo_url: https://github.com/lab-lvm/imagenet-models
paper_authors: Jongbin Ryu, Dongyoon Han, Jongwoo Lim
For: 该论文旨在提出一种新的建筑设计，以增强表达能力。* Methods: 该方法使用多个分类头（classification heads），而不是通过通道扩展或附加块来提高表达能力。这些头使用注意力基于的聚合，利用对Feature similarity进行拟合，以增强每个轻量级头的表达能力。* Results: 该方法可以在ImageNet-1K上超越现有的CNN和ViT模型，并在多个下游任务中表现出色，如COCO物体实例分割、ADE20k semantic segmentation和细化视觉分类等。 Code publicly available at: https://github.com/Lab-LVM/imagenet-models。

Abstract
We introduce a novel architecture design that enhances expressiveness by incorporating multiple head classifiers (\ie, classification heads) instead of relying on channel expansion or additional building blocks. Our approach employs attention-based aggregation, utilizing pairwise feature similarity to enhance multiple lightweight heads with minimal resource overhead. We compute the Gramian matrices to reinforce class tokens in an attention layer for each head. This enables the heads to learn more discriminative representations, enhancing their aggregation capabilities. Furthermore, we propose a learning algorithm that encourages heads to complement each other by reducing correlation for aggregation. Our models eventually surpass state-of-the-art CNNs and ViTs regarding the accuracy-throughput trade-off on ImageNet-1K and deliver remarkable performance across various downstream tasks, such as COCO object instance segmentation, ADE20k semantic segmentation, and fine-grained visual classification datasets. The effectiveness of our framework is substantiated by practical experimental results and further underpinned by generalization error bound. We release the code publicly at: https://github.com/Lab-LVM/imagenet-models.

摘要
我们提出了一种新的建筑设计，通过多个头分类器（即分类头）来提高表达能力，而不是依赖通道扩展或额外组件。我们的方法使用注意力基于的汇集，通过对每个头进行积分来增强轻量级的头。我们计算 Gramian 矩阵来强制类别标签在注意力层中增强多个头的汇集能力。此外，我们提出了一种学习算法，使得头们之间减少相关性，以便协同汇集。我们的模型最终超越了状态艺术 CNN 和 ViT 在 ImageNet-1K 上的精度-通过put trade-off，并在多个下游任务上表现出色，如 COCO 对象实例分割、ADE20k semantic segmentation 和细化视觉分类 dataset。我们的框架的有效性得到了实验证明，并被更加强大的通用Error bound 支持。我们在 GitHub 上公开了代码：https://github.com/Lab-LVM/imagenet-models。

Show from Tell: Audio-Visual Modelling in Clinical Settings

paper_url: http://arxiv.org/abs/2310.16477
repo_url: None
paper_authors: Jianbo Jiao, Mohammad Alsharid, Lior Drukker, Aris T. Papageorghiou, Andrew Zisserman, J. Alison Noble
for: 本研究は医疗设定下でのaudio-visual模型を提案し、専门家のアノテーションなしで医疗タスクに役立つ医学的表现を学习するための方法を提案します。
methods: 本研究では、自律学习のための简単で效率的な多Modal自律学习フレームワークを提案します。この方法では、speech音声を参照にして、ultrasound画像中の解剖学的领域を検出することができます。
results: 実験结果では、大规模の医疗多Modal ultrasoundビデオデータセットに対して、提案された自律学习方法は、専门家アノテーションなしで高性能な自动化下流医疗タスクを実现するための良い转移学习表现を学习することができました。さらに、全ての学习データを使用した完全な学习方法を上回るパフォーマンスを示しました。

Abstract
Auditory and visual signals usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, the audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals -- usually speech. In this paper, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference. Experimental evaluations on a large-scale clinical multi-modal ultrasound video dataset show that the proposed self-supervised method learns good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions.

摘要
听觉和视觉信号通常同时存在，并且在自然环境和临床设置中呈相关关系。然而，在临床情况下，听觉-视觉模型化可能更加困难，主要因为听觉信号中的噪音（包括信号水平噪音和semantic-level噪音），以及不同的听觉/视觉信号来源。在这篇论文中，我们考虑了在临床设置下的听觉-视觉模型化，提供一种不需要人工专家标注的解决方案。我们提议的方法是一种简单 yet effective的多modal自动学习框架，可以在听觉信号中 lokalisir anatomical region of interest。我们对大规模的临床多Modal Ultrasound视频数据集进行了实验评估，结果表明，我们的自动学习方法可以学习出好的传输性骨骼表示，提高了下游临床任务的自动化处理效果，甚至超过了全部监督的解决方案。

DualMatch: Robust Semi-Supervised Learning with Dual-Level Interaction

paper_url: http://arxiv.org/abs/2310.16459
repo_url: https://github.com/cwangai/dualmatch
paper_authors: Cong Wang, Xiaofeng Cao, Lanzhe Guo2, Zenglin Shi
for: 这 paper 的目的是提出一种新的 semi-supervised learning 方法，以便在标签不够的情况下利用无标签数据。
methods: 这 paper 使用了一种新的 dual-level 交互方法，即在jointly invoking feature embedding和class prediction的方式下进行学习。此外，它还需要一种consistent regularization，确保不同的数据扩展视图和不同的数据之间的feature embedding具有相似性。
results: 经验表明，这 paper 的提议可以在标准的 semi-supervised learning 设置下实现9%的错误减少，而在更复杂的类别不均衡设置下，仍可以实现6%的错误减少。

Abstract
Semi-supervised learning provides an expressive framework for exploiting unlabeled data when labels are insufficient. Previous semi-supervised learning methods typically match model predictions of different data-augmented views in a single-level interaction manner, which highly relies on the quality of pseudo-labels and results in semi-supervised learning not robust. In this paper, we propose a novel SSL method called DualMatch, in which the class prediction jointly invokes feature embedding in a dual-level interaction manner. DualMatch requires consistent regularizations for data augmentation, specifically, 1) ensuring that different augmented views are regulated with consistent class predictions, and 2) ensuring that different data of one class are regulated with similar feature embeddings. Extensive experiments demonstrate the effectiveness of DualMatch. In the standard SSL setting, the proposal achieves 9% error reduction compared with SOTA methods, even in a more challenging class-imbalanced setting, the proposal can still achieve 6% error reduction. Code is available at https://github.com/CWangAI/DualMatch

摘要
semi-supervised learning 提供了一个表达性强的框架，可以将无标签数据作用到当 labels 不足时。先前的 semi-supervised learning 方法通常是将不同扩展的观点汇总在单一水平上进行汇总，这高度依赖 pseudo-label 的质量，从而导致 semi-supervised learning 不稳定。在这篇论文中，我们提出了一个新的 SSL 方法，叫做 DualMatch，这个方法在两个水平上进行汇总。DualMatch 需要一些一致的调整，特别是：1）确保不同扩展的观点汇总的类别预测是一致的，2）确保不同一个类别的数据是一致的。实验结果显示 DualMatch 的效果。在标准 SSL 设定下，我们的提案可以与 SOTA 方法相比，获得 9% 的错误减少，甚至在更加具体的类别偏见设定下，我们的提案仍然可以获得 6% 的错误减少。代码可以在 https://github.com/CWangAI/DualMatch 上取得。

ChimpACT: A Longitudinal Dataset for Understanding Chimpanzee Behaviors

paper_url: http://arxiv.org/abs/2310.16447
repo_url: https://github.com/shirleymaxx/chimpact
paper_authors: Xiaoxuan Ma, Stephan P. Kaufhold, Jiajun Su, Wentao Zhu, Jack Terwilliger, Andres Meza, Yixin Zhu, Federico Rossano, Yizhou Wang
for: 这个研究的目的是提高动物福祉，模拟社会行为，以及了解人类和其他动物之间的共同行为。
methods: 这个研究使用了视频数据，并对其进行了详细的标注和分类。
results: 这个研究提供了一个包含20多只黑猩狮的视频数据集，并对这些数据进行了详细的分析和研究，以深入了解黑猩狮的社会行为和通信方式。

Abstract
Understanding the behavior of non-human primates is crucial for improving animal welfare, modeling social behavior, and gaining insights into distinctively human and phylogenetically shared behaviors. However, the lack of datasets on non-human primate behavior hinders in-depth exploration of primate social interactions, posing challenges to research on our closest living relatives. To address these limitations, we present ChimpACT, a comprehensive dataset for quantifying the longitudinal behavior and social relations of chimpanzees within a social group. Spanning from 2015 to 2018, ChimpACT features videos of a group of over 20 chimpanzees residing at the Leipzig Zoo, Germany, with a particular focus on documenting the developmental trajectory of one young male, Azibo. ChimpACT is both comprehensive and challenging, consisting of 163 videos with a cumulative 160,500 frames, each richly annotated with detection, identification, pose estimation, and fine-grained spatiotemporal behavior labels. We benchmark representative methods of three tracks on ChimpACT: (i) tracking and identification, (ii) pose estimation, and (iii) spatiotemporal action detection of the chimpanzees. Our experiments reveal that ChimpACT offers ample opportunities for both devising new methods and adapting existing ones to solve fundamental computer vision tasks applied to chimpanzee groups, such as detection, pose estimation, and behavior analysis, ultimately deepening our comprehension of communication and sociality in non-human primates.

摘要
理解非人类 primate 的行为是关键的，可以提高动物福祉，模拟社会行为，并为人类和生物共享的行为提供新的洞察。然而，因为非人类 primate 的行为数据缺乏，因此对非人类 primate 的社会互动进行深入探索受到限制。为解决这些限制，我们介绍了 ChimpACT，一个包括2015-2018年在德国列vik zoo的一群超过20只非人类 primate 的行为数据。这个数据集涵盖了一个年轻的 male chimanzee 的发展轨迹，名为 Azibo。ChimpACT 是丰富和挑战的数据集，包括163个视频，共计160,500帧，每个帧都有丰富的注释，包括检测、识别、姿势估计和精细的时空行为标签。我们在 ChimpACT 上测试了代表性的三个Track：（i）跟踪和识别、（ii）姿势估计、（iii）时空动作检测。我们的实验表明，ChimpACT 提供了许多机会，用于创造新的方法和适应现有的方法，以解决在非人类 primate 群体中的检测、姿势估计和行为分析问题，最终深化我们对非人类 primate 的communication和社会性的理解。

On Pixel-level Performance Assessment in Anomaly Detection

paper_url: http://arxiv.org/abs/2310.16435
repo_url: None
paper_authors: Mehdi Rafiei, Toby P. Breckon, Alexandros Iosifidis
for: 本研究旨在探讨 anomaly detection 方法在不同应用中的表现，特别是在像素级别时的评估带来的复杂挑战。
methods: 本研究使用了 eleven 种现代 anomaly detection 方法，应用于 twenty-one 个 anomaly detection 问题。
results: 经过广泛的实验评估，研究人员发现，使用 Precision-Recall 基于的 metric 可以更好地捕捉方法的表现，这些 metric 更适合用于这种任务。

Abstract
Anomaly detection methods have demonstrated remarkable success across various applications. However, assessing their performance, particularly at the pixel-level, presents a complex challenge due to the severe imbalance that is most commonly present between normal and abnormal samples. Commonly adopted evaluation metrics designed for pixel-level detection may not effectively capture the nuanced performance variations arising from this class imbalance. In this paper, we dissect the intricacies of this challenge, underscored by visual evidence and statistical analysis, leading to delve into the need for evaluation metrics that account for the imbalance. We offer insights into more accurate metrics, using eleven leading contemporary anomaly detection methods on twenty-one anomaly detection problems. Overall, from this extensive experimental evaluation, we can conclude that Precision-Recall-based metrics can better capture relative method performance, making them more suitable for the task.

摘要
异常检测方法在不同应用领域中表现出了惊人的成功。然而，评估这些方法的性能，特别是在像素级别，却存在严重的类别不平衡问题。通常采用的评估指标可能不能准确捕捉这种类别不平衡导致的性能变化。本文通过视觉证据和统计分析，探讨了这种挑战的复杂性，并提出了考虑类别不平衡的评估指标。我们在 twenty-one 个异常检测问题上使用了 eleven 种当代异常检测方法进行了广泛的实验评估。总的来说，我们可以从这些实验结果中得出结论，精度-回归-基于的指标更适合用于这个任务。

Winning Prize Comes from Losing Tickets: Improve Invariant Learning by Exploring Variant Parameters for Out-of-Distribution Generalization

paper_url: http://arxiv.org/abs/2310.16391
repo_url: None
paper_authors: Zhuo Huang, Muyang Li, Li Shen, Jun Yu, Chen Gong, Bo Han, Tongliang Liu
for:EVIL aims to improve OOD generalization by identifying a robust subnetwork that is resistant to distribution shift.methods:EVIL leverages distribution knowledge to find both invariant and variant parameters, and uses them to improve OOD generalization.results:EVIL can effectively and efficiently enhance many popular methods, such as ERM, IRM, SAM, etc., on an integrated testbed called DomainBed.

Abstract
Out-of-Distribution (OOD) Generalization aims to learn robust models that generalize well to various environments without fitting to distribution-specific features. Recent studies based on Lottery Ticket Hypothesis (LTH) address this problem by minimizing the learning target to find some of the parameters that are critical to the task. However, in OOD problems, such solutions are suboptimal as the learning task contains severe distribution noises, which can mislead the optimization process. Therefore, apart from finding the task-related parameters (i.e., invariant parameters), we propose Exploring Variant parameters for Invariant Learning (EVIL) which also leverages the distribution knowledge to find the parameters that are sensitive to distribution shift (i.e., variant parameters). Once the variant parameters are left out of invariant learning, a robust subnetwork that is resistant to distribution shift can be found. Additionally, the parameters that are relatively stable across distributions can be considered invariant ones to improve invariant learning. By fully exploring both variant and invariant parameters, our EVIL can effectively identify a robust subnetwork to improve OOD generalization. In extensive experiments on integrated testbed: DomainBed, EVIL can effectively and efficiently enhance many popular methods, such as ERM, IRM, SAM, etc.

摘要
外部分布（OOD）泛化目标是学习具有良好泛化能力的模型，以适应不同环境中的数据分布。近年来，基于抽奖假设（LTH）的研究提出了以最小化学习目标来找到任务相关的参数的方法，但在OOD问题中，这些解决方案是不优化的，因为学习任务中存在严重的分布噪声，这可能会导致优化过程受到束缚。因此，我们提出了尝试探索变体参数来实现泛化学习（EVIL），该方法还利用了分布知识来找到分布转移中不稳定的参数。一旦变体参数被去除，我们可以找到一个鲁棒的子网络，该子网络对分布转移具有抗性。此外，可以考虑相对稳定的参数作为惰性参数，以改进泛化学习。通过全面探索变体和惰性参数，我们的EVIL可以有效地找到一个鲁棒的子网络，以提高OOD泛化能力。在各种流行的方法基础上，如ERM、IRM、SAM等，我们在DomainBed集成测试床上进行了广泛的实验，EVIL可以高效地和高质量地增强这些方法。

MVFAN: Multi-View Feature Assisted Network for 4D Radar Object Detection

paper_url: http://arxiv.org/abs/2310.16389
repo_url: None
paper_authors: Qiao Yan, Yihan Wang
for:* 这篇论文的目的是提出一种基于4D雷达的3D对象检测方法，以提高自动驾驶系统的能力和可靠性。methods:* 该方法基于一个新的Position Map Generation模块，用于增强特征学习，并且使用了一种新的Radar Feature Assisted backbone来全面利用4D雷达传感器提供的Doppler速度和反射率数据。results:* 对Astyx和VoD数据集进行了广泛的实验和ablation研究，证明了该方法的有效性，特别是对小移动目标物如人行和自行车的检测性能有了明显的改善。I hope that helps! Let me know if you have any further questions.

Abstract
4D radar is recognized for its resilience and cost-effectiveness under adverse weather conditions, thus playing a pivotal role in autonomous driving. While cameras and LiDAR are typically the primary sensors used in perception modules for autonomous vehicles, radar serves as a valuable supplementary sensor. Unlike LiDAR and cameras, radar remains unimpaired by harsh weather conditions, thereby offering a dependable alternative in challenging environments. Developing radar-based 3D object detection not only augments the competency of autonomous vehicles but also provides economic benefits. In response, we propose the Multi-View Feature Assisted Network (\textit{MVFAN}), an end-to-end, anchor-free, and single-stage framework for 4D-radar-based 3D object detection for autonomous vehicles. We tackle the issue of insufficient feature utilization by introducing a novel Position Map Generation module to enhance feature learning by reweighing foreground and background points, and their features, considering the irregular distribution of radar point clouds. Additionally, we propose a pioneering backbone, the Radar Feature Assisted backbone, explicitly crafted to fully exploit the valuable Doppler velocity and reflectivity data provided by the 4D radar sensor. Comprehensive experiments and ablation studies carried out on Astyx and VoD datasets attest to the efficacy of our framework. The incorporation of Doppler velocity and RCS reflectivity dramatically improves the detection performance for small moving objects such as pedestrians and cyclists. Consequently, our approach culminates in a highly optimized 4D-radar-based 3D object detection capability for autonomous driving systems, setting a new standard in the field.

摘要
四维度雷达被广泛应用于自动驾驶领域，因其鲜为人知的优点，包括可靠性和成本效益。雷达不同于激光雷达和摄像头，在恶劣天气条件下仍然能够保持高度可靠，因此在自动驾驶系统中扮演着重要的辅助角色。为了提高自动驾驶系统的可靠性和安全性，我们提出了基于四维度雷达的三维物体检测方法，即多视图特征帮助网络（MVFAN）。我们解决了尚未充分利用特征的问题，通过引入新的位置图生成模块，以提高特征学习的灵活性和可靠性。此外，我们还提出了一种创新的干扰抑制器，以避免由雷达点云的不规则分布所引起的干扰。此外，我们还特制了一种雷达特征帮助核心，以全面利用雷达传感器提供的Doppler速度和反射率数据。我们在Astyx和VoD数据集上进行了广泛的实验和缺省研究，结果表明，我们的框架在检测小运动目标（如行人和自行车）的表现出色。因此，我们的方法在自动驾驶系统中提供了一种高度优化的四维度雷达基于三维物体检测能力，为自动驾驶领域的发展提供了新的标准。

General Point Model with Autoencoding and Autoregressive

paper_url: http://arxiv.org/abs/2310.16861
repo_url: None
paper_authors: Zhe Li, Zhangyang Gao, Cheng Tan, Stan Z. Li, Laurence T. Yang
for: 这篇论文旨在探讨大语言模型的预训练架构，以及如何使用这些架构来提高点云表示能力。
methods: 该论文提出了一种通用的点云模型（General Point Model，GPM），该模型结合了自编码和自发现任务，并可以进行精细调整以适应不同的下游任务。
results: 对比 Point-BERT、MaskPoint 和 PointMAE 等模型，GPM在点云理解任务中表现出色，并且在条件生成任务中也可以达到比较高的水平。此外，GPM的核心思想是将自编码和自发现任务融合到同一个 transformer 中，这使得模型在不同的下游任务上具有更大的灵活性。

Abstract
The pre-training architectures of large language models encompass various types, including autoencoding models, autoregressive models, and encoder-decoder models. We posit that any modality can potentially benefit from a large language model, as long as it undergoes vector quantization to become discrete tokens. Inspired by GLM, we propose a General Point Model (GPM) which seamlessly integrates autoencoding and autoregressive tasks in point cloud transformer. This model is versatile, allowing fine-tuning for downstream point cloud representation tasks, as well as unconditional and conditional generation tasks. GPM enhances masked prediction in autoencoding through various forms of mask padding tasks, leading to improved performance in point cloud understanding. Additionally, GPM demonstrates highly competitive results in unconditional point cloud generation tasks, even exhibiting the potential for conditional generation tasks by modifying the input's conditional information. Compared to models like Point-BERT, MaskPoint and PointMAE, our GPM achieves superior performance in point cloud understanding tasks. Furthermore, the integration of autoregressive and autoencoding within the same transformer underscores its versatility across different downstream tasks.

摘要
大型语言模型的预训练架构包括自编码模型、自回归模型以及编码器-解码器模型。我们认为任何modal都可能受益于大语言模型，只要它经过vector量化成为简单的token。受GLM的 inspirations所影响，我们提议一种通用点模型（GPM），该模型可以在点云trasnformer中协调自编码和自回归任务。这个模型非常灵活，可以进行下游点云表示任务的细化调整，以及无条件和条件生成任务。GPM通过不同的mask padding任务来提高autoencoding中的假值预测，从而提高点云理解性能。此外，GPM在无条件点云生成任务中表现出非常竞争力，甚至可以通过修改输入的条件信息来实现条件生成任务。相比之下，与Point-BERT、MaskPoint和PointMAE等模型相比，我们的GPM在点云理解任务中表现出较高的性能。此外，将自回归和自编码任务嵌入同一个transformer中，强调了这个模型的多功能性在不同的下游任务中。

Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles

paper_url: http://arxiv.org/abs/2310.16388
repo_url: None
paper_authors: Aagam Bakliwal, Amit D. Joshi
for: 这个研究目的是为了实现深伪检测中的动态实体验证。
methods: 这个方法结合了进步的2D和3D卷积神经网，其中3D模型通过滑动范围实现了空间和时间维度上的特征捕捉。
results: 实验显示，这种组合实现了优异的验证效果，表明它具有对深伪生成的欺骗行为的应急应对能力。

Abstract
In the dynamic realm of deepfake detection, this work presents an innovative approach to validate video content. The methodology blends advanced 2-dimensional and 3-dimensional Convolutional Neural Networks. The 3D model is uniquely tailored to capture spatiotemporal features via sliding filters, extending through both spatial and temporal dimensions. This configuration enables nuanced pattern recognition in pixel arrangement and temporal evolution across frames. Simultaneously, the 2D model leverages EfficientNet architecture, harnessing auto-scaling in Convolutional Neural Networks. Notably, this ensemble integrates Voting Ensembles and Adaptive Weighted Ensembling. Strategic prioritization of the 3-dimensional model's output capitalizes on its exceptional spatio-temporal feature extraction. Experimental validation underscores the effectiveness of this strategy, showcasing its potential in countering deepfake generation's deceptive practices.

摘要
在深层伪造检测领域中，这项工作提出了一种创新的方法来验证视频内容。该方法结合了高级的2维和3维卷积神经网络。3D模型特意设计了捕捉空间时间特征的滑动缓示，通过空间和时间维度的扩展来提高细节特征识别。同时，2D模型采用了高效的EfficientNet架构，实现了自适应卷积神经网络的核心。此外，这个集成还包括投票集成和适应加权集成。在优先级方面，将3维模型的输出作为首要考虑，以便利用其出色的空间时间特征抽取。实验证明了这种策略的有效性，表明其在防止深层伪造生成的欺诈实践中具有潜在的优势。

Frequency-Aware Transformer for Learned Image Compression

paper_url: http://arxiv.org/abs/2310.16387
repo_url: None
paper_authors: Han Li, Shaohui Li, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
for: 提高了图像压缩和传输的效率，解决了现有LIC方法中的约束和方向细节损失问题。
methods: 我们提出了一种新的频率意识变换块（FAT），通过多级方向分析来捕捉自然图像的频率组成。此外，我们还引入了频率调制Feedforward网络（FMFFN）来适应不同频率组成，提高了比特率-误差性能。最后，我们提出了一种基于变换器的通道自动循环（T-CA）模型，有效地利用通道相互关系。
results: 我们的方法在BD-率上比现有LIC方法更高，并且胜过最新的标准化编码器VTM-12.1 by 14.5%, 15.1%, 13.0% on the Kodak, Tecnick, and CLIC datasets。

Abstract
Learned image compression (LIC) has gained traction as an effective solution for image storage and transmission in recent years. However, existing LIC methods are redundant in latent representation due to limitations in capturing anisotropic frequency components and preserving directional details. To overcome these challenges, we propose a novel frequency-aware transformer (FAT) block that for the first time achieves multiscale directional ananlysis for LIC. The FAT block comprises frequency-decomposition window attention (FDWA) modules to capture multiscale and directional frequency components of natural images. Additionally, we introduce frequency-modulation feed-forward network (FMFFN) to adaptively modulate different frequency components, improving rate-distortion performance. Furthermore, we present a transformer-based channel-wise autoregressive (T-CA) model that effectively exploits channel dependencies. Experiments show that our method achieves state-of-the-art rate-distortion performance compared to existing LIC methods, and evidently outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in BD-rate on the Kodak, Tecnick, and CLIC datasets.

摘要
现代学习图像压缩（LIC）技术在最近几年来得到了广泛应用和推广，但现有的LIC方法具有重复的 latent 表示，导致不能够准确捕捉自然图像的多方位频率成分和方向细节。为解决这些挑战，我们提出了一种新的频率意识转换块（FAT），该块包括频率分解窗口注意力（FDWA）模块，以捕捉自然图像的多方位频率成分。此外，我们还引入了频率调制Feedforward网络（FMFFN），以适应不同频率成分的改变，从而提高rate-distortion性能。此外，我们还提出了基于 transformer 的渠道 wise 自动逆生成（T-CA）模型，该模型能够有效利用渠道之间的依赖关系。实验表明，我们的方法在 rate-distortion 性能方面与现有的 LIC 方法相比，有州最佳的表现，并且明显超过最新的标准化编码器 VTM-12.1 的14.5%、15.1% 和 13.0% 的BD-rate 在 Kodak、Tecnick 和 CLIC 数据集上。

Open-NeRF: Towards Open Vocabulary NeRF Decomposition

paper_url: http://arxiv.org/abs/2310.16383
repo_url: None
paper_authors: Hao Zhang, Fang Li, Narendra Ahuja
for: 解决Neural Radiance Fields（NeRF）中的对象分解问题，以便在3D重建和视觉合成中进行物体操作。
methods: 利用大规模的、可用的、Segment Anything Model（SAM）和嵌入式并蒸馈法来实现开放词汇查询的灵活性和3D分 segmentation的准确性。
results: 比静脉补充（LERF）和FFD（FFD）在开放词汇场景下表现出色，并且在干扰和杂乱特征的情况下保持一致的物体认知和细化。

Abstract
In this paper, we address the challenge of decomposing Neural Radiance Fields (NeRF) into objects from an open vocabulary, a critical task for object manipulation in 3D reconstruction and view synthesis. Current techniques for NeRF decomposition involve a trade-off between the flexibility of processing open-vocabulary queries and the accuracy of 3D segmentation. We present, Open-vocabulary Embedded Neural Radiance Fields (Open-NeRF), that leverage large-scale, off-the-shelf, segmentation models like the Segment Anything Model (SAM) and introduce an integrate-and-distill paradigm with hierarchical embeddings to achieve both the flexibility of open-vocabulary querying and 3D segmentation accuracy. Open-NeRF first utilizes large-scale foundation models to generate hierarchical 2D mask proposals from varying viewpoints. These proposals are then aligned via tracking approaches and integrated within the 3D space and subsequently distilled into the 3D field. This process ensures consistent recognition and granularity of objects from different viewpoints, even in challenging scenarios involving occlusion and indistinct features. Our experimental results show that the proposed Open-NeRF outperforms state-of-the-art methods such as LERF \cite{lerf} and FFD \cite{ffd} in open-vocabulary scenarios. Open-NeRF offers a promising solution to NeRF decomposition, guided by open-vocabulary queries, enabling novel applications in robotics and vision-language interaction in open-world 3D scenes.

摘要
在这篇论文中，我们解决了基于神经辐射场（NeRF）的对象分解问题，这是3D重建和视觉合成中对物体进行操作的关键任务。现有的NeRF分解技术存在较大的灵活性和3D分割精度之间的负担。我们提出了开放词汇内置神经辐射场（Open-NeRF），它利用大规模的商业化分割模型，如Segment Anything Model（SAM），并在层次嵌入和热链整合方法下实现了开放词汇查询的灵活性和3D分割精度。Open-NeRF首先利用大规模基础模型生成层次2D面mask提案，从不同视点生成这些提案，然后使用跟踪方法对它们进行准确的对齐和集成，并将其嵌入到3D空间中，最后进行热链整合和筛选，以保证对不同视点的物体承载和不确定特征的一致性。我们的实验结果表明，我们提出的Open-NeRF在开放词汇场景下超过了现有的LERF \cite{lerf}和FFD \cite{ffd}的性能。Open-NeRF提供了一种有前途的解决方案，受开放词汇查询指导，在开放世界3D场景中实现了新的 робо扮和视觉语言互动应用。

Towards Large-scale Masked Face Recognition

paper_url: http://arxiv.org/abs/2310.16364
repo_url: None
paper_authors: Manyuan Zhang, Bingqi Ma, Guanglu Song, Yunxiao Wang, Hongsheng Li, Yu Liu
for: 本研究的目的是提出一种在COVID-19 coronavirus 疫情期间大规模戴口罩的人脸识别算法冠军解决方案。
methods: 本研究使用的方法包括大规模训练、数据噪声处理、戴口罩和不戴口罩人脸识别精度平衡等四个挑战。
results: 本研究在ICCV MFR WebFace260M 和 InsightFace 无结构授益识别 tracks 上实现了冠军成绩，并提出了一种适用于大规模戴口罩人脸识别的推理友好模型体系。

Abstract
During the COVID-19 coronavirus epidemic, almost everyone is wearing masks, which poses a huge challenge for deep learning-based face recognition algorithms. In this paper, we will present our \textbf{championship} solutions in ICCV MFR WebFace260M and InsightFace unconstrained tracks. We will focus on four challenges in large-scale masked face recognition, i.e., super-large scale training, data noise handling, masked and non-masked face recognition accuracy balancing, and how to design inference-friendly model architecture. We hope that the discussion on these four aspects can guide future research towards more robust masked face recognition systems.

摘要
durante la epidemia de COVID-19 del coronavirus, prácticamente todos están usando mascarillas, lo que plantea un gran desafío para los algoritmos de reconocimiento de rostros basados en aprendizaje profundo. En este artículo, presentaremos nuestras soluciones campeonas en las pistas MFR WebFace260M e InsightFace de ICCV. Centraremonos en cuatro desafíos en el reconocimiento de rostros mascados a gran escala, es decir, la capacitación en escalas supergrandes, el manejo de ruido de datos, el equilibrio entre la precisión de reconocimiento de rostros mascados y no mascados, y cómo diseñar arquitecturas de modelos amigables con la inferencia. Esperamos que el debate sobre estos cuatro aspectos pueda guiar la investigación futura hacia sistemas de reconocimiento de rostros más robustos con mascarillas.

paper_url: http://arxiv.org/abs/2310.16349
repo_url: None
paper_authors: Se-Ho Kim, Inyong Koo, Inyoung Lee, Byeongjun Park, Changick Kim
for: 提高3D物体检测器的性能
methods: 使用diffusion过程进行提档 proposal refinement
results: 在KITTI数据集上实现了高性能的3D物体检测Here’s the full translation in Simplified Chinese:
for: 提高3D物体检测器的性能
methods: 使用diffusion过程进行提档 proposal refinement
results: 在KITTI数据集上实现了高性能的3D物体检测I hope that helps! Let me know if you have any other questions.

Abstract
Denoising diffusion models show remarkable performances in generative tasks, and their potential applications in perception tasks are gaining interest. In this paper, we introduce a novel framework named DiffRef3D which adopts the diffusion process on 3D object detection with point clouds for the first time. Specifically, we formulate the proposal refinement stage of two-stage 3D object detectors as a conditional diffusion process. During training, DiffRef3D gradually adds noise to the residuals between proposals and target objects, then applies the noisy residuals to proposals to generate hypotheses. The refinement module utilizes these hypotheses to denoise the noisy residuals and generate accurate box predictions. In the inference phase, DiffRef3D generates initial hypotheses by sampling noise from a Gaussian distribution as residuals and refines the hypotheses through iterative steps. DiffRef3D is a versatile proposal refinement framework that consistently improves the performance of existing 3D object detection models. We demonstrate the significance of DiffRef3D through extensive experiments on the KITTI benchmark. Code will be available.

摘要
diffusion 模型在生成任务中表现出色，其在感知任务中的潜在应用也引起了关注。在这篇论文中，我们介绍了一个名为DiffRef3D的新框架，它在3D物体检测中使用点云的扩散过程来进行首次应用。具体来说，我们将两stage 3D物体检测器的提议改进阶段设计为一个条件的扩散过程。在训练过程中，DiffRef3D逐渐添加了随机噪声到提议和目标对象之间的差异，然后将这些噪声应用到提议中来生成假设。提升模块使用这些假设来减少噪声并生成准确的盒子预测。在推断阶段，DiffRef3D通过随机从 Gaussian 分布中采样噪声来生成初始假设，然后通过迭代步骤来修改这些假设，以生成高精度的盒子预测。DiffRef3D 是一种通用的提议改进框架，可以在现有的 3D 物体检测模型上逐次提高性能。我们通过对 KITTI benchmark 进行了广泛的实验，证明了DiffRef3D 的重要性。代码将可以获得。

Dolfin: Diffusion Layout Transformers without Autoencoder

paper_url: http://arxiv.org/abs/2310.16305
repo_url: None
paper_authors: Yilin Wang, Zeyuan Chen, Liangjun Zhong, Zheng Ding, Zhizhou Sha, Zhuowen Tu
for: 这篇论文旨在提出一种新的生成模型，即Diffusion Layout Transformers without Autoencoder (Dolfin)，该模型可以有效地提高生成能力，同时减少计算复杂性。
methods: Dolfin使用Transformer-based噪声过程来实现布局生成，并提出了一种有效的bi-directional（非 causal joint）序列表示方法，以及一种autoregressive噪声模型（Dolfin-AR），能够更好地捕捉邻近对象之间的 semantics相关性，如对齐、大小和覆盖。
results: 对标准生成布局Benchmark进行评估，Dolfin显著提高了各种指标（fid, alignment, overlap, MaxIoU和DocSim scores），同时提高了透明度和可操作性。此外，Dolfin的应用不仅限于布局生成，还适用于模型几何结构，如直线段。实验结果表明Dolfin具有优势。

Abstract
In this paper, we introduce a novel generative model, Diffusion Layout Transformers without Autoencoder (Dolfin), which significantly improves the modeling capability with reduced complexity compared to existing methods. Dolfin employs a Transformer-based diffusion process to model layout generation. In addition to an efficient bi-directional (non-causal joint) sequence representation, we further propose an autoregressive diffusion model (Dolfin-AR) that is especially adept at capturing rich semantic correlations for the neighboring objects, such as alignment, size, and overlap. When evaluated against standard generative layout benchmarks, Dolfin notably improves performance across various metrics (fid, alignment, overlap, MaxIoU and DocSim scores), enhancing transparency and interoperability in the process. Moreover, Dolfin's applications extend beyond layout generation, making it suitable for modeling geometric structures, such as line segments. Our experiments present both qualitative and quantitative results to demonstrate the advantages of Dolfin.

摘要
在这篇论文中，我们引入了一种新的生成模型，即扩散布局变换器无自编码器（Dolfin），它能够显著提高模型化能力而减少复杂性，相比现有的方法。Dolfin使用Transformer基于的扩散过程来模型布局生成。除了高效的双向（非 causal 联合）序列表示之外，我们还提出了一种激进的扩散模型（Dolfin-AR），它尤其适合捕捉邻近对象的丰富semantic相关性，如对齐、大小和重叠。当评估了标准的生成布局benchmark时，Dolfin明显提高了多个指标（fid、对齐、重叠、MaxIoU和DocSim分数），从而提高了透明度和可操作性。此外，Dolfin的应用场景不仅包括布局生成，还适用于模型Geometric结构，如直线段。我们的实验包括qualitative和quantitative结果，以demonstrate Dolfin的优势。

4D-Editor: Interactive Object-level Editing in Dynamic Neural Radiance Fields via 4D Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.16858
repo_url: None
paper_authors: Dadong Jiang, Zhihui Ke, Xiaobo Zhou, Xidong Shi
for: 这 paper 的目的是实现在动态场景中进行交互式对象水平编辑（例如，删除、重新颜色、变换、组合）。
methods: 这 paper 使用的方法包括 hybrid semantic feature fields 来保持空间时间一致性，以及 recursive selection refinement 来提高动态 NeRF 中的 segmentation 精度。
results: EXTENSIVE experiments 和 editing examples 表明，4D-Editor 可以实现高品质的动态 NeRF 编辑。Here’s the full text in Simplified Chinese:
for: 这 paper 的目的是实现在动态场景中进行交互式对象水平编辑（例如，删除、重新颜色、变换、组合）。
methods: 这 paper 使用的方法包括 hybrid semantic feature fields 来保持空间时间一致性，以及 recursive selection refinement 来提高动态 NeRF 中的 segmentation 精度。
results: EXTENSIVE experiments 和 editing examples 表明，4D-Editor 可以实现高品质的动态 NeRF 编辑。I hope this helps! Let me know if you have any other questions.

Abstract
This paper targets interactive object-level editing(e.g., deletion, recoloring, transformation, composition) in dynamic scenes. Recently, some methods aiming for flexible editing static scenes represented by neural radiance field (NeRF) have shown impressive synthesis quality, while similar capabilities in time-variant dynamic scenes remain limited. To solve this problem, we propose 4D-Editor, an interactive semantic-driven editing framework, allowing editing multiple objects in dynamic NeRF based on user strokes on a single frame. Our dynamic scene representation is built upon hybrid semantic feature fields so that the spatial-temporal consistency can be maintained after editing. In addition, we design recursive selection refinement that significantly boosts segmentation accuracy in a dynamic NeRF to aid the editing process. Moreover, we develop multi-view reprojection inpainting to fill holes caused by incomplete scene capture after editing. Extensive experiments and editing examples on real-world demonstrate that 4D-Editor achieves photo-realistic dynamic NeRF editing. Project page: https://patrickddj.github.io/4D-Editor

摘要
这篇论文targets互动对象水平编辑（例如，删除、重新颜色、变换、组合）在动态场景中。在最近，一些方法targeting静止场景 represented by neural radiance field (NeRF)的灵活编辑能力有所进步，而在时间变化的动态场景中的相似能力尚未得到有效的解决。为解决这个问题，我们提出了4D-Editor，一个互动semantic驱动的编辑框架，允许在单帧上进行多个对象的编辑。我们的动态场景表示基于半结合 semantic feature fields，以保持空间时间一致性 после编辑。此外，我们设计了重层选择精度提高，以提高动态NeRF中的分割精度，以便编辑过程中更好地帮助用户。此外，我们开发了多视图重oprojection填充，以填充在编辑后Scene capture中出现的孔隙。广泛的实验和编辑示例表明，4D-Editor可以实现高品质的动态NeRF编辑。项目页面：https://patrickddj.github.io/4D-EditorNote: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network

paper_url: http://arxiv.org/abs/2310.16288
repo_url: https://github.com/taatiteam/motionagformer
paper_authors: Soroush Mehraban, Vida Adeli, Babak Taati
for: 本研究旨在提出一种新的注意力GCNFormer块（AGFormer），以提高3D人姿估计中的本地关系学习。
methods: 该模型使用两个并行的 transformer 流水线和 GCNFormer 流水线，并将其分解为多个 AGFormer 块。GCNFormer 模块利用邻近关节之间的本地关系，生成一个补充性的表示，并与 transformer 输出进行可靠的拟合。
results: 在 Human3.6M 和 MPI-INF-3DHP 两个标准测试集上，MotionAGFormer 模型 achieved state-of-the-art 结果，P1 误差分别为 38.4mm 和 16.2mm。同时，该模型使用的参数量只有一半，计算量三倍于之前的领先模型。代码和模型可以在 GitHub 上获取。

Abstract
Recent transformer-based approaches have demonstrated excellent performance in 3D human pose estimation. However, they have a holistic view and by encoding global relationships between all the joints, they do not capture the local dependencies precisely. In this paper, we present a novel Attention-GCNFormer (AGFormer) block that divides the number of channels by using two parallel transformer and GCNFormer streams. Our proposed GCNFormer module exploits the local relationship between adjacent joints, outputting a new representation that is complementary to the transformer output. By fusing these two representation in an adaptive way, AGFormer exhibits the ability to better learn the underlying 3D structure. By stacking multiple AGFormer blocks, we propose MotionAGFormer in four different variants, which can be chosen based on the speed-accuracy trade-off. We evaluate our model on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP. MotionAGFormer-B achieves state-of-the-art results, with P1 errors of 38.4mm and 16.2mm, respectively. Remarkably, it uses a quarter of the parameters and is three times more computationally efficient than the previous leading model on Human3.6M dataset. Code and models are available at https://github.com/TaatiTeam/MotionAGFormer.

摘要
近期基于transformer的方法已经表现出色地进行3D人姿估算。然而，它们具有整体视图，通过编码全局关系 между所有关节来不准确地捕捉本地依赖关系。在这篇论文中，我们提出了一种新的Attention-GCNFormer（AGFormer）块，通过使用两个平行的transformer和GCNFormer流程来分解通道数。我们的提议的GCNFormer模块利用邻近关节之间的本地关系，输出一个新的表示，与transformer输出相комplementary。通过在adaptive的方式进行融合，AGFormer能够更好地学习下来3D结构。通过堆叠多个AGFormer块，我们提议MotionAGFormer模型，有四种不同的变体，可以根据速度精度质量进行选择。我们在人3.6M和MPI-INF-3DHP两个 популяр的benchmark数据集上评估了我们的模型，MotionAGFormer-B Variant achieve state-of-the-art Results，P1 error为38.4mm和16.2mm，分别。很Remarkably，它使用的参数数量只是前一个领先模型的一半，并且在人3.6M数据集上三倍更快速并且更加计算效率。代码和模型可以在https://github.com/TaatiTeam/MotionAGFormer上获取。

TransPose: 6D Object Pose Estimation with Geometry-Aware Transformer

paper_url: http://arxiv.org/abs/2310.16279
repo_url: None
paper_authors: Xiao Lin, Deming Wang, Guangliang Zhou, Chengju Liu, Qijun Chen
for: 提高RGB基于方法的6D对象pose估计精度，避免 occlusion 和照明变化的影响。
methods: 使用TransformerEncoder和geometry-aware模块，提取点云特征表示，并在全球信息交换下提高对 occlusion 的Robustness。
results: 在三个benchmark datasets上实现了竞争性的pose估计效果。

Abstract
Estimating the 6D object pose is an essential task in many applications. Due to the lack of depth information, existing RGB-based methods are sensitive to occlusion and illumination changes. How to extract and utilize the geometry features in depth information is crucial to achieve accurate predictions. To this end, we propose TransPose, a novel 6D pose framework that exploits Transformer Encoder with geometry-aware module to develop better learning of point cloud feature representations. Specifically, we first uniformly sample point cloud and extract local geometry features with the designed local feature extractor base on graph convolution network. To improve robustness to occlusion, we adopt Transformer to perform the exchange of global information, making each local feature contains global information. Finally, we introduce geometry-aware module in Transformer Encoder, which to form an effective constrain for point cloud feature learning and makes the global information exchange more tightly coupled with point cloud tasks. Extensive experiments indicate the effectiveness of TransPose, our pose estimation pipeline achieves competitive results on three benchmark datasets.

摘要
估算6D对象姿 pose是许多应用中的关键任务。由于缺乏深度信息，现有的RGB基于方法容易受到遮挡和照明变化的影响。如何EXTRACT和利用点云信息的几何特征是很重要的。为了实现这一目标，我们提出了TransPose，一种新的6D姿态框架，利用Transformer Encoder和几何意识模块来提高点云特征表示学习。具体来说，我们首先对点云进行均匀采样，然后使用设计的本地特征提取器基于图像感知网络提取当地几何特征。为了提高遮挡Robustness，我们采用Transformer来进行全局信息交换，使每个本地特征包含全局信息。最后，我们在Transformer Encoder中引入几何意识模块，以形成有效的约束，使点云特征学习更加紧密地关联点云任务。我们对TransPose进行了广泛的实验，结果表明TransPose可以准确地估算6D对象姿。我们的姿态估算管道在三个标准数据集上达到了竞争性的 результа。

Deep Learning for Plant Identification and Disease Classification from Leaf Images: Multi-prediction Approaches

paper_url: http://arxiv.org/abs/2310.16273
repo_url: https://github.com/funzi-son/plant_pathology_dl
paper_authors: Jianping Yao, Son N. Tran, Saurabh Garg, Samantha Sawyer
for: 本研究主要针对现代农业中的深度学习应用，特别是使用叶子图像进行植物疾病诊断，深度学习在这一领域中扮演着重要的角色。methods: 本研究使用的方法包括多种多样的深度学习模型，包括多模型、多标签、多输出和多任务模型，其中不同的底层CNN可以被使用。results: 经过实验研究，我们发现使用InceptionV3作为底层CNN可以获得更好的性能，而使用单个模型也可以与使用两个模型相比肤。最终，我们的提出的总结型多输出CNN（GSMo-CNN）在三个标准测试集上达到了领先的性能。

Abstract
Deep learning plays an important role in modern agriculture, especially in plant pathology using leaf images where convolutional neural networks (CNN) are attracting a lot of attention. While numerous reviews have explored the applications of deep learning within this research domain, there remains a notable absence of an empirical study to offer insightful comparisons due to the employment of varied datasets in the evaluation. Furthermore, a majority of these approaches tend to address the problem as a singular prediction task, overlooking the multifaceted nature of predicting various aspects of plant species and disease types. Lastly, there is an evident need for a more profound consideration of the semantic relationships that underlie plant species and disease types. In this paper, we start our study by surveying current deep learning approaches for plant identification and disease classification. We categorise the approaches into multi-model, multi-label, multi-output, and multi-task, in which different backbone CNNs can be employed. Furthermore, based on the survey of existing approaches in plant pathology and the study of available approaches in machine learning, we propose a new model named Generalised Stacking Multi-output CNN (GSMo-CNN). To investigate the effectiveness of different backbone CNNs and learning approaches, we conduct an intensive experiment on three benchmark datasets Plant Village, Plant Leaves, and PlantDoc. The experimental results demonstrate that InceptionV3 can be a good choice for a backbone CNN as its performance is better than AlexNet, VGG16, ResNet101, EfficientNet, MobileNet, and a custom CNN developed by us. Interestingly, empirical results support the hypothesis that using a single model can be comparable or better than using two models. Finally, we show that the proposed GSMo-CNN achieves state-of-the-art performance on three benchmark datasets.

摘要
现代农业中，深度学习扮演着重要的角色，特别是在植物病理学中使用叶片图像，其中 convolutional neural networks (CNN) 在这个领域吸引了很多关注。虽然有很多文章评论了深度学习在这个研究领域的应用，但是还没有一篇实证研究提供了有用的对比。此外，大多数方法都是单纯地视为预测问题，忽略了植物种和病种类型之间的多方面性。此外，还有一个明显的需求，即更深入地理解植物种和病种类型之间的含义关系。在本文中，我们开始我们的研究 by surveying current deep learning approaches for plant identification and disease classification.我们将这些方法分为多模型、多标签、多输出和多任务类型，其中可以使用不同的底层CNN。此外，根据现有的植物病理学方法和机器学习方法的调查，我们提出了一种新的模型 named Generalised Stacking Multi-output CNN (GSMo-CNN)。为了评估不同的底层CNN和学习方法的效果，我们在三个标准数据集（Plant Village、Plant Leaves、PlantDoc）上进行了广泛的实验。实验结果显示，InceptionV3可以作为底层CNN，其性能比AlexNet、VGG16、ResNet101、EfficientNet、MobileNet和我们自己开发的自定义CNN更好。有趣的是，实验结果支持我们的假设，即使用单个模型可以与使用两个模型相比或更好。最后，我们表明了我们提出的GSMo-CNN在三个标准数据集上达到了状态之前的最佳性能。

SCB-ST-Dataset4: Extending the Spatio-Temporal Behavior Dataset in Student Classroom Scenarios Through Image Dataset Method

paper_url: http://arxiv.org/abs/2310.16267
repo_url: https://github.com/whiffe/scb-dataset
paper_authors: Fan Yang, Xiaofei Wang
for: This paper aims to provide a solution to the lack of publicly available spatio-temporal datasets on student behavior, which hinders research in the field of automatic student behavior detection using deep learning methods.
methods: The proposed method involves extending the existing SCB-ST-Dataset4 with an image dataset and using a Behavior Similarity Index (BSI) to explore the similarity of behaviors.
results: The proposed method was evaluated using four deep learning algorithms (YOLOv5, YOLOv7, YOLOv8, and SlowFast) and achieved a mean average precision (map) of up to 82.3%. The experiment demonstrated the effectiveness of the method and the dataset provides a robust foundation for future research in student behavior detection.Here’s the information in Simplified Chinese text:
for: 这篇论文的目的是解决学生行为自动检测使用深度学习方法时缺乏公共可用的空间时间数据的问题。
methods: 提议的方法是通过扩展现有的 SCB-ST-Dataset4 图像集，并使用行为相似性指数 (BSI) 来探索行为之间的相似性。
results: 提议的方法被评估使用四种深度学习算法 (YOLOv5, YOLOv7, YOLOv8, SlowFast)，实现了最高的 mean average precision (map) 值达到 82.3%。实验证明了方法的有效性，数据集提供了未来学生行为检测研究的坚实基础。

Abstract
Using deep learning methods to detect students' classroom behavior automatically is a promising approach for analyzing their class performance and improving teaching effectiveness. However, the lack of publicly available spatio-temporal datasets on student behavior, as well as the high cost of manually labeling such datasets, pose significant challenges for researchers in this field. To address this issue, we proposed a method for extending the spatio-temporal behavior dataset in Student Classroom Scenarios (SCB-ST-Dataset4) through image dataset. Our SCB-ST-Dataset4 comprises 754094 images with 25670 labels, focusing on 3 behaviors: hand-raising, reading, writing. Our proposed method can rapidly generate spatio-temporal behavioral datasets without requiring annotation. Furthermore, we proposed a Behavior Similarity Index (BSI) to explore the similarity of behaviors. We evaluated the dataset using the YOLOv5, YOLOv7, YOLOv8, and SlowFast algorithms, achieving a mean average precision (map) of up to 82.3%. The experiment further demonstrates the effectiveness of our method. This dataset provides a robust foundation for future research in student behavior detection, potentially contributing to advancements in this field. The SCB-ST-Dataset4 is available for download at: https://github.com/Whiffe/SCB-dataset.

摘要
（使用深度学习方法检测学生学习环境中的行为自动化是一个有前途的方法，可以分析学生的课程表现和提高教学效果。然而，学生行为的公共可用空间时间数据集和手动标注这些数据集的高成本，对于这个领域的研究人员而言是一个大的挑战。为解决这个问题，我们提出了一种方法，通过图像集来扩展学生学习环境中的行为数据集。我们的SCB-ST-Dataset4包含754094张图像和25670个标签，关注3种行为：抬头、读书和写作。我们提出的方法可以快速生成空间时间行为数据集，不需要注解。此外，我们还提出了行为相似指数（BSI），以探索行为之间的相似性。我们使用YOLOv5、YOLOv7、YOLOv8和SlowFast算法进行评估，实现了最大平均准确率（map）达82.3%。实验证明了我们的方法的有效性。这个数据集为未来学生行为检测领域的研究提供了一个坚实的基础，有助于这一领域的进步。SCB-ST-Dataset4可以在以下链接下载：https://github.com/Whiffe/SCB-dataset。）

UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based Perception

paper_url: http://arxiv.org/abs/2310.16255
repo_url: None
paper_authors: Christopher Maxey, Jaehoon Choi, Hyungtae Lee, Dinesh Manocha, Heesung Kwon
for: 用于提高UAV预测模型的训练数据 quantity和质量。
methods: 利用最新的神经渲染技术进行静态和动态新视图UAV预测图像生成，尤其是高空拍摄场景中的突出特征。
results: 使用混合实际和synthetic数据进行优化后，检测模型的性能得到了显著提升。

Abstract
Tremendous variations coupled with large degrees of freedom in UAV-based imaging conditions lead to a significant lack of data in adequately learning UAV-based perception models. Using various synthetic renderers in conjunction with perception models is prevalent to create synthetic data to augment the learning in the ground-based imaging domain. However, severe challenges in the austere UAV-based domain require distinctive solutions to image synthesis for data augmentation. In this work, we leverage recent advancements in neural rendering to improve static and dynamic novelview UAV-based image synthesis, especially from high altitudes, capturing salient scene attributes. Finally, we demonstrate a considerable performance boost is achieved when a state-ofthe-art detection model is optimized primarily on hybrid sets of real and synthetic data instead of the real or synthetic data separately.

摘要
巨大的变化和大量的自由度在无人机图像环境中导致学习无人机图像模型的数据缺乏。使用各种合成渲染器和感知模型是常见的做法来创建合成数据以增强地面上的图像学习。然而，无人机图像领域的恶劣环境需要特有的解决方案来synthesize图像，尤其是从高空拍摄的场景。在这种情况下，我们利用最新的神经渲染技术来提高静止和动态新视图无人机图像synthesize，特别是高空拍摄的场景。最终，我们示出了将状态之最佳检测模型优化为主要使用混合的实际和合成数据集，而不是单独使用实际数据或合成数据，可以获得显著的性能提升。

GraFT: Gradual Fusion Transformer for Multimodal Re-Identification

paper_url: http://arxiv.org/abs/2310.16856
repo_url: None
paper_authors: Haoli Yin, Jiayao Li, Eva Schiller, Luke McDermott, Daniel Cummings
for: 本研究旨在提出一种能够有效地进行多Modal ReID的模型，以满足计算机视觉领域中增加模式的需求。
methods: 本研究提出了一种名为Gradual Fusion Transformer（GraFT）的新模型，它使用学习扩展的协同自注意力机制，以便同时捕捉多Modal特征和物体特征。此外，研究人员还提出了一种新的训练方法和一种改进的 triplet损失函数，以便优化ReID特征空间。
results: 对于多Modal ReID任务，GraFT consistently 超越了现有的多Modal ReID标准准确率。此外，研究人员还通过了大量的缺失学习研究，以证明GraFT的有效性。此外，为了实现模型的部署 versatility，研究人员还提出了一种基于神经网络裁剪的方法，以实现模型的大小和性能之间的平衡。

Abstract
Object Re-Identification (ReID) is pivotal in computer vision, witnessing an escalating demand for adept multimodal representation learning. Current models, although promising, reveal scalability limitations with increasing modalities as they rely heavily on late fusion, which postpones the integration of specific modality insights. Addressing this, we introduce the \textbf{Gradual Fusion Transformer (GraFT)} for multimodal ReID. At its core, GraFT employs learnable fusion tokens that guide self-attention across encoders, adeptly capturing both modality-specific and object-specific features. Further bolstering its efficacy, we introduce a novel training paradigm combined with an augmented triplet loss, optimizing the ReID feature embedding space. We demonstrate these enhancements through extensive ablation studies and show that GraFT consistently surpasses established multimodal ReID benchmarks. Additionally, aiming for deployment versatility, we've integrated neural network pruning into GraFT, offering a balance between model size and performance.

摘要

2023-10-25

cs.AI

cs.AI - 2023-10-25

math-PVS: A Large Language Model Framework to Map Scientific Publications to PVS Theories

paper_url: http://arxiv.org/abs/2310.17064
repo_url: None
paper_authors: Hassen Saidi, Susmit Jha, Tuhin Sahai
for: This paper aims to investigate the applicability of large language models (LLMs) in formalizing advanced mathematical concepts and to propose a framework for critically reviewing and checking mathematical reasoning in research papers.
methods: The proposed framework synergizes the capabilities of proof assistants, specifically PVS, with LLMs, enabling a bridge between textual descriptions in academic papers and formal specifications in PVS.
results: The proposed approach, called “math-PVS,” can automatically extract and formalize mathematical theorems from research papers, offering an innovative tool for academic review and discovery.

Abstract
As artificial intelligence (AI) gains greater adoption in a wide variety of applications, it has immense potential to contribute to mathematical discovery, by guiding conjecture generation, constructing counterexamples, assisting in formalizing mathematics, and discovering connections between different mathematical areas, to name a few. While prior work has leveraged computers for exhaustive mathematical proof search, recent efforts based on large language models (LLMs) aspire to position computing platforms as co-contributors in the mathematical research process. Despite their current limitations in logic and mathematical tasks, there is growing interest in melding theorem proving systems with foundation models. This work investigates the applicability of LLMs in formalizing advanced mathematical concepts and proposes a framework that can critically review and check mathematical reasoning in research papers. Given the noted reasoning shortcomings of LLMs, our approach synergizes the capabilities of proof assistants, specifically PVS, with LLMs, enabling a bridge between textual descriptions in academic papers and formal specifications in PVS. By harnessing the PVS environment, coupled with data ingestion and conversion mechanisms, we envision an automated process, called \emph{math-PVS}, to extract and formalize mathematical theorems from research papers, offering an innovative tool for academic review and discovery.

摘要
随着人工智能（AI）在各种应用领域的推广，它在数学发现方面拥有巨大的潜力。AI可以引导推理生成、构建反例、协助正式化数学，以及发现不同数学领域之间的连接，等等。尽管以往的计算机被用于极限的数学证明搜索，但近期基于大语言模型（LLM）的努力希望将计算平台作为数学研究过程中的合作伙伴。虽然LLM在逻辑和数学任务上有限制，但是有关将基础模型与证明系统融合的兴趣在增长。这项工作研究了LLM在正式化高级数学概念方面的可用性，并提出了一个框架，可以对研究论文中的数学逻辑进行检查和评审。由于LLM的逻辑推理缺陷，我们的方法结合证明助手PVS的能力，实现了从学术论文中的文本描述转换到PVS中的正式规定的自动过程。通过将PVS环境与数据入口和转换机制相结合，我们可以实现一个名为“math-PVS”的自动化过程，从研究论文中提取和正式化数学定理，为数学研究的自动化评审和发现提供了一个创新的工具。

Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer

paper_url: http://arxiv.org/abs/2310.17049
repo_url: https://github.com/vigor-jzhang/icc-regularizer
paper_authors: Jianwei Zhang, Suren Jayasuriya, Visar Berisha
for: 这个论文的目的是提出一种新的超参数化方法，以提高深度神经网络在特定机器学习任务中的表现。
methods: 这个论文使用了 measurement theory 中的重复性概念，并提出了一种新的评价指标 - 内类相关系数（ICC）来评估嵌入的重复性。
results: 实验结果表明，添加 ICC 正则化可以提高学习的嵌入重复性，并且这些嵌入可以提高下游任务的表现，如 speaker verification、voice style conversion 和诊断异常声音。

Abstract
A good supervised embedding for a specific machine learning task is only sensitive to changes in the label of interest and is invariant to other confounding factors. We leverage the concept of repeatability from measurement theory to describe this property and propose to use the intra-class correlation coefficient (ICC) to evaluate the repeatability of embeddings. We then propose a novel regularizer, the ICC regularizer, as a complementary component for contrastive losses to guide deep neural networks to produce embeddings with higher repeatability. We use simulated data to explain why the ICC regularizer works better on minimizing the intra-class variance than the contrastive loss alone. We implement the ICC regularizer and apply it to three speech tasks: speaker verification, voice style conversion, and a clinical application for detecting dysphonic voice. The experimental results demonstrate that adding an ICC regularizer can improve the repeatability of learned embeddings compared to only using the contrastive loss; further, these embeddings lead to improved performance in these downstream tasks.

摘要
一个好的监督式嵌入是只受标签变化的影响，而不受其他干扰因素的影响。我们利用测量理论中的重复性来描述这一特性，并提议使用内类相关系数（ICC）来评估嵌入的重复性。我们then propose a novel regularizer, the ICC regularizer, as a complementary component for contrastive losses to guide deep neural networks to produce embeddings with higher repeatability. We use simulated data to explain why the ICC regularizer works better on minimizing the intra-class variance than the contrastive loss alone. We implement the ICC regularizer and apply it to three speech tasks: speaker verification, voice style conversion, and a clinical application for detecting dysphonic voice. The experimental results demonstrate that adding an ICC regularizer can improve the repeatability of learned embeddings compared to only using the contrastive loss; further, these embeddings lead to improved performance in these downstream tasks.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

StochGradAdam: Accelerating Neural Networks Training with Stochastic Gradient Sampling

paper_url: http://arxiv.org/abs/2310.17042
repo_url: None
paper_authors: Juyoung Yun
for: 提高深度学习优化的稳定性和性能
methods: 使用抽样 gradient 技术，选择部分 gradients 进行每一轮的优化
results: 在图像分类和 segmentation 任务中表现出色，比传统 Adam 优化器更高效Here’s the translation of the three key points in English:
for: Improving the stability and performance of deep learning optimization
methods: Using gradient sampling technique, selectively considering a subset of gradients for each iteration
results: Superior performance in image classification and segmentation tasks compared to traditional Adam optimizer

Abstract
In the rapidly advancing domain of deep learning optimization, this paper unveils the StochGradAdam optimizer, a novel adaptation of the well-regarded Adam algorithm. Central to StochGradAdam is its gradient sampling technique. This method not only ensures stable convergence but also leverages the advantages of selective gradient consideration, fostering robust training by potentially mitigating the effects of noisy or outlier data and enhancing the exploration of the loss landscape for more dependable convergence. In both image classification and segmentation tasks, StochGradAdam has demonstrated superior performance compared to the traditional Adam optimizer. By judiciously sampling a subset of gradients at each iteration, the optimizer is optimized for managing intricate models. The paper provides a comprehensive exploration of StochGradAdam's methodology, from its mathematical foundations to bias correction strategies, heralding a promising advancement in deep learning training techniques.

摘要
在深度学习优化领域的快速发展中，这篇论文公布了StochGradAdam优化器，这是Adam算法的一种新的变体。StochGradAdam的核心技术是 Gradient Sampling 技术，这种方法不仅保证稳定的收敛，还可以选择性考虑梯度，从而避免噪音或异常数据的影响，并且可以更好地探索损失函数的地形，以更可靠的收敛。在图像分类和分割任务中，StochGradAdam已经与传统的Adam优化器相比示出了更高的性能。每次迭代中选择一 subset of 梯度，使得优化器更适合处理复杂的模型。文章从数学基础到偏误修正策略进行了全面的探讨，这标志着深度学习训练技术的一个新的突破。

On Surgical Fine-tuning for Language Encoders

paper_url: http://arxiv.org/abs/2310.17041
repo_url: https://github.com/ymtao5219/surgical_fine_tuning
paper_authors: Abhilasha Lodha, Gayatri Belapurkar, Saloni Chalkapurkar, Yuanming Tao, Reshmi Ghosh, Samyadeep Basu, Dmitrii Petrov, Soundararajan Srinivasan
for: 这篇论文的目的是为了探索可以将语言模型 Fine-tuning 的层数范围降低到少数层，以提高下游语言任务的性能。
methods: 本研究使用了一个简单的度量基于渔业信息矩阵（FIM score），来选择可以进行选择性 Fine-tuning 的层。这个度量可以实际地选择出适合的层，实现了下游语言任务的强大表现。
results: 研究发现，只需要 Fine-tuning 少数层就可以得到和完全 Fine-tuning 所有层的性能相似或更好的结果。此外，这个度量还可以在优化过程中保持不变，证明了其可靠性。

Abstract
Fine-tuning all the layers of a pre-trained neural language encoder (either using all the parameters or using parameter-efficient methods) is often the de-facto way of adapting it to a new task. We show evidence that for different downstream language tasks, fine-tuning only a subset of layers is sufficient to obtain performance that is close to and often better than fine-tuning all the layers in the language encoder. We propose an efficient metric based on the diagonal of the Fisher information matrix (FIM score), to select the candidate layers for selective fine-tuning. We show, empirically on GLUE and SuperGLUE tasks and across distinct language encoders, that this metric can effectively select layers leading to a strong downstream performance. Our work highlights that task-specific information corresponding to a given downstream task is often localized within a few layers, and tuning only those is sufficient for strong performance. Additionally, we demonstrate the robustness of the FIM score to rank layers in a manner that remains constant during the optimization process.

摘要
通常情况下，使用预训练神经语言编码器的所有参数进行精细调整（或者使用 parameter-efficient methods）是适应新任务的准确方法。我们的实验表明，对不同的下游语言任务，只需要调整一 subset of layers 可以获得与所有层的语言编码器调整性能很近的性能。我们提出了一个有效的度量基于斜矩阵 Fisher information matrix（FIM score），用于选择候选层进行选择性调整。我们的实验表明，这个度量可以有效地选择层，并在 GLUE 和 SuperGLUE 任务上 across distinct language encoders 获得出色的下游性能。我们的研究表明，任务特定的信息通常在几层中具有高度的地方化特征，并且只需要调整这些层可以获得出色的性能。此外，我们还证明了 FIM score 可以在优化过程中保持不变的方式对层进行排名。

Apollo: Zero-shot MultiModal Reasoning with Multiple Experts

paper_url: http://arxiv.org/abs/2310.18369
repo_url: https://github.com/danielabd/apollo-cap
paper_authors: Daniela Ben-David, Tzuf Paz-Argaman, Reut Tsarfaty
For: The paper is written for proposing a modular framework that leverages the expertise of different foundation models over different modalities and domains to perform a single, complex, multi-modal task without relying on prompt engineering or tailor-made multi-modal training.* Methods: The paper proposes a modular framework that enables decentralized command execution and allows each model to contribute and benefit from the expertise of the other models. The approach can be extended to a variety of foundation models, including audio and vision models, and does not depend on prompts.* Results: The paper demonstrates the effectiveness of the proposed approach on two tasks: stylized image captioning and audio-aware image captioning. The experiments show that the approach outperforms semi-supervised state-of-the-art models on the stylized image captioning task while being zero-shot and avoiding costly training, data collection, and prompt engineering. Additionally, the approach is applied to a novel task of audio-aware image captioning, where the task is to generate text that describes the image within the context of the provided audio.

Abstract
We propose a modular framework that leverages the expertise of different foundation models over different modalities and domains in order to perform a single, complex, multi-modal task, without relying on prompt engineering or otherwise tailor-made multi-modal training. Our approach enables decentralized command execution and allows each model to both contribute and benefit from the expertise of the other models. Our method can be extended to a variety of foundation models (including audio and vision), above and beyond only language models, as it does not depend on prompts. We demonstrate our approach on two tasks. On the well-known task of stylized image captioning, our experiments show that our approach outperforms semi-supervised state-of-the-art models, while being zero-shot and avoiding costly training, data collection, and prompt engineering. We further demonstrate this method on a novel task, audio-aware image captioning, in which an image and audio are given and the task is to generate text that describes the image within the context of the provided audio. Our code is available on GitHub.

摘要
我们提出了一个模块化框架，利用不同基础模型在不同Modalities和领域的专业知识，实现单一、复杂多Modal任务，不依赖于提问工程或特制多Modal训练。我们的方法允许分布式命令执行和每个模型都可以享受到其他模型的专业知识。我们的方法可以扩展到多种基础模型（包括音频和视觉），而不仅仅是语言模型，因为它不依赖于提问。我们的实验表明，我们的方法可以超越半导化状态体验的模型，而且是零shot和不需要贵重训练、数据收集和提问工程。我们还在一个新任务上进行了实验，即Audio-aware图像描述，在给定的图像和音频基础上，生成描述图像的文本。我们的代码可以在GitHub上下载。

netFound: Foundation Model for Network Security

paper_url: http://arxiv.org/abs/2310.17025
repo_url: None
paper_authors: Satyandra Guthula, Navya Battula, Roman Beltiukov, Wenbo Guo, Arpit Gupta
for: 本研究旨在提出一种基础模型（netFound），用于网络安全领域的机器学习（ML）应用。
methods: 本研究使用自我超vised算法对 readily available的无标签网络包迹进行预训练，然后使用层次和多模态特征来具体捕捉网络交互的隐藏 context。
results: 对三种网络下游任务（流量分类、网络入侵检测和APT检测）进行了实验，并证明了 netFound 在这些任务中的superiority，同时也证明了其对噪音和缺失标签、时间变化和多种网络环境的Robustness。

Abstract
In ML for network security, traditional workflows rely on high-quality labeled data and manual feature engineering, but limited datasets and human expertise hinder feature selection, leading to models struggling to capture crucial relationships and generalize effectively. Inspired by recent advancements in ML application domains like GPT-4 and Vision Transformers, we have developed netFound, a foundational model for network security. This model undergoes pre-training using self-supervised algorithms applied to readily available unlabeled network packet traces. netFound's design incorporates hierarchical and multi-modal attributes of network traffic, effectively capturing hidden networking contexts, including application logic, communication protocols, and network conditions. With this pre-trained foundation in place, we can fine-tune netFound for a wide array of downstream tasks, even when dealing with low-quality, limited, and noisy labeled data. Our experiments demonstrate netFound's superiority over existing state-of-the-art ML-based solutions across three distinct network downstream tasks: traffic classification, network intrusion detection, and APT detection. Furthermore, we emphasize netFound's robustness against noisy and missing labels, as well as its ability to generalize across temporal variations and diverse network environments. Finally, through a series of ablation studies, we provide comprehensive insights into how our design choices enable netFound to more effectively capture hidden networking contexts, further solidifying its performance and utility in network security applications.

摘要
在网络安全领域中的机器学习（ML）工作流程，传统上依靠高质量的标签数据和人工工程师，但是有限的数据集和人工专业知识限制了特征选择，导致模型困难捕捉关键关系和泛化有效。受最近的机器学习应用领域的进步，如GPT-4和视觉转换器，我们开发了netFound，一个基础模型 для网络安全。netFound模型在自动学习算法应用于可以获得的无标签网络包迹数据上进行预训练。netFound的设计包括层次结构和多模式特征的网络流量，能够有效捕捉隐藏的网络上下文，包括应用逻辑、通信协议和网络条件。通过这种预训练基础，我们可以对netFound进行细化，即使处理低质量、有限和噪声的标签数据时。我们的实验表明，netFound在三个不同的网络下游任务上表现出优于当前状态艺术的机器学习基本解决方案：流量分类、网络入侵检测和APT检测。此外，我们强调netFound对噪声和缺失标签的 Robustness，以及其能够在时间变化和多种网络环境下泛化。最后，通过一系列的减少研究，我们提供了广泛的启示，描述了我们的设计选择如何使得netFound更有效地捕捉隐藏的网络上Context，进一步巩固其性能和实用性在网络安全应用中。

Controlled Decoding from Language Models

paper_url: http://arxiv.org/abs/2310.17022
repo_url: None
paper_authors: Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami
for: 这个论文目标是控制语言模型的生成，使其向高评价结果靠拢。
methods: 该论文提出了一种新的Off-policy Reinforcement Learning方法，称为Controlled Decoding（CD），通过一个值函数来控制生成的评价。
results: 实验表明，CD能够有效地控制Reddit conversations corpus中的语言模型生成，并且可以解决多目标 reinforcement learning 问题无需额外复杂度。

Abstract
We propose controlled decoding (CD), a novel off-policy reinforcement learning method to control the autoregressive generation from language models towards high reward outcomes. CD solves an off-policy reinforcement learning problem through a value function for the reward, which we call a prefix scorer. The prefix scorer is used at inference time to steer the generation towards higher reward outcomes. We show that the prefix scorer may be trained on (possibly) off-policy data to predict the expected reward when decoding is continued from a partially decoded response. We empirically demonstrate that CD is effective as a control mechanism on Reddit conversations corpus. We also show that the modularity of the design of CD makes it possible to control for multiple rewards, effectively solving a multi-objective reinforcement learning problem with no additional complexity. Finally, we show that CD can be applied in a novel blockwise fashion at inference-time, again without the need for any training-time changes, essentially bridging the gap between the popular best-of-$K$ strategy and token-level reinforcement learning. This makes CD a promising approach for alignment of language models.

摘要
我们提出控制解码（CD），一种新的离政策强化学习方法，用于控制语言模型的自然逻辑生成向高赏点结果。CD解决了离政策强化学习问题通过一个值函数，我们称之为前缀评分器。前缀评分器在推理时使用于引导生成向高赏点结果。我们证明了前缀评分器可以在（可能）离政策数据上训练，以预测继续推理后的预期奖励。我们在Reddit会话集体上进行了实验，证明了CD的效果。此外，我们还表明了CD的模块化设计，可以控制多个奖励，实际解决了多目标强化学习问题无额外复杂度。最后，我们表明了CD可以在推理时进行块式应用，再无需任何训练时间变化，实际连接了最受欢迎的best-of-$K$策略和token级强化学习。这使得CD成为对语言模型的Alignment的有望方法。

An Integrative Survey on Mental Health Conversational Agents to Bridge Computer Science and Medical Perspectives

paper_url: http://arxiv.org/abs/2310.17017
repo_url: https://github.com/jeffreych0/mental_chatbot_survey
paper_authors: Young Min Cho, Sunny Rai, Lyle Ungar, João Sedoc, Sharath Chandra Guntuku
for: 这个论文主要是为了探讨心理健康对话代理（即 chatbot）在解决心理健康挑战方面的潜在效果，以及如何bridge между计算机科学和医学两个领域之间的知识分享。
methods: 这篇论文采用了PRISMA框架进行系统性的文献综述，检查了534篇发表在计算机科学和医学两个领域的论文。
results: 论文发现了136篇关于建立心理健康相关对话代理的关键论文，这些论文中的模型和实验设计技术有多种多样。计算机科学论文更加关注LLM技术和自动化评价指标，而医学论文则更加关注规则驱动的对话代理和参与者的健康结果。

Abstract
Mental health conversational agents (a.k.a. chatbots) are widely studied for their potential to offer accessible support to those experiencing mental health challenges. Previous surveys on the topic primarily consider papers published in either computer science or medicine, leading to a divide in understanding and hindering the sharing of beneficial knowledge between both domains. To bridge this gap, we conduct a comprehensive literature review using the PRISMA framework, reviewing 534 papers published in both computer science and medicine. Our systematic review reveals 136 key papers on building mental health-related conversational agents with diverse characteristics of modeling and experimental design techniques. We find that computer science papers focus on LLM techniques and evaluating response quality using automated metrics with little attention to the application while medical papers use rule-based conversational agents and outcome metrics to measure the health outcomes of participants. Based on our findings on transparency, ethics, and cultural heterogeneity in this review, we provide a few recommendations to help bridge the disciplinary divide and enable the cross-disciplinary development of mental health conversational agents.

摘要
心理健康对话机器人（即chatbot）广泛研究其潜在性能提供访问支持心理健康挑战的人。先前的调查主要考虑计算机科学和医学领域发表的论文，导致两个领域之间的理解不同，阻碍两个领域之间的有益知识共享。为bridging这个差距，我们采用PRISMA框架进行了全面的文献综述，查看了534篇发表在计算机科学和医学两个领域的论文。我们的系统性综述发现了136篇关于建立心理健康相关对话机器人的重要论文，其中计算机科学论文主要关注LLM技术和自动评价指标，而医学论文主要采用规则型对话机器人和参与者的健康结果评价指标。根据我们在透明度、伦理和文化多样性方面的发现，我们提出了一些建议，以帮助跨学科发展心理健康对话机器人。

Personalized Speech-driven Expressive 3D Facial Animation Synthesis with Style Control

paper_url: http://arxiv.org/abs/2310.17011
repo_url: None
paper_authors: Elif Bozkurt
for: 这个论文旨在创建一个基于真实人脸动画的表情控制框架，以实现高度自然和可靠的人脸动画synthesis。
methods: 该框架使用一个非autoregressive encoder-decoder架构，包括表情编码器、语音编码器和表情解码器。在训练阶段，表情编码器首先分解人脸动画序列为个性特征和语音相关信息，然后将这些信息输入到 transformer层中进行更新。
results: 该方法可以生成基于输入语音的自然和准确的人脸动画，同时保留目标人脸的说话风格。

Abstract
Different people have different facial expressions while speaking emotionally. A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility. Existing approaches to personalized speech-driven 3D facial animation either use one-hot identity labels or rely-on person specific models which limit their scalability. We present a personalized speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles), and synthesizes novel animations given a speech input with the target style for various emotion categories. Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components: expression encoder, speech encoder and expression decoder. Since, expressive facial motion includes both identity-specific style and speech-related content information; expression encoder first disentangles facial motion sequences into style and content representations, respectively. Then, both of the speech encoder and the expression decoders input the extracted style information to update transformer layer weights during training phase. Our speech encoder also extracts speech phoneme label and duration information to achieve better synchrony within the non-autoregressive synthesis mechanism more effectively. Through detailed experiments, we demonstrate that our approach produces temporally coherent facial expressions from input speech while preserving the speaking styles of the target identities.

摘要
不同的人有不同的脸部表达方式，一个真实的脸部动画系统应该考虑到这些人各自的脸部表达风格和特点，以达到高度的自然性和可信度。现有的人性化语音驱动3D脸部动画方法可能使用一个热度标签或者基于特定人的模型，这限制了其扩展性。我们提出了一种基于语音输入的个性化表达3D脸部动画生成框架，该框架模型了人各自的脸部运动为缓存表示（称为风格），并将输入语音中的各种情感类别内容与目标风格相匹配。我们的框架在端到端的方式进行训练，具有非autoregressive编码器-解码器架构，包括表达编码器、语音编码器和表达解码器三个主要组件。由于表达动作包含人各自风格特点和语音相关的内容信息，表达编码器首先分解脸部动作序列为风格和内容表示，然后两个语音编码器和表达解码器都输入提取的风格信息以更新权重 durante el entrenamiento.我们的语音编码器还提取了语音音频标签和持续时间信息，以更好地同步在非autoregressive生成过程中。经过详细的实验，我们示出了我们的方法可以从输入语音中生成具有同步的脸部表达，同时保留目标人的说话风格。

This Reads Like That: Deep Learning for Interpretable Natural Language Processing

paper_url: http://arxiv.org/abs/2310.17010
repo_url: https://github.com/fanconic/this_reads_like_that
paper_authors: Claudio Fanconi, Moritz Vandenhirtz, Severin Husmann, Julia E. Vogt
for: 这个论文是为了提高自然语言处理中的 prototype 网络的性能和可解释性而写的。
methods: 这个论文使用了learned weighted similarity measure来增强prototype网络中的相似计算，以及提出了一种post-hoc解释机制来提取输入句子和prototype句子中关键的单词。
results: 论文的实验结果表明，对于 AG News 和 RT Polarity 数据集，提出的方法不仅超过了之前的 prototype-based 方法的预测性能，还提高了解释性的准确性 compared to rationale-based 循环卷积。

Abstract
Prototype learning, a popular machine learning method designed for inherently interpretable decisions, leverages similarities to learned prototypes for classifying new data. While it is mainly applied in computer vision, in this work, we build upon prior research and further explore the extension of prototypical networks to natural language processing. We introduce a learned weighted similarity measure that enhances the similarity computation by focusing on informative dimensions of pre-trained sentence embeddings. Additionally, we propose a post-hoc explainability mechanism that extracts prediction-relevant words from both the prototype and input sentences. Finally, we empirically demonstrate that our proposed method not only improves predictive performance on the AG News and RT Polarity datasets over a previous prototype-based approach, but also improves the faithfulness of explanations compared to rationale-based recurrent convolutions.

摘要
专案学习，一种具有内在可解释的机器学习方法，利用学习到的原型 Similarities 来类别新数据。它主要应用于计算机视觉领域，在这个工作中，我们基于先前的研究进一步探索 prototype 网络的扩展到自然语言处理。我们提出了一个学习加权相似度量表，可以增强相似度计算，专注于预训照句子嵌入中的有用维度。此外，我们提出了一个后续解释机制，可以从原型和输入句子中提取预测相关的字词。最后，我们实践表明，我们的提议方法不仅在 AG News 和 RT Polarity 数据集上超越先前的原型基于方法，而且也提高了解释的实惠性比过去的 rational 基于循环推导。

STEER: Semantic Turn Extension-Expansion Recognition for Voice Assistants

paper_url: http://arxiv.org/abs/2310.16990
repo_url: None
paper_authors: Leon Liyang Zhang, Jiarui Lu, Joel Ruben Antony Moniz, Aditya Kulkarni, Dhivya Piraviperumal, Tien Dung Tran, Nicholas Tzou, Hong Yu
for: 这个研究的目的是提出一种探测用户发送后续指令时的引导intent模型（STEER），以便更好地理解用户的需求。
methods: 该模型使用了一些规则来采样opt-in的使用数据，并使用了自然语言处理技术来分类用户的意图。
results: 实验结果表明，STEER模型在采样的数据上显示了优秀的准确率（超过95%），并且在真实世界中的探测场景中也表现出了强大的零基eline性。此外，提出了一个增强版本的模型（STEER+），该模型使用semantic parse tree来提供更多的上下文，以便更好地理解句子中的异常词。

Abstract
In the context of a voice assistant system, steering refers to the phenomenon in which a user issues a follow-up command attempting to direct or clarify a previous turn. We propose STEER, a steering detection model that predicts whether a follow-up turn is a user's attempt to steer the previous command. Constructing a training dataset for steering use cases poses challenges due to the cold-start problem. To overcome this, we developed heuristic rules to sample opt-in usage data, approximating positive and negative samples without any annotation. Our experimental results show promising performance in identifying steering intent, with over 95% accuracy on our sampled data. Moreover, STEER, in conjunction with our sampling strategy, aligns effectively with real-world steering scenarios, as evidenced by its strong zero-shot performance on a human-graded evaluation set. In addition to relying solely on user transcripts as input, we introduce STEER+, an enhanced version of the model. STEER+ utilizes a semantic parse tree to provide more context on out-of-vocabulary words, such as named entities that often occur at the sentence boundary. This further improves model performance, reducing error rate in domains where entities frequently appear, such as messaging. Lastly, we present a data analysis that highlights the improvement in user experience when voice assistants support steering use cases.

摘要
在语音助手系统中，“steering”指的是用户发送后续命令，以修正或解释上一个命令的现象。我们提出了STEER模型，可以预测用户是否通过后续命令来修正之前的命令。因为constructing一个训练集用于steering用例具有冷启动问题，我们开发了一些规则来采样 opt-in 使用数据，使用无注释的样本来 aproximate正例和负例。我们的实验结果表明，STEER模型在预测steering意图时显示了良好的性能，准确率高于95%。此外，STEER模型， junto con我们的采样策略，对实际世界中的steering场景具有很强的适应性，如果 human-graded evaluation set 中的 zero-shot 性能。此外，我们还引入了STEER+，一个改进版本的模型。STEER+使用semantic parse tree来提供更多的上下文，包括 sentence boundary 上的名称实体，这会进一步提高模型性能，特别是在消息频道上。最后，我们展示了一个数据分析，表明voice assistant支持steering用例可以提高用户体验。

The Significance of Machine Learning in Clinical Disease Diagnosis: A Review

paper_url: http://arxiv.org/abs/2310.16978
repo_url: None
paper_authors: S M Atikur Rahman, Sifat Ibtisum, Ehsan Bazgir, Tumpa Barai
for: 本研究旨在提高心率数据传输的准确性和计算效率，帮助医疗机构更好地诊断疾病。
methods: 本研究使用了多种机器学习算法，包括支持向量机器学习、决策树、Random Forest等，以提高疾病诊断的准确性和效率。
results: 研究发现，使用机器学习算法可以提高心率数据的准确性和计算效率，并且可以适应不同的疾病类型和数据类型。

Abstract
The global need for effective disease diagnosis remains substantial, given the complexities of various disease mechanisms and diverse patient symptoms. To tackle these challenges, researchers, physicians, and patients are turning to machine learning (ML), an artificial intelligence (AI) discipline, to develop solutions. By leveraging sophisticated ML and AI methods, healthcare stakeholders gain enhanced diagnostic and treatment capabilities. However, there is a scarcity of research focused on ML algorithms for enhancing the accuracy and computational efficiency. This research investigates the capacity of machine learning algorithms to improve the transmission of heart rate data in time series healthcare metrics, concentrating particularly on optimizing accuracy and efficiency. By exploring various ML algorithms used in healthcare applications, the review presents the latest trends and approaches in ML-based disease diagnosis (MLBDD). The factors under consideration include the algorithm utilized, the types of diseases targeted, the data types employed, the applications, and the evaluation metrics. This review aims to shed light on the prospects of ML in healthcare, particularly in disease diagnosis. By analyzing the current literature, the study provides insights into state-of-the-art methodologies and their performance metrics.

摘要
全球医疾诊断需求仍然很大，因为各种疾病机制复杂，病人症状多样化。为了解决这些挑战，研究人员、医生和患者都在转向机器学习（ML），一种人工智能（AI）专业，开发解决方案。通过利用高级ML和AI技术，医疗各界人员获得了提高诊断和治疗能力。然而，关于ML算法以提高时间序列医疗指标中心脉速率数据传输的准确性和计算效率的研究相对落后。本研究探讨了ML算法在医疗应用中的可行性，特别是在提高准确性和计算效率方面。通过检查各种健康应用中的ML算法，本文提供了最新的趋势和方法。评价标准包括算法使用、疾病类型、数据类型、应用程序和评价指标。本文的目的是探讨ML在医疗领域的前景，特别是疾病诊断方面的可能性。通过分析当前文献，本研究提供了有关当前领先技术和其性能指标的视角。

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

paper_url: http://arxiv.org/abs/2310.16931
repo_url: https://github.com/speechbrain/benchmarks
paper_authors: Luca Della Libera, Pooneh Mousavi, Salah Zaiem, Cem Subakan, Mirco Ravanelli
for: 本研究旨在提供一个用于多语言自动语音识别（ASR）系统的 continual learning benchmark，以探索在新语言中学习时，如何保持先前语言的知识。
methods: 本研究使用了现有的大规模预训 ASR 模型，并实现了多种 continual learning 方法，以评估学习新语言时的效果。
results: 本研究提供了一个多语言 ASR 的 continual learning benchmark，并在这个 benchmark 上评估了多种 continual learning 方法的效果，以探索如何在学习新语言时，保持先前语言的知识。

Abstract
Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new languages without losing valuable information from previous data. Furthermore, existing continual learning benchmarks focus mostly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task. The code is available at https://github.com/speechbrain/benchmarks.

摘要
现代多语言自动语音识别（ASR）系统如Whisper已经使得可以通过单个模型来识别多种语言的音频。然而，当前领先的ASR模型通常在单个语言或多任务 Setting中被评估，忽略了在新语言上不断学习的挑战。目前学术研究中不 enough to add new languages without losing valuable information from previous data. In addition, existing continual learning benchmarks focus mainly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task. 代码可以在https://github.com/speechbrain/benchmarks中找到。

Wide Flat Minimum Watermarking for Robust Ownership Verification of GANs

paper_url: http://arxiv.org/abs/2310.16919
repo_url: None
paper_authors: Jianwei Fei, Zhihua Xia, Benedetta Tondi, Mauro Barni
for: 保护基于生成器的知识产权 (Intellectual Property Rights) against 白盒模型攻击 (white-box attacks)
methods: 使用多比特盒子-自由 watermarking 方法，在 GAN 训练过程中添加额外的 watermarking 损失函数，使得生成的图像含有隐藏的水印，可以通过预训练的水印解码器检测。为提高鲁棒性，使模型参数具有宽浅的极小值，使得任何模型参数修改都不会消除水印。
results: 实验结果表明，存在水印的图像质量几乎不受影响，而且水印具有高度鲁棒性，抵抗模型修改和代理模型攻击。

Abstract
We propose a novel multi-bit box-free watermarking method for the protection of Intellectual Property Rights (IPR) of GANs with improved robustness against white-box attacks like fine-tuning, pruning, quantization, and surrogate model attacks. The watermark is embedded by adding an extra watermarking loss term during GAN training, ensuring that the images generated by the GAN contain an invisible watermark that can be retrieved by a pre-trained watermark decoder. In order to improve the robustness against white-box model-level attacks, we make sure that the model converges to a wide flat minimum of the watermarking loss term, in such a way that any modification of the model parameters does not erase the watermark. To do so, we add random noise vectors to the parameters of the generator and require that the watermarking loss term is as invariant as possible with respect to the presence of noise. This procedure forces the generator to converge to a wide flat minimum of the watermarking loss. The proposed method is architectureand dataset-agnostic, thus being applicable to many different generation tasks and models, as well as to CNN-based image processing architectures. We present the results of extensive experiments showing that the presence of the watermark has a negligible impact on the quality of the generated images, and proving the superior robustness of the watermark against model modification and surrogate model attacks.

摘要
我们提出了一种新的多位数字箱子无法水印方法，用于保护生成对应的知识产权（IPR）。这种水印方法可以增强对白盒式攻击，如精细调整、剪辑、量化和代理模型攻击的Robustness。在生成器训练时，我们通过添加额外的水印损失项来嵌入水印，使生成器生成的图像包含一个不可见的水印，可以通过预训练的水印解码器来检索。为了增强对白盒模型级攻击的Robustness，我们确保生成器在水印损失项中落在宽阔的平坦 minimum 中，以避免任何模型参数的修改可以消除水印。为此，我们在生成器参数中添加随机的噪声向量，并要求水印损失项在噪声存在时保持一定的不变性。这个过程使生成器 converges 到一个宽阔的平坦 minimum ，使得任何模型参数的修改都不会消除水印。我们的方法是 Architecture 和 Dataset 无关的，因此可以应用于许多不同的生成任务和模型，以及 CNN 基于的图像处理架构。我们的实验结果表明，水印的存在对生成的图像质量的影响为无效的，并证明了我们的水印方法对模型修改和代理模型攻击的Robustness 是超越的。

Unsupervised Learning of Molecular Embeddings for Enhanced Clustering and Emergent Properties for Chemical Compounds

paper_url: http://arxiv.org/abs/2310.18367
repo_url: None
paper_authors: Jaiveer Gill, Ratul Chakraborty, Reetham Gubba, Amy Liu, Shrey Jain, Chirag Iyer, Obaid Khwaja, Saurav Kumar
for: 这 paper 的目的是为了开发一种新的计算工具，用于探索和理解分子结构和性质。
methods: 这 paper 使用了多种方法来探索和分类化学物质的 SMILES 数据，包括图Structures 分析和自然语言描述嵌入。
results: 这 paper 的结果表明，使用这些方法可以获得了明确的、集中的分 clusters，并且可以有效地查询和理解化学物质。

Abstract
The detailed analysis of molecular structures and properties holds great potential for drug development discovery through machine learning. Developing an emergent property in the model to understand molecules would broaden the horizons for development with a new computational tool. We introduce various methods to detect and cluster chemical compounds based on their SMILES data. Our first method, analyzing the graphical structures of chemical compounds using embedding data, employs vector search to meet our threshold value. The results yielded pronounced, concentrated clusters, and the method produced favorable results in querying and understanding the compounds. We also used natural language description embeddings stored in a vector database with GPT3.5, which outperforms the base model. Thus, we introduce a similarity search and clustering algorithm to aid in searching for and interacting with molecules, enhancing efficiency in chemical exploration and enabling future development of emergent properties in molecular property prediction models.

摘要
detail 分析分子结构和性质具有很大的潜力用于药物发现，通过机器学习来开拓新的计算工具。我们引入了多种方法来探测和归类化学化合物基于其SMILES数据。我们的第一种方法是使用嵌入数据来探测化学结构的图形结构，并使用vector搜索达到我们的阈值。结果出现了明显、集中的团结果，这种方法在查询和理解化学物质方面表现出了良好的效果。我们还使用GPT3.5中的自然语言描述嵌入存储在向量库中，这超越了基础模型。因此，我们引入了相似搜索和归类算法，以帮助在化学探索中快速搜索和交互化学物质，提高化学探索的效率，并启动未来的分子性质预测模型的发展。

RDBench: ML Benchmark for Relational Databases

paper_url: http://arxiv.org/abs/2310.16837
repo_url: None
paper_authors: Zizhao Zhang, Yi Yang, Lutong Zou, He Wen, Tao Feng, Jiaxuan You
for:ML Benchmark For Relational Databases (RDBench) aims to promote reproducible ML research on RDBs that include multiple tables.methods:RDBench offers diverse RDB datasets of varying scales, domains, and relational structures, organized into 4 levels. It exposes three types of interfaces including tabular data, homogeneous graphs, and heterogeneous graphs, sharing the same underlying task definition.results:RDBench enables meaningful comparisons between ML methods from diverse domains, ranging from XGBoost to Graph Neural Networks, under RDB prediction tasks. Multiple classification and regression tasks are designed for each RDB dataset, and results are reported with averaged findings to enhance the robustness of the experimental results.

Abstract
Benefiting from high-quality datasets and standardized evaluation metrics, machine learning (ML) has achieved sustained progress and widespread applications. However, while applying machine learning to relational databases (RDBs), the absence of a well-established benchmark remains a significant obstacle to the development of ML. To address this issue, we introduce ML Benchmark For Relational Databases (RDBench), a standardized benchmark that aims to promote reproducible ML research on RDBs that include multiple tables. RDBench offers diverse RDB datasets of varying scales, domains, and relational structures, organized into 4 levels. Notably, to simplify the adoption of RDBench for diverse ML domains, for any given database, RDBench exposes three types of interfaces including tabular data, homogeneous graphs, and heterogeneous graphs, sharing the same underlying task definition. For the first time, RDBench enables meaningful comparisons between ML methods from diverse domains, ranging from XGBoost to Graph Neural Networks, under RDB prediction tasks. We design multiple classification and regression tasks for each RDB dataset and report averaged results over the same dataset, further enhancing the robustness of the experimental findings. RDBench is implemented with DBGym, a user-friendly platform for ML research and application on databases, enabling benchmarking new ML methods with RDBench at ease.

摘要
使用高质量的数据集和标准化的评估 metric，机器学习（ML）已经取得了持续的进步和广泛的应用。然而，当应用机器学习到关系数据库（RDB）时，缺乏一个具有广泛适用性的标准准测试 benchmark 是一个重要的障碍物。为解决这个问题，我们介绍了 ML Benchmark For Relational Databases（RDBench），一个标准化的准测试 benchmark，旨在促进可重复的 ML 研究在 RDB 中，包括多张表。RDBench 提供了多种不同的 RDB 数据集，其中每个数据集都有不同的规模、领域和关系结构，并且分为 4 级别。尤其是，为了简化 RDBench 在多种 ML 领域中的采用，对于任何给定的数据库，RDBench 提供了三种类型的接口，包括表格数据、同质graph和不同质graph，这些接口都是基于同一个任务定义。这样，RDBench 为不同领域的 ML 方法进行比较，从 XGBoost 到图 neural network，在 RDB 预测任务中实现了意义的比较。我们设计了多种分类和回归任务，并对每个 RDB 数据集进行了多次评估，以提高实验结果的稳定性。RDBench 通过 DBGym 实现，一个用户友好的平台 для ML 研究和应用于数据库，可以轻松地对 RDBench 进行准测试新的 ML 方法。

LLM-FP4: 4-Bit Floating-Point Quantized Transformers

paper_url: http://arxiv.org/abs/2310.16836
repo_url: https://github.com/nbasyl/llm-fp4
paper_authors: Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng
for: 这个论文目的是为了在大型语言模型（LLM）中进行后training quantization，将 weights 和 activations 数值降到 4 位浮点数值。
methods: 这个方法使用了浮点数值量化（FP quantization），并通过搜寻优化量化参数以提高表现。此外，这个方法还使用了每道活动量化，以解决活动量化的问题。
results: 这个方法可以将 LLaMA-13B 模型中的 weights 和 activations 降到 4 位浮点数值，并在常识零shot reasoning 任务上得到了63.1的平均分数，仅比整数模型低5.8分，比前一代最佳方案高出12.7分。

Abstract
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based and struggle with bit widths below 8 bits. Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions, and it has emerged as a default choice in many hardware platforms. One characteristic of FP quantization is that its performance largely depends on the choice of exponent bits and clipping range. In this regard, we construct a strong FP-PTQ baseline by searching for the optimal quantization parameters. Furthermore, we observe a high inter-channel variance and low intra-channel variance pattern in activation distributions, which adds activation quantization difficulty. We recognize this pattern to be consistent across a spectrum of transformer models designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models. To tackle this, we propose per-channel activation quantization and show that these additional scaling factors can be reparameterized as exponential biases of weights, incurring a negligible cost. Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1 on the common sense zero-shot reasoning tasks, which is only 5.8 lower than the full-precision model, significantly outperforming the previous state-of-the-art by 12.7 points. Code is available at: https://github.com/nbasyl/LLM-FP4.

摘要
我们提出LLM-FP4，将大型语言模型（LLM）中的 weights 和 activaions 降到 4 位浮点数值的POST训练方法。现有的POST训练（PTQ）解决方案主要是整数型的，对于比特幅下8 bits以下的情况做出差。相比于整数量化，浮点数（FP）量化更 flexible，可以更好地处理长尾或铃鼓形分布，因此在许多硬件平台上变得默认的选择。一个FP量化的特点是它的性能受到选择的指数位数和截取范围的影响。为了建立强大的FP-PTQ基线，我们进行了搜寻最佳量化参数。此外，我们发现 activation 分布中存在高通道方差低通道方差的特性，这会增加 activation 量化的问题。我们认为这个特性是跨多个 transformer 模型设计 для多元任务的共同特点，例如 LLMs、BERT 和 Vision Transformer 模型。为了解决这个问题，我们提出了每通道 activation 量化，并证明这些额外的标准化因子可以被视为 weight 的指数偏移，带来无法忽略的成本。我们的方法首次可以将 LLaMA-13B 中的 weights 和 activations 降到 4 位浮点数值，并在常识零基础理解任务上实现平均得分63.1，仅比整数模型下降5.8分，与过去状态艺术的最佳方案相对提高12.7分。代码可以在以下链接获取：https://github.com/nbasyl/LLM-FP4。

Proposal-Contrastive Pretraining for Object Detection from Fewer Data

paper_url: http://arxiv.org/abs/2310.16835
repo_url: None
paper_authors: Quentin Bouniot, Romaric Audigier, Angélique Loesch, Amaury Habrard
for: 这篇论文的目的是为了提出一种不需要大量数据的无监督预训方法，并且能够在实际应用中获得好的性能。
methods: 这篇论文使用了transformer构 architecture，并且将object detector作为预训模型，通过生成大量的object proposal来进行对照学习。
results: 研究发现，这种方法能够在标准和新的benchmark上实现顶尖的性能，并且在仅使用少量数据的情况下进行预训。

Abstract
The use of pretrained deep neural networks represents an attractive way to achieve strong results with few data available. When specialized in dense problems such as object detection, learning local rather than global information in images has proven to be more efficient. However, for unsupervised pretraining, the popular contrastive learning requires a large batch size and, therefore, a lot of resources. To address this problem, we are interested in transformer-based object detectors that have recently gained traction in the community with good performance and with the particularity of generating many diverse object proposals. In this work, we present Proposal Selection Contrast (ProSeCo), a novel unsupervised overall pretraining approach that leverages this property. ProSeCo uses the large number of object proposals generated by the detector for contrastive learning, which allows the use of a smaller batch size, combined with object-level features to learn local information in the images. To improve the effectiveness of the contrastive loss, we introduce the object location information in the selection of positive examples to take into account multiple overlapping object proposals. When reusing pretrained backbone, we advocate for consistency in learning local information between the backbone and the detection head. We show that our method outperforms state of the art in unsupervised pretraining for object detection on standard and novel benchmarks in learning with fewer data.

摘要
“使用预训 Deep Neural Networks 是一种吸引人的方法来实现强大的结果，即使有限的数据available。当特化在对像检测这类密集问题时，学习本地而不是全局信息在图像中已经证明是更加有效率。然而，对于无超级预训，广泛的对比学习需要大批号Size和资源。为解决这个问题，我们对 transformer-based 对像检测器表示兴趣，这些检测器在社区中获得了良好的性能，并且具有生成多个多标的物件提案的特性。在这个工作中，我们提出了 Proposal Selection Contrast（ProSeCo），一种新的无超级预训方法，利用这个特性。ProSeCo 使用生成的物件提案进行对比学习，这allow us 使用较小的批号Size，同时使用物件层的特征来学习本地信息在图像中。为了提高对比损失的有效性，我们将物件位置信息包含在选择正例的范例中，以考虑多个重叠的物件提案。当 reuse pretrained backbone 时，我们强调了在 backbone 和检测头之间的一致性，以确保在学习本地信息时，backbone 和检测头之间的学习是一致的。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

TD-MPC2: Scalable, Robust World Models for Continuous Control

paper_url: http://arxiv.org/abs/2310.16828
repo_url: https://github.com/nicklashansen/tdmpc2
paper_authors: Nicklas Hansen, Hao Su, Xiaolong Wang
for: 这个论文的目的是提出一种基于模型的强化学习算法（TD-MPC2），用于本地轨迹优化在学习得到的隐藏空间中。
methods: 这个算法使用了模型基于的强化学习策略，包括TD-MPC算法和一系列改进。
results: 在104个在线RL任务中，TD-MPC2表现出色，与基elines相比显著提高了表现，并且可以在多个任务领域、embodiments和动作空间中进行多任务学习。

Abstract
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. Explore videos, models, data, code, and more at https://nicklashansen.github.io/td-mpc2

摘要
TD-MPC是一种基于模型的强化学习（RL）算法，实现了本地轨迹优化在学习的隐藏空间中。在这篇文章中，我们介绍了TD-MPC2：一系列对TD-MPC算法进行改进的方法。我们表明TD-MPC2在104个在线RL任务中表现出色，在4个多样化任务领域中准确地实现了强大的结果，并且使用单一的超参数来实现。我们进一步显示，代理机制能力随模型和数据集大小增长，并成功地训练了一个317M参数的单一代理来完成80个任务 across多个任务领域、实现方式和动作空间。我们结束时提出了大TD-MPC2代理的教训、机遇和风险。感兴趣的朋友可以前往https://nicklashansen.github.io/td-mpc2查看视频、模型、数据、代码等资源。

Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction

paper_url: http://arxiv.org/abs/2310.16822
repo_url: None
paper_authors: Xuming Hu, Junzhe Chen, Aiwei Liu, Shiao Meng, Lijie Wen, Philip S. Yu
for: 提高图像中 Entity 和 Relation EXTRACT 的能力
methods: 使用 Multimodal EXTRACTION 方法，combined with images and text，获取更多的信号，并将其们Alignment through graphs or hierarchical fusion，以提高EXTRACTION 能力
results: experiments on three datasets show an average 3.41% F1 improvement over prior SOTA, and using the method on prior SOTA fusions further improves 5.47% F1.

Abstract
How can we better extract entities and relations from text? Using multimodal extraction with images and text obtains more signals for entities and relations, and aligns them through graphs or hierarchical fusion, aiding in extraction. Despite attempts at various fusions, previous works have overlooked many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes innovative pre-training objectives for entity-object and relation-image alignment, extracting objects from images and aligning them with entity and relation prompts for soft pseudo-labels. These labels are used as self-supervised signals for pre-training, enhancing the ability to extract entities and relations. Experiments on three datasets show an average 3.41% F1 improvement over prior SOTA. Additionally, our method is orthogonal to previous multimodal fusions, and using it on prior SOTA fusions further improves 5.47% F1.

摘要
<>将文本翻译成简化中文。>可以通过多模态提取来更好地提取实体和关系从文本中，并使用图像和文本获得更多的信号，并将其归一化为图像或层次融合。不过，先前的作品都忽略了大量没有标签的图像描述对，如新闻CLIPing。这篇论文提出了一些创新的预训练目标，用于对象-图像和关系-图像的归一化，从图像中提取对象并将其与实体和关系提示进行软式 Pseudo-标签。这些标签被用作自我超vised信号进行预训练，从而提高实体和关系提取的能力。实验结果表明，与先前最优的STATE OF THE ART相比，我们的方法能够提高3.41%的F1分数。此外，我们的方法与先前的多模态融合 orthogonal，并在使用之前的最优融合后，进一步提高5.47%的F1分数。

Can GPT models Follow Human Summarization Guidelines? Evaluating ChatGPT and GPT-4 for Dialogue Summarization

paper_url: http://arxiv.org/abs/2310.16810
repo_url: None
paper_authors: Yongxin Zhou, Fabien Ringeval, François Portet
for: 本研究探讨了基于提示的大型自然语言模型（LLM）如ChatGPT和GPT-4在遵循人类指南的对话概要整理能力。
methods: 研究使用了DialogSum（英文社交对话）和DECODA（法语客服对话）等实验，测试了多种提示，包括现有文献中的提示和人类概要指南中的提示，以及两步提示方法。
results: 研究发现，GPT模型通常会生成长度很长的概要，并且与人类概要指南不准确。但是，使用人类指南作为中间步骤显示了 promise，在一些情况下超过了直接Word length constraint提示。研究发现，GPT模型在概要中表现出了独特的风格特征。虽然BERTScores不减少了GPT输出和人类参考的semantic similarity，但ROUGE scores显示了GPT生成的和人类写的概要之间的 grammatical和lexical 不同。这些发现反映了 GPT模型在人类指南下的对话概要整理能力。

Abstract
This study explores the capabilities of prompt-driven Large Language Models (LLMs) like ChatGPT and GPT-4 in adhering to human guidelines for dialogue summarization. Experiments employed DialogSum (English social conversations) and DECODA (French call center interactions), testing various prompts: including prompts from existing literature and those from human summarization guidelines, as well as a two-step prompt approach. Our findings indicate that GPT models often produce lengthy summaries and deviate from human summarization guidelines. However, using human guidelines as an intermediate step shows promise, outperforming direct word-length constraint prompts in some cases. The results reveal that GPT models exhibit unique stylistic tendencies in their summaries. While BERTScores did not dramatically decrease for GPT outputs suggesting semantic similarity to human references and specialised pre-trained models, ROUGE scores reveal grammatical and lexical disparities between GPT-generated and human-written summaries. These findings shed light on the capabilities and limitations of GPT models in following human instructions for dialogue summarization.

摘要

Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

paper_url: http://arxiv.org/abs/2310.16790
repo_url: None
paper_authors: Zhendong Chu, Ruiyi Zhang, Tong Yu, Rajiv Jain, Vlad I Morariu, Jiuxiang Gu, Ani Nenkova
for: 提高 ner 模型的性能，避免使用大量、高质量的注释数据，而是使用低质量的协作注释数据和外部知识库数据。
methods: 提出了一种使用指导集来减少噪声标注数据的方法，并使用探测器模型来重新调整样本权重。
results: 在公共协作和远程监督数据集上表现出色，可靠地提高 ner 模型的性能，并且只需要一小部分的指导集。

Abstract
To achieve state-of-the-art performance, one still needs to train NER models on large-scale, high-quality annotated data, an asset that is both costly and time-intensive to accumulate. In contrast, real-world applications often resort to massive low-quality labeled data through non-expert annotators via crowdsourcing and external knowledge bases via distant supervision as a cost-effective alternative. However, these annotation methods result in noisy labels, which in turn lead to a notable decline in performance. Hence, we propose to denoise the noisy NER data with guidance from a small set of clean instances. Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights. The discriminator is capable of detecting both span and category errors with different discriminative prompts. Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.

摘要
Translated into Simplified Chinese:要达到状态前的性能，一直需要训练NER模型，需要大量、高质量的注释数据，这是both costly和time-intensive的。然而，在实际应用中，通常通过众包和外部知识库的 distant supervision 来获得大量低质量的标注数据，这是一种cost-effective的替代方案。然而，这些注释方法会导致噪声标注，这会导致性能下降。因此，我们提议使用一小set of clean instances 来减噪NER数据。同时，我们还训练了一个discriminator模型，并使用其输出来重新调整样本权重。这个discriminator模型能够检测 span 和 category 错误，并且可以通过不同的推诱来检测。Results on public crowdsourcing和 distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

paper_url: http://arxiv.org/abs/2310.16787
repo_url: None
paper_authors: Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Deb Roy, Sara Hooker
for:The paper aims to address the legal and ethical risks associated with the use of vast, diverse, and inconsistently documented datasets in natural language processing (NLP) research.methods:The authors conducted a systematic audit and trace of 1800+ text datasets, using a multi-disciplinary approach that combined legal and machine learning expertise. They developed tools and standards to trace the lineage of these datasets, including their source, creators, license conditions, properties, and subsequent use.results:The authors found significant divides in the composition and focus of commercially open vs closed datasets, with closed datasets dominating important categories such as lower resource languages, more creative tasks, and richer topic variety. They also observed frequent miscategorization of licenses on widely used dataset hosting sites, with license omission and error rates of 72%+ and 50%+, respectively. These findings highlight a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs in NLP. To address these issues, the authors release their entire audit with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections.

Abstract
The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 72%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org.

摘要
“对于对大量、多样化和不充分文献的语言模型训练而构成的法律和道德风险，我们举办了多种领域的专家团队实地实测和追溯1800多个文本数据集。我们开发了工具和标准来追溯这些数据集的来源、创建者、授权条件、性质和使用方式。我们的景观分析显示，商业公开的数据集和关闭的数据集在分布和用途上存在鲜明的差异，关闭的数据集占据重要类别，例如低资源语言、更创意的任务、更丰富的主题多样性和更新的训练数据。这显示出授权条件下的数据集分配存在深刻的分化，并且对法律实践中的著作权和允许使用产生了更高的影响。我们发现了广泛使用的数据集主机网站上的授权错误和授权漏洞，授权漏洞率高于72%，错误率高于50%。这显示出对最受欢迎的数据集的误导和不负责任的使用存在危机。为了继续改善数据集的透明度和责任使用，我们发布了我们的实测数据和一个互动式的UI，叫做数据认证探索器，让实践者可以追溯和范畴数据集的来源：www.dataprovenance.org。”

Multi-scale Diffusion Denoised Smoothing

paper_url: http://arxiv.org/abs/2310.16779
repo_url: https://github.com/jh-jeong/smoothing-multiscale
paper_authors: Jongheon Jeong, Jinwoo Shin
for: 本研究旨在提高随机缓和方法的证明性 robustness，并实现大型预训练模型的鲁棒性。
methods: 本研究使用了随机缓和方法，并提出了一种多级缓和方法来提高证明性 robustness。
results: 实验表明，使用多级缓和方法可以实现高噪声水平下的证明性 robustness，同时保持准确率与非缓和模型几乎相同。

Abstract
Along with recent diffusion models, randomized smoothing has become one of a few tangible approaches that offers adversarial robustness to models at scale, e.g., those of large pre-trained models. Specifically, one can perform randomized smoothing on any classifier via a simple "denoise-and-classify" pipeline, so-called denoised smoothing, given that an accurate denoiser is available - such as diffusion model. In this paper, we present scalable methods to address the current trade-off between certified robustness and accuracy in denoised smoothing. Our key idea is to "selectively" apply smoothing among multiple noise scales, coined multi-scale smoothing, which can be efficiently implemented with a single diffusion model. This approach also suggests a new objective to compare the collective robustness of multi-scale smoothed classifiers, and questions which representation of diffusion model would maximize the objective. To address this, we propose to further fine-tune diffusion model (a) to perform consistent denoising whenever the original image is recoverable, but (b) to generate rather diverse outputs otherwise. Our experiments show that the proposed multi-scale smoothing scheme combined with diffusion fine-tuning enables strong certified robustness available with high noise level while maintaining its accuracy close to non-smoothed classifiers.

摘要
Recently, randomized smoothing has become one of the few practical methods that can provide adversarial robustness to large pre-trained models, such as those with many layers. Specifically, a simple "denoise-and-classify" pipeline can be used to perform randomized smoothing on any classifier, as long as an accurate denoiser is available, such as a diffusion model. In this paper, we propose a scalable method to balance the trade-off between certified robustness and accuracy in denoised smoothing. Our key idea is to selectively apply smoothing at multiple noise scales, which can be efficiently implemented with a single diffusion model. This approach also introduces a new objective to compare the collective robustness of multi-scale smoothed classifiers, and we propose to fine-tune the diffusion model to achieve this goal. Our experiments show that the proposed multi-scale smoothing scheme combined with diffusion fine-tuning can provide strong certified robustness with high noise levels while maintaining accuracy close to non-smoothed classifiers.

DEFT: Data Efficient Fine-Tuning for Large Language Models via Unsupervised Core-Set Selection

paper_url: http://arxiv.org/abs/2310.16776
repo_url: None
paper_authors: Devleena Das, Vivek Khetan
for: 这篇论文目的是探讨如何使用可训练语言模型（PLM）进行下游任务的数据准备。
methods: 该论文提出了一种数据精炼框架（DEFT），通过不upervised核心集选择来最小化PLM的数据准备量。
results: 论文在文本编辑LM中展示了DEFT框架的效果，并与状态之arte编辑模型CoEDIT进行比较。结果表明，DEFT模型可以与CoEDIT模型准确率相似，仅使用大约70%的数据进行 fine-tuning。

Abstract
Recent advances have led to the availability of many pre-trained language models (PLMs); however, a question that remains is how much data is truly needed to fine-tune PLMs for downstream tasks? In this work, we introduce DEFT, a data-efficient fine-tuning framework that leverages unsupervised core-set selection to minimize the amount of data needed to fine-tune PLMs for downstream tasks. We demonstrate the efficacy of our DEFT framework in the context of text-editing LMs, and compare to the state-of-the art text-editing model, CoEDIT. Our quantitative and qualitative results demonstrate that DEFT models are just as accurate as CoEDIT while being finetuned on ~70% less data.

摘要
最近的进步已经使得许多预训练语言模型（PLM）可以获得，但是一个问题仍然是如何真正需要多少数据来精度地训练 PLM для下游任务？在这种工作中，我们介绍了 DEFT，一种数据效率的微调框架，该框架利用无监督核心集选择来最小化需要微调 PLM 的数据量。我们在文本修订LM中展示了我们的 DEFT 框架的效果，并与当前领先的文本修订模型CoEDIT进行比较。我们的量化和质量结果表明，DEFT 模型可以与 CoEDIT 模型准确性相同，只需要微调 ~70% 的数据量。

AI Agent as Urban Planner: Steering Stakeholder Dynamics in Urban Planning via Consensus-based Multi-Agent Reinforcement Learning

paper_url: http://arxiv.org/abs/2310.16772
repo_url: https://github.com/mao1207/Steering-Stakeholder-Dynamics-in-Urban-Planning-via-Consensus-based-MARL
paper_authors: Kejiang Qian, Lingjun Mao, Xin Liang, Yimin Ding, Jin Gao, Xinran Wei, Ziyi Guo, Jiajie Li
for: 本研究旨在提高现代城市规划实践中的可持续发展和社会参与度，通过一种基于多代理人强化学习的妥协框架，满足不同利益集的各种需求。
methods: 本研究提出了一种基于多代理人强化学习的妥协框架，其中智能代理人代表不同的利益集，通过投票来选择最佳的土地使用类型。此外，我们还提出了一种新的妥协机制在奖励设计中，以便在集体决策过程中优化土地利用。
results: 我们的计算模型在实际社区的传统顶部规划和参与规划方法上进行了广泛的实验，结果表明，我们的计算模型能够提高全球利益和满足不同人群的需求，导致不同人群的满意度得到提高。

Abstract
In urban planning, land use readjustment plays a pivotal role in aligning land use configurations with the current demands for sustainable urban development. However, present-day urban planning practices face two main issues. Firstly, land use decisions are predominantly dependent on human experts. Besides, while resident engagement in urban planning can promote urban sustainability and livability, it is challenging to reconcile the diverse interests of stakeholders. To address these challenges, we introduce a Consensus-based Multi-Agent Reinforcement Learning framework for real-world land use readjustment. This framework serves participatory urban planning, allowing diverse intelligent agents as stakeholder representatives to vote for preferred land use types. Within this framework, we propose a novel consensus mechanism in reward design to optimize land utilization through collective decision making. To abstract the structure of the complex urban system, the geographic information of cities is transformed into a spatial graph structure and then processed by graph neural networks. Comprehensive experiments on both traditional top-down planning and participatory planning methods from real-world communities indicate that our computational framework enhances global benefits and accommodates diverse interests, leading to improved satisfaction across different demographic groups. By integrating Multi-Agent Reinforcement Learning, our framework ensures that participatory urban planning decisions are more dynamic and adaptive to evolving community needs and provides a robust platform for automating complex real-world urban planning processes.

摘要
在城市规划中，土地利用重新规划扮演着关键作用，以适应可持续发展的城市发展需求。然而，当前城市规划做法面临两大挑战。一是，土地利用决策受人类专家的主导，二是， resident 参与城市规划可以提高城市可持续发展和居住质量，但是与不同利益相关者的利益协调很困难。为解决这些挑战，我们介绍了一种基于多智能代理人学习的共识型多智能代理人学习框架，用于实际的土地利用重新规划。这种框架支持参与型城市规划，allowing 多种智能代理人作为利益相关者代表投票 preferred 土地利用类型。在这个框架中，我们提出了一种新的共识机制，用于优化土地利用through 集体决策。为了抽象城市系统的复杂结构，我们将城市的地理信息转化为空间图 structure，然后由图神经网络处理。我们在实际的传统顶部规划和参与式规划方法上进行了实验，结果表明，我们的计算框架可以提高全球利益并与不同民族群体的利益协调，导致不同民族群体的满意度提高。通过将多智能代理人学习与参与型城市规划结合，我们的框架确保了参与型城市规划决策更加动态和适应到不断发展的社区需求，并提供了对复杂实际城市规划过程的自动化平台。

SuperHF: Supervised Iterative Learning from Human Feedback

paper_url: http://arxiv.org/abs/2310.16763
repo_url: https://github.com/openfeedback/superhf
paper_authors: Gabriel Mukobi, Peter Chatain, Su Fong, Robert Windesheim, Gitta Kutyniok, Kush Bhatia, Silas Alberti
for: 这个研究旨在解决大型语言模型的安全性、人类价值调整和训练稳定性问题。
methods: 这个研究使用了两种常见的方法来调整语言模型：Supervised Fine-Tuning (SFT) 和 Reinforcement Learning from Human Feedback (RLHF)。SFT 是一种简单和可靠的方法，而 RLHF 则是一种更加进步的方法，但也存在问题，例如训练不稳定和受到赏金攻击。
results: 这个研究提出了一个新的方法，即 Supervised Iterative Learning from Human Feedback (SuperHF)，以解决 RLHF 的问题。SuperHF 使用了一个简单的超级vised损失函数和 Kullback-Leibler (KL) 数学预测器，并通过在线学习 режим中 repeatedly sampling a batch of model outputs 和 filtering them through the reward model 来创建自己的训练数据。研究结果显示 SuperHF 可以高效地调整语言模型，并且可以轻松地实现高赏金和低赏金攻击之间的交换。此外，SuperHF 也可以提高下游评估的准确性和调整性。

Abstract
While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). SFT is simple and robust, powering a host of open-source models, while RLHF is a more sophisticated method used in top-tier models like ChatGPT but also suffers from instability and susceptibility to reward hacking. We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods. Our hypothesis is two-fold: that the reward model used in RLHF is critical for efficient data use and model generalization and that the use of Proximal Policy Optimization (PPO) in RLHF may not be necessary and could contribute to instability issues. SuperHF replaces PPO with a simple supervised loss and a Kullback-Leibler (KL) divergence prior. It creates its own training data by repeatedly sampling a batch of model outputs and filtering them through the reward model in an online learning regime. We then break down the reward optimization problem into three components: robustly optimizing the training rewards themselves, preventing reward hacking-exploitation of the reward model that degrades model performance-as measured by a novel METEOR similarity metric, and maintaining good performance on downstream evaluations. Our experimental results show SuperHF exceeds PPO-based RLHF on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our GPT-4 based qualitative evaluation scheme all the while being significantly simpler to implement, highlighting SuperHF's potential as a competitive language model alignment technique.

摘要
大型语言模型可能具有杰出的能力，但它们常会带来安全性、与人类价值观 alignment 以及训练过程中的稳定性的挑战。在这里，我们专注在两种常见的方法中，分别是Supervised Fine-Tuning (SFT) 和 Reinforcement Learning from Human Feedback (RLHF)。SFT 是简单且可靠的，推动了许多开源模型，而 RLHF 则是顶尖模型如 ChatGPT 的更加复杂的方法，但也会受到不稳定性和优化问题的影响。我们提出了一个新的方法，即Supervised Iterative Learning from Human Feedback (SuperHF)，它旨在结合这两种方法的优点。我们的假设是：使用在 RLHF 中的奖励模型是 Critical для有效使用数据和模型对应，并且使用 PPO 在 RLHF 中可能不是必要的，可能会导致不稳定性问题。SuperHF 取代了 PPO 使用简单的监督损失和 Kullback-Leibler (KL) 差分预测。它通过在线上学习模式下，不断抽样一批模型输出，然后将其通过奖励模型进行筛选。我们将奖励优化问题分为三个部分：强化训练奖励本身，防止奖励模型被欺骗的问题，以及维持下游评估的好表现。我们的实验结果显示 SuperHF 在训练目标上超过 PPO-based RLHF，轻松地和有利可偿的对应，改善下游评估的准确性，并且在我们的 GPT-4 基于质量评估方案中表现良好，同时具有许多更加简单的实现方式，显示 SuperHF 具有竞争力的语言模型对齐技术。

HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

paper_url: http://arxiv.org/abs/2310.16755
repo_url: https://github.com/ying-hui-he/hi-tom_dataset
paper_authors: Yinghui He, Yufan Wu, Yilin Jia, Rada Mihalcea, Yulong Chen, Naihao Deng
for: 本研究探讨了更高一级的理解思维（Higher Order Theory of Mind，简称 HI-TOM），它是理解别人的思维状态的能力。
methods: 本研究使用了各种大型自然语言处理模型（Large Language Models，简称 LLMs）进行实验测试，以评估它们在更高一级的思维任务中的表现。
results: 研究发现，现有的 LLMS 在更高一级思维任务中表现不佳，存在许多失败案例。研究还进行了不同失败案例的分析，并对未来 NLP 发展的影响进行了思考。

Abstract
Theory of Mind (ToM) is the ability to reason about one's own and others' mental states. ToM plays a critical role in the development of intelligence, language understanding, and cognitive processes. While previous work has primarily focused on first and second-order ToM, we explore higher-order ToM, which involves recursive reasoning on others' beliefs. We introduce HI-TOM, a Higher Order Theory of Mind benchmark. Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks, demonstrating the limitations of current LLMs. We conduct a thorough analysis of different failure cases of LLMs, and share our thoughts on the implications of our findings on the future of NLP.

摘要

PROMINET: Prototype-based Multi-View Network for Interpretable Email Response Prediction

paper_url: http://arxiv.org/abs/2310.16753
repo_url: None
paper_authors: Yuqing Wang, Prashanth Vijayaraghavan, Ehsan Degan
for: 这个研究旨在提高电子邮件交流中的写作和客户满意度，并且对电子邮件讯息的内容和架构进行分析和预测。
methods: 这个研究使用了Prototype-based Multi-view Network（PROMINET）模型，该模型结合了semantic和structural信息，从email数据中学习出latent exemplars，并将其映射到观察到的数据中。
results: 实验结果显示，PROMINET模型比基eline模型高出约3%的F1分数在两个真实世界的email数据中，并且提供了可解的Email Response预测。此外，模型还可以提供实际的Email文本编译建议，以提高电子邮件的效果和客户满意度。

Abstract
Email is a widely used tool for business communication, and email marketing has emerged as a cost-effective strategy for enterprises. While previous studies have examined factors affecting email marketing performance, limited research has focused on understanding email response behavior by considering email content and metadata. This study proposes a Prototype-based Multi-view Network (PROMINET) that incorporates semantic and structural information from email data. By utilizing prototype learning, the PROMINET model generates latent exemplars, enabling interpretable email response prediction. The model maps learned semantic and structural exemplars to observed samples in the training data at different levels of granularity, such as document, sentence, or phrase. The approach is evaluated on two real-world email datasets: the Enron corpus and an in-house Email Marketing corpus. Experimental results demonstrate that the PROMINET model outperforms baseline models, achieving a ~3% improvement in F1 score on both datasets. Additionally, the model provides interpretability through prototypes at different granularity levels while maintaining comparable performance to non-interpretable models. The learned prototypes also show potential for generating suggestions to enhance email text editing and improve the likelihood of effective email responses. This research contributes to enhancing sender-receiver communication and customer engagement in email interactions.

摘要

Translating Universal Scene Descriptions into Knowledge Graphs for Robotic Environment

paper_url: http://arxiv.org/abs/2310.16737
repo_url: None
paper_authors: Giang Hoang Nguyen, Daniel Bessler, Simon Stelter, Mihai Pomarlan, Michael Beetz
for: 这paper的目的是提出一种基于虚拟现实技术的机器人环境模型化方法，以便机器人可以更好地完成人类水平的 manipulate任务。
methods: 这paper使用的方法是将Scene Graph转换为知识图（Knowledge Graph）表示，以便Semantic Querying和与其他知识源的集成。 USD格式被用于建立环境模型，并且进行了对Knowledge Graph的实现和调试。
results: 这paper的结果表明，通过使用虚拟现实技术，可以快速和高效地将Scene Graph转换为Knowledge Graph表示，并且可以提高机器人的 manipulate任务完成性。

Abstract
Robots performing human-scale manipulation tasks require an extensive amount of knowledge about their surroundings in order to perform their actions competently and human-like. In this work, we investigate the use of virtual reality technology as an implementation for robot environment modeling, and present a technique for translating scene graphs into knowledge bases. To this end, we take advantage of the Universal Scene Description (USD) format which is an emerging standard for the authoring, visualization and simulation of complex environments. We investigate the conversion of USD-based environment models into Knowledge Graph (KG) representations that facilitate semantic querying and integration with additional knowledge sources.

摘要
роботы，在完成人类水平的抓取任务时，需要很多关于其环境的知识，以便它们可以做出人类化的动作。在这项工作中，我们研究使用虚拟现实技术来实现机器人环境模型，并提出了将场景描述转换为知识库表示的技术。为此，我们利用了 Universal Scene Description（USD）格式，该格式是复杂环境的作者、可视化和模拟的标准 Format。我们研究将 USD 基于的环境模型转换为知识图（KG）表示，以便进行semantic查询和与其他知识源的集成。

paper_url: http://arxiv.org/abs/2310.16735
repo_url: None
paper_authors: Wenlong Li, Zihao Li, Wenkai Li, Yueming Zhang, Aolan Li
for: 本研究旨在挖掘和整合近三十年（1995-2022年）的数据保护领域 empirical research，以便更好地了解和评估GDPR的实施和影响。
methods: 本研究采用了批判性文献综述的方法，检视和整合近三十年内发表的数据保护相关Empirical research，以便提出更加有效的数据保护政策和实践。
results: 本研究发现， empirical research在数据保护领域的应用和效果仍然不充分被认可和利用，未来的研究应该更加重视实证研究，以便更好地了解和改进数据保护政策和实践。

Abstract
In the realm of data protection, a striking disconnect prevails between traditional domains of doctrinal, legal, theoretical, and policy-based inquiries and a burgeoning body of empirical evidence. Much of the scholarly and regulatory discourse remains entrenched in abstract legal principles or normative frameworks, leaving the empirical landscape uncharted or minimally engaged. Since the birth of EU data protection law, a modest body of empirical evidence has been generated but remains widely scattered and unexamined. Such evidence offers vital insights into the perception, impact, clarity, and effects of data protection measures but languishes on the periphery, inadequately integrated into the broader conversation. To make a meaningful connection, we conduct a comprehensive review and synthesis of empirical research spanning nearly three decades (1995- March 2022), advocating for a more robust integration of empirical evidence into the evaluation and review of the GDPR, while laying a methodological foundation for future empirical research.

摘要
在数据保护领域，一种各异的分化现象存在，传统的法律、理论和政策研究领域与兴起的实证证据领域之间没有fficient的连接。大量的学术和法规讨论仍然困扰在抽象的法律原则和 normative 框架之中，而实证证据领域几乎没有被发掘或考虑。自EU数据保护法的出生以来，一小部分的实证证据已经生成，但它们尚未得到广泛的检视和应用。这些证据提供了对数据保护措施的影响、效果和清晰度的重要信息，但它们却被排在边缘，无法得到合理的评估和利用。为了建立有效的连接，我们进行了 nearly three decades (1995- March 2022) 的全面回顾和synthesis of empirical research，并提出了一种更加robust的实证证据的 integrate into the evaluation and review of the GDPR，同时为未来的实证研究提供了方法学基础。

SkyMath: Technical Report

paper_url: http://arxiv.org/abs/2310.16713
repo_url: None
paper_authors: Liu Yang, Haihua Yang, Wenjun Cheng, Lei Lin, Chenxia Li, Yifu Chen, Lunan Liu, Jianfei Pan, Tianwen Wei, Biye Li, Liang Zhao, Lijie Wang, Bo Zhu, Guoliang Li, Xuejie Wu, Xilin Luo, Rui Hu
for: 这个论文旨在探讨大语言模型（LLMs）在自然语言处理（NLP）任务中的潜力，以及如何使用这些模型进行数学逻辑推理。
methods: 这篇论文使用了自我比较细化（self-compare fine-tuning）来增强 Skywork-13B-Base 模型的数学逻辑能力。
results: 根据 GSM8K 测试数据集，SkyMath 模型在相同大小的open-source模型中表现出色，创造了新的最佳性能记录（SOTA）。

Abstract
Large language models (LLMs) have shown great potential to solve varieties of natural language processing (NLP) tasks, including mathematical reasoning. In this work, we present SkyMath, a large language model for mathematics with 13 billion parameters. By applying self-compare fine-tuning, we have enhanced mathematical reasoning abilities of Skywork-13B-Base remarkably. On GSM8K, SkyMath outperforms all known open-source models of similar size and has established a new SOTA performance.

摘要
大型语言模型（LLM）已经表现出优异的潜力来解决各种自然语言处理（NLP）任务，包括数学逻辑。在这个工作中，我们提出了 SkyMath，一个拥有130亿个参数的数学语言模型。通过自我比较精致调整，我们将 Skywork-13B-Base 的数学逻辑能力优化得非常出色。在 GSM8K 上，SkyMath 已经超越了所有已知的开源模型，并建立了新的 SOTA 性能。

Human-centred explanation of rule-based decision-making systems in the legal domain

paper_url: http://arxiv.org/abs/2310.16704
repo_url: None
paper_authors: Suzan Zuurmond, AnneMarie Borg, Matthijs van Kempen, Remi Wieten
for: 这个论文是为了解释基于规则的自动决策系统在法律领域的决策过程。
methods: 论文提出了一种基于图数据库的解释方法，可以根据用户提问进行个性化的解释和 Multimedia 显示。
results: 论文通过一个实际场景在荷兰税务和custom Administration中实现了其概念框架和解释方法。

Abstract
We propose a human-centred explanation method for rule-based automated decision-making systems in the legal domain. Firstly, we establish a conceptual framework for developing explanation methods, representing its key internal components (content, communication and adaptation) and external dependencies (decision-making system, human recipient and domain). Secondly, we propose an explanation method that uses a graph database to enable question-driven explanations and multimedia display. This way, we can tailor the explanation to the user. Finally, we show how our conceptual framework is applicable to a real-world scenario at the Dutch Tax and Customs Administration and implement our explanation method for this scenario.

摘要
我们提出了一种人类中心的解释方法，用于自动决策系统在法律领域。首先，我们建立了一个概念框架，用于开发解释方法，包括内部组件（内容、沟通和适应）以及外部依赖关系（决策系统、人类接收者和领域）。其次，我们提出了一种使用图数据库来实现问题驱动的解释方法，以适应用户需求。最后，我们示例了我们的概念框架在荷兰税务和Customs Administration的实际应用中。Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policies

paper_url: http://arxiv.org/abs/2310.16686
repo_url: https://github.com/michael-beukman/decisionadapter
paper_authors: Michael Beukman, Devon Jarvis, Richard Klein, Steven James, Benjamin Rosman
for: 本研究旨在解决人工智能在真实世界中应用中的一个问题，即如何使其能够更好地适应新的环境和不同的转移动力学。
methods: 本研究使用了一种新的神经网络架构，称为决策适应器（Decision Adapter），它可以根据上下文信息来生成行为策略，从而提高agent的总体化能力。
results: 对于多个环境，研究发现使用决策适应器可以获得更好的总体化性能，并且比其他方法更加抗扰异常变量。

Abstract
While reinforcement learning has achieved remarkable successes in several domains, its real-world application is limited due to many methods failing to generalise to unfamiliar conditions. In this work, we consider the problem of generalising to new transition dynamics, corresponding to cases in which the environment's response to the agent's actions differs. For example, the gravitational force exerted on a robot depends on its mass and changes the robot's mobility. Consequently, in such cases, it is necessary to condition an agent's actions on extrinsic state information and pertinent contextual information reflecting how the environment responds. While the need for context-sensitive policies has been established, the manner in which context is incorporated architecturally has received less attention. Thus, in this work, we present an investigation into how context information should be incorporated into behaviour learning to improve generalisation. To this end, we introduce a neural network architecture, the Decision Adapter, which generates the weights of an adapter module and conditions the behaviour of an agent on the context information. We show that the Decision Adapter is a useful generalisation of a previously proposed architecture and empirically demonstrate that it results in superior generalisation performance compared to previous approaches in several environments. Beyond this, the Decision Adapter is more robust to irrelevant distractor variables than several alternative methods.

摘要
while reinforcement learning has achieved remarkable successes in several domains, its real-world application is limited due to many methods failing to generalise to unfamiliar conditions. In this work, we consider the problem of generalising to new transition dynamics, corresponding to cases in which the environment's response to the agent's actions differs. For example, the gravitational force exerted on a robot depends on its mass and changes the robot's mobility. Consequently, in such cases, it is necessary to condition an agent's actions on extrinsic state information and pertinent contextual information reflecting how the environment responds. While the need for context-sensitive policies has been established, the manner in which context is incorporated architecturally has received less attention. Thus, in this work, we present an investigation into how context information should be incorporated into behaviour learning to improve generalisation. To this end, we introduce a neural network architecture, the Decision Adapter, which generates the weights of an adapter module and conditions the behaviour of an agent on the context information. We show that the Decision Adapter is a useful generalisation of a previously proposed architecture and empirically demonstrate that it results in superior generalisation performance compared to previous approaches in several environments. Beyond this, the Decision Adapter is more robust to irrelevant distractor variables than several alternative methods.Note: Please keep in mind that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Detection of news written by the ChatGPT through authorship attribution performed by a Bidirectional LSTM model

paper_url: http://arxiv.org/abs/2310.16685
repo_url: https://github.com/amandafi/human-writing-vs.-gpt-writing
paper_authors: Amanda Ferrari Iaquinta, Gustavo Voltani von Atzingen
for: 这个研究的目的是为了确定使用ChatGPT生成新闻时，是否会产生假新闻、谣言和不信任新闻来源的问题。
methods: 这个研究使用了不同的自然语言处理技术来提取新闻文章中的特征，并使用了三种不同的模型建立。
results: 研究发现，使用BIidirectional LSTM神经网络模型可以达到91.57%的准确率，并且在测试集中表现最佳。

Abstract
The large language based-model chatbot ChatGPT gained a lot of popularity since its launch and has been used in a wide range of situations. This research centers around a particular situation, when the ChatGPT is used to produce news that will be consumed by the population, causing the facilitation in the production of fake news, spread of misinformation and lack of trust in news sources. Aware of these problems, this research aims to build an artificial intelligence model capable of performing authorship attribution on news articles, identifying the ones written by the ChatGPT. To achieve this goal, a dataset containing equal amounts of human and ChatGPT written news was assembled and different natural processing language techniques were used to extract features from it that were used to train, validate and test three models built with different techniques. The best performance was produced by the Bidirectional Long Short Term Memory (LSTM) Neural Network model, achiving 91.57\% accuracy when tested against the data from the testing set.

摘要
大型语言模型聊天机器人ChatGPT在其发布后受到了广泛关注，并在多种情况下使用。本研究专注于一种情况，即在使用ChatGPT生成新闻，导致新闻生成的假新闻、谣言和新闻来源的不信任。为解决这些问题，本研究目标是建立一个能够进行新闻文章作者归属分析的人工智能模型，并确定这些文章是由ChatGPT生成的。为达到这个目标，我们组织了一个包含人类和ChatGPT生成的新闻文章的数据集，并使用不同的自然语言处理技术提取了这些数据中的特征，以用于训练、验证和测试三种不同的模型。最终，使用双向长短期记忆（LSTM）神经网络模型得到了91.57%的准确率，当 tested against the testing set 数据时。

Exploring Large Language Models for Code Explanation

paper_url: http://arxiv.org/abs/2310.16673
repo_url: None
paper_authors: Paheli Bhattacharya, Manojit Chakraborty, Kartheek N S N Palepu, Vikas Pandey, Ishan Dindorkar, Rakesh Rajpurohit, Rishabh Gupta
for: 这个论文是为了提高代码理解而自动生成代码文档。
methods: 论文使用了各种大语言模型（LLMs）来生成代码片断的自然语言摘要。
results: 研究发现代码LLMs在代码生成和代码摘要任务中表现出色，而零发Method在数据分布不同时表现更佳。

Abstract
Automating code documentation through explanatory text can prove highly beneficial in code understanding. Large Language Models (LLMs) have made remarkable strides in Natural Language Processing, especially within software engineering tasks such as code generation and code summarization. This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs. The findings indicate that Code LLMs outperform their generic counterparts, and zero-shot methods yield superior results when dealing with datasets with dissimilar distributions between training and testing sets.

摘要
通过使用说明文本自动生成代码文档可以帮助代码理解。大型自然语言模型（LLMs）在软件工程任务中，如代码生成和代码概要，在自然语言处理方面做出了很多突出的进步。本研究专门探讨了代码片断的自然语言概要生成任务，使用不同的LLMs。研究结果表明，代码LLMs在对于不同分布的数据集上表现更高水平，而零参数方法在测试集和训练集之间的不同分布情况下也表现出色。

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation

paper_url: http://arxiv.org/abs/2310.16656
repo_url: None
paper_authors: Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, Yaniv Leviathan
for: 这个研究的目的是提高文本到图像模型的质量和多样性。
methods: 这个研究使用了一种特殊的自动描述模型来重新标签 dataset，并在重新标签的 dataset 上训练文本到图像模型。
results: 比较基eline的结果表明，使用重新标签 dataset 可以提高文本到图像模型的图像质量和Semantic alignment。例如，FID 从 17.87 降低到 14.84， faithful image generation 的人工评估提高了 64.3%。此外，这种技术还可以减少训练-推理差异和提供更多的信息每个例子，提高模型的样本效率和理解 capability。

Abstract
Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.

摘要
文本到图像扩散模型在过去几年内实现了很大的进步，可以生成高质量和多样的图像从文本提示。然而，even the most advanced models 经常难以准确地遵循提示中的所有指令。大多数这些模型是在包含（图像，描述）对的dataset上训练的，图像通常来自互联网，描述是其 HTML 备用文本。我们的研究发现，这些描述通常质量低下，我们 argue that this significantly affects the model's ability to understand nuanced semantics in the textual prompts。我们示出了将 corpus 重新标签为特殊的自动描述模型，然后训练文本到图像模型在重新标签的 dataset 上，模型可以获得显著改善。首先，在总体图像质量方面：例如 FID 14.84 vs. 基线值 17.87，并在人工评估中得到64.3%的增进。其次，在 semantic alignment 方面：例如 semantic object accuracy 84.34 vs. 78.90， counting alignment errors 1.32 vs. 1.44，和 positional alignment 62.42 vs. 57.60。我们分析了不同的重新标签方法，并提供证据，称这种技术，我们称之为 RECAP，可以降低训练-运行差异，并为模型提供更多的信息每个例子，提高样本效率，使模型更好地理解描述和图像之间的关系。

Will releasing the weights of large language models grant widespread access to pandemic agents?

paper_url: http://arxiv.org/abs/2310.18233
repo_url: None
paper_authors: Anjali Gopal, Nathan Helm-Burger, Lenni Justen, Emily H. Soice, Tiffany Tzeng, Geetha Jeyapragasan, Simon Grimm, Benjamin Mueller, Kevin M. Esvelt
for: investigate whether continued model weight proliferation is likely to help future malicious actors inflict mass death
methods: using a hackathon to test the ability of participants to obtain and release the reconstructed 1918 pandemic influenza virus using malicious prompts and two versions of the Llama-2-70B model (Base and Spicy)
results: the Spicy model provided some participants with nearly all key information needed to obtain the virus, suggesting that releasing the weights of advanced foundation models could lead to the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.Here’s the full text in Simplified Chinese:
for: 研究是否继续发布基础模型权重会帮助未来的黑客带来大规模死亡
methods: 通过启用黑客赛事，测试参与者通过malicious提示 obtener和发布1918年流感病毒的能力，并使用两个Llama-2-70B模型（基础和辛辣）
results: 辛辣模型为一些参与者提供了几乎所有关键信息，从而 sugguest that releasing the weights of advanced foundation models could lead to the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons。

Abstract
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.

摘要
(Simplified Chinese translation)大型语言模型可以为研究和人类理解提供教程，并且可以吸引多种领域的专家知识。一个正确地保护的模型会拒绝提供“双用”的洞察，以避免引起严重的危害，但一些公开发布 weights 的模型在不久后就被调整以移除安全措施。我们查into whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.

ArTST: Arabic Text and Speech Transformer

paper_url: http://arxiv.org/abs/2310.16621
repo_url: https://github.com/mbzuai-nlp/artst
paper_authors: Hawau Olamide Toyin, Amirbek Djanibekov, Ajinkya Kulkarni, Hanan Aldarmaki
for: 支持阿拉伯语开源语音技术
methods: 使用预训练的阿拉伯语文本和语音转换器（ArTST）
results: 在自动语音识别（ASR）、文本到语音合成（TTS）和口语标识任务中表现优秀，并且在低资源情况下的TTS任务中具有普适性。

Abstract
We present ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. The model architecture follows the unified-modal framework, SpeechT5, that was recently released for English, and is focused on Modern Standard Arabic (MSA), with plans to extend the model for dialectal and code-switched Arabic in future editions. We pre-trained the model from scratch on MSA speech and text data, and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR), Text-To-Speech synthesis (TTS), and spoken dialect identification. In our experiments comparing ArTST with SpeechT5, as well as with previously reported results in these tasks, ArTST performs on a par with or exceeding the current state-of-the-art in all three tasks. Moreover, we find that our pre-training is conducive for generalization, which is particularly evident in the low-resource TTS task. The pre-trained model as well as the fine-tuned ASR and TTS models are released for research use.

摘要
我们介绍ArTST，一个预训练的阿拉伯文本和语音转换器，用于支持开源的阿拉伯语音技术。模型采用了统一Modal框架SpeechT5，最近发布的英语版本，并专注于现代标准阿拉伯语（MSA），计划将模型扩展到 диалект和混合阿拉伯语。我们从scratch预训练了模型，并对其进行了MSA语音和文本数据的精度调整。我们在ASR、TTS和语言识别三个任务中进行了实验，并与SpeechT5以及之前报道的结果进行了比较。结果显示，ArTST在三个任务中具有和或超过当前状态的术。此外，我们发现预训练对泛化具有良好的作用，特别是在低资源TTS任务中。预训练模型以及精度调整后的ASR和TTS模型都被发布 для研究用途。

Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors

paper_url: http://arxiv.org/abs/2310.16609
repo_url: None
paper_authors: Marek Kubis, Paweł Skórzewski, Marcin Sowański, Tomasz Ziętkiewicz
for: 这个论文的目的是研究自然语言理解（NLU）模型的性能如何受到语音识别错误的影响。
methods: 该论文提出了一种方法，将词法识别错误与NLU模型的性能相关联，并使用合成语音进行NLU评估。
results: 研究发现，使用合成语音进行NLU评估并不会导致显著的性能下降。

Abstract
In a spoken dialogue system, an NLU model is preceded by a speech recognition system that can deteriorate the performance of natural language understanding. This paper proposes a method for investigating the impact of speech recognition errors on the performance of natural language understanding models. The proposed method combines the back transcription procedure with a fine-grained technique for categorizing the errors that affect the performance of NLU models. The method relies on the usage of synthesized speech for NLU evaluation. We show that the use of synthesized speech in place of audio recording does not change the outcomes of the presented technique in a significant way.

摘要
在一个对话系统中，一个NLU模型会被某种语音识别系统 precede，这可能会下降自然语言理解的性能。这篇论文提出了一种方法来研究语音识别错误对NLU模型的影响。该方法结合了后处理程序和细化的错误分类技术。该方法利用了对NLU评估中的语音合成。我们显示，使用语音合成代替音频记录不会对结果产生显著的变化。

An Explainable Deep Learning-Based Method For Schizophrenia Diagnosis Using Generative Data-Augmentation

paper_url: http://arxiv.org/abs/2310.16867
repo_url: None
paper_authors: Mehrshad Saadatinia, Armin Salimi-Badr
for: automatic diagnosis of schizophrenia using EEG brain recordings
methods: generative data augmentation, CNN, WGAN-GP, VAE
results: 3.0% improvement in accuracy, lower loss value, faster convergence, and interpretable model explanations

Abstract
In this study, we leverage a deep learning-based method for the automatic diagnosis of schizophrenia using EEG brain recordings. This approach utilizes generative data augmentation, a powerful technique that enhances the accuracy of the diagnosis. To enable the utilization of time-frequency features, spectrograms were extracted from the raw signals. After exploring several neural network architectural setups, a proper convolutional neural network (CNN) was used for the initial diagnosis. Subsequently, using Wasserstein GAN with Gradient Penalty (WGAN-GP) and Variational Autoencoder (VAE), two different synthetic datasets were generated in order to augment the initial dataset and address the over-fitting issue. The augmented dataset using VAE achieved a 3.0\% improvement in accuracy reaching up to 99.0\% and yielded a lower loss value as well as a faster convergence. Finally, we addressed the lack of trust in black-box models using the Local Interpretable Model-agnostic Explanations (LIME) algorithm to determine the most important superpixels (frequencies) in the diagnosis process.

摘要
在这项研究中，我们利用深度学习基于方法进行自动诊断偏头痛症用EEG脑记录。这种方法利用生成数据增强技术，提高诊断准确性。为了利用时频特征，我们从原始信号中提取spectrogram。经过考虑多种神经网络建立方式，我们选择了合适的卷积神经网络（CNN）进行初步诊断。接着，我们使用 Wasserstein GAN with Gradient Penalty（WGAN-GP）和Variational Autoencoder（VAE）生成了两个不同的合成数据集，以增强初始数据集并解决过拟合问题。使用VAE生成的数据集实现了3.0%的提升，达到99.0%的准确率，并得到了较低的损失值以及更快的收敛。最后，我们使用Local Interpretable Model-agnostic Explanations（LIME）算法来确定诊断过程中最重要的超 pix（频率）。

Balancing central and marginal rejection when combining independent significance tests

paper_url: http://arxiv.org/abs/2310.16600
repo_url: None
paper_authors: Chris Salahub, R. Wayne Oldford
For: The paper is written to discuss and evaluate the significance of a collection of $p$-values, particularly when the original data are not available.* Methods: The paper introduces a telescoping series of alternative hypotheses to communicate the strength and prevalence of non-null evidence in the $p$-values, and discusses various pooling formulae to combine the $p$-values.* Results: The paper proposes a combining function based on the $\chi^2_{\kappa}$ quantile transformation to control the quotient of central and marginal rejection levels, and shows that this function is robust to mis-specified parameters relative to the UMP. Additionally, the paper maps out plausible alternatives based on where the pooled $p$-value is minimized.

Abstract
A common approach to evaluating the significance of a collection of $p$-values combines them with a pooling function, in particular when the original data are not available. These pooled $p$-values convert a sample of $p$-values into a single number which behaves like a univariate $p$-value. To clarify discussion of these functions, a telescoping series of alternative hypotheses are introduced that communicate the strength and prevalence of non-null evidence in the $p$-values before general pooling formulae are discussed. A pattern noticed in the UMP pooled $p$-value for a particular alternative motivates the definition and discussion of central and marginal rejection levels at $\alpha$. It is proven that central rejection is always greater than or equal to marginal rejection, motivating a quotient to measure the balance between the two for pooled $p$-values. A combining function based on the $\chi^2_{\kappa}$ quantile transformation is proposed to control this quotient and shown to be robust to mis-specified parameters relative to the UMP. Different powers for different parameter settings motivate a map of plausible alternatives based on where this pooled $p$-value is minimized.

摘要
一种常见的方法来评估一组 $p$-值的重要性是将它们与一个 combinatory 函数相结合，特别是当原始数据不可用时。这些卷积 $p$-值将一个样本 $p$-值转换成一个单一的数字，这个数字 behave 如一个单variate $p$-value。为了加深这些函数的讨论，我们引入了一系列的备用假设，这些假设通过 $p$-值中的非null 证据的强度和普遍性来交流。在 UMP 卷积 $p$-value 中的特定假设下，我们定义和讨论中心和边缘拒绝水平在 $\alpha$ 上。由于中心拒绝总是大于或等于边缘拒绝，我们提出了一个比率来度量这两个拒绝水平之间的平衡。一种基于 $\chi^2_{\kappa}$ 量化变换的 combining 函数被提议，可以控制这个比率，并且在 parameter 不符合时比 UMP 更加稳定。不同的参数设置导致一个地图，其中这个卷积 $p$-值在最小化的地方。

Adaptive Uncertainty Estimation via High-Dimensional Testing on Latent Representations

paper_url: http://arxiv.org/abs/2310.16587
repo_url: https://github.com/hku-medai/bnn_uncertainty
paper_authors: Tsai Hor Chan, Kin Wai Lau, Jiajun Shen, Guosheng Yin, Lequan Yu
for: 这篇研究旨在提出一个新的深度学习 uncertainty estimation 框架，以便在不需要见到 OOD 数据的情况下仍能够精确地评估深度学习模型的不确定性。
methods: 这篇研究使用了 data-adaptive high-dimensional hypothesis testing 来进行 uncertainty estimation，并且不需要重新训练对象 функ数据。 tested statistic 运用了 latent 表示的 Statistical 属性，以提高测试性能。
results: 实验结果显示，使用 Bayesian neural networks 进行 encoding 可以增强测试性能，并且可以更精确地评估深度学习模型的不确定性。另外，这篇研究还引入了一个家庭单位检测程序，以决定 OOD 检测任务中最佳的阈值，以减少错误发现率 (FDR)。

Abstract
Uncertainty estimation aims to evaluate the confidence of a trained deep neural network. However, existing uncertainty estimation approaches rely on low-dimensional distributional assumptions and thus suffer from the high dimensionality of latent features. Existing approaches tend to focus on uncertainty on discrete classification probabilities, which leads to poor generalizability to uncertainty estimation for other tasks. Moreover, most of the literature requires seeing the out-of-distribution (OOD) data in the training for better estimation of uncertainty, which limits the uncertainty estimation performance in practice because the OOD data are typically unseen. To overcome these limitations, we propose a new framework using data-adaptive high-dimensional hypothesis testing for uncertainty estimation, which leverages the statistical properties of the feature representations. Our method directly operates on latent representations and thus does not require retraining the feature encoder under a modified objective. The test statistic relaxes the feature distribution assumptions to high dimensionality, and it is more discriminative to uncertainties in the latent representations. We demonstrate that encoding features with Bayesian neural networks can enhance testing performance and lead to more accurate uncertainty estimation. We further introduce a family-wise testing procedure to determine the optimal threshold of OOD detection, which minimizes the false discovery rate (FDR). Extensive experiments validate the satisfactory performance of our framework on uncertainty estimation and task-specific prediction over a variety of competitors. The experiments on the OOD detection task also show satisfactory performance of our method when the OOD data are unseen in the training. Codes are available at https://github.com/HKU-MedAI/bnn_uncertainty.

摘要
uncertainty estimation aimsto evaluate the confidence of a trained deep neural network. However, existing uncertainty estimation approaches rely on low-dimensional distributional assumptions and thus suffer from the high dimensionality of latent features. Existing approaches tend to focus on uncertainty on discrete classification probabilities, which leads to poor generalizability to uncertainty estimation for other tasks. Moreover, most of the literature requires seeing the out-of-distribution (OOD) data in the training for better estimation of uncertainty, which limits the uncertainty estimation performance in practice because the OOD data are typically unseen. To overcome these limitations, we propose a new framework using data-adaptive high-dimensional hypothesis testing for uncertainty estimation, which leverages the statistical properties of the feature representations. Our method directly operates on latent representations and thus does not require retraining the feature encoder under a modified objective. The test statistic relaxes the feature distribution assumptions to high dimensionality, and it is more discriminative to uncertainties in the latent representations. We demonstrate that encoding features with Bayesian neural networks can enhance testing performance and lead to more accurate uncertainty estimation. We further introduce a family-wise testing procedure to determine the optimal threshold of OOD detection, which minimizes the false discovery rate (FDR). Extensive experiments validate the satisfactory performance of our framework on uncertainty estimation and task-specific prediction over a variety of competitors. The experiments on the OOD detection task also show satisfactory performance of our method when the OOD data are unseen in the training. Codes are available at .

Learning to Explain: A Model-Agnostic Framework for Explaining Black Box Models

paper_url: http://arxiv.org/abs/2310.16584
repo_url: https://github.com/ltx-code/ltx
paper_authors: Oren Barkan, Yuval Asher, Amit Eshel, Yehonatan Elisha, Noam Koenigstein
for: 提供视觉模型的后期解释methods: 使用一个”解释器”模型生成解释地图，并在两个阶段的训练中使用独特的配置来使用Masked Input对模型的预测进行比较，以实现一种新的对抗对象函数。results: LTX在不同维度上显著超越当前状态的最佳解释性。

Abstract
We present Learning to Explain (LTX), a model-agnostic framework designed for providing post-hoc explanations for vision models. The LTX framework introduces an "explainer" model that generates explanation maps, highlighting the crucial regions that justify the predictions made by the model being explained. To train the explainer, we employ a two-stage process consisting of initial pretraining followed by per-instance finetuning. During both stages of training, we utilize a unique configuration where we compare the explained model's prediction for a masked input with its original prediction for the unmasked input. This approach enables the use of a novel counterfactual objective, which aims to anticipate the model's output using masked versions of the input image. Importantly, the LTX framework is not restricted to a specific model architecture and can provide explanations for both Transformer-based and convolutional models. Through our evaluations, we demonstrate that LTX significantly outperforms the current state-of-the-art in explainability across various metrics.

摘要
我们提出了学习解释（LTX）框架，这是一个模型无关的框架，用于提供后续解释 vision 模型的预测。 LTX 框架引入了一个“解释器”模型，这个模型生成的解释地图可以显示出模型被解释的关键区域，这些区域可以让模型的预测。为了训练解释器，我们运用了两阶段训练的方法，包括初始预训和每个实例的调整。在这两阶段的训练中，我们使用了一个独特的配置，在比较模型被解释的预测和原始预测之间进行比较。这种配置使得我们可以使用一个新的反向 counterfactual 目标，这个目标的目标是预测模型使用填充的输入图像。重要的是，LTX 框架不受特定的模型架构限制，可以提供解释 für both Transformer 基于和传统的单元模型。我们的评估结果显示，LTX 在不同的 метриках上明显超过了现有的州态。

Hybrid Minimax-MCTS and Difficulty Adjustment for General Game Playing

paper_url: http://arxiv.org/abs/2310.16581
repo_url: https://github.com/marcoantonioaav/hybrid-minimax-mcts
paper_authors: Marco Antônio Athayde de Aguiar Vieira, Anderson Rocha Tavares, Renato Perez Ribas
for: 这篇论文是为了开发一个智能对手，以便在零点游戏中实现不同困难级别的游戏体验。
methods: 这篇论文提出了一种混合最小搜索和MCTS算法的方法，以实现适应不同困难级别的人工智能对手。
results: 测试结果表明，这种混合算法和新的困难调整系统都是有前途的智能对手方法。

Abstract
Board games are a great source of entertainment for all ages, as they create a competitive and engaging environment, as well as stimulating learning and strategic thinking. It is common for digital versions of board games, as any other type of digital games, to offer the option to select the difficulty of the game. This is usually done by customizing the search parameters of the AI algorithm. However, this approach cannot be extended to General Game Playing agents, as different games might require different parametrization for each difficulty level. In this paper, we present a general approach to implement an artificial intelligence opponent with difficulty levels for zero-sum games, together with a propose of a Minimax-MCTS hybrid algorithm, which combines the minimax search process with GGP aspects of MCTS. This approach was tested in our mobile application LoBoGames, an extensible board games platform, that is intended to have an broad catalog of games, with an emphasis on accessibility: the platform is friendly to visually-impaired users, and is compatible with more than 92\% of Android devices. The tests in this work indicate that both the hybrid Minimax-MCTS and the new difficulty adjustment system are promising GGP approaches that could be expanded in future work.

摘要
《Board games are a great source of entertainment for all ages, as they create a competitive and engaging environment, as well as stimulating learning and strategic thinking. It is common for digital versions of board games, as any other type of digital games, to offer the option to select the difficulty of the game. This is usually done by customizing the search parameters of the AI algorithm. However, this approach cannot be extended to General Game Playing agents, as different games might require different parametrization for each difficulty level. In this paper, we present a general approach to implement an artificial intelligence opponent with difficulty levels for zero-sum games, together with a propose of a Minimax-MCTS hybrid algorithm, which combines the minimax search process with GGP aspects of MCTS. This approach was tested in our mobile application LoBoGames, an extensible board games platform, that is intended to have an broad catalog of games, with an emphasis on accessibility: the platform is friendly to visually-impaired users, and is compatible with more than 92\% of Android devices. The tests in this work indicate that both the hybrid Minimax-MCTS and the new difficulty adjustment system are promising GGP approaches that could be expanded in future work.》Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Adapt Anything: Tailor Any Image Classifiers across Domains And Categories Using Text-to-Image Diffusion Models

paper_url: http://arxiv.org/abs/2310.16573
repo_url: None
paper_authors: Weijie Chen, Haoyu Wang, Shicai Yang, Lei Zhang, Wei Wei, Yanning Zhang, Luojun Lin, Di Xie, Yueting Zhuang
for: 这篇论文的目的是研究一种使用现代文本到图像扩散模型来适应任务和领域域的图像分类器。
methods: 本文使用了现有的领域适应图像分类方法，并将高质量的文本到图像生成器生成的图像作为预处理数据来进行适应。
results: 实验结果表明，使用这种方法可以在不需要收集和标注真实世界数据的情况下，将文本到图像生成器中嵌入的知识传递到任务特定的图像分类器中，并且能够超越现有的领域适应图像分类方法。

Abstract
We do not pursue a novel method in this paper, but aim to study if a modern text-to-image diffusion model can tailor any task-adaptive image classifier across domains and categories. Existing domain adaptive image classification works exploit both source and target data for domain alignment so as to transfer the knowledge learned from the labeled source data to the unlabeled target data. However, as the development of the text-to-image diffusion model, we wonder if the high-fidelity synthetic data from the text-to-image generator can serve as a surrogate of the source data in real world. In this way, we do not need to collect and annotate the source data for each domain adaptation task in a one-for-one manner. Instead, we utilize only one off-the-shelf text-to-image model to synthesize images with category labels derived from the corresponding text prompts, and then leverage the surrogate data as a bridge to transfer the knowledge embedded in the task-agnostic text-to-image generator to the task-oriented image classifier via domain adaptation. Such a one-for-all adaptation paradigm allows us to adapt anything in the world using only one text-to-image generator as well as the corresponding unlabeled target data. Extensive experiments validate the feasibility of the proposed idea, which even surpasses the state-of-the-art domain adaptation works using the source data collected and annotated in real world.

摘要

Label Propagation for Graph Label Noise

paper_url: http://arxiv.org/abs/2310.16560
repo_url: None
paper_authors: Yao Cheng, Caihua Shan, Yifei Shen, Xiang Li, Siqiang Luo, Dongsheng Li
for: rectifying noisy labels and assigning labels to previously unlabeled nodes in the context of arbitrary heterophily.
methods: LP4GLN algorithm, which consists of three steps: (1) reconstruct the graph to recover the homophily property, (2) utilize label propagation to rectify the noisy labels, (3) select high-confidence labels to retain for the next iteration.
results: superior performance compared to 7 typical baselines in node classification tasks under varying graph heterophily levels and noise types.

Abstract
Label noise is a common challenge in large datasets, as it can significantly degrade the generalization ability of deep neural networks. Most existing studies focus on noisy labels in computer vision; however, graph models encompass both node features and graph topology as input, and become more susceptible to label noise through message-passing mechanisms. Recently, only a few works have been proposed to tackle the label noise on graphs. One major limitation is that they assume the graph is homophilous and the labels are smoothly distributed. Nevertheless, real-world graphs may contain varying degrees of heterophily or even be heterophily-dominated, leading to the inadequacy of current methods. In this paper, we study graph label noise in the context of arbitrary heterophily, with the aim of rectifying noisy labels and assigning labels to previously unlabeled nodes. We begin by conducting two empirical analyses to explore the impact of graph homophily on graph label noise. Following observations, we propose a simple yet efficient algorithm, denoted as LP4GLN. Specifically, LP4GLN is an iterative algorithm with three steps: (1) reconstruct the graph to recover the homophily property, (2) utilize label propagation to rectify the noisy labels, (3) select high-confidence labels to retain for the next iteration. By iterating these steps, we obtain a set of correct labels, ultimately achieving high accuracy in the node classification task. The theoretical analysis is also provided to demonstrate its remarkable denoising "effect". Finally, we conduct experiments on 10 benchmark datasets under varying graph heterophily levels and noise types, comparing the performance of LP4GLN with 7 typical baselines. Our results illustrate the superior performance of the proposed LP4GLN.

摘要
标签噪声是大型数据集中的一个常见挑战，可以很大程度地降低深度神经网络的泛化能力。大多数现有研究都集中于计算机视觉中的噪声标签;然而，图模型包含节点特征和图结构作为输入，因此更容易受到噪声标签的影响。虽然有一些最近的工作已经提出来了处理图标签噪声，但是它们假设图是同质的，标签分布平滑。然而，实际世界中的图可能包含不同程度的异质或者甚至是异质占主导地位，导致现有方法无法应用。在这篇论文中，我们研究了图标签噪声在不同异质性下的情况，目的是修复噪声标签并将标签分配给未标注的节点。我们开始通过两个实际分析来探索图同质性对图标签噪声的影响。根据观察结果，我们提出了一种简单 yet efficient的算法， denoted as LP4GLN。LP4GLN是一个迭代算法，具体步骤如下：1. 重建图以恢复同质性;2. 利用标签推广来修正噪声标签;3. 选择高信度标签来保留下一轮。通过迭代这些步骤，我们可以获得一组正确的标签，最终实现高精度在节点分类任务中。我们还提供了理论分析，以证明其强大的"效果"。最后，我们在10个标准 benchmark dataset上进行了10种不同异质性水平和噪声类型的实验，与7种典型基线相比较。我们的结果表明，LP4GLN的表现胜过了7种基线。

Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion

paper_url: http://arxiv.org/abs/2310.16546
repo_url: None
paper_authors: Taehyun Cho, Seungyub Han, Heesoo Lee, Kyungjae Lee, Jungwoo Lee
for: This paper proposes a novel distributional reinforcement learning algorithm to avoid biased exploration and improve performance.
methods: The proposed algorithm randomizes the risk criterion to avoid one-sided tendency on risk, and uses a perturbed distributional Bellman optimality operator to prove convergence and optimality.
results: The proposed method outperforms other existing distribution-based algorithms in various environments, including Atari 55 games.

Abstract
Distributional reinforcement learning algorithms have attempted to utilize estimated uncertainty for exploration, such as optimism in the face of uncertainty. However, using the estimated variance for optimistic exploration may cause biased data collection and hinder convergence or performance. In this paper, we present a novel distributional reinforcement learning algorithm that selects actions by randomizing risk criterion to avoid one-sided tendency on risk. We provide a perturbed distributional Bellman optimality operator by distorting the risk measure and prove the convergence and optimality of the proposed method with the weaker contraction property. Our theoretical results support that the proposed method does not fall into biased exploration and is guaranteed to converge to an optimal return. Finally, we empirically show that our method outperforms other existing distribution-based algorithms in various environments including Atari 55 games.

摘要
分布式权威学习算法已经尝试使用估计的不确定性进行探索，如在面对不确定性时表现optimism。然而，使用估计方差来实现optimistic探索可能会导致数据采集偏斜和性能下降。在这篇论文中，我们提出了一种新的分布式权威学习算法，它通过随机化风险 критериion来避免一面的倾向。我们提供了一个扰动分布 Bellman 优化算法，并证明了我们的方法具有更弱的收缩性质。我们的理论结果表明，我们的方法不会受到偏斜探索的影响，并且能够 converge to an optimal return。最后，我们通过实验表明，我们的方法在多个环境中（包括Atari 55 游戏）表现出色，超过了其他现有的分布基于算法。

A Multilingual Virtual Guide for Self-Attachment Technique

paper_url: http://arxiv.org/abs/2310.18366
repo_url: None
paper_authors: Alicia Jiayun Law, Ruoyu Hu, Lisa Alazraki, Anandha Gopalan, Neophytos Polydorou, Abbas Edalat
for: 这个研究的目的是开发一个基于现有语言数据的计算框架，用于在普通话中提供Self-Attachment Technique（SAT）。
methods: 该框架不需要大规模的人类翻译，却可以达到相似的性能水平，同时保持安全性和可靠性。研究者提出了两种增强可用响应数据的方法，包括emploympathetic rewrite。
results: 研究者通过对前一代英语只的SAT chatbot进行比较，通过非临床人类试验（N=42），每次试验5天，量化显示了与英语 SAT chatbot相当的性能水平。研究者还提供了限制分析和建议，以帮助未来的改进。

Abstract
In this work, we propose a computational framework that leverages existing out-of-language data to create a conversational agent for the delivery of Self-Attachment Technique (SAT) in Mandarin. Our framework does not require large-scale human translations, yet it achieves a comparable performance whilst also maintaining safety and reliability. We propose two different methods of augmenting available response data through empathetic rewriting. We evaluate our chatbot against a previous, English-only SAT chatbot through non-clinical human trials (N=42), each lasting five days, and quantitatively show that we are able to attain a comparable level of performance to the English SAT chatbot. We provide qualitative analysis on the limitations of our study and suggestions with the aim of guiding future improvements.

摘要
在这个工作中，我们提出了一种计算机框架，利用现有的语言外数据创建一个拥有自我附加技巧（SAT）的 Mandarin 会话代理。我们的框架不需要大规模的人类翻译，却可以达到相似的性能水平，同时保持安全和可靠性。我们提出了两种增强可用响应数据的方法，通过同情性重写。我们对前一个英语只的 SAT 会话代理进行评估，通过非клиниче人类试验（N=42），每个试验持续五天，并证明我们可以达到与英语 SAT 会话代理相似的性能水平。我们提供了限制分析和建议，以帮助未来的改进。

FedTherapist: Mental Health Monitoring with User-Generated Linguistic Expressions on Smartphones via Federated Learning

paper_url: http://arxiv.org/abs/2310.16538
repo_url: None
paper_authors: Jaemin Shin, Hyungjun Yoon, Seungjoo Lee, Sungjoon Park, Yunxin Liu, Jinho D. Choi, Sung-Ju Lee
for: 这个论文是为了提出一种基于联邦学习的移动设备上的心理健康监测系统，以保护用户的隐私。
methods: 该论文使用了连续语音和键盘输入，并通过联邦学习方法进行训练，以解决智能手机上的语言模型训练问题。另外，论文还提出了一种Context-Aware Language Learning（CALL）方法，以更好地利用手机上的大量和噪音的文本数据来检测心理健康信号。
results: 论文的实验结果显示，使用联邦学习方法可以提高心理健康监测系统的准确率，比非语言特征的表现高出0.15 AUROC和8.21% MAE。

Abstract
Psychiatrists diagnose mental disorders via the linguistic use of patients. Still, due to data privacy, existing passive mental health monitoring systems use alternative features such as activity, app usage, and location via mobile devices. We propose FedTherapist, a mobile mental health monitoring system that utilizes continuous speech and keyboard input in a privacy-preserving way via federated learning. We explore multiple model designs by comparing their performance and overhead for FedTherapist to overcome the complex nature of on-device language model training on smartphones. We further propose a Context-Aware Language Learning (CALL) methodology to effectively utilize smartphones' large and noisy text for mental health signal sensing. Our IRB-approved evaluation of the prediction of self-reported depression, stress, anxiety, and mood from 46 participants shows higher accuracy of FedTherapist compared with the performance with non-language features, achieving 0.15 AUROC improvement and 8.21% MAE reduction.

摘要
心理医生通过语言使用病人进行诊断精神疾病。然而由于数据隐私问题，现有的潜在精神健康监测系统使用替代特征 such as 活动、应用程序使用和位置通过移动设备。我们提议了 FedTherapist，一种基于联邦学习的移动精神健康监测系统，可以在隐私保护的情况下使用连续的语音和键盘输入。我们比较了不同的模型设计，并评估了它们在 FedTherapist 中的性能和开销。此外，我们还提出了 Context-Aware Language Learning（CALL）方法，以便有效利用智能手机上的大量和噪音的文本来感知精神健康信号。我们经过IRB审核的评估结果表明，FedTherapist 比使用非语言特征的性能高，实现了0.15 AUROC 提升和8.21% MAE 减少。

R$^3$ Prompting: Review, Rephrase and Resolve for Chain-of-Thought Reasoning in Large Language Models under Noisy Context

paper_url: http://arxiv.org/abs/2310.16535
repo_url: None
paper_authors: Qingyuan Tian, Hanlun Zhu, Lei Wang, Yang Li, Yunshi Lan
For: The paper aims to improve the performance of large language models (LLMs) on various reasoning tasks under noisy contexts.* Methods: The proposed method, called R$^3$ prompting, interacts with LLMs to perform key sentence extraction, variable declaration, and answer prediction, which mimics the thought process of reviewing, rephrasing, and resolving.* Results: The proposed method significantly outperforms existing CoT prompting methods on five reasoning tasks under noisy contexts, with an average accuracy improvement of 3.7% using GPT-3.5-turbo.

Abstract
With the help of Chain-of-Thought (CoT) prompting, Large Language Models (LLMs) have achieved remarkable performance on various reasoning tasks. However, most of them have been evaluated under noise-free context and the dilemma for LLMs to produce inaccurate results under the noisy context has not been fully investigated. Existing studies utilize trigger sentences to encourage LLMs to concentrate on the relevant information but the trigger has limited effect on final answer prediction. Inspired by interactive CoT method, where intermediate reasoning steps are promoted by multiple rounds of interaction between users and LLMs, we propose a novel prompting method, namely R$^3$ prompting, for CoT reasoning under noisy context. Specifically, R$^3$ prompting interacts with LLMs to perform key sentence extraction, variable declaration and answer prediction, which corresponds to a thought process of reviewing, rephrasing and resolving. The responses generated at the last interaction will perform as hints to guide toward the responses of the next interaction. Our experiments show that R$^3$ prompting significantly outperforms existing CoT prompting methods on five reasoning tasks under noisy context. With GPT-3.5-turbo, we observe 3.7% accuracy improvement on average on the reasoning tasks under noisy context compared to the most competitive prompting baseline. More analyses and ablation studies show the robustness and generalization of R$^3$ prompting method in solving reasoning tasks in LLMs under noisy context.

摘要
以Chain-of-Thought（CoT）推动，大语言模型（LLM）在多种推理任务上达到了remarkable表现。然而，大多数研究都在干净环境下进行评估，尚未全面探讨LLM在噪音环境下的表现问题。现有研究通过触发句来鼓励LLM强调相关信息，但触发句对最终答案预测产生有限的影响。 drawing inspiration from interactive CoT方法，我们提出了一种新的推示方法，即R$^3$推示，用于CoT推理在噪音环境下。具体来说，R$^3$推示与LLM进行关键句提取、变量声明和答案预测，与人类思维过程中的回顾、重新表述和解决相吻合。最后一次交互的答案作为下一次交互的提示，进一步提高了LLM的表现。我们的实验表明，R$^3$推示与最竞争的推示基准方法相比，在五种推理任务下噪音环境下表现出了显著的提高，使用GPT-3.5-turbo时平均提高了3.7%的精度。此外，我们还进行了更多的分析和减少研究，证明了R$^3$推示方法在LLM下的稳定性和通用性。

Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting

paper_url: http://arxiv.org/abs/2310.16523
repo_url: None
paper_authors: Preethi Lahoti, Nicholas Blumm, Xiao Ma, Raghavendra Kotikalapudi, Sahitya Potluri, Qijun Tan, Hansa Srinivasan, Ben Packer, Ahmad Beirami, Alex Beutel, Jilin Chen
for: 本研究旨在提高生成大型自然语言模型（LLM）的多样性，以便在用户提供的不充分的提示下，模型可以生成多种不同的响应，而不是只有一种固定的回答。
methods: 本研究使用了评估数据集和提出了多样性评价指标，以衡量生成响应中人和文化方面的多样性。此外，还提出了一种称为集体批评和自动投票（CCSV）的新提示技术，可以通过让模型自身进行多样性理解和自我评估，提高LLM的多样性无需手动编写示例或提示调整。
results: 实验表明，我们的提posed方法可以有效地提高人和文化多样性，并在所有基线方法之上减少了差距。此外，我们还发现了LLM可以理解多样性的概念，并可以对自己的回答进行多样性评价。

Abstract
A crucial challenge for generative large language models (LLMs) is diversity: when a user's prompt is under-specified, models may follow implicit assumptions while generating a response, which may result in homogenization of the responses, as well as certain demographic groups being under-represented or even erased from the generated responses. In this paper, we formalize diversity of representation in generative LLMs. We present evaluation datasets and propose metrics to measure diversity in generated responses along people and culture axes. We find that LLMs understand the notion of diversity, and that they can reason and critique their own responses for that goal. This finding motivated a new prompting technique called collective-critique and self-voting (CCSV) to self-improve people diversity of LLMs by tapping into its diversity reasoning capabilities, without relying on handcrafted examples or prompt tuning. Extensive empirical experiments with both human and automated evaluations show that our proposed approach is effective at improving people and culture diversity, and outperforms all baseline methods by a large margin.

摘要
一个重要挑战是，生成大语言模型（LLM）的多样性：当用户的提示不够精确时，模型可能会遵循偏见而生成响应，这可能导致响应的同化，以及某些民族或文化群体被排除或者消失在生成的响应中。在这篇论文中，我们正式定义生成LLM的多样性表示。我们提供评估数据集和提出多样性评估 metric，用于衡量生成响应中人和文化轴上的多样性。我们发现 LLM 理解多样性的概念，并且它可以对自己的回答进行多样性评估，不需要靠手工例子或提示调整。这一发现使我们提出了一种新的提示技术called collective-critique and self-voting（CCSV），用于自我改进 LLM 的人多样性，不需要靠手工例子或提示调整。我们进行了广泛的实验，并证明了我们的提posed方法可以有效地提高人和文化多样性，并在所有基准方法之上出色表现。

Identifying Reasons for Bias: An Argumentation-Based Approach

paper_url: http://arxiv.org/abs/2310.16506
repo_url: None
paper_authors: Madeleine Waller, Odinaldo Rodrigues, Oana Cocarascu
for: Ensuring the fairness of algorithmic decision-making systems
methods: Model-agnostic argumentation-based method using a quantitative argumentation framework and well-known semantics to identify bias
results: Effective in identifying bias in two datasets commonly used in the fairness literature

Abstract
As algorithmic decision-making systems become more prevalent in society, ensuring the fairness of these systems is becoming increasingly important. Whilst there has been substantial research in building fair algorithmic decision-making systems, the majority of these methods require access to the training data, including personal characteristics, and are not transparent regarding which individuals are classified unfairly. In this paper, we propose a novel model-agnostic argumentation-based method to determine why an individual is classified differently in comparison to similar individuals. Our method uses a quantitative argumentation framework to represent attribute-value pairs of an individual and of those similar to them, and uses a well-known semantics to identify the attribute-value pairs in the individual contributing most to their different classification. We evaluate our method on two datasets commonly used in the fairness literature and illustrate its effectiveness in the identification of bias.

摘要

On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection

paper_url: http://arxiv.org/abs/2310.16492
repo_url: None
paper_authors: Sangha Park, Jisoo Mok, Dahuin Jung, Saehyung Lee, Sungroh Yoon
for: 提高神经网络中的Out-of-Distribution（OoD）检测精度，以确保神经网络安全部署。
methods: 使用文本外围暴露来提高OoD检测性能，包括在训练时引入低信任率的预测假设。
results: 通过使用文本外围，实现了在大规模OoD和困难OoD数据集上的竞争性表现，并提供了对优秀文本外围的emplary criteria。

Abstract
Successful detection of Out-of-Distribution (OoD) data is becoming increasingly important to ensure safe deployment of neural networks. One of the main challenges in OoD detection is that neural networks output overconfident predictions on OoD data, make it difficult to determine OoD-ness of data solely based on their predictions. Outlier exposure addresses this issue by introducing an additional loss that encourages low-confidence predictions on OoD data during training. While outlier exposure has shown promising potential in improving OoD detection performance, all previous studies on outlier exposure have been limited to utilizing visual outliers. Drawing inspiration from the recent advancements in vision-language pre-training, this paper venture out to the uncharted territory of textual outlier exposure. First, we uncover the benefits of using textual outliers by replacing real or virtual outliers in the image-domain with textual equivalents. Then, we propose various ways of generating preferable textual outliers. Our extensive experiments demonstrate that generated textual outliers achieve competitive performance on large-scale OoD and hard OoD benchmarks. Furthermore, we conduct empirical analyses of textual outliers to provide primary criteria for designing advantageous textual outliers: near-distribution, descriptiveness, and inclusion of visual semantics.

摘要
成功探测 OUT-OF-DISTRIBUTION（OoD）数据变得越来越重要，以确保神经网络的安全部署。OoD探测的主要挑战在于神经网络在OoD数据上输出过自信的预测，使得根据预测结果很难判断数据是否属于OoD类别。Outlier exposure解决这个问题，通过在训练过程中引入陌生数据的额外损失，以鼓励神经网络在OoD数据上输出低自信率的预测。然而，所有之前的Outlier exposure研究都是基于视觉异常点。这篇论文启发自最近的视觉语言预训练技术，在文本异常点方面进行了探索。我们首先探讨使用文本异常点的利点，然后提出了不同的文本异常点生成方法。我们的广泛的实验表明，生成的文本异常点可以在大规模OoD和困难OoD benchmark上实现竞争性表现。此外，我们进行了文本异常点的实际分析，为设计有利的文本异常点提供了首要的标准：靠近分布、描述性和视觉 semantics的包含。

Transfer of Reinforcement Learning-Based Controllers from Model- to Hardware-in-the-Loop

paper_url: http://arxiv.org/abs/2310.17671
repo_url: None
paper_authors: Mario Picerno, Lucas Koch, Kevin Badalian, Marius Wegener, Joschka Schaub, Charles Robert Koch, Jakob Andert
for: 这项研究旨在加速使用 transferred learning（TL）和 cross-in-the-loop（XiL） simulation 训练深度学习（RL）代理人，以便在真实应用中使用RL。
methods: 该研究使用了计算成本低的模型在循环（MiL） simulate 选择合适的算法和精确调整超参数，然后将候选代理人转移到硬件在循环（HiL）系统进行训练。
results: 结果表明需要在进行真实硬件转移时调整奖励参数，并且比较一个直接在HiL系统训练的代理人和一个转移过来的代理人，发现后者的训练时间减少了5.9倍。结果表明RL代理人需要与真正的硬件进行交互，并且TL和XiL simulation synergies 可以减少训练时间和提高性能。

Abstract
The process of developing control functions for embedded systems is resource-, time-, and data-intensive, often resulting in sub-optimal cost and solutions approaches. Reinforcement Learning (RL) has great potential for autonomously training agents to perform complex control tasks with minimal human intervention. Due to costly data generation and safety constraints, however, its application is mostly limited to purely simulated domains. To use RL effectively in embedded system function development, the generated agents must be able to handle real-world applications. In this context, this work focuses on accelerating the training process of RL agents by combining Transfer Learning (TL) and X-in-the-Loop (XiL) simulation. For the use case of transient exhaust gas re-circulation control for an internal combustion engine, use of a computationally cheap Model-in-the-Loop (MiL) simulation is made to select a suitable algorithm, fine-tune hyperparameters, and finally train candidate agents for the transfer. These pre-trained RL agents are then fine-tuned in a Hardware-in-the-Loop (HiL) system via TL. The transfer revealed the need for adjusting the reward parameters when advancing to real hardware. Further, the comparison between a purely HiL-trained and a transferred agent showed a reduction of training time by a factor of 5.9. The results emphasize the necessity to train RL agents with real hardware, and demonstrate that the maturity of the transferred policies affects both training time and performance, highlighting the strong synergies between TL and XiL simulation.

摘要
开发嵌入式系统控制函数的过程是资源-, 时间-, 和数据-昂贵的，常导致优化成本和解决方案的偏好。 reinforcement learning（RL）有很大的潜力，可以自动训练代理人来执行复杂的控制任务，无需人类干预。然而，由于数据生成的成本和安全限制，RL的应用通常限于完全的模拟领域。要使RL有效地应用于嵌入式系统功能开发，生成的代理人必须能够处理实际世界应用。在这种情况下，这项工作关注于加速RL代理人的训练过程，通过结合传输学习（TL）和X-in-the-Loop（XiL）模拟来实现。为内燃机器循环油耗控制的使用例子，我们使用计算成本低的模型-in-the-Loop（MiL）模拟选择适当的算法，细调超参数，并最终在HiL系统中训练候选代理人。这些预训练RL代理人然后在硬件-in-the-Loop（HiL）系统中进行TL传输，并对奖金参数进行调整。结果表明，在进行真正硬件上训练RL代理人是必要的，并且在RL代理人的成熟度和训练时间之间存在强烈的相互作用。

Semiring Provenance for Lightweight Description Logics

paper_url: http://arxiv.org/abs/2310.16472
repo_url: None
paper_authors: Camille Bourgaux, Ana Ozaki, Rafael Peñaloza
for: 本研究围绕描述逻辑中的semiring provenance框架进行调查，以增强描述逻辑的表达能力和可解释性。
methods: 本研究使用了 commutative semiring 来注释描述逻辑规则，并通过对 ontology 后果的 Computation 来反映这些注释的 derivation。
results: 研究人员定义了一种 semiring provenance semantics，并证明其满足了一些愿景性质（如EXTEND）。另外，研究人员还研究了 why-provenance 的复杂性问题，并对 positive Boolean provenance 和 lineage 进行了研究。

Abstract
We investigate semiring provenance--a successful framework originally defined in the relational database setting--for description logics. In this context, the ontology axioms are annotated with elements of a commutative semiring and these annotations are propagated to the ontology consequences in a way that reflects how they are derived. We define a provenance semantics for a language that encompasses several lightweight description logics and show its relationships with semantics that have been defined for ontologies annotated with a specific kind of annotation (such as fuzzy degrees). We show that under some restrictions on the semiring, the semantics satisfies desirable properties (such as extending the semiring provenance defined for databases). We then focus on the well-known why-provenance, which allows to compute the semiring provenance for every additively and multiplicatively idempotent commutative semiring, and for which we study the complexity of problems related to the provenance of an axiom or a conjunctive query answer. Finally, we consider two more restricted cases which correspond to the so-called positive Boolean provenance and lineage in the database setting. For these cases, we exhibit relationships with well-known notions related to explanations in description logics and complete our complexity analysis. As a side contribution, we provide conditions on an ELHI_bot ontology that guarantee tractable reasoning.

摘要
我团队 investigate semiring provenance---一种成功的框架，原定于关系数据库设置--- для描述逻辑。在这个 контексте，ontology axioms 被标注为 комму态 semiring 的元素，并且这些标注被传递到ontology consequences中，以反映它们如何被 derivation。我们定义了描述逻辑语言中的 provenance semantics，并证明其与其他语言中的 semantics 有关系。我们还证明，在certain restrictions 下，这种 semantics 满足了愉悦的性质（如扩展关系数据库中的 semiring provenance）。然后，我们专注于 well-known why-provenance，可以计算 semiring provenance для每个可加法和可乘法 идеmpotent commutative semiring，并研究这些问题的复杂性。最后，我们考虑两种更加限制的情况，即positive Boolean provenance 和 lineage 在数据库设置中。对于这些情况，我们提出了与描述逻辑中的解释相关的一些概念，并完成了我们的复杂性分析。作为一个Side contribution，我们提供了ELHI_bot ontology 的条件，以 garantue tractable reasoning。

Towards Explainability in Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2310.16457
repo_url: None
paper_authors: Vasileios Arampatzakis, George Pavlidis, Kyriakos Pantoglou, Nikolaos Mitianoudis, Nikos Papamarkos
for: 本研究旨在探讨深度估计方法的可解性，具体来说是如何 humans perceive depth 的方法。
methods: 本研究使用了深度学习方法，并在实验中测试了state-of-the-art方法，以 indirectly 评估这些方法在定义的上下文中的可解性。
results: 结果表明，使用不同的方法可以达到约77%的准确率，其中一些方法表现出色，这种性能差异 indirectly 透露了这些方法对Relative size 的感知。

Abstract
The estimation of depth in two-dimensional images has long been a challenging and extensively studied subject in computer vision. Recently, significant progress has been made with the emergence of Deep Learning-based approaches, which have proven highly successful. This paper focuses on the explainability in monocular depth estimation methods, in terms of how humans perceive depth. This preliminary study emphasizes on one of the most significant visual cues, the relative size, which is prominent in almost all viewed images. We designed a specific experiment to mimic the experiments in humans and have tested state-of-the-art methods to indirectly assess the explainability in the context defined. In addition, we observed that measuring the accuracy required further attention and a particular approach is proposed to this end. The results show that a mean accuracy of around 77% across methods is achieved, with some of the methods performing markedly better, thus, indirectly revealing their corresponding potential to uncover monocular depth cues, like relative size.

摘要
Computer vision 领域中两维图像中depth的估计已经是长期的挑战和广泛研究的主题。在深度学习方法的出现后，这个领域已经做出了重要的进展。本文将关注在单目深度估计方法中的可解释性，即人类如何感受到深度。本初步研究强调了人类最重要的视觉cue之一——相对大小，这种cue在大多数视图图像中都很显著。我们设计了一个特定的实验，模拟人类的实验，并测试了当前领域的state-of-the-art方法，以 indirectly 评估这些方法在定义的上下文中的可解释性。此外，我们发现了测试准确性需要进一步的注意和一种特定的方法的提议，以便实现更高的准确性。结果显示，所有方法的平均准确率在77%左右，其中一些方法表现出色，这种表现直接表明它们感受到了单目深度cue，如相对大小。

Graph-based multimodal multi-lesion DLBCL treatment response prediction from PET images

paper_url: http://arxiv.org/abs/2310.16863
repo_url: None
paper_authors: Oriane Thiery, Mira Rizkallah, Clément Bailly, Caroline Bodet-Milin, Emmanuel Itti, René-Olivier Casasnovas, Steven Le Gouill, Thomas Carlier, Diana Mateus
for: 这个研究旨在发展一个电脑支持的方法，以帮助诊断和跟踪悉普性大B细胞淋巴癌（DLBCL）。
methods: 这个方法使用最新的 graf neural network，结合多个肿瘤的影像信息，并使用标注模组优化不同数据模式之间的integreation。
results: 实验结果显示，我们提议的方法在583名病例中的训练和评估中，具有较高的2年进展自由生存率（PFS）准确率。

Abstract
Diffuse Large B-cell Lymphoma (DLBCL) is a lymphatic cancer involving one or more lymph nodes and extranodal sites. Its diagnostic and follow-up rely on Positron Emission Tomography (PET) and Computed Tomography (CT). After diagnosis, the number of nonresponding patients to standard front-line therapy remains significant (30-40%). This work aims to develop a computer-aided approach to identify high-risk patients requiring adapted treatment by efficiently exploiting all the information available for each patient, including both clinical and image data. We propose a method based on recent graph neural networks that combine imaging information from multiple lesions, and a cross-attention module to integrate different data modalities efficiently. The model is trained and evaluated on a private prospective multicentric dataset of 583 patients. Experimental results show that our proposed method outperforms classical supervised methods based on either clinical, imaging or both clinical and imaging data for the 2-year progression-free survival (PFS) classification accuracy.

摘要
大细胞淋巴癌（DLBCL）是一种淋巴癌细胞扩散到一个或多个 лимф节和外周组织。其诊断和跟踪凭据 Positron Emission Tomography（PET）和计算机 Tomography（CT）。诊断后，标准前线治疗不响应病人的比例仍然很高（30-40%）。这项工作的目标是开发一种基于计算机技术的方法，以提高高风险患者的个性化治疗方案，通过有效地利用每个患者的所有信息，包括临床和图像数据。我们提出一种基于最新的图 neural networks 的方法，将多个肿瘤的图像信息集成，并使用交叉注意模块来有效地集成不同数据模式。模型在一个私人前推multicentric dataset上训练和评估，实验结果显示，我们的提议方法在2年内生存率（PFS）的分类准确率上超过了传统的指导方法，基于临床、图像或两者的数据。

Faithful Path Language Modelling for Explainable Recommendation over Knowledge Graph

paper_url: http://arxiv.org/abs/2310.16452
repo_url: None
paper_authors: Giacomo Balloccu, Ludovico Boratto, Christian Cancedda, Gianni Fenu, Mirko Marras
for: 这篇论文旨在提高推荐系统的透明度，通过基于知识图的路径理解方法。
methods: 本文提出了一种新的方法，即PEARLM，它通过语言模型来有效地捕捉用户行为和产品知识，并将实体和关系在同一个优化空间中归一化。
results: 对两个数据集进行实验，PEARLM方法比州先进的基elines表现出色，可以更好地捕捉用户的偏好和嗜好。Here’s the English version for reference:
for: This paper aims to improve the transparency of recommendation systems by using path reasoning methods over knowledge graphs.
methods: The proposed method, PEARLM, efficiently captures user behavior and product-side knowledge through language modeling, and unifies entities and relations in the same optimization space.
results: Experimental results on two datasets show the effectiveness of PEARLM compared to state-of-the-art baselines.

Abstract
Path reasoning methods over knowledge graphs have gained popularity for their potential to improve transparency in recommender systems. However, the resulting models still rely on pre-trained knowledge graph embeddings, fail to fully exploit the interdependence between entities and relations in the KG for recommendation, and may generate inaccurate explanations. In this paper, we introduce PEARLM, a novel approach that efficiently captures user behaviour and product-side knowledge through language modelling. With our approach, knowledge graph embeddings are directly learned from paths over the KG by the language model, which also unifies entities and relations in the same optimisation space. Constraints on the sequence decoding additionally guarantee path faithfulness with respect to the KG. Experiments on two datasets show the effectiveness of our approach compared to state-of-the-art baselines. Source code and datasets: AVAILABLE AFTER GETTING ACCEPTED.

摘要
translate_text="Path reasoning methods over knowledge graphs have gained popularity for their potential to improve transparency in recommender systems. However, the resulting models still rely on pre-trained knowledge graph embeddings, fail to fully exploit the interdependence between entities and relations in the KG for recommendation, and may generate inaccurate explanations. In this paper, we introduce PEARLM, a novel approach that efficiently captures user behavior and product-side knowledge through language modeling. With our approach, knowledge graph embeddings are directly learned from paths over the KG by the language model, which also unifies entities and relations in the same optimization space. Constraints on the sequence decoding additionally guarantee path faithfulness with respect to the KG. Experiments on two datasets show the effectiveness of our approach compared to state-of-the-art baselines. Source code and datasets: AVAILABLE AFTER GETTING ACCEPTED."Here's the translation in Simplified Chinese: PATH 理解方法在知识Graph中得到了广泛的应用，因为它们可以提高推荐系统的透明度。然而，现有的模型仍然依赖于预训练的知识Graph嵌入，不充分利用知识Graph中Entity和关系之间的互相依赖关系，并可能生成错误的解释。在这篇论文中，我们介绍了PEARLM，一种新的方法，可以效率地捕捉用户行为和产品 сторо面知识通过语言模型。我们的方法直接从知识Graph中的路径上学习语言模型，同时也将Entity和关系嵌入到同一个优化空间中。另外，对序列解码加入约束，保证路径的准确性与知识Graph相关。在两个数据集上进行了实验，比较了我们的方法与现有的基elines。源代码和数据集：接受后提供。

Diversity Enhanced Narrative Question Generation for Storybooks

paper_url: http://arxiv.org/abs/2310.16446
repo_url: https://github.com/hkyoon95/mqg
paper_authors: Hokeun Yoon, JinYeong Bak
for: 提高学习或对话环境中的理解、参与度、评估和总效果。
methods: 使用多个问题生成模型（mQG），通过专注于上下文和问题来生成多个、多样化的、可回答的问题。
results: 在 FairytaleQA 数据集上取得了优异的评估结果，并在零学习情况下应用于 TellMeWhy 和 SQuAD1.1 数据集上得到了扎实的结果。

Abstract
Question generation (QG) from a given context can enhance comprehension, engagement, assessment, and overall efficacy in learning or conversational environments. Despite recent advancements in QG, the challenge of enhancing or measuring the diversity of generated questions often remains unaddressed. In this paper, we introduce a multi-question generation model (mQG), which is capable of generating multiple, diverse, and answerable questions by focusing on context and questions. To validate the answerability of the generated questions, we employ a SQuAD2.0 fine-tuned question answering model, classifying the questions as answerable or not. We train and evaluate mQG on the FairytaleQA dataset, a well-structured QA dataset based on storybooks, with narrative questions. We further apply a zero-shot adaptation on the TellMeWhy and SQuAD1.1 datasets. mQG shows promising results across various evaluation metrics, among strong baselines.

摘要
Question generation（QG）从给定的上下文中可以提高理解、参与度、评估和总体效果在学习或对话环境中。尽管最近在QG方面有所进步，但仍然有很多问题的多样性增进或评估的挑战。在这篇论文中，我们提出了多个问题生成模型（mQG），可以生成多个、多样的和可答案的问题，通过关注上下文和问题来做这个。为验证生成的问题是否可答，我们使用了SQuAD2.0 Fine-tuned问题回答模型，将问题分为可答和不可答两类。我们在FairytaleQA dataset上训练和评估mQG，该dataset基于故事书的问题。此外，我们还对TellMeWhy和SQuAD1.1 datasets进行零容量适应。mQG在不同的评价指标上表现出色，与强基线相比。

An Integrative Paradigm for Enhanced Stroke Prediction: Synergizing XGBoost and xDeepFM Algorithms

paper_url: http://arxiv.org/abs/2310.16430
repo_url: None
paper_authors: Weinan Dai, Yifeng Jiang, Chengjie Mou, Chongyu Zhang
for: 预测中风的目的是为了预防和管理这种残割性疾病。
methods: 本研究使用了一个完整的数据集，并提出了一个集成模型，该模型将XGBoost和xDeepFM算法相结合。
results: 我们通过严格的实验验证了我们的集成模型的有效性，并与其他模型进行比较，从而获得了有价值的发现，以及对机器学习和深度学习技术在中风预测领域的贡献。

Abstract
Stroke prediction plays a crucial role in preventing and managing this debilitating condition. In this study, we address the challenge of stroke prediction using a comprehensive dataset, and propose an ensemble model that combines the power of XGBoost and xDeepFM algorithms. Our work aims to improve upon existing stroke prediction models by achieving higher accuracy and robustness. Through rigorous experimentation, we validate the effectiveness of our ensemble model using the AUC metric. Through comparing our findings with those of other models in the field, we gain valuable insights into the merits and drawbacks of various approaches. This, in turn, contributes significantly to the progress of machine learning and deep learning techniques specifically in the domain of stroke prediction.

摘要
<>转换文本到简化中文。<>roke 预测在stroke 的预防和管理中发挥关键作用。在这个研究中，我们面临roke 预测挑战，使用了 comprehensive 数据集，并提议一种ensemble 模型，将 XGBoost 和 xDeepFM 算法相结合。我们的工作目的是提高现有roke 预测模型的准确率和可靠性。通过严格的实验，我们验证了我们的ensemble 模型的有效性，使用 AUC 指标。通过与其他模型在领域中比较我们的发现，我们获得了对machine learning 和 deep learning 技术在roke 预测中的应用所得的有价值的理解。

Graph Agent: Explicit Reasoning Agent for Graphs

paper_url: http://arxiv.org/abs/2310.16421
repo_url: None
paper_authors: Qinyong Wang, Zhenxiang Gao, Rong Xu
for: 本文旨在提供一种基于大语言模型、印uctive-deductive 逻辑模块和长期记忆的知识图reasoning方法，以提高复杂的知识图 reasoning任务的效果。
methods: 本文提出了一种名为Graph Agent（GA）的智能代理方法，该方法利用大语言模型（LLM）、印uctive-deductive 逻辑模块和长期记忆来进行知识图reasoning任务。GA通过将图结构转换成文本数据，使得LLM可以处理、分析和提供预测结果，同时也可以提供人类可读的解释。
results: 本文的实验结果表明，GA在节点分类和链接预测任务上达到了状态的术性表现，具体的Result分别为Cora数据集的90.65%, PubMed数据集的95.48%, PrimeKG数据集的89.32%。相比之下，现有的GNN和transformer模型，GA具有显著的显式逻辑能力、免训练、易于适应不同的知识图reasoning任务的优势。

Abstract
Graph embedding methods such as Graph Neural Networks (GNNs) and Graph Transformers have contributed to the development of graph reasoning algorithms for various tasks on knowledge graphs. However, the lack of interpretability and explainability of graph embedding methods has limited their applicability in scenarios requiring explicit reasoning. In this paper, we introduce the Graph Agent (GA), an intelligent agent methodology of leveraging large language models (LLMs), inductive-deductive reasoning modules, and long-term memory for knowledge graph reasoning tasks. GA integrates aspects of symbolic reasoning and existing graph embedding methods to provide an innovative approach for complex graph reasoning tasks. By converting graph structures into textual data, GA enables LLMs to process, reason, and provide predictions alongside human-interpretable explanations. The effectiveness of the GA was evaluated on node classification and link prediction tasks. Results showed that GA reached state-of-the-art performance, demonstrating accuracy of 90.65%, 95.48%, and 89.32% on Cora, PubMed, and PrimeKG datasets, respectively. Compared to existing GNN and transformer models, GA offered advantages of explicit reasoning ability, free-of-training, easy adaption to various graph reasoning tasks

摘要
graph embedding方法如graph neural networks (GNNs)和graph transformers已经为知识图reasoning任务提供了贡献。然而，graph embedding方法的 interpretability和可解释性限制了它们在需要显式reasoning的场景下的应用。在这篇论文中，我们介绍了Graph Agent (GA)，一种基于大型自然语言模型（LLMs）、推理模块和长期记忆的知识图reasoning方法。GA结合了symbolic reasoning和现有的graph embedding方法，提供了一种创新的graph reasoning任务approach。通过将graph结构转化为文本数据，GA使得LLMs可以处理、理解和提供预测，同时提供人类可解释的解释。GA在节点分类和链接预测任务上进行了评估，结果显示GA达到了状态之 искусственный智能的性能，具体的数据如下：在Cora、PubMed和PrimeKG datasets上，GA的准确率分别达到了90.65%、95.48%和89.32%。相比之前的GNN和transformer模型，GA具有显式reasoning能力、免训练和适应不同的graph reasoning任务的优势。

Balancing Augmentation with Edge-Utility Filter for Signed GNNs

paper_url: http://arxiv.org/abs/2310.16862
repo_url: None
paper_authors: Ke-Jia Chen, Yaming Ji, Youran Qu, Chuhan Xu
for: 这种论文的目的是提高signed graph neural networks（SGNNs）的稳定性和性能，通过增强graph的结构和semantic balance。
methods: 该论文提出了一种增强策略，包括测量每个负边的utilty，并 selectively增强graph的结构和semantic balance。
results: 实验表明，该方法可以显著提高SGNN的性能和普适性，并且可以在五种实际 dataset中进行链接预测。

Abstract
Signed graph neural networks (SGNNs) has recently drawn more attention as many real-world networks are signed networks containing two types of edges: positive and negative. The existence of negative edges affects the SGNN robustness on two aspects. One is the semantic imbalance as the negative edges are usually hard to obtain though they can provide potentially useful information. The other is the structural unbalance, e.g. unbalanced triangles, an indication of incompatible relationship among nodes. In this paper, we propose a balancing augmentation method to address the above two aspects for SGNNs. Firstly, the utility of each negative edge is measured by calculating its occurrence in unbalanced structures. Secondly, the original signed graph is selectively augmented with the use of (1) an edge perturbation regulator to balance the number of positive and negative edges and to determine the ratio of perturbed edges to original edges and (2) an edge utility filter to remove the negative edges with low utility to make the graph structure more balanced. Finally, a SGNN is trained on the augmented graph which effectively explores the credible relationships. A detailed theoretical analysis is also conducted to prove the effectiveness of each module. Experiments on five real-world datasets in link prediction demonstrate that our method has the advantages of effectiveness and generalization and can significantly improve the performance of SGNN backbones.

摘要
signed 图 neural networks (SGNNs) 在最近几年引起了更多的关注，因为许多实际网络是带有两种类型的边：正向和负向的边。存在负向边的存在会对 SGNN 的Robustness 产生两种方面的影响。一个是 semantics 不均衡，负向边通常很难以获得，但它们可能提供有用信息。另一个是结构不均衡，例如不均衡的三角形，这表明节点之间的关系不兼容。在这篇论文中，我们提议一种平衡增强方法，以解决上述两个方面的问题。首先，我们测量每个负向边的使用价值，计算它们在不均衡结构中的发生频率。其次，我们使用（1）边扰动调节器来平衡正向和负向边的数量，并确定扰动的比例。（2）边用途筛选器来从原始签入的图中删除低使用价值的负向边，以使得图的结构更加平衡。最后，我们在增强后的图上训练 SGNN，以便更好地探索有效的关系。我们还进行了详细的理论分析，以证明每个模块的有效性。在五个实际链接预测任务上进行了实验，结果表明我们的方法具有效果和普适性，可以显著提高 SGNN 的表现。

Open Knowledge Base Canonicalization with Multi-task Unlearning

paper_url: http://arxiv.org/abs/2310.16419
repo_url: None
paper_authors: Bingchen Liu, Shihao Hou, Weixin Zeng, Xiang Zhao, Shijun Liu, Li Pan
for: OKB canonicalization （大规模移动计算领域中的知识库建设）
methods: machine unlearning（机器忘记） + clustering + KGE learning（知识图加 embeddings 学习）
results: advanced machine unlearning effects（提高机器忘记效果）In more detail, the paper proposes a multi-task unlearning framework called MulCanon, which utilizes the noise characteristics in the diffusion model to achieve machine unlearning for data in OKB canonicalization. The framework unifies the learning objectives of diffusion model, KGE, and clustering algorithms, and adopts a two-step multi-task learning paradigm for training. The experimental study on popular OKB canonicalization datasets shows that MulCanon achieves advanced machine unlearning effects.

Abstract
The construction of large open knowledge bases (OKBs) is integral to many applications in the field of mobile computing. Noun phrases and relational phrases in OKBs often suffer from redundancy and ambiguity, which calls for the investigation on OKB canonicalization. However, in order to meet the requirements of some privacy protection regulations and to ensure the timeliness of the data, the canonicalized OKB often needs to remove some sensitive information or outdated data. The machine unlearning in OKB canonicalization is an excellent solution to the above problem. Current solutions address OKB canonicalization by devising advanced clustering algorithms and using knowledge graph embedding (KGE) to further facilitate the canonicalization process. Effective schemes are urgently needed to fully synergise machine unlearning with clustering and KGE learning. To this end, we put forward a multi-task unlearning framework, namely MulCanon, to tackle machine unlearning problem in OKB canonicalization. Specifically, the noise characteristics in the diffusion model are utilized to achieve the effect of machine unlearning for data in OKB. MulCanon unifies the learning objectives of diffusion model, KGE and clustering algorithms, and adopts a two-step multi-task learning paradigm for training. A thorough experimental study on popular OKB canonicalization datasets validates that MulCanon achieves advanced machine unlearning effects.

摘要
大规模开放知识库（OKB）的建构对移动计算应用程序而言是非常重要的。OKB中的名实词和关系词 oftentimes受到重复和歧义的影响，这种情况需要研究OKB canonicalization。然而，为了遵守一些隐私保护法规和保证数据的时效性， canonicalized OKB often需要 removing some sensitive information or outdated data。机器学习 inverse в OKB canonicalization 是一个优秀的解决方案。现有的解决方案通过设计高级划分算法和使用知识图嵌入（KGE）来进一步促进 canonicalization 过程。有效的方案是必要的，以全面融合机器学习 inverse 与划分和 KGE 学习。为此，我们提出了一种多任务学习框架，即 MulCanon，用于解决机器学习 inverse 问题在 OKB canonicalization 中。 Specifically, MulCanon 利用了Diffusion model中的噪音特征来实现数据中的机器学习 inverse。MulCanon 将 diffusion model、KGE 和划分算法的学习目标统一，采用了两步多任务学习 paradigm for training。经过对popular OKB canonicalization 数据集的严格实验 validate that MulCanon 可以 achieve advanced machine unlearning effects.

Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero

paper_url: http://arxiv.org/abs/2310.16410
repo_url: None
paper_authors: Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, Been Kim
for: 这篇论文旨在探讨如何利用高性能的人工智能系统（AlphaZero）中嵌入的隐藏知识，以提高人类专家性能。methods: 作者提出了一种新的方法，用于从AlphaZero中提取新的棋盘概念，并进行人类学习和评估。results: 研究表明，AlphaZero可能嵌入的知识不仅超过了人类知识，还可以成功地传授给人类专家。在人类研究中，四位世界级国际象棋大师通过解决提交的概念原型位置，得到了改进。

Abstract
Artificial Intelligence (AI) systems have made remarkable progress, attaining super-human performance across various domains. This presents us with an opportunity to further human knowledge and improve human expert performance by leveraging the hidden knowledge encoded within these highly performant AI systems. Yet, this knowledge is often hard to extract, and may be hard to understand or learn from. Here, we show that this is possible by proposing a new method that allows us to extract new chess concepts in AlphaZero, an AI system that mastered the game of chess via self-play without human supervision. Our analysis indicates that AlphaZero may encode knowledge that extends beyond the existing human knowledge, but knowledge that is ultimately not beyond human grasp, and can be successfully learned from. In a human study, we show that these concepts are learnable by top human experts, as four top chess grandmasters show improvements in solving the presented concept prototype positions. This marks an important first milestone in advancing the frontier of human knowledge by leveraging AI; a development that could bear profound implications and help us shape how we interact with AI systems across many AI applications.

摘要
人工智能（AI）系统已经取得了很大的进步，在不同领域达到了人类超越性表现。这给我们提供了一个机会，通过利用AI系统中隐藏的知识来进一步推动人类知识和专家表现。然而，这些知识可能很难提取，并且可能很难理解或学习。在这里，我们提出了一种新的方法，可以从AlphaZero AI系统中提取新的棋盘概念。我们的分析表明，AlphaZero可能具有超越人类知识的知识，但这些知识并不是不可以被人类理解和学习的。在人类研究中，我们发现，这些概念可以被四位高级国际象棋大师学习，他们在给出的概念原型位置中解决问题的能力得到了改进。这标志着我们在利用AI推动人类知识的前夕，这可能会对许多AI应用程序产生深远的影响，并帮助我们如何与AI系统交互。

Challenges of Radio Frequency Fingerprinting: From Data Collection to Deployment

paper_url: http://arxiv.org/abs/2310.16406
repo_url: None
paper_authors: Saeif Alhazbi, Ahmed Hussain, Savio Sciancalepore, Gabriele Oligeri, Panos Papadimitratos
for: 本文旨在探讨Radio Frequency Fingerprinting（RFF）技术如何使用机器学习（ML）和深度学习（DL）来实现无线设备认证。
methods: 本文使用的方法包括RFF技术和ML/DL技术，并对这些技术的缺陷和挑战进行分析。
results: 本文的研究发现现有的RFF系统尚未能够在实际应用中使用，并且存在许多挑战和问题。未来的研究应该关注这些问题，以便实现RFF系统的真正应用。

Abstract
Radio Frequency Fingerprinting (RFF) techniques promise to authenticate wireless devices at the physical layer based on inherent hardware imperfections introduced during manufacturing. Such RF transmitter imperfections are reflected into over-the-air signals, allowing receivers to accurately identify the RF transmitting source. Recent advances in Machine Learning, particularly in Deep Learning (DL), have improved the ability of RFF systems to extract and learn complex features that make up the device-specific fingerprint. However, integrating DL techniques with RFF and operating the system in real-world scenarios presents numerous challenges. This article identifies and analyzes these challenges while considering the three reference phases of any DL-based RFF system: (i) data collection and preprocessing, (ii) training, and finally, (iii) deployment. Our investigation points out the current open problems that prevent real deployment of RFF while discussing promising future directions, thus paving the way for further research in the area.

摘要
radio 频率指纹技术 (RFF) 承诺通过物理层认证无线设备，基于生产过程中引入的固有硬件瑕疵。这些 RF 发送器瑕疵会在通过空气信号中反射，让接收器准确地识别 RF 发送源。近年来，机器学习技术的进步，特别是深度学习 (DL)，提高了 RFF 系统EXTRACT和学习复杂的设备特征。然而，将 DL 技术与 RFF 集成并在实际场景中运行充满挑战。这篇文章确认并分析了这些挑战，并考虑了任何 DL-based RFF 系统的三个参考阶段：（一）数据收集和处理，（二）训练，最后（三）部署。我们的调查发现当前还存在许多未解决的问题，阻碍 RFF 的实际应用，并讨论了未来研究的可能性，以便继续深入研究这一领域。

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models

paper_url: http://arxiv.org/abs/2310.16400
repo_url: None
paper_authors: Tianyi Lu, Xing Zhang, Jiaxi Gu, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu
for: 文章目的是提出一种无需训练的框架，以实现根据文本指导的视频编辑。
methods: 方法是将图像LDM和视频LDM的latent拟合在denoising过程中，以保持视频LDM中的时间一致性，同时充分利用图像LDM中的高精度。
results: 对比于传统方法，FLDM可以提高文本对齐和时间一致性的编辑视频质量。

Abstract
Latent Diffusion Models (LDMs) are renowned for their powerful capabilities in image and video synthesis. Yet, video editing methods suffer from insufficient pre-training data or video-by-video re-training cost. In addressing this gap, we propose FLDM (Fused Latent Diffusion Model), a training-free framework to achieve text-guided video editing by applying off-the-shelf image editing methods in video LDMs. Specifically, FLDM fuses latents from an image LDM and an video LDM during the denoising process. In this way, temporal consistency can be kept with video LDM while high-fidelity from the image LDM can also be exploited. Meanwhile, FLDM possesses high flexibility since both image LDM and video LDM can be replaced so advanced image editing methods such as InstructPix2Pix and ControlNet can be exploited. To the best of our knowledge, FLDM is the first method to adapt off-the-shelf image editing methods into video LDMs for video editing. Extensive quantitative and qualitative experiments demonstrate that FLDM can improve the textual alignment and temporal consistency of edited videos.

摘要
Latent Diffusion Models (LDMs) 是以强大的能力在图像和视频生成而著称的。然而，视频编辑方法受到缺乏预训练数据或视频视频重新训练成本的限制。为了填补这一漏洞，我们提议了 FLDM（混合潜在扩散模型），一种不需要训练的框架，通过在视频 LDM 中应用市场上ready-to-use的图像编辑方法来实现文本指导的视频编辑。具体来说，FLDM 在杂化过程中将图像 LDM 和视频 LDM 的潜在特征进行混合。这样可以保持视频 LDM 中的时间一致性，同时也可以利用图像 LDM 中的高精度特性。此外，FLDM 具有高灵活性，因为图像 LDM 和视频 LDM 都可以被更高级的图像编辑方法所取代，例如 InstructPix2Pix 和 ControlNet。根据我们所知，FLDM 是首次将市场上ready-to-use的图像编辑方法应用到视频 LDM 中进行视频编辑。EXT 广泛的量化和质量测试表明，FLDM 可以改善编辑后的文本对齐和时间一致性。

Evaluating General-Purpose AI with Psychometrics

paper_url: http://arxiv.org/abs/2310.16379
repo_url: None
paper_authors: Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, Luning Sun, David Stillwell, Fang Luo, Xing Xie
for: This paper aims to improve the evaluation of general-purpose AI systems by incorporating psychometrics, the science of psychological measurement.
methods: The proposed method uses psychometrics to identify and measure the latent constructs that underlie performance across multiple tasks, providing a more comprehensive and rigorous approach to evaluating AI systems.
results: The authors propose a framework for integrating psychometrics with AI and explore future opportunities for doing so, with the goal of improving the evaluation and understanding of general-purpose AI systems.Here is the same information in Simplified Chinese text:
for: 这篇论文目标是通过吸收心理测量科学（psychometrics）来改善普适人工智能系统的评估。
methods: 该方法使用心理测量科学来识别和测量多个任务下的潜在构uct，从而提供一种更加全面和准确的人工智能系统评估方法。
results: 作者提出了将心理测量科学与人工智能集成的框架，并探讨未来可能性，以提高普适人工智能系统的评估和理解。

Abstract
Artificial intelligence (AI) has witnessed an evolution from task-specific to general-purpose systems that trend toward human versatility. As AI systems begin to play pivotal roles in society, it is important to ensure that they are adequately evaluated. Current AI benchmarks typically assess performance on collections of specific tasks. This has drawbacks when used for assessing general-purpose AI systems. First, it is difficult to predict whether AI systems could complete a new task it has never seen or that did not previously exist. Second, these benchmarks often focus on overall performance metrics, potentially overlooking the finer details crucial for making informed decisions. Lastly, there are growing concerns about the reliability of existing benchmarks and questions about what is being measured. To solve these challenges, this paper suggests that psychometrics, the science of psychological measurement, should be placed at the core of evaluating general-purpose AI. Psychometrics provides a rigorous methodology for identifying and measuring the latent constructs that underlie performance across multiple tasks. We discuss its merits, warn against potential pitfalls, and propose a framework for putting it into practice. Finally, we explore future opportunities to integrate psychometrics with AI.

摘要
人工智能（AI）在演化过程中从任务特定向广泛应用系统，趋向人类多能。随着AI系统在社会中扮演着重要角色，因此需要确保它们得到了足够的评估。现有的AIbenchmarks通常评估特定任务集合的性能。这有一些缺点，如果用于评估通用AI系统。首先，难以预测AI系统能否完成它从未看过或者不存在的新任务。其次，这些benchmarks通常专注于总性能指标，可能忽略了决策过程中的细节。最后，有关现有benchmarks的可靠性和评估的问题也在提出。为解决这些挑战，这篇论文建议将心理测量（psychometrics）置于AI评估的核心。心理测量提供了一种严格的方法来识别和测量在多个任务中表现的隐藏构造。我们讲述其优点、警告 против潜在的陷阱，并提出一个实施框架。最后，我们探讨将心理测量与AI集成的未来机会。

GADY: Unsupervised Anomaly Detection on Dynamic Graphs

paper_url: http://arxiv.org/abs/2310.16376
repo_url: None
paper_authors: Shiqi Lou, Qingyue Zhang, Shujie Yang, Yuyang Tian, Zhaoxuan Tan, Minnan Luo
for: 检测动态图中异常行为，即在图和其时间信息中检测entity的行为异常。
methods: 我们提出了一种基于自动生成的无监督图动态异常检测方法（GADY），用于解决现有方法面临的动态结构建构挑战和负样本生成挑战。
results: 我们的GADY方法在三个真实世界数据集上表现出了显著的优异，与前一个状态艺术法相比显著提高了性能。

Abstract
Anomaly detection on dynamic graphs refers to detecting entities whose behaviors obviously deviate from the norms observed within graphs and their temporal information. This field has drawn increasing attention due to its application in finance, network security, social networks, and more. However, existing methods face two challenges: dynamic structure constructing challenge - difficulties in capturing graph structure with complex time information and negative sampling challenge - unable to construct excellent negative samples for unsupervised learning. To address these challenges, we propose Unsupervised Generative Anomaly Detection on Dynamic Graphs (GADY). To tackle the first challenge, we propose a continuous dynamic graph model to capture the fine-grained information, which breaks the limit of existing discrete methods. Specifically, we employ a message-passing framework combined with positional features to get edge embeddings, which are decoded to identify anomalies. For the second challenge, we pioneer the use of Generative Adversarial Networks to generate negative interactions. Moreover, we design a loss function to alter the training goal of the generator while ensuring the diversity and quality of generated samples. Extensive experiments demonstrate that our proposed GADY significantly outperforms the previous state-of-the-art method on three real-world datasets. Supplementary experiments further validate the effectiveness of our model design and the necessity of each module.

摘要
“异常探测在动态图表上指的是检测图表中的元素，其行为明显与常规模式不同。这个领域在金融、网络安全、社交网络等领域获得了越来越多的注意。但现有方法面临两个挑战：动态结构建构挑战 - 对复杂时间信息的图表结构捕捉困难，以及负数样本挑战 - 无法建立出色的负数样本供无监督学习。为解决这两个挑战，我们提出了不supervised生成异常探测方法（GADY）。”“为了解决第一个挑战，我们提出了一个连续动态图表模型，以捕捉细节信息。这个模型与现有的缓存方法不同，可以更好地捕捉图表中的细节变化。具体来说，我们运用了讯息传递框架，融合 pozitional 特征以获得边嵌入，这些边嵌入可以转换为异常探测。”“对于第二个挑战，我们创新使用生成对抗网络来生成负数样本。此外，我们设计了一个损失函数，以调整生成器的训练目标，并确保生成的样本多样性和质量。实验结果显示，我们的提案GADY对三个真实世界数据集的表现明显超越了前一代方法。补充实验更进一步验证了我们的模型设计和各模块的必要性。”

InstructPTS: Instruction-Tuning LLMs for Product Title Summarization

paper_url: http://arxiv.org/abs/2310.16361
repo_url: None
paper_authors: Besnik Fetahu, Zhiyu Chen, Oleg Rokhlenko, Shervin Malmasi
for: 这个论文旨在提高电子商务产品目录中的商品标题概要，以便更好地支持推荐、问答和评论摘要等功能。
methods: 这个论文提出了一种可控的产品标题概要方法，基于最近的指令精度练习方法。该方法可以根据不同的标准（例如字数、包含特定短语等）生成匹配的产品标题概要。
results: 对实际电子商务目录进行了广泛的评估，结果显示，与简单的精度练习方法相比，该方法可以生成更准确的产品名称概要，提高了14和8个BLEU和ROUGE分数。

Abstract
E-commerce product catalogs contain billions of items. Most products have lengthy titles, as sellers pack them with product attributes to improve retrieval, and highlight key product aspects. This results in a gap between such unnatural products titles, and how customers refer to them. It also limits how e-commerce stores can use these seller-provided titles for recommendation, QA, or review summarization. Inspired by recent work on instruction-tuned LLMs, we present InstructPTS, a controllable approach for the task of Product Title Summarization (PTS). Trained using a novel instruction fine-tuning strategy, our approach is able to summarize product titles according to various criteria (e.g. number of words in a summary, inclusion of specific phrases, etc.). Extensive evaluation on a real-world e-commerce catalog shows that compared to simple fine-tuning of LLMs, our proposed approach can generate more accurate product name summaries, with an improvement of over 14 and 8 BLEU and ROUGE points, respectively.

摘要
电商产品目录中包含了数十亿个商品。大多数产品有很长的标题，卖家会将其填充产品特性，以提高检索和强调产品的关键特征。这会导致产品标题与客户的实际称呼之间出现一个差距，同时限制了电商店可以使用卖家提供的标题进行推荐、问答或评论摘要。我们受到最近的指令��unced LLMS的研究所 inspirited，我们提出了一种可控的产品标题摘要（PTS）方法。我们使用了一种新的指令细化策略来训练我们的方法，可以根据不同的标准（例如，摘要中单词数、包含特定短语等）来摘要产品标题。我们对实际的电商目录进行了广泛的评估，结果显示，相比于简单地细化LLMS，我们提出的方法可以生成更加准确的产品名称摘要，提高了14和8个BLEU和ROUGE分数。

A Comprehensive Review of AI-enabled Unmanned Aerial Vehicle: Trends, Vision , and Challenges

paper_url: http://arxiv.org/abs/2310.16360
repo_url: None
paper_authors: Osim Kumar Pal, Md Sakib Hossain Shovon, M. F. Mridha, Jungpil Shin
for: This paper explores the potential of AI-powered UAVs in various applications, including navigation, object detection and tracking, wildlife monitoring, precision agriculture, rescue operations, surveillance, and communication among UAVs.
methods: The paper examines the use of AI techniques such as machine learning, computer vision, and deep learning to enable these applications, and discusses the challenges and limitations of these approaches.
results: The study highlights the potential of AI-powered UAVs to revolutionize industries such as agriculture, surveillance, and disaster management, while also raising ethical and safety concerns that need to be addressed.

Abstract
In recent years, the combination of artificial intelligence (AI) and unmanned aerial vehicles (UAVs) has brought about advancements in various areas. This comprehensive analysis explores the changing landscape of AI-powered UAVs and friendly computing in their applications. It covers emerging trends, futuristic visions, and the inherent challenges that come with this relationship. The study examines how AI plays a role in enabling navigation, detecting and tracking objects, monitoring wildlife, enhancing precision agriculture, facilitating rescue operations, conducting surveillance activities, and establishing communication among UAVs using environmentally conscious computing techniques. By delving into the interaction between AI and UAVs, this analysis highlights the potential for these technologies to revolutionise industries such as agriculture, surveillance practices, disaster management strategies, and more. While envisioning possibilities, it also takes a look at ethical considerations, safety concerns, regulatory frameworks to be established, and the responsible deployment of AI-enhanced UAV systems. By consolidating insights from research endeavours in this field, this review provides an understanding of the evolving landscape of AI-powered UAVs while setting the stage for further exploration in this transformative domain.

摘要
The study examines how AI enables navigation, object detection and tracking, wildlife monitoring, precision agriculture, rescue operations, surveillance activities, and communication among UAVs using environmentally conscious computing techniques. By exploring the interaction between AI and UAVs, this analysis highlights the potential for these technologies to revolutionize industries such as agriculture, surveillance practices, disaster management strategies, and more.However, the analysis also considers ethical considerations, safety concerns, and regulatory frameworks to be established for the responsible deployment of AI-enhanced UAV systems. By consolidating insights from research in this field, this review provides an understanding of the evolving landscape of AI-powered UAVs and sets the stage for further exploration in this transformative domain.

AccoMontage-3: Full-Band Accompaniment Arrangement via Sequential Style Transfer and Multi-Track Function Prior

paper_url: http://arxiv.org/abs/2310.16334
repo_url: https://github.com/zhaojw1998/accomontage-3
paper_authors: Jingwei Zhao, Gus Xia, Ye Wang
for: 本研究旨在开发一种符号音乐自动化系统，可以基于输入的主题旋律和和声生成多轨、全乐队伴奏。
methods: 该系统包括三个模块，每个模块模拟了不同方面的全乐队作曲。第一个模块是一个钢琴编制器，通过精灵抽取和约束搜索，将约束式转移到和声中，生成钢琴伴奏。第二个模块是一个乐队编制器，根据乐曲的整体风格编码，将钢琴伴奏谱写成全乐队演奏。第三个模块是一个先前模型，用于描述乐曲的全局结构，从而在音乐上下文中应用样式转移。
results: 实验表明，该系统在与基准相比显著提高了表现，而且模块化设计具有有效的控制和音乐意义上的表达能力。

Abstract
We propose AccoMontage-3, a symbolic music automation system capable of generating multi-track, full-band accompaniment based on the input of a lead melody with chords (i.e., a lead sheet). The system contains three modular components, each modelling a vital aspect of full-band composition. The first component is a piano arranger that generates piano accompaniment for the lead sheet by transferring texture styles to the chords using latent chord-texture disentanglement and heuristic retrieval of texture donors. The second component orchestrates the piano accompaniment score into full-band arrangement according to the orchestration style encoded by individual track functions. The third component, which connects the previous two, is a prior model characterizing the global structure of orchestration style over the whole piece of music. From end to end, the system learns to generate full-band accompaniment in a self-supervised fashion, applying style transfer at two levels of polyphonic composition: texture and orchestration. Experiments show that our system outperforms the baselines significantly, and the modular design offers effective controls in a musically meaningful way.

摘要
我们提出AccoMontage-3，一种符号音乐自动化系统，可以基于输入的主旋律和和声（即主稿）生成多轨、全团配乐。系统包括三个模块，每个模块都模拟了全团作曲中的一个重要方面。第一个模块是一个钢琴编制器，通过将文化样式传递到和声中的谱写法来生成钢琴伴奏。第二个模块将钢琴伴奏谱写成全团排版，根据每个乐器的特性和乐谱函数编码。第三个模块是一个先进的模型，用于模elling全团作曲风格的全局结构。从头到尾，系统通过自我超VI持学习生成全团配乐，并在多重复合作曲中应用样式转移。实验结果表明，我们的系统与基线相比有显著的优势，而模块化设计还提供了有效的控制方式，具有音乐意义上的 significances。

CoheSentia: A Novel Benchmark of Incremental versus Holistic Assessment of Coherence in Generated Texts

paper_url: http://arxiv.org/abs/2310.16329
repo_url: None
paper_authors: Aviya Maimon, Reut Tsarfaty
for: The paper aims to introduce a novel benchmark for assessing the human-perceived coherence of automatically generated texts.methods: The paper uses two annotation protocols to assess coherence: a global protocol that assigns a single coherence score, and an incremental protocol that scores sentence by sentence and pinpoints reasons for incoherence.results: The paper shows that the inter-annotator agreement in the incremental mode is higher than in the holistic alternative, and that standard language models fine-tuned for coherence detection show varied performance on the different factors contributing to (in)coherence. The results emphasize the need for developing more reliable methods for coherence assessment.Here is the simplified Chinese text in the format you requested:for: 本文目的是引入一个新的自动生成文本合理性评估标准。methods: 本文使用两种注释协议来评估合理性：一种全局协议，将单个合理性分数赋予，另一种逐句协议，每个句子都得到一个(不合理)分数，并指出了不合理的原因。results: 本文显示，逐句协议的间译员一致度高于全局协议，并且标准的语言模型经过训练后，对不同的合理性因素表现出了不一致的性能。结果强调了需要更加可靠的合理性评估方法的发展。

Abstract
Coherence is a linguistic term that refers to the relations between small textual units (sentences, propositions), which make the text logically consistent and meaningful to the reader. With the advances of generative foundational models in NLP, there is a pressing need to automatically assess the human-perceived coherence of automatically generated texts. Up until now, little work has been done on explicitly assessing the coherence of generated texts and analyzing the factors contributing to (in)coherence. Previous work on the topic used other tasks, e.g., sentence reordering, as proxies of coherence, rather than approaching coherence detection heads on. In this paper, we introduce {\sc CoheSentia}, a novel benchmark of human-perceived coherence of automatically generated texts. Our annotation protocol reflects two perspectives; one is global, assigning a single coherence score, and the other is incremental, scoring sentence by sentence. The incremental method produces an (in)coherence score for each text fragment and also pinpoints reasons for incoherence at that point. Our benchmark contains 500 automatically-generated and human-annotated paragraphs, each annotated in both methods, by multiple raters. Our analysis shows that the inter-annotator agreement in the incremental mode is higher than in the holistic alternative, and our experiments show that standard LMs fine-tuned for coherence detection show varied performance on the different factors contributing to (in)coherence. All in all, these models yield unsatisfactory performance, emphasizing the need for developing more reliable methods for coherence assessment.

摘要
“凝聚”是一个语言学术语，指小文本单位（句子、命题）之间的关系，使文本具有逻辑连贯性和意义性 для读者。随着自然语言处理（NLP）的发展，有一定的需求以自动评估生成文本的人类感知凝聚性。到目前为止，对生成文本的凝聚性的评估几乎没有任何研究，而且对于生成文本中的凝聚性因素进行分析也没有充分的研究。在这篇文章中，我们引入了{\sc CoheSentia}，一个新的自动生成文本人类感知凝聚性的标准 bencmark。我们的标注协议包括两种角度：全球的标注方法，将文本的凝聚性评分为单一的数值，以及增量的标注方法，将每个句子的凝聚性评分为单一的数值，并且还能够确定各个点数的不凝聚原因。我们的标注集包含500个自动生成和人类标注的段落，每个段落都被多名标注者标注了两种方法。我们的分析显示，增量标注方法的间接协议比全球方法高，并且我们的实验显示，适用于凝聚性检测的标准语言模型（LM）在不同的凝聚性因素上表现不一。总之，这些模型在凝聚性检测方面表现不佳，强调需要发展更可靠的方法。

Modality-Agnostic Self-Supervised Learning with Meta-Learned Masked Auto-Encoder

paper_url: http://arxiv.org/abs/2310.16318
repo_url: https://github.com/alinlab/MetaMAE
paper_authors: Huiwon Jang, Jihoon Tack, Daewon Choi, Jongheon Jeong, Jinwoo Shin
for: 本文旨在提出一种模型独立学习（Self-Supervised Learning，SSL）框架，可以在多个模式下进行学习。
methods: 本文使用Masked Auto-Encoder（MAE） architecture，并通过元学习来解释MAE为多Modalities的学习器。我们提出了两种高级元学习技术：首先，通过梯度基元学习来调整缺省的秘钥表示；其次，通过任务对比学习来确保秘钥表示与任务相关。
results: 我们在DABS模式独立学习 benchmark中进行了实验，并证明MetaMAE可以在多个模式下显著超越先前的基eline。

Abstract
Despite its practical importance across a wide range of modalities, recent advances in self-supervised learning (SSL) have been primarily focused on a few well-curated domains, e.g., vision and language, often relying on their domain-specific knowledge. For example, Masked Auto-Encoder (MAE) has become one of the popular architectures in these domains, but less has explored its potential in other modalities. In this paper, we develop MAE as a unified, modality-agnostic SSL framework. In turn, we argue meta-learning as a key to interpreting MAE as a modality-agnostic learner, and propose enhancements to MAE from the motivation to jointly improve its SSL across diverse modalities, coined MetaMAE as a result. Our key idea is to view the mask reconstruction of MAE as a meta-learning task: masked tokens are predicted by adapting the Transformer meta-learner through the amortization of unmasked tokens. Based on this novel interpretation, we propose to integrate two advanced meta-learning techniques. First, we adapt the amortized latent of the Transformer encoder using gradient-based meta-learning to enhance the reconstruction. Then, we maximize the alignment between amortized and adapted latents through task contrastive learning which guides the Transformer encoder to better encode the task-specific knowledge. Our experiment demonstrates the superiority of MetaMAE in the modality-agnostic SSL benchmark (called DABS), significantly outperforming prior baselines. Code is available at https://github.com/alinlab/MetaMAE.

摘要
尽管自适学习（SSL）在多种模式下具有实际重要性，但最近的进展主要集中在视觉和语言领域，经常利用这些领域特定的知识。例如，偏振自适学习（MAE）已成为这些领域中流行的architecture，但它在其他模式下的潜力尚未得到充分发挥。在这篇论文中，我们开发了MAE作为一个统一、模式不偏的SSL框架。然后，我们提出了以元学习为核心，以提高MAE在多种模式下的SSL的想法，并提出了一种新的元学习技术。我们的关键想法是视Masked Auto-Encoder（MAE）的mask reconstruction为元学习任务：偏振token是通过adapting Transformer元学习器通过含括无masked token的权重学习来预测。基于这个新的解释，我们提出了两种高级元学习技术。首先，我们采用权重学习来提高含括Transfomer编码器的amortized latent。然后，我们通过任务对比学习来确保amortized和adapted latent之间的对应。我们的实验表明，MetaMAE在模式不偏的SSL benchmark（称为DABS）中表现出色， Significantly outperforming prior baselines。代码可以在https://github.com/alinlab/MetaMAE中找到。

Sum-of-Parts Models: Faithful Attributions for Groups of Features

paper_url: http://arxiv.org/abs/2310.16316
repo_url: https://github.com/debugml/sop
paper_authors: Weiqiu You, Helen Qu, Marco Gatti, Bhuvnesh Jain, Eric Wong
for: 这个论文目的是为了提供一种可靠的机器学习模型解释方法，帮助astrophysicists更好地理解星系形成过程。
methods: 该论文使用Sum-of-Parts（SOP）模型，该模型可以提供可解释的分组特征贡献，帮助找到星系形成中具有重要作用的特征。
results: 在标准解释指标上评估SOP模型，以及在一个实际案例中，使用SOP模型提供的可信的解释来帮助astrophysicists发现新的星系形成知识。

Abstract
An explanation of a machine learning model is considered "faithful" if it accurately reflects the model's decision-making process. However, explanations such as feature attributions for deep learning are not guaranteed to be faithful, and can produce potentially misleading interpretations. In this work, we develop Sum-of-Parts (SOP), a class of models whose predictions come with grouped feature attributions that are faithful-by-construction. This model decomposes a prediction into an interpretable sum of scores, each of which is directly attributable to a sparse group of features. We evaluate SOP on benchmarks with standard interpretability metrics, and in a case study, we use the faithful explanations from SOP to help astrophysicists discover new knowledge about galaxy formation.

摘要
machine learning 模型的解释被称为"loyal"，如果它们准确反映模型的决策过程。然而，特征贡献对深度学习来说并不一定是loyal，可能产生误导性的解释。在这项工作中，我们开发了Sum-of-Parts（SOP）模型，它的预测结果包括有组织的特征贡献，这些贡献直接关联到一个稀疏的特征集中。我们使用标准解释指标评估SOP模型，并在一个案例研究中，使用SOP模型提供的loyal解释帮助astrophysicists发现新的星系形成知识。

Instance-wise Linearization of Neural Network for Model Interpretation

paper_url: http://arxiv.org/abs/2310.16295
repo_url: None
paper_authors: Zhimin Li, Shusen Liu, Kailkhura Bhavya, Timo Bremer, Valerio Pascucci
for: 这个论文主要针对的是解释神经网络模型如何使用输入特征来做预测，以及如何从神经网络模型中提取有用的特征分布。
methods: 该论文提出了一种实例化线性化方法，该方法可以将神经网络模型的前向计算过程转换为线性矩阵乘法，从而提取出神经网络模型的预测过程中各个层次的线性特征。
results: 该论文通过应用实例化线性化方法在神经网络模型中，得到了一种可以描述神经网络模型预测过程的线性矩阵乘法方程，该方程不仅可以提供有用的特征分布，还可以告诉我们每个输入特征如何直接影响预测结果。

Abstract
Neural network have achieved remarkable successes in many scientific fields. However, the interpretability of the neural network model is still a major bottlenecks to deploy such technique into our daily life. The challenge can dive into the non-linear behavior of the neural network, which rises a critical question that how a model use input feature to make a decision. The classical approach to address this challenge is feature attribution, which assigns an important score to each input feature and reveal its importance of current prediction. However, current feature attribution approaches often indicate the importance of each input feature without detail of how they are actually processed by a model internally. These attribution approaches often raise a concern that whether they highlight correct features for a model prediction. For a neural network model, the non-linear behavior is often caused by non-linear activation units of a model. However, the computation behavior of a prediction from a neural network model is locally linear, because one prediction has only one activation pattern. Base on the observation, we propose an instance-wise linearization approach to reformulates the forward computation process of a neural network prediction. This approach reformulates different layers of convolution neural networks into linear matrix multiplication. Aggregating all layers' computation, a prediction complex convolution neural network operations can be described as a linear matrix multiplication $F(x) = W \cdot x + b$. This equation can not only provides a feature attribution map that highlights the important of the input features but also tells how each input feature contributes to a prediction exactly. Furthermore, we discuss the application of this technique in both supervise classification and unsupervised neural network learning parametric t-SNE dimension reduction.

摘要
Current feature attribution methods provide important scores for each input feature but do not reveal how the features are processed internally by the model. This raises concerns about whether these methods are highlighting the correct features for the model's predictions.In neural networks, non-linear behavior is often caused by non-linear activation units. However, the computation process of a prediction is locally linear, as each prediction has only one activation pattern. Based on this observation, we propose an instance-wise linearization approach that reformulates the forward computation process of a neural network prediction. This approach transforms different layers of convolutional neural networks into linear matrix multiplication. By aggregating all layers' computations, a complex convolutional neural network operation can be described as a linear matrix multiplication equation: $F(x) = W \cdot x + b$.This equation not only provides a feature attribution map that highlights the importance of the input features but also reveals exactly how each input feature contributes to the prediction. Furthermore, we discuss the application of this technique in both supervised classification and unsupervised neural network learning, including parametric t-SNE dimension reduction.

XFEVER: Exploring Fact Verification across Languages

paper_url: http://arxiv.org/abs/2310.16278
repo_url: https://github.com/nii-yamagishilab/xfever
paper_authors: Yi-Chen Chang, Canasai Kruengkrai, Junichi Yamagishi
for: 本研究设计了跨语言事实抽象和验证（XFEVER）数据集，用于评估不同语言的事实验证模型。
methods: 本研究使用机器翻译将英语的声明和证据文本翻译成六种语言，并将训练和开发集用机器翻译，而测试集则包括专业翻译和机器翻译的文本。
results: 实验结果显示，使用多语言语言模型可以快速建立不同语言的事实验证模型，但表现因语言而异，英语的表现较佳。此外，我们发现可以有效地消除模型误偏，通过考虑英语和目标语言之间的预测相似性。

Abstract
This paper introduces the Cross-lingual Fact Extraction and VERification (XFEVER) dataset designed for benchmarking the fact verification models across different languages. We constructed it by translating the claim and evidence texts of the Fact Extraction and VERification (FEVER) dataset into six languages. The training and development sets were translated using machine translation, whereas the test set includes texts translated by professional translators and machine-translated texts. Using the XFEVER dataset, two cross-lingual fact verification scenarios, zero-shot learning and translate-train learning, are defined, and baseline models for each scenario are also proposed in this paper. Experimental results show that the multilingual language model can be used to build fact verification models in different languages efficiently. However, the performance varies by language and is somewhat inferior to the English case. We also found that we can effectively mitigate model miscalibration by considering the prediction similarity between the English and target languages. The XFEVER dataset, code, and model checkpoints are available at https://github.com/nii-yamagishilab/xfever.

摘要
本文介绍了跨语言实体提取和验证（XFEVER）数据集，用于评测不同语言的实体验证模型。我们通过将 claim 和 evidence 文本从实体提取和验证（FEVER）数据集翻译成六种语言，构建了该数据集。训练集和开发集使用机器翻译进行翻译，测试集则包括由专业翻译员翻译的文本以及机器翻译的文本。在本文中，我们定义了跨语言实体验证的两种场景：零shot学习和translate-train学习，并提出了基eline模型 для每个场景。实验结果表明，可以使用多语言语模型建立不同语言的实体验证模型，但性能因语言而异，与英语情况相比有所下降。我们还发现，可以通过考虑英语和目标语言之间的预测相似性来有效地缓解模型偏差。XFEVER数据集、代码和模型检查点可以在https://github.com/nii-yamagishilab/xfever中下载。

Bayesian Domain Invariant Learning via Posterior Generalization of Parameter Distributions

paper_url: http://arxiv.org/abs/2310.16277
repo_url: None
paper_authors: Shiyu Shen, Bin Pan, Tianyang Shi, Tao Li, Zhenwei Shi
for: 这篇论文的目的是学习对不同训练领域的预测模型，以实现更好的统一预测性。
methods: 这篇论文使用了 bayesian neural network 来学习对不同训练领域的预测模型，并且将注意力集中在对维度分布的调整，而不是对维度分布的调整。
results: 这篇论文提出了一个名为 PosTerior Generalization (PTG) 的简单 yet effective 方法，可以用来估计对不同训练领域的参数分布，并且可以与现有的领域一致方法结合使用，以提高预测性能。PTG 在 DomainBed 上的评估中表现了竞争性的表现。

Abstract
Domain invariant learning aims to learn models that extract invariant features over various training domains, resulting in better generalization to unseen target domains. Recently, Bayesian Neural Networks have achieved promising results in domain invariant learning, but most works concentrate on aligning features distributions rather than parameter distributions. Inspired by the principle of Bayesian Neural Network, we attempt to directly learn the domain invariant posterior distribution of network parameters. We first propose a theorem to show that the invariant posterior of parameters can be implicitly inferred by aggregating posteriors on different training domains. Our assumption is more relaxed and allows us to extract more domain invariant information. We also propose a simple yet effective method, named PosTerior Generalization (PTG), that can be used to estimate the invariant parameter distribution. PTG fully exploits variational inference to approximate parameter distributions, including the invariant posterior and the posteriors on training domains. Furthermore, we develop a lite version of PTG for widespread applications. PTG shows competitive performance on various domain generalization benchmarks on DomainBed. Additionally, PTG can use any existing domain generalization methods as its prior, and combined with previous state-of-the-art method the performance can be further improved. Code will be made public.

摘要
领域不变学习目标是学习EXTRACTING invariant features over various training domains, resulting in better generalization to unseen target domains. 最近， Bayesian Neural Networks have achieved promising results in domain invariant learning, but most works concentrate on aligning features distributions rather than parameter distributions. Inspired by the principle of Bayesian Neural Network, we attempt to directly learn the domain invariant posterior distribution of network parameters. We first propose a theorem to show that the invariant posterior of parameters can be implicitly inferred by aggregating posteriors on different training domains. Our assumption is more relaxed and allows us to extract more domain invariant information. We also propose a simple yet effective method, named PosTerior Generalization (PTG), that can be used to estimate the invariant parameter distribution. PTG fully exploits variational inference to approximate parameter distributions, including the invariant posterior and the posteriors on training domains. Furthermore, we develop a lite version of PTG for widespread applications. PTG shows competitive performance on various domain generalization benchmarks on DomainBed. Additionally, PTG can use any existing domain generalization methods as its prior, and combined with previous state-of-the-art method the performance can be further improved. 代码将公开。

Using GPT-4 to Augment Unbalanced Data for Automatic Scoring

paper_url: http://arxiv.org/abs/2310.18365
repo_url: None
paper_authors: Luyang Fang, Gyeong-Geon Lee, Xiaoming Zhai
for: 这个研究是为了解决自动评分中学生回答不均匀的问题，使用GPT-4大语言模型进行资料增强。
methods: 研究使用GPT-4生成模型生成学生回答的对应问题，以增强资料，然后使用DistillBERT进行自动评分。
results: 研究发现，将GPT-4增强的数据融入到自动评分模型中，可以提高精度、准确性、回传率和F1分数，特别是在较少的评分类别中。且不同于原始数据的比例，需要不同的增强数据量以获得稳定的改善。此外，与学生写作增强数据相比，GPT-4增强的评分模型表现更好或相等。

Abstract
Machine learning-based automatic scoring can be challenging if students' responses are unbalanced across scoring categories, as it introduces uncertainty in the machine training process. To meet this challenge, we introduce a novel text data augmentation framework leveraging GPT-4, a generative large language model, specifically tailored for unbalanced datasets in automatic scoring. Our experimental dataset comprised student written responses to two science items. We crafted prompts for GPT-4 to generate responses resembling student written answers, particularly for the minority scoring classes, to augment the data. We then finetuned DistillBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and F1 metrics. Our findings revealed that incorporating GPT-4-augmented data remarkedly improved model performance, particularly for precision, recall, and F1 scores. Interestingly, the extent of improvement varied depending on the specific dataset and the proportion of augmented data used. Notably, we found that a varying amount of augmented data (5\%-40\%) was needed to obtain stable improvement for automatic scoring. We also compared the accuracies of models trained with GPT-4 augmented data to those trained with additional student-written responses. Results suggest that the GPT-4 augmented scoring models outperform or match the models trained with student-written augmented data. This research underscores the potential and effectiveness of data augmentation techniques utilizing generative large language models--GPT-4 in addressing unbalanced datasets within automated assessment.

摘要

CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment

paper_url: http://arxiv.org/abs/2310.16271
repo_url: None
paper_authors: Jixiang Hong, Quan Tu, Changyu Chen, Xing Gao, Ji Zhang, Rui Yan
for: 本研究旨在使用人工回馈法（RLHF）和排名方法对大规模语言模型（LLM）进行对人价值的调整，以确保模型的输出符合人类的偏好和价值观。
methods: 本研究使用了一种名为循环调整（CycleAlign）的新方法，通过在循环互动中使用隐藏模型（black-box）和可见模型（white-box）来实现对模型的调整。在每次互动中，隐藏模型根据人工提供的指导和示范来排序模型生成的响应，而白色模型则通过自身的判断来评价自己生成的响应。
results: 研究发现，通过多次循环互动，循环调整框架可以有效地将白色模型与隐藏模型进行对调，并且可以在低资源情况下实现。此外，与现有方法相比，模型经过循环调整后表现出色，达到了人类价值对Alignment的最佳性能。

Abstract
Language models trained on large-scale corpus often generate content that is harmful, toxic, or contrary to human preferences, making their alignment with human values a critical concern. Reinforcement learning from human feedback (RLHF) with algorithms like PPO is a prevalent approach for alignment but is often complex, unstable, and resource-intensive. Recently, ranking-based alignment methods have emerged, offering stability and effectiveness by replacing the RL framework with supervised fine-tuning, but they are costly due to the need for annotated data. Considering that existing large language models (LLMs) like ChatGPT are already relatively well-aligned and cost-friendly, researchers have begun to align the language model with human preference from AI feedback. The common practices, which unidirectionally distill the instruction-following responses from LLMs, are constrained by their bottleneck. Thus we introduce CycleAlign to distill alignment capabilities from parameter-invisible LLMs (black-box) to a parameter-visible model (white-box) in an iterative manner. With in-context learning (ICL) as the core of the cycle, the black-box models are able to rank the model-generated responses guided by human-craft instruction and demonstrations about their preferences. During iterative interaction, the white-box models also have a judgment about responses generated by them. Consequently, the agreement ranking could be viewed as a pseudo label to dynamically update the in-context demonstrations and improve the preference ranking ability of black-box models. Through multiple interactions, the CycleAlign framework could align the white-box model with the black-box model effectively in a low-resource way. Empirical results illustrate that the model fine-tuned by CycleAlign remarkably exceeds existing methods, and achieves the state-of-the-art performance in alignment with human value.

摘要
大量文本训练的语言模型经常生成有害、毒性或背离人类偏好的内容，使其与人类价值观Alignment成为一项关键问题。使用人类反馈强化学习（RLHF）的算法如PPO是一种常见的实现方式，但它们经常复杂、不稳定和资源占用。 reciently, 排名基于的对齐方法出现了，它们可以通过取代RL框架，实现稳定性和效果，但它们需要大量的标注数据。 given that existing large language models（LLMs）like ChatGPT are already relatively well-aligned and cost-friendly, researchers have begun to align the language model with human preference from AI feedback。 common practices， which unidirectionally distill the instruction-following responses from LLMs， are constrained by their bottleneck。 therefore, we introduce CycleAlign to distill alignment capabilities from parameter-invisible LLMs（black-box）to a parameter-visible model（white-box）in an iterative manner。 with in-context learning（ICL）as the core of the cycle， the black-box models are able to rank the model-generated responses guided by human-craft instruction and demonstrations about their preferences。 during iterative interaction， the white-box models also have a judgment about responses generated by them。 consequently， the agreement ranking could be viewed as a pseudo label to dynamically update the in-context demonstrations and improve the preference ranking ability of black-box models。 through multiple interactions， the CycleAlign framework could align the white-box model with the black-box model effectively in a low-resource way。 empirical results illustrate that the model fine-tuned by CycleAlign remarkably exceeds existing methods， and achieves the state-of-the-art performance in alignment with human value。

Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism

paper_url: http://arxiv.org/abs/2310.16270
repo_url: https://github.com/msakarvadia/attentionlens
paper_authors: Mansi Sakarvadia, Arham Khan, Aswathy Ajith, Daniel Grzenda, Nathaniel Hudson, André Bauer, Kyle Chard, Ian Foster
for: 这个论文旨在了解 trasformer 基于语言模型中 attention 头的特定作用，以及它们如何生成最终预测结果。
methods: 该论文使用 reverse engineering 技术来探索 attention 头的内部机制，并提出了一种名为 Attention Lens 的工具来将 attention 头的输出翻译成 vocabulary tokens。
results: 预liminary 结果表明，attention 头在语言模型中扮演着非常特殊的角色，并且可以通过 learned 的 attention-head-specific 转换来翻译 attention 头的输出。

Abstract
Transformer-based Large Language Models (LLMs) are the state-of-the-art for natural language tasks. Recent work has attempted to decode, by reverse engineering the role of linear layers, the internal mechanisms by which LLMs arrive at their final predictions for text completion tasks. Yet little is known about the specific role of attention heads in producing the final token prediction. We propose Attention Lens, a tool that enables researchers to translate the outputs of attention heads into vocabulary tokens via learned attention-head-specific transformations called lenses. Preliminary findings from our trained lenses indicate that attention heads play highly specialized roles in language models. The code for Attention Lens is available at github.com/msakarvadia/AttentionLens.

摘要
Transformer-based 大型自然语言模型 (LLMs) 是当前最佳实践的自然语言任务。近期的工作尝试了将 linear layers 的内部机制 reverse engineering 为文本完成任务中的最终预测。然而，对于特定的 attention heads 在生成最终Token预测中的作用知之 little。我们提出 Attention Lens，一个工具，允许研究人员通过学习 attention-head-specific 的变换（lenses）将 attention heads 的输出翻译成词汇符号。我们的初步发现表明， attention heads 在语言模型中扮演了非常特殊的角色。Attention Lens 的代码可以在 github.com/msakarvadia/AttentionLens 上找到。

Multilingual Coarse Political Stance Classification of Media. The Editorial Line of a ChatGPT and Bard Newspaper

paper_url: http://arxiv.org/abs/2310.16269
repo_url: None
paper_authors: Cristina España-Bonet
for: This paper aims to explore the use of artificial intelligence (AI) in news outlets and its potential impact on bias ratings.
methods: The authors use authentic news outlets’ ratings to create a multilingual corpus of news with coarse stance annotations and automatically extracted topic annotations. They train classifiers on this data to identify the editorial line of unseen newspapers in English, German, Spanish, and Catalan.
results: The classifiers are able to identify the editorial line of most unseen newspapers in the four languages, and the authors observe that ChatGPT’s editorial line evolves over time and differs among languages.

Abstract
Neutrality is difficult to achieve and, in politics, subjective. Traditional media typically adopt an editorial line that can be used by their potential readers as an indicator of the media bias. Several platforms currently rate news outlets according to their political bias. The editorial line and the ratings help readers in gathering a balanced view of news. But in the advent of instruction-following language models, tasks such as writing a newspaper article can be delegated to computers. Without imposing a biased persona, where would an AI-based news outlet lie within the bias ratings? In this work, we use the ratings of authentic news outlets to create a multilingual corpus of news with coarse stance annotations (Left and Right) along with automatically extracted topic annotations. We show that classifiers trained on this data are able to identify the editorial line of most unseen newspapers in English, German, Spanish and Catalan. We then apply the classifiers to 101 newspaper-like articles written by ChatGPT and Bard in the 4 languages at different time periods. We observe that, similarly to traditional newspapers, ChatGPT editorial line evolves with time and, being a data-driven system, the stance of the generated articles differs among languages.

摘要
（ Traditional media typically adopt an editorial line that can be used by their potential readers as an indicator of the media bias. Several platforms currently rate news outlets according to their political bias. The editorial line and the ratings help readers in gathering a balanced view of news. But in the advent of instruction-following language models, tasks such as writing a newspaper article can be delegated to computers. Without imposing a biased persona, where would an AI-based news outlet lie within the bias ratings? In this work, we use the ratings of authentic news outlets to create a multilingual corpus of news with coarse stance annotations (Left and Right) along with automatically extracted topic annotations. We show that classifiers trained on this data are able to identify the editorial line of most unseen newspapers in English, German, Spanish and Catalan. We then apply the classifiers to 101 newspaper-like articles written by ChatGPT and Bard in the 4 languages at different time periods. We observe that, similarly to traditional newspapers, ChatGPT editorial line evolves with time and, being a data-driven system, the stance of the generated articles differs among languages.）Note: The text has been translated using the Google Translate API, which may not be perfect and may not capture all the nuances of the original text.

Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation

paper_url: http://arxiv.org/abs/2310.16263
repo_url: None
paper_authors: Jiexin Wang, Liuwen Cao, Xitong Luo, Zhiping Zhou, Jiayuan Xie, Adam Jatowt, Yi Cai
for: 评估和提高大型语言模型（LLMs）在代码生成方面的安全性。
methods: 使用不同的方法和技术来提高 LLMs 的安全性，包括代码生成、代码修复和攻击类型分类等。
results: 研究发现现有模型在代码生成过程中经常忽略安全问题，导致生成的代码具有漏洞性; 提议了一些有效的方法来缓解安全漏洞，提高 LLMs 的总体可靠性。

Abstract
Large language models (LLMs) have brought significant advancements to code generation, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, introduces the risk of inadvertently propagating security vulnerabilities. To effectively mitigate this concern, this paper presents a comprehensive study focused on evaluating and enhancing code LLMs from a software security perspective. We introduce SecuCoGen\footnote{SecuCoGen has been uploaded as supplemental material and will be made publicly available after publication.}, a meticulously curated dataset targeting 21 critical vulnerability types. SecuCoGen comprises 180 samples and serves as the foundation for conducting experiments on three crucial code-related tasks: code generation, code repair and vulnerability classification, with a strong emphasis on security. Our experimental results reveal that existing models often overlook security concerns during code generation, leading to the generation of vulnerable code. To address this, we propose effective approaches to mitigate the security vulnerabilities and enhance the overall robustness of code generated by LLMs. Moreover, our study identifies weaknesses in existing models' ability to repair vulnerable code, even when provided with vulnerability information. Additionally, certain vulnerability types pose challenges for the models, hindering their performance in vulnerability classification. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.

摘要
Translated into Simplified Chinese:大型语言模型（LLMs）已经为开发者带来了重要的进步，帮助他们更好地生成代码。然而，通过使用 GitHub 等开源存储库中的未经过处理的数据进行训练，可能会意外地传播安全漏洞。为了有效地缓解这种问题，这篇论文提出了一项全面的研究，旨在从软件安全角度评估和加强代码生成器。我们提出了 SecuCoGen，一个精心准备的数据集，包含 21 种关键的漏洞类型。SecuCoGen 包含 180 个样本，并作为基于代码生成、代码修复和漏洞分类等三个关键任务的实验基础。我们的实验结果表明，现有的模型在代码生成时经常忽略安全问题，导致生成的代码存在漏洞。为了解决这个问题，我们提出了一些有效的方法来缓解安全漏洞并提高代码生成器的整体可靠性。此外，我们的研究还发现了现有模型在修复漏洞代码时存在缺陷，即使提供了漏洞信息。此外，某些漏洞类型对模型表现出了困难。根据这些发现，我们认为这项研究将对软件工程领域产生积极的影响，激励开发人员开发更好的训练和使用 LLMs 的方法，从而导致更安全和可靠的模型部署。

rTisane: Externalizing conceptual models for data analysis increases engagement with domain knowledge and improves statistical model quality

paper_url: http://arxiv.org/abs/2310.16262
repo_url: None
paper_authors: Eunice Jun, Edward Misback, Jeffrey Heer, René Just
for: 本研究旨在了解分析员在使用统计模型时的假设表达方式，以及这些假设如何影响统计模型质量。
methods: 本研究使用域特定语言（DSL）让分析员表达概念模型，并在这些模型中解决歧义。
results: 研究发现，使用 rTisane 的 DSL 可以帮助分析员更深入地表达假设，并更准确地外部化假设。 rTisane 也导致统计模型更好地匹配分析员的假设，保持分析意图，并更好地适应数据。

Abstract
Statistical models should accurately reflect analysts' domain knowledge about variables and their relationships. While recent tools let analysts express these assumptions and use them to produce a resulting statistical model, it remains unclear what analysts want to express and how externalization impacts statistical model quality. This paper addresses these gaps. We first conduct an exploratory study of analysts using a domain-specific language (DSL) to express conceptual models. We observe a preference for detailing how variables relate and a desire to allow, and then later resolve, ambiguity in their conceptual models. We leverage these findings to develop rTisane, a DSL for expressing conceptual models augmented with an interactive disambiguation process. In a controlled evaluation, we find that rTisane's DSL helps analysts engage more deeply with and accurately externalize their assumptions. rTisane also leads to statistical models that match analysts' assumptions, maintain analysis intent, and better fit the data.

摘要
(Note: The text is translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.)

A Causal Disentangled Multi-Granularity Graph Classification Method

paper_url: http://arxiv.org/abs/2310.16256
repo_url: None
paper_authors: Yuan Li, Li Liu, Penggang Chen, Youmin Zhang, Guoyin Wang
for: 本文提出了一种解决图像数据的多维度特征映射问题的方法，以提高图像分类任务的准确率和可解释性。
methods: 本文提出了一种基于 causal disentanglement 的多维度图像表示学习方法（CDM-GNN），该方法可以分离图像中重要的子结构和偏好部分，从多维度角度进行表示学习，提高图像分类任务的准确率和可解释性。
results: 本文通过对三个实际数据集（MUTAG、PTC、IMDM-M）进行比较，表明 CDM-GNN 模型在图像分类任务中表现出色，同时也可以提供可解释的结果，与人类认知模式相符。

Abstract
Graph data widely exists in real life, with large amounts of data and complex structures. It is necessary to map graph data to low-dimensional embedding. Graph classification, a critical graph task, mainly relies on identifying the important substructures within the graph. At present, some graph classification methods do not combine the multi-granularity characteristics of graph data. This lack of granularity distinction in modeling leads to a conflation of key information and false correlations within the model. So, achieving the desired goal of a credible and interpretable model becomes challenging. This paper proposes a causal disentangled multi-granularity graph representation learning method (CDM-GNN) to solve this challenge. The CDM-GNN model disentangles the important substructures and bias parts within the graph from a multi-granularity perspective. The disentanglement of the CDM-GNN model reveals important and bias parts, forming the foundation for its classification task, specifically, model interpretations. The CDM-GNN model exhibits strong classification performance and generates explanatory outcomes aligning with human cognitive patterns. In order to verify the effectiveness of the model, this paper compares the three real-world datasets MUTAG, PTC, and IMDM-M. Six state-of-the-art models, namely GCN, GAT, Top-k, ASAPool, SUGAR, and SAT are employed for comparison purposes. Additionally, a qualitative analysis of the interpretation results is conducted.

摘要
Graph data广泛存在于实际生活中，具有大量数据和复杂结构。需要将图数据映射到低维度嵌入。图分类任务是图处理中的关键任务，主要是在图中发现重要的子结构。目前，一些图分类方法不会结合图数据的多级结构。这会导致模型中的关键信息和假相关性混淆。因此，实现可靠和可解释的模型变得困难。本文提出了一种 causal disentangled multi-granularity 图表示学习方法（CDM-GNN）解决这个挑战。CDM-GNN 模型在多级视角下分离出重要的子结构和偏好部分。CDM-GNN 模型的分离可以揭示重要和偏好的部分，这成为模型的分类任务基础。CDM-GNN 模型在分类任务中表现出色，并生成了与人认知模式相符的解释结果。为了证明模型的有效性，本文对三个实际 datasets（MUTAG、PTC、IMDM-M）进行比较。并进行了对解释结果的质量分析。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. The translation is based on the original text and may not capture all the nuances and variations of the original text.

ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair

paper_url: http://arxiv.org/abs/2310.16253
repo_url: None
paper_authors: Yonghao Wu, Zheng Li, Jie M. Zhang, Yong Liu
for: 本研究旨在提供一个新的 fault localization 和 program repair 的 benchmark dataset，以确保 LLM-based 方法的可靠性和通用性。
methods: 该研究使用了精心约定的 fault 数据，以消除现有 benchmark 中的数据泄露问题，从而提供一个可靠的 benchmark 集。
results: 该研究提供了 1,254 个 Java faulty program 和 1,625 个 Python faulty program，每个 fault 都包含缺陷位置和修复后的代码版本，适用于 fault localization 和 program repair 相关的研究。

Abstract
With the growing interest on Large Language Models (LLMs) for fault localization and program repair, ensuring the integrity and generalizability of the LLM-based methods becomes paramount. The code in existing widely-adopted benchmarks for these tasks was written before the the bloom of LLMs and may be included in the training data of existing popular LLMs, thereby suffering from the threat of data leakage, leading to misleadingly optimistic performance metrics. To address this issue, we introduce "ConDefects", a novel dataset of real faults meticulously curated to eliminate such overlap. ConDefects contains 1,254 Java faulty programs and 1,625 Python faulty programs. All these programs are sourced from the online competition platform AtCoder and were produced between October 2021 and September 2023. We pair each fault with fault locations and the corresponding repaired code versions, making it tailored for in fault localization and program repair related research. We also provide interfaces for selecting subsets based on different time windows and coding task difficulties. While inspired by LLM-based tasks, ConDefects can be adopted for benchmarking ALL types of fault localization and program repair methods. The dataset is publicly available, and a demo video can be found at https://www.youtube.com/watch?v=22j15Hj5ONk.

摘要
随着大语言模型（LLM）在错误定位和程序修复领域的兴趣增长，确保LLM-基于方法的完整性和通用性变得非常重要。现有的广泛采用的测试集中包含的代码可能在LLM的训练数据中包含，从而导致数据泄露问题，从而导致表现指标过optimistic。为解决这个问题，我们介绍了“ConDefects”，一个新的故障数据集，其中包含1,254个Java错误程序和1,625个Python错误程序。这些程序都来自于在线竞赛平台AtCoder，生成时间为2021年10月至2023年9月。我们对每个错误进行了精心编辑，以消除 overlap。我们还为每个错误提供了相应的修复代码版本，使其适用于错误定位和程序修复相关的研究。此外，我们还提供了基于不同时间窗口和编程任务难度的选择接口。虽然受LLM-基于任务的 inspirations，但ConDefects可以适用于所有类型的错误定位和程序修复方法的 benchmarking。数据集公开可用，demo视频可以在https://www.youtube.com/watch?v=22j15Hj5ONk找到。

2023-10-26

Image Prior and Posterior Conditional Probability Representation for Efficient Damage Assessment

ControlLLM: Augment Language Models with Tools by Searching on Graphs

AutoCT: Automated CT registration, segmentation, and quantification

A Dataset of Relighted 3D Interacting Hands

SynergyNet: Bridging the Gap between Discrete and Continuous Representations for Precise Medical Image Segmentation

Alzheimers Disease Diagnosis by Deep Learning Using MRI-Based Approaches

Advancing Brain Tumor Detection: A Thorough Investigation of CNNs, Clustering, and SoftMax Classification in the Analysis of MRI Images

Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model

A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised Video Anomaly Detection

6-DoF Stability Field via Diffusion Models

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

DeepShaRM: Multi-View Shape and Reflectance Map Recovery Under Unknown Lighting

A Survey on Transferability of Adversarial Examples across Deep Neural Networks

Noise-Free Score Distillation

Global Structure-Aware Diffusion Process for Low-Light Image Enhancement

SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching

Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Revisiting the Distillation of Image Representations into Point Clouds for Autonomous Driving

A Hybrid Graph Network for Complex Activity Detection in Video

Cross-modal Active Complementary Learning with Self-refining Correspondence

OTMatch: Improving Semi-Supervised Learning with Optimal Transport

Sign Languague Recognition without frame-sequencing constraints: A proof of concept on the Argentinian Sign Language

Uncertainty-weighted Loss Functions for Improved Adversarial Attacks on Semantic Segmentation

AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

Circuit as Set of Points

Detection Defenses: An Empty Promise against Adversarial Patch Attacks on Optical Flow

Learning Temporal Sentence Grounding From Narrated EgoVideos

SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation

Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning

CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling

IndustReal: A Dataset for Procedure Step Recognition Handling Execution Errors in Egocentric Videos in an Industrial-Like Setting

Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics

Scale-Adaptive Feature Aggregation for Efficient Space-Time Video Super-Resolution

RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments

BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point Clouds

Generalizing to Unseen Domains in Diabetic Retinopathy Classification

Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

Three-dimensional Bone Image Synthesis with Generative Adversarial Networks

Weakly-Supervised Surgical Phase Recognition

Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction Network for Tone Mapping

Exploring Iterative Refinement with Diffusion Models for Video Grounding

Blind Image Super-resolution with Rich Texture-Aware Codebooks

MO-YOLO: End-to-End Multiple-Object Tracking Method with YOLO and MOTR

Bridging Phylogeny and Taxonomy with Protein-protein Interaction Networks

Low-Dimensional Gradient Helps Out-of-Distribution Detection

Learning depth from monocular video sequences

Deep Imbalanced Regression via Hierarchical Classification Adjustment

Simple Baselines for Projection-based Full-reference and No-reference Point Cloud Quality Assessment

A Classifier Using Global Character Level and Local Sub-unit Level Features for Hindi Online Handwritten Character Recognition

Comparison of Cross-Entropy, Dice, and Focal Loss for Sea Ice Type Segmentation

Virtual Accessory Try-On via Keypoint Hallucination

Task-driven Prompt Evolution for Foundation Models

Deep Learning on SAR Imagery: Transfer Learning Versus Randomly Initialized Weights

Enhancing sea ice segmentation in Sentinel-1 images with atrous convolutions

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Navigating Data Heterogeneity in Federated Learning A Semi-Supervised Approach for Object Detection

Automating lichen monitoring in ecological studies using instance segmentation of time-lapse images

HCT: Hybrid Convnet-Transformer for Parkinson’s disease detection and severity prediction from gait

HyperFields: Towards Zero-Shot Generation of NeRFs from Text

2023-10-26

Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting

Clover: Closed-Loop Verifiable Code Generation

Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks

“You Are An Expert Linguistic Annotator”: Limits of LLMs as Analyzers of Abstract Meaning Representation

Utilizing Language Models for Energy Load Forecasting

Evaluation of large language models using an Indian language LGBTI+ lexicon

Graph Convolutional Networks for Complex Traffic Scenario Classification

GROOViST: A Metric for Grounding Objects in Visual Storytelling

Social Contract AI: Aligning AI Assistants with Implicit Group Norms

Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems

Improving Traffic Density Forecasting in Intelligent Transportation Systems Using Gated Graph Neural Networks

Large Language Models as Generalizable Policies for Embodied Tasks

From Transcripts to Insights: Uncovering Corporate Risks Using Generative AI

Outlier Dimensions Encode Task-Specific Knowledge

A Wireless AI-Generated Content (AIGC) Provisioning Framework Empowered by Semantic Communication

Defending Against Transfer Attacks From Public Models

In-Context Learning Dynamics with Random Binary Sequences

Grow Your Limits: Continuous Improvement with Real-World RL for Robotic Locomotion