2023-08-14

eess.IV

eess.IV - 2023-08-14

Automated Ensemble-Based Segmentation of Adult Brain Tumors: A Novel Approach Using the BraTS AFRICA Challenge Data

paper_url: http://arxiv.org/abs/2308.07214
repo_url: None
paper_authors: Chiranjeewee Prasad Koirala, Sovesh Mohapatra, Advait Gosai, Gottfried Schlaug
for: 这篇论文旨在探讨使用深度学习技术来提高脑肿瘤分割精度，尤其是在 SUB-SAHARAN AFRICA 地区患者群体中。
methods: 这篇论文使用了多种多Modal MRI 数据，并提出了一种ensemble方法，包括eleven个不同的变体，基于三种核心架构：UNet3D、ONet3D 和 SphereNet3D，以及修改的损失函数。
results: 研究发现， ensemble方法可以在不同的核心架构和修改后的损失函数下提高评估指标，例如 Dice 分数为0.82、0.82和0.87分别用于提高肿瘤、肿瘤核心和全肿瘤标签。

Abstract
Brain tumors, particularly glioblastoma, continue to challenge medical diagnostics and treatments globally. This paper explores the application of deep learning to multi-modality magnetic resonance imaging (MRI) data for enhanced brain tumor segmentation precision in the Sub-Saharan Africa patient population. We introduce an ensemble method that comprises eleven unique variations based on three core architectures: UNet3D, ONet3D, SphereNet3D and modified loss functions. The study emphasizes the need for both age- and population-based segmentation models, to fully account for the complexities in the brain. Our findings reveal that the ensemble approach, combining different architectures, outperforms single models, leading to improved evaluation metrics. Specifically, the results exhibit Dice scores of 0.82, 0.82, and 0.87 for enhancing tumor, tumor core, and whole tumor labels respectively. These results underline the potential of tailored deep learning techniques in precisely segmenting brain tumors and lay groundwork for future work to fine-tune models and assess performance across different brain regions.

摘要
脑肿、特别是 glioblastoma，在全球医学诊断和治疗中仍然存在挑战。这篇论文探讨了深度学习在多modal磁共振成像（MRI）数据上的应用，以提高脑肿分 segmentation精度在非洲南部患者人口中。我们介绍了一个集成方法，包括eleven个独特变种，基于三种核心架构：UNet3D、ONet3D 和 SphereNet3D，以及修改的损失函数。研究强调了需要Age和Population基于的分 segmentation模型，以全面考虑脑部的复杂性。我们的发现表明，ensemble方法，将不同架构相结合，可以提高评价指标。具体来说，结果表明，整合ensemble方法可以提高评价指标，达到0.82、0.82和0.87的Dice分数。这些结果证明了深度学习技术在精确地分 segmentation脑肿中的潜力，并为未来细化模型和评估不同脑部区域的性能奠定基础。

SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation

paper_url: http://arxiv.org/abs/2308.07156
repo_url: None
paper_authors: An Wang, Mobarakol Islam, Mengya Xu, Yang Zhang, Hongliang Ren
for: 这个论文主要研究的是semantic segmentation模型Segment Anything Model（SAM）在Robotic Surgery领域的Robustness和零shot泛化能力。
methods: 这个论文使用了SAM模型，以及不同的提示方法，包括bounding box和点提示。
results: 研究发现，SAM在bounding box提示下表现出remarkable的零shot泛化能力，但在点提示和无提示情况下表现不佳，特别是在复杂的外科手术场景下。此外，SAM也存在对数据损害的敏感性和难以在不同情况下维持高性能的问题。

Abstract
The Segment Anything Model (SAM) serves as a fundamental model for semantic segmentation and demonstrates remarkable generalization capabilities across a wide range of downstream scenarios. In this empirical study, we examine SAM's robustness and zero-shot generalizability in the field of robotic surgery. We comprehensively explore different scenarios, including prompted and unprompted situations, bounding box and points-based prompt approaches, as well as the ability to generalize under corruptions and perturbations at five severity levels. Additionally, we compare the performance of SAM with state-of-the-art supervised models. We conduct all the experiments with two well-known robotic instrument segmentation datasets from MICCAI EndoVis 2017 and 2018 challenges. Our extensive evaluation results reveal that although SAM shows remarkable zero-shot generalization ability with bounding box prompts, it struggles to segment the whole instrument with point-based prompts and unprompted settings. Furthermore, our qualitative figures demonstrate that the model either failed to predict certain parts of the instrument mask (e.g., jaws, wrist) or predicted parts of the instrument as wrong classes in the scenario of overlapping instruments within the same bounding box or with the point-based prompt. In fact, SAM struggles to identify instruments in complex surgical scenarios characterized by the presence of blood, reflection, blur, and shade. Additionally, SAM is insufficiently robust to maintain high performance when subjected to various forms of data corruption. We also attempt to fine-tune SAM using Low-rank Adaptation (LoRA) and propose SurgicalSAM, which shows the capability in class-wise mask prediction without prompt. Therefore, we can argue that, without further domain-specific fine-tuning, SAM is not ready for downstream surgical tasks.

摘要
Segment Anything Model (SAM) acted as a fundamental model for semantic segmentation and demonstrated remarkable generalization capabilities across a wide range of downstream scenarios. In this empirical study, we examined SAM's robustness and zero-shot generalizability in the field of robotic surgery. We comprehensively explored different scenarios, including prompted and unprompted situations, bounding box and points-based prompt approaches, as well as the ability to generalize under corruptions and perturbations at five severity levels. Additionally, we compared the performance of SAM with state-of-the-art supervised models. We conducted all the experiments with two well-known robotic instrument segmentation datasets from MICCAI EndoVis 2017 and 2018 challenges. Our extensive evaluation results showed that although SAM showed remarkable zero-shot generalization ability with bounding box prompts, it struggled to segment the whole instrument with point-based prompts and unprompted settings. Furthermore, our qualitative figures demonstrated that the model either failed to predict certain parts of the instrument mask (e.g., jaws, wrist) or predicted parts of the instrument as wrong classes in the scenario of overlapping instruments within the same bounding box or with the point-based prompt. In fact, SAM struggled to identify instruments in complex surgical scenarios characterized by the presence of blood, reflection, blur, and shade. Additionally, SAM was insufficiently robust to maintain high performance when subjected to various forms of data corruption. We also attempted to fine-tune SAM using Low-rank Adaptation (LoRA) and proposed SurgicalSAM, which showed the capability in class-wise mask prediction without prompt. Therefore, we can argue that, without further domain-specific fine-tuning, SAM is not ready for downstream surgical tasks.

FocusFlow: Boosting Key-Points Optical Flow Estimation for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.07104
repo_url: https://github.com/zhonghuayi/focusflow_official
paper_authors: Zhonghua Yi, Hao Shi, Kailun Yang, Qi Jiang, Yaozu Ye, Ze Wang, Kaiwei Wang
for:The paper is focused on improving the performance of optical flow estimation in key-point-critical scenarios for autonomous driving applications.methods:The proposed method, called FocusFlow, uses a points-based modeling approach that explicitly learns key-point-related priors. It also introduces a new loss function called Conditional Point Control Loss (CPCL) and a Condition Control Encoder (CCE) to improve the performance of the model.results:The proposed FocusFlow framework shows outstanding performance with up to +44.5% precision improvement on various key points such as ORB, SIFT, and even learning-based SiLK, and exceptional scalability for most existing data-driven optical flow methods. It also yields competitive or superior performances rivaling the original models on the whole frame.

Abstract
Key-point-based scene understanding is fundamental for autonomous driving applications. At the same time, optical flow plays an important role in many vision tasks. However, due to the implicit bias of equal attention on all points, classic data-driven optical flow estimation methods yield less satisfactory performance on key points, limiting their implementations in key-point-critical safety-relevant scenarios. To address these issues, we introduce a points-based modeling method that requires the model to learn key-point-related priors explicitly. Based on the modeling method, we present FocusFlow, a framework consisting of 1) a mix loss function combined with a classic photometric loss function and our proposed Conditional Point Control Loss (CPCL) function for diverse point-wise supervision; 2) a conditioned controlling model which substitutes the conventional feature encoder by our proposed Condition Control Encoder (CCE). CCE incorporates a Frame Feature Encoder (FFE) that extracts features from frames, a Condition Feature Encoder (CFE) that learns to control the feature extraction behavior of FFE from input masks containing information of key points, and fusion modules that transfer the controlling information between FFE and CFE. Our FocusFlow framework shows outstanding performance with up to +44.5% precision improvement on various key points such as ORB, SIFT, and even learning-based SiLK, along with exceptional scalability for most existing data-driven optical flow methods like PWC-Net, RAFT, and FlowFormer. Notably, FocusFlow yields competitive or superior performances rivaling the original models on the whole frame. The source code will be available at https://github.com/ZhonghuaYi/FocusFlow_official.

摘要
基点 centered scene understanding 是自动驾驶应用的基础。同时，光流扮演了许多视觉任务中重要的角色。然而，由于约定性偏袋所有点都具有相同的注意力， класси的数据驱动光流估计方法在关键点上的表现不如预期，这限制了它们在关键点关键的安全相关场景中的实现。为解决这些问题，我们介绍了一种点 cloud 模型化方法，要求模型直接学习关键点相关的前置知识。基于该方法，我们提出了 FOCUSFLOW 框架，包括以下两个部分：1. 一种 mix 损失函数，与 класси的光学损失函数和我们所提出的 Conditional Point Control Loss (CPCL) 函数进行多样化点wise 超vision;2. 一种受控制的模型，替换了传统的特征编码器，我们提出的 Condition Control Encoder (CCE)。CCE 包括 Frame Feature Encoder (FFE)、Condition Feature Encoder (CFE) 和 fusion module，CFE 通过输入Mask 中关键点信息来学习控制 FFE 提取特征的行为，并将控制信息传递给 FFE。我们的 FOCUSFLOW 框架在多个关键点上（包括 ORB、SIFT 和学习基于 SiLK 的）表现出色，同时具有出色的扩展性，可以与大多数现有的数据驱动光流估计方法（如 PWC-Net、RAFT 和 FlowFormer）进行比较。特别是，FOCUSFLOW 在整个帧上的表现与原始模型相当或更好。源代码将在 GitHub 上发布，详情请参考。

When Deep Learning Meets Multi-Task Learning in SAR ATR: Simultaneous Target Recognition and Segmentation

paper_url: http://arxiv.org/abs/2308.07093
repo_url: None
paper_authors: Chenwei Wang, Jifang Pei, Zhiyong Wang, Yulin Huang, Junjie Wu, Haiguang Yang, Jianyu Yang
for: 本文旨在提出一种基于多任务学习的Synthetic Aperture Radar（SAR）自动目标识别（ATR）方法，以便同时提取多种目标特征。
methods: 本文提出了一种基于深度学习理论的多任务学习框架，包括两个主要结构：编码器和解码器。编码器用于提取不同尺度的图像特征，而解码器则是一种任务特有的结构，它使用这些提取的特征进行适应性和优化性地进行识别和分割。
results: 根据Moving and Stationary Target Acquisition and Recognition（MSTAR） dataset的实验结果表明，提出的方法在识别和分割方面具有优秀的性能。

Abstract
With the recent advances of deep learning, automatic target recognition (ATR) of synthetic aperture radar (SAR) has achieved superior performance. By not being limited to the target category, the SAR ATR system could benefit from the simultaneous extraction of multifarious target attributes. In this paper, we propose a new multi-task learning approach for SAR ATR, which could obtain the accurate category and precise shape of the targets simultaneously. By introducing deep learning theory into multi-task learning, we first propose a novel multi-task deep learning framework with two main structures: encoder and decoder. The encoder is constructed to extract sufficient image features in different scales for the decoder, while the decoder is a tasks-specific structure which employs these extracted features adaptively and optimally to meet the different feature demands of the recognition and segmentation. Therefore, the proposed framework has the ability to achieve superior recognition and segmentation performance. Based on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset, experimental results show the superiority of the proposed framework in terms of recognition and segmentation.

摘要
(Simplified Chinese translation)随着深度学习的进步，激光探测器自动目标识别（ATR）已经达到了出色的性能。由于不受目标类别的限制，SAR ATR系统可以从同时提取多种目标属性中受益。在这篇论文中，我们提出了一种新的多任务学习方法 для SAR ATR，可以同时获得目标的准确分类和精确的形状信息。通过将深度学习理论引入多任务学习中，我们首先提出了一种新的多任务深度学习框架，包括编码器和解码器两部分。编码器用于抽取不同缩放级别的图像特征，以便解码器使用这些抽取的特征进行适应性和优化的处理。因此，我们提出的框架具有提高认知和分割性能的能力。基于MSTAR数据集，我们的实验结果表明，我们的方法在认知和分割方面具有优越性。

Deepbet: Fast brain extraction of T1-weighted MRI using Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2308.07003
repo_url: None
paper_authors: Lukas Fisch, Stefan Zumdick, Carlotta Barkhau, Daniel Emden, Jan Ernsting, Ramona Leenings, Kelvin Sarink, Nils R. Winter, Benjamin Risse, Udo Dannlowski, Tim Hahn
for: 这个论文的目的是提出一种高精度、快速的脑部EXTRACTION方法，用于多种神经成像预处理管道中。
methods: 该方法使用了现代的深度学习方法，包括LinkNet网络，在两个预测过程中进行两个阶段预测，从而提高了 segmentation 性能。
results: 该方法在跨验证中得到了 novel state-of-the-art 性能， median Dice 分数 (DSC) 为 99.0%，超过当前状态的艺术模型 (DSC = 97.8% 和 DSC = 97.9%)。此外，该方法能够更好地抗抑异常值，Dice 分数大于 96.9% для所有样本。最后，该模型可以加速脑部EXTRACTION的速度，比现有方法快约 10 倍，可以在低级别硬件上处理一个图像只需 ~2 秒。

Abstract
Brain extraction in magnetic resonance imaging (MRI) data is an important segmentation step in many neuroimaging preprocessing pipelines. Image segmentation is one of the research fields in which deep learning had the biggest impact in recent years enabling high precision segmentation with minimal compute. Consequently, traditional brain extraction methods are now being replaced by deep learning-based methods. Here, we used a unique dataset comprising 568 T1-weighted (T1w) MR images from 191 different studies in combination with cutting edge deep learning methods to build a fast, high-precision brain extraction tool called deepbet. deepbet uses LinkNet, a modern UNet architecture, in a two stage prediction process. This increases its segmentation performance, setting a novel state-of-the-art performance during cross-validation with a median Dice score (DSC) of 99.0% on unseen datasets, outperforming current state of the art models (DSC = 97.8% and DSC = 97.9%). While current methods are more sensitive to outliers, resulting in Dice scores as low as 76.5%, deepbet manages to achieve a Dice score of > 96.9% for all samples. Finally, our model accelerates brain extraction by a factor of ~10 compared to current methods, enabling the processing of one image in ~2 seconds on low level hardware.

摘要
magnetic resonance imaging（MRI）数据中的脑部提取是许多神经成像预处理管道中重要的分 segmentation步骤。图像分 segmentation是深度学习过去几年内最大的影响领域之一，使得传统的脑部提取方法被取代了深度学习基于方法。我们使用了568张T1-weighted（T1w）MR图像从191个不同的研究中的独特数据集，并使用最新的深度学习方法来构建一个高速、高精度的脑部提取工具called deepbet。deepbet使用了LinkNet，一种现代的UNet架构，在两个预测过程中进行两个阶段预测。这使得它的分 segmentation性能得到了提升，在批处理中 median Dice score（DSC）达到了99.0%，超越当前状态的艺术模型（DSC = 97.8%和DSC = 97.9%）。而当前方法更敏感于异常值，导致Dice score只有76.5%，而deepbet则可以达到> 96.9%的Dice score。最后，我们的模型可以将脑部提取加速了约10倍，使得一张图像在低级别硬件上只需2秒钟左右。

How inter-rater variability relates to aleatoric and epistemic uncertainty: a case study with deep learning-based paraspinal muscle segmentation

paper_url: http://arxiv.org/abs/2308.06964
repo_url: None
paper_authors: Parinaz Roshanzamir, Hassan Rivaz, Joshua Ahn, Hamza Mirza, Neda Naghdi, Meagan Anstruther, Michele C. Battié, Maryse Fortin, Yiming Xiao
for: This paper aims to explore the relationship between inter-rater variability and uncertainty in deep learning models for medical image segmentation, and to compare the performance of different DL models and label fusion strategies.methods: The authors use test-time augmentation (TTA), test-time dropout (TTD), and deep ensemble to measure aleatoric and epistemic uncertainties, and compare the performance of UNet and TransUNet with two label fusion strategies.results: The study reveals the interplay between inter-rater variability and uncertainties, and shows that the choice of label fusion strategies and DL models can affect the performance and uncertainty of the resulting algorithms.

Abstract
Recent developments in deep learning (DL) techniques have led to great performance improvement in medical image segmentation tasks, especially with the latest Transformer model and its variants. While labels from fusing multi-rater manual segmentations are often employed as ideal ground truths in DL model training, inter-rater variability due to factors such as training bias, image noise, and extreme anatomical variability can still affect the performance and uncertainty of the resulting algorithms. Knowledge regarding how inter-rater variability affects the reliability of the resulting DL algorithms, a key element in clinical deployment, can help inform better training data construction and DL models, but has not been explored extensively. In this paper, we measure aleatoric and epistemic uncertainties using test-time augmentation (TTA), test-time dropout (TTD), and deep ensemble to explore their relationship with inter-rater variability. Furthermore, we compare UNet and TransUNet to study the impacts of Transformers on model uncertainty with two label fusion strategies. We conduct a case study using multi-class paraspinal muscle segmentation from T2w MRIs. Our study reveals the interplay between inter-rater variability and uncertainties, affected by choices of label fusion strategies and DL models.

摘要
In this paper, we investigate the relationship between inter-rater variability and the reliability of DL algorithms by measuring aleatoric and epistemic uncertainties using test-time augmentation (TTA), test-time dropout (TTD), and deep ensemble. We also compare UNet and TransUNet to study the impact of Transformers on model uncertainty with two label fusion strategies. Our case study focuses on multi-class paraspinal muscle segmentation from T2w MRIs.Our findings reveal an interplay between inter-rater variability and uncertainties, which are influenced by the choice of label fusion strategies and DL models. By understanding these relationships, we can better construct training data and develop more reliable DL models for clinical applications.

Robustness Stress Testing in Medical Image Classification

paper_url: http://arxiv.org/abs/2308.06889
repo_url: https://github.com/mobarakol/robustness_stress_testing
paper_authors: Mobarakol Islam, Zeju Li, Ben Glocker
for: This paper aims to assess the robustness and equity of disease detection models using progressive stress testing.methods: The authors use five different bidirectional and unidirectional image perturbations with six different severity levels to test the models’ robustness.results: The authors find that some models may yield more robust and equitable performance than others, and pretraining characteristics play an important role in downstream robustness.

Abstract
Deep neural networks have shown impressive performance for image-based disease detection. Performance is commonly evaluated through clinical validation on independent test sets to demonstrate clinically acceptable accuracy. Reporting good performance metrics on test sets, however, is not always a sufficient indication of the generalizability and robustness of an algorithm. In particular, when the test data is drawn from the same distribution as the training data, the iid test set performance can be an unreliable estimate of the accuracy on new data. In this paper, we employ stress testing to assess model robustness and subgroup performance disparities in disease detection models. We design progressive stress testing using five different bidirectional and unidirectional image perturbations with six different severity levels. As a use case, we apply stress tests to measure the robustness of disease detection models for chest X-ray and skin lesion images, and demonstrate the importance of studying class and domain-specific model behaviour. Our experiments indicate that some models may yield more robust and equitable performance than others. We also find that pretraining characteristics play an important role in downstream robustness. We conclude that progressive stress testing is a viable and important tool and should become standard practice in the clinical validation of image-based disease detection models.

摘要
深度神经网络在图像基于疾病检测方面表现出色。性能通常通过临床验证方法进行评估，以证明在临床上达到可接受的准确率。然而，只是在测试数据集上报告好的性能指标不一定是一个可靠的指示器，特别是当测试数据集来自同一个分布为训练数据集时，iid测试集性能可能是一个不可靠的准确率估计。在这篇论文中，我们使用压力测试来评估模型的可靠性和 subgroup 性能差异。我们设计了进行逆向和直向的图像扰动，并使用六个不同的严重程度。作为一个使用场景，我们将压力测试应用于肺X射线和皮肤损伤图像中的疾病检测模型，并证明了研究类和领域特定的模型行为的重要性。我们的实验表示某些模型可能具有更高的可靠性和公平性。我们还发现预训练特征对下游可靠性具有重要作用。我们结论，进行进程式压力测试是一种可靠的和重要的工具，应成为临床验证图像基于疾病检测模型的标准实践。

2023-08-13

cs.CV

cs.CV - 2023-08-13

Modified Topological Image Preprocessing for Skin Lesion Classifications

paper_url: http://arxiv.org/abs/2308.06796
repo_url: None
paper_authors: Hong Cheng, Rebekah Leamons, Ahmad Al Shami
for: 这个论文是为了提出一种修改了拓扑数据分析模型，用于精确地处理皮肤图像的预处理和优化。
methods: 该模型使用了修改后的拓扑数据分析方法，并在使用深度卷积神经网络和视transformer模型进行训练。
results: 实验结果表明，使用修改后的拓扑数据分析方法可以在皮肤图像预处理中提高性能。

Abstract
This paper proposes a modified Topological Data Analysis model for skin images preprocessing and enhancements. The skin lesion dataset HAM10000 used with the intention of identifying the important objects in relevant regions of the images. In order to evaluate both the original dataset and the preprocessed dataset, Deep Convolutional Neural Network and Vision Transformer models were utilized to train both models. After training, the experimental results demonstrate that the images preprocessed using the Modified Topological Data Analysis consistently perform better.

摘要
这个论文提出了一种修改后的拓扑数据分析模型，用于皮肤图像的预处理和改进。使用了悬峰10000个皮肤病变数据集，以便在相关区域中标识重要对象。为了评估原始数据集和预处理后的数据集，使用了深度卷积神经网络和视 traducción transformer 模型进行训练。经训练后，实验结果表明，使用修改后的拓扑数据分析后的图像预处理 consistently perform better。Note: "拓扑数据分析" (topological data analysis) is a bit of a mouthful in Chinese, so I've shortened it to "拓扑分析" (topological analysis) in the translation.

PV-SSD: A Projection and Voxel-based Double Branch Single-Stage 3D Object Detector

paper_url: http://arxiv.org/abs/2308.06791
repo_url: None
paper_authors: Yongxin Shao, Aihong Tan, Zhetao Sun, Enhui Zheng, Tianhong Yan
for: 提高LIDAR数据的3D对象检测和分类精度，以便自动驾驶。
methods: 提出基于精度抽象和投影叠加的double branch特征提取方法（PV-SSD），以减少投影过程中的信息损失。
results: 与之前的工作相比，本方法实现了良好的性能，并且提出了多个贡献，包括：1）基于精度抽象的粒子特征提取方法；2）基于重要性抽象的特征点抽取方法；3）基于SSFA模块的MSSFA模块。

Abstract
LIDAR-based 3D object detection and classification is crucial for autonomous driving. However, inference in real-time from extremely sparse 3D data poses a formidable challenge. To address this issue, a common approach is to project point clouds onto a bird's-eye or perspective view, effectively converting them into an image-like data format. However, this excessive compression of point cloud data often leads to the loss of information. This paper proposes a 3D object detector based on voxel and projection double branch feature extraction (PV-SSD) to address the problem of information loss. We add voxel features input containing rich local semantic information, which is fully fused with the projected features in the feature extraction stage to reduce the local information loss caused by projection. A good performance is achieved compared to the previous work. In addition, this paper makes the following contributions: 1) a voxel feature extraction method with variable receptive fields is proposed; 2) a feature point sampling method by weight sampling is used to filter out the feature points that are more conducive to the detection task; 3) the MSSFA module is proposed based on the SSFA module. To verify the effectiveness of our method, we designed comparison experiments.

摘要
LIDAR-based 3D对象检测和分类是自动驾驶中关键。然而，在实时推理从极其稀疏3D数据中却成为一大挑战。为解决这个问题，一种常见的方法是将点云 proyect onto a bird's-eye or perspective view，实际上将其转换成图像类数据格式。然而，这种压缩点云数据的方法经常会导致信息损失。这篇论文提出了基于voxel和投影双分支特征提取（PV-SSD）的3D对象检测器，以解决信息损失问题。我们添加了包含丰富本地语义信息的voxel特征输入，并将其完全与投影特征在特征提取阶段进行了完全融合，以降低由投影所导致的本地信息损失。我们实现了与之前的工作相比的良好性能。此外，本文还做出了以下贡献：1）基于voxel特征提取方法中的可变感知场被提出；2）通过重点抽样来筛选更适合检测任务的特征点；3）基于SSFA模块的MSSFA模块被提出。为证明我们的方法的有效性，我们设计了对比实验。

RMP-Loss: Regularizing Membrane Potential Distribution for Spiking Neural Networks

paper_url: http://arxiv.org/abs/2308.06787
repo_url: None
paper_authors: Yufei Guo, Xiaode Liu, Yuanpei Chen, Liwen Zhang, Weihang Peng, Yuhan Zhang, Xuhui Huang, Zhe Ma
for: 这篇论文是为了解决神经网络中的数字化错误问题，并提出一个简单且直观的训练方法来减少这种错误的影响。
methods: 本论文使用的方法是一种叫做Regularizing membrane potential loss (RMP-Loss)的调整项，可以将数字化错误的影响降到最小化。这个方法实现非常简单，并且可以轻松地训练神经网络。
results: 本论文的实验结果显示，使用RMP-Loss训练神经网络可以对数字化错误问题做出有效的降低，并且可以与其他已知的方法相比，在不同的网络架构和数据集上表现更好。

Abstract
Spiking Neural Networks (SNNs) as one of the biology-inspired models have received much attention recently. It can significantly reduce energy consumption since they quantize the real-valued membrane potentials to 0/1 spikes to transmit information thus the multiplications of activations and weights can be replaced by additions when implemented on hardware. However, this quantization mechanism will inevitably introduce quantization error, thus causing catastrophic information loss. To address the quantization error problem, we propose a regularizing membrane potential loss (RMP-Loss) to adjust the distribution which is directly related to quantization error to a range close to the spikes. Our method is extremely simple to implement and straightforward to train an SNN. Furthermore, it is shown to consistently outperform previous state-of-the-art methods over different network architectures and datasets.

摘要
神经网络（SNN）作为生物体系静脉模型，最近受到了非常多的关注。它可以减少能耗，因为它将实际值膜电压转换为0/1气压来传输信息，因此硬件实现中的多Multiplications of activations and weights可以被替换为加法运算。然而，这种归一化机制会不可避免地导致归一化错误，从而导致重大的信息损失。为解决这个问题，我们提议一种调整膜电压损失（RMP-Loss）来调整直接与归一化错误相关的分布，使其落在近距离气压范围内。我们的方法非常简单易于实现，并且可以 straightforwardly 训练一个 SNN。此外，我们的方法在不同的网络架构和数据集上都能够 consistently outperform 前一代的方法。

Shape-guided Conditional Latent Diffusion Models for Synthesising Brain Vasculature

paper_url: http://arxiv.org/abs/2308.06781
repo_url: None
paper_authors: Yash Deo, Haoran Dou, Nishant Ravikumar, Alejandro F. Frangi, Toni Lassila
for: 该研究旨在提高对脑血管疾病的研究和临床 intervención中对脑血管的理解，通过生成真实的3D脑血管分割图像，包括较少见的脑血管变化。
methods: 该研究使用了一种新的生成模型，基于 conditional latent diffusion model，具有形态和解剖指导，以生成真实的3D脑血管分割图像，包括不同的脑血管变化。
results: 研究结果显示，该模型生成的脑血管分割图像比较真实，与其他生成模型，如 conditional GAN和 conditional VAE，具有更高的视觉准确性，FID分数比best-performing GAN-based model高53%。

Abstract
The Circle of Willis (CoW) is the part of cerebral vasculature responsible for delivering blood to the brain. Understanding the diverse anatomical variations and configurations of the CoW is paramount to advance research on cerebrovascular diseases and refine clinical interventions. However, comprehensive investigation of less prevalent CoW variations remains challenging because of the dominance of a few commonly occurring configurations. We propose a novel generative approach utilising a conditional latent diffusion model with shape and anatomical guidance to generate realistic 3D CoW segmentations, including different phenotypical variations. Our conditional latent diffusion model incorporates shape guidance to better preserve vessel continuity and demonstrates superior performance when compared to alternative generative models, including conditional variants of 3D GAN and 3D VAE. We observed that our model generated CoW variants that are more realistic and demonstrate higher visual fidelity than competing approaches with an FID score 53\% better than the best-performing GAN-based model.

摘要
圆形 Wille （CoW）是脑血管系统中听取血液到脑部的部分。了解不同的 anatomical 变化和配置的 CoW 对于进展研究脑血管疾病和细化临床 intervención 至关重要。然而，对于 menos prevalence CoW 变化的全面调查仍然具有挑战，因为一些常见的配置占据了主导地位。我们提出了一种新的生成方法，使用 conditioned 潜在扩散模型，包括形态和解剖指导来生成真实的 3D CoW 分割，包括不同的现象变化。我们的 conditioned 潜在扩散模型包括形态指导，以更好地保持血管连续性，并在与其他生成模型进行比较时，表现出更高的性能。我们观察到，我们的模型生成的 CoW 变化更加真实，与竞争方法相比，FID 分数高达 53% 更高。

Neural Networks at a Fraction with Pruned Quaternions

paper_url: http://arxiv.org/abs/2308.06780
repo_url: https://github.com/smlab-niser/quartLT22
paper_authors: Sahel Mohammad Iqbal, Subhankar Mishra
for: 这个研究目的是实现具有限制的计算能力的装置上进行深度学习模型的部署。
methods: 这个研究使用删减技术来删除不必要的参数，以减少训练和推导的资源需求。另外，使用高维度数据嵌入，如复数或四元数，可以降低模型的参数数量，保持精度。
results: 研究发现，在某些架构和任务上，这些复数模型在高给定率下具有更高的准确性，比如在CIFAR-10上的图像分类任务中，使用Conv-4架构，删减后的复数模型比同架构的实际模型提高了超过10%。实验结果显示，在极限的资源环境下，一个简短的复数网络可能比同架构的实际简短模型更适合进行部署。

Abstract
Contemporary state-of-the-art neural networks have increasingly large numbers of parameters, which prevents their deployment on devices with limited computational power. Pruning is one technique to remove unnecessary weights and reduce resource requirements for training and inference. In addition, for ML tasks where the input data is multi-dimensional, using higher-dimensional data embeddings such as complex numbers or quaternions has been shown to reduce the parameter count while maintaining accuracy. In this work, we conduct pruning on real and quaternion-valued implementations of different architectures on classification tasks. We find that for some architectures, at very high sparsity levels, quaternion models provide higher accuracies than their real counterparts. For example, at the task of image classification on CIFAR-10 using Conv-4, at $3\%$ of the number of parameters as the original model, the pruned quaternion version outperforms the pruned real by more than $10\%$. Experiments on various network architectures and datasets show that for deployment in extremely resource-constrained environments, a sparse quaternion network might be a better candidate than a real sparse model of similar architecture.

摘要
现代神经网络在训练和推理过程中的参数数量逐渐增加，这限制了它们在计算机能力有限的设备上进行部署。折射是一种技术来移除不必要的权重，以降低训练和推理所需的资源。此外，在多维输入数据的机器学习任务中，使用高维数域嵌入，如复数或四元数，可以降低参数数量而保持准确性。在这项工作中，我们对实际值和四元数值实现的不同架构进行了折射。我们发现，在某些架构上，在非常高的稀疏程度下，四元数模型可以提供更高的准确性。例如，在使用Conv-4架构进行图像分类任务时，在原始模型的3%的参数数量下，折射后的四元数模型高于原始实际模型的准确性超过10%。在不同的网络架构和数据集上进行了多种实验，我们发现在极限的资源环境下，一个稀疏的四元数网络可能比同类架构的实际稀疏模型更适合进行部署。

Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

paper_url: http://arxiv.org/abs/2308.06777
repo_url: https://github.com/LiheYoung/ShrinkMatch
paper_authors: Lihe Yang, Zhen Zhao, Lei Qi, Yu Qiao, Yinghuan Shi, Hengshuang Zhao
for: 本文针对semi-supervised learning中的问题提出了一个新的方法，即ShrinkMatch，以解决过去的 Pseudo label 可能不准确问题。
methods: 本文使用了一个新的方法，即ShrinkMatch，它可以将uncertain samples转换为certain samples，并且运用了一个consistency regularization来实现更好的表现。
results: 本文的实验结果显示了ShrinkMatch方法在各个 benchmark 上的出色表现，并且与其他state-of-the-art方法相比，它的表现更好。

Abstract
Semi-supervised learning is attracting blooming attention, due to its success in combining unlabeled data. To mitigate potentially incorrect pseudo labels, recent frameworks mostly set a fixed confidence threshold to discard uncertain samples. This practice ensures high-quality pseudo labels, but incurs a relatively low utilization of the whole unlabeled set. In this work, our key insight is that these uncertain samples can be turned into certain ones, as long as the confusion classes for the top-1 class are detected and removed. Invoked by this, we propose a novel method dubbed ShrinkMatch to learn uncertain samples. For each uncertain sample, it adaptively seeks a shrunk class space, which merely contains the original top-1 class, as well as remaining less likely classes. Since the confusion ones are removed in this space, the re-calculated top-1 confidence can satisfy the pre-defined threshold. We then impose a consistency regularization between a pair of strongly and weakly augmented samples in the shrunk space to strive for discriminative representations. Furthermore, considering the varied reliability among uncertain samples and the gradually improved model during training, we correspondingly design two reweighting principles for our uncertain loss. Our method exhibits impressive performance on widely adopted benchmarks. Code is available at https://github.com/LiheYoung/ShrinkMatch.

摘要
semi-supervised learning 已经吸引了大量的注意力，因为它可以将无标签数据与标签数据结合起来。为了避免 potential incorrect pseudo labels，现有的框架通常设置固定的信任度reshold来抛弃不确定的样本。这种做法可以保证高质量的 pseudo labels，但是会导致未利用整个无标签集的资源。在这项工作中，我们的关键发现是，这些不确定的样本可以被转化为确定的样本，只要检测并移除混淆类。驱使这个想法，我们提出了一种名为 ShrinkMatch 的新方法。对于每个不确定的样本，它可以适应地寻找一个缩小的类空间，这个类空间只包含原始的 top-1 类，以及剩下的 less likely 类。由于混淆类被移除在这个空间中，重新计算的 top-1 信任度可以满足预定的阈值。然后，我们对一对强制和弱制的扩展样本之间的一致性regularization进行强制。此外，考虑到不确定样本之间的不同可靠性和在训练过程中逐渐提高的模型，我们采用了两种不同的 uncertain loss 重量原则。我们的方法在广泛采用的 benchmark 上表现出色。代码可以在中找到。

Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches

paper_url: http://arxiv.org/abs/2308.06776
repo_url: https://github.com/linxin0/scpgabnet
paper_authors: Xin Lin, Chao Ren, Xiao Liu, Jie Huang, Yinjie Lei
for: 提高无标注数据集上的图像去噪性能，不需要大量的训练数据。
methods: 基于生成对抗网络的无supervised方法，通过在filter-guided噪音提取模块中逐次更新denoiser来提高噪音提取性能。
results: 与state-of-the-art无supervised方法相比，本方法实现了更高的噪音提取性能，而且不需要增加训练数据量或计算复杂度。

Abstract
Deep learning methods have shown remarkable performance in image denoising, particularly when trained on large-scale paired datasets. However, acquiring such paired datasets for real-world scenarios poses a significant challenge. Although unsupervised approaches based on generative adversarial networks offer a promising solution for denoising without paired datasets, they are difficult in surpassing the performance limitations of conventional GAN-based unsupervised frameworks without significantly modifying existing structures or increasing the computational complexity of denoisers. To address this problem, we propose a SC strategy for multiple denoisers. This strategy can achieve significant performance improvement without increasing the inference complexity of the GAN-based denoising framework. Its basic idea is to iteratively replace the previous less powerful denoiser in the filter-guided noise extraction module with the current powerful denoiser. This process generates better synthetic clean-noisy image pairs, leading to a more powerful denoiser for the next iteration. This baseline ensures the stability and effectiveness of the training network. The experimental results demonstrate the superiority of our method over state-of-the-art unsupervised methods.

摘要
深度学习方法在图像去噪中表现出了惊人的表现，特别是在大规模对应数据集上训练。然而，在实际场景中获得这些对应数据集是一项巨大的挑战。although generative adversarial networks（GAN）基于的无监督方法可以提供去噪无需对应数据集的解决方案，但是它们在不改变现有结构或提高去噪器的计算复杂度下难以超越传统GAN基于无监督框架的性能限制。为解决这个问题，我们提出了SC策略。这种策略可以在GAN基于的去噪框架中实现显著性能提升，无需提高去噪器的计算复杂度。它的基本思想是在滤波器指导噪音EXTRACTION模块中，逐次替换以前较弱的去噪器，使得当前更强的去噪器生成更好的干涉clean-noisy图像对。这个基准确保了训练网络的稳定性和效果。实验结果表明，我们的方法在无监督去噪方法中表现出了明显的优势。

A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations

paper_url: http://arxiv.org/abs/2308.06767
repo_url: https://github.com/hrcheng1066/awesome-pruning
paper_authors: Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi
for: 本研究寻求解决现代深度神经网络的大型模型需要大量计算和存储资源的问题，以便在有限的资源环境下部署和加速推理。
methods: 本文对现有的深度神经网络剪辑技术进行了一个权威的综述，包括1) 通用/特定加速、2) 何时剪辑、3) 如何剪辑、4) 剪辑与其他压缩技术的融合。
results: 本文对7对不同的剪辑设置进行了比较分析，并探讨了一些新的话题，如后处理剪辑、不同水平的监督剪辑和应用于不同领域（如对抗攻击），以便更好地了解现有方法的共同点和不同点，并为未来的研究提供基础。

Abstract
Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of seven pairs of contrast settings for pruning (e.g., unstructured/structured) and explore emerging topics, including post-training pruning, different levels of supervision for pruning, and broader applications (e.g., adversarial robustness) to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. To facilitate future research, we build a curated collection of datasets, networks, and evaluations on different applications. Finally, we provide some valuable recommendations on selecting pruning methods and prospect promising research directions. We build a repository at https://github.com/hrcheng1066/awesome-pruning.

摘要
现代深度神经网络，特别是最近的大语言模型，具有庞大的模型大小，需要显著的计算和存储资源。为了在有限资源环境中部署现代模型和加速推理时间，研究人员逐渐探索剪枝技术作为神经网络压缩的流行研究方向。然而，有很多相关研究的报告是不够全面的。为了解决这个问题，在这个调查中，我们提供了一份完整的剪枝技术评论，包括1) 通用/特定速度，2) 何时剪枝，3) 如何剪枝，和4) 剪枝与其他压缩技术的融合。然后，我们进行了7对7的对比分析，探讨不同的设定（例如，无结构/结构），并探索了新的主题，如后期剪枝、不同水平的监督、以及更广泛的应用（例如，对抗攻击），以便更好地了解现有方法的共同点和差异，并为未来的研究提供基础。为便于未来的研究，我们创建了一个汇总的数据集、网络和评估的库，并提供了一些有价值的建议，以及一些前景探索的可能性。

Tissue Segmentation of Thick-Slice Fetal Brain MR Scans with Guidance from High-Quality Isotropic Volumes

paper_url: http://arxiv.org/abs/2308.06762
repo_url: None
paper_authors: Shijie Huang, Xukun Zhang, Zhiming Cui, He Zhang, Geng Chen, Dinggang Shen
for: 这个论文的目的是为了提高胎儿大脑磁共振成像（MR）扫描中的精准组织分割。
methods: 这篇论文使用了域适应技术，将高质量的ISO体磁共振图像（和其相应的注解）作为指导，以提高胎儿大脑磁共振扫描中的精准组织分割。
results: 这篇论文的实验结果表明，使用C2DA-Net可以在胎儿大脑磁共振扫描中提高精准组织分割的性能，并且比前Edge的方法更好。

Abstract
Accurate tissue segmentation of thick-slice fetal brain magnetic resonance (MR) scans is crucial for both reconstruction of isotropic brain MR volumes and the quantification of fetal brain development. However, this task is challenging due to the use of thick-slice scans in clinically-acquired fetal brain data. To address this issue, we propose to leverage high-quality isotropic fetal brain MR volumes (and also their corresponding annotations) as guidance for segmentation of thick-slice scans. Due to existence of significant domain gap between high-quality isotropic volume (i.e., source data) and thick-slice scans (i.e., target data), we employ a domain adaptation technique to achieve the associated knowledge transfer (from high-quality volumes to thick-slice scans). Specifically, we first register the available high-quality isotropic fetal brain MR volumes across different gestational weeks to construct longitudinally-complete source data. To capture domain-invariant information, we then perform Fourier decomposition to extract image content and style codes. Finally, we propose a novel Cycle-Consistent Domain Adaptation Network (C2DA-Net) to efficiently transfer the knowledge learned from high-quality isotropic volumes for accurate tissue segmentation of thick-slice scans. Our C2DA-Net can fully utilize a small set of annotated isotropic volumes to guide tissue segmentation on unannotated thick-slice scans. Extensive experiments on a large-scale dataset of 372 clinically acquired thick-slice MR scans demonstrate that our C2DA-Net achieves much better performance than cutting-edge methods quantitatively and qualitatively.

摘要
准确的脏部分 segmentation thick-slice 胎 Mind Magnetic Resonance（MR）扫描是关键的，以重建是otropic 胎 Mind MR 体积以及胎 Mind 发展评估。然而，这项任务受到thick-slice 扫描的使用带来挑战，因为这些扫描通常具有低分辨率。为了解决这个问题，我们提议利用高质量的 isotropic 胎 Mind MR 体积（以及其相应的注释）作为指导，以提高 thick-slice 扫描的 segmentation 精度。由于源数据和目标数据之间存在显著的领域差异，我们采用领域适应技术来实现相关的知识传递。具体来说，我们首先将可用的高质量 isotropic 胎 Mind MR 体积进行注册，以构建不同 Gestational Week 的 longitudinally-complete 源数据。然后，我们使用 Fourier 分解来提取图像内容和样式代码。最后，我们提议一种新的 Cycle-Consistent Domain Adaptation Network（C2DA-Net），以高效地将高质量 isotropic 体积中学到的知识传递到 thick-slice 扫描中。我们的 C2DA-Net 可以充分利用一小组注释的 isotropic 体积来导引脏部分 segmentation on unannotated thick-slice scans。我们在一个大规模的数据集上进行了广泛的实验，并证明了我们的 C2DA-Net 在量和质量上都有明显的优势。

Influence Function Based Second-Order Channel Pruning-Evaluating True Loss Changes For Pruning Is Possible Without Retraining

paper_url: http://arxiv.org/abs/2308.06755
repo_url: https://github.com/hrcheng1066/ifso
paper_authors: Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi
for: 这篇论文旨在提出一种新的通道缩减方法，以更有效地选择需要缩减的通道。
methods: 该方法使用了Influence Function（影响函数）来评估通道的真实损失变化，而不需要重新训练权重。
results: 实验表明，该方法可以更加准确地选择需要缩减的通道，并且比exististing方法更快速。此外，该方法还开拓出了一些新的可能性，例如可以不需要重新训练权重来评估true损失变化。

Abstract
A challenge of channel pruning is designing efficient and effective criteria to select channels to prune. A widely used criterion is minimal performance degeneration. To accurately evaluate the truth performance degeneration requires retraining the survived weights to convergence, which is prohibitively slow. Hence existing pruning methods use previous weights (without retraining) to evaluate the performance degeneration. However, we observe the loss changes differ significantly with and without retraining. It motivates us to develop a technique to evaluate true loss changes without retraining, with which channels to prune can be selected more reliably and confidently. We first derive a closed-form estimator of the true loss change per pruning mask change, using influence functions without retraining. Influence function which is from robust statistics reveals the impacts of a training sample on the model's prediction and is repurposed by us to assess impacts on true loss changes. We then show how to assess the importance of all channels simultaneously and develop a novel global channel pruning algorithm accordingly. We conduct extensive experiments to verify the effectiveness of the proposed algorithm. To the best of our knowledge, we are the first that shows evaluating true loss changes for pruning without retraining is possible. This finding will open up opportunities for a series of new paradigms to emerge that differ from existing pruning methods. The code is available at https://github.com/hrcheng1066/IFSO.

摘要
一个频道剔除挑战是设计高效、有效的选择频道的 критеририи。广泛使用的标准是最小性能倒退。然而，要准确评估真正的性能倒退，需要重新训练存活的权重，这是非常慢的。因此，现有的剔除方法使用前一个 weights（无需重新训练）来评估性能倒退。但我们发现，无需重新训练时的损失变化很大。这种发现使我们开发一种评估真正的损失变化的技术，以更加可靠和自信地选择剔除频道。我们首先 deriv 一个关闭式估计器，用于评估每个剔除面积变化后的真正损失变化。我们使用 robust 统计中的影响函数，无需重新训练，可以准确地评估频道对模型预测的影响。然后，我们可以同时评估所有频道的重要性，并开发了一种全局频道剔除算法。我们进行了广泛的实验，证明了我们的提案的有效性。根据我们所知，我们是第一个证明可以无需重新训练评估真正的损失变化的人。这一发现将开启一系列的新思想，与现有的剔除方法不同。我们的代码可以在上找到。

FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware Lookup Table

paper_url: http://arxiv.org/abs/2308.06749
repo_url: https://github.com/wenhao-li-777/fastllve
paper_authors: Wenhao Li, Guangyang Wu, Wenyi Wang, Peiran Ren, Xiaohong Liu
for: 提高低光照视频质量
methods: 使用Look-Up-Table（LUT）技术维护间帧亮度一致性，并设计了学习型Intensity-Aware LUT（IA-LUT）模块进行自适应增强。
results: 实验结果表明，我们的方法在质量和间帧亮度一致性两个方面均达到了领先水平，并且可以在1080p视频上实现50+帧/秒的处理速度，比SOTA CNN基于方法更快。

Abstract
Low-Light Video Enhancement (LLVE) has received considerable attention in recent years. One of the critical requirements of LLVE is inter-frame brightness consistency, which is essential for maintaining the temporal coherence of the enhanced video. However, most existing single-image-based methods fail to address this issue, resulting in flickering effect that degrades the overall quality after enhancement. Moreover, 3D Convolution Neural Network (CNN)-based methods, which are designed for video to maintain inter-frame consistency, are computationally expensive, making them impractical for real-time applications. To address these issues, we propose an efficient pipeline named FastLLVE that leverages the Look-Up-Table (LUT) technique to maintain inter-frame brightness consistency effectively. Specifically, we design a learnable Intensity-Aware LUT (IA-LUT) module for adaptive enhancement, which addresses the low-dynamic problem in low-light scenarios. This enables FastLLVE to perform low-latency and low-complexity enhancement operations while maintaining high-quality results. Experimental results on benchmark datasets demonstrate that our method achieves the State-Of-The-Art (SOTA) performance in terms of both image quality and inter-frame brightness consistency. More importantly, our FastLLVE can process 1,080p videos at $\mathit{50+}$ Frames Per Second (FPS), which is $\mathit{2 \times}$ faster than SOTA CNN-based methods in inference time, making it a promising solution for real-time applications. The code is available at https://github.com/Wenhao-Li-777/FastLLVE.

摘要
低光照视频提升（LLVE）在最近几年内获得了广泛关注。一个关键的需求是 между帧亮度一致性，这是维护提升后视频的时间一致性的关键。然而，大多数现有的单图像基方法无法解决这个问题，导致提升后的视频呈现出抖抖的效果，从而降低总质量。此外，基于视频的3D卷积神经网络（CNN）方法，尽管可以维护间帧一致性，但是计算成本高昂，使其不适合实时应用。为解决这些问题，我们提出了一个高效的排序名为快速LLVE，它利用了Look-Up-Table（LUT）技术来保证间帧亮度一致性。我们特别设计了一个可学习的Intensity-Aware LUT（IA-LUT）模块，用于自适应增强，解决低动态问题在低光照场景下。这使得快速LLVE可以在低延迟和低复杂度下进行增强操作，同时保持高质量结果。实验结果表明，我们的方法在标准测试集上达到了领先的性能水平， both image quality和间帧亮度一致性。此外，我们的快速LLVE可以处理1080p视频，并在50+帧每秒进行加速，这比SOTA CNN基于方法的推理时间快速2倍。代码可以在https://github.com/Wenhao-Li-777/FastLLVE中找到。

Target before Shooting: Accurate Anomaly Detection and Localization under One Millisecond via Cascade Patch Retrieval

paper_url: http://arxiv.org/abs/2308.06748
repo_url: https://github.com/flyinghu123/cpr
paper_authors: Hanxi Li, Jianfei Hu, Bo Li, Hao Chen, Yongbin Zheng, Chunhua Shen
for: 提出了一种新的异常检测框架，实现了同时保证异常检测精度和运行速度的两个目标。
methods: 该框架通过粗细匹配方法选择测试图像各个小块的最佳对比图像，然后使用地区匹配方法在这些地区找到最佳的地方匹配。最后，计算每个测试图像块的异常分数基于地方匹配距离和非背景概率。
results: 在MVTec AD、BTAD和MVTec-3D AD等三个评测 dataset 上，提出的方法与所有参照方法进行比较，具有显著的优势，测试结果表明，该方法在不同的异常检测任务中具有较高的精度和较低的时间复杂度。

Abstract
In this work, by re-examining the "matching" nature of Anomaly Detection (AD), we propose a new AD framework that simultaneously enjoys new records of AD accuracy and dramatically high running speed. In this framework, the anomaly detection problem is solved via a cascade patch retrieval procedure that retrieves the nearest neighbors for each test image patch in a coarse-to-fine fashion. Given a test sample, the top-K most similar training images are first selected based on a robust histogram matching process. Secondly, the nearest neighbor of each test patch is retrieved over the similar geometrical locations on those "global nearest neighbors", by using a carefully trained local metric. Finally, the anomaly score of each test image patch is calculated based on the distance to its "local nearest neighbor" and the "non-background" probability. The proposed method is termed "Cascade Patch Retrieval" (CPR) in this work. Different from the conventional patch-matching-based AD algorithms, CPR selects proper "targets" (reference images and locations) before "shooting" (patch-matching). On the well-acknowledged MVTec AD, BTAD and MVTec-3D AD datasets, the proposed algorithm consistently outperforms all the comparing SOTA methods by remarkable margins, measured by various AD metrics. Furthermore, CPR is extremely efficient. It runs at the speed of 113 FPS with the standard setting while its simplified version only requires less than 1 ms to process an image at the cost of a trivial accuracy drop. The code of CPR is available at https://github.com/flyinghu123/CPR.

摘要
在这个工作中，我们重新审视了异常检测（AD）的“匹配”性质，并提出了一种新的AD框架，该框架同时具有新纪录级AD准确率和极高的运行速度。在该框架中，异常检测问题通过一种层次补丁检索过程来解决，首先选择测试样本中最相似的训练图像集，然后在这些“全球最似图像”上进行精心训练的本地度量来检索测试补丁的最近邻居。最后，测试图像补丁的异常分数根据补丁与“本地最似图像”以及“非背景”概率来计算。我们称这种方法为“层次补丁检索”（CPR）。与传统的补丁匹配基于AD算法不同，CPR在选择“目标”（参考图像和位置）之前已经选择了合适的“目标”。在广泛承认的MVTec AD、BTAD和MVTec-3D AD数据集上，我们的提案方法与所有比较参考方法的较大胜利差度相比，按照不同的AD指标进行评价。此外，CPR非常高效，它在标准设置下运行速度达113帧/秒，而其简化版本只需0.1毫秒来处理一幅图像，而且只有一rivial的准确率下降。CPR的代码可以在GitHub上找到：https://github.com/flyinghu123/CPR。

Self-supervised Noise2noise Method Utilizing Corrupted Images with a Modular Network for LDCT Denoising

paper_url: http://arxiv.org/abs/2308.06746
repo_url: https://github.com/xyuan01/self-supervised-noise2noise-for-ldct
paper_authors: Yuting Zhu, Qiang He, Yudong Yao, Yueyang Teng
for: 这篇论文旨在提出一种基于单簇 Computed Tomography (CT) 影像的自动降噪方法，不需要配对的陌生资料。
methods: 这篇论文使用了一种组合方法，包括自我指导的噪声2噪声模型和陌生噪声策略。首先，我们将 LDCT 影像重复地添加了一种相似的噪声。然后，我们使用只有次要损坏的影像进行训练。我们选择了一个模组化 U-Net 结构来进行任务，这样可以增加讯号场的视野而无需增加参数数。
results: 实验结果显示，提案的方法比过去的深度学习方法更有效率，在 Mayo LDCT 数据集上得到了好的效果。

Abstract
Deep learning is a very promising technique for low-dose computed tomography (LDCT) image denoising. However, traditional deep learning methods require paired noisy and clean datasets, which are often difficult to obtain. This paper proposes a new method for performing LDCT image denoising with only LDCT data, which means that normal-dose CT (NDCT) is not needed. We adopt a combination including the self-supervised noise2noise model and the noisy-as-clean strategy. First, we add a second yet similar type of noise to LDCT images multiple times. Note that we use LDCT images based on the noisy-as-clean strategy for corruption instead of NDCT images. Then, the noise2noise model is executed with only the secondary corrupted images for training. We select a modular U-Net structure from several candidates with shared parameters to perform the task, which increases the receptive field without increasing the parameter size. The experimental results obtained on the Mayo LDCT dataset show the effectiveness of the proposed method compared with that of state-of-the-art deep learning methods. The developed code is available at https://github.com/XYuan01/Self-supervised-Noise2Noise-for-LDCT.

摘要
深度学习是LDCT图像锈除的非常有前途的技术。然而，传统的深度学习方法需要配备附近的噪声和清洁数据集，这经常很难以获得。这篇论文提出了一种使用仅LDCT数据进行LDCT图像锈除的新方法。我们采用了混合自我supervised随机噪声模型和噪声作为清洁策略。首先，我们将LDCT图像添加了多个相似的噪声。注意，我们使用LDCT图像作为噪声Strategy instead ofNDCT图像。然后，我们执行了噪声2噪声模型，只使用次要损害的图像进行训练。我们选择了一种模块化U-Net结构从多个候选结构中，以增加感知场而不是增加参数大小。实验结果在Mayo LDCT数据集上表明了提议的方法的有效性，比对现有的深度学习方法更好。开发代码可以在https://github.com/XYuan01/Self-supervised-Noise2Noise-for-LDCT中下载。

TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution

paper_url: http://arxiv.org/abs/2308.06743
repo_url: https://github.com/lenubolim/textdiff
paper_authors: Baolin Liu, Zongyuan Yang, Pengfei Wang, Junjie Zhou, Ziqi Liu, Ziyi Song, Yan Liu, Yongping Xiong
For: The paper aims to improve the readability and recognizability of scene text images by proposing a diffusion-based framework for scene text image super-resolution.* Methods: The proposed method, called TextDiff, consists of two modules: the Text Enhancement Module (TEM) and the Mask-Guided Residual Diffusion Module (MRD). The TEM generates an initial deblurred text image and a mask that encodes the spatial location of the text, while the MRD effectively sharpenes the text edge by modeling the residuals between the ground-truth images and the initial deblurred images.* Results: The proposed TextDiff achieves state-of-the-art (SOTA) performance on public benchmark datasets and can improve the readability of scene text images. Additionally, the MRD module is plug-and-play and can effectively sharpens the text edges produced by SOTA methods without requiring any additional joint training.

Abstract
The goal of scene text image super-resolution is to reconstruct high-resolution text-line images from unrecognizable low-resolution inputs. The existing methods relying on the optimization of pixel-level loss tend to yield text edges that exhibit a notable degree of blurring, thereby exerting a substantial impact on both the readability and recognizability of the text. To address these issues, we propose TextDiff, the first diffusion-based framework tailored for scene text image super-resolution. It contains two modules: the Text Enhancement Module (TEM) and the Mask-Guided Residual Diffusion Module (MRD). The TEM generates an initial deblurred text image and a mask that encodes the spatial location of the text. The MRD is responsible for effectively sharpening the text edge by modeling the residuals between the ground-truth images and the initial deblurred images. Extensive experiments demonstrate that our TextDiff achieves state-of-the-art (SOTA) performance on public benchmark datasets and can improve the readability of scene text images. Moreover, our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods. This enhancement not only improves the readability and recognizability of the results generated by SOTA methods but also does not require any additional joint training. Available Codes:https://github.com/Lenubolim/TextDiff.

摘要
目标是帮助您将低分辨率的场景文本图像转换成高分辨率文本线图像。现有的方法通常通过像素级损失优化来实现文本边缘的增强，但这会导致文本边缘变得模糊，从而影响文本的可读性和识别性。为了解决这些问题，我们提出了 TextDiff，首个适用于场景文本图像超分辨率的扩散框架。它包括两个模块：文本增强模块（TEM）和帮助器导向残差扩散模块（MRD）。TEM 生成了初始的去噪文本图像和一个描述文本的空间位置的面罩。MRD 负责通过模拟实际图像和初始去噪图像之间的差异来有效地尖锐文本边缘。我们进行了广泛的实验，结果表明 TextDiff 在公共测试集上达到了领先的表现水平（SOTA），并可以提高场景文本图像的可读性。此外，我们提出的 MRD 模块可以很好地增强 SOTA 方法生成的文本边缘，无需额外的联合训练。可以在 GitHub 上下载代码：https://github.com/Lenubolim/TextDiff。

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

paper_url: http://arxiv.org/abs/2308.06739
repo_url: None
paper_authors: David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou
for: 本研究旨在解决无监督学习在视觉表示中的快速进步，尽管需要训练大规模数据集，但这会导致数据采集成本高昂，并且存在数据隐私问题。
methods: 我们开始通过探索 diffusion models 的 cross-attention层内置的annotation-free注意力掩模来解决这一问题。我们还investigate了三种常见的无监督学习技术（即对比学习、遮盖模型和视觉语言预训练），并提出了专门采用这些自由注意力掩模的解决方案。
results: 我们通过了广泛的实验，证明了我们的方法可以在不同的下游任务中提高基eline模型的性能，包括图像分类、检测、分割和图像文本检索。通过使用我们的方法，可以将无监督预训练在synthetic数据上的性能与实际场景中的性能趋同。

Abstract
Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques ( i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.

摘要
Translated into Simplified Chinese:尽管Unsupervised learning在视觉表示方面进步 Rapidly, but it still requires expensive data collection and raises additional concerns about data privacy. Recently, text-to-image diffusion models generated synthetic images have shown great potential for image recognition. Although promising, there has been inadequate exploration of unsupervised learning on diffusion-generated images. To address this, we start by discovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques (i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.Translated into Traditional Chinese:尽管Unsupervised learning在视觉表示方面进步 Rapidly, but it still requires expensive data collection and raises additional concerns about data privacy. Recently, text-to-image diffusion models generated synthetic images have shown great potential for image recognition. Although promising, there has been inadequate exploration of unsupervised learning on diffusion-generated images. To address this, we start by discovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques (i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.

3D Scene Graph Prediction on Point Clouds Using Knowledge Graphs

paper_url: http://arxiv.org/abs/2308.06719
repo_url: None
paper_authors: Yiding Qiu, Henrik I. Christensen
for: scene graph prediction in 3D environments
methods: message-passing method with commonsense knowledge graphs
results: 15.0% improvement in scene graph prediction accuracy with external knowledge, 7.96% improvement with internal knowledge compared to state-of-the-art algorithms, and real-world testing with 10 frames per second for scene graph generation.Here’s the full text in Simplified Chinese:
for: scene graph prediction在3D环境中
methods: message-passing方法与常识知识图
results: 外部知识Integration leads to 15.0% improvement in scene graph prediction accuracy, 7.96% improvement with internal knowledge compared to state-of-the-art algorithms, and real-world testing with 10 frames per second for scene graph generation.

Abstract
3D scene graph prediction is a task that aims to concurrently predict object classes and their relationships within a 3D environment. As these environments are primarily designed by and for humans, incorporating commonsense knowledge regarding objects and their relationships can significantly constrain and enhance the prediction of the scene graph. In this paper, we investigate the application of commonsense knowledge graphs for 3D scene graph prediction on point clouds of indoor scenes. Through experiments conducted on a real-world indoor dataset, we demonstrate that integrating external commonsense knowledge via the message-passing method leads to a 15.0 % improvement in scene graph prediction accuracy with external knowledge and $7.96\%$ with internal knowledge when compared to state-of-the-art algorithms. We also tested in the real world with 10 frames per second for scene graph generation to show the usage of the model in a more realistic robotics setting.

摘要
三维场景图预测是一项任务，旨在同时预测场景中对象的类别和其之间的关系。由于这些环境主要由人类设计和使用，因此包含常识知识对场景图预测具有明显的约束和优化作用。在这篇论文中，我们调查了在点云indoor场景中使用commonsense知识图进行三维场景图预测的应用。通过对实际indoor数据集进行实验，我们表明了将外部常识知识integrated到消息传递方法中可以提高场景图预测精度，比对 estado-of-the-art算法提高15.0%。此外，我们还在真实的 robotics 环境中测试了Scene Graph生成，以示模型的应用。

StairNetV3: Depth-aware Stair Modeling using Deep Learning

paper_url: http://arxiv.org/abs/2308.06715
repo_url: None
paper_authors: Chen Wang, Zhongcai Pei, Shuang Qiu, Yachun Wang, Zhiyong Tang
for: 这 paper 的目的是提出一种基于视觉的自主移动 робоット climb 楼梯的技术，尤其是在不熟悉的环境中。
methods: 该 paper 使用了一种基于 convolutional neural network (CNN) 的 depth-aware stair modeling 方法，包括提取楼梯几何特征和预测深度图像为联合任务，并使用设计的信息传播架构以实现有效的超视觉学习。
results: 实验表明，该方法与之前最佳的单目视觉方法相比，有一个显著的提升（IOU 提升3.4%），并且Lightweight 版本具有快速检测速度，可满足大多数实时应用的需求。

Abstract
Vision-based stair perception can help autonomous mobile robots deal with the challenge of climbing stairs, especially in unfamiliar environments. To address the problem that current monocular vision methods are difficult to model stairs accurately without depth information, this paper proposes a depth-aware stair modeling method for monocular vision. Specifically, we take the extraction of stair geometric features and the prediction of depth images as joint tasks in a convolutional neural network (CNN), with the designed information propagation architecture, we can achieve effective supervision for stair geometric feature learning by depth information. In addition, to complete the stair modeling, we take the convex lines, concave lines, tread surfaces and riser surfaces as stair geometric features and apply Gaussian kernels to enable the network to predict contextual information within the stair lines. Combined with the depth information obtained by depth sensors, we propose a stair point cloud reconstruction method that can quickly get point clouds belonging to the stair step surfaces. Experiments on our dataset show that our method has a significant improvement over the previous best monocular vision method, with an intersection over union (IOU) increase of 3.4 %, and the lightweight version has a fast detection speed and can meet the requirements of most real-time applications. Our dataset is available at https://data.mendeley.com/datasets/6kffmjt7g2/1.

摘要
<>使用视觉技术，自动移动Robot可以更好地处理楼梯，特别是在未知环境中。为了解决目前的单目视觉方法难以准确地模型楼梯 without depth information，这篇论文提出了一种基于深度信息的楼梯模型方法。具体来说，我们将提取楼梯的 geometric 特征和预测深度图作为一个 convolutional neural network (CNN) 中的联合任务，通过我们设计的信息传递架构，可以实现有效的监督楼梯 geometric 特征学习。此外，为了完成楼梯模型，我们将楼梯的 convex 线、拱线、踏板面和踏梯面作为楼梯的 geometric 特征，并应用 Gaussian kernels，使网络可以预测楼梯内部的信息。与depth sensor获取的深度信息结合，我们提出了一种可以快速获取楼梯步骤表面的点云重建方法。实验结果表明，我们的方法与前一个最佳单目视觉方法相比，IOU 提高了 3.4%，轻量版本具有快速检测速度，可满足大多数实时应用的需求。我们的数据集可以在中下载。

LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

paper_url: http://arxiv.org/abs/2308.06713
repo_url: None
paper_authors: Binbin Yang, Yi Luo, Ziliang Chen, Guangrun Wang, Xiaodan Liang, Liang Lin
for: 这篇研究是为了实现高品质的复杂场景生成，以优化现有的散射模型。
methods: 这篇研究提出了一个具有 semantic control 的 Layout-Aware 散射模型（LAW-Diffusion），通过内置的空间依赖解析和位置意识的跨物体注意力模组，实现了具有属地对应性和空间相互关联的场景生成。
results: compared to previous Layout-to-Image（L2I）方法，LAW-Diffusion 可以更好地生成具有内在逻辑和空间相互关联的场景，并且可以实现实际中的实例重新构成。

Abstract
Thanks to the rapid development of diffusion models, unprecedented progress has been witnessed in image synthesis. Prior works mostly rely on pre-trained linguistic models, but a text is often too abstract to properly specify all the spatial properties of an image, e.g., the layout configuration of a scene, leading to the sub-optimal results of complex scene generation. In this paper, we achieve accurate complex scene generation by proposing a semantically controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from the previous Layout-to-Image generation (L2I) methods that only explore category-aware relationships, LAW-Diffusion introduces a spatial dependency parser to encode the location-aware semantic coherence across objects as a layout embedding and produces a scene with perceptually harmonious object styles and contextual relations. To be specific, we delicately instantiate each object's regional semantics as an object region map and leverage a location-aware cross-object attention module to capture the spatial dependencies among those disentangled representations. We further propose an adaptive guidance schedule for our layout guidance to mitigate the trade-off between the regional semantic alignment and the texture fidelity of generated objects. Moreover, LAW-Diffusion allows for instance reconfiguration while maintaining the other regions in a synthesized image by introducing a layout-aware latent grafting mechanism to recompose its local regional semantics. To better verify the plausibility of generated scenes, we propose a new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to measure how the images preserve the rational and harmonious relations among contextual objects. Comprehensive experiments demonstrate that our LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.

摘要
due to the rapid development of diffusion models, there have been unprecedented advances in image synthesis. previous works mainly rely on pre-trained linguistic models, but a text is often too abstract to properly specify all the spatial properties of an image, such as the layout configuration of a scene, leading to sub-optimal results of complex scene generation. in this paper, we achieve accurate complex scene generation by proposing a semantically controllable Layout-AWare diffusion model, termed LAW-Diffusion. unlike previous Layout-to-Image (L2I) methods that only explore category-aware relationships, LAW-Diffusion introduces a spatial dependency parser to encode the location-aware semantic coherence across objects as a layout embedding and produces a scene with perceptually harmonious object styles and contextual relations. specifically, we delicately instantiate each object's regional semantics as an object region map and leverage a location-aware cross-object attention module to capture the spatial dependencies among those disentangled representations. we also propose an adaptive guidance schedule for our layout guidance to mitigate the trade-off between the regional semantic alignment and the texture fidelity of generated objects. furthermore, LAW-Diffusion allows for instance reconfiguration while maintaining the other regions in a synthesized image by introducing a layout-aware latent grafting mechanism to recompose its local regional semantics. to better verify the plausibility of generated scenes, we propose a new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to measure how the images preserve the rational and harmonious relations among contextual objects. comprehensive experiments demonstrate that our LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.

Compositional Feature Augmentation for Unbiased Scene Graph Generation

paper_url: http://arxiv.org/abs/2308.06712
repo_url: None
paper_authors: Lin Li, Guikun Chen, Jun Xiao, Yi Yang, Chunping Wang, Long Chen
for: 本研究旨在探讨如何更好地探测图像中的视觉关系 triplets <sub, pred, obj>，以提高Scene Graph Generation (SGG) 的性能。
methods: 本文提出了一种新的Compositional Feature Augmentation (CFA)策略，该策略可以增加每个 predicate 的关系 triplet 特征的多样性，从而提高 SGG 的鲁棒性。CFA 包括将每个关系 triplet 特征分解成两部分：内在特征和外在特征，然后通过将这些特征与其他样本的特征进行替换或混合来增加 triplet 特征的多样性。
results: 对比于现有的重新权衡策略，CFA 可以更好地增加每个 predicate 的关系 triplet 特征的多样性，从而提高 SGG 的性能。经过广泛的ablation研究，我们发现CFA 可以在不同的 metrics 之间取得新的状态公共表现。

Abstract
Scene Graph Generation (SGG) aims to detect all the visual relation triplets in a given image. With the emergence of various advanced techniques for better utilizing both the intrinsic and extrinsic information in each relation triplet, SGG has achieved great progress over the recent years. However, due to the ubiquitous long-tailed predicate distributions, today's SGG models are still easily biased to the head predicates. Currently, the most prevalent debiasing solutions for SGG are re-balancing methods, e.g., changing the distributions of original training samples. In this paper, we argue that all existing re-balancing strategies fail to increase the diversity of the relation triplet features of each predicate, which is critical for robust SGG. To this end, we propose a novel Compositional Feature Augmentation (CFA) strategy, which is the first unbiased SGG work to mitigate the bias issue from the perspective of increasing the diversity of triplet features. Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into various SGG frameworks. Extensive ablations have shown that CFA achieves a new state-of-the-art performance on the trade-off between different metrics.

摘要
Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into various SGG frameworks. Extensive ablations have shown that CFA achieves a new state-of-the-art performance on the trade-off between different metrics.

Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition

paper_url: http://arxiv.org/abs/2308.06707
repo_url: https://github.com/oliverhxh/cag
paper_authors: Xiaohu Huang, Xinggang Wang, Zhidianqiu Jin, Bo Yang, Botao He, Bin Feng, Wenyu Liu
for: 本研究旨在提高skeleton-based gait认知 task中的个人识别率，使用graph convolutional networks (GCNs)来提取多视角下不同人体姿势的特征。
methods: 我们提出了一种condition-adaptive graph (CAG) convolution network，具有自适应特征和视角的能力。CAG网络包括joint-specific filter learning (JSFL)模块和view-adaptive topology learning (VATL)模块。JSFL模块生成每个关节独特的滤波器， capture细腻的姿势特征；VATL模块生成适应视角的图学结构，对关节进行相应的相关处理。
results: 实验结果表明，CAG网络在CASIA-B和OU-MVLP两个最常用的数据集上都超过了所有之前的skeleton-based方法。此外，通过与视觉基本方法相结合，CAG网络可以提供有用的补充信息，提高了识别率。

Abstract
Graph convolutional networks have been widely applied in skeleton-based gait recognition. A key challenge in this task is to distinguish the individual walking styles of different subjects across various views. Existing state-of-the-art methods employ uniform convolutions to extract features from diverse sequences and ignore the effects of viewpoint changes. To overcome these limitations, we propose a condition-adaptive graph (CAG) convolution network that can dynamically adapt to the specific attributes of each skeleton sequence and the corresponding view angle. In contrast to using fixed weights for all joints and sequences, we introduce a joint-specific filter learning (JSFL) module in the CAG method, which produces sequence-adaptive filters at the joint level. The adaptive filters capture fine-grained patterns that are unique to each joint, enabling the extraction of diverse spatial-temporal information about body parts. Additionally, we design a view-adaptive topology learning (VATL) module that generates adaptive graph topologies. These graph topologies are used to correlate the joints adaptively according to the specific view conditions. Thus, CAG can simultaneously adjust to various walking styles and viewpoints. Experiments on the two most widely used datasets (i.e., CASIA-B and OU-MVLP) show that CAG surpasses all previous skeleton-based methods. Moreover, the recognition performance can be enhanced by simply combining CAG with appearance-based methods, demonstrating the ability of CAG to provide useful complementary information.The source code will be available at https://github.com/OliverHxh/CAG.

摘要
“几何卷积网络在人体骨架基于步行识别中广泛应用。一个关键挑战在这个任务中是在不同的视角下分辨别人的步行风格。现有的状态艺术方法使用固定的权重来抽取不同序列中的特征，并忽略视角变化的影响。为了解决这些限制，我们提议一种可适应条件的几何卷积网络（CAG），可以动态适应每个骨架序列和相应的视角。而不是使用所有关节和序列中的固定权重，我们引入了关节特定的缓冲学（JSFL）模块，该模块生成序列特有的缓冲。这些缓冲能够捕捉每个关节细腻的特征，并提取不同的空间-时间信息。此外，我们设计了视角适应图学（VATL）模块，该模块生成适应视角的图学结构。这些图学结构用于相互相关关节，以适应特定的视角条件。因此，CAG可以同时适应不同的步行风格和视角。实验结果表明，CAG超过了所有之前的骨架基于方法，并且可以通过简单地将CAG与外观基于方法相结合，进一步提高识别性能。代码将在 GitHub 上发布，请参考。”

Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation

paper_url: http://arxiv.org/abs/2308.06693
repo_url: https://github.com/dlut-yyc/isomer
paper_authors: Yichen Yuan, Yifan Wang, Lijun Wang, Xiaoqi Zhao, Huchuan Lu, Yu Wang, Weibo Su, Lei Zhang
for: 这个论文主要针对 Zero-Shot Video Object Segmentation (ZVOS) 任务，即在不使用任何 annotated video data 的情况下，将视频中的 объекты segmentation 到准确的位置和类别。
methods: 该论文提出了两种基于 Transformer 的方法，分别是 Context-Sharing Transformer (CST) 和 Semantic Gathering-Scattering Transformer (SGST)，以提高 ZVOS 的性能和计算效率。
results: 与基eline相比，该论文的方法在 ZVOS 任务中具有新的 state-of-the-art 性能，同时提高了计算效率，相比基eline的 13 倍。 Code 可以在 https://github.com/DLUT-yyc/Isomer 上下载。

Abstract
Recent leading zero-shot video object segmentation (ZVOS) works devote to integrating appearance and motion information by elaborately designing feature fusion modules and identically applying them in multiple feature stages. Our preliminary experiments show that with the strong long-range dependency modeling capacity of Transformer, simply concatenating the two modality features and feeding them to vanilla Transformers for feature fusion can distinctly benefit the performance but at a cost of heavy computation. Through further empirical analysis, we find that attention dependencies learned in Transformer in different stages exhibit completely different properties: global query-independent dependency in the low-level stages and semantic-specific dependency in the high-level stages. Motivated by the observations, we propose two Transformer variants: i) Context-Sharing Transformer (CST) that learns the global-shared contextual information within image frames with a lightweight computation. ii) Semantic Gathering-Scattering Transformer (SGST) that models the semantic correlation separately for the foreground and background and reduces the computation cost with a soft token merging mechanism. We apply CST and SGST for low-level and high-level feature fusions, respectively, formulating a level-isomerous Transformer framework for ZVOS task. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance. Code is available at https://github.com/DLUT-yyc/Isomer.

摘要
现代领先的零shot视频对象分割（ZVOS）方法强调 интеграción appeared和动作信息，通过设计优化的特征融合模块来实现。我们的初步实验表明，使用强大的长距离依赖模型Transformer可以明显提高性能，但是需要高计算成本。通过进一步的实验分析，我们发现Transformer中不同阶段的注意力关系都有不同性质：低阶段的全局缺省关系和高阶段的Semantic特定关系。这些发现驱动我们提出两种Transformer变体：i) 共享上下文Transformer（CST），通过轻量级计算学习图像帧中的全局共享上下文信息。ii) semantic聚合散发Transformer（SGST），通过软token合并机制模型对eground和background的semantic相关性，减少计算成本。我们在不同阶段使用CST和SGST进行特征融合，组成了级别异谱Transformer框架，与基eline相比，我们的方法可以提高13倍的速度，并达到新的ZVOS性能记录。代码可以在https://github.com/DLUT-yyc/Isomer上下载。

SimMatchV2: Semi-Supervised Learning with Graph Consistency

paper_url: http://arxiv.org/abs/2308.06692
repo_url: https://github.com/mingkai-zheng/simmatchv2
paper_authors: Mingkai Zheng, Shan You, Lang Huang, Chen Luo, Fei Wang, Chen Qian, Chang Xu
for: 这个论文目的是提出一种新的半监督学习算法，以解决计算机视觉领域中的半监督图像分类问题。
methods: 该算法基于图 teoría的消息传递和节点分类，并提出了四种一致性，包括节点-节点一致性、节点-边一致性、边-边一致性和边-节点一致性。
results: 该算法在多个半监督学习benchmark上进行验证，与ResNet-50作为背景网络和300个训练 epoch，SimMatchV2实现了71.9%和76.2%的Top-1准确率，分别使用1%和10%的标注样本。这些成果在之前的方法中显著超越，达到了状态作准的性能。

Abstract
Semi-Supervised image classification is one of the most fundamental problem in computer vision, which significantly reduces the need for human labor. In this paper, we introduce a new semi-supervised learning algorithm - SimMatchV2, which formulates various consistency regularizations between labeled and unlabeled data from the graph perspective. In SimMatchV2, we regard the augmented view of a sample as a node, which consists of a label and its corresponding representation. Different nodes are connected with the edges, which are measured by the similarity of the node representations. Inspired by the message passing and node classification in graph theory, we propose four types of consistencies, namely 1) node-node consistency, 2) node-edge consistency, 3) edge-edge consistency, and 4) edge-node consistency. We also uncover that a simple feature normalization can reduce the gaps of the feature norm between different augmented views, significantly improving the performance of SimMatchV2. Our SimMatchV2 has been validated on multiple semi-supervised learning benchmarks. Notably, with ResNet-50 as our backbone and 300 epochs of training, SimMatchV2 achieves 71.9\% and 76.2\% Top-1 Accuracy with 1\% and 10\% labeled examples on ImageNet, which significantly outperforms the previous methods and achieves state-of-the-art performance. Code and pre-trained models are available at \href{https://github.com/mingkai-zheng/SimMatchV2}{https://github.com/mingkai-zheng/SimMatchV2}.

摘要
semi-supervised图像分类是计算机视觉中最基本的问题之一，可以减少人工劳动。在这篇论文中，我们介绍了一种新的semi-supervised学习算法——SimMatchV2，它在图像视角下对各个样本进行了不同的拓展视图，并在图表视角下定义了多种一致性规范。在SimMatchV2中，我们将每个样本的拓展视图看作一个节点，这些节点之间通过 Edge 连接， Edge 的 Similarity 度量节点表示的一致性。我们提出了四种一致性类型：1）节点-节点一致性，2）节点-边一致性，3）边-边一致性，4）边-节点一致性。我们还发现，一个简单的特征Normalization可以降低不同拓展视图特征的差异，从而提高SimMatchV2的性能。我们的SimMatchV2在多个 semi-supervised 学习 benchmark 上进行了验证，与 ResNet-50 作为背景网络和 300 epoch 训练，SimMatchV2 在 ImageNet 上 achieve 71.9% 和 76.2% Top-1 Accuracy WITH 1% 和 10% 标注样本，显著超过先前的方法，实现了状态的最佳性能。代码和预训练模型可以在 \href{https://github.com/mingkai-zheng/SimMatchV2}{https://github.com/mingkai-zheng/SimMatchV2} 上获取。

Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training

paper_url: http://arxiv.org/abs/2308.06689
repo_url: https://github.com/dravenalg/reste
paper_authors: Xiao-Ming Wu, Dian Zheng, Zuhao Liu, Wei-Shi Zheng
for: 这个论文的目的是提出一种能够充分考虑条件对应网络的训练稳定性的条件对应网络训练方法。
methods: 这个论文使用了一种名为Rectified Straight Through Estimator（ReSTE）的新的条件对应网络训练方法，它可以充分考虑条件对应网络的训练稳定性。
results: 实验结果显示，ReSTE可以在CIFAR-10和ImageNet datasets上 achieve excellent performance，并且比其他方法（不含任何辅助模组或损失）还要好。

Abstract
Binarization of neural networks is a dominant paradigm in neural networks compression. The pioneering work BinaryConnect uses Straight Through Estimator (STE) to mimic the gradients of the sign function, but it also causes the crucial inconsistency problem. Most of the previous methods design different estimators instead of STE to mitigate it. However, they ignore the fact that when reducing the estimating error, the gradient stability will decrease concomitantly. These highly divergent gradients will harm the model training and increase the risk of gradient vanishing and gradient exploding. To fully take the gradient stability into consideration, we present a new perspective to the BNNs training, regarding it as the equilibrium between the estimating error and the gradient stability. In this view, we firstly design two indicators to quantitatively demonstrate the equilibrium phenomenon. In addition, in order to balance the estimating error and the gradient stability well, we revise the original straight through estimator and propose a power function based estimator, Rectified Straight Through Estimator (ReSTE for short). Comparing to other estimators, ReSTE is rational and capable of flexibly balancing the estimating error with the gradient stability. Extensive experiments on CIFAR-10 and ImageNet datasets show that ReSTE has excellent performance and surpasses the state-of-the-art methods without any auxiliary modules or losses.

摘要
neural networks 的归纳化是现代神经网络压缩的主导方法。 BinaryConnect 开创性的工作使用 Straight Through Estimator (STE) 模仿签名函数的梯度，但也会导致重要的不一致问题。前一些方法设计不同的估计器来缓解这个问题，但它们忽略了当减少估计错误时，模型的梯度稳定性会降低。这些高度不同梯度会危害模型的训练和梯度涨落和爆炸。为了充分考虑梯度稳定性，我们提出了一新的审视方法，将 BNNs 训练视为梯度稳定性和估计错误之间的平衡。在这种视角下，我们首先设计了两个指标来量化平衡现象。此外，为了平衡估计错误和梯度稳定性，我们修改了原始的直通估计器，并提出了一个功能基于 rectified straight through estimator (ReSTE)。与其他估计器相比，ReSTE 是理性的，可以很好地平衡估计错误和梯度稳定性。我们对 CIFAR-10 和 ImageNet dataset 进行了广泛的实验，结果表明 ReSTE 表现出色，超过了当前的状态艺术方法，不需要任何辅助模块或损失。

Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges

paper_url: http://arxiv.org/abs/2308.06668
repo_url: https://github.com/jiajiali04/agriculture-foundation-models
paper_authors: Jiajia Li, Mingle Xu, Lirong Xiang, Dong Chen, Weichao Zhuang, Xunyuan Yin, Zhaojian Li
for: 本研究旨在探讨基础模型（Foundation Model，FM）在智能农业领域的潜力。
methods: 本研究首先对最新的FM进行了 обзор，并将其分为四类：语言FM、视觉FM、多模态FM和强化学习FM。然后，我们详细介绍了在农业领域开发农业FM的过程，以及其在智能农业中的潜在应用。
results: 本研究通过对FM的探讨，提供了一个新的AI在农业领域的发展方向，即基于FM的智能农业系统。这种系统可以减少大量标注数据的依赖，提高效率和通用性。同时，我们还描述了在开发农业FM时的独特挑战，包括模型训练、验证和部署。

Abstract
The past decade has witnessed the rapid development of ML and DL methodologies in agricultural systems, showcased by great successes in variety of agricultural applications. However, these conventional ML/DL models have certain limitations: They heavily rely on large, costly-to-acquire labeled datasets for training, require specialized expertise for development and maintenance, and are mostly tailored for specific tasks, thus lacking generalizability. Recently, foundation models have demonstrated remarkable successes in language and vision tasks across various domains. These models are trained on a vast amount of data from multiple domains and modalities. Once trained, they can accomplish versatile tasks with just minor fine-tuning and minimal task-specific labeled data. Despite their proven effectiveness and huge potential, there has been little exploration of applying FMs to agriculture fields. Therefore, this study aims to explore the potential of FMs in the field of smart agriculture. In particular, we present conceptual tools and technical background to facilitate the understanding of the problem space and uncover new research directions in this field. To this end, we first review recent FMs in the general computer science domain and categorize them into four categories: language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs. Subsequently, we outline the process of developing agriculture FMs and discuss their potential applications in smart agriculture. We also discuss the unique challenges associated with developing AFMs, including model training, validation, and deployment. Through this study, we contribute to the advancement of AI in agriculture by introducing AFMs as a promising paradigm that can significantly mitigate the reliance on extensive labeled datasets and enhance the efficiency, effectiveness, and generalization of agricultural AI systems.

摘要
过去一代，机器学习（ML）和深度学习（DL）方法在农业系统中得到了迅速发展，在各种农业应用中显示出了很大成功。然而，传统的ML/DL模型有一些局限性：它们需要大量、成本高的标注数据进行训练，需要专业知识进行开发和维护，而且主要针对特定任务，缺乏总体化性。在最近的几年，基础模型（FM）在语言和视觉任务中获得了很大成功。这些模型在多个领域和模式上训练了庞大数据。一旦训练完成，它们可以通过微调和微量标注数据完成多种任务。despite their proven effectiveness and huge potential, there has been little exploration of applying FMs to agriculture fields. Therefore, this study aims to explore the potential of FMs in the field of smart agriculture. In particular, we present conceptual tools and technical background to facilitate the understanding of the problem space and uncover new research directions in this field. To this end, we first review recent FMs in the general computer science domain and categorize them into four categories: language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs. Subsequently, we outline the process of developing agriculture FMs and discuss their potential applications in smart agriculture. We also discuss the unique challenges associated with developing AFMs, including model training, validation, and deployment. Through this study, we contribute to the advancement of AI in agriculture by introducing AFMs as a promising paradigm that can significantly mitigate the reliance on extensive labeled datasets and enhance the efficiency, effectiveness, and generalization of agricultural AI systems.

Polar Collision Grids: Effective Interaction Modelling for Pedestrian Trajectory Prediction in Shared Space Using Collision Checks

paper_url: http://arxiv.org/abs/2308.06654
repo_url: None
paper_authors: Mahsa Golchoubian, Moojan Ghafurian, Kerstin Dautenhahn, Nasser Lashgarian Azad
for: 预测步行人的轨迹是自动驾驶车辆安全导航中的关键能力，特别是在与步行人共享空间时。步行人运动在共享空间中受到汽车和其他步行人的影响，因此可以更好地模型步行人-汽车和步行人之间的交互，从而提高步行人轨迹预测模型的准确性。
methods: 我们提出了一种基于启发的方法，通过计算碰撞风险来选择交互对象。我们将关注与目标步行人之间可能碰撞的两个代理的时间到碰撞和方向角来编码交互效果。我们还 introduce了一种新的方向角坐标系，以便更好地表示交互对象之间的位势。
results: 我们的结果显示，与基eline方法（用作比较）相比，我们的方法在HBS数据集上预测的轨迹更加准确。

Abstract
Predicting pedestrians' trajectories is a crucial capability for autonomous vehicles' safe navigation, especially in spaces shared with pedestrians. Pedestrian motion in shared spaces is influenced by both the presence of vehicles and other pedestrians. Therefore, effectively modelling both pedestrian-pedestrian and pedestrian-vehicle interactions can increase the accuracy of the pedestrian trajectory prediction models. Despite the huge literature on ways to encode the effect of interacting agents on a pedestrian's predicted trajectory using deep-learning models, limited effort has been put into the effective selection of interacting agents. In the majority of cases, the interaction features used are mainly based on relative distances while paying less attention to the effect of the velocity and approaching direction in the interaction formulation. In this paper, we propose a heuristic-based process of selecting the interacting agents based on collision risk calculation. Focusing on interactions of potentially colliding agents with a target pedestrian, we propose the use of time-to-collision and the approach direction angle of two agents for encoding the interaction effect. This is done by introducing a novel polar collision grid map. Our results have shown predicted trajectories closer to the ground truth compared to existing methods (used as a baseline) on the HBS dataset.

摘要
预测行人轨迹是自动驾驶车辆安全导航中的关键能力，特别是在与行人共享空间时。行人运动在共享空间中受到车辆和其他行人的影响。因此，可以准确地模拟行人与车辆和其他行人之间的交互，可以提高行人轨迹预测模型的准确性。虽然有很大的文献研究了使用深度学习模型来编码交互代理的影响，但是对选择交互代理的有效选择尚未得到足够的关注。大多数情况下，交互特征 mainly based on relative distances，而忽略了交互形式中 velocities和接近方向的影响。在这篇论文中，我们提出了一种基于冲突风险计算的交互代理选择规则。关注可能发生冲突的两个代理之间的时间差距和接近方向角，以编码交互效果。我们通过引入一种新的圆形冲突网格地图来实现这一点。我们的结果显示，与基eline方法（作为参照）相比，我们的方法在HBS数据集上预测轨迹更加准确。

Advances in Self-Supervised Learning for Synthetic Aperture Sonar Data Processing, Classification, and Pattern Recognition

paper_url: http://arxiv.org/abs/2308.11633
repo_url: None
paper_authors: Brandon Sheffield, Frank E. Bobe III, Bradley Marchand, Matthew S. Emigh
for: 提高水下探测技术的精度和效率
methods: 使用自主学习方法（SSL）处理SAS数据，进行分类和特征识别
results: 实验结果表明，MoCo-SAS在F1分数方面表现 significanly better than传统的指导学习方法，这表明SSL在SAS数据处理中具有潜在的应用前景和可能性。

Abstract
Synthetic Aperture Sonar (SAS) imaging has become a crucial technology for underwater exploration because of its unique ability to maintain resolution at increasing ranges, a characteristic absent in conventional sonar techniques. However, the effective application of deep learning to SAS data processing is often limited due to the scarcity of labeled data. To address this challenge, this paper proposes MoCo-SAS that leverages self-supervised learning (SSL) for SAS data processing, classification, and pattern recognition. The experimental results demonstrate that MoCo-SAS significantly outperforms traditional supervised learning methods, as evidenced by significant improvements observed in terms of the F1-score. These findings highlight the potential of SSL in advancing the state-of-the-art in SAS data processing, offering promising avenues for enhanced underwater object detection and classification.

摘要
美式 Synthetic Aperture Sonar（SAS）成像技术在水下探索中变得非常重要，因为它可以保持分辨率随距离增长，这是传统sonar技术缺乏的特点。然而，通常的深度学习应用于SAS数据处理中频繁受限因为标注数据的罕见。为解决这个挑战，这篇论文提议了MoCo-SAS，它利用自动标注学习（SSL）进行SAS数据处理、分类和模式识别。实验结果表明，MoCo-SAS在F1分数方面有显著提高，比传统监督学习方法要好。这些发现表明SSL在SAS数据处理中具有潜在的潜力，提供了更好的水下对象检测和分类技术。

3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

paper_url: http://arxiv.org/abs/2308.06635
repo_url: https://github.com/dsx0511/3dmotformer
paper_authors: Shuxiao Ding, Eike Rehder, Lukas Schneider, Marius Cordts, Juergen Gall
for: 这篇论文的目的是提出一种学习基于 transformer 架构的三维物体跟踪（3DMOT）方法，以提高自动驾驶 vehicle 的精度和可靠性。
methods: 本文使用 Edge-Augmented Graph Transformer 来在帧帧基础上进行 track-detection 图грамreasoning，并通过边类划分进行数据归一化。在线上训练中，我们提出了一种novel的自适应训练策略，包括循环和回归的前进 pass，以及顺序批量优化。
results: 使用 CenterPoint 检测结果，本文的方法实现了 71.2% 和 68.2% AMOTA 在 nuScenes 验证和测试分别，并且一个训练好的 3DMOTFormer 模型可以在不同的物体检测器上进行泛化。

Abstract
Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter but require many manually tuned parameters. On the other hand, learning-based approaches face the problem of adapting the training to the online setting, leading to inevitable distribution mismatch between training and inference as well as suboptimal performance. In this work, we propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. We use an Edge-Augmented Graph Transformer to reason on the track-detection bipartite graph frame-by-frame and conduct data association via edge classification. To reduce the distribution mismatch between training and inference, we propose a novel online training strategy with an autoregressive and recurrent forward pass as well as sequential batch optimization. Using CenterPoint detections, our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively. In addition, a trained 3DMOTFormer model generalizes well across different object detectors. Code is available at: https://github.com/dsx0511/3DMOTFormer.

摘要
Tracking 3D 物体 precisely 和 consistently 是自动驾驶车辆中关键的，允许更可靠的下游任务，如轨迹预测和运动规划。基于近年来对 объек detection 的重要进步，跟踪-by-detection 方法在现场中变得越来越受欢迎，因为它的简单性和效率。当前的 3D 多对象跟踪（MOT）方法通常采用非学习基于模型的方法，如卡尔曼筛滤器，但是需要许多手动调整的参数。在另一方面，学习基于approaches 在线设定中遇到了适应训练的问题，导致在执行和训练之间的分布差异，以及不佳的性能。在这种情况下，我们提出了 3DMOTFormer，一种基于 transformer 架构的学习geometry-based 3D MOT 框架。我们使用 Edge-Augmented Graph Transformer 来在每帧上对跟踪-检测二分图进行理解，并通过边类别进行数据关联。为了减少训练和执行之间的分布差异，我们提出了一种新的在线训练策略，包括自适应循环和回归前进 pass，以及顺序批量优化。使用 CenterPoint 检测，我们的方法实现了 71.2% 和 68.2% AMOTA 在 nuScenes 验证和测试分割中，并且一个训练过的 3DMOTFormer 模型具有良好的泛化性。代码可以在 GitHub 上找到：https://github.com/dsx0511/3DMOTFormer。

Fusion-GRU: A Deep Learning Model for Future Bounding Box Prediction of Traffic Agents in Risky Driving Videos

paper_url: http://arxiv.org/abs/2308.06628
repo_url: None
paper_authors: Muhammad Monjurul Karim, Ruwen Qin, Yinhai Wang
for: 预测周围交通代理人员未来矩形 bounding box 以确保自动驾驶车辆和高级驾驶协助系统在复杂交通场景中安全和高效 navigate.
methods: 本文提出了一种 novel encoder-decoder 架构 called Fusion-GRU, which accounts for the mutual and complex interactions among input features, and uses an intermediary estimator coupled with a self-attention aggregation layer to learn sequential dependencies for long-range prediction.
results: 实验结果表明 Fusion-GRU 能够有效地预测交通代理人员未来矩形 bounding box, 并且在 ROL 和 HEV-I 两个公共数据集上达到了出色的表现。

Abstract
To ensure the safe and efficient navigation of autonomous vehicles and advanced driving assistance systems in complex traffic scenarios, predicting the future bounding boxes of surrounding traffic agents is crucial. However, simultaneously predicting the future location and scale of target traffic agents from the egocentric view poses challenges due to the vehicle's egomotion causing considerable field-of-view changes. Moreover, in anomalous or risky situations, tracking loss or abrupt motion changes limit the available observation time, requiring learning of cues within a short time window. Existing methods typically use a simple concatenation operation to combine different cues, overlooking their dynamics over time. To address this, this paper introduces the Fusion-Gated Recurrent Unit (Fusion-GRU) network, a novel encoder-decoder architecture for future bounding box localization. Unlike traditional GRUs, Fusion-GRU accounts for mutual and complex interactions among input features. Moreover, an intermediary estimator coupled with a self-attention aggregation layer is also introduced to learn sequential dependencies for long range prediction. Finally, a GRU decoder is employed to predict the future bounding boxes. The proposed method is evaluated on two publicly available datasets, ROL and HEV-I. The experimental results showcase the promising performance of the Fusion-GRU, demonstrating its effectiveness in predicting future bounding boxes of traffic agents.

摘要
要确保自动驾驶车和高级驾驶帮助系统在复杂交通场景中安全和高效地导航，预测周围交通代理的未来矩形框是关键。然而，同时预测目标交通代理的未来位置和Scale从 egocentric 视图出现困难，由于车辆的 egomotion 导致了 considrable 视场变化。此外，在异常或危险情况下，跟踪损失或突然运动变化限制了可用观察时间，需要学习在短时间窗口内的信号。现有方法通常使用简单的 concatenation 操作将不同的信号组合起来，忽略他们在时间上的动态变化。为解决这个问题，本文提出了 Fusion-Gated Recurrent Unit (Fusion-GRU) 网络，一种新的编码器-解码器架构 для 未来矩形框 Localization。与传统 GRU 不同，Fusion-GRU 考虑了输入特征之间的相互和复杂交互。此外，一个中间估计器和一个自我注意汇聚层也是引入，以学习长距离预测的时间序列关系。最后，一个 GRU 解码器用于预测未来矩形框。提案的方法在 ROL 和 HEV-I 两个公共可用数据集上进行了测试，实验结果表明 Fusion-GRU 的批处理能力很出色，证明其在预测交通代理未来矩形框方面的效果是非常有 Promise。

ADRMX: Additive Disentanglement of Domain Features with Remix Loss

paper_url: http://arxiv.org/abs/2308.06624
repo_url: https://github.com/berkerdemirel/ADRMX
paper_authors: Berker Demirel, Erchan Aptoula, Huseyin Ozkan
for: 这个研究目的是为了实现多域领域对应，即将模型从多个来源领域中撷取具有通用性的特征，以减少域别的分布变化对模型的影响。
methods: 本研究使用了一种名为“添加式分离”的新架构，将域别特征与域共通特征整合在一起，以实现域variant特征的捕捉。此外，还引入了一种新的数据增强技术，将不同域的样本混合在维度空间中，以进一步支持模型的通用能力。
results: 经过广泛的DomainBed experiments，ADRMX模型在竞争性的情况下实现了州际状态的表现，并且超过了现有的模型。代码将会在GitHub上公开。

Abstract
The common assumption that train and test sets follow similar distributions is often violated in deployment settings. Given multiple source domains, domain generalization aims to create robust models capable of generalizing to new unseen domains. To this end, most of existing studies focus on extracting domain invariant features across the available source domains in order to mitigate the effects of inter-domain distributional changes. However, this approach may limit the model's generalization capacity by relying solely on finding common features among the source domains. It overlooks the potential presence of domain-specific characteristics that could be prevalent in a subset of domains, potentially containing valuable information. In this work, a novel architecture named Additive Disentanglement of Domain Features with Remix Loss (ADRMX) is presented, which addresses this limitation by incorporating domain variant features together with the domain invariant ones using an original additive disentanglement strategy. Moreover, a new data augmentation technique is introduced to further support the generalization capacity of ADRMX, where samples from different domains are mixed within the latent space. Through extensive experiments conducted on DomainBed under fair conditions, ADRMX is shown to achieve state-of-the-art performance. Code will be made available at GitHub after the revision process.

摘要
通常假设训练集和测试集都遵循相似的分布是在部署场景下常被违反的。面对多个源领域，领域泛化目标是创建可以泛化到新未经见过的领域的Robust模型。为此，大多数现有的研究都是EXTRACTING DOMAIN INVARIANT FEATURES ACROSS AVAILABLE SOURCE DOMAINS，以减少INTER-DOMAIN分布变化的影响。然而，这种方法可能会限制模型的泛化能力，因为它只是在多个源领域中找到共同特征。这会忽略可能在一些领域中具有价值信息的领域特有特征。在这种工作中，一种新的架构被提出，即Additive Disentanglement of Domain Features with Remix Loss（ADRMX），它解决了这个限制。ADRMX通过将领域特征与领域 invariant 特征相加来实现这一点。此外，一种新的数据增强技术也被引入，其中来自不同领域的样本被混合在离散空间中。经过了EXTENSIVE EXPERIMENTS CONDUCTED ON DOMAINBED UNDER FAIR CONDITIONS，ADRMX得到了状态机器的表现。代码将在GitHub上公布后进行修订。

Polyp-SAM++: Can A Text Guided SAM Perform Better for Polyp Segmentation?

paper_url: http://arxiv.org/abs/2308.06623
repo_url: https://github.com/RisabBiswas/Polyp-SAM-PlusPlus
paper_authors: Risab Biswas
for: The paper is written for the task of polyp segmentation in medical images, with the goal of improving the accuracy and robustness of the segmentation process.
methods: The paper uses the Segment Anything Model (SAM) as the base model for polyp segmentation, and incorporates text prompting to guide the segmentation process.
results: The paper evaluates the performance of the text-guided SAM on benchmark datasets and compares the results with unprompted SAM. The results show that the text-guided SAM achieves better segmentation accuracy and robustness than unprompted SAM.Here are the three points in Simplified Chinese text:
for: 本文是为医疗图像中的肿吸分 segmentation任务而写的，目标是提高分 segmentation的准确性和稳定性。
methods: 本文使用 Segment Anything Model (SAM) 作为基本模型，并通过文本提示来导引分 segmentation 过程。
results: 本文对 benchmark 数据集进行评估，并比较文本提示 SAM 和无提示 SAM 的结果。结果显示，文本提示 SAM 在分 segmentation 任务上的性能更高、更稳定。

Abstract
Meta recently released SAM (Segment Anything Model) which is a general-purpose segmentation model. SAM has shown promising results in a wide variety of segmentation tasks including medical image segmentation. In the field of medical image segmentation, polyp segmentation holds a position of high importance, thus creating a model which is robust and precise is quite challenging. Polyp segmentation is a fundamental task to ensure better diagnosis and cure of colorectal cancer. As such in this study, we will see how Polyp-SAM++, a text prompt-aided SAM, can better utilize a SAM using text prompting for robust and more precise polyp segmentation. We will evaluate the performance of a text-guided SAM on the polyp segmentation task on benchmark datasets. We will also compare the results of text-guided SAM vs unprompted SAM. With this study, we hope to advance the field of polyp segmentation and inspire more, intriguing research. The code and other details will be made publically available soon at https://github.com/RisabBiswas/Polyp-SAM++.

摘要
meta 最近发布了 SAM（ Segment Anything Model），这是一个通用分割模型。 SAM 在多种分割任务中表现出色，包括医疗图像分割。在医疗图像分割领域，肿瘤分割具有非常高的重要性，因此创建一个稳定和精准的模型非常具有挑战性。肿瘤分割是检测和治疗潜肿瘤的基础任务。在这项研究中，我们将研究如何使用文本提示来更好地使用 SAM 进行肿瘤分割。我们将对文本引导 SAM 在标准数据集上进行评估，并与不引导 SAM 进行比较。我们希望通过这项研究，推动肿瘤分割领域的发展，并鼓励更多的激动人心的研究。代码和其他细节将在 https://github.com/RisabBiswas/Polyp-SAM++ 上公开。

DFM-X: Augmentation by Leveraging Prior Knowledge of Shortcut Learning

paper_url: http://arxiv.org/abs/2308.06622
repo_url: https://github.com/nis-research/dfmx-augmentation
paper_authors: Shunxin Wang, Christoph Brune, Raymond Veldhuis, Nicola Strisciuglio
for: 提高模型的普适性和鲁棒性，防止神经网络学习 superficiale 的统计学特征，从而提高模型的泛化能力和鲁棒性。
methods: 提出了一种数据增强策略，称为DFM-X，该策略利用了预测模型中的主导频率图（DFM）来避免神经网络学习快捷解决方案。
results: 实验结果表明，DFM-X 可以提高模型对常见损害和攻击的Robustness，并且可以轻松地与其他增强技术结合使用，以进一步提高模型的泛化能力和鲁棒性。

Abstract
Neural networks are prone to learn easy solutions from superficial statistics in the data, namely shortcut learning, which impairs generalization and robustness of models. We propose a data augmentation strategy, named DFM-X, that leverages knowledge about frequency shortcuts, encoded in Dominant Frequencies Maps computed for image classification models. We randomly select X% training images of certain classes for augmentation, and process them by retaining the frequencies included in the DFMs of other classes. This strategy compels the models to leverage a broader range of frequencies for classification, rather than relying on specific frequency sets. Thus, the models learn more deep and task-related semantics compared to their counterpart trained with standard setups. Unlike other commonly used augmentation techniques which focus on increasing the visual variations of training data, our method targets exploiting the original data efficiently, by distilling prior knowledge about destructive learning behavior of models from data. Our experimental results demonstrate that DFM-X improves robustness against common corruptions and adversarial attacks. It can be seamlessly integrated with other augmentation techniques to further enhance the robustness of models.

摘要
We use Dominant Frequencies Maps (DFMs) to identify the frequency shortcuts that image classification models are prone to learning. We then select a percentage of training images from certain classes and process them by retaining the frequencies included in the DFMs of other classes. This forces the models to use a broader range of frequencies for classification, rather than relying on specific frequency sets.Unlike other augmentation techniques that focus on increasing visual variations in the training data, DFM-X targets the efficient use of the original data by leveraging prior knowledge about the destructive learning behavior of models. Our experimental results show that DFM-X improves the robustness of models against common corruptions and adversarial attacks. It can be easily integrated with other augmentation techniques to further enhance the robustness of models.

LadleNet: Translating Thermal Infrared Images to Visible Light Images Using A Scalable Two-stage U-Net

paper_url: http://arxiv.org/abs/2308.06603
repo_url: https://github.com/ach-1914/ladlenet
paper_authors: Tonghui Zou
for: 这 paper 的目的是将thermal infrared (TIR) 图像转换成可见光 (VI) 图像，并且可以应用于多个领域，如TIR-VI 图像registratin 和融合。
methods: 这 paper 使用了一种基于 U-Net 架构的算法，称为 LadleNet，其包括 ‘Handle’ 模块和 ‘Bowl’ 模块。 Handle 模块constructs an abstract semantic space，而 Bowl 模块 decode这 semantic space来生成 mapped VI 图像。 Handle 模块可以通过使用semantic segmentation networks来扩展其网络架构，从而提高模型性能。
results: comparative experiments 表明， compared to existing methodologies, our approach achieves state-of-the-art performance in terms of image clarity and perceptual quality。

Abstract
The translation of thermal infrared (TIR) images to visible light (VI) images presents a challenging task with potential applications spanning various domains such as TIR-VI image registration and fusion. Leveraging supplementary information derived from TIR image conversions can significantly enhance model performance and generalization across these applications. However, prevailing issues within this field include suboptimal image fidelity and limited model scalability. In this paper, we introduce an algorithm, LadleNet, based on the U-Net architecture. LadleNet employs a two-stage U-Net concatenation structure, augmented with skip connections and refined feature aggregation techniques, resulting in a substantial enhancement in model performance. Comprising 'Handle' and 'Bowl' modules, LadleNet's Handle module facilitates the construction of an abstract semantic space, while the Bowl module decodes this semantic space to yield mapped VI images. The Handle module exhibits extensibility by allowing the substitution of its network architecture with semantic segmentation networks, thereby establishing more abstract semantic spaces to bolster model performance. Consequently, we propose LadleNet+, which replaces LadleNet's Handle module with the pre-trained DeepLabv3+ network, thereby endowing the model with enhanced semantic space construction capabilities. The proposed method is evaluated and tested on the KAIST dataset, accompanied by quantitative and qualitative analyses. Compared to existing methodologies, our approach achieves state-of-the-art performance in terms of image clarity and perceptual quality. The source code will be made available at https://github.com/Ach-1914/LadleNet/tree/main/.

摘要
通过将热成像（TIR）图像转换成可见光（VI）图像，提供了一些应用领域的挑战，如TIR-VI图像匹配和融合。利用TIR图像转换生成的补充信息可以significantly enhance模型性能和泛化性。然而，现有的问题包括低效图像准确性和有限的模型扩展性。在这篇文章中，我们提出了一种算法，即LadleNet，基于U-Net架构。LadleNet使用了两个阶段的U-Net堆叠结构，并添加了跳过连接和精细特征聚合技术，从而实现了显著提高模型性能。LadleNet由“ Handle”和“Bowl”模块组成，其中“ Handle”模块建立了一个抽象的 semantic space，而“Bowl”模块将这个semantic space解码成生成的VI图像。“ Handle”模块具有扩展性，可以通过更改其网络架构来使用semantic segmentation网络，从而建立更加抽象的semantic spaces，以提高模型性能。因此，我们提出了LadleNet+，其替换了LadleNet的“ Handle”模块为预训练的DeepLabv3+网络，从而使模型具有更高的semantic space建立能力。我们的方法在KAIST数据集上进行了评估和测试，并进行了量化和质量分析。相比现有的方法，我们的方法在图像清晰度和感知质量方面达到了状态态的性能。模型源代码将在https://github.com/Ach-1914/LadleNet/tree/main/下提供。

2023-08-13

cs.AI

cs.AI - 2023-08-13

Dual Meta-Learning with Longitudinally Generalized Regularization for One-Shot Brain Tissue Segmentation Across the Human Lifespan

paper_url: http://arxiv.org/abs/2308.06774
repo_url: None
paper_authors: Yongheng Sun, Fan Wang, Jun Shu, Haifeng Wang, Li Wang. Deyu Meng, Chunfeng Lian
for: 这个论文旨在提出一种用于批处理数据的脑细胞分割方法，以便于 neuroscience 和临床研究。
methods: 该方法使用 dual meta-learning 模型，包括一个 plug-and-play 特征提取器和一个 initializer 任务头，以学习 longitudinally 一致的表示。此外，两种类 aware 正则化也是提出来鼓励 longitudinal 一致性。
results: 实验结果表明，该方法在 iSeg2019 和 ADNI 数据集上具有效果。代码可以在 https://github.com/ladderlab-xjtu/DuMeta 上下载。

Abstract
Brain tissue segmentation is essential for neuroscience and clinical studies. However, segmentation on longitudinal data is challenging due to dynamic brain changes across the lifespan. Previous researches mainly focus on self-supervision with regularizations and will lose longitudinal generalization when fine-tuning on a specific age group. In this paper, we propose a dual meta-learning paradigm to learn longitudinally consistent representations and persist when fine-tuning. Specifically, we learn a plug-and-play feature extractor to extract longitudinal-consistent anatomical representations by meta-feature learning and a well-initialized task head for fine-tuning by meta-initialization learning. Besides, two class-aware regularizations are proposed to encourage longitudinal consistency. Experimental results on the iSeg2019 and ADNI datasets demonstrate the effectiveness of our method. Our code is available at https://github.com/ladderlab-xjtu/DuMeta.

摘要
��rett�Brain tissue segmentation是 neuroscience 和 clinical studies 中必备的一环。然而，对 longitudinal 数据进行 segmentation 是一个挑战，因为大脑在生长过程中会发生 dynamically 的变化。先前的研究主要集中在自我超vised 的 regularization 上，这会导致 fine-tuning 时失去 longitudinal 一致性。在这篇论文中，我们提出了 dual meta-learning парадигма，以学习 longitudinally 一致的表示和 fine-tuning persist。具体来说，我们学习了一个可插入的 feature extractor，用于抽取 longitudinally 一致的 anatomical 表示，以及一个 Well-Initialized 任务头，用于 fine-tuning。此外，我们还提出了两种类型 aware 的 regularization，以促进 longitudinal 一致性。我们的实验结果在 iSeg2019 和 ADNI 数据集上表明了我们的方法的有效性。我们的代码可以在 https://github.com/ladderlab-xjtu/DuMeta 上获取。

Few-shot Class-incremental Learning: A Survey

paper_url: http://arxiv.org/abs/2308.06764
repo_url: None
paper_authors: Jinghua Zhang, Li Liu, Olli Silven, Matti Pietikäinen, Dewen Hu
for: 这篇论文旨在提供关于几拟学习（Few-shot Class-Incremental Learning，FSCIL）的系统性和深入的评论。
methods: 本论文涵盖了FSCIL中的多种方法，包括数据基于、结构基于和优化基于的分类方法以及锚点基于和锚点自由的对象检测方法。
results: 本论文提供了一个彻底的检查和评估的 benchmark 数据集和评价指标，以及一些在FSCIL中的推荐的研究方向。

Abstract
Few-shot Class-Incremental Learning (FSCIL) presents a unique challenge in machine learning, as it necessitates the continuous learning of new classes from sparse labeled training samples without forgetting previous knowledge. While this field has seen recent progress, it remains an active area of exploration. This paper aims to provide a comprehensive and systematic review of FSCIL. In our in-depth examination, we delve into various facets of FSCIL, encompassing the problem definition, the discussion of primary challenges of unreliable empirical risk minimization and the stability-plasticity dilemma, general schemes, and relevant problems of incremental learning and few-shot learning. Besides, we offer an overview of benchmark datasets and evaluation metrics. Furthermore, we introduce the classification methods in FSCIL from data-based, structure-based, and optimization-based approaches and the object detection methods in FSCIL from anchor-free and anchor-based approaches. Beyond these, we illuminate several promising research directions within FSCIL that merit further investigation.

摘要
《几个示例学习（Few-shot Class-Incremental Learning，FSCIL）》是机器学习领域的一个独特挑战，它需要在缺乏标注训练样本的情况下，不断学习新的类型，而不会忘记之前的知识。虽然这一领域已经有了相应的进步，但仍然是一个活跃的探索领域。本文的目的是提供关于FSCIL的全面和系统性的综述，我们在这里进行了深入的检查，涵盖了FSCIL的问题定义、主要挑战、基本方案、相关的增量学习和几个示例学习方法等方面。此外，我们还介绍了FSCIL中的标准数据集和评价指标。在FSCIL中，我们分析了数据基于、结构基于和优化基于的分类方法，以及anchor-free和anchor-based的对应检测方法。此外，我们还透出了FSCIL中一些有前途的研究方向，以便进一步探索和发展这一领域。

Evaluating the anticipated outcomes of MRI seizure image from open-source tool- Prototype approach

paper_url: http://arxiv.org/abs/2308.07762
repo_url: None
paper_authors: Jayanthi Vajiram, Aishwarya Senthil, Utkarsh Maurya
for: 这个论文主要用于描述世界各地约70亿人口中 epilepsy 病人的脑部功能障碍和分析。
methods: 本文使用了多种开源神经成像工具，包括 MATLAB、Slicer 3D、Brain Suite21a、SPM 和 MedCalc，以进行脑部功能障碍的检查和分析。大约 60% 的研究人员使用了 MATLAB 进行图像处理，而其他 30% 使用了其他开源软件工具。
results: 根据本文的报告，大约 70% 的研究人员使用了 MATLAB 进行图像处理，而其他 30% 使用了其他开源软件工具。

Abstract
Epileptic Seizure is an abnormal neuronal exertion in the brain, affecting nearly 70 million of the world's population (Ngugi et al., 2010). So many open-source neuroimaging tools are used for metabolism checkups and analysis purposes. The scope of open-source tools like MATLAB, Slicer 3D, Brain Suite21a, SPM, and MedCalc are explained in this paper. MATLAB was used by 60% of the researchers for their image processing and 10% of them use their proprietary software. More than 30% of the researchers use other open-source software tools with their processing techniques for the study of magnetic resonance seizure images

摘要
эпилептический приступ是脑内部不正常的神经舒张，影响全球约70亿人口（ Ngugi et al., 2010）。这么多个开源神经成像工具用于 метаболизма检查和分析。这篇论文中 explain了开源工具如MATLAB、Slicer 3D、Brain Suite21a、SPM 和 MedCalc 的范围。MATLAB 被60%的研究人员用于图像处理，而10%的他们使用自己的专有软件。 более30%的研究人员使用其他开源软件工具进行频率逐点图像的研究

Heterogeneous Multi-Agent Reinforcement Learning via Mirror Descent Policy Optimization

paper_url: http://arxiv.org/abs/2308.06741
repo_url: None
paper_authors: Mohammad Mehdi Nasiri, Mansoor Rezghi
for: solving cooperative Multi-Agent Reinforcement Learning (MARL) problems with varying agent abilities and individual policies
methods: Heterogeneous-Agent Mirror Descent Policy Optimization (HAMDPO) algorithm, which utilizes the multi-agent advantage decomposition lemma for efficient policy updates and guarantees stability and performance improvements
results: superiority over state-of-the-art algorithms such as HATRPO and HAPPO, demonstrated through experiments on Multi-Agent MuJoCo and StarCraftII tasks

Abstract
This paper presents an extension of the Mirror Descent method to overcome challenges in cooperative Multi-Agent Reinforcement Learning (MARL) settings, where agents have varying abilities and individual policies. The proposed Heterogeneous-Agent Mirror Descent Policy Optimization (HAMDPO) algorithm utilizes the multi-agent advantage decomposition lemma to enable efficient policy updates for each agent while ensuring overall performance improvements. By iteratively updating agent policies through an approximate solution of the trust-region problem, HAMDPO guarantees stability and improves performance. Moreover, the HAMDPO algorithm is capable of handling both continuous and discrete action spaces for heterogeneous agents in various MARL problems. We evaluate HAMDPO on Multi-Agent MuJoCo and StarCraftII tasks, demonstrating its superiority over state-of-the-art algorithms such as HATRPO and HAPPO. These results suggest that HAMDPO is a promising approach for solving cooperative MARL problems and could potentially be extended to address other challenging problems in the field of MARL.

摘要
HAMDPO uses the multi-agent advantage decomposition lemma to efficiently update agent policies while ensuring improved overall performance. The algorithm iteratively updates agent policies using an approximate solution of the trust-region problem, which guarantees stability and improves performance.HAMDPO is capable of handling both continuous and discrete action spaces for heterogeneous agents in various MARL problems. The authors evaluate HAMDPO on Multi-Agent MuJoCo and StarCraftII tasks and show that it outperforms state-of-the-art algorithms such as HATRPO and HAPPO. These results suggest that HAMDPO is a promising approach for solving cooperative MARL problems and could potentially be extended to address other challenging problems in the field of MARL.Here is the Simplified Chinese translation of the text:这篇论文提出了一种新的方法 called Heterogeneous-Agent Mirror Descent Policy Optimization (HAMDPO)，用于解决多体决策学习（MARL）中的合作问题。该方法是为了处理每个代理机器人有不同能力和个性政策的情况。HAMDPO 使用多体优势分解证明来有效地更新代理机器人的策略，同时保证总体性能提高。该算法通过迭代更新代理机器人的策略，使用approxSolution 的信任区问题的解决方案，保证稳定性和性能提高。HAMDPO 可以处理不同的连续和离散动作空间，并在多种 MARL 问题上进行应用。作者们通过在 Multi-Agent MuJoCo 和 StarCraftII 任务上评估 HAMDPO，发现它比state-of-the-art 算法如 HATRPO 和 HAPPO 表现出色。这些结果表明，HAMDPO 是一种有前途的方法，可以解决多体决策学习中的合作问题，并可能扩展到其他难题。

Probabilistic Imputation for Time-series Classification with Missing Data

paper_url: http://arxiv.org/abs/2308.06738
repo_url: https://github.com/yuneg11/SupNotMIWAE-with-ObsDropout
paper_authors: SeungHyun Kim, Hyunsu Kim, EungGu Yun, Hwangrae Lee, Jaehun Lee, Juho Lee
For: This paper proposes a novel probabilistic framework for classification with multivariate time series data that contains missing values.* Methods: The proposed method consists of two parts: a deep generative model for missing value imputation and a classifier. The generative model is trained to impute the missing values in multiple plausible ways, effectively modeling the uncertainty of the imputation. The classifier takes the time series data along with the imputed missing values and classifies signals, and is trained to capture the predictive uncertainty due to the multiple possibilities of imputations.* Results: The proposed method is demonstrated to be effective through extensive experiments on real-world time series data with missing values.

Abstract
Multivariate time series data for real-world applications typically contain a significant amount of missing values. The dominant approach for classification with such missing values is to impute them heuristically with specific values (zero, mean, values of adjacent time-steps) or learnable parameters. However, these simple strategies do not take the data generative process into account, and more importantly, do not effectively capture the uncertainty in prediction due to the multiple possibilities for the missing values. In this paper, we propose a novel probabilistic framework for classification with multivariate time series data with missing values. Our model consists of two parts; a deep generative model for missing value imputation and a classifier. Extending the existing deep generative models to better capture structures of time-series data, our deep generative model part is trained to impute the missing values in multiple plausible ways, effectively modeling the uncertainty of the imputation. The classifier part takes the time series data along with the imputed missing values and classifies signals, and is trained to capture the predictive uncertainty due to the multiple possibilities of imputations. Importantly, we show that na\"ively combining the generative model and the classifier could result in trivial solutions where the generative model does not produce meaningful imputations. To resolve this, we present a novel regularization technique that can promote the model to produce useful imputation values that help classification. Through extensive experiments on real-world time series data with missing values, we demonstrate the effectiveness of our method.

摘要
多变量时间序列数据在实际应用中通常含有大量缺失值。现有的主流方法为这些缺失值是采用各种各样的归纳法（如零、平均值、邻近时间步的值）或学习参数。然而，这些简单策略并不考虑数据生成过程，更重要的是，它们不能有效地捕捉预测中的不确定性。在这篇论文中，我们提出了一种新的概率 frameworks для类型化多变量时间序列数据中的缺失值。我们的模型包括两部分：深度生成模型和分类器。我们在深度生成模型部分使用了多种可能性的填充方法，以模型缺失值的不确定性。分类器部分接受了时间序列数据以及填充后的缺失值，并将信号分类，并且在不同的填充情况下预测不确定性。我们发现，将生成模型和分类器直接组合可能会导致生成模型不生成有意义的填充值，从而影响分类的准确性。为解决这个问题，我们提出了一种新的规范技术，可以促进模型生成有用的填充值，以便于分类。我们通过对实际时间序列数据中缺失值的广泛实验，证明了我们的方法的有效性。

paper_url: http://arxiv.org/abs/2308.06735
repo_url: None
paper_authors: Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yaning Zhang, Qi Wu
for: 这个论文的目的是为了提出一个新的机器人任务，即空中视语Navigation（AerialVLN），用于研究机器人在开放空间中 navigation 的问题。
methods: 这篇论文使用了一种基于 cross-modal-alignment（CMA）方法的扩展基线模型，并使用了一个3D simulator，其中包括25个城市级别的enario，以支持连续导航、环境扩展和配置。
results: 论文发现，基于CMA方法的扩展基线模型与人类表现之间仍然存在一定的差距，这表明空中视语Navigation（AerialVLN）是一个新的挑战性任务。

Abstract
Recently emerged Vision-and-Language Navigation (VLN) tasks have drawn significant attention in both computer vision and natural language processing communities. Existing VLN tasks are built for agents that navigate on the ground, either indoors or outdoors. However, many tasks require intelligent agents to carry out in the sky, such as UAV-based goods delivery, traffic/security patrol, and scenery tour, to name a few. Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning. To fill this gap and facilitate research in this field, we propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. Our simulator supports continuous navigation, environment extension and configuration. We also proposed an extended baseline model based on the widely-used cross-modal-alignment (CMA) navigation methods. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task. Dataset and code is available at https://github.com/AirVLN/AirVLN.

摘要
现在刚刚出现的视力语言导航（VLN）任务已经吸引了计算机视觉和自然语言处理领域的广泛关注。现有的VLN任务都是为地面上的agent进行定位 navigating，但是许多任务需要在天空中进行，如用UAV进行物资交通/安全巡查和景色游览等。在天空中导航比在地面上更加复杂，因为代理需要考虑飞行高度和更复杂的空间关系理解。为了填补这个空白和促进这一领域的研究，我们提出了一个新任务名为空中VLN，它是基于UAV的和向外部环境。我们开发了25个城市级的enario的3D模拟器，模拟器支持连续导航、环境扩展和配置。我们还提出了一种基于协调多modal（CMA）导航方法的扩展基线模型。我们发现，与人类性能相比，基线模型还有一定的差距，这表明空中VLN是一个新的挑战性任务。数据集和代码可以在https://github.com/AirVLN/AirVLN上获取。

Precipitation nowcasting with generative diffusion models

paper_url: http://arxiv.org/abs/2308.06733
repo_url: https://github.com/fmerizzi/Precipitation-nowcasting-with-generative-diffusion-models
paper_authors: Andrea Asperti, Fabio Merizzi, Alberto Paparella, Giorgio Pedrazzi, Matteo Angelinelli, Stefano Colamonaco
for: 该研究旨在检验气候预测中 diffusion models 的可行性，特别是在降水预测（precipitation nowcasting）方面。
methods: 该研究使用了一种生成ensemble diffusion（GED）模型，通过生成多个可能的天气enario，然后使用post-processing网络将其融合成可能性最高的预测。
results: 相比之前的深度学习模型，GED模型在总性能方面显著提高了。

Abstract
In recent years traditional numerical methods for accurate weather prediction have been increasingly challenged by deep learning methods. Numerous historical datasets used for short and medium-range weather forecasts are typically organized into a regular spatial grid structure. This arrangement closely resembles images: each weather variable can be visualized as a map or, when considering the temporal axis, as a video. Several classes of generative models, comprising Generative Adversarial Networks, Variational Autoencoders, or the recent Denoising Diffusion Models have largely proved their applicability to the next-frame prediction problem, and is thus natural to test their performance on the weather prediction benchmarks. Diffusion models are particularly appealing in this context, due to the intrinsically probabilistic nature of weather forecasting: what we are really interested to model is the probability distribution of weather indicators, whose expected value is the most likely prediction. In our study, we focus on a specific subset of the ERA-5 dataset, which includes hourly data pertaining to Central Europe from the years 2016 to 2021. Within this context, we examine the efficacy of diffusion models in handling the task of precipitation nowcasting. Our work is conducted in comparison to the performance of well-established U-Net models, as documented in the existing literature. Our proposed approach of Generative Ensemble Diffusion (GED) utilizes a diffusion model to generate a set of possible weather scenarios which are then amalgamated into a probable prediction via the use of a post-processing network. This approach, in comparison to recent deep learning models, substantially outperformed them in terms of overall performance.

摘要
在最近几年，传统的数学方法 для准确的天气预测逐渐面临深度学习方法的挑战。历史数据集用于短距离和中距离天气预测通常有规则的空间格局结构，这种设置与图像非常相似，每种天气变量都可以被视为地图或者在考虑时间轴的情况下为视频。多种生成模型，包括生成对抗网络、变量自动编码器和最近的干扰扩散模型，在下一帧预测问题上有广泛的应用，因此自然地测试它们的性能在天气预测中。干扰扩散模型在这种情况下特别吸引人，因为天气预测的本质是probabilistic的：我们实际上是希望模型可以描述天气指标的概率分布，其期望值是最有可能的预测。在我们的研究中，我们专注于ERA-5数据集的一个子集，包括2016-2021年中欧每小时的数据。在这个上下文中，我们研究了干扰扩散模型在降水预测方面的能力。我们的方法与文献中已有的U-Net模型相比，并使用生成ensemble扩散（GED）模型。这种方法使用扩散模型生成一组可能的天气情况，然后使用Post处理网络将这些情况融合成一个可能的预测。与最近的深度学习模型相比，我们的方法在总性能方面表现得更好。

Transforming Sentiment Analysis in the Financial Domain with ChatGPT

paper_url: http://arxiv.org/abs/2308.07935
repo_url: None
paper_authors: Georgios Fatouros, John Soldatos, Kalliopi Kouroumali, Georgios Makridis, Dimosthenis Kyriazis
for: 本研究旨在探讨大语言模型ChatGPT 3.5在金融情感分析中的潜力, 特别是在外汇市场（forex）中。
methods: 本研究使用了零shot提示方法，对手动抽取的forex相关新闻标题进行了多个ChatGPT提问的测试，并使用了精度、准确率、f1score和 Mean Absolute Error（MAE）来衡量情感分类的性能。此外，还进行了对预测的情感和股票市场回报的相关性的评估。
results: ChatGPT比FinBERT更高级的情感分类性能提升约35%，并且与股票市场回报之间的相关性提高约36%。这些结果表明，提示工程在零shot上是非常重要的，并且指出了ChatGPT在金融应用中的潜力。

Abstract
Financial sentiment analysis plays a crucial role in decoding market trends and guiding strategic trading decisions. Despite the deployment of advanced deep learning techniques and language models to refine sentiment analysis in finance, this study breaks new ground by investigating the potential of large language models, particularly ChatGPT 3.5, in financial sentiment analysis, with a strong emphasis on the foreign exchange market (forex). Employing a zero-shot prompting approach, we examine multiple ChatGPT prompts on a meticulously curated dataset of forex-related news headlines, measuring performance using metrics such as precision, recall, f1-score, and Mean Absolute Error (MAE) of the sentiment class. Additionally, we probe the correlation between predicted sentiment and market returns as an additional evaluation approach. ChatGPT, compared to FinBERT, a well-established sentiment analysis model for financial texts, exhibited approximately 35\% enhanced performance in sentiment classification and a 36\% higher correlation with market returns. By underlining the significance of prompt engineering, particularly in zero-shot contexts, this study spotlights ChatGPT's potential to substantially boost sentiment analysis in financial applications. By sharing the utilized dataset, our intention is to stimulate further research and advancements in the field of financial services.

摘要
Note:* "Financial sentiment analysis" Financial sentiment analysis (FSAs) is the process of analyzing text data to determine the sentiment of investors, analysts, and other market participants towards a particular financial asset or market event.* "Zero-shot prompting" Zero-shot prompting refers to the use of language models to perform tasks or generate text based on prompts that are not present in the training data.* "ChatGPT" ChatGPT is a type of large language model that uses a transformer architecture and is trained on a large corpus of text data to generate human-like text.* "FinBERT" FinBERT is a pre-trained language model that is specifically designed for financial text analysis.

CLE Diffusion: Controllable Light Enhancement Diffusion Model

paper_url: http://arxiv.org/abs/2308.06725
repo_url: None
paper_authors: Yuyang Yin, Dejia Xu, Chuangchuang Tan, Ping Liu, Yao Zhao, Yunchao Wei
For: 提高低光照图像质量，提供用户rich的控制能力* Methods: 使用 conditional diffusion model 和 Segment-Anything Model (SAM) 实现用户定制的光照增强* Results: 在量值、质量和可控性三个方面达到竞争性表现，并且提供了rich的用户控制能力

Abstract
Low light enhancement has gained increasing importance with the rapid development of visual creation and editing. However, most existing enhancement algorithms are designed to homogeneously increase the brightness of images to a pre-defined extent, limiting the user experience. To address this issue, we propose Controllable Light Enhancement Diffusion Model, dubbed CLE Diffusion, a novel diffusion framework to provide users with rich controllability. Built with a conditional diffusion model, we introduce an illumination embedding to let users control their desired brightness level. Additionally, we incorporate the Segment-Anything Model (SAM) to enable user-friendly region controllability, where users can click on objects to specify the regions they wish to enhance. Extensive experiments demonstrate that CLE Diffusion achieves competitive performance regarding quantitative metrics, qualitative results, and versatile controllability. Project page: \url{https://yuyangyin.github.io/CLEDiffusion/}

摘要
低光照增强在视觉创作和编辑领域的快速发展中得到了越来越重要的注目。然而，现有的增强算法大多采用一致性提高图像亮度，限制用户体验。为解决这个问题，我们提出了可控光增强扩散模型（CLE Diffusion），这是一种新的扩散框架，提供了用户 Rich 控制性。通过加入条件扩散模型，我们引入了照明嵌入，让用户控制自己的需要的亮度水平。此外，我们还 incorporate了 Segment-Anything Model（SAM），使得用户可以通过点击对象来指定需要增强的区域。广泛的实验表明，CLE Diffusion 在量化指标、质量效果和多样性控制方面具有竞争力。项目页面：

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

paper_url: http://arxiv.org/abs/2308.06721
repo_url: None
paper_authors: Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, Wei Yang
for: 这篇论文旨在提出一种可靠且轻量级的适配器，使得预训练的文本到图像扩散模型可以使用图像提示来生成图像。
methods: 该适配器使用了解coupled crossed attention机制，将文本特征和图像特征分开处理，以提高生成图像的精度和效率。
results: experiments show that IP-Adapter可以与预训练的图像提示模型相比，在生成图像方面达到相当或更好的性能，并且可以与文本提示结合使用，实现多模态图像生成。

Abstract
Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.

摘要
近年来，大型文本到图像扩散模型的强大能力吸引了广泛的关注，这些模型可以生成高效的图像。然而，使用仅文本提示来生成愿景图像是非常困难，因为这通常需要复杂的提示工程。一种alternative是使用图像提示，以示人们所说的“一个图像值得一千个话”。 although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. 在这篇论文中，我们提出了IP-Adapter，一种高效且轻量级的适配器，可以使得预训练的文本到图像扩散模型具备图像提示能力。我们的键要设计是分离的交叉关注机制，它将文本特征和图像特征的交叉关注层分离开来。尽管我们的方法简单，一个只有22M参数的IP-Adapter仍可以达到与完全 fine-tune 的图像提示模型相当或更好的性能。由于我们冻结了预训练的扩散模型，我们的IP-Adapter可以被普遍化到其他自定义模型，以及使用现有的可控生成工具进行可控生成。此外，使用分离的交叉关注策略，图像提示还可以与文本提示共同工作，以实现多modal图像生成。相关项目页面可以查看于 \url{https://ip-adapter.github.io}。

Generalized Independent Noise Condition for Estimating Causal Structure with Latent Variables

paper_url: http://arxiv.org/abs/2308.06718
repo_url: None
paper_authors: Feng Xie, Biwei Huang, Zhengming Chen, Ruichu Cai, Clark Glymour, Zhi Geng, Kun Zhang
For: The paper is focused on learning the causal structure of a system with latent variables, including identifying the number of latent variables and their relationships with observed variables.* Methods: The authors propose a Generalized Independent Noise (GIN) condition for linear non-Gaussian acyclic causal models with latent variables, which is used to identify the causal relationships between the observed and latent variables. They also develop a search procedure to efficiently estimate the underlying causal structure.* Results: The authors show that the proposed approach can identify the causal structure of a system with latent variables, and demonstrate its effectiveness through experimental results. Additionally, they find that the independent noise condition can be seen as a special case of the GIN condition, which provides a connection between the two concepts.

Abstract
We investigate the challenging task of learning causal structure in the presence of latent variables, including locating latent variables and determining their quantity, and identifying causal relationships among both latent and observed variables. To address this, we propose a Generalized Independent Noise (GIN) condition for linear non-Gaussian acyclic causal models that incorporate latent variables, which establishes the independence between a linear combination of certain measured variables and some other measured variables. Specifically, for two observed random vectors $\bf{Y}$ and $\bf{Z}$, GIN holds if and only if $\omega^{\intercal}\mathbf{Y}$ and $\mathbf{Z}$ are independent, where $\omega$ is a non-zero parameter vector determined by the cross-covariance between $\mathbf{Y}$ and $\mathbf{Z}$. We then give necessary and sufficient graphical criteria of the GIN condition in linear non-Gaussian acyclic causal models. Roughly speaking, GIN implies the existence of an exogenous set $\mathcal{S}$ relative to the parent set of $\mathbf{Y}$ (w.r.t. the causal ordering), such that $\mathcal{S}$ d-separates $\mathbf{Y}$ from $\mathbf{Z}$. Interestingly, we find that the independent noise condition (i.e., if there is no confounder, causes are independent of the residual derived from regressing the effect on the causes) can be seen as a special case of GIN. With such a connection between GIN and latent causal structures, we further leverage the proposed GIN condition, together with a well-designed search procedure, to efficiently estimate Linear, Non-Gaussian Latent Hierarchical Models (LiNGLaHs), where latent confounders may also be causally related and may even follow a hierarchical structure. We show that the underlying causal structure of a LiNGLaH is identifiable in light of GIN conditions under mild assumptions. Experimental results show the effectiveness of the proposed approach.

摘要
我们研究一个复杂任务：在含有隐变量的情况下学习 causal 结构。其中包括找到隐变量的位置和它们的数量，以及确定隐变量和观测变量之间的 causal 关系。为此，我们提出一个通用独立噪声（GIN）条件，该条件用于线性非常 Gaussian 隐含 causal 模型中的隐变量，并且可以独立地测试隐变量的存在和数量。 Specifically, for two observed random vectors $\bf{Y}$ and $\bf{Z}$, GIN holds if and only if $\omega^\intercal \mathbf{Y}$ and $\mathbf{Z}$ are independent, where $\omega$ is a non-zero parameter vector determined by the cross-covariance between $\mathbf{Y}$ and $\mathbf{Z}$. We then give necessary and sufficient graphical criteria of the GIN condition in linear non-Gaussian acyclic causal models. Roughly speaking, GIN implies the existence of an exogenous set $\mathcal{S}$ relative to the parent set of $\mathbf{Y}$ (w.r.t. the causal ordering), such that $\mathcal{S}$ d-separates $\mathbf{Y}$ from $\mathbf{Z}$. Interestingly, we find that the independent noise condition (i.e., if there is no confounder, causes are independent of the residual derived from regressing the effect on the causes) can be seen as a special case of GIN. With such a connection between GIN and latent causal structures, we further leverage the proposed GIN condition, together with a well-designed search procedure, to efficiently estimate Linear, Non-Gaussian Latent Hierarchical Models (LiNGLaHs), where latent confounders may also be causally related and may even follow a hierarchical structure. We show that the underlying causal structure of a LiNGLaH is identifiable in light of GIN conditions under mild assumptions. Experimental results show the effectiveness of the proposed approach.

Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden Rewards

paper_url: http://arxiv.org/abs/2308.06717
repo_url: None
paper_authors: Ilgin Dogan, Zuo-Jun Max Shen, Anil Aswani
for: 本研究探讨了一种复杂的主-代理人游戏，在该游戏中，代理人不能直接观察代理人的奖励实现情况，这与许多主-代理人模型不同。这种信息不均衡使得代理人困难地估计代理人的未知奖励，这种问题在各种实际场景中都有广泛的应用，如绿色能源存储合同和个性化奖励等。
methods: 本研究使用了多臂抽象（MAB）问题和学习代理人的方法来解决这种复杂的主-代理人游戏。在这个框架中，代理人通过学习来决定选择，而代理人则通过训练并采用了一种并行的算法来估计代理人的未知奖励。
results: 本研究证明了在非 Parametric 模型下，可以使用历史记录来估计代理人的未知奖励，并且可以通过一种数据驱动的奖励策略来实现这一目标。此外，我们还证明了代理人的 regret bound，即代理人在选择奖励策略时的误差 bound。最后，我们通过 simulations 来证明我们的框架在绿色能源集成合同中的实用性。

Abstract
In practice, incentive providers (i.e., principals) often cannot observe the reward realizations of incentivized agents, which is in contrast to many principal-agent models that have been previously studied. This information asymmetry challenges the principal to consistently estimate the agent's unknown rewards by solely watching the agent's decisions, which becomes even more challenging when the agent has to learn its own rewards. This complex setting is observed in various real-life scenarios ranging from renewable energy storage contracts to personalized healthcare incentives. Hence, it offers not only interesting theoretical questions but also wide practical relevance. This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal. The agent tackles a multi-armed bandit (MAB) problem to maximize their expected reward plus incentive. On top of the agent's learning, the principal trains a parallel algorithm and faces a trade-off between consistently estimating the agent's unknown rewards and maximizing their own utility by offering adaptive incentives to lead the agent. For a non-parametric model, we introduce an estimator whose only input is the history of principal's incentives and agent's choices. We unite this estimator with a proposed data-driven incentive policy within a MAB framework. Without restricting the type of the agent's algorithm, we prove finite-sample consistency of the estimator and a rigorous regret bound for the principal by considering the sequential externality imposed by the agent. Lastly, our theoretical results are reinforced by simulations justifying applicability of our framework to green energy aggregator contracts.

摘要
在实践中，奖励提供者（即主体）经常无法观察奖励的实现，这与许多主体-代理模型不同。这种信息不均衡使得主体Difficult to consistently estimate the agent's unknown rewards by solely watching the agent's decisions, especially when the agent needs to learn its own rewards. This complex setting is observed in various real-life scenarios, such as renewable energy storage contracts and personalized healthcare incentives. Therefore, it not only raises interesting theoretical questions but also has wide practical relevance.本文研究了一个反复的反选择游戏，在这个游戏中，一个自利益的学习代理和一个学习的主体之间进行交互。代理面临着一个多重武器问题（MAB），以最大化其预期奖励加上奖励。同时，主体也在学习，面临着一种奖励的适应性和自己的利用率之间的负担。为一个非 Parametric 模型，我们引入了一个仅基于主体的奖励历史和代理的选择的估计器。我们将这个估计器与一种基于 MAB 的数据驱动奖励策略结合。不Restricting the type of the agent's algorithm, we prove the finite-sample consistency of the estimator and a rigorous regret bound for the principal by considering the sequential externality imposed by the agent.最后，我们的理论结果被实践中的 simulations 证明了我们的框架在绿色能源聚合合同中的应用可行性。

Learning on Graphs with Out-of-Distribution Nodes

paper_url: http://arxiv.org/abs/2308.06714
repo_url: https://github.com/songyyyy/kdd22-oodgat
paper_authors: Yu Song, Donglin Wang
for: 本研究旨在 Addressing the problem of graph learning with out-of-distribution nodes, aiming to detect outliers and classify remaining nodes to known classes.
methods: 提出了一种新的 Graph Attention Network (GAT) 模型，称为 Out-of-Distribution Graph Attention Network (OODGAT)，该模型通过Explicitly modeling the interaction between different kinds of nodes and separating inliers from outliers during feature propagation,以便检测异常节点和分类正常节点。
results: 对多个实验 dataset 进行了extensive experiments，结果表明 OODGAT 比现有的异常检测方法具有更大的优势，同时与正常分类 Task 的性能相当。

Abstract
Graph Neural Networks (GNNs) are state-of-the-art models for performing prediction tasks on graphs. While existing GNNs have shown great performance on various tasks related to graphs, little attention has been paid to the scenario where out-of-distribution (OOD) nodes exist in the graph during training and inference. Borrowing the concept from CV and NLP, we define OOD nodes as nodes with labels unseen from the training set. Since a lot of networks are automatically constructed by programs, real-world graphs are often noisy and may contain nodes from unknown distributions. In this work, we define the problem of graph learning with out-of-distribution nodes. Specifically, we aim to accomplish two tasks: 1) detect nodes which do not belong to the known distribution and 2) classify the remaining nodes to be one of the known classes. We demonstrate that the connection patterns in graphs are informative for outlier detection, and propose Out-of-Distribution Graph Attention Network (OODGAT), a novel GNN model which explicitly models the interaction between different kinds of nodes and separate inliers from outliers during feature propagation. Extensive experiments show that OODGAT outperforms existing outlier detection methods by a large margin, while being better or comparable in terms of in-distribution classification.

摘要
GRAPH NEURAL NETWORKS (GNNs) 是当今最先进的图数据预测模型。而现有的 GNN 模型在各种图数据任务上已经表现出色，但是对于图数据中存在 OUT-OF-DISTRIBUTION（OOD）节点的情况却得到了相对的少的关注。从 Computer Vision 和自然语言处理中借鉴的概念，我们定义 OOD 节点为训练集中未出现过的标签。实际上，由于许多网络是通过程序自动构建的，真实世界中的图数据经常具有噪音和未知分布的特点，因此在这种情况下，我们定义了图学习 WITH OUT-OF-DISTRIBUTION NODES 的问题。特别是，我们希望完成两个任务：1）检测图中不属于已知分布的节点，2）分类剩下的节点为已知类别中的一个。我们表明了图中连接 patrern 可以为 OUTLIER 检测提供信息，并提出了一种新的 GNN 模型，即 Out-of-Distribution Graph Attention Network (OODGAT)，该模型在传播特征时显式地处理不同类型的节点之间的交互，以分离异常节点和常见节点。我们进行了广泛的实验，并证明了 OODGAT 在 OUTLIER 检测方面比现有的方法有大幅度的提升，而且在可见分类方面也是或等于于比较好的。

Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection

paper_url: http://arxiv.org/abs/2308.06701
repo_url: None
paper_authors: Haichao Zhang, Can Qin, Yu Yin, Yun Fu
for: 提高掩蔽物检测的深度学习模型性能
methods: 使用生成模型生成真实的掩蔽图像，并将其用于训练现有的对象检测模型
results: 在三个数据集（COD10k、CAMO和CHAMELEON）上超越当前状态的方法， demonstarting 其在掩蔽物检测中的有效性

Abstract
Camouflaged objects that blend into natural scenes pose significant challenges for deep-learning models to detect and synthesize. While camouflaged object detection is a crucial task in computer vision with diverse real-world applications, this research topic has been constrained by limited data availability. We propose a framework for synthesizing camouflage data to enhance the detection of camouflaged objects in natural scenes. Our approach employs a generative model to produce realistic camouflage images, which can be used to train existing object detection models. Specifically, we use a camouflage environment generator supervised by a camouflage distribution classifier to synthesize the camouflage images, which are then fed into our generator to expand the dataset. Our framework outperforms the current state-of-the-art method on three datasets (COD10k, CAMO, and CHAMELEON), demonstrating its effectiveness in improving camouflaged object detection. This approach can serve as a plug-and-play data generation and augmentation module for existing camouflaged object detection tasks and provides a novel way to introduce more diversity and distributions into current camouflage datasets.

摘要
伪装物体在自然场景中混合是深度学习模型检测和synthesize的重要挑战。伪装物体检测是计算机视觉中重要的应用领域之一，但这一研究领域受到有限的数据可用性的限制。我们提出了一种框架，用于增强自然场景中的伪装物体检测。我们的方法使用生成模型生成真实的伪装图像，这些图像可以用来训练现有的对象检测模型。具体来说，我们使用一个伪装环境生成器，它是根据伪装分布分类器进行监督的。我们的框架在三个数据集（COD10k、CAMO和CHAMELEON）上表现出了比前一个状态的方法更好的性能，这说明了我们的方法的有效性。这种方法可以作为现有伪装物体检测任务的数据生成和增强模块，并提供了一种新的多样性和分布的引入方式，以便为当前的伪装数据集增加更多的多样性。

paper_url: http://arxiv.org/abs/2308.06696
repo_url: https://github.com/zjukg/maco
paper_authors: Yichi Zhang, Zhuo Chen, Wen Zhang
for: 提高大规模知识图（KG）中缺失模态信息的问题，以便更好地完成知识图完成（KGC）任务。
methods: 提出了一种模态对抗和对比框架（MACO），通过对Generator和Discriminator进行对抗训练，生成缺失模态特征，以便在MMKGC模型中使用。同时，我们设计了交叉模态对比损失，以提高生成器的性能。
results: 在公共benchmark上进行了实验，并进行了进一步的探索，结果显示MACO可以达到状态机器人的 результаats，并且可以用于强化多种MMKGC模型。我们的代码和benchmark数据可以在https://github.com/zjukg/MACO上获取。

Abstract
Recent years have seen significant advancements in multi-modal knowledge graph completion (MMKGC). MMKGC enhances knowledge graph completion (KGC) by integrating multi-modal entity information, thereby facilitating the discovery of unobserved triples in the large-scale knowledge graphs (KGs). Nevertheless, existing methods emphasize the design of elegant KGC models to facilitate modality interaction, neglecting the real-life problem of missing modalities in KGs. The missing modality information impedes modal interaction, consequently undermining the model's performance. In this paper, we propose a modality adversarial and contrastive framework (MACO) to solve the modality-missing problem in MMKGC. MACO trains a generator and discriminator adversarially to generate missing modality features that can be incorporated into the MMKGC model. Meanwhile, we design a cross-modal contrastive loss to improve the performance of the generator. Experiments on public benchmarks with further explorations demonstrate that MACO could achieve state-of-the-art results and serve as a versatile framework to bolster various MMKGC models. Our code and benchmark data are available at https://github.com/zjukg/MACO.

摘要
近年来，多模式知识图完成（MMKGC）技术得到了 significiant 进步。 MMKGC 可以将多种模式实体信息集成到知识图中，从而促进发现大规模知识图中的未观察 triple。然而，现有的方法强调设计美观的 KGC 模型，以便模态之间的互动，忽略了实际生活中的缺失模态问题。缺失模态信息会阻碍模态之间的互动，从而降低模型的性能。在这篇论文中，我们提出了一种模态对抗和对比框架（MACO），用于解决 MMKGC 中缺失模态问题。MACO 通过对generator和discriminator进行对抗训练，生成缺失模态特征，以便在 MMKGC 模型中添加。同时，我们设计了对比架构来提高生成器的性能。实验结果表明，MACO 可以达到领先的Result和服务为多种 MMKGC 模型的可靠框架。我们的代码和 benchmark 数据可以在 https://github.com/zjukg/MACO 中下载。

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

paper_url: http://arxiv.org/abs/2308.06685
repo_url: None
paper_authors: Yutao Jin, Bin Liu, Jing Wang
for: 本研究旨在提高视频captioning模型的准确性和完整性，通过使用 dual graphs和gated fusion来生成多维度特征表示。
methods: 我们提出了基于 dual graphs和gated fusion的视频captioning模型，包括 dual-graphs reasoning和gated fusion两部分。 dual-graphs reasoning通过两种图来生成视频内容的多个方面特征，而gated fusion则是将多个特征表示之间的信息聚合以提高视频内容的全面理解。
results: 我们在MSVD和MSR-VTT两个常用 dataset上进行了实验，并取得了当前领域最佳表现。

Abstract
The application of video captioning models aims at translating the content of videos by using accurate natural language. Due to the complex nature inbetween object interaction in the video, the comprehensive understanding of spatio-temporal relations of objects remains a challenging task. Existing methods often fail in generating sufficient feature representations of video content. In this paper, we propose a video captioning model based on dual graphs and gated fusion: we adapt two types of graphs to generate feature representations of video content and utilize gated fusion to further understand these different levels of information. Using a dual-graphs model to generate appearance features and motion features respectively can utilize the content correlation in frames to generate various features from multiple perspectives. Among them, dual-graphs reasoning can enhance the content correlation in frame sequences to generate advanced semantic features; The gated fusion, on the other hand, aggregates the information in multiple feature representations for comprehensive video content understanding. The experiments conducted on worldly used datasets MSVD and MSR-VTT demonstrate state-of-the-art performance of our proposed approach.

摘要
视频captioning模型的应用目标是将视频内容翻译成准确的自然语言。由于视频中对象之间的复杂交互，理解视频内容的空间时间关系是一项具有挑战性的任务。现有方法通常无法生成足够的视频内容特征表示。在这篇论文中，我们提出了基于双图和闭合融合的视频captioning模型：我们采用两种类型的图来生成视频内容的特征表示，并使用闭合融合来进一步理解这些不同水平的信息。使用双图模型生成出现特征和运动特征分别可以利用帧序列中的内容相关性来生成多种特征从多个角度。其中，双图理解可以增强帧序列中的内容相关性，生成更高级别的 semantic 特征；而闭合融合则可以将多个特征表示的信息聚合在一起，实现视频内容全面理解。我们在MSVD和MSR-VTT等世界 commonly 使用的数据集上进行了实验，并达到了当前最佳性能。

Law of Balance and Stationary Distribution of Stochastic Gradient Descent

paper_url: http://arxiv.org/abs/2308.06671
repo_url: None
paper_authors: Liu Ziyin, Hongchao Li, Masahito Ueda
for: 这个论文的目的是解释如何使用权重学习来训练神经网络，以及神经网络训练过程中SGD算法的工作机制。methods: 这篇论文使用了权重学习和SGD算法来训练神经网络，并通过分批训练来证明SGD算法可以带来对神经网络的稳定化。results: 这篇论文的结果表明，当损失函数具有对称性时，SGD算法会带来对神经网络的稳定化，并且可以带来复杂的非线性现象，如阶梯过程中的相态转变、缺失ерgodicity和振荡倒转。这些现象只存在于深度很大的神经网络中，这表明了深度神经网络和浅度神经网络之间的根本区别。

Abstract
The stochastic gradient descent (SGD) algorithm is the algorithm we use to train neural networks. However, it remains poorly understood how the SGD navigates the highly nonlinear and degenerate loss landscape of a neural network. In this work, we prove that the minibatch noise of SGD regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry. Because the difference between a simple diffusion process and SGD dynamics is the most significant when symmetries are present, our theory implies that the loss function symmetries constitute an essential probe of how SGD works. We then apply this result to derive the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width. The stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.

摘要
Stochastic gradient descent（SGD）算法是我们用来训练神经网络的算法。但是，SGD在神经网络的高度非线性和潜在的异常点的搜索方面仍然不甚了解。在这项工作中，我们证明了SGD中的小批处理噪声规范化神经网络的解决方案，当损失函数包含对称性时。因为在对称性存在时，SGD的动态和普通液体流动的差异最大，我们的理论意味着损失函数的对称性是SGD工作的重要检验。我们then使用这个结果来Derive stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width. Stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.Note that Simplified Chinese is a simplified version of Chinese that uses shorter words and sentences, and is often used in informal writing and online communication. Traditional Chinese is a more formal version of Chinese that is used in formal writing and in most printed materials.

Unsupervised Adaptation of Polyp Segmentation Models via Coarse-to-Fine Self-Supervision

paper_url: http://arxiv.org/abs/2308.06665
repo_url: None
paper_authors: Jiexiang Wang, Chaoqi Chen
for: 本研究实验为了解决受隐私和安全问题限制的对应领域自适应~~(UDA) 问题，专注于不需要源数据的对应领域自适应~~(SFDA) 方法。
methods: 本研究提出了一个名为 Region-to-Pixel Adaptation Network~(RPANet) 的新 SFDA 框架，通过均衡粗细自我监督学习，从粗细到细节层次掌握区域和像素层次的标识表现。RPANet 包括两个模组：Foreground-aware Contrastive Learning (FCL) 和 Confidence-Calibrated Pseudo-Labeling (CCPL)，它们分别解决了“如何区别”和“如何调整”的关键挑战。
results: 实验结果显示，RPANet 在三个跨领域肿瘤分类任务上具有优秀的表现，较以前SFDA和UDA方法无法达到的水准，显示了SFDA在医疗应用中的潜力。

Abstract
Unsupervised Domain Adaptation~(UDA) has attracted a surge of interest over the past decade but is difficult to be used in real-world applications. Considering the privacy-preservation issues and security concerns, in this work, we study a practical problem of Source-Free Domain Adaptation (SFDA), which eliminates the reliance on annotated source data. Current SFDA methods focus on extracting domain knowledge from the source-trained model but neglects the intrinsic structure of the target domain. Moreover, they typically utilize pseudo labels for self-training in the target domain, but suffer from the notorious error accumulation problem. To address these issues, we propose a new SFDA framework, called Region-to-Pixel Adaptation Network~(RPANet), which learns the region-level and pixel-level discriminative representations through coarse-to-fine self-supervision. The proposed RPANet consists of two modules, Foreground-aware Contrastive Learning (FCL) and Confidence-Calibrated Pseudo-Labeling (CCPL), which explicitly address the key challenges of ``how to distinguish'' and ``how to refine''. To be specific, FCL introduces a supervised contrastive learning paradigm in the region level to contrast different region centroids across different target images, which efficiently involves all pseudo labels while robust to noisy samples. CCPL designs a novel fusion strategy to reduce the overconfidence problem of pseudo labels by fusing two different target predictions without introducing any additional network modules. Extensive experiments on three cross-domain polyp segmentation tasks reveal that RPANet significantly outperforms state-of-the-art SFDA and UDA methods without access to source data, revealing the potential of SFDA in medical applications.

摘要
自然语言处理中的Unsupervised Domain Adaptation~(UDA)在过去的一个十年里引起了广泛的关注，但在实际应用中却难以使用。 Considering the privacy-preservation issues和安全问题，在这种工作中，我们研究了一个实用的源无需采用Domain Adaptation~(SFDA)问题，该问题消除了源数据的注释。Current SFDA方法通过提取源模型中的领域知识来解决问题，但忽略了目标频谱的内在结构。另外，它们通常通过pseudo标签进行自我训练在目标频谱中，但受到误差积累问题的困扰。为了解决这些问题，我们提出了一个新的SFDA框架，calledRegion-to-Pixel Adaptation Network~(RPANet)，该框架通过粗细自动supervised contrastive learning和Confidence-Calibrated Pseudo-Labeling（CCPL）模块来学习区域水平和像素级别的抑制表示。RPANet包括两个模块：Foreground-aware Contrastive Learning（FCL）和Confidence-Calibrated Pseudo-Labeling（CCPL），这两个模块直接解决了“如何 отличи出”和“如何精细化”的关键问题。具体来说，FCL引入了一种监督对照学习 парадигмы，在不同的目标图像中对不同的区域中心进行对比，以fficiently利用所有pseudo标签，并且对噪声样本具有鲁棒性。CCPL设计了一种新的融合策略，通过将两个不同的目标预测融合而不引入任何额外网络模块，以减少pseudo标签过度自信的问题。经验表明，RPANet在三个跨频谱肠segmentation任务上显著超过了state-of-the-art SFDA和UDA方法，无需访问源数据，探明SFDA在医疗应用中的潜力。

ALGAN: Time Series Anomaly Detection with Adjusted-LSTM GAN

paper_url: http://arxiv.org/abs/2308.06663
repo_url: None
paper_authors: Md Abul Bashar, Richi Nayak
for: 这篇论文旨在应用生成对抗网络（GAN）方法来探测时间序列资料中的异常点。
methods: 这篇论文提出了一个新的GAN模型，名为调整LSTM GAN（ALGAN），它可以在无supervision的情况下提高时间序列资料中的异常点探测精度。
results: 论文的实验结果显示，ALGAN在46个真实世界单变时间序列数据集和多个领域的大量多变时间序列数据集上的异常点探测精度高于传统、神经网络基于的和其他GAN基于的方法。

Abstract
Anomaly detection in time series data, to identify points that deviate from normal behaviour, is a common problem in various domains such as manufacturing, medical imaging, and cybersecurity. Recently, Generative Adversarial Networks (GANs) are shown to be effective in detecting anomalies in time series data. The neural network architecture of GANs (i.e. Generator and Discriminator) can significantly improve anomaly detection accuracy. In this paper, we propose a new GAN model, named Adjusted-LSTM GAN (ALGAN), which adjusts the output of an LSTM network for improved anomaly detection in both univariate and multivariate time series data in an unsupervised setting. We evaluate the performance of ALGAN on 46 real-world univariate time series datasets and a large multivariate dataset that spans multiple domains. Our experiments demonstrate that ALGAN outperforms traditional, neural network-based, and other GAN-based methods for anomaly detection in time series data.

摘要
“时间序列资料中的偏差探测，以探测不同于常规行为的点，是不同领域中的一个常见问题，例如生产、医疗影像和 cybersecurity。在最近的研究中，生成对抗网络（GANs）已经被证明能够优化时间序列资料中的偏差探测精度。这个神经网络架构（i.e. 生成器和识别器）可以在无监督下提高偏差探测精度。在这篇论文中，我们提出了一个新的 GAN 模型，名为 Adjusted-LSTM GAN（ALGAN），它可以在无监督下对时间序列资料进行优化的偏差探测。我们将这个模型评估在 46 个真实的时间序列资料集和一个大的多重时间序列资料集中，结果显示 ALGAN 可以在无监督下优化时间序列资料中的偏差探测精度，并且比较传统的神经网络、神经网络基于的方法和其他 GAN 基于的方法更高。”

Benign Shortcut for Debiasing: Fair Visual Recognition via Intervention with Shortcut Features

paper_url: http://arxiv.org/abs/2308.08482
repo_url: https://github.com/yiiizhang/shortcutDebiasing
paper_authors: Yi Zhang, Jitao Sang, Junyang Wang, Dongmei Jiang, Yaowei Wang
for: 这篇论文旨在解决机器学习模型对敏感社会特征（如性别和种族）的预测问题，以确保在社会应用中保持公平性。
methods: 这篇论文提出了一种称为“Shortcut Debiasing”的方法，将偏见特征（如性别）转换为快捷特征（ Shortcut Features），然后使用 causal intervention 方法删除这些快捷特征 durante 推断过程。
results: 这篇论文在多个 benchmark 数据集上实现了与现有debiasing方法相比的重要改善， both 精度和公平性方面。

Abstract
Machine learning models often learn to make predictions that rely on sensitive social attributes like gender and race, which poses significant fairness risks, especially in societal applications, such as hiring, banking, and criminal justice. Existing work tackles this issue by minimizing the employed information about social attributes in models for debiasing. However, the high correlation between target task and these social attributes makes learning on the target task incompatible with debiasing. Given that model bias arises due to the learning of bias features (\emph{i.e}., gender) that help target task optimization, we explore the following research question: \emph{Can we leverage shortcut features to replace the role of bias feature in target task optimization for debiasing?} To this end, we propose \emph{Shortcut Debiasing}, to first transfer the target task's learning of bias attributes from bias features to shortcut features, and then employ causal intervention to eliminate shortcut features during inference. The key idea of \emph{Shortcut Debiasing} is to design controllable shortcut features to on one hand replace bias features in contributing to the target task during the training stage, and on the other hand be easily removed by intervention during the inference stage. This guarantees the learning of the target task does not hinder the elimination of bias features. We apply \emph{Shortcut Debiasing} to several benchmark datasets, and achieve significant improvements over the state-of-the-art debiasing methods in both accuracy and fairness.

摘要

Smart Knowledge Transfer using Google-like Search

paper_url: http://arxiv.org/abs/2308.06653
repo_url: None
paper_authors: Srijoni Majumdar, Partha Pratim Das
for: Addressing the issue of rising software maintenance cost due to program comprehension challenges.
methods: Proposes SMARTKT (Smart Knowledge Transfer), a search framework that extracts and integrates knowledge related to various aspects of an application in the form of a semantic graph, supporting syntax and semantic queries and converting the process of program comprehension into a “google-like” search problem.
results: Not specified in the abstract, but the paper likely presents the effectiveness of SMARTKT in improving program comprehension and reducing software maintenance costs.

Abstract
To address the issue of rising software maintenance cost due to program comprehension challenges, we propose SMARTKT (Smart Knowledge Transfer), a search framework, which extracts and integrates knowledge related to various aspects of an application in form of a semantic graph. This graph supports syntax and semantic queries and converts the process of program comprehension into a {\em google-like} search problem.

摘要
Here's the breakdown of the translation:* "rising software maintenance cost" is translated as "软件维护成本的增加"* "due to program comprehension challenges" is translated as "由程序理解困难引起"* "we propose SMARTKT" is translated as "我们提出智能知识传输"* "a search framework" is translated as "一个搜索框架"* "which extracts and integrates knowledge related to various aspects of an application" is translated as "可以提取和集成各个方面的应用程序知识"* "in the form of a semantic graph" is translated as "以semantic graph的形式"* "This graph supports syntax and semantic queries" is translated as "这个图表支持语法和 semantics 查询"* "and converts the process of program comprehension into a 'Google-like' search problem" is translated as "并将程序理解过程转化为一个"Google-like"搜索问题"Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Stationary Algorithmic Balancing For Dynamic Email Re-Ranking Problem

paper_url: http://arxiv.org/abs/2308.08460
repo_url: https://github.com/jylevangeline/mosr
paper_authors: Jiayi Liu, Jennifer Neville
for: 这个研究旨在提出一个基于多重目标的电子邮件推荐系统，以满足用户在不同时间的偏好变化。
methods: 这个研究使用了一个适应控制模型来动态均衡多重目标，包括 sender和topic的相关性、时间新鲜度和信息简洁度。
results: 研究结果显示，MOSR在非站ARY preferences下表现更好，特别是在用户偏好变化时。另外，MOSR在不同样本中的稳定性也得到了证明。

Abstract
Email platforms need to generate personalized rankings of emails that satisfy user preferences, which may vary over time. We approach this as a recommendation problem based on three criteria: closeness (how relevant the sender and topic are to the user), timeliness (how recent the email is), and conciseness (how brief the email is). We propose MOSR (Multi-Objective Stationary Recommender), a novel online algorithm that uses an adaptive control model to dynamically balance these criteria and adapt to preference changes. We evaluate MOSR on the Enron Email Dataset, a large collection of real emails, and compare it with other baselines. The results show that MOSR achieves better performance, especially under non-stationary preferences, where users value different criteria more or less over time. We also test MOSR's robustness on a smaller down-sampled dataset that exhibits high variance in email characteristics, and show that it maintains stable rankings across different samples. Our work offers novel insights into how to design email re-ranking systems that account for multiple objectives impacting user satisfaction.

摘要

Accelerating Diffusion-based Combinatorial Optimization Solvers by Progressive Distillation

paper_url: http://arxiv.org/abs/2308.06644
repo_url: https://github.com/jwrh/Accelerating-Diffusion-based-Combinatorial-Optimization-Solvers-by-Progressive-Distillation
paper_authors: Junwei Huang, Zhiqing Sun, Yiming Yang
for: 提高NP-完全 combinatorial优化（CO）问题的解决质量和搜索效率
methods: 使用进步浸泡法加速推理，通过在推理过程中采取 fewer steps 来减少推理时间
results: 实验结果显示，使用进步浸泡法可以提高推理速度 16 倍，只带来 0.019% 的性能下降在 TSP-50 数据集上

Abstract
Graph-based diffusion models have shown promising results in terms of generating high-quality solutions to NP-complete (NPC) combinatorial optimization (CO) problems. However, those models are often inefficient in inference, due to the iterative evaluation nature of the denoising diffusion process. This paper proposes to use progressive distillation to speed up the inference by taking fewer steps (e.g., forecasting two steps ahead within a single step) during the denoising process. Our experimental results show that the progressively distilled model can perform inference 16 times faster with only 0.019% degradation in performance on the TSP-50 dataset.

摘要
几何基于的扩散模型已经在解决NP完备（NPC） combinatorial优化（CO）问题中显示出了可观的成果，但是这些模型往往在推断中效率低下，因为推断过程是迭代评估的。本文提议使用进步养分来加速推断，在推断过程中只需要几步（例如在单步中预测两步）。我们的实验结果显示，使用进步养分的模型可以在TSP-50 dataset上实现16倍的推断速度，仅带来0.019%的性能下降。

Can Unstructured Pruning Reduce the Depth in Deep Neural Networks?

paper_url: http://arxiv.org/abs/2308.06619
repo_url: None
paper_authors: Zhu Liao, Victor Quétu, Van-Tam Nguyen, Enzo Tartaglione
for: 降低深度神经网络大小而保持性能
methods: 使用Entropy Guided Pruning算法（EGP），主要是根据层次 entropy 来决定 Connection 的缩减和完全删除
results: 对 популяр的模型 ResNet-18 和 Swin-T 进行了广泛的实验，发现 EGP 能够有效地压缩深度神经网络，同时保持竞争性能水平。

Abstract
Pruning is a widely used technique for reducing the size of deep neural networks while maintaining their performance. However, such a technique, despite being able to massively compress deep models, is hardly able to remove entire layers from a model (even when structured): is this an addressable task? In this study, we introduce EGP, an innovative Entropy Guided Pruning algorithm aimed at reducing the size of deep neural networks while preserving their performance. The key focus of EGP is to prioritize pruning connections in layers with low entropy, ultimately leading to their complete removal. Through extensive experiments conducted on popular models like ResNet-18 and Swin-T, our findings demonstrate that EGP effectively compresses deep neural networks while maintaining competitive performance levels. Our results not only shed light on the underlying mechanism behind the advantages of unstructured pruning, but also pave the way for further investigations into the intricate relationship between entropy, pruning techniques, and deep learning performance. The EGP algorithm and its insights hold great promise for advancing the field of network compression and optimization. The source code for EGP is released open-source.

摘要
剪辑是一种广泛使用的技术，用于减少深度神经网络的大小，保持其性能。然而，这种技术，即使能够压缩深度模型，几乎无法完全移除层（即使是结构化的）：是这个任务可行吗？在这项研究中，我们介绍了EGP算法，一种创新的熵导向剪辑算法，用于减少深度神经网络的大小，保持其性能。EGP的关键焦点在于优先剪辑层中的低熵连接，最终导致它们的完全移除。经过对流行的模型如ResNet-18和Swin-T进行了广泛的实验，我们的发现表明EGP有效地压缩深度神经网络，保持竞争力的性能水平。我们的结果不仅解释了剪辑技术的优势，还开创了进一步研究熵、剪辑技术和深度学习性能之间的复杂关系。EGP算法和其发现将对神经网络压缩和优化领域的发展带来巨大的潜力。EGP算法的源代码已经开源发布。

On the Interplay of Convolutional Padding and Adversarial Robustness

paper_url: http://arxiv.org/abs/2308.06612
repo_url: None
paper_authors: Paul Gavrikov, Janis Keuper
for: 本研究探讨了 padding 对 adversarial attack 的影响，并 analyzed 了不同的 padding 模式对 adversarial robustness 的影响。
methods: 本研究使用了 Convolutional Neural Networks (CNN)，并对 input 进行 padding 以 preserve 特征图像的分辨率。
results: 研究发现， adversarial attacks 通常会在图像边缘产生偏差异常，这些偏差异常与 padding 使用的边界有关。

Abstract
It is common practice to apply padding prior to convolution operations to preserve the resolution of feature-maps in Convolutional Neural Networks (CNN). While many alternatives exist, this is often achieved by adding a border of zeros around the inputs. In this work, we show that adversarial attacks often result in perturbation anomalies at the image boundaries, which are the areas where padding is used. Consequently, we aim to provide an analysis of the interplay between padding and adversarial attacks and seek an answer to the question of how different padding modes (or their absence) affect adversarial robustness in various scenarios.

摘要
通常来说，在卷积神经网络（CNN）中，会在特征地图前加padding以保持特征地图的分辨率。虽然有很多选择，通常是通过添加边框中 zeros 来实现。在这项工作中，我们发现了敌意攻击通常会在图像边界上产生异常的扰动，这 precisamente 是 padding 使用的地方。因此，我们想进行 padding 和敌意攻击之间的分析，并查找不同 padding 模式（或其缺失）对敌意Robustness 在不同情况下的影响。

Bio-SIEVE: Exploring Instruction Tuning Large Language Models for Systematic Review Automation

paper_url: http://arxiv.org/abs/2308.06610
repo_url: https://github.com/ambroser53/bio-sieve
paper_authors: Ambrose Robinson, William Thorne, Ben P. Wu, Abdullah Pandor, Munira Essat, Mark Stevenson, Xingyi Song
for: 这个研究旨在探讨如何使用大自然语言模型（LLMs）支持和训练，以便在提供明确的选择标准下进行文献屏选。
methods: 研究使用了 instruction tuning 方法来训练 LLaMA 和 Guanaco 模型，以进行摘要屏选。
results: 研究发现，使用 Bio-SIEVE 模型可以超越 ChatGPT 和经过训练的传统方法，并在医学领域中更好地泛化。然而，在安全性优先的场景下，模型仍然需要进一步适应。此外，研究还探讨了多任务训练 Bio-SIEVE-Multi 模型，包括 PICO 提取和排除逻辑等任务，但发现它无法与单任务 Bio-SIEVE 的性能相比。

Abstract
Medical systematic reviews can be very costly and resource intensive. We explore how Large Language Models (LLMs) can support and be trained to perform literature screening when provided with a detailed set of selection criteria. Specifically, we instruction tune LLaMA and Guanaco models to perform abstract screening for medical systematic reviews. Our best model, Bio-SIEVE, outperforms both ChatGPT and trained traditional approaches, and generalises better across medical domains. However, there remains the challenge of adapting the model to safety-first scenarios. We also explore the impact of multi-task training with Bio-SIEVE-Multi, including tasks such as PICO extraction and exclusion reasoning, but find that it is unable to match single-task Bio-SIEVE's performance. We see Bio-SIEVE as an important step towards specialising LLMs for the biomedical systematic review process and explore its future developmental opportunities. We release our models, code and a list of DOIs to reconstruct our dataset for reproducibility.

摘要
医学系统atic review可以非常昂贵和资源占用。我们探讨如何使用大型自然语言模型（LLM）支持和训练来执行文献屏选。特别是我们 instrucion 调整 LLaMA 和 Guanaco 模型来执行医学系统atic review的摘要屏选。我们的最佳模型，生物-SIEVE，超过了 ChatGPT 和训练过的传统方法，并在医学领域中更好地泛化。然而，还有一个适应模型到安全第一的挑战。我们还探讨 Bio-SIEVE 多任务训练的影响，包括 PICO 提取和排除逻辑任务，但发现它无法与单任务 Bio-SIEVE 的性能匹配。我们认为 Bio-SIEVE 是特циализиing LLMs для医学系统atic review过程中的重要一步，并探讨其未来发展机遇。我们发布我们的模型、代码和 DOIs 以便重现我们的数据集。

2023-08-13

cs.CL

cs.CL - 2023-08-13

Faithful to Whom? Questioning Interpretability Measures in NLP

paper_url: http://arxiv.org/abs/2308.06795
repo_url: None
paper_authors: Evan Crothers, Herna Viktor, Nathalie Japkowicz
for: 这 paper 的目的是探讨现有的 faithfulness metrics 是否适用于比较不同的神经网络文本分类器的解释性。
methods: 作者使用 iterative masking 方法测试 faithfulness metrics，并发现这些度量在不同的模型之间存在很大的变化。
results: 作者发现 masked samples frequently 外部训练数据分布，并且 iterative masking 可能导致 faithfulness scores 的巨大变化。另外，作者还研究了对 faithfulness scores 的影响，包括 adversarial attacks 和 adversarial training。

Abstract
A common approach to quantifying model interpretability is to calculate faithfulness metrics based on iteratively masking input tokens and measuring how much the predicted label changes as a result. However, we show that such metrics are generally not suitable for comparing the interpretability of different neural text classifiers as the response to masked inputs is highly model-specific. We demonstrate that iterative masking can produce large variation in faithfulness scores between comparable models, and show that masked samples are frequently outside the distribution seen during training. We further investigate the impact of adversarial attacks and adversarial training on faithfulness scores, and demonstrate the relevance of faithfulness measures for analyzing feature salience in text adversarial attacks. Our findings provide new insights into the limitations of current faithfulness metrics and key considerations to utilize them appropriately.

摘要
一种常见的方法量化模型解释性是通过 iteratively masking input token 并测量预测标签变化的方式来计算 faithfulness 度量。然而，我们显示这些度量不适合比较不同的神经网络文本分类器的解释性，因为模型具有很高的特定性。我们示出了 iterative 遮盖可能会导致大量的 faithfulness 分数变化，并且遮盖样本通常不在训练时间段内。我们进一步研究了对 faithfulness 度量的影响和对文本对抗攻击的分析，并证明了 faithfulness 度量的重要性。我们的发现提供了新的理解现有 faithfulness 度量的限制和使其正确使用的关键考虑因素。

Modeling the Dashboard Provenance

paper_url: http://arxiv.org/abs/2308.06788
repo_url: None
paper_authors: Johne Jarske, Jorge Rady, Lucia V. L. Filgueiras, Leandro M. Velloso, Tania L. Santos
For: The paper aims to provide a provenance representation model for dashboards and its visual and data components, which can help organizations evaluate the quality, consistency, and reliability of the information presented on dashboards.* Methods: The proposed model will offer a comprehensive set of essential provenance metadata that enables users to evaluate the context in which a specific dashboard was developed, including information about people, organizations, entities, and activities involved in the production, influence, or delivery of the data or object.* Results: The paper aims to provide a standardized and visualized representation of provenance metadata for dashboards, which can help users make better decisions based on the quality and reliability of the information presented.

Abstract
Organizations of all kinds, whether public or private, profit-driven or non-profit, and across various industries and sectors, rely on dashboards for effective data visualization. However, the reliability and efficacy of these dashboards rely on the quality of the visual and data they present. Studies show that less than a quarter of dashboards provide information about their sources, which is just one of the expected metadata when provenance is seriously considered. Provenance is a record that describes people, organizations, entities, and activities that had a role in the production, influence, or delivery of a piece of data or an object. This paper aims to provide a provenance representation model, that entitles standardization, modeling, generation, capture, and visualization, specifically designed for dashboards and its visual and data components. The proposed model will offer a comprehensive set of essential provenance metadata that enables users to evaluate the quality, consistency, and reliability of the information presented on dashboards. This will allow a clear and precise understanding of the context in which a specific dashboard was developed, ultimately leading to better decision-making.

摘要
Provenance is a record that describes people, organizations, entities, and activities that had a role in the production, influence, or delivery of a piece of data or an object. This paper aims to provide a provenance representation model, that entitles standardization, modeling, generation, capture, and visualization, specifically designed for dashboards and its visual and data components. The proposed model will offer a comprehensive set of essential provenance metadata that enables users to evaluate the quality, consistency, and reliability of the information presented on dashboards. This will allow a clear and precise understanding of the context in which a specific dashboard was developed, ultimately leading to better decision-making.

Token-Scaled Logit Distillation for Ternary Weight Generative Language Models

paper_url: http://arxiv.org/abs/2308.06744
repo_url: None
paper_authors: Minsoo Kim, Sihwa Lee, Janghwan Lee, Sukjin Hong, Du-Seong Chang, Wonyong Sung, Jungwook Choi
for: 这个研究是为了解决生成模型在实际应用中的大型模型问题。
methods: 这个研究使用了量化测试敏感训练（QAT）方法，并提出了一个专门适用于生成模型的知识传递法。
results: 这个研究获得了较少于1.0倍的衰落和无损失的推理任务结果，表明了这个方法的成功。

Abstract
Generative Language Models (GLMs) have shown impressive performance in tasks such as text generation, understanding, and reasoning. However, the large model size poses challenges for practical deployment. To solve this problem, Quantization-Aware Training (QAT) has become increasingly popular. However, current QAT methods for generative models have resulted in a noticeable loss of accuracy. To counteract this issue, we propose a novel knowledge distillation method specifically designed for GLMs. Our method, called token-scaled logit distillation, prevents overfitting and provides superior learning from the teacher model and ground truth. This research marks the first evaluation of ternary weight quantization-aware training of large-scale GLMs with less than 1.0 degradation in perplexity and no loss of accuracy in a reasoning task.

摘要
生成语言模型（GLM）在文本生成、理解和推理等任务中表现出色，但模型大小带来实际部署的挑战。为解决这个问题，量化意识训练（QAT）在生成模型中变得越来越流行。然而，现有的QAT方法对生成模型带来明显的精度损失。为此，我们提出了一种特有的知识储存方法，称为Token扩展LOGIT储存。该方法防止过拟合，并从教师模型和真实数据中提取优质知识。这项研究标志着大规模GLM的三进制重量量化意识训练的首次评估，并达到了低于1.0的质量下降和无损失的理解任务准确率。

Emergent communication for AR

paper_url: http://arxiv.org/abs/2308.07342
repo_url: None
paper_authors: Ruxiao Chen, Shuaishuai Guo
for: 这篇论文旨在提出一种用于Mobile Augmented Reality（MAR）的 emergent semantic communication 框架，以便在 MAR 中提高通信效率。
methods: 作者使用了两个代理人通过修改了 Lewis 信号游戏进行训练，以便自动生成一种简短的通信协议。
results: 实验表明，提出的方案在不可见对象上具有更好的泛化性，并且可以通过使用小型消息来提高通信效率。

Abstract
Mobile augmented reality (MAR) is widely acknowledged as one of the ubiquitous interfaces to the digital twin and Metaverse, demanding unparalleled levels of latency, computational power, and energy efficiency. The existing solutions for realizing MAR combine multiple technologies like edge, cloud computing, and fifth-generation (5G) networks. However, the inherent communication latency of visual data imposes apparent limitations on the quality of experience (QoE). To address the challenge, we propose an emergent semantic communication framework to learn the communication protocols in MAR. Specifically, we train two agents through a modified Lewis signaling game to emerge a discrete communication protocol spontaneously. Based on this protocol, two agents can communicate about the abstract idea of visual data through messages with extremely small data sizes in a noisy channel, which leads to message errors. To better simulate real-world scenarios, we incorporate channel uncertainty into our training process. Experiments have shown that the proposed scheme has better generalization on unseen objects than traditional object recognition used in MAR and can effectively enhance communication efficiency through the utilization of small-size messages.

摘要
移动增强现实（MAR）被广泛承认为数字双胞迷和Metaverse的一种普遍的界面，需要无 précédent 的延迟、计算能力和能效率。现有的 MAR 实现方案 combining 多种技术，如边缘计算、云计算和 fifth-generation（5G）网络。然而，视觉数据的自然通信延迟带来明显的用户体验质量（QoE）限制。为 Addressing 这个挑战，我们提出了一种emergent semantic communication框架，用于在 MAR 中学习通信协议。具体来说，我们通过 modify 了 Lewis 信号游戏来训练两个代理人，从而自然地生成一个精简的通信协议。根据这个协议，两个代理人可以通过 messages WITH extremely small data sizes 在噪音频道中交换信息，这会导致消息错误。为更好地模拟实际情况，我们将频率uncertainty incorporated 到我们的训练过程中。实验结果表明，我们的方案在未看到对象时比传统 MAR 中使用的对象识别更好地 generalization ，并可以通过利用小型消息来提高通信效率。

2023-08-13

cs.LG

cs.LG - 2023-08-13

Faithful to Whom? Questioning Interpretability Measures in NLP

paper_url: http://arxiv.org/abs/2308.06795
repo_url: None
paper_authors: Evan Crothers, Herna Viktor, Nathalie Japkowicz
for: 这 paper 是为了评估不同神经网络文本分类器的解释性而写的。
methods: 这 paper 使用了基于层次遮盖的 faithfulness 度量来评估模型的解释性，并证明了这些度量不适合比较不同模型的解释性。
results: 研究发现，基于层次遮盖的 faithfulness 度量在不同模型之间可能会带来很大的差异，而且遮盖样本经常处于训练期间所未见 Distribution 之外。

Abstract
A common approach to quantifying model interpretability is to calculate faithfulness metrics based on iteratively masking input tokens and measuring how much the predicted label changes as a result. However, we show that such metrics are generally not suitable for comparing the interpretability of different neural text classifiers as the response to masked inputs is highly model-specific. We demonstrate that iterative masking can produce large variation in faithfulness scores between comparable models, and show that masked samples are frequently outside the distribution seen during training. We further investigate the impact of adversarial attacks and adversarial training on faithfulness scores, and demonstrate the relevance of faithfulness measures for analyzing feature salience in text adversarial attacks. Our findings provide new insights into the limitations of current faithfulness metrics and key considerations to utilize them appropriately.

摘要
一种常见的方法量化模型解释性是通过逐渐遮盖输入符号来计算输出标签变化的程度。然而，我们显示这些度量不适合比较不同的神经网络文本分类器的解释性，因为遮盖输入的响应是高度模型特定的。我们示出了遮盖样本会导致大量的 faithfulness 分数变化，并且显示遮盖样本 frequently 外部训练数据分布。我们进一步调查了对抗攻击和对抗训练对 faithfulness 度量的影响，并证明了 faithfulness 度量对文本对抗攻击中的特征突出性进行分析具有重要意义。我们的发现为现有的 faithfulness 度量带来新的理解和使用其应用中的关键考虑因素。

Neural Networks at a Fraction with Pruned Quaternions

paper_url: http://arxiv.org/abs/2308.06780
repo_url: https://github.com/smlab-niser/quartLT22
paper_authors: Sahel Mohammad Iqbal, Subhankar Mishra
for: 这个研究旨在测试在极端资源受限的环境中，使用简单的神经网络来进行预测。
methods: 研究使用删减来简化神经网络中的参数数量，并使用高维数据嵌入来维持预测精度。
results: 研究发现，在某些架构和 dataset 上，删减后的数值网络可以超过相同架构的实际网络。例如，在 CIFAR-10 上使用 Conv-4 架构时，删减后的数值网络在 $3%$ 的参数数量下，可以比实际网络高于 $10%$。

Abstract
Contemporary state-of-the-art neural networks have increasingly large numbers of parameters, which prevents their deployment on devices with limited computational power. Pruning is one technique to remove unnecessary weights and reduce resource requirements for training and inference. In addition, for ML tasks where the input data is multi-dimensional, using higher-dimensional data embeddings such as complex numbers or quaternions has been shown to reduce the parameter count while maintaining accuracy. In this work, we conduct pruning on real and quaternion-valued implementations of different architectures on classification tasks. We find that for some architectures, at very high sparsity levels, quaternion models provide higher accuracies than their real counterparts. For example, at the task of image classification on CIFAR-10 using Conv-4, at $3\%$ of the number of parameters as the original model, the pruned quaternion version outperforms the pruned real by more than $10\%$. Experiments on various network architectures and datasets show that for deployment in extremely resource-constrained environments, a sparse quaternion network might be a better candidate than a real sparse model of similar architecture.

摘要
当代最先进的神经网络具有越来越多的参数，这限制了它们在计算能力有限的设备上进行训练和推理的可行性。剪枝是一种技术，可以从神经网络中移除不必要的权重，以降低训练和推理的资源需求。此外，在多维输入数据的机器学习任务中，使用高维数域嵌入，如复数或四元数，可以降低参数数量而保持准确性。在这项工作中，我们对实际值和四元数值实现的不同架构进行剪枝处理，并发现在某些架构上，难以架构的四元数模型在高度精简率下提供更高的准确性。例如，在使用Conv-4架构进行图像分类任务时，采用了$3\%$的参数数量的剪枝后，四元数模型的准确性高于实际值模型的准确性超过$10\%$。经过了不同的网络架构和数据集的实验，我们发现在极其有限的资源环境中，一个稀疏的四元数网络可能比同类架构的实际稀疏模型更适合进行部署。

A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations

paper_url: http://arxiv.org/abs/2308.06767
repo_url: https://github.com/hrcheng1066/awesome-pruning
paper_authors: Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi
for: 本文提供了现代深度神经网络压缩的综述，尤其是最新的大型语言模型，以及压缩方法的批判和评价。
methods: 本文分类了现有的压缩研究工作，包括一般/特定加速、压缩时机、压缩方法和压缩与其他压缩技术的融合。
results: 本文提供了七对对比设定的深度神经网络压缩的比较分析，并探讨了emerging topics such as post-training pruning, different levels of supervision for pruning, and broader applications。

Abstract
Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of seven pairs of contrast settings for pruning (e.g., unstructured/structured) and explore emerging topics, including post-training pruning, different levels of supervision for pruning, and broader applications (e.g., adversarial robustness) to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. To facilitate future research, we build a curated collection of datasets, networks, and evaluations on different applications. Finally, we provide some valuable recommendations on selecting pruning methods and prospect promising research directions. We build a repository at https://github.com/hrcheng1066/awesome-pruning.

摘要
现代深度神经网络，特别是最近的大型语言模型，具有庞大的计算和存储资源需求。为实现资源约束环境中部署现代模型和加速推理时间，研究人员开始了压缩神经网络的研究，成为现代神经网络压缩的流行研究方向。然而，当前存在相对落后的压缩研究报告。为解决这问题，在这篇评论中，我们提供了一个包括1)通用/特定速度、2)何时压缩、3)如何压缩和4)压缩与其他压缩技术融合的taxonomy的全面评论。然后，我们对7对对比设定进行了综合分析（例如，无结构/结构），并探讨了emerging topics（例如，后处理压缩、不同级别的监督压缩和更广泛的应用，例如对抗攻击），以抛光现有方法的相似性和差异，并为未来的研究提供了基础。为便于未来的研究，我们创建了一个 curaated 的数据集、网络和评估集。最后，我们提供了一些有价值的建议，包括选择压缩方法和前景探索的可能性，以及未来研究的可能性。我们在上建立了一个存储库。

Conic Descent Redux for Memory-Efficient Optimization

paper_url: http://arxiv.org/abs/2308.07343
repo_url: None
paper_authors: Bingcong Li, Georgios B. Giannakis
for: 本研究探讨了一种最近发展的首项凹降（CD）解决方案，并在三个方面进行了改进：intuition、理论和算法实现。
methods: 本研究发现CD可以提供一种直观的几何 derivation，来自对准题的 dual 问题。这开启了新的算法设计的门户，其中一个是旋转 variant of CD（MOCO）的示例。透过分析 CD 和 MOCO 的双重行为，发现：i) 可以分析性地确定停止标准；ii) 可以设计预conditioners 以加速双方的准确。
results: 最后，本研究开发了一种内存效率高的 MOCO 变体，用于扩展 SDP 特别是低级解。numerical validation 表明，这种变体可以快速和精准地解决 SDP 问题。

Abstract
Conic programming has well-documented merits in a gamut of signal processing and machine learning tasks. This contribution revisits a recently developed first-order conic descent (CD) solver, and advances it in three aspects: intuition, theory, and algorithmic implementation. It is found that CD can afford an intuitive geometric derivation that originates from the dual problem. This opens the door to novel algorithmic designs, with a momentum variant of CD, momentum conic descent (MOCO) exemplified. Diving deeper into the dual behavior CD and MOCO reveals: i) an analytically justified stopping criterion; and, ii) the potential to design preconditioners to speed up dual convergence. Lastly, to scale semidefinite programming (SDP) especially for low-rank solutions, a memory efficient MOCO variant is developed and numerically validated.

摘要
带形编程在信号处理和机器学习任务中有良好的记录。这篇论文探讨了最近开发的首项对数算法（CD）解决方案，并在三个方面提高：直观、理论和算法实现。发现CD可以提供直观的几何 derivation，这开启了新的算法设计的门户，例如帕摩散度降低（MOCO）。透过对CD和MOCO的分析，发现：一、分析正确的停止标准；二、设计加速对偶速度的预处理器。最后，为了扩大低级解的SDP，我们开发了内存有效的MOCO变体，并在数值上验证了其正确性。

Few-shot Class-incremental Learning: A Survey

paper_url: http://arxiv.org/abs/2308.06764
repo_url: None
paper_authors: Jinghua Zhang, Li Liu, Olli Silven, Matti Pietikäinen, Dewen Hu
for: 本文提供了一个系统性的和深入的简要评论，涵盖了多类增量学习（Few-shot Class-Incremental Learning，FSCIL）领域的各种方面，包括问题定义、基本挑战、一般方案、相关逻辑和评价指标等。
methods: 本文总结了FSCIL中的一些常见方法，包括基于数据、基于结构和优化基的方法，以及对象检测方法的各种改进方法，如 anchor-free 和 anchor-based 方法。
results: 本文提供了一些在FSCIL领域的研究方向，包括数据-based、结构-based 和优化-based 方法，以及一些需要进一步探索的研究方向。

Abstract
Few-shot Class-Incremental Learning (FSCIL) presents a unique challenge in machine learning, as it necessitates the continuous learning of new classes from sparse labeled training samples without forgetting previous knowledge. While this field has seen recent progress, it remains an active area of exploration. This paper aims to provide a comprehensive and systematic review of FSCIL. In our in-depth examination, we delve into various facets of FSCIL, encompassing the problem definition, the discussion of primary challenges of unreliable empirical risk minimization and the stability-plasticity dilemma, general schemes, and relevant problems of incremental learning and few-shot learning. Besides, we offer an overview of benchmark datasets and evaluation metrics. Furthermore, we introduce the classification methods in FSCIL from data-based, structure-based, and optimization-based approaches and the object detection methods in FSCIL from anchor-free and anchor-based approaches. Beyond these, we illuminate several promising research directions within FSCIL that merit further investigation.

摘要
《几个示例学习（Few-shot Class-Incremental Learning，FSCIL）》是机器学习领域中的一个独特挑战，它需要在缺乏标注训练样本的情况下，不断学习新的类型，而不会忘记之前的知识。尽管这一领域在最近几年内已经取得了一些进展，但仍然是一个活跃的探索领域。本文的目标是提供一个全面和系统的FSCIL评审，包括问题定义、主要挑战的不可靠的实际风险最小化和稳定性-柔软性之间的矛盾、通用方案和相关的增量学习和几个示例学习的问题。此外，我们还介绍了评价指标和标准测试集。进而，我们介绍了FSCIL中的分类方法，包括数据基于、结构基于和优化基于的方法，以及对象检测方法，包括无锚和锚基的方法。此外，我们还逐光了一些在FSCIL中的有前途的研究方向。

Discovering the Symptom Patterns of COVID-19 from Recovered and Deceased Patients Using Apriori Association Rule Mining

paper_url: http://arxiv.org/abs/2308.06763
repo_url: None
paper_authors: Mohammad Dehghani, Zahra Yazdanparast, Mobin Mohammadi
for: 该研究用于挖掘COVID-19患者的症状模式，以帮助临床医生更好地诊断和治疗疾病。
methods: 该研究使用了Apriori算法进行协会规则挖掘，从COVID-19患者的临床数据中挖掘出最常见的症状。
results: 研究结果显示，COVID-19患者最常见的症状包括呼吸停止（72%）、咳嗽（64%）、发热（59%）、衰弱（18%）、肌肉疼痛（14.5%）和喉咙痛（12%）。

Abstract
The COVID-19 pandemic has a devastating impact globally, claiming millions of lives and causing significant social and economic disruptions. In order to optimize decision-making and allocate limited resources, it is essential to identify COVID-19 symptoms and determine the severity of each case. Machine learning algorithms offer a potent tool in the medical field, particularly in mining clinical datasets for useful information and guiding scientific decisions. Association rule mining is a machine learning technique for extracting hidden patterns from data. This paper presents an application of association rule mining based Apriori algorithm to discover symptom patterns from COVID-19 patients. The study, using 2875 records of patient, identified the most common symptoms as apnea (72%), cough (64%), fever (59%), weakness (18%), myalgia (14.5%), and sore throat (12%). The proposed method provides clinicians with valuable insight into disease that can assist them in managing and treating it effectively.

摘要
COVID-19 流行病在全球产生了毁灭性的影响，让数百万人丧生，引起了重大的社会和经济干扰。为了优化决策和分配有限的资源，必须识别 COVID-19 症状并评估每个病例的严重程度。机器学习算法在医疗领域中提供了一个强大的工具，特别是在挖掘医疗数据中找到有用信息和导引科学决策。在这篇文章中，我们使用 Apriori 算法进行协会规则挖掘，以找到 COVID-19 患者的症状模式。研究使用了 2875 份病例数据，发现最常见的症状包括呼吸抑制（72%）、咳嗽（64%）、高烧（59%）、衰弱（18%）、肌痛（14.5%）和喉咙痛（12%）。我们的方法可以帮助医生更好地理解这种疾病，从而更有效地诊治和治疗。

Heterogeneous Multi-Agent Reinforcement Learning via Mirror Descent Policy Optimization

paper_url: http://arxiv.org/abs/2308.06741
repo_url: None
paper_authors: Mohammad Mehdi Nasiri, Mansoor Rezghi
for: 这个研究旨在解决多智能机器人学习（Multi-Agent Reinforcement Learning，MARL）中参与者的不同能力和个人策略问题。
methods: 提案的Heterogeneous-Agent Mirror Descent Policy Optimization（HAMDPO）算法利用多智能机器人优势分解定理来实现每个代理策略的有效更新，并确保总性表现提高。HAMDPO通过迭代更新代理策略的近似解决信任区域问题，以确保稳定性和表现改善。
results: 在Multi-Agent MuJoCo和StarCraftII任务中，HAMDPO比state-of-the-art算法HATRPO和HAPPO表现出色，实现了稳定性和表现提高。这些结果显示HAMDPO是解决合作MARL问题的有望方法，可能会扩展到其他MARL领域中的挑战性问题。

Abstract
This paper presents an extension of the Mirror Descent method to overcome challenges in cooperative Multi-Agent Reinforcement Learning (MARL) settings, where agents have varying abilities and individual policies. The proposed Heterogeneous-Agent Mirror Descent Policy Optimization (HAMDPO) algorithm utilizes the multi-agent advantage decomposition lemma to enable efficient policy updates for each agent while ensuring overall performance improvements. By iteratively updating agent policies through an approximate solution of the trust-region problem, HAMDPO guarantees stability and improves performance. Moreover, the HAMDPO algorithm is capable of handling both continuous and discrete action spaces for heterogeneous agents in various MARL problems. We evaluate HAMDPO on Multi-Agent MuJoCo and StarCraftII tasks, demonstrating its superiority over state-of-the-art algorithms such as HATRPO and HAPPO. These results suggest that HAMDPO is a promising approach for solving cooperative MARL problems and could potentially be extended to address other challenging problems in the field of MARL.

摘要
The HAMDPO algorithm uses the multi-agent advantage decomposition lemma to efficiently update agent policies while ensuring overall performance improvements. The algorithm iteratively updates agent policies through an approximate solution of the trust-region problem, which guarantees stability and improves performance.HAMDPO is capable of handling both continuous and discrete action spaces for heterogeneous agents in various MARL problems. The authors evaluate the algorithm on Multi-Agent MuJoCo and StarCraftII tasks and show that it outperforms state-of-the-art algorithms such as HATRPO and HAPPO. These results suggest that HAMDPO is a promising approach for solving cooperative MARL problems and could potentially be extended to address other challenging problems in the field of MARL.

Weighted Sparse Partial Least Squares for Joint Sample and Feature Selection

paper_url: http://arxiv.org/abs/2308.06740
repo_url: https://github.com/wenwenmin/wspls
paper_authors: Wenwen Min, Taosheng Xu, Chris Ding
for: 这种研究旨在扩展sPLS的应用范围，通过特定subset of samples和减少异常值来检测稀疏的数据集。
methods: 该研究提出了一种$\ell_\infty/\ell_0$-norm压缩权重稀疏PLS（wsPLS）方法，通过$\ell_\infty/\ell_0$-norm压缩来选择一个subset of samples，并使用多视图数据可以处理多个数据集。
results: 研究人员通过数值和生物医学数据实验表明，提出的方法可以减少数据维度，提高数据融合的稳定性和准确性。

Abstract
Sparse Partial Least Squares (sPLS) is a common dimensionality reduction technique for data fusion, which projects data samples from two views by seeking linear combinations with a small number of variables with the maximum variance. However, sPLS extracts the combinations between two data sets with all data samples so that it cannot detect latent subsets of samples. To extend the application of sPLS by identifying a specific subset of samples and remove outliers, we propose an $\ell_\infty/\ell_0$-norm constrained weighted sparse PLS ($\ell_\infty/\ell_0$-wsPLS) method for joint sample and feature selection, where the $\ell_\infty/\ell_0$-norm constrains are used to select a subset of samples. We prove that the $\ell_\infty/\ell_0$-norm constrains have the Kurdyka-\L{ojasiewicz}~property so that a globally convergent algorithm is developed to solve it. Moreover, multi-view data with a same set of samples can be available in various real problems. To this end, we extend the $\ell_\infty/\ell_0$-wsPLS model and propose two multi-view wsPLS models for multi-view data fusion. We develop an efficient iterative algorithm for each multi-view wsPLS model and show its convergence property. As well as numerical and biomedical data experiments demonstrate the efficiency of the proposed methods.

摘要
“罕缺部分最小方差（sPLS）是一种常见的维度减少技术，用于数据融合，它通过寻找两个视图中数据样本的线性组合，以实现最大差异。然而，sPLS不能检测隐藏的样本集。为了扩展sPLS的应用，我们提出了一种$\ell_\infty/\ell_0$-norm受限的重量 sparse PLS（$\ell_\infty/\ell_0$-wsPLS）方法，用于联合样本和特征选择。我们证明了$\ell_\infty/\ell_0$-norm受限有 Kurdyka-\L{ojasiewicz} 性质，因此可以开发一个全球收敛的算法来解决它。此外，多视图数据中的样本可能是同一个集合的。为此，我们扩展了$\ell_\infty/\ell_0$-wsPLS模型，并提出了两种多视图wsPLS模型 для多视图数据融合。我们开发了一个高效的迭代算法，并证明其收敛性。数值和生物医学数据实验 demonstrate了我们提出的方法的效率。”Note: Simplified Chinese is a written form of Chinese that uses simpler characters and grammar than Traditional Chinese. It is commonly used in mainland China and Singapore.

Probabilistic Imputation for Time-series Classification with Missing Data

paper_url: http://arxiv.org/abs/2308.06738
repo_url: https://github.com/yuneg11/SupNotMIWAE-with-ObsDropout
paper_authors: SeungHyun Kim, Hyunsu Kim, EungGu Yun, Hwangrae Lee, Jaehun Lee, Juho Lee
for: 这个论文主要是为了解决多重时间序列资料中的缺失价值问题。
methods: 我们提出了一个新的机会统计学 frameworks，它包括两个部分：一个深度生成模型来填写缺失价值，以及一个分类器。我们将深度生成模型扩展到更好地捕捉时间序列资料的结构，并将分类器训练为将时间序列资料与填写的缺失价值分类。
results: 我们通过实际实验表明，我们的方法可以有效地解决多重时间序列资料中的缺失价值问题，并且可以提供更好的预测结果。

Abstract
Multivariate time series data for real-world applications typically contain a significant amount of missing values. The dominant approach for classification with such missing values is to impute them heuristically with specific values (zero, mean, values of adjacent time-steps) or learnable parameters. However, these simple strategies do not take the data generative process into account, and more importantly, do not effectively capture the uncertainty in prediction due to the multiple possibilities for the missing values. In this paper, we propose a novel probabilistic framework for classification with multivariate time series data with missing values. Our model consists of two parts; a deep generative model for missing value imputation and a classifier. Extending the existing deep generative models to better capture structures of time-series data, our deep generative model part is trained to impute the missing values in multiple plausible ways, effectively modeling the uncertainty of the imputation. The classifier part takes the time series data along with the imputed missing values and classifies signals, and is trained to capture the predictive uncertainty due to the multiple possibilities of imputations. Importantly, we show that na\"ively combining the generative model and the classifier could result in trivial solutions where the generative model does not produce meaningful imputations. To resolve this, we present a novel regularization technique that can promote the model to produce useful imputation values that help classification. Through extensive experiments on real-world time series data with missing values, we demonstrate the effectiveness of our method.

摘要
多变量时间序列数据在实际应用中通常含有大量缺失值。现有的主流方法为这种缺失值是轮廓性地填充它们（零、平均值、邻近时间步颗度）或学习参数。然而，这些简单策略并不考虑数据生成过程，更重要的是，它们不能有效捕捉预测中的不确定性，因为缺失值的多种可能性。在这篇论文中，我们提出了一种新的概率 Framework for classification with multivariate time series data containing missing values.我们的模型包括两部分：深度生成模型和分类器。我们对深度生成模型进行了扩展，以更好地捕捉时间序列数据的结构，并训练它们以生成多种可能的缺失值，以模拟缺失值的uncertainty。分类器部分接受了时间序列数据以及填充后的缺失值，并分类信号，并训练它们以捕捉多种缺失值的预测不确定性。然而，我们发现，直接组合生成模型和分类器可能会导致轻微的解决方案，其中生成模型不会生成有用的填充值。为解决这个问题，我们提出了一种新的规范技术，可以促进模型生成有用的填充值，以便分类。通过对实际时间序列数据进行了广泛的实验，我们证明了我们的方法的有效性。

Precipitation nowcasting with generative diffusion models

paper_url: http://arxiv.org/abs/2308.06733
repo_url: https://github.com/fmerizzi/Precipitation-nowcasting-with-generative-diffusion-models
paper_authors: Andrea Asperti, Fabio Merizzi, Alberto Paparella, Giorgio Pedrazzi, Matteo Angelinelli, Stefano Colamonaco
For: 这个研究是用来测试深度学习方法在气象预报中的精度。* Methods: 这个研究使用了数种深度学习模型，包括生成模型、Variational Autoencoders和抑制算法。* Results: 研究发现，使用生成ensemble扩展（GED）模型可以对于降水预报提供更高的精度，比起现有的深度学习模型。

Abstract
In recent years traditional numerical methods for accurate weather prediction have been increasingly challenged by deep learning methods. Numerous historical datasets used for short and medium-range weather forecasts are typically organized into a regular spatial grid structure. This arrangement closely resembles images: each weather variable can be visualized as a map or, when considering the temporal axis, as a video. Several classes of generative models, comprising Generative Adversarial Networks, Variational Autoencoders, or the recent Denoising Diffusion Models have largely proved their applicability to the next-frame prediction problem, and is thus natural to test their performance on the weather prediction benchmarks. Diffusion models are particularly appealing in this context, due to the intrinsically probabilistic nature of weather forecasting: what we are really interested to model is the probability distribution of weather indicators, whose expected value is the most likely prediction. In our study, we focus on a specific subset of the ERA-5 dataset, which includes hourly data pertaining to Central Europe from the years 2016 to 2021. Within this context, we examine the efficacy of diffusion models in handling the task of precipitation nowcasting. Our work is conducted in comparison to the performance of well-established U-Net models, as documented in the existing literature. Our proposed approach of Generative Ensemble Diffusion (GED) utilizes a diffusion model to generate a set of possible weather scenarios which are then amalgamated into a probable prediction via the use of a post-processing network. This approach, in comparison to recent deep learning models, substantially outperformed them in terms of overall performance.

摘要
Recently, traditional numerical methods for accurate weather prediction have been increasingly challenged by deep learning methods. Historical weather data used for short and medium-range forecasts are typically organized into a regular spatial grid structure, resembling images or videos when considering the temporal axis. Generative models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Denoising Diffusion Models (DDMs) have shown great potential in predicting the next frame of weather patterns. Diffusion models are particularly appealing in this context, as weather forecasting is inherently probabilistic and what we are really interested in modeling is the probability distribution of weather indicators.In our study, we focus on a specific subset of the ERA-5 dataset, which includes hourly data for Central Europe from 2016 to 2021. We examine the efficacy of diffusion models in handling the task of precipitation nowcasting and compare their performance to well-established U-Net models. Our proposed approach, Generative Ensemble Diffusion (GED), utilizes a diffusion model to generate a set of possible weather scenarios, which are then combined into a probable prediction using a post-processing network. This approach outperforms recent deep learning models in terms of overall performance.

Generalized Independent Noise Condition for Estimating Causal Structure with Latent Variables

paper_url: http://arxiv.org/abs/2308.06718
repo_url: None
paper_authors: Feng Xie, Biwei Huang, Zhengming Chen, Ruichu Cai, Clark Glymour, Zhi Geng, Kun Zhang
for: The paper is written for learning causal structure in the presence of latent variables, including locating latent variables and determining their quantity, and identifying causal relationships among both latent and observed variables.
methods: The paper proposes a Generalized Independent Noise (GIN) condition for linear non-Gaussian acyclic causal models that incorporate latent variables, which establishes the independence between a linear combination of certain measured variables and some other measured variables. The paper also provides necessary and sufficient graphical criteria of the GIN condition in linear non-Gaussian acyclic causal models.
results: The paper shows that the proposed GIN condition, together with a well-designed search procedure, can be used to efficiently estimate Linear, Non-Gaussian Latent Hierarchical Models (LiNGLaHs), where latent confounders may also be causally related and may even follow a hierarchical structure. The paper also demonstrates the effectiveness of the proposed approach through experimental results.

Abstract
We investigate the challenging task of learning causal structure in the presence of latent variables, including locating latent variables and determining their quantity, and identifying causal relationships among both latent and observed variables. To address this, we propose a Generalized Independent Noise (GIN) condition for linear non-Gaussian acyclic causal models that incorporate latent variables, which establishes the independence between a linear combination of certain measured variables and some other measured variables. Specifically, for two observed random vectors $\bf{Y}$ and $\bf{Z}$, GIN holds if and only if $\omega^{\intercal}\mathbf{Y}$ and $\mathbf{Z}$ are independent, where $\omega$ is a non-zero parameter vector determined by the cross-covariance between $\mathbf{Y}$ and $\mathbf{Z}$. We then give necessary and sufficient graphical criteria of the GIN condition in linear non-Gaussian acyclic causal models. Roughly speaking, GIN implies the existence of an exogenous set $\mathcal{S}$ relative to the parent set of $\mathbf{Y}$ (w.r.t. the causal ordering), such that $\mathcal{S}$ d-separates $\mathbf{Y}$ from $\mathbf{Z}$. Interestingly, we find that the independent noise condition (i.e., if there is no confounder, causes are independent of the residual derived from regressing the effect on the causes) can be seen as a special case of GIN. With such a connection between GIN and latent causal structures, we further leverage the proposed GIN condition, together with a well-designed search procedure, to efficiently estimate Linear, Non-Gaussian Latent Hierarchical Models (LiNGLaHs), where latent confounders may also be causally related and may even follow a hierarchical structure. We show that the underlying causal structure of a LiNGLaH is identifiable in light of GIN conditions under mild assumptions. Experimental results show the effectiveness of the proposed approach.

摘要
Translated into Simplified Chinese:我们研究一个复杂的任务，即在存在隐变量的情况下学习 causal 结构，包括找到隐变量的位置和量，以及确定隐变量和观测变量之间的 causal 关系。为此，我们提出了一种 Generalized Independent Noise (GIN) 条件，用于 linear non-Gaussian 隐变量模型，该条件 garanties that a linear combination of certain observed variables and some other observed variables are independent. Specifically, for two observed random vectors $\mathbf{Y}$ and $\mathbf{Z}$, GIN holds if and only if $\omega^\top \mathbf{Y}$ and $\mathbf{Z}$ are independent, where $\omega$ is a non-zero parameter vector determined by the cross-covariance between $\mathbf{Y}$ and $\mathbf{Z}$. We then provide necessary and sufficient graphical criteria of the GIN condition in linear non-Gaussian acyclic causal models. Roughly speaking, GIN implies the existence of an exogenous set $\mathcal{S}$ relative to the parent set of $\mathbf{Y}$ (w.r.t. the causal ordering), such that $\mathcal{S}$ d-separates $\mathbf{Y}$ from $\mathbf{Z}$. Interestingly, we find that the independent noise condition (i.e., if there is no confounder, causes are independent of the residual derived from regressing the effect on the causes) can be seen as a special case of GIN. With such a connection between GIN and latent causal structures, we further leverage the proposed GIN condition, together with a well-designed search procedure, to efficiently estimate Linear, Non-Gaussian Latent Hierarchical Models (LiNGLaHs), where latent confounders may also be causally related and may even follow a hierarchical structure. We show that the underlying causal structure of a LiNGLaH is identifiable in light of GIN conditions under mild assumptions. Experimental results show the effectiveness of the proposed approach.

Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden Rewards

paper_url: http://arxiv.org/abs/2308.06717
repo_url: None
paper_authors: Ilgin Dogan, Zuo-Jun Max Shen, Anil Aswani
for: This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal in a setting where the principal cannot observe the agent’s reward realizations.methods: The paper uses a multi-armed bandit (MAB) problem to model the agent’s learning and a parallel algorithm for the principal to consistently estimate the agent’s unknown rewards while maximizing their own utility.results: The paper proves finite-sample consistency of an estimator and a rigorous regret bound for the principal by considering the sequential externality imposed by the agent, and simulations justify the applicability of the framework to green energy aggregator contracts.

Abstract
In practice, incentive providers (i.e., principals) often cannot observe the reward realizations of incentivized agents, which is in contrast to many principal-agent models that have been previously studied. This information asymmetry challenges the principal to consistently estimate the agent's unknown rewards by solely watching the agent's decisions, which becomes even more challenging when the agent has to learn its own rewards. This complex setting is observed in various real-life scenarios ranging from renewable energy storage contracts to personalized healthcare incentives. Hence, it offers not only interesting theoretical questions but also wide practical relevance. This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal. The agent tackles a multi-armed bandit (MAB) problem to maximize their expected reward plus incentive. On top of the agent's learning, the principal trains a parallel algorithm and faces a trade-off between consistently estimating the agent's unknown rewards and maximizing their own utility by offering adaptive incentives to lead the agent. For a non-parametric model, we introduce an estimator whose only input is the history of principal's incentives and agent's choices. We unite this estimator with a proposed data-driven incentive policy within a MAB framework. Without restricting the type of the agent's algorithm, we prove finite-sample consistency of the estimator and a rigorous regret bound for the principal by considering the sequential externality imposed by the agent. Lastly, our theoretical results are reinforced by simulations justifying applicability of our framework to green energy aggregator contracts.

摘要
在实践中，奖励提供者（即主体）经常无法观察奖励的实现情况，这与许多主体-代理模型不同，这种信息不均衡会让主体难以透过决策来估计代理人的未知奖励，这变得更加复杂，当代理人需要学习自己的奖励时。这种复杂的设定在各种实际场景中出现，包括可再生能源存储合同和个性化医疗奖励。因此，它不仅存在许多理论问题，还有广泛的实际应用。本文研究了一个反复的对抗选择游戏，其中一个自利主义学习代理人与一个学习主体之间进行交互。代理人面临多支枪战（MAB）问题，以最大化他们的预期奖励加上奖励。除了代理人的学习之外，主体还需要训练一个平行算法，并面临一种奖励优化和代理人奖励的负担。为了不假设代理人的算法类型，我们提出了一种无参数的估计器，其唯一的输入是主体的奖励历史和代理人的选择。我们将这种估计器与一种基于MAB框架的数据驱动奖励策略联系起来。我们证明了这种估计器的finite-sample consistent性和对主体的正确做出约束。最后，我们通过实验证明了我们的框架在绿色能源总包合同中的应用可行性。

CDR: Conservative Doubly Robust Learning for Debiased Recommendation

paper_url: http://arxiv.org/abs/2308.08461
repo_url: None
paper_authors: ZiJie Song, JiaWei Chen, Sheng Zhou, QiHao Shi, Yan Feng, Chun Chen, Can Wang
for: 提高推荐系统中偏见的稳定性和性能
methods: 使用 Conservative Doubly Robust 策略（CDR），包括对填充值进行筛选和分析，以减少偏见的影响
results: 比较 experiments 表明，CDR 可以提高推荐系统的性能，同时减少偏见的频率

Abstract
In recommendation systems (RS), user behavior data is observational rather than experimental, resulting in widespread bias in the data. Consequently, tackling bias has emerged as a major challenge in the field of recommendation systems. Recently, Doubly Robust Learning (DR) has gained significant attention due to its remarkable performance and robust properties. However, our experimental findings indicate that existing DR methods are severely impacted by the presence of so-called Poisonous Imputation, where the imputation significantly deviates from the truth and becomes counterproductive. To address this issue, this work proposes Conservative Doubly Robust strategy (CDR) which filters imputations by scrutinizing their mean and variance. Theoretical analyses show that CDR offers reduced variance and improved tail bounds.In addition, our experimental investigations illustrate that CDR significantly enhances performance and can indeed reduce the frequency of poisonous imputation.

摘要
在推荐系统（RS）中，用户行为数据是观察性的而不是实验性的，导致数据中存在普遍的偏见。因此，解决偏见问题已成为推荐系统领域的主要挑战。近些年来，双重稳健学习（DR）已经受到了广泛关注，因为它的表现良好和稳健性。然而，我们的实验结果表明，现有的DR方法受到 socalled "poisonous imputation" 的影响，其中的填充数据显著不符合事实，甚至变得counterproductive。为解决这个问题，本工作提出了 Conservative Doubly Robust 策略（CDR），该策略通过评估填充数据的mean和variance来筛选填充。理论分析表明，CDR可以降低方差和提高尾 bounds。此外，我们的实验研究表明，CDR可以显著提高性能，并可以减少poisonous imputation的频率。

Learning on Graphs with Out-of-Distribution Nodes

paper_url: http://arxiv.org/abs/2308.06714
repo_url: https://github.com/songyyyy/kdd22-oodgat
paper_authors: Yu Song, Donglin Wang
for: 本文旨在Addressing the problem of graph learning with out-of-distribution nodes, including detecting nodes that do not belong to the known distribution and classifying the remaining nodes to be one of the known classes.
methods: 本文提出了一种新的Graph Attention Network（GAT）模型，即Out-of-Distribution Graph Attention Network（OODGAT），该模型可以Explicitly model the interaction between different kinds of nodes and separate inliers from outliers during feature propagation.
results: 实验表明，OODGAT比现有的异常检测方法表现出较大的优势，同时与现有的分类方法相比，OODGAT的分类性能也是比较良好的。

Abstract
Graph Neural Networks (GNNs) are state-of-the-art models for performing prediction tasks on graphs. While existing GNNs have shown great performance on various tasks related to graphs, little attention has been paid to the scenario where out-of-distribution (OOD) nodes exist in the graph during training and inference. Borrowing the concept from CV and NLP, we define OOD nodes as nodes with labels unseen from the training set. Since a lot of networks are automatically constructed by programs, real-world graphs are often noisy and may contain nodes from unknown distributions. In this work, we define the problem of graph learning with out-of-distribution nodes. Specifically, we aim to accomplish two tasks: 1) detect nodes which do not belong to the known distribution and 2) classify the remaining nodes to be one of the known classes. We demonstrate that the connection patterns in graphs are informative for outlier detection, and propose Out-of-Distribution Graph Attention Network (OODGAT), a novel GNN model which explicitly models the interaction between different kinds of nodes and separate inliers from outliers during feature propagation. Extensive experiments show that OODGAT outperforms existing outlier detection methods by a large margin, while being better or comparable in terms of in-distribution classification.

摘要
图ael Neural Networks (GNNs) 是当前最佳模型 для图ael任务中的预测模型。 Although existing GNNs have shown great performance on various graph-related tasks, little attention has been paid to the scenario where out-of-distribution (OOD) nodes exist in the graph during training and inference. Based on the concept from CV and NLP, we define OOD nodes as nodes with labels not seen in the training set. Since many networks are automatically constructed by programs, real-world graphs are often noisy and may contain nodes from unknown distributions. In this work, we define the problem of graph learning with out-of-distribution nodes. Specifically, we aim to accomplish two tasks: 1) detect nodes that do not belong to the known distribution and 2) classify the remaining nodes as one of the known classes. We demonstrate that the connection patterns in graphs are informative for outlier detection, and propose Out-of-Distribution Graph Attention Network (OODGAT), a novel GNN model that explicitly models the interaction between different types of nodes and separates inliers from outliers during feature propagation. Extensive experiments show that OODGAT outperforms existing outlier detection methods by a large margin, while being better or comparable in terms of in-distribution classification.

The Hard-Constraint PINNs for Interface Optimal Control Problems

paper_url: http://arxiv.org/abs/2308.06709
repo_url: https://github.com/tianyouzeng/pinns-interface-optimal-control
paper_authors: Ming-Chih Lai, Yongcun Song, Xiaoming Yuan, Hangrui Yue, Tianyou Zeng
for: solves optimal control problems subject to partial differential equations (PDEs) with interfaces and some control constraints.
methods: combines physics-informed neural networks (PINNs) with recently developed discontinuity capturing neural networks to solve the problems.
results: guarantees that both the boundary and interface conditions can be satisfied exactly, and is efficient for elliptic and parabolic interface optimal control problems.Here’s the full summary in Simplified Chinese:
for: solves optimal control problems subject to PDEs with interfaces and control constraints.
methods: combines PINNs with discontinuity capturing neural networks.
results: guarantees exact satisfaction of boundary and interface conditions, and is efficient for elliptic and parabolic interface optimal control problems.

Abstract
We show that the physics-informed neural networks (PINNs), in combination with some recently developed discontinuity capturing neural networks, can be applied to solve optimal control problems subject to partial differential equations (PDEs) with interfaces and some control constraints. The resulting algorithm is mesh-free and scalable to different PDEs, and it ensures the control constraints rigorously. Since the boundary and interface conditions, as well as the PDEs, are all treated as soft constraints by lumping them into a weighted loss function, it is necessary to learn them simultaneously and there is no guarantee that the boundary and interface conditions can be satisfied exactly. This immediately causes difficulties in tuning the weights in the corresponding loss function and training the neural networks. To tackle these difficulties and guarantee the numerical accuracy, we propose to impose the boundary and interface conditions as hard constraints in PINNs by developing a novel neural network architecture. The resulting hard-constraint PINNs approach guarantees that both the boundary and interface conditions can be satisfied exactly and they are decoupled from the learning of the PDEs. Its efficiency is promisingly validated by some elliptic and parabolic interface optimal control problems.

摘要
我们显示了物理学 Informed Neural Networks (PINNs) 可以与最近发展的破碎点捕捉神经网络 (DCNNs) 结合，以解决具有界面和一些控制约束的最佳控制问题。这个算法是无网格的和可扩展的，并且保证控制约束的严格性。由于边界和界面条件，以及PDEs，都是软的约束，因此需要同时学习它们，并且没有保证边界和界面条件可以精确地满足。这会导致调整约束的预测条件和神经网络训练中的困难。为了解决这些困难并保证数值精度，我们提出了将边界和界面条件作为硬的约束在PINNs中，通过开发一种新的神经网络架构。这种硬约束PINNs方法可以保证边界和界面条件可以精确地满足，并且与PDEs的学习分离开来。我们在一些椭圆和带形interface最佳控制问题中调查了这种方法的效率，并证明了其可靠性。

Generating observation guided ensembles for data assimilation with denoising diffusion probabilistic model

paper_url: http://arxiv.org/abs/2308.06708
repo_url: https://github.com/yasahi-hpc/generative-enkf
paper_authors: Yuuichi Asahi, Yuta Hasegawa, Naoyuki Onodera, Takashi Shimokawabe, Hayato Shiba, Yasuhiro Idomura
for: 这 paper 用于 ensemble data assimilation，使用 pseudo ensemble 生成的 denoising diffusion probabilistic model。
methods: 该方法使用模型对含杂和罕见观测数据进行训练，生成多个不同的 Ensemble，并利用这些 Ensemble 的差异来进行数据融合。
results: 比较 conventional ensemble data assimilation 方法，这种方法在模型不完善时显示出更好的性能。

Abstract
This paper presents an ensemble data assimilation method using the pseudo ensembles generated by denoising diffusion probabilistic model. Since the model is trained against noisy and sparse observation data, this model can produce divergent ensembles close to observations. Thanks to the variance in generated ensembles, our proposed method displays better performance than the well-established ensemble data assimilation method when the simulation model is imperfect.

摘要
这篇论文介绍了一种ensemble数据融合方法，使用pseudo ensemble生成于噪声扩散概率模型。由于模型对噪声和稀缺观测数据进行训练，因此这个模型可以生成与观测数据相近的多个分布。由于这些生成的分布之间的差异，我们提议的方法在模型不完美时表现更好 than traditional ensemble数据融合方法。Note: "pseudo ensemble" in Chinese is "假集合" (fǎ jiéhù), and "denoising diffusion probabilistic model" in Chinese is "噪声扩散概率模型" (zāi shēng kuò shiān yù jì mó delè).

Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

paper_url: http://arxiv.org/abs/2308.06703
repo_url: None
paper_authors: Avery Ma, Yangchen Pan, Amir-massoud Farahmand
for: 论述了使用权重更新法（SGD）和自适应梯度方法（Adam、RMSProp）训练深度神经网络的研究。
methods: 使用SGD和自适应梯度方法训练深度神经网络。
results: 对于自然数据集，SGD训练的模型对输入扰动 exhibit 较好的Robustness，而使用自适应梯度方法训练的模型则对于这些扰动 exhibit 较差的Robustness。这种差异可以通过学习动态研究和synthetic dataset的实验来解释。

Abstract
Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam and RMSProp, have been widely used in training deep neural networks. We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations. Notably, our investigation demonstrates the presence of irrelevant frequencies in natural datasets, where alterations do not affect models' generalization performance. However, models trained with adaptive methods show sensitivity to these changes, suggesting that their use of irrelevant frequencies can lead to solutions sensitive to perturbations. To better understand this difference, we study the learning dynamics of gradient descent (GD) and sign gradient descent (signGD) on a synthetic dataset that mirrors natural signals. With a three-dimensional input space, the models optimized with GD and signGD have standard risks close to zero but vary in their adversarial risks. Our result shows that linear models' robustness to $\ell_2$-norm bounded changes is inversely proportional to the model parameters' weight norm: a smaller weight norm implies better robustness. In the context of deep learning, our experiments show that SGD-trained neural networks show smaller Lipschitz constants, explaining the better robustness to input perturbations than those trained with adaptive gradient methods.

摘要

Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection

paper_url: http://arxiv.org/abs/2308.06701
repo_url: None
paper_authors: Haichao Zhang, Can Qin, Yu Yin, Yun Fu
for: 提高深度学习模型对涂抹式对象检测的能力
methods: 使用生成模型生成涂抹式图像，以增强现有对象检测模型的识别能力
results: 比现有方法高效，在COD10k、CAMO和CHAMELEON三个数据集上达到了更高的检测精度

Abstract
Camouflaged objects that blend into natural scenes pose significant challenges for deep-learning models to detect and synthesize. While camouflaged object detection is a crucial task in computer vision with diverse real-world applications, this research topic has been constrained by limited data availability. We propose a framework for synthesizing camouflage data to enhance the detection of camouflaged objects in natural scenes. Our approach employs a generative model to produce realistic camouflage images, which can be used to train existing object detection models. Specifically, we use a camouflage environment generator supervised by a camouflage distribution classifier to synthesize the camouflage images, which are then fed into our generator to expand the dataset. Our framework outperforms the current state-of-the-art method on three datasets (COD10k, CAMO, and CHAMELEON), demonstrating its effectiveness in improving camouflaged object detection. This approach can serve as a plug-and-play data generation and augmentation module for existing camouflaged object detection tasks and provides a novel way to introduce more diversity and distributions into current camouflage datasets.

摘要
伪装物体在自然场景中混合很困难对深度学习模型进行检测和生成。隐身物体检测是计算机视觉中重要的任务，它在各种实际应用中具有广泛的意义。然而，这一研究领域受到有限的数据可用性的限制。我们提出了一种框架，用于增强自然场景中隐身物体的检测。我们的方法使用生成模型生成真实的伪装图像，这些图像可以用来训练现有的物体检测模型。具体来说，我们使用一个伪装环境生成器，该生成器被监督于伪装分布分类器，以生成伪装图像。这些图像然后被我们的生成器扩展，以增加数据集。我们的框架在COD10k、CAMO和CHAMELEON三个数据集上表现出色，超越当前状态的方法，证明了我们的方法的有效性。这种方法可以作为现有隐身物体检测任务的数据生成和增强模块，并提供一种新的多样性和分布引入现有的伪装数据集的方法。

SimMatchV2: Semi-Supervised Learning with Graph Consistency

paper_url: http://arxiv.org/abs/2308.06692
repo_url: https://github.com/mingkai-zheng/simmatchv2
paper_authors: Mingkai Zheng, Shan You, Lang Huang, Chen Luo, Fei Wang, Chen Qian, Chang Xu
for: 这个研究目的是为了提出一个新的半supervised learning算法，以减少人工劳动。
methods: 这个算法叫做SimMatchV2，它利用图论的观点来设计了多种一致规律，以确保labeled和unlabeled数据之间的一致性。
results: 这个算法在多个半supervised learningbenchmark上进行验证，以300次训练和ResNet-50底层，SimMatchV2在ImageNet上得到71.9%和76.2%的Top-1准确率，优于之前的方法，并达到了现有的最佳性能。

Abstract
Semi-Supervised image classification is one of the most fundamental problem in computer vision, which significantly reduces the need for human labor. In this paper, we introduce a new semi-supervised learning algorithm - SimMatchV2, which formulates various consistency regularizations between labeled and unlabeled data from the graph perspective. In SimMatchV2, we regard the augmented view of a sample as a node, which consists of a label and its corresponding representation. Different nodes are connected with the edges, which are measured by the similarity of the node representations. Inspired by the message passing and node classification in graph theory, we propose four types of consistencies, namely 1) node-node consistency, 2) node-edge consistency, 3) edge-edge consistency, and 4) edge-node consistency. We also uncover that a simple feature normalization can reduce the gaps of the feature norm between different augmented views, significantly improving the performance of SimMatchV2. Our SimMatchV2 has been validated on multiple semi-supervised learning benchmarks. Notably, with ResNet-50 as our backbone and 300 epochs of training, SimMatchV2 achieves 71.9\% and 76.2\% Top-1 Accuracy with 1\% and 10\% labeled examples on ImageNet, which significantly outperforms the previous methods and achieves state-of-the-art performance. Code and pre-trained models are available at \href{https://github.com/mingkai-zheng/SimMatchV2}{https://github.com/mingkai-zheng/SimMatchV2}.

摘要
《半指导Image Classification》是计算机视觉中的一个基本问题，它可以减少人工劳动量。在这篇论文中，我们介绍了一种新的半指导学习算法——SimMatchV2，它在图像视角下划定了不同类别的样本。在SimMatchV2中，我们将每个样本视为一个节点，每个节点有一个标签和对应的表示。不同的节点之间连接了边，边的 Similarity 度量节点表示之间的相似性。我们还提出了四种一致性，即1）节点-节点一致性，2）节点-边一致性，3）边-边一致性，4）边-节点一致性。我们还发现了一种简单的特征归一化可以减少不同扩展视图之间的特征范围差异，从而显著提高SimMatchV2的性能。我们的SimMatchV2在多个半指导学习标准benchmark上进行验证，与ResNet-50作为背景和300个训练周期，SimMatchV2在ImageNet上 achieve 71.9%和76.2%的Top-1准确率，与先前的方法相比显著超越，实现了状态的最佳性能。代码和预训练模型可以在中获取。

MDB: Interactively Querying Datasets and Models

paper_url: http://arxiv.org/abs/2308.06686
repo_url: None
paper_authors: Aaditya Naik, Adam Stein, Yinjun Wu, Eric Wong, Mayur Naik
for: 这篇论文是为了提供一个Debugging框架，帮助开发者在机器学习管道中系统地调试错误。
methods: 这篇论文使用了函数编程和关系代数来构建表达式查询数据集和模型预测。查询可重用和轻松修改，帮助调试员快速缩小查询错误和模型行为。
results: experiments show that MDB可以提供更快（10倍）和更短（40%）的查询，并且在用户研究中，开发者可以成功构建复杂的查询来描述机器学习模型的错误。

Abstract
As models are trained and deployed, developers need to be able to systematically debug errors that emerge in the machine learning pipeline. We present MDB, a debugging framework for interactively querying datasets and models. MDB integrates functional programming with relational algebra to build expressive queries over a database of datasets and model predictions. Queries are reusable and easily modified, enabling debuggers to rapidly iterate and refine queries to discover and characterize errors and model behaviors. We evaluate MDB on object detection, bias discovery, image classification, and data imputation tasks across self-driving videos, large language models, and medical records. Our experiments show that MDB enables up to 10x faster and 40\% shorter queries than other baselines. In a user study, we find developers can successfully construct complex queries that describe errors of machine learning models.

摘要
models 是在训练和部署过程中，开发人员需要系统地调试出现在机器学习管道中的错误。我们提出了 MDB，一个用于交互查询数据集和模型的调试框架。MDB将函数编程与关系代数结合，以构建表达式查询数据库中的数据集和模型预测。查询可重复使用，易于修改，让调试者可以快速灵活地 iteratively 修改查询，以描述和揭示错误和模型行为。我们在对自动驾驶视频、大语言模型和医疗记录进行对象检测、偏见发现、图像分类和数据补充任务上进行了实验，发现 MDB 可以提高查询速度和查询长度，相比于其他基eline。在用户研究中，我们发现开发人员可以成功地构建复杂的查询，以描述机器学习模型的错误。

Separable Gaussian Neural Networks: Structure, Analysis, and Function Approximations

paper_url: http://arxiv.org/abs/2308.06679
repo_url: None
paper_authors: Siyuan Xing, Jianqiao Sun
For: 这个论文想要解决高维输入数据的快速 interpolate和分类问题，提出了一种新的前向网络模型 - 分解 Gaussian 神经网络（SGNN）。* Methods: SGNN 利用 Gaussian 函数的分解性，将输入数据分割成多列，然后在并行层中进行批量处理，从而将计算量从 GRBFNN 的 O(N^d) 减少到 O(dN)，速度增长 linear 地。* Results: 实验表明，SGNN 可以与 GRBFNN 相比，在 tri-variate 函数近似中实现 100 倍的速度提升，并且保持 GRBFNN 的级别准确性。 SGNN 还比 DNNs WITH RuLU 和 Sigmoid 函数更易于训练和调整。在approximating 函数 WITH complex geometry 时，SGNN 可以达到三个数量级更高的准确性。

Abstract
The Gaussian-radial-basis function neural network (GRBFNN) has been a popular choice for interpolation and classification. However, it is computationally intensive when the dimension of the input vector is high. To address this issue, we propose a new feedforward network - Separable Gaussian Neural Network (SGNN) by taking advantage of the separable property of Gaussian functions, which splits input data into multiple columns and sequentially feeds them into parallel layers formed by uni-variate Gaussian functions. This structure reduces the number of neurons from O(N^d) of GRBFNN to O(dN), which exponentially improves the computational speed of SGNN and makes it scale linearly as the input dimension increases. In addition, SGNN can preserve the dominant subspace of the Hessian matrix of GRBFNN in gradient descent training, leading to a similar level of accuracy to GRBFNN. It is experimentally demonstrated that SGNN can achieve 100 times speedup with a similar level of accuracy over GRBFNN on tri-variate function approximations. The SGNN also has better trainability and is more tuning-friendly than DNNs with RuLU and Sigmoid functions. For approximating functions with complex geometry, SGNN can lead to three orders of magnitude more accurate results than a RuLU-DNN with twice the number of layers and the number of neurons per layer.

摘要
Gaussian-radial-basis函数神经网络（GRBFNN）已经是选择 interpolation和分类的受欢迎选择。然而，当输入向量维度高时，它会占用大量计算资源。为解决这个问题，我们提出了一个新的前向网络——分解 Gaussian 神经网络（SGNN），利用 Gaussian 函数的分解性，将输入数据分解成多列，然后将它们顺序输入到由单variate Gaussian 函数组成的并行层中。这种结构将 GRBFNN 中的 neuron 数由 O(N^d) 降低到 O(dN)，从而 exponential 提高 SGNN 的计算速度，使其与输入维度增加时呈线性增长。此外，SGNN 还可以保留 GRBFNN 的主要子空间，从而在梯度下降训练中达到类似精度水平。实验表明，SGNN 可以在 tri-variate 函数拟合中实现 100 倍的速度提升，同时保持精度水平。此外，SGNN 还比 DNNs WITH RuLU 和 sigmoid 函数更易于训练和调整。对于拟合复杂几何函数的情况，SGNN 可以 achieve 三个排名的精度提升。

A deep learning framework for multi-scale models based on physics-informed neural networks

paper_url: http://arxiv.org/abs/2308.06672
repo_url: None
paper_authors: Yong Wang, Yanzhong Yao, Jiawei Guo, Zhiming Gao
for: 解决多级别问题（multi-scale problems）
methods: 基于深度神经网络（deep neural networks）和解决partial differential equations（PDEs）的physics-informed neural networks（PINN）方法
results: 提出了一种新的框架，可以同时优化多级别的损失项，并且可以处理不同子域的问题变化。

Abstract
Physics-informed neural networks (PINN) combine deep neural networks with the solution of partial differential equations (PDEs), creating a new and promising research area for numerically solving PDEs. Faced with a class of multi-scale problems that include loss terms of different orders of magnitude in the loss function, it is challenging for standard PINN methods to obtain an available prediction. In this paper, we propose a new framework for solving multi-scale problems by reconstructing the loss function. The framework is based on the standard PINN method, and it modifies the loss function of the standard PINN method by applying different numbers of power operations to the loss terms of different magnitudes, so that the individual loss terms composing the loss function have approximately the same order of magnitude among themselves. In addition, we give a grouping regularization strategy, and this strategy can deal well with the problem which varies significantly in different subdomains. The proposed method enables loss terms with different magnitudes to be optimized simultaneously, and it advances the application of PINN for multi-scale problems.

摘要
物理学 Informed neural networks (PINN) combine deep neural networks with partial differential equations (PDEs) 的解决方法，创造了一个新的研究领域，用于数值解决 PDEs。面临多个层次问题，其中loss函数中的损失项有不同的级别，标准的PINN方法难以获得可用的预测。在这篇论文中，我们提出了一种新的多层次问题解决框架。这种框架基于标准的PINN方法，对loss函数中的各个损失项应用不同的数量的power操作，使得各个损失项的级别相对较同。此外，我们提出了一种分组常见化策略，该策略可以处理不同子领域中变化很大的问题。提出的方法可以同时优化不同级别的损失项，并提高PINN在多层次问题上的应用。

Law of Balance and Stationary Distribution of Stochastic Gradient Descent

paper_url: http://arxiv.org/abs/2308.06671
repo_url: None
paper_authors: Liu Ziyin, Hongchao Li, Masahito Ueda
for: This paper aims to understand how the stochastic gradient descent (SGD) algorithm navigates the highly nonlinear and degenerate loss landscape of a neural network.
methods: The paper uses theoretical analysis to prove that the minibatch noise of SGD regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry.
results: The paper derives the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width, and shows that the stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion, which are unique to deep networks.Here is the answer in Simplified Chinese text:
for: 这篇论文目标是理解权重梯度下降（SGD）算法在神经网络的高非线性和平衡梯度图像中的探索。
methods: 论文使用理论分析，证明SGD中批处理噪声对于包含扩缩尺度Symmetry的损失函数的解决方法。
results: 论文Derive diagonally linear network with arbitrary depth and width的stationary distribution of stochastic gradient flow，并显示其站立分布具有复杂非线性现象，如相转变、破碎Ergodicity和振荡反转，这些现象只存在于深度很大的网络中。

Abstract
The stochastic gradient descent (SGD) algorithm is the algorithm we use to train neural networks. However, it remains poorly understood how the SGD navigates the highly nonlinear and degenerate loss landscape of a neural network. In this work, we prove that the minibatch noise of SGD regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry. Because the difference between a simple diffusion process and SGD dynamics is the most significant when symmetries are present, our theory implies that the loss function symmetries constitute an essential probe of how SGD works. We then apply this result to derive the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width. The stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.

摘要
SGD算法是我们用来训练神经网络的算法，但是它在神经网络的高度非线性和缺乏稳定性的损失函数空间中 navigation 仍然不够了解。在这个工作中，我们证明了SGD中的小批量噪声规范化解决方案，当损失函数具有扩展对称性时。由于噪声和SGD动力学的差异最大化在对称性存在时，我们的理论 imply 损失函数对称性是SGD工作的重要探测器。我们然后使用这结果来 derive 神经网络的站点分布，并证明了深度神经网络存在复杂非线性现象，如相对稳定性、破坏性和异常倒振。这些现象仅存在深度神经网络中，表明深度和浅度模型之间存在根本的差异。

Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges

paper_url: http://arxiv.org/abs/2308.06668
repo_url: https://github.com/jiajiali04/agriculture-foundation-models
paper_authors: Jiajia Li, Mingle Xu, Lirong Xiang, Dong Chen, Weichao Zhuang, Xunyuan Yin, Zhaojian Li
for: 本研究旨在探讨基于machine learning和deep learning的智能农业领域中的应用Foundation Model（FM）。
methods: 本研究首先对现代计算机科学领域的FM进行了综述，并将其分为四类：语言FM、视觉FM、多模态FM和奖励学习FM。然后，我们详细介绍了在农业领域开发农业FM的过程，并讨论了其在智能农业中的潜在应用。
results: 本研究通过引入基于FM的应用方法，可以减少农业AI系统的依赖于大量标注数据，提高效率、有效性和通用性。此外，本研究还提出了开发农业FM的一些挑战，包括模型训练、验证和部署。

Abstract
The past decade has witnessed the rapid development of ML and DL methodologies in agricultural systems, showcased by great successes in variety of agricultural applications. However, these conventional ML/DL models have certain limitations: They heavily rely on large, costly-to-acquire labeled datasets for training, require specialized expertise for development and maintenance, and are mostly tailored for specific tasks, thus lacking generalizability. Recently, foundation models have demonstrated remarkable successes in language and vision tasks across various domains. These models are trained on a vast amount of data from multiple domains and modalities. Once trained, they can accomplish versatile tasks with just minor fine-tuning and minimal task-specific labeled data. Despite their proven effectiveness and huge potential, there has been little exploration of applying FMs to agriculture fields. Therefore, this study aims to explore the potential of FMs in the field of smart agriculture. In particular, we present conceptual tools and technical background to facilitate the understanding of the problem space and uncover new research directions in this field. To this end, we first review recent FMs in the general computer science domain and categorize them into four categories: language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs. Subsequently, we outline the process of developing agriculture FMs and discuss their potential applications in smart agriculture. We also discuss the unique challenges associated with developing AFMs, including model training, validation, and deployment. Through this study, we contribute to the advancement of AI in agriculture by introducing AFMs as a promising paradigm that can significantly mitigate the reliance on extensive labeled datasets and enhance the efficiency, effectiveness, and generalization of agricultural AI systems.

摘要
过去一代，机器学习（ML）和深度学习（DL）方法在农业系统中得到了迅速发展，在多种农业应用中显示出了很大成功。然而，这些传统的ML/DL模型具有一些限制：它们需要大量、昂贵的标签数据进行训练，需要专门的专业知识进行开发和维护，而且主要是为特定任务设计，因此缺乏普适性。在最近的几年里，基础模型（FM）在语言和视觉任务中获得了惊人的成功。这些模型通过大量的数据来自多个领域和模式进行训练，一旦训练完成，就可以完成多种任务，只需要微小的调整和微小的任务特定的标签数据。尽管它们的可效性和潜在的潜力很大，但在农业领域中还没有多少探索基础模型的应用。因此，本研究旨在探讨基础模型在智能农业领域的潜力。具体来说，我们首先将最近的FM在通用计算机科学领域中进行了综述，并将其分为四类：语言FM、视觉FM、多模式FM和奖励学习FM。然后，我们详细介绍了在农业领域开发农业FM的过程，并讨论了它们在智能农业中的潜在应用。我们还讨论了开发AFM的独特挑战，包括模型训练、验证和部署。通过本研究，我们为农业AI的发展做出了贡献，将基础模型作为一种可能的解决方案，可以减少农业AI系统的依赖于大量标签数据，提高效率、有效性和普适性。

ALGAN: Time Series Anomaly Detection with Adjusted-LSTM GAN

paper_url: http://arxiv.org/abs/2308.06663
repo_url: None
paper_authors: Md Abul Bashar, Richi Nayak
For: Anomaly detection in time series data, specifically in univariate and multivariate datasets in an unsupervised setting.* Methods: Proposes a new GAN model called Adjusted-LSTM GAN (ALGAN), which adjusts the output of an LSTM network for improved anomaly detection accuracy.* Results: Outperforms traditional, neural network-based, and other GAN-based methods for anomaly detection in time series data, as demonstrated through experiments on 46 real-world univariate time series datasets and a large multivariate dataset.

Abstract
Anomaly detection in time series data, to identify points that deviate from normal behaviour, is a common problem in various domains such as manufacturing, medical imaging, and cybersecurity. Recently, Generative Adversarial Networks (GANs) are shown to be effective in detecting anomalies in time series data. The neural network architecture of GANs (i.e. Generator and Discriminator) can significantly improve anomaly detection accuracy. In this paper, we propose a new GAN model, named Adjusted-LSTM GAN (ALGAN), which adjusts the output of an LSTM network for improved anomaly detection in both univariate and multivariate time series data in an unsupervised setting. We evaluate the performance of ALGAN on 46 real-world univariate time series datasets and a large multivariate dataset that spans multiple domains. Our experiments demonstrate that ALGAN outperforms traditional, neural network-based, and other GAN-based methods for anomaly detection in time series data.

摘要
<>时间序列数据中异常检测，以识别不同于常规行为的点，是多个领域中的一个常见问题，包括制造、医疗影像和网络安全等。最近，生成对抗网络（GANs）在时间序列数据中的异常检测中表现出色。GANs的神经网络架构（即生成器和识别器）可以显著提高异常检测精度。在本文中，我们提出了一种新的GAN模型，名为调整LSTM GAN（ALGAN），该模型可以在无监督的情况下，对单变量和多变量时间序列数据进行改进的异常检测。我们对46个真实的单变量时间序列数据集和多个领域的大量多变量数据集进行了试验，结果表明，ALGAN比传统的神经网络基于的方法、神经网络GAN方法和其他GAN方法在时间序列数据中的异常检测方面表现出色。Note: "LSTM" stands for Long Short-Term Memory, which is a type of Recurrent Neural Network (RNN) designed to handle time series data.

Benign Shortcut for Debiasing: Fair Visual Recognition via Intervention with Shortcut Features

paper_url: http://arxiv.org/abs/2308.08482
repo_url: https://github.com/yiiizhang/shortcutDebiasing
paper_authors: Yi Zhang, Jitao Sang, Junyang Wang, Dongmei Jiang, Yaowei Wang
for: 降低机器学习模型中的偏见风险，特别是在社会应用中，如雇用、银行和刑事司法等。
methods: 我们提出了一种简洁处理方法，称为“快捷偏见处理”（Shortcut Debiasing），它首先将偏见特征 transferred to快捷特征，然后使用 causal intervention 把快捷特征 eliminated during inference。
results: 我们将此方法应用到多个 benchmark 数据集上，并与现有的偏见处理方法进行比较，获得了显著的改善。

Abstract
Machine learning models often learn to make predictions that rely on sensitive social attributes like gender and race, which poses significant fairness risks, especially in societal applications, such as hiring, banking, and criminal justice. Existing work tackles this issue by minimizing the employed information about social attributes in models for debiasing. However, the high correlation between target task and these social attributes makes learning on the target task incompatible with debiasing. Given that model bias arises due to the learning of bias features (\emph{i.e}., gender) that help target task optimization, we explore the following research question: \emph{Can we leverage shortcut features to replace the role of bias feature in target task optimization for debiasing?} To this end, we propose \emph{Shortcut Debiasing}, to first transfer the target task's learning of bias attributes from bias features to shortcut features, and then employ causal intervention to eliminate shortcut features during inference. The key idea of \emph{Shortcut Debiasing} is to design controllable shortcut features to on one hand replace bias features in contributing to the target task during the training stage, and on the other hand be easily removed by intervention during the inference stage. This guarantees the learning of the target task does not hinder the elimination of bias features. We apply \emph{Shortcut Debiasing} to several benchmark datasets, and achieve significant improvements over the state-of-the-art debiasing methods in both accuracy and fairness.

摘要
机器学习模型经常学习依赖敏感社会特征如性别和种族的预测，这会带来公平风险，特别是在社会应用中，如招聘、银行和刑事司法。现有的工作解决这个问题，是通过减少模型使用的社会特征来减少模型的偏见。然而，目标任务和社会特征之间的高相关性使得学习目标任务与减少偏见不兼容。基于模型偏见来自偏见特征（例如性别）的学习，我们提出了以下研究问题：“可以通过剪辑特征来替代偏见特征的角色来优化目标任务吗？”为此，我们提出了短Circuit Debiasing，即在训练阶段通过将目标任务学习的偏见特征转移到剪辑特征上，然后通过 causal intervention 在推理阶段消除剪辑特征。短Circuit Debiasing 的关键思想是设计可控的剪辑特征，以便在训练阶段替代偏见特征，并在推理阶段通过 intervention 轻松消除。这 garantizes 学习目标任务不会阻碍减少偏见。我们在多个标准数据集上应用短Circuit Debiasing，并在准确率和公平性两个方面获得了 significan 的改进。

Polar Collision Grids: Effective Interaction Modelling for Pedestrian Trajectory Prediction in Shared Space Using Collision Checks

paper_url: http://arxiv.org/abs/2308.06654
repo_url: None
paper_authors: Mahsa Golchoubian, Moojan Ghafurian, Kerstin Dautenhahn, Nasser Lashgarian Azad
for: 预测行人轨迹是自动驾驶车辆安全导航中的关键能力，特别是在与行人共享空间时。行人运动在共享空间中受到车辆和其他行人的影响，因此可以更好地模型行人-车辆和行人-行人交互，从而提高行人轨迹预测模型的准确性。
methods: 我们提出了一种基于启发的交互代理选择过程，利用碰撞风险计算来选择交互代理。我们关注与可能碰撞的代理之间的时间到碰撞和接近方向的影响，并通过引入一种新的极地增量增量Grid Map来编码交互效果。
results: 我们的结果表明，使用我们提出的方法可以比基eline方法（作为参考）在HBS数据集上预测轨迹更加准确。

Abstract
Predicting pedestrians' trajectories is a crucial capability for autonomous vehicles' safe navigation, especially in spaces shared with pedestrians. Pedestrian motion in shared spaces is influenced by both the presence of vehicles and other pedestrians. Therefore, effectively modelling both pedestrian-pedestrian and pedestrian-vehicle interactions can increase the accuracy of the pedestrian trajectory prediction models. Despite the huge literature on ways to encode the effect of interacting agents on a pedestrian's predicted trajectory using deep-learning models, limited effort has been put into the effective selection of interacting agents. In the majority of cases, the interaction features used are mainly based on relative distances while paying less attention to the effect of the velocity and approaching direction in the interaction formulation. In this paper, we propose a heuristic-based process of selecting the interacting agents based on collision risk calculation. Focusing on interactions of potentially colliding agents with a target pedestrian, we propose the use of time-to-collision and the approach direction angle of two agents for encoding the interaction effect. This is done by introducing a novel polar collision grid map. Our results have shown predicted trajectories closer to the ground truth compared to existing methods (used as a baseline) on the HBS dataset.

摘要
预测行人轨迹是自动驾驶车辆安全导航中的关键能力，特别是在与行人共享空间时。行人运动在共享空间中受到车辆和其他行人的影响。因此，可以准确地模拟行人与车辆和其他行人之间的互动，可以提高行人轨迹预测模型的准确性。Despite the extensive literature on using deep-learning models to encode the effect of interacting agents on a pedestrian's predicted trajectory, there has been limited effort put into selecting the interacting agents effectively. Most existing methods use relative distance as the main factor in the interaction formulation, while ignoring the effect of velocity and approaching direction.在这篇论文中，我们提出了一种基于启发的互动代理选择过程，通过计算碰撞风险来选择互动代理。我们将注意力集中在可能碰撞的代理与目标行人之间的互动效应上，并通过引入一种新的极地碰撞格图来编码这种互动效应。我们的结果表明，与基eline方法相比，我们的方法可以在HBS数据集上预测轨迹更加准确。

Accelerating Diffusion-based Combinatorial Optimization Solvers by Progressive Distillation

paper_url: http://arxiv.org/abs/2308.06644
repo_url: https://github.com/jwrh/Accelerating-Diffusion-based-Combinatorial-Optimization-Solvers-by-Progressive-Distillation
paper_authors: Junwei Huang, Zhiqing Sun, Yiming Yang
for: 提高 NP-完全 combinatorial 优化问题的解决速度
methods: 使用进步干涤法加速推理，在杂化过程中采取 fewer steps，如在单步内预测两步
results: 实验结果显示，使用进步干涤模型可以将推理速度提高 16 倍，而性能下降仅 0.019%，在 TSP-50 数据集上。

Abstract
Graph-based diffusion models have shown promising results in terms of generating high-quality solutions to NP-complete (NPC) combinatorial optimization (CO) problems. However, those models are often inefficient in inference, due to the iterative evaluation nature of the denoising diffusion process. This paper proposes to use progressive distillation to speed up the inference by taking fewer steps (e.g., forecasting two steps ahead within a single step) during the denoising process. Our experimental results show that the progressively distilled model can perform inference 16 times faster with only 0.019% degradation in performance on the TSP-50 dataset.

摘要
GRaph-based diffusion models have shown promising results in terms of generating high-quality solutions to NP-complete (NPC) combinatorial optimization (CO) problems. However, those models are often inefficient in inference, due to the iterative evaluation nature of the denoising diffusion process. This paper proposes to use progressive distillation to speed up the inference by taking fewer steps (e.g., forecasting two steps ahead within a single step) during the denoising process. Our experimental results show that the progressively distilled model can perform inference 16 times faster with only 0.019% degradation in performance on the TSP-50 dataset.Here's the translation in Traditional Chinese: GRaph-based diffusion models have shown promising results in terms of generating high-quality solutions to NP-complete (NPC) combinatorial optimization (CO) problems. However, those models are often inefficient in inference, due to the iterative evaluation nature of the denoising diffusion process. This paper proposes to use progressive distillation to speed up the inference by taking fewer steps (e.g., forecasting two steps ahead within a single step) during the denoising process. Our experimental results show that the progressively distilled model can perform inference 16 times faster with only 0.019% degradation in performance on the TSP-50 dataset.

Advances in Self-Supervised Learning for Synthetic Aperture Sonar Data Processing, Classification, and Pattern Recognition

paper_url: http://arxiv.org/abs/2308.11633
repo_url: None
paper_authors: Brandon Sheffield, Frank E. Bobe III, Bradley Marchand, Matthew S. Emigh
for: 本研究旨在提高水下探索中SAS数据处理、分类和 Pattern recognition的效果，通过使用自助学习（SSL）技术。
methods: 本研究提出了MoCo-SAS，一种基于SSL的SAS数据处理方法，包括数据预处理、特征提取、模型训练和测试。
results: 实验结果表明，MoCo-SAS与传统的指导学习方法相比，在F1分数上有显著提高，表明SSL可以在SAS数据处理中提高效果，并且具有潜在的应用前景。

Abstract
Synthetic Aperture Sonar (SAS) imaging has become a crucial technology for underwater exploration because of its unique ability to maintain resolution at increasing ranges, a characteristic absent in conventional sonar techniques. However, the effective application of deep learning to SAS data processing is often limited due to the scarcity of labeled data. To address this challenge, this paper proposes MoCo-SAS that leverages self-supervised learning (SSL) for SAS data processing, classification, and pattern recognition. The experimental results demonstrate that MoCo-SAS significantly outperforms traditional supervised learning methods, as evidenced by significant improvements observed in terms of the F1-score. These findings highlight the potential of SSL in advancing the state-of-the-art in SAS data processing, offering promising avenues for enhanced underwater object detection and classification.

摘要
射频成像技术（SAS）已成为水下探测中不可或缺的一种重要技术，因其可以维持分辨率随距离增长，这是传统声纳技术缺乏的特点。然而，各种深度学习在SAS数据处理中的有效应用却受到标注数据的罕见性的限制。为解决这个挑战，本文提出了MoCo-SAS，利用自动编程学习（SSL）进行SAS数据处理、分类和模式识别。实验结果表明，MoCo-SAS在F1分数方面显著超越传统监督学习方法，这表明SSL在SAS数据处理中具有潜在的潜在优势。这些发现表明SSL在SAS数据处理中可能提供新的突破口，用于提高水下对象检测和分类的精度。

ADRMX: Additive Disentanglement of Domain Features with Remix Loss

paper_url: http://arxiv.org/abs/2308.06624
repo_url: https://github.com/berkerdemirel/ADRMX
paper_authors: Berker Demirel, Erchan Aptoula, Huseyin Ozkan
for: 这个研究旨在创建能够在新不同预设范围中具有普遍化能力的模型，以减少因为不同预设范围之间的分布变化对模型的影响。
methods: 这个研究使用了一种名为“Additive Disentanglement of Domain Features with Remix Loss”的新架构，并 introduce了一种新的数据增强技术，将不同预设范围中的数据混合在潜在空间中。
results: 这个研究透过对DomainBed进行了EXTENSIVE的实验，展示了ADRMX可以实现现场的表现，并且比以前的研究得到更好的结果。

Abstract
The common assumption that train and test sets follow similar distributions is often violated in deployment settings. Given multiple source domains, domain generalization aims to create robust models capable of generalizing to new unseen domains. To this end, most of existing studies focus on extracting domain invariant features across the available source domains in order to mitigate the effects of inter-domain distributional changes. However, this approach may limit the model's generalization capacity by relying solely on finding common features among the source domains. It overlooks the potential presence of domain-specific characteristics that could be prevalent in a subset of domains, potentially containing valuable information. In this work, a novel architecture named Additive Disentanglement of Domain Features with Remix Loss (ADRMX) is presented, which addresses this limitation by incorporating domain variant features together with the domain invariant ones using an original additive disentanglement strategy. Moreover, a new data augmentation technique is introduced to further support the generalization capacity of ADRMX, where samples from different domains are mixed within the latent space. Through extensive experiments conducted on DomainBed under fair conditions, ADRMX is shown to achieve state-of-the-art performance. Code will be made available at GitHub after the revision process.

摘要
通常的假设是训练集和测试集都follow相似的分布是在部署设置中常常被违反。给定多个源领域，领域泛化目标是创建抗衰假设模型，以便在新未经见过的领域中泛化。为此，大多数现有的研究都是EXTRACTING DOMAIN INVARIANT FEATURES ACROSS AVAILABLE SOURCE DOMAINS，以mitigate INTER-DOMAIN distributional changes的影响。然而，这种方法可能会限制模型的泛化能力，因为它只是在 source domains 中找到共同特征。它忽略了可能存在一些领域特有的特征，这些特征可能在一些领域中具有价值信息。在这项工作中，一种新的架构名为 Additive Disentanglement of Domain Features with Remix Loss (ADRMX) 被提出，它解决了这种限制，通过将领域特征和领域 invariants 相加拼接在一起。此外，一种新的数据增强技术也被引入，用于进一步支持 ADRMX 的泛化能力，其中不同领域的样本在离散空间中混合。通过对 DomainBed 进行了广泛的实验，ADRMX 在 fair 的条件下显示出了状态的表现。代码将在 GitHub 上提供。

Can Unstructured Pruning Reduce the Depth in Deep Neural Networks?

paper_url: http://arxiv.org/abs/2308.06619
repo_url: None
paper_authors: Zhu Liao, Victor Quétu, Van-Tam Nguyen, Enzo Tartaglione
for: 降低深度神经网络大小 while maintaining performance
methods: 基于Entropy Guided Pruning算法，优先遍历层次 entropy 低的连接，进行完全移除
results: 成功地压缩深度神经网络，保持竞争力水平，并提供了关于不结构压缩的机制和深度学习性能之间的新的视角。

Abstract
Pruning is a widely used technique for reducing the size of deep neural networks while maintaining their performance. However, such a technique, despite being able to massively compress deep models, is hardly able to remove entire layers from a model (even when structured): is this an addressable task? In this study, we introduce EGP, an innovative Entropy Guided Pruning algorithm aimed at reducing the size of deep neural networks while preserving their performance. The key focus of EGP is to prioritize pruning connections in layers with low entropy, ultimately leading to their complete removal. Through extensive experiments conducted on popular models like ResNet-18 and Swin-T, our findings demonstrate that EGP effectively compresses deep neural networks while maintaining competitive performance levels. Our results not only shed light on the underlying mechanism behind the advantages of unstructured pruning, but also pave the way for further investigations into the intricate relationship between entropy, pruning techniques, and deep learning performance. The EGP algorithm and its insights hold great promise for advancing the field of network compression and optimization. The source code for EGP is released open-source.

摘要
剪辑是一种广泛使用的技术，用于降低深度神经网络的大小，保持性能。然而，这种技术，即使可以压缩深度模型，几乎不能完全移除层（即使是结构化的）：是这个任务可行吗？在这项研究中，我们介绍了EGP算法，一种创新的熵导向剪辑算法，用于减少深度神经网络的大小，保持性能。EGP的关键点在于优先剪辑层中的熵低的连接，以便完全移除它们。我们在popular模型如ResNet-18和Swin-T等模型上进行了广泛的实验，发现EGP有效地减少深度神经网络的大小，保持竞争力水平。我们的研究不仅解释了不结构化剪辑的优势，还为深度学习性能和剪辑技术之间的复杂关系开辟了新的可能性。EGP算法和其洞见拥有很大的潜力，可以推动深度神经网络压缩和优化领域的进步。EGP算法的源代码已经开源。

On the Interplay of Convolutional Padding and Adversarial Robustness

paper_url: http://arxiv.org/abs/2308.06612
repo_url: None
paper_authors: Paul Gavrikov, Janis Keuper
for: 本文旨在研究padding和敌意攻击之间的交互关系，以及不同padding模式对敌意Robustness的影响。
methods: 本文使用Convolutional Neural Networks (CNN)进行研究，并对不同padding模式进行比较。
results: 本文发现，敌意攻击通常会导致图像边界上的异常，这些异常与padding有关。此外，本文还发现不同padding模式对敌意Robustness的影响不同。

Abstract
It is common practice to apply padding prior to convolution operations to preserve the resolution of feature-maps in Convolutional Neural Networks (CNN). While many alternatives exist, this is often achieved by adding a border of zeros around the inputs. In this work, we show that adversarial attacks often result in perturbation anomalies at the image boundaries, which are the areas where padding is used. Consequently, we aim to provide an analysis of the interplay between padding and adversarial attacks and seek an answer to the question of how different padding modes (or their absence) affect adversarial robustness in various scenarios.

摘要
通常来说，在卷积神经网络（CNN）中， pading 被用来保持特征地图的分辨率。虽然有很多方法可供选择，但通常是通过在输入添加一个边界的零值来实现。在这项工作中，我们发现了一个现象：攻击者经常在图像边界处引起异常的杂变，这些区域 precisly 是在 padding 中使用的地方。因此，我们想进行 padding 和攻击者之间的分析，并问到不同的 padding 模式（或其缺失）对于不同的场景中的鲁棒性有什么影响。

LadleNet: Translating Thermal Infrared Images to Visible Light Images Using A Scalable Two-stage U-Net

paper_url: http://arxiv.org/abs/2308.06603
repo_url: https://github.com/ach-1914/ladlenet
paper_authors: Tonghui Zou
for: 该 paper 的目的是提出一种基于 U-Net 架构的算法，用于将thermal infrared（TIR）图像转换为可见光（VI）图像，以满足不同领域的应用需求。
methods: 该算法使用了两个阶段的 U-Net concatenation结构，以及缺省连接和精细特征聚合技术，从而提高模型性能。该算法包括 ‘Handle’ 模块和 ‘Bowl’ 模块，其中 ‘Handle’ 模块建立了一个抽象的 semantic space，而 ‘Bowl’ 模块将该 semantic space 转换为封装的 VI 图像。
results: comparing to existing methodologies, 该方法在 KAIST 数据集上测试得到了最佳性能，包括图像清晰度和感知质量。

Abstract
The translation of thermal infrared (TIR) images to visible light (VI) images presents a challenging task with potential applications spanning various domains such as TIR-VI image registration and fusion. Leveraging supplementary information derived from TIR image conversions can significantly enhance model performance and generalization across these applications. However, prevailing issues within this field include suboptimal image fidelity and limited model scalability. In this paper, we introduce an algorithm, LadleNet, based on the U-Net architecture. LadleNet employs a two-stage U-Net concatenation structure, augmented with skip connections and refined feature aggregation techniques, resulting in a substantial enhancement in model performance. Comprising 'Handle' and 'Bowl' modules, LadleNet's Handle module facilitates the construction of an abstract semantic space, while the Bowl module decodes this semantic space to yield mapped VI images. The Handle module exhibits extensibility by allowing the substitution of its network architecture with semantic segmentation networks, thereby establishing more abstract semantic spaces to bolster model performance. Consequently, we propose LadleNet+, which replaces LadleNet's Handle module with the pre-trained DeepLabv3+ network, thereby endowing the model with enhanced semantic space construction capabilities. The proposed method is evaluated and tested on the KAIST dataset, accompanied by quantitative and qualitative analyses. Compared to existing methodologies, our approach achieves state-of-the-art performance in terms of image clarity and perceptual quality. The source code will be made available at https://github.com/Ach-1914/LadleNet/tree/main/.

摘要
文本翻译：thermal infrared（TIR）图像到可见光（VI）图像的翻译问题具有广泛的应用领域，如TIR-VI图像匹配和融合。利用TIR图像的补充信息可以大幅提高模型性能和泛化性。然而，现有的问题包括图像质量不佳和模型缺乏扩展性。本文介绍一种算法，叫做LadleNet，基于U-Net架构。LadleNet使用了两个阶段的U-Net concatenation结构，加上了跳过连接和细化特征聚合技术，从而实现了显著提高模型性能。LadleNet包括“ Handle”和“Bowl”模块，其中“ Handle”模块建立了一个抽象的语义空间，而“Bowl”模块将这个语义空间转换成VI图像。“ Handle”模块具有扩展性，可以将其网络架构替换为语义分割网络，从而建立更加抽象的语义空间，提高模型性能。因此，我们提出了LadleNet+，其替换了LadleNet的“ Handle”模块，使用了预训练的DeepLabv3+网络，从而为模型增加了更多的语义空间建构能力。我们的方法在KAIST数据集上进行了评估和测试，并进行了量化和质量分析。与现有方法相比，我们的方法在图像清晰度和感知质量方面达到了国际前ier的性能。代码将在https://github.com/Ach-1914/LadleNet/tree/main/中提供。

2023-08-13

eess.IV

eess.IV - 2023-08-13

Shape-guided Conditional Latent Diffusion Models for Synthesising Brain Vasculature

paper_url: http://arxiv.org/abs/2308.06781
repo_url: None
paper_authors: Yash Deo, Haoran Dou, Nishant Ravikumar, Alejandro F. Frangi, Toni Lassila
for: 了解脑血管系统中圆形封闭（Circle of Willis，CoW）的多样性和配置，以提高脑血管疾病研究和临床 intervención的精度。
methods: 使用条件潜在扩散模型（conditional latent diffusion model），包括形态和解剖指导，生成真实的3D CoW分割结果，包括不同的现象型变化。
results: 比较 conditional variants of 3D GAN和3D VAE的模型，发现我们的模型能够更好地保持血管连续性，并且生成的CoW变化更加真实，FID分数比最佳performing GAN-based model高53%。

Abstract
The Circle of Willis (CoW) is the part of cerebral vasculature responsible for delivering blood to the brain. Understanding the diverse anatomical variations and configurations of the CoW is paramount to advance research on cerebrovascular diseases and refine clinical interventions. However, comprehensive investigation of less prevalent CoW variations remains challenging because of the dominance of a few commonly occurring configurations. We propose a novel generative approach utilising a conditional latent diffusion model with shape and anatomical guidance to generate realistic 3D CoW segmentations, including different phenotypical variations. Our conditional latent diffusion model incorporates shape guidance to better preserve vessel continuity and demonstrates superior performance when compared to alternative generative models, including conditional variants of 3D GAN and 3D VAE. We observed that our model generated CoW variants that are more realistic and demonstrate higher visual fidelity than competing approaches with an FID score 53\% better than the best-performing GAN-based model.

摘要
圆形维利斯（CoW）是脑血管系统的一部分，负责将血液传递到脑中。了解不同的静脉维利斯变化和配置是研究脑血管疾病的前进和精细化临床 intervención的关键。然而，对于较少seen CoW变化的全面调查仍然是挑战，因为一些常见的配置占据了主导地位。我们提出了一种新的生成方法，使用conditioned latent diffusion模型，包含形态指导，以生成真实的3D CoW分割，包括不同的现象变化。我们的conditioned latent diffusion模型能够更好地保持血管连续性，并与其他生成模型相比，如3D GAN和3D VAE的conditioned变种，显示出更高的性能。我们发现，我们的模型生成的CoW变化比competing approach更真实，Visual fidelity高于53%。

Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches

paper_url: http://arxiv.org/abs/2308.06776
repo_url: https://github.com/linxin0/scpgabnet
paper_authors: Xin Lin, Chao Ren, Xiao Liu, Jie Huang, Yinjie Lei
for: 提高无监督图像净化的性能，不需要大量的对称数据。
methods: 基于生成敌对网络的Unsupervised Approach， iteratively replace previous less powerful denoiser with current powerful denoiser， generate better synthetic clean-noisy image pairs。
results: 比 state-of-the-art unsupervised方法有更好的性能。

Abstract
Deep learning methods have shown remarkable performance in image denoising, particularly when trained on large-scale paired datasets. However, acquiring such paired datasets for real-world scenarios poses a significant challenge. Although unsupervised approaches based on generative adversarial networks offer a promising solution for denoising without paired datasets, they are difficult in surpassing the performance limitations of conventional GAN-based unsupervised frameworks without significantly modifying existing structures or increasing the computational complexity of denoisers. To address this problem, we propose a SC strategy for multiple denoisers. This strategy can achieve significant performance improvement without increasing the inference complexity of the GAN-based denoising framework. Its basic idea is to iteratively replace the previous less powerful denoiser in the filter-guided noise extraction module with the current powerful denoiser. This process generates better synthetic clean-noisy image pairs, leading to a more powerful denoiser for the next iteration. This baseline ensures the stability and effectiveness of the training network. The experimental results demonstrate the superiority of our method over state-of-the-art unsupervised methods.

摘要
深度学习方法在图像噪声除除表现出了惊人的表现，特别是在大规模对应数据集上训练的情况下。然而，在真实世界场景中获得对应数据集的获得是一项重要挑战。 Although 无监督方法基于生成对抗网络提供了一种噪声除除无需对应数据集的解决方案，但它们在不改变现有结构或提高噪声除除器的计算复杂度下难以超越传统GAN基于无监督框架的性能限制。为解决这个问题，我们提议了SC策略 для多个噪声除除器。这种策略可以在不增加GAN基于噪声除除框架的推理复杂度下实现显著性能提高。其基本思想是在滤波器引导噪声提取模块中，逐次将前一个较弱的噪声除除器 replaced 为当前更强的噪声除除器。这个过程生成了更好的人工干扰净损像对，导致更强的噪声除除器。这个基准保证了训练网络的稳定性和效果。实验结果表明，我们的方法在无监督方法中表现出色。

Tissue Segmentation of Thick-Slice Fetal Brain MR Scans with Guidance from High-Quality Isotropic Volumes

paper_url: http://arxiv.org/abs/2308.06762
repo_url: None
paper_authors: Shijie Huang, Xukun Zhang, Zhiming Cui, He Zhang, Geng Chen, Dinggang Shen
for: 这个研究的目的是提高胎儿脑MR扫描中的组织分类精度，以便重建iso类型脑MR扫描 volume 和评估胎儿脑发展。
methods: 这个研究使用了域 adaptation 技术，将高品质的iso类型脑MR扫描 volume 作为指导，对厚层扫描进行组织分类。
results: 实验结果显示，这个方法可以对胎儿脑MR扫描中的组织分类进行高精度的调整，并且与现有的方法相比，表现更加出色。

Abstract
Accurate tissue segmentation of thick-slice fetal brain magnetic resonance (MR) scans is crucial for both reconstruction of isotropic brain MR volumes and the quantification of fetal brain development. However, this task is challenging due to the use of thick-slice scans in clinically-acquired fetal brain data. To address this issue, we propose to leverage high-quality isotropic fetal brain MR volumes (and also their corresponding annotations) as guidance for segmentation of thick-slice scans. Due to existence of significant domain gap between high-quality isotropic volume (i.e., source data) and thick-slice scans (i.e., target data), we employ a domain adaptation technique to achieve the associated knowledge transfer (from high-quality volumes to thick-slice scans). Specifically, we first register the available high-quality isotropic fetal brain MR volumes across different gestational weeks to construct longitudinally-complete source data. To capture domain-invariant information, we then perform Fourier decomposition to extract image content and style codes. Finally, we propose a novel Cycle-Consistent Domain Adaptation Network (C2DA-Net) to efficiently transfer the knowledge learned from high-quality isotropic volumes for accurate tissue segmentation of thick-slice scans. Our C2DA-Net can fully utilize a small set of annotated isotropic volumes to guide tissue segmentation on unannotated thick-slice scans. Extensive experiments on a large-scale dataset of 372 clinically acquired thick-slice MR scans demonstrate that our C2DA-Net achieves much better performance than cutting-edge methods quantitatively and qualitatively.

摘要
幼虫脑magnetic resonance（MR）扫描的粗层扫描是重要的，因为它们可以提供高级别的脑部MR影像Volume，并且可以量化胎儿脑部发展的进度。然而，这个任务具有挑战性，因为在临床中获取的胎儿脑部MR扫描通常使用粗层扫描。为解决这个问题，我们提议使用高质量的ISO分布式胎儿脑MR影像（以及其相应的标注）作为指导，以实现粗层扫描的准确分割。由于源数据和目标数据之间存在很大的领域差异，我们采用领域适应技术来实现相关的知识传递。具体来说，我们首先将可用的高质量ISO分布式胎儿脑MR影像长itudinally完整地注册，以构建不同 gestational weeks的源数据。然后，我们使用快速 Fourier 分解来提取图像内容和风格代码。最后，我们提出了一种名为C2DA-Net的循环一致领域适应网络，以高效地传递从高质量ISO分布式胎儿脑MR影像中学习的知识，以便在无标注的粗层扫描中进行准确的组织分割。我们的C2DA-Net可以全面利用一小组标注的ISO分布式胎儿脑MR影像来导导粗层扫描中的组织分割。我们在372个临床获取的粗层扫描中进行了广泛的实验，并证明了我们的C2DA-Net可以在量化和质量上superior于当前的方法。

FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware Lookup Table

paper_url: http://arxiv.org/abs/2308.06749
repo_url: https://github.com/wenhao-li-777/fastllve
paper_authors: Wenhao Li, Guangyang Wu, Wenyi Wang, Peiran Ren, Xiaohong Liu
for: 提高低光照视频质量
methods: 利用Look-Up-Table（LUT）技术维护 между帧亮度一致性，并设计了一个可学习的Intensity-Aware LUT（IA-LUT）模块进行自适应增强。
results: 实验结果表明，我们的方法可以在标准数据集上达到最新状态的性能，同时在帧速和计算复杂度方面也具有优势。 Code available at https://github.com/Wenhao-Li-777/FastLLVE.

Abstract
Low-Light Video Enhancement (LLVE) has received considerable attention in recent years. One of the critical requirements of LLVE is inter-frame brightness consistency, which is essential for maintaining the temporal coherence of the enhanced video. However, most existing single-image-based methods fail to address this issue, resulting in flickering effect that degrades the overall quality after enhancement. Moreover, 3D Convolution Neural Network (CNN)-based methods, which are designed for video to maintain inter-frame consistency, are computationally expensive, making them impractical for real-time applications. To address these issues, we propose an efficient pipeline named FastLLVE that leverages the Look-Up-Table (LUT) technique to maintain inter-frame brightness consistency effectively. Specifically, we design a learnable Intensity-Aware LUT (IA-LUT) module for adaptive enhancement, which addresses the low-dynamic problem in low-light scenarios. This enables FastLLVE to perform low-latency and low-complexity enhancement operations while maintaining high-quality results. Experimental results on benchmark datasets demonstrate that our method achieves the State-Of-The-Art (SOTA) performance in terms of both image quality and inter-frame brightness consistency. More importantly, our FastLLVE can process 1,080p videos at $\mathit{50+}$ Frames Per Second (FPS), which is $\mathit{2 \times}$ faster than SOTA CNN-based methods in inference time, making it a promising solution for real-time applications. The code is available at https://github.com/Wenhao-Li-777/FastLLVE.

摘要
低光照视频增强（LLVE）在最近几年内受到了广泛关注。一个关键的要求是 между帧亮度一致性，以保持视频增强后的时间一致性。然而，大多数现有的单张图像基的方法无法解决这个问题，导致幻灯效应，从而降低了整体质量。此外，基于3D卷积神经网络（CNN）的方法，它们是为视频维护 между帧一致性而设计的，但是计算成本高，使其不适用于实时应用。为解决这些问题，我们提出了高效的渠道名为快速LLVE，利用Look-Up-Table（LUT）技术来维护between帧亮度一致性。我们特制了可学习的Intensity-Aware LUT（IA-LUT）模块，用于适应增强，解决低动态问题在低光照场景中。这使得快速LLVE可以在低延迟和低复杂度下进行增强操作，同时维护高质量结果。实验结果表明，我们的方法在标准数据集上达到了状态之作（SOTA）的性能， both图像质量和between帧亮度一致性方面。此外，我们的快速LLVE可以处理1080P视频，在50+帧每秒（FPS）处理速度，高于SOTA CNN基于方法的两倍即2倍的执行速度，这使得它在实时应用中成为一个有前途的解决方案。代码可以在https://github.com/Wenhao-Li-777/FastLLVE中找到。

Self-supervised Noise2noise Method Utilizing Corrupted Images with a Modular Network for LDCT Denoising

paper_url: http://arxiv.org/abs/2308.06746
repo_url: https://github.com/xyuan01/self-supervised-noise2noise-for-ldct
paper_authors: Yuting Zhu, Qiang He, Yudong Yao, Yueyang Teng
for: 这个研究旨在提出一种基于单束 Computed Tomography (CT) 影像的自我监督噪声降低方法，不需要对CT影像进行训练。
methods: 本研究使用了一种组合自我监督噪声模型和降低噪声的方法，首先将LDCT影像添加了两种相似的噪声，然后使用这些降低噪声的影像进行训练。
results: 实验结果显示，提出的方法比前一代深度学习方法更有效地进行LDCT影像降低噪声。

Abstract
Deep learning is a very promising technique for low-dose computed tomography (LDCT) image denoising. However, traditional deep learning methods require paired noisy and clean datasets, which are often difficult to obtain. This paper proposes a new method for performing LDCT image denoising with only LDCT data, which means that normal-dose CT (NDCT) is not needed. We adopt a combination including the self-supervised noise2noise model and the noisy-as-clean strategy. First, we add a second yet similar type of noise to LDCT images multiple times. Note that we use LDCT images based on the noisy-as-clean strategy for corruption instead of NDCT images. Then, the noise2noise model is executed with only the secondary corrupted images for training. We select a modular U-Net structure from several candidates with shared parameters to perform the task, which increases the receptive field without increasing the parameter size. The experimental results obtained on the Mayo LDCT dataset show the effectiveness of the proposed method compared with that of state-of-the-art deep learning methods. The developed code is available at https://github.com/XYuan01/Self-supervised-Noise2Noise-for-LDCT.

摘要
深度学习是LDCT图像减噪的非常有前途的技术。然而，传统的深度学习方法通常需要配备零噪和干净的数据集，这些数据集往往很难获得。这篇论文提出了一种只使用LDCT数据进行LDCT图像减噪的新方法。我们采用了一种组合，包括自我监督的噪声2噪模型和噪声作为干净策略。首先，我们将LDCT图像添加了多次相似的噪声。注意我们使用LDCT图像来代替NDCT图像进行损害。然后，我们在噪声2噪模型中进行训练，只使用第二次损害的图像。我们选择了一个模块化U-Net结构，从多个候选者中选择了共享参数来完成任务，这样可以增加感知范围而不是增加参数大小。实验结果表明，提出的方法在Mayo LDCT数据集上比州前的深度学习方法更有效。代码可以在https://github.com/XYuan01/Self-supervised-Noise2Noise-for-LDCT上下载。

Polyp-SAM++: Can A Text Guided SAM Perform Better for Polyp Segmentation?

paper_url: http://arxiv.org/abs/2308.06623
repo_url: https://github.com/RisabBiswas/Polyp-SAM-PlusPlus
paper_authors: Risab Biswas
for: 这个论文的目的是提高肿瘤 segmentation 的精度和稳定性，并通过文本提示来使用 SAM 模型进行肿瘤 segmentation。
methods: 这个论文使用的方法是使用文本提示来提高 SAM 模型的精度和稳定性，并在 benchmark 数据集上进行评估。
results: 研究发现，使用文本提示可以提高 SAM 模型的肿瘤 segmentation 精度和稳定性，并且比不使用文本提示的情况下更好。

Abstract
Meta recently released SAM (Segment Anything Model) which is a general-purpose segmentation model. SAM has shown promising results in a wide variety of segmentation tasks including medical image segmentation. In the field of medical image segmentation, polyp segmentation holds a position of high importance, thus creating a model which is robust and precise is quite challenging. Polyp segmentation is a fundamental task to ensure better diagnosis and cure of colorectal cancer. As such in this study, we will see how Polyp-SAM++, a text prompt-aided SAM, can better utilize a SAM using text prompting for robust and more precise polyp segmentation. We will evaluate the performance of a text-guided SAM on the polyp segmentation task on benchmark datasets. We will also compare the results of text-guided SAM vs unprompted SAM. With this study, we hope to advance the field of polyp segmentation and inspire more, intriguing research. The code and other details will be made publically available soon at https://github.com/RisabBiswas/Polyp-SAM++.

摘要
meta 最近发布了 SAM（分割任务模型），这是一种通用的分割模型。SAM 在多种分割任务中表现出色，包括医疗影像分割。在医疗影像分割领域，肿瘤分割具有非常高的重要性，因此创建一个精度高和可靠的模型是非常挑战性的。肿瘤分割是诊断和治疗抑郁癌的基本任务。在本研究中，我们将看到 Polyp-SAM++，一种使用文本提示的 SAM，如何更好地利用 SAM 进行肿瘤分割。我们将对 Polyp-SAM++ 在标准数据集上进行评估，并与不提示 SAM 进行比较。我们希望通过这项研究，推动肿瘤分割领域的进步，并鼓励更多的感人研究。代码和其他细节将在https://github.com/RisabBiswas/Polyp-SAM++ 上公开。

2023-08-12

cs.SD

cs.SD - 2023-08-12

Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

paper_url: http://arxiv.org/abs/2308.06547
repo_url: None
paper_authors: Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan
for: 提高自动语音识别器的性能在半监督学习中，当 Label 缺乏时
methods: 提议一种新的替代 pseudo-labeling 框架，包括一种通用的 CTC 损失函数、一种 confidence-based 错误检测方法和一种自动调整 threshold 方法
results: 对比 traditional CTC 损失函数和 confidence-based 错误检测方法，提议的替代 pseudo-labeling 框架可以更好地处理含有错误 tokens 的 pseudo-Label，并且不需要手动调整 threshold

Abstract
When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning.

摘要
当标注数据短缺时，半超vised学习采用pseudo-标签技术可以显著提高自动语音识别的性能。然而，pseudo-标签经常含有许多错误的token。将含有错误token的pseudo-标签作为真实标签在损失函数中使用会导致优化性能下降。前一些工作尝试了通过过滤 pseudo-标签中最含糟糕的token或提高总体pseudo-标签质量来缓解这个问题。虽然这些方法有一定的效果，但是完全消除pseudo-标签中的错误token是不现实的。在这种情况下，我们提出了一种新的框架名为代理 pseudo-标签。该框架包括以下几个组成部分。首先，我们引入一种通用的CTC损失函数，可以处理含有错误token的pseudo-标签。在使用这种损失函数进行pseudo-标签时，需要检测pseudo-标签中的错误token。在这种情况下，我们采用一种 confidence-based 错误检测方法，通过比较错误token的信任分数与一个给定的阈值，以确定错误token的存在。因此，第二个提出的技术是增强CTC损失函数，以增强错误检测的能力。此外，通过 confidence-based 错误检测获得良好性能通常需要进行广泛的阈值调整。而我们提出的自动阈值调整方法，通过使用标注数据作为代理，自动地调整阈值，从而避免了手动调整的痛苦。

BigWavGAN: A Wave-To-Wave Generative Adversarial Network for Music Super-Resolution

paper_url: http://arxiv.org/abs/2308.06483
repo_url: None
paper_authors: Yenan Zhang, Hiroshi Watanabe
for: 这个论文目的是提高音乐超解析（SR）领域中深度神经网络（DNN）的性能。
methods: 这个论文使用了大型DNN模型，并结合了State-Of-The-Art（SOTA）的激励函数和对抗训练策略。它的权衡器包括多尺度权衡器（MSD）和多分辨率权衡器（MRD）。
results: 对于音乐SR问题，BigWavGAN模型表现出色，超过了基eline模型和State-Of-The-Art（SOTA）音乐SR模型。它还能够处理异常数据，并且有较好的总体化能力。

Abstract
Generally, Deep Neural Networks (DNNs) are expected to have high performance when their model size is large. However, large models failed to produce high-quality results commensurate with their scale in music Super-Resolution (SR). We attribute this to that DNNs cannot learn information commensurate with their size from standard mean square error losses. To unleash the potential of large DNN models in music SR, we propose BigWavGAN, which incorporates Demucs, a large-scale wave-to-wave model, with State-Of-The-Art (SOTA) discriminators and adversarial training strategies. Our discriminator consists of Multi-Scale Discriminator (MSD) and Multi-Resolution Discriminator (MRD). During inference, since only the generator is utilized, there are no additional parameters or computational resources required compared to the baseline model Demucs. Objective evaluation affirms the effectiveness of BigWavGAN in music SR. Subjective evaluations indicate that BigWavGAN can generate music with significantly high perceptual quality over the baseline model. Notably, BigWavGAN surpasses the SOTA music SR model in both simulated and real-world scenarios. Moreover, BigWavGAN represents its superior generalization ability to address out-of-distribution data. The conducted ablation study reveals the importance of our discriminators and training strategies. Samples are available on the demo page.

摘要
通常情况下，深度神经网络（DNNs）预期会在模型大小增加时表现出色。然而，大型模型在音乐超分解（SR）中并没有达到预期的高质量效果。我们认为这是因为DNNs无法从标准方差误差损失中学习足够的信息。为了解放大型DNN模型在音乐SR中的潜力，我们提出了BigWavGAN，它将大规模涉及的wave-to-wave模型Demucs融合到了领先的推误器和对抗训练策略中。我们的推误器包括多尺度推误器（MSD）和多分辨率推误器（MRD）。在推理过程中，由于只有生成器被使用，因此没有额外的参数或计算资源的需求，与基线模型Demucs相比。对象评估表明BigWavGAN在音乐SR中的效果非常高。主观评估表明BigWavGAN可以生成具有显著高媒体质量的音乐，比基线模型高。此外，BigWavGAN在实际和 simulate 的情况下都能够超越领先的音乐SR模型。此外，BigWavGAN在处理异常数据的能力方面表现出了superior的普适性。进行的ablation研究表明我们的推误器和训练策略的重要性。样例可以在 demo 页面中找到。

Bilingual Streaming ASR with Grapheme units and Auxiliary Monolingual Loss

paper_url: http://arxiv.org/abs/2308.06327
repo_url: None
paper_authors: Mohammad Soleymanpour, Mahmoud Al Ismail, Fahimeh Bahmaninezhad, Kshitiz Kumar, Jian Wu
for: 支持英语为次要地区的半自动语音识别（ASR）设置
methods: 使用全双语对照模型、双流Transformer模型、并行编码结构和语言标识（LID）损失
results: 提高英语混合码能力，对代码混合ES和IT应用进行大规模训练和测试，并显示出优于LID损失的特点

Abstract
We introduce a bilingual solution to support English as secondary locale for most primary locales in hybrid automatic speech recognition (ASR) settings. Our key developments constitute: (a) pronunciation lexicon with grapheme units instead of phone units, (b) a fully bilingual alignment model and subsequently bilingual streaming transformer model, (c) a parallel encoder structure with language identification (LID) loss, (d) parallel encoder with an auxiliary loss for monolingual projections. We conclude that in comparison to LID loss, our proposed auxiliary loss is superior in specializing the parallel encoders to respective monolingual locales, and that contributes to stronger bilingual learning. We evaluate our work on large-scale training and test tasks for bilingual Spanish (ES) and bilingual Italian (IT) applications. Our bilingual models demonstrate strong English code-mixing capability. In particular, the bilingual IT model improves the word error rate (WER) for a code-mix IT task from 46.5% to 13.8%, while also achieving a close parity (9.6%) with the monolingual IT model (9.5%) over IT tests.

摘要
我们介绍了一种双语解决方案，以支持英语为次要地区的多地点自动语音识别（ASR）设置。我们的关键发展包括：(a) 使用字节单位 вместоPhone单位的发音词典。(b)一个完全双语对应模型和随后的双语流transformer模型。(c)一个并行编码结构，并且添加语言标识（LID）损失。(d)并行编码器，并且添加一个辅助损失来特化到各自的单语言本地。我们结合了这些发展，并进行了大规模的训练和测试任务，以评估我们的方法在双语西班牙（ES）和双语意大利（IT）应用中的性能。我们的双语模型在英语混合码中表现出色，特别是双语IT模型在一个混合IT任务中，从46.5%降低到13.8%，同时也与单语意大利模型（9.5%）在意大利测试上凑平。

2023-08-12

cs.CV

cs.CV - 2023-08-12

Cyclic Test-Time Adaptation on Monocular Video for 3D Human Mesh Reconstruction

paper_url: http://arxiv.org/abs/2308.06554
repo_url: https://github.com/hygenie1228/cycleadapt_release
paper_authors: Hyeongjin Nam, Daniel Sungho Jung, Yeonguk Oh, Kyoung Mu Lee
for: addresses the domain gap problem in 3D human mesh reconstruction by proposing a cyclic adaptation method that leverages both 2D and 3D evidence.
methods: the proposed method consists of two networks: a human mesh reconstruction network (HMRNet) and a human motion denoising network (MDNet), which are cyclically adapted given a test video. The 3D supervision targets generated by MDNet are used to fully supervise HMRNet, reducing the reliance on 2D evidence.
results: the proposed method achieves state-of-the-art performance compared to previous test-time adaptation methods, demonstrating the effectiveness of the cyclic adaptation scheme in addressing the domain gap problem.

Abstract
Despite recent advances in 3D human mesh reconstruction, domain gap between training and test data is still a major challenge. Several prior works tackle the domain gap problem via test-time adaptation that fine-tunes a network relying on 2D evidence (e.g., 2D human keypoints) from test images. However, the high reliance on 2D evidence during adaptation causes two major issues. First, 2D evidence induces depth ambiguity, preventing the learning of accurate 3D human geometry. Second, 2D evidence is noisy or partially non-existent during test time, and such imperfect 2D evidence leads to erroneous adaptation. To overcome the above issues, we introduce CycleAdapt, which cyclically adapts two networks: a human mesh reconstruction network (HMRNet) and a human motion denoising network (MDNet), given a test video. In our framework, to alleviate high reliance on 2D evidence, we fully supervise HMRNet with generated 3D supervision targets by MDNet. Our cyclic adaptation scheme progressively elaborates the 3D supervision targets, which compensate for imperfect 2D evidence. As a result, our CycleAdapt achieves state-of-the-art performance compared to previous test-time adaptation methods. The codes are available at https://github.com/hygenie1228/CycleAdapt_RELEASE.

摘要
尽管最近的3D人体渲染技术得到了进步，但域外差问题仍然是主要挑战。一些先前的工作通过测试时适应来解决域外差问题，但高度依赖于2D证据（例如2D人体关键点）的适应会导致两个主要问题。首先，2D证据引入深度不确定性，阻碍学习准确的3D人体几何学。其次，2D证据在测试时可能受到噪声或部分损失，这会导致错误的适应。为解决以上问题，我们介绍了CyclesAdapt，它将两个网络——人体渲染网络（HMRNet）和人体动作净化网络（MDNet）——在测试视频基础上进行循环适应。在我们的框架中，为了减少依赖于2D证据，我们完全supervise HMRNet 的生成3D目标，使其能够学习准确的3D人体几何学。我们的循环适应方案逐渐填充3D目标，以补做受到噪声或部分损失的2D证据。因此，我们的CyclesAdapt可以与之前的测试时适应方法相比，实现最新的表现。代码可以在https://github.com/hygenie1228/CycleAdapt_RELEASE 中找到。

Revisiting Vision Transformer from the View of Path Ensemble

paper_url: http://arxiv.org/abs/2308.06548
repo_url: None
paper_authors: Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou
for: 本文提出了一种新的视点，认为 transformer 层可以被看作是多个并行的路径 ensemble network。
methods: 将传统的多头自注意力（MSA）和Feed Forward Network（FFN）替换为三个并行的路径，并使用 identify connection 将这些路径转换为明确的多路ensemble network。
results: 通过调查每个路径对最终预测的影响，发现一些路径甚至会降低性能。因此，提出了路径裁剪和 EnsembleScale 技术来优化路径组合，以便允许短路专注提供高质量表示。此外，通过自馈沟通来增强 paths 服务后续路径的表示。

Abstract
Vision Transformers (ViTs) are normally regarded as a stack of transformer layers. In this work, we propose a novel view of ViTs showing that they can be seen as ensemble networks containing multiple parallel paths with different lengths. Specifically, we equivalently transform the traditional cascade of multi-head self-attention (MSA) and feed-forward network (FFN) into three parallel paths in each transformer layer. Then, we utilize the identity connection in our new transformer form and further transform the ViT into an explicit multi-path ensemble network. From the new perspective, these paths perform two functions: the first is to provide the feature for the classifier directly, and the second is to provide the lower-level feature representation for subsequent longer paths. We investigate the influence of each path for the final prediction and discover that some paths even pull down the performance. Therefore, we propose the path pruning and EnsembleScale skills for improvement, which cut out the underperforming paths and re-weight the ensemble components, respectively, to optimize the path combination and make the short paths focus on providing high-quality representation for subsequent paths. We also demonstrate that our path combination strategies can help ViTs go deeper and act as high-pass filters to filter out partial low-frequency signals. To further enhance the representation of paths served for subsequent paths, self-distillation is applied to transfer knowledge from the long paths to the short paths. This work calls for more future research to explain and design ViTs from new perspectives.

摘要
视transformer（ViT）通常被看作是一 stack of transformer层。在这项工作中，我们提出了一种新的视图，显示了ViT可以被看作是一个多路网络，每个层包含多个平行的路径。 Specifically, we可以将传统的多头自注意（MSA）和Feed-Forward Network（FFN）转化为每个transformer层中的三个平行路径。然后，我们利用我们新的transformer形式中的标识连接，并将ViT转化为一个显式多路ensemble网络。从这种新的视角来看，这些路径在两个功能：第一是提供分类器所需的特征，第二是提供后续更长的路径所需的下一个特征表示。我们调查每个路径对最终预测的影响，发现一些路径甚至会降低性能。因此，我们提出了路径剔除和EnsembleScale技巧来优化路径组合，即将不良表现的路径剔除，并重新权重ensemble组件。我们还证明了我们的路径组合策略可以帮助ViT深入探索，并作为高通过滤器来过滤部分低频信号。为了进一步增强路径服务后续路径的表示，我们应用了自适应知识传递，将长路径中的知识传递给短路径。这项工作呼吁了更多的未来研究，以解释和设计ViT从新的视角。

SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning

paper_url: http://arxiv.org/abs/2308.06531
repo_url: https://github.com/aim-uofa/segprompt
paper_authors: Muzhi Zhu, Hengtao Li, Hao Chen, Chengxiang Fan, Weian Mao, Chenchen Jing, Yifan Liu, Chunhua Shen
for: 提高closed-set实例分割模型对未知类别的检测能力
methods: 使用类别信息进行训练 Mechanism，提高模型对已知和未知类别的检测能力
results: 在新的开放世界数据集上，SegPrompt可以提高总和未知检测性能by 5.6%和6.1%，而无需影响推理效率。在 existed cross-dataset transfer和强烈监督设置下，我们的方法也得到了5.5%和12.3%的相对改进。

Abstract
Current closed-set instance segmentation models rely on pre-defined class labels for each mask during training and evaluation, largely limiting their ability to detect novel objects. Open-world instance segmentation (OWIS) models address this challenge by detecting unknown objects in a class-agnostic manner. However, previous OWIS approaches completely erase category information during training to keep the model's ability to generalize to unknown objects. In this work, we propose a novel training mechanism termed SegPrompt that uses category information to improve the model's class-agnostic segmentation ability for both known and unknown categories. In addition, the previous OWIS training setting exposes the unknown classes to the training set and brings information leakage, which is unreasonable in the real world. Therefore, we provide a new open-world benchmark closer to a real-world scenario by dividing the dataset classes into known-seen-unseen parts. For the first time, we focus on the model's ability to discover objects that never appear in the training set images. Experiments show that SegPrompt can improve the overall and unseen detection performance by 5.6% and 6.1% in AR on our new benchmark without affecting the inference efficiency. We further demonstrate the effectiveness of our method on existing cross-dataset transfer and strongly supervised settings, leading to 5.5% and 12.3% relative improvement.

摘要
当前的闭erset实例分割模型依赖于在训练和评估中预先定义的类标签，这限制了它们的能力检测新的对象。开放世界实例分割（OWIS）模型解决了这个挑战，它在无类别情况下检测未知对象。然而，前一些OWIS方法完全抹除了类型信息在训练中，以保持模型对未知类型的泛化能力。在这种情况下，我们提出了一种新的训练机制，称为SegPrompt，它使用类型信息来提高模型在已知和未知类型之间的无类别分割能力。此外，前一些OWIS训练设置会泄露信息，这不符合实际世界的情况。因此，我们提供了一个更加真实的开放世界 benchmark，将数据集分为已知、未seen和未知三部分。我们首次关注模型能够在训练集图像中不出现的对象检测能力。实验结果显示，SegPrompt可以在AR上提高总和未seen检测性能5.6%和6.1%，而不影响推理效率。我们还证明我们的方法在现有的跨数据集转移和强烈监督设置下有5.5%和12.3%的相对改进。

paper_url: http://arxiv.org/abs/2308.06530
repo_url: None
paper_authors: Miaoyu Li, Yachao Zhang, Xu MA, Yanyun Qu, Yun Fu
for: 这篇论文旨在提高频率域执行3D semantic segmentation的预测性和灵活性，并且在新的频率域中进行预测，而不需要训练数据集。
methods: 这篇论文提出了一种基于鸟瞰看的cross-modal learning架构，具有更高的错误耐受性和稳定性，并且可以实现频率域内的预测。
results: 这篇论文透过三个不同的3D数据集进行评估，结果显示BEV-DG在所有设定中具有显著的性能优势，与现有的竞争者相比，BEV-DG的性能优势为10%左右。

Abstract
Cross-modal Unsupervised Domain Adaptation (UDA) aims to exploit the complementarity of 2D-3D data to overcome the lack of annotation in a new domain. However, UDA methods rely on access to the target domain during training, meaning the trained model only works in a specific target domain. In light of this, we propose cross-modal learning under bird's-eye view for Domain Generalization (DG) of 3D semantic segmentation, called BEV-DG. DG is more challenging because the model cannot access the target domain during training, meaning it needs to rely on cross-modal learning to alleviate the domain gap. Since 3D semantic segmentation requires the classification of each point, existing cross-modal learning is directly conducted point-to-point, which is sensitive to the misalignment in projections between pixels and points. To this end, our approach aims to optimize domain-irrelevant representation modeling with the aid of cross-modal learning under bird's-eye view. We propose BEV-based Area-to-area Fusion (BAF) to conduct cross-modal learning under bird's-eye view, which has a higher fault tolerance for point-level misalignment. Furthermore, to model domain-irrelevant representations, we propose BEV-driven Domain Contrastive Learning (BDCL) with the help of cross-modal learning under bird's-eye view. We design three domain generalization settings based on three 3D datasets, and BEV-DG significantly outperforms state-of-the-art competitors with tremendous margins in all settings.

摘要
cross-modal无监督领域适应（UDA）目标是利用2D-3D数据的补充性来缺乏目标领域的标注。然而，UDA方法需要训练时有Target领域的存在，因此训练的模型只能在特定的Target领域中工作。为了解决这个问题，我们提出了基于鸟瞰视的cross-modal学习 для领域总结（DG）的3D语义分割，称为BEV-DG。DG比UDA更加困难，因为模型在训练时无法访问目标领域，因此它需要通过cross-modal学习来减少领域差距。由于3D语义分割需要每个点的分类，现有的cross-modal学习是直接进行点对点的，这是 projection between pixels and points 的不一致敏感。为此，我们的方法是通过cross-modal学习下鸟瞰视模型化领域无关表示，使用BEV-based Area-to-area Fusion（BAF）来进行cross-modal学习，这种方法具有更高的错误忍容度。此外，我们还提出了基于鸟瞰视的BEV-driven Domain Contrastive Learning（BDCL），通过cross-modal学习来模型领域无关表示。我们设计了基于三个3D数据集的三个领域总结设置，BEV-DG在所有设置中都以很大的优势超越了当前的竞争对手。

Seed Feature Maps-based CNN Models for LEO Satellite Remote Sensing Services

paper_url: http://arxiv.org/abs/2308.06515
repo_url: None
paper_authors: Zhichao Lu, Chuntao Ding, Shangguang Wang, Ran Cheng, Felix Juefei-Xu, Vishnu Naresh Boddeti
for: 这篇研究是为了提出一个基于ground-station server的框架，以实现高性能的卷积神经网络模型在低地球轨道（LEO）卫星上的快速遥测图像处理。
methods: 本研究使用了一个基于seed feature map的框架，具体是每个层的卷积神经网络模型仅包含一个可学习的特征图（seed feature map），并通过特定规律生成其他特征图。此外，这个框架还使用了Random Hyperparameter Generation（RHG）技术，实现在LEO卫星上更新卷积神经网络模型。
results: 实验结果显示，提出的框架可以与现有的State-of-the-art方法相比，在ISPRS Vaihingen、ISPRS Potsdam、UAVid和LoveDA等数据集上实现更高的mIoU，特别是在UAVid数据集上，SineFM-based模型的mIoU高于UNetFormer，仅使用3.3倍少的参数和2.2倍少的FLOPs。

Abstract
Deploying high-performance convolutional neural network (CNN) models on low-earth orbit (LEO) satellites for rapid remote sensing image processing has attracted significant interest from industry and academia. However, the limited resources available on LEO satellites contrast with the demands of resource-intensive CNN models, necessitating the adoption of ground-station server assistance for training and updating these models. Existing approaches often require large floating-point operations (FLOPs) and substantial model parameter transmissions, presenting considerable challenges. To address these issues, this paper introduces a ground-station server-assisted framework. With the proposed framework, each layer of the CNN model contains only one learnable feature map (called the seed feature map) from which other feature maps are generated based on specific rules. The hyperparameters of these rules are randomly generated instead of being trained, thus enabling the generation of multiple feature maps from the seed feature map and significantly reducing FLOPs. Furthermore, since the random hyperparameters can be saved using a few random seeds, the ground station server assistance can be facilitated in updating the CNN model deployed on the LEO satellite. Experimental results on the ISPRS Vaihingen, ISPRS Potsdam, UAVid, and LoveDA datasets for semantic segmentation services demonstrate that the proposed framework outperforms existing state-of-the-art approaches. In particular, the SineFM-based model achieves a higher mIoU than the UNetFormer on the UAVid dataset, with 3.3x fewer parameters and 2.2x fewer FLOPs.

摘要
deploying high-performance convolutional neural network (CNN) models on low-earth orbit (LEO) satellites for rapid remote sensing image processing has attracted significant interest from industry and academia. However, the limited resources available on LEO satellites contrast with the demands of resource-intensive CNN models, necessitating the adoption of ground-station server assistance for training and updating these models. existing approaches often require large floating-point operations (FLOPs) and substantial model parameter transmissions, presenting considerable challenges. to address these issues, this paper introduces a ground-station server-assisted framework. with the proposed framework, each layer of the CNN model contains only one learnable feature map (called the seed feature map) from which other feature maps are generated based on specific rules. the hyperparameters of these rules are randomly generated instead of being trained, thus enabling the generation of multiple feature maps from the seed feature map and significantly reducing FLOPs. furthermore, since the random hyperparameters can be saved using a few random seeds, the ground station server assistance can be facilitated in updating the CNN model deployed on the LEO satellite. experimental results on the ISPRS Vaihingen, ISPRS Potsdam, UAVid, and LoveDA datasets for semantic segmentation services demonstrate that the proposed framework outperforms existing state-of-the-art approaches. in particular, the SineFM-based model achieves a higher mIoU than the UNetFormer on the UAVid dataset, with 3.3x fewer parameters and 2.2x fewer FLOPs.

Out-of-distribution multi-view auto-encoders for prostate cancer lesion detection

paper_url: http://arxiv.org/abs/2308.06481
repo_url: None
paper_authors: Alvaro Fernandez-Quilez, Linas Vidziunas, Ørjan Kløvfjell Thoresen, Ketil Oppedal, Svein Reidar Kjosavik, Trygve Eftestøl
for: 这篇论文目的是为了提出一种基于对外域检测的潜在医疗影像识别方法，并且运用不同T2w方向的多条流进行检测，以提高肝癌潜在病变检测的精确度。
methods: 本论文使用的方法包括对外域检测和多条流方法，以探索肝癌潜在病变检测的可能性。
results: 本论文的结果显示，使用多条流方法可以提高肝癌潜在病变检测的精确度，并且在一个公共可用数据集上获得了更高的检测精确度（AUC），具体为73.1%和82.3%之间。

Abstract
Traditional deep learning (DL) approaches based on supervised learning paradigms require large amounts of annotated data that are rarely available in the medical domain. Unsupervised Out-of-distribution (OOD) detection is an alternative that requires less annotated data. Further, OOD applications exploit the class skewness commonly present in medical data. Magnetic resonance imaging (MRI) has proven to be useful for prostate cancer (PCa) diagnosis and management, but current DL approaches rely on T2w axial MRI, which suffers from low out-of-plane resolution. We propose a multi-stream approach to accommodate different T2w directions to improve the performance of PCa lesion detection in an OOD approach. We evaluate our approach on a publicly available data-set, obtaining better detection results in terms of AUC when compared to a single direction approach (73.1 vs 82.3). Our results show the potential of OOD approaches for PCa lesion detection based on MRI.

摘要
传统的深度学习（DL）方法基于指导学习 paradigma需要大量的标注数据，而这些数据在医疗领域很难获得。不supervised Out-of-distribution（OOD）检测是一种alternative，它需要更少的标注数据。另外，OOD应用可以利用医疗数据的类偏好。核磁共振成像（MRI）已经证明是肠癌（PCa）诊断和管理的有用工具，但当前的DL方法仅仅采用T2w极向MRI，这会受到低外平面分辨率的限制。我们提议一种多流程approach来满足不同的T2w方向，以提高PCa患部检测的性能。我们对公共可用数据集进行评估，并获得了与单向approach相比的更好的检测结果（AUC=73.1 vs AUC=82.3）。我们的结果表明OOD方法在MRI上进行PCa患部检测具有潜在的应用前景。

Leveraging multi-view data without annotations for prostate MRI segmentation: A contrastive approach

paper_url: http://arxiv.org/abs/2308.06477
repo_url: None
paper_authors: Tim Nikolass Lindeijer, Tord Martin Ytredal, Trygve Eftestøl, Tobias Nordström, Fredrik Jäderling, Martin Eklund, Alvaro Fernandez-Quilez
for: 提高 automatic prostate segmentation 的精度和可靠性，使用 multi-view MRI 数据和 contrastive learning 技术。
methods: 提posed 一种 triplet encoder and single decoder network 基于 U-Net，称为 tU-Net (triplet U-Net)，可以利用不需要注意力的 sagittal 和 coronal 视图来提高 segmentation 的精度。
results: tU-Net 显示在 dice score 指标上 statistically 提高了精度 (91.25+-0.52% 比 86.40+-1.50%,P<.001)，并且在不同视图的数据上进行了可靠的总体骨骼变换。

Abstract
An accurate prostate delineation and volume characterization can support the clinical assessment of prostate cancer. A large amount of automatic prostate segmentation tools consider exclusively the axial MRI direction in spite of the availability as per acquisition protocols of multi-view data. Further, when multi-view data is exploited, manual annotations and availability at test time for all the views is commonly assumed. In this work, we explore a contrastive approach at training time to leverage multi-view data without annotations and provide flexibility at deployment time in the event of missing views. We propose a triplet encoder and single decoder network based on U-Net, tU-Net (triplet U-Net). Our proposed architecture is able to exploit non-annotated sagittal and coronal views via contrastive learning to improve the segmentation from a volumetric perspective. For that purpose, we introduce the concept of inter-view similarity in the latent space. To guide the training, we combine a dice score loss calculated with respect to the axial view and its manual annotations together with a multi-view contrastive loss. tU-Net shows statistical improvement in dice score coefficient (DSC) with respect to only axial view (91.25+-0.52% compared to 86.40+-1.50%,P<.001). Sensitivity analysis reveals the volumetric positive impact of the contrastive loss when paired with tU-Net (2.85+-1.34% compared to 3.81+-1.88%,P<.001). Further, our approach shows good external volumetric generalization in an in-house dataset when tested with multi-view data (2.76+-1.89% compared to 3.92+-3.31%,P=.002), showing the feasibility of exploiting non-annotated multi-view data through contrastive learning whilst providing flexibility at deployment in the event of missing views.

摘要
通过增强多视图数据的利用，我们提出了一种基于对照学习的三元Encoder-单元网络（tU-Net），用于提高肾脏细分。我们在训练时使用了非标注的架子视图和仰视图，通过对照学习来利用这些视图，从而提高 segmentation 的精度。为了引导训练，我们组合了axial视图和其手动注释的 dice score 损失函数，以及多视图对照损失函数。 results 表明，tU-Net 比只使用axial视图的情况提高了 dice score 系数（DSC）（91.25±0.52% vs 86.40±1.50%,P<0.001）。另外，我们的方法还在不同的混合率下进行了敏感性分析，发现对照学习损失函数对于与 tU-Net 结合使用时产生的卷积效应具有 Statistical significance（2.85±1.34% vs 3.81±1.88%,P<0.001）。此外，我们的方法还在一个自有的数据集上进行了 external volumetric 一致性测试，并发现在使用多视图数据时，tU-Net 的性能较好（2.76±1.89% vs 3.92±3.31%,P=.002），这表明了我们的方法可以在实际应用中利用非标注的多视图数据进行对照学习，并且在部署时可以避免 missing views 的问题。

Tiny and Efficient Model for the Edge Detection Generalization

paper_url: http://arxiv.org/abs/2308.06468
repo_url: https://github.com/xavysp/teed
paper_authors: Xavier Soria, Yachuan Li, Mohammad Rouhani, Angel D. Sappa
for: 本文 targets at addressing the issue of edge detection in computer vision, with the objectives of simplicity, efficiency, and generalization.
methods: 本文提出了一种轻量级卷积神经网络（TEED），具有只有58K参数，比现状态 искусственный智能模型少。通过在BIPED dataset上训练，可以在less than 30分钟内完成训练，每个epoch仅需less than 5分钟。
results: 本文的提出的模型可以快速 converges within the first few epochs，并且预测的边映射具有高质量。此外，本文还提出了一个新的测试数据集，用于测试边检测模型的通用性。I hope this helps!

Abstract
Most high-level computer vision tasks rely on low-level image operations as their initial processes. Operations such as edge detection, image enhancement, and super-resolution, provide the foundations for higher level image analysis. In this work we address the edge detection considering three main objectives: simplicity, efficiency, and generalization since current state-of-the-art (SOTA) edge detection models are increased in complexity for better accuracy. To achieve this, we present Tiny and Efficient Edge Detector (TEED), a light convolutional neural network with only $58K$ parameters, less than $0.2$% of the state-of-the-art models. Training on the BIPED dataset takes $less than 30 minutes$, with each epoch requiring $less than 5 minutes$. Our proposed model is easy to train and it quickly converges within very first few epochs, while the predicted edge-maps are crisp and of high quality. Additionally, we propose a new dataset to test the generalization of edge detection, which comprises samples from popular images used in edge detection and image segmentation. The source code is available in https://github.com/xavysp/TEED.

摘要
大多数高级计算机视觉任务都基于低级图像操作作为初始过程。操作如图像提高、图像增强和超分辨率，为更高级图像分析提供基础。在这项工作中，我们考虑了三个主要目标：简单、高效和泛化，因为当前状态体系（SOTA）的边检测模型在精度方面增加了复杂度。为达到这个目标，我们提出了简单高效的边检测器（TEED），这是一个具有58000个参数的轻量级卷积神经网络，比状态体系模型少了99.8%的参数。在BIPE dataset上训练时间只需要少于30分钟，每个epoch只需要少于5分钟。我们的提出的模型轻松训练，快速 converges，并且预测的边映射具有高质量。此外，我们还提出了一个新的测试泛化边检测的数据集，该数据集包括流行的图像used in edge detection和图像分类中的样本。源代码可以在https://github.com/xavysp/TEED上获取。

Improved YOLOv8 Detection Algorithm in Security Inspection Image

paper_url: http://arxiv.org/abs/2308.06452
repo_url: None
paper_authors: Liyao Lu
for: 本研究旨在解决X射线图像检测中的重叠检测对象、假阳性货物检测和检测失败问题。
methods: 本研究提出了基于YOLOv8s的改进X射线财物检测算法CSS-YOLO。
results: 实验结果表明，CSS-YOLO算法能够提高检测精度，降低假阳性率和 missed detection 率，提高安全检查效果。

Abstract
Security inspection is the first line of defense to ensure the safety of people's lives and property, and intelligent security inspection is an inevitable trend in the future development of the security inspection industry. Aiming at the problems of overlapping detection objects, false detection of contraband, and missed detection in the process of X-ray image detection, an improved X-ray contraband detection algorithm CSS-YOLO based on YOLOv8s is proposed.

摘要
安全检查是人们生命和财产安全的首列防御，未来安全检查行业的发展将具有智能化特点。面临检测对象重叠、质控违禁品和检测失败等问题，我们提出了基于YOLOv8s的改进X射线质控检测算法CSS-YOLO。

TongueSAM: An Universal Tongue Segmentation Model Based on SAM with Zero-Shot

paper_url: http://arxiv.org/abs/2308.06444
repo_url: https://github.com/cshan-github/tonguesam
paper_authors: Shan Cao, Qunsheng Ruan, Qingfeng Wu
for: 本研究旨在提出一种通用的舌部分 segmentation模型，以解决现有的舌部分 segmentation方法在不同舌型图像上表现 mediocre 的问题。
methods: 本研究使用的是一种名为 SAM（Segment Anything Model）的大规模预训练交互分割模型，该模型具有强大的零shot泛化能力。通过应用 SAM 到舌部分分割，可以实现零shot 分割不同类型的舌型图像。此外，本研究还使用了一种基于对象检测的 Prompt Generator，以实现一个端到端自动化的舌部分分割方法。
results: 实验表明，TongueSAM 在不同舌部分分割数据集上表现出色，特别是在零shot 下表现。此外，TongueSAM 可以 direct 应用于其他数据集无需 fine-tuning。据我们知道，这是首次应用大规模预训练模型于舌部分分割。研究成果和预训练模型将在：https://github.com/cshan-github/TongueSAM 上公布。

Abstract
Tongue segmentation serves as the primary step in automated TCM tongue diagnosis, which plays a significant role in the diagnostic results. Currently, numerous deep learning based methods have achieved promising results. However, most of these methods exhibit mediocre performance on tongues different from the training set. To address this issue, this paper proposes a universal tongue segmentation model named TongueSAM based on SAM (Segment Anything Model). SAM is a large-scale pretrained interactive segmentation model known for its powerful zero-shot generalization capability. Applying SAM to tongue segmentation enables the segmentation of various types of tongue images with zero-shot. In this study, a Prompt Generator based on object detection is integrated into SAM to enable an end-to-end automated tongue segmentation method. Experiments demonstrate that TongueSAM achieves exceptional performance across various of tongue segmentation datasets, particularly under zero-shot. TongueSAM can be directly applied to other datasets without fine-tuning. As far as we know, this is the first application of large-scale pretrained model for tongue segmentation. The project and pretrained model of TongueSAM be publiced in :https://github.com/cshan-github/TongueSAM.

摘要
叙述分割 serves as the primary step in automated TCM tongue diagnosis, which plays a significant role in the diagnostic results. Currently, numerous deep learning based methods have achieved promising results. However, most of these methods exhibit mediocre performance on tongues different from the training set. To address this issue, this paper proposes a universal tongue segmentation model named TongueSAM based on SAM (Segment Anything Model). SAM is a large-scale pretrained interactive segmentation model known for its powerful zero-shot generalization capability. Applying SAM to tongue segmentation enables the segmentation of various types of tongue images with zero-shot. In this study, a Prompt Generator based on object detection is integrated into SAM to enable an end-to-end automated tongue segmentation method. Experiments demonstrate that TongueSAM achieves exceptional performance across various of tongue segmentation datasets, particularly under zero-shot. TongueSAM can be directly applied to other datasets without fine-tuning. As far as we know, this is the first application of large-scale pretrained model for tongue segmentation. The project and pretrained model of TongueSAM be publiced in :https://github.com/cshan-github/TongueSAM.Here's the translation in Traditional Chinese: tonguesegmentation serves as the primary step in automated TCM tongue diagnosis, which plays a significant role in the diagnostic results. Currently, numerous deep learning based methods have achieved promising results. However, most of these methods exhibit mediocre performance on tongues different from the training set. To address this issue, this paper proposes a universal tongue segmentation model named TongueSAM based on SAM (Segment Anything Model). SAM is a large-scale pretrained interactive segmentation model known for its powerful zero-shot generalization capability. Applying SAM to tongue segmentation enables the segmentation of various types of tongue images with zero-shot. In this study, a Prompt Generator based on object detection is integrated into SAM to enable an end-to-end automated tongue segmentation method. Experiments demonstrate that TongueSAM achieves exceptional performance across various of tongue segmentation datasets, particularly under zero-shot. TongueSAM can be directly applied to other datasets without fine-tuning. As far as we know, this is the first application of large-scale pretrained model for tongue segmentation. The project and pretrained model of TongueSAM be publiced in :https://github.com/cshan-github/TongueSAM.

Distributionally Robust Optimization and Invariant Representation Learning for Addressing Subgroup Underrepresentation: Mechanisms and Limitations

paper_url: http://arxiv.org/abs/2308.06434
repo_url: None
paper_authors: Nilesh Kumar, Ruby Shrestha, Zhiyuan Li, Linwei Wang
For: This paper aims to address the issue of spurious correlation due to subgroup underrepresentation in medical image classification, specifically by exploring the use of robust optimization to learn invariant representations.* Methods: The paper proposes a novel approach that leverages robust optimization to facilitate the learning of invariant representations, and evaluates the effectiveness of this approach through a comprehensive study.* Results: The proposed approach is shown to improve the performance of classifiers on underrepresented subgroups, while maintaining high average and worst-group performance, compared to existing methods such as generalized reweighting and naive invariant representation learning.

Abstract
Spurious correlation caused by subgroup underrepresentation has received increasing attention as a source of bias that can be perpetuated by deep neural networks (DNNs). Distributionally robust optimization has shown success in addressing this bias, although the underlying working mechanism mostly relies on upweighting under-performing samples as surrogates for those underrepresented in data. At the same time, while invariant representation learning has been a powerful choice for removing nuisance-sensitive features, it has been little considered in settings where spurious correlations are caused by significant underrepresentation of subgroups. In this paper, we take the first step to better understand and improve the mechanisms for debiasing spurious correlation due to subgroup underrepresentation in medical image classification. Through a comprehensive evaluation study, we first show that 1) generalized reweighting of under-performing samples can be problematic when bias is not the only cause for poor performance, while 2) naive invariant representation learning suffers from spurious correlations itself. We then present a novel approach that leverages robust optimization to facilitate the learning of invariant representations at the presence of spurious correlations. Finetuned classifiers utilizing such representation demonstrated improved abilities to reduce subgroup performance disparity, while maintaining high average and worst-group performance.

摘要
假设对于小分支的参数不足，导致深度神经网络（DNNs）中的伪正相关。 Distributionally robust optimization 已经在解决这种偏见方面取得成功，但是其主要运作机制是通过增重下performing samples 作为没有在数据中受到代表的样本。在这篇研究中，我们对对于小分支参数不足导致的伪正相关的推导和改进方法进行了首次的研究。我们首先显示了以下两个结果：1）通过增重下performing samples 并不一定能够解决伪正相关，而2）简单的对称表现学习方法本身受到了伪正相关的影响。我们然后提出了一种新的方法，利用Robust optimization 来促进对伪正相关的推导。我们继续调整这些表现，以便在存在伪正相关的情况下维持高的平均和最差分支性能。

paper_url: http://arxiv.org/abs/2308.06432
repo_url: None
paper_authors: Yuhan Zhang, Kun Huang, Mingchao Li, Songtao Yuan, Qiang Chen
for: 预测 age-related macular degeneration (nAMD) 疾病进程和效果。
methods: 提posed a single-horizon disease evolution network (SHENet)，使用 feature encoder、graph evolution module 和 feature decoder，并通过 adversarial training 确保疾病进程学习的有效性。
results: 与其他生成方法相比，SHENet 生成的 SD-OCT 图像质量最高，同时保持结构保护和内容预测。 Qualitative evaluations 也表明 SHENet 的视觉效果较好。

Abstract
Most of the existing disease prediction methods in the field of medical image processing fall into two classes, namely image-to-category predictions and image-to-parameter predictions. Few works have focused on image-to-image predictions. Different from multi-horizon predictions in other fields, ophthalmologists prefer to show more confidence in single-horizon predictions due to the low tolerance of predictive risk. We propose a single-horizon disease evolution network (SHENet) to predictively generate post-therapeutic SD-OCT images by inputting pre-therapeutic SD-OCT images with neovascular age-related macular degeneration (nAMD). In SHENet, a feature encoder converts the input SD-OCT images to deep features, then a graph evolution module predicts the process of disease evolution in high-dimensional latent space and outputs the predicted deep features, and lastly, feature decoder recovers the predicted deep features to SD-OCT images. We further propose an evolution reinforcement module to ensure the effectiveness of disease evolution learning and obtain realistic SD-OCT images by adversarial training. SHENet is validated on 383 SD-OCT cubes of 22 nAMD patients based on three well-designed schemes based on the quantitative and qualitative evaluations. Compared with other generative methods, the generative SD-OCT images of SHENet have the highest image quality. Besides, SHENet achieves the best structure protection and content prediction. Qualitative evaluations also demonstrate that SHENet has a better visual effect than other methods. SHENet can generate post-therapeutic SD-OCT images with both high prediction performance and good image quality, which has great potential to help ophthalmologists forecast the therapeutic effect of nAMD.

摘要
现有的疾病预测方法在医学图像处理领域主要分为两类：图像到类别预测和图像到参数预测，其中少数工作关注到图像到图像预测。与其他多个预测horizon不同，眼科医生更偏好单个预测horizon，因为预测风险的低忍性。我们提出了单个预测疾病演化网络（SHENet），用于预测基于前治疗SD-OCT图像的后治疗SD-OCT图像。在SHENet中，一个特征编码器将输入SD-OCT图像转化为深度特征，然后一个图像演化模块预测疾病演化过程在高维潜在空间中，并输出预测的深度特征。最后，特征解码器重建预测的深度特征为SD-OCT图像。我们还提出了演化增强模块，以确保疾病演化学习的有效性并获得真实的SD-OCT图像，通过对抗训练。SHENet在383个SD-OCT立方体上的22例nAMD患者基于三种有效的方案进行验证，并通过量化和质量评估。与其他生成方法相比，SHENet生成的SD-OCT图像的生成质量最高。此外，SHENet保持了最佳的结构保护和内容预测。质量评估还表明，SHENet在视觉效果方面表现更好。SHENet可以预测nAMD后治疗SD-OCT图像，具有高预测性和良好的图像质量，这对眼科医生预测nAMD治疗效果具有很大潜力。

M&M: Tackling False Positives in Mammography with a Multi-view and Multi-instance Learning Sparse Detector

paper_url: http://arxiv.org/abs/2308.06420
repo_url: None
paper_authors: Yen Nhi Truong Vu, Dan Guo, Ahmed Taha, Jason Su, Thomas Paul Matthews
for: 提高诊断率和避免假阳性结果
methods: 使用 sparse R-CNN，包括多视图交叉注意模块和多实例学习
results: 提高了检测和预测性能，并通过精细的ablation study证明每个组件的效果

Abstract
Deep-learning-based object detection methods show promise for improving screening mammography, but high rates of false positives can hinder their effectiveness in clinical practice. To reduce false positives, we identify three challenges: (1) unlike natural images, a malignant mammogram typically contains only one malignant finding; (2) mammography exams contain two views of each breast, and both views ought to be considered to make a correct assessment; (3) most mammograms are negative and do not contain any findings. In this work, we tackle the three aforementioned challenges by: (1) leveraging Sparse R-CNN and showing that sparse detectors are more appropriate than dense detectors for mammography; (2) including a multi-view cross-attention module to synthesize information from different views; (3) incorporating multi-instance learning (MIL) to train with unannotated images and perform breast-level classification. The resulting model, M&M, is a Multi-view and Multi-instance learning system that can both localize malignant findings and provide breast-level predictions. We validate M&M's detection and classification performance using five mammography datasets. In addition, we demonstrate the effectiveness of each proposed component through comprehensive ablation studies.

摘要
深度学习基于对象检测方法在萤幕检查中显示出优秀表现，但高 false positive 率可能会阻碍其在临床实践中的效iveness。为了减少 false positive，我们标识了三个挑战：（1）癌症肺像素通常只包含一个癌症发现;（2）萤幕检查包括两个视图每一个乳腺，需要考虑两个视图来确定正确的评估;（3）大多数萤幕检查为正常图像，没有任何发现。在这种情况下，我们解决了这三个挑战，通过：（1）利用稀疏 R-CNN，并证明稀疏检测器更适合萤幕检查;（2）添加多视图交叉注意模块，以将不同视图的信息相互协同;（3）采用多例学习（MIL），以使用无注释图像进行训练，并在乳腺级别进行预测。得到的模型称为 M&M，它可以同时localize 癌症发现和进行乳腺级别预测。我们验证 M&M 的检测和预测性能使用五个萤幕检查 dataset。此外，我们还通过完整的减少研究，证明每一个提案的效果。

Improving Pseudo Labels for Open-Vocabulary Object Detection

paper_url: http://arxiv.org/abs/2308.06412
repo_url: None
paper_authors: Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar B. G, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas
for: 提高开放词汇物体检测（OVD）中使用预先训练的视觉语言模型（VLM）生成的假标签（PL）的性能。
methods: 提出在线自我训练和拆分并融合头（SAS-Det）方法，包括自我训练VLMs生成高质量PL，并利用拆分并融合头除去PL的地方噪声，同时 fusion complementary knowledge learned from precise ground truth和噪声PL。
results: 在COCO和LVISbenchmark上 achieved 37.4 AP$_{50}$和27.3 AP$_r$，胜过先前的状态艺术模型，并且 Pseudo labeling 速度比过去的方法快三倍。

Abstract
Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pretrained vision and language models (VLMs). However, PLs generated by VLMs are extremely noisy due to the gap between the pretraining objective of VLMs and OVD, which blocks further advances on PLs. In this paper, we aim to reduce the noise in PLs and propose a method called online Self-training And a Split-and-fusion head for OVD (SAS-Det). First, the self-training finetunes VLMs to generate high quality PLs while prevents forgetting the knowledge learned in the pretraining. Second, a split-and-fusion (SAF) head is designed to remove the noise in localization of PLs, which is usually ignored in existing methods. It also fuses complementary knowledge learned from both precise ground truth and noisy pseudo labels to boost the performance. Extensive experiments demonstrate SAS-Det is both efficient and effective. Our pseudo labeling is 3 times faster than prior methods. SAS-Det outperforms prior state-of-the-art models of the same scale by a clear margin and achieves 37.4 AP$_{50}$ and 27.3 AP$_r$ on novel categories of the COCO and LVIS benchmarks, respectively.

摘要

Detecting and Preventing Hallucinations in Large Vision Language Models

paper_url: http://arxiv.org/abs/2308.06394
repo_url: None
paper_authors: Anisha Gunjal, Jihan Yin, Erhan Bas
for:The paper aims to address the issue of hallucinations in instruction-tuned large vision language models (LVLMs) for visual question answering (VQA).methods:The authors introduce a new dataset called M-HalDetect, which consists of 16,000 fine-grained annotations on VQA examples to train and benchmark models for hallucination detection and prevention. They also propose a novel optimization method called Fine-grained Direct Preference Optimization (FDPO) to reduce hallucinations in LVLMs.results:The authors evaluate the effectiveness of M-HalDetect and FDPO using human evaluation and find that they reduce hallucination rates in InstructBLIP by 41% and 55%, respectively. They also find that their reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57%, respectively, and has strong correlation with human evaluated accuracy scores.

Abstract
Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a (M)ultimodal (Hal)lucination (Detect)ion Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained annotations on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for hallucination prevention, we optimize InstructBLIP through our novel Fine-grained Direct Preference Optimization (FDPO). We also train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling. We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in InstructBLIP by 41% and 55% respectively. We also find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores.

摘要
干脆大视语言模型（LVLM）在多modal任务上 generalized 了，特别是对于视觉问答（VQA）。然而，生成视觉固有的回答仍然是current state-of-the-art LVLMs（InstructBLIP）中的挑战。我们发现，甚至当前最佳LVLMs中还有30%的hallucination text，包括不存在的物体、不准确的描述和关系。为了解决这个问题，我们提出了M-HalDetect数据集，可以用于训练和对比模型，以检测和避免hallucination。M-HalDetect包括16k精细的VQA示例注释，使其成为首个多modal hallucination detection数据集。不同于之前的工作，我们不仅考虑物体hallucination，还注释了不准确的实体描述和关系。为了证明M-HalDetect的可用性，我们对InstructBLIP进行了novel Fine-grained Direct Preference Optimization（FDPO）优化。我们还使用InstructBLIP中的精细多modal奖励模型进行训练，并使用best-of-n拒绝采样来评估其效果。我们对FDPO和拒绝采样进行了人工评估，发现它们可以降低InstructBLIP中的hallucination率41%和55%。此外，我们发现我们的奖励模型可以普适化到其他多modal模型，降低LLaVA和mPLUG-OWL中的hallucination率15%和57%，并与人类评估精度成对。

R2S100K: Road-Region Segmentation Dataset For Semi-Supervised Autonomous Driving in the Wild

paper_url: http://arxiv.org/abs/2308.06393
repo_url: None
paper_authors: Muhammad Atif Butt, Hassan Ali, Adnan Qayyum, Waqas Sultani, Ala Al-Fuqaha, Junaid Qadir
for: 这项研究的目的是提供一个大规模的、多样化的道路区域分割数据集，以便为自动驾驶技术的发展提供更好的支持。
methods: 这项研究使用了一种名为Efficient Data Sampling（EDS）的自我教学框架，通过利用无标注数据来提高学习效果，同时还使用了 semi-supervised learning 方法。
results: 实验结果表明，提出的方法可以显著改善 semantic segmentation 任务的泛化能力，同时也可以降低标注成本。

Abstract
Semantic understanding of roadways is a key enabling factor for safe autonomous driving. However, existing autonomous driving datasets provide well-structured urban roads while ignoring unstructured roadways containing distress, potholes, water puddles, and various kinds of road patches i.e., earthen, gravel etc. To this end, we introduce Road Region Segmentation dataset (R2S100K) -- a large-scale dataset and benchmark for training and evaluation of road segmentation in aforementioned challenging unstructured roadways. R2S100K comprises 100K images extracted from a large and diverse set of video sequences covering more than 1000 KM of roadways. Out of these 100K privacy respecting images, 14,000 images have fine pixel-labeling of road regions, with 86,000 unlabeled images that can be leveraged through semi-supervised learning methods. Alongside, we present an Efficient Data Sampling (EDS) based self-training framework to improve learning by leveraging unlabeled data. Our experimental results demonstrate that the proposed method significantly improves learning methods in generalizability and reduces the labeling cost for semantic segmentation tasks. Our benchmark will be publicly available to facilitate future research at https://r2s100k.github.io/.

摘要
<>转换文本到简化中文。<>路径理解是自驾投控车中关键的能力因素。然而，现有的自驾投控车数据集只提供了有效的城市路径，而忽略了不结构化的路径中的压力、沟壑、水泥等等。为此，我们介绍了路径区域分割数据集（R2S100K）——一个大规模的数据集和标准 для训练和评估路径分割在上述挑战性的路径上。R2S100K包含100K张图像，其中14,000张图像有细腻的像素标注路径区域，剩下86,000张图像可以通过半有结构学习方法进行利用。此外，我们提出了一种效率的数据采样（EDS）基于的自动训练框架，以提高学习的通用性和减少标注成本。我们的实验结果表明，提posed方法可以显著提高学习方法的通用性和减少标注成本。我们的标准将在https://r2s100k.github.io/上公开，以便未来的研究。

U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

paper_url: http://arxiv.org/abs/2308.06383
repo_url: https://github.com/zhangcyg/u-red
paper_authors: Yan Di, Chenyangguang Zhang, Ruida Zhang, Fabian Manhardt, Yongzhi Su, Jason Rambach, Didier Stricker, Xiangyang Ji, Federico Tombari
for: 本文提出了一种不supervised shape retrieval和扭形管道，用于从已有的CAD模型库中检索和扭形匹配目标对象。
methods: 该管道使用了一种新的点级差异指导度量来抗随机变量，并通过将所有可能的全形对象投影到单位球面上来处理一个部分观察的一对多关系。
results: 在PartNet、ComplementMe和Scan2CAD等 sintetic和实际数据集上，U-RED比前状态艺术方法提高47.3%、16.7%和31.6%。

Abstract
In this paper, we propose U-RED, an Unsupervised shape REtrieval and Deformation pipeline that takes an arbitrary object observation as input, typically captured by RGB images or scans, and jointly retrieves and deforms the geometrically similar CAD models from a pre-established database to tightly match the target. Considering existing methods typically fail to handle noisy partial observations, U-RED is designed to address this issue from two aspects. First, since one partial shape may correspond to multiple potential full shapes, the retrieval method must allow such an ambiguous one-to-many relationship. Thereby U-RED learns to project all possible full shapes of a partial target onto the surface of a unit sphere. Then during inference, each sampling on the sphere will yield a feasible retrieval. Second, since real-world partial observations usually contain noticeable noise, a reliable learned metric that measures the similarity between shapes is necessary for stable retrieval. In U-RED, we design a novel point-wise residual-guided metric that allows noise-robust comparison. Extensive experiments on the synthetic datasets PartNet, ComplementMe and the real-world dataset Scan2CAD demonstrate that U-RED surpasses existing state-of-the-art approaches by 47.3%, 16.7% and 31.6% respectively under Chamfer Distance.

摘要
在这篇论文中，我们提出了无监督的形状检索和扭曲管道（U-RED），它可以将从RGB图像或扫描得到的任意物体观察作为输入，并将相似的CAD模型从预设的数据库中检索出来，以便与目标物体紧密匹配。现有的方法通常无法处理干扰性的部分观察，因此U-RED在两个方面进行了改进。首先，由于一个部分形状可能对应多个可能的全形状，因此检索方法必须允许这种杂乱的一对多关系。U-RED通过将所有可能的全形状 проек到单位球上来解决这个问题。然后，在推理时，每个样本在球上的抽象都将产生一个可能的检索。其次，由于实际的部分观察通常含有显著的干扰，因此需要一个可靠的学习的形状相似度量表，以确保稳定的检索。在U-RED中，我们设计了一种新的点级差异导向的形状相似度量表，允许比较干扰的形状。我们在PartNet、ComplementMe和Scan2CAD等 sintetic和实际数据集上进行了广泛的实验，结果显示，U-RED在Chamfer Distance下比现有状态的方法提高47.3%、16.7%和31.6%。

CATS v2: Hybrid encoders for robust medical segmentation

paper_url: http://arxiv.org/abs/2308.06377
repo_url: https://github.com/haoli12345/cats
paper_authors: Hao Li, Han Liu, Dewei Hu, Xing Yao, Jiacheng Wang, Ipek Oguz
for: 这个研究的目的是提出一个以CATS为基础的对� части�内部构成，以提高医疗影像分类�的精度和意义性。
methods: 这个研究使用了CATS v2模型，其中包括一个具有传播�的 Hybrid 构成，该构成包括一个 CNN 基础的 Encoder 路径和一个传播�的 Transformer 路径。
results: 在两个公共挑战赛 datasets 上进行评估，CATS v2 模型在分类 VS 和肾脏癌等项目上表现出较高的 Dice scores，较以前的方法为高。

Abstract
Convolutional Neural Networks (CNNs) have exhibited strong performance in medical image segmentation tasks by capturing high-level (local) information, such as edges and textures. However, due to the limited field of view of convolution kernel, it is hard for CNNs to fully represent global information. Recently, transformers have shown good performance for medical image segmentation due to their ability to better model long-range dependencies. Nevertheless, transformers struggle to capture high-level spatial features as effectively as CNNs. A good segmentation model should learn a better representation from local and global features to be both precise and semantically accurate. In our previous work, we proposed CATS, which is a U-shaped segmentation network augmented with transformer encoder. In this work, we further extend this model and propose CATS v2 with hybrid encoders. Specifically, hybrid encoders consist of a CNN-based encoder path paralleled to a transformer path with a shifted window, which better leverage both local and global information to produce robust 3D medical image segmentation. We fuse the information from the convolutional encoder and the transformer at the skip connections of different resolutions to form the final segmentation. The proposed method is evaluated on two public challenge datasets: Cross-Modality Domain Adaptation (CrossMoDA) and task 5 of Medical Segmentation Decathlon (MSD-5), to segment vestibular schwannoma (VS) and prostate, respectively. Compared with the state-of-the-art methods, our approach demonstrates superior performance in terms of higher Dice scores.

摘要
卷积神经网络（CNN）在医疗图像分割任务中表现出色，捕捉到高级（本地）信息，如边缘和 тексту层。然而，由于卷积核的视野有限，使得CNN难以完全表征全局信息。近些年来， transformer 在医疗图像分割中表现良好，主要是因为它们能够更好地模型长距离依赖关系。然而， transformer 在捕捉高级空间特征方面表现不如 CNN 好。为了建立一个好的分割模型，需要学习更好的 Representation 来兼顾本地和全局特征，以确保准确和Semantic 准确。在我们之前的工作中，我们提出了CATS，它是一个 U-shaped 分割网络，通过添加 transformer 编码器来提高性能。在这个工作中，我们进一步扩展了CATS 模型，并提出了CATS v2 模型，它使用了混合编码器。具体来说，混合编码器包括一个 CNN 基于编码器路径和一个 shifted window 的 transformer 路径，这两者可以更好地利用本地和全局信息，以生成Robust 3D 医疗图像分割。我们在不同的分辨率之间进行 skip 连接，将 convolutional 编码器和 transformer 的信息融合起来，以生成最终的分割。我们在 Cross-Modality Domain Adaptation（CrossMoDA）和 Medical Segmentation Decathlon 任务（MSD-5）上进行了评估，对 vestibular schwannoma 和 prostate 进行了分割。与当前的状态艺术方法相比，我们的方法在 Dice 分数方面表现出色。

Surrogate Model for Geological CO2 Storage and Its Use in MCMC-based History Matching

paper_url: http://arxiv.org/abs/2308.06341
repo_url: None
paper_authors: Yifu Han, Francois P. Hamon, Su Jiang, Louis J. Durlofsky
for: 这个研究targets an important application in geological carbon storage operations, specifically history matching of storage systems with high prior geological uncertainty.
methods: The authors extend a recently introduced recurrent R-U-Net surrogate model to treat geomodel realizations drawn from a wide range of geological scenarios, using flow simulation results and a Markov chain Monte Carlo history matching workflow.
results: The surrogate model provides accurate predictions for new realizations over the full range of geological scenarios, with median relative error of 1.3% in pressure and 4.5% in saturation. The incorporation of the surrogate model into the history matching workflow reduces geological uncertainty and leads to posterior 3D pressure and saturation fields that display much closer agreement with the true-model responses than prior predictions.

Abstract
Deep-learning-based surrogate models show great promise for use in geological carbon storage operations. In this work we target an important application - the history matching of storage systems characterized by a high degree of (prior) geological uncertainty. Toward this goal, we extend the recently introduced recurrent R-U-Net surrogate model to treat geomodel realizations drawn from a wide range of geological scenarios. These scenarios are defined by a set of metaparameters, which include the mean and standard deviation of log-permeability, permeability anisotropy ratio, horizontal correlation length, etc. An infinite number of realizations can be generated for each set of metaparameters, so the range of prior uncertainty is large. The surrogate model is trained with flow simulation results, generated using the open-source simulator GEOS, for 2000 random realizations. The flow problems involve four wells, each injecting 1 Mt CO2/year, for 30 years. The trained surrogate model is shown to provide accurate predictions for new realizations over the full range of geological scenarios, with median relative error of 1.3% in pressure and 4.5% in saturation. The surrogate model is incorporated into a Markov chain Monte Carlo history matching workflow, where the goal is to generate history matched realizations and posterior estimates of the metaparameters. We show that, using observed data from monitoring wells in synthetic `true' models, geological uncertainty is reduced substantially. This leads to posterior 3D pressure and saturation fields that display much closer agreement with the true-model responses than do prior predictions.

摘要
深度学习基本的代理模型在地质碳存储操作中表现出了极大的承诺。在这项工作中，我们target了一个重要应用 - 地质风险很高的存储系统历史匹配。为达到这个目标，我们将Recurrent R-U-Net代理模型扩展到处理各种不同的地质场景。这些场景是通过一组元参数来定义的，其中包括含量风险的平均值和标准差、滤 filtering ratio、水平相关长度等。可以生成无数量的实例 для每个元参数，因此地质风险的范围是非常广泛。我们使用GEOS开源模拟器对2000个随机实例进行流体模拟，并将模型训练于这些实例。训练后，代理模型能够准确预测新的实例，并且在全面的地质场景下显示了 median相对误差为1.3%的压力和4.5%的浓度。这个代理模型被 incorporated into Markov chain Monte Carlo历史匹配工作流程中，以生成历史匹配实例和 posterior 的元参数估计。我们显示，使用 synthetic 'true' 模型中的观测数据，地质风险可以减少得非常多。这导致了 posterior 3D 压力和浓度场 displaying much closer agreement with the true-model responses than prior predictions。

Deep Learning-Based Open Source Toolkit for Eosinophil Detection in Pediatric Eosinophilic Esophagitis

paper_url: http://arxiv.org/abs/2308.06333
repo_url: https://github.com/hrlblab/open-eoe
paper_authors: Juming Xiong, Yilin Liu, Ruining Deng, Regina N Tyree, Hernan Correa, Girish Hiremath, Yaohong Wang, Yuankai Huo
for:This paper aims to develop an open-source toolkit for automated detection of eosinophils in whole slide images for the diagnosis of eosinophilic esophagitis.methods:The toolkit uses deep learning-based object detection models and ensemble learning to improve the accuracy and reliability of eosinophil detection.results:The toolkit was tested on a set of 289 whole slide images and achieved an accuracy of 91% in detecting eosinophils at the widely accepted threshold of >= 15 per high power field for diagnosing eosinophilic esophagitis.

Abstract
Eosinophilic Esophagitis (EoE) is a chronic, immune/antigen-mediated esophageal disease, characterized by symptoms related to esophageal dysfunction and histological evidence of eosinophil-dominant inflammation. Owing to the intricate microscopic representation of EoE in imaging, current methodologies which depend on manual identification are not only labor-intensive but also prone to inaccuracies. In this study, we develop an open-source toolkit, named Open-EoE, to perform end-to-end whole slide image (WSI) level eosinophil (Eos) detection using one line of command via Docker. Specifically, the toolkit supports three state-of-the-art deep learning-based object detection models. Furthermore, Open-EoE further optimizes the performance by implementing an ensemble learning strategy, and enhancing the precision and reliability of our results. The experimental results demonstrated that the Open-EoE toolkit can efficiently detect Eos on a testing set with 289 WSIs. At the widely accepted threshold of >= 15 Eos per high power field (HPF) for diagnosing EoE, the Open-EoE achieved an accuracy of 91%, showing decent consistency with pathologist evaluations. This suggests a promising avenue for integrating machine learning methodologies into the diagnostic process for EoE. The docker and source code has been made publicly available at https://github.com/hrlblab/Open-EoE.

摘要
《Eosinophilic Esophagitis (EoE)是一种慢性、免疫/抗原诱导的食道疾病，表现为食道功能障碍和 Histological 证据显示的吸收性黑色素细胞滥多性Inflammation。由于EoE的微scopic表现在成像中复杂，目前的方法ologies依靠 manual identification 不仅劳累也容易出错。本研究中，我们开发了一个开源工具kit，名为 Open-EoE，通过一行命令 via Docker 来实现整个扫描图像 (WSI) 层的吸收性黑色素细胞 (Eos) 检测。Specifically，工具kit 支持三种 state-of-the-art deep learning-based object detection 模型。此外，Open-EoE 还进一步优化了性能，通过实现 ensemble learning 策略，并提高了结果的精度和可靠性。实验结果表明，Open-EoE 工具kit 可以有效地检测 Eos 在289个 WSIs 上。在 widely accepted 的 >= 15 Eos per high power field (HPF) 的标准下，Open-EoE 达到了 91% 的准确率，与Pathologist 评估相当一致。这表明可以将机器学习方法 integrate 到 EoE 诊断过程中，并且 Open-EoE 的 Docker 和源代码已经在 https://github.com/hrlblab/Open-EoE 上公开 released。

Revolutionizing Space Health (Swin-FSR): Advancing Super-Resolution of Fundus Images for SANS Visual Assessment Technology

paper_url: http://arxiv.org/abs/2308.06332
repo_url: https://github.com/FarihaHossain/SwinFSR
paper_authors: Khondker Fariha Hossain, Sharif Amit Kamran, Joshua Ong, Andrew G. Lee, Alireza Tavakkoli
for: 这paper是为了提出一种基于SwinTransformer的眼内画像超分辨模型，用于解决在各种各样的眼内图像识别任务中的数据传输压缩问题。methods: 这paper使用了SwinTransformer搭配空间和深度精度注意力来实现眼内图像超分辨。results: 这paper在三个公共数据集上达到了Peak signal-to-noise-ratio（PSNR）47.89、49.00和45.32，并在NASA提供的一个专用数据集上达到了相当的比较结果。

Abstract
The rapid accessibility of portable and affordable retinal imaging devices has made early differential diagnosis easier. For example, color funduscopy imaging is readily available in remote villages, which can help to identify diseases like age-related macular degeneration (AMD), glaucoma, or pathological myopia (PM). On the other hand, astronauts at the International Space Station utilize this camera for identifying spaceflight-associated neuro-ocular syndrome (SANS). However, due to the unavailability of experts in these locations, the data has to be transferred to an urban healthcare facility (AMD and glaucoma) or a terrestrial station (e.g, SANS) for more precise disease identification. Moreover, due to low bandwidth limits, the imaging data has to be compressed for transfer between these two places. Different super-resolution algorithms have been proposed throughout the years to address this. Furthermore, with the advent of deep learning, the field has advanced so much that x2 and x4 compressed images can be decompressed to their original form without losing spatial information. In this paper, we introduce a novel model called Swin-FSR that utilizes Swin Transformer with spatial and depth-wise attention for fundus image super-resolution. Our architecture achieves Peak signal-to-noise-ratio (PSNR) of 47.89, 49.00 and 45.32 on three public datasets, namely iChallenge-AMD, iChallenge-PM, and G1020. Additionally, we tested the model's effectiveness on a privately held dataset for SANS provided by NASA and achieved comparable results against previous architectures.

摘要
“快速访问可携带便宜的肉眼成像设备，使得早期差异诊断变得更加容易。例如，颜色基准成像技术可以在偏远的村庄中提供，以帮助诊断年龄相关macular degeneration（AMD）、高压病（glaucoma）或 PATHOLOGICAL MYOPIA（PM）等疾病。然而，由于这些地点缺乏专业人士，因此数据必须被传输到城市医疗机构（AMD和 glaucoma）或地面站（例如，SANS）进行更加精确的疾病诊断。此外，由于带宽限制，成像数据必须进行压缩传输。过去数年，一些超分辨算法已经提出来解决这个问题。此外，随着深度学习的发展，这一领域已经进步到了非常高的水平，可以使得压缩后的成像数据被 decompress 到原始形式，而不会产生空间信息损失。本文提出了一种名为 Swin-FSR 的新模型，该模型使用 Swin Transformer 与空间和深度宽度注意来进行肉眼成像超分辨。我们的架构实现了 Peak signal-to-noise-ratio（PSNR）的 47.89、49.00 和 45.32 在三个公共数据集上，namely iChallenge-AMD、iChallenge-PM 和 G1020。此外，我们对 NASA 提供的一个私人保留数据集进行测试，并实现了与前一代架构相当的效果。”

A Hierarchical Descriptor Framework for On-the-Fly Anatomical Location Matching between Longitudinal Studies

paper_url: http://arxiv.org/abs/2308.07337
repo_url: None
paper_authors: Halid Ziya Yerebakan, Yoshihisa Shinagawa, Mahesh Ranganath, Simon Allen-Raffl, Gerardo Hermosillo Valadez
for: 医疗图像 longitudinal 比较中匹配 анатомиче位置
methods: 使用 hierarchical sparse sampling 计算查询点描述符，然后使用 hierarchical search 找到最相似的点在目标图像中
results: 实现了减少计算时间至毫秒级单个CPU上，可以帮助医生在实时比较相似的 анатомиче位置而无需额外建筑或存储变换场景Is there anything else I can help you with?

Abstract
We propose a method to match anatomical locations between pairs of medical images in longitudinal comparisons. The matching is made possible by computing a descriptor of the query point in a source image based on a hierarchical sparse sampling of image intensities that encode the location information. Then, a hierarchical search operation finds the corresponding point with the most similar descriptor in the target image. This simple yet powerful strategy reduces the computational time of mapping points to a millisecond scale on a single CPU. Thus, radiologists can compare similar anatomical locations in near real-time without requiring extra architectural costs for precomputing or storing deformation fields from registrations. Our algorithm does not require prior training, resampling, segmentation, or affine transformation steps. We have tested our algorithm on the recently published Deep Lesion Tracking dataset annotations. We observed more accurate matching compared to Deep Lesion Tracker while being 24 times faster than the most precise algorithm reported therein. We also investigated the matching accuracy on CT and MR modalities and compared the proposed algorithm's accuracy against ground truth consolidated from multiple radiologists.

摘要
我们提出了一种方法，用于在医疗影像对比中匹配解剖位置。该方法基于源图像中计算查询点的特征器，该特征器是基于层次稀疏抽象的图像强度值，这些值编码了位置信息。然后，使用层次搜索操作找到目标图像中最相似的点。这种简单 yet 强大的策略可以在单个 CPU 上减少比较时间到毫秒级，因此让 radiologist 可以在实时比较相似的解剖位置，无需额外的建筑成本或存储扭变场的预计算或存储。我们的算法不需要先行训练、扩充、分割或非对映变换步骤。我们在最近发布的 Deep Lesion Tracking 数据集注释中进行了测试，并观察到比 Deep Lesion Tracker 更准确的匹配，同时比最精确的算法 report 在其中的 24 倍 faster。我们还 investigate 了该算法的匹配精度在 CT 和 MR Modalities 上，并与多名医生共同协调的ground truth进行比较。

FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods

paper_url: http://arxiv.org/abs/2308.06248
repo_url: https://github.com/visinf/funnybirds
paper_authors: Robin Hesse, Simone Schaub-Meyer, Stefan Roth
for: 这个论文的目的是解释人工智能（XAI）领域中复杂的深度神经网络模型的内部工作方式。
methods: 这篇论文使用了一种新的Synthetic vision dataset，叫做FunnyBirds，以及一系列自动评估协议来解决XAI中缺乏ground truth解释的挑战。
results: 通过使用FunnyBirds dataset和自动评估协议，这篇论文报告了24种不同的神经网络模型和XAI方法的结果，并证明了这些方法在一种完全自动和系统的方式下的优缺点。

Abstract
The field of explainable artificial intelligence (XAI) aims to uncover the inner workings of complex deep neural models. While being crucial for safety-critical domains, XAI inherently lacks ground-truth explanations, making its automatic evaluation an unsolved problem. We address this challenge by proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying automatic evaluation protocols. Our dataset allows performing semantically meaningful image interventions, e.g., removing individual object parts, which has three important implications. First, it enables analyzing explanations on a part level, which is closer to human comprehension than existing methods that evaluate on a pixel level. Second, by comparing the model output for inputs with removed parts, we can estimate ground-truth part importances that should be reflected in the explanations. Third, by mapping individual explanations into a common space of part importances, we can analyze a variety of different explanation types in a single common framework. Using our tools, we report results for 24 different combinations of neural models and XAI methods, demonstrating the strengths and weaknesses of the assessed methods in a fully automatic and systematic manner.

摘要
field of explainable artificial intelligence (XAI) 目的是暴露复杂深度神经网络模型的内部工作原理。而这种技术在安全关键领域非常重要，但XAI本身缺乏真实的解释，这使得自动评估成为一个未解决的问题。我们解决这个挑战 by proposing a novel synthetic vision dataset， named FunnyBirds，以及相应的自动评估协议。我们的数据集允许执行Semantically meaningful image interventions，例如 removing individual object parts，这有三个重要的后果。首先，它允许分析解释的部级划分，这更加接近人类的理解，而不是现有的方法，它们会评估像素级划分。其次，通过比较模型输出的各个部分输入，我们可以估算出各个部分的真实重要性，这些重要性应该反映在解释中。最后，我们可以将各种不同类型的解释映射到一个共同的部分重要性空间中，以便分析多种不同的解释类型在单一的框架中。使用我们的工具，我们报告了24种不同的神经网络模型和XAI方法的结果，这些结果 demonstrate了评估方法的优劣点在一个完全自动和系统的方式上。

Continual Face Forgery Detection via Historical Distribution Preserving

paper_url: http://arxiv.org/abs/2308.06217
repo_url: None
paper_authors: Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji
for: 防止面部伪造攻击的安全威胁
methods: 使用普遍攻击伪造模型、知识传递和历史分布保持等方法
results: 比前一代方法高效地检测新的伪造攻击，并维持了面部伪造 distribuition的稳定性

Abstract
Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors.

摘要
<> translate "Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors."中文简体版：现代面孔伪造技术得到了快速发展，对安全提供了严重的威胁。现有的面孔伪造检测方法尝试学习通用特征，但 ainda fall short of practical application。此外，在历史训练数据上进行finetuning这些方法是费时和占用存储空间的。在这篇论文中，我们关注一个新和挑战的问题： continual face forgery detection（CFFD），该问题的目标是高效地从新的伪造攻击中学习，而不是忘记之前的。我们提出了一个历史分布保持（HDP）框架，该框架保留和保持历史面孔的分布。为了实现这一目标，我们使用通用对抗扰动（UAP）来模拟历史伪造分布，并使用知识蒸馏来保持实际面孔的分布变化。我们还建立了一个新的CFFD数据集和三个评估协议。我们的广泛实验表明，我们的方法在CFFD中表现出了优于当前竞争者。

2023-08-12

cs.AI

cs.AI - 2023-08-12

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

paper_url: http://arxiv.org/abs/2308.06595
repo_url: None
paper_authors: Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, Ludwig Schimdt
for: 评估视觉语言模型在真实世界中的 instrucion-following 能力（evaluate vision-language models’ ability to follow instructions in real-world scenarios）
methods: 使用 70 个 ‘instruction families’ 和 592 个测试查询（use 70 instruction families and 592 test queries），包括从基本认知到游戏和创意生成等多种任务（including tasks such as basic recognition, game playing, and creative generation）
results: 使用人工和自动评估方法，发现现有模型与参考模型之间的质量差距 relativelly large（using both human and automatic evaluation methods, the quality gap between existing models and reference models is relatively large），提供了一个动态参与的项目，让实验室和研究人员可以简单地在项目网站上提交自己的模型答案（providing a dynamic project that allows researchers and practitioners to simply submit their model’s responses on the project website）

Abstract
We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at visit-bench.github.io.

摘要
我们介绍VisIT-Bench（视觉指令比赛），一个用于评估视觉语言模型的实际应用场景的 benchmark。我们开始于精心选择70个“指令家庭”，我们认为视觉语言模型应该能够解决这些指令。我们的数据集包括592个测试查询，每个查询都有一个人工生成的指令条件描述。这些描述包括指令特有的因素，例如一个指令要求关于轮椅用户是否可以进入商店的访问性，描述了斜坡/潜在障碍物。这些描述允许我们收集人工验证的参考输出 для每个实例，并使用文本 только LLM 自动评估候选的多Modal生成。我们使用人工和自动评估来衡量模型和参考之间的质量差距，例如，最高级别的指令遵循模型只在与 GPT-4 参考的比赛中赢得27%。VisIT-Bench 是开放的，参与者可以在项目网站上提交他们的模型的回答。数据、代码和排名信息可以在 visit-bench.github.io 上获得。

Value-Distributional Model-Based Reinforcement Learning

paper_url: http://arxiv.org/abs/2308.06590
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters
for: 这个论文目的是为了解决sequential decision-making任务中的uncertainty quantification问题。
methods: 这个论文使用了model-based Bayesian reinforcement learning的方法，其中的目标是学习Markov决策过程中参数不确定性induced的 posterior distribution over value functions。
results: 论文的实验表明，EQR算法可以在 continuous-control tasks 中比Established model-based和model-free算法表现出性能优势。

Abstract
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective, where the goal is to learn the posterior distribution over value functions induced by parameter (epistemic) uncertainty of the Markov decision process. Previous work restricts the analysis to a few moments of the distribution over values or imposes a particular distribution shape, e.g., Gaussians. Inspired by distributional reinforcement learning, we introduce a Bellman operator whose fixed-point is the value distribution function. Based on our theory, we propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function that can be used for policy optimization. Evaluation across several continuous-control tasks shows performance benefits with respect to established model-based and model-free algorithms.

摘要
<>量化政策长期表现的不确定性是解决sequential decision-making任务的重要问题。我们从model-based Bayesian reinforcement learning的视角 изуча这个问题，目标是学习Markov决策过程中参数（эпистемиче）不确定性引起的 posterior distribution over value functions。先前的工作只考虑了这些分布的一些瞬间或假设了特定的分布形式，例如 Gaussian。 inspirited by distributional reinforcement learning, we introduce a Bellman operator whose fixed-point is the value distribution function。 Based on our theory, we propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function that can be used for policy optimization. 评估在多个连续控制任务上表现出与已有的model-based和model-free算法相比的性能优势。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need Traditional Chinese, please let me know.

Approximate Answering of Graph Queries

paper_url: http://arxiv.org/abs/2308.06585
repo_url: None
paper_authors: Michael Cochez, Dimitrios Alivanistos, Erik Arakelyan, Max Berrendorf, Daniel Daza, Mikhail Galkin, Pasquale Minervini, Mathias Niepert, Hongyu Ren
for: 本文旨在介绍几种方法，以帮助回答含有不完整信息的知识图（KG）中的查询。
methods: 本文提出了多种方法，包括基于预测、基于潜在相似性、基于证据等方法，以满足不同类型的查询需求。
results: 这些方法可以帮助解决各种查询问题，如答案推断、 Entity Disambiguation、 Relation extraction 等。但是，这些方法受到图数据不完整和不准确的限制。

Abstract
Knowledge graphs (KGs) are inherently incomplete because of incomplete world knowledge and bias in what is the input to the KG. Additionally, world knowledge constantly expands and evolves, making existing facts deprecated or introducing new ones. However, we would still want to be able to answer queries as if the graph were complete. In this chapter, we will give an overview of several methods which have been proposed to answer queries in such a setting. We will first provide an overview of the different query types which can be supported by these methods and datasets typically used for evaluation, as well as an insight into their limitations. Then, we give an overview of the different approaches and describe them in terms of expressiveness, supported graph types, and inference capabilities.

摘要
知识图（KG）自然而然地是不完整的，因为世界知识的不完整和输入KG中的偏见。此外，世界知识不断扩展和发展，使现有的事实过时或引入新的事实。然而，我们仍然希望能够回答问题，作为如果图完整一样。在这章中，我们将给出不同类型的查询支持的方法的概述，以及通常用于评估的数据集，以及这些方法的局限性。然后，我们将对不同的方法进行描述，包括表达力、支持的图类型和推理能力。

paper_url: http://arxiv.org/abs/2308.06573
repo_url: None
paper_authors: Guirong Zhuo, Shouyi Lu, Huanyu Zhou, Lianqing Zheng, Lu Xiong
for:* 4D radar–visual odometry (4DRVO) is an attractive solution for achieving accurate and robust pose estimation by integrating complementary information from 4D radar and cameras.methods:* 4DRVO-Net leverages a feature pyramid, pose warping, and cost volume (PWC) network architecture to progressively estimate and refine poses, with a multi-scale feature extraction network called Radar-PointNet++ that fully considers rich 4D radar point information.* An adaptive 4D radar–camera fusion module (A-RCFM) is designed to automatically select image features based on 4D radar point features, facilitating multi-scale cross-modal feature interaction and adaptive multi-modal feature fusion.results:* Our method outperforms all learning-based and geometry-based methods for most sequences in the VoD dataset, and has exhibited promising performance that closely approaches that of the 64-line LiDAR odometry results of A-LOAM without mapping optimization.

Abstract
Four-dimensional (4D) radar--visual odometry (4DRVO) integrates complementary information from 4D radar and cameras, making it an attractive solution for achieving accurate and robust pose estimation. However, 4DRVO may exhibit significant tracking errors owing to three main factors: 1) sparsity of 4D radar point clouds; 2) inaccurate data association and insufficient feature interaction between the 4D radar and camera; and 3) disturbances caused by dynamic objects in the environment, affecting odometry estimation. In this paper, we present 4DRVO-Net, which is a method for 4D radar--visual odometry. This method leverages the feature pyramid, pose warping, and cost volume (PWC) network architecture to progressively estimate and refine poses. Specifically, we propose a multi-scale feature extraction network called Radar-PointNet++ that fully considers rich 4D radar point information, enabling fine-grained learning for sparse 4D radar point clouds. To effectively integrate the two modalities, we design an adaptive 4D radar--camera fusion module (A-RCFM) that automatically selects image features based on 4D radar point features, facilitating multi-scale cross-modal feature interaction and adaptive multi-modal feature fusion. In addition, we introduce a velocity-guided point-confidence estimation module to measure local motion patterns, reduce the influence of dynamic objects and outliers, and provide continuous updates during pose refinement. We demonstrate the excellent performance of our method and the effectiveness of each module design on both the VoD and in-house datasets. Our method outperforms all learning-based and geometry-based methods for most sequences in the VoD dataset. Furthermore, it has exhibited promising performance that closely approaches that of the 64-line LiDAR odometry results of A-LOAM without mapping optimization.

摘要
四维度（4D）雷达--视觉协调（4DRVO）结合了不同信息，使得它成为了精度和可靠性很高的pose estimation的有力解决方案。然而，4DRVO可能会出现严重的跟踪错误，这些错误主要来自于以下三个原因：1）4D雷达点云稀疏; 2）摄像头和雷达数据的不准确相关和不足的特征互动; 3）环境中的动态对象的干扰，影响 pose estimation。在这篇文章中，我们提出了4DRVO-Net，这是一种4D雷达--视觉协调方法。这种方法利用了特征层、pose扭曲和成本量网络架构，逐步估算和精化pose。我们提出了一种多尺度特征提取网络，叫做Radar-PointNet++,该网络可以全面考虑4D雷达点云的丰富信息，以便细化学习稀疏4D雷达点云。为了有效地结合两种模式，我们设计了自适应4D雷达--摄像头融合模块（A-RCFM），该模块可以根据4D雷达点云特征自动选择摄像头特征，实现了多尺度交互和自适应多模式特征融合。此外，我们引入了速度导向点信任度估计模块，可以测量本地运动趋势，减少动态对象和异常点的影响，并在pose精化过程中提供连续更新。我们在VoD和自有 dataset上展示了我们的方法的优秀性和每个模块设计的有效性。我们的方法在大多数序列上超过了所有学习基于和几何基于的方法，并且在64行LiDAR odometry结果的A-LOAM不需要地图优化的情况下，表现出了可观的表现。

ModelScope Text-to-Video Technical Report

paper_url: http://arxiv.org/abs/2308.06571
repo_url: None
paper_authors: Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang
for: 这个论文旨在描述一种基于文本-图像合成模型（即Stable Diffusion）的文本-视频合成模型（ModelScopeT2V）。
methods: 该模型采用了空间-时间块来保证渠道生成顺序和运动过渡的一致性，并且可以在训练和推理阶段适应不同的帧数。模型包括三个组件（即VQGAN、文本编码器和杂噪UNet），总共含1.7亿个参数，其中0.5亿个参数专门用于时间能力。
results: 模型在三个评价指标上表现出优于当前状态艺术方法。代码和在线demo可以在\url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}中找到。

Abstract
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.

摘要
这篇论文介绍了ModelScopeT2V，一种文本到视频合成模型，它从文本到图像合成模型（即稳定扩散）中演化出来。ModelScopeT2V包含空间-时间块来保证 Frame 生成的一致性和平滑的运动过渡。模型可以在训练和推理过程中适应不同的帧数，因此适用于图像-文本和视频-文本数据集。ModelScopeT2V由三个组件（即 VQGAN、文本编码器和杂净 UNet）组成，总共含有1.7亿参数，其中0.5亿参数专门用于时间能力。模型在三个评价指标上表现出色，超过了当前最佳方法。代码和在线示例可以在 \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary} 上获取。

MC-DRE: Multi-Aspect Cross Integration for Drug Event/Entity Extraction

paper_url: http://arxiv.org/abs/2308.06546
repo_url: None
paper_authors: Jie Yang, Soyeon Caren Han, Siqu Long, Josiah Poon, Goran Nenadic
For: This paper proposes a new multi-aspect cross-integration framework for drug entity/event detection in drug-related documents.* Methods: The proposed framework uses multi-aspect encoders to describe semantic, syntactic, and medical document contextual information, and conducts cross-integration of different contextual information in three ways: key-value cross, attention cross, and feedforward cross.* Results: The proposed model outperforms all state-of-the-art (SOTA) models on two widely used tasks, flat entity detection and discontinuous event extraction.

Abstract
Extracting meaningful drug-related information chunks, such as adverse drug events (ADE), is crucial for preventing morbidity and saving many lives. Most ADEs are reported via an unstructured conversation with the medical context, so applying a general entity recognition approach is not sufficient enough. In this paper, we propose a new multi-aspect cross-integration framework for drug entity/event detection by capturing and aligning different context/language/knowledge properties from drug-related documents. We first construct multi-aspect encoders to describe semantic, syntactic, and medical document contextual information by conducting those slot tagging tasks, main drug entity/event detection, part-of-speech tagging, and general medical named entity recognition. Then, each encoder conducts cross-integration with other contextual information in three ways: the key-value cross, attention cross, and feedforward cross, so the multi-encoders are integrated in depth. Our model outperforms all SOTA on two widely used tasks, flat entity detection and discontinuous event extraction.

摘要
<>提取有用的药物相关信息块，如负面影响（ADE），对避免负担和拯救生命非常重要。大多数ADE都是通过不结构化的医疗讨论报告的方式报告的，因此使用一般的实体识别方法不够。在这篇论文中，我们提议一种新的多方面融合框架，用于药物实体/事件检测，通过捕捉和对照不同语言/知识/文档上下文的信息来描述药物相关文档。我们首先构建多方面编码器，用于描述语义、语法和医疗文档上下文信息，包括插槽标注任务、主药物实体/事件检测、语法标注和普通医学实体识别。然后，每个编码器进行了三种跨integration：键值跨、注意力跨和Feedforward跨，以融合多个上下文信息。我们的模型在两个常用任务上都超过了所有SOTA的性能。

Digital elevation model correction in urban areas using extreme gradient boosting, land cover and terrain parameters

paper_url: http://arxiv.org/abs/2308.06545
repo_url: None
paper_authors: Chukwuma Okolie, Jon Mills, Adedayo Adeleke, Julian Smit
For: The paper aims to enhance the accuracy of medium-resolution digital elevation models (DEMs) in urban areas, specifically in Cape Town, South Africa, for hydrological and environmental modelling.* Methods: The authors use the extreme gradient boosting (XGBoost) ensemble algorithm to correct the DEMs, with eleven predictor variables including elevation, urban footprints, slope, aspect, surface roughness, and more.* Results: The corrected DEMs achieved significant accuracy gains, with a root mean square error (RMSE) improvement of 46-53% for Copernicus DEM and 72-73% for AW3D DEM, compared to other proposed methods. These results demonstrate the potential of gradient boosted trees for enhancing DEM quality and improving hydrological modelling in urban catchments.Here is the same information in Simplified Chinese text, as requested:* For: 这个论文的目的是提高城市区域中的数字高程模型（DEM）的准确性，以便于水文和环境模型。* Methods: 作者使用极限Gradient Boosting（XGBoost）ensemble算法来修正DEM，使用的predictor变量包括高程、城市脚印、坡度、方向、表面荒凉、地形位置指数、地形荒凉指数、地形表面 текстура等 eleven个变量。* Results: 修正后的DEM实现了显著的准确性提高，比如 Copernicus DEM的RMSE提高46-53%，AW3D DEM的RMSE提高72-73%，与其他提议的方法相比。这些结果表明极限Gradient Boosting树可以提高DEM的质量，并且为城市catchments中的水文模型提供改善。

Abstract
The accuracy of digital elevation models (DEMs) in urban areas is influenced by numerous factors including land cover and terrain irregularities. Moreover, building artifacts in global DEMs cause artificial blocking of surface flow pathways. This compromises their quality and adequacy for hydrological and environmental modelling in urban landscapes where precise and accurate terrain information is needed. In this study, the extreme gradient boosting (XGBoost) ensemble algorithm is adopted for enhancing the accuracy of two medium-resolution 30m DEMs over Cape Town, South Africa: Copernicus GLO-30 and ALOS World 3D (AW3D). XGBoost is a scalable, portable and versatile gradient boosting library that can solve many environmental modelling problems. The training datasets are comprised of eleven predictor variables including elevation, urban footprints, slope, aspect, surface roughness, topographic position index, terrain ruggedness index, terrain surface texture, vector roughness measure, forest cover and bare ground cover. The target variable (elevation error) was calculated with respect to highly accurate airborne LiDAR. After training and testing, the model was applied for correcting the DEMs at two implementation sites. The correction achieved significant accuracy gains which are competitive with other proposed methods. The root mean square error (RMSE) of Copernicus DEM improved by 46 to 53% while the RMSE of AW3D DEM improved by 72 to 73%. These results showcase the potential of gradient boosted trees for enhancing the quality of DEMs, and for improved hydrological modelling in urban catchments.

摘要
地数模型（DEM）在城市地区的准确性受到多种因素的影响，包括地表覆盖物和地形 irregularities。此外，全球 DEM 中的建筑物略导致表面流道路径的人工堵塞，从而降低其质量和适用性 для水文环境模型在城市景观中，需要精准和准确的地形信息。在这种研究中，我们采用了极限拟合搅拌（XGBoost）ensemble算法来提高两个中等分辨率 30 m DEM 的准确性，即 Copernicus GLO-30 和 ALOS World 3D（AW3D）。XGBoost 是一种可扩展、可移植和多样的拟合搅拌库，可以解决许多环境模型问题。训练数据集包括 eleven 个预测变量，包括高程、城市脚印、坡度、方向、表面粗糙度、地形坡度指数、地形表面文化、向量粗糙度度量、森林覆盖率和裸地覆盖率。target variable （高程误差）与高精度飞行 LiDAR 进行计算。之后，模型被应用于修正 DEM 的两个实施场景。修正后，DEM 的Root Mean Square Error（RMSE）提高了46%到53%，AW3D DEM 的 RMSE 提高了72%到73%。这些结果显示了拟合搅拌树的潜在可能性，以及对城市流域水文模型的改进。

Dealing with Small Datasets for Deep Learning in Medical Imaging: An Evaluation of Self-Supervised Pre-Training on CT Scans Comparing Contrastive and Masked Autoencoder Methods for Convolutional Models

paper_url: http://arxiv.org/abs/2308.06534
repo_url: https://github.com/wolfda95/ssl-medicalimagining-cl-mae
paper_authors: Daniel Wolf, Tristan Payer, Catharina Silvia Lisson, Christoph Gerhard Lisson, Meinrad Beer, Timo Ropinski, Michael Götz
for: 这篇论文旨在探讨deep learning在医疗影像领域中的应用，以减少诊断错误、轻量化医生工作负担，并加快诊断。
methods: 这篇论文使用了自动标注学习方法，包括对大量无标注影像进行自动标注。
results: 研究发现，使用SparK预训方法可以更好地适应小型标注数据，并且在诊断任务中表现更好。

Abstract
Deep learning in medical imaging has the potential to minimize the risk of diagnostic errors, reduce radiologist workload, and accelerate diagnosis. Training such deep learning models requires large and accurate datasets, with annotations for all training samples. However, in the medical imaging domain, annotated datasets for specific tasks are often small due to the high complexity of annotations, limited access, or the rarity of diseases. To address this challenge, deep learning models can be pre-trained on large image datasets without annotations using methods from the field of self-supervised learning. After pre-training, small annotated datasets are sufficient to fine-tune the models for a specific task. The most popular self-supervised pre-training approaches in medical imaging are based on contrastive learning. However, recent studies in natural image processing indicate a strong potential for masked autoencoder approaches. Our work compares state-of-the-art contrastive learning methods with the recently introduced masked autoencoder approach "SparK" for convolutional neural networks (CNNs) on medical images. Therefore we pre-train on a large unannotated CT image dataset and fine-tune on several CT classification tasks. Due to the challenge of obtaining sufficient annotated training data in medical imaging, it is of particular interest to evaluate how the self-supervised pre-training methods perform when fine-tuning on small datasets. By experimenting with gradually reducing the training dataset size for fine-tuning, we find that the reduction has different effects depending on the type of pre-training chosen. The SparK pre-training method is more robust to the training dataset size than the contrastive methods. Based on our results, we propose the SparK pre-training for medical imaging tasks with only small annotated datasets.

摘要
深度学习在医疗影像领域可能减少诊断错误风险，减轻放射学家的工作负担，并加速诊断。深度学习模型的训练需要大量和准确的数据集，并将所有训练样本标注。然而，在医疗影像领域，特定任务的标注数据集经常很小，这可能由标注的复杂性、访问限制或疾病的罕见性引起。为解决这个挑战，可以使用自动标注学习的方法进行深度学习模型的预训练。在预训练后，只需要小量的标注数据集来精度地调整模型 для特定任务。医疗影像领域最受欢迎的自动标注预训练方法是对比学习。然而，最近的自然图像处理研究表明，遮盲 autoencoder 方法有很强的潜在性。我们的工作比较了当前状态的对比学习方法和新引入的遮盲 autoencoder 方法 "SparK" 在医疗影像中的 convolutional neural networks (CNNs) 上。因此，我们预训练在大量无注释 CT 图像数据集上，并在多个 CT 分类任务上进行精度调整。由于医疗影像领域获得足够的注释训练数据是困难的，因此特别关心自动标注预训练方法在小型注释数据集上的性能。通过逐渐减少 fine-tuning 数据集大小的实验，我们发现降低的效果与预训练方法的类型有很大的差异。SparK 预训练方法在训练数据集尺寸减少后表现更加稳定。根据我们的结果，我们建议使用 SparK 预训练方法进行医疗影像任务，只需要小量的注释训练数据。

Learning Abstract Visual Reasoning via Task Decomposition: A Case Study in Raven Progressive Matrices

paper_url: http://arxiv.org/abs/2308.06528
repo_url: https://github.com/jakubkwiatkowski/abstract_compositional_transformer
paper_authors: Jakub Kwiatkowski, Krzysztof Krawiec
for: The paper aims to improve the performance of solving Raven Progressive Matrices (RPM) tasks using deep learning.
methods: The proposed method uses a transformer-based architecture to predict the visual properties of individual objects and their arrangements, rather than directly choosing the answer. The model parses the visual input into tokens and is trained using self-supervised methods with various masking regimes.
results: The proposed method outperforms state-of-the-art methods and provides interesting insights and partial explanations about the inference. Additionally, the design of the method is immune to biases that exist in some RPM benchmarks.Here’s the simplified Chinese text for the three key points:
for: 这篇论文目的是使用深度学习方法改进解决Raven Progressive Matrices (RPM)任务。
methods: 提议的方法使用 transformer 架构，而不是直接选择答案，而是预测图像中对象的视觉属性和排列。模型将视觉输入解析成 токен，并使用自我超vised 训练方法，包括不同的掩蔽方式。
results: 提议的方法不仅超越了当前的方法，还提供了有趣的解释和偏好。此外，方法的设计也免备了一些 RPM 数据集中的偏见。

Abstract
One of the challenges in learning to perform abstract reasoning is that problems are often posed as monolithic tasks, with no intermediate subgoals. In Raven Progressive Matrices (RPM), the task is to choose one of the available answers given a context, where both contexts and answers are composite images featuring multiple objects in various spatial arrangements. As this high-level goal is the only guidance available, learning is challenging and most contemporary solvers tend to be opaque. In this study, we propose a deep learning architecture based on the transformer blueprint which, rather than directly making the above choice, predicts the visual properties of individual objects and their arrangements. The multidimensional predictions obtained in this way are then directly juxtaposed to choose the answer. We consider a few ways in which the model parses the visual input into tokens and several regimes of masking parts of the input in self-supervised training. In experimental assessment, the models not only outperform state-of-the-art methods but also provide interesting insights and partial explanations about the inference. The design of the method also makes it immune to biases that are known to exist in some RPM benchmarks.

摘要
一个learning抽象逻辑的挑战是问题经常被提出为单一任务，没有中间目标。在Raven进步矩阵（RPM）中，任务是根据上下文选择一个可用的答案，上下文和答案都是复杂的图像组合，包括多个物体在不同的空间排列。由于这个高级目标是唯一的指导，学习是困难的，大多数当代解决方案都是透明的。在这项研究中，我们提议一种基于转换器蓝图的深度学习架构，而不是直接选择上述选择，而是预测图像中对象的视觉属性和排列。得到的多维预测可以直接相互对比，从而选择答案。我们考虑了一些将视觉输入分解成токен的方法，以及在自然supervised训练中隐藏部分输入的方法。在实验评估中，模型不仅超越了当前的方法，还提供了有趣的结论和部分解释，关于推理过程。此外，方法的设计还使其免受一些RPMbenchmark中已知的偏见。

SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models

paper_url: http://arxiv.org/abs/2308.06522
repo_url: None
paper_authors: Sara Babakniya, Ahmed Roushdy Elkordy, Yahya H. Ezzeldin, Qingfeng Liu, Kee-Bong Song, Mostafa El-Khamy, Salman Avestimehr
for: 这篇论文目的是探讨在 Federated Learning（FL）中使用已经预训练的 transformer 模型进行调整，以获得最佳的语言任务结果。
methods: 这篇论文使用的方法包括 parameter efficient fine-tuning（PEFT）和一个名为 SLoRA 的新方法，用于在高度多标的数据情况下bridge the performance gap between PEFT 和全部调整。
results: 实验结果显示，SLoRA 可以 дости持比 full fine-tuning 相似的性能，并在大约 $\sim 1%$ 的稀疏更新下实现大约 $90%$ 的训练时间减少。

Abstract
Transfer learning via fine-tuning pre-trained transformer models has gained significant success in delivering state-of-the-art results across various NLP tasks. In the absence of centralized data, Federated Learning (FL) can benefit from distributed and private data of the FL edge clients for fine-tuning. However, due to the limited communication, computation, and storage capabilities of edge devices and the huge sizes of popular transformer models, efficient fine-tuning is crucial to make federated training feasible. This work explores the opportunities and challenges associated with applying parameter efficient fine-tuning (PEFT) methods in different FL settings for language tasks. Specifically, our investigation reveals that as the data across users becomes more diverse, the gap between fully fine-tuning the model and employing PEFT methods widens. To bridge this performance gap, we propose a method called SLoRA, which overcomes the key limitations of LoRA in high heterogeneous data scenarios through a novel data-driven initialization technique. Our experimental results demonstrate that SLoRA achieves performance comparable to full fine-tuning, with significant sparse updates with approximately $\sim 1\%$ density while reducing training time by up to $90\%$.

摘要
<> translate the following text into Simplified Chinese: Transfer learning via fine-tuning pre-trained transformer models has gained significant success in delivering state-of-the-art results across various NLP tasks. In the absence of centralized data, Federated Learning (FL) can benefit from distributed and private data of the FL edge clients for fine-tuning. However, due to the limited communication, computation, and storage capabilities of edge devices and the huge sizes of popular transformer models, efficient fine-tuning is crucial to make federated training feasible. This work explores the opportunities and challenges associated with applying parameter efficient fine-tuning (PEFT) methods in different FL settings for language tasks. Specifically, our investigation reveals that as the data across users becomes more diverse, the gap between fully fine-tuning the model and employing PEFT methods widens. To bridge this performance gap, we propose a method called SLoRA, which overcomes the key limitations of LoRA in high heterogeneous data scenarios through a novel data-driven initialization technique. Our experimental results demonstrate that SLoRA achieves performance comparable to full fine-tuning, with significant sparse updates with approximately $\sim 1\%$ density while reducing training time by up to $90\%$.Transfer learning via fine-tuning pre-trained transformer models 在各种 NLP 任务中取得了很大的成功，但在没有中央数据的情况下，Federated Learning (FL) 可以利用分布式和私有的 FL 边缘客户端数据进行 fine-tuning。然而，由于边缘设备的限制性，包括通信、计算和存储能力，以及流行的 transformer 模型的巨大大小，fficient fine-tuning 是使 federated 训练可行的关键。这个工作探讨了在不同的 FL 设置下，用于语言任务的 PEFT 方法所面临的机会和挑战。我们的调查发现，随着用户数据的多样化，完全 fine-tuning 和 PEFT 方法之间的性能差距逐渐扩大。为了弥补这个性能差距，我们提议一种名为 SLoRA 的方法，通过一种新的数据驱动初始化技术，超越 LoRA 在高多样性数据场景中的关键局限性。我们的实验结果表明，SLoRA 可以与完全 fine-tuning 相比，在 $\sim 1\%$ 杂点上实现相似的性能，同时减少训练时间达到 $90\%$。

One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training

paper_url: http://arxiv.org/abs/2308.07934
repo_url: https://github.com/jianshuod/tba
paper_authors: Jianshuo Dong, Han Qiu, Yiming Li, Tianwei Zhang, Yuanjie Li, Zeqi Lai, Chao Zhang, Shu-Tao Xia
For: This paper aims to propose a training-assisted bit flip attack on deep neural networks (DNNs) to compromise their security.* Methods: The attack exploits memory fault inject techniques such as row hammer and involves the adversary in the training stage to build a high-risk model. The attack can convert the high-risk model to a malicious one on the victim’s side by flipping only one critical bit on average in the deployment stage.* Results: The attack poses a significant threat even when defenses are employed, and the adversary can easily convert the high-risk model to a malicious one by flipping only one critical bit on average.Here is the information in Simplified Chinese text:
for: 这篇论文目的是提出一种基于训练的位置攻击，用于攻击深度神经网络（DNNs）的安全性。
methods: 该攻击利用了内存错误注入技术，如行撞击，并在训练阶段由敌方参与建立高风险模型。攻击者可以在部署阶段通过只flipping一个关键位来将高风险模型转换为恶意模型。
results: 该攻击可以快速地转换高风险模型为恶意模型，并且对防御措施仍然构成了一定的威胁。

Abstract
Deep neural networks (DNNs) are widely deployed on real-world devices. Concerns regarding their security have gained great attention from researchers. Recently, a new weight modification attack called bit flip attack (BFA) was proposed, which exploits memory fault inject techniques such as row hammer to attack quantized models in the deployment stage. With only a few bit flips, the target model can be rendered useless as a random guesser or even be implanted with malicious functionalities. In this work, we seek to further reduce the number of bit flips. We propose a training-assisted bit flip attack, in which the adversary is involved in the training stage to build a high-risk model to release. This high-risk model, obtained coupled with a corresponding malicious model, behaves normally and can escape various detection methods. The results on benchmark datasets show that an adversary can easily convert this high-risk but normal model to a malicious one on victim's side by \textbf{flipping only one critical bit} on average in the deployment stage. Moreover, our attack still poses a significant threat even when defenses are employed. The codes for reproducing main experiments are available at \url{https://github.com/jianshuod/TBA}.

摘要

HyperFormer: Enhancing Entity and Relation Interaction for Hyper-Relational Knowledge Graph Completion

paper_url: http://arxiv.org/abs/2308.06512
repo_url: https://github.com/zhiweihu1103/hkgc-hyperformer
paper_authors: Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan
for: 这个论文主要目标是完善具有 attribute-value 赋值的高级知识图（HKG），以推理未知 triple 而考虑其赋值。
methods: 这个论文提出了 HyperFormer 模型，该模型利用了本地级别的序列信息，包括实体、关系和赋值的内容，以提高 triple 预测的精度。模型包括三个不同模块：实体邻居聚合模块、关系赋值聚合模块和卷积推理模块。
results: 经过广泛的实验 validate 了 HyperFormer 模型在三个知识图 datasets 上的效果，并且在不同的条件下进行了比较。模型在实验中表现出了明显的优势。代码和数据可以在 GitHub 上找到。

Abstract
Hyper-relational knowledge graphs (HKGs) extend standard knowledge graphs by associating attribute-value qualifiers to triples, which effectively represent additional fine-grained information about its associated triple. Hyper-relational knowledge graph completion (HKGC) aims at inferring unknown triples while considering its qualifiers. Most existing approaches to HKGC exploit a global-level graph structure to encode hyper-relational knowledge into the graph convolution message passing process. However, the addition of multi-hop information might bring noise into the triple prediction process. To address this problem, we propose HyperFormer, a model that considers local-level sequential information, which encodes the content of the entities, relations and qualifiers of a triple. More precisely, HyperFormer is composed of three different modules: an entity neighbor aggregator module allowing to integrate the information of the neighbors of an entity to capture different perspectives of it; a relation qualifier aggregator module to integrate hyper-relational knowledge into the corresponding relation to refine the representation of relational content; a convolution-based bidirectional interaction module based on a convolutional operation, capturing pairwise bidirectional interactions of entity-relation, entity-qualifier, and relation-qualifier. realize the depth perception of the content related to the current statement. Furthermore, we introduce a Mixture-of-Experts strategy into the feed-forward layers of HyperFormer to strengthen its representation capabilities while reducing the amount of model parameters and computation. Extensive experiments on three well-known datasets with four different conditions demonstrate HyperFormer's effectiveness. Datasets and code are available at https://github.com/zhiweihu1103/HKGC-HyperFormer.

摘要
超过标准知识 graphs (HKGs) 将 attribute-value 资讯 associates 到 triplets, 实际表示了对应 triplets 的详细信息。 hyper-relational 知识图完成 (HKGC) 目标是预测未知 triplets, 考虑其资讯。现有大多数 HKGC 方法利用全局级图结构编码 hyper-relational 知识到图 convolution 消息传递过程中。然而，添加多个跳跃信息可能会带来 triple 预测过程中的噪声。为解决这个问题，我们提出了 HyperFormer，一种模型，考虑本地级别的顺序信息，对 entitites、关系和资讯的内容进行编码。更加准确地说，HyperFormer 由三个不同模块组成：一个 entity neighbor aggregator 模块，用于将 entity 的 neighborgraph 信息集成，以 Capture 不同的 perspective of it; 一个 relation qualifier aggregator 模块，用于将 hyper-relational 知识 integrate 到对应关系中，以 Refine 关系内容的表示; 一个基于 convolution 操作的 bidirectional interaction module，用于 Capture entity-relation、entity-qualifier 和 relation-qualifier 对的 pairwise bidirectional interactions, 实现对当前声明的深度认知。此外，我们在 HyperFormer 的 feed-forward 层中引入 Mixture-of-Experts 策略，以增强其表示能力，同时减少模型参数和计算量。extensive experiments 表明 HyperFormer 有效。数据集和代码可以在上获取。

Three Ways of Using Large Language Models to Evaluate Chat

paper_url: http://arxiv.org/abs/2308.06502
repo_url: https://github.com/oplatek/chateval-llm
paper_authors: Ondřej Plátek, Vojtěch Hudeček, Patricia Schmidtová, Mateusz Lango, Ondřej Dušek
for: 这个论文描述了由team6提交的ChatEval竞赛中的系统，包括三种基于大语言模型（LLMs）预测对话机器人回复质量的方法。
methods: 论文描述了三种方法，包括使用动态少量示例从矢量存储中提取提示，以及对其他两种方法的分析和未来工作的需求。
results: 论文报告了基于这三种方法的改进，包括使用动态少量示例从矢量存储中提取提示的改进。同时，论文还报告了其他两种方法的性能分析和未来工作的需求。

Abstract
This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other two approaches and report needed improvements for future work. We developed the three systems over just two weeks, showing the potential of LLMs for this task. An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs. However, we find that the Llama 2 models do not benefit from few-shot examples in the same way as ChatGPT.

摘要
这篇论文描述了团队6在ChatEval DSTC 11 Track 4比赛中提交的三种不同方法来预测对话机器人响应质量。我们使用大型自然语言模型（LLM）来预测对话机器人响应的每个转折质量。我们发现使用动态少量示例从向量存储中提取的Prompt对ChatGPT的性能有所提升。我们还分析了其他两种方法的性能并报告了未来工作中所需的改进。我们在只有两周时间内开发了这三种系统，这表明LLMs在这个任务中的潜力。经过比赛结束后的抽象研究发现，新的Llama 2模型在关键性能方面追近ChatGPT和开源LLMs的性能。然而，我们发现Llama 2模型不如ChatGPT那样受益于少量示例。

Latent Emission-Augmented Perspective-Taking (LEAPT) for Human-Robot Interaction

paper_url: http://arxiv.org/abs/2308.06498
repo_url: None
paper_authors: Kaiqi Chen, Jing Yu Lim, Kingsley Kuan, Harold Soh
for: 本文是为了帮助机器人进行视角理解，即理解人类的视角和信念。
methods: 本文使用了深度世界模型，允许机器人进行视觉和概念上的视角理解，即能够推断人类看到和信任的内容。
results: 实验表明，本方法在三个半可见人机交互任务中表现出色，与现有的基准值进行比较，显著超越了基准值。

Abstract
Perspective-taking is the ability to perceive or understand a situation or concept from another individual's point of view, and is crucial in daily human interactions. Enabling robots to perform perspective-taking remains an unsolved problem; existing approaches that use deterministic or handcrafted methods are unable to accurately account for uncertainty in partially-observable settings. This work proposes to address this limitation via a deep world model that enables a robot to perform both perception and conceptual perspective taking, i.e., the robot is able to infer what a human sees and believes. The key innovation is a decomposed multi-modal latent state space model able to generate and augment fictitious observations/emissions. Optimizing the ELBO that arises from this probabilistic graphical model enables the learning of uncertainty in latent space, which facilitates uncertainty estimation from high-dimensional observations. We tasked our model to predict human observations and beliefs on three partially-observable HRI tasks. Experiments show that our method significantly outperforms existing baselines and is able to infer visual observations available to other agent and their internal beliefs.

摘要

EgoPoser: Robust Real-Time Ego-Body Pose Estimation in Large Scenes

paper_url: http://arxiv.org/abs/2308.06493
repo_url: None
paper_authors: Jiaxi Jiang, Paul Streli, Manuel Meier, Christian Holz
for: 这篇论文旨在解决headset上的 egopose估计问题，即只使用头和手部位的位姿来估计全身姿态。
methods: 该论文提出了一种新的输入表示方法和一种新的运动分解方法，以估计全身姿态独立于全局位置。此外，它还能够对不同用户的体型进行robust模型。
results: 实验表明，该论文在质量和量化上都有较好的表现，而且可以保持高速推断速度（大于600帧/秒）。这篇论文为将来的工作提供了一个可靠的基线，即全身姿态估计不再需要外部捕捉，并可以在大景观环境中扩展。

Abstract
Full-body ego-pose estimation from head and hand poses alone has become an active area of research to power articulate avatar representation on headset-based platforms. However, existing methods over-rely on the confines of the motion-capture spaces in which datasets were recorded, while simultaneously assuming continuous capture of joint motions and uniform body dimensions. In this paper, we propose EgoPoser, which overcomes these limitations by 1) rethinking the input representation for headset-based ego-pose estimation and introducing a novel motion decomposition method that predicts full-body pose independent of global positions, 2) robustly modeling body pose from intermittent hand position and orientation tracking only when inside a headset's field of view, and 3) generalizing across various body sizes for different users. Our experiments show that EgoPoser outperforms state-of-the-art methods both qualitatively and quantitatively, while maintaining a high inference speed of over 600 fps. EgoPoser establishes a robust baseline for future work, where full-body pose estimation needs no longer rely on outside-in capture and can scale to large-scene environments.

摘要
全身ego姿 estimation从头和手姿alone已成为研究的活跃领域，以提供头盔平台上的人物表现。然而，现有方法受到数据采集空间的限制，同时假设持续采集 JOINT 动作和一致体 dimensions。在这篇论文中，我们提出了 EgoPoser，它缓解了这些限制，通过：1. 重新定义头盔基于的输入表示，并 introduce 一种新的运动分解方法，可以独立地预测全身姿。2. 可靠地模型体姿从头盔视野内部的间歇手姿和方向追踪。3. 对不同用户的体型进行一致化。我们的实验表明，EgoPoser 超过了现有方法的质量和量化表现，同时保持了高速度推断速度超过 600 fps。EgoPoser 建立了一个可靠的基线，将全身姿推断带到大景景环境中。

Generating Faithful Text From a Knowledge Graph with Noisy Reference Text

paper_url: http://arxiv.org/abs/2308.06488
repo_url: None
paper_authors: Tahsina Hashem, Weiqing Wang, Derry Tanti Wijaya, Mohammed Eunus Ali, Yuan-Fang Li
for: 这个论文的目的是提出一种基于知识图（KG）的自然语言生成模型，能够生成准确表示知识图信息的自然语言文本。
methods: 该模型使用了对抗学习和可控文本生成技术，以提高模型对 faithful 信息的识别和控制。
results: 论文的实验结果表明，该模型在 faithfulness 方面表现出色，超过了现有的状态艺文。

Abstract
Knowledge Graph (KG)-to-Text generation aims at generating fluent natural-language text that accurately represents the information of a given knowledge graph. While significant progress has been made in this task by exploiting the power of pre-trained language models (PLMs) with appropriate graph structure-aware modules, existing models still fall short of generating faithful text, especially when the ground-truth natural-language text contains additional information that is not present in the graph. In this paper, we develop a KG-to-text generation model that can generate faithful natural-language text from a given graph, in the presence of noisy reference text. Our framework incorporates two core ideas: Firstly, we utilize contrastive learning to enhance the model's ability to differentiate between faithful and hallucinated information in the text, thereby encouraging the decoder to generate text that aligns with the input graph. Secondly, we empower the decoder to control the level of hallucination in the generated text by employing a controllable text generation technique. We evaluate our model's performance through the standard quantitative metrics as well as a ChatGPT-based quantitative and qualitative analysis. Our evaluation demonstrates the superior performance of our model over state-of-the-art KG-to-text models on faithfulness.

摘要
知识图（KG）-to-文本生成目标是生成流畅自然语言文本，准确表达给定知识图中的信息。虽然现有模型通过利用适当的前训练语言模型（PLMs）和合适的图结构意识模块，已经取得了显著的进步，但现有模型仍然无法生成准确的文本，特别是当参考文本中含有不在知识图中的信息时。在这篇论文中，我们开发了一种KG-to-文本生成模型，可以从给定图生成准确的自然语言文本，并在参考文本中含有噪音时提供 faithful 的文本生成。我们的框架包括两个核心想法：首先，我们利用对比学习增强模型的能力，在文本中划分 faithful 和幻想信息，从而让解码器生成与输入图相关的文本。其次，我们赋予解码器控制幻想度的能力，通过使用可控文本生成技术。我们通过标准的量化度量以及基于 ChatGPT 的量化和质量分析进行评估。我们的评估结果表明，我们的模型在准确性方面与当前状态的 KG-to-文本模型相比，表现出优异的性能。

Not So Robust After All: Evaluating the Robustness of Deep Neural Networks to Unseen Adversarial Attacks

paper_url: http://arxiv.org/abs/2308.06467
repo_url: None
paper_authors: Roman Garaev, Bader Rasheed, Adil Khan
for: This study aims to challenge the efficacy and generalization of contemporary defense mechanisms against adversarial attacks.methods: The study explores the hypothesis proposed by Ilyas et. al, which posits that DNN image features can be either robust or non-robust, with adversarial attacks targeting the latter. The study employs canonical correlation analysis, visualizes the representations, and calculates the mean distance between these representations and various DNN decision boundaries.results: The study finds a significant difference between $L_2$ and $L_{\infty}$ norms, which could provide insights into the potential dangers posed by $L_{\infty}$ norm attacks, previously underestimated by the research community.

Abstract
Deep neural networks (DNNs) have gained prominence in various applications, such as classification, recognition, and prediction, prompting increased scrutiny of their properties. A fundamental attribute of traditional DNNs is their vulnerability to modifications in input data, which has resulted in the investigation of adversarial attacks. These attacks manipulate the data in order to mislead a DNN. This study aims to challenge the efficacy and generalization of contemporary defense mechanisms against adversarial attacks. Specifically, we explore the hypothesis proposed by Ilyas et. al, which posits that DNN image features can be either robust or non-robust, with adversarial attacks targeting the latter. This hypothesis suggests that training a DNN on a dataset consisting solely of robust features should produce a model resistant to adversarial attacks. However, our experiments demonstrate that this is not universally true. To gain further insights into our findings, we analyze the impact of adversarial attack norms on DNN representations, focusing on samples subjected to $L_2$ and $L_{\infty}$ norm attacks. Further, we employ canonical correlation analysis, visualize the representations, and calculate the mean distance between these representations and various DNN decision boundaries. Our results reveal a significant difference between $L_2$ and $L_{\infty}$ norms, which could provide insights into the potential dangers posed by $L_{\infty}$ norm attacks, previously underestimated by the research community.

摘要

Multi-Label Knowledge Distillation

paper_url: http://arxiv.org/abs/2308.06453
repo_url: https://github.com/penghui-yang/l2d
paper_authors: Penghui Yang, Ming-Kun Xie, Chen-Chen Zong, Lei Feng, Gang Niu, Masashi Sugiyama, Sheng-Jun Huang
for: 这篇论文主要针对多标签学习问题，旨在提出一种基于知识储存技术的多标签知识传递方法。
methods: 该方法首先将多标签学习问题分解成多个二分类问题，然后通过分别对每个二分类问题进行知识储存来增强学习的特征表示。同时，该方法还利用标签嵌入结构来提高特征表示的独特性。
results: 实验结果表明，提出的方法可以减少标签之间的知识冲突，并且在多个 benchmark 数据集上达到了较高的性能水平，比较于其他比较方法。

Abstract
Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning. However, these methods can hardly be extended to the multi-label learning scenario, where each instance is associated with multiple semantic labels, because the prediction probabilities do not sum to one and feature maps of the whole example may ignore minor classes in such a scenario. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, thus achieving superior performance against diverse comparing methods. Our code is available at: https://github.com/penghui-yang/L2D

摘要
现有的知识传授方法通常是将教师网络的输出几何或中间特征图形知识传授到学生网络，这很成功在多类单 Label 学习中。但这些方法几乎无法扩展到多Label学习情况下，因为预测概率不会加总到一，且特征图形全例可能忽略次要类别。在本文中，我们提出了一个新的多Label知识传授方法。一方面，它利用多Label学习问题中的 semantic 知识，将问题分成多个二分类问题；另一方面，它利用类别对称信息来强化学习的特征表现。实验结果显示，提案的方法可以避免知识抵触 Label 之间，因此在多个比较方法面上获得了更好的性能。我们的代码可以在：https://github.com/penghui-yang/L2D 中找到。

Semantic Equivariant Mixup

paper_url: http://arxiv.org/abs/2308.06451
repo_url: None
paper_authors: Zongbo Han, Tianchi Xie, Bingzhe Wu, Qinghua Hu, Changqing Zhang
for: 提高模型对分布Shift的 Robustness，通过在表示空间强制保持输入数据的结构不变。
methods: 基于semantic-equivariance assumption的generic mixup regularization，使得模型在混合样本中学习更多的semantic information。
results: 经过extensive empirical studies和qualitative analyzes，表明提出的方法可以提高模型的Robustness和Generalization能力。

Abstract
Mixup is a well-established data augmentation technique, which can extend the training distribution and regularize the neural networks by creating ''mixed'' samples based on the label-equivariance assumption, i.e., a proportional mixup of the input data results in the corresponding labels being mixed in the same proportion. However, previous mixup variants may fail to exploit the label-independent information in mixed samples during training, which usually contains richer semantic information. To further release the power of mixup, we first improve the previous label-equivariance assumption by the semantic-equivariance assumption, which states that the proportional mixup of the input data should lead to the corresponding representation being mixed in the same proportion. Then a generic mixup regularization at the representation level is proposed, which can further regularize the model with the semantic information in mixed samples. At a high level, the proposed semantic equivariant mixup (sem) encourages the structure of the input data to be preserved in the representation space, i.e., the change of input will result in the obtained representation information changing in the same way. Different from previous mixup variants, which tend to over-focus on the label-related information, the proposed method aims to preserve richer semantic information in the input with semantic-equivariance assumption, thereby improving the robustness of the model against distribution shifts. We conduct extensive empirical studies and qualitative analyzes to demonstrate the effectiveness of our proposed method. The code of the manuscript is in the supplement.

摘要
混合是一种已有的数据增强技术，可以使得训练分布延伸并规范神经网络，通过创建基于标签相似性假设的混合样本。然而，先前的混合变体可能会忽略混合样本中的标签独立信息，这些信息通常含有更加丰富的 semantics。为了更好地发挥混合的力量，我们首先提高了先前的标签相似性假设，使其转化为 semantics相似性假设，即混合输入数据时，应该对应的表示也在同样的比例进行混合。然后，我们提出了一种通用的混合规范，可以在表示层进行规范，以更加规范模型。总的来说，我们的 semantic equivariant mixup（sem）方法要求输入数据的结构在表示空间保持不变，即输入变化后，获得的表示信息也会在同样的比例进行变化。与先前的混合变体不同，我们的方法更关注于保持输入中更加丰富的 semantics信息，从而提高模型对分布偏移的Robustness。我们进行了广泛的实验和质量分析，以证明我们的提议的效iveness。代码在附录中。

A Sequential Meta-Transfer (SMT) Learning to Combat Complexities of Physics-Informed Neural Networks: Application to Composites Autoclave Processing

paper_url: http://arxiv.org/abs/2308.06447
repo_url: https://github.com/miladramzy/sequentialmetatransferpinns
paper_authors: Milad Ramezankhani, Abbas S. Milani
for: 解决非线性偏微分方程（PDE）问题，提高物理学法的泛化能力。
methods: 使用sequential meta-transfer（SMT）学习框架，将时间域分解成小时段，每个时间段使用meta-学习器进行快速适应。
results: 在一个复杂系统中，通过使用SMT学习框架，可以明显提高PINNs的适应能力，同时减少计算成本，提高效率。

Abstract
Physics-Informed Neural Networks (PINNs) have gained popularity in solving nonlinear partial differential equations (PDEs) via integrating physical laws into the training of neural networks, making them superior in many scientific and engineering applications. However, conventional PINNs still fall short in accurately approximating the solution of complex systems with strong nonlinearity, especially in long temporal domains. Besides, since PINNs are designed to approximate a specific realization of a given PDE system, they lack the necessary generalizability to efficiently adapt to new system configurations. This entails computationally expensive re-training from scratch for any new change in the system. To address these shortfalls, in this work a novel sequential meta-transfer (SMT) learning framework is proposed, offering a unified solution for both fast training and efficient adaptation of PINNs in highly nonlinear systems with long temporal domains. Specifically, the framework decomposes PDE's time domain into smaller time segments to create "easier" PDE problems for PINNs training. Then for each time interval, a meta-learner is assigned and trained to achieve an optimal initial state for rapid adaptation to a range of related tasks. Transfer learning principles are then leveraged across time intervals to further reduce the computational cost.Through a composites autoclave processing case study, it is shown that SMT is clearly able to enhance the adaptability of PINNs while significantly reducing computational cost, by a factor of 100.

摘要
物理学教导神经网络（PINNs）在解决非线性偏微分方程（PDEs）中得到了广泛应用，通过将物理法则 integrate到神经网络训练中，使其在科学和工程应用中优于传统方法。然而，传统的PINNs在处理复杂系统中仍然缺乏精度，特别是在长时间域内。此外，由于PINNs是为某种特定的PDE系统进行适应，因此缺乏可重用的扩展性，需要在新系统配置时重新从零开始训练，这会增加计算成本。为了解决这些缺陷，本文提出了一种新的时序顺序多模式学习（SMT）框架，用于快速训练和高效适应PINNs在非线性系统中。特别是，该框架将时间域 decomposes 为 smaller time segments，以创建"更容易"的PDE问题，以便PINNs的快速训练。然后，每个时间段中分配了一个meta-学习器，并在快速适应一系列相关任务的基础上进行了优化。然后，通过转移学习原理，在时间间隔内进行了进一步的计算成本减少。通过一个复杂材料自动炉处理案例研究，显示SMT可以明显提高PINNs的适应性，同时显著减少计算成本，比例为100。

Sensitivity-Aware Mixed-Precision Quantization and Width Optimization of Deep Neural Networks Through Cluster-Based Tree-Structured Parzen Estimation

paper_url: http://arxiv.org/abs/2308.06422
repo_url: None
paper_authors: Seyedarmin Azizi, Mahdi Nazemi, Arash Fayyazi, Massoud Pedram
for: 这篇论文的目的是提出一种自动选择神经网络层的最佳位元数和层宽的搜寻方法，以提高深度学习模型的效率。
methods: 这篇论文使用的方法包括对神经网络层的位元数和层宽进行自动选择，并使用希瑟尔基于删除的搜寻范围缩小技术，以便快速寻找最佳设计。它还使用树结构的Parzen估计器来建立代表性模型，以便快速探索不同的架构可能性。
results: 这篇论文的结果显示，与现有的压缩策略相比，这种方法可以实现20%的模型大小减少，不会对准确性产生影响。另外，这种方法的搜寻时间仅需12倍于目前最佳搜寻策略，使得快速设计和实现深度学习解决方案成为可能。

Abstract
As the complexity and computational demands of deep learning models rise, the need for effective optimization methods for neural network designs becomes paramount. This work introduces an innovative search mechanism for automatically selecting the best bit-width and layer-width for individual neural network layers. This leads to a marked enhancement in deep neural network efficiency. The search domain is strategically reduced by leveraging Hessian-based pruning, ensuring the removal of non-crucial parameters. Subsequently, we detail the development of surrogate models for favorable and unfavorable outcomes by employing a cluster-based tree-structured Parzen estimator. This strategy allows for a streamlined exploration of architectural possibilities and swift pinpointing of top-performing designs. Through rigorous testing on well-known datasets, our method proves its distinct advantage over existing methods. Compared to leading compression strategies, our approach records an impressive 20% decrease in model size without compromising accuracy. Additionally, our method boasts a 12x reduction in search time relative to the best search-focused strategies currently available. As a result, our proposed method represents a leap forward in neural network design optimization, paving the way for quick model design and implementation in settings with limited resources, thereby propelling the potential of scalable deep learning solutions.

摘要
“深度学习模型的复杂性和计算需求逐渐增加，因此有效地优化神经网络设计的搜索方法变得非常重要。这项工作提出了一种新的搜索机制，可以自动选择神经网络层的最佳位数和宽度。这会导致深度神经网络的效率得到明显提高。在搜索空间中，我们利用希腊拟合法（Hessian-based pruning）缩小搜索范围，以便快速消除不重要的参数。然后，我们采用分布式树结构的Parzen估计器来构建代表性模型，以便快速探索不同的建筑方案。这种策略可以快速寻找最佳设计，并且可以保证模型的准确性不受影响。我们对知名的数据集进行了严格的测试，并证明了我们的方法与现有方法相比，可以录入20%的模型大小减少，同时保持准确性不变。此外，我们的方法可以在搜索时间方面实现12倍的提升，相比于目前最佳的搜索焦点策略。因此，我们的提议方法代表了神经网络设计优化领域的一大突破，为具有限制资源的场景中快速实现神经网络设计，铺平深度学习解决方案的可能性。”

Pedestrian Trajectory Prediction in Pedestrian-Vehicle Mixed Environments: A Systematic Review

paper_url: http://arxiv.org/abs/2308.06419
repo_url: None
paper_authors: Mahsa Golchoubian, Moojan Ghafurian, Kerstin Dautenhahn, Nasser Lashgarian Azad
for: The paper is written for the development of practical pedestrian trajectory prediction algorithms for autonomous vehicles (AVs) in unstructured environments.
methods: The paper systematically reviews different methods proposed in the literature for modelling pedestrian trajectory prediction in the presence of vehicles, and investigates specific considerations for pedestrian-vehicle interaction.
results: The paper provides an overview of datasets containing trajectory data of both pedestrians and vehicles used by the reviewed papers, and discusses research gaps and directions for future work, such as the need for more effective definition of interacting agents in deep learning methods and the need for more datasets of mixed traffic in unstructured environments.Here are the three points in Simplified Chinese text:
for: 本文是为了开发可行的步行者轨迹预测算法，用于自动驾驶车辆（AV）在无结构环境中。
methods: 本文系统地查询了Literature中的不同方法，用于模拟步行者轨迹预测在车辆存在下。
results: 本文提供了各种数据集，包括步行者和车辆的轨迹数据，并讨论了未来研究的潜在空间，如深度学习方法中的交互代理定义和无结构环境中混合交通数据的收集。

Abstract
Planning an autonomous vehicle's (AV) path in a space shared with pedestrians requires reasoning about pedestrians' future trajectories. A practical pedestrian trajectory prediction algorithm for the use of AVs needs to consider the effect of the vehicle's interactions with the pedestrians on pedestrians' future motion behaviours. In this regard, this paper systematically reviews different methods proposed in the literature for modelling pedestrian trajectory prediction in presence of vehicles that can be applied for unstructured environments. This paper also investigates specific considerations for pedestrian-vehicle interaction (compared with pedestrian-pedestrian interaction) and reviews how different variables such as prediction uncertainties and behavioural differences are accounted for in the previously proposed prediction models. PRISMA guidelines were followed. Articles that did not consider vehicle and pedestrian interactions or actual trajectories, and articles that only focused on road crossing were excluded. A total of 1260 unique peer-reviewed articles from ACM Digital Library, IEEE Xplore, and Scopus databases were identified in the search. 64 articles were included in the final review as they met the inclusion and exclusion criteria. An overview of datasets containing trajectory data of both pedestrians and vehicles used by the reviewed papers has been provided. Research gaps and directions for future work, such as having more effective definition of interacting agents in deep learning methods and the need for gathering more datasets of mixed traffic in unstructured environments are discussed.

摘要
планирование пути автономного транспортного средства (АВ) в пространстве, разделенном с пешеходами, требует расчета будущих траекторий пешеходов. практический алгоритм предсказания траекторий пешеходов для использования АВ должен учитывать влияние взаимодействия автомобиля с пешеходами на будущие движения людей. в этом смысле, этот папяр систематически обзорывает разные методы, предложенные в литературе для моделирования предсказания траекторий пешеходов в присутствии автомобилей, которые могут быть применены в неструктурированных средах. папяр также рассматривает специфические условия для взаимодействия пешеходов и автомобилей (в сравнении с взаимодействием пешеходов-пешеходов) и обзоры, как различные переменные, такие как неопределенности предсказаний и различия в поведении, учитываются в предыдущих предсказательных моделях. following PRISMA guidelines, articles that did not consider vehicle and pedestrian interactions or actual trajectories, and articles that only focused on road crossing were excluded. a total of 1260 unique peer-reviewed articles from ACM Digital Library, IEEE Xplore, and Scopus databases were identified in the search. 64 articles were included in the final review as they met the inclusion and exclusion criteria. an overview of datasets containing trajectory data of both pedestrians and vehicles used by the reviewed papers has been provided. research gaps and directions for future work, such as having more effective definition of interacting agents in deep learning methods and the need for gathering more datasets of mixed traffic in unstructured environments, are discussed.

Dialogue Possibilities between a Human Supervisor and UAM Air Traffic Management: Route Alteration

paper_url: http://arxiv.org/abs/2308.06411
repo_url: None
paper_authors: Jeongseok Kim, Kangjin Kim
for: 本研究旨在提出一种基于知识表示和逻辑的城市航空交通管理（UATM）拓扑管理方法，以便快速Identify safe和高效的 Routes in a carefully sampled environment.
methods: 本方法使用Answer Set Programming（ASP）实现，其中包括非 monotonic reasoning和两个阶段对话，考虑安全和可能的影响因素。
results: 经过多个查询从两个 simulations scenarios， validate了提出的方法的可靠性和有效性。I hope this helps! Let me know if you have any further questions.

Abstract
This paper introduces a novel approach to detour management in Urban Air Traffic Management (UATM) using knowledge representation and reasoning. It aims to understand the complexities and requirements of UAM detours, enabling a method that quickly identifies safe and efficient routes in a carefully sampled environment. This method implemented in Answer Set Programming uses non-monotonic reasoning and a two-phase conversation between a human manager and the UATM system, considering factors like safety and potential impacts. The robustness and efficacy of the proposed method were validated through several queries from two simulation scenarios, contributing to the symbiosis of human knowledge and advanced AI techniques. The paper provides an introduction, citing relevant studies, problem formulation, solution, discussions, and concluding comments.

摘要
这篇论文提出了一种新的偏航管理方法（Detour Management），用于城市空中交通管理（UATM），利用知识表示和推理。它旨在理解城市垂直飞行偏航的复杂性和需求，以便快速地确定安全和高效的路径，并在精心采样的环境中进行。这种方法使用了非 monotonic 推理和两个阶段的人工管理和UATM系统之间的对话，考虑了安全和可能的影响因素。该方法的可靠性和有效性通过多个查询来 validate，来自两个 simulate enario。这篇论文提供了引言、相关研究、问题表述、解决方案、讨论和结论。

A Brain-Computer Interface Augmented Reality Framework with Auto-Adaptive SSVEP Recognition

paper_url: http://arxiv.org/abs/2308.06401
repo_url: None
paper_authors: Yasmine Mustafa, Mohamed Elmahallawy, Tie Luo, Seif Eldawlatly
for: 该研究旨在开发一种可以满足不同个体的脑电信号特点的简单适应集合分类系统，以便在脑机接口（BCI）和增强现实（AR）技术的应用中提高抗骚抗振性能。
methods: 该研究使用了稳态视觉谱波（SSVEP）信号 Pattern，并提出了一种简单的BCI-AR框架，以支持广泛的SSVEP-based BCI-AR应用程序的开发。
results: 测试结果显示，我们的ensemble分类方法在SSVEP-based BCI-AR应用程序中表现出了Robust性，并且与之前的研究相比，我们的方法在包括头部运动的情况下仍然能够达到80%的正确率（在PC上）和77%的正确率（使用HoloLens AR头盔）。此外，我们的视觉刺激时间为5秒，相对较短。

Abstract
Brain-Computer Interface (BCI) initially gained attention for developing applications that aid physically impaired individuals. Recently, the idea of integrating BCI with Augmented Reality (AR) emerged, which uses BCI not only to enhance the quality of life for individuals with disabilities but also to develop mainstream applications for healthy users. One commonly used BCI signal pattern is the Steady-state Visually-evoked Potential (SSVEP), which captures the brain's response to flickering visual stimuli. SSVEP-based BCI-AR applications enable users to express their needs/wants by simply looking at corresponding command options. However, individuals are different in brain signals and thus require per-subject SSVEP recognition. Moreover, muscle movements and eye blinks interfere with brain signals, and thus subjects are required to remain still during BCI experiments, which limits AR engagement. In this paper, we (1) propose a simple adaptive ensemble classification system that handles the inter-subject variability, (2) present a simple BCI-AR framework that supports the development of a wide range of SSVEP-based BCI-AR applications, and (3) evaluate the performance of our ensemble algorithm in an SSVEP-based BCI-AR application with head rotations which has demonstrated robustness to the movement interference. Our testing on multiple subjects achieved a mean accuracy of 80\% on a PC and 77\% using the HoloLens AR headset, both of which surpass previous studies that incorporate individual classifiers and head movements. In addition, our visual stimulation time is 5 seconds which is relatively short. The statistically significant results show that our ensemble classification approach outperforms individual classifiers in SSVEP-based BCIs.

摘要
Initially, Brain-Computer Interface (BCI) 引起关注的应用是为Physically impaired individuals 提高生活质量。然而， BCIs 的潜在应用不仅限于这些人群，还可以为健康用户开发主流应用程序。 BCIs 使用 Steady-state Visually-evoked Potential (SSVEP) 信号模式， capture 脑的响应，并使用 BCIs 来表达需求或愿望。然而，每个人的脑信号不同，因此需要每个人SSVEP 认知。此外，肌肉运动和眼睛跳动会干扰脑信号，因此需要用户在BCI实验中保持静止，限制了AR的应用。在这篇论文中，我们提出了一种简单的适应集成分类系统，可以处理每个人的差异。我们还提出了一种支持广泛SSVEP 基于 BCIs 应用程序的简单AR框架。我们的结果表明，我们的集成分类方法在SSVEP 基于 BCIs 的AR应用程序中，可以快速响应用户的需求或愿望，并且在多个测试人群中表现出 statistically significant 的表现。我们的测试结果显示，我们的集成分类方法在PC 和 HoloLens AR 头盔中都可以达到80%和77%的准确率，这 beiden超过了以个体分类器和头部运动混合的前一 Studies。此外，我们的视觉刺激时间为5秒，相对较短。总之，我们的研究表明，集成分类方法在SSVEP 基于 BCIs 的AR应用程序中表现出了优于个体分类器的表现。这 suggets that our ensemble classification approach can be a promising solution for developing mainstream BCI-AR applications.

ZYN: Zero-Shot Reward Models with Yes-No Questions

paper_url: http://arxiv.org/abs/2308.06385
repo_url: https://github.com/vicgalle/zero-shot-reward-models
paper_authors: Victor Gallego
for: 本文提出了一种解决方案，用于指导语言模型生成文本，以便与人类操作员的偏好相align。
methods: 该方法使用另一个语言模型作为批评者和奖励模型，通过一个Yes-No问题的提问来表达用户偏好，无需进一步的标注数据。
results: 在不同的文本生成领域中，包括毒瘤化、修正电影评论的情感、控制模型对某个话题的看法，以及个性化文本生成器的推荐等方面，实验证明了提议的ZYN框架的可能性。

Abstract
In this work, we address the problem of directing the text generations of a LLM towards a desired behavior, aligning the generated text with the preferences of the human operator. We propose using another language model as a critic, reward model in a zero-shot way thanks to the prompt of a Yes-No question that represents the user preferences, without requiring further labeled data. This zero-shot reward model provides the learning signal to further fine-tune the base LLM using reinforcement learning, as in RLAIF; yet our approach is also compatible in other contexts such as quality-diversity search. Extensive evidence of the capabilities of the proposed ZYN framework is provided through experiments in different domains related to text generation, including detoxification; optimizing sentiment of movie reviews, or any other attribute; steering the opinion about a particular topic the model may have; and personalizing prompt generators for text-to-image tasks. Code to be released at \url{https://github.com/vicgalle/zero-shot-reward-models/}.

摘要
在这项工作中，我们解决了直接将语言生成模型（LLM）引导到所需行为的问题，将生成的文本与人类运行员的偏好相align。我们提议使用另一个语言模型作为批评者、奖励模型，通过Zero-shot manner，只需通过问题提示（Yes-No问题）表示用户偏好，无需更多的标注数据。这种Zero-shot奖励模型为基础LLM进行了进一步微调，使用束缚学习，类似RLAIF;而我们的方法也可以在其他上下文中使用，如质量多样性搜索。我们通过不同领域的文本生成实验提供了广泛的证据，包括毒瘤化、修改电影评论的情感、控制模型对某个话题的看法，以及个性化提示生成器 для文本到图像任务。代码将在 \url{https://github.com/vicgalle/zero-shot-reward-models/} 上发布。

DCNFIS: Deep Convolutional Neuro-Fuzzy Inference System

paper_url: http://arxiv.org/abs/2308.06378
repo_url: None
paper_authors: Mojtaba Yeganejou, Kimia Honari, Ryan Kluzinski, Scott Dick, Michael Lipsett, James Miller
for: 该研究旨在提出一种新的深度学习模型，以提高模型的透明度而不增加准确性的损失。
methods: 该研究使用了深度 convolutional neuro-fuzzy inference system (DCNFIS)，即将深度学习模型和逻辑学习模型相结合，以提高模型的透明度。
results: 研究发现，DCNFIS可以与现有的三种 convolutional neural networks 相比，在四个公共数据集上表现相当准确。此外，DCNFIS还可以超过当前的深度逻辑系统的性能。此外，通过解释来源于逻辑规则的质量分析，该研究还发现了一些有用的特性。

Abstract
A key challenge in eXplainable Artificial Intelligence is the well-known tradeoff between the transparency of an algorithm (i.e., how easily a human can directly understand the algorithm, as opposed to receiving a post-hoc explanation), and its accuracy. We report on the design of a new deep network that achieves improved transparency without sacrificing accuracy. We design a deep convolutional neuro-fuzzy inference system (DCNFIS) by hybridizing fuzzy logic and deep learning models and show that DCNFIS performs as accurately as three existing convolutional neural networks on four well-known datasets. We furthermore that DCNFIS outperforms state-of-the-art deep fuzzy systems. We then exploit the transparency of fuzzy logic by deriving explanations, in the form of saliency maps, from the fuzzy rules encoded in DCNFIS. We investigate the properties of these explanations in greater depth using the Fashion-MNIST dataset.

摘要
一个主要挑战在可解释人工智能是论文质量和直观性之间的交换。我们报告了一种新的深度网络的设计，该网络可以提高直观性而无需牺牲准确性。我们将深度 convolutional neuro-fuzzy inference system (DCNFIS) 设计为混合深度学习和规则逻辑模型，并证明 DCNFIS 在四个常见数据集上表现和三种现有的 convolutional neural networks 相同。此外，我们还证明 DCNFIS 在深度逻辑系统中表现更出色。然后，我们利用规则逻辑的透明性，从 DCNFIS 中提取出解释，以幻灯片的形式表示。我们在 Fashion-MNIST 数据集中进一步调查了这些解释的性质。

Large Language Models and Knowledge Graphs: Opportunities and Challenges

paper_url: http://arxiv.org/abs/2308.06374
repo_url: https://github.com/jettbrains/-L-
paper_authors: Jeff Z. Pan, Simon Razniewski, Jan-Christoph Kalo, Sneha Singhania, Jiaoyan Chen, Stefan Dietze, Hajira Jabeen, Janna Omeliyanenko, Wen Zhang, Matteo Lissandrini, Russa Biswas, Gerard de Melo, Angela Bonifati, Edlira Vakaj, Mauro Dragoni, Damien Graux
for: 本研究论文探讨了大语言模型（LLM）在知识表示方面的发展，以及这些模型对知识图和 parametric knowledge 的影响。
methods: 本文使用了许多现有的知识表示方法，如知识图和Parametric knowledge，以及一些新的研究方法。
results: 本文总结了一些关于 LLMs 和知识图的共识和观点，并提出了一些可能的研究方向和挑战。

Abstract
Large Language Models (LLMs) have taken Knowledge Representation -- and the world -- by storm. This inflection point marks a shift from explicit knowledge representation to a renewed focus on the hybrid representation of both explicit knowledge and parametric knowledge. In this position paper, we will discuss some of the common debate points within the community on LLMs (parametric knowledge) and Knowledge Graphs (explicit knowledge) and speculate on opportunities and visions that the renewed focus brings, as well as related research topics and challenges.

摘要

Wireless Federated $k$-Means Clustering with Non-coherent Over-the-Air Computation

paper_url: http://arxiv.org/abs/2308.06371
repo_url: None
paper_authors: Alphan Sahin
for: 降低无线网络上实现 Federated k-means 算法时的每次通信延迟
methods: 使用 Over-the-air computation（OAC）方案，通过编码器利用数字征在均匀数系统中的表示，通过无线多访问通道的信号积加性性质消除精确时钟和频率同步需求
results: 对客户位置 clustering 场景进行 demonstration，比较标准 k-means clustering 和提议方法的性能，结果显示提议方法与标准 k-means 性能相似，同时降低了通信延迟

Abstract
In this study, we propose using an over-the-air computation (OAC) scheme for the federated k-means clustering algorithm to reduce the per-round communication latency when it is implemented over a wireless network. The OAC scheme relies on an encoder exploiting the representation of a number in a balanced number system and computes the sum of the updates for the federated k-means via signal superposition property of wireless multiple-access channels non-coherently to eliminate the need for precise phase and time synchronization. Also, a reinitialization method for ineffectively used centroids is proposed to improve the performance of the proposed method for heterogeneous data distribution. For a customer-location clustering scenario, we demonstrate the performance of the proposed algorithm and compare it with the standard k-means clustering. Our results show that the proposed approach performs similarly to the standard k-means while reducing communication latency.

摘要
在这种研究中，我们提议使用无线电 computation（OAC）方案来降低在无线网络上实现 federated k-means 算法时的每轮通信延迟。 OAC 方案利用一个编码器利用数字 representation 在平衡数系统中的特性，通过无线多接入通道的信号重叠性性质来消除精确的时钟和相位同步需求。此外，我们还提出了一种重新初始化不合适使用的中心点方法，以提高提案方法在不同数据分布情况下的性能。为一个客户位置 clustering 场景，我们展示了提案的算法性能和标准 k-means 集群算法的比较，我们的结果表明，提案的方法与标准 k-means 集群算法性能相似，同时降低了通信延迟。

Topic-Level Bayesian Surprise and Serendipity for Recommender Systems

paper_url: http://arxiv.org/abs/2308.06368
repo_url: https://github.com/ton-moy/surprise-and-serendipity
paper_authors: Tonmoy Hasan, Razvan Bunescu
for: 提高推荐系统的多样性，使用高度可能性的推荐项，让用户体验到新、未看过的类别。
methods: 使用 bayesian 惊喜来衡量item的意外性，并结合协同推荐算法来找到相似用户。
results: 实验结果表明，使用 bayesian 惊喜与距离基于的优化方法相比，对于时间和主题层次的意外性的评估更加准确，并且在推荐高度可能性的项目方面获得更好的性能。

Abstract
A recommender system that optimizes its recommendations solely to fit a user's history of ratings for consumed items can create a filter bubble, wherein the user does not get to experience items from novel, unseen categories. One approach to mitigate this undesired behavior is to recommend items with high potential for serendipity, namely surprising items that are likely to be highly rated. In this paper, we propose a content-based formulation of serendipity that is rooted in Bayesian surprise and use it to measure the serendipity of items after they are consumed and rated by the user. When coupled with a collaborative-filtering component that identifies similar users, this enables recommending items with high potential for serendipity. To facilitate the evaluation of topic-level models for surprise and serendipity, we introduce a dataset of book reading histories extracted from Goodreads, containing over 26 thousand users and close to 1.3 million books, where we manually annotate 449 books read by 4 users in terms of their time-dependent, topic-level surprise. Experimental evaluations show that models that use Bayesian surprise correlate much better with the manual annotations of topic-level surprise than distance-based heuristics, and also obtain better serendipitous item recommendation performance.

摘要
一个推荐系统仅将推荐项目调整为用户的预先消耗项目历史，可能会创建一个范例弹性泡箱，让用户无法体验到未看过的类别。为了解决这个问题，可以推荐有高可能性的意外项目，即吸引用户高度评价的项目。在这篇论文中，我们提出了基于bayesian surprise的内容基于的serendipity表现，并使用它来衡量项目被用户过后评价后的surprise程度。当与相似用户的协同组件一起使用时，这将允许推荐高可能性的意外项目。为了评估主题层模型的惊喜和意外性表现，我们引入了Goodreads上的阅读历史数据集，包括26,000名用户和1,300,000本书，其中我们 manually annotate 449本被4名用户阅读的书籍，以时间依赖的主题层惊喜作为标准。实验评估显示，使用bayesian surprise的模型与距离基于的规律来的模型相比，具有更高的惊喜和意外性表现，并且在serendipity项目推荐上也有更好的表现。

Causally Linking Health Application Data and Personal Information Management Tools

paper_url: http://arxiv.org/abs/2308.08556
repo_url: None
paper_authors: Saturnino Luz, Masood Masoodian
for: 本研究旨在开发一种整合多种数据源、分析和可见化工具，以帮助用户更好地理解健康变量之间的 causal 连接。
methods: 本研究使用了数据挖掘、时间序列分析和可见化技术，并将这些技术与各种健康应用程序集成。
results: 研究人员通过提供用户可见化时间序列数据，使用者可以更好地理解健康变量之间的关系，从而帮助用户更好地管理健康。

Abstract
The proliferation of consumer health devices such as smart watches, sleep monitors, smart scales, etc, in many countries, has not only led to growing interest in health monitoring, but also to the development of a countless number of ``smart'' applications to support the exploration of such data by members of the general public, sometimes with integration into professional health services. While a variety of health data streams has been made available by such devices to users, these streams are often presented as separate time-series visualizations, in which the potential relationships between health variables are not explicitly made visible. Furthermore, despite the fact that other aspects of life, such as work and social connectivity, have become increasingly digitised, health and well-being applications make little use of the potentially useful contextual information provided by widely used personal information management tools, such as shared calendar and email systems. This paper presents a framework for the integration of these diverse data sources, analytic and visualization tools, with inference methods and graphical user interfaces to help users by highlighting causal connections among such time-series.

摘要
“随着各国消费者医疗设备的普及，如智能手表、睡眠监测仪、智能秤 scales 等，人们对健康监测的兴趣不 только增加，而且促使了大量的``智能''应用程序的开发，以支持公众成员对健康数据的探索，并有时与专业医疗服务集成。而这些医疗设备提供的健康数据流量，经常以分开的时间序列视图方式显示出来，无法直观地显示健康变量之间的可能关系。此外，尽管其他方面的生活，如工作和社交连接，已经 Digitized，健康和福祉应用却几乎不使用广泛使用的个人信息管理工具，如共享日历和邮件系统，具有可营利的上下文信息。本文提出了将这些多种数据源、分析和视图工具、推理方法和图形用户界面集成起来，以帮助用户更好地探索健康数据的关系。”

paper_url: http://arxiv.org/abs/2308.06354
repo_url: https://github.com/aim-harvard/sdoh
paper_authors: Marco Guevara, Shan Chen, Spencer Thomas, Tafadzwa L. Chaunzwa, Idalid Franco, Benjamin Kann, Shalini Moningi, Jack Qian, Madeleine Goldstein, Susan Harper, Hugo JWL Aerts, Guergana K. Savova, Raymond H. Mak, Danielle S. Bitterman
For: The paper aims to extract social determinants of health (SDoH) from electronic health records (EHRs) to improve patient outcomes.* Methods: The study uses large language models to extract SDoH from free text in EHRs, and experiments with synthetic data generation to improve the extraction of scarce SDoH data.* Results: The best-performing models were fine-tuned Flan-T5 XL and Flan-T5 XXL, which outperformed zero- and few-shot performance of ChatGPT-family models and showed less algorithmic bias. The models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured only 2.0%.Here’s the information in Simplified Chinese text:* 为：本研究用大语言模型提取电子医疗记录中社会determinants of health（SDoH），以提高患者结果。* 方法：研究使用自由文本中的SDoH，并对缺乏SDoH数据进行生成数据的尝试。* 结果：最佳表现的模型是精细调整后的Flan-T5 XL和Flan-T5 XXL，它们在比较shot setting下表现得更好，并且表现出较少的算法偏见。模型可以准确地提取93.8%的患者有不良SDoH，而ICD-10代码只能捕捉2.0%。

Abstract
Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data. 800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated. The study also experimented with synthetic data generation and assessed for algorithmic bias. Our best-performing models were fine-tuned Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The benefit of augmenting fine-tuning with synthetic data varied across model architecture and size, with smaller Flan-T5 models (base and large) showing the greatest improvements in performance (delta F1 +0.12 to +0.23). Model performance was similar on the in-hospital system dataset but worse on the MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models for both tasks. These fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can effectively extracted SDoH information from clinic notes, performing better compare to GPT zero- and few-shot settings. These models could enhance real-world evidence on SDoH and aid in identifying patients needing social support.

摘要
社会 determinants of health (SDoH) 有重要的影响 på patient outcomes，但是它们从电子健康记录 (EHR) 中 incomplete 收集。这项研究检查了大型自然语言模型能否从自由文本中提取 SDoH，其中最常见的位置是 EHR 中。研究还检查了使用生成的Synthetic clinical text 来提高提取这些罕见 yet extremely valuable 的临床数据的能力。研究采用了800份病人笔记，并评估了多种 transformer-based 模型。研究还进行了生成数据的评估和算法偏见的检查。我们的最佳表现模型是 Fine-tuned Flan-T5 XL (macro-F1 0.71) 和 Fine-tuned Flan-T5 XXL (macro-F1 0.70)。使用生成数据进行 augmentation 的效果因模型结构和大小而异，小型 Flan-T5 模型（基本和大型）在性能提升中表现最佳（delta F1 +0.12到 +0.23）。模型在医院内系统数据集上的表现相似，但在 MIMIC-III 数据集上表现更差。我们的最佳精度调整模型在 zero-和 few-shot 任务上表现更好，并且比 ChatGPT 家族模型更少改变其预测结果，这表明它们更少受到算法偏见（p<0.05）。在 patient 级别上，我们的模型可以识别93.8%的患者拥有不利的 SDoH，而 ICD-10 代码只能识别2.0%。我们的方法可以有效地从临床笔记中提取 SDoH 信息，并在 GPT zero-和 few-shot 设置下表现更好。这些模型可以增强实际证据，并帮助 indentify 需要社会支持的患者。

Combining feature aggregation and geometric similarity for re-identification of patterned animals

paper_url: http://arxiv.org/abs/2308.06335
repo_url: None
paper_authors: Veikka Immonen, Ekaterina Nepovinnykh, Tuomas Eerola, Charles V. Stewart, Heikki Kälviäinen
For: The paper is written for studying animal populations by using image-based re-identification of individual animals.* Methods: The paper combines two types of pattern similarity metrics: pattern appearance similarity and geometric pattern similarity.* Results: The proposed combination of pattern similarity metrics achieves promising re-identification accuracies for Saimaa ringed seals and whale sharks.Here’s the text in Simplified Chinese:
for: 研究动物种群，通过图像基于个体重新识别。
methods: combining两种 patrern similarity metrics： patrern appearance similarity和几何 patrern similarity。
results: 提议的combinaison achieve promising的重新识别精度 дляSaimaa环形海豹和鲸鱼。

Abstract
Image-based re-identification of animal individuals allows gathering of information such as migration patterns of the animals over time. This, together with large image volumes collected using camera traps and crowdsourcing, opens novel possibilities to study animal populations. For many species, the re-identification can be done by analyzing the permanent fur, feather, or skin patterns that are unique to each individual. In this paper, we address the re-identification by combining two types of pattern similarity metrics: 1) pattern appearance similarity obtained by pattern feature aggregation and 2) geometric pattern similarity obtained by analyzing the geometric consistency of pattern similarities. The proposed combination allows to efficiently utilize both the local and global pattern features, providing a general re-identification approach that can be applied to a wide variety of different pattern types. In the experimental part of the work, we demonstrate that the method achieves promising re-identification accuracies for Saimaa ringed seals and whale sharks.

摘要
图像基于个体重新识别动物，可以获取动物迁徙趋势的信息，并且通过摄像头和人员参与投票，收集大量图像。这些图像可以用于研究动物种群。许多物种的重新识别可以通过分析永久性毛发、羽毛或皮肤特征来完成，这些特征是每个个体唯一的。在这篇论文中，我们提出了结合两种模式相似度度量的方法：1）图像出现相似度度量，通过图像特征聚合获得，2）几何模式相似度度量，通过分析模式相似度的几何一致性来获得。该方法可以有效利用本地和全局模式特征，提供一种通用的重新识别方法，可以应用于多种不同的模式类型。在实验部分，我们示例了对Saimaa环形鳐和鲸鱼等动物的重新识别准确率。

Foundation Model is Efficient Multimodal Multitask Model Selector

paper_url: http://arxiv.org/abs/2308.06262
repo_url: https://github.com/opengvlab/multitask-model-selector
paper_authors: Fanqing Meng, Wenqi Shao, Zhanglin Peng, Chonghe Jiang, Kaipeng Zhang, Yu Qiao, Ping Luo
For: 本文研究了一个未得到充分研究的问题：给一个集合 pré-trained neural networks，预测它们在每个多 modal 任务上的性能，而不需要 fine-tuning 它们。* Methods: 本文提出了一种高效的多任务模型选择器（EMMS），使用大规模基础模型将多个下游任务的多种标签格式转化为一个统一的噪声标签嵌入。EMMS 可以通过一种简单的负权重回归来估计模型的传输性能，可以高效地解决一个 Alternating Minimization 算法。* Results: 广泛的实验表明，EMMS 是一种快速、有效和通用的模型选择器，可以高效地评估 pré-trained 模型的传输性能。例如，相比之前的 state-of-the-art 方法 LogME 增强我们的标签嵌入，EMMS 在图像识别、引用、描述、视觉问答和文本问答等五个下游任务上实现了9.0%、26.3%、20.1%、54.8% 和12.2% 的性能提升，同时带来5.13x、6.29x、3.59x、6.19x 和5.66x 的速度提升。代码可以在 https://github.com/OpenGVLab/Multitask-Model-Selector 上获取。

Abstract
This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering. A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models' transferability,they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model's transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9.0\%, 26.3\%, 20.1\%, 54.8\%, 12.2\% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock time, respectively. The code is available at https://github.com/OpenGVLab/Multitask-Model-Selector.

摘要

Enhancing Network Management Using Code Generated by Large Language Models

paper_url: http://arxiv.org/abs/2308.06261
repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
paper_authors: Sathiya Kumaran Mani, Yajie Zhou, Kevin Hsieh, Santiago Segarra, Ranveer Chandra, Srikanth Kandula
for: This paper aims to provide a novel approach for natural-language-based network management, leveraging large language models (LLMs) to generate task-specific code from natural language queries.
methods: The proposed approach utilizes LLMs to generate code, addressing the challenges of explainability, scalability, and privacy by allowing network operators to inspect the generated code and eliminating the need to share network data with LLMs.
results: The prototype system designed and evaluated in the paper demonstrates high accuracy, cost-effectiveness, and potential for further enhancements using complementary program synthesis techniques.

Abstract
Analyzing network topologies and communication graphs plays a crucial role in contemporary network management. However, the absence of a cohesive approach leads to a challenging learning curve, heightened errors, and inefficiencies. In this paper, we introduce a novel approach to facilitate a natural-language-based network management experience, utilizing large language models (LLMs) to generate task-specific code from natural language queries. This method tackles the challenges of explainability, scalability, and privacy by allowing network operators to inspect the generated code, eliminating the need to share network data with LLMs, and concentrating on application-specific requests combined with general program synthesis techniques. We design and evaluate a prototype system using benchmark applications, showcasing high accuracy, cost-effectiveness, and the potential for further enhancements using complementary program synthesis techniques.

摘要
现代网络管理中分析网络拓扑和通信图是关键。然而，由于缺乏一致的方法，会导致学习曲线困难、错误高伸和不效率。在这篇论文中，我们介绍一种新的方法，使得网络管理人员可以通过自然语言查询来获得任务特定的代码。这种方法解决了解释性、可扩展性和隐私问题，因为网络数据不需要与大语言模型（LLMs）分享，而是专注于应用特定的请求，并结合通用程序生成技术。我们设计并评估了一个原型系统，使用标准套件应用程序进行评估，显示高精度、成本效果和可能性。

ChatGPT-based Investment Portfolio Selection

paper_url: http://arxiv.org/abs/2308.06260
repo_url: None
paper_authors: Oleksandr Romanko, Akhilesh Narayan, Roy H. Kwon
for: 投资组合选择（portfolio selection）
methods: 使用生成AI模型（ChatGPT）获取S&P500市场指数中可能有潜力的股票，并对这些股票进行优化配置
results: 结果表明，使用ChatGPT进行股票选择可以带来更好的回报，但是在分配股票重量方面可能不如量化优化模型。但是将AI生成的股票选择与量化优化模型相结合，可以获得更好的投资效果，建议将来投资决策中采用协同approach。

Abstract
In this paper, we explore potential uses of generative AI models, such as ChatGPT, for investment portfolio selection. Trusting investment advice from Generative Pre-Trained Transformer (GPT) models is a challenge due to model "hallucinations", necessitating careful verification and validation of the output. Therefore, we take an alternative approach. We use ChatGPT to obtain a universe of stocks from S&P500 market index that are potentially attractive for investing. Subsequently, we compared various portfolio optimization strategies that utilized this AI-generated trading universe, evaluating those against quantitative portfolio optimization models as well as comparing to some of the popular investment funds. Our findings indicate that ChatGPT is effective in stock selection but may not perform as well in assigning optimal weights to stocks within the portfolio. But when stocks selection by ChatGPT is combined with established portfolio optimization models, we achieve even better results. By blending strengths of AI-generated stock selection with advanced quantitative optimization techniques, we observed the potential for more robust and favorable investment outcomes, suggesting a hybrid approach for more effective and reliable investment decision-making in the future.

摘要
在这篇论文中，我们探讨了使用生成AI模型，如ChatGPT，来选择投资 portefolio的可能性。因为GPT模型的“幻觉”问题，使得对模型输出的信任具有挑战性，因此我们采取了一种不同的方法。我们使用ChatGPT来获取S&P500市场指数中可能有吸引力的股票，然后比较了不同的投资组合优化策略，包括使用这些AI生成的交易宇宙，与量化投资优化模型进行比较，以及与一些流行的投资基金进行比较。我们的发现表明，ChatGPT在股票选择方面是有效的，但可能不如在分配股票 weights 方面表现好。但当ChatGPT生成的股票选择与已有的量化优化模型相结合时，我们可以获得更好的投资结果。通过融合AI生成的股票选择和已有的量化优化技术，我们发现了一种更加有效和可靠的投资决策方法，建议将这种方法应用于未来的投资决策中。

Automated Sizing and Training of Efficient Deep Autoencoders using Second Order Algorithms

paper_url: http://arxiv.org/abs/2308.06221
repo_url: None
paper_authors: Kanishka Tyagi, Chinmay Rane, Michael Manry
For: 本研究旨在提出一种多步训练方法，用于设计通用线性分类器。* Methods: 首先，通过回归获得初始多类线性分类器。然后，通过减少无用输入的方式，降低验证错误。同时，通过类似于霍-卡什洛夫规则的方法，提高 DESIRED 输出。接着，输出推定器被扩展为一个通用的线性分类器中的多层感知器。* Results: 通过组合剪枝和增长策略，提高输入单元的推定器，并将输出单元扩展为一个通用的线性分类器中的多层感知器。最后，通过改进每个深度学习块，提高整体深度学习模型的性能。

Abstract
We propose a multi-step training method for designing generalized linear classifiers. First, an initial multi-class linear classifier is found through regression. Then validation error is minimized by pruning of unnecessary inputs. Simultaneously, desired outputs are improved via a method similar to the Ho-Kashyap rule. Next, the output discriminants are scaled to be net functions of sigmoidal output units in a generalized linear classifier. We then develop a family of batch training algorithm for the multi layer perceptron that optimizes its hidden layer size and number of training epochs. Next, we combine pruning with a growing approach. Later, the input units are scaled to be the net function of the sigmoidal output units that are then feed into as input to the MLP. We then propose resulting improvements in each of the deep learning blocks thereby improving the overall performance of the deep architecture. We discuss the principles and formulation regarding learning algorithms for deep autoencoders. We investigate several problems in deep autoencoders networks including training issues, the theoretical, mathematical and experimental justification that the networks are linear, optimizing the number of hidden units in each layer and determining the depth of the deep learning model. A direct implication of the current work is the ability to construct fast deep learning models using desktop level computational resources. This, in our opinion, promotes our design philosophy of building small but powerful algorithms. Performance gains are demonstrated at each step. Using widely available datasets, the final network's ten fold testing error is shown to be less than that of several other linear, generalized linear classifiers, multi layer perceptron and deep learners reported in the literature.

摘要
我们提出了一种多步训练方法用于设计通用线性分类器。首先，通过回归获得初始多类线性分类器。然后，通过减少不必要的输入，降低验证错误。同时，通过类似于霍-卡什纳规则的方法，提高期望的输出。接着，输出推定器被映射到通用线性分类器中的sigmoid输出单元。然后，我们开发了一家批处理训练算法，用于最优化多层感知器的隐藏层大小和训练轮次数。接着，我们结合剪除和增长方法。最后，输入单元被映射到sigmoid输出单元的网络中，并且这些输入单元被用作多层感知器的输入。我们then propose several improvements in each deep learning block, leading to improved overall performance of the deep architecture. We discuss the principles and formulation of learning algorithms for deep autoencoders, and investigate several problems in deep autoencoder networks, including training issues, theoretical, mathematical, and experimental justification that the networks are linear, optimizing the number of hidden units in each layer, and determining the depth of the deep learning model. A direct implication of our work is the ability to construct fast deep learning models using desktop-level computational resources, which promotes our design philosophy of building small but powerful algorithms. Performance gains are demonstrated at each step. Using widely available datasets, the final network's ten-fold testing error is shown to be less than that of several other linear, generalized linear classifiers, multi-layer perceptron, and deep learners reported in the literature.

Safety in Traffic Management Systems: A Comprehensive Survey

paper_url: http://arxiv.org/abs/2308.06204
repo_url: None
paper_authors: Wenlu Du, Ankan Dash, Jing Li, Hua Wei, Guiling Wang
for: 这篇论文旨在提供对交通管理系统安全性的全面回顾，包括交通管理系统中出现的各种安全问题、当前研究的状况以及提高交通管理系统安全性的技术和方法。
methods: 论文使用了文献综述的方法，概括了交通管理系统中的安全问题，并分析了当前研究的状况和提议。
results: 论文总结了当前研究的结果和限制，并提出了未来研究的方向。

Abstract
Traffic management systems play a vital role in ensuring safe and efficient transportation on roads. However, the use of advanced technologies in traffic management systems has introduced new safety challenges. Therefore, it is important to ensure the safety of these systems to prevent accidents and minimize their impact on road users. In this survey, we provide a comprehensive review of the literature on safety in traffic management systems. Specifically, we discuss the different safety issues that arise in traffic management systems, the current state of research on safety in these systems, and the techniques and methods proposed to ensure the safety of these systems. We also identify the limitations of the existing research and suggest future research directions.

摘要
交通管理系统在公路上的交通运输中发挥了关键作用，但是使用先进技术的交通管理系统引入了新的安全挑战。因此，确保交通管理系统的安全性是非常重要的，以避免事故和减少它们对公路用户的影响。在这份调查中，我们提供了交通管理系统安全的全面评论。 Specifically，我们讨论了交通管理系统中不同的安全问题，当前的研究进展、以及为确保交通管理系统安全的技术和方法。我们还识别了现有研究的限制，并建议未来的研究方向。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

2023-08-12

cs.CL

cs.CL - 2023-08-12

MT4CrossOIE: Multi-stage Tuning for Cross-lingual Open Information Extraction

paper_url: http://arxiv.org/abs/2308.06552
repo_url: https://github.com/CSJianYang/Multilingual-Multimodal-NLP/tree/main/MT4CrossOIE
paper_authors: Zixiang Wang, Linzheng Chai, Jian Yang, Jiaqi Bai, Yuwei Yin, Jiaheng Liu, Hongcheng Guo, Tongliang Li, Liqun Yang, Hebboul Zine el-abidine, Zhoujun Li
for: 提高多语言开放信息提取（Cross-Lingual Open Information Extraction，简称CrossIE）的效果，使得模型能够在不同语言的文本上提取结构化信息。
methods: 提出了一种多阶段调整框架MT4CrossIE，通过将语言特定的知识注入到共享模型中来提高crossIE的性能。
results: 实验结果表明，通过组合模型基于和数据基于的转移技术，MT4CrossIE可以在多种benchmark上提高crossIE的性能，并且多个语言特定模块的组合对crossIE的性能有积极的影响。

Abstract
Cross-lingual open information extraction aims to extract structured information from raw text across multiple languages. Previous work uses a shared cross-lingual pre-trained model to handle the different languages but underuses the potential of the language-specific representation. In this paper, we propose an effective multi-stage tuning framework called MT4CrossIE, designed for enhancing cross-lingual open information extraction by injecting language-specific knowledge into the shared model. Specifically, the cross-lingual pre-trained model is first tuned in a shared semantic space (e.g., embedding matrix) in the fixed encoder and then other components are optimized in the second stage. After enough training, we freeze the pre-trained model and tune the multiple extra low-rank language-specific modules using mixture-of-LoRAs for model-based cross-lingual transfer. In addition, we leverage two-stage prompting to encourage the large language model (LLM) to annotate the multi-lingual raw data for data-based cross-lingual transfer. The model is trained with multi-lingual objectives on our proposed dataset OpenIE4++ by combing the model-based and data-based transfer techniques. Experimental results on various benchmarks emphasize the importance of aggregating multiple plug-in-and-play language-specific modules and demonstrate the effectiveness of MT4CrossIE in cross-lingual OIE\footnote{\url{https://github.com/CSJianYang/Multilingual-Multimodal-NLP}.

摘要
cross-lingual开放信息提取目标在多种语言之间提取结构化信息。先前的工作使用共享的cross-lingual预训练模型处理不同语言，但是未能充分利用语言特定表示的潜在优势。在这篇论文中，我们提出了一种高效的多阶段调整框架MT4CrossIE，用于提高cross-lingual开放信息提取。具体来说，在共享的semantic space（例如embedding matrix）中首先对cross-lingual预训练模型进行共享调整，然后其他组件在第二阶段进行优化。经过充分训练后，我们冻结预训练模型，并使用mixture-of-LoRAs进行模型基于cross-lingual传递。此外，我们利用两阶段提示来鼓励大语言模型（LLM）对多语言原始数据进行标注，以实现数据基于cross-lingual传递。我们在OpenIE4++数据集上训练了多语言目标，并结合模型基于和数据基于传递技术。实验结果在多个benchmark上表明，汇集多个插件和Play语言特定模块的重要性，并证明MT4CrossIE在cross-lingual OIE中的效果。

Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

paper_url: http://arxiv.org/abs/2308.06547
repo_url: None
paper_authors: Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan
for: 提高自动语音识别器的性能在半监督学习中，当标注数据稀缺时methods: 提议一种新的替代 pseudo-labeling 框架，包括一个通用的 CTC 损失函数、验证错误 Pseudo-label 的方法和自动调整 thresholdresults: 在实验中，该框架可以在半监督学习中提高自动语音识别器的性能，并且可以自动调整 threshold，避免手动调整 threshold 的痛苦

Abstract
When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning.

摘要
当标注数据不足时，半超vised学习使用pseudo-labeling技术可以显著提高自动语音识别的性能。然而，pseudo-labels经常含有许多错误的标签。使用含有错误标签的labels作为损失函数中的参考数据会导致优化性能的问题。先前的工作已经尝试过抑制这个问题，可以是通过筛选pseudo-labels中的最噪音标签，或者提高pseudo-labels的质量。尽管这些方法有一定的效果，但是完全消除pseudo-labels中的错误标签是不现实的。在这种情况下，我们提出了一种新的框架，即代理pseudo-labeling。该框架包括以下几个组成部分：1. 一种通用的CTC损失函数，可以处理含有错误标签的pseudo-labels。该损失函数可以接受pseudo-labels中的错误标签，并且可以在预测过程中检测错误标签。2. 一种信息储存基于错误检测方法，可以在预测过程中检测pseudo-labels中的错误标签。该方法通过比较预测的信息储存与给定的阈值进行比较，以确定错误标签。3. 一种自动调整阈值的方法，可以使用标注数据作为代理，以便不需要手动调整阈值。这种新的框架可以减少因为使用含有错误标签的labels而导致的优化性能问题，并且可以在不完全消除pseudo-labels中的错误标签的情况下提高自动语音识别的性能。

With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector

paper_url: http://arxiv.org/abs/2308.06527
repo_url: None
paper_authors: Ondřej Plátek, Mateusz Lango, Ondřej Dušek
for: 本研究重新实现了瓦姆瓦斯和塞恩兹（2022）的人工评估实验，该实验evaluated一个自动检测机器翻译输出中的过度和不足翻译（翻译包含更多或更少信息 than the original）的自动系统。
methods: 我们使用了作者提供的文档和代码，但在重新实现实验setup时，我们发现了一些问题，并提供了改进可重现性的建议。
results: 我们的复制结果大致与原始研究的结论相符，但在某些情况下，我们 observeda statistically significant differences，表明了人工标注的高变异性。

Abstract
This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations (translations containing more or less information than the original) in machine translation (MT) outputs. Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving reproducibility. Our replicated results generally confirm the conclusions of the original study, but in some cases, statistically significant differences were observed, suggesting a high variability of human annotation.

摘要

AutoConv: Automatically Generating Information-seeking Conversations with Large Language Models

paper_url: http://arxiv.org/abs/2308.06507
repo_url: None
paper_authors: Siheng Li, Cheng Yang, Yichun Yin, Xinyu Zhu, Zesen Cheng, Lifeng Shang, Xin Jiang, Qun Liu, Yujiu Yang
for: 提高信息寻求对话生成的训练数据稀缺性问题
methods: 利用大语言模型几何学习和生成能力，将对话生成问题定义为语言模型预测问题，并在几个人对话的基础上训练语言模型来捕捉信息寻求过程的特征，生成高质量的synthetic对话
results: 对两个常用的数据集进行实验，证明AutoConv具有显著的提升和减少人工标注的依赖性

Abstract
Information-seeking conversation, which aims to help users gather information through conversation, has achieved great progress in recent years. However, the research is still stymied by the scarcity of training data. To alleviate this problem, we propose AutoConv for synthetic conversation generation, which takes advantage of the few-shot learning ability and generation capacity of large language models (LLM). Specifically, we formulate the conversation generation problem as a language modeling task, then finetune an LLM with a few human conversations to capture the characteristics of the information-seeking process and use it for generating synthetic conversations with high quality. Experimental results on two frequently-used datasets verify that AutoConv has substantial improvements over strong baselines and alleviates the dependence on human annotation. In addition, we also provide several analysis studies to promote future research.

摘要
信息寻求对话，目前已经取得了很大的进步，但研究还面临着数据缺乏的问题。为解决这个问题，我们提出了AutoConv，它利用大语言模型（LLM）的几shot学习能力和生成能力来生成高质量的人工对话。具体来说，我们将对话生成问题定义为语言模型化问题，然后使用一些人类对话来训练LLM，以capture信息寻求过程中的特点。实验结果表明，AutoConv在两个常用的数据集上具有显著的提升和减少人类注释的依赖性。此外，我们还提供了一些分析研究，以便未来的研究。

NewsDialogues: Towards Proactive News Grounded Conversation

paper_url: http://arxiv.org/abs/2308.06501
repo_url: https://github.com/sihengli99/newsdialogues
paper_authors: Siheng Li, Yichun Yin, Cheng Yang, Wangjie Jiang, Yiwei Li, Zesen Cheng, Lifeng Shang, Xin Jiang, Qun Liu, Yujiu Yang
for: 本研究旨在提出一种新任务——积极新闻附加对话，以便对话系统可以主动领导对话，基于新闻中的一些关键话题。
methods: 本研究使用了一种名为Predict-Generate-Rank的方法，包括一个生成器用于预测和生成基于新闻的知识，以及一个排名器用于对多个回答进行排名，以避免曝光偏见。
results: 经过广泛的实验，研究发现了一些关键发现和挑战，以及提出了未来研究的一些方向。

Abstract
Hot news is one of the most popular topics in daily conversations. However, news grounded conversation has long been stymied by the lack of well-designed task definition and scarce data. In this paper, we propose a novel task, Proactive News Grounded Conversation, in which a dialogue system can proactively lead the conversation based on some key topics of the news. In addition, both information-seeking and chit-chat scenarios are included realistically, where the user may ask a series of questions about the news details or express their opinions and be eager to chat. To further develop this novel task, we collect a human-to-human Chinese dialogue dataset \ts{NewsDialogues}, which includes 1K conversations with a total of 14.6K utterances and detailed annotations for target topics and knowledge spans. Furthermore, we propose a method named Predict-Generate-Rank, consisting of a generator for grounded knowledge prediction and response generation, and a ranker for the ranking of multiple responses to alleviate the exposure bias. We conduct comprehensive experiments to demonstrate the effectiveness of the proposed method and further present several key findings and challenges to prompt future research.

摘要
热门新闻是日常对话中最受欢迎的话题之一，然而新闻基于对话的探讨长期受到缺乏有效定义任务和珍贵数据的限制。在这篇论文中，我们提出了一个新任务：主动新闻基于对话（Proactive News Grounded Conversation），在其中对话系统可以主动引导对话，基于新闻中的一些关键话题。此外，我们还包括了信息寻求和聊天场景，用户可能会提问新闻细节的多个问题或表达自己的意见并且很有兴趣聊天。为了进一步开发这个新任务，我们收集了一个人类对话 dataset 《NewsDialogues》，该 dataset 包含了1000个对话，总共14600个语音和详细的注释目标话题和知识范围。此外，我们还提出了一种方法，即预测生成排名（Predict-Generate-Rank），该方法包括一个生成基于新闻的预测和回答生成器，以及一个排名器用于多个答案的排名，以降低曝光偏见。我们进行了广泛的实验，以示提出的方法的效iveness，并提出了一些关键发现和未来研究的挑战。

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

paper_url: http://arxiv.org/abs/2308.06463
repo_url: https://github.com/robustnlp/cipherchat
paper_authors: Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu
for: 本研究旨在探讨大语言模型（LLMs）的安全定制是否可以扩展到非自然语言（cipher）领域。
methods: 我们提出了一种名为 CipherChat 的新框架，允许人类通过密码提示和几个几shot 加密示例与 LLMs 进行交流。我们使用 CipherChat 评估当今最先进的 LLMs，包括 ChatGPT 和 GPT-4，在不同的人类密码下的11个安全领域中的表现。
results: 实验结果表明，某些密码可以在某些安全领域中绕过 GPT-4 的安全定制，这说明了在非自然语言领域中的安全定制的必要性。此外，我们发现 LLMs 似乎有一种’’秘密密码’’，并提出了一种名为 SelfCipher 的新方法，可以通过角色扮演和几个示例来触发这种能力。SelfCipher surprisingly 在大多数情况下超过了现有的人类密码。我们将代码和数据发布在 GitHub 上（https://github.com/RobustNLP/CipherChat）。

Abstract
Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.

摘要
安全是大语言模型（LLM）的核心发展之一。有很多工作在将 LLM 与人类伦理和偏好相匹配，包括预处理数据筛选、监督练习、人类反馈强化学习和红团等等。在这项研究中，我们发现可以通过密码来绕过 LLM 的安全对齐技术，这些技术主要是在自然语言上进行的。我们提出了一个新的框架 CipherChat，用于系统地检验非自然语言（密码）上 LLM 的安全对齐可行性。CipherChat 允许人们通过密码提示和系统角色描述以及几个加密示例来与 LLM 进行交流。我们使用 CipherChat 测试了当前的状态体 LLM，包括 ChatGPT 和 GPT-4，在不同的人类密码上进行了11个安全领域的测试。实验结果显示，某些密码可以在多个安全领域中绕过 GPT-4 的安全对齐，这说明了非自然语言的安全对齐的必要性。另外，我们发现 LLM 似乎有一个“秘密密码”，我们提出了一种新的 SelfCipher，只需要通过角色扮演和几个示例来诱发这种能力。SelfCipher surprisingly 在大多数情况下超过了现有的人类密码。我们的代码和数据将在 GitHub 上发布。

Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation

paper_url: http://arxiv.org/abs/2308.06457
repo_url: https://github.com/zhichaowang970201/text-to-video
paper_authors: Zhichao Wang, Mengyu Dai, Keld Lundgaard
for: 本研究旨在提供一种基于文本的视频创建方法，具体来说是一种人脸无关的视频审核方法，以便在不同的语言和语速下生成可观看的视频。
methods: 本研究提出了一种两stage的方法，包括文本至语音转化和语音驱动的人脸讲话生成。在第一阶段，我们利用预训练的零shot模型实现文本至语音转化。在第二阶段，我们使用语音驱动的人脸讲话生成方法，以生成有趣的视频。
results: 本研究通过对不同的文本和语音样本进行比较分析，找到了最佳的文本至语音转化和语音驱动的人脸讲话生成方法。此外，我们还提供了一些Audio和视频示例，可以在以下链接中找到：https://github.com/ZhichaoWang970201/Text-to-Video/tree/main。

Abstract
The advent of ChatGPT has introduced innovative methods for information gathering and analysis. However, the information provided by ChatGPT is limited to text, and the visualization of this information remains constrained. Previous research has explored zero-shot text-to-video (TTV) approaches to transform text into videos. However, these methods lacked control over the identity of the generated audio, i.e., not identity-agnostic, hindering their effectiveness. To address this limitation, we propose a novel two-stage framework for person-agnostic video cloning, specifically focusing on TTV generation. In the first stage, we leverage pretrained zero-shot models to achieve text-to-speech (TTS) conversion. In the second stage, an audio-driven talking head generation method is employed to produce compelling videos privided the audio generated in the first stage. This paper presents a comparative analysis of different TTS and audio-driven talking head generation methods, identifying the most promising approach for future research and development. Some audio and videos samples can be found in the following link: https://github.com/ZhichaoWang970201/Text-to-Video/tree/main.

摘要
随着ChatGPT的出现，新的信息收集和分析方法得到了推动。然而，ChatGPT提供的信息仅限于文本，视觉化这些信息仍然受限。先前的研究曾经 explore zero-shot文本到视频（TTV）方法，将文本转换成视频。然而，这些方法缺乏控制音频个体的能力，妨碍其效iveness。为了解决这个限制，我们提议一种新的两阶段框架，专门针对人具无关的视频副本。在第一阶段，我们利用预训练的零shot模型实现文本到语音（TTS）转换。在第二阶段，我们使用音频驱动的人物头部生成方法生成有吸引力的视频，只要提供在第一阶段生成的音频。本文对不同的TTS和音频驱动人物头部生成方法进行比较分析，并确定未来研究和发展的最佳方法。有关音频和视频样例，请参考以下链接：https://github.com/ZhichaoWang970201/Text-to-Video/tree/main。

Demonstration-based learning for few-shot biomedical named entity recognition under machine reading comprehension

paper_url: http://arxiv.org/abs/2308.06454
repo_url: None
paper_authors: Leilei Su, Jian Chen, Yifan Peng, Cong Sun
for: 提高几少 BioNER 模型的识别能力
methods: 利用示例学习方法，将 BioNER 转化为机器阅读理解问题
results: 在6个数据集上，与基eline方法比较，提高了1.1%和1.0%的平均 F1 分数，并且可以与大量注释数据的完全监督学习方法竞争

Abstract
Although deep learning techniques have shown significant achievements, they frequently depend on extensive amounts of hand-labeled data and tend to perform inadequately in few-shot scenarios. The objective of this study is to devise a strategy that can improve the model's capability to recognize biomedical entities in scenarios of few-shot learning. By redefining biomedical named entity recognition (BioNER) as a machine reading comprehension (MRC) problem, we propose a demonstration-based learning method to address few-shot BioNER, which involves constructing appropriate task demonstrations. In assessing our proposed method, we compared the proposed method with existing advanced methods using six benchmark datasets, including BC4CHEMD, BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, and JNLPBA. We examined the models' efficacy by reporting F1 scores from both the 25-shot and 50-shot learning experiments. In 25-shot learning, we observed 1.1% improvements in the average F1 scores compared to the baseline method, reaching 61.7%, 84.1%, 69.1%, 70.1%, 50.6%, and 59.9% on six datasets, respectively. In 50-shot learning, we further improved the average F1 scores by 1.0% compared to the baseline method, reaching 73.1%, 86.8%, 76.1%, 75.6%, 61.7%, and 65.4%, respectively. We reported that in the realm of few-shot learning BioNER, MRC-based language models are much more proficient in recognizing biomedical entities compared to the sequence labeling approach. Furthermore, our MRC-language models can compete successfully with fully-supervised learning methodologies that rely heavily on the availability of abundant annotated data. These results highlight possible pathways for future advancements in few-shot BioNER methodologies.

摘要
Translated into Simplified Chinese:尽管深度学习技术已经达到了显著的成就，但它们往往需要大量的手动标注数据，并在几个shot场景下表现不佳。本研究的目标是提出一种策略，以提高模型在几个shot学习中识别生物医学实体的能力。我们将生物医学命名实体识别（BioNER）定义为机器阅读理解（MRC）问题，并提出了一种示例学习方法来解决几个shot BioNER。我们使用了六个标准测试集来评估我们的提议方法，包括BC4CHEMD、BC5CDR-Chemical、BC5CDR-疾病、NCBI-疾病、BC2GM和JNLPBA。我们根据25个shot和50个shot的学习实验来评估模型的效果，并发现在25个shot学习中，我们的方法与基eline方法相比，平均F1分数提高了1.1%，达到了61.7%、84.1%、69.1%、70.1%、50.6%和59.9%的水平。在50个shot学习中，我们进一步提高了平均F1分数，达到了73.1%、86.8%、76.1%、75.6%、61.7%和65.4%的水平。我们发现，在几个shot BioNER中，基于MRC语言模型的方法比sequence标注方法更有才能地识别生物医学实体。此外，我们的MRC语言模型可以与充分监督学习方法竞争，这些方法依赖于大量的注释数据的可用性。这些结果透视了未来几个shot BioNER方法的可能发展道路。

Simple Model Also Works: A Novel Emotion Recognition Network in Textual Conversation Based on Curriculum Learning Strategy

paper_url: http://arxiv.org/abs/2308.06450
repo_url: None
paper_authors: Jiang Li, Xiaoping Wang, Yingjian Liu, Qing Zhou, Zhigang Zeng
for: 本研究主要针对对话中的情感识别 зада项 (Emotion Recognition in Conversation, ERC) 进行研究，以提高情感识别的效率和精度。
methods: 本研究提出了一个基于学习范例 (Curriculum Learning, CL) 的情感识别网络 (Emotion Recognition Network, ERNetCL)，并使用时间编码 (Temporal Encoder, TE) 和空间编码 (Spatial Encoder, SE) 来融合先前的方法，以优化情感识别的效率和精度。
results: 实验结果显示，本研究的提案方法可以优化情感识别的效率和精度，并与其他基于标准方法的方法相比，具有明显的性能优化。

Abstract
Emotion Recognition in Conversation (ERC) has emerged as a research hotspot in domains such as conversational robots and question-answer systems. How to efficiently and adequately retrieve contextual emotional cues has been one of the key challenges in the ERC task. Existing efforts do not fully model the context and employ complex network structures, resulting in excessive computational resource overhead without substantial performance improvement. In this paper, we propose a novel Emotion Recognition Network based on Curriculum Learning strategy (ERNetCL). The proposed ERNetCL primarily consists of Temporal Encoder (TE), Spatial Encoder (SE), and Curriculum Learning (CL) loss. We utilize TE and SE to combine the strengths of previous methods in a simplistic manner to efficiently capture temporal and spatial contextual information in the conversation. To simulate the way humans learn curriculum from easy to hard, we apply the idea of CL to the ERC task to progressively optimize the network parameters of ERNetCL. At the beginning of training, we assign lower learning weights to difficult samples. As the epoch increases, the learning weights for these samples are gradually raised. Extensive experiments on four datasets exhibit that our proposed method is effective and dramatically beats other baseline models.

摘要
《对话中情感识别（ERC）》在领域如会话机器人和问答系统中已经成为研究热点。 efficiently和准确地检索上下文情感cue是ERC任务中的关键挑战。现有尝试没有完全考虑上下文，使用复杂的网络结构，导致计算资源占用过高而无法提供明显的性能提升。本文提出了一种基于学习纲程（CL）的情感识别网络（ERNetCL）。我们利用TE和SE组合previous方法的优点，以简单的方式高效地捕捉对话中的时间和空间上下文信息。通过模仿人类学习纲程的思想，我们在ERC任务中应用CL来逐渐优化ERNetCL的网络参数。在训练的开始时，我们将难度较高的样本分配低学习权重。随着epoch增加，这些样本的学习权重逐渐升高。我们在四个数据集进行了广泛的实验，结果显示，我们的提议方法效果明显，可以很好地超越基准模型。

Performance Prediction for Multi-hop Questions

paper_url: http://arxiv.org/abs/2308.06431
repo_url: None
paper_authors: Mohammadreza Samadi, Davood Rafiei
for: 预测开放领域多步问答（QA）问题的评估难度。
methods: 提出了一种新的预测方法multHP，用于预测开放领域多步问答问题的表现。
results: 对largest multi-hop QA数据集进行了广泛的评估，并显示了提档的表现，比传统单步QPP模型更好。 Additionally, the approach can be effectively used to optimize the parameters of QA systems, such as the number of documents to be retrieved, resulting in improved overall retrieval performance.

Abstract
We study the problem of Query Performance Prediction (QPP) for open-domain multi-hop Question Answering (QA), where the task is to estimate the difficulty of evaluating a multi-hop question over a corpus. Despite the extensive research on predicting the performance of ad-hoc and QA retrieval models, there has been a lack of study on the estimation of the difficulty of multi-hop questions. The problem is challenging due to the multi-step nature of the retrieval process, potential dependency of the steps and the reasoning involved. To tackle this challenge, we propose multHP, a novel pre-retrieval method for predicting the performance of open-domain multi-hop questions. Our extensive evaluation on the largest multi-hop QA dataset using several modern QA systems shows that the proposed model is a strong predictor of the performance, outperforming traditional single-hop QPP models. Additionally, we demonstrate that our approach can be effectively used to optimize the parameters of QA systems, such as the number of documents to be retrieved, resulting in improved overall retrieval performance.

摘要
我们研究了开放领域多步问答（QA）中的问题评估性能预测（QPP）问题，即估计评估一个多步问题的难度。尽管有很多关于预测广泛问答和搜索模型性能的研究，但是没有研究了多步问题的预测。这个问题具有多步搜索过程的多样性和步骤之间的依赖关系，以及需要进行推理。为解决这个挑战，我们提出了 multHP，一种新的预测开放领域多步问题性能的方法。我们对最大的多步问答数据集进行了广泛的评估，结果表明，我们提出的模型是一个强大的性能预测器，超过了传统单步 QPP 模型。此外，我们还证明了我们的方法可以有效地用于优化 QA 系统的参数，例如检索文档的数量，从而提高总体检索性能。

Dynamic Planning with a LLM

paper_url: http://arxiv.org/abs/2308.06391
repo_url: https://github.com/itl-ed/llm-dp
paper_authors: Gautier Dagan, Frank Keller, Alex Lascarides
for: 解决 embodied agent 应用问题，尤其是复杂的计划需要多步骤的情况。
methods: 融合 neural network 和 symbolic planner，使用 LLM 和 traditional planner 共同解决 embodied task。
results: LLM-DP 比 naive LLM ReAct baseline 更快和更高效地解决 Alfworld 问题。

Abstract
While Large Language Models (LLMs) can solve many NLP tasks in zero-shot settings, applications involving embodied agents remain problematic. In particular, complex plans that require multi-step reasoning become difficult and too costly as the context window grows. Planning requires understanding the likely effects of one's actions and identifying whether the current environment satisfies the goal state. While symbolic planners find optimal solutions quickly, they require a complete and accurate representation of the planning problem, severely limiting their use in practical scenarios. In contrast, modern LLMs cope with noisy observations and high levels of uncertainty when reasoning about a task. Our work presents LLM Dynamic Planner (LLM-DP): a neuro-symbolic framework where an LLM works hand-in-hand with a traditional planner to solve an embodied task. Given action-descriptions, LLM-DP solves Alfworld faster and more efficiently than a naive LLM ReAct baseline.

摘要
大型自然语言模型（LLM）可以解决许多自然语言处理任务在零模式下，但是包含身体代理的应用仍然是问题。特别是复杂的计划需要多步骤的理解和识别环境是否满足目标状态。而符号计划器可以快速找到优化解决方案，但它们需要完整和准确地表示计划问题，因此在实际场景中几乎无法使用。相比之下，现代LLM在面临噪音观察和高度不确定性时仍然能够有效地理解任务。我们的工作提出了LLM动态规划器（LLM-DP）：一种神经符号框架，在LLM与传统计划器之间协作解决身体任务。给出动作描述，LLM-DP比预期的LLM ReAct基线更快和高效地解决了Alfworld任务。

Bilingual Streaming ASR with Grapheme units and Auxiliary Monolingual Loss

paper_url: http://arxiv.org/abs/2308.06327
repo_url: None
paper_authors: Mohammad Soleymanpour, Mahmoud Al Ismail, Fahimeh Bahmaninezhad, Kshitiz Kumar, Jian Wu
for: 这个研究旨在提供一个英文作为次要区域的混合自动语音识别（ASR）设置中的双语解决方案。
methods: 我们的主要开发包括： (a) 发音词库使用文字单位而不是语音单位， (b) 完全双语对焦模型和随后的双语流式变数模型， (c) 平行Encoder结构 WITH 语言识别（LID）损失， (d) 平行Encoder WITH 辅助损失 для单语言预测。
results: 我们的工作在大规模训练和测试任务中显示出强大的英文混合能力。特别是双语IT模型在一个混合IT任务中从46.5%降至13.8%，同时也与单语IT模型（9.5%）在IT测试中仅差0.6%。

Abstract
We introduce a bilingual solution to support English as secondary locale for most primary locales in hybrid automatic speech recognition (ASR) settings. Our key developments constitute: (a) pronunciation lexicon with grapheme units instead of phone units, (b) a fully bilingual alignment model and subsequently bilingual streaming transformer model, (c) a parallel encoder structure with language identification (LID) loss, (d) parallel encoder with an auxiliary loss for monolingual projections. We conclude that in comparison to LID loss, our proposed auxiliary loss is superior in specializing the parallel encoders to respective monolingual locales, and that contributes to stronger bilingual learning. We evaluate our work on large-scale training and test tasks for bilingual Spanish (ES) and bilingual Italian (IT) applications. Our bilingual models demonstrate strong English code-mixing capability. In particular, the bilingual IT model improves the word error rate (WER) for a code-mix IT task from 46.5% to 13.8%, while also achieving a close parity (9.6%) with the monolingual IT model (9.5%) over IT tests.

摘要
我们介绍了一种双语解决方案，以英语为次要地区的hybrid自动语音识别（ASR）设置中支持英语。我们的关键发展包括：（a）使用字母单位而不是语音单位的发音词典，（b）完全双语对应模型和随后的双语流Transformer模型，（c）并行编码结构和语言标识（LID）损失，（d）并行编码器和辅助损失 для单语投影。我们认为，相比LID损失，我们提posed的辅助损失可以更好地特化并行编码器到各自的单语本地，从而为双语学习带来更强的特点。我们对大规模训练和测试任务进行了两种双语西班牙（ES）和双语意大利（IT）应用。我们的双语模型在英语混合代码 Task中显示出了强大的英语混合能力。特别是，双语IT模型将IT任务中的代码混合WER从46.5%降低至13.8%，同时也与单语IT模型（9.5%）在IT测试任务上达到了近似的水平（9.6%）。

Self-Alignment with Instruction Backtranslation

paper_url: http://arxiv.org/abs/2308.06259
repo_url: https://github.com/Spico197/Humback
paper_authors: Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, Mike Lewis
for: 这个论文是为了提高语言模型的质量而写的（improve the quality of language models）。
methods: 这个论文使用自动生成的指令Prompt来自我增强语言模型（instruction backtranslation）。
results: 这个论文的方法可以高效地自我增强语言模型，并且在Alpaca领导者榜单上表现更好于所有不使用热静质量数据的LLaMa模型（outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data）。

Abstract
We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.

摘要
我们提出了一种可扩展的方法，用于建立高质量的指令遵循语言模型，通过自动将人工写好的文本标记为相应的指令。我们的方法名为指令反翻译。我们的方法开始于一个基于小量种子数据的语言模型，并使用给定的网络资料。这个种子模型用于生成指令提示文本（自我扩充），然后选择高质量的示例从中间（自我审核）。这些数据然后用于训练更强的模型。两轮我们的方法训练后，我们的模型在不使用热静质料的情况下在Alpaca排行榜上表现最佳， demonstarting highly effective self-alignment。

KETM:A Knowledge-Enhanced Text Matching method

paper_url: http://arxiv.org/abs/2308.06235
repo_url: https://github.com/1094701018/ketm
paper_authors: Kexin Jiang, Yahui Zhao, Guozhe Jin, Zhenguo Zhang, Rongyi Cui
for: 这个论文是为了提高文本匹配 task 的性能，通过增强模型理解和逻辑能力。
methods: 该模型使用 Wiktionary retrieve 文本单词定义作为外部知识，并通过多angle pooling 提取文本和知识的特征向量。然后，通过权重门限机制将文本和知识进行权重 fusión，以提高模型的理解和逻辑能力。
results: 在四个 datasets 上进行了实验验证，结果显示，该模型在所有四个 datasets 上都表现良好，并且与不添加外部知识的基本模型相比，该模型的性能有所提高，这证明了该模型的有效性。

Abstract
Text matching is the task of matching two texts and determining the relationship between them, which has extensive applications in natural language processing tasks such as reading comprehension, and Question-Answering systems. The mainstream approach is to compute text representations or to interact with the text through attention mechanism, which is effective in text matching tasks. However, the performance of these models is insufficient for texts that require commonsense knowledge-based reasoning. To this end, in this paper, We introduce a new model for text matching called the Knowledge Enhanced Text Matching model (KETM), to enrich contextual representations with real-world common-sense knowledge from external knowledge sources to enhance our model understanding and reasoning. First, we use Wiktionary to retrieve the text word definitions as our external knowledge. Secondly, we feed text and knowledge to the text matching module to extract their feature vectors. The text matching module is used as an interaction module by integrating the encoder layer, the co-attention layer, and the aggregation layer. Specifically, the interaction process is iterated several times to obtain in-depth interaction information and extract the feature vectors of text and knowledge by multi-angle pooling. Then, we fuse text and knowledge using a gating mechanism to learn the ratio of text and knowledge fusion by a neural network that prevents noise generated by knowledge. After that, experimental validation on four datasets are carried out, and the experimental results show that our proposed model performs well on all four datasets, and the performance of our method is improved compared to the base model without adding external knowledge, which validates the effectiveness of our proposed method. The code is available at https://github.com/1094701018/KETM

摘要
文本匹配是自然语言处理任务中的一项重要任务，它有广泛的应用于阅读理解和问答系统等领域。主流方法是计算文本表示或通过注意力机制与文本进行交互，这些模型在文本匹配任务中表现良好。然而，这些模型在需要通过常识知ledge来解释的文本匹配任务时表现不够。为此，我们在这篇论文中提出了一种新的文本匹配模型，即知识增强文本匹配模型（KETM），以增强我们的模型理解和解释能力。首先，我们使用Wiktionary来提取文本单词定义作为我们的外部知识。然后，我们将文本和知识传递给文本匹配模块，以提取它们的特征向量。文本匹配模块被用作交互模块，并组合了编码层、相关层和聚合层。具体来说，交互过程会多次迭代，以获取深入的交互信息，并使用多角度聚合来提取文本和知识的特征向量。然后，我们使用阻块机制来融合文本和知识，以学习文本和知识的权重比例。最后，我们对四个数据集进行了实验验证，结果表明，我们提出的方法在所有四个数据集上表现出色，并且与基本模型相比，我们的方法性能有所提高，这 validate了我们的方法的有效性。代码可以在 GitHub 上找到。

A Large Language Model Enhanced Conversational Recommender System

paper_url: http://arxiv.org/abs/2308.06212
repo_url: None
paper_authors: Yue Feng, Shuchang Liu, Zhenghai Xue, Qingpeng Cai, Lantao Hu, Peng Jiang, Kun Gai, Fei Sun
for: 提高会话式推荐系统的效果（improve the effectiveness of conversational recommender systems）
methods: 利用大语言模型（Large Language Models, LLM）的理解和生成能力，并与专家模型（expert models）合作，以解决不同的子任务（sub-tasks）。
results: 实验结果表明，使用RLPF进行精心调整LLM，可以提高会话式推荐系统的性能（performance）。

Abstract
Conversational recommender systems (CRSs) aim to recommend high-quality items to users through a dialogue interface. It usually contains multiple sub-tasks, such as user preference elicitation, recommendation, explanation, and item information search. To develop effective CRSs, there are some challenges: 1) how to properly manage sub-tasks; 2) how to effectively solve different sub-tasks; and 3) how to correctly generate responses that interact with users. Recently, Large Language Models (LLMs) have exhibited an unprecedented ability to reason and generate, presenting a new opportunity to develop more powerful CRSs. In this work, we propose a new LLM-based CRS, referred to as LLMCRS, to address the above challenges. For sub-task management, we leverage the reasoning ability of LLM to effectively manage sub-task. For sub-task solving, we collaborate LLM with expert models of different sub-tasks to achieve the enhanced performance. For response generation, we utilize the generation ability of LLM as a language interface to better interact with users. Specifically, LLMCRS divides the workflow into four stages: sub-task detection, model matching, sub-task execution, and response generation. LLMCRS also designs schema-based instruction, demonstration-based instruction, dynamic sub-task and model matching, and summary-based generation to instruct LLM to generate desired results in the workflow. Finally, to adapt LLM to conversational recommendations, we also propose to fine-tune LLM with reinforcement learning from CRSs performance feedback, referred to as RLPF. Experimental results on benchmark datasets show that LLMCRS with RLPF outperforms the existing methods.

摘要
文本： conversational recommender systems (CRSs) 目的是为用户提供高质量的ITEM通过对话界面。它通常包含多个子任务，例如用户偏好描述、推荐、解释和ITEM信息搜索。为开发有效的CRS，存在一些挑战：1）如何正确地管理子任务；2）如何有效地解决不同的子任务；和3）如何正确地生成与用户交互的响应。现在，大型自然语言模型（LLM）在理解和生成方面表现出了无 precedent的能力，这提供了一个新的机会，以开发更有力的CRS。在这项工作中，我们提出了一种基于LLM的CRS，称之为LLMCRS，以解决以下挑战。LLMCRS分为四个阶段：子任务检测、模型匹配、子任务执行和响应生成。LLMCRS还实现了 schema-based instruction、demonstration-based instruction、动态子任务和模型匹配以及摘要生成，以使LLM生成愿望的结果。最后，为适应对话推荐，我们还提出了使用强化学习来调整LLM的性能反馈，称之为RLPF。实验结果表明，LLMCRS与RLPF比现有方法高效。Here's the translation:文本： conversational recommender systems (CRSs) 目的是为用户提供高质量的ITEM通过对话界面。它通常包含多个子任务，例如用户偏好描述、推荐、解释和ITEM信息搜索。为开发有效的CRS，存在一些挑战：1）如何正确地管理子任务；2）如何有效地解决不同的子任务；和3）如何正确地生成与用户交互的响应。现在，大型自然语言模型（LLM）在理解和生成方面表现出了无 precedent的能力，这提供了一个新的机会，以开发更有力的CRS。在这项工作中，我们提出了一种基于LLM的CRS，称之为LLMCRS，以解决以下挑战。LLMCRS分为四个阶段：子任务检测、模型匹配、子任务执行和响应生成。LLMCRS还实现了 schema-based instruction、demonstration-based instruction、动态子任务和模型匹配以及摘要生成，以使LLM生成愿望的结果。最后，为适应对话推荐，我们还提出了使用强化学习来调整LLM的性能反馈，称之为RLPF。实验结果表明，LLMCRS与RLPF比现有方法高效。

Thinking Like an Expert:Multimodal Hypergraph-of-Thought (HoT) Reasoning to boost Foundation Modals

paper_url: http://arxiv.org/abs/2308.06207
repo_url: None
paper_authors: Fanglong Yao, Changyuan Tian, Jintao Liu, Zequn Zhang, Qing Liu, Li Jin, Shuchao Li, Xiaoyu Li, Xian Sun
for: This paper aims to enhance the reasoning ability of foundation models by proposing a multimodal Hypergraph-of-Thought (HoT) reasoning paradigm, which can handle high-order multi-hop reasoning and multimodal comparative judgement.methods: The proposed HoT reasoning paradigm utilizes a textual hypergraph-of-thought and a visual hypergraph-of-thought, along with Cross-modal Co-Attention Graph Learning for multimodal comparative verification.results: Experimentations on the ScienceQA benchmark show that the proposed HoT-based T5 outperforms CoT-based GPT3.5 and chatGPT, and is on par with CoT-based GPT4 with a lower model size.

Abstract
Reasoning ability is one of the most crucial capabilities of a foundation model, signifying its capacity to address complex reasoning tasks. Chain-of-Thought (CoT) technique is widely regarded as one of the effective methods for enhancing the reasoning ability of foundation models and has garnered significant attention. However, the reasoning process of CoT is linear, step-by-step, similar to personal logical reasoning, suitable for solving general and slightly complicated problems. On the contrary, the thinking pattern of an expert owns two prominent characteristics that cannot be handled appropriately in CoT, i.e., high-order multi-hop reasoning and multimodal comparative judgement. Therefore, the core motivation of this paper is transcending CoT to construct a reasoning paradigm that can think like an expert. The hyperedge of a hypergraph could connect various vertices, making it naturally suitable for modelling high-order relationships. Inspired by this, this paper innovatively proposes a multimodal Hypergraph-of-Thought (HoT) reasoning paradigm, which enables the foundation models to possess the expert-level ability of high-order multi-hop reasoning and multimodal comparative judgement. Specifically, a textual hypergraph-of-thought is constructed utilizing triple as the primary thought to model higher-order relationships, and a hyperedge-of-thought is generated through multi-hop walking paths to achieve multi-hop inference. Furthermore, we devise a visual hypergraph-of-thought to interact with the textual hypergraph-of-thought via Cross-modal Co-Attention Graph Learning for multimodal comparative verification. Experimentations on the ScienceQA benchmark demonstrate the proposed HoT-based T5 outperforms CoT-based GPT3.5 and chatGPT, which is on par with CoT-based GPT4 with a lower model size.

摘要
基本模型的逻辑能力是其最重要的能力之一，表明其能够解决复杂的逻辑任务。链条思维（CoT）技术是提高基本模型的逻辑能力的有效方法，吸引了广泛的关注。然而，CoT的逻辑过程是线性的，步骤式的，类似于个人的逻辑思维，适用于解决一般和些许复杂的问题。然而，专家的思维模式具有两个突出的特征，不能由CoT处理得好，即高阶多跳 reasoning和多模态比较判断。因此，本文的核心动机是超越CoT，建立一种能够思考如专家的逻辑模型。基于hypergraph的Hypergraph-of-Thought（HoT）逻辑模型是这种思考的核心。在这种模型中，hyperedge可以连接多个顶点，使其自然地适用于高阶关系的模elling。从这个意义上，本文创新地提出了一种多模态HoT逻辑模型，使基本模型具有专家水平的高阶多跳 reasoning和多模态比较判断能力。具体来说，通过 triple作为主要思想来模型高阶关系，并通过多跳步行路径生成多跳推理。此外，我们还提出了跨模态的Co-Attention图学习来实现多模态比较验证。实验结果表明，基于HoT的T5超过了CoT-based GPT3.5和chatGPT，与CoT-based GPT4的性能相似，但模型规模较小。

2023-08-14

Automated Ensemble-Based Segmentation of Adult Brain Tumors: A Novel Approach Using the BraTS AFRICA Challenge Data

SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation

FocusFlow: Boosting Key-Points Optical Flow Estimation for Autonomous Driving

When Deep Learning Meets Multi-Task Learning in SAR ATR: Simultaneous Target Recognition and Segmentation

Deepbet: Fast brain extraction of T1-weighted MRI using Convolutional Neural Networks

How inter-rater variability relates to aleatoric and epistemic uncertainty: a case study with deep learning-based paraspinal muscle segmentation

Robustness Stress Testing in Medical Image Classification

2023-08-13

Modified Topological Image Preprocessing for Skin Lesion Classifications

PV-SSD: A Projection and Voxel-based Double Branch Single-Stage 3D Object Detector

RMP-Loss: Regularizing Membrane Potential Distribution for Spiking Neural Networks

Shape-guided Conditional Latent Diffusion Models for Synthesising Brain Vasculature

Neural Networks at a Fraction with Pruned Quaternions

Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches

A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations

Tissue Segmentation of Thick-Slice Fetal Brain MR Scans with Guidance from High-Quality Isotropic Volumes

Influence Function Based Second-Order Channel Pruning-Evaluating True Loss Changes For Pruning Is Possible Without Retraining

FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware Lookup Table

Target before Shooting: Accurate Anomaly Detection and Localization under One Millisecond via Cascade Patch Retrieval

Self-supervised Noise2noise Method Utilizing Corrupted Images with a Modular Network for LDCT Denoising

TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

3D Scene Graph Prediction on Point Clouds Using Knowledge Graphs

StairNetV3: Depth-aware Stair Modeling using Deep Learning

LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

Compositional Feature Augmentation for Unbiased Scene Graph Generation

Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition

Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation

SimMatchV2: Semi-Supervised Learning with Graph Consistency

Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training

Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges

Polar Collision Grids: Effective Interaction Modelling for Pedestrian Trajectory Prediction in Shared Space Using Collision Checks

Advances in Self-Supervised Learning for Synthetic Aperture Sonar Data Processing, Classification, and Pattern Recognition

3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking

Fusion-GRU: A Deep Learning Model for Future Bounding Box Prediction of Traffic Agents in Risky Driving Videos

ADRMX: Additive Disentanglement of Domain Features with Remix Loss

Polyp-SAM++: Can A Text Guided SAM Perform Better for Polyp Segmentation?

DFM-X: Augmentation by Leveraging Prior Knowledge of Shortcut Learning

LadleNet: Translating Thermal Infrared Images to Visible Light Images Using A Scalable Two-stage U-Net

2023-08-13

Dual Meta-Learning with Longitudinally Generalized Regularization for One-Shot Brain Tissue Segmentation Across the Human Lifespan

Few-shot Class-incremental Learning: A Survey

Evaluating the anticipated outcomes of MRI seizure image from open-source tool- Prototype approach

Heterogeneous Multi-Agent Reinforcement Learning via Mirror Descent Policy Optimization

Probabilistic Imputation for Time-series Classification with Missing Data

AerialVLN: Vision-and-Language Navigation for UAVs

Precipitation nowcasting with generative diffusion models

Transforming Sentiment Analysis in the Financial Domain with ChatGPT

CLE Diffusion: Controllable Light Enhancement Diffusion Model

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Generalized Independent Noise Condition for Estimating Causal Structure with Latent Variables

Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden Rewards

Learning on Graphs with Out-of-Distribution Nodes

Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection

MACO: A Modality Adversarial and Contrastive Framework for Modality-missing Multi-modal Knowledge Graph Completion

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

Law of Balance and Stationary Distribution of Stochastic Gradient Descent

Unsupervised Adaptation of Polyp Segmentation Models via Coarse-to-Fine Self-Supervision

ALGAN: Time Series Anomaly Detection with Adjusted-LSTM GAN

Benign Shortcut for Debiasing: Fair Visual Recognition via Intervention with Shortcut Features

Smart Knowledge Transfer using Google-like Search

Stationary Algorithmic Balancing For Dynamic Email Re-Ranking Problem

Accelerating Diffusion-based Combinatorial Optimization Solvers by Progressive Distillation

Can Unstructured Pruning Reduce the Depth in Deep Neural Networks?

On the Interplay of Convolutional Padding and Adversarial Robustness

Bio-SIEVE: Exploring Instruction Tuning Large Language Models for Systematic Review Automation

2023-08-13

Faithful to Whom? Questioning Interpretability Measures in NLP

Modeling the Dashboard Provenance

Token-Scaled Logit Distillation for Ternary Weight Generative Language Models

Emergent communication for AR

2023-08-13

Faithful to Whom? Questioning Interpretability Measures in NLP

Neural Networks at a Fraction with Pruned Quaternions

A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations

Conic Descent Redux for Memory-Efficient Optimization

Few-shot Class-incremental Learning: A Survey

Discovering the Symptom Patterns of COVID-19 from Recovered and Deceased Patients Using Apriori Association Rule Mining