2023-10-26

cs.CV

cs.CV - 2023-10-26

Image Prior and Posterior Conditional Probability Representation for Efficient Damage Assessment

paper_url: http://arxiv.org/abs/2310.17801
repo_url: None
paper_authors: Jie Wei, Weicong Feng, Erik Blasch, Erika Ardiles-Cruz, Haibin Ling
for: 本研究旨在提高人工援助和灾难应急Response (HADR) 应用中的损害评估 (DA) 效率和可扩展性。
methods: 本文提出了一种名为图像先后 conditional probability (IP2CP) 的有效计算视觉表示方法，用于对准前后灾难图像，并将其编码成一个图像进行深度学习处理，以确定损害水平。
results: 在两个重要的实际应用场景中，IP2CP 表现出了良好的性能：一是像素级Semantic segmentation，二是质心矩阵学习引入的全面损害分类。结果表明，基于 IP2CP 的深度学习框架可以有效实现数据和计算效率，这对 HADR 应用来说非常重要。

Abstract
It is important to quantify Damage Assessment (DA) for Human Assistance and Disaster Response (HADR) applications. In this paper, to achieve efficient and scalable DA in HADR, an image prior and posterior conditional probability (IP2CP) is developed as an effective computational imaging representation. Equipped with the IP2CP representation, the matching pre- and post-disaster images are effectively encoded into one image that is then processed using deep learning approaches to determine the damage levels. Two scenarios of crucial importance for the practical use of DA in HADR applications are examined: pixel-wise semantic segmentation and patch-based contrastive learning-based global damage classification. Results achieved by IP2CP in both scenarios demonstrate promising performances, showing that our IP2CP-based methods within the deep learning framework can effectively achieve data and computational efficiency, which is of utmost importance for the DA in HADR applications.

摘要
important to quantify Damage Assessment (DA) for Human Assistance and Disaster Response (HADR) applications. In this paper, to achieve efficient and scalable DA in HADR, an image prior and posterior conditional probability (IP2CP) is developed as an effective computational imaging representation. Equipped with the IP2CP representation, the matching pre- and post-disaster images are effectively encoded into one image that is then processed using deep learning approaches to determine the damage levels. Two scenarios of crucial importance for the practical use of DA in HADR applications are examined: pixel-wise semantic segmentation and patch-based contrastive learning-based global damage classification. Results achieved by IP2CP in both scenarios demonstrate promising performances, showing that our IP2CP-based methods within the deep learning framework can effectively achieve data and computational efficiency, which is of utmost importance for the DA in HADR applications.Here's the translation:重要的是量化损害评估（DA）在人类援助和灾害应急应用中。本文使用图像先后条件概率（IP2CP）来实现有效的计算影像表示。通过IP2CP表示，匹配的前后灾害图像都被编码为一个图像，然后使用深度学习方法来确定损害水平。在实际应用中，对DA的两个场景得到了极重要的评估：像素级Semantic segmentation和 patch-based contrastive learning-based全面损害分类。IP2CP在这两个场景中的表现具有承诺性，表明我们的IP2CP基于深度学习框架可以有效实现数据和计算效率，这对DA应用中至关重要。

ControlLLM: Augment Language Models with Tools by Searching on Graphs

paper_url: http://arxiv.org/abs/2310.17796
repo_url: https://github.com/opengvlab/controlllm
paper_authors: Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Zhiheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang
for: 本研究旨在开发一种新的框架，使大型自然语言模型（LLM）能够利用多Modal工具解决复杂的实际任务。
methods: 本研究的方法包括三个关键组成部分：（1）任务分解器，将复杂任务分解成明确的子任务，并将输入和输出明确定义；（2）思想图（ToG） paradigma，在已经建立的工具图中搜索优化解决方案的路径；（3）执行引擎，将解决方案 interpret并有效地运行工具在不同的计算设备上。
results: 研究人员对多种图像、音频和视频处理任务进行了评估，并证明了该框架在精度、效率和多样性方面与现有方法相比有superior性。代码可以在https://github.com/OpenGVLab/ControlLLM 上找到。

Abstract
We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable performance of LLMs, they still struggle with tool invocation due to ambiguous user prompts, inaccurate tool selection and parameterization, and inefficient tool scheduling. To overcome these challenges, our framework comprises three key components: (1) a \textit{task decomposer} that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG) paradigm} that searches the optimal solution path on a pre-built tool graph, which specifies the parameter and dependency relations among different tools; and (3) an \textit{execution engine with a rich toolbox} that interprets the solution path and runs the tools efficiently on different computational devices. We evaluate our framework on diverse tasks involving image, audio, and video processing, demonstrating its superior accuracy, efficiency, and versatility compared to existing methods. The code is at https://github.com/OpenGVLab/ControlLLM .

摘要
我们提出了 ControlLLM 框架，一种新的框架，允许大型自然语言模型（LLM）使用多Modal工具来解决复杂的实际任务。虽然 LLM 表现出了惊人的表现，但它们仍然因为用户提示的歧义、不准确的工具选择和参数化、不合理的工具调度而陷入困难。为了解决这些挑战，我们的框架包括三个关键组件：1. 任务分解器，将复杂任务分解成明确的子任务，输入和输出都具有明确的定义。2. Thoughts-on-Graph（ToG）理念，在预先构建的工具图上搜索最佳解决方案的路径。这个图表示不同工具之间的参数和依赖关系。3. 执行引擎与丰富工具库，将解决方案 интер普理解并运行工具，并在不同的计算设备上进行高效的执行。我们在各种图像、音频和视频处理任务中评估了 ControlLLM 框架，并证明它在精度、效率和多样性方面胜过现有方法。代码可以在 GitHub 上找到：https://github.com/OpenGVLab/ControlLLM。

AutoCT: Automated CT registration, segmentation, and quantification

paper_url: http://arxiv.org/abs/2310.17780
repo_url: None
paper_authors: Zhe Bai, Abdelilah Essiari, Talita Perciano, Kristofer E. Bouchard
for: 提供一个全面的CT成像处理和分析管道，用于基础科学发展和临床应用。
methods: 使用自动化预处理、注册、分割和量化分析3D CT扫描数据的整个管道。
results: 实现了基于Atlas的CT分割和量化，通过势能变换来提取本地特征，用于下游统计学学习，可以facilitate医疗诊断。

Abstract
The processing and analysis of computed tomography (CT) imaging is important for both basic scientific development and clinical applications. In AutoCT, we provide a comprehensive pipeline that integrates an end-to-end automatic preprocessing, registration, segmentation, and quantitative analysis of 3D CT scans. The engineered pipeline enables atlas-based CT segmentation and quantification leveraging diffeomorphic transformations through efficient forward and inverse mappings. The extracted localized features from the deformation field allow for downstream statistical learning that may facilitate medical diagnostics. On a lightweight and portable software platform, AutoCT provides a new toolkit for the CT imaging community to underpin the deployment of artificial intelligence-driven applications.

摘要
computed tomography（CT）影像处理和分析对科学研究和临床应用都具有重要 significanc。在AutoCT中，我们提供了一个整体性的管道，该管道集成了端到端自动化预处理、注册、分割和量化CT扫描图像的工作流程。我们通过可靠的前向和反向映射来实现 diffeomorphic 变换，从而提取 CT 扫描图像中的本地特征。这些本地特征可以用于下游统计学学习，以便帮助医学诊断。AutoCT 在轻量级和可搬式软件平台上提供了一套新的工具包，为 CT 影像社区提供了人工智能驱动应用的基础。

A Dataset of Relighted 3D Interacting Hands

paper_url: http://arxiv.org/abs/2310.17768
repo_url: None
paper_authors: Gyeongsik Moon, Shunsuke Saito, Weipeng Xu, Rohan Joshi, Julia Buffalini, Harley Bellan, Nicholas Rosen, Jesse Richardson, Mallorie Mize, Philippe de Bree, Tomas Simon, Bo Peng, Shubham Garg, Kevyn McPhail, Takaaki Shiratori
for: 本研究旨在提供一个多样化和实际的手交互分析数据集，以便更好地研究手交互的自然语言处理。
methods: 本研究使用了一种现代的手重光照网络，并利用了精准的两手三维姿态跟踪来生成多样化和实际的手交互图像。
results: 对比于现有的手交互图像数据集，本研究的Re：InterHand数据集具有更多的多样化和实际的图像表现，同时也提供了更多的大量的三维姿态跟踪数据。

Abstract
The two-hand interaction is one of the most challenging signals to analyze due to the self-similarity, complicated articulations, and occlusions of hands. Although several datasets have been proposed for the two-hand interaction analysis, all of them do not achieve 1) diverse and realistic image appearances and 2) diverse and large-scale groundtruth (GT) 3D poses at the same time. In this work, we propose Re:InterHand, a dataset of relighted 3D interacting hands that achieve the two goals. To this end, we employ a state-of-the-art hand relighting network with our accurately tracked two-hand 3D poses. We compare our Re:InterHand with existing 3D interacting hands datasets and show the benefit of it. Our Re:InterHand is available in https://mks0601.github.io/ReInterHand/.

摘要
“二手互动是训练模型最difficult的讯号之一，因为手部的自相似和复杂的动作，以及手部的 occlusion。许多dataset已经被提出来分析二手互动，但都无法同时取得1) 多样化和生动的图像出现和2) 多样化和大量的GT 3D姿态。在这个工作中，我们提出了Re:InterHand，一个基于state-of-the-art手部重光网络和我们精确地追踪的二手3D姿态。我们与现有的3D互动手部dataset进行比较，展示了Re:InterHand的优点。Re:InterHand可以在https://mks0601.github.io/ReInterHand/中下载。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

SynergyNet: Bridging the Gap between Discrete and Continuous Representations for Precise Medical Image Segmentation

paper_url: http://arxiv.org/abs/2310.17764
repo_url: https://github.com/CandleLabAI/SynergyNet-WACV-2024
paper_authors: Vandan Gorade, Sparsh Mittal, Debesh Jha, Ulas Bagci
for: 这篇论文是为了提高医疗影像分析的表现，特别是透过整合维度的混合以提高现有的encoder-decoder分类框架。
methods: 这篇论文使用了一种新的瓶颈架构，称为SynergyNet，它可以融合维度的整合以获取相互补偿的信息，并成功地保留了细节和高级构造的资讯。
results: 根据多组组织分类和心脏组织数据，这篇论文的SynergyNet模型比其他现有的方法（包括TransUNet）表现更好，统计学上的提升为2.16%的 dice scores和11.13%的 Hausdorff scores。在皮肤患病和脑肿瘤分类数据上，这篇论文的SynergyNet模型实现了1.71%的 Intersection-over Union 提升和8.58%的提升。

Abstract
In recent years, continuous latent space (CLS) and discrete latent space (DLS) deep learning models have been proposed for medical image analysis for improved performance. However, these models encounter distinct challenges. CLS models capture intricate details but often lack interpretability in terms of structural representation and robustness due to their emphasis on low-level features. Conversely, DLS models offer interpretability, robustness, and the ability to capture coarse-grained information thanks to their structured latent space. However, DLS models have limited efficacy in capturing fine-grained details. To address the limitations of both DLS and CLS models, we propose SynergyNet, a novel bottleneck architecture designed to enhance existing encoder-decoder segmentation frameworks. SynergyNet seamlessly integrates discrete and continuous representations to harness complementary information and successfully preserves both fine and coarse-grained details in the learned representations. Our extensive experiment on multi-organ segmentation and cardiac datasets demonstrates that SynergyNet outperforms other state of the art methods, including TransUNet: dice scores improving by 2.16%, and Hausdorff scores improving by 11.13%, respectively. When evaluating skin lesion and brain tumor segmentation datasets, we observe a remarkable improvement of 1.71% in Intersection-over Union scores for skin lesion segmentation and of 8.58% for brain tumor segmentation. Our innovative approach paves the way for enhancing the overall performance and capabilities of deep learning models in the critical domain of medical image analysis.

摘要
Recently, continuous latent space (CLS) 和 discrete latent space (DLS) deep learning models have been proposed for medical image analysis, which have improved performance. However, these models have different challenges. CLS models can capture intricate details, but often lack interpretability in terms of structural representation and robustness due to their emphasis on low-level features. On the other hand, DLS models have interpretability, robustness, and the ability to capture coarse-grained information thanks to their structured latent space. However, DLS models have limited efficacy in capturing fine-grained details. To address the limitations of both CLS and DLS models, we propose SynergyNet, a novel bottleneck architecture designed to enhance existing encoder-decoder segmentation frameworks. SynergyNet seamlessly integrates discrete and continuous representations to harness complementary information and successfully preserves both fine and coarse-grained details in the learned representations. Our extensive experiment on multi-organ segmentation and cardiac datasets shows that SynergyNet outperforms other state-of-the-art methods, including TransUNet: dice scores improve by 2.16%, and Hausdorff scores improve by 11.13%, respectively. When evaluating skin lesion and brain tumor segmentation datasets, we observe a remarkable improvement of 1.71% in Intersection-over-Union scores for skin lesion segmentation and of 8.58% for brain tumor segmentation. Our innovative approach paves the way for enhancing the overall performance and capabilities of deep learning models in the critical domain of medical image analysis.

Alzheimers Disease Diagnosis by Deep Learning Using MRI-Based Approaches

paper_url: http://arxiv.org/abs/2310.17755
repo_url: None
paper_authors: Sarasadat Foroughipoor, Kimia Moradi, Hamidreza Bolhasani
for: 这篇研究旨在探讨使用Magnetic Resonance Imaging（MRI）技术和深度学习算法来诊断阿尔茨海默病（AD）。
methods: 这些研究使用MRI技术取得的数据，然后使用深度学习算法进行特征提取和模式识别，以帮助早期诊断和病程评估。
results: 这些研究发现，使用MRI技术和深度学习算法可以帮助早期诊断阿尔茨海默病，并且可以特别识别出患者的病程阶段和特定症状。

Abstract
The most frequent kind of dementia of the nervous system, Alzheimer's disease, weakens several brain processes (such as memory) and eventually results in death. The clinical study uses magnetic resonance imaging to diagnose AD. Deep learning algorithms are capable of pattern recognition and feature extraction from the inputted raw data. As early diagnosis and stage detection are the most crucial elements in enhancing patient care and treatment outcomes, deep learning algorithms for MRI images have recently allowed for diagnosing a medical condition at the beginning stage and identifying particular symptoms of Alzheimer's disease. As a result, we aimed to analyze five specific studies focused on AD diagnosis using MRI-based deep learning algorithms between 2021 and 2023 in this study. To completely illustrate the differences between these techniques and comprehend how deep learning algorithms function, we attempted to explore selected approaches in depth.

摘要
最常见的神经系统失智症，阿尔茨海默病（AD），会弱化许多大脑过程（如记忆），最终会导致死亡。临床研究使用核磁共振成像（MRI）诊断AD。深度学习算法具有模式识别和特征提取功能，可以从输入的原始数据中提取有用的特征。早期诊断和stage检测是患者护理和治疗效果的关键因素，因此深度学习算法在MRI图像上的应用在诊断阿尔茨海默病的早期阶段和特征识别方面具有重要意义。本研究采用MRI图像基于深度学习算法进行AD诊断的五项研究，均发生在2021年至2023年之间。为了彻底描述这些技术的差异和深度学习算法的工作方式，我们尝试了深入探讨选择的方法。

Advancing Brain Tumor Detection: A Thorough Investigation of CNNs, Clustering, and SoftMax Classification in the Analysis of MRI Images

paper_url: http://arxiv.org/abs/2310.17720
repo_url: None
paper_authors: Jonayet Miah, Duc M Cao, Md Abu Sayed3, Md Siam Taluckder, Md Sabbirul Haque, Fuad Mahmud
for: 检测脑肿的早期检测是脑肿疾病管理的关键，以提高治疗效果和患者结果。本研究探讨了使用卷积神经网络（CNN）来检测脑肿，使用MRI图像。
methods: 本研究使用MRI图像进行数据采集，并将其处理并输入到CNN体系中。CNN体系使用SoftMax彻底连接层进行分类，实现了98%的准确率。此外，本研究还使用Radial Basis Function（RBF）和Decision Tree（DT）两种分类器，其中RBF的准确率为98.24%，DT的准确率为95.64%。此外，本研究还引入了分类方法来提高CNN的准确率。
results: 本研究的结果表明，SoftMax分类器在测试数据上的准确率为99.52%，而RBF和DT分类器的准确率分别为98.24%和95.64%。此外，本研究还使用了敏感度、特异性和准确率来全面评估网络的性能。

Abstract
Brain tumors pose a significant global health challenge due to their high prevalence and mortality rates across all age groups. Detecting brain tumors at an early stage is crucial for effective treatment and patient outcomes. This study presents a comprehensive investigation into the use of Convolutional Neural Networks (CNNs) for brain tumor detection using Magnetic Resonance Imaging (MRI) images. The dataset, consisting of MRI scans from both healthy individuals and patients with brain tumors, was processed and fed into the CNN architecture. The SoftMax Fully Connected layer was employed to classify the images, achieving an accuracy of 98%. To evaluate the CNN's performance, two other classifiers, Radial Basis Function (RBF) and Decision Tree (DT), were utilized, yielding accuracy rates of 98.24% and 95.64%, respectively. The study also introduced a clustering method for feature extraction, improving CNN's accuracy. Sensitivity, Specificity, and Precision were employed alongside accuracy to comprehensively evaluate the network's performance. Notably, the SoftMax classifier demonstrated the highest accuracy among the categorizers, achieving 99.52% accuracy on test data. The presented research contributes to the growing field of deep learning in medical image analysis. The combination of CNNs and MRI data offers a promising tool for accurately detecting brain tumors, with potential implications for early diagnosis and improved patient care.

摘要
脑肿瘤对全球健康带来了很大挑战，因为它们在所有年龄组中有高的发病率和死亡率。早期检测脑肿瘤非常重要，以确保有效的治疗和患者结果。本研究使用了卷积神经网络（CNN）来检测脑肿瘤，使用了磁共振成像（MRI）图像。数据集包括了健康个体和脑肿瘤患者的MRI图像，经过处理后被传输到CNN体系中。SoftMax完全连接层被用来分类图像，实现了98%的准确率。为了评估CNN的表现，还使用了几种其他分类器，包括卷积函数（RBF）和决策树（DT），其中RBF和DT分别实现了98.24%和95.64%的准确率。研究还提出了一种归一化方法，以提高CNN的准确率。在评估网络表现时，使用了敏感性、特异性和精度，以全面评估网络的表现。结果显示，SoftMax分类器在测试数据上实现了99.52%的准确率。本研究贡献了深度学习在医疗图像分析领域的发展，并表明了CNN和MRI数据的结合可以提供高精度的脑肿瘤检测，有可能提供早期诊断和改善患者护理的可能性。

Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model

paper_url: http://arxiv.org/abs/2310.17653
repo_url: None
paper_authors: Karsten Roth, Lukas Thede, Almut Sophia Koepke, Oriol Vinyals, Olivier Hénaff, Zeynep Akata
for: 这个论文旨在探讨 deep learning 模型在不同设计决策下的训练方法，以及这些方法如何影响模型学习特征集。
methods: 作者使用了公共模型库，包含 thousands 个在 ImageNet 等标准数据集上训练的模型，并对这些模型进行了分析。
results: 研究发现，无论模型是哪两个，都可以在训练过程中学习独特的特征集。此外，作者还提出了一种基于数据分割的扩展方法，可以在大多数情况下实现模型之间的知识传递，无需外部评价。

Abstract
Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other -- independent of overall performance. Given any arbitrary pairing of pretrained models and no external rankings (such as separate test sets, e.g. due to data privacy), we investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation -- a task made particularly difficult as additional knowledge can be contained in stronger, equiperformant or weaker models. Yet facilitating robust transfer in scenarios agnostic to pretrained model pairings would unlock auxiliary gains and knowledge fusion from any model repository without restrictions on model and problem specifics - including from weaker, lower-performance models. This work therefore provides an initial, in-depth exploration on the viability of such general-purpose knowledge transfer. Across large-scale experiments, we first reveal the shortcomings of standard knowledge distillation techniques, and then propose a much more general extension through data partitioning for successful transfer between nearly all pretrained models, which we show can also be done unsupervised. Finally, we assess both the scalability and impact of fundamental model properties on successful model-agnostic knowledge transfer.

摘要

A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised Video Anomaly Detection

paper_url: http://arxiv.org/abs/2310.17650
repo_url: None
paper_authors: Anas Al-lahham, Nurbek Tastan, Zaigham Zaheer, Karthik Nandakumar
for: 本研究旨在提出一种完全无监督的视频异常事件检测方法，以解决视频异常检测在无任何标注或人工监督的情况下的挑战。
methods: 提议的方法基于一种简单而有效的两个阶段pseudo标签生成框架，包括层次分割和统计假设测试，以生成视频段级（常见/异常）pseudo标签。
results: 对两个大规模公共领域数据集（UCF-Crime和XD-Violence）进行了广泛的研究，并证明了提议的无监督方法可以与所有现有的一类分类和无监督方法相比，而且与状态的监督方法相比，可以达到相似的性能。

Abstract
Detection of anomalous events in videos is an important problem in applications such as surveillance. Video anomaly detection (VAD) is well-studied in the one-class classification (OCC) and weakly supervised (WS) settings. However, fully unsupervised (US) video anomaly detection methods, which learn a complete system without any annotation or human supervision, have not been explored in depth. This is because the lack of any ground truth annotations significantly increases the magnitude of the VAD challenge. To address this challenge, we propose a simple-but-effective two-stage pseudo-label generation framework that produces segment-level (normal/anomaly) pseudo-labels, which can be further used to train a segment-level anomaly detector in a supervised manner. The proposed coarse-to-fine pseudo-label (C2FPL) generator employs carefully-designed hierarchical divisive clustering and statistical hypothesis testing to identify anomalous video segments from a set of completely unlabeled videos. The trained anomaly detector can be directly applied on segments of an unseen test video to obtain segment-level, and subsequently, frame-level anomaly predictions. Extensive studies on two large-scale public-domain datasets, UCF-Crime and XD-Violence, demonstrate that the proposed unsupervised approach achieves superior performance compared to all existing OCC and US methods , while yielding comparable performance to the state-of-the-art WS methods.

摘要
检测视频异常事件是应用中的一个重要问题，例如监视。视频异常检测（VAD）在一类分类（OCC）和弱监督（WS）设置中已经得到了广泛的研究。然而，完全无监督（US）的视频异常检测方法，即没有任何标注或人工监督，尚未得到了深入研究。这是因为缺乏任何标注数据导致VAD挑战的难度增加了。为解决这个挑战，我们提出了一个简单 yet effective的两个阶段 Pseudo-label生成框架，生成视频段级（正常/异常） Pseudo-labels，可以进一步用于在监督下训练段级异常检测器。我们的粗略到细分 Pseudo-label生成器（C2FPL）使用了仔细设计的层次分割和统计假设测试来从一组完全无标注视频中 identific 异常视频段。训练后的异常检测器可以直接应用于测试视频段中，以获得段级异常预测，并最终生成帧级异常预测。我们在两个大规模的公共领域数据集，UCFCrime和XD-Violence上，进行了广泛的研究，结果显示，我们的无监督方法在OCC和US方法中表现出色，同时与WS方法相当。

6-DoF Stability Field via Diffusion Models

paper_url: http://arxiv.org/abs/2310.17649
repo_url: None
paper_authors: Takuma Yoneda, Tianchong Jiang, Gregory Shakhnarovich, Matthew R. Walter
for:* 6-DoFusion is a generative model that can generate 3D poses of objects to create stable configurations of a scene.methods:* The model uses a diffusion model to incrementally refine a randomly initialized SE(3) pose to generate a sample from a learned, context-dependent distribution over stable poses.results:* The model can construct stable scenes involving novel object classes and improve the accuracy of state-of-the-art 3D pose estimation methods.Here is the text in Simplified Chinese:for:* 6-DoFusion 是一个生成模型，可以生成 объек的 3D 姿态，以实现组件的稳定配置。methods:* 模型使用一个散布模型，逐步优化一个随机初始化的 SE(3) 姿态，以生成一个从学习的、上下文相依的分布中的稳定姿态样本。results:* 模型可以实现稳定的组件，包括新的物体类型，并且可以提高现有的3D 姿态估计方法的精度。

Abstract
A core capability for robot manipulation is reasoning over where and how to stably place objects in cluttered environments. Traditionally, robots have relied on object-specific, hand-crafted heuristics in order to perform such reasoning, with limited generalizability beyond a small number of object instances and object interaction patterns. Recent approaches instead learn notions of physical interaction, namely motion prediction, but require supervision in the form of labeled object information or come at the cost of high sample complexity, and do not directly reason over stability or object placement. We present 6-DoFusion, a generative model capable of generating 3D poses of an object that produces a stable configuration of a given scene. Underlying 6-DoFusion is a diffusion model that incrementally refines a randomly initialized SE(3) pose to generate a sample from a learned, context-dependent distribution over stable poses. We evaluate our model on different object placement and stacking tasks, demonstrating its ability to construct stable scenes that involve novel object classes as well as to improve the accuracy of state-of-the-art 3D pose estimation methods.

摘要
一个核心能力 для机器人操作是在混乱环境中稳定地放置物体。传统上，机器人依靠特定对象的手工规则来实现这种能力，有限的通用性只能涵盖一小数量的物体实例和物体交互模式。最近的方法学习物理互动，但需要监督，通过标注对象信息来获得权重，或者来到高样本复杂度的代价，并不直接关于稳定性或物体放置的理解。我们提出了6DoFusion，一种生成模型，可以生成一个物体的3D姿态，使得给定场景中的物体生成稳定的配置。6DoFusion的基础是一种扩散模型，逐步精细地修改一个随机初始化的SE(3)姿态，以生成一个学习的、场景相依的 pose distribution。我们对不同的物体放置和堆叠任务进行评估，示出了我们的模型能够处理新的物体类型以及提高现有3D姿态估计方法的准确性。

paper_url: http://arxiv.org/abs/2310.17642
repo_url: None
paper_authors: Tsun-Hsuan Wang, Alaa Maalouf, Wei Xiao, Yutong Ban, Alexander Amini, Guy Rosman, Sertac Karaman, Daniela Rus
for:这篇论文旨在提出一种基于多modal基础模型的端到端自动驾驶系统，以提高系统的可靠性和适应能力。methods: authors使用了多modal基础模型，包括图像和文本，以提高自动驾驶系统的准确性和可靠性。他们还使用了 pixel/patch-aligned特征提取方法，以捕捉图像的细节和 semantics。results: authors的方法在多种不同的测试中达到了无 precedent的结果，同时在不同的环境和场景下表现出了更大的鲁棒性。此外， authors还使用了文本来进行数据增强和策略调试，以提高自动驾驶系统的可靠性和适应能力。

Abstract
As autonomous driving technology matures, end-to-end methodologies have emerged as a leading strategy, promising seamless integration from perception to control via deep learning. However, existing systems grapple with challenges such as unexpected open set environments and the complexity of black-box models. At the same time, the evolution of deep learning introduces larger, multimodal foundational models, offering multi-modal visual and textual understanding. In this paper, we harness these multimodal foundation models to enhance the robustness and adaptability of autonomous driving systems, enabling out-of-distribution, end-to-end, multimodal, and more explainable autonomy. Specifically, we present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text. To do so, we introduce a method to extract nuanced spatial (pixel/patch-aligned) features from transformers to enable the encapsulation of both spatial and semantic features. Our approach (i) demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations, and (ii) allows the incorporation of latent space simulation (via text) for improved training (data augmentation via text) and policy debugging. We encourage the reader to check our explainer video at https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be and to view the code and demos on our project webpage at https://drive-anywhere.github.io/.

摘要
As autonomous driving technology matures, end-to-end methodologies have emerged as a leading strategy, promising seamless integration from perception to control via deep learning. However, existing systems grapple with challenges such as unexpected open set environments and the complexity of black-box models. At the same time, the evolution of deep learning introduces larger, multimodal foundational models, offering multi-modal visual and textual understanding. In this paper, we harness these multimodal foundation models to enhance the robustness and adaptability of autonomous driving systems, enabling out-of-distribution, end-to-end, multimodal, and more explainable autonomy. Specifically, we present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text. To do so, we introduce a method to extract nuanced spatial (pixel/patch-aligned) features from transformers to enable the encapsulation of both spatial and semantic features. Our approach (i) demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations, and (ii) allows the incorporation of latent space simulation (via text) for improved training (data augmentation via text) and policy debugging. We encourage the reader to check our explainer video at and to view the code and demos on our project webpage at .

DeepShaRM: Multi-View Shape and Reflectance Map Recovery Under Unknown Lighting

paper_url: http://arxiv.org/abs/2310.17632
repo_url: None
paper_authors: Kohei Yamashita, Shohei Nobuhara, Ko Nishino
for: accurately recovers object geometry in challenging settings of textureless, non-Lambertian objects under unknown natural illumination.
methods: novel multi-view method called DeepShaRM, which uses a deep reflectance map estimation network and a deep shape-from-shading network to bypass the ill-posed problem of reflectance and illumination decomposition.
results: state-of-the-art accuracy on this challenging task, demonstrated through extensive experiments on both synthetic and real-world data.

Abstract
Geometry reconstruction of textureless, non-Lambertian objects under unknown natural illumination (i.e., in the wild) remains challenging as correspondences cannot be established and the reflectance cannot be expressed in simple analytical forms. We derive a novel multi-view method, DeepShaRM, that achieves state-of-the-art accuracy on this challenging task. Unlike past methods that formulate this as inverse-rendering, i.e., estimation of reflectance, illumination, and geometry from images, our key idea is to realize that reflectance and illumination need not be disentangled and instead estimated as a compound reflectance map. We introduce a novel deep reflectance map estimation network that recovers the camera-view reflectance maps from the surface normals of the current geometry estimate and the input multi-view images. The network also explicitly estimates per-pixel confidence scores to handle global light transport effects. A deep shape-from-shading network then updates the geometry estimate expressed with a signed distance function using the recovered reflectance maps. By alternating between these two, and, most important, by bypassing the ill-posed problem of reflectance and illumination decomposition, the method accurately recovers object geometry in these challenging settings. Extensive experiments on both synthetic and real-world data clearly demonstrate its state-of-the-art accuracy.

摘要
几何重建Textureless、非拉贝特的物体（即在野外）仍然是一个挑战，因为无法确定对应关系和反射无法表示为简单的分析型式。我们提出了一种新的多视图方法，即DeepShaRM，可以在这个挑战任务中实现状态 arts的准确性。与过去的方法不同，我们的关键思想是反射和照明无需分离，而是一起 estimating compound reflectance map。我们介绍了一种新的深度反射地图估计网络，可以从表 Normal 和输入多视图图像中提取camera-view反射地图。该网络还可以直接估计每个像素的信任分数，以处理全球照明效果。一个深度形状从反射地图更新geometry estimate，使用表 Normal 和输入多视图图像来表示。通过交互这两个网络，并更重要的是，通过绕过反射和照明的分解问题，该方法可以高精度地重建物体几何。我们的实验表明，该方法在 sintetic 和 real-world 数据上具有状态 arts 的准确性。

A Survey on Transferability of Adversarial Examples across Deep Neural Networks

paper_url: http://arxiv.org/abs/2310.17626
repo_url: https://github.com/jindonggu/awesome_adversarial_transferability
paper_authors: Jindong Gu, Xiaojun Jia, Pau de Jorge, Wenqain Yu, Xinwei Liu, Avery Ma, Yuan Xun, Anjun Hu, Ashkan Khakzar, Zhijiang Li, Xiaochun Cao, Philip Torr
for: 本研究旨在探讨对抗性例子的跨模型传播性，以及提高抗性例子的传播性的不同方法。
methods: 本文分析了现有的方法，包括权重调整、数据 augmentation、抗性例子生成等方法，以提高抗性例子的传播性。
results: 本文发现，现有的方法可以增强抗性例子的传播性，但也存在一些挑战和未知，如抗性例子的数量和种类的限制，以及模型之间的传播性的不同。

Abstract
The emergence of Deep Neural Networks (DNNs) has revolutionized various domains, enabling the resolution of complex tasks spanning image recognition, natural language processing, and scientific problem-solving. However, this progress has also exposed a concerning vulnerability: adversarial examples. These crafted inputs, imperceptible to humans, can manipulate machine learning models into making erroneous predictions, raising concerns for safety-critical applications. An intriguing property of this phenomenon is the transferability of adversarial examples, where perturbations crafted for one model can deceive another, often with a different architecture. This intriguing property enables "black-box" attacks, circumventing the need for detailed knowledge of the target model. This survey explores the landscape of the adversarial transferability of adversarial examples. We categorize existing methodologies to enhance adversarial transferability and discuss the fundamental principles guiding each approach. While the predominant body of research primarily concentrates on image classification, we also extend our discussion to encompass other vision tasks and beyond. Challenges and future prospects are discussed, highlighting the importance of fortifying DNNs against adversarial vulnerabilities in an evolving landscape.

摘要

Noise-Free Score Distillation

paper_url: http://arxiv.org/abs/2310.17590
repo_url: https://github.com/orenkatzir/nfsd
paper_authors: Oren Katzir, Or Patashnik, Daniel Cohen-Or, Dani Lischinski
For: 这 paper 探讨了 Text-to-Content Generation 在非图像领域中的实现方法，具体来说是对 Score Distillation Sampling (SDS) proces 进行了重新解释和改进。* Methods: 该 paper 建议了一种新的 Noise-Free Score Distillation (NFSD) proces, 该 proces 基于一种简单的解释，即通过约束 undesired noise term 的泛化，以实现更有效的 Text-to-image 模型的泛化。* Results: 作者通过提供许多质量比较的例子，证明了 NFSD proces 可以在 Nominal Classifier-Free Guidance (CFG) scale 下实现更有效的泛化，并且可以避免结果的过度平滑，以保证生成的数据是真实的并符合描述文本的需求。

Abstract
Score Distillation Sampling (SDS) has emerged as the de facto approach for text-to-content generation in non-image domains. In this paper, we reexamine the SDS process and introduce a straightforward interpretation that demystifies the necessity for large Classifier-Free Guidance (CFG) scales, rooted in the distillation of an undesired noise term. Building upon our interpretation, we propose a novel Noise-Free Score Distillation (NFSD) process, which requires minimal modifications to the original SDS framework. Through this streamlined design, we achieve more effective distillation of pre-trained text-to-image diffusion models while using a nominal CFG scale. This strategic choice allows us to prevent the over-smoothing of results, ensuring that the generated data is both realistic and complies with the desired prompt. To demonstrate the efficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as well as several other methods.

摘要
尽管Score Distillation Sampling（SDS）已成为非图像领域中文本生成的德法方法，但我们在这篇论文中又进行了重新评估和解释SDS过程。我们发现，SDS中的大型Classifier-Free Guidance（CFG）缺点是由一种不希望的噪声项引起的。基于这种解释，我们提出了一种新的Noise-Free Score Distillation（NFSD）过程，它具有最小改动，但可以更好地储备预训练的文本生成扩散模型。这种流lined设计可以避免过度熔炼结果，使得生成的数据具有真实性和满足描述的提示。为证明NFSD的有效性，我们提供了许多qualitative例子，包括NFSD和SDS以及其他方法。

Global Structure-Aware Diffusion Process for Low-Light Image Enhancement

paper_url: http://arxiv.org/abs/2310.17577
repo_url: https://github.com/jinnh/GSAD
paper_authors: Jinhui Hou, Zhiyu Zhu, Junhui Hou, Hui Liu, Huanqiang Zeng, Hui Yuan
for: 本文研究了一种基于分散的框架，以解决低光照图像增强问题。
methods: 我们提出了一种基于散度模型的方法，并在其内部加入了一个抽象的ODE-轨迹规则来正则化。这种方法利用了最近的研究结果，表明低拥挤ODE-轨迹可以导致稳定和有效的散度过程。我们在图像数据中嵌入了一个全球结构意识的正则化项，以逐渐促进图像细节的保留和对比的增强。此外，我们还引入了一种不确定度指导正则化技术，以智能地减少图像中最EXTREME的区域的约束。
results: 实验评估表明，提出的散度基于框架，并且补充了排名信息的正则化，在低光照图像增强中表现出色。结果表明，我们的方法可以提高图像质量、降低噪声和增强对比，相比之前的方法有显著进步。我们认为这种创新的方法将激发更多的探索和进步在低光照图像处理领域，并可能具有其他散度模型的应用。代码可以在https://github.com/jinnh/GSAD上获取。

Abstract
This paper studies a diffusion-based framework to address the low-light image enhancement problem. To harness the capabilities of diffusion models, we delve into this intricate process and advocate for the regularization of its inherent ODE-trajectory. To be specific, inspired by the recent research that low curvature ODE-trajectory results in a stable and effective diffusion process, we formulate a curvature regularization term anchored in the intrinsic non-local structures of image data, i.e., global structure-aware regularization, which gradually facilitates the preservation of complicated details and the augmentation of contrast during the diffusion process. This incorporation mitigates the adverse effects of noise and artifacts resulting from the diffusion process, leading to a more precise and flexible enhancement. To additionally promote learning in challenging regions, we introduce an uncertainty-guided regularization technique, which wisely relaxes constraints on the most extreme regions of the image. Experimental evaluations reveal that the proposed diffusion-based framework, complemented by rank-informed regularization, attains distinguished performance in low-light enhancement. The outcomes indicate substantial advancements in image quality, noise suppression, and contrast amplification in comparison with state-of-the-art methods. We believe this innovative approach will stimulate further exploration and advancement in low-light image processing, with potential implications for other applications of diffusion models. The code is publicly available at https://github.com/jinnh/GSAD.

摘要

SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching

paper_url: http://arxiv.org/abs/2310.17569
repo_url: None
paper_authors: Xinghui Li, Jingyi Lu, Kai Han, Victor Prisacariu
for: 本研究旨在解决图像对的semantic keypoint匹配问题。
methods: 本文使用Stable Diffusion（SD）的中间输出作为图像特征地图，并通过基本的提示调整技术来解 liberate SD的内在潜力，从而实现对前一些方法的显著提高。此外，我们还提出了一种新的决定式提示模块，该模块根据输入图像对的本地特征进行Conditional prompting，从而进一步提高表现。
results: 我们的方法SD4Match在PF-Pascal、PF-Willow和SPair-71k数据集上进行了广泛的评估，显示SD4Match在所有数据集上都设置了新的Benchmark。特别是在SPair-71k数据集上，SD4Match比前一些State-of-the-art方法提高12个百分点。

Abstract
In this paper, we address the challenge of matching semantically similar keypoints across image pairs. Existing research indicates that the intermediate output of the UNet within the Stable Diffusion (SD) can serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets new benchmarks in accuracy across all these datasets. Particularly, SD4Match outperforms the previous state-of-the-art by a margin of 12 percentage points on the challenging SPair-71k dataset.

摘要
在这篇论文中，我们 Addresses the challenge of matching semantically similar keypoints across image pairs. 现有研究表明，Stable Diffusion（SD）的中间输出可以 serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. 我们对PF-Pascal、PF-Willow和SPair-71k dataset进行了全面的评估，并证明SD4Match在这些dataset上设置新的benchmark，并且在SPair-71k dataset上比前一个state-of-the-art提高12个百分点。

Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction

paper_url: http://arxiv.org/abs/2310.17527
repo_url: https://github.com/masked-spacetime-hashing/msth
paper_authors: Feng Wang, Zilong Chen, Guokang Wang, Yafei Song, Huaping Liu
for: efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos
methods: Masked Space-Time Hash encoding (MSTH), a novel method that represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding, guided by an uncertainty-based objective
results: consistently better results than previous methods with only 20 minutes of training time and 130 MB of memory storage

Abstract
In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel method for efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos. Based on the observation that dynamic scenes often contain substantial static areas that result in redundancy in storage and computations, MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding. The weights for the two components are represented by a learnable mask which is guided by an uncertainty-based objective to reflect the spatial and temporal importance of each 3D position. With this design, our method can reduce the hash collision rate by avoiding redundant queries and modifications on static areas, making it feasible to represent a large number of space-time voxels by hash tables with small size.Besides, without the requirements to fit the large numbers of temporally redundant features independently, our method is easier to optimize and converge rapidly with only twenty minutes of training for a 300-frame dynamic scene.As a result, MSTH obtains consistently better results than previous methods with only 20 minutes of training time and 130 MB of memory storage. Code is available at https://github.com/masked-spacetime-hashing/msth

摘要
在这篇论文中，我们提出了Masked Space-Time Hash编码（MSTH），一种新的方法用于高效地重建动态3D场景从多视角或单视角视频中。基于观察到的动态场景中往往包含了相对较多的静止区域，从而导致存储和计算中的重复，MSTH将动态场景表示为一个权重加权的3D哈希编码和4D哈希编码的权重组合。这个权重组合是由一个可学习的掩码引导的，掩码的学习目标是根据空间和时间的重要性来反映每个3D位置的重要性。通过这种设计，我们的方法可以避免在静止区域上的重复查询和修改，因此可以使用哈希表存储大量的空间时间坐标，并且可以快速优化和训练。在20分钟的训练时间和130MB的存储空间内，我们的方法可以在300帧动态场景中实现更好的结果，超过之前的方法。代码可以在https://github.com/masked-spacetime-hashing/msth上找到。

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

paper_url: http://arxiv.org/abs/2310.17519
repo_url: https://github.com/sbharadwajj/flare
paper_authors: Shrisha Bharadwaj, Yufeng Zheng, Otmar Hilliges, Michael J. Black, Victoria Fernandez-Abrevaya
for: 创建个性化可动的3D头像，使其具有高度准确、现实主义、可重新灯光和与现有渲染系统兼容的特征。
methods: 通过可微分渲染和高度优化的计算机图形方法，以及部分使用神经网络来学习高级度3D磁质表示。
results: 实现了高质量的geometry和外观，同时具有高效地训练和渲染的特点，比现有方法更高效。

Abstract
Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce FLARE, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the pre-filtered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches.

摘要
我们的目标是高效地学习个性化可动的3D头像，从视频中学习高精度、现实、可重新照明和现代渲染系统兼容的3D矩阵表示。尽管3D矩阵可以高效处理和高度可移植，但它们缺乏实际上的形状和外观真实性。神经表示方法，则是真实的，但它们在训练和渲染过程中较慢，而且兼容性不佳。我们的关键发现是可以通过可导式渲染来高效地学习高精度3D矩阵表示，并通过高度优化的计算机图形方法来减少一些组件的神经网络。为了实现这一点，我们提出了FLARE技术，它可以从单个照视视频中生成可动和可重新照明的3D头像。首先，我们通过 mesh 表示来学习 canonical geometry，以便高效地进行可导式渲染和学习混合形状和线性混合皮肤弹性参数。其次，我们遵循物理学渲染，将观察到的颜色 decomposed 为内在反射率、粗糙度和神经网络中的照明表示，以便学习的头像在新的场景中被重新照明。由于我们的输入视频是在单个设备上捕捉的，因此模拟环境光是非常困难的。在基于折射积分的approximation中，我们使用多层感知器（MLP）模ulated by surface roughness来approximate环境图像，从而消除了必须直接模型光的需求。我们示出了我们的矩阵形式的 avatar 概合，与学习的形变、材质和照明 MLP 相结合，可以生成高质量的形状和外观，同时也比现有方法更高效地训练和渲染。

Revisiting the Distillation of Image Representations into Point Clouds for Autonomous Driving

paper_url: http://arxiv.org/abs/2310.17504
repo_url: None
paper_authors: Gilles Puy, Spyros Gidaris, Alexandre Boulch, Oriane Siméoni, Corentin Sautier, Patrick Pérez, Andrei Bursuc, Renaud Marlet
for: 本文旨在提出一种简单的2D-to-3D填充方法，用于提高自动驾驶场景中的自我监督3D网络表现。
methods: 本文提出了一种简单的2D-to-3D填充方法，通过填充高质量的2D特征到3D网络中来提高3D网络的表现。此外，本文还运用了高容量3D网络进行填充，以提高3D特征质量。
results: 本文的实验结果表明，通过使用提出的2D-to-3D填充方法，可以显著提高自动驾驶场景中的自我监督3D网络表现，并且可以用于开放词汇分割和背景/前景发现。

Abstract
Self-supervised image networks can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. However, self-supervised 3D networks on lidar data do not perform as well for now. A few methods therefore propose to distill high-quality self-supervised 2D features into 3D networks. The most recent ones doing so on autonomous driving data show promising results. Yet, a performance gap persists between these distilled features and fully-supervised ones. In this work, we revisit 2D-to-3D distillation. First, we propose, for semantic segmentation, a simple approach that leads to a significant improvement compared to prior 3D distillation methods. Second, we show that distillation in high capacity 3D networks is key to reach high quality 3D features. This actually allows us to significantly close the gap between unsupervised distilled 3D features and fully-supervised ones. Last, we show that our high-quality distilled representations can also be used for open-vocabulary segmentation and background/foreground discovery.

摘要
自我监督图像网络可以非常高效地解决复杂的2D任务（例如semantic segmentation、物体发现），但是自我监督3D网络在激光数据上并不表现太好。一些方法因此提议将高质量的2D自我监督特征融合到3D网络中。最新的这些方法在自主驾驶数据上显示了有前途的结果。然而，与完全监督的特征相比，这些融合的特征仍然存在一定的性能差距。在这项工作中，我们重新审视2D-to-3D融合。首先，我们提议用于semantic segmentation的简单方法，这会比之前的3D融合方法带来显著的改善。其次，我们表明了高容量3D网络中的融合是关键来达到高质量3D特征。这实际上使得我们可以明显减小不监督融合3D特征和完全监督特征之间的性能差距。最后，我们表明了我们高质量的融合表示可以用于开放词汇分割和背景/前景发现。

A Hybrid Graph Network for Complex Activity Detection in Video

paper_url: http://arxiv.org/abs/2310.17493
repo_url: https://github.com/salmank255/CompAD
paper_authors: Salman Khan, Izzeddin Teeti, Andrew Bradley, Mohamed Elhoseiny, Fabio Cuzzolin
for:本研究旨在解决视频中复杂活动检测问题，尤其是在自动驾驶和体育分析等领域。methods:本研究使用混合图神经网络，其combines attention应用于本地动态场景中的图编码和时间图模型。results:研究结果表明，该方法在三个数据集上都超过了之前的状态OF-THE-ART方法。

Abstract
Interpretation and understanding of video presents a challenging computer vision task in numerous fields - e.g. autonomous driving and sports analytics. Existing approaches to interpreting the actions taking place within a video clip are based upon Temporal Action Localisation (TAL), which typically identifies short-term actions. The emerging field of Complex Activity Detection (CompAD) extends this analysis to long-term activities, with a deeper understanding obtained by modelling the internal structure of a complex activity taking place within the video. We address the CompAD problem using a hybrid graph neural network which combines attention applied to a graph encoding the local (short-term) dynamic scene with a temporal graph modelling the overall long-duration activity. Our approach is as follows: i) Firstly, we propose a novel feature extraction technique which, for each video snippet, generates spatiotemporal `tubes' for the active elements (`agents') in the (local) scene by detecting individual objects, tracking them and then extracting 3D features from all the agent tubes as well as the overall scene. ii) Next, we construct a local scene graph where each node (representing either an agent tube or the scene) is connected to all other nodes. Attention is then applied to this graph to obtain an overall representation of the local dynamic scene. iii) Finally, all local scene graph representations are interconnected via a temporal graph, to estimate the complex activity class together with its start and end time. The proposed framework outperforms all previous state-of-the-art methods on all three datasets including ActivityNet-1.3, Thumos-14, and ROAD.

摘要
视频内容理解和解释存在许多领域中是一项computer vision挑战，如自动驾驶和运动分析。现有的视频clip中动作解释方法基于Temporal Action Localisation（TAL），通常可以识别短期动作。然而，emerging field of Complex Activity Detection（CompAD）扩展了这种分析，通过模型内部结构的复杂活动来获得更深刻的理解。我们使用混合图解释器来解决CompAD问题，其 combine了对本地（短期）动态场景的图编码器和时间图模型。我们的方法如下：1.首先，我们提出了一种新的特征提取技术，对于每个视频剪辑，生成了空间时间的“管” для活动元素（agent）在本地场景中，通过对个体物体的检测、跟踪和提取3D特征来实现。2.接下来，我们构建了本地场景图，其中每个节点（表示agent管或场景）与其他节点相连。然后，我们应用了注意力来获得本地动态场景的总体表示。3.最后，所有的本地场景图表示被通过时间图相连，以估计复杂活动类型以及其开始和结束时间。我们提出的框架在所有三个数据集上都超过了之前的所有状态的方法，包括ActivityNet-1.3、Thumos-14和ROAD。

paper_url: http://arxiv.org/abs/2310.17468
repo_url: https://github.com/qinyang79/crcl
paper_authors: Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, Peng Hu
for: 提高图文匹配的Robustness，增强现有方法的鲁棒性。
methods: 提出一种Generalized Cross-modal Robust Complementary Learning框架（CRCL），利用Active Complementary Loss（ACL）和Self-refining Correspondence Correction（SCC）提高现有方法的鲁棒性。
results: 经验和理论 pruebas 表明，CRCL可以减少NC的影响，提高图文匹配的精度和稳定性。

Abstract
Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.

摘要
近些时候，图文匹配已经引起了学术和业界的越来越多的关注，这是图文模态之间的隐藏相关性的基础。然而，大多数现有方法假设训练对都是正确的，而忽略了普遍存在的注释噪声（NC），从而不可避免性下降。虽然一些方法尝试解决这种噪声，但它们仍然面临两个挑战：过度记忆和不可靠的NC修正，特别是在高噪声下。为Address这两个问题，我们提议一种通用的跨模态Robust Complementary Learning框架（CRCL），它具有一种新的Active Complementary Loss（ACL）和一种有效的Self-refining Correspondence Correction（SCC），以提高现有方法的 robustness。Specifically, ACL利用活动和补做学习损失来减少提供错误指导的风险，从而实际和实验证明了对NC的robustness。SCC利用多个自适应过程和振荡调整来扩大对匹配的感知范围，从而缓解错误的积累和实现精准和稳定的修正。我们在Flickr30K、MS-COCO和CC152K三个图文benchmark上进行了广泛的实验，以验证我们的CRCL对于Synthetic和Real-world NC的超越性。

OTMatch: Improving Semi-Supervised Learning with Optimal Transport

paper_url: http://arxiv.org/abs/2310.17455
repo_url: None
paper_authors: Zhiquan Tan, Kaipeng Zheng, Weiran Huang
for: 这篇论文是为了提高 semi-supervised learning 中的学习效果，使用有限量的标签数据，并利用无标签数据中的资讯。
methods: 这篇论文使用的方法是使用 optimal transport loss function，并且利用这些数据中的 semantic relationships 来提高学习效果。
results: compared to current state-of-the-art method FreeMatch，OTMatch 可以实现更高的精度，具体的测试结果为 CIFAR-10 上的标签数据下的误差比 FreeMatch 降低了 3.18%、STL-10 上的标签数据下的误差比 FreeMatch 降低了 3.46%、ImageNet 上的标签数据下的误差比 FreeMatch 降低了 1.28%。这显示了 OTMatch 的优化性和超越性。

Abstract
Semi-supervised learning has made remarkable strides by effectively utilizing a limited amount of labeled data while capitalizing on the abundant information present in unlabeled data. However, current algorithms often prioritize aligning image predictions with specific classes generated through self-training techniques, thereby neglecting the inherent relationships that exist within these classes. In this paper, we present a new approach called OTMatch, which leverages semantic relationships among classes by employing an optimal transport loss function. By utilizing optimal transport, our proposed method consistently outperforms established state-of-the-art methods. Notably, we observed a substantial improvement of a certain percentage in accuracy compared to the current state-of-the-art method, FreeMatch. OTMatch achieves 3.18%, 3.46%, and 1.28% error rate reduction over FreeMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively. This demonstrates the effectiveness and superiority of our approach in harnessing semantic relationships to enhance learning performance in a semi-supervised setting.

摘要
semi-supervised learning 已经做出了很大的进步，通过有效地利用有限量的标注数据和丰富的无标注数据，实现了不错的学习效果。然而，目前的算法经常强调通过自我训练技术生成的特定类型来对图像进行预测，从而忽视了这些类型之间的内在关系。在这篇论文中，我们提出了一种新的方法 called OTMatch，它利用类型之间的 semantics 关系，通过最优运输损失函数来进行学习。通过使用最优运输，我们的提议方法可以一直超越现有的状态平衡方法。我们注意到，与现有状态平衡方法 FreeMatch 进行比较，OTMatch 在 CIFAR-10 上with 1 个标签、STL-10 上with 4 个标签、以及 ImageNet 上with 100 个标签时都具有较高的精度。这表明我们的方法可以借助类型之间的 semantics 关系，提高 semi-supervised 学习中的学习性能。

Sign Languague Recognition without frame-sequencing constraints: A proof of concept on the Argentinian Sign Language

paper_url: http://arxiv.org/abs/2310.17437
repo_url: None
paper_authors: Franco Ronchetti, Facundo Manuel Quiroga, César Estrebou, Laura Lanzarini, Alejandro Rosete
for: 这个论文旨在提出一种可靠的手语识别方法，以帮助听力障碍人士和听力正常人士之间的交流和教学。
methods: 该论文提出了一种概率模型，结合不同类型的特征（如位置、运动和手势）进行手语分类。该模型采用袋子中的词法方法，以探索论文中的假设，即不需要顺序来进行识别。
results: 该论文在使用阿根廷手语数据集（包含64个手语类和3200个样本）时，实现了97%的准确率，提供了一些证据，证明了论文中的假设是可行的。

Abstract
Automatic sign language recognition (SLR) is an important topic within the areas of human-computer interaction and machine learning. On the one hand, it poses a complex challenge that requires the intervention of various knowledge areas, such as video processing, image processing, intelligent systems and linguistics. On the other hand, robust recognition of sign language could assist in the translation process and the integration of hearing-impaired people, as well as the teaching of sign language for the hearing population. SLR systems usually employ Hidden Markov Models, Dynamic Time Warping or similar models to recognize signs. Such techniques exploit the sequential ordering of frames to reduce the number of hypothesis. This paper presents a general probabilistic model for sign classification that combines sub-classifiers based on different types of features such as position, movement and handshape. The model employs a bag-of-words approach in all classification steps, to explore the hypothesis that ordering is not essential for recognition. The proposed model achieved an accuracy rate of 97% on an Argentinian Sign Language dataset containing 64 classes of signs and 3200 samples, providing some evidence that indeed recognition without ordering is possible.

摘要
自动手语识别（SLR）是人机交互和机器学习领域中的一个重要话题。一方面，它需要多种知识领域的交叠，如视频处理、图像处理、智能系统和语言学。另一方面，可靠地识别手语可以帮助翻译过程和听力障碍人士的 интеграción，以及听力人士学习手语。通常，SLR系统使用隐藏Marker模型、动态时间扩展或类似模型来识别手语。这些技术利用手语的顺序排序来减少假设数量。本文提出了一种通用的手语分类模型，该模型结合不同类型的特征，如位置、运动和手势。该模型使用袋包法在所有分类步骤中，以探索假设That ordering是不必须的。提出的模型在阿根廷手语数据集上达到了97%的准确率，包括64个手语类和3200个样本，提供了一些证据，证明了实际上可以通过识别手语而不需要顺序。

Uncertainty-weighted Loss Functions for Improved Adversarial Attacks on Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.17436
repo_url: https://github.com/kmaag/uncertainty-weighted-loss
paper_authors: Kira Maag, Asja Fischer
for: 防止深度神经网络受到攻击，提高图像分割的可靠性。
methods: 使用简单的不确定性权重方案，将攻击损失函数中的像素级别权重提高，并将确定错误分类的像素loss设为零。
results: 在多个 dataset 和模型上进行了实质性分析，显示了这些方法可以提高攻击性能。

Abstract
State-of-the-art deep neural networks have been shown to be extremely powerful in a variety of perceptual tasks like semantic segmentation. However, these networks are vulnerable to adversarial perturbations of the input which are imperceptible for humans but lead to incorrect predictions. Treating image segmentation as a sum of pixel-wise classifications, adversarial attacks developed for classification models were shown to be applicable to segmentation models as well. In this work, we present simple uncertainty-based weighting schemes for the loss functions of such attacks that (i) put higher weights on pixel classifications which can more easily perturbed and (ii) zero-out the pixel-wise losses corresponding to those pixels that are already confidently misclassified. The weighting schemes can be easily integrated into the loss function of a range of well-known adversarial attackers with minimal additional computational overhead, but lead to significant improved perturbation performance, as we demonstrate in our empirical analysis on several datasets and models.

摘要
现代深度神经网络已经在多种感知任务中展现出极高的能力，如Semantic Segmentation。然而，这些网络受到输入杂音的影响，这些杂音对人类来说是难以看到的，但会导致错误预测。对于图像分割任务，我们将图像分割看作是每个像素的分类问题。在这个工作中，我们提出了一些简单的不确定性基于权重的损失函数，其中（i）将更容易受到杂音影响的像素分类权重高，（ii）将确定错误分类的像素权重设为0。这些权重函数可以轻松地与许多已知的恶意攻击者损失函数结合使用，但会导致显著提高杂音性能，我们在多个 dataset 和模型上进行了实验性分析，并证明了这一点。

AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

paper_url: http://arxiv.org/abs/2310.17419
repo_url: None
paper_authors: You-Ming Chang, Chen Yeh, Wei-Chen Chiu, Ning Yu
for: 这个论文目的是提出一种基于语言视觉模型的深层伪造探测方法，以提高对未见到的数据的深层伪造探测精度。
methods: 这篇论文使用了InstructBLIP和提问调整技术，将深层伪造探测转换为视觉问题，并对InstructBLIP进行软题调整以回答问题中的真伪信息。
results: 实验结果显示，使用预训练的语言视觉模型并进行提问调整可以很好地提高深层伪造探测精度，从58.8%提高到91.31%，而且这些进步仅需要较少的培训参数，因此这是一个有效和经济的深层伪造探测解决方案。

Abstract
Deep generative models can create remarkably photorealistic fake images while raising concerns about misinformation and copyright infringement, known as deepfake threats. Deepfake detection technique is developed to distinguish between real and fake images, where the existing methods typically learn classifiers in the image domain or various feature domains. However, the generalizability of deepfake detection against emerging and more advanced generative models remains challenging. In this paper, being inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach using VLMs (e.g. InstructBLIP) and prompt tuning techniques to improve the deepfake detection accuracy over unseen data. We formulate deepfake detection as a visual question answering problem, and tune soft prompts for InstructBLIP to answer the real/fake information of a query image. We conduct full-spectrum experiments on datasets from 3 held-in and 13 held-out generative models, covering modern text-to-image generation, image editing and image attacks. Results demonstrate that (1) the deepfake detection accuracy can be significantly and consistently improved (from 58.8% to 91.31%, in average accuracy over unseen data) using pretrained vision-language models with prompt tuning; (2) our superior performance is at less cost of trainable parameters, resulting in an effective and efficient solution for deepfake detection. Code and models can be found at https://github.com/nctu-eva-lab/AntifakePrompt.

摘要
深度生成模型可创造出极其真实的假图像，但也引发了假信息和版权侵犯的问题，称为深度假图检测问题。现有的方法通常是在图像领域或多个特征领域学习分类器。然而，对于出现和更进一步的生成模型来说，总的来说是很困难的。在这篇论文中，我们受到了零批优势的视觉语言模型（VLMs）的启发，我们提出了一种新的方法，使用VLMs（例如InstructBLIP）和提问技术来改进深度假图检测精度。我们将深度假图检测转化为视觉问答问题，并对InstructBLIP进行软提问的调整，以回答查询图像的真假信息。我们在3个保持数据集和13个保持数据集上进行了全谱试验，覆盖了现代文本到图像生成、图像修改和图像攻击等多种生成模型。结果显示：1. 使用预训练的视觉语言模型和提问调整可以显著提高深度假图检测精度（从58.8%提高到91.31%，平均精度提高）。2. 我们的优秀表现在较少的可学习参数上，具有效率和可行的解决方案。代码和模型可以在https://github.com/nctu-eva-lab/AntifakePrompt中找到。

Circuit as Set of Points

paper_url: http://arxiv.org/abs/2310.17418
repo_url: https://github.com/hustvl/circuitformer
paper_authors: Jialv Zou, Xinggang Wang, Jiahao Guo, Wenyu Liu, Qian Zhang, Chang Huang
for: 该 paper 主要用于提高 Electronic Design Automation (EDA) 中的电路设计过程中的缓存和 Routing 阶段的效率，通过使用人工智能技术来帮助电路设计。
methods: 该 paper 使用 Transformer-based point cloud perception 方法来提取电路组件的特征，无需预处理，可以进行终端训练，并且实现了高性能。
results: 实验结果显示，该方法在 CircuitNet 和 ISPD2015 数据集上的堵塞预测任务以及 CircuitNet 数据集上的设计规则检查 (DRC) 预测任务中均达到了状态数据集的最高表现。

Abstract
As the size of circuit designs continues to grow rapidly, artificial intelligence technologies are being extensively used in Electronic Design Automation (EDA) to assist with circuit design. Placement and routing are the most time-consuming parts of the physical design process, and how to quickly evaluate the placement has become a hot research topic. Prior works either transformed circuit designs into images using hand-crafted methods and then used Convolutional Neural Networks (CNN) to extract features, which are limited by the quality of the hand-crafted methods and could not achieve end-to-end training, or treated the circuit design as a graph structure and used Graph Neural Networks (GNN) to extract features, which require time-consuming preprocessing. In our work, we propose a novel perspective for circuit design by treating circuit components as point clouds and using Transformer-based point cloud perception methods to extract features from the circuit. This approach enables direct feature extraction from raw data without any preprocessing, allows for end-to-end training, and results in high performance. Experimental results show that our method achieves state-of-the-art performance in congestion prediction tasks on both the CircuitNet and ISPD2015 datasets, as well as in design rule check (DRC) violation prediction tasks on the CircuitNet dataset. Our method establishes a bridge between the relatively mature point cloud perception methods and the fast-developing EDA algorithms, enabling us to leverage more collective intelligence to solve this task. To facilitate the research of open EDA design, source codes and pre-trained models are released at https://github.com/hustvl/circuitformer.

摘要
随着电路设计的大小不断增长，人工智能技术在电子设计自动化（EDA）中得到了广泛应用，以帮助电路设计。电路的位置和路径是物理设计过程中最时间consuming的部分，如何快速评估电路的位置已成为热点研究话题。先前的工作都是将电路设计转化为图像使用手动方法，然后使用卷积神经网络（CNN）提取特征，但这种方法的特征提取有限，无法实现端到端训练。另一些工作则是将电路设计视为图structure，使用图神经网络（GNN）提取特征，但这种方法需要时间consuming的预处理。在我们的工作中，我们提出了一种新的电路设计视角，将电路组件视为点云，使用基于点云的传输器来提取电路特征。这种方法可以直接从原始数据提取特征，不需预处理，可以实现端到端训练，并且实现了高性能。实验结果表明，我们的方法在CircuitNet和ISPD2015 datasets上的堵塞预测任务和CircuitNetdataset上的设计规则检查（DRC）预测任务中均 achieve state-of-the-art表现。我们的方法打通了相对较成熔的点云识别方法和快速发展的 EDA 算法之间的桥梁，使我们可以更好地利用全球的智能来解决这个任务。为便于开放式 EDA 设计的研究，我们在 GitHub 上发布了源代码和预训练模型，请参考。

Detection Defenses: An Empty Promise against Adversarial Patch Attacks on Optical Flow

paper_url: http://arxiv.org/abs/2310.17403
repo_url: https://github.com/cv-stuttgart/detectiondefenses
paper_authors: Erik Scheurer, Jenny Schmalfuss, Alexander Lis, Andrés Bruhn
for: 这篇论文旨在检验目前可用的检测并移除防御策略（ILP和LGS）对state-of-the-art Optical Flow方法的影响，以及这些防御策略对攻击者的抗击能力。
methods: 这篇论文使用了多种state-of-the-art Optical Flow方法，并对这些方法进行了防御策略的检测和移除。
results: 实验结果表明，目前使用的检测并移除防御策略不仅会下降无抗的场景中的Optical Flow质量，还会削弱对攻击者的抗击能力。这些防御策略不能提供预期的安全性。

Abstract
Adversarial patches undermine the reliability of optical flow predictions when placed in arbitrary scene locations. Therefore, they pose a realistic threat to real-world motion detection and its downstream applications. Potential remedies are defense strategies that detect and remove adversarial patches, but their influence on the underlying motion prediction has not been investigated. In this paper, we thoroughly examine the currently available detect-and-remove defenses ILP and LGS for a wide selection of state-of-the-art optical flow methods, and illuminate their side effects on the quality and robustness of the final flow predictions. In particular, we implement defense-aware attacks to investigate whether current defenses are able to withstand attacks that take the defense mechanism into account. Our experiments yield two surprising results: Detect-and-remove defenses do not only lower the optical flow quality on benign scenes, in doing so, they also harm the robustness under patch attacks for all tested optical flow methods except FlowNetC. As currently employed detect-and-remove defenses fail to deliver the promised adversarial robustness for optical flow, they evoke a false sense of security. The code is available at https://github.com/cv-stuttgart/DetectionDefenses.

摘要
“敌对画面推帧可以让光流预测结果不可靠，因此它对实际世界中的动作探测和其下渠道应用存在实际的威胁。因此，可能的解决方案是针对攻击画面中的敌对画面进行探测和移除。但是，这些防护策略对光流预测的质量和可靠性的影响尚未得到充分的探讨。在这篇论文中，我们对一些现有的探测和移除防护措施ILP和LGS进行了广泛的测试，并评估它们对光流预测的质量和可靠性的影响。尤其是，我们实现了防护意识攻击，以检查现有的防护策略是否能够抵抗这种攻击。我们的实验结果产生了两个惊喜：探测和移除防护不仅会对正常场景下的光流质量产生负面影响，而且会对所有测试过的光流方法（除了FlowNetC）下的攻击实际上减少其防护能力。现在的探测和移除防护无法为光流预测提供实际的敌对防护，它们产生了一个假的安全感。我们的代码可以在https://github.com/cv-stuttgart/DetectionDefenses上获得。”

Learning Temporal Sentence Grounding From Narrated EgoVideos

paper_url: http://arxiv.org/abs/2310.17395
repo_url: https://github.com/keflanagan/climer
paper_authors: Kevin Flanagan, Dima Damen, Michael Wray
for: 这 paper 的目的是解决长形自 centered 数据集（如 Ego4D 和 EPIC-Kitchens）对 temporal sentence grounding (TSG) 任务的新挑战。
methods: 该 paper 使用了只使用笔记和其相对粗略的时间戳来学习在这些数据集中附加 sentences。它们提出了一种名为 clip merging (CliMer) 的方法，通过文本控制注意力来进行对比性增强。
results: 对比高效的 TSG 方法，CliMer 方法可以提高 mean R@1 的性能，从 3.9 提高到 5.7 在 Ego4D 上，从 10.7 提高到 13.0 在 EPIC-Kitchens 上。

Abstract
The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens presents a new challenge for the task of Temporal Sentence Grounding (TSG). Compared to traditional benchmarks on which this task is evaluated, these datasets offer finer-grained sentences to ground in notably longer videos. In this paper, we develop an approach for learning to ground sentences in these datasets using only narrations and their corresponding rough narration timestamps. We propose to artificially merge clips to train for temporal grounding in a contrastive manner using text-conditioning attention. This Clip Merging (CliMer) approach is shown to be effective when compared with a high performing TSG method -- e.g. mean R@1 improves from 3.9 to 5.7 on Ego4D and from 10.7 to 13.0 on EPIC-Kitchens. Code and data splits available from: https://github.com/keflanagan/CliMer

摘要
“ egocentric 数据集如 Ego4D 和 EPIC-Kitchens 的出现提出了新的挑战 для时间句子固定（TSG）任务。与传统的评估标准相比，这些数据集提供了更细化的句子，需要在视频中进行更精细的固定。在这篇论文中，我们提出了使用 narraion 和其相应的粗略 narraion 时间戳来学习固定 sentences。我们称之为 clip merging（CliMer）方法，它通过文本控制注意力来进行对比性训练。我们的 CliMer 方法在比较高性能 TSG 方法（如 Mean R@1）的基础上进行了改进，例如在 Ego4D 上从 3.9 提高到 5.7，在 EPIC-Kitchens 上从 10.7 提高到 13.0。代码和数据分割可以从 GitHub 上获取：https://github.com/keflanagan/CliMer。”Note that Simplified Chinese is the standard writing system used in mainland China, and it is different from Traditional Chinese, which is used in Taiwan and other countries.

SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation

paper_url: http://arxiv.org/abs/2310.17359
repo_url: https://github.com/Jiang-HB/DiffusionReg
paper_authors: Haobo Jiang, Mathieu Salzmann, Zheng Dang, Jin Xie, Jian Yang
for: 6D object pose estimation in real-world scenarios
methods: SE(3) diffusion model-based point cloud registration framework
results: Outstanding pose estimation performance on real-world datasets (TUD-L, LINEMOD, and Occluded-LINEMOD)

Abstract
In this paper, we introduce an SE(3) diffusion model-based point cloud registration framework for 6D object pose estimation in real-world scenarios. Our approach formulates the 3D registration task as a denoising diffusion process, which progressively refines the pose of the source point cloud to obtain a precise alignment with the model point cloud. Training our framework involves two operations: An SE(3) diffusion process and an SE(3) reverse process. The SE(3) diffusion process gradually perturbs the optimal rigid transformation of a pair of point clouds by continuously injecting noise (perturbation transformation). By contrast, the SE(3) reverse process focuses on learning a denoising network that refines the noisy transformation step-by-step, bringing it closer to the optimal transformation for accurate pose estimation. Unlike standard diffusion models used in linear Euclidean spaces, our diffusion model operates on the SE(3) manifold. This requires exploiting the linear Lie algebra $\mathfrak{se}(3)$ associated with SE(3) to constrain the transformation transitions during the diffusion and reverse processes. Additionally, to effectively train our denoising network, we derive a registration-specific variational lower bound as the optimization objective for model learning. Furthermore, we show that our denoising network can be constructed with a surrogate registration model, making our approach applicable to different deep registration networks. Extensive experiments demonstrate that our diffusion registration framework presents outstanding pose estimation performance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets.

摘要
“在这篇论文中，我们介绍了基于SE(3)扩散模型的点云注准框架，用于在实际场景中进行6D对象pose估计。我们的方法将3D注准任务转化为一个滤净扩散过程，通过不断注入噪声（扰动变换）来逐步纠正源点云的pose，以达到精确对齐model点云。我们的训练过程包括两个操作：SE(3)扩散过程和SE(3)反向过程。SE(3)扩散过程逐渐扰动优化的rigid变换，而SE(3)反向过程则是学习一个滤净网络，逐步纠正噪声后的不确定变换，使其更加精确地对齐pose。与标准扩散模型在线性Euclidean空间中使用不同，我们的扩散模型在SE(3)拟合上运行。这需要利用SE(3)拟合中的线性李代数 $\mathfrak{se}(3)$ 来约束变换过程中的过渡。此外，为了有效地训练我们的滤净网络，我们 derivates一种注准特有的下界为优化目标，并证明我们的滤净网络可以通过 substitute registration model来实现。我们的实验表明，我们的扩散注准框架在实际世界TUD-L、LINEMOD和Occluded-LINEMOD数据集上具有出色的pose估计性能。”

Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning

paper_url: http://arxiv.org/abs/2310.17356
repo_url: None
paper_authors: Anas Al-lahham, Obaidah Theeb, Khaled Elalem, Tariq A. Alshawi, Saleh A. Alshebeili
for: 预测能源含量稳定电网和不间断服务。
methods: 提出了一种新的天际图像特征提取和学习基本技术来估计短期太阳辐射。
results: 与文献中已知的计算高效算法相比，我们的方法实现了竞争性的结果，而且计算复杂度减少了多。

Abstract
Ahead-of-time forecasting of the output power of power plants is essential for the stability of the electricity grid and ensuring uninterrupted service. However, forecasting renewable energy sources is difficult due to the chaotic behavior of natural energy sources. This paper presents a new approach to estimate short-term solar irradiance from sky images. The~proposed algorithm extracts features from sky images and use learning-based techniques to estimate the solar irradiance. The~performance of proposed machine learning (ML) algorithm is evaluated using two publicly available datasets of sky images. The~datasets contain over 350,000 images for an interval of 16 years, from 2004 to 2020, with the corresponding global horizontal irradiance (GHI) of each image as the ground truth. Compared to the state-of-the-art computationally heavy algorithms proposed in the literature, our approach achieves competitive results with much less computational complexity for both nowcasting and forecasting up to 4 h ahead of time.

摘要
预测发电厂输出功率的预测是电力网络稳定和无间断服务的关键。然而，预测可再生能源很困难，因为自然能源的行为是杂乱的。本文提出了一种新的方法来估算短期日射量。该算法从天空图像中提取特征，并使用学习技术来估算太阳辐射。提出的机器学习（ML）算法的性能被评估使用了两个公开available的天空图像数据集。这两个数据集包含了2004年至2020年的16年间，共有350,000张图像，每张图像的全球水平照度（GHI）作为真实值。相比之前在文献中提出的计算沉重的算法，我们的方法实现了与之相当的竞争力，而且计算复杂性减少了多。这意味着我们的方法可以在4小时之前对nowcasting和预测进行有效的预测。

CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling

paper_url: http://arxiv.org/abs/2310.17347
repo_url: None
paper_authors: Seyedmorteza Sadat, Jakob Buhmann, Derek Bradely, Otmar Hilliges, Romann M. Weber
for: 提高 diffusion models 的输出多样性，特别是在高指导缩放比例下或者在小数据集上训练时。
methods: 提供一种改进的抽样策略，通过在推理过程中添加 scheduled, monotonically decreasing Gaussian noise 来平衡多样性和条件匹配。
results: 在多种条件生成任务中，使用现有预训练 diffusion model，CADS 可以提高 diffusion models 的多样性，并在 class-conditional ImageNet 生成中达到新的state-of-the-art FID 值（1.70和2.31）。

Abstract
While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 and 512$\times$512 respectively.

摘要
“ whilst conditional diffusion models have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. we attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. our condition-annealed diffusion sampler (cads) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. further, using an existing pretrained diffusion model, cads achieves a new state-of-the-art fid of 1.70 and 2.31 for class-conditional imagenet generation at 256x256 and 512x512 respectively.”Note: "FID" stands for "Frechet Inception Distance", which is a measure of the quality of generated images. A lower FID score indicates better image quality.

IndustReal: A Dataset for Procedure Step Recognition Handling Execution Errors in Egocentric Videos in an Industrial-Like Setting

paper_url: http://arxiv.org/abs/2310.17323
repo_url: https://github.com/timschoonbeek/industreal
paper_authors: Tim J. Schoonbeek, Tim Houben, Hans Onvlee, Peter H. N. de With, Fons van der Sommen
for: 这篇论文主要关注于recognizing the correct completion and order of procedural steps，以解决action recognition for procedural tasks中的一个限制，即无法衡量动作的成功度。
methods: 论文提出了一种新的任务—procedure step recognition（PSR），并提供了一个多模态的IndustReal数据集。
results: 论文在IndustReal数据集上进行了实验，并发现了一些新的错误类型，如执行错误和步骤错误。这些错误会限制action recognition的应用在工业领域。

Abstract
Although action recognition for procedural tasks has received notable attention, it has a fundamental flaw in that no measure of success for actions is provided. This limits the applicability of such systems especially within the industrial domain, since the outcome of procedural actions is often significantly more important than the mere execution. To address this limitation, we define the novel task of procedure step recognition (PSR), focusing on recognizing the correct completion and order of procedural steps. Alongside the new task, we also present the multi-modal IndustReal dataset. Unlike currently available datasets, IndustReal contains procedural errors (such as omissions) as well as execution errors. A significant part of these errors are exclusively present in the validation and test sets, making IndustReal suitable to evaluate robustness of algorithms to new, unseen mistakes. Additionally, to encourage reproducibility and allow for scalable approaches trained on synthetic data, the 3D models of all parts are publicly available. Annotations and benchmark performance are provided for action recognition and assembly state detection, as well as the new PSR task. IndustReal, along with the code and model weights, is available at: https://github.com/TimSchoonbeek/IndustReal .

摘要
尽管动作识别 для进程任务已经受到了关注，但是它具有一个基本的缺陷，即没有提供行动的成功度量。这限制了这些系统在工业领域的应用，因为进程动作的结果比执行动作本身更重要。为解决这些限制，我们定义了新的任务：程序步骤识别（PSR），它是识别正确完成和顺序的程序步骤的任务。同时，我们也提供了多modal的IndustReal数据集。与现有数据集不同的是，IndustReal包含了进程错误（如漏洞）以及执行错误。大多数这些错误只存在在验证和测试集中，使IndustReal适用于评估算法对新、未看过的错误的Robustness。此外，为便于可重复性和可以使用合成数据进行扩展，所有部件的3D模型都公开可用。标注和比较性性能是提供的，以及新的PSR任务。IndustReal，以及代码和模型权重，可以在：https://github.com/TimSchoonbeek/IndustReal 中找到。

Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics

paper_url: http://arxiv.org/abs/2310.17316
repo_url: None
paper_authors: Shuai Yang, Zhifei Chen, Pengguang Chen, Xi Fang, Shu Liu, Yingcong Chen
for: 本研究的目的是提供一个精准、semantic-abundant、大规模的 defect spectrum 数据集，用于实际应用中的缺陷检测。
methods: 本研究使用了 four key industrial benchmarks，对现有的标注进行了细化和增加 semantic details，以区分多种缺陷类型。而且，我们提出了一种基于 diffusion-based generator的 two-stage 生成器，用于生成高质量和多样化的缺陷图像。
results: 对于 defect inspection 模型的效果，synthetic images generated by Defect-Gen 显示了明显的提高。总的来说，The Defect Spectrum dataset 在缺陷检测研究中显示了很好的潜力，提供了一个坚实的平台用于测试和优化高级模型。

Abstract
Defect inspection is paramount within the closed-loop manufacturing system. However, existing datasets for defect inspection often lack precision and semantic granularity required for practical applications. In this paper, we introduce the Defect Spectrum, a comprehensive benchmark that offers precise, semantic-abundant, and large-scale annotations for a wide range of industrial defects. Building on four key industrial benchmarks, our dataset refines existing annotations and introduces rich semantic details, distinguishing multiple defect types within a single image. Furthermore, we introduce Defect-Gen, a two-stage diffusion-based generator designed to create high-quality and diverse defective images, even when working with limited datasets. The synthetic images generated by Defect-Gen significantly enhance the efficacy of defect inspection models. Overall, The Defect Spectrum dataset demonstrates its potential in defect inspection research, offering a solid platform for testing and refining advanced models.

摘要
“缺陷检查是关键在关闭式生产系统中。然而，现有的缺陷检查数据集经常缺乏实际应用中所需的精度和semantic细节。本文介绍了缺陷谱，一个全面的标准准 markers，提供了精度、semantic-abundant和大规模的注释，用于各种工业缺陷。基于四个键industrial benchmark，我们的数据集细化了现有的注释，并引入了丰富的semantic细节，在单个图像中分辨多种缺陷类型。此外，我们引入了 Defect-Gen，一个两stage diffusion-based generator，用于生成高质量和多样化的缺陷图像，即使working with limited datasets。生成的synthetic图像由 Defect-Gen明显提高了缺陷检查模型的效果。总之，缺陷谱数据集在缺陷检查研究中展示了很好的潜力，提供了一个坚实的平台用于测试和优化高级模型。”

Scale-Adaptive Feature Aggregation for Efficient Space-Time Video Super-Resolution

paper_url: http://arxiv.org/abs/2310.17294
repo_url: https://github.com/megvii-research/wacv2024-safa
paper_authors: Zhewei Huang, Ailin Huang, Xiaotao Hu, Chen Hu, Jun Xu, Shuchang Zhou
for: 提高视频质量
methods: 提出了一种新的Scale-Adaptive Feature Aggregation（SAFA）网络，该网络可以适应不同的运动幅度，以提高流体基于特征的传播。
results: 对四个公共的STVSR标准测试集进行了实验，SAFA网络可以达到领先的性能水平，比如TMNet和VideoINR方法的平均提高超过0.5dB的PSNR表现，而需要 menos than half的参数和only 1/3的计算成本。

Abstract
The Space-Time Video Super-Resolution (STVSR) task aims to enhance the visual quality of videos, by simultaneously performing video frame interpolation (VFI) and video super-resolution (VSR). However, facing the challenge of the additional temporal dimension and scale inconsistency, most existing STVSR methods are complex and inflexible in dynamically modeling different motion amplitudes. In this work, we find that choosing an appropriate processing scale achieves remarkable benefits in flow-based feature propagation. We propose a novel Scale-Adaptive Feature Aggregation (SAFA) network that adaptively selects sub-networks with different processing scales for individual samples. Experiments on four public STVSR benchmarks demonstrate that SAFA achieves state-of-the-art performance. Our SAFA network outperforms recent state-of-the-art methods such as TMNet and VideoINR by an average improvement of over 0.5dB on PSNR, while requiring less than half the number of parameters and only 1/3 computational costs.

摘要
space-time video super-resolution (STVSR) 任务是提高视频质量，同时进行视频帧 interpolate (VFI) 和视频超解像 (VSR)。然而，面临 temporal 维度和比例不一致的挑战，大多数现有的 STVSR 方法复杂且不灵活地处理不同的运动振荡。在这种情况下，我们发现选择合适的处理缩放级别可以实现remarkable benefits的流基feature propagation。我们提出了一种新的Scale-Adaptive Feature Aggregation (SAFA) 网络，该网络可以适应各个样本的不同处理缩放级别。实验表明，我们的 SAFA 网络在四个公共 STVSR 测试集上达到了最佳性能。与最近的 state-of-the-art 方法如 TMNet 和 VideoINR 相比，我们的 SAFA 网络平均提高了PSNR 值超过 0.5dB，同时需要 fewer than half 的参数和只有 1/3 的计算成本。

RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments

paper_url: http://arxiv.org/abs/2310.17290
repo_url: None
paper_authors: Mengxue Qu, Yu Wu, Wu Liu, Xiaodan Liang, Jingkuan Song, Yao Zhao, Yunchao Wei
for: 这个论文旨在探讨如何基于具体的目的或需求来检测对象。
methods: 这个论文使用了一个新的数据集called Reasoning Intention-Oriented Objects (RIO)，以便更好地处理开放环境中的意图。
results: 研究人员通过使用RIO数据集，发现了一些现有模型在开放环境中理解意图对象的能力有所提高。

Abstract
Intention-oriented object detection aims to detect desired objects based on specific intentions or requirements. For instance, when we desire to "lie down and rest", we instinctively seek out a suitable option such as a "bed" or a "sofa" that can fulfill our needs. Previous work in this area is limited either by the number of intention descriptions or by the affordance vocabulary available for intention objects. These limitations make it challenging to handle intentions in open environments effectively. To facilitate this research, we construct a comprehensive dataset called Reasoning Intention-Oriented Objects (RIO). In particular, RIO is specifically designed to incorporate diverse real-world scenarios and a wide range of object categories. It offers the following key features: 1) intention descriptions in RIO are represented as natural sentences rather than a mere word or verb phrase, making them more practical and meaningful; 2) the intention descriptions are contextually relevant to the scene, enabling a broader range of potential functionalities associated with the objects; 3) the dataset comprises a total of 40,214 images and 130,585 intention-object pairs. With the proposed RIO, we evaluate the ability of some existing models to reason intention-oriented objects in open environments.

摘要
<>Translate the following text into Simplified Chinese:Intention-oriented object detection aims to detect desired objects based on specific intentions or requirements. For instance, when we desire to "lie down and rest", we instinctively seek out a suitable option such as a "bed" or a "sofa" that can fulfill our needs. Previous work in this area is limited either by the number of intention descriptions or by the affordance vocabulary available for intention objects. These limitations make it challenging to handle intentions in open environments effectively. To facilitate this research, we construct a comprehensive dataset called Reasoning Intention-Oriented Objects (RIO). In particular, RIO is specifically designed to incorporate diverse real-world scenarios and a wide range of object categories. It offers the following key features:1. 在RIO中，INTENTIONDESCRIPTION是用自然的句子来表达，而不是单个词或动词短语，使其更加实用和有意义。2. RIO中的INTENTIONDESCRIPTION是场景相关的，使得对象的功能更加广泛。3. RIO dataset包含40,214张图片和130,585个INTENTION-OBJECT对。Using the proposed RIO, we evaluate the ability of some existing models to reason intention-oriented objects in open environments.Translate the text into Simplified Chinese, please.>Here's the translation:意向导航对象检测的目的是根据特定的意向或需求检测所需的对象。例如，当我们想要“躺下休息”时，我们会自然地寻找一个适合的选择，如床或沙发，以满足我们的需求。现有的研究在这一领域受限于意向描述的数量或意向对象的可用词汇。这些限制使得在开放环境中处理意向变得困难。为了进行这些研究，我们构建了一个完整的数据集，即理解意向对象（RIO）。特别是，RIO是专门为开放环境和多种对象类型设计的。它具有以下三个关键特点：1. RIO中的意向描述使用自然的句子表达，而不是单个词或动词短语，使其更加实用和有意义。2. RIO中的意向描述与场景相关，使得对象的功能更加广泛。3. RIO dataset包含40,214张图片和130,585个意向对象对。使用我们提议的RIO，我们评估了一些现有模型在开放环境中理解意向对象的能力。

BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point Clouds

paper_url: http://arxiv.org/abs/2310.17281
repo_url: https://github.com/valeoai/bevcontrast
paper_authors: Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit
for: 提高自动驾驶汽车 LiDAR 点云自我监督的简单性和效率。
methods: 设计了一种基于 Bird’s Eye View 平面的对比损失函数，从而实现了简单且高效的自我监督。
results: 比起 PointConstrast 和 TARL 等方法，BEVContrast 可以提供更好的性能和简洁性，且计算cell级别表示只需要 pays a small computational cost.

Abstract
We present a surprisingly simple and efficient method for self-supervision of 3D backbone on automotive Lidar point clouds. We design a contrastive loss between features of Lidar scans captured in the same scene. Several such approaches have been proposed in the literature from PointConstrast, which uses a contrast at the level of points, to the state-of-the-art TARL, which uses a contrast at the level of segments, roughly corresponding to objects. While the former enjoys a great simplicity of implementation, it is surpassed by the latter, which however requires a costly pre-processing. In BEVContrast, we define our contrast at the level of 2D cells in the Bird's Eye View plane. Resulting cell-level representations offer a good trade-off between the point-level representations exploited in PointContrast and segment-level representations exploited in TARL: we retain the simplicity of PointContrast (cell representations are cheap to compute) while surpassing the performance of TARL in downstream semantic segmentation.

摘要
我们提出了一种奇异简单高效的自监督3D脊梁在汽车激光雷达点云上的方法。我们定义了在同一场景中捕捉的雷达扫描特征之间的对比损失。文献中已有许多类似的方法，如PointContrast，它使用点级对比，到现状之最佳TARL，它使用段级对比，约相对应于物体。而PointContrast具有简单的实现，但是被TARL所超越，后者 however需要昂贵的预处理。在BEVContrast中，我们定义了2D bird's eye view平面上的细胞级对比，得到的细胞级表示具有点级表示使用PointContrast和段级表示使用TARL之间的好COMPROMISE：我们保留了PointContrast的简单实现，同时超越TARL在下游semantic segmentation中的性能。

Generalizing to Unseen Domains in Diabetic Retinopathy Classification

paper_url: http://arxiv.org/abs/2310.17255
repo_url: https://github.com/chumsy0725/spsd-vit
paper_authors: Chamuditha Jayanga Galappaththige, Gayal Kuruppu, Muhammad Haris Khan
for:这个研究旨在解决遗传性糖尿病视力损害（DR）的早期诊断和治疗过程中的检测问题，以帮助早期疗法和恢复病情。methods:我们提出了一个简单且有效的领域扩展（DG）方法，通过一个新的预测软化机制，实现了自我激发（self-distillation）在感知 трансформа器（ViT）中。results:我们在多个挑战性的开源DR检测数据集上进行了广泛的实验，包括多源和单源DG设定，并使用三种不同的ViT背bone进行比较。我们的方法在这些设定下实现了比较好的性能，并且在验证测试中获得了改善的准确性和调整性。

Abstract
Diabetic retinopathy (DR) is caused by long-standing diabetes and is among the fifth leading cause for visual impairments. The process of early diagnosis and treatments could be helpful in curing the disease, however, the detection procedure is rather challenging and mostly tedious. Therefore, automated diabetic retinopathy classification using deep learning techniques has gained interest in the medical imaging community. Akin to several other real-world applications of deep learning, the typical assumption of i.i.d data is also violated in DR classification that relies on deep learning. Therefore, developing DR classification methods robust to unseen distributions is of great value. In this paper, we study the problem of generalizing a model to unseen distributions or domains (a.k.a domain generalization) in DR classification. To this end, we propose a simple and effective domain generalization (DG) approach that achieves self-distillation in vision transformers (ViT) via a novel prediction softening mechanism. This prediction softening is an adaptive convex combination one-hot labels with the model's own knowledge. We perform extensive experiments on challenging open-source DR classification datasets under both multi-source and single-source DG settings with three different ViT backbones to establish the efficacy and applicability of our approach against competing methods. For the first time, we report the performance of several state-of-the-art DG methods on open-source DR classification datasets after conducting thorough experiments. Finally, our method is also capable of delivering improved calibration performance than other methods, showing its suitability for safety-critical applications, including healthcare. We hope that our contributions would investigate more DG research across the medical imaging community.

摘要
糖尿病retinopathy (DR) 是由长期糖尿病引起的，是视力障碍的第五大原因。 early diagnosis 和治疗可以帮助缓解病情，但检测过程很复杂且时consuming。因此，使用深度学习技术进行糖尿病分类已经在医学影像社区中吸引了广泛的关注。与其他多种应用场景一样，糖尿病分类中的i.i.d数据假设也被违反。因此，开发一种可以抗见 distributions 的DR分类方法非常有价值。在这篇论文中，我们研究了在DR分类中对不同分布或领域（a.k.a. 领域普适化）的扩展。为达到这个目标，我们提出了一种简单而有效的领域普适化（DG）方法，通过一种新的预测软化机制来实现自我热化。这种预测软化是一种可靠的一个逻辑折衔，将一个一个的一个零准确率与模型自己的知识相乘。我们在多个开源DR分类数据集上进行了广泛的实验，包括多源和单源DG设置，并使用三种不同的 ViT 框架来证明我们的方法的有效性和可应用性。我们是第一次在开源DR分类数据集上进行了多种state-of-the-art DG方法的实验，并发现我们的方法在calibration性能方面表现出色，表明它适合安全关键应用，如医疗。我们希望通过我们的贡献，可以鼓励更多的领域普适化研究在医学影像社区中进行。

Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

paper_url: http://arxiv.org/abs/2310.17218
repo_url: None
paper_authors: Jiachen Li, Xiaojin Gong
for: 提高对象重复识别（Re-ID）性能 across various supervision settings。
methods: 利用大规模预训练的视觉语言模型（CLIP）进行适应，并直接精度地调整CLIP的图像Encoder使用抽象对比学习（PCL）损失，取消需要提示学习。
results: 在人车Re-ID数据集上实现了与CLIP-ReID相当的竞争性表现，并在无监督情况下进一步实现了状态级表现。

Abstract
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance.

摘要
Translated into Simplified Chinese:这个工作的目标是适应大规模预训练的视觉语言模型，如对比语言图像预训练（CLIP），以提高对象重新识别（Re-ID）的性能。虽然推荐学习已经使得CLIP-ReID实现了良好的表现，但是下面的机制和推荐学习的必要性仍然不清楚，这是因为ReID任务缺乏semantic标签。在这个工作中，我们首先分析CLIP-ReID中的推荐学习角色和其局限性。基于我们的调查，我们提议一种简单 yet effective的方法，通过直接练习CLIP的图像编码器使用prototype对比学习（PCL）损失来适应CLIPsupervised object Re-ID。我们的方法不需要推荐学习。实验结果表明，我们的方法在人体和车辆Re-ID dataset上与CLIP-ReID相比具有竞争力。此外，我们还扩展了我们的PCL-based CLIP fine-tuning方法到无监督场景，在这些场景下，我们实现了状态的最佳性能。

Three-dimensional Bone Image Synthesis with Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2310.17216
repo_url: None
paper_authors: Christoph Angermann, Johannes Bereiter-Payr, Kerstin Stock, Markus Haltmeier, Gerald Degenhart
for: 这篇论文是为了探讨三维生成对医疗影像处理领域的应用而写的。
methods: 这篇论文使用的方法是基于三维生成对抗网络（GAN），可以高效地生成高分辨率医疗影像 Volume 的细节。
results: 这篇论文的结果表明，GAN可以成功地在三维设定下进行生成，并且可以进行大规模的数据驱动模型的开发。此外，GAN的反向减法也可以在这种设定下实现，并用于图像混合、特征编辑和风格混合等应用。结果得到了三维 HR-pQCT 数据库的广泛验证。

Abstract
Medical image processing has been highlighted as an area where deep learning-based models have the greatest potential. However, in the medical field in particular, problems of data availability and privacy are hampering research progress and thus rapid implementation in clinical routine. The generation of synthetic data not only ensures privacy, but also allows to \textit{draw} new patients with specific characteristics, enabling the development of data-driven models on a much larger scale. This work demonstrates that three-dimensional generative adversarial networks (GANs) can be efficiently trained to generate high-resolution medical volumes with finely detailed voxel-based architectures. In addition, GAN inversion is successfully implemented for the three-dimensional setting and used for extensive research on model interpretability and applications such as image morphing, attribute editing and style mixing. The results are comprehensively validated on a database of three-dimensional HR-pQCT instances representing the bone micro-architecture of the distal radius.

摘要
医学图像处理领域内，深度学习基本模型的潜力得到了特别强调。然而，医疗领域中特别是数据可用性和隐私问题，对研究进步和临床应用的阻碍。生成 sintetic 数据不仅保障隐私，还可以为新的患者群体创造特定特征，以便开发基于大规模数据的模型。这项工作表明，三维生成对抗网络（GAN）可以高效地训练高分辨率医学三维体volume，并在三维设定下成功实现GAN反向。这些结果在三维 HR-pQCT 数据库上进行了广泛的验证，表明这些模型在bone micro-architecture中具有高度的可解释性和应用前景。

Weakly-Supervised Surgical Phase Recognition

paper_url: http://arxiv.org/abs/2310.17209
repo_url: None
paper_authors: Roy Hirsch, Regev Cohen, Mathilde Caron, Tomer Golany, Daniel Freedman, Ehud Rivlin
for: 针对计算机助手手术系统中的阶段识别问题
methods: 组合图像分割和自动学习，提出一种随机游走解决方案，并利用弱约束和少量学习
results: 在公共Cholec80 dataset上进行实验，在多种设置下达到了可塑性和低资源消耗的表现

Abstract
A key element of computer-assisted surgery systems is phase recognition of surgical videos. Existing phase recognition algorithms require frame-wise annotation of a large number of videos, which is time and money consuming. In this work we join concepts of graph segmentation with self-supervised learning to derive a random-walk solution for per-frame phase prediction. Furthermore, we utilize within our method two forms of weak supervision: sparse timestamps or few-shot learning. The proposed algorithm enjoys low complexity and can operate in lowdata regimes. We validate our method by running experiments with the public Cholec80 dataset of laparoscopic cholecystectomy videos, demonstrating promising performance in multiple setups.

摘要
computer-assisted surgery systems中一个关键元素是运行过程识别。现有的运行识别算法需要大量的几几个影像档案进行框架层级的标注，这是时间和金额的浪费。在这个工作中，我们结合了Graph分类和自动学习的概念，以derive一个随机步进行每帧运行预测。此外，我们在方法中使用了两种弱型指导：稀脱时间标签或几何学学习。提议的算法具有低复杂度，可以在低数据 режи中运行。我们运行了 experiments with公共Cholec80dataset of laparoscopic cholecystectomy videos， demonstarted promising performance in multiple setups。

Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction Network for Tone Mapping

paper_url: http://arxiv.org/abs/2310.17190
repo_url: https://github.com/fengzhang427/LLF-LUT
paper_authors: Feng Zhang, Ming Tian, Zhiqiang Li, Bin Xu, Qingbo Lu, Changxin Gao, Nong Sang
for: 本研究旨在 Addressing the limitations of traditional 3-Dimensional LookUp Table (3D LUT) based tone mapping methods, which often fail to preserve local details in images.
methods: 该研究提出了一种新的策略，即通过closed-form Laplacian pyramid decomposition and reconstruction，并采用image-adaptive 3D LUTs和Progressive learning of local Laplacian filters来实现同时global和local操作。
results: 对两个标准测试集进行了广泛的实验，并证明了该方法可以同时保持全球含义和地方细节，并且与现有方法相比有所提高。

Abstract
Tone mapping aims to convert high dynamic range (HDR) images to low dynamic range (LDR) representations, a critical task in the camera imaging pipeline. In recent years, 3-Dimensional LookUp Table (3D LUT) based methods have gained attention due to their ability to strike a favorable balance between enhancement performance and computational efficiency. However, these methods often fail to deliver satisfactory results in local areas since the look-up table is a global operator for tone mapping, which works based on pixel values and fails to incorporate crucial local information. To this end, this paper aims to address this issue by exploring a novel strategy that integrates global and local operators by utilizing closed-form Laplacian pyramid decomposition and reconstruction. Specifically, we employ image-adaptive 3D LUTs to manipulate the tone in the low-frequency image by leveraging the specific characteristics of the frequency information. Furthermore, we utilize local Laplacian filters to refine the edge details in the high-frequency components in an adaptive manner. Local Laplacian filters are widely used to preserve edge details in photographs, but their conventional usage involves manual tuning and fixed implementation within camera imaging pipelines or photo editing tools. We propose to learn parameter value maps progressively for local Laplacian filters from annotated data using a lightweight network. Our model achieves simultaneous global tone manipulation and local edge detail preservation in an end-to-end manner. Extensive experimental results on two benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art methods.

摘要
《Tone Mapping using 3D LUT and Local Laplacian Filters》目的：使高动态范围（HDR）图像转换为低动态范围（LDR）表示，是摄像头成像管线中的关键任务。在过去几年，基于3维LookUp Table（3D LUT）的方法吸引了广泛关注，因为它们能够平衡提升性和计算效率。然而，这些方法经常在地方区域出现不满心的结果，因为Look-up Table是一个全局操作符，基于像素值进行匹配，而不会考虑重要的地方信息。为此，本文提出了一种新的策略，即通过closed-form Laplacian pyramid decomposition和重建来结合全局和地方操作符。特别是，我们使用适应性的3D LUT来控制在低频谱中的音调，并且使用地方 Laplacian 滤波器来在高频谱中进行适应式的细节缩放。地方 Laplacian 滤波器广泛用于保持照片中的缝隙细节，但是它们的传统使用具有手动调整和固定实现在摄像头成像管线或图像修饰工具中。我们提议通过轻量级网络来逐渐学习参数值图表进行地方 Laplacian 滤波器的进行进行适应性调整。我们的模型可以同时进行全局音调调整和地方细节缩放，并且在端到端方式进行实现。对两个标准数据集进行了广泛的实验，结果表明，我们的方法在比较当前的方法中表现出色。

paper_url: http://arxiv.org/abs/2310.17189
repo_url: https://github.com/mastervito/diffusionvg
paper_authors: Xiao Liang, Tao Shi, Yaoyuan Liang, Te Tao, Shao-Lun Huang
for: 本研究旨在提高视频落实（video grounding）的精度和效率，以便更好地将文本描述与视频内容相匹配。
methods: 本研究提出了一种基于扩散模型的新方法，称为DiffusionVG，它将视频落实视为一个条件生成任务，通过逐渐添加噪声并在反扩散过程中进行恢复，以便从噪声输入中生成目标 span。
results: 在主流的Charades-STA和ActivityNet Captions测试集上，DiffusionVG表现竞争力或者连续性更高，而无需使用复杂的特性或噪声。

Abstract
Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span in a single-shot manner, resulting in the absence of a systematical prediction refinement process. In this paper, we propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task, where the target span is generated from Gaussian noise inputs and interatively refined in the reverse diffusion process. During training, DiffusionVG progressively adds noise to the target span with a fixed forward diffusion process and learns to recover the target span in the reverse diffusion process. In inference, DiffusionVG can generate the target span from Gaussian noise inputs by the learned reverse diffusion process conditioned on the video-sentence representations. Our DiffusionVG follows the encoder-decoder architecture, which firstly encodes the video-sentence features and iteratively denoises the predicted spans in its specialized span refining decoder. Without bells and whistles, our DiffusionVG demonstrates competitive or even superior performance compared to existing well-crafted models on mainstream Charades-STA and ActivityNet Captions benchmarks.

摘要
视频固定目标是将目标刻影在未处理视频中与给定句子查询匹配。现有方法通常从预先定义的提案中选择最佳预测或直接在单击shot模式下进行 span 直接回归，从而缺乏系统化预测纠正过程。在这篇论文中，我们提出了DiffusionVG，一种新的框架，其中视频固定目标被формализова为一个条件生成任务，其中目标刻影从托管的 Gaussian 噪声输入生成并经过反演 diffusion 过程进行逐步纠正。在训练时，DiffusionVG 逐渐添加到目标刻影的噪声输入，并通过学习反演 diffusion 过程来回归目标刻影。在推理时，DiffusionVG 可以从 Gaussian 噪声输入生成目标刻影，并且通过特殊的 span 纠正逻辑来进行 Conditioned 生成。我们的DiffusionVG 采用了 Encoder-Decoder 架构，它首先将视频-句子特征编码，然后在特殊的 span 纠正解码器中iteratively 进行噪声恢复。没有一切饰物的，我们的DiffusionVG 在主流的 Charades-STA 和 ActivityNet Captions benchmark 上达到了与现有高水平的竞争性或者even 超越性表现。

Blind Image Super-resolution with Rich Texture-Aware Codebooks

paper_url: http://arxiv.org/abs/2310.17188
repo_url: None
paper_authors: Rui Qin, Ming Sun, Fangyuan Zhang, Xing Wen, Bin Wang
for: 提高盲目超解像（BSR）方法的效果，使其能够更好地处理复杂的盲目压缩和杂质损害。
methods: 提出了一种基于高分辨率（HR）重建码库的 Rich Texture-aware Codebook-based Network（RTCNet），包括了适应性损害抑制模块（DTPM）和 patch-aware texture prior module（PTPM）。
results: RTCNet在多个 benchmark 上比州先进方法提高了0.16 ~ 0.46dB。

Abstract
Blind super-resolution (BSR) methods based on high-resolution (HR) reconstruction codebooks have achieved promising results in recent years. However, we find that a codebook based on HR reconstruction may not effectively capture the complex correlations between low-resolution (LR) and HR images. In detail, multiple HR images may produce similar LR versions due to complex blind degradations, causing the HR-dependent only codebooks having limited texture diversity when faced with confusing LR inputs. To alleviate this problem, we propose the Rich Texture-aware Codebook-based Network (RTCNet), which consists of the Degradation-robust Texture Prior Module (DTPM) and the Patch-aware Texture Prior Module (PTPM). DTPM effectively mines the cross-resolution correlation of textures between LR and HR images by exploiting the cross-resolution correspondence of textures. PTPM uses patch-wise semantic pre-training to correct the misperception of texture similarity in the high-level semantic regularization. By taking advantage of this, RTCNet effectively gets rid of the misalignment of confusing textures between HR and LR in the BSR scenarios. Experiments show that RTCNet outperforms state-of-the-art methods on various benchmarks by up to 0.16 ~ 0.46dB.

摘要
干扰盲超分辨率（BSR）方法，基于高分辨率（HR）重建码库，在过去几年内取得了有望的成果。然而，我们发现，基于HR重建的码库可能不能有效地捕捉LR和HR图像之间的复杂相关性。具体来说，多个HR图像可能会生成相同的LR版本，因为复杂的盲抑分辨率，导致HR依赖的只码库具有有限的文本多样性，面临恶势riorityLR输入。为了解决这个问题，我们提议了Rich Texture-aware Codebook-based Network（RTCNet），它包括Degradation-robust Texture Prior Module（DTPM）和Patch-aware Texture Prior Module（PTPM）。DTPM通过利用LR和HR图像之间的Texture的交叉相关性，有效地挖掘LR和HR图像之间的Texture相关性。PTPM使用patch-wise semantic pre-training来正确地修正高级 semantics regularization中的Texture相似性误差。通过这种方式，RTCNet可以有效地消除BSR场景中HR和LR图像之间的混淆文本。实验表明，RTCNet在不同的标准 bencmarks上出perform state-of-the-art方法，提高了0.16~0.46dB。

MO-YOLO: End-to-End Multiple-Object Tracking Method with YOLO and MOTR

paper_url: http://arxiv.org/abs/2310.17170
repo_url: https://github.com/liaopan-lp/MO-YOLO
paper_authors: Liao Pan, Yang Feng, Wu Di, Liu Bo, Zhang Xingle
for: 提高多对象跟踪（MOT）领域中的灵活性和计算效率，提出一种高效、轻量级、计算资源减少的端到端多对象跟踪模型，名为MO-YOLO。
methods: 结合YOLO和RT-DETR模型，构建一个高效、轻量级、计算资源减少的端到端多对象跟踪网络，以提高MOT领域的计算效率和灵活性。
results: 在MOT17 dataset上，MO-YOLO只需要1个GeForce 2080 Ti GPU和12个小时的训练时间，就能够 achieve comparable performance，而MOTR\cite{zeng2022motr}则需要8个GeForce 2080 Ti GPU和4天的训练时间。

Abstract
This paper aims to address critical issues in the field of Multi-Object Tracking (MOT) by proposing an efficient and computationally resource-efficient end-to-end multi-object tracking model, named MO-YOLO. Traditional MOT methods typically involve two separate steps: object detection and object tracking, leading to computational complexity and error propagation issues. Recent research has demonstrated outstanding performance in end-to-end MOT models based on Transformer architectures, but they require substantial hardware support. MO-YOLO combines the strengths of YOLO and RT-DETR models to construct a high-efficiency, lightweight, and resource-efficient end-to-end multi-object tracking network, offering new opportunities in the multi-object tracking domain. On the MOT17 dataset, MOTR\cite{zeng2022motr} requires training with 8 GeForce 2080 Ti GPUs for 4 days to achieve satisfactory results, while MO-YOLO only requires 1 GeForce 2080 Ti GPU and 12 hours of training to achieve comparable performance.

摘要
这篇论文目标是解决多对物跟踪（MOT）领域的关键问题，提出一种高效、计算资源充足的端到端多对物跟踪模型，名为MO-YOLO。传统的MOT方法通常包括两个分开的步骤：物体检测和物体跟踪，导致计算复杂性和错误传递问题。现代研究表明，基于Transformer架构的端到端MOT模型可以达到出色的性能，但它们需要重要的硬件支持。MO-YOLO将YOLO和RT-DETR模型的优点相结合，构建一个高效、轻量级、计算资源充足的端到端多对物跟踪网络，为多对物跟踪领域带来新的机遇。在MOT17数据集上，MOTR\cite{zeng2022motr}需要训练8个GeForce 2080 Ti GPU的4天时间来获得满意的结果，而MO-YOLO只需1个GeForce 2080 Ti GPU和12个小时的训练时间来达到相当的性能。

Bridging Phylogeny and Taxonomy with Protein-protein Interaction Networks

paper_url: http://arxiv.org/abs/2310.17164
repo_url: None
paper_authors: Long-Huei Chen, Mohana Prasad Sathya Moorthy, Pratyaksh Sharma
for: 这项研究旨在更深入地理解生物体内的蛋白质-蛋白质互作（PPI）网络，以了解生物体之间的种系发生关系。
methods: 研究人员使用了已知种类的蛋白质网络统计特征来预测新发现的蛋白质网络统计特征，以及使用这些统计特征来分类生物体。
results: 研究人员成功创建了一个预测蛋白质网络统计特征的模型，以及一个使用蛋白质网络统计特征来分类生物体的模型。这两个模型成功地将蛋白质网络和种系发生关系两个领域联系起来。

Abstract
The protein-protein interaction (PPI) network provides an overview of the complex biological reactions vital to an organism's metabolism and survival. Even though in the past PPI network were compared across organisms in detail, there has not been large-scale research on how individual PPI networks reflect on the species relationships. In this study we aim to increase our understanding of the tree of life and taxonomy by gleaming information from the PPI networks. We successful created (1) a predictor of network statistics based on known traits of existing species in the phylogeny, and (2) a taxonomic classifier of organism using the known protein network statistics, whether experimentally determined or predicted de novo. With the knowledge of protein interactions at its core, our two models effectively connects two field with widely diverging methodologies - the phylogeny and taxonomy of species.

摘要
生物体内的蛋白质-蛋白质互作（PPI）网络提供了生物过程的概述，这些过程对生物体的存活和代谢是关键。尽管过去曾经对不同生物体的PPI网络进行了详细比较，但是没有大规模研究过个PPI网络如何反映种之间的关系。在这项研究中，我们希望通过蛋白质互作网络来增加我们对生命树的理解和分类学的知识。我们成功地开发了以下两种模型：1. 基于已知种的特征来预测网络统计 Parameters的模型，这些特征包括生物体的分类、体积、生长速度等。2. 使用已知蛋白质网络统计 Parameters来分类生物体，无论这些统计 Parameters是通过实验测定还是通过德拟测定来获得的。通过蛋白质互作网络的知识作为核心，我们的两种模型成功地结合了生物体分类学和蛋白质互作网络的两个领域，它们的方法论相对远离。

Low-Dimensional Gradient Helps Out-of-Distribution Detection

paper_url: http://arxiv.org/abs/2310.17163
repo_url: None
paper_authors: Yingwen Wu, Tao Li, Xinwen Cheng, Jie Yang, Xiaolin Huang
for: 这个研究旨在探讨深度神经网络（DNNs）中的外部资料探测（OOD）领域，以确保深度学习模型在实际应用中的可靠性。
methods: 这个研究使用了整个梯度信息来进行OOD探测，包括梯度方向和梯度norm。具体来说，研究者使用了一个特定的主成分空间来实现线性维度减少，从而获得了具有最小资料损失的低维度表示。
results: 研究结果显示，这个新的OOD探测方法可以与现有的检测方法相比，在各种检测任务中表现出色，例如在ImageNetbenchmark上，这个方法可以实现11.15%的 False Positive Rate reduction（FPR95）。

Abstract
Detecting out-of-distribution (OOD) samples is essential for ensuring the reliability of deep neural networks (DNNs) in real-world scenarios. While previous research has predominantly investigated the disparity between in-distribution (ID) and OOD data through forward information analysis, the discrepancy in parameter gradients during the backward process of DNNs has received insufficient attention. Existing studies on gradient disparities mainly focus on the utilization of gradient norms, neglecting the wealth of information embedded in gradient directions. To bridge this gap, in this paper, we conduct a comprehensive investigation into leveraging the entirety of gradient information for OOD detection. The primary challenge arises from the high dimensionality of gradients due to the large number of network parameters. To solve this problem, we propose performing linear dimension reduction on the gradient using a designated subspace that comprises principal components. This innovative technique enables us to obtain a low-dimensional representation of the gradient with minimal information loss. Subsequently, by integrating the reduced gradient with various existing detection score functions, our approach demonstrates superior performance across a wide range of detection tasks. For instance, on the ImageNet benchmark, our method achieves an average reduction of 11.15% in the false positive rate at 95% recall (FPR95) compared to the current state-of-the-art approach. The code would be released.

摘要
检测出现在数据集之外的样本（out-of-distribution，OOD）是深度神经网络（DNN）在实际应用中的可靠性 Ensure essential. While previous research has mainly investigated the disparity between in-distribution (ID) and OOD data through forward information analysis, the discrepancy in parameter gradients during the backward process of DNNs has received insufficient attention. Existing studies on gradient disparities mainly focus on the utilization of gradient norms, neglecting the wealth of information embedded in gradient directions. To bridge this gap, in this paper, we conduct a comprehensive investigation into leveraging the entirety of gradient information for OOD detection. The primary challenge arises from the high dimensionality of gradients due to the large number of network parameters. To solve this problem, we propose performing linear dimension reduction on the gradient using a designated subspace that comprises principal components. This innovative technique enables us to obtain a low-dimensional representation of the gradient with minimal information loss. Subsequently, by integrating the reduced gradient with various existing detection score functions, our approach demonstrates superior performance across a wide range of detection tasks. For instance, on the ImageNet benchmark, our method achieves an average reduction of 11.15% in the false positive rate at 95% recall (FPR95) compared to the current state-of-the-art approach. The code will be released.Note that Simplified Chinese is a romanization of Chinese, and the actual Chinese characters may vary depending on the system and font used.

Learning depth from monocular video sequences

paper_url: http://arxiv.org/abs/2310.17156
repo_url: None
paper_authors: Zhenwei Luo
for: 这个论文旨在提出一种基于单影视频序列的单张图像深度估计模型，以便在训练过程中更好地使用更多的图像作为监督。
methods: 我们提出了一种新的训练损失函数，使得在训练过程中可以更好地包含更多的图像作为监督。我们还提出了一种简单 yet effective的模型来考虑帧到帧像素运动。
results: 当我们将这些方法结合使用时，我们在KITTI dataset上的自主监督下得到了单张图像深度估计的state-of-the-art结果。

Abstract
Learning single image depth estimation model from monocular video sequence is a very challenging problem. In this paper, we propose a novel training loss which enables us to include more images for supervision during the training process. We propose a simple yet effective model to account the frame to frame pixel motion. We also design a novel network architecture for single image estimation. When combined, our method produces state of the art results for monocular depth estimation on the KITTI dataset in the self-supervised setting.

摘要
学习单个图像深度估计模型从单摄影视频序列是一个非常困难的问题。在这篇论文中，我们提出了一种新的训练损失函数，允许我们在训练过程中使用更多的图像进行监督。我们提出了一种简单又有效的方法来考虑帧到帧像素运动。我们还设计了一种新的网络架构来实现单个图像估计。当这些方法相结合使用时，我们的方法在KITTI dataset上的自主监督 Setting中产生了state-of-the-art的结果。

Deep Imbalanced Regression via Hierarchical Classification Adjustment

paper_url: http://arxiv.org/abs/2310.17154
repo_url: None
paper_authors: Haipeng Xiong, Angela Yao
for: 本文提出了一种解决不平衡回归任务中的问题，即使用层次分类器来改善回归性能。
methods: 该方法首先将回归目标空间分解为多个精细分类器，然后使用范围保持分类器来学习一个单一的类ifier。
results: 实验结果显示，该方法在三种多元的回归任务中（年龄估计、人群数量估计和深度估计）都达到了Superior result。

Abstract
Regression tasks in computer vision, such as age estimation or counting, are often formulated into classification by quantizing the target space into classes. Yet real-world data is often imbalanced -- the majority of training samples lie in a head range of target values, while a minority of samples span a usually larger tail range. By selecting the class quantization, one can adjust imbalanced regression targets into balanced classification outputs, though there are trade-offs in balancing classification accuracy and quantization error. To improve regression performance over the entire range of data, we propose to construct hierarchical classifiers for solving imbalanced regression tasks. The fine-grained classifiers limit the quantization error while being modulated by the coarse predictions to ensure high accuracy. Standard hierarchical classification approaches, however, when applied to the regression problem, fail to ensure that predicted ranges remain consistent across the hierarchy. As such, we propose a range-preserving distillation process that can effectively learn a single classifier from the set of hierarchical classifiers. Our novel hierarchical classification adjustment (HCA) for imbalanced regression shows superior results on three diverse tasks: age estimation, crowd counting and depth estimation. We will release the source code upon acceptance.

摘要
计算机视觉领域中的回归任务，如年龄估计或计数，经常被转化为分类任务。然而，实际数据往往受到偏斜——大多数训练样本集中在一个头range的目标值上，而一小部分样本则覆盖一个通常更大的尾range。通过选择类划分，可以调整不均匀的回归目标，并且可以平衡分类精度和划分误差。为了提高回归性能，我们提议使用层次分类器解决不均匀的回归任务。细化分类器限制划分误差，同时被模拟粗略预测的控制，以确保高精度。然而，标准层次分类方法，应用于回归问题时，无法保证预测范围保持一致性。因此，我们提议一种保持范围的润释过程，可以有效地学习单个分类器从多个层次分类器中。我们称之为层次分类调整（HCA）。我们的HCA在三种多样化任务上显示出优秀的结果：年龄估计、人群计数和深度估计。我们即将发布源代码。

Simple Baselines for Projection-based Full-reference and No-reference Point Cloud Quality Assessment

paper_url: http://arxiv.org/abs/2310.17147
repo_url: None
paper_authors: Zicheng Zhang, Yingjie Zhou, Wei Sun, Xiongkuo Min, Guangtao Zhai
for: 本研究旨在提供高效的点云质量评估方法，以满足存储和带宽限制下的3D内容表示和应用需求。
methods: 该研究使用多个投影方法从点云数据中获取多个投影，并使用流行的视觉脊梁提取质量感知特征。基于FR和NR两种任务，分别计算出全referenced和无参照质量表示。
results: 在ICIP 2023 PCVQA挑战中，该研究取得了五个评测轨道中的四个首位。

Abstract
Point clouds are widely used in 3D content representation and have various applications in multimedia. However, compression and simplification processes inevitably result in the loss of quality-aware information under storage and bandwidth constraints. Therefore, there is an increasing need for effective methods to quantify the degree of distortion in point clouds. In this paper, we propose simple baselines for projection-based point cloud quality assessment (PCQA) to tackle this challenge. We use multi-projections obtained via a common cube-like projection process from the point clouds for both full-reference (FR) and no-reference (NR) PCQA tasks. Quality-aware features are extracted with popular vision backbones. The FR quality representation is computed as the similarity between the feature maps of reference and distorted projections while the NR quality representation is obtained by simply squeezing the feature maps of distorted projections with average pooling The corresponding quality representations are regressed into visual quality scores by fully-connected layers. Taking part in the ICIP 2023 PCVQA Challenge, we succeeded in achieving the top spot in four out of the five competition tracks.

摘要
点云是广泛应用于3D内容表示领域中的一种常用技术，但是压缩和简化过程会导致数据损失。因此，有效地评估点云的质量变得越来越重要。在这篇论文中，我们提出了一些简单的基线方法用于基于投影的点云质量评估（PCQA）任务。我们使用了通过共同的立方体投影过程获得的多个投影，并从点云中提取了流行的视觉脊梁中的质量感知特征。FR质量表示为参照投影和扭曲投影之间的相似性，而NR质量表示直接压缩扭曲投影的特征图，并使用了全连接层进行回归。在ICIP 2023 PCVQA挑战中，我们成功地获得了五个竞赛轨道中的四个首位。

A Classifier Using Global Character Level and Local Sub-unit Level Features for Hindi Online Handwritten Character Recognition

paper_url: http://arxiv.org/abs/2310.17138
repo_url: None
paper_authors: Anand Sharma, A. G. Ramakrishnan
For: The paper is written to develop a classifier for Hindi online handwritten characters, which models the joint distribution of global character features, number of sub-units, and local sub-unit features using latent variables.* Methods: The classifier uses histograms of points, orientations, and dynamics of orientations (HPOD) features to represent characters at both global and local levels, and the parameters are estimated using maximum likelihood method. The study also compares the performance of the developed classifier with other classifiers and features used in previous studies.* Results: The developed classifier achieves the highest accuracy of 93.5% on the testing set, outperforming other classifiers trained on different features extracted from the same training set and evaluated on the same testing set.Here are the three key points in Simplified Chinese text:* For: 本研究开发了一种基于全球特征、分割单元数量和地方分割单元特征的类ifizier，用于模型印地语 Онлайн手写字符。* Methods: 该类ifizier使用点频、方向和方向动态特征（HPOD）来表示字符的全球特征和地方分割单元特征，并使用最大可能性方法来估计类ifizier的参数。研究还对以前的研究中使用的不同类ifizier和特征进行比较。* Results: 研究发现，基于HPOD特征的类ifizier在测试集上达到了93.5%的最高准确率，超过了基于不同特征的类ifizier在同一测试集上的表现。

Abstract
A classifier is developed that defines a joint distribution of global character features, number of sub-units and local sub-unit features to model Hindi online handwritten characters. The classifier uses latent variables to model the structure of sub-units. The classifier uses histograms of points, orientations, and dynamics of orientations (HPOD) features to represent characters at global character level and local sub-unit level and is independent of character stroke order and stroke direction variations. The parameters of the classifier is estimated using maximum likelihood method. Different classifiers and features used in other studies are considered in this study for classification performance comparison with the developed classifier. The classifiers considered are Second Order Statistics (SOS), Sub-space (SS), Fisher Discriminant (FD), Feedforward Neural Network (FFN) and Support Vector Machines (SVM) and the features considered are Spatio Temporal (ST), Discrete Fourier Transform (DFT), Discrete Cosine Transform (SCT), Discrete Wavelet Transform (DWT), Spatial (SP) and Histograms of Oriented Gradients (HOG). Hindi character datasets used for training and testing the developed classifier consist of samples of handwritten characters from 96 different character classes. There are 12832 samples with an average of 133 samples per character class in the training set and 2821 samples with an average of 29 samples per character class in the testing set. The developed classifier has the highest accuracy of 93.5\% on the testing set compared to that of the classifiers trained on different features extracted from the same training set and evaluated on the same testing set considered in this study.

摘要
我们开发了一种分类器，它定义了全局字符特征、数量的子单元特征和本地子单元特征的共同分布，用于模型印度 Онлайн手写字符。这个分类器使用隐藏变量来模型子单元的结构。它使用点频率、方向和方向的变化（HPOD）特征来表示字符的全局特征和本地子单元特征，并且不受字符笔触顺序和笔触方向的变化。参数的估计使用最大可能性方法。本研究中考虑了其他一些研究使用的不同分类器和特征，包括第二阶 statistics（SOS）、子空间（SS）、捕捉特征（FD）、径向神经网络（FFN）和支持向量机（SVM），以及各种特征，如时空特征（ST）、抽象傅立叙变换（DFT）、抽象佩顺叙变换（SCT）、时空扫描变换（DWT）、空间特征（SP）和方向特征（HOG）。印度字符数据集用于训练和测试开发的分类器，包括96个不同字符类的样本。训练集包含12832个样本，平均每个字符类样本133个，测试集包含2821个样本，平均每个字符类样本29个。开发的分类器在测试集上的准确率为93.5%，比其他在同一个训练集和测试集上训练的分类器的准确率高。

Comparison of Cross-Entropy, Dice, and Focal Loss for Sea Ice Type Segmentation

paper_url: http://arxiv.org/abs/2310.17135
repo_url: None
paper_authors: Rafael Pires de Lima, Behzad Vahedi, Morteza Karimzadeh
for: 这篇论文是为了提高对于冰对航行安全的 Navigation 中使用 Convolutional Neural Network (CNN) 模型来生成冰图。
methods: 这篇论文使用了三种不同的损失函数（cross-entropy、Dice和Focal），以测试它们在对冰类型的预测中的表现。
results: despite the fact that Dice 和 Focal loss produce higher metrics, results from cross-entropy seem generally more physically consistent。

Abstract
Up-to-date sea ice charts are crucial for safer navigation in ice-infested waters. Recently, Convolutional Neural Network (CNN) models show the potential to accelerate the generation of ice maps for large regions. However, results from CNN models still need to undergo scrutiny as higher metrics performance not always translate to adequate outputs. Sea ice type classes are imbalanced, requiring special treatment during training. We evaluate how three different loss functions, some developed for imbalanced class problems, affect the performance of CNN models trained to predict the dominant ice type in Sentinel-1 images. Despite the fact that Dice and Focal loss produce higher metrics, results from cross-entropy seem generally more physically consistent.

摘要
现代海冰图表是航海安全 navigation 中不可或缺的。最近，卷积神经网络（CNN）模型表现出加速大区域海冰图表生成的潜力。然而，CNN 模型的结果仍需受到评估，因为高度度量表现并不总是能够确保良好的输出。海冰类型受到不均匀分布，需要特殊的训练方法。我们 evaluate 了三种不同的损失函数，其中一些是为异常分布类问题而设计的，如何影响 CNN 模型在 Sentinel-1 图像中预测主要海冰类型的性能。虽然 Dice 和 Focal 损失能生成更高的度量，但跨度 entropy 损失的结果更加 физи学上一致。

Virtual Accessory Try-On via Keypoint Hallucination

paper_url: http://arxiv.org/abs/2310.17131
repo_url: None
paper_authors: Junhong Gou, Bo Zhang, Li Niu, Jianfu Zhang, Jianlou Si, Chen Qian, Liqing Zhang
for: 处理虚拟穿着和虚拟配饰的任务，专注于虚拟配饰try-on，将饰物（例如眼镜、领带）适应到脸部或肖像图像中。
methods: 我们提出了一个背景对准网络，利用背景知识来将背景和前景融合成一个合理的合成图像。我们的方法首先学习人体知识，然后预测该项目的目标位置，然后将该资讯与预测的位置混合到背景UNet中。最后，我们计算扭曲参数，将该资讯扭曲到背景中。
results: 我们在STRAT dataset上进行了实验， validate the effectiveness of our proposed method。

Abstract
The virtual try-on task refers to fitting the clothes from one image onto another portrait image. In this paper, we focus on virtual accessory try-on, which fits accessory (e.g., glasses, ties) onto a face or portrait image. Unlike clothing try-on, which relies on human silhouette as guidance, accessory try-on warps the accessory into an appropriate location and shape to generate a plausible composite image. In contrast to previous try-on methods that treat foreground (i.e., accessories) and background (i.e., human faces or bodies) equally, we propose a background-oriented network to utilize the prior knowledge of human bodies and accessories. Specifically, our approach learns the human body priors and hallucinates the target locations of specified foreground keypoints in the background. Then our approach will inject foreground information with accessory priors into the background UNet. Based on the hallucinated target locations, the warping parameters are calculated to warp the foreground. Moreover, this background-oriented network can also easily incorporate auxiliary human face/body semantic segmentation supervision to further boost performance. Experiments conducted on STRAT dataset validate the effectiveness of our proposed method.

摘要
virtual 试穿任务指的是从一个图像上适应另一个肖像图像上的服装。在这篇论文中，我们专注于虚拟配饰试穿，即将配饰（如镜片、领带）适应到一个脸或肖像图像上。与服装试穿不同，配饰试穿不需要人体轮廓作为指导，而是将配饰扭曲到合适的位置和形状，以生成一个可信度高的复合图像。在之前的试穿方法中，背景（即人体）和前景（即配饰）被平等对待，我们提议了背景 oriented 网络，以利用人体和配饰的先验知识。具体来说，我们的方法学习人体先验知识，并在背景中预测target键带点的位置。然后，我们将前景信息与配饰先验知识混合到背景 UNet 中，根据预测的target位置，计算扭曲参数，以扭曲前景。此外，这种背景 oriented 网络还可以轻松地包含辅助人体脸/身体 semantic 分割监督，以进一步提高性能。在 STRAT 数据集上进行的实验证明了我们的提议的效果。

Task-driven Prompt Evolution for Foundation Models

paper_url: http://arxiv.org/abs/2310.17128
repo_url: None
paper_authors: Rachana Sathish, Rahul Venkataramani, K S Shriram, Prasad Sudhakar
for: 这个研究是为了提高基础模型（Segment Anything Model，SAM）在医疗影像模式下的表现。
methods: 这个研究使用了大量预训条件和概念学习模型，从下游任务中学习提示来优化基础模型的表现。
results: 研究发现，这种提示优化技术可以在肺部 segmentation中获得了 significiant improvement（约75%），并且可以自动优化基础模型的提示，以提高其表现。

Abstract
Promptable foundation models, particularly Segment Anything Model (SAM), have emerged as a promising alternative to the traditional task-specific supervised learning for image segmentation. However, many evaluation studies have found that their performance on medical imaging modalities to be underwhelming compared to conventional deep learning methods. In the world of large pre-trained language and vision-language models, learning prompt from downstream tasks has achieved considerable success in improving performance. In this work, we propose a plug-and-play Prompt Optimization Technique for foundation models like SAM (SAMPOT) that utilizes the downstream segmentation task to optimize the human-provided prompt to obtain improved performance. We demonstrate the utility of SAMPOT on lung segmentation in chest X-ray images and obtain an improvement on a significant number of cases ($\sim75\%$) over human-provided initial prompts. We hope this work will lead to further investigations in the nascent field of automatic visual prompt-tuning.

摘要
通用基础模型，如分割任何模型（SAM），已经出现为图像分割任务中的有前途的替代方案。然而，许多评估研究发现，这些模型在医疗影像模式上的表现不如传统的深度学习方法出色。在大型预训练语言和视觉语言模型的世界中，学习下游任务中的提示已经取得了显著的成功，以提高性能。在这项工作中，我们提出了一种插入式优化技术（SAMPOT），使用下游分割任务来优化提供的人类提示，以获得改善的性能。我们在肺部分剖扫图像中应用SAMPOT，并在大量的 случаес中（约75%）获得了人类提供的初始提示的改善。我们希望这项工作会鼓励进一步的自动视觉提示优化研究。

Deep Learning on SAR Imagery: Transfer Learning Versus Randomly Initialized Weights

paper_url: http://arxiv.org/abs/2310.17126
repo_url: None
paper_authors: Morteza Karimzadeh, Rafael Pires de Lima
for: 本研究旨在评估深度学习在探测雷达数据上的应用，特别是用于海洋 Navigation 的海冰映射。
methods: 本研究使用了 randomly initialized weights 和 fine-tuning pre-trained model 两种方法来训练深度学习模型。
results: 研究结果显示，使用 pre-trained model 进行 fine-tuning 后，模型在测试样本中的表现更佳，特别是在融雪季节的样本上。

Abstract
Deploying deep learning on Synthetic Aperture Radar (SAR) data is becoming more common for mapping purposes. One such case is sea ice, which is highly dynamic and rapidly changes as a result of the combined effect of wind, temperature, and ocean currents. Therefore, frequent mapping of sea ice is necessary to ensure safe marine navigation. However, there is a general shortage of expert-labeled data to train deep learning algorithms. Fine-tuning a pre-trained model on SAR imagery is a potential solution. In this paper, we compare the performance of deep learning models trained from scratch using randomly initialized weights against pre-trained models that we fine-tune for this purpose. Our results show that pre-trained models lead to better results, especially on test samples from the melt season.

摘要
deploying deep learning on synthetic aperture radar (SAR) data 是 becoming more common for mapping purposes. One such case is sea ice, which is highly dynamic and rapidly changes as a result of the combined effect of wind, temperature, and ocean currents. Therefore, frequent mapping of sea ice is necessary to ensure safe marine navigation. However, there is a general shortage of expert-labeled data to train deep learning algorithms. Fine-tuning a pre-trained model on SAR imagery is a potential solution. In this paper, we compare the performance of deep learning models trained from scratch using randomly initialized weights against pre-trained models that we fine-tune for this purpose. Our results show that pre-trained models lead to better results, especially on test samples from the melt season.Here's the text with Traditional Chinese characters:部署深度学习于Synthetic Aperture Radar（SAR）数据是 becoming more common for mapping purposes。One such case is sea ice, which is highly dynamic and rapidly changes as a result of the combined effect of wind, temperature, and ocean currents。Therefore, frequent mapping of sea ice is necessary to ensure safe marine navigation。However, there is a general shortage of expert-labeled data to train deep learning algorithms。Fine-tuning a pre-trained model on SAR imagery is a potential solution。In this paper, we compare the performance of deep learning models trained from scratch using randomly initialized weights against pre-trained models that we fine-tune for this purpose。Our results show that pre-trained models lead to better results, especially on test samples from the melt season。

Enhancing sea ice segmentation in Sentinel-1 images with atrous convolutions

paper_url: http://arxiv.org/abs/2310.17122
repo_url: None
paper_authors: Rafael Pires de Lima, Behzad Vahedi, Nick Hughes, Andrew P. Barrett, Walter Meier, Morteza Karimzadeh
for: 这研究旨在使用机器学习算法来自动化海冰图表生成，以提高海冰 Navigation 的效率。
methods: 我们使用了 Extreme Earth version 2 高分辨率测试数据集，并开发了一个自定义管道，其 combining ResNets 和 Atrous Spatial Pyramid Pooling 来 segments SAR 图像。
results: 我们的模型在 binary 海冰-开水分类和多类海冰分类两个方面都达到了高效性。特别是，在 January 和 July 测试场景中，我们的模型的 weighted F1 分数都大于 0.95， median weighted F1 分数为 0.98。相比之下，一个基eline U-Net 的 weighted average F1 分数在 July 和 January 测试场景中分别为 0.92-0.94 和 0.97-0.98。

Abstract
Due to the growing volume of remote sensing data and the low latency required for safe marine navigation, machine learning (ML) algorithms are being developed to accelerate sea ice chart generation, currently a manual interpretation task. However, the low signal-to-noise ratio of the freely available Sentinel-1 Synthetic Aperture Radar (SAR) imagery, the ambiguity of backscatter signals for ice types, and the scarcity of open-source high-resolution labelled data makes automating sea ice mapping challenging. We use Extreme Earth version 2, a high-resolution benchmark dataset generated for ML training and evaluation, to investigate the effectiveness of ML for automated sea ice mapping. Our customized pipeline combines ResNets and Atrous Spatial Pyramid Pooling for SAR image segmentation. We investigate the performance of our model for: i) binary classification of sea ice and open water in a segmentation framework; and ii) a multiclass segmentation of five sea ice types. For binary ice-water classification, models trained with our largest training set have weighted F1 scores all greater than 0.95 for January and July test scenes. Specifically, the median weighted F1 score was 0.98, indicating high performance for both months. By comparison, a competitive baseline U-Net has a weighted average F1 score of ranging from 0.92 to 0.94 (median 0.93) for July, and 0.97 to 0.98 (median 0.97) for January. Multiclass ice type classification is more challenging, and even though our models achieve 2% improvement in weighted F1 average compared to the baseline U-Net, test weighted F1 is generally between 0.6 and 0.80. Our approach can efficiently segment full SAR scenes in one run, is faster than the baseline U-Net, retains spatial resolution and dimension, and is more robust against noise compared to approaches that rely on patch classification.

摘要
Translation:由于远程感知数据的增长量和海洋导航需要的延迟时间都在逐渐增长，因此机器学习（ML）算法在自动化海冰图制定中得到了广泛的应用。然而，自由available Sentinel-1 Synthetic Aperture Radar（SAR）影像的信号噪声比率较低，反射信号的含义对冰种类是ambiguous，而且开源高分辨率标注数据的缺乏使自动化海冰 mapping更加挑战。我们使用Extreme Earth版2，一个高分辨率标准 datasets generated for ML training和评估，来调查ML在自动化海冰 mapping中的效果。我们自定义的管道 combining ResNets和Atrous Spatial Pyramid Pooling来进行SAR图像分割。我们对以下两个方面进行调查：i) 将海冰和开水分割为二元类型；ii) 将五种冰种类分割为多类型。对于二元海冰-开水分割，我们使用最大training set进行训练的模型均有weighted F1 scores大于0.95 for January和July测试场景。特别是，测试场景的中位weighted F1 score是0.98，表示在这两个月的性能都很高。与基线U-Net相比，我们的模型在July月的测试场景中weighted average F1 score在0.92-0.94之间（中位值为0.93），在January月的测试场景中weighted average F1 score在0.97-0.98之间（中位值为0.97）。对于多类冰种类分割，我们的模型比基线U-Net提高了2%的weighted F1平均分，但测试weighted F1在0.6-0.80之间。我们的方法可以一次性将整个SAR场景分割，比基线U-Net更快，保留空间分辨率和维度，并对噪声更加抗性。

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

paper_url: http://arxiv.org/abs/2310.17109
repo_url: None
paper_authors: Chau Pham, Truong Vu, Khoi Nguyen
for:* This paper addresses the challenging problem of open-vocabulary object detection (OVOD), where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training.methods:* The proposed method, LP-OVOD, uses a novel approach that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text.results:* Experimental results on COCO affirm the superior performance of the LP-OVOD approach over the state of the art, achieving $\textbf{40.5}$ in $\text{AP}_{novel}$ using ResNet50 as the backbone and without external datasets or knowing novel classes during training.

Abstract
This paper addresses the challenging problem of open-vocabulary object detection (OVOD) where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training. A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label. However, this method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information. To address this issue, we propose a novel method, LP-OVOD, that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text. Experimental results on COCO affirm the superior performance of our approach over the state of the art, achieving $\textbf{40.5}$ in $\text{AP}_{novel}$ using ResNet50 as the backbone and without external datasets or knowing novel classes during training. Our code will be available at https://github.com/VinAIResearch/LP-OVOD.

摘要
这个论文解决了开放词汇物体检测（OVOD）的挑战，即在测试图像中无需在训练中提供未知类的标注时，一个物体检测器可以识别已知和未知类。一种常见的OVOD方法是使用CLIP的共同文本图像嵌入来将盒子提案分配给其最近的文本标签。然而，这种方法存在一个重要问题：许多低质量盒子，如过度和下遮掩物体盒子，与高质量盒子具有同样的相似性分数，因为CLIP没有接受具体物体位置信息的训练。为解决这个问题，我们提出了一种新的方法，LP-OVOD，它抛弃低质量盒子通过在顶部相关地区提取的pseudo标签来训练sigmoid线性分类器。实验结果表明我们的方法在COCO上超过了现有的状态态势，达到了$\textbf{40.5}$的$\text{AP}_{novel}$值，使用ResNet50作为背景网络，不需要外部数据集或在训练过程中知道未知类。我们的代码将在https://github.com/VinAIResearch/LP-OVOD上发布。

Navigating Data Heterogeneity in Federated Learning A Semi-Supervised Approach for Object Detection

paper_url: http://arxiv.org/abs/2310.17097
repo_url: None
paper_authors: Taehyeon Kim, Eric Lin, Junu Lee, Christian Lau, Vaikkunth Mugunthan
for: 本研究旨在提出一种semi-supervised federated object detection（SSFOD）方法，用于Scene中where labeled data只存在服务器端，客户端具有无标签数据。
methods: 我们提出了一种两阶段策略，包括选择性训练和正交增强全参数训练，以有效地解决数据shift（例如天气条件） между服务器和客户端。我们还提出了一种选择性修剪backbone的检测器，以避免过拟合；一种正交规则来增强表达分歧；以及一种本地EMA驱动的假标分配来生成高质量假标。
results: 我们对prominent autonomous driving dataset（BDD100K、Cityscapes和SODA10M）进行了广泛的验证，并证明了我们的方法的有效性。特别是，使用仅20-30%的标签，FedSTO方法可以与完全supervised centralized training方法相比，达到nearly同等水平的性能。

Abstract
Federated Learning (FL) has emerged as a potent framework for training models across distributed data sources while maintaining data privacy. Nevertheless, it faces challenges with limited high-quality labels and non-IID client data, particularly in applications like autonomous driving. To address these hurdles, we navigate the uncharted waters of Semi-Supervised Federated Object Detection (SSFOD). We present a pioneering SSFOD framework, designed for scenarios where labeled data reside only at the server while clients possess unlabeled data. Notably, our method represents the inaugural implementation of SSFOD for clients with 0% labeled non-IID data, a stark contrast to previous studies that maintain some subset of labels at each client. We propose FedSTO, a two-stage strategy encompassing Selective Training followed by Orthogonally enhanced full-parameter training, to effectively address data shift (e.g. weather conditions) between server and clients. Our contributions include selectively refining the backbone of the detector to avert overfitting, orthogonality regularization to boost representation divergence, and local EMA-driven pseudo label assignment to yield high-quality pseudo labels. Extensive validation on prominent autonomous driving datasets (BDD100K, Cityscapes, and SODA10M) attests to the efficacy of our approach, demonstrating state-of-the-art results. Remarkably, FedSTO, using just 20-30% of labels, performs nearly as well as fully-supervised centralized training methods.

摘要
federnated learning (FL) 已经成为训练模型遍布分布式数据源的有效框架，同时保持数据隐私。然而，它面临有限高质量标签和非标一致客户端数据的挑战，特别是在自动驾驶应用中。为解决这些障碍，我们在无法预测的水域中探索 semi-supervised federated object detection (SSFOD)。我们提出了一种先进的 SSFOD 框架，适用于客户端只有无标记数据，而服务器只有标记数据。尤其是，我们的方法是首次实现 SSFOD 客户端无标记非标一致数据上的实现，与前一些研究不同，后者都保留了每个客户端上的一些标签。我们提出了 FedSTO，一种两阶段策略，包括选择性训练和正交扩展全参数训练，以有效地解决数据偏移（如天气条件）问题。我们的贡献包括选择性修正检测器的背bone，避免过拟合，正交regularization 增强表示异常性，以及本地EMA驱动的pseudo标签分配。我们对知名的自动驾驶数据集（BDD100K、Cityscapes和SODA10M）进行了广泛验证，证明我们的方法的有效性。印象人是，FedSTO，只使用20-30%的标签，可以与完全监督中心训练方法相比。

Automating lichen monitoring in ecological studies using instance segmentation of time-lapse images

paper_url: http://arxiv.org/abs/2310.17080
repo_url: None
paper_authors: Safwen Naimi, Olfa Koubaa, Wassim Bouachir, Guillaume-Alexandre Bilodeau, Gregory Jeddore, Patricia Baines, David Correia, Andre Arsenault
for: assist ecologists in monitoring and analyzing epiphytic lichens
methods: use time-lapse cameras and semantic segmentation with an effective training approach to automate monitoring and biomass estimation of epiphytic lichens
results: significantly improve the accuracy and efficiency of lichen population monitoring, making it a valuable tool for forest ecologists and environmental scientists to evaluate the impact of climate change on Canada’s forests

Abstract
Lichens are symbiotic organisms composed of fungi, algae, and/or cyanobacteria that thrive in a variety of environments. They play important roles in carbon and nitrogen cycling, and contribute directly and indirectly to biodiversity. Ecologists typically monitor lichens by using them as indicators to assess air quality and habitat conditions. In particular, epiphytic lichens, which live on trees, are key markers of air quality and environmental health. A new method of monitoring epiphytic lichens involves using time-lapse cameras to gather images of lichen populations. These cameras are used by ecologists in Newfoundland and Labrador to subsequently analyze and manually segment the images to determine lichen thalli condition and change. These methods are time-consuming and susceptible to observer bias. In this work, we aim to automate the monitoring of lichens over extended periods and to estimate their biomass and condition to facilitate the task of ecologists. To accomplish this, our proposed framework uses semantic segmentation with an effective training approach to automate monitoring and biomass estimation of epiphytic lichens on time-lapse images. We show that our method has the potential to significantly improve the accuracy and efficiency of lichen population monitoring, making it a valuable tool for forest ecologists and environmental scientists to evaluate the impact of climate change on Canada's forests. To the best of our knowledge, this is the first time that such an approach has been used to assist ecologists in monitoring and analyzing epiphytic lichens.

摘要
蘑菇是一种 симбиotic 生物，由真菌、藻类和/或细菌组成，可以在多种环境中存活。它们对碳和氮的循环具有重要作用，并直接和间接地对生物多样性产生贡献。生态学家通常通过使用蘑菇作为指标来评估空气质量和栖息环境。特别是epiphytic蘑菇，生活在树上，是评估空气质量和环境健康的关键标志。一种新的监测epiphytic蘑菇的方法是使用时间�lapse摄像头获取图像。这些摄像头由新foundland和Labrador的生态学家使用，以后分析和手动分割图像，以确定蘑菇质量和变化。这些方法时间consuming和易受观察者偏见的影响。在这项工作中，我们的目标是自动监测蘑菇在长期内的变化，并估计其生物质量和condition。为此，我们提出了一个基于semantic Segmentation的框架，以自动监测和估计epiphytic蘑菇在时间�lapse图像上的生物质量和condition。我们表明，我们的方法具有提高精度和效率的潜在优势，可以为森林生态学家和环境科学家提供一种有价值的工具，以评估气候变化对加拿大森林的影响。到目前为止，这是首次使用这种方法来帮助生态学家监测和分析epiphytic蘑菇。

HCT: Hybrid Convnet-Transformer for Parkinson’s disease detection and severity prediction from gait

paper_url: http://arxiv.org/abs/2310.17078
repo_url: https://github.com/safwennaimi/hct-hybrid-convnet-transformer-for-parkinson-s-disease-detection-and-severity-prediction-from-gait
paper_authors: Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau
for: 本研究提出了一种基于新型混合卷积神经网络-变换器架构的深度学习方法，用于从步态数据中检测和分stage帕金森病（PD）。
methods: 我们采用了一种两步方法，将问题分解为两个子问题。我们的混合架构首先将健康人 versus 帕金森病人分类。如果患者是帕金森病人，那么我们的多类 Hybrid ConvNet-Transformer 模型将确定帕金森病的谱分 stage。我们的混合 architecture 利用了 ConvNet 和 Transformer 两种不同的强大技术，以便准确地检测 PD 和确定其严重程度 stage。
results: 我们的混合方法在比较其他状态革新方法时表现出优异，PD 检测精度达 97%，严重程度分 stage 精度达 87%。I hope that helps! Let me know if you have any other questions.

Abstract
In this paper, we propose a novel deep learning method based on a new Hybrid ConvNet-Transformer architecture to detect and stage Parkinson's disease (PD) from gait data. We adopt a two-step approach by dividing the problem into two sub-problems. Our Hybrid ConvNet-Transformer model first distinguishes healthy versus parkinsonian patients. If the patient is parkinsonian, a multi-class Hybrid ConvNet-Transformer model determines the Hoehn and Yahr (H&Y) score to assess the PD severity stage. Our hybrid architecture exploits the strengths of both Convolutional Neural Networks (ConvNets) and Transformers to accurately detect PD and determine the severity stage. In particular, we take advantage of ConvNets to capture local patterns and correlations in the data, while we exploit Transformers for handling long-term dependencies in the input signal. We show that our hybrid method achieves superior performance when compared to other state-of-the-art methods, with a PD detection accuracy of 97% and a severity staging accuracy of 87%. Our source code is available at: https://github.com/SafwenNaimi

摘要
在这篇论文中，我们提出了一种新的深度学习方法，基于新的混合ConvNet-Transformer架构，用于从步态数据中检测和分期 Parkinson's disease（PD）。我们采用了两步 Approach，先将问题分为两个子问题。我们的混合架构首先分辨健康和 Parkinsonian 患者。如果患者是 Parkinsonian， THEN 我们的多类 Hybrid ConvNet-Transformer 模型确定了 Hoehn 和 Yahr（H&Y）分数，以评估PD的严重程度阶段。我们的混合架构利用 ConvNets 捕捉本地征 patrerns 和相关性，而Transformers 处理输入信号的长期依赖关系。我们表明，我们的混合方法在比较其他当前领先方法时，具有superior的性能，PD检测精度达97%，严重阶段评估精度达87%。我们的源代码可以在：https://github.com/SafwenNaimi 中找到。

HyperFields: Towards Zero-Shot Generation of NeRFs from Text

paper_url: http://arxiv.org/abs/2310.17075
repo_url: None
paper_authors: Sudarshan Babu, Richard Liu, Avery Zhou, Michael Maire, Greg Shakhnarovich, Rana Hanocka
for: 这篇论文的目的是提出一种基于文本的NeRF生成方法，可以在单个前进 pass中生成文本状态下的NeRF模型，并且可以在不同场景下进行适应。
methods: 该方法使用了动态权重网络和NeRF填充训练，以学习文本token嵌入空间中的NeRF模型的映射。
results: 该方法可以在不同场景下适应，并且可以在不同的文本条件下生成novel的场景，包括零shot和一些精心调整后的场景。此外，训练HyperFields可以比传统的神经网络优化方法更快速地训练 converges。

Abstract
We introduce HyperFields, a method for generating text-conditioned Neural Radiance Fields (NeRFs) with a single forward pass and (optionally) some fine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns a smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF distillation training, which distills scenes encoded in individual NeRFs into one dynamic hypernetwork. These techniques enable a single network to fit over a hundred unique scenes. We further demonstrate that HyperFields learns a more general map between text and NeRFs, and consequently is capable of predicting novel in-distribution and out-of-distribution scenes -- either zero-shot or with a few finetuning steps. Finetuning HyperFields benefits from accelerated convergence thanks to the learned general map, and is capable of synthesizing novel scenes 5 to 10 times faster than existing neural optimization-based methods. Our ablation experiments show that both the dynamic architecture and NeRF distillation are critical to the expressivity of HyperFields.

摘要
我们介绍HyperFields，一种方法用于生成文本条件的神经辐射场（NeRF），通过单一的前进 pass和（可选）一些精细调整。关键技术包括：（i）动态超网络，该网络学习文本token嵌入空间中的NeRF的缓和映射；（ii）NeRF蒸馏训练，将各个NeRF中的场景编码到一个动态超网络中。这些技术使得单个网络可以适应百余个不同的场景。我们进一步证明，HyperFields学习了文本和NeRF之间的更加通用的映射，因此能够预测静态和动态场景，包括零 shot 和几步精度调整。HyperFields在精度调整过程中受到加速的收敛速度，可以在5到10倍 бы于现有神经优化方法中 synthesize 新场景。我们的抽象实验表明，动态架构和NeRF蒸馏都是HyperFields的表达能力的关键因素。

2023-10-26

Image Prior and Posterior Conditional Probability Representation for Efficient Damage Assessment

ControlLLM: Augment Language Models with Tools by Searching on Graphs

AutoCT: Automated CT registration, segmentation, and quantification

A Dataset of Relighted 3D Interacting Hands

SynergyNet: Bridging the Gap between Discrete and Continuous Representations for Precise Medical Image Segmentation

Alzheimers Disease Diagnosis by Deep Learning Using MRI-Based Approaches

Advancing Brain Tumor Detection: A Thorough Investigation of CNNs, Clustering, and SoftMax Classification in the Analysis of MRI Images

Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model

A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised Video Anomaly Detection

6-DoF Stability Field via Diffusion Models

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

DeepShaRM: Multi-View Shape and Reflectance Map Recovery Under Unknown Lighting

A Survey on Transferability of Adversarial Examples across Deep Neural Networks

Noise-Free Score Distillation

Global Structure-Aware Diffusion Process for Low-Light Image Enhancement

SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching

Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Revisiting the Distillation of Image Representations into Point Clouds for Autonomous Driving

A Hybrid Graph Network for Complex Activity Detection in Video

Cross-modal Active Complementary Learning with Self-refining Correspondence

OTMatch: Improving Semi-Supervised Learning with Optimal Transport

Sign Languague Recognition without frame-sequencing constraints: A proof of concept on the Argentinian Sign Language

Uncertainty-weighted Loss Functions for Improved Adversarial Attacks on Semantic Segmentation

AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

Circuit as Set of Points

Detection Defenses: An Empty Promise against Adversarial Patch Attacks on Optical Flow

Learning Temporal Sentence Grounding From Narrated EgoVideos

SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation

Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning

CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling

IndustReal: A Dataset for Procedure Step Recognition Handling Execution Errors in Egocentric Videos in an Industrial-Like Setting

Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics

Scale-Adaptive Feature Aggregation for Efficient Space-Time Video Super-Resolution

RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments

BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point Clouds

Generalizing to Unseen Domains in Diabetic Retinopathy Classification

Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

Three-dimensional Bone Image Synthesis with Generative Adversarial Networks

Weakly-Supervised Surgical Phase Recognition

Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction Network for Tone Mapping

Exploring Iterative Refinement with Diffusion Models for Video Grounding

Blind Image Super-resolution with Rich Texture-Aware Codebooks

MO-YOLO: End-to-End Multiple-Object Tracking Method with YOLO and MOTR

Bridging Phylogeny and Taxonomy with Protein-protein Interaction Networks

Low-Dimensional Gradient Helps Out-of-Distribution Detection

Learning depth from monocular video sequences

Deep Imbalanced Regression via Hierarchical Classification Adjustment

Simple Baselines for Projection-based Full-reference and No-reference Point Cloud Quality Assessment

A Classifier Using Global Character Level and Local Sub-unit Level Features for Hindi Online Handwritten Character Recognition

Comparison of Cross-Entropy, Dice, and Focal Loss for Sea Ice Type Segmentation

Virtual Accessory Try-On via Keypoint Hallucination

Task-driven Prompt Evolution for Foundation Models

Deep Learning on SAR Imagery: Transfer Learning Versus Randomly Initialized Weights

Enhancing sea ice segmentation in Sentinel-1 images with atrous convolutions

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Navigating Data Heterogeneity in Federated Learning A Semi-Supervised Approach for Object Detection

Automating lichen monitoring in ecological studies using instance segmentation of time-lapse images

HCT: Hybrid Convnet-Transformer for Parkinson’s disease detection and severity prediction from gait

HyperFields: Towards Zero-Shot Generation of NeRFs from Text