2023-07-14

cs.CV

cs.CV - 2023-07-14

Combining multitemporal optical and SAR data for LAI imputation with BiLSTM network

paper_url: http://arxiv.org/abs/2307.07434
repo_url: None
paper_authors: W. Zhao, F. Yin, H. Ma, Q. Wu, J. Gomez-Dans, P. Lewis
for: 这个论文是为了提高冬小麦产量预测的基础数据，即叶面指数（LAI）的补充。
methods: 这个论文使用了时序序列Sentinel-1 VH/VV的数据，通过bidirectional LSTM（BiLSTM）网络进行LAI补充。
results: 实验结果表明，BiLSTM比传统回归方法更高效，能够捕捉多个时序序列之间的非线性动态。它在不同的生长条件下表现良好，并且可以使用有限的Sentinel-2图像来进行补充。BiLSTM在衰黄期表现特别出色。因此，BiLSTM可以用于时序序列Sentinel-1 VH/VV和Sentinel-2数据中的LAI补充，这种方法可以应用于其他时序序列补充问题。

Abstract
The Leaf Area Index (LAI) is vital for predicting winter wheat yield. Acquisition of crop conditions via Sentinel-2 remote sensing images can be hindered by persistent clouds, affecting yield predictions. Synthetic Aperture Radar (SAR) provides all-weather imagery, and the ratio between its cross- and co-polarized channels (C-band) shows a high correlation with time series LAI over winter wheat regions. This study evaluates the use of time series Sentinel-1 VH/VV for LAI imputation, aiming to increase spatial-temporal density. We utilize a bidirectional LSTM (BiLSTM) network to impute time series LAI and use half mean squared error for each time step as the loss function. We trained models on data from southern Germany and the North China Plain using only LAI data generated by Sentinel-1 VH/VV and Sentinel-2. Experimental results show BiLSTM outperforms traditional regression methods, capturing nonlinear dynamics between multiple time series. It proves robust in various growing conditions and is effective even with limited Sentinel-2 images. BiLSTM's performance surpasses that of LSTM, particularly over the senescence period. Therefore, BiLSTM can be used to impute LAI with time-series Sentinel-1 VH/VV and Sentinel-2 data, and this method could be applied to other time-series imputation issues.

摘要
“叶面指数（LAI）是预测冬小麦收获的关键因素。但是，由于可视光Remote Sensing像素可能会受到持续云层遮挡，对收获预测造成影响。Synthetic Aperture Radar（SAR）提供了不受天气阻据的影像，其中横极和衡极通道（C-band）的比率与时间序列LAI之间存在高度的相关性。本研究探讨使用时间序列Sentinel-1 VH/VV来替代LAI数据，以增加空间时间的密度。我们使用了 bidirectional LSTM（BiLSTM）网络来替代时间序列LAI，并使用每个时间步的平均方差作为损失函数。我们使用了德国南部和中国北方平原的数据，只使用Sentinel-1 VH/VV和Sentinel-2的数据来训练模型。实验结果显示，BiLSTM比传统回归方法高效，能够捕捉时间序列之间的非线性动态。它在不同生长条件下表现良好，并且具有限制Sentinel-2像素的可行性。BiLSTM在衰老期表现更好，因此可以用时间序列Sentinel-1 VH/VV和Sentinel-2数据来替代LAI数据，这种方法可以应用于其他时间序列替代问题。”

Transient Neural Radiance Fields for Lidar View Synthesis and 3D Reconstruction

paper_url: http://arxiv.org/abs/2307.09555
repo_url: None
paper_authors: Anagh Malik, Parsa Mirdehghan, Sotiris Nousias, Kiriakos N. Kutulakos, David B. Lindell
for: 这 paper 的目的是用 NeRFs 模型来描述 Multiview 图像中的场景外观和几何结构，同时使用雷达测量数据作为额外监督。
methods: 这 paper 使用了一种新的时间分辨率版本的卷积渲染方程来渲染雷达测量数据，以捕捉在纳秒级别上的光学传输现象。
results: 作者们在一个具有 simulated 和捕捉的多视图扫描数据的 dataset 上评估了他们的方法，并发现其在缺少输入视图点的情况下比点云基础监督更好地恢复场景的几何和外观。这种方法可能对于自动驾驶、 робо控和远程感知等应用有很大的应用前景。

Abstract
Neural radiance fields (NeRFs) have become a ubiquitous tool for modeling scene appearance and geometry from multiview imagery. Recent work has also begun to explore how to use additional supervision from lidar or depth sensor measurements in the NeRF framework. However, previous lidar-supervised NeRFs focus on rendering conventional camera imagery and use lidar-derived point cloud data as auxiliary supervision; thus, they fail to incorporate the underlying image formation model of the lidar. Here, we propose a novel method for rendering transient NeRFs that take as input the raw, time-resolved photon count histograms measured by a single-photon lidar system, and we seek to render such histograms from novel views. Different from conventional NeRFs, the approach relies on a time-resolved version of the volume rendering equation to render the lidar measurements and capture transient light transport phenomena at picosecond timescales. We evaluate our method on a first-of-its-kind dataset of simulated and captured transient multiview scans from a prototype single-photon lidar. Overall, our work brings NeRFs to a new dimension of imaging at transient timescales, newly enabling rendering of transient imagery from novel views. Additionally, we show that our approach recovers improved geometry and conventional appearance compared to point cloud-based supervision when training on few input viewpoints. Transient NeRFs may be especially useful for applications which seek to simulate raw lidar measurements for downstream tasks in autonomous driving, robotics, and remote sensing.

摘要
《神经辐射场（NeRFs）已成为多视图图像和几何的模型化工具。最近的研究也开始尝试将额外的超级视差从激光或深度测量设备的测量数据加入NeRF框架中。然而，先前的激光超级视差NeRF都是用来渲染传统的摄像头图像，并使用激光测量得到的点云数据作为辅助监督；因此，它们不能充分利用激光测量数据中的图像形成模型。在这里，我们提出了一种新的方法，可以将单 photon激光系统测量的时间分辨率历史曲线作为输入，并尝试从新视角渲染这些历史曲线。与传统的NeRF不同，我们的方法基于时间分辨率版本的卷积渲染方程来渲染激光测量和捕捉时钟秒级的辐射运输现象。我们对一个具有历史扫描和捕捉功能的单 photon激光系统进行了首次实验和捕捉。总的来说，我们的方法可以将NeRF引入新的时间级别的成像，并能够从新视角渲染动态图像。此外，我们还证明了我们的方法在几个输入视图点的监督下可以提取更好的几何和传统的外观特征，比起基于点云的监督来说更好。动态NeRF可能会在自动驾驶、机器人和远程感知等领域中发挥作用。》

Improving Zero-Shot Generalization for CLIP with Synthesized Prompts

paper_url: http://arxiv.org/abs/2307.07397
repo_url: https://github.com/mrflogs/SHIP
paper_authors: Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, Tieniu Tan
for: 提高现有的 fine-tuning 方法的性能，使其能够更好地适应实际应用中的长尾和Zipf’s law问题。
methods: 提出了一种插件化生成方法 called \textbf{S}ynt\textbf{H}es\textbf{I}zed \textbf{P}rompts~(\textbf{SHIP})，通过变量自动编码器引入生成器，使用文本编码器来重建视觉特征。
results: 通过基本到新的泛化、跨数据集转移学习和总体零shot学习，实验证明了我们的方法的超越性。

Abstract
With the growing interest in pretrained vision-language models like CLIP, recent research has focused on adapting these models to downstream tasks. Despite achieving promising results, most existing methods require labeled data for all classes, which may not hold in real-world applications due to the long tail and Zipf's law. For example, some classes may lack labeled data entirely, such as emerging concepts. To address this problem, we propose a plug-and-play generative approach called \textbf{S}ynt\textbf{H}es\textbf{I}zed \textbf{P}rompts~(\textbf{SHIP}) to improve existing fine-tuning methods. Specifically, we follow variational autoencoders to introduce a generator that reconstructs the visual features by inputting the synthesized prompts and the corresponding class names to the textual encoder of CLIP. In this manner, we easily obtain the synthesized features for the remaining label-only classes. Thereafter, we fine-tune CLIP with off-the-shelf methods by combining labeled and synthesized features. Extensive experiments on base-to-new generalization, cross-dataset transfer learning, and generalized zero-shot learning demonstrate the superiority of our approach. The code is available at \url{https://github.com/mrflogs/SHIP}.

摘要
随着CLIP类型预训练语义视觉模型的兴趣增长，当前研究的焦点在于将这些模型适应下渠道任务。虽然实现了有前途的结果，但大多数现有方法需要所有类别的标注数据，这在实际应用中可能不存在，因为Zipf的法则和长尾问题。例如，某些类别可能完全缺乏标注数据，如出现的概念。为解决这个问题，我们提议一种插件式生成方法called \textbf{S}ynt\textbf{H}es\textbf{I}zed \textbf{P}rompts~(\textbf{SHIP})，以改进现有的精度训练方法。具体来说，我们采用变量自动编码器引入一个生成器，通过输入生成的提示和相应的类别名称，将视觉特征重构回CLIP的文本编码器中。这样，我们可以轻松地获得标注-只有的类别的生成特征。然后，我们将CLIP通过市场上可获得的方法进行精度训练，并将标注和生成特征相结合。我们进行了基于新基础上的泛化、跨数据集转移学习和通用零shot学习的广泛实验，结果表明我们的方法有所优势。代码可以在\url{https://github.com/mrflogs/SHIP}中找到。

L-DAWA: Layer-wise Divergence Aware Weight Aggregation in Federated Self-Supervised Visual Representation Learning

paper_url: http://arxiv.org/abs/2307.07393
repo_url: None
paper_authors: Yasar Abbas Ur Rehman, Yan Gao, Pedro Porto Buarque de Gusmão, Mina Alibeigi, Jiajun Shen, Nicholas D. Lane
for: 提高edge设备上的自适应学习和联合学习的数据隐私保证和模型质量提升
methods: 层分别积分法（Layer-wise Divergence Aware Weight Aggregation，L-DAWA）用于缓解客户端偏见和分化导致的联合学习整合问题
results: 在cross-silo和cross-device设置下，对CIFAR-10/100和Tiny ImageNet datasets进行了广泛的实验，并实现了新的SOTA性能在对照和非对照自适应学习方法上

Abstract
The ubiquity of camera-enabled devices has led to large amounts of unlabeled image data being produced at the edge. The integration of self-supervised learning (SSL) and federated learning (FL) into one coherent system can potentially offer data privacy guarantees while also advancing the quality and robustness of the learned visual representations without needing to move data around. However, client bias and divergence during FL aggregation caused by data heterogeneity limits the performance of learned visual representations on downstream tasks. In this paper, we propose a new aggregation strategy termed Layer-wise Divergence Aware Weight Aggregation (L-DAWA) to mitigate the influence of client bias and divergence during FL aggregation. The proposed method aggregates weights at the layer-level according to the measure of angular divergence between the clients' model and the global model. Extensive experiments with cross-silo and cross-device settings on CIFAR-10/100 and Tiny ImageNet datasets demonstrate that our methods are effective and obtain new SOTA performance on both contrastive and non-contrastive SSL approaches.

摘要
随着摄像头设备的普遍使用，生成大量未标注图像数据的情况变得越来越普遍。通过结合自我超级vised learning（SSL）和联合学习（FL），可以实现数据隐私保证，同时提高学习的视觉表示质量和可靠性，不需要将数据传输。然而，客户端偏见和分布导致FL聚合中的客户端偏见和分布限制了下游任务的性能。在本文中，我们提出了一种新的聚合策略，即层次分解差异抑制Weight聚合（L-DAWA），以mitigate客户端偏见和分布的影响。该方法在客户端和全球模型之间进行层次分解，根据客户端模型和全球模型之间的角度差来权衡聚合。我们在跨积silod和跨设备的实验中，使用CIFAR-10/100和Tiny ImageNet datasets，证明了我们的方法的有效性，并实现了新的SOTA性能在对照和非对照SSL方法上。

Defect Classification in Additive Manufacturing Using CNN-Based Vision Processing

paper_url: http://arxiv.org/abs/2307.07378
repo_url: None
paper_authors: Xiao Liu, Alessandra Mileo, Alan F. Smeaton
for: 这篇论文旨在用计算机视觉和在位MONITORING技术收集Additive manufacturing（AM）过程中的大量数据，并使用机器学习技术提高AM的质量。
methods: 本文使用卷积神经网络（CNNs）准确地分类AM过程中的图像数据中的缺陷，并应用活动学习技术来开发分类模型。
results: 本文通过将人类Loop mechanism integrate到分类模型中，以减少需要用于训练和生成训练数据的数据量。

Abstract
The development of computer vision and in-situ monitoring using visual sensors allows the collection of large datasets from the additive manufacturing (AM) process. Such datasets could be used with machine learning techniques to improve the quality of AM. This paper examines two scenarios: first, using convolutional neural networks (CNNs) to accurately classify defects in an image dataset from AM and second, applying active learning techniques to the developed classification model. This allows the construction of a human-in-the-loop mechanism to reduce the size of the data required to train and generate training data.

摘要
通过计算机视觉和在位MONITORING使用视觉传感器，可以收集Additive Manufacturing（AM）过程中大量的数据集。这些数据集可以使用机器学习技术来提高AM的质量。本文研究了两个场景：首先，使用卷积神经网络（CNNs）准确地分类AM图像集中的缺陷，其次，应用活动学习技术到已经开发的分类模型，以建立人类在Loop机制，以降低需要训练和生成训练数据的大小。

ConTrack: Contextual Transformer for Device Tracking in X-ray

paper_url: http://arxiv.org/abs/2307.07541
repo_url: None
paper_authors: Marc Demoustier, Yue Zhang, Venkatesh Narasimha Murthy, Florin C. Ghesu, Dorin Comaniciu
for: 这篇研究是设计来解决静脉手术中策略追踪的问题，特别是在心脏手术中，需要精确地探测和追踪干道器的位置。
methods: 这篇研究使用了一个名为ConTrack的当代化网络，该网络使用了空间和时间上下文信息来精确地探测和追踪干道器的位置。
results: 实验结果显示，ConTrack方法在与现有追踪模型比较时，具有45%或更高的准确率。

Abstract
Device tracking is an important prerequisite for guidance during endovascular procedures. Especially during cardiac interventions, detection and tracking of guiding the catheter tip in 2D fluoroscopic images is important for applications such as mapping vessels from angiography (high dose with contrast) to fluoroscopy (low dose without contrast). Tracking the catheter tip poses different challenges: the tip can be occluded by contrast during angiography or interventional devices; and it is always in continuous movement due to the cardiac and respiratory motions. To overcome these challenges, we propose ConTrack, a transformer-based network that uses both spatial and temporal contextual information for accurate device detection and tracking in both X-ray fluoroscopy and angiography. The spatial information comes from the template frames and the segmentation module: the template frames define the surroundings of the device, whereas the segmentation module detects the entire device to bring more context for the tip prediction. Using multiple templates makes the model more robust to the change in appearance of the device when it is occluded by the contrast agent. The flow information computed on the segmented catheter mask between the current and the previous frame helps in further refining the prediction by compensating for the respiratory and cardiac motions. The experiments show that our method achieves 45% or higher accuracy in detection and tracking when compared to state-of-the-art tracking models.

摘要
Device tracking是endo vasculature процедуры的重要前提。特别是在心血管 intervención中，探测和跟踪导引器的扫描器板准确性非常重要，用于映射血管from angiography (高剂量矿物质) to fluoroscopy (低剂量无矿物质)。跟踪导引器的挑战包括：扫描器板中的导引器可能会被矿物质 occlude during angiography or interventional devices; 并且它们都在心动和呼吸运动中不断移动。为了解决这些挑战，我们提出了ConTrack，一种基于transformer网络的方法，利用扫描器板中的空间和时间上下文信息进行准确的设备探测和跟踪。空间信息来自于模板帧和分 segmentation模块：模板帧定义了设备周围的环境，而分 segmentation模块检测了整个设备，以提供更多的上下文信息 для tip prediction。使用多个模板使得模型更加鲁凤于设备的外观变化，当它被矿物质 occlude 时。流动信息在分 segmentation模块中计算的扫描器板中的�atheter mask between the current and the previous frame帮助进一步细化预测，以补做心动和呼吸运动。实验显示，我们的方法可以在与状态方法进行比较时达到45%或更高的检测和跟踪精度。

Flow-Guided Controllable Line Drawing Generation

paper_url: http://arxiv.org/abs/2307.07540
repo_url: None
paper_authors: Chengyu Fang, Xianfeng Han
for: 本研究旨在 automatizethe generation of artistic character line drawings from photographs, 以探讨 Vector Flow Aware and Line Controllable Image-to-Image Translation 架构的可能性。
methods: 我们首先提出了 Image-to-Flow 网络（I2FNet），用于有效地和稳定地生成 vector flow field，并提供了一个方向导航 для绘制线条。然后，我们引入了 Double Flow Generator（DFG）框架，用于融合 vector flow 和输入图像流的特征，以保证流程中的空间一致性。此外，我们还添加了一个 Line Control Matrix（LCM），以便控制绘制的线条的粗细、 глад度和连续性。
results: 我们的方法可以在高分辨率的图像中生成高质量的人物线 Drawing 图像，并且可以控制绘制的线条的特征。我们的实验结果表明，我们的方法可以在量和质量两个方面具有优于其他方法。

Abstract
In this paper, we investigate the problem of automatically controllable artistic character line drawing generation from photographs by proposing a Vector Flow Aware and Line Controllable Image-to-Image Translation architecture, which can be viewed as an appealing intersection between Artificial Intelligence and Arts. Specifically, we first present an Image-to-Flow network (I2FNet) to efficiently and robustly create the vector flow field in a learning-based manner, which can provide a direction guide for drawing lines. Then, we introduce our well-designed Double Flow Generator (DFG) framework to fuse features from learned vector flow and input image flow guaranteeing the spatial coherence of lines. Meanwhile, in order to allow for controllable character line drawing generation, we integrate a Line Control Matrix (LCM) into DFG and train a Line Control Regressor (LCR) to synthesize drawings with different styles by elaborately controlling the level of details, such as thickness, smoothness, and continuity, of lines. Finally, we design a Fourier Transformation Loss to further constrain the character line generation from the frequency domain view of the point. Quantitative and qualitative experiments demonstrate that our approach can obtain superior performance in producing high-resolution character line-drawing images with perceptually realistic characteristics.

摘要
在这篇论文中，我们研究了自动控制的艺术人物线 drawing 生成从照片的问题，我们提出了一种 Vector Flow Aware 和 Line Controllable Image-to-Image Translation 架构，可以视为人工智能和艺术之间的吸引人交叉点。 Specifically，我们首先提出了一个 Image-to-Flow 网络 (I2FNet)，可以高效和稳定地生成学习基于的vector flow场，这可以提供笔画的方向指南。然后，我们介绍了我们妙心设计的 Double Flow Generator (DFG) 框架，将学习的vector flow和输入图像流集成，以保证图像线的空间一致性。此外，为了允许可控制的人物线 drawing 生成，我们将一个 Line Control Matrix (LCM) интегрирован到 DFG 中，并训练一个 Line Control Regressor (LCR)，通过精心控制线的细粒度、平滑度和连续性来 synthesize 不同风格的笔画。最后，我们设计了一种 Fourier Transformation Loss，从frequency domain的角度来约束 character line generation。量化和质量实验表明，我们的方法可以在生成高分辨率的人物线 drawing 图像时获得优秀的表现，具有人工智能和艺术的实际特征。

LEST: Large-scale LiDAR Semantic Segmentation with Transformer

paper_url: http://arxiv.org/abs/2307.09367
repo_url: None
paper_authors: Chuanyu Luo, Nuo Cheng, Sikun Ma, Han Li, Xiaohan Li, Shengguang Lei, Pu Li
for: 这篇论文主要用于大规模LiDAR点云semantic segmentation，即自动驾驶感知中的关键任务。
methods: 该论文提出了一种基于Transformer架构的LiDARsemantic Segmentation方法，包括两个新组件：Space Filling Curve（SFC）Grouping策略和Distance-based Cosine Linear Transformer（DISCO）。
results: 在公共nuScenes semantic segmentation验证集和SemanticKITTI测试集上，该模型超过了所有其他状态时的方法。

Abstract
Large-scale LiDAR-based point cloud semantic segmentation is a critical task in autonomous driving perception. Almost all of the previous state-of-the-art LiDAR semantic segmentation methods are variants of sparse 3D convolution. Although the Transformer architecture is becoming popular in the field of natural language processing and 2D computer vision, its application to large-scale point cloud semantic segmentation is still limited. In this paper, we propose a LiDAR sEmantic Segmentation architecture with pure Transformer, LEST. LEST comprises two novel components: a Space Filling Curve (SFC) Grouping strategy and a Distance-based Cosine Linear Transformer, DISCO. On the public nuScenes semantic segmentation validation set and SemanticKITTI test set, our model outperforms all the other state-of-the-art methods.

摘要
大规模 LiDAR 基于点云Semantic segmentation 是自动驾驶感知中的关键任务。大多数前一代状态对 LiDAR semantic segmentation 方法都是稀疏 3D 卷积的变种。虽然 Transformer 架构在自然语言处理和 2D 计算机视觉领域变得越来越流行，但它在大规模点云Semantic segmentation 领域的应用仍然很有限。在本文中，我们提出了 LiDAR Semantic Segmentation 架构，即 LEST，其包括两个新的组成部分：空间填充曲线（SFC）分组策略和距离基于 косину斯线性变换（DISCO）。在公共 nuScenes semantic segmentation 验证集和 SemanticKITTI 测试集上，我们的模型超过所有其他前一代方法的性能。

paper_url: http://arxiv.org/abs/2307.07341
repo_url: None
paper_authors: Zixin Guo, Tzu-Jui Julius Wang, Selen Pehlivan, Abduljalil Radman, Jorma Laaksonen
for: 这个研究的目的是发展一种具有更少超vision的语言预训练（W-VLP）方法，以减少需要大量的图像和文本对象标签的烦琐和成本。
methods: 这个研究使用了一个叫做Prompts-in-The-Loop（PiTL）的方法，将知识从大型自然语言模型（LLM）中提取出来，用于描述图像。具体来说，给定一个图像的类别标签，例如制油厂，LLM可以提取出这个图像可能包含的大量储存罐、管线和…等资讯。这些知识可以补充图像中可能出现的常见关系。
results: 这个研究发现，使用PiTL生成的对话 pairs可以对VLP进行强化，并且需要更少的超vision。实验结果显示，使用PiTL-生成的对话 pairs，VLP模型在图像到文本（I2T）和文本到图像（T2I）搜寻任务上表现更好，并且需要更少的超vision。这些结果显示PiTL-生成的对话 pairs 具有优秀的效用性。

Abstract
Vision-language (VL) Pre-training (VLP) has shown to well generalize VL models over a wide range of VL downstream tasks, especially for cross-modal retrieval. However, it hinges on a huge amount of image-text pairs, which requires tedious and costly curation. On the contrary, weakly-supervised VLP (W-VLP) explores means with object tags generated by a pre-trained object detector (OD) from images. Yet, they still require paired information, i.e. images and object-level annotations, as supervision to train an OD. To further reduce the amount of supervision, we propose Prompts-in-The-Loop (PiTL) that prompts knowledge from large language models (LLMs) to describe images. Concretely, given a category label of an image, e.g. refinery, the knowledge, e.g. a refinery could be seen with large storage tanks, pipework, and ..., extracted by LLMs is used as the language counterpart. The knowledge supplements, e.g. the common relations among entities most likely appearing in a scene. We create IN14K, a new VL dataset of 9M images and 1M descriptions of 14K categories from ImageNet21K with PiTL. Empirically, the VL models pre-trained with PiTL-generated pairs are strongly favored over other W-VLP works on image-to-text (I2T) and text-to-image (T2I) retrieval tasks, with less supervision. The results reveal the effectiveness of PiTL-generated pairs for VLP.

摘要
视觉语言（VL）预训练（VLP）已经表现出在各种视觉下渠道任务上广泛适用，特别是跨Modal Retrieval。然而，它需要大量的图像文本对，需要辛苦和成本的筹集。相反，弱类视觉预训练（W-VLP）探索了通过图像检测器（OD）生成的对象标签来预训练VL模型。然而，它们仍然需要对的图像和对象级别注释来训练OD。为了进一步减少监督，我们提出了Prompts-in-The-Loop（PiTL），它使用大型语言模型（LLM）来描述图像。具体来说，给定一个图像的类别标签，例如“炼厂”，我们使用LLM提取的知识，例如“炼厂可能出现的大容器、管道和...”，作为语言对应。这种知识补充了图像中可能出现的常见关系，例如场景中的实体之间的共同关系。我们创建了IN14K dataset，包含900万个图像和100万个描述，分别来自ImageNet21K dataset。我们的实验表明，使用PiTL生成的对应对在图像到文本（I2T）和文本到图像（T2I）检索任务上具有更高的性能，并且需要更少的监督。这些结果表明PiTL生成对的有效性。

Risk Controlled Image Retrieval

paper_url: http://arxiv.org/abs/2307.07336
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Kaiwen Cai, Chris Xiaoxuan Lu, Xingyu Zhao, Xiaowei Huang
for: 提高图像检索精度和可靠性
methods: 应用不确定性评估技术，提供可靠性 guarantees
results: 实现图像检索集的覆盖保证，提高检索精度和可靠性

Abstract
Most image retrieval research focuses on improving predictive performance, ignoring scenarios where the reliability of the prediction is also crucial. Uncertainty quantification technique can be applied to mitigate this issue by assessing uncertainty for retrieval sets, but it can provide only a heuristic estimate of uncertainty rather than a guarantee. To address these limitations, we present Risk Controlled Image Retrieval (RCIR), which generates retrieval sets with coverage guarantee, i.e., retrieval sets that are guaranteed to contain the true nearest neighbors with a predefined probability. RCIR can be easily integrated with existing uncertainty-aware image retrieval systems, agnostic to data distribution and model selection. To the best of our knowledge, this is the first work that provides coverage guarantees for image retrieval. The validity and efficiency of RCIR are demonstrated on four real-world image retrieval datasets: Stanford CAR-196, CUB-200, Pittsburgh and ChestX-Det.

摘要
大多数图像检索研究强调改进预测性能，忽略了预测可靠性的场景。不精准量化技术可以应用于 mitigate这一问题，评估检索集的不确定性，但它只能提供一个启示性的不确定性估计，而不是一个保证。为解决这些局限性，我们提出了风险控制图像检索（RCIR），生成具有覆盖保证的检索集，即在先定概率下包含真正的最近邻居。RCIR可以与现有的不确定性感知图像检索系统集成，不受数据分布和模型选择的影响。据我们所知，这是首次为图像检索提供了覆盖保证。我们在四个真实世界图像检索数据集上 validate和demonstrate RCIR 的有效性和高效性：Stanford CAR-196、CUB-200、Pittsburgh和ChestX-Det。

Capsule network with shortcut routing

paper_url: http://arxiv.org/abs/2307.10212
repo_url: None
paper_authors: Dang Thanh Vu, Vo Hoang Trong, Yu Gwang-Hyun, Kim Jin-Young
for: 这项研究旨在提出一种新的 routing 机制，以解决卷积网络中的计算不灵活性问题，直接从地方卷积到全局卷积进行活化。
methods: 这项研究使用了一种注意力采用权值的方法，以提高效率。
results: 实验结果表明，使用提议的粗略路由方法可以在 Mnist、smallnorb 和 affNist datasets 上达到类比性的分类性能，即准确率为 99.52%、93.91% 和 89.02% соответственно。此外，使用注意力采用权值的方法可以减少计算量，比EM routing 减少计算量1.42倍和2.5倍。这些发现有助于提高高效精度的层次模式表示模型。

Abstract
This study introduces "shortcut routing," a novel routing mechanism in capsule networks that addresses computational inefficiencies by directly activating global capsules from local capsules, eliminating intermediate layers. An attention-based approach with fuzzy coefficients is also explored for improved efficiency. Experimental results on Mnist, smallnorb, and affNist datasets show comparable classification performance, achieving accuracies of 99.52%, 93.91%, and 89.02% respectively. The proposed fuzzy-based and attention-based routing methods significantly reduce the number of calculations by 1.42 and 2.5 times compared to EM routing, highlighting their computational advantages in capsule networks. These findings contribute to the advancement of efficient and accurate hierarchical pattern representation models.

摘要
Simplified Chinese:这个研究提出了一种名为"快捷路由"的新的路由机制，用于减少卷积网络中的计算复杂性。这种机制直接从本地卷积到全局卷积的方式活动，从中间层除去了。此外，还探索了一种基于注意力的方法，使用杂度系数进行补做。实验结果表明，在Mnist、smallnorb和affNist dataset上，这种方法可以达到99.52%、93.91%和89.02%的分类精度，并且与EM路由相比，具有1.42和2.5倍的计算优势。这些发现将对快速和准确的层次特征表示模型的发展做出贡献。

SynTable: A Synthetic Data Generation Pipeline for Unseen Object Amodal Instance Segmentation of Cluttered Tabletop Scenes

paper_url: http://arxiv.org/abs/2307.07333
repo_url: None
paper_authors: Zhili Ng, Haozhe Wang, Zhengshen Zhang, Francis Tay Eng Hock, Marcelo H. Ang Jr
for: 这篇论文是为了描述一个用于生成高质量的人工数据集的Python基于的数据生成工具 SynTable，用于进行未经见过的物体模式分割。
methods: 这篇论文使用了NVIDIA Isaac Sim Replicator Composer生成高质量的人工数据集，并可以根据用户的需求自动生成metadata，如模式和无模式实例分割masks，遮盲masks，深度地图， bounding box 和材料属性。
results: 这篇论文的实验结果表明，使用SynTable生成的数据集可以在Sim-to-Real转移中提高模型的性能，并且与OSD-Amodal数据集的评价相当。

Abstract
In this work, we present SynTable, a unified and flexible Python-based dataset generator built using NVIDIA's Isaac Sim Replicator Composer for generating high-quality synthetic datasets for unseen object amodal instance segmentation of cluttered tabletop scenes. Our dataset generation tool can render a complex 3D scene containing object meshes, materials, textures, lighting, and backgrounds. Metadata, such as modal and amodal instance segmentation masks, occlusion masks, depth maps, bounding boxes, and material properties, can be generated to automatically annotate the scene according to the users' requirements. Our tool eliminates the need for manual labeling in the dataset generation process while ensuring the quality and accuracy of the dataset. In this work, we discuss our design goals, framework architecture, and the performance of our tool. We demonstrate the use of a sample dataset generated using SynTable by ray tracing for training a state-of-the-art model, UOAIS-Net. The results show significantly improved performance in Sim-to-Real transfer when evaluated on the OSD-Amodal dataset. We offer this tool as an open-source, easy-to-use, photorealistic dataset generator for advancing research in deep learning and synthetic data generation.

摘要
在这项工作中，我们介绍SynTable，一个统一和灵活的Python基础的数据生成工具，使用NVIDIA的Isaac Sim Replicator Composer生成高质量的synthetic数据集 для未经看过的对象模式分割。我们的数据生成工具可以渲染复杂的3D场景，包括物体模型、材料、Texture、照明和背景。用户可以根据需要生成metadata，如Modal和Amodal实例分割mask、 occlusion mask、深度地图、 bounding box 和材料属性，以自动标注场景。我们的工具消除了手动标注数据生成过程中的需求，同时保证数据的质量和准确性。在这项工作中，我们讨论我们的设计目标、框架体系和工具的性能。我们示例了使用SynTable生成的样本数据集，通过投影法训练State-of-the-art模型UOAIS-Net。结果显示在Sim-to-Real传输中，SynTable生成的数据集在OSD-Amodal数据集上得到了显著改善的性能。我们将这个工具作为开源、易用、 photorealistic数据生成工具，为深度学习和synthetic数据生成的研究提供支持。

HEAL-SWIN: A Vision Transformer On The Sphere

paper_url: http://arxiv.org/abs/2307.07313
repo_url: https://github.com/janegerken/heal-swin
paper_authors: Oscar Carlsson, Jan E. Gerken, Hampus Linander, Heiner Spieß, Fredrik Ohlsson, Christoffer Petersson, Daniel Persson
for: 高分辨率宽角鱼眼图像在自动驾驶等机器人应用中日益重要，但使用普通的卷积神经网络或视transformer在这些数据上是问题，因为将投影到平面上引入了投影和扭曲的损失。
methods: 我们引入了HEAL-SWIN transformer，它将astrophysics和cosmology中使用的高度均匀的Hierarchical Equal Area iso-Latitude Pixelation（HEALPix）网格和层次Shifted-Window（SWIN）变换器结合，实现了高效和灵活的模型，能够在高分辨率、扭曲无的球面数据上训练。在HEAL-SWIN中，HEALPix网格的嵌套结构用于Swin transformer的覆盖和窗口操作，导致一个具有最小计算开销的一维表示方式。
results: 我们在semantic segmentation和depth regression任务上使用HEAL-SWIN模型在真实和 sintetic汽车数据集上达到了superior性能。我们的代码可以在https://github.com/JanEGerken/HEAL-SWIN中下载。

Abstract
High-resolution wide-angle fisheye images are becoming more and more important for robotics applications such as autonomous driving. However, using ordinary convolutional neural networks or vision transformers on this data is problematic due to projection and distortion losses introduced when projecting to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer, which combines the highly uniform Hierarchical Equal Area iso-Latitude Pixelation (HEALPix) grid used in astrophysics and cosmology with the Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and flexible model capable of training on high-resolution, distortion-free spherical data. In HEAL-SWIN, the nested structure of the HEALPix grid is used to perform the patching and windowing operations of the SWIN transformer, resulting in a one-dimensional representation of the spherical data with minimal computational overhead. We demonstrate the superior performance of our model for semantic segmentation and depth regression tasks on both synthetic and real automotive datasets. Our code is available at https://github.com/JanEGerken/HEAL-SWIN.

摘要
高分辨率宽角鱼眼图像在自动驾驶等机器人应用中日益重要。然而，使用普通的卷积神经网络或视Transformer在这些数据上进行训练存在投影和扭曲损失的问题。我们介绍了HEAL-SWIN transformer，它将astrophysics和cosmology中使用的高度均匀的 Hierarchical Equal Area iso-Latitude Pixelation（HEALPix）网格与层次分割（SWIN）转换器结合，实现高效可扩展的模型，能够在高分辨率、扭曲free的球面数据上进行训练。在HEAL-SWIN中，HEALPix网格的嵌套结构用于批处和窗口操作，从而实现了一个具有最小计算开销的一维表示方式。我们在 sintetic和实际汽车数据集上进行Semantic segmentation和深度回归任务的实验，并证明了我们的模型的优秀表现。代码可以在https://github.com/JanEGerken/HEAL-SWIN中找到。

3D Shape-Based Myocardial Infarction Prediction Using Point Cloud Classification Networks

paper_url: http://arxiv.org/abs/2307.07298
repo_url: None
paper_authors: Marcel Beetz, Yilong Yang, Abhirup Banerjee, Lei Li, Vicente Grau
for: 这个论文的目的是提高心肺疾病诊断，使用完整的3D卡丁形态来更好地理解和预测心肺疾病的结果。
methods: 该论文提出了一种完全自动化的多步骤管道，包括3D卡丁表面重建步骤和点云分类网络。该方法利用了点云深度学习的最新进展，以实现直接和高效地多尺度学习高分辨率卡丁形态模型。
results: 研究人员对UK Biobank数据集中的1068名参与者进行了预现性心肺疾病检测和新生心肺疾病预测任务，并发现了与临床标准相比的 ~13% 和 ~5% 的提升。此外，他们还分析了每个肺脏和卡丁阶段对3D形态基于的心肺疾病检测的作用，并进行了心肺疾病结果的视觉分析。

Abstract
Myocardial infarction (MI) is one of the most prevalent cardiovascular diseases with associated clinical decision-making typically based on single-valued imaging biomarkers. However, such metrics only approximate the complex 3D structure and physiology of the heart and hence hinder a better understanding and prediction of MI outcomes. In this work, we investigate the utility of complete 3D cardiac shapes in the form of point clouds for an improved detection of MI events. To this end, we propose a fully automatic multi-step pipeline consisting of a 3D cardiac surface reconstruction step followed by a point cloud classification network. Our method utilizes recent advances in geometric deep learning on point clouds to enable direct and efficient multi-scale learning on high-resolution surface models of the cardiac anatomy. We evaluate our approach on 1068 UK Biobank subjects for the tasks of prevalent MI detection and incident MI prediction and find improvements of ~13% and ~5% respectively over clinical benchmarks. Furthermore, we analyze the role of each ventricle and cardiac phase for 3D shape-based MI detection and conduct a visual analysis of the morphological and physiological patterns typically associated with MI outcomes.

摘要
乳心细胞损伤（MI）是冠arca疾病中最常见的一种，临床决策通常基于单一的影像生物标志物。然而，这些指标只是心脏三维结构和生理的一种简化表达，因此难以更好地理解和预测MI结果。在这个工作中，我们研究了基于完整的三维心脏形状的点云的可用性，以提高MI事件的检测。我们提出了一种完全自动的多步骤管道，包括三维心脏表面重建步骤和点云分类网络。我们的方法利用了最新的点云深度学习的进步，以实现直接和高效地多尺度学习高分辨率的心脏 анатомиче模型。我们对UK Biobank数据集上的1068名参与者进行了MI检测和未来MI预测两个任务，并发现我们的方法与临床标准准确率相比提高了约13%和5%。此外，我们还分析了每个肺和卡ди亚阶段对于三维形状基于MI检测的作用，并进行了MI结果常见的 morphological和生理特征的视觉分析。

Sampling-Priors-Augmented Deep Unfolding Network for Robust Video Compressive Sensing

paper_url: http://arxiv.org/abs/2307.07291
repo_url: https://github.com/yuhaoo00/SPA-DUN
paper_authors: Yuhao Huang, Gangrong Qu, Youran Ge
For: 高速场景记录和低帧率传感器* Methods: 使用Sampling-Priors-Augmented Deep Unfolding Network (SPA-DUN) 进行高效和可靠的 видео压缩感知重建* Results: 在 simulations 和实际数据集上，SPA-DUN 可以处理多种采样设置，并 achieve state-of-the-art 性能与极高效率。

Abstract
Video Compressed Sensing (VCS) aims to reconstruct multiple frames from one single captured measurement, thus achieving high-speed scene recording with a low-frame-rate sensor. Although there have been impressive advances in VCS recently, those state-of-the-art (SOTA) methods also significantly increase model complexity and suffer from poor generality and robustness, which means that those networks need to be retrained to accommodate the new system. Such limitations hinder the real-time imaging and practical deployment of models. In this work, we propose a Sampling-Priors-Augmented Deep Unfolding Network (SPA-DUN) for efficient and robust VCS reconstruction. Under the optimization-inspired deep unfolding framework, a lightweight and efficient U-net is exploited to downsize the model while improving overall performance. Moreover, the prior knowledge from the sampling model is utilized to dynamically modulate the network features to enable single SPA-DUN to handle arbitrary sampling settings, augmenting interpretability and generality. Extensive experiments on both simulation and real datasets demonstrate that SPA-DUN is not only applicable for various sampling settings with one single model but also achieves SOTA performance with incredible efficiency.

摘要

Implicit Neural Feature Fusion Function for Multispectral and Hyperspectral Image Fusion

paper_url: http://arxiv.org/abs/2307.07288
repo_url: None
paper_authors: ShangQi Deng, RuoCheng Wu, Liang-Jian Deng, Ran Ran, Tai-Xiang Jiang
for: 这个论文的目的是解决多spectral和 hyperspectral图像融合问题，即将高分辨率多spectral图像（HR-MSI）和低分辨率 hyperspectral图像（LR-HSI）融合成高分辨率 hyperspectral图像（HR-HSI）。methods: 本论文提出了一种基于Implicit Neural Representation（INR）的方法，称为Implicit Neural Feature Fusion Function（INF），它利用HR-MSI作为高频率细节辅助输入，并通过 dual high-frequency fusion（DHFF）结构和cosine similarity（INR-CS）来协调多modal特征。results: 根据实验结果，提出的INFN网络在两个公共数据集上（CAVE和Harvard）上达到了当前最佳性能。

Abstract
Multispectral and Hyperspectral Image Fusion (MHIF) is a practical task that aims to fuse a high-resolution multispectral image (HR-MSI) and a low-resolution hyperspectral image (LR-HSI) of the same scene to obtain a high-resolution hyperspectral image (HR-HSI). Benefiting from powerful inductive bias capability, CNN-based methods have achieved great success in the MHIF task. However, they lack certain interpretability and require convolution structures be stacked to enhance performance. Recently, Implicit Neural Representation (INR) has achieved good performance and interpretability in 2D tasks due to its ability to locally interpolate samples and utilize multimodal content such as pixels and coordinates. Although INR-based approaches show promise, they require extra construction of high-frequency information (\emph{e.g.,} positional encoding). In this paper, inspired by previous work of MHIF task, we realize that HR-MSI could serve as a high-frequency detail auxiliary input, leading us to propose a novel INR-based hyperspectral fusion function named Implicit Neural Feature Fusion Function (INF). As an elaborate structure, it solves the MHIF task and addresses deficiencies in the INR-based approaches. Specifically, our INF designs a Dual High-Frequency Fusion (DHFF) structure that obtains high-frequency information twice from HR-MSI and LR-HSI, then subtly fuses them with coordinate information. Moreover, the proposed INF incorporates a parameter-free method named INR with cosine similarity (INR-CS) that uses cosine similarity to generate local weights through feature vectors. Based on INF, we construct an Implicit Neural Fusion Network (INFN) that achieves state-of-the-art performance for MHIF tasks of two public datasets, \emph{i.e.,} CAVE and Harvard. The code will soon be made available on GitHub.

摘要
多спектраль和高spectral图像融合（MHIF）是一个实用的任务，旨在将高分辨率多спектраль图像（HR-MSI）和低分辨率高spectral图像（LR-HSI）的同一场景图像融合为获得高分辨率高spectral图像（HR-HSI）。由于强大的推导偏好能力，基于CNN的方法在MHIF任务中取得了很大的成功。然而，它们缺乏一定的解释性和需要堆叠 convolution 结构来提高性能。在最近的几年，基于INR的方法在2D任务中取得了良好的性能和解释性，因为它可以地方 interpolate samples和利用多Modal的内容，如像素和坐标。虽然INR基于的方法表现良好，但它们需要额外建立高频信息（例如， pozitional encoding）。在这篇论文中，我们受到MHIF任务的启发，认为HR-MSI可以作为高频细节助手输入，这引导我们提出了一种基于INR的新的高spectral融合函数，即偏导内在特征融合函数（INF）。作为一种复杂的结构，INF解决了MHIF任务和INR基于方法的缺陷。具体来说，我们的INF实现了双高频融合（DHFF）结构，从HR-MSI和LR-HSI中获取高频信息两次，然后细致地融合它们与坐标信息。此外，我们的INF还包括一种无参数的方法，即INR WITH cosine similarity（INR-CS），使用cosine similarity来生成本地权重通过特征向量。基于INF，我们建立了一个偏导内在融合网络（INFN），实现了MHIF任务的最新状态。代码即将在GitHub上公布。

Cloud Detection in Multispectral Satellite Images Using Support Vector Machines With Quantum Kernels

paper_url: http://arxiv.org/abs/2307.07281
repo_url: None
paper_authors: Artur Miroszewski, Jakub Mielczarek, Filip Szczepanek, Grzegorz Czelusta, Bartosz Grabowski, Bertrand Le Saux, Jakub Nalepa
for:* 这个论文是为了扩展 классические支持向量机(SVM)，使其能够更好地处理卫星数据分类 зада务。methods:* 这个论文使用了量子kernel（QKE）过程和经典SVM训练方法结合，将像素数据映射到希尔伯特空间中。* 使用ZZ-特征地图将参数化 ansatz 状态映射到希尔伯特空间中，并优化参数以最大化kernel目标对齐。results:* 实验结果表明，使用模拟的hybrid SVM可以成功地分类卫星图像，并且与经典SVM的准确率相当。

Abstract
Support vector machines (SVMs) are a well-established classifier effectively deployed in an array of pattern recognition and classification tasks. In this work, we consider extending classic SVMs with quantum kernels and applying them to satellite data analysis. The design and implementation of SVMs with quantum kernels (hybrid SVMs) is presented. It consists of the Quantum Kernel Estimation (QKE) procedure combined with a classic SVM training routine. The pixel data are mapped to the Hilbert space using ZZ-feature maps acting on the parameterized ansatz state. The parameters are optimized to maximize the kernel target alignment. We approach the problem of cloud detection in satellite image data, which is one of the pivotal steps in both on-the-ground and on-board satellite image analysis processing chains. The experiments performed over the benchmark Landsat-8 multispectral dataset revealed that the simulated hybrid SVM successfully classifies satellite images with accuracy on par with classic SVMs.

摘要

Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation

paper_url: http://arxiv.org/abs/2307.07269
repo_url: https://github.com/asif-hanif/vafa
paper_authors: Asif Hanif, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan
for: 这篇论文旨在提高深度学习模型在医疗领域中的可靠性，因为这些模型在抗击测试中很容易受到攻击。
methods: 该论文使用3D频域对抗攻击来攻击深度学习模型，并且提出了一种基于频域对抗训练的方法来优化模型的可靠性。
results: 该论文通过实验表明，使用频域对抗训练可以提高模型对于不同类型的攻击的抗击能力，并且可以保持模型在清洁和攻击样本上的性能之间取得一个更好的平衡。

Abstract
It is imperative to ensure the robustness of deep learning models in critical applications such as, healthcare. While recent advances in deep learning have improved the performance of volumetric medical image segmentation models, these models cannot be deployed for real-world applications immediately due to their vulnerability to adversarial attacks. We present a 3D frequency domain adversarial attack for volumetric medical image segmentation models and demonstrate its advantages over conventional input or voxel domain attacks. Using our proposed attack, we introduce a novel frequency domain adversarial training approach for optimizing a robust model against voxel and frequency domain attacks. Moreover, we propose frequency consistency loss to regulate our frequency domain adversarial training that achieves a better tradeoff between model's performance on clean and adversarial samples. Code is publicly available at https://github.com/asif-hanif/vafa.

摘要
必须确保深度学习模型在重要应用领域如医疗中的稳定性。 although recent advances in deep learning have improved the performance of volumetric medical image segmentation models, these models cannot be deployed for real-world applications immediately due to their vulnerability to adversarial attacks. we present a 3D frequency domain adversarial attack for volumetric medical image segmentation models and demonstrate its advantages over conventional input or voxel domain attacks. using our proposed attack, we introduce a novel frequency domain adversarial training approach for optimizing a robust model against voxel and frequency domain attacks. Moreover, we propose frequency consistency loss to regulate our frequency domain adversarial training that achieves a better tradeoff between model's performance on clean and adversarial samples. code is publicly available at https://github.com/asif-hanif/vafa.Here's the text in Traditional Chinese:必须确保深度学习模型在重要应用领域如医疗中的稳定性。 although recent advances in deep learning have improved the performance of volumetric medical image segmentation models, these models cannot be deployed for real-world applications immediately due to their vulnerability to adversarial attacks. we present a 3D frequency domain adversarial attack for volumetric medical image segmentation models and demonstrate its advantages over conventional input or voxel domain attacks. using our proposed attack, we introduce a novel frequency domain adversarial training approach for optimizing a robust model against voxel and frequency domain attacks. Moreover, we propose frequency consistency loss to regulate our frequency domain adversarial training that achieves a better tradeoff between model's performance on clean and adversarial samples. code is publicly available at https://github.com/asif-hanif/vafa.

cOOpD: Reformulating COPD classification on chest CT scans as anomaly detection using contrastive representations

paper_url: http://arxiv.org/abs/2307.07254
repo_url: None
paper_authors: Silvia D. Almeida, Carsten T. Lüth, Tobias Norajitra, Tassilo Wald, Marco Nolden, Paul F. Jaeger, Claus P. Heussel, Jürgen Biederer, Oliver Weinheimer, Klaus Maier-Hein
for: 这个研究的目的是为了提高肺病诊断的精确性和效率，特别是对于肺病多形变异的诊断。
methods: 这个研究使用了一种自适应对应模型，通过学习无标示肺部频谱像的表现，实现肺病分类的准确性和效率。
results: 研究结果显示，这个方法可以实现肺病分类的高精确性和高效率，并且可以提供明确的肺部频谱像 anomaly map 和病人级别得分，对于肺病的早期诊断和监测提供了有价的帮助。

Abstract
Classification of heterogeneous diseases is challenging due to their complexity, variability of symptoms and imaging findings. Chronic Obstructive Pulmonary Disease (COPD) is a prime example, being underdiagnosed despite being the third leading cause of death. Its sparse, diffuse and heterogeneous appearance on computed tomography challenges supervised binary classification. We reformulate COPD binary classification as an anomaly detection task, proposing cOOpD: heterogeneous pathological regions are detected as Out-of-Distribution (OOD) from normal homogeneous lung regions. To this end, we learn representations of unlabeled lung regions employing a self-supervised contrastive pretext model, potentially capturing specific characteristics of diseased and healthy unlabeled regions. A generative model then learns the distribution of healthy representations and identifies abnormalities (stemming from COPD) as deviations. Patient-level scores are obtained by aggregating region OOD scores. We show that cOOpD achieves the best performance on two public datasets, with an increase of 8.2% and 7.7% in terms of AUROC compared to the previous supervised state-of-the-art. Additionally, cOOpD yields well-interpretable spatial anomaly maps and patient-level scores which we show to be of additional value in identifying individuals in the early stage of progression. Experiments in artificially designed real-world prevalence settings further support that anomaly detection is a powerful way of tackling COPD classification.

摘要
临床疾病分类是一项复杂的任务，因为疾病的复杂性、症状的多样性以及图像观察结果的不确定性。 chronic obstructive pulmonary disease (COPD) 是一个典型的例子，它 Despite being the third leading cause of death, it is often underdiagnosed due to its sparse, diffuse, and heterogeneous appearance on computed tomography (CT) scans. To address this challenge, we propose a novel approach called cOOpD, which reforms COPD binary classification as an anomaly detection task.In cOOpD, we first learn representations of unlabeled lung regions using a self-supervised contrastive pretext task, which captures specific characteristics of both healthy and diseased regions. We then use a generative model to learn the distribution of healthy representations and identify abnormalities (stemming from COPD) as deviations from this distribution. Patient-level scores are obtained by aggregating region-level out-of-distribution (OOD) scores.We evaluate cOOpD on two public datasets and show that it achieves the best performance compared to previous supervised state-of-the-art methods, with an increase of 8.2% and 7.7% in terms of area under the receiver operating characteristic curve (AUROC). Additionally, cOOpD provides well-interpretable spatial anomaly maps and patient-level scores, which can be used to identify individuals in the early stage of progression.In artificially designed real-world prevalence settings, we also demonstrate that anomaly detection is a powerful way of tackling COPD classification.

Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training

paper_url: http://arxiv.org/abs/2307.07246
repo_url: None
paper_authors: Xiaofei Chen, Yuting He, Cheng Xue, Rongjun Ge, Shuo Li, Guanyu Yang
for: 提高计算机助け诊断的可行性，解决医疗领域大规模semantic overlap和shift问题
methods: 基于医疗对话语言预训练，不需人工标注，通过描述信息学习描述信息的表示学习
results: 在8个任务中，包括分类、 segmentation、检索和semantic relatedness等，与零shot或几shot设置相当或更好的表现

Abstract
The foundation models based on pre-training technology have significantly advanced artificial intelligence from theoretical to practical applications. These models have facilitated the feasibility of computer-aided diagnosis for widespread use. Medical contrastive vision-language pre-training, which does not require human annotations, is an effective approach for guiding representation learning using description information in diagnostic reports. However, the effectiveness of pre-training is limited by the large-scale semantic overlap and shifting problems in medical field. To address these issues, we propose the Knowledge-Boosting Contrastive Vision-Language Pre-training framework (KoBo), which integrates clinical knowledge into the learning of vision-language semantic consistency. The framework uses an unbiased, open-set sample-wise knowledge representation to measure negative sample noise and supplement the correspondence between vision-language mutual information and clinical knowledge. Extensive experiments validate the effect of our framework on eight tasks including classification, segmentation, retrieval, and semantic relatedness, achieving comparable or better performance with the zero-shot or few-shot settings. Our code is open on https://github.com/ChenXiaoFei-CS/KoBo.

摘要
基于预训技术的基金会模型已经有效提高了人工智能的理论和实践应用。这些模型使得计算机辅助诊断的广泛应用变得可能。医疗对视语言预训，不需要人工注释，是一种有效的方法来引导表示学习使用描述信息。然而，预训的效果受到医疗领域的大规模semantic overlap和shift问题的限制。为了解决这些问题，我们提出了知识扩充对视语言预训框架(KoBo)，该框架将临床知识integrated到视语言semantic consistency的学习中。该框架使用无偏点、开放集样本知识表示来度量负样本噪声，并补充视语言共 informations和临床知识之间的对应关系。我们进行了广泛的实验，证明了我们的框架在八个任务中，包括分类、 segmentation、retrieval和semantic relatedness等，可以达到零 instances或少 instances的设置下的相当或更好的性能。我们的代码可以在https://github.com/ChenXiaoFei-CS/KoBo 中找到。

FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation

paper_url: http://arxiv.org/abs/2307.07245
repo_url: https://github.com/ty-shi/freecos
paper_authors: Tianyi Shi, Xiaohuan Ding, Liang Zhang, Xin Yang
for:自动 segmentation of curvilinear objects methods:使用 fractal-based synthesis 和 geometric information alignmentresults:超越现有的无监督方法和领域适应方法，在四个公共数据集上（XCAD、DRIVE、STARE和CrackTree）实现了高度的性能提升。I hope that helps! Let me know if you have any other questions.

Abstract
Curvilinear object segmentation is critical for many applications. However, manually annotating curvilinear objects is very time-consuming and error-prone, yielding insufficiently available annotated datasets for existing supervised methods and domain adaptation methods. This paper proposes a self-supervised curvilinear object segmentation method that learns robust and distinctive features from fractals and unlabeled images (FreeCOS). The key contributions include a novel Fractal-FDA synthesis (FFS) module and a geometric information alignment (GIA) approach. FFS generates curvilinear structures based on the parametric Fractal L-system and integrates the generated structures into unlabeled images to obtain synthetic training images via Fourier Domain Adaptation. GIA reduces the intensity differences between the synthetic and unlabeled images by comparing the intensity order of a given pixel to the values of its nearby neighbors. Such image alignment can explicitly remove the dependency on absolute intensity values and enhance the inherent geometric characteristics which are common in both synthetic and real images. In addition, GIA aligns features of synthetic and real images via the prediction space adaptation loss (PSAL) and the curvilinear mask contrastive loss (CMCL). Extensive experimental results on four public datasets, i.e., XCAD, DRIVE, STARE and CrackTree demonstrate that our method outperforms the state-of-the-art unsupervised methods, self-supervised methods and traditional methods by a large margin. The source code of this work is available at https://github.com/TY-Shi/FreeCOS.

摘要
Curvilinear对象 segmentation是许多应用程序的关键。然而，手动标注curvilinear对象是非常时间consuming和容易出错，导致现有的超级vised方法和领域适应方法的可用数据不足。这篇论文提出了一种无监督的curvilinear对象 segmentation方法，该方法可以从自然语言和无标签图像中学习抗衡性和特征特征（FreeCOS）。本论文的主要贡献包括一种新的FFS模块和一种几何信息对接（GIA）方法。FFS使用 Parametric Fractal L-system生成curvilinear结构，并将生成的结构与无标签图像集成，以获得synthetic训练图像via Fourier Domain Adaptation。GIA通过比较给定像素的INTENSITY值与其邻近 neighbors的INTENSITY值，来降低synthetic和真实图像之间的INTENSITY值差异。这种图像对接可以明确地去除绝对INTENSITY值的依赖关系，并强调图像的内在几何特征。此外，GIA还使用PSAL和CMCL来对实际和synthetic图像的特征进行对接。我们在四个公共数据集（XCAD、DRIVE、STARE和CrackTree）进行了广泛的实验，并证明了我们的方法在无监督方法、自监督方法和传统方法之上具有显著的优势。 FreeCOS代码可以在https://github.com/TY-Shi/FreeCOS上获取。

MaxSR: Image Super-Resolution Using Improved MaxViT

paper_url: http://arxiv.org/abs/2307.07240
repo_url: None
paper_authors: Bincheng Yang, Gangshan Wu
for: 该文章的目的是提出一种基于最近的混合视觉变换器模型MaxViT的单张图像超解模型，以提高单张图像超解的性能。
methods: 该模型包括四部分：浅层特征提取块、多个层次适应MaxViT块来提取深层次特征和模型自身相似性，层次特征融合块、重建块。关键组件是适应MaxViT块，它将MBConv混合 с缩放-激活、块注意力和网格注意力。为了更好地全局模型输入低分辨率图像的自相似性，我们改进了块注意力和网格注意力，使之成为适应块注意力和适应网格注意力。
results: 我们实现了两种不同的模型：классиical单张图像超解模型（MaxSR）和轻量级单张图像超解模型（MaxSR-light）。实验显示，我们的MaxSR和MaxSR-light在效率上创造了新的状态对。

Abstract
While transformer models have been demonstrated to be effective for natural language processing tasks and high-level vision tasks, only a few attempts have been made to use powerful transformer models for single image super-resolution. Because transformer models have powerful representation capacity and the in-built self-attention mechanisms in transformer models help to leverage self-similarity prior in input low-resolution image to improve performance for single image super-resolution, we present a single image super-resolution model based on recent hybrid vision transformer of MaxViT, named as MaxSR. MaxSR consists of four parts, a shallow feature extraction block, multiple cascaded adaptive MaxViT blocks to extract deep hierarchical features and model global self-similarity from low-level features efficiently, a hierarchical feature fusion block, and finally a reconstruction block. The key component of MaxSR, i.e., adaptive MaxViT block, is based on MaxViT block which mixes MBConv with squeeze-and-excitation, block attention and grid attention. In order to achieve better global modelling of self-similarity in input low-resolution image, we improve block attention and grid attention in MaxViT block to adaptive block attention and adaptive grid attention which do self-attention inside each window across all grids and each grid across all windows respectively in the most efficient way. We instantiate proposed model for classical single image super-resolution (MaxSR) and lightweight single image super-resolution (MaxSR-light). Experiments show that our MaxSR and MaxSR-light establish new state-of-the-art performance efficiently.

摘要
transformer模型已经在自然语言处理任务和高级视觉任务上得到了证明，但只有一些尝试使用强大的transformer模型进行单图超解像。因为transformer模型具有强大的表示能力和内置的自注意机制，可以借助输入低分辨率图像中的自相似性前来提高单图超解像性能。为此，我们提出了基于最近的hybrid视觉transformer（MaxViT）模型的单图超解像模型，名为MaxSR。MaxSR包括四部分：浅层特征提取块、多个彩色分割的adaptive MaxViT块、层次特征融合块和最后的重建块。MaxSR的关键组件是adaptive MaxViT块，它将MBConv与压缩激发、块注意力和网格注意力相结合。为了更好地全局地模型输入低分辨率图像中的自相似性，我们在MaxViT块中进行了改进，使其成为adaptive块注意力和adaptive网格注意力，可以在最有效的方式内进行自注意。我们实现了提议的模型，并对 классиical单图超解像（MaxSR）和lightweight单图超解像（MaxSR-light）进行实现。实验结果显示，我们的MaxSR和MaxSR-light都达到了高效的新状态码。

Source-Free Domain Adaptive Fundus Image Segmentation with Class-Balanced Mean Teacher

paper_url: http://arxiv.org/abs/2307.09973
repo_url: https://github.com/lloongx/sfda-cbmt
paper_authors: Longxiang Tang, Kai Li, Chunming He, Yulun Zhang, Xiu Li
for: 本文研究了无源领域适应性肠图像分割，目的是使用无标注图像适应一个目标领域的基于预训练的肠图像分割模型。
methods: 本文提出了一种基于弱强协同学习的分类平衡老师模型（CBMT），以解决无标注数据适应中的两个主要问题：一是不稳定性问题，二是类别不平衡问题。
results: 实验表明，CBMT可以有效地解决这两个问题，并在多个标准推广上超越现有方法。

Abstract
This paper studies source-free domain adaptive fundus image segmentation which aims to adapt a pretrained fundus segmentation model to a target domain using unlabeled images. This is a challenging task because it is highly risky to adapt a model only using unlabeled data. Most existing methods tackle this task mainly by designing techniques to carefully generate pseudo labels from the model's predictions and use the pseudo labels to train the model. While often obtaining positive adaption effects, these methods suffer from two major issues. First, they tend to be fairly unstable - incorrect pseudo labels abruptly emerged may cause a catastrophic impact on the model. Second, they fail to consider the severe class imbalance of fundus images where the foreground (e.g., cup) region is usually very small. This paper aims to address these two issues by proposing the Class-Balanced Mean Teacher (CBMT) model. CBMT addresses the unstable issue by proposing a weak-strong augmented mean teacher learning scheme where only the teacher model generates pseudo labels from weakly augmented images to train a student model that takes strongly augmented images as input. The teacher is updated as the moving average of the instantly trained student, which could be noisy. This prevents the teacher model from being abruptly impacted by incorrect pseudo-labels. For the class imbalance issue, CBMT proposes a novel loss calibration approach to highlight foreground classes according to global statistics. Experiments show that CBMT well addresses these two issues and outperforms existing methods on multiple benchmarks.

摘要
This paper proposes the Class-Balanced Mean Teacher (CBMT) model to address these issues. CBMT uses a weak-strong augmented mean teacher learning scheme, where the teacher model generates pseudo labels from weakly augmented images to train a student model that takes strongly augmented images as input. The teacher is updated as the moving average of the instantly trained student, which helps prevent the teacher model from being impacted by incorrect pseudo-labels. Additionally, CBMT proposes a novel loss calibration approach to highlight foreground classes according to global statistics, addressing the class imbalance issue.Experiments show that CBMT effectively addresses these issues and outperforms existing methods on multiple benchmarks.

Masked Autoencoders for Unsupervised Anomaly Detection in Medical Images

paper_url: http://arxiv.org/abs/2307.07534
repo_url: https://github.com/lilygeorgescu/mae-medical-anomaly-detection
paper_authors: Mariana-Iuliana Georgescu
For: 本研究旨在检测医学成像中的异常现象，但不使用病理样本进行训练。* Methods: 我们提出使用伪装自动编码器模型来学习正常样本的结构，然后在伪装编码器的差分上训练一个异常分类器。我们使用正常扫描图像的重建结果作为负样本，而伪装模块修改正常扫描图像的一些区域的INTENSITY，作为正样本。* Results: 我们在BRATS2020和LUNA16两个医学成像数据集上进行了实验，并与四种 state-of-the-art 异常检测框架进行比较，分别是 AST、RD4AD、AnoVAEGAN 和 f-AnoGAN。

Abstract
Pathological anomalies exhibit diverse appearances in medical imaging, making it difficult to collect and annotate a representative amount of data required to train deep learning models in a supervised setting. Therefore, in this work, we tackle anomaly detection in medical images training our framework using only healthy samples. We propose to use the Masked Autoencoder model to learn the structure of the normal samples, then train an anomaly classifier on top of the difference between the original image and the reconstruction provided by the masked autoencoder. We train the anomaly classifier in a supervised manner using as negative samples the reconstruction of the healthy scans, while as positive samples, we use pseudo-abnormal scans obtained via our novel pseudo-abnormal module. The pseudo-abnormal module alters the reconstruction of the normal samples by changing the intensity of several regions. We conduct experiments on two medical image data sets, namely BRATS2020 and LUNA16 and compare our method with four state-of-the-art anomaly detection frameworks, namely AST, RD4AD, AnoVAEGAN and f-AnoGAN.

摘要
医学影像中的疾病异常现象多样化，使得收集和标注充足数据来训练深度学习模型在指导下的情况困难。因此，在这项工作中，我们采用不同的方法来检测医学影像中的异常。我们提议使用Masked Autoencoder模型来学习正常样本的结构，然后在Masked Autoencoder的差异上训练异常分类器。我们在指导下训练异常分类器，使用正样本的重建为负样本，而使用我们新提出的假异常模块生成的 pseudo-异常样本作为正样本。假异常模块对正常样本的重建进行了一些区域的INTENSITY变化。我们在 BRATS2020和LUNA16两个医学影像数据集上进行了实验，并与四种现代异常检测框架进行了比较，namely AST, RD4AD, AnoVAEGAN和f-AnoGAN。

Challenge Results Are Not Reproducible

paper_url: http://arxiv.org/abs/2307.07226
repo_url: None
paper_authors: Annika Reinke, Georg Grab, Lena Maier-Hein
for: 本研究旨在评估医疗影像分析挑战的可重现性。
methods: 研究使用了2019年度稳定医疗影像分类挑战（ROBUST-MIS）中提交的算法的重现。
results: 重现后，挑战排名显著不同于原来的挑战，这表明挑战排名可能不具可重现性。

Abstract
While clinical trials are the state-of-the-art methods to assess the effect of new medication in a comparative manner, benchmarking in the field of medical image analysis is performed by so-called challenges. Recently, comprehensive analysis of multiple biomedical image analysis challenges revealed large discrepancies between the impact of challenges and quality control of the design and reporting standard. This work aims to follow up on these results and attempts to address the specific question of the reproducibility of the participants methods. In an effort to determine whether alternative interpretations of the method description may change the challenge ranking, we reproduced the algorithms submitted to the 2019 Robust Medical Image Segmentation Challenge (ROBUST-MIS). The leaderboard differed substantially between the original challenge and reimplementation, indicating that challenge rankings may not be sufficiently reproducible.

摘要
在药物开发中，临床试验是当今最佳方法来评估新药的效果，而在医学影像分析领域，则通过 так称的挑战来进行比较性评估。最近，多个生物医学影像分析挑战的全面分析发现，挑战的影响和设计标准的质量控制存在大差异。这项工作的目标是跟进这些结果，并尝试解决特定问题，即参与者的方法是否可重复。为了确定参与者的方法是否具有可重复性，我们对2019年的稳定医学影像分类挑战（ROBUST-MIS）中提交的算法进行重新实现。结果发现，挑战排名存在很大的差异，这表明挑战排名可能不具有足够的可重复性。

Complementary Frequency-Varying Awareness Network for Open-Set Fine-Grained Image Recognition

paper_url: http://arxiv.org/abs/2307.07214
repo_url: None
paper_authors: Jiayin Sun, Hong Wang, Qiulei Dong
for: 提高开放集合图像识别精度
methods: 提出了一种Complementary Frequency-varying Awareness Network（CFAN），包括三个模块：（i）特征提取模块，（ii）频率变化过滤模块，（iii） complementary temporal aggregation module。
results: 对于3种细致图像 dataset和2种粗致图像 dataset的实验结果表明，CFAN-OSFGR方法在大多数情况下与9种现有方法进行比较，表现出了显著的优势。

Abstract
Open-set image recognition is a challenging topic in computer vision. Most of the existing works in literature focus on learning more discriminative features from the input images, however, they are usually insensitive to the high- or low-frequency components in features, resulting in a decreasing performance on fine-grained image recognition. To address this problem, we propose a Complementary Frequency-varying Awareness Network that could better capture both high-frequency and low-frequency information, called CFAN. The proposed CFAN consists of three sequential modules: (i) a feature extraction module is introduced for learning preliminary features from the input images; (ii) a frequency-varying filtering module is designed to separate out both high- and low-frequency components from the preliminary features in the frequency domain via a frequency-adjustable filter; (iii) a complementary temporal aggregation module is designed for aggregating the high- and low-frequency components via two Long Short-Term Memory networks into discriminative features. Based on CFAN, we further propose an open-set fine-grained image recognition method, called CFAN-OSFGR, which learns image features via CFAN and classifies them via a linear classifier. Experimental results on 3 fine-grained datasets and 2 coarse-grained datasets demonstrate that CFAN-OSFGR performs significantly better than 9 state-of-the-art methods in most cases.

摘要
“开放集image认识是计算机视觉领域中的一个挑战。大多数现有的文献works都是学习从输入图像中学习更有吸引力的特征，然而，它们通常对高频或低频组件不敏感，导致图像细化认识性能下降。为解决这个问题，我们提出了一种Complementary Frequency-varying Awareness Network（CFAN），可以更好地捕捉高频和低频信息。CFAN包括三个顺序模块：（i）一个特征提取模块，用于从输入图像中学习初步特征；（ii）一个频率域中的频率变化滤波器，用于在频率域中分离出高频和低频组件；（iii）一个 complementary temporal aggregation模块，用于在时间域中聚合高频和低频组件，并使用两个Long Short-Term Memory网络（LSTM）进行聚合。基于CFAN，我们进一步提出了一种开放集细化图像认识方法（CFAN-OSFGR），它通过CFAN学习图像特征，并使用一个线性分类器进行分类。实验结果表明，CFAN-OSFGR在3个细化图像 dataset和2个粗化图像 dataset上表现出色，在大多数情况下与9种现有方法进行比较，表现出 significatively better 的性能。”

Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection

paper_url: http://arxiv.org/abs/2307.07205
repo_url: https://github.com/aleflabo/MoCoDAD
paper_authors: Alessandro Flaborea, Luca Collorone, Guido D’Amely, Stefano D’Arrigo, Bardh Prenkaj, Fabio Galasso
for: 视频异常检测（VAD）
methods: 使用生成模型和扩散概率模型生成多模态人体姿态，并通过统计聚合来检测异常
results: 在4个 benchmark 上表现出色，超过了现有的最佳Result

Abstract
Anomalies are rare and anomaly detection is often therefore framed as One-Class Classification (OCC), i.e. trained solely on normalcy. Leading OCC techniques constrain the latent representations of normal motions to limited volumes and detect as abnormal anything outside, which accounts satisfactorily for the openset'ness of anomalies. But normalcy shares the same openset'ness property since humans can perform the same action in several ways, which the leading techniques neglect. We propose a novel generative model for video anomaly detection (VAD), which assumes that both normality and abnormality are multimodal. We consider skeletal representations and leverage state-of-the-art diffusion probabilistic models to generate multimodal future human poses. We contribute a novel conditioning on the past motion of people and exploit the improved mode coverage capabilities of diffusion processes to generate different-but-plausible future motions. Upon the statistical aggregation of future modes, an anomaly is detected when the generated set of motions is not pertinent to the actual future. We validate our model on 4 established benchmarks: UBnormal, HR-UBnormal, HR-STC, and HR-Avenue, with extensive experiments surpassing state-of-the-art results.

摘要
异常现象是罕见的，因此异常检测通常被框架为一类分类（OCC），即只在正常情况下进行训练。现leading OCC技术限制了正常动作的latent表示形式为有限体积，检测其外部为异常。然而，正常情况也具有同样的开放性属性，因为人们可以通过不同的方式执行同一个动作，这些技术忽略了。我们提出了一种新的生成模型 для视频异常检测（VAD），它假设了正常和异常都是多模的。我们考虑了骨骼表示形式，并利用当今最佳的扩散概率模型来生成多模未来人体姿势。我们采用了一种新的 conditioning 方法，基于过去人体运动的统计聚合，并利用扩散过程的改进Mode覆盖能力来生成不同 yet plausible 的未来姿势。当生成的集合不符合实际未来时，我们识别出异常。我们验证了我们的模型在 4 个确立的标准准obenchmark 上，包括 UBnormal、HR-UBnormal、HR-STC 和 HR-Avenue，并进行了广泛的实验，超过了当前最佳的成果。

Volumetric Wireframe Parsing from Neural Attraction Fields

paper_url: http://arxiv.org/abs/2307.10206
repo_url: https://github.com/cherubicxn/neat
paper_authors: Nan Xue, Bin Tan, Yuxi Xiao, Liang Dong, Gui-Song Xia, Tianfu Wu
for: 这篇论文旨在实现高精度的3D线框解析，无需Explicit匹配。
methods: 该方法首先提出一种基于神经网络的Attraction Fields，用于学习3D线段坐标。然后，它们提出一种全球缝隙感知（GJP）模块，用于从Attraction Fields中感知有意义的3D缝隙。最后，它们计算3D线框的原始笔记，通过吸引查询的3D线段到3D缝隙上。
results: 在DTU和BlendedMVS数据集上进行了实验，取得了出色的表现。

Abstract
The primal sketch is a fundamental representation in Marr's vision theory, which allows for parsimonious image-level processing from 2D to 2.5D perception. This paper takes a further step by computing 3D primal sketch of wireframes from a set of images with known camera poses, in which we take the 2D wireframes in multi-view images as the basis to compute 3D wireframes in a volumetric rendering formulation. In our method, we first propose a NEural Attraction (NEAT) Fields that parameterizes the 3D line segments with coordinate Multi-Layer Perceptrons (MLPs), enabling us to learn the 3D line segments from 2D observation without incurring any explicit feature correspondences across views. We then present a novel Global Junction Perceiving (GJP) module to perceive meaningful 3D junctions from the NEAT Fields of 3D line segments by optimizing a randomly initialized high-dimensional latent array and a lightweight decoding MLP. Benefitting from our explicit modeling of 3D junctions, we finally compute the primal sketch of 3D wireframes by attracting the queried 3D line segments to the 3D junctions, significantly simplifying the computation paradigm of 3D wireframe parsing. In experiments, we evaluate our approach on the DTU and BlendedMVS datasets with promising performance obtained. As far as we know, our method is the first approach to achieve high-fidelity 3D wireframe parsing without requiring explicit matching.

摘要
primal 绘制是马尔普视觉理论中的基本表示，允许从2D到2.5D的简洁图像处理。本文进一步计算基于多视图图像的知道摄像机位置的3D primal 绘制，其中我们将2D 绘制图像作为基础，计算3D 绘制图像在三维渲染中。我们首先提出了一种基于多层感知神经网络（MLP）的神经吸引（NEAT）场，以学习2D 绘制图像中的3D 线段。然后，我们提出了一种全球缝合（GJP）模块，用于从 NEAT 场中提取有意义的3D 缝合。由于我们明确地模型了3D 缝合，因此我们最终计算了3D 绘制图像的 primal 绘制，从而大大简化了3D 绘制图像的计算模式。在实验中，我们对DTU和BlendedMVS数据集进行了评估，并取得了出色的性能。据我们所知，我们的方法是首次实现高精度3D 绘制 parsing 无需显式匹配。

Omnipotent Adversarial Training for Unknown Label-noisy and Imbalanced Datasets

paper_url: http://arxiv.org/abs/2307.08596
repo_url: https://github.com/guanlinlee/oat
paper_authors: Guanlin Li, Kangjie Chen, Yuan Xu, Han Qiu, Tianwei Zhang
for: 本研究的目的是解决实际应用中的一个挑战，即在受损和噪声 dataset 上训练一个模型，以达到高度清洁率和鲁棒性。
methods: 我们提出了两种创新的方法来解决训练集中的标签噪声和数据不均匀性问题。首先，我们引入了一个 oracle 到 adversarial training processe，以帮助模型学习正确的数据-标签 conditional distribution。其次，我们提出了 logits 调整 adversarial training，以解决数据不均匀性挑战，这可以帮助模型学习bayes-优化的分布。
results: 我们的全面评估结果显示，OAT 在复杂的数据不均匀和标签噪声场景下出现较大的改善，clean accuracy 提高了More than 20%，robust accuracy 提高了More than 10%。

Abstract
Adversarial training is an important topic in robust deep learning, but the community lacks attention to its practical usage. In this paper, we aim to resolve a real-world application challenge, i.e., training a model on an imbalanced and noisy dataset to achieve high clean accuracy and robustness, with our proposed Omnipotent Adversarial Training (OAT). Our strategy consists of two innovative methodologies to address the label noise and data imbalance in the training set. We first introduce an oracle into the adversarial training process to help the model learn a correct data-label conditional distribution. This carefully-designed oracle can provide correct label annotations for adversarial training. We further propose logits adjustment adversarial training to overcome the data imbalance challenge, which can help the model learn a Bayes-optimal distribution. Our comprehensive evaluation results show that OAT outperforms other baselines by more than 20% clean accuracy improvement and 10% robust accuracy improvement under the complex combinations of data imbalance and label noise scenarios. The code can be found in https://github.com/GuanlinLee/OAT.

摘要
¹ adversarial training是深度学习中重要的话题，但社区对其实用方面缺乏关注。在这篇论文中，我们想要解决一个实际应用挑战，即在受损和噪声 dataset 上训练模型，以 достичь高度的清洁精度和Robustness。我们提出了两种创新的方法来解决训练集中的标签噪声和数据不均衡问题。我们首先引入了一个恰当的oracle到对抗训练过程中，帮助模型学习正确的数据-标签 conditional distribution。这个特制的oracle可以提供正确的标签注释 для对抗训练。我们还提出了Logits调整对抗训练，以解决数据不均衡问题，帮助模型学习bayes优化的分布。我们的全面评估结果表明，OAT在复杂的数据不均衡和标签噪声搭配下出perform得超过20%的清洁精度提升和10%的Robustness提升。代码可以在https://github.com/GuanlinLee/OAT中找到。

LightFormer: An End-to-End Model for Intersection Right-of-Way Recognition Using Traffic Light Signals and an Attention Mechanism

paper_url: http://arxiv.org/abs/2307.07196
repo_url: https://github.com/danielming123/lightformer
paper_authors: Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Eduardo Nebot, Stewart Worrall
for: 本研究旨在掌握智能汽车在交通灯光控制下的行驶权。
methods: 该研究提出了一种基于摄像头的交通灯光识别模型，名为LightFormer，用于生成可行的行驶方向的权限状态。模型包括一个空间时间内部结构和一个注意机制，以使用过去图像特征来确定当前帧的权限状态。此外，还提出了一种修改后的多重权重arcface损失函数，以提高模型的分类性能。
results: 经过训练并测试两个公共交通灯光数据集，LightFormer模型能够准确地识别交通灯光的权限状态。

Abstract
For smart vehicles driving through signalised intersections, it is crucial to determine whether the vehicle has right of way given the state of the traffic lights. To address this issue, camera based sensors can be used to determine whether the vehicle has permission to proceed straight, turn left or turn right. This paper proposes a novel end to end intersection right of way recognition model called LightFormer to generate right of way status for available driving directions in complex urban intersections. The model includes a spatial temporal inner structure with an attention mechanism, which incorporates features from past image to contribute to the classification of the current frame right of way status. In addition, a modified, multi weight arcface loss is introduced to enhance the model classification performance. Finally, the proposed LightFormer is trained and tested on two public traffic light datasets with manually augmented labels to demonstrate its effectiveness.

摘要
For smart vehicles driving through signalized intersections, it is crucial to determine whether the vehicle has the right of way given the state of the traffic lights. To address this issue, camera-based sensors can be used to determine whether the vehicle has permission to proceed straight, turn left, or turn right. This paper proposes a novel end-to-end intersection right-of-way recognition model called LightFormer to generate right-of-way status for available driving directions in complex urban intersections. The model includes a spatial-temporal inner structure with an attention mechanism, which incorporates features from past images to contribute to the classification of the current frame right-of-way status. In addition, a modified, multi-weight arcface loss is introduced to enhance the model classification performance. Finally, the proposed LightFormer is trained and tested on two public traffic light datasets with manually augmented labels to demonstrate its effectiveness.Here's the translation in Traditional Chinese:For smart vehicles driving through signalized intersections, it is crucial to determine whether the vehicle has the right of way given the state of the traffic lights. To address this issue, camera-based sensors can be used to determine whether the vehicle has permission to proceed straight, turn left, or turn right. This paper proposes a novel end-to-end intersection right-of-way recognition model called LightFormer to generate right-of-way status for available driving directions in complex urban intersections. The model includes a spatial-temporal inner structure with an attention mechanism, which incorporates features from past images to contribute to the classification of the current frame right-of-way status. In addition, a modified, multi-weight arcface loss is introduced to enhance the model classification performance. Finally, the proposed LightFormer is trained and tested on two public traffic light datasets with manually augmented labels to demonstrate its effectiveness.

Adversarial Training Over Long-Tailed Distribution

paper_url: http://arxiv.org/abs/2307.10205
repo_url: https://github.com/guanlinlee/reat
paper_authors: Guanlin Li, Guowen Xu, Tianwei Zhang
for: 本 paper 探讨了对具有长尾分布的数据进行 adversarial 训练，这种情况在前期工作中很少被探讨。与 convent ional adversarial 训练相比，这种过程会陷入生成不均匀的 adversarial 例子 (AE) 和不均匀的特征空间，导致模型在尾部数据上表现出低 robustness 和准确率。
methods: 为解决这个问题，我们提出了一个新的 adversarial 训练框架 – Re-balancing Adversarial Training (REAT)。该框架包括两个组成部分：(1) 一种基于有效数量的新训练策略，以导引模型生成更加均匀和有用的 AE; (2) 一个特殊构造的罚函数，以强制模型的特征空间达到一定的质量标准。
results: 对不同的数据集和模型结构进行评估，我们发现 REAT 可以有效地提高模型的Robustness 和保持模型的清洁准确率。代码可以在 https://github.com/GuanlinLee/REAT 找到。

Abstract
In this paper, we study adversarial training on datasets that obey the long-tailed distribution, which is practical but rarely explored in previous works. Compared with conventional adversarial training on balanced datasets, this process falls into the dilemma of generating uneven adversarial examples (AEs) and an unbalanced feature embedding space, causing the resulting model to exhibit low robustness and accuracy on tail data. To combat that, we propose a new adversarial training framework -- Re-balancing Adversarial Training (REAT). This framework consists of two components: (1) a new training strategy inspired by the term effective number to guide the model to generate more balanced and informative AEs; (2) a carefully constructed penalty function to force a satisfactory feature space. Evaluation results on different datasets and model structures prove that REAT can effectively enhance the model's robustness and preserve the model's clean accuracy. The code can be found in https://github.com/GuanlinLee/REAT.

摘要
在这篇论文中，我们研究了对长条度分布的数据集进行对抗训练，这种情况在前一些研究中很少被考虑。与传统的对抗训练在均衡数据集上进行的情况相比，这个过程会陷入生成不均匀的对抗示例（AE）以及不均匀的特征空间的困境，导致模型在尾部数据上表现出低的鲁棒性和准确率。为解决这个问题，我们提出了一个新的对抗训练框架——重新均衡对抗训练（REAT）。这个框架包括两个组成部分：（1）一种基于有效数量的新训练策略，以导引模型生成更加均匀和有用的AE;（2）一个特殊构造的罚函数，以强制模型在特征空间中达到满意的结果。经过不同数据集和模型结构的评估，我们发现REAT可以有效地提高模型的鲁棒性，同时保持模型的净精度。代码可以在https://github.com/GuanlinLee/REAT找到。

Erasing, Transforming, and Noising Defense Network for Occluded Person Re-Identification

paper_url: http://arxiv.org/abs/2307.07187
repo_url: None
paper_authors: Neng Dong, Liyan Zhang, Shuanglin Yan, Hao Tang, Jinhui Tang
for: occluded person re-identification
methods: adversarial defense, random erasure, random transformations, noise perturbation
results: effective handling of occlusion issues, no need for external modules, superior performance on five public datasets

Abstract
Occlusion perturbation presents a significant challenge in person re-identification (re-ID), and existing methods that rely on external visual cues require additional computational resources and only consider the issue of missing information caused by occlusion. In this paper, we propose a simple yet effective framework, termed Erasing, Transforming, and Noising Defense Network (ETNDNet), which treats occlusion as a noise disturbance and solves occluded person re-ID from the perspective of adversarial defense. In the proposed ETNDNet, we introduce three strategies: Firstly, we randomly erase the feature map to create an adversarial representation with incomplete information, enabling adversarial learning of identity loss to protect the re-ID system from the disturbance of missing information. Secondly, we introduce random transformations to simulate the position misalignment caused by occlusion, training the extractor and classifier adversarially to learn robust representations immune to misaligned information. Thirdly, we perturb the feature map with random values to address noisy information introduced by obstacles and non-target pedestrians, and employ adversarial gaming in the re-ID system to enhance its resistance to occlusion noise. Without bells and whistles, ETNDNet has three key highlights: (i) it does not require any external modules with parameters, (ii) it effectively handles various issues caused by occlusion from obstacles and non-target pedestrians, and (iii) it designs the first GAN-based adversarial defense paradigm for occluded person re-ID. Extensive experiments on five public datasets fully demonstrate the effectiveness, superiority, and practicality of the proposed ETNDNet. The code will be released at \url{https://github.com/nengdong96/ETNDNet}.

摘要
干扰对人重新标识（re-ID）提出了 significativ challenge，现有的方法通常需要额外的计算资源，并只考虑 occlusion 引起的信息缺失问题。在这篇论文中，我们提出了一种简单 yet effective的框架，称为Erasing, Transforming, and Noising Defense Network（ETNDNet），它将 occlusion 视为干扰，通过对抗防御的方式解决 occluded person re-ID 问题。在我们提出的 ETNDNet 中，我们引入了三种策略：首先，我们随机将特征图抹除，创建一个干扰表示，使得抗护理学习损失，以保护重新标识系统免受缺失信息的影响。其次，我们引入了随机变换，模拟了障碍物和非目标人员的位置偏移，使得抽取器和分类器通过对抗学习，学习具有抗性的表示。最后，我们在特征图中添加了随机值，对干扰引入的噪音进行处理，并通过对抗游戏，提高重新标识系统对 occlusion 的抗性。ETNDNet 的三个关键亮点是：一、它不需要任何外部模块和参数；二、它能有效地处理各种由障碍物和非目标人员引起的 occlusion 问题；三、它是首次在 occluded person re-ID 中应用 GAN 基于对抗防御的方法。我们的实验在五个公共数据集上进行了广泛的证明和评估，并证明了 ETNDNet 的有效性、优势和实用性。代码将在 \url{https://github.com/nengdong96/ETNDNet} 上发布。

TVPR: Text-to-Video Person Retrieval and a New Benchmark

paper_url: http://arxiv.org/abs/2307.07184
repo_url: None
paper_authors: Fan Ni, Xu Zhang, Jianhui Wu, Guan-Nan Dong, Aichun Zhu, Hui Liu, Yue Zhang
for: 提高文本基于人脸 Retrieval的性能，解决隐藏在孤立帧中的人脸和变量动作细节的问题。
methods: 提出了一种新的 Text-to-Video Person Retrieval（TVPR）任务，使用文本描述和视频数据交互进行人脸 Retrieval。同时，构建了大规模的 across-modal人员视频数据集（TVPReid），包括人脸、动作和环境交互等详细自然语言描述。
results: 提出了 Text-to-Video Person Retrieval Network（TVPRN），使用视觉和动作特征进行人脸视频表示，并使用预训练的 BERT 获取描述文本的表示，以便找出最相关的人脸视频。经过广泛的实验，TVPRN 在 TVPReid 数据集上达到了state-of-the-art表现。

Abstract
Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured in isolated frames or variable motion details are given in the textual description. In this paper, we propose a new task called Text-to-Video Person Retrieval(TVPR) which aims to effectively overcome the limitations of isolated frames. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, such as person's appearance, actions and interactions with environment, etc., termed as Text-to-Video Person Re-identification (TVPReid) dataset, which will be publicly available. To this end, a Text-to-Video Person Retrieval Network (TVPRN) is proposed. Specifically, TVPRN acquires video representations by fusing visual and motion representations of person videos, which can deal with temporal occlusion and the absence of variable motion details in isolated frames. Meanwhile, we employ the pre-trained BERT to obtain caption representations and the relationship between caption and video representations to reveal the most relevant person videos. To evaluate the effectiveness of the proposed TVPRN, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, TVPRN is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.

摘要
现有的文本基于人识别方法多数采用文本到图像人识别。然而，由于隔ilder的缺乏动态信息，在文本描述中人脸遮盖或变化的运动细节时，性能受到了限制。在这篇论文中，我们提出了一个新任务：文本到视频人识别（TVPR），旨在超越隔ilder的局限性。由于没有任何人视频与自然语言描述的数据集或标准，我们构建了一个大规模的跨模态人视频数据集，包括人物的出现、行为和环境交互等细节的自然语言描述，称为文本到视频人重识别（TVPReid）数据集。为此，我们提出了文本到视频人 Retrieval网络（TVPRN）。具体来说，TVPRN使用人视频的视觉和运动表示 fusion 来处理时间遮盖和隔ilder中缺乏变化运动细节。同时，我们使用预训练的 BERT 获取caption表示，并通过caption和视频表示之间的关系，找出最相关的人视频。为证明提出的 TVPRN 的有效性，我们进行了广泛的实验，并在 TVPReid 数据集上达到了当前最佳性能。据我们知道，TVPRN 是首次使用视频来解决文本基于人识别任务，并在 TVPReid 数据集上实现了状态畅的性能。TVPReid 数据集将于未来的研究中公开，以便推动未来的研究。

DISPEL: Domain Generalization via Domain-Specific Liberating

paper_url: http://arxiv.org/abs/2307.07181
repo_url: None
paper_authors: Chia-Yuan Chang, Yu-Neng Chuang, Guanchu Wang, Mengnan Du, Na Zou
for: 本研究旨在提出一种基于特征组分的领域总结方法，以便在只使用有限源领域训练后，在未经见过的测试领域上实现良好的性能。
methods: 我们提出了一种基于特征组分的领域总结方法，其中将基础特征分为领域共享特征和领域特定特征。而领域特定特征却难以从输入数据中分辨和隔离。为此，我们提出了一种叫做 DISPEL 的后处理精细掩蔽方法，该方法可以在 embedding 空间中过滤出不定义和难以分辨的领域特定特征。
results: 我们的实验结果表明，DISPEL 可以与现有方法相比，在五个 benchmark 上表现出色，并且可以进一步普适多种算法。此外，我们还 derivated 一个泛化误差 bounds，以保证泛化性能。

Abstract
Domain generalization aims to learn a generalization model that can perform well on unseen test domains by only training on limited source domains. However, existing domain generalization approaches often bring in prediction-irrelevant noise or require the collection of domain labels. To address these challenges, we consider the domain generalization problem from a different perspective by categorizing underlying feature groups into domain-shared and domain-specific features. Nevertheless, the domain-specific features are difficult to be identified and distinguished from the input data. In this work, we propose DomaIn-SPEcific Liberating (DISPEL), a post-processing fine-grained masking approach that can filter out undefined and indistinguishable domain-specific features in the embedding space. Specifically, DISPEL utilizes a mask generator that produces a unique mask for each input data to filter domain-specific features. The DISPEL framework is highly flexible to be applied to any fine-tuned models. We derive a generalization error bound to guarantee the generalization performance by optimizing a designed objective loss. The experimental results on five benchmarks demonstrate DISPEL outperforms existing methods and can further generalize various algorithms.

摘要
领域总则目标是学习一个通用模型，可以在未经见过的测试领域上表现出色，只需在有限的源领域上训练。然而，现有的领域总则方法经常带来预测无关的噪声或需要收集领域标签。为解决这些挑战，我们从不同的角度看待领域总则问题，将下面的特征分为领域共享特征和领域特定特征。然而，领域特定特征很难被识别出来并分离于输入数据中。在这种情况下，我们提出了DISPEL方法，一种后处细化的面积掩码方法，可以在嵌入空间中过滤无法定义或不可分辨的领域特定特征。DISPEL方法使用一个生成器生成特有的掩码，用于过滤领域特定特征。DISPEL框架高度灵活，可以应用于任何精度调整的模型。我们 derivate一个总则错误范围，以保证总则性能。实验结果在五个benchmark上表明，DISPEL方法超过了现有方法，并可以进一步普适多种算法。

Adaptive Region Selection for Active Learning in Whole Slide Image Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.07168
repo_url: https://github.com/deepmicroscopy/adaptiveregionselection
paper_authors: Jingna Qiu, Frauke Wilm, Mathias Öttl, Maja Schlereth, Chang Liu, Tobias Heimann, Marc Aubreville, Katharina Breininger
for:这篇论文是为了提高 Histological gigapixel-sized whole slide images (WSIs) 的分类训练模型而做的准备工作，特别是在 Region-based active learning (AL) 中，即通过选择一小部分 annotated 图像区域来训练模型，而不是全图像。methods:这篇论文提出了一种新的技术，即动态选择 annotated 区域，以避免 AL Step size 的选择问题。这种技术首先会找到一个有用的区域，然后确定该区域的最佳包围框，而不是选择固定形状和大小的 rectangular 区域，如标准方法所做。results:这篇论文使用了 breast cancer metastases segmentation 任务，在 CAMELYON16 dataset 上进行了评估。结果显示，这种新技术可以在不同的 AL Step size 下 consistently 实现更高的 sampling efficiency，并且只需要 annotate 2.6% 的组织质量。这意味着可以大幅减少 annotate 图像集的成本。代码可以在 https://github.com/DeepMicroscopy/AdaptiveRegionSelection 上下载。

Abstract
The process of annotating histological gigapixel-sized whole slide images (WSIs) at the pixel level for the purpose of training a supervised segmentation model is time-consuming. Region-based active learning (AL) involves training the model on a limited number of annotated image regions instead of requesting annotations of the entire images. These annotation regions are iteratively selected, with the goal of optimizing model performance while minimizing the annotated area. The standard method for region selection evaluates the informativeness of all square regions of a specified size and then selects a specific quantity of the most informative regions. We find that the efficiency of this method highly depends on the choice of AL step size (i.e., the combination of region size and the number of selected regions per WSI), and a suboptimal AL step size can result in redundant annotation requests or inflated computation costs. This paper introduces a novel technique for selecting annotation regions adaptively, mitigating the reliance on this AL hyperparameter. Specifically, we dynamically determine each region by first identifying an informative area and then detecting its optimal bounding box, as opposed to selecting regions of a uniform predefined shape and size as in the standard method. We evaluate our method using the task of breast cancer metastases segmentation on the public CAMELYON16 dataset and show that it consistently achieves higher sampling efficiency than the standard method across various AL step sizes. With only 2.6\% of tissue area annotated, we achieve full annotation performance and thereby substantially reduce the costs of annotating a WSI dataset. The source code is available at https://github.com/DeepMicroscopy/AdaptiveRegionSelection.

摘要
“ annotating histological gigapixel-sized whole slide images (WSIs) at the pixel level for the purpose of training a supervised segmentation model is time-consuming. Region-based active learning (AL) involves training the model on a limited number of annotated image regions instead of requesting annotations of the entire images. These annotation regions are iteratively selected, with the goal of optimizing model performance while minimizing the annotated area. The standard method for region selection evaluates the informativeness of all square regions of a specified size and then selects a specific quantity of the most informative regions. We find that the efficiency of this method highly depends on the choice of AL step size (i.e., the combination of region size and the number of selected regions per WSI), and a suboptimal AL step size can result in redundant annotation requests or inflated computation costs. This paper introduces a novel technique for selecting annotation regions adaptively, mitigating the reliance on this AL hyperparameter. Specifically, we dynamically determine each region by first identifying an informative area and then detecting its optimal bounding box, as opposed to selecting regions of a uniform predefined shape and size as in the standard method. We evaluate our method using the task of breast cancer metastases segmentation on the public CAMELYON16 dataset and show that it consistently achieves higher sampling efficiency than the standard method across various AL step sizes. With only 2.6% of tissue area annotated, we achieve full annotation performance and thereby substantially reduce the costs of annotating a WSI dataset. The source code is available at https://github.com/DeepMicroscopy/AdaptiveRegionSelection.”Here's the translation in Traditional Chinese:“ annotating histological gigapixel-sized whole slide images (WSIs) at the pixel level for the purpose of training a supervised segmentation model is time-consuming. Region-based active learning (AL) involves training the model on a limited number of annotated image regions instead of requesting annotations of the entire images. These annotation regions are iteratively selected, with the goal of optimizing model performance while minimizing the annotated area. The standard method for region selection evaluates the informativeness of all square regions of a specified size and then selects a specific quantity of the most informative regions. We find that the efficiency of this method highly depends on the choice of AL step size (i.e., the combination of region size and the number of selected regions per WSI), and a suboptimal AL step size can result in redundant annotation requests or inflated computation costs. This paper introduces a novel technique for selecting annotation regions adaptively, mitigating the reliance on this AL hyperparameter. Specifically, we dynamically determine each region by first identifying an informative area and then detecting its optimal bounding box, as opposed to selecting regions of a uniform predefined shape and size as in the standard method. We evaluate our method using the task of breast cancer metastases segmentation on the public CAMELYON16 dataset and show that it consistently achieves higher sampling efficiency than the standard method across various AL step sizes. With only 2.6% of tissue area annotated, we achieve full annotation performance and thereby substantially reduce the costs of annotating a WSI dataset. The source code is available at https://github.com/DeepMicroscopy/AdaptiveRegionSelection.”

Linking vision and motion for self-supervised object-centric perception

paper_url: http://arxiv.org/abs/2307.07147
repo_url: https://github.com/wayveai/socs
paper_authors: Kaylene C. Stocking, Zak Murez, Vijay Badrinarayanan, Jamie Shotton, Alex Kendall, Claire Tomlin, Christopher P. Burgess
for: 这篇论文目的是提出一种基于自主驾驶算法的对象中心表示方法，以便更好地理解环境中多个独立的代理人和场景特征之间的互动。
methods: 这篇论文使用了一种自我超视的对象中心视觉模型，只使用了RGB视频和车辆的pose作为输入，进行对象分解。
results: 研究人员通过使用这种方法在Waymo开放感知数据集上获得了可接受的结果，虽然对象层质不如supervised方法或其他使用更特权信息的方法，但模型能够学习一种能够融合多个相机视角的时间序列表示，并成功跟踪了 dataset 中的许多车辆和行人。

Abstract
Object-centric representations enable autonomous driving algorithms to reason about interactions between many independent agents and scene features. Traditionally these representations have been obtained via supervised learning, but this decouples perception from the downstream driving task and could harm generalization. In this work we adapt a self-supervised object-centric vision model to perform object decomposition using only RGB video and the pose of the vehicle as inputs. We demonstrate that our method obtains promising results on the Waymo Open perception dataset. While object mask quality lags behind supervised methods or alternatives that use more privileged information, we find that our model is capable of learning a representation that fuses multiple camera viewpoints over time and successfully tracks many vehicles and pedestrians in the dataset. Code for our model is available at https://github.com/wayveai/SOCS.

摘要
object-centric表示法可以让自主驾驶算法对多个独立的Agent和场景特征进行理解。传统上，这些表示方法通过监督学习获得，但这会分离感知和下游驾驶任务，可能会对泛化造成负面影响。在这项工作中，我们适应了基于自动学习的对象分解模型，只使用RGB视频和车辆的pose作为输入。我们示示了我们的方法在 Waymo 开放感知数据集上获得了有前途的结果。虽然对象层面质量落后于监督方法或其他使用更特权信息的方法，但我们发现我们的模型可以学习一个汇集多个相机视点的时间序列，并成功跟踪多辆车和人行进在数据集中。代码可以在https://github.com/wayveai/SOCS上获取。

Deteksi Sampah di Permukaan dan Dalam Perairan pada Objek Video dengan Metode Robust and Efficient Post-Processing dan Tubelet-Level Bounding Box Linking

paper_url: http://arxiv.org/abs/2307.10039
repo_url: None
paper_authors: Bryan Tjandra, Made S. N. Negara, Nyoo S. C. Handoko
for: 本研究旨在开发一种自动化垃圾收集机器人，以解决印度尼西亚水域中垃圾的问题。
methods: 本研究使用了YOLOv5模型和Robust & Efficient Post Processing（REPP）方法，以及Tubelet-level bounding box linking在FloW和Roboflow数据集上。这些方法可以提高原生Object Detection的性能，并考虑邻帧检测结果。
results: 研究结果表明，后处理阶段和Tubelet-level bounding box linking可以提高检测质量，相比YOLOv5 alone提高约3%。这些方法可以检测表面和水下垃圾，并可以应用于实时图像基于垃圾收集机器人。

Abstract
Indonesia, as a maritime country, has a significant portion of its territory covered by water. Ineffective waste management has resulted in a considerable amount of trash in Indonesian waters, leading to various issues. The development of an automated trash-collecting robot can be a solution to address this problem. The robot requires a system capable of detecting objects in motion, such as in videos. However, using naive object detection methods in videos has limitations, particularly when image focus is reduced and the target object is obstructed by other objects. This paper's contribution provides an explanation of the methods that can be applied to perform video object detection in an automated trash-collecting robot. The study utilizes the YOLOv5 model and the Robust & Efficient Post Processing (REPP) method, along with tubelet-level bounding box linking on the FloW and Roboflow datasets. The combination of these methods enhances the performance of naive object detection from YOLOv5 by considering the detection results in adjacent frames. The results show that the post-processing stage and tubelet-level bounding box linking can improve the quality of detection, achieving approximately 3% better performance compared to YOLOv5 alone. The use of these methods has the potential to detect surface and underwater trash and can be applied to a real-time image-based trash-collecting robot. Implementing this system is expected to mitigate the damage caused by trash in the past and improve Indonesia's waste management system in the future.

摘要
印度尼西亚，作为一个海上国家，有很大一部分领土被水覆盖。不效的垃圾管理导致了印度尼西亚水域中的废弃物堆积，引起了各种问题。开发自动垃圾收集机器人可以解决这个问题。这个机器人需要一个能够检测运动 объек的系统，例如在视频中。但是使用直观的对象检测方法在视频中有限制，特别是当图像焦点减少和目标对象受到其他对象干扰时。本文的贡献是对视频对象检测方法的解释，并使用YOLOv5模型和Robust & Efficient Post Processing（REPP）方法，以及Tubelet-level bounding box linking在Flow和Roboflow数据集上。将这些方法结合使用可以提高YOLOv5直观对象检测的性能，并考虑邻帧检测结果。结果表明，后处理阶段和Tubelet-level bounding box linking可以提高检测质量，相比YOLOv5 alone，提高约3%。这些方法可以检测表面和水下垃圾，并可以应用于实时图像基于垃圾收集机器人。实施这种系统，预计可以改善过去垃圾的损害，并提高未来印度尼西亚的垃圾管理系统。

paper_url: http://arxiv.org/abs/2307.07142
repo_url: None
paper_authors: Gongxin Yao, Yixin Xuan, Yiwei Chen, Yu Pan
For: 本研究主要针对image-to-point cloud registration问题，实现点云和图像之间的匹配。* Methods: 我们提出了一个具有粗细对称的框架，从本地角度出发，首先建立点云和图像之间的匹配，然后透过精确的搜寻、注意力学习和精确匹配，从细腻搜寻空间中获取高品质的匹配。* Results: 我们在大规模的 OUTDOOR 实验中证明了我们的方法的优越性，并且在EPnP算法下进行了匹配。

Abstract
In the context of image-to-point cloud registration, acquiring point-to-pixel correspondences presents a challenging task since the similarity between individual points and pixels is ambiguous due to the visual differences in data modalities. Nevertheless, the same object present in the two data formats can be readily identified from the local perspective of point sets and pixel patches. Motivated by this intuition, we propose a coarse-to-fine framework that emphasizes the establishment of correspondences between local point sets and pixel patches, followed by the refinement of results at both the point and pixel levels. On a coarse scale, we mimic the classic Visual Transformer to translate both image and point cloud into two sequences of local representations, namely point and pixel proxies, and employ attention to capture global and cross-modal contexts. To supervise the coarse matching, we propose a novel projected point proportion loss, which guides to match point sets with pixel patches where more points can be projected into. On a finer scale, point-to-pixel correspondences are then refined from a smaller search space (i.e., the coarsely matched sets and patches) via well-designed sampling, attentional learning and fine matching, where sampling masks are embedded in the last two steps to mitigate the negative effect of sampling. With the high-quality correspondences, the registration problem is then resolved by EPnP algorithm within RANSAC. Experimental results on large-scale outdoor benchmarks demonstrate our superiority over existing methods.

摘要
在图像与点云注册中，获取点对像像点对应存在挑战，因为图像和点云数据模式之间的视觉差异使得个别点和像素之间的相似性具有很大的歧义。然而，同一个物体在两种数据格式中可以轻松地从地方视角上识别。基于这种感知，我们提出了一个卷积框架，强调在点集和像素块之间建立对应关系，然后对点和像素水平进行重finement。在粗略层次，我们模仿了经典的视觉 трансформа器，将图像和点云转换成两个Sequence of local representation，即点和像素代理，并使用注意力来捕捉全局和交叉模式上的信息。为了监督粗略匹配，我们提出了一种新的 projeted point proportion loss，它引导将点集与像素块匹配，其中更多的点可以被投影到像素块中。在细化层次，点对像像点对应从粗略匹配的小搜索空间（即粗略匹配的集和块）进行重新匹配，使用特制的采样、注意力学习和细化匹配，其中采样面被嵌入最后两步以避免采样的负面影响。与高质量对应关系，则可以通过EPnP算法在RANSAC中解决注册问题。实验结果表明我们在大规模的户外 benchmark 上表现出色。

Fine-grained Text-Video Retrieval with Frozen Image Encoders

paper_url: http://arxiv.org/abs/2307.09972
repo_url: None
paper_authors: Zuozhuo Dai, Fangtao Shao, Qingkun Su, Zilong Dong, Siyu Zhu
for: 文章主要针对文本视频检索（TVR）领域的问题，即如何有效地将文本与视频进行关联。
methods: 文章提出了一种两stage的文本视频检索建模，首先使用现有的 TVR 方法和cosine similarity网络进行高效的文本/视频候选选择，然后提出了一种新的视频文本跨模块，以便在空间和时间维度上捕捉细腻的多模式信息。
results: 实验结果表明， compared to 状态空间的方法，我们的提出的 CrossTVR 方法可以更好地提高文本视频检索性能。

Abstract
State-of-the-art text-video retrieval (TVR) methods typically utilize CLIP and cosine similarity for efficient retrieval. Meanwhile, cross attention methods, which employ a transformer decoder to compute attention between each text query and all frames in a video, offer a more comprehensive interaction between text and videos. However, these methods lack important fine-grained spatial information as they directly compute attention between text and video-level tokens. To address this issue, we propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions. Additionally, we employ the frozen CLIP model strategy in fine-grained retrieval, enabling scalability to larger pre-trained vision models like ViT-G, resulting in improved retrieval performance. Experiments on text video retrieval datasets demonstrate the effectiveness and scalability of our proposed CrossTVR compared to state-of-the-art approaches.

摘要
现代文本视频检索（TVR）方法通常使用CLIP和cosine相似性来实现高效的检索。而cross attention方法，它使用变换器解码器计算文本查询和视频帧之间的关注，可以更好地考虑文本和视频之间的互动。然而，这些方法缺乏细致的空间信息，因为直接计算文本和视频级别的токен之间的关注。为解决这个问题，我们提出了CrossTVR，一种两stage文本视频检索架构。在第一个阶段，我们利用现有的 TVR 方法和cosine相似性网络来高效地选择文本/视频候选者。在第二个阶段，我们提出了一种新的解除视频文本交叉注意模块，以捕捉视频和文本之间的细致的多Modal信息。此外，我们采用冻结 CLIP 模型策略，使得可以扩展到更大的预训练视觉模型 like ViT-G，从而提高检索性能。实验表明，我们提出的 CrossTVR 比对现有的方法更有效和可扩展。

CeRF: Convolutional Neural Radiance Fields for New View Synthesis with Derivatives of Ray Modeling

paper_url: http://arxiv.org/abs/2307.07125
repo_url: None
paper_authors: Xiaoyan Yang, Dingbo Lu, Yang Li, Chenhui Li, Changbo Wang
for: novel view synthesis of high-fidelity images
methods: Convolutional Neural Radiance Fields with 1D convolutional operations and structured neural network architecture, and a proposed recurrent module to solve geometric ambiguity
results: promising results compared with existing state-of-the-art methodsHere’s the full text in Simplified Chinese:
for: 高效图像新视图合成
methods: 基于1D卷积操作的卷积神经场，以及用于解决几何杂乱的循环模块
results: 与现有状态艺术方法相比，显示出优秀的结果

Abstract
In recent years, novel view synthesis has gained popularity in generating high-fidelity images. While demonstrating superior performance in the task of synthesizing novel views, the majority of these methods are still based on the conventional multi-layer perceptron for scene embedding. Furthermore, light field models suffer from geometric blurring during pixel rendering, while radiance field-based volume rendering methods have multiple solutions for a certain target of density distribution integration. To address these issues, we introduce the Convolutional Neural Radiance Fields to model the derivatives of radiance along rays. Based on 1D convolutional operations, our proposed method effectively extracts potential ray representations through a structured neural network architecture. Besides, with the proposed ray modeling, a proposed recurrent module is employed to solve geometric ambiguity in the fully neural rendering process. Extensive experiments demonstrate the promising results of our proposed model compared with existing state-of-the-art methods.

摘要
近年来，新视图合成技术受到广泛关注，并在生成高效图像方面达到了显著提高。虽然大多数这些方法仍然基于传统的多层感知器进行场景嵌入，但是它们在新视图合成任务中表现出色。然而，光场模型在像素渲染过程中受到光学模糊的影响，而基于辐射场的Volume渲染方法具有多个解的涂抹积分问题。为解决这些问题，我们介绍了卷积神经场的激素场，通过1D卷积操作来有效地提取潜在的光线表示。此外，我们还提出了一种循环模块，用于解决全神经渲染过程中的 геометрические抽象问题。广泛的实验证明了我们提出的模型与现有状态艺技术相比，具有出色的表现。

Improved Flood Insights: Diffusion-Based SAR to EO Image Translation

paper_url: http://arxiv.org/abs/2307.07123
repo_url: None
paper_authors: Minseok Seo, Youngtack Oh, Doyi Kim, Dongmin Kang, Yeji Choi
For: 该论文目的是提高洪水灾害评估的可 interpretability，通过将Synthetic Aperture Radar（SAR）图像转换成Electro-Optical（EO）图像，提高人类分析者对洪水危机的理解。* Methods: 该论文提出了一种新的Diffusion-Based SAR to EO Image Translation（DSE）框架，用于将SAR图像转换成EO图像，以提高洪水灾害评估的可 interpretability。* Results: 实验结果表明，DSE框架不仅可以提高洪水灾害评估的可读性，还可以提高所有测试的洪水分割基准的性能。

Abstract
Driven by rapid climate change, the frequency and intensity of flood events are increasing. Electro-Optical (EO) satellite imagery is commonly utilized for rapid response. However, its utilities in flood situations are hampered by issues such as cloud cover and limitations during nighttime, making accurate assessment of damage challenging. Several alternative flood detection techniques utilizing Synthetic Aperture Radar (SAR) data have been proposed. Despite the advantages of SAR over EO in the aforementioned situations, SAR presents a distinct drawback: human analysts often struggle with data interpretation. To tackle this issue, this paper introduces a novel framework, Diffusion-Based SAR to EO Image Translation (DSE). The DSE framework converts SAR images into EO images, thereby enhancing the interpretability of flood insights for humans. Experimental results on the Sen1Floods11 and SEN12-FLOOD datasets confirm that the DSE framework not only delivers enhanced visual information but also improves performance across all tested flood segmentation baselines.

摘要
随着气候变化的加速，洪水事件的频率和 INTENSITY 都在增加。电子优化（EO）卫星成像通常用于快速应对。然而，它在洪水情况下面临着云覆盖和夜间限制，增加了评估损害的困难。一些使用 Synthetic Aperture Radar（SAR）数据的洪水探测技术已经被提议。虽然 SAR 在上述情况下具有优势，但它具有一个缺点：人工分析员经常遇到数据解释的困难。为解决这个问题，本文提出了一个新的框架：傅尔基于 SAR 到 EO 图像翻译（DSE）。DSE 框架将 SAR 图像转换为 EO 图像，从而提高了洪水启示的可读性 для 人类。实验结果表明，在 Sen1Floods11 和 SEN12-FLOOD 数据集上，DSE 框架不仅提供了加强的视觉信息，还提高了所有测试的洪水分割基elines 的性能。

Achelous: A Fast Unified Water-surface Panoptic Perception Framework based on Fusion of Monocular Camera and 4D mmWave Radar

paper_url: http://arxiv.org/abs/2307.07102
repo_url: https://github.com/GuanRunwei/Achelous
paper_authors: Runwei Guan, Shanliang Yao, Xiaohui Zhu, Ka Lok Man, Eng Gee Lim, Jeremy Smith, Yong Yue, Yutao Yue
for: 提高水面自动驾驶系统的智能化水平，提供一种低成本、快速的涵义束合抗函数。
methods: 基于单光相机和4D mmWave雷达的约束函数整合，同时执行五个任务：视觉目标检测和分割、可驾泊区分 segmentation、水线分割和雷达点云分割。
results: 在一个收集的数据集上，Achelous family模型比 HybridNets 快11 FPS，并在5 mAP$_{\text{50-95}$ 和 0.7 mIoU 上超越 YOLOX-Tiny 和 Segformer-B0，特别是在恶势卷云、黑暗环境和相机失效情况下表现出色。

Abstract
Current perception models for different tasks usually exist in modular forms on Unmanned Surface Vehicles (USVs), which infer extremely slowly in parallel on edge devices, causing the asynchrony between perception results and USV position, and leading to error decisions of autonomous navigation. Compared with Unmanned Ground Vehicles (UGVs), the robust perception of USVs develops relatively slowly. Moreover, most current multi-task perception models are huge in parameters, slow in inference and not scalable. Oriented on this, we propose Achelous, a low-cost and fast unified panoptic perception framework for water-surface perception based on the fusion of a monocular camera and 4D mmWave radar. Achelous can simultaneously perform five tasks, detection and segmentation of visual targets, drivable-area segmentation, waterline segmentation and radar point cloud segmentation. Besides, models in Achelous family, with less than around 5 million parameters, achieve about 18 FPS on an NVIDIA Jetson AGX Xavier, 11 FPS faster than HybridNets, and exceed YOLOX-Tiny and Segformer-B0 on our collected dataset about 5 mAP$_{\text{50-95}$ and 0.7 mIoU, especially under situations of adverse weather, dark environments and camera failure. To our knowledge, Achelous is the first comprehensive panoptic perception framework combining vision-level and point-cloud-level tasks for water-surface perception. To promote the development of the intelligent transportation community, we release our codes in \url{https://github.com/GuanRunwei/Achelous}.

摘要
现有的感知模型通常存在模块化的形式在无人水面车（USV）上，这些模型在边缘设备上并行执行，导致感知结果与USV位置的偏差，并导致自动导航错误决策。相比于无人地面车（UGV），USV的感知模型发展相对较慢。另外，大多数当前的多任务感知模型具有庞大的参数量、慢速推理和不可扩展的问题。基于这一点，我们提出了 Achelous，一个低成本、快速的综合杂化感知框架，用于水面感知。Achelous可同时完成五个任务，包括视觉目标检测和分割、可行区域分割、水面分割和4D mmWave雷达点云分割。此外，Achelous家族中的模型，具有 Less than 5000万参数，在 NVIDIA Jetson AGX Xavier 上达到约 18 FPS，比 HybridNets 快 11 FPS，并在我们收集的数据集上 exceed YOLOX-Tiny 和 Segformer-B0 的 5 mAP$_{\text{50-95}$ 和 0.7 mIoU，特别是在恶劣天气、黑暗环境和摄像头故障等情况下。我们知道，Achelous 是首个对水面感知任务进行综合杂化感知框架的研究。为推动智能交通社区的发展，我们在 \url{https://github.com/GuanRunwei/Achelous} 上发布了代码。

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

paper_url: http://arxiv.org/abs/2307.07063
repo_url: https://github.com/yiren-jian/blitext
paper_authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi
for: 本研究旨在优化冰冻大语言模型（LLM）的应用，以提高资源充足的视频语言（VL）预训练。
methods: 我们提出了一种新的方法，即通过预测理想的提示来匹配视觉特征。我们引入了一种名为Prompt-Transformer（P-Former）的模型，该模型通过仅仅使用语言数据进行训练，而不需要图像文本对应。这种方法使得VL训练过程中分解成了一个额外的阶段。
results: 我们的实验表明，我们的框架可以显著提高一个基本图像文本基线（BLIP-2）的性能，并有效地缩小基于4M或129M图像文本对应的模型性能差距。此外，我们的框架可以在不同的基模块上进行模块化和灵活的应用，并在视频学习任务中得到了成功应用。代码可以在https://github.com/yiren-jian/BLIText上获取。

Abstract
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. Importantly, our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task using varied base modules. The code is available at https://github.com/yiren-jian/BLIText

摘要
我们提出了一种新的方法，旨在优化冻结大型语言模型（LLM）在资源占用量大的视频语言（VL）预训练中的应用。现有的方法使用视觉特征作为提示来导引语言模型，主要是确定与文本相对应的最 relevante 视觉特征。我们的方法则注重语言组件，具体来说是确定最佳提示，以对应视觉特征进行对齐。我们提出了Prompt-Transformer（P-Former）模型，该模型预测最佳提示，并且专门在语言数据上进行训练，不需要图像文本对应。这种策略将结束到终端VL训练过程中的整个过程分解为一个额外的、独立的阶段。我们的实验表明，我们的框架可以显著提高一种稳定的图像文本基线（BLIP-2）的性能，并有效缩小使用4M或129M图像文本对应的模型性能差距。此外，我们的框架是模块化的，可以在不同的基础模块上进行成功应用，如视频学习任务中。代码可以在https://github.com/yiren-jian/BLIText 上获取。

AnyStar: Domain randomized universal star-convex 3D instance segmentation

paper_url: http://arxiv.org/abs/2307.07044
repo_url: https://github.com/neel-dey/anystar
paper_authors: Neel Dey, S. Mazdak Abulnaga, Benjamin Billot, Esra Abaci Turk, P. Ellen Grant, Adrian V. Dalca, Polina Golland
for: 这篇论文是为了解决 bio-微型视觉和放射学中的星形物体实例分割问题，而不需要大量的手动标注。
methods: 该论文提出了一种域随机生成模型，可以生成星形物体的 synthetic 训练数据，以train 一个通用的星形实例分割网络。
results: 根据论文的描述，使用该方法可以训练一个通用的星形实例分割网络，可以在不同的 dataset 和模式下进行高精度的3D分割，无需进行任何的重新引入、微调或域适应。

Abstract
Star-convex shapes arise across bio-microscopy and radiology in the form of nuclei, nodules, metastases, and other units. Existing instance segmentation networks for such structures train on densely labeled instances for each dataset, which requires substantial and often impractical manual annotation effort. Further, significant reengineering or finetuning is needed when presented with new datasets and imaging modalities due to changes in contrast, shape, orientation, resolution, and density. We present AnyStar, a domain-randomized generative model that simulates synthetic training data of blob-like objects with randomized appearance, environments, and imaging physics to train general-purpose star-convex instance segmentation networks. As a result, networks trained using our generative model do not require annotated images from unseen datasets. A single network trained on our synthesized data accurately 3D segments C. elegans and P. dumerilii nuclei in fluorescence microscopy, mouse cortical nuclei in micro-CT, zebrafish brain nuclei in EM, and placental cotyledons in human fetal MRI, all without any retraining, finetuning, transfer learning, or domain adaptation. Code is available at https://github.com/neel-dey/AnyStar.

摘要
星形对象在生物微scopie和放射学中出现，包括核体、肿块、肿肿转移和其他单元。现有的星形实例分割网络需要大量的手动标注实例，这需要巨大的人工标注努力。此外，对于新的数据集和成像模式来说，需要重大的重工程化或者赋值，这是因为对比、形状、方向、分辨率和密度的变化。我们提出了AnyStar，一种随机生成模型，用于生成星形对象的 sintetic 训练数据，用于训练通用的星形实例分割网络。因此，使用我们的生成模型训练的网络不需要未看过的数据集的注释。我们的网络可以高精度地3D分割C. elegans和P. dumerilii核体在 fluorescence 微scopie中，mouse cortical核体在 micro-CT 中，zebrafish brain核体在 EM 中，以及人类胎儿 Placental cotyledons 在人类胎儿 MRI 中，无需任何再训练、赋值、传输学习或领域适应。代码可以在https://github.com/neel-dey/AnyStar 中找到。

Deepfake Video Detection Using Generative Convolutional Vision Transformer

paper_url: http://arxiv.org/abs/2307.07036
repo_url: https://github.com/erprogs/genconvit
paper_authors: Deressa Wodajo, Solomon Atnafu, Zahid Akhtar
For: The paper is written for detecting deepfake videos, which have become a significant concern due to their potential to spread false information and compromise digital media integrity.* Methods: The proposed model, called Generative Convolutional Vision Transformer (GenConViT), combines ConvNeXt and Swin Transformer models for feature extraction, and utilizes Autoencoder and Variational Autoencoder to learn from the latent data distribution.* Results: The model achieves improved performance in detecting a wide range of deepfake videos, with an average accuracy of 95.8% and an AUC value of 99.3% across the tested datasets. The model demonstrates robust performance in deepfake video detection and provides an effective solution for identifying fake videos while preserving media integrity.Here are the three points in Simplified Chinese text:* For: 这篇论文是为检测深伪视频而写的，深伪视频已经成为 False Information 的一种可能性，并且可能会对数字媒体的完整性产生负面影响。* Methods: 该论文提出的模型是 Generative Convolutional Vision Transformer (GenConViT)，它结合 ConvNeXt 和 Swin Transformer 模型来提取特征，并使用 Autoencoder 和 Variational Autoencoder 来学习 latent 数据分布。* Results: 模型在多种深伪视频检测中表现出色，具有高精度和高 F1 分数，以及 AUC 值。模型在检测深伪视频方面达到了robust性，并提供了一种有效的方法来识别假视频，保持数字媒体完整性。

Abstract
Deepfakes have raised significant concerns due to their potential to spread false information and compromise digital media integrity. In this work, we propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection. Our model combines ConvNeXt and Swin Transformer models for feature extraction, and it utilizes Autoencoder and Variational Autoencoder to learn from the latent data distribution. By learning from the visual artifacts and latent data distribution, GenConViT achieves improved performance in detecting a wide range of deepfake videos. The model is trained and evaluated on DFDC, FF++, DeepfakeTIMIT, and Celeb-DF v2 datasets, achieving high classification accuracy, F1 scores, and AUC values. The proposed GenConViT model demonstrates robust performance in deepfake video detection, with an average accuracy of 95.8% and an AUC value of 99.3% across the tested datasets. Our proposed model addresses the challenge of generalizability in deepfake detection by leveraging visual and latent features and providing an effective solution for identifying a wide range of fake videos while preserving media integrity. The code for GenConViT is available at https://github.com/erprogs/GenConViT.

摘要
深层复制（Deepfake）技术已经引起了广泛的关注，因为它们的潜在能力可能导致假信息的传播和数字媒体的完整性的威胁。在这项工作中，我们提出了一种基于Generative Convolutional Vision Transformer（GenConViT）的深层复制视频检测模型。我们的模型结合了ConvNeXt和Swin Transformer模型，用于特征提取，并使用Autoencoder和Variational Autoencoder来学习从隐藏数据分布中学习。通过学习视觉特征和隐藏数据分布，GenConViT实现了对各种深层复制视频的广泛检测性能的改进。我们的模型在DFDC、FF++、DeepfakeTIMIT和Celeb-DF v2等数据集上进行训练和评估，实现了高的分类精度、F1分数和AUC值。我们提出的GenConViT模型在深层复制视频检测中表现了强大的一致性，其平均准确率为95.8%，AUC值为99.3%。我们的提出的模型通过利用视觉和隐藏特征，为检测各种假视频而提供了有效的解决方案，保护数字媒体完整性。GenConViT模型的代码可以在https://github.com/erprogs/GenConViT中下载。

Tapestry of Time and Actions: Modeling Human Activity Sequences using Temporal Point Process Flows

paper_url: http://arxiv.org/abs/2307.10305
repo_url: None
paper_authors: Vinayak Gupta, Srikanta Bedathur
for: 本研究旨在理解人类活动序列中的动作时间分布，并同时解决下流作业预测、序列目标预测和端到端序列生成等高含量问题。
methods: 本文提出了一种基于带 markers的时间点过程（MTPP）框架，使用自注意模块和时间正常化流来模型动作序列中动作之间的影响和间隔时间。此外，提出了一种新的添加，可以处理动作序列中动作顺序的变化。
results: 经过广泛的实验，表明ProActive模型在动作和目标预测方面具有显著的准确率提升，并且实现了端到端动作序列生成的首次应用。

Abstract
Human beings always engage in a vast range of activities and tasks that demonstrate their ability to adapt to different scenarios. Any human activity can be represented as a temporal sequence of actions performed to achieve a certain goal. Unlike the time series datasets extracted from electronics or machines, these action sequences are highly disparate in their nature -- the time to finish a sequence of actions can vary between different persons. Therefore, understanding the dynamics of these sequences is essential for many downstream tasks such as activity length prediction, goal prediction, next action recommendation, etc. Existing neural network-based approaches that learn a continuous-time activity sequence (or CTAS) are limited to the presence of only visual data or are designed specifically for a particular task, i.e., limited to next action or goal prediction. In this paper, we present ProActive, a neural marked temporal point process (MTPP) framework for modeling the continuous-time distribution of actions in an activity sequence while simultaneously addressing three high-impact problems -- next action prediction, sequence-goal prediction, and end-to-end sequence generation. Specifically, we utilize a self-attention module with temporal normalizing flows to model the influence and the inter-arrival times between actions in a sequence. In addition, we propose a novel addition over the ProActive model that can handle variations in the order of actions, i.e., different methods of achieving a given goal. We demonstrate that this variant can learn the order in which the person or actor prefers to do their actions. Extensive experiments on sequences derived from three activity recognition datasets show the significant accuracy boost of ProActive over the state-of-the-art in terms of action and goal prediction, and the first-ever application of end-to-end action sequence generation.

摘要
人类总是在各种各样的活动和任务中展现出非常强的适应能力，这些活动和任务可以看作是一个时间序列的动作。与电子设备或机器的时间序列数据不同，这些动作序列具有非常不同的特点，因此理解这些序列的动态是许多下游任务的关键，如行动长度预测、目标预测、下一个动作建议等。现有的基于神经网络的方法，只能学习视觉数据的连续时间动作序列（CTAS），而且一些方法只能进行特定任务，如下一个动作或目标预测。在这篇论文中，我们提出了ProActive，一种基于marked temporal point process（MTPP）的神经网络框架，用于模型连续时间动作序列中的动作的分布，同时解决三个高度影响的问题：下一个动作预测、序列目标预测和终端动作生成。我们使用自注意模块和时间正常化流来模型动作序列中动作之间的影响和间隔时间。此外，我们还提出了一种新的ProActive变体，可以处理动作序列中动作的不同顺序，即不同的方法来完成同一个目标。我们的实验表明，这种变体可以学习actor或人类的偏好顺序。与现状的最佳实践相比，ProActive在动作和目标预测方面具有显著的准确率提升，并且是继承动作序列生成的首次应用。

Bridging the Gap: Heterogeneous Face Recognition with Conditional Adaptive Instance Modulation

paper_url: http://arxiv.org/abs/2307.07032
repo_url: None
paper_authors: Anjith George, Sebastien Marcel
for:本研究旨在提高Face Recognition（FR）系统的可用性，通过匹配不同频谱的脸像，包括可见和热成像频谱。methods:我们对不同频谱的脸像进行了分类，并将不同频谱视为不同的样式。我们还提出了一种 Conditional Adaptive Instance Modulation（CAIM）模块，可以将这些样式引入到预训练FR网络中，以适应目标频谱。results:我们在多个具有挑战性的测试集上进行了广泛的评估，并证明了我们的方法在与现有方法相比具有更高的性能。我们将源代码和实验室协议公开发布，以便重复我们的发现。

Abstract
Heterogeneous Face Recognition (HFR) aims to match face images across different domains, such as thermal and visible spectra, expanding the applicability of Face Recognition (FR) systems to challenging scenarios. However, the domain gap and limited availability of large-scale datasets in the target domain make training robust and invariant HFR models from scratch difficult. In this work, we treat different modalities as distinct styles and propose a framework to adapt feature maps, bridging the domain gap. We introduce a novel Conditional Adaptive Instance Modulation (CAIM) module that can be integrated into pre-trained FR networks, transforming them into HFR networks. The CAIM block modulates intermediate feature maps, to adapt the style of the target modality effectively bridging the domain gap. Our proposed method allows for end-to-end training with a minimal number of paired samples. We extensively evaluate our approach on multiple challenging benchmarks, demonstrating superior performance compared to state-of-the-art methods. The source code and protocols for reproducing the findings will be made publicly available.

摘要
异构面Recognition（HFR）目的是匹配不同频谱的面图像，扩展面认知（FR）系统的应用场景。然而，频谱差距和目标频谱大规模数据的有限性使得从头文件训练Robust和抗变异HFR模型困难。在这种情况下，我们将不同频谱视为不同风格，并提议一种框架来适应特征地图。我们引入了一种新的Conditional Adaptive Instance Modulation（CAIM）模块，可以与预训练FR网络集成，将其转化为HFR网络。CAIM块在中间特征地图中进行调整，以有效地桥接频谱差距。我们提出的方法允许终端培训，只需要 minimal number of paired samples。我们对多个复杂的标准架进行了广泛的评估，并示出了与当前方法相比的优秀性能。源代码和 reproduce 的协议将公开发布。

Self-regulating Prompts: Foundational Model Adaptation without Forgetting

paper_url: http://arxiv.org/abs/2307.06948
repo_url: https://github.com/muzairkhattak/promptsrc
paper_authors: Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
for:PromptSRC is designed to improve the performance of prompt learning for downstream tasks while maintaining the generalization ability of the pre-trained CLIP model.methods:PromptSRC uses a self-regularization framework that includes mutual agreement maximization, self-ensemble of prompts, and textual diversity to guide the prompts to optimize for both task-specific and task-agnostic general representations.results:PromptSRC outperforms existing methods on 4 benchmarks and demonstrates better performance on downstream tasks while maintaining the generalization ability of the pre-trained CLIP model.Here is the answer in Simplified Chinese text:for: PromptSRC 是为提高下游任务的表现而设计的提示学习方法，同时保持预训练 CLIP 模型的通用能力。methods: PromptSRC 使用自我 regulatory 框架，包括协调最大化、自我集成和文本多样性来引导提示来优化任务特定和通用特征表示。results: PromptSRC 在 4 个 benchmark 上表现出色，在下游任务上表现更好，同时保持预训练 CLIP 模型的通用能力。

Abstract
Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific objective, i.e., cross-entropy loss, prompts tend to overfit downstream data distributions and find it challenging to capture task-agnostic general features from the frozen CLIP. This leads to the loss of the model's original generalization capability. To address this issue, our work introduces a self-regularization framework for prompting called PromptSRC (Prompting with Self-regulating Constraints). PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations using a three-pronged approach by: (a) regulating prompted representations via mutual agreement maximization with the frozen model, (b) regulating with self-ensemble of prompts over the training trajectory to encode their complementary strengths, and (c) regulating with textual diversity to mitigate sample diversity imbalance with the visual branch. To the best of our knowledge, this is the first regularization framework for prompt learning that avoids overfitting by jointly attending to pre-trained model features, the training trajectory during prompting, and the textual diversity. PromptSRC explicitly steers the prompts to learn a representation space that maximizes performance on downstream tasks without compromising CLIP generalization. We perform extensive experiments on 4 benchmarks where PromptSRC overall performs favorably well compared to the existing methods. Our code and pre-trained models are publicly available at: https://github.com/muzairkhattak/PromptSRC.

摘要
它们发现了一种高效的替代方案，即使用适应性训练来练习基础模型，如CLIP，以适应不同的下游任务。通常通过任务特定的目标函数，即十字矩阵损失函数，来训练提示。然而，这会导致提示过拟合下游数据分布，难以捕捉基于冻结的CLIP的任务不可预测的通用特征。这会导致模型的原始泛化能力丢失。为解决这个问题，我们的工作提出了一个自regularization框架，即PromptSRC（提示自regularization）。PromptSRC使提示优化任务特定和任务无关通用表示，通过以下三个方法：（a）通过与冻结模型的共识最大化来规范提示表示，（b）通过自我集成提示训练轨迹中的提示来编码其优势，（c）通过文本多样性来抑制采样不均衡问题。根据我们所知，这是首个避免过拟合的提示学习规则框架，可以同时关注预训练模型特征、训练轨迹和文本多样性。PromptSRC显式地使提示学习一个表示空间，以最大化下游任务性能而无需妥协CLIP泛化。我们在4个标准测试集上进行了广泛的实验，并证明PromptSRC在现有方法中表现出色。我们的代码和预训练模型可以在https://github.com/muzairkhattak/PromptSRC上获取。

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

paper_url: http://arxiv.org/abs/2307.06942
repo_url: https://github.com/opengvlab/internvideo
paper_authors: Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao
for: 本研究旨在开发一个大规模的视频-文本多模态数据集，以便学习高效和可迁移的视频-文本表示，以掌握多modal理解和生成。
methods: 我们采用了一种可扩展的方法，利用大型自然语言模型（LLM）来自动建立高质量的视频-文本数据集，并在这个数据集上进行对比学习，以学习视频-文本表示。我们还提出了一种多尺度方法，用于生成视频相关的文本描述。
results: 我们的模型在视频识别和检索任务上表现出色，并且在视频-文本生成和对话系统等多modal应用中也具有广泛的应用前景。

Abstract
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

摘要
Translated into Simplified Chinese:这篇论文介绍了InternVid，一个大规模的视频-中心多模态数据集，允许学习强大并可传递的视频-文本表示，以便多Modal理解和生成。InternVid数据集包含超过700万个视频，持续时间约为760000小时，共有23400万个视频clip， accompanied by detailed descriptions of total 4.1 billion words。我们的核心贡献是开发一种可扩展的方法，通过大语言模型（LLM）来自动建立高质量的视频-文本数据集，并在这个数据集上学习视频-语言表示。我们使用多尺度方法生成视频相关的描述。此外，我们介绍了ViCLIP，基于ViT-L的视频-文本表示学习模型。通过对InternVid进行对比学习，这个模型在零shot动作识别和视频检索方面表现出了领先的性能。除了基本的视频理解任务之外，我们的数据集和模型在广泛的应用领域。它们特别有用于生成交叠的视频-文本数据，以便学习基于视频的对话系统，以及进一步的视频-文本和文本-视频生成研究。这些提出的资源为关注多Modal视频理解和生成的研究人员和实践者提供了一个工具。

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

paper_url: http://arxiv.org/abs/2307.06940
repo_url: https://github.com/videocrafter/animate-a-story
paper_authors: Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen
for: 用于生成视觉故事，减少实际拍摄或动画渲染的复杂性。
methods: 基于现有视频clip的整合，提供可控的场景和动作Context，以及文本提示导航的 видео生成模型。
results: 比较现有基eline的优势，能够生成具有可控的场景和动作，以及文本提示导航的视觉故事视频。

Abstract
Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.

摘要
生成视频 для视觉Storytelling可以是一个劳碌和复杂的过程，通常需要live-action拍摄或图形动画渲染。为了绕过这些挑战，我们的关键想法是利用现有的视频剪辑和synthesize一个coherent的Storytelling视频，并通过自定义视频的场景和动作来满足用户的需求。我们的框架包括两个功能模块：1. 动作结构检索（Motion Structure Retrieval）：通过提供用户需要的场景或动作上下文的查询文本，检索可用的视频clip。2. 结构引导文本到视频synthesizer（Structure-Guided Text-to-Video Synthesis）：根据动作结构和文本提示，生成具有结构和人物控制的视频。为了实现第一个模块，我们利用了一个off-the-shelf的视频检索系统，并提取视频的深度作为动作结构。为了实现第二个模块，我们提议一种可控的视频生成模型，具有灵活的结构和人物控制。通过跟随结构指导和外观指令，我们可以生成具有视觉一致性的视频。为保证视频之间的一致性，我们提议一种有效的人物个性化方法，允许通过文本提示来定义愿望的人物标识。我们的方法在多个存在baseline的实验中表现出了显著的优势。

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

paper_url: http://arxiv.org/abs/2307.06925
repo_url: None
paper_authors: Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, Amit H. Bermano
for: This paper is written for personalizing text-to-image (T2I) generation, allowing users to guide the creative image generation process by combining their own visual concepts in natural language prompts.
methods: The paper proposes a domain-agnostic method for T2I personalization that does not require any specialized dataset or prior information about the personalized concepts. The method uses a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space.
results: The experimental results demonstrate the effectiveness of the proposed approach, showing that the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.Here’s the Chinese translation of the three information points:
for: 这篇论文是为了个性化文本到图像（T2I）生成，让用户通过将自己的视觉概念合并到自然语言提示中来导引创意图像生成过程。
methods: 论文提出了一种不需要特殊数据集或个性化概念知识的领域不依赖的方法，使用了一种新的对比基于的正则化技术来保持目标概念特征的高准确性，同时将预测的元素靠近 CLIP Token 的编辑区域。
results: 实验结果表明提出的方法的有效性，预测的元素比不正则化模型预测的元素更加 semantic，导致一个更好的表示，实现了状态的最佳性能，同时比前方法更加灵活。

Abstract
Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.

摘要

LVLane: Deep Learning for Lane Detection and Classification in Challenging Conditions

paper_url: http://arxiv.org/abs/2307.06853
repo_url: https://github.com/zillur-av/LVLane
paper_authors: Zillur Rahman, Brendan Tran Morris
for: 该论文的目的是提出一种基于深度学习的末端车道检测和分类系统，以提高自动驾驶和高级驾驶支持系统的性能。
methods: 该系统使用了深度学习方法，包括一个特有的数据集，以捕捉具有特殊挑战的车道检测场景。此外，该系统还提出了一种基于Convolutional Neural Networks (CNN)的分类分支，以便识别不同的车道类型。
results: 实验结果表明，该系统在TuSimple数据集、Caltech Lane数据集和LVLane数据集等多个数据集上具有优秀的车道检测和分类能力，特别是在面临特殊挑战的场景下表现出色。

Abstract
Lane detection plays a pivotal role in the field of autonomous vehicles and advanced driving assistant systems (ADAS). Despite advances from image processing to deep learning based models, algorithm performance is highly dependent on training data matching the local challenges such as extreme lighting conditions, partially visible lane markings, and sparse lane markings like Botts' dots. To address this, we present an end-to-end lane detection and classification system based on deep learning methodologies. In our study, we introduce a unique dataset meticulously curated to encompass scenarios that pose significant challenges for state-of-the-art (SOTA) lane localization models. Moreover, we propose a CNN-based classification branch, seamlessly integrated with the detector, facilitating the identification of distinct lane types. This architecture enables informed lane-changing decisions and empowers more resilient ADAS capabilities. We also investigate the effect of using mixed precision training and testing on different models and batch sizes. Experimental evaluations conducted on the widely-used TuSimple dataset, Caltech Lane dataset, and our LVLane dataset demonstrate the effectiveness of our model in accurately detecting and classifying lanes amidst challenging scenarios. Our method achieves state-of-the-art classification results on the TuSimple dataset. The code of the work can be found on www.github.com/zillur-av/LVLane.

摘要
lane detection 在自动驾驶和高级驾驶助手系统（ADAS）中扮演着关键的角色。尽管从图像处理演化到深度学习基于模型，但算法性能仍然受到本地挑战的影响，如极端光照条件、部分可见的车道标记和罕见的车道标记如Botts的点。为解决这一问题，我们提出了一个端到端的车道检测和分类系统，基于深度学习方法。在我们的研究中，我们提供了一个独特的数据集，仔细筛选了表现出特殊挑战的场景，以便为现有的SOTA车道local化模型提供更加准确的挑战。此外，我们还提出了一种基于CNN的分类分支，与检测器紧密集成，以便识别不同的车道类型。这种架构允许更加有知识的车道更改决策，激发更加可靠的ADAS功能。我们还进行了不同模型和批处理大小的混合精度训练和测试的研究。实验结果在广泛使用的TuSimple数据集、Caltech Lane数据集和我们的LVLane数据集上展示了我们的模型在面临挑战场景下准确检测和分类车道的能力。我们的方法在TuSimple数据集上实现了SOTA的分类结果。代码可以在www.github.com/zillur-av/LVLane中找到。

2023-07-14

Combining multitemporal optical and SAR data for LAI imputation with BiLSTM network

Transient Neural Radiance Fields for Lidar View Synthesis and 3D Reconstruction

Improving Zero-Shot Generalization for CLIP with Synthesized Prompts

L-DAWA: Layer-wise Divergence Aware Weight Aggregation in Federated Self-Supervised Visual Representation Learning

Defect Classification in Additive Manufacturing Using CNN-Based Vision Processing

ConTrack: Contextual Transformer for Device Tracking in X-ray

Flow-Guided Controllable Line Drawing Generation

LEST: Large-scale LiDAR Semantic Segmentation with Transformer

PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting

Risk Controlled Image Retrieval

Capsule network with shortcut routing

SynTable: A Synthetic Data Generation Pipeline for Unseen Object Amodal Instance Segmentation of Cluttered Tabletop Scenes

HEAL-SWIN: A Vision Transformer On The Sphere

3D Shape-Based Myocardial Infarction Prediction Using Point Cloud Classification Networks

Sampling-Priors-Augmented Deep Unfolding Network for Robust Video Compressive Sensing

Implicit Neural Feature Fusion Function for Multispectral and Hyperspectral Image Fusion

Cloud Detection in Multispectral Satellite Images Using Support Vector Machines With Quantum Kernels

Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation

cOOpD: Reformulating COPD classification on chest CT scans as anomaly detection using contrastive representations

Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training

FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation

MaxSR: Image Super-Resolution Using Improved MaxViT

Source-Free Domain Adaptive Fundus Image Segmentation with Class-Balanced Mean Teacher

Masked Autoencoders for Unsupervised Anomaly Detection in Medical Images

Challenge Results Are Not Reproducible

Complementary Frequency-Varying Awareness Network for Open-Set Fine-Grained Image Recognition

Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection

Volumetric Wireframe Parsing from Neural Attraction Fields

Omnipotent Adversarial Training for Unknown Label-noisy and Imbalanced Datasets

LightFormer: An End-to-End Model for Intersection Right-of-Way Recognition Using Traffic Light Signals and an Attention Mechanism

Adversarial Training Over Long-Tailed Distribution

Erasing, Transforming, and Noising Defense Network for Occluded Person Re-Identification

TVPR: Text-to-Video Person Retrieval and a New Benchmark

DISPEL: Domain Generalization via Domain-Specific Liberating

Adaptive Region Selection for Active Learning in Whole Slide Image Semantic Segmentation

Linking vision and motion for self-supervised object-centric perception

Deteksi Sampah di Permukaan dan Dalam Perairan pada Objek Video dengan Metode Robust and Efficient Post-Processing dan Tubelet-Level Bounding Box Linking

CFI2P: Coarse-to-Fine Cross-Modal Correspondence Learning for Image-to-Point Cloud Registration

Fine-grained Text-Video Retrieval with Frozen Image Encoders

CeRF: Convolutional Neural Radiance Fields for New View Synthesis with Derivatives of Ray Modeling

Improved Flood Insights: Diffusion-Based SAR to EO Image Translation

Achelous: A Fast Unified Water-surface Panoptic Perception Framework based on Fusion of Monocular Camera and 4D mmWave Radar

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

AnyStar: Domain randomized universal star-convex 3D instance segmentation

Deepfake Video Detection Using Generative Convolutional Vision Transformer

Tapestry of Time and Actions: Modeling Human Activity Sequences using Temporal Point Process Flows

Bridging the Gap: Heterogeneous Face Recognition with Conditional Adaptive Instance Modulation

Self-regulating Prompts: Foundational Model Adaptation without Forgetting

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

LVLane: Deep Learning for Lane Detection and Classification in Challenging Conditions