2023-07-07

cs.CV

cs.CV - 2023-07-07

Detecting the Sensing Area of A Laparoscopic Probe in Minimally Invasive Cancer Surgery

paper_url: http://arxiv.org/abs/2307.03662
repo_url: https://github.com/br0202/sensing_area_detection
paper_authors: Baoru Huang, Yicheng Hu, Anh Nguyen, Stamatia Giannarou, Daniel S. Elson
for: 针对于医学领域中的外科手术预测和肿瘤检测。
methods: 使用了一种新型的缚定式 Laparoscope γ射测定器，以实时地local化预先注射的辐源追踪剂。
results: 通过利用高维度的图像特征和探针位置信息，成功解决了γ活动视图化的问题，并创造了一个新的性能标准。

Abstract
In surgical oncology, it is challenging for surgeons to identify lymph nodes and completely resect cancer even with pre-operative imaging systems like PET and CT, because of the lack of reliable intraoperative visualization tools. Endoscopic radio-guided cancer detection and resection has recently been evaluated whereby a novel tethered laparoscopic gamma detector is used to localize a preoperatively injected radiotracer. This can both enhance the endoscopic imaging and complement preoperative nuclear imaging data. However, gamma activity visualization is challenging to present to the operator because the probe is non-imaging and it does not visibly indicate the activity origination on the tissue surface. Initial failed attempts used segmentation or geometric methods, but led to the discovery that it could be resolved by leveraging high-dimensional image features and probe position information. To demonstrate the effectiveness of this solution, we designed and implemented a simple regression network that successfully addressed the problem. To further validate the proposed solution, we acquired and publicly released two datasets captured using a custom-designed, portable stereo laparoscope system. Through intensive experimentation, we demonstrated that our method can successfully and effectively detect the sensing area, establishing a new performance benchmark. Code and data are available at https://github.com/br0202/Sensing_area_detection.git

摘要
在外科onkology中，外科医生很难识别lymph nodes和完全 remove cancer，即使使用预先的内分析系统如PET和CT。这是因为在手术中没有可靠的实时显示工具。Recently, endoscopic radio-guided cancer detection and resection has been evaluated, which uses a novel tethered laparoscopic gamma detector to localize a preoperatively injected radiotracer. This can both enhance endoscopic imaging and complement preoperative nuclear imaging data. However, gamma activity visualization is challenging to present to the operator because the probe is non-imaging and does not visibly indicate the activity origination on the tissue surface. Initial attempts used segmentation or geometric methods, but these were unsuccessful. Instead, we found that the problem could be resolved by leveraging high-dimensional image features and probe position information. To demonstrate the effectiveness of this solution, we designed and implemented a simple regression network that successfully addressed the problem. To further validate the proposed solution, we acquired and publicly released two datasets captured using a custom-designed, portable stereo laparoscope system. Through intensive experimentation, we demonstrated that our method can successfully and effectively detect the sensing area, establishing a new performance benchmark. Code and data are available at .

Robust Human Detection under Visual Degradation via Thermal and mmWave Radar Fusion

paper_url: http://arxiv.org/abs/2307.03623
repo_url: https://github.com/ramdrop/utm
paper_authors: Kaiwen Cai, Qiyue Xia, Peize Li, John Stankovic, Chris Xiaoxuan Lu
for: 本研究旨在提出一种多模态人体检测系统，用于解决在质量不佳的视觉条件下人体检测的问题。
methods: 本研究使用了可携带式热成像相机和单芯片mm波雷达，并提出了一种bayesian特征提取器和一种uncertainty-guided融合方法来减少热成像检测特征的噪音和雷达点云的多 PATH噪声。
results: 本研究对实际数据集进行评估，并证明了我们的方法在多种竞争方法中具有显著的优势，包括单模态和多模态方法。

Abstract
The majority of human detection methods rely on the sensor using visible lights (e.g., RGB cameras) but such sensors are limited in scenarios with degraded vision conditions. In this paper, we present a multimodal human detection system that combines portable thermal cameras and single-chip mmWave radars. To mitigate the noisy detection features caused by the low contrast of thermal cameras and the multi-path noise of radar point clouds, we propose a Bayesian feature extractor and a novel uncertainty-guided fusion method that surpasses a variety of competing methods, either single-modal or multi-modal. We evaluate the proposed method on real-world data collection and demonstrate that our approach outperforms the state-of-the-art methods by a large margin.

摘要
大多数人员探测方法都是使用可见光（例如RGB摄像头），但这些感知器在有很差视力条件下效果有限。在这篇论文中，我们提出了一种多模态人员探测系统，该系统结合携带式热成像镜头和单 chip MM 微波雷达。为了减少热成像镜头的噪声探测特征和雷达点云的多重反射噪声，我们提议了一种 bayesian 特征提取器和一种新的不确定性导向融合方法。我们对实际数据收集进行了评估，并证明了我们的方法在比较方法中具有明显的优势。

Depth Estimation Analysis of Orthogonally Divergent Fisheye Cameras with Distortion Removal

paper_url: http://arxiv.org/abs/2307.03602
repo_url: None
paper_authors: Matvei Panteleev, Houari Bettahar
for: 提高 fisheye 相机镜像干扰矫正和深度估计精度
methods: 使用两个虚拟平铺相机（VPC），每个VPC捕捉小区域，并将其呈现无镜面偏扭变，模拟平铺相机的行为
results: 对虚拟环境和实际相机实验结果进行比较，显示提案方法可以减少干扰和改善深度估计精度

Abstract
Stereo vision systems have become popular in computer vision applications, such as 3D reconstruction, object tracking, and autonomous navigation. However, traditional stereo vision systems that use rectilinear lenses may not be suitable for certain scenarios due to their limited field of view. This has led to the popularity of vision systems based on one or multiple fisheye cameras in different orientations, which can provide a field of view of 180x180 degrees or more. However, fisheye cameras introduce significant distortion at the edges that affects the accuracy of stereo matching and depth estimation. To overcome these limitations, this paper proposes a method for distortion-removal and depth estimation analysis for stereovision system using orthogonally divergent fisheye cameras (ODFC). The proposed method uses two virtual pinhole cameras (VPC), each VPC captures a small portion of the original view and presents it without any lens distortions, emulating the behavior of a pinhole camera. By carefully selecting the captured regions, it is possible to create a stereo pair using two VPCs. The performance of the proposed method is evaluated in both simulation using virtual environment and experiments using real cameras and their results compared to stereo cameras with parallel optical axes. The results demonstrate the effectiveness of the proposed method in terms of distortion removal and depth estimation accuracy.

摘要
三角视系统在计算机视觉应用中变得流行，如3D重建、对象跟踪和自动导航。然而，传统的三角视系统使用直线镜头可能无法适用于某些场景，因为它们的视场有限。这导致了基于一或多个折衣镜头的不同orientation的视系统的 Popularity，这些系统可以提供180x180度或更大的视场。然而，折衣镜头会在边缘 introduce significant distortion，影响三角匹配和深度估计的准确性。为了解决这些限制，本文提出了一种基于折衣镜头的三角视系统中的distortion-removal和深度估计分析方法。该方法使用两个虚拟缩影镜头（VPC），每个VPC捕捉一小部分的原始视图，并无镜头扭曲的情况下，表现出pinhole镜头的行为。通过精心选择捕捉的区域，可以创建一个三角对Using two VPCs。实验结果表明，提议的方法可以减少折衣的影响，并提高深度估计的准确性。

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

paper_url: http://arxiv.org/abs/2307.03601
repo_url: https://github.com/jshilong/gpt4roi
paper_authors: Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, Ping Luo
for: 这 paper 的目的是提高大型语言模型（LLM）在图像和文本对应中的细腻多模态理解能力，通过在区域水平上调整 instruciton。
methods: 该 paper 使用了重新编写 bounding box 为空间指令的方法，将视觉特征与语言嵌入拼接在一起，输入到 LLM 进行训练。
results: 该 paper 提出了一种基于区域水平的视觉语言模型（GPT4RoI），可以提供更多的区域级多模态能力，如细腻区域描述和复杂区域逻辑。用户可以通过语言和空间指令来互动，并可以通过不同的区域指令来控制细腻程度。

Abstract
Instruction tuning large language model (LLM) on image-text pairs has achieved unprecedented vision-language multimodal abilities. However, their vision-language alignments are only built on image-level, the lack of region-level alignment limits their advancements to fine-grained multimodal understanding. In this paper, we propose instruction tuning on region-of-interest. The key design is to reformulate the bounding box as the format of spatial instruction. The interleaved sequences of visual features extracted by the spatial instruction and the language embedding are input to LLM, and trained on the transformed region-text data in instruction tuning format. Our region-level vision-language model, termed as GPT4RoI, brings brand new conversational and interactive experience beyond image-level understanding. (1) Controllability: Users can interact with our model by both language and spatial instructions to flexibly adjust the detail level of the question. (2) Capacities: Our model supports not only single-region spatial instruction but also multi-region. This unlocks more region-level multimodal capacities such as detailed region caption and complex region reasoning. (3) Composition: Any off-the-shelf object detector can be a spatial instruction provider so as to mine informative object attributes from our model, like color, shape, material, action, relation to other objects, etc. The code, data, and demo can be found at https://github.com/jshilong/GPT4RoI.

摘要
大型语言模型（LLM）的指令调整在图像和文本对数据上实现了前无之例的视觉语言融合能力。然而，这些视觉语言对应只基于图像水平，缺失地区水平对应限制了其细化多模态理解的进步。在这篇论文中，我们提议在地区水平上调整指令。我们的关键设计是将 bounding box 转换为空间指令的格式。批处视觉特征和语言嵌入被输入到 LLM，并在转换后的地区文本数据上进行了 instrucion 调整。我们称之为 GPT4RoI 的 Region-level 视觉语言模型，它为用户提供了新的对话和交互体验，跻身于图像水平的理解之外。（1）可控性：用户可以通过语言和空间指令来灵活地调整问题的细节水平。（2）能力：我们的模型支持单个地区空间指令以及多个地区。这解锁了更多的地区多模态能力，如详细地区描述和复杂地区逻辑。（3）组合：任何准备好的物体检测器都可以提供空间指令，从而挖掘出模型中的有用对象特征，如颜色、形状、材质、动作、与其他对象的关系等。代码、数据和示例可以在 GitHub 上找到：https://github.com/jshilong/GPT4RoI。

Unsupervised Segmentation of Fetal Brain MRI using Deep Learning Cascaded Registration

paper_url: http://arxiv.org/abs/2307.03579
repo_url: https://github.com/valbcn/casreg
paper_authors: Valentin Comte, Mireia Alenya, Andrea Urru, Judith Recober, Ayako Nakaki, Francesca Crovetto, Oscar Camara, Eduard Gratacós, Elisenda Eixarch, Fàtima Crispi, Gemma Piella, Mario Ceresa, Miguel A. González Ballester
for: 这研究的目的是为了提高胎儿脑 magnetic resonance imaging（MRI）的自动分割精度，以便分析胎儿脑发育和检测可能的脑发育异常。
methods: 该研究提出了一种新的无监督分割方法，基于多个Atlas分割。该方法使用了一个卷积神经网络来进行3D图像 региSTRATION，并通过计算小、增量的变形来将移动图像精确地对齐到固定图像。这个卷积神经网络可以用来注册多个标注图像，并将其组合成一个精确的分割结果。
results: 该研究的实验结果表明，提出的卷积神经网络注册和多Atlas分割方法可以超越现有的注册方法，并且与使用大量标注数据进行训练的nnU-Net相当。此外，该方法只需使用一小部分的标注数据来进行多Atlas分割任务，而不需要任何数据来训练网络。

Abstract
Accurate segmentation of fetal brain magnetic resonance images is crucial for analyzing fetal brain development and detecting potential neurodevelopmental abnormalities. Traditional deep learning-based automatic segmentation, although effective, requires extensive training data with ground-truth labels, typically produced by clinicians through a time-consuming annotation process. To overcome this challenge, we propose a novel unsupervised segmentation method based on multi-atlas segmentation, that accurately segments multiple tissues without relying on labeled data for training. Our method employs a cascaded deep learning network for 3D image registration, which computes small, incremental deformations to the moving image to align it precisely with the fixed image. This cascaded network can then be used to register multiple annotated images with the image to be segmented, and combine the propagated labels to form a refined segmentation. Our experiments demonstrate that the proposed cascaded architecture outperforms the state-of-the-art registration methods that were tested. Furthermore, the derived segmentation method achieves similar performance and inference time to nnU-Net while only using a small subset of annotated data for the multi-atlas segmentation task and none for training the network. Our pipeline for registration and multi-atlas segmentation is publicly available at https://github.com/ValBcn/CasReg.

摘要
准确 segmentation of fetal brain magnetic resonance images 是关键 для分析胎儿脑部发展和检测可能的神经发育畸形。传统的深度学习自动 segmentation 方法，虽然有效，但需要大量的训练数据并有标注数据，通常由临床医生通过时间consuming 的标注过程生成。为了解决这个挑战，我们提出了一种新的无监督分割方法，基于多个 Atlas segmentation，可以准确地分割多种组织而无需训练数据。我们的方法使用了堆叠的深度学习网络 для 3D 图像匹配，计算小、增量的形变来将移动图像精准地对齐于静止图像。这个堆叠网络可以用来对多个标注图像与要分割的图像进行匹配，并将传播的标签组合成为精度的分割。我们的实验表明，我们提出的堆叠体系超越了测试中的状态态术方法。此外，我们的分割方法可以与 nnU-Net 的性能相似，只需使用小数量的标注数据进行多个Atlas segmentation 任务，并无需训练网络。我们的注册和多个Atlas segmentation 管道可以在 GitHub 上获得，请参考 https://github.com/ValBcn/CasReg。

SpawnNet: Learning Generalizable Visuomotor Skills from Pre-trained Networks

paper_url: http://arxiv.org/abs/2307.03567
repo_url: https://github.com/johnrso/spawnnet
paper_authors: Xingyu Lin, John So, Sashwat Mahalingam, Fangchen Liu, Pieter Abbeel
for: 本研究旨在探讨使用预训练视觉表征的可行性，以便提高学习策略的通用能力。
methods: 本研究使用了一种新的两树架构，即SpawnNet，来将预训练多层表征融合到一个独立的网络中，以学习一个Robust的策略。
results: 对于实验和实际场景，SpawnNet表现出了明显的可 categorical 泛化能力，比之前的方法更好。

Abstract
The existing internet-scale image and video datasets cover a wide range of everyday objects and tasks, bringing the potential of learning policies that have broad generalization. Prior works have explored visual pre-training with different self-supervised objectives, but the generalization capabilities of the learned policies remain relatively unknown. In this work, we take the first step towards this challenge, focusing on how pre-trained representations can help the generalization of the learned policies. We first identify the key bottleneck in using a frozen pre-trained visual backbone for policy learning. We then propose SpawnNet, a novel two-stream architecture that learns to fuse pre-trained multi-layer representations into a separate network to learn a robust policy. Through extensive simulated and real experiments, we demonstrate significantly better categorical generalization compared to prior approaches in imitation learning settings.

摘要
现有的互联网级图像和视频数据集覆盖了广泛的日常物品和任务，这对学习策略的泛化潜力具有很大的潜力。先前的工作已经探索了不同的自我超vised目标，但已经学习的策略的泛化能力仍然不够了解。在这项工作中，我们首次面临这个挑战，我们关注使用预训练的表示来帮助策略的泛化。我们首先确定采用静止预训练视觉背bone的主要瓶颈，然后我们提出了SpawnNet，一种新的两核 architecture，它学习将预训练的多层表示融合到一个分离的网络中，以学习一个稳定的策略。通过了详细的 simulate和实际实验，我们证明了SpawnNet在模仿学习设置下的分类泛化性能明显更好，比先前的方法更好。

VariGrad: A Novel Feature Vector Architecture for Geometric Deep Learning on Unregistered Data

paper_url: http://arxiv.org/abs/2307.03553
repo_url: https://github.com/emmanuel-hartman/pytorch_varigrad
paper_authors: Emmanuel Hartman, Emery Pierson
for: 本研究提出了一种新的几何深度学习层，使用变量梯度（VariGrad）计算3D几何数据的特征向量表示。这些特征向量可以用于多种下游学习任务，如分类、匹配和形态重建。
methods: 本研究使用了无关于参数化的变量表示方法，以便在数据独立于采样或参数化的情况下训练和测试模型。
results: 研究表明，提出的VariGrad层具有高效、普适和对采样重新采样的可靠性。

Abstract
We present a novel geometric deep learning layer that leverages the varifold gradient (VariGrad) to compute feature vector representations of 3D geometric data. These feature vectors can be used in a variety of downstream learning tasks such as classification, registration, and shape reconstruction. Our model's use of parameterization independent varifold representations of geometric data allows our model to be both trained and tested on data independent of the given sampling or parameterization. We demonstrate the efficiency, generalizability, and robustness to resampling demonstrated by the proposed VariGrad layer.

摘要
我们提出了一种新的几何深度学习层，利用变量Gradient（VariGrad）计算三维几何数据的特征向量表示。这些特征向量可以用于多种下游学习任务，如分类、注册和形状重建。我们的模型使用独立参数的变量表示方法，使得我们的模型可以在不同的抽象和参数下进行训练和测试。我们展示了提议的VariGrad层的效率、通用性和对抽样的稳定性。

paper_url: http://arxiv.org/abs/2307.03538
repo_url: https://github.com/XLiu443/Language-free-Compositional-Action-Generation-via-Decoupling-Refinement
paper_authors: Xiao Liu, Guangyi Chen, Yansong Tang, Guangrun Wang, Ser-Nam Lim
for: 本研究旨在生成3D动作，无需依赖于庞大的神经网络语言注释。
methods: 我们提出了一个新的框架，包括动作对接、条件动作生成和解除精度提升。动作对接使用能量模型提取每个子动作的注意力掩模，然后将两个动作结合使用这些注意力来生成pseudo训练示例。然后，我们使用条件生成模型CVAE来学习一个latent空间，使得动作生成更加多样化。最后，我们提出了解除精度提升，使用自我supervised预训练模型MAE来保证子动作和组合动作之间的semantic一致性。这个进程包括将生成的3D动作映射到2D空间，分解这些图像为两个子segments，使用MAE模型重建完整的图像从子segments，并强制恢复的图像与原始子动作映射的图像一致。
results: 我们创建了两个新的 datasets，名为HumanAct-C和UESTC-C，并提出了相应的评价度量。我们进行了 both qualitative和量化的评估，以证明我们的方法的效果。

Abstract
Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation. Existing methods largely rely on extensive neural language annotations to discern composable latent semantics, a process that is often costly and labor-intensive. In this study, we introduce a novel framework to generate compositional actions without reliance on language auxiliaries. Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement. Action Coupling utilizes an energy model to extract the attention masks of each sub-action, subsequently integrating two actions using these attentions to generate pseudo-training examples. Then, we employ a conditional generative model, CVAE, to learn a latent space, facilitating the diverse generation. Finally, we propose Decoupling Refinement, which leverages a self-supervised pre-trained model MAE to ensure semantic consistency between the sub-actions and compositional actions. This refinement process involves rendering generated 3D actions into 2D space, decoupling these images into two sub-segments, using the MAE model to restore the complete image from sub-segments, and constraining the recovered images to match images rendered from raw sub-actions. Due to the lack of existing datasets containing both sub-actions and compositional actions, we created two new datasets, named HumanAct-C and UESTC-C, and present a corresponding evaluation metric. Both qualitative and quantitative assessments are conducted to show our efficacy.

摘要
<> translate the following text into Simplified Chinese:Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation. Existing methods largely rely on extensive neural language annotations to discern composable latent semantics, a process that is often costly and labor-intensive. In this study, we introduce a novel framework to generate compositional actions without reliance on language auxiliaries. Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement. Action Coupling utilizes an energy model to extract the attention masks of each sub-action, subsequently integrating two actions using these attentions to generate pseudo-training examples. Then, we employ a conditional generative model, CVAE, to learn a latent space, facilitating the diverse generation. Finally, we propose Decoupling Refinement, which leverages a self-supervised pre-trained model MAE to ensure semantic consistency between the sub-actions and compositional actions. This refinement process involves rendering generated 3D actions into 2D space, decoupling these images into two sub-segments, using the MAE model to restore the complete image from sub-segments, and constraining the recovered images to match images rendered from raw sub-actions. Due to the lack of existing datasets containing both sub-actions and compositional actions, we created two new datasets, named HumanAct-C and UESTC-C, and present a corresponding evaluation metric. Both qualitative and quantitative assessments are conducted to show our efficacy.Translation:<>组合简单元素成复杂概念是关键，特别是在3D动作生成中。现有方法主要依赖于广泛的神经语言标注来 отлича出可组合的含义，这个过程经常是费时和劳动密集的。在这种研究中，我们提出了一种新的框架，可以生成无语言助记的compositional动作。我们的方法包括三个主要组成部分：Action Coupling、Conditional Action Generation和Decoupling Refinement。Action Coupling使用能量模型提取每个子动作的注意力映射，然后将两个动作使用这些注意力进行拼接，生成 pseudo-training 示例。然后，我们使用Conditional Generative Model（CVAE）来学习一个含义空间，促进多样化生成。最后，我们提出了Decoupling Refinement，使用预训练的MAE模型来保证子动作和compositional动作之间的semantic consistency。这个修正过程包括将生成的3D动作映射到2D空间，将这些图像分解成两个子图像，使用MAE模型重建完整的图像，并使其与原始子图像匹配。由于现有的数据集没有包含子动作和compositional动作，我们创建了两个新的数据集，名为HumanAct-C和UESTC-C，并提出了相应的评价指标。我们进行了质量和量化评价，以展示我们的效果。

Joint Perceptual Learning for Enhancement and Object Detection in Underwater Scenarios

paper_url: http://arxiv.org/abs/2307.03536
repo_url: None
paper_authors: Chenping Fu, Wanqi Yuan, Jiewen Xiao, Risheng Liu, Xin Fan
for: jointly learn underwater object detection and image enhancement
methods: 使用着色矩阵和卷积神经网络，并提出了一种双层优化方法
results: 实现了更好的图像增强和物体检测精度

Abstract
Underwater degraded images greatly challenge existing algorithms to detect objects of interest. Recently, researchers attempt to adopt attention mechanisms or composite connections for improving the feature representation of detectors. However, this solution does \textit{not} eliminate the impact of degradation on image content such as color and texture, achieving minimal improvements. Another feasible solution for underwater object detection is to develop sophisticated deep architectures in order to enhance image quality or features. Nevertheless, the visually appealing output of these enhancement modules do \textit{not} necessarily generate high accuracy for deep detectors. More recently, some multi-task learning methods jointly learn underwater detection and image enhancement, accessing promising improvements. Typically, these methods invoke huge architecture and expensive computations, rendering inefficient inference. Definitely, underwater object detection and image enhancement are two interrelated tasks. Leveraging information coming from the two tasks can benefit each task. Based on these factual opinions, we propose a bilevel optimization formulation for jointly learning underwater object detection and image enhancement, and then unroll to a dual perception network (DPNet) for the two tasks. DPNet with one shared module and two task subnets learns from the two different tasks, seeking a shared representation. The shared representation provides more structural details for image enhancement and rich content information for object detection. Finally, we derive a cooperative training strategy to optimize parameters for DPNet. Extensive experiments on real-world and synthetic underwater datasets demonstrate that our method outputs visually favoring images and higher detection accuracy.

摘要
水下降低图像对现有算法检测对象存在挑战。研究人员尝试采用注意力机制或复合连接来改善检测器的特征表示。然而，这种解决方案不能完全消除水下图像内容的影响，如颜色和xture，只能获得有限的改进。另一个可行的水下对象检测解决方案是开发高级深度架构，以增强图像质量或特征。然而，这些美化模块的输出不一定能够提高深度检测器的准确率。在最近几年，一些多任务学习方法同时学习水下检测和图像改善，并取得了有望的改进。这些方法通常需要庞大的架构和昂贵的计算，导致效率低下。 Based on these facts, we propose a bilevel optimization formulation for jointly learning water下 object detection and image enhancement, and then unroll to a dual perception network (DPNet) for the two tasks. DPNet with one shared module and two task subnets learns from the two different tasks, seeking a shared representation. The shared representation provides more structural details for image enhancement and rich content information for object detection. Finally, we derive a cooperative training strategy to optimize parameters for DPNet. Extensive experiments on real-world and synthetic underwater datasets demonstrate that our method outputs visually pleasing images and higher detection accuracy.

Matching in the Wild: Learning Anatomical Embeddings for Multi-Modality Images

paper_url: http://arxiv.org/abs/2307.03535
repo_url: None
paper_authors: Xiaoyu Bai, Fan Bai, Xiaofei Huo, Jia Ge, Tony C. W. Mok, Zi Li, Minfeng Xu, Jingren Zhou, Le Lu, Dakai Jin, Xianghua Ye, Jingjing Lu, Ke Yan
for: 这个研究旨在提高内部模组之间的对 alignment 的精度，以便更好地利用 CT 和 MRI 两种不同模式之间的信息。
methods: 我们提出了一种新的方法，叫做 Cross-SAM，它利用了一个新的迭代过程，将 embedding learning 和 CT-MRI registrations 融合在一起，以提高对 alignment 的精度。
results: 我们在两个 CT-MRI 融合注册dataset上进行了评估，发现 Cross-SAM 能够实现了 CT 和 MRI 之间的稳定融合注册，并且与其他方法相比，表现出了州域之最。

Abstract
Radiotherapists require accurate registration of MR/CT images to effectively use information from both modalities. In a typical registration pipeline, rigid or affine transformations are applied to roughly align the fixed and moving images before proceeding with the deformation step. While recent learning-based methods have shown promising results in the rigid/affine step, these methods often require images with similar field-of-view (FOV) for successful alignment. As a result, aligning images with different FOVs remains a challenging task. Self-supervised landmark detection methods like self-supervised Anatomical eMbedding (SAM) have emerged as a useful tool for mapping and cropping images to similar FOVs. However, these methods are currently limited to intra-modality use only. To address this limitation and enable cross-modality matching, we propose a new approach called Cross-SAM. Our approach utilizes a novel iterative process that alternates between embedding learning and CT-MRI registration. We start by applying aggressive contrast augmentation on both CT and MRI images to train a SAM model. We then use this SAM to identify corresponding regions on paired images using robust grid-points matching, followed by a point-set based affine/rigid registration, and a deformable fine-tuning step to produce registered paired images. We use these registered pairs to enhance the matching ability of SAM, which is then processed iteratively. We use the final model for cross-modality matching tasks. We evaluated our approach on two CT-MRI affine registration datasets and found that Cross-SAM achieved robust affine registration on both datasets, significantly outperforming other methods and achieving state-of-the-art performance.

摘要
医 Physicists需要准确地将MR/CT图像 регистрирова到以便使用这两种模式中的信息。在一般的注册管道中，使用精度的旋转或相对变换来粗略地将固定图像和移动图像对齐，然后进行塑形步骤。而最近的学习基于方法已经在精度步骤中表现出了有前途的结果，但这些方法经常需要具有相似的视野范围（FOV）的图像进行成功的对齐。因此，将图像 WITH 不同的 FOV 进行对齐仍然是一个挑战。自动找到自我医学特征的自适应检测方法，如自适应Anatomical eMbedding（SAM），已经作为一种有用的工具来映射和剪辑图像，但这些方法目前只能在同一种模式中使用。为了解决这种限制并启用跨模式匹配，我们提出了一种新的方法，即 Cross-SAM。我们的方法利用了一种新的迭代过程，它 alternate между embedding learning和CT-MRI注册。我们首先在CT和MRI图像上应用了强制对比增强，然后使用这些SAM来标识对应的区域，并使用精度的grid-points匹配和点集基于的旋转/相对变换注册步骤，最后使用可动的精度调整步骤来生成注册的对应图像。我们使用这些注册对来增强SAM的匹配能力，然后重复处理，并使用最终模型进行跨模式匹配任务。我们在两个CT-MRI注册数据集上进行了评估，并发现 Cross-SAM在两个数据集上都达到了稳定的Affine注册，与其他方法相比，表现出了显著的优势，并达到了领域的前景性表现。

HoughLaneNet: Lane Detection with Deep Hough Transform and Dynamic Convolution

paper_url: http://arxiv.org/abs/2307.03494
repo_url: None
paper_authors: Jia-Qi Zhang, Hao-Bin Duan, Jun-Long Chen, Ariel Shamir, Miao Wang
for: 提高自动驾驶中车道检测的精度和可靠性，解决车道检测复杂的问题。
methods: 提出了一种基于幂函数变换的层次结构，将整个图像中的所有车道特征整合到幂函数参数空间中，并采用了动态 convolution模块来有效地分解每个车道特征。
results: 实验结果表明，提出的方法可以更好地检测受阻或损坏的车道图像，并且与当前最佳方法相当或超过其性能。

Abstract
The task of lane detection has garnered considerable attention in the field of autonomous driving due to its complexity. Lanes can present difficulties for detection, as they can be narrow, fragmented, and often obscured by heavy traffic. However, it has been observed that the lanes have a geometrical structure that resembles a straight line, leading to improved lane detection results when utilizing this characteristic. To address this challenge, we propose a hierarchical Deep Hough Transform (DHT) approach that combines all lane features in an image into the Hough parameter space. Additionally, we refine the point selection method and incorporate a Dynamic Convolution Module to effectively differentiate between lanes in the original image. Our network architecture comprises a backbone network, either a ResNet or Pyramid Vision Transformer, a Feature Pyramid Network as the neck to extract multi-scale features, and a hierarchical DHT-based feature aggregation head to accurately segment each lane. By utilizing the lane features in the Hough parameter space, the network learns dynamic convolution kernel parameters corresponding to each lane, allowing the Dynamic Convolution Module to effectively differentiate between lane features. Subsequently, the lane features are fed into the feature decoder, which predicts the final position of the lane. Our proposed network structure demonstrates improved performance in detecting heavily occluded or worn lane images, as evidenced by our extensive experimental results, which show that our method outperforms or is on par with state-of-the-art techniques.

摘要
自动驾驶领域内，车道检测已经引起了非常大的关注，因为它的复杂性。车道可能会变窄、分 Fragmented 或者受到压杂的交通影响，但是观察到车道有一定的几何结构，这使得通过利用这个特点可以提高车道检测的结果。为解决这个挑战，我们提议使用层次深度霍夫变换（DHT）方法，将整个图像中的所有车道特征 combine 到霍夫参数空间中。此外，我们还改进了点选择方法，并将动态卷积模块 incorporate 到图像原像中，以有效地区分每条车道。我们的网络架构包括后ION 网络（ResNet 或 Pyramid Vision Transformer）、特征峰网络作为 neck 提取多比例特征，以及层次 DHT 基于特征聚合头来准确地分类每条车道。通过利用车道特征在霍夫参数空间中，网络学习了对应每条车道的动态卷积参数，使得动态卷积模块可以有效地区分每条车道。最后，车道特征被传递到特征解码器，解码器预测了最终车道的位置。我们的提议的网络结构在检测受到压杂或损坏的车道图像时表现出了改进的性能，这得到了我们的广泛实验结果的支持，其中我们的方法与现有技术相当或超越。

Unpaired Multi-View Graph Clustering with Cross-View Structure Matching

paper_url: http://arxiv.org/abs/2307.03476
repo_url: https://github.com/wy1019/upmgc-sm
paper_authors: Yi Wen, Siwei Wang, Qing Liao, Weixuan Liang, Ke Liang, Xinhang Wan, Xinwang Liu
for: 提高多视图数据的群集效果，这个 paper 写的目的是创建一个无 Parameters 的 гра clustering 框架，可以处理不完整的数据对。
methods: 本 paper 使用的方法是一个 Unpaired Multi-view Graph Clustering framework with Cross-View Structure Matching (UPMGC-SM)，这个方法使用多视图数据的结构资讯来优化 cross-view 对应关系。
results: 实验结果显示，本 paper 的提案可以有效地处理不完整的数据对，并且可以与已有的 graph clustering 方法整合来增强它们的效能。

Abstract
Multi-view clustering (MVC), which effectively fuses information from multiple views for better performance, has received increasing attention. Most existing MVC methods assume that multi-view data are fully paired, which means that the mappings of all corresponding samples between views are pre-defined or given in advance. However, the data correspondence is often incomplete in real-world applications due to data corruption or sensor differences, referred as the data-unpaired problem (DUP) in multi-view literature. Although several attempts have been made to address the DUP issue, they suffer from the following drawbacks: 1) Most methods focus on the feature representation while ignoring the structural information of multi-view data, which is essential for clustering tasks; 2) Existing methods for partially unpaired problems rely on pre-given cross-view alignment information, resulting in their inability to handle fully unpaired problems; 3) Their inevitable parameters degrade the efficiency and applicability of the models. To tackle these issues, we propose a novel parameter-free graph clustering framework termed Unpaired Multi-view Graph Clustering framework with Cross-View Structure Matching (UPMGC-SM). Specifically, unlike the existing methods, UPMGC-SM effectively utilizes the structural information from each view to refine cross-view correspondences. Besides, our UPMGC-SM is a unified framework for both the fully and partially unpaired multi-view graph clustering. Moreover, existing graph clustering methods can adopt our UPMGC-SM to enhance their ability for unpaired scenarios. Extensive experiments demonstrate the effectiveness and generalization of our proposed framework for both paired and unpaired datasets.

摘要
多视图聚合（MVC），已经得到了更好的性能的注意。大多数现有的MVC方法假设多视图数据是完全对应的，这意味着所有视图之间的样本映射都是先前定义或提供的。然而，在实际应用中，数据对应性 oftentimes incomplete due to data corruption or sensor differences, referred as the data-unpaired problem (DUP) in multi-view literature. Although several attempts have been made to address the DUP issue, they suffer from the following drawbacks:1. 大多数方法只注重特征表示，忽略了多视图数据的结构信息，这是 clustering 任务中非常重要的;2. 现有的部分对应问题方法 rely on pre-given cross-view alignment information, resulting in their inability to handle fully unpaired problems;3. 它们的参数会影响模型的效率和可应用性。为了解决这些问题，我们提出了一个新的参数自由的图 clustering 框架，称为 Unpaired Multi-view Graph Clustering framework with Cross-View Structure Matching (UPMGC-SM). Specifically, unlike the existing methods, UPMGC-SM 能够充分利用每视图中的结构信息来修正交叉视图对应关系。此外，我们的 UPMGC-SM 是一个统一的框架，可以处理完全和部分对应的多视图图 clustering。此外，现有的图 clustering 方法可以采用我们的 UPMGC-SM 来增强它们对无对应场景的能力。广泛的实验表明我们提出的框架对于 paired 和 unpaired 数据均有效和普适。

Freezing of Gait Prediction From Accelerometer Data Using a Simple 1D-Convolutional Neural Network – 8th Place Solution for Kaggle’s Parkinson’s Freezing of Gait Prediction Competition

paper_url: http://arxiv.org/abs/2307.03475
repo_url: https://github.com/janbrederecke/fog
paper_authors: Jan Brederecke
For: 这个研究的目的是检测parkinson病人的停止行动（Freezing of Gait，FOG）事件，以便提供更好的 intervención和管理策略。* Methods: 该研究使用了patient-worn加速度计数据，并使用了一种简单的1-D卷积神经网络来检测FOG事件。* Results: 研究结果表明，使用这种方法可以在实时中检测FOG事件，并在Kaggle上的私人领导板上达到了0.356的平均准确率，并最终排名了1379个équipe中的第8名。

Abstract
Freezing of Gait (FOG) is a common motor symptom in patients with Parkinson's disease (PD). During episodes of FOG, patients suddenly lose their ability to stride as intended. Patient-worn accelerometers can capture information on the patient's movement during these episodes and machine learning algorithms can potentially classify this data. The combination therefore holds the potential to detect FOG in real-time. In this work I present a simple 1-D convolutional neural network that was trained to detect FOG events in accelerometer data. Model performance was assessed by measuring the success of the model to discriminate normal movement from FOG episodes and resulted in a mean average precision of 0.356 on the private leaderboard on Kaggle. Ultimately, the model ranked 8th out of 1379 teams in the Parkinson's Freezing of Gait Prediction competition. The results underscore the potential of Deep Learning-based solutions in advancing the field of FOG detection, contributing to improved interventions and management strategies for PD patients.

摘要
困难步行（FOG）是许多parkinson病患者的常见运动症状之一。在FOG发作时，患者可能会突然失去步行的能力。患者穿戴的加速度仪可以记录患者的运动信息，机器学习算法可以可能地分类这些数据。因此，这两种技术的结合具有检测FOG的潜在力。在这项工作中，我提出了一种简单的1-D convolutional neural network，用于在加速度仪数据中检测FOG事件。模型性能由normal运动和FOG发作之间的分类成功率来衡量，并达到了0.356的mean average precision在Kaggle私人领先板上。最终，模型在1379个组合中排名第8位。这些结果表明深度学习基本解决方案在FOG检测方面具有潜在的优势，可能导致parkinson病患者的 intervención和管理策略的改善。

A Deep Active Contour Model for Delineating Glacier Calving Fronts

paper_url: http://arxiv.org/abs/2307.03461
repo_url: None
paper_authors: Konrad Heidler, Lichao Mou, Erik Loebel, Mirko Scheinert, Sébastien Lefèvre, Xiao Xiang Zhu
for: 这个论文主要针对的是如何将现实世界中的冰川陷阱问题编码为机器学习任务。
methods: 该论文提出了一种新的方法，即将冰川陷阱模型转换为 outline 检测问题，并使用 Convolutional Neural Networks (CNNs) 和 active contour 模型来实现。
results: 该论文通过对格陵兰冰川的多个大规模数据集进行训练和评估，显示了该方法的优越性，并且还展示了这种方法在计算模型预测结果的不确定性方面的优势。

Abstract
Choosing how to encode a real-world problem as a machine learning task is an important design decision in machine learning. The task of glacier calving front modeling has often been approached as a semantic segmentation task. Recent studies have shown that combining segmentation with edge detection can improve the accuracy of calving front detectors. Building on this observation, we completely rephrase the task as a contour tracing problem and propose a model for explicit contour detection that does not incorporate any dense predictions as intermediate steps. The proposed approach, called ``Charting Outlines by Recurrent Adaptation'' (COBRA), combines Convolutional Neural Networks (CNNs) for feature extraction and active contour models for the delineation. By training and evaluating on several large-scale datasets of Greenland's outlet glaciers, we show that this approach indeed outperforms the aforementioned methods based on segmentation and edge-detection. Finally, we demonstrate that explicit contour detection has benefits over pixel-wise methods when quantifying the models' prediction uncertainties. The project page containing the code and animated model predictions can be found at \url{https://khdlr.github.io/COBRA/}.

摘要
选择如何编码现实世界问题为机器学习任务是机器学习设计决策中非常重要的一步。 glacier calving front 问题经常被视为semantic segmentation任务。 latest studies have shown that combining segmentation with edge detection can improve the accuracy of calving front detectors. Building on this observation, we completely rephrase the task as a contour tracing problem and propose a model for explicit contour detection that does not incorporate any dense predictions as intermediate steps. The proposed approach, called "Charting Outlines by Recurrent Adaptation" (COBRA), combines Convolutional Neural Networks (CNNs) for feature extraction and active contour models for the delineation. By training and evaluating on several large-scale datasets of Greenland's outlet glaciers, we show that this approach indeed outperforms the aforementioned methods based on segmentation and edge-detection. Finally, we demonstrate that explicit contour detection has benefits over pixel-wise methods when quantifying the models' prediction uncertainties. project page containing the code and animated model predictions can be found at \url{https://khdlr.github.io/COBRA/}.Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

Universal Semi-supervised Model Adaptation via Collaborative Consistency Training

paper_url: http://arxiv.org/abs/2307.03449
repo_url: None
paper_authors: Zizheng Yan, Yushuang Wu, Yipeng Qin, Xiaoguang Han, Shuguang Cui, Guanbin Li
for: 本研究提出了一个实际和挑战性的领域适应问题，即通用半监督模型适应（USMA），该问题只需要一个预训练的源模型，并且源和目标领域的标签集可以不同。
methods: 我们提出了一种协作一致培训框架，该框架规范了两个模型（源模型和目标数据只预训练的变体模型）的预测一致性，并将其们的优势融合以学习更强大的模型。
results: 我们的方法在多个 benchmark 数据集上实现了显著的效果。

Abstract
In this paper, we introduce a realistic and challenging domain adaptation problem called Universal Semi-supervised Model Adaptation (USMA), which i) requires only a pre-trained source model, ii) allows the source and target domain to have different label sets, i.e., they share a common label set and hold their own private label set, and iii) requires only a few labeled samples in each class of the target domain. To address USMA, we propose a collaborative consistency training framework that regularizes the prediction consistency between two models, i.e., a pre-trained source model and its variant pre-trained with target data only, and combines their complementary strengths to learn a more powerful model. The rationale of our framework stems from the observation that the source model performs better on common categories than the target-only model, while on target-private categories, the target-only model performs better. We also propose a two-perspective, i.e., sample-wise and class-wise, consistency regularization to improve the training. Experimental results demonstrate the effectiveness of our method on several benchmark datasets.

摘要
在这篇论文中，我们介绍了一个实际和挑战性的领域适应问题，即通用半监督模型适应（USMA）。该问题的要求如下：1. 仅使用源模型的预训练结果；2. 源频率和目标频率的标签集不同，即它们共享一个标签集，但各自拥有私有的标签集；3. 每个目标频率类只需几个标注样本。为解决USMA问题，我们提出了一个协同一致训练框架。该框架通过规范源模型和目标数据只预训练的变体模型之间的预测一致性，并将其们的优势融合起来培养更强大的模型。我们的框架的基本思想是，源模型在共同类别上表现更好，而目标模型在目标私有类别上表现更好。我们还提出了两种视角（样本级和类别级）的一致训练 regularization来提高训练。实验结果表明我们的方法在多个标准数据集上具有抗预测能力。

NOFA: NeRF-based One-shot Facial Avatar Reconstruction

paper_url: http://arxiv.org/abs/2307.03441
repo_url: None
paper_authors: Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, Baoyuan Wu
for: 一shot 3D facial avatar reconstruction, only requires a single source image for high-fidelity reconstruction.
methods: 利用3D GAN的生成先验和高效编码器-解码器网络重建源图像的 canoncial neural volume，并提出补做网络来补充面部细节。使用扭变场来折叠 canoncial volume 到表达驱动。
results: 通过广泛的实验比较，实现了较高的同构结果，比如果数据量更大的state-of-the-art方法。

Abstract
3D facial avatar reconstruction has been a significant research topic in computer graphics and computer vision, where photo-realistic rendering and flexible controls over poses and expressions are necessary for many related applications. Recently, its performance has been greatly improved with the development of neural radiance fields (NeRF). However, most existing NeRF-based facial avatars focus on subject-specific reconstruction and reenactment, requiring multi-shot images containing different views of the specific subject for training, and the learned model cannot generalize to new identities, limiting its further applications. In this work, we propose a one-shot 3D facial avatar reconstruction framework that only requires a single source image to reconstruct a high-fidelity 3D facial avatar. For the challenges of lacking generalization ability and missing multi-view information, we leverage the generative prior of 3D GAN and develop an efficient encoder-decoder network to reconstruct the canonical neural volume of the source image, and further propose a compensation network to complement facial details. To enable fine-grained control over facial dynamics, we propose a deformation field to warp the canonical volume into driven expressions. Through extensive experimental comparisons, we achieve superior synthesis results compared to several state-of-the-art methods.

摘要
三维人脸模型重建已经是计算机图形和计算机视觉领域的一个重要研究主题，需要高真实度的渲染和对姿态和表情的灵活控制，以满足许多相关应用。在最近，通过神经辐射场（NeRF）的发展，其性能得到了显著改进。然而，大多数现有的NeRF基于的人脸模型都是面向特定主体的重建和reenactment，需要多张不同视角的图像进行训练，并且学习的模型无法泛化到新的人脸主体，这限制了其进一步的应用。在这种情况下，我们提出了一种只需要单个源图像来重建高质量三维人脸模型的框架。为了解决缺乏泛化能力和缺失多视角信息的挑战，我们利用了3D GAN的生成预设，并开发了高效的编码器-解码器网络来重建源图像的神经体积，并提出了补做网络来补充人脸细节。为了实现细腻的表情控制，我们提出了扭曲场来扭曲神经体积到驱动表情。通过广泛的实验比较，我们实现了与一些当前领先方法相比的超过其表 sintesis结果。

Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer

paper_url: http://arxiv.org/abs/2307.03427
repo_url: https://github.com/mungomeng/survival-xsurv
paper_authors: Mingyuan Meng, Lei Bi, Michael Fulham, Dagan Feng, Jinman Kim
for: 预测乳腺癌患者存活情况，提供早期诊断和治疗规划的信息。
methods: 基于深度学习和医疗图像的深度存存模型，结合多Modalities图像（如PET-CT），并提取特定区域（如主要肿瘤区和迁徙门节区）的预测信息。
results: 在HEAD和NeCK淋巴肿瘤癌数据集上，我们的XSurv方法比前一代存存预测方法高效，能够结合PET和CT图像的补做性信息，并提取特定区域的预测信息。

Abstract
Survival prediction is crucial for cancer patients as it provides early prognostic information for treatment planning. Recently, deep survival models based on deep learning and medical images have shown promising performance for survival prediction. However, existing deep survival models are not well developed in utilizing multi-modality images (e.g., PET-CT) and in extracting region-specific information (e.g., the prognostic information in Primary Tumor (PT) and Metastatic Lymph Node (MLN) regions). In view of this, we propose a merging-diverging learning framework for survival prediction from multi-modality images. This framework has a merging encoder to fuse multi-modality information and a diverging decoder to extract region-specific information. In the merging encoder, we propose a Hybrid Parallel Cross-Attention (HPCA) block to effectively fuse multi-modality features via parallel convolutional layers and cross-attention transformers. In the diverging decoder, we propose a Region-specific Attention Gate (RAG) block to screen out the features related to lesion regions. Our framework is demonstrated on survival prediction from PET-CT images in Head and Neck (H&N) cancer, by designing an X-shape merging-diverging hybrid transformer network (named XSurv). Our XSurv combines the complementary information in PET and CT images and extracts the region-specific prognostic information in PT and MLN regions. Extensive experiments on the public dataset of HEad and neCK TumOR segmentation and outcome prediction challenge (HECKTOR 2022) demonstrate that our XSurv outperforms state-of-the-art survival prediction methods.

摘要
生存预测对 cancer 患者非常重要，因为它提供了早期的诊断信息，用于治疗规划。在最近几年，深度存活模型基于深度学习和医疗图像已经显示出了惊人的表现。然而，现有的深度存活模型并没有充分利用多Modalities 图像（例如 PET-CT），也没有充分提取区域特定的信息（例如 Primary Tumor （PT）和 Metastatic Lymph Node （MLN）区域的诊断信息）。为了解决这个问题，我们提出了一种融合-分化学习框架，用于存活预测从多Modalities 图像。这个框架包括一个融合Encoder，用于融合多Modalities 信息，以及一个分化Decoder，用于提取区域特定的信息。在融合Encoder中，我们提出了一种Hybrid Parallel Cross-Attention（HPCA）块，用于有效地融合多Modalities 特征，并通过并行卷积层和交叉注意力变换器来实现。在分化Decoder中，我们提出了一种Region-specific Attention Gate（RAG）块，用于筛选出病变区域相关的特征。我们的框架在 Head and Neck 癌症的存活预测中使用 X-shape 融合-分化混合变换网络（名为 XSurv），把 PET 和 CT 图像的补充性信息融合在一起，并提取 PT 和 MLN 区域的区域特定诊断信息。我们的 XSurv 在 HEAD and neCK TumOR segmentation and outcome prediction challenge 2022 公共数据集上进行了广泛的实验，并证明了我们的 XSurv 在存活预测方面超过了当前的状态艺。

Registration-Free Hybrid Learning Empowers Simple Multimodal Imaging System for High-quality Fusion Detection

paper_url: http://arxiv.org/abs/2307.03425
repo_url: None
paper_authors: Yinghan Guan, Haoran Dai, Zekuan Yu, Shouyu Wang, Yuanjie Gu
for: smoke and wildfire detection
methods: CNN-Transformer hybrid learning framework with unified high-quality multimodal feature matching module and fusion module
results: superior detection performance compared to other state-of-the-art methods under conventional registered conditions, and the first unregistered multimodal smoke and wildfire detection benchmark is openly available.Here’s the full text in Simplified Chinese:
for: 这个论文是为了实现烟火检测而写的。
methods: 该论文提出了一种基于CNN-Transformer混合学习框架的高质量多Modal特征匹配模块（AKM）和拟合模块（WDAF），通过AKM和WDAF的合作来实现高质量红外意识可见混合检测。
results: experiments on M3FD dataset表明，提出的方法在已有的注册条件下达到了最佳检测性能，并且在未注册的情况下开设了第一个多Modal烟火检测benchmark。

Abstract
Multimodal fusion detection always places high demands on the imaging system and image pre-processing, while either a high-quality pre-registration system or image registration processing is costly. Unfortunately, the existing fusion methods are designed for registered source images, and the fusion of inhomogeneous features, which denotes a pair of features at the same spatial location that expresses different semantic information, cannot achieve satisfactory performance via these methods. As a result, we propose IA-VFDnet, a CNN-Transformer hybrid learning framework with a unified high-quality multimodal feature matching module (AKM) and a fusion module (WDAF), in which AKM and DWDAF work in synergy to perform high-quality infrared-aware visible fusion detection, which can be applied to smoke and wildfire detection. Furthermore, experiments on the M3FD dataset validate the superiority of the proposed method, with IA-VFDnet achieving the best detection performance than other state-of-the-art methods under conventional registered conditions. In addition, the first unregistered multimodal smoke and wildfire detection benchmark is openly available in this letter.

摘要
多模态融合检测总是对图像系统和图像预处理做出高要求，而ither高质量预注册系统或图像注册处理成本较高。可惜，现有的融合方法都是为注册源图像设计的，因此无法实现满意的性能via这些方法。为此，我们提议IA-VFDnet，一种基于CNN-Transformer混合学习框架的高质量多模态特征匹配模块（AKM）和融合模块（WDAF），其中AKM和WDAF在同工 synergy中实现高质量红外意识可见融合检测，可应用于烟和野火检测。此外，在M3FD数据集上进行的实验 validate了我们提议的方法的优越性，IA-VFDnet在 convential注册条件下实现了其他状态对照方法的最佳检测性能。此外，我们还公开提供了首个无注册多模态烟和野火检测benchmark。

Hyperspectral and Multispectral Image Fusion Using the Conditional Denoising Diffusion Probabilistic Model

paper_url: http://arxiv.org/abs/2307.03423
repo_url: https://github.com/shuaikaishi/ddpmfus
paper_authors: Shuaikai Shi, Lijun Zhang, Jie Chen
for: 这个论文主要是为了提出一种基于深度学习的卷积混合方法，以提高卷积图像的空间和spectral分辨率。
methods: 该方法基于conditioned denoising diffusion probabilistic model（DDPM），包括一个前向扩散过程和一个反向denoising过程。前向扩散过程逐渐添加 Gaussian 噪声到高空间分辨率卷积图像（HrHSI），而反向denoising过程通过学习预测desired HrHSI的高空间分辨率版本，条件于对应的高空间分辨率多spectral图像（HrMSI）和low空间分辨率卷积图像（LrHSI）。
results: 对一个indoor和两个遥感数据集进行了实验，并与其他先进的深度学习基于混合方法进行了比较。结果显示，提出的方法在混合过程中具有superiority。codes of this work将被opensourced于以下地址：https://github.com/shuaikaishi/DDPMFus，以便进行可重现。

Abstract
Hyperspectral images (HSI) have a large amount of spectral information reflecting the characteristics of matter, while their spatial resolution is low due to the limitations of imaging technology. Complementary to this are multispectral images (MSI), e.g., RGB images, with high spatial resolution but insufficient spectral bands. Hyperspectral and multispectral image fusion is a technique for acquiring ideal images that have both high spatial and high spectral resolution cost-effectively. Many existing HSI and MSI fusion algorithms rely on known imaging degradation models, which are often not available in practice. In this paper, we propose a deep fusion method based on the conditional denoising diffusion probabilistic model, called DDPM-Fus. Specifically, the DDPM-Fus contains the forward diffusion process which gradually adds Gaussian noise to the high spatial resolution HSI (HrHSI) and another reverse denoising process which learns to predict the desired HrHSI from its noisy version conditioning on the corresponding high spatial resolution MSI (HrMSI) and low spatial resolution HSI (LrHSI). Once the training is completes, the proposed DDPM-Fus implements the reverse process on the test HrMSI and LrHSI to generate the fused HrHSI. Experiments conducted on one indoor and two remote sensing datasets show the superiority of the proposed model when compared with other advanced deep learningbased fusion methods. The codes of this work will be opensourced at this address: https://github.com/shuaikaishi/DDPMFus for reproducibility.

摘要
干ogram (HSI) 具有大量的spectral信息，反映物质特性，但其 spatial resolution受到成像技术限制而低。与之相结合的是多spectral图像 (MSI)，如 RGB 图像，具有高 spatial resolution，但lack spectral band。干ogram和多spectral图像合并是一种获得理想图像，具有高 spatial 和高 spectral resolution的方法。许多现有的 HSI 和 MSI 合并算法 rely on known imaging degradation models，往往不在实践中可用。在这篇文章中，我们提出了基于 conditional denoising diffusion probabilistic model (DDPM) 的深度融合方法，称为 DDPM-Fus。具体来说，DDPM-Fus 包括将高 spatial resolution HSI (HrHSI) 逐渐添加 Gaussian noise 的前进 diffusion process，以及 conditioning on 高 spatial resolution MSI (HrMSI) 和 low spatial resolution HSI (LrHSI) 的reverse denoising process，学习预测 Desired HrHSI。一旦训练完成，我们的 DDPM-Fus 实现了 reverse process 在 test HrMSI 和 LrHSI 上，生成融合后的 HrHSI。我们在一个indoor和两个遥感数据集上进行了实验，并证明了我们的方法在其他高级深度学习基于融合方法之上的比较优势。我们将在这里公开源代码：https://github.com/shuaikaishi/DDPMFus，以便重现。

Learning Adversarial Semantic Embeddings for Zero-Shot Recognition in Open Worlds

paper_url: http://arxiv.org/abs/2307.03416
repo_url: https://github.com/lhrst/ase
paper_authors: Tianqi Li, Guansong Pang, Xiao Bai, Jin Zheng, Lei Zhou, Xin Ning
for: 这个研究是为了解决Zero-Shot Open-Set Recognition（ZS-OSR）任务，即在Zero-Shot Learning（ZSL） Setting下需要精确地分类未见类别的样本，并能够拒绝未知类别的样本。
methods: 我们使用了现有的State-of-the-art ZSL和OSR模型，并引入了一个新的方法，即生成unknown classes的对抗性semantic embeddings，以训练一个unknowns-informed ZS-OSR分类器。
results: 我们的方法substantially outperforms the combined solutions in detecting unknown classes while retaining the classification accuracy on unseen classes，并在 generalized ZS-OSR settings中也 achieve similar superiority.

Abstract
Zero-Shot Learning (ZSL) focuses on classifying samples of unseen classes with only their side semantic information presented during training. It cannot handle real-life, open-world scenarios where there are test samples of unknown classes for which neither samples (e.g., images) nor their side semantic information is known during training. Open-Set Recognition (OSR) is dedicated to addressing the unknown class issue, but existing OSR methods are not designed to model the semantic information of the unseen classes. To tackle this combined ZSL and OSR problem, we consider the case of "Zero-Shot Open-Set Recognition" (ZS-OSR), where a model is trained under the ZSL setting but it is required to accurately classify samples from the unseen classes while being able to reject samples from the unknown classes during inference. We perform large experiments on combining existing state-of-the-art ZSL and OSR models for the ZS-OSR task on four widely used datasets adapted from the ZSL task, and reveal that ZS-OSR is a non-trivial task as the simply combined solutions perform badly in distinguishing the unseen-class and unknown-class samples. We further introduce a novel approach specifically designed for ZS-OSR, in which our model learns to generate adversarial semantic embeddings of the unknown classes to train an unknowns-informed ZS-OSR classifier. Extensive empirical results show that our method 1) substantially outperforms the combined solutions in detecting the unknown classes while retaining the classification accuracy on the unseen classes and 2) achieves similar superiority under generalized ZS-OSR settings.

摘要
Zero-Shot Learning (ZSL) 专注于在训练过程中只使用类型相关信息来分类未经见过的样本。它无法处理生活中的开放世界enario，那里有测试样本的未知类型， neither samples（例如，图像） nor their type-related information is known during training。Open-Set Recognition (OSR) 专门解决未知类型问题，但现有的 OSR 方法没有考虑类型信息的 semantic information。为了解决这个 ZSL 和 OSR 的共同问题，我们提出了 "Zero-Shot Open-Set Recognition" (ZS-OSR) 任务，其中模型在 ZSL Setting 下进行训练，但需要在推理时准确地分类未经见过的样本，并能够拒绝未知样本。我们在四个广泛使用的数据集上进行了大规模的实验，发现 ZS-OSR 是一个非常复杂的任务，简单地将 ZSL 和 OSR 模型结合起来的方法表现不佳。我们还提出了一种专门为 ZS-OSR 设计的新方法，其中我们的模型学习生成未知类型的敌意Semantic embedding，以训练一个不知情 ZS-OSR 分类器。我们的方法在检测未知类型的同时保持分类精度，并在总体 ZS-OSR 设定下实现了类似的superiority。

Unsupervised Hyperspectral and Multispectral Images Fusion Based on the Cycle Consistency

paper_url: http://arxiv.org/abs/2307.03413
repo_url: https://github.com/shuaikaishi/CycFusion
paper_authors: Shuaikai Shi, Lijun Zhang, Yoann Altmann, Jie Chen
for: 本研究旨在提出一种不需要known spatial degradation parameters的Unsupervised hyperspectral and multispectral image fusion方法，以提高图像的空间分辨率和спектраль特征的精度。
methods: 该方法基于循环一致性，学习了低分辨率多spectral图像（LrHSI）和高分辨率多spectral图像（HrMSI）之间的频谱域转换，并将恰好的高分辨率 hyperspectral图像（HrHSI）视为中间特征图。
results: 实验结果表明，对多个数据集进行比较，该方法在无监督的情况下，与其他所有不监督拟合方法相比，具有更高的精度和稳定性。

Abstract
Hyperspectral images (HSI) with abundant spectral information reflected materials property usually perform low spatial resolution due to the hardware limits. Meanwhile, multispectral images (MSI), e.g., RGB images, have a high spatial resolution but deficient spectral signatures. Hyperspectral and multispectral image fusion can be cost-effective and efficient for acquiring both high spatial resolution and high spectral resolution images. Many of the conventional HSI and MSI fusion algorithms rely on known spatial degradation parameters, i.e., point spread function, spectral degradation parameters, spectral response function, or both of them. Another class of deep learning-based models relies on the ground truth of high spatial resolution HSI and needs large amounts of paired training images when working in a supervised manner. Both of these models are limited in practical fusion scenarios. In this paper, we propose an unsupervised HSI and MSI fusion model based on the cycle consistency, called CycFusion. The CycFusion learns the domain transformation between low spatial resolution HSI (LrHSI) and high spatial resolution MSI (HrMSI), and the desired high spatial resolution HSI (HrHSI) are considered to be intermediate feature maps in the transformation networks. The CycFusion can be trained with the objective functions of marginal matching in single transform and cycle consistency in double transforms. Moreover, the estimated PSF and SRF are embedded in the model as the pre-training weights, which further enhances the practicality of our proposed model. Experiments conducted on several datasets show that our proposed model outperforms all compared unsupervised fusion methods. The codes of this paper will be available at this address: https: //github.com/shuaikaishi/CycFusion for reproducibility.

摘要
干支spectral图像（HSI）具有丰富的spectral信息，通常因hardware限制而具有低空间分辨率。而多spectral图像（MSI），例如RGB图像，具有高空间分辨率，但缺乏spectral特征。干支spectral和多spectral图像 fusión可以是成本效益和高效的方式，以获取高空间分辨率和高spectral分辨率图像。许多传统的HSI和MSI fusión算法依赖于已知的空间退化参数，例如点扩散函数、spectral退化参数、spectral响应函数或其中之一。另一类的深度学习基于模型则需要大量的协同训练图像，并且需要高度的精度和可靠性。在这篇文章中，我们提出了一种不需要supervision的HSI和MSI fusión模型，称为CycFusion。CycFusion学习了干支spectral和多spectral图像之间的域转换，并将愿望的高空间分辨率HSI视为转换网络中的中间特征图。CycFusion可以通过单个transform和双transform的对应函数来进行训练，并且可以在不同的datasets上进行模型验证。实验结果表明，我们提出的模型在与其他不需要supervision的fusión方法进行比较时表现出色。codes of this paper will be available at this address: https: //github.com/shuaikaishi/CycFusion for reproducibility.

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

paper_url: http://arxiv.org/abs/2307.03407
repo_url: None
paper_authors: Dahyun Kang, Piotr Koniusz, Minsu Cho, Naila Murray
for: 这个论文的目的是解决弱监督少量图像分类和 segmentation 问题，通过利用一个自我监督的视觉转移（ViT）预训练模型。
methods: 该方法使用自我监督 ViT 生成的token表示，通过自我注意力来生成分类和 segmentation 预测，通过两个任务头。
results: 实验结果表明，在不同的监督情况下，该方法可以具有显著的性能提升，特别是在没有像素级标签的情况下。

Abstract
We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with ``mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.

摘要
我们Addresses the task of weakly-supervised few-shot image classification和 segmentation，通过利用Vision Transformer（ViT）预训练自我supervision。我们提议的方法利用自我supervision ViT 中的token表示，通过自我注意力，生成分类和 segmentation预测。我们的模型可以有效地在没有像素级标签的情况下，使用只有图像级标签进行训练，学习进行分类和 segmentation。为此，它使用来自自我supervision ViT 中生成的token的注意力地图，作为像素级 pseudo-标签。我们还探讨了一种实用的混合supervision设置，其中一些训练图像包含ground-truth像素级标签，剩下的图像只有图像级标签。为这种混合设置，我们提议使用pseudo-标签增强器，该模型在可用的ground-truth像素级标签的基础上训练。我们的实验在 Pascal-5i 和 COCO-20i 上达到了多种supervision设置下的显著性能提升，特别是在没有或少像素级标签的情况下。

RGB-D Mapping and Tracking in a Plenoxel Radiance Field

paper_url: http://arxiv.org/abs/2307.03404
repo_url: None
paper_authors: Andreas L. Teigen, Yeonsoo Park, Annette Stahl, Rudolf Mester
for:* 这个技术报告主要写于哪些领域？methods:* 这个技术使用了哪些方法？results:* 这个技术实现了哪些成果？Here are the answers in Simplified Chinese:for:* 这个技术报告主要写于 Computer Vision 和 Robotics 领域。methods:* 这个技术使用了 Plenoxel 频谱场模型，以及RGB-D数据无需神经网络的分析差分方法。results:* 这个技术实现了state-of-the-art的映射和跟踪任务结果，同时比 neural network-based 方法更快。

Abstract
Building on the success of Neural Radiance Fields (NeRFs), recent years have seen significant advances in the domain of novel view synthesis. These models capture the scene's volumetric radiance field, creating highly convincing dense photorealistic models through the use of simple, differentiable rendering equations. Despite their popularity, these algorithms suffer from severe ambiguities in visual data inherent to the RGB sensor, which means that although images generated with view synthesis can visually appear very believable, the underlying 3D model will often be wrong. This considerably limits the usefulness of these models in practical applications like Robotics and Extended Reality (XR), where an accurate dense 3D reconstruction otherwise would be of significant value. In this technical report, we present the vital differences between view synthesis models and 3D reconstruction models. We also comment on why a depth sensor is essential for modeling accurate geometry in general outward-facing scenes using the current paradigm of novel view synthesis methods. Focusing on the structure-from-motion task, we practically demonstrate this need by extending the Plenoxel radiance field model: Presenting an analytical differential approach for dense mapping and tracking with radiance fields based on RGB-D data without a neural network. Our method achieves state-of-the-art results in both the mapping and tracking tasks while also being faster than competing neural network-based approaches.

摘要
在最近几年，因为神经辐射场（NeRF）的成功， novel view synthesis 领域有了 significiant advances。这些模型可以 capture 场景的三维辐射场，通过简单的可导渠 Equations 来创建高效的、 photorealistic 模型。 despite their popularity, these algorithms suffer from severe ambiguities in visual data inherent to the RGB sensor, which means that although images generated with view synthesis can visually appear very believable, the underlying 3D model will often be wrong. This considerably limits the usefulness of these models in practical applications like Robotics and Extended Reality (XR), where an accurate dense 3D reconstruction otherwise would be of significant value.在这份技术报告中，我们展示了视图synthesis 模型和 3D 重建模型之间的重要差异。我们还评论了为了在现今的 novel view synthesis 方法中模型 precisions 的 accurate geometry 的深度感知器的重要性。在structure-from-motion 任务中，我们实际地示出了这种需求。我们通过扩展 Plenoxel 辐射场模型，提出了一种基于 RGB-D 数据的分析差分方法 для dense mapping 和 tracking。我们的方法可以在 mapping 和 tracking 任务中达到状态艺术 Results，同时也比竞争的神经网络基于方法更快。

Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials

paper_url: http://arxiv.org/abs/2307.03398
repo_url: None
paper_authors: Wenmiao Hu, Yichen Zhang, Yuxuan Liang, Yifang Yin, Andrei Georgescu, An Tran, Hannes Kruppa, See-Kiong Ng, Roger Zimmermann
for:This paper focuses on improving the accuracy of fine-grained orientation estimation for street-view images.methods:The proposed methods use a combination of feature extraction and deep learning techniques to estimate the orientation of street-view images.results:The proposed methods achieve high accuracy on orientation estimation, with an average improvement of 34.9% and 28.2% compared to previous works. Integrating fine-grained orientation estimation in training also improves the performance on geo-localization.

Abstract
Street-view imagery provides us with novel experiences to explore different places remotely. Carefully calibrated street-view images (e.g. Google Street View) can be used for different downstream tasks, e.g. navigation, map features extraction. As personal high-quality cameras have become much more affordable and portable, an enormous amount of crowdsourced street-view images are uploaded to the internet, but commonly with missing or noisy sensor information. To prepare this hidden treasure for "ready-to-use" status, determining missing location information and camera orientation angles are two equally important tasks. Recent methods have achieved high performance on geo-localization of street-view images by cross-view matching with a pool of geo-referenced satellite imagery. However, most of the existing works focus more on geo-localization than estimating the image orientation. In this work, we re-state the importance of finding fine-grained orientation for street-view images, formally define the problem and provide a set of evaluation metrics to assess the quality of the orientation estimation. We propose two methods to improve the granularity of the orientation estimation, achieving 82.4% and 72.3% accuracy for images with estimated angle errors below 2 degrees for CVUSA and CVACT datasets, corresponding to 34.9% and 28.2% absolute improvement compared to previous works. Integrating fine-grained orientation estimation in training also improves the performance on geo-localization, giving top 1 recall 95.5%/85.5% and 86.8%/80.4% for orientation known/unknown tests on the two datasets.

摘要
街景图像提供了许多不同的地方的远程探索。高级别的街景图像（例如Google街景图）可以用于不同的下游任务，如导航和地图特征提取。随着个人高质量相机的成本下降和 portaбеility提高，互联网上上传了大量的拍摄街景图像，但通常缺失或含有噪音的感知信息。为了准备这些隐藏的财富，确定缺失的地理位置信息和摄像机方向角度是两个等 importante的任务。现有方法已经达到了高性能的地图化街景图像，但大多数现有的工作更注重地图化than estimating图像方向。在这个工作中，我们重申了找到细化的图像方向的重要性，正式定义问题，并提供了评价图像方向估计质量的测试 метрик。我们提出了两种方法来改进细化图像方向估计，实现了82.4%和72.3%的准确率，对于CVUSA和CACT datasets的图像的估计角度错误小于2度，相对于先前的工作提高了34.9%和28.2%的绝对改进。将细化的图像方向估计integrated into training还提高了地图化性能，在两个dataset上取得了 recall 95.5%/85.5%和86.8%/80.4%，对于orientationknown/unknown测试。

General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.03388
repo_url: https://github.com/nhikieu/spatialvolumetricmultimodal
paper_authors: Nhi Kieu, Kien Nguyen, Sridha Sridharan, Clinton Fookes
for: 这个研究探讨了 PerceiverIO 综合多模式网络在遥测Semantic Segmentation 领域的表现。
methods: 研究使用了一个 UNit-inspired 模组，该模组使用三维核算法来汇入本地信息，同时学习跨模式特征。
results: 研究发现，提案的方法可以与专门架构 like UNetFormer 和 SwinUNet 相比，达到了竞争性的结果，显示了该方法在优化网络架构设计方面的潜在。

Abstract
The advent of high-resolution multispectral/hyperspectral sensors, LiDAR DSM (Digital Surface Model) information and many others has provided us with an unprecedented wealth of data for Earth Observation. Multimodal AI seeks to exploit those complementary data sources, particularly for complex tasks like semantic segmentation. While specialized architectures have been developed, they are highly complicated via significant effort in model design, and require considerable re-engineering whenever a new modality emerges. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this ostensibly universal network struggles with object scale variation in remote sensing images and fails to detect the presence of cars from a top-down view. To address these issues, even with extreme class imbalance issues, we propose a spatial and volumetric learning component. Specifically, we design a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously, while reducing network computational burden via the cross-attention mechanism of PerceiverIO. The effectiveness of the proposed component is validated through extensive experiments comparing it with other methods such as 2D convolution, and dual local module (\ie the combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed method achieves competitive results with specialized architectures like UNetFormer and SwinUNet, showing its potential to minimize network architecture engineering with a minimal compromise on the performance.

摘要
“现代高分辨率多spectral/干spectral传感器、LiDAR DSM（数字地面模型）等数据源的出现，为地球观测带来了前所未有的数据 богат度。多Modal AI 利用这些补充数据源，特别是 для复杂任务 like semantic segmentation。虽然专门的架构有出现，但它们具有较高的复杂度，需要较大的模型设计和重新引擎，每当新的模态出现时。 current trend in general-purpose multimodal networks has shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this ostensibly universal network struggles with object scale variation in remote sensing images and fails to detect the presence of cars from a top-down view. To address these issues, we propose a spatial and volumetric learning component. Specifically, we design a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously, while reducing network computational burden via the cross-attention mechanism of PerceiverIO. The effectiveness of the proposed component is validated through extensive experiments comparing it with other methods such as 2D convolution and dual local module (\ie the combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed method achieves competitive results with specialized architectures like UNetFormer and SwinUNet, showing its potential to minimize network architecture engineering with a minimal compromise on the performance.”

Weakly-supervised Contrastive Learning for Unsupervised Object Discovery

paper_url: http://arxiv.org/abs/2307.03376
repo_url: https://github.com/npucvr/wscuod
paper_authors: Yunqiu Lv, Jing Zhang, Nick Barnes, Yuchao Dai
for: 本研究旨在提出一种新的无监督物体发现方法，以提高物体检测和分割的精度。
methods: 我们提出了一种基于自我超vised学习模型的方法，通过弱监督对比学习（WCL）增强 semantic信息探索。我们还使用了原始数据的主成分分析（PCA）来本地化物体区域。
results: 我们在一些无监督物体发现数据集上进行了广泛的实验，并证明了我们的提议的有效性。source code和实验结果可以通过我们的项目页面获取：https://github.com/npucvr/WSCUOD.git。

Abstract
Unsupervised object discovery (UOD) refers to the task of discriminating the whole region of objects from the background within a scene without relying on labeled datasets, which benefits the task of bounding-box-level localization and pixel-level segmentation. This task is promising due to its ability to discover objects in a generic manner. We roughly categorise existing techniques into two main directions, namely the generative solutions based on image resynthesis, and the clustering methods based on self-supervised models. We have observed that the former heavily relies on the quality of image reconstruction, while the latter shows limitations in effectively modeling semantic correlations. To directly target at object discovery, we focus on the latter approach and propose a novel solution by incorporating weakly-supervised contrastive learning (WCL) to enhance semantic information exploration. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images, which is achieved by fine-tuning the feature encoder of a self-supervised model, namely DINO, via WCL. Subsequently, we introduce Principal Component Analysis (PCA) to localize object regions. The principal projection direction, corresponding to the maximal eigenvalue, serves as an indicator of the object region(s). Extensive experiments on benchmark unsupervised object discovery datasets demonstrate the effectiveness of our proposed solution. The source code and experimental results are publicly available via our project page at https://github.com/npucvr/WSCUOD.git.

摘要
无监督物体发现（UOD）指的是在场景中分别背景和物体的整个区域，不使用标注数据，这对绑定框位置和像素级划分具有推动作用。这个任务有前途，因为它可以在通用的方式下发现物体。我们约分exist的技术为两大方向，即基于图像重新synthesis的生成解决方案，以及基于自我超vised模型的聚类方法。我们发现了，前者强调图像重建质量，而后者在模型 semantic关系模型化有限。为直接实现物体发现，我们选择后者，并提出一种新的解决方案，即通过弱监督对比学习（WCL）增强 semantic信息探索。我们设计了一种带有高级 semantic特征的自然语言处理模型，通过练习 DINO 模型的特征编码器，并通过 WCL 进行 fine-tuning。然后，我们引入Principal Component Analysis（PCA）来地址 object 区域。对于无监督物体发现数据集进行了广泛的实验，证明了我们的提议的有效性。项目代码和实验结果可以通过我们的项目页面https://github.com/npucvr/WSCUOD.git获取。

A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision

paper_url: http://arxiv.org/abs/2307.03353
repo_url: None
paper_authors: Zhonghan Zhao, Wenhao Chai, Shengyu Hao, Wenhao Hu, Guanhong Wang, Shidong Cao, Mingli Song, Jenq-Neng Hwang, Gaoang Wang
for: 这篇论文旨在探讨深度学习在体育性能方面的应用，包括识别、理解和决策等三个方面。
methods: 论文提出了深度学习算法的层次结构，并对现有的数据集进行了综述，同时描述了现有的挑战和未来发展趋势。
results: 论文通过对现有数据集的分析和对深度学习在体育应用的概述，提供了对深度学习在体育性能方面的研究 Referenced.

Abstract
Deep learning has the potential to revolutionize sports performance, with applications ranging from perception and comprehension to decision. This paper presents a comprehensive survey of deep learning in sports performance, focusing on three main aspects: algorithms, datasets and virtual environments, and challenges. Firstly, we discuss the hierarchical structure of deep learning algorithms in sports performance which includes perception, comprehension and decision while comparing their strengths and weaknesses. Secondly, we list widely used existing datasets in sports and highlight their characteristics and limitations. Finally, we summarize current challenges and point out future trends of deep learning in sports. Our survey provides valuable reference material for researchers interested in deep learning in sports applications.

摘要
深度学习有可能对体育表现进行革命性的改变，其应用范围从感知和理解到决策。本文提供了深度学习在体育表现方面的全面评论，主要涵盖三大方面：算法、数据集和虚拟环境，以及挑战。首先，我们介绍了深度学习算法在体育表现中的层次结构，并对它们的优缺点进行比较。其次，我们列出了常用的体育数据集，并将其特点和局限性作出描述。最后，我们summarized current challenges and highlighted future trends of deep learning in sports.本文提供的参考资料有价值，对深度学习在体育应用领域的研究人员非常有帮助。Here's the translation of the text into Traditional Chinese:深度学习有可能对体育表现进行革命性的改变，其应用范围从感知和理解到决策。本文提供了深度学习在体育表现方面的全面评论，主要涵盖三大方面：算法、数据集和虚拟环境，以及挑战。首先，我们介绍了深度学习算法在体育表现中的层次结构，并对它们的优缺点进行比较。其次，我们列出了常用的体育数据集，并将其特点和局限性作出描述。最后，我们summarized current challenges and highlighted future trends of deep learning in sports.本文提供的参考资料有价值，对深度学习在体育应用领域的研究人员非常有帮助。

Dividing and Conquering a BlackBox to a Mixture of Interpretable Models: Route, Interpret, Repeat

paper_url: http://arxiv.org/abs/2307.05350
repo_url: https://github.com/batmanlab/ICML-2023-Route-interpret-repeat
paper_authors: Shantanu Ghosh, Ke Yu, Forough Arabshahi, Kayhan Batmanghelich
for: This paper aims to blur the distinction between post hoc explanation of a Blackbox and constructing interpretable models, by iteratively carving out a mixture of interpretable experts (MoIE) and a residual network.
methods: The paper uses a route, interpret, and repeat approach, starting with a Blackbox and iteratively carving out MoIE and a residual network. Each interpretable model specializes in a subset of samples and explains them using First Order Logic (FOL), while the residual network handles the remaining samples.
results: The extensive experiments show that the approach (1) identifies a diverse set of instance-specific concepts with high concept completeness via MoIE without compromising performance, (2) identifies the relatively “harder” samples to explain via residuals, (3) outperforms interpretable by-design models by significant margins during test-time interventions, and (4) fixes the shortcut learned by the original Blackbox.Here’s the Chinese translation of the three points:
for: 这篇论文目标是将黑盒模型的Post hoc解释与构建可解释模型相分离，通过迭代挖出一个混合型可解释专家（MoIE）和剩余网络。
methods: 论文使用一种路径、解释、重复的方法，从黑盒模型开始，迭代挖出MoIE和剩余网络。每个可解释模型专门处理一部分样本，使用First Order Logic（FOL）进行基本的推理，以解释黑盒模型中的概念。剩余网络处理剩下的样本。
results: 广泛的实验结果表明，该方法（1）通过MoIE无需性能下降，identify一个多样化的实例特定概念集，具有高概念完整性，（2）通过剩余网络处理 harder 的样本，（3）在测试时间 intervención中，与可解释设计模型相比，具有显著的性能优势，（4）修复黑盒模型中学习的短cut。MoIE代码可以在：https://github.com/batmanlab/ICML-2023-Route-interpret-repeat 。

Abstract
ML model design either starts with an interpretable model or a Blackbox and explains it post hoc. Blackbox models are flexible but difficult to explain, while interpretable models are inherently explainable. Yet, interpretable models require extensive ML knowledge and tend to be less flexible and underperforming than their Blackbox variants. This paper aims to blur the distinction between a post hoc explanation of a Blackbox and constructing interpretable models. Beginning with a Blackbox, we iteratively carve out a mixture of interpretable experts (MoIE) and a residual network. Each interpretable model specializes in a subset of samples and explains them using First Order Logic (FOL), providing basic reasoning on concepts from the Blackbox. We route the remaining samples through a flexible residual. We repeat the method on the residual network until all the interpretable models explain the desired proportion of data. Our extensive experiments show that our route, interpret, and repeat approach (1) identifies a diverse set of instance-specific concepts with high concept completeness via MoIE without compromising in performance, (2) identifies the relatively ``harder'' samples to explain via residuals, (3) outperforms the interpretable by-design models by significant margins during test-time interventions, and (4) fixes the shortcut learned by the original Blackbox. The code for MoIE is publicly available at: \url{https://github.com/batmanlab/ICML-2023-Route-interpret-repeat}

摘要
机器学习模型设计可以从可解释模型或黑盒开始，然后进行后处解释。黑盒模型灵活，但难以解释，而可解释模型具有内置的解释功能。然而，可解释模型需要广泛的机器学习知识，并且通常比其黑盒变体表现不佳。本文旨在融合后处解释和构建可解释模型。从黑盒开始，我们逐渐刻意挖掘一个混合可解释专家（MoIE）和剩下的剩下网络。每个可解释模型专门处理一 subset of samples，并使用首险逻辑（FOL）进行基本的推理，提供黑盒中概念的基本理解。我们通过剩下网络将剩下的样本传递给灵活的剩下网络。我们在这个过程中重复多次，直到所有的可解释模型解释愿望的数据分量。我们的广泛实验表明，我们的路由、解释和重复方法（1）可以通过MoIE无需牺牲性能来获得多样化的实例特有概念，（2）可以通过剩下网络来确定难以解释的样本，（3）在测试时间 intervención中大幅度超越可解释设计模型，以及（4）修复黑盒学习的快捷。MoIE代码可以在以下链接获取：https://github.com/batmanlab/ICML-2023-Route-interpret-repeat

Open-Vocabulary Object Detection via Scene Graph Discovery

paper_url: http://arxiv.org/abs/2307.03339
repo_url: None
paper_authors: Hengcan Shi, Munawar Hayat, Jianfei Cai
for: 这篇论文是为了解决开放词汇对象检测问题，即不同于传统检测，只检测固定类别对象，而是检测开放类别集中的对象。
methods: 该论文提出了一种新的场景图基于发现网络（SGDN），利用场景图指示来检测开放词汇对象。具体来说，包括稀疏场景图指导注意力（SSGA）的场景图解码器（SGDecoder），以及场景图基于预测（SGPred）机制。
results: 实验结果表明，该方法可以有效地解决开放词汇对象检测问题，并且可以进行开放Scene Graph检测。此外，该方法还可以提高对象本地化的准确率。

Abstract
In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task.

摘要
近年来，开放词汇（OV）对象检测已经吸引了越来越多的研究者的注意力。与传统检测不同，OV检测targets不同的开放类别对象。先前的工作frequently使用视觉语言（VL）训练数据（例如，referring grounding data）来认识OV对象。然而，这些数据通常包含更多的信息，例如场景图，这些信息也是OV检测的关键。在本文中，我们提出了一种新的场景图基于发现网络（SGDN），它利用场景图指示进行OV检测。首先，我们提出了场景图基本解码器（SGDecoder），包括稀疏场景图指导的注意力（SSGA）。它捕捉场景图并利用它们来发现OV对象。其次，我们提出了场景图基本预测（SGPred），我们构建了场景图基本偏移预测（SGOR）机制，以便对场景图EXTRACTION和对象LOCALIZATION进行互相增强。最后，我们设计了一种 crossed-modal学习机制。它通过场景图作为桥接，以提高不同模态嵌入的一致性，以便对开放类别对象进行分类。在COCO和LVIS上进行了实验，并证明了我们的方法的有效性。此外，我们还示出了我们的模型对开放场景图检测的能力，而之前的OV场景图生成方法无法完成这个任务。

Facial Landmark Detection Evaluation on MOBIO Database

paper_url: http://arxiv.org/abs/2307.03329
repo_url: None
paper_authors: Na Zhang
for: 该论文旨在提高移动设备上部署生物特征技术的研究，特别是面部识别和语音识别等技术在移动设备上的应用。
methods: 该论文使用了多种现有的面部特征检测方法，以评估其性能在移动设备上。
results: 研究发现，面部特征检测在移动设备上的性能较为挑战，MOBIO数据库可以作为一个新的挑战数据库。

Abstract
MOBIO is a bi-modal database that was captured almost exclusively on mobile phones. It aims to improve research into deploying biometric techniques to mobile devices. Research has been shown that face and speaker recognition can be performed in a mobile environment. Facial landmark localization aims at finding the coordinates of a set of pre-defined key points for 2D face images. A facial landmark usually has specific semantic meaning, e.g. nose tip or eye centre, which provides rich geometric information for other face analysis tasks such as face recognition, emotion estimation and 3D face reconstruction. Pretty much facial landmark detection methods adopt still face databases, such as 300W, AFW, AFLW, or COFW, for evaluation, but seldomly use mobile data. Our work is first to perform facial landmark detection evaluation on the mobile still data, i.e., face images from MOBIO database. About 20,600 face images have been extracted from this audio-visual database and manually labeled with 22 landmarks as the groundtruth. Several state-of-the-art facial landmark detection methods are adopted to evaluate their performance on these data. The result shows that the data from MOBIO database is pretty challenging. This database can be a new challenging one for facial landmark detection evaluation.

摘要

CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

paper_url: http://arxiv.org/abs/2307.03293
repo_url: https://github.com/ngaggion/chexmask-database
paper_authors: Nicolás Gaggion, Candelaria Mosquera, Lucas Mansilla, Martina Aineseder, Diego H. Milone, Enzo Ferrante
for: 这个论文的目的是为了提供一个大型、多中心的胸部X射线分割数据集，以便用于胸部X射线分析方法的开发。
methods: 这个论文使用了HybridGNet模型来确保所有数据集中的分割结果具有一致性和高质量。
results: 这个论文提供了676,803个分割mask，并通过专业医生评估和自动化质量控制来验证这些mask。 Additionally, the paper provides individualized quality indices per mask and an overall quality estimation per dataset.

Abstract
The development of successful artificial intelligence models for chest X-ray analysis relies on large, diverse datasets with high-quality annotations. While several databases of chest X-ray images have been released, most include disease diagnosis labels but lack detailed pixel-level anatomical segmentation labels. To address this gap, we introduce an extensive chest X-ray multi-center segmentation dataset with uniform and fine-grain anatomical annotations for images coming from six well-known publicly available databases: CANDID-PTX, ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest, and VinDr-CXR, resulting in 676,803 segmentation masks. Our methodology utilizes the HybridGNet model to ensure consistent and high-quality segmentations across all datasets. Rigorous validation, including expert physician evaluation and automatic quality control, was conducted to validate the resulting masks. Additionally, we provide individualized quality indices per mask and an overall quality estimation per dataset. This dataset serves as a valuable resource for the broader scientific community, streamlining the development and assessment of innovative methodologies in chest X-ray analysis. The CheXmask dataset is publicly available at: \url{https://physionet.org/content/chexmask-cxr-segmentation-data/}.

摘要
发展成功人工智能模型 для胸部X射影分析需要大量多样化的数据集，其中包括高质量的注解标注。虽然数据库胸部X射影图像已经发布，但大多数只包含疾病诊断标签，缺乏细腻像素级别的解剖学分割标注。为了解决这个问题，我们介绍了一个广泛的胸部X射影多中心分割数据集，其中包含来自六个公共可用的数据库：CANDID-PTX、ChestX-ray8、Chexpert、MIMIC-CXR-JPG、Padchest和VinDr-CXR，共计676,803个分割mask。我们的方法使用HybridGNet模型来确保分割结果具有一致性和高质量。我们进行了严格的验证，包括专业医生评估和自动化质量控制，以验证结果。此外，我们还提供了每个mask的个性化质量指标以及每个数据集的总质量估计。这个数据集作为科学社区的资源，可以促进胸部X射影分析领域的创新和评估。CheXmask数据集公共可用于：\url{https://physionet.org/content/chexmask-cxr-segmentation-data/}.

To pretrain or not to pretrain? A case study of domain-specific pretraining for semantic segmentation in histopathology

paper_url: http://arxiv.org/abs/2307.03275
repo_url: None
paper_authors: Tushar Kataria, Beatrice Knudsen, Shireen Elhabian
for: 这个研究是为了检查 histopathology 领域特有的预训练模型是否能提供更好的初始化，以提高病理学影像应用程序的性能。
methods: 研究使用了不同类型的预训练模型，包括 histopathology 领域特有的预训练模型和 real-world 影像预训练模型，并 Comparing 它们的表现。
results: 研究结果显示，使用 histopathology 领域特有的预训练模型可以提高病理学影像识别和分类的表现，但是这些表现取决于任务和训练数据集的大小。此外，研究也发现使用这些预训练模型可以提高病理学影像中的细胞和腺体分类表现，但是这些表现仅在特定的任务和训练数据集中出现。

Abstract
Annotating medical imaging datasets is costly, so fine-tuning (or transfer learning) is the most effective method for digital pathology vision applications such as disease classification and semantic segmentation. However, due to texture bias in models trained on real-world images, transfer learning for histopathology applications might result in underperforming models, which necessitates the need for using unlabeled histopathology data and self-supervised methods to discover domain-specific characteristics. Here, we tested the premise that histopathology-specific pretrained models provide better initializations for pathology vision tasks, i.e., gland and cell segmentation. In this study, we compare the performance of gland and cell segmentation tasks with histopathology domain-specific and non-domain-specific (real-world images) pretrained weights. Moreover, we investigate the dataset size at which domain-specific pretraining produces significant gains in performance. In addition, we investigated whether domain-specific initialization improves the effectiveness of out-of-distribution testing on distinct datasets but the same task. The results indicate that performance gain using domain-specific pretrained weights depends on both the task and the size of the training dataset. In instances with limited dataset sizes, a significant improvement in gland segmentation performance was also observed, whereas models trained on cell segmentation datasets exhibit no improvement.

摘要
<>输入文本为：批注医学影像数据集是成本高的，因此 Fine-tuning（或传输学习）是数字 PATHOLOGY 视觉应用，如疾病分类和semantic segmentation 中最有效的方法。然而，由于图像世界中的 texture bias，传输学习 для histopathology 应用可能会导致模型表现不佳，这种情况下需要使用无标注 histopathology 数据和自我supervised 方法来发现领域特有的特征。本研究检验了假设，即 histopathology 特定的预训练模型为 PATHOLOGY 视觉任务提供更好的初始化，即腺体和细胞 segmentation。本研究 comparing 腺体和细胞 segmentation 任务使用 histopathology 领域特定和非领域特定（实际世界图像）预训练 веса的表现。此外，我们还研究了领域特定预训练生成的性能提升的数据集大小。在这些研究中，我们发现了领域特定预训练在某些任务上的性能提升取决于任务和领域特定预训练数据集的大小。在有限的数据集大小下，领域特定预训练可以获得显著的性能提升，而模型在 cell segmentation 任务上表现不变。Translation:<>输入文本为：批注医学影像数据集是成本高的，因此 Fine-tuning（或传输学习）是数字 PATHOLOGY 视觉应用，如疾病分类和semantic segmentation 中最有效的方法。然而，由于图像世界中的 texture bias，传输学习 для histopathology 应用可能会导致模型表现不佳，这种情况下需要使用无标注 histopathology 数据和自我supervised 方法来发现领域特有的特征。本研究检验了假设，即 histopathology 特定的预训练模型为 PATHOLOGY 视觉任务提供更好的初始化，即腺体和细胞 segmentation。本研究 comparing 腺体和细胞 segmentation 任务使用 histopathology 领域特定和非领域特定（实际世界图像）预训练 веса的表现。此外，我们还研究了领域特定预训练生成的性能提升的数据集大小。在这些研究中，我们发现了领域特定预训练在某些任务上的性能提升取决于任务和领域特定预训练数据集的大小。在有限的数据集大小下，领域特定预训练可以获得显著的性能提升，而模型在 cell segmentation 任务上表现不变。

ADASSM: Adversarial Data Augmentation in Statistical Shape Models From Images

paper_url: http://arxiv.org/abs/2307.03273
repo_url: None
paper_authors: Mokshagna Sai Teja Karanam, Tushar Kataria, Krithika Iyer, Shireen Elhabian
for: 这篇论文旨在提出一种新的数据增强策略，以适应图像到统计形态模型（SSM）框架中的数据缺乏问题。
methods: 该策略基于数据依存的噪声生成或文本增强技术，通过在图像到SSM网络中作为对手训练，生成多样化和挑战性的噪声样本。
results: 该策略可以提高图像到SSM网络的准确率，使模型更加注重下面形态，而不是固定在像素值上。

Abstract
Statistical shape models (SSM) have been well-established as an excellent tool for identifying variations in the morphology of anatomy across the underlying population. Shape models use consistent shape representation across all the samples in a given cohort, which helps to compare shapes and identify the variations that can detect pathologies and help in formulating treatment plans. In medical imaging, computing these shape representations from CT/MRI scans requires time-intensive preprocessing operations, including but not limited to anatomy segmentation annotations, registration, and texture denoising. Deep learning models have demonstrated exceptional capabilities in learning shape representations directly from volumetric images, giving rise to highly effective and efficient Image-to-SSM networks. Nevertheless, these models are data-hungry and due to the limited availability of medical data, deep learning models tend to overfit. Offline data augmentation techniques, that use kernel density estimation based (KDE) methods for generating shape-augmented samples, have successfully aided Image-to-SSM networks in achieving comparable accuracy to traditional SSM methods. However, these augmentation methods focus on shape augmentation, whereas deep learning models exhibit image-based texture bias resulting in sub-optimal models. This paper introduces a novel strategy for on-the-fly data augmentation for the Image-to-SSM framework by leveraging data-dependent noise generation or texture augmentation. The proposed framework is trained as an adversary to the Image-to-SSM network, augmenting diverse and challenging noisy samples. Our approach achieves improved accuracy by encouraging the model to focus on the underlying geometry rather than relying solely on pixel values.

摘要
各种统计形态模型（SSM）在识别人体解剖学变化方面已经得到了广泛的应用，它们使用一致的形态表示方式来比较形态，从而检测疾病和制定治疗方案。在医疗影像中，从CT/MRI扫描获取形态表示需要耗时的预处理步骤，包括但不限于解剖部分标注、注册和图像减震。深度学习模型直接从三维图像中学习形态表示，这些模型已经取得了非常高效和可靠的成果，并且被称为高效的图像-SSM网络。然而，这些模型需要大量的数据，由于医疗数据的有限性，这些模型往往遇到过拟合问题。在线数据增强技术，使用基于KDE方法生成的形态增强样本，已经成功地帮助图像-SSM网络实现与传统SSM方法相当的准确性。然而，这些增强技术主要关注形态增强，而深度学习模型具有图像基于的文本偏好，导致模型表现不佳。本文提出了一种新的在线数据增强策略，通过利用数据依赖的噪声生成或文本增强来帮助图像-SSM网络。该方法在训练过程中作为对图像-SSM网络的反对手，生成多样化和挑战性的噪声样本，以提高模型的准确性。我们的方法通过让模型关注下面的结构，而不是仅仅依赖像素值，从而提高模型的表现。

Empirical Analysis of a Segmentation Foundation Model in Prostate Imaging

paper_url: http://arxiv.org/abs/2307.03266
repo_url: None
paper_authors: Heejong Kim, Victor Ion Butoi, Adrian V. Dalca, Daniel J. A. Margolis, Mert R. Sabuncu
for:This paper is written for the purpose of evaluating the effectiveness of a foundation model for medical image segmentation, specifically in the context of prostate imaging.methods:The paper uses a recently developed foundation model called UniverSeg, which is trained on a large dataset of images and can be customized for various downstream tasks with little to no labeled data.results:The paper compares the performance of UniverSeg against conventional task-specific segmentation models and highlights several important factors that will likely be important in the development and adoption of foundation models for medical image segmentation. The results show that UniverSeg achieves competitive performance against task-specific models while requiring significantly less labeled data.

Abstract
Most state-of-the-art techniques for medical image segmentation rely on deep-learning models. These models, however, are often trained on narrowly-defined tasks in a supervised fashion, which requires expensive labeled datasets. Recent advances in several machine learning domains, such as natural language generation have demonstrated the feasibility and utility of building foundation models that can be customized for various downstream tasks with little to no labeled data. This likely represents a paradigm shift for medical imaging, where we expect that foundation models may shape the future of the field. In this paper, we consider a recently developed foundation model for medical image segmentation, UniverSeg. We conduct an empirical evaluation study in the context of prostate imaging and compare it against the conventional approach of training a task-specific segmentation model. Our results and discussion highlight several important factors that will likely be important in the development and adoption of foundation models for medical image segmentation.

摘要
现代医疗影像分割技术多数采用深度学习模型。然而，这些模型通常需要严格定义的任务和质量验证数据，这会导致成本增加。在其他机器学习领域，如自然语言生成，最近的进展表明可以建立基础模型，可以通过少量或无标注数据来适应多个下游任务。这可能会对医疗影像领域造成一种 парадигShift。在这篇论文中，我们考虑了一种新发展的基础模型，即UniverSeg。我们对抗比较这种基础模型与专门为医疗影像分割训练的模型。我们的结果和讨论描述了一些重要的因素，这些因素将影响基础模型在医疗影像分割领域的发展和采纳。

A Fully Automated and Explainable Algorithm for the Prediction of Malignant Transformation in Oral Epithelial Dysplasia

paper_url: http://arxiv.org/abs/2307.03757
repo_url: None
paper_authors: Adam J Shephard, Raja Muhammad Saad Bashir, Hanya Mahmood, Mostafa Jahanifar, Fayyaz Minhas, Shan E Ahmed Raza, Kris D McCombe, Stephanie G Craig, Jacqueline James, Jill Brooks, Paul Nankivell, Hisham Mehanna, Syed Ali Khurram, Nasir M Rajpoot
for: 预防唾液腺癌的诊断和预测
methods: 使用人工智能算法，基于历史Patterns in Haematoxylin and Eosin染色整个扫描图像中的核lei，分配唾液腺癌转化风险分数（OMT分数），以衡量唾液腺癌的转化风险。
results: 在内部十进制验证集（Sheffield）和两个外部验证集（Birmingham和Belfast）上，提出了一个AUROC = 0.74的预测模型，可以预测唾液腺癌是否会转化为癌症。此外，存在证明了OMT分数的诊断价值，并且在预测转化过程中发现了 péripheral和epithelium-infiltrating免疫细胞的存在。

Abstract
Oral epithelial dysplasia (OED) is a premalignant histopathological diagnosis given to lesions of the oral cavity. Its grading suffers from significant inter-/intra- observer variability, and does not reliably predict malignancy progression, potentially leading to suboptimal treatment decisions. To address this, we developed a novel artificial intelligence algorithm that can assign an Oral Malignant Transformation (OMT) risk score, based on histological patterns in the in Haematoxylin and Eosin stained whole slide images, to quantify the risk of OED progression. The algorithm is based on the detection and segmentation of nuclei within (and around) the epithelium using an in-house segmentation model. We then employed a shallow neural network fed with interpretable morphological/spatial features, emulating histological markers. We conducted internal cross-validation on our development cohort (Sheffield; n = 193 cases) followed by independent validation on two external cohorts (Birmingham and Belfast; n = 92 cases). The proposed OMTscore yields an AUROC = 0.74 in predicting whether an OED progresses to malignancy or not. Survival analyses showed the prognostic value of our OMTscore for predicting malignancy transformation, when compared to the manually-assigned WHO and binary grades. Analysis of the correctly predicted cases elucidated the presence of peri-epithelial and epithelium-infiltrating lymphocytes in the most predictive patches of cases that transformed (p < 0.0001). This is the first study to propose a completely automated algorithm for predicting OED transformation based on interpretable nuclear features, whilst being validated on external datasets. The algorithm shows better-than-human-level performance for prediction of OED malignant transformation and offers a promising solution to the challenges of grading OED in routine clinical practice.

摘要
口腔质变性病（OED）是口腔腺肿的先癌诊断，但其分级受到许多内外观察员的变化带来不确定性，并不能准确预测肿瘤转化，可能导致不佳的治疗决策。为解决这个问题，我们开发了一种新的人工智能算法，可以基于口腔染色涂抹整个扫描图像中的历史学特征，分配口腔肿瘤转化风险分数（OMT分数）。该算法基于识别和分割细胞核的自己 segmentation 模型，然后使用一个浅层神经网络，以便模拟历史学特征。我们在 Sheffield 开发团队（n = 193 例）进行了内部十字验证，然后在 Birmingham 和 Belfast 两个外部团队（n = 92 例）进行了独立验证。我们的提议的 OMT 分数可以在预测口腔肿瘤转化是否发生的问题上达到 AUROC = 0.74 的表现。 survival 分析表明我们的 OMT 分数具有预测肿瘤转化的诊断价值，比 manually-assigned WHO 和二分阶段的分数更高。分析正确预测的 случа件表明，在转化的 случа件中存在辐射性和 epithelium 滥入的 T 细胞，这些特征在最预测性的补丁中具有显著性（p < 0.0001）。这是首次提出一种完全自动化的 OED 转化预测算法，基于可读性的核型特征，并在外部数据集上进行了验证。该算法在预测 OED 肿瘤转化的问题上达到了人类水平以上的表现，并且提供了一个有前途的解决方案，以便在日常临床医学实践中改善 OED 的分级。

PSDR-Room: Single Photo to Scene using Differentiable Rendering

paper_url: http://arxiv.org/abs/2307.03244
repo_url: None
paper_authors: Kai Yan, Fujun Luan, MiloŠ HaŠAn, Thibault Groueix, Valentin Deschaintre, Shuang Zhao
for: 用于快速匹配目标图像中的室内场景，需要艺术和技术素养。
methods: 使用最新的路径空间可微 Rendering 方法，通过Gradient Descent 优化灯光和物体姿态，以及材质等参数，以达到视觉匹配目标图像。
results: 可以使用单张图像场景理解方法来初始化优化，并搜索适当的3D模型和材质。实验表明，方法可以 editing 室内场景中的各种元素。Here’s the translation in English for reference:
for: Designed to quickly match the appearance of a target image of an indoor scene, requiring both artistic and technical skills.
methods: Leveraging a recent path-space differentiable rendering approach to provide unbiased gradients of the rendering with respect to geometry, lighting, and procedural materials, allowing for optimization of all these components using gradient descent to visually match the input photo appearance.
results: Can use recent single-image scene understanding methods to initialize the optimization and search for appropriate 3D models and materials. Experimental results demonstrate the editability of the resulting scene components.

Abstract
A 3D digital scene contains many components: lights, materials and geometries, interacting to reach the desired appearance. Staging such a scene is time-consuming and requires both artistic and technical skills. In this work, we propose PSDR-Room, a system allowing to optimize lighting as well as the pose and materials of individual objects to match a target image of a room scene, with minimal user input. To this end, we leverage a recent path-space differentiable rendering approach that provides unbiased gradients of the rendering with respect to geometry, lighting, and procedural materials, allowing us to optimize all of these components using gradient descent to visually match the input photo appearance. We use recent single-image scene understanding methods to initialize the optimization and search for appropriate 3D models and materials. We evaluate our method on real photographs of indoor scenes and demonstrate the editability of the resulting scene components.

摘要
一幅3D数字场景包含多个组件：灯光、材料和几何体，这些组件相互交互以达到所需的外观。设置这种场景是时间consuming的，需要艺术和技术技巧。在这种工作中，我们提议PSDR-Room，一个系统，允许用户最小化输入来优化灯光和个体物体的 pose 和材料，以匹配目标图像中的房间场景的外观，并且可以通过梯度 descent来优化这些组件。我们利用最近的路径空间微分渲染方法，以获取不偏梯度图像渲染中的geometry、灯光和材料的梯度，这些梯度可以用于优化这些组件。我们使用最近的单图像场景理解方法来初始化优化和搜索适合的3D模型和材料。我们对实际拍摄的室内场景照片进行评估，并证明可以编辑场景中的组件。

paper_url: http://arxiv.org/abs/2307.03243
repo_url: None
paper_authors: Jie Zhang, Masanori Suganuma, Takayuki Okatani
for: 这篇论文探讨了无监督的工业物体/文瑞异常探测（AD），并提出了一个更加具有挑战性的无监督AD设定，即在一个给定的图像集中探测异常 sample，这个设定不需要人工标注，与过去的研究不同。
methods: 我们提出了一个名为PatchCluster的 novel方法，将这个问题转换为一个本地异常探测问题，并使用了一个新的分割方法来检测图像和像素层次的异常 sample。
results: 实验结果显示，PatchCluster在没有知情normal数据的情况下可以实现高度的异常探测性能，甚至与需要知情normal数据的SOTA方法相比。

Abstract
Recent studies on visual anomaly detection (AD) of industrial objects/textures have achieved quite good performance. They consider an unsupervised setting, specifically the one-class setting, in which we assume the availability of a set of normal (\textit{i.e.}, anomaly-free) images for training. In this paper, we consider a more challenging scenario of unsupervised AD, in which we detect anomalies in a given set of images that might contain both normal and anomalous samples. The setting does not assume the availability of known normal data and thus is completely free from human annotation, which differs from the standard AD considered in recent studies. For clarity, we call the setting blind anomaly detection (BAD). We show that BAD can be converted into a local outlier detection problem and propose a novel method named PatchCluster that can accurately detect image- and pixel-level anomalies. Experimental results show that PatchCluster shows a promising performance without the knowledge of normal data, even comparable to the SOTA methods applied in the one-class setting needing it.

摘要
最近的图像异常检测研究（AD）已经达到了非常好的性能。它们假设了一个无监督的设置，具体是一个一类设置，在这里我们假设了一组正常（即异常free）图像用于训练。在这篇论文中，我们考虑了更加具有挑战性的无监督AD场景，在这里我们检测图像中的异常 sample，这些图像可能包含正常和异常样本。这个设置不需要人类注释，与标准的AD不同。为了便于描述，我们称之为盲目异常检测（BAD）。我们表明了BAD可以转化为本地异常检测问题，并提出了一种名为PatchCluster的新方法，可以准确地检测图像和像素级异常。实验结果表明，PatchCluster在没有正常数据知识的情况下可以达到高度的性能，甚至与需要正常数据的SOTA方法相当。

Adaptive Generation of Privileged Intermediate Information for Visible-Infrared Person Re-Identification

paper_url: http://arxiv.org/abs/2307.03240
repo_url: None
paper_authors: Mahdi Alehdaghi, Arthur Josi, Pourya Shamsolmoali, Rafael M. O. Cruz, Eric Granger
for: 本研究的目的是提高Visible-infrared人识别（V-I ReID）的精度，通过在RGB和IR感知器上建立一个共享表征空间，以便在不同感知器上捕捉到同一个人的图像。
methods: 本研究提出了一种名为Adaptive Generation of Privileged Intermediate Information（AGPI^2）的训练方法，用于生成一个虚拟频谱域，以bridging V和I模式之间的数据分布差异。AGPI^2使用非线性生成模块和嵌入模块，通过对RGB图像进行非线性变换，生成一个中间频谱域中的图像，并且使得这些中间图像具有较小的频谱域差异。
results: 实验结果表明，AGPI^2可以提高V-I ReID的匹配精度，而无需额外的计算资源在推理过程中。

Abstract
Visible-infrared person re-identification seeks to retrieve images of the same individual captured over a distributed network of RGB and IR sensors. Several V-I ReID approaches directly integrate both V and I modalities to discriminate persons within a shared representation space. However, given the significant gap in data distributions between V and I modalities, cross-modal V-I ReID remains challenging. Some recent approaches improve generalization by leveraging intermediate spaces that can bridge V and I modalities, yet effective methods are required to select or generate data for such informative domains. In this paper, the Adaptive Generation of Privileged Intermediate Information training approach is introduced to adapt and generate a virtual domain that bridges discriminant information between the V and I modalities. The key motivation behind AGPI^2 is to enhance the training of a deep V-I ReID backbone by generating privileged images that provide additional information. These privileged images capture shared discriminative features that are not easily accessible within the original V or I modalities alone. Towards this goal, a non-linear generative module is trained with an adversarial objective, translating V images into intermediate spaces with a smaller domain shift w.r.t. the I domain. Meanwhile, the embedding module within AGPI^2 aims to produce similar features for both V and generated images, encouraging the extraction of features that are common to all modalities. In addition to these contributions, AGPI^2 employs adversarial objectives for adapting the intermediate images, which play a crucial role in creating a non-modality-specific space to address the large domain shifts between V and I domains. Experimental results conducted on challenging V-I ReID datasets indicate that AGPI^2 increases matching accuracy without extra computational resources during inference.

摘要
visible-infrared人识别方法目的是检索RGB和IR感知器上捕捉的同一个人的图像。一些V-I ReID方法直接将V和I模式集成到共同表示空间中，但由于V和I模式的数据分布差距较大，跨模式V-I ReID仍然是一个挑战。一些最近的方法利用中间空间来bridge V和I模式，但需要有效的数据选择或生成方法。在这篇论文中，我们提出了适应生成特权中间信息训练方法（AGPI^2），用于适应和生成一个可以bridge V和I模式之间的虚拟频谱。我们的关键想法是通过生成特权图像来增强深度V-I ReID背景模型的训练，这些特权图像包含共享特征信息，这些信息在原始V或I模式中很难访问。为了实现这一目标，我们在AGPI^2中训练了一个非线性生成模块，通过对V图像进行非线性映射，将其转换为中间空间中的一个更小的频谱差距。同时， embedding模块在AGPI^2中尝试生成V和生成图像之间的相似特征，以便提取这些特征是所有模式共享的。此外，AGPI^2还使用了对中间图像的对抗目标，这些目标在创建一个不受模式限制的空间中扮演了关键的角色，以Addressing the large domain shift between V and I domains。实验结果表明，AGPI^2可以提高匹配精度，不需要额外的计算资源在推理过程中。

Synthesizing Artistic Cinemagraphs from Text

paper_url: http://arxiv.org/abs/2307.03190
repo_url: https://github.com/text2cinemagraph/text2cinemagraph
paper_authors: Aniruddha Mahapatra, Aliaksandr Siarohin, Hsin-Ying Lee, Sergey Tulyakov, Jun-Yan Zhu
for: 这个论文是为了创建基于文本描述的电影场景（电影场景）的自动化方法。
methods: 该方法使用了图像双生技术，从单个文本提示中生成一对图像：一个艺术性的图像和一个自然looking的图像。该艺术性图像描绘文本提示中的风格和外观，而自然looking图像简化了布局和动作分析。然后，通过使用现有的自然图像和视频数据集，准确地分割自然looking图像并预测可能的动作，并将这些动作传递给艺术性图像来创建最终的电影场景。
results: 该方法比现有的方法在创建电影场景时表现出色，特别是在自然风景和艺术性场景以及其他世界的场景中。这被证明了通过自动化指标和用户研究。此外，该方法还可以用于动画现有的画作，以及通过文本控制动作方向。

Abstract
We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt - a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.

摘要
我们介绍Text2Cinemagraph，一种完全自动的方法，可以将文本描述转化成动画照片 - 特别是当提示中包含想象力和艺术风格时，这是一项非常具有挑战性的任务，因为解决含义和动作的含义需要进行复杂的解释。现有的单张图像动画方法在艺术输入下表现不佳，而最近的文本基于视频方法经常出现时间不一致，尝试维持某些区域静止。为解决这些挑战，我们提出了一个合成文本描述中的图像双胞胎的想法 - 一对一个艺术风格和自然风格相似的图像对。而艺术图像将文本中的风格和形象细节呈现出来，而自然图像则大大简化了布局和动作分析。利用现有的自然图像和视频数据集，我们可以准确地分割自然图像，并预测文本中的Semantic信息所决定的合理动作。然后将预测的动作转移到艺术图像中，以创建最终的动画照片。我们的方法在创建自然风景以及艺术和其他世界的场景中的动画照片方面表现出色，并经过自动度量和用户测试 Validation。最后，我们还展示了两个扩展：将现有的画作动画和通过文本控制动作方向。

IPO-LDM: Depth-aided 360-degree Indoor RGB Panorama Outpainting via Latent Diffusion Model

paper_url: http://arxiv.org/abs/2307.03177
repo_url: None
paper_authors: Tianhao Wu, Chuanxia Zheng, Tat-Jen Cham
for: 这篇论文的目的是创建高质量的360度RGB投影图，并使用Latent Diffusion Models（LDM）来实现。
methods: 这篇论文使用了一种新的双模态潜在扩散结构，该结构在训练时使用RGB和深度投影数据，但在推理时可以使用 нормаль的深度值。此外，论文还提出了一种进步的摄像头旋转技术，以提高投影图的绕ounding一致性。
results: 论文的IPO-LDM模型不仅在RGB投影图外绘制方面具有显著的优势，还可以生成多种不同类型的面孔，并且每个面孔具有良好的结构。

Abstract
Generating complete 360-degree panoramas from narrow field of view images is ongoing research as omnidirectional RGB data is not readily available. Existing GAN-based approaches face some barriers to achieving higher quality output, and have poor generalization performance over different mask types. In this paper, we present our 360-degree indoor RGB panorama outpainting model using latent diffusion models (LDM), called IPO-LDM. We introduce a new bi-modal latent diffusion structure that utilizes both RGB and depth panoramic data during training, but works surprisingly well to outpaint normal depth-free RGB images during inference. We further propose a novel technique of introducing progressive camera rotations during each diffusion denoising step, which leads to substantial improvement in achieving panorama wraparound consistency. Results show that our IPO-LDM not only significantly outperforms state-of-the-art methods on RGB panorama outpainting, but can also produce multiple and diverse well-structured results for different types of masks.

摘要
<>将宽角度图像转换为全景360度图像是当前研究的热点问题，因为无法直接获得全景RGB数据。现有的基于GAN的方法具有较差的输出质量和不同掩码类型的泛化性能。在本文中，我们提出了一种基于缓动扩散模型（LDM）的360度室内RGB全景抹雷模型，称之为IPO-LDM。我们在训练时使用了RGB和深度全景数据的双模态缓动扩散结构，但在推理时可以使用depth-freeRGB图像进行抹雷。我们还提出了在每个扩散推净步中逐渐添加摄像头旋转的技术，这会导致全景包袋的实现。结果表明，我们的IPO-LDM不仅可以明显超越当前状态的RGB全景抹雷方法，还可以生成多种不同类型的掩码下的多个高质量结构。

VideoGLUE: Video General Understanding Evaluation of Foundation Models

paper_url: http://arxiv.org/abs/2307.03166
repo_url: None
paper_authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong
for: 本研究用于评估现有基础模型（Foundation Model，FM）在视频理解任务上的能力，并提出一种简单的 VideoGLUE 分数（VGS）来衡量 FM 在适应通用视频理解任务时的效果和效率。
methods: 本研究使用了三项hallmark task（行动识别、时间Localization和空间时间Localization）、八个社区广泛接受的数据集，以及四种适应基础模型的方法进行研究。
results: 主要发现结果包括：一、任务特化模型在六个FM studied 的情况下表现出色，与自然语言和图像理解领域中FM的表现形成鲜明的对比；二、视频本地FM在分析动态视频时表现更好，特别是在时间地址和多个动作理解方面；三、视频本地FM可以在轻量适应下（例如冻结FM干部）完成视频任务，而图像本地FM则在全面练习下表现较好。

Abstract
We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs.

摘要
我们使用了一个仔细设计的实验协议来评估现有基础模型（FM）的视频理解能力，包括三项标志性任务（动作识别、时间地址和空间时间地址）、八个社区广泛接受的数据集，以及四种适应方法为基础模型进行下游任务的调整。此外，我们提出了一个名为视频GLUE分数（VGS）的scalar来衡量基础模型在普通视频理解任务上的效果和效率。我们的主要发现包括以下几点：首先，任务特化的模型在我们所研究的六个FM中显著超越了其他模型，这与自然语言和图像理解领域中FM的表现形成鲜明的对比。其次，视频本地FM，即在预训练数据中包含视频模式的FM，在分析动作丰富视频、时间地址动作和视频中的多个动作方面表现更好。最后，视频本地FM可以通过轻度适应下游任务（例如冻结FM的背bone）来达到良好的视频任务性能，而图像本地FM则在全面练习下达到更好的性能。这三个发现表明了视频关注FM的研究需求和机遇，以及任务和适应方法对FM的评估的重要性。

Can Domain Adaptation Improve Accuracy and Fairness of Skin Lesion Classification?

paper_url: http://arxiv.org/abs/2307.03157
repo_url: None
paper_authors: Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm
for: 本研究旨在 investigate 多个皮肤病变数据集中的无监督领域适应（UDA）方法在 binary 和多类皮肤病变分类中的可行性。
methods: 我们使用了单源、合并源和多源的 UDA 训练方案，以解决皮肤病变分类中的数据不均衡问题。
results: 我们的实验结果表明，UDA 可以有效地在 binary 分类任务中，并且可以减轻数据不均衡问题。在多类分类任务中，UDA 的性能较弱，需要特别处理数据不均衡问题以达到上乘基eline的准确率。此外，我们发现 Label Shift 对测试错误强相关，而Feature-level UDA 方法在处理不均衡数据集时存在限制。最后，我们发现 UDA 可以有效地减少对少数群体的偏见，无需显式使用 fairness-focused 技术。

Abstract
Deep learning-based diagnostic system has demonstrated potential in classifying skin cancer conditions when labeled training example are abundant. However, skin lesion analysis often suffers from a scarcity of labeled data, hindering the development of an accurate and reliable diagnostic system. In this work, we leverage multiple skin lesion datasets and investigate the feasibility of various unsupervised domain adaptation (UDA) methods in binary and multi-class skin lesion classification. In particular, we assess three UDA training schemes: single-, combined-, and multi-source. Our experiment results show that UDA is effective in binary classification, with further improvement being observed when imbalance is mitigated. In multi-class task, its performance is less prominent, and imbalance problem again needs to be addressed to achieve above-baseline accuracy. Through our quantitative analysis, we find that the test error of multi-class tasks is strongly correlated with label shift, and feature-level UDA methods have limitations when handling imbalanced datasets. Finally, our study reveals that UDA can effectively reduce bias against minority groups and promote fairness, even without the explicit use of fairness-focused techniques.

摘要

MultiVENT: Multilingual Videos of Events with Aligned Natural Text

paper_url: http://arxiv.org/abs/2307.03153
repo_url: None
paper_authors: Kate Sanders, David Etter, Reno Kriz, Benjamin Van Durme
for: 这个论文的目的是构建一个多语言、事件中心视频集合（MultiVENT），以便使用这些视频教学模型受益于现代新闻报道的多样化表达方式。
methods: 该论文使用了多种方法，包括构建多语言、事件中心视频集合（MultiVENT）、分析在线新闻视频的状况以及如何使用这些视频建立准确、多语言的模型。
results: 该论文提供了一个基线模型 для复杂、多语言视频检索，以便使用MultiVENT进行信息检索。

Abstract
Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.

摘要
每天新闻报道很多样化，从传统广播转向多种形式的直播视频。现有的新闻视频数据集都是为英语观众制作的传统新闻广播，我们解决这个局限性的问题，构建了MultiVENT数据集，包含多种语言的事件中心视频和文档。MultiVENT包括新闻广播视频和非专业事件录像，我们通过分析在线新闻视频的状况，探讨如何使用MultiVENT建立强大、准确的模型。最后，我们提供了一种复杂的多语言视频检索模型，作为MultiVENT中的基线模型。

Topology-Aware Loss for Aorta and Great Vessel Segmentation in Computed Tomography Images

paper_url: http://arxiv.org/abs/2307.03137
repo_url: None
paper_authors: Seher Ozcelik, Sinan Unver, Ilke Ali Gurses, Rustu Turkay, Cigdem Gunduz-Demir
for: 这个论文是为了解决基于计算机 Tomatoes（CT）图像中血管和大动脉的分割问题。
methods: 这个论文使用了一种新的topology-aware损失函数，该函数通过 persist homology 来衡量预测和真实值之间的拓扑不同。
results: 实验表明，使用该损失函数可以获得更好的结果， indicating 该方法的有效性。

Abstract
Segmentation networks are not explicitly imposed to learn global invariants of an image, such as the shape of an object and the geometry between multiple objects, when they are trained with a standard loss function. On the other hand, incorporating such invariants into network training may help improve performance for various segmentation tasks when they are the intrinsic characteristics of the objects to be segmented. One example is segmentation of aorta and great vessels in computed tomography (CT) images where vessels are found in a particular geometry in the body due to the human anatomy and they mostly seem as round objects on a 2D CT image. This paper addresses this issue by introducing a new topology-aware loss function that penalizes topology dissimilarities between the ground truth and prediction through persistent homology. Different from the previously suggested segmentation network designs, which apply the threshold filtration on a likelihood function of the prediction map and the Betti numbers of the ground truth, this paper proposes to apply the Vietoris-Rips filtration to obtain persistence diagrams of both ground truth and prediction maps and calculate the dissimilarity with the Wasserstein distance between the corresponding persistence diagrams. The use of this filtration has advantage of modeling shape and geometry at the same time, which may not happen when the threshold filtration is applied. Our experiments on 4327 CT images of 24 subjects reveal that the proposed topology-aware loss function leads to better results than its counterparts, indicating the effectiveness of this use.

摘要
Segmentation 网络不会显式地学习图像中全局不变量，如物体形状和多个物体之间的几何关系，当它们在标准损失函数下训练时。然而，将这些不变量 incorporated 到网络训练中可能会提高不同的 segmentation 任务的性能，因为它们是物体被分 segmentation 的内在特征。例如，在计算机断层成像（CT）图像中分割血管和大血管，血管在人体 анаatomy 中具有特定的几何结构，在2D CT 图像上通常看起来是圆形的物体。这篇论文解决这个问题，通过引入一种新的 topology-aware 损失函数， penalty topology 异常 между真实值和预测值通过不变式 homology。与之前的 segmentation 网络设计不同，这篇论文提议使用 Vietoris-Rips 滤波来获取 both ground truth 和预测图像的 persistence 图，并计算它们之间的 Wasserstein 距离。这种 filtration 的优点在于同时模型形状和几何，这可能不会在应用 threshold 滤波时发生。我们在 4327 CT 图像上进行了 24 个人的实验，发现提议的 topology-aware 损失函数比其他方法更有效，这表明该用法的有效性。

Benchmarking Test-Time Adaptation against Distribution Shifts in Image Classification

paper_url: http://arxiv.org/abs/2307.03133
repo_url: https://github.com/yuyongcan/benchmark-tta
paper_authors: Yongcan Yu, Lijun Sheng, Ran He, Jian Liang
for: 本研究旨在提供一个可靠的测试时适应（TTA）方法评估 benchmark，以便研究人员和实践者可以准确地评估和比较不同的 TTA 方法在改进模型的Robustness和泛化性能方面的效果。
methods: 本研究评估了 13 种知名的 TTA 方法和其变种，并在 five 个广泛使用的图像分类 datasets（CIFAR-10-C、CIFAR-100-C、ImageNet-C、DomainNet和Office-Home）上进行了系统性的评估。这些方法包括不同的适应enario（如在线适应 versus 离线适应、实例适应 versus 批量适应 versus 频率适应）。此外，我们还探索了不同的 TTA 方法与不同的网络后处理器之间的兼容性。
results: 我们的研究发现，不同的 TTA 方法在不同的预测场景下的效果有所不同。 Specifically, we found that some methods perform better in certain scenarios, while others may not be as effective. Additionally, we observed that some methods are more compatible with certain network backbones than others. Our findings provide valuable insights into the strengths and limitations of different TTA methods and can help guide future research in this area.

Abstract
Test-time adaptation (TTA) is a technique aimed at enhancing the generalization performance of models by leveraging unlabeled samples solely during prediction. Given the need for robustness in neural network systems when faced with distribution shifts, numerous TTA methods have recently been proposed. However, evaluating these methods is often done under different settings, such as varying distribution shifts, backbones, and designing scenarios, leading to a lack of consistent and fair benchmarks to validate their effectiveness. To address this issue, we present a benchmark that systematically evaluates 13 prominent TTA methods and their variants on five widely used image classification datasets: CIFAR-10-C, CIFAR-100-C, ImageNet-C, DomainNet, and Office-Home. These methods encompass a wide range of adaptation scenarios (e.g. online adaptation v.s. offline adaptation, instance adaptation v.s. batch adaptation v.s. domain adaptation). Furthermore, we explore the compatibility of different TTA methods with diverse network backbones. To implement this benchmark, we have developed a unified framework in PyTorch, which allows for consistent evaluation and comparison of the TTA methods across the different datasets and network architectures. By establishing this benchmark, we aim to provide researchers and practitioners with a reliable means of assessing and comparing the effectiveness of TTA methods in improving model robustness and generalization performance. Our code is available at https://github.com/yuyongcan/Benchmark-TTA.

摘要
测试时适应（TTA）是一种技术，旨在通过使用无标示样本来提高模型的总体性表现。由于神经网络系统面临到分布转移时的稳定性问题，有很多TTA方法被提出。然而，评估这些方法的时候通常采用不同的设置，例如不同的分布转移、后端和设计方案，这导致了评估效果的不一致和公平性的问题。为解决这个问题，我们提出了一个基准，系统地评估13种知名TTA方法和其变体在五种广泛使用的图像分类dataset上：CIFAR-10-C、CIFAR-100-C、ImageNet-C、DomainNet和Office-Home。这些方法涵盖了各种适应enario（例如在线适应vs.离线适应、实例适应vs.批适应vs.领域适应）。此外，我们还探索了不同TTA方法与不同后端网络的Compatibility。为实现这个基准，我们在PyTorch中开发了一个统一的框架，允许在不同的dataset和网络架构上一致性地评估和比较TTA方法的效果。通过建立这个基准，我们希望为研究者和实践者提供一个可靠的方式来评估和比较TTA方法在提高模型的Robustness和总体性表现方面的效果。我们的代码可以在https://github.com/yuyongcan/Benchmark-TTA上获取。

Principal subbundles for dimension reduction

paper_url: http://arxiv.org/abs/2307.03128
repo_url: None
paper_authors: Morten Akhøj, James Benn, Erlend Grong, Stefan Sommer, Xavier Pennec
for: 用于构成和重建表面
methods: 使用本地线性近似来获取低维 bundle
results: 可以成功应用于许多重要问题，如构建 Approximating 子 manifold、计算观察之间的距离等。

Abstract
In this paper we demonstrate how sub-Riemannian geometry can be used for manifold learning and surface reconstruction by combining local linear approximations of a point cloud to obtain lower dimensional bundles. Local approximations obtained by local PCAs are collected into a rank $k$ tangent subbundle on $\mathbb{R}^d$, $k

摘要
在这篇论文中，我们示示了如何使用非柯尼希 геометрия来进行拟合 manifold 和表面重建，通过将本地线性近似合集到一个降维Bundle 中。这个降维Bundle 在 $\mathbb{R}^d$ 上定义为 rank $k$ 的 tangent 子bundle，其中 $k

LISSNAS: Locality-based Iterative Search Space Shrinkage for Neural Architecture Search

paper_url: http://arxiv.org/abs/2307.03110
repo_url: None
paper_authors: Bhavna Gopal, Arjun Sridhar, Tunhou Zhang, Yiran Chen
for: 这篇论文旨在提出一个自动化的搜索空间缩小算法，以提高搜索性能和搜索空间的多样性。
methods: 本论文使用了本地性和结构相似性的关系来优化搜索空间，实现了高效的搜索和多样性保持。
results: 本论文在不同的搜索空间和数据集上进行了实验，结果显示了LISSNAS算法在搜索性能和多样性方面的最佳性能，包括ImageNet上的手动搜索中的最高Top-1精度（77.6%）、Kendall-Tau指数、搜索空间大小等。

Abstract
Search spaces hallmark the advancement of Neural Architecture Search (NAS). Large and complex search spaces with versatile building operators and structures provide more opportunities to brew promising architectures, yet pose severe challenges on efficient exploration and exploitation. Subsequently, several search space shrinkage methods optimize by selecting a single sub-region that contains some well-performing networks. Small performance and efficiency gains are observed with these methods but such techniques leave room for significantly improved search performance and are ineffective at retaining architectural diversity. We propose LISSNAS, an automated algorithm that shrinks a large space into a diverse, small search space with SOTA search performance. Our approach leverages locality, the relationship between structural and performance similarity, to efficiently extract many pockets of well-performing networks. We showcase our method on an array of search spaces spanning various sizes and datasets. We accentuate the effectiveness of our shrunk spaces when used in one-shot search by achieving the best Top-1 accuracy in two different search spaces. Our method achieves a SOTA Top-1 accuracy of 77.6\% in ImageNet under mobile constraints, best-in-class Kendal-Tau, architectural diversity, and search space size.

摘要
搜索空间的特征标志了神经建筑搜索（NAS）的进步。大型和复杂的搜索空间，具有多样化的建筑元素和结构，提供了更多的可能性来生成出色的建筑，但也对有效地探索和利用 pose 严重挑战。为此，许多搜索空间缩小方法通过选择单个子区域来找到一些表现良好的网络。这些方法可以提供小幅提高性和效率，但是这些技术留下大量可以进一步提高搜索性能的空间，并且无法保持建筑多样性。我们提出了 LISSNAS，一种自动化算法，可以将大型空间缩小到多样性强、性能优秀的小搜索空间。我们的方法利用了地方性，建筑和性能之间的相似关系，以高效地提取许多表现良好的网络。我们在多个搜索空间中进行了证明，并在 ImageNet 下实现了移动端的 SOTA Top-1 准确率为 77.6%，同时保持了 Kendall-Tau 最佳、建筑多样性和搜索空间大小。

How to Detect Unauthorized Data Usages in Text-to-image Diffusion Models

paper_url: http://arxiv.org/abs/2307.03108
repo_url: None
paper_authors: Zhenting Wang, Chen Chen, Yuchen Liu, Lingjuan Lyu, Dimitris Metaxas, Shiqing Ma
for: 防止文本到图像扩散模型中的数据非法使用
methods: 植入干扰记忆法，通过分析模型是否记忆植入内容来检测非法数据使用
results: 在Stable Diffusion和LoRA模型上进行了实验，得到了效果的检测非法数据使用结果

Abstract
Recent text-to-image diffusion models have shown surprising performance in generating high-quality images. However, concerns have arisen regarding the unauthorized usage of data during the training process. One example is when a model trainer collects a set of images created by a particular artist and attempts to train a model capable of generating similar images without obtaining permission from the artist. To address this issue, it becomes crucial to detect unauthorized data usage. In this paper, we propose a method for detecting such unauthorized data usage by planting injected memorization into the text-to-image diffusion models trained on the protected dataset. Specifically, we modify the protected image dataset by adding unique contents on the images such as stealthy image wrapping functions that are imperceptible to human vision but can be captured and memorized by diffusion models. By analyzing whether the model has memorization for the injected content (i.e., whether the generated images are processed by the chosen post-processing function), we can detect models that had illegally utilized the unauthorized data. Our experiments conducted on Stable Diffusion and LoRA model demonstrate the effectiveness of the proposed method in detecting unauthorized data usages.

摘要
近期文本到图像扩散模型已经显示出了高质量图像生成的出色表现。然而，有关数据非法使用的担忧也在提出。一个例子是模型训练者收集了某个艺术家创作的图像集并尝试通过不取得艺术家的授权来训练一个能够生成类似图像的模型。为解决这个问题，检测非法数据使用变得非常重要。在这篇论文中，我们提议一种方法，通过在受保护图像集中添加特有的内容，例如隐形图像包装函数，使得扩散模型能够吸收这些内容并且记忆它们。然后，通过判断模型是否具有这些内容的记忆（即是否通过选择的后处理函数处理生成的图像），可以检测模型是否使用了非法数据。我们在Stable Diffusion和LoRA模型上进行了实验，并证明了我们的方法的有效性。

Contextual Affinity Distillation for Image Anomaly Detection

paper_url: http://arxiv.org/abs/2307.03101
repo_url: None
paper_authors: Jie Zhang, Masanori Suganuma, Takayuki Okatani
for:本研究旨在提高无监督工业异常检测的性能，特别是对逻辑异常进行检测，而不需要训练繁重的模型。methods:本研究基于先前的知识塑化工作，使用两名学生（本地学生和全球学生）来更好地模仿教师的行为。本地学生主要用于检测结构异常，而全球学生则关注逻辑异常。为了进一步鼓励全球学生学习捕捉长距离依赖关系，我们设计了全球上下文维度压缩块（GCCB），并提出了上下文相互关联损失。results:实验结果表明，提议方法不需要训练复杂的模型，可以达到新的领先性水平在MVTec LOCO AD数据集上。

Abstract
Previous works on unsupervised industrial anomaly detection mainly focus on local structural anomalies such as cracks and color contamination. While achieving significantly high detection performance on this kind of anomaly, they are faced with logical anomalies that violate the long-range dependencies such as a normal object placed in the wrong position. In this paper, based on previous knowledge distillation works, we propose to use two students (local and global) to better mimic the teacher's behavior. The local student, which is used in previous studies mainly focuses on structural anomaly detection while the global student pays attention to logical anomalies. To further encourage the global student's learning to capture long-range dependencies, we design the global context condensing block (GCCB) and propose a contextual affinity loss for the student training and anomaly scoring. Experimental results show the proposed method doesn't need cumbersome training techniques and achieves a new state-of-the-art performance on the MVTec LOCO AD dataset.

摘要
先前的工业异常检测研究主要关注本地结构异常，如裂隙和颜色杂散。尽管达到了本地异常检测的显著高效性，但它们面临着跨距离相互关联的逻辑异常，如正常对象被错误地放置。在这篇论文中，基于先前的知识塑模工作，我们提议使用两名学生（本地和全球）来更好地模仿教师的行为。本地学生，在先前的研究中主要用于结构异常检测，而全球学生则关注逻辑异常。为了进一步鼓励全球学生学习捕捉长距离相互关联，我们设计了全球上下文缩合块（GCCB）并提出了上下文相互关系损失。实验结果表明，我们提议的方法不需要复杂的训练技术，并达到了MVTec LOCO AD数据集的新的状态之平台。

2023-07-07

Detecting the Sensing Area of A Laparoscopic Probe in Minimally Invasive Cancer Surgery

Robust Human Detection under Visual Degradation via Thermal and mmWave Radar Fusion

Depth Estimation Analysis of Orthogonally Divergent Fisheye Cameras with Distortion Removal

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Unsupervised Segmentation of Fetal Brain MRI using Deep Learning Cascaded Registration

SpawnNet: Learning Generalizable Visuomotor Skills from Pre-trained Networks

VariGrad: A Novel Feature Vector Architecture for Geometric Deep Learning on Unregistered Data

Language-free Compositional Action Generation via Decoupling Refinement

Joint Perceptual Learning for Enhancement and Object Detection in Underwater Scenarios

Matching in the Wild: Learning Anatomical Embeddings for Multi-Modality Images

HoughLaneNet: Lane Detection with Deep Hough Transform and Dynamic Convolution

Unpaired Multi-View Graph Clustering with Cross-View Structure Matching

Freezing of Gait Prediction From Accelerometer Data Using a Simple 1D-Convolutional Neural Network – 8th Place Solution for Kaggle’s Parkinson’s Freezing of Gait Prediction Competition

A Deep Active Contour Model for Delineating Glacier Calving Fronts

Universal Semi-supervised Model Adaptation via Collaborative Consistency Training

NOFA: NeRF-based One-shot Facial Avatar Reconstruction

Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer

Registration-Free Hybrid Learning Empowers Simple Multimodal Imaging System for High-quality Fusion Detection

Hyperspectral and Multispectral Image Fusion Using the Conditional Denoising Diffusion Probabilistic Model

Learning Adversarial Semantic Embeddings for Zero-Shot Recognition in Open Worlds

Unsupervised Hyperspectral and Multispectral Images Fusion Based on the Cycle Consistency

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

RGB-D Mapping and Tracking in a Plenoxel Radiance Field

Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials

General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation

Weakly-supervised Contrastive Learning for Unsupervised Object Discovery

A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision

Dividing and Conquering a BlackBox to a Mixture of Interpretable Models: Route, Interpret, Repeat

Open-Vocabulary Object Detection via Scene Graph Discovery

Facial Landmark Detection Evaluation on MOBIO Database

CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

To pretrain or not to pretrain? A case study of domain-specific pretraining for semantic segmentation in histopathology

ADASSM: Adversarial Data Augmentation in Statistical Shape Models From Images

Empirical Analysis of a Segmentation Foundation Model in Prostate Imaging

A Fully Automated and Explainable Algorithm for the Prediction of Malignant Transformation in Oral Epithelial Dysplasia

PSDR-Room: Single Photo to Scene using Differentiable Rendering

That’s BAD: Blind Anomaly Detection by Implicit Local Feature Clustering

Adaptive Generation of Privileged Intermediate Information for Visible-Infrared Person Re-Identification

Synthesizing Artistic Cinemagraphs from Text

IPO-LDM: Depth-aided 360-degree Indoor RGB Panorama Outpainting via Latent Diffusion Model

VideoGLUE: Video General Understanding Evaluation of Foundation Models

Can Domain Adaptation Improve Accuracy and Fairness of Skin Lesion Classification?

MultiVENT: Multilingual Videos of Events with Aligned Natural Text

Topology-Aware Loss for Aorta and Great Vessel Segmentation in Computed Tomography Images

Benchmarking Test-Time Adaptation against Distribution Shifts in Image Classification

Principal subbundles for dimension reduction

LISSNAS: Locality-based Iterative Search Space Shrinkage for Neural Architecture Search

How to Detect Unauthorized Data Usages in Text-to-image Diffusion Models

Contextual Affinity Distillation for Image Anomaly Detection