2023-08-31

cs.CV

cs.CV - 2023-08-31

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator

paper_url: http://arxiv.org/abs/2308.16906
repo_url: https://github.com/xlwangdev/hc-net
paper_authors: Xiaolong Wang, Runsen Xu, Zuofan Cui, Zeyu Wan, Yu Zhang
for: 本研究旨在解决细致跨视图地理地标注问题。
methods: 我们提出了一种新的方法，通过使用准确的投影变换，将地面图像与卫星图像的同一区域对齐。我们使用可 differentiable 的球面变换，遵循几何原理，将地面图像的视角与卫星图像的视角一起对齐。为了解决 occlusion、小 overlap 和季节变化等挑战，我们提出了一种可靠的相关性感知投影变换 estimator。
results: 我们的方法可以在30 FPS的速度下运行，并在 VIGOR 测试benchmark上显著提高了平均 метрик地标注错误率，减少了21.3%和32.4%在同一个区域和跨区域通用任务中，以及在 KITTI 测试benchmark上减少了34.4%。

Abstract
In this paper, we introduce a novel approach to fine-grained cross-view geo-localization. Our method aligns a warped ground image with a corresponding GPS-tagged satellite image covering the same area using homography estimation. We first employ a differentiable spherical transform, adhering to geometric principles, to accurately align the perspective of the ground image with the satellite map. This transformation effectively places ground and aerial images in the same view and on the same plane, reducing the task to an image alignment problem. To address challenges such as occlusion, small overlapping range, and seasonal variations, we propose a robust correlation-aware homography estimator to align similar parts of the transformed ground image with the satellite image. Our method achieves sub-pixel resolution and meter-level GPS accuracy by mapping the center point of the transformed ground image to the satellite image using a homography matrix and determining the orientation of the ground camera using a point above the central axis. Operating at a speed of 30 FPS, our method outperforms state-of-the-art techniques, reducing the mean metric localization error by 21.3% and 32.4% in same-area and cross-area generalization tasks on the VIGOR benchmark, respectively, and by 34.4% on the KITTI benchmark in same-area evaluation.

摘要
“在这篇论文中，我们介绍了一种新的细化跨视地理定位方法。我们的方法使用投影变换将摄影地图与相应的GPS标记的卫星图像重叠的同一个区域，并使用同尺度的投影变换来减少任务到图像对齐问题。为了解决 occlusion、小 overlap 范围和季节变化等挑战，我们提出了一种可靠的相关关系感知投影变换器，用于对投影变换后的地面图像中的相似部分与卫星图像进行对齐。我们的方法可以实现幂等分辨率和米级GPS准确性，通过将投影变换后的地面图像的中心点映射到卫星图像中使用投影矩阵，并通过确定地面摄像机的方向使用一个点上的中心轴来减少偏差。我们的方法可以在30帧/秒的速度下运行，并且在VIGOR和KITTI的测试集上比州前方法提高了21.3%和32.4%的同区和跨区普通地理定位误差，并在KITTI的测试集上提高了34.4%的同区误差。”

EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild

paper_url: http://arxiv.org/abs/2308.16894
repo_url: https://github.com/eth-ait/emdb
paper_authors: Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan Zarate, Otmar Hilliges
For: This paper presents a novel dataset called EMDB, which contains high-quality 3D SMPL pose and shape parameters with global body and camera trajectories for in-the-wild videos.* Methods: The authors use body-worn, wireless electromagnetic (EM) sensors and a hand-held iPhone to record motion data, and propose a multi-stage optimization procedure to construct EMDB. They also leverage a neural implicit avatar model to reconstruct detailed human surface geometry and appearance.* Results: The authors evaluate the accuracy of EMDB using a multi-view volumetric capture system and show that it has an expected accuracy of 2.3 cm positional and 10.6 degrees angular error, surpassing the accuracy of previous in-the-wild datasets. They also evaluate existing state-of-the-art monocular RGB methods for camera-relative and global pose estimation on EMDB.Here’s the format you requested:* For: 这篇论文提出了一个名为EMDB的新数据集，该数据集包含了高质量的3D SMPL姿态和形状参数，以及全身和摄像机的轨迹，用于在自然环境中拍摄的视频。* Methods: 作者使用了身体穿着的无线电磁（EM）仪器和手持式iPhone记录运动数据，并提出了一种多Stage优化算法来构建EMDB。他们还利用了一种神经凝聚模型来重建人体表面几何和外观，以提高对齐和平滑性。* Results: 作者使用多视角扫描系统进行了EMDB的评估，并显示其预期偏差为2.3 cm的位差和10.6度的姿态偏差，超过了前一代在自然环境中的数据集的精度。他们还评估了现有的单摄RGB方法在EMDB上的性能。EMDB公共可用于https://ait.ethz.ch/emdb。

Abstract
We present EMDB, the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. EMDB is a novel dataset that contains high-quality 3D SMPL pose and shape parameters with global body and camera trajectories for in-the-wild videos. We use body-worn, wireless electromagnetic (EM) sensors and a hand-held iPhone to record a total of 58 minutes of motion data, distributed over 81 indoor and outdoor sequences and 10 participants. Together with accurate body poses and shapes, we also provide global camera poses and body root trajectories. To construct EMDB, we propose a multi-stage optimization procedure, which first fits SMPL to the 6-DoF EM measurements and then refines the poses via image observations. To achieve high-quality results, we leverage a neural implicit avatar model to reconstruct detailed human surface geometry and appearance, which allows for improved alignment and smoothness via a dense pixel-level objective. Our evaluations, conducted with a multi-view volumetric capture system, indicate that EMDB has an expected accuracy of 2.3 cm positional and 10.6 degrees angular error, surpassing the accuracy of previous in-the-wild datasets. We evaluate existing state-of-the-art monocular RGB methods for camera-relative and global pose estimation on EMDB. EMDB is publicly available under https://ait.ethz.ch/emdb

摘要
我们介绍EMDB，全球3D人姿和形状的电磁学数据库。EMDB是一个新的数据集，包含高质量的3D SMPL姿势和形状参数，以及全身和相机轨迹数据，从户外和户内的81个序列和10名参与者中收集了总计58分钟的运动数据。此外，我们还提供了全球相机位和身体根轨迹数据。为构建EMDB，我们提议了多个阶段优化过程，首先是使用6度自由度电磁测量数据适应SMPL，然后通过图像观察进行纠正。为了实现高质量结果，我们利用神经隐式人物模型来重construct人类表面几何和外观，这使得对齐和平滑得到了改进。我们的评估，使用多视图探针捕捉系统，表明EMDB的预期准确性为2.3公分位置和10.6度角度误差，超过了之前的户外数据集的准确性。我们对EMDB进行了多种state-of-the-art单视RGB相机相对和全球姿势估计的评估。EMDB公开可用于https://ait.ethz.ch/emdb。

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

paper_url: http://arxiv.org/abs/2308.16891
repo_url: https://github.com/YanjieZe/GNFactor
paper_authors: Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, Xiaolong Wang
for: This paper aims to develop a visual behavior cloning agent for multi-task robotic manipulation that can execute diverse tasks from visual observations in unstructured real-world environments.
methods: The proposed method, called GNFactor, uses a combination of a generalizable neural field (GNF) and a Perceiver Transformer to jointly optimize a shared deep 3D voxel representation, leveraging a vision-language foundation model to incorporate semantics in 3D.
results: The authors evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations, showing a substantial improvement over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor.

Abstract
It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present $\textbf{GNFactor}$, a visual behavior cloning agent for multi-task robotic manipulation with $\textbf{G}$eneralizable $\textbf{N}$eural feature $\textbf{F}$ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($\textit{e.g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .

摘要
Robotics 是一个长期的问题，是开发能够从视觉观察中执行多样化的抓取任务的智能代理的能力。为了实现这个目标，机器人需要有完整的Scene的3D结构和 semantics的理解。在这项工作中，我们提出了 $\textbf{GNFactor}$,一种可视行为做模块 для多任务机器人抓取，该模块结合了一个通用神经场（GNF）和一个Perceiver Transformer作为决策模块，利用了一个共享的深度3D立方体表示。为了在3D中包含 semantics，重建模块利用了一个视觉语言基础模型（例如，Stable Diffusion）将丰富的semantic信息透传到深度3D立方体中。我们在3个真实机器人任务上评估GNFactor，并对RLBench任务进行详细的抽象。我们发现GNFactor在seen和unseen任务中具有显著的提高， demonstrating GNFactor的强大通用能力。我们的项目网站是。

Text2Scene: Text-driven Indoor Scene Stylization with Part-aware Details

paper_url: http://arxiv.org/abs/2308.16880
repo_url: None
paper_authors: Inwoo Hwang, Hyeonwoo Kim, Young Min Kim
for: 这个论文的目的是创建虚拟场景中对象的细节Texture。
methods: 该方法使用参考图片和文本描述来 guideline 3D形状的纹理，使得生成的颜色遵循场景中的层次结构或semantic parts的结构。
results: 该方法可以实时创建场景中对象的细节Texture，并且能够保持场景中的结构层次结构。这是首个实用和可扩展的方法，可以创建场景中对象的细节Texture，并且不需要专门的高质量Texture dataset。

Abstract
We propose Text2Scene, a method to automatically create realistic textures for virtual scenes composed of multiple objects. Guided by a reference image and text descriptions, our pipeline adds detailed texture on labeled 3D geometries in the room such that the generated colors respect the hierarchical structure or semantic parts that are often composed of similar materials. Instead of applying flat stylization on the entire scene at a single step, we obtain weak semantic cues from geometric segmentation, which are further clarified by assigning initial colors to segmented parts. Then we add texture details for individual objects such that their projections on image space exhibit feature embedding aligned with the embedding of the input. The decomposition makes the entire pipeline tractable to a moderate amount of computation resources and memory. As our framework utilizes the existing resources of image and text embedding, it does not require dedicated datasets with high-quality textures designed by skillful artists. To the best of our knowledge, it is the first practical and scalable approach that can create detailed and realistic textures of the desired style that maintain structural context for scenes with multiple objects.

摘要
我们提出了 Text2Scene，一种方法可以自动创建虚拟场景中多个物体的真实 texture。根据参考图像和文本描述，我们的管道将Labelled 3D 几何体上添加细节 texture，使得生成的颜色尊重层次结构或semantic parts中的相似材质。而不是在整个场景上一步应用平面化，我们从 geometric segmentation 中获取weak semantic cues，并将分割后的部分赋予初始颜色。然后，我们为个体对象添加 texture details，使其像素空间投影展示Feature embedding相对 embedding输入一致。这种分解使整个管道可以利用一定的计算资源和存储空间进行执行。由于我们的框架利用现有的图像和文本嵌入，因此不需要专门的高质量 texture 数据集，由技术型 худож们制作。根据我们所知，这是首个实用和可扩展的方法，可以创建包含多个物体的场景中的细节和真实 texture，同时保持结构上下文。

SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation

paper_url: http://arxiv.org/abs/2308.16876
repo_url: https://github.com/JiabenChen/SportsSloMo
paper_authors: Jiaben Chen, Huaizu Jiang
For: The paper is written for improving the performance of video frame interpolation in human-centric scenarios, specifically in the sports analysis industry, by introducing a new benchmark dataset and two human-aware loss terms.* Methods: The paper uses several state-of-the-art video frame interpolation methods and re-trains them on a new benchmark dataset called SportsSloMo, which consists of high-resolution slow-motion sports videos crawled from YouTube. The paper also introduces two human-aware loss terms to improve the accuracy of video frame interpolation.* Results: The paper shows that the proposed loss terms lead to consistent performance improvement over five existing models, establishing strong baseline models on the SportsSloMo benchmark. The results also highlight the difficulty of the benchmark and the importance of considering human-aware priors in video frame interpolation.Here is the information in Simplified Chinese text:* For: 这篇论文是为了提高人体中心的视频帧 interpolate的性能，特别是在运动分析领域，通过引入新的benchmark dataset和两种人体意识损失项。* Methods: 论文使用了一些现有的视频帧 interpolate方法，并将其重新训练在新的benchmark datasetcalled SportsSloMo上，该dataset包含高分辨率慢动作运动视频从YouTube上抓取。论文还引入了两种人体意识损失项以提高 interpolate的准确性。* Results: 论文显示，提出的损失项导致了五种现有模型的性能的一致提高，在SportsSloMo benchmark上建立了强的基线模型。结果还 highlights SportsSloMo benchmark的困难性和人体意识损失项的重要性。

Abstract
Human-centric video frame interpolation has great potential for improving people's entertainment experiences and finding commercial applications in the sports analysis industry, e.g., synthesizing slow-motion videos. Although there are multiple benchmark datasets available in the community, none of them is dedicated for human-centric scenarios. To bridge this gap, we introduce SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution ($\geq$720p) slow-motion sports videos crawled from YouTube. We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. It highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To improve the accuracy, we introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection, respectively. The loss terms are model agnostic and can be easily plugged into any video frame interpolation approaches. Experimental results validate the effectiveness of our proposed loss terms, leading to consistent performance improvement over 5 existing models, which establish strong baseline models on our benchmark. The dataset and code can be found at: https://neu-vi.github.io/SportsSlomo/.

摘要
人类中心视频帧 interpolate 技术在娱乐领域有很大的潜力，例如生成高速慢动作视频。 although there are several benchmark datasets available in the community, none of them is specifically designed for human-centric scenarios. To address this gap, we introduce SportsSloMo, a benchmark that includes over 130,000 video clips and 1 million high-resolution ($ \geq $ 720p) slow-motion sports videos crawled from YouTube. We retrain several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. This highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To improve accuracy, we introduce two loss terms that consider human-aware priors, adding auxiliary supervision to panoptic segmentation and human keypoints detection, respectively. These loss terms are model agnostic and can be easily plugged into any video frame interpolation approaches. Experimental results demonstrate the effectiveness of our proposed loss terms, leading to consistent performance improvement over 5 existing models, which establish strong baseline models on our benchmark. The dataset and code can be found at: https://neu-vi.github.io/SportsSlomo/.

Holistic Processing of Colour Images Using Novel Quaternion-Valued Wavelets on the Plane

paper_url: http://arxiv.org/abs/2308.16875
repo_url: None
paper_authors: Neil D. Dizon, Jeffrey A. Hogan
for: color image processing
methods: quaternionic wavelet filters, quaternion-valued wavelets on the plane
results: demonstration of quaternion-valued wavelets as a promising tool for holistic color image processing, including compression, enhancement, segmentation, and denoising techniques.Here’s the full translation of the abstract in Simplified Chinese:
for: 本研究 investigate the applicability of quaternion-valued wavelets on the plane to holistic color image processing.
methods: 我们提出了一种使用quaternionic wavelet filters和最近发展的quaternion-valued wavelets on the plane进行color image decomposition和重建的方法。
results: 我们通过实践了各种压缩、提高、分类和去噪技术，证明quaternion-valued wavelets as a promising tool for holistic color image processing.

Abstract
We investigate the applicability of quaternion-valued wavelets on the plane to holistic colour image processing. We present a methodology for decomposing and reconstructing colour images using quaternionic wavelet filters associated to recently developed quaternion-valued wavelets on the plane. We consider compression, enhancement, segmentation, and denoising techniques to demonstrate quaternion-valued wavelets as a promising tool for holistic colour image processing.

摘要
我们研究使用平面上的四元数值波幅来进行整体彩色图像处理。我们提出了使用四元数值波幅筛选器与最近发展的平面上四元数值波幅进行彩色图像的分解和重建方法。我们考虑了压缩、提高、分割和降噪技术，以示四元数值波幅在整体彩色图像处理中的潜力。

Self-pruning Graph Neural Network for Predicting Inflammatory Disease Activity in Multiple Sclerosis from Brain MR Images

paper_url: http://arxiv.org/abs/2308.16863
repo_url: https://github.com/chinmay5/ms_ida
paper_authors: Chinmay Prabhakar, Hongwei Bran Li, Johannes C. Paetzold, Timo Loehr, Chen Niu, Mark Mühlau, Daniel Rueckert, Benedikt Wiestler, Bjoern Menze
For: 预测多发性硬化病（MS）的Inflammatory disease activity是关键的，以便评估疾病程度和治疗。但MS患者的脑部 lesions 的数量、大小和分布差异较大，使得机器学习方法难以学习整个脑部 MRI 图像来评估和预测疾病。* Methods: 我们使用图 neural network (GNN) 来聚合重要的生物标志物（如患者的 lesion load 或 spatial proximity），以便建立一个全脑 MRI 图像的全球表示。我们提出了一种两阶段的 MS Inflammatory disease activity 预测方法。第一阶段，使用 3D 分割网络检测 lesions，然后使用自动学习算法提取图像特征。第二阶段，使用检测到的 lesions 建立患者图，lesions acted as nodes 在图中，并将它们连接到基于空间 proximity 的图中。最后，我们将 inflammatory disease activity 预测问题转化为图类别问题。* Results: 我们的提出方法在一年和两年的Inflammatory disease activity 预测任务中都有较大的提高（AUCs 分别为 0.67 vs. 0.61 和 0.66 vs. 0.60），并且可以自动选择最重要的 lesions 进行预测。此外，我们的方法具有内在的解释性，可以为每个 lesion 分配一个重要性分数，以便理解整个预测结果的基础。

Abstract
Multiple Sclerosis (MS) is a severe neurological disease characterized by inflammatory lesions in the central nervous system. Hence, predicting inflammatory disease activity is crucial for disease assessment and treatment. However, MS lesions can occur throughout the brain and vary in shape, size and total count among patients. The high variance in lesion load and locations makes it challenging for machine learning methods to learn a globally effective representation of whole-brain MRI scans to assess and predict disease. Technically it is non-trivial to incorporate essential biomarkers such as lesion load or spatial proximity. Our work represents the first attempt to utilize graph neural networks (GNN) to aggregate these biomarkers for a novel global representation. We propose a two-stage MS inflammatory disease activity prediction approach. First, a 3D segmentation network detects lesions, and a self-supervised algorithm extracts their image features. Second, the detected lesions are used to build a patient graph. The lesions act as nodes in the graph and are initialized with image features extracted in the first stage. Finally, the lesions are connected based on their spatial proximity and the inflammatory disease activity prediction is formulated as a graph classification task. Furthermore, we propose a self-pruning strategy to auto-select the most critical lesions for prediction. Our proposed method outperforms the existing baseline by a large margin (AUCs of 0.67 vs. 0.61 and 0.66 vs. 0.60 for one-year and two-year inflammatory disease activity, respectively). Finally, our proposed method enjoys inherent explainability by assigning an importance score to each lesion for the overall prediction. Code is available at https://github.com/chinmay5/ms_ida.git

摘要
多发性硬化病（MS）是一种严重的神经系统疾病，它表现为中枢神经系统中的Inflammatory lesions。因此，预测这种疾病的活动是评估疾病和治疗的关键。然而，MS斑点可以在脑部任何地方出现，其形状、大小和总数都会因 patient differ。这种高度的变化性使得机器学习方法很难学习整个脑部MRI扫描的全局有效表示。技术上也是非常困难的将必要的生物标志物such as lesion load or spatial proximity纳入机器学习模型中。我们的工作是首次使用图 neural networks（GNN）来聚合这些生物标志物，以实现一种全新的全局表示。我们提出了一种两个阶段的多发性硬化病活动预测方法。首先，一个3D segmentation network detects lesions，然后一种自动学习算法提取了这些斑点的图像特征。其次，检测到的斑点被用来建立patient graph。斑点 acts as nodes in the graph and initialized with image features extracted in the first stage。最后，斑点被基于其空间 proximity连接。这种连接方式使得疾病活动预测变成了一种图类别任务。此外，我们提出了一种自动剪枝策略，以便自动选择最重要的斑点进行预测。我们的提出的方法在现有基线上显著超越（AUCs of 0.67 vs. 0.61 and 0.66 vs. 0.60 for one-year and two-year inflammatory disease activity, respectively）。最后，我们的方法具有内置的解释性，可以为每个斑点分配一个重要性分数，以用于总体预测。代码可以在中找到。

Diffusion Models for Interferometric Satellite Aperture Radar

paper_url: http://arxiv.org/abs/2308.16847
repo_url: None
paper_authors: Alexandre Tuel, Thomas Kerdreux, Claudia Hulbert, Bertrand Rouet-Leduc
for: 用于生成雷达数据集，促进深度学习对雷达数据进行处理和分析。
methods: 使用潜在分布模型（PDM）生成雷达数据集，但是采样时间仍然是一个问题，加速采样策略在简单的图像数据集上工作良好，但在我们的雷达数据集上失败。
results: PDMs能够生成具有复杂和真实结构的图像，但是采样时间仍然是一个问题。提供了一个简单和多功能的开源库https://github.com/thomaskerdreux/PDM_SAR_InSAR_generation，可以在单个GPU上训练、采样和评估PDMs。

Abstract
Probabilistic Diffusion Models (PDMs) have recently emerged as a very promising class of generative models, achieving high performance in natural image generation. However, their performance relative to non-natural images, like radar-based satellite data, remains largely unknown. Generating large amounts of synthetic (and especially labelled) satellite data is crucial to implement deep-learning approaches for the processing and analysis of (interferometric) satellite aperture radar data. Here, we leverage PDMs to generate several radar-based satellite image datasets. We show that PDMs succeed in generating images with complex and realistic structures, but that sampling time remains an issue. Indeed, accelerated sampling strategies, which work well on simple image datasets like MNIST, fail on our radar datasets. We provide a simple and versatile open-source https://github.com/thomaskerdreux/PDM_SAR_InSAR_generation to train, sample and evaluate PDMs using any dataset on a single GPU.

摘要
几何扩散模型（PDM）在自然图像生成方面已经表现出非常出色，但它们在非自然图像，如雷达基站数据，的性能则还不很清楚。生成大量的人工卫星数据是深度学习方法处理和分析雷达数据的关键，我们利用PDM来生成几个雷达基站图像集。我们发现PDM可以生成具有复杂和真实结构的图像，但抽样时间仍然是一个问题。事实上，加速抽样策略，可以在简单的图像集如MNIST上工作良好，在我们的雷达集上失败。我们提供了一个简单和多功能的开源https://github.com/thomaskerdreux/PDM_SAR_InSAR_generation，可以在单个GPU上训练、抽样和评估PDM。

Coarse-to-Fine Amodal Segmentation with Shape Prior

paper_url: http://arxiv.org/abs/2308.16825
repo_url: None
paper_authors: Jianxiong Gao, Xuelin Qian, Yikai Wang, Tianjun Xiao, Tong He, Zheng Zhang, Yanwei Fu
for: 本研究旨在解决模糊物体分割问题，提出了一种新的方法Coarse-to-Fine Segmentation（C2F-Seg）。
methods: C2F-Seg首先将学习空间从像素级别的图像空间降低到向量化的隐藏空间，从而更好地处理长距离依赖关系和学习模糊物体分割。然而，这个隐藏空间缺乏对物体的细节信息，这使得直接提供精确的模糊物体分割变得困难。为了解决这个问题，我们提出了一种卷积整合模块，用于在视觉特征和粗略预测分割的基础上提供更精确的模糊物体分割。
results: 我们在KINS和COCO-A两个标准测试集上进行了广泛的实验，并证明了C2F-Seg的优越性。此外，我们还展示了我们的方法在视频模糊物体分割任务上的潜在应用。项目页面：http://jianxgao.github.io/C2F-Seg。

Abstract
Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object. In this paper, we propose a novel approach, called Coarse-to-Fine Segmentation (C2F-Seg), that addresses this problem by progressively modeling the amodal segmentation. C2F-Seg initially reduces the learning space from the pixel-level image space to the vector-quantized latent space. This enables us to better handle long-range dependencies and learn a coarse-grained amodal segment from visual features and visible segments. However, this latent space lacks detailed information about the object, which makes it difficult to provide a precise segmentation directly. To address this issue, we propose a convolution refine module to inject fine-grained information and provide a more precise amodal object segmentation based on visual features and coarse-predicted segmentation. To help the studies of amodal object segmentation, we create a synthetic amodal dataset, named as MOViD-Amodal (MOViD-A), which can be used for both image and video amodal object segmentation. We extensively evaluate our model on two benchmark datasets: KINS and COCO-A. Our empirical results demonstrate the superiority of C2F-Seg. Moreover, we exhibit the potential of our approach for video amodal object segmentation tasks on FISHBOWL and our proposed MOViD-A. Project page at: http://jianxgao.github.io/C2F-Seg.

摘要
“模糊物体分割是一项复杂的任务，涉及到分割可见和遮盖的对象部分。在这篇论文中，我们提出了一种新的方法，即粗细到精细分割（C2F-Seg），用于解决这个问题。C2F-Seg首先将学习空间从像素级图像空间降低到量化的几何空间。这使得我们可以更好地处理长距离依赖关系，从视觉特征和可见分割中学习粗细的模糊分割。然而，这个几何空间缺乏对物体的细节信息，这使得直接提供精确的分割变得困难。为了解决这个问题，我们提出了一个 convolution 精细化模块，用于在可见分割基础之上注入细节信息，以提供更精确的模糊物体分割。为了促进模糊物体分割的研究，我们创建了一个人工生成的模糊数据集，名为MOViD-A（MOViD-A），可以用于图像和视频模糊物体分割。我们广泛测试了我们的模型在 KINS 和 COCO-A 两个标准数据集上，结果表明 C2F-Seg 的超越性。此外，我们还展示了我们的方法在视频模糊物体分割任务上的潜在应用性，在 FISHBOWL 和我们所创建的 MOViD-A 上。更多信息请访问：http://jianxgao.github.io/C2F-Seg。”

BTSeg: Barlow Twins Regularization for Domain Adaptation in Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.16819
repo_url: None
paper_authors: Johannes Künzel, Anna Hilsmann, Peter Eisert
for: 这个论文的目的是提出一种基于弱监督学习的Semantic Image Segmentation方法，以应对不同恶劣环境（如雨、夜晚、雪、极端照明）带来的挑战。
methods: 本论文使用的方法是基于image-level对照的，将不同恶劣环境下的图像视为彼此的”增强”，然后使用Barlow twins损失函数进行训练。
results: 该方法在ACDC和新的ACG评分标准上进行评估，与现有的State-of-the-art方法进行比较，结果显示其具有较好的一致性和扩展性。

Abstract
Semantic image segmentation is a critical component in many computer vision systems, such as autonomous driving. In such applications, adverse conditions (heavy rain, night time, snow, extreme lighting) on the one hand pose specific challenges, yet are typically underrepresented in the available datasets. Generating more training data is cumbersome and expensive, and the process itself is error-prone due to the inherent aleatoric uncertainty. To address this challenging problem, we propose BTSeg, which exploits image-level correspondences as weak supervision signal to learn a segmentation model that is agnostic to adverse conditions. To this end, our approach uses the Barlow twins loss from the field of unsupervised learning and treats images taken at the same location but under different adverse conditions as "augmentations" of the same unknown underlying base image. This allows the training of a segmentation model that is robust to appearance changes introduced by different adverse conditions. We evaluate our approach on ACDC and the new challenging ACG benchmark to demonstrate its robustness and generalization capabilities. Our approach performs favorably when compared to the current state-of-the-art methods, while also being simpler to implement and train. The code will be released upon acceptance.

摘要
Semantic image segmentation 是许多计算机视觉系统中的关键组件，例如自动驾驶。在这些应用中，不利的条件（重雨、夜晚、雪、极端照明）等等会 pose 特定的挑战，然而这些数据在可用的数据集中往往受到偏见。生成更多的训练数据是繁琐且昂贵，而且自身的过程也具有内在的随机uncertainty。为解决这个问题，我们提出了 BTSeg，它利用图像水平匹配作为弱超级视图来学习一个不受不利条件的影响的 segmentation 模型。为达到这个目的，我们的方法使用 Barlow twins 损失函数，它来自不upervised learning 领域，并将图像在不同的不利条件下拍摄的图像视为“增强”而不是独立的图像。这样可以让学习一个不受不利条件的 segmentation 模型。我们对 ACDC 和新的 AC G 数据集进行评估，以示其Robustness 和通用性。我们的方法与当前状态的方法相比，性能更好，同时也更简单地实现和训练。代码将在接受后发布。

Multiscale Residual Learning of Graph Convolutional Sequence Chunks for Human Motion Prediction

paper_url: http://arxiv.org/abs/2308.16801
repo_url: None
paper_authors: Mohsen Zand, Ali Etemad, Michael Greenspan
for: human motion prediction
methods: ResChunk, an end-to-end network that explores dynamically correlated body components based on pairwise relationships between all joints in individual sequences, and learns the residuals between target sequence chunks in an autoregressive manner to enforce temporal connectivities.
results: outperforms other techniques and sets a new state-of-the-art on two challenging benchmark datasets, CMU Mocap and Human3.6M.Here is the Chinese translation of the three key information points:
for: 人体动作预测
methods: ResChunk endorse 终端网络，通过对每个序列中所有关节之间的对应关系进行学习，并通过在终端预测中强制执行时间连接性来学习很多级别的动作特征。
results: 在 CMU Mocap 和 Human3.6M 两个挑战性评测数据集上超越其他技术，设置新的状态对。

Abstract
A new method is proposed for human motion prediction by learning temporal and spatial dependencies. Recently, multiscale graphs have been developed to model the human body at higher abstraction levels, resulting in more stable motion prediction. Current methods however predetermine scale levels and combine spatially proximal joints to generate coarser scales based on human priors, even though movement patterns in different motion sequences vary and do not fully comply with a fixed graph of spatially connected joints. Another problem with graph convolutional methods is mode collapse, in which predicted poses converge around a mean pose with no discernible movements, particularly in long-term predictions. To tackle these issues, we propose ResChunk, an end-to-end network which explores dynamically correlated body components based on the pairwise relationships between all joints in individual sequences. ResChunk is trained to learn the residuals between target sequence chunks in an autoregressive manner to enforce the temporal connectivities between consecutive chunks. It is hence a sequence-to-sequence prediction network which considers dynamic spatio-temporal features of sequences at multiple levels. Our experiments on two challenging benchmark datasets, CMU Mocap and Human3.6M, demonstrate that our proposed method is able to effectively model the sequence information for motion prediction and outperform other techniques to set a new state-of-the-art. Our code is available at https://github.com/MohsenZand/ResChunk.

摘要
新方法提议用于人体运动预测，通过学习时间和空间依赖关系。最近，多尺度图被开发以更高层次模型人体，导致更稳定的运动预测。现有方法尝试在不同的运动序列中模型人体的运动方式，但是它们预先决定了尺度水平并将空间靠近的关节组合起来生成更粗细的尺度，这并不符合人类运动的实际情况。另外，图 convolutional 方法可能会出现模式落弓现象，预测 pose 会 converges 到一个平均 pose WITH NO 可见运动，特别是在长期预测中。为了解决这些问题，我们提出了 ResChunk，一个终端到终点网络，它通过对各个序列中的所有关节之间的对应关系进行学习，来捕捉人体动态相关的特征。ResChunk 在 autoregressive 方式下强制执行时间相关的连接，以便在 consecutive chunk 之间强制 enforcing 时间连接。因此，它是一种序列到序列预测网络，可以考虑人体动态的空间时间特征。我们在 CMU Mocap 和 Human3.6M 两个挑战性数据集上进行了实验，结果显示，我们的提议方法可以有效地模型序列信息，并超越其他技术，设置新的状态之册。我们的代码可以在 GitHub 上找到：https://github.com/MohsenZand/ResChunk.

Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models

paper_url: http://arxiv.org/abs/2308.16777
repo_url: None
paper_authors: Minheng Ni, Yabo Zhang, Kailai Feng, Xiaoming Li, Yiwen Guo, Wangmeng Zuo
for: This paper is written for the task of zero-shot referring image segmentation, which involves finding an instance segmentation mask based on a given referring description without using paired training data.
methods: The paper proposes a novel method called Referring Diffusional segmentor (Ref-Diff), which leverages fine-grained multi-modal information from generative models to improve performance.
results: The paper demonstrates that Ref-Diff achieves comparable performance to existing state-of-the-art weakly-supervised models without a proposal generator, and outperforms these competing methods by a significant margin when combining both generative and discriminative models.Here’s the simplified Chinese text for the three key points:
for: 这篇论文是为了零shot引用图像分割而写的，即基于给定的引用描述而不使用对应的训练数据来获得图像分割mask。
methods: 该论文提出了一种新的方法 called Referring Diffusional segmentor (Ref-Diff)，它利用生成模型中的细腻多模态信息来提高性能。
results: 论文表明，Ref-Diff可以在不使用提案生成器的情况下与现有的州对的弱监督模型匹配，并且在结合生成和识别模型时，Ref-Diff可以超越竞争对手的性能。

Abstract
Zero-shot referring image segmentation is a challenging task because it aims to find an instance segmentation mask based on the given referring descriptions, without training on this type of paired data. Current zero-shot methods mainly focus on using pre-trained discriminative models (e.g., CLIP). However, we have observed that generative models (e.g., Stable Diffusion) have potentially understood the relationships between various visual elements and text descriptions, which are rarely investigated in this task. In this work, we introduce a novel Referring Diffusional segmentor (Ref-Diff) for this task, which leverages the fine-grained multi-modal information from generative models. We demonstrate that without a proposal generator, a generative model alone can achieve comparable performance to existing SOTA weakly-supervised models. When we combine both generative and discriminative models, our Ref-Diff outperforms these competing methods by a significant margin. This indicates that generative models are also beneficial for this task and can complement discriminative models for better referring segmentation. Our code is publicly available at https://github.com/kodenii/Ref-Diff.

摘要
<>转换文本到简化中文。<>零shot引用图像分割是一项具有挑战性的任务，它目标是基于给定的引用描述获取一个实例分割mask，无需对这类对应数据进行训练。现有的零shot方法主要依靠预训练的推理模型（如CLIP）。然而，我们发现了使用生成模型（如稳定扩散）可能已经理解了不同视觉元素与文本描述之间的关系，这在这个任务中rarely被研究。在这项工作中，我们介绍了一种新的引用扩散分割器（Ref-Diff），它利用生成模型细致的多Modal信息。我们示示了无需提议生成器，生成模型alone可以达到与现有SOTA弱相关模型相同的性能。当我们将生成和推理模型结合起来时，我们的Ref-Diff超过了这些竞争对手，表明生成模型也是这个任务中有用的和可以补充推理模型以实现更好的引用分割。我们的代码在https://github.com/kodenii/Ref-Diff上公开。

Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images

paper_url: http://arxiv.org/abs/2308.16758
repo_url: None
paper_authors: Cuican Yu, Guansong Lu, Yihan Zeng, Jian Sun, Xiaodan Liang, Huibin Li, Zongben Xu, Songcen Xu, Wei Zhang, Hang Xu
for: 生成基于文本描述的3D脸部，如游戏、电影和机器人等领域。
methods: 提出了一种基于文本指导的3D脸部生成方法（TG-3DFace），使用无条件3D脸部生成框架，并通过文本条件学习实现文本指导3D脸部生成。此外，我们还提出了两种文本到脸部的交叉模态对接技术，包括全局对比学习和细化对接模块，以实现高度相关性 между生成的3D脸部和输入文本。
results: TG-3DFace比现有方法提高了9%多视角一致率（MVIC），并在rendered face图像中实现了更高的FID和CLIP分数，证明了我们在生成实用和semantic-consistent的文本图像方面的超越。

Abstract
Generating 3D faces from textual descriptions has a multitude of applications, such as gaming, movie, and robotics. Recent progresses have demonstrated the success of unconditional 3D face generation and text-to-3D shape generation. However, due to the limited text-3D face data pairs, text-driven 3D face generation remains an open problem. In this paper, we propose a text-guided 3D faces generation method, refer as TG-3DFace, for generating realistic 3D faces using text guidance. Specifically, we adopt an unconditional 3D face generation framework and equip it with text conditions, which learns the text-guided 3D face generation with only text-2D face data. On top of that, we propose two text-to-face cross-modal alignment techniques, including the global contrastive learning and the fine-grained alignment module, to facilitate high semantic consistency between generated 3D faces and input texts. Besides, we present directional classifier guidance during the inference process, which encourages creativity for out-of-domain generations. Compared to the existing methods, TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D. The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models, demonstrating our superiority in generating realistic and semantic-consistent textures.

摘要
生成3D面部从文本描述有多种应用，如游戏、电影和 робоaxi。Recent progresses have demonstrated the success of unconditional 3D face generation and text-to-3D shape generation. However, due to the limited text-3D face data pairs, text-driven 3D face generation remains an open problem. In this paper, we propose a text-guided 3D faces generation method, referred to as TG-3DFace, for generating realistic 3D faces using text guidance. Specifically, we adopt an unconditional 3D face generation framework and equip it with text conditions, which learns the text-guided 3D face generation with only text-2D face data. On top of that, we propose two text-to-face cross-modal alignment techniques, including the global contrastive learning and the fine-grained alignment module, to facilitate high semantic consistency between generated 3D faces and input texts. Besides, we present directional classifier guidance during the inference process, which encourages creativity for out-of-domain generations. Compared to the existing methods, TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D. The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models, demonstrating our superiority in generating realistic and semantic-consistent textures.

Unsupervised CT Metal Artifact Reduction by Plugging Diffusion Priors in Dual Domains

paper_url: http://arxiv.org/abs/2308.16742
repo_url: https://github.com/deepxuan/dudodp-mar
paper_authors: Xuan Liu, Yaoqin Xie, Songhui Diao, Shan Tan, Xiaokun Liang
For: The paper aims to improve the quality of computed tomography (CT) images by reducing metal artifacts, which can be challenging to diagnose accurately.* Methods: The proposed method uses an unsupervised diffusion model to restore degraded portions in CT images caused by metal artifacts. This approach leverages the priors embedded within the pre-trained diffusion model in both the sinogram and image domains.* Results: The proposed method outperforms existing unsupervised metal artifact reduction methods, including another method based on the diffusion model, in terms of both quantitative and qualitative performance. It also demonstrates superior visual results compared to supervised and unsupervised methods on clinical datasets.Here is the same information in Simplified Chinese text:* 为：本文目的是提高计算tomography（CT）图像质量，减少金属残余所引起的诊断困难。* 方法：提议的方法使用无监督的扩散模型来修复CT图像中的金属残余所引起的延迟。这种方法利用预训练扩散模型中的假设在sinogram和图像域中进行双域处理。* 结果：提议的方法在量化和质量上都超过了现有的无监督残余减少方法，包括另一种基于扩散模型的方法。此外，它在临床数据上也显示出了较好的视觉效果，比supervised和无监督方法更好。

Abstract
During the process of computed tomography (CT), metallic implants often cause disruptive artifacts in the reconstructed images, impeding accurate diagnosis. Several supervised deep learning-based approaches have been proposed for reducing metal artifacts (MAR). However, these methods heavily rely on training with simulated data, as obtaining paired metal artifact CT and clean CT data in clinical settings is challenging. This limitation can lead to decreased performance when applying these methods in clinical practice. Existing unsupervised MAR methods, whether based on learning or not, typically operate within a single domain, either in the image domain or the sinogram domain. In this paper, we propose an unsupervised MAR method based on the diffusion model, a generative model with a high capacity to represent data distributions. Specifically, we first train a diffusion model using CT images without metal artifacts. Subsequently, we iteratively utilize the priors embedded within the pre-trained diffusion model in both the sinogram and image domains to restore the degraded portions caused by metal artifacts. This dual-domain processing empowers our approach to outperform existing unsupervised MAR methods, including another MAR method based on the diffusion model, which we have qualitatively and quantitatively validated using synthetic datasets. Moreover, our method demonstrates superior visual results compared to both supervised and unsupervised methods on clinical datasets.

摘要
In this paper, we propose an unsupervised MAR method based on the diffusion model, a generative model with a high capacity to represent data distributions. Specifically, we first train a diffusion model using CT images without metal artifacts. Then, we iteratively utilize the priors embedded within the pre-trained diffusion model in both the sinogram and image domains to restore the degraded portions caused by metal artifacts. This dual-domain processing enables our approach to outperform existing unsupervised MAR methods, including another MAR method based on the diffusion model, which we have qualitatively and quantitatively validated using synthetic datasets. Moreover, our method demonstrates superior visual results compared to both supervised and unsupervised methods on clinical datasets.

Parsing is All You Need for Accurate Gait Recognition in the Wild

paper_url: http://arxiv.org/abs/2308.16739
repo_url: https://github.com/Gait3D/Gait3D-Benchmark
paper_authors: Jinkai Zheng, Xinchen Liu, Shuai Wang, Lihao Wang, Chenggang Yan, Wu Liu
for: 本研究的目的是提出一种新的步态表示方法，即步态解析序列（GPS），以提高人体步态识别的准确率。
methods: 该研究提出了一种基于人体解析的人体步态识别框架，即解析步态（ParsingGait），该框架包括一个Convolutional Neural Network（CNN）基础和两个轻量级的头。
results: 该研究通过对Gait3D-Parsing dataset进行了全面的评估，并与现有的人体步态识别方法进行了比较，得到了显著提高的准确率。

Abstract
Binary silhouettes and keypoint-based skeletons have dominated human gait recognition studies for decades since they are easy to extract from video frames. Despite their success in gait recognition for in-the-lab environments, they usually fail in real-world scenarios due to their low information entropy for gait representations. To achieve accurate gait recognition in the wild, this paper presents a novel gait representation, named Gait Parsing Sequence (GPS). GPSs are sequences of fine-grained human segmentation, i.e., human parsing, extracted from video frames, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the GPS representation, we propose a novel human parsing-based gait recognition framework, named ParsingGait. ParsingGait contains a Convolutional Neural Network (CNN)-based backbone and two light-weighted heads. The first head extracts global semantic features from GPSs, while the other one learns mutual information of part-level features through Graph Convolutional Networks to model the detailed dynamics of human walking. Furthermore, due to the lack of suitable datasets, we build the first parsing-based dataset for gait recognition in the wild, named Gait3D-Parsing, by extending the large-scale and challenging Gait3D dataset. Based on Gait3D-Parsing, we comprehensively evaluate our method and existing gait recognition methods. The experimental results show a significant improvement in accuracy brought by the GPS representation and the superiority of ParsingGait. The code and dataset are available at https://gait3d.github.io/gait3d-parsing-hp .

摘要
Binary 阴影和关键点基于骨架在人体行走识别研究中占据了主导地位，因为它们从视频帧中易于提取。尽管它们在实验室环境中表现出色，但在实际场景中通常失败，因为它们的信息熵很低。为实现高精度的人体行走识别在野外，本文提出了一种新的行走表示方式，名为行走解析序列（GPS）。GPS是从视频帧中提取的细化人体分割，因此它们具有远高于Binary阴影和关键点基于骨架的信息熵，可以更好地编码人体行走时的形态和动态。此外，为了有效地探索GPS表示法的能力，我们提出了一种基于人体分割的行走识别框架，名为ParsingGait。ParsingGait包括一个基于Convolutional Neural Network（CNN）的背bone和两个轻量级的头。第一个头从GPS中提取全局semantic特征，而另一个头通过图像学树征网络学习人体行走中的细节动态。此外，由于缺乏适用的数据集，我们建立了首个基于分割的人体行走识别数据集，名为Gait3D-Parsing，通过扩展大规模和挑战性的Gait3D数据集。基于Gait3D-Parsing，我们对我们的方法和现有的行走识别方法进行了全面评估。实验结果表明，GPS表示法和ParsingGait带来了显著的准确性提升。代码和数据集可以在https://gait3d.github.io/gait3d-parsing-hp 上获取。

US-SFNet: A Spatial-Frequency Domain-based Multi-branch Network for Cervical Lymph Node Lesions Diagnoses in Ultrasound Images

paper_url: http://arxiv.org/abs/2308.16738
repo_url: None
paper_authors: Yubiao Yue, Jun Xue, Haihua Liang, Bingchun Luo, Zhenzhang Li
for: 诊断 cervical lymph node lesions
methods: 使用 deep learning 模型，包括 Conv-FFT Block 和 US-SFNet architecture
results: 实现 92.89% 准确率，90.46% 精度，89.95% 敏感性和 97.49% 特异性

Abstract
Ultrasound imaging serves as a pivotal tool for diagnosing cervical lymph node lesions. However, the diagnoses of these images largely hinge on the expertise of medical practitioners, rendering the process susceptible to misdiagnoses. Although rapidly developing deep learning has substantially improved the diagnoses of diverse ultrasound images, there remains a conspicuous research gap concerning cervical lymph nodes. The objective of our work is to accurately diagnose cervical lymph node lesions by leveraging a deep learning model. To this end, we first collected 3392 images containing normal lymph nodes, benign lymph node lesions, malignant primary lymph node lesions, and malignant metastatic lymph node lesions. Given that ultrasound images are generated by the reflection and scattering of sound waves across varied bodily tissues, we proposed the Conv-FFT Block. It integrates convolutional operations with the fast Fourier transform to more astutely model the images. Building upon this foundation, we designed a novel architecture, named US-SFNet. This architecture not only discerns variances in ultrasound images from the spatial domain but also adeptly captures microstructural alterations across various lesions in the frequency domain. To ascertain the potential of US-SFNet, we benchmarked it against 12 popular architectures through five-fold cross-validation. The results show that US-SFNet is SOTA and can achieve 92.89% accuracy, 90.46% precision, 89.95% sensitivity and 97.49% specificity, respectively.

摘要
ultrasound imaging serves as a pivotal tool for diagnosing cervical lymph node lesions . However, the diagnoses of these images largely hinge on the expertise of medical practitioners, rendering the process susceptible to misdiagnoses . Although rapidly developing deep learning has substantially improved the diagnoses of diverse ultrasound images , there remains a conspicuous research gap concerning cervical lymph nodes . The objective of our work is to accurately diagnose cervical lymph node lesions by leveraging a deep learning model . To this end, we first collected 3392 images containing normal lymph nodes, benign lymph node lesions, malignant primary lymph node lesions, and malignant metastatic lymph node lesions . Given that ultrasound images are generated by the reflection and scattering of sound waves across varied bodily tissues, we proposed the Conv-FFT Block . It integrates convolutional operations with the fast Fourier transform to more astutely model the images . Building upon this foundation, we designed a novel architecture, named US-SFNet . This architecture not only discerns variances in ultrasound images from the spatial domain but also adeptly captures microstructural alterations across various lesions in the frequency domain . To ascertain the potential of US-SFNet, we benchmarked it against 12 popular architectures through five-fold cross-validation . The results show that US-SFNet is SOTA and can achieve 92.89% accuracy, 90.46% precision, 89.95% sensitivity and 97.49% specificity, respectively .

Towards Vehicle-to-everything Autonomous Driving: A Survey on Collaborative Perception

paper_url: http://arxiv.org/abs/2308.16714
repo_url: None
paper_authors: Si Liu, Chen Gao, Yuan Chen, Xingyu Peng, Xianghao Kong, Kun Wang, Runsheng Xu, Wentao Jiang, Hao Xiang, Jiaqi Ma, Miao Wang
for: 这篇论文旨在为智能交通系统的开发提供一种新的方向，即 vehicle-to-everything (V2X) 自动驾驶。
methods: 这篇论文提出了一种协同探测 (CP) 方法，用于解决 V2X 系统中的各种限制，如遮挡和远程探测。
results: 论文提供了一系列 CP 方法的总结和分析，包括协同探测过程中的不同阶段、路边设备的置放、延迟补做、性能-带宽贸易等方面。

Abstract
Vehicle-to-everything (V2X) autonomous driving opens up a promising direction for developing a new generation of intelligent transportation systems. Collaborative perception (CP) as an essential component to achieve V2X can overcome the inherent limitations of individual perception, including occlusion and long-range perception. In this survey, we provide a comprehensive review of CP methods for V2X scenarios, bringing a profound and in-depth understanding to the community. Specifically, we first introduce the architecture and workflow of typical V2X systems, which affords a broader perspective to understand the entire V2X system and the role of CP within it. Then, we thoroughly summarize and analyze existing V2X perception datasets and CP methods. Particularly, we introduce numerous CP methods from various crucial perspectives, including collaboration stages, roadside sensors placement, latency compensation, performance-bandwidth trade-off, attack/defense, pose alignment, etc. Moreover, we conduct extensive experimental analyses to compare and examine current CP methods, revealing some essential and unexplored insights. Specifically, we analyze the performance changes of different methods under different bandwidths, providing a deep insight into the performance-bandwidth trade-off issue. Also, we examine methods under different LiDAR ranges. To study the model robustness, we further investigate the effects of various simulated real-world noises on the performance of different CP methods, covering communication latency, lossy communication, localization errors, and mixed noises. In addition, we look into the sim-to-real generalization ability of existing CP methods. At last, we thoroughly discuss issues and challenges, highlighting promising directions for future efforts. Our codes for experimental analysis will be public at https://github.com/memberRE/Collaborative-Perception.

摘要
自动驾驶 Vehicle-to-everything（V2X）开启了一个有前途的方向，为智能交通系统新一代的发展带来了无限的可能性。协同感知（CP）作为V2X实现的关键组件，可以超越个体感知的限制，包括遮挡和远程感知。在本文中，我们提供了V2X场景中CP方法的全面和深入的评论，为社区带来了更深刻的理解。首先，我们介绍了典型的V2X系统架构和工作流程，为了更好地理解整个V2X系统以及CP在其中的角色。然后，我们系统地总结和分析了现有的V2X感知数据集和CP方法。特别是，我们介绍了CP方法的各种关键视角，包括协同阶段、路边设备位置、延迟补做、性能带宽交易、攻击防御、pose对齐等。此外，我们还进行了广泛的实验分析，比较和探讨现有CP方法的性能，揭示了一些不已探索的发现。例如，我们分析了不同方法在不同带宽下的性能变化，提供了深入的性能带宽交易问题的研究。同时，我们还研究了不同LiDAR范围下的方法性能。为了研究模型的稳定性，我们还进行了对不同的模拟实际噪声的影响分析，包括通信延迟、损失通信、地理位置错误和混合噪声。此外，我们还检查了现有CP方法的 sim-to-real 普适性。最后，我们进行了问题和挑战的详细讨论，探讨了未来努力的可能性。我们的实验分析代码将在GitHub上公开，链接为：https://github.com/memberRE/Collaborative-Perception。

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

paper_url: http://arxiv.org/abs/2308.16689
repo_url: None
paper_authors: Weihan Wang, Zhen Yang, Bin Xu, Juanzi Li, Yankui Sun
for: 这篇论文的目标是提高视觉语言预训练（VLP）方法的精度和速度。
methods: 该论文提出了两个组成部分来进一步促进模型学习细腻的图像语言对应。一是在掩码语言模型（MLM）中使用交叉热力学标签生成软标签，以提高模型的稳定性。二是在图像语言匹配（ITM）任务中，利用当前的语言编码器生成硬件负样本，使模型学习高质量的表示。
results: 经过广泛的实验测试，该论文的提案可以在多种视觉语言任务中达到更高的性能水平，表明其在VLP预训练中的潜力。

Abstract
Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks. Prior arts usually focus on how to align visual and textual features, but strategies for improving the robustness of model and speeding up model convergence are left insufficiently explored. In this paper, we propose a novel method ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model, which alleviates the problem of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input, encouraging the model to learn high-quality representations by increasing the difficulty of the ITM task. By leveraging the above techniques, our ViLTA can achieve better performance on various vision-language tasks. Extensive experiments on benchmark datasets demonstrate that the effectiveness of ViLTA and its promising potential for vision-language pre-training.

摘要
Recently, vision-language pre-training (VLP) methods have gained popularity, with the main goal of jointly learning visual and textual features using a transformer-based architecture, leading to significant improvements on various vision-language tasks. However, existing methods tend to focus on aligning visual and textual features, with insufficient attention given to improving the robustness of the model and accelerating convergence.In this paper, we propose a novel method called ViLTA, which consists of two components to further enhance the model's ability to learn fine-grained representations of image-text pairs. For Masked Language Modeling (MLM), we introduce a cross-distillation method to generate soft labels, which helps improve the robustness of the model by addressing the issue of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of the language input, encouraging the model to learn high-quality representations by increasing the difficulty of the ITM task.Our ViLTA achieves better performance on various vision-language tasks, as demonstrated by extensive experiments on benchmark datasets. The effectiveness of ViLTA and its potential for vision-language pre-training are promising, paving the way for further advancements in this field.

Diffusion Inertial Poser: Human Motion Reconstruction from Arbitrary Sparse IMU Configurations

paper_url: http://arxiv.org/abs/2308.16682
repo_url: None
paper_authors: Tom Van Wouwe, Seunghwan Lee, Antoine Falisse, Scott Delp, C. Karen Liu
for: 这种研究的目的是为了实时重建人体动作，并且能够适应不同的各个IMU配置。
methods: 该研究使用了单批扩散生成模型（Diffusion Inertial Poser，DiffIP）来重建人体动作。DiffIP不仅可以在不同的IMU配置下提供高精度的重建结果，还可以在实时中进行选择最佳IMU配置。
results: 研究发现，使用DiffIP可以在不同的IMU配置下重建人体动作，并且可以在实时中选择最佳IMU配置。此外，研究还发现，在只有四个IMU可用时，最佳IMU配置是 instruementing the thighs and forearms，而global translation reconstruction是 instruementing the feet instead of the thighs。

Abstract
Motion capture from a limited number of inertial measurement units (IMUs) has important applications in health, human performance, and virtual reality. Real-world limitations and application-specific goals dictate different IMU configurations (i.e., number of IMUs and chosen attachment body segments), trading off accuracy and practicality. Although recent works were successful in accurately reconstructing whole-body motion from six IMUs, these systems only work with a specific IMU configuration. Here we propose a single diffusion generative model, Diffusion Inertial Poser (DiffIP), which reconstructs human motion in real-time from arbitrary IMU configurations. We show that DiffIP has the benefit of flexibility with respect to the IMU configuration while being as accurate as the state-of-the-art for the commonly used six IMU configuration. Our system enables selecting an optimal configuration for different applications without retraining the model. For example, when only four IMUs are available, DiffIP found that the configuration that minimizes errors in joint kinematics instruments the thighs and forearms. However, global translation reconstruction is better when instrumenting the feet instead of the thighs. Although our approach is agnostic to the underlying model, we built DiffIP based on physiologically realistic musculoskeletal models to enable use in biomedical research and health applications.

摘要
干质量捕捉从有限的惯性测量单元（IMU）中有重要的应用在健康、人类性能和虚拟现实领域。实际世界的限制和特定应用目标导致不同的IMU配置（即IMU的数量和选择的体部附属部分），交换准确性和实用性。虽然最近的工作能够高精度地重construct整个人体运动从六个IMU，但这些系统只能在特定的IMU配置下工作。我们提出了一种单 diffusion生成模型，即干质量投射（DiffIP），可以在实时重构人体运动从任意IMU配置。我们表明了DiffIP具有对IMU配置的灵活性，同时与最新的状态之一相当准确。我们的系统允许选择不同应用场景中的优化IMU配置，无需重新训练模型。例如，当只有四个IMU可用时，DiffIP发现了将骨骼和肘部Instrument为最小错误的 JOINT动态学Instrument。然而，全局翻译重建更好When Instrumenting the feet instead of the thighs。虽然我们的方法不依赖于下面的模型，但我们基于实际的musculoskeletal模型来启用在生物医学研究和健康应用中。

SoccerNet 2023 Tracking Challenge – 3rd place MOT4MOT Team Technical Report

paper_url: http://arxiv.org/abs/2308.16651
repo_url: None
paper_authors: Gal Shitrit, Ishay Be’ery, Ido Yerhushalmy
for: 本研究是为了解决足球竞赛中玩家和球的探测和追踪问题。
methods: 我们使用了一个现代化的在线多对象追踪器和一个当代的物体检测器来进行玩家追踪。为了解决我们的线上方法的局限性，我们将在处理过程中加入插值和无出现的追踪合并。此外，我们还使用了一个外观基于的追踪合并技术来处理图像边界上的终止和创建追踪。
results: 我们的方法在足球网络2023追踪挑战中获得第三名，具有66.27的HOTA分数。

Abstract
The SoccerNet 2023 tracking challenge requires the detection and tracking of soccer players and the ball. In this work, we present our approach to tackle these tasks separately. We employ a state-of-the-art online multi-object tracker and a contemporary object detector for player tracking. To overcome the limitations of our online approach, we incorporate a post-processing stage using interpolation and appearance-free track merging. Additionally, an appearance-based track merging technique is used to handle the termination and creation of tracks far from the image boundaries. Ball tracking is formulated as single object detection, and a fine-tuned YOLOv8l detector with proprietary filtering improves the detection precision. Our method achieves 3rd place on the SoccerNet 2023 tracking challenge with a HOTA score of 66.27.

摘要
《2023年度足球网络追踪挑战》需要识别和追踪足球球员和球。在这个工作中，我们提出了一个分开处理球员和球的方法。我们使用了现代化的在线多个物体追踪器和当代物体检测器来进行球员追踪。为了突破我们的在线方法的局限，我们将在追踪过程中进行插值和无现象追踪融合。此外，我们还使用了基于外观的追踪融合技术来处理图像边缘上的追踪。球的追踪是 формулю视为单一物体检测，我们使用了微小的 YOLOv8l 检测器进行精确的检测。我们的方法在《2023年度足球网络追踪挑战》中获得第三名，具有66.27的HOTA分数。

paper_url: http://arxiv.org/abs/2308.16649
repo_url: None
paper_authors: Prateksha Udhayanan, Srikrishna Karanam, Balaji Vasan Srinivasan
for: solves the problem of composed image retrieval, which takes an input query consisting of an image and a modification text and retrieves images that match the desired changes.
methods: uses a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step, using a new visual image attention computation technique called multi-modal gradient attention (MMGrad).
results: demonstrates improved grounding and better explainability of the models, as well as competitive quantitative retrieval performance on standard benchmark datasets.

Abstract
We consider the problem of composed image retrieval that takes an input query consisting of an image and a modification text indicating the desired changes to be made on the image and retrieves images that match these changes. Current state-of-the-art techniques that address this problem use global features for the retrieval, resulting in incorrect localization of the regions of interest to be modified because of the global nature of the features, more so in cases of real-world, in-the-wild images. Since modifier texts usually correspond to specific local changes in an image, it is critical that models learn local features to be able to both localize and retrieve better. To this end, our key novelty is a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step. We achieve this by first proposing a new visual image attention computation technique, which we call multi-modal gradient attention (MMGrad) that is explicitly conditioned on the modifier text. We next demonstrate how MMGrad can be incorporated into an end-to-end model training strategy with a new learning objective that explicitly forces these MMGrad attention maps to highlight the correct local regions corresponding to the modifier text. By training retrieval models with this new loss function, we show improved grounding by means of better visual attention maps, leading to better explainability of the models as well as competitive quantitative retrieval performance on standard benchmark datasets.

摘要
我们考虑一个图像搜寻问题，将输入查询包含图像和修改文本，指定图像中需要进行的变更。现有的先进技术使用全球特征进行搜寻，导致因全球特征的 natur 而导致图像中需要修改的区域 incorrect 的位置。因为修改文本通常与图像中的特定地方有关，因此模型需要学习地方特征，以便更好地位置和搜寻。为此，我们的关键新特点是一个新的 Gradient-Attention 基于学习目标，强制模型在每次搜寻过程中专注于需要修改的地方。我们首先提出了一种新的视觉图像注意力计算技术，我们称之为多模式Gradient注意力（MMGrad），这个技术是基于修改文本的条件下进行。我们接着示出了如何将 MMGrad 注意力地图Integrated 到一个端到端训练策略中，并提出了一个新的学习目标，强制这些 MMGrad 注意力地图高亮正确的地方。通过对搜寻模型进行训练，我们显示了改善的视觉注意力地图，导致模型的解释性提高，以及与标准参考数据相对的量化搜寻性能。

Generate Your Own Scotland: Satellite Image Generation Conditioned on Maps

paper_url: http://arxiv.org/abs/2308.16648
repo_url: https://github.com/miquel-espinosa/map-sat
paper_authors: Miguel Espinosa, Elliot J. Crowley
for: 这篇论文是为了探讨气象观测中的扩散模型，以及如何使用这些模型来生成具有真实感的卫星图像。
methods: 该论文使用了现有的预训练扩散模型，并将其conditioned在 cartographic 数据上，以生成具有真实感的卫星图像。
results: 论文通过对 Mainland Scotland 和 Central Belt 地区的两个大数据集进行训练，并通过对 ControlNet 模型的评估，demonstrated 了图像质量和地图准确性都可以实现。Here’s the full text in Simplified Chinese:
for: 这篇论文是为了探讨气象观测中的扩散模型，以及如何使用这些模型来生成具有真实感的卫星图像。
methods: 该论文使用了现有的预训练扩散模型，并将其conditioned在 cartographic 数据上，以生成具有真实感的卫星图像。
results: 论文通过对 Mainland Scotland 和 Central Belt 地区的两个大数据集进行训练，并通过对 ControlNet 模型的评估，demonstrated 了图像质量和地图准确性都可以实现。

Abstract
Despite recent advancements in image generation, diffusion models still remain largely underexplored in Earth Observation. In this paper we show that state-of-the-art pretrained diffusion models can be conditioned on cartographic data to generate realistic satellite images. We provide two large datasets of paired OpenStreetMap images and satellite views over the region of Mainland Scotland and the Central Belt. We train a ControlNet model and qualitatively evaluate the results, demonstrating that both image quality and map fidelity are possible. Finally, we provide some insights on the opportunities and challenges of applying these models for remote sensing. Our model weights and code for creating the dataset are publicly available at https://github.com/miquel-espinosa/map-sat.

摘要
尽管最近的图像生成技术有了很大的进步，但 diffusion 模型仍然在地球观测中尚未得到充分的发掘。在这篇论文中，我们表明了使用 cartographic 数据来Conditional Pretrained Diffusion Models（ControlNet）来生成真实的卫星图像。我们提供了两个大的对应的 OpenStreetMap 图像和卫星视图数据集，并对其进行训练，证明了图像质量和地图准确性都是可能的。最后，我们提供了应用这些模型在远程感知方面的机遇和挑战。我们的模型权重和创建数据集的代码都可以在 GitHub 上获得：https://github.com/miquel-espinosa/map-sat。

Learning Channel Importance for High Content Imaging with Interpretable Deep Input Channel Mixing

paper_url: http://arxiv.org/abs/2308.16637
repo_url: None
paper_authors: Daniel Siegismund, Mario Wieser, Stephan Heyse, Stephan Steigele
for: 本研究旨在开发一种新的药物候选者选择方法，用于治疗复杂疾病。
methods: 本研究使用了深度学习方法，但是这些方法缺乏关键通道信息。因此，本研究提出了一种新的方法，基于图像融合概念和α混合，可以在高内容图像分析中 interpret 生物学phenotype。
results: experiments 表明，DCMIX 可以learns 生物学相关的通道重要性，而不会 sacrifice 预测性能。

Abstract
Uncovering novel drug candidates for treating complex diseases remain one of the most challenging tasks in early discovery research. To tackle this challenge, biopharma research established a standardized high content imaging protocol that tags different cellular compartments per image channel. In order to judge the experimental outcome, the scientist requires knowledge about the channel importance with respect to a certain phenotype for decoding the underlying biology. In contrast to traditional image analysis approaches, such experiments are nowadays preferably analyzed by deep learning based approaches which, however, lack crucial information about the channel importance. To overcome this limitation, we present a novel approach which utilizes multi-spectral information of high content images to interpret a certain aspect of cellular biology. To this end, we base our method on image blending concepts with alpha compositing for an arbitrary number of channels. More specifically, we introduce DCMIX, a lightweight, scaleable and end-to-end trainable mixing layer which enables interpretable predictions in high content imaging while retaining the benefits of deep learning based methods. We employ an extensive set of experiments on both MNIST and RXRX1 datasets, demonstrating that DCMIX learns the biologically relevant channel importance without scarifying prediction performance.

摘要
揭示新药候选者治疗复杂疾病的研究仍然是生物医药研究领域中最大的挑战。为了解决这个挑战，生物医药研究实施了标准化高内容成像协议，将不同的细胞组织渠道每个图像通道上标签。为了评估实验结果，科学家需要了解渠道对某种fenotype的重要性，以解读下面的生物学。在传统的图像分析方法中，这些实验通常被分析为深度学习基本的方法，但这些方法缺乏关键的渠道重要性信息。为了解决这个限制，我们提出了一种新的方法，利用高内容图像的多spectrum信息来解释某些细胞生物学方面。为此，我们基于图像拟合概念和alpha拼接 compose进行arbitrary数量的渠道。我们称之为DCMIX，它是一种轻量级、可扩展和可以执行的混合层，可以在高内容成像中提取生物相关的渠道重要性信息，而不损失深度学习基本的预测性能。我们在MNIST和RXRX1数据集上进行了广泛的实验，证明DCMIX可以学习生物相关的渠道重要性，而不损失预测性能。

MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

paper_url: http://arxiv.org/abs/2308.16635
repo_url: None
paper_authors: Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han
for: 本研究旨在模型面对面交流场景，包括发送者和接收者的角色。现有研究主要关注生成发送者视频，而忽略了生成接收者头像的问题。
methods: 我们提出了多方面响应听众头像生成网络（MFR-Net），其中使用潜在噪声扩散模型预测多样化的头 pose 和表情特征。为实现多方面响应，我们设计了特征综合模块，以提高听众身份信息的准确性。
results: 我们的广泛实验表明，MFR-Net不仅实现了多方面响应的多样性和发送者身份信息，还能够在发送者视频中表达不同的态度和观点。

Abstract
Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and a listener head image. An ideal generated responsive listening video should respond to the speaker with attitude or viewpoint expressing while maintaining diversity in interaction patterns and accuracy in listener identity information. To achieve this goal, we propose the \textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising diffusion model to predict diverse head pose and expression features. In order to perform multi-faceted response to the speaker video, while maintaining accurate listener identity preservation, we design the Feature Aggregation Module to boost listener identity features and fuse them with other speaker-related features. Finally, a renderer finetuned with identity consistency loss produces the final listening head videos. Our extensive experiments demonstrate that MFR-Net not only achieves multi-faceted responses in diversity and speaker identity information but also in attitude and viewpoint expression.

摘要
人际交流是一个常见的场景，其中包括说话者和听众的角色。现有大多数研究方法都是生成说话者视频，而听众头部生成则受到了相对的忽略。响应式听众头部生成是一项重要的任务，它目标是模拟人际交流场景，通过生成一个听众头部视频，并与说话者视频和听众头部图像进行相互响应。理想的生成的响应式听众视频应该与说话者保持相互关系，同时表达出不同的互动模式和听众身份信息的准确性。为实现这个目标，我们提出了多元faceted响应听众头部生成网络（MFR-Net）。具体来说，MFR-Net使用概率推净扩散模型预测多元faceted的头部姿势和表情特征。为了在多个响应方式中保持听众身份信息的准确性，我们设计了特征聚合模块，以增强听众身份特征并与其他说话者相关的特征相集成。最后，通过标准化进行训练的渲染器生成了最终的听众头部视频。我们的广泛的实验表明，MFR-Net不仅实现了多元faceted响应，同时也保持了听众身份信息的准确性和说话者的态度和观点表达。

Semi-Supervised SAR ATR Framework with Transductive Auxiliary Segmentation

paper_url: http://arxiv.org/abs/2308.16633
repo_url: None
paper_authors: Chenwei Wang, Xiaoyu Liu, Yulin Huang, Siyi Luo, Jifang Pei, Jianyu Yang, Deqing Mao
for: 提高自动目标识别（ATR）性能，解决受限于少量标注数据的问题。
methods: 提出了一种半监督SAR ATR框架，利用可见 auxiliary segmentation 激活 inductive bias，通过训练过程中的信息异常损失（IRL）来逐渐利用认知和分割的信息 compilation。
results: 在MSTAR数据集上进行了实验，实现了对20个类的几招学习，并同时实现了准确的分割结果，其中识别率高于88.00%，面对EOC的变化也能保持高于80.00%的识别率。

Abstract
Convolutional neural networks (CNNs) have achieved high performance in synthetic aperture radar (SAR) automatic target recognition (ATR). However, the performance of CNNs depends heavily on a large amount of training data. The insufficiency of labeled training SAR images limits the recognition performance and even invalidates some ATR methods. Furthermore, under few labeled training data, many existing CNNs are even ineffective. To address these challenges, we propose a Semi-supervised SAR ATR Framework with transductive Auxiliary Segmentation (SFAS). The proposed framework focuses on exploiting the transductive generalization on available unlabeled samples with an auxiliary loss serving as a regularizer. Through auxiliary segmentation of unlabeled SAR samples and information residue loss (IRL) in training, the framework can employ the proposed training loop process and gradually exploit the information compilation of recognition and segmentation to construct a helpful inductive bias and achieve high performance. Experiments conducted on the MSTAR dataset have shown the effectiveness of our proposed SFAS for few-shot learning. The recognition performance of 94.18\% can be achieved under 20 training samples in each class with simultaneous accurate segmentation results. Facing variances of EOCs, the recognition ratios are higher than 88.00\% when 10 training samples each class.

摘要
卷积神经网络（CNN）在抽象天线镜（SAR）自动目标识别（ATR）中表现出色，但是它们的性能受到大量训练数据的限制。lack of labeled SAR training images limits the recognition performance and even renders some ATR methods ineffective. Under few labeled training data, many existing CNNs are even ineffective. To address these challenges, we propose a Semi-supervised SAR ATR Framework with transductive Auxiliary Segmentation (SFAS). The proposed framework focuses on exploiting the transductive generalization on available unlabeled samples with an auxiliary loss serving as a regularizer. Through auxiliary segmentation of unlabeled SAR samples and information residue loss (IRL) in training, the framework can employ the proposed training loop process and gradually exploit the information compilation of recognition and segmentation to construct a helpful inductive bias and achieve high performance. Experiments conducted on the MSTAR dataset have shown the effectiveness of our proposed SFAS for few-shot learning. The recognition performance of 94.18% can be achieved under 20 training samples in each class with simultaneous accurate segmentation results. Facing variances of EOCs, the recognition ratios are higher than 88.00% when 10 training samples each class.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation

paper_url: http://arxiv.org/abs/2308.16632
repo_url: https://github.com/sosppxo/3d-stmn
paper_authors: Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, Xiaoshuai Sun
for: 这篇论文主要针对的是3D Referring Expression Segmentation（3D-RES）领域的问题，即提取和匹配 Referring Expression 的两个阶段方法存在重要的问题，如生成不够精准的初始提案和推理速度的减速。
methods: 我们提出了一种创新的综合型Superpoint-Text Matching Network（3D-STMN），它利用Superpoint-Text Matching（STM）机制，将语言指示与相关的超点相匹配，而不是 traditional methods 通过 navigating through instance proposals。此外，我们还添加了 Dependency-Driven Interaction（DDI）模块，通过依赖树来增强模型对 Referring Expression 的semantic理解。
results: 我们的模型在ScanRefer benchmark上表现出色，不仅提高了mIoU 11.7个点，而且也提高了推理速度，比传统方法快了95.7倍。

Abstract
In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions. However, this conventional paradigm encounters significant challenges, most notably in terms of the generation of lackluster initial proposals and a pronounced deceleration in inference speed. Recognizing these limitations, we introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights. One of the keystones of our model is the Superpoint-Text Matching (STM) mechanism. Unlike traditional methods that navigate through instance proposals, STM directly correlates linguistic indications with their respective superpoints, clusters of semantically related points. This architectural decision empowers our model to efficiently harness cross-modal semantic relationships, primarily leveraging densely annotated superpoint-text pairs, as opposed to the more sparse instance-text pairs. In pursuit of enhancing the role of text in guiding the segmentation process, we further incorporate the Dependency-Driven Interaction (DDI) module to deepen the network's semantic comprehension of referring expressions. Using the dependency trees as a beacon, this module discerns the intricate relationships between primary terms and their associated descriptors in expressions, thereby elevating both the localization and segmentation capacities of our model. Comprehensive experiments on the ScanRefer benchmark reveal that our model not only set new performance standards, registering an mIoU gain of 11.7 points but also achieve a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times. The code and models are available at https://github.com/sosppxo/3D-STMN.

摘要
在三维引用表达分割（3D-RES）中，早期的方法采用两阶段 paradigm，先提取分割提案并将其与引用表达进行匹配。然而，这种传统的方法存在显著的挑战，主要是生成亮点不够的初始提案以及推理速度明显下降。认识到这些限制，我们提出了一种创新的端到端Superpoint-Text Matching网络（3D-STMN），增强了基于依赖关系的洞察。我们的模型的一个关键元素是Superpoint-Text Matching（STM）机制。与传统方法不同，STM直接相关语言指示与其相应的超点相关，而不是通过实例提案来进行匹配。这种建筑决策使得我们的模型可以高效地利用交叉模式的semantic关系，主要利用densely注释的超点文本对，而不是更罕见的实例文本对。为了进一步增强文本在引导 segmentation 过程中的作用，我们进一步包含了依赖关系驱动的Interaction（DDI）模块。通过依赖树作为指南，这个模块可以识别表达中的复杂关系，从而提高我们模型的semantic理解和地方化能力。经验表明，我们的模型不仅设置了新的性能标准，mIoU提高11.7点，还实现了惊人的推理速度提升，高速95.7倍。代码和模型可以在https://github.com/sosppxo/3D-STMN 中下载。

Neural Gradient Regularizer

paper_url: http://arxiv.org/abs/2308.16612
repo_url: https://github.com/yyfz/neural-gradient-regularizer
paper_authors: Shuang Xu, Yifan Wang, Zixiang Zhao, Jiangjun Peng, Xiangyong Cao, Deyu Meng
for: 提高图像处理领域中的图像稳定性和细节表示性。
methods: 使用神经网络 Output 来表示梯度图，不需要采用简约性假设，可以避免梯度图的下估。
results: 对多种图像类型和处理任务进行了广泛的实验，证明了 NGR 的灵活性和可重用性，并且在多种任务上表现出了superior的性能。

Abstract
Owing to its significant success, the prior imposed on gradient maps has consistently been a subject of great interest in the field of image processing. Total variation (TV), one of the most representative regularizers, is known for its ability to capture the sparsity of gradient maps. Nonetheless, TV and its variants often underestimate the gradient maps, leading to the weakening of edges and details whose gradients should not be zero in the original image. Recently, total deep variation (TDV) has been introduced, assuming the sparsity of feature maps, which provides a flexible regularization learned from large-scale datasets for a specific task. However, TDV requires retraining when the image or task changes, limiting its versatility. In this paper, we propose a neural gradient regularizer (NGR) that expresses the gradient map as the output of a neural network. Unlike existing methods, NGR does not rely on the sparsity assumption, thereby avoiding the underestimation of gradient maps. NGR is applicable to various image types and different image processing tasks, functioning in a zero-shot learning fashion, making it a versatile and plug-and-play regularizer. Extensive experimental results demonstrate the superior performance of NGR over state-of-the-art counterparts for a range of different tasks, further validating its effectiveness and versatility.

摘要
In this paper, we propose a neural gradient regularizer (NGR) that expresses the gradient map as the output of a neural network. Unlike existing methods, NGR does not rely on the sparsity assumption, thereby avoiding the underestimation of gradient maps. NGR is applicable to various image types and different image processing tasks, functioning in a zero-shot learning fashion, making it a versatile and plug-and-play regularizer. Extensive experimental results demonstrate the superior performance of NGR over state-of-the-art counterparts for a range of different tasks, further validating its effectiveness and versatility. translate into Simplified Chinese as follows:由于其取得了显著的成功，在图像处理领域中强制 gradient maps 的优先级一直受到了极大的关注。总变化 (TV)，图像处理中最为代表的正则化之一，能够捕捉到 gradient maps 的稀畴性。然而，TV 及其变种通常会下降 gradient maps，导致图像中的边缘和细节的 Gradient 值不应该为零。最近，总深度变化 (TDV) 被引入，假设特定任务中的特征图中的稀畴性，提供了大规模数据集学习而来的灵活正则化。然而，TDV 需要重新训练当图像或任务改变，限制了它的多样性。在这篇论文中，我们提出了一种神经Gradient regularizer (NGR)，它将 gradient map 表示为神经网络的输出。与现有方法不同，NGR 不假设稀畴性，因此可以避免下降 gradient maps。NGR 适用于不同的图像类型和图像处理任务，可以在零处理模式下工作，使它成为一种多样化和插件化的正则化。我们的实验结果表明，NGR 在不同任务中表现出色，超越了当前的相关方法，进一步证明了它的有效性和多样性。

Detecting Out-of-Context Image-Caption Pairs in News: A Counter-Intuitive Method

paper_url: http://arxiv.org/abs/2308.16611
repo_url: None
paper_authors: Eivind Moholdt, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen
for: 本研究旨在利用生成图像模型检测新闻中图像-$caption$对的Out-of-Context（OOC）使用。
methods: 本研究使用了两种生成图像模型：DALL-E 2和Stable-Diffusion，并创建了两个新的数据集，包含6800个图像。
results: 研究人员通过对每个图像生成模型进行初步质量和量化分析，评估它们是否适用于检测便宜 fake。

Abstract
The growth of misinformation and re-contextualized media in social media and news leads to an increasing need for fact-checking methods. Concurrently, the advancement in generative models makes cheapfakes and deepfakes both easier to make and harder to detect. In this paper, we present a novel approach using generative image models to our advantage for detecting Out-of-Context (OOC) use of images-caption pairs in news. We present two new datasets with a total of $6800$ images generated using two different generative models including (1) DALL-E 2, and (2) Stable-Diffusion. We are confident that the method proposed in this paper can further research on generative models in the field of cheapfake detection, and that the resulting datasets can be used to train and evaluate new models aimed at detecting cheapfakes. We run a preliminary qualitative and quantitative analysis to evaluate the performance of each image generation model for this task, and evaluate a handful of methods for computing image similarity.

摘要
随着社交媒体和新闻中的谣言和重新 контекст化媒体的增长，需要Fact-checking方法的增加。同时，进步的生成模型使得低成本的 fake 和深度 fake 变得更加容易生成和更加难以识别。在这篇论文中，我们提出了一种使用生成图像模型的新方法，用于检测新闻中图像-标题组合的 OUT-OF-CONTEXT（OOC）使用。我们提供了两个新的数据集，共计6800张图像，由两种不同的生成模型生成，包括（1）DALL-E 2，和（2）Stable-Diffusion。我们 confidence 认为，提出的方法可以为生成模型在 cheapfake 检测领域进行进一步研究，并且得到的数据集可以用于训练和评估新的 cheapfake 检测模型。我们进行了先期的 качеitative 和量化分析，评估每种图像生成模型的性能，以及计算图像相似性的方法。

Towards Optimal Patch Size in Vision Transformers for Tumor Segmentation

paper_url: http://arxiv.org/abs/2308.16598
repo_url: https://github.com/ramtin-mojtahedi/ovtps
paper_authors: Ramtin Mojtahedi, Mohammad Hamghalam, Richard K. G. Do, Amber L. Simpson
for: 这个研究是为了提高血液肿瘤检测的精度和效率，特别是在肝脏癌的早期诊断和治疗中。
methods: 这个研究使用了深度学习模型，特别是使用了 vision transformer 来解析 3D 计算机断层成像 (CT) 照片。
results: 研究发现，使用我们提议的多resolution图像检查方法可以提高 vision transformer 的检测性能，并且可以适应不同的肿瘤大小。

Abstract
Detection of tumors in metastatic colorectal cancer (mCRC) plays an essential role in the early diagnosis and treatment of liver cancer. Deep learning models backboned by fully convolutional neural networks (FCNNs) have become the dominant model for segmenting 3D computerized tomography (CT) scans. However, since their convolution layers suffer from limited kernel size, they are not able to capture long-range dependencies and global context. To tackle this restriction, vision transformers have been introduced to solve FCNN's locality of receptive fields. Although transformers can capture long-range features, their segmentation performance decreases with various tumor sizes due to the model sensitivity to the input patch size. While finding an optimal patch size improves the performance of vision transformer-based models on segmentation tasks, it is a time-consuming and challenging procedure. This paper proposes a technique to select the vision transformer's optimal input multi-resolution image patch size based on the average volume size of metastasis lesions. We further validated our suggested framework using a transfer-learning technique, demonstrating that the highest Dice similarity coefficient (DSC) performance was obtained by pre-training on training data with a larger tumour volume using the suggested ideal patch size and then training with a smaller one. We experimentally evaluate this idea through pre-training our model on a multi-resolution public dataset. Our model showed consistent and improved results when applied to our private multi-resolution mCRC dataset with a smaller average tumor volume. This study lays the groundwork for optimizing semantic segmentation of small objects using vision transformers. The implementation source code is available at:https://github.com/Ramtin-Mojtahedi/OVTPS.

摘要
检测肿瘤在转移性大肠癌（mCRC）中发挥重要作用于早期诊断和治疗肝癌。深度学习模型基于全部 convolutional neural network（FCNN）已成为对3D计算机断层成像（CT）扫描的标准模型。然而，由于它们的卷积层受到限制，无法捕捉长距离依赖和全局上下文。为解决这个限制，视transformer被引入以解决FCNN的局部感受场。although transformers can capture long-range features, their segmentation performance decreases with various tumor sizes due to the model sensitivity to the input patch size. While finding an optimal patch size improves the performance of vision transformer-based models on segmentation tasks, it is a time-consuming and challenging procedure.这篇论文提出了一种方法，可以根据转移性大肠癌肿瘤的平均体积来选择视transformer的最佳输入多resolution图像 patch size。我们进一步验证了我们的建议框架使用传播学习技术，并证明在segmentation任务上，使用我们建议的理想patch size后，再进行训练，可以获得最高的 dice相似度系数（DSC）性能。我们实验验证了这个想法，通过在多resolution的公共数据集上预训练我们的模型，然后在我们私有的多resolutionmCRC数据集上进行训练，我们的模型在这些数据集上显示了一致和改善的结果。这项研究为优化基于视transformer的 semantic segmentation小物体的优化奠定了基础。模型实现代码可以在https://github.com/Ramtin-Mojtahedi/OVTPS中找到。

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

paper_url: http://arxiv.org/abs/2308.16582
repo_url: None
paper_authors: Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu
for: 解决文本描述到图像生成中的分辨率导致的组合问题
methods: 提出了一个两stage管道 named Any-Size-Diffusion (ASD), 使用选择的图像集和限定的比例范围来优化文本条件的扩散模型，以适应不同的图像大小。
results: 实验结果表明，ASD可以生成任意大小的图像，并且比传统的瓦片算法减少了推理时间 by 2倍。

Abstract
Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.

摘要
基于稳定扩散的生成模型在文本到图像合成中遇到了分辨率引起的组合问题，主要来自于模型在具有单个分辨率图像和相应的文本描述的对配对上进行训练。此外，直接在无限大的图像上进行训练是不可能的，因为这需要巨量的文本-图像对和巨大的计算成本。为了解决这些挑战，我们提出了一个两个阶段的管道，名为Any-Size-Diffusion（ASD），用于生成具有任意大小的、具有良好组合的图像，并最小化高内存GPU资源的需求。在首个阶段，我们称为Any Ratio Adaptability Diffusion（ARAD），利用一个选择的图像集，其中图像的比例范围是有限的，来优化文本受控扩散模型，以提高其对不同图像大小的调整能力。为了支持任意大小的图像创建，我们在后续阶段引入了快速无缝瓷 diffusion（FSTD）技术。这种技术允许快速扩大ASD输出到任何高分辨率大小，避免缝合 artifacts或内存过载。我们在LAION-COCO和MM-CelebA-HQ标准底下进行了实验，结果表明ASD可以生成具有任意大小的、具有良好组合的图像，并比传统瓷 diffusion算法快速两倍。

GHuNeRF: Generalizable Human NeRF from a Monocular Video

paper_url: http://arxiv.org/abs/2308.16576
repo_url: None
paper_authors: Chen Li, Jihao Lin, Gim Hee Lee
for: 本研究的目的是学习一个通用人体NeRF模型从单视镜视频中。
methods: 我们提出了一种可视性感知聚合方案，用于计算点精度特征，并使用注意力机制增强点精度特征以提高量化表示精度。
results: 我们在ZJU-MoCap数据集上进行了验证，与多视图视频基于方法相比，我们的方法可以达到相似的性能。在单视镜人像抓取数据集上，我们的方法比existings works更高效。

Abstract
In this paper, we tackle the challenging task of learning a generalizable human NeRF model from a monocular video. Although existing generalizable human NeRFs have achieved impressive results, they require muti-view images or videos which might not be always available. On the other hand, some works on free-viewpoint rendering of human from monocular videos cannot be generalized to unseen identities. In view of these limitations, we propose GHuNeRF to learn a generalizable human NeRF model from a monocular video of the human performer. We first introduce a visibility-aware aggregation scheme to compute vertex-wise features, which is used to construct a 3D feature volume. The feature volume can only represent the overall geometry of the human performer with insufficient accuracy due to the limited resolution. To solve this, we further enhance the volume feature with temporally aligned point-wise features using an attention mechanism. Finally, the enhanced feature is used for predicting density and color for each sampled point. A surface-guided sampling strategy is also introduced to improve the efficiency for both training and inference. We validate our approach on the widely-used ZJU-MoCap dataset, where we achieve comparable performance with existing multi-view video based approaches. We also test on the monocular People-Snapshot dataset and achieve better performance than existing works when only monocular video is used.

摘要
在这篇论文中，我们面临着从单视频中学习一个通用人体NeRF模型的挑战。现有的通用人体NeRF模型都已经实现了卓越的结果，但它们需要多视图图像或视频，这可能并不总是可доступ的。在另一方面，一些从单视频中对人体进行自由视点渲染的工作无法泛化到未看到的人体。在这些限制下，我们提出了GHuNeRF来学习单视频中人体NeRF模型。我们首先引入了可见性感知汇聚方案来计算顶点级别特征，这些特征用于构建3D特征体积。由于限制了分辨率，特征体积只能表示人体演员的总体几何结构，而不够准确。为解决这个问题，我们进一步增强体积特征使用时间对齐点级别特征的注意机制。最后，我们使用增强的特征预测每个采样点的密度和颜色。我们还引入了基于表面的采样策略来提高训练和推断的效率。我们在常用的中国交通大学MoCap数据集上验证了我们的方法，我们与多视图视频基于方法相比具有相似的性能。我们还在单视频人像Snapshot数据集上进行测试，与单视频基于方法相比，我们的方法在只使用单视频时表现更好。

Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation for Semi-Supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.16573
repo_url: None
paper_authors: Yuanbin Chen, Tao Wang, Hui Tang, Longxuan Zhao, Ruige Zong, Tao Tan, Xinlin Zhang, Tong Tong
for: 这篇论文目的是提出一种基于mean-teacher模型的半supervised医疗影像分类方法（Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation，简称DCPA），以提高半supervised分类的效能。
methods: 这篇论文使用了一个共同encoder和两个不同的decoder，并且使用了consistency regularization、pseudo-labels和mixup操作来强化半supervised分类。
results: 实验结果显示，DCPA模型在三个公开的医疗影像数据集上比六个现有的半supervised方法表现出色，尤其是在5% annotated data的情况下。

Abstract
Medical image segmentation methods often rely on fully supervised approaches to achieve excellent performance, which is contingent upon having an extensive set of labeled images for training. However, annotating medical images is both expensive and time-consuming. Semi-supervised learning offers a solution by leveraging numerous unlabeled images alongside a limited set of annotated ones. In this paper, we introduce a semi-supervised medical image segmentation method based on the mean-teacher model, referred to as Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation (DCPA). This method combines consistency regularization, pseudo-labels, and data augmentation to enhance the efficacy of semi-supervised segmentation. Firstly, the proposed model comprises both student and teacher models with a shared encoder and two distinct decoders employing different up-sampling strategies. Minimizing the output discrepancy between decoders enforces the generation of consistent representations, serving as regularization during student model training. Secondly, we introduce mixup operations to blend unlabeled data with labeled data, creating mixed data and thereby achieving data augmentation. Lastly, pseudo-labels are generated by the teacher model and utilized as labels for mixed data to compute unsupervised loss. We compare the segmentation results of the DCPA model with six state-of-the-art semi-supervised methods on three publicly available medical datasets. Beyond classical 10\% and 20\% semi-supervised settings, we investigate performance with less supervision (5\% labeled data). Experimental outcomes demonstrate that our approach consistently outperforms existing semi-supervised medical image segmentation methods across the three semi-supervised settings.

摘要
医疗图像分割方法frequently rely on完全supervised方法来实现优秀性能，这是因为需要一个较大的标注图像集来训练。然而，标注医疗图像是both expensive and time-consuming。 semi-supervised learning 提供了一种解决方案，利用大量的无标注图像和一些标注图像来训练模型。在这篇论文中，我们引入了一种基于mean-teacher模型的 semi-supervised医疗图像分割方法，称为 dual-decoder consistency via pseudo-labels guided data augmentation (DCPA)。这种方法结合了一致性规则、pseudo-labels和数据增强来提高 semi-supervised segmentation的效果。首先，我们的模型包括一个共享编码器和两个不同的解码器。将解码器的输出差异抑制到最小值，使得学生模型在训练时生成一致的表示。其次，我们引入了mixup操作，将无标注数据与标注数据混合，从而实现数据增强。最后，我们使用教师模型生成的pseudo-labels来计算混合数据的无supervised损失。我们与六种state-of-the-art semi-supervised方法进行比较，并在三个公共可用的医疗数据集上进行测试。我们不仅在经典的10%和20% semi-supervised设置下进行测试，还在5%标注数据下进行测试。实验结果表明，我们的方法在三个 semi-supervised 设置下一直表现出优于现有的 semi-supervised 医疗图像分割方法。

Document Layout Analysis on BaDLAD Dataset: A Comprehensive MViTv2 Based Approach

paper_url: http://arxiv.org/abs/2308.16571
repo_url: None
paper_authors: Ashrafur Rahman Khan, Asif Azad
for: 本研究旨在自动抽取文档中的文本框、段落、图片和表格。
methods: 我们使用MViTv2变换器模型架构和缺失检测R-CNN在BaDLAD数据集上进行训练，以抽取文档中的各种元素。训练过程中使用3个阶段的循环训练，共训练20365个文档图像，训练损失为0.2125，掩码损失为0.19。
results: 我们的研究不仅限于训练，还进行了各种可能的改进方向的探索。我们研究了旋转和翻转增强的影响、预推理图像切割的效果、变换器背景分辨率的变化情况，以及使用双通道推理来探测漏掉的文本框。这些探索结果表明，一些修改可以提高性能，而其他修改则提供了未来尝试的新视角。

Abstract
In the rapidly evolving digital era, the analysis of document layouts plays a pivotal role in automated information extraction and interpretation. In our work, we have trained MViTv2 transformer model architecture with cascaded mask R-CNN on BaDLAD dataset to extract text box, paragraphs, images and tables from a document. After training on 20365 document images for 36 epochs in a 3 phase cycle, we achieved a training loss of 0.2125 and a mask loss of 0.19. Our work extends beyond training, delving into the exploration of potential enhancement avenues. We investigate the impact of rotation and flip augmentation, the effectiveness of slicing input images pre-inference, the implications of varying the resolution of the transformer backbone, and the potential of employing a dual-pass inference to uncover missed text-boxes. Through these explorations, we observe a spectrum of outcomes, where some modifications result in tangible performance improvements, while others offer unique insights for future endeavors.

摘要
在数字时代的快速演化中，文档布局分析具有重要的作用，以帮助自动提取和解释信息。在我们的工作中，我们使用MViTv2变换器模型架构和缩进mask R-CNN在BaDLAD数据集上进行了训练，以提取文档中的文本框、段落、图片和表格。训练过程中，我们使用20365个文档图像，在3个阶段的循环中进行了36个轮次，最终获得了训练损失为0.2125和mask损失为0.19。我们的工作不仅在训练上停留，还在探索可能的提升途径。我们研究了旋转和翻转增强的影响，以及在预判之前切分输入图像的效果，以及变换器后部分的分辨率对性能的影响。此外，我们还探索了使用双通道推理来检测漏掉的文本框的可能性。通过这些探索，我们发现了一些修改可以提高性能，而其他修改则提供了未来努力的新思路。

Shape of my heart: Cardiac models through learned signed distance functions

paper_url: http://arxiv.org/abs/2308.16568
repo_url: None
paper_authors: Jan Verhülsdonk, Thomas Grandits, Francisco Sahli Costabal, Rolf Krause, Angelo Auricchio, Gundolf Haase, Simone Pezzuto, Alexander Effland
for: 构建个性化人体心脏模型
methods: 使用三维深度Signed distance functions刚性函数来重建心脏形态
results: 能够从部分数据或不同模式中重建准确的心脏形态，以及生成新的形态样本。

Abstract
The efficient construction of an anatomical model is one of the major challenges of patient-specific in-silico models of the human heart. Current methods frequently rely on linear statistical models, allowing no advanced topological changes, or requiring medical image segmentation followed by a meshing pipeline, which strongly depends on image resolution, quality, and modality. These approaches are therefore limited in their transferability to other imaging domains. In this work, the cardiac shape is reconstructed by means of three-dimensional deep signed distance functions with Lipschitz regularity. For this purpose, the shapes of cardiac MRI reconstructions are learned from public databases to model the spatial relation of multiple chambers in Cartesian space. We demonstrate that this approach is also capable of reconstructing anatomical models from partial data, such as point clouds from a single ventricle, or modalities different from the trained MRI, such as electroanatomical mapping, and in addition, allows us to generate new anatomical shapes by randomly sampling latent vectors.

摘要
Current methods for constructing patient-specific in-silico models of the human heart often rely on linear statistical models, which do not allow for advanced topological changes, or require medical image segmentation followed by a meshing pipeline, which is strongly dependent on image resolution, quality, and modality. These approaches are limited in their transferability to other imaging domains.In this study, we reconstruct the cardiac shape using three-dimensional deep signed distance functions with Lipschitz regularity. We learn the shapes of cardiac MRI reconstructions from public databases to model the spatial relation of multiple chambers in Cartesian space. Our approach can also reconstruct anatomical models from partial data, such as point clouds from a single ventricle, or modalities different from the trained MRI, such as electroanatomical mapping. Additionally, we can generate new anatomical shapes by randomly sampling latent vectors.

ScrollNet: Dynamic Weight Importance for Continual Learning

paper_url: http://arxiv.org/abs/2308.16567
repo_url: https://github.com/firefyf/scrollnet
paper_authors: Fei Yang, Kai Wang, Joost van de Weijer
for: 这个研究是为了探讨一种名为暂时学习（Continual Learning，CL）的方法，以提高机器学习模型在继续学习新任务时的稳定性。
methods: 这个研究使用了一种名为滑块网络（ScrollNet）的方法，它可以在进行sequential task learning时，根据不同任务的重要性，将网络中的参数重新分配。这样可以在继续学习新任务时，维持更好的稳定性。
results: 实验结果显示， ScrollNet 可以与不同的 CL 方法结合，并且在 CIFAR100 和 TinyImagenet dataset 上显示出良好的效果。code 可以在 https://github.com/FireFYF/ScrollNet.git 上取得。

Abstract
The principle underlying most existing continual learning (CL) methods is to prioritize stability by penalizing changes in parameters crucial to old tasks, while allowing for plasticity in other parameters. The importance of weights for each task can be determined either explicitly through learning a task-specific mask during training (e.g., parameter isolation-based approaches) or implicitly by introducing a regularization term (e.g., regularization-based approaches). However, all these methods assume that the importance of weights for each task is unknown prior to data exposure. In this paper, we propose ScrollNet as a scrolling neural network for continual learning. ScrollNet can be seen as a dynamic network that assigns the ranking of weight importance for each task before data exposure, thus achieving a more favorable stability-plasticity tradeoff during sequential task learning by reassigning this ranking for different tasks. Additionally, we demonstrate that ScrollNet can be combined with various CL methods, including regularization-based and replay-based approaches. Experimental results on CIFAR100 and TinyImagenet datasets show the effectiveness of our proposed method. We release our code at https://github.com/FireFYF/ScrollNet.git.

摘要
“现有大多数持续学习（CL）方法的基本原则是优先保持稳定性，通过责备关键任务 Parameters 的变化，同时允许其他 Parameters 进行可变性。任务重要性的量可以通过在训练时期学习任务特定的面罩（e.g., 参数隔离方法）或通过引入调整项（e.g., 调整方法）来决定。但所有这些方法都假设任务重要性的量是无知的。在这篇文章中，我们提出了 ScrollNet，一个滚动神经网络，用于持续学习。 ScrollNet 可以看作是一个动态网络，它在训练时期将任务重要性的排名分配给每个任务，以 achieve 更好的稳定性-可变性贡献。此外，我们还证明了 ScrollNet 可以与不同的 CL 方法结合使用，包括调整基本和回溯基本方法。实验结果在 CIFAR100 和 TinyImagenet 数据集上显示了我们提出的方法的有效性。我们在 GitHub 上发布了我们的代码，可以在 https://github.com/FireFYF/ScrollNet.git 取得。”

MoMA: Momentum Contrastive Learning with Multi-head Attention-based Knowledge Distillation for Histopathology Image Analysis

paper_url: http://arxiv.org/abs/2308.16561
repo_url: https://github.com/trinhvg/moma
paper_authors: Trinh Thi Le Vuong, Jin Tae Kwak
for: This paper aims to address the issue of lack of quality data in computational pathology by proposing a method to transfer knowledge from an existing model to a new model without direct access to the source data.
methods: The proposed method utilizes a student-teacher framework with momentum contrastive learning and multi-head attention mechanism to distill relevant knowledge from the teacher model and adapt to the unique nuances of the target data.
results: The proposed method outperforms other related methods in transferring knowledge to different domains and tasks, and provides a guideline on the learning strategy for different types of tasks and scenarios in computational pathology.Here’s the same information in Simplified Chinese text:
for: 这篇论文目标是解决计算生物学中数据质量不足的问题，提出一种将现有模型中的知识传递到新模型中，不直接访问源数据的方法。
methods: 提议的方法使用学生教师框架，通过帧动冲突对比学习和多头注意机制，从教师模型中提取有用的知识，并适应目标数据中独特的特点。
results: 提议的方法在不同的场景和任务中表现出色，超过其他相关方法，并提供了不同类型任务和场景的学习策略的指南。

Abstract
There is no doubt that advanced artificial intelligence models and high quality data are the keys to success in developing computational pathology tools. Although the overall volume of pathology data keeps increasing, a lack of quality data is a common issue when it comes to a specific task due to several reasons including privacy and ethical issues with patient data. In this work, we propose to exploit knowledge distillation, i.e., utilize the existing model to learn a new, target model, to overcome such issues in computational pathology. Specifically, we employ a student-teacher framework to learn a target model from a pre-trained, teacher model without direct access to source data and distill relevant knowledge via momentum contrastive learning with multi-head attention mechanism, which provides consistent and context-aware feature representations. This enables the target model to assimilate informative representations of the teacher model while seamlessly adapting to the unique nuances of the target data. The proposed method is rigorously evaluated across different scenarios where the teacher model was trained on the same, relevant, and irrelevant classification tasks with the target model. Experimental results demonstrate the accuracy and robustness of our approach in transferring knowledge to different domains and tasks, outperforming other related methods. Moreover, the results provide a guideline on the learning strategy for different types of tasks and scenarios in computational pathology. Code is available at: \url{https://github.com/trinhvg/MoMA}.

摘要
“无疑，高级人工智能模型和高质量数据是 Computational Pathology 工具的成功关键。尽管整体病理数据量在增加，但因为多种理由，包括隐私和伦理问题，对特定任务而言，仍然存在质量数据的问题。在这个工作中，我们提出了将知识传承，即使不直接访问源数据，以学习一个新的目标模型。我们使用学生-教师架构，将学生模型从教师模型中学习出一个目标模型，并透过几种方法，例如对称对抗学习和多头注意力机制，将教师模型中的知识逐渐传承到学生模型中。这使得学生模型能够吸收教师模型中的有用表示，同时适应特定任务的独特特点。我们对不同的任务和场景进行了严谨的评估，结果显示我们的方法在不同的领域和任务中具有高精度和可靠性，并且超越了其他相关方法。此外，我们的结果还提供了 Computational Pathology 中不同任务和场景的学习策略的指南。Code可以在：\url{https://github.com/trinhvg/MoMA} 中找到。”

E3CM: Epipolar-Constrained Cascade Correspondence Matching

paper_url: http://arxiv.org/abs/2308.16555
repo_url: None
paper_authors: Chenbo Zhou, Shuai Su, Qijun Chen, Rui Fan
For: The paper is written for addressing the challenges of accurate and robust correspondence matching in 3D computer vision tasks, and proposing a novel approach called Epipolar-Constrained Cascade Correspondence (E3CM) to improve the accuracy and efficiency of correspondence matching.* Methods: The paper uses pre-trained convolutional neural networks to match correspondence, without requiring annotated data for any network training or fine-tuning. The method leverages epipolar constraints to guide the matching process and incorporates a cascade structure for progressive refinement of matches.* Results: The paper demonstrates the superiority of E3CM over existing methods through comprehensive experiments and provides a publicly available source code for further research and reproducibility.Here are the three points in Simplified Chinese text:* For: 本文是为了解决3D计算机视觉任务中准确和稳定的对应匹配问题而写的。* Methods: 本文使用预训练的卷积神经网络来匹配对应，不需要任何网络训练或精度调整的 annotated 数据。方法利用epipolar约束来导航匹配过程，并采用阶段结构进行对应的进一步精化。* Results: 本文通过了广泛的实验证明E3CM的优越性，并提供了可公开下载的源代码，以便进一步研究和重现性。

Abstract
Accurate and robust correspondence matching is of utmost importance for various 3D computer vision tasks. However, traditional explicit programming-based methods often struggle to handle challenging scenarios, and deep learning-based methods require large well-labeled datasets for network training. In this article, we introduce Epipolar-Constrained Cascade Correspondence (E3CM), a novel approach that addresses these limitations. Unlike traditional methods, E3CM leverages pre-trained convolutional neural networks to match correspondence, without requiring annotated data for any network training or fine-tuning. Our method utilizes epipolar constraints to guide the matching process and incorporates a cascade structure for progressive refinement of matches. We extensively evaluate the performance of E3CM through comprehensive experiments and demonstrate its superiority over existing methods. To promote further research and facilitate reproducibility, we make our source code publicly available at https://mias.group/E3CM.

摘要
准确和可靠的对应匹配对于多种3D计算机视觉任务是非常重要的。然而，传统的显式编程基础方法经常陷入复杂的场景中，深度学习基础方法则需要大量高质量标注数据进行网络训练。在这篇文章中，我们介绍了Epipolar-Constrained Cascade Correspondence（E3CM），一种新的方法，解决了这些限制。与传统方法不同，E3CM利用预训练的卷积神经网络来匹配对应，不需要任何网络训练或细化。我们的方法利用epipolar约束来引导匹配过程，并在某些阶段进行进一步的匹配精度提高。我们在详细的实验中证明了E3CM的superiority，并且为了促进更多的研究和重复性，我们将我们的代码公开发布在https://mias.group/E3CM。

Prompt-enhanced Hierarchical Transformer Elevating Cardiopulmonary Resuscitation Instruction via Temporal Action Segmentation

paper_url: http://arxiv.org/abs/2308.16552
repo_url: None
paper_authors: Yang Liu, Xiaoyun Zhong, Shiyao Zhai, Zhicheng Du, Zhenyuan Gao, Qiming Huang, Canyang Zhang, Bin Jiang, Vijay Kumar Pandey, Sanyang Han, Runming Wang, Yuxing Han, Peiwu Qin
for: 这个论文主要是为了提高心肺征复（CPR）训练的效果，以提高救护人员的救命能力。
methods: 这个论文使用了现代深度学习技术，包括文本提示基于的视频特征提取器（VFE）、转换器基于的动作分割执行器（ASE）和准确率调整器（PRC）。
results: 这个论文的实验结果表明，通过使用这些深度学习技术，可以提高CPR训练的效果，并且可以达到多个指标超过91.0%的水平。

Abstract
The vast majority of people who suffer unexpected cardiac arrest are performed cardiopulmonary resuscitation (CPR) by passersby in a desperate attempt to restore life, but endeavors turn out to be fruitless on account of disqualification. Fortunately, many pieces of research manifest that disciplined training will help to elevate the success rate of resuscitation, which constantly desires a seamless combination of novel techniques to yield further advancement. To this end, we collect a custom CPR video dataset in which trainees make efforts to behave resuscitation on mannequins independently in adherence to approved guidelines, thereby devising an auxiliary toolbox to assist supervision and rectification of intermediate potential issues via modern deep learning methodologies. Our research empirically views this problem as a temporal action segmentation (TAS) task in computer vision, which aims to segment an untrimmed video at a frame-wise level. Here, we propose a Prompt-enhanced hierarchical Transformer (PhiTrans) that integrates three indispensable modules, including a textual prompt-based Video Features Extractor (VFE), a transformer-based Action Segmentation Executor (ASE), and a regression-based Prediction Refinement Calibrator (PRC). The backbone of the model preferentially derives from applications in three approved public datasets (GTEA, 50Salads, and Breakfast) collected for TAS tasks, which accounts for the excavation of the segmentation pipeline on the CPR dataset. In general, we unprecedentedly probe into a feasible pipeline that genuinely elevates the CPR instruction qualification via action segmentation in conjunction with cutting-edge deep learning techniques. Associated experiments advocate our implementation with multiple metrics surpassing 91.0%.

摘要
大多数因不期望的心血管综合症而产生突发心肺重症的人都会received cardiopulmonary resuscitation（CPR）治疗，但是这些尝试通常是无果的，因为不符合要求。幸好，许多研究表明，有条件的训练可以提高救命成功率，这个目标 Constant需要不断地 комбинировать新技术以实现进一步的进步。为此，我们收集了一个自定义CPR视频数据集，在该数据集中，学员独立尝试在人形机上进行救命，遵循approved的指南，以创建一个辅助工具箱，以便supervision和修正中间可能存在的问题。我们的研究视这个问题为计算机视觉中的时间动作 segmentation（TAS）任务， aimsto segment an untrimmed video at a frame-wise level。我们提出了一种Prompt-enhanced hierarchical Transformer（PhiTrans）模型，该模型包括三个不可或缺的模块：一个基于文本提示的 Video Features Extractor（VFE），一个基于transformer的Action Segmentation Executor（ASE），以及一个基于回归的 Prediction Refinement Calibrator（PRC）。模型的后勤来自于三个官方公开的 dataset（GTEA、50Salads和Breakfast），这些dataset用于 TAS 任务，这使得模型的 segmentation pipeline 可以在 CPR 数据集上进行挖掘。总的来说，我们未曾 probed 一个可行的管道，通过动作 segmentation 和现代深度学习技术来提高 CPR instrucion 资格。相关实验表明，我们的实现在多个指标上超过 91.0%。

Object Detection for Caries or Pit and Fissure Sealing Requirement in Children’s First Permanent Molars

paper_url: http://arxiv.org/abs/2308.16551
repo_url: None
paper_authors: Chenyao Jiang, Shiyao Zhai, Hengrui Song, Yuqing Ma, Yachen Fan, Yancheng Fang, Dongmei Yu, Canyang Zhang, Sanyang Han, Runming Wang, Yong Liu, Jianbo Li, Peiwu Qin
for: 预防牙肥病和牙缝病舒缝病的自动检测，帮助家长或孩子的监护人在家中进行检测。
methods: 使用智能手机拍摄的照片自动检测牙肥病和牙缝病舒缝病，使用YOLOv5和YOLOX模型，采用分割策略减少图像预处理中的信息损失。
results: 使用YOLOX模型并采用分割策略可以达到72.3 mAP.5的最佳结果，而不使用分割策略的YOLOv5模型可以达到71.2 mAP.5的最佳结果。

Abstract
Dental caries is one of the most common oral diseases that, if left untreated, can lead to a variety of oral problems. It mainly occurs inside the pits and fissures on the occlusal/buccal/palatal surfaces of molars and children are a high-risk group for pit and fissure caries in permanent molars. Pit and fissure sealing is one of the most effective methods that is widely used in prevention of pit and fissure caries. However, current detection of pits and fissures or caries depends primarily on the experienced dentists, which ordinary parents do not have, and children may miss the remedial treatment without timely detection. To address this issue, we present a method to autodetect caries and pit and fissure sealing requirements using oral photos taken by smartphones. We use the YOLOv5 and YOLOX models and adopt a tiling strategy to reduce information loss during image pre-processing. The best result for YOLOXs model with tiling strategy is 72.3 mAP.5, while the best result without tiling strategy is 71.2. YOLOv5s6 model with/without tiling attains 70.9/67.9 mAP.5, respectively. We deploy the pre-trained network to mobile devices as a WeChat applet, allowing in-home detection by parents or children guardian.

摘要
牙科疾病中的牙肉病是最常见的口腔疾病之一，如果未经治疗，可能导致许多口腔问题。它主要发生在牙齿的凹槽和缺陷处，特别是永久牙齿上的 occlusal/buccal/palatal 表面。牙齿凹槽填充是预防牙肉病的一种最有效的方法，但目前检测牙齿凹槽或疾病的方法主要仍然取决于经验丰富的牙医，普通的父母不具备这种经验。为解决这个问题，我们提出了一种自动检测牙肉病和牙齿凹槽填充需求的方法，使用智能手机拍摄的口腔照片。我们使用 YOLOv5 和 YOLOX 模型，采用分割策略来减少图像预处理中的信息损失。最佳结果为 YOLOXs 模型 WITH 分割策略，得分为 72.3 mAP.5，而不使用分割策略的最佳结果为 71.2。YOLOv5s6 模型 WITH/ без 分割策略分别得到了 70.9/67.9 mAP.5。我们将预训练网络部署到移动设备上，并作为 WeChat applet 提供在家 detection，allowing 父母或孩子的监护人进行 домаш减轻。

Decoupled Local Aggregation for Point Cloud Learning

paper_url: http://arxiv.org/abs/2308.16532
repo_url: None
paper_authors: Binjie Chen, Yunzhou Xia, Yu Zang, Cheng Wang, Jonathan Li
for: 本文旨在提出一种基于点云的点网络，以解决点云的不结构性导致的本地归一化效率问题。
methods: 本文使用了一种叫做“分解本地归一化”的方法，即在本地归一化过程中不再Explicit地模型空间关系，而是通过点云特征中的本地信息来进行归一化。
results: 实验结果表明，使用本文提出的DeLA方法可以在五个 классическихbenchmark上 achieve state-of-the-art性能，同时具有相对较低的延迟。具体来说，DeLA在ScanObjectNN上达到了90.3%的总准确率，在S3DIS Area 5上达到了74.2%的mIoU。

Abstract
The unstructured nature of point clouds demands that local aggregation be adaptive to different local structures. Previous methods meet this by explicitly embedding spatial relations into each aggregation process. Although this coupled approach has been shown effective in generating clear semantics, aggregation can be greatly slowed down due to repeated relation learning and redundant computation to mix directional and point features. In this work, we propose to decouple the explicit modelling of spatial relations from local aggregation. We theoretically prove that basic neighbor pooling operations can too function without loss of clarity in feature fusion, so long as essential spatial information has been encoded in point features. As an instantiation of decoupled local aggregation, we present DeLA, a lightweight point network, where in each learning stage relative spatial encodings are first formed, and only pointwise convolutions plus edge max-pooling are used for local aggregation then. Further, a regularization term is employed to reduce potential ambiguity through the prediction of relative coordinates. Conceptually simple though, experimental results on five classic benchmarks demonstrate that DeLA achieves state-of-the-art performance with reduced or comparable latency. Specifically, DeLA achieves over 90\% overall accuracy on ScanObjectNN and 74\% mIoU on S3DIS Area 5. Our code is available at https://github.com/Matrix-ASC/DeLA .

摘要
“点云的无架构造需要地方聚合运算能够适应不同的地方结构。现有方法通过明确地嵌入空间关系到每个聚合运算中，尽管这种联合方法有效地实现了清晰的 semantics，但聚合运算可能会受到重复学习空间关系和重复计算的影响，导致聚合运算速度受到限制。在这个工作中，我们提议将明确地嵌入空间关系的聚合运算与地方聚合分开。我们理论上证明，基本的邻居池化操作可以不会失去清晰度，只要点 cloud 中的基本空间信息已经被编码。为了实现分开的地方聚合，我们提出了 DeLA，一个轻量级的点网络，其中在每个学习阶段中首先形成 relative 空间编码，然后使用点实体卷积和边最大池化进行地方聚合。此外，我们还使用了规制项来减少潜在的混乱，通过预测相对坐标的预测。概念简单虽然，但实验结果显示，DeLA 在五个 класси的 benchmar 上实现了状态顶峰性能，并且具有相对较高的聚合速度。具体来说，DeLA 在 ScanObjectNN 上 achieves 过 90% 的总准确率，并在 S3DIS Area 5 上 achieves 74% 的 mIoU。我们的代码可以在 https://github.com/Matrix-ASC/DeLA 上找到。”

Privacy-Preserving Medical Image Classification through Deep Learning and Matrix Decomposition

paper_url: http://arxiv.org/abs/2308.16530
repo_url: None
paper_authors: Andreea Bianca Popescu, Cosmin Ioan Nita, Ioana Antonia Taca, Anamaria Vizitiu, Lucian Mihai Itu
for: 本研究旨在开发一种基于深度学习的医疗数据处理技术，以保护医疗数据的隐私和安全性。
methods: 本研究使用了 singular value decomposition（SVD）和主成分分析（PCA）将医疗图像加密，以便在深度学习分析中使用加密图像。
results: 研究发现，使用加密图像可以保护医疗图像的隐私和安全性，同时不会影响深度学习模型的性能。

Abstract
Deep learning (DL)-based solutions have been extensively researched in the medical domain in recent years, enhancing the efficacy of diagnosis, planning, and treatment. Since the usage of health-related data is strictly regulated, processing medical records outside the hospital environment for developing and using DL models demands robust data protection measures. At the same time, it can be challenging to guarantee that a DL solution delivers a minimum level of performance when being trained on secured data, without being specifically designed for the given task. Our approach uses singular value decomposition (SVD) and principal component analysis (PCA) to obfuscate the medical images before employing them in the DL analysis. The capability of DL algorithms to extract relevant information from secured data is assessed on a task of angiographic view classification based on obfuscated frames. The security level is probed by simulated artificial intelligence (AI)-based reconstruction attacks, considering two threat actors with different prior knowledge of the targeted data. The degree of privacy is quantitatively measured using similarity indices. Although a trade-off between privacy and accuracy should be considered, the proposed technique allows for training the angiographic view classifier exclusively on secured data with satisfactory performance and with no computational overhead, model adaptation, or hyperparameter tuning. While the obfuscated medical image content is well protected against human perception, the hypothetical reconstruction attack proved that it is also difficult to recover the complete information of the original frames.

摘要
深度学习（DL）基本解决方案在医疗领域得到了广泛的研究，提高诊断、规划和治疗的效果。由于医疗相关数据的使用受到严格的限制，为了开发和使用 DL 模型，需要实施robust的数据保护措施。同时，保证 DL 解决方案在受保护数据上提供最低水平的性能是一个挑战。我们的方法使用特征值分解（SVD）和主成分分析（PCA）将医疗图像减少为不可识别的形式，然后使其进行 DL 分析。DL 算法能够从受保护数据中提取有用信息，我们在ANGIOGRAPHIC VIEW分类任务上进行了评估。我们对这种安全性水平进行了模拟人工智能（AI）基于重建攻击的检验，考虑了两个不同的目标数据的攻击者。我们使用相似指标来量化隐私水平。虽然需要考虑隐私和准确之间的平衡，但我们的方法可以在受保护数据上培训ANGIOGRAPHIC VIEW分类器，并且无需计算开销、模型适应或Hyperparameter调整。尽管干扰后的医疗图像内容具有较高的隐私保护，但是对于原始干扰的完整信息的重建仍然是困难的。

SA6D: Self-Adaptive Few-Shot 6D Pose Estimator for Novel and Occluded Objects

paper_url: http://arxiv.org/abs/2308.16528
repo_url: None
paper_authors: Ning Gao, Ngo Anh Vien, Hanna Ziesche, Gerhard Neumann
for: 提高机器人对实际世界中物体的有意义 manipulate 能力
methods: 使用自适应分割模块来识别新目标对象，并使用受限的参考图像构建目标对象的点云模型
results: SA6D方法在实际桌面对象 dataset 上表现出优于现有方法，特别是在受遮挡的场景下，而且需要 fewer reference images

Abstract
To enable meaningful robotic manipulation of objects in the real-world, 6D pose estimation is one of the critical aspects. Most existing approaches have difficulties to extend predictions to scenarios where novel object instances are continuously introduced, especially with heavy occlusions. In this work, we propose a few-shot pose estimation (FSPE) approach called SA6D, which uses a self-adaptive segmentation module to identify the novel target object and construct a point cloud model of the target object using only a small number of cluttered reference images. Unlike existing methods, SA6D does not require object-centric reference images or any additional object information, making it a more generalizable and scalable solution across categories. We evaluate SA6D on real-world tabletop object datasets and demonstrate that SA6D outperforms existing FSPE methods, particularly in cluttered scenes with occlusions, while requiring fewer reference images.

摘要
为实现真实世界中机器人掌控物品的有意义，6D姿态估计是一项关键方面。现有的方法通常难以扩展预测到新的物品种类引入的场景，特别是在受到干扰的情况下。在这种工作中，我们提出了几个shot姿态估计（FSPE）方法called SA6D，该方法使用自适应分割模块来识别新的目标对象并使用仅具有少量堵塞的参考图像构建目标对象的点云模型。与现有方法不同，SA6D不需要物体中心的参考图像或任何其他物体信息，这使得它在类别上更加通用和扩展性更强。我们在真实的桌面对象数据集上评估SA6D并证明它在受到干扰的场景下表现更出色，需要更少的参考图像。

Unsupervised Recognition of Unknown Objects for Open-World Object Detection

paper_url: http://arxiv.org/abs/2308.16527
repo_url: https://github.com/frh23333/mepu-owod
paper_authors: Ruohuan Fang, Guansong Pang, Lei Zhou, Xiao Bai, Jin Zheng
for: 该论文旨在解决开放世界物体检测（OWOD）问题，即需要一个检测模型能够检测已知和未知物体，并且可以逐渐学习新引入的知识。
methods: 该论文提出了一种新的方法，即学习不监督的探测模型，可以从 raw pseudo labels 生成的真实未知对象中识别真正的未知对象。该模型可以通过一种无类别自我训练方法来进一步加以改进。
results: 实验结果表明，该方法可以在 MS COCO 数据集上对未知对象进行更好的检测，同时保持已知对象类别的检测性能。此外，该方法在 LVIS 和 Objects365 数据集上也有更好的泛化能力。

Abstract
Open-World Object Detection (OWOD) extends object detection problem to a realistic and dynamic scenario, where a detection model is required to be capable of detecting both known and unknown objects and incrementally learning newly introduced knowledge. Current OWOD models, such as ORE and OW-DETR, focus on pseudo-labeling regions with high objectness scores as unknowns, whose performance relies heavily on the supervision of known objects. While they can detect the unknowns that exhibit similar features to the known objects, they suffer from a severe label bias problem that they tend to detect all regions (including unknown object regions) that are dissimilar to the known objects as part of the background. To eliminate the label bias, this paper proposes a novel approach that learns an unsupervised discriminative model to recognize true unknown objects from raw pseudo labels generated by unsupervised region proposal methods. The resulting model can be further refined by a classification-free self-training method which iteratively extends pseudo unknown objects to the unlabeled regions. Experimental results show that our method 1) significantly outperforms the prior SOTA in detecting unknown objects while maintaining competitive performance of detecting known object classes on the MS COCO dataset, and 2) achieves better generalization ability on the LVIS and Objects365 datasets.

摘要

MS23D: A 3D Object Detection Method Using Multi-Scale Semantic Feature Points to Construct 3D Feature Layers

paper_url: http://arxiv.org/abs/2308.16518
repo_url: None
paper_authors: Yongxin Shao, Aihong Tan, Tianhong Yan, Zhetao Sun
for: 本研究旨在提出一种基于小尺寸矩阵和大尺寸矩阵的二Stage 3D检测方法，以提高3D特征层的效率和精度。
methods: 本方法使用小尺寸矩阵提取细腻的本地特征，并使用大尺寸矩阵捕捉远程的本地特征。此外，我们还提出了一种基于多尺度semantic特征点的3D特征层构建方法，以将稀疏的3D特征层转化为更加 компакт的表示。
results: 我们对KITTI dataset和ONCE dataset进行了合理评估，并证明了我们的方法可以显著提高3D检测的效率和精度。

Abstract
Lidar point clouds, as a type of data with accurate distance perception, can effectively represent the motion and posture of objects in three-dimensional space. However, the sparsity and disorderliness of point clouds make it challenging to extract features directly from them. Many studies have addressed this issue by transforming point clouds into regular voxel representations. However, these methods often lead to the loss of fine-grained local feature information due to downsampling. Moreover, the sparsity of point clouds poses difficulties in efficiently aggregating features in 3D feature layers using voxel-based two-stage methods. To address these issues, this paper proposes a two-stage 3D detection framework called MS$^{2}$3D. In MS$^{2}$3D, we utilize small-sized voxels to extract fine-grained local features and large-sized voxels to capture long-range local features. Additionally, we propose a method for constructing 3D feature layers using multi-scale semantic feature points, enabling the transformation of sparse 3D feature layers into more compact representations. Furthermore, we compute the offset between feature points in the 3D feature layers and the centroid of objects, aiming to bring them as close as possible to the object's center. It significantly enhances the efficiency of feature aggregation. To validate the effectiveness of our method, we evaluated our method on the KITTI dataset and ONCE dataset together.

摘要
利达点云，作为三维空间中对物体运动和姿态的识别数据，具有高精度的距离识别能力。然而，点云的稀疏性和杂乱性使其直接提取特征具有挑战性。许多研究已经解决这个问题，将点云转换成常见的小立方体表示。然而，这些方法通常会导致细节特征信息的丢失，由于下采样。此外，点云的稀疏性使得在三维特征层中效率地聚合特征更加困难。为解决这些问题，本文提出了一种两stage的三维探测框架，称为MS$^{2}$3D。在MS$^{2}$3D中，我们利用小型的立方体来提取细节特征，并利用大型的立方体来捕捉远程的本地特征。此外，我们提出了一种建立三维特征层的方法，使得稀疏的三维特征层可以转换成更加紧凑的表示。进一步，我们计算了特征点在三维特征层中的偏移量，并将其尝试带到物体的中心点最近。这有助于提高特征聚合的效率。为证明我们的方法的有效性，我们在KITTI dataset和ONCE dataset上进行了联合评估。

MVDream: Multi-view Diffusion for 3D Generation

paper_url: http://arxiv.org/abs/2308.16512
repo_url: None
paper_authors: Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang
for: 生成多视图图像从文本提示中
methods: 利用图像扩散模型、多视图数据集和3D资产实现多视图协同生成
results: 实现了基于2D扩散和3D数据的多视图协同生成，并可以在几架扩散样本下进行个性化3D生成

Abstract
We propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

摘要
我们提出MVDream，一种多视图填充模型，可以根据给定的文本提示生成具有准确尺度的多视图图像。通过利用 pré-训练在大规模网络数据集上的图像填充模型和从3D资产渲染的多视图数据集，所得到的多视图填充模型可以实现2D填充的通用性和3D数据的一致性。这种模型因此可以应用于3D生成中的多视图优先，通过Score Distillation Sampling，解决现有2D-提升方法的稳定性问题。 finally，我们表明该多视图填充模型还可以在几个shot设定下进行个性化3D生成，即DreamBooth3D应用，保持主体标识一致性。

Robust GAN inversion

paper_url: http://arxiv.org/abs/2308.16510
repo_url: None
paper_authors: Egor Sevriugov, Ivan Oseledets
for: 提高生成 adversarial networks（GANs）latent space中图像的修改精度和可编辑性。
methods: 提出一种在原始latent space $W$中工作的方法，通过调整生成器网络来恢复缺失的图像细节。具有学习的regularization策略，使用StyleGAN 2模型进行训练。
results: 相比传统方法，该方法在重建质量和计算效率两个方面表现出色，实现最低的抖动值，并且在构建二分图像特征的抽象上观察到了微妙的改善。在Flickr-Faces-HQ和LSUN Church两个复杂的数据集上进行了实验。

Abstract
Recent advancements in real image editing have been attributed to the exploration of Generative Adversarial Networks (GANs) latent space. However, the main challenge of this procedure is GAN inversion, which aims to map the image to the latent space accurately. Existing methods that work on extended latent space $W+$ are unable to achieve low distortion and high editability simultaneously. To address this issue, we propose an approach which works in native latent space $W$ and tunes the generator network to restore missing image details. We introduce a novel regularization strategy with learnable coefficients obtained by training randomized StyleGAN 2 model - WRanGAN. This method outperforms traditional approaches in terms of reconstruction quality and computational efficiency, achieving the lowest distortion with 4 times fewer parameters. Furthermore, we observe a slight improvement in the quality of constructing hyperplanes corresponding to binary image attributes. We demonstrate the effectiveness of our approach on two complex datasets: Flickr-Faces-HQ and LSUN Church.

摘要
Recent advances in real image editing have been attributed to the exploration of Generative Adversarial Networks (GANs) latent space. However, the main challenge of this procedure is GAN inversion, which aims to map the image to the latent space accurately. Existing methods that work on extended latent space $W+$ are unable to achieve low distortion and high editability simultaneously. To address this issue, we propose an approach that works in native latent space $W$ and tunes the generator network to restore missing image details. We introduce a novel regularization strategy with learnable coefficients obtained by training randomized StyleGAN 2 model - WRanGAN. This method outperforms traditional approaches in terms of reconstruction quality and computational efficiency, achieving the lowest distortion with 4 times fewer parameters. Furthermore, we observe a slight improvement in the quality of constructing hyperplanes corresponding to binary image attributes. We demonstrate the effectiveness of our approach on two complex datasets: Flickr-Faces-HQ and LSUN Church.Translated into Simplified Chinese:现代图像编辑技术的进步，挺借助于生成对抗网络（GANs）的秘密空间的探索。然而，GAN倒退是这个过程的主要挑战，它目标是将图像映射到秘密空间上准确。现有的方法，working on extend latent space $W+$, 无法同时实现低偏差和高可编辑性。为解决这个问题，我们提出了一种方法，working in native latent space $W$，并调整生成器网络以还原缺失图像细节。我们引入了一种新的规范策略，使用可学习的系数，通过训练随机的StyleGAN 2模型-WRanGAN来获得。这种方法在重建质量和计算效率方面超过传统方法，实现了最低偏差，并且参数量比传统方法少四倍。此外，我们发现在构建二元图像特征上的抽象图像上也有一定的改善。我们在两个复杂的数据集Flickr-Faces-HQ和LSUN Church中证明了我们的方法的有效性。

Illumination Distillation Framework for Nighttime Person Re-Identification and A New Benchmark

paper_url: http://arxiv.org/abs/2308.16486
repo_url: https://github.com/alexadlu/idf
paper_authors: Andong Lu, Zhang Zhang, Yan Huang, Yifan Zhang, Chenglong Li, Jin Tang, Liang Wang
for: 这 paper 是关于夜间人识别 task 的研究，它是视觉监测领域中非常重要且挑战性的任务。
methods: 该 paper 提出了一个 Illumination Distillation Framework (IDF)，用于 Addressing the low illumination challenge in nighttime person Re-ID。 IDF 包括 master branch、illumination enhancement branch 和 illumination distillation module。
results: 实验结果表明，我们的 IDF 可以在两个夜间人识别 dataset 上达到状态空间的表现。我们将在 GitHub 上发布我们的代码和数据集。

Abstract
Nighttime person Re-ID (person re-identification in the nighttime) is a very important and challenging task for visual surveillance but it has not been thoroughly investigated. Under the low illumination condition, the performance of person Re-ID methods usually sharply deteriorates. To address the low illumination challenge in nighttime person Re-ID, this paper proposes an Illumination Distillation Framework (IDF), which utilizes illumination enhancement and illumination distillation schemes to promote the learning of Re-ID models. Specifically, IDF consists of a master branch, an illumination enhancement branch, and an illumination distillation module. The master branch is used to extract the features from a nighttime image. The illumination enhancement branch first estimates an enhanced image from the nighttime image using a nonlinear curve mapping method and then extracts the enhanced features. However, nighttime and enhanced features usually contain data noise due to unstable lighting conditions and enhancement failures. To fully exploit the complementary benefits of nighttime and enhanced features while suppressing data noise, we propose an illumination distillation module. In particular, the illumination distillation module fuses the features from two branches through a bottleneck fusion model and then uses the fused features to guide the learning of both branches in a distillation manner. In addition, we build a real-world nighttime person Re-ID dataset, named Night600, which contains 600 identities captured from different viewpoints and nighttime illumination conditions under complex outdoor environments. Experimental results demonstrate that our IDF can achieve state-of-the-art performance on two nighttime person Re-ID datasets (i.e., Night600 and Knight ). We will release our code and dataset at https://github.com/Alexadlu/IDF.

摘要
<>Translate the given text into Simplified Chinese.<>夜间人识别（人识别在夜间）是视觉监测中非常重要且挑战性强的任务，但它尚未得到全面研究。在低照明条件下，人识别方法的性能通常会降低。为了解决夜间人识别中低照明的挑战，这篇论文提出了照明混合框架（IDF），该框架利用照明优化和照明混合方案来推进人识别模型的学习。具体来说，IDF包括主支线、照明优化支线和照明混合模块。主支线用于从夜间图像中提取特征。照明优化支线首先使用非线性曲线映射方法将夜间图像提取出高质量的照明图像，然后从照明图像中提取提高特征。然而，夜间和提高特征通常含有数据噪音，由于不稳定的照明条件和优化失败。为了完全利用夜间和提高特征的共同优点，同时降低数据噪音，我们提出了照明混合模块。具体来说，照明混合模块将两支线的特征通过瓶颈混合模型进行混合，然后使用混合特征来引导两支线的学习。此外，我们建立了一个真实的夜间人识别数据集，名为Night600，该数据集包含600个人识别样本，从不同的视点和夜间照明条件下捕捉到了复杂的户外环境。实验结果表明，我们的IDF可以在两个夜间人识别数据集（即Night600和Knight）上达到状态独特的性能。我们将在GitHub上发布我们的代码和数据集，链接为https://github.com/Alexadlu/IDF。

PivotNet: Vectorized Pivot Learning for End-to-end HD Map Construction

paper_url: http://arxiv.org/abs/2308.16477
repo_url: None
paper_authors: Wenjie Ding, Limeng Qiao, Xi Qiu, Chi Zhang
for: 高清晰地图在自动驾驶研究中获得了广泛关注，旨在提高地图精度和完整性。
methods: 我们提出了一种简单 yet effective的架构方案名为PivotNet，它采用了统一的承式基于地图表示方法和直接集体预测方法。
results: 我们的实验和排除示例表明，PivotNet在与其他SOTAs进行比较时表现出色，在最佳情况下提高了5.9 mAP。

Abstract
Vectorized high-definition map online construction has garnered considerable attention in the field of autonomous driving research. Most existing approaches model changeable map elements using a fixed number of points, or predict local maps in a two-stage autoregressive manner, which may miss essential details and lead to error accumulation. Towards precise map element learning, we propose a simple yet effective architecture named PivotNet, which adopts unified pivot-based map representations and is formulated as a direct set prediction paradigm. Concretely, we first propose a novel Point-to-Line Mask module to encode both the subordinate and geometrical point-line priors in the network. Then, a well-designed Pivot Dynamic Matching module is proposed to model the topology in dynamic point sequences by introducing the concept of sequence matching. Furthermore, to supervise the position and topology of the vectorized point predictions, we propose a Dynamic Vectorized Sequence loss. Extensive experiments and ablations show that PivotNet is remarkably superior to other SOTAs by 5.9 mAP at least. The code will be available soon.

摘要
“卷积高清地图在自动驾驶研究中受到了广泛关注。大多数现有方法使用固定数量的点来模型可变地图元素，或者采用两阶段autoregressive方法预测本地地图，这可能会错过重要细节并导致错误积累。为了精确地学习地图元素，我们提议一个简单 yet effective的架构名为PivotNet，它采用了统一的承载点基于地图表示方法。具体来说，我们首先提出了一种新的点线掩码模块，用于编码子ordinates和几何点线约束在网络中。然后，我们提出了一种套件动态匹配模块，用于模型动态点序列中的topology。此外，为了监视点预测的位置和topology，我们提出了动态卷积序列损失。经验和拓展表明，PivotNet至少比其他SOTAs高5.9 mAP。代码将很快上传。”Note: "SOTAs" refers to "State of the Arts" in English, which means the current best performance in a particular field or technology.

Self-Sampling Meta SAM: Enhancing Few-shot Medical Image Segmentation with Meta-Learning

paper_url: http://arxiv.org/abs/2308.16466
repo_url: None
paper_authors: Yiming Zhang, Tianang Leng, Kun Han, Xiaohui Xie
for: 这篇论文旨在提出一个基于 Self-Sampling Meta SAM 框架的几拍医疗像素分类方法，以便在医疗像素分类中实现快速线上适应。
methods: 本研究使用了一个线上快速Gradient Descent优化器，并与一个元学习器进行进一步优化，以确保快速和可靠地适应新任务。此外，还包括一个自适应模组和一个特殊设计的医疗几拍学习构造，以提高医疗像素分类的精度。
results: 实验结果显示，提出的方法在一个流行的腹部 CT 数据集和一个 MRI 数据集上实现了state-of-the-art 的改善，具体改善率为10.21% 和1.80% 在适应率上，分别是DSC 的平均改善。在结论中，我们提出了一种快速线上适应的方法，可以在0.83分钟内适应新的器官。代码将在 GitHub 上公开。

Abstract
While the Segment Anything Model (SAM) excels in semantic segmentation for general-purpose images, its performance significantly deteriorates when applied to medical images, primarily attributable to insufficient representation of medical images in its training dataset. Nonetheless, gathering comprehensive datasets and training models that are universally applicable is particularly challenging due to the long-tail problem common in medical images. To address this gap, here we present a Self-Sampling Meta SAM (SSM-SAM) framework for few-shot medical image segmentation. Our innovation lies in the design of three key modules: 1) An online fast gradient descent optimizer, further optimized by a meta-learner, which ensures swift and robust adaptation to new tasks. 2) A Self-Sampling module designed to provide well-aligned visual prompts for improved attention allocation; and 3) A robust attention-based decoder specifically designed for medical few-shot learning to capture relationship between different slices. Extensive experiments on a popular abdominal CT dataset and an MRI dataset demonstrate that the proposed method achieves significant improvements over state-of-the-art methods in few-shot segmentation, with an average improvements of 10.21% and 1.80% in terms of DSC, respectively. In conclusion, we present a novel approach for rapid online adaptation in interactive image segmentation, adapting to a new organ in just 0.83 minutes. Code is publicly available on GitHub upon acceptance.

摘要
Segment Anything Model (SAM) 在通用图像 semantic segmentation 方面表现出色，但在医疗图像方面表现下降，主要归结于训练数据集中医疗图像的不充分表现。然而，收集全面的数据集和训练通用的模型是特别困难，因为医疗图像的长尾问题。为解决这个差距，我们在这里提出了一个 Self-Sampling Meta SAM (SSM-SAM) 框架，用于医疗图像 few-shot segmentation。我们的创新在于设计了三个关键模块：1. 在线快速梯度下降优化器，通过meta-学习进一步优化，以确保快速和robust地适应新任务。2. 自适应模块，用于提供高效的视觉提示，以改善注意力分配。3. 专门为医疗 few-shot learning 设计的强健关注基本 decode，用于捕捉不同扁平的关系。我们在一个流行的 Abdomen CT 数据集和一个 MRI 数据集上进行了广泛的实验，结果表明，我们的方法在 few-shot segmentation 方面得到了明显的提高，相对于状态之前的方法，平均提高了10.21%和1.80%的 DSC，分别是。在结尾，我们介绍了一种 novel 的方法，可以在交互式图像分割中快速在线适应，在0.83分钟内适应新的器官。代码将在 GitHub 上公开。

Domain Adaptive Synapse Detection with Weak Point Annotations

paper_url: http://arxiv.org/abs/2308.16461
repo_url: None
paper_authors: Qi Chen, Wei Huang, Yueyi Zhang, Zhiwei Xiong
for: This paper is written for detecting synapses from electron microscopy (EM) images using a two-stage segmentation-based framework with weak point annotations.
methods: The paper uses a segmentation-based pipeline to obtain synaptic instance masks in the first stage, and regenerates square masks to get high-quality pseudo labels in the second stage. The method also utilizes the distance nearest principle to match paired pre-synapses and post-synapses.
results: The paper ranks 1st place in the WASPSYN challenge at ISBI 2023 with high-accuracy detection results.Here’s the information in Simplified Chinese text:
for: 这篇论文是为了使用电子显微镜图像（EM）检测 synapse，使用了两个阶段的分割基于框架，使用弱点注解。
methods: 论文使用了分割ipeline来获得 synaptic instance masks 的第一阶段，然后在第二阶段使用 Square masks 来获得高质量的pseudo标签。论文还使用了距离最近原理来匹配 paired pre-synapses 和 post-synapses。
results: 论文在 ISBI 2023 上的 WASPSYN 挑战中获得了第一名的高精度检测结果。

Abstract
The development of learning-based methods has greatly improved the detection of synapses from electron microscopy (EM) images. However, training a model for each dataset is time-consuming and requires extensive annotations. Additionally, it is difficult to apply a learned model to data from different brain regions due to variations in data distributions. In this paper, we present AdaSyn, a two-stage segmentation-based framework for domain adaptive synapse detection with weak point annotations. In the first stage, we address the detection problem by utilizing a segmentation-based pipeline to obtain synaptic instance masks. In the second stage, we improve model generalizability on target data by regenerating square masks to get high-quality pseudo labels. Benefiting from our high-accuracy detection results, we introduce the distance nearest principle to match paired pre-synapses and post-synapses. In the WASPSYN challenge at ISBI 2023, our method ranks the 1st place.

摘要
随着学习基本方法的发展，电子显微镜成像（EM）图像中的 synapse 检测得到了显著改进。然而，为每个数据集训练模型需要大量的标注数据，并且将学习的模型应用到不同的脑区域数据中是困难的。在这篇论文中，我们提出了 AdaSyn，一种基于分段的框架，用于域适应的synapse检测。在第一个阶段，我们利用分段管道来获取 synaptic instance masks。在第二个阶段，我们通过生成高质量pseudo标签来提高模型在目标数据上的泛化性。由于我们的高精度检测结果，我们引入了距离最近原则来匹配预synapse和后synapse。在 ISBI 2023 年的 WASPSYN 挑战中，我们的方法位于第一名。

Improving Lens Flare Removal with General Purpose Pipeline and Multiple Light Sources Recovery

paper_url: http://arxiv.org/abs/2308.16460
repo_url: https://github.com/yuyanzhou1/improving-lens-flare-removal
paper_authors: Yuyan Zhou, Dong Liang, Songcan Chen, Sheng-Jun Huang, Shuo Yang, Chongyi Li
for: 提高镜头闪光除除的性能，使其更适用于更广泛的情景。
methods: 根据实际拍摄的图像对光源进行自适应曝光和腐减处理，并通过拟合抑制法实现更加可靠地回收多个光源。
results: 经验表明，我们的解决方案可以有效地提高镜头闪光除除的性能，并推动其在更加复杂的场景中的应用。

Abstract
When taking images against strong light sources, the resulting images often contain heterogeneous flare artifacts. These artifacts can importantly affect image visual quality and downstream computer vision tasks. While collecting real data pairs of flare-corrupted/flare-free images for training flare removal models is challenging, current methods utilize the direct-add approach to synthesize data. However, these methods do not consider automatic exposure and tone mapping in image signal processing pipeline (ISP), leading to the limited generalization capability of deep models training using such data. Besides, existing methods struggle to handle multiple light sources due to the different sizes, shapes and illuminance of various light sources. In this paper, we propose a solution to improve the performance of lens flare removal by revisiting the ISP and remodeling the principle of automatic exposure in the synthesis pipeline and design a more reliable light sources recovery strategy. The new pipeline approaches realistic imaging by discriminating the local and global illumination through convex combination, avoiding global illumination shifting and local over-saturation. Our strategy for recovering multiple light sources convexly averages the input and output of the neural network based on illuminance levels, thereby avoiding the need for a hard threshold in identifying light sources. We also contribute a new flare removal testing dataset containing the flare-corrupted images captured by ten types of consumer electronics. The dataset facilitates the verification of the generalization capability of flare removal methods. Extensive experiments show that our solution can effectively improve the performance of lens flare removal and push the frontier toward more general situations.

摘要
当采集图像时，强光源的影响 often leads to heterogeneous flare artifacts, which can significantly affect the visual quality of the image and downstream computer vision tasks. Current methods use the direct-add approach to synthesize data, but these methods do not consider automatic exposure and tone mapping in the image signal processing pipeline (ISP), resulting in limited generalization capability of deep models training using such data. Moreover, existing methods struggle to handle multiple light sources due to the different sizes, shapes, and illuminance of various light sources.In this paper, we propose a solution to improve the performance of lens flare removal by revisiting the ISP and remodeling the principle of automatic exposure in the synthesis pipeline. We design a more reliable light sources recovery strategy that convexly averages the input and output of the neural network based on illuminance levels, avoiding the need for a hard threshold in identifying light sources. Our solution can effectively improve the performance of lens flare removal and push the frontier toward more general situations.In addition, we contribute a new flare removal testing dataset containing flare-corrupted images captured by ten types of consumer electronics. The dataset facilitates the verification of the generalization capability of flare removal methods. Extensive experiments show that our solution can effectively improve the performance of lens flare removal and push the frontier toward more general situations.

Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff

paper_url: http://arxiv.org/abs/2308.16454
repo_url: None
paper_authors: Satoshi Suzuki, Shin’ya Yamaguchi, Shoichiro Takeda, Sekitoshi Kanai, Naoki Makishima, Atsushi Ando, Ryo Masumura
for: 这篇论文关注深度神经网络（DNN）中的标准准确率和抗击例训练（AT）之间的贸易关系。 Although AT improves robustness, it degrades standard accuracy, thus yielding a tradeoff.
methods: 该论文提出了一种新的AT方法called ARREST，它包括三个组成部分：（i）敌对精度训练（AFT）、（ii）表示导航知识传播（RGKD）、（iii）噪音再播（NR）。 AFT trains a DNN on adversarial examples by initializing its parameters with a DNN that is standardly pretrained on clean examples. RGKD and NR respectively entail a regularization term and an algorithm to preserve latent representations of clean examples during AFT.
results: 实验结果表明，ARREST可以更好地缓解标准准确率和抗击例训练之间的贸易关系，比前一些AT基于的方法更有效。

Abstract
This paper addresses the tradeoff between standard accuracy on clean examples and robustness against adversarial examples in deep neural networks (DNNs). Although adversarial training (AT) improves robustness, it degrades the standard accuracy, thus yielding the tradeoff. To mitigate this tradeoff, we propose a novel AT method called ARREST, which comprises three components: (i) adversarial finetuning (AFT), (ii) representation-guided knowledge distillation (RGKD), and (iii) noisy replay (NR). AFT trains a DNN on adversarial examples by initializing its parameters with a DNN that is standardly pretrained on clean examples. RGKD and NR respectively entail a regularization term and an algorithm to preserve latent representations of clean examples during AFT. RGKD penalizes the distance between the representations of the standardly pretrained and AFT DNNs. NR switches input adversarial examples to nonadversarial ones when the representation changes significantly during AFT. By combining these components, ARREST achieves both high standard accuracy and robustness. Experimental results demonstrate that ARREST mitigates the tradeoff more effectively than previous AT-based methods do.

摘要

Adversarial finetuning (AFT): AFT trains a DNN on adversarial examples by initializing its parameters with a DNN that is pre-trained on clean examples.2. Representation-guided knowledge distillation (RGKD): RGKD adds a regularization term to the loss function to ensure that the representations of the DNN remain consistent during AFT.3. Noisy replay (NR): NR switches the input adversarial examples to non-adversarial ones when the representation of the DNN changes significantly during AFT.By combining these components, ARREST achieves both high standard accuracy and robustness. Experimental results show that ARREST outperforms previous AT-based methods in mitigating the trade-off.

Njobvu-AI: An open-source tool for collaborative image labeling and implementation of computer vision models

paper_url: http://arxiv.org/abs/2308.16435
repo_url: https://github.com/sullichrosu/njobvu-ai
paper_authors: Jonathan S. Koning, Ashwin Subramanian, Mazen Alotaibi, Cara L. Appel, Christopher M. Sullivan, Thon Chao, Lisa Truong, Robyn L. Tanguay, Pankaj Jaiswal, Taal Levi, Damon B. Lesmeister
for: 这个论文的目的是为了提供一个可用、开源的计算机视觉模型工具，帮助许多研究人员在各种领域进行计算机视觉应用。
methods: 该工具使用Node.js实现，可以在桌面和服务器硬件上运行，具有多种功能，包括数据标注、多人协作、自定义模型训练和新模型应用。
results: 该工具可以帮助研究人员快速和简单地创建和应用自己的计算机视觉模型，并且支持多种数据类型和标注方式，可以满足各种应用需求。

Abstract
Practitioners interested in using computer vision models lack user-friendly and open-source software that combines features to label training data, allow multiple users, train new algorithms, review output, and implement new models. Labeling training data, such as images, is a key step to developing accurate object detection algorithms using computer vision. This step is often not compatible with many cloud-based services for marking or labeling image and video data due to limited internet bandwidth in many regions of the world. Desktop tools are useful for groups working in remote locations, but users often do not have the capability to combine projects developed locally by multiple collaborators. Furthermore, many tools offer features for labeling data or using pre-trained models for classification, but few allow researchers to combine these steps to create and apply custom models. Free, open-source, and user-friendly software that offers a full suite of features (e.g., ability to work locally and online, and train custom models) is desirable to field researchers and conservationists that may have limited coding skills. We developed Njobvu-AI, a free, open-source tool that can be run on both desktop and server hardware using Node.js, allowing users to label data, combine projects for collaboration and review, train custom algorithms, and implement new computer vision models. The name Njobvu-AI (pronounced N-joh-voo AI), incorporating the Chichewa word for elephant, is inspired by a wildlife monitoring program in Malawi that was a primary impetus for the development of this tool and references similarities between the powerful memory of elephants and properties of computer vision models.

摘要
计算机视觉模型的实践者需要一个用户友好、开源的软件，这个软件可以结合标注训练数据、允许多个用户合作、训练新算法、查看输出和应用新模型的功能。标注训练数据，如图像，是计算机视觉精度的关键步骤，但是许多云服务不支持在多个地区的图像和视频数据标注或标记，因为这些地区的互联网带宽有限。桌面工具可以帮助远程团队合作，但用户通常无法将本地开发的项目集成到多个合作者的项目中。此外，许多工具只提供标注数据或使用预训练模型进行分类的功能，而很少允许研究人员将这些步骤结合起来创建和应用自定义模型。开源、免费、用户友好的软件，具有本地和在线工作、自定义模型训练等功能，对于野外研究人员和保护人员来说非常有用，他们可能有限的编程技能。为了满足这些需求，我们开发了Njobvu-AI，一个免费、开源的工具，可以在桌面和服务器硬件上运行，使用Node.js，允许用户标注数据、合作项目、查看输出和训练自定义算法。Njobvu-AI的名字（pronounced N-joh-voo AI），包含恩爱语言中的象象，是为了纪念马拉维的野生生物监测计划，这个计划是Njobvu-AI的开发的主要驱动力量，同时也表达计算机视觉模型和象象的相似之处。

Deformation Robust Text Spotting with Geometric Prior

paper_url: http://arxiv.org/abs/2308.16404
repo_url: None
paper_authors: Xixuan Hao, Aozhong Zhang, Xianze Meng, Bin Fu
for: 文本检测和识别问题的解决方案
methods: 基于ARText数据集，提出了一种抗形态变化的文本检测方法，包括经验式检测和semantic reasoning两个模块
results: 在ARText和IC19-ReCTS数据集上进行了实验，并取得了优秀的效果

Abstract
The goal of text spotting is to perform text detection and recognition in an end-to-end manner. Although the diversity of luminosity and orientation in scene texts has been widely studied, the font diversity and shape variance of the same character are ignored in recent works, since most characters in natural images are rendered in standard fonts. To solve this problem, we present a Chinese Artistic Dataset, termed as ARText, which contains 33,000 artistic images with rich shape deformation and font diversity. Based on this database, we develop a deformation robust text spotting method (DR TextSpotter) to solve the recognition problem of complex deformation of characters in different fonts. Specifically, we propose a geometric prior module to highlight the important features based on the unsupervised landmark detection sub-network. A graph convolution network is further constructed to fuse the character features and landmark features, and then performs semantic reasoning to enhance the discrimination for different characters. The experiments are conducted on ARText and IC19-ReCTS datasets. Our results demonstrate the effectiveness of our proposed method.

摘要
文本检测的目标是在端到端方式下完成文本检测和识别。虽然景点文本的多样性和方向性已经广泛研究，但是文本中字体多样性和形状差异仍然被当前的研究忽略，因为大多数自然图像中的字体都是标准字体。为解决这个问题，我们提出了一个中文艺术数据集，称为ARText，该数据集包含33,000件艺术图像，具有丰富的形态变化和字体多样性。基于这个数据集，我们开发了一种可以抗形态变化的文本检测方法（DR TextSpotter），以解决不同字体中字符的识别问题。具体来说，我们提出了一个 геометрические前提模块，用于基于无监督标记检测子网络中重要特征的高亮显示。然后，我们将Character特征和标记特征 fusion在一起，并使用semantic reasoning进行更多的增强，以提高不同字符之间的分辨率。我们对ARText和IC19-ReCTS数据集进行了实验，我们的结果表明我们的提议的方法的有效性。

paper_url: http://arxiv.org/abs/2308.16386
repo_url: https://github.com/husteryoung/mplt
paper_authors: Yang Luo, Xiqing Guo, Hui Feng, Lei Ao
for: 实现RGB-T tracking中更全面的融合，减少计算成本
methods: 基于促进学习的多modal融合tracking体系，具有降低计算成本的强化注意力机制
results: 实验表明，提议的tracking体系具有高效率和优秀性能，能够实现状态部级表现而且保持高速运行

Abstract
Object tracking based on the fusion of visible and thermal im-ages, known as RGB-T tracking, has gained increasing atten-tion from researchers in recent years. How to achieve a more comprehensive fusion of information from the two modalities with fewer computational costs has been a problem that re-searchers have been exploring. Recently, with the rise of prompt learning in computer vision, we can better transfer knowledge from visual large models to downstream tasks. Considering the strong complementarity between visible and thermal modalities, we propose a tracking architecture based on mutual prompt learning between the two modalities. We also design a lightweight prompter that incorporates attention mechanisms in two dimensions to transfer information from one modality to the other with lower computational costs, embedding it into each layer of the backbone. Extensive ex-periments have demonstrated that our proposed tracking ar-chitecture is effective and efficient, achieving state-of-the-art performance while maintaining high running speeds.

摘要
<>对于RGB-T tracking（基于可见和热图像的跟踪），在过去几年内，研究人员们已经投入了大量时间和精力。为了实现更全面的两种模式之间的信息融合，同时降低计算成本，是研究人员所关注的一个问题。随着计算机视觉中的备受推荐学的兴起，我们可以更好地将视觉大型模型的知识传递到下游任务中。考虑到可见和热图像之间的强相互补做，我们提出了基于两种模式之间的互动学习的跟踪架构。我们还设计了一个轻量级的推荐器，其中包含了两个维度的注意机制，以便在两种模式之间传输信息，并将其嵌入到每层的背bone中。广泛的实验证明了我们提出的跟踪架构的有效性和高效率，同时保持高速运行。

Separate and Locate: Rethink the Text in Text-based Visual Question Answering

paper_url: http://arxiv.org/abs/2308.16383
repo_url: None
paper_authors: Chengyang Fang, Jiangnan Li, Liang Li, Can Ma, Dayong Hu
for: This paper focuses on the task of text-based visual question answering (TextVQA), which involves answering questions about the text in images.methods: The proposed method, called Separate and Locate (SaL), explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, it uses a Text Semantic Separate (TSS) module to recognize semantic contextual relations between words, and a Spatial Circle Position (SCP) module to better construct and reason the spatial position relationships.results: The SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets, respectively. Additionally, the proposed method achieves 2.68% and 2.52% accuracy improvement compared to the pre-training state-of-the-art method, without any pre-training tasks.Here’s the simplified Chinese text for the three key points:for: 这篇论文关注文本视觉问答任务（TextVQA），即对图像中文本进行问答。methods: 提议的方法是分离和定位（SaL），它利用文本semantic上下文信息和设计空间位嵌入来构建文本之间的空间关系。results: SaL模型比基线模型高出4.44%和3.96%的准确率在TextVQA和ST-VQA数据集上，并且与预训练状态的方法相比，无需预训练任务，仍然实现2.68%和2.52%的准确率提升。

Abstract
Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at https://github.com/fangbufang/SaL.

摘要
文本视觉问答（TextVQA）的目标是回答图像中文本中的问题。大多数相关研究都将注意力集中在网络结构设计或者预训练任务上。所有这些方法都将OCR文本列表为阅读顺序（从左到右和从上到下），并将其转化为自然语言的句子。然而，它们忽略了OCR文本在TextVQA任务中的大多数文本没有含义上的相互关系。此外，这些方法使用一维位坐标的 embedding 来构建文本之间的空间关系，这并不合理。一维位坐标只能表示句子中单个字的左右顺序关系，而不能表示复杂的空间位置关系。为了解决这些问题，我们提出了一种新的方法，即分离和定位（SaL）。具体来说，我们提出了文本含义分离（TSS）模块，帮助模型认识文本中具有含义上的相互关系的单词。然后，我们引入空间圆形位（SCP）模块，帮助模型更好地构建和理解OCR文本之间的空间位置关系。我们的SaL模型与基线模型相比，提高了4.44%和3.96%的准确率。与6400万预训练样本预训练的状态OF-THE-ART方法相比，我们的方法，没有任何预训练任务，仍然实现了2.68%和2.52%的准确率提高。我们的代码和模型将在https://github.com/fangbufang/SaL上发布。

3D vision-based structural masonry damage detection

paper_url: http://arxiv.org/abs/2308.16380
repo_url: None
paper_authors: Elmira Faraji Zonouz, Xiao Pan, Yu-Cheng Hsu, Tony Yang
For: 监测砖石结构的损害是非常重要，以避免可能的灾难性结果。但是，手动检查可能需要很长时间，并且对人员来说可能有危险。* Methods: 本研究使用了新的计算机视觉和机器学习算法来自动化检查过程，以提高效率和安全性。大多数现有的2D视觉基本方法仅能进行质量化损害分类、2D定位和平面质量化。* Results: 我们的研究表明，使用3D视觉方法可以准确地检测砖石结构的损害，并且可以在复杂环境中检测损害。我们的实验表明，我们的方法可以有效地分类损害状态，定位和质量化重要的损害特征。结果表明，我们的方法可以提高砖石结构检查的自动化水平。

Abstract
The detection of masonry damage is essential for preventing potentially disastrous outcomes. Manual inspection can, however, take a long time and be hazardous to human inspectors. Automation of the inspection process using novel computer vision and machine learning algorithms can be a more efficient and safe solution to prevent further deterioration of the masonry structures. Most existing 2D vision-based methods are limited to qualitative damage classification, 2D localization, and in-plane quantification. In this study, we present a 3D vision-based methodology for accurate masonry damage detection, which offers a more robust solution with a greater field of view, depth of vision, and the ability to detect failures in complex environments. First, images of the masonry specimens are collected to generate a 3D point cloud. Second, 3D point clouds processing methods are developed to evaluate the masonry damage. We demonstrate the effectiveness of our approach through experiments on structural masonry components. Our experiments showed the proposed system can effectively classify damage states and localize and quantify critical damage features. The result showed the proposed method can improve the level of autonomy during the inspection of masonry structures.

摘要
检测砖石损害的重要性可以防止可能的灾难性结果。然而，手动检查可能需要很长时间，同时也可能对人类检查员造成危险。通过自动化检查过程，使用新的计算机视觉和机器学习算法可以提供更高效和安全的解决方案，以防止砖石结构的进一步衰化。现有的大多数2D视觉基本方法只能进行质量化损害分类、2D位置确定和平面质量量化。在本研究中，我们提出了一种基于3D视觉的方法，用于精准地检测砖石损害。这种方法具有更广阔的视场、深度视觉和在复杂环境中检测损害的能力。首先，我们收集砖石样品图像，以生成3D点云。然后，我们开发了3D点云处理方法，以评估砖石损害。我们的实验表明，我们的方法可以有效地分类损害状态，并在复杂环境中准确地定位和评估重要的损害特征。实验结果表明，我们的方法可以提高砖石结构检查的自动化水平。

Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training

paper_url: http://arxiv.org/abs/2308.16376
repo_url: None
paper_authors: Lei Bai, Dongang Wang, Michael Barnett, Mariano Cabezas, Weidong Cai, Fernando Calamante, Kain Kyle, Dongnan Liu, Linda Ly, Aria Nguyen, Chun-Chien Shieh, Ryan Sullivan, Hengrui Wang, Geng Zhan, Wanli Ouyang, Chenyu Wang
For: 这个研究旨在对多个临床站点中的多个肢体疾病（Multiple Sclerosis，MS）进行演化评估，以帮助理解疾病进展和指导治疗策略。* Methods: 这个研究使用了联邦学习框架，并考虑了标签杂变。具体来说，我们提出了一个分离式硬件标签修正（DHLC）策略，考虑了患者疾病的不均分布和朗甚界限，以修正错误的标签。另外，我们也提出了一个中央强化标签修正（CELC）策略，利用所有站点的聚合中央模型作为修正教师，提高修正过程的可靠性。* Results: 我们在两个多站点数据集上进行了广泛的实验，结果显示了我们的提案方法的有效性和可靠性，显示了它们在跨站点合作中的应用潜力。

Abstract
Accurately measuring the evolution of Multiple Sclerosis (MS) with magnetic resonance imaging (MRI) critically informs understanding of disease progression and helps to direct therapeutic strategy. Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. Obtaining sufficient data from a single clinical site is challenging and does not address the heterogeneous need for model robustness. Conversely, the collection of data from multiple sites introduces data privacy concerns and potential label noise due to varying annotation standards. To address this dilemma, we explore the use of the federated learning framework while considering label noise. Our approach enables collaboration among multiple clinical sites without compromising data privacy under a federated learning paradigm that incorporates a noise-robust training strategy based on label correction. Specifically, we introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions, enabling the correction of false annotations based on prediction confidence. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites, enhancing the reliability of the correction process. Extensive experiments conducted on two multi-site datasets demonstrate the effectiveness and robustness of our proposed methods, indicating their potential for clinical applications in multi-site collaborations.

摘要

2023-08-31

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator

EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

Text2Scene: Text-driven Indoor Scene Stylization with Part-aware Details

SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation

Holistic Processing of Colour Images Using Novel Quaternion-Valued Wavelets on the Plane

Self-pruning Graph Neural Network for Predicting Inflammatory Disease Activity in Multiple Sclerosis from Brain MR Images

Diffusion Models for Interferometric Satellite Aperture Radar

Coarse-to-Fine Amodal Segmentation with Shape Prior

BTSeg: Barlow Twins Regularization for Domain Adaptation in Semantic Segmentation

Multiscale Residual Learning of Graph Convolutional Sequence Chunks for Human Motion Prediction

Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models

Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images

Unsupervised CT Metal Artifact Reduction by Plugging Diffusion Priors in Dual Domains

Parsing is All You Need for Accurate Gait Recognition in the Wild

US-SFNet: A Spatial-Frequency Domain-based Multi-branch Network for Cervical Lymph Node Lesions Diagnoses in Ultrasound Images

Towards Vehicle-to-everything Autonomous Driving: A Survey on Collaborative Perception

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Diffusion Inertial Poser: Human Motion Reconstruction from Arbitrary Sparse IMU Configurations

SoccerNet 2023 Tracking Challenge – 3rd place MOT4MOT Team Technical Report

Learning with Multi-modal Gradient Attention for Explainable Composed Image Retrieval

Generate Your Own Scotland: Satellite Image Generation Conditioned on Maps

Learning Channel Importance for High Content Imaging with Interpretable Deep Input Channel Mixing

MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Semi-Supervised SAR ATR Framework with Transductive Auxiliary Segmentation

3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation

Neural Gradient Regularizer

Detecting Out-of-Context Image-Caption Pairs in News: A Counter-Intuitive Method

Towards Optimal Patch Size in Vision Transformers for Tumor Segmentation

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

GHuNeRF: Generalizable Human NeRF from a Monocular Video

Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation for Semi-Supervised Medical Image Segmentation

Document Layout Analysis on BaDLAD Dataset: A Comprehensive MViTv2 Based Approach

Shape of my heart: Cardiac models through learned signed distance functions

ScrollNet: Dynamic Weight Importance for Continual Learning

MoMA: Momentum Contrastive Learning with Multi-head Attention-based Knowledge Distillation for Histopathology Image Analysis

E3CM: Epipolar-Constrained Cascade Correspondence Matching

Prompt-enhanced Hierarchical Transformer Elevating Cardiopulmonary Resuscitation Instruction via Temporal Action Segmentation

Object Detection for Caries or Pit and Fissure Sealing Requirement in Children’s First Permanent Molars

Decoupled Local Aggregation for Point Cloud Learning

Privacy-Preserving Medical Image Classification through Deep Learning and Matrix Decomposition

SA6D: Self-Adaptive Few-Shot 6D Pose Estimator for Novel and Occluded Objects

Unsupervised Recognition of Unknown Objects for Open-World Object Detection

MS23D: A 3D Object Detection Method Using Multi-Scale Semantic Feature Points to Construct 3D Feature Layers

MVDream: Multi-view Diffusion for 3D Generation

Robust GAN inversion

Illumination Distillation Framework for Nighttime Person Re-Identification and A New Benchmark

PivotNet: Vectorized Pivot Learning for End-to-end HD Map Construction

Self-Sampling Meta SAM: Enhancing Few-shot Medical Image Segmentation with Meta-Learning

Domain Adaptive Synapse Detection with Weak Point Annotations

Improving Lens Flare Removal with General Purpose Pipeline and Multiple Light Sources Recovery

Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff

Njobvu-AI: An open-source tool for collaborative image labeling and implementation of computer vision models

Deformation Robust Text Spotting with Geometric Prior

RGB-T Tracking via Multi-Modal Mutual Prompt Learning

Separate and Locate: Rethink the Text in Text-based Visual Question Answering

3D vision-based structural masonry damage detection

Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training