2023-11-16

cs.CV

cs.CV - 2023-11-16

CV-Attention UNet: Attention-based UNet for 3D Cerebrovascular Segmentation of Enhanced TOF-MRA Images

paper_url: http://arxiv.org/abs/2311.10224
repo_url: None
paper_authors: Syed Farhan Abbas, Nguyen Thanh Duc, Yoonguu Song, Kyungwon Kim, Boreom Lee
for: 这个研究的目的是精确地分类脑血管图像，以帮助诊断脑血管疾病。methods: 这个研究使用了3D脑血管注意力UNet方法，named CV-AttentionUNet，来精确地提取脑血管图像。这个方法包括了一系列的预处理技术和深度超级vised UNet，以提高脑血管分类的精度。此外，这个方法还使用了注意力机制，以专注于相关的相互关联，并忽略无关的生物学信息。results: 我们的研究表明，CV-AttentionUNet方法可以对于脑血管分类task中的脑血管图像进行高精度的分类，并且在TubeTK dataset上表现比现有的state-of-the-art方法更好。

Abstract
Due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. The commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. Additionally, convolutional neural networks (CNNs) suffer from performance degradation when the batch size is small, and deeper networks experience the vanishing gradient problem. Methods: In this paper, we attempt to solve these limitations and propose the 3D cerebrovascular attention UNet method, named CV-AttentionUNet, for precise extraction of brain vessel images. We proposed a sequence of preprocessing techniques followed by deeply supervised UNet to improve the accuracy of segmentation of the brain vessels leading to a stroke. To combine the low and high semantics, we applied the attention mechanism. This mechanism focuses on relevant associations and neglects irrelevant anatomical information. Furthermore, the inclusion of deep supervision incorporates different levels of features that prove to be beneficial for network convergence. Results: We demonstrate the efficiency of the proposed method by cross-validating with an unlabeled dataset, which was further labeled by us. We believe that the novelty of this algorithm lies in its ability to perform well on both labeled and unlabeled data with image processing-based enhancement. The results indicate that our method performed better than the existing state-of-the-art methods on the TubeTK dataset. Conclusion: The proposed method will help in accurate segmentation of cerebrovascular structure leading to stroke

摘要
due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. the commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. additionally, convolutional neural networks (CNNs) suffer from performance degradation when the batch size is small, and deeper networks experience the vanishing gradient problem. methods: in this paper, we attempt to solve these limitations and propose the 3d cerebrovascular attention UNet method, named cv-attentionunet, for precise extraction of brain vessel images. we proposed a sequence of preprocessing techniques followed by deeply supervised UNet to improve the accuracy of segmentation of the brain vessels leading to a stroke. to combine the low and high semantics, we applied the attention mechanism. this mechanism focuses on relevant associations and neglects irrelevant anatomical information. furthermore, the inclusion of deep supervision incorporates different levels of features that prove to be beneficial for network convergence. results: we demonstrate the efficiency of the proposed method by cross-validating with an unlabeled dataset, which was further labeled by us. we believe that the novelty of this algorithm lies in its ability to perform well on both labeled and unlabeled data with image processing-based enhancement. the results indicate that our method performed better than the existing state-of-the-art methods on the tubetk dataset. conclusion: the proposed method will help in accurate segmentation of cerebrovascular structure leading to stroke.

Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication

paper_url: http://arxiv.org/abs/2311.10207
repo_url: None
paper_authors: Jannis Schönleber, Lukas Cavigelli, Renzo Andri, Matteo Perotti, Luca Benini
for: 本文针对精度高、能耗低的MatMul计算问题提出了一种解决方案，以替代传统的MatMul加速器。
methods: 本文使用了一种叫做“Maddness”的方法，通过使用哈希函数和缓存表（LUT）来实现MatMul计算，而不需要直接进行乘法运算。
results: 对于14nm和3nm技术的缩放，本文实现了一个高达161 TOp/s/W@0.55V的能耗效率，并达到了CIFAR-10的Top-1准确率高于92.5% using ResNet9。

Abstract
From classical HPC to deep learning, MatMul is at the heart of today's computing. The recent Maddness method approximates MatMul without the need for multiplication by using a hash-based version of product quantization (PQ) indexing into a look-up table (LUT). Stella Nera is the first Maddness accelerator and it achieves 15x higher area efficiency (GMAC/s/mm^2) and more than 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators implemented in the same technology. The hash function is a decision tree, which allows for an efficient hardware implementation as the multiply-accumulate operations are replaced by decision tree passes and LUT lookups. The entire Maddness MatMul can be broken down into parts that allow an effective implementation with small computing units and memories, allowing it to reach extreme efficiency while remaining generically applicable for MatMul tasks. In a commercial 14nm technology and scaled to 3nm, we achieve an energy efficiency of 161 TOp/s/W@0.55V with a Top-1 accuracy on CIFAR-10 of more than 92.5% using ResNet9.

摘要

K-space Cold Diffusion: Learning to Reconstruct Accelerated MRI without Noise

paper_url: http://arxiv.org/abs/2311.10162
repo_url: None
paper_authors: Guoyao Shen, Mengyu Li, Chad W. Farris, Stephan Anderson, Xin Zhang
for: 这 paper 是为了提出一种基于冰晶扩展的 MRI 重建模型，用于快速 MRI 图像重建。
methods: 该模型使用冰晶扩展来实现一般化的图像变换，包括模糊、下采样等操作。
results: 根据对一个大量开源 MRI 数据集的测试，该模型可以生成高质量的重建图像，并且比其他深度学习基于 MRI 重建模型更好。

Abstract
Deep learning-based MRI reconstruction models have achieved superior performance these days. Most recently, diffusion models have shown remarkable performance in image generation, in-painting, super-resolution, image editing and more. As a generalized diffusion model, cold diffusion further broadens the scope and considers models built around arbitrary image transformations such as blurring, down-sampling, etc. In this paper, we propose a k-space cold diffusion model that performs image degradation and restoration in k-space without the need for Gaussian noise. We provide comparisons with multiple deep learning-based MRI reconstruction models and perform tests on a well-known large open-source MRI dataset. Our results show that this novel way of performing degradation can generate high-quality reconstruction images for accelerated MRI.

摘要
现在的深度学习基于MRI重建模型已经取得了出色的表现。最近，扩散模型在图像生成、填充、超分解、图像修改等领域都有出色的表现。作为一种通用扩散模型，冷扩散进一步拓宽了范围，考虑了基于任意图像变换的模型，如模糊、下采样等。在这篇论文中，我们提出了基于k空间冷扩散模型的图像劣化和重建方法，无需Gaussian噪声。我们对多种深度学习基于MRI重建模型进行了比较，并在一个著名的大型开源MRI数据集上进行了测试。我们的结果表明，这种新的劣化方法可以生成高质量的重建图像 для加速MRI。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

paper_url: http://arxiv.org/abs/2311.10093
repo_url: https://github.com/johndpope/TheChosenOne
paper_authors: Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski
for: 文章旨在提供一种完全自动化的一致性人物生成方法，用于文学创作、游戏开发资产设计、广告等实际应用。
methods: 我们提出的方法基于迭代过程，在每一个阶段内，从给定的文本提示中找出一个具有相似identify的图像集，并从这个集合中提取一个更加一致的identify。
results: 我们的方法在量化分析中表现出更好的平衡点，与基准方法相比，并且在用户研究中得到了证实。我们还展示了该方法在各种实际应用中的可行性。

Abstract
Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, these models struggle with generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach. Project page is available at https://omriavrahami.com/the-chosen-one

摘要
近期文本到图像生成模型的进步，推开了大量的视觉创造力。然而，这些模型往往受到一致性的限制，这是许多实际应用中的关键问题，如故事化、游戏开发资产设计、广告等。现有方法通常依赖于多个预存图像或尝试繁琐的手动过程。在这个工作中，我们提出了一种完全自动的一致性Character生成解决方案，唯一的输入是文本提示。我们引入了一种迭代过程，每个阶段都会从一组相似的图像中提取一个更一致的标识。我们的量化分析表明，我们的方法在提示对齐和一致性之间做出了更好的平衡，这些发现得到了用户研究的证实。为了结束，我们展示了一些实际应用场景。项目页面可以在https://omriavrahami.com/the-chosen-one找到。

Traffic Video Object Detection using Motion Prior

paper_url: http://arxiv.org/abs/2311.10092
repo_url: None
paper_authors: Lihao Liu, Yanqi Cheng, Dongdong Chen, Jing He, Pietro Liò, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
for: 提高交通视频中物体检测精度
methods: 使用新的自注意模块和 Pseudo-label 机制，利用运动先驱来增强时间信息Integration和纠正噪声 pseudo label
results: 与现有状态前的方法比，表现出2%的提高，达到更高的检测精度

Abstract
Traffic videos inherently differ from generic videos in their stationary camera setup, thus providing a strong motion prior where objects often move in a specific direction over a short time interval. Existing works predominantly employ generic video object detection framework for traffic video object detection, which yield certain advantages such as broad applicability and robustness to diverse scenarios. However, they fail to harness the strength of motion prior to enhance detection accuracy. In this work, we propose two innovative methods to exploit the motion prior and boost the performance of both fully-supervised and semi-supervised traffic video object detection. Firstly, we introduce a new self-attention module that leverages the motion prior to guide temporal information integration in the fully-supervised setting. Secondly, we utilise the motion prior to develop a pseudo-labelling mechanism to eliminate noisy pseudo labels for the semi-supervised setting. Both of our motion-prior-centred methods consistently demonstrates superior performance, outperforming existing state-of-the-art approaches by a margin of 2% in terms of mAP.

摘要
traffic videos 自然 diferen FROM generic videos 的 stationary camera setup, thus providing a strong motion prior where objects often move in a specific direction over a short time interval. Existing works predominantly employ generic video object detection framework for traffic video object detection, which yield certain advantages such as broad applicability and robustness to diverse scenarios. However, they fail to harness the strength of motion prior to enhance detection accuracy. In this work, we propose two innovative methods to exploit the motion prior and boost the performance of both fully-supervised and semi-supervised traffic video object detection. Firstly, we introduce a new self-attention module that leverages the motion prior to guide temporal information integration in the fully-supervised setting. Secondly, we utilize the motion prior to develop a pseudo-labeling mechanism to eliminate noisy pseudo labels for the semi-supervised setting. Both of our motion-prior-centred methods consistently demonstrate superior performance, outperforming existing state-of-the-art approaches by a margin of 2% in terms of mAP.

Adaptive Shells for Efficient Neural Radiance Field Rendering

paper_url: http://arxiv.org/abs/2311.10091
repo_url: None
paper_authors: Zian Wang, Tianchang Shen, Merlin Nimier-David, Nicholas Sharp, Jun Gao, Alexander Keller, Sanja Fidler, Thomas Müller, Zan Gojcic
for: 这个论文的目的是提高神经辐射场的渲染速度和视觉质量，通过在各个场景中适当地使用积分和表面基于的渲染方法。
methods: 这个论文使用神经网络来学习积分和表面基于的渲染方法，并通过自适应的积分大小和权重来控制在各个场景中的渲染精度。
results: 实验结果表明，这个方法可以大幅提高渲染速度和视觉质量，并且可以在不同的场景中适应性地应用。此外，这个方法还可以提取出精度高的表面 mesh，用于下游应用such as 动画和仿真。

Abstract
Neural radiance fields achieve unprecedented quality for novel view synthesis, but their volumetric formulation remains expensive, requiring a huge number of samples to render high-resolution images. Volumetric encodings are essential to represent fuzzy geometry such as foliage and hair, and they are well-suited for stochastic optimization. Yet, many scenes ultimately consist largely of solid surfaces which can be accurately rendered by a single sample per pixel. Based on this insight, we propose a neural radiance formulation that smoothly transitions between volumetric- and surface-based rendering, greatly accelerating rendering speed and even improving visual fidelity. Our method constructs an explicit mesh envelope which spatially bounds a neural volumetric representation. In solid regions, the envelope nearly converges to a surface and can often be rendered with a single sample. To this end, we generalize the NeuS formulation with a learned spatially-varying kernel size which encodes the spread of the density, fitting a wide kernel to volume-like regions and a tight kernel to surface-like regions. We then extract an explicit mesh of a narrow band around the surface, with width determined by the kernel size, and fine-tune the radiance field within this band. At inference time, we cast rays against the mesh and evaluate the radiance field only within the enclosed region, greatly reducing the number of samples required. Experiments show that our approach enables efficient rendering at very high fidelity. We also demonstrate that the extracted envelope enables downstream applications such as animation and simulation.

摘要
Our method constructs an explicit mesh envelope that spatially bounds a neural volumetric representation. In solid regions, the envelope nearly converges to a surface and can often be rendered with a single sample. To achieve this, we generalize the NeuS formulation with a learned spatially-varying kernel size that encodes the spread of the density, fitting a wide kernel to volume-like regions and a tight kernel to surface-like regions. We then extract an explicit mesh of a narrow band around the surface, with the width determined by the kernel size, and fine-tune the radiance field within this band.At inference time, we cast rays against the mesh and evaluate the radiance field only within the enclosed region, greatly reducing the number of samples required. Our approach enables efficient rendering at very high fidelity. We also demonstrate that the extracted envelope enables downstream applications such as animation and simulation.

Visual Environment Assessment for Safe Autonomous Quadrotor Landing

paper_url: http://arxiv.org/abs/2311.10065
repo_url: None
paper_authors: Mattia Secchiero, Nishanth Bobbili, Yang Zhou, Giuseppe Loianno
for: 本研究旨在提供一种能够自动检测和评估安全降落区域的方法，以确保无人机在系统故障、低电量或完成特定任务后安全着陆。
methods: 该方法利用神经网络提取环境特征，并与geometry Map结合，以获得环境特征和重要的几何特征，如 Slope、平坦程度和护岸度。然后，根据这些特征，定义多个成本指标来评估环境的安全性、稳定性和适用性，并将最适合的降落区域标识出来。
results: 实验结果表明，该方法可以有效地评估环境中的降落区域，并帮助无人机安全着陆。

Abstract
Autonomous identification and evaluation of safe landing zones are of paramount importance for ensuring the safety and effectiveness of aerial robots in the event of system failures, low battery, or the successful completion of specific tasks. In this paper, we present a novel approach for detection and assessment of potential landing sites for safe quadrotor landing. Our solution efficiently integrates 2D and 3D environmental information, eliminating the need for external aids such as GPS and computationally intensive elevation maps. The proposed pipeline combines semantic data derived from a Neural Network (NN), to extract environmental features, with geometric data obtained from a disparity map, to extract critical geometric attributes such as slope, flatness, and roughness. We define several cost metrics based on these attributes to evaluate safety, stability, and suitability of regions in the environments and identify the most suitable landing area. Our approach runs in real-time on quadrotors equipped with limited computational capabilities. Experimental results conducted in diverse environments demonstrate that the proposed method can effectively assess and identify suitable landing areas, enabling the safe and autonomous landing of a quadrotor.

摘要
自动识别和评估安全降落区的重要性对于保证飞行器的安全性和效率具有极高的重要性，尤其在系统故障、电池低下或完成特定任务后。在这篇论文中，我们提出了一种新的降落点检测和评估方法。我们的解决方案可以快速地将2D和3D环境信息集成，从而消除需要外部帮助（如GPS）和计算昂贵的高程地图。我们的管道使用神经网络（NN）来提取环境特征，并使用不同程度图来提取环境中关键的几何特征，如坡度、平坦度和荒凉程度。我们根据这些特征定义了多个成本指标，以评估环境中区域的安全性、稳定性和适用性，并将最适合的降落区标识出来。我们的方法在飞行器上搭载有有限的计算能力下运行，并在多种环境中进行了实验，证明了我们的方法可以有效地评估和标识适合降落的区域，使飞行器安全地自动降落。

Analyzing Deviations of Dyadic Lines in Fast Hough Transform

paper_url: http://arxiv.org/abs/2311.10064
repo_url: None
paper_authors: Gleb Smirnov, Simon Karpenko
for: 本文研究了dyadic线模型在图像识别中的准确性。
methods: 本文使用了统计分析方法来研究dyadic线模型的偏差。
results: 研究发现，dyadic线模型的偏差的mean值为零，而方差为O(log(n))。随着n的增加，这些偏差的分布会转化为一个正态分布，其中mean值为零，并且具有小的方差。这个限定结果借鉴了 Erdős theory。

Abstract
Fast Hough transform is a widely used algorithm in pattern recognition. The algorithm relies on approximating lines using a specific discrete line model called dyadic lines. The worst-case deviation of a dyadic line from the ideal line it used to construct grows as $O(log(n))$, where $n$ is the linear size of the image. But few lines actually reach the worst-case bound. The present paper addresses a statistical analysis of the deviation of a dyadic line from its ideal counterpart. Specifically, our findings show that the mean deviation is zero, and the variance grows as $O(log(n))$. As $n$ increases, the distribution of these (suitably normalized) deviations converges towards a normal distribution with zero mean and a small variance. This limiting result makes an essential use of ergodic theory.

摘要

Depth Insight – Contribution of Different Features to Indoor Single-image Depth Estimation

paper_url: http://arxiv.org/abs/2311.10042
repo_url: None
paper_authors: Yihong Wu, Yuwen Heng, Mahesan Niranjan, Hansung Kim
for: 这个论文主要研究了单一图像中的深度估算问题，以寻求更好地理解深度估算模型如何利用图像中的各种特征来预测深度。
methods: 本论文使用了特征提取技术来关联单个特征（形状、 текстура、颜色和饱和度）与深度的关系。
results: 研究发现，在indoor场景中，形状提取得到的结果具有更大的贡献，而其他特征也具有不同程度的贡献。这些发现可以帮助优化深度估算模型，提高其准确性和Robustness。

Abstract
Depth estimation from a single image is a challenging problem in computer vision because binocular disparity or motion information is absent. Whereas impressive performances have been reported in this area recently using end-to-end trained deep neural architectures, as to what cues in the images that are being exploited by these black box systems is hard to know. To this end, in this work, we quantify the relative contributions of the known cues of depth in a monocular depth estimation setting using an indoor scene data set. Our work uses feature extraction techniques to relate the single features of shape, texture, colour and saturation, taken in isolation, to predict depth. We find that the shape of objects extracted by edge detection substantially contributes more than others in the indoor setting considered, while the other features also have contributions in varying degrees. These insights will help optimise depth estimation models, boosting their accuracy and robustness. They promise to broaden the practical applications of vision-based depth estimation. The project code is attached to the supplementary material and will be published on GitHub.

摘要
depth estimation from a single image is a challenging problem in computer vision because binocular disparity or motion information is absent. Recently, impressive performances have been reported in this area using end-to-end trained deep neural architectures, but it is hard to know what cues in the images are being exploited by these black box systems. To this end, in this work, we quantify the relative contributions of the known cues of depth in a monocular depth estimation setting using an indoor scene data set. Our work uses feature extraction techniques to relate the single features of shape, texture, color, and saturation, taken in isolation, to predict depth. We find that the shape of objects extracted by edge detection substantially contributes more than others in the indoor setting considered, while the other features also have contributions in varying degrees. These insights will help optimize depth estimation models, boosting their accuracy and robustness. They promise to broaden the practical applications of vision-based depth estimation. The project code is attached to the supplementary material and will be published on GitHub.Here's the translation in Traditional Chinese:depth estimation from a single image is a challenging problem in computer vision because binocular disparity or motion information is absent. Recently, impressive performances have been reported in this area using end-to-end trained deep neural architectures, but it is hard to know what cues in the images are being exploited by these black box systems. To this end, in this work, we quantify the relative contributions of the known cues of depth in a monocular depth estimation setting using an indoor scene data set. Our work uses feature extraction techniques to relate the single features of shape, texture, color, and saturation, taken in isolation, to predict depth. We find that the shape of objects extracted by edge detection substantially contributes more than others in the indoor setting considered, while the other features also have contributions in varying degrees. These insights will help optimize depth estimation models, boosting their accuracy and robustness. They promise to broaden the practical applications of vision-based depth estimation. The project code is attached to the supplementary material and will be published on GitHub.

Match and Locate: low-frequency monocular odometry based on deep feature matching

paper_url: http://arxiv.org/abs/2311.10034
repo_url: None
paper_authors: Stepan Konev, Yuriy Biktairov
for: 这篇论文的目的是提出一种基于单个摄像头的Robotic odometry方法，以提高系统的可行性和简洁性。
methods: 这篇论文使用了深度特征匹配模型来匹配预先 capture的图像特征，然后使用卷积神经网进行精确的姿势和位置估计。
results: 这篇论文在AISG-SLA Visual Localisation Challenge中评估了这种方法的表现，发现其可以实现精确的姿势和位置估计， orientation estimation error约3度， translation estimation error约2米，与其他参赛者相比，这种方法在computational efficiency和易于实现的前提下表现竞争力强。

Abstract
Accurate and robust pose estimation plays a crucial role in many robotic systems. Popular algorithms for pose estimation typically rely on high-fidelity and high-frequency signals from various sensors. Inclusion of these sensors makes the system less affordable and much more complicated. In this work we introduce a novel approach for the robotic odometry which only requires a single camera and, importantly, can produce reliable estimates given even extremely low-frequency signal of around one frame per second. The approach is based on matching image features between the consecutive frames of the video stream using deep feature matching models. The resulting coarse estimate is then adjusted by a convolutional neural network, which is also responsible for estimating the scale of the transition, otherwise irretrievable using only the feature matching information. We evaluate the performance of the approach in the AISG-SLA Visual Localisation Challenge and find that while being computationally efficient and easy to implement our method shows competitive results with only around $3^{\circ}$ of orientation estimation error and $2m$ of translation estimation error taking the third place in the challenge.

摘要
准确和可靠的姿态估计在许多机器人系统中扮演着关键性的角色。常见的姿态估计算法通常基于高精度和高频信号，这些信号来自多种感知器。然而，这些感知器的包含使得系统变得更加昂贵和复杂。在这个工作中，我们介绍了一种新的机器人征卷方法，只需一个摄像头即可实现。这种方法基于图像特征匹配模型，通过比较连续帧视频流中的图像特征，生成初步估计。然后，使用卷积神经网络调整初步估计，同时估计转换的比例。我们在AISG-SLA视 lokalisierung Challenge中评估了这种方法的性能，发现它具有计算效率和易于实现的优点，同时与其他参赛者相比，其姿态估计误差为约3度和平均误差为2米，在挑战中排名第三。

On the Overconfidence Problem in Semantic 3D Mapping

paper_url: http://arxiv.org/abs/2311.10018
repo_url: None
paper_authors: Joao Marcos Correia Marques, Albert Zhai, Shenlong Wang, Kris Hauser
for: 这篇论文旨在解决Semantic 3D mapping中的混淆风险问题，即多视图结合深度和图像分割信息时，传统的映射方法会假设整个地图是正确的，从而导致输出抖抖。
methods: 该论文提出了多种在整合缓存阶段使用不同方法来改善混淆度规则的方法，并对ScanNet数据集进行了比较。其中，最常用的 bayesian 混合策略被证明是最差异步calibrated。作者们还提出了一种学习管道，GLFS，可以同时实现高精度和3D地图准确性，并保留实时能力。
results: 作者们示出，在一个模块化ObjectNav Agent上，通过正确地将Semantic Fusion纳入混合过程中，可以提高其成功率。此外，作者们还证明了地图准确性对下游任务的重要性。

Abstract
Semantic 3D mapping, the process of fusing depth and image segmentation information between multiple views to build 3D maps annotated with object classes in real-time, is a recent topic of interest. This paper highlights the fusion overconfidence problem, in which conventional mapping methods assign high confidence to the entire map even when they are incorrect, leading to miscalibrated outputs. Several methods to improve uncertainty calibration at different stages in the fusion pipeline are presented and compared on the ScanNet dataset. We show that the most widely used Bayesian fusion strategy is among the worst calibrated, and propose a learned pipeline that combines fusion and calibration, GLFS, which achieves simultaneously higher accuracy and 3D map calibration while retaining real-time capability. We further illustrate the importance of map calibration on a downstream task by showing that incorporating proper semantic fusion on a modular ObjectNav agent improves its success rates. Our code will be provided on Github for reproducibility upon acceptance.

摘要
Semantic 3D mapping, the process of combining depth and image segmentation information from multiple views to create 3D maps annotated with object classes in real-time, is a current area of interest. This paper highlights the fusion overconfidence problem, where conventional mapping methods assign high confidence to the entire map even when they are incorrect, leading to miscalibrated outputs. Several methods to improve uncertainty calibration at different stages in the fusion pipeline are presented and compared on the ScanNet dataset. We show that the most widely used Bayesian fusion strategy is among the worst calibrated, and propose a learned pipeline that combines fusion and calibration, GLFS, which achieves simultaneously higher accuracy and 3D map calibration while retaining real-time capability. We further illustrate the importance of map calibration on a downstream task by showing that incorporating proper semantic fusion on a modular ObjectNav agent improves its success rates. Our code will be provided on Github for reproducibility upon acceptance.Here's the translation in Traditional Chinese:Semantic 3D mapping, the process of combining depth and image segmentation information from multiple views to create 3D maps annotated with object classes in real-time, is a current area of interest. This paper highlights the fusion overconfidence problem, where conventional mapping methods assign high confidence to the entire map even when they are incorrect, leading to miscalibrated outputs. Several methods to improve uncertainty calibration at different stages in the fusion pipeline are presented and compared on the ScanNet dataset. We show that the most widely used Bayesian fusion strategy is among the worst calibrated, and propose a learned pipeline that combines fusion and calibration, GLFS, which achieves simultaneously higher accuracy and 3D map calibration while retaining real-time capability. We further illustrate the importance of map calibration on a downstream task by showing that incorporating proper semantic fusion on a modular ObjectNav agent improves its success rates. Our code will be provided on Github for reproducibility upon acceptance.

SQLNet: Scale-Modulated Query and Localization Network for Few-Shot Class-Agnostic Counting

paper_url: http://arxiv.org/abs/2311.10011
repo_url: https://github.com/hcplab-sysu/sqlnet
paper_authors: Hefeng Wu, Yandong Chen, Lingbo Liu, Tianshui Chen, Keze Wang, Liang Lin
for: 解决 counting all objects of an arbitrary class 问题，提高下游任务的性能。
methods: 提出了一种基于 localization 的 novel 方法，即 Scale-modulated Query and Localization Network (SQLNet)，它可以充分利用 exemplars 的 scale 信息，进行有效的 counting。
results: 在 popular CAC benchmarks 上，SQLNet 的表现优于当前领先方法，不仅在 counting 精度方面表现出色，还在 localization 和 bounding box 生成方面达到了优秀的result。

Abstract
The class-agnostic counting (CAC) task has recently been proposed to solve the problem of counting all objects of an arbitrary class with several exemplars given in the input image. To address this challenging task, existing leading methods all resort to density map regression, which renders them impractical for downstream tasks that require object locations and restricts their ability to well explore the scale information of exemplars for supervision. To address the limitations, we propose a novel localization-based CAC approach, termed Scale-modulated Query and Localization Network (SQLNet). It fully explores the scales of exemplars in both the query and localization stages and achieves effective counting by accurately locating each object and predicting its approximate size. Specifically, during the query stage, rich discriminative representations of the target class are acquired by the Hierarchical Exemplars Collaborative Enhancement (HECE) module from the few exemplars through multi-scale exemplar cooperation with equifrequent size prompt embedding. These representations are then fed into the Exemplars-Unified Query Correlation (EUQC) module to interact with the query features in a unified manner and produce the correlated query tensor. In the localization stage, the Scale-aware Multi-head Localization (SAML) module utilizes the query tensor to predict the confidence, location, and size of each potential object. Moreover, a scale-aware localization loss is introduced, which exploits flexible location associations and exemplar scales for supervision to optimize the model performance. Extensive experiments demonstrate that SQLNet outperforms state-of-the-art methods on popular CAC benchmarks, achieving excellent performance not only in counting accuracy but also in localization and bounding box generation. Our codes will be available at https://github.com/HCPLab-SYSU/SQLNet

摘要
“类型不敏感 counting（CAC）任务最近被提出来解决输入图像中所有类型的对象数量的问题。为了解决这个复杂的任务，现有领先的方法都是通过密度地图回归来实现，这会导致其在下游任务中的缺乏能力和对 exemplars 的缺乏监督。为了解决这些限制，我们提出了一种基于localization的新方法，称为Scale-modulated Query and Localization Network（SQLNet）。它能够全面探索 exemplars 的尺度，并在查询和本地化两个阶段中准确地定位和估计每个对象的大小。在查询阶段，通过多scale exemplar合作和equifrequent size prompt embedding，HECE模块从少量 exemplars 中获得了丰富的描述符，然后将其传递给EUQC模块进行与查询特征的统一交互，生成相关的查询张量。在本地化阶段，SAML模块使用查询张量预测对象的信心、位置和大小。此外，我们还引入了具有灵活位置关系和 exemplars 尺度的scale-aware本地化损失，以便在模型性能优化。我们的实验结果表明，SQLNet 在流行的 CAC 测试准则上表现出色，不仅在计数准确性方面取得了出色的成绩，还在本地化和 bounding box 生成方面取得了优秀的成绩。我们的代码将在https://github.com/HCPLab-SYSU/SQLNet 上提供。”

TransFusion – A Transparency-Based Diffusion Model for Anomaly Detection

paper_url: http://arxiv.org/abs/2311.09999
repo_url: None
paper_authors: Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj
for: 这个研究是为了提高表面异常检测的精度和效率，特别是在工业检测中。
methods: 这个研究使用了一种新的透明度基于的扩散过程，通过逐步增加异常区域的透明度，以精确地重建正常的 appearances。
results: 这个研究在两个常用的检测 dataset上（VisA和MVTec AD）得到了 state-of-the-art 的表现，具有image-level AUROC 的98.5%和99.2%。

Abstract
Surface anomaly detection is a vital component in manufacturing inspection. Reconstructive anomaly detection methods restore the normal appearance of an object, ideally modifying only the anomalous regions. Due to the limitations of commonly used reconstruction architectures, the produced reconstructions are often poor and either still contain anomalies or lack details in anomaly-free regions. Recent reconstructive methods adopt diffusion models, however with the standard diffusion process the problems are not adequately addressed. We propose a novel transparency-based diffusion process, where the transparency of anomalous regions is progressively increased, restoring their normal appearance accurately and maintaining the appearance of anomaly-free regions without loss of detail. We propose TRANSparency DifFUSION (TransFusion), a discriminative anomaly detection method that implements the proposed diffusion process, enabling accurate downstream anomaly detection. TransFusion achieves state-of-the-art performance on both the VisA and the MVTec AD datasets, with an image-level AUROC of 98.5% and 99.2%, respectively.

摘要
表面异常检测是制造检查中的重要组成部分。重建性异常检测方法可以修复物品的正常外观，理想情况下仅 modify anomalous regions。由于通用的重建架构受限，生成的重建结果经常仍然含有异常或lack of detail in anomaly-free regions。最近的重建方法采用扩散模型，但标准的扩散过程并不能够妥善解决问题。我们提出了一种新的透明度基于扩散过程，其中异常区域的透明度逐渐增加，将其正确地修复为正常的外观，并维持异常区域以外的外观不受损害。我们提出了名为TransFusion的检测方法，它实现了提议的扩散过程，允许精确的下游异常检测。TransFusion在VisA和MVTec AD datasets上 achieve state-of-the-art performance，具体来说是图像水平的AUROC为98.5%和99.2%。

DeepEMD: A Transformer-based Fast Estimation of the Earth Mover’s Distance

paper_url: http://arxiv.org/abs/2311.09998
repo_url: https://github.com/atulkumarin/deepemd
paper_authors: Atul Kumar Sinha, Francois Fleuret
for: 这个论文的目的是提出一种基于注意力的模型，用于精确地计算点云集的摩尔变换距离（Earth Mover’s Distance，EMD），以便作为生成模型的训练损失函数。
methods: 这种模型使用注意力机制来计算点云集之间的匹配，并通过Explicitly Computing Matching来获得精确的摩尔变换距离和其梯度的估计。
results: 实验表明，这种模型可以准确地计算摩尔变换距离和其梯度，并且在训练过程中具有高速度响应和广泛的应用前提。此外，模型在无法训练集上的运行表现也非常出色。

Abstract
The Earth Mover's Distance (EMD) is the measure of choice between point clouds. However the computational cost to compute it makes it prohibitive as a training loss, and the standard approach is to use a surrogate such as the Chamfer distance. We propose an attention-based model to compute an accurate approximation of the EMD that can be used as a training loss for generative models. To get the necessary accurate estimation of the gradients we train our model to explicitly compute the matching between point clouds instead of EMD itself. We cast this new objective as the estimation of an attention matrix that approximates the ground truth matching matrix. Experiments show that this model provides an accurate estimate of the EMD and its gradient with a wall clock speed-up of more than two orders of magnitude with respect to the exact Hungarian matching algorithm and one order of magnitude with respect to the standard approximate Sinkhorn algorithm, allowing in particular to train a point cloud VAE with the EMD itself. Extensive evaluation show the remarkable behaviour of this model when operating out-of-distribution, a key requirement for a distance surrogate. Finally, the model generalizes very well to point clouds during inference several times larger than during training.

摘要
地球移动者距离（EMD）是点云之间的度量标准，但计算成本使其成为训练损失的禁制品。标准方法是使用 Chamfer 距离作为代理。我们提议使用注意力基于模型来计算精确的EMD aproximation，以便用于训练生成模型。为了获得必要的精确Gradient，我们在模型训练中显式计算点云之间的匹配。我们将这个新的目标设定为估算真实的匹配矩阵。实验显示，这个模型可以准确地计算EMD和其Gradient，并且与准确的挪威抽象搜索算法和标准搜索算法相比，具有大量的时间速度提升（至少两个排名）和一个排名的速度提升。这使得我们可以使用EMD本身作为训练损失。我们的模型在误差外的操作中表现出色，这是距离代理的关键要求。此外，我们的模型在推理时可以处理大量的点云，并且可以在训练时使用相同的模型。

From Pretext to Purpose: Batch-Adaptive Self-Supervised Learning

paper_url: http://arxiv.org/abs/2311.09974
repo_url: None
paper_authors: Jiansong Zhang, Peizhong Liu
for: 本文旨在提出一种适应batch size和预测任务的自然语言处理方法，以提高自然语言处理的自动学习能力。
methods: 本文使用了对 batch data 的维度减少和重建，以实现batch数据之间的内部通信，并通过嵌入层来适应性地增强自我超vised feature编码能力。
results: 根据ImageNet-1k的线性分类测试，我们的方法可以在比较公平的情况下达到状态 arts 性能，而且在ImageNet-100上，相比原始性能，top1的最大提升为1.25%。

Abstract
In recent years, self-supervised contrastive learning has emerged as a distinguished paradigm in the artificial intelligence landscape. It facilitates unsupervised feature learning through contrastive delineations at the instance level. However, crafting an effective self-supervised paradigm remains a pivotal challenge within this field. This paper delves into two crucial factors impacting self-supervised contrastive learning-bach size and pretext tasks, and from a data processing standpoint, proposes an adaptive technique of batch fusion. The proposed method, via dimensionality reduction and reconstruction of batch data, enables formerly isolated individual data to partake in intra-batch communication through the Embedding Layer. Moreover, it adaptively amplifies the self-supervised feature encoding capability as the training progresses. We conducted a linear classification test of this method based on the classic contrastive learning framework on ImageNet-1k. The empirical findings illustrate that our approach achieves state-of-the-art performance under equitable comparisons. Benefiting from its "plug-and-play" characteristics, we further explored other contrastive learning methods. On the ImageNet-100, compared to the original performance, the top1 has seen a maximum increase of 1.25%. We suggest that the proposed method may contribute to the advancement of data-driven self-supervised learning research, bringing a fresh perspective to this community.

摘要
近年来，自我超viscontrastive learning已经出现为人工智能领域的一种distinguished paradigm。它可以通过对instance level进行对比，实现无监督特征学习。然而，制定有效的自我超viscontrastive learning paradigm仍然是这个领域中的一个关键挑战。这篇论文探讨了自我超viscontrastive learning中两个关键因素：batch size和pretext tasks，并从数据处理角度提出了一种适应技术——批处理融合。提议的方法通过维度减少和批处理数据的重建，使得原来隔离的个体数据能够在Embedding层内进行INTRA-batch交流。此外，该方法可以逐渐增强自我超viscontrastive feature编码能力，并在训练进程中进行适应调整。我们基于经典对比学习框架进行了Linear classification测试，实验结果表明，我们的方法在相等比较下实现了状态盘领先性。由于其“嵌入式”的特点，我们进一步探索了其他对比学习方法。在ImageNet-100上，相比原始性能，排名前100的最大提升为1.25%。我们建议该方法可能会促进数据驱动的自我超viscontrastive学习研究，为这个社区带来一种新的视角。

SurgPLAN: Surgical Phase Localization Network for Phase Recognition

paper_url: http://arxiv.org/abs/2311.09965
repo_url: None
paper_authors: Xingjian Luo, You Pang, Zhen Chen, Jinlin Wu, Zongmin Zhang, Zhen Lei, Hongbin Liu
for: 提高智能操作室中手术理解的精度，解决自动手术阶段识别存在两大问题，即不能捕捉每帧和运动信息的特征特征，以及每个阶段的预测不稳定性。
methods: 我们提出了一种名为手术阶段位置网络（SurgPLAN），它通过捕捉多尺度空间和时间特征，以及基于时间区域提案的phasenprediction来提高手术阶段识别的精度和稳定性。
results: 我们的SurgPLAN在比较existing方法时，在精度和稳定性两个方面具有显著的优势。

Abstract
Surgical phase recognition is crucial to providing surgery understanding in smart operating rooms. Despite great progress in automatic surgical phase recognition, most existing methods are still restricted by two problems. First, these methods cannot capture discriminative visual features for each frame and motion information with simple 2D networks. Second, the frame-by-frame recognition paradigm degrades the performance due to unstable predictions within each phase, termed as phase shaking. To address these two challenges, we propose a Surgical Phase LocAlization Network, named SurgPLAN, to facilitate a more accurate and stable surgical phase recognition with the principle of temporal detection. Specifically, we first devise a Pyramid SlowFast (PSF) architecture to serve as the visual backbone to capture multi-scale spatial and temporal features by two branches with different frame sampling rates. Moreover, we propose a Temporal Phase Localization (TPL) module to generate the phase prediction based on temporal region proposals, which ensures accurate and consistent predictions within each surgical phase. Extensive experiments confirm the significant advantages of our SurgPLAN over frame-by-frame approaches in terms of both accuracy and stability.

摘要
针对智能操作室中的手术理解，外科阶段识别具有重要的意义。虽然自动外科阶段识别技术已经取得了大量的进步，但大多数现有方法都受到两个问题的限制。首先，这些方法无法捕捉每帧和运动信息的特征特征，这限制了它们的识别精度。其次，frame-by-frame认识模式会导致识别性下降，这被称为“阶段震荡”。为了解决这两个挑战，我们提议一种名为外科阶段封顶网络（SurgPLAN），它可以提供更加准确和稳定的外科阶段识别。具体来说，我们首先设计了一种Pyramid SlowFast（PSF）架构，它作为视觉后ION来捕捉多scalespatial和时间特征。此外，我们还提出了一种时间阶段本地化（TPL）模块，它可以基于时间区域提案来生成阶段预测，从而保证了识别的准确和一致。我们进行了广泛的实验，结果表明，我们的SurgPLAN在 Frame-by-frame方法的基础上具有显著优势，包括更高的准确率和稳定性。

VertDetect: Fully End-to-End 3D Vertebral Instance Segmentation Model

paper_url: http://arxiv.org/abs/2311.09958
repo_url: None
paper_authors: Geoff Klein, Michael Hardisty, Cari Whyne, Anne L. Martel
for: 这篇论文的目的是提出一个完全自动的三维脊梗检测和分类模型，以便在骨科手术和放射治疗中进行观察规划。
methods: 这个模型使用了一个共同的CNN背部，将检测和分类分支的feature map传递给它们。此外，还使用了一个图 theoretics 网络层，以改善脊梗标签。
results: 这个模型在 VerSe 2019 和 VerSe 2020 公共和隐藏测试集中均 achieved state-of-the-art 性能，其中 Dice Similarity Coefficient (DSC) 为 0.883 (95% CI, 0.843-0.906) 和 0.882 (95% CI, 0.835-0.909)，以及 0.868 (95% CI, 0.834-0.890) 和 0.869 (95% CI, 0.832-0.891)。

Abstract
Vertebral detection and segmentation are critical steps for treatment planning in spine surgery and radiation therapy. Accurate identification and segmentation are complicated in imaging that does not include the full spine, in cases with variations in anatomy (T13 and/or L6 vertebrae), and in the presence of fracture or hardware. This paper proposes VertDetect, a fully automated end-to-end 3D vertebral instance segmentation Convolutional Neural Network (CNN) model to predict vertebral level labels and segmentations for all vertebrae present in a CT scan. The utilization of a shared CNN backbone provides the detection and segmentation branches of the network with feature maps containing both spinal and vertebral level information. A Graph Convolutional Network (GCN) layer is used to improve vertebral labelling by using the known structure of the spine. This model achieved a Dice Similarity Coefficient (DSC) of 0.883 (95% CI, 0.843-0.906) and 0.882 (95% CI, 0.835-0.909) in the VerSe 2019 and 0.868 (95\% CI, 0.834-0.890) and 0.869 (95\% CI, 0.832-0.891) in the VerSe 2020 public and hidden test sets, respectively. This model achieved state-of-the-art performance for an end-to-end architecture, whose design facilitates the extraction of features that can be subsequently used for downstream tasks.

摘要
<>translate("VertDetect is a fully automated end-to-end 3D vertebral instance segmentation Convolutional Neural Network (CNN) model that predicts vertebral level labels and segmentations for all vertebrae present in a CT scan. The shared CNN backbone provides the detection and segmentation branches with feature maps containing both spinal and vertebral level information. A Graph Convolutional Network (GCN) layer is used to improve vertebral labeling using the known structure of the spine. This model achieved a Dice Similarity Coefficient (DSC) of 0.883 (95% CI, 0.843-0.906) and 0.882 (95% CI, 0.835-0.909) in the VerSe 2019 and 0.868 (95\% CI, 0.834-0.890) and 0.869 (95\% CI, 0.832-0.891) in the VerSe 2020 public and hidden test sets, respectively. This model achieved state-of-the-art performance for an end-to-end architecture, whose design facilitates the extraction of features that can be subsequently used for downstream tasks.">> Here's the breakdown of the translation:* "VertDetect" is translated as "VertDetect" (全自动的三维vertebral实例分割Convolutional Neural Network模型)* "Convolutional Neural Network" is translated as "Convolutional Neural Network" (卷积神经网络)* "end-to-end" is translated as "端到端" (end-to-end)* "vertebral instance segmentation" is translated as "vertebral实例分割" (vertebral instance segmentation)* "CT scan" is translated as "CT扫描" (CT scan)* "spinal and vertebral level information" is translated as "脊梁和vertebral уров别信息" (spinal and vertebral level information)* "Graph Convolutional Network" is translated as "图connvolutional网络" (Graph Convolutional Network)* "vertebral labeling" is translated as "vertebral标注" (vertebral labeling)* "known structure of the spine" is translated as "脊梁的已知结构" (known structure of the spine)* "Dice Similarity Coefficient" is translated as " dice相似度系数" (Dice Similarity Coefficient)* "public and hidden test sets" is translated as "公共和隐藏测试集" (public and hidden test sets)* "state-of-the-art performance" is translated as "现状最佳性能" (state-of-the-art performance)Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Score-based generative models learn manifold-like structures with constrained mixing

paper_url: http://arxiv.org/abs/2311.09952
repo_url: None
paper_authors: Li Kevin Wenliang, Ben Moran
for: score-based generative models (SBMs) learn the data distribution supported on a low-dimensional manifold
methods: linear approximations and subspaces spanned by local feature vectors
results: the learned vector field mixes samples by a non-conservative field within the manifold, and the subspace spanned by the local features overlaps with an effective density function.Here is the text in Simplified Chinese:
for: SBMs 学习数据分布的低维折衔
methods: 利用线性近似和本地特征向量生成子空间
results: 学习得到的向量场在折衔中混合样本，并且在折衔中保持数据分布的折衔结构。

Abstract
How do score-based generative models (SBMs) learn the data distribution supported on a low-dimensional manifold? We investigate the score model of a trained SBM through its linear approximations and subspaces spanned by local feature vectors. During diffusion as the noise decreases, the local dimensionality increases and becomes more varied between different sample sequences. Importantly, we find that the learned vector field mixes samples by a non-conservative field within the manifold, although it denoises with normal projections as if there is an energy function in off-manifold directions. At each noise level, the subspace spanned by the local features overlap with an effective density function. These observations suggest that SBMs can flexibly mix samples with the learned score field while carefully maintaining a manifold-like structure of the data distribution.

摘要

Harnessing Transformers: A Leap Forward in Lung Cancer Image Detection

paper_url: http://arxiv.org/abs/2311.09942
repo_url: None
paper_authors: Amine Bechar, Youssef Elmir, Rafik Medjoudj, Yassine Himeur, Abbes Amira
for: 本研究探讨了贯彻学习（TL）和变换器在肿瘤检测中的应用，尤其是基于图像分析。
methods: 本研究使用了多种方法，包括TL、变换器和卷积神经网络（CNN）模型，其中变换器在图像分析中表现最佳，具有97.41%的准确率 для肾癌检测和94.71%的准确率 для histopathological lung cancer。
results: 本研究结果显示，变Transformers在图像分析中表现最佳，其准确率为97.41% для肾癌检测和94.71% для histopathological lung cancer。

Abstract
This paper discusses the role of Transfer Learning (TL) and transformers in cancer detection based on image analysis. With the enormous evolution of cancer patients, the identification of cancer cells in a patient's body has emerged as a trend in the field of Artificial Intelligence (AI). This process involves analyzing medical images, such as Computed Tomography (CT) scans and Magnetic Resonance Imaging (MRIs), to identify abnormal growths that may help in cancer detection. Many techniques and methods have been realized to improve the quality and performance of cancer classification and detection, such as TL, which allows the transfer of knowledge from one task to another with the same task or domain. TL englobes many methods, particularly those used in image analysis, such as transformers and Convolutional Neural Network (CNN) models trained on the ImageNet dataset. This paper analyzes and criticizes each method of TL based on image analysis and compares the results of each method, showing that transformers have achieved the best results with an accuracy of 97.41% for colon cancer detection and 94.71% for Histopathological Lung cancer. Future directions for cancer detection based on image analysis are also discussed.

摘要
TL encompasses a range of methods, including those used in image analysis, such as transformers and Convolutional Neural Network (CNN) models trained on the ImageNet dataset. This paper examines and critiques each TL method based on image analysis, comparing the results of each method. The paper finds that transformers have achieved the best results, with an accuracy of 97.41% for colon cancer detection and 94.71% for Histopathological Lung cancer. The paper also discusses future directions for cancer detection based on image analysis.Translated into Simplified Chinese:这篇论文探讨了转移学习（TL）和转换器在生物图像分析中的肿瘤检测。随着癌症患者的增加，在病理图像分析中确定患者体内癌细胞的存在已成为人工智能领域的趋势。这个过程涉及分析医疗图像，如计算Tomography（CT）扫描和磁共振成像（MRI），以确定癌细胞的存在。多种技术和方法已被实现以提高癌症分类和检测的质量和性能，包括TL，它允许知识从一个任务中传输到另一个任务或领域中。TL包括许多方法，其中包括在图像分析中使用的转换器和Convolutional Neural Network（CNN）模型在ImageNet数据集上被训练。这篇论文对每种TL方法进行了分析和评价，并比较了每种方法的结果。论文发现，转换器已经实现了最好的结果，具体来说是97.41%的肿瘤检测精度 для肠癌和94.71%的肺癌检测精度。论文还讨论了基于图像分析的未来癌症检测的方向。Translated into Traditional Chinese:这篇论文探讨了将学习（TL）和转换器在生物图像分析中的肿瘤检测。随着癌症患者的增加，在病理图像分析中确定患者体内癌细胞的存在已成为人工智能领域的趋势。这个过程涉及分析医疗图像，如计算Tomography（CT）扫描和磁共振成像（MRI），以确定癌细胞的存在。多种技术和方法已被实现以提高癌症分类和检测的质量和性能，包括TL，它允许知识从一个任务中传输到另一个任务或领域中。TL包括许多方法，其中包括在图像分析中使用的转换器和Convolutional Neural Network（CNN）模型在ImageNet数据集上被训练。这篇论文对每种TL方法进行了分析和评价，并比较了每种方法的结果。论文发现，转换器已经实现了最好的结果，具体来说是97.41%的肿瘤检测精度 для肝癌和94.71%的肺癌检测精度。论文还讨论了基于图像分析的未来癌症检测的方向。

RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection

paper_url: http://arxiv.org/abs/2311.09939
repo_url: https://github.com/stevejpapad/relevant-evidence-detection
paper_authors: Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis
for: 本研究旨在提供一种自动Multimodal fact-checking方法，用于支持或驳斥声明的真实性。
methods: 本研究使用了一种新的”相关证据检测”（RED）模块，用于判断每个证据是否相关，以支持或驳斥声明。此外，研究还提出了多种体系和机制，如”导向注意力”模块，以提高模型的可解释性和性能。
results: 研究表明，使用RED-DOT模型可以在VERITEbenchmark上实现30%的提高，并在NewsCLIPings+上达到竞争性和改进的性能，无需大量的证据或多个Encoder。此外，研究还进行了质量分析，表明”导向注意力”模块可以提高模型的解释性。

Abstract
Online misinformation is often multimodal in nature, i.e., it is caused by misleading associations between texts and accompanying images. To support the fact-checking process, researchers have been recently developing automatic multimodal methods that gather and analyze external information, evidence, related to the image-text pairs under examination. However, prior works assumed all collected evidence to be relevant. In this study, we introduce a "Relevant Evidence Detection" (RED) module to discern whether each piece of evidence is relevant, to support or refute the claim. Specifically, we develop the "Relevant Evidence Detection Directed Transformer" (RED-DOT) and explore multiple architectural variants (e.g., single or dual-stage) and mechanisms (e.g., "guided attention"). Extensive ablation and comparative experiments demonstrate that RED-DOT achieves significant improvements over the state-of-the-art on the VERITE benchmark by up to 28.5%. Furthermore, our evidence re-ranking and element-wise modality fusion led to RED-DOT achieving competitive and even improved performance on NewsCLIPings+, without the need for numerous evidence or multiple backbone encoders. Finally, our qualitative analysis demonstrates that the proposed "guided attention" module has the potential to enhance the architecture's interpretability. We release our code at: https://github.com/stevejpapad/relevant-evidence-detection

摘要
在线资讯承害 frequently 是多 modal 的，即由诽导的文字和附加的图像所致。为支持事实核查过程，研究人员最近已经开发出自动多模式方法，将外部信息、证据聚合和分析。但是，先前的工作假设所有收集到的证据都是有用的。在这一 studyt，我们引入一个“有用证据检测”（RED）模组，以决定每个证据是否有用，以支持或驳回主张。我们开发了“导向注意力”（RED-DOT）模型，并考虑多种架构和机制（例如，单Stage 或 dual-stage，以及“导向注意力”）。我们实施了广泛的删除和比较实验，证明了 RED-DOT 在 VERITE 标准 benchmark 上可以实现最多 28.5% 的提升。此外，我们的证据重新排序和元素综合模式融合实现了 RED-DOT 在 NewsCLIPings+ 上的竞争性和改进性，无需丰富的证据或多个背部构成器。最后，我们的Qualitative分析显示出“导向注意力”模组具有提高架构解释性的潜力。我们在 GitHub 上发布了代码：https://github.com/stevejpapad/relevant-evidence-detection。

Selection of Distinct Morphologies to Divide & Conquer Gigapixel Pathology Images

paper_url: http://arxiv.org/abs/2311.09902
repo_url: None
paper_authors: Abubakr Shafique, Saghir Alfasly, Areej Alsaafin, Peyman Nejat, Jibran A. Khan, H. R. Tizhoosh
for: 本研究旨在提出一种选择小型、代表性强的WSIs补丁集合方法，以便在计算生物学中进行WSIs分类和匹配分析。
methods: 本方法基于”拆分与统一”的思想，通过绘制出WSIs中各种形态特征的示意图，并对这些示意图进行自动选择，以选择一个能够涵盖所有WSIs中形态特征的小型补丁集合。
results: 研究表明，SDM方法在多个公共和私人生化病理学数据集上具有remarkable的效果，并且不需要参数设定，因为它自动优化选择过程以捕捉WSIs中的形态特征。

Abstract
Whole slide images (WSIs) are massive digital pathology files illustrating intricate tissue structures. Selecting a small, representative subset of patches from each WSI is essential yet challenging. Therefore, following the "Divide & Conquer" approach becomes essential to facilitate WSI analysis including the classification and the WSI matching in computational pathology. To this end, we propose a novel method termed "Selection of Distinct Morphologies" (SDM) to choose a subset of WSI patches. The aim is to encompass all inherent morphological variations within a given WSI while simultaneously minimizing the number of selected patches to represent these variations, ensuring a compact yet comprehensive set of patches. This systematically curated patch set forms what we term a "montage". We assess the representativeness of the SDM montage across various public and private histopathology datasets. This is conducted by using the leave-one-out WSI search and matching evaluation method, comparing it with the state-of-the-art Yottixel's mosaic. SDM demonstrates remarkable efficacy across all datasets during its evaluation. Furthermore, SDM eliminates the necessity for empirical parameterization, a crucial aspect of Yottixel's mosaic, by inherently optimizing the selection process to capture the distinct morphological features within the WSI.

摘要
整个扫描图像（WSIs）是大量数字病理学文件，展示了复杂的组织结构。选择WSIs中每个patch的小样本 subset是必要的，但是具有挑战性。为了促进WSIs的分析，包括类别和WSIs匹配，我们提出了一种新方法称为“选择独特形态”（SDM）。该方法的目标是在给定WSIs中涵盖所有自然形态的变化，同时最小化选择的patch数量，以确保一个紧凑且全面的patch集。这个系统化批处形成了我们称之为“montage”。我们在不同的公共和私人 histopathology 数据集上评估SDM montage的表现。我们使用离开一个 WSI 搜索和匹配评估方法，与现有的 Yottixel 的落幕进行比较。SDM在所有数据集上都表现出了很好的效果。此外，SDM 摒弃了 Yottixel 落幕中的参数化，因为它自动优化选择过程，以捕捉 WSIs 中的独特形态特征。

I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization

paper_url: http://arxiv.org/abs/2311.10126
repo_url: None
paper_authors: Yunshan Zhong, Jiawei Hu, Mingbao Lin, Mengzhao Chen, Rongrong Ji
for:* 这个 paper 是为了解决 transformer 模型在实际应用中的 Computational Cost 问题，特别是在训练和测试过程中的 dense 计算成本问题。methods:* 这个 paper 使用 post-training quantization (PTQ) 方法，将 transformer 模型训练后的 weights 量化为 low-bit 格式，以提高 computational efficiency。* 这个 paper introduce 一个 novel 的 I&S-ViT 方法，它可以在 PTQ 过程中稳定地调整 transformer 模型的 weights，以提高模型的性能。results:* 这个 paper 的结果显示，I&S-ViT 方法可以在 diverse vision tasks 中提高 transformer 模型的性能，特别是在 low-bit enario 下。* 例如，I&S-ViT 方法可以提高 3-bit ViT-B 模型的性能 by 50.68%。

Abstract
Albeit the scalable performance of vision transformers (ViTs), the dense computational costs (training & inference) undermine their position in industrial applications. Post-training quantization (PTQ), tuning ViTs with a tiny dataset and running in a low-bit format, well addresses the cost issue but unluckily bears more performance drops in lower-bit cases. In this paper, we introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion. I&S-ViT first identifies two issues in the PTQ of ViTs: (1) Quantization inefficiency in the prevalent log2 quantizer for post-Softmax activations; (2) Rugged and magnified loss landscape in coarse-grained quantization granularity for post-LayerNorm activations. Then, I&S-ViT addresses these issues by introducing: (1) A novel shift-uniform-log2 quantizer (SULQ) that incorporates a shift mechanism followed by uniform quantization to achieve both an inclusive domain representation and accurate distribution approximation; (2) A three-stage smooth optimization strategy (SOS) that amalgamates the strengths of channel-wise and layer-wise quantization to enable stable learning. Comprehensive evaluations across diverse vision tasks validate I&S-ViT' superiority over existing PTQ of ViTs methods, particularly in low-bit scenarios. For instance, I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.

摘要
尽管Scalable Performance of Vision Transformers（ViTs），但是密集的计算成本（训练和推理）却限制其在产业应用中的位置。Post-training Quantization（PTQ），通过使用tiny dataset和低位数据类型进行调教，有效地解决了成本问题，但是在低位数据类型下会导致性能下降。在这篇论文中，我们引入I&S-ViT，一种新的方法，可以在包容和稳定的情况下进行PTQ的规范。I&S-ViT首先发现了ViTs中PTQ中的两个问题：（1）频繁使用的log2 quantizer在Post-Softmax活动中的量化不准确;（2）LayerNorm活动中的粗糙和增大的损失图像。然后，I&S-ViT通过引入以下两种方法来解决这些问题：（1）一种新的Shift-Uniform-Log2 quantizer（SULQ），通过添加shift机制并使用均匀量化来实现包容的领域表示和准确的分布近似;（2）一种三个阶段的smooth optimization strategy（SOS），将通道级和层级量化融合在一起，以实现稳定的学习。通过对多种视觉任务进行广泛的评估，我们证明I&S-ViT在PTQ方法中的超越性，特别是在低位数据类型下。例如，I&S-ViT使3位ViT-B的性能提高了50.68%。

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

paper_url: http://arxiv.org/abs/2311.10125
repo_url: https://github.com/lhbuilder/sa-segment-anything
paper_authors: Chris Kelly, Luhui Hu, Cindy Yang, Yu Tian, Deshun Yang, Bang Yang, Zaoshan Huang, Zihao Li, Yuexian Zou
for: 这份论文的目的是发展一个可以整合多种现有的州OfTheArt（SOTA）Computer Vision（CV）模型的框架，以提高CV领域的进步和效率。
methods: 这份论文使用了一个名为UnifiedVisionGPT的框架，它可以融合多种SOTA CV模型，并且可以自动选择适合的模型基于多 modal 输入，例如文本提示和图像。
results: 这份论文显示了UnifiedVisionGPT的架构和能力，并证明了它在CV领域中的应用可以提高效率、多样性、普遍性和性能。

Abstract
In the current landscape of artificial intelligence, foundation models serve as the bedrock for advancements in both language and vision domains. OpenAI GPT-4 has emerged as the pinnacle in large language models (LLMs), while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models such as Meta's SAM and DINO, and YOLOS. However, the financial and computational burdens of training new models from scratch remain a significant barrier to progress. In response to this challenge, we introduce UnifiedVisionGPT, a novel framework designed to consolidate and automate the integration of SOTA vision models, thereby facilitating the development of vision-oriented AI. UnifiedVisionGPT distinguishes itself through four key features: (1) provides a versatile multimodal framework adaptable to a wide range of applications, building upon the strengths of multimodal foundation models; (2) seamlessly integrates various SOTA vision models to create a comprehensive multimodal platform, capitalizing on the best components of each model; (3) prioritizes vision-oriented AI, ensuring a more rapid progression in the CV domain compared to the current trajectory of LLMs; and (4) introduces automation in the selection of SOTA vision models, generating optimal results based on diverse multimodal inputs such as text prompts and images. This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, generalization, and performance. Our implementation, along with the unified multimodal framework and comprehensive dataset, is made publicly available at https://github.com/LHBuilder/SA-Segment-Anything.

摘要
当前人工智能领域中，基础模型作为进步的基础，在语言和视觉领域得到了广泛应用。OpenAI GPT-4在大语言模型（LLM）中脱颖而出，而计算机视觉（CV）领域则拥有丰富的状态对照（SOTA）模型，如Meta的SAM和DINO，以及YOLOS。然而，培育新模型的财务和计算成本仍然是进步的主要障碍。为应对这个挑战，我们介绍了一种新的框架——统一视觉GPT，该框架旨在集成和自动化STATE OF THE ART（SOTA）视觉模型，以便开发视觉启发的人工智能。统一视觉GPT的四个关键特点是：（1）提供多样化的多Modal Framework，可适应各种应用场景，基于多Modal基础模型的优势；（2）将多种SOTA视觉模型集成在一起，创造出全面的多Modal平台，利用每个模型的优点；（3）强调视觉启发，以更快速地进步在计算机视觉领域，相比现有的语言模型的趋势；以及（4）通过自动化SOTA视觉模型的选择，根据多Modal输入 such as文本提示和图像，生成优化的结果。本文描述了统一视觉GPT的架构和能力，展示其在计算机视觉领域的潜在革命性。我们的实现，以及统一多Modal框架和完整的数据集，在https://github.com/LHBuilder/SA-Segment-Anything上公开提供。

Rusty Detection Using Image Processing For Maintenance Of Stations

paper_url: http://arxiv.org/abs/2311.09849
repo_url: None
paper_authors: Dao Duy Tung, Ho Xuan Hung
for: 这种研究旨在准确地分割涂抹的锈批镀表面上的锈区域。
methods: 该方法基于数字图像处理，使用HSV颜色模型进行分割，并应用单一级Retinex模型来平衡光照的影响。然后通过手动色滤波进行进一步处理，以增强锈区域的识别。最后，使用DBScan算法进行准确的锈区域分割。
results: 该方法可以准确地分割涂抹的锈批镀表面上的锈区域，提供一种有价值的锈检测和分析方法。

Abstract
This study addresses the challenge of accurately seg-menting rusted areas on painted construction surfaces. A method leveraging digital image processing is explored to calculate the percentage of rust present on painted coatings. The proposed segmentation approach is based on the HSV color model. To equalize luminosity and mitigate the influence of illumination, a fundamental model of single-scale Retinex is applied specifically to the saturation component. Subsequently, the image undergoes further processing, involv-ing manual color filtering. This step is crucial for refining the identification of rusted regions. To enhance precision and filter out noise, the pixel areas selected through color filtering are subjected to the DBScan algorithm. This multi-step process aims to achieve a robust segmentation of rusted areas on painted construction surfaces, providing a valuable contribution to the field of corrosion detection and analysis.

摘要
First, a fundamental model of single-scale Retinex is applied to the saturation component of the image to equalize luminosity and mitigate the influence of illumination. Next, the image undergoes manual color filtering to refine the identification of rusted regions. Finally, the pixel areas selected through color filtering are subjected to the DBScan algorithm to enhance precision and filter out noise.The proposed segmentation approach is designed to provide a robust and accurate method for detecting rusted areas on painted construction surfaces, which is a valuable contribution to the field of corrosion detection and analysis.

Overcoming Data Scarcity in Biomedical Imaging with a Foundational Multi-Task Model

paper_url: http://arxiv.org/abs/2311.09847
repo_url: None
paper_authors: Raphael Schäfer, Till Nicke, Henning Höfener, Annkristin Lange, Dorit Merhof, Friedrich Feuerhake, Volkmar Schulz, Johannes Lotz, Fabian Kiessling
For: 这个论文的目的是提出一种基于多任务学习的基本模型训练策略，以便在生物医学成像领域中使用。* Methods: 这个论文使用了一种多任务学习策略，其中包括将多种类别和分类任务组织在一起，以便减少训练数据的内存需求。* Results: 研究发现，使用这种多任务学习策略可以让基本模型在不同的任务上保持高度的性能，并且只需要1%的原始训练数据和不需要细化。此外，这种方法还可以在不同的中心进行跨中心传输。

Abstract
Foundational models, pretrained on a large scale, have demonstrated substantial success across non-medical domains. However, training these models typically requires large, comprehensive datasets, which contrasts with the smaller and more heterogeneous datasets common in biomedical imaging. Here, we propose a multi-task learning strategy that decouples the number of training tasks from memory requirements. We trained a Universal bioMedical PreTrained model (UMedPT) on a multi-task database including tomographic, microscopic, and X-ray images, with various labelling strategies such as classification, segmentation, and object detection. The UMedPT foundational model outperformed ImageNet pretraining and the previous state-of-the-art models. For tasks related to the pretraining database, it maintained its performance with only 1% of the original training data and without fine-tuning. For out-of-domain tasks it required not more than 50% of the original training data. In an external independent validation imaging features extracted using UMedPT proved to be a new standard for cross-center transferability.

摘要
基础模型，在大规模预训练下，在非医学领域实现了重要成功。然而，这些模型的训练通常需要大量、全面的数据集，而生物医学成像中的数据集通常较小、更加多样化。在这里，我们提出了一种多任务学习策略，即解耦训练任务数量和内存需求。我们使用了一个通用的生物医学预训练模型（UMedPT），在包括tomographic、微scopic和X射线图像的多任务数据库上进行训练，并采用了不同的标签策略，如分类、分割和对象检测。UMedPT基础模型在预训练数据库中相对于ImageNet预训练和前一个状态的模型表现出色。对于与预训练数据库相关的任务，它只需要1%的原始训练数据，而不需要微调。对于外部独立验证的任务，它只需要50%的原始训练数据。在外部独立验证中，使用UMedPT提取出的生物医学特征被认为是新的跨中心传送标准。

GroupMixer: Patch-based Group Convolutional Neural Network for Breast Cancer Detection from Histopathological Images

paper_url: http://arxiv.org/abs/2311.09846
repo_url: None
paper_authors: Ardavan Modarres, Erfan Ebrahim Esfahani, Mahsa Bahrami
for: 检测悬肢癌细胞恶性的早期阶段是控制其影响的关键步骤。 histopathological analysis 提供了独特的机会 для恶性肿瘤检测。但是，这种任务会对 histopathologists 太 tedious 和 time-consuming。
methods: 使用 Deep Neural Networks 直接从 raw histopathological images 学习 informative features，而不需要 manual feature extraction。
results: 使用 CNN 架构和 Patch Embedding 操作，实现了高度精准的悬肢癌检测，并且对比其他方法具有更多的 trainable parameters 和更大的数据集来进行训练。尽管使用 Transformer 架构在医学图像分析中显示了惊人的表现，但是这些架构具有较多的可训练参数和需要大量的数据来进行训练。

Abstract
Diagnosis of breast cancer malignancy at the early stages is a crucial step for controlling its side effects. Histopathological analysis provides a unique opportunity for malignant breast cancer detection. However, such a task would be tedious and time-consuming for the histopathologists. Deep Neural Networks enable us to learn informative features directly from raw histopathological images without manual feature extraction. Although Convolutional Neural Networks (CNNs) have been the dominant architectures in the computer vision realm, Transformer-based architectures have shown promising results in different computer vision tasks. Although harnessing the capability of Transformer-based architectures for medical image analysis seems interesting, these architectures are large, have a significant number of trainable parameters, and require large datasets to be trained on, which are usually rare in the medical domain. It has been claimed and empirically proved that at least part of the superior performance of Transformer-based architectures in Computer Vision domain originates from patch embedding operation. In this paper, we borrowed the previously introduced idea of integrating a fully Convolutional Neural Network architecture with Patch Embedding operation and presented an efficient CNN architecture for breast cancer malignancy detection from histopathological images. Despite the number of parameters that is significantly smaller than other methods, the accuracy performance metrics achieved 97.65%, 98.92%, 99.21%, and 98.01% for 40x, 100x, 200x, and 400x magnifications respectively. We took a step forward and modified the architecture using Group Convolution and Channel Shuffling ideas and reduced the number of trainable parameters even more with a negligible decline in performance and achieved 95.42%, 98.16%, 96.05%, and 97.92% accuracy for the mentioned magnifications respectively.

摘要
诊断乳腺癌恶性肿瘤的初期阶段是控制其副作用的关键步骤。 histopathological 分析提供了诊断恶性乳腺癌的唯一机会。然而，这种任务对 histopathologists 来说是繁琐和时间consuming的。深度神经网络允许我们直接从 Raw histopathological 图像中学习有用的特征，无需手动提取特征。尽管卷积神经网络（CNNs）在计算机视觉领域是主导的建筑，但基于 Transformer 的建筑在不同的计算机视觉任务中表现出色。虽然在医疗领域使用基于 Transformer 的建筑可能有趣，这些建筑却很大，有很多可训练的参数，并且需要大量的数据来进行训练，这些数据通常在医疗领域是罕见的。有人提出并证明了，至少一部分基于 Transformer 的表现提升的原因是负权重嵌入操作。在这篇论文中，我们借鉴了之前引入的将 Fully Convolutional Neural Network 架构与负权重嵌入操作结合的想法，并提出了一种高效的乳腺癌恶性识别方法。尽管参数的数量远少于其他方法，但我们在 40x、100x、200x 和 400x 倍镜下测试的性能指标分别达到 97.65%、98.92%、99.21% 和 98.01%。我们进一步修改了架构，使用 Group Convolution 和 Channel Shuffling 的想法，并减少了可训练参数的数量，但性能的下降是极少的，达到 95.42%、98.16%、96.05% 和 97.92%。

MAM-E: Mammographic synthetic image generation with diffusion models

paper_url: http://arxiv.org/abs/2311.09822
repo_url: https://github.com/Likalto4/diffusion-models_master
paper_authors: Ricardo Montoya-del-Angel, Karla Sam-Millan, Joan C Vilanova, Robert Martí
for: 本研究旨在使用扩散模型作为医疗影像资料增强技术，以解决医疗影像领域资料缺乏的问题。
methods: 本研究使用了现代的条件扩散管道，以生成高质量的全场数字乳腺影像。同时，我们还提出使用稳定的扩散模型来填充人工变化的Synthetic lesions在健康乳腺影像上。
results: 我们提出了一个名为MAM-E的生成模型架空，可以根据文本提示生成高质量的乳腺影像，并且可以填充人工变化的Synthetic lesions在specific region of the breast。 finally, we provide了量化和质感评估的生成影像，以及易用的グラフィカルユーザインターフェース для乳腺影像生成。

Abstract
Generative models are used as an alternative data augmentation technique to alleviate the data scarcity problem faced in the medical imaging field. Diffusion models have gathered special attention due to their innovative generation approach, the high quality of the generated images and their relatively less complex training process compared with Generative Adversarial Networks. Still, the implementation of such models in the medical domain remains at early stages. In this work, we propose exploring the use of diffusion models for the generation of high quality full-field digital mammograms using state-of-the-art conditional diffusion pipelines. Additionally, we propose using stable diffusion models for the inpainting of synthetic lesions on healthy mammograms. We introduce MAM-E, a pipeline of generative models for high quality mammography synthesis controlled by a text prompt and capable of generating synthetic lesions on specific regions of the breast. Finally, we provide quantitative and qualitative assessment of the generated images and easy-to-use graphical user interfaces for mammography synthesis.

摘要
“生成模型被用作医疗影像领域数据增强技术的替代方法，以解决医疗影像领域面临的数据缺乏问题。扩散模型吸引了特别的关注，因为它们的创新生成方法、高质量生成图像和相对于对抗网络更为简单的训练过程。然而，医疗领域中的实施仍然处于早期阶段。在这种工作中，我们提议使用扩散模型来生成高质量全场数字乳肿图像，并使用稳定的扩散模型进行 synthetic 病变的填充。我们介绍了MAM-E，一个基于生成模型的高质量乳肿合成管道，可以通过文本提示来控制生成过程。此外，我们还提供了生成图像的量化和质量评估，以及易用的图形用户界面。”

Neural-Logic Human-Object Interaction Detection

paper_url: http://arxiv.org/abs/2311.09817
repo_url: https://github.com/Aryia-Behroziuan/Other-sources
paper_authors: Liulei Li, Jianan Wei, Wenguan Wang, Yi Yang
for: 提高 HOI 检测器的性能和零学习推广能力
methods: 修改传统 Transformer 自注意机制，以便在 <人, 动作, 物品> triplet 上进行神经逻辑reasoning，并由两种关键性 HOI 理解属性（可用性和 прокси迫）导引。
results: 在 V-COCO 和 HICO-DET 上对常规和零学习情况下，实现了显著改善，与现有方法相比。

Abstract
The interaction decoder utilized in prevalent Transformer-based HOI detectors typically accepts pre-composed human-object pairs as inputs. Though achieving remarkable performance, such paradigm lacks feasibility and cannot explore novel combinations over entities during decoding. We present L OGIC HOI, a new HOI detector that leverages neural-logic reasoning and Transformer to infer feasible interactions between entities. Specifically, we modify the self-attention mechanism in vanilla Transformer, enabling it to reason over the triplet and constitute novel interactions. Meanwhile, such reasoning process is guided by two crucial properties for understanding HOI: affordances (the potential actions an object can facilitate) and proxemics (the spatial relations between humans and objects). We formulate these two properties in first-order logic and ground them into continuous space to constrain the learning process of our approach, leading to improved performance and zero-shot generalization capabilities. We evaluate L OGIC HOI on V-COCO and HICO-DET under both normal and zero-shot setups, achieving significant improvements over existing methods.

摘要
很多现有的Transformer基于HOI探测器通常使用预 compose的人物对碰输入decoder。虽然实现了惊人的性能，但这种方法缺乏实用性，不能探索Entities During Decoding中的新组合。我们介绍了 L OGIC HOI，一种新的HOI探测器，利用神经逻辑理解和Transformer来推理可能的人物对碰。具体来说，我们修改了vanilla Transformer中的自我注意机制，使其能够对 <人,行为,物> triplet进行推理，并且通过两个关键的HOI理解属性：可行性（物品可能支持的行为）和距离（人类和物品之间的空间关系）。我们将这两个属性写入了第一频谱逻辑，并将其降到连续空间中，以制约我们的方法学习过程，从而提高性能和零shot泛化能力。我们在V-COCO和HICO-DET上评估了L OGIC HOI，在正常和零shot设置下都达到了显著的改善。

MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture

paper_url: http://arxiv.org/abs/2311.10123
repo_url: None
paper_authors: Lincong Feng, Muyu Wang, Maoyu Wang, Kuo Xu, Xiaoli Liu
for:* 这个研究旨在提高三维物体生成的效率和质量，并且解决多视角对称和实际对称的问题。methods:* 这个方法使用了两个阶段的优化方法，首先优化三维物体的几何表示，然后进行细微调整和纹理优化。results:* 这个方法可以实现高品质的三维物体生成，并且可以在20分钟内完成文本描述的三维生成。此外，这个方法还能够实现图像控制，提高了三维生成的可控性。

Abstract
Generative models for 3D object synthesis have seen significant advancements with the incorporation of prior knowledge distilled from 2D diffusion models. Nevertheless, challenges persist in the form of multi-view geometric inconsistencies and slow generation speeds within the existing 3D synthesis frameworks. This can be attributed to two factors: firstly, the deficiency of abundant geometric a priori knowledge in optimization, and secondly, the entanglement issue between geometry and texture in conventional 3D generation methods.In response, we introduce MetaDreammer, a two-stage optimization approach that leverages rich 2D and 3D prior knowledge. In the first stage, our emphasis is on optimizing the geometric representation to ensure multi-view consistency and accuracy of 3D objects. In the second stage, we concentrate on fine-tuning the geometry and optimizing the texture, thereby achieving a more refined 3D object. Through leveraging 2D and 3D prior knowledge in two stages, respectively, we effectively mitigate the interdependence between geometry and texture. MetaDreamer establishes clear optimization objectives for each stage, resulting in significant time savings in the 3D generation process. Ultimately, MetaDreamer can generate high-quality 3D objects based on textual prompts within 20 minutes, and to the best of our knowledge, it is the most efficient text-to-3D generation method. Furthermore, we introduce image control into the process, enhancing the controllability of 3D generation. Extensive empirical evidence confirms that our method is not only highly efficient but also achieves a quality level that is at the forefront of current state-of-the-art 3D generation techniques.

摘要
现代生成模型在三维物体生成方面已经做出了重要进步，通过吸取二维扩散模型中的知识。然而，现有的三维生成框架仍然面临着多视图几何不一致和慢速生成速度的挑战。这可以归结于两点：首先，严重缺乏丰富的几何知识在优化中，其次，在传统的三维生成方法中，几何和文本之间存在杂化问题。为了解决这些问题，我们介绍MetaDreamer，一种两stage优化方法，利用强大的二维和三维知识。在第一个阶段，我们强调优化几何表示，以确保多视图一致性和三维物体的准确性。在第二个阶段，我们专注于细化几何和优化文本，以实现更加细腻的三维物体。通过在两个阶段分别利用二维和三维知识，我们有效地消除了几何和文本之间的互相关系。MetaDreamer采用清晰的优化目标，从而大大降低三维生成过程中的时间成本。最终，MetaDreamer可以在20分钟内基于文本提示生成高质量的三维物体，并且，到目前为止，它是目前最高效的文本到三维生成方法。此外，我们还引入图像控制，使三维生成过程中的控制性得到了进一步提升。广泛的实验证明，我们的方法不仅高效，而且达到了当前领域的前沿水平。

EvaSurf: Efficient View-Aware Implicit Textured Surface Reconstruction on Mobile Devices

paper_url: http://arxiv.org/abs/2311.09806
repo_url: None
paper_authors: Jingnan Gao, Zhuo Chen, Yichao Yan, Bowen Pan, Zhe Wang, Jiangjing Lyu, Xiaokang Yang
For: 高效率、视角受限的3D对象重建* Methods: 使用高效表面基本模型、多视图监测模块、含有 Gaussian 豆的隐式Texture 以及轻量级神经灯谱* Results: 能够在移动设备上实现高质量的外观和准确的网格重建，并且可以在1-2个小时内训练使用单个GPU，并在40帧/秒以上的帧率下运行。

Abstract
Reconstructing real-world 3D objects has numerous applications in computer vision, such as virtual reality, video games, and animations. Ideally, 3D reconstruction methods should generate high-fidelity results with 3D consistency in real-time. Traditional methods match pixels between images using photo-consistency constraints or learned features, while differentiable rendering methods like Neural Radiance Fields (NeRF) use surface-based representations or differentiable volume rendering to generate high-fidelity scenes. However, these methods require excessive runtime for rendering, making them impractical for daily applications. To address these challenges, we present $\textbf{EvaSurf}$, an $\textbf{E}$fficient $\textbf{V}$iew-$\textbf{A}$ware Implicit Textured $\textbf{Surf}$ace Reconstruction method on Mobile Devices. In our method, we first employ an efficient surface-based model with a multi-view supervision module to ensure accurate mesh creation. To enable high-fidelity rendering, we learn an implicit texture embedded with a set of Gaussian lobes to capture view-dependent information. Furthermore, With the explicit geometry and the implicit texture, we can employ a lightweight neural shader to reduce the expense of computation and further support real-time rendering on common mobile devices. Extensive experiments demonstrate that our method can reconstruct high-quality appearance and accurate mesh on both synthetic and real-world datasets. Moreover, our method can be trained in just 1-2 hours using a single GPU and run on mobile devices at over 40FPS (Frames Per Second), with a final package required for rendering taking up only 40-50 MB.

摘要
<>重建现实世界中的3D对象有很多应用程序在计算机视觉中，如虚拟现实、游戏和动画。理想情况下，3D重建方法应该生成高品质的结果，并在实时中进行渲染。传统方法通过图像匹配 pixels 使用光度约束或学习特征来实现，而 differentiable rendering 方法如神经辐射场（NeRF）则使用表面基本表示或可导渲染来生成高品质场景。然而，这些方法在渲染时需要过分的时间，使得它们在日常应用中不够实用。为解决这些挑战，我们介绍 $\textbf{EvaSurf}$，一种高效的视觉相关的凹面 Textured 表面重建方法，运行在移动设备上。在我们的方法中，我们首先采用高效的表面基本模型，并在多视图超vision模块的支持下确保精度的网格创建。为了实现高品质渲染，我们学习了一个包含多个 Gaussian lobes 的隐式文本，以捕捉视觉相关信息。此外，通过Explicit geometry和隐式文本，我们可以使用轻量级的神经渲染器来减少计算成本，并进一步支持实时渲染在常见的移动设备上。广泛的实验表明，我们的方法可以在 Both synthetic and real-world datasets 上重建高质量的外观和准确的网格，并且可以在1-2小时内在单个 GPU 上训练，并在移动设备上运行于40帧/秒（Frame Per Second），总包装需要40-50 MB。Note: The text has been translated using Google Translate, and some parts may not be perfectly accurate or idiomatic.

Certified Control for Train Sign Classification

paper_url: http://arxiv.org/abs/2311.09778
repo_url: None
paper_authors: Jan Roßbach, Michael Leuschel
for: 这篇论文是为了研究一种用于验证自动驾驶列车系统的证明框架，以避免识别器错误地检测交通标志。
methods: 这篇论文使用了经典的计算机视觉算法来检查检测到的交通标志是否遵循预先定义的规范。
results: 我们的初步结果很有希望，可以达到较高的准确率，同时只有较小的报告率下降，但是进一步的推广性研究仍然需要进行。

Abstract
There is considerable industrial interest in integrating AI techniques into railway systems, notably for fully autonomous train systems. The KI-LOK research project is involved in developing new methods for certifying such AI-based systems. Here we explore the utility of a certified control architecture for a runtime monitor that prevents false positive detection of traffic signs in an AI-based perception system. The monitor uses classical computer vision algorithms to check if the signs -- detected by an AI object detection model -- fit predefined specifications. We provide such specifications for some critical signs and integrate a Python prototype of the monitor with a popular object detection model to measure relevant performance metrics on generated data. Our initial results are promising, achieving considerable precision gains with only minor recall reduction; however, further investigation into generalization possibilities will be necessary.

摘要
有很大的工业兴趣在将人工智能技术应用于铁路系统中，特别是实现无人驾驶列车系统。KI-LOK研究项目正在开发新的认证方法，以确保这些基于AI的系统的可靠性。我们在这篇文章中探讨一种具有认证控制架构的运行监控系统，以防止AI基于感知系统中的假阳性检测交通标志。这个监控系统使用传统的计算机视觉算法来检查探测到的标志是否符合预定义的规范。我们为一些关键的标志提供了特定的规范，并将Python原型 integrate with a popular object detection model to measure relevant performance metrics on generated data。我们的初步结果很有前途，可以实现较大的准确率提升，但是需要进一步的探索可泛化性。

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

paper_url: http://arxiv.org/abs/2311.10122
repo_url: https://github.com/PKU-YuanGroup/Video-LLaVA
paper_authors: Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, Li Yuan
for: 提高视觉语言理解的下游任务性能
methods: 将图像和视频编码到一个共享特征空间中，并将其作为大语言模型的输入
results: 创建了一个简单 yet robust LVLM 基线模型 Video-LLaVA，可以从混合图像和视频 dataset 中学习并提高多模态交互。Video-LLaVA 在 9 个图像benchmark 上表现出色，并在 5 个图像问答dataset 和 4 个图像benchmark toolkits 上超过 Video-ChatGPT。

Abstract
The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos.

摘要
大型视语模型（LVLM）已经提高了多个下游任务的性能，包括图像和视频理解。大多数现有的方法将图像和视频编码为分开的特征空间，然后将其作为大型语言模型的输入。然而，由于图像和视频的不一致编码，即 projection 层的不一致，使得大型语言模型（LLM）学习多modal交互变得困难。在这种情况下，我们将视觉表示 integrate 到语言特征空间中，以提高基础的 LLM towards 一体化 LVLM。因此，我们建立了简单 yet robust 的 LVM 基eline，称为 Video-LLaVA，它从混合的图像和视频数据集中学习，并且互相增强。 Video-LLaVA 在 9 个图像benchmark上 achieve 出色的表现，包括 5 个图像问答数据集和 4 个图像benchmark工具kit。此外，我们的 Video-LLaVA 还比 Video-ChatGPT 在 MSRVTT、MSVD、TGIF 和 ActivityNet 上表现出色，提高了 5.8%、9.9%、18.6% 和 10.1%。值得注意的是，广泛的实验表明，Video-LLaVA 能够在一个统一的视觉表示中促进图像和视频之间的互助，并且在图像和视频特征空间中提高表现，比特定于图像或视频的模型更好。

Slide-SAM: Medical SAM Meets Sliding Window

paper_url: http://arxiv.org/abs/2311.10121
repo_url: None
paper_authors: Quan Quan, Fenghe Tang, Zikang Xu, Heqin Zhu, S. Kevin Zhou
For: This paper proposes a new method called Slide-SAM for 3D medical image segmentation, which extends the Segment Anything Model (SAM) to 3D medical images.* Methods: The proposed method uses a single slice prompt to segment the entire volume, reducing the prompt workload for professionals. It also uses high resolution (H$ \times $W = 1024$ \times $1024) for training in 3D images to achieve optimal learning for small targets.* Results: The proposed method was evaluated on multiple datasets and achieved the most advanced 3D segmentation performance while maintaining the minimum prompt. The code will be open source soon.Here’s the summary in Simplified Chinese:
for: 这篇论文提出了一种新方法 called Slide-SAM，用于3D医学图像分割。
methods: 该方法使用单个slice提示来分割整个量子，减少专业人员的提示工作负担。它还使用高分辨率（H$ \times $W = 1024$ \times $1024）在3D图像上进行训练，以便在小目标上达到优化的学习。
results: 该方法在多个数据集上进行了评估，并达到了最高的3D分割性能，同时保持最小的提示。代码即将公开源代码。

Abstract
Segment Anything Model (SAM) achieves remarkable results in 2D image segmentation of natural images. However, the huge gap between medical images and natural images prevents it directly applied to medical image segmentation tasks. Especially in 3D medical image, SAM cannot learn the contextual relationship between slices, which limites application in real scenarios. In addition, recent research shows that applying 2D SAM to 3D images requires prompting the entire volume, which is time and label comsuming. In order to solve the above problems, we introduced Slide-SAM which extended SAM to 3D medical images. Specifically, you only need to use a single slice prompt to segement the entire volume, which greatly reduces the prompt workload for professionals. Secondly, unlike traditional 3D medical image segmentation, we are free from the influence of computing resources and can still use high resolution (H$ \times $W = 1024$ \times $1024) for training in 3D images to achieve optimal learning for small targets. This is to combine the entire 3D volume is beyond the reach of training. Finally, we collected a large number of 3D images from large-scale 3D public and private datasets, and extended SAM to 3D medical image segmentation involving bounding box and point prompts. Finally, we perform a comprehensive evaluation and analysis investigating the performance of Slide-SAM in medical image segmentation of different modalities, anatomy, and organs. We have verified Slide-SAM's segmentation capabilities on multiple datasets, achieving the most advanced 3D segmentation performance while maintaining the minimum prompt. Code will be open source soon.

摘要
Segment Anything Model (SAM) 在自然图像2D分割任务上取得了惊人的结果。然而，医疗图像与自然图像之间的巨大差距使得SAM直接应用于医疗图像分割任务是不可能的。尤其是在3D医疗图像中，SAM无法学习层次关系 между slice，这限制了其在实际场景中的应用。此外， latest research 表明，将2D SAM应用于3D图像需要整个Volume提示，这是时间和标签耗费的。为解决以上问题，我们提出了Slide-SAM，它将SAM扩展到3D医疗图像。具体来说，只需使用单个slice提示可以分割整个Volume，这大幅减少了专业人员的提示工作负担。其次，与传统3D医疗图像分割不同，我们不受计算资源的限制，可以在3D图像的训练中使用高分辨率（H $\times $ W = 1024 $\times $ 1024），以达到最佳学习效果。最后，我们收集了大量3D图像从大规模的3D公共和私人数据集，并将SAM扩展到3D医疗图像分割，包括 bounding box 和点提示。我们进行了全面的评估和分析，investigating 3D segmentation的不同Modalities、Anatomy 和器官的性能。我们已经证明Slide-SAM在不同Modalities、Anatomy 和器官上的 segmentation 性能是最先进的，同时保持最少的提示。代码即将公开源代码。

Utilizing dataset affinity prediction in object detection to assess training data

paper_url: http://arxiv.org/abs/2311.09768
repo_url: None
paper_authors: Stefan Becker, Jens Bayer, Ronny Hug, Wolfgang Hübner, Michael Arens
for: 提高对象探测器的训练效果和一致性，并且可以采用不同的汽车数据集来增强模型的通用性。
methods: 提出了一种在检测时加入数据源预测模块的方法，以便在训练时更好地评估数据集的信息价值，从而提高对象探测器的性能。
results: 研究表明，通过自动选择不同数据集中的样本，可以训练对象探测器使用较少的训练样本，而无需失去检测精度。

Abstract
Data pooling offers various advantages, such as increasing the sample size, improving generalization, reducing sampling bias, and addressing data sparsity and quality, but it is not straightforward and may even be counterproductive. Assessing the effectiveness of pooling datasets in a principled manner is challenging due to the difficulty in estimating the overall information content of individual datasets. Towards this end, we propose incorporating a data source prediction module into standard object detection pipelines. The module runs with minimal overhead during inference time, providing additional information about the data source assigned to individual detections. We show the benefits of the so-called dataset affinity score by automatically selecting samples from a heterogeneous pool of vehicle datasets. The results show that object detectors can be trained on a significantly sparser set of training samples without losing detection accuracy.

摘要
数据聚合提供了各种优势，如增加样本大小、改善泛化、减少采样偏见和解决数据稀缺和质量问题，但不是直接的并可能是Counterproductive。评估聚合dataset的效果是有挑战的，因为难以估计个体dataset的总信息内容。为此，我们提议在标准对象检测管道中添加数据源预测模块。该模块在推理时间产生较少的开销，为个体检测提供额外信息关于分配给它的数据源。我们显示了所谓的数据源相互邻接分数的好处，自动从多种交通工具数据集中选择样本。结果表明，可以通过减少训练样本的数量来训练对象检测器，而不会影响检测精度。

Scene Text Image Super-resolution based on Text-conditional Diffusion Models

paper_url: http://arxiv.org/abs/2311.09759
repo_url: None
paper_authors: Chihiro Noguchi, Shun Fukuda, Masao Yamanaka
for: 这 paper 的目的是提出一种基于文本条件扩散模型（DM）的Scene Text Image Super-resolution（STISR）方法，以提高Scene Text Recognition（STR）的性能。methods: 这 paper 使用了文本条件扩散模型（DM），其能够Synthesize高分辨率（HR）文本图像，从而提高STR的性能。results: experiments 表明，使用文本条件扩散模型（DM）可以 notable improve STISR 的性能，特别是当输入为低分辨率（LR）文本图像时。此外，该方法还可以生成高分辨率（HR）和低分辨率（LR）的对应图像对，为 STR 的训练提供了更好的数据支持。

Abstract
Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition. STISR aims to transform blurred and noisy low-resolution (LR) text images in real-world settings into clear high-resolution (HR) text images suitable for scene text recognition. In this study, we leverage text-conditional diffusion models (DMs), known for their impressive text-to-image synthesis capabilities, for STISR tasks. Our experimental results revealed that text-conditional DMs notably surpass existing STISR methods. Especially when texts from LR text images are given as input, the text-conditional DMs are able to produce superior quality super-resolution text images. Utilizing this capability, we propose a novel framework for synthesizing LR-HR paired text image datasets. This framework consists of three specialized text-conditional DMs, each dedicated to text image synthesis, super-resolution, and image degradation. These three modules are vital for synthesizing distinct LR and HR paired images, which are more suitable for training STISR methods. Our experiments confirmed that these synthesized image pairs significantly enhance the performance of STISR methods in the TextZoom evaluation.

摘要

DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics

paper_url: http://arxiv.org/abs/2311.09753
repo_url: None
paper_authors: Aniket Roy, Maiterya Suin, Anshul Shah, Ketul Shah, Jiang Liu, Rama Chellappa
for: 提高生成图像质量
methods: 使用 Kurtosis Concentration (KC) 损失函数
results: 在三种不同任务中（1）个性化几何折衔练习、（2）无条件图像生成、（3）图像超分辨率）中提高了 perceived 质量， measured by FID、MUSIQ score 和用户评价。

Abstract
Diffusion models have advanced generative AI significantly in terms of editing and creating naturalistic images. However, efficiently improving generated image quality is still of paramount interest. In this context, we propose a generic "naturalness" preserving loss function, viz., kurtosis concentration (KC) loss, which can be readily applied to any standard diffusion model pipeline to elevate the image quality. Our motivation stems from the projected kurtosis concentration property of natural images, which states that natural images have nearly constant kurtosis values across different band-pass versions of the image. To retain the "naturalness" of the generated images, we enforce reducing the gap between the highest and lowest kurtosis values across the band-pass versions (e.g., Discrete Wavelet Transform (DWT)) of images. Note that our approach does not require any additional guidance like classifier or classifier-free guidance to improve the image quality. We validate the proposed approach for three diverse tasks, viz., (1) personalized few-shot finetuning using text guidance, (2) unconditional image generation, and (3) image super-resolution. Integrating the proposed KC loss has improved the perceptual quality across all these tasks in terms of both FID, MUSIQ score, and user evaluation.

摘要
Diffusion models have advanced generative AI significantly in terms of editing and creating naturalistic images. However, improving the quality of generated images efficiently is still a top priority. To address this, we propose a generic "naturalness" preserving loss function, called kurtosis concentration (KC) loss, which can be easily applied to any standard diffusion model pipeline to enhance image quality. Our motivation comes from the projected kurtosis concentration property of natural images, which states that natural images have nearly constant kurtosis values across different band-pass versions of the image. To retain the "naturalness" of the generated images, we enforce reducing the gap between the highest and lowest kurtosis values across the band-pass versions (e.g., Discrete Wavelet Transform (DWT)) of images. Note that our approach does not require any additional guidance like classifier or classifier-free guidance to improve the image quality. We validate the proposed approach for three diverse tasks, namely (1) personalized few-shot finetuning using text guidance, (2) unconditional image generation, and (3) image super-resolution. Integrating the proposed KC loss has improved the perceptual quality across all these tasks in terms of both FID, MUSIQ score, and user evaluation.

Gradient-Map-Guided Adaptive Domain Generalization for Cross Modality MRI Segmentation

paper_url: http://arxiv.org/abs/2311.09737
repo_url: https://github.com/cuttle-fish-my/gm-guided-dg
paper_authors: Bingnan Li, Zhitong Gao, Xuming He
for: 这个研究旨在提高计算机支持医学诊断的跨Modal MRI标本分类，以实现跨域数据收集和模型通用性。
methods: 我们提出了一个新的适应领域扩展框架，它结合了无学习交叉领域表示基于图像梯度地图和一种基于分类前置检测适应策略，以减少地方领域迁移。
results: 我们在两个多Modal MRI数据集上验证了我们的方法，包括六个跨Modal MRI标本分类任务。在所有任务设置下，我们的方法一直超过竞争方法，并且在有限的训练数据下保持稳定的性能。

Abstract
Cross-modal MRI segmentation is of great value for computer-aided medical diagnosis, enabling flexible data acquisition and model generalization. However, most existing methods have difficulty in handling local variations in domain shift and typically require a significant amount of data for training, which hinders their usage in practice. To address these problems, we propose a novel adaptive domain generalization framework, which integrates a learning-free cross-domain representation based on image gradient maps and a class prior-informed test-time adaptation strategy for mitigating local domain shift. We validate our approach on two multi-modal MRI datasets with six cross-modal segmentation tasks. Across all the task settings, our method consistently outperforms competing approaches and shows a stable performance even with limited training data.

摘要
跨模态MRI分割是医疗辅助诊断中的非常有价值的技术，它允许悬浮数据采集和模型通用化。然而，大多数现有方法在域Shift问题上困难处理本地变化，通常需要大量数据进行训练，这会限制它们在实践中的使用。为解决这些问题，我们提出了一种新的适应域通用化框架，该框架 integrate了一种无需学习的跨Domain表示基于图像梯度地图和一种基于类偏置的测试时适应策略，以mitigate本地域Shift问题。我们在两个多模态MRI数据集上进行了六个跨模态分割任务的验证。在所有任务设置下，我们的方法始终超越了竞争方法，并在有限的训练数据下保持稳定性。

MS-Former: Memory-Supported Transformer for Weakly Supervised Change Detection with Patch-Level Annotations

paper_url: http://arxiv.org/abs/2311.09726
repo_url: https://github.com/guanyuezhen/ms-former
paper_authors: Zhenglai Li, Chang Tang, Xinwang Liu, Changdong Li, Xianju Li, Wei Zhang
for: 这 paper 是为了提出一种基于 patch-level 纪录的弱类标注下的变化检测方法。
methods: 该方法基于 transformer 架构，包括一个 bidirectional attention block (BAB) 和一个 patch-level supervision scheme (PSS)。BAB 通过从时间差特征中提取关于改变和不变区域的上下文信息，并将其存储在内存银行中。PSS 则使用 patch-level 纪录来引导网络学习有价值的知识，从而进一步提高表达能力。
results: 实验结果表明，该方法在三个标准测试集上达到了优秀的变化检测效果。

Abstract
Fully supervised change detection methods have achieved significant advancements in performance, yet they depend severely on acquiring costly pixel-level labels. Considering that the patch-level annotations also contain abundant information corresponding to both changed and unchanged objects in bi-temporal images, an intuitive solution is to segment the changes with patch-level annotations. How to capture the semantic variations associated with the changed and unchanged regions from the patch-level annotations to obtain promising change results is the critical challenge for the weakly supervised change detection task. In this paper, we propose a memory-supported transformer (MS-Former), a novel framework consisting of a bi-directional attention block (BAB) and a patch-level supervision scheme (PSS) tailored for weakly supervised change detection with patch-level annotations. More specifically, the BAM captures contexts associated with the changed and unchanged regions from the temporal difference features to construct informative prototypes stored in the memory bank. On the other hand, the BAM extracts useful information from the prototypes as supplementary contexts to enhance the temporal difference features, thereby better distinguishing changed and unchanged regions. After that, the PSS guides the network learning valuable knowledge from the patch-level annotations, thus further elevating the performance. Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method in the change detection task. The demo code for our work will be publicly available at \url{https://github.com/guanyuezhen/MS-Former}.

摘要
具有全程监督改变方法在性能方面已经取得了显著进步，然而它们受到买到便宜的像素级标签的限制。考虑到bi-temporal图像中的块级注释也包含了改变和不改变对象之间的许多信息，因此一种直观的解决方案是将改变分割成块级注释。然而，如何从块级注释中捕捉改变和不改变区域之间的semantic变化，以获得出色的改变结果是critical挑战。在这篇论文中，我们提出了一种记忆支持的变换器（MS-Former），这是一种新的框架，包括一个bi-directional attention块（BAB）和一个块级监督方案（PSS），这些方案特地针对无监督改变检测任务。更加具体地说，BAB会从bi-temporal特征图中捕捉改变和不改变区域之间的上下文，并将其存储在内存银行中。然后，BAB会从内存银行中提取有用的信息，以增强bi-temporal特征图中的时间差特征，从而更好地 отличи改变和不改变区域。同时，PSS会引导网络学习从块级注释中得到有价值的知识，从而进一步提高性能。我们的实验结果表明，我们的提议方法在改变检测任务中具有出色的效果。我们将在 \url{https://github.com/guanyuezhen/MS-Former} 上公开发布我们的代码示例。

Now and Future of Artificial Intelligence-based Signet Ring Cell Diagnosis: A Survey

paper_url: http://arxiv.org/abs/2311.10118
repo_url: None
paper_authors: Zhu Meng, Junhao Dong, Limei Guo, Fei Su, Guangxi Wang, Zhicheng Zhao
for: 本研究主要为了提供一份从2008年到8月2023年的深度学习驱动的筛 Cell（SRC）分析的综述，帮助无医学背景的研究人员更好地了解SRC的生物学特征和自动识别挑战，以及现有算法的表现和未来趋势。
methods: 本文分析了SRC分析中使用的代表性算法，并对它们进行了分类、检测和分 segmentation 的比较分析。
results: 本文发现了SRC分析领域中的一些问题和未解决的问题，并提出了未来研究的方向和趋势，以帮助研究人员更好地理解SRC的生物学特征和自动识别挑战，以及现有算法的表现和未来趋势。

Abstract
Since signet ring cells (SRCs) are associated with high peripheral metastasis rate and dismal survival, they play an important role in determining surgical approaches and prognosis, while they are easily missed by even experienced pathologists. Although automatic diagnosis SRCs based on deep learning has received increasing attention to assist pathologists in improving the diagnostic efficiency and accuracy, the existing works have not been systematically overviewed, which hindered the evaluation of the gap between algorithms and clinical applications. In this paper, we provide a survey on SRC analysis driven by deep learning from 2008 to August 2023. Specifically, the biological characteristics of SRCs and the challenges of automatic identification are systemically summarized. Then, the representative algorithms are analyzed and compared via dividing them into classification, detection, and segmentation. Finally, for comprehensive consideration to the performance of existing methods and the requirements for clinical assistance, we discuss the open issues and future trends of SRC analysis. The retrospect research will help researchers in the related fields, particularly for who without medical science background not only to clearly find the outline of SRC analysis, but also gain the prospect of intelligent diagnosis, resulting in accelerating the practice and application of intelligent algorithms.

摘要
自 signet 环绕细胞（SRC）与高周边肿瘤率和减少生存率之间的关系，使得 SRC 在决定手术方法和诊断的重要作用。然而，经验 Pathologist 可能会扫描到这些细胞，尽管 automatic 诊断 SRC 基于深度学习已经收到了提高诊断效率和准确性的关注。在这篇文章中，我们提供了从 2008 年到 8 月 2023 年的 SRC 分析驱动深度学习的报告。specifically，我们系统地概述了 SRC 的生物特征和自动识别的挑战。然后，我们分析了代表性的算法，并将其分为分类、检测和分 segmentation 三个部分进行比较。最后，为了全面评估现有方法的性能和临床应用的需求，我们讨论了开放问题和未来趋势。这些研究将帮助相关领域的研究人员，特别是没有医学背景的研究人员，不仅能够清楚地了解 SRC 分析的大纲，而且能够获得智能诊断的前景，从而加速智能算法的实践和应用。

Robust Contrastive Learning With Theory Guarantee

paper_url: http://arxiv.org/abs/2311.09671
repo_url: None
paper_authors: Ngoc N. Tran, Lam Tran, Hoang Phan, Anh Bui, Tung Pham, Toan Tran, Dinh Phung, Trung Le
for: 这个论文的目的是探讨contrastive learning（CL）自我超vised训练方法中的不supervised预处理阶段是如何支持supervised预处理阶段的。
methods: 这个论文使用了一种分两个阶段的CL框架，首先从无标记数据中学习特征，然后使用这些特征来训练一个线性分类器。
results: 研究发现，在CL的第一个阶段中使用的不supervised损失函数对于在第二个阶段的supervised损失函数的提升有着重要的作用。同时，研究还发现了一些关键的组成部分在不supervised损失函数中，可以帮助提高supervised损失函数的稳定性和性能。

Abstract
Contrastive learning (CL) is a self-supervised training paradigm that allows us to extract meaningful features without any label information. A typical CL framework is divided into two phases, where it first tries to learn the features from unlabelled data, and then uses those features to train a linear classifier with the labeled data. While a fair amount of existing theoretical works have analyzed how the unsupervised loss in the first phase can support the supervised loss in the second phase, none has examined the connection between the unsupervised loss and the robust supervised loss, which can shed light on how to construct an effective unsupervised loss for the first phase of CL. To fill this gap, our work develops rigorous theories to dissect and identify which components in the unsupervised loss can help improve the robust supervised loss and conduct proper experiments to verify our findings.

摘要
《对比学习（Contrastive Learning，CL）是一种自动标注训练方法，它允许我们从无标签数据中提取有意义的特征。一个典型的CL框架包括两个阶段，第一阶段是从无标签数据中学习特征，第二阶段是使用这些特征来训练一个线性分类器与标签数据进行训练。虽然一些现有的理论研究已经分析了如何在第一阶段中学习的无监督损失如何支持第二阶段的监督损失，但是没有研究过无监督损失与鲁棒监督损失之间的连接，这可以推熔到如何构建有效的无监督损失。为了填补这个差距，我们的工作开发了准确的理论来分析无监督损失中的各个组件是否能够提高鲁棒监督损失，并进行了相应的实验验证。》Note: Please keep in mind that the translation is done by a machine and may not be perfect. If you have any further questions or need any adjustments, please let me know.

Multi-View Spectrogram Transformer for Respiratory Sound Classification

paper_url: http://arxiv.org/abs/2311.09655
repo_url: None
paper_authors: Wentao He, Yuchen Yan, Jianfeng Ren, Ruibin Bai, Xudong Jiang
for: 这篇论文旨在应用深度神经网络来分类呼吸声音。
methods: 该论文提出了一种多观点spectrogram对称trasformer（MVST），将不同大小的mel-spectrogram分割为多个视角的音频元件，然后使用对称转构器将这些元件转换为自适应特征。
results: 实验结果显示，该提出的MVST方法较前一项方法有更好的表现，在ICBHI数据集上分类呼吸声音的任务中。

Abstract
Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different sized patches, representing the multi-view acoustic elements of a respiratory sound. These patches and positional embeddings are then fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the proposed MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds.

摘要
深度神经网络已经应用于音频spectrogram中的呼吸音分类。现有模型 oftentimes treat spectrogram as synthetic image，忽略其物理特征。本文提出了 Multi-View Spectrogram Transformer (MVST)，用于嵌入不同视图的时间频谱特征。具体来说，提案的MVST将mel-spectrogram分割成不同大小的patches，表示呼吸音的多视图听音元件。这些patches和位域嵌入被Feed into transformer encoder中，通过自我注意机制提取多视图特征之间的关注信息。最后，设计了一种权重权重调整方案，以自动将多视图特征相互权重，以便在特定场景下高亮最佳的一个视图特征。实验结果表明，提案的MVST在ICBHI数据集上明显超过了现有方法，用于分类呼吸音。

Improved TokenPose with Sparsity

paper_url: http://arxiv.org/abs/2311.09653
repo_url: None
paper_authors: Anning Li
for: 人体姿态估计
methods: 使用简单的隐藏状态和自动重要性来降低计算复杂度，并通过对键点和视觉特征进行精细化来提高人体姿态估计的精度。
results: 在MPII数据集上实现新的状态艺术纪录，并证明了方法的可行性。

Abstract
Over the past few years, the vision transformer and its various forms have gained significance in human pose estimation. By treating image patches as tokens, transformers can capture global relationships wisely, estimate the keypoint tokens by leveraging the visual tokens, and recognize the posture of the human body. Nevertheless, global attention is computationally demanding, which poses a challenge for scaling up transformer-based methods to high-resolution features. In this paper, we introduce sparsity in both keypoint token attention and visual token attention to improve human pose estimation. Experimental results on the MPII dataset demonstrate that our model has a higher level of accuracy and proved the feasibility of the method, achieving new state-of-the-art results. The idea can also provide references for other transformer-based models.

摘要
在过去几年，视力变换器和其多种形式在人体姿态估计中具有重要意义。通过将图像块看作为符号，变换器可以聪明地捕捉全局关系，根据视觉符号来估计关键点符号，并识别人体姿态。然而，全球注意力计算具有挑战性，这会对基于变换器的方法的扩大到高分辨率特征造成挑战。在这篇论文中，我们引入了缺失在关键点符号注意力和视觉符号注意力中，以提高人体姿态估计的精度。实验结果表明，我们的模型在MPII数据集上达到了新的州OF-the-artResult，并证明了方法的可行性。这个想法还可以作为其他基于变换器的模型的参考。

Event-based Motion-Robust Accurate Shape Estimation for Mixed Reflectance Scenes

paper_url: http://arxiv.org/abs/2311.09652
repo_url: None
paper_authors: Aniket Dashpute, Jiazhang Wang, James Taylor, Oliver Cossairt, Ashok Veeraraghavan, Florian Willomitzer
for: fast and motion-robust 3D imaging of mixed reflectance scenes
methods: event-based structured light system, epipolar constraints, triangulation, deflectometry
results: high accuracy (<500μm) and fast capture speed (14Hz or 250Hz) for mixed reflectance scenes

Abstract
Event-based structured light systems have recently been introduced as an exciting alternative to conventional frame-based triangulation systems for the 3D measurements of diffuse surfaces. Important benefits include the fast capture speed and the high dynamic range provided by the event camera - albeit at the cost of lower data quality. So far, both low-accuracy event-based as well as high-accuracy frame-based 3D imaging systems are tailored to a specific surface type, such as diffuse or specular, and can not be used for a broader class of object surfaces ("mixed reflectance scenes"). In this paper, we present a novel event-based structured light system that enables fast 3D imaging of mixed reflectance scenes with high accuracy. On the captured events, we use epipolar constraints that intrinsically enable decomposing the measured reflections into diffuse, two-bounce specular, and other multi-bounce reflections. The diffuse objects in the scene are reconstructed using triangulation. Eventually, the reconstructed diffuse scene parts are used as a "display" to evaluate the specular scene parts via deflectometry. This novel procedure allows us to use the entire scene as a virtual screen, using only a scanning laser and an event camera. The resulting system achieves fast and motion-robust (14Hz) reconstructions of mixed reflectance scenes with < 500 $\mu$m accuracy. Moreover, we introduce a "superfast" capture mode (250Hz) for the 3D measurement of diffuse scenes.

摘要
现代事件驱动的探讨光系统已经被引入为diffuse表面三维测量的新型代替方案，具有快速捕捉速度和高动态范围。然而，这些系统的数据质量相对较低，而且仅适用于特定表面类型（diffuse或speculative），无法应用于更广泛的物体表面（"混合反射场景"）。在这篇论文中，我们提出了一种新的事件驱动探讨光系统，可以快速地三维测量混合反射场景，并且具有高准确性。在捕捉到的事件上，我们使用epipolar约束，可以自动分解测量的反射为diffuse、两次反射和其他多次反射。diffuse对象在场景中被重建，并用三角测量来重建。最后，重建的diffuse场景部分被用作"显示"，以评估speculative场景部分via折射。这种新的程序让我们可以使用扫描 láser和事件相机来扫描整个场景，并且实现快速和运动稳定（14Hz）的重建混合反射场景，具有<500μm的准确性。此外，我们还介绍了"超快"捕捉模式（250Hz），用于三维测量diffuse场景。

Reconstructing Continuous Light Field From Single Coded Image

paper_url: http://arxiv.org/abs/2311.09646
repo_url: None
paper_authors: Yuya Ishikawa, Keita Takahashi, Chihiro Tsutake, Toshiaki Fujii
for: reconstruction of continuous light fields of a target scene from a single observed image
methods: joint aperture-exposure coding and neural radiance field (NeRF)
results: accurate and efficient reconstruction of continuous light fields without test time optimization, bridging the gap between camera design and neural rendering.Here’s the full text in Simplified Chinese:
for: 这个研究旨在从单个观察图像中重建目标场景的连续光场。
methods: 这个方法结合了共同的开口-曝光编码和神经辐射场（NeRF）来实现视觉合成。
results: 这个方法可以高效地和高质量地重建连续光场，不需要任何测试时间优化。这是我们知道的第一个将摄像头设计和神经渲染相结合的研究。

Abstract
We propose a method for reconstructing a continuous light field of a target scene from a single observed image. Our method takes the best of two worlds: joint aperture-exposure coding for compressive light-field acquisition, and a neural radiance field (NeRF) for view synthesis. Joint aperture-exposure coding implemented in a camera enables effective embedding of 3-D scene information into an observed image, but in previous works, it was used only for reconstructing discretized light-field views. NeRF-based neural rendering enables high quality view synthesis of a 3-D scene from continuous viewpoints, but when only a single image is given as the input, it struggles to achieve satisfactory quality. Our method integrates these two techniques into an efficient and end-to-end trainable pipeline. Trained on a wide variety of scenes, our method can reconstruct continuous light fields accurately and efficiently without any test time optimization. To our knowledge, this is the first work to bridge two worlds: camera design for efficiently acquiring 3-D information and neural rendering.

摘要
我们提出了一种方法，可以从单个观察到的图像中重建连续的光场场景。我们的方法结合了两种世界的优点：joint aperature-exposure coding for compressive light-field acquisition,和基于神经辐射场（NeRF）的视觉合成。 joint aperature-exposure coding在摄像头中实现了有效地嵌入3D场景信息到观察到的图像中，但在前一些工作中，它只用于重建精确的光场观察角度。基于NeRF的神经渲染可以高质量地合成3D场景的视角，但当只有单个图像作为输入时，它很难达到满意的质量。我们的方法将这两种技术集成成一个高效、端到端训练可以的管道。我们在各种场景下训练了这种方法，可以高效地和高质量地重建连续的光场场景，无需任何测试时间优化。到我们所知，这是第一次将摄像头设计用于高效地获取3D信息和神经渲染相结合。

Weakly Supervised Anomaly Detection for Chest X-Ray Image

paper_url: http://arxiv.org/abs/2311.09642
repo_url: https://github.com/iamcuriosity/wscxr
paper_authors: Haoqi Ni, Ximiao Zhang, Min Xu, Ning Lang, Xiuzhuang Zhou
for: 本研究旨在提出一种基于weakly supervised learning的额外肺X光图像异常检测方法，以便在临床应用中更好地检测肺疾病。
methods: 本方法首先构建了normal和异常图像特征集合，然后通过异常特征挖掘来除去正常区域特征，以便全面利用疾病区域的珍贵特征。此外，本方法还使用了线性混合策略来增强异常检测器的训练。
results: experiments表明，本方法在两个肺X光图像 Dataset上显示了效果。

Abstract
Chest X-Ray (CXR) examination is a common method for assessing thoracic diseases in clinical applications. While recent advances in deep learning have enhanced the significance of visual analysis for CXR anomaly detection, current methods often miss key cues in anomaly images crucial for identifying disease regions, as they predominantly rely on unsupervised training with normal images. This letter focuses on a more practical setup in which few-shot anomaly images with only image-level labels are available during training. For this purpose, we propose WSCXR, a weakly supervised anomaly detection framework for CXR. WSCXR firstly constructs sets of normal and anomaly image features respectively. It then refines the anomaly image features by eliminating normal region features through anomaly feature mining, thus fully leveraging the scarce yet crucial features of diseased areas. Additionally, WSCXR employs a linear mixing strategy to augment the anomaly features, facilitating the training of anomaly detector with few-shot anomaly images. Experiments on two CXR datasets demonstrate the effectiveness of our approach.

摘要
骨肋X射影（CXR）检测是诊断 thoracic 疾病的常用方法。Recent advances in deep learning 使得视觉分析在 CXR 畸形检测中具有更大的重要性，但现有方法通常会遗漏疾病区域中的关键提示，因为它们主要依靠 normal 图像进行无监督训练。这封信件关注一种更实用的设置，在训练过程中仅有几张畸形图像和图像水平标签可用。为此，我们提出了 WSCXR，一种弱型监督畸形检测框架 для CXR。WSCXR 首先构建 normal 和畸形图像特征集，然后通过畸形特征挖掘，减少疾病区域特征，从而全面利用疾病区域中的珍贵特征。此外，WSCXR 采用了线性混合策略，以增强畸形特征的训练，使用几张畸形图像进行检测。实验表明，我们的方法有效地检测 CXR 畸形。

On the Quantification of Image Reconstruction Uncertainty without Training Data

paper_url: http://arxiv.org/abs/2311.09639
repo_url: None
paper_authors: Sirui Bi, Victor Fung, Jiaxin Zhang
for: This paper focuses on developing a deep variational framework for image reconstruction and uncertainty estimation in computational imaging.
methods: The proposed method leverages a deep generative model to learn an approximate posterior distribution for image reconstruction uncertainty, using a flow-based model and gradient boosting for robustness and expressiveness.
results: The method is validated on several benchmark tasks and two real-world applications, demonstrating reliable and high-quality image reconstruction with robust uncertainty estimation.

Abstract
Computational imaging plays a pivotal role in determining hidden information from sparse measurements. A robust inverse solver is crucial to fully characterize the uncertainty induced by these measurements, as it allows for the estimation of the complete posterior of unrecoverable targets. This, in turn, facilitates a probabilistic interpretation of observational data for decision-making. In this study, we propose a deep variational framework that leverages a deep generative model to learn an approximate posterior distribution to effectively quantify image reconstruction uncertainty without the need for training data. We parameterize the target posterior using a flow-based model and minimize their Kullback-Leibler (KL) divergence to achieve accurate uncertainty estimation. To bolster stability, we introduce a robust flow-based model with bi-directional regularization and enhance expressivity through gradient boosting. Additionally, we incorporate a space-filling design to achieve substantial variance reduction on both latent prior space and target posterior space. We validate our method on several benchmark tasks and two real-world applications, namely fastMRI and black hole image reconstruction. Our results indicate that our method provides reliable and high-quality image reconstruction with robust uncertainty estimation.

摘要
计算成像在捕捉隐藏信息方面发挥关键作用，从稀缺测量中推断出隐藏信息的不确定性需要一个坚固的 inverse solver。这样可以全面描述测量过程中induced的uncertainty，并且使得观察数据的概率解释变得可能，从而帮助做出决策。在这个研究中，我们提出了一个深度变量框架，该框架利用深度生成模型来学习一个近似 posterior distribution，以便有效地量ify image reconstruction uncertainty，无需训练数据。我们使用流基本模型来参数化目标 posterior，并通过最小化其Kullback-Leibler（KL）偏度来实现准确的 uncertainty estimation。为了增强稳定性，我们引入了bi-directional regularization和扩展表达能力通过梯度批处理。此外，我们采用了填充设计，以实现在latent prior空间和目标 posterior空间上的重要variance reduction。我们在多个benchmark任务和两个实际应用中，即fastMRI和黑洞图像重建中 validate our方法，结果表明我们的方法可以提供可靠和高质量的图像重建，同时也可以提供准确的 uncertainty estimation。

DECDM: Document Enhancement using Cycle-Consistent Diffusion Models

paper_url: http://arxiv.org/abs/2311.09625
repo_url: None
paper_authors: Jiaxin Zhang, Joy Rimchala, Lalla Mouatadid, Kamalika Das, Sricharan Kumar
for: 提高文档像素质量，以提高自动文档处理和文档智能。
methods: 基于Diffusion模型的终端到终端文档图像翻译方法，无需同时训练源和目标模型，可以应用到其他域对。
results: 与状态艺术方法相比，DECDM在多种 sintetic数据和benchmark datasets上表现出色，可以 Quantitatively和Qualitatively提高文档图像质量。

Abstract
The performance of optical character recognition (OCR) heavily relies on document image quality, which is crucial for automatic document processing and document intelligence. However, most existing document enhancement methods require supervised data pairs, which raises concerns about data separation and privacy protection, and makes it challenging to adapt these methods to new domain pairs. To address these issues, we propose DECDM, an end-to-end document-level image translation method inspired by recent advances in diffusion models. Our method overcomes the limitations of paired training by independently training the source (noisy input) and target (clean output) models, making it possible to apply domain-specific diffusion models to other pairs. DECDM trains on one dataset at a time, eliminating the need to scan both datasets concurrently, and effectively preserving data privacy from the source or target domain. We also introduce simple data augmentation strategies to improve character-glyph conservation during translation. We compare DECDM with state-of-the-art methods on multiple synthetic data and benchmark datasets, such as document denoising and {\color{black}shadow} removal, and demonstrate the superiority of performance quantitatively and qualitatively.

摘要
表现强大的光学字符识别（OCR）功能受到文档图像质量的限制，这对于自动文档处理和文档智能来说非常重要。然而，现有的文档增强方法通常需要监督数据对，这会导致数据分离和隐私保护的问题，使得这些方法难以适应新的域对。为解决这些问题，我们提出了DECDM，一种基于傅立叶分布模型的终端文档图像翻译方法。DECDM在无监督的情况下独立地训练源（噪音输入）和目标（清晰输出）模型，因此可以应用到其他对。DECDM在一个dataset上单独训练，不需要同时扫描两个dataset，从而有效地保护数据隐私。我们还介绍了一些简单的数据扩展策略，以保持字符形态的恒久性 durante la traducción。我们与状态之前的方法进行比较，并在多个合成数据和标准 benchmark datasets上进行评估，如文档噪音去除和阴影去除。我们的实验结果表明DECDM在量和质量上具有显著的优势。

Apoptosis classification using attention based spatio temporal graph convolution neural network

paper_url: http://arxiv.org/abs/2311.09623
repo_url: None
paper_authors: Akash Awasthi
for: 本研究旨在提出一种基于注意力图像抽象的图像缓冲扩充网络，用于精准地分类细胞死亡。
methods: 该方法使用注意力图像抽象网络，考虑多个目标细胞之间的交互关系，并在视频序列中模型每个时刻点的关系。
results: 该方法可以准确地分类细胞死亡，同时考虑空间和时间关系。

Abstract
Accurate classification of apoptosis plays an important role in cell biology research. There are many state-of-the-art approaches which use deep CNNs to perform the apoptosis classification but these approaches do not account for the cell interaction. Our paper proposes the Attention Graph spatio-temporal graph convolutional network to classify the cell death based on the target cells in the video. This method considers the interaction of multiple target cells at each time stamp. We model the whole video sequence as a set of graphs and classify the target cell in the video as dead or alive. Our method encounters both spatial and temporal relationships.

摘要
精准的细胞死亡分类在细胞生物研究中扮演着重要的角色。目前有许多先进的方法使用深度卷积神经网络进行细胞死亡分类，但这些方法不考虑细胞之间的互动。我们的论文提出了注意力图像空间时间卷积神经网络，用于基于目标细胞的视频中细胞死亡分类。这种方法考虑了每个时间戳的多个目标细胞之间的互动关系。我们将整个视频序列视为一系列图像，并将目标细胞在视频中分类为死亡或活着。我们的方法考虑了空间和时间关系。

Wildfire Smoke Detection with Cross Contrast Patch Embedding

paper_url: http://arxiv.org/abs/2311.10116
repo_url: None
paper_authors: Chong Wang, Cheng Xu, Adeel Akram, Zhilin Shan, Qixing Zhang
for: 本研究旨在提高Transformer基于深度网络的野火识别性能，特别是提高Transformer对烟雾特征的抽取能力。
methods: 本研究提出了 Cross Contrast Patch Embedding（CCPE）模块和Separable Negative Sampling Mechanism（SNSM），以提高网络对烟雾特征的抽取和识别性能。
results: 对RealFire Test dataset进行了广泛测试和评估，与基线检测模型相比，本研究的方法具有显著的性能提升。

Abstract
The Transformer-based deep networks have increasingly shown significant advantages over CNNs. Some existing work has applied it in the field of wildfire recognition or detection. However, we observed that the vanilla Transformer is not friendly for extracting smoke features. Because low-level information such as color, transparency and texture is very important for smoke recognition, and transformer pays more attention to the semantic relevance between middle- or high-level features, and is not sensitive to the subtle changes of low-level features along the space. To solve this problem, we propose the Cross Contrast Patch Embedding(CCPE) module based on the Swin Transformer, which uses the multi-scales spatial frequency contrast information in both vertical and horizontal directions to improve the discrimination of the network on the underlying details. The fuzzy boundary of smoke makes the positive and negative label assignment for instances in a dilemma, which is another challenge for wildfires detection. To solve this problem, a Separable Negative Sampling Mechanism(SNSM) is proposed. By using two different negative instance sampling strategies on positive images and negative images respectively, the problem of supervision signal confusion caused by label diversity in the process of network training is alleviated. This paper also releases the RealFire Test, the largest real wildfire test set so far, to evaluate the proposed method and promote future research. It contains 50,535 images from 3,649 video clips. The proposed method has been extensively tested and evaluated on RealFire Test dataset, and has a significant performance improvement compared with the baseline detection models.

摘要
《Transformer基于深度网络在野火识别方面的应用》 Introduction:现在的研究中，Transformer基于深度网络已经显示出了对于CNNs的明显优势。然而，我们发现了Transformer不适合提取烟雾特征。因为烟雾识别中低级信息如颜色、透明度和文本ure是非常重要的，而Transformer更关注中间或高级特征之间的 semantic relevance，并不敏感于空间方向中的细微变化。为解决这个问题，我们提出了基于Swin Transformer的 Cross Contrast Patch Embedding（CCPE）模块，利用多个比例的空间频率对比信息来提高网络对下面详细信息的推断。另外，野火检测中烟雾的模糊边界使得实例的正负标签分配困难，这也是一个挑战。为解决这个问题，我们提出了分解负采样机制（SNSM）。通过在正例图像和负例图像上采用不同的负采样策略，使得网络训练过程中的监督信号混乱问题得到了缓解。This paper also releases the RealFire Test, the largest real wildfire test set so far, to evaluate the proposed method and promote future research. It contains 50,535 images from 3,649 video clips. The proposed method has been extensively tested and evaluated on RealFire Test dataset, and has a significant performance improvement compared with the baseline detection models.

Multi-Task Learning Approach for Unified Biometric Estimation from Fetal Ultrasound Anomaly Scans

paper_url: http://arxiv.org/abs/2311.09607
repo_url: https://github.com/BioMedIA-MBZUAI/Multi-Task-Learning-Approach-for-Unified-Biometric-Estimation-from-Fetal-Ultrasound-Anomaly-Scans
paper_authors: Mohammad Areeb Qazi, Mohammed Talha Alam, Ibrahim Almakky, Werner Gerhard Diehl, Leanne Bricker, Mohammad Yaqub
for: The paper is written for estimating fetal biometry parameters from ultrasound images, which is crucial for evaluating fetal growth, monitoring health, and identifying potential complications.
methods: The paper proposes a multi-task learning approach that combines classification and segmentation to estimate fetal biometrics. The approach uses a U-Net architecture with an added classification head, and leverages a weighted joint classification and segmentation loss function to train the model.
results: The paper achieves a mean absolute error (MAE) of 1.08 mm on head circumference, 1.44 mm on abdomen circumference, and 1.10 mm on femur length with a classification accuracy of 99.91% on a dataset of fetal ultrasound images.Here’s the information in Simplified Chinese text:
for: 本研究是为了从ultrasound图像中计算胎儿生长指标，这是评估胎儿生长、监测健康和识别潜在问题的关键。
methods: 本研究提出了一种多任务学习方法，即将分类和分割结合在一起，以便从ultrasound图像中计算胎儿生长指标。该方法使用了U-Net架构，并添加了一个分类头，以便在训练过程中使用加权共同分类和分割损失函数。
results: 本研究实现了head圈 circumference的平均绝对误差（MAE）为1.08 mm， Abdomen circumference的MAE为1.44 mm， femur length的MAE为1.10 mm，并达到了99.91%的分类精度在一个 dataset of fetal ultrasound images 中。

Abstract
Precise estimation of fetal biometry parameters from ultrasound images is vital for evaluating fetal growth, monitoring health, and identifying potential complications reliably. However, the automated computerized segmentation of the fetal head, abdomen, and femur from ultrasound images, along with the subsequent measurement of fetal biometrics, remains challenging. In this work, we propose a multi-task learning approach to classify the region into head, abdomen and femur as well as estimate the associated parameters. We were able to achieve a mean absolute error (MAE) of 1.08 mm on head circumference, 1.44 mm on abdomen circumference and 1.10 mm on femur length with a classification accuracy of 99.91\% on a dataset of fetal Ultrasound images. To achieve this, we leverage a weighted joint classification and segmentation loss function to train a U-Net architecture with an added classification head. The code can be accessed through \href{https://github.com/BioMedIA-MBZUAI/Multi-Task-Learning-Approach-for-Unified-Biometric-Estimation-from-Fetal-Ultrasound-Anomaly-Scans.git}{\texttt{Github}

摘要
准确估算胎儿生长指标从ultrasound图像是诊断胎儿增长、监测健康和识别问题的关键。然而，通过计算器自动分割ultrasound图像中的胎儿头、腹部和股骨，以及其后的胎儿生长指标的测量，仍然是一项挑战。在这种工作中，我们提议一种多任务学习方法，通过分类区域为头、腹部和股骨，同时估算相关参数。我们在一个胎儿ultrasound图像集上实现了mean absolute error（MAE）为1.08毫米的头圈 circumference，1.44毫米的腹部 circumference和1.10毫米的股骨长度，同时实现了99.91%的分类精度。为达到这一点，我们利用了一种权重加权的联合分类和分割损失函数，用于训练一个U-Net架构，并添加了一个分类头。代码可以通过\href{https://github.com/BioMedIA-MBZUAI/Multi-Task-Learning-Approach-for-Unified-Biometric-Estimation-from-Fetal-Ultrasound-Anomaly-Scans.git}{\texttt{Github} [

Gradual Source Domain Expansion for Unsupervised Domain Adaptation

paper_url: http://arxiv.org/abs/2311.09599
repo_url: https://github.com/ThomasWestfechtel/GSDE
paper_authors: Thomas Westfechtel, Hao-Wei Yeh, Dexuan Zhang, Tatsuya Harada
for: overcome the need for a large labeled dataset in unsupervised domain adaptation
methods: gradual source domain expansion (GSDE) algorithm, training the UDA task several times from scratch with target data expansion
results: outperform state-of-the-art methods on three benchmarks (Office-31, OfficeHome, and DomainNet) and improve the accuracy of a variety of different state-of-the-art UDA approaches.Here’s the format you requested:
for: <what are the paper written for?>
methods: <what methods the paper use?>
results: <what results the paper get?>I hope that helps!

Abstract
Unsupervised domain adaptation (UDA) tries to overcome the need for a large labeled dataset by transferring knowledge from a source dataset, with lots of labeled data, to a target dataset, that has no labeled data. Since there are no labels in the target domain, early misalignment might propagate into the later stages and lead to an error build-up. In order to overcome this problem, we propose a gradual source domain expansion (GSDE) algorithm. GSDE trains the UDA task several times from scratch, each time reinitializing the network weights, but each time expands the source dataset with target data. In particular, the highest-scoring target data of the previous run are employed as pseudo-source samples with their respective pseudo-label. Using this strategy, the pseudo-source samples induce knowledge extracted from the previous run directly from the start of the new training. This helps align the two domains better, especially in the early training epochs. In this study, we first introduce a strong baseline network and apply our GSDE strategy to it. We conduct experiments and ablation studies on three benchmarks (Office-31, OfficeHome, and DomainNet) and outperform state-of-the-art methods. We further show that the proposed GSDE strategy can improve the accuracy of a variety of different state-of-the-art UDA approaches.

摘要
Unsupervised domain adaptation (UDA) 尝试使用源数据集中具有很多标签数据的知识来推导目标数据集，该数据集没有标签。然而，在目标领域中的早期不一致可能会导致错误堆积。为解决这个问题，我们提出了慢步源领域扩展（GSDE）算法。GSDE 在 UDA 任务上进行多次从零开始训练，每次重新初始化网络权重，但每次扩展源数据集以包括目标数据。具体来说，上一轮最高分的目标数据被用作 Pseudo-source 样本，与其它 Pseudo-source 样本一起，直接从上一轮训练中提取了知识。这种策略可以更好地对两个领域进行对应，尤其是在训练的早期。在这个研究中，我们首先提出了一个强大的基线网络，然后应用我们的 GSDE 策略来改进其性能。我们在 Office-31、OfficeHome 和 DomainNet 三个标准测试集上进行了实验和剖析研究，并超越了当前最佳方法。此外，我们还证明了我们的 GSDE 策略可以提高多种不同的状态流行 UDA 方法的准确率。

MARformer: An Efficient Metal Artifact Reduction Transformer for Dental CBCT Images

paper_url: http://arxiv.org/abs/2311.09590
repo_url: None
paper_authors: Yuxuan Shi, Jun Xu, Dinggang Shen
for: 针对受到金属artefacts干扰的 dental CBCT 图像，提出了一种高效的 Transformer 来实现金属artefacts 减少 (MAR)。
methods: 提出了一种基于 globally 相似结构的 Dimension-Reduced Self-Attention (DRSA) 模块，以及一种基于 Patch-wise Perceptive Feed Forward Network (P2FFN) 的 fine-grained восстановление模块。
results: 对 dental CBCT 图像进行了 Synthetic 和实际的 metal artefacts 测试，结果显示，我们的 MARformer 高效，并且超过了之前的 MAR 方法和两种 Restoration Transformers。

Abstract
Cone Beam Computed Tomography (CBCT) plays a key role in dental diagnosis and surgery. However, the metal teeth implants could bring annoying metal artifacts during the CBCT imaging process, interfering diagnosis and downstream processing such as tooth segmentation. In this paper, we develop an efficient Transformer to perform metal artifacts reduction (MAR) from dental CBCT images. The proposed MAR Transformer (MARformer) reduces computation complexity in the multihead self-attention by a new Dimension-Reduced Self-Attention (DRSA) module, based on that the CBCT images have globally similar structure. A Patch-wise Perceptive Feed Forward Network (P2FFN) is also proposed to perceive local image information for fine-grained restoration. Experimental results on CBCT images with synthetic and real-world metal artifacts show that our MARformer is efficient and outperforms previous MAR methods and two restoration Transformers.

摘要
cone beam computed tomography (CBCT) 在 dental 诊断和手术中发挥关键作用，但是 metal зуб钻Implant 可能会在 CBCT 图像处理过程中引入干扰性的 metal artifacts，影响诊断和下游处理，如 Tooth 分 segmentation。在这篇论文中，我们开发了一种高效的 transformer 来实现 dental CBCT 图像中的 metal artifacts 减少 (MAR)。我们提出的 MAR transformer （MARformer）通过一种新的 Dimension-Reduced Self-Attention（DRSA）模块，基于 CBCT 图像的全球相似结构，来降低计算复杂性。此外，我们还提出了一种 Patch-wise Perceptive Feed Forward Network（P2FFN）来捕捉本地图像信息，进行细化修复。实验结果表明，我们的 MARformer 高效，并比前期 MAR 方法和两种修复 transformer 高效。

3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation

paper_url: http://arxiv.org/abs/2311.09571
repo_url: None
paper_authors: Dale Decatur, Itai Lang, Kfir Aberman, Rana Hanocka
for: 本研究开发了一种自动填充 mesh 的本地 semantic 区域 texture 技术，通过文本描述进行操作。
methods: 本方法直接操作 mesh，生成可以与标准 графіcs 管线整合的 texture map。同时生成本地化MAP和 texture map，以实现它们之间的融合。使用多个阶层的填充模型来监督本地编辑技术，以提高细节和分辨率。
results: 本研究能够对不同类别的 shapes 进行本地 texture，并且可以控制 texture 的细节和全球理解。实验页面：https://threedle.github.io/3d-paintbrush

Abstract
In this work we develop 3D Paintbrush, a technique for automatically texturing local semantic regions on meshes via text descriptions. Our method is designed to operate directly on meshes, producing texture maps which seamlessly integrate into standard graphics pipelines. We opt to simultaneously produce a localization map (to specify the edit region) and a texture map which conforms to it. This synergistic approach improves the quality of both the localization and the stylization. To enhance the details and resolution of the textured area, we leverage multiple stages of a cascaded diffusion model to supervise our local editing technique with generative priors learned from images at different resolutions. Our technique, referred to as Cascaded Score Distillation (CSD), simultaneously distills scores at multiple resolutions in a cascaded fashion, enabling control over both the granularity and global understanding of the supervision. We demonstrate the effectiveness of 3D Paintbrush to locally texture a variety of shapes within different semantic regions. Project page: https://threedle.github.io/3d-paintbrush

摘要
在这个工作中，我们开发了3D涂刷技术，它可以通过文本描述自动地给mesh中的本地semantic区域添加文本ure。我们的方法直接操作mesh，生成的текстура映射可以衔接到标准图形管道中。我们同时生成了localization map（用于specify edit region）和它对应的текстура映射。这种相互作用使得本地化和 стилизация均得到了改善。为了提高文本区域的细节和分辨率，我们利用了多个阶段的叠加扩散模型来监督我们的本地编辑技术。我们称之为Cascaded Score Distillation（CSD），它同时在叠加的多个阶段中进行分辨率控制和全局理解的监督。我们示出了3D涂刷技术的效果，可以在不同的semantic region中地方Texture各种形状。项目页面：https://threedle.github.io/3d-paintbrush

paper_url: http://arxiv.org/abs/2311.09543
repo_url: None
paper_authors: Ming Chen, Yan Zhou, Weihua Jian, Pengfei Wan, Zhongyuan Wang
for: 高精度和时间一致的人体动作重建从视频中
methods: 使用全球变换器编码器获取时间感知的全局特征序列，并使用卷积GRU网络生成高分辨率的本地特征图，以及一个循环更新模块来优化估计的SMPL参数，以实现高精度和平滑的结果
results: 比前一代方法更高精度的结果，在3DPW、MPI-INF-3DHP和Human3.6M等知名 benchmark 上都达到了更高的性能

Abstract
Though significant progress in human pose and shape recovery from monocular RGB images has been made in recent years, obtaining 3D human motion with high accuracy and temporal consistency from videos remains challenging. Existing video-based methods tend to reconstruct human motion from global image features, which lack detailed representation capability and limit the reconstruction accuracy. In this paper, we propose a Temporal-Aware Refining Network (TAR), to synchronously explore temporal-aware global and local image features for accurate pose and shape recovery. First, a global transformer encoder is introduced to obtain temporal global features from static feature sequences. Second, a bidirectional ConvGRU network takes the sequence of high-resolution feature maps as input, and outputs temporal local feature maps that maintain high resolution and capture the local motion of the human body. Finally, a recurrent refinement module iteratively updates estimated SMPL parameters by leveraging both global and local temporal information to achieve accurate and smooth results. Extensive experiments demonstrate that our TAR obtains more accurate results than previous state-of-the-art methods on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.

摘要
尽管在最近几年内，从单色RGB图像中提取人体姿态和形状的进步很大，但从视频中获取高精度和时间一致的人体运动仍然是一个挑战。现有的视频基于方法通常是从全图像特征中提取人体运动，这些特征缺乏细节表示能力，导致重建精度有限。在这篇论文中，我们提议一种名为时间感知修复网络（TAR），以同步探索时间感知的全局和局部图像特征，以达到高精度和平滑的人体姿态和形状重建。首先，我们引入全球变换Encoder，从静止特征序列中提取时间全局特征。其次，我们使用双向ConvGRU网络，将高分辨率特征图组作为输入，并输出时间局部特征图，以保持高分辨率和捕捉人体动作的局部运动。最后，我们引入循环更新模块，通过全球和局部时间信息来更新估计的SMPL参数，以实现高精度和平滑的结果。我们对 популяр的benchmark进行了广泛的实验，结果表明，我们的TAR方法比之前的状态 искусственный智能方法更高精度。

FedFusion: Manifold Driven Federated Learning for Multi-satellite and Multi-modality Fusion

paper_url: http://arxiv.org/abs/2311.09540
repo_url: https://github.com/ldxdu/fedfusion
paper_authors: DaiXun Li, Weiying Xie, Yunsong Li, Leyuan Fang
for:多Modal remote sensing数据的合并是一项复杂的任务，因为它涉及到多种不同的感知特性和数据分布。methods:该文提出了一种基于拟合的多Modal数据合并框架，即FedFusion，它通过在每个客户端上随机选择地方数据来共同估计每个客户端的显著拟合结构，并将特征矩阵压缩到低维度空间中，作为后续分类器的输入。results:该文比较了现有方法和自己的方法在三个多Modal数据集上的性能，并达到了94.35%的平均分类精度，同时压缩通信成本四倍。此外，基于 Jetson TX2 工业模块的轨道边缘计算架构上进行了实际卫星图像的数字实验，显示FedFusion可以减少训练时间48.4分钟（15.18%），同时优化精度。

Abstract
Multi-satellite, multi-modality in-orbit fusion is a challenging task as it explores the fusion representation of complex high-dimensional data under limited computational resources. Deep neural networks can reveal the underlying distribution of multi-modal remote sensing data, but the in-orbit fusion of multimodal data is more difficult because of the limitations of different sensor imaging characteristics, especially when the multimodal data follows non-independent identically distribution (Non-IID) distributions. To address this problem while maintaining classification performance, this paper proposes a manifold-driven multi-modality fusion framework, FedFusion, which randomly samples local data on each client to jointly estimate the prominent manifold structure of shallow features of each client and explicitly compresses the feature matrices into a low-rank subspace through cascading and additive approaches, which is used as the feature input of the subsequent classifier. Considering the physical space limitations of the satellite constellation, we developed a multimodal federated learning module designed specifically for manifold data in a deep latent space. This module achieves iterative updating of the sub-network parameters of each client through global weighted averaging, constructing a framework that can represent compact representations of each client. The proposed framework surpasses existing methods in terms of performance on three multimodal datasets, achieving a classification average accuracy of 94.35$\%$ while compressing communication costs by a factor of 4. Furthermore, extensive numerical evaluations of real-world satellite images were conducted on the orbiting edge computing architecture based on Jetson TX2 industrial modules, which demonstrated that FedFusion significantly reduced training time by 48.4 minutes (15.18%) while optimizing accuracy.}

摘要
多卫星、多Modalities在遥感空间融合是一个复杂的任务，因为它探索了复杂高维数据的融合表示，在限制的计算资源下进行。深度神经网络可以揭示多Modalities遥感数据的下面分布，但是多Modalities遥感数据的融合更加困难，因为不同的感器成像特点存在限制，特别是当多Modalities数据遵循非独立同分布（Non-IID）分布时。为了解决这个问题而保持分类性能，这篇论文提出了一个概率驱动的多Modalities融合框架，即FedFusion，它在每个客户端上随机选择本地数据，并且同时将每个客户端的浅层特征矩阵压缩到低维度空间中，并通过权重平均来更新每个客户端的子网络参数。针对卫星团 constellation 的物理空间限制，我们开发了特有的多Modalities联合学习模块，用于深层空间中的权重学习。这个模块通过迭代更新每个客户端的子网络参数，构建了一个可以表示每个客户端的紧凑表示框架。提出的框架在三个多Modalities数据集上表现出色，实现了分类准确率94.35%，同时压缩通信成本4倍。此外，基于 Jetson TX2 工业模块的遥感边缘计算架构进行了实际数据测试，并证明了 FedFusion 可以减少训练时间48.4分钟（15.18%），同时优化准确率。

Pseudo-keypoints RKHS Learning for Self-supervised 6DoF Pose Estimation

paper_url: http://arxiv.org/abs/2311.09500
repo_url: None
paper_authors: Yangzheng Wu, Michael Greenspan
for: bridging the simulation-to-real domain gap in 6DoF PE
methods: 使用自然学习kernel在RKHS中，并提出了一种自我超vised keypoint radial voting-based 6DoF PE框架
results: 实现了state-of-the-art性能在三个常用的6DoF PE数据集上（LINEMOD (+4.2%), Occlusion LINEMOD (+2%), YCB-Video (+3%）），并与完全监督方法在所有六个BOP核心数据集上表现相当（ Within -10.8% to -0.3%）。

Abstract
This paper addresses the simulation-to-real domain gap in 6DoF PE, and proposes a novel self-supervised keypoint radial voting-based 6DoF PE framework, effectively narrowing this gap using a learnable kernel in RKHS. We formulate this domain gap as a distance in high-dimensional feature space, distinct from previous iterative matching methods. We propose an adapter network, which evolves the network parameters from the source domain, which has been massively trained on synthetic data with synthetic poses, to the target domain, which is trained on real data. Importantly, the real data training only uses pseudo-poses estimated by pseudo-keypoints, and thereby requires no real groundtruth data annotations. RKHSPose achieves state-of-the-art performance on three commonly used 6DoF PE datasets including LINEMOD (+4.2%), Occlusion LINEMOD (+2%), and YCB-Video (+3%). It also compares favorably to fully supervised methods on all six applicable BOP core datasets, achieving within -10.8% to -0.3% of the top fully supervised results.

摘要

Center Focusing Network for Real-Time LiDAR Panoptic Segmentation

paper_url: http://arxiv.org/abs/2311.09499
repo_url: https://github.com/gangzhang842/cfnet
paper_authors: Xiaoyan Li, Gang Zhang, Boyue Wang, Yongli Hu, Baocai Yin
for: 实时LiDAR精准分割，提高自动驾驶车辆对周围环境和对象的全面理解。
methods: 提出了一种新的中心吸引网络（CFNet），包括中心吸引特征编码（CFFE）和快速中心排除模块（CDM），以提高精准和实时LiDAR精准分割。
results: 在SemanticKITTI和nuScenes精准分割benchmark上，CFNet比所有其他方法表现出较大的优势，并与最高效的方法相比，运行速度提高1.6倍。

Abstract
LiDAR panoptic segmentation facilitates an autonomous vehicle to comprehensively understand the surrounding objects and scenes and is required to run in real time. The recent proposal-free methods accelerate the algorithm, but their effectiveness and efficiency are still limited owing to the difficulty of modeling non-existent instance centers and the costly center-based clustering modules. To achieve accurate and real-time LiDAR panoptic segmentation, a novel center focusing network (CFNet) is introduced. Specifically, the center focusing feature encoding (CFFE) is proposed to explicitly understand the relationships between the original LiDAR points and virtual instance centers by shifting the LiDAR points and filling in the center points. Moreover, to leverage the redundantly detected centers, a fast center deduplication module (CDM) is proposed to select only one center for each instance. Experiments on the SemanticKITTI and nuScenes panoptic segmentation benchmarks demonstrate that our CFNet outperforms all existing methods by a large margin and is 1.6 times faster than the most efficient method. The code is available at https://github.com/GangZhang842/CFNet.

摘要
利用LiDAR照片拼接的精炼分割可以帮助自动驾驶车辆全面理解周围环境和场景，并且需要在实时下运行。最近的提议方法可以加速算法，但其效果和效率仍然受到非存在实例中心的模型化和中心基于归一化模块的成本所限。为了实现准确和实时的LiDAR精炼分割，我们提出了一种新的中心集中网络（CFNet）。具体来说，我们提出了中心集中特征编码（CFFE），以明确原始LiDAR点和虚拟实例中心之间的关系，通过将LiDAR点Shift和填充中心点。此外，为了利用重复检测到的中心点，我们提出了快速中心筛选模块（CDM），以选择每个实例只有一个中心点。实验表明，我们的CFNet在SemanticKITTI和nuScenes精炼分割标准 benchmark上比所有其他方法差距较大，并且比最高效的方法快1.6倍。代码可以在https://github.com/GangZhang842/CFNet 中找到。

2023-11-16

CV-Attention UNet: Attention-based UNet for 3D Cerebrovascular Segmentation of Enhanced TOF-MRA Images

Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication

K-space Cold Diffusion: Learning to Reconstruct Accelerated MRI without Noise

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

Traffic Video Object Detection using Motion Prior

Adaptive Shells for Efficient Neural Radiance Field Rendering

Visual Environment Assessment for Safe Autonomous Quadrotor Landing

Analyzing Deviations of Dyadic Lines in Fast Hough Transform

Depth Insight – Contribution of Different Features to Indoor Single-image Depth Estimation

Match and Locate: low-frequency monocular odometry based on deep feature matching

On the Overconfidence Problem in Semantic 3D Mapping

SQLNet: Scale-Modulated Query and Localization Network for Few-Shot Class-Agnostic Counting

TransFusion – A Transparency-Based Diffusion Model for Anomaly Detection

DeepEMD: A Transformer-based Fast Estimation of the Earth Mover’s Distance

From Pretext to Purpose: Batch-Adaptive Self-Supervised Learning

SurgPLAN: Surgical Phase Localization Network for Phase Recognition

VertDetect: Fully End-to-End 3D Vertebral Instance Segmentation Model

Score-based generative models learn manifold-like structures with constrained mixing

Harnessing Transformers: A Leap Forward in Lung Cancer Image Detection

RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection

Selection of Distinct Morphologies to Divide & Conquer Gigapixel Pathology Images

I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

Rusty Detection Using Image Processing For Maintenance Of Stations

Overcoming Data Scarcity in Biomedical Imaging with a Foundational Multi-Task Model

GroupMixer: Patch-based Group Convolutional Neural Network for Breast Cancer Detection from Histopathological Images

MAM-E: Mammographic synthetic image generation with diffusion models

Neural-Logic Human-Object Interaction Detection

MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture

EvaSurf: Efficient View-Aware Implicit Textured Surface Reconstruction on Mobile Devices

Certified Control for Train Sign Classification

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Slide-SAM: Medical SAM Meets Sliding Window

Utilizing dataset affinity prediction in object detection to assess training data

Scene Text Image Super-resolution based on Text-conditional Diffusion Models

DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics

Gradient-Map-Guided Adaptive Domain Generalization for Cross Modality MRI Segmentation

MS-Former: Memory-Supported Transformer for Weakly Supervised Change Detection with Patch-Level Annotations

Now and Future of Artificial Intelligence-based Signet Ring Cell Diagnosis: A Survey

Robust Contrastive Learning With Theory Guarantee

Multi-View Spectrogram Transformer for Respiratory Sound Classification

Improved TokenPose with Sparsity

Event-based Motion-Robust Accurate Shape Estimation for Mixed Reflectance Scenes

Reconstructing Continuous Light Field From Single Coded Image

Weakly Supervised Anomaly Detection for Chest X-Ray Image

On the Quantification of Image Reconstruction Uncertainty without Training Data

DECDM: Document Enhancement using Cycle-Consistent Diffusion Models

Apoptosis classification using attention based spatio temporal graph convolution neural network

Wildfire Smoke Detection with Cross Contrast Patch Embedding

Multi-Task Learning Approach for Unified Biometric Estimation from Fetal Ultrasound Anomaly Scans

Gradual Source Domain Expansion for Unsupervised Domain Adaptation

MARformer: An Efficient Metal Artifact Reduction Transformer for Dental CBCT Images

3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation

Temporal-Aware Refinement for Video-based Human Pose and Shape Recovery

FedFusion: Manifold Driven Federated Learning for Multi-satellite and Multi-modality Fusion

Pseudo-keypoints RKHS Learning for Self-supervised 6DoF Pose Estimation

Center Focusing Network for Real-Time LiDAR Panoptic Segmentation