2023-11-20

cs.CV

cs.CV - 2023-11-20

HandSight: DeCAF & Improved Fisher Vectors to Classify Clothing Color and Texture with a Finger-Mounted Camera

paper_url: http://arxiv.org/abs/2311.12225
repo_url: None
paper_authors: Alexander J. Medeiros, Lee Stearns, Jon E. Froehlich
for: 解决盲人每天选衣问题，使用手指搭载摄像头和现代分类算法。
methods: 使用DeCAF和改进的抽象报告图像特征进行服装 текстуre 分类。
results: 获得 >95% 的准确率，并提供了一个大量的图像 dataset（HCTD）和现代分类算法的评估。

Abstract
We demonstrate the use of DeCAF and Improved Fisher Vector image features to classify clothing texture. The issue of choosing clothes is a problem for the blind every day. This work attempts to solve the issue with a finger-mounted camera and state-of-the-art classification algorithms. To evaluate our solution, we collected 520 close-up images across 29 pieces of clothing. We contribute (1) the HCTD, an image dataset taken with a NanEyeGS camera, a camera small enough to be mounted on the finger, and (2) evaluations of state-of-the-art recognition algorithms applied to our dataset - achieving an accuracy >95%. Throughout the paper, we will discuss previous work, evaluate the current work, and finally, suggest the project's future direction.

摘要
我们展示了使用DeCAF和改进的鱼雷vector图像特征来分类服装Texture。每天选择衣服是盲人的问题。这项工作尝试解决这个问题通过用户手指上的摄像头和当今最佳分类算法。为评估我们的解决方案，我们收集了520个 close-up图像，涵盖29件不同的服装。我们的贡献包括（1）HCTD dataset，使用 NanEyeGS摄像头拍摄的图像集，该摄像头可以被安装在手指上，以及（2）对我们的数据集进行当今最佳recognition算法的评估，实现了准确率大于95%。在这篇论文中，我们将讨论前一项工作、评估当前工作，并最后提出未来项目的方向。

DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation

paper_url: http://arxiv.org/abs/2311.12194
repo_url: None
paper_authors: Yifei Li, Hsiao-yu Chen, Egor Larionov, Nikolaos Sarafianos, Wojciech Matusik, Tuur Stuyck
for: 这个论文的目的是提高数字人体的现实感，以便实现虚拟存在和自定义。
methods: 这个论文使用了差分 simulation 技术来实现人体和服装的共优化。
results: 实验结果表明，这个方法可以生成现实的服装和人体形状，可以方便地应用于下游应用程序。Here’s the breakdown of each point in English:
for: The purpose of this paper is to improve the realism of digital avatars, enabling telepresence applications with self-expression and customization.
methods: The paper uses a novel approach called differentiable simulation to perform body and garment co-optimization.
results: The experimental results show that the proposed approach can generate realistic clothing and body shapes that can be easily used in downstream applications.

Abstract
The realism of digital avatars is crucial in enabling telepresence applications with self-expression and customization. A key aspect of this realism originates from the physical accuracy of both a true-to-life body shape and clothing. While physical simulations can produce high-quality, realistic motions for clothed humans, they require precise estimation of body shape and high-quality garment assets with associated physical parameters for cloth simulations. However, manually creating these assets and calibrating their parameters is labor-intensive and requires specialized expertise. To address this gap, we propose DiffAvatar, a novel approach that performs body and garment co-optimization using differentiable simulation. By integrating physical simulation into the optimization loop and accounting for the complex nonlinear behavior of cloth and its intricate interaction with the body, our framework recovers body and garment geometry and extracts important material parameters in a physically plausible way. Our experiments demonstrate that our approach generates realistic clothing and body shape that can be easily used in downstream applications.

摘要
现实化数字人物的重要性在推动虚拟存在应用中具有自我表达和定制功能。一个关键的这种现实性来自于真实的身体形状和服装的物理准确性。虽然物理模拟可以生成高质量、现实的人类运动，但它们需要精准地估计身体形状和高质量的服装资产，并且需要特殊的专业知识来调整参数。为解决这个差距，我们提出了DiffAvatar，一种新的方法，它通过拥有可微的模拟来实现身体和服装的共优化。我们的框架将物理模拟集成到优化循环中，考虑到织物的复杂非线性行为和身体之间的细腻交互，从而实现了身体和服装的准确物理性。我们的实验表明，我们的方法可以生成现实的衣服和身体形状，可以方便地在下游应用中使用。

Disentangling Structure and Appearance in ViT Feature Space

paper_url: http://arxiv.org/abs/2311.12193
repo_url: None
paper_authors: Narek Tumanyan, Omer Bar-Tal, Shir Amir, Shai Bagon, Tali Dekel
for: The paper is written for transferring the visual appearance of one natural image to another, specifically for generating an image where objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance image.
methods: The paper uses a pre-trained and fixed Vision Transformer (ViT) model to leverage semantic information and derive novel disentangled representations of structure and appearance. The objective function splices the desired structure and appearance representations together in the space of ViT features.
results: The paper demonstrates high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance, without requiring adversarial training or additional input information such as semantic segmentation or correspondences.Here’s the simplified Chinese text for the three key points:
for: 这篇论文是为了将一个自然图像的视觉特征转移到另一个图像上的。特别是，我们的目标是生成一个图像，其中源结构图像中的对象被”涂抹”上目标外观图像中的semantically相关的对象的视觉特征。
methods: 这篇论文使用一个预训练和固定的视觉转换器（ViT）模型，以利用 semantic信息并 deriv出新的分离的结构和外观表示。我们定义了一个目标函数，将愿望的结构和外观表示拼接在ViT特征空间中。
results: 这篇论文在一系列的宽频域自然图像对中 demonstarted高分辨率结果，面对对象的数量、 pose和外观变化，无需对抗恐或额外的输入信息，如semantic分割或对应关系。

Abstract
We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.

摘要
我们提出了一种方法，可以将一个自然图像的视觉特征semantic transfer到另一个图像中。特icularly，我们的目标是将源结构图像中的 объекts "涂抹" 上 Target appearance image中的semantic相关的物体的视觉特征。为了 integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.Here's the translation in Traditional Chinese:我们提出了一种方法，可以将一个自然图像的视觉特征semantic transfer到另一个图像中。特别是，我们的目标是将源结构图像中的 objects "涂抹" 上 Target appearance image中的semantic相关的物体的视觉特征。为了 integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.

LABELMAKER: Automatic Semantic Label Generation from RGB-D Trajectories

paper_url: http://arxiv.org/abs/2311.12174
repo_url: None
paper_authors: Silvan Weder, Hermann Blum, Francis Engelmann, Marc Pollefeys
for: 该论文主要用于提供一种自动生成2D/3D标签框架，以便训练或评估视觉模型。
methods: 该框架基于多种现状顶尖分割模型和神经网络抽象 render，可以自动生成高精度的2D/3D标签数据，不需要人工干预。
results: 对比手动标注的ScanNet数据集，该框架可以生成更高精度的标签数据，并自动标注了之前未标注的ARKitScenes数据集。

Abstract
Semantic annotations are indispensable to train or evaluate perception models, yet very costly to acquire. This work introduces a fully automated 2D/3D labeling framework that, without any human intervention, can generate labels for RGB-D scans at equal (or better) level of accuracy than comparable manually annotated datasets such as ScanNet. Our approach is based on an ensemble of state-of-the-art segmentation models and 3D lifting through neural rendering. We demonstrate the effectiveness of our LabelMaker pipeline by generating significantly better labels for the ScanNet datasets and automatically labelling the previously unlabeled ARKitScenes dataset. Code and models are available at https://labelmaker.org

摘要
<>对于训练或评估观察模型， semantic annotation 是不可或缺的，但是它们很昂贵。这项工作提出了一个完全自动化的 2D/3D 标注框架，可以在等同于或更高的准确率下生成 RGB-D 扫描数据的标注，不需要人工干预。我们的方法基于 Ensemble 的 state-of-the-art segmentation 模型和神经渲染的 3D 提升。我们在 LabelMaker 管道中证明了我们的方法的效果，通过生成 ScanNet 数据集中的更好的标注，并自动标注 ARKitScenes 数据集。可以在 https://labelmaker.org 获取代码和模型。Note: "Semantic annotation" in the text refers to the process of labeling objects or scenes in a video or image dataset with their semantic meaning (e.g., "chair", "table", "person", etc.).

ChemScraper: Graphics Extraction, Molecular Diagram Parsing, and Annotated Data Generation for PDF Images

paper_url: http://arxiv.org/abs/2311.12161
repo_url: https://gitlab.com/dprl/graphics-extraction
paper_authors: Ayush Kumar Shah, Bryan Manrique Amador, Abhisek Dey, Ming Creekmore, Blake Ocampo, Scott Denmark, Richard Zanibbi
for: 这篇论文是为了提出一种将生成的PDF图像翻译为化学结构表示（CDXML）的方法。
methods: 这种方法使用了 born-digital PDF图像中的Explicit locations和 shapes来提取符号，然后应用简单的图形变换来捕捉图像和化学结构的视觉和化学结构。
results: 作者的方法可以快速（PDF $\rightarrow$ 视觉图 $\rightarrow$ 化学图）转化 born-digital PDF图像，并且不需要GPU、Optical Character Recognition（OCR）或vectorization。在标准准确性测试中，作者的方法可以与SMILES字符串进行高度准确的比较。

Abstract
Existing visual parsers for molecule diagrams translate pixel-based raster images such as PNGs to chemical structure representations (e.g., SMILES). However, PDFs created by word processors including \LaTeX{} and Word provide explicit locations and shapes for characters, lines, and polygons. We %introduce a method to extract symbols from born-digital PDF molecule images and then apply simple graph transformations to capture both visual and chemical structure in editable ChemDraw files (CDXML). Our fast ( PDF $\rightarrow$ visual graph $\rightarrow$ chemical graph ) pipeline does not require GPUs, Optical Character Recognition (OCR) or vectorization. We evaluate on standard benchmarks using SMILES strings, along with a novel evaluation that provides graph-based metrics and error compilation using LgEval. The geometric information in born-digital PDFs produces a highly accurate parser, motivating generating training data for visual parsers that recognize from raster images, with extracted graphics, visual structure, and chemical structure as annotations. To do this we render SMILES strings in Indigo, parse molecule structure, and then validate recognized structure to select correct files.

摘要
现有的视觉解析器 для分子图表示图像（如PNG）可以将图像转换为化学结构表示（如SMILES）。然而，WORD处理器生成的PDF文档提供了显式的字体、线条和多边形的位置和形状信息。我们提出了一种方法，可以从生成的PDF分子图像中提取符号，然后应用简单的图形变换来捕捉图像中的视觉结构和化学结构，并将其保存为可编辑的ChemDraw文件（CDXML）。我们的快速（PDF $\rightarrow$ 视觉图 $\rightarrow$ 化学图）管道不需要GPU、Optical Character Recognition（OCR）或vectorization。我们在标准准确度测试上使用SMILES字符串进行评估，以及一种新的评估方法，使用LgEval进行图形基准测试和错误编译。生成的PDF中的几何信息使得我们的解析器具有非常高的准确率，这引起了生成用于视觉解析器的训练数据的需求，其中包括提取的图形、视觉结构和化学结构作为注释。为此，我们使用Indigo渲染SMILES字符串，解析分子结构，然后验证认出的结构是否正确，以选择正确的文件。

Model-aware 3D Eye Gaze from Weak and Few-shot Supervisions

paper_url: http://arxiv.org/abs/2311.12157
repo_url: https://github.com/dimitris-christodoulou57/model-aware_3d_eye_gaze
paper_authors: Nikola Popovic, Dimitrios Christodoulou, Danda Pani Paudel, Xi Wang, Luc Van Gool
for: 这个论文的目的是提出一种基于弱监睹的3D眼动识别方法，使用眼动 semantic segmentation图像来预测3D眼动。
methods: 该方法使用 transformer 网络架构，并将眼动 semantic segmentation图像和直接监睹的3D眼动vector composite together，以便适应3D眼动模型的预测。
results: 实验结果表明，该方法在多种场景下具有显著的优势，与基eline方法相比，angular gaze error下降约5度。此外，只使用0.05%的3D注解数据可以达到类似于基eline方法的性能。

Abstract
The task of predicting 3D eye gaze from eye images can be performed either by (a) end-to-end learning for image-to-gaze mapping or by (b) fitting a 3D eye model onto images. The former case requires 3D gaze labels, while the latter requires eye semantics or landmarks to facilitate the model fitting. Although obtaining eye semantics and landmarks is relatively easy, fitting an accurate 3D eye model on them remains to be very challenging due to its ill-posed nature in general. On the other hand, obtaining large-scale 3D gaze data is cumbersome due to the required hardware setups and computational demands. In this work, we propose to predict 3D eye gaze from weak supervision of eye semantic segmentation masks and direct supervision of a few 3D gaze vectors. The proposed method combines the best of both worlds by leveraging large amounts of weak annotations--which are easy to obtain, and only a few 3D gaze vectors--which alleviate the difficulty of fitting 3D eye models on the semantic segmentation of eye images. Thus, the eye gaze vectors, used in the model fitting, are directly supervised using the few-shot gaze labels. Additionally, we propose a transformer-based network architecture, that serves as a solid baseline for our improvements. Our experiments in diverse settings illustrate the significant benefits of the proposed method, achieving about 5 degrees lower angular gaze error over the baseline, when only 0.05% 3D annotations of the training images are used. The source code is available at https://github.com/dimitris-christodoulou57/Model-aware_3D_Eye_Gaze.

摘要
“ predicting 3D eye gaze from eye images 可以通过（a）全程学习对图像与视线映射的映射，或者（b）将3D眼模型适配到图像上。前者需要3D gaze标签，而后者需要眼动 semantics或特征点来促进模型适配。虽然获取眼动 semantics 和特征点 Comparatively easy, but fitting an accurate 3D eye model on them remains challenging due to its ill-posed nature in general。另一方面，获取大规模3D gaze数据 cumbersome due to the required hardware setups and computational demands。在这个工作中，我们提出了基于弱级别指导的3D eye gaze预测方法。该方法结合了大量弱级别指导（容易获取）和直接监督一些3D gaze вектор。因此，使用 Direct supervision of a few 3D gaze vectors alleviates the difficulty of fitting 3D eye models on the semantic segmentation of eye images。此外，我们还提出了一种基于 transformer 网络架构的方法，作为我们的提高基础。我们的实验在多种设定下证明了我们的方法具有显著的优势，在使用0.05% 3D注解的训练图像时，angular gaze error 降低约5度。源代码可以在中获取。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Uncertainty Estimation in Contrast-Enhanced MR Image Translation with Multi-Axis Fusion

paper_url: http://arxiv.org/abs/2311.12153
repo_url: None
paper_authors: Ivo M. Baltruschat, Parvaneh Janbakhshi, Melanie Dohmen, Matthias Lenga
for: 这个论文主要针对的是医学影像转换任务中的知识不确定性评估。
methods: 该论文提出了一种基于多视角图像数据的多轴融合（MAF）模型不确定性评估方法。
results: 对于Synthesizing contrast enhanced T1-weighted images based on native T1, T2和T2-FLAIR scans任务，该方法显示了良好的相对误差和不确定性评估结果（$\rho_{\text healthy} = 0.89$）。

Abstract
In recent years, deep learning has been applied to a wide range of medical imaging and image processing tasks. In this work, we focus on the estimation of epistemic uncertainty for 3D medical image-to-image translation. We propose a novel model uncertainty quantification method, Multi-Axis Fusion (MAF), which relies on the integration of complementary information derived from multiple views on volumetric image data. The proposed approach is applied to the task of synthesizing contrast enhanced T1-weighted images based on native T1, T2 and T2-FLAIR scans. The quantitative findings indicate a strong correlation ($\rho_{\text healthy} = 0.89$) between the mean absolute image synthetization error and the mean uncertainty score for our MAF method. Hence, we consider MAF as a promising approach to solve the highly relevant task of detecting synthetization failures at inference time.

摘要
Note:* "epistemic uncertainty" refers to the uncertainty in the knowledge of the model, i.e., the uncertainty in the output of the model due to the limitations of the model's understanding of the input data.* "image-to-image translation" refers to the task of translating an input image from one modality or domain to another, e.g., translating a native T1 scan to a contrast-enhanced T1-weighted image.* "synthesizing" refers to the process of generating a new image based on input data, e.g., synthesizing a contrast-enhanced T1-weighted image based on native T1, T2, and T2-FLAIR scans.* "uncertainty quantification" refers to the process of estimating the uncertainty of a model's output, e.g., estimating the uncertainty of the synthesized image.* "failure detection" refers to the process of detecting when the model's output is incorrect or unreliable, e.g., detecting when the synthesized image is of poor quality or does not accurately represent the input data.

Applications of Large Scale Foundation Models for Autonomous Driving

paper_url: http://arxiv.org/abs/2311.12144
repo_url: None
paper_authors: Yu Huang, Yue Chen, Zhu Li
for: 本研究旨在应用基础模型和大语言模型（LLM）于自动驾驶系统中，以解决现有AI长尾问题。
methods: 本研究使用了基础模型和LLM，包括模拟、世界模型、数据标注和观念规划等方法。
results: 本研究发现，通过结合基础模型和LLM，可以将人类知识、常识和推理应用到自动驾驶系统中，从而解决现有AI长尾问题。

Abstract
Since DARPA Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. Recently powered by large language models (LLMs), chat systems, such as chatGPT and PaLM, emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI) in natural language processing (NLP). There comes a natural thinking that we could employ these abilities to reformulate autonomous driving. By combining LLM with foundation models, it is possible to utilize the human knowledge, commonsense and reasoning to rebuild autonomous driving systems from the current long-tailed AI dilemma. In this paper, we investigate the techniques of foundation models and LLMs applied for autonomous driving, categorized as simulation, world model, data annotation and planning or E2E solutions etc.

摘要
Translation notes:* "rural" and "urban" challenges are translated as "DARPA Grand Challenges" in Simplified Chinese, as the term "rural" and "urban" are not commonly used in Chinese.* "long-tailed AI dilemma" is translated as "current long-tailed AI dilemma" in Simplified Chinese, as the phrase "long-tailed" is not commonly used in Chinese.* "foundation models" is translated as "基础模型" in Simplified Chinese, as the term "foundation" is not commonly used in Chinese.* "LLMs" is translated as "大语言模型" in Simplified Chinese, as the term "LLM" is not commonly used in Chinese.* "planning or E2E solutions" is translated as "规划或E2E解决方案" in Simplified Chinese, as the term "E2E" is not commonly used in Chinese.

Fingerspelling PoseNet: Enhancing Fingerspelling Translation with Pose-Based Transformer Models

paper_url: http://arxiv.org/abs/2311.12128
repo_url: None
paper_authors: Pooya Fayyazsanavi, Negar Nejatishahidin, Jana Kosecka
for: 本研究旨在提高美国手语指写翻译的精度，使用视频在野进行 fingerspelling 翻译。
methods: 该研究利用了更加准确的手姿估计技术，并提出了一种基于 transformer 编码器-解码器模型的新型建议，允许无缝上下文ual word 翻译。此外，研究还添加了一种新的损失函数，以准确预测手写字符串的长度，从而提高训练和推测的性能。
results: 通过了广泛的实验，研究人员表明了其提议的方法在 ChicagoFSWild 和 ChicagoFSWild+ 上的超过 10% 的相对改进。这些结果表明了该方法的有效性，并且可能推动手语翻译的进步。代码可以在 https://github.com/pooyafayyaz/Fingerspelling-PoseNet 上获取。

Abstract
We address the task of American Sign Language fingerspelling translation using videos in the wild. We exploit advances in more accurate hand pose estimation and propose a novel architecture that leverages the transformer based encoder-decoder model enabling seamless contextual word translation. The translation model is augmented by a novel loss term that accurately predicts the length of the finger-spelled word, benefiting both training and inference. We also propose a novel two-stage inference approach that re-ranks the hypotheses using the language model capabilities of the decoder. Through extensive experiments, we demonstrate that our proposed method outperforms the state-of-the-art models on ChicagoFSWild and ChicagoFSWild+ achieving more than 10% relative improvement in performance. Our findings highlight the effectiveness of our approach and its potential to advance fingerspelling recognition in sign language translation. Code is also available at https://github.com/pooyafayyaz/Fingerspelling-PoseNet.

摘要
我们 Addressing the task of American Sign Language fingerspelling translation using videos in the wild. We exploit advances in more accurate hand pose estimation and propose a novel architecture that leverages the transformer based encoder-decoder model enabling seamless contextual word translation. The translation model is augmented by a novel loss term that accurately predicts the length of the finger-spelled word, benefiting both training and inference. We also propose a novel two-stage inference approach that re-ranks the hypotheses using the language model capabilities of the decoder. Through extensive experiments, we demonstrate that our proposed method outperforms the state-of-the-art models on ChicagoFSWild and ChicagoFSWild+ achieving more than 10% relative improvement in performance. Our findings highlight the effectiveness of our approach and its potential to advance fingerspelling recognition in sign language translation. 码也可以在https://github.com/pooyafayyaz/Fingerspelling-PoseNet 查看。

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

paper_url: http://arxiv.org/abs/2311.12092
repo_url: https://github.com/rohitgandikota/sliders
paper_authors: Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, David Bau
for: 这个论文的目的是创建可解释的概念滑块，以便在扩散模型中控制图像生成中的属性。
methods: 这个方法利用了一个低纬度参数方向，以实现一个概念的精确控制，同时尽量减少其他属性的干扰。这个滑块可以通过一小组的提示或样本图像来创建，因此可以为文本或视觉概念创建滑块。
results: 对比之前的编辑技术，我们的滑块显示出更强的target编辑和更低的干扰。我们还展示了滑块的组合和连续调整，以及在StyleGAN中的intuitive编辑。此外，我们发现我们的方法可以帮助解决扩散过程中的一些常见问题，如物体扭曲和手部扭曲。我们的代码、数据和训练滑块可以在https://sliders.baulab.info/上获取。

Abstract
We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models. Our approach identifies a low-rank parameter direction corresponding to one concept while minimizing interference with other attributes. A slider is created using a small set of prompts or sample images; thus slider directions can be created for either textual or visual concepts. Concept Sliders are plug-and-play: they can be composed efficiently and continuously modulated, enabling precise control over image generation. In quantitative experiments comparing to previous editing techniques, our sliders exhibit stronger targeted edits with lower interference. We showcase sliders for weather, age, styles, and expressions, as well as slider compositions. We show how sliders can transfer latents from StyleGAN for intuitive editing of visual concepts for which textual description is difficult. We also find that our method can help address persistent quality issues in Stable Diffusion XL including repair of object deformations and fixing distorted hands. Our code, data, and trained sliders are available at https://sliders.baulab.info/

摘要
我们提出了一种可解释的概念滑块创建方法，用于在扩散模型中控制特征。我们的方法可以在一个概念方向下确定低维度参数方向，以避免其他特征干扰。我们使用一小集的提示或样本图像来创建滑块，因此滑块方向可以基于文本概念或视觉概念。我们称之为概念滑块，可以高效地组合和连续调整，以实现图像生成的精确控制。在对比前期技术的量化实验中，我们的滑块表现出更强的特定编辑效果，同时干扰下降。我们展示了不同气候、年龄、风格和表情等概念滑块，以及如何将滑块组合以实现更复杂的图像生成。此外，我们发现我们的方法可以帮助解决扩散xl中的持续质量问题，如修复物体扭形和扭formed手部。我们的代码、数据和训练滑块可以在https://sliders.baulab.info/中获取。

PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

paper_url: http://arxiv.org/abs/2311.12024
repo_url: None
paper_authors: Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, Kai Zhang
for: 这个论文是为了重构3D对象从几个无法定位的图像中，同时估计相机pose，并在1.3秒钟内完成这个任务。
methods: 这个方法使用了自注意机制来交换3D对象token和2D图像token之间的信息，预测每个视角的粗略点云，然后使用可导Perspective-n-Point（PnP）解决器来获取相机pose。
results: 当训练在大量多视图定位数据上（约1M个对象）时，PF-LRM表现出了强泛化能力，并在不同评估数据集上超过基eline方法的姿态预测精度和3D重建质量。此外，我们还证明了这个模型在文本/图像-to-3D任务中的应用性，通过快速前向推理。更多信息请访问我们的项目网站：https://totoro97.github.io/pf-lrm。

Abstract
We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1.3 seconds on a single A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange information between 3D object tokens and 2D image tokens; we predict a coarse point cloud for each view, and then use a differentiable Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge amount of multi-view posed data of ~1M objects, PF-LRM shows strong cross-dataset generalization ability, and outperforms baseline methods by a large margin in terms of pose prediction accuracy and 3D reconstruction quality on various unseen evaluation datasets. We also demonstrate our model's applicability in downstream text/image-to-3D task with fast feed-forward inference. Our project website is at: https://totoro97.github.io/pf-lrm .

摘要
我们提出了一种无 pose 大型重建模型（PF-LRM），可以从几张无 pose 图像中重建3D对象，甚至在视觉重叠少的情况下，在单个 A100 GPU 上运行时间约为1.3秒。PF-LRM 是一种高度可扩展的方法，通过自我注意块来交换3D对象标记和2D图像标记之间的信息；我们预测每个视图中的粗略点云，然后使用可导式 Perspective-n-Point（PnP）解决方案来获取相机位置。当在大量多视图posed数据上训练时，PF-LRM 显示出强大的跨数据集泛化能力，并在不同评估数据集上超越基线方法的姿态预测精度和3D重建质量。我们还证明了我们的模型在文本/图像到3D任务中的应用性，通过快速 feed-forward 推理来实现。我们的项目网站位于：https://totoro97.github.io/pf-lrm .

DAS: A Deformable Attention to Capture Salient Information in CNNs

paper_url: http://arxiv.org/abs/2311.12091
repo_url: None
paper_authors: Farzad Salajegheh, Nader Asadi, Soroush Saryazdi, Sudhir Mudur
for: 提高图像识别和物体检测的性能
methods: 使用弹性卷积和分解卷积实现快速和简单的全 convolutional 方法 DAS，以增强模型对重要信息的访问
results: 在添加 DAS 到流行的 CNN 上进行图像分类和物体检测时，得到了性能提高（比如，Stanford Dogs 上的提高率为4.47%，ImageNet 上的提高率为1.91%，COCO AP 上的提高率为3.3%），超过了其他 CNN 注意机制的性能，使用相同或更少的 FLOPs。

Abstract
Convolutional Neural Networks (CNNs) excel in local spatial pattern recognition. For many vision tasks, such as object recognition and segmentation, salient information is also present outside CNN's kernel boundaries. However, CNNs struggle in capturing such relevant information due to their confined receptive fields. Self-attention can improve a model's access to global information but increases computational overhead. We present a fast and simple fully convolutional method called DAS that helps focus attention on relevant information. It uses deformable convolutions for the location of pertinent image regions and separable convolutions for efficiency. DAS plugs into existing CNNs and propagates relevant information using a gating mechanism. Compared to the O(n^2) computational complexity of transformer-style attention, DAS is O(n). Our claim is that DAS's ability to pay increased attention to relevant features results in performance improvements when added to popular CNNs for Image Classification and Object Detection. For example, DAS yields an improvement on Stanford Dogs (4.47%), ImageNet (1.91%), and COCO AP (3.3%) with base ResNet50 backbone. This outperforms other CNN attention mechanisms while using similar or less FLOPs. Our code will be publicly available.

摘要
卷积神经网络（CNN）在本地空间模式识别方面表现出色。然而，对于许多视觉任务，如物体识别和分割，salient information 也存在外部 CNN 的核心 boundaries。然而，CNN 很难捕捉这些相关信息，因为它们的捕捉范围太窄。自我注意可以提高模型对全球信息的访问权，但是会增加计算开销。我们提出了一种快速、简单的全 convolutional 方法called DAS，它使用可变尺寸 convolution 来定位相关图像区域，并使用分解 convolution 来提高效率。DAS 可以与现有 CNN 集成，并通过阀门机制将相关信息传递给下一层。相比于 transformer 样式的注意力计算复杂度 O(n^2)，DAS 的计算复杂度为 O(n)。我们的主张是，DAS 能够增加对相关特征的注意力，会在添加到流行的 CNN 上进行图像分类和物体检测中提高性能。例如，DAS 在 Stanford Dogs 上获得了 4.47% 的改进，在 ImageNet 上获得了 1.91% 的改进，并在 COCO AP 上获得了 3.3% 的改进。这超过了其他 CNN 注意力机制，同时使用相同或更少的 FLOPs。我们将代码公开。

FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation

paper_url: http://arxiv.org/abs/2311.12090
repo_url: None
paper_authors: Chenliang Zhou, Fangcheng Zhong, Param Hanji, Zhilin Guo, Kyle Fogarty, Alejandro Sztrajman, Hongyun Gao, Cengiz Oztireli
for: 该论文旨在提出一种基于自适应变换的点云生成管线，以实现高质量、多样性和可控的点云数量生成任务。
methods: 该管线使用了一种新的频率纠正模块，通过圆形幂等方法来保持高频域内容，同时学习点云分布。此外，该管线还使用了一种梯度随机过程模型来学习拟合的离散分布。
results: 相比于现有的方法，该管线能够实现高质量、多样性和可控的点云数量生成，同时具有高效的计算性。我们的量化和 качеitative结果都证明了 FrePolad 的 estado-of-the-art 性。

Abstract
We propose FrePolad: frequency-rectified point latent diffusion, a point cloud generation pipeline integrating a variational autoencoder (VAE) with a denoising diffusion probabilistic model (DDPM) for the latent distribution. FrePolad simultaneously achieves high quality, diversity, and flexibility in point cloud cardinality for generation tasks while maintaining high computational efficiency. The improvement in generation quality and diversity is achieved through (1) a novel frequency rectification module via spherical harmonics designed to retain high-frequency content while learning the point cloud distribution; and (2) a latent DDPM to learn the regularized yet complex latent distribution. In addition, FrePolad supports variable point cloud cardinality by formulating the sampling of points as conditional distributions over a latent shape distribution. Finally, the low-dimensional latent space encoded by the VAE contributes to FrePolad's fast and scalable sampling. Our quantitative and qualitative results demonstrate the state-of-the-art performance of FrePolad in terms of quality, diversity, and computational efficiency.

摘要
我们提出了FrePolad：一种结合变量自动编码器（VAE）和杂化扩散概率模型（DDPM）的点云生成管道，用于点云生成任务中的高质量、多样性和可变性。FrePolad同时实现了高效率和低维度的点云生成。我们通过以下两个方法提高生成质量和多样性：1. 通过圆柱幂融合学习点云分布，保留高频内容并减少噪声，实现高质量点云生成。2. 使用杂化扩散模型学习受杂化的幂值分布，以获得规则化但复杂的latent分布。此外，FrePolad支持可变点云Cardinality，通过对latent shape分布进行采样来实现。最后，VAE嵌入的低维度latent空间使得FrePolad的采样速度快且可扩展。我们的量化和质量效果表明FrePolad在质量、多样性和计算效率三个方面具有前所未有的表现。

LiDAR-HMR: 3D Human Mesh Recovery from LiDAR

paper_url: http://arxiv.org/abs/2311.11971
repo_url: https://github.com/soullessrobot/lidar-hmr
paper_authors: Bohao Fan, Wenzhao Zheng, Jianjiang Feng, Jie Zhou
for: 这篇论文旨在计算 sparse LiDAR 点云中的3D人体干树结构。
methods: 该论文提出了一种有效的稀疏到粗糙重建方法，通过估计3D人体姿势和逐渐重建人体干树结构。为更好地利用点云的3D结构信息，该方法使用了一种垂直堆栈Transformer（graphormer）来引入点云特征。
results: 对于三个公共可用的数据库进行了实验，并达到了效果的结果。Here’s the translation in English:
for: This paper aims to estimate the 3D human body mesh from sparse LiDAR point clouds.
methods: The proposed method uses an effective sparse-to-dense reconstruction scheme to reconstruct the 3D human mesh, by estimating a sparse representation of the human (3D human pose) and gradually reconstructing the body mesh. To better leverage the 3D structural information of point clouds, the method employs a cascaded graph transformer (graphormer) to introduce point cloud features during sparse-to-dense reconstruction.
results: Experimental results on three publicly available databases demonstrate the effectiveness of the proposed approach.

Abstract
In recent years, point cloud perception tasks have been garnering increasing attention. This paper presents the first attempt to estimate 3D human body mesh from sparse LiDAR point clouds. We found that the major challenge in estimating human pose and mesh from point clouds lies in the sparsity, noise, and incompletion of LiDAR point clouds. Facing these challenges, we propose an effective sparse-to-dense reconstruction scheme to reconstruct 3D human mesh. This involves estimating a sparse representation of a human (3D human pose) and gradually reconstructing the body mesh. To better leverage the 3D structural information of point clouds, we employ a cascaded graph transformer (graphormer) to introduce point cloud features during sparse-to-dense reconstruction. Experimental results on three publicly available databases demonstrate the effectiveness of the proposed approach. Code: https://github.com/soullessrobot/LiDAR-HMR/

摘要
Recently, point cloud perception tasks have been gaining increasing attention. This paper presents the first attempt to estimate 3D human body mesh from sparse LiDAR point clouds. We found that the major challenge in estimating human pose and mesh from point clouds lies in the sparsity, noise, and incompleteness of LiDAR point clouds. To address these challenges, we propose an effective sparse-to-dense reconstruction scheme to reconstruct 3D human mesh. This involves estimating a sparse representation of a human (3D human pose) and gradually reconstructing the body mesh. To better leverage the 3D structural information of point clouds, we employ a cascaded graph transformer (graphormer) to introduce point cloud features during sparse-to-dense reconstruction. Experimental results on three publicly available databases demonstrate the effectiveness of the proposed approach. Code: https://github.com/soullessrobot/LiDAR-HMR/.Here's the translation in Traditional Chinese:近年来，点云识别任务 receiving increasing attention. 本篇文章提出了首次从稀叠 LiDAR 点云中 estimate 3D 人体网格. 我们发现，从点云中 estimate 人姿和网格的主要挑战在于稀叠、噪音和缺失 LiDAR 点云的问题. 面对这些挑战，我们提出了一个有效的稀叠到简的重建方案，从稀叠的人姿中逐步重建人体网格. 为了更好地利用点云的3D 结构资讯，我们使用了弹性的图形变换器 (graphormer) 引入点云特征 During sparse-to-dense reconstruction. 实验结果显示，提出的方法具有效iveness. Code: https://github.com/soullessrobot/LiDAR-HMR/.

SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks

paper_url: http://arxiv.org/abs/2311.11969
repo_url: https://github.com/OpenGVLab/SAM-Med2D
paper_authors: Jin Ye, Junlong Cheng, Jianpin Chen, Zhongying Deng, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Min Zhu, Shaoting Zhang, Junjun He, Yu Qiao
For: The paper is written for developing medical artificial intelligence for enhancing diagnosis, medical image analysis, knowledge sharing, and education.* Methods: The paper introduces SA-Med2D-20M, a large-scale segmentation dataset of 2D medical images built upon numerous public and private datasets, which consists of 4.6 million 2D medical images and 19.7 million corresponding masks covering almost the whole body and showing significant diversity.* Results: The paper presents comprehensive statistics of SA-Med2D-20M to facilitate the better use of the dataset, which can help researchers build medical vision foundation models or apply their models to downstream medical applications.

Abstract
Segment Anything Model (SAM) has achieved impressive results for natural image segmentation with input prompts such as points and bounding boxes. Its success largely owes to massive labeled training data. However, directly applying SAM to medical image segmentation cannot perform well because SAM lacks medical knowledge -- it does not use medical images for training. To incorporate medical knowledge into SAM, we introduce SA-Med2D-20M, a large-scale segmentation dataset of 2D medical images built upon numerous public and private datasets. It consists of 4.6 million 2D medical images and 19.7 million corresponding masks, covering almost the whole body and showing significant diversity. This paper describes all the datasets collected in SA-Med2D-20M and details how to process these datasets. Furthermore, comprehensive statistics of SA-Med2D-20M are presented to facilitate the better use of our dataset, which can help the researchers build medical vision foundation models or apply their models to downstream medical applications. We hope that the large scale and diversity of SA-Med2D-20M can be leveraged to develop medical artificial intelligence for enhancing diagnosis, medical image analysis, knowledge sharing, and education. The data with the redistribution license is publicly available at https://github.com/OpenGVLab/SAM-Med2D.

摘要
segments anything model (SAM) 已经取得了天然图像分割的出色结果，使用点和 bounding box 作为输入提示。其成功主要归功于大量标注训练数据。然而，直接将 SAM 应用于医疗图像分割是不可取的，因为 SAM 缺乏医疗知识 -- 它没有使用医疗图像进行训练。为了将医疗知识 integrate 到 SAM 中，我们介绍了 SA-Med2D-20M，一个基于多个公共和私人数据集的大规模分割dataset。它包括 4.6 万个 2D 医疗图像和 19.7 万个相应的mask，覆盖了大部分的身体和显示了显著的多样性。本文描述了 SA-Med2D-20M 中所收集的所有数据集，并详细介绍如何处理这些数据集。此外，我们还提供了 SA-Med2D-20M 的全面统计数据，以便更好地使用我们的数据集，帮助研究人员建立医疗视觉基础模型或将其模型应用到下游医疗应用。我们希望通过 SA-Med2D-20M 的大规模和多样性，为医疗人工智能的发展做出贡献，以便提高诊断、医疗图像分析、知识共享和教育。数据集的红色分布授权是公共可用的，可以在上下载。

What Can AutoML Do For Continual Learning?

paper_url: http://arxiv.org/abs/2311.11963
repo_url: None
paper_authors: Mert Kilickaya, Joaquin Vanschoren
for: 该论文探讨了AutoML在逐步学习中的潜在应用，具体来说是如何使用AutoML方法来促进逐步学习的更多研究。
methods: 该论文不直接提出新方法，而是通过提出“什么样的AutoML方法可以用于逐步学习”这个问题，探讨了三个关键的研究方向，即使用AutoML方法来实现更动态的逐步学习，挑战了AutoML研究领域的新问题。
results: 该论文未直接提出新方法，但是通过探讨AutoML在逐步学习中的应用，提出了三个关键的研究方向，这些研究方向可能会带来更多的研究和应用。

Abstract
This position paper outlines the potential of AutoML for incremental (continual) learning to encourage more research in this direction. Incremental learning involves incorporating new data from a stream of tasks and distributions to learn enhanced deep representations and adapt better to new tasks. However, a significant limitation of incremental learners is that most current techniques freeze the backbone architecture, hyperparameters, and the order & structure of the learning tasks throughout the learning and adaptation process. We strongly believe that AutoML offers promising solutions to address these limitations, enabling incremental learning to adapt to more diverse real-world tasks. Therefore, instead of directly proposing a new method, this paper takes a step back by posing the question: "What can AutoML do for incremental learning?" We outline three key areas of research that can contribute to making incremental learners more dynamic, highlighting concrete opportunities to apply AutoML methods in novel ways as well as entirely new challenges for AutoML research.

摘要
这份位点论文描述了AutoML在逐渐学习方面的潜在潜力，以促进更多关于这个方向的研究。逐渐学习是指在流动任务和分布中逐渐 incorporating new data，以学习提高深度表示和适应新任务。然而，现有的逐渐学习技术的一个重要限制是，它们在学习和适应过程中冻结了背bone架构、超参数和学习任务的顺序和结构。我们强烈认为，AutoML可以为逐渐学习提供了可能的解决方案，使其能够更好地适应多样化的实际任务。因此，而不是直接提出一种新方法，这篇论文做出了一个问题："AutoML可以做什么为逐渐学习？"我们详细描述了三个关键的研究领域，可以使逐渐学习更加动态，并高亮了可以应用AutoML方法的具体机会以及 entirely new challenges for AutoML research。

An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis

paper_url: http://arxiv.org/abs/2311.11919
repo_url: None
paper_authors: Aishwarya Agarwal, Srikrishna Karanam, Tripti Shukla, Balaji Vasan Srinivasan
for: 本文的主要目标是使用单个参考图像提取多个特征（如颜色、物体、布局、风格），并生成新样本以这些特征为条件。
methods: 本文提出了一种新的多特征倒拍算法（MATTE），该算法在DDPM模型层级和时间步级上同时学习多个嵌入，以提高特征分离。
results: 本文通过广泛的分析和实验表明，MATTE算法可以更好地分离多个特征，并且可以生成更高质量的样本。

Abstract
We consider the problem of constraining diffusion model outputs with a user-supplied reference image. Our key objective is to extract multiple attributes (e.g., color, object, layout, style) from this single reference image, and then generate new samples with them. One line of existing work proposes to invert the reference images into a single textual conditioning vector, enabling generation of new samples with this learned token. These methods, however, do not learn multiple tokens that are necessary to condition model outputs on the multiple attributes noted above. Another line of techniques expand the inversion space to learn multiple embeddings but they do this only along the layer dimension (e.g., one per layer of the DDPM model) or the timestep dimension (one for a set of timesteps in the denoising process), leading to suboptimal attribute disentanglement. To address the aforementioned gaps, the first contribution of this paper is an extensive analysis to determine which attributes are captured in which dimension of the denoising process. As noted above, we consider both the time-step dimension (in reverse denoising) as well as the DDPM model layer dimension. We observe that often a subset of these attributes are captured in the same set of model layers and/or across same denoising timesteps. For instance, color and style are captured across same U-Net layers, whereas layout and color are captured across same timestep stages. Consequently, an inversion process that is designed only for the time-step dimension or the layer dimension is insufficient to disentangle all attributes. This leads to our second contribution where we design a new multi-attribute inversion algorithm, MATTE, with associated disentanglement-enhancing regularization losses, that operates across both dimensions and explicitly leads to four disentangled tokens (color, style, layout, and object).

摘要
我们考虑到将散布模型的输出受限于使用者提供的参考图像。我们的主要目标是从这个单一的参考图像中提取多个特征（例如颜色、物件、布局、Style），然后产生新的样本。现有的方法包括将参考图像反射为单一的文本条件 vector，以便产生新的样本。但这些方法不会学习多个条件，以调控模型的输出。另一些方法则是扩展反射空间，以学习多个嵌入，但是这些方法只是在层级（例如 DDPM 模型的层）或时间步（一组时间步骤）上进行对应，这会导致不够好的特征分离。为了解决这些问题，我们的第一个贡献是对于哪些特征是在哪个维度中捕捉的，我们考虑了时间步骤维度（在逆推实验中）和 DDPM 模型层级。我们发现，一些特征通常是在同一些层级和/或同一些时间步骤中捕捉的。例如，颜色和风格是在同一些 U-Net 层中捕捉的，而布局和颜色则是在同一些时间步骤中捕捉的。因此，仅仅对于时间步骤维度或层级进行对应是不足以将所有特征分离。这导致我们的第二个贡献，即设计了一个新的多特征反射算法（MATTE），以及相应的分离提升训练损失，以确保四个分离的条件（颜色、风格、布局、物件）。

Identifying the Defective: Detecting Damaged Grains for Cereal Appearance Inspection

paper_url: http://arxiv.org/abs/2311.11901
repo_url: https://github.com/hellodfan/ai4graininsp
paper_authors: Lei Fan, Yiwen Ding, Dongdong Fan, Yong Wu, Maurice Pagnucco, Yang Song
for: This paper aims to develop an automated Grain Appearance Inspection (GAI) system to improve the efficiency and accuracy of grain quality evaluation.
methods: The proposed system uses anomaly detection (AD) to identify damaged grains or unknown objects in grain kernels. The AD model, called AD-GAI, is trained using only normal samples and shows high performance in comparison with advanced AD methods.
results: The proposed system achieves a speedup of over 20x compared to human experts and shows highly consistent performance with human evaluation. A large-scale dataset of 220K high-quality images of wheat and maize kernels is created and released for future research.

Abstract
Cereal grain plays a crucial role in the human diet as a major source of essential nutrients. Grain Appearance Inspection (GAI) serves as an essential process to determine grain quality and facilitate grain circulation and processing. However, GAI is routinely performed manually by inspectors with cumbersome procedures, which poses a significant bottleneck in smart agriculture. In this paper, we endeavor to develop an automated GAI system:AI4GrainInsp. By analyzing the distinctive characteristics of grain kernels, we formulate GAI as a ubiquitous problem: Anomaly Detection (AD), in which healthy and edible kernels are considered normal samples while damaged grains or unknown objects are regarded as anomalies. We further propose an AD model, called AD-GAI, which is trained using only normal samples yet can identify anomalies during inference. Moreover, we customize a prototype device for data acquisition and create a large-scale dataset including 220K high-quality images of wheat and maize kernels. Through extensive experiments, AD-GAI achieves considerable performance in comparison with advanced AD methods, and AI4GrainInsp has highly consistent performance compared to human experts and excels at inspection efficiency over 20x speedup. The dataset, code and models will be released at https://github.com/hellodfan/AI4GrainInsp.

摘要
粮食作物在人类饮食中扮演着重要角色，为提供重要营养素的主要来源。然而，粮食质量检测（GAI）通常是由人工检查员 manually进行，这会带来智能农业中的一定瓶颈。在这篇论文中，我们努力开发一个自动化的 GAI 系统：AI4GrainInsp。我们通过分析谷物块的特征，将 GAI 转化为一个普遍问题：异常检测（AD），健康和食用谷物被视为正常样本，而受损谷物或未知物体则被视为异常。我们还提出了一种 AD 模型，called AD-GAI，它通过训练只使用正常样本而能够在推理中检测异常。此外，我们还自定义了一种原型设备 для数据收集，并创建了包括 220 万个高质量谷物块的大规模数据集。经过广泛的实验，AD-GAI 在与先进的 AD 方法进行比较中表现出了显著的性能优势，AI4GrainInsp 的检测效果与人工专家相当，并且在速度方面高于人工检测的 20 倍。数据集、代码和模型将在 GitHub 上发布。

SniffyArt: The Dataset of Smelling Persons

paper_url: http://arxiv.org/abs/2311.11888
repo_url: None
paper_authors: Mathias Zinnen, Azhar Hussian, Hang Tran, Prathmesh Madhu, Andreas Maier, Vincent Christlein
for: 这篇论文的目的是为了开发一个具有人体姿势和肢体关键点的历史艺术作品中的臭味姿势识别方法。
methods: 这篇论文使用了一个名为SniffyArt的数据集，该数据集包含1941名人物在441幅历史艺术作品中的捕捉。每个人物都有一个紧靠的盒子 bounding box，17个姿势关键点和一个姿势标签。通过这些注释，数据集允许开发混合类型的识别方法。
results: 论文还提供了一个基线分析，评估了代表性的检测、关键点估计和分类任务的性能，展示了结合关键点估计和臭味姿势分类的潜在可能性。SniffyArt数据集为未来研究提供了一个坚实的基础，可以推动人姿和臭味维度分析在历史艺术作品中的进一步发展。

Abstract
Smell gestures play a crucial role in the investigation of past smells in the visual arts yet their automated recognition poses significant challenges. This paper introduces the SniffyArt dataset, consisting of 1941 individuals represented in 441 historical artworks. Each person is annotated with a tightly fitting bounding box, 17 pose keypoints, and a gesture label. By integrating these annotations, the dataset enables the development of hybrid classification approaches for smell gesture recognition. The datasets high-quality human pose estimation keypoints are achieved through the merging of five separate sets of keypoint annotations per person. The paper also presents a baseline analysis, evaluating the performance of representative algorithms for detection, keypoint estimation, and classification tasks, showcasing the potential of combining keypoint estimation with smell gesture classification. The SniffyArt dataset lays a solid foundation for future research and the exploration of multi-task approaches leveraging pose keypoints and person boxes to advance human gesture and olfactory dimension analysis in historical artworks.

摘要
< span> Investigation of past smells in the visual arts relies heavily on smell gestures, yet automated recognition poses significant challenges. This paper introduces the SniffyArt dataset, consisting of 1941 individuals represented in 441 historical artworks. Each person is annotated with a tightly fitting bounding box, 17 pose keypoints, and a gesture label. By integrating these annotations, the dataset enables the development of hybrid classification approaches for smell gesture recognition. The dataset's high-quality human pose estimation keypoints are achieved through the merging of five separate sets of keypoint annotations per person. The paper also presents a baseline analysis, evaluating the performance of representative algorithms for detection, keypoint estimation, and classification tasks, showcasing the potential of combining keypoint estimation with smell gesture classification. The SniffyArt dataset lays a solid foundation for future research and the exploration of multi-task approaches leveraging pose keypoints and person boxes to advance human gesture and olfactory dimension analysis in historical artworks.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Multi-Task Faces (MTF) Data Set: A Legally and Ethically Compliant Collection of Face Images for Various Classification Tasks

paper_url: http://arxiv.org/abs/2311.11882
repo_url: https://github.com/ramihaf/mtf_data_set
paper_authors: Rami Haffar, David Sánchez, Josep Domingo-Ferrer
for: 这个论文是为了提供一个多任务面部数据集（MTF），用于多种分类任务，包括识别、年龄、性别和种族等。
methods: 这个论文使用了公共可用的明星图片来收集数据，并且严格遵循了版权法规。数据集被精心挑选和处理，以满足不同的分类任务。
results: 论文中提供了五种深度学习模型在MTF数据集上的性能分析，包括识别、年龄、性别和种族等多个任务。同时，论文还对raw网络上的数据进行了处理和分析，以评估模型在不同的数据上的性能。

Abstract
Human facial data hold tremendous potential to address a variety of classification problems, including face recognition, age estimation, gender identification, emotion analysis, and race classification. However, recent privacy regulations, such as the EU General Data Protection Regulation and others, have restricted the ways in which human images may be collected and used for research. As a result, several previously published data sets containing human faces have been removed from the internet due to inadequate data collection methods that failed to meet privacy regulations. Data sets consisting of synthetic data have been proposed as an alternative, but they fall short of accurately representing the real data distribution. On the other hand, most available data sets are labeled for just a single task, which limits their applicability. To address these issues, we present the Multi-Task Faces (MTF) image data set, a meticulously curated collection of face images designed for various classification tasks, including face recognition, as well as race, gender, and age classification. The MTF data set has been ethically gathered by leveraging publicly available images of celebrities and strictly adhering to copyright regulations. In this paper, we present this data set and provide detailed descriptions of the followed data collection and processing procedures. Furthermore, we evaluate the performance of five deep learning (DL) models on the MTF data set across the aforementioned classification tasks. Additionally, we compare the performance of DL models over the processed MTF data and over raw data crawled from the internet. The reported results constitute a baseline for further research employing these data. The MTF data set can be accessed through the following link (please cite the present paper if you use the data set): https://github.com/RamiHaf/MTF_data_set

摘要
人类脸部数据具有巨大的潜力，可以解决多种分类问题，包括人脸识别、年龄估计、性别确定、情感分析和种族分类。然而，最近的隐私法规，如欧盟通用数据保护条例等，限制了人类图像的收集和使用方式。这导致了一些以前发布在互联网上的人脸数据集被下载，因为这些数据集的收集方法不符合隐私法规。同时，大多数可用的数据集仅适用于单个任务，这限制了它们的可 reuse。为解决这些问题，我们提出了多任务脸(MTF)图像数据集，这是一个仔细精心收集的人脸图像集，适用于多种分类任务，包括人脸识别、种族、性别和年龄分类。MTF数据集通过公开的 célèbres的图像来收集，严格遵循版权法规。在这篇文章中，我们介绍了这个数据集，并提供了数据收集和处理过程的详细描述。此外，我们还评估了五种深度学习（DL）模型在MTF数据集上的性能，以及模型在处理后的MTF数据集和互联网上爬取的原始数据集上的性能。报告的结果可作为后续研究使用这些数据集的基eline。MTF数据集可以通过以下链接访问：https://github.com/RamiHaf/MTF_data_set。（请参考本文中的报告来使用这些数据集。）

VLM-Eval: A General Evaluation on Video Large Language Models

paper_url: http://arxiv.org/abs/2311.11865
repo_url: None
paper_authors: Shuailin Li, Yuang Zhang, Yucheng Zhao, Qiuyue Wang, Fan Jia, Yingfei Liu, Tiancai Wang
for: 这个论文的目的是为视频大语言模型（LLM）进行全面评估。
methods: 这篇论文使用了多种视频任务，包括标题写作、问答、检索和动作识别等，并使用了传统度量标准和GPT基于的评估方法来评估响应质量。它还提出了一个简单的基线：Video-LLaVA，该基线使用了单一的直线投影，并超越了现有的视频 LLM。
results: 这篇论文发现，使用只需要几百个视频教程对象进行微调的方法可以在驾驶场景中获得激发人理解和计算能力。此外，它还证明了视频 LLM 的评估可以在实际场景中扩展。

Abstract
Despite the rapid development of video Large Language Models (LLMs), a comprehensive evaluation is still absent. In this paper, we introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition. In addition to conventional metrics, we showcase how GPT-based evaluation can match human-like performance in assessing response quality across multiple aspects. We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs. Finally, we evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning. We hope our work can serve as a unified evaluation for video LLMs, and help expand more practical scenarios. The evaluation code will be available soon.

摘要
尽管视频大语言模型（LLM）的快速发展，但全面评估仍然缺失。在这篇论文中，我们提出了一种涵盖多个视频任务的综合评估，包括标注、问答、检索和动作识别。除了传统的度量器，我们显示了如何使用GPT基于的评估匹配人类水平评估Response质量在多个方面。我们提出了一个简单的基线：视频-LLaVA，它使用单个直线投影，并超过现有的视频LLM。最后，我们评估视频LLM在外部数据集上，显示了鼓励人的识别和推理能力，只需要undreds of video-instruction pairs进行微调。我们希望我们的工作能为视频LLM提供一个统一的评估，并帮助扩展更多实际场景。评估代码将很快 disponible。

GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding

paper_url: http://arxiv.org/abs/2311.11863
repo_url: None
paper_authors: Hao Li, Dingwen Zhang, Yalun Dai, Nian Liu, Lechao Cheng, Jingfeng Li, Jingdong Wang, Junwei Han
for: 本研究旨在提高3D场景理解和表示 tasks中的下游任务，即semantic prediction和instance segmentation。
methods: 我们提出了一种新的渠道Generalized Perception NeRF（GP-NeRF），它将通用 segmentation 模型和NeRF合并在一个框架下，以便在3D场景中提高context-aware的理解。我们还引入了 transformers 来聚合辐射和 semantic embedding 领域的信息，以便在新的视图下渲染这两个领域。此外，我们还提出了两种自适应机制，即Semantic Distill Loss和Depth-Guided Semantic Distill Loss，以提高semantic field的精度和 geometric consistency。
results: 我们在两个任务上进行了实验比较（semantic segmentation和instance segmentation），使用了both synthetic和real-world datasets。结果显示，我们的方法在Generalized semantic segmentation、fine-tuning semantic segmentation和instance segmentation任务中的性能比SOTA方法高出6.94%、11.76%和8.47%。

Abstract
Applying NeRF to downstream perception tasks for scene understanding and representation is becoming increasingly popular. Most existing methods treat semantic prediction as an additional rendering task, \textit{i.e.}, the "label rendering" task, to build semantic NeRFs. However, by rendering semantic/instance labels per pixel without considering the contextual information of the rendered image, these methods usually suffer from unclear boundary segmentation and abnormal segmentation of pixels within an object. To solve this problem, we propose Generalized Perception NeRF (GP-NeRF), a novel pipeline that makes the widely used segmentation model and NeRF work compatibly under a unified framework, for facilitating context-aware 3D scene perception. To accomplish this goal, we introduce transformers to aggregate radiance as well as semantic embedding fields jointly for novel views and facilitate the joint volumetric rendering of both fields. In addition, we propose two self-distillation mechanisms, i.e., the Semantic Distill Loss and the Depth-Guided Semantic Distill Loss, to enhance the discrimination and quality of the semantic field and the maintenance of geometric consistency. In evaluation, we conduct experimental comparisons under two perception tasks (\textit{i.e.} semantic and instance segmentation) using both synthetic and real-world datasets. Notably, our method outperforms SOTA approaches by 6.94\%, 11.76\%, and 8.47\% on generalized semantic segmentation, finetuning semantic segmentation, and instance segmentation, respectively.

摘要
通过应用NeRF到下游识别任务中，以提高场景理解和表示的能力，现在越来越普遍。大多数现有方法都会对 semantic prediction 视为额外的渲染任务，即“标签渲染”任务，以建立semantic NeRFs。然而，由于不考虑渲染图像中的上下文信息，这些方法通常会出现不清晰的边界分割和对象内部像素的异常分割问题。为解决这个问题，我们提出了通用识别NeRF（GP-NeRF），一种新的管道，使得通用的分割模型和NeRF在一个统一框架下工作，以便实现上下文意识的3D场景识别。为达到这个目标，我们引入了 transformers，以便在新视图下对光谱和semantic embedding场景进行并行汇聚。此外，我们还提出了两种自适应机制，即Semantic Distill Loss和Depth-Guided Semantic Distill Loss，以提高semantic场景的精度和质量。在评估中，我们通过对semantic和实例分割任务进行实验比较，使用了 sintetic和实际数据集。可见，我们的方法在Generalized semantic segmentation、fine-tuning semantic segmentation和实例分割任务中，相比SOTA方法，提高了6.94%、11.76%和8.47%。

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

paper_url: http://arxiv.org/abs/2311.11860
repo_url: https://github.com/rshaojimmy/jiutian
paper_authors: Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie
for: 这个论文主要是为了提高多模态大语言模型（MLLM）的性能，使其能够更好地理解和使用多种多样的信息。
methods: 这篇论文提出了一种两级视觉知识增强策略，包括进程式地 integrate 细化的空间意识视觉知识，以及软提示高级别 semantic 视觉证据。
results: 对多个多模态benchmark进行了广泛的实验，并表明该模型在VSRCIDEr等任务上具有显著的提高（比如VSRCIDEr上提高5%，TextCaps上提高3%，RefCOCOg上提高5%）。

Abstract
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

摘要
多Modal大型自然语言模型（MLLMs）已经赋予大型自然语言模型（LLMs）可以感知和理解多Modal信号。然而，大多数现有的 MLLMs 主要采用先天Alignment的视觉编码器，导致视觉知识的抽取和理解不充分。为解决这个问题，我们设计了两级视觉知识增强的多Modal大型自然语言模型（LION），它使得 MLLM 具有更好的视觉知识抽取和理解能力。1. 细化空间意识视觉知识的进行式整合。我们设计了一个与区域级视觉语言（VL）任务相结合的视觉汇集器，以整合细化空间意识的视觉知识到 MLLM 中。为了解决图像级和区域级 VL 任务之间的冲突，我们提出了一种适应器混合策略和 mixture-of-adapters。这种进行式整合方案对这两种 VL 任务产生了互相促进的效果。2. 软提示高级Semantic视觉证据。我们为 MLLM 提供高级Semantic视觉证据，通过利用多种图像标签。为了减少预测标签的不准确性的影响，我们提出了一种软提示方法，通过在专门设计的文本指令中嵌入学习的token。经过了多种多Modal benchmark 的实验，我们的模型在 VSR 和 TextCaps 上比 InstructBLIP 高5%，在 RefCOCOg 上高3%，在 Kosmos-2 上高5%。

FATURA: A Multi-Layout Invoice Image Dataset for Document Analysis and Understanding

paper_url: http://arxiv.org/abs/2311.11856
repo_url: None
paper_authors: Mahmoud Limam, Marwa Dhiaf, Yousri Kessentini
For: The paper is written for researchers in the field of document analysis and understanding.* Methods: The paper uses a dataset called FATURA, which is a highly diverse dataset featuring multi-layout, annotated invoice document images.* Results: The paper provides comprehensive benchmarks for various document analysis and understanding tasks and conducts experiments under diverse training and evaluation scenarios.Here’s the information in Simplified Chinese text:
for: 这篇论文是为了探讨文档分析和理解领域的研究人员而写的。
methods: 这篇论文使用了一个名为FATURA的 dataset，该dataset是一个多format的、注释的发票文档图像集。
results: 这篇论文提供了多种文档分析和理解任务的全面的标准准测试数据，并在不同的训练和评估场景下进行了实验。

Abstract
Document analysis and understanding models often require extensive annotated data to be trained. However, various document-related tasks extend beyond mere text transcription, requiring both textual content and precise bounding-box annotations to identify different document elements. Collecting such data becomes particularly challenging, especially in the context of invoices, where privacy concerns add an additional layer of complexity. In this paper, we introduce FATURA, a pivotal resource for researchers in the field of document analysis and understanding. FATURA is a highly diverse dataset featuring multi-layout, annotated invoice document images. Comprising $10,000$ invoices with $50$ distinct layouts, it represents the largest openly accessible image dataset of invoice documents known to date. We also provide comprehensive benchmarks for various document analysis and understanding tasks and conduct experiments under diverse training and evaluation scenarios. The dataset is freely accessible at https://zenodo.org/record/8261508, empowering researchers to advance the field of document analysis and understanding.

摘要
In this paper, we introduce FATURA, a groundbreaking resource for researchers in the field of document analysis and understanding. FATURA is a highly diverse dataset featuring multi-layout, annotated invoice document images. Comprising 10,000 invoices with 50 distinct layouts, it represents the largest openly accessible image dataset of invoice documents known to date.We also provide comprehensive benchmarks for various document analysis and understanding tasks and conduct experiments under diverse training and evaluation scenarios. The dataset is freely accessible at , empowering researchers to advance the field of document analysis and understanding.

Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision

paper_url: http://arxiv.org/abs/2311.11853
repo_url: None
paper_authors: Sanket Kachole, Hussain Sajwani, Fariborz Baghaei Naeini, Dimitrios Makris, Yahya Zweiri
for: 这个研究旨在提出一种具有生物灵感的神经网络，以提高视觉数据处理效率，降低能源消耗。
methods: 研究使用了 asynchronous bioplausible neuron（ABN），一种动态脉搏机制，以自动调整输入信号的变化。
results: 实验结果显示，ABN能够优化图像分类和 segmentation 的性能，维护神经平衡，并提高能效性。

Abstract
Spiking Neural Networks (SNNs) offer a biologically inspired approach to computer vision that can lead to more efficient processing of visual data with reduced energy consumption. However, maintaining homeostasis within these networks is challenging, as it requires continuous adjustment of neural responses to preserve equilibrium and optimal processing efficiency amidst diverse and often unpredictable input signals. In response to these challenges, we propose the Asynchronous Bioplausible Neuron (ABN), a dynamic spike firing mechanism to auto-adjust the variations in the input signal. Comprehensive evaluation across various datasets demonstrates ABN's enhanced performance in image classification and segmentation, maintenance of neural equilibrium, and energy efficiency.

摘要
神经网络（SNN）提供一种生物体发展的视觉处理方法，可以实现更高效的数据处理和降低能源消耗。然而，保持神经网络的HOMEOSTASIS是挑战，因为需要不断调整神经响应以维持平衡和最佳处理效率面对多样化和随机的输入信号。为解决这些挑战，我们提议使用异步生物可能性神经元（ABN），一种动态脉冲发射机制来自动调整输入信号的变化。经过了不同的数据集的全面评估，ABN在图像分类和分割方面表现出了更好的表现，同时保持神经网络的平衡和能效性。

Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields

paper_url: http://arxiv.org/abs/2311.11845
repo_url: https://github.com/tatakai1/evenerf
paper_authors: Zhiyuan Min, Yawei Luo, Wei Yang, Yuesong Wang, Yi Yang
for: 本研究旨在提高NeRF模型的常规化能力，使其直接从新场景中生成新视角图像，不需要重新训练场景特定的NeRF模型。
methods: 我们提出了一种名为EVE-NeRF的Entangled View-Epipolar Information Aggregation方法，它在拼接多视图特征时注入场景不变的外观连续性和几何一致性约束，以提高3D表示的普适性。
results: 我们的方法在多种评估场景中达到了状态的表现，比起单一维度的拼接，杂合网络更好地保持了3D场景的几何和外观重建精度。

Abstract
Generalizable NeRF can directly synthesize novel views across new scenes, eliminating the need for scene-specific retraining in vanilla NeRF. A critical enabling factor in these approaches is the extraction of a generalizable 3D representation by aggregating source-view features. In this paper, we propose an Entangled View-Epipolar Information Aggregation method dubbed EVE-NeRF. Different from existing methods that consider cross-view and along-epipolar information independently, EVE-NeRF conducts the view-epipolar feature aggregation in an entangled manner by injecting the scene-invariant appearance continuity and geometry consistency priors to the aggregation process. Our approach effectively mitigates the potential lack of inherent geometric and appearance constraint resulting from one-dimensional interactions, thus further boosting the 3D representation generalizablity. EVE-NeRF attains state-of-the-art performance across various evaluation scenarios. Extensive experiments demonstate that, compared to prevailing single-dimensional aggregation, the entangled network excels in the accuracy of 3D scene geometry and appearance reconstruction.Our project page is https://github.com/tatakai1/EVENeRF.

摘要
通用的NeRF可以直接生成新场景中的新视图，从而消除普通的NeRFScene-specific retraining。一个关键的优化因子在这些方法中是提取一个通用的3D表示。在这篇论文中，我们提议一种拓展视图-轴线信息汇集方法，名为EVE-NeRF。与现有方法不同，EVE-NeRF在汇集视图和轴线信息时采用杂合的方式，通过注入场景不变的外观继续性和几何约束约束，从而有效地消除一维交互中的可能的内置几何和外观约束缺失，从而进一步提高3D表示的通用性。EVE-NeRF实现了最新的性能标准在各种评估场景中。广泛的实验表明，相比于现有的单一维度汇集，杂合网络在3D场景几何和外观重建精度方面具有明显的优势。我们的项目页面是.

Few-shot Multispectral Segmentation with Representations Generated by Reinforcement Learning

paper_url: http://arxiv.org/abs/2311.11827
repo_url: None
paper_authors: Dilith Jayakody, Thanuja Ambegoda
for: 提高几个示例数据集上的多spectral图像分割性能
methods: 使用 reinforcement learning 生成表达来生成特定类分割的表达，并将其用于更新数据集和进行分割
results: 在多个多spectral数据集上证明了提高分割性能的效果

Abstract
The task of multispectral image segmentation (segmentation of images with numerous channels/bands, each capturing a specific range of wavelengths of electromagnetic radiation) has been previously explored in contexts with large amounts of labeled data. However, these models tend not to generalize well to datasets of smaller size. In this paper, we propose a novel approach for improving few-shot segmentation performance on multispectral images using reinforcement learning to generate representations. These representations are generated in the form of mathematical expressions between channels and are tailored to the specific class being segmented. Our methodology involves training an agent to identify the most informative expressions, updating the dataset using these expressions, and then using the updated dataset to perform segmentation. Due to the limited length of the expressions, the model receives useful representations without any added risk of overfitting. We evaluate the effectiveness of our approach on several multispectral datasets and demonstrate its effectiveness in boosting the performance of segmentation algorithms.

摘要
在多spectral图像分割任务中，我们已经曾经利用大量标注数据进行过研究。然而，这些模型通常无法通用于小型数据集。在这篇论文中，我们提出了一种新的方法，用于在多spectral图像上提高少量shot分割性能使用反射学习生成表示。这些表示是通过Channel之间的数学表达来生成的，并且特制于具体的分割类。我们的方法包括使用代理人来确定最有用的表达，更新数据集使用这些表达，然后使用更新后的数据集进行分割。由于表达的长度有限，模型能够获得有用的表示，而不会额外风险过拟合。我们对多spectral数据集进行了评估，并证明了我们的方法的有效性。

Holistic Inverse Rendering of Complex Facade via Aerial 3D Scanning

paper_url: http://arxiv.org/abs/2311.11825
repo_url: None
paper_authors: Zixuan Xie, Rengan Xie, Rong Li, Kai Huang, Pengju Qiao, Jingsen Zhu, Xu Yin, Qi Ye, Wei Hua, Yuchi Huo, Hujun Bao
for: 用于 facade 的 geometry, lighting, and material 重建
methods: 使用 neural signed distance fields (SDFs) 和三种适应策略：semantic regularization、frequency-aware geometry regularization 和 visibility probe-based scheme
results: 实现 physically based 和 photorealistic 的 novel-view rendering、relighting 和 editing，并在实际环境中超越之前的方法。

Abstract
In this work, we use multi-view aerial images to reconstruct the geometry, lighting, and material of facades using neural signed distance fields (SDFs). Without the requirement of complex equipment, our method only takes simple RGB images captured by a drone as inputs to enable physically based and photorealistic novel-view rendering, relighting, and editing. However, a real-world facade usually has complex appearances ranging from diffuse rocks with subtle details to large-area glass windows with specular reflections, making it hard to attend to everything. As a result, previous methods can preserve the geometry details but fail to reconstruct smooth glass windows or verse vise. In order to address this challenge, we introduce three spatial- and semantic-adaptive optimization strategies, including a semantic regularization approach based on zero-shot segmentation techniques to improve material consistency, a frequency-aware geometry regularization to balance surface smoothness and details in different surfaces, and a visibility probe-based scheme to enable efficient modeling of the local lighting in large-scale outdoor environments. In addition, we capture a real-world facade aerial 3D scanning image set and corresponding point clouds for training and benchmarking. The experiment demonstrates the superior quality of our method on facade holistic inverse rendering, novel view synthesis, and scene editing compared to state-of-the-art baselines.

摘要
在这项工作中，我们使用多视图飞行图像来重建建筑外墙的几何、照明和材质。我们的方法只需要简单的RGB图像， captured by a drone，作为输入，以实现物理基于的、 photorealistic 新视图渲染、重新照明和编辑。然而，实际世界中的facade通常具有复杂的外观，从柔软的岩石到大面积的玻璃窗户的镜面反射，这会使得 previous methods 难以同时 preserve geometry details 和 reconstruction smooth glass windows。为 Addressing this challenge，我们引入三个空间和semantic-adaptive optimization strategies，包括基于zero-shot segmentation技术的semantic regularizationapproach，频率意识geometry regularization，和基于可见探针的 scheme。此外，我们还capture了一个真实世界facade飞行3D扫描图像集和相应的点云数据用于训练和参考。实验结果表明，我们的方法在facade holistic inverse rendering、新视图合成和场景编辑方面具有较高的质量，比state-of-the-art baselines。

Cross-View Graph Consistency Learning for Invariant Graph Representations

paper_url: http://arxiv.org/abs/2311.11821
repo_url: None
paper_authors: Jie Chen, Zhiming Li, Hua Mao, Wai Lok Woo, Xi Peng
for: This paper is written for analyzing graph-structured data and learning invariant graph representations for link prediction.
methods: The paper proposes a cross-view graph consistency learning (CGCL) method that uses two complementary augmented views to derive an incomplete graph structure, and then learns invariant graph representations through a cross-view training scheme.
results: The paper achieves competitive results on graph datasets in comparisons with several state-of-the-art algorithms, demonstrating the effectiveness of the proposed CGCL method.Here’s the Chinese version of the three information points:
for: 这篇论文是为了分析图structured数据而写的，并且学习图结构中的不变性表示以便预测链接。
methods: 该篇论文提出了跨视图图一致学习(CGCL)方法，通过两个补充的增强视图来 derivation incomplete图结构，然后通过跨视图训练方案来学习不变性表示。
results: 纸上提供了一些比较竞争力强的结果，证明了提案的CGCL方法的有效性。

Abstract
Graph representation learning is fundamental for analyzing graph-structured data. Exploring invariant graph representations remains a challenge for most existing graph representation learning methods. In this paper, we propose a cross-view graph consistency learning (CGCL) method that learns invariant graph representations for link prediction. First, two complementary augmented views are derived from an incomplete graph structure through a bidirectional graph structure augmentation scheme. This augmentation scheme mitigates the potential information loss that is commonly associated with various data augmentation techniques involving raw graph data, such as edge perturbation, node removal, and attribute masking. Second, we propose a CGCL model that can learn invariant graph representations. A cross-view training scheme is proposed to train the proposed CGCL model. This scheme attempts to maximize the consistency information between one augmented view and the graph structure reconstructed from the other augmented view. Furthermore, we offer a comprehensive theoretical CGCL analysis. This paper empirically and experimentally demonstrates the effectiveness of the proposed CGCL method, achieving competitive results on graph datasets in comparisons with several state-of-the-art algorithms.

摘要
GRAPH 表示学习是数据结构化的数据分析的基础。寻找不变的 GRAPH 表示仍然是现有 GRAPH 表示学习方法的挑战。在这篇论文中，我们提出了跨视图 GRAPH 一致学习（CGCL）方法，用于预测链接。首先，通过双向 GRAPH 结构增强方案，从不完整的 GRAPH 结构中 derivation 出两个补充的视图。这种增强方案可以减少 raw GRAPH 数据的各种数据增强技术中的信息损失，例如边干扰、节点移除和特征遮盖。其次，我们提出了一种 CGCL 模型，可以学习不变的 GRAPH 表示。我们提出了一种跨视图训练方案，用于训练提订的 CGCL 模型。这种方案尝试通过将一个增强视图与另一个视图中的 GRAPH 结构匹配来最大化两个视图之间的一致信息。此外，我们还提供了 CGCL 的完整理论分析。本文通过实验和实证证明了提订的 CGCL 方法的效果，在比较了多个状态前的算法中达到了竞争性的结果。

CrackCLF: Automatic Pavement Crack Detection based on Closed-Loop Feedback

paper_url: http://arxiv.org/abs/2311.11815
repo_url: None
paper_authors: Chong Li, Zhun Fan, Ying Chen, Huibiao Lin, Laura Moretti, Giuseppe Loprencipe, Weihua Sheng, Kelvin C. P. Wang
for: automatic pavement crack detection
methods: encoder-decoder framework with closed-loop feedback and generative adversarial networks
results: outperforms other methods on three public datasets, with the ability to correct errors and adapt to changes in the environmentHere’s the full text in Simplified Chinese:
for: automatic pavement crack detection
methods: 使用encoder-decoder框架和循环反馈，以及生成对抗网络
results: 在三个公共数据集上取得了更高的性能，并能自动 correect错误和适应环境变化

Abstract
Automatic pavement crack detection is an important task to ensure the functional performances of pavements during their service life. Inspired by deep learning (DL), the encoder-decoder framework is a powerful tool for crack detection. However, these models are usually open-loop (OL) systems that tend to treat thin cracks as the background. Meanwhile, these models can not automatically correct errors in the prediction, nor can it adapt to the changes of the environment to automatically extract and detect thin cracks. To tackle this problem, we embed closed-loop feedback (CLF) into the neural network so that the model could learn to correct errors on its own, based on generative adversarial networks (GAN). The resulting model is called CrackCLF and includes the front and back ends, i.e. segmentation and adversarial network. The front end with U-shape framework is employed to generate crack maps, and the back end with a multi-scale loss function is used to correct higher-order inconsistencies between labels and crack maps (generated by the front end) to address open-loop system issues. Empirical results show that the proposed CrackCLF outperforms others methods on three public datasets. Moreover, the proposed CLF can be defined as a plug and play module, which can be embedded into different neural network models to improve their performances.

摘要
自动路面裂隙检测是一项重要任务，以确保路面在服务寿命中的功能性能。受深度学习（DL）的激发，编码-解码框架成为了裂隙检测的powerful工具。然而，这些模型通常是开 loop（OL）系统，它们会将薄裂隙视为背景。同时，这些模型无法自动更正预测错误，也无法适应环境变化自动提取和检测薄裂隙。为解决这个问题，我们将closed-loop反馈（CLF） embedding到神经网络中，使模型可以通过生成 adversarial networks（GAN）学习自动更正错误。得到的模型被称为CrackCLF，它包括前端和后端，即分 segmentation和对抗网络。前端采用U型框架生成裂隙地图，而后端使用多尺度损失函数来更正高阶不一致性 между标签和裂隙地图（生成前端），以解决开 loop系统问题。实验结果表明，提议的CrackCLF在三个公共数据集上表现出色，并且可以作为插件模块，嵌入不同神经网络模型以提高其性能。

Robot Hand-Eye Calibration using Structure-from-Motion

paper_url: http://arxiv.org/abs/2311.11808
repo_url: None
paper_authors: Nicolas Andreff, Radu Horaud, Bernard Espiau
for: 这个论文提出了一种新的 flexible 手眼协调方法，而大多数现有的手眼协调技术都需要使用协调器，并且与摄像头定位方法相结合。
methods: 我们结合了结构从运动和知道的机器人运动，并证明该解可以得到在线形式。这个解不仅解决了普通扭矩运动的问题，还可以处理纯翻译、纯旋转和平面运动等特殊运动。
results: 我们进行了大量的实验，并比较了该方法与现有的方法。结果表明，该方法的质量具有较高的精度和稳定性。

Abstract
In this paper we propose a new flexible method for hand-eye calibration. The vast majority of existing hand-eye calibration techniques requires a calibration rig which is used in conjunction with camera pose estimation methods. Instead, we combine structure-from-motion with known robot motions and we show that the solution can be obtained in linear form. The latter solves for both the hand-eye parameters and for the unknown scale factor inherent with structure-from-motion methods. The algebraic analysis that is made possible with such a linear formulation allows to investigate not only the well known case of general screw motions but also such singular motions as pure translations, pure rotations, and planar motions. In essence, the robot-mounted camera looks to an unknown rigid layout, tracks points over an image sequence and estimates the camera-to-robot relationship. Such a self calibration process is relevant for unmanned vehicles, robots working in remote places, and so forth. We conduct a large number of experiments which validate the quality of the method by comparing it with existing ones.

摘要
在这篇论文中，我们提出了一种新的灵活手眼准备方法。大多数现有的手眼准备技术都需要使用准备机构，并与摄像头位置估计方法结合使用。而我们则将结构从运动与知道机器人运动相结合，并证明该解可以表示为线性形式。这种线性表述使得我们可以对不同类型的运动进行数学分析，包括普通旋转、平面运动和纯粹翻译等特殊运动。在实际应用中，机器人搭载的摄像头将看到未知的固定布局，跟踪图像序列中的点并估计摄像头和机器人之间的关系。这种自动准备过程对无人车、远程工作的机器人等有益。我们进行了大量实验，并与现有方法进行比较，以证明方法的高效性。

Robust Tumor Segmentation with Hyperspectral Imaging and Graph Neural Networks

paper_url: http://arxiv.org/abs/2311.11782
repo_url: None
paper_authors: Mayar Lotfy, Anna Alperovich, Tommaso Giannantonio, Bjorn Barz, Xiaohan Zhang, Felix Holm, Nassir Navab, Felix Boehm, Carolin Schwamborn, Thomas K. Hoffmann, Patrick J. Schuler
for:* This paper aims to improve the accuracy of tumor segmentation in hyperspectral imaging (HSI) for surgical cancer resection.methods:* The proposed method combines hyperspectral imaging (HSI) with graph neural networks (GNNs) to leverage the spatial context of tiles for more robust and smoother segmentation.* The method uses a convolutional neural network (CNN) to extract features for each tile within the graph, and incorporates local image quality metrics into the loss function to enhance the training procedure’s robustness.results:* The proposed method significantly outperforms context-agnostic approaches in accurately distinguishing between healthy and tumor tissues, even in images from previously unseen patients.* The carefully designed loss function, accounting for local image quality, results in additional improvements.Here is the Chinese translation of the three key points:for:* 这篇论文目标是在医学激光成像（HSI）中提高肿瘤分割的准确率，以便更好地进行手术治疗。methods:* 提议的方法是将HSI与图 neural network（GNN）结合起来，利用图中块的空间Context来提高分割的稳定性和准确性。* 方法使用一个卷积神经网络（CNN）来提取每个块内图的特征，并在训练过程中同时训练后续的GNN。results:* 提议的方法在与无关的方法进行比较时，能够更好地分割健康和肿瘤组织，包括在之前未看到的患者的图像中。* 经过精心设计的损失函数，包括地方图像质量指标，可以进一步提高训练过程的稳定性。

Abstract
Segmenting the boundary between tumor and healthy tissue during surgical cancer resection poses a significant challenge. In recent years, Hyperspectral Imaging (HSI) combined with Machine Learning (ML) has emerged as a promising solution. However, due to the extensive information contained within the spectral domain, most ML approaches primarily classify individual HSI (super-)pixels, or tiles, without taking into account their spatial context. In this paper, we propose an improved methodology that leverages the spatial context of tiles for more robust and smoother segmentation. To address the irregular shapes of tiles, we utilize Graph Neural Networks (GNNs) to propagate context information across neighboring regions. The features for each tile within the graph are extracted using a Convolutional Neural Network (CNN), which is trained simultaneously with the subsequent GNN. Moreover, we incorporate local image quality metrics into the loss function to enhance the training procedure's robustness against low-quality regions in the training images. We demonstrate the superiority of our proposed method using a clinical ex vivo dataset consisting of 51 HSI images from 30 patients. Despite the limited dataset, the GNN-based model significantly outperforms context-agnostic approaches, accurately distinguishing between healthy and tumor tissues, even in images from previously unseen patients. Furthermore, we show that our carefully designed loss function, accounting for local image quality, results in additional improvements. Our findings demonstrate that context-aware GNN algorithms can robustly find tumor demarcations on HSI images, ultimately contributing to better surgery success and patient outcome.

摘要
医学图像分割是肿瘤除除术中的一大挑战。近年来，快照 спектраль成像（HSI）与机器学习（ML）的结合已经出现为扩展可能性。然而，由于spectral domain中的信息过度充沛，大多数ML方法仅将HSI（超）像素或块分类，不考虑其空间上的相互关系。在这篇论文中，我们提出了改进的方法，利用块之间的空间关系来提供更加稳定和准确的分割。为了处理块的不规则形状，我们使用图像神经网络（GNN）将相邻区域之间的信息传播。每个块在图中EXTRACT特征使用卷积神经网络（CNN），并同时训练后续的GNN。此外，我们将图像质量指标添加到损失函数中，以提高训练过程的稳定性。我们在30名患者的51张医学图像上进行了临床外ivo的实验，并证明我们提出的GNN基于方法在上述Context-agnostic方法之上显著提高了分割精度。此外，我们还发现我们 precisely设计的损失函数，考虑到本地图像质量，会提供额外的改进。我们的发现表明，在HSI图像上使用Context-aware GNN算法可以稳定地找到肿瘤分界，从而对患者的手术成功和疾病结果产生正面的影响。

Multimodal deep learning for mapping forest dominant height by fusing GEDI with earth observation data

paper_url: http://arxiv.org/abs/2311.11777
repo_url: None
paper_authors: Man Chen, Wenquan Dong, Hao Yu, Iain Woodhouse, Casey M. Ryan, Haoyu Liu, Selena Georgiou, Edward T. A. Mitchard
for: 这个研究旨在使用多源Remote sensing数据和深度学习模型精准地映射高分辨率森林高度。
methods: 我们提出了一种基于多Modal attention remote sensing network（MARSNet）的新深度学习框架，使用GEDI数据、Setinel-1数据、ALOS-2 PALSAR-2数据、Sentinel-2光学数据和其他数据。 MARSNet包括每种Remote sensing数据模式的单独编码器来提取多尺度特征，以及共享解码器来融合特征和估计高度。
results: 我们的研究表明，MARSNet可以高效地估计森林主要高度，R2为0.62，RMSE为2.82米，超过了广泛使用的随机森林方法（R2=0.55，RMSE=3.05米）。此外，我们使用训练好的MARSNet模型生成了10米分辨率的墙壁到墙壁地域高度图像，并通过独立验证使用场地测量结果，MARSNet得到了R2=0.58和RMSE=3.76米的结果，与随机森林基准值（R2=0.41，RMSE=4.37米）相比，表明MARSNet的精度更高。

Abstract
The integration of multisource remote sensing data and deep learning models offers new possibilities for accurately mapping high spatial resolution forest height. We found that GEDI relative heights (RH) metrics exhibited strong correlation with the mean of the top 10 highest trees (dominant height) measured in situ at the corresponding footprint locations. Consequently, we proposed a novel deep learning framework termed the multi-modal attention remote sensing network (MARSNet) to estimate forest dominant height by extrapolating dominant height derived from GEDI, using Setinel-1 data, ALOS-2 PALSAR-2 data, Sentinel-2 optical data and ancillary data. MARSNet comprises separate encoders for each remote sensing data modality to extract multi-scale features, and a shared decoder to fuse the features and estimate height. Using individual encoders for each remote sensing imagery avoids interference across modalities and extracts distinct representations. To focus on the efficacious information from each dataset, we reduced the prevalent spatial and band redundancies in each remote sensing data by incorporating the extended spatial and band reconstruction convolution modules in the encoders. MARSNet achieved commendable performance in estimating dominant height, with an R2 of 0.62 and RMSE of 2.82 m, outperforming the widely used random forest approach which attained an R2 of 0.55 and RMSE of 3.05 m. Finally, we applied the trained MARSNet model to generate wall-to-wall maps at 10 m resolution for Jilin, China. Through independent validation using field measurements, MARSNet demonstrated an R2 of 0.58 and RMSE of 3.76 m, compared to 0.41 and 4.37 m for the random forest baseline. Our research demonstrates the effectiveness of a multimodal deep learning approach fusing GEDI with SAR and passive optical imagery for enhancing the accuracy of high resolution dominant height estimation.

摘要
“多源Remote数据和深度学习模型的整合可以提供高分辨率森林高度的新可能性。我们发现GEDI相对高度（RH）指标与场景中最高10棵树（主对高）的平均值 exhibited strong correlation。因此，我们提出了一个名为多模式注意深度测量网络（MARSNet）的新深度学习框架，用于估计森林主对高。MARSNet包括每个遥感数据模式的分别Encoder来提取多尺度特征，以及共同的解码器来融合特征和估计高度。这些Encoder对每个遥感数据模式进行分别对应，以避免模式之间的干扰和提取特有的表现。为了避免每个遥感数据模式的空间和频率统计重复，我们在Encoder中添加了扩展空间和频率重建卷积模组。MARSNet在主对高估计中表现了优异的成绩，R2为0.62，RMSE为2.82米，比较 Random Forest方法的R2为0.55，RMSE为3.05米。最后，我们将训练好的 MARSNet 模型应用到了 Jilin 地区的壁垒壁垒地图上，以10米resolution进行估计。通过独立验证使用场地测量，MARSNet exhibited R2为0.58，RMSE为3.76米，与Random Forest基eline的R2为0.41，RMSE为4.37米相比。我们的研究显示了多模式深度学习方法，融合 GEDI 、SAR 和过程式光学图像，可以提高高分辨率主对高估计的精度。”

Practical cross-sensor color constancy using a dual-mapping strategy

paper_url: http://arxiv.org/abs/2311.11773
repo_url: None
paper_authors: Shuwei Yue, Minchen Wei
for: 提出了一种基于双映射策略的图像照明估计方法，该方法只需要一个简单的白点测试传感器，并且可以在照明估计和图像重建之间进行快速转换。
methods: 该方法使用了图像重建和照明估计两个映射，然后使用轻量级多层感知神经网络（MLP）模型进行优化。
results: 该方法可以快速实现图像照明估计，并且可以减少传感器差异和提高性能，仅需要一小段的训练时间（约0.003 MB的内存和1小时的训练时间）和快速执行（约0.3 ms和1 ms在GPU和CPU上），并且不敏感于输入图像分辨率。

Abstract
Deep Neural Networks (DNNs) have been widely used for illumination estimation, which is time-consuming and requires sensor-specific data collection. Our proposed method uses a dual-mapping strategy and only requires a simple white point from a test sensor under a D65 condition. This allows us to derive a mapping matrix, enabling the reconstructions of image data and illuminants. In the second mapping phase, we transform the re-constructed image data into sparse features, which are then optimized with a lightweight multi-layer perceptron (MLP) model using the re-constructed illuminants as ground truths. This approach effectively reduces sensor discrepancies and delivers performance on par with leading cross-sensor methods. It only requires a small amount of memory (~0.003 MB), and takes ~1 hour training on an RTX3070Ti GPU. More importantly, the method can be implemented very fast, with ~0.3 ms and ~1 ms on a GPU or CPU respectively, and is not sensitive to the input image resolution. Therefore, it offers a practical solution to the great challenges of data recollection that is faced by the industry.

摘要
深度神经网络（DNNs）广泛应用于照明估计，这是时间费时且需要特定传感器数据采集。我们的提议方法采用双映射策略，只需要一个简单的白点数据集来自测传感器，并在D65条件下进行映射。这使得我们可以 derivate一个映射矩阵，启用图像数据和照明的重建。在第二个映射阶段，我们将重建的图像数据转换为稀疏特征，然后使用轻量级多层感知器（MLP）模型进行优化，使用重建的照明作为真实值。这种方法可以有效减少传感器差异，并提供与前列横跨传感器方法相当的性能。它只需要一小Amount of memory（约0.003 MB），并在RTX3070Ti GPU上训练约1小时。此外，该方法具有快速实现的特点，在GPU和CPU上分别需要0.3毫秒和1毫秒的时间，并不敏感于输入图像分辨率。因此，它提供了实际的解决方案，避免了业界面估计数据收集的大问题。

A Good Feature Extractor Is All You Need for Weakly Supervised Learning in Histopathology

paper_url: http://arxiv.org/abs/2311.11772
repo_url: None
paper_authors: Georg Wölflein, Dyke Ferber, Asier Rabasco Meneghetti, Omar S. M. El Nahhas, Daniel Truhn, Zunamys I. Carrero, David J. Harrison, Ognjen Arandjelović, Jakob N. Kather
for: 这 paper 的目的是evaluating the robustness of public pathology SSL feature extractors and identifying the most suitable feature extractors for clinical applications.
methods: 这 paper 使用了多种方法，包括 slide-level prediction tasks in a weakly supervised setting with external validation cohorts, and an empirical evaluation of publicly available feature extractors.
results: 这 paper 的结果表明 that omitting stain normalization and image augmentations does not compromise downstream performance, while incurring substantial savings in memory and compute. Additionally, the top-performing feature extractors are remarkably robust to variations in stain and augmentations in their latent space.

Abstract
Deep learning is revolutionising pathology, offering novel opportunities in disease prognosis and personalised treatment. Historically, stain normalisation has been a crucial preprocessing step in computational pathology pipelines, and persists into the deep learning era. Yet, with the emergence of feature extractors trained using self-supervised learning (SSL) on diverse pathology datasets, we call this practice into question. In an empirical evaluation of publicly available feature extractors, we find that omitting stain normalisation and image augmentations does not compromise downstream performance, while incurring substantial savings in memory and compute. Further, we show that the top-performing feature extractors are remarkably robust to variations in stain and augmentations like rotation in their latent space. Contrary to previous patch-level benchmarking studies, our approach emphasises clinical relevance by focusing on slide-level prediction tasks in a weakly supervised setting with external validation cohorts. This work represents the most comprehensive robustness evaluation of public pathology SSL feature extractors to date, involving more than 6,000 training runs across nine tasks, five datasets, three downstream architectures, and various preprocessing setups. Our findings stand to streamline digital pathology workflows by minimising preprocessing needs and informing the selection of feature extractors.

摘要

Non-Contact NIR PPG Sensing through Large Sequence Signal Regression

paper_url: http://arxiv.org/abs/2311.11757
repo_url: None
paper_authors: Timothy Hanley, Dara Golden, Robyn Maxwell, Ashkan Parsi, Joseph Lemley
for: 这个论文是为了演示一种新的非接触感知技术，用于从 Near Infra-Red (NIR) 视频中提取心率信号。
methods: 这个论文使用了一种 alternating Convolution Attention Network (CAN) 架构，通过对 NIR 视频序列进行卷积和注意力重叠来进行感知。
results: 这个论文使用了两个公共可用的数据集，通过对这些数据集进行训练，实现了对 NIR 视频中心率信号的高精度预测。训练结果表明，使用这种 CAN 架构可以在 NIR 视频中提取高精度的心率信号，MAE 为 0.99 bpm。

Abstract
Non-Contact sensing is an emerging technology with applications across many industries from driver monitoring in vehicles to patient monitoring in healthcare. Current state-of-the-art implementations focus on RGB video, but this struggles in varying/noisy light conditions and is almost completely unfeasible in the dark. Near Infra-Red (NIR) video, however, does not suffer from these constraints. This paper aims to demonstrate the effectiveness of an alternative Convolution Attention Network (CAN) architecture, to regress photoplethysmography (PPG) signal from a sequence of NIR frames. A combination of two publicly available datasets, which is split into train and test sets, is used for training the CAN. This combined dataset is augmented to reduce overfitting to the 'normal' 60 - 80 bpm heart rate range by providing the full range of heart rates along with corresponding videos for each subject. This CAN, when implemented over video cropped to the subject's head, achieved a Mean Average Error (MAE) of just 0.99 bpm, proving its effectiveness on NIR video and the architecture's feasibility to regress an accurate signal output.

摘要
非接触感测是一种emerging技术，应用于多个行业，从车辆驾驶员监测到医疗保健行业的患者监测。当前状态的实现主要基于RGB视频，但这在不同/噪音的照明条件下受到限制，而且在黑暗中基本无法实现。然而，近红外（NIR）视频不受这些限制。这篇论文目的是提出一种alternative Convolution Attention Network（CAN）架构，用于从NIR视频序列中回归血氧检测信号（PPG）信号。使用两个公共可用的数据集，通过将其分成训练和测试集，进行了训练CAN。这个合并的数据集通过提供每个主题的完整的心率范围，从而降低了预测到“常见”60-80bpm心率范围的溢出。这个CAN，当应用于对主题的头部视频进行裁剪后，实现了 Mean Average Error（MAE）为0.99bpm，证明了其在NIR视频和架构上的可行性和回归精度的输出信号。

AdvGen: Physical Adversarial Attack on Face Presentation Attack Detection Systems

paper_url: http://arxiv.org/abs/2311.11753
repo_url: None
paper_authors: Sai Amrit Patnaik, Shivali Chansoriya, Anil K. Jain, Anoop M. Namboodiri
for: 防止面部识别系统在真实世界中受到攻击，因为攻击者可以通过修改捕捉到的图像来诱导系统进行误认。
methods: 我们提出了一种基于生成对抗网络的自动化攻击策略，可以在物理世界场景下生成攻击图像，并在四个数据集和十个国家级人脸识别系统上进行了广泛的测试。
results: 我们的攻击策略可以在物理世界场景下达到82.01%的攻击成功率，并在实际Physical环境中进行了实验验证。

Abstract
Evaluating the risk level of adversarial images is essential for safely deploying face authentication models in the real world. Popular approaches for physical-world attacks, such as print or replay attacks, suffer from some limitations, like including physical and geometrical artifacts. Recently, adversarial attacks have gained attraction, which try to digitally deceive the learning strategy of a recognition system using slight modifications to the captured image. While most previous research assumes that the adversarial image could be digitally fed into the authentication systems, this is not always the case for systems deployed in the real world. This paper demonstrates the vulnerability of face authentication systems to adversarial images in physical world scenarios. We propose AdvGen, an automated Generative Adversarial Network, to simulate print and replay attacks and generate adversarial images that can fool state-of-the-art PADs in a physical domain attack setting. Using this attack strategy, the attack success rate goes up to 82.01%. We test AdvGen extensively on four datasets and ten state-of-the-art PADs. We also demonstrate the effectiveness of our attack by conducting experiments in a realistic, physical environment.

摘要
evaluating the risk level of adversarial images is essential for safely deploying face authentication models in the real world. popular approaches for physical-world attacks, such as print or replay attacks, suffer from some limitations, like including physical and geometrical artifacts. recently, adversarial attacks have gained attraction, which try to digitally deceive the learning strategy of a recognition system using slight modifications to the captured image. while most previous research assumes that the adversarial image could be digitally fed into the authentication systems, this is not always the case for systems deployed in the real world. this paper demonstrates the vulnerability of face authentication systems to adversarial images in physical world scenarios. we propose advgen, an automated generative adversarial network, to simulate print and replay attacks and generate adversarial images that can fool state-of-the-art padss in a physical domain attack setting. using this attack strategy, the attack success rate goes up to 82.01%. we test advgen extensively on four datasets and ten state-of-the-art padss. we also demonstrate the effectiveness of our attack by conducting experiments in a realistic, physical environment.Here's the translation in Traditional Chinese:评估对于攻击性图像的风险水平是在实际应用中部署人脸识别系统的必要条件。传统的物理攻击方法，如印刷或重播攻击，受到一些限制，例如包括物理和几何学性错误。过去的研究多数假设可以将攻击图像直接传入识别系统，但这不一定适用于实际应用中的系统。本文展示了面部识别系统对于攻击图像在实际世界场景中的脆弱性。我们提出了AdvGen，一个自动生成的对抗网络，来模拟印刷和重播攻击，生成可以诱导面部识别系统的攻击图像。使用这种攻击策略，攻击成功率可以达到82.01%。我们对四个数据集和十个现代PADS进行了广泛的测试。我们还证明了我们的攻击的有效性，通过在实际、物理环境中进行实验。

Fuzzy Information Seeded Region Growing for Automated Lesions After Stroke Segmentation in MR Brain Images

paper_url: http://arxiv.org/abs/2311.11742
repo_url: https://github.com/mawio02/fisrg-for-automated-lesion-after-stroke-segmentation-in-mri
paper_authors: Mario Pascual González
for: stroke lesion segmentation from brain MRI images
methods: Fuzzy Information Seeded Region Growing (FISRG) algorithm
results: highest Dice score of 94.2%, with an average Dice score of 88.1% in the third experiment, indicating effective segmentation of stroke lesions.

Abstract
In the realm of medical imaging, precise segmentation of stroke lesions from brain MRI images stands as a critical challenge with significant implications for patient diagnosis and treatment. Addressing this, our study introduces an innovative approach using a Fuzzy Information Seeded Region Growing (FISRG) algorithm. Designed to effectively delineate the complex and irregular boundaries of stroke lesions, the FISRG algorithm combines fuzzy logic with Seeded Region Growing (SRG) techniques, aiming to enhance segmentation accuracy. The research involved three experiments to optimize the FISRG algorithm's performance, each focusing on different parameters to improve the accuracy of stroke lesion segmentation. The highest Dice score achieved in these experiments was 94.2\%, indicating a high degree of similarity between the algorithm's output and the expert-validated ground truth. Notably, the best average Dice score, amounting to 88.1\%, was recorded in the third experiment, highlighting the efficacy of the algorithm in consistently segmenting stroke lesions across various slices. Our findings reveal the FISRG algorithm's strengths in handling the heterogeneity of stroke lesions. However, challenges remain in areas of abrupt lesion topology changes and in distinguishing lesions from similar intensity brain regions. The results underscore the potential of the FISRG algorithm in contributing significantly to advancements in medical imaging analysis for stroke diagnosis and treatment.

摘要
在医学成像领域，精准地从脑MRI图像中分割中风损害的 segmentation 作为一项关键挑战，对患者诊断和治疗具有重要意义。我们的研究报告了一种新的方法，即基于模糊逻辑和种子区域生长（FISRG）算法。这种算法旨在准确地界定中风损害的复杂和不规则边界，并通过结合模糊逻辑和种子区域生长（SRG）技术，提高分割精度。我们的研究进行了三个实验来优化FISRG算法的性能，每个实验都关注不同的参数来提高中风损害分割的准确率。实验中最高的 dice 分数为 94.2%，表明算法的输出与专家验证的真实值之间存在高度的相似性。而第三个实验的平均 dice 分数为 88.1%，表明算法在不同的slice中具有高度的稳定性，并能够一致地分割中风损害。我们的发现表明FISRG算法在处理中风损害的多样性方面具有优异的能力。然而，在突然的损害 topology 变化和类似intensity脑区域的区分方面仍然存在挑战。结果表明FISRG算法在医学成像分析中具有广泛的应用前景，对stroke诊断和治疗具有重要意义。

On the Importance of Large Objects in CNN Based Object Detection Algorithms

paper_url: http://arxiv.org/abs/2311.11714
repo_url: None
paper_authors: Ahmed Ben Saad, Gabriele Facciolo, Axel Davy
for: 提高对象检测器的性能，特别是小对象的检测 scores。
methods: 引入一个基于对象面积的权重项到训练损失函数中，以便更加强调大对象的学习特征。
results: 在COCO val 2017上，与 InternImage-T 模型结合使用我们的方法可以提高对象检测器的总性能 (+2 p.p. on small objects, +2 p.p. on medium objects, +4 p.p. on large objects)。 Additionally, we conduct additional experiments and ablation studies to confirm the robustness of our findings.

Abstract
Object detection models, a prominent class of machine learning algorithms, aim to identify and precisely locate objects in images or videos. However, this task might yield uneven performances sometimes caused by the objects sizes and the quality of the images and labels used for training. In this paper, we highlight the importance of large objects in learning features that are critical for all sizes. Given these findings, we propose to introduce a weighting term into the training loss. This term is a function of the object area size. We show that giving more weight to large objects leads to improved detection scores across all object sizes and so an overall improvement in Object Detectors performances (+2 p.p. of mAP on small objects, +2 p.p. on medium and +4 p.p. on large on COCO val 2017 with InternImage-T). Additional experiments and ablation studies with different models and on a different dataset further confirm the robustness of our findings.

摘要
Translated into Simplified Chinese:对象检测模型，一种常见的机器学习算法，目的是在图像或视频中准确地识别和定位对象。然而，这个任务可能会导致不均匀的性能，这可能是因为对象的大小以及用于训练的图像和标签的质量。在这篇文章中，我们强调大对象在学习特征上的重要性，这些特征是所有大小对象都需要的。基于这些发现，我们提议在训练损失函数中添加一个面积大小相关的权重项。我们显示，对大对象的权重更重，会导致所有对象大小上的检测分数提高 (+2 p.p. of mAP on small objects, +2 p.p. on medium and +4 p.p. on large on COCO val 2017 with InternImage-T)。附加的实验和缺省研究表明我们的发现是可靠的。

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

paper_url: http://arxiv.org/abs/2311.11700
repo_url: None
paper_authors: Chi Yan, Delin Qu, Dong Wang, Dan Xu, Zhigang Wang, Bin Zhao, Xuelong Li
for: 这个论文是为了提出一种基于3D Gaussian表示的同时定位和地图生成（SLAM）系统。
methods: 这个方法使用了实时可导渲染 Rendering 管道，以提高地图优化和RGB-D重渲染的速度。它还提出了一种适应扩展策略，以有效地重建新观察到的场景几何结构，并改进了已经观察到的区域的地图。
results: 该方法在Replica和TUM-RGBD数据集上实现了与现有实时方法相当的竞争性表现，并且在运行时间和稳定性方面具有明显的优势。

Abstract
In this paper, we introduce $\textbf{GS-SLAM}$ that first utilizes 3D Gaussian representation in the Simultaneous Localization and Mapping (SLAM) system. It facilitates a better balance between efficiency and accuracy. Compared to recent SLAM methods employing neural implicit representations, our method utilizes a real-time differentiable splatting rendering pipeline that offers significant speedup to map optimization and RGB-D re-rendering. Specifically, we propose an adaptive expansion strategy that adds new or deletes noisy 3D Gaussian in order to efficiently reconstruct new observed scene geometry and improve the mapping of previously observed areas. This strategy is essential to extend 3D Gaussian representation to reconstruct the whole scene rather than synthesize a static object in existing methods. Moreover, in the pose tracking process, an effective coarse-to-fine technique is designed to select reliable 3D Gaussian representations to optimize camera pose, resulting in runtime reduction and robust estimation. Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the Replica, TUM-RGBD datasets. The source code will be released soon.

摘要
在这篇论文中，我们引入了$\textbf{GS-SLAM}$，它首先利用了3D高斯函数表示在同时地标注和地图（SLAM）系统中。它使得更好地寻求效率和准确之间的平衡。相比最近的SLAM方法使用神经网络卷积表示，我们的方法使用了实时可导渲染管道，从而提供了显著的速度提升，用于地图优化和RGB-D重新渲染。具体来说，我们提出了适应扩展策略，将新的或噪声3D高斯函数添加到重构新观察到的场景几何，以提高已观察区域的地图。这种策略是对3D高斯函数表示扩展到重构整个场景而不是Synthesize静止物体的现有方法。此外，在摄像头跟踪过程中，我们设计了一种有效的过滤策略，选择可靠的3D高斯函数表示，以优化摄像头pose，从而实现时间缩短和稳定估计。我们的方法在实时实现的State-of-the-art方法上达到了竞争性表现。我们即将发布源代码。

Cut-and-Paste: Subject-Driven Video Editing with Attention Control

paper_url: http://arxiv.org/abs/2311.11697
repo_url: None
paper_authors: Zhichao Zuo, Zhao Zhang, Yan Luo, Yang Zhao, Haijun Zhang, Yi Yang, Meng Wang
for: 这个 paper 是为了提出一种基于文本指导的Semantic Video editing方法，以便在视频编辑中具有更高精度的控制，并且能够保留视频背景。
methods: 该 paper 使用了一种称为 Cut-and-Paste 的新框架，它利用文本指导和补充图像来进行 Semantic Video editing。具体来说，该方法使用了 cross attention 控制方法来限制编辑区域，以保持视频背景和空间时间一致性。
results: 该 paper 的实验结果表明，相比于现有的方法，Cut-and-Paste 方法能够更好地控制视频编辑，并且能够保留视频背景。这些结果是基于量化和主观评价的。

Abstract
This paper presents a novel framework termed Cut-and-Paste for real-word semantic video editing under the guidance of text prompt and additional reference image. While the text-driven video editing has demonstrated remarkable ability to generate highly diverse videos following given text prompts, the fine-grained semantic edits are hard to control by plain textual prompt only in terms of object details and edited region, and cumbersome long text descriptions are usually needed for the task. We therefore investigate subject-driven video editing for more precise control of both edited regions and background preservation, and fine-grained semantic generation. We achieve this goal by introducing an reference image as supplementary input to the text-driven video editing, which avoids racking your brain to come up with a cumbersome text prompt describing the detailed appearance of the object. To limit the editing area, we refer to a method of cross attention control in image editing and successfully extend it to video editing by fusing the attention map of adjacent frames, which strikes a balance between maintaining video background and spatio-temporal consistency. Compared with current methods, the whole process of our method is like ``cut" the source object to be edited and then ``paste" the target object provided by reference image. We demonstrate that our method performs favorably over prior arts for video editing under the guidance of text prompt and extra reference image, as measured by both quantitative and subjective evaluations.

摘要
To achieve this, we introduce a reference image as supplementary input to the text-driven video editing process. This helps avoid the need for detailed text prompts describing object appearance, allowing for more intuitive and efficient editing. Additionally, we extend a cross-attention control method from image editing to video editing, fusing attention maps of adjacent frames to maintain video background and spatio-temporal consistency.Our Cut-and-Paste method is like "cutting" the source object to be edited and "pasting" the target object from the reference image. We demonstrate that our method outperforms prior arts in terms of both quantitative and subjective evaluations.

Clarity ChatGPT: An Interactive and Adaptive Processing System for Image Restoration and Enhancement

paper_url: http://arxiv.org/abs/2311.11695
repo_url: None
paper_authors: Yanyan Wei, Zhao Zhang, Jiahuan Ren, Xiaogang Xu, Richang Hong, Yi Yang, Shuicheng Yan, Meng Wang
for: 提高图像修复和优化（IRE）方法的通用能力和交互功能，解决现有IRE方法的限制性和不足。
methods: 提出了一种基于对话智能的 transformative 系统 Clarity ChatGPT，结合了多种IRE方法，可自动探测图像异常类型并选择合适的修复方法，或者基于用户反馈进行迭代生成满意结果。
results: 在实验studies中，Clarity ChatGPT 能够有效提高IRE方法的通用能力和交互功能，并填补现有视力语言模型的低级域 gap。

Abstract
The generalization capability of existing image restoration and enhancement (IRE) methods is constrained by the limited pre-trained datasets, making it difficult to handle agnostic inputs such as different degradation levels and scenarios beyond their design scopes. Moreover, they are not equipped with interactive mechanisms to consider user preferences or feedback, and their end-to-end settings cannot provide users with more choices. Faced with the above-mentioned IRE method's limited performance and insufficient interactivity, we try to solve it from the engineering and system framework levels. Specifically, we propose Clarity ChatGPT-a transformative system that combines the conversational intelligence of ChatGPT with multiple IRE methods. Clarity ChatGPT can automatically detect image degradation types and select appropriate IRE methods to restore images, or iteratively generate satisfactory results based on user feedback. Its innovative features include a CLIP-powered detector for accurate degradation classification, no-reference image quality evaluation for performance evaluation, region-specific processing for precise enhancements, and advanced fusion techniques for optimal restoration results. Clarity ChatGPT marks a significant advancement in integrating language and vision, enhancing image-text interactions, and providing a robust, high-performance IRE solution. Our case studies demonstrate that Clarity ChatGPT effectively improves the generalization and interaction capabilities in the IRE, and also fills the gap in the low-level domain of the existing vision-language model.

摘要
现有的图像修复和改善（IRE）方法的通用能力受到先前训练数据的限制，这使得它们处理不同的降低水平和场景变得困难。此外，它们没有交互机制来考虑用户偏好或反馈，其端到端设置也无法提供更多的选择。面临以上问题的IRE方法表现不佳，我们尝试解决它从工程和系统框架的角度。我们提出了明亮ChatGPT-一个将语言智能和多种IRE方法结合的转变系统。明亮ChatGPT可以自动检测图像降低类型并选择合适的IRE方法来修复图像，或者基于用户反馈进行迭代生成满意的结果。它的创新特点包括CLIP驱动的检测器以确定准确的降低类型，无参图像质量评价来评估性能，区域特定的处理以实现精细的改善，以及高级融合技术来实现优化的修复结果。明亮ChatGPT对整合语言和视觉，提高图像文本交互，并提供了一个robust、高性能的IRE解决方案。我们的案例研究表明，明亮ChatGPT可以有效地提高IRE的通用和交互能力，同时填补现有视力语言模型的低级域空白。

Segment Together: A Versatile Paradigm for Semi-Supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.11686
repo_url: None
paper_authors: Qingjie Zeng, Yutong Xie, Zilin Lu, Mengkang Lu, Yicheng Wu, Yong Xia
for: 这篇论文的目的是提出一个新的多任务 semi-supervised 架构，以实现医疗影像分类 tasks 中的资料欠缺问题，并且可以将多个数据集合到一个统一的模型中，以便更好地利用无标的数据。methods: 这篇论文使用了一个动态任务推问设计，可以在不同的数据集上灵活地推问不同的目标，以及一个实验室实现的 cutmix 策略来增强模型的准确性。此外，这篇论文还引入了一个内部协调的一致性约束，以使用无标的数据更好地对模型进行训练。results: 这篇论文的实验结果显示，VerSemi 模型可以在四个公共标准 benchmark 上实现比第二最佳方法的大幅提升（例如，平均提升 2.69% Dice 值），实现新的 SOTA 性能在 semi-supervised 医疗影像分类中。

Abstract
Annotation scarcity has become a major obstacle for training powerful deep-learning models for medical image segmentation, restricting their deployment in clinical scenarios. To address it, semi-supervised learning by exploiting abundant unlabeled data is highly desirable to boost the model training. However, most existing works still focus on limited medical tasks and underestimate the potential of learning across diverse tasks and multiple datasets. Therefore, in this paper, we introduce a \textbf{Ver}satile \textbf{Semi}-supervised framework (VerSemi) to point out a new perspective that integrates various tasks into a unified model with a broad label space, to exploit more unlabeled data for semi-supervised medical image segmentation. Specifically, we introduce a dynamic task-prompted design to segment various targets from different datasets. Next, this unified model is used to identify the foreground regions from all labeled data, to capture cross-dataset semantics. Particularly, we create a synthetic task with a cutmix strategy to augment foreground targets within the expanded label space. To effectively utilize unlabeled data, we introduce a consistency constraint. This involves aligning aggregated predictions from various tasks with those from the synthetic task, further guiding the model in accurately segmenting foreground regions during training. We evaluated our VerSemi model on four public benchmarking datasets. Extensive experiments demonstrated that VerSemi can consistently outperform the second-best method by a large margin (e.g., an average 2.69\% Dice gain on four datasets), setting new SOTA performance for semi-supervised medical image segmentation. The code will be released.

摘要
annotation scarcity has become a major obstacle for training powerful deep-learning models for medical image segmentation, restricting their deployment in clinical scenarios. To address it, semi-supervised learning by exploiting abundant unlabeled data is highly desirable to boost the model training. However, most existing works still focus on limited medical tasks and underestimate the potential of learning across diverse tasks and multiple datasets. Therefore, in this paper, we introduce a VERsatile semi-supervised framework (VerSemi) to point out a new perspective that integrates various tasks into a unified model with a broad label space, to exploit more unlabeled data for semi-supervised medical image segmentation. Specifically, we introduce a dynamic task-prompted design to segment various targets from different datasets. Next, this unified model is used to identify the foreground regions from all labeled data, to capture cross-dataset semantics. Particularly, we create a synthetic task with a cutmix strategy to augment foreground targets within the expanded label space. To effectively utilize unlabeled data, we introduce a consistency constraint. This involves aligning aggregated predictions from various tasks with those from the synthetic task, further guiding the model in accurately segmenting foreground regions during training. We evaluated our VerSemi model on four public benchmarking datasets. Extensive experiments demonstrated that VerSemi can consistently outperform the second-best method by a large margin (e.g., an average 2.69\% Dice gain on four datasets), setting new SOTA performance for semi-supervised medical image segmentation. The code will be released.Here's the translation in Traditional Chinese:annotation scarcity has become a major obstacle for training powerful deep-learning models for medical image segmentation, restricting their deployment in clinical scenarios. To address it, semi-supervised learning by exploiting abundant unlabeled data is highly desirable to boost the model training. However, most existing works still focus on limited medical tasks and underestimate the potential of learning across diverse tasks and multiple datasets. Therefore, in this paper, we introduce a VERsatile semi-supervised framework (VerSemi) to point out a new perspective that integrates various tasks into a unified model with a broad label space, to exploit more unlabeled data for semi-supervised medical image segmentation. Specifically, we introduce a dynamic task-prompted design to segment various targets from different datasets. Next, this unified model is used to identify the foreground regions from all labeled data, to capture cross-dataset semantics. Particularly, we create a synthetic task with a cutmix strategy to augment foreground targets within the expanded label space. To effectively utilize unlabeled data, we introduce a consistency constraint. This involves aligning aggregated predictions from various tasks with those from the synthetic task, further guiding the model in accurately segmenting foreground regions during training. We evaluated our VerSemi model on four public benchmarking datasets. Extensive experiments demonstrated that VerSemi can consistently outperform the second-best method by a large margin (e.g., an average 2.69\% Dice gain on four datasets), setting new SOTA performance for semi-supervised medical image segmentation. The code will be released.

Pyramid Diffusion for Fine 3D Large Scene Generation

paper_url: http://arxiv.org/abs/2311.12085
repo_url: https://github.com/Yuheng-SWJTU/pyramid-discrete-diffusion
paper_authors: Yuheng Liu, Xinke Li, Xueting Li, Lu Qi, Chongshou Li, Ming-Hsuan Yang
for: 本研究旨在Addressing the challenges of directly transferring 2D techniques to 3D scene generation, and proposing a novel approach for high-quality 3D scene generation.
methods: 本研究提出了一种多尺度模型（Pyramid Discrete Diffusion，PDD），可以逐步生成高质量的3D场景，从粗略到细节。
results: 实验覆盖了无条件和条件生成两种情况，并得到了吸引人的结果，证明模型在生成真实和细腻的3D场景方面具有效果和稳定性。

Abstract
Directly transferring the 2D techniques to 3D scene generation is challenging due to significant resolution reduction and the scarcity of comprehensive real-world 3D scene datasets. To address these issues, our work introduces the Pyramid Discrete Diffusion model (PDD) for 3D scene generation. This novel approach employs a multi-scale model capable of progressively generating high-quality 3D scenes from coarse to fine. In this way, the PDD can generate high-quality scenes within limited resource constraints and does not require additional data sources. To the best of our knowledge, we are the first to adopt the simple but effective coarse-to-fine strategy for 3D large scene generation. Our experiments, covering both unconditional and conditional generation, have yielded impressive results, showcasing the model's effectiveness and robustness in generating realistic and detailed 3D scenes. Our code will be available to the public.

摘要
<>使用 Pyramid Discrete Diffusion 模型（PDD）进行三维场景生成是一项挑战，因为它会导致场景的分辨率减少，并且有限的实际三维场景数据。为解决这些问题，我们的工作提出了一种新的方法：逐步生成高质量的三维场景。在这种方法中，PDD 可以逐步生成高质量的场景，并且不需要额外的数据源。我们知道，我们是首次采用简单 yet 有效的粗化到细化策略来进行大型三维场景生成。我们的实验，涵盖无条件生成和条件生成，具有吸引人的效果和稳定性，证明了 PDD 模型的可行性和可靠性。我们的代码将对公众开放。

PMP-Swin: Multi-Scale Patch Message Passing Swin Transformer for Retinal Disease Classification

paper_url: http://arxiv.org/abs/2311.11669
repo_url: None
paper_authors: Zhihan Yang, Zhiming Cheng, Tengjin Weng, Shucheng He, Yaqi Wang, Xin Ye, Shuai Wang
for: 预测眼病诊断，提高诊断精度。
methods: 基于Message Passing机制的Patch Message Passing模块，以global交互强制特异性特征，并采用多种缩放的PatchSize进行多级划分。
results: 与现有方法比较，实现了remarkable的性能。

Abstract
Retinal disease is one of the primary causes of visual impairment, and early diagnosis is essential for preventing further deterioration. Nowadays, many works have explored Transformers for diagnosing diseases due to their strong visual representation capabilities. However, retinal diseases exhibit milder forms and often present with overlapping signs, which pose great difficulties for accurate multi-class classification. Therefore, we propose a new framework named Multi-Scale Patch Message Passing Swin Transformer for multi-class retinal disease classification. Specifically, we design a Patch Message Passing (PMP) module based on the Message Passing mechanism to establish global interaction for pathological semantic features and to exploit the subtle differences further between different diseases. Moreover, considering the various scale of pathological features we integrate multiple PMP modules for different patch sizes. For evaluation, we have constructed a new dataset, named OPTOS dataset, consisting of 1,033 high-resolution fundus images photographed by Optos camera and conducted comprehensive experiments to validate the efficacy of our proposed method. And the results on both the public dataset and our dataset demonstrate that our method achieves remarkable performance compared to state-of-the-art methods.

摘要
retinal disease 是一种主要导致视力障碍的原因，早期诊断非常重要，以防止更进一步的恶化。现在，许多研究已经利用 Transformers 进行疾病诊断，因为它们具有强的视觉表示能力。然而，肠眼病的表现相对软，常常出现 overlap 的症状，这会带来精度的多类分类困难。因此，我们提出了一种新的框架，即多尺度缝 Message Passing Swin Transformer，用于多类肠眼病分类。特别是，我们设计了一个缝 Message Passing（PMP）模块，基于消息传递机制，以确立全局交互，激发不同疾病之间的微妙差异。此外，考虑到不同尺度的病理特征，我们将多个 PMP 模块结合在一起，用于不同的缝size。为了评估我们的提议方法，我们建立了一个新的数据集，名为 OPTOS 数据集，包含 1,033 个高分辨率肠眼图像，通过 Optos 摄像机拍摄，并进行了全面的实验来验证我们的提议方法的效果。结果表明，我们的方法在公共数据集和我们自己的数据集上均表现出色，胜过当前的状态对照方法。

ODDR: Outlier Detection & Dimension Reduction Based Defense Against Adversarial Patches

paper_url: http://arxiv.org/abs/2311.12084
repo_url: None
paper_authors: Nandish Chattopadhyay, Amira Guesmi, Muhammad Abdullah Hanif, Bassem Ouni, Muhammad Shafique
for:This paper aims to mitigate the effects of patch-based adversarial attacks on machine learning models.methods:The proposed method, ODDR, uses a three-stage pipeline consisting of Fragmentation, Segregation, and Neutralization. Outlier detection techniques are used to identify and segregate anomalous features associated with adversarial perturbations, and dimension reduction methods are applied to mitigate the impact of these perturbations.results:ODDR effectively mitigates patch-based adversarial attacks, with robust accuracies matching or lying within a small range of clean accuracies, and only a marginal compromise of 1-2% in performance on clean samples. The method outperforms other defenses, demonstrating its effectiveness.Here is the simplified Chinese text for the three key points:for:这篇论文目的是为了 Mitigate patch-based adversarial attacks 对机器学习模型的影响。methods:提议的方法 ODDR 使用了一个三个阶段的管道，包括 Fragmentation、Segregation 和 Neutralization。它使用 outlier detection 技术来标识和分离 adversarial perturbations 对图像的异常特征，并使用 dimension reduction 方法来减少这些异常特征对模型的影响。results:ODDR 有效地 Mitigate patch-based adversarial attacks，Robust accuracy 与 clean accuracy 之间几乎相同，并且只有一个微小的牺牲（1-2%）。与其他防御方法相比，ODDR 表现更佳。

Abstract
Adversarial attacks are a major deterrent towards the reliable use of machine learning models. A powerful type of adversarial attacks is the patch-based attack, wherein the adversarial perturbations modify localized patches or specific areas within the images to deceive the trained machine learning model. In this paper, we introduce Outlier Detection and Dimension Reduction (ODDR), a holistic defense mechanism designed to effectively mitigate patch-based adversarial attacks. In our approach, we posit that input features corresponding to adversarial patches, whether naturalistic or otherwise, deviate from the inherent distribution of the remaining image sample and can be identified as outliers or anomalies. ODDR employs a three-stage pipeline: Fragmentation, Segregation, and Neutralization, providing a model-agnostic solution applicable to both image classification and object detection tasks. The Fragmentation stage parses the samples into chunks for the subsequent Segregation process. Here, outlier detection techniques identify and segregate the anomalous features associated with adversarial perturbations. The Neutralization stage utilizes dimension reduction methods on the outliers to mitigate the impact of adversarial perturbations without sacrificing pertinent information necessary for the machine learning task. Extensive testing on benchmark datasets and state-of-the-art adversarial patches demonstrates the effectiveness of ODDR. Results indicate robust accuracies matching and lying within a small range of clean accuracies (1%-3% for classification and 3%-5% for object detection), with only a marginal compromise of 1%-2% in performance on clean samples, thereby significantly outperforming other defenses.

摘要
机器学习模型面临着严重的抗击攻击的威胁。许多抗击攻击中最具攻击力的是覆盖式攻击，即在图像中添加小量攻击干扰以让机器学习模型进行误分类。在这篇论文中，我们介绍了一种名为异常检测和维度减少（ODDR）的总体防御机制，用于有效地抵御覆盖式攻击。我们认为，受攻击的图像特征（自然或者不自然的攻击干扰）与图像的主要特征分布不同，可以被识别为异常或者异常值。ODDR使用三个阶段管道：分割、分类和中和，提供了对机器学习任务无关的解决方案，适用于图像分类和物体检测任务。分割阶段将样本分割成多个块，以便于后续的分类阶段。在这个阶段，异常检测技术将攻击干扰相关的异常特征分离出来。中和阶段使用维度减少方法对异常特征进行中和，以降低攻击干扰的影响，同时保留必要的信息以确保机器学习任务的正确性。我们对标准的攻击 dataset 和最新的攻击干扰进行了广泛的测试，结果显示ODDR具有优秀的效果。结果表明，ODDR 可以保持和清理样本中的净度（1%-3% для分类和3%-5% для物体检测），只有较小的牺牲（1%-2%），而其他防御机制则需要更大的牺牲。

OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning

paper_url: http://arxiv.org/abs/2311.11666
repo_url: None
paper_authors: Haiyang Ying, Yixuan Yin, Jinzhi Zhang, Fan Wang, Tao Yu, Ruqi Huang, Lu Fang
for: 实现全面理解3D场景，需要一种通用的3D分割方法，可以同时分割多种对象，无论对象数量多少或类别多少，而且能够反映内在的层次结构。
methods: 我们提出了OmniSeg3D方法，它是一种通过层次对比学习框架将多视图不一致的2D分割映射到一致的3D特征场，以实现全面的3D分割和层次结构理解。
results: 我们的方法能够在高质量3D分割和准确的层次结构理解方面达到杰出的效果，并且提供了一个便捷的图形用户界面，以便用户在多种场景中进行自由交互。

Abstract
Towards holistic understanding of 3D scenes, a general 3D segmentation method is needed that can segment diverse objects without restrictions on object quantity or categories, while also reflecting the inherent hierarchical structure. To achieve this, we propose OmniSeg3D, an omniversal segmentation method aims for segmenting anything in 3D all at once. The key insight is to lift multi-view inconsistent 2D segmentations into a consistent 3D feature field through a hierarchical contrastive learning framework, which is accomplished by two steps. Firstly, we design a novel hierarchical representation based on category-agnostic 2D segmentations to model the multi-level relationship among pixels. Secondly, image features rendered from the 3D feature field are clustered at different levels, which can be further drawn closer or pushed apart according to the hierarchical relationship between different levels. In tackling the challenges posed by inconsistent 2D segmentations, this framework yields a global consistent 3D feature field, which further enables hierarchical segmentation, multi-object selection, and global discretization. Extensive experiments demonstrate the effectiveness of our method on high-quality 3D segmentation and accurate hierarchical structure understanding. A graphical user interface further facilitates flexible interaction for omniversal 3D segmentation.

摘要
为了实现全面的3D场景理解，我们需要一种通用的3D分割方法，可以无Restriction地分割多种对象，同时还能够反映Scene中的自然层次结构。为此，我们提出了OmniSeg3D方法，它目标是同时分割所有3D场景中的任何对象。我们的关键发现是通过对多视图不一致的2D分割进行层次对比学习框架，将多级关系模型为像素之间的层次关系。我们首先设计了一种新的层次表示，基于类型无关的2D分割来表示多级关系。其次，我们从3D特征场景中生成的图像特征进行层次归一化，并将归一化后的特征分割成不同层次。在解决不一致2D分割的挑战时，这个框架实现了一个全局一致的3D特征场景，从而实现了层次分割、多对象选择和全球精度分割。我们的实验证明了OmniSeg3D方法的有效性，可以实现高质量的3D分割和准确的层次结构理解。此外，我们还提供了一个图形用户界面，以便用户通过价值Omniversal 3D分割。

PanBench: Towards High-Resolution and High-Performance Pansharpening

paper_url: http://arxiv.org/abs/2311.12083
repo_url: None
paper_authors: Shiying Wang, Xuechao Zou, Kai Li, Junliang Xing, Pin Tao
for: 这篇论文是为了探讨远程感知领域中的笔划合成问题，即将低分辨率多spectral图像与高分辨率灰度图像集成成一个高分辨率图像，以提高远程感知数据分析中的精度。
methods: 这篇论文提出了一种新的笔划合成网络（CMFNet），使用了级联多尺度融合技术，以实现高精度的笔划合成。
results: 论文通过了广泛的实验，证明了CMFNet的效果是非常高，可以在远程感知领域中提高笔划合成的精度。

Abstract
Pansharpening, a pivotal task in remote sensing, involves integrating low-resolution multispectral images with high-resolution panchromatic images to synthesize an image that is both high-resolution and retains multispectral information. These pansharpened images enhance precision in land cover classification, change detection, and environmental monitoring within remote sensing data analysis. While deep learning techniques have shown significant success in pansharpening, existing methods often face limitations in their evaluation, focusing on restricted satellite data sources, single scene types, and low-resolution images. This paper addresses this gap by introducing PanBench, a high-resolution multi-scene dataset containing all mainstream satellites and comprising 5,898 pairs of samples. Each pair includes a four-channel (RGB + near-infrared) multispectral image of 256x256 pixels and a mono-channel panchromatic image of 1,024x1,024 pixels. To achieve high-fidelity synthesis, we propose a Cascaded Multiscale Fusion Network (CMFNet) for Pansharpening. Extensive experiments validate the effectiveness of CMFNet. We have released the dataset, source code, and pre-trained models in the supplementary, fostering further research in remote sensing.

摘要
Remote sensing 中的缩进任务之一是束合低分辨率多spectral图像与高分辨率独立图像，以生成高分辨率的图像，同时保留多spectral信息。这些缩进图像可以增强远程感知数据分类、变化探测和环境监测中的精度。深度学习技术在缩进中已经表现出了显著的成功，但现有方法常面临评估限制，包括固定卫星数据源、单个场景类型和低分辨率图像。本文填补这个空白，通过介绍 PanBench 高分辨率多场景数据集，该数据集包含所有主流卫星，包括 5,898 对样本。每对样本包括一个四通道 (RGB + near-infrared) 多spectral图像，分辨率为 256x256 像素，以及一个单通道独立图像，分辨率为 1,024x1,024 像素。为实现高准确性束合，我们提议一种顺序多尺度融合网络 (CMFNet)。广泛的实验证明了 CMFNet 的有效性。我们在补充中发布了数据集、源代码和预训练模型，欢迎更多的研究人员进行远程感知领域的进一步研究。

Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D Human Motion Recovery from Monocular Videos

paper_url: http://arxiv.org/abs/2311.11662
repo_url: None
paper_authors: Sushovan Chanda, Amogh Tiwari, Lokender Tiwari, Brojeshwar Bhowmick, Avinash Sharma, Hrishav Barua
for: Temporally consistent 3D human body pose, shape, and motion estimation from monocular videos.
methods: Body-aware feature representation, per-frame pose and camera initialization, spatio-temporal feature aggregation using self-similarity and self-attention, and LSTM refinement.
results: Significantly lower acceleration error and outperformance over existing state-of-the-art methods in complex scenarios like partial occlusion, complex poses, and low illumination.Here’s the Chinese translation:
for: 从单目视频中提取一致的3D人体姿态、形状和运动信息。
methods: 使用人体意识的特征表示、每帧姿态和摄像头初始化、基于自相似和自注意的体部特征综合、以及LSTM精度调整。
results: 在复杂的场景下（如部分遮挡、复杂的姿态和低照明）表现出明显低于加速误差和超越现有状态的方法。

Abstract
Recovering temporally consistent 3D human body pose, shape and motion from a monocular video is a challenging task due to (self-)occlusions, poor lighting conditions, complex articulated body poses, depth ambiguity, and limited availability of annotated data. Further, doing a simple perframe estimation is insufficient as it leads to jittery and implausible results. In this paper, we propose a novel method for temporally consistent motion estimation from a monocular video. Instead of using generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose and camera initialization over a temporal window followed by a novel spatio-temporal feature aggregation by using a combination of self-similarity and self-attention over the body-aware features and the perframe initialization. Together, they yield enhanced spatiotemporal context for every frame by considering remaining past and future frames. These features are used to predict the pose and shape parameters of the human body model, which are further refined using an LSTM. Experimental results on the publicly available benchmark data show that our method attains significantly lower acceleration error and outperforms the existing state-of-the-art methods over all key quantitative evaluation metrics, including complex scenarios like partial occlusion, complex poses and even relatively low illumination.

摘要
recuperar la pose y el movimiento temporales consistentes del cuerpo humano en una videomonocular es un desafío debido a (auto-oclusión), condiciones de iluminación pobres, poses corporales articuladas complejas, ambigüedad de profundidad y la limitada disponibilidad de datos annotados. Además, una estimación simple por frame no es suficiente ya que lleva a resultados jittery e implausibles. En este artículo, propusimos un método novel para la estimación de la motion consistente en una videomonocular. En lugar de utilizar características genericas de ResNet, nuestro método utiliza una representación de características corporal-consciente y una inicialización de pose y cámara independiente por ventana temporal, seguida de una agregación de características espacio-temporal innovadora utilizando una combinación de self-similarity y self-attention sobre las características corporal-conscientes y la inicialización por frame. Juntos, они proporcionan un contexto espacio-temporal mejorado para cada marco considerando los marcos restantes en el pasado y el futuro. Estas características se utilizan para predecir los parámetros de pose y forma del modelo de cuerpo humano, que se refinan utilizando un LSTM. Los resultados experimentales en los datos de referencia públicos muestran que nuestro método tiene un error de aceleración significativamente menor y supera a los métodos estado-de-la-arte existentes en todas las métricas cuantitativas clave, incluyendo escenarios complejos como la ocultación parcial, poses corporales complejas y hasta una iluminación relativamente baja.

Double-Condensing Attention Condenser: Leveraging Attention in Deep Learning to Detect Skin Cancer from Skin Lesion Images

paper_url: http://arxiv.org/abs/2311.11656
repo_url: None
paper_authors: Chi-en Amy Tai, Elizabeth Janes, Chris Czarnecki, Alexander Wong
for: 这paper是为了检测皮肤癌症的皮肤变性图像而写的。
methods: 这paper使用了一种叫做Double-Condensing Attention Condensers（DC-AC）的自注意网络结构，以便更快速地计算。
results: 这paper介绍了一种特化于皮肤癌症检测的深度学习模型，其中使用了DC-AC自注意网络结构，并且公开发布了这个模型以便让科学家在抗癌战中获得更多的助手。

Abstract
Skin cancer is the most common type of cancer in the United States and is estimated to affect one in five Americans. Recent advances have demonstrated strong performance on skin cancer detection, as exemplified by state of the art performance in the SIIM-ISIC Melanoma Classification Challenge; however these solutions leverage ensembles of complex deep neural architectures requiring immense storage and compute costs, and therefore may not be tractable. A recent movement for TinyML applications is integrating Double-Condensing Attention Condensers (DC-AC) into a self-attention neural network backbone architecture to allow for faster and more efficient computation. This paper explores leveraging an efficient self-attention structure to detect skin cancer in skin lesion images and introduces a deep neural network design with DC-AC customized for skin cancer detection from skin lesion images. The final model is publicly available as a part of a global open-source initiative dedicated to accelerating advancement in machine learning to aid clinicians in the fight against cancer.

摘要
美国最常见的癌症是皮肤癌，据估计每一个美国人中有一个在五个人会患癌。最新的进展表明在皮肤癌检测方面存在强大的表现，如SIIM-ISIC皮肤癌分类挑战赛的国际首席表现，但这些解决方案具有复杂的深度神经网络架构，需要巨量的存储和计算成本，因此可能不太可能。最近，对于微型机器学习（TinyML）应用而言，把双层凝聚注意力压缩（DC-AC） integrate into a self-attention neural network backbone architecture可以实现更快和高效的计算。这篇论文探讨了使用高效的自注意结构来检测皮肤癌，并提出了一种特化于皮肤癌检测的深度神经网络设计，使用DC-AC。最终模型已经公开发布，并成为全球开源的机器学习推进医生在抗癌斗争中的工具。

Cancer-Net PCa-Data: An Open-Source Benchmark Dataset for Prostate Cancer Clinical Decision Support using Synthetic Correlated Diffusion Imaging Data

paper_url: http://arxiv.org/abs/2311.11647
repo_url: None
paper_authors: Hayden Gunraj, Chi-en Amy Tai, Alexander Wong
for: This paper is written for the purpose of introducing an open-source benchmark dataset of volumetric correlated diffusion imaging (CDI$^s$) data for prostate cancer (PCa) patients, with the goal of advancing research efforts in machine learning and imaging for PCa diagnosis and treatment.
methods: The paper uses CDI$^s$ imaging data from a patient cohort of 200 cases, along with full annotations (gland masks, tumor masks, and PCa diagnosis for each tumor). The authors analyze the demographic and label region diversity of the dataset for potential biases.
results: The paper introduces Cancer-Net PCa-Data, the first-ever public dataset of CDI$^s$ imaging data for PCa, which is an open-source benchmark dataset for researchers to use in developing and evaluating machine learning models for PCa diagnosis and treatment. The dataset is diverse and comprehensive, with 200 patient cases and full annotations, and has the potential to aid clinicians in the global fight against cancer.

Abstract
The recent introduction of synthetic correlated diffusion (CDI$^s$) imaging has demonstrated significant potential in the realm of clinical decision support for prostate cancer (PCa). CDI$^s$ is a new form of magnetic resonance imaging (MRI) designed to characterize tissue characteristics through the joint correlation of diffusion signal attenuation across different Brownian motion sensitivities. Despite the performance improvement, the CDI$^s$ data for PCa has not been previously made publicly available. In our commitment to advance research efforts for PCa, we introduce Cancer-Net PCa-Data, an open-source benchmark dataset of volumetric CDI$^s$ imaging data of PCa patients. Cancer-Net PCa-Data consists of CDI$^s$ volumetric images from a patient cohort of 200 patient cases, along with full annotations (gland masks, tumor masks, and PCa diagnosis for each tumor). We also analyze the demographic and label region diversity of Cancer-Net PCa-Data for potential biases. Cancer-Net PCa-Data is the first-ever public dataset of CDI$^s$ imaging data for PCa, and is a part of the global open-source initiative dedicated to advancement in machine learning and imaging research to aid clinicians in the global fight against cancer.

摘要
Recent introduction of synthetic correlated diffusion (CDI$^s$) imaging has shown significant potential in clinical decision support for prostate cancer (PCa). CDI$^s$ is a new form of magnetic resonance imaging (MRI) that characterizes tissue characteristics through joint correlation of diffusion signal attenuation across different Brownian motion sensitivities. Although CDI$^s$ data for PCa has not been publicly available before, we are committed to advancing research efforts for PCa and introduce Cancer-Net PCa-Data, an open-source benchmark dataset of volumetric CDI$^s$ imaging data for PCa patients. Cancer-Net PCa-Data includes CDI$^s$ volumetric images from a patient cohort of 200 cases, along with full annotations (gland masks, tumor masks, and PCa diagnosis for each tumor). We also analyze the demographic and label region diversity of Cancer-Net PCa-Data for potential biases. Cancer-Net PCa-Data is the first public dataset of CDI$^s$ imaging data for PCa and is part of the global open-source initiative dedicated to advancing machine learning and imaging research to aid clinicians in the global fight against cancer.

CastDet: Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

paper_url: http://arxiv.org/abs/2311.11646
repo_url: None
paper_authors: Yan Li, Weiwei Guo, Dunyun He, Jiaqi Zhou, Yuze Gao, Wenxian Yu
for:This paper focuses on open-vocabulary object detection (OVD) in aerial images, which enables the characterization of new objects beyond training categories on the earth surface without annotating training images for these new categories.methods:The proposed CastDet framework is an end-to-end student-teacher open-vocabulary object detection framework that leverages the CLIP model as an extra omniscient teacher of rich knowledge into the student-teacher self-learning process. The framework also employs a dynamic label queue technique to maintain high-quality pseudo labels during batch training and mitigate label imbalance.results:The proposed CastDet achieves superior open-vocabulary detection performance, with an HM (Harmonic Mean) of 40.0, outperforming previous methods Detic/ViLD by 26.9/21.1 on the VisDroneZSD dataset.

Abstract
Object detection in aerial images is a pivotal task for various earth observation applications, whereas current algorithms learn to detect only a pre-defined set of object categories demanding sufficient bounding-box annotated training samples and fail to detect novel object categories. In this paper, we consider open-vocabulary object detection (OVD) in aerial images that enables the characterization of new objects beyond training categories on the earth surface without annotating training images for these new categories. The performance of OVD depends on the quality of class-agnostic region proposals and pseudo-labels that can generalize well to novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework within the student-teacher mechanism employs the CLIP model as an extra omniscient teacher of rich knowledge into the student-teacher self-learning process. By doing so, our approach boosts novel object proposals and classification. Furthermore, we design a dynamic label queue technique to maintain high-quality pseudo labels during batch training and mitigate label imbalance. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 40.0 HM (Harmonic Mean), which outperforms previous methods Detic/ViLD by 26.9/21.1 on the VisDroneZSD dataset.

摘要
<>对于地球观测应用，空中图像中的对象检测是一项重要任务，而当前算法只能学习预定的对象类别，需要充足的 bounding-box 注释样本，无法检测新的对象类别。在这篇论文中，我们考虑了开放词汇对象检测（OVD）在空中图像中，可以在地球表面上无需注释训练样本中检测新的对象类别。OVD 的性能取决于高质量的类型不敏感区域提案和 Pseudo-label，这些可以很好地泛化到新的对象类别。为了同时生成高质量的提案和 Pseudo-label，我们提出了 CastDet，一个基于 CLIP 的学生教师开放词汇对象检测框架。我们的端到端框架在学生教师机制中使用 CLIP 模型作为额外的宏观知识的教师，从而提高新对象提案和分类。此外，我们设计了动态标签队列技术，以保持批处理训练期间高质量的 Pseudo-label。我们对多个现有的空中对象检测数据集进行了广泛的实验，并证明我们的 CastDet 在开放词汇对象检测任务中表现出色，例如在 VisDroneZSD 数据集上达到 40.0 HM（和律mean），比前方法 Detic/ViLD 提高 26.9/21.1。Note: "HM" stands for "Harmonic Mean", which is a measure of the performance of object detection algorithms.

Video Face Re-Aging: Toward Temporally Consistent Face Re-Aging

paper_url: http://arxiv.org/abs/2311.11642
repo_url: https://github.com/kyugorithm/VFRAN
paper_authors: Abdul Muqeet, Kyuchul Lee, Bumsoo Kim, Yohan Hong, Hyungrae Lee, Woonggon Kim, Kwang Hee Lee
for: 修改视频中人脸年龄，以达到目标年龄
methods: 提出了一种新的人脸视频重新年龄化方法，包括一种新的人脸数据集和一种基eline架构，以及三种专门为视频重新年龄化测试的度量
results: 对公共数据集进行了广泛的实验，并 показа出方法在年龄变化和时间一致性方面的优于现有方法

Abstract
Video face re-aging deals with altering the apparent age of a person to the target age in videos. This problem is challenging due to the lack of paired video datasets maintaining temporal consistency in identity and age. Most re-aging methods process each image individually without considering the temporal consistency of videos. While some existing works address the issue of temporal coherence through video facial attribute manipulation in latent space, they often fail to deliver satisfactory performance in age transformation. To tackle the issues, we propose (1) a novel synthetic video dataset that features subjects across a diverse range of age groups; (2) a baseline architecture designed to validate the effectiveness of our proposed dataset, and (3) the development of three novel metrics tailored explicitly for evaluating the temporal consistency of video re-aging techniques. Our comprehensive experiments on public datasets, such as VFHQ and CelebV-HQ, show that our method outperforms the existing approaches in terms of both age transformation and temporal consistency.

摘要
<>Video face re-aging 征�到将人体表现出target age的年龄，这个问题具有挑战性，因为缺乏具有时间一致性的人脸视频对应数据集。大多数现有方法不考虑视频的时间一致性，对每帧图像进行处理。有些现有方法通过在幽默空间进行视频人脸特征修饰来解决问题，但它们经常无法提供满意的年龄转换表现。为了解决这些问题，我们提议：1. 一个新的人造视频数据集，包含不同年龄群体的人体；2. 一个基线架构，用于验证我们的提议数据集的效果；3. 三个专门为视频重新年龄评价的新指标。我们对公共数据集，如VFHQ和CelebV-HQ进行了全面的实验，结果显示，我们的方法在年龄转换和时间一致性两个方面都超过了现有方法。Translation notes:* 征�到 (chèng zhì da) means "to achieve" or "to reach" in Chinese.* 人体 (rén tǐ) means "person" in Chinese.* 表现 (biǎo xiǎng) means "to show" or "to exhibit" in Chinese.* 年龄 (nián jì) means "age" in Chinese.* 时间 (shí jian) means "time" in Chinese.* 一致性 (yī chéng xìng) means "consistency" or "coherence" in Chinese.* 幽默 (yōu mó) means "latent" or "hidden" in Chinese.* 特征 (tè shē) means "feature" or "characteristic" in Chinese.* 修饰 (xiū shī) means "to modify" or "to alter" in Chinese.* 满意 (mǎn yì) means "satisfactory" or "pleasing" in Chinese.* 评价 (píng jì) means "evaluation" or "assessment" in Chinese.

Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model

paper_url: http://arxiv.org/abs/2311.11638
repo_url: https://github.com/chunminghe/reti-diff
paper_authors: Chunming He, Chengyu Fang, Yulun Zhang, Kai Li, Longxiang Tang, Chenyu You, Fengyang Xiao, Zhenhua Guo, Xiu Li
For: 提高降低图像质量的阈值环境照明照片的可见性和稳定性。* Methods: 利用扩散模型（DM）在紧凑的幽默空间中生成简洁导向假设，并 introduce a novel solution called Reti-Diff for the IDIR task, which includes two key components: Retinex-based latent DM (RLDM) and Retinex-guided transformer (RGformer).* Results: 比较 existing methods 在三个 IDIR 任务上表现出色，以及下游应用程序中的表现。

Abstract
Illumination degradation image restoration (IDIR) techniques aim to improve the visibility of degraded images and mitigate the adverse effects of deteriorated illumination. Among these algorithms, diffusion model (DM)-based methods have shown promising performance but are often burdened by heavy computational demands and pixel misalignment issues when predicting the image-level distribution. To tackle these problems, we propose to leverage DM within a compact latent space to generate concise guidance priors and introduce a novel solution called Reti-Diff for the IDIR task. Reti-Diff comprises two key components: the Retinex-based latent DM (RLDM) and the Retinex-guided transformer (RGformer). To ensure detailed reconstruction and illumination correction, RLDM is empowered to acquire Retinex knowledge and extract reflectance and illumination priors. These priors are subsequently utilized by RGformer to guide the decomposition of image features into their respective reflectance and illumination components. Following this, RGformer further enhances and consolidates the decomposed features, resulting in the production of refined images with consistent content and robustness to handle complex degradation scenarios. Extensive experiments show that Reti-Diff outperforms existing methods on three IDIR tasks, as well as downstream applications. Code will be available at \url{https://github.com/ChunmingHe/Reti-Diff}.

摘要
ILLUMINATION DEGRADATION IMAGE RESTORATION（IDIR）技术目的是提高受损图像的可见度和减轻照明衰减的不良影响。其中的扩散模型（DM）基本方法具有良好的表现，但它们常受到重复计算和像素不对齐问题的困扰，尤其是在预测图像级别分布时。为解决这些问题，我们提议利用DM在紧凑的尺度空间内运行，生成简洁的指导假设。这种方法被称为Reti-Diff。Reti-Diff包括两个关键组件：Retinex基于的秘密DM（RLDM）和Retinex引导的变换器（RGformer）。为确保细节重建和照明更正，RLDM被赋予Retinex知识，从而提取反射和照明假设。这些假设后来被RGformer使用，以导引图像特征的分解为其各自的反射和照明组件。接下来，RGformer进一步加强和卷积这些分解的特征，从而生成了更加细腻和稳定的图像，可以满足复杂的受损enario。实验表明，Reti-Diff在三个IDIR任务上的表现都高于现有方法，同时在下游应用中也达到了更好的效果。代码将在 \url{https://github.com/ChunmingHe/Reti-Diff} 上提供。

Generating Realistic Counterfactuals for Retinal Fundus and OCT Images using Diffusion Models

paper_url: http://arxiv.org/abs/2311.11629
repo_url: None
paper_authors: Indu Ilanchezian, Valentyn Boreiko, Laura Kühlewein, Ziwei Huang, Murat Seçkin Ayhan, Matthias Hein, Lisa Koch, Philipp Berens
for: 用于解释临床决策或评估alternatives
methods: 使用一种扩散模型和一种对抗性鲁棒分类器，生成高度真实的对比图像和OCT B-scan
results: 用户研究发现，使用我们的方法生成的对比图像比之前的方法生成的更真实，甚至与真实图像难以分辨

Abstract
Counterfactual reasoning is often used in a clinical setting to explain decisions or weigh alternatives. Therefore, for imaging based modalities such as ophthalmology, it would be beneficial to be able to create counterfactual images, illustrating the answer to the question: "If the subject had had diabetic retinopathy, how would the fundus image have looked?" Here, we demonstrate that using a diffusion model in combination with an adversarially robust classifier trained on retinal disease classification tasks enables generation of highly realistic counterfactuals of retinal fundus images and optical coherence tomorgraphy (OCT) B-scans. Ideally, these classifiers encode the salient features indicative for each disease class and can steer the diffusion model to show realistic disease signs or remove disease-related lesions in a realistic way. Importantly, in a user study, domain experts found the counterfactuals generated using our method significantly more realistic than counterfactuals generated from a previous method, and even indistiguishable from realistic images.

摘要
ounterfactual 理智 often 用于 clinical setting to explain decisions 或 weigh alternatives. Therefore, for imaging based modalities such as ophthalmology, it would be beneficial to be able to create counterfactual images, illustrating the answer to the question: "If the subject had had diabetic retinopathy, how would the fundus image have looked?" Here, we demonstrate that using a diffusion model in combination with an adversarially robust classifier trained on retinal disease classification tasks enables generation of highly realistic counterfactuals of retinal fundus images and optical coherence tomography (OCT) B-scans. Ideally, these classifiers encode the salient features indicative for each disease class and can steer the diffusion model to show realistic disease signs or remove disease-related lesions in a realistic way. Importantly, in a user study, domain experts found the counterfactuals generated using our method significantly more realistic than counterfactuals generated from a previous method, and even indistinguishable from realistic images.Note that Simplified Chinese is used here, as it is the more widely used standard for Chinese writing in mainland China. If you prefer Traditional Chinese, I can provide that version as well.

Semantic-Preserved Point-based Human Avatar

paper_url: http://arxiv.org/abs/2311.11614
repo_url: None
paper_authors: Lixiang Lin, Jianke Zhu
for: 提高 AR/VR 和数字娱乐领域中人类模拟体验的现实感，该论文首次提出了一种点基式人类模板，涵盖了数字人类表达范围的全部。
methods: 该模型使用两个多层感知（MLP）来模型 pose-dependent deformation 和直线皮剂（LBS）的 weights。人体外观表示采用解码器和每个点附加的特征。与其他 alternatives 的隐式方法不同，点方向表示方式不仅提供了更直观的人类模板动画模型方法，还能够显著降低训练和推理时间。
results: 该方法的实验结果表明其有效性。

Abstract
To enable realistic experience in AR/VR and digital entertainment, we present the first point-based human avatar model that embodies the entirety expressive range of digital humans. We employ two MLPs to model pose-dependent deformation and linear skinning (LBS) weights. The representation of appearance relies on a decoder and the features that attached to each point. In contrast to alternative implicit approaches, the oriented points representation not only provides a more intuitive way to model human avatar animation but also significantly reduces both training and inference time. Moreover, we propose a novel method to transfer semantic information from the SMPL-X model to the points, which enables to better understand human body movements. By leveraging the semantic information of points, we can facilitate virtual try-on and human avatar composition through exchanging the points of same category across different subjects. Experimental results demonstrate the efficacy of our presented method.

摘要
为提供真实的AR/VR和数字娱乐体验，我们介绍了首个点基的人类化模型，涵盖了整个数字人类表达范围。我们使用两个多层感知（MLP）来模型 pose-dependent deformation和直线皮床（LBS）weights。人物外表表示 rely on decoder 和每个点附加的特征。与替代的隐式方法不同，点云表示不仅提供了更直观的人物动画模型化方法，还可以显著减少训练和推断时间。此外，我们提出了一种将SMPL-X模型中的 semantics 信息传递到点上的新方法，使得更好地理解人体动作。通过利用点上的semantic信息，我们可以实现虚拟试穿和人物组合 durch exchange 点的同类Category 的不同主体。实验结果表明我们提出的方法的有效性。

CurriculumLoc: Enhancing Cross-Domain Geolocalization through Multi-Stage Refinement

paper_url: http://arxiv.org/abs/2311.11604
repo_url: https://github.com/npupilab/curriculumloc
paper_authors: Boni Hu, Lin Chen, Runjian Chen, Shuhui Bu, Pengcheng Han, Haowei Li
for: 这篇论文旨在提出一个可靠且可扩展的可视地对照对应方法，以实现实际的可视地对照 зада务。
methods: 这篇论文使用了一个精心设计的多阶段精度提高管线，以及一种全球 semantic 意识和本地几何验证的关键点检测和描述方法。
results: 实验结果显示，这篇论文的方法可以实现高精度的可视地对照，并且在两个不同的距离度量上创下新的高 recall@1 纪录值。

Abstract
Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images, taken at some unknown location, to a set of geo-tagged reference images. Existing methods, devoted to semantic features representation, evolving towards robustness to a wide variety between query and reference, including illumination and viewpoint changes, as well as scale and seasonal variations. However, practical visual geolocalization approaches need to be robust in appearance changing and extreme viewpoint variation conditions, while providing accurate global location estimates. Therefore, inspired by curriculum design, human learn general knowledge first and then delve into professional expertise. We first recognize semantic scene and then measure geometric structure. Our approach, termed CurriculumLoc, involves a delicate design of multi-stage refinement pipeline and a novel keypoint detection and description with global semantic awareness and local geometric verification. We rerank candidates and solve a particular cross-domain perspective-n-point (PnP) problem based on these keypoints and corresponding descriptors, position refinement occurs incrementally. The extensive experimental results on our collected dataset, TerraTrack and a benchmark dataset, ALTO, demonstrate that our approach results in the aforementioned desirable characteristics of a practical visual geolocalization solution. Additionally, we achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively. Dataset, code and trained models are publicly available on https://github.com/npupilab/CurriculumLoc.

摘要
Visual地理位置定位是一项经济可行和可扩展的任务，即将一组或多组查询图像，取自未知位置，与一组准备了地理标记的参考图像进行匹配。现有方法主要关注semantic特征表示，逐渐向多样化 между查询和参考图像的鲁棒性进化，包括照明和视角变化、比例和季节变化。然而，实际 visual地理位置定位应用需要对应变化和极端视角变化的鲁棒性，同时提供准确的全球位置估计。因此，我们受到curriculum设计的 inspirited，人类在学习通用知识后，才能专注于专业专长。我们首先识别semantic场景，然后测量几何结构。我们的方法，称之为CurriculumLoc，包括细致的多Stage刷新管道和一种新型的关键点检测和描述，具有全球semantic认知和本地几何验证。我们在这些关键点和对应描述符基础上进行重新排名，并解决一个特定的 across-domain perspective-n-point（PnP）问题，在这个过程中，位置精度进行逐渐调整。我们的实验结果表明，我们的方法具有上述实际 visual地理位置定位应用中需要的愉悦特性。此外，我们在ALTO标准 dataset上取得了新的高 recall@1 分数为62.6%和94.5%，分别使用两种距离度量。数据集、代码和训练模型都可以在https://github.com/npupilab/CurriculumLoc上获得。

Deep Equilibrium Diffusion Restoration with Parallel Sampling

paper_url: http://arxiv.org/abs/2311.11600
repo_url: https://github.com/caojiezhang/deqir
paper_authors: Jiezhang Cao, Yue Shi, Kai Zhang, Yulun Zhang, Radu Timofte, Luc Van Gool
for: 这 paper 的目的是重新思考基于扩散模型的图像修复方法，通过改进 sampling 链来减少计算成本并且提高图像修复质量。
methods: 这 paper 使用了 deep equilibrium 固定点系统来解决 diffusion-based IR 模型中的问题，并提出了一种基于 analytical solution 的单个图像 sampling 方法，以便在平行化的方式下进行图像修复。
results: 实验结果表明，这 paper 提出的方法可以在典型的 IR 任务和实际应用中达到高质量的图像修复，并且可以快速计算梯度和初始化优化。

Abstract
Diffusion-based image restoration (IR) methods aim to use diffusion models to recover high-quality (HQ) images from degraded images and achieve promising performance. Due to the inherent property of diffusion models, most of these methods need long serial sampling chains to restore HQ images step-by-step. As a result, it leads to expensive sampling time and high computation costs. Moreover, such long sampling chains hinder understanding the relationship between the restoration results and the inputs since it is hard to compute the gradients in the whole chains. In this work, we aim to rethink the diffusion-based IR models through a different perspective, i.e., a deep equilibrium (DEQ) fixed point system. Specifically, we derive an analytical solution by modeling the entire sampling chain in diffusion-based IR models as a joint multivariate fixed point system. With the help of the analytical solution, we are able to conduct single-image sampling in a parallel way and restore HQ images without training. Furthermore, we compute fast gradients in DEQ and found that initialization optimization can boost performance and control the generation direction. Extensive experiments on benchmarks demonstrate the effectiveness of our proposed method on typical IR tasks and real-world settings. The code and models will be made publicly available.

摘要
Diffusion-based image restoration（IR）方法目标是使用扩散模型来恢复高质量（HQ）图像从降低图像中，并达到了可以的表现。由于扩散模型的内在性质，大多数这些方法需要长串行样本链来恢复HQ图像步骤。这会导致样本时间昂贵和计算成本高。此外，这些长串行样本链使得理解恢复结果和输入之间的关系困难，因为计算整个链上的梯度很难。在这种工作中，我们想要重新思考扩散基于IR模型的方法，即深度平衡（DEQ）固定点系统。我们特别是使用扩散模型整个样本链的联合多变量固定点系统来 derivate一个分析解。通过分析解，我们可以在平行样本中进行单图像恢复，并不需要训练。此外，我们在DEQ中计算了快速的梯度，并发现初始化优化可以提高性能并控制生成方向。我们对典型的IR任务和实际场景进行了广泛的实验，并证明了我们的提出的方法的有效性。代码和模型将公开发布。

Predicting urban tree cover from incomplete point labels and limited background information

paper_url: http://arxiv.org/abs/2311.11592
repo_url: None
paper_authors: Hui Zhang, Ankit Kariryaa, Venkanna Babu Guthula, Christian Igel, Stefan Oehmcke
for: 这 paper 是为了提高城市树木的识别和映射，以便更好地了解城市微气候和城市居民的物理和心理健康。
methods: 该 paper 使用深度学习方法来映射城市树木在高分辨率飞行图像中，使用有限的数据集和深度学习来实现。
results: 该 paper 在 Hamburg, Germany 进行了实验，显示系统可以生成城市树木覆盖率图像，不需要提供树木分割。系统的性能会逐渐下降，如果不使用开源地理数据库。

Abstract
Trees inside cities are important for the urban microclimate, contributing positively to the physical and mental health of the urban dwellers. Despite their importance, often only limited information about city trees is available. Therefore in this paper, we propose a method for mapping urban trees in high-resolution aerial imagery using limited datasets and deep learning. Deep learning has become best-practice for this task, however, existing approaches rely on large and accurately labelled training datasets, which can be difficult and expensive to obtain. However, often noisy and incomplete data may be available that can be combined and utilized to solve more difficult tasks than those datasets were intended for. This paper studies how to combine accurate point labels of urban trees along streets with crowd-sourced annotations from an open geographic database to delineate city trees in remote sensing images, a task which is challenging even for humans. To that end, we perform semantic segmentation of very high resolution aerial imagery using a fully convolutional neural network. The main challenge is that our segmentation maps are sparsely annotated and incomplete. Small areas around the point labels of the street trees coming from official and crowd-sourced data are marked as foreground class. Crowd-sourced annotations of streets, buildings, etc. define the background class. Since the tree data is incomplete, we introduce a masking to avoid class confusion. Our experiments in Hamburg, Germany, showed that the system is able to produce tree cover maps, not limited to trees along streets, without providing tree delineations. We evaluated the method on manually labelled trees and show that performance drastically deteriorates if the open geographic database is not used.

摘要
urban 内部的树木对城市微气候有积极影响，为城市居民的物理和心理健康做出正面贡献。然而，有限的城市树木信息 frequently 不足，因此在这篇论文中，我们提出了一种使用有限数据集和深度学习方法来映射城市树木在高分辨率飞行图像中的方法。深度学习已成为最佳实践，但现有的方法通常需要大量、准确地标注数据，这可能是 expensive 和困难的。然而，可能存在噪声和不完整的数据，这些数据可以组合并利用来解决更加复杂的任务。本文研究了如何将精确的城市树木点标签与开源地理数据库中的人工标注结合使用，以在遥感图像中划分城市树木，这是人类也难以完成的任务。为此，我们使用了全 convolutional neural network 进行semantic segmentation 的高分辨率飞行图像。主要挑战在于我们的分类图像 sparse 和不完整。小区域附近街道树木的点标签来自官方和开源数据库，被定义为前景类。人工标注的街道、建筑等定义背景类。由于树木数据不完整，我们引入了masking来避免分类混淆。我们在 Hamburg, Germany 进行了实验，发现系统可以生成不限于街道上的树木覆盖率图像，而不需提供树木划分。我们对手动标注的树木进行了评估，并发现如果不使用开源地理数据库，系统性能会下降很快。

FreeKD: Knowledge Distillation via Semantic Frequency Prompt

paper_url: http://arxiv.org/abs/2311.12079
repo_url: None
paper_authors: Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, Shanghang Zhang
for: 该论文主要针对卷积批处理任务（ dense prediction tasks）中的知识填充（knowledge distillation）问题，旨在提高学生模型的性能。
methods: 该论文提出了一种基于频谱频谱（frequency domain）的知识填充方法，称为“FreeKD”，它通过在教师模型中插入频谱推荐（Frequency Prompt）来吸收频谱上的 semantic context，并通过Pixel-wise频谱面（pixel-wise frequency mask）来定位关键的像素点。
results: 该论文的实验结果表明，FreeKD方法可以在 dense prediction tasks 中提高学生模型的性能，并且比传统的空间基于的知识填充方法（spatial-based distillation methods）更加稳定和robust。

Abstract
Knowledge distillation (KD) has been applied to various tasks successfully, and mainstream methods typically boost the student model via spatial imitation losses. However, the consecutive downsamplings induced in the spatial domain of teacher model is a type of corruption, hindering the student from analyzing what specific information needs to be imitated, which results in accuracy degradation. To better understand the underlying pattern of corrupted feature maps, we shift our attention to the frequency domain. During frequency distillation, we encounter a new challenge: the low-frequency bands convey general but minimal context, while the high are more informative but also introduce noise. Not each pixel within the frequency bands contributes equally to the performance. To address the above problem: (1) We propose the Frequency Prompt plugged into the teacher model, absorbing the semantic frequency context during finetuning. (2) During the distillation period, a pixel-wise frequency mask is generated via Frequency Prompt, to localize those pixel of interests (PoIs) in various frequency bands. Additionally, we employ a position-aware relational frequency loss for dense prediction tasks, delivering a high-order spatial enhancement to the student model. We dub our Frequency Knowledge Distillation method as FreeKD, which determines the optimal localization and extent for the frequency distillation. Extensive experiments demonstrate that FreeKD not only outperforms spatial-based distillation methods consistently on dense prediction tasks (e.g., FreeKD brings 3.8 AP gains for RepPoints-R50 on COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes), but also conveys more robustness to the student. Notably, we also validate the generalization of our approach on large-scale vision models (e.g., DINO and SAM).

摘要
知识塑化（KD）已经成功应用于多种任务，主流方法通常通过空间模仿损失提高学生模型。然而，在教师模型中的连续下采样induced的空间频谱损害是一种损害，使学生无法分析需要被模仿的具体信息，从而导致精度下降。为了更好地理解下频谱损害的下游特征，我们将注意力集中在频谱频率上。在频谱塑化过程中，我们遇到了一个新的挑战：低频带 convey通用但是有限的信息，而高频带更加有用但也会引入噪音。不是每个像素在频谱带中都有相同的贡献。为了解决以上问题，我们提出了频谱提醒（Frequency Prompt），在教师模型中捕捉频谱语义上下文的 semantic frequency context durante el finetuning。在塑化期间，我们通过频谱提醒生成了一个像素级别的频谱面积掩码，以确定在不同频谱带中的关键像素（PoIs）。此外，我们采用了一种位置感知的相关频谱损失，为精密预测任务提供高阶空间提高。我们称之为FreeKD，它确定了塑化的优化本地化和范围。广泛的实验表明，FreeKD不仅在精密预测任务上（例如，FreeKD在COCO2017上提高了RepPoints-R50的AP值3.8，在Cityscapes上提高了PSPNet-R18的mIoU值4.55），而且传递了更加Robustness到学生。另外，我们还验证了我们的方法在大规模视觉模型（例如，DINO和SAM）上的普适性。

AKConv: Convolutional Kernel with Arbitrary Sampled Shapes and Arbitrary Number of Parameters

paper_url: http://arxiv.org/abs/2311.11587
repo_url: https://github.com/cv-zhangxin/akconv
paper_authors: Xin Zhang, Yingze Song, Tingting Song, Degang Yang, Yichen Ye, Jie Zhou, Liming Zhang
for: 这 paper 是为了解决标准 convolutional operation 中的两个缺陷而提出的，即 local window 的限制和固定的 convolutional kernel 大小。
methods: 这 paper 提出了 Alterable Kernel Convolution (AKConv)，一种可变参数和样式的 convolutional operation，通过新的坐标生成算法定义初始位置，并通过偏移来适应目标变化。
results: 对于 COCO2017、VOC 7+12 和 VisDrone-DET2021 等 dataset，AKConv 能够提高 объек 检测性能，并且可以作为替换 convolutional operation 来提高网络性能。

Abstract
Neural networks based on convolutional operations have achieved remarkable results in the field of deep learning, but there are two inherent flaws in standard convolutional operations. On the one hand, the convolution operation be confined to a local window and cannot capture information from other locations, and its sampled shapes is fixed. On the other hand, the size of the convolutional kernel is fixed to k $\times$ k, which is a fixed square shape, and the number of parameters tends to grow squarely with size. It is obvious that the shape and size of targets are various in different datasets and at different locations. Convolutional kernels with fixed sample shapes and squares do not adapt well to changing targets. In response to the above questions, the Alterable Kernel Convolution (AKConv) is explored in this work, which gives the convolution kernel an arbitrary number of parameters and arbitrary sampled shapes to provide richer options for the trade-off between network overhead and performance. In AKConv, we define initial positions for convolutional kernels of arbitrary size by means of a new coordinate generation algorithm. To adapt to changes for targets, we introduce offsets to adjust the shape of the samples at each position. Moreover, we explore the effect of the neural network by using the AKConv with the same size and different initial sampled shapes. AKConv completes the process of efficient feature extraction by irregular convolutional operations and brings more exploration options for convolutional sampling shapes. Object detection experiments on representative datasets COCO2017, VOC 7+12 and VisDrone-DET2021 fully demonstrate the advantages of AKConv. AKConv can be used as a plug-and-play convolutional operation to replace convolutional operations to improve network performance. The code for the relevant tasks can be found at https://github.com/CV-ZhangXin/AKConv.

摘要
神经网络基于卷积操作已经在深度学习中取得了惊人的成果，但标准卷积操作存在两个内在的缺陷。一方面，卷积操作只能在本地窗口中进行，无法捕捉其他位置的信息，而且采样形状是固定的。另一方面，卷积核心的大小是固定的，即k x k，这是一个固定的方正形状，而参数的数量呈平方增长。这是不合理的，因为目标的形状和位置在不同的数据集和位置上是多样的。标准卷积核心的固定采样形状和大小不能适应变化的目标。为了解决这些问题，本文提出了可变卷积（AKConv），它允许卷积核心有可变的参数数量和采样形状，以提供更多的质量和性能之间的质量。在AKConv中，我们使用新的坐标生成算法来定义卷积核心的初始位置。为了适应目标的变化，我们引入偏移量来调整采样形状。此外，我们还研究了使用AKConv的效果，包括使用相同大小的AKConv和不同初始采样形状。AKConv完tes了不规则卷积操作的效果，并提供了更多的卷积采样形状的exploration option。在COCO2017、VOC 7+12和VisDrone-DET2021等代表性数据集上，对象检测实验全面展示了AKConv的优势。AKConv可以作为替换标准卷积操作的卷积操作来提高网络性能。相关任务的代码可以在https://github.com/CV-ZhangXin/AKConv中找到。

SeaDSC: A video-based unsupervised method for dynamic scene change detection in unmanned surface vehicles

paper_url: http://arxiv.org/abs/2311.11580
repo_url: None
paper_authors: Linh Trinh, Ali Anwar, Siegfried Mercelis
For: This paper is focused on detecting dynamic scene changes in Unmanned Surface Vehicles (USVs) using video data.* Methods: The proposed method utilizes a modified VQ-VAE-2 generative picture model for feature extraction and a novel similarity scoring technique for comparing consecutive frames.* Results: The authors demonstrate the efficiency of their technique on a nautical video dataset called RoboWhaler, showing the effectiveness of their approach in detecting dynamic scene changes.

Abstract
Recently, there has been an upsurge in the research on maritime vision, where a lot of works are influenced by the application of computer vision for Unmanned Surface Vehicles (USVs). Various sensor modalities such as camera, radar, and lidar have been used to perform tasks such as object detection, segmentation, object tracking, and motion planning. A large subset of this research is focused on the video analysis, since most of the current vessel fleets contain the camera's onboard for various surveillance tasks. Due to the vast abundance of the video data, video scene change detection is an initial and crucial stage for scene understanding of USVs. This paper outlines our approach to detect dynamic scene changes in USVs. To the best of our understanding, this work represents the first investigation of scene change detection in the maritime vision application. Our objective is to identify significant changes in the dynamic scenes of maritime video data, particularly those scenes that exhibit a high degree of resemblance. In our system for dynamic scene change detection, we propose completely unsupervised learning method. In contrast to earlier studies, we utilize a modified cutting-edge generative picture model called VQ-VAE-2 to train on multiple marine datasets, aiming to enhance the feature extraction. Next, we introduce our innovative similarity scoring technique for directly calculating the level of similarity in a sequence of consecutive frames by utilizing grid calculation on retrieved features. The experiments were conducted using a nautical video dataset called RoboWhaler to showcase the efficient performance of our technique.

摘要
近些年来，marine vision领域内有一场势在浮现的研究活动，其中许多研究受到计算机视觉在无人水面车（USV）上的应用的影响。不同的感知modalities，如摄像头、雷达和激光雷达，都被用于实现对象检测、分割、跟踪和运动规划等任务。大多数当前的船舶舰队都装备了船舶上的摄像头，因此视频分析在这些研究中占据了一个重要的位置。由于视频数据的庞大量，视频场景变化检测是USV视频分析中的初始和关键阶段。本文介绍了我们对USV动态场景变化检测的方法。到目前为止，这是marine vision应用中首次对场景变化检测的研究。我们的目标是在USV动态场景中检测出显著变化，特别是那些场景具有高度的相似性。在我们的系统中，我们提出了一种完全无监督学习方法。与先前的研究不同，我们使用了修改后的VQ-VAE-2模型来在多个海洋数据集上训练，以提高特征提取。然后，我们介绍了我们的创新的相似度评分技术，通过在检索到的特征上进行格子计算来直接计算连续帧之间的相似度水平。实验使用了名为RoboWhaler的海洋视频数据集，以展示我们的技术的高效性。

A 3D Multi-Style Cross-Modality Segmentation Framework for Segmenting Vestibular Schwannoma and Cochlea

paper_url: http://arxiv.org/abs/2311.11578
repo_url: None
paper_authors: Yuzhou Zhuang
for: 本研究旨在用Multi-style Cross-modality Segmentation方法精确地分类 vestibular schwannoma和cochlea区域在无标注hrT2扫描中，以便提高肿瘤诊断和治疗的精度。
methods: 本研究使用了3D多式 Cross-modality Segmentation框架，包括多式转换和自学习分类阶段。首先，使用min-max normalization、voxel size resampling和center cropping来调整ceT1和hrT2扫描的像素大小和中心位置，以获得固定大小的子体积 для训练。接着，使用多种转换网络来超越intensity distribution差异 between多modal扫描。最后，使用nnU-Net框架和iterative自学习方法使用pseudo-labels来在目标领域进行自学习分类。
results: 在crossMoDA2023验证集上，本研究的方法获得了Promising results，mean DSC值为72.78%和80.64%，ASSD值为5.85 mm和0.25 mm дляVS肿瘤和cochlea区域，分别。此外，for intra-和extra-meatal区域，本研究的方法获得了DSC值为59.77%和77.14%。

Abstract
The crossMoDA2023 challenge aims to segment the vestibular schwannoma (sub-divided into intra- and extra-meatal components) and cochlea regions of unlabeled hrT2 scans by leveraging labeled ceT1 scans. In this work, we proposed a 3D multi-style cross-modality segmentation framework for the crossMoDA2023 challenge, including the multi-style translation and self-training segmentation phases. Considering heterogeneous distributions and various image sizes in multi-institutional scans, we first utilize the min-max normalization, voxel size resampling, and center cropping to obtain fixed-size sub-volumes from ceT1 and hrT2 scans for training. Then, we perform the multi-style image translation phase to overcome the intensity distribution discrepancy between unpaired multi-modal scans. Specifically, we design three different translation networks with 2D or 2.5D inputs to generate multi-style and realistic target-like volumes from labeled ceT1 volumes. Finally, we perform the self-training volumetric segmentation phase in the target domain, which employs the nnU-Net framework and iterative self-training method using pseudo-labels for training accurate segmentation models in the unlabeled target domain. On the crossMoDA2023 validation dataset, our method produces promising results and achieves the mean DSC values of 72.78% and 80.64% and ASSD values of 5.85 mm and 0.25 mm for VS tumor and cochlea regions, respectively. Moreover, for intra- and extra-meatal regions, our method achieves the DSC values of 59.77% and 77.14%, respectively.

摘要
<>crossMoDA2023挑战目标是将vestibular schwannoma（分为内部和外部颈部组分）和auditory cochlea区域从无标签hrT2扫描图像中分割，通过利用标注ceT1扫描图像。在这个工作中，我们提出了一个3D多样性交叉Modal segmentation框架 дляcrossMoDA2023挑战，包括多样性翻译和自动训练segmentation阶段。 Considering不均分布和多种图像大小在多机构扫描中，我们首先使用最小值最大值归一化、voxel大小调整和中心剪辑以获取固定大小的sub-volumes从ceT1和hrT2扫描图像中进行训练。然后，我们进行多样性图像翻译阶段，以超越intensity分布差异 между多Modal scans。我们设计了三种不同的翻译网络，其中两个是2D或2.5D输入，以生成多样性和实际的目标类似体积量from标注ceT1体积图像。最后，我们进行自动训练volumetric segmentation阶段，使用nnU-Net框架和迭代自动训练方法使用pseudo-labels进行训练准确的分割模型在无标签目标Domain中。在crossMoDA2023验证集上，我们的方法产生了有前途的结果，得到了 mean DSC 值为72.78%和80.64%，和ASSD值为5.85 mm和0.25 mm дляVS tumor和auditory cochlea区域，分别。此外，对于内部和外部颈部区域，我们的方法得到了 DSC 值为59.77%和77.14%。Note: "vestibular schwannoma" is translated as "vestibular schwannoma" in Simplified Chinese, and "auditory cochlea" is translated as "auditory cochlea" in Simplified Chinese.

paper_url: http://arxiv.org/abs/2311.11567
repo_url: None
paper_authors: Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang, Hongxia Yang
For: The paper aims to evaluate the reasoning capabilities of multi-modal large language models (MLLMs) by creating a new benchmark dataset that focuses on complex reasoning tasks, such as deductive, abductive, and analogical reasoning.* Methods: The authors manually curate a dataset of queries that engage the reasoning capabilities of MLLMs, and incorporate intermediate reasoning steps into their evaluation criteria to assess the models’ ability to generate answers.* Results: The authors evaluate a selection of representative MLLMs using this new benchmark dataset and find that their reasoning capabilities are more accurately measured using this open-ended multi-step elaborate reasoning benchmark, which challenges the models to demonstrate their ability to perform complex reasoning tasks.

Abstract
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning capabilities of MLLMs in the process of generating answers. For a fair comparison across various MLLMs, we incorporate intermediate reasoning steps into our evaluation criteria. In instances where an MLLM is unable to produce a definitive answer, its reasoning ability is evaluated by requesting intermediate reasoning steps. If these steps align with our manual annotations, appropriate scores are assigned. This evaluation scheme resembles methods commonly used in human assessments, such as exams or assignments, and represents what we consider a more effective assessment technique compared with existing benchmarks. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark, designed to challenge and accurately measure their reasoning capabilities. The code and data will be released at https://core-mm.github.io/

摘要
多modal大型自然语言模型（MLLM）在人工智能领域日益突出。这些模型不仅在传统的视觉语言任务中表现出色，而且在当今的多modal benchmark中也显示出卓越表现。虽然许多这些 benchmark 尝试总体评估 MLLM，但它们通常只集中在基本的理解任务上，常常产生单纯的是或否或多选答案。这些方法自然导致混乱和判定 MLLM 的理解能力困难。为解决这个问题，我们手动精心制作了一个特有的 benchmark 数据集，专门为 MLLM 设计。我们的 benchmark 包括三种关键的理解类别：推理、推理和 analogical reasoning。我们的查询是通过特意设计来让 MLLM 在回答时 engag 其理解能力。为 garantuee fair comparison across 不同的 MLLM，我们在评估标准中包括中间的推理步骤。在 MLLM 无法生成准确答案的情况下，我们评估其推理能力通过请求中间推理步骤。如果这些步骤与我们的手动注释相符，就会得分。这种评估方法与人类评估方法相似，例如考试或作业，并且代表我们认为更有效的评估方法，相比已有的 benchmark。我们使用这些精心制作的开放式多步逻辑 benchmark 评估一 selección 的代表 MLLM，以挑战并准确测量它们的理解能力。我们的代码和数据将在上发布。

Does complimentary information from multispectral imaging improve face presentation attack detection?

paper_url: http://arxiv.org/abs/2311.11566
repo_url: None
paper_authors: Narayan Vetrekar, Raghavendra Ramachandra, Sushma Venkatesh, Jyoti D. Pawar, R. S. Gad
For: The paper is written to study the use of multispectral imaging for detecting presentation attacks in face recognition systems.* Methods: The paper uses a dataset called Face Presentation Attack Multispectral (FPAMS) to evaluate the performance of two fusion methods (image fusion and score fusion) in detecting presentation artifacts.* Results: The paper presents superior performance of the PAD based on the score fusion and image fusion methods, demonstrating the significance of employing multispectral imaging to detect presentation artifacts.Here are the three information points in Simplified Chinese text:* For: 这篇论文是为了研究基于多spectral成像的面 recognition系统中的示范攻击检测。* Methods: 这篇论文使用了一个名为Face Presentation Attack Multispectral (FPAMS)的数据集来评估两种混合方法（图像混合和得分混合）在检测示范 artifacts 中的性能。* Results: 这篇论文显示了基于得分混合和图像混合方法的 PAD 表现出色，证明了在检测示范 artifacts 中使用多spectral成像的重要性。

Abstract
Presentation Attack Detection (PAD) has been extensively studied, particularly in the visible spectrum. With the advancement of sensing technology beyond the visible range, multispectral imaging has gained significant attention in this direction. We present PAD based on multispectral images constructed for eight different presentation artifacts resulted from three different artifact species. In this work, we introduce Face Presentation Attack Multispectral (FPAMS) database to demonstrate the significance of employing multispectral imaging. The goal of this work is to study complementary information that can be combined in two different ways (image fusion and score fusion) from multispectral imaging to improve the face PAD. The experimental evaluation results present an extensive qualitative analysis of 61650 sample multispectral images collected for bonafide and artifacts. The PAD based on the score fusion and image fusion method presents superior performance, demonstrating the significance of employing multispectral imaging to detect presentation artifacts.

摘要

NePF: Neural Photon Field for Single-Stage Inverse Rendering

paper_url: http://arxiv.org/abs/2311.11555
repo_url: None
paper_authors: Tuen-Yue Tsui, Qin Zou
for: addressed the ill-posed inverse rendering problem from multi-view images
methods: introduced a novel single-stage framework called Neural Photon Field (NePF), which fully utilizes the physical implication behind the weight function of neural implicit surfaces and the view-dependent radiance
results: demonstrated superiority in recovering high-fidelity geometry and visual-plausible material attributes, evaluated on both real and synthetic datasets.Here’s the summary in English for reference:
for: The paper addresses the ill-posed inverse rendering problem from multi-view images.
methods: The paper introduces a novel single-stage framework called Neural Photon Field (NePF), which fully utilizes the physical implication behind the weight function of neural implicit surfaces and the view-dependent radiance.
results: The paper demonstrates the superiority of the proposed approach in recovering high-fidelity geometry and visual-plausible material attributes, evaluated on both real and synthetic datasets.

Abstract
We present a novel single-stage framework, Neural Photon Field (NePF), to address the ill-posed inverse rendering from multi-view images. Contrary to previous methods that recover the geometry, material, and illumination in multiple stages and extract the properties from various multi-layer perceptrons across different neural fields, we question such complexities and introduce our method - a single-stage framework that uniformly recovers all properties. NePF achieves this unification by fully utilizing the physical implication behind the weight function of neural implicit surfaces and the view-dependent radiance. Moreover, we introduce an innovative coordinate-based illumination model for rapid volume physically-based rendering. To regularize this illumination, we implement the subsurface scattering model for diffuse estimation. We evaluate our method on both real and synthetic datasets. The results demonstrate the superiority of our approach in recovering high-fidelity geometry and visual-plausible material attributes.

摘要
我们提出了一种新的单阶段框架，神经光场（NePF），用于解决多视图图像的逆向渲染问题。与前方法不同，我们的方法不需要在多个层次感知机中提取不同的物理属性，而是通过全面利用神经凝件表面的权重函数和视角依赖的辐射来协调所有属性。此外，我们还提出了一种新的坐标基于的照明模型，用于快速Physically-Based Rendering（PBR）。为了规范这种照明，我们实现了吸收散射模型来估算柔化。我们在真实数据集和 sintetic 数据集上评估了我们的方法，结果显示了我们的方法在高精度的几何和可见性上具有明显的优势。

Unearthing Common Inconsistency for Generalisable Deepfake Detection

paper_url: http://arxiv.org/abs/2311.11549
repo_url: None
paper_authors: Beilin Chu, Xuan Xu, Weike You, Linna Zhou
for: 本研究旨在提出一种能够普适应用于不同涂抹方法的深伪检测方法，以解决现有的深伪检测方法无法普适应用于不同频谱频率频谱频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频��

Abstract
Deepfake has emerged for several years, yet efficient detection techniques could generalize over different manipulation methods require further research. While current image-level detection method fails to generalize to unseen domains, owing to the domain-shift phenomenon brought by CNN's strong inductive bias towards Deepfake texture, video-level one shows its potential to have both generalization across multiple domains and robustness to compression. We argue that although distinct face manipulation tools have different inherent bias, they all disrupt the consistency between frames, which is a natural characteristic shared by authentic videos. Inspired by this, we proposed a detection approach by capturing frame inconsistency that broadly exists in different forgery techniques, termed unearthing-common-inconsistency (UCI). Concretely, the UCI network based on self-supervised contrastive learning can better distinguish temporal consistency between real and fake videos from multiple domains. We introduced a temporally-preserved module method to introduce spatial noise perturbations, directing the model's attention towards temporal information. Subsequently, leveraging a multi-view cross-correlation learning module, we extensively learn the disparities in temporal representations between genuine and fake samples. Extensive experiments demonstrate the generalization ability of our method on unseen Deepfake domains.

摘要

Efficient Model Agnostic Approach for Implicit Neural Representation Based Arbitrary-Scale Image Super-Resolution

paper_url: http://arxiv.org/abs/2311.12077
repo_url: None
paper_authors: Young Jae Oh, Jihun Kim, Tae Hyun Kim
for: 提高单图超解像的计算效率，无需牺牲重建质量。
methods: 使用混合专家模型，动态分配专家对每个像素进行重建。
results: 比 tradicional方法减少73%的 floating point operations（FLOPs），并且提供相同或更高的峰值信号噪听比（PSNR）。

Abstract
Single image super-resolution (SISR) has experienced significant advancements, primarily driven by deep convolutional networks. Traditional networks, however, are limited to upscaling images to a fixed scale, leading to the utilization of implicit neural functions for generating arbitrarily scaled images. Nevertheless, these methodologies have imposed substantial computational demands as they involve querying every target pixel to a single resource-intensive decoder. In this paper, we introduce a novel and efficient framework, the Mixture of Experts Implicit Super-Resolution (MoEISR), which enables super-resolution at arbitrary scales with significantly increased computational efficiency without sacrificing reconstruction quality. MoEISR dynamically allocates the most suitable decoding expert to each pixel using a lightweight mapper module, allowing experts with varying capacities to reconstruct pixels across regions with diverse complexities. Our experiments demonstrate that MoEISR successfully reduces up to 73% in floating point operations (FLOPs) while delivering comparable or superior peak signal-to-noise ratio (PSNR).

摘要
单一图像超解析（SISR）在最近已经经历了重要的进步，主要受到深度卷积神经的驱动。然而，传统的神经网络仅能将图像调整到固定比例，从而需要使用隐藏的神经函数来生成自适应比例的图像。不过，这些方法具有较高的计算成本，因为它们需要每个目标像素与单一资源密集的解oder进行询问。在这篇论文中，我们提出了一个新的和高效的框架，即混合专家隐藏超解析（MoEISR），允许在自适应比例下进行超解析，并大幅降低计算成本。MoEISR通过动态分配最适合的解oding专家给每个像素使用轻量级映射模组，让专家具有不同容量进行像素重建。我们的实验结果显示，MoEISR可以成功降低73%的浮点运算（FLOPs），同时保持比或superior的峰峰信号输出比率（PSNR）。

Event Camera Data Dense Pre-training

paper_url: http://arxiv.org/abs/2311.11533
repo_url: None
paper_authors: Yan Yang, Liyuan Pan, Liu Liu
for: 本研究旨在为 dense prediction 任务预训练神经网络，使用事件摄像头数据。
methods: 我们的方法仅使用事件数据进行训练，并利用事件图像中的事件特征进行自动归一化，以捕捉事件图像中的相似性关系。
results: 我们的方法在 dense prediction 下投入 transfer learning 性能较高，特别是在 DSEC-Flow 比赛中，单个模型占据了挑战性的首位。

Abstract
This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data. Our approach utilizes solely event data for training. Transferring achievements from dense RGB pre-training directly to event camera data yields subpar performance. This is attributed to the spatial sparsity inherent in an event image (converted from event data), where many pixels do not contain information. To mitigate this sparsity issue, we encode an event image into event patch features, automatically mine contextual similarity relationships among patches, group the patch features into distinctive contexts, and enforce context-to-context similarities to learn discriminative event features. For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns. Transfer learning performance on downstream dense prediction tasks illustrates the superiority of our method over state-of-the-art approaches. Notably, our single model secured the top position in the challenging DSEC-Flow benchmark.

摘要

Generalized Category Discovery in Semantic Segmentation

paper_url: http://arxiv.org/abs/2311.11525
repo_url: https://github.com/jethropeng/gcdss
paper_authors: Zhengyuan Peng, Qijian Tian, Jianqing Xu, Yizhang Jin, Xuequan Lu, Xin Tan, Yuan Xie, Lizhuang Ma
for: 这篇论文探索了一个新的设定，即通用类别发现（Generalized Category Discovery，GCD），企从已知类别的图像推导未知类别。不同于先前的新类别发现（Novel Category Discovery，NCD），这个设定不需要每个未知类别都存在于每个未知图像中。此外，我们扩展了分类的范围，让它包括整个图像。
methods: 我们提出了一个简单 yet effective的框架，将GCDSS挑战转换为一个面块分类任务。此外，我们开发了一个基eline方法，即邻居关系导向面块分类算法（NeRG-MaskCA），来进行面块分类。
results: 我们的方法显示了GCDSS的可行性和可能性，并且可以在未知图像中发现和分类新的类别。我们使用我们的方法生成的假标签作为真实标签，以便训练其他模型，从而允许它们在未知类别上进行分类。这构成了更多研究的基础，扩展 semantic segmentation 的应用范围。

Abstract
This paper explores a novel setting called Generalized Category Discovery in Semantic Segmentation (GCDSS), aiming to segment unlabeled images given prior knowledge from a labeled set of base classes. The unlabeled images contain pixels of the base class or novel class. In contrast to Novel Category Discovery in Semantic Segmentation (NCDSS), there is no prerequisite for prior knowledge mandating the existence of at least one novel class in each unlabeled image. Besides, we broaden the segmentation scope beyond foreground objects to include the entire image. Existing NCDSS methods rely on the aforementioned priors, making them challenging to truly apply in real-world situations. We propose a straightforward yet effective framework that reinterprets the GCDSS challenge as a task of mask classification. Additionally, we construct a baseline method and introduce the Neighborhood Relations-Guided Mask Clustering Algorithm (NeRG-MaskCA) for mask categorization to address the fragmentation in semantic representation. A benchmark dataset, Cityscapes-GCD, derived from the Cityscapes dataset, is established to evaluate the GCDSS framework. Our method demonstrates the feasibility of the GCDSS problem and the potential for discovering and segmenting novel object classes in unlabeled images. We employ the generated pseudo-labels from our approach as ground truth to supervise the training of other models, thereby enabling them with the ability to segment novel classes. It paves the way for further research in generalized category discovery, broadening the horizons of semantic segmentation and its applications. For details, please visit https://github.com/JethroPeng/GCDSS

摘要

Towards Few-shot Out-of-Distribution Detection

paper_url: http://arxiv.org/abs/2311.12076
repo_url: None
paper_authors: Jiuqing Dong, Yongbin Gao, Heng Zhou, Jun Cen, Yifan Yao, Sook Yoon, Park Dong Sun
for: 提高 open-world 智能系统的可靠性，即 out-of-distribution (OOD) 检测。
methods: 引入了一个新的几shot OOD 检测benchmark，并进行了实验研究，发现ParameterEfficient Fine-Tuning (PEFT) 策略在几shot OOD 检测任务中表现更好，包括完全 Fine-Tuning 和线性探索 Tuning。
results: 研究发现，在 fine-tuning 过程中可能会产生一些关键信息，这些信息对 OOD 检测非常重要，因此提出了一种方法，即 DomainSpecific and General Knowledge Fusion (DSGF)，以提高几shot OOD 检测能力。

Abstract
Out-of-distribution (OOD) detection is critical for ensuring the reliability of open-world intelligent systems. Despite the notable advancements in existing OOD detection methodologies, our study identifies a significant performance drop under the scarcity of training samples. In this context, we introduce a novel few-shot OOD detection benchmark, carefully constructed to address this gap. Our empirical analysis reveals the superiority of ParameterEfficient Fine-Tuning (PEFT) strategies, such as visual prompt tuning and visual adapter tuning, over conventional techniques, including fully fine-tuning and linear probing tuning in the few-shot OOD detection task. Recognizing some crucial information from the pre-trained model, which is pivotal for OOD detection, may be lost during the fine-tuning process, we propose a method termed DomainSpecific and General Knowledge Fusion (DSGF). This approach is designed to be compatible with diverse fine-tuning frameworks. Our experiments show that the integration of DSGF significantly enhances the few-shot OOD detection capabilities across various methods and fine-tuning methodologies, including fully fine-tuning, visual adapter tuning, and visual prompt tuning. The code will be released.

摘要
外部数据（OOD）检测是智能系统的可靠性 garantia 的关键。 despite notable advancements in existing OOD detection methodologies, our study identifies a significant performance drop under the scarcity of training samples. In this context, we introduce a novel few-shot OOD detection benchmark, carefully constructed to address this gap. Our empirical analysis reveals the superiority of ParameterEfficient Fine-Tuning (PEFT) strategies, such as visual prompt tuning and visual adapter tuning, over conventional techniques, including fully fine-tuning and linear probing tuning in the few-shot OOD detection task. Recognizing some crucial information from the pre-trained model, which is pivotal for OOD detection, may be lost during the fine-tuning process, we propose a method termed DomainSpecific and General Knowledge Fusion (DSGF). This approach is designed to be compatible with diverse fine-tuning frameworks. Our experiments show that the integration of DSGF significantly enhances the few-shot OOD detection capabilities across various methods and fine-tuning methodologies, including fully fine-tuning, visual adapter tuning, and visual prompt tuning. The code will be released.Here's the text in Traditional Chinese:外部数据（OOD）检测是智能系统的可靠性 garantia 的关键。 despite notable advancements in existing OOD detection methodologies, our study identifies a significant performance drop under the scarcity of training samples. In this context, we introduce a novel few-shot OOD detection benchmark, carefully constructed to address this gap. Our empirical analysis reveals the superiority of ParameterEfficient Fine-Tuning (PEFT) strategies, such as visual prompt tuning and visual adapter tuning, over conventional techniques, including fully fine-tuning and linear probing tuning in the few-shot OOD detection task. Recognizing some crucial information from the pre-trained model, which is pivotal for OOD detection, may be lost during the fine-tuning process, we propose a method termed DomainSpecific and General Knowledge Fusion (DSGF). This approach is designed to be compatible with diverse fine-tuning frameworks. Our experiments show that the integration of DSGF significantly enhances the few-shot OOD detection capabilities across various methods and fine-tuning methodologies, including fully fine-tuning, visual adapter tuning, and visual prompt tuning. The code will be released.

Liver Tumor Prediction with Advanced Attention Mechanisms Integrated into a Depth-Based Variant Search Algorithm

paper_url: http://arxiv.org/abs/2311.11520
repo_url: None
paper_authors: P. Kalaiselvi, S. Anusuya
for: 预测肝脏癌症
methods: 使用卷积神经网络（CNN）和深度基于变体搜索算法（CNN-DS-AM），并含有进一步的注意力机制
results: 提高预测肝脏癌症的准确率和可靠性，高于其他当前领域方法

Abstract
In recent days, Deep Learning (DL) techniques have become an emerging transformation in the field of machine learning, artificial intelligence, computer vision, and so on. Subsequently, researchers and industries have been highly endorsed in the medical field, predicting and controlling diverse diseases at specific intervals. Liver tumor prediction is a vital chore in analyzing and treating liver diseases. This paper proposes a novel approach for predicting liver tumors using Convolutional Neural Networks (CNN) and a depth-based variant search algorithm with advanced attention mechanisms (CNN-DS-AM). The proposed work aims to improve accuracy and robustness in diagnosing and treating liver diseases. The anticipated model is assessed on a Computed Tomography (CT) scan dataset containing both benign and malignant liver tumors. The proposed approach achieved high accuracy in predicting liver tumors, outperforming other state-of-the-art methods. Additionally, advanced attention mechanisms were incorporated into the CNN model to enable the identification and highlighting of regions of the CT scans most relevant to predicting liver tumors. The results suggest that incorporating attention mechanisms and a depth-based variant search algorithm into the CNN model is a promising approach for improving the accuracy and robustness of liver tumor prediction. It can assist radiologists in their diagnosis and treatment planning. The proposed system achieved a high accuracy of 95.5% in predicting liver tumors, outperforming other state-of-the-art methods.

摘要
现在的日子里，深度学习（DL）技术已成为机器学习、人工智能、计算机视觉等领域的一种emerging transformation。随后，研究人员和产业在医疗领域中高度支持，预测和控制多种疾病。肝肿瘤预测是分析和治疗肝病的重要任务。本文提出了一种使用卷积神经网络（CNN）和深度基本变体搜索算法（CNN-DS-AM）的新方法，以提高肝肿瘤预测的准确性和稳定性。该方法采用了CT扫描图像 dataset，包括了正常和肿瘤肝肿瘤。Results suggest that incorporating attention mechanisms and a depth-based variant search algorithm into the CNN model is a promising approach for improving the accuracy and robustness of liver tumor prediction. It can assist radiologists in their diagnosis and treatment planning, with an accuracy of 95.5% in predicting liver tumors, outperforming other state-of-the-art methods.

Seeing through the Mask: Multi-task Generative Mask Decoupling Face Recognition

paper_url: http://arxiv.org/abs/2311.11512
repo_url: None
paper_authors: Zhaohui Wang, Sufang Zhang, Jianteng Peng, Xinyi Wang, Yandong Guo
for: 本研究旨在提高面Recognition系统在受到 occlusion 影响时的性能，解决现有系统在 occluded scenes 下表现不佳的问题。
methods: 本研究提出了 Multi-task gEnerative mask dEcoupling face Recognition (MEER) 网络，该网络可以同时处理 occlusion 和 identity 相关的表示，从可见的 facial 部分提取更加纯净的 identity 特征，并通过 join-training 策略实现不受 occlusion 影响的 face synthesis。
results: 实验表明，MEER 可以在实际和synthetic occlusions benchmarks 上进行面Recognition，并且在 occluded scenes 下表现较为出色，超过了现有方法的性能。

Abstract
The outbreak of COVID-19 pandemic make people wear masks more frequently than ever. Current general face recognition system suffers from serious performance degradation,when encountering occluded scenes. The potential reason is that face features are corrupted by occlusions on key facial regions. To tackle this problem, previous works either extract identity-related embeddings on feature level by additional mask prediction, or restore the occluded facial part by generative models. However, the former lacks visual results for model interpretation, while the latter suffers from artifacts which may affect downstream recognition. Therefore, this paper proposes a Multi-task gEnerative mask dEcoupling face Recognition (MEER) network to jointly handle these two tasks, which can learn occlusionirrelevant and identity-related representation while achieving unmasked face synthesis. We first present a novel mask decoupling module to disentangle mask and identity information, which makes the network obtain purer identity features from visible facial components. Then, an unmasked face is restored by a joint-training strategy, which will be further used to refine the recognition network with an id-preserving loss. Experiments on masked face recognition under realistic and synthetic occlusions benchmarks demonstrate that the MEER can outperform the state-ofthe-art methods.

摘要
COVID-19 疫情爆发使人们更常穿戴口罩，现有普通面部识别系统在遇到遮挡场景时表现出了严重的性能下降。这可能是因为面部特征被遮挡的关键区域所致。为解决这个问题，前一些作品可以在特征层提取人类相关的嵌入，或者使用生成模型恢复遮挡的面部部分。然而，前一些作品缺乏可视化结果，而后者可能会出现artefacts，这些artefacts可能会影响下游识别。因此，本文提出了一种多任务生成式面部隐藏减少网络（MEER），该网络可以同时处理这两个任务，学习遮挡不相关的人类特征表示，并实现不遮挡的面部合成。我们首先提出了一种新的面部隐藏模块，该模块可以分离面部和身份信息，使网络从可见的面部组件中获得纯净的身份特征。然后，我们使用联合训练策略恢复无遮挡的面部，该面部将被用来改进识别网络，并使用id保持损失进行补做。实验结果表明，MEER可以在实际和synthetic occlusion benchmark上超越当前的状态OF-THE-ART方法。

BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning

paper_url: http://arxiv.org/abs/2311.12075
repo_url: None
paper_authors: Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, Ee-Chien Chang
for: 本研究旨在提高模型版权保护和防御力，通过研究后门攻击。
methods: 本文使用了 dual-embedding 指导架构和 Bayesian 规则，实现了不易被检测的后门攻击。
results: 对比州对抗后门攻击的基eline，本文的攻击效果更高 (+45.3% ASR)，这些防御策略在实际应用中基本无效。

Abstract
Studying backdoor attacks is valuable for model copyright protection and enhancing defenses. While existing backdoor attacks have successfully infected multimodal contrastive learning models such as CLIP, they can be easily countered by specialized backdoor defenses for MCL models. This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses and introduces the \emph{\toolns} attack, which is resistant to backdoor detection and model fine-tuning defenses. To achieve this, we draw motivations from the perspective of the Bayesian rule and propose a dual-embedding guided framework for backdoor attacks. Specifically, we ensure that visual trigger patterns approximate the textual target semantics in the embedding space, making it challenging to detect the subtle parameter variations induced by backdoor learning on such natural trigger patterns. Additionally, we optimize the visual trigger patterns to align the poisoned samples with target vision features in order to hinder the backdoor unlearning through clean fine-tuning. Extensive experiments demonstrate that our attack significantly outperforms state-of-the-art baselines (+45.3% ASR) in the presence of SoTA backdoor defenses, rendering these mitigation and detection strategies virtually ineffective. Furthermore, our approach effectively attacks some more rigorous scenarios like downstream tasks. We believe that this paper raises awareness regarding the potential threats associated with the practical application of multimodal contrastive learning and encourages the development of more robust defense mechanisms.

摘要
To achieve this, we draw motivations from the Bayesian rule and propose a dual-embedding guided framework for backdoor attacks. Specifically, we ensure that visual trigger patterns approximate the textual target semantics in the embedding space, making it challenging to detect the subtle parameter variations induced by backdoor learning on such natural trigger patterns. Additionally, we optimize the visual trigger patterns to align the poisoned samples with target vision features in order to hinder the backdoor unlearning through clean fine-tuning.Our extensive experiments show that our attack significantly outperforms state-of-the-art baselines (+45.3% ASR) in the presence of SoTA backdoor defenses, rendering these mitigation and detection strategies virtually ineffective. Furthermore, our approach effectively attacks some more rigorous scenarios like downstream tasks. We believe that this paper raises awareness regarding the potential threats associated with the practical application of multimodal contrastive learning and encourages the development of more robust defense mechanisms.

2023-11-20

HandSight: DeCAF & Improved Fisher Vectors to Classify Clothing Color and Texture with a Finger-Mounted Camera

DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation

Disentangling Structure and Appearance in ViT Feature Space

LABELMAKER: Automatic Semantic Label Generation from RGB-D Trajectories

ChemScraper: Graphics Extraction, Molecular Diagram Parsing, and Annotated Data Generation for PDF Images

Model-aware 3D Eye Gaze from Weak and Few-shot Supervisions

Uncertainty Estimation in Contrast-Enhanced MR Image Translation with Multi-Axis Fusion

Applications of Large Scale Foundation Models for Autonomous Driving

Fingerspelling PoseNet: Enhancing Fingerspelling Translation with Pose-Based Transformer Models

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

DAS: A Deformable Attention to Capture Salient Information in CNNs

FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation

LiDAR-HMR: 3D Human Mesh Recovery from LiDAR

SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks

What Can AutoML Do For Continual Learning?

An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis

Identifying the Defective: Detecting Damaged Grains for Cereal Appearance Inspection

SniffyArt: The Dataset of Smelling Persons

Multi-Task Faces (MTF) Data Set: A Legally and Ethically Compliant Collection of Face Images for Various Classification Tasks

VLM-Eval: A General Evaluation on Video Large Language Models

GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

FATURA: A Multi-Layout Invoice Image Dataset for Document Analysis and Understanding

Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision

Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields

Few-shot Multispectral Segmentation with Representations Generated by Reinforcement Learning

Holistic Inverse Rendering of Complex Facade via Aerial 3D Scanning

Cross-View Graph Consistency Learning for Invariant Graph Representations

CrackCLF: Automatic Pavement Crack Detection based on Closed-Loop Feedback

Robot Hand-Eye Calibration using Structure-from-Motion

Robust Tumor Segmentation with Hyperspectral Imaging and Graph Neural Networks

Multimodal deep learning for mapping forest dominant height by fusing GEDI with earth observation data

Practical cross-sensor color constancy using a dual-mapping strategy

A Good Feature Extractor Is All You Need for Weakly Supervised Learning in Histopathology

Non-Contact NIR PPG Sensing through Large Sequence Signal Regression

AdvGen: Physical Adversarial Attack on Face Presentation Attack Detection Systems

Fuzzy Information Seeded Region Growing for Automated Lesions After Stroke Segmentation in MR Brain Images

On the Importance of Large Objects in CNN Based Object Detection Algorithms

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Cut-and-Paste: Subject-Driven Video Editing with Attention Control

Clarity ChatGPT: An Interactive and Adaptive Processing System for Image Restoration and Enhancement

Segment Together: A Versatile Paradigm for Semi-Supervised Medical Image Segmentation

Pyramid Diffusion for Fine 3D Large Scene Generation

PMP-Swin: Multi-Scale Patch Message Passing Swin Transformer for Retinal Disease Classification

ODDR: Outlier Detection & Dimension Reduction Based Defense Against Adversarial Patches

OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning

PanBench: Towards High-Resolution and High-Performance Pansharpening

Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D Human Motion Recovery from Monocular Videos

Double-Condensing Attention Condenser: Leveraging Attention in Deep Learning to Detect Skin Cancer from Skin Lesion Images

Cancer-Net PCa-Data: An Open-Source Benchmark Dataset for Prostate Cancer Clinical Decision Support using Synthetic Correlated Diffusion Imaging Data

CastDet: Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Video Face Re-Aging: Toward Temporally Consistent Face Re-Aging

Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model

Generating Realistic Counterfactuals for Retinal Fundus and OCT Images using Diffusion Models

Semantic-Preserved Point-based Human Avatar

CurriculumLoc: Enhancing Cross-Domain Geolocalization through Multi-Stage Refinement

Deep Equilibrium Diffusion Restoration with Parallel Sampling

Predicting urban tree cover from incomplete point labels and limited background information

FreeKD: Knowledge Distillation via Semantic Frequency Prompt

AKConv: Convolutional Kernel with Arbitrary Sampled Shapes and Arbitrary Number of Parameters

SeaDSC: A video-based unsupervised method for dynamic scene change detection in unmanned surface vehicles

A 3D Multi-Style Cross-Modality Segmentation Framework for Segmenting Vestibular Schwannoma and Cochlea

CORE-MM: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Does complimentary information from multispectral imaging improve face presentation attack detection?

NePF: Neural Photon Field for Single-Stage Inverse Rendering

Unearthing Common Inconsistency for Generalisable Deepfake Detection

Efficient Model Agnostic Approach for Implicit Neural Representation Based Arbitrary-Scale Image Super-Resolution

Event Camera Data Dense Pre-training

Generalized Category Discovery in Semantic Segmentation

Towards Few-shot Out-of-Distribution Detection

Liver Tumor Prediction with Advanced Attention Mechanisms Integrated into a Depth-Based Variant Search Algorithm

Seeing through the Mask: Multi-task Generative Mask Decoupling Face Recognition

BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning