cs.CV - 2023-11-18

Diverse Shape Completion via Style Modulated Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2311.11184
repo_url: None
paper_authors: Wesley Khademi, Li Fuxin
For: 本文提出了一种新的 Conditional Generative Adversarial Network (CGAN)，用于完成部分观察到的三维对象的形状。* Methods: 该网络使用了风格修饰来实现多种可能的完成，并通过提取完整形状的风格代码来实现更好的完成。它还引入了多尺度多样性罚款和批判器，以避免 conditional mode collapse 并在不需要多个真实完成的前提下训练。* Results: 对多个 sintetic 和实际数据集进行评估，该方法能够有效地尊重部分观察，同时获得更多的多样性在完成中。

Abstract
Shape completion aims to recover the full 3D geometry of an object from a partial observation. This problem is inherently multi-modal since there can be many ways to plausibly complete the missing regions of a shape. Such diversity would be indicative of the underlying uncertainty of the shape and could be preferable for downstream tasks such as planning. In this paper, we propose a novel conditional generative adversarial network that can produce many diverse plausible completions of a partially observed point cloud. To enable our network to produce multiple completions for the same partial input, we introduce stochasticity into our network via style modulation. By extracting style codes from complete shapes during training, and learning a distribution over them, our style codes can explicitly carry shape category information leading to better completions. We further introduce diversity penalties and discriminators at multiple scales to prevent conditional mode collapse and to train without the need for multiple ground truth completions for each partial input. Evaluations across several synthetic and real datasets demonstrate that our method achieves significant improvements in respecting the partial observations while obtaining greater diversity in completions.

摘要
shape completion 目标是从部分观察到的3D形状恢复完整的形状。这是一个多Modal的问题，因为可以有多种可能性地完成部分观察到的区域。这种多样性表示形状的下面 uncertainty 和可能性，这对下游任务 such as 规划来说是有利的。在这篇论文中，我们提出了一种新的 conditional 生成 Adversarial network，可以生成多种可能性的完整的点云。为了使我们的网络可以生成同一个部分输入多个完整的结果，我们在网络中引入了随机性。在训练中，我们从完整的形状中提取了style code，学习了这些代码的分布，这些代码可以显式地携带形状类别信息，从而得到更好的 completions。我们还引入了多个缩放因子和权重来避免 conditional 模式崩溃和不需要多个真实的完整结果来训练。在多个 sintetic 和实际的数据集上进行了评估，我们的方法在尊重部分观察的同时获得了更大的多样性。

Active Prompt Learning in Vision Language Models

paper_url: http://arxiv.org/abs/2311.11178
repo_url: https://github.com/jettbrains/-L-
paper_authors: Jihwan Bang, Sumyeong Ahn, Jae-Gil Lee
for: 本研究旨在适应预训练视觉语言模型（VLM）在活动学习框架下进行适应。
methods: 我们提出了一种新的活动学习框架，称为PCB，以适应预训练VLM。PCB利用VLM的知识提供偏好，以解决标签选择的不均衡问题。
results: 我们在七个实际世界数据集上进行了实验，结果表明PCB比普通的活动学习和随机抽样方法表现更好。

Abstract
Pre-trained Vision Language Models (VLMs) have demonstrated notable progress in various zero-shot tasks, such as classification and retrieval. Despite their performance, because improving performance on new tasks requires task-specific knowledge, their adaptation is essential. While labels are needed for the adaptation, acquiring them is typically expensive. To overcome this challenge, active learning, a method of achieving a high performance by obtaining labels for a small number of samples from experts, has been studied. Active learning primarily focuses on selecting unlabeled samples for labeling and leveraging them to train models. In this study, we pose the question, "how can the pre-trained VLMs be adapted under the active learning framework?" In response to this inquiry, we observe that (1) simply applying a conventional active learning framework to pre-trained VLMs even may degrade performance compared to random selection because of the class imbalance in labeling candidates, and (2) the knowledge of VLMs can provide hints for achieving the balance before labeling. Based on these observations, we devise a novel active learning framework for VLMs, denoted as PCB. To assess the effectiveness of our approach, we conduct experiments on seven different real-world datasets, and the results demonstrate that PCB surpasses conventional active learning and random sampling methods.

摘要

LOSTU: Fast, Scalable, and Uncertainty-Aware Triangulation

paper_url: http://arxiv.org/abs/2311.11171
repo_url: None
paper_authors: Sébastien Henry, John A. Christian
for: 提供一种快速、可扩展、统计优化的三角分解方法（LOSTU），用于结构从运动（SfM）管道中的点云三角分解。
methods: 该方法基于最近的发现，并且不同于传统的$L_2$三角分解方法，可以考虑3D点云不确定性。
results: LOSTU可以减少3D重建错误，并且可以更快速于Levenberg-Marquardt优化方案。

Abstract
Triangulation algorithms often aim to minimize the reprojection ($L_2$) error, but this only provides the maximum likelihood estimate when there are no errors in the camera parameters or camera poses. Although recent advancements have yielded techniques to estimate camera parameters accounting for 3D point uncertainties, most structure from motion (SfM) pipelines still use older triangulation algorithms. This work leverages recent discoveries to provide a fast, scalable, and statistically optimal way to triangulate called LOSTU. Results show that LOSTU consistently produces lower 3D reconstruction errors than conventional $L_2$ triangulation methods -- often allowing LOSTU to successfully triangulate more points. Moreover, in addition to providing a better 3D reconstruction, LOSTU can be substantially faster than Levenberg-Marquardt (or similar) optimization schemes.

摘要
通常，三角化算法的目标是减少($L_2$) reprojection 错误，但这只提供了无摄像头参数或摄像头姿态错误时的最大可能性估计。尽管最近有一些进步，仍有许多结构从运动（SfM）管道使用老的三角化算法。这项工作利用最近的发现，提供一种快速、可扩展、统计优化的三角化方法，称为LOSTU。结果显示，LOSTU 常常生成较低的3D重建错误，并且可以更多的点 successfully triangulated。此外，LOSTU 还可以比 Levenberg-Marquardt（或类似）优化方案更快。

Benchmarking Feature Extractors for Reinforcement Learning-Based Semiconductor Defect Localization

paper_url: http://arxiv.org/abs/2311.11145
repo_url: None
paper_authors: Enrique Dehaerne, Bappaditya Dey, Sandip Halder, Stefan De Gendt
for: 测试和检测半导体板件中的缺陷
methods: 使用深度强化学习（RL）方法进行缺陷定位，并评估不同的特征提取器的效果
results: 评估18个代理人在不同的特征提取器下进行定位缺陷的效果，并讨论RL基本框架在半导体缺陷定位中的优点和缺点。

Abstract
As semiconductor patterning dimensions shrink, more advanced Scanning Electron Microscopy (SEM) image-based defect inspection techniques are needed. Recently, many Machine Learning (ML)-based approaches have been proposed for defect localization and have shown impressive results. These methods often rely on feature extraction from a full SEM image and possibly a number of regions of interest. In this study, we propose a deep Reinforcement Learning (RL)-based approach to defect localization which iteratively extracts features from increasingly smaller regions of the input image. We compare the results of 18 agents trained with different feature extractors. We discuss the advantages and disadvantages of different feature extractors as well as the RL-based framework in general for semiconductor defect localization.

摘要
As semiconductor patterning dimensions shrink, more advanced Scanning Electron Microscopy (SEM) image-based defect inspection techniques are needed. Recently, many Machine Learning (ML)-based approaches have been proposed for defect localization and have shown impressive results. These methods often rely on feature extraction from a full SEM image and possibly a number of regions of interest. In this study, we propose a deep Reinforcement Learning (RL)-based approach to defect localization which iteratively extracts features from increasingly smaller regions of the input image. We compare the results of 18 agents trained with different feature extractors. We discuss the advantages and disadvantages of different feature extractors as well as the RL-based framework in general for semiconductor defect localization.Here's the text in Traditional Chinese:为了应对半导体 Patterning 的缩小，需要更进一步的 Scanning Electron Microscopy (SEM) 图像基于抗错方法。最近，许多 Machine Learning (ML) 基于方法已经被提出来进行抗错定位，并且获得了优异的结果。这些方法通常将特征提取自全SEM图像以及可能的一些区域区域。在这一研究中，我们提出了一个深度强化学习 (RL) 基于的抗错定位方法，这个方法会逐步提取从输入图像中的特征，并且将其与RL网络进行结合。我们将训练 18 个代理人使用不同的特征提取器，并且比较它们的结果。我们会讨论不同的特征提取器优点和缺点，以及RL 基于架构的一般优点和缺点。

Estimating Uncertainty in Landslide Segmentation Models

paper_url: http://arxiv.org/abs/2311.11138
repo_url: None
paper_authors: Savinay Nagendra, Chaopeng Shen, Daniel Kifer
for: 这个论文的目的是为了提供高质量、大规模的滥覆区域风险地区数据集，以便进行防范和减轻滥覆的准备和防控工作。
methods: 这篇论文使用了深度学习模型进行滥覆分割（像素标注），并评估了多种不需要建立新的架构的方法来评估像素级别的不确定性。
results: 实验结果表明，使用测试时数据拟合法（Test-Time Augmentation）方法可以在不同的模型和指标下提供最高质量的不确定性评估结果。

Abstract
Landslides are a recurring, widespread hazard. Preparation and mitigation efforts can be aided by a high-quality, large-scale dataset that covers global at-risk areas. Such a dataset currently does not exist and is impossible to construct manually. Recent automated efforts focus on deep learning models for landslide segmentation (pixel labeling) from satellite imagery. However, it is also important to characterize the uncertainty or confidence levels of such segmentations. Accurate and robust uncertainty estimates can enable low-cost (in terms of manual labor) oversight of auto-generated landslide databases to resolve errors, identify hard negative examples, and increase the size of labeled training data. In this paper, we evaluate several methods for assessing pixel-level uncertainty of the segmentation. Three methods that do not require architectural changes were compared, including Pre-Threshold activations, Monte-Carlo Dropout and Test-Time Augmentation -- a method that measures the robustness of predictions in the face of data augmentation. Experimentally, the quality of the latter method was consistently higher than the others across a variety of models and metrics in our dataset.

摘要
landslide 是一种常 recurs 的、广泛存在的威胁。 prep 和 mitigation efforts 可以通过高质量、大规模的数据集来得到支持。目前没有这样的数据集，并且无法手动构建。 latest 自动化努力是使用深度学习模型进行滥舟分割（像素标注），但也重要是确定这些分割的不确定性或信任水平。准确和可靠的不确定性估计可以减少人工劳动成本，以便对自动生成的滥舟数据库进行低成本监督，解决错误、识别硬例外并增加标注训练数据的大小。在这篇论文中，我们评估了一些方法来评估像素级别的不确定性。我们比较了三种方法，包括 Pre-Threshold 活动、Monte-Carlo Dropout 和 Test-Time Augmentation。实验表明，后一种方法在不同的模型和指标上表现了最高的质量。

Invariant-based Mapping of Space During General Motion of an Observer

paper_url: http://arxiv.org/abs/2311.11130
repo_url: None
paper_authors: Juan D. Yepes, Daniel Raviv
for: 这个论文探讨了基于视觉运动的 invariants，导致一个新的即时领域，在这个领域中，站ARY environment被看作是不变的，即使图像在摄像机运动中不断改变，并且可以探测和避免特定子空间中的障碍物，以及检测运动 objetcs。
methods: 这个论文使用了非线性函数， derivated from measurable optical flow，这些函数与三维几何 invariants相连接。
results: 作者通过实验和实践，证明了这种方法可以在具有相机运动的情况下，保持图像的stationary environment不变，并且可以检测和分类运动 объекcs。

Abstract
This paper explores visual motion-based invariants, resulting in a new instantaneous domain where: a) the stationary environment is perceived as unchanged, even as the 2D images undergo continuous changes due to camera motion, b) obstacles can be detected and potentially avoided in specific subspaces, and c) moving objects can potentially be detected. To achieve this, we make use of nonlinear functions derived from measurable optical flow, which are linked to geometric 3D invariants. We present simulations involving a camera that translates and rotates relative to a 3D object, capturing snapshots of the camera projected images. We show that the object appears unchanged in the new domain over time. We process real data from the KITTI dataset and demonstrate how to segment space to identify free navigational regions and detect obstacles within a predetermined subspace. Additionally, we present preliminary results, based on the KITTI dataset, on the identification and segmentation of moving objects, as well as the visualization of shape constancy. This representation is straightforward, relying on functions for the simple de-rotation of optical flow. This representation only requires a single camera, it is pixel-based, making it suitable for parallel processing, and it eliminates the necessity for 3D reconstruction techniques.

摘要
a) The stationary environment is perceived as unchanged, even as the 2D images undergo continuous changes due to camera motion.b) Obstacles can be detected and potentially avoided in specific subspaces.c) Moving objects can potentially be detected.To achieve this, we use nonlinear functions derived from measurable optical flow, which are linked to geometric 3D invariants. We present simulations involving a camera that translates and rotates relative to a 3D object, capturing snapshots of the camera-projected images. We show that the object appears unchanged in the new domain over time.We process real data from the KITTI dataset and demonstrate how to segment space to identify free navigational regions and detect obstacles within a predetermined subspace. Additionally, we present preliminary results, based on the KITTI dataset, on the identification and segmentation of moving objects, as well as the visualization of shape constancy.This representation is straightforward, relying on functions for the simple de-rotation of optical flow. This representation only requires a single camera, it is pixel-based, making it suitable for parallel processing, and it eliminates the necessity for 3D reconstruction techniques.

SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation

paper_url: http://arxiv.org/abs/2311.11125
repo_url: None
paper_authors: Yamei Chen, Yan Di, Guangyao Zhai, Fabian Manhardt, Chenyangguang Zhang, Ruida Zhang, Federico Tombari, Nassir Navab, Benjamin Busam
for: 这篇研究目的是估计物体的6D姿势和3D大小，特别是面对着巨量内类形态的挑战。
methods: 本研究使用物体特有的几何特征，与DINOv2的semantic数据组合，实现了SE(3)-不变的几何特征抽出，并与DINOv2的特征进行对点对焦，实现了对焦对应的物体表现。
results: 实验结果显示，SecondPose在NOCS-REAL275上比前一代最高12.4%提高，并在更复杂的HouseCat6D上仍然超越其他竞争对手。

Abstract
Category-level object pose estimation, aiming to predict the 6D pose and 3D size of objects from known categories, typically struggles with large intra-class shape variation. Existing works utilizing mean shapes often fall short of capturing this variation. To address this issue, we present SecondPose, a novel approach integrating object-specific geometric features with semantic category priors from DINOv2. Leveraging the advantage of DINOv2 in providing SE(3)-consistent semantic features, we hierarchically extract two types of SE(3)-invariant geometric features to further encapsulate local-to-global object-specific information. These geometric features are then point-aligned with DINOv2 features to establish a consistent object representation under SE(3) transformations, facilitating the mapping from camera space to the pre-defined canonical space, thus further enhancing pose estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose achieves a 12.4% leap forward over the state-of-the-art. Moreover, on a more complex dataset HouseCat6D which provides photometrically challenging objects, SecondPose still surpasses other competitors by a large margin. The code will be released soon.

摘要
Category-level object pose estimation aims to predict the 6D pose and 3D size of objects from known categories, but it often struggles with large intra-class shape variation. Existing methods using mean shapes often fail to capture this variation. To address this issue, we propose SecondPose, a novel approach that integrates object-specific geometric features with semantic category priors from DINOv2. By leveraging the advantages of DINOv2 in providing SE(3)-consistent semantic features, we hierarchically extract two types of SE(3)-invariant geometric features to further encapsulate local-to-global object-specific information. These geometric features are then point-aligned with DINOv2 features to establish a consistent object representation under SE(3) transformations, facilitating the mapping from camera space to the pre-defined canonical space, thus further enhancing pose estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose achieves a 12.4% improvement over the state-of-the-art. Moreover, on the more complex HouseCat6D dataset, which provides photometrically challenging objects, SecondPose still outperforms other competitors by a large margin. The code will be released soon.Here's the translation in Traditional Chinese:Category-level object pose estimation aims to predict the 6D pose and 3D size of objects from known categories, but it often struggles with large intra-class shape variation. Existing methods using mean shapes often fail to capture this variation. To address this issue, we propose SecondPose, a novel approach that integrates object-specific geometric features with semantic category priors from DINOv2. By leveraging the advantages of DINOv2 in providing SE(3)-consistent semantic features, we hierarchically extract two types of SE(3)-invariant geometric features to further encapsulate local-to-global object-specific information. These geometric features are then point-aligned with DINOv2 features to establish a consistent object representation under SE(3) transformations, facilitating the mapping from camera space to the pre-defined canonical space, thus further enhancing pose estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose achieves a 12.4% improvement over the state-of-the-art. Moreover, on the more complex HouseCat6D dataset, which provides photometrically challenging objects, SecondPose still outperforms other competitors by a large margin. The code will be released soon.

ShapeMaker: Self-Supervised Joint Shape Canonicalization, Segmentation, Retrieval and Deformation

paper_url: http://arxiv.org/abs/2311.11106
repo_url: None
paper_authors: Yan Di, Chenyangguang Zhang, Chaowei Wang, Ruida Zhang, Guangyao Zhai, Yanyan Li, Bowen Fu, Xiangyang Ji, Shan Gao
for: 本文提出了一种自助学习框架ShapeMaker，用于同时进行形态均衡化、分割、检索和变换等四个高度相关的过程。
methods: 本文使用了一种独特的自助学习方法，同时具有分割、检索和变换等功能。具体来说，首先从 partially-observed 对象中提取出点级 affine-invariant 特征，然后利用这些特征预测semantically consistent的部分分 segmentation和对应的部分中心。接着，使用了一种轻量级的检索模块，将每个部分的特征集成为其检索 токен，然后与一个预设的数据库中的源形进行比较，以确定最接近的形状。最后，使用了一种基于部分中心的神经网络围栏变换模块，将检索到的形状与输入对象进行精准匹配。
results: 实验表明，ShapeMaker 在 Synthetic 数据集 PartNet、ComplementMe 和实际数据集 Scan2CAD 上表现出色，与竞争者相比，具有显著的优势。

Abstract
In this paper, we present ShapeMaker, a unified self-supervised learning framework for joint shape canonicalization, segmentation, retrieval and deformation. Given a partially-observed object in an arbitrary pose, we first canonicalize the object by extracting point-wise affine-invariant features, disentangling inherent structure of the object with its pose and size. These learned features are then leveraged to predict semantically consistent part segmentation and corresponding part centers. Next, our lightweight retrieval module aggregates the features within each part as its retrieval token and compare all the tokens with source shapes from a pre-established database to identify the most geometrically similar shape. Finally, we deform the retrieved shape in the deformation module to tightly fit the input object by harnessing part center guided neural cage deformation. The key insight of ShapeMaker is the simultaneous training of the four highly-associated processes: canonicalization, segmentation, retrieval, and deformation, leveraging cross-task consistency losses for mutual supervision. Extensive experiments on synthetic datasets PartNet, ComplementMe, and real-world dataset Scan2CAD demonstrate that ShapeMaker surpasses competitors by a large margin. Codes will be released soon.

摘要
在本文中，我们提出了ShapeMaker，一个独立学习框架，用于同时进行形态均衡化、分割、检索和变换。给定一个部分可见的物体，我们首先使用点精度不变的特征提取方法，提取物体的内在结构，并与姿态和大小相关。这些学习的特征然后用于预测相同分割和相应的中心点。接着，我们的轻量级检索模块将每个分割的特征作为检索token进行聚合，并将所有token与源形状库中的形状进行比较，以确定最接近的形状。最后，我们使用中心导航神经网络扭曲模块将检索到的形状与输入物体进行紧密匹配。ShapeMaker的关键思想是同时培养四个高度相关的过程：均衡化、分割、检索和变换，通过交叉任务一致损失来互相超级视图。我们在PartNet、ComplementMe和Scan2CAD等 sintetic数据集上进行了广泛的实验，结果显示ShapeMaker在与竞争对手进行比较时，具有很大的优势。代码即将发布。

On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.11096
repo_url: None
paper_authors: Duy Minh Ho Nguyen, Tan Ngoc Pham, Nghiem Tuong Diep, Nghi Quoc Phan, Quang Pham, Vinh Tong, Binh T. Nguyen, Ngan Hoang Le, Nhat Ho, Pengtao Xie, Daniel Sonntag, Mathias Niepert
for: 本研究旨在探讨基础模型在医学图像分割任务中的鲁棒性，以便更好地适应不同的分布转移。
methods: 我们使用了各种基础模型，包括ViT和DeiT，并对其进行了微调。
results: 我们的实验结果表明，基础模型在不同的频率上具有更高的鲁棒性，并且我们还发展了一种新的 bayesian 不确定性估计方法，可以用于评估模型在不同数据集上的性能。

Abstract
Constructing a robust model that can effectively generalize to test samples under distribution shifts remains a significant challenge in the field of medical imaging. The foundational models for vision and language, pre-trained on extensive sets of natural image and text data, have emerged as a promising approach. It showcases impressive learning abilities across different tasks with the need for only a limited amount of annotated samples. While numerous techniques have focused on developing better fine-tuning strategies to adapt these models for specific domains, we instead examine their robustness to domain shifts in the medical image segmentation task. To this end, we compare the generalization performance to unseen domains of various pre-trained models after being fine-tuned on the same in-distribution dataset and show that foundation-based models enjoy better robustness than other architectures. From here, we further developed a new Bayesian uncertainty estimation for frozen models and used them as an indicator to characterize the model's performance on out-of-distribution (OOD) data, proving particularly beneficial for real-world applications. Our experiments not only reveal the limitations of current indicators like accuracy on the line or agreement on the line commonly used in natural image applications but also emphasize the promise of the introduced Bayesian uncertainty. Specifically, lower uncertainty predictions usually tend to higher out-of-distribution (OOD) performance.

摘要
建立一个坚固的模型，使其能够有效地泛化到测试样本下的分布变化，是领域医学影像处理中的一大挑战。基础模型，在大量自然图像和文本数据上进行预训练，已经出现为一种有前途的方法。它在不同任务上展示出了卓越的学习能力，只需要有限量的标注样本。虽然许多技术专注于开发更好的细化策略，以适应特定领域，但我们则研究基础模型在医学图像分割任务中的Robustness。为此，我们比较了不同预训练模型在未看到的领域中的泛化性能，并发现基础模型在这个方面表现出了更好的Robustness。此外，我们还开发了一种新的 bayesian uncertainty estimation 方法，并用其作为指标，来评估模型在未分布（OOD）数据上的性能。我们的实验结果不仅揭示了现有的指标，如精度在线或者线上协调率在自然图像应用中的局限性，而且还强调了我们引入的 bayesian uncertainty 的承诺。具体来说， Lower uncertainty predictions 通常与更高的OOD性能相关。

LightBTSeg: A lightweight breast tumor segmentation model using ultrasound images via dual-path joint knowledge distillation

paper_url: http://arxiv.org/abs/2311.11086
repo_url: None
paper_authors: Hongjiang Guo, Shengwen Wang, Hao Dang, Kangle Xiao, Yaru Yang, Wenpei Liu, Tongtong Liu, Yiying Wan
for: 这个研究的目的是为了提高乳腺癌检测的精确性，以提高乳腺癌的早期诊断和治疗。
methods: 这个研究使用了一种名为LightBTSeg的双路共同知识传授框架，它利用了双教师模型来表达乳腺癌的细部特征。
results: 实验结果显示，LightBTSeg在乳腺癌检测中表现出色，比其他Counterparts更高精确。

Abstract
The accurate segmentation of breast tumors is an important prerequisite for lesion detection, which has significant clinical value for breast tumor research. The mainstream deep learning-based methods have achieved a breakthrough. However, these high-performance segmentation methods are formidable to implement in clinical scenarios since they always embrace high computation complexity, massive parameters, slow inference speed, and huge memory consumption. To tackle this problem, we propose LightBTSeg, a dual-path joint knowledge distillation framework, for lightweight breast tumor segmentation. Concretely, we design a double-teacher model to represent the fine-grained feature of breast ultrasound according to different semantic feature realignments of benign and malignant breast tumors. Specifically, we leverage the bottleneck architecture to reconstruct the original Attention U-Net. It is regarded as a lightweight student model named Simplified U-Net. Then, the prior knowledge of benign and malignant categories is utilized to design the teacher network combined dual-path joint knowledge distillation, which distills the knowledge from cumbersome benign and malignant teachers to a lightweight student model. Extensive experiments conducted on breast ultrasound images (Dataset BUSI) and Breast Ultrasound Dataset B (Dataset B) datasets demonstrate that LightBTSeg outperforms various counterparts.

摘要
importante para la detección de lesiones en el diagnóstico de tumores mamarios. Los métodos basados en aprendizaje profundo mainstream han logrado un avance significativo. Sin embargo, estos métodos de alta performance son difíciles de implementar en escenarios clínicos debido a su complejidad computacional alta, parámetros masivos, velocidad de inferencia lenta y consumo de memoria grande. Para abordar este problema, propusimos LightBTSeg, un marco de distilación de conocimiento dual-path, para segmentación de tumores mamarios livianos.En detalle, diseñamos un modelo doble-maestro para representar la característica fine-grained de la ecografía de mama según diferentes realineaciones semánticas de tumores benignos y malignos. Utilizamos la arquitectura de bottleneck para reconstruir la red Attention U-Net original. Se considera un modelo estudiantil liviano llamado Simplified U-Net. Luego, utilizamos la información previa de categorías benignas y malignas para diseñar la red maestra combinada con distilación de conocimiento dual-path, que transmite el conocimiento de los maestros pesados benignos y malignos a un modelo estudiantil liviano.Los experimentos extensivos realizados en las imágenes de ecografía de mama (dataset BUSI) y el dataset Breast Ultrasound Dataset B (dataset B) demostraron que LightBTSeg supera a sus pares.

Enhancing Transformer-Based Segmentation for Breast Cancer Diagnosis using Auto-Augmentation and Search Optimisation Techniques

paper_url: http://arxiv.org/abs/2311.11065
repo_url: None
paper_authors: Leon Hamnett, Mary Adewunmi, Modinat Abayomi, Kayode Raheem, Fahad Ahmed
for:This paper aims to improve the accuracy and robustness of breast cancer cell segmentation in histology slides using automated image augmentation selection and search optimization strategies.methods:The proposed methodology combines RandAugment with Tree-based Parzen Estimator to identify optimal values for image augmentations and their associated parameters, leading to enhanced segmentation performance.results:The proposed methodology leads to segmentation models that are more resilient to variations in histology slides while maintaining high levels of segmentation performance, with improved segmentation of the tumour class compared to previous research. The best result after applying the augmentations is a Dice Score of 84.08 and an IoU score of 72.54 when segmenting the tumour class.

Abstract
Breast cancer remains a critical global health challenge, necessitating early and accurate detection for effective treatment. This paper introduces a methodology that combines automated image augmentation selection (RandAugment) with search optimisation strategies (Tree-based Parzen Estimator) to identify optimal values for the number of image augmentations and the magnitude of their associated augmentation parameters, leading to enhanced segmentation performance. We empirically validate our approach on breast cancer histology slides, focusing on the segmentation of cancer cells. A comparative analysis of state-of-the-art transformer-based segmentation models is conducted, including SegFormer, PoolFormer, and MaskFormer models, to establish a comprehensive baseline, before applying the augmentation methodology. Our results show that the proposed methodology leads to segmentation models that are more resilient to variations in histology slides whilst maintaining high levels of segmentation performance, and show improved segmentation of the tumour class when compared to previous research. Our best result after applying the augmentations is a Dice Score of 84.08 and an IoU score of 72.54 when segmenting the tumour class. The primary contribution of this paper is the development of a methodology that enhances segmentation performance while ensuring model robustness to data variances. This has significant implications for medical practitioners, enabling the development of more effective machine learning models for clinical applications to identify breast cancer cells from histology slides. Furthermore, the codebase accompanying this research will be released upon publication. This will facilitate further research and application development based on our methodology, thereby amplifying its impact.

摘要
乳癌仍然是全球健康挑战之一，需要早期精准的检测以实现有效的治疗。本文介绍一种方法，将自动图像增强选择（RandAugment）与搜索优化策略（Tree-based Parzen Estimator）结合，以确定图像增强数量和相关增强参数的优化值，以提高分 segmentation性能。我们对乳癌 histology 胶卷进行了实验，专注于癌细胞分 segmentation。我们进行了现有 transformer 基本模型的比较分析，包括 SegFormer、PoolFormer 和 MaskFormer 模型，以建立全面的基准。我们的结果表明，提posed方法可以提高模型对数据变化的抗性，保持高水平的分 segmentation性能，并在识别癌细胞方面显示了改进的分 segmentation性能。我们的最佳结果是 Dice 分数84.08和 IoU 分数72.54，分 segmentation癌细胞类。本文的主要贡献是开发了一种能够提高分 segmentation性能的同时保持模型对数据变化的抗性的方法，这有着重要的医疗应用。此外，本文的代码库将在出版时发布，以便进一步的研究和应用开发，从而增强其影响。

HIDRO-VQA: High Dynamic Range Oracle for Video Quality Assessment

paper_url: http://arxiv.org/abs/2311.11059
repo_url: None
paper_authors: Shreshth Saini, Avinab Saha, Alan C. Bovik
for: 这篇论文旨在提供高动态范围视频质量评估（VQA）模型，用于提供高精度的HDR视频质量评估。
methods: 该模型使用自动生成的相似性推训策略，将SDR视频中的质量感知特征转移到HDR视频中，无需标注。
results: 研究发现，通过自动生成的相似性推训策略，可以将SDR视频中的质量感知特征转移到HDR视频中，并在LIVE-HDR VQA数据库上达到了状态之最好的性能。

Abstract
We introduce HIDRO-VQA, a no-reference (NR) video quality assessment model designed to provide precise quality evaluations of High Dynamic Range (HDR) videos. HDR videos exhibit a broader spectrum of luminance, detail, and color than Standard Dynamic Range (SDR) videos. As HDR content becomes increasingly popular, there is a growing demand for video quality assessment (VQA) algorithms that effectively address distortions unique to HDR content. To address this challenge, we propose a self-supervised contrastive fine-tuning approach to transfer quality-aware features from the SDR to the HDR domain, utilizing unlabeled HDR videos. Our findings demonstrate that self-supervised pre-trained neural networks on SDR content can be further fine-tuned in a self-supervised setting using limited unlabeled HDR videos to achieve state-of-the-art performance on the only publicly available VQA database for HDR content, the LIVE-HDR VQA database. Moreover, our algorithm can be extended to the Full Reference VQA setting, also achieving state-of-the-art performance. Our code is available publicly at https://github.com/avinabsaha/HIDRO-VQA.

摘要
我们介绍HIDRO-VQA，一个不受参考（NR）影像质量评估模型，旨在为高动态范围（HDR）影像提供精确的质量评估。HDR影像比标准动态范围（SDR）影像更具宽频谱、细节和颜色，随着HDR内容的普及，需要一种能有效地处理HDR内容的质量评估算法。为解决这个挑战，我们提议一种基于自我超级vised contrastive fine-tuning的方法，将SDR频谱中的质量感知特征转移到HDR频谱中，使用有限的无标注HDR影像进行自我超级vised fine-tuning。我们的发现表明，可以在自我超级vised Setting中使用SDR内容的自我超级vised预训练网络，通过有限的无标注HDR影像进行自我超级vised fine-tuning，以达到LIVE-HDR VQA数据库中的最新纪录。此外，我们的算法还可以扩展到全参考VQA Setting，也达到了最新纪录。我们的代码可以在https://github.com/avinabsaha/HIDRO-VQA上获得。

Hyperbolic Space with Hierarchical Margin Boosts Fine-Grained Learning from Coarse Labels

paper_url: http://arxiv.org/abs/2311.11019
repo_url: None
paper_authors: Shu-Lin Xu, Yifan Sun, Faen Zhang, Anqi Xu, Xiu-Shen Wei, Yi Yang
for: 这篇论文的目的是提出一种新的方法，用于从粗略标签中学习细化嵌入。
methods: 该方法使用了一种新的嵌入方法，将视觉嵌入映射到一个希пербо利空间中，并在这个空间中应用一种层次cosine margin方式来增强嵌入的推理能力。
results: 经过广泛的实验，该方法在五个 benchmark 数据集上达到了最佳效果，超过了竞争方法的表现。

Abstract
Learning fine-grained embeddings from coarse labels is a challenging task due to limited label granularity supervision, i.e., lacking the detailed distinctions required for fine-grained tasks. The task becomes even more demanding when attempting few-shot fine-grained recognition, which holds practical significance in various applications. To address these challenges, we propose a novel method that embeds visual embeddings into a hyperbolic space and enhances their discriminative ability with a hierarchical cosine margins manner. Specifically, the hyperbolic space offers distinct advantages, including the ability to capture hierarchical relationships and increased expressive power, which favors modeling fine-grained objects. Based on the hyperbolic space, we further enforce relatively large/small similarity margins between coarse/fine classes, respectively, yielding the so-called hierarchical cosine margins manner. While enforcing similarity margins in the regular Euclidean space has become popular for deep embedding learning, applying it to the hyperbolic space is non-trivial and validating the benefit for coarse-to-fine generalization is valuable. Extensive experiments conducted on five benchmark datasets showcase the effectiveness of our proposed method, yielding state-of-the-art results surpassing competing methods.

摘要
学习细腻嵌入从粗略标签的挑战 task Due to limited label granularity supervision, i.e., lacking the detailed distinctions required for fine-grained tasks. The task becomes even more demanding when attempting few-shot fine-grained recognition, which holds practical significance in various applications. To address these challenges, we propose a novel method that embeds visual embeddings into a hyperbolic space and enhances their discriminative ability with a hierarchical cosine margins manner. Specifically, the hyperbolic space offers distinct advantages, including the ability to capture hierarchical relationships and increased expressive power, which favors modeling fine-grained objects. Based on the hyperbolic space, we further enforce relatively large/small similarity margins between coarse/fine classes, respectively, yielding the so-called hierarchical cosine margins manner. While enforcing similarity margins in the regular Euclidean space has become popular for deep embedding learning, applying it to the hyperbolic space is non-trivial and validating the benefit for coarse-to-fine generalization is valuable. Extensive experiments conducted on five benchmark datasets showcase the effectiveness of our proposed method, yielding state-of-the-art results surpassing competing methods.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Improving Adversarial Transferability by Stable Diffusion

paper_url: http://arxiv.org/abs/2311.11017
repo_url: None
paper_authors: Jiayang Liu, Siyu Zhu, Siyuan Liang, Jie Zhang, Han Fang, Weiming Zhang, Ee-Chien Chang
For: The paper is written to explore the potential of leveraging data generated by Stable Diffusion to boost adversarial transferability in the black-box scenario.* Methods: The paper introduces a novel attack method called Stable Diffusion Attack Method (SDAM), which incorporates samples generated by Stable Diffusion to augment input images. Additionally, the paper proposes a fast variant of SDAM to reduce computational overhead while preserving high adversarial transferability.* Results: The paper demonstrates that the proposed method outperforms state-of-the-art baselines by a substantial margin. The approach is also compatible with existing transfer-based attacks to further enhance adversarial transferability.

Abstract
Deep neural networks (DNNs) are susceptible to adversarial examples, which introduce imperceptible perturbations to benign samples, deceiving DNN predictions. While some attack methods excel in the white-box setting, they often struggle in the black-box scenario, particularly against models fortified with defense mechanisms. Various techniques have emerged to enhance the transferability of adversarial attacks for the black-box scenario. Among these, input transformation-based attacks have demonstrated their effectiveness. In this paper, we explore the potential of leveraging data generated by Stable Diffusion to boost adversarial transferability. This approach draws inspiration from recent research that harnessed synthetic data generated by Stable Diffusion to enhance model generalization. In particular, previous work has highlighted the correlation between the presence of both real and synthetic data and improved model generalization. Building upon this insight, we introduce a novel attack method called Stable Diffusion Attack Method (SDAM), which incorporates samples generated by Stable Diffusion to augment input images. Furthermore, we propose a fast variant of SDAM to reduce computational overhead while preserving high adversarial transferability. Our extensive experimental results demonstrate that our method outperforms state-of-the-art baselines by a substantial margin. Moreover, our approach is compatible with existing transfer-based attacks to further enhance adversarial transferability.

摘要
Here's the translation in Simplified Chinese:深度神经网络 (DNN) 容易受到攻击，这些攻击引入不可见的杂音，误导 DNN 预测。虽然一些攻击方法在白盒设置下表现出色，但在黑盒设置下，它们经常遇到困难，特别是面临了防御机制。不同的技术已经出现，以增强黑盒攻击的传输性。 Among them, input transformation-based attacks have shown their effectiveness. 在这篇文章中，我们探索了使用 Stable Diffusion 生成的数据来提高攻击的传输性。这种方法启发于最近的研究，通过 Stable Diffusion 生成的数据来提高模型通用性。以前的研究已经表明，当存在真实数据和生成数据时，模型的通用性会得到改善。基于这一点，我们提出了一种新的攻击方法，即 Stable Diffusion Attack Method (SDAM)，它利用 Stable Diffusion 生成的样本来补充输入图像。此外，我们还提出了一种快速的 SDAM variant，以降低计算开销而保持高的攻击传输性。我们的实验结果表明，我们的方法在比较之下大幅超越了状态当前的基准值。此外，我们的方法与现有的传输基于攻击相容，可以进一步提高攻击的传输性。

Implicit Event-RGBD Neural SLAM

paper_url: http://arxiv.org/abs/2311.11013
repo_url: None
paper_authors: Delin Qu, Chi Yan, Dong Wang, Jie Yin, Dan Xu, Bin Zhao, Xuelong Li
for: 这种paper的目的是提出一种基于事件RGBD的启发式SLAM框架，以解决非理想enario中的问题，如运动模糊和灯光变化，从而提高tracking和mapping的精度和稳定性。
methods: 这种方法使用了可微分CRF渲染技术，通过共享辐射场来生成独特的RGB和事件摄像头数据，并通过学习一个统一的启发式表示来优化 captured事件和RGBD监视。此外，基于事件的时间差性，我们提出了一种时间聚合优化策略，使用事件的连续差异约束来提高跟踪准确性和稳定性。
results: 我们在6个场景、17个序列的实验中，证明了我们的方法可以在不同的挑战环境中高效地处理运动模糊和灯光变化，并且与现有最佳方法进行比较，在跟踪ATE和mappingACC中具有更高的精度和稳定性。

Abstract
Implicit neural SLAM has achieved remarkable progress recently. Nevertheless, existing methods face significant challenges in non-ideal scenarios, such as motion blur or lighting variation, which often leads to issues like convergence failures, localization drifts, and distorted mapping. To address these challenges, we propose $\textbf{EN-SLAM}$, the first event-RGBD implicit neural SLAM framework, which effectively leverages the high rate and high dynamic range advantages of event data for tracking and mapping. Specifically, EN-SLAM proposes a differentiable CRF (Camera Response Function) rendering technique to generate distinct RGB and event camera data via a shared radiance field, which is optimized by learning a unified implicit representation with the captured event and RGBD supervision. Moreover, based on the temporal difference property of events, we propose a temporal aggregating optimization strategy for the event joint tracking and global bundle adjustment, capitalizing on the consecutive difference constraints of events, significantly enhancing tracking accuracy and robustness. Finally, we construct the simulated dataset $\textbf{DEV-Indoors}$ and real captured dataset $\textbf{DEV-Reals}$ containing 6 scenes, 17 sequences with practical motion blur and lighting changes for evaluations. Experimental results show that our method outperforms the SOTA methods in both tracking ATE and mapping ACC with a real-time $17$ FPS in various challenging environments. The code and dataset will be released upon the paper publication.

摘要
Recently, implicit neural SLAM has made significant progress. However, existing methods still face challenges in non-ideal scenarios, such as motion blur or lighting variation, which often leads to issues like convergence failures, localization drifts, and distorted mapping. To address these challenges, we propose $\textbf{EN-SLAM}$, the first event-RGBD implicit neural SLAM framework, which effectively leverages the high rate and high dynamic range advantages of event data for tracking and mapping. Specifically, EN-SLAM proposes a differentiable CRF (Camera Response Function) rendering technique to generate distinct RGB and event camera data via a shared radiance field, which is optimized by learning a unified implicit representation with the captured event and RGBD supervision. Moreover, based on the temporal difference property of events, we propose a temporal aggregating optimization strategy for the event joint tracking and global bundle adjustment, capitalizing on the consecutive difference constraints of events, significantly enhancing tracking accuracy and robustness. Finally, we construct the simulated dataset $\textbf{DEV-Indoors}$ and real captured dataset $\textbf{DEV-Reals}$ containing 6 scenes, 17 sequences with practical motion blur and lighting changes for evaluations. Experimental results show that our method outperforms the SOTA methods in both tracking ATE and mapping ACC with a real-time $17$ FPS in various challenging environments. The code and dataset will be released upon the paper publication.Here's the translation in Traditional Chinese:过去的几年，隐式神经 SLAM 已经取得了非常的进步。然而，现有的方法在非理想的enario中仍然面临着问题，例如运动模糊或照明变化，这经常导致整合失败、位置漂移和投影变形的问题。为了解决这些问题，我们提出了 $\textbf{EN-SLAM}$，第一个事件-RGBD 隐式神经 SLAM 框架。EN-SLAM 使用了可微的 CRF (Camera Response Function) 渲染技术生成了不同的 RGB 和事件摄像机数据，并且透过学习一个统一的隐式表现来对于捕捉的事件和 RGBD 进行超参。此外，基于事件的时间差异性，我们提出了一个时间聚合优化策略，具体来说是在事件统一追踪和全局统一调整中，运用了 consecutive difference 的条件来增强追踪精度和Robustness。最后，我们建立了 $\textbf{DEV-Indoors}$ 和 $\textbf{DEV-Reals}$ 两个实验 dataset，包括 6 个scene，17 个序列，实际上具有了实验模糊和照明变化。实验结果显示，我们的方法在追踪 ATE 和投影 ACC 方面具有了 SOTA 的表现，并且在多种挑战性环境中实现了实时 $17$ FPS。代码和dataset 将在论文发表时释出。

Learning Scene Context Without Images

paper_url: http://arxiv.org/abs/2311.10998
repo_url: None
paper_authors: Amirreza Rouhi, David Han
for: 教学机器人场景知识，以便它们更有效地与环境交互，预测或预测不可见在视觉场景中的对象。
methods: 提出了一种基于 transformer 的新方法 $LMOD$（标签基于缺失对象检测），通过注意机制教会机器人场景知识。该方法不需要实际图像，只需要图像集标签。
results: 研究表明，通过基于标签学习场景关系，可以通过自注意机制学习场景知识，并且该知识可以提高其他视觉基于对象检测算法的性能。

Abstract
Teaching machines of scene contextual knowledge would enable them to interact more effectively with the environment and to anticipate or predict objects that may not be immediately apparent in their perceptual field. In this paper, we introduce a novel transformer-based approach called $LMOD$ ( Label-based Missing Object Detection) to teach scene contextual knowledge to machines using an attention mechanism. A distinctive aspect of the proposed approach is its reliance solely on labels from image datasets to teach scene context, entirely eliminating the need for the actual image itself. We show how scene-wide relationships among different objects can be learned using a self-attention mechanism. We further show that the contextual knowledge gained from label based learning can enhance performance of other visual based object detection algorithm.

摘要
教机器人场景知识会使其更有效地与环境交互，预测或预测未在视觉范围内出现的对象。在这篇论文中，我们介绍了一种新的变换器基本方法，称为$LMOD$（标签基本缺失检测），用于教机器人场景知识。这种方法异常之处在于它完全不需要图像本身，只需要图像的标签。我们示示了如何使用自我注意机制来学习场景中对象之间的关系。我们进一步示示了通过标签学习获得的Contextual知识可以提高其他视觉基于对象检测算法的性能。

Towards Robust and Accurate Visual Prompting

paper_url: http://arxiv.org/abs/2311.10992
repo_url: None
paper_authors: Qi Li, Liangzhi Li, Zhouqiang Jiang, Bowen Wang
for: 本文研究了使用Visual Prompting（VP）在视觉任务中，并解释了robust模型下的VP表现是否会受到数据集的影响。
methods: 本文使用了一种新的技术名为Prompt Boundary Loose（PBL），以提高标准精度下的视觉提示表现，而不会失去对抗 robustness。
results: 广泛的实验结果表明，我们的方法可以在不同的数据集上提高标准精度和对抗 robustness。

Abstract
Visual prompting, an efficient method for transfer learning, has shown its potential in vision tasks. However, previous works focus exclusively on VP from standard source models, it is still unknown how it performs under the scenario of a robust source model: Whether a visual prompt derived from a robust model can inherit the robustness while suffering from the generalization performance decline, albeit for a downstream dataset that is different from the source dataset? In this work, we get an affirmative answer of the above question and give an explanation on the visual representation level. Moreover, we introduce a novel technique named Prompt Boundary Loose (PBL) to effectively mitigates the suboptimal results of visual prompt on standard accuracy without losing (or even significantly improving) its adversarial robustness when using a robust model as source model. Extensive experiments across various datasets show that our findings are universal and demonstrate the significant benefits of our proposed method.

摘要
<>Visual 提示，一种高效的转移学习方法，在视觉任务中表现出了潜在的潜力。然而，先前的工作都是基于标准源模型进行了研究，而不是知道robust模型下的Visual 提示是否能够继承鲁棒性，而在不同的下游数据集上受到较大的泛化性下降？在这个研究中，我们得到了上述问题的积极答案，并对Visual 提示的视觉表示进行了解释。此外，我们还提出了一种名为Prompt Boundary Loose（PBL）的新技术，可以有效地 mitigate 视觉提示在标准准确性下的不佳结果，而不失去（或甚至进一步提高）对 robust 模型的鲁棒性。通过对多个数据集进行了广泛的实验，我们的发现表明了universal的特点，并demonstrated Visual 提示的重要性和PBL的有效性。Note: "robust" in Chinese is "鲁棒" (lùbù), and "standard" is "标准" (biāozhāng).

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

paper_url: http://arxiv.org/abs/2311.10988
repo_url: None
paper_authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang, Changwen Chen
for: 提供一种结构化表示方法，用于许多计算机视觉应用程序中。
methods: 基于变换器架构，学习视觉概念对应关系，以扩展已知对象和关系类别。
results: 在视觉 génome 测试benchmark上，提出一种全开 vocabulary SGG方法，实现了不确定对象和关系类别的承认。

Abstract
Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relation-based SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-toend transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework.

摘要

Make Pixels Dance: High-Dynamic Video Generation

paper_url: http://arxiv.org/abs/2311.10982
repo_url: https://github.com/makepixelsdance/makepixelsdance.github.io
paper_authors: Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen Zhang, Hang Li
for: 本研究旨在提高人工智能中的视频生成技术，特别是生成具有复杂场景和细腻动作的动画视频。
methods: 本研究提出了一种基于扩散模型的新方法，称为PixelDance，该方法在视频生成过程中结合文本指令和图像指令。
results: 经验表明，PixelDance在使用公共数据进行训练后，能够生成具有复杂场景和细腻动作的视频，并设置了新的标准 для视频生成。

Abstract
Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

摘要
创造高动态视频，如有很多动作和复杂视觉效果，是人工智能领域的一大挑战。现有的现场最佳实践，主要集中在文本到视频生成，往往会生成视频剪辑件的动作较少，即使保持高精度。我们认为，仅仅根据文本指令是不够和不优化的 для视频生成。在这篇论文中，我们介绍PixelDance，一种基于扩散模型的新方法，将文本指令和图像指令结合使用，用于视频生成。我们的实验结果表明，PixelDance通过使用公共数据进行训练，可以生成视频中的复杂场景和细腻动作，创造出新的标准 для视频生成。

Structure-Aware Sparse-View X-ray 3D Reconstruction

paper_url: http://arxiv.org/abs/2311.10959
repo_url: None
paper_authors: Yuanhao Cai, Jiahao Wang, Alan Yuille, Zongwei Zhou, Angtian Wang
for: 提高 sparse-view X-ray 3D 重建的精度和效率
methods: 使用 Line Segment-based Transformer (Lineformer) 和 Masked Local-Global (MLG) 照样策略
results: 在 X3D dataset 上，SAX-NeRF 比前一代 NeRF-based 方法提高 12.56 和 2.49 dB 在新视图合成和 CT 重建中

Abstract
X-ray, known for its ability to reveal internal structures of objects, is expected to provide richer information for 3D reconstruction than visible light. Yet, existing neural radiance fields (NeRF) algorithms overlook this important nature of X-ray, leading to their limitations in capturing structural contents of imaged objects. In this paper, we propose a framework, Structure-Aware X-ray Neural Radiodensity Fields (SAX-NeRF), for sparse-view X-ray 3D reconstruction. Firstly, we design a Line Segment-based Transformer (Lineformer) as the backbone of SAX-NeRF. Linefomer captures internal structures of objects in 3D space by modeling the dependencies within each line segment of an X-ray. Secondly, we present a Masked Local-Global (MLG) ray sampling strategy to extract contextual and geometric information in 2D projection. Plus, we collect a larger-scale dataset X3D covering wider X-ray applications. Experiments on X3D show that SAX-NeRF surpasses previous NeRF-based methods by 12.56 and 2.49 dB on novel view synthesis and CT reconstruction. Code, models, and data will be released at https://github.com/caiyuanhao1998/SAX-NeRF

摘要
X射线，知名于对物体内部结构的显示，对于3D重建提供更加丰富的信息。然而，现有的神经辐射场（NeRF）算法忽略了这一重要特点，导致它们在捕捉图像对象结构的能力有限。在这篇论文中，我们提出了一个框架，即结构意识X射线神经辐射场（SAX-NeRF），用于稀疏视角X射线3D重建。首先，我们设计了一种基于线段的转换器（Lineformer）作为SAX-NeRF的核心。Lineformer可以在3D空间中捕捉物体的内部结构，并且通过模型每个线段之间的依赖关系来捕捉物体的3D结构。其次，我们提出了一种面积掩码本地全球（MLG）照明策略，以EXTRACTContextual和Geometric信息在2D投影中。此外，我们收集了更加广泛的X3D数据集，覆盖更多的X射线应用场景。实验表明，SAX-NeRF在X3D数据集上超过了之前的NeRF基于方法， Novel View Synthesis和CT重建方面的表现提高12.56和2.49 dB。代码、模型和数据将在https://github.com/caiyuanhao1998/SAX-NeRF上发布。

NAS-ASDet: An Adaptive Design Method for Surface Defect Detection Network using Neural Architecture Search

paper_url: http://arxiv.org/abs/2311.10952
repo_url: None
paper_authors: Zhenrong Wang, Bin Li, Weifeng Li, Shuanlong Niu, Wang Miao, Tongzhi Niu
for: 这个研究旨在找到一个自动生成适合Surface Defect Detection任务的神经网络架构，以提高工业场景中的检测精度和效率。
methods: 本研究使用Neural Architecture Search（NAS）技术，搭配一个适应性搜寻空间，以自动生成适合Surface Defect Detection任务的神经网络架构。搜寻空间包括重复排列的基本新细胞，以及可搜寻的注意操作。进一步地，使用一种进步搜寻策略和深度超级运算，以更快速地和更好地探索搜寻空间。
results: 实验结果显示，提案的方法可以实现高性能和轻量级的检测网络，并且与其他竞争方法相比，包括手动设计和NAS方法，具有更好的性能和较小的网络模型大小。

Abstract
Deep convolutional neural networks (CNNs) have been widely used in surface defect detection. However, no CNN architecture is suitable for all detection tasks and designing effective task-specific requires considerable effort. The neural architecture search (NAS) technology makes it possible to automatically generate adaptive data-driven networks. Here, we propose a new method called NAS-ASDet to adaptively design network for surface defect detection. First, a refined and industry-appropriate search space that can adaptively adjust the feature distribution is designed, which consists of repeatedly stacked basic novel cells with searchable attention operations. Then, a progressive search strategy with a deep supervision mechanism is used to explore the search space faster and better. This method can design high-performance and lightweight defect detection networks with data scarcity in industrial scenarios. The experimental results on four datasets demonstrate that the proposed method achieves superior performance and a relatively lighter model size compared to other competitive methods, including both manual and NAS-based approaches.

摘要
深度卷积神经网络 (CNN) 已广泛应用于表面缺陷检测中。然而，没有任何 CNN 架构适合所有检测任务，设计有效的任务特定网络需要较大的努力。神经网络搜索 (NAS) 技术使得可以自动生成适应数据驱动的网络。我们提出了一种新的方法called NAS-ASDet，用于适应性地设计检测网络。首先，我们设计了一个精细化和适用于工业场景的搜索空间，这个空间由重叠的基本新细胞组成，每个细胞具有搜索注意操作。然后，我们使用一种进步的搜索策略和深度超级视图机制，以更快和更好地探索搜索空间。这种方法可以在工业场景中的数据稀缺情况下设计高性能且轻量级的缺陷检测网络。实验结果表明，我们的方法可以在四个数据集上实现superior的性能，并且与其他竞争方法相比，模型的大小更轻量级。

Single-shot Phase Retrieval from a Fractional Fourier Transform Perspective

paper_url: http://arxiv.org/abs/2311.10950
repo_url: None
paper_authors: Yixiao Yang, Ran Tao, Kaixuan Wei, Jun Shi
for: 复原普通频率图像 (Classical Phase Retrieval)
methods: 融合FrFT测量模型与自愿学习重建方法
results: 实现单一测量复原 (Single-shot Phase Retrieval) 和获得高品质图像 (High-quality Images)

Abstract
The realm of classical phase retrieval concerns itself with the arduous task of recovering a signal from its Fourier magnitude measurements, which are fraught with inherent ambiguities. A single-exposure intensity measurement is commonly deemed insufficient for the reconstruction of the primal signal, given that the absent phase component is imperative for the inverse transformation. In this work, we present a novel single-shot phase retrieval paradigm from a fractional Fourier transform (FrFT) perspective, which involves integrating the FrFT-based physical measurement model within a self-supervised reconstruction scheme. Specifically, the proposed FrFT-based measurement model addresses the aliasing artifacts problem in the numerical calculation of Fresnel diffraction, featuring adaptability to both short-distance and long-distance propagation scenarios. Moreover, the intensity measurement in the FrFT domain proves highly effective in alleviating the ambiguities of phase retrieval and relaxing the previous conditions on oversampled or multiple measurements in the Fourier domain. Furthermore, the proposed self-supervised reconstruction approach harnesses the fast discrete algorithm of FrFT alongside untrained neural network priors, thereby attaining preeminent results. Through numerical simulations, we demonstrate that both amplitude and phase objects can be effectively retrieved from a single-shot intensity measurement using the proposed approach and provide a promising technique for support-free coherent diffraction imaging.

摘要
经典阶段恢复的领域涉及于从傅里尔变换（Fourier Transform）中获取信号，但这在存在内在幂等性的情况下是一项困难的任务。单个曝光量测量通常被视为不够于重建原始信号，因为缺失的相位组件是重建过程中的关键因素。在这种情况下，我们提出了一种基于分解傅里尔变换（FrFT）的新的单极恢复方法。我们在这种方法中将FrFT基于物理测量模型集成到了一种自适应恢复方案中。具体来说，我们的FrFT基于测量模型解决了数值计算幂等噪声的问题，并且可以适应短距离和长距离传播enario。此外，在FrFT域中测量INTENSITY的方法具有减轻恢复ambiguities和放弃先前需要多个测量或扩展的观测的优点。此外，我们的自适应恢复方法利用FrFT快速简洁算法和未训练神经网络约束，实现了突出的结果。通过数值实验，我们示示了单极测量INTENSITY可以有效地从单个曝光量测量中获取恢复信号，并提供了一种支持自由幂 diffraction imaging 的有力的技术。

Jenga Stacking Based on 6D Pose Estimation for Architectural Form Finding Process

paper_url: http://arxiv.org/abs/2311.10918
repo_url: None
paper_authors: Zixun Huang
for: 这篇论文主要是为了探讨当前最新的6D pose estimation方法的现状，以及在不同的建筑设计场景下应用pose estimation方法的选择。
methods: 本论文通过对最新的Gen6d研究进行评估，对当前开放集成方法进行质量评估，包括应用级别、预测速度、遮挡性、准确率、环境干扰等方面的评估。
results: 本论文通过对6D pose estimation方法的综合评估，发现在应用级别和预测速度方面有所改进空间，对于遮挡性和环境干扰的防止仍然存在一定的挑战。同时，通过与建筑风景环境评估结合，提出了一种可质量的建筑设计方法。

Abstract
This paper includes a review of current state of the art 6d pose estimation methods, as well as a discussion of which pose estimation method should be used in two types of architectural design scenarios. Taking the latest pose estimation research Gen6d as an example, we make a qualitative assessment of the current openset methods in terms of application level, prediction speed, resistance to occlusion, accuracy, resistance to environmental interference, etc. In addition, we try to combine 6D pose estimation and building wind environment assessment to create tangible architectural design approach, we discuss the limitations of the method and point out the direction in which 6d pose estimation is eager to progress in this scenario.

摘要
这篇论文包括当前最佳6D姿态估计方法的回顾，以及在两类建筑设计场景中应用哪种姿态估计方法的讨论。以latest pose estimation research Gen6d为例，我们对当前开放集成方法进行质量评估，包括应用水平、预测速度、遮挡耐受度、准确率、环境干扰耐受度等方面。此外，我们尝试将6D姿态估计与建筑风险环境评估结合，创造可触媒建筑设计方法，并讨论这种方法的局限性和进一步发展方向。