2023-12-01

cs.CV

cs.CV - 2023-12-01

Consistent Mesh Diffusion

paper_url: http://arxiv.org/abs/2312.00971
repo_url: None
paper_authors: Julian Knodt, Xifeng Gao
for: 生成3D模型上的文本图像
methods: 使用单个深度到图像协同扩散网络，将多个2D图像的扩散路径统一并升级到3D，使得在3D表面上渲染时得到一个一致的文本图像
results: 在30个模型上花费约5分钟，并使用CLIP-score和Frechet Inception Distance进行评估，显示我们的方法在前一代方法的基础上做出了改进。

Abstract
Given a 3D mesh with a UV parameterization, we introduce a novel approach to generating textures from text prompts. While prior work uses optimization from Text-to-Image Diffusion models to generate textures and geometry, this is slow and requires significant compute resources. Alternatively, there are projection based approaches that use the same Text-to-Image models that paint images onto a mesh, but lack consistency at different viewing angles, we propose a method that uses a single Depth-to-Image diffusion network, and generates a single consistent texture when rendered on the 3D surface by first unifying multiple 2D image's diffusion paths, and hoisting that to 3D with MultiDiffusion~\cite{multidiffusion}. We demonstrate our approach on a dataset containing 30 meshes, taking approximately 5 minutes per mesh. To evaluate the quality of our approach, we use CLIP-score~\cite{clipscore} and Frechet Inception Distance (FID)~\cite{frechet} to evaluate the quality of the rendering, and show our improvement over prior work.

摘要
Existing projection-based approaches use the same Text-to-Image models to paint images onto a mesh, but lack consistency at different viewing angles. In contrast, our method unifies multiple 2D image diffusion paths and hoists them to 3D with MultiDiffusion, resulting in high-quality textures that are consistent across different viewing angles.We demonstrate our approach on a dataset containing 30 meshes, with each mesh taking approximately 5 minutes to render. To evaluate the quality of our approach, we use CLIP-score and Frechet Inception Distance (FID) to assess the rendering quality and show significant improvement over prior work.

Improve Supervised Representation Learning with Masked Image Modeling

paper_url: http://arxiv.org/abs/2312.00950
repo_url: None
paper_authors: Kaifeng Chen, Daniel Salz, Huiwen Chang, Kihyuk Sohn, Dilip Krishnan, Mojtaba Seyedhosseini
for: 提高计算机视觉领域中表示学习的质量
methods: 采用掩码图像模型（MIM）自我监督学习方法，将其与已有的指导学习方法结合使用
results: 对下游任务 such as 分类、图像检索和semantic segmentation进行改进，并在ImageNet-1k和K-Nearest-Neighbor图像检索评价中达到了高效率I hope this helps! Let me know if you have any other questions.

Abstract
Training visual embeddings with labeled data supervision has been the de facto setup for representation learning in computer vision. Inspired by recent success of adopting masked image modeling (MIM) in self-supervised representation learning, we propose a simple yet effective setup that can easily integrate MIM into existing supervised training paradigms. In our design, in addition to the original classification task applied to a vision transformer image encoder, we add a shallow transformer-based decoder on top of the encoder and introduce an MIM task which tries to reconstruct image tokens based on masked image inputs. We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations for downstream tasks such as classification, image retrieval, and semantic segmentation. We conduct a comprehensive study and evaluation of our setup on public benchmarks. On ImageNet-1k, our ViT-B/14 model achieves 81.72% validation accuracy, 2.01% higher than the baseline model. On K-Nearest-Neighbor image retrieval evaluation with ImageNet-1k, the same model outperforms the baseline by 1.32%. We also show that this setup can be easily scaled to larger models and datasets. Code and checkpoints will be released.

摘要
具有预测批处的视觉嵌入训练已经成为计算机视觉中的标准设置。我们受到最近采用Masked Image Modeling（MIM）自我supervised representation learning的成功而 inspirited，我们提议一个简单 yet effective的设置，可以轻松地将MIMintegrated into existing supervised training paradigms。在我们的设计中，除了原始的分类任务，我们还添加了一个短深度变换器基于decoder到vision transformer图像编码器之上，并引入一个MIM任务，尝试根据遮盖图像输入重建图像 tokens。我们表明，只需要 minimal change in architecture 和无额外处理 during inference，这种设置可以提高下游任务（如分类、图像检索、semantic segmentation）中学习的表示质量。我们在公共benchmark上进行了全面的研究和评估。在ImageNet-1k上，我们的ViT-B/14模型达到了81.72%的验证精度，比基eline模型高2.01%。在ImageNet-1k上进行的K-Nearest-Neighbor图像检索评估中，同样的模型也高于基eline by 1.32%。我们还表明这种设置可以轻松扩展到更大的模型和数据集。我们将代码和Checkpoint发布。

Object 6D pose estimation meets zero-shot learning

paper_url: http://arxiv.org/abs/2312.00947
repo_url: None
paper_authors: Andrea Caraffa, Davide Boscaini, Amir Hamza, Fabio Poiesi
for: 提高零shot对象6D姿态估计的精度
methods: combines geometric descriptors learned from point cloud data with visual features learned from large-scale web images to produce distinctive 3D point-level descriptors
results: outperforms all state-of-the-art zero-shot object 6D pose estimation approaches and ranks first in the BOP Benchmark under the category Task 4: 6D localization of unseen objects

Abstract
Object 6D pose estimation methods can achieve high accuracy when trained and tested on the same objects. However, estimating the pose of objects that are absent at training time is still a challenge. In this work, we advance the state-of-the-art in zero-shot object 6D pose estimation by proposing the first method that fuses the contribution of pre-trained geometric and vision foundation models. Unlike state-of-the-art approaches that train their pipeline on data specifically crafted for the 6D pose estimation task, our method does not require task-specific finetuning. Instead, our method, which we name PoMZ, combines geometric descriptors learned from point cloud data with visual features learned from large-scale web images to produce distinctive 3D point-level descriptors. By applying an off-the-shelf registration algorithm, like RANSAC, PoMZ outperforms all state-of-the-art zero-shot object 6D pose estimation approaches. We extensively evaluate PoMZ across the seven core datasets of the BOP Benchmark, encompassing over a hundred objects and 20 thousand images captured in diverse scenarios. PoMZ ranks first in the BOP Benchmark under the category Task 4: 6D localization of unseen objects. We will release the source code publicly.

摘要
Object 6D pose estimation methods can achieve high accuracy when trained and tested on the same objects. However, estimating the pose of objects that are absent at training time is still a challenge. In this work, we advance the state-of-the-art in zero-shot object 6D pose estimation by proposing the first method that fuses the contribution of pre-trained geometric and vision foundation models. Unlike state-of-the-art approaches that train their pipeline on data specifically crafted for the 6D pose estimation task, our method does not require task-specific finetuning. Instead, our method, which we name PoMZ, combines geometric descriptors learned from point cloud data with visual features learned from large-scale web images to produce distinctive 3D point-level descriptors. By applying an off-the-shelf registration algorithm, like RANSAC, PoMZ outperforms all state-of-the-art zero-shot object 6D pose estimation approaches. We extensively evaluate PoMZ across the seven core datasets of the BOP Benchmark, encompassing over a hundred objects and 20 thousand images captured in diverse scenarios. PoMZ ranks first in the BOP Benchmark under the category Task 4: 6D localization of unseen objects. We will release the source code publicly.Here's the translation in Traditional Chinese: Object 6D pose estimation methods can achieve high accuracy when trained and tested on the same objects. However, estimating the pose of objects that are absent at training time is still a challenge. In this work, we advance the state-of-the-art in zero-shot object 6D pose estimation by proposing the first method that fuses the contribution of pre-trained geometric and vision foundation models. Unlike state-of-the-art approaches that train their pipeline on data specifically crafted for the 6D pose estimation task, our method does not require task-specific finetuning. Instead, our method, which we name PoMZ, combines geometric descriptors learned from point cloud data with visual features learned from large-scale web images to produce distinctive 3D point-level descriptors. By applying an off-the-shelf registration algorithm, like RANSAC, PoMZ outperforms all state-of-the-art zero-shot object 6D pose estimation approaches. We extensively evaluate PoMZ across the seven core datasets of the BOP Benchmark, encompassing over a hundred objects and 20 thousand images captured in diverse scenarios. PoMZ ranks first in the BOP Benchmark under the category Task 4: 6D localization of unseen objects. We will release the source code publicly.

Enhancing Diffusion Models with 3D Perspective Geometry Constraints

paper_url: http://arxiv.org/abs/2312.00944
repo_url: None
paper_authors: Rishi Upadhyay, Howard Zhang, Yunhao Ba, Ethan Yang, Blake Gella, Sicheng Jiang, Alex Wong, Achuta Kadambi
for: 提高图像合成方法中的对称矩阵精度
methods: 引入几何约束来让生成模型在训练过程中强制实现对称矩阵精度
results: 训练使用几何约束的生成模型可以生成更加真实和有趣的图像，并且可以提高下游模型在生成图像上的性能。

Abstract
While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective. We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy. We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images. Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer.

摘要
“ perspective 是艺术中已经非常 изу究的话题，但在图像中它通常被忽略。然而，对于最近的高质量图像生成方法，如潜在扩散模型， perspective 精度不是明确的要求。这些生成的图像难以遵循线性的 perspective 原理。我们引入了一种新的几何约束在生成模型的训练过程中，以确保 perspective 精度。我们表明，通过这种约束训练的模型输出的图像更加真实，并且提高了基于生成图像的下游模型的性能。人类评测表明，使用我们的约束训练的潜在扩散模型输出的图像比 Stable Diffusion V2 模型的图像 prefer 70% 的时间。SOTA 单目几何深度估计模型，如 DPT 和 PixelFormer，在我们的图像上进行了精度调整，在 KITTI 测试集上对于零拟采样转移而达到了最高的 RMSE 下降7.03% 和 SqRel 下降19.3%。”

Zero-Shot Video Question Answering with Procedural Programs

paper_url: http://arxiv.org/abs/2312.00937
repo_url: None
paper_authors: Rohan Choudhury, Koichiro Niinuma, Kris M. Kitani, László A. Jeni
for: answers video questions without requiring pre-trained models or task-specific fine-tuning.
methods: uses a large language model to generate short procedural programs that solve a sequence of visual subtasks, and executes them to obtain the output.
results: achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets.Here’s the full translation in Simplified Chinese:
for: 本研究旨在 answering video questions without requiring pre-trained models or task-specific fine-tuning.
methods: 使用大型语言模型生成短程序程序，解决视觉子任务序列，然后执行它们以获取输出。
results: 在多种多样的benchmark上达到了状态略的Result，与比较方法相比，提高了最多25%的短、长、开放式和多modal video question-answering数据集。Please note that the translation is done using Google Translate and may not be perfect. The original English text may be more accurate.

Abstract
We propose to answer zero-shot questions about videos by generating short procedural programs that derive a final answer from solving a sequence of visual subtasks. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but videos remain challenging: we provide ProViQ with modules intended for video understanding, allowing it to generalize to a wide variety of videos. This code generation framework additionally enables ProViQ to perform other video tasks in addition to question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023.

摘要
我们提出了一种Answer zero-shot video问题的方法，通过生成短程序程序来解决一系列视觉子任务，从而获得最终答案。我们称之为Procedural Video Querying（ProViQ），它使用大型自然语言模型将输入问题和视觉模块API组合成程序程序，然后执行它们来获取输出。相比之前的相似的进程方法，视频仍然是挑战，我们为ProViQ提供了特有的视频理解模块，以便它能够泛化到各种视频中。此代码生成框架还允许ProViQ执行其他视频任务，如多对象跟踪或基本视频编辑。ProViQ在多样化的 benchmark 上达到了领先的成绩，与比较短、长、开放和多模态视频问题答案 datasets 的改进达到25%。我们的项目页面位于https://rccchoudhury.github.io/proviq2023。

Label Delay in Continual Learning

paper_url: http://arxiv.org/abs/2312.00923
repo_url: None
paper_authors: Botos Csaba, Wenxuan Zhang, Matthias Müller, Ser-Nam Lim, Mohamed Elhoseiny, Philip Torr, Adel Bibi
for: 本研究旨在解决在线 continual learning中的标签延迟问题，即新数据可能因为慢昂贵的标注过程而无法获得标签。
methods: 我们引入了一个新的 continual learning框架，该框架在时间步骤中显示了当前时间步骤 $t$ 中的未标记数据，以及在 $t-d$ 时间步骤中的标签。在每个步骤中，我们的框架使用了一个简单、高效的基线方法，通过 reuse 已有的标签来填充未标记数据中的几何难题。
results: 我们的实验结果显示，增加计算资源 alone 无法解决标签延迟问题。当延迟标签时间很长时，仅仅使用最新的未标记数据和标签来训练模型，其性能会明显下降。然而，我们的方法可以减少标签延迟的影响，并在某些情况下 même surpass the performance of 非延迟的模型。

Abstract
Online continual learning, the process of training models on streaming data, has gained increasing attention in recent years. However, a critical aspect often overlooked is the label delay, where new data may not be labeled due to slow and costly annotation processes. We introduce a new continual learning framework with explicit modeling of the label delay between data and label streams over time steps. In each step, the framework reveals both unlabeled data from the current time step $t$ and labels delayed with $d$ steps, from the time step $t-d$. In our extensive experiments amounting to 1060 GPU days, we show that merely augmenting the computational resources is insufficient to tackle this challenge. Our findings underline a notable performance decline when solely relying on labeled data when the label delay becomes significant. More surprisingly, when using state-of-the-art SSL and TTA techniques to utilize the newer, unlabeled data, they fail to surpass the performance of a na\"ive method that simply trains on the delayed supervised stream. To this end, we introduce a simple, efficient baseline that rehearses from the labeled memory samples that are most similar to the new unlabeled samples. This method bridges the accuracy gap caused by label delay without significantly increasing computational complexity. We show experimentally that our method is the least affected by the label delay factor and in some cases successfully recovers the accuracy of the non-delayed counterpart. We conduct various ablations and sensitivity experiments, demonstrating the effectiveness of our approach.

摘要
在线持续学习，通过流动数据进行模型训练，在最近几年内得到了越来越多的关注。然而，一个重要的问题frequently overlooked是标签延迟，即新数据可能无法得到标签，因为标签过程慢且昂贵。我们介绍了一个新的持续学习框架，其中明确模型在时间步骤中的标签延迟。在每个步骤中，框架会透过新的数据和延迟的标签来提供更多的信息。我们的广泛实验（总计1060具GPU天）表明，增加计算资源 alone 是不能解决这个挑战。我们的发现表明，当 solely 使用标注数据时，当标签延迟变得显著时，性能会下降。这更 surprises 我们，使用当今最佳的 SSL 和 TTA 技术可以利用 newer, unlabeled data，它们无法超越简单地在延迟的超级vised流中进行训练的简单方法的性能。为此，我们引入了一个简单、高效的基线，它在最相似的标签记忆中进行回忆，以bridging 标签延迟所导致的准确性差距。我们的实验表明，我们的方法在标签延迟因子的影响下表现最好，并在一些情况下成功地恢复非延迟版本的准确性。我们进行了多个ablation 和敏感性实验，证明了我们的方法的有效性。

Segment and Caption Anything

paper_url: http://arxiv.org/abs/2312.00869
repo_url: https://github.com/IDEA-Research/Grounded-Segment-Anything
paper_authors: Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu
for: 这个研究旨在快速并可扩展地具备 Segment Anything Model (SAM) 生成区域描述。
methods: 研究者提出了一种将区域特征与语言模型的嵌入空间进行对齐的轻量级查询基于特征混合器，以便在 later caption 生成过程中使用区域特征。
results: 研究者通过对象检测和分割任务的先前训练，使用weak supervision预训练，并进行了广泛的实验 validate each design choice。 results show that the proposed method is superior to existing methods and has the potential to scale up regional captioning data.

Abstract
We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via the following https://xk-huang.github.io/segment-caption-anything/.

摘要
我们提出了一种方法，用于快速地让Segment Anything Model（SAM）具备区域标签生成能力。SAM具有强大的通用化能力，可以 segment anything，而且短语理解能力强。通过引入轻量级的查询基于特征混合器，我们将区域特定的特征与语言模型的嵌入空间对齐。由于计算量少（通常在十亿级别），因此计算成本低，内存使用量低，通信带宽低，训练速度快，可扩展性好。为了解决区域标签数据的缺乏问题，我们提议先在目标检测和分割任务上进行弱监督预训练。我们称这个步骤为弱监督预训练，因为预训练数据只包含类别名称，而不是完整的句子描述。弱监督预训练allow us to利用许多公开可用的目标检测和分割数据集。我们进行了广泛的实验，以证明我们的方法的优越性和验证每一个设计选择。这项工作作为扩展区域标签数据的起点，投射到了有效地增加区域标签数据的方向。项目页面，以及相关的代码，可以通过以下链接访问：https://xk-huang.github.io/segment-caption-anything/.

Dense Optical Tracking: Connecting the Dots

paper_url: http://arxiv.org/abs/2312.00786
repo_url: None
paper_authors: Guillaume Le Moing, Jean Ponce, Cordelia Schmid
for: 该研究旨在提出一种简单、高效的点跟踪方法，能够在视频中跟踪每帧所见点的路径，即使存在遮挡。
methods: 该方法首先从关键区域中提取一小集点跟踪，使用商业化的点跟踪算法进行初始化。然后，给定源和目标帧，计算初步的粗略流场和可见掩码，通过学习Optical Flow估计器来处理遮挡和精度地计算点跟踪。
results: 对比当前Optical Flow技术和”通用”跟踪器Like OmniMotion，该方法显示更高准确。同时，与最佳点跟踪算法Like CoTracker相比，其性能在一个数量级上相当或更高，而且速度至少两个数量级更快。量化和质量性实验表明该方法在真实视频中具有承诺。代码、数据和实验视频可以在项目页面：https://16lemoing.github.io/dot 中找到。

Abstract
Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames, DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation, before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal" trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing.github.io/dot .

摘要
现代方法可以重建场景中任何点的轨迹，即使存在遮挡。然而，这些方法在实践中过于慢，无法在合理的时间内跟踪每帧中所有点。本文介绍了 DOT，一种新的简单有效的方法。它首先在动态区域中提取一小集 tracks 使用可用的点跟踪算法。给定源框和目标框，DOT 然后计算初始估计 dense flow 场和可见掩码通过最近邻域 interpolate，然后使用可学习的 optical flow 估计器来处理遮挡和可以在合理的时间内训练。我们表明 DOT 与当前的 optical flow 技术相比，更准确，超过 OmniMotion 等复杂的 "通用" 跟踪器，并且与 CoTracker 相当或更好，而且速度至少两个数量级快。量化和质量上的实验表明了我们的方法的承诺。代码、数据和实验视频可以在项目网站：https://16lemoing.github.io/dot 上获取。

Sequential Modeling Enables Scalable Learning for Large Vision Models

paper_url: http://arxiv.org/abs/2312.00785
repo_url: https://github.com/ytongbai/LVM
paper_authors: Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros
for: 本研究提出了一种新的顺序模型方法，用于不使用任何语言数据来学习大视野模型（LVM）。
methods: 该方法定义了一种通用的格式，称为“视觉句子”，可以表示Raw图像和视频以及Semantic segmentation和深度重建等标注数据源，无需更多的元知识。
results: 通过训练不同级别的模型架构和数据多样性，我们提供了empirical evidence表明我们的模型可以有效扩展。通过设计适当的视觉提示，可以解决多种视觉任务。

Abstract
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.

摘要
我们提出了一种新的顺序模型方法，可以在没有任何语言数据的情况下培育大视觉模型（LVM）。我们定义了一种通用格式，称为“视觉句子”，可以将raw图像和视频以及Semantic segmentation和深度重建等注解数据源所表示，无需任何元知识以外的信息。当这些多样化的视觉数据（总计420亿个字符）被表示为序列后，模型可以通过下一个字符预测的十字Entropy损失来训练。通过训练不同的模型建筑和数据多样性，我们提供了实验证据，表明我们的模型可以有效扩展。在测试时，可以通过设计合适的视觉提示来解决多种视觉任务。

MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video

paper_url: http://arxiv.org/abs/2312.00778
repo_url: None
paper_authors: Hengyi Wang, Jingwen Wang, Lourdes Agapito
for: 这篇论文旨在实现基于RGB-D视频捕捉的动态场景重建。
methods: 该方法使用神经网络表示法，通过捕捉到的视频数据来学习场景的几何和外观特征。同时，它还使用视角相关的扩散预测器来实现真实的完成。
results: 实验结果表明，该方法可以从单光RGB-D视频中高精度地重建360度场景，并且可以处理各种实际世界场景中的大量未观察区域。

Abstract
Neural rendering has demonstrated remarkable success in dynamic scene reconstruction. Thanks to the expressiveness of neural representations, prior works can accurately capture the motion and achieve high-fidelity reconstruction of the target object. Despite this, real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge, we introduce MorpheuS, a framework for dynamic 360{\deg} surface reconstruction from a casually captured RGB-D video. Our approach models the target scene as a canonical field that encodes its geometry and appearance, in conjunction with a deformation field that warps points from the current frame to the canonical space. We leverage a view-dependent diffusion prior and distill knowledge from it to achieve realistic completion of unobserved regions. Experimental results on various real-world and synthetic datasets show that our method can achieve high-fidelity 360{\deg} surface reconstruction of a deformable object from a monocular RGB-D video.

摘要
neural rendering 已经实现了动态场景重建的很大成功。由于神经表示的表达力强大，先前的工作可以准确地捕捉运动和达到高质量重建目标对象。然而，实际世界视频场景经常包含大量未观察到的区域，神经表示在这些区域中很难实现真实的完成。为解决这个挑战，我们介绍了 MorpheuS，一个基于RGB-D视频捕捉的动态360度表面重建框架。我们的方法将目标场景模型为一个 canoncial 场景，该场景包含场景的几何和外观信息，同时还包含一个扭曲场景，该场景将当前帧中的点扭曲到 canonical 空间中。我们利用视角依赖的扩散先验知识，以实现真实的完成未观察到的区域。实验结果表明，我们的方法可以从单色RGB-D视频中高精度重建一个可变形对象的360度表面。

VideoBooth: Diffusion-based Video Generation with Image Prompts

paper_url: http://arxiv.org/abs/2312.00777
repo_url: None
paper_authors: Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu
for: 根据文章的内容，这篇论文主要是 для了文本生成领域中的视觉生成任务，尤其是使用图像提示来控制视觉内容的生成。
methods: 这篇论文提出了一个简流式框架VideoBooth，包括两个特别的设计：1）将图像提示嵌入为粗细度对应，以及2）在细节层中将多尺度图像提示作为额外的关注对应。
results: 实验结果显示，VideoBooth可以实现高品质的自定义视觉内容生成，并且可以适应各种图像提示。

Abstract
Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.

摘要
“视频生成技术快速发展，但仅使用文本提示不够准确地描绘用户意图中的所需主题形象。在这篇论文中，我们研究了基于图片提示的视频生成任务，提供更精确和直接的内容控制。 Specifically，我们提出了一种往返框架VideoBooth，包括两个专门的设计：1）我们提出将图片提示embedded在干扰的方式下，由图像Encoder提供高级编码，而提供多尺度和细节编码的图像提示编码器。这两个补充编码可以准确捕捉所需的形象。 2）在注意力注射模块的细粒度层次，我们将多尺度图片提示作为额外键和值传递给不同的跨帧注意力层。这些额外信息可以细化首帧的细节，然后通过帧内层次传递来保持时间一致性。我们的实验证明，VideoBooth可以生成自定义高质量视频，主题是通过图片提示指定的用户意图。另外，VideoBooth是一种通用的框架，只需要单独的 feed-forward pass，就能够处理各种图片提示。”

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans

paper_url: http://arxiv.org/abs/2312.00775
repo_url: None
paper_authors: Homanga Bharadhwaj, Abhinav Gupta, Vikash Kumar, Shubham Tulsiani
for: 该 paper 的目的是开发一种可以无需直接尝试和不同的物品进行交互的 robot，通过一个多样化的抓取技巧来实现。
methods: 该 paper 使用了一种因素分解的方法，可以利用大规模的人类视频来学习人类如何完成一个任务（人类计划），然后将这个计划翻译成 robot 的实现。
results: 该 paper 的学习系统可以在零次尝试下完成多种 manipulate 技巧，包括拧取、推动、搅拌等，并且可以在不同的物品和场景下进行扩展。

Abstract
We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. https://homangab.github.io/hopman/

摘要
We have developed a human plan predictor that, given a current scene image and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, allowing the robot to follow human plans for generic manipulation tasks in a zero-shot manner with no deployment-time training.Our system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. The translation module only requires a small amount of in-domain data and can generalize to tasks not seen during training.More information can be found on our website:

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

paper_url: http://arxiv.org/abs/2312.00863
repo_url: https://github.com/yformer/EfficientSAM
paper_authors: Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, Vikas Chandra
For: 这个研究旨在提高Segment Anything Model (SAM)的效率，使其能够应用于更广泛的应用场景。* Methods: 本研究使用了Masked Image Pretraining (SAMI)来学习优化的视觉表现，并使用了SAMI预训的轻量级图像Encoder和Mask Decoder来建立EfficientSAMs模型。* Results: 研究发现，使用SAMI预训法可以与其他几乎相同的快速SAM模型（e.g., ~4 AP on COCO/LVIS）perform favorably，并且在零shot实例分割任务中获得了 significiant gain。

Abstract
Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.

摘要
Segment Anything Model (SAM) 已经成为许多视觉应用中的一个强大工具。一个关键的组件是一个超大的 transformer 模型，通过对广泛高质量的 SA-1B 数据集进行训练而具有卓越的性能，包括零shot 转移和高灵活性。然而，这个巨大的计算成本限制了 SAM 模型的应用范围。为解决这个问题，我们提出了 EfficientSAMs，一种轻量级的 SAM 模型，具有较少的复杂性，但仍能达到可接受的性能水平。我们的想法是利用 masked image pretraining，SAMI，来学习有效的视觉表示学习。然后，我们使用 SAMI 预训练的轻量级图像编码器和mask decoder来构建 EfficientSAMs，并在 SA-1B 上进行训练。我们在多种视觉任务上进行评估，包括图像分类、物体检测、实例分割和semantic object detection，并发现我们提出的预训练方法，SAMI，在其他 masked image pretraining 方法上 consistently outperform。在 zero-shot instance segmentation 任务中，我们的 EfficientSAMs 与 SAMI 预训练的轻量级图像编码器相比，在 COCO/LVIS 上获得了 significiant gain（例如，~4 AP）。

Adversarial Score Distillation: When score distillation meets GAN

paper_url: http://arxiv.org/abs/2312.00739
repo_url: https://github.com/2y7c3/asd
paper_authors: Min Wei, Jingkai Zhou, Junyao Sun, Xuesong Zhang
for: 这 paper 的目的是解释和分析现有的分数散列方法存在的缺陷，以及提出一种新的分数散列方法来解决这些缺陷。
methods: 这 paper 使用 Wasserstein 生成对抗网络 (WGAN) парадиг，并发现现有的分数散列方法 Either 使用 fixes 不优化的探测器或者进行不完整的探测器优化，从而导致缺陷。该 paper 提出了一种新的分数散列方法，即 Adversarial Score Distillation (ASD)，其维护一个可优化的探测器，并通过完整的优化目标来更新探测器。
results: 该 paper 的实验表明，相比现有方法，ASD 在 2D 散列和文本到 3D 任务中表现出色。此外，为了探索 WGAN парадиг的普适性，该 paper 还将 ASD 扩展到图像修饰任务，实现了竞争性的结果。

Abstract
Existing score distillation methods are sensitive to classifier-free guidance (CFG) scale: manifested as over-smoothness or instability at small CFG scales, while over-saturation at large ones. To explain and analyze these issues, we revisit the derivation of Score Distillation Sampling (SDS) and decipher existing score distillation with the Wasserstein Generative Adversarial Network (WGAN) paradigm. With the WGAN paradigm, we find that existing score distillation either employs a fixed sub-optimal discriminator or conducts incomplete discriminator optimization, resulting in the scale-sensitive issue. We propose the Adversarial Score Distillation (ASD), which maintains an optimizable discriminator and updates it using the complete optimization objective. Experiments show that the proposed ASD performs favorably in 2D distillation and text-to-3D tasks against existing methods. Furthermore, to explore the generalization ability of our WGAN paradigm, we extend ASD to the image editing task, which achieves competitive results. The project page and code are at https://github.com/2y7c3/ASD.

摘要
现有的分数蒸馏方法对级别自由指导（CFG）尺度敏感：在小CFG尺度下显示过滤或不稳定，而在大CFG尺度下则过滤。为了解释和分析这些问题，我们返回分数蒸馏抽样（SDS）的 derivation 和 Wasserstein 生成敌方网络（WGAN）思想。通过 WGAN 思想，我们发现现有的分数蒸馏方法都是采用固定不优的探测器或只进行部分探测器优化，导致级别敏感的问题。我们提出了对抗分数蒸馏（ASD）方法，该方法维持可优化的探测器，并通过完整的优化目标来更新它。实验显示，我们提出的 ASD 在 2D 蒸馏和文本到 3D 任务中表现出色，并且我们通过扩展 ASD 到图像修改任务来探索我们的 WGAN 思想的普适性，实现了竞争性的结果。项目页面和代码位于。

Segment Any 3D Gaussians

paper_url: http://arxiv.org/abs/2312.00860
repo_url: https://github.com/Jumpat/SegAnyGAussians
paper_authors: Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian
for: 本研究旨在提出一种可实现高精度、快速的3D交互分割方法，以便在3D场景理解和修改中具有更高的效率和灵活性。
methods: 该方法基于2D分割基础模型，并结合3D Gaussian Splatting（3DGS）技术，通过特制的对比训练将2D分割结果嵌入3D Gaussian点特征中。
results: 评估结果显示，SAGA可以与当前顶峰方法竞争，并且可以实现多级别分割和多种输入模式（包括点、涂抹和2D质量）。此外，SAGA可以在毫秒级时间内完成3D分割，相比之前顶峰的1000倍加速。更多细节可以参考https://jumpat.github.io/SAGA。

Abstract
Interactive 3D segmentation in radiance fields is an appealing task since its importance in 3D scene understanding and manipulation. However, existing methods face challenges in either achieving fine-grained, multi-granularity segmentation or contending with substantial computational overhead, inhibiting real-time interaction. In this paper, we introduce Segment Any 3D GAussians (SAGA), a novel 3D interactive segmentation approach that seamlessly blends a 2D segmentation foundation model with 3D Gaussian Splatting (3DGS), a recent breakthrough of radiance fields. SAGA efficiently embeds multi-granularity 2D segmentation results generated by the segmentation foundation model into 3D Gaussian point features through well-designed contrastive training. Evaluation on existing benchmarks demonstrates that SAGA can achieve competitive performance with state-of-the-art methods. Moreover, SAGA achieves multi-granularity segmentation and accommodates various prompts, including points, scribbles, and 2D masks. Notably, SAGA can finish the 3D segmentation within milliseconds, achieving nearly 1000x acceleration compared to previous SOTA. The project page is at https://jumpat.github.io/SAGA.

摘要
<> translate "Interactive 3D segmentation in radiance fields is an appealing task since its importance in 3D scene understanding and manipulation. However, existing methods face challenges in either achieving fine-grained, multi-granularity segmentation or contending with substantial computational overhead, inhibiting real-time interaction. In this paper, we introduce Segment Any 3D GAussians (SAGA), a novel 3D interactive segmentation approach that seamlessly blends a 2D segmentation foundation model with 3D Gaussian Splatting (3DGS), a recent breakthrough of radiance fields. SAGA efficiently embeds multi-granularity 2D segmentation results generated by the segmentation foundation model into 3D Gaussian point features through well-designed contrastive training. Evaluation on existing benchmarks demonstrates that SAGA can achieve competitive performance with state-of-the-art methods. Moreover, SAGA achieves multi-granularity segmentation and accommodates various prompts, including points, scribbles, and 2D masks. Notably, SAGA can finish the 3D segmentation within milliseconds, achieving nearly 1000x acceleration compared to previous SOTA. The project page is at https://jumpat.github.io/SAGA." into Simplified Chinese.Here's the translation:<>三维场景理解和操作中的交互式三维分割是一项吸引人的任务，因为它们在三维场景中的重要性。然而，现有的方法面临着实现细化多级分割或者承受巨大的计算开销，这会阻碍实时交互。在这篇论文中，我们介绍了 Segment Any 3D GAussians（SAGA），一种新的三维交互分割方法，它将二维分割基础模型与三维 Gaussian Splatting（3DGS）相伴合。SAGA通过设计良好的对比训练，将二维分割结果融合到三维 Gaussian 点特征上。经测试表明，SAGA可以与当前state-of-the-art方法匹敌。此外，SAGA可以实现多级分割，并且可以处理多种提示，包括点、笔画和二维mask。尤其是，SAGA可以在毫秒级别内完成三维分割，与前一代SOTA的nearly 1000倍加速。项目页面在https://jumpat.github.io/SAGA。Note: "SOTA" stands for "state-of-the-art" in English.

PointBeV: A Sparse Approach to BeV Predictions

paper_url: http://arxiv.org/abs/2312.00703
repo_url: https://github.com/valeoai/pointbev
paper_authors: Loick Chambon, Eloi Zablocki, Mickael Chen, Florent Bartoccioni, Patrick Perez, Matthieu Cord
for: 本研究旨在提出一种新的 Bird’s-eye View（BeV）分割模型，以提高在驾驶应用中的感知数据融合和下游任务的性能。
methods: 本模型使用 sparse BeV 细分单元 вместо密集格点，从而提供精细的控制 над内存使用情况，并使得可以使用长时间上下文和适应内存缓存的平台。模型采用了高效的两阶段训练策略，以便在训练时进行精细的定制。在推理时，模型可以根据不同的内存/性能负担进行可变的调整，并且可以适应新的具体应用场景。
results: 模型在 nuScenes 数据集上对于车辆、人员和车道分割Task中取得了状态级Result，并且在静止和时间上下文中都能够表现出优秀的性能，而无需使用密集的扫描信号。

Abstract
Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at https://github.com/valeoai/PointBeV.

摘要
鸟瞰视（BeV）表示法在驾驶应用中成为了共同空间，提供了一个统一的空间 для感知数据融合和多种下游任务。然而，传统的模型使用固定分辨率和范围的网格，这会导致计算效率低下，因为所有细节都占用了相同的资源。为解决这个问题，我们提出了PointBeV，一种新的稀疏BeV分割模型，该模型在稀疏BeV细分单元上进行分割而不是在密集网格上。这种方法可以准确地控制内存使用情况，使得可以使用长时间Context和适应内存受限的平台。PointBeV使用高效的两个过程方法进行训练，可以对特定的区域进行专门的计算。在推理时，它可以根据不同的内存/性能负载进行选择，并可以适应新的特定用 caso。PointBeV在nuScenes数据集上为汽车、人员和车道分割 achieves state-of-the-art results，展示了在静止和时间设置中的superior performance，即使只在稀疏信号上进行训练。我们将在https://github.com/valeoai/PointBeV上发布我们的代码，同时包括两个新的高效模块：稀疏特征抽取，用于从图像中提取到BeV的有效特征，以及Submanifold Attention，它可以快速和有效地进行时间模型化。

GIFT: Generative Interpretable Fine-Tuning Transformers

paper_url: http://arxiv.org/abs/2312.00700
repo_url: https://github.com/savadikarc/gift
paper_authors: Chinmay Savadikar, Xi Song, Tianfu Wu
For: 这个论文旨在提出一种具有 interpretability 的 Parameter-efficient fine-tuning Transformers（GIFT），用于在下游任务上 parameter-efficient 地 fine-tune 预训练的 transformer 模型。* Methods: 该论文使用 deep parameter-residual learning 方法，通过选择 transformer 模型中的最终投影层（linear layer）进行 parameter-efficient fine-tuning，并通过学习 PaCa 来生成 fine-tuning 参数。* Results: 在 VTAB benchmark 和 fine-grained visual classification (FGVC) benchmark 上，提出的 GIFT 方法与先前艺术取得了显著更好的性能。

Abstract
We present GIFT (Generative Interpretable Fine-tuning Transformers) for fine-tuning pretrained (often large) Transformer models at downstream tasks in a parameter-efficient way with built-in interpretability. Our GIFT is a deep parameter-residual learning method, which addresses two problems in fine-tuning a pretrained Transformer model: Where to apply the parameter-efficient fine-tuning (PEFT) to be extremely lightweight yet sufficiently expressive, and How to learn the PEFT to better exploit the knowledge of the pretrained model in a direct way? For the former, we select the final projection (linear) layer in the multi-head self-attention of a Transformer model, and verify its effectiveness. For the latter, in contrast to the prior art that directly introduce new model parameters (often in low-rank approximation form) to be learned in fine-tuning with downstream data, we propose a method for learning to generate the fine-tuning parameters. Our GIFT is a hyper-Transformer which take as input the pretrained parameters of the projection layer to generate its fine-tuning parameters using a proposed Parameter-to-Cluster Attention (PaCa). The PaCa results in a simple clustering-based forward explainer that plays the role of semantic segmentation in testing. In experiments, our proposed GIFT is tested on the VTAB benchmark and the fine-grained visual classification (FGVC) benchmark. It obtains significantly better performance than the prior art. Our code is available at https://github.com/savadikarc/gift

摘要
我们介绍GIFT（生成可解释的细化转换器），用于细化预训练的 transformer 模型在下游任务上进行 parameter-efficient 方式进行 fine-tuning，并具有内置解释性。我们的GIFT 是一种深度参数剩余学习方法，解决了在细化预训练 transformer 模型中的两个问题：在哪里应用 parameter-efficient fine-tuning (PEFT)，以及如何更好地利用预训练模型的知识。为了解决前一个问题，我们选择了 transformer 模型中的最终投影（线性）层，并证明其效果。为了解决后一个问题，我们不同于先前艺术直接在 fine-tuning 阶段引入新的模型参数（常常在低级别减少形式），而是提议一种方法来学习生成 fine-tuning 参数。我们的 GIFT 是一种 hyper-transformer，接受预训练参数的投影层，并使用我们提议的 Parameter-to-Cluster Attention（PaCa）来生成 fine-tuning 参数。 PaCa 的结果是一种简单的归一化前 explainer，它在测试中扮演了 semantic segmentation 的角色。在实验中，我们提posed GIFT 被测试在 VTAB benchmark 和 fine-grained visual classification (FGVC) benchmark 上，并显示它在相比先前艺术上得到了显著更好的性能。我们的代码可以在 https://github.com/savadikarc/gift 上获取。

Rethinking Detection Based Table Structure Recognition for Visually Rich Documents

paper_url: http://arxiv.org/abs/2312.00699
repo_url: None
paper_authors: Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
for: 这种论文的目的是什么？
methods: 这种论文使用了哪些方法？
results: 这种论文得到了什么结果？Here are my answers:
for: 这种论文的目的是提出一种基于检测模型的表格结构识别（TSR）方法，以将不结构化表格图像转换为结构化格式，如 HTML 序列。
methods: 这种论文使用了 detection 模型来检测表格中的元素，如列和行，然后使用规则基于的后处理方法将检测结果转换为 HTML 序列。
results: 这种论文通过对 Cascade R-CNN 模型进行简单修改，实现了状态的最佳性能，提高了基eline Cascade R-CNN 模型的性能 by 19.32%, 11.56% 和 14.77% regarding 表格结构只 TEDS 在 SciTSR、FinTabNet 和 PubTables1M 数据集上。

Abstract
Table Structure Recognition (TSR) aims at transforming unstructured table images into structured formats, such as HTML sequences. One type of popular solution is using detection models to detect components of a table, such as columns and rows, then applying a rule-based post-processing method to convert detection results into HTML sequences. However, existing detection-based studies often have the following limitations. First, these studies usually pay more attention to improving the detection performance, which does not necessarily lead to better performance regarding cell-level metrics, such as TEDS. Second, some solutions over-simplify the problem and can miss some critical information. Lastly, even though some studies defined the problem to detect more components to provide as much information as other types of solutions, these studies ignore the fact this problem definition is a multi-label detection because row, projected row header and column header can share identical bounding boxes. Besides, there is often a performance gap between two-stage and transformer-based detection models regarding the structure-only TEDS, even though they have similar performance regarding the COCO metrics. Therefore, we revisit the limitations of existing detection-based solutions, compare two-stage and transformer-based detection models, and identify the key design aspects for the success of a two-stage detection model for the TSR task, including the multi-class problem definition, the aspect ratio for anchor box generation, and the feature generation of the backbone network. We applied simple methods to improve these aspects of the Cascade R-CNN model, achieved state-of-the-art performance, and improved the baseline Cascade R-CNN model by 19.32%, 11.56% and 14.77% regarding the structure-only TEDS on SciTSR, FinTabNet, and PubTables1M datasets.

摘要
table {width: 100%;border-collapse: collapse;}thead {background-color: #f2f2f2;border: 1px solid #ccc;}th, td {padding: 8px;text-align: left;}th {background-color: #e5e5e5;border: 1px solid #ccc;}td {border: 1px solid #ccc;}.margin-top {margin-top: 10px;}.margin-bottom {margin-bottom: 10px;}.float-left {float: left;}.float-right {float: right;}.clearfix:after {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}.baseline {display: flex;justify-content: center;align-items: center;height: 0;}.baseline img {width: 100%;height: auto;}.improvement {display: flex;justify-content: center;align-items: center;height: 0;}.improvement img {width: 100%;height: auto;}.margin-top-10px {margin-top: 10px;}.margin-bottom-10px {margin-bottom: 10px;}.float-left-10px {float: left;margin-right: 10px;}.float-right-10px {float: right;margin-left: 10px;}.clearfix:before {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}.baseline {display: flex;justify-content: center;align-items: center;height: 0;}.baseline img {width: 100%;height: auto;}.improvement {display: flex;justify-content: center;align-items: center;height: 0;}.improvement img {width: 100%;height: auto;}.margin-top-10px {margin-top: 10px;}.margin-bottom-10px {margin-bottom: 10px;}.float-left-10px {float: left;margin-right: 10px;}.float-right-10px {float: right;margin-left: 10px;}.clearfix:before {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}.baseline {display: flex;justify-content: center;align-items: center;height: 0;}.baseline img {width: 100%;height: auto;}.improvement {display: flex;justify-content: center;align-items: center;height: 0;}.improvement img {width: 100%;height: auto;}.margin-top-10px {margin-top: 10px;}.margin-bottom-10px {margin-bottom: 10px;}.float-left-10px {float: left;margin-right: 10px;}.float-right-10px {float: right;margin-left: 10px;}.clearfix:before {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}.baseline {display: flex;justify-content: center;align-items: center;height: 0;}.baseline img {width: 100%;height: auto;}.improvement {display: flex;justify-content: center;align-items: center;height: 0;}.improvement img {width: 100%;height: auto;}.margin-top-10px {margin-top: 10px;}.margin-bottom-10px {margin-bottom: 10px;}.float-left-10px {float: left;margin-right: 10px;}.float-right-10px {float: right;margin-left: 10px;}.clearfix:before {content: "";display: table;clear: both;}.container {max-width: 1200px;margin: 0 auto;padding: 20px;background-color: #fff;border: 1px solid #ccc;box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);}.aspect-ratio {display: flex;justify-content: center;align-items: center;height: 0;}.aspect-ratio img {width: 100%;height: auto;}.feature-generation {display: flex;justify-content: center;align-items: center;height: 0;}.feature-generation img {width: 100%;height: auto;}.backbone-network {display: flex;justify-content: center;align-items: center;height: 0;}.backbone-network img {width: 100%;height: auto;}.multi-class {display: flex;justify-content: center;align-items: center;height: 0;}.multi-class img {width: 100%;height: auto;}.problem-definition {display: flex;justify-content: center;align-items: center;height: 0;}.problem-definition img {width: 100%;height: auto;}.performance-gap {display: flex;justify-content: center;align-items: center;height: 0;}.performance-gap img {width: 100%;height: auto;}.success-key {display: flex;justify-content: center;align-items: center;height: 0;}.success-key img {width: 100%;height: auto;}

Object Detector Differences when using Synthetic and Real Training Data

paper_url: http://arxiv.org/abs/2312.00694
repo_url: https://github.com/ljungqvistmartin/datasplits
paper_authors: Martin Georg Ljungqvist, Otto Nordander, Markus Skans, Arvid Mildner, Tony Liu, Pierre Nugues
for: 这个论文的目的是研究训练基于 sintetic data 的神经网络如何影响神经网络层次结构，以及它们对检测器的影响。
methods: 作者使用了 YOLOv3 对检测器进行训练，并使用 Centered Kernel Alignment (CKA) 进行相似性分析，以explore 训练 sintetic data 对各层神经网络的影响。
results: 研究发现，训练 sintetic data 对检测器的影响主要在早期层，而在 head 部分的影响较大。此外，冻结或不冻结 backbone 对性能和相似性没有显著影响。

Abstract
To train well-performing generalizing neural networks, sufficiently large and diverse datasets are needed. Collecting data while adhering to privacy legislation becomes increasingly difficult and annotating these large datasets is both a resource-heavy and time-consuming task. An approach to overcome these difficulties is to use synthetic data since it is inherently scalable and can be automatically annotated. However, how training on synthetic data affects the layers of a neural network is still unclear. In this paper, we train the YOLOv3 object detector on real and synthetic images from city environments. We perform a similarity analysis using Centered Kernel Alignment (CKA) to explore the effects of training on synthetic data on a layer-wise basis. The analysis captures the architecture of the detector while showing both different and similar patterns between different models. With this similarity analysis we want to give insights on how training synthetic data affects each layer and to give a better understanding of the inner workings of complex neural networks. The results show that the largest similarity between a detector trained on real data and a detector trained on synthetic data was in the early layers, and the largest difference was in the head part. The results also show that no major difference in performance or similarity could be seen between frozen and unfrozen backbone.

摘要
<>通过使用大量和多样化的数据集来训练高性能的神经网络，可以更好地减少难以遵循隐私法规的数据收集和标注问题。使用 sintética 数据可以解决这些问题，因为它可以自动标注和扩大。然而，训练 sintética 数据对神经网络层次结构的影响仍然不清楚。在这篇论文中，我们训练了 YOLOv3 对象检测器在实际和 sintética 图像中。我们使用 Centered Kernel Alignment（CKA）进行相似性分析，以探索训练 sintética 数据对层次结构的影响。这种分析捕捉了检测器的建构，并显示了不同模型之间的相似和不同的模式。通过这种相似性分析，我们想要提供层次结构如何在不同模型之间差异化的理解，以及训练 sintética 数据对每层的影响。结果显示，训练 sintética 数据和实际数据训练的检测器在早期层之间最大的相似性，而头部部分最大的差异。结果还表明，不冻和冻掉核心的效果无法区分。

VisionaryVR: An Optical Simulation Tool for Evaluating and Optimizing Vision Correction Solutions in Virtual Reality

paper_url: http://arxiv.org/abs/2312.00692
repo_url: None
paper_authors: Benedikt W. Hosp, Martin Dechant, Yannick Sauer, Rajat Agarwala, Siegfried Wahl
for: 这种研究用于开发和评估视科学方法的实验工具，需要一种可靠和高效的 simulate real-world optical methods while providing high experimental control.
methods: 这种研究使用了一种新的虚拟现实（VR） simulate tool，包括一个 эксперимент控制器、一个通用的eye-tracking控制器、一个可 configurable defocus simulator，以及一个可 configurable VR questionnaire loader。
results: 这种VR-based simulation tool可以帮助视科学家增加他们的研究工具，提供一种可靠、真实、快速的研究环境。

Abstract
Developing and evaluating vision science methods require robust and efficient tools for assessing their performance in various real-world scenarios. This study presents a novel virtual reality (VR) simulation tool that simulates real-world optical methods while giving high experimental control to the experiment. The tool incorporates an experiment controller, to smoothly and easily handle multiple conditions, a generic eye-tracking controller, that works with most common VR eye-trackers, a configurable defocus simulator, and a generic VR questionnaire loader to assess participants' behavior in virtual reality. This VR-based simulation tool bridges the gap between theoretical and applied research on new optical methods, corrections, and therapies. It enables vision scientists to increase their research tools with a robust, realistic, and fast research environment.

摘要
开发和评估视科学方法需要强大和高效的工具来评估它们在实际世界场景中的性能。本研究提出了一种新的虚拟现实（VR）模拟工具，可以模拟实际世界的光学方法，同时具有高度的实验控制。工具包括一个实验控制器，可以平滑地处理多种条件，一个通用的眼动跟踪控制器，可以与大多数VR眼动跟踪器兼容，一个可配置的虚拟抖噪模拟器，以及一个可配置的VR问卷加载器，用于评估参与者在虚拟世界中的行为。这个VR基于的模拟工具将理论和应用研究之间的差距 bridged，帮助视科学家增强他们的研究工具，提供了一个强大、现实istic和快速的研究环境。

Open-vocabulary object 6D pose estimation

paper_url: http://arxiv.org/abs/2312.00690
repo_url: None
paper_authors: Jaime Corsetti, Davide Boscaini, Changjae Oh, Andrea Cavallaro, Fabio Poiesi
for: 这个论文是关于开放词汇物体6D姿态估计的新设定，在这个设定下， objet of interest 是通过文本提示来指定的，而不需要训练时使用 CAD 模型或视频序列。
methods: 我们提出了一种新的方法，利用视力语言模型来将文本提示中的对象信息与图像特征进行融合，从而估计对象的6D姿态。我们采用了一种特殊的融合策略，以便将文本提示中的对象级别信息与图像特征融合到一起，从而实现对新概念的泛化。
results: 我们在基于 REAL275 和 Toyota-Light 两个 популяр的数据集上新建了一个 benchmark，并对两种方法进行比较：一个是一个经验性的手工方法，另一个是一个深度学习基础方法。结果表明，我们的方法在不同场景中估计对象的6D姿态方面表现出色，超过了两者。项目页面：https://jcorsetti.github.io/oryon/.

Abstract
We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g. CAD or video sequence) is required at inference, (iii) the object is imaged from two different viewpoints of two different scenes, and (iv) the object was not observed during the training phase. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from two distinct scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 39 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Project page: https://jcorsetti.github.io/oryon/.

摘要
我们介绍一个新的开放词汇物体6D姿掌预测设定，在这个设定中，用文本提示来特定目标物体。与现有方法不同的是，在我们的设定中：（1）目标物体仅通过文本提示来定义，没有CAD或视讯序列的需求；（2）没有需要训练过的物品模型；（3）物品从两个不同的拍摄角度和两个不同的场景中捕捉到图像，并且在训练阶段没有观察到物品。为了在这个设定下运作，我们提出了一个新的方法，利用了视觉语言模型来将目标物体从两个不同的场景中分类出来，并且估算其相对6D姿掌。我们的方法的关键是将文本提示中的物品层级信息与本地图像特征进行精心融合，从而产生一个可以扩展到新概念的特征空间。我们验证了我们的方法在基于REAL275和Toyota-Light两个受欢迎的数据集上，总共有39个物品实例出现在四千几对图像中。结果显示，我们的方法在不同场景中估算物品的相对6D姿掌比旧有的手工方法和现代深度学习基eline更高。详情请参考https://jcorsetti.github.io/oryon/.

Infrared Image Super-Resolution via GAN

paper_url: http://arxiv.org/abs/2312.00689
repo_url: None
paper_authors: Yongsong Huang, Shinichiro Omachi
for: IR image super-resolution
methods: generative models, adversarial training
results: potential areas for further investigation and advancement

Abstract
The ability of generative models to accurately fit data distributions has resulted in their widespread adoption and success in fields such as computer vision and natural language processing. In this chapter, we provide a brief overview of the application of generative models in the domain of infrared (IR) image super-resolution, including a discussion of the various challenges and adversarial training methods employed. We propose potential areas for further investigation and advancement in the application of generative models for IR image super-resolution.

摘要
generative模型在数据分布上的准确适应，在计算机视觉和自然语言处理等领域得到广泛的应用和成功。在这一章中，我们提供了IR图像超解像的领域中generative模型的应用概述，包括不同挑战和对抗训练方法的讨论。我们还提出了IR图像超解像领域进一步研究和发展的潜在领域。

Unsupervised Adaptive Implicit Neural Representation Learning for Scan-Specific MRI Reconstruction

paper_url: http://arxiv.org/abs/2312.00677
repo_url: None
paper_authors: Junwei Yang, Pietro Liò
for: 提高MRI成像速度，适用于特定的扫描设置下的低样本率情况
methods: 提出了一种无监督、适应性块级别的MRI成像方法，利用各自坐标的多维协调学习映射到信号强度上
results: 对公共数据集进行了全面评估，并表明该方法可以在8倍下采样情况下，与当前状态艺术 scan-specific MRI成像技术相比，提高成像质量。

Abstract
In recent studies on MRI reconstruction, advances have shown significant promise for further accelerating the MRI acquisition. Most state-of-the-art methods require a large amount of fully-sampled data to optimise reconstruction models, which is impractical and expensive under certain clinical settings. On the other hand, for unsupervised scan-specific reconstruction methods, overfitting is likely to happen due to insufficient supervision, while restrictions on acceleration rates and under-sampling patterns further limit their applicability. To this end, we propose an unsupervised, adaptive coarse-to-fine framework that enhances reconstruction quality without being constrained by the sparsity levels or patterns in under-sampling. The framework employs an implicit neural representation for scan-specific MRI reconstruction, learning a mapping from multi-dimensional coordinates to their corresponding signal intensities. Moreover, we integrate a novel learning strategy that progressively refines the use of acquired k-space signals for self-supervision. This approach effectively adjusts the proportion of supervising signals from unevenly distributed information across different frequency bands, thus mitigating the issue of overfitting while improving the overall reconstruction. Comprehensive evaluation on a public dataset, including both 2D and 3D data, has shown that our method outperforms current state-of-the-art scan-specific MRI reconstruction techniques, for up to 8-fold under-sampling.

摘要
现在的MRI重建研究中，有了重要的进步，将能够进一步加速MRI取得。大多数现代方法需要充分的数据来优化重建模型，但是在certain clinical setting中是不实际和昂贵的。另一方面，不监督的扫描特定重建方法容易出现过滤现象，而且因为不充分的监督，导致重建质量下降。为了解决这个问题，我们提出了一个不监督的、适应性从粗到细框架，能够提高重建质量，不受数据精度或排序的限制。这个框架使用了伪隐藏层级的神经表现，学习对多维坐标与其相对的信号强度之间的映射。此外，我们还整合了一个新的学习策略，逐步地进行自我监督，将不均匀分布在不同频率带的获取的k-空间信号进行自我监督，因此可以避免过滤现象，同时提高重建质量。我们在公共数据集上进行了广泛的评估，包括2D和3D数据，结果显示，我们的方法可以与目前的scan-specific MRI重建技术相比，在8倍减采样情况下实现8倍的提升。

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

paper_url: http://arxiv.org/abs/2312.00674
repo_url: None
paper_authors: Ying Nie, Wei He, Kai Han, Yehui Tang, Tianyu Guo, Fanyi Du, Yunhe Wang
for: 提高轻量级 CLIP 模型的表现，解决一些图像文本对不对应问题。
methods: 提出多级互动方法，包括：Global Instance-level Alignment 改进、Relaxed Bipartite Matching 对象、Masked Language Modeling 额外目标。
results: 在多个下游任务上实现更高的表现，而不增加执行时间成本。

Abstract
Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.

摘要
CLIP类视觉预训练技术已经在多种下游任务上展现出了扎实的表现，如零点图像分类和图像文本检索。现有的大多数CLIP相似工作通常采用相对较大的图像编码器如ResNet50和ViT，而轻量级的对应工作则少有讨论。在这篇论文中，我们提出了一种多级互动 paradigm для训练轻量级CLIP模型。首先，为了解决一些图像文本对不是唯一对应的问题，我们改进了传统的全局实例级别对象，通过软化负样本标签进行进行进行逐渐减弱。其次，我们引入了一种宽松的二分图像对象来实现更细致的图像片段和文本词之间的对应。此外，根据CLIP模型的表现不随文本编码器参数的增加而增加的观察，我们采用了一个额外的MLM目标来激活短文本编码器的潜在能力。在实践中，我们提出了一种辅助混合模块，将透明图像嵌入掺入masked文本嵌入器的不同网络阶段，以提高MLM的表现。广泛的实验表明，无需在推理时添加计算成本，我们的方法可以在多个下游任务上达到更高的表现。

CellMixer: Annotation-free Semantic Cell Segmentation of Heterogeneous Cell Populations

paper_url: http://arxiv.org/abs/2312.00671
repo_url: None
paper_authors: Mehdi Naouar, Gabriel Kalweit, Anusha Klett, Yannick Vogt, Paula Silvestrini, Diana Laura Infante Ramirez, Roland Mertelsmann, Joschka Boedecker, Maria Kalweit
for: 本研究旨在开发一种无监督的细胞分类方法，以便在医学影像中自动分类不同类型的细胞。
methods: 我们提出了一种基于扩展的注解自由方法，可以从图像级别的标签中训练细胞分类模型。
results: 我们的实验结果显示，CellMixer可以在多种细胞类型和成像方式下实现竞争力强的分类性能，这表明方法具有扩展性和广泛应用的潜力。

Abstract
In recent years, several unsupervised cell segmentation methods have been presented, trying to omit the requirement of laborious pixel-level annotations for the training of a cell segmentation model. Most if not all of these methods handle the instance segmentation task by focusing on the detection of different cell instances ignoring their type. While such models prove adequate for certain tasks, like cell counting, other applications require the identification of each cell's type. In this paper, we present CellMixer, an innovative annotation-free approach for the semantic segmentation of heterogeneous cell populations. Our augmentation-based method enables the training of a segmentation model from image-level labels of homogeneous cell populations. Our results show that CellMixer can achieve competitive segmentation performance across multiple cell types and imaging modalities, demonstrating the method's scalability and potential for broader applications in medical imaging, cellular biology, and diagnostics.

摘要
近年来，一些无监督单元分割方法已经提出， Trying to忽略需要耗费劳动的像素级别标注来训练单元分割模型。大多数，如果不是所有的这些方法都是关注不同单元的探测，忽略单元的类型。而这些模型在某些任务上是有效的，如细胞计数，但其他应用需要每个细胞的类型标识。在这篇论文中，我们提出了CellMixer，一种创新的无监督注解方法，用于无类别细胞群体的semantic分割。我们的扩展方法可以从同类细胞population的图像级别标签上训练分割模型。我们的结果显示，CellMixer可以在多种细胞类型和成像方式下实现竞争性的分割性能，表明该方法的普适性和潜在应用在医学成像、细胞生物学和诊断等领域。

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

paper_url: http://arxiv.org/abs/2312.00663
repo_url: None
paper_authors: Kangcheng Liu, Yong-Jin Liu, Kai Tang, Ming Liu, Baoquan Chen
for: 提高3D场景理解的数据效率学习和开放世界几个shot学习
methods: 提议一种通用和简单的框架，利用视Language模型中的含义提取和精细知识填充来扩展3D场景理解的训练数据，并使用能量基础损失和边界意识来优化 bounding box 预测。
results: 实验表明，该方法可以在室内和室外场景中提高3D场景理解的数据效率学习和开放世界几个shot学习性能，并且可以减少训练数据的量和质量要求。

Abstract
Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing.

摘要
深度神经网络模型已经在封闭设定下和完整标签的情况下实现了惊人的进步。然而，当前的3D认知方法的主要瓶颈是它们无法识别任何未看过的新类型，尤其是在实际应用中遇到的多种多样的场景中。同时，当前state-of-the-art 3D场景理解方法主要需要高质量的标签来训练神经网络，它们只能在完全监督模式下表现出色。本工作提出了一种通用且简单的框架，用于在有限标签的情况下进行3D场景理解。为了从预训练视语模型中提取知识，我们提议一种层次对齐预训练和知识继承策略，以提取和继承大规模视语模型中的有用信息。此外，我们还提出了一种新的能量基的损失函数，以利用边界信息。为了促进秘密实例识别和保证效率，我们提出了无监督区域级别semantic contrastive learning方案，使用神经网络的信任预测来确定多个阶段的中间特征编码的差异。我们的方法在室内和室外场景进行了广泛的实验，并证明了其在数据效率学习和开放世界几shot学习中的效果。所有代码、模型和数据都公开发布在：https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing。

Dual-Domain Multi-Contrast MRI Reconstruction with Synthesis-based Fusion Network

paper_url: http://arxiv.org/abs/2312.00661
repo_url: None
paper_authors: Junwei Yang, Pietro Liò
for: 提高多重对焦MRI重建的效率和质量
methods: 基于深度学习的双Domain重建框架，利用快速取得的参考对焦测量来估算受测对焦测量
results: 与现有方法比较，提高重建效率达8倍，并进行了完整的分析和剔除研究以评估方法的有效性

Abstract
Purpose: To develop an efficient dual-domain reconstruction framework for multi-contrast MRI, with the focus on minimising cross-contrast misalignment in both the image and the frequency domains to enhance optimisation. Theory and Methods: Our proposed framework, based on deep learning, facilitates the optimisation for under-sampled target contrast using fully-sampled reference contrast that is quicker to acquire. The method consists of three key steps: 1) Learning to synthesise data resembling the target contrast from the reference contrast; 2) Registering the multi-contrast data to reduce inter-scan motion; and 3) Utilising the registered data for reconstructing the target contrast. These steps involve learning in both domains with regularisation applied to ensure their consistency. We also compare the reconstruction performance with existing deep learning-based methods using a dataset of brain MRI scans. Results: Extensive experiments demonstrate the superiority of our proposed framework, for up to an 8-fold acceleration rate, compared to state-of-the-art algorithms. Comprehensive analysis and ablation studies further present the effectiveness of the proposed components. Conclusion:Our dual-domain framework offers a promising approach to multi-contrast MRI reconstruction. It can also be integrated with existing methods to further enhance the reconstruction.

摘要
目的：开发一个高效的双Domain重建框架，以提高多重变数MRI重建的优化。理论和方法：我们的提议框架基于深度学习，可以快速地使用快速取得的参考变数来优化受测变数。方法包括以下三个步骤：1）学习将参考变数转换为目标变数的数据；2）将多重变数数据注册以减少间歇变数；3）使用注册的数据重建目标变数。这些步骤都包括在两个Domain中进行学习，并对它们进行调整以确保它们的一致性。我们还与现有的深度学习基础上的方法进行比较，使用大脑MRI扫描数据进行实验。结果：广泛的实验显示了我们的提议框架可以与现有的方法相比，在 acceleration rate 上达到8倍。此外，我们还进行了全面的分析和剪除研究，以评估提议的元件的有效性。结论：我们的双Domain重建框架可以对多重变数MRI重建做出有价的贡献，并且可以与现有的方法进行整合，以进一步提高重建的精度。

SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers

paper_url: http://arxiv.org/abs/2312.00648
repo_url: https://github.com/gkakogeorgiou/spot
paper_authors: Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, Nikos Komodakis
for: 本研究旨在提高无监督对象中心学习中场景的可解释性，通过将场景分解成可解释的对象实体（槽）。
methods: 本研究使用了两种新技术：首先，一种基于注意力的自动训练方法，通过将解码器中的槽特征精细地储存到编码器中，提高了对象 segmentation的精度；其次，一种创新的 patch-order 排序策略，使得 transformer 在重建过程中更加强调槽向量的角色。
results: 实验表明，这两种技术的组合可以大幅提高无监督对象中心学习中的对象分割精度，特别是使用了复杂的实际图像。 codes 的实现可以在 https://github.com/gkakogeorgiou/spot 中找到。

Abstract
Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot .

摘要
<> translate the following text into Simplified Chinese:Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot .Translate the text into Simplified Chinese.Here's the translation:不监督物体中心学习的目标是将场景拆分成可解释的物体实体，称为插槽。插槽基于自动编码器是这个任务中最引人注目的方法之一。在其中，重要的方面包括引导编码器生成对象特定的插槽，并确保解码器在重建过程中使用这些插槽。本研究提出了两种新技术：（i）基于注意力的自我训练方法，可以从解码器中提取出更高质量的插槽关注mask，从而提高对象分割效果，和（ii）一种创新的patch顺序Permutation策略，用于 autoregressive transformers，可以强化插槽向量在重建过程中的角色。我们通过实验证明了这些策略的有效性。这种结合方法在不监督物体分割方面比之前的插槽基于自动编码器方法更加出色，特别是在复杂的实际图像上。我们在github上提供了实现代码，请参考https://github.com/gkakogeorgiou/spot。

QAFE-Net: Quality Assessment of Facial Expressions with Landmark Heatmaps

paper_url: http://arxiv.org/abs/2312.00856
repo_url: https://github.com/shuchaoduan/qafe-net
paper_authors: Shuchao Duan, Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi
for: 这篇论文旨在评估帕金森病患者的表情质量。
methods: 该方法使用了时间特征图表与RGB数据，通过捕捉小面部肌肉运动来编码并映射到严重度分数。
results: 对于PFED5数据集和UNBC-McMaster Shoulder Pain Expression Archive Database的对比实验，该方法表现出色，与State-of-the-Art动作质量评估方法在PFED5数据集上具有更高的准确率，并在UNBC-McMaster数据集上实现了较低的平均绝对误差。

Abstract
Facial expression recognition (FER) methods have made great inroads in categorising moods and feelings in humans. Beyond FER, pain estimation methods assess levels of intensity in pain expressions, however assessing the quality of all facial expressions is of critical value in health-related applications. In this work, we address the quality of five different facial expressions in patients affected by Parkinson's disease. We propose a novel landmark-guided approach, QAFE-Net, that combines temporal landmark heatmaps with RGB data to capture small facial muscle movements that are encoded and mapped to severity scores. The proposed approach is evaluated on a new Parkinson's Disease Facial Expression dataset (PFED5), as well as on the pain estimation benchmark, the UNBC-McMaster Shoulder Pain Expression Archive Database. Our comparative experiments demonstrate that the proposed method outperforms SOTA action quality assessment works on PFED5 and achieves lower mean absolute error than the SOTA pain estimation methods on UNBC-McMaster. Our code and the new PFED5 dataset are available at https://github.com/shuchaoduan/QAFE-Net.

摘要
人类情绪表达分类（FER）技术已经在识别人类情绪的领域取得了很大的进步。然而，评估所有情绪表达的质量在医疗应用中是关键的。在这种情况下，我们研究了患有 Parkinson 病的患者的五种不同情绪表达质量。我们提出了一种新的关键点导向方法（QAFE-Net），它将将时间关键点热图与RGB数据结合，以捕捉小脸部肌肉运动，并将其映射到严重性分数中。我们的方法在新的 Parkinson 病脸部表达数据集（PFED5）和《UNBC-McMaster Shoulder Pain Expression Archive Database》上进行了比较实验，并证明了我们的方法可以超过现有的动作质量评估方法，并在《UNBC-McMaster》上实现了较低的平均绝对误差。我们的代码和新的 PFED5 数据集可以在 GitHub 上找到：https://github.com/shuchaoduan/QAFE-Net。

EvE: Exploiting Generative Priors for Radiance Field Enrichment

paper_url: http://arxiv.org/abs/2312.00639
repo_url: None
paper_authors: Karim Kassab, Antoine Schnepf, Jean-Yves Franceschi, Laurent Caraffa, Jeremie Mary, Valérie Gouet-Brunet
for: 大规模景象从无结构图像集中进行计算机视觉中的场景模型化是一项重要挑战。现有方法通过关闭世界的设定来解决在野的神经渲染问题，即只有在训练集中捕捉的场景图像的知识。我们提出了EvE，这是我们知道的第一个利用生成模型来提高在野场景模型化的方法。我们使用预训练的生成网络来增强K-Planes表示中的外部知识。为此，我们定义了相互转换训练程序，以便在训练集上进行优化导航。
methods: 我们使用预训练的生成网络来增强K-Planes表示中的外部知识。为此，我们定义了相互转换训练程序，以便在训练集上进行优化导航。
results: EvE可以在实际旅游 фото集中提高rendered场景的细节和质量，并且超越了现有的状态对在野神经渲染任务的表现。

Abstract
Modeling large-scale scenes from unconstrained image collections in-the-wild has proven to be a major challenge in computer vision. Existing methods tackling in-the-wild neural rendering operate in a closed-world setting, where knowledge is limited to a scene's captured images within a training set. We propose EvE, which is, to the best of our knowledge, the first method leveraging generative priors to improve in-the-wild scene modeling. We employ pre-trained generative networks to enrich K-Planes representations with extrinsic knowledge. To this end, we define an alternating training procedure to conduct optimization guidance of K-Planes trained on the training set. We carry out extensive experiments and verify the merit of our method on synthetic data as well as real tourism photo collections. EvE enhances rendered scenes with richer details and outperforms the state of the art on the task of novel view synthesis in-the-wild. Our project page can be found at https://eve-nvs.github.io .

摘要
modelo de escenas en masa de imágenes en-the-wild desde colecciones de imágenes en el mundo real ha demostrado ser un desafío importante en la visión por computadora. Los métodos existentes para renderizar escenas en el mundo real operan en un entorno cerrado, donde el conocimiento se limita a las imágenes capturadas de una escena dentro de un conjunto de entrenamiento. Proponemos EvE, que, a nuestro conocimiento, es el primer método que utiliza priores generativos para mejorar la modelización de escenas en el mundo real. Utilizamos redes generativas pre-entrenadas para enriquecer las representaciones de K-Planes con conocimiento extrínseco. Para lograr esto, definimos un procedimiento de entrenamiento alternado para guiar la optimización de K-Planes entrenados en el conjunto de entrenamiento. Realizamos experimentos extensivos y verificamos la excelencia de nuestro método en datos sintéticos así como en colecciones de fotos turísticas reales. EvE mejora las escenas renderizadas con más detalles y supera el estado del arte en la tarea de síntesis de vistas nuevas en el mundo real. Puedes encontrar más información sobre nuestro proyecto en la página web .

A Recent Survey of Vision Transformers for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2312.00634
repo_url: None
paper_authors: Asifullah Khan, Zunaira Rauf, Abdul Rehman Khan, Saima Rathore, Saddam Hussain Khan, Sahar Shah, Umair Farooq, Hifsa Asif, Aqsa Asif, Umme Zahoora, Rafi Ullah Khalil, Suleman Qamar, Umme Hani Asif, Faiza Babar Khan, Abdul Majid, Jeonghwan Gwak
for: 这个论文主要为了探讨最新的医学图像分割技术，帮助研究人员更好地理解和应用这些技术。
methods: 这个论文使用的方法包括视Transformers（ViT）和卷积神经网络（CNN）的组合，以及不同的模型和数据集。
results: 这个论文提出了一些最新的医学图像分割方法，包括使用ViT和CNN的组合来提高图像分割精度，以及在各种医学图像模式下实现实时应用。

Abstract
Medical image segmentation plays a crucial role in various healthcare applications, enabling accurate diagnosis, treatment planning, and disease monitoring. In recent years, Vision Transformers (ViTs) have emerged as a promising technique for addressing the challenges in medical image segmentation. In medical images, structures are usually highly interconnected and globally distributed. ViTs utilize their multi-scale attention mechanism to model the long-range relationships in the images. However, they do lack image-related inductive bias and translational invariance, potentially impacting their performance. Recently, researchers have come up with various ViT-based approaches that incorporate CNNs in their architectures, known as Hybrid Vision Transformers (HVTs) to capture local correlation in addition to the global information in the images. This survey paper provides a detailed review of the recent advancements in ViTs and HVTs for medical image segmentation. Along with the categorization of ViT and HVT-based medical image segmentation approaches we also present a detailed overview of their real-time applications in several medical image modalities. This survey may serve as a valuable resource for researchers, healthcare practitioners, and students in understanding the state-of-the-art approaches for ViT-based medical image segmentation.

摘要
医疗图像分割扮演着重要的角色在各种医疗应用中，帮助精确诊断、治疗规划和病情监测。在医疗图像中，结构通常具有高度相关和全球分布的特点。Vision Transformers（ViTs）在医疗图像分割方面表现出了投入的潜力，它们利用其多级注意机制来模型图像中的长距离关系。然而，ViTs缺乏图像相关的适应性和翻译不变性，这可能影响其性能。研究人员最近开发了一些将CNNs结合到ViTs架构中的方法，称为混合视图转换器（HVTs），以捕捉图像中的本地相关性。本文提供了医疗图像分割领域最近的进展，包括ViTs和HVTs的应用。我们还对这些方法的实时应用在医疗图像模式中进行了详细的概述。本文可以作为研究人员、医疗专业人员和学生了解当前最佳方法的 valuable 资源。

Rethinking the Domain Gap in Near-infrared Face Recognition

paper_url: http://arxiv.org/abs/2312.00627
repo_url: None
paper_authors: Michail Tarasiou, Jiankang Deng, Stefanos Zafeiriou
for: 这篇论文主要是为了解决异类面部识别（HFR）问题，即将视觉频谱（VIS）和近红外频谱（NIR）视觉域的图像匹配。
methods: 该论文不同于现有的大多数文献，不直接强调bridging the domain gap，而是采用了大型神经网络的预训练和正则化细化方法，以优化HFR问题。
results: 该论文通过对四个公共数据集进行验证，得出了至少匹配或超越当前状态艺术的结果。

Abstract
Heterogeneous face recognition (HFR) involves the intricate task of matching face images across the visual domains of visible (VIS) and near-infrared (NIR). While much of the existing literature on HFR identifies the domain gap as a primary challenge and directs efforts towards bridging it at either the input or feature level, our work deviates from this trend. We observe that large neural networks, unlike their smaller counterparts, when pre-trained on large scale homogeneous VIS data, demonstrate exceptional zero-shot performance in HFR, suggesting that the domain gap might be less pronounced than previously believed. By approaching the HFR problem as one of low-data fine-tuning, we introduce a straightforward framework: comprehensive pre-training, succeeded by a regularized fine-tuning strategy, that matches or surpasses the current state-of-the-art on four publicly available benchmarks. Corresponding codes can be found at https://github.com/michaeltrs/RethinkNIRVIS.

摘要
非同质面识别（HFR）涉及到复杂的图像匹配任务，跨观察频谱的可见（VIS）和近红外（NIR）频谱。而大多数现有的HFR研究将域阶差视为主要挑战，强调在输入或特征层bridging it。我们的工作不同于这一趋势，我们发现大型神经网络，当先行训练于大规模同质VIS数据后，表现出了很好的零shot性能，表明域阶差可能比以前所认为的更加强大。我们采取了一种低数据精度调整的方法，将HFR问题看作是一种低数据精度调整的问题，并提出了一种简单的框架：全面预训练，然后进行弹性调整策略，可以与当前状态的推进至四个公开的benchmark上匹配或超越。相关代码可以在https://github.com/michaeltrs/RethinkNIRVIS中找到。

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

paper_url: http://arxiv.org/abs/2312.00853
repo_url: https://github.com/ianyeung/mgld-vsr
paper_authors: Xi Yang, Chenhang He, Jianqi Ma, Lei Zhang
for:本研究旨在提出一种高质量的实际世界视频超分辨（VSR）算法，使得LR视频中的细节和材质得到了高质量的恢复。methods:我们提出了一种基于潜在扩散模型的VSR算法，使用了潜在扩散模型的强大能力来生成真实的细节，并通过优化潜在扩散路径来控制生成的图像内容。results:我们的方法在实际世界VSR benchmark数据集上实现了明显的性能提升，证明了我们的模型设计和训练策略的有效性。

Abstract
Real-world low-resolution (LR) videos have diverse and complex degradations, imposing great challenges on video super-resolution (VSR) algorithms to reproduce their high-resolution (HR) counterparts with high quality. Recently, the diffusion models have shown compelling performance in generating realistic details for image restoration tasks. However, the diffusion process has randomness, making it hard to control the contents of restored images. This issue becomes more serious when applying diffusion models to VSR tasks because temporal consistency is crucial to the perceptual quality of videos. In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure the content consistency among adjacent frames, we exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss, ensuring that the generated HR video maintains a coherent and continuous visual flow. To further mitigate the discontinuity of generated details, we insert temporal module to the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.

摘要
In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure content consistency among adjacent frames, we use the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss. This ensures that the generated HR video maintains a coherent and continuous visual flow.To further mitigate the discontinuity of generated details, we insert a temporal module into the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-art methods on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.

Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion

paper_url: http://arxiv.org/abs/2312.00852
repo_url: None
paper_authors: Litu Rout, Yujia Chen, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu
for: 解决 inverse problems 中的 sampling 问题，使用 latent diffusion 模型。
methods: 提出 Second-order Tweedie sampler from Surrogate Loss (STSL)，使用 second-order approximation 实现高效的 posterior sampling。
results: 与 existing solvers PSLD 和 P2L 比较，STSL 可以在 FFHQ、ImageNet 和 COCO benchmarks 上实现4X 和 8X 的减少 neural function evaluations，同时提高 sampling quality。此外，STSL 还可以应用于 text-guided image editing，并能够处理 corrupted images 中的剩余扭曲。

Abstract
Sampling from the posterior distribution poses a major computational challenge in solving inverse problems using latent diffusion models. Common methods rely on Tweedie's first-order moments, which are known to induce a quality-limiting bias. Existing second-order approximations are impractical due to prohibitive computational costs, making standard reverse diffusion processes intractable for posterior sampling. This paper introduces Second-order Tweedie sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency comparable to first-order Tweedie with a tractable reverse process using second-order approximation. Our theoretical results reveal that the second-order approximation is lower bounded by our surrogate loss that only requires $O(1)$ compute using the trace of the Hessian, and by the lower bound we derive a new drift term to make the reverse process tractable. Our method surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural function evaluations, respectively, while notably enhancing sampling quality on FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to text-guided image editing and addresses residual distortions present from corrupted images in leading text-guided image editing methods. To our best knowledge, this is the first work to offer an efficient second-order approximation in solving inverse problems using latent diffusion and editing real-world images with corruptions.

摘要
很多逆问题解决方法都是通过Latent Diffusion Model来进行采样，但是采样从 posterior distribution 中的采样具有重大的计算挑战。通常的方法是基于 Tweedie 的第一个 moment，这些方法知道会导致质量限制的偏见。现有的第二个approximation是不可能实现的，因为计算成本过高，使得标准的 reverse diffusion 过程变得不可行。这篇论文介绍了一种新的 Second-order Tweedie sampler from Surrogate Loss（STSL），这种采样器可以与 first-order Tweedie 的效率相比，同时具有可追踪的 reverse 过程使用 second-order aproximation。我们的理论结果表明，第二个approximation 是基于我们的代理损失，只需要 $O(1)$ 的计算量使用跟踪 Jacobian，并且我们 derive 一个新的推移项来使 reverse 过程可追踪。我们的方法超过了 SOA 的 solvers PSLD 和 P2L，在 FFHQ、ImageNet 和 COCO benchmarks 上减少了神经函数评估的数目，分别是 4X 和 8X，而且明显提高了采样质量。此外，我们还证明了 STSL 可以扩展到文本引导的图像编辑，并且可以解决文本引导图像编辑方法中的剩余扭曲问题。根据我们所知，这是第一次提供了效率的 second-order approximation 在使用 latent diffusion 和文本引导图像编辑中。

Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers

paper_url: http://arxiv.org/abs/2312.00593
repo_url: None
paper_authors: Sahar Nasirihaghighi, Negin Ghamsarian, Heinrich Husslein, Klaus Schoeffmann
for: 本研究旨在开发一个高精度的laparoscopic surgery视频事件识别模型，用于医学教育、实时运动预测和后期手术评估等应用。
methods: 我们使用了一个全面的laparoscopic gynecology视频集，并对其进行了精度的标注和评估。我们还提出了一种hybrid transformer架构，利用Transformer网络来捕捉视频中的间隔dependencies，从而提高事件识别精度。
results: 我们通过了一系列的广泛实验，证明了我们的提posed方法在事件识别中的superiority，并且可以在不同的手术场景和医生技巧水平下进行高精度的事件识别。

Abstract
Analyzing laparoscopic surgery videos presents a complex and multifaceted challenge, with applications including surgical training, intra-operative surgical complication prediction, and post-operative surgical assessment. Identifying crucial events within these videos is a significant prerequisite in a majority of these applications. In this paper, we introduce a comprehensive dataset tailored for relevant event recognition in laparoscopic gynecology videos. Our dataset includes annotations for critical events associated with major intra-operative challenges and post-operative complications. To validate the precision of our annotations, we assess event recognition performance using several CNN-RNN architectures. Furthermore, we introduce and evaluate a hybrid transformer architecture coupled with a customized training-inference framework to recognize four specific events in laparoscopic surgery videos. Leveraging the Transformer networks, our proposed architecture harnesses inter-frame dependencies to counteract the adverse effects of relevant content occlusion, motion blur, and surgical scene variation, thus significantly enhancing event recognition accuracy. Moreover, we present a frame sampling strategy designed to manage variations in surgical scenes and the surgeons' skill level, resulting in event recognition with high temporal resolution. We empirically demonstrate the superiority of our proposed methodology in event recognition compared to conventional CNN-RNN architectures through a series of extensive experiments.

摘要
分析lapanoscopic surgery видео存在复杂多方面的挑战，其应用包括手术培训、过程中手术异常预测和手术后评估。在这篇论文中，我们介绍了一个完整的数据集，用于lapanoscopic gynecology видео中重要事件的识别。我们的数据集包括关键事件与主要过程中的内部挑战和手术后的并发症。为验证我们的注释的准确性，我们使用了多个CNN-RNN架构进行评估。此外，我们引入了一种hybrid transformer架构，并开发了一个特定的训练-推理框架，用于认可lapanoscopic surgery видео中四个 especific事件。利用Transformer网络，我们的提议的架构利用间隔dependencies来抗衡相关内容遮盖、运动模糊和手术场景变化，从而明显提高事件识别精度。此外，我们提出了一种框架采样策略，用于管理手术场景的变化和外科医生的技巧水平，从而实现高 temporal resolution的事件识别。我们通过一系列广泛的实验证明了我们的提议方法在事件识别方面的优越性，比 conventional CNN-RNN架构更高。

Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)

paper_url: http://arxiv.org/abs/2312.00592
repo_url: None
paper_authors: Emma Cramer, Jonas Reiher, Sebastian Trimpe
for: 本研究使用 spatial autoencoders (SAEs) 为 robot 控制 tasks 提供低维度表示。
methods: 本研究使用 SAEs 提取图像数据中的空间特征，并评估 SAEs 是否能够有效地跟踪图像中的物体。
results: 研究发现，通用 SAEs 在跟踪物体方面有很大差异，而且SAEs 的性能与 downstream RL Task 直接相关。此外，研究还提出了三种modification 来提高 SAEs 的跟踪性能。

Abstract
Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state, including information about task-relevant objects not directly measurable. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data. SAEs aim at spatial features such as object positions, which are often useful representations in robotic RL. However, whether an SAE is actually able to track objects in the scene and thus yields a spatial state representation well suited for RL tasks has rarely been examined due to a lack of established metrics. In this paper, we propose to assess the performance of an SAE instance by measuring how well keypoints track ground truth objects in images. We present a computationally lightweight metric and use it to evaluate common baseline SAE architectures on image data from a simulated robot task. We find that common SAEs differ substantially in their spatial extraction capability. Furthermore, we validate that SAEs that perform well in our metric achieve superior performance when used in downstream RL. Thus, our metric is an effective and lightweight indicator of RL performance before executing expensive RL training. Building on these insights, we identify three key modifications of SAE architectures to improve tracking performance. We make our code available at anonymous.4open.science/r/sae-rl.

摘要
常用的强化学习（RL）控制器需要详细的环境状态表示，包括不直接测量的任务相关对象的信息。关键点检测器，如空间自动编码器（SAE），是RL控制器中常见的状态表示方法。SAE通常会在图像数据中检测到空间特征，如物体位置，这些特征通常是RL任务中有用的表示。然而，SAE是否实际上可以跟踪图像中的物体并将其转换为RL任务中有用的状态表示，很少有人研究过。在这篇论文中，我们提出了一种用于评估SAE实例的方法，即在图像中测量关键点是否能够正确地跟踪真实物体的地方。我们提出了一种计算轻量级的 metric，并使其用于评估常见的SAE架构在图像数据中的表示能力。我们发现了SAE的表示能力差异很大，并且我们验证了SAE可以在RL任务中表现出色，只要它们在我们的metric中表现良好。因此，我们的metric是RL任务前的表现评估中的有效和轻量级指标。基于这些发现，我们提出了三种SAE架构修改来提高跟踪性能。我们的代码可以在[anonymous.4open.science/r/sae-rl](http://anonymous.4open.science/r/sae-rl)上下载。

paper_url: http://arxiv.org/abs/2312.00570
repo_url: None
paper_authors: Aleksi Knuutila
for: 这个论文探讨了通过生成对抗网络（GANs）研究社会过程的视觉方面。
methods: 作者使用StyleGAN2模型在自定义的14564张伦敦Google街景图像集上训练，并使用三种倒转技术进行比较。
results: 研究发现了使用GANs生成新图像时，可以根据社会经济、健康和环境质量的metadata进行conditioning，从而生成反映社会过程的特征性图像。这些图像反映了之前未知和难以研究的社会过程的视觉特征。

Abstract
This paper presents a novel application of Generative Adverserial Networks (GANs) to study visual aspects of social processes. I train a a StyleGAN2-model on a custom dataset of 14,564 images of London, sourced from Google Streetview taken in London. After training, I invert the images in the training set, finding points in the model's latent space that correspond to them, and compare results from three inversion techniques. I connect each data point with metadata from the Indices of Multiple Deprivation, describing income, health and environmental quality in the area where the photographs were taken. It is then possible to map which parts of the model's latent space encode visual features that are distinctive for health, income and environmental quality, and condition the synthesis of new images based on these factors. The synthetic images created reflect visual features of social processes that were previously unknown and difficult to study, describing recurring visual differences between deprived and privileged areas in London. GANs are known for their capability to produce a continuous range of images that exhibit visual differences. The paper tests how to exploit this ability through visual comparisons in still images as well as through an interactive website where users can guide image synthesis with sliders. Though conditioned synthesis has its limitations and the results are difficult to validate, the paper points to the potential for generative models to be repurposed to be parts of social scientific methods.

摘要

Physics Inspired Criterion for Pruning-Quantization Joint Learning

paper_url: http://arxiv.org/abs/2312.00851
repo_url: https://github.com/fanxxxxyi/pic-pq
paper_authors: Weiying Xie, Xiaoyi Fan, Xin Zhang, Yunsong Li, Jie Lei, Leyuan Fang
for: 这篇论文的目的是提出一种新的物理启发式剪枝量化学习（PIC-PQ）方法，以实现深度神经网络（DNNs）在具有限制的边缘设备上的部署。
methods: 这篇论文使用了一个新的物理启发式剪枝量化（PIC）方法，它基于了弹簧学中的赫克斯律（Hooke’s law），将模型内部的缓冲组件（FP）与对应的滤波器参数（FP）建立了线性关系。此外，PIC还加上了一个可调整的对称因数，以确保可行性和柔性。
results: 这篇论文的实验结果显示，PIC-PQ方法可以实现一个好的平衡between精度和位元运算（BOPs）压缩比例，例如在CIFAR10上的ResNet56模型可以实现54.96X的BOPs压缩比例，仅增加0.10%的精度损失，而在ImageNet上的ResNet18模型可以实现53.24X的BOPs压缩比例，仅增加0.61%的精度损失。

Abstract
Pruning-quantization joint learning always facilitates the deployment of deep neural networks (DNNs) on resource-constrained edge devices. However, most existing methods do not jointly learn a global criterion for pruning and quantization in an interpretable way. In this paper, we propose a novel physics inspired criterion for pruning-quantization joint learning (PIC-PQ), which is explored from an analogy we first draw between elasticity dynamics (ED) and model compression (MC). Specifically, derived from Hooke's law in ED, we establish a linear relationship between the filters' importance distribution and the filter property (FP) by a learnable deformation scale in the physics inspired criterion (PIC). Furthermore, we extend PIC with a relative shift variable for a global view. To ensure feasibility and flexibility, available maximum bitwidth and penalty factor are introduced in quantization bitwidth assignment. Experiments on benchmarks of image classification demonstrate that PIC-PQ yields a good trade-off between accuracy and bit-operations (BOPs) compression ratio e.g., 54.96X BOPs compression ratio in ResNet56 on CIFAR10 with 0.10% accuracy drop and 53.24X in ResNet18 on ImageNet with 0.61% accuracy drop). The code will be available at https://github.com/fanxxxxyi/PIC-PQ.

摘要
这文章提出了一种新的物理灵感的剪枝数字化结合学习（PIC-PQ），它是通过将模型压缩和量化视为一个可解释的全局条件来实现深度神经网络（DNNs）在有限资源的边缘设备上部署。大多数现有的方法不会同时学习一个全局条件来剪枝和量化，而PIC-PQ则可以实现这一点。在这篇论文中，我们首先将模型压缩和量化视为一个可解释的全局条件，并从运动学（ED）中获得一个对应关系。具体来说，我们从韦伯-霍奇律（Hooke's law）中获得了一个可学习的塑形因子，以将重要性分布与模型属性（FP）之间建立线性关系。此外，我们还将PIC扩展为包括一个相对偏移变数，以获得全局的看法。为确保可行性和灵活性，我们将最大位元数和罚分因子引入量化位元数分配。实验结果显示，PIC-PQ可以实现一个好的平衡度和位元操作数（BOPs）压缩比例，例如在CIFAR10上的ResNet56中，PIC-PQ可以实现54.96倍的BOPs压缩比例，同时仅增加0.10%的精度损失。在ImageNet上的ResNet18中，PIC-PQ可以实现53.24倍的BOPs压缩比例，同时仅增加0.61%的精度损失。代码将会在https://github.com/fanxxxxyi/PIC-PQ中公开。

Domain Adaptive Imitation Learning with Visual Observation

paper_url: http://arxiv.org/abs/2312.00548
repo_url: https://github.com/sunghochoi122/D3IL
paper_authors: Sungho Choi, Seungyul Han, Woojun Kim, Jongseong Chae, Whiyoung Jung, Youngchul Sung
for: 本文考虑了域 adapted imitation learning，agent在目标域学习完成任务，通过观察来源域专家示例。
methods: 我们提出了一种新的框架，通过双Feature抽取和图像重建来提取域独立的行为特征，以便在域 shift 情况下进行训练学习。
results: 我们的方法在跨域imit learning with visual observation中表现出优于之前的算法，实验结果表明。

Abstract
In this paper, we consider domain-adaptive imitation learning with visual observation, where an agent in a target domain learns to perform a task by observing expert demonstrations in a source domain. Domain adaptive imitation learning arises in practical scenarios where a robot, receiving visual sensory data, needs to mimic movements by visually observing other robots from different angles or observing robots of different shapes. To overcome the domain shift in cross-domain imitation learning with visual observation, we propose a novel framework for extracting domain-independent behavioral features from input observations that can be used to train the learner, based on dual feature extraction and image reconstruction. Empirical results demonstrate that our approach outperforms previous algorithms for imitation learning from visual observation with domain shift.

摘要
在这篇论文中，我们考虑了域适应型仿copy学习，其中一个目标域中的机器人通过观察来自不同域的专家示例学习执行任务。域适应型仿copy学习在实际应用中出现，当一个机器人通过视觉感知数据学习别人的移动，但是由于域之间的差异，可能需要通过不同的角度或不同形状的机器人进行仿copy。为了解决跨域仿copy学习中的域转移问题，我们提出了一种新的框架，基于双重特征提取和图像重建来提取域独立的行为特征，并用这些特征来训练学习者。我们的方法在跨域仿copy学习中观察视觉数据时进行了实验，并证明了我们的方法能够超过前一代的仿copy学习算法。

LiDAR-based curb detection for ground truth annotation in automated driving validation

paper_url: http://arxiv.org/abs/2312.00534
repo_url: None
paper_authors: Jose Luis Apellániz, Mikel García, Nerea Aranjuelo, Javier Barandiarán, Marcos Nieto
for: This paper is written for the purpose of developing a method for detecting 3D curbs in a sequence of point clouds captured from a LiDAR sensor, with the goal of creating pre-annotations for labelling pipelines to efficiently generate curb-related ground truth data.
methods: The method used in this paper consists of two main steps: (1) detecting curbs at each scan using a segmentation deep neural network, and (2) estimating the 3D curbs in the reconstructed point cloud using the odometry of the vehicle.
results: The results of this paper show that the manually annotation time is reduced by 50.99% thanks to the automatically generated pre-annotations, while maintaining the data quality level.

Abstract
Curb detection is essential for environmental awareness in Automated Driving (AD), as it typically limits drivable and non-drivable areas. Annotated data are necessary for developing and validating an AD function. However, the number of public datasets with annotated point cloud curbs is scarce. This paper presents a method for detecting 3D curbs in a sequence of point clouds captured from a LiDAR sensor, which consists of two main steps. First, our approach detects the curbs at each scan using a segmentation deep neural network. Then, a sequence-level processing step estimates the 3D curbs in the reconstructed point cloud using the odometry of the vehicle. From these 3D points of the curb, we obtain polylines structured following ASAM OpenLABEL standard. These detections can be used as pre-annotations in labelling pipelines to efficiently generate curb-related ground truth data. We validate our approach through an experiment in which different human annotators were required to annotate curbs in a group of LiDAR-based sequences with and without our automatically generated pre-annotations. The results show that the manual annotation time is reduced by 50.99% thanks to our detections, keeping the data quality level.

摘要
curb detection 是自动驾驶（AD）环境意识中的关键组成部分，通常限制可行和不可行区域。annotated数据是开发和验证 AD 函数的必要条件，但公共数据集中 annotated 点云凹槽数据却scarce。本文提出了用于检测 LiDAR 扫描数据中的3D凹槽的方法，该方法包括两个主要步骤。首先，我们的方法在每个扫描中使用深度神经网络进行凹槽分割，然后使用车辆的增程来在重建的点云中估算3D凹槽。从这些3D凹槽点中，我们获得了符合 ASAM OpenLABEL 标准的聚合线构成的 polyline。这些检测可以用于生成高质量的凹槽相关的地面实际数据，并且可以减少人工标注时间。我们通过对一组 LiDAR 序列进行实验来验证我们的方法，结果显示，在使用我们自动生成的前置标注后，人工标注时间下降了50.99%，保持数据质量水平。

DeepDR: Deep Structure-Aware RGB-D Inpainting for Diminished Reality

paper_url: http://arxiv.org/abs/2312.00532
repo_url: None
paper_authors: Christina Gsaxner, Shohei Mori, Dieter Schmalstieg, Jan Egger, Gerhard Paar, Werner Bailer, Denis Kalkofen
for: 本研究旨在提出一种RGB-D填充框架，用于实现真实环境中物体的消失，并生成完整的3D场景。
methods: 该框架使用深度学习网络，以提高填充的精度和效率。具有场景 semantics 的生成网络，能够Explicitly 控制颜色和深度输出的准确性。
results: 实验结果表明，提议的框架可以胜过相关的工作，both qualitatively and quantitatively。

Abstract
Diminished reality (DR) refers to the removal of real objects from the environment by virtually replacing them with their background. Modern DR frameworks use inpainting to hallucinate unobserved regions. While recent deep learning-based inpainting is promising, the DR use case is complicated by the need to generate coherent structure and 3D geometry (i.e., depth), in particular for advanced applications, such as 3D scene editing. In this paper, we propose DeepDR, a first RGB-D inpainting framework fulfilling all requirements of DR: Plausible image and geometry inpainting with coherent structure, running at real-time frame rates, with minimal temporal artifacts. Our structure-aware generative network allows us to explicitly condition color and depth outputs on the scene semantics, overcoming the difficulty of reconstructing sharp and consistent boundaries in regions with complex backgrounds. Experimental results show that the proposed framework can outperform related work qualitatively and quantitatively.

摘要
减少现实（DR）指的是将真实的物体从环境中移除，并将它们替换为背景。现代DR框架使用填充来描述未观察到的区域。虽然最近的深度学习基于的填充表现出了承诺，但DR使用情况受到生成一致性的结构和3D几何学（即深度）的限制，特别是 для进阶应用，如3D场景编辑。在这篇论文中，我们提出了深度DR，第一个可以满足DR所有要求的RGB-D填充框架。我们的结构意识的生成网络使我们能够直接将颜色和深度输出与场景 semantics相关联，从而超越复杂背景中的锐利和一致的边界重建的困难。实验结果表明，我们提出的框架可以与相关工作进行比较和量化比较。

Algorithm-based diagnostic application for diabetic retinopathy detection

paper_url: http://arxiv.org/abs/2312.00529
repo_url: None
paper_authors: Agnieszka Cisek, Karolina Korycinska, Leszek Pyziak, Marzena Malicka, Tomasz Wiecek, Grzegorz Gruzel, Kamil Szmuc, Jozef Cebulski, Mariusz Spyra
for: 这个研究是为了开发一个自动识别糖尿病视网膜病变的方法，以提高糖尿病视网膜病变的早期检测和预防。
methods: 这个方法使用了高级技术，例如人工神经网络、深度学习和影像分析算法，来分析糖尿病视网膜影像。它使用 morphological algorithms 来识别眼球中的 optic disc 和糖尿病视网膜病变的特征，例如 microaneurysms、hemorrhages 和 exudates。
results: 这个方法可以帮助提高糖尿病视网膜病变的早期检测和预防，并且可以帮助减少糖尿病导致的视力障碍 casos。

Abstract
Diabetic retinopathy (DR) is a growing health problem worldwide and is a leading cause of visual impairment and blindness, especially among working people aged 20-65. Its incidence is increasing along with the number of diabetes cases, and it is more common in developed countries than in developing countries. Recent research in the field of diabetic retinopathy diagnosis is using advanced technologies, such as analysis of images obtained by ophthalmoscopy. Automatic methods for analyzing eye images based on neural networks, deep learning and image analysis algorithms can improve the efficiency of diagnosis. This paper describes an automatic DR diagnosis method that includes processing and analysis of ophthalmoscopic images of the eye. It uses morphological algorithms to identify the optic disc and lesions characteristic of DR, such as microaneurysms, hemorrhages and exudates. Automated DR diagnosis has the potential to improve the efficiency of early detection of this disease and contribute to reducing the number of cases of diabetes-related visual impairment. The final step was to create an application with a graphical user interface that allowed retinal images taken at cooperating ophthalmology offices to be uploaded to the server. These images were then analyzed using a developed algorithm to make a diagnosis.

摘要
世界上的糖尿病 RETINOPATHY (DR) 是一个不断增长的健康问题，是主要的视力障碍和残疾的原因，特别是在20-65岁的工作人群中。其发生率随着糖尿病患者的数量增加，在 развивающихся国家更为普遍，而在发达国家更为常见。现代技术的应用，如对眼科图像进行分析，可以改善糖尿病 RETINOPATHY 诊断的效率。本文描述了一种自动化糖尿病 RETINOPATHY 诊断方法，包括对眼科图像进行处理和分析。该方法使用 morphological 算法来识别眼球中的 optic disc 和糖尿病 RETINOPATHY 的特征 lesions，如微aneurysms、吸血和液体。自动化糖尿病 RETINOPATHY 诊断具有提高糖尿病 RETINOPATHY 的早期发现的潜在优势，从而减少糖尿病相关的视力残疾的数量。最后，我们创建了一个具有图形用户界面的应用程序，可以将在合作眼科办事处上拍摄的 RETINOPATHY 图像上传到服务器进行分析。

Global Localization: Utilizing Relative Spatio-Temporal Geometric Constraints from Adjacent and Distant Cameras

paper_url: http://arxiv.org/abs/2312.00500
repo_url: None
paper_authors: Mohammad Altillawi, Zador Pataki, Shile Li, Ziyuan Liu
for: 本研究旨在使用单个图像在已知区域中重新定位摄像头，以便在机器视觉应用中实现更高精度的地理位置推断。
methods: 本研究提议利用一种新的网络来导引 Deep Network 的本地化。该网络利用相对的空间和时间几何约束来约束 Deep Network 的训练，并同时利用相邻帧和远离帧的相对pose约束来提高本地化的精度。
results: 我们在3个常见的视觉本地化数据集上进行了实验，并证明了我们的方法可以在具有少量或非常罕见的三维坐标坐标的情况下进行高精度的本地化。与直接pose估计方法相比，我们的方法在同等条件下表现更好。

Abstract
Re-localizing a camera from a single image in a previously mapped area is vital for many computer vision applications in robotics and augmented/virtual reality. In this work, we address the problem of estimating the 6 DoF camera pose relative to a global frame from a single image. We propose to leverage a novel network of relative spatial and temporal geometric constraints to guide the training of a Deep Network for localization. We employ simultaneously spatial and temporal relative pose constraints that are obtained not only from adjacent camera frames but also from camera frames that are distant in the spatio-temporal space of the scene. We show that our method, through these constraints, is capable of learning to localize when little or very sparse ground-truth 3D coordinates are available. In our experiments, this is less than 1% of available ground-truth data. We evaluate our method on 3 common visual localization datasets and show that it outperforms other direct pose estimation methods.

摘要
<>转换给定文本到简化中文。<>在机器视觉应用中，重新本地化一个摄像头从前一次地图区域中的单个图像是非常重要的。在这个工作中，我们解决了基于单个图像的摄像头6个自由度姿态的估计问题。我们提议利用一个新的相对空间和时间几何约束网络来引导深度网络的本地化训练。我们同时使用相对空间和时间几何约束，不仅来自邻近摄像头帧，还来自场景中远 away 的摄像头帧。我们显示，通过这些约束，我们的方法可以在少量或非常罕见的基准3D坐标数据下学习本地化。在我们的实验中，这些基准数据占用的比例是 menos than 1%。我们对3个常见视觉本地化数据集进行了评估，并显示我们的方法超过了直接姿态估计方法。

Explainable AI in Diagnosing and Anticipating Leukemia Using Transfer Learning Method

paper_url: http://arxiv.org/abs/2312.00487
repo_url: None
paper_authors: Wahidul Hasan Abir, Md. Fahim Uddin, Faria Rahman Khanam, Mohammad Monirujjaman Khan
for: 这个研究报告关注了急性лимфо血癌（ALL），一种常见的血液癌，主要影响儿童和青少年，表现为快速增殖的幼细胞（WBCs）。这些异常细胞可以压倒正常细胞，导致严重的健康问题。早期检测ALL的精准性是诊断和治疗效果的关键。
methods: 该研究提出了一种自动检测方法，使用计算机助成诊断（CAD）模型，利用深度学习技术提高淋巴癌诊断的准确性和效率。研究使用了不同的传输学习模型，如ResNet101V2、VGG19、InceptionV3和InceptionResNetV2，对ALL进行分类。研究还使用了Local Interpretable Model-Agnostic Explanations（LIME）来确保AI系统的预测的有效性和可靠性。
results: 研究发现，使用InceptionV3模型可以达到98.38%的准确率，超过其他测试模型。结果由LIME算法验证，表明这种方法在准确地识别ALL，提供了一个有价值的工具 для医疗专业人员。研究还指出，这种方法的可靠性和可信worthiness，为医疗领域的XAI技术开辟了新的可能性。

Abstract
This research paper focuses on Acute Lymphoblastic Leukemia (ALL), a form of blood cancer prevalent in children and teenagers, characterized by the rapid proliferation of immature white blood cells (WBCs). These atypical cells can overwhelm healthy cells, leading to severe health consequences. Early and accurate detection of ALL is vital for effective treatment and improving survival rates. Traditional diagnostic methods are time-consuming, costly, and prone to errors. The paper proposes an automated detection approach using computer-aided diagnostic (CAD) models, leveraging deep learning techniques to enhance the accuracy and efficiency of leukemia diagnosis. The study utilizes various transfer learning models like ResNet101V2, VGG19, InceptionV3, and InceptionResNetV2 for classifying ALL. The methodology includes using the Local Interpretable Model-Agnostic Explanations (LIME) for ensuring the validity and reliability of the AI system's predictions. This approach is critical for overcoming the "black box" nature of AI, where decisions made by models are often opaque and unaccountable. The paper highlights that the proposed method using the InceptionV3 model achieved an impressive 98.38% accuracy, outperforming other tested models. The results, verified by the LIME algorithm, showcase the potential of this method in accurately identifying ALL, providing a valuable tool for medical practitioners. The research underscores the impact of explainable artificial intelligence (XAI) in medical diagnostics, paving the way for more transparent and trustworthy AI applications in healthcare.

摘要

Unfolder: Fast localization and image rectification of a document with a crease from folding in half

paper_url: http://arxiv.org/abs/2312.00467
repo_url: None
paper_authors: A. M. Ershov, D. V. Tropin, E. E. Limonova, D. P. Nikolaev, V. V. Arlazarov
for: digitizing folded documents with a smartphone camera
methods: proposes a novel approach called Unfolder, which is robust to projective distortions and does not fragment the image in the vicinity of a crease
results: achieved a recognition error rate of 0.33, outperforming advanced neural network methods DocTr and DewarpNet, with an average runtime of 0.25 s/image on an iPhone XR.

Abstract
Presentation of folded documents is not an uncommon case in modern society. Digitizing such documents by capturing them with a smartphone camera can be tricky since a crease can divide the document contents into separate planes. To unfold the document, one could hold the edges potentially obscuring it in a captured image. While there are many geometrical rectification methods, they were usually developed for arbitrary bends and folds. We consider such algorithms and propose a novel approach Unfolder developed specifically for images of documents with a crease from folding in half. Unfolder is robust to projective distortions of the document image and does not fragment the image in the vicinity of a crease after rectification. A new Folded Document Images dataset was created to investigate the rectification accuracy of folded (2, 3, 4, and 8 folds) documents. The dataset includes 1600 images captured when document placed on a table and when held in hand. The Unfolder algorithm allowed for a recognition error rate of 0.33, which is better than the advanced neural network methods DocTr (0.44) and DewarpNet (0.57). The average runtime for Unfolder was only 0.25 s/image on an iPhone XR.

摘要
现代社会中，折叠文档的显示不是非常罕见的情况。将这些文档数字化，使用智能手机摄像头拍摄时可能会困难，因为折叠会将文档内容分成多个平面。要将文档打开，可以举手阻挡着折叠的部分。虽然有很多几何修正方法，但它们通常适用于任意的折叠和弯曲。我们考虑了这些算法，并提出了一种新的方法——Unfolder，专门为折叠半开的文档图像进行修正。Unfolder具有对文档图像的投影变形强度的抗性，并不会在修正过程中分解图像的附近部分。为了调查这种修正精度，我们创建了一个新的折叠文档图像集（Folded Document Images dataset），该集包括2000张折叠（2、3、4、8叠）文档图像，其中每张图像都是在桌子上放置和在手中持有的两种情况下拍摄的。Unfolder算法的识别错误率为0.33，高于先进神经网络方法DocTr（0.44）和DewarpNet（0.57）。Unfolder算法的平均运行时间为iPhone XR上0.25秒/图像。

Learning Unorthogonalized Matrices for Rotation Estimation

paper_url: http://arxiv.org/abs/2312.00462
repo_url: None
paper_authors: Kerui Gu, Zhihao Li, Shiyong Liu, Jianzhuang Liu, Songcen Xu, Youliang Yan, Michael Bi Mi, Kenji Kawaguchi, Angela Yao
for: pose estimation tasks
methods: learning unorthogonalized `Pseudo’ Rotation Matrices (PRoM)
results: state-of-the-art results on large-scale benchmarks for human pose estimation

Abstract
Estimating 3D rotations is a common procedure for 3D computer vision. The accuracy depends heavily on the rotation representation. One form of representation -- rotation matrices -- is popular due to its continuity, especially for pose estimation tasks. The learning process usually incorporates orthogonalization to ensure orthonormal matrices. Our work reveals, through gradient analysis, that common orthogonalization procedures based on the Gram-Schmidt process and singular value decomposition will slow down training efficiency. To this end, we advocate removing orthogonalization from the learning process and learning unorthogonalized `Pseudo' Rotation Matrices (PRoM). An optimization analysis shows that PRoM converges faster and to a better solution. By replacing the orthogonalization incorporated representation with our proposed PRoM in various rotation-related tasks, we achieve state-of-the-art results on large-scale benchmarks for human pose estimation.

摘要
估计3D旋转是计算机视觉领域中常见的过程。准确性取决于旋转表示方法。一种常用的表示方法是旋转矩阵，它具有连续性，特别适合pose estimation任务。学习过程通常包括正交化，以确保矩阵正交。我们的工作通过梯度分析发现，常见的正交化方法基于 Gram-Schmidt 过程和特征值分解会降低训练效率。因此，我们建议从学习过程中移除正交化，学习不正交化的“伪”旋转矩阵（Pseudo Rotation Matrices，PRoM）。优化分析表明，PRoM更快 converges和更好的解。通过在不同的旋转相关任务中取代包含正交化的表示方法，我们实现了大规模标准chmark上的人 pose estimation状态的末点结果。

An Encoding Framework for Binarized Images using HyperDimensional Computing

paper_url: http://arxiv.org/abs/2312.00454
repo_url: None
paper_authors: Laura Smets, Werner Van Leekwijck, Ing Jyh Tsang, Steven Latré
for: 这篇论文旨在提出一种轻量级的机器学习方法，以应用于穿戴式互联网、附近感应器人工智能应用和设备处理等领域。
methods: 本论文提出了一种使用本原始的四维空间（HD）算术vector操作来对二进制化图像进行编码，以保持邻近位置的模式相似性。这种编码方法使用点扩展和本地线性映射。
results: 根据试验集，这种编码方法可以达到97.35%的准确率 для MNIST数据集和84.12%的准确率 для Fashion-MNIST数据集，与其他基eline HDC模型不同的编码方法相比，表现更好。此外，这种编码方法也具有更高的杂音和模糊防护性。

Abstract
Hyperdimensional Computing (HDC) is a brain-inspired and light-weight machine learning method. It has received significant attention in the literature as a candidate to be applied in the wearable internet of things, near-sensor artificial intelligence applications and on-device processing. HDC is computationally less complex than traditional deep learning algorithms and typically achieves moderate to good classification performance. A key aspect that determines the performance of HDC is the encoding of the input data to the hyperdimensional (HD) space. This article proposes a novel light-weight approach relying only on native HD arithmetic vector operations to encode binarized images that preserves similarity of patterns at nearby locations by using point of interest selection and local linear mapping. The method reaches an accuracy of 97.35% on the test set for the MNIST data set and 84.12% for the Fashion-MNIST data set. These results outperform other studies using baseline HDC with different encoding approaches and are on par with more complex hybrid HDC models. The proposed encoding approach also demonstrates a higher robustness to noise and blur compared to the baseline encoding.

摘要
超dimensional计算（HDC）是一种基于大脑的轻量级机器学习方法。它在文献中受到了广泛关注，用于穿戴式互联网对象、靠近感知器艺术智能应用以及设备处理。相比传统深度学习算法，HDC的计算复杂性较低，通常可以达到中等至良好的分类性能。HDC的输入数据编码到超dimensional（HD）空间是一个关键因素，这篇文章提出了一种新的轻量级方法，只使用本地HD数学运算来编码简化图像，以保持邻近位置的模式相似性。该方法在MNIST数据集测试集上达到了97.35%的准确率，并在Fashion-MNIST数据集测试集上达到了84.12%的准确率。这些结果超过了其他基eline HDC的编码方法，并与更复杂的混合HDC模型相当。该编码方法还表现出更高的鲁棒性于噪声和模糊。

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

paper_url: http://arxiv.org/abs/2312.00452
repo_url: None
paper_authors: Yajie Liu, Pu Ge, Haoxiang Ma, Shichao Fan, Qingjie Liu, Di Huang, Yunhong Wang
for: 该论文目的是提高图像分割（RIS）的泛化能力，使其能够更好地处理不同文本表达或未经见过的视觉实体。
methods: 该方法包括两部分：首先，使用强大预训练模型提供的视觉指导来强制融合多modal信息，以利用空间关系和像素协调来处理部分目标面和偶极点聚集。其次，为了处理不受语言风格限制的文本描述，提出了一种增强表达的方法，通过提供一个明确和重要的推荐语，使表达在一致的上下文中补充，以便更好地捕捉目标。
results: 对于RefCOCO、RefCOCO+和ReferIt等零例跨数据集，该方法实现了比州前方的显著提高，例如，与state-of-the-art相比，它在RefCOCO上提高了4.15%、5.45%和4.64%的mIoU值。此外，在GraspNet-RIS上，该方法也能够良好地泛化到新的enario中， demonstrating its effectiveness。

Abstract
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. Despite the overwhelming progress, it still remains challenging for current approaches to perform well on cases with various text expressions or with unseen visual entities, limiting its further application. In this paper, we present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above. Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context, facilitating target capturing in the presence of linguistic style changes. Furthermore, we introduce a multi-modal fusion aggregation module with visual guidance from a powerful pretrained model to leverage spatial relations and pixel coherences to handle the incomplete target masks and false positive irregular clumps which often appear on unseen visual entities. Extensive experiments are conducted in the zero-shot cross-dataset settings and the proposed approach achieves consistent gains compared to the state-of-the-art, e.g., 4.15\%, 5.45\%, and 4.64\% mIoU increase on RefCOCO, RefCOCO+ and ReferIt respectively, demonstrating its effectiveness. Additionally, the results on GraspNet-RIS show that our approach also generalizes well to new scenarios with large domain shifts.

摘要
参照图像分割（RIS）目标是根据 libre-from 文本描述 segment 图像中的对象。尽管有很大的进步，但现在的方法仍然在不同的文本表达或未看过的视觉实体上表现不佳，限制其进一步应用。在这篇文章中，我们提出了一种新的 RIS 方法，可以大幅提高通用能力。 Specifically，我们提出了一种增强给定表达的方法，通过在一个统一上下文中补充表达，以适应语言风格变化。另外，我们引入了一种多模态融合聚合模块，利用强大预训练模型提供的视觉导航来利用空间关系和像素协调来处理不完整的目标掩蔽和偶极点异常块。我们进行了广泛的零学习跨数据集实验，并达到了与当前状态卷的显著提高，例如，RefCOCO 上的 mIoU 提高4.15%、RefCOCO+ 上的 mIoU 提高5.45% 和 ReferIt 上的 mIoU 提高4.64%，这些结果表明我们的方法有效。此外，我们的方法还在 GraspNet-RIS 上进行了新场景的普适性测试，并得到了良好的结果。

FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

paper_url: http://arxiv.org/abs/2312.00451
repo_url: https://github.com/VITA-Group/FSGS
paper_authors: Zehao Zhu, Zhiwen Fan, Yifan Jiang, Zhangyang Wang
For: The paper is written for novel view synthesis from limited observations, specifically using 3D Gaussian Splatting for real-time and photo-realistic results with as few as three training views.* Methods: The paper proposes a few-shot view synthesis framework based on 3D Gaussian Splatting, which handles extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process, integrating a large-scale pre-trained monocular depth estimator to guide the geometric optimization.* Results: The proposed method, FSGS, achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender.Here’s the same information in Simplified Chinese:* For: 这篇论文是为了解决限制性观察下的新视图 sintesis问题，特别是使用3D Gaussian Splatting来实现实时和图像真实的结果，只需要三个训练视图。* Methods: 该论文提出了一种基于3D Gaussian Splatting的几何� Standing for novel view synthesis框架，该框架可以快速地和高效地处理极其罕见的SfM点，并通过大规模预训练的单目深度估计器来引导几何优化过程。* Results: 提出的方法FSGS可以在多个数据集上实现最佳性和渲染效率的同时提高，包括LLFF、Mip-NeRF360和Blender等。

Abstract
Novel view synthesis from limited observations remains an important and persistent task. However, high efficiency in existing NeRF-based few-shot view synthesis is often compromised to obtain an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website: https://zehaozhu.github.io/FSGS/.

摘要
新视图合成从有限的观察中 remaining 是一个重要和持续的任务。然而，现有的 NeRF-based few-shot view synthesis 高效率通常会 compromise obtaining an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website: .

Dolphins: Multimodal Language Model for Driving

paper_url: http://arxiv.org/abs/2312.00438
repo_url: None
paper_authors: Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, Chaowei Xiao
for: 本研究旨在开发一种可以在复杂的现实世界中驾驶自动车辆，具有人类智能水平的汽车助手。
methods: 本研究使用了一种新型的视觉语言模型，名为Dolphins，可以处理多Modal输入，包括视频数据、文本指令和历史控制信号，并生成相应的输出。在基于开源的预训练视觉语言模型OpenFlamingo的基础之上，我们首先通过创新的基于链的思维过程（GCoT）进行了Dolphins的理解能力的提升。然后，我们对Dolphins进行了适应驾驶领域的定制，通过构建驾驶 especific instruction data和进行了特定的调整。
results: 通过使用BDD-X数据集，我们将四个不同的汽车任务集成到Dolphins中，以促进了汽车任务的总体理解和解决。作为结果，Dolphins具有两个独特的特征：一是对于复杂和长尾的开放世界驾驶场景进行全面理解，并能够解决一系列AV任务；二是具有人类智能特性，包括 gradient-free快速适应、在context learning和错误恢复。

Abstract
The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, we introduce Dolphins, a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant. Dolphins is adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions. Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, we first enhance Dolphins's reasoning capabilities through an innovative Grounded Chain of Thought (GCoT) process. Then we tailored Dolphins to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into Dolphins to foster a holistic understanding of intricate driving scenarios. As a result, the distinctive features of Dolphins are characterized into two dimensions: (1) the ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks, and (2) the emergence of human-like capabilities including gradient-free instant adaptation via in-context learning and error recovery via reflection.

摘要
寻求完全自主汽车（AV），能够处理复杂的实际世界enario，与人类类似的理解和反应能力。在这篇论文中，我们介绍了一种新的视觉语言模型 architecture，称为Dolphins，这种模型具有人类类似的能力，作为交通辅助。Dolphins能够处理多modal输入，包括视频（或图像）数据、文本指令和历史控制信号，生成相应的输出，响应给提供的指令。基于开源的预训练视觉语言模型OpenFlamingo，我们首先增强Dolphins的理解能力，通过一种创新的基于grounded Chain of Thought（GCoT）过程。然后，我们对Dolphins进行了适应驾驶领域的定制，通过构建驾驶Specific instruction data和进行调整。通过使用BDD-X数据集，我们设计了并团结了四个不同的AV任务，以促进它的整体理解能力。因此，Dolphins的特有特征可以分为两个维度：（1）能够提供复杂和长尾的开放世界驾驶enario的全面理解，并解决一系列AV任务，（2）具有人类类似的能力，包括 gradient-free快速适应、在上下文学习中的快速学习和反射。

Enhancing Image Captioning with Neural Models

paper_url: http://arxiv.org/abs/2312.00435
repo_url: None
paper_authors: Pooja Bhatnagar, Sai Mrunaal, Sachin Kamnure
for: 这项研究探讨了基于深度学习模型的图像描述文本生成领域。
methods: 该研究 investigate了不同的神经网络架构配置，特别是注入架构，并提出了一种新的质量指标来评估描述文本生成。
results: 研究发现，尽管merge模型具有更大的词汇库和更高的ROUGE分数，但是注入架构仍然能够生成相关和简洁的图像描述文本。此外，研究还发现了训练数据的优化和超参数的优化对模型性能的影响。

Abstract
This research explores the realm of neural image captioning using deep learning models. The study investigates the performance of different neural architecture configurations, focusing on the inject architecture, and proposes a novel quality metric for evaluating caption generation. Through extensive experimentation and analysis, this work sheds light on the challenges and opportunities in image captioning, providing insights into model behavior and overfitting. The results reveal that while the merge models exhibit a larger vocabulary and higher ROUGE scores, the inject architecture generates relevant and concise image captions. The study also highlights the importance of refining training data and optimizing hyperparameters for improved model performance. This research contributes to the growing body of knowledge in neural image captioning and encourages further exploration in the field, emphasizing the democratization of artificial intelligence.

摘要
这项研究探讨使用深度学习模型进行图像描述，包括不同的神经网络架构的性能对比，特别是注入架构。这项研究还提出了一种新的质量指标来评估图像描述生成。通过广泛的实验和分析，这项研究揭示了图像描述的挑战和机会，并为模型行为和过拟合提供了深入的理解。研究发现，虽然merge模型具有更大的词汇量和更高的ROUGE分数，但是注入架构能够生成相关和简洁的图像描述。此外，研究还强调了训练数据的优化和超参数的优化，以提高模型性能。这项研究对神经图像描述领域的发展做出了贡献，并鼓励进一步探索，推广人工智能的普及。

A Low-Power Neuromorphic Approach for Efficient Eye-Tracking

paper_url: http://arxiv.org/abs/2312.00425
repo_url: https://github.com/pbonazzi/retina
paper_authors: Pietro Bonazzi, Sizhen Bian, Giovanni Lippolis, Yawei Li, Sadique Sheik, Michele Magno
for: 这篇论文旨在提出一种基于 neuromorphic 技术的眼动追踪方法，使用纯事件数据，以提高眼动追踪系统的精度和效率。methods: 该方法使用了直接训练的启发型神经网络（SNN）回归模型，并利用了一种 cutting-edge 低功耗边缘神经处理器——Speck，以实现更高的精度和效率。results: 该方法在实验中实现了较高的精度和较低的计算复杂性，比较传统的event-based眼动追踪方法。 Specifically, the method achieved a pupil centroid error of 3.24 pixels in a 64x64 DVS input, which is 1.24 pixels better than the latest event-based eye-tracking method. Additionally, the method showed an end-to-end power consumption between 2.89-4.8 mW and a latency of 5.57-8.01 mS, depending on the time window.

Abstract
This paper introduces a neuromorphic methodology for eye tracking, harnessing pure event data captured by a Dynamic Vision Sensor (DVS) camera. The framework integrates a directly trained Spiking Neuron Network (SNN) regression model and leverages a state-of-the-art low power edge neuromorphic processor - Speck, collectively aiming to advance the precision and efficiency of eye-tracking systems. First, we introduce a representative event-based eye-tracking dataset, "Ini-30", which was collected with two glass-mounted DVS cameras from thirty volunteers. Then,a SNN model, based on Integrate And Fire (IAF) neurons, named "Retina", is described , featuring only 64k parameters (6.63x fewer than the latest) and achieving pupil tracking error of only 3.24 pixels in a 64x64 DVS input. The continous regression output is obtained by means of convolution using a non-spiking temporal 1D filter slided across the output spiking layer. Finally, we evaluate Retina on the neuromorphic processor, showing an end-to-end power between 2.89-4.8 mW and a latency of 5.57-8.01 mS dependent on the time window. We also benchmark our model against the latest event-based eye-tracking method, "3ET", which was built upon event frames. Results show that Retina achieves superior precision with 1.24px less pupil centroid error and reduced computational complexity with 35 times fewer MAC operations. We hope this work will open avenues for further investigation of close-loop neuromorphic solutions and true event-based training pursuing edge performance.

摘要
这篇论文介绍了一种神经omorphic方法ology for eye tracking, 利用纯事件数据 captured by a Dynamic Vision Sensor (DVS) camera。这个框架 integrate了一个直接训练的脉冲神经网络（SNN）回归模型，并利用了一种 cutting-edge low power edge neuromorphic processor - Speck，共同目标是提高眼动跟踪系统的精度和效率。首先，我们介绍了一个代表性的事件基于眼动跟踪数据集，“Ini-30”，这些数据被收集了三十名志愿者使用两个镜头 mount DVS camera。然后，我们描述了一个基于 Integrate And Fire (IAF) 神经元的 SNN 模型，名为 “Retina”，它只有 64k 参数（6.63x 少于最新的），并在 64x64 DVS 输入上实现了眼动跟踪错误仅 3.24 像素。 continuous regression output 是通过 convolution 使用一个非脳神经 1D 滤波器在输出射频层滑动来获得的。最后，我们在 neuromorphic processor 上评估 Retina，显示了综合的 endpoint power 在 2.89-4.8 mW 和 latency 在 5.57-8.01 mS 之间，具体取决于时间窗口。我们还对 Retina 模型与最新的事件基于眼动跟踪方法“3ET”进行了比较，结果显示 Retina 具有更高的精度，并且减少了计算复杂性， MAC 操作数量减少了 35 倍。我们希望这项工作能够开启关于真正的事件基于训练和边缘性能的 investigate 的大门。

Towards Explaining Satellite Based Poverty Predictions with Convolutional Neural Networks

paper_url: http://arxiv.org/abs/2312.00416
repo_url: None
paper_authors: Hamid Sarmadi, Thorsteinn Rögnvaldsson, Nils Roger Carlsson, Mattias Ohlsson, Ibrahim Wahab, Ola Hall
for: 这 paper 的目的是解释深度卷积神经网络（CNN）在静止卫星图像上预测贫困指标的基础。
methods: 这 paper 使用的方法是分析 CNN 的响应细节，并解释预测的基础。
results: 该 paper 的结果表明，使用低分辨率日夜两栅卫星图像训练的 CNN 模型可以超越人类Subjects，在排序贫困指标类别中表现出优异的能力。

Abstract
Deep convolutional neural networks (CNNs) have been shown to predict poverty and development indicators from satellite images with surprising accuracy. This paper presents a first attempt at analyzing the CNNs responses in detail and explaining the basis for the predictions. The CNN model, while trained on relatively low resolution day- and night-time satellite images, is able to outperform human subjects who look at high-resolution images in ranking the Wealth Index categories. Multiple explainability experiments performed on the model indicate the importance of the sizes of the objects, pixel colors in the image, and provide a visualization of the importance of different structures in input images. A visualization is also provided of type images that maximize the network prediction of Wealth Index, which provides clues on what the CNN prediction is based on.

摘要
Translation in Simplified Chinese:深度卷积神经网络（CNN）已经能准确预测卫星图像中的贫困和发展指标。本文提供了对 CNN 模型的详细分析和解释，以及其预测基础的解释。CNN 模型，尽管在低分辨率日间和夜间卫星图像上训练，仍能超越人类对高分辨率图像的识别。多种解释实验表明，CNN 模型中的对象大小、图像像素颜色以及输入图像中的不同结构均具有重要作用。此外，还提供了 MAX 值图像类型，它们可以帮助我们理解 CNN 预测基础的依据。

Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

paper_url: http://arxiv.org/abs/2312.00414
repo_url: None
paper_authors: Taichi Nishimura, Shota Nakada, Masayoshi Kondo
for: partially relevant video retrieval (PRVR)
methods: use of lightweight visual backbones and a simple query-image attention trick to improve efficiency and performance, and fine-tuning approach with trainable modules to further boost performance
results: best performance on ActivityNet Captions and TVR, with promising zero-shot performance against state-of-the-art (SOTA) methods efficiently

Abstract
In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.

摘要
在这篇论文中，我们提出了一种高效高性能的方法 для部分相关视频检索（PRVR），目标是检索不trimmed长视频中包含至少一个相关的文本查询 moment。在效率和性能两个方面，前一个研究的潜在瓶颈是视觉编码稠密帧。这导致研究人员选择轻量级的视觉核心，从而导致检索性能下降。然而，直接将高性能大规模视语言模型（VLM）应用于这些视频检索 task 并不合适，因为它们的效率低下。为解决这些问题，我们不是 dense frames，而是使用 super image，即将视频帧重新排序成 $N \times N$ 格式。这将视觉编码数量减少到 $\frac{1}{N^2}$，并且补偿了大规模 VLM 的低效率，使我们能够采用它们作为强大编码器。 surprisingly，我们发现，只需使用简单的查询图像注意力技巧，VLM 可以对 super image 进行有效的泛化，并且表现出了可观的零shot性能。此外，我们还提出了一种微调方法，通过 incorporating 一些可调模块到 VLM 后ION 中。实验结果表明，我们的方法可以高效地达到 ActivityNet Captions 和 TVR 上的最佳性能。

SCHEME: Scalable Channer Mixer for Vision Transformers

paper_url: http://arxiv.org/abs/2312.00412
repo_url: None
paper_authors: Deepak Sridhar, Yunsheng Li, Nuno Vasconcelos
For: This paper aims to improve the performance of Vision Transformers (ViT) by studying the channel mixer or feature mixing block, specifically by using sparse feature mixing and a lightweight channel covariance attention mechanism.* Methods: The authors propose a block diagonal MLP structure to improve the accuracy of the feature clusters, and introduce a new family of SCHEMEformer models that can be plugged into any ViT architecture to obtain different trade-offs between complexity and performance.* Results: The proposed method achieves substantial accuracy gains over existing designs, especially under lower FLOPs regimes, on image classification, object detection, and semantic segmentation tasks with different ViT backbones. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.

Abstract
Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.

摘要
vision transformers 因其在许多视觉任务中表现出色，而吸引了广泛关注。although the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (ffn or mlp) has not been explored in depth, despite accounting for a significant portion of the parameters and computation in a model. in this work, we investigate whether sparse feature mixing can replace dense connections and confirm this with a block diagonal mlp structure that improves accuracy by supporting larger expansion ratios. to further improve the feature clusters formed by this structure and thereby enhance accuracy, we introduce a lightweight, parameter-free channel covariance attention (cca) mechanism as a parallel branch during training. this cca mechanism enables gradual feature mixing across channel groups during training, with the contribution of each group decaying to zero as training progresses to convergence. this allows the cca block to be discarded during inference, resulting in enhanced performance with no additional computational cost. the resulting scalable channel mixer (scheme) can be integrated into any vit architecture to obtain a range of models with different trade-offs between complexity and performance by controlling the size of the block diagonal structure in the mlp. experiments on image classification, object detection, and semantic segmentation with different vit backbones consistently demonstrate substantial accuracy gains over existing designs, especially under lower flop regimes. for example, the schemeformer establishes a new sota of 79.7% accuracy for vit using pure attention mixers on imaginet-1k at 1.77g flop.

VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things

paper_url: http://arxiv.org/abs/2312.00401
repo_url: None
paper_authors: Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma
for: 本研究旨在Addressing the challenges of fine-grained and interrelated vision tool usage in Video Internet of Things (VIoT) by building a framework based on large language models (LLMs) to correctly interact with humans, query knowledge videos, and invoke vision models to accomplish complicated tasks.
methods: 该研究使用了semi-automatic annotations to craft a training dataset and establish benchmarks involving 11 representative vision models across three categories. Additionally, the ReAct instruction tuning method was used to learn the tool capability of LLMs.
results: 实验和分析结果表明，VIoTGPT是有效的，可以 correctly interact with humans, query knowledge videos, and invoke vision models to accomplish complicated tasks.

Abstract
Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. Learning to schedule perceiving models and analyzing the collected videos intelligently will be potential sparks for VIoT. In this paper, to address the challenges posed by the fine-grained and interrelated vision tool usage of VIoT, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to accomplish complicated tasks. To support VIoTGPT and related future works, we meticulously crafted the training dataset and established benchmarks involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to ReAct instruction tuning based on the collected VIoT dataset to learn the tool capability. Quantitative and qualitative experimental results and analyses demonstrate the effectiveness of VIoTGPT.

摘要
视频互联网其 Things (VIoT) 已经实现了历史上无 precedent 的视频数据收集量。学习智能调度视频模型和分析收集的视频数据将是 VIoT 的潜在燃料。在这篇论文中，我们为了解决 VIoT 中细化和相互关联的视觉工具使用带来的挑战，构建了基于 LLMS 的 VIoTGPT 框架，以正确地与人类交互、查询知识视频和适用视觉模型完成复杂任务。为支持 VIoTGPT 和未来相关工作，我们仔细制作了训练集和建立了基于 semi-automatic 批注的三类视觉模型的 Benchmark。为了使 LLMS 成为智能代理人，我们根据收集到的 VIoT 数据进行 ReAct 指令调整，以学习工具能力。经过量化和质量测试，我们得到了 VIoTGPT 的效果。

Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network

paper_url: http://arxiv.org/abs/2312.00398
repo_url: None
paper_authors: Quoc Hung T. Le, Hieu H. Pham
for: 旨在早期诊断和治疗肌骨疾病和认知障碍病人的运动困难和心理健康问题。
methods: 使用计算机视觉和深度学习技术，提出了一种新的空间时间转换器网络，从单视角RGB视频中估计重要的步态参数。
results: 对公共数据集的脑性麻痹病人进行了实验评估，显示了当前状态的提前的方法被超越，在估计总步态参数（包括步速、步偏指数 - GDI、膝盖屈臂角）方面显示了显著的改善，同时使用 fewer 参数和减少手动特征提取。

Abstract
Musculoskeletal diseases and cognitive impairments in patients lead to difficulties in movement as well as negative effects on their psychological health. Clinical gait analysis, a vital tool for early diagnosis and treatment, traditionally relies on expensive optical motion capture systems. Recent advances in computer vision and deep learning have opened the door to more accessible and cost-effective alternatives. This paper introduces a novel spatio-temporal Transformer network to estimate critical gait parameters from RGB videos captured by a single-view camera. Empirical evaluations on a public dataset of cerebral palsy patients indicate that the proposed framework surpasses current state-of-the-art approaches and show significant improvements in predicting general gait parameters (including Walking Speed, Gait Deviation Index - GDI, and Knee Flexion Angle at Maximum Extension), while utilizing fewer parameters and alleviating the need for manual feature extraction.

摘要
疾病和智能障碍导致患者的移动受到阻碍，同时也对其心理健康产生负面影响。临床步态分析是诊断和治疗的关键工具，但传统的光学动态捕捉系统很昂贵。现代计算机视觉和深度学习技术的出现使得更加可 accessible 和 cost-effective 的选择成为可能。本文提出一种新的空间时间转换器网络，用于从单视图RGB视频中估算步态参数。实验证明，提出的框架在公共数据集上赢得了当前状态的先进方法，并在总体步态参数（包括步速、步态指数 - GDI 和膝盖弯曲角度）中显示了显著的改善，同时使用的参数少于现有方法，并减少了手动特征提取的需求。

Study and Survey on Gesture Recognition Systems

paper_url: http://arxiv.org/abs/2312.00392
repo_url: None
paper_authors: Kshitij Deshpande, Varad Mashalkar, Kaustubh Mhaisekar, Amaan Naikwadi, Archana Ghotkar
for: 本研究探讨了手势识别系统在多个领域的应用，包括游戏、医疗、家用电器、工业机器人和虚拟现实等。
methods: 本文比较了不同的手势捕获方法，包括计算机视觉、深度学习、人工智能等方法。同时还讨论了数据收集技术和数据源。
results: 本研究结果显示，手势识别系统在不同领域的应用中具有广泛的应用前景和潜在的商业化可能性。同时，也遇到了一些常见的挑战，如手势识别精度、数据采集和处理等问题。

Abstract
In recent years, there has been a considerable amount of research in the Gesture Recognition domain, mainly owing to the technological advancements in Computer Vision. Various new applications have been conceptualised and developed in this field. This paper discusses the implementation of gesture recognition systems in multiple sectors such as gaming, healthcare, home appliances, industrial robots, and virtual reality. Different methodologies for capturing gestures are compared and contrasted throughout this survey. Various data sources and data acquisition techniques have been discussed. The role of gestures in sign language has been studied and existing approaches have been reviewed. Common challenges faced while building gesture recognition systems have also been explored.

摘要
近年来，GestureRecognition领域内有很大的研究进展，主要归功于计算机视觉技术的发展。这篇论文探讨了 gesture recognition 系统在多个领域的应用，如游戏、医疗、家用电器、工业机器人和虚拟现实等。文中比较了不同的捕捉手势方法，以及不同的数据来源和数据收集技术。此外，文章还研究了手势在手语中的角色，并对现有的方法进行了评论。最后，文章还探讨了在建立 gesture recognition 系统时面临的常见挑战。

Partition-based K-space Synthesis for Multi-contrast Parallel Imaging

paper_url: http://arxiv.org/abs/2312.00387
repo_url: None
paper_authors: Yuxia Huang, Zhonghui Wu, Xiaoling Xu, Minghui Zhang, Shanshan Wang, Qiegen Liu
for: 这篇论文主要是为了提高多CONTRAST磁共振成像技术的效率和图像质量。
methods: 这篇论文提出了一种新的多CONTRAST成像方法，即分割空间重建（Partition-based k-space synthesis，PKS），可以通过特征联合来提高T2束重建质量。
results: 实验结果表明， combining T1和T2数据可以实现比使用传统的k-space平行成像（SAKE）更好的图像质量，并且可以减少总成像时间。

Abstract
Multi-contrast magnetic resonance imaging is a significant and essential medical imaging technique.However, multi-contrast imaging has longer acquisition time and is easy to cause motion artifacts. In particular, the acquisition time for a T2-weighted image is prolonged due to its longer repetition time (TR). On the contrary, T1-weighted image has a shorter TR. Therefore,utilizing complementary information across T1 and T2-weighted image is a way to decrease the overall imaging time. Previous T1-assisted T2 reconstruction methods have mostly focused on image domain using whole-based image fusion approaches. The image domain reconstruction method has the defects of high computational complexity and limited flexibility. To address this issue, we propose a novel multi-contrast imaging method called partition-based k-space synthesis (PKS) which can achieve super reconstruction quality of T2-weighted image by feature fusion. Concretely, we first decompose fully-sampled T1 k-space data and under-sampled T2 k-space data into two sub-data, separately. Then two new objects are constructed by combining the two sub-T1/T2 data. After that, the two new objects as the whole data to realize the reconstruction of T2-weighted image. Finally, the objective T2 is synthesized by extracting the sub-T2 data of each part. Experimental results showed that our combined technique can achieve comparable or better results than using traditional k-space parallel imaging(SAKE) that processes each contrast independently.

摘要
多contrast磁共振成像是医学成像技术中的一种重要和必要的方法。然而，多contrast成像有更长的获取时间，容易出现运动artefacts。特别是T2重量成像的获取时间会更长，因为其 repeating time（TR）更长。相反，T1重量成像的TR更短。因此，利用T1和T2成像中的共谱信息来减少总成像时间是一种方法。先前的T1协助T2重建方法主要集中在图像域中使用整体基于图像 fusión方法。图像域重建方法存在高度计算复杂性和局限性。为解决这个问题，我们提出了一种新的多contrast成像方法，即分割基于k空间合成（PKS）方法。我们首先将完全样本的T1k空间数据和不足样本的T2k空间数据分割成两个子数据。然后，我们构建了两个新的对象，每个对象由两个子T1/T2数据组成。接着，我们将这两个新对象作为整个数据进行重建。最后，我们从每个部分中提取出对应的sub-T2数据，以实现T2重量成像的重建。实验结果表明，我们的结合技术可以与传统的k空间并行成像（SAKE）技术相比，达到或更好的结果。

Local monotone operator learning using non-monotone operators: MnM-MOL

paper_url: http://arxiv.org/abs/2312.00386
repo_url: None
paper_authors: Maneesh John, Jyothi Rikhab Chand, Mathews Jacob
for: 这篇论文旨在提高Undersampled MR图像恢复的性能，推广现有的iterative reconstruction算法。
methods: 该论文使用了深度平衡模型和最近的 monotone operator learning 方法，减少了训练过程中的内存需求。但这些方法通常会导致性能下降。本文提出了两种方法来减轻这种约束，包括在数据梯度和CNN块之间加入 convex-non-convex regularization 策略，以及只在图像拟合 manifold 附近强制 CNN 块是 monotone 操作。
results: 理论分析和实验结果都表明，提出的方法可以提高性能，并且具有robustness 性，即对输入杂乱的抗干扰能力。

Abstract
The recovery of magnetic resonance (MR) images from undersampled measurements is a key problem that has seen extensive research in recent years. Unrolled approaches, which rely on end-to-end training of convolutional neural network (CNN) blocks within iterative reconstruction algorithms, offer state-of-the-art performance. These algorithms require a large amount of memory during training, making them difficult to employ in high-dimensional applications. Deep equilibrium (DEQ) models and the recent monotone operator learning (MOL) approach were introduced to eliminate the need for unrolling, thus reducing the memory demand during training. Both approaches require a Lipschitz constraint on the network to ensure that the forward and backpropagation iterations converge. Unfortunately, the constraint often results in reduced performance compared to unrolled methods. The main focus of this work is to relax the constraint on the CNN block in two different ways. Inspired by convex-non-convex regularization strategies, we now impose the monotone constraint on the sum of the gradient of the data term and the CNN block, rather than constrain the CNN itself to be a monotone operator. This approach enables the CNN to learn possibly non-monotone score functions, which can translate to improved performance. In addition, we only restrict the operator to be monotone in a local neighborhood around the image manifold. Our theoretical results show that the proposed algorithm is guaranteed to converge to the fixed point and that the solution is robust to input perturbations, provided that it is initialized close to the true solution. Our empirical results show that the relaxed constraints translate to improved performance and that the approach enjoys robustness to input perturbations similar to MOL.

摘要
恢复 магнитная резонанса（MR）图像从不充分量测量的问题是近年来研究的关键问题。卷积神经网络（CNN）块在迭代重建算法中的末级训练得到了现代性的表现。这些算法需要训练时大量的内存，因此在高维应用中具有困难。深度平衡（DEQ）模型和最近的偏见运算学习（MOL）方法是解除不rollers的需求，从而降低了训练时内存占用。这两种方法都需要网络中的 Lipschitz 约束，以确保前向和反向径向迭代的收敛。然而，这种约束通常会导致性能下降。本文的主要研究方向是relaxing这种约束，以提高性能。我们通过 convex-non-convex 规则策略，对数据项和 CNN 块的梯度的和进行约束，而不是直接约束 CNN 本身是偏见操作符。这种方法允许 CNN 学习可能不具备偏见的分数函数，从而提高性能。此外，我们只Restrict the operator to be monotone in a local neighborhood around the image manifold.我们的理论结果表明，提案的算法是确定的收敛点，并且输入扰动的解是稳定的，只要初始化得到True solution。我们的实验结果表明，饰用 relaxed 约束可以提高性能，并且拥有与 MOL 类似的输入扰动Robustness。

NeuSG: Neural Implicit Surface Reconstruction with 3D Gaussian Splatting Guidance

paper_url: http://arxiv.org/abs/2312.00846
repo_url: None
paper_authors: Hanlin Chen, Chen Li, Gim Hee Lee
for: 提高多视图3D重建的精度，使用神经隐式表面重建策略。
methods: 使用3D Gaussian Splatting为指导，并通过权重补做来约束3D Gaussian Splatting生成的点云紧邻表面。同时，使用神经隐式模型预测表面法向，以优化点云的补做。
results: 实现高精度的多视图3D重建，包括细节 rich 的表面重建。实验结果表明，我们的方法可以在 Tanks 和 Temples 等场景中提高重建的精度。

Abstract
Existing neural implicit surface reconstruction methods have achieved impressive performance in multi-view 3D reconstruction by leveraging explicit geometry priors such as depth maps or point clouds as regularization. However, the reconstruction results still lack fine details because of the over-smoothed depth map or sparse point cloud. In this work, we propose a neural implicit surface reconstruction pipeline with guidance from 3D Gaussian Splatting to recover highly detailed surfaces. The advantage of 3D Gaussian Splatting is that it can generate dense point clouds with detailed structure. Nonetheless, a naive adoption of 3D Gaussian Splatting can fail since the generated points are the centers of 3D Gaussians that do not necessarily lie on the surface. We thus introduce a scale regularizer to pull the centers close to the surface by enforcing the 3D Gaussians to be extremely thin. Moreover, we propose to refine the point cloud from 3D Gaussians Splatting with the normal priors from the surface predicted by neural implicit models instead of using a fixed set of points as guidance. Consequently, the quality of surface reconstruction improves from the guidance of the more accurate 3D Gaussian splatting. By jointly optimizing the 3D Gaussian Splatting and the neural implicit model, our approach benefits from both representations and generates complete surfaces with intricate details. Experiments on Tanks and Temples verify the effectiveness of our proposed method.

摘要
现有的神经隐式表面重建方法已经在多视图3D重建中实现了印象人的表现，通过利用深度图或点云作为regulization来优化表面重建结果。然而，重建结果仍然缺乏细节，这是因为深度图过于平滑或点云过于稀疏。在这种情况下，我们提出了一种基于神经隐式表面重建管道的3D高斯拟合引导的细节重建方法。高斯拟合可以生成密集的点云，但是直接使用高斯拟合可能会失败，因为生成的点不一定位于表面上。我们因此引入了扩展尺度正则化，以便将高斯拟合生成的点吸引到表面靠近。此外，我们还提出了使用神经隐式模型预测的表面法向正则化来精细地修正高斯拟合生成的点云。通过结合3D高斯拟合和神经隐式模型的优化，我们的方法可以利用两种表示方式，并生成具有细节的完整表面。实验结果表明，我们的方法在坦克和寺庙等场景中具有remarkable效果。

Text-Guided 3D Face Synthesis – From Generation to Editing

paper_url: http://arxiv.org/abs/2312.00375
repo_url: None
paper_authors: Yunjie Wu, Yapeng Meng, Zhipeng Hu, Lincheng Li, Haoqian Wu, Kun Zhou, Weiwei Xu, Xin Yu
for: 本文提出了一种纹理导向的3D面Synthesis方法，可以通过Iterative adjustments来实现自定义的3D面Synthesis。
methods: 在生成阶段，本文提出了一种geometry-texture decoupled generation方法，以解决由 coupling 所导致的 geometry 细节丢失问题。在编辑阶段，本文使用了一种增强 texture 质量的 texture diffusion model，并引入了 UV 域一致性保持 régularization 和自适应一致性权重策略来提高编辑效果。
results: 通过 comprehensive experiments，本文显示了其方法在面Synthesis中的superiority。

Abstract
Text-guided 3D face synthesis has achieved remarkable results by leveraging text-to-image (T2I) diffusion models. However, most existing works focus solely on the direct generation, ignoring the editing, restricting them from synthesizing customized 3D faces through iterative adjustments. In this paper, we propose a unified text-guided framework from face generation to editing. In the generation stage, we propose a geometry-texture decoupled generation to mitigate the loss of geometric details caused by coupling. Besides, decoupling enables us to utilize the generated geometry as a condition for texture generation, yielding highly geometry-texture aligned results. We further employ a fine-tuned texture diffusion model to enhance texture quality in both RGB and YUV space. In the editing stage, we first employ a pre-trained diffusion model to update facial geometry or texture based on the texts. To enable sequential editing, we introduce a UV domain consistency preservation regularization, preventing unintentional changes to irrelevant facial attributes. Besides, we propose a self-guided consistency weight strategy to improve editing efficacy while preserving consistency. Through comprehensive experiments, we showcase our method's superiority in face synthesis. Project page: https://faceg2e.github.io/.

摘要
文本指导的3D脸合成已经取得了很大的成果，通过利用文本到图像（T2I）扩散模型。然而，现有的大多数工作都是单向生成，忽略了编辑，因此无法通过Iterative adjustments进行自定义3D脸的合成。在这篇论文中，我们提议一个统一的文本指导框架，从面generation到编辑。在生成阶段，我们提议一种几何学-文本分离的生成，以避免由 coupling 所带来的几何学细节损失。此外，分离允许我们在生成geometry和texture之间进行互相匹配，从而提高了geometry和texture之间的吻合度。我们还使用了精度调整的Texture扩散模型来提高Texture的质量在RGB和YUV空间中。在编辑阶段，我们首先使用一个预训练的扩散模型来更新脸部几何学或Texture基于文本。为了启用顺序编辑，我们引入了UV域一致性保持regularization，避免了无意义地改变不相关的脸部特征。此外，我们提出了一种自适应的Consistency weight策略，以提高编辑效果的同时保持一致性。通过广泛的实验，我们展示了我们的方法在脸合成中的超越性。项目页面：https://faceg2e.github.io/.

Benchmarking Multi-Domain Active Learning on Image Classification

paper_url: http://arxiv.org/abs/2312.00364
repo_url: None
paper_authors: Jiayi Li, Rohan Taori, Tatsunori B. Hashimoto
for: 提高模型性能的活动学习方法
methods: 多个领域数据点策略性标注
results: 传统单个领域活动学习策略在多个领域场景下通常落后于随机选择，需要未来研究。

Abstract
Active learning aims to enhance model performance by strategically labeling informative data points. While extensively studied, its effectiveness on large-scale, real-world datasets remains underexplored. Existing research primarily focuses on single-source data, ignoring the multi-domain nature of real-world data. We introduce a multi-domain active learning benchmark to bridge this gap. Our benchmark demonstrates that traditional single-domain active learning strategies are often less effective than random selection in multi-domain scenarios. We also introduce CLIP-GeoYFCC, a novel large-scale image dataset built around geographical domains, in contrast to existing genre-based domain datasets. Analysis on our benchmark shows that all multi-domain strategies exhibit significant tradeoffs, with no strategy outperforming across all datasets or all metrics, emphasizing the need for future research.

摘要
active learning目的是提高模型性能，通过策略性地标记有用的数据点。虽然已经广泛研究，但在大规模实际数据集上的效果仍未得到充分探索。现有研究主要集中在单个源数据上，忽略了实际世界数据的多元领域性。我们引入了多领域活动学benchmark，填补了这个差距。我们的benchmark表明，传统的单个领域活动学策略在多领域场景下经常落后于随机选择。我们还引入了CLIP-GeoYFCC，一个新的大规模图像数据集，与现有的类型基于领域数据集不同，我们的分析表明，所有的多领域策略都存在了明显的变数，无论是在所有数据集上或所有指标上，强调了未来研究的需要。

Dancing with Images: Video Distillation via Static-Dynamic Disentanglement

paper_url: http://arxiv.org/abs/2312.00362
repo_url: None
paper_authors: Ziyu Wang, Yue Xu, Cewu Lu, Yong-Lu Li
for: 这项研究旨在为视频 dataset 进行效率的机器学习准备，特别是对于图像 dataset 的数据压缩。
methods: 这项研究首次系统地研究了视频压缩，并提出了一个分类法来描述视频的压缩方法。
results: 研究发现，视频中的时间信息通常不太好学习，而且synthetic data 中的时间维度很小。这些发现motivates我们的一种将动态和静止信息分离开的框架，首先将视频转换为静止图像，然后使用一个可学习的动态记忆块补做动态和运动信息。我们的方法在不同的视频集上实现了状态机器学习的最佳性，同时具有较小的存储开销。

Abstract
Recently, dataset distillation has paved the way towards efficient machine learning, especially for image datasets. However, the distillation for videos, characterized by an exclusive temporal dimension, remains an underexplored domain. In this work, we provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression. Our investigation reveals that the temporal information is usually not well learned during distillation , and the temporal dimension of synthetic data contributes little. The observations motivate our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block. Our method achieves state-of-the-art on video datasets at different scales, with notably smaller storage expenditure. Our code will be publicly available.

摘要
Our investigation reveals that temporal information is often not well learned during distillation, and the temporal dimension of synthetic data has little impact. These observations motivate our unified framework for disentangling dynamic and static information in videos. The framework first distills the videos into still images as static memory and then compensates for the dynamic and motion information with a learnable dynamic memory block.Our method achieves state-of-the-art performance on video datasets of different scales, with significantly smaller storage expenditure. Our code will be publicly available.

Efficient Multimodal Semantic Segmentation via Dual-Prompt Learning

paper_url: http://arxiv.org/abs/2312.00360
repo_url: https://github.com/shaohuadong2021/dplnet
paper_authors: Shaohua Dong, Yunhe Feng, Qing Yang, Yan Huang, Dongfang Liu, Heng Fan
for: 提高复杂场景中的semantic segmentation（例如indoor/low-light环境）的精度
methods: 使用 dual-prompt learning network（DPLNet），它使用预训练的RGB模型进行适应，并通过两个提问学习模块（MPG和MFA）来实现多模态的特征融合和学习
results: 在四个RGB-D/T semantic segmentation数据集上达到新的状态码性能或与其他复杂方法相当，同时满足参数效率，并在其他多模态任务中如精度 objet detection和视频semantic segmentation中表现出色。

Abstract
Multimodal (e.g., RGB-Depth/RGB-Thermal) fusion has shown great potential for improving semantic segmentation in complex scenes (e.g., indoor/low-light conditions). Existing approaches often fully fine-tune a dual-branch encoder-decoder framework with a complicated feature fusion strategy for achieving multimodal semantic segmentation, which is training-costly due to the massive parameter updates in feature extraction and fusion. To address this issue, we propose a surprisingly simple yet effective dual-prompt learning network (dubbed DPLNet) for training-efficient multimodal (e.g., RGB-D/T) semantic segmentation. The core of DPLNet is to directly adapt a frozen pre-trained RGB model to multimodal semantic segmentation, reducing parameter updates. For this purpose, we present two prompt learning modules, comprising multimodal prompt generator (MPG) and multimodal feature adapter (MFA). MPG works to fuse the features from different modalities in a compact manner and is inserted from shadow to deep stages to generate the multi-level multimodal prompts that are injected into the frozen backbone, while MPG adapts prompted multimodal features in the frozen backbone for better multimodal semantic segmentation. Since both the MPG and MFA are lightweight, only a few trainable parameters (3.88M, 4.4% of the pre-trained backbone parameters) are introduced for multimodal feature fusion and learning. Using a simple decoder (3.27M parameters), DPLNet achieves new state-of-the-art performance or is on a par with other complex approaches on four RGB-D/T semantic segmentation datasets while satisfying parameter efficiency. Moreover, we show that DPLNet is general and applicable to other multimodal tasks such as salient object detection and video semantic segmentation. Without special design, DPLNet outperforms many complicated models. Our code will be available at github.com/ShaohuaDong2021/DPLNet.

摘要
多模态（如RGB-深度/RGB-热）融合已经在复杂场景中提高 semantic segmentation 的潜在性。现有的方法通常是完全精度地 fine-tune 一个 dual-branch encoder-decoder 框架，以实现多模态 semantic segmentation，这会带来大量参数更新，因此训练成本高。为解决这个问题，我们提出了一种奇异简单 yet 有效的 dual-prompt learning 网络（称为 DPLNet），用于增加训练效率的多模态（如RGB-D/T）semantic segmentation。DPLNet 的核心思想是直接采用冻结的 pre-trained RGB 模型，并将其适应到多模态 semantic segmentation，从而减少参数更新。为此，我们提出了两个 prompt learning 模块：多模态 prompt generator（MPG）和多模态 feature adapter（MFA）。MPG 用于将不同模式的特征 fusion 到一起，并在阴影到深度的多个层次中插入，以生成多级多模态提醒，而 MFA 则是在冻结后部中适应提醒的多模态特征，以提高多模态 semantic segmentation。由于 MPG 和 MFA 都是轻量级的，只有3.88M（4.4% 的冻结后部参数）的可变参数被引入，用于多模态特征融合和学习。使用简单的解码器（3.27M 参数），DPLNet 在四个 RGB-D/T semantic segmentation 数据集上实现了新的国际最佳性或与其他复杂的方法相当，而且满足参数效率。此外，我们还证明了 DPLNet 是通用的，可以应用于其他多模态任务，如焦点对象检测和视频 semantic segmentation。 Without special design, DPLNet outperforms many complicated models.我们的代码将在 GitHub 上提供。

Impact of Data Augmentation on QCNNs

paper_url: http://arxiv.org/abs/2312.00358
repo_url: None
paper_authors: Leting Zhouli, Peiyong Wang, Udaya Parampalli
for: 这篇论文的目的是用量子机制改进传统的图像识别模型，以提高图像识别的性能。
methods: 这篇论文使用的方法包括传统的图像识别模型（CNNs）和量子图像识别模型（QCNNs），以及数据增强（DA）技术。
results: 测试结果显示，QCNNs在三个常用的数据集上的损失和预测精度都高于CNNs，而且DA技术不会提高QCNNs的性能。

Abstract
In recent years, Classical Convolutional Neural Networks (CNNs) have been applied for image recognition successfully. Quantum Convolutional Neural Networks (QCNNs) are proposed as a novel generalization to CNNs by using quantum mechanisms. The quantum mechanisms lead to an efficient training process in QCNNs by reducing the size of input from $N$ to $log_2N$. This paper implements and compares both CNNs and QCNNs by testing losses and prediction accuracy on three commonly used datasets. The datasets include the MNIST hand-written digits, Fashion MNIST and cat/dog face images. Additionally, data augmentation (DA), a technique commonly used in CNNs to improve the performance of classification by generating similar images based on original inputs, is also implemented in QCNNs. Surprisingly, the results showed that data augmentation didn't improve QCNNs performance. The reasons and logic behind this result are discussed, hoping to expand our understanding of Quantum machine learning theory.

摘要
近年来，经典的卷积神经网络（CNN）已成功应用于图像识别领域。量子卷积神经网络（QCNN）是一种新的推广，通过利用量子机制来提高训练效率。量子机制使输入大小从$N$减少到$log_2N$，从而提高了QCNN的训练效率。本文对CNN和QCNN进行实现和比较，并在三个常用的数据集上测试损失和预测精度。这三个数据集分别是MNIST手写数字、时尚MNIST和猫狗脸部图像。此外，数据扩展（DA）技术，通常用于CNN来提高分类性能，也在QCNN中实现。然而，结果显示，数据扩展并没有提高QCNN的性能。这些结果的原因和逻辑被讨论，以扩展我们对量子机器学习理论的理解。

A Generalizable Deep Learning System for Cardiac MRI

paper_url: http://arxiv.org/abs/2312.00357
repo_url: None
paper_authors: Rohan Shad, Cyril Zakka, Dhamanpreet Kaur, Robyn Fong, Ross Warren Filice, John Mongan, Kimberly Kalianos, Nishith Khandwala, David Eng, Matthew Leipzig, Walter Witschey, Alejandro de Feria, Victor Ferrari, Euan Ashley, Michael A. Acker, Curtis Langlotz, William Hiesinger
for: cardiac MRI assessment of myocardial structure, function, and tissue characteristics
methods: deep learning model trained via self-supervised contrastive learning using raw text of radiology reports
results: remarkable performance across a range of tasks, including left ventricular ejection fraction regression and diagnosis of 35 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy, with clinical grade diagnostic accuracy using a fraction of the typical training data.

Abstract
Cardiac MRI allows for a comprehensive assessment of myocardial structure, function, and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep learning model is trained via self-supervised contrastive learning, by which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank, and two additional publicly available external datasets. We explore emergent zero-shot capabilities of our system, and demonstrate remarkable performance across a range of tasks; including the problem of left ventricular ejection fraction regression, and the diagnosis of 35 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep learning system is capable of not only understanding the staggering complexity of human cardiovascular disease, but can be directed towards clinical problems of interest yielding impressive, clinical grade diagnostic accuracy with a fraction of the training data typically required for such tasks.

摘要
卡ди亚克MRI可以进行全面评估心肺结构、功能和组织特征。我们描述了一个基础视觉系统，可以涵盖人类冠arca疾病和健康的全面范围。我们的深度学习模型通过自我超越对比学习，从raw文本中学习了心肺MRI扫描图像中的视觉概念。我们在美国四大学普通科医院的数据上训练和评估了我们的模型，并在UK BioBank和两个公共可用的外部数据集上进行了表现。我们还证明了我们的系统在多个任务上表现出色，包括左心肺血量回卷 regression 和35种不同的疾病诊断，如 cardiac amyloidosis 和 hypertrophic cardiomyopathy。我们表示，我们的深度学习系统不仅能够理解人类冠arca疾病的惊人复杂性，还可以根据临床问题指导，以便在很小的训练数据量下获得临床级别的诊断精度。

Manipulating the Label Space for In-Context Classification

paper_url: http://arxiv.org/abs/2312.00351
repo_url: None
paper_authors: Haokun Chen, Xu Yang, Yuhang Huang, Zihan Wu, Jing Wang, Xin Geng
for: 提高协同学习（In-Context Learning，ICL）能力，使Language Model（LM）和视觉语言模型（VLMs）能够更好地在不同的数据集上进行分类。
methods: 通过修改每个协同示例（ICEs）的标签空间，以提高协同示例的知识密度，从而使用 fewer ICEs 提供更多信息，并且提高协同分类性能。
results: 在 ImageNet 和 CUB-200 等多种数据集上，使用我们的方法可以提高协同分类精度，比如在 ImageNet 上从 74.70% 提高到 76.21%，比 CLIP 高0.67%。在 CUB-200 上，我们的方法可以提高 1-shot 精度从 48.86% 到 69.05%，比 CLIP 高12.15%。

Abstract
After pre-training by generating the next word conditional on previous words, the Language Model (LM) acquires the ability of In-Context Learning (ICL) that can learn a new task conditional on the context of the given in-context examples (ICEs). Similarly, visually-conditioned Language Modelling is also used to train Vision-Language Models (VLMs) with ICL ability. However, such VLMs typically exhibit weaker classification abilities compared to contrastive learning-based models like CLIP, since the Language Modelling objective does not directly contrast whether an object is paired with a text. To improve the ICL of classification, using more ICEs to provide more knowledge is a straightforward way. However, this may largely increase the selection time, and more importantly, the inclusion of additional in-context images tends to extend the length of the in-context sequence beyond the processing capacity of a VLM. To alleviate these limitations, we propose to manipulate the label space of each ICE to increase its knowledge density, allowing for fewer ICEs to convey as much information as a larger set would. Specifically, we propose two strategies which are Label Distribution Enhancement and Visual Descriptions Enhancement to improve In-context classification performance on diverse datasets, including the classic ImageNet and more fine-grained datasets like CUB-200. Specifically, using our approach on ImageNet, we increase accuracy from 74.70\% in a 4-shot setting to 76.21\% with just 2 shots. surpassing CLIP by 0.67\%. On CUB-200, our method raises 1-shot accuracy from 48.86\% to 69.05\%, 12.15\% higher than CLIP. The code is given in https://anonymous.4open.science/r/MLS_ICC.

摘要
“对于Language Model（LM）和视觉语言模型（VLM）来说，我们提出了两种策略来提高内容学习（In-Context Learning，ICL）的性能。第一种策略是增强标签空间的分布，使每个内容示例（ICE）能够传递更多的知识。第二种策略是增强视觉描述，使VLM能够更好地理解内容示例中的物件。我们在ImageNet和CUB-200等多个数据集上运行了我们的方法，结果显示我们的方法可以从4个随机抽取的示例中获得76.21%的准确率，比CLIP高0.67%。在CUB-200数据集上，我们的方法从1个随机抽取的示例中获得69.05%的准确率，比CLIP高12.15%。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the given text and may not be exactly the same as the original text in Traditional Chinese.

Student Activity Recognition in Classroom Environments using Transfer Learning

paper_url: http://arxiv.org/abs/2312.00348
repo_url: None
paper_authors: Anagha Deshpande, Vedant Deshpande
for: 提高教室环境的安全性、效率和总质量
methods: 使用深度学习技术，包括传输学习和特征提取，以及一个新的类room dataset
results: Xception模型在新的类room dataset上达到93%的准确率，超过其他三个模型Here’s the full translation of the abstract in Simplified Chinese:
for: 本研究旨在通过人工智能和深度学习技术，提高教室环境的安全性、效率和总质量。
methods: 本研究使用的方法包括传输学习和特征提取，以及一个新的类room dataset。
results: 研究发现，使用Xception模型在新的类room dataset上达到93%的准确率，超过其他三个模型。

Abstract
The recent advances in artificial intelligence and deep learning facilitate automation in various applications including home automation, smart surveillance systems, and healthcare among others. Human Activity Recognition is one of its emerging applications, which can be implemented in a classroom environment to enhance safety, efficiency, and overall educational quality. This paper proposes a system for detecting and recognizing the activities of students in a classroom environment. The dataset has been structured and recorded by the authors since a standard dataset for this task was not available at the time of this study. Transfer learning, a widely adopted method within the field of deep learning, has proven to be helpful in complex tasks like image and video processing. Pretrained models including VGG-16, ResNet-50, InceptionV3, and Xception are used for feature extraction and classification tasks. Xception achieved an accuracy of 93%, on the novel classroom dataset, outperforming the other three models in consideration. The system proposed in this study aims to introduce a safer and more productive learning environment for students and educators.

摘要
This paper proposes a system for detecting and recognizing the activities of students in a classroom environment. The authors have collected and structured a dataset for this task, as no standard dataset was available at the time of the study. Transfer learning, a widely used method in deep learning, has proven to be effective in complex tasks such as image and video processing.The authors have used pretrained models, including VGG-16, ResNet-50, InceptionV3, and Xception, for feature extraction and classification tasks. Xception achieved an accuracy of 93% on the novel classroom dataset, outperforming the other three models. The system proposed in this study aims to create a safer and more productive learning environment for students and educators.Here is the text in Simplified Chinese:Recent advances in人工智能和深度学习已经实现了各种自动化应用，如家居自动化、智能监控系统和医疗等。人类活动识别（HAR）是这些技术的一个emerging应用，可以在教室环境中提高安全性、效率和总的教学质量。这篇论文提出了一种检测和识别学生在教室环境中的活动系统。作者们自己收集和结构化了这个数据集，因为当时没有标准的数据集。转移学习，深度学习中广泛使用的方法，在复杂的图像和视频处理任务中证明了其效果。作者们使用了预训练模型，包括VGG-16、ResNet-50、InceptionV3和Xception，进行特征提取和分类任务。Xception在新的教室数据集上达到了93%的准确率，超过了其他三个模型。这个系统的目标是创造一个更安全和更产出的学习环境 для学生和教师。

OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline

paper_url: http://arxiv.org/abs/2312.00343
repo_url: https://github.com/xiandaguo/openstereo
paper_authors: Xianda Guo, Juntao Lu, Chenming Zhang, Yiqi Wang, Yiqun Duan, Tian Yang, Zheng Zhu, Long Chen
for: 这份研究的目的是提供一个实用的史特耳匹配函数库，并通过实验评估不同网络模型的表现。
methods: 这份研究使用了多种不同的网络模型，包括12种不同的训练和测试网络模型，以及一个名为StereoBase的简单却强大的基eline模型。
results: 这份研究的结果显示，使用OpenStereo函数库和StereoBase基eline模型，在SceneFlow资料集上的实验结果具有或超越了原始论文中报告的性能指标。

Abstract
Stereo matching, a pivotal technique in computer vision, plays a crucial role in robotics, autonomous navigation, and augmented reality. Despite the development of numerous impressive methods in recent years, replicating their results and determining the most suitable architecture for practical application remains challenging. Addressing this gap, our paper introduces a comprehensive benchmark focusing on practical applicability rather than solely on performance enhancement. Specifically, we develop a flexible and efficient stereo matching codebase, called OpenStereo. OpenStereo includes training and inference codes of more than 12 network models, making it, to our knowledge, the most complete stereo matching toolbox available. Based on OpenStereo, we conducted experiments on the SceneFlow dataset and have achieved or surpassed the performance metrics reported in the original paper. Additionally, we conduct an in-depth revisitation of recent developments in stereo matching through ablative experiments. These investigations inspired the creation of StereoBase, a simple yet strong baseline model. Our extensive comparative analyses of StereoBase against numerous contemporary stereo matching methods on the SceneFlow dataset demonstrate its remarkably strong performance. The source code is available at https://github.com/XiandaGuo/OpenStereo.

摘要
sterEO匹配，一种关键的计算机视觉技术，在机器人、自主导航和增强现实中扮演着重要的角色。尽管在过去几年内出现了许多优秀的方法，但复制这些结果并确定最适合实际应用的建筑师仍然是一个挑战。为了解决这个问题，我们的论文提出了一个完整的benchmark，关注实际应用性而不仅仅是提高性能。特别是，我们开发了一个flexible和高效的stereo匹配代码库，called OpenStereo。OpenStereo包括训练和推理代码的更 than 12个网络模型，使其成为我们所知道的最完整的stereo匹配工具箱。基于OpenStereo，我们在SceneFlow dataset上进行了实验，并达到或超过原始文献中报道的性能指标。此外，我们还进行了对最近的stereo匹配发展的ablative实验。这些调查 inspirited我们创建了StereoBase，一个简单却强大的基eline模型。我们对StereoBase与许多当代stereo匹配方法进行了广泛的比较分析，并在SceneFlow dataset上进行了extensive的比较分析。结果显示StereoBase在性能上表现极佳。代码可以在https://github.com/XiandaGuo/OpenStereo中获取。

Learning Anatomically Consistent Embedding for Chest Radiography

paper_url: http://arxiv.org/abs/2312.00335
repo_url: https://github.com/jlianglab/peac
paper_authors: Ziyu Zhou, Haozhe Luo, Jiaxuan Pang, Xiaowei Ding, Michael Gotway, Jianming Liang
for: 本文旨在利用自然学习（SSL）方法学习医疗图像中的视觉表示。
methods: 本文提出了一种新的SSL方法，即patch embedding of anatomical consistency（PEAC），以便医疗图像分析。PEAC方法通过稳定的网格基本匹配来学习全局和局部一致性，并将预训练PEAC模型转移到多种下游任务中。
results: 对比现有的完全/自我监督方法，PEAC方法实现了显著更好的性能。PEAC方法还能够捕捉不同gender、体重和健康状况的患者之间的 анатоical结构一致性，从而提高了我们的方法的医疗图像分析可读性。

Abstract
Self-supervised learning (SSL) approaches have recently shown substantial success in learning visual representations from unannotated images. Compared with photographic images, medical images acquired with the same imaging protocol exhibit high consistency in anatomy. To exploit this anatomical consistency, this paper introduces a novel SSL approach, called PEAC (patch embedding of anatomical consistency), for medical image analysis. Specifically, in this paper, we propose to learn global and local consistencies via stable grid-based matching, transfer pre-trained PEAC models to diverse downstream tasks, and extensively demonstrate that (1) PEAC achieves significantly better performance than the existing state-of-the-art fully/self-supervised methods, and (2) PEAC captures the anatomical structure consistency across views of the same patient and across patients of different genders, weights, and healthy statuses, which enhances the interpretability of our method for medical image analysis.

摘要
自主学习（SSL）方法在无注释图像上学习视觉表示已经显示出很大的成功。与摄影图像不同，医疗图像采用同一种捕捉协议，图像的解剖学协议高度一致。为了利用这种解剖学一致性，本文介绍了一种新的SSL方法，即PEAC（质量块嵌入的解剖学一致），用于医疗图像分析。特别是，在本文中，我们提议通过稳定的格子基本匹配来学习全局和局部一致性，将预训练PEAC模型转移到多种下游任务上，并证明了以下两点：1. PEAC比既有状态的自主/无监督方法更好的性能，并且2. PEAC能够在不同的患者、性别、体重和健康状态下捕捉到一致的解剖结构，从而提高医疗图像分析的可读性。

Improving Efficiency of DNN-based Relocalization Module for Autonomous Driving with Server-side Computing

paper_url: http://arxiv.org/abs/2312.00316
repo_url: None
paper_authors: Dengbo Li, Jieren Cheng, Boyi Liu
for: 本研究提出了一种基于深度神经网络（DNN）的自动驾驶摄像头重定位方法，以解决现有Literature中的计算机telicity问题。
methods: 我们的方法利用了边缘云合作，将某些神经网络模块传输到服务器端进行执行，以提高计算效率。我们还研究了不同网络分割方案下的数据帧执行时间，以帮助我们做出Offloading决策。
results: 我们的研究显示，服务器端Offloading对DNN基于的自动驾驶摄像头重定位方法的实现起到了关键作用。此外，我们还研究了数据融合的结果。最后，我们通过实验评估 validate了我们的提议的效果。

Abstract
In this work, we present a novel framework for camera relocation in autonomous vehicles, leveraging deep neural networks (DNN). While existing literature offers various DNN-based camera relocation methods, their deployment is hindered by their high computational demands during inference. In contrast, our approach addresses this challenge through edge cloud collaboration. Specifically, we strategically offload certain modules of the neural network to the server and evaluate the inference time of data frames under different network segmentation schemes to guide our offloading decisions. Our findings highlight the vital role of server-side offloading in DNN-based camera relocation for autonomous vehicles, and we also discuss the results of data fusion. Finally, we validate the effectiveness of our proposed framework through experimental evaluation.

摘要
在这个工作中，我们提出了一种新的摄像头重定位框架 для自动驾驶车辆，利用深度神经网络（DNN）。现有的文献提供了多种基于DNN的摄像头重定位方法，但它们在推理时的计算占用率很高。与此相反，我们的方法通过Edge云合作解决这个挑战。具体来说，我们在服务器和Edge云之间分别分配了certain module of the neural network，并评估了数据帧在不同网络分割方案下的推理时间，以便做出合适的卸载决策。我们的发现表明Edge云卸载在DNN基于摄像头重定位中发挥了关键作用，并且我们还讨论了数据融合的结果。最后，我们验证了我们的提议的效果通过实验评估。

Improving Normalization with the James-Stein Estimator

paper_url: http://arxiv.org/abs/2312.00313
repo_url: None
paper_authors: Seyedalireza Khoshsirat, Chandra Kambhamettu
for: 提高高维统计中sample mean的效果，解决Stein’s paradox问题。
methods: 使用James-Stein estimator来改进mean和variance的估计，并在深度学习中应用normalization layers。
results: 在不同的计算机视觉任务中（包括图像分类、语义分割和3D物体分类），我们的改进的normalization layers显示出了superior的准确率，而无需额外的计算成本增加。

Abstract
Stein's paradox holds considerable sway in high-dimensional statistics, highlighting that the sample mean, traditionally considered the de facto estimator, might not be the most efficacious in higher dimensions. To address this, the James-Stein estimator proposes an enhancement by steering the sample means toward a more centralized mean vector. In this paper, first, we establish that normalization layers in deep learning use inadmissible estimators for mean and variance. Next, we introduce a novel method to employ the James-Stein estimator to improve the estimation of mean and variance within normalization layers. We evaluate our method on different computer vision tasks: image classification, semantic segmentation, and 3D object classification. Through these evaluations, it is evident that our improved normalization layers consistently yield superior accuracy across all tasks without extra computational burden. Moreover, recognizing that a plethora of shrinkage estimators surpass the traditional estimator in performance, we study two other prominent shrinkage estimators: Ridge and LASSO. Additionally, we provide visual representations to intuitively demonstrate the impact of shrinkage on the estimated layer statistics. Finally, we study the effect of regularization and batch size on our modified batch normalization. The studies show that our method is less sensitive to batch size and regularization, improving accuracy under various setups.

摘要
史丁的 парадокс在高维统计中具有很大的影响，指出样本平均值，传统上被视为效果最好的估计器，在更高的维度下可能不是最佳的选择。为解决这个问题，詹姆斯-史丁估计器提出了一种改进方法，通过调整样本平均值向中心化的均值向量倾斜。在这篇论文中，我们首先证明了深度学习中的 нормализа层使用不当的估计器来估计均值和方差。然后，我们介绍了一种新的方法，使用詹姆斯-史丁估计器来改进均值和方差的估计。我们在不同的计算机视觉任务上进行了评估：图像分类、semantic segmentation和3D对象分类。经过这些评估，显示我们改进后的normalization层在所有任务上都具有更高的精度，而且没有额外的计算负担。此外，我们还研究了两种其他的显著减少估计器：Ridge和LASSO。此外，我们还提供了视觉表示，用于直观地示出减少的影响于估计层 Statistics。最后，我们研究了 regularization和批处理的效果于我们的修改后的批处理。研究结果表明，我们的方法对于不同的设置是更加敏感的，而且在不同的批处理和 regularization 下都能够提高精度。

Segment Anything Model-guided Collaborative Learning Network for Scribble-supervised Polyp Segmentation

paper_url: http://arxiv.org/abs/2312.00312
repo_url: None
paper_authors: Yiming Zhao, Tao Zhou, Yunqi Gu, Yi Zhou, Yizhe Zhang, Ye Wu, Huazhu Fu
for:This paper proposes a novel method for scribble-supervised polyp segmentation in colonoscopy images, which is crucial for early detection and prevention of colorectal cancer.methods:The proposed method, called SAM-CLNet, combines a Cross-level Enhancement and Aggregation Network (CEA-Net) with a Segment Anything Model (SAM) to leverage the strengths of both models. CEA-Net uses a Cross-level Enhancement Module (CEM) and a Feature Aggregation Module (FAM) to enhance and aggregate features from different resolutions, while SAM provides segmentation guidance.results:Extensive experiments show that SAM-CLNet outperforms state-of-the-art weakly-supervised segmentation methods, demonstrating the effectiveness of the proposed method for polyp segmentation in colonoscopy images.

Abstract
Polyp segmentation plays a vital role in accurately locating polyps at an early stage, which holds significant clinical importance for the prevention of colorectal cancer. Various polyp segmentation methods have been developed using fully-supervised deep learning techniques. However, pixel-wise annotation for polyp images by physicians during the diagnosis is both time-consuming and expensive. Moreover, visual foundation models such as the Segment Anything Model (SAM) have shown remarkable performance. Nevertheless, directly applying SAM to medical segmentation may not produce satisfactory results due to the inherent absence of medical knowledge. In this paper, we propose a novel SAM-guided Collaborative Learning Network (SAM-CLNet) for scribble-supervised polyp segmentation, enabling a collaborative learning process between our segmentation network and SAM to boost the model performance. Specifically, we first propose a Cross-level Enhancement and Aggregation Network (CEA-Net) for weakly-supervised polyp segmentation. Within CEA-Net, we propose a Cross-level Enhancement Module (CEM) that integrates the adjacent features to enhance the representation capabilities of different resolution features. Additionally, a Feature Aggregation Module (FAM) is employed to capture richer features across multiple levels. Moreover, we present a box-augmentation strategy that combines the segmentation maps generated by CEA-Net with scribble annotations to create more precise prompts. These prompts are then fed into SAM, generating segmentation SAM-guided masks, which can provide additional supervision to train CEA-Net effectively. Furthermore, we present an Image-level Filtering Mechanism to filter out unreliable SAM-guided masks. Extensive experimental results show that our SAM-CLNet outperforms state-of-the-art weakly-supervised segmentation methods.

摘要
政 полиsegmentation plays a vital role in accurately locating polyps at an early stage, which holds significant clinical importance for the prevention of colorectal cancer. Various polyp segmentation methods have been developed using fully-supervised deep learning techniques. However, pixel-wise annotation for polyp images by physicians during the diagnosis is both time-consuming and expensive. Moreover, visual foundation models such as the Segment Anything Model (SAM) have shown remarkable performance. Nevertheless, directly applying SAM to medical segmentation may not produce satisfactory results due to the inherent absence of medical knowledge. In this paper, we propose a novel SAM-guided Collaborative Learning Network (SAM-CLNet) for scribble-supervised polyp segmentation, enabling a collaborative learning process between our segmentation network and SAM to boost the model performance. Specifically, we first propose a Cross-level Enhancement and Aggregation Network (CEA-Net) for weakly-supervised polyp segmentation. Within CEA-Net, we propose a Cross-level Enhancement Module (CEM) that integrates the adjacent features to enhance the representation capabilities of different resolution features. Additionally, a Feature Aggregation Module (FAM) is employed to capture richer features across multiple levels. Moreover, we present a box-augmentation strategy that combines the segmentation maps generated by CEA-Net with scribble annotations to create more precise prompts. These prompts are then fed into SAM, generating segmentation SAM-guided masks, which can provide additional supervision to train CEA-Net effectively. Furthermore, we present an Image-level Filtering Mechanism to filter out unreliable SAM-guided masks. Extensive experimental results show that our SAM-CLNet outperforms state-of-the-art weakly-supervised segmentation methods.

3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation

paper_url: http://arxiv.org/abs/2312.00311
repo_url: None
paper_authors: Zidu Wang, Xiangyu Zhu, Tianshuo Zhang, Baiqin Wang, Zhen Lei
for: 提供了3D面部重建的多种应用场景，但现有方法困难重建表达强的面部。
methods: 利用facial part segmentation的geometry，引入Part Re-projection Distance Loss（PRDL）来优化点云集的分布，从而提高面部重建的性能。
results: PRDL比基于renderer的方法更具有明确的梯度，并在广泛的量化和质量测试中达到了状态机器人的重建性能。

Abstract
3D Morphable Models (3DMMs) provide promising 3D face reconstructions in various applications. However, existing methods struggle to reconstruct faces with extreme expressions due to deficiencies in supervisory signals, such as sparse or inaccurate landmarks. Segmentation information contains effective geometric contexts for face reconstruction. Certain attempts intuitively depend on differentiable renderers to compare the rendered silhouettes of reconstruction with segmentation, which is prone to issues like local optima and gradient instability. In this paper, we fully utilize the facial part segmentation geometry by introducing Part Re-projection Distance Loss (PRDL). Specifically, PRDL transforms facial part segmentation into 2D points and re-projects the reconstruction onto the image plane. Subsequently, by introducing grid anchors and computing different statistical distances from these anchors to the point sets, PRDL establishes geometry descriptors to optimize the distribution of the point sets for face reconstruction. PRDL exhibits a clear gradient compared to the renderer-based methods and presents state-of-the-art reconstruction performance in extensive quantitative and qualitative experiments. The project will be publicly available.

摘要
三维可变模型（3DMM）提供了许多应用场景中的可靠三维面部重建。然而，现有的方法在面部表达强烈时受到监督信号不充分或不准确的缺陷，导致重建面部表达困难。 segmentation信息包含面部重建中有效的 геометрические上下文。certain attempts rely on differentiable renderers to compare the rendered silhouettes of reconstruction with segmentation, which is prone to issues like local optima and gradient instability。在本文中，我们充分利用面部部分分 segmentation的几何结构，通过引入 Part Re-projection Distance Loss（PRDL）来优化面部重建。具体来说，PRDL将面部部分分 segmentation转换成2D点 cloud，然后将重建图像平面上重 проек化。接着，通过引入网格锚点和计算不同的统计距离，PRDL建立了面部重建中的geometry描述符，以便优化点云的分布。PRDL与基于 renderer 的方法相比，具有明确的梯度，并在广泛的量化和质量测试中达到了状态 искусственный智能水平。该项目将于公共可用。

A knowledge-based data-driven (KBDD) framework for all-day identification of cloud types using satellite remote sensing

paper_url: http://arxiv.org/abs/2312.00308
repo_url: None
paper_authors: Longfeng Nie, Yuntian Chen, Mengge Du, Changqi Sun, Dongxiao Zhang
for:* The paper is written for evaluating changes in rainfall, heatwaves, water resources, floods and droughts, food security and vegetation cover, as well as land use, using high-resolution geostationary observations.methods:* The paper proposes a knowledge-based data-driven (KBDD) framework for all-day identification of cloud types based on spectral information from Himawari-8/9 satellite sensors.* A novel, simple and efficient network, named CldNet, is proposed and compared with widely used semantic segmentation networks.results:* The proposed model CldNet achieves an accuracy of 80.89+-2.18% in identifying cloud types, which is state-of-the-art and has increased by 32%, 46%, 22%, 2%, and 39%, respectively, compared with widely used semantic segmentation networks.* The accuracy of CldNet-W using visible and near-infrared bands and CldNet-O not using visible and near-infrared bands on the test dataset is 82.23+-2.14% and 73.21+-2.02%, respectively.* The trained CldNet without any fine-tuning can predict cloud types with higher spatial resolution using satellite spectral data with spatial resolution 0.02°*0.02°, indicating that CldNet possesses a strong generalization ability.

Abstract
Cloud types, as a type of meteorological data, are of particular significance for evaluating changes in rainfall, heatwaves, water resources, floods and droughts, food security and vegetation cover, as well as land use. In order to effectively utilize high-resolution geostationary observations, a knowledge-based data-driven (KBDD) framework for all-day identification of cloud types based on spectral information from Himawari-8/9 satellite sensors is designed. And a novel, simple and efficient network, named CldNet, is proposed. Compared with widely used semantic segmentation networks, including SegNet, PSPNet, DeepLabV3+, UNet, and ResUnet, our proposed model CldNet with an accuracy of 80.89+-2.18% is state-of-the-art in identifying cloud types and has increased by 32%, 46%, 22%, 2%, and 39%, respectively. With the assistance of auxiliary information (e.g., satellite zenith/azimuth angle, solar zenith/azimuth angle), the accuracy of CldNet-W using visible and near-infrared bands and CldNet-O not using visible and near-infrared bands on the test dataset is 82.23+-2.14% and 73.21+-2.02%, respectively. Meanwhile, the total parameters of CldNet are only 0.46M, making it easy for edge deployment. More importantly, the trained CldNet without any fine-tuning can predict cloud types with higher spatial resolution using satellite spectral data with spatial resolution 0.02{\deg}*0.02{\deg}, which indicates that CldNet possesses a strong generalization ability. In aggregate, the KBDD framework using CldNet is a highly effective cloud-type identification system capable of providing a high-fidelity, all-day, spatiotemporal cloud-type database for many climate assessment fields.

摘要
云类，作为气象数据类型，对评估降水、热波、水资源、洪涝和旱情等气候评估领域具有特殊重要性。为了有效利用高分辨率地球Station观测数据，我们设计了基于知识库数据驱动（KBDD）框架，用于基于 Himawari-8/9 卫星传感器的spectral信息进行全天云类类型识别。并提出了一种新的简单、高效的网络模型，名为CldNet。与常用的语义分割网络模型，包括SegNet、PSPNet、DeepLabV3+、UNet和ResUnet等，相比，我们的提posed模型CldNet在识别云类类型方面具有state-of-the-art的性能，提高了32%、46%、22%、2%和39%。通过auxiliary信息（例如卫星zenith/azimuth角度、太阳zenith/azimuth角度）的协助，CldNet-W使用可见和近红外带的数据在测试集上的准确率为82.23+-2.14%，而CldNet-O不使用可见和近红外带的数据的准确率为73.21+-2.02%。此外，CldNet的总参数只有0.46M，使其易于边缘部署。同时，训练过的CldNet没有任何 fine-tuning 可以在卫星spectral数据上预测云类类型，这表明CldNet具有强大的泛化能力。综上所述，KBDD框架使用CldNet是一种非常有效的云类类型识别系统，可以为许多气候评估领域提供高准确率、全天、空间temporal云类数据库。

RadioGalaxyNET: Dataset and Novel Computer Vision Algorithms for the Detection of Extended Radio Galaxies and Infrared Hosts

paper_url: http://arxiv.org/abs/2312.00306
repo_url: None
paper_authors: Nikhel Gupta, Zeeshan Hayder, Ray P. Norris, Minh Huynh, Lars Petersson
for: automatic identification of associated components of extended sources and their corresponding infrared hosts
methods: multimodal dataset and novel computer vision algorithms
results: detection and localization of multi-component extended radio galaxies and their corresponding infrared hosts

Abstract
Creating radio galaxy catalogues from next-generation deep surveys requires automated identification of associated components of extended sources and their corresponding infrared hosts. In this paper, we introduce RadioGalaxyNET, a multimodal dataset, and a suite of novel computer vision algorithms designed to automate the detection and localization of multi-component extended radio galaxies and their corresponding infrared hosts. The dataset comprises 4,155 instances of galaxies in 2,800 images with both radio and infrared channels. Each instance provides information about the extended radio galaxy class, its corresponding bounding box encompassing all components, the pixel-level segmentation mask, and the keypoint position of its corresponding infrared host galaxy. RadioGalaxyNET is the first dataset to include images from the highly sensitive Australian Square Kilometre Array Pathfinder (ASKAP) radio telescope, corresponding infrared images, and instance-level annotations for galaxy detection. We benchmark several object detection algorithms on the dataset and propose a novel multimodal approach to simultaneously detect radio galaxies and the positions of infrared hosts.

摘要
创建Radio galaxy目录从下一代深度调查需要自动识别相关组件的扩展源和其对应的红外主 galaxy。在这篇论文中，我们介绍RadioGalaxyNET数据集，这是一个多模式数据集和一系列新的计算机视觉算法，用于自动检测和定位多组件扩展电波 галакси和其对应的红外主 galaxy。数据集包含4,155个星系实例，分别在2,800张图像中，包括电波和红外通道。每个实例提供电波 галакси类别、包含所有组件的 bounding box、像素级分割面积和红外主 galaxy的关键点位置。RadioGalaxyNET是第一个包含ASKAP电波望远镜、红外图像和实例级标注的数据集。我们在数据集上测试了多种对象检测算法，并提出了一种新的多模式方法，同时检测电波 галакси和红外主 galaxy的位置。

Developmental Pretraining (DPT) for Image Classification Networks

paper_url: http://arxiv.org/abs/2312.00304
repo_url: None
paper_authors: Niranjan Rajesh, Debayan Gupta
for: 解决深度神经网络对物体识别的数据需求不断增长，而传统预训练方法需要大量数据，我们提出了发展预训练（DPT）方法。
methods: DPT采用了人类婴儿视觉发展为灵感，采用逐步教学方法，从Primitive和通用特征开始，逐步增加复杂特征。
results: 对比随机初始化的模型，DPT方法可以提高模型的性能，表明DPT可以成为一种可行的预训练方法。

Abstract
In the backdrop of increasing data requirements of Deep Neural Networks for object recognition that is growing more untenable by the day, we present Developmental PreTraining (DPT) as a possible solution. DPT is designed as a curriculum-based pre-training approach designed to rival traditional pre-training techniques that are data-hungry. These training approaches also introduce unnecessary features that could be misleading when the network is employed in a downstream classification task where the data is sufficiently different from the pre-training data and is scarce. We design the curriculum for DPT by drawing inspiration from human infant visual development. DPT employs a phased approach where carefully-selected primitive and universal features like edges and shapes are taught to the network participating in our pre-training regime. A model that underwent the DPT regime is tested against models with randomised weights to evaluate the viability of DPT.

摘要
在深度神经网络对物体识别的数据需求不断增长的背景下，我们提出了发展预训练（DPT）作为一种可能的解决方案。DPT是一种基于课程的预训练方法，旨在与传统预训练技术相比，减少数据的需求。这些预训练技术也可能引入不必要的特征，这些特征在下游分类任务中可能会导致误导。我们设计了DPT的课程，Drawing inspiration from human infant visual development。DPT使用分阶段方法，教会网络参与我们的预训练程序中的精心选择的基本和通用特征，如缝和形状。我们测试了经过DPT训练的模型，并与随机 inicial weights 的模型进行比较，以评估DPT的可行性。

QIENet: Quantitative irradiance estimation network using recurrent neural network based on satellite remote sensing data

paper_url: http://arxiv.org/abs/2312.00299
repo_url: None
paper_authors: Longfeng Nie, Yuntian Chen, Dongxiao Zhang, Xinyue Liu, Wentian Yuan
for:* 该研究旨在提供高 spatial resolution 的全球水平照度 (GHI) 估计，用于生成可再生能源。methods:* 提出了一种量化照度估计网络 (QIENet)，使用循环神经网络 (RNN) 和卷积操作将卫星Remote Sensing数据的时空特征提取出来，并与GHI相关的时间信息 (时间、日期、月份) 和地理信息 (高程、经度、纬度) 作为模型输入。results:* QIENet可以准确地估计hourly GHI，并且不会过估地面观测值，同时可以降低RMSE的比例为27.51%/18.00%，提高R2的比例为20.17%/9.42%，并提高r的比例为8.69%/3.54% 相比NSRDB/ERA5。Here are the three key points in Simplified Chinese:for:* 该研究旨在提供高 spatial resolution 的全球水平照度 (GHI) 估计，用于生成可再生能源。methods:* 提出了一种量化照度估计网络 (QIENet)，使用循环神经网络 (RNN) 和卷积操作将卫星Remote Sensing数据的时空特征提取出来，并与GHI相关的时间信息 (时间、日期、月份) 和地理信息 (高程、经度、纬度) 作为模型输入。results:* QIENet可以准确地估计hourly GHI，并且不会过估地面观测值，同时可以降低RMSE的比例为27.51%/18.00%，提高R2的比例为20.17%/9.42%，并提高r的比例为8.69%/3.54% 相比NSRDB/ERA5。

Abstract
Global horizontal irradiance (GHI) plays a vital role in estimating solar energy resources, which are used to generate sustainable green energy. In order to estimate GHI with high spatial resolution, a quantitative irradiance estimation network, named QIENet, is proposed. Specifically, the temporal and spatial characteristics of remote sensing data of the satellite Himawari-8 are extracted and fused by recurrent neural network (RNN) and convolution operation, respectively. Not only remote sensing data, but also GHI-related time information (hour, day, and month) and geographical information (altitude, longitude, and latitude), are used as the inputs of QIENet. The satellite spectral channels B07 and B11 - B15 and time are recommended as model inputs for QIENet according to the spatial distributions of annual solar energy. Meanwhile, QIENet is able to capture the impact of various clouds on hourly GHI estimates. More importantly, QIENet does not overestimate ground observations and can also reduce RMSE by 27.51%/18.00%, increase R2 by 20.17%/9.42%, and increase r by 8.69%/3.54% compared with ERA5/NSRDB. Furthermore, QIENet is capable of providing a high-fidelity hourly GHI database with spatial resolution 0.02{\deg} * 0.02{\deg}(approximately 2km * 2km) for many applied energy fields.

摘要

Towards Redundancy-Free Sub-networks in Continual Learning

paper_url: http://arxiv.org/abs/2312.00840
repo_url: None
paper_authors: Cheng Chen, Jingkuan Song, LianLi Gao, Heng Tao Shen
for: This paper addresses the problem of catastrophic forgetting (CF) in continual learning, specifically by proposing a method called Information Bottleneck Masked sub-network (IBM) to mitigate CF and improve the efficiency of new tasks training.
methods: The IBM method uses information bottleneck to eliminate redundancy within sub-networks, accumulates valuable information into essential weights, and decomposes hidden representations to automate the construction process.
results: The paper shows that IBM consistently outperforms state-of-the-art methods, with a 70% reduction in the number of parameters within sub-networks and an 80% decrease in training time.

Abstract
Catastrophic Forgetting (CF) is a prominent issue in continual learning. Parameter isolation addresses this challenge by masking a sub-network for each task to mitigate interference with old tasks. However, these sub-networks are constructed relying on weight magnitude, which does not necessarily correspond to the importance of weights, resulting in maintaining unimportant weights and constructing redundant sub-networks. To overcome this limitation, inspired by information bottleneck, which removes redundancy between adjacent network layers, we propose \textbf{\underline{I}nformation \underline{B}ottleneck \underline{M}asked sub-network (IBM)} to eliminate redundancy within sub-networks. Specifically, IBM accumulates valuable information into essential weights to construct redundancy-free sub-networks, not only effectively mitigating CF by freezing the sub-networks but also facilitating new tasks training through the transfer of valuable knowledge. Additionally, IBM decomposes hidden representations to automate the construction process and make it flexible. Extensive experiments demonstrate that IBM consistently outperforms state-of-the-art methods. Notably, IBM surpasses the state-of-the-art parameter isolation method with a 70\% reduction in the number of parameters within sub-networks and an 80\% decrease in training time.

摘要
catastrophic forgetting (CF) 是一个问题在持续学习中出现的潜在问题。parameter isolation addresses this challenge by masking a sub-network for each task to mitigate interference with old tasks. However, these sub-networks are constructed relying on weight magnitude, which does not necessarily correspond to the importance of weights, resulting in maintaining unimportant weights and constructing redundant sub-networks. To overcome this limitation, inspired by information bottleneck, which removes redundancy between adjacent network layers, we propose \textbf{\underline{I}nformation \underline{B}ottleneck \underline{M}asked sub-network (IBM)} to eliminate redundancy within sub-networks. Specifically, IBM accumulates valuable information into essential weights to construct redundancy-free sub-networks, not only effectively mitigating CF by freezing the sub-networks but also facilitating new tasks training through the transfer of valuable knowledge. Additionally, IBM decomposes hidden representations to automate the construction process and make it flexible. Extensive experiments demonstrate that IBM consistently outperforms state-of-the-art methods. Notably, IBM surpasses the state-of-the-art parameter isolation method with a 70\% reduction in the number of parameters within sub-networks and an 80\% decrease in training time.Here's the text with some additional information about the Simplified Chinese translation:The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The text is written in a formal style, using technical terms and phrases commonly used in the field of machine learning.The translation includes some idiomatic expressions and cultural references that may not be familiar to non-native Chinese speakers. For example, "parameter isolation" is a term that is commonly used in machine learning to refer to the practice of isolating specific parts of a model to prevent them from being modified during training. However, this term may not be immediately familiar to non-native Chinese speakers, and the translation may need to be adjusted to better convey the intended meaning.Additionally, the translation includes some cultural references that may not be familiar to non-native Chinese speakers. For example, the phrase "information bottleneck" is a term that is commonly used in machine learning to refer to the practice of reducing the amount of information in a model to improve its performance. However, this term may not be immediately familiar to non-native Chinese speakers, and the translation may need to be adjusted to better convey the intended meaning.Overall, the translation is written in a formal style and includes some idiomatic expressions and cultural references that may not be familiar to non-native Chinese speakers. However, the translation should still be understandable to native Chinese speakers and non-native speakers who are familiar with the field of machine learning.

An Adaptive Correspondence Scoring Framework for Unsupervised Image Registration of Medical Images

paper_url: http://arxiv.org/abs/2312.00837
repo_url: None
paper_authors: Xiaoran Zhang, John C. Stendahl, Lawrence Staib, Albert J. Sinusas, Alex Wong, James S. Duncan
for: 这个研究旨在提出一个适应性训练方案，以解决医疗影像注册中的无监督学习问题。
methods: 这个方法使用一个可靠的推导映射来重新权重错误 residuals，以避免因为杂音和共谱性而导致的误差。
results: 这个研究发现，使用这个适应性方法可以优化医疗影像注册的性能，并且与其他基准方法比较，表现明显佳。

Abstract
We propose an adaptive training scheme for unsupervised medical image registration. Existing methods rely on image reconstruction as the primary supervision signal. However, nuisance variables (e.g. noise and covisibility) often cause the loss of correspondence between medical images, violating the Lambertian assumption in physical waves (e.g. ultrasound) and consistent imaging acquisition. As the unsupervised learning scheme relies on intensity constancy to establish correspondence between images for reconstruction, this introduces spurious error residuals that are not modeled by the typical training objective. To mitigate this, we propose an adaptive framework that re-weights the error residuals with a correspondence scoring map during training, preventing the parametric displacement estimator from drifting away due to noisy gradients, which leads to performance degradations. To illustrate the versatility and effectiveness of our method, we tested our framework on three representative registration architectures across three medical image datasets along with other baselines. Our proposed adaptive framework consistently outperforms other methods both quantitatively and qualitatively. Paired t-tests show that our improvements are statistically significant. The code will be publicly available at \url{https://voldemort108x.github.io/AdaCS/}.

摘要
我们提出一种适应性训练方案 для不监督医疗图像 регистрации。现有方法通常基于图像重建作为主要监督信号。然而，干扰变量（如噪声和相关性）经常导致医疗图像之间的匹配关系丢失，违背了兰伯特矢规（例如ultrasound）和一致的图像获取。由于不监督学习方案基于图像的强度常数来建立图像之间的对应关系，这会引入假的错误副作用，这些错误副作用不被典型的训练目标模型。为了解决这个问题，我们提出一个适应性框架，在训练过程中重新权重错误副作用的整体分数地图，防止参数调整器由噪声梯度引起的性能下降。为了证明我们的方法的多样性和有效性，我们在三种不同的医疗图像 dataset 上测试了我们的框架，并与其他基准值进行比较。我们的提出的适应性框架在质量和质量上都能够超越其他方法，并且 statistically significant 的方差显示出我们的改进是 statistically significant。我们的代码将在 \url{https://voldemort108x.github.io/AdaCS/} 上公开。

Adaptability of Computer Vision at the Tactical Edge: Addressing Environmental Uncertainty

paper_url: http://arxiv.org/abs/2312.00269
repo_url: None
paper_authors: Hayden Moore
for:这篇论文旨在提高战场上的指挥控制（C2）系统中的计算机视觉（CV）系统的智能分析能力。methods:这篇论文提出了一种基于不确定性评估（UQ）的方法，用于在战场上实时调整CV模型，以适应变化的环境和目标。results:这篇论文的结果表明，通过使用UQ来调整CV模型，可以提高CV系统在战场上的适应能力和precision。

Abstract
Computer Vision (CV) systems are increasingly being adopted into Command and Control (C2) systems to improve intelligence analysis on the battlefield, the tactical edge. CV systems leverage Artificial Intelligence (AI) algorithms to help visualize and interpret the environment, enhancing situational awareness. However, the adaptability of CV systems at the tactical edge remains challenging due to rapidly changing environments and objects which can confuse the deployed models. A CV model leveraged in this environment can become uncertain in its predictions, as the environment and the objects existing in the environment begin to change. Additionally, mission objectives can rapidly change leading to adjustments in technology, camera angles, and image resolutions. All of which can negatively affect the performance of and potentially introduce uncertainty into the system. When the training environment and/or technology differs from the deployment environment, CV models can perform unexpectedly. Unfortunately, most scenarios at the tactical edge do not incorporate Uncertainty Quantification (UQ) into their deployed C2 and CV systems. This concept paper explores the idea of synchronizing robust data operations and model fine-tuning driven by UQ all at the tactical edge. Specifically, curating datasets and training child models based on the residuals of predictions, using these child models to calculate prediction intervals (PI), and then using these PI to calibrate the deployed models. By incorporating UQ into the core operations surrounding C2 and CV systems at the tactical edge, we can help drive purposeful adaptability on the battlefield.

摘要
这份概念论文探讨了同步稳健数据操作和模型精确化驱动的UQ概念，以提高战场环境中的适应性。具体来说，这些方法包括：1. 统计数据和训练子模型，使用这些子模型计算预测interval（PI）。2. 使用这些PI来检验和调整部署模型的预测。3. 通过在部署环境中进行UQ的数据操作和模型精确化，实现在战场环境中的稳定适应。通过将UQ integrating into the core operations surrounding C2 and CV systems at the tactical edge, we can help drive purposeful adaptability on the battlefield.

Heteroscedastic Uncertainty Estimation for Probabilistic Unsupervised Registration of Noisy Medical Images

paper_url: http://arxiv.org/abs/2312.00836
repo_url: None
paper_authors: Xiaoran Zhang, Daniel H. Pak, Shawn S. Ahn, Xiaoxiao Li, Chenyu You, Lawrence Staib, Albert J. Sinusas, Alex Wong, James S. Duncan
for: 这种纸上提出了一个不同强度噪声估计框架，用于无监督医学图像注册。现有方法假设图像上噪声的分布是均匀的，而忽略了医学图像实际上的噪声分布是不均匀的特点。这会导致噪声矩阵的折算和性能下降。
methods: 我们提出了一种适应加权方案，通过对噪声使用分配的$\gamma$-指数强化信号噪声比（SNR）来改善估计器的精度。我们还使用了一个分配的噪声估计器来模拟医学图像上的不均匀噪声，以避免模型被噪声矩阵所驱动。
results: 我们在三个医学图像数据集上测试了我们的框架，并与其他基elines进行比较。我们的提出的方法在量计和质量上都有所改善，并且可以提供正确和有意义的不确定度测量。对比试验表明，我们的改进是 statistically significant。代码将于 \url{https://voldemort108x.github.io/hetero_uncertainty/} 上公开发布。

Abstract
This paper proposes a heteroscedastic uncertainty estimation framework for unsupervised medical image registration. Existing methods rely on objectives (e.g. mean-squared error) that assume a uniform noise level across the image, disregarding the heteroscedastic and input-dependent characteristics of noise distribution in real-world medical images. This further introduces noisy gradients due to undesired penalization on outliers, causing unnatural deformation and performance degradation. To mitigate this, we propose an adaptive weighting scheme with a relative $\gamma$-exponentiated signal-to-noise ratio (SNR) for the displacement estimator after modeling the heteroscedastic noise using a separate variance estimator to prevent the model from being driven away by spurious gradients from error residuals, leading to more accurate displacement estimation. To illustrate the versatility and effectiveness of the proposed method, we tested our framework on two representative registration architectures across three medical image datasets. Our proposed framework consistently outperforms other baselines both quantitatively and qualitatively while also providing accurate and sensible uncertainty measures. Paired t-tests show that our improvements in registration accuracy are statistically significant. The code will be publicly available at \url{https://voldemort108x.github.io/hetero_uncertainty/}.

摘要
本文提出了一种不同方差的风险估计框架，用于无监督医学图像 регистра。现有方法通常使用平均квадраット误差作为目标函数，忽略了实际医学图像中的不同方差特性，导致的是噪声梯度的抛弃和误差残余的干扰，从而导致不自然的形变和性能下降。为解决这个问题，我们提议使用一种相对γ-加速的信号噪声比率（SNR）的 adaptive 权重分配方案，并在模型中采用分离的噪声估计器，以避免模型被噪声梯度所驱动，从而更加准确地估计位移。为证明我们的方法的多样性和效果，我们在三个医学图像数据集上测试了两种常见的 регистра架构。结果表明，我们的提议方法在量和质量两个方面都能够持续性地提高 регистра精度，而且也提供了更加准确的不确定度度量。对比测试显示，我们的改进在 регистра精度方面是 statistically 有意义的。代码将在 \url{https://voldemort108x.github.io/hetero_uncertainty/} 上公开。

2023-12-01

Consistent Mesh Diffusion

Improve Supervised Representation Learning with Masked Image Modeling

Object 6D pose estimation meets zero-shot learning

Enhancing Diffusion Models with 3D Perspective Geometry Constraints

Zero-Shot Video Question Answering with Procedural Programs

Label Delay in Continual Learning

Segment and Caption Anything

Dense Optical Tracking: Connecting the Dots

Sequential Modeling Enables Scalable Learning for Large Vision Models

MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video

VideoBooth: Diffusion-based Video Generation with Image Prompts

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Adversarial Score Distillation: When score distillation meets GAN

Segment Any 3D Gaussians

PointBeV: A Sparse Approach to BeV Predictions

GIFT: Generative Interpretable Fine-Tuning Transformers

Rethinking Detection Based Table Structure Recognition for Visually Rich Documents

Object Detector Differences when using Synthetic and Real Training Data

VisionaryVR: An Optical Simulation Tool for Evaluating and Optimizing Vision Correction Solutions in Virtual Reality

Open-vocabulary object 6D pose estimation

Infrared Image Super-Resolution via GAN

Unsupervised Adaptive Implicit Neural Representation Learning for Scan-Specific MRI Reconstruction

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

CellMixer: Annotation-free Semantic Cell Segmentation of Heterogeneous Cell Populations

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Dual-Domain Multi-Contrast MRI Reconstruction with Synthesis-based Fusion Network

SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers

QAFE-Net: Quality Assessment of Facial Expressions with Landmark Heatmaps

EvE: Exploiting Generative Priors for Radiance Field Enrichment

A Recent Survey of Vision Transformers for Medical Image Segmentation

Rethinking the Domain Gap in Near-infrared Face Recognition

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion

Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers

Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)

Generative models for visualising abstract social processes: Guiding streetview image synthesis of StyleGAN2 with indices of deprivation

Physics Inspired Criterion for Pruning-Quantization Joint Learning

Domain Adaptive Imitation Learning with Visual Observation

LiDAR-based curb detection for ground truth annotation in automated driving validation

DeepDR: Deep Structure-Aware RGB-D Inpainting for Diminished Reality

Algorithm-based diagnostic application for diabetic retinopathy detection

Global Localization: Utilizing Relative Spatio-Temporal Geometric Constraints from Adjacent and Distant Cameras

Explainable AI in Diagnosing and Anticipating Leukemia Using Transfer Learning Method

Unfolder: Fast localization and image rectification of a document with a crease from folding in half

Learning Unorthogonalized Matrices for Rotation Estimation

An Encoding Framework for Binarized Images using HyperDimensional Computing

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Dolphins: Multimodal Language Model for Driving

Enhancing Image Captioning with Neural Models

A Low-Power Neuromorphic Approach for Efficient Eye-Tracking

Towards Explaining Satellite Based Poverty Predictions with Convolutional Neural Networks

Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

SCHEME: Scalable Channer Mixer for Vision Transformers

VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things

Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network

Study and Survey on Gesture Recognition Systems

Partition-based K-space Synthesis for Multi-contrast Parallel Imaging

Local monotone operator learning using non-monotone operators: MnM-MOL

NeuSG: Neural Implicit Surface Reconstruction with 3D Gaussian Splatting Guidance

Text-Guided 3D Face Synthesis – From Generation to Editing

Benchmarking Multi-Domain Active Learning on Image Classification

Dancing with Images: Video Distillation via Static-Dynamic Disentanglement

Efficient Multimodal Semantic Segmentation via Dual-Prompt Learning

Impact of Data Augmentation on QCNNs

A Generalizable Deep Learning System for Cardiac MRI

Manipulating the Label Space for In-Context Classification

Student Activity Recognition in Classroom Environments using Transfer Learning

OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline

Learning Anatomically Consistent Embedding for Chest Radiography

Improving Efficiency of DNN-based Relocalization Module for Autonomous Driving with Server-side Computing

Improving Normalization with the James-Stein Estimator

Segment Anything Model-guided Collaborative Learning Network for Scribble-supervised Polyp Segmentation

3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation

A knowledge-based data-driven (KBDD) framework for all-day identification of cloud types using satellite remote sensing

RadioGalaxyNET: Dataset and Novel Computer Vision Algorithms for the Detection of Extended Radio Galaxies and Infrared Hosts

Developmental Pretraining (DPT) for Image Classification Networks

QIENet: Quantitative irradiance estimation network using recurrent neural network based on satellite remote sensing data