2023-09-08

cs.CV

cs.CV - 2023-09-08

Open and reusable deep learning for pathology with WSInfer and QuPath

paper_url: http://arxiv.org/abs/2309.04631
repo_url: None
paper_authors: Jakub R. Kaczmarzyk, Alan O’Callaghan, Fiona Inglis, Tahsin Kurc, Rajarsi Gupta, Erich Bremer, Peter Bankhead, Joel H. Saltz
for: 这份论文的目的是提高肿瘤学中深度学习模型的应用，并使其更加流畅和可存取。
methods: 本研究使用了一个新的开源软件生态系统，名为WSInfer，以便更加方便地将深度学习模型应用到肿瘤影像中。WSInfer包括三个主要元素：1）一个Python套件和命令行工具，可以快速地将裁剪式深度学习应用到整个肿瘤影像中; 2）一个QuPath扩展，提供了一个易用且交互式的软件引擎，并3）一个模型zoo，可以让肿瘤学模型和metadata在标准化的形式下进行分享。
results: 本研究的结果显示，WSInfer可以让肿瘤学家和研究人员更加方便地存取和应用深度学习模型，并且不需要程式码经验。WSInfer的源代码被hosts在GitHub上，并且有相关的文档在https://wsinfer.readthedocs.io 。

Abstract
The field of digital pathology has seen a proliferation of deep learning models in recent years. Despite substantial progress, it remains rare for other researchers and pathologists to be able to access models published in the literature and apply them to their own images. This is due to difficulties in both sharing and running models. To address these concerns, we introduce WSInfer: a new, open-source software ecosystem designed to make deep learning for pathology more streamlined and accessible. WSInfer comprises three main elements: 1) a Python package and command line tool to efficiently apply patch-based deep learning inference to whole slide images; 2) a QuPath extension that provides an alternative inference engine through user-friendly and interactive software, and 3) a model zoo, which enables pathology models and metadata to be easily shared in a standardized form. Together, these contributions aim to encourage wider reuse, exploration, and interrogation of deep learning models for research purposes, by putting them into the hands of pathologists and eliminating a need for coding experience when accessed through QuPath. The WSInfer source code is hosted on GitHub and documentation is available at https://wsinfer.readthedocs.io.

摘要
随着数字 PATHOLOGY 领域的发展，深度学习模型在过去几年内得到了广泛应用。尽管已经取得了显著进步，但是对于其他研究人员和病理学家来说，访问已经发表的模型并将其应用到自己的图像仍然是非常困难的。这是由于分享和运行模型的困难所致。为解决这些问题，我们介绍了 WSInfer：一个新的开源软件生态系统，旨在使得 PATHOLOGY 中的深度学习更加流畅和可访问。WSInfer 包括三个主要元素：1. Python 包和命令行工具，用于高效地应用 patch-based 深度学习推理到整个扫描图像上。2. QuPath 扩展，提供了一个用户友好的和交互式的推理引擎，并且可以让病理学家通过 QuPath 访问和运行深度学习模型，不需要编程经验。3. 模型 zoo，可以方便地将 PATHOLOGY 模型和元数据分享在标准化的形式下。综上所述，WSInfer 的贡献是希望通过将深度学习模型带到病理学家手上，并且不需要编程经验，以便更多的研究人员和病理学家可以轻松地 reuse、探索和调查 PATHOLOGY 中的深度学习模型，以便更好地推进 PATHOLOGY 领域的研究。WSInfer 的源代码位于 GitHub 上，文档可以在中找到。

Style Generation: Image Synthesis based on Coarsely Matched Texts

paper_url: http://arxiv.org/abs/2309.04608
repo_url: None
paper_authors: Mengyao Cui, Zhe Zhu, Shao-Ping Lu, Yulu Yang
for: 文章主要目的是提出一种基于文本指导的图像风格生成方法，以便在具有粗糙匹配的文本指导下进行图像生成和修饰。
methods: 本文提出了一种基于文本指导的图像风格生成方法，包括两个阶段：第一阶段使用句子特征来生成图像的整体风格，第二阶段使用多模态风格合成模块来细化生成的风格。
results: 经过广泛的实验和简洁分析，本文提出的方法能够有效地生成基于文本指导的图像风格，并且可以应用于多个实际场景，如文本-图像对齐和故事视觉化等。

Abstract
Previous text-to-image synthesis algorithms typically use explicit textual instructions to generate/manipulate images accurately, but they have difficulty adapting to guidance in the form of coarsely matched texts. In this work, we attempt to stylize an input image using such coarsely matched text as guidance. To tackle this new problem, we introduce a novel task called text-based style generation and propose a two-stage generative adversarial network: the first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature, which is produced by a multi-modality style synthesis module. We re-filter one existing dataset and collect a new dataset for the task. Extensive experiments and ablation studies are conducted to validate our framework. The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization. Our datasets are published at https://www.kaggle.com/datasets/mengyaocui/style-generation.

摘要
The first stage of our GAN uses a sentence feature to generate the overall image style, while the second stage refines the generated style with a synthetic feature produced by a multi-modality style synthesis module. We collect a new dataset and re-filter an existing dataset to support our framework.To validate our approach, we conduct extensive experiments and ablation studies. Our work has practical potential, as demonstrated by applications such as text-image alignment and story visualization. Our datasets are available at .

Dynamic Mesh-Aware Radiance Fields

paper_url: http://arxiv.org/abs/2309.04581
repo_url: https://github.com/YilingQiao/DMRF
paper_authors: Yi-Ling Qiao, Alexander Gao, Yiran Xu, Yue Feng, Jia-Bin Huang, Ming C. Lin
for: 这篇论文的目的是探讨在把几何体资产逻辑地嵌入到 fotorealistic Neural Radience Fields（NeRF）中，以便在物理上一致的方式渲染和模拟它们，从系统角度来看是一个未经探讨的领域。
methods: 这篇论文提出了一种两向相互关联的方法，即在渲染和模拟过程中，将几何体和NeRF之间进行相互 Coupling。首先，我们审视了几何体和NeRF之间的光传输方程，然后将它们转化为一种高效的算法，用于更新各个碰撞点的辐射和通过put。为了解决NeRF使用的标准颜色空间和几何体之间的差异，我们在NeRF中训练了高动态范围（HDR）图像。此外，我们还提出了一种策略来估算NeRF中的光源和投射阴影。
results: 我们的实验结果表明，在渲染和模拟过程中，将几何体和NeRF之间进行相互 Coupling，可以提高视觉真实性。这是因为它允许真实的光传输从NeRF媒体onto几何体，对折射/填充表面和 diffuse surface informed by dynamic scene产生影响。

Abstract
Embedding polygonal mesh assets within photorealistic Neural Radience Fields (NeRF) volumes, such that they can be rendered and their dynamics simulated in a physically consistent manner with the NeRF, is under-explored from the system perspective of integrating NeRF into the traditional graphics pipeline. This paper designs a two-way coupling between mesh and NeRF during rendering and simulation. We first review the light transport equations for both mesh and NeRF, then distill them into an efficient algorithm for updating radiance and throughput along a cast ray with an arbitrary number of bounces. To resolve the discrepancy between the linear color space that the path tracer assumes and the sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range (HDR) images. We also present a strategy to estimate light sources and cast shadows on the NeRF. Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated with a high-performance physics simulator that supports cloth, rigid and soft bodies. The full rendering and simulation system can be run on a GPU at interactive rates. We show that a hybrid system approach outperforms alternatives in visual realism for mesh insertion, because it allows realistic light transport from volumetric NeRF media onto surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse surfaces informed by the dynamic scene.

摘要
<> transtable mesh assets within photorealistic Neural Radience Fields（NeRF）volumes, such that they can be rendered and their dynamics simulated in a physically consistent manner with the NeRF, is under-explored from the system perspective of integrating NeRF into the traditional graphics pipeline. This paper designs a two-way coupling between mesh and NeRF during rendering and simulation. We first review the light transport equations for both mesh and NeRF, then distill them into an efficient algorithm for updating radiance and throughput along a cast ray with an arbitrary number of bounces. To resolve the discrepancy between the linear color space that the path tracer assumes and the sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range（HDR）images. We also present a strategy to estimate light sources and cast shadows on the NeRF. Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated with a high-performance physics simulator that supports cloth, rigid and soft bodies. The full rendering and simulation system can be run on a GPU at interactive rates. We show that a hybrid system approach outperforms alternatives in visual realism for mesh insertion, because it allows realistic light transport from volumetric NeRF media onto surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse surfaces informed by the dynamic scene.Note that Simplified Chinese is a more casual and informal version of Chinese, and the word order and grammar may be different from Traditional Chinese.

Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation

paper_url: http://arxiv.org/abs/2309.04573
repo_url: None
paper_authors: Shyam Nandan Rai, Fabio Cermelli, Barbara Caputo, Carlo Masone
for: 本文旨在提出一种基于面Mask的异常检测方法，以解决自动驾驶应用中异常对象实例分割的问题。
methods: 本文提出了一种新的面Mask分类架构，包括全球面Mask注意模块、面对比学习、面修正解决方案和面架构特性采集方法等技术创新，以提高异常检测的精度。
results: 经过全面的质量评估，本文的Mask2异常方法在异常分割、开放集Semantic分割和开放集精度分割三个任务上达到了新的国际纪录。

Abstract
Segmenting unknown or anomalous object instances is a critical task in autonomous driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects' boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating a mask-classification architecture to jointly address anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies/unknown objects: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; iii) a mask refinement solution to reduce false positives; and iv) a novel approach to mine unknown instances based on the mask-architecture properties. By comprehensive qualitative and qualitative evaluation, we show Mask2Anomaly achieves new state-of-the-art results across the benchmarks of anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation.

摘要
segmenting unknown or anomalous object instances是自动驾驶应用中的一个关键任务，它通常是以每个像素为单位进行分类的传统方法。然而，不考虑每个像素的语义上下文会导致对象边界的高度不确定性和多个假阳性。我们提议一种思路变革，即从每个像素分类转换到Mask分类。我们的Mask2Anomaly方法表明了将Mask分类建立在 JOINT 中的可能性，以同时解决异常分 segmentation、开放集Semantic segmentation和开放集Panoptic segmentation的问题。Mask2Anomaly方法包括多个技术创新，用于提高异常检测：一、全局遮盲注意力模块，以各自针对前景和背景区域进行遮盲注意力;二、遮盲对比学习，以最大化异常与已知类之间的距离;三、遮盲修正解决方案，以降低假阳性;四、基于Mask-architecture属性挖掘未知实例的新方法。通过全面的Qualitative和Quantitative评估，我们证明Mask2Anomaly方法在 benchmark 上实现了新的状态可识别结果，包括异常分 segmentation、开放集Semantic segmentation和开放集Panoptic segmentation。

Poster: Making Edge-assisted LiDAR Perceptions Robust to Lossy Point Cloud Compression

paper_url: http://arxiv.org/abs/2309.04549
repo_url: None
paper_authors: Jin Heo, Gregorie Phillips, Per-Erik Brodin, Ada Gavrilovska
for: 提高LiDAR点云的质量，以减少因压缩而导致的感知性能下降。
methods: 使用基于深度梯度的插值算法来提高LiDAR点云的质量。
results: 与现有的图像插值算法相比，该算法可以提供更好的质量结果，当点云从插值后重建时。

Abstract
Real-time light detection and ranging (LiDAR) perceptions, e.g., 3D object detection and simultaneous localization and mapping are computationally intensive to mobile devices of limited resources and often offloaded on the edge. Offloading LiDAR perceptions requires compressing the raw sensor data, and lossy compression is used for efficiently reducing the data volume. Lossy compression degrades the quality of LiDAR point clouds, and the perception performance is decreased consequently. In this work, we present an interpolation algorithm improving the quality of a LiDAR point cloud to mitigate the perception performance loss due to lossy compression. The algorithm targets the range image (RI) representation of a point cloud and interpolates points at the RI based on depth gradients. Compared to existing image interpolation algorithms, our algorithm shows a better qualitative result when the point cloud is reconstructed from the interpolated RI. With the preliminary results, we also describe the next steps of the current work.

摘要
现实时光 detection和跟踪（LiDAR）感知需要大量计算能力，例如3D对象检测和同时地图定位。由于移动设备的限制资源，LiDAR感知通常会在边缘上下载。压缩 Raw sensor data 需要lossy compression，这会降低LiDAR点云的质量。在这种情况下，我们提出了一种 interpolating algorithm，用于改善LiDAR点云的质量，以避免因压缩而导致的感知性能下降。我们的算法targets the range image（RI）表示法，并在RI基于深度梯度进行点云点的 interpolating。与现有的图像 interpolating algorithm相比，我们的算法在重建点云时表现更好。在下一步工作中，我们还将描述我们的current work的进展。

Examining Autoexposure for Challenging Scenes

paper_url: http://arxiv.org/abs/2309.04542
repo_url: None
paper_authors: SaiKiran Tedla, Beixuan Yang, Michael S. Brown
for: 提供一个大量的曝光数据集，以便发展适用于具有变化照明的环境中的曝光算法。
methods: 使用一个软件平台，让不同的曝光算法可以在一个插件的方式下使用数据集进行重复评估。
results: 透过评估一些现有的曝光策略，发现大多数使用者偏好使用简单的焦点方法来应对具有变化照明的情况。

Abstract
Autoexposure (AE) is a critical step applied by camera systems to ensure properly exposed images. While current AE algorithms are effective in well-lit environments with constant illumination, these algorithms still struggle in environments with bright light sources or scenes with abrupt changes in lighting. A significant hurdle in developing new AE algorithms for challenging environments, especially those with time-varying lighting, is the lack of suitable image datasets. To address this issue, we have captured a new 4D exposure dataset that provides a large solution space (i.e., shutter speed range from (1/500 to 15 seconds) over a temporal sequence with moving objects, bright lights, and varying lighting. In addition, we have designed a software platform to allow AE algorithms to be used in a plug-and-play manner with the dataset. Our dataset and associate platform enable repeatable evaluation of different AE algorithms and provide a much-needed starting point to develop better AE methods. We examine several existing AE strategies using our dataset and show that most users prefer a simple saliency method for challenging lighting conditions.

摘要
自动曝光（AE）是摄像系统中一个关键的步骤，以确保得到正确曝光的图像。目前的AE算法在充足照明环境下效果良好，但是这些算法在灯光强度变化或场景中有突然变化的照明情况下仍然受到挑战。开发新的AE算法需要一个适当的图像数据集，但现有的问题在开发新算法方面带来了很大的障碍。为解决这个问题，我们已经捕捉了一个新的4D曝光数据集，该数据集提供了广泛的解决空间（即闭合速度范围从(1/500到15秒)），并且包含了在时间序列中移动的 объек、灯光和不同的照明情况。此外，我们还设计了一个软件平台，以便AE算法可以在插件化的方式使用该数据集。我们的数据集和相关平台为不同的AE算法提供了重复可评估的开始点，并且我们通过使用我们的数据集对一些现有的AE策略进行了评估，发现大多数用户在困难的照明条件下偏好使用简单的注意力方法。

Generalized Cross-domain Multi-label Few-shot Learning for Chest X-rays

paper_url: http://arxiv.org/abs/2309.04462
repo_url: None
paper_authors: Aroof Aimen, Arsh Verma, Makarand Tapaswi, Narayanan C. Krishnan
for: 这篇论文是用于测试胸部X射镜像的异常性分类方法。
methods: 这篇论文使用了一个称为Generalized Cross-Domain Multi-Label Few-Shot Learning（GenCDML-FSL）的整合框架，这个框架可以处理多个挑战，包括训练和评估集来自不同Domain的资料，以及训练和评估过程中的类别 overlap。
results: 比较了以上Method与已知方法，如trasnfer learning、Hybrid transfer learning和Multi-label meta-learning，在多个数据集上的比较结果显示了我们的方法的超越性。

Abstract
Real-world application of chest X-ray abnormality classification requires dealing with several challenges: (i) limited training data; (ii) training and evaluation sets that are derived from different domains; and (iii) classes that appear during training may have partial overlap with classes of interest during evaluation. To address these challenges, we present an integrated framework called Generalized Cross-Domain Multi-Label Few-Shot Learning (GenCDML-FSL). The framework supports overlap in classes during training and evaluation, cross-domain transfer, adopts meta-learning to learn using few training samples, and assumes each chest X-ray image is either normal or associated with one or more abnormalities. Furthermore, we propose Generalized Episodic Training (GenET), a training strategy that equips models to operate with multiple challenges observed in the GenCDML-FSL scenario. Comparisons with well-established methods such as transfer learning, hybrid transfer learning, and multi-label meta-learning on multiple datasets show the superiority of our approach.

摘要
实际应用中的胸部X射线异常分类问题需要面临多个挑战：（i）有限的训练数据；（ii）训练和评估集来自不同领域；（iii）训练中的类可能与评估中的类有部分重叠。为解决这些挑战，我们提出了一个总结性的框架，即通用跨领域多标签少量学习（GenCDML-FSL）框架。该框架支持训练和评估阶段类的重叠，进行跨领域传输，采用元学习来学习使用少量训练样本，并假设每个胸部X射线图像是正常的或与一个或多个异常相关。此外，我们提出了一种通用 episodic 训练策略（GenET），该策略可以让模型在 GenCDML-FSL 场景中处理多个挑战。与已有方法 such as 传输学习、混合传输学习和多标签元学习在多个数据集上进行比较，我们的方法表现出了superiority。

WiSARD: A Labeled Visual and Thermal Image Dataset for Wilderness Search and Rescue

paper_url: http://arxiv.org/abs/2309.04453
repo_url: None
paper_authors: Daniel Broyles, Christopher R. Hayner, Karen Leung
for: 这些研究是为了帮助reducesearch times和alleviate safety risks for first responders carrying out Wilderness Search and Rescue (WiSAR) operations.
methods: 这些研究使用了多模式感知器，specifically visual-thermal cameras，以使wiSAR UAVs可以在多种操作条件下工作。
results: 这些研究提供了roughly 56,000 labeled visual and thermal images，用于开发vision-based algorithms for autonomous WiSAR UAVs。这些图像来自UAV飞行中的多种地形、季节、天气和照明条件。

Abstract
Sensor-equipped unoccupied aerial vehicles (UAVs) have the potential to help reduce search times and alleviate safety risks for first responders carrying out Wilderness Search and Rescue (WiSAR) operations, the process of finding and rescuing person(s) lost in wilderness areas. Unfortunately, visual sensors alone do not address the need for robustness across all the possible terrains, weather, and lighting conditions that WiSAR operations can be conducted in. The use of multi-modal sensors, specifically visual-thermal cameras, is critical in enabling WiSAR UAVs to perform in diverse operating conditions. However, due to the unique challenges posed by the wilderness context, existing dataset benchmarks are inadequate for developing vision-based algorithms for autonomous WiSAR UAVs. To this end, we present WiSARD, a dataset with roughly 56,000 labeled visual and thermal images collected from UAV flights in various terrains, seasons, weather, and lighting conditions. To the best of our knowledge, WiSARD is the first large-scale dataset collected with multi-modal sensors for autonomous WiSAR operations. We envision that our dataset will provide researchers with a diverse and challenging benchmark that can test the robustness of their algorithms when applied to real-world (life-saving) applications.

摘要
游戏式无人航空车(UAV)可以帮助紧急救援人员在遥远地区进行搜索和拯救操作，即当人失踪在郊状地区时。然而，视觉感应器alone无法涵盖所有可能的地形、天气和照明情况，因此需要使用多模式感应器，尤其是视觉热成像摄像头，以实现 WiSAR UAVs 在多元运行环境中的运作。然而，由于郊状地区的特殊挑战，现有的数据集标准是不充分的 для开发基于视觉的数据分析算法。为此，我们提出了 WiSARD 数据集，收集了来自 UAV 飞行的约 56,000 个视觉和热成像摄像头标签图像，包括不同的地形、季节、天气和照明情况。我们知道 WiSARD 是首个基于多模式感应器的大规模数据集，我们预期这个数据集将提供研究人员一个多样化和挑战性的 benchmarck，以测试对真实应用中的数据分析算法的Robustness。

Demographic Disparities in 1-to-Many Facial Identification

paper_url: http://arxiv.org/abs/2309.04447
repo_url: None
paper_authors: Aman Bhatta, Gabriella Pangelinan, Micheal C. King, Kevin W. Bowyer
for: 这个研究旨在检验不同民族和性别对多个人识别率的影响，以及低分辨率和噪音影响识别率的变化。
methods: 这个研究使用了一个新的评价指标，以检验多个人识别率的差异。这些指标包括d’指标、相对分数差和多个人识别分数的分布。
results: 研究发现，不同民族和性别对多个人识别率的影响不同，而且在低分辨率和噪音情况下，男女之间的差异更大。此外，研究还发现，使用”surveillance camera quality”图像库对”government ID quality”图像库进行比较可能会导致识别率下降。

Abstract
Most studies to date that have examined demographic variations in face recognition accuracy have analyzed 1-to-1 matching accuracy, using images that could be described as "government ID quality". This paper analyzes the accuracy of 1-to-many facial identification across demographic groups, and in the presence of blur and reduced resolution in the probe image as might occur in "surveillance camera quality" images. Cumulative match characteristic curves(CMC) are not appropriate for comparing propensity for rank-one recognition errors across demographics, and so we introduce three metrics for this: (1) d' metric between mated and non-mated score distributions, (2) absolute score difference between thresholds in the high-similarity tail of the non-mated and the low-similarity tail of the mated distribution, and (3) distribution of (mated - non-mated rank one scores) across the set of probe images. We find that demographic variation in 1-to-many accuracy does not entirely follow what has been observed in 1-to-1 matching accuracy. Also, different from 1-to-1 accuracy, demographic comparison of 1-to-many accuracy can be affected by different numbers of identities and images across demographics. Finally, we show that increased blur in the probe image, or reduced resolution of the face in the probe image, can significantly increase the false positive identification rate. And we show that the demographic variation in these high blur or low resolution conditions is much larger for male/ female than for African-American / Caucasian. The point that 1-to-many accuracy can potentially collapse in the context of processing "surveillance camera quality" probe images against a "government ID quality" gallery is an important one.

摘要
大多数研究到目前为止对人群差异对面部识别精度进行了分析，使用“政府身份证图像”的样本。这篇论文研究了面部识别的1-to-多匹配精度，以及在不同人群中的差异。我们还引入了三个指标来比较不同人群的潜在一级识别错误风险：1. between mated and non-mated score distributions的d'指标;2. 非硬件tail的非硬件分布下的硬件分布附近的硬件分布差异;3. 探索图像集中的(硬件-非硬件一级识别分布)的分布。我们发现，在1-to-多匹配精度方面，人群差异并不完全与1-to-1匹配精度相同。此外，对于不同人群来说，1-to-多匹配精度的比较可能受到不同人群中的人数和图像数的影响。最后，我们发现，在低锐化或低分辨率情况下，增加了挤压效应可以导致False Positive Identification率的增加。此外，对于男女和非裔美国人来说，在高锐化或低分辨率情况下的人群差异较大。这一点显示，在处理“surveillance camera quality”的探索图像时，1-to-多匹配精度可能会受到“government ID quality”画库的影响。

Comparative Study of Visual SLAM-Based Mobile Robot Localization Using Fiducial Markers

paper_url: http://arxiv.org/abs/2309.04441
repo_url: None
paper_authors: Jongwon Lee, Su Yeon Choi, David Hanley, Timothy Bretl
for: 本研究比较了基于视觉SLAM的移动机器人地理位置的三种方法，包括SLAM、SLAM与先前地图和地理位置与先前地图。这些方法都使用了 fiducial marker（即正方形的人工标记，具有黑白棕点纹），以提高地理位置准确性和计算效率。
methods: 本研究使用了视觉SLAM技术，并且在 fiducial marker 的支持下进行了地理位置估算。在这些方法中，SLAM 方法使用了所有可用的特征和标记来估算地理位置，而 SLAM 与先前地图方法则使用了先前知道的地图来帮助估算地理位置。
results: 实验结果表明，三种方法具有相似的绝对轨迹错误水平，但是地理位置估算过程中的运行时间中最短。在地图噪音的影响下，SLAM 与先前地图方法能够维持性能，而地理位置方法却在两个方面下降。

Abstract
This paper presents a comparative study of three modes for mobile robot localization based on visual SLAM using fiducial markers (i.e., square-shaped artificial landmarks with a black-and-white grid pattern): SLAM, SLAM with a prior map, and localization with a prior map. The reason for comparing the SLAM-based approaches leveraging fiducial markers is because previous work has shown their superior performance over feature-only methods, with less computational burden compared to methods that use both feature and marker detection without compromising the localization performance. The evaluation is conducted using indoor image sequences captured with a hand-held camera containing multiple fiducial markers in the environment. The performance metrics include absolute trajectory error and runtime for the optimization process per frame. In particular, for the last two modes (SLAM and localization with a prior map), we evaluate their performances by perturbing the quality of prior map to study the extent to which each mode is tolerant to such perturbations. Hardware experiments show consistent trajectory error levels across the three modes, with the localization mode exhibiting the shortest runtime among them. Yet, with map perturbations, SLAM with a prior map maintains performance, while localization mode degrades in both aspects.

摘要
Here is the text in Simplified Chinese:这篇论文比较了三种移动机器人本地化方法，基于视觉SLAM和 fiducial marker（即方正方形人工标记，黑白扫描纹理）：SLAM、SLAM WITH prior map 和本地化 WITH prior map。这种比较是因为之前的研究表明，使用 fiducial marker 的方法在功能特征和计算成本方面都有着优势，而不需要同时检测特征和标记。这些方法的评估是通过使用indoor镜头拍摄的图像序列来进行，这些序列包含多个 fiducial marker。评估 metric 包括每帧的绝对轨迹错误和优化过程的运行时间。结果表明，三种方法在绝对轨迹错误方面具有相同的水平，但本地化模式具有最短的运行时间。然而，当 prior map 的质量受到扰动时，SLAM WITH prior map 能够维持性能，而本地化模式在两个方面都会下降。

Single View Refractive Index Tomography with Neural Fields

paper_url: http://arxiv.org/abs/2309.04437
repo_url: None
paper_authors: Brandon Zhao, Aviad Levis, Liam Connor, Pratul P. Srinivasan, Katherine L. Bouman
for: 这篇论文的目的是重建场景中的3D干涉场，从2D投射图像测量得到。
methods: 这篇论文使用一种坐标基于的神经网络来模型场景中的连续干涉场，并使用光束的3D空间弯曲来优化网络参数，从而重建干涉场。
results: 在模拟中，这种方法可以成功地重建干涉场，并分析了不同光源分布对重建的影响。在一个模拟的黑洞映射问题中，还成功地重建了真实的模拟黑洞分布。

Abstract
Refractive Index Tomography is an inverse problem in which we seek to reconstruct a scene's 3D refractive field from 2D projected image measurements. The refractive field is not visible itself, but instead affects how the path of a light ray is continuously curved as it travels through space. Refractive fields appear across a wide variety of scientific applications, from translucent cell samples in microscopy to fields of dark matter bending light from faraway galaxies. This problem poses a unique challenge because the refractive field directly affects the path that light takes, making its recovery a non-linear problem. In addition, in contrast with traditional tomography, we seek to recover the refractive field using a projected image from only a single viewpoint by leveraging knowledge of light sources scattered throughout the medium. In this work, we introduce a method that uses a coordinate-based neural network to model the underlying continuous refractive field in a scene. We then use explicit modeling of rays' 3D spatial curvature to optimize the parameters of this network, reconstructing refractive fields with an analysis-by-synthesis approach. The efficacy of our approach is demonstrated by recovering refractive fields in simulation, and analyzing how recovery is affected by the light source distribution. We then test our method on a simulated dark matter mapping problem, where we recover the refractive field underlying a realistic simulated dark matter distribution.

摘要
《干涉度图像》是一种逆 проблеme 在干涉度图像中，我们希望从2D投影图像的测量中重construct 场景中的3D干涉场。干涉场不可见自身，但它会影响光束在空间中的曲线运动。干涉场在多种科学应用中出现，从微scopic 的透明细胞样本到远方 галакси的场景中的暗物质弯光。这个问题 pose 一种独特挑战，因为干涉场直接影响光束的路径，使其回归变为非线性问题。此外，在传统tomography 中，我们通过多个视点测量来重construct 干涉场，而我们在这里是通过单个视点测量来实现。在这个工作中，我们提出了一种基于坐标的神经网络来模型场景中的连续干涉场。然后，我们通过明确的3D空间曲线的计算来优化神经网络的参数，通过分析synthesis 的方法来重construct 干涉场。我们的方法的效果在仿真中进行了测试，并分析了灯源分布对回归的影响。最后，我们在一个模拟的黑 matter 映射问题中测试了我们的方法，并成功地重construct 黑 matter 的干涉场。

Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving

paper_url: http://arxiv.org/abs/2309.04422
repo_url: None
paper_authors: Thomas E. Huang, Yifan Liu, Luc Van Gool, Fisher Yu
for: 本研究旨在探讨自动驾驶场景中多个多样化视觉任务的整合。
methods: 该研究使用了一种单一结构和单一参数的网络（VTDNet），通过任务间交互阶段来交换信息，实现多个任务的同时解决。
results: 与单任务网络相比，VTDNet在大多数任务上表现出色，仅使用20%的计算资源。

Abstract
Performing multiple heterogeneous visual tasks in dynamic scenes is a hallmark of human perception capability. Despite remarkable progress in image and video recognition via representation learning, current research still focuses on designing specialized networks for singular, homogeneous, or simple combination of tasks. We instead explore the construction of a unified model for major image and video recognition tasks in autonomous driving with diverse input and output structures. To enable such an investigation, we design a new challenge, Video Task Decathlon (VTD), which includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels. On VTD, we develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks. VTDNet groups similar tasks and employs task interaction stages to exchange information within and between task groups. Given the impracticality of labeling all tasks on all frames, and the performance degradation associated with joint training of many tasks, we design a Curriculum training, Pseudo-labeling, and Fine-tuning (CPF) scheme to successfully train VTDNet on all tasks and mitigate performance loss. Armed with CPF, VTDNet significantly outperforms its single-task counterparts on most tasks with only 20% overall computations. VTD is a promising new direction for exploring the unification of perception tasks in autonomous driving.

摘要
人类视觉能力的一个特征是同时完成多种不同类型的视觉任务在动态场景中。虽然图像和视频认知技术已经做出了很大的进步，但现在的研究仍然强调设计专门的网络来解决单一或同类型的任务。我们则是探索构建一个统一的模型来涵盖主要的图像和视频认知任务在自动驾驶中，并且输入和输出结构多样化。为了实现这种研究，我们设计了一个新的挑战——视频任务十项赛（VTD），这个挑战包括十种代表性的图像和视频任务，涵盖分类、 segmentation、 localization 和对象和像素的关系。在 VTD 中，我们开发了一个统一的网络——VTDNet，它使用单一的结构和单一的参数来实现所有十个任务。VTDNet 将相似任务分组，并在任务组之间进行交互来交换信息。由于实际上标注所有任务的所有帧是不实际的，以及多任务合并训练会导致性能下降，我们设计了一种学习环境、 Pseudo-labeling 和精度调整（CPF）的办法，以成功训练 VTDNet 在所有任务上，并将性能下降降到最低。与单任务网络相比，VTDNet 在大多数任务上表现出色，只需要20%的总计算资源。VTD 是自动驾驶视觉任务统一探索的一个有前途的新方向。

DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields

paper_url: http://arxiv.org/abs/2309.04410
repo_url: https://github.com/junzhezhang/DeformToon3D
paper_authors: Junzhe Zhang, Yushi Lan, Shuai Yang, Fangzhou Hong, Quan Wang, Chai Kiat Yeo, Ziwei Liu, Chen Change Loy
for: 本研究旨在解决3D漫画化问题，即将艺术领域的样式应用到目标3D面部上，并保持原始GAN幂等空间的良好性。
methods: 我们提出了DeformToon3D方法，它是针对堆叠3D GAN的有效漫画化框架。我们将3D漫画化分解为geometry和texture材质化的子问题，以更好地保持原始GAN幂等空间。我们还提出了一种新的StyleField，它预测 conditional 3D变形以将真实空间NeRF调整到样式空间中。
results: 我们的方法可以实现高质量的3D漫画化，并且支持灵活的样式度控制和形状-文本ure-特有的样式交换。此外，我们可以高效地训练我们的模型，不需要任何实际的2D-3D训练对。

Abstract
In this paper, we address the challenging problem of 3D toonification, which involves transferring the style of an artistic domain onto a target 3D face with stylized geometry and texture. Although fine-tuning a pre-trained 3D GAN on the artistic domain can produce reasonable performance, this strategy has limitations in the 3D domain. In particular, fine-tuning can deteriorate the original GAN latent space, which affects subsequent semantic editing, and requires independent optimization and storage for each new style, limiting flexibility and efficient deployment. To overcome these challenges, we propose DeformToon3D, an effective toonification framework tailored for hierarchical 3D GAN. Our approach decomposes 3D toonification into subproblems of geometry and texture stylization to better preserve the original latent space. Specifically, we devise a novel StyleField that predicts conditional 3D deformation to align a real-space NeRF to the style space for geometry stylization. Thanks to the StyleField formulation, which already handles geometry stylization well, texture stylization can be achieved conveniently via adaptive style mixing that injects information of the artistic domain into the decoder of the pre-trained 3D GAN. Due to the unique design, our method enables flexible style degree control and shape-texture-specific style swap. Furthermore, we achieve efficient training without any real-world 2D-3D training pairs but proxy samples synthesized from off-the-shelf 2D toonification models.

摘要
在这篇论文中，我们讨论了三维渐化（3D toonification）问题，即将艺术领域的风格应用到目标三维face上，并保持具有渐化的geometry和Texture。虽然可以通过练化预训练的3D GAN来实现可理解的性能，但这种策略有一些限制。具体来说，练化可能会损害原始GAN latent space，影响后续的semantic editing，并需要独立的优化和存储每个新风格，限制了灵活性和高效的部署。为了解决这些挑战，我们提出了DeformToon3D，一种适合层次3D GAN的有效渐化框架。我们的方法将三维渐化分解为geometry和Texture渐化的子问题，以更好地保持原始latent space。具体来说，我们开发了一种名为StyleField的新型预测器，可以在Real-Space NeRF上预测conditional 3D deformation，以使geometry渐化适应风格空间。由于StyleField的形式，Texture渐化可以通过适应风格混合来实现，injects风格空间信息到预训练的3D GAN decoder中。由于独特的设计，我们的方法可以实现自适应风格度控制和形状特征特定的风格交换。此外，我们可以不使用任何真实世界2D-3D训练对，而是使用市面上的2D渐化模型生成的代理样本进行训练。

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

paper_url: http://arxiv.org/abs/2309.04399
repo_url: None
paper_authors: Yupeng Zhou, Daquan Zhou, Zuo-Liang Zhu, Yaxing Wang, Qibin Hou, Jiashi Feng
for: 提高 diffusion 模型中文案与图像的匹配率
methods: 使用 adaptive mask 来改进 cross-modality 关系学习，从而更好地匹配文本 embedding 和图像特征
results: 与原始 diffusion 模型相比，MaskDiffusion 可以大幅提高文本-图像匹配率，而且计算负担几乎不变。

Abstract
Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. Nevertheless, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning between the prompt and the output image. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in semantic information embedding from the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can significantly improve the text-to-image consistency with negligible computation overhead compared to the original diffusion models.

摘要
最近的扩散模型进步有力地生成了视觉吸引人的图像。然而，确保生成图像与给定的提示保持close match仍然是一项棘手的挑战。在这项工作中，我们发现了一个关键因素导致文本-图像匹配问题的原因：在提取图像特征时，扩散模型缺乏文本和图像之间的跨Modal关系学习。为了更好地对准提示和图像内容，我们提出了一种基于适应面罩的跨注意力机制，该机制通过根据注意力地图和提示嵌入来动态调整每个文本符号对图像特征的贡献。这种机制明确地减少了文本编码器中嵌入的Semantic信息抖动，从而导致了文本-图像一致性的明显提高。我们称之为MaskDiffusion，它是训练 свобо和热插的，可以应用于流行的预训练扩散模型。当应用于凉 diffusion模型时，我们的MaskDiffusion可以显著提高文本-图像一致性，而且与原始扩散模型的计算负担几乎不变。

Language Prompt for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.04379
repo_url: https://github.com/wudongming97/prompt4driving
paper_authors: Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, Jianbing Shen
for: 这篇论文是为了解决自动驾驶领域中使用自然语言提示驱动场景中的挑战，即缺乏配对的提示-实例数据。
methods: 该论文提出了第一个用于驾驶场景的对象中心语言提示集，名为NuPrompt，它扩展了Nuscenes数据集，并构建了35,367个语言描述，每个描述都对应5.3个 объек跟踪。
results: 该论文提出了一种基于Transformer的简单朴素模型，名为PromptTrack，并在NuPrompt上进行了实验，实验结果表明，PromptTrack在NuPrompt上表现出色。

Abstract
A new trend in the computer vision community is to capture objects of interest following flexible human command represented by a natural language prompt. However, the progress of using language prompts in driving scenarios is stuck in a bottleneck due to the scarcity of paired prompt-instance data. To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks. Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack. Experiments show that our PromptTrack achieves impressive performance on NuPrompt. We hope this work can provide more new insights for the autonomous driving community. Dataset and Code will be made public at \href{https://github.com/wudongming97/Prompt4Driving}{https://github.com/wudongming97/Prompt4Driving}.

摘要
新趋势在计算机视觉社区是通过自然语言提示来捕捉对象 Interest的 flexible 人工命令。然而，使用语言提示在驾驶场景中的进展却被困在数据缺乏的瓶颈中。为解决这个挑战，我们提出了首个适用于驾驶场景的三维、多视图、多帧空间的对象-中心语言提示集，名为NuPrompt。它将Nuscenes数据集扩展到构建总共35,367个语言描述，每个描述都关联着5.3个对象跟踪。基于对象-文本对的新标准，我们提出了一个新的提示驱动任务，即使用语言提示来预测视图和帧中描述的对象轨迹。此外，我们还提供了一个简单的端到端基eline模型，基于Transformer，名为PromptTrack。实验表明，我们的PromptTrack在NuPrompt上表现出了很好的表现。我们希望这项工作能够为自动驾驶社区提供更多的新想法。数据集和代码将在\href{https://github.com/wudongming97/Prompt4Driving}{https://github.com/wudongming97/Prompt4Driving}上公开。

CNN Injected Transformer for Image Exposure Correction

paper_url: http://arxiv.org/abs/2309.04366
repo_url: https://github.com/rebeccaeexu/cit-ec
paper_authors: Shuning Xu, Xiangyu Chen, Binbin Song, Jiantao Zhou
for: corrected image exposure
methods: CNN Injected Transformer (CIT) and carefully formulated loss functions
results: outperforms state-of-the-art approaches in terms of both quantitative and qualitative metrics

Abstract
Capturing images with incorrect exposure settings fails to deliver a satisfactory visual experience. Only when the exposure is properly set, can the color and details of the images be appropriately preserved. Previous exposure correction methods based on convolutions often produce exposure deviation in images as a consequence of the restricted receptive field of convolutional kernels. This issue arises because convolutions are not capable of capturing long-range dependencies in images accurately. To overcome this challenge, we can apply the Transformer to address the exposure correction problem, leveraging its capability in modeling long-range dependencies to capture global representation. However, solely relying on the window-based Transformer leads to visually disturbing blocking artifacts due to the application of self-attention in small patches. In this paper, we propose a CNN Injected Transformer (CIT) to harness the individual strengths of CNN and Transformer simultaneously. Specifically, we construct the CIT by utilizing a window-based Transformer to exploit the long-range interactions among different regions in the entire image. Within each CIT block, we incorporate a channel attention block (CAB) and a half-instance normalization block (HINB) to assist the window-based self-attention to acquire the global statistics and refine local features. In addition to the hybrid architecture design for exposure correction, we apply a set of carefully formulated loss functions to improve the spatial coherence and rectify potential color deviations. Extensive experiments demonstrate that our image exposure correction method outperforms state-of-the-art approaches in terms of both quantitative and qualitative metrics.

摘要
捕捉图像with incorrect exposure settings会导致视觉经验不满意。只有当曝光正确设置时，图像的颜色和细节才能正确保存。过去的曝光修正方法基于 convolution often produce exposure deviation in images as a consequence of the restricted receptive field of convolutional kernels. This issue arises because convolutions are not capable of capturing long-range dependencies in images accurately. To overcome this challenge, we can apply the Transformer to address the exposure correction problem, leveraging its capability in modeling long-range dependencies to capture global representation. However, solely relying on the window-based Transformer leads to visually disturbing blocking artifacts due to the application of self-attention in small patches. In this paper, we propose a CNN Injected Transformer (CIT) to harness the individual strengths of CNN and Transformer simultaneously. Specifically, we construct the CIT by utilizing a window-based Transformer to exploit the long-range interactions among different regions in the entire image. Within each CIT block, we incorporate a channel attention block (CAB) and a half-instance normalization block (HINB) to assist the window-based self-attention to acquire the global statistics and refine local features. In addition to the hybrid architecture design for exposure correction, we apply a set of carefully formulated loss functions to improve the spatial coherence and rectify potential color deviations. Extensive experiments demonstrate that our image exposure correction method outperforms state-of-the-art approaches in terms of both quantitative and qualitative metrics.

SSIG: A Visually-Guided Graph Edit Distance for Floor Plan Similarity

paper_url: http://arxiv.org/abs/2309.04357
repo_url: None
paper_authors: Casper van Engelenburg, Seyran Khademi, Jan van Gemert
for: 这 paper 是为了提出一种简单 yet effective 的 metric，用于衡量建筑底层平面图像之间的结构相似性，而不需要学习。methods: 这 paper 使用了 image 和 graph 距离来计算 structural similarity，并提出了一种基于 IoU 和 GED 的评价指标，称为 SSIG。results: 实验结果表明，使用 SSIG 可以获得类似于深度学习方法的结构相似性 Retrieval 结果，而且更加有效地比较建筑底层平面图像的结构相似性。

Abstract
We propose a simple yet effective metric that measures structural similarity between visual instances of architectural floor plans, without the need for learning. Qualitatively, our experiments show that the retrieval results are similar to deeply learned methods. Effectively comparing instances of floor plan data is paramount to the success of machine understanding of floor plan data, including the assessment of floor plan generative models and floor plan recommendation systems. Comparing visual floor plan images goes beyond a sole pixel-wise visual examination and is crucially about similarities and differences in the shapes and relations between subdivisions that compose the layout. Currently, deep metric learning approaches are used to learn a pair-wise vector representation space that closely mimics the structural similarity, in which the models are trained on similarity labels that are obtained by Intersection-over-Union (IoU). To compensate for the lack of structural awareness in IoU, graph-based approaches such as Graph Matching Networks (GMNs) are used, which require pairwise inference for comparing data instances, making GMNs less practical for retrieval applications. In this paper, an effective evaluation metric for judging the structural similarity of floor plans, coined SSIG (Structural Similarity by IoU and GED), is proposed based on both image and graph distances. In addition, an efficient algorithm is developed that uses SSIG to rank a large-scale floor plan database. Code will be openly available.

摘要
我们提出一种简单 yet有效的度量，用于衡量建筑floor plan的结构相似性，无需学习。我们的实验表明，检索结果与深度学习方法相似。对于机器理解floor plan数据的成功，包括floor plan生成模型和floor plan推荐系统，都是重要的。 Comparing visual floor plan图像不仅是solely based on pixel-wise visual examination，更是关注 shapes和relations between subdivisions that compose the layout的相似性和差异。目前，深度度量学习方法是用于学习一个pair-wise vector representation space，以便closely mimic structural similarity，其中模型是通过Intersection-over-Union（IoU）获得对应的similarity labels。为了补做IoU中的结构不足，Graph-based approaches such as Graph Matching Networks (GMNs) 是使用的，但这些方法需要对数据实例进行对比，使得GMNs 在检索应用中不实用。在这篇论文中，一种有效的floor plan结构相似度度量，称为SSIG（Structural Similarity by IoU and GED），是基于图像和图distance的。此外，一种高效的算法是开发出来，用于排序大规模的floor plan数据库。代码将公开。

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

paper_url: http://arxiv.org/abs/2309.04354
repo_url: None
paper_authors: Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du
for: 这个研究旨在使用罕发 Mixture-of-Experts 模型（MoE）来缩小 Computer Vision Transformers（ViT），以提高资源受限的视觉应用程序中的表现。
methods: 提议了一个简化的 Mobile Vision MoE 设计，将整个图像Routing 到专家中，以及一个稳定的 MoE 训练方法，使用超级类信息来导引路由器。
results: 经验表明，我们的罕发 Mobile Vision MoE 可以在 ImageNet-1k 上比 dense ViT 表现更好，例如 ViT-Tiny 模型的 Mobile V-MoE 比它的 dense 对应者高出3.39%。另外，对于仅有54M FLOPs 的视觉运算成本的 ViT Variant，我们的 MoE 可以提高4.66%。

Abstract
Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In this work, we instead explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications. To this end, we propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts. We also propose a stable MoE training procedure that uses super-class information to guide the router. We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs. For example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only 54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.

摘要
低粒度混合专家模型（MoE）在最近几年内得到了广泛的关注，因为它可以将模型大小与输入Token的执行效率解耦开来，只有一小部分模型参数对于任何输入Token进行激活。这使得低粒度MoE在不同领域，如自然语言处理和计算机视觉等领域取得了无 precedent的缩放。在这种工作中，我们则是使用低粒度MoE来缩小视Transformers（ViTs），以使其更适合具有限制的视觉应用。为此，我们提出了简单化了的手持版MoE设计，其中整个图像而不是具体的补丁被 routed 到专家。我们还提出了稳定的MoE训练过程，该过程使用超类信息来引导路由。我们实验表明，我们的粒度 мобиLE Vision MoEs（V-MoEs）可以在性能和效率之间取得更好的平衡，比如对于 ViT-Tiny 模型，我们的手持 V-MoE 在 ImageNet-1k 上比其拥有相同执行成本的 dense ViT 提高3.39%。而对于具有仅 54M FLOPs 执行成本的 ViT 变体，我们的 MoE 提高4.66%。

Revealing the preference for correcting separated aberrations in joint optic-image design

paper_url: http://arxiv.org/abs/2309.04342
repo_url: None
paper_authors: Jingwen Zhou, Shiqi Chen, Zheng Ren, Wenguan Zhang, Jiapu Yan, Huajun Feng, Qi Li, Yueting Chen
for: 本文旨在jointly设计光学系统和下游算法，以实现高效的复杂系统设计 such as smartphones和 дроны。
methods: 本文首先从光学设计的角度，描述了光学系统中的各种荷 aberrations。然后，提出了一种图像模拟系统，用于重现真实的拍摄过程。最后，提出了一种基于神经网络的 aberration correction 方法，并证明其超过了现有方法。
results: 实验表明，在jointly设计光学系统和下游算法时，应该优先 corrected longitudinal chromatic aberration、lateral chromatic aberration、spherical aberration、field curvature 和 coma，而 astigmatism 则应该排在最后。基于这些 preference，可以实现10%的总轨道减少，并且具有更高的计算摄影质量。本文的优化思路为jointly设计复杂光学系统和下游算法提供了新的思路。

Abstract
The joint design of the optical system and the downstream algorithm is a challenging and promising task. Due to the demand for balancing the global optimal of imaging systems and the computational cost of physical simulation, existing methods cannot achieve efficient joint design of complex systems such as smartphones and drones. In this work, starting from the perspective of the optical design, we characterize the optics with separated aberrations. Additionally, to bridge the hardware and software without gradients, an image simulation system is presented to reproduce the genuine imaging procedure of lenses with large field-of-views. As for aberration correction, we propose a network to perceive and correct the spatially varying aberrations and validate its superiority over state-of-the-art methods. Comprehensive experiments reveal that the preference for correcting separated aberrations in joint design is as follows: longitudinal chromatic aberration, lateral chromatic aberration, spherical aberration, field curvature, and coma, with astigmatism coming last. Drawing from the preference, a 10% reduction in the total track length of the consumer-level mobile phone lens module is accomplished. Moreover, this procedure spares more space for manufacturing deviations, realizing extreme-quality enhancement of computational photography. The optimization paradigm provides innovative insight into the practical joint design of sophisticated optical systems and post-processing algorithms.

摘要
合作设计光学系统和下游算法是一项挑战性较高且投资极大的任务。由于需要平衡全球优化图像系统和物理模拟计算成本，现有方法无法实现复杂系统 such as 智能手机和无人机的有效集成设计。在这种工作中，从光学设计的视角出发，我们 caracterize 光学器件为分离的荷量。此外，为了bridging 硬件和软件而无需梯度，我们提出了一种图像仿真系统，可以复制实际摄影过程中的镜头大 FOV 的真实摄影。在荷量修正方面，我们提议一种神经网络，可以感知并修正场景中的空间变化荷量，并证明其超过了当前方法的优势。经过广泛的实验，我们发现在修正分离荷量时的偏好顺序如下：Longitudinal Chromatic Aberration、Lateral Chromatic Aberration、Spherical Aberration、Field Curvature、Coma、Astigmatism，其中 Astigmatism 为最后一个。基于这种偏好，我们实现了Consumer-level 移动 phone 镜头模块的10% 总轨道减少。此外，这种过程还剩余了更多的生产偏移，实现了极高质量的计算摄影增强。我们的优化思路为实际复杂光学系统和后处理算法的集成设计带来了创新的视角。

Leveraging Model Fusion for Improved License Plate Recognition

paper_url: http://arxiv.org/abs/2309.04331
repo_url: None
paper_authors: Rayson Laroca, Luiz A. Zanlorensi, Valter Estevam, Rodrigo Minetto, David Menotti
for: 本研究旨在填补多模型识别结果的缺失，探讨多个识别模型的结果结合可以提高识别精度。
methods: 本研究使用多种直观的方法进行结合，包括选择最有信心的预测和多数投票策略。
results: 实验结果表明，结合多个模型可以减少对特定数据集/场景的表现下降的可能性。此外，结合基于速度的模型也是一个有效的策略，能够在满足一定的时间延迟的情况下提高识别精度。

Abstract
License Plate Recognition (LPR) plays a critical role in various applications, such as toll collection, parking management, and traffic law enforcement. Although LPR has witnessed significant advancements through the development of deep learning, there has been a noticeable lack of studies exploring the potential improvements in results by fusing the outputs from multiple recognition models. This research aims to fill this gap by investigating the combination of up to 12 different models using straightforward approaches, such as selecting the most confident prediction or employing majority vote-based strategies. Our experiments encompass a wide range of datasets, revealing substantial benefits of fusion approaches in both intra- and cross-dataset setups. Essentially, fusing multiple models reduces considerably the likelihood of obtaining subpar performance on a particular dataset/scenario. We also found that combining models based on their speed is an appealing approach. Specifically, for applications where the recognition task can tolerate some additional time, though not excessively, an effective strategy is to combine 4-6 models. These models may not be the most accurate individually, but their fusion strikes an optimal balance between accuracy and speed.

摘要

AMLP:Adaptive Masking Lesion Patches for Self-supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2309.04312
repo_url: None
paper_authors: Xiangtao Wang, Ruizhi Wang, Jie Zhou, Thomas Lukasiewicz, Zhenghua Xu
for: 这个论文是为了解决自主指定的医学图像分割问题，即使用自主掩码模型在医学图像上进行学习。
methods: 该论文提出了一种新的自主掩码医学图像分割框架，称为自适应掩码病变块（AMLP）。该框架包括一种掩码选择策略（MPS），用于确定和学习含病变块的块。此外，该论文还引入了一种注意力重构损失（ARL）和一种类别一致损失（CCL），以提高病变块的准确性和分类精度。
results: 根据两个医学图像分割数据集的实验结果，AMLP在自主掩码模型中的性能明显高于现有的自主方法。这些策略有效地解决了在医学图像上应用自主掩码模型的限制，并且能够捕捉病变块的细节，这些细节是分割任务中非常重要的。

Abstract
Self-supervised masked image modeling has shown promising results on natural images. However, directly applying such methods to medical images remains challenging. This difficulty stems from the complexity and distinct characteristics of lesions compared to natural images, which impedes effective representation learning. Additionally, conventional high fixed masking ratios restrict reconstructing fine lesion details, limiting the scope of learnable information. To tackle these limitations, we propose a novel self-supervised medical image segmentation framework, Adaptive Masking Lesion Patches (AMLP). Specifically, we design a Masked Patch Selection (MPS) strategy to identify and focus learning on patches containing lesions. Lesion regions are scarce yet critical, making their precise reconstruction vital. To reduce misclassification of lesion and background patches caused by unsupervised clustering in MPS, we introduce an Attention Reconstruction Loss (ARL) to focus on hard-to-reconstruct patches likely depicting lesions. We further propose a Category Consistency Loss (CCL) to refine patch categorization based on reconstruction difficulty, strengthening distinction between lesions and background. Moreover, we develop an Adaptive Masking Ratio (AMR) strategy that gradually increases the masking ratio to expand reconstructible information and improve learning. Extensive experiments on two medical segmentation datasets demonstrate AMLP's superior performance compared to existing self-supervised approaches. The proposed strategies effectively address limitations in applying masked modeling to medical images, tailored to capturing fine lesion details vital for segmentation tasks.

摘要
自我监督遮盲图像模型在自然图像上显示了扎实的成果。然而，直接将这些方法应用到医学图像仍然是一项挑战。这种挑战的原因在于医学图像中的病变特征更加复杂和特殊，使得学习有效的表征变得困难。另外，传统的高固定遮盲率限制了修剪细小病变细节，导致学习的范围受限。为解决这些限制，我们提出了一种新的自我监督医学图像分割框架，即适应遮盲病变裂片（AMLP）。特别是，我们设计了一种遮盲裂片选择策略（MPS），以确定和专注于包含病变的裂片进行学习。病变区域scarce yet critical，需要精准重建。为了避免由自动归类所引起的病变和背景裂片的混淆，我们引入了一种注意力重建损失（ARL），以注意精准重建病变裂片。此外，我们还提出了一种类别一致损失（CCL），以根据重建难度进一步划分病变和背景裂片，强化病变和背景之间的分别。此外，我们还开发了一种适应遮盲率策略（AMR），以逐渐增加遮盲率，扩大可重建信息，提高学习。我们对医学图像分割任务中的两个数据集进行了广泛的实验，并证明AMLP在自我监督方法中表现出色，与现有的自我监督方法相比。我们的提案有效地解决了应用遮盲模型到医学图像的限制，适应捕捉病变细节，这些细节对分割任务至关重要。

Have We Ever Encountered This Before? Retrieving Out-of-Distribution Road Obstacles from Driving Scenes

paper_url: http://arxiv.org/abs/2309.04302
repo_url: None
paper_authors: Youssef Shoeb, Robin Chan, Gesina Schwalbe, Azarm Nowzard, Fatma Güney, Hanno Gottschalk
for: 本研究旨在提供一种基于文本查询的外部数据采集方法，以满足自动驾驶系统中的协同Debugging需求。
methods: 该方法基于最新的OoD分割和多Modal基础模型，可以快速从无标注视频中提取安全关键场景，并通过文本查询来检索相似的场景。
results: 该方法可以快速和高效地提取与OoD道路障碍相关的场景，并提供一种基于文本查询的novel Approach来检索这些场景。

Abstract
In the life cycle of highly automated systems operating in an open and dynamic environment, the ability to adjust to emerging challenges is crucial. For systems integrating data-driven AI-based components, rapid responses to deployment issues require fast access to related data for testing and reconfiguration. In the context of automated driving, this especially applies to road obstacles that were not included in the training data, commonly referred to as out-of-distribution (OoD) road obstacles. Given the availability of large uncurated recordings of driving scenes, a pragmatic approach is to query a database to retrieve similar scenarios featuring the same safety concerns due to OoD road obstacles. In this work, we extend beyond identifying OoD road obstacles in video streams and offer a comprehensive approach to extract sequences of OoD road obstacles using text queries, thereby proposing a way of curating a collection of OoD data for subsequent analysis. Our proposed method leverages the recent advances in OoD segmentation and multi-modal foundation models to identify and efficiently extract safety-relevant scenes from unlabeled videos. We present a first approach for the novel task of text-based OoD object retrieval, which addresses the question ''Have we ever encountered this before?''.

摘要
生命周期中高度自动化系统在开放动态环境中的适应能力是关键。具有数据驱动AI组件的系统在部署问题上需要快速访问相关数据进行测试和重新配置。在自动驾驶上特别是，对于没有包含在训练数据中的外部道路障碍（OoD），快速响应是非常重要。由于有大量未经整理的驾驶场景录像，我们可以通过查询数据库来检索类似的场景，并且可以使用文本查询来提取OoD道路障碍序列。在这种情况下，我们不仅可以识别OoD道路障碍在视频流中，还可以提供一种抽象CURATE OoD数据集，以便进行后续分析。我们的提议方法基于最近的OoD分割和多Modal基础模型，可以快速和有效地从未标注的视频中提取安全相关的场景。我们还提出了一种新的任务：文本基本对象重 Retrieval，可以回答问题“我们之前有否遇到过这个?”。

How Can We Tame the Long-Tail of Chest X-ray Datasets?

paper_url: http://arxiv.org/abs/2309.04293
repo_url: None
paper_authors: Arsh Verma
for: 用于自动推断胸部X射线图像中的各种畸形。
methods: 使用深度学习模型来学习独立的特征，解决多标签和少数畸形问题。
results: 提出一种使用初始化更加近似于目标数据集的方法，可以帮助提高模型性能，并且可以轻松扩展到新的标签。

Abstract
Chest X-rays (CXRs) are a medical imaging modality that is used to infer a large number of abnormalities. While it is hard to define an exhaustive list of these abnormalities, which may co-occur on a chest X-ray, few of them are quite commonly observed and are abundantly represented in CXR datasets used to train deep learning models for automated inference. However, it is challenging for current models to learn independent discriminatory features for labels that are rare but may be of high significance. Prior works focus on the combination of multi-label and long tail problems by introducing novel loss functions or some mechanism of re-sampling or re-weighting the data. Instead, we propose that it is possible to achieve significant performance gains merely by choosing an initialization for a model that is closer to the domain of the target dataset. This method can complement the techniques proposed in existing literature, and can easily be scaled to new labels. Finally, we also examine the veracity of synthetically generated data to augment the tail labels and analyse its contribution to improving model performance.

摘要
胸部X光图（CXR）是医学影像模式，用于推断许多不正常情况。尽管难以列举完整的不正常情况列表，这些情况可能在胸部X光图上同时出现，但一些非常常见，并且在使用深度学习模型自动推断时广泛存在于CXR数据集中。然而，当前的模型很难学习独立的特征来标识罕见的标签，它们可能具有高度的重要性。先前的工作将焦点放在多标签和长尾问题的组合上，通过引入新的损失函数或数据重新排序机制来解决。而我们则提议，可以通过选择更加适应目标数据集的初始化方法来实现显著的性能提升。这种方法可以补充现有文献中的技术，并可以轻松扩展到新的标签。此外，我们还研究了增强尾标签的合成数据的真实性，并分析其对模型性能的贡献。

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

paper_url: http://arxiv.org/abs/2309.04509
repo_url: None
paper_authors: Yujin Jeong, Wonjeong Ryoo, Seunghyun Lee, Dabin Seo, Wonmin Byeon, Sangpil Kim, Jinkyu Kim
for: 这篇论文主要针对的是音频到视频生成技术，具体来说是使用音频输入将 temporal semantics 和 magnitude 纳入视频生成中，以生成响应音频的视频内容。
methods: 该模型使用了稳定扩散模型，将文本语义信息与音频编码器的顺序编码器结合，以生成视频帧。
results: 该方法在多个任务上表现出色，与当前领域的状态Of-the-art技术进行比较，并提供了更多的示例，可以在 https://ku-vai.github.io/TPoS/ 中找到。

Abstract
In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To generate video frames, TPoS utilizes a latent stable diffusion model with textual semantic information, which is then guided by the sequential audio embedding from our pretrained Audio Encoder. As a result, this method produces audio reactive video contents. We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation. More examples are available at https://ku-vai.github.io/TPoS/

摘要
Recently, video generation has become a prominent generative tool and has attracted significant attention. However, there is little consideration in audio-to-video generation, although audio contains unique qualities such as temporal semantics and magnitude. Therefore, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To generate video frames, TPoS utilizes a latent stable diffusion model with textual semantic information, which is then guided by the sequential audio embedding from our pretrained Audio Encoder. As a result, this method produces audio reactive video contents. We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation. More examples are available at https://ku-vai.github.io/TPoS/.Here's the translation in Traditional Chinese:近年来，影片生成技术成为了主要的生成工具，吸引了广泛的关注。然而，对于音频至影片生成的考虑，几乎没有，尽管音频具有时间 semantics 和强度等独特特性。因此，我们提出了 The Power of Sound (TPoS) 模型，将音频输入包括了可变的时间 semantics 和强度。将 TPoS 模型应用于生成影片帧，使用了稳定的扩散模型，并将文本内容与影片帧的预先训练 Audio Encoder 进行组合。因此，这种方法可以生成对音频有应答的影片内容。我们在不同的任务中认为 TPoS 的效果，并与现有的音频至影片生成技术进行比较。更多的例子可以在网站上找到。

Towards Practical Capture of High-Fidelity Relightable Avatars

paper_url: http://arxiv.org/abs/2309.04247
repo_url: None
paper_authors: Haotian Yang, Mingwu Zheng, Wanquan Feng, Haibin Huang, Yu-Kun Lai, Pengfei Wan, Zhongyuan Wang, Chongyang Ma
for: 高精度3D人物捕捉和重建
methods: 使用动态图像序列和变化灯光条件进行训练，实现真实的照明和实时动画
results: 提供了一种高质量的捕捉和重建方法，可以在多种场景中实现真实的照明和动画效果

Abstract
In this paper, we propose a novel framework, Tracking-free Relightable Avatar (TRAvatar), for capturing and reconstructing high-fidelity 3D avatars. Compared to previous methods, TRAvatar works in a more practical and efficient setting. Specifically, TRAvatar is trained with dynamic image sequences captured in a Light Stage under varying lighting conditions, enabling realistic relighting and real-time animation for avatars in diverse scenes. Additionally, TRAvatar allows for tracking-free avatar capture and obviates the need for accurate surface tracking under varying illumination conditions. Our contributions are two-fold: First, we propose a novel network architecture that explicitly builds on and ensures the satisfaction of the linear nature of lighting. Trained on simple group light captures, TRAvatar can predict the appearance in real-time with a single forward pass, achieving high-quality relighting effects under illuminations of arbitrary environment maps. Second, we jointly optimize the facial geometry and relightable appearance from scratch based on image sequences, where the tracking is implicitly learned. This tracking-free approach brings robustness for establishing temporal correspondences between frames under different lighting conditions. Extensive qualitative and quantitative experiments demonstrate that our framework achieves superior performance for photorealistic avatar animation and relighting.

摘要
在这篇论文中，我们提出了一种新的框架，即 Tracking-free Relightable Avatar（TRAvatar），用于捕捉和重建高质量的3D人物。相比前方法，TRAvatar在更实用和高效的设置下工作。具体来说，TRAvatar通过在不同照明条件下捕捉的动态图像序列进行训练，使得人物在多样化场景中的动画和重新照明得到了真实的渲染。此外，TRAvatar允许无需准确的表面跟踪，从而消除了对不同照明条件下的表面跟踪的需求。我们的贡献有两个方面：首先，我们提出了一种新的网络架构，该架构直接基于和确保光线的线性性。通过训练简单的群组照明 Captures，TRAvatar可以在实时下一步逻辑执行，实现高质量的重新照明效果下环境图像中的不同照明条件下。其次，我们将人物的面部几何和可重新照明的外观从头开始，基于图像序列进行 JOINT 优化。这种无需跟踪的方法带来了在不同照明条件下建立 temporales 匹配的稳定性。我们的框架在实际和量化的实验中都达到了高质量的人物动画和重新照明的表现。

Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition

paper_url: http://arxiv.org/abs/2309.04506
repo_url: None
paper_authors: Lingyu Du, Xucong Zhang, Guohao Lan
for: 提高出现在多个 gaze 数据集上的 gaze 估计性能，使用一个通用的摄像头作为输入设备。
methods: 提出 ConGaze 框架，利用无标注的脸部图像学习无关Subject的 gaze-aware 表示，通过对 gaze-specific 数据增强和subject-conditional projection module来保持 gaze-semantic 特征和眼神一致性。
results: ConGaze 在三个公共 gaze 估计数据集上比现有的无监督学习解决方案提高了6.7%到22.5%，并在跨数据集评估中提高了15.1%到24.6%。

Abstract
Appearance-based gaze estimation has shown great promise in many applications by using a single general-purpose camera as the input device. However, its success is highly depending on the availability of large-scale well-annotated gaze datasets, which are sparse and expensive to collect. To alleviate this challenge we propose ConGaze, a contrastive learning-based framework that leverages unlabeled facial images to learn generic gaze-aware representations across subjects in an unsupervised way. Specifically, we introduce the gaze-specific data augmentation to preserve the gaze-semantic features and maintain the gaze consistency, which are proven to be crucial for effective contrastive gaze representation learning. Moreover, we devise a novel subject-conditional projection module that encourages a share feature extractor to learn gaze-aware and generic representations. Our experiments on three public gaze estimation datasets show that ConGaze outperforms existing unsupervised learning solutions by 6.7% to 22.5%; and achieves 15.1% to 24.6% improvement over its supervised learning-based counterpart in cross-dataset evaluations.

摘要
<>转换文本到简化中文。<>应用基于的 gaze 估计已经在许多应用程序中显示出了很大的搭配性，只使用一个通用的摄像头作为输入设备。然而，其成功受到大规模、有良好标注的 gaze 数据集的可用性的限制。为了解决这个挑战，我们提出了 ConGaze，一个基于对比学习的框架，利用无标注的脸部图像来学习不同人Subject中的通用 gaze-aware 表示。specifically，我们引入了 gaze-specific 数据增强技术来保持 gaze-semantic 特征和维护 gaze 一致性，这些特征被证明是对有效对比 gaze 表示学习的关键。此外，我们设计了一个新的 subject-conditional projection module，以便一个共享特征提取器来学习 gaze-aware 和通用表示。我们在三个公共 gaze 估计数据集上进行了实验，结果显示，ConGaze 在对比学习解决方案上出现了6.7%到22.5%的提升，并在跨数据集评估中达到15.1%到24.6%的提升。

FIVA: Facial Image and Video Anonymization and Anonymization Defense

paper_url: http://arxiv.org/abs/2309.04228
repo_url: None
paper_authors: Felix Rosberg, Eren Erdal Aksoy, Cristofer Englund, Fernando Alonso-Fernandez
for: 这个论文旨在提出一种新的面部匿名化方法，以保护个人隐私。
methods: 这个方法使用了建议的身份追踪和强大的匿名化技术，以确保面部匿名化能够一致性地运行在帧中，并且可以抵挡重建攻击。
results: 这个方法可以确保0个真阳性，false acceptance rate为0.001，并且可以实现面部匿名化和脸部替换。

Abstract
In this paper, we present a new approach for facial anonymization in images and videos, abbreviated as FIVA. Our proposed method is able to maintain the same face anonymization consistently over frames with our suggested identity-tracking and guarantees a strong difference from the original face. FIVA allows for 0 true positives for a false acceptance rate of 0.001. Our work considers the important security issue of reconstruction attacks and investigates adversarial noise, uniform noise, and parameter noise to disrupt reconstruction attacks. In this regard, we apply different defense and protection methods against these privacy threats to demonstrate the scalability of FIVA. On top of this, we also show that reconstruction attack models can be used for detection of deep fakes. Last but not least, we provide experimental results showing how FIVA can even enable face swapping, which is purely trained on a single target image.

摘要
在这篇论文中，我们提出了一种新的面部匿名技术，简称为FIVA。我们的提议方法可以保持面部匿名的一致性在帧内，并且可以 garantuee a strong difference from the original face。FIVA 可以保证0个真正的正确率，false acceptance rate 为0.001。我们的工作考虑了重要的安全问题，包括重建攻击，并对不同类型的随机噪声进行了 investigate。为了恢复随机噪声的攻击，我们应用了不同的防御和保护方法。此外，我们还证明了可以使用重建攻击模型来检测深伪。最后，我们提供了实验结果，证明FIVA 可以实现面部交换，只需要单个目标图像进行培训。

Long-Range Correlation Supervision for Land-Cover Classification from Remote Sensing Images

paper_url: http://arxiv.org/abs/2309.04225
repo_url: None
paper_authors: Dawen Yu, Shunping Ji
for:这篇论文的目的是提出一种基于深度学习的陆地覆盖分类方法，以优化大型遥感图像中的远距离相关性模型。methods:该方法使用了一种名为超级vised长距离相关网络（SLCNet），它通过在批处理中使用类别一致性信息来directly supervise the long-range dependency modeling。此外，该方法还引入了一个辅助的自适应感知场特征提取模块，以Capture finely detailed feature representations for multi-size objects in multi-scale remote sensing images。results:对于三个遥感数据集，SLCNet achieved state-of-the-art performance compared with advanced segmentation methods from computer vision, medicine, and remote sensing communities。

Abstract
Long-range dependency modeling has been widely considered in modern deep learning based semantic segmentation methods, especially those designed for large-size remote sensing images, to compensate the intrinsic locality of standard convolutions. However, in previous studies, the long-range dependency, modeled with an attention mechanism or transformer model, has been based on unsupervised learning, instead of explicit supervision from the objective ground truth. In this paper, we propose a novel supervised long-range correlation method for land-cover classification, called the supervised long-range correlation network (SLCNet), which is shown to be superior to the currently used unsupervised strategies. In SLCNet, pixels sharing the same category are considered highly correlated and those having different categories are less relevant, which can be easily supervised by the category consistency information available in the ground truth semantic segmentation map. Under such supervision, the recalibrated features are more consistent for pixels of the same category and more discriminative for pixels of other categories, regardless of their proximity. To complement the detailed information lacking in the global long-range correlation, we introduce an auxiliary adaptive receptive field feature extraction module, parallel to the long-range correlation module in the encoder, to capture finely detailed feature representations for multi-size objects in multi-scale remote sensing images. In addition, we apply multi-scale side-output supervision and a hybrid loss function as local and global constraints to further boost the segmentation accuracy. Experiments were conducted on three remote sensing datasets. Compared with the advanced segmentation methods from the computer vision, medicine, and remote sensing communities, the SLCNet achieved a state-of-the-art performance on all the datasets.

摘要
现代深度学习基于语义 segmentation 方法中，远程依赖关系模型已经广泛应用，特别是针对大型远程感知图像。然而，在前一些研究中，远程依赖关系，通过注意力机制或 transformer 模型进行模型，都是基于无监督学习。在这篇论文中，我们提出了一种新的监督性远程相关方法，called SLCNet，可以在土地覆盖分类中提高准确率。在 SLCNet 中，与同一类别的像素视为高度相关，与不同类别的像素视为不相关，这可以通过地图中的类别一致信息进行监督。由于这种监督，重调的特征更加一致于同类别的像素，更加突出不同类别的像素，不管它们的距离。为了补充全局远程相关缺失的细节信息，我们引入了一个辅助适应性识别场FeatureEXTRACT模块，并行于远程相关模块在编码器中。此外，我们还应用多尺度侧输出监督和混合损失函数作为本地和全局约束，以进一步提高分类精度。在三个远程感知数据集上进行了实验。与现代分类方法（计算机视觉、医学和远程感知社区）相比，SLCNet 在所有数据集上达到了状态的表现。

Score-PA: Score-based 3D Part Assembly

paper_url: http://arxiv.org/abs/2309.04220
repo_url: https://github.com/j-f-cheng/score-pa_score-based-3d-part-assembly
paper_authors: Junfeng Cheng, Mingdong Wu, Ruiyuan Zhang, Guanqi Zhan, Chao Wu, Hao Dong
for: 本研究旨在提出一种基于生成模型的3D部件组装方法，以解决自主3D部件组装问题在机器人和3D计算机视觉领域中的挑战。
methods: 本文提出了一种名为Score-based 3D Part Assembly（Score-PA）框架，用于3D部件组装。此外，我们还提出了一种叫做快速预测器-修正器抽象器（FPC）算法，用于加速框架中的采样过程。
results: 我们通过了多种评价指标来评估组装质量和多样性，并发现我们的算法在比较现有状态艺术方法时表现出色，得到了更好的结果。

Abstract
Autonomous 3D part assembly is a challenging task in the areas of robotics and 3D computer vision. This task aims to assemble individual components into a complete shape without relying on predefined instructions. In this paper, we formulate this task from a novel generative perspective, introducing the Score-based 3D Part Assembly framework (Score-PA) for 3D part assembly. Knowing that score-based methods are typically time-consuming during the inference stage. To address this issue, we introduce a novel algorithm called the Fast Predictor-Corrector Sampler (FPC) that accelerates the sampling process within the framework. We employ various metrics to assess assembly quality and diversity, and our evaluation results demonstrate that our algorithm outperforms existing state-of-the-art approaches. We release our code at https://github.com/J-F-Cheng/Score-PA_Score-based-3D-Part-Assembly.

摘要
自主三维部件组装是机器人和三维计算机视觉领域中的一项挑战性任务。这个任务的目标是将个体部件组装成完整的形状，不依赖于预定的指令。在这篇论文中，我们从一种新的生成方式出发，提出了Score-based 3D Part Assembly框架（Score-PA），用于三维部件组装。因为分数基本方法通常在推理阶段相对耗时，为了解决这个问题，我们提出了一种新的算法叫做快速预测器-修正器抽象器（FPC），它加速了Score-PA框架中的采样过程。我们使用了多种指标来评估组装质量和多样性，我们的评估结果表明，我们的算法在现有状态的方法上表现出色。我们在https://github.com/J-F-Cheng/Score-PA_Score-based-3D-Part-Assembly上分享了我们的代码。

SegmentAnything helps microscopy images based automatic and quantitative organoid detection and analysis

paper_url: http://arxiv.org/abs/2309.04190
repo_url: https://github.com/xiaodanxing/sam4organoid
paper_authors: Xiaodan Xing, Chunling Tang, Yunzhe Guo, Nicholas Kurniawan, Guang Yang
for: studying organ development, drug discovery, and toxicity assessment
methods: leveraging SegmentAnything for precise demarcation of individual organoids, and introducing a set of morphological properties for quantitative analysis
results: close alignment with manual organoid detection and measurement, demonstrating the effectiveness of the proposed method in accelerating organoid morphology analysis

Abstract
Organoids are self-organized 3D cell clusters that closely mimic the architecture and function of in vivo tissues and organs. Quantification of organoid morphology helps in studying organ development, drug discovery, and toxicity assessment. Recent microscopy techniques provide a potent tool to acquire organoid morphology features, but manual image analysis remains a labor and time-intensive process. Thus, this paper proposes a comprehensive pipeline for microscopy analysis that leverages the SegmentAnything to precisely demarcate individual organoids. Additionally, we introduce a set of morphological properties, including perimeter, area, radius, non-smoothness, and non-circularity, allowing researchers to analyze the organoid structures quantitatively and automatically. To validate the effectiveness of our approach, we conducted tests on bright-field images of human induced pluripotent stem cells (iPSCs) derived neural-epithelial (NE) organoids. The results obtained from our automatic pipeline closely align with manual organoid detection and measurement, showcasing the capability of our proposed method in accelerating organoids morphology analysis.

摘要
organoids 是自组织的3D细胞群，具有在 vivo 组织中的结构和功能的高度相似性。量化 organoid 形态可以帮助研究器官发展、药物探索和毒性评估。现有的微镜技术为 organoid 形态特征的获取提供了强大的工具，但是手动图像分析仍然是一项劳动和时间耗费的过程。因此，这篇论文提出了一个完整的微镜分析管线，利用 SegmentAnything 精准地界定个体 organoid。此外，我们还引入了一组形态特征，包括周长、面积、半径、不整形和不圆形，使研究人员可以对 organoid 结构进行量化和自动化的分析。为验证我们的方法的有效性，我们对人类干细胞 derived neural-epithelial（NE） organoids 的明亮场图进行了测试。结果表明，我们的自动化管线与手动图像分析结果高度相似，这表明了我们提出的方法在加速 organoid 形态分析方面的能力。

Stereo Matching in Time: 100+ FPS Video Stereo Matching for Extended Reality

paper_url: http://arxiv.org/abs/2309.04183
repo_url: None
paper_authors: Ziang Cheng, Jiayu Yang, Hongdong Li
for: 这篇论文主要是为了解决现场掌上设备上的实时深度推断问题，以提高现场扩展实际（XR）应用的性能。
methods: 这篇论文使用了一种新的视频斯特瑞数据集，并提出了一种基于视频的斯特瑞匹配方法，以实现实时的深度推断。这种方法利用了视频中的相互关系和缓存机制，以提高效率而不损失准确性。
results: 根据论文的测试结果，这种方法在标准桌面计算机上实现了134帧每秒的实时推断速度，或在磁盘式VR/AR头戴式设备上实现了30帧每秒的实时推断速度，都是现有技术的最佳性能。

Abstract
Real-time Stereo Matching is a cornerstone algorithm for many Extended Reality (XR) applications, such as indoor 3D understanding, video pass-through, and mixed-reality games. Despite significant advancements in deep stereo methods, achieving real-time depth inference with high accuracy on a low-power device remains a major challenge. One of the major difficulties is the lack of high-quality indoor video stereo training datasets captured by head-mounted VR/AR glasses. To address this issue, we introduce a novel video stereo synthetic dataset that comprises photorealistic renderings of various indoor scenes and realistic camera motion captured by a 6-DoF moving VR/AR head-mounted display (HMD). This facilitates the evaluation of existing approaches and promotes further research on indoor augmented reality scenarios. Our newly proposed dataset enables us to develop a novel framework for continuous video-rate stereo matching. As another contribution, our dataset enables us to proposed a new video-based stereo matching approach tailored for XR applications, which achieves real-time inference at an impressive 134fps on a standard desktop computer, or 30fps on a battery-powered HMD. Our key insight is that disparity and contextual information are highly correlated and redundant between consecutive stereo frames. By unrolling an iterative cost aggregation in time (i.e. in the temporal dimension), we are able to distribute and reuse the aggregated features over time. This approach leads to a substantial reduction in computation without sacrificing accuracy. We conducted extensive evaluations and comparisons and demonstrated that our method achieves superior performance compared to the current state-of-the-art, making it a strong contender for real-time stereo matching in VR/AR applications.

摘要
现实时斯特瑞匹配是虚拟现实（XR）应用的关键算法之一，包括室内3D理解、视频过关和混合实际游戏。尽管深度斯特瑞方法得到了重要的进步，但在低功耗设备上实现实时深度推测仍然是一个主要挑战。主要的困难之一是lack of high-quality indoor video stereo training datasets captured by head-mounted VR/AR glasses。为解决这个问题，我们介绍了一个新的视频斯特瑞 sintetic dataset，该dataset包括各种室内场景的 photorealistic 渲染和realistic camera motion captured by a 6-DoF moving VR/AR head-mounted display (HMD)。这使得我们可以评估现有方法并促进更多的室内扩展实际游戏enario研究。我们新提出的dataset允许我们开发一个新的持续视频斯特瑞匹配框架。另一个贡献是我们的dataset允许我们提出一种适合XR应用的新视频斯特瑞匹配方法，该方法在惊人的134fps（在标准桌面电脑上）或30fps（在电池电源的HMD上）实时推测。我们的关键发现是，在不同的斯特瑞帧之间，диспараITY和上下文信息之间存在很高的相关性和重复性。我们通过在时间维度（i.e.,在时间维度）折叠一种迭代成本聚合来分配和重用聚合的特征。这种方法导致了显著的计算减少，而不是牺牲准确性。我们进行了广泛的评估和比较，并证明了我们的方法在当前状态的某些应用中表现出色，使其成为实时斯特瑞匹配的强 кандидат。

Unsupervised Object Localization with Representer Point Selection

paper_url: http://arxiv.org/abs/2309.04172
repo_url: https://github.com/yeonghwansong/uolwrps
paper_authors: Yeonghwan Song, Seokwoo Jang, Dina Katabi, Jeany Son
for: 本研究旨在提出一种新的无监督对象定位方法，可以让我们理解模型的预测结果。
methods: 本方法基于代表点选择，通过选择模型预测结果中最重要的示例，提供了如何理解模型预测的示例和其重要性。
results: 我们的方法在多个数据集上与状态当前的无监督和自监督对象定位方法相比，具有显著的优势，甚至超过了最近的弱监督和几个预处理方法。

Abstract
We propose a novel unsupervised object localization method that allows us to explain the predictions of the model by utilizing self-supervised pre-trained models without additional finetuning. Existing unsupervised and self-supervised object localization methods often utilize class-agnostic activation maps or self-similarity maps of a pre-trained model. Although these maps can offer valuable information for localization, their limited ability to explain how the model makes predictions remains challenging. In this paper, we propose a simple yet effective unsupervised object localization method based on representer point selection, where the predictions of the model can be represented as a linear combination of representer values of training points. By selecting representer points, which are the most important examples for the model predictions, our model can provide insights into how the model predicts the foreground object by providing relevant examples as well as their importance. Our method outperforms the state-of-the-art unsupervised and self-supervised object localization methods on various datasets with significant margins and even outperforms recent weakly supervised and few-shot methods.

摘要
我们提出了一种新的无监督物体定位方法，可以使用无监督预训练模型来解释模型预测的结果。现有的无监督和自我监督物体定位方法经常使用类型不具有激活图或模型自身的相似图来提供有价值的信息。虽然这些图可以提供有用的信息，但它们的解释能力对模型预测的限制性尚未得到解决。在这篇论文中，我们提出了一种简单 yet 有效的无监督物体定位方法，基于表达点选择，其中模型预测可以表示为一个线性组合的表达点值。通过选择表达点，这些是模型预测中最重要的示例，我们的模型可以提供如何模型预测了前景对象的信息，并提供相关示例以及其重要性。我们的方法在多个数据集上与状态之前的无监督和自我监督物体定位方法之间具有显著的差异，甚至超过最近的弱监督和几个shot方法。

PRISTA-Net: Deep Iterative Shrinkage Thresholding Network for Coded Diffraction Patterns Phase Retrieval

paper_url: http://arxiv.org/abs/2309.04171
repo_url: https://github.com/liuaxou/prista-net
paper_authors: Aoxu Liu, Xiaohong Fan, Yin Yang, Jianping Zhang
for:PRISTA-Net is designed to solve the problem of phase retrieval (PR) in computational imaging and image processing, which is a challenge nonlinear inverse problem.methods:PRISTA-Net uses a deep unfolding network (DUN) based on the first-order iterative shrinkage thresholding algorithm (ISTA) to address the proximal-point mapping sub-problem associated with sparse priors. It also utilizes an attention mechanism to focus on phase information containing image edges, textures, and structures, and the fast Fourier transform (FFT) to learn global features to enhance local information.results:Experiments on Coded Diffraction Patterns (CDPs) measurements demonstrate that PRISTA-Net outperforms the existing state-of-the-art methods in terms of qualitative and quantitative evaluations.

Abstract
The problem of phase retrieval (PR) involves recovering an unknown image from limited amplitude measurement data and is a challenge nonlinear inverse problem in computational imaging and image processing. However, many of the PR methods are based on black-box network models that lack interpretability and plug-and-play (PnP) frameworks that are computationally complex and require careful parameter tuning. To address this, we have developed PRISTA-Net, a deep unfolding network (DUN) based on the first-order iterative shrinkage thresholding algorithm (ISTA). This network utilizes a learnable nonlinear transformation to address the proximal-point mapping sub-problem associated with the sparse priors, and an attention mechanism to focus on phase information containing image edges, textures, and structures. Additionally, the fast Fourier transform (FFT) is used to learn global features to enhance local information, and the designed logarithmic-based loss function leads to significant improvements when the noise level is low. All parameters in the proposed PRISTA-Net framework, including the nonlinear transformation, threshold parameters, and step size, are learned end-to-end instead of being manually set. This method combines the interpretability of traditional methods with the fast inference ability of deep learning and is able to handle noise at each iteration during the unfolding stage, thus improving recovery quality. Experiments on Coded Diffraction Patterns (CDPs) measurements demonstrate that our approach outperforms the existing state-of-the-art methods in terms of qualitative and quantitative evaluations. Our source codes are available at \emph{https://github.com/liuaxou/PRISTA-Net}.

摘要
“复位问题（PR） involves recovering an unknown image from limited amplitude measurement data，是一个非线性逆问题在计算机影像和影像处理中。然而，许多PR方法是基于黑盒网络模型，缺乏可解性和插件和平（PnP）框架，需要精确的参数调整。为了解决这个问题，我们开发了PRISTA-Net，一个深度 unfolding 网络（DUN），基于首次iterative shrinkage thresholding 算法（ISTA）。这个网络使用可学化的非线性转换来解决对簇统调整问题，并使用注意力机制来针对具有像素、文本和结构的phasic信息。此外，我们使用快速傅立叶变换（FFT）来学习全域特征，以增强本地信息，并使用设计的对数型损失函数，导致在噪音水平低时有明显的改进。所有PRISTA-Net框架内的参数，包括非线性转换、阈值参数和步长，都是通过端到端学习而不是手动设置。这种方法结合了传统方法的可解性和深度学习的快速推理能力，并可以在每个融合阶段中处理噪音，进而改善复位质量。实验结果显示，我们的方法在CDPs测量中超过了现有的州际优秀方法，以质量和量度评估为准。我们的原始代码可以在 \emph{https://github.com/liuaxou/PRISTA-Net} 获取。”

Grouping Boundary Proposals for Fast Interactive Image Segmentation

paper_url: http://arxiv.org/abs/2309.04169
repo_url: None
paper_authors: Li Liu, Da Chen, Minglei Shu, Laurent D. Cohen
for: This paper proposes a new image segmentation model that leverages the minimal geodesic framework and adaptive cut-based circular optimal path computation scheme to improve the accuracy and efficiency of image segmentation.
methods: The proposed model combines the minimal geodesic framework with an adaptive cut-based circular optimal path computation scheme and a graph-based boundary proposals grouping scheme to segment images.
results: Experimental results show that the proposed model outperforms state-of-the-art minimal paths-based image segmentation approaches.Here’s the same information in Simplified Chinese:
for: 这篇论文提出了一种基于最小几何框架和自适应割分算法的新的图像分割模型，用于解决图像分割问题。
methods: 该模型结合了最小几何框架、自适应割分算法和图形基于边界提议的组合来分割图像。
results: 实验结果表明，该模型比状态艺术最小路径基于图像分割方法更高效和更准确。

Abstract
Geodesic models are known as an efficient tool for solving various image segmentation problems. Most of existing approaches only exploit local pointwise image features to track geodesic paths for delineating the objective boundaries. However, such a segmentation strategy cannot take into account the connectivity of the image edge features, increasing the risk of shortcut problem, especially in the case of complicated scenario. In this work, we introduce a new image segmentation model based on the minimal geodesic framework in conjunction with an adaptive cut-based circular optimal path computation scheme and a graph-based boundary proposals grouping scheme. Specifically, the adaptive cut can disconnect the image domain such that the target contours are imposed to pass through this cut only once. The boundary proposals are comprised of precomputed image edge segments, providing the connectivity information for our segmentation model. These boundary proposals are then incorporated into the proposed image segmentation model, such that the target segmentation contours are made up of a set of selected boundary proposals and the corresponding geodesic paths linking them. Experimental results show that the proposed model indeed outperforms state-of-the-art minimal paths-based image segmentation approaches.

摘要

Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

paper_url: http://arxiv.org/abs/2309.04158
repo_url: None
paper_authors: Hongyu Hu, Tiancheng Lin, Jie Wang, Zhenbang Sun, Yi Xu
for: 提高视语模型（VLM）的适应能力，使其更好地适应下游任务。
methods: combining pre-trained large language models（LLMs）和learnable prompts，通过对Prompt的学习进行对接，从而提高视语模型的适应能力。
results: 在11个下游数据集上，DuAl-PT实现了superior的表现，并且在base-to-new泛化上也显示出了优秀的结果。

Abstract
Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visual concepts from tedious training data, showing superb generalization ability. Amount of prompt learning methods have been proposed to efficiently adapt the VLMs to downstream tasks with only a few training samples. We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs), called Dual-Aligned Prompt Tuning (DuAl-PT). Learnable prompts, like CoOp, implicitly model the context through end-to-end training, which are difficult to control and interpret. While explicit context descriptions generated by LLMs, like GPT-3, can be directly used for zero-shot classification, such prompts are overly relying on LLMs and still underexplored in few-shot domains. With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling. To achieve this, we introduce a pre-trained LLM to generate context descriptions, and we encourage the prompts to learn from the LLM's knowledge by alignment, as well as the alignment between prompts and local image features. Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization. Hopefully, DuAl-PT can serve as a strong baseline. Code will be available.

摘要
大规模视言模型（VLM），如CLIP，通过 tedious 训练数据学习广泛的视觉概念，显示出杰出的泛化能力。Amount of 提示学习方法已经被提出，以实现通过只需几个训练样本来适应下游任务。我们介绍了一种新的方法，通过将预训练的大型语言模型（LLM）与视言模型结合，来提高提示学习的视言模型。我们称之为双对调整提示（DuAl-PT）。learnable prompts，如CoOp，通过端到端训练来模型上下文，但这些提示难以控制和解释。而由 LLM 生成的文本提示，如 GPT-3，可以直接用于零shot分类，但这些提示过于依赖 LLM 并且还未在几shot领域得到充分发挥。With DuAl-PT，我们提议学习更加上下文意识的提示，利用 both explicit 和 implicit 上下文模型。为了实现这一点，我们引入了预训练 LLM 生成上下文描述，并强制提示学习从 LLM 的知识中，以及上下文和本地图像特征之间的对应。实验结果表明，DuAl-PT 在 11 个下游数据集上的几shot认识和基础到新泛化中表现出色。希望 DuAl-PT 可以成为一个强大的基eline。代码将可以公开。

Mapping EEG Signals to Visual Stimuli: A Deep Learning Approach to Match vs. Mismatch Classification

paper_url: http://arxiv.org/abs/2309.04153
repo_url: None
paper_authors: Yiqian Yang, Zhengqiao Zhao, Qian Wang, Yan Yang, Jingdong Chen
for: 该研究旨在开发一种基于深度学习的“匹配vs不匹配”模型，用于类ifizying视频片段是否引起记录的EEG信号响应，以及学习视觉内容和相应的神经记录之间的关系。
methods: 该模型使用了一种新的“匹配vs不匹配”机制，通过对视频片段和EEG信号进行匹配和不匹配的比较，以捕捉视频内容和神经记录之间的关系。
results: 研究发现，使用该模型可以在不知道训练数据的情况下，达到最高的准确率，并且可以减少 между主体噪音。此外，研究还发现，模型预测中的脑区域主要与语言处理相关，然后是视觉处理相关。这些结果有助于开发基于神经录音的视频重建技术和相关应用。

Abstract
Existing approaches to modeling associations between visual stimuli and brain responses are facing difficulties in handling between-subject variance and model generalization. Inspired by the recent progress in modeling speech-brain response, we propose in this work a ``match-vs-mismatch'' deep learning model to classify whether a video clip induces excitatory responses in recorded EEG signals and learn associations between the visual content and corresponding neural recordings. Using an exclusive experimental dataset, we demonstrate that the proposed model is able to achieve the highest accuracy on unseen subjects as compared to other baseline models. Furthermore, we analyze the inter-subject noise using a subject-level silhouette score in the embedding space and show that the developed model is able to mitigate inter-subject noise and significantly reduce the silhouette score. Moreover, we examine the Grad-CAM activation score and show that the brain regions associated with language processing contribute most to the model predictions, followed by regions associated with visual processing. These results have the potential to facilitate the development of neural recording-based video reconstruction and its related applications.

摘要
现有的视觉刺激和大脑响应模型面临着处理 между人差异和模型泛化的挑战。启发于最近的语音大脑响应模型的进步，我们在本工作中提出了一种“匹配vs不匹配”深度学习模型，用于判断视频片断是否产生记录的EEG信号中的刺激响应。使用专属实验数据集，我们示出了该模型能够在未见过的人群中达到最高的准确率，并且分析了在 embedding 空间中的人体遮盾分数，显示该模型能够减少人体噪音，并且通过Grad-CAM活化分数显示，大脑语言处理相关区域对模型预测做出了主要贡献，其次是视觉处理相关区域。这些结果有potential用于发展基于 neural recording 的视频重建和相关应用。

Representation Synthesis by Probabilistic Many-Valued Logic Operation in Self-Supervised Learning

paper_url: http://arxiv.org/abs/2309.04148
repo_url: None
paper_authors: Hiroki Nakamura, Masashi Okada, Tadahiro Taniguchi
for: 本研究探讨了一种基于多值逻辑的自助学习（SSL）方法，用于学习混合图像表示。
methods: 该方法使用混合图像synthesize表示，并使用多值逻辑运算实现表示合并。该方法可以保持原始表示的remarkable特征。
results: 对于图像分类任务，该方法与前期表示合并方法竞争性。此外，我们还研究了图像检索应用，并发现了与图像类别数量之间的关系。

Abstract
Self-supervised learning (SSL) using mixed images has been studied to learn various image representations. Existing methods using mixed images learn a representation by maximizing the similarity between the representation of the mixed image and the synthesized representation of the original images. However, few methods consider the synthesis of representations from the perspective of mathematical logic. In this study, we focused on a synthesis method of representations. We proposed a new SSL with mixed images and a new representation format based on many-valued logic. This format can indicate the feature-possession degree, that is, how much of each image feature is possessed by a representation. This representation format and representation synthesis by logic operation realize that the synthesized representation preserves the remarkable characteristics of the original representations. Our method performed competitively with previous representation synthesis methods for image classification tasks. We also examined the relationship between the feature-possession degree and the number of classes of images in the multilabel image classification dataset to verify that the intended learning was achieved. In addition, we discussed image retrieval, which is an application of our proposed representation format using many-valued logic.

摘要
（简化中文）自动学习（SSL）使用混合图像已经研究了学习不同的图像表示。现有的方法使用混合图像学习表示，通常是通过最大化混合图像表示和原始图像表示的相似性来学习表示。然而，很少考虑混合表示的合成从数学逻辑的角度。在这个研究中，我们关注了混合表示的合成方法。我们提出了一种新的SSL WITH mixed images和一种基于多值逻辑的新表示格式。这种格式可以表示每个图像特征的具有度，即表示中具有多少个图像特征。这种表示格式和基于逻辑操作的表示合成实现了保留原始表示的杰出特征。我们的方法与之前的表示合成方法相比较竞争，并在图像分类任务中达到了类似的性能。我们还检验了图像分类 dataset 中图像类别数与特征具有度之间的关系，以验证学习是否实现了所需的。此外，我们还讨论了使用我们提出的表示格式进行图像检索，这是一个图像检索的应用。

Robot Localization and Mapping Final Report – Sequential Adversarial Learning for Self-Supervised Deep Visual Odometry

paper_url: http://arxiv.org/abs/2309.04147
repo_url: None
paper_authors: Akankshya Kar, Sajal Maheshwari, Shamit Lal, Vinay Sameer Raja Kad
For: The paper aims to improve the accuracy of visual odometry (VO) and Simultaneous Localization and Mapping (SLAM) in challenging scenarios by using deep neural networks to extract high-level features and generate more accurate depth and pose estimates.* Methods: The paper explores two approaches to improve the accuracy of VO and SLAM: 1) modeling using optical flow and recurrent neural networks (RNN) to exploit spatio-temporal correlations, and 2) using a generative adversarial network (GAN) to improve the depth estimation and reduce artifacts.* Results: The paper achieves better depth and pose estimates compared to previous works, and demonstrates the effectiveness of the proposed methods in challenging scenarios such as low-texture images and dynamic scenarios.

Abstract
Visual odometry (VO) and SLAM have been using multi-view geometry via local structure from motion for decades. These methods have a slight disadvantage in challenging scenarios such as low-texture images, dynamic scenarios, etc. Meanwhile, use of deep neural networks to extract high level features is ubiquitous in computer vision. For VO, we can use these deep networks to extract depth and pose estimates using these high level features. The visual odometry task then can be modeled as an image generation task where the pose estimation is the by-product. This can also be achieved in a self-supervised manner, thereby eliminating the data (supervised) intensive nature of training deep neural networks. Although some works tried the similar approach [1], the depth and pose estimation in the previous works are vague sometimes resulting in accumulation of error (drift) along the trajectory. The goal of this work is to tackle these limitations of past approaches and to develop a method that can provide better depths and pose estimates. To address this, a couple of approaches are explored: 1) Modeling: Using optical flow and recurrent neural networks (RNN) in order to exploit spatio-temporal correlations which can provide more information to estimate depth. 2) Loss function: Generative adversarial network (GAN) [2] is deployed to improve the depth estimation (and thereby pose too), as shown in Figure 1. This additional loss term improves the realism in generated images and reduces artifacts.

摘要
Visual odometry (VO) 和 SLAM 已经在多个视图几何学中使用了多年。这些方法在复杂的场景下（如低Texture图像、动态场景等）存在一定的缺陷。而现在，使用深度神经网络提取高级特征是计算机视觉中 ubique 的现象。为VO，我们可以使用这些深度神经网络来提取深度和pose估计，并将视觉跟踪任务模型为图像生成任务，其中 pose 估计是产物。这可以通过自我监督的方式进行实现，从而消除深度神经网络的培训数据（supervised）Intensive 性。虽然一些工作已经尝试了类似的方法 [1]，但在这些方法中的深度和pose估计 Sometimes 存在抽象（vague），导致轨迹中的错误（drift）积累。本工作的目标是解决过去的限制，并提供更好的深度和pose估计。为此，我们探讨了一些方法：1. 模型：通过光流和循环神经网络（RNN）来利用空间时间相关性，从而提供更多的信息来估计深度。2. 损失函数：使用生成对抗网络（GAN）来改善深度估计（并因此提高pose估计），如图1所示。这个额外的损失函数提高生成图像的真实性，并减少了图像的artefacts。

Depth Completion with Multiple Balanced Bases and Confidence for Dense Monocular SLAM

paper_url: http://arxiv.org/abs/2309.04145
repo_url: None
paper_authors: Weijian Xie, Guanyi Chu, Quanhao Qian, Yihao Yu, Hai Li, Danpeng Chen, Shangjin Zhai, Nan Wang, Hujun Bao, Guofeng Zhang
For: This paper proposes a novel method for dense SLAM based on monocular cameras, which can achieve online dense mapping on a mobile device.* Methods: The proposed method integrates a light-weight depth completion network (BBC-Net) into a sparse SLAM system using a multi-basis depth representation. The method predicts multiple balanced bases and a confidence map from a monocular image with sparse points, and the final depth is a linear combination of predicted depth bases optimized by tuning the corresponding weights.* Results: The proposed method achieves better performance in monocular dense mapping than state-of-the-art methods, and provides an online demo running on a mobile phone.

Abstract
Dense SLAM based on monocular cameras does indeed have immense application value in the field of AR/VR, especially when it is performed on a mobile device. In this paper, we propose a novel method that integrates a light-weight depth completion network into a sparse SLAM system using a multi-basis depth representation, so that dense mapping can be performed online even on a mobile phone. Specifically, we present a specifically optimized multi-basis depth completion network, called BBC-Net, tailored to the characteristics of traditional sparse SLAM systems. BBC-Net can predict multiple balanced bases and a confidence map from a monocular image with sparse points generated by off-the-shelf keypoint-based SLAM systems. The final depth is a linear combination of predicted depth bases that can be optimized by tuning the corresponding weights. To seamlessly incorporate the weights into traditional SLAM optimization and ensure efficiency and robustness, we design a set of depth weight factors, which makes our network a versatile plug-in module, facilitating easy integration into various existing sparse SLAM systems and significantly enhancing global depth consistency through bundle adjustment. To verify the portability of our method, we integrate BBC-Net into two representative SLAM systems. The experimental results on various datasets show that the proposed method achieves better performance in monocular dense mapping than the state-of-the-art methods. We provide an online demo running on a mobile phone, which verifies the efficiency and mapping quality of the proposed method in real-world scenarios.

摘要
“对于单目镜头的SLAM技术来说， dense SLAM 在 AR/VR 领域中有很大的应用价值，特别是在移动设备上进行。在这篇论文中，我们提出了一个新的方法，将轻量级的深度完成网络（BBC-Net）integrete到了简略SLAM 系统中，以在移动电话上进行线上 dense mapping。具体来说，我们提出了一个特别适合传统简略SLAM 系统的多基底深度完成网络，可以从单目镜头照片中预测多个均衡基底和一个信心地图。最终的深度是由多个预测的深度基底进行线性结合，可以通过调整对应的加权因子进行优化。为了让我们的网络适应各种现有的简略SLAM 系统，我们设计了一个深度加权因子集，这使得我们的网络成为了一个通用的插入模组，可以轻松地整合到各种现有的简略SLAM 系统中，并且可以提高全球深度一致性through bundle adjustment。为了证明我们的方法的可移植性，我们将 BBC-Net 整合到了两个代表性的 SLAM 系统中。实验结果显示，我们的方法在单目密集地图中表现比前景方法更好。我们提供了一个线上 demo，证明了我们的方法在实际情况下的效率和地图质量。”

On the Efficacy of Multi-scale Data Samplers for Vision Applications

paper_url: http://arxiv.org/abs/2309.04502
repo_url: None
paper_authors: Elvis Nunez, Thomas Merth, Anish Prabhu, Mehrdad Farajtabar, Mohammad Rastegari, Sachin Mehta, Maxwell Horton
For: 本研究探讨了多尺度解析训练的性质，以帮助提高视觉任务的性能。* Methods: 本研究使用了可变批处理多尺度数据采样器，该采样器在每个训练迭代中随机选择输入分辨率，并在批处理大小的同时进行调整。* Results: 研究发现，多尺度采样器 behave as 隐式数据正则化，可以加速训练速度，同时保持或提高模型的准确率，并且更好地适应数据分布和缩放变化。此外，研究还扩展了一个多尺度变换批处理器，通过逐渐增加分辨率来减少计算量，并在检测和实例分割任务中获得了37%的训练计算量减少和3-4%的mAP提高。

Abstract
Multi-scale resolution training has seen an increased adoption across multiple vision tasks, including classification and detection. Training with smaller resolutions enables faster training at the expense of a drop in accuracy. Conversely, training with larger resolutions has been shown to improve performance, but memory constraints often make this infeasible. In this paper, we empirically study the properties of multi-scale training procedures. We focus on variable batch size multi-scale data samplers that randomly sample an input resolution at each training iteration and dynamically adjust their batch size according to the resolution. Such samplers have been shown to improve model accuracy beyond standard training with a fixed batch size and resolution, though it is not clear why this is the case. We explore the properties of these data samplers by performing extensive experiments on ResNet-101 and validate our conclusions across multiple architectures, tasks, and datasets. We show that multi-scale samplers behave as implicit data regularizers and accelerate training speed. Compared to models trained with single-scale samplers, we show that models trained with multi-scale samplers retain or improve accuracy, while being better-calibrated and more robust to scaling and data distribution shifts. We additionally extend a multi-scale variable batch sampler with a simple curriculum that progressively grows resolutions throughout training, allowing for a compute reduction of more than 30%. We show that the benefits of multi-scale training extend to detection and instance segmentation tasks, where we observe a 37% reduction in training FLOPs along with a 3-4% mAP increase on MS-COCO using a Mask R-CNN model.

摘要

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

paper_url: http://arxiv.org/abs/2309.04109
repo_url: None
paper_authors: Changming Xiao, Qi Yang, Feng Zhou, Changshui Zhang
for: 这研究旨在利用文本扩散模型中的注意机制进行Semantic Grounding，不需要再训练也不需要执行时间优化。
methods: 提议使用文本扩散模型的denoising网络中的注意机制来实现Semantic Grounding。
results: 在 Pascal VOC 2012 和 Microsoft COCO 2014 上进行了weakly-supervised Semantic Segmentation的评估，并得到了较高的性能。此外，我们还发现了自定义生成方法中学习的文本嵌入的word-pixel相关性可以通过一些修改来掌握。

Abstract
Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.

摘要
Diffusion models have recently revolutionized the field of text-to-image generation. These models have a unique way of fusing text and image information, which allows them to generate highly text-related images. From another perspective, these generative models provide insights into the precise correlation between words and pixels. In this work, we propose a simple but effective method that utilizes the attention mechanism in the denoising network of text-to-image diffusion models to achieve semantic grounding of phrases without re-training or inference-time optimization. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation settings, and our method achieves superior performance compared to prior methods. Furthermore, we find that the acquired word-pixel correlation is generalizable for the learned text embedding of customized generation methods, which only require a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Our experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.Here is the word-for-word translation of the text into Simplified Chinese:Diffusion 模型最近在文本到图像生成领域引起了革命。这些模型具有独特的文本和图像信息 fusions 的方式，使得它们能够生成高度相关的文本图像。从另一个角度来看，这些生成模型表明了文本和像素之间的精确相关性。在这项工作中，我们提出了一种简单 yet effective 的方法，利用文本涂抹网络中的注意机制来实现文本短语的semantic grounding，不需要重新训练 nor inference-time optimization。我们在 Pascal VOC 2012 和 Microsoft COCO 2014 下进行了弱监督semantic segmentation 设置下的评估，并发现我们的方法在相比先前方法上表现出色。此外，我们发现了acquired 的word-pixel correlation 可以通过自定义生成方法中学习的文本嵌入来扩展。为了证明我们的发现，我们引入了一个新的实际任务“个性化引用图像分割”，并提供了一个新的数据集。我们在多种情况下进行了实验，并发现我们的方法在这个任务上比强基eline 表现出优异。总之，我们的工作揭示了 diffusion 模型中的丰富多模态知识可以用于分割。

Toward Sufficient Spatial-Frequency Interaction for Gradient-aware Underwater Image Enhancement

paper_url: http://arxiv.org/abs/2309.04089
repo_url: https://github.com/zhihefang/SFGNet
paper_authors: Chen Zhao, Weiling Cai, Chenyu Dong, Ziqi Zeng
for: 提高水下图像质量
methods: 基于空间频率相互作用和梯度地图的SFGNet框架
results: 实验结果表明，我们的方法可以成功提高水下图像质量，并与其他方法匹配或超越其视觉质量改进。

Abstract
Underwater images suffer from complex and diverse degradation, which inevitably affects the performance of underwater visual tasks. However, most existing learning-based Underwater image enhancement (UIE) methods mainly restore such degradations in the spatial domain, and rarely pay attention to the fourier frequency information. In this paper, we develop a novel UIE framework based on spatial-frequency interaction and gradient maps, namely SFGNet, which consists of two stages. Specifically, in the first stage, we propose a dense spatial-frequency fusion network (DSFFNet), mainly including our designed dense fourier fusion block and dense spatial fusion block, achieving sufficient spatial-frequency interaction by cross connections between these two blocks. In the second stage, we propose a gradient-aware corrector (GAC) to further enhance perceptual details and geometric structures of images by gradient map. Experimental results on two real-world underwater image datasets show that our approach can successfully enhance underwater images, and achieves competitive performance in visual quality improvement.

摘要
水下图像受到复杂和多样化的干扰，这会不可避免地影响水下视觉任务的性能。然而，大多数现有的学习基于水下图像改善（UIE）方法主要是在空间频谱领域进行修复，rarely 充分利用了干扰的频率信息。在这篇论文中，我们开发了一种新的UIE框架，即SFGNet，它包括两个阶段。具体来说，在第一阶段，我们提出了一个密集的空间频谱融合网络（DSFFNet），包括我们设计的密集傅立叶融合块和密集空间融合块，通过跨连接这两个块实现了足够的空间频谱交互。在第二阶段，我们提出了一个梯度感知corrector（GAC），用于进一步增强图像的感知细节和几何结构，通过梯度地图。实验结果表明，我们的方法可以成功地改善水下图像，并在视觉质量改进方面实现了竞争性能。

Towards Efficient SDRTV-to-HDRTV by Learning from Image Formation

paper_url: http://arxiv.org/abs/2309.04084
repo_url: https://github.com/xiaom233/hdrtvnet-plus
paper_authors: Xiangyu Chen, Zheyuan Li, Zhengwen Zhang, Jimmy S. Ren, Yihao Liu, Jingwen He, Yu Qiao, Jiantao Zhou, Chao Dong
for: 本研究目的是将SDRTV内容转换为HDRTV标准，以提高视觉效果。
methods: 本文提出了一种三步解决方案，包括自适应全色映射、本地增强和高点级别。全色映射阶段使用全图统计作为引导，进行图像适应色映射。本地增强网络用于增强本地细节。最后，我们将两个子网络组合成一个生成器，通过GAN共同训练来保证高点级别。
results: 我们的方法可以准确地将SDRTV内容转换为HDRTV标准，并且可以保持高品质和精细的视觉效果。我们的方法主要针对4K分辨率图像，是轻量级的和高效的。我们还构建了一个名为HDRTV1K的数据集，包含1235个和117个训练图像和测试图像，均为4K分辨率。此外，我们选择了五个度量来评估SDRTV-to-HDRTV算法的结果。最终结果表明我们的方法在量化和视觉上具有国际前沿水平。代码、模型和数据集可以在https://github.com/xiaom233/HDRTVNet-plus上获取。

Abstract
Modern displays are capable of rendering video content with high dynamic range (HDR) and wide color gamut (WCG). However, the majority of available resources are still in standard dynamic range (SDR). As a result, there is significant value in transforming existing SDR content into the HDRTV standard. In this paper, we define and analyze the SDRTV-to-HDRTV task by modeling the formation of SDRTV/HDRTV content. Our analysis and observations indicate that a naive end-to-end supervised training pipeline suffers from severe gamut transition errors. To address this issue, we propose a novel three-step solution pipeline called HDRTVNet++, which includes adaptive global color mapping, local enhancement, and highlight refinement. The adaptive global color mapping step uses global statistics as guidance to perform image-adaptive color mapping. A local enhancement network is then deployed to enhance local details. Finally, we combine the two sub-networks above as a generator and achieve highlight consistency through GAN-based joint training. Our method is primarily designed for ultra-high-definition TV content and is therefore effective and lightweight for processing 4K resolution images. We also construct a dataset using HDR videos in the HDR10 standard, named HDRTV1K that contains 1235 and 117 training images and 117 testing images, all in 4K resolution. Besides, we select five metrics to evaluate the results of SDRTV-to-HDRTV algorithms. Our final results demonstrate state-of-the-art performance both quantitatively and visually. The code, model and dataset are available at https://github.com/xiaom233/HDRTVNet-plus.

摘要
现代显示器可以渲染视频内容高动态范围（HDR）和宽色域范围（WCG）。然而，大多数可用资源仍然是标准动态范围（SDR）。因此，将现有的SDR内容转换到HDRTV标准具有重要价值。在这篇论文中，我们定义和分析将SDRTV转换为HDRTV的任务。我们的分析和观察表明，使用简单的端到端超vised训练管道会导致严重的色域过渡错误。为解决这个问题，我们提出了一个新的三步解决方案管道called HDRTVNet++,其包括 adaptive global color mapping、local enhancement和高点级别调整。adaptive global color mapping步骤使用图像全局统计作为指导进行图像适应色mapping。然后，我们部署了本地增强网络来增强本地细节。最后，我们将两个子网络组合成一个生成器，通过GAN相关训练实现高点级别调整。我们的方法主要针对4K分辨率图像，因此效果精准和轻量级。我们还构建了一个名为HDRTV1K的HDR视频集，包含1235个和117个训练图像和测试图像，全部是4K分辨率。此外，我们选择了五个度量来评估SDRTV-to-HDRTV算法的结果。最终结果表明我们的方法在量和视觉上具有国际前进的性能。代码、模型和数据集可以在https://github.com/xiaom233/HDRTVNet-plus上获取。

UER: A Heuristic Bias Addressing Approach for Online Continual Learning

paper_url: http://arxiv.org/abs/2309.04081
repo_url: https://github.com/FelixHuiweiLin/UER
paper_authors: Huiwei Lin, Shanshan Feng, Baoquan Zhang, Hongliang Qiao, Xutao Li, Yunming Ye
for: 这篇论文主要针对在线连续学习中的偏见问题，即在继续训练神经网络时，由于数据流动性的限制，导致神经网络偏爱当前数据中的类别，从而导致忘记前期数据的问题。
methods: 这篇论文提出了一种简单而高效的方法，即使 angle factor 和 norm factor 的偏见问题。通过分解 dot-product logits 为两个因素，发现偏见主要出现在 angle factor 上，可以用 cosine logits 来学习新知识。同时，通过使用 norm factor 来帮助保持历史知识。
results: 对于三个数据集，论文提出的 UER 方法可以在不同的情况下具有最高的性能，超过了多种现有方法的性能。

Abstract
Online continual learning aims to continuously train neural networks from a continuous data stream with a single pass-through data. As the most effective approach, the rehearsal-based methods replay part of previous data. Commonly used predictors in existing methods tend to generate biased dot-product logits that prefer to the classes of current data, which is known as a bias issue and a phenomenon of forgetting. Many approaches have been proposed to overcome the forgetting problem by correcting the bias; however, they still need to be improved in online fashion. In this paper, we try to address the bias issue by a more straightforward and more efficient method. By decomposing the dot-product logits into an angle factor and a norm factor, we empirically find that the bias problem mainly occurs in the angle factor, which can be used to learn novel knowledge as cosine logits. On the contrary, the norm factor abandoned by existing methods helps remember historical knowledge. Based on this observation, we intuitively propose to leverage the norm factor to balance the new and old knowledge for addressing the bias. To this end, we develop a heuristic approach called unbias experience replay (UER). UER learns current samples only by the angle factor and further replays previous samples by both the norm and angle factors. Extensive experiments on three datasets show that UER achieves superior performance over various state-of-the-art methods. The code is in https://github.com/FelixHuiweiLin/UER.

摘要
（简化中文）在线continuous学习目标是通过连续数据流进行单次 passes through 训练神经网络。现有最有效的方法是启用循环训练，但它们仍然需要进一步改进。在这篇论文中，我们尝试通过更直观和更有效的方法来解决偏见问题。我们通过分解dot product logits into angle factor和norm factor来发现，偏见问题主要出现在角度因子上，可以用作学习新知识的cosine logits。相反，待用于记忆知识的norm factor被现有方法抛弃。基于这一观察，我们提议使用norm factor来平衡新知识和历史知识，以解决偏见问题。为此，我们开发了一种启用经验回放（UER）的规则。UER只learning当前样本的角度因子，并在之前的样本中重新使用角度因子和norm因子来回放。我们在三个dataset上进行了广泛的实验，并证明UER可以超越多种现有方法的性能。代码可以在https://github.com/FelixHuiweiLin/UER中找到。

Enhancing Hierarchical Transformers for Whole Brain Segmentation with Intracranial Measurements Integration

paper_url: http://arxiv.org/abs/2309.04071
repo_url: https://github.com/masilab/unest
paper_authors: Xin Yu, Yucheng Tang, Qi Yang, Ho Hin Lee, Shunxing Bao, Yuankai Huo, Bennett A. Landman
for: 本研究旨在提高现有的全脑分割方法，以包含内侧量测量，并提供更全面的脑结构分析。
methods: 本研究使用改进的层次变换器UNesT进行全脑分割，并同时分割脑部133个区域和内侧量/后腔量。为了解决数据短缺问题，模型首先在8个不同站点的4859个T1-weighted（T1w）3D图像上进行预训练，然后在Open Access Series Imaging Studies（OASIS）上进行微调。
results: 我们使用Dice相似度（DSC）评估方法，并显示我们的模型能够准确地估计内侧量/后腔量，同时保持132个脑区的性能在相同水平。

Abstract
Whole brain segmentation with magnetic resonance imaging (MRI) enables the non-invasive measurement of brain regions, including total intracranial volume (TICV) and posterior fossa volume (PFV). Enhancing the existing whole brain segmentation methodology to incorporate intracranial measurements offers a heightened level of comprehensiveness in the analysis of brain structures. Despite its potential, the task of generalizing deep learning techniques for intracranial measurements faces data availability constraints due to limited manually annotated atlases encompassing whole brain and TICV/PFV labels. In this paper, we enhancing the hierarchical transformer UNesT for whole brain segmentation to achieve segmenting whole brain with 133 classes and TICV/PFV simultaneously. To address the problem of data scarcity, the model is first pretrained on 4859 T1-weighted (T1w) 3D volumes sourced from 8 different sites. These volumes are processed through a multi-atlas segmentation pipeline for label generation, while TICV/PFV labels are unavailable. Subsequently, the model is finetuned with 45 T1w 3D volumes from Open Access Series Imaging Studies (OASIS) where both 133 whole brain classes and TICV/PFV labels are available. We evaluate our method with Dice similarity coefficients(DSC). We show that our model is able to conduct precise TICV/PFV estimation while maintaining the 132 brain regions performance at a comparable level. Code and trained model are available at: https://github.com/MASILab/UNesT/wholebrainSeg.

摘要
整个脑部分 segmentation with magnetic resonance imaging (MRI) 可以不侵入性地测量脑部分，包括总脑部分体积 (TICV) 和后底槽体积 (PFV)。提高现有的整个脑部分分 segmentation 方法，以包括脑部分测量，可以提供更全面的脑结构分析。然而，将深度学习技术推广到脑部分测量 faced 数据可用性问题，因为有限的手动标注图集覆盖整个脑部分和 TICV/PFV 标签。在这篇论文中，我们改进了层次转换器 UNesT для整个脑部分分 segmentation，以达到同时 segmenting 整个脑部分和 TICV/PFV 的目的。为了解决数据缺乏问题，我们首先在 8 个不同的站点上获得了 4859 个 T1-weighted (T1w) 三维图像，并将其传递 через多个 Atlas 分割ipeline 生成标签。然后，我们在 Open Access Series Imaging Studies (OASIS) 上进行了 fine-tuning，使得模型可以同时测量整个脑部分和 TICV/PFV。我们使用 dice 相似度 coefficient (DSC) 进行评估。我们发现，我们的模型可以准确地估计 TICV/PFV，同时保持 132 个脑部分性能的水平。代码和已经训练的模型可以在 GitHub 上获取：。

INSURE: An Information Theory Inspired Disentanglement and Purification Model for Domain Generalization

paper_url: http://arxiv.org/abs/2309.04063
repo_url: None
paper_authors: Xi Yu, Huan-Hsin Tseng, Shinjae Yoo, Haibin Ling, Yuewei Lin
for: 本文旨在提出一种基于信息理论的分解和纯化模型（INSURE），以便在未见目标领域中学习泛化模型。
methods: 本文使用了一种信息理论启发的损失函数，以确保分解的特征包含足够的类标签信息和另一个分解的卫星特征包含足够的领域信息。此外，本文还使用了一种对照纯化损失函数，使卫星特征抛弃所有类相关信息，使得类相关特征包含足够和必要的类标签信息。而不是使用多个Encoder，本文使用了一个学习的二进制masque作为分解器，以便更加有效地进行分解。
results: 对四个广泛使用的预测数据集（PACS、OfficeHome、TerraIncognita和DomainNet）进行了广泛的实验，并证明了提出的INSURE方法可以超越当前的状态艺。此外，本文还证明了领域特定的类相关特征对预测数据集的泛化有益。

Abstract
Domain Generalization (DG) aims to learn a generalizable model on the unseen target domain by only training on the multiple observed source domains. Although a variety of DG methods have focused on extracting domain-invariant features, the domain-specific class-relevant features have attracted attention and been argued to benefit generalization to the unseen target domain. To take into account the class-relevant domain-specific information, in this paper we propose an Information theory iNspired diSentanglement and pURification modEl (INSURE) to explicitly disentangle the latent features to obtain sufficient and compact (necessary) class-relevant feature for generalization to the unseen domain. Specifically, we first propose an information theory inspired loss function to ensure the disentangled class-relevant features contain sufficient class label information and the other disentangled auxiliary feature has sufficient domain information. We further propose a paired purification loss function to let the auxiliary feature discard all the class-relevant information and thus the class-relevant feature will contain sufficient and compact (necessary) class-relevant information. Moreover, instead of using multiple encoders, we propose to use a learnable binary mask as our disentangler to make the disentanglement more efficient and make the disentangled features complementary to each other. We conduct extensive experiments on four widely used DG benchmark datasets including PACS, OfficeHome, TerraIncognita, and DomainNet. The proposed INSURE outperforms the state-of-art methods. We also empirically show that domain-specific class-relevant features are beneficial for domain generalization.

摘要
域内泛化（DG）目标是通过只在多个观察到的源领域进行训练来学习一个通用的模型，以便在未见目标领域进行泛化。虽然许多DG方法都专注于提取域无关特征，但是域相关的类特征受到了关注，并且被论证可以帮助泛化到未见目标领域。为了考虑域相关的类特征信息，在这篇论文中，我们提出了基于信息理论的INSURE模型，以Explicitly分离隐藏特征，以获得充足和 компакт（必要）的类特征，以便泛化到未见目标领域。具体来说，我们首先提出了基于信息理论的损失函数，以确保分离的类特征包含充足的类标签信息，而另一个分离的卫星特征具备充足的域信息。我们进一步提出了一个套用纯化损失函数，使卫星特征抛弃所有类相关信息，从而使类特征具备充足和 компакт（必要）的信息。此外，而不是使用多个encoder，我们提议使用学习的二进制面积作为我们的分离器，以使分离更高效，并使分离的特征相互补做。我们在四个广泛使用DG benchmark数据集（PACS、OfficeHome、TerraIncognita和DomainNet）进行了广泛的实验。提出的INSURE模型超过了当前的状态艺。我们还证明了域相关的类特征对泛化有利。