2023-10-20

cs.CV

cs.CV - 2023-10-20

A Dual-Stream Neural Network Explains the Functional Segregation of Dorsal and Ventral Visual Pathways in Human Brains

paper_url: http://arxiv.org/abs/2310.13849
repo_url: https://github.com/minkyu-choi04/dualstreambrains
paper_authors: Minkyu Choi, Kuan Han, Xiaokai Wang, Yizhen Zhang, Zhongming Liu
for: 模仿人类视系统中的两条平行路径，该模型用于空间处理和物体识别。
methods: 使用两个分支的卷积神经网络（CNN）模型，分别模仿人脑中的脊梁和轴索路径。
results: 与人脑处理同一影片时，模型的两个分支具有不同的学习目标和表征，主要受到视觉注意力和物体识别的各自目标的影响，而不是特定的抑制或选择性。

Abstract
The human visual system uses two parallel pathways for spatial processing and object recognition. In contrast, computer vision systems tend to use a single feedforward pathway, rendering them less robust, adaptive, or efficient than human vision. To bridge this gap, we developed a dual-stream vision model inspired by the human eyes and brain. At the input level, the model samples two complementary visual patterns to mimic how the human eyes use magnocellular and parvocellular retinal ganglion cells to separate retinal inputs to the brain. At the backend, the model processes the separate input patterns through two branches of convolutional neural networks (CNN) to mimic how the human brain uses the dorsal and ventral cortical pathways for parallel visual processing. The first branch (WhereCNN) samples a global view to learn spatial attention and control eye movements. The second branch (WhatCNN) samples a local view to represent the object around the fixation. Over time, the two branches interact recurrently to build a scene representation from moving fixations. We compared this model with the human brains processing the same movie and evaluated their functional alignment by linear transformation. The WhereCNN and WhatCNN branches were found to differentially match the dorsal and ventral pathways of the visual cortex, respectively, primarily due to their different learning objectives. These model-based results lead us to speculate that the distinct responses and representations of the ventral and dorsal streams are more influenced by their distinct goals in visual attention and object recognition than by their specific bias or selectivity in retinal inputs. This dual-stream model takes a further step in brain-inspired computer vision, enabling parallel neural networks to actively explore and understand the visual surroundings.

摘要
人类视系统使用两条平行的路径进行空间处理和物体识别。相比之下，计算机视系统通常使用单个径向的Feedforward路径，使其比人视不太强大、适应性或效率。为了bridging这个差距，我们开发了基于人眼和大脑的双流视模型。在输入级别，模型从两个不同的视觉模式中采样两个协同的视觉特征，以模仿人眼使用大脑皮层ganglion cells将视觉输入分配给大脑。在后端，模型通过两个分支的卷积神经网络（CNN）处理分配的输入，以模仿人大脑使用脑干和脑膜两个路径进行平行的视觉处理。首支分支（WhereCNN）在全视图中学习空间注意力和控制眼动。第二支分支（WhatCNN）在fixation周围的视觉中学习物体特征。随着时间的推移，两支分支之间进行交互性的回归，从移动的眼动中构建场景表示。我们与人脑处理相同电影的结果进行对比，并评估其功能对齐性。WhereCNN和WhatCNN分支在功能上与人脑的视觉干道和脑膜干道相对应，尤其是由于它们的不同学习目标。这些模型基于的结果引导我们推断，人脑的视觉干道和脑膜干道的不同响应和表示是更多地受到它们的特定目标和需求而决定的，而不是它们特定的偏好或选择性在视觉输入中。这种双流模型在人脑适应计算机视觉方面又一步进展，使得平行神经网络能够活动地探索和理解视觉环境。

Normalizing flow-based deep variational Bayesian network for seismic multi-hazards and impacts estimation from InSAR imagery

paper_url: http://arxiv.org/abs/2310.13805
repo_url: None
paper_authors: Xuechun Li, Paula M. Burgi, Wei Ma, Hae Young Noh, David J. Wald, Susu Xu
for: 本研究旨在提供高精度的onsite灾害估算，以便快速和有效地进行post-灾应对。
methods: 本研究使用抗干扰synthetic aperture radar（InSAR）数据，并提出了一种新的Stochastic variational inference with normalizing flows方法，可以同时估算多种不见的灾害和影响。
results: 研究表明，该方法可以减少由干扰和复杂的信号引起的估算误差，并提供高精度的onsite灾害估算结果。

Abstract
Onsite disasters like earthquakes can trigger cascading hazards and impacts, such as landslides and infrastructure damage, leading to catastrophic losses; thus, rapid and accurate estimates are crucial for timely and effective post-disaster responses. Interferometric Synthetic aperture radar (InSAR) data is important in providing high-resolution onsite information for rapid hazard estimation. Most recent methods using InSAR imagery signals predict a single type of hazard and thus often suffer low accuracy due to noisy and complex signals induced by co-located hazards, impacts, and irrelevant environmental changes (e.g., vegetation changes, human activities). We introduce a novel stochastic variational inference with normalizing flows derived to jointly approximate posteriors of multiple unobserved hazards and impacts from noisy InSAR imagery.

摘要

Data-Free Knowledge Distillation Using Adversarially Perturbed OpenGL Shader Images

paper_url: http://arxiv.org/abs/2310.13782
repo_url: None
paper_authors: Logan Frank, Jim Davis
for: This paper focuses on the problem of knowledge distillation (KD) in the absence of the original training data, which is known as “data-free” KD.methods: The proposed approach uses unnatural OpenGL images and large amounts of data augmentation, along with adversarial attacks, to train a student network.results: The proposed method achieves state-of-the-art results for a variety of datasets/networks and is more stable than existing generator-based data-free KD methods.Here’s the text in Simplified Chinese:for: 本文关注无原始训练数据的知识储备（KD）问题，即“数据空”KD。methods: 提议的方法使用不同的OpenGL图像和大量数据增强，以及抗击攻击，来训练学生网络。results: 提议的方法在多个数据集/网络上达到了状态的最佳结果，并且比现有的生成器基于的数据空KD方法更稳定。

Abstract
Knowledge distillation (KD) has been a popular and effective method for model compression. One important assumption of KD is that the original training dataset is always available. However, this is not always the case due to privacy concerns and more. In recent years, "data-free" KD has emerged as a growing research topic which focuses on the scenario of performing KD when no data is provided. Many methods rely on a generator network to synthesize examples for distillation (which can be difficult to train) and can frequently produce images that are visually similar to the original dataset, which raises questions surrounding whether privacy is completely preserved. In this work, we propose a new approach to data-free KD that utilizes unnatural OpenGL images, combined with large amounts of data augmentation and adversarial attacks, to train a student network. We demonstrate that our approach achieves state-of-the-art results for a variety of datasets/networks and is more stable than existing generator-based data-free KD methods. Source code will be available in the future.

摘要
知识塑化（KD）是一种受欢迎且有效的模型压缩方法。然而，KD中一个重要假设是原始训练集总是可用的。然而，这并不总是情况，特别是由于隐私问题和其他因素。在过去几年，无数据KD（data-free KD）作为一个快速发展的研究领域而出现。许多方法利用生成器网络生成例子进行塑化（可能困难于训练），并且可能会生成与原始数据集相似的图像，这引发了隐私是否完全保持的问题。在这项工作中，我们提出了一种新的无数据KD方法，利用不自然的OpenGL图像，结合大量数据增强和对抗攻击，训练学生网络。我们示出了我们的方法可以在多个数据集和网络上实现状态之最的结果，并且更稳定于现有的生成器基于的无数据KD方法。未来我们计划将源代码公开。

TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models

paper_url: http://arxiv.org/abs/2310.13772
repo_url: None
paper_authors: Tianshi Cao, Karsten Kreis, Sanja Fidler, Nicholas Sharp, Kangxue Yin
for: 这 paper 的目的是提出一种新的 texture synthesis 方法，用于给定的 3D geometry Synthesize textures.
methods: 这 paper 使用了 large-scale text-guided image diffusion models 来实现 texture synthesis. 它不同于 recent works 通过使用 2D text-to-image diffusion models 来缓慢和脆弱的优化 процесс来 distill 3D objects.
results: TexFusion 可以生成高质量、多样性和全局一致的 textures. 它可以efficiently generate diverse, high quality and globally coherent textures.

Abstract
We present TexFusion (Texture Diffusion), a new method to synthesize textures for given 3D geometries, using large-scale text-guided image diffusion models. In contrast to recent works that leverage 2D text-to-image diffusion models to distill 3D objects using a slow and fragile optimization process, TexFusion introduces a new 3D-consistent generation technique specifically designed for texture synthesis that employs regular diffusion model sampling on different 2D rendered views. Specifically, we leverage latent diffusion models, apply the diffusion model's denoiser on a set of 2D renders of the 3D object, and aggregate the different denoising predictions on a shared latent texture map. Final output RGB textures are produced by optimizing an intermediate neural color field on the decodings of 2D renders of the latent texture. We thoroughly validate TexFusion and show that we can efficiently generate diverse, high quality and globally coherent textures. We achieve state-of-the-art text-guided texture synthesis performance using only image diffusion models, while avoiding the pitfalls of previous distillation-based methods. The text-conditioning offers detailed control and we also do not rely on any ground truth 3D textures for training. This makes our method versatile and applicable to a broad range of geometry and texture types. We hope that TexFusion will advance AI-based texturing of 3D assets for applications in virtual reality, game design, simulation, and more.

摘要
我们介绍TexFusion（Texture Diffusion），一种新的方法用于给定3D геометрии的文本渲染。相比之前的工作，利用2D文本到图像扩散模型来降低3D对象的整合过程，TexFusion引入了一种专门为文本生成设计的3D一致的生成技术。具体来说，我们利用潜在扩散模型，对2D渲染视图中的3D对象应用扩散模型的降噪器，并将不同视图中的降噪器预测集成到共享潜在文本地图上。最终输出的RGB文本是通过优化2D渲染视图中的扩散预测来生成的。我们详细验证了TexFusion，并证明了我们可以高效地生成多样化、高质量和全局一致的文本。我们在只使用图像扩散模型的情况下，超越了先前的滤波器基本方法，并且不需要任何基于真实3D文本的训练。这使得我们的方法通用和可应用于各种几何和文本类型。我们希望TexFusion能够推动AI在虚拟现实、游戏设计、模拟和更多应用中的3D资产 текстури化。

PACE: Human and Camera Motion Estimation from in-the-wild Videos

paper_url: http://arxiv.org/abs/2310.13768
repo_url: None
paper_authors: Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J. Black, Otmar Hilliges, Jan Kautz, Umar Iqbal
for: 估计人体运动在全景视频中
methods: 提议一种结合人体动作和摄像机动作的全球估计方法，通过对人体动作和背景特征的结合来分离人体和摄像机动作。不同于现有方法使用SLAM作为初始化，我们提议在估计人体和摄像机动作时紧密地结合SLAM和人体动作约束。
results: 对比现有方法，我们的方法在人体和摄像机动作估计中具有显著的改进，并且提出了一种适合批处理的运动假设，使我们的方法更加高效。此外，我们还提出了一个适合评估摄像机动作的实验室数据集，并在实验中证明了我们的方法的优越性。

Abstract
We present a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the coupling of human and camera motions in the video. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use SLAM as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world RICH datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions.

摘要
我们提出了一种方法来估计人体动作在全景视频中。这是一个非常困难的任务，因为人体和摄像机的运动在视频中相互关联。为解决这个问题，我们提议一个共同优化框架，通过对人体动作和摄像机运动进行分离，使用前景人体动作假设和背景场景特征。不同于现有的方法使用SLAM作为初始化，我们提议在SLAM和人体动作假设之间进行紧密的集成，以便同时优化人体和摄像机的运动。这种设计 combinesthe strengths of SLAM and motion priors, leading to significant improvements in human and camera motion estimation. In addition, we introduce a motion prior suitable for batch optimization, making our approach much more efficient than existing methods. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experimental results on the synthetic and real-world RICH datasets show that our approach significantly outperforms prior art in estimating both human and camera motions.

U-BEV: Height-aware Bird’s-Eye-View Segmentation and Neural Map-based Relocalization

paper_url: http://arxiv.org/abs/2310.13766
repo_url: None
paper_authors: Andrea Boscolo Camiletto, Alfredo Bochicchio, Alexander Liniger, Dengxin Dai, Abel Gawel
for: 本研究旨在提高智能汽车的重新本地化精度，当GPS接收强度不足或感知器基本本地化失败时。
methods: 本文使用鸟瞰视图（BEV）分割技术来估算当地场景的准确出现，并可以帮助智能汽车重新本地化。然而，BEV方法的缺点是需要大量计算来利用地理约束。本文提出了U-BEV模型，它是基于U-Net架构的BEV分割模型，可以让BEV在多层高度上进行Scene理解。这种扩展可以提高U-BEV的性能，并且和其他计算相同的BEV方法相比，提高1.7到2.8 mIoU。
results: 本研究结果表明，将编码的神经网络BEV与可导分配模板匹配器结合使用，可以实现高精度的重新本地化。与类似计算复杂度的 transformer-based BEV方法相比，本方法提高了重新本地化精度。在nuScenes数据集上，本方法的召回精度高于26%。

Abstract
Efficient relocalization is essential for intelligent vehicles when GPS reception is insufficient or sensor-based localization fails. Recent advances in Bird's-Eye-View (BEV) segmentation allow for accurate estimation of local scene appearance and in turn, can benefit the relocalization of the vehicle. However, one downside of BEV methods is the heavy computation required to leverage the geometric constraints. This paper presents U-BEV, a U-Net inspired architecture that extends the current state-of-the-art by allowing the BEV to reason about the scene on multiple height layers before flattening the BEV features. We show that this extension boosts the performance of the U-BEV by up to 4.11 IoU. Additionally, we combine the encoded neural BEV with a differentiable template matcher to perform relocalization on neural SD-map data. The model is fully end-to-end trainable and outperforms transformer-based BEV methods of similar computational complexity by 1.7 to 2.8 mIoU and BEV-based relocalization by over 26% Recall Accuracy on the nuScenes dataset.

摘要
efficient 地方化是智能车辆当GPS接收不够或感知器基于的地方化失败时的关键。最近的鸟瞰视图（BEV）分割技术允许精确地估计当地场景的外观，从而为车辆的重新定位提供了利器。然而，BEV方法的一个缺点是需要大量的计算来利用地形约束。这篇论文提出了U-BEV架构，它是基于U-Net的建议，允许BEV理解场景在多个高层次上进行推理，然后将BEV特征级别。我们显示，这种扩展可以提高U-BEV的性能，最多提高4.11 IoU。此外，我们将编码的神经网络BEV与演算器模板匹配器结合，以实现基于神经网络的SD-map数据重新定位。该模型是完全端到端训练的，与类似计算复杂度的转换器基于BEV方法相比，提高了1.7到2.8 mIoU，并超过了26%的Recall精度在nuScenes数据集。

Evaluating sleep-stage classification: how age and early-late sleep affects classification performance

paper_url: http://arxiv.org/abs/2310.13754
repo_url: None
paper_authors: Eugenia Moris, Ignacio Larrabide
for: automatic sleep-stage classification method
methods: Wavelets for feature extraction, Random Forest for classification
results: the performance of the classifier is affected by the age of the subjects and the moment of sleep, with some sleep stages improving and others worsening.Here’s the full translation of the abstract in Simplified Chinese:
for: automatic sleep-stage classification method
methods: 使用浪波lets для特征提取，Random Forest进行分类
results: 论文发现，参与者年龄和睡眠时间点会影响自动分类器的性能，一些睡眠阶段的分类得到改善，而另一些则得到恶化。

Abstract
Sleep stage classification is a common method used by experts to monitor the quantity and quality of sleep in humans, but it is a time-consuming and labour-intensive task with high inter- and intra-observer variability. Using Wavelets for feature extraction and Random Forest for classification, an automatic sleep-stage classification method was sought and assessed. The age of the subjects, as well as the moment of sleep (early-night and late-night), were confronted to the performance of the classifier. From this study, we observed that these variables do affect the automatic model performance, improving the classification of some sleep stages and worsening others.

摘要
睡眠阶段分类是一种常用的专业人员用于评估人类睡眠量和质量的方法，但是它是一项时间consuming和劳动密集的任务，具有高度的内部和外部观察者变化。使用浪涌来提取特征和随机森林来分类，一种自动睡眠阶段分类方法被寻找并评估。研究发现，试验者的年龄以及睡眠的时间点（晚上早些和晚上较晚）对自动模型的性能产生影响，改善一些睡眠阶段的分类，而对其他阶段的分类则有所恶化。

Reference-based Restoration of Digitized Analog Videotapes

paper_url: http://arxiv.org/abs/2310.14926
repo_url: https://github.com/miccunifi/tape
paper_authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo
for: restore analog videotapes
methods: CLIP-based artifact detection, Swin-UNet network with Multi-Reference Spatial Feature Fusion (MRSFF) blocks
results: effective restoration of degraded analog videotapes, outperforms other state-of-the-art methodsHere’s the full text in Simplified Chinese:
for: restore analog videotapes
methods: CLIP-based artifact detection, Swin-UNet network with Multi-Reference Spatial Feature Fusion (MRSFF) blocks
results: 高效地 Restore 了受损的 analog videotapes, 比其他状态的方法更高效I hope that helps!

Abstract
Analog magnetic tapes have been the main video data storage device for several decades. Videos stored on analog videotapes exhibit unique degradation patterns caused by tape aging and reader device malfunctioning that are different from those observed in film and digital video restoration tasks. In this work, we present a reference-based approach for the resToration of digitized Analog videotaPEs (TAPE). We leverage CLIP for zero-shot artifact detection to identify the cleanest frames of each video through textual prompts describing different artifacts. Then, we select the clean frames most similar to the input ones and employ them as references. We design a transformer-based Swin-UNet network that exploits both neighboring and reference frames via our Multi-Reference Spatial Feature Fusion (MRSFF) blocks. MRSFF blocks rely on cross-attention and attention pooling to take advantage of the most useful parts of each reference frame. To address the absence of ground truth in real-world videos, we create a synthetic dataset of videos exhibiting artifacts that closely resemble those commonly found in analog videotapes. Both quantitative and qualitative experiments show the effectiveness of our approach compared to other state-of-the-art methods. The code, the model, and the synthetic dataset are publicly available at https://github.com/miccunifi/TAPE.

摘要
传统的材料磁带是数据存储设备的主要设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设备数据存储设�

Localizing and Editing Knowledge in Text-to-Image Generative Models

paper_url: http://arxiv.org/abs/2310.13730
repo_url: None
paper_authors: Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, Varun Manjunatha
for: 这篇论文的目的是研究文本到图像生成模型中的知识储存和传递问题。
methods: 这篇论文使用了 causal mediation analysis 方法来分析文本到图像模型中不同视觉特征知识的储存和传递。具体来说， authors 使用了 UNet 和文本编码器来跟踪不同视觉特征知识的传递，并发现不同视觉特征知识不是孤立在具体组件中，而是分布在文本到图像模型中的一系列组件中。
results: 这篇论文的结果表明，CLIP 文本编码器在实际的文本到图像模型中只有一个 causal state，并且这个 causal state 是文本中最后一个属性token 的第一个自注意层。这与其他语言模型的 causal state 不同，后者通常是中间的 MLP 层。基于这个发现， authors 提出了一种快速、无需数据的模型修改方法 Diff-QuickFix，可以快速编辑（简化）文本到图像模型中的概念。

Abstract
Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.

摘要
文本到图像扩散模型，如稳定扩散和图像，在最新的MS-COCO和其他生成benchmark中实现了无 precedent的图像真实度水平。给出caption，图像生成需要细化的知识，包括对象结构、风格、视角等多个属性。在这些模型中，这些信息存储在哪里？在我们的论文中，我们解答这个问题，了解扩散模型中对不同视觉属性的知识是如何分布的。我们采用了 causal mediation analysis，跟踪扩散模型中对不同视觉属性的知识是如何分布在（i）UNet和（ii）文本编码器中。尤其是，我们发现在大型文本到图像扩散模型中，不同属性的知识不是孤立分布在特定的组件中，而是分布在一系列组件中。这些组件frequently是不同的视觉属性。另外，我们发现在公共的文本到图像模型中，如稳定扩散，CLIP文本编码器只有一个 causal state，这是最后一个主题token的第一个自注意力层。这与其他语言模型的 causal state不同，通常是mid-MLP层。基于这一观察，我们引入了一种快速、无数据的模型修改方法Diff-QuickFix，可以快速编辑（抹除）文本到图像模型中的概念。Diff-QuickFix可以在下一秒内进行数据准确的更新，提供1000倍的速度增加和与现有精细调整方法相当的编辑性能。

Using Human-like Mechanism to Weaken Effect of Pre-training Weight Bias in Face-Recognition Convolutional Neural Network

paper_url: http://arxiv.org/abs/2310.13674
repo_url: None
paper_authors: Haojiang Ying, Yi-Fan Li, Yiyang Chen
for: 这个研究的目的是解释人工智能中的卷积神经网络（CNN）如何模仿人类的认知机制。
methods: 研究者使用了4种广泛使用的CNN模型（AlexNet、VGG11、VGG13和VGG16），通过转移学习进行情感值分类任务。与人类数据进行比较，研究发现这些CNN模型在一定程度上模仿人类的认知方式。基于神经科学和行为数据，研究者还对AlexNet进行了更新，使其更像人类的认知。
results: 研究发现，更新后的FE-AlexNet在情感值分类任务中表现更好，并且与人类认知更相似。这些结果还揭示了CNN模型的计算机制。此外，这项研究还提供了一种新的理解和改进CNN性能的方法，基于人类数据。

Abstract
Convolutional neural network (CNN), as an important model in artificial intelligence, has been widely used and studied in different disciplines. The computational mechanisms of CNNs are still not fully revealed due to the their complex nature. In this study, we focused on 4 extensively studied CNNs (AlexNet, VGG11, VGG13, and VGG16) which has been analyzed as human-like models by neuroscientists with ample evidence. We trained these CNNs to emotion valence classification task by transfer learning. Comparing their performance with human data, the data unveiled that these CNNs would partly perform as human does. We then update the object-based AlexNet using self-attention mechanism based on neuroscience and behavioral data. The updated FE-AlexNet outperformed all the other tested CNNs and closely resembles human perception. The results further unveil the computational mechanisms of these CNNs. Moreover, this study offers a new paradigm to better understand and improve CNN performance via human data.

摘要
卷积神经网络（CNN）是人工智能中一种重要的模型，在不同领域都得到了广泛的应用和研究。然而，它们的计算机制仍然没有完全揭示，因为它们的复杂性。在本研究中，我们选择了4种广泛采用和研究的 CNN（AlexNet、VGG11、VGG13和VGG16），由神经科学家作为人类模型进行分析，并有丰富的证据。我们使用传输学习训练这些 CNN 进行情感值分类任务。与人类数据进行比较，数据表明这些 CNN 在一定程度上会 acted like humans do。然后，我们将 AlexNet 更新为基于自注意力机制的 FE-AlexNet，并证明它在所有测试 CNN 中表现最佳，并且与人类感知很相似。这些结果进一步揭示了这些 CNN 的计算机制，并提供了一种新的方法来更好地理解和改进 CNN 性能。

ARNIQA: Learning Distortion Manifold for Image Quality Assessment

paper_url: http://arxiv.org/abs/2310.14918
repo_url: https://github.com/miccunifi/arniqa
paper_authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo
for: 本研究旨在开发一种无需参考图像的图像质量评估方法，以匹配人类受众的识别。
methods: 我们提出了一种自助学习方法，名为ARNIQA，它通过学习图像扭曲映射来获得图像质量表示。
results: 我们的方法在多个数据集上达到了当今最佳性能，并且在数据效率、通用性和稳定性等方面也表现出优异。代码和模型可以在 GitHub 上获取。

Abstract
No-Reference Image Quality Assessment (NR-IQA) aims to develop methods to measure image quality in alignment with human perception without the need for a high-quality reference image. In this work, we propose a self-supervised approach named ARNIQA (leArning distoRtion maNifold for Image Quality Assessment) for modeling the image distortion manifold to obtain quality representations in an intrinsic manner. First, we introduce an image degradation model that randomly composes ordered sequences of consecutively applied distortions. In this way, we can synthetically degrade images with a large variety of degradation patterns. Second, we propose to train our model by maximizing the similarity between the representations of patches of different images distorted equally, despite varying content. Therefore, images degraded in the same manner correspond to neighboring positions within the distortion manifold. Finally, we map the image representations to the quality scores with a simple linear regressor, thus without fine-tuning the encoder weights. The experiments show that our approach achieves state-of-the-art performance on several datasets. In addition, ARNIQA demonstrates improved data efficiency, generalization capabilities, and robustness compared to competing methods. The code and the model are publicly available at https://github.com/miccunifi/ARNIQA.

摘要
NR-IQA（无参考图像质量评估）的目标是开发一种基于人类感知的图像质量评估方法，不需要高质量的参考图像。在这种工作中，我们提出了一种自我超级vised方法 named ARNIQA（扩展的Random Image Distoration Manifold for Image Quality Assessment），用于模型图像扭曲 manifold 以获取内在的质量表示。首先，我们引入了一种图像抑制模型，可以随机排序 consecutively applied distortions 的序列。这样，我们可以Synthesize 图像的多种抑制模式。其次，我们提议在训练模型时，通过将不同图像的 patches 的表示相似化，来实现图像在同样的抑制模式下的匹配。因此，具有同样的抑制模式的图像将在扭曲 manifold 中的不同位置相对应。最后，我们将图像表示映射到质量分数的简单线性回归器，这样无需调整编码器的参数。实验结果显示，我们的方法可以在多个数据集上达到状态之最的性能。此外，ARNIQA 还表现出了更好的数据使用效率、通用能力和Robustness 等特点，相比其他方法。模型和代码可以在上获取。

Deep-Learning-based Change Detection with Spaceborne Hyperspectral PRISMA data

paper_url: http://arxiv.org/abs/2310.13627
repo_url: None
paper_authors: J. F. Amieva, A. Austoni, M. A. Brovelli, L. Ansalone, P. Naylor, F. Serva, B. Le Saux
for:This paper is written for environmental monitoring and disaster management, as well as other sectors where change detection (CD) is applied.methods:The paper uses both standard and deep-learning (DL) CD methods, as well as a pipeline starting from coregistration, followed by CD with a full-spectrum algorithm, and a DL network developed for optical data.results:The paper finds that changes in vegetation and built environments are well captured using the proposed methods, and that the spectral information is valuable for identifying subtle changes. However, the paper also notes that atmospheric effects and the lack of reliable ground truth present a major challenge to hyperspectral CD.

Abstract
Change detection (CD) methods have been applied to optical data for decades, while the use of hyperspectral data with a fine spectral resolution has been rarely explored. CD is applied in several sectors, such as environmental monitoring and disaster management. Thanks to the PRecursore IperSpettrale della Missione operativA (PRISMA), hyperspectral-from-space CD is now possible. In this work, we apply standard and deep-learning (DL) CD methods to different targets, from natural to urban areas. We propose a pipeline starting from coregistration, followed by CD with a full-spectrum algorithm and by a DL network developed for optical data. We find that changes in vegetation and built environments are well captured. The spectral information is valuable to identify subtle changes and the DL methods are less affected by noise compared to the statistical method, but atmospheric effects and the lack of reliable ground truth represent a major challenge to hyperspectral CD.

摘要
CD方法（变化检测方法）在光学数据中应用了数十年，而使用高分谱数据的使用则rarely explored。CD方法在各个领域，如环境监测和灾害管理中使用。due to the PRecursore IperSpettrale della Missione operativA (PRISMA), hyperspectral-from-space CD现在可能。在这种工作中，我们对不同的目标，从自然区域到城市区域进行了应用。我们提议一个管道，从协调开始，然后使用全谱式算法进行CD，并使用为光学数据开发的深度学习网络。我们发现， Vegetation和建筑环境中的变化都被良好地捕捉。spectral信息有价值，可以发现微妙的变化，而深度学习方法对噪声比统计方法更为敏感，但大气效应和可靠的地面真实数据的缺乏是干扰 hyperspectral CD的主要挑战。

Inter-Scale Dependency Modeling for Skin Lesion Segmentation with Transformer-based Networks

paper_url: http://arxiv.org/abs/2310.13727
repo_url: None
paper_authors: Sania Eskandari, Janet Lumpp
for: 帮助诊断皮肤癌变，自动分割皮肤病变部分
methods: 使用U-Net建模，并提出Inter-scale Context Fusion（ISCF）模块来弥补semantic gaps
results: 在皮肤病变分割 benchmark 上获得了有效的结果，支持应用性和效果

Abstract
Melanoma is a dangerous form of skin cancer caused by the abnormal growth of skin cells. Fully Convolutional Network (FCN) approaches, including the U-Net architecture, can automatically segment skin lesions to aid diagnosis. The symmetrical U-Net model has shown outstanding results, but its use of a convolutional operation limits its ability to capture long-range dependencies, which are essential for accurate medical image segmentation. In addition, the U-shaped structure suffers from the semantic gaps between the encoder and decoder. In this study, we developed and evaluated a U-shaped hierarchical Transformer-based structure for skin lesion segmentation while we proposed an Inter-scale Context Fusion (ISCF) to utilize the attention correlations in each stage of the encoder to adaptively combine the contexts coming from each stage to hinder the semantic gaps. The preliminary results of the skin lesion segmentation benchmark endorse the applicability and efficacy of the ISCF module.

摘要
melanoma 是一种危险的皮肤癌，由皮肤细胞异常生长所致。全量卷积网络（FCN）方法，包括 U-Net 架构，可以自动 segment 皮肤损伤以帮助诊断。 symmetrical U-Net 模型已经达到了卓越的效果，但它使用的 convolutional 操作限制了它的捕捉长距离依赖关系的能力，这些关系是医疗影像分割中非常重要的。此外， U 形结构受到 encoder 和 decoder 之间的 semantic gap 的困扰。在这项研究中，我们开发了一种 U 形层次 Transformer 结构，用于皮肤损伤 segmentation，并提出了一种 Inter-scale Context Fusion（ISCF）模块，用于在每个encoder 阶段中adaptively combine 来自每个阶段的上下文，以避免 semantic gap。preliminary 的皮肤损伤 segmentation benchmark 结果表征了 ISCF 模块的可行性和效果。

What you see is what you get: Experience ranking with deep neural dataset-to-dataset similarity for topological localisation

paper_url: http://arxiv.org/abs/2310.13622
repo_url: https://github.com/mttgdd/vdna-experience-selection
paper_authors: Matthew Gadd, Benjamin Ramtoula, Daniele De Martini, Paul Newman
for: 本研究旨在提高视觉导航中的地点 lokalisierung 精度和稳定性，通过回忆最相关的视觉记忆来减少地点 lokalisierung 的努力。
methods: 本研究使用了 Visual DNA，一种高度可扩展的工具来比较图像集。在本研究中，我们使用了图像序列（map和实时经验）的比较，以检测模式的变化，包括天气、照明和季节。
results: 我们发现，对于使用深度建筑来进行地点 lokalisierung，分布度量可以比较 neuron-wise 活动统计数据 между实时图像和多个以前记录的经验，并且可以考虑季节（冬夏）或时间点（天亮夜晚）的变化。我们的方法可以准确地评估实际地点 lokalisierung 性能，并且在北欧的 Nordland 跨季度数据集和 Oxford 大学公园的照明和温和季节变化数据集上进行验证。

Abstract
Recalling the most relevant visual memories for localisation or understanding a priori the likely outcome of localisation effort against a particular visual memory is useful for efficient and robust visual navigation. Solutions to this problem should be divorced from performance appraisal against ground truth - as this is not available at run-time - and should ideally be based on generalisable environmental observations. For this, we propose applying the recently developed Visual DNA as a highly scalable tool for comparing datasets of images - in this work, sequences of map and live experiences. In the case of localisation, important dataset differences impacting performance are modes of appearance change, including weather, lighting, and season. Specifically, for any deep architecture which is used for place recognition by matching feature volumes at a particular layer, we use distribution measures to compare neuron-wise activation statistics between live images and multiple previously recorded past experiences, with a potentially large seasonal (winter/summer) or time of day (day/night) shift. We find that differences in these statistics correlate to performance when localising using a past experience with the same appearance gap. We validate our approach over the Nordland cross-season dataset as well as data from Oxford's University Parks with lighting and mild seasonal change, showing excellent ability of our system to rank actual localisation performance across candidate experiences.

摘要
回忆最有关系的视觉记忆可以帮助提高视觉导航的效率和稳定性。解决这个问题应该与表现评估 Against ground truth 分离，因为在运行时不可用。我们提议使用最近发展的视觉 DNA 作为高可扩展的图像比较工具。在本工作中，我们使用 Distribution measures 来比较 neuron-wise 活动统计量 междуlive图像和多个前期记录的过去经验，包括季节（冬季/夏季）或时间点（日间/夜晚）的变化。我们发现这些统计量与表现强相关，可以用来评估不同经验的地方化性。我们验证了我们的方法在北欧的 Nordland 跨季度数据集以及牛津大学公园的光照和温和季节变化数据集上，并示出了我们系统可以高效地评估实际的地方化性。

FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer

paper_url: http://arxiv.org/abs/2310.13605
repo_url: None
paper_authors: Xinyu Zhang, Li Wang, Zhiqiang Jiang, Kun Dai, Tao Xie, Lei Yang, Wenhao Yu, Yang Shen, Jun Li
for: 本研究旨在提出一种基于Transformer的Feature Matching方法，以提高计算机视觉任务中的结构从运动和视觉定位精度。
methods: 本方法使用一种专门设计的Reconciliatory Transformer（RecFormer），包括全球观察注意层（GPAL）、评估重要性层（PWL）和本地观察Feed-forward网络（LPFFN），以适应不同的感知范围和重要性进行自适应调整。
results: 广泛的实验结果表明，FMRT在多个标准 bencmarks上达到了极高的性能水平，包括pose estimation、visual localization、homography estimation和图像匹配等。

Abstract
Local Feature Matching, an essential component of several computer vision tasks (e.g., structure from motion and visual localization), has been effectively settled by Transformer-based methods. However, these methods only integrate long-range context information among keypoints with a fixed receptive field, which constrains the network from reconciling the importance of features with different receptive fields to realize complete image perception, hence limiting the matching accuracy. In addition, these methods utilize a conventional handcrafted encoding approach to integrate the positional information of keypoints into the visual descriptors, which limits the capability of the network to extract reliable positional encoding message. In this study, we propose Feature Matching with Reconciliatory Transformer (FMRT), a novel Transformer-based detector-free method that reconciles different features with multiple receptive fields adaptively and utilizes parallel networks to realize reliable positional encoding. Specifically, FMRT proposes a dedicated Reconciliatory Transformer (RecFormer) that consists of a Global Perception Attention Layer (GPAL) to extract visual descriptors with different receptive fields and integrate global context information under various scales, Perception Weight Layer (PWL) to measure the importance of various receptive fields adaptively, and Local Perception Feed-forward Network (LPFFN) to extract deep aggregated multi-scale local feature representation. Extensive experiments demonstrate that FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.

摘要
本文提出了一种新的Transformer基本方法，即Feature Matching with Reconciliatory Transformer（FMRT），用于解决多个计算机视觉任务（如结构从运动和视觉位置）中的本地特征匹配问题。这些方法仅将键点之间的长距离上下文信息集成到Transformer网络中，导致网络无法考虑不同的感知场景中的特征之间的重要性，从而限制匹配精度。此外，这些方法使用了传统的手动编码方法来整合键点的位置信息到视觉描述符中，这限制了网络的能力以可靠的编码位置信息。在本研究中，我们提出了一种专门的Reconciliatory Transformer（RecFormer），包括全球感知注意层（GPAL）、重要性测量层（PWL）和本地感知径向网络（LPFFN）。GPAL用于捕捉不同感知场景中的视觉描述符，并将其集成到不同缩放尺度下的全球上下文信息中；PWL用于测量不同感知场景中的重要性，并将其适应性地调整；LPFFN用于提取多尺度本地特征表示。广泛的实验表明，FMRT在多个测试准则上表现出色，包括pose estimation、视觉位置、投影估计和图像匹配等。

Longer-range Contextualized Masked Autoencoder

paper_url: http://arxiv.org/abs/2310.13593
repo_url: None
paper_authors: Taekyung Kim, Sanghyuk Chun, Byeongho Heo, Dongyoon Han
for: 提高自助学习模型的长范围关注和多个范围关注的能力，以提高图像识别 tasks 的性能。
methods: 提出了一种名为 Longer-range Contextualized Masked Autoencoder (LC-MAE) 的自助学习框架，通过全像 pixels 和多视图 pixels 的组合来提高模型的长范围关注和多个范围关注能力，同时减少输入的空间重复性。
results: 通过LC-MAE框架，实现了 ImageNet-1K 上的 Top-1 准确率提高至 84.2%，与基eline ViT-B 模型相比增加了 0.6%p，并在 semantic segmentation 和 fine-grained visual classification 任务上显示出了显著的性能提升。

Abstract
Masked image modeling (MIM) has emerged as a promising self-supervised learning (SSL) strategy. The MIM pre-training facilitates learning powerful representations using an encoder-decoder framework by randomly masking some input pixels and reconstructing the masked pixels from the remaining ones. However, as the encoder is trained with partial pixels, the MIM pre-training can suffer from a low capability of understanding long-range dependency. This limitation may hinder its capability to fully understand multiple-range dependencies, resulting in narrow highlighted regions in the attention map that may incur accuracy drops. To mitigate the limitation, We propose a self-supervised learning framework, named Longer-range Contextualized Masked Autoencoder (LC-MAE). LC-MAE effectively leverages a global context understanding of visual representations while simultaneously reducing the spatial redundancy of input at the same time. Our method steers the encoder to learn from entire pixels in multiple views while also learning local representation from sparse pixels. As a result, LC-MAE learns more discriminative representations, leading to a performance improvement of achieving 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain. We attribute the success to the enhanced pre-training method, as evidenced by the singular value spectrum and attention analyses. Finally, LC-MAE achieves significant performance gains at the downstream semantic segmentation and fine-grained visual classification tasks; and on diverse robust evaluation metrics. Our code will be publicly available.

摘要
自适应学习（SSL）策略中的面具模型（MIM）已经出现为一种有前途的策略。MIM预训练使用Encoder-Decoder框架，随机遮盖输入像素，并从剩下的像素中重建遮盖的像素。然而，由于Encoder在部分像素上训练，MIM预训练可能会受到长距离依赖的限制，从而导致缺乏多距离依赖的全面理解，可能导致窄的强调区域在注意力图中，从而导致准确性下降。为解决这些限制，我们提出了一种自适应学习框架，名为长距离contextualized Masked Autoencoder（LC-MAE）。LC-MAE有效地利用了视觉表示的全局上下文理解，同时减少输入的空间重复。我们的方法使得Encoder从多个视图中学习整个像素，同时从罕见像素中学习本地表示。因此，LC-MAE学习出了更加细致的表示，导致它在ImageNet-1K上 achieved 84.2% top-1准确率，升准确率0.6%。我们认为这些成功是由于我们的预训练方法的改进，如谱值 спектrum和注意力分析所证明。最后，LC-MAE在下游semantic segmentation和细化视觉分类任务中具有显著性能提升，并在多个 Robust 评价指标上表现出色。我们的代码将公开。

POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization

paper_url: http://arxiv.org/abs/2310.13585
repo_url: None
paper_authors: Elahe Vahdani, Yingli Tian
for: 本 paper targets the challenge of point-supervised temporal action detection, where only a single frame is annotated for each action instance in the training set.
methods: 本 paper 提出了一种 Pseudo-label Oriented Transformer (POTLoc)，使用只有点级标注进行weakly-supervised Action Localization。POTLoc 通过自我训练策略来识别和跟踪连续的动作结构。
results: POTLoc 在 THUMOS’14 和 ActivityNet-v1.2 数据集上表现出色，与状态之前的点supervised方法进行比较，显示了5%的平均精度提升。

Abstract
This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets, showing a significant improvement of 5% average mAP on the former.

摘要
This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization, which utilizes only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model generates action proposals solely with point-level supervision, which undergo refinement and regression to enhance the precision of the estimated action boundaries. These estimated boundaries serve as supplementary supervisory signals, known as pseudo-labels, to guide the transformer for enhanced learning of action dynamics.The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels provide information about the coarse locations and boundaries of actions, assisting the transformer in learning action dynamics. POTLoc outperforms state-of-the-art point-supervised methods on the THUMOS'14 and ActivityNet-v1.2 datasets, showing a significant improvement of 5% average mAP on the former.

Progressive Dual Priori Network for Generalized Breast Tumor Segmentation

paper_url: http://arxiv.org/abs/2310.13574
repo_url: None
paper_authors: Li Wang, Lihui Wang, Zixiang Kuai, Lei Tang, Yingfeng Ou, Chen Ye, Yuemin Zhu
for: 提高乳腺癌分 segmentation模型的通用能力和小型腺癌分割性能
methods: 提出进步 dual priori network (PDPNet)，使用均衡精度和稳定性两种约束来进行乳腺癌分割
results: 比较多种国家数据集，PDPNet的 DSC、SEN、KAPPA 和 HD95 值分别提高 3.63%、8.19%、5.52% 和 3.66%，并通过减少正常组织的影响来提高模型的通用能力。

Abstract
To promote the generalization ability of breast tumor segmentation models, as well as to improve the segmentation performance for breast tumors with smaller size, low-contrast amd irregular shape, we propose a progressive dual priori network (PDPNet) to segment breast tumors from dynamic enhanced magnetic resonance images (DCE-MRI) acquired at different sites. The PDPNet first cropped tumor regions with a coarse-segmentation based localization module, then the breast tumor mask was progressively refined by using the weak semantic priori and cross-scale correlation prior knowledge. To validate the effectiveness of PDPNet, we compared it with several state-of-the-art methods on multi-center datasets. The results showed that, comparing against the suboptimal method, the DSC, SEN, KAPPA and HD95 of PDPNet were improved 3.63\%, 8.19\%, 5.52\%, and 3.66\% respectively. In addition, through ablations, we demonstrated that the proposed localization module can decrease the influence of normal tissues and therefore improve the generalization ability of the model. The weak semantic priors allow focusing on tumor regions to avoid missing small tumors and low-contrast tumors. The cross-scale correlation priors are beneficial for promoting the shape-aware ability for irregual tumors. Thus integrating them in a unified framework improved the multi-center breast tumor segmentation performance.

摘要
<>为提高乳腺癌分割模型的通用能力以及改善小型、低对比度和不规则形状的乳腺癌分割性能，我们提出了进步式双级先知网络（PDPNet）。PDPNet首先使用粗略分割基于本地化模块将肿瘤区域分割出来，然后通过弱语义先知和跨尺度相关先知知识进行进一步细化。为验证PDPNet的效果，我们与多个中心数据集进行比较。结果显示，相比于不优化方法，PDPNet的DSC、SEN、KAPPA和HD95分别提高了3.63%、8.19%、5.52%和3.66%。此外，通过剥离，我们证明了提案的本地化模块可以减少正常组织的影响，因此提高模型的通用能力。弱语义先知使得模型更注重肿瘤区域，以避免漏掉小肿瘤和低对比度肿瘤。跨尺度相关先知使得模型能够更好地捕捉不规则形状的肿瘤。因此，将它们集成到一个统一框架中，提高了多中心乳腺癌分割性能。

A Simple Baseline for Knowledge-Based Visual Question Answering

paper_url: http://arxiv.org/abs/2310.13570
repo_url: None
paper_authors: Alexandros Xenos, Themos Stafylakis, Ioannis Patras, Georgios Tzimiropoulos
for: 本研究的目的是解决知识基础视觉问答问题（KB-VQA），以提高问答效果。
methods: 本研究提出了一种简单、可重现的管道，基于快速内容学习，使用问题描述文本作为上下文信息，通过提问LLaMA（1和2）来解决问题。
results: 相比之前的方法，本研究的方法不需要训练，无需访问外部数据库或API，却可以达到当前最佳性能水平在OK-VQA和A-OK-VQA数据集上。

Abstract
This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA

摘要
这篇论文关注知识基于视觉问答（KB-VQA）问题。现有研究强调了结合外部数据库和深度学习模型（LLMs）的知识来解决需要外部知识的问题。这些方法的共同局限性是它们通常包含复杂的管道和大量依赖于GPT-3 API访问。我们的主要贡献在这篇论文中是提出了一个非常简单和可重现的管道，即通过快速在 LLMA （1和2）中进行准确的培训，使用问题描述作为 Contextual information。与之前的方法不同，我们的方法不需要训练和访问外部数据库或 API，却可以达到 OK-VQA 和 A-OK-VQA 数据集上的状态当前精度。最后，我们进行了多个ablation study来理解我们的方法的重要方面。我们的代码公开可用于 GitHub 上的 https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA。

ROSS: Radar Off-road Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.13551
repo_url: None
paper_authors: Peng Jiang, Srikanth Saripalli
for: 本研究旨在探讨RADAR数据中的Semantic segmentation问题，以提高自动导航在非道路环境中的能效性。
methods: 我们提出了一种新的管道，利用LIDAR数据和现有的 annotated off-road LIDAR数据来生成RADAR标签，其中RADAR数据被 Represented as images。
results: 我们验证了这种实用的方法，并发现它在实际数据中表现出色，这demonstrates the potential of RADAR technology for navigation applications in off-road environments。Here’s the translation in English for reference:
for: This study aims to tackle the inherent complexities of semantic segmentation in RADAR data for off-road scenarios, with the goal of enhancing the efficiency of autonomous navigation in such environments.
methods: We propose a novel pipeline that utilizes LIDAR data and an existing annotated off-road LIDAR dataset to generate RADAR labels, where the RADAR data are represented as images.
results: We validate the effectiveness of our practical approach using real-world datasets, and find that it performs excellently, demonstrating the potential of RADAR technology for navigation applications in off-road environments.

Abstract
As the demand for autonomous navigation in off-road environments increases, the need for effective solutions to understand these surroundings becomes essential. In this study, we confront the inherent complexities of semantic segmentation in RADAR data for off-road scenarios. We present a novel pipeline that utilizes LIDAR data and an existing annotated off-road LIDAR dataset for generating RADAR labels, in which the RADAR data are represented as images. Validated with real-world datasets, our pragmatic approach underscores the potential of RADAR technology for navigation applications in off-road environments.

摘要
As the demand for autonomous navigation in off-road environments increases, the need for effective solutions to understand these surroundings becomes essential. In this study, we confront the inherent complexities of semantic segmentation in RADAR data for off-road scenarios. We present a novel pipeline that utilizes LIDAR data and an existing annotated off-road LIDAR dataset for generating RADAR labels, in which the RADAR data are represented as images. Validated with real-world datasets, our pragmatic approach underscores the potential of RADAR technology for navigation applications in off-road environments.Here's the translation in Simplified Chinese characters:随着自动驾驶在非路面环境中的需求增加，理解这些环境的能力变得非常重要。在这个研究中，我们面临了RADAR数据中的自然复杂性，并提出了一种新的管道，利用LIIDAR数据和现有的 annotated off-road LIIDAR 数据集来生成 RADAR 标签。我们使用现实世界数据集进行验证，并证明了 RADAR 技术在非路面环境中的导航应用程序潜力。

A review of individual tree crown detection and delineation from optical remote sensing images

paper_url: http://arxiv.org/abs/2310.13481
repo_url: None
paper_authors: Juepeng Zheng, Shuai Yuan, Weijia Li, Haohuan Fu, Le Yu
for: This paper provides a comprehensive review of Individual Tree Crown Detection (ITCD) methods for detecting and delineating individual tree crowns in optical remote sensing images.methods: The review covers a wide range of ITCD methods, including traditional image processing methods, traditional machine learning methods, and deep learning-based methods.results: The review discusses the strengths and limitations of each method and provides a clear knowledge map of existing ITCD efforts. It also proposes some ITCD-related applications and potential hot topics in future ITCD research.

Abstract
Powered by the advances of optical remote sensing sensors, the production of very high spatial resolution multispectral images provides great potential for achieving cost-efficient and high-accuracy forest inventory and analysis in an automated way. Lots of studies that aim at providing an inventory to the level of each individual tree have generated a variety of methods for Individual Tree Crown Detection and Delineation (ITCD). This review covers ITCD methods for detecting and delineating individual tree crowns, and systematically reviews the past and present of ITCD-related researches applied to the optical remote sensing images. With the goal to provide a clear knowledge map of existing ITCD efforts, we conduct a comprehensive review of recent ITCD papers to build a meta-data analysis, including the algorithm, the study site, the tree species, the sensor type, the evaluation method, etc. We categorize the reviewed methods into three classes: (1) traditional image processing methods (such as local maximum filtering, image segmentation, etc.); (2) traditional machine learning methods (such as random forest, decision tree, etc.); and (3) deep learning based methods. With the deep learning-oriented approaches contributing a majority of the papers, we further discuss the deep learning-based methods as semantic segmentation and object detection methods. In addition, we discuss four ITCD-related issues to further comprehend the ITCD domain using optical remote sensing data, such as comparisons between multi-sensor based data and optical data in ITCD domain, comparisons among different algorithms and different ITCD tasks, etc. Finally, this review proposes some ITCD-related applications and a few exciting prospects and potential hot topics in future ITCD research.

摘要
Powered by the advances of optical remote sensing sensors, the production of very high spatial resolution multispectral images provides great potential for achieving cost-efficient and high-accuracy forest inventory and analysis in an automated way. Many studies that aim at providing an inventory to the level of each individual tree have generated a variety of methods for Individual Tree Crown Detection and Delineation (ITCD). This review covers ITCD methods for detecting and delineating individual tree crowns, and systematically reviews the past and present of ITCD-related researches applied to the optical remote sensing images. With the goal to provide a clear knowledge map of existing ITCD efforts, we conduct a comprehensive review of recent ITCD papers to build a meta-data analysis, including the algorithm, the study site, the tree species, the sensor type, the evaluation method, etc. We categorize the reviewed methods into three classes: (1) traditional image processing methods (such as local maximum filtering, image segmentation, etc.); (2) traditional machine learning methods (such as random forest, decision tree, etc.); and (3) deep learning based methods. With the deep learning-oriented approaches contributing a majority of the papers, we further discuss the deep learning-based methods as semantic segmentation and object detection methods. In addition, we discuss four ITCD-related issues to further comprehend the ITCD domain using optical remote sensing data, such as comparisons between multi-sensor based data and optical data in ITCD domain, comparisons among different algorithms and different ITCD tasks, etc. Finally, this review proposes some ITCD-related applications and a few exciting prospects and potential hot topics in future ITCD research.

Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation

paper_url: http://arxiv.org/abs/2310.13479
repo_url: https://github.com/fgirbal/segment-select-correct
paper_authors: Francisco Eiras, Kemal Oksuz, Adel Bibi, Philip H. S. Torr, Puneet K. Dokania
for: 提高弱类学习图像分割（Referring Image Segmentation，RIS）的性能，使其与全监督学习方法匹配。
methods: 提出一种新的弱类学习框架，包括三个步骤：获取引用 instrucion中对象的实例Mask（segment），使用零损学习选择可能正确的Mask（select），并使用模型修复零损选择的错误（correct）。
results: 在实验中，只使用首两个步骤（零损 segment和 select）比其他零损基elines提高了19%，而我们的全方法可以更高地超越这个强baseline，将弱类学习RIS的性能提高至全监督方法水平，在某些情况下将间隔降低至14%。

Abstract
Referring Image Segmentation (RIS) - the problem of identifying objects in images through natural language sentences - is a challenging task currently mostly solved through supervised learning. However, while collecting referred annotation masks is a time-consuming process, the few existing weakly-supervised and zero-shot approaches fall significantly short in performance compared to fully-supervised learning ones. To bridge the performance gap without mask annotations, we propose a novel weakly-supervised framework that tackles RIS by decomposing it into three steps: obtaining instance masks for the object mentioned in the referencing instruction (segment), using zero-shot learning to select a potentially correct mask for the given instruction (select), and bootstrapping a model which allows for fixing the mistakes of zero-shot selection (correct). In our experiments, using only the first two steps (zero-shot segment and select) outperforms other zero-shot baselines by as much as 19%, while our full method improves upon this much stronger baseline and sets the new state-of-the-art for weakly-supervised RIS, reducing the gap between the weakly-supervised and fully-supervised methods in some cases from around 33% to as little as 14%. Code is available at https://github.com/fgirbal/segment-select-correct.

摘要
稍微指导学习（RIS）——在自然语言句子中识别图像中的对象——是一项具有挑战性的任务，目前主要通过指导学习来解决。然而，收集引用涂抹的过程是时间consuming的，而exist的弱指导学习和零shot方法在性能上与完全指导学习方法相比显得较差。为了bridging性能差距而不需要涂抹，我们提出了一种新的弱指导学习框架，该框架通过分解RIS任务为三步来解决：取得提及的对象图像的实例涂抹（segment），使用零shot学习选择可能正确的涂抹（select），并使用一种可以修复零shot选择的错误的模型（correct）。在我们的实验中，只使用第一两步（零shot segment和select）可以与其他零shot基elines相比提高到19%，而我们的全方法可以超越这个强baseline，并将弱指导学习RIS的状态rola-art降低到14%。代码可以在https://github.com/fgirbal/segment-select-correct中找到。

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

paper_url: http://arxiv.org/abs/2310.13473
repo_url: https://github.com/coderj-one/giraffe-bench
paper_authors: Mingwei Zhu, Leigang Sha, Yu Shu, Kangjia Zhao, Tiancheng Zhao, Jianwei Yin
for: 本文旨在评估大型语言模型（MLLMs）在预测预测任务中的能力，以探索其预测理解能力的可能性。
methods: 本文提出了一个新的benchmark，用于评估MLLMs在多种场景中的预测能力。该benchmark包括三个重要领域：抽象模式逻辑推理、人员活动预测和物理互动预测。同时，本文还开发了三种评估方法，以评估模型在基于多 modal输入的预测和逻辑推理任务中的表现。
results: 实验证明了本文提出的benchmark和评估方法的可靠性，并 revela了当前popular MLLMs在预测任务中的优缺点。最后，本文的benchmark可以为MLLMs的发展提供一个标准化的评估框架，并促进模型的更高级别的发展，以便在复杂的多modal输入下进行预测和逻辑推理。

Abstract
Multimodal large language models (MLLMs) have shown great potential in perception and interpretation tasks, but their capabilities in predictive reasoning remain under-explored. To address this gap, we introduce a novel benchmark that assesses the predictive reasoning capabilities of MLLMs across diverse scenarios. Our benchmark targets three important domains: abstract pattern reasoning, human activity prediction, and physical interaction prediction. We further develop three evaluation methods powered by large language model to robustly quantify a model's performance in predicting and reasoning the future based on multi-visual context. Empirical experiments confirm the soundness of the proposed benchmark and evaluation methods via rigorous testing and reveal pros and cons of current popular MLLMs in the task of predictive reasoning. Lastly, our proposed benchmark provides a standardized evaluation framework for MLLMs and can facilitate the development of more advanced models that can reason and predict over complex long sequence of multimodal input.

摘要

Dance Your Latents: Consistent Dance Generation through Spatial-temporal Subspace Attention Guided by Motion Flow

paper_url: http://arxiv.org/abs/2310.14780
repo_url: None
paper_authors: Haipeng Fang, Zhihao Sun, Ziyao Huang, Fan Tang, Juan Cao, Sheng Tang
for: 这篇论文旨在提高人体舞蹈生成领域的生成AI技术，以实现更高质量的舞蹈视频生成。
methods: 该方法基于隐藏空间的嵌入学习，并提出了空间-时间嵌入注意力块和运动流导航注意力块等两种新的注意力机制，以提高生成的空间时间一致性。
results: 实验结果表明，该方法可以显著提高生成的舞蹈视频中的空间时间一致性，从而提高生成的质量。

Abstract
The advancement of generative AI has extended to the realm of Human Dance Generation, demonstrating superior generative capacities. However, current methods still exhibit deficiencies in achieving spatiotemporal consistency, resulting in artifacts like ghosting, flickering, and incoherent motions. In this paper, we present Dance-Your-Latents, a framework that makes latents dance coherently following motion flow to generate consistent dance videos. Firstly, considering that each constituent element moves within a confined space, we introduce spatial-temporal subspace-attention blocks that decompose the global space into a combination of regular subspaces and efficiently model the spatiotemporal consistency within these subspaces. This module enables each patch pay attention to adjacent areas, mitigating the excessive dispersion of long-range attention. Furthermore, observing that body part's movement is guided by pose control, we design motion flow guided subspace align & restore. This method enables the attention to be computed on the irregular subspace along the motion flow. Experimental results in TikTok dataset demonstrate that our approach significantly enhances spatiotemporal consistency of the generated videos.

摘要
“人体舞蹈生成领域内，对于生成AI的进步已经推广到。然而，目前的方法仍然存在着时空一致性的缺陷，导致类似“幽灵”、“跳跃”和“无统一”的artefacts出现。在本文中，我们提出 Dance-Your-Latents 框架，让 latent 在动作流中跳舞具有一致性，实现了一致的舞蹈视频生成。首先，我们考虑到每个元素在紧随的空间内运动，我们引入时空频域注意力对应方法，将全球空间分解为一系列规律的频域和有效地模型时空一致性。这个模员使每个小区与邻近区域进行对话，解决了过度分散的长距离注意力问题。其次，我们观察到人体部分的运动受到pose控制的指导，我们设计了动作流导向的时空对齐恢复方法。这种方法使得注意力可以在动作流方向上进行计算，实现了一致的注意力Computing。实验结果显示，我们的方法在TikTok数据集上有 statistically significant 提高了生成视频的时空一致性。”

Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval

paper_url: http://arxiv.org/abs/2310.13451
repo_url: None
paper_authors: Donghuo Zeng, Kazushi Ikeda
for: 本文targets the problem of cross-modal retrieval, specifically addressing the issue of suboptimal model performance due to the oversight of distinguishing between semi-hard and hard triples in the optimization process.
methods: 本文提出了一种基于课程学习的两阶段训练方法，通过从 semi-hard triplets开始，然后通过 interpolating embeddings来增强模型的学习过程。最后，模型通过 hard triplet mining来进一步优化。
results: 实验结果表明，在两个音频视频数据集上，与当前状态艺术方法MSNSCA进行比较，本文的方法在AV-CMR任务上的AVE数据集上提高了均值溢通精度（MAP）的平均值约9.8%， indicating the effectiveness of the proposed method.

Abstract
The cross-modal retrieval model leverages the potential of triple loss optimization to learn robust embedding spaces. However, existing methods often train these models in a singular pass, overlooking the distinction between semi-hard and hard triples in the optimization process. The oversight of not distinguishing between semi-hard and hard triples leads to suboptimal model performance. In this paper, we introduce a novel approach rooted in curriculum learning to address this problem. We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets. In the first stage, the model is trained with a set of semi-hard triplets, starting from a low-loss base. Subsequently, in the second stage, we augment the embeddings using an interpolation technique. This process identifies potential hard negatives, alleviating issues arising from high-loss functions due to a scarcity of hard triples. Our approach then applies hard triplet mining in the augmented embedding space to further optimize the model. Extensive experimental results conducted on two audio-visual datasets show a significant improvement of approximately 9.8% in terms of average Mean Average Precision (MAP) over the current state-of-the-art method, MSNSCA, for the Audio-Visual Cross-Modal Retrieval (AV-CMR) task on the AVE dataset, indicating the effectiveness of our proposed method.

摘要
《cross-modal retrieval模型可以利用 triple loss优化来学习强化 embedding空间。然而，现有方法通常在单个过程中训练这些模型，忽视 semi-hard triplets 和 hard triplets 之间的差异在优化过程中。这种忽视导致模型性能下降。在本文中，我们提出了一种新的方法，基于 curriculum learning，来解决这个问题。我们提议一种两阶段训练方法，使得模型在 semi-hard triplets 上学习，然后使用 interpolate 技术来增强 embedding，并在扩展 embedding 空间中进行 hard triplet 挖掘，以进一步优化模型。我们的方法在两个音频视频数据集上进行了广泛的实验，并与当前状态艺术方法 MSNSCA 进行了比较。结果表明，我们的方法在 AVE 数据集上的 Audio-Visual Cross-Modal Retrieval（AV-CMR）任务中提高了平均精度为9.8%。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the given text and may not reflect the exact phrasing or wording of the original text.

Definition-independent Formalization of Soundscapes: Towards a Formal Methodology

paper_url: http://arxiv.org/abs/2310.13404
repo_url: None
paper_authors: Mikel D. Jedrusiak, Thomas Harweg, Timo Haselhoff, Bryce T. Lawrence, Susanne Moebus, Frank Weichert
for: 本研究旨在提供一种独立于听samples定义的 formalization，以便捕捉不同领域的听samples数据的多元结构，以及不同 идеológy的表述。
methods: 本研究使用了频谱相关矩阵来检测土地使用类型，作为一种 alternativa 于 MFCCs 的特征。
results: exemplary 分析表明，使用频谱相关矩阵可以准确地检测土地使用类型，并且可以捕捉到不同领域的听samples数据中的多元结构。

Abstract
Soundscapes have been studied by researchers from various disciplines, each with different perspectives, goals, approaches, and terminologies. Accordingly, depending on the field, the concept of a soundscape's components changes, consequently changing the basic definition. This results in complicating interdisciplinary communication and comparison of results. Especially when soundscape-unrelated research areas are involved. For this reason, we present a potential formalization that is independent of the underlying soundscape definition, with the goal of being able to capture the heterogeneous structure of the data as well as the different ideologies in one model. In an exemplary analysis of frequency correlation matrices for land use type detection as an alternative to features like MFCCs, we show a practical application of our presented formalization.

摘要
听soundscapes已经被不同领域的研究人员研究，每个领域有不同的视角、目标、方法和术语。因此，听soundscapes的组成部分在不同领域中发生变化，导致基本定义的变化。这会使交通 между不同领域和对结果的比较变得复杂。特别是当涉及到不同领域的研究时。为了解决这个问题，我们提出了一种可能的正式化方法，可以独立地捕捉听soundscapes的多元结构和不同意识形态。在一个例子中，我们通过对听frequency correlation matrix进行分析，以示听land use类型检测的实际应用。

paper_url: http://arxiv.org/abs/2310.13398
repo_url: None
paper_authors: Yijie Zhou, Likun Cai, Xianhui Cheng, Zhongxue Gan, Xiangyang Xue, Wenchao Ding
for: automatic annotating functions for multi-modal data
methods: open-source open-vocabulary auto-labeling system that integrates Large Language Models (LLMs) and vision-language models (VLMs)
results: significantly improves annotation efficiency compared to manual annotation while providing accurate open-vocabulary auto-annotating results.Here’s the simplified Chinese text:
for: automatic对多模态数据进行注解功能
methods: 基于开源的开 vocabulary自动注解系统，结合大语言模型（LLMs）和视力语言模型（VLMs）
results: Significantly improves manual注解效率，提供准确的开 vocabulary自动注解结果

Abstract
In the era of big data and large models, automatic annotating functions for multi-modal data are of great significance for real-world AI-driven applications, such as autonomous driving and embodied AI. Unlike traditional closed-set annotation, open-vocabulary annotation is essential to achieve human-level cognition capability. However, there are few open-vocabulary auto-labeling systems for multi-modal 3D data. In this paper, we introduce OpenAnnotate3D, an open-source open-vocabulary auto-labeling system that can automatically generate 2D masks, 3D masks, and 3D bounding box annotations for vision and point cloud data. Our system integrates the chain-of-thought capabilities of Large Language Models (LLMs) and the cross-modality capabilities of vision-language models (VLMs). To the best of our knowledge, OpenAnnotate3D is one of the pioneering works for open-vocabulary multi-modal 3D auto-labeling. We conduct comprehensive evaluations on both public and in-house real-world datasets, which demonstrate that the system significantly improves annotation efficiency compared to manual annotation while providing accurate open-vocabulary auto-annotating results.

摘要
在大数据和大模型时代，自动标注函数对多 modal 数据的应用非常重要，如自动驾驶和具体 AI。不同于传统的关闭集合标注，开放词汇标注是实现人类认知水平的关键。然而，对多 modal 3D 数据的开放词汇自动标注系统很少。在这篇论文中，我们介绍 OpenAnnotate3D，一个开源的开放词汇自动标注系统，可以自动生成2D masks、3D masks和3D bounding box注释 для视觉和点云数据。我们的系统结合了 Large Language Models (LLMs) 的链条思维能力和视觉语言模型 (VLMs) 的交叉模态能力。据我们所知，OpenAnnotate3D 是开放词汇多 modal 3D 自动标注的先驱之作。我们对公共和内部实验室数据进行了全面的评估，结果表明，系统可以大幅提高人工标注的效率，同时提供高精度的开放词汇自动标注结果。

ScalableMap: Scalable Map Learning for Online Long-Range Vectorized HD Map Construction

paper_url: http://arxiv.org/abs/2310.13378
repo_url: https://github.com/jingy1yu/scalablemap
paper_authors: Jingyi Yu, Zizhao Zhang, Shengfu Xia, Jizhang Sang
for: 这个论文是为了建立在board camera感知器上的在线长距离高清地图建构管线。
methods: 该论文使用了纹理化表示法，使用多边形和多边形来表示地图元素。它还提出了一种层次稀疏地图表示法，以便更好地利用纹理化地图元素的可扩展性，并设计了一种进程编码机制和一种监督策略。
results: 该论文在nuScenes数据集上达到了6.5 mAP的最高精度，在长距离场景下表现特别出色，超越了之前的状态态模型，并且实现了18.3 FPS的速度。

Abstract
We propose a novel end-to-end pipeline for online long-range vectorized high-definition (HD) map construction using on-board camera sensors. The vectorized representation of HD maps, employing polylines and polygons to represent map elements, is widely used by downstream tasks. However, previous schemes designed with reference to dynamic object detection overlook the structural constraints within linear map elements, resulting in performance degradation in long-range scenarios. In this paper, we exploit the properties of map elements to improve the performance of map construction. We extract more accurate bird's eye view (BEV) features guided by their linear structure, and then propose a hierarchical sparse map representation to further leverage the scalability of vectorized map elements and design a progressive decoding mechanism and a supervision strategy based on this representation. Our approach, ScalableMap, demonstrates superior performance on the nuScenes dataset, especially in long-range scenarios, surpassing previous state-of-the-art model by 6.5 mAP while achieving 18.3 FPS. Code is available at https://github.com/jingy1yu/ScalableMap.

摘要
我们提出了一种新的端到端管道，用于在线上进行高清定制（HD）地图建构，使用车载摄像头传感器。这种vectorized表示方法，使用多边形和多边形来表示地图元素，广泛用于下游任务。然而，之前的方案忽略了线性地图元素的结构约束，导致长距离场景下的性能下降。在这篇论文中，我们利用地图元素的属性来提高地图建构的性能。我们从多个 bird's eye view（BEV）特征中提取更加准确的特征，然后提出一种层次稀疏地图表示法，以便更好地利用vectorized地图元素的可扩展性，并设计了一种进程式解码机制和一种根据这种表示法的监督策略。我们的方法，称为ScalableMap，在nuScenes数据集上表现出色，特别是在长距离场景下，比前一个状态的模型提高6.5 mAP，并达到18.3 FPS。代码可以在https://github.com/jingy1yu/ScalableMap中下载。

Single-view 3D reconstruction via inverse procedural modeling

paper_url: http://arxiv.org/abs/2310.13373
repo_url: None
paper_authors: Albert Garifullin, Nikolay Maiorov, Vladimir Frolov
for: 3D reconstruction via inverse procedural modeling, demonstrating results on tree models and complex objects
methods: using a genetic algorithm for fitting set of input parameters, differentiable rendering and differentiable procedural generators for precise reconstruction
results: significant improvement in precision, ability to reconstruct 3D models accurately with a small number of input images, and application to complex generators with both differentiable and non-differentiable procedural generators

Abstract
We propose an approach to 3D reconstruction via inverse procedural modeling and investigate two variants of this approach. The first option consists in the fitting set of input parameters using a genetic algorithm. We demonstrate the results of our work on tree models, complex objects, with the reconstruction of which most existing methods cannot handle. The second option allows us to significantly improve the precision by using gradients within memetic algorithm, differentiable rendering and also differentiable procedural generators. In our work we see 2 main contributions. First, we propose a method to join differentiable rendering and inverse procedural modeling. This gives us an opportunity to reconstruct 3D model more accurately than existing approaches when a small number of input images are available (even for single image). Second, we join both differentiable and non-differentiable procedural generators in a single framework which allow us to apply inverse procedural modeling to fairly complex generators: when gradient is available, reconstructions is precise, when gradient is not available, reconstruction is approximate, but always high quality without visual artifacts.

摘要
我们提出了一种3D重建方法，基于反工程模型，并研究了这种方法的两种变体。第一种方法使用遗传算法来调整输入参数的集合。我们在树模型、复杂物体上进行了实验，并达到了现有方法无法处理的3D重建结果。第二种方法使用内在的积分算法、可微渲染和可微生成器，可以在输入图像很少时进行高精度的3D重建。在我们的工作中，我们认为有两个主要贡献：首先，我们将可微渲染和反工程模型结合在一起，从而在输入图像很少时可以更加准确地重建3D模型（即使用单个图像）。其次，我们将可微和非可微的生成器集成到同一个框架中，以便应用反工程模型到较复杂的生成器中，当gradient可用时，重建是精确的，当gradient不可用时，重建是相对精度高而无视觉 artifacts。

PSGText: Stroke-Guided Scene Text Editing with PSP Module

paper_url: http://arxiv.org/abs/2310.13366
repo_url: None
paper_authors: Felix Liawi, Yun-Da Tsai, Guan-Lun Lu, Shou-De Lin
for: 该论文目的是提供一种将文本 transferred 到图像中的方法，以保持原始文本背景和样式，并提高修改后的文本的清晰度和可读性。
methods: 该方法包括三个阶段：首先，我们引入了一个文本交换网络，可以轻松地将原始文本替换为新的文本。其次，我们 integrates 了一个背景填充网络，用于修复background图像中的空洞，以保持视觉协调和一致性。最后，我们使用一个融合网络将这两个网络的结果融合，得到高清晰度和可读性的修改后图像。
results: 该方法可以生成高质量的修改后图像，保持原始文本背景和样式，并提高修改后文本的清晰度和可读性。 demo 视频可以在补充材料中找到。

Abstract
Scene Text Editing (STE) aims to substitute text in an image with new desired text while preserving the background and styles of the original text. However, present techniques present a notable challenge in the generation of edited text images that exhibit a high degree of clarity and legibility. This challenge primarily stems from the inherent diversity found within various text types and the intricate textures of complex backgrounds. To address this challenge, this paper introduces a three-stage framework for transferring texts across text images. Initially, we introduce a text-swapping network that seamlessly substitutes the original text with the desired replacement. Subsequently, we incorporate a background inpainting network into our framework. This specialized network is designed to skillfully reconstruct background images, effectively addressing the voids left after the removal of the original text. This process meticulously preserves visual harmony and coherence in the background. Ultimately, the synthesis of outcomes from the text-swapping network and the background inpainting network is achieved through a fusion network, culminating in the creation of the meticulously edited final image. A demo video is included in the supplementary material.

摘要

Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos

paper_url: http://arxiv.org/abs/2310.13356
repo_url: https://github.com/seoha-kim/Sync-NeRF
paper_authors: Seoha Kim, Jeongmin Bae, Youngsik Yun, Hahyun Lee, Gun Bang, Youngjung Uh
for: 该论文旨在解决4D场景重建使用神经辐射场（NeRF）时，对动态场景的重建受限，并且无法适应不同时刻拍摄的多视图视频。
methods: 作者引入时差偏移来解决这个问题，并将偏移值与NeRF共同优化。这种方法适用于多种基eline和提高了它们的性能。
results: 实验结果表明，该方法可以有效地同步多视图视频，并且在Plenoptic Video Dataset和一个新建的Unsynchronized Dynamic Blender Dataset上进行了验证。项目页面：https://seoha-kim.github.io/sync-nerf。

Abstract
Recent advancements in 4D scene reconstruction using neural radiance fields (NeRF) have demonstrated the ability to represent dynamic scenes from multi-view videos. However, they fail to reconstruct the dynamic scenes and struggle to fit even the training views in unsynchronized settings. It happens because they employ a single latent embedding for a frame while the multi-view images at the frame were actually captured at different moments. To address this limitation, we introduce time offsets for individual unsynchronized videos and jointly optimize the offsets with NeRF. By design, our method is applicable for various baselines and improves them with large margins. Furthermore, finding the offsets naturally works as synchronizing the videos without manual effort. Experiments are conducted on the common Plenoptic Video Dataset and a newly built Unsynchronized Dynamic Blender Dataset to verify the performance of our method. Project page: https://seoha-kim.github.io/sync-nerf

摘要
最近的进展在4D场景重建领域使用神经辐射场（NeRF）已经表明了能够从多视图视频中重建动态场景。然而，它们无法重建动态场景，并且在不同时刻拍摄的多视图图像中很难匹配。这是因为它们使用单个离散嵌入来表示一帧的图像，而多视图图像在该帧中实际上是在不同的时刻拍摄的。为解决这些限制，我们引入时间偏移 для个体不同时刻的视频，并同时优化偏移。由设计来看，我们的方法适用于各种基elines和提高它们的大幅度。此外，找到偏移也自然地同步视频，无需手动努力。我们在常见的Plenoptic Video Dataset和新建的Unsynchronized Dynamic Blender Dataset上进行了实验，以验证我们的方法的性能。项目页面：https://seoha-kim.github.io/sync-nerf

SILC: Improving Vision Language Pretraining with Self-Distillation

paper_url: http://arxiv.org/abs/2310.13355
repo_url: None
paper_authors: Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool, Federico Tombari
for:This paper focuses on improving the performance of open-vocabulary classification and retrieval models by using a simple addition of local-to-global correspondence learning through self-distillation during contrastive pre-training.methods:The proposed method, SILC, uses a contrastive objective to learn image-text alignment and adds local-to-global correspondence learning through self-distillation to improve image feature learning for dense prediction tasks.results:The proposed SILC model achieves state-of-the-art performance on several computer vision tasks, including zero-shot classification, few-shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation, with better scaling compared to baselines.

Abstract
Image-Text pretraining on web-scale image caption dataset has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we propose the simple addition of local-to-global correspondence learning by self-distillation as an additional objective for contrastive pre-training to propose SILC. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on several computer vision tasks including classification, retrieval, and especially segmentation. We further show that SILC scales better with the same training duration compared to the baselines. Our model SILC sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation.

摘要
“图文预训在大规模图片描述集合上进行预训，现在成为开 vocabulary 分类和搜寻模型的默认配方，这主要归功于 CLIP 和其变体的成功。一些工作也使用 CLIP 特征进行紧密预测任务，并显示了开集能力的出现。但是，对比тив的目标只是对图片文本的对预导，不对图片特征学习紧密预测任务。在这个工作中，我们提出了将本地到全球对应学习作为额外目标，通过自我静态学习来实现 SILC。我们表明，将本地图片特征从 EMA 教师模型散发到 exponential moving average 教师模型可以大幅提高模型在许多计算机视觉任务中的性能，包括分类、搜寻、特别是分割。我们进一步显示，SILC 在相同训练时间下可以与基准点模型相比，并且在 zero-shot 分类、几步分类、图片和文本搜寻、 zero-shot 分割和开 vocabulary 分割等多个任务中设置新的州OF THE ART。”

EarlyBird: Early-Fusion for Multi-View Tracking in the Bird’s Eye View

paper_url: http://arxiv.org/abs/2310.13350
repo_url: https://github.com/tteepe/EarlyBird
paper_authors: Torben Teepe, Philipp Wolters, Johannes Gilg, Fabian Herzog, Gerhard Rigoll
for: 本研究旨在探讨 whether tracking in the Bird’s Eye View (BEV) can bring the next performance breakthrough in Multi-Target Multi-Camera (MTMC) tracking.
methods: 本研究使用 early-fusion 方法，通过在 BEV 中探测每个人并学习强的 Re-Identification (re-ID) 特征来实现时间相关性 association.
results: results 显示 early-fusion 在 BEV 中可以达到高精度的探测和跟踪。 EarlyBird 方法在 Wildtrack 上超过当前状态艺术，提高了 MOTA 和 IDF1 的表现。

Abstract
Multi-view aggregation promises to overcome the occlusion and missed detection challenge in multi-object detection and tracking. Recent approaches in multi-view detection and 3D object detection made a huge performance leap by projecting all views to the ground plane and performing the detection in the Bird's Eye View (BEV). In this paper, we investigate if tracking in the BEV can also bring the next performance breakthrough in Multi-Target Multi-Camera (MTMC) tracking. Most current approaches in multi-view tracking perform the detection and tracking task in each view and use graph-based approaches to perform the association of the pedestrian across each view. This spatial association is already solved by detecting each pedestrian once in the BEV, leaving only the problem of temporal association. For the temporal association, we show how to learn strong Re-Identification (re-ID) features for each detection. The results show that early-fusion in the BEV achieves high accuracy for both detection and tracking. EarlyBird outperforms the state-of-the-art methods and improves the current state-of-the-art on Wildtrack by +4.6 MOTA and +5.6 IDF1.

摘要
多视图聚合承诺可以超越 occlusion 和缺失检测挑战在多目标检测和跟踪中。 current 方法在多视图检测和3D对象检测中取得了巨大的性能突破，通过将所有视图投影到地面平面并在 Bird's Eye View（BEV）中进行检测。在这篇论文中，我们研究了 whether 在 BEV 中进行跟踪可以带来下一个性能突破在 Multi-Target Multi-Camera（MTMC）跟踪中。大多数当前方法在多视图跟踪中都是在每个视图中进行检测和跟踪任务，并使用图形基本方法来进行视图之间的关联。这个空间关联已经被推缩到在 BEV 中检测每个人员一次，只剩下时间关联的问题。为了解决时间关联问题，我们展示了如何学习强大的 Re-Identification（re-ID）特征 для每个检测。结果显示，在 BEV 中早期融合可以 дости得高度的准确率 both 检测和跟踪。 EarlyBird 超越了当前状态的方法，并提高了 Wildtrack 中当前状态的 MOTA 和 IDF1 的表现+4.6 和 +5.6。

DeepFDR: A Deep Learning-based False Discovery Rate Control Method for Neuroimaging Data

paper_url: http://arxiv.org/abs/2310.13349
repo_url: None
paper_authors: Taehyo Kim, Hai Shu, Qiran Jia, Mony de Leon
for: 这篇论文是为了解决voxel-based多测试问题，特别是处理大脑中复杂的空间相关性。
methods: 这篇论文提出了一种基于深度学习的空间FDR控制方法，利用无监督学习图像分割来解决voxel-based多测试问题。
results: 数值研究表明，DeepFDR比现有方法更有效地控制FDR，同时减少了假发现率，并且具有优秀的计算效率，适用于处理大规模的神经成像数据。

Abstract
Voxel-based multiple testing is widely used in neuroimaging data analysis. Traditional false discovery rate (FDR) control methods often ignore the spatial dependence among the voxel-based tests and thus suffer from substantial loss of testing power. While recent spatial FDR control methods have emerged, their validity and optimality remain questionable when handling the complex spatial dependencies of the brain. Concurrently, deep learning methods have revolutionized image segmentation, a task closely related to voxel-based multiple testing. In this paper, we propose DeepFDR, a novel spatial FDR control method that leverages unsupervised deep learning-based image segmentation to address the voxel-based multiple testing problem. Numerical studies, including comprehensive simulations and Alzheimer's disease FDG-PET image analysis, demonstrate DeepFDR's superiority over existing methods. DeepFDR not only excels in FDR control and effectively diminishes the false nondiscovery rate, but also boasts exceptional computational efficiency highly suited for tackling large-scale neuroimaging data.

摘要
voxel基于多测试广泛应用于神经成像数据分析中。传统的假发现率（FDR）控制方法通常忽略了VOXEL基于测试之间的空间依赖关系，从而导致了重大的测试力下降。而最近的空间FDR控制方法已经出现，但它们在处理大脑复杂的空间依赖关系时的有效性和优化性仍然存在问题。同时，深度学习方法在图像分割领域已经引领了革命，这个领域与VOXEL基于多测试密切相关。在这篇论文中，我们提出了DeepFDR，一种基于深度学习无监督图像分割的新的空间FDR控制方法。 numerics 研究，包括完整的 simulations和阿尔茨海默症FDG-PET图像分析，表明DeepFDR在现有方法中具有优越性，不仅在FDR控制方面 excellence，还能够减少假发现率，同时具有出色的计算效率，适合处理大规模神经成像数据。

DeepFracture: A Generative Approach for Predicting Brittle Fractures

paper_url: http://arxiv.org/abs/2310.13344
repo_url: None
paper_authors: Yuhang Huang, Takashi Kanai
for: 这篇论文是关于在不可逆摇摆动画中生成真实的破坏动画，使用物理 simulate 技术，但是使用 Voronoi 图或先 Fraction 的方法可能缺乏真实感。
methods: 该论文提出了一种基于学习的方法，将真实的不可逆破坏动画与 rigid-body simulations 集成。该方法使用 BEM 不可逆破坏 simulations 生成破坏模式和碰撞条件，然后将其作为学习过程的输入。
results: 该论文的实验结果表明，该方法可以生成比现有方法更加细节的不可逆破坏动画，同时保持了可观计算效率。

Abstract
In the realm of brittle fracture animation, generating realistic destruction animations with physics simulation techniques can be computationally expensive. Although methods using Voronoi diagrams or pre-fractured patterns work for real-time applications, they often lack realism in portraying brittle fractures. This paper introduces a novel learning-based approach for seamlessly merging realistic brittle fracture animations with rigid-body simulations. Our method utilizes BEM brittle fracture simulations to create fractured patterns and collision conditions for a given shape, which serve as training data for the learning process. To effectively integrate collision conditions and fractured shapes into a deep learning framework, we introduce the concept of latent impulse representation and geometrically-segmented signed distance function (GS-SDF). The latent impulse representation serves as input, capturing information about impact forces on the shape's surface. Simultaneously, a GS-SDF is used as the output representation of the fractured shape. To address the challenge of optimizing multiple fractured pattern targets with a single latent code, we propose an eight-dimensional latent space based on a normal distribution code within our latent impulse representation design. This adaptation effectively transforms our neural network into a generative one. Our experimental results demonstrate that our approach can generate significantly more detailed brittle fractures compared to existing techniques, all while maintaining commendable computational efficiency during run-time.

摘要
在破碎破坏动画领域，通过物理 simulations 技术生成真实的破坏动画可以 computationally expensive. Although methods using Voronoi diagrams or pre-fractured patterns work for real-time applications, they often lack realism in portraying brittle fractures. This paper introduces a novel learning-based approach for seamlessly merging realistic brittle fracture animations with rigid-body simulations. Our method utilizes BEM brittle fracture simulations to create fractured patterns and collision conditions for a given shape, which serve as training data for the learning process. To effectively integrate collision conditions and fractured shapes into a deep learning framework, we introduce the concept of latent impulse representation and geometrically-segmented signed distance function (GS-SDF). The latent impulse representation serves as input, capturing information about impact forces on the shape's surface. Simultaneously, a GS-SDF is used as the output representation of the fractured shape. To address the challenge of optimizing multiple fractured pattern targets with a single latent code, we propose an eight-dimensional latent space based on a normal distribution code within our latent impulse representation design. This adaptation effectively transforms our neural network into a generative one. Our experimental results demonstrate that our approach can generate significantly more detailed brittle fractures compared to existing techniques, all while maintaining commendable computational efficiency during run-time.

CylinderTag: An Accurate and Flexible Marker for Cylinder-Shape Objects Pose Estimation Based on Projective Invariants

paper_url: http://arxiv.org/abs/2310.13320
repo_url: https://github.com/wsakobe/cylindertag
paper_authors: Shaoan Wang, Mingzhu Zhu, Yaoqing Hu, Dongyue Li, Fusong Yuan, Junzhi Yu
for: 高精度位势估计基于视觉标记，用于圆柱形表面上的物体位势估计。
methods: 提出了一种新的视觉标记 called CylinderTag，适用于可扩展的圆柱形表面，并使用了拟合投影射影的拟合环境来编码方向。
results: 通过广泛的实验评估，显示了CylinderTag在不同视角下的检测率、检测速度、字典大小、地方缩动和位势估计精度等性能指标，并且在实时检测和广泛的应用场景中展现出了优秀的表现。

Abstract
High-precision pose estimation based on visual markers has been a thriving research topic in the field of computer vision. However, the suitability of traditional flat markers on curved objects is limited due to the diverse shapes of curved surfaces, which hinders the development of high-precision pose estimation for curved objects. Therefore, this paper proposes a novel visual marker called CylinderTag, which is designed for developable curved surfaces such as cylindrical surfaces. CylinderTag is a cyclic marker that can be firmly attached to objects with a cylindrical shape. Leveraging the manifold assumption, the cross-ratio in projective invariance is utilized for encoding in the direction of zero curvature on the surface. Additionally, to facilitate the usage of CylinderTag, we propose a heuristic search-based marker generator and a high-performance recognizer as well. Moreover, an all-encompassing evaluation of CylinderTag properties is conducted by means of extensive experimentation, covering detection rate, detection speed, dictionary size, localization jitter, and pose estimation accuracy. CylinderTag showcases superior detection performance from varying view angles in comparison to traditional visual markers, accompanied by higher localization accuracy. Furthermore, CylinderTag boasts real-time detection capability and an extensive marker dictionary, offering enhanced versatility and practicality in a wide range of applications. Experimental results demonstrate that the CylinderTag is a highly promising visual marker for use on cylindrical-like surfaces, thus offering important guidance for future research on high-precision visual localization of cylinder-shaped objects. The code is available at: https://github.com/wsakobe/CylinderTag.

摘要
高精度姿势估计基于视觉标记已经在计算机视觉领域得到了广泛的研究。然而，传统的平面标记在抛物线表面上的适用性受到抛物线表面的多样性所限制，这难以实现高精度姿势估计 для抛物线对象。因此，这篇论文提出了一种新的视觉标记 called CylinderTag，适用于可开发的抛物线表面，如圆柱体表面。CylinderTag 是一种循环标记，可以固定在圆柱体形状的对象上。利用 manifold 假设，在项目ive 的均衡点上使用 cross-ratio 的编码，以获得在表面上的方向。此外，为了使 CylinderTag 更加可用，我们提出了一种冒泡搜索基于的标记生成器和高性能的识别器。此外，我们还进行了广泛的实验，评估 CylinderTag 的性能，包括检测率、检测速度、字典大小、本地化振荡和姿势估计精度。CylinderTag 在不同视角下的检测性能显著高于传统 visual marker，同时具有更高的本地化精度。此外，CylinderTag 具有实时检测能力和广泛的标记字典，提供了更好的 versatility 和实用性，适用于广泛的应用场景。实验结果表明，CylinderTag 是一种非常有前途的视觉标记，适用于圆柱体形状的对象，为高精度视觉本地化做出了重要的指导。代码可以在以下链接获取：https://github.com/wsakobe/CylinderTag。

Non-Negative Spherical Relaxations for Universe-Free Multi-Matching and Clustering

paper_url: http://arxiv.org/abs/2310.13311
repo_url: None
paper_authors: Johan Thunberg, Florian Bernard
for: optimize optimization problems over binary matrices with injectivity constraints, particularly for multi-matching and clustering
methods: 使用非负圆形 relaxation 和 conditional power iteration 方法优化relaxed问题
results: 比较spectral multi-matching和spectral clustering方法，方法不需要额外处理获得binary结果，并且在多种多样的多匹配和归一化设定下表现出优异result

Abstract
We propose a novel non-negative spherical relaxation for optimization problems over binary matrices with injectivity constraints, which in particular has applications in multi-matching and clustering. We relax respective binary matrix constraints to the (high-dimensional) non-negative sphere. To optimize our relaxed problem, we use a conditional power iteration method to iteratively improve the objective function, while at same time sweeping over a continuous scalar parameter that is (indirectly) related to the universe size (or number of clusters). Opposed to existing procedures that require to fix the integer universe size before optimization, our method automatically adjusts the analogous continuous parameter. Furthermore, while our approach shares similarities with spectral multi-matching and spectral clustering, our formulation has the strong advantage that we do not rely on additional post-processing procedures to obtain binary results. Our method shows compelling results in various multi-matching and clustering settings, even when compared to methods that use the ground truth universe size (or number of clusters).

摘要
我们提出了一种新的非负球体relaxation算法，用于优化包含二进制矩阵约束的优化问题，具体应用于多对约束和分组。我们将相应的二进制矩阵约束relax到高维非负球体上。为优化我们的放宽问题，我们使用一种增量的力论迭代方法，同时逐步提高目标函数，并同时探索一个连续的浮点参数，这个参数与宇宙大小（或群集数）相关。与现有方法不同，我们的方法不需要先确定整数宇宙大小，而是通过迭代进行调整。此外，我们的方法与spectral multi-matching和spectral clustering相似，但我们不需要额外的后处理步骤来获得 binary结果。我们的方法在多个多对约束和分组设置中显示出了吸引人的 результа。

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

paper_url: http://arxiv.org/abs/2310.13292
repo_url: None
paper_authors: Kihyun You, Jawook Gu, Jiyeon Ham, Beomhee Park, Jiho Kim, Eun Kyoung Hong, Woonhyunk Baek, Byungseok Roh
for: 推动视力语言预训练（VLP）模型的发展，以便在医疗领域实现零或几次shot分类无需昂贵的标注。
methods: 对于缺乏医学图像文本数据的问题，我们通过扩展图像标签对进行了图像文本对的扩展，并利用多个图像和多个 radiologic report 中的多个部分。我们还设计了两种对比损失函数（ICL和TCL），用于学习医学图像和报告的研究级特征。
results: 我们的模型在同样的条件下比现有模型强度，并且扩大的数据集提高了我们预训练模型的分类特征，尽管有一定的损失性能。

Abstract
A large-scale image-text pair dataset has greatly contributed to the development of vision-language pre-training (VLP) models, which enable zero-shot or few-shot classification without costly annotation. However, in the medical domain, the scarcity of data remains a significant challenge for developing a powerful VLP model. In this paper, we tackle the lack of image-text data in chest X-ray by expanding image-label pair as image-text pair via general prompt and utilizing multiple images and multiple sections in a radiologic report. We also design two contrastive losses, named ICL and TCL, for learning study-level characteristics of medical images and reports, respectively. Our model outperforms the state-of-the-art models trained under the same conditions. Also, enlarged dataset improve the discriminative power of our pre-trained model for classification, while sacrificing marginal retrieval performance. Code is available at https://github.com/kakaobrain/cxr-clip.

摘要
一个大规模图文对比数据集的出现对视语言预训练（VLP）模型的发展做出了重要贡献，这些模型可以在零shot或几shot的情况下实现无需高成本标注的分类。然而，在医疗领域，数据的缺乏仍然是开发强大VLP模型的主要挑战。在这篇论文中，我们解决了骨科X射影像数据的缺乏问题，通过扩展图标对为图文对和使用多个图像和多个 radiologic report 中的多个部分。我们还设计了两种对比损失，即 IC 损失和 TC 损失，用于学习医疗图像和报告的学习特征。我们的模型在同样的条件下训练时与现有的状态艺模型相比，表现出色。此外，扩大数据集也提高了我们预训练模型的分类特征， хотя是相对较差的检索性能。代码可以在上下载。

Pathologist-Like Explanations Unveiled: an Explainable Deep Learning System for White Blood Cell Classification

paper_url: http://arxiv.org/abs/2310.13279
repo_url: None
paper_authors: Aditya Shankar Pal, Debojyoti Biswas, Joy Mahapatra, Debasis Banerjee, Prantar Chakrabarti, Utpal Garain
For: 这个研究旨在开发一种可解释的深度学习模型，以提高白细胞分类的准确性和可读性。* Methods: 该模型使用深度学习算法和五个特征（granularity、细胞色素、核形状、大小相对红细胞、核：细胞比率）进行自动白细胞分类、定位和分割。* Results: 模型在一个新的数据集上进行训练和评估，并达到了81.08%的平均分类精度和89.16%的精度指数。此外，模型还能够准确地预测五个特征的值，并且与其他多种状态对模型的影响进行了比较。

Abstract
White blood cells (WBCs) play a crucial role in safeguarding the human body against pathogens and foreign substances. Leveraging the abundance of WBC imaging data and the power of deep learning algorithms, automated WBC analysis has the potential for remarkable accuracy. However, the capability of deep learning models to explain their WBC classification remains largely unexplored. In this study, we introduce HemaX, an explainable deep neural network-based model that produces pathologist-like explanations using five attributes: granularity, cytoplasm color, nucleus shape, size relative to red blood cells, and nucleus to cytoplasm ratio (N:C), along with cell classification, localization, and segmentation. HemaX is trained and evaluated on a novel dataset, LeukoX, comprising 467 blood smear images encompassing ten (10) WBC types. The proposed model achieves impressive results, with an average classification accuracy of 81.08% and a Jaccard index of 89.16% for cell localization. Additionally, HemaX performs well in generating the five explanations with a normalized mean square error of 0.0317 for N:C ratio and over 80% accuracy for the other four attributes. Comprehensive experiments comparing against multiple state-of-the-art models demonstrate that HemaX's classification accuracy remains unaffected by its ability to provide explanations. Moreover, empirical analyses and validation by expert hematologists confirm the faithfulness of explanations predicted by our proposed model.

摘要
白血球（WBC）在人体免疫系统中扮演着关键性角色，深受人们的关注。利用白血球图像数据的丰富性和深入学习算法的力量，自动化白血球分类有很大的潜力。然而，深入学习模型对其白血球分类的解释能力尚未得到充分探索。在这项研究中，我们提出了HemaX模型，它是一种可解释的深度神经网络模型，可以生成医生类似的解释，包括五个特征：粒度、细胞膜颜色、核形状、大小相对红细胞、核：细胞膜比率（N：C），同时还包括细胞类别、局部化和分割。HemaX模型在一个新的数据集LeukoX上进行训练和评估，LeukoX数据集包含467个血液干细胞图像，涵盖10种白血球类型。我们的模型在这些实验中获得了非常出色的结果，其中平均分类精度为81.08%，Jaccard指数为89.16%，用于细胞局部化。此外，HemaX模型在生成五个特征的解释方面也表现出色，其中N：C比率的normalized mean square error为0.0317，其他四个特征的解释准确率大于80%。在多个现有的状态对比模型的实验中，我们发现HemaX模型的分类精度不受其能提供解释的影响。此外，由专家血液学家验证的实验和验证表明，HemaX模型生成的解释准确性很高。

DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics

paper_url: http://arxiv.org/abs/2310.13268
repo_url: https://github.com/thu-ml/dpm-solver-v3
paper_authors: Kaiwen Zheng, Cheng Lu, Jianfei Chen, Jun Zhu
for:DPM-Solver-v3 is designed to improve the efficiency of high-fidelity image generation using diffusion probabilistic models (DPMs).methods:The proposed method introduces a new fast ODE solver for DPMs, which minimizes the first-order discretization error of the ODE solution and incorporates multistep methods and a predictor-corrector framework.results:DPM-Solver-v3 achieves consistently better or comparable performance in both unconditional and conditional sampling with both pixel-space and latent-space DPMs, especially in 5-10 NFEs. The method achieves FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable Diffusion, bringing a speed-up of 15%-30% compared to previous state-of-the-art training-free methods.

Abstract
Diffusion probabilistic models (DPMs) have exhibited excellent performance for high-fidelity image generation while suffering from inefficient sampling. Recent works accelerate the sampling procedure by proposing fast ODE solvers that leverage the specific ODE form of DPMs. However, they highly rely on specific parameterization during inference (such as noise/data prediction), which might not be the optimal choice. In this work, we propose a novel formulation towards the optimal parameterization during sampling that minimizes the first-order discretization error of the ODE solution. Based on such formulation, we propose DPM-Solver-v3, a new fast ODE solver for DPMs by introducing several coefficients efficiently computed on the pretrained model, which we call empirical model statistics. We further incorporate multistep methods and a predictor-corrector framework, and propose some techniques for improving sample quality at small numbers of function evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 achieves consistently better or comparable performance in both unconditional and conditional sampling with both pixel-space and latent-space DPMs, especially in 5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable Diffusion, bringing a speed-up of 15%$\sim$30% compared to previous state-of-the-art training-free methods. Code is available at https://github.com/thu-ml/DPM-Solver-v3.

摘要
Diffusion probabilistic models (DPMs) 有出色的表现在高精度图像生成中，但是它们在采样过程中存在不fficient的问题。 current works 提出了快速的 ODE 解决方案，但是它们高度依赖于特定的参数化 durante inference（如噪声/数据预测），这可能并不是最佳选择。在这项工作中，我们提出了一种新的参数化形式，以实现采样过程中的首项累累误差最小化。基于这种形式，我们提出了 DPM-Solver-v3，一种新的快速 ODE 解决方案，通过引入一些高效计算的参数，我们称之为 empirical model statistics。我们还 incorporated multistep methods and a predictor-corrector framework，并提出了一些提高样本质量的技术， especialy in small numbers of function evaluations (NFE) or large guidance scales。实验表明，DPM-Solver-v3 可以在 both unconditional 和 conditional 采样中实现更好或相当的表现，特别是在 5 ∼ 10 NFE 的情况下。我们在 CIFAR10 上获得了 FID 的 12.21 (5 NFE)，MSE 的 0.55 (5 NFE, 7.5 指导尺度)，在 Stable Diffusion 上提高了训练free 方法的速度，比前一代 state-of-the-art 快速 15% ∼ 30%。代码可以在 https://github.com/thu-ml/DPM-Solver-v3 上获得。

UE4-NeRF:Neural Radiance Field for Real-Time Rendering of Large-Scale Scene

paper_url: http://arxiv.org/abs/2310.13263
repo_url: None
paper_authors: Jiaming Gu, Minchao Jiang, Hongsheng Li, Xiaoyuan Lu, Guangming Zhu, Syed Afaq Ali Shah, Liang Zhang, Mohammed Bennamoun
For: This paper proposes a novel neural rendering system called UE4-NeRF for real-time rendering of large-scale scenes using NeRF technology.* Methods: The proposed method partitions each large scene into multiple sub-NeRFs, uses polygonal meshes to represent the scene, and trains meshes of varying levels of detail for different observation levels.* Results: The proposed method achieves real-time rendering of large-scale scenes at 4K resolution with a frame rate of up to 43 FPS, and achieves rendering quality comparable to state-of-the-art approaches.Here’s the full summary in Simplified Chinese:* 为：本文提出了一种基于NeRF技术的新型神经采集系统UE4-NeRF，用于实时渲染大规模场景。* 方法：提议方法将每个大场景分割成多个子NeRF，使用多面体来表示场景，并在不同观察水平上进行不同精度的 mesh 训练。* 结果：提议方法在4K分辨率下实现了实时渲染大规模场景，帧率可达43帧/秒，并与现有方法的渲染质量相当。

Abstract
Neural Radiance Fields (NeRF) is a novel implicit 3D reconstruction method that shows immense potential and has been gaining increasing attention. It enables the reconstruction of 3D scenes solely from a set of photographs. However, its real-time rendering capability, especially for interactive real-time rendering of large-scale scenes, still has significant limitations. To address these challenges, in this paper, we propose a novel neural rendering system called UE4-NeRF, specifically designed for real-time rendering of large-scale scenes. We partitioned each large scene into different sub-NeRFs. In order to represent the partitioned independent scene, we initialize polygonal meshes by constructing multiple regular octahedra within the scene and the vertices of the polygonal faces are continuously optimized during the training process. Drawing inspiration from Level of Detail (LOD) techniques, we trained meshes of varying levels of detail for different observation levels. Our approach combines with the rasterization pipeline in Unreal Engine 4 (UE4), achieving real-time rendering of large-scale scenes at 4K resolution with a frame rate of up to 43 FPS. Rendering within UE4 also facilitates scene editing in subsequent stages. Furthermore, through experiments, we have demonstrated that our method achieves rendering quality comparable to state-of-the-art approaches. Project page: https://jamchaos.github.io/UE4-NeRF/.

摘要
neural radiance fields (NeRF) 是一种新型的隐式 3D 重建方法，它在过去几年内得到了广泛关注，并且在3D 场景的重建方面表现出了巨大的潜力。它可以通过一组照片来重建 3D 场景，但是它的实时渲染能力，特别是对于大规模场景的实时渲染，仍然存在一定的限制。为了解决这些挑战，在这篇论文中，我们提出了一种基于 neural rendering 的新型渲染系统，即 UE4-NeRF，用于实时渲染大规模场景。我们将每个大场景分解成不同的子 NeRF，以便更好地表示独立的场景。为了表示分解后的独立场景，我们在场景中构建多个正则八面体，并将其中的多边形面的顶点进行不断优化 durante el proceso de entrenamiento。 drawing inspiration from Level of Detail (LOD) techniques, we trained meshes of varying levels of detail for different observation levels。我们的方法结合了 UE4 的渲染管线，实现了实时渲染大规模场景的 4K 分辨率，帧率可达 43 FPS。此外，通过实验，我们证明了我们的方法可以实现与当前状态最佳的渲染质量。项目页面：https://jamchaos.github.io/UE4-NeRF/。

Domain-specific optimization and diverse evaluation of self-supervised models for histopathology

paper_url: http://arxiv.org/abs/2310.13259
repo_url: None
paper_authors: Jeremy Lai, Faruk Ahmed, Supriya Vijay, Tiam Jaroensri, Jessica Loo, Saurabh Vyawahare, Saloni Agarwal, Fayaz Jamil, Yossi Matias, Greg S. Corrado, Dale R. Webster, Jonathan Krause, Yun Liu, Po-Hsuan Cameron Chen, Ellery Wulczyn, David F. Steiner
for: 提高诊断、临床研究和精准医学的可能性
methods: 自我超VI方法
results: 标准SSL方法在 Histopathology 图像上表现出色，具有各种任务类型和资源的可用性。另外，域pecific的方法ological improvements可以进一步提高性能。

Abstract
Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications.

摘要
任务特定的深度学习模型在 histopathology 领域提供了可能性，以提高诊断、临床研究和精准医学。然而，开发这些模型经常受到高质量数据的限制。基本模型在 histopathology 中学习通用表示，可以降低数据、计算和技术专家的努力，以达到需要的模型性能。在这种工作中，我们描述了通过自我超级学习（SSL）的基本模型的开发和评估。我们首先建立了 17 种不同的组织类型和 12 种不同的癌种类，以及不同的最佳放大和任务类型的多样化基准任务。然后，我们使用这些基准任务来探索和评估特有的 histopathology SSL 方法，并在储存的 patch-level 和弱级批处理任务上进行进一步评估。我们发现，针对 histopathology 图像的标准 SSL 方法在我们的基准任务中表现良好，而域pecific的方法学习改进可以进一步提高表现。我们的发现赋予了使用域pecific SSL 方法在 pathology 领域的价值，并建立了一组高质量的基本模型，以便进一步的研究和应用。

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

paper_url: http://arxiv.org/abs/2310.13255
repo_url: None
paper_authors: Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu
for: 这篇论文旨在帮助大语言模型（LLM） equip 身体机器人具备自适应能力，但这些努力通常忽略了开放世界的视觉丰富性，导致LLM-based Agent难以直观理解它所处的环境和生成易于理解的回应。
methods: 我们提出了Steve-Eye，一个结合LLM和视觉编码器的大多模式模型，可以处理视觉文本输入并生成多模式反馈。我们还采用了半自动策略来收集了850K的开放世界指令对，使我们的模型涵盖了三种函数：多模式感知、基础知识库和技能预测和规划。
results: 我们开发了三个开放世界评估标准，然后从多个角度进行了广泛的实验来验证我们的模型能够策略性行动和规划。

Abstract
Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics. However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game." Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand. In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation. Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback. In addition, we use a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, empowering our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning. Lastly, we develop three open-world evaluation benchmarks, then carry out extensive experiments from a wide range of perspectives to validate our model's capability to strategically act and plan. Codes and datasets will be released.

摘要

Diagnosis-oriented Medical Image Compression with Efficient Transfer Learning

paper_url: http://arxiv.org/abs/2310.13250
repo_url: None
paper_authors: Guangqi Xie, Xin Li, Xiaohan Pan, Zhibo Chen
for: 这篇论文主要应用于Remote medical diagnosis中，以提高医疗资料传输效率和准确性。
methods: 本论文提出了一个新的医疗图像压缩方法，即Diagnosis-oriented medical image compression，通过将医疗资料分类和压缩，以减少传输成本。
results: 实验结果显示，该方法可以实现47.594%的BD-Rate减少，比HEVC标准更高效。另外，仅需要对政策网络的A2C模块（2.7%的参数）进行调整，并且只需要使用1个医疗样本。

Abstract
Remote medical diagnosis has emerged as a critical and indispensable technique in practical medical systems, where medical data are required to be efficiently compressed and transmitted for diagnosis by either professional doctors or intelligent diagnosis devices. In this process, a large amount of redundant content irrelevant to the diagnosis is subjected to high-fidelity coding, leading to unnecessary transmission costs. To mitigate this, we propose diagnosis-oriented medical image compression, a special semantic compression task designed for medical scenarios, targeting to reduce the compression cost without compromising the diagnosis accuracy. However, collecting sufficient medical data to optimize such a compression system is significantly expensive and challenging due to privacy issues and the lack of professional annotation. In this study, we propose DMIC, the first efficient transfer learning-based codec, for diagnosis-oriented medical image compression, which can be effectively optimized with only few-shot annotated medical examples, by reusing the knowledge in the existing reinforcement learning-based task-driven semantic coding framework, i.e., HRLVSC [1]. Concretely, we focus on tuning only the partial parameters of the policy network for bit allocation within HRLVSC, which enables it to adapt to the medical images. In this work, we validate our DMIC with the typical medical task, Coronary Artery Segmentation. Extensive experiments have demonstrated that our DMIC can achieve 47.594%BD-Rate savings compared to the HEVC anchor, by tuning only the A2C module (2.7% parameters) of the policy network with only 1 medical sample.

摘要
医学远程诊断已成为现代医疗系统中不可或缺的技术，医疗数据需要高效压缩并传输以便诊断，由专业医生或智能诊断设备进行诊断。在这个过程中，大量不相关于诊断的医疗数据会被高精度编码，导致无必要的传输成本增加。为了解决这个问题，我们提出了诊断导向医学影像压缩，这是专门为医疗场景设计的Semantic压缩任务，旨在降低压缩成本而无需妥协诊断精度。然而，收集足够的医疗数据来优化这种压缩系统是非常昂贵和困难的，主要是因为隐私问题和缺乏专业标注。在这个研究中，我们提出了DMIC，首个高效转移学习基于codec，用于诊断导向医学影像压缩。DMIC可以通过只需几个 annotated medical example来有效优化， reuse existing reinforcement learning-based task-driven semantic coding framework HRLVSC 的知识，以适应医学影像。我们在实验中验证了我们的DMIC，并与典型的医学任务Coronary Artery Segmentation进行了比较。结果表明，我们的DMIC可以在与HEVC anchor进行比较时实现47.594%BD-Rate savings，只需要调整HRLVSC中A2C模块（2.7%参数）和1个医学样本。

Auxiliary Features-Guided Super Resolution for Monte Carlo Rendering

paper_url: http://arxiv.org/abs/2310.13235
repo_url: None
paper_authors: Qiqi Hou, Feng Liu
for: 提高 Monte Carlo 渲染算法的速度，通过使用超分辨率技术减少像素数量。
methods: 利用高分辨率辅助特征来导航超分辨率渲染，并使用 Cross-modality Transformer 网络和 residual densely-connected Swin Transformer 组合来提取代表性特征。
results: 对比超分辨率渲染和 Monte Carlo 释除方法，auxiliary features 导航的超分辨率渲染方法可以提供更高质量的渲染图像。

Abstract
This paper investigates super resolution to reduce the number of pixels to render and thus speed up Monte Carlo rendering algorithms. While great progress has been made to super resolution technologies, it is essentially an ill-posed problem and cannot recover high-frequency details in renderings. To address this problem, we exploit high-resolution auxiliary features to guide super resolution of low-resolution renderings. These high-resolution auxiliary features can be quickly rendered by a rendering engine and at the same time provide valuable high-frequency details to assist super resolution. To this end, we develop a cross-modality Transformer network that consists of an auxiliary feature branch and a low-resolution rendering branch. These two branches are designed to fuse high-resolution auxiliary features with the corresponding low-resolution rendering. Furthermore, we design residual densely-connected Swin Transformer groups to learn to extract representative features to enable high-quality super-resolution. Our experiments show that our auxiliary features-guided super-resolution method outperforms both super-resolution methods and Monte Carlo denoising methods in producing high-quality renderings.

摘要

PTSR: Patch Translator for Image Super-Resolution

paper_url: http://arxiv.org/abs/2310.13216
repo_url: None
paper_authors: Neeraj Baghel, Shiv Ram Dubey, Satish Kumar Singh
for: 提高图像超分辨率，降低计算成本和存储量。
methods: 提议一种基于 transformer 的 GAN 网络，不含 convolution 操作，使用 patch translator 模块来重新生成改进的 patches，并由生成器使用这些 patches 生成 2x 和 4x 超分辨率图像。
results: 对于 DIV2K、Set5、Set14 和 BSD100 等标准数据集进行了实验，与最佳竞争模型相比，提出的模型在 $4\times$ 超分辨率上提高了平均21.66% 的 PNSR 分数和11.59% 的 SSIM 分数。还分析了提议的损失函数和焦点图来说明提议方法的效果。

Abstract
Image super-resolution generation aims to generate a high-resolution image from its low-resolution image. However, more complex neural networks bring high computational costs and memory storage. It is still an active area for offering the promise of overcoming resolution limitations in many applications. In recent years, transformers have made significant progress in computer vision tasks as their robust self-attention mechanism. However, recent works on the transformer for image super-resolution also contain convolution operations. We propose a patch translator for image super-resolution (PTSR) to address this problem. The proposed PTSR is a transformer-based GAN network with no convolution operation. We introduce a novel patch translator module for regenerating the improved patches utilising multi-head attention, which is further utilised by the generator to generate the 2x and 4x super-resolution images. The experiments are performed using benchmark datasets, including DIV2K, Set5, Set14, and BSD100. The results of the proposed model is improved on an average for $4\times$ super-resolution by 21.66% in PNSR score and 11.59% in SSIM score, as compared to the best competitive models. We also analyse the proposed loss and saliency map to show the effectiveness of the proposed method.

摘要
Image超解算生成目标是生成高分辨率图像从低分辨率图像。然而，更复杂的神经网络带来高计算成本和存储空间。这仍然是活跃的领域，提供超分辨率限制的缺乏的解决方案。在最近几年，转换器在计算机视觉任务中做出了重要进步，它的稳定自注意机制使得它在图像分类、 object detection 等任务中表现出色。然而，最近对转换器的图像超解算任务也包含了 convolution 操作。我们提出了一种图像超解算patch translator（PTSR），以解决这个问题。我们的提案中的 PTSR 是基于转换器的 GAN 网络，没有 convolution 操作。我们引入了一种新的 patch translator 模块，用于重新生成改进的 patches，使用多头注意机制，该机制被生成器使用，以生成 2x 和 4x 超分辨率图像。我们在 DIV2K、Set5、Set14 和 BSD100 等标准 datasets 上进行了实验。我们的提案模型在 $4\times$ 超分辨率下提高了平均 21.66% 的 PNSR 分数和 11.59% 的 SSIM 分数，相比最佳竞争对手。我们还分析了我们的提案损失函数和质感图，以证明我们的方法的有效性。

Zone Evaluation: Revealing Spatial Bias in Object Detection

paper_url: http://arxiv.org/abs/2310.13215
repo_url: https://github.com/zzh-tju/zoneeval
paper_authors: Zhaohui Zheng, Yuming Chen, Qibin Hou, Xiang Li, Ping Wang, Ming-Ming Cheng
for: 本研究旨在探讨对象检测器的空间偏好问题，并提出了一种新的zone评价协议来评估对象检测器的性能。
methods: 本研究使用了10种popular对象检测器和5个检测数据集，通过对zone评价协议进行广泛评估，探讨对象检测器的空间偏好问题。
results: 研究发现，对象检测器在图像边缘zone（96%）的性能不达到AP值（平均精度），表明对象检测器在这个zone中表现不佳。此外，研究还发现，对象检测器的性能异常分布在不同的zone中，存在一定的空间偏好问题。

Abstract
A fundamental limitation of object detectors is that they suffer from "spatial bias", and in particular perform less satisfactorily when detecting objects near image borders. For a long time, there has been a lack of effective ways to measure and identify spatial bias, and little is known about where it comes from and what degree it is. To this end, we present a new zone evaluation protocol, extending from the traditional evaluation to a more generalized one, which measures the detection performance over zones, yielding a series of Zone Precisions (ZPs). For the first time, we provide numerical results, showing that the object detectors perform quite unevenly across the zones. Surprisingly, the detector's performance in the 96\% border zone of the image does not reach the AP value (Average Precision, commonly regarded as the average detection performance in the entire image zone). To better understand spatial bias, a series of heuristic experiments are conducted. Our investigation excludes two intuitive conjectures about spatial bias that the object scale and the absolute positions of objects barely influence the spatial bias. We find that the key lies in the human-imperceptible divergence in data patterns between objects in different zones, thus eventually forming a visible performance gap between the zones. With these findings, we finally discuss a future direction for object detection, namely, spatial disequilibrium problem, aiming at pursuing a balanced detection ability over the entire image zone. By broadly evaluating 10 popular object detectors and 5 detection datasets, we shed light on the spatial bias of object detectors. We hope this work could raise a focus on detection robustness. The source codes, evaluation protocols, and tutorials are publicly available at \url{https://github.com/Zzh-tju/ZoneEval}.

摘要
基本限制的物体探测器受到"空间偏见"的影响，特别是在图像边缘附近探测物体的时候表现不如预期。在这个问题上，我们提出了一种新的区域评估协议，从传统评估扩展到更通用的一种，可以测量探测器在不同区域的性能，并提取一系列的Zone精度（ZP）。这是第一次提供数字结果，表明探测器在图像96%边缘区域的性能不达到AP值（平均精度，通常被视为整个图像区域的平均探测性能）。为了更好地理解空间偏见，我们进行了一系列的启发性实验。我们的调查排除了两个直觉的假设：物体大小和绝对位置对空间偏见的影响。我们发现关键在于数据模式之间的人类不可见差异，因此 eventually形成了不同区域的可见性 gap。通过这些发现，我们最后讨论了对象探测的未来方向，即空间不平衡问题，旨在寻求整个图像区域内均匀的探测能力。通过评估10款流行的对象探测器和5个检测数据集，我们突出了对象探测器的空间偏见。我们希望这项工作能够启发对检测稳定性的关注。评估协议、评估工具和教程都可以在获得。

Identification of Abnormality in Maize Plants From UAV Images Using Deep Learning Approaches

paper_url: http://arxiv.org/abs/2310.13201
repo_url: None
paper_authors: Aminul Huq, Dimitris Zermas, George Bebis
for: 早期识别植物异常是重要的任务，以确保植物健康成长和获得高产量。精准农业可以受益于现代计算机视觉工具，使农业策略变得有效和高效。由于农业地域很大，农民需要手动检查广泛的地区，以确定植物的状况和应用有效的治疗方法。
methods: 我们使用深度学习技术来检测不同水平的异常（即低、中、高或无异常）在玉米植物图像中。我们的方法可以独立地识别不同的生长阶段，并且可以提供有价值的信息 для人工标注数据收集，帮助他们专注于更小的图像集中进行标注。
results: 我们对公共可用的数据集进行了实验，并取得了promising的初步结果，包括88.89%的低异常检测精度和100%的无异常检测精度。

Abstract
Early identification of abnormalities in plants is an important task for ensuring proper growth and achieving high yields from crops. Precision agriculture can significantly benefit from modern computer vision tools to make farming strategies addressing these issues efficient and effective. As farming lands are typically quite large, farmers have to manually check vast areas to determine the status of the plants and apply proper treatments. In this work, we consider the problem of automatically identifying abnormal regions in maize plants from images captured by a UAV. Using deep learning techniques, we have developed a methodology which can detect different levels of abnormality (i.e., low, medium, high or no abnormality) in maize plants independently of their growth stage. The primary goal is to identify anomalies at the earliest possible stage in order to maximize the effectiveness of potential treatments. At the same time, the proposed system can provide valuable information to human annotators for ground truth data collection by helping them to focus their attention on a much smaller set of images only. We have experimented with two different but complimentary approaches, the first considering abnormality detection as a classification problem and the second considering it as a regression problem. Both approaches can be generalized to different types of abnormalities and do not make any assumption about the abnormality occurring at an early plant growth stage which might be easier to detect due to the plants being smaller and easier to separate. As a case study, we have considered a publicly available data set which exhibits mostly Nitrogen deficiency in maize plants of various growth stages. We are reporting promising preliminary results with an 88.89\% detection accuracy of low abnormality and 100\% detection accuracy of no abnormality.

摘要
早期识别植物异常是重要的任务，以确保生长正常并实现高产量的农作物。精准农业可以受益于现代计算机视觉工具，以制定有效和高效的农业策略。由于农地往往非常大，因此农民需要手动检查广泛的地区，以确定植物的状况和应用相应的治疗。在这个工作中，我们考虑了自动地检测玉米植物中的异常区域，使用深度学习技术。我们开发了一种方法，可以独立地分类不同程度的异常（即低、中、高或无异常）在玉米植物中。我们的主要目标是在最早的时间内识别异常，以最大化治疗的效果。同时，我们的提posed系统可以为人类标注数据收集提供有价值的信息，帮助他们只需要关注相对较少的图像。我们对两种不同 pero complementary的方法进行了实验，第一种视异常检测为分类问题，第二种视异常检测为回归问题。两种方法都可以泛化到不同的异常类型，不假设异常发生在植物在早期生长阶段，可能更容易检测由于植物较小和易于分离。作为案例研究，我们使用了公开可用的数据集，该数据集主要表现出玉米植物中的缺氮缺乏。我们报告了有前景的初步结果，包括88.89%的低异常检测精度和100%的无异常检测精度。

2023-10-20

A Dual-Stream Neural Network Explains the Functional Segregation of Dorsal and Ventral Visual Pathways in Human Brains

Normalizing flow-based deep variational Bayesian network for seismic multi-hazards and impacts estimation from InSAR imagery

Data-Free Knowledge Distillation Using Adversarially Perturbed OpenGL Shader Images

TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models

PACE: Human and Camera Motion Estimation from in-the-wild Videos

U-BEV: Height-aware Bird’s-Eye-View Segmentation and Neural Map-based Relocalization

Evaluating sleep-stage classification: how age and early-late sleep affects classification performance

Reference-based Restoration of Digitized Analog Videotapes

Localizing and Editing Knowledge in Text-to-Image Generative Models

Using Human-like Mechanism to Weaken Effect of Pre-training Weight Bias in Face-Recognition Convolutional Neural Network

ARNIQA: Learning Distortion Manifold for Image Quality Assessment

Deep-Learning-based Change Detection with Spaceborne Hyperspectral PRISMA data

Inter-Scale Dependency Modeling for Skin Lesion Segmentation with Transformer-based Networks

What you see is what you get: Experience ranking with deep neural dataset-to-dataset similarity for topological localisation

FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer

Longer-range Contextualized Masked Autoencoder

POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization

Progressive Dual Priori Network for Generalized Breast Tumor Segmentation

A Simple Baseline for Knowledge-Based Visual Question Answering

ROSS: Radar Off-road Semantic Segmentation

A review of individual tree crown detection and delineation from optical remote sensing images

Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Dance Your Latents: Consistent Dance Generation through Spatial-temporal Subspace Attention Guided by Motion Flow

Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval

Definition-independent Formalization of Soundscapes: Towards a Formal Methodology

OpenAnnotate3D: Open-Vocabulary Auto-Labeling System for Multi-modal 3D Data

ScalableMap: Scalable Map Learning for Online Long-Range Vectorized HD Map Construction

Single-view 3D reconstruction via inverse procedural modeling

PSGText: Stroke-Guided Scene Text Editing with PSP Module

Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos

SILC: Improving Vision Language Pretraining with Self-Distillation

EarlyBird: Early-Fusion for Multi-View Tracking in the Bird’s Eye View

DeepFDR: A Deep Learning-based False Discovery Rate Control Method for Neuroimaging Data

DeepFracture: A Generative Approach for Predicting Brittle Fractures

CylinderTag: An Accurate and Flexible Marker for Cylinder-Shape Objects Pose Estimation Based on Projective Invariants

Non-Negative Spherical Relaxations for Universe-Free Multi-Matching and Clustering

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

Pathologist-Like Explanations Unveiled: an Explainable Deep Learning System for White Blood Cell Classification

DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics

UE4-NeRF:Neural Radiance Field for Real-Time Rendering of Large-Scale Scene

Domain-specific optimization and diverse evaluation of self-supervised models for histopathology

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

Diagnosis-oriented Medical Image Compression with Efficient Transfer Learning

Auxiliary Features-Guided Super Resolution for Monte Carlo Rendering

PTSR: Patch Translator for Image Super-Resolution

Zone Evaluation: Revealing Spatial Bias in Object Detection

Identification of Abnormality in Maize Plants From UAV Images Using Deep Learning Approaches