2023-10-14

cs.CV

cs.CV - 2023-10-14

What Do Deep Saliency Models Learn about Visual Attention?

paper_url: http://arxiv.org/abs/2310.09679
repo_url: https://github.com/szzexpoi/saliency_analysis
paper_authors: Shi Chen, Ming Jiang, Qi Zhao
for: 这篇论文旨在探讨深度聚焦模型如何预测人类视觉注意力，以及这些模型的成功机制是如何工作的。
methods: 本文提出了一种新的分析框架，可以帮助理解深度聚焦模型学习的隐藏特征，并提供了一种原则性的解释和量化这些特征的贡献。这个框架可以将隐藏特征分解成可解释的基准，并将聚焦预测转化为一种权重组合的问题。
results: 通过应用这种框架，我们进行了广泛的分析，包括聚焦预测的正面和负面权重、训练数据和架构设计的影响、细化训练的进程效应和常见的深度聚焦模型失败模式。此外，我们还通过分析不同应用场景中的视觉注意力特征，如人类自闭症 spectrum 症例中的异常注意力、情感引起的吸引注意力和时间的注意力演化。

Abstract
In recent years, deep saliency models have made significant progress in predicting human visual attention. However, the mechanisms behind their success remain largely unexplained due to the opaque nature of deep neural networks. In this paper, we present a novel analytic framework that sheds light on the implicit features learned by saliency models and provides principled interpretation and quantification of their contributions to saliency prediction. Our approach decomposes these implicit features into interpretable bases that are explicitly aligned with semantic attributes and reformulates saliency prediction as a weighted combination of probability maps connecting the bases and saliency. By applying our framework, we conduct extensive analyses from various perspectives, including the positive and negative weights of semantics, the impact of training data and architectural designs, the progressive influences of fine-tuning, and common failure patterns of state-of-the-art deep saliency models. Additionally, we demonstrate the effectiveness of our framework by exploring visual attention characteristics in various application scenarios, such as the atypical attention of people with autism spectrum disorder, attention to emotion-eliciting stimuli, and attention evolution over time. Our code is publicly available at \url{https://github.com/szzexpoi/saliency_analysis}.

摘要
在最近的几年中，深度眩怯模型已经取得了人类视觉注意力预测的 significanth进步。然而，这些模型的成功的机制仍然largely unexplained，这是因为深度神经网络的含义不 transparent。在这篇论文中，我们提出了一种新的分析框架，它可以揭示深度眩怯模型学习的隐式特征，并提供了理解和量化这些特征对眩怯预测的贡献的原则性的解释。我们的方法将这些隐式特征分解成可解释的基准，这些基准与semantic attribute的对应关系是Explicitly aligned。我们重新定义了眩怯预测为这些基准之间的权重加权组合，并通过应用我们的框架，我们进行了广泛的分析，包括正面和负面权重的semantics，训练数据和建筑设计的影响，练习的进步性和state-of-the-art深度眩怯模型的共同失败模式。此外，我们还通过各种应用场景来探讨视觉注意力特征，例如人类 autism spectrum disorder 的非典型注意力、情感刺激刺激的注意力和时间的注意力演化。我们的代码公开在 \url{https://github.com/szzexpoi/saliency_analysis}。

Point-DynRF: Point-based Dynamic Radiance Fields from a Monocular Video

paper_url: http://arxiv.org/abs/2310.09647
repo_url: None
paper_authors: Byeongjun Park, Changick Kim
for: 生成从笔直视频中的新视图
methods: 使用 neural point clouds 和 dynamic radiance fields 来学习全局场景几何信息和渲染过程
results: 在 NVIDIA Dynamic Scenes Dataset 和一些 causally captured monocular video clips 上验证了方法的有效性

Abstract
Dynamic radiance fields have emerged as a promising approach for generating novel views from a monocular video. However, previous methods enforce the geometric consistency to dynamic radiance fields only between adjacent input frames, making it difficult to represent the global scene geometry and degenerates at the viewpoint that is spatio-temporally distant from the input camera trajectory. To solve this problem, we introduce point-based dynamic radiance fields (\textbf{Point-DynRF}), a novel framework where the global geometric information and the volume rendering process are trained by neural point clouds and dynamic radiance fields, respectively. Specifically, we reconstruct neural point clouds directly from geometric proxies and optimize both radiance fields and the geometric proxies using our proposed losses, allowing them to complement each other. We validate the effectiveness of our method with experiments on the NVIDIA Dynamic Scenes Dataset and several causally captured monocular video clips.

摘要
《动态辐射场》技术在生成单视图的新观察角度方面表现出了惊人的前进。然而，先前的方法只在邻近输入帧中保证动态辐射场的几何一致性，使得表示全场景几何和观点远离输入摄像机轨迹的问题变得困难。为解决这个问题，我们提出了点基的动态辐射场（Point-DynRF）框架，其中global scene几何信息和volume渲染过程通过神经点云和动态辐射场进行了培auotrained。具体来说，我们直接从几何代理中重建神经点云，并通过我们提出的损失函数来优化辐射场和几何代理，使其相互补做。我们通过对NVIDIA动态场景集和一些 causally captured的单视图视频剪辑进行实验 validate了我们的方法的有效性。

Dimma: Semi-supervised Low Light Image Enhancement with Adaptive Dimming

paper_url: http://arxiv.org/abs/2310.09633
repo_url: https://github.com/wojciechkoz/dimma
paper_authors: Wojciech Kozłowski, Michał Szachniewicz, Michał Stypułkowski, Maciej Zięba
for: 提高低光照图像质量，保持自然颜色
methods: 使用小量图像对数据集进行协同学习，通过混合混合稀释权重网络来模拟不同摄像头下的场景拍摄情况，并通过精度量化器来调整亮度水平
results: 使用只需几个图像对可以达到与完全监督方法相同的竞争水平，并且在某些指标上超越当前状态艺法，几乎与其他方法相当Here’s the explanation in English:
for: The paper aims to enhance low-light images while maintaining their natural colors.
methods: The proposed approach uses a small set of image pairs to replicate scenes captured under extreme lighting conditions using a specific camera. It employs a convolutional mixture density network to generate distorted colors based on illumination differences, and accurately grades the dimming factor for flexibility in adjusting brightness levels. Additionally, the approach uses a conditional UNet architecture to generate images with desired lightness levels based on user input.
results: The proposed approach achieves competitive results compared to fully supervised methods, and surpasses state-of-the-art methods in some metrics when trained on the full dataset.

Abstract
Enhancing low-light images while maintaining natural colors is a challenging problem due to camera processing variations and limited access to photos with ground-truth lighting conditions. The latter is a crucial factor for supervised methods that achieve good results on paired datasets but do not handle out-of-domain data well. On the other hand, unsupervised methods, while able to generalize, often yield lower-quality enhancements. To fill this gap, we propose Dimma, a semi-supervised approach that aligns with any camera by utilizing a small set of image pairs to replicate scenes captured under extreme lighting conditions taken by that specific camera. We achieve that by introducing a convolutional mixture density network that generates distorted colors of the scene based on the illumination differences. Additionally, our approach enables accurate grading of the dimming factor, which provides a wide range of control and flexibility in adjusting the brightness levels during the low-light image enhancement process. To further improve the quality of our results, we introduce an architecture based on a conditional UNet. The lightness value provided by the user serves as the conditional input to generate images with the desired lightness. Our approach using only few image pairs achieves competitive results compared to fully supervised methods. Moreover, when trained on the full dataset, our model surpasses state-of-the-art methods in some metrics and closely approaches them in others.

摘要
提高低光照图像的质量是一个具有挑战性的问题，因为摄像头处理的变化和有限的照明条件图像的数量。后者是重要的因素，因为超级vised方法可以在匹配数据集上达到良好的结果，但是不能处理非预期的数据。相反，无监督方法可以泛化，但是通常会产生较低质量的提高。为了填补这个差距，我们提议了Dimma，一种半监督方法，通过使用特定摄像头拍摄的场景图像的小量对照片来模拟极端照明条件下拍摄的场景。我们通过引入一个 convolutional mixture density network来生成场景图像中的扭曲颜色，基于照明差异。此外，我们的方法可以精确地测量场景图像的暗度因子，提供了较广泛的控制和灵活性来调整低光照图像的亮度水平。为了进一步提高我们的结果质量，我们引入了基于 conditional UNet 的架构。用户输入的亮度值 serving as the conditional input，以生成具有所需亮度的图像。我们的方法使用只需几对照片可以与完全监督方法相比，并且当训练在全 dataset 时，我们的模型超越了当前状态的方法，在一些指标中甚至超越了其他方法，只是在其他指标中有些落后。

Time-based Mapping of Space Using Visual Motion Invariants

paper_url: http://arxiv.org/abs/2310.09632
repo_url: None
paper_authors: Juan D. Yepes, Daniel Raviv
for: 这个论文的目的是提出一种基于视觉动态特征的三维点 clouds的表示方法，以确保形态不变。
methods: 该方法利用非线性的流体量计量器来创建一种新的表示方法，称为“时间清除”（Time-Clearance）和“时间到接触”（Time-to-Contact）。这些 invariants 保持时间不变，使得可以轻松地检测移动点不符合预期的不变性。
results: 作者通过实验和 Unity 模拟来证明这种表示方法的有效性，并表明可以轻松地检测移动点不符合预期的不变性。此外，这种表示方法需要只一个摄像机，并且不需要确定摄像机的运动速度量。此外，该方法适合并行处理。

Abstract
This paper focuses on visual motion-based invariants that result in a representation of 3D points in which the stationary environment remains invariant, ensuring shape constancy. This is achieved even as the images undergo constant change due to camera motion. Nonlinear functions of measurable optical flow, which are related to geometric 3D invariants, are utilized to create a novel representation. We refer to the resulting optical flow-based invariants as 'Time-Clearance' and the well-known 'Time-to-Contact' (TTC). Since these invariants remain constant over time, it becomes straightforward to detect moving points that do not adhere to the expected constancy. We present simulations of a camera moving relative to a 3D object, snapshots of its projected images captured by a rectilinearly moving camera, and the object as it appears unchanged in the new domain over time. In addition, Unity-based simulations demonstrate color-coded transformations of a projected 3D scene, illustrating how moving objects can be readily identified. This representation is straightforward, relying on simple optical flow functions. It requires only one camera, and there is no need to determine the magnitude of the camera's velocity vector. Furthermore, the representation is pixel-based, making it suitable for parallel processing.

摘要
We present simulations of a camera moving relative to a 3D object, snapshots of its projected images captured by a rectilinearly moving camera, and the object as it appears unchanged in the new domain over time. In addition, Unity-based simulations demonstrate color-coded transformations of a projected 3D scene, illustrating how moving objects can be readily identified.This representation is straightforward, relying on simple optical flow functions. It requires only one camera, and there is no need to determine the magnitude of the camera's velocity vector. Furthermore, the representation is pixel-based, making it suitable for parallel processing.

Real-Time Traffic Sign Detection: A Case Study in a Santa Clara Suburban Neighborhood

paper_url: http://arxiv.org/abs/2310.09630
repo_url: None
paper_authors: Harish Loghashankar, Hieu Nguyen
for: 这项研究旨在开发一个实时交通标识系统，使用YOLOv5架构并在室外社区中进行实时识别交通标识。
methods: 该项目将使用多样化的交通标识图像集进行训练YOLOv5模型，并在适合实时推理的硬件平台上部署模型。
results: 在实验中，该系统在实时摄像头上检测和识别交通标识得分为96%，表明该系统可以提供实时和准确的交通信息，有助于提高道路安全和交通管理，并且可能为自动驾驶研究开辟新的可能性。

Abstract
This research project aims to develop a real-time traffic sign detection system using the YOLOv5 architecture and deploy it for efficient traffic sign recognition during a drive in a suburban neighborhood. The project's primary objectives are to train the YOLOv5 model on a diverse dataset of traffic sign images and deploy the model on a suitable hardware platform capable of real-time inference. The project will involve collecting a comprehensive dataset of traffic sign images. By leveraging the trained YOLOv5 model, the system will detect and classify traffic signs from a real-time camera on a dashboard inside a vehicle. The performance of the deployed system will be evaluated based on its accuracy in detecting traffic signs, real-time processing speed, and overall reliability. During a case study in a suburban neighborhood, the system demonstrated a notable 96% accuracy in detecting traffic signs. This research's findings have the potential to improve road safety and traffic management by providing timely and accurate real-time information about traffic signs and can pave the way for further research into autonomous driving.

摘要
The project involves collecting a comprehensive dataset of traffic sign images, and leveraging the trained YOLOv5 model, the system will detect and classify traffic signs from a real-time camera on a dashboard inside a vehicle. The performance of the deployed system will be evaluated based on its accuracy in detecting traffic signs, real-time processing speed, and overall reliability.During a case study in a suburban neighborhood, the system demonstrated a notable 96% accuracy in detecting traffic signs. The findings of this research have the potential to improve road safety and traffic management by providing timely and accurate real-time information about traffic signs, and can pave the way for further research into autonomous driving.Translation notes:* "suburban neighborhood" is translated as "郊区" (suburban area)* "dashboard" is translated as "车载屏" (car-mounted screen)* "real-time" is translated as "实时" (real-time)* "accuracy" is translated as "准确率" (accuracy)* "processing speed" is translated as "处理速度" (processing speed)* "reliability" is translated as "可靠性" (reliability)Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

Detecting Moving Objects Using a Novel Optical-Flow-Based Range-Independent Invariant

paper_url: http://arxiv.org/abs/2310.09627
repo_url: None
paper_authors: Daniel Raviv, Juan D. Yepes, Ayush Gowda
for: 这篇论文主要关注了一种新的移动对象检测方法，该方法可以在摄像机运动时检测移动对象。
methods: 该方法使用了光流基本变换，通过这种变换可以生成一个不依赖于时间点播、点云范围和摄像机速度的兼容2D不变图像输出。在这个新频谱中，对于3D点 cloud中的点，如果它们与预定的查找图像值不匹配，那么它们就可以被轻松地识别为相对于静止3D环境中的移动对象。该方法不需要了解摄像机的运动方向或速度，也不需要3D点云范围信息。它适合实时并行处理，因此非常实用。
results: 作者通过实验和仿真 validate了新频谱的有效性，并证明其在拥有直线运动的摄像机时的稳定性。这种方法开创了新的移动对象检测方法，同时也为未来在六度自由运动摄像机中的研究提供了基础。

Abstract
This paper focuses on a novel approach for detecting moving objects during camera motion. We present an optical-flow-based transformation that yields a consistent 2D invariant image output regardless of time instants, range of points in 3D, and the speed of the camera. In other words, this transformation generates a lookup image that remains invariant despite the changing projection of the 3D scene and camera motion. In the new domain, projections of 3D points that deviate from the values of the predefined lookup image can be clearly identified as moving relative to the stationary 3D environment, making them seamlessly detectable. The method does not require prior knowledge of the direction of motion or speed of the camera, nor does it necessitate 3D point range information. It is well-suited for real-time parallel processing, rendering it highly practical for implementation. We have validated the effectiveness of the new domain through simulations and experiments, demonstrating its robustness in scenarios involving rectilinear camera motion, both in simulations and with real-world data. This approach introduces new ways for moving objects detection during camera motion, and also lays the foundation for future research in the context of moving object detection during six-degrees-of-freedom camera motion.

摘要

JSMoCo: Joint Coil Sensitivity and Motion Correction in Parallel MRI with a Self-Calibrating Score-Based Diffusion Model

paper_url: http://arxiv.org/abs/2310.09625
repo_url: None
paper_authors: Lixuan Chen, Xuanyu Tian, Jiangjie Wu, Ruimin Feng, Guoyan Lao, Yuyao Zhang, Hongjiang Wei
for: correction of motion artifacts in MRI reconstruction
methods: joint estimation of motion parameters and coil sensitivity maps using score-based diffusion models and Gibbs sampler
results: high-quality MRI image reconstruction from sparsely-sampled k-space data, even in the presence of motion

Abstract
Magnetic Resonance Imaging (MRI) stands as a powerful modality in clinical diagnosis. However, it is known that MRI faces challenges such as long acquisition time and vulnerability to motion-induced artifacts. Despite the success of many existing motion correction algorithms, there has been limited research focused on correcting motion artifacts on the estimated coil sensitivity maps for fast MRI reconstruction. Existing methods might suffer from severe performance degradation due to error propagation resulting from the inaccurate coil sensitivity maps estimation. In this work, we propose to jointly estimate the motion parameters and coil sensitivity maps for under-sampled MRI reconstruction, referred to as JSMoCo. However, joint estimation of motion parameters and coil sensitivities results in a highly ill-posed inverse problem due to an increased number of unknowns. To address this, we introduce score-based diffusion models as powerful priors and leverage the MRI physical principles to efficiently constrain the solution space for this optimization problem. Specifically, we parameterize the rigid motion as three trainable variables and model coil sensitivity maps as polynomial functions. Leveraging the physical knowledge, we then employ Gibbs sampler for joint estimation, ensuring system consistency between sensitivity maps and desired images, avoiding error propagation from pre-estimated sensitivity maps to the reconstructed images. We conduct comprehensive experiments to evaluate the performance of JSMoCo on the fastMRI dataset. The results show that our method is capable of reconstructing high-quality MRI images from sparsely-sampled k-space data, even affected by motion. It achieves this by accurately estimating both motion parameters and coil sensitivities, effectively mitigating motion-related challenges during MRI reconstruction.

摘要

Learning Hierarchical Features with Joint Latent Space Energy-Based Prior

paper_url: http://arxiv.org/abs/2310.09604
repo_url: None
paper_authors: Jiali Cui, Ying Nian Wu, Tian Han
for: 学习层次表示的基本问题
methods: 提议使用共同幽会空间EBM前模型和多层幽会变量
results: 实验表明该模型能有效地捕捉层次表示和模型数据分布

Abstract
This paper studies the fundamental problem of multi-layer generator models in learning hierarchical representations. The multi-layer generator model that consists of multiple layers of latent variables organized in a top-down architecture tends to learn multiple levels of data abstraction. However, such multi-layer latent variables are typically parameterized to be Gaussian, which can be less informative in capturing complex abstractions, resulting in limited success in hierarchical representation learning. On the other hand, the energy-based (EBM) prior is known to be expressive in capturing the data regularities, but it often lacks the hierarchical structure to capture different levels of hierarchical representations. In this paper, we propose a joint latent space EBM prior model with multi-layer latent variables for effective hierarchical representation learning. We develop a variational joint learning scheme that seamlessly integrates an inference model for efficient inference. Our experiments demonstrate that the proposed joint EBM prior is effective and expressive in capturing hierarchical representations and modelling data distribution.

摘要
To address this issue, the authors propose a joint latent space EBM prior model with multi-layer latent variables for effective hierarchical representation learning. They develop a variational joint learning scheme that seamlessly integrates an inference model for efficient inference. The authors' experiments demonstrate that the proposed joint EBM prior is effective and expressive in capturing hierarchical representations and modeling data distribution.Translation notes:* "multi-layer generator models" becomes "多层生成器模型" (duō zhì chǎng jī mó delè)* "latent variables" becomes "隐藏变量" (yǐn zhǐ biàn yòu)* "Gaussian" becomes "高矮分布" (gāo ài fān bù)* "energy-based" becomes "能量基于" (néng liàng jī yǔ)* "hierarchical representations" becomes "层次表示" (céng zhì bǎo xiǎng)* "data distribution" becomes "数据分布" (shù jiào fān bù)

B-Spine: Learning B-Spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation

paper_url: http://arxiv.org/abs/2310.09603
repo_url: https://github.com/whao22/B-Spine
paper_authors: Hao Wang, Qiang Song, Ruofeng Yin, Rui Ma, Yizhou Yu, Yi Chang
for:* 这paper是为了提出一种robust和可 interpret的spinal curvature estimation方法。methods:* 该方法使用了一种深度学习pipeline，包括SegRefine网络和B-spline预测模型，以便从低质量X射线图像中提取spine curvature。results:* 与其他代表性和State-of-the-Art学习基于方法进行Quantitative和Qualitative比较后，该方法在公共AASCE2019数据集和我们新提出的CJUH-JLU数据集上表现出了superior的性能，demonstrating its robustness and interpretability for spinal curvature estimation.

Abstract
Spinal curvature estimation is important to the diagnosis and treatment of the scoliosis. Existing methods face several issues such as the need of expensive annotations on the vertebral landmarks and being sensitive to the image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for low-quality images which are blurry and hazy. In this paper, we propose B-Spine, a novel deep learning pipeline to learn B-spline curve representation of the spine and estimate the Cobb angles for spinal curvature estimation from low-quality X-ray images. Given a low-quality input, a novel SegRefine network which employs the unpaired image-to-image translation is proposed to generate a high quality spine mask from the initial segmentation result. Next, a novel mask-based B-spline prediction model is proposed to predict the B-spline curve for the spine centerline. Finally, the Cobb angles are estimated by a hybrid approach which combines the curve slope analysis and a curve-based regression model. We conduct quantitative and qualitative comparisons with the representative and SOTA learning-based methods on the public AASCE2019 dataset and our new proposed CJUH-JLU dataset which contains more challenging low-quality images. The superior performance on both datasets shows our method can achieve both robustness and interpretability for spinal curvature estimation.

摘要
<>translate_language English Simplified ChineseSpinal curvature estimation is important for the diagnosis and treatment of scoliosis. Existing methods have several issues, such as the need for expensive annotations on vertebral landmarks and being sensitive to image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for low-quality images that are blurry and hazy. In this paper, we propose B-Spine, a novel deep learning pipeline to learn B-spline curve representation of the spine and estimate the Cobb angles for spinal curvature estimation from low-quality X-ray images. Given a low-quality input, a novel SegRefine network which employs unpaired image-to-image translation is proposed to generate a high-quality spine mask from the initial segmentation result. Next, a novel mask-based B-spline prediction model is proposed to predict the B-spline curve for the spine centerline. Finally, the Cobb angles are estimated by a hybrid approach which combines curve slope analysis and a curve-based regression model. We conduct quantitative and qualitative comparisons with representative and SOTA learning-based methods on the public AASCE2019 dataset and our new proposed CJUH-JLU dataset, which contains more challenging low-quality images. The superior performance on both datasets shows that our method can achieve both robustness and interpretability for spinal curvature estimation.

Hawkeye: A PyTorch-based Library for Fine-Grained Image Recognition with Deep Learning

paper_url: http://arxiv.org/abs/2310.09600
repo_url: https://github.com/hawkeye-finegrained/hawkeye
paper_authors: Jiabei He, Yang Shen, Xiu-Shen Wei, Ye Wu
for: 这份研究是为了提供一个开源的 PyTorch 基础库，用于 Fine-Grained Image Recognition (FGIR) 任务。
methods: 这个库使用了深度学习的方法，并且具有 Modular 架构，以提高代码质量和人类可读配置。它还包含了 16 种现有的精细方法，覆盖 6 种不同的 paradigm，允许用户尝试不同的方法来解决 FGIR 任务。
results: 根据 authors 的声明，这个库是首个以 PyTorch 为基础的开源库，并且提供了一个全面的解决方案 для FGIR 任务。

Abstract
Fine-Grained Image Recognition (FGIR) is a fundamental and challenging task in computer vision and multimedia that plays a crucial role in Intellectual Economy and Industrial Internet applications. However, the absence of a unified open-source software library covering various paradigms in FGIR poses a significant challenge for researchers and practitioners in the field. To address this gap, we present Hawkeye, a PyTorch-based library for FGIR with deep learning. Hawkeye is designed with a modular architecture, emphasizing high-quality code and human-readable configuration, providing a comprehensive solution for FGIR tasks. In Hawkeye, we have implemented 16 state-of-the-art fine-grained methods, covering 6 different paradigms, enabling users to explore various approaches for FGIR. To the best of our knowledge, Hawkeye represents the first open-source PyTorch-based library dedicated to FGIR. It is publicly available at https://github.com/Hawkeye-FineGrained/Hawkeye/, providing researchers and practitioners with a powerful tool to advance their research and development in the field of FGIR.

摘要
fine-grained 图像识别（FGIR）是计算机视觉和多媒体领域的基础和挑战性任务，对知识经济和工业互联网应用具有重要的作用。然而，无一个统一的开源软件库，覆盖了不同的FGIR Paradigma，对研究人员和实践者们带来了很大的挑战。为了解决这个问题，我们提出了Hawkeye，一个基于PyTorch的FGIR库。Hawkeye采用了模块化的架构，强调高质量的代码和人类可读的配置，为FGIR任务提供了全面的解决方案。在Hawkeye中，我们实现了16种当前顶尖的细化图像方法，覆盖了6个不同的Paradigma，使用户可以探索不同的FGIR方法。据我们所知，Hawkeye是首个基于PyTorch的开源FGIR库，可以在https://github.com/Hawkeye-FineGrained/Hawkeye/上获取。这将为研究人员和实践者们提供一个强大的工具，以推动FGIR领域的研究和开发。

Learning Unified Representations for Multi-Resolution Face Recognition

paper_url: http://arxiv.org/abs/2310.09563
repo_url: https://github.com/stevensmith2000/btnet
paper_authors: Hulingxiao He, Wu Yuan, Yidian Huang, Shilong Zhao, Wen Yuan, Hanqing Li
for: 提高多分辨率脸Recognizer的表示学习方法
methods: 使用Branch-to-Trunk网络(BTNet)，包括一个统一Encoder（TNet）和多个分辨率Adapter（BNets），输入刚好与输出匹配，提高了微辨率脸的可识别度
results: 实验表明，BTNet可以在面Recognition benchmark上达到优秀表现，具有较少计算量和参数存储，并在QMUL-SurvFace 1: N face identification任务上创造新的状态机制。代码可以在https://github.com/StevenSmith2000/BTNet上获取

Abstract
In this work, we propose Branch-to-Trunk network (BTNet), a representation learning method for multi-resolution face recognition. It consists of a trunk network (TNet), namely a unified encoder, and multiple branch networks (BNets), namely resolution adapters. As per the input, a resolution-specific BNet is used and the output are implanted as feature maps in the feature pyramid of TNet, at a layer with the same resolution. The discriminability of tiny faces is significantly improved, as the interpolation error introduced by rescaling, especially up-sampling, is mitigated on the inputs. With branch distillation and backward-compatible training, BTNet transfers discriminative high-resolution information to multiple branches while guaranteeing representation compatibility. Our experiments demonstrate strong performance on face recognition benchmarks, both for multi-resolution identity matching and feature aggregation, with much less computation amount and parameter storage. We establish new state-of-the-art on the challenging QMUL-SurvFace 1: N face identification task. Our code is available at https://github.com/StevenSmith2000/BTNet.

摘要
在这个工作中，我们提出了分支到主干网络（BTNet），一种多resolution face recognition的表示学习方法。它包括一个主干网络（TNet），即统一编码器，以及多个分支网络（BNets），即分解器。根据输入，使用resolution-specific BNet，并将输出作为特征图层在TNet的特征峰中进行嵌入。这 mitigates the interpolation error introduced by rescaling, especially up-sampling, and significantly improves the discriminability of tiny faces. 通过分支涂抹和回传compatible训练，BTNet可以将高分辨率的特征信息传递给多个分支，同时保证表示相容性。我们的实验表明BTNet在多resolution identity matching和特征聚合任务上显示出了强大表现，减少了计算量和参数存储量。我们在QMUL-SurvFace 1: N face identification任务上创造了新的状态码，代码可以在https://github.com/StevenSmith2000/BTNet中获取。

Scene Text Recognition Models Explainability Using Local Features

paper_url: http://arxiv.org/abs/2310.09549
repo_url: https://github.com/markytools/strexp
paper_authors: Mark Vincent Ty, Rowel Atienza
for: 这个论文主要研究的是Scene Text Recognition（STR）透明性（XAI），即如何使人们能够理解模型的预测结果的原因。
methods: 这篇论文使用了数据解释框架，即归因基本方法，来解释深度学习模型中的输入数据。然而，在STR中，这些方法仅仅能够提供全局性的解释，不能够准确地解释输入数据中的每个字符的预测结果。为了解决这个问题，这篇论文提出了一种新的方法，即STRExp，可以考虑本地解释，即每个字符预测结果的解释。
results: 这篇论文对不同的STR模型和数据集进行了比较，并评估了不同的归因基本方法的效果。结果表明，STRExp可以提供更加精准和有用的解释，而且可以在不同的STR模型和数据集上进行广泛的应用。

Abstract
Explainable AI (XAI) is the study on how humans can be able to understand the cause of a model's prediction. In this work, the problem of interest is Scene Text Recognition (STR) Explainability, using XAI to understand the cause of an STR model's prediction. Recent XAI literatures on STR only provide a simple analysis and do not fully explore other XAI methods. In this study, we specifically work on data explainability frameworks, called attribution-based methods, that explain the important parts of an input data in deep learning models. However, integrating them into STR produces inconsistent and ineffective explanations, because they only explain the model in the global context. To solve this problem, we propose a new method, STRExp, to take into consideration the local explanations, i.e. the individual character prediction explanations. This is then benchmarked across different attribution-based methods on different STR datasets and evaluated across different STR models.

摘要
什么是可解释AI（XAI）？XAI是研究如何让人们理解模型预测的原因的学科。在这项工作中，我们的问题关注点是场景文本识别（STR）可解释，使用XAI来理解STR模型的预测。现有XAI литера图书籍中只提供了简单的分析，没有充分探讨其他XAI方法。在这项研究中，我们专门关注深度学习模型中的数据解释框架，即归因基本方法，可以解释输入数据中重要的部分。但是，将其集成到STR中会导致不一致和不效的解释，因为它们只能解释模型在全局上。为解决这个问题，我们提出了一种新方法，STRExp，可以考虑本地解释，即个体字符预测解释。这种方法然后在不同的归因基本方法和不同的STR数据集上进行了比较。

Benchmarking the Sim-to-Real Gap in Cloth Manipulation

paper_url: http://arxiv.org/abs/2310.09543
repo_url: None
paper_authors: David Blanco-Mulero, Oriol Barbany, Gokhan Alcan, Adrià Colomé, Carme Torras, Ville Kyrki
for: 这篇论文旨在评估现实 físic engine 是如何帮助学习柔体物体的扭曲和位移的。
methods: 这篇论文使用了四种流行的柔体物体模拟器：MuJoCo、Bullet、Flex 和 SOFA，并对它们进行评估。
results: 论文提供了一个开源的测试集，用于评估这些模拟器的实际准确性、计算时间和稳定性。

Abstract
Realistic physics engines play a crucial role for learning to manipulate deformable objects such as garments in simulation. By doing so, researchers can circumvent challenges such as sensing the deformation of the object in the real-world. In spite of the extensive use of simulations for this task, few works have evaluated the reality gap between deformable object simulators and real-world data. We present a benchmark dataset to evaluate the sim-to-real gap in cloth manipulation. The dataset is collected by performing a dynamic cloth manipulation task involving contact with a rigid table. We use the dataset to evaluate the reality gap, computational time, and simulation stability of four popular deformable object simulators: MuJoCo, Bullet, Flex, and SOFA. Additionally, we discuss the benefits and drawbacks of each simulator. The benchmark dataset is open-source. Supplementary material, videos, and code, can be found at https://sites.google.com/view/cloth-sim2real-benchmark.

摘要
现实 física 引擎在学习处理可变形 объек 的 simulate 中发挥关键作用。通过这样做，研究人员可以远离实际世界中感知对象的变形的挑战。尽管对这种任务的 simulate 广泛使用，但有少数作品评估了 sim-to-real 间的差距。我们提供了一个标准化数据集来评估 cloth manipulate 中的 sim-to-real 差距。数据集是通过对固定桌子进行动态 cloth manipulate 任务来收集的。我们使用该数据集来评估 simulator 的 reality gap，计算时间以及模拟稳定性。此外，我们还讨论了每个 simulator 的优缺点。标准化数据集开源，补充材料、视频和代码可以在https://sites.google.com/view/cloth-sim2real-benchmark 找到。

Towards End-to-End Unsupervised Saliency Detection with Self-Supervised Top-Down Context

paper_url: http://arxiv.org/abs/2310.09533
repo_url: None
paper_authors: Yicheng Song, Shuyong Gao, Haozhe Xing, Yiting Cheng, Yan Wang, Wenqiang Zhang
for: 本研究旨在提高无监督聚合物 objet detection 的训练效率，并且可以 mines 深度特征中的rich semantic信息。
methods: 我们提出了一种自动supervised end-to-end salient object detection框架，通过上述的top-downcontext来学习最有助于性的分割指导。
results: 我们的方法在 benchmark 数据集上进行了广泛的实验，并证明了与最近的终端方法和多stage方法相比，我们的方法可以达到最高的性能。

Abstract
Unsupervised salient object detection aims to detect salient objects without using supervision signals eliminating the tedious task of manually labeling salient objects. To improve training efficiency, end-to-end methods for USOD have been proposed as a promising alternative. However, current solutions rely heavily on noisy handcraft labels and fail to mine rich semantic information from deep features. In this paper, we propose a self-supervised end-to-end salient object detection framework via top-down context. Specifically, motivated by contrastive learning, we exploit the self-localization from the deepest feature to construct the location maps which are then leveraged to learn the most instructive segmentation guidance. Further considering the lack of detailed information in deepest features, we exploit the detail-boosting refiner module to enrich the location labels with details. Moreover, we observe that due to lack of supervision, current unsupervised saliency models tend to detect non-salient objects that are salient in some other samples of corresponding scenarios. To address this widespread issue, we design a novel Unsupervised Non-Salient Suppression (UNSS) method developing the ability to ignore non-salient objects. Extensive experiments on benchmark datasets demonstrate that our method achieves leading performance among the recent end-to-end methods and most of the multi-stage solutions. The code is available.

摘要
<>这是一个使用简化中文的文本：不监督焦点检测目标检测焦点物件的存在，而不需要手动标注焦点物件。为了提高训练效率，终端方法 для USOD 已经被提议为可靠的替代方案。然而，目前的解决方案将重要的 semantic information 从深度特征中挖掘出来，导致训练效率低下。在这篇文章中，我们提出一个自动监督终端焦点检测框架，通过上下文的对应来学习焦点检测。具体来说，我们灵感自适应学习，从深度特征中找出最有价的位置对，然后将其用于学习最有价的分割引导。此外，因为深度特征中缺乏细节信息，我们运用细节增强修正模组来补充位置标签中的细节信息。此外，我们发现现有的不监督焦点模型往往对于某些相似的场景中的非焦点物件进行检测，这是一个广泛的问题。为了解决这个问题，我们提出了一个名为Unsupervised Non-Salient Suppression（UNSS）的新方法，可以将非焦点物件忽略掉。实验结果显示，我们的方法在最近的终端方法和多阶段解决方案中具有领先的表现。代码可以在网上获取。

TS-ENAS:Two-Stage Evolution for Cell-based Network Architecture Search

paper_url: http://arxiv.org/abs/2310.09525
repo_url: None
paper_authors: Juan Zou, Shenghong Wu, Yizhang Xia, Weiwei Jiang, Zeping Wu, Jinhua Zheng
for: 本研究提出了一种Two-Stage Evolution for cell-based Network Architecture Search（TS-ENAS）算法，用于自动设计神经网络结构。
methods: 该算法使用一个一阶搜索和一个二阶搜索两个阶段来搜索神经网络结构。在第一阶段，使用堆栈Cell进行搜索，以减少搜索的复杂性。在第二阶段，对这些Cell进行调整。
results: 在四个图像分类 dataset（Fashion-MNIST、CIFAR10、CIFAR100和ImageNet）上进行了广泛的测试和比较，并与22种现有的算法进行了比较，包括手动设计的网络和NAS网络。结果表明，TS-ENAS可以更有效地找到与其他算法相对性能的神经网络结构。

Abstract
Neural network architecture search provides a solution to the automatic design of network structures. However, it is difficult to search the whole network architecture directly. Although using stacked cells to search neural network architectures is an effective way to reduce the complexity of searching, these methods do not able find the global optimal neural network structure since the number of layers, cells and connection methods is fixed. In this paper, we propose a Two-Stage Evolution for cell-based Network Architecture Search(TS-ENAS), including one-stage searching based on stacked cells and second-stage adjusting these cells. In our algorithm, a new cell-based search space and an effective two-stage encoding method are designed to represent cells and neural network structures. In addition, a cell-based weight inheritance strategy is designed to initialize the weight of the network, which significantly reduces the running time of the algorithm. The proposed methods are extensively tested and compared on four image classification dataset, Fashion-MNIST, CIFAR10, CIFAR100 and ImageNet and compared with 22 state-of-the-art algorithms including hand-designed networks and NAS networks. The experimental results show that TS-ENAS can more effectively find the neural network architecture with comparative performance.

摘要
Translated into Simplified Chinese:神经网络结构搜索提供了自动设计网络结构的解决方案。然而，直接搜索整个网络结构是困难的。尽管使用堆式细胞来搜索神经网络结构是有效的方法来减少搜索的复杂性，但这些方法无法找到全球优化的神经网络结构，因为层数、细胞数和连接方法的数量是固定的。在这篇论文中，我们提出了一种两个阶段演化的细胞基于网络 architecture搜索算法（TS-ENAS），包括一个阶段是基于堆式细胞的搜索，以及第二阶段是调整这些细胞。在我们的算法中，我们设计了一个新的细胞基于搜索空间和一种有效的两阶段编码方法来表示细胞和神经网络结构。此外，我们还设计了基于细胞的初始化策略来初始化网络的权重，这有效减少了算法的运行时间。我们在四个图像分类数据集（Fashion-MNIST、CIFAR10、CIFAR100和ImageNet）进行了广泛的测试和比较，与22种现有的算法（包括手动设计的网络和NAS网络）进行了比较。实验结果表明，TS-ENAS可以更有效地找到与其他方法相当的神经网络结构。

OBSUM: An object-based spatial unmixing model for spatiotemporal fusion of remote sensing images

paper_url: http://arxiv.org/abs/2310.09517
repo_url: https://github.com/houcaiguo/obsum-code
paper_authors: Houcai Guo, Dingqi Ye, Lorenzo Bruzzone
for:这个论文旨在提高遥感图像的空间和时间分辨率，以便进行时间序列分析。methods:这个研究提出了一种基于物体分析和空间分解的物体基于空间混合模型（OBSUM），以解决当前遥感时间序列融合方法中的两个重要问题。OBSUM包括一个预处理步骤和三个融合步骤，即物体水平减混、物体水平偏差补偿和像素水平偏差补偿。results:对比五种代表性的遥感时间序列融合方法，OBSUM在准确指标和视觉效果上表现出色，并且在两个典型的遥感应用中也达到了满意的结果。因此，OBSUM有很大的应用潜力，可以生成高分辨率和准确的时间序列观测数据，以支持多种遥感应用。

Abstract
Spatiotemporal fusion aims to improve both the spatial and temporal resolution of remote sensing images, thus facilitating time-series analysis at a fine spatial scale. However, there are several important issues that limit the application of current spatiotemporal fusion methods. First, most spatiotemporal fusion methods are based on pixel-level computation, which neglects the valuable object-level information of the land surface. Moreover, many existing methods cannot accurately retrieve strong temporal changes between the available high-resolution image at base date and the predicted one. This study proposes an Object-Based Spatial Unmixing Model (OBSUM), which incorporates object-based image analysis and spatial unmixing, to overcome the two abovementioned problems. OBSUM consists of one preprocessing step and three fusion steps, i.e., object-level unmixing, object-level residual compensation, and pixel-level residual compensation. OBSUM can be applied using only one fine image at the base date and one coarse image at the prediction date, without the need of a coarse image at the base date. The performance of OBSUM was compared with five representative spatiotemporal fusion methods. The experimental results demonstrated that OBSUM outperformed other methods in terms of both accuracy indices and visual effects over time-series. Furthermore, OBSUM also achieved satisfactory results in two typical remote sensing applications. Therefore, it has great potential to generate accurate and high-resolution time-series observations for supporting various remote sensing applications.

摘要
<>这个研究旨在提高遥感图像的空间和时间解析精度，以便进行时间序列分析，并且解决现有的一些重要问题。首先，大多数的遥感融合方法是基于像素级计算，忽略了地表上的有价值物体信息。其次，许多现有的方法无法精确地回传具有强时间变化的高分辨率图像。本研究提出了一个物体基本分析和空间混合模型（OBSUM），以解决上述问题。OBSUM包括一个预processing步骤和三个融合步骤，即物体水平混合、物体水平剩余补偿和像素水平剩余补偿。OBSUM可以使用仅有一个精细图像的基本日期，并且不需要基本日期的粗糙图像。实验结果显示，OBSUM在精度指标和视觉效果上优于五种代表性的遥感融合方法。此外，OBSUM也在两个典型的遥感应用中取得了满意的结果。因此，它具有高精度时间序列观测的可能性。

Foundation Ark: Accruing and Reusing Knowledge for Superior and Robust Performance

paper_url: http://arxiv.org/abs/2310.09507
repo_url: https://github.com/jlianglab/ark
paper_authors: DongAo Ma, Jiaxuan Pang, Michael B. Gotway, Jianming Liang
for: 本研究旨在开发一个可 powerful和Robust的基础模型，通过聚合多个小型公共数据集来实现。methods: 我们提出了 Ark 框架，用于聚合和重用多个不同专家标注的数据集中的知识。results: 我们通过在多种成像任务中进行精度调整、线性检测和性别偏见分析，证明了 Ark 模型在比特aha fully/self-supervised baselines和 Google 的专有 CXR-FM 模型之上具有显著的超越性和Robustness。

Abstract
Deep learning nowadays offers expert-level and sometimes even super-expert-level performance, but achieving such performance demands massive annotated data for training (e.g., Google's proprietary CXR Foundation Model (CXR-FM) was trained on 821,544 labeled and mostly private chest X-rays (CXRs)). Numerous datasets are publicly available in medical imaging but individually small and heterogeneous in expert labels. We envision a powerful and robust foundation model that can be trained by aggregating numerous small public datasets. To realize this vision, we have developed Ark, a framework that accrues and reuses knowledge from heterogeneous expert annotations in various datasets. As a proof of concept, we have trained two Ark models on 335,484 and 704,363 CXRs, respectively, by merging several datasets including ChestX-ray14, CheXpert, MIMIC-II, and VinDr-CXR, evaluated them on a wide range of imaging tasks covering both classification and segmentation via fine-tuning, linear-probing, and gender-bias analysis, and demonstrated our Ark's superior and robust performance over the SOTA fully/self-supervised baselines and Google's proprietary CXR-FM. This enhanced performance is attributed to our simple yet powerful observation that aggregating numerous public datasets diversifies patient populations and accrues knowledge from diverse experts, yielding unprecedented performance yet saving annotation cost. With all codes and pretrained models released at GitHub.com/JLiangLab/Ark, we hope that Ark exerts an important impact on open science, as accruing and reusing knowledge from expert annotations in public datasets can potentially surpass the performance of proprietary models trained on unusually large data, inspiring many more researchers worldwide to share codes and datasets to build open foundation models, accelerate open science, and democratize deep learning for medical imaging.

摘要
现在的深度学习技术已经可以达到专家级或者超过专家级的性能，但是获得这样的性能需要巨量的标注数据进行训练（如Google的专有CXR基础模型（CXR-FM）在821544个标注的和大多数私人胸部X射线图像（CXRs）上进行训练）。医疗影像领域有很多公共数据集，但每个数据集都是小型和多样化的专家标注。我们的愿景是建立一个强大和稳定的基础模型，可以通过合并各种小型公共数据集来训练。为了实现这个愿景，我们已经开发了Ark框架，它可以聚合和重用各种各样的专家标注知识。作为证明，我们已经在335484和704363个CXRs上分别训练了两个Ark模型，并通过精细调整、直接检测和性别偏见分析来评估其性能。我们的Ark模型在覆盖各种影像任务，包括分类和分割，并表现出超过当前最佳自动化/自我超vised基线和Google专有CXR-FM的性能。这种提高的性能归因于我们简单 yet 强大的观察，即合并各种公共数据集可以多样化病人人口和收集专家知识，从而实现无 precedent的性能，同时降低标注成本。我们在GitHub上发布了所有代码和预训练模型，我们希望Ark能够对开源科学产生重要影响，因为聚合和重用公共数据集中的专家标注知识可能会超越 propriety模型在非常大数据上的性能，鼓励更多的研究人员在世界各地分享代码和数据集，加速开源科学，和democratize深度学习 для医疗影像领域。

paper_url: http://arxiv.org/abs/2310.09503
repo_url: https://github.com/mr-neko/jm3d
paper_authors: Jiayi Ji, Haowei Wang, Changli Wu, Yiwei Ma, Xiaoshuai Sun, Rongrong Ji
for: 本研究旨在解决3D表示学中的三大挑战，包括信息损失、缺乏同步和不充分利用细节信息。
methods: 该研究提出了一种全面的JM3D方法，包括多视图图像和层次文本的结构多模式组织器（SMO）和语言理解与视觉表示的共同对齐（JMA）。
results: 研究表明，JM3D-LLM模型在ModelNet40和ScanObjectNN测试集上显示出了superiority，并且JM3D-LLM模型通过有效的微调来结合3D表示和大语言模型。

Abstract
The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach. Our code and models are available at https://github.com/Mr-Neko/JM3D.

摘要
随着三维表示学的崛起，在计算机视觉、自动驾驶和 роботиCS中发挥着重要作用。然而，一种流行的趋势是将二维对对策直接应用到三维领域，这种方法存在三个突出的挑战：（1）信息强化：这种方法将三维数据与单个视图二维图像和普通的文本进行对齐，忽略了多视图图像和详细的子类别文本的需求。（2）不足的共同作用：这些策略将三维表示与图像和文本特征进行对齐，阻碍整体优化三维模型。（3）不充分利用：学习的表示中具有细节的信息经常未被完全利用，表明有可能的误差。为了解决这些问题，我们介绍了JM3D，一种全面的方法，它将点云、文本和图像集成起来。JM3D的关键贡献包括多视图文本结构（SMO），它使用多个视图和层次文本来增强视语言表示，以及对语言理解与视觉表示的共同对齐（JMA）。我们的高级模型JM3D-LLM通过有效的微调来结合语言模型和三维表示。我们在ModelNet40和ScanObjectNN上进行了评估，结果表明JM3D的超越性。JM3D-LLM的高性能进一步证明了我们的表示传递方法的有效性。我们的代码和模型可以在https://github.com/Mr-Neko/JM3D中找到。

Learning In-between Imagery Dynamics via Physical Latent Spaces

paper_url: http://arxiv.org/abs/2310.09495
repo_url: None
paper_authors: Jihun Han, Yoonsang Lee, Anne Gelb
for: 学习两个图像在连续时间步骤中的下一个图像的动态关系
methods: 利用含有物理模型表示的partial differential equations（PDEs）来估计图像的中间阶段，并保持图像空间相关性
results: 通过数字测试使用地球科学图像数据，证明方法的稳定性和有效性

Abstract
We present a framework designed to learn the underlying dynamics between two images observed at consecutive time steps. The complex nature of image data and the lack of temporal information pose significant challenges in capturing the unique evolving patterns. Our proposed method focuses on estimating the intermediary stages of image evolution, allowing for interpretability through latent dynamics while preserving spatial correlations with the image. By incorporating a latent variable that follows a physical model expressed in partial differential equations (PDEs), our approach ensures the interpretability of the learned model and provides insight into corresponding image dynamics. We demonstrate the robustness and effectiveness of our learning framework through a series of numerical tests using geoscientific imagery data.

摘要
我们提出了一种框架，用于学习两个图像在连续时间步骤中的下面动力。图像数据的复杂性和缺乏时间信息使得捕捉唯一发展模式具有挑战性。我们的提议是估算图像演化过程中的中间阶段，以保持图像空间相关性，并通过潜在动力提供可读性。我们的方法包括一个遵循部分偏微分方程（PDE）的潜在变量，以保证学习模型的可读性并提供相应的图像动力的理解。我们通过对地球科学图像数据进行数值测试，证明了我们的学习框架的稳定性和效果。

Perception Reinforcement Using Auxiliary Learning Feature Fusion: A Modified Yolov8 for Head Detection

paper_url: http://arxiv.org/abs/2310.09492
repo_url: None
paper_authors: Jiezhou Chen, Guankun Wang, Weixiang Liu, Xiaopin Zhong, Yibin Tian, ZongZe Wu
for: 提高人头检测精度和 robustness
methods: 使用改进版 Yolov8，增加 auxillary 学习特征拟合（ALFF）模块和 Distribution Focal Loss 等技术来提高目标感知和检测精度
results: 实验结果表明我们的方法可以提高人头检测精度和 robustness，并且在不同的背景和照明条件下都有优秀的表现

Abstract
Head detection provides distribution information of pedestrian, which is crucial for scene statistical analysis, traffic management, and risk assessment and early warning. However, scene complexity and large-scale variation in the real world make accurate detection more difficult. Therefore, we present a modified Yolov8 which improves head detection performance through reinforcing target perception. An Auxiliary Learning Feature Fusion (ALFF) module comprised of LSTM and convolutional blocks is used as the auxiliary task to help the model perceive targets. In addition, we introduce Noise Calibration into Distribution Focal Loss to facilitate model fitting and improve the accuracy of detection. Considering the requirements of high accuracy and speed for the head detection task, our method is adapted with two kinds of backbone, namely Yolov8n and Yolov8m. The results demonstrate the superior performance of our approach in improving detection accuracy and robustness.

摘要
《头部检测提供了人行道用户分布信息，这对Scene统计分析、交通管理和风险评估预警都是关键。然而，实际世界中的场景复杂度和大规模变化使得准确检测更加困难。因此，我们提出了一种基于Yolov8的修改方法，通过强化目标感知来提高头部检测性能。我们使用一个名为ALFF（auxiliary learning feature fusion）模块，其包括LSTM和卷积块，作为辅助任务，帮助模型更好地感知目标。此外，我们还引入了降噪约束分布损失，以便提高模型的适应性和准确性。考虑到头部检测任务的高精度和速度要求，我们采用了Yolov8n和Yolov8m两种骨干。结果表明，我们的方法可以提高检测精度和Robustness。》Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Exploring the Design Space of Diffusion Autoencoders for Face Morphing

paper_url: http://arxiv.org/abs/2310.09484
repo_url: None
paper_authors: Zander Blasingame, Chen Liu
for: 本研究探索了 diffusion autoencoders 创造的 face morph 设计空间，特别是 sampling algorithms、reverse DDIM solver 和 partial sampling through small amounts of added noise 等三个轴。
methods: 本研究使用了 sampling algorithms、reverse DDIM solver 和 partial sampling through small amounts of added noise 等方法。
results: 研究发现，采用不同的 sampling algorithms、reverse DDIM solver 和 partial sampling through small amounts of added noise 可以创造出不同的 face morph。

Abstract
Face morphs created by Diffusion Autoencoders are a recent innovation and the design space of such an approach has not been well explored. We explore three axes of the design space, i.e., 1) sampling algorithms, 2) the reverse DDIM solver, and 3) partial sampling through small amounts of added noise.

摘要
Diffusion Autoencoders 创造的面部变换是最近的创新，这个设计空间的探索尚未充分。我们探索了三个轴，即：1. 采样算法2. 反向 DDIM 解决方案3. 通过小量附加的噪声进行部分采样Here's a breakdown of the translation:1. "Diffusion Autoencoders" (Diffusion Autoencoders) - 散度自适应器2. "创造的面部变换" (face morphs created) - 面部变换 (face morphs)3. "是最近的创新" (is a recent innovation) - 最近的创新 (recent innovation)4. "设计空间" (design space) - 设计空间 (design space)5. "尚未充分" (has not been well explored) - 尚未充分 (has not been well explored)6. "我们探索了三个轴" (we explore three axes) - 我们探索了三个轴 (we explore three axes)7. "即：" (i.e.,) - 即： (i.e.,)8. "采样算法" (sampling algorithms) - 采样算法 (sampling algorithms)9. "反向 DDIM 解决方案" (reverse DDIM solver) - 反向 DDIM 解决方案 (reverse DDIM solver)10. "通过小量附加的噪声进行部分采样" (partial sampling through small amounts of added noise) - 通过小量附加的噪声进行部分采样 (partial sampling through small amounts of added noise)

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

paper_url: http://arxiv.org/abs/2310.09478
repo_url: None
paper_authors: Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny
for: 这篇论文的目的是建立一个统一的界面，用于完成多种语言领域的应用，包括图像描述、视觉问题回答和视觉基础设定等。
methods: 这篇论文使用了MiniGPT-v2模型，这是一个可以作为多种视觉语言任务的统一界面。它使用唯一识别码来识别不同任务的训练指令，以提高模型对每个任务的学习效率。
results: 实验结果显示，MiniGPT-v2在视觉问题回答和视觉基础设定benchmark上表现强，与其他视觉语言通用模型相比。

Abstract
Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/

摘要
大型语言模型在各种语言相关应用中展示了其权威的能力。这为我们提供了动机，我们想要建立一个统一的界面，以便处理多种视觉语言任务，包括图像描述、视觉问题回答和视觉定位等等。挑战是使用单一模型来进行多种视觉语言任务，并且将其训练为每个任务。为了解决这个问题，我们引入了MiniGPT-v2模型，它可以作为视觉语言任务的统一界面。我们在训练过程中使用对应的唯一识别码，以便让模型更容易识别每个任务的指令，并且提高模型对每个任务的学习效率。经过三阶段训练后，我们的实验结果显示，MiniGPT-v2在许多视觉问题回答和视觉定位benchmark上表现出色，与其他视觉语言通用模型相比。我们的模型和代码可以在https://minigpt-v2.github.io/上取得。

Plug-and-Play Feature Generation for Few-Shot Medical Image Classification

paper_url: http://arxiv.org/abs/2310.09471
repo_url: None
paper_authors: Qianyu Guo, Huifang Du, Xing Jia, Shuyong Gao, Yan Teng, Haofen Wang, Wenqiang Zhang
for: 提高医学图像分类模型的普适性和实用性，使用有限数量的训练数据。
methods: 提出了一种名为MedMFG的灵活和轻量级的插件和撑杆方法，可以生成具有足够特征的类别特征。
results: 在跨域分数标准上对比多个基eline和后处理方法，MedMFG达到了10%以上的性能提升，并且可以轻松地与多种背景和基eline整合。

Abstract
Few-shot learning (FSL) presents immense potential in enhancing model generalization and practicality for medical image classification with limited training data; however, it still faces the challenge of severe overfitting in classifier training due to distribution bias caused by the scarce training samples. To address the issue, we propose MedMFG, a flexible and lightweight plug-and-play method designed to generate sufficient class-distinctive features from limited samples. Specifically, MedMFG first re-represents the limited prototypes to assign higher weights for more important information features. Then, the prototypes are variationally generated into abundant effective features. Finally, the generated features and prototypes are together to train a more generalized classifier. Experiments demonstrate that MedMFG outperforms the previous state-of-the-art methods on cross-domain benchmarks involving the transition from natural images to medical images, as well as medical images with different lesions. Notably, our method achieves over 10% performance improvement compared to several baselines. Fusion experiments further validate the adaptability of MedMFG, as it seamlessly integrates into various backbones and baselines, consistently yielding improvements of over 2.9% across all results.

摘要
几个示例学习（Few-shot learning，FSL）具有极大的潜力提高模型通用性和实用性，尤其是在医疗图像分类中使用有限训练数据。然而，FSL仍然面临分布偏见导致分类器训练过程中严重溢出的问题。为解决这问题，我们提出MedMFG，一种灵活且轻量级的插件式方法，可以生成充足的类别特征从有限样本中。具体来说，MedMFG首先重新表示有限原型，以便将更重要的信息特征赋予更高的权重。然后，原型通过变换生成成为丰富的有效特征。最后，生成的特征和原型一起训练一个更通用的分类器。实验表明，MedMFG在跨领域 benchmark 上比前一些基eline方法提高了10%以上的性能。此外，我们的方法在不同的背景和基eline上进行融合实验， consistently 获得了2.9%以上的提高。

Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner

paper_url: http://arxiv.org/abs/2310.09469
repo_url: None
paper_authors: Mengfei Xia, Yujun Shen, Changsong Lei, Yu Zhou, Ran Yi, Deli Zhao, Wenping Wang, Yong-jin Liu
for: 提高Diffusion Model的推理速度，解决现有加速方法的性能下降问题。
methods: 通过视为离散积分过程，找到更加准确的积分方向，以提高推理性能。具体来说，在每个净化步骤中，将原始参数化被取代，并在新的步骤中 conditioning 网络，以使其更加准确地描述真实分布。
results: 广泛的实验表明，我们的插件设计可以高效地训练，并在多种现有加速方法中提高推理性能，特别是当有少量净化步骤时。例如，在使用10个净化步骤的LSUN Bedroom dataset上，我们可以将DDIM的FID从9.65降低到6.07，只需采用我们的方法。代码将公开发布。

Abstract
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed. Existing acceleration algorithms simplify the sampling by skipping most steps yet exhibit considerable performance degradation. By viewing the generation of diffusion models as a discretized integrating process, we argue that the quality drop is partly caused by applying an inaccurate integral direction to a timestep interval. To rectify this issue, we propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost. Specifically, at each denoising step, we replace the original parameterization by conditioning the network on a new timestep, which is obtained by aligning the sampling distribution to the real distribution. Extensive experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods, especially when there are few denoising steps. For example, when using 10 denoising steps on the popular LSUN Bedroom dataset, we improve the FID of DDIM from 9.65 to 6.07, simply by adopting our method for a more appropriate set of timesteps. Code will be made publicly available.

摘要
一种扩散模型，通过千次减噪步骤生成图像，通常受到慢速推理速度的限制。现有的加速算法简化抽取步骤，但显示出较大的性能下降。我们认为，生成扩散模型的过程可 viewed为离散 интегрирование过程，因此，生成图像质量下降的一部分原因是应用不正确的积分方向。为了解决这个问题，我们提议一种时间步骤对齐器，帮助找到更加准确的积分方向，最小化成本。具体来说，在每次减噪步骤中，我们将原始参数化替换为根据新的时间步骤 conditioning 网络。我们进行了广泛的实验，发现我们的插件设计可以高效地训练，并提高各种现有加速方法的推理性能，特别是当有少量减噪步骤时。例如，在使用10个减噪步骤的LSUN床间 dataset上，我们可以通过采用我们的方法，从9.65提高FID到6.07。我们将代码公开。

MAC: ModAlity Calibration for Object Detection

paper_url: http://arxiv.org/abs/2310.09461
repo_url: None
paper_authors: Yutian Lei, Jun Liu, Dong Huang
for: 本研究旨在开发一种能够快速和高效地将RGB输入模式转换为非RGB输入模式的方法，以便在不同的输入模式下进行物体检测。
methods: 本研究提出了一种名为ModAlity Calibration（MAC）的灵活管道，用于协调目标输入模式和源输入模式之间的差异。具体来说，我们在目标输入模式上额外添加了一个小准备模块，并对这个模块进行MAC训练技术的应用，以便在无需100%手动标注的情况下，使目标输入模式模型达到与基eline模型相同或更好的表现。
results: 我们通过对WiFi输入模式、Lidar输入模式和热成像输入模式等不同输入模式的模型进行组合，并将这些模型与预训练的RGB输入模式模型进行拟合，证明了MAC的效iveness。

Abstract
The flourishing success of Deep Neural Networks(DNNs) on RGB-input perception tasks has opened unbounded possibilities for non-RGB-input perception tasks, such as object detection from wireless signals, lidar scans, and infrared images. Compared to the matured development pipeline of RGB-input (source modality) models, developing non-RGB-input (target-modality) models from scratch poses excessive challenges in the modality-specific network design/training tricks and labor in the target-modality annotation. In this paper, we propose ModAlity Calibration (MAC), an efficient pipeline for calibrating target-modality inputs to the DNN object detection models developed on the RGB (source) modality. We compose a target-modality-input model by adding a small calibrator module ahead of a source-modality model and introduce MAC training techniques to impose dense supervision on the calibrator. By leveraging (1) prior knowledge synthesized from the source-modality model and (2) paired {target, source} data with zero manual annotations, our target-modality models reach comparable or better metrics than baseline models that require 100% manual annotations. We demonstrate the effectiveness of MAC by composing the WiFi-input, Lidar-input, and Thermal-Infrared-input models upon the pre-trained RGB-input models respectively.

摘要
深度神经网络（DNN）在RGB输入感知任务上的繁荣成功打开了非RGB输入感知任务的无尽可能性，如从无线电信号、激光扫描和红外图像中的对象检测。相比于已经成熟的RGB输入（源模式）模型的开发管线，开发非RGB输入（目标模式）模型从零开始带来了过度的挑战，包括特性化网络设计/训练技巧和目标模式注解的劳动。在这篇论文中，我们提出了模态均衡（MAC）管线，用于均衡目标模式输入到基于RGB模式（源模式）的对象检测模型中。我们将目标模式输入模型的前置加一小均衡模块，并引入MAC训练技术，以在均衡模块中强制对均衡模块进行密集监督。通过利用（1）源模式模型中的优先知识和（2）paired {target, source} 数据集，我们的目标模式模型可以达到与基eline模型相同或更好的 metric，而不需要100%的手动注解。我们通过将WiFi输入、激光输入和红外温度输入模型分别建立在pre-trained RGB输入模型上，来证明MAC的效果。

PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation

paper_url: http://arxiv.org/abs/2310.09458
repo_url: None
paper_authors: Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, Wayne Wu
for: 本研究旨在解决 zero-shot text-to-3D 人体生成中 SDS 方法可能提供不准确的梯度方向问题，以及高级度文本-to-3D 人体质量控制的挑战。
methods: 本研究提出了一种名为PaintHuman的模型，通过两种方法来解决问题：首先，引入了一种修改 SDS 的新得分函数（denoised score distillation，DSD），以 iteratively 更正梯度方向并生成高质量的Texture。其次，使用深度图作为geometry guidance，确保Texture具有人体模型表面的semantic alignment。
results: 对比 estado-of-the-art 方法，我们的方法在许多预测和评估任务上具有较高的性能和质量。

Abstract
Recent advances in zero-shot text-to-3D human generation, which employ the human model prior (eg, SMPL) or Score Distillation Sampling (SDS) with pre-trained text-to-image diffusion models, have been groundbreaking. However, SDS may provide inaccurate gradient directions under the weak diffusion guidance, as it tends to produce over-smoothed results and generate body textures that are inconsistent with the detailed mesh geometry. Therefore, directly leverage existing strategies for high-fidelity text-to-3D human texturing is challenging. In this work, we propose a model called PaintHuman to addresses the challenges from two aspects. We first propose a novel score function, Denoised Score Distillation (DSD), which directly modifies the SDS by introducing negative gradient components to iteratively correct the gradient direction and generate high-quality textures. In addition, we use the depth map as a geometric guidance to ensure the texture is semantically aligned to human mesh surfaces. To guarantee the quality of rendered results, we employ geometry-aware networks to predict surface materials and render realistic human textures. Extensive experiments, benchmarked against state-of-the-art methods, validate the efficacy of our approach.

摘要
近期的零shot文本到3D人体生成技术发展，使用人体模型先验（如SMPL）或Score Distillation Sampling（SDS）与预训练文本到图像扩散模型，取得了重要进展。然而，SDS可能在弱扩散指导下提供不准确的梯度方向，因为它有可能生成过度平滑的结果和与细节 mesh geometry 不一致的人体 texture。因此，直接使用现有的高精度文本到3D人体纹理策略是挑战。在这种工作中，我们提议一种名为PaintHuman的模型，用以解决这些挑战。我们首先提出了一种新的分数函数，Denosied Score Distillation（DSD），它直接修改了 SDS，通过引入负梯度组分来逐次更正梯度方向，生成高质量的纹理。此外，我们使用深度图为 geometric 指导，确保纹理与人体 mesh 表面含义相对应。为保证渲染结果的质量，我们使用 geometry-aware 网络预测表面材质并生成真实的人体纹理。我们对 state-of-the-art 方法进行了广泛的实验，并证明了我们的方法的有效性。

UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN

paper_url: http://arxiv.org/abs/2310.09457
repo_url: None
paper_authors: Chunyu Yuan, Dongfang Zhao, Sos S. Agaian
for:* 这个论文的目的是提出一种高效、轻量级的皮肤恶性肿瘤分割方法，以便在移动医疗应用中使用。methods:* 该方法使用了多层感知（MLP）和卷积神经网络（CNN）的组合，并提出了一种新的封装块（UCM-Net-Block）来减少参数的 overhead 并提高学习能力。results:* 对于 isic2017 和 isic2018 数据集进行了广泛的实验，并证明了 UCM-Net 在皮肤恶性肿瘤分割中的竞争力。* UCM-Net 的参数少于 50KB，计算量少于 0.05 GLOPs，创造了新的可能性标准 для皮肤恶性肿瘤分割的效率。

Abstract
Skin cancer is a significant public health problem, and computer-aided diagnosis can help to prevent and treat it. A crucial step for computer-aided diagnosis is accurately segmenting skin lesions in images, which allows for lesion detection, classification, and analysis. However, this task is challenging due to the diverse characteristics of lesions, such as appearance, shape, size, color, texture, and location, as well as image quality issues like noise, artifacts, and occlusions. Deep learning models have recently been applied to skin lesion segmentation, but they have high parameter counts and computational demands, making them unsuitable for mobile health applications. To address this challenge, we propose UCM-Net, a novel, efficient, and lightweight solution that integrates Multi-Layer Perceptions (MLP) and Convolutional Neural Networks (CNN). Unlike conventional UNet architectures, our UCMNet-Block reduces parameter overhead and enhances UCM-Net's learning capabilities, leading to robust segmentation performance. We validate UCM-Net's competitiveness through extensive experiments on isic2017 and isic2018 datasets. Remarkably, UCM-Net has less than 50KB parameters and less than 0.05 Giga-Operations Per Second (GLOPs), setting a new possible standard for efficiency in skin lesion segmentation. The source code will be publicly available.

摘要
皮肤癌是一个严重的公共卫生问题，计算机助成诊断可以帮助预防和治疗。计算机助成诊断的关键步骤是准确地分割皮肤癌病变图像中的病变，以便诊断、分类和分析。然而，这个任务很困难，因为癌病变的多样性，包括外表、形状、大小、颜色、文本ure和位置等，以及图像质量问题，如噪声、artefacts和遮挡。最近，深度学习模型已经应用于皮肤癌病变分割，但它们具有高参数计数和计算需求，使其不适合移动医疗应用。为解决这个挑战，我们提出了UCM-Net，一种新的、有效和轻量级的解决方案，它结合多层感知（MLP）和卷积神经网络（CNN）。与传统的UNet架构不同，我们的UCMNet-Block减少参数开销和提高UCM-Net的学习能力，从而实现了稳定的分割性能。我们通过对isic2017和isic2018数据集进行广泛的实验，证明UCM-Net在皮肤癌病变分割中的竞争力。特别是，UCM-Net的参数少于50KB，计算需求少于0.05 Giga-Operations Per Second（GLOPs），创造了新的可能性标准 для皮肤癌病变分割的效率。源代码将公开 availability。

2023-10-14

What Do Deep Saliency Models Learn about Visual Attention?

Point-DynRF: Point-based Dynamic Radiance Fields from a Monocular Video

Dimma: Semi-supervised Low Light Image Enhancement with Adaptive Dimming

Time-based Mapping of Space Using Visual Motion Invariants

Real-Time Traffic Sign Detection: A Case Study in a Santa Clara Suburban Neighborhood

Detecting Moving Objects Using a Novel Optical-Flow-Based Range-Independent Invariant

JSMoCo: Joint Coil Sensitivity and Motion Correction in Parallel MRI with a Self-Calibrating Score-Based Diffusion Model

Learning Hierarchical Features with Joint Latent Space Energy-Based Prior

B-Spine: Learning B-Spline Curve Representation for Robust and Interpretable Spinal Curvature Estimation

Hawkeye: A PyTorch-based Library for Fine-Grained Image Recognition with Deep Learning

Learning Unified Representations for Multi-Resolution Face Recognition

Scene Text Recognition Models Explainability Using Local Features

Benchmarking the Sim-to-Real Gap in Cloth Manipulation

Towards End-to-End Unsupervised Saliency Detection with Self-Supervised Top-Down Context

TS-ENAS:Two-Stage Evolution for Cell-based Network Architecture Search

OBSUM: An object-based spatial unmixing model for spatiotemporal fusion of remote sensing images

Foundation Ark: Accruing and Reusing Knowledge for Superior and Robust Performance

JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues

Learning In-between Imagery Dynamics via Physical Latent Spaces

Perception Reinforcement Using Auxiliary Learning Feature Fusion: A Modified Yolov8 for Head Detection

Exploring the Design Space of Diffusion Autoencoders for Face Morphing

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Plug-and-Play Feature Generation for Few-Shot Medical Image Classification

Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner

MAC: ModAlity Calibration for Object Detection

PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation

UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN