2023-09-22

cs.CV

cs.CV - 2023-09-22

ClusterFormer: Clustering As A Universal Visual Learner

paper_url: http://arxiv.org/abs/2309.13196
repo_url: https://github.com/clusterformer/clusterformer
paper_authors: James C. Liang, Yiming Cui, Qifan Wang, Tong Geng, Wenguan Wang, Dongfang Liu
for: 这个研究旨在提出一个基于CLUSTERing的概念的普遍性视觉模型，即CLUSTERFORMER，并将其应用于多种视觉任务中，包括图像分类、物体检测和图像分割等。
methods: 这个模型使用了两个新的设计：1. 回归十字运算实现了Transformer中的十字运算机制，并允许逐层更新团中心以便强化表示学习; 2. 图像特征重新分配使用更新的团中心，通过相似度基准来重新分配图像特征，实现了透明的管道。
results: 实验结果显示CLUSTERFORMER可以超过多种知名的特化架构，包括图像分类、物体检测和图像分割等任务，并在不同的团粒度（即图像、方巢和像素粒度）下实现高效性。

Abstract
This paper presents CLUSTERFORMER, a universal vision model that is based on the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1. recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2. feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i.e., image classification, object detection, and image segmentation) with varying levels of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results demonstrate that CLUSTERFORMER outperforms various well-known specialized architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classification, 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.

摘要

Recurrent cross-attention clustering: This reformulates the cross-attention mechanism in Transformer to enable recursive updates of cluster centers, facilitating strong representation learning.2. Feature dispatching: This uses updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline.This elegant design enables a streamlined, explainable, and transferable workflow for tackling heterogeneous vision tasks (image classification, object detection, and image segmentation) with varying levels of clustering granularity (image-, box-, and pixel-level). Empirical results show that CLUSTERFORMER outperforms various well-known specialized architectures, achieving:* 83.41% top-1 accuracy over ImageNet-1K for image classification* 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation* 52.4% mIoU over ADE20K for semantic segmentation* 55.8% PQ over COCO Panoptic for panoptic segmentation.We hope that our work will catalyze a paradigm shift in universal models in computer vision, demonstrating the efficacy of the CLUSTERing paradigm in achieving strong representation learning and transferability across diverse vision tasks.

Spatial-frequency channels, shape bias, and adversarial robustness

paper_url: http://arxiv.org/abs/2309.13190
repo_url: https://github.com/ajaysub110/critical-band-masking
paper_authors: Ajay Subramanian, Elena Sizikova, Najib J. Majaj, Denis G. Pelli
for:这种研究旨在探索人类和神经网络在认知物体方面使用的频谱信息是什么。methods:研究人员使用了 crítical band masking 技术，该技术可以揭示人类和神经网络在认知物体过程中使用的频谱滤波器（或“渠道”）的宽度。results:研究发现，人类在自然图像中认知物体时使用的频谱滤波器与人类在字体和梯形图像中认知时使用的频谱滤波器一致，宽度都是一个 octave。然而，神经网络渠道在不同的架构和训练策略下表现为 2-4 倍于人类渠道宽度，这意味着神经网络对高频和低频噪声敏感，而人类不是。 adversarial 和扩展图像训练通常用于提高网络的Robustness和形态偏好。这种训练是否将网络和人类的物体认知渠道进行对接？

Abstract
What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel'') that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. On the other hand, the neural network channel, across various architectures and training strategies, is 2-4 times as wide as the human channel. In other words, networks are vulnerable to high and low frequency noise that does not affect human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (53% variance explained) and with robustness of adversarially-trained networks (74% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further away from the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only increases this difference.

摘要
人类和神经网络在认知物体时使用哪些空间频率信息？在神经科学中，关键带掩蔽是一种已知的工具，可以揭示人类和神经网络在认知物体时使用的频率选择性滤波器。关键带掩蔽测量人类和神经网络在噪声添加后的认知性能的敏感度。现有的关键带掩蔽研究表明，人类认知 periodic patterns（格拉丁）和字母使用一个频率带宽（ doubles 频率）的空间频率滤波器（或“渠道”）来认知物体。我们在人类和神经网络之间进行关键带掩蔽任务，并测试了14名人类和76个神经网络在16种 ImageNet 分类任务中的表现。我们发现，人类在自然图像中认知物体使用了同样的一个频率带宽的渠道，这是人类物体认知的启示性特征。然而，神经网络渠道，不同架构和训练策略，宽度为2-4倍于人类渠道。即神经网络具有高频和低频噪声不affects human performance的敏感性。常见的图像增强和抗击训练被用来提高网络的Robustness和形态偏好。这种训练是否与人类物体认知渠道相align？神经网络渠道的三个特性（带宽、中心频率、峰噪敏感度）与形态偏好（53% 额外变化）以及对抗训练后网络的Robustness（74% 额外变化）存在强相关性。抗击训练可以提高网络的Robustness，但是同时也使得网络渠道的宽度更加远离人类渠道。因此，关键带掩蔽表明，神经网络渠道比人类渠道更加宽，并且抗击训练只会进一步扩大这个差距。

Flow Factorized Representation Learning

paper_url: http://arxiv.org/abs/2309.13167
repo_url: https://github.com/kingjamessong/latent-flow
paper_authors: Yue Song, T. Anderson Keller, Nicu Sebe, Max Welling
for: 本研究的主要目标是学习表示，以达到对真实因素的分解。
methods: 我们提出了一种新的视角，即流动因素化表示学习（Flow Factorized Representation Learning），并在这种结构下学习更有效和更有用的表示。
results: 我们的模型在标准表示学习 bencmarks 上达到更高的likelihood，同时也更接近于相对平衡模型。此外，我们还证明了我们的变换是可以composite和适用于新数据，这表明我们的表示学习模型具有一定的抗预测和普适性。

Abstract
A prominent goal of representation learning research is to achieve representations which are factorized in a useful manner with respect to the ground truth factors of variation. The fields of disentangled and equivariant representation learning have approached this ideal from a range of complimentary perspectives; however, to date, most approaches have proven to either be ill-specified or insufficiently flexible to effectively separate all realistic factors of interest in a learned latent space. In this work, we propose an alternative viewpoint on such structured representation learning which we call Flow Factorized Representation Learning, and demonstrate it to learn both more efficient and more usefully structured representations than existing frameworks. Specifically, we introduce a generative model which specifies a distinct set of latent probability paths that define different input transformations. Each latent flow is generated by the gradient field of a learned potential following dynamic optimal transport. Our novel setup brings new understandings to both \textit{disentanglement} and \textit{equivariance}. We show that our model achieves higher likelihoods on standard representation learning benchmarks while simultaneously being closer to approximately equivariant models. Furthermore, we demonstrate that the transformations learned by our model are flexibly composable and can also extrapolate to new data, implying a degree of robustness and generalizability approaching the ultimate goal of usefully factorized representation learning.

摘要
prominent goal of representation learning research 是 achiev representations 是 factorized in a useful manner with respect to the ground truth factors of variation. disentangled and equivariant representation learning approached this ideal from a range of complimentary perspectives; however, to date, most approaches have proven to either be ill-specified or insufficiently flexible to effectively separate all realistic factors of interest in a learned latent space. In this work, we propose an alternative viewpoint on such structured representation learning, which we call Flow Factorized Representation Learning, and demonstrate it to learn both more efficient and more usefully structured representations than existing frameworks. Specifically, we introduce a generative model that specifies a distinct set of latent probability paths that define different input transformations. Each latent flow is generated by the gradient field of a learned potential following dynamic optimal transport. Our novel setup brings new understandings to both disentanglement and equivariance. We show that our model achieves higher likelihoods on standard representation learning benchmarks while simultaneously being closer to approximately equivariant models. Furthermore, we demonstrate that the transformations learned by our model are flexibly composable and can also extrapolate to new data, implying a degree of robustness and generalizability approaching the ultimate goal of usefully factorized representation learning.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries. The translation is based on the standard Chinese characters and grammar, and may be slightly different from the traditional Chinese used in Hong Kong and Taiwan.

Pixel-wise Smoothing for Certified Robustness against Camera Motion Perturbations

paper_url: http://arxiv.org/abs/2309.13150
repo_url: None
paper_authors: Hanjiang Hu, Zuxin Liu, Linyi Li, Jiacheng Zhu, Ding Zhao
for:* 这种方法用于证明深度学习视觉模型对摄像头运动干扰的Robustness。methods:* 该方法使用了一种新的、高效的和实用的框架，利用了像素空间的平滑分布，从而消除了贵重的摄像头运动采样成本，提高了证明Robustness的效率。results:* 通过广泛的实验证明，该方法可以很好地平衡证明效果和计算效率。例如，该方法可以在使用只有30%的投影图像框架的情况下实现约80%的证明准确率。

Abstract
In recent years, computer vision has made remarkable advancements in autonomous driving and robotics. However, it has been observed that deep learning-based visual perception models lack robustness when faced with camera motion perturbations. The current certification process for assessing robustness is costly and time-consuming due to the extensive number of image projections required for Monte Carlo sampling in the 3D camera motion space. To address these challenges, we present a novel, efficient, and practical framework for certifying the robustness of 3D-2D projective transformations against camera motion perturbations. Our approach leverages a smoothing distribution over the 2D pixel space instead of in the 3D physical space, eliminating the need for costly camera motion sampling and significantly enhancing the efficiency of robustness certifications. With the pixel-wise smoothed classifier, we are able to fully upper bound the projection errors using a technique of uniform partitioning in camera motion space. Additionally, we extend our certification framework to a more general scenario where only a single-frame point cloud is required in the projection oracle. This is achieved by deriving Lipschitz-based approximated partition intervals. Through extensive experimentation, we validate the trade-off between effectiveness and efficiency enabled by our proposed method. Remarkably, our approach achieves approximately 80% certified accuracy while utilizing only 30% of the projected image frames.

摘要
现在的计算机视觉技术在自动驾驶和机器人控制方面已经取得了非常出色的进步。然而，已经观察到深度学习基于视觉模型对摄像头运动干扰的Robustness有所不足。现有的证明过程对摄像头运动干扰的Robustness进行评估是非常昂贵和时间consuming的，因为需要进行大量的图像投影以实现Monte Carlo抽象在3D摄像头运动空间中。为解决这些挑战，我们提出了一种新的、高效、实用的框架，用于证明3D-2D投影变换对摄像头运动干扰的Robustness。我们的方法利用2D像素空间中的平滑分布而不是3D物理空间中的平滑分布，从而消除了高昂的摄像头运动样本成本和大量的图像投影。通过使用像素空间平滑分布，我们可以完全上界投影错误，使用一种基于均匀分区的技术来实现Camera motion空间中的均匀分区。此外，我们将证明框架扩展到一个更加通用的场景，只需要提供单帧点云作为投影oracle。我们通过 derivation Lipschitz-basedapproximated partition intervals来实现这一点。通过广泛的实验，我们证明了我们的提出的方法的效率和可靠性之间的trade-off。特别是，我们的方法可以在30%的图像投影帧上达到约80%的证明精度。

Trading-off Mutual Information on Feature Aggregation for Face Recognition

paper_url: http://arxiv.org/abs/2309.13137
repo_url: None
paper_authors: Mohammad Akyash, Ali Zafari, Nasser M. Nasrabadi
for: 提高人脸识别精度methods: aggregate ArcFace和AdaFace两个state-of-the-art深度人脸识别模型的输出，通过利用trasnformer注意机制来把扩展两个特征地图之间的关系，从而提高人脸识别系统的总体识别能力。results: 通过对比多个标准 bencmark 结果，我们观察到了我们的方法在人脸识别 tasks 中的一致性提高。

Abstract
Despite the advances in the field of Face Recognition (FR), the precision of these methods is not yet sufficient. To improve the FR performance, this paper proposes a technique to aggregate the outputs of two state-of-the-art (SOTA) deep FR models, namely ArcFace and AdaFace. In our approach, we leverage the transformer attention mechanism to exploit the relationship between different parts of two feature maps. By doing so, we aim to enhance the overall discriminative power of the FR system. One of the challenges in feature aggregation is the effective modeling of both local and global dependencies. Conventional transformers are known for their ability to capture long-range dependencies, but they often struggle with modeling local dependencies accurately. To address this limitation, we augment the self-attention mechanism to capture both local and global dependencies effectively. This allows our model to take advantage of the overlapping receptive fields present in corresponding locations of the feature maps. However, fusing two feature maps from different FR models might introduce redundancies to the face embedding. Since these models often share identical backbone architectures, the resulting feature maps may contain overlapping information, which can mislead the training process. To overcome this problem, we leverage the principle of Information Bottleneck to obtain a maximally informative facial representation. This ensures that the aggregated features retain the most relevant and discriminative information while minimizing redundant or misleading details. To evaluate the effectiveness of our proposed method, we conducted experiments on popular benchmarks and compared our results with state-of-the-art algorithms. The consistent improvement we observed in these benchmarks demonstrates the efficacy of our approach in enhancing FR performance.

摘要
尽管面Recognition（FR）领域已经取得了一些进步，但FR方法的精度仍然不够高。为了提高FR性能，这篇论文提议了一种将两种现有的深度FR模型，即ArcFace和AdaFace，的输出聚合的技术。在我们的方法中，我们利用了变换器注意机制，以利用两个特征图的不同部分之间的关系。这样做的目的是提高总的识别力。一个挑战在特征聚合中是有效地模型本地和全局依赖关系。传统的变换器通常能够很好地捕捉长距离依赖关系，但它们经常在本地依赖关系上做出不准确的预测。为了解决这个限制，我们在自我注意机制中进行了修改，以同时 capture本地和全局依赖关系。这使得我们的模型能够利用特征图中相互重叠的区域的拥有的相互关系。然而，将两个特征图从不同的FR模型融合可能会导致人脸嵌入中的纬度冗余。这是因为这些模型通常具有相同的背部架构，导致生成的特征图可能包含重复的信息。为了解决这个问题，我们利用信息瓶颈原理，从人脸嵌入中提取最大可能的信息，以确保聚合的特征保留了最有用和权威的信息，同时减少不必要或误导的细节。为了评估我们的提议的效果，我们在popular benchmark上进行了实验，并与当前的算法进行比较。我们在这些benchmark中经常观察到了一致性提高，这表明了我们的方法的有效性。

Understanding Calibration of Deep Neural Networks for Medical Image Classification

paper_url: http://arxiv.org/abs/2309.13132
repo_url: None
paper_authors: Abhishek Singh Sambyal, Usma Niyaz, Narayanan C. Krishnan, Deepti R. Bathula
for: 这篇论文旨在探讨医疗影像分析中，使用深度神经网络时，确保模型的准确性和可靠性是非常重要的。
methods: 这篇论文使用了多种训练方法，包括全supervised training和旋转自给supervised learning，以了解不同训练方法对模型准确性和可靠性的影响。
results: 研究发现，使用旋转自给supervised learning的训练方法可以将模型的准确性和可靠性提高，并且可以实现比全supervised training更好的准确性和可靠性。

Abstract
In the field of medical image analysis, achieving high accuracy is not enough; ensuring well-calibrated predictions is also crucial. Confidence scores of a deep neural network play a pivotal role in explainability by providing insights into the model's certainty, identifying cases that require attention, and establishing trust in its predictions. Consequently, the significance of a well-calibrated model becomes paramount in the medical imaging domain, where accurate and reliable predictions are of utmost importance. While there has been a significant effort towards training modern deep neural networks to achieve high accuracy on medical imaging tasks, model calibration and factors that affect it remain under-explored. To address this, we conducted a comprehensive empirical study that explores model performance and calibration under different training regimes. We considered fully supervised training, which is the prevailing approach in the community, as well as rotation-based self-supervised method with and without transfer learning, across various datasets and architecture sizes. Multiple calibration metrics were employed to gain a holistic understanding of model calibration. Our study reveals that factors such as weight distributions and the similarity of learned representations correlate with the calibration trends observed in the models. Notably, models trained using rotation-based self-supervised pretrained regime exhibit significantly better calibration while achieving comparable or even superior performance compared to fully supervised models across different medical imaging datasets. These findings shed light on the importance of model calibration in medical image analysis and highlight the benefits of incorporating self-supervised learning approach to improve both performance and calibration.

摘要
在医疗影像分析领域，即使达到高精度也不够；保证准确的预测也非常重要。深度神经网络的自信分数在解释性方面发挥关键作用，为模型的certainty提供了信息， помо助分析出需要注意的案例，并建立对预测的信任。因此，在医疗影像领域，准确可靠的预测是非常重要的。虽然社区内有很大的努力，以使现代深度神经网络在医疗影像任务上达到高精度，但模型准确性和可靠性的调整仍然受到了少数研究。为了解决这个问题，我们进行了全面的实验研究，探讨了不同的训练方法对模型性能和准确性的影响。我们考虑了完全监督学习，这是社区中最常用的方法，以及旋转基于自动学习的方法，包括无扩展和带扩展的方法，在不同的数据集和模型大小上进行了测试。我们使用多种准确度指标来了解模型准确性的多方面特性。我们的研究发现，模型的weight分布和学习的表示相似性与模型准确性的趋势相关。特别是，通过旋转基于自动学习的预训练方法进行训练的模型在不同的医疗影像数据集上显示出了显著更好的准确性，而且与完全监督学习模型相比，它们在不同的模型大小上实现了相似或更高的性能。这些发现 shed light on the importance of model calibration in medical image analysis, and highlight the benefits of incorporating self-supervised learning approaches to improve both performance and calibration.

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

paper_url: http://arxiv.org/abs/2309.13041
repo_url: None
paper_authors: Chethan Bhateja, Derek Guo, Dibya Ghosh, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, Aviral Kumar
for: 这个论文是为了帮助机器人学习学习掌控技能，尤其是在没有奖励信号的情况下。
methods: 这个论文使用了视频数据来适应机器人的学习，通过时间差学习来学习值函数，并将其应用于机器人掌控任务中。
results: 这个论文在多个机器人掌控任务上取得了良好的结果，其中包括在一个真实的WidowX机器人上进行的多个掌控任务。政策比之前的方法更好，更加稳定，并能够广泛应用。

Abstract
Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at https://dibyaghosh.com/vptr/

摘要
在现代机器学习系统中，预训练在互联网数据上有证明是一种关键因素，以实现广泛的通用化。在机器人学习上，可以通过在机器人经验数据集上进行预训练来实现这种能力。然而，这些方法与视频数据（如Ego4D）存在类型匹配问题，因为视频只提供了观察经验，而不提供动作或奖励注释，这些注释是机器人学习方法所需的。在这篇论文中，我们开发了一种将大规模人类视频数据集成入机器人预训练的系统，基于完全通过时间差学习学习值函数。我们表明，在视频数据集上学习值函数可以学习更适合下游机器人预训练的表示，比其他视频数据学习方法更好。我们的系统，即V-PTR，将预训练在视频数据集上的好处与多种机器人数据预训练相结合，以生成更好的 manipulate 任务的价值函数和策略。在一个真实的 WidowX 机器人上，我们的框架可以大幅提高先前方法的政策。我们的视频和其他细节可以在找到。

NeRRF: 3D Reconstruction and View Synthesis for Transparent and Specular Objects with Neural Refractive-Reflective Fields

paper_url: http://arxiv.org/abs/2309.13039
repo_url: https://github.com/dawning77/nerrf
paper_authors: Xiaoxue Chen, Junchen Liu, Hao Zhao, Guyue Zhou, Ya-Qin Zhang
for: 这篇论文是关于图像基于视图合成的研究，旨在解决NeRF无法处理复杂的光路变化问题，导致无法成功合成透明或镜面物体的问题。
methods: 作者们提出了吸收射镜场（Refractive-Reflective Field，RRF），通过使用进攻四面体和进攻编码来重建非LAMBERTIAN对象的几何结构，并使用费勒涅尔定律来模型物体的折射和反射效果。同时，为了实现高效和有效的抑杂，提出了虚拟圆锥超抽样技术。
results: 作者们在不同的形状、背景和费勒涅尔定律上进行了多种实验，并对不同的编辑应用进行了质量和量化的比较，包括材质编辑、物体替换/插入和环境照明估计。

Abstract
Neural radiance fields (NeRF) have revolutionized the field of image-based view synthesis. However, NeRF uses straight rays and fails to deal with complicated light path changes caused by refraction and reflection. This prevents NeRF from successfully synthesizing transparent or specular objects, which are ubiquitous in real-world robotics and A/VR applications. In this paper, we introduce the refractive-reflective field. Taking the object silhouette as input, we first utilize marching tetrahedra with a progressive encoding to reconstruct the geometry of non-Lambertian objects and then model refraction and reflection effects of the object in a unified framework using Fresnel terms. Meanwhile, to achieve efficient and effective anti-aliasing, we propose a virtual cone supersampling technique. We benchmark our method on different shapes, backgrounds and Fresnel terms on both real-world and synthetic datasets. We also qualitatively and quantitatively benchmark the rendering results of various editing applications, including material editing, object replacement/insertion, and environment illumination estimation. Codes and data are publicly available at https://github.com/dawning77/NeRRF.

摘要
“对象基于图像的视 synthesis 领域受到对应� Neural Radiance Fields（NeRF）的革命性影响。然而，NeRF 使用直线光束，无法处理由折射和反射导致的复杂光束变化，这限制了 NeRF 在透明或 Specular 物体的成功实现。在这篇论文中，我们介绍了 Refractive-Reflective Field（RRF）。我们将物体照片为输入，首先使用进攻四边形（Marching Tetrahedra）进行非 Lambertian 物体的重建，然后在一个统一框架中模型物体的折射和反射效应，使用 Fresnel 表达。此外，为了获得高效和有效的抑挡遮瑕，我们提出了虚拟圆锥超推数技术。我们在不同的形状、背景和 Fresnel 表达下进行了不同的测试，并评估了不同的编辑应用，包括材料编辑、物体取代/插入和环境照明估计。我们的代码和数据公开在 GitHub 上，请参考 https://github.com/dawning77/NeRRF。”

Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception?

paper_url: http://arxiv.org/abs/2309.13038
repo_url: None
paper_authors: Xiaoxiao Sun, Nidham Gazagnadou, Vivek Sharma, Lingjuan Lyu, Hongdong Li, Liang Zheng
for: 这篇论文主要是为了研究现有的手工图像质量指标是否能够准确反映人类对隐私信息的识别度。
methods: 这篇论文使用了4种现有的攻击方法来重建图像，并询问多个人标注者判断重建图像是否可识。
results: 研究发现现有的手工指标与人类对隐私信息的识别度强度不匹配，甚至自身差异也很大。提出了一种学习基于的measure called SemSim来评估重建图像的semantic相似性，并证明SemSim具有更高的人类评价相关性。

Abstract
Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which, as a judgement for model privacy leakage, are more trustworthy. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods.

摘要
手工制作的图像质量指标，如PSNR和SSIM，通常用于评估模型隐私风险的重建攻击。在这些指标下，可以重建的图像，如果与原始图像相似，则表示更大的隐私泄露。相反，如果图像与原始图像不相似，则表示更高的鲁棒性。但是，这些指标并不能保证与人类意见相符，人类意见是评估模型隐私泄露的更可靠的判断标准。在这篇论文中，我们全面研究了这些手工制作的指标是否能够准确反映人类对重建图像中的隐私信息的评估。在5个不同类型的数据集上，我们使用4种不同的攻击方法来重建图像，并对每个重建图像请多名人工标注者评估这个图像是否可识别。我们的研究发现，这些手工制作的指标与人类对隐私信息的评估存在较弱的相关性，甚至这些指标本身经常相互矛盾。这些观察表明了现有的指标在社区中的风险。为了解决这个潜在的风险，我们提议一种学习基于的度量方法，即SemSim，用于评估重建图像与原始图像之间的semantic相似性。SemSim通过使用标准的 triplet损失函数，使用原始图像作为固定点，一个可识别的重建图像作为正样本，一个不可识别的重建图像作为负样本进行训练。通过人工标注，SemSim能够更好地反映隐私信息的semantic水平上的泄露。我们展示SemSim与现有指标之间存在高度相关性，并且这种相关性可以在未看到的数据集、模型和攻击方法上进行扩展。

Performance Analysis of UNet and Variants for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2309.13013
repo_url: None
paper_authors: Walid Ehab, Yongmin Li
for: 本研究旨在探讨深度学习模型在医疗图像分割中的应用，特别是UNet架构和其变种的表现。
methods: 本研究使用了深度学习模型，包括标准UNet、Res-UNet和Attention Res-UNet三种架构，对多种医疗图像分割任务进行评估。
results: 研究发现，扩展UNet架构具有优秀的医疗图像分割能力，而Res-UNet和Attention Res-UNet架构具有更平滑的整合和更高的性能，特别是处理细节图像时。

Abstract
Medical imaging plays a crucial role in modern healthcare by providing non-invasive visualisation of internal structures and abnormalities, enabling early disease detection, accurate diagnosis, and treatment planning. This study aims to explore the application of deep learning models, particularly focusing on the UNet architecture and its variants, in medical image segmentation. We seek to evaluate the performance of these models across various challenging medical image segmentation tasks, addressing issues such as image normalization, resizing, architecture choices, loss function design, and hyperparameter tuning. The findings reveal that the standard UNet, when extended with a deep network layer, is a proficient medical image segmentation model, while the Res-UNet and Attention Res-UNet architectures demonstrate smoother convergence and superior performance, particularly when handling fine image details. The study also addresses the challenge of high class imbalance through careful preprocessing and loss function definitions. We anticipate that the results of this study will provide useful insights for researchers seeking to apply these models to new medical imaging problems and offer guidance and best practices for their implementation.

摘要
医学影像在现代医疗中扮演着重要的角色，通过非侵入性的视觉化内部结构和异常，提高疾病早期检测、精准诊断和治疗规划。本研究旨在探讨深度学习模型，尤其是UNet架构和其变体，在医学图像分割任务中的应用。我们希望通过不同的挑战性医学图像分割任务来评估这些模型的表现，解决问题如图像normalization、resize、架构选择、损失函数设计和Hyperparameter优化。研究发现，标准的UNet架构，当扩展了深度网络层时，是一个高效的医学图像分割模型，而Res-UNet和Attention Res-UNet架构在处理细节时表现更好，特别是在处理细节时。此外，我们还 Addresses the challenge of high class imbalance through careful preprocessing and loss function definitions。我们预计这些结果将为研究人员在新的医学影像问题上应用这些模型提供有用的指导和最佳实践。

Deep3DSketch+: Rapid 3D Modeling from Single Free-hand Sketches

paper_url: http://arxiv.org/abs/2309.13006
repo_url: None
paper_authors: Tianrun Chen, Chenglong Fu, Ying Zang, Lanyun Zhu, Jia Zhang, Papa Mao, Lingyun Sun
for: This paper aims to provide an end-to-end approach for 3D modeling using only a single free-hand sketch, without requiring multiple sketches or view information.
methods: The proposed approach, called Deep3DSketch+, uses a lightweight generation network for efficient inference in real-time, and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information and facilitate learning of realistic and fine-detailed shape structures.
results: The proposed approach achieved state-of-the-art (SOTA) performance on both synthetic and real datasets, demonstrating its effectiveness in generating high-fidelity 3D models from a single free-hand sketch.

Abstract
The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.

摘要
rapid development of AR/VR 带来巨大的三维内容需求，而传统的计算机支持设计（CAD）方法需要时间consuming 和 labor-intensive modeling process， sketch-based 三维模型化呈现了一个可能的解决方案，但是绘制缺乏和模糊性使得模型化困难以实现创作者的想法。需要精确地从多个视图或步骤性的绘制来解决这个挑战，但是这并不友好于初学者。在这种工作中，我们介绍了一种新的端到端方法，即 Deep3DSketch+，它可以通过单个自由手绘制来完成3D模型化，不需要多个绘制或视图信息输入。我们还引入了轻量级生成网络以实现实时执行，以及一种结构意识的对抗训练方法和笔触提升模块（SEM），以捕捉结构信息，使模型学习真实和细节rich shape结构，以实现高精度性。我们的实验表明，我们的方法可以与当前最佳性（SOTA）在synthetic和实际数据集上达到最高性能。

Point Cloud Network: An Order of Magnitude Improvement in Linear Layer Parameter Count

paper_url: http://arxiv.org/abs/2309.12996
repo_url: https://gitlab.com/chetterich/pcn-paper-and-materials
paper_authors: Charles Hetterich
for: 本文介绍了Point Cloud Network（PCN）架构，一种新的深度学习网络实现方式，并提供了实验证明PCN的优越性比多层感知器（MLP）。
methods: 本文使用了MLP和PCN两种不同的架构来训练多个模型，包括原始的AlexNet模型，以便对直接比较线性层的性能。
results: 研究发现，使用PCN架构的AlexNet-PCN16模型可以与原始AlexNet模型具有相同的测试准确率（test accuracy），仅占AlexNet模型的99.5%参数量。所有训练都在云端RTX 4090 GPU上进行，使用了pytorch库进行模型构建和训练。

Abstract
This paper introduces the Point Cloud Network (PCN) architecture, a novel implementation of linear layers in deep learning networks, and provides empirical evidence to advocate for its preference over the Multilayer Perceptron (MLP) in linear layers. We train several models, including the original AlexNet, using both MLP and PCN architectures for direct comparison of linear layers (Krizhevsky et al., 2012). The key results collected are model parameter count and top-1 test accuracy over the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). AlexNet-PCN16, our PCN equivalent to AlexNet, achieves comparable efficacy (test accuracy) to the original architecture with a 99.5% reduction of parameters in its linear layers. All training is done on cloud RTX 4090 GPUs, leveraging pytorch for model construction and training. Code is provided for anyone to reproduce the trials from this paper.

摘要
The PCN architecture has fewer parameters (99.5% fewer in the linear layers) but still achieves the same level of accuracy as the original AlexNet.* The PCN architecture performs well on both the CIFAR-10 and CIFAR-100 datasets.All of the training was done on cloud RTX 4090 GPUs using PyTorch for model construction and training. The code for reproducing the trials is provided.

License Plate Recognition Based On Multi-Angle View Model

paper_url: http://arxiv.org/abs/2309.12972
repo_url: https://github.com/zeniSoida/pl1
paper_authors: Dat Tran-Anh, Khanh Linh Tran, Hoai-Nam Vu
for: 本研究旨在解决图像/视频中文检测问题，尤其是识别车牌上的文字。
methods: 本方法 combinates multiple views of license plates to improve text detection accuracy. 具体来说，我们使用三个视角（view-1、view-2、view-3）来识别车牌上的文字组成部分，并使用相似度和距离度量来确定最佳匹配。
results: 实验结果表明，提出的方法在自主收集的PTITPlates dataset和Stanford Cars Dataset上具有较高的识别精度，较exist方法有所提高。

Abstract
In the realm of research, the detection/recognition of text within images/videos captured by cameras constitutes a highly challenging problem for researchers. Despite certain advancements achieving high accuracy, current methods still require substantial improvements to be applicable in practical scenarios. Diverging from text detection in images/videos, this paper addresses the issue of text detection within license plates by amalgamating multiple frames of distinct perspectives. For each viewpoint, the proposed method extracts descriptive features characterizing the text components of the license plate, specifically corner points and area. Concretely, we present three viewpoints: view-1, view-2, and view-3, to identify the nearest neighboring components facilitating the restoration of text components from the same license plate line based on estimations of similarity levels and distance metrics. Subsequently, we employ the CnOCR method for text recognition within license plates. Experimental results on the self-collected dataset (PTITPlates), comprising pairs of images in various scenarios, and the publicly available Stanford Cars Dataset, demonstrate the superiority of the proposed method over existing approaches.

摘要
在研究领域中，图像/视频中的文本检测/识别问题对研究人员来说是非常困难的。尽管有一些进步，但现有方法仍然需要进一步改进才能在实际场景中应用。与图像/视频中的文本检测方法不同，这篇论文强调车牌上的文本检测，通过将多个视角的帧合并来实现。对于每个视角，我们提出的方法可以提取描述文本组件的特征，包括角点和面积。具体来说，我们提出了三个视角：视角1、视角2和视角3，用于标识同一个车牌线上的相邻组件，并且根据相似度和距离度量来重建车牌上的文本组件。接着，我们使用CnOCR方法进行车牌上文本识别。实验结果表明，我们的提议方法在自己收集的数据集（PTITPlates）和公共可用的 stanford cars 数据集上具有显著优势，超过现有方法。

PI-RADS v2 Compliant Automated Segmentation of Prostate Zones Using co-training Motivated Multi-task Dual-Path CNN

paper_url: http://arxiv.org/abs/2309.12970
repo_url: None
paper_authors: Arnab Das, Suhita Ghosh, Sebastian Stober
for: 这个论文的目的是提供一种自动化的检测和评估肾脏癌病变的方法，以帮助提高诊断和治疗的精度。
methods: 这个方法使用了一种双树 convolutional neural network (CNN)，每个树分别捕捉不同的区域（PZ、TZ、DPU和AFS）的表示。在第二个训练阶段，不同的树的表示进行了互补性的调整，以提高 segmentation 精度。此外，这个方法还 integrate 了多任务学习来进一步提高 segmentation 精度。
results: 根据这个方法，误差（mean absolute symmetric distance）的提高量为7.56%、11.00%、58.43%和19.67%对PZ、TZ、DPU和AFS区域进行了提高。

Abstract
The detailed images produced by Magnetic Resonance Imaging (MRI) provide life-critical information for the diagnosis and treatment of prostate cancer. To provide standardized acquisition, interpretation and usage of the complex MRI images, the PI-RADS v2 guideline was proposed. An automated segmentation following the guideline facilitates consistent and precise lesion detection, staging and treatment. The guideline recommends a division of the prostate into four zones, PZ (peripheral zone), TZ (transition zone), DPU (distal prostatic urethra) and AFS (anterior fibromuscular stroma). Not every zone shares a boundary with the others and is present in every slice. Further, the representations captured by a single model might not suffice for all zones. This motivated us to design a dual-branch convolutional neural network (CNN), where each branch captures the representations of the connected zones separately. Further, the representations from different branches act complementary to each other at the second stage of training, where they are fine-tuned through an unsupervised loss. The loss penalises the difference in predictions from the two branches for the same class. We also incorporate multi-task learning in our framework to further improve the segmentation accuracy. The proposed approach improves the segmentation accuracy of the baseline (mean absolute symmetric distance) by 7.56%, 11.00%, 58.43% and 19.67% for PZ, TZ, DPU and AFS zones respectively.

摘要
magnetic resonance imaging (MRI) 提供了生命critical的信息，用于诊断和治疗前列腺癌。为了提供标准化的获取、解释和使用复杂的MRI图像，PI-RADS v2指南被提出。一个自动 segmentation 可以确保consistent和精确的肿坏检测、stage和治疗。指南建议将前列腺分成四个区域：PZ（周边区）、TZ（过渡区）、DPU（后束肠URETHRA）和AFS（前锥形connective tissue）。不是每个区域都与其他区域接壤，而且不同的区域在每个层次中的表现不同。这种情况motivates我们设计了一个双支分布式 convolutional neural network (CNN)，其中每支分布式 CNN 分别捕捉connected zones 的表现。此外，不同支分布式 CNN 在第二个训练阶段 fine-tune 的损失中进行互补作用，这种损失penalizes 两支分布式 CNN 对同一类型的预测差异。我们还在我们的框架中包含多任务学习，以进一步提高 segmentation 精度。提出的方法与基准（mean absolute symmetric distance）的 segmentation 精度相比，提高了7.56%、11.00%、58.43%和19.67% respectivly 的PZ、TZ、DPU和AFS区域。

Detect Every Thing with Few Examples

paper_url: http://arxiv.org/abs/2309.12969
repo_url: https://github.com/mlzxy/devit
paper_authors: Xinyu Zhang, Yuting Wang, Abdeslam Boularias
for: 这个论文目的是开发一种基于视觉语言的开放类型物体检测器，可以检测到训练时没有看到的类别。
methods: 这个论文使用了视觉只的DINOv2背景，并通过示例图像来学习新的类别。它还提出了一种将多类分类任务转换为 binary 分类任务的技术，以及一种地区卷积技术来优化本地化检测。
results: 在COCO和LVIS测试集上，DE-ViT比开放类型SoTA高6.9个AP50，并在新类中达到50个AP50。在几shot和一shot SoTA上，DE-ViT比较高7.2个mAP和2.8个AP50。在LVIS测试集上，DE-ViT比开放类型SoTA高2.2个mask AP，达到34.3个mask APr。

Abstract
Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at https://github.com/mlzxy/devit.

摘要
“开放集Object检测目标在训练时未经看过的类型检测。最新的进展都采用了开放词汇思想，通过视力语言核心来表示类别。本文介绍DE-ViT，一种基于视力只的DINOv2核心实现开放集Object检测，不需要语言。为提高检测能力，我们将多类型分类任务转化为二分类任务，并提出一种新的区域卷积技术。我们在COCO和LVIS上进行了开放集、少量和一个批量Object检测测试，对COCO的开放集SoTA进行了6.9 AP50的超越和50 AP50的新类表现。对于少量和一个批量SoTA，DE-ViT也进行了15 mAP和7.2 mAP的超越。对于LVIS，DE-ViT超越了开放集SoTA2.2个面积AP和34.3个面积APr。代码可以在https://github.com/mlzxy/devit中下载。”

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

paper_url: http://arxiv.org/abs/2309.13101
repo_url: https://github.com/ingra14m/Deformable-3D-Gaussians
paper_authors: Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, Xiaogang Jin
for: 本研究旨在解决现有的动态场景重建和渲染方法中的缺陷，提供更高质量和实时速度的方法。
methods: 我们提出了一种基于显式3D高斯函数的弹性3D高斯拼接法，通过对可见空间中的高斯函数进行扭曲学习，来模型单目动态场景。我们还提出了一种缓解训练过程中偏差pose的技术，以提高时间 interpolate任务的平滑性。
results: 我们的方法在渲染质量和实时速度两个方面具有显著优势，与现有方法相比显著提高了渲染质量和速度。这使得我们的方法适用于多视图合成、时间合成和实时渲染等任务。

Abstract
Implicit neural representation has opened up new avenues for dynamic scene reconstruction and rendering. Nonetheless, state-of-the-art methods of dynamic neural rendering rely heavily on these implicit representations, which frequently struggle with accurately capturing the intricate details of objects in the scene. Furthermore, implicit methods struggle to achieve real-time rendering in general dynamic scenes, limiting their use in a wide range of tasks. To address the issues, we propose a deformable 3D Gaussians Splatting method that reconstructs scenes using explicit 3D Gaussians and learns Gaussians in canonical space with a deformation field to model monocular dynamic scenes. We also introduced a smoothing training mechanism with no extra overhead to mitigate the impact of inaccurate poses in real datasets on the smoothness of time interpolation tasks. Through differential gaussian rasterization, the deformable 3D Gaussians not only achieve higher rendering quality but also real-time rendering speed. Experiments show that our method outperforms existing methods significantly in terms of both rendering quality and speed, making it well-suited for tasks such as novel-view synthesis, time synthesis, and real-time rendering.

摘要
匿名神经表示法已经开启了新的动态场景重建和渲染领域。然而，现状的动态神经渲染方法通常依赖于这些匿名表示法，它们经常快速捕捉场景中对象的细节。此外，匿名方法在普通的动态场景中实时渲染通常困难，限制了它们在各种任务中的使用。为了解决这些问题，我们提出了使用可变的3DGAUSSIAN分辨率拼接法来重建场景，这种方法使用显式的3DGAUSSIAN和 canonical space中的扭曲场来模型单目动态场景。我们还提出了一种缓和训练机制，可以在真实数据集中减少不准确的姿势的影响，以提高时间插值任务的平滑性。通过差分 Gaussian 渲染，可变的3DGAUSSIAN不仅实现了更高的渲染质量，还具有实时渲染速度。实验表明，我们的方法与现有方法相比，在渲染质量和速度两个方面具有显著的优势，适用于如新视角合成、时间插值和实时渲染等任务。

On Data Fabrication in Collaborative Vehicular Perception: Attacks and Countermeasures

paper_url: http://arxiv.org/abs/2309.12955
repo_url: https://github.com/zqzqz/advcollaborativeperception
paper_authors: Qingzhao Zhang, Shuowei Jin, Ruiyang Zhu, Jiachen Sun, Xumiao Zhang, Qi Alfred Chen, Z. Morley Mao
for: 这篇论文旨在探讨Connected and Autonomous Vehicles (CAVs) 在协同感知系统中的安全隐患，以及如何通过协同感知系统中的数据来实现安全驱动。
methods: 本论文使用了现场实验和仿真方法来研究协同感知系统中的数据攻击和防御策略。
results: 本论文的实验结果显示，攻击者可以通过提供假数据来让CAVs做出错误的驾驶决策，导致减速或增加碰撞风险。而提出的异常检测方法可以检测91.5%的攻击，并在实际场景中减少了攻击的影响。

Abstract
Collaborative perception, which greatly enhances the sensing capability of connected and autonomous vehicles (CAVs) by incorporating data from external resources, also brings forth potential security risks. CAVs' driving decisions rely on remote untrusted data, making them susceptible to attacks carried out by malicious participants in the collaborative perception system. However, security analysis and countermeasures for such threats are absent. To understand the impact of the vulnerability, we break the ground by proposing various real-time data fabrication attacks in which the attacker delivers crafted malicious data to victims in order to perturb their perception results, leading to hard brakes or increased collision risks. Our attacks demonstrate a high success rate of over 86% on high-fidelity simulated scenarios and are realizable in real-world experiments. To mitigate the vulnerability, we present a systematic anomaly detection approach that enables benign vehicles to jointly reveal malicious fabrication. It detects 91.5% of attacks with a false positive rate of 3% in simulated scenarios and significantly mitigates attack impacts in real-world scenarios.

摘要
将文本翻译成简化中文：协同感知，它使connected and autonomous vehicles (CAVs) 的感知能力得到了大幅提高，但也涉及到了安全隐患。CAVs 的驾驶决策取决于外部不可靠数据，使其易受到来自collaborative perception系统中的恶意参与者的攻击。然而，对于这些威胁的安全分析和对策缺乏。为了了解攻击的影响，我们开辟了一个研究，在协同感知系统中提出了不同的实时数据造假攻击。攻击者通过向受害者传递预制作的假数据来干扰受害者的感知结果，导致停车或增加碰撞风险。我们的攻击得到了高于86%的成功率在高精度的模拟场景中，并在实际场景中也是可行的。为了缓解攻击，我们提出了一种系统化异常检测方法，它能够在benign vehicles之间共同披露恶意fabrication。它在模拟场景中检测到91.5%的攻击， false positive率仅3%。在实际场景中，它能够有效地缓解攻击的影响。

Inter-vendor harmonization of Computed Tomography (CT) reconstruction kernels using unpaired image translation

paper_url: http://arxiv.org/abs/2309.12953
repo_url: None
paper_authors: Aravind R. Krishnan, Kaiwen Xu, Thomas Li, Chenyu Gao, Lucas W. Remedios, Praitayini Kanakaraj, Ho Hin Lee, Shunxing Bao, Kim L. Sandler, Fabien Maldonado, Ivana Isgum, Bennett A. Landman
for: This paper aims to investigate the harmonization of computed tomography (CT) scans from different manufacturers using an unpaired image translation approach.
methods: The authors use a multipath cycle generative adversarial network (GAN) to harmonize the CT scans and evaluate the effect of harmonization on the reconstruction kernels.
results: The authors find that their approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status, and vendor on emphysema quantification.Here is the same information in Simplified Chinese text:
for: 这篇论文目标是通过不同生产厂商的计算Tomography（CT）扫描图像的不同构成器进行协调。
methods: 作者使用了一种多路径循环生成算法网络（GAN）来协调CT扫描图像，并评估构成器的影响。
results: 作者发现，他们的方法可以减少不同构成器的差异，并且高亮年龄、性别、吸烟状况和生产厂商对emphysema量化的影响。

Abstract
The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmonization of CT scans in single or multiple manufacturers. However, these methods require paired scans of hard and soft reconstruction kernels that are spatially and anatomically aligned. Additionally, a large number of models need to be trained across different kernel pairs within manufacturers. In this study, we adopt an unpaired image translation approach to investigate harmonization between and across reconstruction kernels from different manufacturers by constructing a multipath cycle generative adversarial network (GAN). We use hard and soft reconstruction kernels from the Siemens and GE vendors from the National Lung Screening Trial dataset. We use 50 scans from each reconstruction kernel and train a multipath cycle GAN. To evaluate the effect of harmonization on the reconstruction kernels, we harmonize 50 scans each from Siemens hard kernel, GE soft kernel and GE hard kernel to a reference Siemens soft kernel (B30f) and evaluate percent emphysema. We fit a linear model by considering the age, smoking status, sex and vendor and perform an analysis of variance (ANOVA) on the emphysema scores. Our approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status and vendor on emphysema quantification.

摘要
computed tomography（CT）生成中的重建核心（kernel）会决定图像的文字。保持重建核心的一致性非常重要，因为下面的CT文字可能会影响量化图像分析中的测量结果。为了解决这个问题，我们采用了一种不带对的图像翻译方法，并使用多条路径生成反向传播神经网络（GAN）来调整不同制造商的重建核心。我们使用来自SIEMENS和GE两家公司的硬件和软件重建核心，从国家肺癌检测试验数据集中选择50个扫描。我们使用50个扫描来训练多条路径GAN，并对每个重建核心进行调整。为了评估调整后的重建核心的影响，我们对SIEMENS硬件重建核心、GE软件重建核心和GE硬件重建核心进行调整，并对每个扫描进行50次评估。我们使用年龄、吸烟状况、性别和制造商作为 Linear 模型的可变量，并对抑瘤率进行分析变异（ANOVA）。我们的方法可以减少不同重建核心之间的差异，并高亮制造商、性别、吸烟状况和年龄对抑瘤率的影响。

Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.12943
repo_url: https://github.com/wpy1999/bas-extension
paper_authors: Wei Zhai, Pingyu Wu, Kai Zhu, Yang Cao, Feng Wu, Zheng-Jun Zha
For: 本研究旨在提高弱度指导对象本地化和 semantic segmentation的性能，通过生成foreground prediction map（FPM）来实现像素级本地化。* Methods: 该研究提出了两个关键的实验观察：1）当已经训练过的网络中的foreground mask扩展时，cross-entropy会 converge to zero，而且activation value会持续增加 until the foreground mask扩展到对象边界。基于这两个观察，该研究提出了一种Background Activation Suppression（BAS）方法，通过Activation Map Constraint（AMC）模块来减少背景活动值，同时通过foreground region guidance和面积约束来学习整个对象区域。* Results: 对CUB-200-2011和ILSVRC dataset进行了广泛的实验，显示BAS可以 achieve significant and consistent improvement over baseline methods。此外，该方法还 achieve state-of-the-art weakly supervised semantic segmentation性能在PASCAL VOC 2012和MS COCO 2014 dataset上。

Abstract
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. Recently, a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve pixel-level localization. While existing FPM-based methods use cross-entropy to evaluate the foreground prediction map and to guide the learning of the generator, this paper presents two astonishing experimental observations on the object localization learning process: For a trained network, as the foreground mask expands, 1) the cross-entropy converges to zero when the foreground mask covers only part of the object region. 2) The activation value continuously increases until the foreground mask expands to the object boundary. Therefore, to achieve a more effective localization performance, we argue for the usage of activation value to learn more object regions. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint (AMC) module is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using foreground region guidance and area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. In addition, our method also achieves state-of-the-art weakly supervised semantic segmentation performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. Code and models are available at https://github.com/wpy1999/BAS-Extension.

摘要
弱地监督对象定位和 semantic segmentation 目标是通过只使用图像级别标签来 lokalisieren objects。最近，一种新的 paradigm 出现，即通过生成 foreground prediction map (FPM) 来实现像素级定位。而现有的 FPM 基于方法使用 cross-entropy 来评估 foreground prediction map 并帮助生成器学习，而这篇文章则发现了对对象定位学习过程的两个 astonishing experimental observation：1) 当 foreground mask 扩展时，cross-entropy 会 converge to zero 只有部分object region 被 mask 覆盖; 2) 在 foreground mask 扩展到 object boundary 时，activation value 会不断增加。因此，我们认为使用 activation value 可以更好地学习更多的 object regions。在这篇文章中，我们提出了 Background Activation Suppression (BAS) 方法。具体来说，我们设计了 Activation Map Constraint (AMC) 模块，以便通过压制背景 activation value 来促进生成器的学习。同时，通过使用 foreground region guidance 和 area constraint，BAS 可以学习整个 object 的区域。在推理阶段，我们考虑了不同类别的预测图共同来获得最终的定位结果。我们的实验表明，BAS 可以在 CUB-200-2011 和 ILSVRC 数据集上 achieves 显著和稳定的改进，并且我们的方法也可以在 PASCAL VOC 2012 和 MS COCO 2014 数据集上实现 state-of-the-art 的弱监督 semantic segmentation性能。代码和模型可以在上获取。

Zero-Shot Object Counting with Language-Vision Models

paper_url: http://arxiv.org/abs/2309.13097
repo_url: None
paper_authors: Jingyi Xu, Hieu Le, Dimitris Samaras
for: 本研究旨在实现无需人工标注的物体数量计算，即针对任意类型的物体进行测试时的自动化计数。
methods: 我们提出了一种新的设定方法，即零例SHOT对象计数（ZSC），只需要在测试时提供类名即可。这种方法不需要人工标注，可以自动化操作。我们首先从输入图像中检索一些物体裁剪，然后使用这些裁剪作为计数例子。目标是找到包含目标物体的裁剪，同时也是所有图像中所有物体的视觉表示。我们首先使用大型语言视觉模型，包括CLIP和Stable Diffusion，构建类型质量标准，然后选择包含目标物体的裁剪。此外，我们还提出了一种排名模型，以估算每个裁剪的计数错误，从而选择最适合计数的例子。
results: 我们在最新的无类别物体数量 datasets（FSC-147）上进行了实验，结果表明我们的方法效果很高。

Abstract
Class-agnostic object counting aims to count object instances of an arbitrary class at test time. It is challenging but also enables many potential applications. Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories, especially for autonomous systems. Thus, we propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time. This obviates the need for human annotators and enables automated operation. To perform ZSC, we propose finding a few object crops from the input image and use them as counting exemplars. The goal is to identify patches containing the objects of interest while also being visually representative for all instances in the image. To do this, we first construct class prototypes using large language-vision models, including CLIP and Stable Diffusion, to select the patches containing the target objects. Furthermore, we propose a ranking model that estimates the counting error of each patch to select the most suitable exemplars for counting. Experimental results on a recent class-agnostic counting dataset, FSC-147, validate the effectiveness of our method.

摘要
“类型无关对象计数”targets counting object instances of an arbitrary class at test time, which is challenging but also enables many potential applications. Current methods require human-annotated exemplars as inputs, which are often unavailable for novel categories, especially for autonomous systems. Therefore, we propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time. This eliminates the need for human annotators and enables automated operation.To perform ZSC, we propose finding a few object crops from the input image and using them as counting exemplars. The goal is to identify patches containing the objects of interest while also being visually representative for all instances in the image. To do this, we first construct class prototypes using large language-vision models, such as CLIP and Stable Diffusion, to select the patches containing the target objects. Additionally, we propose a ranking model that estimates the counting error of each patch to select the most suitable exemplars for counting.Experimental results on a recent class-agnostic counting dataset, FSC-147, validate the effectiveness of our method.

Bridging Sensor Gaps via Single-Direction Tuning for Hyperspectral Image Classification

paper_url: http://arxiv.org/abs/2309.12865
repo_url: https://github.com/cecilia-xue/hyt-nas
paper_authors: Xizhe Xue, Haokui Zhang, Ying Li, Liuwei Wan, Zongwen Bai, Mike Zheng Shou
for: This paper aims to address the challenge of training ViT models on hyperspectral images (HSIs) with limited training samples.
methods: The proposed method is called single-direction tuning (SDT) and it leverages existing labeled HSI datasets and RGB datasets to enhance the performance on new HSI datasets. SDT uses a parallel architecture, asynchronous cold-hot gradient update strategy, and unidirectional interaction.
results: The proposed Triplet-structured transformer (Tri-Former) achieves better performance compared to several state-of-the-art methods on three representative HSI datasets. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed SDT.Here’s the Chinese translation of the three key points:
for: 本研究目的是解决训练 ViT 模型在有限样本的高spectral 图像（HSIs）中的挑战。
methods: 提议的方法是单向调整策略（SDT），它利用现有标注的 HSI 数据集和 RGB 数据集来提高新的 HSI 数据集的性能。SDT 使用并行架构、异步冷热梯度更新策略和单向互动。
results: 提议的 Triplet-structured transformer (Tri-Former) 在三个代表性的 HSI 数据集上达到了许多现状方法的更好性能。同源、异源和跨模态调整实验证明了提议的 SDT 的有效性。

Abstract
Recently, some researchers started exploring the use of ViTs in tackling HSI classification and achieved remarkable results. However, the training of ViT models requires a considerable number of training samples, while hyperspectral data, due to its high annotation costs, typically has a relatively small number of training samples. This contradiction has not been effectively addressed. In this paper, aiming to solve this problem, we propose the single-direction tuning (SDT) strategy, which serves as a bridge, allowing us to leverage existing labeled HSI datasets even RGB datasets to enhance the performance on new HSI datasets with limited samples. The proposed SDT inherits the idea of prompt tuning, aiming to reuse pre-trained models with minimal modifications for adaptation to new tasks. But unlike prompt tuning, SDT is custom-designed to accommodate the characteristics of HSIs. The proposed SDT utilizes a parallel architecture, an asynchronous cold-hot gradient update strategy, and unidirectional interaction. It aims to fully harness the potent representation learning capabilities derived from training on heterologous, even cross-modal datasets. In addition, we also introduce a novel Triplet-structured transformer (Tri-Former), where spectral attention and spatial attention modules are merged in parallel to construct the token mixing component for reducing computation cost and a 3D convolution-based channel mixer module is integrated to enhance stability and keep structure information. Comparison experiments conducted on three representative HSI datasets captured by different sensors demonstrate the proposed Tri-Former achieves better performance compared to several state-of-the-art methods. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed SDT.

摘要
近些时候，一些研究人员开始使用ViT来解决高spectralInterval（HSI）分类问题，并取得了显著的成果。然而，ViT模型的训练需要一大量的训练样本，而高spectralInterval数据由于注解成本高，通常只有限量的训练样本。这个矛盾尚未得到有效解决。在这篇论文中，我们提议单向调整（SDT）策略，作为一个桥梁，允许我们通过现有的标注HSI数据集和RGB数据集来提高新的HSI数据集的性能。我们的SDT继承了提前调整的想法，即 reuse pre-trained models with minimal modifications for adaptation to new tasks。不同于提前调整，SDT是特地针对HSIs的定制设计的。我们的SDT采用并行架构、异步冷热Gradient更新策略和单向交互。它旨在完全利用训练在不同数据集上的hetrologous和cross-modal数据的强大表示学习能力。此外，我们还介绍了一种新的Triplet-structured transformer（Tri-Former），其中spectral attention和spatial attention模块在并行的构建token混合组件，以减少计算成本，并integrate了3D卷积基本 Channel mixer模块以提高稳定性和保持结构信息。在三个代表性的HSI数据集上进行了比较实验，我们的Tri-Former表现比一些当前的方法更好。同义、异义和cross-modal调整实验证明了SDT的有效性。

Associative Transformer Is A Sparse Representation Learner

paper_url: http://arxiv.org/abs/2309.12862
repo_url: None
paper_authors: Yuwei Sun, Hideya Ochiai, Zhirong Wu, Stephen Lin, Ryota Kanai
for: 这篇论文旨在探讨如何使用弹性交互来更好地模拟生物学原理，并提出了一种基于全球工作空间理论和相关记忆的Associative Transformer（AiT）模型。
methods: AiT模型使用了跨层聚合的核心空间，并通过结合缓存的方式实现瓶颈式注意力。这些瓶颈式注意力会限制注意力的容量，从而模拟生物学中的弹性交互。
results: 对于多种视觉任务，AiT模型表现出了superiority，可以学习不同的特征弹性，并且可以在不同的输入量和维度上保持复杂度的不变性。

Abstract
Emerging from the monolithic pairwise attention mechanism in conventional Transformer models, there is a growing interest in leveraging sparse interactions that align more closely with biological principles. Approaches including the Set Transformer and the Perceiver employ cross-attention consolidated with a latent space that forms an attention bottleneck with limited capacity. Building upon recent neuroscience studies of Global Workspace Theory and associative memory, we propose the Associative Transformer (AiT). AiT induces low-rank explicit memory that serves as both priors to guide bottleneck attention in the shared workspace and attractors within associative memory of a Hopfield network. Through joint end-to-end training, these priors naturally develop module specialization, each contributing a distinct inductive bias to form attention bottlenecks. A bottleneck can foster competition among inputs for writing information into the memory. We show that AiT is a sparse representation learner, learning distinct priors through the bottlenecks that are complexity-invariant to input quantities and dimensions. AiT demonstrates its superiority over methods such as the Set Transformer, Vision Transformer, and Coordination in various vision tasks.

摘要
(Simplified Chinese translation)由传统的对称Transformer模型中的单一对对注意机制而出发，有一种增长的兴趣是利用稀疏的交互来更加准确地遵循生物学原理。包括Set Transformer和Perceiver在内的方法都使用了混合注意力，并通过限制容量的瓶颈注意力来实现稀疏的交互。基于最近的 neuroscience研究的全球工作区理论和相关记忆，我们提出了相关转换器（AiT）。AiT通过强制实现低级别的显式记忆，使得瓶颈注意力在共享工作区中服务为导向注意力的先验知识，并在相关记忆中形成吸引器。通过联合的终端训练，这些先验知识自然发展出模块特化，每个模块增加了不同的抽象偏好，以形成注意瓶颈。这个瓶颈可以促进输入竞争对写入记忆。我们显示AiT是一种稀疏表示学习器，通过瓶颈学习出不同的先验知识，这些先验知识是输入量和维度的复杂性不变的。AiT在不同的视觉任务中表现出优势。

paper_url: http://arxiv.org/abs/2309.12855
repo_url: https://github.com/ft-zhou-zzz/cmta
paper_authors: Fengtao Zhou, Hao Chen
for: 这篇论文的目的是提出一个 Cross-Modal Translation and Alignment (CMTA) 框架，以探索不同模式之间的自然联系，并将不同模式之间的资讯转换为彼此对应的形式，以提高统计分析的精度和准确性。
methods: 这篇论文使用了两个平行的encoder-decoder结构，将多modal资料融合为单一的数据表现，并通过将生成的跨模式表现与原始模式表现进行对应，以提高模式之间的联系和转换资讯。此外，这篇论文还提出了一个跨模式注意力模组，作为不同模式之间的资讯桥梁，以实现跨模式的互动和资讯转换。
results: 这篇论文的实验结果显示，跨模式转换和对应的CMTA框架能够在五个公共TCGA数据集上实现更高的统计分析精度和准确性，比起现有的方法。

Abstract
With the rapid advances in high-throughput sequencing technologies, the focus of survival analysis has shifted from examining clinical indicators to incorporating genomic profiles with pathological images. However, existing methods either directly adopt a straightforward fusion of pathological features and genomic profiles for survival prediction, or take genomic profiles as guidance to integrate the features of pathological images. The former would overlook intrinsic cross-modal correlations. The latter would discard pathological information irrelevant to gene expression. To address these issues, we present a Cross-Modal Translation and Alignment (CMTA) framework to explore the intrinsic cross-modal correlations and transfer potential complementary information. Specifically, we construct two parallel encoder-decoder structures for multi-modal data to integrate intra-modal information and generate cross-modal representation. Taking the generated cross-modal representation to enhance and recalibrate intra-modal representation can significantly improve its discrimination for comprehensive survival analysis. To explore the intrinsic crossmodal correlations, we further design a cross-modal attention module as the information bridge between different modalities to perform cross-modal interactions and transfer complementary information. Our extensive experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.

摘要
随着高通量测序技术的快速发展，生存分析的注意点从临床指标转移到了将 genomic profil 与生理图像 incorporate 到生存预测中。现有方法可以分为两类：直接将生理特征和 genomic profil 简单地拼接起来进行生存预测，或者将 genomic profil 作为引导，将生理图像的特征 integrate 到生存预测中。前者可能会忽略不同Modal 之间的自然相关性。后者可能会丢弃不相关于蛋白表达的生理信息。为解决这些问题，我们提出了一种 Cross-Modal Translation and Alignment (CMTA) 框架，用于探索不同Modal 之间的自然相关性，并将 complementary 信息传递。 Specifically, we construct two parallel encoder-decoder structures for multi-modal data to integrate intra-modal information and generate cross-modal representation. Taking the generated cross-modal representation to enhance and recalibrate intra-modal representation can significantly improve its discrimination for comprehensive survival analysis. To explore the intrinsic crossmodal correlations, we further design a cross-modal attention module as the information bridge between different modalities to perform cross-modal interactions and transfer complementary information. Our extensive experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.

SRFNet: Monocular Depth Estimation with Fine-grained Structure via Spatial Reliability-oriented Fusion of Frames and Events

paper_url: http://arxiv.org/abs/2309.12842
repo_url: https://github.com/Tianbo-Pan/SRFNet
paper_authors: Tianbo Pan, Zidong Cao, Lin Wang
for: 本研究旨在提高单目视频中的深度估计精度，以便应用于机器人导航和自动驾驶等场景。
methods: 本研究提出了一种名为SRFNet的新网络模型，包括两个关键技术组件：一是基于注意力的互动式融合模块（AIF），二是可靠性 oriented 深度修正模块（RDR）。AIF模块使用事件和帧的空间偏好作为初始 máscara 来引导多模态特征融合，并通过反馈增强帧和事件特征学习。RDR模块使用融合的特征和 máscara 来估计精度高的深度结构。
results: 本研究在 synthetic 和实际世界数据集上评估了SRFNet的效果，结果显示，无需预训练，SRFNet可以在夜景中比 Priors 等方法更高的性能。

Abstract
Monocular depth estimation is a crucial task to measure distance relative to a camera, which is important for applications, such as robot navigation and self-driving. Traditional frame-based methods suffer from performance drops due to the limited dynamic range and motion blur. Therefore, recent works leverage novel event cameras to complement or guide the frame modality via frame-event feature fusion. However, event streams exhibit spatial sparsity, leaving some areas unperceived, especially in regions with marginal light changes. Therefore, direct fusion methods, e.g., RAMNet, often ignore the contribution of the most confident regions of each modality. This leads to structural ambiguity in the modality fusion process, thus degrading the depth estimation performance. In this paper, we propose a novel Spatial Reliability-oriented Fusion Network (SRFNet), that can estimate depth with fine-grained structure at both daytime and nighttime. Our method consists of two key technical components. Firstly, we propose an attention-based interactive fusion (AIF) module that applies spatial priors of events and frames as the initial masks and learns the consensus regions to guide the inter-modal feature fusion. The fused feature are then fed back to enhance the frame and event feature learning. Meanwhile, it utilizes an output head to generate a fused mask, which is iteratively updated for learning consensual spatial priors. Secondly, we propose the Reliability-oriented Depth Refinement (RDR) module to estimate dense depth with the fine-grained structure based on the fused features and masks. We evaluate the effectiveness of our method on the synthetic and real-world datasets, which shows that, even without pretraining, our method outperforms the prior methods, e.g., RAMNet, especially in night scenes. Our project homepage: https://vlislab22.github.io/SRFNet.

摘要
单目深度估计是一个重要的任务，用于测量相机附近的距离，这对于自动驾驶和机器人定位等应用非常重要。传统的帧基本方法受限于对应数范围和运动模糊的问题，因此latest works将event camera整合或导引帧模式的特性。然而，event流拥有空间罕见性，特别是在光度变化较小的区域，导致direct fusion方法，例如RAMNet，忽略了每个模式的最有信心区域的贡献。这会导致多模式融合过程中的结构混乱，进而下降深度估计性能。在这篇论文中，我们提出了一个名为Spatial Reliability-oriented Fusion Network（SRFNet）的新方法，可以在日间和夜间都 estimate fine-grained的深度结构。我们的方法包括两个关键技术部分。首先，我们提出了一个注意力基于的互动式融合（AIF）模组，它根据事件和帧的空间假设作为初始mask，并学习导引多modal feature融合的共识区域。融合后的特征被反馈以提高帧和事件特征学习。同时，它还使用一个output head生成融合mask，并轮询更新以学习共识的空间假设。其次，我们提出了可靠性对适定深度修正（RDR）模组，用于根据融合特征和mask估计精确的深度结构。我们将这个方法评估在实验和真实世界数据上，结果显示，不需要预训，我们的方法在夜间场景中表现更好，比如RAMNet等先前的方法。更多详细信息可以通过我们的项目首页：。

Domain Adaptive Few-Shot Open-Set Learning

paper_url: http://arxiv.org/abs/2309.12814
repo_url: https://github.com/debabratapal7/dafosnet
paper_authors: Debabrata Pal, Deeptej More, Sai Bhargav, Dipesh Tamboli, Vaneet Aggarwal, Biplab Banerjee
for: 解决目标查询集中未知样本和Visual shift问题，同时可以快速适应新场景。
methods: 提出了一种新的方法called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS)，并使用了一种名为DAFOSNET的meta-learning-based架构。在训练过程中，模型学习了共享和特异 embedding space，并创建了一个pseudo open-space决策边界。
results: 通过使用一对 conditional adversarial networks和domain-specific batch-normalized class prototypes alignment strategy，模型能够快速适应新场景并提高数据密度。

Abstract
Few-shot learning has made impressive strides in addressing the crucial challenges of recognizing unknown samples from novel classes in target query sets and managing visual shifts between domains. However, existing techniques fall short when it comes to identifying target outliers under domain shifts by learning to reject pseudo-outliers from the source domain, resulting in an incomplete solution to both problems. To address these challenges comprehensively, we propose a novel approach called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS) and introduce a meta-learning-based architecture named DAFOSNET. During training, our model learns a shared and discriminative embedding space while creating a pseudo open-space decision boundary, given a fully-supervised source domain and a label-disjoint few-shot target domain. To enhance data density, we use a pair of conditional adversarial networks with tunable noise variances to augment both domains closed and pseudo-open spaces. Furthermore, we propose a domain-specific batch-normalized class prototypes alignment strategy to align both domains globally while ensuring class-discriminativeness through novel metric objectives. Our training approach ensures that DAFOS-NET can generalize well to new scenarios in the target domain. We present three benchmarks for DA-FSOS based on the Office-Home, mini-ImageNet/CUB, and DomainNet datasets and demonstrate the efficacy of DAFOS-NET through extensive experimentation

摘要
几个步学习已经在面临未知样本从新类目标查询集中识别难题和处理视觉变化 между领域方面做出了很好的进展。然而，现有技术在确定目标异常点下存在缺陷，即通过学习源领域中的 pseudo-outlier 来拒绝，导致解决这两个问题的答案不完整。为了全面解决这些挑战，我们提出了一种新的方法calledDomain Adaptive Few-Shot Open Set Recognition（DA-FSOS），并介绍了一种基于meta-学习的架构 named DAFOSNET。在训练过程中，我们的模型学习了共享和特异的嵌入空间，同时创建了一个 Pseudo open-space 决策边界，使用完全supervised的源领域和一个标签分离的少量目标领域。为了增强数据密度，我们使用了一对 conditional adversarial networks WITH tunable noise variances 来扩展两个领域的closed和pseudo-open空间。此外，我们提出了一种适应域特化的batch normalized class prototypes alignment策略，用于对两个领域进行全球协调，并确保类别特异性通过新的度量目标。我们的训练方法确保了 DAFOS-NET 在新enario中能够通过��

Automatic view plane prescription for cardiac magnetic resonance imaging via supervision by spatial relationship between views

paper_url: http://arxiv.org/abs/2309.12805
repo_url: https://github.com/wd111624/cmr_plan
paper_authors: Dong Wei, Yawen Huang, Donghuan Lu, Yuexiang Li, Yefeng Zheng
for: 这种系统的目的是自动化卡路里MR图像的规划，以帮助临床实践中的医生和技术人员更加快速和准确地完成图像规划。methods: 该系统使用了深度学习网络，通过挖掘数据中的空间关系来自动地确定目标平面和源视图之间的交叉线，并通过堆栈锥体网络来逐步提高回归。此外，该系统还使用了多视图规划策略，将所有源视图中的预测热图聚合以获得全球最优的规划。results: 实验结果显示，该系统可以准确地预测四个标准的卡路里MR图像平面，并且比现有的方法更加精准，包括传统的Atlas-based和 newer deep-learning-based方法。此外，该系统还可以预测第一个Cardiac-anatomy-oriented平面（或多个平面），从body-oriented扫描中获得。

Abstract
Background: View planning for the acquisition of cardiac magnetic resonance (CMR) imaging remains a demanding task in clinical practice. Purpose: Existing approaches to its automation relied either on an additional volumetric image not typically acquired in clinic routine, or on laborious manual annotations of cardiac structural landmarks. This work presents a clinic-compatible, annotation-free system for automatic CMR view planning. Methods: The system mines the spatial relationship, more specifically, locates the intersecting lines, between the target planes and source views, and trains deep networks to regress heatmaps defined by distances from the intersecting lines. The intersection lines are the prescription lines prescribed by the technologists at the time of image acquisition using cardiac landmarks, and retrospectively identified from the spatial relationship. As the spatial relationship is self-contained in properly stored data, the need for additional manual annotation is eliminated. In addition, the interplay of multiple target planes predicted in a source view is utilized in a stacked hourglass architecture to gradually improve the regression. Then, a multi-view planning strategy is proposed to aggregate information from the predicted heatmaps for all the source views of a target plane, for a globally optimal prescription, mimicking the similar strategy practiced by skilled human prescribers. Results: The experiments include 181 CMR exams. Our system yields the mean angular difference and point-to-plane distance of 5.68 degrees and 3.12 mm, respectively. It not only achieves superior accuracy to existing approaches including conventional atlas-based and newer deep-learning-based in prescribing the four standard CMR planes but also demonstrates prescription of the first cardiac-anatomy-oriented plane(s) from the body-oriented scout.

摘要
背景：卡路里变 imagine（CMR）成像取得的规划仍然是艰辛的任务在临床实践中。目的：现有的自动化方法都是基于不常见的三维图像或劳累的手动标注卡ди亚Structural landmarks。这个工作提出了一个可以在临床实践中使用的无需标注的自动CMR规划系统。方法：系统利用目标平面和源视图之间的空间关系，具体来说是找出目标平面和源视图之间的交叉点，并使用深度网络来回归定距离 définition heatmaps。交叉点是由技术人员在图像取得时使用卡ди亚Structural landmarks预scribed的规则，并在后续从空间关系中回拟。由于空间关系自身含有所需的信息，因此无需额外的手动标注。此外，系统还利用多个目标平面预测的多个源视图之间的互动，在堆栈ourglass架构中进行渐进改进。然后，提议一种多视图规划策略，将所有源视图中的预测热图集成，以实现全局优化的规划，类似于人类资深决策者的做法。结果：实验包括181个CMR试验。我们的系统的平均角度差和点到平面距离为5.68度和3.12mm。不仅达到了现有方法的精度，还可以成功地预scribed四个标准CMR平面，以及首次预scribedBody-oriented scout中的cardiac-anatomy-oriented平面。

Scalable Semantic 3D Mapping of Coral Reefs with Deep Learning

paper_url: http://arxiv.org/abs/2309.12804
repo_url: None
paper_authors: Jonathan Sauder, Guilhem Banc-Prandi, Anders Meibom, Devis Tuia
for: This paper aims to develop a new method for mapping underwater environments from ego-motion video, with a focus on coral reef monitoring.
methods: The method uses machine learning to adapt to challenging underwater conditions and combines 3D mapping with semantic segmentation of images.
results: The method achieves high-precision 3D semantic mapping at unprecedented scale with significantly reduced labor costs, making it possible to monitor coral reefs more efficiently and effectively.Here’s the full text in Simplified Chinese:
for: 这篇论文目的是开发一种基于ego-motion视频的海洋环境地图方法，主要关注珊瑚礁监测。
methods: 该方法使用机器学习适应海洋下难以控制的环境，并将3D地图与图像Semantic分割相结合。
results: 该方法实现了高精度3DSemantic地图，并在减少劳动成本方面取得了显著进步，使得珊瑚礁监测更加高效和可靠。

Abstract
Coral reefs are among the most diverse ecosystems on our planet, and are depended on by hundreds of millions of people. Unfortunately, most coral reefs are existentially threatened by global climate change and local anthropogenic pressures. To better understand the dynamics underlying deterioration of reefs, monitoring at high spatial and temporal resolution is key. However, conventional monitoring methods for quantifying coral cover and species abundance are limited in scale due to the extensive manual labor required. Although computer vision tools have been employed to aid in this process, in particular SfM photogrammetry for 3D mapping and deep neural networks for image segmentation, analysis of the data products creates a bottleneck, effectively limiting their scalability. This paper presents a new paradigm for mapping underwater environments from ego-motion video, unifying 3D mapping systems that use machine learning to adapt to challenging conditions under water, combined with a modern approach for semantic segmentation of images. The method is exemplified on coral reefs in the northern Gulf of Aqaba, Red Sea, demonstrating high-precision 3D semantic mapping at unprecedented scale with significantly reduced required labor costs: a 100 m video transect acquired within 5 minutes of diving with a cheap consumer-grade camera can be fully automatically analyzed within 5 minutes. Our approach significantly scales up coral reef monitoring by taking a leap towards fully automatic analysis of video transects. The method democratizes coral reef transects by reducing the labor, equipment, logistics, and computing cost. This can help to inform conservation policies more efficiently. The underlying computational method of learning-based Structure-from-Motion has broad implications for fast low-cost mapping of underwater environments other than coral reefs.

摘要
珊瑚礁是地球上最多样化的生态系统之一，并且有百万人的生存受其影响。然而，大多数珊瑚礁面临全球气候变化和地方人类活动的威胁。为了更好地理解珊瑚礁的衰退机制，高精度空间和时间分辨率的监测是关键。although computer vision工具已经被应用于这一过程，特别是使用SfM摄ogrammetry для3D地图和深度神经网络 для图像分割，但是分析数据产品创造了瓶颈，从而限制了其扩展性。这篇文章介绍了一种新的珊瑚礁监测方法，基于自己的运动来自视频，结合机器学习来适应水下挑战的3D地图系统，并与现代图像分割方法相结合。这种方法在北红海的珊瑚礁中进行了高精度3Dsemantic地图，覆盖100米视频 transect，只需5分钟投入和分析时间。我们的方法可以快速扩大珊瑚礁监测，减少劳动、设备、运输和计算成本，从而更有效地 Inform conservation policies。我们的方法可以把珊瑚礁 transect democratized，减少劳动和设备成本，以便更多的人可以参与监测和保护。这种方法的计算方法，基于学习的Structure-from-Motion，对于快速低成本地图的水下环境的应用有广泛的应用前景。

NOC: High-Quality Neural Object Cloning with 3D Lifting of Segment Anything

paper_url: http://arxiv.org/abs/2309.12790
repo_url: None
paper_authors: Xiaobao Wei, Renrui Zhang, Jiarui Wu, Jiaming Liu, Ming Lu, Yandong Guo, Shanghang Zhang
for: 本研究旨在提出一种基于神经场的高品质3D对象重建方法，以便在用户指定的实时下重建目标对象。
methods: 本方法基于神经场和Segment Anything Model (SAM)的优点，首先将多视图2D分割Masks lifted到3D变化场中，然后将2D特征 lifted到3D SAM场中以提高重建质量。
results: 在多个 benchmark 数据集上进行了详细的实验，表明本方法能够提供高品质的目标对象重建结果。

Abstract
With the development of the neural field, reconstructing the 3D model of a target object from multi-view inputs has recently attracted increasing attention from the community. Existing methods normally learn a neural field for the whole scene, while it is still under-explored how to reconstruct a certain object indicated by users on-the-fly. Considering the Segment Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in this paper, we propose Neural Object Cloning (NOC), a novel high-quality 3D object reconstruction method, which leverages the benefits of both neural field and SAM from two aspects. Firstly, to separate the target object from the scene, we propose a novel strategy to lift the multi-view 2D segmentation masks of SAM into a unified 3D variation field. The 3D variation field is then projected into 2D space and generates the new prompts for SAM. This process is iterative until convergence to separate the target object from the scene. Then, apart from 2D masks, we further lift the 2D features of the SAM encoder into a 3D SAM field in order to improve the reconstruction quality of the target object. NOC lifts the 2D masks and features of SAM into the 3D neural field for high-quality target object reconstruction. We conduct detailed experiments on several benchmark datasets to demonstrate the advantages of our method. The code will be released.

摘要
随着神经场的发展，从多视图输入中重建目标对象的3D模型已经吸引了社区的越来越多的关注。现有方法通常学习一个整个场景的神经场，而即时根据用户指定的对象进行重建仍然是一个未探索的领域。尝试 Segment Anything Model (SAM) 可以效果地 segment any 2D 图像，在这篇论文中，我们提出了一种新的高质量3D对象重建方法——神经对象复制（NOC），该方法利用了神经场和 SAM 的优点，从两个方面进行了利用。首先，为了将目标对象从场景中分离，我们提出了一种新的策略，即将多视图2D 分割面的 SAM 编码器输出 lift 到一个统一的3D 变化场，然后将这个3D 变化场 проек 到2D 空间，生成新的提示，这个过程是迭代的，直到达到对象分离的 converges。然后，除了2D 面，我们还 lift 了 SAM 编码器的2D 特征到3D SAM 场，以提高目标对象的重建质量。NOC 将 SAM 的2D 面和特征 lift 到神经场中，以实现高质量的目标对象重建。我们对多个标准数据集进行了详细的实验，以示出我们的方法的优势。代码将会发布。

EMS: 3D Eyebrow Modeling from Single-view Images

paper_url: http://arxiv.org/abs/2309.12787
repo_url: None
paper_authors: Chenghong Li, Leyang Jin, Yujian Zheng, Yizhou Yu, Xiaoguang Han
for: 这个论文的目的是提出一种基于学习的方法来实现单视图3D眉毛重建。
methods: 这个方法使用了三个模块：RootFinder、OriPredictor和FiberEnder。RootFinder用于Localizing fiber root positions，OriPredictor用于预测3D空间中的方向场，FiberEnder用于确定每个纤维的长度。
results: 该方法在不同的眉毛样式和长度上表现了效果，并且可以处理部分受阻的根位置问题。

Abstract
Eyebrows play a critical role in facial expression and appearance. Although the 3D digitization of faces is well explored, less attention has been drawn to 3D eyebrow modeling. In this work, we propose EMS, the first learning-based framework for single-view 3D eyebrow reconstruction. Following the methods of scalp hair reconstruction, we also represent the eyebrow as a set of fiber curves and convert the reconstruction to fibers growing problem. Three modules are then carefully designed: RootFinder firstly localizes the fiber root positions which indicates where to grow; OriPredictor predicts an orientation field in the 3D space to guide the growing of fibers; FiberEnder is designed to determine when to stop the growth of each fiber. Our OriPredictor is directly borrowing the method used in hair reconstruction. Considering the differences between hair and eyebrows, both RootFinder and FiberEnder are newly proposed. Specifically, to cope with the challenge that the root location is severely occluded, we formulate root localization as a density map estimation task. Given the predicted density map, a density-based clustering method is further used for finding the roots. For each fiber, the growth starts from the root point and moves step by step until the ending, where each step is defined as an oriented line with a constant length according to the predicted orientation field. To determine when to end, a pixel-aligned RNN architecture is designed to form a binary classifier, which outputs stop or not for each growing step. To support the training of all proposed networks, we build the first 3D synthetic eyebrow dataset that contains 400 high-quality eyebrow models manually created by artists. Extensive experiments have demonstrated the effectiveness of the proposed EMS pipeline on a variety of different eyebrow styles and lengths, ranging from short and sparse to long bushy eyebrows.

摘要
眉毛在面部表情和外貌中扮演了关键角色，尽管3D人脸数字化已得到了广泛的研究，但3D眉毛模型化却受到了较少的关注。在这项工作中，我们提出了EMS框架，是首个基于学习的单视角3D眉毛重建框架。我们将眉毛表示为一组纤维曲线，并将重建转化为纤维增长问题。为确定眉毛的长度和形状，我们设计了三个模块：RootFinder、OriPredictor和FiberEnder。RootFinder先地本地化眉毛根部位置，以便于纤维增长；OriPredictor预测了3D空间中纤维的方向场，以帮助纤维增长；FiberEnder用于确定每个纤维的增长结束点。我们的OriPredictor直接借鉴了毛发重建中使用的方法。由于眉毛和毛发之间存在差异，因此RootFinder和FiberEnder均需要新的设计。具体来说，为了处理眉毛根部位置严重受遮挡的挑战，我们将根部地图估计任务转化为一个density map估计任务。给出预测的density map，然后使用密度基于的划分方法来找到根部。对于每个纤维，增长从根部开始，每步长度为constanteorientation field的oriented line。直到结束，每个步骤都需要通过一个像素对齐的RNN架构来判断是否需要停止增长。为支持所提出的网络的训练，我们建立了首个3D人工眉毛数据集，该数据集包含400个高质量眉毛模型，由艺术家手工创建。广泛的实验证明了我们提出的EMS管道的效果，可以处理不同的眉毛风格和长度，从短毛到长毛。

LMC: Large Model Collaboration with Cross-assessment for Training-Free Open-Set Object Recognition

paper_url: http://arxiv.org/abs/2309.12780
repo_url: https://github.com/harryqu123/lmc
paper_authors: Haoxuan Qu, Xiaofei Hui, Yujun Cai, Jun Liu
for: 这 paper 的目的是如何精确地进行开放集object recognition，以减少依赖伪阳特征。
methods: 本 paper 提出了一个名为 Large Model Collaboration (LMC) 的新框架，通过多个 off-the-shelf 大型模型的协力来解决这个问题。此外，paper 还提出了多个新的设计来有效地从大型模型中提取隐藏知识。
results: 实验结果显示了我们提出的框架的有效性。可以在 https://github.com/Harryqu123/LMC 获取代码。

Abstract
Open-set object recognition aims to identify if an object is from a class that has been encountered during training or not. To perform open-set object recognition accurately, a key challenge is how to reduce the reliance on spurious-discriminative features. In this paper, motivated by that different large models pre-trained through different paradigms can possess very rich while distinct implicit knowledge, we propose a novel framework named Large Model Collaboration (LMC) to tackle the above challenge via collaborating different off-the-shelf large models in a training-free manner. Moreover, we also incorporate the proposed framework with several novel designs to effectively extract implicit knowledge from large models. Extensive experiments demonstrate the efficacy of our proposed framework. Code is available https://github.com/Harryqu123/LMC

摘要

WiCV@CVPR2023: The Eleventh Women In Computer Vision Workshop at the Annual CVPR Conference

paper_url: http://arxiv.org/abs/2309.12768
repo_url: None
paper_authors: Doris Antensteiner, Marah Halawa, Asra Aslam, Ivaxi Sheth, Sachini Herath, Ziqi Huang, Sunnie S. Y. Kim, Aparna Akula, Xin Wang
for: The paper is written to present the details of the Women in Computer Vision Workshop - WiCV 2023, which aims to amplify the voices of underrepresented women in the computer vision community.
methods: The paper uses a comprehensive report on the workshop program, historical trends from past WiCV@CVPR events, and a summary of statistics related to presenters, attendees, and sponsorship for the WiCV 2023 workshop.
results: The paper presents a detailed report on the WiCV 2023 workshop, including the program, historical trends, and statistics related to presenters, attendees, and sponsorship. The paper also highlights the importance of such events in addressing gender imbalances within the field of computer vision.Here’s the same information in Simplified Chinese text:
for: 这篇论文是为了介绍女性计算机视觉工作坊（WiCV 2023）的详细信息。WiCV 的目标是促进计算机视觉领域中少数女性的声音，并且推动该领域的多样性和平等。
methods: 这篇论文使用了 WiCV 2023 工作坊的全面报告，以及过去 WiCV@CVPR 事件的历史趋势和统计数据，以描述 WiCV 2023 工作坊的进程和成果。
results: 这篇论文提供了 WiCV 2023 工作坊的详细报告，包括工作坊的程序、历史趋势和统计数据，以及参与者和赞助商的相关信息。论文还强调了计算机视觉领域内的性别不平衡问题，并认为这类活动对于解决这一问题具有重要意义。

Abstract
In this paper, we present the details of Women in Computer Vision Workshop - WiCV 2023, organized alongside the hybrid CVPR 2023 in Vancouver, Canada. WiCV aims to amplify the voices of underrepresented women in the computer vision community, fostering increased visibility in both academia and industry. We believe that such events play a vital role in addressing gender imbalances within the field. The annual WiCV@CVPR workshop offers a) opportunity for collaboration between researchers from minority groups, b) mentorship for female junior researchers, c) financial support to presenters to alleviate finanacial burdens and d) a diverse array of role models who can inspire younger researchers at the outset of their careers. In this paper, we present a comprehensive report on the workshop program, historical trends from the past WiCV@CVPR events, and a summary of statistics related to presenters, attendees, and sponsorship for the WiCV 2023 workshop.

摘要
在本文中，我们介绍了2023年度的女性计算机视觉工作坊（WiCV 2023），该活动与CVPR 2023合办于加拿大温尼伯市。WiCV 的目标是扩大计算机视觉领域下的弱调女性人群的声音，提高学术和产业领域中的女性 visibility。我们认为这些活动对于解决计算机视觉领域中的性别偏见非常重要。每年的 WiCV@CVPR 工作坊提供了以下机会：（a）少数民族研究者之间的合作，（b）为女性新手研究者提供导师，（c）为参会者提供论文发表支持，以及（d）多样化的角色模范，以激励年轻研究者在职业开始时。在本文中，我们提供了2023年 WiCV@CVPR 工作坊的工作计划、过去事件的历史趋势以及2023年工作坊的统计数据。

S3TC: Spiking Separated Spatial and Temporal Convolutions with Unsupervised STDP-based Learning for Action Recognition

paper_url: http://arxiv.org/abs/2309.12761
repo_url: None
paper_authors: Mireille El-Assal, Pierre Tirilly, Ioan Marius Bilasco
for:This paper focuses on developing a more efficient video analysis method using Spiking Neural Networks (SNNs) and Spiking Separated Spatial and Temporal Convolutions (S3TCs).methods:The authors use unsupervised learning with the Spike Timing-Dependent Plasticity (STDP) rule and introduce S3TCs to reduce the number of parameters required for video analysis.results:The proposed method successfully extracts spatio-temporal information from videos, increases the output spiking activity, and outperforms spiking 3D convolutions on the KTH, Weizmann, and IXMAS datasets.Here is the answer in Simplified Chinese text:for: 这篇论文关注开发更高效的视频分析方法，使用神经网络和分离空间和时间卷积（S3TCs）。methods: 作者使用无监督学习和脉冲时间依赖性变化（STDP）规则，并引入S3TCs来降低视频分析所需的参数数量。results: 提议的方法在KTH、Weizmann和IXMAS数据集上成功提取空间-时间信息，提高输出脉冲活动，并超越了脉冲3D卷积。

Abstract
Video analysis is a major computer vision task that has received a lot of attention in recent years. The current state-of-the-art performance for video analysis is achieved with Deep Neural Networks (DNNs) that have high computational costs and need large amounts of labeled data for training. Spiking Neural Networks (SNNs) have significantly lower computational costs (thousands of times) than regular non-spiking networks when implemented on neuromorphic hardware. They have been used for video analysis with methods like 3D Convolutional Spiking Neural Networks (3D CSNNs). However, these networks have a significantly larger number of parameters compared with spiking 2D CSNN. This, not only increases the computational costs, but also makes these networks more difficult to implement with neuromorphic hardware. In this work, we use CSNNs trained in an unsupervised manner with the Spike Timing-Dependent Plasticity (STDP) rule, and we introduce, for the first time, Spiking Separated Spatial and Temporal Convolutions (S3TCs) for the sake of reducing the number of parameters required for video analysis. This unsupervised learning has the advantage of not needing large amounts of labeled data for training. Factorizing a single spatio-temporal spiking convolution into a spatial and a temporal spiking convolution decreases the number of parameters of the network. We test our network with the KTH, Weizmann, and IXMAS datasets, and we show that S3TCs successfully extract spatio-temporal information from videos, while increasing the output spiking activity, and outperforming spiking 3D convolutions.

摘要
视频分析是计算机视觉中的一项重要任务，在过去几年内受到了很多关注。当前的状态艺术性表现在视频分析方面是通过深度神经网络（DNNs）实现的，其计算成本较高，需要大量标注数据进行训练。神经元网络（SNNs）在神经模拟硬件上实现时有 thousands 万次更低的计算成本，但它们的参数数量相对较多，使得它们更难于实现。在这项工作中，我们使用 CSNNs 在无监督的方式进行训练，并 introduce 一种新的 Spiking Separated Spatial and Temporal Convolutions（S3TCs），以降低视频分析所需的参数数量。这种无监督学习具有不需要大量标注数据进行训练的优点。将一个综合空间时间射阻挡分解成空间射阻挡和时间射阻挡，可以降低网络的参数数量。我们在 KTH、Weizmann 和 IXMAS 数据集上测试了我们的网络，并显示了 S3TCs 成功地从视频中提取空间时间信息，提高输出脉冲活动，并超过了射阻挡三维 convolution。

Transformer-based Image Compression with Variable Image Quality Objectives

paper_url: http://arxiv.org/abs/2309.12717
repo_url: None
paper_authors: Chia-Hao Kao, Yi-Hsin Chen, Cheng Chien, Wei-Chen Chiu, Wen-Hsiao Peng
for: 该 paper 是为了提供一种可变图像质量目标的 transformer 基于压缩系统，以满足用户的偏好。
methods: 该方法使用 learned codec 进行优化，以实现不同质量目标下的图像重建。用户可以通过单一的模型来选择一个质量目标的交易off。
results: 该方法可以通过使用 prompt tokens 来condition transformer 基于 autoencoder，并通过学习 prompt generation network 来生成适应用户偏好和输入图像的 prompt tokens。对于常见的质量指标，广泛的实验表明该方法可以适应不同的质量目标，并且与单一质量目标方法相比，其表现相对较好。

Abstract
This paper presents a Transformer-based image compression system that allows for a variable image quality objective according to the user's preference. Optimizing a learned codec for different quality objectives leads to reconstructed images with varying visual characteristics. Our method provides the user with the flexibility to choose a trade-off between two image quality objectives using a single, shared model. Motivated by the success of prompt-tuning techniques, we introduce prompt tokens to condition our Transformer-based autoencoder. These prompt tokens are generated adaptively based on the user's preference and input image through learning a prompt generation network. Extensive experiments on commonly used quality metrics demonstrate the effectiveness of our method in adapting the encoding and/or decoding processes to a variable quality objective. While offering the additional flexibility, our proposed method performs comparably to the single-objective methods in terms of rate-distortion performance.

摘要
Inspired by the success of prompt-tuning techniques, the system introduces prompt tokens to condition the Transformer-based autoencoder. These prompt tokens are generated adaptively based on the user's preference and input image through learning a prompt generation network. Extensive experiments on commonly used quality metrics demonstrate the effectiveness of the method in adapting the encoding and/or decoding processes to a variable quality objective. Notably, the proposed method offers the additional flexibility while performing comparably to single-objective methods in terms of rate-distortion performance.

mixed attention auto encoder for multi-class industrial anomaly detection

paper_url: http://arxiv.org/abs/2309.12700
repo_url: None
paper_authors: Jiangqi Liu, Feng Wang
for: 本研究旨在提出一种可以实现多类异常检测的单一模型，以解决现有方法的高存储成本和训练效率低下问题。
methods: 该方法使用混合注意力自适应Encoder（MAAE），并采用空间注意力和通道注意力来有效地捕捉多类特征分布的全球category信息，以及模型多个类别特征分布的模型。此外，该方法还提出了适应噪声生成器和多尺度融合模块，以适应实际噪声和保持不同类别物体表面 semantics。
results: MAAE在 benchmark 数据集上达到了比state-of-the-art 方法更高的性能。

Abstract
Most existing methods for unsupervised industrial anomaly detection train a separate model for each object category. This kind of approach can easily capture the category-specific feature distributions, but results in high storage cost and low training efficiency. In this paper, we propose a unified mixed-attention auto encoder (MAAE) to implement multi-class anomaly detection with a single model. To alleviate the performance degradation due to the diverse distribution patterns of different categories, we employ spatial attentions and channel attentions to effectively capture the global category information and model the feature distributions of multiple classes. Furthermore, to simulate the realistic noises on features and preserve the surface semantics of objects from different categories which are essential for detecting the subtle anomalies, we propose an adaptive noise generator and a multi-scale fusion module for the pre-trained features. MAAE delivers remarkable performances on the benchmark dataset compared with the state-of-the-art methods.

摘要
现有的方法 для无监督工业异常检测通常将每个物件类别 trains一个分开的模型。这种方法可以轻松地捕捉每个类别的特征分布，但会导致存储成本高且训练效率低。在这篇论文中，我们提出了一个整合式混合注意力自动编码器（MAAE），以实现多类别异常检测的单一模型。为了解决不同类别的特征分布多样性导致性能下降的问题，我们使用空间注意力和通道注意力来有效地捕捉全类别信息和多类别特征分布。其次，为了模拟实际上的噪声和保持不同类别物件表面 semantics，我们提出了适应式噪声生成器和多尺度融合模组。MAAE在比较 dataset 上 delivert 了非常出色的性能，与当前方法相比。

eWand: A calibration framework for wide baseline frame-based and event-based camera systems

paper_url: http://arxiv.org/abs/2309.12685
repo_url: None
paper_authors: Thomas Gossard, Andreas Ziegler, Levin Kolmar, Jonas Tebbe, Andreas Zell
for: 用于高精度对象位置三角推算的精准准确calibration
methods: 使用闪烁LED闪烁在透明球体内，代替传统的印刷或显示 Pattern
results: 提供高精度、易于使用的多摄像头外部坐标calibration方法，适用于frame-和事件基camera

Abstract
Accurate calibration is crucial for using multiple cameras to triangulate the position of objects precisely. However, it is also a time-consuming process that needs to be repeated for every displacement of the cameras. The standard approach is to use a printed pattern with known geometry to estimate the intrinsic and extrinsic parameters of the cameras. The same idea can be applied to event-based cameras, though it requires extra work. By using frame reconstruction from events, a printed pattern can be detected. A blinking pattern can also be displayed on a screen. Then, the pattern can be directly detected from the events. Such calibration methods can provide accurate intrinsic calibration for both frame- and event-based cameras. However, using 2D patterns has several limitations for multi-camera extrinsic calibration, with cameras possessing highly different points of view and a wide baseline. The 2D pattern can only be detected from one direction and needs to be of significant size to compensate for its distance to the camera. This makes the extrinsic calibration time-consuming and cumbersome. To overcome these limitations, we propose eWand, a new method that uses blinking LEDs inside opaque spheres instead of a printed or displayed pattern. Our method provides a faster, easier-to-use extrinsic calibration approach that maintains high accuracy for both event- and frame-based cameras.

摘要
准确的均衡是多摄像头三角测量物体位置精准的关键。然而，这也是一项时间消耗的过程，需要每次移动摄像头时重新进行。标准方法是使用印刷的模式来估算摄像头的内参和外参参数。这种方法可以应用于事件基图像，尽管需要额外的工作。通过从事件中重建幻象，可以直接检测印刷的模式。然后，可以使用幻象的检测来提供内参均衡。但是，使用2D模式有多个相机的外参均衡的限制，因为相机具有不同的视角和广泛的基线。2D模式只能从一个方向检测，需要很大的尺寸来补偿其与摄像头的距离。这使得外参均衡变得时间消耗和困难。为了缓解这些限制，我们提出了ewand，一种新的方法，使用LED灯光在透明球体中闪烁。我们的方法提供了一种更快、更容易使用的外参均衡方法，可以保持高精度 для both事件基图像和帧基图像。

paper_url: http://arxiv.org/abs/2309.12657
repo_url: None
paper_authors: Jiazhen Wang, Bin Liu, Changtao Miao, Zhiwei Zhao, Wanyi Zhuang, Qi Chu, Nenghai Yu
for: 本研究旨在提出一种简单且有效的 transformer 基本框架，用于检测和稳定多模态束缚 manipulation。
methods: 我们首先构造了视 language 预训练Encoder，并使用 dual-branch cross-attention (DCA) 技术来抽取和融合模态特有的特征。此外，我们还设计了分离精细类ifier (DFC)，以提高模态特有的特征挖掘和避免模态竞争。
results: 我们在 $\rm DGM^4$ 数据集上进行了广泛的实验，并证明了我们提出的模型在对 state-of-the-art 方法的比较中表现出色。

Abstract
AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the $\rm DGM^4$ dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.

摘要
人工智能生成的文本和图像在互联网上广泛传播，尤其是多模态杂化的 manipulate 问题，导致社会受到了负面影响。现有的多模态杂化检测和定位方法主要是通过视觉语言特征的融合来进行预测，而忽略了特定模式特征的重要性，从而导致低效的结果。在这篇论文中，我们提出了一种简单而新的 transformer 基于的多模态杂化检测和定位框架。我们的框架同时探索特定模式的特征，保留多模态对齐的能力。为此，我们引入视觉语言预训练encoder和双支分支交叉注意力（DCA），以EXTRACT和融合特定模式的特征。此外，我们设计了独立细致分类器（DFC），以提高特定模式的特征挖掘和避免模式竞争。此外，我们还提出了适应性 manipulate 查询（IMQ），可以在每个模式中适应学习查询，以提高对伪造细节的发现。我们对 $\rm DGM^4$ 数据集进行了广泛的实验，并证明了我们的提出的模型在对state-of-the-art方法的比较中表现出色。

FP-PET: Large Model, Multiple Loss And Focused Practice

paper_url: http://arxiv.org/abs/2309.12650
repo_url: None
paper_authors: Yixin Chen, Ourui Fu, Wenrui Shao, Zhaoheng Xie
for: 这项研究旨在提出FP-PET方法，用于医学图像分割，尤其是CT和PET图像。
methods: 该研究使用了多种机器学习模型，包括STUNet-large、SwinUNETR和VNet，以实现最新的分割性能。
results: 研究提出了一个综合评价指标，将多个评价指标（如 dice分数、false positive volume 和 false negative volume）加权平均，以提供全面的模型效果评价。

Abstract
This study presents FP-PET, a comprehensive approach to medical image segmentation with a focus on CT and PET images. Utilizing a dataset from the AutoPet2023 Challenge, the research employs a variety of machine learning models, including STUNet-large, SwinUNETR, and VNet, to achieve state-of-the-art segmentation performance. The paper introduces an aggregated score that combines multiple evaluation metrics such as Dice score, false positive volume (FPV), and false negative volume (FNV) to provide a holistic measure of model effectiveness. The study also discusses the computational challenges and solutions related to model training, which was conducted on high-performance GPUs. Preprocessing and postprocessing techniques, including gaussian weighting schemes and morphological operations, are explored to further refine the segmentation output. The research offers valuable insights into the challenges and solutions for advanced medical image segmentation.

摘要
Translated into Simplified Chinese:这项研究提出了FP-PET，一种涵盖CT和PET图像分割的全面方法。利用AutoPet2023 Challenge数据集，研究使用了多种机器学习模型，包括STUNet-large、SwinUNETR和VNet，以实现最新的分割性能。文章引入了一个汇总分数，将多个评估指标，如 dice分数、false positive volume (FPV) 和 false negative volume (FNV) 汇总而成一个整体评价指标，以提供更全面的模型效果评估。研究还讨论了模型训练中的计算挑战和解决方案，并在高性能GPU上进行训练。研究还探讨了预处理和后处理技术，包括高斯权重方案和形态运算，以进一步细化分割输出。研究提供了进一步了解高级医学图像分割的挑战和解决方案。

RHINO: Regularizing the Hash-based Implicit Neural Representation

paper_url: http://arxiv.org/abs/2309.12642
repo_url: None
paper_authors: Hao Zhu, Fengyi Liu, Qi Zhang, Xun Cao, Zhan Ma
for: 提高Hash表示法中的Regularization，以提高 interpolate 的可靠性和稳定性。
methods: 引入一个连续分析函数，以便在Hash表示法中增强Regularization，不需要修改当前的Hash表示法架构。
results: RHINO在多种任务上表现出色，如图像适应、签名距离函数表示、5D静止/6D动态神经辐射场优化等，并且在质量和速度两个方面超过当前状态态技术。

Abstract
The use of Implicit Neural Representation (INR) through a hash-table has demonstrated impressive effectiveness and efficiency in characterizing intricate signals. However, current state-of-the-art methods exhibit insufficient regularization, often yielding unreliable and noisy results during interpolations. We find that this issue stems from broken gradient flow between input coordinates and indexed hash-keys, where the chain rule attempts to model discrete hash-keys, rather than the continuous coordinates. To tackle this concern, we introduce RHINO, in which a continuous analytical function is incorporated to facilitate regularization by connecting the input coordinate and the network additionally without modifying the architecture of current hash-based INRs. This connection ensures a seamless backpropagation of gradients from the network's output back to the input coordinates, thereby enhancing regularization. Our experimental results not only showcase the broadened regularization capability across different hash-based INRs like DINER and Instant NGP, but also across a variety of tasks such as image fitting, representation of signed distance functions, and optimization of 5D static / 6D dynamic neural radiance fields. Notably, RHINO outperforms current state-of-the-art techniques in both quality and speed, affirming its superiority.

摘要
使用含义神经表示（INR）通过哈希表实现了非常出色的效果和效率，可是现有的状态 искусственный интеллект技术表现不够稳定，经常产生不可靠和噪音的结果 durante interpolaciones. 我们认为这个问题的根本原因在于哈希键和输入坐标之间的梯度流不畅，链式规则尝试模型离散的哈希键，而不是连续的坐标。为解决这个问题，我们介绍了犀牛（RHINO），它是一种连续的分析函数，可以在不修改现有哈希基于INR的网络架构的情况下，提供更好的规范。这种连接确保了输入坐标和网络的输出之间的精准的梯度传递，从而提高了规范。我们的实验结果不仅表明了不同的哈希基于INR如DINER和快速NP的规范能力的扩展，还在图像适应、签证距离函数表示和5D静态/6D动态神经辐射场的优化中达到了更高的质量和速度，并且超过了当前状态 искусственный интеллект技术的性能。

Global Context Aggregation Network for Lightweight Saliency Detection of Surface Defects

paper_url: http://arxiv.org/abs/2309.12641
repo_url: None
paper_authors: Feng Yan, Xiaoheng Jiang, Yang Lu, Lisha Cui, Shupan Li, Jiale Cao, Mingliang Xu, Dacheng Tao
for: 这个论文主要目的是提出一种轻量级的抗余损检测方法，以提高检测效率和精度。
methods: 本文提出了一种基于encoder-decoder结构的Global Context Aggregation Network (GCANet)，包括一种新的transformerEncoder和Channel Reference Attention (CRA)模块，以提高多层特征表示的综合性。
results: 对三个公共的损害数据集进行实验表明，GCANet可以与17种state-of-the-art方法进行比较，并且在精度和运行效率之间做出了一个更好的平衡。具体来说，GCANet在SD-saliency-900上 achieve了91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$,和97.35% $E_\phi$的精度，并且在单个GPU上运行272fps。

Abstract
Surface defect inspection is a very challenging task in which surface defects usually show weak appearances or exist under complex backgrounds. Most high-accuracy defect detection methods require expensive computation and storage overhead, making them less practical in some resource-constrained defect detection applications. Although some lightweight methods have achieved real-time inference speed with fewer parameters, they show poor detection accuracy in complex defect scenarios. To this end, we develop a Global Context Aggregation Network (GCANet) for lightweight saliency detection of surface defects on the encoder-decoder structure. First, we introduce a novel transformer encoder on the top layer of the lightweight backbone, which captures global context information through a novel Depth-wise Self-Attention (DSA) module. The proposed DSA performs element-wise similarity in channel dimension while maintaining linear complexity. In addition, we introduce a novel Channel Reference Attention (CRA) module before each decoder block to strengthen the representation of multi-level features in the bottom-up path. The proposed CRA exploits the channel correlation between features at different layers to adaptively enhance feature representation. The experimental results on three public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency compared with other 17 state-of-the-art methods. Specifically, GCANet achieves competitive accuracy (91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$, and 97.35% $E_\phi$) on SD-saliency-900 while running 272fps on a single gpu.

摘要
surface defect inspection 是一项非常具有挑战性的任务， surface defects 通常会出现弱化的外观或者在复杂的背景下出现。大多数高精度的缺陷检测方法需要昂贵的计算和存储开销，使其在一些资源受限的缺陷检测应用中不实用。 although some lightweight methods have achieved real-time inference speed with fewer parameters, they show poor detection accuracy in complex defect scenarios. 为了解决这个问题，我们开发了一个全球上下文聚合网络（GCANet），用于轻量级的缺陷检测。我们在轻量级的后ION上加入了一个新的 transformer Encoder，以便在全球上下文信息中捕捉全球上下文信息。我们引入了一种新的 Depth-wise Self-Attention（DSA）模块，用于在通道维度进行元素对元素的相似性检测，同时保持线性复杂度。此外，我们在每个解码块前加入了一个 Channel Reference Attention（CRA）模块，以强化底层特征表示。CRA模块利用不同层次特征之间的通道相关性来适应性地增强特征表示。我们在三个公共缺陷数据集上进行了实验，结果显示，我们的网络在缺陷检测精度和运行效率之间做出了更好的平衡，相比于其他 17 种国际前沿方法。具体来说，GCANet 在 SD-saliency-900 上达到了同等精度（91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$, 和 97.35% $E_\phi$），而且在单个 GPU 上运行速度为 272 fps。

CINFormer: Transformer network with multi-stage CNN feature injection for surface defect segmentation

paper_url: http://arxiv.org/abs/2309.12639
repo_url: None
paper_authors: Xiaoheng Jiang, Kaiyi Guo, Yang Lu, Feng Yan, Hao Liu, Jiale Cao, Mingliang Xu, Dacheng Tao
for: 这个研究旨在提高工业生产中的表面缺陷检测精度，并解决深度学习方法中的一些挑战，如微弱缺陷和背景中的干扰。
methods: 本研究提出了一个基于 transformer 网络的 CINFormer，具有多Stage CNN 特征插入和 Top-K 自我注意模组。这个架构可以维持 CNN 捕捉细部特征和 transformer 抑制背景干扰的优点，以提高缺陷检测精度。
results: 实验结果显示，提出的 CINFormer 在 DAGM 2007、Magnetic tile 和 NEU 等表面缺陷数据集上实现了顶尖性能，并且在不同的缺陷类型和背景干扰下都能够获得高度的精度。

Abstract
Surface defect inspection is of great importance for industrial manufacture and production. Though defect inspection methods based on deep learning have made significant progress, there are still some challenges for these methods, such as indistinguishable weak defects and defect-like interference in the background. To address these issues, we propose a transformer network with multi-stage CNN (Convolutional Neural Network) feature injection for surface defect segmentation, which is a UNet-like structure named CINFormer. CINFormer presents a simple yet effective feature integration mechanism that injects the multi-level CNN features of the input image into different stages of the transformer network in the encoder. This can maintain the merit of CNN capturing detailed features and that of transformer depressing noises in the background, which facilitates accurate defect detection. In addition, CINFormer presents a Top-K self-attention module to focus on tokens with more important information about the defects, so as to further reduce the impact of the redundant background. Extensive experiments conducted on the surface defect datasets DAGM 2007, Magnetic tile, and NEU show that the proposed CINFormer achieves state-of-the-art performance in defect detection.

摘要
surface defect inspection 是现代工业生产中非常重要的一环。尽管基于深度学习的抗损检测方法已经取得了 significiant progress，但还有一些挑战，如微弱损害难以辨别和背景中的干扰。为解决这些问题，我们提出了一种基于 transformer 网络的多stage CNN 特征注入的表面抗损分割方法，即 CINFormer。 CINFormer 提供了一种简单 yet 有效的特征集成机制，通过在 transformer 网络的编码器中注入不同级别的 CNN 特征，以维持 CNN 捕捉细节特征的优点，同时使用 transformer 网络压缩背景干扰的优点。此外，CINFormer 还提供了 Top-K 自注意模块，以便更好地强调损害的关键信息，从而进一步减少背景干扰的影响。经过对 DAGM 2007、Magnetic 块和 NEU 等表面抗损数据集的广泛实验，我们发现，提议的 CINFormer 可以达到现代表面抗损检测的州标性性能。

Auto-Lesion Segmentation with a Novel Intensity Dark Channel Prior for COVID-19 Detection

paper_url: http://arxiv.org/abs/2309.12638
repo_url: None
paper_authors: Basma Jumaa Saleh, Zaid Omar, Vikrant Bhateja, Lila Iznita Izhar
for: 本研究旨在开发一种基于 computed tomography (CT) 图像的 COVID-19 诊断方法，以帮助诊断可疑 COVID-19 患者。methods: 本研究使用了 radiomic 特征，并采用了强化自动分割原理（IDCP）和深度神经网络（ALS-IDCP-DNN），在定义的分析阈值范围内进行图像分类。results: 验证性 dataset 上，提议的模型实现了98.8%的平均准确率、99%的准确率、98%的 recall 和98%的 F1 score。这些结果表明，我们的模型可以准确地分类 COVID-19 图像，可以帮助 radiologists 诊断可疑 COVID-19 患者。此外，我们的模型表现得更好于现有的10多个国际研究。

Abstract
During the COVID-19 pandemic, medical imaging techniques like computed tomography (CT) scans have demonstrated effectiveness in combating the rapid spread of the virus. Therefore, it is crucial to conduct research on computerized models for the detection of COVID-19 using CT imaging. A novel processing method has been developed, utilizing radiomic features, to assist in the CT-based diagnosis of COVID-19. Given the lower specificity of traditional features in distinguishing between different causes of pulmonary diseases, the objective of this study is to develop a CT-based radiomics framework for the differentiation of COVID-19 from other lung diseases. The model is designed to focus on outlining COVID-19 lesions, as traditional features often lack specificity in this aspect. The model categorizes images into three classes: COVID-19, non-COVID-19, or normal. It employs enhancement auto-segmentation principles using intensity dark channel prior (IDCP) and deep neural networks (ALS-IDCP-DNN) within a defined range of analysis thresholds. A publicly available dataset comprising COVID-19, normal, and non-COVID-19 classes was utilized to validate the proposed model's effectiveness. The best performing classification model, Residual Neural Network with 50 layers (Resnet-50), attained an average accuracy, precision, recall, and F1-score of 98.8%, 99%, 98%, and 98% respectively. These results demonstrate the capability of our model to accurately classify COVID-19 images, which could aid radiologists in diagnosing suspected COVID-19 patients. Furthermore, our model's performance surpasses that of more than 10 current state-of-the-art studies conducted on the same dataset.

摘要
Translated into Simplified Chinese:在COVID-19疫情期间，计算机成像技术如计算机tomography（CT）扫描已经表现出效iveness在防止病毒的迅速传播。因此，需要进行计算机模型的研究，以便使用CT成像来诊断COVID-19。我们开发了一种新的处理方法，利用 радиомics特征，以帮助CT成像诊断COVID-19。传统的特征 oft lack specificity in distinguishing between different causes of pulmonary diseases，因此我们的目标是开发一个基于CT成像的radiomics框架，以区分COVID-19和其他肺病。模型设计用于强调COVID-19斑点，以便更好地识别COVID-19。模型将图像分类为三类：COVID-19、非COVID-19或正常。它使用了增强自动分割原理，使用暗色通道优先预测（IDCP）和深度神经网络（ALS-IDCP-DNN），在定义的分析阈值范围内。一个公共可用的数据集，包括COVID-19、正常和非COVID-19类别，用于验证我们的模型效果。使用最佳表现的分类模型，即Residual Neural Network with 50 layers（Resnet-50），在COVID-19、非COVID-19和正常类别之间达到了平均准确率、精度、回归率和F1分数的98.8%、99%、98%和98%。这些结果表明我们的模型可以准确地分类COVID-19图像，这将助力放医生诊断可能的COVID-19患者。此外，我们的模型性能超过了现有的10个以上state-of-the-art研究，同一个数据集。

Learning Actions and Control of Focus of Attention with a Log-Polar-like Sensor

paper_url: http://arxiv.org/abs/2309.12634
repo_url: None
paper_authors: Robin Göransson, Volker Krueger
for: 提高自动移动机器人图像处理速度
methods: 使用径向尺度图像数据和 gaze 控制
results: 成功降低图像像素数量，不影响游戏性能Here’s a breakdown of each point:
for: The paper aims to improve the image processing speed of an autonomous mobile robot.
methods: The paper explores the use of log-polar like image data with gaze control, and extends an A3C deep RL approach with an LSTM network to learn the policy for playing Atari games and gaze control.
results: The paper successfully reduces the amount of image pixels by a factor of 5 without losing any gaming performance.

Abstract
With the long-term goal of reducing the image processing time on an autonomous mobile robot in mind we explore in this paper the use of log-polar like image data with gaze control. The gaze control is not done on the Cartesian image but on the log-polar like image data. For this we start out from the classic deep reinforcement learning approach for Atari games. We extend an A3C deep RL approach with an LSTM network, and we learn the policy for playing three Atari games and a policy for gaze control. While the Atari games already use low-resolution images of 80 by 80 pixels, we are able to further reduce the amount of image pixels by a factor of 5 without losing any gaming performance.

摘要
这里使用简化中文翻译文本。为了实现机器人自动运行中的图像处理时间缩短，这篇论文探讨了使用对应的对应图像数据，并将控制 gaze 在这些图像数据上。我们从 класи的深度学习游戏方法开始，并将 A3C 深度RL 方法与 LSTM 网络扩展。我们学习了三个 Atari 游戏和一个 gaze 控制策略。 Although Atari games already use low-resolution images of 80 by 80 pixels, we are able to further reduce the amount of image pixels by a factor of 5 without losing any gaming performance.

Decision Fusion Network with Perception Fine-tuning for Defect Classification

paper_url: http://arxiv.org/abs/2309.12630
repo_url: None
paper_authors: Xiaoheng Jiang, Shilong Tian, Zhiwen Zhu, Yang Lu, Hao Liu, Li Chen, Shupan Li, Mingliang Xu
for: 卷积 neural network 用于Surface defect inspection 中的检测和分类任务
methods: 提出了一种决策融合网络（DFNet），通过将semantic decision和feature decision融合来强化网络的决策能力，同时提出了一种感知细化模块（PFM）来优化前景和背景的分割结果
results: 对于公开的数据集KolektorSDD2和Magnetic-tile-defect-datasets进行了实验，实现了96.1% AP和94.6% mAP的效果

Abstract
Surface defect inspection is an important task in industrial inspection. Deep learning-based methods have demonstrated promising performance in this domain. Nevertheless, these methods still suffer from misjudgment when encountering challenges such as low-contrast defects and complex backgrounds. To overcome these issues, we present a decision fusion network (DFNet) that incorporates the semantic decision with the feature decision to strengthen the decision ability of the network. In particular, we introduce a decision fusion module (DFM) that extracts a semantic vector from the semantic decision branch and a feature vector for the feature decision branch and fuses them to make the final classification decision. In addition, we propose a perception fine-tuning module (PFM) that fine-tunes the foreground and background during the segmentation stage. PFM generates the semantic and feature outputs that are sent to the classification decision stage. Furthermore, we present an inner-outer separation weight matrix to address the impact of label edge uncertainty during segmentation supervision. Our experimental results on the publicly available datasets including KolektorSDD2 (96.1% AP) and Magnetic-tile-defect-datasets (94.6% mAP) demonstrate the effectiveness of the proposed method.

摘要
superficie defecto inspección es una tarea importante en la inspección industrial. los métodos basados en aprendizaje profundo han demostrado un rendimiento prometedor en este dominio. sin embargo, estos métodos todavía sufren de mal juzgar cuando se encuentran desafíos como defectos de baja contraste y fondos complejos. para superar estos problemas, presentamos una red de fusión de decisiones (DFNet) que integra la decisión semántica con la decisión de características para fortalecer la capacidad de toma de decisiones del réseau. en particular, introducimos un módulo de fusión de decisiones (DFM) que extrae un vector semántico de la rama de decisión semántica y un vector de características de la rama de decisión de características y los fusiona para tomar la decisión de clasificación final. Además, propusimos un módulo de finuración de percepción (PFM) que realiza la fine-tuning de la percepción durante la etapa de segmentación. PFM genera las salidas semánticas y de características que se envían a la etapa de toma de decisiones de clasificación. Además, presentamos una matriz de pesos de separación interior-exterior para abordar el impacto de la incertidumbre de la etiqueta en la supervisión de segmentación. nuestros resultados experimentales en los conjuntos de datos públicos, incluyendo KolektorSDD2 (96.1% AP) y Magnetic-tile-defect-datasets (94.6% mAP), demuestran la eficacia de la método propuesto.

DeFormer: Integrating Transformers with Deformable Models for 3D Shape Abstraction from a Single Image

paper_url: http://arxiv.org/abs/2309.12594
repo_url: None
paper_authors: Di Liu, Xiang Yu, Meng Ye, Qilong Zhangli, Zhuowei Li, Zhixing Zhang, Dimitris N. Metaxas
for: 提出了一种新的bi-channel Transformer架构，用于同时估计全局和局部变形。
methods: 使用了一种参数化的塑形模型，称为DeFormer，以便更好地抽象复杂的物体形状。
results: 在ShapeNet上进行了广泛的实验，并达到了比之前最佳的重建精度，并且可以Visualize了更加准确的semantic相关性。

Abstract
Accurate 3D shape abstraction from a single 2D image is a long-standing problem in computer vision and graphics. By leveraging a set of primitives to represent the target shape, recent methods have achieved promising results. However, these methods either use a relatively large number of primitives or lack geometric flexibility due to the limited expressibility of the primitives. In this paper, we propose a novel bi-channel Transformer architecture, integrated with parameterized deformable models, termed DeFormer, to simultaneously estimate the global and local deformations of primitives. In this way, DeFormer can abstract complex object shapes while using a small number of primitives which offer a broader geometry coverage and finer details. Then, we introduce a force-driven dynamic fitting and a cycle-consistent re-projection loss to optimize the primitive parameters. Extensive experiments on ShapeNet across various settings show that DeFormer achieves better reconstruction accuracy over the state-of-the-art, and visualizes with consistent semantic correspondences for improved interpretability.

摘要
通过使用一组基本形状来表示目标形状，现代方法已经实现了在计算机视觉和图形领域中准确地抽象三维形状。然而，这些方法可能使用相对较多的基本形状，或者由于基本形状的有限表达能力而缺乏几何可动性。在本文中，我们提出了一种新的双渠道变换器架构，并结合参数化可变模型，称为DeFormer，以同时估计全局和局部几何变换。这样，DeFormer可以抽象复杂的物体形状，使用少量的基本形状，并且具有更广泛的几何覆盖和细节。然后，我们引入了力场驱动的动态适应和цикли性征重 Projekt loss 来优化基本形状参数。通过对ShapeNet进行了多种设置的实验，我们发现DeFormer可以在计算机视觉和图形领域中实现更好的重建精度，并且可以visualize与更高度的 semantic correspondence для改进释plausibility。

Improving Machine Learning Robustness via Adversarial Training

paper_url: http://arxiv.org/abs/2309.12593
repo_url: None
paper_authors: Long Dang, Thushari Hapuarachchi, Kaiqi Xiong, Jing Lin
for: 本研究旨在 investigate 机器学习（ML）Robustness 在中央化和分散化环境中，以帮助设计更加Robust 的 ML 算法。
methods: 本研究使用了 adversarial training 方法在中央化和分散化环境中进行 ML 训练和测试，并使用了 Fast Gradient Sign Method 和 DeepFool 生成抗性例子。
results: 在中央化环境中，我们实现了测试准确率为 65.41% 和 83.0%，比现有研究提高了 18.41% 和 47%。在分散化环境中，我们研究了 Federated learning（FL）的Robustness，并使用了 adversarial training 方法与独立同分布（IID）和非IID数据进行比较。在 IID 数据 caso，我们可以实现类似于中央化环境的Robust准确率。在非IID 数据 caso，自然准确率下降了 25% 和 23.4%，对比 IID 数据 caso，分别是。我们还提出了一种 IID 数据共享方法，可以提高自然准确率到 85.04% 和 Robust准确率从 57% 提高到 72% 和 67%，分别是。

Abstract
As Machine Learning (ML) is increasingly used in solving various tasks in real-world applications, it is crucial to ensure that ML algorithms are robust to any potential worst-case noises, adversarial attacks, and highly unusual situations when they are designed. Studying ML robustness will significantly help in the design of ML algorithms. In this paper, we investigate ML robustness using adversarial training in centralized and decentralized environments, where ML training and testing are conducted in one or multiple computers. In the centralized environment, we achieve a test accuracy of 65.41% and 83.0% when classifying adversarial examples generated by Fast Gradient Sign Method and DeepFool, respectively. Comparing to existing studies, these results demonstrate an improvement of 18.41% for FGSM and 47% for DeepFool. In the decentralized environment, we study Federated learning (FL) robustness by using adversarial training with independent and identically distributed (IID) and non-IID data, respectively, where CIFAR-10 is used in this research. In the IID data case, our experimental results demonstrate that we can achieve such a robust accuracy that it is comparable to the one obtained in the centralized environment. Moreover, in the non-IID data case, the natural accuracy drops from 66.23% to 57.82%, and the robust accuracy decreases by 25% and 23.4% in C&W and Projected Gradient Descent (PGD) attacks, compared to the IID data case, respectively. We further propose an IID data-sharing approach, which allows for increasing the natural accuracy to 85.04% and the robust accuracy from 57% to 72% in C&W attacks and from 59% to 67% in PGD attacks.

摘要
随着机器学习（ML）在实际应用中的广泛使用， Ensuring the robustness of ML algorithms against potential worst-case noises, adversarial attacks, and highly unusual situations during their design has become crucial. Studying ML robustness can significantly help in the design of ML algorithms. In this paper, we investigate ML robustness using adversarial training in centralized and decentralized environments, where ML training and testing are conducted in one or multiple computers.在中央化环境中，我们通过对快速梯度方法和深度欺骗的挑战性例子进行反对抗教程，实现了测试精度为65.41%和83.0%。相比已有研究，这些结果表明了18.41%的提高 для快速梯度方法和47%的提高 для深度欺骗。在分布式学习（FL）环境中，我们研究了 Federated learning（FL）的可靠性，使用反对抗教程与独立且相同分布（IID）和非IID数据进行研究，其中CIFAR-10被用于这项研究。在IID数据情况下，我们的实验结果表明，我们可以实现与中央化环境相同的可靠性。此外，在非IID数据情况下，自然精度从66.23%降低到57.82%，而可靠精度下降25%和23.4%在C&W和投影梯度下降（PGD）攻击中，相比IID数据情况下。我们还提出了一种IID数据分享方法，可以提高自然精度到85.04%和可靠精度从57%提高到72%在C&W攻击中，并提高到67%在PGD攻击中。

BGF-YOLO: Enhanced YOLOv8 with Multiscale Attentional Feature Fusion for Brain Tumor Detection

paper_url: http://arxiv.org/abs/2309.12585
repo_url: https://github.com/mkang315/bgf-yolo
paper_authors: Ming Kang, Chee-Ming Ting, Fung Fung Ting, Raphaël C. -W. Phan
for: automated brain tumor detection
methods: integrate Bi-level Routing Attention (BRA), Generalized feature pyramid networks (GFPN), and Fourth detecting head into YOLOv8
results: 4.7% absolute increase of mAP$_{50}$ compared to YOLOv8x, and achieves state-of-the-art on the brain tumor detection dataset Br35H.Here’s the full Chinese text:
for: 本研究旨在 automatized 脑癌检测
methods: integrate Bi-level Routing Attention (BRA), Generalized feature pyramid networks (GFPN), 和 Fourth detecting head into YOLOv8
results: 与 YOLOv8x 相比，BGF-YOLO 提供 4.7% 的统计提升，并在脑癌检测 dataset Br35H 上 achievement state-of-the-art.I hope this helps!

Abstract
You Only Look Once (YOLO)-based object detectors have shown remarkable accuracy for automated brain tumor detection. In this paper, we develop a novel BGF-YOLO architecture by incorporating Bi-level Routing Attention (BRA), Generalized feature pyramid networks (GFPN), and Fourth detecting head into YOLOv8. BGF-YOLO contains an attention mechanism to focus more on important features, and feature pyramid networks to enrich feature representation by merging high-level semantic features with spatial details. Furthermore, we investigate the effect of different attention mechanisms and feature fusions, detection head architectures on brain tumor detection accuracy. Experimental results show that BGF-YOLO gives a 4.7% absolute increase of mAP$_{50}$ compared to YOLOv8x, and achieves state-of-the-art on the brain tumor detection dataset Br35H. The code is available at https://github.com/mkang315/BGF-YOLO.

摘要
“你只需要看一次”（YOLO）基本的物体探测器已经在自动脑肿检测中表现出色。在这篇论文中，我们开发了一种新的BGF-YOLO架构，并将Bi-level Routing Attention（BRA）、Generalized feature pyramid networks（GFPN）和第四个探测头纳入YOLOv8。BGF-YOLO包含一种注意机制，以增强关键特征的注意力，并通过将高级 semantic features与空间细节合并来增强特征表示。此外，我们还 investigate了不同的注意机制和特征融合、探测头架构对脑肿检测精度的影响。实验结果表明，BGF-YOLO与YOLOv8x相比，提高了4.7%的mAP$_{50}$精度，并在脑肿检测数据集Br35H中达到了状态机。代码可以在https://github.com/mkang315/BGF-YOLO中获取。

Classification of Alzheimers Disease with Deep Learning on Eye-tracking Data

paper_url: http://arxiv.org/abs/2309.12574
repo_url: None
paper_authors: Harshinee Sriram, Cristina Conati, Thalia Field
For: This paper aims to classify Alzheimer’s Disease (AD) from eye-tracking (ET) data using a Deep-Learning classifier trained end-to-end on raw ET data.* Methods: The proposed method, called VTNet, uses a GRU and a CNN in parallel to leverage both visual (V) and temporal (T) representations of ET data.* Results: VTNet outperforms the state-of-the-art approaches in AD classification, providing encouraging evidence on the generality of this model to make predictions from ET data.Here’s the Chinese translation of the three pieces of information:* For: 这篇论文目标是使用 Raw 眼动追踪数据进行扩散性疾病分类 (AD)。* Methods: 提议的方法是 VTNet，它使用 GRU 和 CNN 并行使用，以利用眼动数据中的视觉 (V) 和时间 (T) 表示。* Results: VTNet 在 AD 分类任务上表现出色，超过了现有的方法，提供了对这种模型在眼动数据上的预测性的有力证明。

Abstract
Existing research has shown the potential of classifying Alzheimers Disease (AD) from eye-tracking (ET) data with classifiers that rely on task-specific engineered features. In this paper, we investigate whether we can improve on existing results by using a Deep-Learning classifier trained end-to-end on raw ET data. This classifier (VTNet) uses a GRU and a CNN in parallel to leverage both visual (V) and temporal (T) representations of ET data and was previously used to detect user confusion while processing visual displays. A main challenge in applying VTNet to our target AD classification task is that the available ET data sequences are much longer than those used in the previous confusion detection task, pushing the limits of what is manageable by LSTM-based models. We discuss how we address this challenge and show that VTNet outperforms the state-of-the-art approaches in AD classification, providing encouraging evidence on the generality of this model to make predictions from ET data.

摘要
先前的研究已经证明了通过眼动跟踪（ET）数据进行阿尔茨heimer病（AD）分类的潜在性。在这篇论文中，我们调查了是否可以通过使用深度学习的分类器，对直接使用原始ET数据进行分类，以提高现有结果。我们称之为VTNet，它使用GRU和CNN并行使用视觉（V）和时间（T）表示，曾用于检测视觉显示的混乱。主要挑战在应用VTNet到我们的target AD分类任务中是，ET数据序列的可用性远比先前的混乱检测任务更长，这使得LSTM模型管理的范围受到挑战。我们详细介绍了我们如何解决这个挑战，并示出VTNet在AD分类任务中的表现，超越了当前的状态艺术方法，提供了对ET数据进行预测的鼓励性证据。

paper_url: http://arxiv.org/abs/2309.12572
repo_url: None
paper_authors: Hanem Ellethy, Viktor Vegh, Shekhar S. Chandra
for: 这个研究旨在提高轻度头部创伤（mTBI）的诊断精度，并且使用多Modal的残差算法（MRCNN）和Occlusion Sensitivity Maps（OSM）来增强诊断模型的解释力。
methods: 这个研究使用了一个 interpretable 的 3D Multi-Modal Residual Convolutional Neural Network（MRCNN）模型，并且将 Occlusion Sensitivity Maps（OSM）加入了诊断模型中，以增强诊断的精度。
results: 研究结果显示，MRCNN 模型在 mTBI 诊断中表现出色，精度达 82.4%，sensitivity 达 82.6%，特异性达 81.6%，并且比 CT 基于的 Residual Convolutional Neural Network（RCNN）模型提高了 4.4% 的特异性和 9.0% 的精度。

Abstract
Mild Traumatic Brain Injury (mTBI) is a significant public health challenge due to its high prevalence and potential for long-term health effects. Despite Computed Tomography (CT) being the standard diagnostic tool for mTBI, it often yields normal results in mTBI patients despite symptomatic evidence. This fact underscores the complexity of accurate diagnosis. In this study, we introduce an interpretable 3D Multi-Modal Residual Convolutional Neural Network (MRCNN) for mTBI diagnostic model enhanced with Occlusion Sensitivity Maps (OSM). Our MRCNN model exhibits promising performance in mTBI diagnosis, demonstrating an average accuracy of 82.4%, sensitivity of 82.6%, and specificity of 81.6%, as validated by a five-fold cross-validation process. Notably, in comparison to the CT-based Residual Convolutional Neural Network (RCNN) model, the MRCNN shows an improvement of 4.4% in specificity and 9.0% in accuracy. We show that the OSM offers superior data-driven insights into CT images compared to the Grad-CAM approach. These results highlight the efficacy of the proposed multi-modal model in enhancing the diagnostic precision of mTBI.

摘要
轻度头部Trauma (mTBI) 是一个重要的公共卫生挑战，因其高频率和长期健康影响的可能性。 Despite Computed Tomography (CT) 是 mTBI 的标准诊断工具，它经常在 mTBI 患者中显示正常结果，这再次 highlights 诊断的复杂性。在这项研究中，我们介绍了一种可解释的 3D 多模态异常感知 Convolutional Neural Network (MRCNN) 模型，用于 mTBI 诊断。我们的 MRCNN 模型在 mTBI 诊断中表现出色，其中的平均准确率为 82.4%，敏感性为 82.6%，特异性为 81.6%，这些结果通过五次交叉验证过程得到验证。尤其是，相比 CT-based Residual Convolutional Neural Network (RCNN) 模型，我们的 MRCNN 模型在特异性和准确率方面增加了4.4%和9.0%。我们表明 OSM 在 CT 图像中提供了更高的数据驱动的权重，相比 Grad-CAM 方法。这些结果表明我们的多模态模型在 mTBI 诊断中增强了诊断精度。

Wave-informed dictionary learning for high-resolution imaging in complex media

paper_url: http://arxiv.org/abs/2310.12990
repo_url: None
paper_authors: Miguel Moscoso, Alexei Novikov, George Papanicolaou, Chrysoula Tsogka
for: 这个论文目的是提出一种用于吸收媒体成像的方法，当有大量和多样化的数据集available时。
methods: 该方法有两个步骤：第一步使用字典学习算法来估计真正的绿函数向量，并将其作为列存储在一个不ordered的感测矩阵中。第二步使用多维度排序法来让列表的排序，并使用连接信息来 derive from cross-correlations of its columns，如时间反转。
results: 通过 simulation experiments，我们示出了该方法能够在复杂媒体中提供高分辨率的成像图像。

Abstract
We propose an approach for imaging in scattering media when large and diverse data sets are available. It has two steps. Using a dictionary learning algorithm the first step estimates the true Green's function vectors as columns in an unordered sensing matrix. The array data comes from many sparse sets of sources whose location and strength are not known to us. In the second step, the columns of the estimated sensing matrix are ordered for imaging using Multi-Dimensional Scaling with connectivity information derived from cross-correlations of its columns, as in time reversal. For these two steps to work together we need data from large arrays of receivers so the columns of the sensing matrix are incoherent for the first step, as well as from sub-arrays so that they are coherent enough to obtain the connectivity needed in the second step. Through simulation experiments, we show that the proposed approach is able to provide images in complex media whose resolution is that of a homogeneous medium.

摘要
我们提出了一种方法，用于在散射媒体中进行成像，当有大量多样化的数据集 disponible。该方法包括两步。在第一步中，使用一个词汇学算法，我们估算了真实的绿函数向量作为排序后的感知矩阵的列。数组数据来自多个稀疏的源集，其位置和强度不知道我们。在第二步中，我们使用多维度尺度学（Multi-Dimensional Scaling）将排序后的感知矩阵的列轴进行了排序，并使用这些列的垂直相关性来获得连接信息。为了使这两个步骤可以共同工作，我们需要从大型接收器阵列中获得数据，以确保感知矩阵的列不受相关性的限制。通过实验 simulate, we show that the proposed approach can provide images in complex media with resolution comparable to that of a homogeneous medium.Note: The translation is provided "as is" and may not be perfect. Please let me know if you need any further assistance or clarification.

Triple-View Knowledge Distillation for Semi-Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.12557
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Ping Li, Junjie Chen, Li Yuan, Xianghua Xu, Mingli Song
for: 提高 semi-supervised semantic segmentation 的效果，使用少量标注图像和大量非标注图像预测像素级标签图。
methods: 使用 tri-training 和 triple-view encoder 来捕捉多样化特征，并通过知识储存技术学习对应的 semantics。 dual-frequency decoder 选择重要特征，并通过双频道注意机制来评估特征重要性。
results: 在 Pascal VOC 2012 和 Cityscapes 两个标准测试集上进行了广泛的实验，结果表明提出的方法在精度和推理速度之间做出了好的平衡，并且与其他方法相比具有更好的性能。

Abstract
To alleviate the expensive human labeling, semi-supervised semantic segmentation employs a few labeled images and an abundant of unlabeled images to predict the pixel-level label map with the same size. Previous methods often adopt co-training using two convolutional networks with the same architecture but different initialization, which fails to capture the sufficiently diverse features. This motivates us to use tri-training and develop the triple-view encoder to utilize the encoders with different architectures to derive diverse features, and exploit the knowledge distillation skill to learn the complementary semantics among these encoders. Moreover, existing methods simply concatenate the features from both encoder and decoder, resulting in redundant features that require large memory cost. This inspires us to devise a dual-frequency decoder that selects those important features by projecting the features from the spatial domain to the frequency domain, where the dual-frequency channel attention mechanism is introduced to model the feature importance. Therefore, we propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation, including the triple-view encoder and the dual-frequency decoder. Extensive experiments were conducted on two benchmarks, \ie, Pascal VOC 2012 and Cityscapes, whose results verify the superiority of the proposed method with a good tradeoff between precision and inference speed.

摘要
要解决高严格的人类标注成本高昂，半supervised semantic segmentation使用一些标注图像和大量无标注图像预测像素级标签地图，同时使用两个 convolutional network 的不同初始化来提高分类精度。以前的方法通常采用两个 convolutional network 的同 architectures 的 co-training，但这会遗漏重要的多样化特征。这种情况驱使我们使用 tri-training 和三个视角编码器，以利用不同的架构来获得多样化的特征，并利用知识填充技术来学习这些编码器之间的补做 semantics。此外，现有的方法通常将编码器和解码器的特征直接 concatenate，从而导致缓存成本过高。这个灵感我们提出了一种 dual-frequency decoder，可以选择重要的特征，并通过将特征从空间频谱中 проек到频谱频率上，实现了 dual-frequency channel attention mechanism。因此，我们提出了一种 Triple-view Knowledge Distillation框架，称之为 TriKD，用于半supervised semantic segmentation，包括三个视角编码器和 dual-frequency decoder。我们在 Pascal VOC 2012 和 Cityscapes 两个标准benchmark上进行了广泛的实验，结果证明了我们提出的方法具有较好的平衡性和推理速度。

2023-09-22

ClusterFormer: Clustering As A Universal Visual Learner

Spatial-frequency channels, shape bias, and adversarial robustness

Flow Factorized Representation Learning

Pixel-wise Smoothing for Certified Robustness against Camera Motion Perturbations

Trading-off Mutual Information on Feature Aggregation for Face Recognition

Understanding Calibration of Deep Neural Networks for Medical Image Classification

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

NeRRF: 3D Reconstruction and View Synthesis for Transparent and Specular Objects with Neural Refractive-Reflective Fields

Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception?

Performance Analysis of UNet and Variants for Medical Image Segmentation

Deep3DSketch+: Rapid 3D Modeling from Single Free-hand Sketches

Point Cloud Network: An Order of Magnitude Improvement in Linear Layer Parameter Count

License Plate Recognition Based On Multi-Angle View Model

PI-RADS v2 Compliant Automated Segmentation of Prostate Zones Using co-training Motivated Multi-task Dual-Path CNN

Detect Every Thing with Few Examples

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

On Data Fabrication in Collaborative Vehicular Perception: Attacks and Countermeasures

Inter-vendor harmonization of Computed Tomography (CT) reconstruction kernels using unpaired image translation

Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation

Zero-Shot Object Counting with Language-Vision Models

Bridging Sensor Gaps via Single-Direction Tuning for Hyperspectral Image Classification

Associative Transformer Is A Sparse Representation Learner

Cross-Modal Translation and Alignment for Survival Analysis

SRFNet: Monocular Depth Estimation with Fine-grained Structure via Spatial Reliability-oriented Fusion of Frames and Events

Domain Adaptive Few-Shot Open-Set Learning

Automatic view plane prescription for cardiac magnetic resonance imaging via supervision by spatial relationship between views

Scalable Semantic 3D Mapping of Coral Reefs with Deep Learning

NOC: High-Quality Neural Object Cloning with 3D Lifting of Segment Anything

EMS: 3D Eyebrow Modeling from Single-view Images

LMC: Large Model Collaboration with Cross-assessment for Training-Free Open-Set Object Recognition

WiCV@CVPR2023: The Eleventh Women In Computer Vision Workshop at the Annual CVPR Conference

S3TC: Spiking Separated Spatial and Temporal Convolutions with Unsupervised STDP-based Learning for Action Recognition

Transformer-based Image Compression with Variable Image Quality Objectives

mixed attention auto encoder for multi-class industrial anomaly detection

eWand: A calibration framework for wide baseline frame-based and event-based camera systems

Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

FP-PET: Large Model, Multiple Loss And Focused Practice

RHINO: Regularizing the Hash-based Implicit Neural Representation

Global Context Aggregation Network for Lightweight Saliency Detection of Surface Defects

CINFormer: Transformer network with multi-stage CNN feature injection for surface defect segmentation

Auto-Lesion Segmentation with a Novel Intensity Dark Channel Prior for COVID-19 Detection

Learning Actions and Control of Focus of Attention with a Log-Polar-like Sensor

Decision Fusion Network with Perception Fine-tuning for Defect Classification

DeFormer: Integrating Transformers with Deformable Models for 3D Shape Abstraction from a Single Image

Improving Machine Learning Robustness via Adversarial Training

BGF-YOLO: Enhanced YOLOv8 with Multiscale Attentional Feature Fusion for Brain Tumor Detection

Classification of Alzheimers Disease with Deep Learning on Eye-tracking Data

Interpretable 3D Multi-Modal Residual Convolutional Neural Network for Mild Traumatic Brain Injury Diagnosis

Wave-informed dictionary learning for high-resolution imaging in complex media

Triple-View Knowledge Distillation for Semi-Supervised Semantic Segmentation