2023-09-06

cs.CV

cs.CV - 2023-09-06

Distribution-Aware Prompt Tuning for Vision-Language Models

paper_url: http://arxiv.org/abs/2309.03406
repo_url: https://github.com/mlvlab/dapt
paper_authors: Eulrang Cho, Jooyeon Kim, Hyunwoo J. Kim
for: 该研究旨在提高视觉语言模型（VLM）的表现，通过在目标任务上进行适应调整。
methods: 该研究使用了数据驱动的适应调整方法，包括在输入图像或文本中添加上下文。在学习Vector的基础上，通过对两种模式的特征空间进行对齐，提高VLM的表现。
results: 该研究的实验结果表明，使用分布规格感知调整（DAPT）方法可以显著提高VLM的总体表现，并且在11个标准测试集上进行了广泛的验证。

Abstract
Pre-trained vision-language models (VLMs) have shown impressive performance on various downstream tasks by utilizing knowledge learned from large data. In general, the performance of VLMs on target tasks can be further improved by prompt tuning, which adds context to the input image or text. By leveraging data from target tasks, various prompt-tuning methods have been studied in the literature. A key to prompt tuning is the feature space alignment between two modalities via learnable vectors with model parameters fixed. We observed that the alignment becomes more effective when embeddings of each modality are `well-arranged' in the latent space. Inspired by this observation, we proposed distribution-aware prompt tuning (DAPT) for vision-language models, which is simple yet effective. Specifically, the prompts are learned by maximizing inter-dispersion, the distance between classes, as well as minimizing the intra-dispersion measured by the distance between embeddings from the same class. Our extensive experiments on 11 benchmark datasets demonstrate that our method significantly improves generalizability. The code is available at https://github.com/mlvlab/DAPT.

摘要
传统的视觉语言模型（VLM）已经在多个下游任务上表现出色，通过利用大量数据学习知识。在总的来说，VLM的目标任务表现可以通过提示调整进一步提高，这种方法可以通过给输入图像或文本添加上下文来实现。利用目标任务的数据，文献中有多种提示调整方法的研究。一个关键在提示调整中是在两个模式之间的特征空间对齐，通过学习向量并固定模型参数。我们发现，每个模式的特征空间的对齐会在嵌入空间中更加有效。 Drawing inspiration from this observation, we proposed distribution-aware prompt tuning (DAPT) for vision-language models, which is simple yet effective. Specifically, the prompts are learned by maximizing inter-dispersion, the distance between classes, as well as minimizing the intra-dispersion measured by the distance between embeddings from the same class. Our extensive experiments on 11 benchmark datasets demonstrate that our method significantly improves generalizability. 代码可以在 https://github.com/mlvlab/DAPT 上获取。

Reasonable Anomaly Detection in Long Sequences

paper_url: http://arxiv.org/abs/2309.03401
repo_url: https://github.com/allenyljiang/anomaly-detection-in-sequences
paper_authors: Yalong Jiang, Changkang Li
for: 本文提出了一种新的视频异常检测方法，以解决现有方法中的示例有限性问题。
methods: 本文使用了栈搅拌机制（SSM）模型，通过长期序列学习表示物体的运动模式，并根据过去的状态预测未来状态，从而检测异常情况。
results: 经过广泛的实验评估，本文的方法能够在数据集和现有方法上显著提高异常检测性能。

Abstract
Video anomaly detection is a challenging task due to the lack in approaches for representing samples. The visual representations of most existing approaches are limited by short-term sequences of observations which cannot provide enough clues for achieving reasonable detections. In this paper, we propose to completely represent the motion patterns of objects by learning from long-term sequences. Firstly, a Stacked State Machine (SSM) model is proposed to represent the temporal dependencies which are consistent across long-range observations. Then SSM model functions in predicting future states based on past ones, the divergence between the predictions with inherent normal patterns and observed ones determines anomalies which violate normal motion patterns. Extensive experiments are carried out to evaluate the proposed approach on the dataset and existing ones. Improvements over state-of-the-art methods can be observed. Our code is available at https://github.com/AllenYLJiang/Anomaly-Detection-in-Sequences.

摘要
视频异常检测是一项复杂的任务，主要因为缺乏对样本的表示方法。现有的大多数方法的视觉表示有限，短期序列观察数据无法提供足够的信息以实现有效的检测。在本文中，我们提出了完全基于长期序列学习的物体运动模式表示方法。首先，我们提出了堆叠状态机制（SSM）模型，用于表示时间相关性，这些相关性在长距离观察中保持一致。然后，SSM模型可以基于过去的状态预测未来状态，与内在的正常模式相比，如果存在异常情况，则视为异常。我们对数据集和现有方法进行了广泛的实验，并观察到了我们的方法与之前的状态艺术方法的改进。代码可以在https://github.com/AllenYLJiang/Anomaly-Detection-in-Sequences中找到。

A novel method for iris recognition using BP neural network and parallel computing by the aid of GPUs (Graphics Processing Units)

paper_url: http://arxiv.org/abs/2309.03390
repo_url: None
paper_authors: Farahnaz Hosseini, Hossein Ebrahimpour, Samaneh Askari
for: 该文章旨在探讨一种新的芳华识别系统设计方法。
methods: 该方法首先从芳华图像中提取了Haar波浪特征，这些特征具有快速抽取和对每个芳华唯一的优点。然后使用了后向传播神经网络（BPNN）作为分类器。
results: 该系统的性能和加速结果在文章中展示了，使用GPU和CUDA实现的BPNN并行算法可以加速学习过程。

Abstract
In this paper, we seek a new method in designing an iris recognition system. In this method, first the Haar wavelet features are extracted from iris images. The advantage of using these features is the high-speed extraction, as well as being unique to each iris. Then the back propagation neural network (BPNN) is used as a classifier. In this system, the BPNN parallel algorithms and their implementation on GPUs have been used by the aid of CUDA in order to speed up the learning process. Finally, the system performance and the speeding outcomes in a way that this algorithm is done in series are presented.

摘要
在这篇论文中，我们寻找一种新的方法来设计一个萝卤识别系统。在这种方法中，首先从萝卤图像中提取Haar浪声特征。这些特征的优点是快速提取和唯一性，可以用于识别每个萝卤。然后，我们使用背投传播神经网络（BPNN）作为分类器。在这个系统中，我们使用GPU上的并行算法和CUDA技术来加速学习过程。最后，我们展示了这个算法的性能和加速效果。

Kidney abnormality segmentation in thorax-abdomen CT scans

paper_url: http://arxiv.org/abs/2309.03383
repo_url: None
paper_authors: Gabriel Efrain Humpire Mamani, Nikolas Lessmann, Ernst Th. Scholten, Mathias Prokop, Colin Jacobs, Bram van Ginneken
for: This paper aims to support clinicians in identifying and quantifying renal abnormalities such as cysts, lesions, masses, metastases, and primary tumors through the use of deep learning for segmenting kidney parenchyma and kidney abnormalities.
methods: The paper introduces an end-to-end segmentation method that utilizes a modified 3D U-Net network with four additional components: end-to-end multi-resolution approach, task-specific data augmentations, modified loss function using top-$k$, and spatial dropout. The method was trained on 215 contrast-enhanced thoracic-abdominal CT scans.
results: The paper reports that the best-performing model achieved Dice scores of 0.965 and 0.947 for segmenting kidney parenchyma in two test sets, outperforming an independent human observer. The method also achieved a Dice score of 0.585 for segmenting kidney abnormalities within the 30 test scans containing them, suggesting potential for further improvement in computerized methods.

Abstract
In this study, we introduce a deep learning approach for segmenting kidney parenchyma and kidney abnormalities to support clinicians in identifying and quantifying renal abnormalities such as cysts, lesions, masses, metastases, and primary tumors. Our end-to-end segmentation method was trained on 215 contrast-enhanced thoracic-abdominal CT scans, with half of these scans containing one or more abnormalities. We began by implementing our own version of the original 3D U-Net network and incorporated four additional components: an end-to-end multi-resolution approach, a set of task-specific data augmentations, a modified loss function using top-$k$, and spatial dropout. Furthermore, we devised a tailored post-processing strategy. Ablation studies demonstrated that each of the four modifications enhanced kidney abnormality segmentation performance, while three out of four improved kidney parenchyma segmentation. Subsequently, we trained the nnUNet framework on our dataset. By ensembling the optimized 3D U-Net and the nnUNet with our specialized post-processing, we achieved marginally superior results. Our best-performing model attained Dice scores of 0.965 and 0.947 for segmenting kidney parenchyma in two test sets (20 scans without abnormalities and 30 with abnormalities), outperforming an independent human observer who scored 0.944 and 0.925, respectively. In segmenting kidney abnormalities within the 30 test scans containing them, the top-performing method achieved a Dice score of 0.585, while an independent second human observer reached a score of 0.664, suggesting potential for further improvement in computerized methods. All training data is available to the research community under a CC-BY 4.0 license on https://doi.org/10.5281/zenodo.8014289

摘要
在这个研究中，我们介绍了一种深度学习方法用于分割肾脏和肾脏疾病，以支持临床医生在识别和评估肾脏疾病，如肿瘤、抑郁、肿瘤和原发性肾脏癌。我们的终端 segmentation 方法在 215 个对比增强的 thoracic-abdominal CT 扫描图像上进行训练，其中半数图像含有一个或多个疾病。我们开始实现我们自己的版本的原始 3D U-Net 网络，并添加了四个附加组件：一个终端多分辨率方法、一组任务特定的数据增强、修改后的 top-$k$ 损失函数和空间抽象。此外，我们设计了特制的后处理策略。ablation 研究表明，每一个修改都提高了肾脏疾病 segmentation 性能，而三个中提高了肾脏正常组织 segmentation。然后，我们将 nnUNet 框架进行训练。通过将优化的 3D U-Net 和 nnUNet 与我们特制的后处理 ensemble，我们实现了微妙的提高。我们最佳性能模型在两个测试集（20 个无疾病扫描图像和 30 个含疾病扫描图像）中，对肾脏正常组织 segmentation 取得了 dice 分数为 0.965 和 0.947，超过了一名独立的人类观察员，其分数为 0.944 和 0.925。在 segmenting 肾脏疾病内的 30 个测试扫描图像中，我们的最佳方法取得了 dice 分数为 0.585，而第二名独立的人类观察员达到了分数为 0.664，表明计算机化方法还有很大的提高空间。所有训练数据都可以通过 CC-BY 4.0 LICENSE 在 https://doi.org/10.5281/zenodo.8014289 上获得。

Active shooter detection and robust tracking utilizing supplemental synthetic data

paper_url: http://arxiv.org/abs/2309.03381
repo_url: None
paper_authors: Joshua R. Waite, Jiale Feng, Riley Tavassoli, Laura Harris, Sin Yong Tan, Subhadeep Chakraborty, Soumik Sarkar
for: 针对美国gun violence问题的增长关注，提出了开发公共安全系统的想法，其中一种方法是探测和跟踪射击者，以预防或减轻暴力事件的影响。
methods: 本文提出了探测射击者而不是只是枪支，以提高跟踪鲁棒性，因为隐藏枪支不再导致系统产生失去威胁的情况。为了解决公共数据的限制和创造困难，本文使用域随机化和传输学习，以提高模型在不同情况下的普适性。
results: 使用YOLOv8和Deep OC-SORT，实现了一个初步版本的射击者跟踪系统，可以在边缘硬件上运行，包括Raspberry Pi和Jetson Nano。

Abstract
The increasing concern surrounding gun violence in the United States has led to a focus on developing systems to improve public safety. One approach to developing such a system is to detect and track shooters, which would help prevent or mitigate the impact of violent incidents. In this paper, we proposed detecting shooters as a whole, rather than just guns, which would allow for improved tracking robustness, as obscuring the gun would no longer cause the system to lose sight of the threat. However, publicly available data on shooters is much more limited and challenging to create than a gun dataset alone. Therefore, we explore the use of domain randomization and transfer learning to improve the effectiveness of training with synthetic data obtained from Unreal Engine environments. This enables the model to be trained on a wider range of data, increasing its ability to generalize to different situations. Using these techniques with YOLOv8 and Deep OC-SORT, we implemented an initial version of a shooter tracking system capable of running on edge hardware, including both a Raspberry Pi and a Jetson Nano.

摘要
随着美国的枪击事件频率的增长，有关公众安全的关注也在不断增加。为了开发一个可以提高公众安全的系统，一种方法是检测和跟踪射击者，以防止或减轻暴力事件的影响。在这篇论文中，我们提出了检测射击者的整体方法，而不仅仅是检测枪支，这将允许更好地跟踪射击者，即使枪支被遮住也不会导致系统丢失跟踪。然而，公共可用的射击者数据比枪支数据更加有限和困难生成。因此，我们研究了采用域随机化和传输学习来提高训练 synthetic 数据的效果。这使得模型可以在更广泛的数据上训练，从而提高其对不同情况的适应能力。使用这些技术和 YOLOv8 和 Deep OC-SORT，我们实现了一个可以在边缘硬件上运行的射击者跟踪系统，包括 Raspberry Pi 和 Jetson Nano。

ViewMix: Augmentation for Robust Representation in Self-Supervised Learning

paper_url: http://arxiv.org/abs/2309.03360
repo_url: None
paper_authors: Arjon Das, Xin Zhong
for: 这 paper 的目的是提出一种基于自动检测的视角拼接策略，以提高基于自助学习的表示学习能力。
methods: 这 paper 使用了一种基于自动检测的视角拼接策略，并与多种基于自助学习的表示学习方法进行比较。
results: 该 paper 表明，通过使用 ViewMix 策略，可以提高基于自助学习的表示学习方法的地方化能力和稳定性。

Abstract
Joint Embedding Architecture-based self-supervised learning methods have attributed the composition of data augmentations as a crucial factor for their strong representation learning capabilities. While regional dropout strategies have proven to guide models to focus on lesser indicative parts of the objects in supervised methods, it hasn't been adopted by self-supervised methods for generating positive pairs. This is because the regional dropout methods are not suitable for the input sampling process of the self-supervised methodology. Whereas dropping informative pixels from the positive pairs can result in inefficient training, replacing patches of a specific object with a different one can steer the model from maximizing the agreement between different positive pairs. Moreover, joint embedding representation learning methods have not made robustness their primary training outcome. To this end, we propose the ViewMix augmentation policy, specially designed for self-supervised learning, upon generating different views of the same image, patches are cut and pasted from one view to another. By leveraging the different views created by this augmentation strategy, multiple joint embedding-based self-supervised methodologies obtained better localization capability and consistently outperformed their corresponding baseline methods. It is also demonstrated that incorporating ViewMix augmentation policy promotes robustness of the representations in the state-of-the-art methods. Furthermore, our experimentation and analysis of compute times suggest that ViewMix augmentation doesn't introduce any additional overhead compared to other counterparts.

摘要
joint embedding architecture-based self-supervised learning方法中，数据增强的组合被认为是关键因素，对于强化表示学习能力。而regional dropout策略在supervised方法中已经证明了导models专注于更加不显示的部分，但是它们没有被采用于自我监督方法中，这是因为regional dropout方法不适用于自我监督方法的输入采样过程。 dropped informative pixels from the positive pairs can result in inefficient training，而且 replacing patches of a specific object with a different one can steer the model away from maximizing the agreement between different positive pairs。 joint embedding representation learning方法没有让robustness成为主要训练目标。为此，我们提出了ViewMixaugmentation policy，专门针对自我监督学习。在生成不同视图的同一个图像上，将patches cut和paste到另一个视图中。通过利用不同的视图，生成的多个joint embedding-based self-supervised方法在本地化能力方面表现出色，并且与基eline方法相比，表现出了明显的提升。此外，我们的实验和分析表明，ViewMix augmentation不会增加任何额外的计算时间。

paper_url: http://arxiv.org/abs/2309.03353
repo_url: None
paper_authors: Venkata Udaya Sameer, Shilpa Mukhopadhyay, Ruchira Naskar, Ishaan Dali
for: 本研究旨在验证视频源的authenticity和来源，以确定视频是否实际来源于声明的来源。
methods: 本研究使用特征提取、特征选择和后续源分类来实现视频源识别。
results: 我们的实验结果表明，提出的方法比传统的指纹基本方法更高效。

Abstract
Source camera identification in digital videos is the problem of associating an unknown digital video with its source device, within a closed set of possible devices. The existing techniques in source detection of digital videos try to find a fingerprint of the actual source in the video in form of PRNU (Photo Response Non--Uniformity), and match it against the SPN (Sensor Pattern Noise) of each possible device. The highest correlation indicates the correct source. We investigate the problem of identifying a video source through a feature based approach using machine learning. In this paper, we present a blind forensic technique of video source authentication and identification, based on feature extraction, feature selection and subsequent source classification. The main aim is to determine whether a claimed source for a video is actually its original source. If not, we identify its original source. Our experimental results prove the efficiency of the proposed method compared to traditional fingerprint based technique.

摘要
源码 identificatin in digital videos 是一个关键问题，即将未知的数字视频与其源设备相关联，在一个封闭的设备集中。现有的数字视频源检测技术尝试找到视频中的实际源print（PRNU），并将其与每个可能的设备SPN（感光器环境噪）进行匹配。最高的相关性指示正确的源。我们 investigate了基于特征分析的视频源认证和identification方法，以实现视频源的潜在验证和确定。我们的实验结果表明，我们提posed方法比传统的指纹基本技术更高效。Here's a word-for-word translation of the text:源码标识在数字视频中是一个关键问题，即将未知的数字视频与其源设备相关联，在一个封闭的设备集中。现有的数字视频源检测技术尝试找到视频中的实际源印记（PRNU），并将其与每个可能的设备感光器环境噪（SPN）进行匹配。最高的相关性指示正确的源。我们 investigate了基于特征分析的视频源认证和identification方法，以实现视频源的潜在验证和确定。我们的实验结果表明，我们提posed方法比传统的指纹基本技术更高效。

Using Neural Networks for Fast SAR Roughness Estimation of High Resolution Images

paper_url: http://arxiv.org/abs/2309.03351
repo_url: https://github.com/jeovafarias/sar-roughness-estimation-neural-nets
paper_authors: Li Fan, Jeova Farias Sales Rocha Neto
for: 这个论文主要是为了提出一种基于神经网络的高分辨率Synthetic Aperture Radar（SAR）图像分析方法，以解决SAR图像中的杂点噪声问题。
methods: 这个方法首先学习了G_I^0分布下数据的模型，然后可以从SAR图像中提取杂点信息，并用于后续的图像处理任务，如分 segmentation、分类和解释。
results: 该方法可以快速和可靠地估计SAR图像中的杂点参数，尤其是高分辨率图像。此外，该方法还可以扩展到处理图像输入，并且可以使用简单的神经网络来实现实时像素粗糙度估计。

Abstract
The analysis of Synthetic Aperture Radar (SAR) imagery is an important step in remote sensing applications, and it is a challenging problem due to its inherent speckle noise. One typical solution is to model the data using the $G_I^0$ distribution and extract its roughness information, which in turn can be used in posterior imaging tasks, such as segmentation, classification and interpretation. This leads to the need of quick and reliable estimation of the roughness parameter from SAR data, especially with high resolution images. Unfortunately, traditional parameter estimation procedures are slow and prone to estimation failures. In this work, we proposed a neural network-based estimation framework that first learns how to predict underlying parameters of $G_I^0$ samples and then can be used to estimate the roughness of unseen data. We show that this approach leads to an estimator that is quicker, yields less estimation error and is less prone to failures than the traditional estimation procedures for this problem, even when we use a simple network. More importantly, we show that this same methodology can be generalized to handle image inputs and, even if trained on purely synthetic data for a few seconds, is able to perform real time pixel-wise roughness estimation for high resolution real SAR imagery.

摘要
“Synthetic Aperture Radar（SAR）影像分析是远程感知应用中的一个重要步骤，但是它受到自然的雷达噪声的限制。一种常见的解决方案是使用$G_I^0$分布来模型数据，并从其中提取噪声信息，以便在后续的图像处理任务中使用，如分 segmentation、分类和解释。这导致了高分辨率图像的快速和可靠地Estimation of roughness parameter的需求。然而，传统的参数估计方法是慢并且容易出现估计错误。在这项工作中，我们提出了基于神经网络的估计框架，它可以先预测underlying参数的$G_I^0$样本，然后用于估计未看到的数据中的粗糙度。我们显示了这种方法比传统估计方法更快、更准确、更可靠，即使使用简单的网络。更重要的是，我们显示了这种方法可以扩展到处理图像输入，且即使只使用了几秒钟的Synthetic数据，仍能在实时进行每个像素粗糙度估计。”

SADIR: Shape-Aware Diffusion Models for 3D Image Reconstruction

paper_url: http://arxiv.org/abs/2309.03335
repo_url: None
paper_authors: Nivetha Jayakumar, Tonmoy Hossain, Miaomiao Zhang
for: 本研究旨在提高3D图像重建的精度和shape结构保持性，使用深度学习模型。
methods: 本研究提出了一种基于扩散模型的shape-aware网络（SADIR），通过同时学习mean shape和变换模型来指导3D图像重建。
results: 对于大脑和心脏MRIs，我们的方法SADIR比基eline方法有更低的重建错误和更好地保持object shape结构。

Abstract
3D image reconstruction from a limited number of 2D images has been a long-standing challenge in computer vision and image analysis. While deep learning-based approaches have achieved impressive performance in this area, existing deep networks often fail to effectively utilize the shape structures of objects presented in images. As a result, the topology of reconstructed objects may not be well preserved, leading to the presence of artifacts such as discontinuities, holes, or mismatched connections between different parts. In this paper, we propose a shape-aware network based on diffusion models for 3D image reconstruction, named SADIR, to address these issues. In contrast to previous methods that primarily rely on spatial correlations of image intensities for 3D reconstruction, our model leverages shape priors learned from the training data to guide the reconstruction process. To achieve this, we develop a joint learning network that simultaneously learns a mean shape under deformation models. Each reconstructed image is then considered as a deformed variant of the mean shape. We validate our model, SADIR, on both brain and cardiac magnetic resonance images (MRIs). Experimental results show that our method outperforms the baselines with lower reconstruction error and better preservation of the shape structure of objects within the images.

摘要
三维图像重建从有限数量的二维图像是计算机视觉和图像分析领域的长期挑战。虽然深度学习基本方法在这一领域已经取得了很好的成绩，但现有的深度网络 oftentimes 不够利用图像中对象的形态结构。为此，重建对象的topology可能不会很好地保留，导致图像中的缺陷，如缺失、洞、或者不同部分之间的不一致。在这篇论文中，我们提出了一种基于扩散模型的形态意识网络，名为SADIR，以解决这些问题。与先前的方法不同，我们的模型不仅仅依靠图像中的空间相关性来进行三维重建，而且同时学习图像中对象的形态特征。为此，我们开发了一个联合学习网络，该网络同时学习一个变换模型中的平均形态。每个重建的图像都被视为变换模型中的一个扭曲变体。我们验证了我们的模型，SADIR，在脑和心脏磁共振图像（MRI）上。实验结果表明，我们的方法在比较基eline上下降重建错误和更好地保留图像中对象的形态结构。

Expert Uncertainty and Severity Aware Chest X-Ray Classification by Multi-Relationship Graph Learning

paper_url: http://arxiv.org/abs/2309.03331
repo_url: None
paper_authors: Mengliang Zhang, Xinyue Hu, Lin Gu, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M. Summers, Yingying Zhu
for: 这篇论文的目的是为了提高胸部X线成像（CXR）报告中的疾病标签准确性，因为胸部X线成像诊断常常需要处理多种肺病，并且鉴别这些疾病的细节和患者的状况不同，因此鉴别结果可能会受到诊断者的干扰。
methods: 这篇论文使用了一个规律基于关键字的方法来重新提取CXR报告中的疾病标签，并且还使用了一个多关系图学习方法，以及一个专家不确定意识感损失函数，以提高验证结果的解释性。
results: 这篇论文的实验结果显示，考虑疾病严重程度和不确定性的模型可以超越先前的州OF-THE-ART方法的性能。

Abstract
Patients undergoing chest X-rays (CXR) often endure multiple lung diseases. When evaluating a patient's condition, due to the complex pathologies, subtle texture changes of different lung lesions in images, and patient condition differences, radiologists may make uncertain even when they have experienced long-term clinical training and professional guidance, which makes much noise in extracting disease labels based on CXR reports. In this paper, we re-extract disease labels from CXR reports to make them more realistic by considering disease severity and uncertainty in classification. Our contributions are as follows: 1. We re-extracted the disease labels with severity and uncertainty by a rule-based approach with keywords discussed with clinical experts. 2. To further improve the explainability of chest X-ray diagnosis, we designed a multi-relationship graph learning method with an expert uncertainty-aware loss function. 3. Our multi-relationship graph learning method can also interpret the disease classification results. Our experimental results show that models considering disease severity and uncertainty outperform previous state-of-the-art methods.

摘要
患者在胸部X射线检查（CXR）时经常uffer from multiple lung diseases. 评估患者情况时，由于复杂的疾病特征、不同肺脏病变的柔软Texture changes和患者状况差异，辐射学家可能会做出不确定的诊断，即使他们拥有长期临床训练和专业指导。这会导致CXR报告中的疾病标签EXTRACTION具有较高的噪音。在这篇论文中，我们重新EXTRACT了CXR报告中的疾病标签，以使其更加真实性。我们的贡献如下：1. 我们使用规则基本的方法与临床专家讨论的关键词来重新EXTRACT疾病标签，考虑疾病严重性和不确定性。2. 为了进一步提高胸部X射线诊断的解释性，我们设计了一种多关系图学习方法，并使用专家不确定性感知损失函数。3. 我们的多关系图学习方法还可以解释肺病分类结果。我们的实验结果表明，考虑疾病严重性和不确定性的模型比前一个状态的方法表现更好。

MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation

paper_url: http://arxiv.org/abs/2309.03329
repo_url: https://github.com/dinhhieuhoang/meganet
paper_authors: Nhat-Tan Bui, Dinh-Hieu Hoang, Quang-Thuc Nguyen, Minh-Triet Tran, Ngan Le
for: 该研究旨在提高肠内部肿瘤分割的精度，以促进早期肠癌诊断。
methods: 该研究提出了一种多尺度Edge-Guided Attention网络（MEGANet），结合了经典的边检测技术和注意机制，以提高肿瘤边界的定义。
results: 对五个 benchmark 数据集进行了广泛的实验，并表明了我们的EGANet在六个评价指标上超越了现有的SOTA方法。

Abstract
Efficient polyp segmentation in healthcare plays a critical role in enabling early diagnosis of colorectal cancer. However, the segmentation of polyps presents numerous challenges, including the intricate distribution of backgrounds, variations in polyp sizes and shapes, and indistinct boundaries. Defining the boundary between the foreground (i.e. polyp itself) and the background (surrounding tissue) is difficult. To mitigate these challenges, we propose Multi-Scale Edge-Guided Attention Network (MEGANet) tailored specifically for polyp segmentation within colonoscopy images. This network draws inspiration from the fusion of a classical edge detection technique with an attention mechanism. By combining these techniques, MEGANet effectively preserves high-frequency information, notably edges and boundaries, which tend to erode as neural networks deepen. MEGANet is designed as an end-to-end framework, encompassing three key modules: an encoder, which is responsible for capturing and abstracting the features from the input image, a decoder, which focuses on salient features, and the Edge-Guided Attention module (EGA) that employs the Laplacian Operator to accentuate polyp boundaries. Extensive experiments, both qualitative and quantitative, on five benchmark datasets, demonstrate that our EGANet outperforms other existing SOTA methods under six evaluation metrics. Our code is available at \url{https://github.com/DinhHieuHoang/MEGANet}

摘要
高效的贝壳分割在医疗领域对抗肝癌的诊断扮演了关键角色。然而，贝壳分割存在许多挑战，包括贝壳的复杂分布、贝壳大小和形状的变化，以及边界不明确。准确地定义贝壳和背景之间的边界是困难的。为了解决这些挑战，我们提议一种适应贝壳分割的多尺度 Edge-Guided Attention 网络（MEGANet）。这个网络灵感来自于对精度 Edge Detection 技术和注意机制的融合。通过这些技术的组合，MEGANet 能够保留高频信息，包括边缘和边界，这些信息在神经网络深化时往往会丢失。MEGANet 是一个端到端框架，包括一个编码器，负责从输入图像中提取和抽象特征，一个解码器，专注于突出特征，以及Edge-Guided Attention 模块（EGA），使用拉普拉斯运算符来强调贝壳边界。我们在五个参考数据集进行了广泛的实验，包括质量和量化的评估，显示我们的 EGANet 在六个评价指标上表现出色，超过了现有的 State-of-the-Art 方法。我们的代码可以在 GitHub 上找到：https://github.com/DinhHieuHoang/MEGANet。

C-CLIP: Contrastive Image-Text Encoders to Close the Descriptive-Commentative Gap

paper_url: http://arxiv.org/abs/2309.03921
repo_url: None
paper_authors: William Theisen, Walter Scheirer
for: 这个论文的目的是为了提高社交媒体上图文对应的 Multimodal 模型的性能，以便在不同语言和平台上进行图文对应任务。
methods: 这个论文使用的方法是使用对称的图文对应模型，并在这些模型中使用特定的训练数据来提高图文对应的性能。
results: 论文的结果表明，使用特定的训练数据可以大幅提高图文对应的性能，并且这些结果可以在多种非英语语言上进行应用。

Abstract
The interplay between the image and comment on a social media post is one of high importance for understanding its overall message. Recent strides in multimodal embedding models, namely CLIP, have provided an avenue forward in relating image and text. However the current training regime for CLIP models is insufficient for matching content found on social media, regardless of site or language. Current CLIP training data is based on what we call ``descriptive'' text: text in which an image is merely described. This is something rarely seen on social media, where the vast majority of text content is ``commentative'' in nature. The captions provide commentary and broader context related to the image, rather than describing what is in it. Current CLIP models perform poorly on retrieval tasks where image-caption pairs display a commentative relationship. Closing this gap would be beneficial for several important application areas related to social media. For instance, it would allow groups focused on Open-Source Intelligence Operations (OSINT) to further aid efforts during disaster events, such as the ongoing Russian invasion of Ukraine, by easily exposing data to non-technical users for discovery and analysis. In order to close this gap we demonstrate that training contrastive image-text encoders on explicitly commentative pairs results in large improvements in retrieval results, with the results extending across a variety of non-English languages.

摘要
社交媒体文章的图文对话是理解总体信息的关键之一。现在的多模式嵌入模型CLIP已经提供了前进的方向，但现在CLIP模型的训练方式是不足以满足社交媒体上的内容的。现有CLIP训练数据基于我们称之为“描述性”文本：文本中描述图片的内容。这在社交媒体上非常少见，大多数文本内容是“评论性”的，图片的caption提供了更广泛的背景和评论。现有CLIP模型在图片-caption对应 зада务中表现不佳。将这个差距减少会对社交媒体相关应用领域带来多个重要的好处，例如在灾难事件中，如现在俄罗斯入侵乌克兰的进行开源情报操作（OSINT）的团队能够更好地帮助努力。为了减少这个差距，我们展示了在使用明确评论对应的图片-文本对应器进行训练后， Retrieval结果得到了大幅提高，这些结果在多种非英语语言上都能够扩展。

CoNeS: Conditional neural fields with shift modulation for multi-sequence MRI translation

paper_url: http://arxiv.org/abs/2309.03320
repo_url: https://github.com/cyjdswx/cones
paper_authors: Yunjie Chen, Marius Staring, Olaf M. Neve, Stephan R. Romeijn, Erik F. Hensen, Berit M. Verbist, Jelmer M. Wolterink, Qian Tao
for: 这个研究的目的是提出一种可以Synthesize missing MRI sequence的方法，以便在诊断过程中使用深度学习模型。
methods: 这个方法使用Conditional Neural fields with Shift modulation（CoNeS）模型，将 voxel 坐标作为输入，并使用多层感知扩展（MLP）作为解oder，以实现像素到像素的映射。
results: 实验结果显示，提出的方法可以较好地透过多个序列的MRI资料进行翻译，并且可以更好地保留高频率细节。此外，实验还显示了这个方法可以扩展到诊断下游任务中，例如分类 tasks。

Abstract
Multi-sequence magnetic resonance imaging (MRI) has found wide applications in both modern clinical studies and deep learning research. However, in clinical practice, it frequently occurs that one or more of the MRI sequences are missing due to different image acquisition protocols or contrast agent contraindications of patients, limiting the utilization of deep learning models trained on multi-sequence data. One promising approach is to leverage generative models to synthesize the missing sequences, which can serve as a surrogate acquisition. State-of-the-art methods tackling this problem are based on convolutional neural networks (CNN) which usually suffer from spectral biases, resulting in poor reconstruction of high-frequency fine details. In this paper, we propose Conditional Neural fields with Shift modulation (CoNeS), a model that takes voxel coordinates as input and learns a representation of the target images for multi-sequence MRI translation. The proposed model uses a multi-layer perceptron (MLP) instead of a CNN as the decoder for pixel-to-pixel mapping. Hence, each target image is represented as a neural field that is conditioned on the source image via shift modulation with a learned latent code. Experiments on BraTS 2018 and an in-house clinical dataset of vestibular schwannoma patients showed that the proposed method outperformed state-of-the-art methods for multi-sequence MRI translation both visually and quantitatively. Moreover, we conducted spectral analysis, showing that CoNeS was able to overcome the spectral bias issue common in conventional CNN models. To further evaluate the usage of synthesized images in clinical downstream tasks, we tested a segmentation network using the synthesized images at inference.

摘要
多序列核磁共振成像（MRI）在现代临床研究和深度学习中得到了广泛应用。然而，在临床实践中，由于不同的图像获取协议或患者的荷物禁忌，导致MRI序列中有一些图像缺失，这限制了使用深度学习模型训练在多序列数据上的应用。一种可能的方法是使用生成模型来生成缺失的序列，这可以作为训练深度学习模型的供应。现状的方法通常基于卷积神经网络（CNN），它们通常受到频率偏好的影响，导致重建高频率细节的表现不佳。在这篇论文中，我们提出了基于 Conditional Neural fields with Shift modulation（CoNeS）的方法。CoNeS 模型接受 voxel 坐标作为输入，并学习一种用于多序列 MRI 翻译的目标图像表示方式。我们使用多层感知神经网络（MLP）作为像素到像素映射的解码器，因此每个目标图像都被表示为一个 conditional neural field，通过 shift modulation 和学习的隐藏代码来conditioned 于源图像。我们在 BraTS 2018 和一个内部临床数据集中进行了实验，并证明了我们的方法在多序列 MRI 翻译中超过了现状方法的视觉和量化性能。此外，我们还进行了频谱分析，表明 CoNeS 能够超越常见的频谱偏好问题。为了进一步评估生成的图像在临床下渠道任务中的使用，我们在推理阶段使用生成的图像进行分割网络的测试。

Bayes’ Rays: Uncertainty Quantification for Neural Radiance Fields

paper_url: http://arxiv.org/abs/2309.03185
repo_url: https://github.com/BayesRays/BayesRays
paper_authors: Lily Goli, Cody Reading, Silvia Sellán, Alec Jacobson, Andrea Tagliasacchi
for: 这个论文是为了评估透视场景中神经辐射场景（NeRFs）的不确定性。
methods: 该论文提出了一种后期框架，可以无需修改训练过程来评估神经辐射场景中的不确定性。该方法基于空间干扰和 bayesian laplaceapproximation来建立三维不确定性场景。
results: 该论文通过统计 derivation 和实验结果显示了其在关键指标和应用中的superior性能。更多结果可以在https://bayesrays.github.io/查看。

Abstract
Neural Radiance Fields (NeRFs) have shown promise in applications like view synthesis and depth estimation, but learning from multiview images faces inherent uncertainties. Current methods to quantify them are either heuristic or computationally demanding. We introduce BayesRays, a post-hoc framework to evaluate uncertainty in any pre-trained NeRF without modifying the training process. Our method establishes a volumetric uncertainty field using spatial perturbations and a Bayesian Laplace approximation. We derive our algorithm statistically and show its superior performance in key metrics and applications. Additional results available at: https://bayesrays.github.io.

摘要
neural radiance fields (NeRFs) 已经在视图合成和深度估计方面显示了承诺，但学习从多视图图像中存在内在的不确定性。现有的方法来衡量这些不确定性是 Either heuristic 或 computationally demanding。我们介绍 BayesRays，一种在预训练 NeRF 无需修改训练过程中的 posterior framework 来评估不确定性。我们的方法创建了一个卷积uncertainty field 使用空间扰动和bayesian laplace approximation。我们从统计角度 derivation 我们的算法，并在关键指标和应用中表现出优于现有方法。更多结果可以在: 中找到。

3D Transformer based on deformable patch location for differential diagnosis between Alzheimer’s disease and Frontotemporal dementia

paper_url: http://arxiv.org/abs/2309.03183
repo_url: None
paper_authors: Huy-Dung Nguyen, Michaël Clément, Boris Mansencal, Pierrick Coupé
for: 本研究的目的是提出一种基于 transformer 架构的三维医学数据识别方法，以提高阿尔茨海默病和前rontemporal dementia 的Multi-class differential diagnosis。
methods: 本研究使用了 transformer 架构，并提出了一种弹性质的 patch 定位模块，以提高精准性。此外，为了解决数据稀缺问题，我们提出了一种有效的数据扩充技术策略，适用于训练 transformer 模型。
results: 我们的实验表明，提出的方法可以实现竞争性的 результаats，并且可以Visualize 弹性 patch 定位，揭示了每种疾病的诊断所使用的主要脑区域。

Abstract
Alzheimer's disease and Frontotemporal dementia are common types of neurodegenerative disorders that present overlapping clinical symptoms, making their differential diagnosis very challenging. Numerous efforts have been done for the diagnosis of each disease but the problem of multi-class differential diagnosis has not been actively explored. In recent years, transformer-based models have demonstrated remarkable success in various computer vision tasks. However, their use in disease diagnostic is uncommon due to the limited amount of 3D medical data given the large size of such models. In this paper, we present a novel 3D transformer-based architecture using a deformable patch location module to improve the differential diagnosis of Alzheimer's disease and Frontotemporal dementia. Moreover, to overcome the problem of data scarcity, we propose an efficient combination of various data augmentation techniques, adapted for training transformer-based models on 3D structural magnetic resonance imaging data. Finally, we propose to combine our transformer-based model with a traditional machine learning model using brain structure volumes to better exploit the available data. Our experiments demonstrate the effectiveness of the proposed approach, showing competitive results compared to state-of-the-art methods. Moreover, the deformable patch locations can be visualized, revealing the most relevant brain regions used to establish the diagnosis of each disease.

摘要
阿尔茨海默病和前rontemporal dementia是常见的神经退化疾病，它们的临床表现相似，诊断非常困难。许多努力已经done for the diagnosis of each disease，但多类差分诊断还没有得到active exploration。在最近几年，transformer-based模型在各种计算机视觉任务中表现出色，但它们在疾病诊断中使用不常见，主要因为3D医疗数据的有限性，transformer-based模型的大小。本文提出了一种新的3D transformer-based架构，使用可变形矩阵定位模块以提高阿尔茨海默病和前rontemporal dementia的多类差分诊断。此外，为了解决数据稀缺的问题，我们提议了一种高效的数据扩充技术组合，适用于在3D结构磁共振成像数据上训练transformer-based模型。最后，我们提议将我们的transformer-based模型与传统机器学习模型结合，使用大脑结构体积来更好地利用可用的数据。我们的实验结果表明，提议的方法具有竞争力，与状态对照方法相比，并且可视化的可变形矩阵定位可以揭示每种疾病诊断中最重要的大脑区域。

SLiMe: Segment Like Me

paper_url: http://arxiv.org/abs/2309.03179
repo_url: None
paper_authors: Aliasghar Khani, Saeid Asgari Taghanaki, Aditya Sanghi, Ali Mahdavi Amiri, Ghassan Hamarneh
for: 这篇研究旨在提出一个单一示例的图像分类法，可以在测试阶段使用单一图像和其分类标签来构成任意细分层次的图像分类。
methods: 这篇研究使用大量的感知语言模型，如Stable Diffusion（SD），来实现图像分类。它将这个问题设定为一个优化任务，具体是从SD的偏好中提取注意力地图，然后将Stable Diffusion的文本嵌入优化，以让每个嵌入学习一个单一分类区域。
results: 研究结果显示，SLiMe可以在测试阶段使用单一图像和其分类标签来构成任意细分层次的图像分类，并且比其他一元和几元的分类方法更高效。此外，在可以使用更多训练数据时，SLiMe的性能会得到进一步提升。

Abstract
Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.

摘要
“大量视语模型，如稳定扩散（SD），在多种下游任务上取得了重要进步，包括图像编辑、图像对应和3D形状生成。受这些进步的激发，我们想要利用这些广泛的视语模型来 segment 图像，并且可以使用只有一个标注样本。我们提出了 SLime，它带有一个优化问题的框架。给定一个训练图像和其 segmentation 图像，我们首先提取注意力地图，包括我们的新的“加重累积自注意力地图”。然后，使用提取的注意力地图，Stable Diffusion 的文本嵌入被优化，以确保每个嵌入学习一个单个分割区域。这些学习的嵌入然后可以高亮分割区域在注意力地图中，并可以在推理中使用这些注意力地图来生成分割图像。这使得 SLime 可以在推理中对实际世界图像进行分割，并且可以使用只有一个标注样本。此外，当有更多的训练数据可用时，我们可以使用几个示例进行训练，从而提高 SLime 的性能。我们进行了一系列知识丰富的实验，检查了不同的设计因素，并证明了 SLime 在一shot 和几个示例下的 segmentation 方法中表现出色。”

3D Object Positioning Using Differentiable Multimodal Learning

paper_url: http://arxiv.org/abs/2309.03177
repo_url: None
paper_authors: Sean Zanyk-McLean, Krishna Kumar, Paul Navratil
for: 优化计算机图形Scene中对象的位置，使其与观察者或参照对象相匹配。
methods: 使用模拟的激光数据via ray tracing和图像像素损失，并使用分子 descend gradient 优化对象的位置。
results: 使用两种感知模式（图像和激光）可以更快 converges 对象的位置，这种方法有 potential usefulness для自动驾驶车辆，可以用于场景中多个actor的位置确定。

Abstract
This article describes a multi-modal method using simulated Lidar data via ray tracing and image pixel loss with differentiable rendering to optimize an object's position with respect to an observer or some referential objects in a computer graphics scene. Object position optimization is completed using gradient descent with the loss function being influenced by both modalities. Typical object placement optimization is done using image pixel loss with differentiable rendering only, this work shows the use of a second modality (Lidar) leads to faster convergence. This method of fusing sensor input presents a potential usefulness for autonomous vehicles, as these methods can be used to establish the locations of multiple actors in a scene. This article also presents a method for the simulation of multiple types of data to be used in the training of autonomous vehicles.

摘要
这篇文章描述了一种多模态方法，使用模拟的激光数据和图像像素损失，通过可导渲染来优化对观察者或参考对象的位置在计算机图形Scene中。对象位置优化使用梯度下降，损失函数受到两种模态的影响。通常的对象放置优化只使用图像像素损失和可导渲染，这种工作表明在使用第二种感知器（激光）时，更快地 converges。这种感知器数据融合方法在自动驾驶汽车中可能有用，因为它们可以用于场景中多个actor的位置确定。这篇文章还描述了一种用于训练自动驾驶汽车的多种数据的 simulate方法。

PDiscoNet: Semantically consistent part discovery for fine-grained recognition

paper_url: http://arxiv.org/abs/2309.03173
repo_url: https://github.com/robertdvdk/part_detection
paper_authors: Robert van der Klis, Stephan Alaniz, Massimiliano Mancini, Cassio F. Dantas, Dino Ienco, Zeynep Akata, Diego Marcos
for: 本研究旨在提高细化分类模型的准确性，通过让模型首先检测到特定物体部分，然后使用这些部分来推断类别。
methods: 本研究提出了PDiscoNet方法，使用图像级别的类别标签和约束，以便找出物体部分。此外，还使用部分抽取和特征向量修饰来保证每个部分具有不同的信息。
results: 对于CUB、CelebA和PartImageNet等数据集，PDiscoNet方法可以提供明显更好的部分发现性能，而无需进行额外的Hyperparameter调整，同时不会影响分类性能。

Abstract
Fine-grained classification often requires recognizing specific object parts, such as beak shape and wing patterns for birds. Encouraging a fine-grained classification model to first detect such parts and then using them to infer the class could help us gauge whether the model is indeed looking at the right details better than with interpretability methods that provide a single attribution map. We propose PDiscoNet to discover object parts by using only image-level class labels along with priors encouraging the parts to be: discriminative, compact, distinct from each other, equivariant to rigid transforms, and active in at least some of the images. In addition to using the appropriate losses to encode these priors, we propose to use part-dropout, where full part feature vectors are dropped at once to prevent a single part from dominating in the classification, and part feature vector modulation, which makes the information coming from each part distinct from the perspective of the classifier. Our results on CUB, CelebA, and PartImageNet show that the proposed method provides substantially better part discovery performance than previous methods while not requiring any additional hyper-parameter tuning and without penalizing the classification performance. The code is available at https://github.com/robertdvdk/part_detection.

摘要
通常需要细化分类时，需要识别特定的物体部分，如鸟类的嘴形和翼模式。我们建议使用PDiscoNet来发现物体部分，使用只有图像级别的类别标签以及促进这些部分是：特异的、紧凑的、对对映变换旋转的、活跃的。此外，我们还提出使用部分排除和部分特征向量修饰，以避免单个部分占据过多的地位。我们的实验结果表明，我们的方法可以在CUB、CelebA和PartImageNet上提供显著更好的部分发现性能，而不需要进行额外的 гиперпараметр调整，也不会影响分类性能。代码可以在https://github.com/robertdvdk/part_detection上找到。

ResFields: Residual Neural Fields for Spatiotemporal Signals

paper_url: http://arxiv.org/abs/2309.03160
repo_url: https://github.com/markomih/ResFields
paper_authors: Marko Mihajlovic, Sergey Prokudin, Marc Pollefeys, Siyu Tang
for: 用于模型复杂的3D数据，特别是大型神经签距离场（NeRFs）或神经签距离场（SDFs）via单个多层感知器（MLP）。
methods: incorporating temporal residual layers into neural fields，dubbed ResFields，a novel class of networks specifically designed to effectively represent complex temporal signals。
results: 提出了一种有效的方法来解决MLP的限制，并对多个复杂任务进行了全面的分析和评估，包括2D视频近似、动态形状模型化via temporal SDFs、动态NeRF重建等。

Abstract
Neural fields, a category of neural networks trained to represent high-frequency signals, have gained significant attention in recent years due to their impressive performance in modeling complex 3D data, especially large neural signed distance (SDFs) or radiance fields (NeRFs) via a single multi-layer perceptron (MLP). However, despite the power and simplicity of representing signals with an MLP, these methods still face challenges when modeling large and complex temporal signals due to the limited capacity of MLPs. In this paper, we propose an effective approach to address this limitation by incorporating temporal residual layers into neural fields, dubbed ResFields, a novel class of networks specifically designed to effectively represent complex temporal signals. We conduct a comprehensive analysis of the properties of ResFields and propose a matrix factorization technique to reduce the number of trainable parameters and enhance generalization capabilities. Importantly, our formulation seamlessly integrates with existing techniques and consistently improves results across various challenging tasks: 2D video approximation, dynamic shape modeling via temporal SDFs, and dynamic NeRF reconstruction. Lastly, we demonstrate the practical utility of ResFields by showcasing its effectiveness in capturing dynamic 3D scenes from sparse sensory inputs of a lightweight capture system.

摘要
神经场（Neural Fields），一类基于神经网络的高频信号表示方法，在过去几年内受到了广泛关注，尤其是通过单一多层感知器（MLP）来表示复杂的3D数据，如大神经积分距离（SDFs）或各向异性场（NeRFs）。然而，尽管MLP具有强大和简单的表示能力，这些方法仍然面临着处理大型和复杂的时间信号的挑战，因为MLP的容量有限。在这篇论文中，我们提出了一种有效的方法，通过将时间剩余层添加到神经场中，称之为剩余场（ResFields），这种网络专门设计用于有效地表示复杂的时间信号。我们进行了全面的分析，并提出了一种矩阵分解技术来减少可训练参数的数量，提高泛化能力。重要的是，我们的方案可以兼容现有技术，并在不同的挑战任务上提供了稳定的改进。最后，我们示出了ResFields在捕捉低保持的3D场景中的实用性。

Do We Still Need Non-Maximum Suppression? Accurate Confidence Estimates and Implicit Duplication Modeling with IoU-Aware Calibration

paper_url: http://arxiv.org/abs/2309.03110
repo_url: None
paper_authors: Johannes Gilg, Torben Teepe, Fabian Herzog, Philipp Wolters, Gerhard Rigoll
for: 提高 object detection 系统的可靠性和可读性
methods: 使用 IoU-aware calibration 取代经典 NMS 后处理
results: 提高了 detection 的准确性和可读性，并且比标准 NMS 和 calibration 方法更高效

Abstract
Object detectors are at the heart of many semi- and fully autonomous decision systems and are poised to become even more indispensable. They are, however, still lacking in accessibility and can sometimes produce unreliable predictions. Especially concerning in this regard are the -- essentially hand-crafted -- non-maximum suppression algorithms that lead to an obfuscated prediction process and biased confidence estimates. We show that we can eliminate classic NMS-style post-processing by using IoU-aware calibration. IoU-aware calibration is a conditional Beta calibration; this makes it parallelizable with no hyper-parameters. Instead of arbitrary cutoffs or discounts, it implicitly accounts for the likelihood of each detection being a duplicate and adjusts the confidence score accordingly, resulting in empirically based precision estimates for each detection. Our extensive experiments on diverse detection architectures show that the proposed IoU-aware calibration can successfully model duplicate detections and improve calibration. Compared to the standard sequential NMS and calibration approach, our joint modeling can deliver performance gains over the best NMS-based alternative while producing consistently better-calibrated confidence predictions with less complexity. The \hyperlink{https://github.com/Blueblue4/IoU-AwareCalibration}{code} for all our experiments is publicly available.

摘要
To address these issues, we propose using IoU-aware calibration, which is a conditional Beta calibration that can be parallelized with no hyperparameters. This approach eliminates the need for classic NMS-style post-processing and instead uses empirical probability estimates to model duplicate detections and improve calibration.Our extensive experiments on diverse detection architectures show that the proposed IoU-aware calibration can successfully model duplicate detections and improve calibration. Compared to the standard sequential NMS and calibration approach, our joint modeling can deliver performance gains over the best NMS-based alternative while producing consistently better-calibrated confidence predictions with less complexity. The code for all our experiments is publicly available at \hyperlink{https://github.com/Blueblue4/IoU-AwareCalibration}{https://github.com/Blueblue4/IoU-AwareCalibration}.

FArMARe: a Furniture-Aware Multi-task methodology for Recommending Apartments based on the user interests

paper_url: http://arxiv.org/abs/2309.03100
repo_url: https://github.com/aliabdari/farmare
paper_authors: Ali Abdari, Alex Falcon, Giuseppe Serra
for: 本研究旨在提供一个基于文本查询的住房推荐系统，以减少用户寻找新住所的时间过程中的困难。
methods: 本研究使用了一种多任务方法，名为FArMARe，它支持跨Modal对照式训练，并且具有家具感知目标。
results: 经过三种不同的方法和两种原始特征提取方法的实验，FArMARe 在解决这个问题上显示了优秀的效果。

Abstract
Nowadays, many people frequently have to search for new accommodation options. Searching for a suitable apartment is a time-consuming process, especially because visiting them is often mandatory to assess the truthfulness of the advertisements found on the Web. While this process could be alleviated by visiting the apartments in the metaverse, the Web-based recommendation platforms are not suitable for the task. To address this shortcoming, in this paper, we define a new problem called text-to-apartment recommendation, which requires ranking the apartments based on their relevance to a textual query expressing the user's interests. To tackle this problem, we introduce FArMARe, a multi-task approach that supports cross-modal contrastive training with a furniture-aware objective. Since public datasets related to indoor scenes do not contain detailed descriptions of the furniture, we collect and annotate a dataset comprising more than 6000 apartments. A thorough experimentation with three different methods and two raw feature extraction procedures reveals the effectiveness of FArMARe in dealing with the problem at hand.

摘要
现在，许多人经常需要搜索新的住房选项。搜索一个适合的公寓是一个时间消耗的过程，特别是因为浏览它们是必需的，以确保在网上发现的广告的真实性。尽管这个过程可以通过虚拟世界中的浏览来减轻，但网络上的推荐平台并不适用于这个任务。为解决这个缺点，在这篇论文中，我们定义了一个新的问题，即文本到公寓推荐，这需要根据用户的文本查询来排序公寓的相关性。为解决这个问题，我们介绍了FArMARe，一种多任务方法，支持跨模态对比学习，并且具有家具意识的目标。由于公共 datasets related to indoor scenes 没有详细的家具描述，我们收集和注释了一个包含 más de 6000 个公寓的数据集。经过三种不同的方法和两种原始特征提取方法的实验，我们发现 FArMARe 能够成功地解决这个问题。

Character Queries: A Transformer-based Approach to On-Line Handwritten Character Segmentation

paper_url: http://arxiv.org/abs/2309.03072
repo_url: https://github.com/jungomi/character-queries
paper_authors: Michael Jungo, Beat Wolf, Andrii Maksai, Claudiu Musat, Andreas Fischer
for: 本研究旨在提高在线手写文本分 segmentation的精度，具体来说是在知道转录文本的情况下，通过划分样本点和字符的匹配来实现字符的分 segmentation。
methods: 我们提出了一种基于 transformer 架构的方法，其中每个分区是基于学习的字符查询在 transformer 解码块中形成的。我们还考虑了多种方法来评估我们的方法的效果。
results: 我们在两个常用的在线手写数据集上（IAM-OnDB 和 HANDS-VNOnDB）创建了字符分 segmentation 的ground truth，并对多种方法进行评估，得到了最佳的总结果。

Abstract
On-line handwritten character segmentation is often associated with handwriting recognition and even though recognition models include mechanisms to locate relevant positions during the recognition process, it is typically insufficient to produce a precise segmentation. Decoupling the segmentation from the recognition unlocks the potential to further utilize the result of the recognition. We specifically focus on the scenario where the transcription is known beforehand, in which case the character segmentation becomes an assignment problem between sampling points of the stylus trajectory and characters in the text. Inspired by the $k$-means clustering algorithm, we view it from the perspective of cluster assignment and present a Transformer-based architecture where each cluster is formed based on a learned character query in the Transformer decoder block. In order to assess the quality of our approach, we create character segmentation ground truths for two popular on-line handwriting datasets, IAM-OnDB and HANDS-VNOnDB, and evaluate multiple methods on them, demonstrating that our approach achieves the overall best results.

摘要
在线手写字符识别常常与手写识别结合在一起，即使识别模型包含了定位相关的机制，但通常还不够精确地分 segmentation。将分 segmentation 和识别解联，可以更好地利用识别结果。我们专注在已知转写的情况下，在这种情况下，字符分 segmentation 变成了对样本点轨迹和文本中的字符进行对应的对映问题。受 $k$-means 聚类算法的启发，我们从样本点轨迹的角度出发，并在Transformer底层构造中逐个形成学习的字符查询。为了评估我们的方法的质量，我们创建了两个受欢迎的在线手写数据集的字符分 segmentation 真实值，并评估了多种方法，展示了我们的方法取得了总体最好的结果。

Prompt-based All-in-One Image Restoration using CNNs and Transformer

paper_url: http://arxiv.org/abs/2309.03063
repo_url: None
paper_authors: Hu Gao, Jing Yang, Ning Wang, Jingfan Yang, Ying Zhang, Depeng Dang
for: 这个论文的目标是为了回复高质量图像从其受损观测中恢复。现有的大多数方法仅专注于单一的降低效果，因此在实际场景中不能达到最佳效果。
methods: 我们提出了一种数据组件指向方法，通过提取特征并使用提示来使单个模型能够有效地处理多种图像降低任务。我们使用编码器捕捉特征，并在解码器中引入提示来指导图像恢复。为了模型高质量图像恢复的地方不变性和非地方信息，我们将CNN操作和变换器结合使用。
results: 我们的方法可以快速和高效地处理多种图像降低任务，并且在不同的降低任务中达到了竞争力。我们的方法可以与专门的任务算法竞争，并且在实际场景中表现良好。

Abstract
Image restoration aims to recover the high-quality images from their degraded observations. Since most existing methods have been dedicated into single degradation removal, they may not yield optimal results on other types of degradations, which do not satisfy the applications in real world scenarios. In this paper, we propose a novel data ingredient-oriented approach that leverages prompt-based learning to enable a single model to efficiently tackle multiple image degradation tasks. Specifically, we utilize a encoder to capture features and introduce prompts with degradation-specific information to guide the decoder in adaptively recovering images affected by various degradations. In order to model the local invariant properties and non-local information for high-quality image restoration, we combined CNNs operations and Transformers. Simultaneously, we made several key designs in the Transformer blocks (multi-head rearranged attention with prompts and simple-gate feed-forward network) to reduce computational requirements and selectively determines what information should be persevered to facilitate efficient recovery of potentially sharp images. Furthermore, we incorporate a feature fusion mechanism further explores the multi-scale information to improve the aggregated features. The resulting tightly interlinked hierarchy architecture, named as CAPTNet, despite being designed to handle different types of degradations, extensive experiments demonstrate that our method performs competitively to the task-specific algorithms.

摘要
Image restoration aimed to recover high-quality images from degraded observations. Most existing methods only focus on single degradation removal, which may not produce optimal results in real-world scenarios. In this paper, we propose a novel data ingredient-oriented approach that leverages prompt-based learning to enable a single model to efficiently handle multiple image degradation tasks. Specifically, we use an encoder to capture features and introduce prompts with degradation-specific information to guide the decoder in adaptively recovering images affected by various degradations. To model local invariant properties and non-local information for high-quality image restoration, we combined CNNs operations and Transformers. Additionally, we made several key designs in the Transformer blocks, such as multi-head rearranged attention with prompts and simple-gate feed-forward network, to reduce computational requirements and selectively preserve information to facilitate efficient recovery of potentially sharp images. Furthermore, we incorporate a feature fusion mechanism to explore multi-scale information and improve aggregated features. The resulting tightly interlinked hierarchy architecture, named CAPTNet, despite being designed to handle different types of degradations, shows competitive performance compared to task-specific algorithms through extensive experiments.

Adaptive Growth: Real-time CNN Layer Expansion

paper_url: http://arxiv.org/abs/2309.03049
repo_url: https://github.com/yunjiezhu/extensible-convolutional-layer-git-version
paper_authors: Yunjie Zhu, Yunhao Chen
for: 提高深度学习模型的适应性和效率，适用于动态环境。
methods: 使用动态演进层，在已有深度学习模型中扩展层的功能，通过实时评估层的表征能力，进行自适应调整。
results: 与超参数方法相比，实现了更高的适应性和更好的性能在多个数据集上，包括MNIST、Fashion-MNIST、CIFAR-10和CIFAR-100等。同时，在转移学习场景下也表现出了更高的适应性。

Abstract
Deep Neural Networks (DNNs) have shown unparalleled achievements in numerous applications, reflecting their proficiency in managing vast data sets. Yet, their static structure limits their adaptability in ever-changing environments. This research presents a new algorithm that allows the convolutional layer of a Convolutional Neural Network (CNN) to dynamically evolve based on data input, while still being seamlessly integrated into existing DNNs. Instead of a rigid architecture, our approach iteratively introduces kernels to the convolutional layer, gauging its real-time response to varying data. This process is refined by evaluating the layer's capacity to discern image features, guiding its growth. Remarkably, our unsupervised method has outstripped its supervised counterparts across diverse datasets like MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. It also showcases enhanced adaptability in transfer learning scenarios. By introducing a data-driven model scalability strategy, we are filling a void in deep learning, leading to more flexible and efficient DNNs suited for dynamic settings. Code:(https://github.com/YunjieZhu/Extensible-Convolutional-Layer-git-version).

摘要
深度神经网络（DNNs）在多个应用场景中表现出了无与伦比的成绩，彰显其对庞大数据集的管理能力。然而，它们的静态结构限制了它们在不断变化的环境中的适应性。这项研究提出了一个新的算法，允许 convolutional layer 中的 Convolutional Neural Network（CNN）在数据输入的基础上动态演化，而无需更改现有 DNN 的结构。而不是固定的 Architecture，我们的方法会在运行时逐渐添加 kernel 到 convolutional layer，根据数据的变化进行反馈，以提高它的感知度。这个过程由评估层的特征分类能力来引导，以便增强其生长。很显icht，我们的无监督方法在多个 dataset 上（如 MNIST、Fashion-MNIST、CIFAR-10 和 CIFAR-100）的表现都超过了其监督 counterpart。它还在转移学习场景中显示出了更高的适应性。我们通过引入数据驱动的模型扩展策略，填补了深度学习中的一个空白，导致更 flexible 和高效的 DNN 适用于动态场景。代码：https://github.com/YunjieZhu/Extensible-Convolutional-Layer-git-version。

Exploring Semantic Consistency in Unpaired Image Translation to Generate Data for Surgical Applications

paper_url: http://arxiv.org/abs/2309.03048
repo_url: None
paper_authors: Danush Kumar Venkatesh, Dominik Rivoir, Micha Pfeiffer, Fiona Kolbinger, Marius Distler, Jürgen Weitz, Stefanie Speidel
for: 这篇论文的目的是研究无监督图像翻译技术，以生成具有高semantic consistency的大量标注数据集，用于手术计算机视觉应用。
methods: 这篇论文使用了多种state-of-the-art图像翻译模型，包括structural-similarity loss和contrastive learning，以提高图像翻译的semantic consistency。
results: 研究表明，使用这种简单的组合方法可以生成高semantic consistency的图像数据集，并且可以更有效地用于手术 semantic segmentation 任务的训练。

Abstract
In surgical computer vision applications, obtaining labeled training data is challenging due to data-privacy concerns and the need for expert annotation. Unpaired image-to-image translation techniques have been explored to automatically generate large annotated datasets by translating synthetic images to the realistic domain. However, preserving the structure and semantic consistency between the input and translated images presents significant challenges, mainly when there is a distributional mismatch in the semantic characteristics of the domains. This study empirically investigates unpaired image translation methods for generating suitable data in surgical applications, explicitly focusing on semantic consistency. We extensively evaluate various state-of-the-art image translation models on two challenging surgical datasets and downstream semantic segmentation tasks. We find that a simple combination of structural-similarity loss and contrastive learning yields the most promising results. Quantitatively, we show that the data generated with this approach yields higher semantic consistency and can be used more effectively as training data.

摘要
在骨科计算机视觉应用中，获得标注数据具有数据隐私问题和专家标注的需求。不配对图像译化技术已经探索以自动生成大量标注数据，将 sintetic 图像翻译到真实域。然而，保持输入和翻译图像之间的结构和 semantics 一致存在主要挑战，特别当 semantic 特征领域的分布不同时。本研究 empirically 探究了无配对图像翻译方法在骨科应用中生成合适数据，专门关注 semantic 一致性。我们广泛评估了多种现状顶峰图像翻译模型，在两个复杂的骨科数据集和下游semantic 分割任务上进行了广泛的评估。我们发现，一种简单的结构相似损失和对比学习的组合可以获得最佳结果。量化地表明，通过这种方法生成的数据具有更高的semantic 一致性，可以更有效地作为训练数据使用。

MCM: Multi-condition Motion Synthesis Framework for Multi-scenario

paper_url: http://arxiv.org/abs/2309.03031
repo_url: None
paper_authors: Zeyu Ling, Bo Han, Yongkang Wong, Mohan Kangkanhalli, Weidong Geng
for: 本研究的目的是解决多个条件人体动作生成任务中的多个条件输入问题，包括文本、音乐、语音等多种形式的输入。
methods: 本研究提出了一种新的多Condition Motion Synthesis（MCM）模型，该模型可以同时处理多个条件输入，并且可以与DDPM-like扩散模型结合使用，以保持生成能力。MCM模型包括两个分支结构，主分支和控制分支，两者具有相同的结构，并且初始化控制分支的参数与主分支的参数相同，以确保生成能力的维持。
results: 研究表明，MCM模型在文本到动作和音乐到舞蹈等多个任务中均达到了顶峰性能，与专门为这些任务设计的方法相当。此外，MCM模型还能够有效地实现多个条件Modal控制，实现“一次训练，动作需要”的目标。

Abstract
The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios, ranging from text-to-motion and music-to-dance, among others. While existing research has primarily focused on single conditions, the multi-condition human motion generation remains underexplored. In this paper, we address these challenges by introducing MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions. The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input while preserving its generative capabilities. Specifically, MCM employs two-branch architecture consisting of a main branch and a control branch. The control branch shares the same structure as the main branch and is initialized with the parameters of the main branch, effectively maintaining the generation ability of the main branch and supporting multi-condition input. We also introduce a Transformer-based diffusion model MWNet (DDPM-like) as our main branch that can capture the spatial complexity and inter-joint correlations in motion sequences through a channel-dimension self-attention module. Quantitative comparisons demonstrate that our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods. Furthermore, the qualitative evaluation shows that MCM not only streamlines the adaptation of methodologies originally designed for text-to-motion tasks to domains like music-to-dance and speech-to-gesture, eliminating the need for extensive network re-configurations but also enables effective multi-condition modal control, realizing "once trained is motion need".

摘要
目标是多个条件人体动作生成任务的多个条件人体动作生成任务，旨在涵盖多种形式，如文本、音乐、语音等。这使得任务具有适应多种场景的能力，从文本到动作和音乐到舞蹈等。现有研究主要集中在单个条件下进行研究，而多个条件人体动作生成任务仍然尚未得到充分探索。在这篇论文中，我们解决这些挑战，提出了MCM，一种新的人体动作生成模型，可以在多种条件下进行生成。MCM框架可以与任何DDPM-like扩散模型结合，并且可以同时处理多个条件输入，而不会影响生成能力。具体来说，MCM采用了两极架构，主分支和控制分支。控制分支与主分支结构相同，并且初始化为主分支的参数，以保持生成能力，同时支持多个条件输入。我们还提出了一种基于Transformer的扩散模型MWNet，它可以通过通道维度自注意模块捕捉人体动作序列中的空间复杂性和相关性。量化比较表明，我们的方法在文本到动作和音乐到舞蹈两个任务中均达到了状态艺术的结果，与专门的任务方法相当。此外，质量评估表明，MCM不仅可以将原本设计用于文本到动作任务的方法流线化到类似的音乐到舞蹈和语音到手势等领域，无需进行广泛的网络重新配置，还能够实现有效的多个条件模式控制，实现“一次训练，动作需要”。

SEAL: A Framework for Systematic Evaluation of Real-World Super-Resolution

paper_url: http://arxiv.org/abs/2309.03020
repo_url: https://github.com/xpixelgroup/seal
paper_authors: Wenlong Zhang, Xiaohui Li, Xiangyu Chen, Yu Qiao, Xiao-Ming Wu, Chao Dong
for: 这种研究旨在提供一个系统性的评估平台，以便对实际世界图像的超分辨率方法进行全面的评估。
methods: 该研究提出了一种新的评估框架，即 SEAL，它可以快速和系统地评估实际世界图像的超分辨率方法。
results: 该研究通过对现有的实际世界超分辨率方法进行评估，并提出了一个新的强基eline，以及两个新的评估指标（Acceptance Rate和Relative Performance Ratio），以便更好地评估实际世界图像的超分辨率方法。

Abstract
Real-world Super-Resolution (real-SR) methods focus on dealing with diverse real-world images and have attracted increasing attention in recent years. The key idea is to use a complex and high-order degradation model to mimic real-world degradations. Although they have achieved impressive results in various scenarios, they are faced with the obstacle of evaluation. Currently, these methods are only assessed by their average performance on a small set of degradation cases randomly selected from a large space, which fails to provide a comprehensive understanding of their overall performance and often yields biased results. To overcome the limitation in evaluation, we propose SEAL, a framework for systematic evaluation of real-SR. In particular, we cluster the extensive degradation space to create a set of representative degradation cases, which serves as a comprehensive test set. Next, we propose a coarse-to-fine evaluation protocol to measure the distributed and relative performance of real-SR methods on the test set. The protocol incorporates two new metrics: acceptance rate (AR) and relative performance ratio (RPR), derived from an acceptance line and an excellence line. Under SEAL, we benchmark existing real-SR methods, obtain new observations and insights into their performance, and develop a new strong baseline. We consider SEAL as the first step towards creating an unbiased and comprehensive evaluation platform, which can promote the development of real-SR.

摘要

Sparse 3D Reconstruction via Object-Centric Ray Sampling

paper_url: http://arxiv.org/abs/2309.03008
repo_url: None
paper_authors: Llukman Cerkezi, Paolo Favaro
for: 3D object reconstruction from a sparse set of views captured from a 360-degree calibrated camera rig
methods: hybrid model using both MLP-based neural representation and triangle mesh, object-centric sampling scheme of the neural representation, and differentiable renderer
results: state of the art 3D reconstructions, does not require additional supervision of segmentation masks, works with sparse views on several datasets (Google’s Scanned Objects, Tank and Temples, and MVMC Car)

Abstract
We propose a novel method for 3D object reconstruction from a sparse set of views captured from a 360-degree calibrated camera rig. We represent the object surface through a hybrid model that uses both an MLP-based neural representation and a triangle mesh. A key contribution in our work is a novel object-centric sampling scheme of the neural representation, where rays are shared among all views. This efficiently concentrates and reduces the number of samples used to update the neural model at each iteration. This sampling scheme relies on the mesh representation to ensure also that samples are well-distributed along its normals. The rendering is then performed efficiently by a differentiable renderer. We demonstrate that this sampling scheme results in a more effective training of the neural representation, does not require the additional supervision of segmentation masks, yields state of the art 3D reconstructions, and works with sparse views on the Google's Scanned Objects, Tank and Temples and MVMC Car datasets.

摘要
我们提出了一种新的3D物体重建方法，使用一个稀疏视图集 captured from一个360度准备好的相机环境。我们表示物体表面通过一种混合模型，这里使用MLP基于神经网络表示以及三角形网格。我们的研究所提供了一种新的物体中心的采样方案，其中光束被所有视图共享。这种采样方案基于网格表示，以确保采样点均匀分布在网格轴上。然后，我们使用可导渠渲染器进行高效渲染。我们示出了这种采样方案能够更有效地训练神经网络，不需要额外的分 segmentation 标注，实现了状态的arte 3D重建，并在Google Scanned Objects、Tank和Temples以及MVMC Car数据集上达到了最佳效果。

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

paper_url: http://arxiv.org/abs/2309.02999
repo_url: https://github.com/ch3cook-fdu/vote2cap-detr
paper_authors: Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen
for: 本研究旨在提出一种简单 yet effective的 transformer框架，以便在3D scene中进行详细描述。
methods: 本研究使用了分解预测和迭代 espacial refinement 策略，以提高 caption 生成和对象定位的准确性。
results: 对 ScanRefer 和 Nr3D 两个常用数据集进行了广泛的实验，结果表明 Vote2Cap-DETR 和 Vote2Cap-DETR++ 超过了传统的 “detect-then-describe” 方法，并且可以快速地生成详细的 caption。

Abstract
3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions. Existing methods adopt a sophisticated "detect-then-describe" pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components. While these methods have achieved initial success, the cascade pipeline tends to accumulate errors because of duplicated and inaccurate box estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding. Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture. To this end, we propose an advanced version, Vote2Cap-DETR++, which decouples the queries into localization and caption queries to capture task-specific features. Additionally, we introduce the iterative spatial refinement strategy to vote queries for faster convergence and better localization performance. We also insert additional spatial information to the caption head for more accurate descriptions. Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and Vote2Cap-DETR++ surpass conventional "detect-then-describe" methods by a large margin. Codes will be made available at https://github.com/ch3cook-fdu/Vote2Cap-DETR.

摘要
3D dense captioning需要一个模型将输入3D场景的理解翻译成多个关联于不同对象区域的caption。现有方法采用复杂的"检测然后描述"管道，其中建立了许多手动设计的组件。虽然这些方法已经实现了初步的成功，但cascade管道往往会积累错误，因为重复的和不准确的盒式估计以及混乱的3D场景。在这篇论文中，我们首先提出了Vote2Cap-DETR，一个简单又有效的转换器框架，该框架通过平行解码来解耦描述生成和对象定位的过程。此外，我们认为对象定位和描述生成需要不同水平的场景理解，这可能会在一组共享的查询中捕捉到挑战。为此，我们提出了Vote2Cap-DETR++，一个进一步的版本，该版本将查询分解成定位和描述查询，以捕捉任务特有的特征。此外，我们还引入了迭代空间重定位策略，以便更快地 converges和更好地定位性能。此外，我们还将空间信息添加到描述头部，以提高描述的准确性。无需额外的钻掘和细节，我们在两个常用的数据集ScanRefer和Nr3D上进行了广泛的实验，并证明Vote2Cap-DETR和Vote2Cap-DETR++在"检测然后描述"方法的基础上超过了 conventinal方法的表现。代码将在https://github.com/ch3cook-fdu/Vote2Cap-DETR上提供。

Continual Evidential Deep Learning for Out-of-Distribution Detection

paper_url: http://arxiv.org/abs/2309.02995
repo_url: None
paper_authors: Eduardo Aguilar, Bogdan Raducanu, Petia Radeva, Joost Van de Weijer
for: 这个研究旨在实现同时进行增执分类和异数数据检测，并且使用证据深度学习方法进行离散数据检测。
methods: 本研究提出了一个称为CEDL的方法，它结合了证据深度学习方法和增执学习框架，以便同时进行增执分类和异数数据检测。
results: 根据实验结果显示，CEDL方法在CIFAR-100数据集上，在5和10任务设置下，均能够提供比基eline更好的Object Classification结果，并且在OOD检测方面也能够大幅超过多种后勤方法的评估结果。

Abstract
Uncertainty-based deep learning models have attracted a great deal of interest for their ability to provide accurate and reliable predictions. Evidential deep learning stands out achieving remarkable performance in detecting out-of-distribution (OOD) data with a single deterministic neural network. Motivated by this fact, in this paper we propose the integration of an evidential deep learning method into a continual learning framework in order to perform simultaneously incremental object classification and OOD detection. Moreover, we analyze the ability of vacuity and dissonance to differentiate between in-distribution data belonging to old classes and OOD data. The proposed method, called CEDL, is evaluated on CIFAR-100 considering two settings consisting of 5 and 10 tasks, respectively. From the obtained results, we could appreciate that the proposed method, in addition to provide comparable results in object classification with respect to the baseline, largely outperforms OOD detection compared to several posthoc methods on three evaluation metrics: AUROC, AUPR and FPR95.

摘要
“uncertainty-based深度学习模型在提供准确和可靠预测方面吸引了很大的关注。证明深度学习在探测出现在数据集之外的数据时表现出色，我们在这篇论文中提出了将证明深度学习方法 integrate into continual learning框架，以同时进行逐步类别和对外数据检测。此外，我们还分析了真空和矛盾的能力，用于 отличить在数据集中的老类和外数据。我们提出的方法，称为CEDL，在CIFAR-100上进行了5和10任务的评估。结果显示，我们的方法不仅与基eline相当的对象分类结果，而且在OOD检测中也大幅超越了多个后处方法，以三个评估指标：AUROC、AUPR和FPR95来评估。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I will be happy to provide the translation in that format as well.

FishMOT: A Simple and Effective Method for Fish Tracking Based on IoU Matching

paper_url: http://arxiv.org/abs/2309.02975
repo_url: https://github.com/gakkistar/fishmot
paper_authors: Shuo Liu, Lulu Han, Xiaoyang Liu, Junli Ren, Fang Wang, Yuanshan Lin
for: 这篇论文旨在提供一个高精度、可靠的鱼类追踪方法，以应对鱼类行为和生态学研究中的追踪挑战。
methods: 本论文提出了一个称为FishMOT（鱼类多对象追踪）的新型鱼类追踪方法，它结合了物体检测和IoU匹配，包括基本模组、互动模组和发现模组。在这些模组中，基本模组通过IoU的识别码来进行目标协调，互动模组将IoU的识别码和鱼类物体的IoU进行融合，以应对鱼类间的遮挡；发现模组则使用空间时间信息来超越由检测器在复杂环境中缺失检测所导致的追踪失败。
results: 实验结果显示，FishMOT比前一代多对象追踪器和特化的鱼类追踪工具在MOTA、精度、计算时间、内存consumption等方面表现更好，并且具有优秀的一致性和通用性。

Abstract
Fish tracking plays a vital role in understanding fish behavior and ecology. However, existing tracking methods face challenges in accuracy and robustness dues to morphological change of fish, occlusion and complex environment. This paper proposes FishMOT(Multiple Object Tracking for Fish), a novel fish tracking approach combining object detection and IoU matching, including basic module, interaction module and refind module. Wherein, a basic module performs target association based on IoU of detection boxes between successive frames to deal with morphological change of fish; an interaction module combines IoU of detection boxes and IoU of fish entity to handle occlusions; a refind module use spatio-temporal information uses spatio-temporal information to overcome the tracking failure resulting from the missed detection by the detector under complex environment. FishMOT reduces the computational complexity and memory consumption since it does not require complex feature extraction or identity assignment per fish, and does not need Kalman filter to predict the detection boxes of successive frame. Experimental results demonstrate FishMOT outperforms state-of-the-art multi-object trackers and specialized fish tracking tools in terms of MOTA, accuracy, computation time, memory consumption, etc.. Furthermore, the method exhibits excellent robustness and generalizability for varying environments and fish numbers. The simplified workflow and strong performance make FishMOT as a highly effective fish tracking approach. The source codes and pre-trained models are available at: https://github.com/gakkistar/FishMOT

摘要
鱼类跟踪对鱼类行为和生态学理解具有重要作用。然而，现有的跟踪方法面临着准确性和可靠性的挑战，主要是因为鱼类形态变化、遮挡和复杂的环境。本文提出了鱼类MOT（多对目标跟踪 для鱼类），一种新的鱼类跟踪方法， combining 对象检测和IoU匹配，包括基本模块、交互模块和重新找模块。其中，基本模块通过IoU的检测盒子在不同帧之间进行目标关联，以处理鱼类形态变化;交互模块将IoU的检测盒子和鱼类实体IoU组合以处理遮挡;重新找模块使用空间时间信息以超越由检测器在复杂环境中 missed 检测而导致的跟踪失败。鱼类MOT减少了计算复杂性和内存占用，因为它不需要复杂的特征提取或鱼类特征分配，也不需要 kalman 筛选器来预测下一帧的检测盒子。实验结果表明，鱼类MOT 在 MOTA、准确率、计算时间、内存占用等方面比状态艺术多对象跟踪器和专门的鱼类跟踪工具更高。此外，方法具有优秀的抗难度和普适性，可以在不同环境和鱼类数量下展现出优秀的表现。简化的工作流程和强大的表现使得鱼类MOT 成为一种非常有效的鱼类跟踪方法。源代码和预训练模型可以在以下链接中获取：https://github.com/gakkistar/FishMOT

Dynamic Hyperbolic Attention Network for Fine Hand-object Reconstruction

paper_url: http://arxiv.org/abs/2309.02965
repo_url: None
paper_authors: Zhiying Leng, Shun-Cheng Wu, Mahdi Saleh, Antonio Montanaro, Hao Yu, Yin Wang, Nassir Navab, Xiaohui Liang, Federico Tombari
for: 本研究旨在提出一种基于贝叶斯空间的精确手object reconstruction方法，以便更好地学习手object的特征。
methods: 该方法基于贝叶斯空间的特性，包括强制学习的图像特征和多模态特征，以及一种名为动态贝叶斯注意力网络（DHANet）。
results: 对于三个公共数据集，该方法比大多数现有方法表现更好，提供了一个可行的手object reconstruction方案。

Abstract
Reconstructing both objects and hands in 3D from a single RGB image is complex. Existing methods rely on manually defined hand-object constraints in Euclidean space, leading to suboptimal feature learning. Compared with Euclidean space, hyperbolic space better preserves the geometric properties of meshes thanks to its exponentially-growing space distance, which amplifies the differences between the features based on similarity. In this work, we propose the first precise hand-object reconstruction method in hyperbolic space, namely Dynamic Hyperbolic Attention Network (DHANet), which leverages intrinsic properties of hyperbolic space to learn representative features. Our method that projects mesh and image features into a unified hyperbolic space includes two modules, ie. dynamic hyperbolic graph convolution and image-attention hyperbolic graph convolution. With these two modules, our method learns mesh features with rich geometry-image multi-modal information and models better hand-object interaction. Our method provides a promising alternative for fine hand-object reconstruction in hyperbolic space. Extensive experiments on three public datasets demonstrate that our method outperforms most state-of-the-art methods.

摘要
重构对象和手在3D从单个RGB图像中是复杂的。现有方法通过手动定义在欧式空间中的手-对象约束，导致特征学习不优化。相比欧式空间，拥有快速增长的空间距离的拓扑空间更好地保持 mesh 的几何性质，因为它将相似性基于的特征强调。在这种工作中，我们提出了首个在拓扑空间中精确重建手-对象方法，即动态拓扑空间注意力网络（DHANet），该方法利用拓扑空间的内在属性来学习表示性的特征。我们的方法将 mesh 和图像特征投影到一个统一的拓扑空间中，包括动态拓扑图 convolution 和图像注意力拓扑图 convolution。通过这两个模块，我们的方法学习了具有丰富几何-图像多模式信息的 mesh 特征，并更好地模型手-对象互动。我们的方法为精确手-对象重建在拓扑空间提供了一个有前途的代替。我们在三个公共数据集上进行了广泛的实验，并证明了我们的方法在大多数状态前方法之上。

Hierarchical-level rain image generative model based on GAN

paper_url: http://arxiv.org/abs/2309.02964
repo_url: None
paper_authors: Zhenyuan Liu, Tong Jia, Xingyu Xing, Jianfeng Wu, Junyi Chen
for: 提高自动驾驶车Visual perception系统的性能下限问题（SOTIF），通过生成不同雨强度的图像数据来测试雨天下Visual perception算法的性能。
methods: 基于生成对抗网络（GAN）的 Hierarchical-level雨图生成模型（RCCycleGAN），可以生成不同雨强度的图像。采用Conditional GAN（CGAN）的方式，将不同雨强度作为标签进行分类。同时，对模型结构进行优化，并调整训练策略以解决模式混合问题。
results: 比基eline模型CycleGAN和DerainCycleGAN的 peak signal-to-noise ratio（PSNR）和structural similarity（SSIM）指标都有显著提高，具体是2.58 dB和0.74 dB，增加了18%和8%。进行了ablation实验以验证模型调整的有效性。

Abstract
Autonomous vehicles are exposed to various weather during operation, which is likely to trigger the performance limitations of the perception system, leading to the safety of the intended functionality (SOTIF) problems. To efficiently generate data for testing the performance of visual perception algorithms under various weather conditions, a hierarchical-level rain image generative model, rain conditional CycleGAN (RCCycleGAN), is constructed. RCCycleGAN is based on the generative adversarial network (GAN) and can generate images of light, medium, and heavy rain. Different rain intensities are introduced as labels in conditional GAN (CGAN). Meanwhile, the model structure is optimized and the training strategy is adjusted to alleviate the problem of mode collapse. In addition, natural rain images of different intensities are collected and processed for model training and validation. Compared with the two baseline models, CycleGAN and DerainCycleGAN, the peak signal-to-noise ratio (PSNR) of RCCycleGAN on the test dataset is improved by 2.58 dB and 0.74 dB, and the structural similarity (SSIM) is improved by 18% and 8%, respectively. The ablation experiments are also carried out to validate the effectiveness of the model tuning.

摘要
自动驾驶车辆在运行过程中可能会遭遇不同的天气条件，这可能会导致感知系统的性能限制，从而导致安全功能（SOTIF）问题。为了效率地生成测试感知算法在不同天气条件下的性能数据，我们构建了一个层次结构的雨图生成模型，即雨条件的CycleGAN（RCCycleGAN）。RCCycleGAN基于生成对抗网络（GAN），可以生成不同雨强度的雨图。雨强度被用作CGAN中的标签。此外，模型结构优化和训练策略调整，以解决模式混合问题。此外，我们还收集了不同雨强度的自然雨图，用于模型训练和验证。与基eline模型CycleGAN和DerainCycleGAN相比，RCCycleGAN在测试集上的PSNR提高2.58dB和0.74dB，SSIM提高18%和8%。我们还进行了减少效果的实验来验证模型调整的有效性。

Indoor Localization Using Radio, Vision and Audio Sensors: Real-Life Data Validation and Discussion

paper_url: http://arxiv.org/abs/2309.02961
repo_url: None
paper_authors: Ilayda Yaman, Guoda Tian, Erik Tegler, Patrik Persson, Nikhil Challa, Fredrik Tufvesson, Ove Edfors, Kalle Astrom, Steffen Malkowsky, Liang Liu
for: 本研究探讨了使用Radio、视觉和声音感知器在同一环境中进行indoor定位方法。
methods: 本研究使用了现代算法和实际数据进行评估，包括使用巨量MIMO技术的机器学习算法 дляRadio定位、使用RGB-D摄像头的ORB-SLAM3算法 для视觉定位、以及使用麦克风数组的SFS2算法 для声音定位。
results: 本研究发现了不同感知器的定位精度、可靠性、准备需求和可能的系统复杂性等方面的优劣点，并提供了一个基础和引导 для进一步发展高精度多感知定位系统，如感知融合和上下文和环境意识适应。

Abstract
This paper investigates indoor localization methods using radio, vision, and audio sensors, respectively, in the same environment. The evaluation is based on state-of-the-art algorithms and uses a real-life dataset. More specifically, we evaluate a machine learning algorithm for radio-based localization with massive MIMO technology, an ORB-SLAM3 algorithm for vision-based localization with an RGB-D camera, and an SFS2 algorithm for audio-based localization with microphone arrays. Aspects including localization accuracy, reliability, calibration requirements, and potential system complexity are discussed to analyze the advantages and limitations of using different sensors for indoor localization tasks. The results can serve as a guideline and basis for further development of robust and high-precision multi-sensory localization systems, e.g., through sensor fusion and context and environment-aware adaptation.

摘要

A Non-Invasive Interpretable NAFLD Diagnostic Method Combining TCM Tongue Features

paper_url: http://arxiv.org/abs/2309.02959
repo_url: https://github.com/cshan-github/selectornet
paper_authors: Shan Cao, Qunsheng Ruan, Qingfeng Wu
for: 这个研究是为了提出一个非侵入性的、可解释的非酒精肝病诊断方法，并且仅需要用户提供的数据是年龄、性别、身高、体重、腰围和脐围，以及舌头图像。
methods: 这个方法使用了融合病人的生物体指标和舌头特征，然后将其输入到名为SelectorNet的融合网络中，SelectorNet包括了注意力机制和特征选择机制，可以自动学习选择重要的特征。
results: 实验结果显示，提出的方法可以使用非侵入性数据 achieve an accuracy of 77.22%，并且提供了吸引人的解释矩阵。

Abstract
Non-alcoholic fatty liver disease (NAFLD) is a clinicopathological syndrome characterized by hepatic steatosis resulting from the exclusion of alcohol and other identifiable liver-damaging factors. It has emerged as a leading cause of chronic liver disease worldwide. Currently, the conventional methods for NAFLD detection are expensive and not suitable for users to perform daily diagnostics. To address this issue, this study proposes a non-invasive and interpretable NAFLD diagnostic method, the required user-provided indicators are only Gender, Age, Height, Weight, Waist Circumference, Hip Circumference, and tongue image. This method involves merging patients' physiological indicators with tongue features, which are then input into a fusion network named SelectorNet. SelectorNet combines attention mechanisms with feature selection mechanisms, enabling it to autonomously learn the ability to select important features. The experimental results show that the proposed method achieves an accuracy of 77.22\% using only non-invasive data, and it also provides compelling interpretability matrices. This study contributes to the early diagnosis of NAFLD and the intelligent advancement of TCM tongue diagnosis. The project in this paper is available at: https://github.com/cshan-github/SelectorNet.

摘要
(Note: Simplified Chinese is used in this translation, as it is more widely used in mainland China and is the standard form of Chinese used in education and government. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.)

Robust Visual Tracking by Motion Analyzing

paper_url: http://arxiv.org/abs/2309.03247
repo_url: https://github.com/XJLeoYu/Robust-Visual-Tracking-by-Motion-Analyzing
paper_authors: Mohammed Leo, Kurban Ubul, ShengJie Cheng, Michael Ma
for: 这篇论文旨在提出一种新的视频对象分割算法，以提高视频对象跟踪（VOT）的精度和效率。
methods: 该算法使用了tensor结构来描述目标的运动模式，并将其 integrate into the segmentation module。
results: 该算法在四个 benchmark（LaSOT\cite{fan2019lasot}, AVisT\cite{noman2022avist}, OTB100\cite{7001050}, GOT-10k\cite{huang2019got}) 上达到了SOTA的结果，并且具有实时运行的能力。

Abstract
In recent years, Video Object Segmentation (VOS) has emerged as a complementary method to Video Object Tracking (VOT). VOS focuses on classifying all the pixels around the target, allowing for precise shape labeling, while VOT primarily focuses on the approximate region where the target might be. However, traditional segmentation modules usually classify pixels frame by frame, disregarding information between adjacent frames. In this paper, we propose a new algorithm that addresses this limitation by analyzing the motion pattern using the inherent tensor structure. The tensor structure, obtained through Tucker2 tensor decomposition, proves to be effective in describing the target's motion. By incorporating this information, we achieved competitive results on Four benchmarks LaSOT\cite{fan2019lasot}, AVisT\cite{noman2022avist}, OTB100\cite{7001050}, and GOT-10k\cite{huang2019got} LaSOT\cite{fan2019lasot} with SOTA. Furthermore, the proposed tracker is capable of real-time operation, adding value to its practical application.

摘要
近年来，视频对象分割（VOS）作为对象跟踪（VOT）的补充方法而出现。VOS专注于将目标周围的所有像素分类，以获得精确的形态标注，而VOT主要关注目标的可能存在的短区域。然而，传统的分割模块通常frame by frame来分类像素，忽略了邻帧信息。在这篇论文中，我们提出了一种新的算法，通过利用内在的维度结构来解决这一限制。这种维度结构通过图kernels decomposition获得，并证明了其在目标运动中的效果。通过 incorporating这种信息，我们实现了在四个benchmark上达到了SOTA的竞争性成绩，分别是LaSOT\cite{fan2019lasot}, Avist\cite{noman2022avist}, OTB100\cite{7001050}和GOT-10k\cite{huang2019got}。此外，我们的追踪器还具有实时运行的能力，增加了其在实际应用中的价值。

M3D-NCA: Robust 3D Segmentation with Built-in Quality Control

paper_url: http://arxiv.org/abs/2309.02954
repo_url: None
paper_authors: John Kalkhof, Anirban Mukhopadhyay
for: 这篇论文的目的是提出一种基于神经细胞自动机（NCA）的三维医疗影像分类方法，以提高资源受限的医疗设施和冲突区域中的医疗影像分类效能。
methods: 这篇论文使用的方法是基于NCA的三维医疗影像分类方法，通过n级融合来实现。此外，这篇论文还提出了一个基于M3D-NCA的质量指标，可以自动检测NCAs中的错误。
results: 这篇论文的结果显示，M3D-NCA在脑干和膀胱分类中比据两个较大的UNet模型高出2%的 dice值，并且可以在Raspberry Pi 4 Model B（2GB RAM）上运行。这显示M3D-NCA可能是一个有效和高效的医疗影像分类方法，特别是在资源受限的医疗设施和冲突区域中。

Abstract
Medical image segmentation relies heavily on large-scale deep learning models, such as UNet-based architectures. However, the real-world utility of such models is limited by their high computational requirements, which makes them impractical for resource-constrained environments such as primary care facilities and conflict zones. Furthermore, shifts in the imaging domain can render these models ineffective and even compromise patient safety if such errors go undetected. To address these challenges, we propose M3D-NCA, a novel methodology that leverages Neural Cellular Automata (NCA) segmentation for 3D medical images using n-level patchification. Moreover, we exploit the variance in M3D-NCA to develop a novel quality metric which can automatically detect errors in the segmentation process of NCAs. M3D-NCA outperforms the two magnitudes larger UNet models in hippocampus and prostate segmentation by 2% Dice and can be run on a Raspberry Pi 4 Model B (2GB RAM). This highlights the potential of M3D-NCA as an effective and efficient alternative for medical image segmentation in resource-constrained environments.

摘要
医疗图像分割依赖大规模深度学习模型，如UNet基 architecture。然而，这些模型在实际应用中受限于高计算需求，使其不适用于资源有限的环境，如初级医疗机构和冲突区域。此外，图像领域的变化可能使这些模型无效，甚至威胁 patient safety 如果这些错误未探测。为Addressing these challenges, we propose M3D-NCA，一种新的方法，利用神经细胞自动机(NCA) segmentation for 3D medical images using n-level patchification。此外，我们利用 M3D-NCA 的变异来开发一种新的质量指标，可以自动探测 NCAs 分割过程中的错误。M3D-NCA 在 hippocampus 和 prostate 分割方面比 UNet 模型两倍大的范围内出performanced by 2% Dice，并且可以在 Raspberry Pi 4 Model B (2GB RAM) 上运行。这些结果表明 M3D-NCA 可以作为医疗图像分割的有效和高效的替代方案。

Patched Line Segment Learning for Vector Road Mapping

paper_url: http://arxiv.org/abs/2309.02923
repo_url: None
paper_authors: Jiakun Xu, Bowen Xu, Gui-Song Xia, Liang Dong, Nan Xue
for: 本研究提出了一种新的方法，用于从卫星遥感图像中计算向量道路地图。
methods: 该方法使用线段来表示路径，不仅捕捉了路径的位置，还捕捉了路径的方向，使其成为一种强大的表示方式。
results: 在我们的实验中，我们发现使用有效的路径表示可以大幅提高向量道路映射的性能，而不需要对神经网络架构进行大量修改。此外，我们的方法只需6个GPU小时的训练，相比之下，已有的方法需要32倍的训练时间。

Abstract
This paper presents a novel approach to computing vector road maps from satellite remotely sensed images, building upon a well-defined Patched Line Segment (PaLiS) representation for road graphs that holds geometric significance. Unlike prevailing methods that derive road vector representations from satellite images using binary masks or keypoints, our method employs line segments. These segments not only convey road locations but also capture their orientations, making them a robust choice for representation. More precisely, given an input image, we divide it into non-overlapping patches and predict a suitable line segment within each patch. This strategy enables us to capture spatial and structural cues from these patch-based line segments, simplifying the process of constructing the road network graph without the necessity of additional neural networks for connectivity. In our experiments, we demonstrate how an effective representation of a road graph significantly enhances the performance of vector road mapping on established benchmarks, without requiring extensive modifications to the neural network architecture. Furthermore, our method achieves state-of-the-art performance with just 6 GPU hours of training, leading to a substantial 32-fold reduction in training costs in terms of GPU hours.

摘要
Specifically, we divide the input image into non-overlapping patches and predict a suitable line segment within each patch. This allows us to capture spatial and structural cues from these patch-based line segments, simplifying the process of constructing the road network graph without the need for additional neural networks for connectivity.In our experiments, we show how our effective representation of the road graph significantly enhances the performance of vector road mapping on established benchmarks, without requiring extensive modifications to the neural network architecture. Additionally, our method achieves state-of-the-art performance with just 6 GPU hours of training, leading to a substantial 32-fold reduction in training costs in terms of GPU hours.

Towards Efficient Training with Negative Samples in Visual Tracking

paper_url: http://arxiv.org/abs/2309.02903
repo_url: None
paper_authors: Qingmao Wei, Bi Zeng, Guotian Zeng
for: 降低现代视觉对象跟踪方法中的计算资源和训练数据量，以避免过拟合。
methods: 该研究提出了一种更加高效的训练策略，通过将负样本与正样本混合，即 JOINT LEARNING WITH NEGATIVE SAMPLES（JN），以防止模型僵化并强制它使用模板来定位目标。此外，我们采用了分布型头，以表达在负样本下的uncertainty，从而有效地处理负样本。
results: 我们的模型JN-256在挑战性评价 benchmark上达到了75.8% AO 和 84.1% AUC，超过了以前的SOTA tracker，即使使用更大的模型和更高的输入分辨率。另外，JN-256 在训练时使用的数据量只有半个以前的works使用的样本数。

Abstract
Current state-of-the-art (SOTA) methods in visual object tracking often require extensive computational resources and vast amounts of training data, leading to a risk of overfitting. This study introduces a more efficient training strategy to mitigate overfitting and reduce computational requirements. We balance the training process with a mix of negative and positive samples from the outset, named as Joint learning with Negative samples (JN). Negative samples refer to scenarios where the object from the template is not present in the search region, which helps to prevent the model from simply memorizing the target, and instead encourages it to use the template for object location. To handle the negative samples effectively, we adopt a distribution-based head, which modeling the bounding box as distribution of distances to express uncertainty about the target's location in the presence of negative samples, offering an efficient way to manage the mixed sample training. Furthermore, our approach introduces a target-indicating token. It encapsulates the target's precise location within the template image. This method provides exact boundary details with negligible computational cost but improving performance. Our model, JN-256, exhibits superior performance on challenging benchmarks, achieving 75.8% AO on GOT-10k and 84.1% AUC on TrackingNet. Notably, JN-256 outperforms previous SOTA trackers that utilize larger models and higher input resolutions, even though it is trained with only half the number of data sampled used in those works.

摘要
现代Visual对象跟踪方法通常需要大量的计算资源和庞大的训练数据，导致风险过拟合。本研究提出了更加高效的训练策略，以避免过拟合和降低计算需求。我们从开始就将训练过程平衡使用负样本和正样本，称之为共同学习负样本（JN）。负样本指的是在搜索区域中没有目标对象的场景，这有助于避免模型升级目标，而是使其使用模板来定位对象。为了有效处理负样本，我们采用分布型头，通过表示范围内的距离分布来表示目标的位置不确定性，提供了一种高效的混合样本训练方法。此外，我们引入目标指示符。它将目标的准确位置包含在模板图像中。这种方法提供了精确的边界详细信息，而且计算成本很低，但是提高性能。我们的模型JN-256在GOT-10k和TrackingNet等挑战性评测中表现出色，达到75.8% AO和84.1% AUC。值得一提的是，JN-256比使用更大的模型和更高的输入分辨率的先前SOTA跟踪器即使在训练数据量为半数时，仍然表现出优异的性能。

A Unified Framework for Discovering Discrete Symmetries

paper_url: http://arxiv.org/abs/2309.02898
repo_url: None
paper_authors: Pavan Karjol, Rohan Kashyap, Aditya Gopalan, Prathosh A. P
for: 学习一个尊重Symmetry的函数，从多种子群中选择最佳函数。
methods: 提出了一种统一的框架，可以快速地找到最佳函数，并且可以处理多种不同的子群，包括地方对称、双棱镜对称和循环对称。核心是一种新的架构，包括线性和张量函数，可以很好地表达对Symmetry的函数。
results: 在图像数字和多项式回归任务中，该方法得到了极好的效果。

Abstract
We consider the problem of learning a function respecting a symmetry from among a class of symmetries. We develop a unified framework that enables symmetry discovery across a broad range of subgroups including locally symmetric, dihedral and cyclic subgroups. At the core of the framework is a novel architecture composed of linear and tensor-valued functions that expresses functions invariant to these subgroups in a principled manner. The structure of the architecture enables us to leverage multi-armed bandit algorithms and gradient descent to efficiently optimize over the linear and the tensor-valued functions, respectively, and to infer the symmetry that is ultimately learnt. We also discuss the necessity of the tensor-valued functions in the architecture. Experiments on image-digit sum and polynomial regression tasks demonstrate the effectiveness of our approach.

摘要
我们考虑一个函数在一组对称性下学习的问题。我们提出了一个统一的框架，可以在广泛的子群中包括地方对称、二面体和循环 subgroup中发现对称。框架的核心是一种新的建筑，由线性和张量函数组成，用于在原则上表达对称的函数。这种建筑结构允许我们通过多重投射算法和梯度下降来高效地优化线性函数和张量函数，并从中INFER learnt的对称性。我们还讨论了张量函数的必要性。在图像数字和多项式回归任务上，我们的方法得到了效果。

Image Aesthetics Assessment via Learnable Queries

paper_url: http://arxiv.org/abs/2309.02861
repo_url: None
paper_authors: Zhiwei Xiong, Yunfan Zhang, Zhiqi Shen, Peiran Ren, Han Yu
for: 这篇论文的目的是提出一种基于学习查询的图像美学评估方法（IAA-LQ），以提高图像美学评估的效果。methods: 该方法使用学习查询来提取图像中的美学特征，并且使用冻结的图像编码器来获取先验知识。results: 实验结果表明，IAA-LQ方法可以超越现有的状态时间方法，提高图像美学评估的精度，并且可以提高SRCC和PLCC指标的值。

Abstract
Image aesthetics assessment (IAA) aims to estimate the aesthetics of images. Depending on the content of an image, diverse criteria need to be selected to assess its aesthetics. Existing works utilize pre-trained vision backbones based on content knowledge to learn image aesthetics. However, training those backbones is time-consuming and suffers from attention dispersion. Inspired by learnable queries in vision-language alignment, we propose the Image Aesthetics Assessment via Learnable Queries (IAA-LQ) approach. It adapts learnable queries to extract aesthetic features from pre-trained image features obtained from a frozen image encoder. Extensive experiments on real-world data demonstrate the advantages of IAA-LQ, beating the best state-of-the-art method by 2.2% and 2.1% in terms of SRCC and PLCC, respectively.

摘要
Image 美学评估（IAA）目的是估计图像的美学性。根据图像内容，需要选择不同的标准来评估图像的美学性。现有的工作使用基于内容知识的预训练视觉干扰来学习图像美学。然而，训练这些干扰是时间consuming，而且容易产生注意力分散。 drawing inspiration from learnable queries in vision-language alignment, we propose the Image Aesthetics Assessment via Learnable Queries（IAA-LQ）approach。它适应learnable queries来提取图像美学特征从预训练图像特征中。我们进行了大量的实验，并证明IAA-LQ在实际数据上具有优势，比best state-of-the-art方法高2.2%和2.1%的SRCC和PLCC指标。

Bandwidth-efficient Inference for Neural Image Compression

paper_url: http://arxiv.org/abs/2309.02855
repo_url: None
paper_authors: Shanzhi Yin, Tongda Xu, Yongsheng Liang, Yuanyuan Wang, Yanghao Li, Yan Wang, Jingjing Liu
for: 提高Mobile和Edge设备上的神经网络推理性能，解决由深度神经网络和大小尺寸的特征地图所带来的带宽瓶颈和能源限制问题。
methods: 提出了一种终端到终端可微 differentiable带宽高效神经推理方法，其中的活化被使用神经数据压缩法进行压缩。特别是，我们提出了一种基于变换量化 entropy coding的pipeline，用于活化压缩。
results: 对于现有模型量化方法优化后的低级任务图像压缩，可以达到最高19x的带宽减少和6.21x的能源减少。

Abstract
With neural networks growing deeper and feature maps growing larger, limited communication bandwidth with external memory (or DRAM) and power constraints become a bottleneck in implementing network inference on mobile and edge devices. In this paper, we propose an end-to-end differentiable bandwidth efficient neural inference method with the activation compressed by neural data compression method. Specifically, we propose a transform-quantization-entropy coding pipeline for activation compression with symmetric exponential Golomb coding and a data-dependent Gaussian entropy model for arithmetic coding. Optimized with existing model quantization methods, low-level task of image compression can achieve up to 19x bandwidth reduction with 6.21x energy saving.

摘要
随着神经网络的深度和特征图的大小增加，在移动和边缘设备上实现网络推理时的限制性带宽和功能占用成为瓶颈。在这篇论文中，我们提出了一种终端到终端可导 differentiable带宽高效神经推理方法，其中活动被压缩使用神经数据压缩方法。具体来说，我们提出了一个转换-量化-Entropy编码管道，用于活动压缩，并使用对称的指数 Golomb编码和数据依赖的 Gaussian Entropy 模型 для数学编码。通过与现有模型量化方法结合优化，可以达到图像压缩的最低级别任务，带宽减少至 19 倍，能耗减少至 6.21 倍。

Knowledge Distillation Layer that Lets the Student Decide

paper_url: http://arxiv.org/abs/2309.02843
repo_url: https://github.com/adagorgun/letkd-framework
paper_authors: Ada Gorgun, Yeti Z. Gurbuz, A. Aydin Alatan
for: 本研究旨在提高知识蒸馈（KD）技术的实践，使学习Student模型的限制能力模型（teacher）的学习过程更加有效。
methods: 我们提出了一种可学习的KD层，用于在学习Student模型时将教师模型的知识直接嵌入到特征变换中。这些能力包括：i) 利用教师模型的知识来抛弃干扰信息，ii) 将知识传递深入。因此，学生模型在推理过程中可以获得教师模型的知识，不仅在训练过程中。
results: 我们通过对3种常见的分类 benchmark进行严格的实验，证明了我们的方法的效果。

Abstract
Typical technique in knowledge distillation (KD) is regularizing the learning of a limited capacity model (student) by pushing its responses to match a powerful model's (teacher). Albeit useful especially in the penultimate layer and beyond, its action on student's feature transform is rather implicit, limiting its practice in the intermediate layers. To explicitly embed the teacher's knowledge in feature transform, we propose a learnable KD layer for the student which improves KD with two distinct abilities: i) learning how to leverage the teacher's knowledge, enabling to discard nuisance information, and ii) feeding forward the transferred knowledge deeper. Thus, the student enjoys the teacher's knowledge during the inference besides training. Formally, we repurpose 1x1-BN-ReLU-1x1 convolution block to assign a semantic vector to each local region according to the template (supervised by the teacher) that the corresponding region of the student matches. To facilitate template learning in the intermediate layers, we propose a novel form of supervision based on the teacher's decisions. Through rigorous experimentation, we demonstrate the effectiveness of our approach on 3 popular classification benchmarks. Code is available at: https://github.com/adagorgun/letKD-framework

摘要
通常的知识填充（KD）技术是使学生模型（学生）的学习受到师模型（教师）的 régularization，使学生的回归与教师的回归相匹配。虽然在最后一层和以上层 particularly useful，但是对于学生的特征变换的影响相对较弱，限制了KD的实践在中间层。为了让学生在推理过程中获得教师的知识，我们提议一种可学习的KD层，该层具有以下两种能力：i) 学习如何利用教师的知识，以便抛弃干扰信息；ii) 将知识传递深入。因此，学生在推理过程中不仅会受到教师的影响，还会在推理过程中直接获得教师的知识。具体来说，我们将1x1-BN-ReLU-1x1 convolution block重新用于将每个地方分配一个 semantics vector，根据学生对应的地方与教师的地方匹配的模板（由教师supervise）。为了在中间层进行模板学习，我们提出了一种新的超visions基于教师的决策。经过严格的实验，我们证明了我们的方法在3个popular classification benchmark上的效果。代码可以在：https://github.com/adagorgun/letKD-framework 中找到。

Adjacency-hopping de Bruijn Sequences for Non-repetitive Coding

paper_url: http://arxiv.org/abs/2309.02841
repo_url: None
paper_authors: Bin Chen, Zhenglin Liang, Shiqian Wu
for: 这种论文主要用于描述一种特殊的循环序列，即邻接跳跃de Bruijn序列，以及这种序列在结构化光编码中的应用。
methods: 这种序列使用了一种新的邻接跳跃方法，这种方法可以保证邻接代码具有不同性，同时保持 subsequences 的唯一性。
results: 该论文 theoretically 证明了邻接跳跃de Bruijn序列的存在，并计算了这种序列的数量。此外，该论文还应用了这种序列在结构化光编码中，并提供了一个具有唯一性和邻接不同性的色带图像编码方案。

Abstract
A special type of cyclic sequences named adjacency-hopping de Bruijn sequences is introduced in this paper. It is theoretically proved the existence of such sequences, and the number of such sequences is derived. These sequences guarantee that all neighboring codes are different while retaining the uniqueness of subsequences, which is a significant characteristic of original de Bruijn sequences in coding and matching. At last, the adjacency-hopping de Bruijn sequences are applied to structured light coding, and a color fringe pattern coded by such a sequence is presented. In summary, the proposed sequences demonstrate significant advantages in structured light coding by virtue of the uniqueness of subsequences and the adjacency-hopping characteristic, and show potential for extension to other fields with similar requirements of non-repetitive coding and efficient matching.

摘要
本文提出了一种特殊的循环序列，名为邻接跳跃de Bruijn序列。这种序列的存在性被证明了，并且计算出了其数量。这些序列 garantía所有邻近代码都不同，同时保留原始de Bruijn序列中的 subsequences 唯一性，这是一个重要的特征。最后，这种序列应用于结构化光编码，并将其用于生成一个彩色环纹图像。总的来说，提出的序列在结构化光编码中表现出了优异的特点，包括 subsequences 的唯一性和邻接跳跃特性，并且具有扩展到其他需要非重复编码和效果快速匹配的领域的潜力。

EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.03244
repo_url: https://github.com/nikolai10/egic
paper_authors: Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder
for: This paper is written for improving image compression using a generative model.
methods: The paper proposes a novel method called EGIC, which uses an implicitly encoded variant of image interpolation to predict the residual between a MSE-optimized and GAN-optimized decoder output.
results: The paper shows that EGIC outperforms several baseline methods, including HiFiC, MRIC, and DIRAC, while performing almost on par with VTM-20.0 on the distortion end. Additionally, EGIC is simple to implement, lightweight, and provides excellent interpolation characteristics, making it a promising candidate for practical applications targeting the low bit range.Here is the text in Simplified Chinese:
for: 这篇论文是为了提高图像压缩使用生成模型而写的。
methods: 论文提出了一种新的方法called EGIC，它使用隐式编码的图像 interpolate来预测MSE优化和GAN优化解码器输出之间的差异。
results: 论文表明EGIC在比较多种基eline方法，包括HiFiC、MRIC和DIRAC，而且与VTM-20.0在损均值上几乎相当，而且EGIC具有简单实现、轻量级（例如0.18x模型参数相比HiFiC）、优秀的 interpolate特性，使其成为实际应用中的优秀候选者。

Abstract
We introduce EGIC, a novel generative image compression method that allows traversing the distortion-perception curve efficiently from a single model. Specifically, we propose an implicitly encoded variant of image interpolation that predicts the residual between a MSE-optimized and GAN-optimized decoder output. On the receiver side, the user can then control the impact of the residual on the GAN-based reconstruction. Together with improved GAN-based building blocks, EGIC outperforms a wide-variety of perception-oriented and distortion-oriented baselines, including HiFiC, MRIC and DIRAC, while performing almost on par with VTM-20.0 on the distortion end. EGIC is simple to implement, very lightweight (e.g. 0.18x model parameters compared to HiFiC) and provides excellent interpolation characteristics, which makes it a promising candidate for practical applications targeting the low bit range.

摘要
我们介绍EGIC，一种新的生成图像压缩方法，可以高效地从单一模型中横越偏差感知曲线。具体来说，我们提出了一种隐式编码的图像插值方法，可以预测MSE优化和GAN优化解oder输出之间的差异。在接收端，用户可以控制GAN基建的重建影响。与改进的GAN基建块相结合，EGIC超越了广泛的感知对象和偏差对象的基eline，包括HiFiC、MRIC和DIRAC，并且在偏差端接近VTM-20.0的性能。EGIC简单实现、轻量级（例如0.18x的模型参数相比HiFiC），具有出色的插值特性，使其成为实际应用中的吸引人选择。

Image-Object-Specific Prompt Learning for Few-Shot Class-Incremental Learning

paper_url: http://arxiv.org/abs/2309.02833
repo_url: None
paper_authors: In-Ug Yoon, Tae-Min Choi, Sun-Kyung Lee, Young-Min Kim, Jong-Hwan Kim
for: This paper aims to improve the performance of Fine-Grained Semantic Image Segmentation (FSCIL) in incremental sessions, where the encoder often underperforms.
methods: The authors propose a novel training framework that leverages the generalizability of the Contrastive Language-Image Pre-training (CLIP) model to unseen classes, and formulates image-object-specific (IOS) classifiers for input images.
results: The proposed framework consistently demonstrates superior performance compared to state-of-the-art methods across three datasets (miniImageNet, CIFAR100, and CUB200), and the authors provide additional experiments to validate the learned model’s ability to achieve IOS classifiers.Here is the summary in the format you requested:
for: 提高 FSCIL 在增量会话中的表现，特别是在遇到训练集不够的情况下。
methods: 提出一种基于 CLIP 模型的新的训练框架，通过形成输入图像的特定属性（如翼或轮）的特定类别（IOS）分类器来提高 FSCIL 的表现。
results: 提出的框架在 miniImageNet、CIFAR100 和 CUB200 三个 dataset 上 consistently 示出了比state-of-the-art 方法更好的表现，并提供了验证学习模型是否可以实现 IOS 分类器的额外实验。

Abstract
While many FSCIL studies have been undertaken, achieving satisfactory performance, especially during incremental sessions, has remained challenging. One prominent challenge is that the encoder, trained with an ample base session training set, often underperforms in incremental sessions. In this study, we introduce a novel training framework for FSCIL, capitalizing on the generalizability of the Contrastive Language-Image Pre-training (CLIP) model to unseen classes. We achieve this by formulating image-object-specific (IOS) classifiers for the input images. Here, an IOS classifier refers to one that targets specific attributes (like wings or wheels) of class objects rather than the image's background. To create these IOS classifiers, we encode a bias prompt into the classifiers using our specially designed module, which harnesses key-prompt pairs to pinpoint the IOS features of classes in each session. From an FSCIL standpoint, our framework is structured to retain previous knowledge and swiftly adapt to new sessions without forgetting or overfitting. This considers the updatability of modules in each session and some tricks empirically found for fast convergence. Our approach consistently demonstrates superior performance compared to state-of-the-art methods across the miniImageNet, CIFAR100, and CUB200 datasets. Further, we provide additional experiments to validate our learned model's ability to achieve IOS classifiers. We also conduct ablation studies to analyze the impact of each module within the architecture.

摘要
虽然许多FSCIL研究已经进行过，但在增量会话中达到满意的表现仍然是一大挑战。一个显著的挑战是，在增量会话中训练的编码器，经常在增量会话中下降表现。在这个研究中，我们提出了一种新的FSCIL训练框架，利用CLIP模型对未看过的类型的泛化能力。我们实现这种IOS类ifiers的目的是通过我们特制的模块，将特定的属性（如翼或轮胎）作为类对象的特征进行编码。在FSCIL的视角下，我们的框架结构是保留先前的知识，并快速适应新会话而不忘记或过拟合。这包括在每个会话中更新模块的机制，以及一些经验上发现的快速吸收技巧。我们的方法在miniImageNet、CIFAR100和CUB200数据集上 consistently 示出了与状态ixel方法相比的superior表现。此外，我们还提供了额外的实验，以验证我们学习的模型是否可以实现IOS类ifiers。此外，我们还进行了ablation study，以分析每个模块在架构中的影响。

3D Trajectory Reconstruction of Drones using a Single Camera

paper_url: http://arxiv.org/abs/2309.02801
repo_url: None
paper_authors: Seobin Hwang, Hanyoung Kim, Chaeyeon Heo, Youkyoung Na, Cheongeun Lee, Yeongjun Cho
for: 防止非法用途的尼龙，这项研究提出了一种基于单个摄像头的尼龙三维轨迹重建框架。
methods: 该方法利用准备好的摄像头进行尼龙自动跟踪，并利用尼龙的实际长度信息和摄像头参数，通过几何关系确定尼龙的三维轨迹。
results: 实验结果表明，提出的方法可以准确地重建尼龙的三维轨迹，并demonstrate了该框架在单摄像头式监测系统中的潜力。

Abstract
Drones have been widely utilized in various fields, but the number of drones being used illegally and for hazardous purposes has increased recently. To prevent those illegal drones, in this work, we propose a novel framework for reconstructing 3D trajectories of drones using a single camera. By leveraging calibrated cameras, we exploit the relationship between 2D and 3D spaces. We automatically track the drones in 2D images using the drone tracker and estimate their 2D rotations. By combining the estimated 2D drone positions with their actual length information and camera parameters, we geometrically infer the 3D trajectories of the drones. To address the lack of public drone datasets, we also create synthetic 2D and 3D drone datasets. The experimental results show that the proposed methods accurately reconstruct drone trajectories in 3D space, and demonstrate the potential of our framework for single camera-based surveillance systems.

摘要
《用单一摄像头掌握无人机三维轨迹》Introduction:无人机在不同领域得到广泛应用，但近些年来，非法使用无人机的数量却在增加。为防止这些非法无人机，在这项工作中，我们提出了一种基于单一摄像头的无人机三维轨迹重建方案。我们利用准备好的摄像头进行准确跟踪无人机的2D位势，并利用摄像头的捕捉到无人机的2D旋转信息，以确定无人机的3D轨迹。Methodology:我们首先使用准备好的摄像头对无人机进行跟踪，并利用摄像头的准确性来确定无人机的2D位势。然后，我们利用无人机的实际长度信息和摄像头参数，在3D空间中进行三维轨迹的重建。为了解决公共无人机数据缺乏的问题，我们还创造了一些人工生成的2D和3D无人机数据集。Results:实验结果表明，我们提出的方法可以准确地重建无人机的3D轨迹，并证明了我们的框架在单一摄像头基础上的可行性。这些结果还表明了我们的方法在无人机监测系统中的潜在应用前景。Conclusion:本文提出了一种基于单一摄像头的无人机三维轨迹重建方案，可以准确地重建无人机的3D轨迹。我们还创造了一些人工生成的2D和3D无人机数据集，以便进一步验证和改进我们的方法。这些结果表明了我们的方法在无人机监测系统中的潜在应用前景。

LightNeuS: Neural Surface Reconstruction in Endoscopy using Illumination Decline

paper_url: http://arxiv.org/abs/2309.02777
repo_url: None
paper_authors: Víctor M. Batlle, José M. M. Montiel, Pascal Fua, Juan D. Tardós
for: 该论文旨在提出一种基于照明变化的眼窍endooscope序列图像三维重建方法。
methods: 该方法基于两点关键意识：首先，endooscope中的液体空间是水密的，这种性质由签名距离函数来自然强制实现。其次，场景照明是可变的，来自endooscope的灯光和距离表面的 inverse squared 强度衰减。为了利用这两点，该方法基于NeuS算法，一种能够从多视图学习表面形态和外观特征的神经 implicit surface reconstruction 技术，但是现在只能处理静止照明的场景。为了解决这个限制，我们修改NeuS架构，考虑到灯光和camera的准确均衡，并引入了准确的摄像头和灯光Source的光学模型。
results: 该方法可以生成整个colon部分的水密重建，并在幻像中达到了出色的准确性。另外，由于照明下降和水密 prior的结合，该方法可以完成未被观察的表面部分的重建，并达到了可接受的准确性，这些结果为自动评估癌症检查探针提供了基础。

Abstract
We propose a new approach to 3D reconstruction from sequences of images acquired by monocular endoscopes. It is based on two key insights. First, endoluminal cavities are watertight, a property naturally enforced by modeling them in terms of a signed distance function. Second, the scene illumination is variable. It comes from the endoscope's light sources and decays with the inverse of the squared distance to the surface. To exploit these insights, we build on NeuS, a neural implicit surface reconstruction technique with an outstanding capability to learn appearance and a SDF surface model from multiple views, but currently limited to scenes with static illumination. To remove this limitation and exploit the relation between pixel brightness and depth, we modify the NeuS architecture to explicitly account for it and introduce a calibrated photometric model of the endoscope's camera and light source. Our method is the first one to produce watertight reconstructions of whole colon sections. We demonstrate excellent accuracy on phantom imagery. Remarkably, the watertight prior combined with illumination decline, allows to complete the reconstruction of unseen portions of the surface with acceptable accuracy, paving the way to automatic quality assessment of cancer screening explorations, measuring the global percentage of observed mucosa.

摘要

Endoluminal cavities are watertight, which can be naturally enforced by modeling them as signed distance functions.2. The scene illumination is variable and decays with the inverse of the squared distance to the surface.To exploit these insights, we build on NeuS, a neural implicit surface reconstruction technique that can learn appearance and a SDF surface model from multiple views. However, NeuS is limited to scenes with static illumination. To remove this limitation, we modify the NeuS architecture to explicitly account for the relation between pixel brightness and depth, and introduce a calibrated photometric model of the endoscope’s camera and light source.Our method is the first to produce watertight reconstructions of whole colon sections. We demonstrate excellent accuracy on phantom imagery, and remarkably, the watertight prior combined with illumination decline allows us to complete the reconstruction of unseen portions of the surface with acceptable accuracy. This paves the way to automatic quality assessment of cancer screening explorations, and measuring the global percentage of observed mucosa.

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

paper_url: http://arxiv.org/abs/2309.02773
repo_url: None
paper_authors: Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, Dong Xu
for: 该 paper 主要研究开放 vocabulary Semantic Segmentation 的问题，尝试使用 Conditional Latent Diffusion Model 来解决这个问题。
methods: 该 paper 使用了 pre-trained text-image discriminative models, such as CLIP, 并通过 contrastive learning 来Alignment 过程，但是这个过程可能会导致重要的局部化信息和物体完整性的丢失，这些信息是为Semantic Segmentation 准确性所必需的。
results: 该 paper 提出了一种基于 diffusion models 的open-vocabulary Semantic Segmentation 方法，并通过实验证明了这种方法可以 дости得高效的结果。Specifically, the proposed method uses a training-free approach named DiffSegmenter, which utilizes cross-attention maps produced by the denoising U-Net to generate segmentation scores, and further refines and completes the segmentation results with self-attention maps. The proposed method also designs effective textual prompts and a category filtering mechanism to enhance the segmentation results. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.

Abstract
Recent research has explored the utilization of pre-trained text-image discriminative models, such as CLIP, to tackle the challenges associated with open-vocabulary semantic segmentation. However, it is worth noting that the alignment process based on contrastive learning employed by these models may unintentionally result in the loss of crucial localization information and object completeness, which are essential for achieving accurate semantic segmentation. More recently, there has been an emerging interest in extending the application of diffusion models beyond text-to-image generation tasks, particularly in the domain of semantic segmentation. These approaches utilize diffusion models either for generating annotated data or for extracting features to facilitate semantic segmentation. This typically involves training segmentation models by generating a considerable amount of synthetic data or incorporating additional mask annotations. To this end, we uncover the potential of generative text-to-image conditional diffusion models as highly efficient open-vocabulary semantic segmenters, and introduce a novel training-free approach named DiffSegmenter. Specifically, by feeding an input image and candidate classes into an off-the-shelf pre-trained conditional latent diffusion model, the cross-attention maps produced by the denoising U-Net are directly used as segmentation scores, which are further refined and completed by the followed self-attention maps. Additionally, we carefully design effective textual prompts and a category filtering mechanism to further enhance the segmentation results. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.

摘要
近期研究探讨了利用预训练文本图像抗变模型，如CLIP，解决开 vocabulary semantic segmentation 中的挑战。然而，需要注意的是，这些模型使用的对比学习Alignment процесс可能会意外地导致重要的本地化信息和对象完整性的产生，这些信息是Semantic segmentation 的准确性所必需的。近来， diffusion models 在 semantic segmentation 领域的应用也在扩展。这些approaches使用 diffusion models 为生成注释数据或提取特征，以便实现 Semantic segmentation。通常情况下，这些方法需要生成大量的synthetic数据或添加额外的mask注释。为了实现这一点，我们探讨了使用生成文本图像决定型 diffusion models 作为开 vocabulary semantic segmenter，并提出了一种新的无需训练的方法 named DiffSegmenter。具体来说，我们将输入图像和候选类 feed 到一个预训练的conditional latent diffusion model中，并使用其生成的杂化映射来生成Semantic segmentation scores。这些 scores 然后通过自我注意 Mechanism 进行更新和完善。此外，我们还特意设计了有效的文本提示和类别筛选机制，以进一步提高 segmentation 结果。我们在三个benchmark dataset上进行了广泛的实验，结果显示，我们的 DiffSegmenter 可以在开 vocabulary semantic segmentation 中达到卓越的成绩。

RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation

paper_url: http://arxiv.org/abs/2309.03240
repo_url: None
paper_authors: Hengyue Liu, Bir Bhanu
for: 提高Scene Graph Generation（SGG）的精度和效率，尤其是在 Addressing the challenges of fixed-size entity representations and long-tailed distribution.
methods: 提出了一种novel architecture called RepSGG，通过将主题作为查询，对象作为键，以及它们之间的最大关注重量作为对应关系的权重，以提取更加精细和灵活的特征表示。此外，还提出了一种在训练过程中根据运行时性能进行对 relate logits 的修正，以便更好地塑造表示关系的精度和效率。
results: 实验结果显示，RepSGG可以在Visual Genome和Open Images V6 datasets上 achieve the state-of-the-art or comparable performance，同时具有快速的推理速度，证明了提出的方法的有效性和效率。

Abstract
Scene Graph Generation (SGG) has achieved significant progress recently. However, most previous works rely heavily on fixed-size entity representations based on bounding box proposals, anchors, or learnable queries. As each representation's cardinality has different trade-offs between performance and computation overhead, extracting highly representative features efficiently and dynamically is both challenging and crucial for SGG. In this work, a novel architecture called RepSGG is proposed to address the aforementioned challenges, formulating a subject as queries, an object as keys, and their relationship as the maximum attention weight between pairwise queries and keys. With more fine-grained and flexible representation power for entities and relationships, RepSGG learns to sample semantically discriminative and representative points for relationship inference. Moreover, the long-tailed distribution also poses a significant challenge for generalization of SGG. A run-time performance-guided logit adjustment (PGLA) strategy is proposed such that the relationship logits are modified via affine transformations based on run-time performance during training. This strategy encourages a more balanced performance between dominant and rare classes. Experimental results show that RepSGG achieves the state-of-the-art or comparable performance on the Visual Genome and Open Images V6 datasets with fast inference speed, demonstrating the efficacy and efficiency of the proposed methods.

摘要

DMKD: Improving Feature-based Knowledge Distillation for Object Detection Via Dual Masking Augmentation

paper_url: http://arxiv.org/abs/2309.02719
repo_url: None
paper_authors: Guang Yang, Yin Tang, Zhijian Wu, Jun Li, Jianhua Xu, Xili Wan
For: The paper is written for improving the performance of object detection tasks using masked knowledge distillation.* Methods: The paper uses a Dual Masked Knowledge Distillation (DMKD) framework, which employs dual attention mechanism and self-adjustable weighting strategy to capture both spatially important and channel-wise informative clues for comprehensive masked feature reconstruction.* Results: The paper demonstrates that the student networks achieve performance gains of 4.1% and 4.3% with the help of the proposed method when RetinaNet and Cascade Mask R-CNN are respectively used as the teacher networks, while outperforming other state-of-the-art distillation methods.Here is the information in Simplified Chinese text:* 用途: 论文是为了提高对象检测任务的性能使用masked知识传递。* 方法: 论文使用了Dual Masked Knowledge Distillation（DMKD）框架，该框架使用双注意力机制和自适应权重调整策略来捕捉双重重要的spatial和channelwise信息，实现了全面的masked特征重建。* 结果: 论文显示，使用提议方法可以使学生网络在RetinaNet和Cascade Mask R-CNN两个教师网络的帮助下提高性能，同时超越其他当前的传递方法。

Abstract
Recent mainstream masked distillation methods function by reconstructing selectively masked areas of a student network from the feature map of its teacher counterpart. In these methods, the masked regions need to be properly selected, such that reconstructed features encode sufficient discrimination and representation capability like the teacher feature. However, previous masked distillation methods only focus on spatial masking, making the resulting masked areas biased towards spatial importance without encoding informative channel clues. In this study, we devise a Dual Masked Knowledge Distillation (DMKD) framework which can capture both spatially important and channel-wise informative clues for comprehensive masked feature reconstruction. More specifically, we employ dual attention mechanism for guiding the respective masking branches, leading to reconstructed feature encoding dual significance. Furthermore, fusing the reconstructed features is achieved by self-adjustable weighting strategy for effective feature distillation. Our experiments on object detection task demonstrate that the student networks achieve performance gains of 4.1% and 4.3% with the help of our method when RetinaNet and Cascade Mask R-CNN are respectively used as the teacher networks, while outperforming the other state-of-the-art distillation methods.

摘要
(Simplified Chinese translation)现代主流的掩蔽知识传授方法都是通过学习器网络中选择性地掩蔽特定区域，以从教师网络的特征图中重建学习器网络中的特征。然而，先前的掩蔽知识传授方法仅专注于空间掩蔽，导致掩蔽区域仅对空间重要性有偏好，无法传递有用的通道讯息。在这一研究中，我们提出了一个多模式掩蔽知识传授（DMKD）框架，可以捕捉多模式掩蔽特征的重建。具体来说，我们使用双注意力机制来引导各自的掩蔽分支，使得重建特征具有双重重要性。此外，我们使用自适应的权重调整策略，以实现有效的特征传授。我们对物件检测任务进行实验，结果显示，使用我们的方法可以帮助学习器网络在RetinaNet和Cascade Mask R-CNN作为教师网络时，提高表现率4.1%和4.3%，并且超过其他状态的传授方法。

Gene-induced Multimodal Pre-training for Image-omic Classification

paper_url: http://arxiv.org/abs/2309.02702
repo_url: None
paper_authors: Ting Jin, Xingran Xie, Renjie Wan, Qingli Li, Yan Wang
for: 这篇论文旨在提出一种基于遗传学和材料学图像的多Modal Pre-training（GiMP）框架，用于 классификации任务。
methods: 该框架使用了群体多头自我注意力基因编码器，以捕捉全息特征，以及一种隐藏素征模型（MPM），用于捕捉不同组织的潜在特征。
results: 实验结果表明，GiMP框架在TCGA数据集上表现出色，准确率达99.47%，超过了传统方法。代码可以在https://github.com/huangwudiduan/GIMP上获取。

Abstract
Histology analysis of the tumor micro-environment integrated with genomic assays is the gold standard for most cancers in modern medicine. This paper proposes a Gene-induced Multimodal Pre-training (GiMP) framework, which jointly incorporates genomics and Whole Slide Images (WSIs) for classification tasks. Our work aims at dealing with the main challenges of multi-modality image-omic classification w.r.t. (1) the patient-level feature extraction difficulties from gigapixel WSIs and tens of thousands of genes, and (2) effective fusion considering high-order relevance modeling. Concretely, we first propose a group multi-head self-attention gene encoder to capture global structured features in gene expression cohorts. We design a masked patch modeling paradigm (MPM) to capture the latent pathological characteristics of different tissues. The mask strategy is randomly masking a fixed-length contiguous subsequence of patch embeddings of a WSI. Finally, we combine the classification tokens of paired modalities and propose a triplet learning module to learn high-order relevance and discriminative patient-level information.After pre-training, a simple fine-tuning can be adopted to obtain the classification results. Experimental results on the TCGA dataset show the superiority of our network architectures and our pre-training framework, achieving 99.47% in accuracy for image-omic classification. The code is publicly available at https://github.com/huangwudiduan/GIMP.

摘要
现代医学中大多数癌症的标准方法是 histology 分析和 genomic 测试。这篇文章提出了一个 Gene-induced Multimodal Pre-training（GiMP）框架，该框架结合 genomics 和 Whole Slide Images（WSIs）进行分类任务。我们的工作是解决多模态图像-生物学分类中关于（1）来自 gigapixel WSIs 和 tens of thousands of genes 的病人级特征提取问题，以及（2）高级相关模型化的有效融合问题。具体来说，我们首先提出了一种组 multi-head self-attention 基因编码器，以捕捉全Structured 特征在基因表达相关组中。我们设计了一种 masked patch modeling 模式（MPM），以捕捉不同组织的潜在病理特征。在 mask 策略中，随机遮盖一个固定长度连续序列的 patch 嵌入。最后，我们将多modal 分类标记 fusion 并提出了一个 triplet learning 模块，以学习高级相关和权威病人级信息。经过预训练后，可以简单地进行微调，以获得分类结果。TCGA 数据集的实验结果表明，我们的网络架构和预训练框架具有优势，实现了 99.47% 的图像-生物学分类精度。代码可以在上公开获取。

Improving Image Classification of Knee Radiographs: An Automated Image Labeling Approach

paper_url: http://arxiv.org/abs/2309.02681
repo_url: None
paper_authors: Jikai Zhang, Carlos Santos, Christine Park, Maciej Mazurowski, Roy Colglazier
for: 这个研究的目的是开发一种自动标注方法，以提高诊断X光影像的Image Classification模型，以分辨正常股骨影像和疾病或替换手术影像。methods: 这个研究使用了7382名病人和637名病人的数据来训练和验证自动标注模型。results: 研究结果表明，使用自动标注模型可以提高诊断X光影像的Image Classification性能，具有较高的WAUC值和AUC-ROC值，特别是在正常和疾病类别中。此外，DeLong测试表明，这种提高是 statistically significant（p-value<0.002和p-value<0.001）。这些结果表明，自动标注方法可以有效地提高诊断X光影像的Image Classification性能，为患者护理和大量股骨数据库的筛选提供帮助。

Abstract
Large numbers of radiographic images are available in knee radiology practices which could be used for training of deep learning models for diagnosis of knee abnormalities. However, those images do not typically contain readily available labels due to limitations of human annotations. The purpose of our study was to develop an automated labeling approach that improves the image classification model to distinguish normal knee images from those with abnormalities or prior arthroplasty. The automated labeler was trained on a small set of labeled data to automatically label a much larger set of unlabeled data, further improving the image classification performance for knee radiographic diagnosis. We developed our approach using 7,382 patients and validated it on a separate set of 637 patients. The final image classification model, trained using both manually labeled and pseudo-labeled data, had the higher weighted average AUC (WAUC: 0.903) value and higher AUC-ROC values among all classes (normal AUC-ROC: 0.894; abnormal AUC-ROC: 0.896, arthroplasty AUC-ROC: 0.990) compared to the baseline model (WAUC=0.857; normal AUC-ROC: 0.842; abnormal AUC-ROC: 0.848, arthroplasty AUC-ROC: 0.987), trained using only manually labeled data. DeLong tests show that the improvement is significant on normal (p-value<0.002) and abnormal (p-value<0.001) images. Our findings demonstrated that the proposed automated labeling approach significantly improves the performance of image classification for radiographic knee diagnosis, allowing for facilitating patient care and curation of large knee datasets.

摘要
大量的血管成像图像在膝关节Radiology实践中可用于深度学习模型的训练，识别膝关节异常。然而，这些图像通常没有可用的标签，因为人工标注的限制。我们的研究旨在开发一种自动标签方法，以提高血管成像分类模型，识别正常膝关节图像和异常或过去arthroplasty图像。我们使用了7,382名患者和637名患者的分离集来验证我们的方法。最终的血管成像分类模型，使用了手动标签和 Pseudo-标签数据进行训练，与基线模型（WAUC=0.857；正常AUC-ROC=0.842；异常AUC-ROC=0.848；arthroplasty AUC-ROC=0.987）相比，在所有类型中得到了更高的Weighted Average AUC（WAUC：0.903）值和更高的AUC-ROC值。DeLong测试表明，在正常（p-值<0.002）和异常（p-值<0.001）图像上，我们的提高是 statistically significant。我们的发现表明，我们提出的自动标签方法可以在膝关节Radiographic诊断中显著提高图像分类性能，为患者护理和大量膝关节数据库的筛选提供了有力的支持。

Efficient Training for Visual Tracking with Deformable Transformer

paper_url: http://arxiv.org/abs/2309.02676
repo_url: None
paper_authors: Qingmao Wei, Guotian Zeng, Bi Zeng
for: 这篇论文是为了提出一种高效的视觉对象跟踪方法，以便应用于实际场景。
methods: 该方法使用了高效的Encoder-Decoder结构，并使用了弹性变换器作为目标头，从而降低了GFLOPs。在训练过程中，我们引入了一种新的一对多标签分配方法和一种辅助去噪技术，使模型更快地趋向于 converges。
results: 我们的方法在顶尖GOT-10k测试benchmark上达到了72.9%的AO，仅用20%的训练班目时间，并且在GFLOPs方面比所有基eline的转换器更低。

Abstract
Recent Transformer-based visual tracking models have showcased superior performance. Nevertheless, prior works have been resource-intensive, requiring prolonged GPU training hours and incurring high GFLOPs during inference due to inefficient training methods and convolution-based target heads. This intensive resource use renders them unsuitable for real-world applications. In this paper, we present DETRack, a streamlined end-to-end visual object tracking framework. Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head, achieves higher sparsity than traditional convolution heads, resulting in decreased GFLOPs. For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique, significantly accelerating model's convergence. Comprehensive experiments affirm the effectiveness and efficiency of our proposed method. For instance, DETRack achieves 72.9% AO on challenging GOT-10k benchmarks using only 20% of the training epochs required by the baseline, and runs with lower GFLOPs than all the transformer-based trackers.

摘要
In this paper, we propose DETRack, a streamlined end-to-end visual object tracking framework. Our framework uses an efficient encoder-decoder structure, where the deformable transformer decoder acts as a target head, achieving higher sparsity than traditional convolution heads, resulting in reduced GFLOPs.For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique, significantly accelerating the model's convergence. Comprehensive experiments demonstrate the effectiveness and efficiency of our proposed method. For example, DETRack achieves 72.9% AO on the challenging GOT-10k benchmark using only 20% of the training epochs required by the baseline, and runs with lower GFLOPs than all the transformer-based trackers.

Progressive Attention Guidance for Whole Slide Vulvovaginal Candidiasis Screening

paper_url: http://arxiv.org/abs/2309.02670
repo_url: https://github.com/cjdbehumble/miccai2023-vvc-screening
paper_authors: Jiangdong Cai, Honglin Xiong, Maosong Cao, Luyan Liu, Lichi Zhang, Qian Wang
for: 这个论文是为了提出一种基于整体图像分类的自动诊断病理图像检测方法，以解决病理图像检测领域中缺乏标注数据和病原菌特有的问题。
methods: 该方法首先使用一个预训练的检测模型作为先前的指导，然后使用跳过自我关注模块来细化关注病原菌的细腻特征。最后，使用对比学习方法来降低由病理图像的样式差异引起的过拟合和关注到假阳性区域。
results: 我们的实验结果表明，我们的框架可以达到状态畅的性能。代码和示例数据可以在https://github.com/cjdbehumble/MICCAI2023-VVC-Screening中找到。

Abstract
Vulvovaginal candidiasis (VVC) is the most prevalent human candidal infection, estimated to afflict approximately 75% of all women at least once in their lifetime. It will lead to several symptoms including pruritus, vaginal soreness, and so on. Automatic whole slide image (WSI) classification is highly demanded, for the huge burden of disease control and prevention. However, the WSI-based computer-aided VCC screening method is still vacant due to the scarce labeled data and unique properties of candida. Candida in WSI is challenging to be captured by conventional classification models due to its distinctive elongated shape, the small proportion of their spatial distribution, and the style gap from WSIs. To make the model focus on the candida easier, we propose an attention-guided method, which can obtain a robust diagnosis classification model. Specifically, we first use a pre-trained detection model as prior instruction to initialize the classification model. Then we design a Skip Self-Attention module to refine the attention onto the fined-grained features of candida. Finally, we use a contrastive learning method to alleviate the overfitting caused by the style gap of WSIs and suppress the attention to false positive regions. Our experimental results demonstrate that our framework achieves state-of-the-art performance. Code and example data are available at https://github.com/cjdbehumble/MICCAI2023-VVC-Screening.

摘要
对于人类发炎病毒（VVC）而言，这是最常见的感染，估计会在所有女性中发生至少一次。它会导致一些 симптом，例如痒著、阴道疼痛等。由于这种疾病的管控和预防问题很大，因此数位数据支持的数位构成VCC检测方法是非常需要的。然而，这种数位构成方法仍然没有，因为缺乏标注数据和发炎菌的特殊性。发炎菌在数位构成中具有特殊的延长形状、小型的分布和数位构成的Style gap。为了让模型更容易捕捉发炎菌，我们提出了一个注意力引导的方法。具体来说，我们首先使用预训掌握的检测模型作为几何调教，然后设计了跳跃自我注意模组，以更新注意力以精细特征。最后，我们使用对比学习方法，以减少因数位构成的Style gap导致的过滤。我们的实验结果显示，我们的框架可以实现国际一级的表现。可以在https://github.com/cjdbehumble/MICCAI2023-VVC-Screening上获取代码和示例数据。

Fast and Resource-Efficient Object Tracking on Edge Devices: A Measurement Study

paper_url: http://arxiv.org/abs/2309.02666
repo_url: https://github.com/git-disl/emo
paper_authors: Sanjana Vijay Ganesh, Yanzhao Wu, Gaowen Liu, Ramana Kompella, Ling Liu
For: This paper focuses on the performance issues and optimization opportunities for multi-object tracking (MOT) on edge devices with heterogeneous computing resources.* Methods: The paper proposes several edge-specific performance optimization strategies, called EMO, to speed up real-time object tracking, including window-based optimization and similarity-based optimization.* Results: The proposed EMO approach is competitive with respect to representative on-device object tracking techniques in terms of run-time performance and tracking accuracy, as demonstrated through extensive experiments on popular MOT benchmarks.

Abstract
Object tracking is an important functionality of edge video analytic systems and services. Multi-object tracking (MOT) detects the moving objects and tracks their locations frame by frame as real scenes are being captured into a video. However, it is well known that real time object tracking on the edge poses critical technical challenges, especially with edge devices of heterogeneous computing resources. This paper examines the performance issues and edge-specific optimization opportunities for object tracking. We will show that even the well trained and optimized MOT model may still suffer from random frame dropping problems when edge devices have insufficient computation resources. We present several edge specific performance optimization strategies, collectively coined as EMO, to speed up the real time object tracking, ranging from window-based optimization to similarity based optimization. Extensive experiments on popular MOT benchmarks demonstrate that our EMO approach is competitive with respect to the representative methods for on-device object tracking techniques in terms of run-time performance and tracking accuracy. EMO is released on Github at https://github.com/git-disl/EMO.

摘要
<>将文本翻译为简化中文。<>Edge video analytic系统和服务中的目标跟踪功能非常重要。多目标跟踪（MOT）可以检测到摄像头中的移动目标，并在每帧中跟踪它们的位置。然而，在边缘设备上实时进行目标跟踪存在重要的技术挑战，尤其是边缘设备的不同计算资源。这篇论文检查了目标跟踪的性能问题和边缘设备特有的优化机会。我们将示出，即使使用最佳化的MOT模型，在边缘设备有限的计算资源情况下，仍可能出现随机帧掉pping问题。我们提出了多种边缘特定的性能优化策略，称为EMO，以加速实时目标跟踪。这些策略包括窗口优化和相似性优化等。我们在流行的MOT benchmark上进行了广泛的实验，并证明了我们的EMO方法与当前的边缘设备上的目标跟踪技术相比，在运行时性能和跟踪精度方面具有竞争力。EMO在GitHub上发布，请参考。

Multiclass Alignment of Confidence and Certainty for Network Calibration

paper_url: http://arxiv.org/abs/2309.02636
repo_url: None
paper_authors: Vinith Kugathasan, Muhammad Haris Khan
For: 提高模型预测结果的准确性和可靠性，特别是在安全关键应用中。* Methods: 提出了一种新的训练时Calibration方法，基于模型预测信任度与确定性之间的差异，以提高模型的可靠性和准确性。* Results: 经过EXTENSIVE EXPERIMENTS的证明，该方法在十个复杂的数据集上取得了最佳的Calibration性能，包括在预测内部和外部预测中。

Abstract
Deep neural networks (DNNs) have made great strides in pushing the state-of-the-art in several challenging domains. Recent studies reveal that they are prone to making overconfident predictions. This greatly reduces the overall trust in model predictions, especially in safety-critical applications. Early work in improving model calibration employs post-processing techniques which rely on limited parameters and require a hold-out set. Some recent train-time calibration methods, which involve all model parameters, can outperform the postprocessing methods. To this end, we propose a new train-time calibration method, which features a simple, plug-and-play auxiliary loss known as multi-class alignment of predictive mean confidence and predictive certainty (MACC). It is based on the observation that a model miscalibration is directly related to its predictive certainty, so a higher gap between the mean confidence and certainty amounts to a poor calibration both for in-distribution and out-of-distribution predictions. Armed with this insight, our proposed loss explicitly encourages a confident (or underconfident) model to also provide a low (or high) spread in the presoftmax distribution. Extensive experiments on ten challenging datasets, covering in-domain, out-domain, non-visual recognition and medical image classification scenarios, show that our method achieves state-of-the-art calibration performance for both in-domain and out-domain predictions. Our code and models will be publicly released.

摘要
To improve model calibration, we propose a new train-time calibration method called multi-class alignment of predictive mean confidence and predictive certainty (MACC). This method is based on the observation that model miscalibration is directly related to predictive certainty, so a higher gap between the mean confidence and certainty indicates poor calibration for both in-distribution and out-of-distribution predictions. Our proposed loss explicitly encourages a confident (or underconfident) model to provide a low (or high) spread in the presoftmax distribution.We conduct extensive experiments on ten challenging datasets, covering in-domain, out-domain, non-visual recognition, and medical image classification scenarios, and show that our method achieves state-of-the-art calibration performance for both in-domain and out-domain predictions. Our code and models will be publicly released.

2023-09-06

Distribution-Aware Prompt Tuning for Vision-Language Models

Reasonable Anomaly Detection in Long Sequences

A novel method for iris recognition using BP neural network and parallel computing by the aid of GPUs (Graphics Processing Units)

Kidney abnormality segmentation in thorax-abdomen CT scans

Active shooter detection and robust tracking utilizing supplemental synthetic data

ViewMix: Augmentation for Robust Representation in Self-Supervised Learning

Source Camera Identification and Detection in Digital Videos through Blind Forensics

Using Neural Networks for Fast SAR Roughness Estimation of High Resolution Images

SADIR: Shape-Aware Diffusion Models for 3D Image Reconstruction

Expert Uncertainty and Severity Aware Chest X-Ray Classification by Multi-Relationship Graph Learning

MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation

C-CLIP: Contrastive Image-Text Encoders to Close the Descriptive-Commentative Gap

CoNeS: Conditional neural fields with shift modulation for multi-sequence MRI translation

Bayes’ Rays: Uncertainty Quantification for Neural Radiance Fields

3D Transformer based on deformable patch location for differential diagnosis between Alzheimer’s disease and Frontotemporal dementia

SLiMe: Segment Like Me

3D Object Positioning Using Differentiable Multimodal Learning

PDiscoNet: Semantically consistent part discovery for fine-grained recognition

ResFields: Residual Neural Fields for Spatiotemporal Signals

Do We Still Need Non-Maximum Suppression? Accurate Confidence Estimates and Implicit Duplication Modeling with IoU-Aware Calibration

FArMARe: a Furniture-Aware Multi-task methodology for Recommending Apartments based on the user interests

Character Queries: A Transformer-based Approach to On-Line Handwritten Character Segmentation

Prompt-based All-in-One Image Restoration using CNNs and Transformer

Adaptive Growth: Real-time CNN Layer Expansion

Exploring Semantic Consistency in Unpaired Image Translation to Generate Data for Surgical Applications

MCM: Multi-condition Motion Synthesis Framework for Multi-scenario

SEAL: A Framework for Systematic Evaluation of Real-World Super-Resolution

Sparse 3D Reconstruction via Object-Centric Ray Sampling

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

Continual Evidential Deep Learning for Out-of-Distribution Detection

FishMOT: A Simple and Effective Method for Fish Tracking Based on IoU Matching

Dynamic Hyperbolic Attention Network for Fine Hand-object Reconstruction

Hierarchical-level rain image generative model based on GAN

Indoor Localization Using Radio, Vision and Audio Sensors: Real-Life Data Validation and Discussion

A Non-Invasive Interpretable NAFLD Diagnostic Method Combining TCM Tongue Features

Robust Visual Tracking by Motion Analyzing

M3D-NCA: Robust 3D Segmentation with Built-in Quality Control

Patched Line Segment Learning for Vector Road Mapping

Towards Efficient Training with Negative Samples in Visual Tracking

A Unified Framework for Discovering Discrete Symmetries

Image Aesthetics Assessment via Learnable Queries

Bandwidth-efficient Inference for Neural Image Compression

Knowledge Distillation Layer that Lets the Student Decide

Adjacency-hopping de Bruijn Sequences for Non-repetitive Coding

EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation

Image-Object-Specific Prompt Learning for Few-Shot Class-Incremental Learning

3D Trajectory Reconstruction of Drones using a Single Camera

LightNeuS: Neural Surface Reconstruction in Endoscopy using Illumination Decline

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation

DMKD: Improving Feature-based Knowledge Distillation for Object Detection Via Dual Masking Augmentation

Gene-induced Multimodal Pre-training for Image-omic Classification

Improving Image Classification of Knee Radiographs: An Automated Image Labeling Approach

Efficient Training for Visual Tracking with Deformable Transformer

Progressive Attention Guidance for Whole Slide Vulvovaginal Candidiasis Screening

Fast and Resource-Efficient Object Tracking on Edge Devices: A Measurement Study

Multiclass Alignment of Confidence and Certainty for Network Calibration