2023-08-10

cs.CV

cs.CV - 2023-08-10

AD-CLIP: Adapting Domains in Prompt Space Using CLIP

paper_url: http://arxiv.org/abs/2308.05659
repo_url: None
paper_authors: Mainak Singha, Harsh Pal, Ankit Jha, Biplab Banerjee
for: 解决预测领域（target domain）和训练领域（source domain）不同时，预测模型的泛化问题。methods: 利用CLIP的冻结视觉背景，提取图像风格（domain）和内容信息，然后学习提问符。提问符设计为具有领域不变性和分类通用性，通过同时使用图像风格和内容特征来进行条件学习。results: 与现有文献进行比较，AD-CLIP在三个标准预测领域数据集上实现了更高的效果。

Abstract
Although deep learning models have shown impressive performance on supervised learning tasks, they often struggle to generalize well when the training (source) and test (target) domains differ. Unsupervised domain adaptation (DA) has emerged as a popular solution to this problem. However, current DA techniques rely on visual backbones, which may lack semantic richness. Despite the potential of large-scale vision-language foundation models like CLIP, their effectiveness for DA has yet to be fully explored. To address this gap, we introduce AD-CLIP, a domain-agnostic prompt learning strategy for CLIP that aims to solve the DA problem in the prompt space. We leverage the frozen vision backbone of CLIP to extract both image style (domain) and content information, which we apply to learn prompt tokens. Our prompts are designed to be domain-invariant and class-generalizable, by conditioning prompt learning on image style and content features simultaneously. We use standard supervised contrastive learning in the source domain, while proposing an entropy minimization strategy to align domains in the embedding space given the target domain data. We also consider a scenario where only target domain samples are available during testing, without any source domain data, and propose a cross-domain style mapping network to hallucinate domain-agnostic tokens. Our extensive experiments on three benchmark DA datasets demonstrate the effectiveness of AD-CLIP compared to existing literature.

摘要
Translated into Simplified Chinese:深度学习模型在指导学习任务上表现出了很好的表现，但是当训练（源）和测试（目标）领域不同时，它们经常难以泛化良好。随着不监督领域变换（DA）技术的出现，这个问题得到了解决。然而，当前的DA技术往往基于视觉背bone，这可能缺乏semantic richness。尽管CLIP等大规模视觉语言基础模型的潜力尚未得到完全explored，因此我们引入AD-CLIP，一种基于CLIP的领域不同的提示学习策略，以解决DA问题。我们利用CLIP的冻结视觉背bone来提取图像风格（领域）和内容信息，并将其应用于学习提示符。我们的提示符设计为领域不变和类通用，通过同时应用图像风格和内容特征来进行提示学习。我们在源领域中使用标准的监督对比学习，而在目标领域中提出了一种减少随机性的方法，以将领域对应。我们还考虑了只有目标领域的样本Available during testing，而不需要源领域数据，并提出了一种cross-domain style mapping network来产生领域无关的token。我们对三个 benchmark DA 数据集进行了广泛的实验，结果表明 AD-CLIP 的效果比现有文献更高。

Attention-based 3D CNN with Multi-layer Features for Alzheimer’s Disease Diagnosis using Brain Images

paper_url: http://arxiv.org/abs/2308.05655
repo_url: None
paper_authors: Yanteng Zhang, Qizhi Teng, Xiaohai He, Tong Niu, Lipei Zhang, Yan Liu, Chao Ren
for: 这个研究旨在提高了Alzheimer病的诊断精度，通过结合多层特征以及注意力机制，从brain影像中提取更好的特征，以更好地识别Alzheimer病的特征。
methods: 本研究使用了一个终端到端的3D CNN框架，基于ResNet，并通过注意力机制将多层特征融合，以更好地捕捉brain影像中的微妙变化。
results: 本研究在ADNI数据库上进行了ablation实验，使用了两种模式的brain影像，获得了89.71%和91.18%的AD诊断精度，并超过了一些现有的方法。

Abstract
Structural MRI and PET imaging play an important role in the diagnosis of Alzheimer's disease (AD), showing the morphological changes and glucose metabolism changes in the brain respectively. The manifestations in the brain image of some cognitive impairment patients are relatively inconspicuous, for example, it still has difficulties in achieving accurate diagnosis through sMRI in clinical practice. With the emergence of deep learning, convolutional neural network (CNN) has become a valuable method in AD-aided diagnosis, but some CNN methods cannot effectively learn the features of brain image, making the diagnosis of AD still presents some challenges. In this work, we propose an end-to-end 3D CNN framework for AD diagnosis based on ResNet, which integrates multi-layer features obtained under the effect of the attention mechanism to better capture subtle differences in brain images. The attention maps showed our model can focus on key brain regions related to the disease diagnosis. Our method was verified in ablation experiments with two modality images on 792 subjects from the ADNI database, where AD diagnostic accuracies of 89.71% and 91.18% were achieved based on sMRI and PET respectively, and also outperformed some state-of-the-art methods.

摘要
《结构MRI和PET成像在诊断阿尔ツheimer病（AD）中发挥重要作用，显示了脑部的形态变化和糖分代谢变化。但是，在临床实践中，通过sMRI的诊断仍然存在一些困难，例如，某些认知障碍患者的脑部图像表现相对较为低调。随着深度学习的出现，卷积神经网络（CNN）在AD诊断中成为了一种有价值的方法。但是，一些CNN方法无法有效地学习脑部图像的特征，使得AD的诊断仍然存在一些挑战。在这种情况下，我们提出了一种基于ResNet的综合3D CNN框架 для AD诊断，该框架通过多层效应机制来更好地捕捉脑部图像中的细微差异。我们的模型可以通过注意力地图来更好地关注与疾病诊断相关的主要脑区域。我们的方法在ADNI数据库上进行了792个研究主题的ablation实验，其中AD诊断精度分别为89.71%和91.18%，并且超过了一些现有的方法。》

Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization

paper_url: http://arxiv.org/abs/2308.05648
repo_url: https://github.com/sldz0306/ccr
paper_authors: Zezhong Lv, Bing Su, Ji-Rong Wen
for: 本研究旨在提高视频Localization的效果，通过弱监督方法实现视频和自然语言之间的Alignment。
methods: 本研究使用的方法包括cross-modality similarity匹配和Counterfactual cross-modality reasoning。
results: 实验结果表明，提出的方法可以减轻视频Propopsal的重建干扰，提高视频Localization的精度。

Abstract
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language induced by the coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and the masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. Specifically, we first formulate query reconstruction as an aggregated causal effect of cross-modality and query knowledge. Then by introducing counterfactual cross-modality knowledge into this aggregation, the spurious impact of the unmasked part contributing to the reconstruction is explicitly modeled. Finally, by suppressing the unimodal effect of masked query, we can rectify the reconstructions of video proposals to perform reasonable contrastive learning. Extensive experimental evaluations demonstrate the effectiveness of our proposed method. The code is available at \href{https://github.com/sLdZ0306/CCR}{https://github.com/sLdZ0306/CCR}.

摘要
视频瞬间本地化目标是 retrieve 目标段的不断视频，根据自然语言查询。Recently, weakly supervised methods have gained attention, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by weakly supervised methods is the mismatch between the video and language induced by coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. Specifically, we first formulate query reconstruction as an aggregated causal effect of cross-modality and query knowledge. Then, by introducing counterfactual cross-modality knowledge into this aggregation, the spurious impact of the unmasked part contributing to the reconstruction is explicitly modeled. Finally, by suppressing the unimodal effect of masked query, we can rectify the reconstructions of video proposals to perform reasonable contrastive learning. Extensive experimental evaluations demonstrate the effectiveness of our proposed method. 代码可以在 \href{https://github.com/sLdZ0306/CCR}{https://github.com/sLdZ0306/CCR} 上获取。

Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network

paper_url: http://arxiv.org/abs/2308.05605
repo_url: https://github.com/wencheng256/daccn
paper_authors: Wencheng Han, Junbo Yin, Jianbing Shen
for: 提高自助监督的单目深度估计精度
methods: 提出了一种新的方向意识模块和改进的积累卷积方法
results: 实验显示，该方法在三个广泛使用的标准测试集（KITTI、Cityscapes和Make3D）上达到了新的状态码性能，在所有三种自助监督下都显示了显著的改进。

Abstract
Monocular depth estimation is known as an ill-posed task in which objects in a 2D image usually do not contain sufficient information to predict their depth. Thus, it acts differently from other tasks (e.g., classification and segmentation) in many ways. In this paper, we find that self-supervised monocular depth estimation shows a direction sensitivity and environmental dependency in the feature representation. But the current backbones borrowed from other tasks pay less attention to handling different types of environmental information, limiting the overall depth accuracy. To bridge this gap, we propose a new Direction-aware Cumulative Convolution Network (DaCCN), which improves the depth feature representation in two aspects. First, we propose a direction-aware module, which can learn to adjust the feature extraction in each direction, facilitating the encoding of different types of information. Secondly, we design a new cumulative convolution to improve the efficiency for aggregating important environmental information. Experiments show that our method achieves significant improvements on three widely used benchmarks, KITTI, Cityscapes, and Make3D, setting a new state-of-the-art performance on the popular benchmarks with all three types of self-supervision.

摘要
监视一个单目标的深度估计是一个不充分定义的任务，因为图像中的对象通常不含有足够的信息来预测其深度。因此，这个任务与其他任务（如分类和分割）有很多不同之处。在这篇论文中，我们发现了自我监视单目标深度估计中的方向敏感性和环境依赖性在特征表示方面。然而，现有的基础模型从其他任务中借鉴的方法很少注重处理不同类型的环境信息，这限制了总的深度准确性。为了bridging这个差距，我们提议了一种新的方向敏感的累积卷积网络（DaCCN），它可以改进深度特征表示的两个方面。首先，我们提出了一个方向敏感模块，可以根据不同的方向来调整特征提取，以便更好地编码不同类型的信息。其次，我们设计了一种新的累积卷积来提高环境信息的重要性聚合效率。实验显示，我们的方法在三个广泛使用的标准测试集（KITTI、Cityscapes和Make3D）上达到了新的状态对比性，在这些标准测试集上，我们的方法在所有三种自我监视下达到了最高的性能。

paper_url: http://arxiv.org/abs/2308.05602
repo_url: None
paper_authors: Shizhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid
for: 本研究旨在帮助智能代理人在未经见过的环境中寻找目标对象，以提高智能导航能力。
methods: 本研究使用了隐式地图，通过将新观察到的信息递归更新地图，并通过引入辅助任务来促进空间理解。
results: 本研究在MP3D数据集上达到了state-of-the-art的性能，并在真实场景中使用只需几个实际示例就能够实现适当的目标对象导航。

Abstract
Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments. Classical methods explicitly build maps of environments and require extensive engineering while lacking semantic information for object-oriented exploration. On the other hand, end-to-end learning methods alleviate manual map design and predict actions using implicit representations. Such methods, however, lack an explicit notion of geometry and may have limited ability to encode navigation history. In this work, we propose an implicit spatial map for object goal navigation. Our implicit map is recursively updated with new observations at each step using a transformer. To encourage spatial reasoning, we introduce auxiliary tasks and train our model to reconstruct explicit maps as well as to predict visual features, semantic labels and actions. Our method significantly outperforms the state of the art on the challenging MP3D dataset and generalizes well to the HM3D dataset. We successfully deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes using only a few real-world demonstrations. Code, trained models and videos are available at \url{https://www.di.ens.fr/willow/research/onav_rim/}.

摘要
<>translation_direction: zh-HansObjective: 导航一个智能代理人到具有给定物品类别的位置在未经过工程的环境中。经典方法会显式地建立环境地图，但缺乏对物品导航的 semantic 信息，而且需要大量的工程。在另一方面，终端学习方法可以减少手动地图设计和预测动作使用隐式表示，但是它们缺乏显式的几何信息和导航历史编码能力。在这种情况下，我们提出了一种隐式几何地图 для物品目标导航。我们的隐式地图通过每步新观察更新，使用 transformer 进行递归更新。为了促进几何理解，我们引入了辅助任务，并训练我们的模型可以重建显式地图、预测视觉特征、semantic标签以及动作。我们的方法在 MP3D 数据集上显著超越了现状，并在 HM3D 数据集上广泛适用。我们成功地在真实场景中部署了我们的模型，并在实际演示中达到了鼓励的物品目标导航结果。代码、训练模型和视频可以在 \url{https://www.di.ens.fr/willow/research/onav_rim/} 上获取。

NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search

paper_url: http://arxiv.org/abs/2308.05600
repo_url: None
paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
for: 这个研究旨在提高深度神经网络（DNN）的硬件设备，以减少它们的计算成本和 latency。
methods: 本研究使用了一种称为“非均匀数字化”的技术，将浮点数表示转换为低位数字表示，以确保DNN模型的紧密运算。
results: 研究获得了 state-of-the-art 的压缩率，包括 both data-free 和 data-driven 配置。

Abstract
Deep neural network (DNN) deployment has been confined to larger hardware devices due to their expensive computational requirements. This challenge has recently reached another scale with the emergence of large language models (LLMs). In order to reduce both their memory footprint and latency, a promising technique is quantization. It consists in converting floating point representations to low bit-width fixed point representations, usually by assuming a uniform mapping onto a regular grid. This process, referred to in the literature as uniform quantization, may however be ill-suited as most DNN weights and activations follow a bell-shaped distribution. This is even worse on LLMs whose weight distributions are known to exhibit large, high impact, outlier values. In this work, we propose an improvement over the most commonly adopted way to tackle this limitation in deep learning models quantization, namely, non-uniform quantization. NUPES leverages automorphisms to preserve the scalar multiplications. Such transformations are derived from power functions. However, the optimization of the exponent parameter and weight values remains a challenging and novel problem which could not be solved with previous post training optimization techniques which only learn to round up or down weight values in order to preserve the predictive function. We circumvent this limitation with a new paradigm: learning new quantized weights over the entire quantized space. Similarly, we enable the optimization of the power exponent, i.e. the optimization of the quantization operator itself during training by alleviating all the numerical instabilities. The resulting predictive function is compatible with integer-only low-bit inference. We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.

摘要
深度神经网络（DNN）的部署因其高计算成本受限于更大的硬件设备。这一挑战最近又由大型语言模型（LLM）的出现带来了新的级别。以减少其存储占用量和延迟，一种有前途的技术是量化。它通过将浮点表示转换为低位宽的固定点表示，通常通过假设一个固定的映射到一个常规网格来实现。这个过程，在文献中称为均匀量化，可能并不适合大多数DNN的权重和活动值，因为它们通常遵循一个钟形分布。这种问题更加严重，因为LLM的权重分布知道会显示出大、高影响的异常值。在这项工作中，我们提出一种改进常见深度学习模型量化的方法，即非均匀量化（NUPES）。NUPES利用自动变换来保持整数乘法。这些变换来自于力学函数。然而，优化废弃值和权重值的问题仍然是一个挑战，这是因为前训练优化技术只能学习将权重值缩放到适当的整数値。我们解决这个问题的新方法是通过学习新的量化权重值，并且在整个量化空间中进行优化。此外，我们还使得量化运算자本身在训练中进行优化，从而消除所有的数值不稳定性。得到的预测函数与整数仅具有低位宽的减法兼容。我们展示了该方法可以在数据驱动和数据预处理的配置下实现状态革命级别的压缩率。

Test-Time Selection for Robust Skin Lesion Analysis

paper_url: http://arxiv.org/abs/2308.05595
repo_url: https://github.com/alceubissoto/skin-tts
paper_authors: Alceu Bissoto, Catarina Barata, Eduardo Valle, Sandra Avila
for: 减少皮肤病变分类模型中的偏见，使模型更加准确地预测皮肤病变。
methods: 提出了一种人类在 loop 方法（TTS），利用测试样本中的正面（例如，疤痕面积）和负面（例如，artifacts）关键点，以避免模型学习干扰因素。
results: 实现了对皮肤病变分类模型中的偏见的mitigation，无需重新训练模型，并且可以在不同的注解量和偏见水平下进行稳定的性能表现。

Abstract
Skin lesion analysis models are biased by artifacts placed during image acquisition, which influence model predictions despite carrying no clinical information. Solutions that address this problem by regularizing models to prevent learning those spurious features achieve only partial success, and existing test-time debiasing techniques are inappropriate for skin lesion analysis due to either making unrealistic assumptions on the distribution of test data or requiring laborious annotation from medical practitioners. We propose TTS (Test-Time Selection), a human-in-the-loop method that leverages positive (e.g., lesion area) and negative (e.g., artifacts) keypoints in test samples. TTS effectively steers models away from exploiting spurious artifact-related correlations without retraining, and with less annotation requirements. Our solution is robust to a varying availability of annotations, and different levels of bias. We showcase on the ISIC2019 dataset (for which we release a subset of annotated images) how our model could be deployed in the real-world for mitigating bias.

摘要
皮肤损害分析模型受到图像获取过程中的artifacts的干扰，这些artifacts不含有临床信息，却影响模型预测结果。现有的解决方案通过对模型进行正则化，以防止它们学习这些干扰特征，仅具有部分成功。现有的测试时debiasing技术不适用于皮肤损害分析，因为它们假设测试数据的分布是不实际的或需要医疗专业人员进行繁琐的注释。我们提出了TTS（测试时选择），一种人类在循环中的方法，利用测试样本中的正面（例如，损害区域）和负面（例如，artifacts）关键点。TTS可以让模型在没有重新训练的情况下，快速地跳过损害相关的干扰特征。我们的解决方案对于不同的注释量和偏好都是稳定的，并且可以适应不同的测试数据分布。我们在ISIC2019 dataset（我们发布了一 subset of annotated images）上展示了我们的模型在实际应用中的mitigating bias的能力。

Category Feature Transformer for Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.05581
repo_url: None
paper_authors: Quan Tang, Chuanjian Liu, Fagui Liu, Yifan Liu, Jun Jiang, Bowen Zhang, Kai Han, Yunhe Wang
for: 这个研究旨在提高 semantic segmentation 的性能，通过改进 feature aggregation 方法。
methods: 该研究提出了 Category Feature Transformer (CFT)，通过多头注意力机制来探索 feature embedding 和变换的流动，从高级别特征中提取各个 semantic category 的统一 embedding。
results: 对比于传统的点积或拼接方法，CFT 在各种 backbone 网络上展示了更高的性能，特别是在 ADE20K dataset 上得到了吸引人的 55.1% mIoU，同时减少了模型参数和计算量。

Abstract
Aggregation of multi-stage features has been revealed to play a significant role in semantic segmentation. Unlike previous methods employing point-wise summation or concatenation for feature aggregation, this study proposes the Category Feature Transformer (CFT) that explores the flow of category embedding and transformation among multi-stage features through the prevalent multi-head attention mechanism. CFT learns unified feature embeddings for individual semantic categories from high-level features during each aggregation process and dynamically broadcasts them to high-resolution features. Integrating the proposed CFT into a typical feature pyramid structure exhibits superior performance over a broad range of backbone networks. We conduct extensive experiments on popular semantic segmentation benchmarks. Specifically, the proposed CFT obtains a compelling 55.1% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.

摘要
aggregation of multi-stage features 已经被确认为semantic segmentation中发挥重要作用。 unlike previous methods使用点 wise summation或concatenaion来实现特征集成,这种研究提议Category Feature Transformer（CFT），该研究通过流行的多头注意机制来探索多个阶段特征之间的类 embedding和转换。 CFT在每个层次特征集成过程中学习各个semantic category的统一特征表示，并在高分辨率特征上动态广播它们。将提议的CFTintegrated into a typical feature pyramid structure shows superior performance over a wide range of backbone networks. We conduct extensive experiments on popular semantic segmentation benchmarks. Specifically, the proposed CFT obtains a compelling 55.1% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.

Cross-Domain Product Representation Learning for Rich-Content E-Commerce

paper_url: http://arxiv.org/abs/2308.05550
repo_url: https://github.com/adxcreative/cope
paper_authors: Xuehan Bai, Yan Li, Yanhua Cheng, Wenjie Yang, Quan Chen, Han Li
for: 这篇论文旨在解决Rich-content电商中产品在不同媒体频道上的描述不一致问题，实现跨媒体产品识别，以提高用户搜索体验和产品推荐效果。
methods: 本论文提出了一种 Cross-dOmain Product rEpresentation（COPE）框架，通过多modal学习（包括文本和视觉学习）将产品表示在不同媒体频道上的特征空间统一。
results: experiments表明，COPE可以学习一个共同特征空间，以便在不同媒体频道上进行产品识别和推荐。

Abstract
The proliferation of short video and live-streaming platforms has revolutionized how consumers engage in online shopping. Instead of browsing product pages, consumers are now turning to rich-content e-commerce, where they can purchase products through dynamic and interactive media like short videos and live streams. This emerging form of online shopping has introduced technical challenges, as products may be presented differently across various media domains. Therefore, a unified product representation is essential for achieving cross-domain product recognition to ensure an optimal user search experience and effective product recommendations. Despite the urgent industrial need for a unified cross-domain product representation, previous studies have predominantly focused only on product pages without taking into account short videos and live streams. To fill the gap in the rich-content e-commerce area, in this paper, we introduce a large-scale cRoss-dOmain Product Ecognition dataset, called ROPE. ROPE covers a wide range of product categories and contains over 180,000 products, corresponding to millions of short videos and live streams. It is the first dataset to cover product pages, short videos, and live streams simultaneously, providing the basis for establishing a unified product representation across different media domains. Furthermore, we propose a Cross-dOmain Product rEpresentation framework, namely COPE, which unifies product representations in different domains through multimodal learning including text and vision. Extensive experiments on downstream tasks demonstrate the effectiveness of COPE in learning a joint feature space for all product domains.

摘要
“短视频和直播平台的普及化已经改变了在线购物的方式。而而不是浏览产品页面，消费者现在通过 ricch-content e-commerce 购买产品，这种新的在线购物方式带来了技术挑战，因为产品在不同媒体频道上可能会present differently。因此，一个统一的产品表示是必须的，以确保跨媒体产品识别，并提供优化的用户搜索体验和有效的产品推荐。然而，随着 ricch-content e-commerce 领域的快速发展，前一些研究主要集中在产品页面上，忽略了短视频和直播的存在。为了填补这个空白，在这篇论文中，我们提出了一个大规模的 cRoss-dOmain Product Ecognition 数据集，称为 ROPE。ROPE 覆盖了多个产品类别，包括超过 180,000 个产品，对应millions of short videos and live streams。这是第一个同时覆盖产品页面、短视频和直播的数据集，提供了跨媒体产品表示的基础。此外，我们提出了一个 Cross-dOmain Product rEpresentation 框架，即 COPE，该框架通过多Modal学习，包括文本和视觉，将产品表示在不同媒体频道上统一。我们的实验表明，COPE 可以学习一个共同特征空间，以便在所有产品频道上进行下游任务。”

Deep Richardson-Lucy Deconvolution for Low-Light Image Deblurring

paper_url: http://arxiv.org/abs/2308.05543
repo_url: None
paper_authors: Liang Chen, Jiawei Zhang, Zhenhua Li, Yunxuan Wei, Faming Fang, Jimmy Ren, Jinshan Pan
for: 这篇论文主要用于处理受扰光照下拍摄的图像，特别是处理受扰光照下拍摄的图像中具有噪点和模糊的问题。
methods: 该论文提出了一种数据驱动的方法，通过学习latent map来模型受扰光照下拍摄图像中的噪点。该方法可以将非 слепо的锐化问题转化为最大 posterior（MAP）问题，并通过 iteratively computing latent map和latent image来解决。
results: 实验结果表明，该方法可以与现有算法相比，在synthetic和实际图像上提供高质量的锐化结果，而无需 amplified artifacts。

Abstract
Images taken under the low-light condition often contain blur and saturated pixels at the same time. Deblurring images with saturated pixels is quite challenging. Because of the limited dynamic range, the saturated pixels are usually clipped in the imaging process and thus cannot be modeled by the linear blur model. Previous methods use manually designed smooth functions to approximate the clipping procedure. Their deblurring processes often require empirically defined parameters, which may not be the optimal choices for different images. In this paper, we develop a data-driven approach to model the saturated pixels by a learned latent map. Based on the new model, the non-blind deblurring task can be formulated into a maximum a posterior (MAP) problem, which can be effectively solved by iteratively computing the latent map and the latent image. Specifically, the latent map is computed by learning from a map estimation network (MEN), and the latent image estimation process is implemented by a Richardson-Lucy (RL)-based updating scheme. To estimate high-quality deblurred images without amplified artifacts, we develop a prior estimation network (PEN) to obtain prior information, which is further integrated into the RL scheme. Experimental results demonstrate that the proposed method performs favorably against state-of-the-art algorithms both quantitatively and qualitatively on synthetic and real-world images.

摘要
低光照下拍摄的图像常常具有模糊和饱和像素同时出现。对于这类图像，进行去模糊处理是非常困难的。这是因为捕捉过程中的动态范围有限，因此饱和像素通常会被截断，无法被线性模糊模型模型。之前的方法通常使用手动设计的光滑函数来估算截断过程。他们的去模糊过程通常需要定制参数，这些参数可能不是不同图像的优化选择。在这篇论文中，我们提出了一种数据驱动的方法，通过学习latent map来模型饱和像素。基于新的模型，我们可以将非视情报的去模糊任务转化为最大 posterior（MAP）问题，这可以通过 iteratively 计算latent map和latent image来有效解决。具体来说，latent map是通过学习map estimation network（MEN）来计算的，而latent image estimation过程是通过Richardson-Lucy（RL）更新方案来实现。为了获得高质量去模糊图像而无需增强artefacts，我们开发了一种先验概率网络（PEN）来获得先验信息，该信息进一步被集成到RL方案中。实验结果表明，我们提出的方法在synthetic和实际图像上都与状态 arts 算法相比，具有较高的量化和质量表现。

Robust Asymmetric Loss for Multi-Label Long-Tailed Learning

paper_url: http://arxiv.org/abs/2308.05542
repo_url: https://github.com/kalelpark/RAL
paper_authors: Wongi Park, Inhyuk Park, Sungeun Kim, Jongbin Ryu
for: 这个论文是为了解决长尾分布和多标签问题而写的。
methods: 这篇论文提出了一种 Robust Asymmetric Loss（RAL）函数，用于同时解决长尾分布和多标签问题。
results: 论文在多种长尾单标签数据集上实现了优秀的表现，其中包括ICCV CVAMD 2023 competition中的CXR-LT数据集，并达到了Top-5的成绩。

Abstract
In real medical data, training samples typically show long-tailed distributions with multiple labels. Class distribution of the medical data has a long-tailed shape, in which the incidence of different diseases is quite varied, and at the same time, it is not unusual for images taken from symptomatic patients to be multi-label diseases. Therefore, in this paper, we concurrently address these two issues by putting forth a robust asymmetric loss on the polynomial function. Since our loss tackles both long-tailed and multi-label classification problems simultaneously, it leads to a complex design of the loss function with a large number of hyper-parameters. Although a model can be highly fine-tuned due to a large number of hyper-parameters, it is difficult to optimize all hyper-parameters at the same time, and there might be a risk of overfitting a model. Therefore, we regularize the loss function using the Hill loss approach, which is beneficial to be less sensitive against the numerous hyper-parameters so that it reduces the risk of overfitting the model. For this reason, the proposed loss is a generic method that can be applied to most medical image classification tasks and does not make the training process more time-consuming. We demonstrate that the proposed robust asymmetric loss performs favorably against the long-tailed with multi-label medical image classification in addition to the various long-tailed single-label datasets. Notably, our method achieves Top-5 results on the CXR-LT dataset of the ICCV CVAMD 2023 competition. We opensource our implementation of the robust asymmetric loss in the public repository: https://github.com/kalelpark/RAL.

摘要
医疗数据中的训练样本通常具有长尾分布，多个标签。医疗数据的分布具有长尾形状， symptomatic 患者的图像可能患有多种疾病。因此，在这篇论文中，我们同时解决这两个问题，提出了一种robust asymmetric loss函数。我们的损失函数同时解决了长尾和多标签分类问题，因此导致了损失函数的复杂设计，具有许多超参数。虽然模型可以高度精化，但是由于超参数的多少，具有风险过拟合模型。因此，我们使用希尔损失函数的正则化方法，以降低对超参数的敏感性，从而减少过拟合风险。由于这种损失函数是一种通用的方法，可以应用于大多数医疗图像分类任务，不会增加训练过程的时间。我们的方法在多种长尾单标签数据集上达到了优秀的表现，特别是在ICCV CVAMD 2023 大赛中的 CXR-LT 数据集上达到了 Top-5 результаutos。我们将我们实现的robust asymmetric loss函数开源在公共存储库：https://github.com/kalelpark/RAL。

Is there progress in activity progress prediction?

paper_url: http://arxiv.org/abs/2308.05533
repo_url: https://github.com/frans-db/progress-prediction
paper_authors: Frans de Boer, Jan C. van Gemert, Jouke Dijkstra, Silvia L. Pintea
for: 这项研究的目的是估计活动完成的百分数。
methods: 这些方法使用机器学习方法，在复杂和现实istic的视频集合上训练和评估。
results: 研究发现现有的进度预测方法在这些视频集合上并不能提取有用的视觉信息，因此无法超过基准线。我们设计了一个精心控制的 synthetic 数据集，并在这个数据集上示出了考虑的方法可以利用视觉信息，直接关联进度预测。我们 conclued 进度预测任务在目前使用的实际世界数据集上是不妥协的。此外，为了公正地衡量活动进度，我们建议使用简单 yet effective的帧数基准线。

Abstract
Activity progress prediction aims to estimate what percentage of an activity has been completed. Currently this is done with machine learning approaches, trained and evaluated on complicated and realistic video datasets. The videos in these datasets vary drastically in length and appearance. And some of the activities have unanticipated developments, making activity progression difficult to estimate. In this work, we examine the results obtained by existing progress prediction methods on these datasets. We find that current progress prediction methods seem not to extract useful visual information for the progress prediction task. Therefore, these methods fail to exceed simple frame-counting baselines. We design a precisely controlled dataset for activity progress prediction and on this synthetic dataset we show that the considered methods can make use of the visual information, when this directly relates to the progress prediction. We conclude that the progress prediction task is ill-posed on the currently used real-world datasets. Moreover, to fairly measure activity progression we advise to consider a, simple but effective, frame-counting baseline.

摘要
In this work, we examine the results obtained by existing progress prediction methods on these datasets. We find that current progress prediction methods do not effectively use visual information for the progress prediction task, and therefore, they fail to exceed simple frame-counting baselines.To address this issue, we design a precisely controlled dataset for activity progress prediction, and on this synthetic dataset, we show that the considered methods can make use of visual information that is directly related to the progress prediction. We conclude that the progress prediction task is not well-posed on the currently used real-world datasets, and to fairly measure activity progression, we recommend considering a simple but effective frame-counting baseline.

Critical Points ++: An Agile Point Cloud Importance Measure for Robust Classification, Adversarial Defense and Explainable AI

paper_url: http://arxiv.org/abs/2308.05525
repo_url: https://github.com/yossilevii100/critical_points2
paper_authors: Meir Yossef Levi, Guy Gilboa
for: 本研究探讨了3D点云中关键点的Interplay和Out-Of-Distribution（OOD）样本。
methods: 研究人员首先研究了常见损害和异常样本是否被解释为关键点。然后，他们推广了关键点的概念，将其转化为重要度度量。通过只在不重要点上训练分类网络，研究人员发现可以大幅提高网络的Robustness，但是会导致小量的性能损失在干净样本上。研究人员还发现了 норма化 entropy 的高度有用性，并建议使用 adaptive threshold 来选择不重要点。
results: 研究人员的方法可以在多个应用中达到 State-Of-The-Art（SOTA）Result，包括 Explainable AI（XAI）、异常 Sample Removal、不确定性估计、Robust Classification 和 Adversarial Defense。

Abstract
The ability to cope accurately and fast with Out-Of-Distribution (OOD) samples is crucial in real-world safety demanding applications. In this work we first study the interplay between critical points of 3D point clouds and OOD samples. Our findings are that common corruptions and outliers are often interpreted as critical points. We generalize the notion of critical points into importance measures. We show that training a classification network based only on less important points dramatically improves robustness, at a cost of minor performance loss on the clean set. We observe that normalized entropy is highly informative for corruption analysis. An adaptive threshold based on normalized entropy is suggested for selecting the set of uncritical points. Our proposed importance measure is extremely fast to compute. We show it can be used for a variety of applications, such as Explainable AI (XAI), Outlier Removal, Uncertainty Estimation, Robust Classification and Adversarial Defense. We reach SOTA results on the two latter tasks. Code is available at: https://github.com/yossilevii100/critical_points2

摘要
“实际应用中的安全性要求非常高，因此能够快速和精确地处理不同类型的外部数据（Out-Of-Distribution，OOD）amples是非常重要的。在这个工作中，我们首先研究了3D点云的批处和OOD samples之间的交互。我们发现，通常的变化和异常点经常被视为批点。我们将批点的概念扩展为重要度衡量。我们显示，仅对不重要的点进行训练，可以很好地提高抗衡性，但是会有小量的性能损失。我们发现，正规化 entropy 非常有用于问题分析。我们建议一个基于正规化 entropy 的静态阈值，用于选择不重要的点。我们的提出的重要度度量非常快速 compute。我们显示它可以用于多个应用，如可读性AI（XAI）、异常点除除、不确定度估计、Robust Classification 和攻击防护。我们在这两个任务上达到了 SOTA 结果。代码可以在：https://github.com/yossilevii100/critical_points2 中找到。”

Look at the Neighbor: Distortion-aware Unsupervised Domain Adaptation for Panoramic Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.05493
repo_url: None
paper_authors: Xu Zheng, Tianbo Pan, Yunhao Luo, Lin Wang
for: 这个研究目的是提高Panoramic Semantic Segmentation中的预测性能，以及解决这个领域中的领域差异问题。
methods: 这个研究使用了一个新的Unsupervised Domain Adaptation方法，包括Distortion-Aware Attention和Class-Wise Feature Aggregation两个模组。这个方法不需要使用任何几何约束，可以更好地处理不对称的像素分布问题。
results: 这个研究的结果显示，这个新的Unsupervised Domain Adaptation方法可以实现更好的预测性能，并且可以大幅降低80%的parameters。

Abstract
Endeavors have been recently made to transfer knowledge from the labeled pinhole image domain to the unlabeled panoramic image domain via Unsupervised Domain Adaptation (UDA). The aim is to tackle the domain gaps caused by the style disparities and distortion problem from the non-uniformly distributed pixels of equirectangular projection (ERP). Previous works typically focus on transferring knowledge based on geometric priors with specially designed multi-branch network architectures. As a result, considerable computational costs are induced, and meanwhile, their generalization abilities are profoundly hindered by the variation of distortion among pixels. In this paper, we find that the pixels' neighborhood regions of the ERP indeed introduce less distortion. Intuitively, we propose a novel UDA framework that can effectively address the distortion problems for panoramic semantic segmentation. In comparison, our method is simpler, easier to implement, and more computationally efficient. Specifically, we propose distortion-aware attention (DA) capturing the neighboring pixel distribution without using any geometric constraints. Moreover, we propose a class-wise feature aggregation (CFA) module to iteratively update the feature representations with a memory bank. As such, the feature similarity between two domains can be consistently optimized. Extensive experiments show that our method achieves new state-of-the-art performance while remarkably reducing 80% parameters.

摘要
尝试将知识从标注的小孔镜像域传递到无标注的全景图像域via无监督领域适应（UDA）。目标是解决领域差距问题，由非均匀分布的像素引起的风格差异和扭曲问题。先前的工作通常是基于地理约束设计特制多支网络架构来进行传输知识。这会导致计算成本增加，同时其泛化能力受到像素扭曲的变化很大的限制。在这篇论文中，我们发现了ERP像素的邻居区域实际上具有较低的扭曲程度。我们提出了一种新的UDA框架，可以有效地解决全景图像 semantic segmentation 中的扭曲问题。与先前的方法相比，我们的方法更简单，易于实现，计算效率更高。我们提出了一种受到邻居像素分布的扭曲意识（DA）模块，不使用任何地理约束。此外，我们提出了一种类别特征聚合（CFA）模块，可以逐渐更新特征表示，并且通过记忆银行来一致地优化特征相似性。广泛的实验表明，我们的方法可以实现新的领先性表现，同时减少80%的参数量。

YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

paper_url: http://arxiv.org/abs/2308.05480
repo_url: https://github.com/fishandwasabi/yolo-ms
paper_authors: Yuming Chen, Xinbin Yuan, Ruiqi Wu, Jiabao Wang, Qibin Hou, Ming-Ming Cheng
for:The paper aims to provide the object detection community with an efficient and performant object detector, specifically the YOLO-MS model.methods:The core design of YOLO-MS is based on investigations into how convolutions with different kernel sizes affect object detection performance at different scales. The strategy is to enhance multi-scale feature representations for real-time object detectors.results:The YOLO-MS model outperforms recent state-of-the-art real-time object detectors, including YOLO-v7 and RTMDet, with a comparable number of parameters and FLOPs. Specifically, the XS version of YOLO-MS achieves an AP score of 43%+ on MS COCO, which is about 2%+ higher than RTMDet with the same model size. Additionally, the method can be used as a plug-and-play module for other YOLO models, improving their AP scores with fewer parameters and FLOPs.

Abstract
We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can strongly enhance multi-scale feature representations of real-time object detectors. To verify the effectiveness of our strategy, we build a network architecture, termed YOLO-MS. We train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets, like ImageNet, or pre-trained weights. Without bells and whistles, our YOLO-MS outperforms the recent state-of-the-art real-time object detectors, including YOLO-v7 and RTMDet, when using a comparable number of parameters and FLOPs. Taking the XS version of YOLO-MS as an example, with only 4.5M learnable parameters and 8.7G FLOPs, it can achieve an AP score of 43%+ on MS COCO, which is about 2%+ higher than RTMDet with the same model size. Moreover, our work can also be used as a plug-and-play module for other YOLO models. Typically, our method significantly improves the AP of YOLOv8 from 37%+ to 40%+ with even fewer parameters and FLOPs. Code is available at https://github.com/FishAndWasabi/YOLO-MS.

摘要
我们目标是为对象检测社区提供高效和高性能的对象检测器，称为YOLO-MS。我们的核心设计基于对不同核群体维度对象检测性能的系列调查。结果是一种新的策略，可以强大地提高实时对象检测器的多缘特征表示。为证明我们的策略的有效性，我们建立了一个网络架构，称为YOLO-MS。我们的YOLO-MS在从头开始训练MS COCO数据集，不需要任何其他大规模数据集，如ImageNet，或预训练 веса。无论钟和鼓，我们的YOLO-MS在与相同参数和计算量下，可以超越最近的实时对象检测器，包括YOLO-v7和RTMDet。例如，我们的XS版YOLO-MS只有4.5M个学习参数和8.7G计算量，可以在MS COCO上 achieve AP分数达43%+，这比RTMDet相同模型大小的情况高出2%+。此外，我们的工作还可以用作其他YOLO模型的插件模块，通常可以使YOLOv8的AP从37%+提高到40%+，只需要更少的参数和计算量。代码可以在https://github.com/FishAndWasabi/YOLO-MS上获取。

Surface Masked AutoEncoder: Self-Supervision for Cortical Imaging Data

paper_url: http://arxiv.org/abs/2308.05474
repo_url: https://github.com/metrics-lab/surface-vision-transformers
paper_authors: Simon Dahan, Mariana da Silva, Daniel Rueckert, Emma C Robinson
for: 这篇论文旨在探讨如何使用自我监督来解决视Transformer架构中的逻辑假设问题，以提高小 datasets 上的泛化能力。
methods: 该论文基于近期将视Transformer翻译到表面几何的研究，并 investigate了使用Masked AutoEncoder（MAE）自我监督来学习 cortical 结构。
results: 该方法可以有效地重建输入中masked 的表面数据，从而学习强大的表示，并在下游任务中提高表现。我们在 cortical phenotype regression 中使用 developing Human Connectome Project（dHCP）评估了该方法，并发现在 pre-training 后，模型的性能提高了26%，并且 converges 80% faster than scratch 训练。此外，我们发现在大量数据集上 pre-training 视Transformer 模型可以快速获得Robust 的表示，供 fine-tuning 在 low-data 情况下。

Abstract
Self-supervision has been widely explored as a means of addressing the lack of inductive biases in vision transformer architectures, which limits generalisation when networks are trained on small datasets. This is crucial in the context of cortical imaging, where phenotypes are complex and heterogeneous, but the available datasets are limited in size. This paper builds upon recent advancements in translating vision transformers to surface meshes and investigates the potential of Masked AutoEncoder (MAE) self-supervision for cortical surface learning. By reconstructing surface data from a masked version of the input, the proposed method effectively models cortical structure to learn strong representations that translate to improved performance in downstream tasks. We evaluate our approach on cortical phenotype regression using the developing Human Connectome Project (dHCP) and demonstrate that pre-training leads to a 26\% improvement in performance, with an 80\% faster convergence, compared to models trained from scratch. Furthermore, we establish that pre-training vision transformer models on large datasets, such as the UK Biobank (UKB), enables the acquisition of robust representations for finetuning in low-data scenarios. Our code and pre-trained models are publicly available at \url{https://github.com/metrics-lab/surface-vision-transformers}.

摘要
自我监督已经广泛探索以解决视transformer架构中缺乏适应性的问题，这限制了通用化when网络在小数据集上训练。在 cortical imaging 中，现象是复杂且多样的，但可用的数据集很有限。这篇论文基于最近的翻译 vision transformers 到表面网格和 investigates the potential of Masked AutoEncoder (MAE) self-supervision for cortical surface learning。通过从masked版本的输入中重建表面数据，提议的方法可以有效地模型 cortical structure，从而学习强大的表示。我们在 cortical phenotype regression 中使用 developing Human Connectome Project (dHCP) 进行评估，并发现预训练可以提高表现的26％，并且更快 converges，相比于从scratch 训练的模型。此外，我们发现预训练 vision transformer 模型在大数据集上，如 UK Biobank (UKB) ，可以获得 Robust 的表示，用于 low-data scenarios 的 finetuning。我们的代码和预训练模型可以在上获取。

Comprehensive Analysis of Network Robustness Evaluation Based on Convolutional Neural Networks with Spatial Pyramid Pooling

paper_url: http://arxiv.org/abs/2308.08012
repo_url: None
paper_authors: Wenjun Jiang, Tianlong Fan, Changhao Li, Chuanfu Zhang, Tao Zhang, Zong-fu Luo
for: 本研究旨在提高复杂网络的连接稳定性评估效率和实用性。
methods: 本研究使用机器学习技术来解决连接稳定性评估中的挑战。
results: 研究结果表明，提出的卷积神经网络模型（CNN）和空间彩色Pooling网络（SPP-net）可以有效地解决高计算成本、不同网络类型、组件类型和失效enario下的连接稳定性评估问题。

Abstract
Connectivity robustness, a crucial aspect for understanding, optimizing, and repairing complex networks, has traditionally been evaluated through time-consuming and often impractical simulations. Fortunately, machine learning provides a new avenue for addressing this challenge. However, several key issues remain unresolved, including the performance in more general edge removal scenarios, capturing robustness through attack curves instead of directly training for robustness, scalability of predictive tasks, and transferability of predictive capabilities. In this paper, we address these challenges by designing a convolutional neural networks (CNN) model with spatial pyramid pooling networks (SPP-net), adapting existing evaluation metrics, redesigning the attack modes, introducing appropriate filtering rules, and incorporating the value of robustness as training data. The results demonstrate the thoroughness of the proposed CNN framework in addressing the challenges of high computational time across various network types, failure component types and failure scenarios. However, the performance of the proposed CNN model varies: for evaluation tasks that are consistent with the trained network type, the proposed CNN model consistently achieves accurate evaluations of both attack curves and robustness values across all removal scenarios. When the predicted network type differs from the trained network, the CNN model still demonstrates favorable performance in the scenario of random node failure, showcasing its scalability and performance transferability. Nevertheless, the performance falls short of expectations in other removal scenarios. This observed scenario-sensitivity in the evaluation of network features has been overlooked in previous studies and necessitates further attention and optimization. Lastly, we discuss important unresolved questions and further investigation.

摘要
<> translate the following text into Simplified Chinese<>网络稳定性，复杂网络理解、优化和维护的关键方面，传统上通过时间消耗和实际不切实际的simulation进行评估。幸运地，机器学习提供了一个新的解决方案。然而，数个关键问题仍然无解，包括更一般的边缺失enario中性能、通过攻击曲线而不直接培养 robustness、预测任务可扩展性和预测能力传输性。在这篇论文中，我们通过设计卷积神经网络（CNN）模型，并采用空间层次卷积网络（SPP-net），适应现有评估指标，重新设计攻击模式，引入合适的筛选规则，并将robustness的价值作为训练数据来进行处理。结果表明我们提posed CNN框架在不同网络类型、组件类型和攻击场景下具有高计算时间的通用性。然而，我们的CNN模型在不同攻击场景下的性能存在差异，其中在随机节点失效场景下，CNN模型仍然可以获得良好的性能，表明它具有可扩展性和性能传输性。然而，在其他攻击场景下，CNN模型的性能不符合预期，这种场景敏感性尚未在前一次研究中得到了足够的注意。最后，我们讨论了一些未解决的重要问题和进一步的调查。

KS-APR: Keyframe Selection for Robust Absolute Pose Regression

paper_url: http://arxiv.org/abs/2308.05459
repo_url: None
paper_authors: Changkun Liu, Yukun Zhao, Tristan Braud
for: 提高手持设备的虚拟 reality（AR）精度
methods: 使用精度检查和拒绝不可靠的姿态估计
results: 提高了所有模型的精度和减少了大误差的比例In English:
for: Improving the accuracy of mobile augmented reality (AR)
methods: Using accuracy checks and rejecting unreliable pose estimates
results: Improved accuracy and reduced proportion of large errors for all models

Abstract
Markerless Mobile Augmented Reality (AR) aims to anchor digital content in the physical world without using specific 2D or 3D objects. Absolute Pose Regressors (APR) are end-to-end machine learning solutions that infer the device's pose from a single monocular image. Thanks to their low computation cost, they can be directly executed on the constrained hardware of mobile AR devices. However, APR methods tend to yield significant inaccuracies for input images that are too distant from the training set. This paper introduces KS-APR, a pipeline that assesses the reliability of an estimated pose with minimal overhead by combining the inference results of the APR and the prior images in the training set. Mobile AR systems tend to rely upon visual-inertial odometry to track the relative pose of the device during the experience. As such, KS-APR favours reliability over frequency, discarding unreliable poses. This pipeline can integrate most existing APR methods to improve accuracy by filtering unreliable images with their pose estimates. We implement the pipeline on three types of APR models on indoor and outdoor datasets. The median error on position and orientation is reduced for all models, and the proportion of large errors is minimized across datasets. Our method enables state-of-the-art APRs such as DFNetdm to outperform single-image and sequential APR methods. These results demonstrate the scalability and effectiveness of KS-APR for visual localization tasks that do not require one-shot decisions.

摘要
Markerless移动增强现实（AR）目标是在物理世界中固定数字内容，无需使用特定的2D或3D对象。绝对姿态预测器（APR）是一种端到端机器学习解决方案，通过单一的照片来预测设备的姿态。由于它们的计算成本低，可以直接在移动AR设备上执行。然而，APR方法通常对输入图像过于远的域产生了显著的错误。本文介绍KS-APR，一个管道，可以在低负荷下评估估计pose的可靠性，并将其与训练集中的先前图像结合使用。移动AR系统通常通过视觉-运动估计来跟踪设备的相对姿态，因此KS-APR偏好可靠性而不是频率，抛弃不可靠的姿态。这个管道可以将大多数现有APR方法集成到更高的准确度，并将不可靠的图像与其pose估计一起抛弃。我们在室内和室外数据集上实现了这个管道，并显示了所有模型的 median 错误和方向错误的减少，以及大型错误的最小化。我们的方法可以让现有的APR方法，如DFNetdm，在单张图像和序列APR方法上表现出色。这些结果表明KS-APR可以在视觉地理标定任务中实现高精度和可靠性，而不需要一次性决策。

Transforming Breast Cancer Diagnosis: Towards Real-Time Ultrasound to Mammogram Conversion for Cost-Effective Diagnosis

paper_url: http://arxiv.org/abs/2308.05449
repo_url: None
paper_authors: Sahar Almahfouz Nasser, Ashutosh Sharma, Anmol Saraf, Amruta Mahendra Parulekar, Purvi Haria, Amit Sethi
For: The paper aims to provide surgeons with mammogram-like image quality in real-time from noisy US images.* Methods: The authors use the Stride software to numerically solve the forward model and generate ultrasound images from mammogram images. They also leverage domain adaptation and generative adversarial networks (GANs) to enhance the realism of the simulated ultrasound images.* Results: The resultant images have considerably more discernible details than the original US images.Here’s the simplified Chinese text:
for: 这篇论文目的是为医生提供高品质的实时ultrasound（US）图像。
methods: 作者使用Stride软件数值解决前向模型，将mammogram图像转换成US图像。他们还利用域适应和生成对抗网络（GANs）提高模拟US图像的真实性。
results: 结果图像比原始US图像更有明显的特征。

Abstract
Ultrasound (US) imaging is better suited for intraoperative settings because it is real-time and more portable than other imaging techniques, such as mammography. However, US images are characterized by lower spatial resolution noise-like artifacts. This research aims to address these limitations by providing surgeons with mammogram-like image quality in real-time from noisy US images. Unlike previous approaches for improving US image quality that aim to reduce artifacts by treating them as (speckle noise), we recognize their value as informative wave interference pattern (WIP). To achieve this, we utilize the Stride software to numerically solve the forward model, generating ultrasound images from mammograms images by solving wave-equations. Additionally, we leverage the power of domain adaptation to enhance the realism of the simulated ultrasound images. Then, we utilize generative adversarial networks (GANs) to tackle the inverse problem of generating mammogram-quality images from ultrasound images. The resultant images have considerably more discernible details than the original US images.

摘要
超声成像（US）在操作期间更适合使用，因为它们是实时的，更加可搬than other imaging techniques, such as mammography. However, US images are characterized by lower spatial resolution and noise-like artifacts. This research aims to address these limitations by providing surgeons with mammogram-like image quality in real-time from noisy US images. Unlike previous approaches for improving US image quality that aim to reduce artifacts by treating them as (speckle noise), we recognize their value as informative wave interference patterns (WIP). To achieve this, we utilize the Stride software to numerically solve the forward model, generating ultrasound images from mammography images by solving wave-equations. Additionally, we leverage the power of domain adaptation to enhance the realism of the simulated ultrasound images. Then, we utilize generative adversarial networks (GANs) to tackle the inverse problem of generating mammogram-quality images from ultrasound images. The resultant images have considerably more discernible details than the original US images.Here's the translation in Traditional Chinese as well, for comparison:超声成像（US）在操作期间更适合使用，因为它们是实时的，更加可搬than other imaging techniques, such as mammography. However, US images are characterized by lower spatial resolution and noise-like artifacts. This research aims to address these limitations by providing surgeons with mammogram-like image quality in real-time from noisy US images. Unlike previous approaches for improving US image quality that aim to reduce artifacts by treating them as (speckle noise), we recognize their value as informative wave interference patterns (WIP). To achieve this, we utilize the Stride software to numerically solve the forward model, generating ultrasound images from mammography images by solving wave-equations. Additionally, we leverage the power of domain adaptation to enhance the realism of the simulated ultrasound images. Then, we utilize generative adversarial networks (GANs) to tackle the inverse problem of generating mammogram-quality images from ultrasound images. The resultant images have considerably more discernible details than the original US images.

A Generalized Physical-knowledge-guided Dynamic Model for Underwater Image Enhancement

paper_url: http://arxiv.org/abs/2308.05447
repo_url: None
paper_authors: Pan Mu, Hanning Xu, Zheyuan Liu, Zheng Wang, Sixian Chan, Cong Bai
for: 提高水下图像的色彩扩散和对比度，以及处理不同类型的水下图像。
methods: 提出一种基于物理知识的动态模型（简称GUPDM），包括三部分：大气环境基本结构（ADS）、传输导航动态结构（TDS）和优先级多尺度结构（PMS）。通过形成模型来模拟不同类型的水下图像，并采用动态核心扩散来自适应水下图像的特点。
results: 通过实验表明，GUPDM可以有效地提高水下图像的色彩扩散和对比度，并且可以适应不同类型的水下图像。

Abstract
Underwater images often suffer from color distortion and low contrast resulting in various image types, due to the scattering and absorption of light by water. While it is difficult to obtain high-quality paired training samples with a generalized model. To tackle these challenges, we design a Generalized Underwater image enhancement method via a Physical-knowledge-guided Dynamic Model (short for GUPDM), consisting of three parts: Atmosphere-based Dynamic Structure (ADS), Transmission-guided Dynamic Structure (TDS), and Prior-based Multi-scale Structure (PMS). In particular, to cover complex underwater scenes, this study changes the global atmosphere light and the transmission to simulate various underwater image types (e.g., the underwater image color ranging from yellow to blue) through the formation model. We then design ADS and TDS that use dynamic convolutions to adaptively extract prior information from underwater images and generate parameters for PMS. These two modules enable the network to select appropriate parameters for various water types adaptively. Besides, the multi-scale feature extraction module in PMS uses convolution blocks with different kernel sizes and obtains weights for each feature map via channel attention block and fuses them to boost the receptive field of the network. The source code will be available at \href{https://github.com/shiningZZ/GUPDM}{https://github.com/shiningZZ/GUPDM}.

摘要
水下图像经常受到颜色扭曲和对比度下降的影响，导致各种图像类型，这是由于水尘和吸收光的效应。而获得高质量的搅合训练样本是困难的，以普通的模型来处理这些挑战。为解决这些问题，我们设计了一种通用的水下图像增强方法，即物理知识引导动态模型（简称GUPDM），它包括三部分：大气基础动态结构（ADS）、传输基础动态结构（TDS）和优先级基础多尺度结构（PMS）。具体来说，为了处理复杂的水下场景，这个研究通过形成模型来模拟不同的水下图像类型（例如，水下图像颜色从黄到蓝）。然后，我们设计了ADS和TDS模块，它们使用动态 convolution来自适应地提取水下图像中的优先级信息，并生成PMS模块中的参数。这两个模块使得网络可以自动选择不同的水体类型的参数。此外，PMS模块中的多尺度特征提取模块使用不同的核心大小的卷积块，通过通道注意力块来获得每个特征图的权重，并将其拼接以提高网络的感知范围。源代码将在 \href{https://github.com/shiningZZ/GUPDM}{https://github.com/shiningZZ/GUPDM} 上提供。

Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation

paper_url: http://arxiv.org/abs/2308.05441
repo_url: None
paper_authors: Hao Liang, Pietro Perona, Guha Balakrishnan
for: 本研究旨在测试面部识别系统的偏见。现有的测试方法基于在野收集的数据集，这些数据集可能受到保护属性（如种族、性别）和非保护属性（如姿势、照明）的影响。这些观察数据集只能得出相关性的结论，例如“算法A在 dataset X 中对男女面部的准确率不同”。而我们的实验方法可以 manipulate 属性，从而得出 causal 结论，例如 “算法A对 gender 和皮肤色的影响”。
methods: 我们的方法基于使用神经网络生成器生成synthetic face image，其中每个属性Of interest 独立地修改，保持其他属性不变。人观察员提供了真实的面部 Similarity 标准，用于评估synthetic image pair 之间的面部 Identification 距离。我们 validate 我们的方法量化地，对三种研究级面部识别模型的种族和性别偏见进行评估。我们的syntheticipeline发现，对于这些算法，黑人和东亚人种 subgroup 的准确率较低。我们的方法还可以评估这些模型对 Face Identity 距离的影响。我们的大量synthetic数据集，包括48,000个synthetic face image pair 和555,000个人注解（个体属性和对比面部 Similarity），现已经提供给研究人员。
results: 我们的实验结果表明，对于这些算法，黑人和东亚人种 subgroup 的准确率较低。我们的synthetic pipeline 还发现，这些模型对 Face Identity 距离的影响，并且可以评估这些模型对不同属性的影响。

Abstract
We propose an experimental method for measuring bias in face recognition systems. Existing methods to measure bias depend on benchmark datasets that are collected in the wild and annotated for protected (e.g., race, gender) and non-protected (e.g., pose, lighting) attributes. Such observational datasets only permit correlational conclusions, e.g., "Algorithm A's accuracy is different on female and male faces in dataset X.". By contrast, experimental methods manipulate attributes individually and thus permit causal conclusions, e.g., "Algorithm A's accuracy is affected by gender and skin color." Our method is based on generating synthetic faces using a neural face generator, where each attribute of interest is modified independently while leaving all other attributes constant. Human observers crucially provide the ground truth on perceptual identity similarity between synthetic image pairs. We validate our method quantitatively by evaluating race and gender biases of three research-grade face recognition models. Our synthetic pipeline reveals that for these algorithms, accuracy is lower for Black and East Asian population subgroups. Our method can also quantify how perceptual changes in attributes affect face identity distances reported by these models. Our large synthetic dataset, consisting of 48,000 synthetic face image pairs (10,200 unique synthetic faces) and 555,000 human annotations (individual attributes and pairwise identity comparisons) is available to researchers in this important area.

摘要
我们提出了一种实验方法来测试面部识别系统的偏见。现有的方法测试偏见基于在野收集的数据集，并对保护 attribute（例如种族、性别）和非保护 attribute（例如姿势、照明）进行标注。这些观察数据集只允许 correlate 结论，例如“算法 A 的精度不同于女性和男性脸部在数据集 X 中”。而我们的实验方法可以单独 manipulate attribute，因此可以得到 causal 结论，例如 “算法 A 的精度受到 gender 和皮肤色影响”。我们的方法基于使用神经网络生成器生成 synthetic 脸部图像，其中每个 attribute of interest 独立地修改，而保持所有其他 attribute 不变。人类观察员提供了实际的识别同一个 synthetic 图像对的真实标注。我们 validate 我们的方法量化地，对三种研究级 face recognition 模型的种族和性别偏见进行评估。我们的 synthetic 管道发现，这些算法对黑人和东亚人 subgroup 的精度较低。我们的方法还可以量化这些模型对 attribute 的变化如何影响面部识别距离的报告。我们的大量 synthetic 数据集，包括 48,000 个 synthetic 脸部图像对（10,200 个Unique synthetic 脸部）和 555,000 个人类标注（个体 attribute 和对比识别比较）现已提供给研究人员。

Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation

paper_url: http://arxiv.org/abs/2308.05438
repo_url: https://github.com/junzastar/dftr_voting
paper_authors: Jun Zhou, Kai Chen, Linlin Xu, Qi Dou, Jing Qin
for: 本研究旨在提高基于单一RGBD图像的6D对象姿态估计的效率和精度，特别是两种不同模式之间的有效融合。
methods: 我们提出了一种 Deep Fusion Transformer（DFTr）块，该块可以聚合不同模式的跨模式特征，以提高姿态估计。此外，我们还提出了一种Weighted vector-wise voting算法，该算法通过非迭代全局优化策略来准确地local化3D关键点，并实现了近实时推理。
results: 我们的实验表明，我们的提议的3D关键点投票算法具有强大的泛化能力和高效性。Results on four widely used benchmarks also demonstrate that our method outperforms the state-of-the-art methods by large margins.

Abstract
One critical challenge in 6D object pose estimation from a single RGBD image is efficient integration of two different modalities, i.e., color and depth. In this work, we tackle this problem by a novel Deep Fusion Transformer~(DFTr) block that can aggregate cross-modality features for improving pose estimation. Unlike existing fusion methods, the proposed DFTr can better model cross-modality semantic correlation by leveraging their semantic similarity, such that globally enhanced features from different modalities can be better integrated for improved information extraction. Moreover, to further improve robustness and efficiency, we introduce a novel weighted vector-wise voting algorithm that employs a non-iterative global optimization strategy for precise 3D keypoint localization while achieving near real-time inference. Extensive experiments show the effectiveness and strong generalization capability of our proposed 3D keypoint voting algorithm. Results on four widely used benchmarks also demonstrate that our method outperforms the state-of-the-art methods by large margins.

摘要
一个重要挑战在基于单个RGBD图像的6D对象姿态估计中是有效地结合两种不同模式，即颜色和深度。在这个工作中，我们解决这个问题通过一种新的深度融合变换（DFTr）块，可以将不同模式之间的交叉特征聚合以提高姿态估计。与现有的融合方法不同，我们提出的DFTr可以更好地模型不同模式的语义相关性，以便更好地融合不同模式的全球提高特征，从而提高信息抽取。此外，为了进一步提高可靠性和效率，我们引入了一种新的加权 вектор值投票算法，可以非迭代地全球优化精度，以实现准确的3D关键点定位，同时实现近实时的推理。广泛的实验表明我们的提出的3D关键点投票算法的有效性和强大的通用能力。结果也表明我们的方法在四个广泛使用的benchmark上大幅度超过了状态艺的方法。

Ensemble Modeling for Multimodal Visual Action Recognition

paper_url: http://arxiv.org/abs/2308.05430
repo_url: None
paper_authors: Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah
for: 这篇论文的目的是提出一种多 modal 动作识别的ensemble模型方法。
methods: 这篇论文使用了一种基于 focal loss的个别模ality模型训练方法，并提出了一种将 focal loss 调整为适应 MECCANO dataset 的长尾分布的方法。
results: 实验结果显示了这种方法的效果。

Abstract
In this work, we propose an ensemble modeling approach for multimodal action recognition. We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset. Based on the underlying principle of focal loss, which captures the relationship between tail (scarce) classes and their prediction difficulties, we propose an exponentially decaying variant of focal loss for our current task. It initially emphasizes learning from the hard misclassified examples and gradually adapts to the entire range of examples in the dataset. This annealing process encourages the model to strike a balance between focusing on the sparse set of hard samples, while still leveraging the information provided by the easier ones. Additionally, we opt for the late fusion strategy to combine the resultant probability distributions from RGB and Depth modalities for final action prediction. Experimental evaluations on the MECCANO dataset demonstrate the effectiveness of our approach.

摘要
在这项工作中，我们提出了一种集成模型化方法 для多模态动作识别。我们独立地在MECCANO数据集中训练个体模态模型，使用我们提议的缩放损失变体，该变体能够处理MECCANO数据集的长尾分布。基于缩放损失的基本原理，我们提出了一种衰减变体的缩放损失，该变体在初始阶段强调学习困难识别的样本，逐渐适应整个数据集的范围。这种熔化过程鼓励模型均衡 между关注罕见类和其预测困难的关系，同时仍然利用整个数据集中的信息。此外，我们选择了后期融合策略，将RGB和深度模态的结果概率分布组合为最终的动作预测。实验评估在MECCANO数据集上，证明了我们的方法的有效性。

Speech-Driven 3D Face Animation with Composite and Regional Facial Movements

paper_url: http://arxiv.org/abs/2308.05428
repo_url: https://github.com/wuhaozhe/audio2face_mm2023
paper_authors: Haozhe Wu, Songtao Zhou, Jia Jia, Junliang Xing, Qi Wen, Xiang Wen
for: 这篇论文主要针对的是如何使用语音驱动的3D人脸动画，以具有生动的表现和高效的计算。
methods: 该方法首先引入了适应性调整模块，通过使用自然语音驱动的非自然表达来动态调整语音驱动的面部表现。其次，该方法保证每帧的面部特征集中注重当前3D人脸的本地空间运动。最后，该方法提出了一种非autoregressive的听音抽象核心，以维护高频人脸运动的细节和高效地进行推理。
results: 经过广泛的实验和用户研究，该方法被证明可以胜过当前领先的方法， both qualitatively和quantitatively。

Abstract
Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements. This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation. The composite nature pertains to how speech-independent factors globally modulate speech-driven facial movements along the temporal dimension. Meanwhile, the regional nature alludes to the notion that facial movements are not globally correlated but are actuated by local musculature along the spatial dimension. It is thus indispensable to incorporate both natures for engendering vivid animation. To address the composite nature, we introduce an adaptive modulation module that employs arbitrary facial movements to dynamically adjust speech-driven facial movements across frames on a global scale. To accommodate the regional nature, our approach ensures that each constituent of the facial features for every frame focuses on the local spatial movements of 3D faces. Moreover, we present a non-autoregressive backbone for translating audio to 3D facial movements, which maintains high-frequency nuances of facial movements and facilitates efficient inference. Comprehensive experiments and user studies demonstrate that our method surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively.

摘要
<>对于由语音驱动的3D人脸动画来说，存在许多挑战，主要是因为人脸运动的复杂性和多样性。这篇论文强调需要考虑人脸运动的复合和地域特性。复合特性指的是语音独立因素在时间维度上对语音驱动的人脸运动进行全面的调整。地域特性则表示人脸运动不是全面相关的，而是由当地的肌肉 actuate 在空间维度上。因此，需要同时考虑这两种特性，以便生成真实的动画。为了解决复合特性，我们提出了一个适应型调整模块，通过使用任意的面部运动来动态调整语音驱动的面部运动 across frames 的全局级别。为了满足地域特性，我们的方法 garantía 每帧的面部特征都将注意当地的3D人脸空间运动。此外，我们还提出了一种非自我回归的核心，用于将语音转化为3D人脸运动，保留高频环境的人脸运动细节，并且实现高效的推理。完整的实验和用户研究表明，我们的方法在质量和量上都超过了当今状态的最佳方法。>>>

Adaptive Low Rank Adaptation of Segment Anything to Salient Object Detection

paper_url: http://arxiv.org/abs/2308.05426
repo_url: None
paper_authors: Ruikai Cui, Siyuan He, Shi Qiu
for: 提高Salient Object Detection（SOD）性能
methods: 采用适应性训练Segment Anything Model（SAM），利用深度学习中的低级结构，实现对出色对象检测的适应精度
results: 对五个RGB数据集进行了全面的质量和量测试，证明了我们的方法在Salient Object Detection领域的性能明显超过了现有方法

Abstract
Foundation models, such as OpenAI's GPT-3 and GPT-4, Meta's LLaMA, and Google's PaLM2, have revolutionized the field of artificial intelligence. A notable paradigm shift has been the advent of the Segment Anything Model (SAM), which has exhibited a remarkable capability to segment real-world objects, trained on 1 billion masks and 11 million images. Although SAM excels in general object segmentation, it lacks the intrinsic ability to detect salient objects, resulting in suboptimal performance in this domain. To address this challenge, we present the Segment Salient Object Model (SSOM), an innovative approach that adaptively fine-tunes SAM for salient object detection by harnessing the low-rank structure inherent in deep learning. Comprehensive qualitative and quantitative evaluations across five challenging RGB benchmark datasets demonstrate the superior performance of our approach, surpassing state-of-the-art methods.

摘要
基于OpenAI的GPT-3和GPT-4、Meta的LLaMA以及Google的PaLM2等基础模型，我们已经进行了一系列的研究和开发。在人工智能领域，我们发现了一种新的思维方式，即Segment Anything Model（SAM）。SAM在实际世界中 segmentation 任务上表现了非常出色，经过10亿个mask和1100万张图像训练，但它在焦点 объек detection 领域表现不佳，这是因为它缺乏内在的焦点检测能力。为解决这个挑战，我们提出了Segment Salient Object Model（SSOM），一种新的方法，通过利用深度学习中的低级结构来适应性地细化SAM，以提高焦点 объек detection 的性能。我们在五个RGB标准测试数据集上进行了全面的质量和量化评估，结果表明，我们的方法在焦点 объек detection 领域表现出色，超越了当前的状态艺。

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

paper_url: http://arxiv.org/abs/2308.05421
repo_url: https://github.com/gewu-lab/pstp-net
paper_authors: Guangyao Li, Wenxuan Hou, Di Hu
for: 这个 paper 的目的是提出一种 Progressive Spatio-Temporal Perception Network (PSTP-Net)，用于 answers 视频中的问题。
methods: 这个模型包括三个模块：首先，一个时间段选择模块用于选择与问题相关的音频视频段。然后，一个空间区域选择模块用于从选择的时间段中选择与问题相关的区域。最后，一个听力导向的视觉注意力模块用于捕捉音频和选择的空间区域之间的关系。
results: EXTENSIVE experimental results on the public MUSIC-AVQA 和 AVQA datasets show that PSTP-Net 具有高效性和高精度，可以快速和准确地回答视频中的问题。

Abstract
Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region selection module is utilized to choose the most relevant regions associated with the question from the selected temporal segments. To further refine the selection of features, an audio-guided visual attention module is employed to perceive the association between auido and selected spatial regions. Finally, the spatio-temporal features from these modules are integrated for answering the question. Extensive experimental results on the public MUSIC-AVQA and AVQA datasets provide compelling evidence of the effectiveness and efficiency of PSTP-Net. Code is available at: \href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}

摘要
Audio-Visual问答任务（AVQA）目的是回答视频中不同的视觉对象和声音之间的关系 вопросы。这些自然多模态视频具有丰富和复杂的音视频组件，大多数可能与问题无关，甚至作为干扰因子影响回答问题内容。相反，只考虑问题相关的音视频内容，可以快速地回答问题。在本文中，我们提出了一种进步的空间时间感知网络（PSTP-Net），它包括三个模块，逐渐地标识问题相关的空间时间区域。具体来说，首先引入的是时间段选择模块，用于选择与问题相关的音视频段。然后，使用空间区域选择模块选择与问题相关的区域。为了进一步细化选择特征，我们采用了听音引导视觉注意力模块，以便感知音频和选择的空间区域之间的关系。最后，这些模块中的特征被集成，用于回答问题。我们在公共的MUSIC-AVQA和AVQA数据集上进行了广泛的实验，并提供了证明PSTP-Net的效果和效率的证据。代码可以在：\href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}

SC3K: Self-supervised and Coherent 3D Keypoints Estimation from Rotated, Noisy, and Decimated Point Cloud Data

paper_url: http://arxiv.org/abs/2308.05410
repo_url: https://github.com/iit-pavis/sc3k
paper_authors: Mohammad Zohaib, Alessio Del Bue
for: 这种论文是为了推算ObjectCategory中的关键点的新方法，在实际应用中处理受到噪声、下采样和自由旋转的点云数据（PCD）时。
methods: 该论文提出了一种新的自动学习方法，不需要任何注释，可以在没有对象类型的先验知识的情况下推算关键点。这种方法使用了一种新的自我监督训练策略，并且使用了一种协同的辅助损失函数来促进所需的关键点特性。
results: 实验结果显示，提出的方法可以更好地 estimating keypoints， coverage提高了+9.41%，同时保持了semantic consistency (+4.66%)，这些关键点可以最佳地表示object的3D形状，并且与现有的无监督方法进行比较。代码和数据可以在https://github.com/IITPAVIS/SC3K上下载。

Abstract
This paper proposes a new method to infer keypoints from arbitrary object categories in practical scenarios where point cloud data (PCD) are noisy, down-sampled and arbitrarily rotated. Our proposed model adheres to the following principles: i) keypoints inference is fully unsupervised (no annotation given), ii) keypoints position error should be low and resilient to PCD perturbations (robustness), iii) keypoints should not change their indexes for the intra-class objects (semantic coherence), iv) keypoints should be close to or proximal to PCD surface (compactness). We achieve these desiderata by proposing a new self-supervised training strategy for keypoints estimation that does not assume any a priori knowledge of the object class, and a model architecture with coupled auxiliary losses that promotes the desired keypoints properties. We compare the keypoints estimated by the proposed approach with those of the state-of-the-art unsupervised approaches. The experiments show that our approach outperforms by estimating keypoints with improved coverage (+9.41%) while being semantically consistent (+4.66%) that best characterizes the object's 3D shape for downstream tasks. Code and data are available at: https://github.com/IITPAVIS/SC3K

摘要

Keypoints inference is fully unsupervised (no annotation given).2. Keypoints position error should be low and resilient to PCD perturbations (robustness).3. Keypoints should not change their indexes for the intra-class objects (semantic coherence).4. Keypoints should be close to or proximal to PCD surface (compactness).To achieve these desiderata, we propose a new self-supervised training strategy for keypoints estimation that does not assume any a priori knowledge of the object class, and a model architecture with coupled auxiliary losses that promotes the desired keypoints properties. We compare the keypoints estimated by the proposed approach with those of the state-of-the-art unsupervised approaches. The experiments show that our approach outperforms by estimating keypoints with improved coverage (+9.41%) while being semantically consistent (+4.66%) that best characterizes the object’s 3D shape for downstream tasks.Code and data are available at: https://github.com/IITPAVIS/SC3K

Enhancing Low-light Light Field Images with A Deep Compensation Unfolding Network

paper_url: http://arxiv.org/abs/2308.05404
repo_url: https://github.com/lyuxianqiang/lfll-dcu
paper_authors: Xianqiang Lyu, Junhui Hou
for: 这篇论文旨在提出一种新的、可解释的端到端学习框架，称为深度补偿解放网络（DCUNet），用于修复低光照条件下捕捉的光场图像。
methods: DCUNet采用多Stage结构，模仿解析难题的优化过程，并包括内容相关的深度补偿模块，以抑制噪声和推算光照图和推算结果错误。
results: 对于 simulate 和实际数据集，DCUNet 比 state-of-the-art 方法更高效和更好地保持光场图像的重要 геомétríc结构。代码将在 https://github.com/lyuxianqiang/LFLL-DCU 公开。

Abstract
This paper presents a novel and interpretable end-to-end learning framework, called the deep compensation unfolding network (DCUNet), for restoring light field (LF) images captured under low-light conditions. DCUNet is designed with a multi-stage architecture that mimics the optimization process of solving an inverse imaging problem in a data-driven fashion. The framework uses the intermediate enhanced result to estimate the illumination map, which is then employed in the unfolding process to produce a new enhanced result. Additionally, DCUNet includes a content-associated deep compensation module at each optimization stage to suppress noise and illumination map estimation errors. To properly mine and leverage the unique characteristics of LF images, this paper proposes a pseudo-explicit feature interaction module that comprehensively exploits redundant information in LF images. The experimental results on both simulated and real datasets demonstrate the superiority of our DCUNet over state-of-the-art methods, both qualitatively and quantitatively. Moreover, DCUNet preserves the essential geometric structure of enhanced LF images much better. The code will be publicly available at https://github.com/lyuxianqiang/LFLL-DCU.

摘要

Learning Gabor Texture Features for Fine-Grained Recognition

paper_url: http://arxiv.org/abs/2308.05396
repo_url: None
paper_authors: Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, Jun Liu
for: 提高细化识别性能
methods: 使用Gabor filters和CNN branches，并提出多种优化策略
results: 在多个 dataset 上达到顶尖性能水平

Abstract
Extracting and using class-discriminative features is critical for fine-grained recognition. Existing works have demonstrated the possibility of applying deep CNNs to exploit features that distinguish similar classes. However, CNNs suffer from problems including frequency bias and loss of detailed local information, which restricts the performance of recognizing fine-grained categories. To address the challenge, we propose a novel texture branch as complimentary to the CNN branch for feature extraction. We innovatively utilize Gabor filters as a powerful extractor to exploit texture features, motivated by the capability of Gabor filters in effectively capturing multi-frequency features and detailed local information. We implement several designs to enhance the effectiveness of Gabor filters, including imposing constraints on parameter values and developing a learning method to determine the optimal parameters. Moreover, we introduce a statistical feature extractor to utilize informative statistical information from the signals captured by Gabor filters, and a gate selection mechanism to enable efficient computation by only considering qualified regions as input for texture extraction. Through the integration of features from the Gabor-filter-based texture branch and CNN-based semantic branch, we achieve comprehensive information extraction. We demonstrate the efficacy of our method on multiple datasets, including CUB-200-2011, NA-bird, Stanford Dogs, and GTOS-mobile. State-of-the-art performance is achieved using our approach.

摘要
<>将文本翻译成简化中文。<>抽取和使用类别特征是细化识别的关键。现有工作已经证明了深度Convolutional Neural Networks (CNNs) 可以激活类似类之间的特征。然而，CNNs 受到频率偏见和细化地方信息的失去的限制，这限制了细化类别的识别性。为解决这个挑战，我们提议一种新的纹理分支作为 CNN 分支的补充。我们创新地利用 Gabor filters 作为强大的提取器，以利用纹理特征。我们实施了多种设计来提高 Gabor filters 的效果，包括参数约束和学习方法来确定最佳参数。此外，我们引入了一种统计特征提取器，以利用 Gabor filters 捕捉到的信号中的有用统计信息。还有一种阀选机制，以便只考虑有利的区域作为纹理提取的输入。通过将 Gabor filters 和 CNN 的 semantic 分支结合起来，我们实现了全面的信息提取。我们在多个 dataset 上示范了我们的方法，包括 CUB-200-2011、NA-bird、Stanford Dogs 和 GTOS-mobile。我们的方法实现了现状的最佳性。

Robust Localization with Visual-Inertial Odometry Constraints for Markerless Mobile AR

paper_url: http://arxiv.org/abs/2308.05394
repo_url: None
paper_authors: Changkun Liu, Yukun Zhao, Tristan Braud
for: 这篇论文主要是为了提高无标志 markerless 手持设备的 Augmented Reality (AR) 应用程序的定位精度。
methods: 这篇论文提出了一种新的框架，即 VIO-APR，它将 absolute pose regressor (APR) 与地方的 VIO 跟踪系统结合在一起，以便更好地评估 VIO 的可靠性，并通过 APR 来识别和补做 VIO 的漂移。
results: 在使用 VIO-APR 后， median 准确性提高了36%（位置）和29%（orientation），高精度（0.25 m, 2°）的帧数增加了112%，而低精度（5 m, 10°）的帧数减少了大量。 VIO-APR 在一个基于 Unity 的手持 AR 应用程序中被实现，并显示了更高的本地化和更稳定的整体体验。

Abstract
Visual Inertial Odometry (VIO) is an essential component of modern Augmented Reality (AR) applications. However, VIO only tracks the relative pose of the device, leading to drift over time. Absolute pose estimation methods infer the device's absolute pose, but their accuracy depends on the input quality. This paper introduces VIO-APR, a new framework for markerless mobile AR that combines an absolute pose regressor (APR) with a local VIO tracking system. VIO-APR uses VIO to assess the reliability of the APR and the APR to identify and compensate for VIO drift. This feedback loop results in more accurate positioning and more stable AR experiences. To evaluate VIO-APR, we created a dataset that combines camera images with ARKit's VIO system output for six indoor and outdoor scenes of various scales. Over this dataset, VIO-APR improves the median accuracy of popular APR by up to 36\% in position and 29\% in orientation, increases the percentage of frames in the high ($0.25 m, 2^{\circ}$) accuracy level by up to 112\% and reduces the percentage of frames predicted below the low ($5 m, 10^\circ$) accuracy greatly. We implement VIO-APR into a mobile AR application using Unity to demonstrate its capabilities. VIO-APR results in noticeably more accurate localization and a more stable overall experience.

摘要
“几何态对应”（VIO）是现代增强现实（AR）应用程序中的重要组件。然而，VIO只追踪设备的相对位姿，因此会随时间偏移。绝对位姿估推方法可以将设备的绝对位姿估推，但它们的准确度取决于输入质量。本文提出了一个新的框架，即VIO-APR，它结合了绝对位姿估推器（APR）和地方VIO追踪系统。VIO-APR使用VIO评估估推器的可靠性，并使用APR来识别和补偿VIO偏移。这个关键Loop的结果是更加精确的定位和更加稳定的AR体验。为了评估VIO-APR，我们创建了一个具有相机图像和ARKit的VIO系统输出的六个室内和外部景象的数据集。在这个数据集上，VIO-APR提高了流行的APR的中位均值精度（position）和方位精度（orientation）的比例，增加了高精度水平（0.25 m，2°）的frames的百分比，并大幅降低了低精度水平（5 m，10°）的frames的百分比。我们将VIO-APR集成到Unity中的 mobilAR应用程序中，以示其能力。VIO-APR实现了更加精确的定位和更加稳定的AR体验。”

Product Review Image Ranking for Fashion E-commerce

paper_url: http://arxiv.org/abs/2308.05390
repo_url: None
paper_authors: Sangeet Jaiswal, Dhruv Patel, Sreekanth Vempati, Konduru Saiswaroop
for: The paper aims to improve the ranking of customer images on a fashion e-commerce platform, as the reliance on User Generated Content (UGC) has increased and the number of customer images has grown.
methods: The proposed method uses a training procedure to rank customer images, leveraging distortion techniques to enhance the quality of the images and a network to distinguish between high-quality and bad-quality images.
results: The proposed method outperforms baseline models on two metrics, correlation coefficient and accuracy, by substantial margins.Here’s the simplified Chinese text for the three points:
for: 这篇论文目的是提高时尚电商平台上客户照片的排名，因为用户生成内容的依赖度增加，客户照片的数量也在增加。
methods: 提议的方法使用训练程序来排序客户照片，利用扭曲技术提高照片质量，并使用网络来分辨高质量照片和差质量照片。
results: 提议的方法相比基eline模型，在两个约束中（相关系数和准确率）表现出了明显的优势。

Abstract
In a fashion e-commerce platform where customers can't physically examine the products on their own, being able to see other customers' text and image reviews of the product is critical while making purchase decisions. Given the high reliance on these reviews, over the years we have observed customers proactively sharing their reviews. With an increase in the coverage of User Generated Content (UGC), there has been a corresponding increase in the number of customer images. It is thus imperative to display the most relevant images on top as it may influence users' online shopping choices and behavior. In this paper, we propose a simple yet effective training procedure for ranking customer images. We created a dataset consisting of Myntra (A Major Indian Fashion e-commerce company) studio posts and highly engaged (upvotes/downvotes) UGC images as our starting point and used selected distortion techniques on the images of the above dataset to bring their quality at par with those of bad UGC images. We train our network to rank bad-quality images lower than high-quality ones. Our proposed method outperforms the baseline models on two metrics, namely correlation coefficient, and accuracy, by substantial margins.

摘要
在一个无法质感产品的电商平台上，能够看到其他顾客的文本和图像评论是购买决策中非常重要的。随着用户生成内容的涵盖率的增加，我们在年月之间观察到顾客积极分享他们的评论。随着用户图像的增加，显示最相关的图像变得非常重要，因为它们可能影响用户在线购物选择和行为。在这篇论文中，我们提出了一种简单 yet 有效的训练方法，用于排序顾客图像。我们使用 Myntra（印度主要的时尚电商公司）的Studio帖子和高度参与度（Upvotes/Downvotes）的用户生成内容图像作为我们的起点，并使用选择的扭曲技术来将这些图像的质量与Bad UGC图像相匹配。我们训练我们的网络，以便将差质图像排名在低于高质图像之前。我们的提议方法在两个纪录中，即相关性系数和准确率，与基准模型相比均显示出了显著的优势。

HGDNet: A Height-Hierarchy Guided Dual-Decoder Network for Single View Building Extraction and Height Estimation

paper_url: http://arxiv.org/abs/2308.05387
repo_url: None
paper_authors: Chaoran Lu, Ningning Cao, Pan Zhang, Ting Liu, Baochai Peng, Guozhang Liu, Mengke Yuan, Sen Zhang, Simin Huang, Tao Wang
for: 提高大规模城市3D重建 task 的性能，即建筑高度估计和建筑检测两个相关任务的统一。
methods: 提出了 Height-hierarchy Guided Dual-decoder Network (HGDNet)，通过guide synthesized discrete height-hierarchy nDSM来帮助height estimation branch，提高了建筑高度估计的准确性。同时，采用了两个阶段堆叠结构来实现更加准确的建筑EXTRACTION。
results: 在 DFC 2023 Track 2 数据集上进行了实验，得到了建筑高度估计（δ1:0.8012）、实例EXTRACTION（AP50:0.7730）和最终的平均分数（0.7871），在测试阶段以第一名的成绩。

Abstract
Unifying the correlative single-view satellite image building extraction and height estimation tasks indicates a promising way to share representations and acquire generalist model for large-scale urban 3D reconstruction. However, the common spatial misalignment between building footprints and stereo-reconstructed nDSM height labels incurs degraded performance on both tasks. To address this issue, we propose a Height-hierarchy Guided Dual-decoder Network (HGDNet) to estimate building height. Under the guidance of synthesized discrete height-hierarchy nDSM, auxiliary height-hierarchical building extraction branch enhance the height estimation branch with implicit constraints, yielding an accuracy improvement of more than 6% on the DFC 2023 track2 dataset. Additional two-stage cascade architecture is adopted to achieve more accurate building extraction. Experiments on the DFC 2023 Track 2 dataset shows the superiority of the proposed method in building height estimation ({\delta}1:0.8012), instance extraction (AP50:0.7730), and the final average score 0.7871 ranks in the first place in test phase.

摘要
合并相关的单视图卫星图像建筑EXTRACTION和高度估计任务表明了实现大规模城市3D重建的有望途径。然而，通常的空间不同步 между建筑基面和斯特瑞重建的nDSM高度标签会导致两个任务的性能下降。为解决这个问题，我们提议一种高度层导航分布式双解调网络（HGDNet）来估算建筑高度。在带有生成的精制Height-层次nDSM的指导下，增强高度估计分支可以提供隐式约束，从而提高高度估计精度。实验表明，我们提议的方法在DFC 2023 Track 2数据集上的高度估计（δ1:0.8012）、实例EXTRACTION（AP50:0.7730）和最终平均分数（0.7871）均达到了领先地位。

Interaction-aware Joint Attention Estimation Using People Attributes

paper_url: http://arxiv.org/abs/2308.05382
repo_url: https://github.com/chihina/pjae
paper_authors: Chihiro Nakatani, Hiroaki Kawashima, Norimichi Ukita
for: 这种论文旨在提出一种基于单个图像的共享注意力估计方法，与之前相关的工作不同，它不仅使用人具有关注相关的特征，还使用人的位置和行为作为共享注意力的上下文指标进行权重。
methods: 我们提出了一种新的Transformer基于注意力网络来编码共享注意力为低维特征，并引入了特殊的MLP头并 позицион嵌入，以便预测每个像素的共享注意力信任度，从而生成信任度热图。
results: 我们的方法在比较实验中与领先方法进行比较，并表明我们的方法在量化方面得到了提高。代码：https://anonymous.4open.science/r/anonymized_codes-ECA4.

Abstract
This paper proposes joint attention estimation in a single image. Different from related work in which only the gaze-related attributes of people are independently employed, (I) their locations and actions are also employed as contextual cues for weighting their attributes, and (ii) interactions among all of these attributes are explicitly modeled in our method. For the interaction modeling, we propose a novel Transformer-based attention network to encode joint attention as low-dimensional features. We introduce a specialized MLP head with positional embedding to the Transformer so that it predicts pixelwise confidence of joint attention for generating the confidence heatmap. This pixelwise prediction improves the heatmap accuracy by avoiding the ill-posed problem in which the high-dimensional heatmap is predicted from the low-dimensional features. The estimated joint attention is further improved by being integrated with general image-based attention estimation. Our method outperforms SOTA methods quantitatively in comparative experiments. Code: https://anonymous.4open.science/r/anonymized_codes-ECA4.

摘要
这个论文提出了基于单个图像的共享注意力估计方法。与相关的工作不同，我们不仅独立使用人具 gaze-相关特征，还利用人具位置和动作作为共享注意力的上下文调节器，并且明确模型这些特征之间的交互。为实现交互模型，我们提议使用一种专门的 transformer 基于注意力网络，将共享注意力编码为低维特征。我们还引入了一个特殊的多层感知（MLP）头，使其预测每个像素的共享注意力信任程度，以生成信任热图。这种像素级预测可以避免高维热图预测具有低维特征的不定 problema。我们的方法在比较实验中与state-of-the-art方法相比表现出色，代码可以在以下链接中找到：https://anonymous.4open.science/r/anonymized_codes-ECA4。

Flexible Isosurface Extraction for Gradient-Based Mesh Optimization

paper_url: http://arxiv.org/abs/2308.05371
repo_url: None
paper_authors: Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, Jun Gao
for: 本研究探讨了基于梯度的网格优化，即通过代表3D表面网格为某个拟合函数的iso面来进行优化，这种方法在摄影、生成模型、反向物理等应用中越来越普遍。现有的实现都是基于经典iso面提取算法如追溯立方体和双重缘界，这些技术是为了从固定的known场景中提取网格，而在优化设置下缺乏可以表示高质量特征保持网格的自由度，或者受到数值不稳定的问题。
methods: 我们引入了特定的参数，以便在提取网格时进行本地灵活的调整，以保持网格的准确性和可见性。这些参数通过自动微分准确地与背景场景的拟合函数一起更新。我们基于双重追溯立方体来提取网格，并提供了可选的生成四面体和层次适应网格的扩展。
results: 我们的实验表明，FlexiCubes可以在synthetic benchmark和实际应用中提供显著的改进，提高网格质量和准确性。

Abstract
This work considers gradient-based mesh optimization, where we iteratively optimize for a 3D surface mesh by representing it as the isosurface of a scalar field, an increasingly common paradigm in applications including photogrammetry, generative modeling, and inverse physics. Existing implementations adapt classic isosurface extraction algorithms like Marching Cubes or Dual Contouring; these techniques were designed to extract meshes from fixed, known fields, and in the optimization setting they lack the degrees of freedom to represent high-quality feature-preserving meshes, or suffer from numerical instabilities. We introduce FlexiCubes, an isosurface representation specifically designed for optimizing an unknown mesh with respect to geometric, visual, or even physical objectives. Our main insight is to introduce additional carefully-chosen parameters into the representation, which allow local flexible adjustments to the extracted mesh geometry and connectivity. These parameters are updated along with the underlying scalar field via automatic differentiation when optimizing for a downstream task. We base our extraction scheme on Dual Marching Cubes for improved topological properties, and present extensions to optionally generate tetrahedral and hierarchically-adaptive meshes. Extensive experiments validate FlexiCubes on both synthetic benchmarks and real-world applications, showing that it offers significant improvements in mesh quality and geometric fidelity.

摘要
To address this limitation, we introduce FlexiCubes, a new isosurface representation specifically designed for optimizing an unknown mesh. Our key insight is to introduce additional carefully-chosen parameters into the representation, which allow for local flexible adjustments to the extracted mesh geometry and connectivity. These parameters are updated along with the underlying scalar field via automatic differentiation when optimizing for a downstream task.We base our extraction scheme on Dual Marching Cubes for improved topological properties, and present extensions to optionally generate tetrahedral and hierarchically-adaptive meshes. Extensive experiments validate FlexiCubes on both synthetic benchmarks and real-world applications, showing that it offers significant improvements in mesh quality and geometric fidelity.

TriDo-Former: A Triple-Domain Transformer for Direct PET Reconstruction from Low-Dose Sinograms

paper_url: http://arxiv.org/abs/2308.05365
repo_url: https://github.com/gluucose/TriDoFormer
paper_authors: Jiaqi Cui, Pinxian Zeng, Xinyi Zeng, Peng Wang, Xi Wu, Jiliu Zhou, Yan Wang, Dinggang Shen
for: 提高标准剂量Positron发射Tomography（PET）图像质量，最小化辐射暴露
methods: 使用 transformer 模型，联合三个频域（sinogram、图像、频率）进行直接 PET 重建
results: 比对 estado-of-the-art 方法，TriDo-Former 能够提供更高质量的 PET 图像，同时减少辐射暴露

Abstract
To obtain high-quality positron emission tomography (PET) images while minimizing radiation exposure, various methods have been proposed for reconstructing standard-dose PET (SPET) images from low-dose PET (LPET) sinograms directly. However, current methods often neglect boundaries during sinogram-to-image reconstruction, resulting in high-frequency distortion in the frequency domain and diminished or fuzzy edges in the reconstructed images. Furthermore, the convolutional architectures, which are commonly used, lack the ability to model long-range non-local interactions, potentially leading to inaccurate representations of global structures. To alleviate these problems, we propose a transformer-based model that unites triple domains of sinogram, image, and frequency for direct PET reconstruction, namely TriDo-Former. Specifically, the TriDo-Former consists of two cascaded networks, i.e., a sinogram enhancement transformer (SE-Former) for denoising the input LPET sinograms and a spatial-spectral reconstruction transformer (SSR-Former) for reconstructing SPET images from the denoised sinograms. Different from the vanilla transformer that splits an image into 2D patches, based specifically on the PET imaging mechanism, our SE-Former divides the sinogram into 1D projection view angles to maintain its inner-structure while denoising, preventing the noise in the sinogram from prorogating into the image domain. Moreover, to mitigate high-frequency distortion and improve reconstruction details, we integrate global frequency parsers (GFPs) into SSR-Former. The GFP serves as a learnable frequency filter that globally adjusts the frequency components in the frequency domain, enforcing the network to restore high-frequency details resembling real SPET images. Validations on a clinical dataset demonstrate that our TriDo-Former outperforms the state-of-the-art methods qualitatively and quantitatively.

摘要
要获得高质量的 позиトロン发射 Tomography（PET）图像，同时最小化辐射暴露，various方法已经被提议用于直接从低剂量PET（LPET）信号直接重构标准剂量PET（SPET）图像。然而，当前方法 часто忽略信号境界，导致信号频谱频率域中的高频扩散，并且图像重构后的边缘变得模糊或者杂乱。此外，通用的卷积架构，常常缺乏能够模型长距离非本地交互的能力，可能导致不准确地表示全球结构。为了解决这些问题，我们提议一种基于 transformer 的模型，称为 TriDo-Former。具体来说，TriDo-Former 由两个缓冲网络组成，即信号增强 transformer（SE-Former）和空间spectral重构 transformer（SSR-Former）。SE-Former 使得LPET信号中的噪声得到去噪，而不是将图像分割成2D patches，SSR-Former 使用GFP来重构SPET图像。GFP acts as a learnable frequency filter that globally adjusts the frequency components in the frequency domain, enforcing the network to restore high-frequency details resembling real SPET images。 validate on a clinical dataset, our TriDo-Former outperforms the state-of-the-art methods both qualitatively and quantitatively.

Pseudo-label Alignment for Semi-supervised Instance Segmentation

paper_url: http://arxiv.org/abs/2308.05359
repo_url: https://github.com/hujiecpp/pais
paper_authors: Jie Hu, Chen Chen, Liujuan Cao, Shengchuan Zhang, Annan Shu, Guannan Jiang, Rongrong Ji
for: 提高 semi-supervised instance segmentation 的性能，特别是在有限 Label 的情况下。
methods: 提出一个名为 pseudo-label aligning instance segmentation (PAIS) 的新框架，通过动态调整 semi-supervised loss 的权重，以适应不同的 class 和 mask 质量。
results: 在 COCO 和 Cityscapes datasets 上进行了广泛的实验，显示 PAIS 是一种有 promise 的 semi-supervised instance segmentation 框架，特别是在 Label 数据受限的情况下。在 COCO dataset 上，只使用 1% 的 Label 数据，PAIS 可以达到 21.2 mAP（基于 Mask-RCNN）和 19.9 mAP（基于 K-Net）的性能，比 current state-of-the-art 模型 NoisyBoundary 高出 12 点之多。

Abstract
Pseudo-labeling is significant for semi-supervised instance segmentation, which generates instance masks and classes from unannotated images for subsequent training. However, in existing pipelines, pseudo-labels that contain valuable information may be directly filtered out due to mismatches in class and mask quality. To address this issue, we propose a novel framework, called pseudo-label aligning instance segmentation (PAIS), in this paper. In PAIS, we devise a dynamic aligning loss (DALoss) that adjusts the weights of semi-supervised loss terms with varying class and mask score pairs. Through extensive experiments conducted on the COCO and Cityscapes datasets, we demonstrate that PAIS is a promising framework for semi-supervised instance segmentation, particularly in cases where labeled data is severely limited. Notably, with just 1\% labeled data, PAIS achieves 21.2 mAP (based on Mask-RCNN) and 19.9 mAP (based on K-Net) on the COCO dataset, outperforming the current state-of-the-art model, \ie, NoisyBoundary with 7.7 mAP, by a margin of over 12 points. Code is available at: \url{https://github.com/hujiecpp/PAIS}.

摘要
假标签对 semi-supervised instance segmentation 非常重要，它可以生成实例掩码和类别标签从无注释图像中进行后续训练。然而，现有的管道中的假标签可能会直接被过滤掉，因为类别和掩码质量不匹配。为解决这个问题，我们在这篇论文中提出了一种新的框架，即假标签对齐Instance Segmentation（PAIS）。在 PAIS 中，我们定义了一种动态对齐损失（DALoss），该损失可以根据不同的类别和掩码质量对 semi-supervised 损失项的重量进行调整。经过了EXTENSIVE EXPERIMENTS在 COCO 和 Cityscapes 数据集上，我们证明了 PAIS 是 semi-supervised instance segmentation 的一个有前途的框架，特别在 Label 数据充足环境下出现问题时。值得注意的是，只使用 1% 的 Label 数据，PAIS 在 COCO 数据集上 achiev 21.2 mAP（基于 Mask-RCNN）和 19.9 mAP（基于 K-Net），比现状态的最佳模型 NoisyBoundary 的 7.7 mAP 高出12点多。代码可以在：\url{https://github.com/hujiecpp/PAIS} 中找到。

Fine-grained building roof instance segmentation based on domain adapted pretraining and composite dual-backbone

paper_url: http://arxiv.org/abs/2308.05358
repo_url: None
paper_authors: Guozhang Liu, Baochai Peng, Ting Liu, Pan Zhang, Mengke Yuan, Chaoran Lu, Ningning Cao, Sen Zhang, Simin Huang, Tao Wang
for: 这篇论文是为了提出一个能够实现单独建筑物的Semantic Interpretation，并且能够扩展到高分辨率光偏振仪影像中的建筑物实例分割器。
methods: 这篇论文使用了域 adaptation pre-training策略和composite dual-backbone来推广特征学习，以及一个新的数据增强管线、Stochastic Weight Averaging（SWA）训练和实例分割器模型的整合。
results: 实验结果显示，我们的方法在2023 IEEE GRSS Data Fusion Contest（DFC）Track 1测试阶段中获得第一名($mAP_{50}$:50.6%)，并且我们还探索了使用光偏振仪影像和SAR数据的多modal资料融合的潜力。

Abstract
The diversity of building architecture styles of global cities situated on various landforms, the degraded optical imagery affected by clouds and shadows, and the significant inter-class imbalance of roof types pose challenges for designing a robust and accurate building roof instance segmentor. To address these issues, we propose an effective framework to fulfill semantic interpretation of individual buildings with high-resolution optical satellite imagery. Specifically, the leveraged domain adapted pretraining strategy and composite dual-backbone greatly facilitates the discriminative feature learning. Moreover, new data augmentation pipeline, stochastic weight averaging (SWA) training and instance segmentation based model ensemble in testing are utilized to acquire additional performance boost. Experiment results show that our approach ranks in the first place of the 2023 IEEE GRSS Data Fusion Contest (DFC) Track 1 test phase ($mAP_{50}$:50.6\%). Note-worthily, we have also explored the potential of multimodal data fusion with both optical satellite imagery and SAR data.

摘要
global市区建筑风格多样性、云影和阴影对抗减弱了建筑瓦片实例分割的精度和稳定性，为设计一个稳定和准确的建筑瓦片实例分割器带来挑战。为解决这些问题，我们提出了一种有效的框架，以实现高分辨率光学卫星图像中建筑物的 semantic解释。具体来说，我们利用了领域适应性预训练策略和复合双轴核心，以便更好地学习抽象特征。此外，我们还使用了新的数据增强管道、随机权重平均（SWA）训练和测试阶段的实例分割器模型ensemble，以获得更高的性能提升。实验结果表明，我们的方法在2023年IEEE GRSS数据融合大赛（DFC）赛道1测试阶段中得分第一（$mAP_{50}$：50.6%）。另外，我们还探索了多Modal数据融合，使用光学卫星图像和SAR数据。

TCSloT: Text Guided 3D Context and Slope Aware Triple Network for Dental Implant Position Prediction

paper_url: http://arxiv.org/abs/2308.05355
repo_url: None
paper_authors: Xinquan Yang, Jinheng Xie, Xuechen Li, Xuguang Li, Linlin Shen, Yongqiang Deng
for: 这个论文的目的是为了提高骨嵌入式仪器的适应率和精度。
methods: 这个论文使用了一种名为TCSloT的三维文本引导和倾斜识别网络，以及一个名为TVP的文本变化识别模组，以及一个名为SAL的倾斜敏感损失函数。
results: 根据五次十字验证，这个TCSloT方法在骨嵌入式仪器数据集上表现出色，较之前方法有更高的适应率和精度。

Abstract
In implant prosthesis treatment, the surgical guide of implant is used to ensure accurate implantation. However, such design heavily relies on the manual location of the implant position. When deep neural network has been proposed to assist the dentist in locating the implant position, most of them take a single slice as input, which do not fully explore 3D contextual information and ignoring the influence of implant slope. In this paper, we design a Text Guided 3D Context and Slope Aware Triple Network (TCSloT) which enables the perception of contextual information from multiple adjacent slices and awareness of variation of implant slopes. A Texture Variation Perception (TVP) module is correspondingly elaborated to process the multiple slices and capture the texture variation among slices and a Slope-Aware Loss (SAL) is proposed to dynamically assign varying weights for the regression head. Additionally, we design a conditional text guidance (CTG) module to integrate the text condition (i.e., left, middle and right) from the CLIP for assisting the implant position prediction. Extensive experiments on a dental implant dataset through five-fold cross-validation demonstrated that the proposed TCSloT achieves superior performance than existing methods.

摘要
在附加 prósthesis 治疗中，针对implant的外科引导被用来确保精准的植入。然而，这种设计依赖于手动定位implant的位置。当deep neural network被提议以帮助 dentist 定位implant的位置时，大多数其中的输入是单个slice，这不完全探索3DContextual信息和忽略了implant的 Slope的影响。在这篇论文中，我们设计了Text Guided 3D Context and Slope Aware Triple Network (TCSloT)，它能够从多个相邻slice中捕捉Contextual信息和implant Slope的变化。同时，我们还设计了Texture Variation Perception (TVP)模块，用于处理多个slice和捕捉slice之间的Texture variation。此外，我们还提出了Slope-Aware Loss (SAL)，以动态分配不同权重 для regression head。此外，我们还设计了 conditional text guidance (CTG)模块，用于integrating text condition (例如，左、中、右) from CLIP，以帮助implant的位置预测。经过对一个 dental implant 数据集的五次横断验证，我们的TCSloT方法已经达到了现有方法的超越性表现。

Towards General and Fast Video Derain via Knowledge Distillation

paper_url: http://arxiv.org/abs/2308.05346
repo_url: None
paper_authors: Defang Cai, Pan Mu, Sixian Chan, Zhanpeng Shao, Cong Bai
for: 这篇论文是关于视觉系统中降水的影响和去雨的方法。
methods: 该论文提出了一种基于知识储存的普适 видео去雨网络（名为RRGNet），可以处理不同类型的雨束。特别是，我们设计了一个帧分组化的Encoder-Decoder网络，利用视频的时间信息。此外，我们还使用了老任务模型来引导当前模型学习新的雨束类型，而不是忘记之前学习的知识。
results: 我们的开发的通用方法在运行速度和去雨效果方面达到了最佳效果。

Abstract
As a common natural weather condition, rain can obscure video frames and thus affect the performance of the visual system, so video derain receives a lot of attention. In natural environments, rain has a wide variety of streak types, which increases the difficulty of the rain removal task. In this paper, we propose a Rain Review-based General video derain Network via knowledge distillation (named RRGNet) that handles different rain streak types with one pre-training weight. Specifically, we design a frame grouping-based encoder-decoder network that makes full use of the temporal information of the video. Further, we use the old task model to guide the current model in learning new rain streak types while avoiding forgetting. To consolidate the network's ability to derain, we design a rain review module to play back data from old tasks for the current model. The experimental results show that our developed general method achieves the best results in terms of running speed and derain effect.

摘要
As a common natural weather condition, rain can obscure video frames and thus affect the performance of the visual system, so video derain receives a lot of attention. In natural environments, rain has a wide variety of streak types, which increases the difficulty of the rain removal task. In this paper, we propose a Rain Review-based General video derain Network via knowledge distillation (named RRGNet) that handles different rain streak types with one pre-training weight. Specifically, we design a frame grouping-based encoder-decoder network that makes full use of the temporal information of the video. Further, we use the old task model to guide the current model in learning new rain streak types while avoiding forgetting. To consolidate the network's ability to derain, we design a rain review module to play back data from old tasks for the current model. The experimental results show that our developed general method achieves the best results in terms of running speed and derain effect.Here's the translation breakdown:* As a common natural weather condition, rain can obscure video frames and thus affect the performance of the visual system, so video derain receives a lot of attention. (雨是常见的自然天气 Condition, 可以遮盖视频帧, 影响视觉系统的性能, 因此视频 Derain 受到了很多关注.)* In natural environments, rain has a wide variety of streak types, which increases the difficulty of the rain removal task. (在自然环境中, 雨有各种不同的梦幕类型, 使得雨 Remove 任务变得更加困难.)* In this paper, we propose a Rain Review-based General video derain Network via knowledge distillation (named RRGNet) that handles different rain streak types with one pre-training weight. (在这篇论文中, 我们提出了一种基于雨 Review 的通用视频 Derain 网络 (名为 RRGNet)，可以通过知识塑造来处理不同的雨梦幕类型, 使用一个预训练权重.)* Specifically, we design a frame grouping-based encoder-decoder network that makes full use of the temporal information of the video. (具体来说, 我们设计了基于帧组的 encoder-decoder 网络, 可以充分利用视频的时间信息.)* Further, we use the old task model to guide the current model in learning new rain streak types while avoiding forgetting. (此外, 我们使用现有任务模型来引导当前模型在学习新的雨梦幕类型, 避免忘记.)* To consolidate the network's ability to derain, we design a rain review module to play back data from old tasks for the current model. (为了巩固网络的 Derain 能力, 我们设计了一个雨 Review 模块, 可以将过去任务的数据播放给当前模型.)* The experimental results show that our developed general method achieves the best results in terms of running speed and derain effect. (实验结果表明, 我们提出的通用方法在运行速度和 Derain 效果上得到了最好的结果.)

Prostate Age Gap (PAG): An MRI surrogate marker of aging for prostate cancer detection

paper_url: http://arxiv.org/abs/2308.05344
repo_url: None
paper_authors: Alvaro Fernandez-Quilez, Tobias Nordström, Fredrik Jäderling, Svein Reidar Kjosavik, Martin Eklund
for: 这个研究的目的是确定Prostate Age Gap（PAG）作为抑肾癌（PC）风险的估计工具。
methods: 这个研究使用了深度学习模型，通过对7243个肾镜片的训练和测试，来评估PAG的估计能力。
results: 研究发现，PAG在识别高度重要的PC风险上显示出了更好的预测能力，并且与PSA水平和PI-RADS>=3进行比较，得到了更高的AUC值（0.981）。

Abstract
Background: Prostate cancer (PC) MRI-based risk calculators are commonly based on biological (e.g. PSA), MRI markers (e.g. volume), and patient age. Whilst patient age measures the amount of years an individual has existed, biological age (BA) might better reflect the physiology of an individual. However, surrogates from prostate MRI and linkage with clinically significant PC (csPC) remain to be explored. Purpose: To obtain and evaluate Prostate Age Gap (PAG) as an MRI marker tool for csPC risk. Study type: Retrospective. Population: A total of 7243 prostate MRI slices from 468 participants who had undergone prostate biopsies. A deep learning model was trained on 3223 MRI slices cropped around the gland from 81 low-grade PC (ncsPC, Gleason score <=6) and 131 negative cases and tested on the remaining 256 participants. Assessment: Chronological age was defined as the age of the participant at the time of the visit and used to train the deep learning model to predict the age of the patient. Following, we obtained PAG, defined as the model predicted age minus the patient's chronological age. Multivariate logistic regression models were used to estimate the association through odds ratio (OR) and predictive value of PAG and compared against PSA levels and PI-RADS>=3. Statistical tests: T-test, Mann-Whitney U test, Permutation test and ROC curve analysis. Results: The multivariate adjusted model showed a significant difference in the odds of clinically significant PC (csPC, Gleason score >=7) (OR =3.78, 95% confidence interval (CI):2.32-6.16, P <.001). PAG showed a better predictive ability when compared to PI-RADS>=3 and adjusted by other risk factors, including PSA levels: AUC =0.981 vs AUC =0.704, p<.001. Conclusion: PAG was significantly associated with the risk of clinically significant PC and outperformed other well-established PC risk factors.

摘要
Background: 肾癌（PC）MRI基于风险计算器通常基于生物 markers（例如PSA）、MRI标志（例如体积）和患者年龄。然而，生物年龄（BA）可能更好地反映个体的生理 physiology。然而，MRI检测和临床重要PC（csPC）之间的连接仍然需要进一步探索。目的：为了获得和评估肾 Age Gap（PAG）作为MRI标志工具，以评估csPC风险。研究类型：回顾性。人口：总共7243个肾MRI剖析结果从468名参与者手术后进行了检测，其中81名低等PC（ncsPC， Gleason分数<=6）和131名消除性检测结果。评估：参与者年龄是通过训练深度学习模型来定义参与者年龄，然后对MRI扫描结果进行了cropping。模型被训练在3223个MRI扫描结果中，并在256名参与者中进行了测试。Assessment：参与者年龄是通过训练深度学习模型来预测参与者年龄。其后，我们计算了PAG，即模型预测年龄 minus 参与者年龄。多变量回归分析方法用于计算PAG与csPC风险的相关性，并与PSA水平和PI-RADS>=3进行比较。Results：多变量调整模型显示csPC风险（Gleason分数>=7）之间存在显著差异（OR =3.78，95%CI：2.32-6.16，P <.001）。PAG表现出了与其他PC风险因子相比更好的预测能力，包括PSA水平和其他调整因子。Conclusion：PAG与csPC风险之间存在显著相关性，并且在其他PC风险因子调整后表现出了更好的预测能力。

RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation

paper_url: http://arxiv.org/abs/2308.05318
repo_url: https://github.com/irmvlab/rlsac
paper_authors: Chang Nie, Guangming Wang, Zhe Liu, Luca Cavalli, Marc Pollefeys, Hesheng Wang
for: 本文提出了一种基于强化学习的样本共识框架（RLSAC），用于终端 robust 估计。RLSAC 利用图 нейрон网络来利用数据和历史信息，以指导下一步样本的选择。
methods: 本文提出了一种基于强化学习的样本共识框架（RLSAC），其中使用了图 нейрон网络，并且通过下游任务的反馈来进行无监督训练。
results: 实验结果表明，RLSAC 可以从特征学习到慢慢探索更好的假设。此外，RLSAC 还可以轻松地转移到其他样本共识基于稳健估计的任务上。

Abstract
Robust estimation is a crucial and still challenging task, which involves estimating model parameters in noisy environments. Although conventional sampling consensus-based algorithms sample several times to achieve robustness, these algorithms cannot use data features and historical information effectively. In this paper, we propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation. RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set. The feedback of downstream tasks serves as the reward for unsupervised training. Therefore, RLSAC can avoid differentiating to learn the features and the feedback of downstream tasks for end-to-end robust estimation. In addition, RLSAC integrates a state transition module that encodes both data and memory features. Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis. Through analysis, it is apparent that RLSAC can be easily transferred to other sampling consensus-based robust estimation tasks. To the best of our knowledge, RLSAC is also the first method that uses reinforcement learning to sample consensus for end-to-end robust estimation. We release our codes at https://github.com/IRMVLab/RLSAC.

摘要
强健估算是一项关键但又具有挑战性的任务，它涉及到在噪声环境中估算模型参数。 conventional sampling consensus-based algorithms 通常需要多次抽样来实现强健性，但这些算法无法有效地利用数据特征和历史信息。在这篇论文中，我们提出了RLSAC，一种基于强化学习的SAmple Consensus框架，用于终端强健估算。 RLSAC 利用图 neural network 来利用数据和记忆特征来引导抽样下一个最小集的探索方向。下游任务的反馈作为无监督训练的奖励，因此RLSAC可以避免学习特征和下游任务的反馈来实现终端强健估算。此外，RLSAC 还 integrates 一个状态转移模块，该模块编码了数据和记忆特征。我们的实验结果表明，RLSAC 可以从特征中逐步探索更好的假设。通过分析，可以看出，RLSAC 可以轻松地转移到其他抽样consensus-based 强健估算任务。到目前为止，RLSAC 是我们知道的第一种使用强化学习来抽样consensus的方法，我们在 GitHub 上发布了代码，请参考 https://github.com/IRMVLab/RLSAC。

Deep Semantic Graph Matching for Large-scale Outdoor Point Clouds Registration

paper_url: http://arxiv.org/abs/2308.05314
repo_url: None
paper_authors: Shaocong Liu, Tao Wang, Yan Zhang, Ruqin Zhou, Li Li, Chenguang Dai, Yongsheng Zhang, Hanyun Wang
for: 本研究は大规模のアウトドアポイントクラウドの регистраーションに関するものです。
methods: 本研究では、大规模ポイントクラウドセマンティック分类ネットワークを使用してポイントクラウドのセマンティックカテゴリーラベルを取得します。それから、类似のセマンティックラベルを持つ邻接するポイントをクラスターするために、ε-距离クラスターアルゴリズムを使用します。次に、セマンティックインスタンス间の空间的な邻接関系に基づいて、高次元の特征を学习するグラフカーネルネットワークを构筑します。特徴の组み合わせには、几何学的形状特征、セマンティックカテゴリー特征および空间分布特征が含まれます。最后に、セマンティックインスタンスマッチング问题を最适输送问题としてモデル化し、最适匹配层を使用して解决します。
results: 実験结果では、提案された方法により、KITTI Odometryデータセットでの平均相対距离误差および平均相対回転误差が6.6cmおよび0.229{\deg}です。

Abstract
The current point cloud registration methods are mainly based on geometric information and usually ignore the semantic information in the point clouds. In this paper, we treat the point cloud registration problem as semantic instance matching and registration task, and propose a deep semantic graph matching method for large-scale outdoor point cloud registration. Firstly, the semantic category labels of 3D point clouds are obtained by utilizing large-scale point cloud semantic segmentation network. The adjacent points with the same category labels are then clustered together by using Euclidean clustering algorithm to obtain the semantic instances. Secondly, the semantic adjacency graph is constructed based on the spatial adjacency relation of semantic instances. Three kinds of high-dimensional features including geometric shape features, semantic categorical features and spatial distribution features are learned through graph convolutional network, and enhanced based on attention mechanism. Thirdly, the semantic instance matching problem is modeled as an optimal transport problem, and solved through an optimal matching layer. Finally, according to the matched semantic instances, the geometric transformation matrix between two point clouds is first obtained by SVD algorithm and then refined by ICP algorithm. The experiments are cconducted on the KITTI Odometry dataset, and the average relative translation error and average relative rotation error of the proposed method are 6.6cm and 0.229{\deg} respectively.

摘要
当前点云注册方法主要基于几何信息，通常忽略点云中的 semantics信息。在这篇论文中，我们将点云注册问题看作semantic实例匹配和注册任务，并提出了大规模室外点云注册的深度 semantic graph匹配方法。首先，通过利用大规模点云semantic分割网络获取3D点云的semantic分类标签。然后，通过Euclidean clustering算法将相邻的点云分类标签相同的点云集成为semantic实例。其次，基于semantic实例之间的空间相互关系，构建semantic adjacency图。然后，通过图 convolutional neural network学习三种高维特征，包括几何形态特征、semantic分类特征和空间分布特征，并通过注意机制进行增强。第三次，将semantic实例匹配问题转化为一个最优运输问题，并通过最优匹配层解决。最后，根据匹配的semantic实例，首先使用SVD算法获取两个点云之间的几何变换矩阵，然后通过ICP算法进行精细调整。我们在KITTI Odometry数据集上进行了实验，并得到了6.6cm和0.229{\deg}的平均相对平移错误和平均相对旋转错误。

DAOT: Domain-Agnostically Aligned Optimal Transport for Domain-Adaptive Crowd Counting

paper_url: http://arxiv.org/abs/2308.05311
repo_url: https://github.com/HopooLinZ/DAOT
paper_authors: Huilin Zhu, Jingling Yuan, Xian Zhong, Zhengwei Yang, Zheng Wang, Shengfeng He
for: 这篇论文是为了解决对于人群调查中的域别问题，特别是跨域调查时的域别差异。
methods: 本文使用了一种名为Domain-agnostically Aligned Optimal Transport (DAOT)的策略，它利用对域别因素的调整，将域别因素与其他域别因素进行适当的调整，以实现跨域调查的桥接。
results: 根据该策略，作者实现了跨域调查中的强大普遍性，并且在五个标准的人群调查benchmark上进行了广泛的实验，结果显示了该方法的优异性。

Abstract
Domain adaptation is commonly employed in crowd counting to bridge the domain gaps between different datasets. However, existing domain adaptation methods tend to focus on inter-dataset differences while overlooking the intra-differences within the same dataset, leading to additional learning ambiguities. These domain-agnostic factors, e.g., density, surveillance perspective, and scale, can cause significant in-domain variations, and the misalignment of these factors across domains can lead to a drop in performance in cross-domain crowd counting. To address this issue, we propose a Domain-agnostically Aligned Optimal Transport (DAOT) strategy that aligns domain-agnostic factors between domains. The DAOT consists of three steps. First, individual-level differences in domain-agnostic factors are measured using structural similarity (SSIM). Second, the optimal transfer (OT) strategy is employed to smooth out these differences and find the optimal domain-to-domain misalignment, with outlier individuals removed via a virtual "dustbin" column. Third, knowledge is transferred based on the aligned domain-agnostic factors, and the model is retrained for domain adaptation to bridge the gap across domains. We conduct extensive experiments on five standard crowd-counting benchmarks and demonstrate that the proposed method has strong generalizability across diverse datasets. Our code will be available at: https://github.com/HopooLinZ/DAOT/.

摘要
域际适应是常见的人群计数技术，用于跨域数据集之间的域际差异 bridging。然而，现有的域际适应方法通常会忽略内部数据集之间的差异，导致额外学习混淆。这些域际无关因素，例如密度、观察角度和比例，可以在同一个数据集中导致重要的内部差异，并且在不同域际之间的差异可能会导致性能下降。为解决这个问题，我们提议一种基于域际无关因素的域际对齐策略（DAOT）。DAOT包括以下三步：1. 使用结构相似度（SSIM）测量各个个体在不同域际中的域际无关因素差异。2. 使用最佳传输（OT）策略平滑出这些差异，并在不同域际之间找到最佳域际偏移，并将异常个体从虚拟"废弃"列中除除。3. 基于对齐的域际无关因素，传输知识并重新训练模型，以bridging跨域数据集之间的差异。我们对五个标准人群计数标准 benchmark进行了广泛的实验，并证明了我们的方法具有跨多个数据集的强大普适性。我们的代码将在 GitHub 上公开：https://github.com/HopooLinZ/DAOT/.

From CNN to Transformer: A Review of Medical Image Segmentation Models

paper_url: http://arxiv.org/abs/2308.05305
repo_url: None
paper_authors: Wenjian Yao, Jiajun Bai, Wei Liao, Yuheng Chen, Mengjuan Liu, Yao Xie
For: The paper is written for researchers in the field of medical image segmentation, particularly those interested in using deep learning models for this task.* Methods: The paper surveys four recent medical image segmentation models, including U-Net and its variants, as well as transformer-based models like TransUNet.* Results: The paper evaluates the performance of these models on two benchmark datasets (Tuberculosis Chest X-rays and ovarian tumors) and discusses the main challenges and future trends in medical image segmentation.Here’s the same information in Simplified Chinese text:* For: 该文章是为医学图像分割领域的研究人员所写的，特别是关注使用深度学习模型进行这项任务。* Methods: 文章介绍了最近四年内最为代表的医学图像分割模型，包括U-Net和其变体，以及基于转移器的模型如TransUNet。* Results: 文章对两个标准数据集（肺结核X光图像和卵巢肿瘤）进行了量化评估这些模型的性能，并讨论了医学图像分割领域的主要挑战和未来趋势。

Abstract
Medical image segmentation is an important step in medical image analysis, especially as a crucial prerequisite for efficient disease diagnosis and treatment. The use of deep learning for image segmentation has become a prevalent trend. The widely adopted approach currently is U-Net and its variants. Additionally, with the remarkable success of pre-trained models in natural language processing tasks, transformer-based models like TransUNet have achieved desirable performance on multiple medical image segmentation datasets. In this paper, we conduct a survey of the most representative four medical image segmentation models in recent years. We theoretically analyze the characteristics of these models and quantitatively evaluate their performance on two benchmark datasets (i.e., Tuberculosis Chest X-rays and ovarian tumors). Finally, we discuss the main challenges and future trends in medical image segmentation. Our work can assist researchers in the related field to quickly establish medical segmentation models tailored to specific regions.

摘要
医学图像分割是医学图像分析的重要步骤，特别是疾病诊断和治疗的关键前提。深度学习在图像分割方面的应用已成为现代医学图像分析的主流方法。目前广泛采用的方法包括U-Net和其变种。此外，由于自然语言处理任务中 pré-训练模型的显著成功，如Transformer-based模型TransUNet，在医学图像分割数据集上达到了满意的性能。在这篇论文中，我们对最近几年内最有代表性的四种医学图像分割模型进行了一个抽查。我们 theoretically 分析了这些模型的特点，并对两个标准数据集（即肺炎X光图像和卵巢肿瘤）进行了量化评估。最后，我们讨论了医学图像分割领域的主要挑战和未来趋势。我们的工作可以帮助相关领域的研究人员快速建立适应特定地区的医学分割模型。

Multi-Visual-Inertial System: Analysis, Calibration and Estimation

paper_url: http://arxiv.org/abs/2308.05303
repo_url: None
paper_authors: Yulin Yang, Patrick Geneva, Guoquan Huang
for: 这个论文主要针对多视觉陀螺系统（MVIS）的状态估算和感知融合算法的研究。
methods: 论文使用的方法包括新的分析合并IMU集成（ACI3），用于预处理IMU测量，以及模型多种噪声和扩展姿态约束。
results: 论文通过实验和 simulations validate 了提出的方法，并证明了与现有方法相比，该方法可以实现竞争性的准确性和重复性。

Abstract
In this paper, we study state estimation of multi-visual-inertial systems (MVIS) and develop sensor fusion algorithms to optimally fuse an arbitrary number of asynchronous inertial measurement units (IMUs) or gyroscopes and global and(or) rolling shutter cameras. We are especially interested in the full calibration of the associated visual-inertial sensors, including the IMU or camera intrinsics and the IMU-IMU(or camera) spatiotemporal extrinsics as well as the image readout time of rolling-shutter cameras (if used). To this end, we develop a new analytic combined IMU integration with intrinsics-termed ACI3-to preintegrate IMU measurements, which is leveraged to fuse auxiliary IMUs and(or) gyroscopes alongside a base IMU. We model the multi-inertial measurements to include all the necessary inertial intrinsic and IMU-IMU spatiotemporal extrinsic parameters, while leveraging IMU-IMU rigid-body constraints to eliminate the necessity of auxiliary inertial poses and thus reducing computational complexity. By performing observability analysis of MVIS, we prove that the standard four unobservable directions remain - no matter how many inertial sensors are used, and also identify, for the first time, degenerate motions for IMU-IMU spatiotemporal extrinsics and auxiliary inertial intrinsics. In addition to the extensive simulations that validate our analysis and algorithms, we have built our own MVIS sensor rig and collected over 25 real-world datasets to experimentally verify the proposed calibration against the state-of-the-art calibration method such as Kalibr. We show that the proposed MVIS calibration is able to achieve competing accuracy with improved convergence and repeatability, which is open sourced to better benefit the community.

摘要
在这篇论文中，我们研究多视听听系统（MVIS）的状态估计和感知融合算法，以优化无限多个异步听听测量单元（IMU）或陀螺仪和全球和(或)滚动镜头相机的感知融合。我们特别关注多视听听感知器的全部准确卡lip，包括听听测量单元或相机的内参和听听测量单元(或相机)的空间时间外参，以及滚动镜头相机的图像读取时间。为此，我们开发了一种新的分析结合IMU集成integration（ACI3），用于预integrate听听测量单元，并将其用于融合辅助IMU和(或)陀螺仪。我们模型了多元听听测量，包括所有必要的听听内参和听听测量单元之间的空间时间外参，同时利用听听测量单元之间的硬体约束，以消除辅助听听pose的必要性，从而降低计算复杂性。通过对MVIS的可见性分析，我们证明了标准四个不见方向都存在，不 matter how many听听测量单元被用，并同时标识了IMU-IMU空间时间外参和辅助听听内参的缺乏定向性。除了对我们的分析和算法进行了广泛的实验验证之外，我们还自己建立了MVIS感知器测试台，收集了超过25个实际数据集，以实际验证我们的准确性核算法。我们的结果显示，我们的MVIS准确性核算法可以与现有的准确性核算法相比，具有更好的快速响应和重复性，这些结果将被开源，以更好地服务于社区。

Fine-Grained Self-Supervised Learning with Jigsaw Puzzles for Medical Image Classification

paper_url: http://arxiv.org/abs/2308.05770
repo_url: https://github.com/kalelpark/FG-SSL
paper_authors: Wongi Park, Jongbin Ryu
for: 这个研究旨在提高医疗影像中细微病变的分类精度，因为这些病变之间存在微小差异。methods: 本研究提出了一种细元自我超vised学习（FG-SSL）方法，通过几个层的对称块进行逐步学习，以实现细元之间的距离降低。此外，我们还应用了层次块进行逐步细元学习，以EXTRACT不同的信息在每个步骤中。results: 我们在实验中发现，提出的细元自我超vised学习方法能够与现有的State-of-the-art方法相比，在广泛使用的ISIC2018、APTOS2019和ISIC2017 datasets上表现出色。

Abstract
Classifying fine-grained lesions is challenging due to minor and subtle differences in medical images. This is because learning features of fine-grained lesions with highly minor differences is very difficult in training deep neural networks. Therefore, in this paper, we introduce Fine-Grained Self-Supervised Learning(FG-SSL) method for classifying subtle lesions in medical images. The proposed method progressively learns the model through hierarchical block such that the cross-correlation between the fine-grained Jigsaw puzzle and regularized original images is close to the identity matrix. We also apply hierarchical block for progressive fine-grained learning, which extracts different information in each step, to supervised learning for discovering subtle differences. Our method does not require an asymmetric model, nor does a negative sampling strategy, and is not sensitive to batch size. We evaluate the proposed fine-grained self-supervised learning method on comprehensive experiments using various medical image recognition datasets. In our experiments, the proposed method performs favorably compared to existing state-of-the-art approaches on the widely-used ISIC2018, APTOS2019, and ISIC2017 datasets.

摘要
<>translate the following text into Simplified Chinese:Classifying fine-grained lesions is challenging due to minor and subtle differences in medical images. This is because learning features of fine-grained lesions with highly minor differences is very difficult in training deep neural networks. Therefore, in this paper, we introduce Fine-Grained Self-Supervised Learning(FG-SSL) method for classifying subtle lesions in medical images. The proposed method progressively learns the model through hierarchical block such that the cross-correlation between the fine-grained Jigsaw puzzle and regularized original images is close to the identity matrix. We also apply hierarchical block for progressive fine-grained learning, which extracts different information in each step, to supervised learning for discovering subtle differences. Our method does not require an asymmetric model, nor does a negative sampling strategy, and is not sensitive to batch size. We evaluate the proposed fine-grained self-supervised learning method on comprehensive experiments using various medical image recognition datasets. In our experiments, the proposed method performs favorably compared to existing state-of-the-art approaches on the widely-used ISIC2018, APTOS2019, and ISIC2017 datasets.Translate the text into Simplified Chinese:<>Here's the translation:精细 lesion 分类具有挑战性，主要是因为医疗图像中细微差异很大。这是因为在训练深度神经网络时，学习精细 lesion 的特征非常困难。因此，在这篇论文中，我们提出了 Fine-Grained Self-Supervised Learning（FG-SSL）方法，用于医疗图像中精细 lesion 的分类。我们的方法通过层次块进行逐步学习，使得对精细 Jigsaw puzzle 和正则化的原始图像之间的协同关系接近于标准矩阵。我们还将层次块应用于进程性的细化学习，以便在每个步骤中提取不同的信息，并将其应用于监督学习以找到细微差异。我们的方法不需要非对称模型，也不需要负样本策略，并且不敏感于批处理大小。我们在多种医疗图像识别数据集上进行了广泛的实验，并证明了我们的方法与现有状态艺术方法在 ISIC2018、APTOS2019 和 ISIC2017 等广泛使用的数据集上表现出色。

Informative Scene Graph Generation via Debiasing

paper_url: http://arxiv.org/abs/2308.05286
repo_url: None
paper_authors: Lianli Gao, Xinyu Lyu, Yuyu Guo, Yuxuan Hu, Yuan-Fang Li, Lu Xu, Heng Tao Shen, Jingkuan Song
For: The paper aims to address the issue of biases in scene graph generation models, which tend to predict common predicates instead of informative ones, leading to the loss of precise information and overall performance.* Methods: The proposed method, DB-SGG, integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL) to address the imbalances in semantic space and training samples. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships, while BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates.* Results: The proposed method outperforms Transformer by 136.3%, 119.5%, and 122.6% on mR@20 at three SGG sub-tasks on the SGG-VG dataset, and is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).Here is the simplified Chinese text for the three key points:
for: 本文目标是解决Scene Graph Generation（SGG）模型偏袋现象，即预测常见的 predicate 而不是有用的 predicate。
methods: 提议的方法DB-SGG包括两个组件：Semantic Debiasing（SD）和Balanced Predicate Learning（BPL），用于解决 semantic space 和训练样本的偏袋问题。 SD 利用冲突矩阵和两个分别的图来构造 predicate 关系，而 BPL 采用随机下样本策略和抽象策略来强调有用 predicate。
results: 提议的方法在 SGG 子任务中的 mR@20 上比Transformer高出136.3%、119.5% 和122.6%，并在另一个复杂的 SGG dataset（SGG-GQA）和两个下游任务（句子到图 retrieval 和图像captioning）上进行验证。

Abstract
Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, e.g. "on" and "at", instead of informative ones, e.g. "standing on" and "looking at". This tendency results in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "stone blocking road" to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by 136.3%, 119.5%, and 122.6% on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).

摘要
Scene graph生成目标是检测视觉关系三元组（主语、谓语、谓object）。由于数据偏见，当前模型往往预测常见谓语，如“on”和“at”，而不是有用的谓语，如“站立在”和“看向”。这种偏见导致精准信息损失和总体性能下降。如果模型只使用“石头在路上”而不是“石头堵路”来描述一幅图像，那可能是严重的误解。我们认为这种现象是由两种偏见引起的：semantic space层次偏见和训练样本层次偏见。为解决这个问题，我们提出了DB-SGG，一种有效的框架，不是传统的分布适应。它包括两个组成部分：semantic debiasing（SD）和balanced predicate learning（BPL）。SD利用冲突矩阵和两种权重的笛卡尔矩阵来构建谓语关系。BPL采用随机下抽样策略和模糊性下降策略来强调有用的谓语。由于我们的方法是模型无关的，可以轻松应用于SGG模型，而且在SGG-VG数据集上超越了Transformer，增加了136.3%、119.5%和122.6%的mR@20。我们的方法还在另一个复杂的SGG数据集（SGG-GQA）和两个下游任务（句子到图 Retrieval和图像描述）中进行了验证。

Local-Global Information Interaction Debiasing for Dynamic Scene Graph Generation

paper_url: http://arxiv.org/abs/2308.05274
repo_url: None
paper_authors: Xinyu Lyu, Jingwei Liu, Yuyu Guo, Lianli Gao
for: 提高动态场景图生成（DynSGG）模型的性能，解决tail预测问题。
methods: 基于多任务学习（MTL）的新DynSGG模型，引入本地交互信息和全局人体动作交互信息，使模型更全面地理解单个图像的视觉上下文。
results: 在Action Genome dataset上进行了广泛的实验，证明我们提出的框架有效地解决了长尾问题，同时也提高了动态场景图生成的性能。

Abstract
The task of dynamic scene graph generation (DynSGG) aims to generate scene graphs for given videos, which involves modeling the spatial-temporal information in the video. However, due to the long-tailed distribution of samples in the dataset, previous DynSGG models fail to predict the tail predicates. We argue that this phenomenon is due to previous methods that only pay attention to the local spatial-temporal information and neglect the consistency of multiple frames. To solve this problem, we propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information. The interaction between objects and frame features makes the model more fully understand the visual context of the single image. Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates. Extensive experiments on Action Genome dataset demonstrate the efficacy of our proposed framework, which not only improves the dynamic scene graph generation but also alleviates the long-tail problem.

摘要
dynamically scene graph generation (DynSGG) 目标是为给定视频生成场景图，需要处理视频中的空间-时间信息。然而，由于样本集的长尾分布，先前的 DynSGG 模型无法预测尾预测值。我们认为这种现象是因为先前的方法只关注本地空间-时间信息，忽略多帧图像的一致性。为解决这个问题，我们提出了基于多任务学习的新 DynSGG 模型，DynSGG-MTL，该模型引入了本地互动信息和全局人体动作互动信息。对象和帧特征之间的交互使得模型更好地理解单个图像的视觉上下文。长期人体动作超级vised 模型生成符合全局约束的多个场景图，以避免模型无法学习尾预测值。广泛的 Action Genome 数据集实验证明了我们提出的框架的有效性，不仅改善了动态场景图生成，还解决了长尾问题。

TrainFors: A Large Benchmark Training Dataset for Image Manipulation Detection and Localization

paper_url: http://arxiv.org/abs/2308.05264
repo_url: None
paper_authors: Soumyaroop Nandi, Prem Natarajan, Wael Abd-Almageed
for:* 这项研究的目的是为了提供一个标准的图像修改检测和定位 benchmark 数据集，以便对现有的图像修改检测方法进行公正的评估。methods:* 这项研究使用了一个标准的图像修改数据集，并对现有的图像修改检测方法进行训练和测试。results:* 研究发现现有的图像修改检测方法在使用标准数据集上的性能不一，并且存在一些问题，如数据集的不具有 persistent 性和模型架构的差异。Here is the same information in Simplified Chinese:for:* 这项研究的目的是为了提供一个标准的图像修改检测和定位 benchmark 数据集，以便对现有的图像修改检测方法进行公正的评估。methods:* 这项研究使用了一个标准的图像修改数据集，并对现有的图像修改检测方法进行训练和测试。results:* 研究发现现有的图像修改检测方法在使用标准数据集上的性能不一，并且存在一些问题，如数据集的不具有 persistent 性和模型架构的差异。

Abstract
The evaluation datasets and metrics for image manipulation detection and localization (IMDL) research have been standardized. But the training dataset for such a task is still nonstandard. Previous researchers have used unconventional and deviating datasets to train neural networks for detecting image forgeries and localizing pixel maps of manipulated regions. For a fair comparison, the training set, test set, and evaluation metrics should be persistent. Hence, comparing the existing methods may not seem fair as the results depend heavily on the training datasets as well as the model architecture. Moreover, none of the previous works release the synthetic training dataset used for the IMDL task. We propose a standardized benchmark training dataset for image splicing, copy-move forgery, removal forgery, and image enhancement forgery. Furthermore, we identify the problems with the existing IMDL datasets and propose the required modifications. We also train the state-of-the-art IMDL methods on our proposed TrainFors1 dataset for a fair evaluation and report the actual performance of these methods under similar conditions.

摘要
《评估数据集和度量标准化 для图像修改检测和位置标注（IMDL）研究已经实现。但训练数据集还是非标准的。前一些研究者使用了不同的数据集来训练神经网络来检测图像诈改和定位修改后的像素地图。为了公平比较，训练集、测试集和评估度量应该固定。因此，现有的方法之间的比较可能不公平，因为结果受训练集和模型架构的影响。此外，没有任何前一些工作公布了用于IMDL任务的人工生成数据集。我们提议一个标准化的测试数据集，用于图像拼接、复制移动诈改、移除诈改和图像增强诈改。此外，我们还识别了现有IMDL数据集中的问题，并提出了修改。我们还使用我们所提议的TrainFors1数据集来训练当前领域的状态OF THE ART方法，并对这些方法进行公平的评估。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Advancing Early Detection of Virus Yellows: Developing a Hybrid Convolutional Neural Network for Automatic Aphid Counting in Sugar Beet Fields

paper_url: http://arxiv.org/abs/2308.05257
repo_url: https://github.com/junfenggaolab/counting-aphids
paper_authors: Xumin Gao, Wenxin Xue, Callum Lennox, Mark Stevens, Junfeng Gao
for: 这个论文的目的是为了提供一种有效的自动蜂数计算方法，以预 warningsugar beet fields中的蜂病风险。
methods: 该方法使用了一种混合的自动蜂数计算网络架构，它结合检测网络和density map estimation网络。在蜂数低的情况下，使用了一种改进的Yolov5来计算蜂数;而在蜂数高的情况下，使用了CSRNet来计算蜂数。
results: 对比试验表明，该方法在计算蜂数方面比所有其他方法更高精度，其MAE和RMSE值分别为2.93和4.01（标准蜂数），以及34.19和38.66（高密度蜂数）。此外，改进后的Yolov5的AP值高于原始Yolov5的5%。特别是对于非常小的蜂和密集分布的蜂，改进后的Yolov5的检测性能明显更高。

Abstract
Aphids are efficient vectors to transmit virus yellows in sugar beet fields. Timely monitoring and control of their populations are thus critical to prevent the large-scale outbreak of virus yellows. However, the manual counting of aphids, which is the most common practice, is labor-intensive and time-consuming. Additionally, two of the biggest challenges in aphid counting are that aphids are small objects and their density distributions are varied in different areas of the field. To address these challenges, we proposed a hybrid automatic aphid counting network architecture which integrates the detection network and the density map estimation network. When the distribution density of aphids is low, it utilizes an improved Yolov5 to count aphids. Conversely, when the distribution density of aphids is high, its witches to CSRNet to count aphids. To the best of our knowledge, this is the first framework integrating the detection network and the density map estimation network for counting tasks. Through comparison experiments of counting aphids, it verified that our proposed approach outperforms all other methods in counting aphids. It achieved the lowest MAE and RMSE values for both the standard and high-density aphid datasets: 2.93 and 4.01 (standard), and 34.19 and 38.66 (high-density), respectively. Moreover, the AP of the improved Yolov5 is 5% higher than that of the original Yolov5. Especially for extremely small aphids and densely distributed aphids, the detection performance of the improved Yolov5 is significantly better than the original Yolov5. This work provides an effective early warning for the virus yellows risk caused by aphids in sugar beet fields, offering protection for sugar beet growth and ensuring sugar beet yield. The datasets and project code are released at: https://github.com/JunfengGaolab/Counting-Aphids.

摘要
蜥蠕是蔗糖田中病毒黄病的高效传播者。在蔗糖田中，在时间上适当监测和控制蜥蠕 populace 是非常重要的，以避免大规模病毒黄病爆发。然而， manual counting of aphids， which is the most common practice， is labor-intensive and time-consuming。 In addition, two of the biggest challenges in aphid counting are that aphids are small objects and their density distributions are varied in different areas of the field. To address these challenges, we proposed a hybrid automatic aphid counting network architecture which integrates the detection network and the density map estimation network. When the distribution density of aphids is low, it utilizes an improved Yolov5 to count aphids. Conversely, when the distribution density of aphids is high, it switches to CSRNet to count aphids. To the best of our knowledge, this is the first framework integrating the detection network and the density map estimation network for counting tasks. Through comparison experiments of counting aphids, it was verified that our proposed approach outperforms all other methods in counting aphids. It achieved the lowest MAE and RMSE values for both the standard and high-density aphid datasets: 2.93 and 4.01 (standard), and 34.19 and 38.66 (high-density), respectively. Moreover, the AP of the improved Yolov5 is 5% higher than that of the original Yolov5. Especially for extremely small aphids and densely distributed aphids, the detection performance of the improved Yolov5 is significantly better than the original Yolov5. This work provides an effective early warning for the virus yellows risk caused by aphids in sugar beet fields, offering protection for sugar beet growth and ensuring sugar beet yield. The datasets and project code are released at: https://github.com/JunfengGaolab/Counting-Aphids.

Spatial Gated Multi-Layer Perceptron for Land Use and Land Cover Mapping

paper_url: http://arxiv.org/abs/2308.05235
repo_url: https://github.com/aj1365/sgumlp
paper_authors: Ali Jamali, Swalpa Kumar Roy, Danfeng Hong, Peter M Atkinson, Pedram Ghamisi
for: 这个研究是为了开发一个能够精确地分类土地用途的数位模型。
methods: 这个研究使用了多层感知核（MLP）和空间闸 gates（SGUs）来实现精确的土地用途分类。
results: 研究发现，提出的SGU-MLP分类算法比其他CNN和CNN-ViT基于的模型（包括HybridSN、ResNet、iFormer、EfficientFormer和CoAtNet）更高度精确，在三个实验中（在HOUSTON、BERLIN和AUGSBURG）都有所进步。例如，在HOUSTON实验中，SGU-MLP比HybridSN、CoAtNet、Efficientformer、iFormer和ResNet进步约15%、19%、20%、21%和25%，分别。

Abstract
Convolutional Neural Networks (CNNs) are models that are utilized extensively for the hierarchical extraction of features. Vision transformers (ViTs), through the use of a self-attention mechanism, have recently achieved superior modeling of global contextual information compared to CNNs. However, to realize their image classification strength, ViTs require substantial training datasets. Where the available training data are limited, current advanced multi-layer perceptrons (MLPs) can provide viable alternatives to both deep CNNs and ViTs. In this paper, we developed the SGU-MLP, a learning algorithm that effectively uses both MLPs and spatial gating units (SGUs) for precise land use land cover (LULC) mapping. Results illustrated the superiority of the developed SGU-MLP classification algorithm over several CNN and CNN-ViT-based models, including HybridSN, ResNet, iFormer, EfficientFormer and CoAtNet. The proposed SGU-MLP algorithm was tested through three experiments in Houston, USA, Berlin, Germany and Augsburg, Germany. The SGU-MLP classification model was found to consistently outperform the benchmark CNN and CNN-ViT-based algorithms. For example, for the Houston experiment, SGU-MLP significantly outperformed HybridSN, CoAtNet, Efficientformer, iFormer and ResNet by approximately 15%, 19%, 20%, 21%, and 25%, respectively, in terms of average accuracy. The code will be made publicly available at https://github.com/aj1365/SGUMLP

摘要
卷积神经网络（CNN）是一种广泛应用于层次EXTRACT特征的模型。视觉 трансформер（ViT）通过自我注意机制，在全球上下文信息模型化方面最近获得了超越CNN的成绩。然而，为了实现它们的图像分类优势，ViT需要大量的训练数据。当可用的训练数据有限时，当前的高级多层感知神经网络（MLP）可以提供可靠的替代方案。在这篇论文中，我们开发了SGU-MLP学习算法，该算法利用了MLP和空间闭合单元（SGU）来实现精确的土地用途分类。结果表明，与多个CNN和CNN-ViT基于模型相比，我们提出的SGU-MLP分类算法表现出色，包括HybridSN、ResNet、iFormer、EfficientFormer和CoAtNet等模型。SGU-MLP分类模型在HOUSTON、BERLIN和AUGSBURG三个实验中均表现出优异，例如HOUSTON实验中，SGU-MLPsignificantly exceededHybridSN、CoAtNet、Efficientformer、iFormer和ResNet等模型的均值准确率，分别高于它们约15%、19%、20%、21%和25%。SGU-MLP分类模型的代码将于https://github.com/aj1365/SGUMLP上公开。

SegMatch: A semi-supervised learning method for surgical instrument segmentation

paper_url: http://arxiv.org/abs/2308.05232
repo_url: None
paper_authors: Meng Wei, Charlie Budd, Luis C. Garcia-Peraza-Herrera, Reuben Dorent, Miaojing Shi, Tom Vercauteren
for: 这个论文旨在提高 Laparoscopic 和 Robotic 手术图像中的外科工具 segmentation 精度，降低需要昂贵的标注成本。methods: 我们提出了 SegMatch，一种基于 FixMatch 的 semi supervised learning 方法，用于减少 Laparoscopic 和 Robotic 手术图像中的标注成本。SegMatch 利用弱转换和 pseudo 标注来强制执行不监督损失，并且在 segmentation 任务中注意到了等变性和不变性性质。我们还引入了可调学习的对抗扰动扩展，以提高对抗扰动的准确性。results: 我们在 MICCAI Instrument Segmentation Challenge datasets Robust-MIS 2019 和 EndoVis 2017 进行了测试，结果表明，通过在训练过程中添加无标注数据，我们的方法可以超越完全监督的方法，即使它们受到训练数据的限制。SegMatch 还比一些当前的 semi-supervised learning semantic segmentation 模型在不同的标签数据比例中表现出色。

Abstract
Surgical instrument segmentation is recognised as a key enabler to provide advanced surgical assistance and improve computer assisted interventions. In this work, we propose SegMatch, a semi supervised learning method to reduce the need for expensive annotation for laparoscopic and robotic surgical images. SegMatch builds on FixMatch, a widespread semi supervised classification pipeline combining consistency regularization and pseudo labelling, and adapts it for the purpose of segmentation. In our proposed SegMatch, the unlabelled images are weakly augmented and fed into the segmentation model to generate a pseudo-label to enforce the unsupervised loss against the output of the model for the adversarial augmented image on the pixels with a high confidence score. Our adaptation for segmentation tasks includes carefully considering the equivariance and invariance properties of the augmentation functions we rely on. To increase the relevance of our augmentations, we depart from using only handcrafted augmentations and introduce a trainable adversarial augmentation strategy. Our algorithm was evaluated on the MICCAI Instrument Segmentation Challenge datasets Robust-MIS 2019 and EndoVis 2017. Our results demonstrate that adding unlabelled data for training purposes allows us to surpass the performance of fully supervised approaches which are limited by the availability of training data in these challenges. SegMatch also outperforms a range of state-of-the-art semi-supervised learning semantic segmentation models in different labelled to unlabelled data ratios.

摘要
针对手术 instrumente 分割问题，我们提出了SegMatch方法，这是一种半supervised学习方法，可以减少高成本的标注cost для lap 和 robotic 手术图像。SegMatch 基于FixMatch算法，这是一种广泛应用的半supervised分类管道， combining consistency regularization和pseudo labeling。在我们的SegMatch中，无标注图像会被弱地扩充，并feed into segmentation模型，以生成一个pseudo-标签，以便对模型输出的像素进行强制检查。我们对segmentation任务进行了仔细考虑，包括对扩充函数的等变性和不变性性。为了增强我们的扩充的 relevance，我们不仅使用了手工设计的扩充，还引入了一种可学习的对抗扩充策略。我们的算法在Robust-MIS 2019和EndoVis 2017两个MICCAI instrumente 分割挑战数据集上进行了评估。我们的结果表明，通过在训练过程中添加无标注数据，可以超越完全supervised方法的性能，这些方法受到训练数据的有限制。SegMatch还超过了一些state-of-the-art半supervised学习semantic segmentation模型，在不同的标签到无标签数据比例下。

A Unified Interactive Model Evaluation for Classification, Object Detection, and Instance Segmentation in Computer Vision

paper_url: http://arxiv.org/abs/2308.05168
repo_url: None
paper_authors: Changjian Chen, Yukai Guo, Fengyuan Tian, Shilong Liu, Weikai Yang, Zhaowei Wang, Jing Wu, Hang Su, Hanspeter Pfister, Shixia Liu
for: 这篇论文主要是为了评估计算机视觉领域中的模型评估工具，它们主要是为了评估分类模型，而忽略了更复杂的模型，如对象检测。
methods: 这篇论文提出了一种开源的视觉分析工具——Uni-Evaluator，用于支持计算机视觉领域中的一元模型评估。这种方法的关键思想是将不同任务中的预测结果表示为统一的概率分布。
results: 在两个案例研究中，Uni-Evaluator被证明能够有效地评估模型性能，并且可以帮助进行有知识的改进。

Abstract
Existing model evaluation tools mainly focus on evaluating classification models, leaving a gap in evaluating more complex models, such as object detection. In this paper, we develop an open-source visual analysis tool, Uni-Evaluator, to support a unified model evaluation for classification, object detection, and instance segmentation in computer vision. The key idea behind our method is to formulate both discrete and continuous predictions in different tasks as unified probability distributions. Based on these distributions, we develop 1) a matrix-based visualization to provide an overview of model performance; 2) a table visualization to identify the problematic data subsets where the model performs poorly; 3) a grid visualization to display the samples of interest. These visualizations work together to facilitate the model evaluation from a global overview to individual samples. Two case studies demonstrate the effectiveness of Uni-Evaluator in evaluating model performance and making informed improvements.

摘要
现有的模型评估工具主要专注于评估分类模型，留下了评估更复杂的模型，如物体检测的空白。在这篇论文中，我们开发了一款开源的视觉分析工具——Uni-Evaluator，用于支持统一的模型评估在计算机视觉中。我们的方法的关键思想是将不同任务中的预测结果形式为统一的概率分布。基于这些分布，我们开发了：1）矩阵视图，用于提供模型性能的概述；2）表视图，用于标识模型性能不佳的数据subset；3）格子视图，用于显示关注的样本。这些视图共同工作，以帮助从全局概述到个体样本的模型评估。两个案例证明Uni-Evaluator的有效性在评估模型性能并作出了有用的改进。

Deep Learning for Morphological Identification of Extended Radio Galaxies using Weak Labels

paper_url: http://arxiv.org/abs/2308.05166
repo_url: https://github.com/nikhel1/gal-cam
paper_authors: Nikhel Gupta, Zeeshan Hayder, Ray P. Norris, Minh Huynh, Lars Petersson, X. Rosalind Wang, Heinz Andernach, Bärbel S. Koribalski, Miranda Yew, Evan J. Crawford
for: 这个研究旨在开发一种基于深度学习的弱监督算法，以降低对复杂 radio galaxy 的 pixe-level标注成本。
methods: 该算法使用了类级别标签来训练深度学习模型，并使用 inter-pixel 关系网络 (IRNet) 来进一步调整类 activation maps (CAMs)，以获得 radio galaxy 的实例分割面积。
results: 我们使用的数据来自澳大利亚 Square Kilometre Array Pathfinder (ASKAP) telescope，特别是 Evolutionary Map of the Universe (EMU) Pilot Survey，覆盖了天空面积270平方度，具有RMS敏感度25-35 $\mu$Jy/beam。我们示出了弱监督深度学习算法可以高精度预测 pixe-level信息，包括 radio emission 的扩展面积和红外主 galaxy 的位置。我们使用 mAP 评价模型的性能，并显示模型在 radio masks 和红外主 galaxy 位置上的 mAP$_{50}$ 分别为 67.5% 和 76.8%。模型架构可以在以下链接中找到：https://github.com/Nikhel1/Gal-CAM

Abstract
The present work discusses the use of a weakly-supervised deep learning algorithm that reduces the cost of labelling pixel-level masks for complex radio galaxies with multiple components. The algorithm is trained on weak class-level labels of radio galaxies to get class activation maps (CAMs). The CAMs are further refined using an inter-pixel relations network (IRNet) to get instance segmentation masks over radio galaxies and the positions of their infrared hosts. We use data from the Australian Square Kilometre Array Pathfinder (ASKAP) telescope, specifically the Evolutionary Map of the Universe (EMU) Pilot Survey, which covered a sky area of 270 square degrees with an RMS sensitivity of 25-35 $\mu$Jy/beam. We demonstrate that weakly-supervised deep learning algorithms can achieve high accuracy in predicting pixel-level information, including masks for the extended radio emission encapsulating all galaxy components and the positions of the infrared host galaxies. We evaluate the performance of our method using mean Average Precision (mAP) across multiple classes at a standard intersection over union (IoU) threshold of 0.5. We show that the model achieves a mAP$_{50}$ of 67.5\% and 76.8\% for radio masks and infrared host positions, respectively. The network architecture can be found at the following link: https://github.com/Nikhel1/Gal-CAM

摘要
现在的工作介绍了一种使用弱类标注深度学习算法，以降低复杂Radio галактики多 компонент的标注成本。该算法在Radio галактики的弱类标签上训练，以获得类活动图(CAM)。然后，使用Inter-pixel关系网络(IRNet)来进一步细化CAM，以获得Radio галактики和它们的红外主 galaxy的实例分割mask。我们使用澳大利亚的Square Kilometre Array Pathfinder(ASKAP)望远镜，特别是Evolutionary Map of the Universe(EMU) Pilot Survey，覆盖了天空面积270平方度，具有RMS敏感度为25-35μJy/beam。我们表明，弱类标注深度学习算法可以高精度预测像素级信息，包括Radio emission覆盖所有 галактиComponents的面积和红外主 galaxy的位置。我们使用mean Average Precision(mAP)来评估方法的性能，并在多个类型的交叠 UNION(IoU)阈值0.5下进行评估。我们发现，模型在Radio mask和红外主 galaxy位置上的mAP$_{50}$分别为67.5%和76.8%。网络架构可以在以下链接中找到：https://github.com/Nikhel1/Gal-CAM。

Scene-Generalizable Interactive Segmentation of Radiance Fields

paper_url: http://arxiv.org/abs/2308.05104
repo_url: None
paper_authors: Songlin Tang, Wenjie Pei, Xin Tao, Tanghui Jia, Guangming Lu, Yu-Wing Tai
for:* 三维 объек象分割在各种场景中的可交互式分割方法methods:* 跨维度导航协助减少了基于2D图像的scarce用户点击的问题* 不确定性除去的3D分割模块实现了高效而准确的3D分割* 在2D空间中监督3D分割错误的掩蔽抑制学习方案results:* 对两个实际挑战性的benchmark数据集进行了广泛的实验，并证明了提案的方法的效果和场景化可行性* 比 класси方法需要场景特定优化的方法表现更优异

Abstract
Existing methods for interactive segmentation in radiance fields entail scene-specific optimization and thus cannot generalize across different scenes, which greatly limits their applicability. In this work we make the first attempt at Scene-Generalizable Interactive Segmentation in Radiance Fields (SGISRF) and propose a novel SGISRF method, which can perform 3D object segmentation for novel (unseen) scenes represented by radiance fields, guided by only a few interactive user clicks in a given set of multi-view 2D images. In particular, the proposed SGISRF focuses on addressing three crucial challenges with three specially designed techniques. First, we devise the Cross-Dimension Guidance Propagation to encode the scarce 2D user clicks into informative 3D guidance representations. Second, the Uncertainty-Eliminated 3D Segmentation module is designed to achieve efficient yet effective 3D segmentation. Third, Concealment-Revealed Supervised Learning scheme is proposed to reveal and correct the concealed 3D segmentation errors resulted from the supervision in 2D space with only 2D mask annotations. Extensive experiments on two real-world challenging benchmarks covering diverse scenes demonstrate 1) effectiveness and scene-generalizability of the proposed method, 2) favorable performance compared to classical method requiring scene-specific optimization.

摘要
现有的互动式分割方法在辐射场中存在Scene-specific优化，这限制了它们的可重用性。在这个工作中，我们首次提出了Scene-Generalizable Interactive Segmentation in Radiance Fields（SGISRF）方法，可以在未经过Scene-specific优化的情况下，对未看过的Scene中的3D объек的分割进行3D объек分割，只需要在多视图2D图像中提供一些互动用户点击。特别是，我们的SGISRF方法解决了三个挑战：首先，我们提出了跨维度导航协助升级，以将罕见的2D用户点击编码成有用的3D导航表示。其次，我们设计了不确定性消除3D分割模块，以实现高效的3D分割。最后，我们提出了隐藏的3D分割错误的抑制和修复方案，以解释和修正在2D空间中只有2DMask注释的超出3D分割错误。我们的实验结果表明，SGISRF方法具有效果和Scene-generalizability，并且与经典方法相比，具有更好的性能。

A degree of image identification at sub-human scales could be possible with more advanced clusters

paper_url: http://arxiv.org/abs/2308.05092
repo_url: https://github.com/prateekjannu/imagescale2
paper_authors: Prateek Y J
for: 本研究旨在确定当前可用的自动学习技术是否可以达到人类水平的视觉图像理解，使用同样的感知输入量和质量。
methods: 本研究使用了涉及数据量和图像质量的缩放实验，无需外部资金支持。
results: 我们发现，同时缩放数据量和图像分辨率可以 дости得人类水平的物品检测性能，而不需要超过人类大小。我们使用视transformer在200000个图像和256ppi之间进行训练。

Abstract
The purpose of the research is to determine if currently available self-supervised learning techniques can accomplish human level comprehension of visual images using the same degree and amount of sensory input that people acquire from. Initial research on this topic solely considered data volume scaling. Here, we scale both the volume of data and the quality of the image. This scaling experiment is a self-supervised learning method that may be done without any outside financing. We find that scaling up data volume and picture resolution at the same time enables human-level item detection performance at sub-human sizes.We run a scaling experiment with vision transformers trained on up to 200000 images up to 256 ppi.

摘要
<> translate "The purpose of the research is to determine if currently available self-supervised learning techniques can accomplish human level comprehension of visual images using the same degree and amount of sensory input that people acquire from. Initial research on this topic solely considered data volume scaling. Here, we scale both the volume of data and the quality of the image. This scaling experiment is a self-supervised learning method that may be done without any outside financing. We find that scaling up data volume and picture resolution at the same time enables human-level item detection performance at sub-human sizes.We run a scaling experiment with vision transformers trained on up to 200000 images up to 256 ppi." into Simplified Chinese. translate "研究的目的是判断现有的自主学习技术是否可以通过同样的感知输入来实现人类水平的视觉理解。初步研究仅考虑数据量缩放。在这里，我们同时缩放数据量和图像质量。这是一种没有外部资金支持的自主学习方法。我们发现，同时缩放数据量和图像分辨率可以实现人类水平的项目检测性能。我们使用视transformer在200000个图像上进行训练，分辨率达256ppi。" into Simplified Chinese.Here's the translation:<>研究的目的是判断现有的自主学习技术是否可以通过同样的感知输入来实现人类水平的视觉理解。初步研究仅考虑数据量缩放。在这里，我们同时缩放数据量和图像质量。这是一种没有外部资金支持的自主学习方法。我们发现，同时缩放数据量和图像分辨率可以实现人类水平的项目检测性能。我们使用视transformer在200000个图像上进行训练，分辨率达256ppi。Translation:The purpose of the research is to determine whether existing self-supervised learning techniques can achieve human-level understanding of visual images using the same amount and degree of sensory input as humans. Initial research only considered data volume scaling. Here, we scale both the volume of data and the quality of the image. This scaling experiment is a self-supervised learning method without external funding. We find that scaling up data volume and picture resolution at the same time enables human-level item detection performance at sub-human sizes. We use vision transformers to train on up to 200,000 images with a resolution of 256 ppi.

Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

paper_url: http://arxiv.org/abs/2308.05070
repo_url: https://github.com/aimagelab/vffc
paper_authors: Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
for: 这项研究旨在提出一种基于快速傅立卷变换算法的深度学习方法，用于恢复和检测艾库拉纳皮里文档中的墨水。
methods: 该方法使用了改进的快速傅立卷变换算法，并应用于一种深度学习架构中，以便在艾库拉纳皮里文档中自动检测墨水。
results: 实验表明，该方法在艾库拉纳皮里文档中具有优秀的检测精度和速度，并且可以有效地处理高度损害的文档。

Abstract
Recent advancements in Digital Document Restoration (DDR) have led to significant breakthroughs in analyzing highly damaged written artifacts. Among those, there has been an increasing interest in applying Artificial Intelligence techniques for virtually unwrapping and automatically detecting ink on the Herculaneum papyri collection. This collection consists of carbonized scrolls and fragments of documents, which have been digitized via X-ray tomography to allow the development of ad-hoc deep learning-based DDR solutions. In this work, we propose a modification of the Fast Fourier Convolution operator for volumetric data and apply it in a segmentation architecture for ink detection on the challenging Herculaneum papyri, demonstrating its suitability via deep experimental analysis. To encourage the research on this task and the application of the proposed operator to other tasks involving volumetric data, we will release our implementation (https://github.com/aimagelab/vffc)

摘要
现代数字文献修复技术的发展已导致对高度损坏的手写文物的分析 achievements 。 Among them, there has been an increasing interest in applying artificial intelligence techniques for virtually unwrapping and automatically detecting ink on the Herculaneum papyri collection. This collection consists of carbonized scrolls and fragments of documents, which have been digitized via X-ray tomography to allow the development of ad-hoc deep learning-based DDR solutions. In this work, we propose a modification of the Fast Fourier Convolution operator for volumetric data and apply it in a segmentation architecture for ink detection on the challenging Herculaneum papyri, demonstrating its suitability via deep experimental analysis. To encourage the research on this task and the application of the proposed operator to other tasks involving volumetric data, we will release our implementation (https://github.com/aimagelab/vffc)。Note that "Simplified Chinese" is a translation of the text into Chinese characters, but the word order and grammar may be different from traditional Chinese.

Geometric Learning-Based Transformer Network for Estimation of Segmentation Errors

paper_url: http://arxiv.org/abs/2308.05068
repo_url: None
paper_authors: Sneha Sree C, Mohammad Al Fahim, Keerthi Ram, Mohanasankar Sivaprakasam
for: 这个研究旨在提高和加速医疗机构对图像分类的专业人员的努力，并且为医生提供一个评估和修正错误的工具。
methods: 我们提出了一个方法，可以在三维Volume的分类图中识别和量化错误区域。我们的方法使用一个基于Transformer的图形神经网络，可以在三维 mesh 中计算错误的数值和类别。
results: 我们的方法在一个高分辨率的微型CT数据集上进行评估，实际上生成了错误的三维分类图。我们的方法可以在这些错误分类图中识别和量化错误区域，并且与其他图形神经网络进行比较，获得了更高的精度和准确性。

Abstract
Many segmentation networks have been proposed for 3D volumetric segmentation of tumors and organs at risk. Hospitals and clinical institutions seek to accelerate and minimize the efforts of specialists in image segmentation. Still, in case of errors generated by these networks, clinicians would have to manually edit the generated segmentation maps. Given a 3D volume and its putative segmentation map, we propose an approach to identify and measure erroneous regions in the segmentation map. Our method can estimate error at any point or node in a 3D mesh generated from a possibly erroneous volumetric segmentation map, serving as a Quality Assurance tool. We propose a graph neural network-based transformer based on the Nodeformer architecture to measure and classify the segmentation errors at any point. We have evaluated our network on a high-resolution micro-CT dataset of the human inner-ear bony labyrinth structure by simulating erroneous 3D segmentation maps. Our network incorporates a convolutional encoder to compute node-centric features from the input micro-CT data, the Nodeformer to learn the latent graph embeddings, and a Multi-Layer Perceptron (MLP) to compute and classify the node-wise errors. Our network achieves a mean absolute error of ~0.042 over other Graph Neural Networks (GNN) and an accuracy of 79.53% over other GNNs in estimating and classifying the node-wise errors, respectively. We also put forth vertex-normal prediction as a custom pretext task for pre-training the CNN encoder to improve the network's overall performance. Qualitative analysis shows the efficiency of our network in correctly classifying errors and reducing misclassifications.

摘要
很多三维分割网络已经被提出用于肿瘤和关键器官的三维分割。医院和临床机构希望加速和减少专家在图像分割中的努力。然而，在这些网络生成的分割图中出现错误时，临床专家仍需手动修改生成的分割图。我们提出一种方法，可以在基于Nodeformer架构的图 neural network中计算和评估分割错误的点或节点级别。我们的方法可以在三维碎片中计算错误的点级别差异，并且可以作为质量控制工具。我们在人类内耳骨制造结构的高分辨率微型计算机扫描图像上评估了我们的网络。我们的网络包括一个卷积Encoder计算输入微型计算机数据中的节点特征，Nodeformer学习干扰图像的秘密图像嵌入，以及一个多层感知器（MLP）计算和分类节点错误。我们的网络在其他图 neural network中的mean absolute error约为0.042，和其他图 neural network中的准确率为79.53%。我们还提出了预训练CNNEncoder的预测顶点正常任务，以提高网络的总性能。Qualitative分析表明我们的网络可以准确地分类错误并减少错误分类。

A Novel Method for improving accuracy in neural network by reinstating traditional back propagation technique

paper_url: http://arxiv.org/abs/2308.05059
repo_url: None
paper_authors: Gokulprasath R
for: 这项研究旨在提出一种快速和有效地训练深度神经网络的方法，以解决传统训练方法所面临的计算开销和淡吸梯度问题。
methods: 该方法利用一种快速参数更新技术，减少了计算梯度的需求，从而提高了学习速度和避免了淡吸梯度问题。
results: 对于标准数据集，该方法在比较之下超过了当前最佳方法的性能，并且显示了更快的学习速度和更好的稳定性。

Abstract
Deep learning has revolutionized industries like computer vision, natural language processing, and speech recognition. However, back propagation, the main method for training deep neural networks, faces challenges like computational overhead and vanishing gradients. In this paper, we propose a novel instant parameter update methodology that eliminates the need for computing gradients at each layer. Our approach accelerates learning, avoids the vanishing gradient problem, and outperforms state-of-the-art methods on benchmark data sets. This research presents a promising direction for efficient and effective deep neural network training.

摘要

PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

paper_url: http://arxiv.org/abs/2308.05051
repo_url: None
paper_authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton
for: 本研究旨在提高视频中复杂的时间相关动作关系的检测精度，通过利用多尺度时间特征来提高 temps 自注意 Mechanism 的表达能力。
methods: 我们提出了一种基于 transformer 的网络，称为 PAT，它通过嵌入相对位域编码和利用多尺度时间关系来提高 temps 自注意 Mechanism 的表达能力，并且不使用层次结构。
results: 我们在两个复杂的多标签数据集上进行了评估，并显示了 PAT 可以提高当前状态的艺术精度，具体的结果为 Charades 数据集上的 mAP 提高 1.1%，MultiTHUMOS 数据集上的 mAP 提高 0.6%，其中 Charades 数据集的新状态艺术精度为 26.5%，MultiTHUMOS 数据集的新状态艺术精度为 44.6%。

Abstract
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.

摘要
我们介绍PAT，一种基于转换器的网络，可以学习视频中复杂的时间相关动作依赖关系。现有方法中，转换器中的自我注意机制会失去时间位置信息，这是对精准动作检测至关重要。为解决这个问题，我们（i）在自我注意机制中嵌入相对位置编码，（ii）利用多尺度时间关系，设计了一种新的非层次网络，与现有的转换器基于层次结构的方法不同。我们认为，将自我注意机制与多个子抽样过程结合在一起会导致位置信息损失。我们对我们提议的方法进行了两个挑战性多标签数据集的测试，并显示PAT在Charades和MultiTHUMOS数据集上的current state-of-the-art结果提高1.1%和0.6% mAP，分别达到26.5%和44.6%的新state-of-the-art mAP。我们还进行了广泛的拓展研究，以检查不同组件的影响。

2023-08-10

AD-CLIP: Adapting Domains in Prompt Space Using CLIP

Attention-based 3D CNN with Multi-layer Features for Alzheimer’s Disease Diagnosis using Brain Images

Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization

Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network

Object Goal Navigation with Recursive Implicit Maps

NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search

Test-Time Selection for Robust Skin Lesion Analysis

Category Feature Transformer for Semantic Segmentation

Cross-Domain Product Representation Learning for Rich-Content E-Commerce

Deep Richardson-Lucy Deconvolution for Low-Light Image Deblurring

Robust Asymmetric Loss for Multi-Label Long-Tailed Learning

Is there progress in activity progress prediction?

Critical Points ++: An Agile Point Cloud Importance Measure for Robust Classification, Adversarial Defense and Explainable AI

Look at the Neighbor: Distortion-aware Unsupervised Domain Adaptation for Panoramic Semantic Segmentation

YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

Surface Masked AutoEncoder: Self-Supervision for Cortical Imaging Data

Comprehensive Analysis of Network Robustness Evaluation Based on Convolutional Neural Networks with Spatial Pyramid Pooling

KS-APR: Keyframe Selection for Robust Absolute Pose Regression

Transforming Breast Cancer Diagnosis: Towards Real-Time Ultrasound to Mammogram Conversion for Cost-Effective Diagnosis

A Generalized Physical-knowledge-guided Dynamic Model for Underwater Image Enhancement

Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation

Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation

Ensemble Modeling for Multimodal Visual Action Recognition

Speech-Driven 3D Face Animation with Composite and Regional Facial Movements

Adaptive Low Rank Adaptation of Segment Anything to Salient Object Detection

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

SC3K: Self-supervised and Coherent 3D Keypoints Estimation from Rotated, Noisy, and Decimated Point Cloud Data

Enhancing Low-light Light Field Images with A Deep Compensation Unfolding Network

Learning Gabor Texture Features for Fine-Grained Recognition

Robust Localization with Visual-Inertial Odometry Constraints for Markerless Mobile AR

Product Review Image Ranking for Fashion E-commerce

HGDNet: A Height-Hierarchy Guided Dual-Decoder Network for Single View Building Extraction and Height Estimation

Interaction-aware Joint Attention Estimation Using People Attributes

Flexible Isosurface Extraction for Gradient-Based Mesh Optimization

TriDo-Former: A Triple-Domain Transformer for Direct PET Reconstruction from Low-Dose Sinograms

Pseudo-label Alignment for Semi-supervised Instance Segmentation

Fine-grained building roof instance segmentation based on domain adapted pretraining and composite dual-backbone

TCSloT: Text Guided 3D Context and Slope Aware Triple Network for Dental Implant Position Prediction

Towards General and Fast Video Derain via Knowledge Distillation

Prostate Age Gap (PAG): An MRI surrogate marker of aging for prostate cancer detection

RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation

Deep Semantic Graph Matching for Large-scale Outdoor Point Clouds Registration

DAOT: Domain-Agnostically Aligned Optimal Transport for Domain-Adaptive Crowd Counting

From CNN to Transformer: A Review of Medical Image Segmentation Models

Multi-Visual-Inertial System: Analysis, Calibration and Estimation

Fine-Grained Self-Supervised Learning with Jigsaw Puzzles for Medical Image Classification

Informative Scene Graph Generation via Debiasing

Local-Global Information Interaction Debiasing for Dynamic Scene Graph Generation

TrainFors: A Large Benchmark Training Dataset for Image Manipulation Detection and Localization

Advancing Early Detection of Virus Yellows: Developing a Hybrid Convolutional Neural Network for Automatic Aphid Counting in Sugar Beet Fields

Spatial Gated Multi-Layer Perceptron for Land Use and Land Cover Mapping

SegMatch: A semi-supervised learning method for surgical instrument segmentation

A Unified Interactive Model Evaluation for Classification, Object Detection, and Instance Segmentation in Computer Vision

Deep Learning for Morphological Identification of Extended Radio Galaxies using Weak Labels

Scene-Generalizable Interactive Segmentation of Radiance Fields

A degree of image identification at sub-human scales could be possible with more advanced clusters

Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

Geometric Learning-Based Transformer Network for Estimation of Segmentation Errors

A Novel Method for improving accuracy in neural network by reinstating traditional back propagation technique

PAT: Position-Aware Transformer for Dense Multi-Label Action Detection